Repository: OpenGVLab/VideoChat-Flash
Branch: main
Commit: 2f8e2f578897
Files: 1928
Total size: 31.2 MB
Directory structure:
gitextract_mlwsex56/
├── .gitattributes
├── LICENSE
├── README.md
├── llava-train_videochat/
│ ├── .dockerignore
│ ├── .editorconfig
│ ├── .gitattributes
│ ├── .gitignore
│ ├── LICENSE
│ ├── README.md
│ ├── cog.yaml
│ ├── data/
│ │ ├── ablation_short-long_mix_sft.yaml
│ │ ├── stage1_init_connector_iv1m.yaml
│ │ ├── stage2_short_pretrain_iv6m.yaml
│ │ ├── stage3_short-long_mix_sft.yaml
│ │ └── stage4_highres_postsft.yaml
│ ├── llava/
│ │ ├── __init__.py
│ │ ├── constants.py
│ │ ├── conversation.py
│ │ ├── dist_utils.py
│ │ ├── mm_utils.py
│ │ ├── model/
│ │ │ ├── __init__.py
│ │ │ ├── apply_delta.py
│ │ │ ├── builder.py
│ │ │ ├── consolidate.py
│ │ │ ├── language_model/
│ │ │ │ ├── llava_qwen.py
│ │ │ │ ├── llava_qwen_flash.py
│ │ │ │ └── modeling_qwen2_flash.py
│ │ │ ├── llava_arch.py
│ │ │ ├── make_delta.py
│ │ │ ├── multimodal_encoder/
│ │ │ │ ├── builder.py
│ │ │ │ ├── clip_encoder.py
│ │ │ │ ├── internvideo2/
│ │ │ │ │ ├── __init__.py
│ │ │ │ │ ├── flash_attention_class.py
│ │ │ │ │ ├── pos_embed.py
│ │ │ │ │ └── vit_scale_clean.py
│ │ │ │ ├── internvideo2_encoder.py
│ │ │ │ ├── siglip_encoder.py
│ │ │ │ ├── umt/
│ │ │ │ │ └── vit.py
│ │ │ │ └── umt_encoder.py
│ │ │ ├── multimodal_projector/
│ │ │ │ ├── builder.py
│ │ │ │ └── tome16_mlp_hd64.py
│ │ │ └── utils.py
│ │ ├── serialize_utils.py
│ │ ├── train/
│ │ │ ├── llava_trainer.py
│ │ │ ├── llava_trainer_eval.py
│ │ │ ├── train.py
│ │ │ └── train_mem.py
│ │ ├── utils.py
│ │ └── video_utils.py
│ ├── pyproject.toml
│ ├── requirements.txt
│ └── scripts/
│ ├── train/
│ │ ├── stage1-init_connector/
│ │ │ ├── stage1_internvideo2_tome16_res224_qwen7b.sh
│ │ │ ├── stage1_umt_tome16_res224_qwen7b.sh
│ │ │ └── stage1_umt_tome16_res448_qwen1_5b.sh
│ │ ├── stage2-visual_pretraining/
│ │ │ ├── stage2_internvideo2_tome16_res224_qwen_7b.sh
│ │ │ ├── stage2_umt_tome16_res224_qwen_7b.sh
│ │ │ └── stage2_umt_tome16_res448_qwen_1_5b.sh
│ │ ├── stage3-video_sft/
│ │ │ ├── stage3_internvideo2_tome16_res224_qwen_7b.sh
│ │ │ ├── stage3_umt_tome16_res224_qwen_7b.sh
│ │ │ └── stage3_umt_tome16_res448_qwen_1_5b.sh
│ │ └── stage4_highres_postft/
│ │ └── stage4_umt_tome16_res448_qwen_7b.sh
│ ├── zero1.json
│ ├── zero2.json
│ ├── zero2_fused_adamw.json
│ ├── zero2_offload.json
│ ├── zero3.json
│ ├── zero3_offload.json
│ └── zero3pp.json
├── lmms-eval_videochat/
│ ├── .gitignore
│ ├── .pre-commit-config.yaml
│ ├── LICENSE
│ ├── README.md
│ ├── docs/
│ │ ├── README.md
│ │ ├── commands.md
│ │ ├── current_tasks.md
│ │ ├── model_guide.md
│ │ ├── run_examples.md
│ │ └── task_guide.md
│ ├── eval_annotations/
│ │ ├── LVBench/
│ │ │ ├── README.md
│ │ │ └── json/
│ │ │ ├── lvbench_clean.json
│ │ │ ├── lvbench_clean_cartoon.json
│ │ │ ├── lvbench_clean_documentary.json
│ │ │ ├── lvbench_clean_live.json
│ │ │ ├── lvbench_clean_selfmedia.json
│ │ │ ├── lvbench_clean_sport.json
│ │ │ └── lvbench_clean_tv.json
│ │ ├── LongVideoBench/
│ │ │ ├── README.md
│ │ │ ├── lvb_test_wo_gt.json
│ │ │ ├── lvb_val.json
│ │ │ ├── test-00000-of-00001.parquet
│ │ │ └── validation-00000-of-00001.parquet
│ │ ├── MLVU_MC/
│ │ │ ├── README.md
│ │ │ └── json/
│ │ │ ├── 1_plotQA.json
│ │ │ ├── 2_needle.json
│ │ │ ├── 3_ego.json
│ │ │ ├── 4_count.json
│ │ │ ├── 5_order.json
│ │ │ ├── 6_anomaly_reco.json
│ │ │ └── 7_topic_reasoning.json
│ │ ├── MVBench/
│ │ │ ├── README.md
│ │ │ └── json/
│ │ │ ├── action_antonym.json
│ │ │ ├── action_count.json
│ │ │ ├── action_localization.json
│ │ │ ├── action_prediction.json
│ │ │ ├── action_sequence.json
│ │ │ ├── character_order.json
│ │ │ ├── counterfactual_inference.json
│ │ │ ├── egocentric_navigation.json
│ │ │ ├── episodic_reasoning.json
│ │ │ ├── fine_grained_action.json
│ │ │ ├── fine_grained_pose.json
│ │ │ ├── moving_attribute.json
│ │ │ ├── moving_count.json
│ │ │ ├── moving_direction.json
│ │ │ ├── object_existence.json
│ │ │ ├── object_interaction.json
│ │ │ ├── object_shuffle.json
│ │ │ ├── scene_transition.json
│ │ │ ├── state_change.json
│ │ │ └── unexpected_action.json
│ │ ├── PerceptionTest/
│ │ │ ├── .gitattributes
│ │ │ └── README.md
│ │ ├── Temporal_Grounding/
│ │ │ ├── README.md
│ │ │ └── json/
│ │ │ └── temporal_grounding_charades.json
│ │ └── Video-MME/
│ │ ├── README.md
│ │ └── videomme/
│ │ └── test-00000-of-00001.parquet
│ ├── lmms_eval/
│ │ ├── __init__.py
│ │ ├── __main__.py
│ │ ├── api/
│ │ │ ├── __init__.py
│ │ │ ├── filter.py
│ │ │ ├── instance.py
│ │ │ ├── metrics.py
│ │ │ ├── model.py
│ │ │ ├── registry.py
│ │ │ ├── samplers.py
│ │ │ └── task.py
│ │ ├── evaluator.py
│ │ ├── filters/
│ │ │ ├── __init__.py
│ │ │ ├── decontamination.py
│ │ │ ├── extraction.py
│ │ │ ├── selection.py
│ │ │ └── transformation.py
│ │ ├── logging_utils.py
│ │ ├── models/
│ │ │ ├── __init__.py
│ │ │ └── videochat_flash.py
│ │ ├── tasks/
│ │ │ ├── __init__.py
│ │ │ ├── _task_utils/
│ │ │ │ ├── file_utils.py
│ │ │ │ ├── gpt_eval_utils.py
│ │ │ │ ├── video_loader.py
│ │ │ │ └── vqa_eval_metric.py
│ │ │ ├── longvideobench/
│ │ │ │ ├── longvideobench_test_v.yaml
│ │ │ │ ├── longvideobench_val_i.yaml
│ │ │ │ ├── longvideobench_val_v.yaml
│ │ │ │ └── utils.py
│ │ │ ├── lvbench/
│ │ │ │ ├── _default_template.yaml
│ │ │ │ ├── lvbench.yaml
│ │ │ │ ├── lvbench_cartoon.yaml
│ │ │ │ ├── lvbench_documentary.yaml
│ │ │ │ ├── lvbench_live.yaml
│ │ │ │ ├── lvbench_selfmedia.yaml
│ │ │ │ ├── lvbench_sport.yaml
│ │ │ │ ├── lvbench_tv.yaml
│ │ │ │ └── utils.py
│ │ │ ├── mlvu_mc/
│ │ │ │ ├── _default_template.yaml
│ │ │ │ ├── mlvu_mc.yaml
│ │ │ │ ├── mlvu_mc_anomaly_reco.yaml
│ │ │ │ ├── mlvu_mc_count.yaml
│ │ │ │ ├── mlvu_mc_ego.yaml
│ │ │ │ ├── mlvu_mc_needle.yaml
│ │ │ │ ├── mlvu_mc_order.yaml
│ │ │ │ ├── mlvu_mc_plotqa.yaml
│ │ │ │ ├── mlvu_mc_topic_reasoning.yaml
│ │ │ │ └── utils.py
│ │ │ ├── mvbench/
│ │ │ │ ├── _default_template.yaml
│ │ │ │ ├── mvbench.yaml
│ │ │ │ ├── mvbench_action_antonym.yaml
│ │ │ │ ├── mvbench_action_count.yaml
│ │ │ │ ├── mvbench_action_localization.yaml
│ │ │ │ ├── mvbench_action_prediction.yaml
│ │ │ │ ├── mvbench_action_sequence.yaml
│ │ │ │ ├── mvbench_character_order.yaml
│ │ │ │ ├── mvbench_counterfactual_inference.yaml
│ │ │ │ ├── mvbench_egocentric_navigation.yaml
│ │ │ │ ├── mvbench_episodic_reasoning.yaml
│ │ │ │ ├── mvbench_fine_grained_action.yaml
│ │ │ │ ├── mvbench_fine_grained_pose.yaml
│ │ │ │ ├── mvbench_moving_attribute.yaml
│ │ │ │ ├── mvbench_moving_count.yaml
│ │ │ │ ├── mvbench_moving_direction.yaml
│ │ │ │ ├── mvbench_object_existence.yaml
│ │ │ │ ├── mvbench_object_interaction.yaml
│ │ │ │ ├── mvbench_object_shuffle.yaml
│ │ │ │ ├── mvbench_scene_transition.yaml
│ │ │ │ ├── mvbench_state_change.yaml
│ │ │ │ ├── mvbench_unexpected_action.yaml
│ │ │ │ └── utils.py
│ │ │ ├── perceptiontest/
│ │ │ │ └── val/
│ │ │ │ ├── _default_template_yaml
│ │ │ │ ├── perceptiontest_mc.yaml
│ │ │ │ └── utils.py
│ │ │ ├── temporal_grounding/
│ │ │ │ ├── _default_template.yaml
│ │ │ │ ├── charades.yaml
│ │ │ │ ├── eval_tvg.py
│ │ │ │ └── utils.py
│ │ │ └── videomme/
│ │ │ ├── utils.py
│ │ │ ├── videomme.yaml
│ │ │ └── videomme_w_subtitle.yaml
│ │ └── utils.py
│ ├── pyproject.toml
│ ├── scripts/
│ │ ├── eval_longvideobench.sh
│ │ ├── eval_lvbench.sh
│ │ ├── eval_mlvu.sh
│ │ ├── eval_mvbench.sh
│ │ ├── eval_perceptiontest_val_mc.sh
│ │ ├── eval_temporal_grounding_chardes.sh
│ │ └── eval_videomme.sh
│ ├── setup.py
│ └── videochat-flash-7B@448_eval_log_videomme.json
├── xtuner-eval_niah/
│ ├── README.md
│ ├── llava/
│ │ ├── __init__.py
│ │ ├── constants.py
│ │ ├── conversation.py
│ │ ├── dist_utils.py
│ │ ├── mm_utils.py
│ │ ├── model/
│ │ │ ├── __init__.py
│ │ │ ├── apply_delta.py
│ │ │ ├── builder.py
│ │ │ ├── consolidate.py
│ │ │ ├── language_model/
│ │ │ │ ├── llava_qwen.py
│ │ │ │ ├── llava_qwen_flash.py
│ │ │ │ └── modeling_qwen2_flash.py
│ │ │ ├── llava_arch.py
│ │ │ ├── make_delta.py
│ │ │ ├── multimodal_encoder/
│ │ │ │ ├── builder.py
│ │ │ │ ├── clip_encoder.py
│ │ │ │ ├── internvideo2/
│ │ │ │ │ ├── __init__.py
│ │ │ │ │ ├── flash_attention_class.py
│ │ │ │ │ ├── pos_embed.py
│ │ │ │ │ └── vit_scale_clean.py
│ │ │ │ ├── internvideo2_encoder.py
│ │ │ │ ├── siglip_encoder.py
│ │ │ │ ├── umt/
│ │ │ │ │ └── vit.py
│ │ │ │ └── umt_encoder.py
│ │ │ ├── multimodal_projector/
│ │ │ │ ├── builder.py
│ │ │ │ └── tome16_mlp_hd64.py
│ │ │ └── utils.py
│ │ ├── serialize_utils.py
│ │ ├── train/
│ │ │ ├── llava_trainer.py
│ │ │ ├── llava_trainer_eval.py
│ │ │ ├── train.py
│ │ │ └── train_mem.py
│ │ ├── utils.py
│ │ └── video_utils.py
│ ├── longva/
│ │ ├── __init__.py
│ │ ├── constants.py
│ │ ├── conversation.py
│ │ ├── mm_utils.py
│ │ ├── model/
│ │ │ ├── __init__.py
│ │ │ ├── apply_delta.py
│ │ │ ├── builder.py
│ │ │ ├── consolidate.py
│ │ │ ├── language_model/
│ │ │ │ ├── llava_llama.py
│ │ │ │ ├── llava_mistral.py
│ │ │ │ ├── llava_mpt.py
│ │ │ │ ├── llava_qwen.py
│ │ │ │ └── modeling_llama.py
│ │ │ ├── llava_arch.py
│ │ │ ├── make_delta.py
│ │ │ ├── multimodal_encoder/
│ │ │ │ ├── builder.py
│ │ │ │ └── clip_encoder.py
│ │ │ ├── multimodal_projector/
│ │ │ │ ├── builder.py
│ │ │ │ └── pooler_projector.py
│ │ │ ├── multimodal_resampler/
│ │ │ │ ├── builder.py
│ │ │ │ ├── masked_drop.py
│ │ │ │ ├── perceiver.py
│ │ │ │ ├── qformer.py
│ │ │ │ └── spatial_pool.py
│ │ │ └── utils.py
│ │ ├── train/
│ │ │ ├── llama_flash_attn_monkey_patch.py
│ │ │ ├── llava_trainer.py
│ │ │ ├── train.py
│ │ │ ├── train_dpo.py
│ │ │ └── train_mem.py
│ │ └── utils.py
│ ├── niah_requirements.txt
│ ├── tmp/
│ │ └── git_placeholder
│ ├── vision_niah/
│ │ ├── data/
│ │ │ ├── haystack_embeddings/
│ │ │ │ └── git_placeholder
│ │ │ ├── haystack_videos/
│ │ │ │ └── git_placeholder
│ │ │ ├── needle_embeddings/
│ │ │ │ └── git_placeholder
│ │ │ └── source_data/
│ │ │ ├── git_placeholder
│ │ │ └── niah-coco-singlehop_20.json
│ │ ├── data_multi/
│ │ │ ├── needle_embeddings/
│ │ │ │ └── git_placeholder
│ │ │ └── source_data/
│ │ │ ├── git_placeholder
│ │ │ └── niah-coco-multihop-100.json
│ │ ├── flash_eval_xtuner_multi.sh
│ │ ├── flash_eval_xtuner_single.sh
│ │ ├── log/
│ │ │ ├── s1/
│ │ │ │ └── git_placeholder
│ │ │ ├── s2/
│ │ │ │ └── git_placeholder
│ │ │ └── s3/
│ │ │ └── git_placeholder
│ │ ├── longva_eval_xtuner_multi.sh
│ │ ├── longva_eval_xtuner_single.sh
│ │ ├── model_weights/
│ │ │ └── git_placeholder
│ │ ├── multi_eval_vision_niah.py
│ │ ├── multi_produce_needle_embedding.py
│ │ ├── niah_output_multi/
│ │ │ └── git_placeholder
│ │ ├── niah_output_single/
│ │ │ └── git_placeholder
│ │ ├── produce_haystack_embedding.py
│ │ ├── single_eval_vision_niah.py
│ │ └── single_produce_needle_embedding.py
│ └── xtuner/
│ ├── __init__.py
│ ├── _lite/
│ │ ├── __init__.py
│ │ ├── accelerate/
│ │ │ ├── __init__.py
│ │ │ ├── dispatches/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── _attention.py
│ │ │ │ ├── _fused/
│ │ │ │ │ ├── __init__.py
│ │ │ │ │ ├── layer_norm.py
│ │ │ │ │ ├── rms_norm.py
│ │ │ │ │ └── rotary.py
│ │ │ │ ├── clip.py
│ │ │ │ ├── internlm2.py
│ │ │ │ ├── llama.py
│ │ │ │ └── qwen2.py
│ │ │ ├── generate.py
│ │ │ ├── lora.py
│ │ │ └── packed.py
│ │ ├── auto.py
│ │ ├── chat/
│ │ │ ├── __init__.py
│ │ │ ├── backends/
│ │ │ │ └── __init__.py
│ │ │ ├── messages/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── base.py
│ │ │ │ └── chat.py
│ │ │ └── templates/
│ │ │ ├── __init__.py
│ │ │ ├── chat.py
│ │ │ └── hybrid.py
│ │ ├── datasets/
│ │ │ ├── __init__.py
│ │ │ ├── cache.py
│ │ │ ├── format.py
│ │ │ ├── llava.py
│ │ │ ├── load.py
│ │ │ ├── pretrain.py
│ │ │ ├── text.py
│ │ │ └── tokenize.py
│ │ ├── modelings/
│ │ │ ├── __init__.py
│ │ │ ├── internlm2/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── configuration_internlm2.py
│ │ │ │ └── modeling_internlm2.py
│ │ │ └── llava/
│ │ │ ├── __init__.py
│ │ │ ├── configuration_internlm2.py
│ │ │ ├── configuration_llava.py
│ │ │ ├── modeling_internlm2.py
│ │ │ ├── modeling_llava.py
│ │ │ └── processing_llava.py
│ │ ├── parallel/
│ │ │ ├── __init__.py
│ │ │ ├── comm.py
│ │ │ ├── fsdp/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── checkpointing.py
│ │ │ │ ├── lazy.py
│ │ │ │ ├── precision.py
│ │ │ │ └── wrap.py
│ │ │ ├── logger.py
│ │ │ ├── plans/
│ │ │ │ └── internlm2.py
│ │ │ ├── sampler.py
│ │ │ ├── sequence/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── attention.py
│ │ │ │ ├── data_collate.py
│ │ │ │ ├── ops.py
│ │ │ │ └── reduce_loss.py
│ │ │ └── setup.py
│ │ └── yunchang/
│ │ ├── __init__.py
│ │ ├── comm/
│ │ │ ├── __init__.py
│ │ │ ├── all_to_all.py
│ │ │ └── extract_local.py
│ │ ├── globals.py
│ │ ├── hybrid/
│ │ │ ├── __init__.py
│ │ │ ├── async_attn_layer.py
│ │ │ ├── attn_layer.py
│ │ │ └── utils.py
│ │ ├── ring/
│ │ │ ├── __init__.py
│ │ │ ├── llama3_flash_attn_varlen.py
│ │ │ ├── ring_flash_attn.py
│ │ │ ├── ring_flash_attn_varlen.py
│ │ │ ├── stripe_flash_attn.py
│ │ │ ├── triton_utils.py
│ │ │ ├── utils.py
│ │ │ ├── zigzag_ring_flash_attn.py
│ │ │ └── zigzag_ring_flash_attn_varlen.py
│ │ └── ulysses/
│ │ ├── __init__.py
│ │ └── attn_layer.py
│ ├── apis/
│ │ ├── __init__.py
│ │ ├── datasets/
│ │ │ ├── __init__.py
│ │ │ ├── alpaca.py
│ │ │ ├── arxiv.py
│ │ │ ├── code_alpaca.py
│ │ │ ├── colorist.py
│ │ │ ├── lawyer.py
│ │ │ ├── medical.py
│ │ │ ├── moss_003_sft.py
│ │ │ ├── oasst1.py
│ │ │ ├── open_orca.py
│ │ │ ├── sql.py
│ │ │ ├── tiny_codes.py
│ │ │ └── wizardlm.py
│ │ ├── model.py
│ │ └── training_args.py
│ ├── configs/
│ │ ├── __init__.py
│ │ ├── baichuan/
│ │ │ ├── baichuan2_13b_base/
│ │ │ │ ├── baichuan2_13b_base_qlora_alpaca_e3.py
│ │ │ │ ├── baichuan2_13b_base_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── baichuan2_13b_base_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── baichuan2_13b_base_qlora_alpaca_zh_e3.py
│ │ │ │ ├── baichuan2_13b_base_qlora_arxiv_gentitle_e3.py
│ │ │ │ ├── baichuan2_13b_base_qlora_code_alpaca_e3.py
│ │ │ │ ├── baichuan2_13b_base_qlora_colorist_e5.py
│ │ │ │ ├── baichuan2_13b_base_qlora_lawyer_e3.py
│ │ │ │ ├── baichuan2_13b_base_qlora_oasst1_512_e3.py
│ │ │ │ ├── baichuan2_13b_base_qlora_oasst1_e3.py
│ │ │ │ ├── baichuan2_13b_base_qlora_open_platypus_e3.py
│ │ │ │ └── baichuan2_13b_base_qlora_sql_e3.py
│ │ │ ├── baichuan2_13b_chat/
│ │ │ │ ├── baichuan2_13b_chat_qlora_alpaca_e3.py
│ │ │ │ ├── baichuan2_13b_chat_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── baichuan2_13b_chat_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── baichuan2_13b_chat_qlora_alpaca_zh_e3.py
│ │ │ │ ├── baichuan2_13b_chat_qlora_code_alpaca_e3.py
│ │ │ │ ├── baichuan2_13b_chat_qlora_lawyer_e3.py
│ │ │ │ ├── baichuan2_13b_chat_qlora_oasst1_512_e3.py
│ │ │ │ ├── baichuan2_13b_chat_qlora_oasst1_e3.py
│ │ │ │ └── baichuan2_13b_chat_qlora_open_platypus_e3.py
│ │ │ ├── baichuan2_7b_base/
│ │ │ │ ├── baichuan2_7b_base_qlora_alpaca_e3.py
│ │ │ │ ├── baichuan2_7b_base_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── baichuan2_7b_base_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── baichuan2_7b_base_qlora_alpaca_zh_e3.py
│ │ │ │ ├── baichuan2_7b_base_qlora_arxiv_gentitle_e3.py
│ │ │ │ ├── baichuan2_7b_base_qlora_code_alpaca_e3.py
│ │ │ │ ├── baichuan2_7b_base_qlora_colorist_e5.py
│ │ │ │ ├── baichuan2_7b_base_qlora_lawyer_e3.py
│ │ │ │ ├── baichuan2_7b_base_qlora_oasst1_512_e3.py
│ │ │ │ ├── baichuan2_7b_base_qlora_oasst1_e3.py
│ │ │ │ ├── baichuan2_7b_base_qlora_open_platypus_e3.py
│ │ │ │ └── baichuan2_7b_base_qlora_sql_e3.py
│ │ │ ├── baichuan2_7b_chat/
│ │ │ │ ├── baichuan2_7b_chat_qlora_alpaca_e3.py
│ │ │ │ ├── baichuan2_7b_chat_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── baichuan2_7b_chat_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── baichuan2_7b_chat_qlora_alpaca_zh_e3.py
│ │ │ │ ├── baichuan2_7b_chat_qlora_code_alpaca_e3.py
│ │ │ │ ├── baichuan2_7b_chat_qlora_lawyer_e3.py
│ │ │ │ ├── baichuan2_7b_chat_qlora_oasst1_512_e3.py
│ │ │ │ ├── baichuan2_7b_chat_qlora_oasst1_e3.py
│ │ │ │ └── baichuan2_7b_chat_qlora_open_platypus_e3.py
│ │ │ ├── baichuan_13b_base/
│ │ │ │ ├── baichuan_13b_base_qlora_alpaca_e3.py
│ │ │ │ ├── baichuan_13b_base_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── baichuan_13b_base_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── baichuan_13b_base_qlora_alpaca_zh_e3.py
│ │ │ │ ├── baichuan_13b_base_qlora_arxiv_gentitle_e3.py
│ │ │ │ ├── baichuan_13b_base_qlora_code_alpaca_e3.py
│ │ │ │ ├── baichuan_13b_base_qlora_colorist_e5.py
│ │ │ │ ├── baichuan_13b_base_qlora_lawyer_e3.py
│ │ │ │ ├── baichuan_13b_base_qlora_medical_e1.py
│ │ │ │ ├── baichuan_13b_base_qlora_moss_sft_all_e1.py
│ │ │ │ ├── baichuan_13b_base_qlora_moss_sft_all_e2_gpu8.py
│ │ │ │ ├── baichuan_13b_base_qlora_moss_sft_plugins_e1.py
│ │ │ │ ├── baichuan_13b_base_qlora_oasst1_512_e3.py
│ │ │ │ ├── baichuan_13b_base_qlora_oasst1_e3.py
│ │ │ │ ├── baichuan_13b_base_qlora_open_platypus_e3.py
│ │ │ │ ├── baichuan_13b_base_qlora_openorca_e1.py
│ │ │ │ ├── baichuan_13b_base_qlora_sql_e3.py
│ │ │ │ └── baichuan_13b_base_qlora_tiny_codes_e1.py
│ │ │ ├── baichuan_13b_chat/
│ │ │ │ ├── baichuan_13b_chat_qlora_alpaca_e3.py
│ │ │ │ ├── baichuan_13b_chat_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── baichuan_13b_chat_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── baichuan_13b_chat_qlora_alpaca_zh_e3.py
│ │ │ │ ├── baichuan_13b_chat_qlora_arxiv_gentitle_e3.py
│ │ │ │ ├── baichuan_13b_chat_qlora_code_alpaca_e3.py
│ │ │ │ ├── baichuan_13b_chat_qlora_colorist_e5.py
│ │ │ │ ├── baichuan_13b_chat_qlora_lawyer_e3.py
│ │ │ │ ├── baichuan_13b_chat_qlora_medical_e1.py
│ │ │ │ ├── baichuan_13b_chat_qlora_oasst1_512_e3.py
│ │ │ │ ├── baichuan_13b_chat_qlora_oasst1_e3.py
│ │ │ │ ├── baichuan_13b_chat_qlora_open_platypus_e3.py
│ │ │ │ ├── baichuan_13b_chat_qlora_openorca_e1.py
│ │ │ │ ├── baichuan_13b_chat_qlora_sql_e3.py
│ │ │ │ └── baichuan_13b_chat_qlora_tiny_codes_e1.py
│ │ │ └── baichuan_7b/
│ │ │ ├── baichuan_7b_qlora_alpaca_e3.py
│ │ │ ├── baichuan_7b_qlora_alpaca_enzh_e3.py
│ │ │ ├── baichuan_7b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── baichuan_7b_qlora_alpaca_zh_e3.py
│ │ │ ├── baichuan_7b_qlora_arxiv_gentitle_e3.py
│ │ │ ├── baichuan_7b_qlora_code_alpaca_e3.py
│ │ │ ├── baichuan_7b_qlora_colorist_e5.py
│ │ │ ├── baichuan_7b_qlora_lawyer_e3.py
│ │ │ ├── baichuan_7b_qlora_medical_e1.py
│ │ │ ├── baichuan_7b_qlora_moss_sft_all_e1.py
│ │ │ ├── baichuan_7b_qlora_moss_sft_all_e2_gpu8.py
│ │ │ ├── baichuan_7b_qlora_moss_sft_plugins_e1.py
│ │ │ ├── baichuan_7b_qlora_oasst1_512_e3.py
│ │ │ ├── baichuan_7b_qlora_oasst1_e3.py
│ │ │ ├── baichuan_7b_qlora_open_platypus_e3.py
│ │ │ ├── baichuan_7b_qlora_openorca_e1.py
│ │ │ ├── baichuan_7b_qlora_sql_e3.py
│ │ │ └── baichuan_7b_qlora_tiny_codes_e1.py
│ │ ├── chatglm/
│ │ │ ├── chatglm2_6b/
│ │ │ │ ├── chatglm2_6b_qlora_alpaca_e3.py
│ │ │ │ ├── chatglm2_6b_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── chatglm2_6b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── chatglm2_6b_qlora_alpaca_zh_e3.py
│ │ │ │ ├── chatglm2_6b_qlora_arxiv_gentitle_e3.py
│ │ │ │ ├── chatglm2_6b_qlora_code_alpaca_e3.py
│ │ │ │ ├── chatglm2_6b_qlora_colorist_e5.py
│ │ │ │ ├── chatglm2_6b_qlora_lawyer_e3.py
│ │ │ │ ├── chatglm2_6b_qlora_medical_e1.py
│ │ │ │ ├── chatglm2_6b_qlora_oasst1_512_e3.py
│ │ │ │ ├── chatglm2_6b_qlora_oasst1_e3.py
│ │ │ │ ├── chatglm2_6b_qlora_open_platypus_e3.py
│ │ │ │ ├── chatglm2_6b_qlora_openorca_e1.py
│ │ │ │ ├── chatglm2_6b_qlora_sql_e3.py
│ │ │ │ └── chatglm2_6b_qlora_tiny_codes_e1.py
│ │ │ ├── chatglm3_6b/
│ │ │ │ ├── chatglm3_6b_qlora_alpaca_e3.py
│ │ │ │ ├── chatglm3_6b_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── chatglm3_6b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── chatglm3_6b_qlora_alpaca_zh_e3.py
│ │ │ │ ├── chatglm3_6b_qlora_arxiv_gentitle_e3.py
│ │ │ │ ├── chatglm3_6b_qlora_code_alpaca_e3.py
│ │ │ │ ├── chatglm3_6b_qlora_colorist_e5.py
│ │ │ │ ├── chatglm3_6b_qlora_lawyer_e3.py
│ │ │ │ ├── chatglm3_6b_qlora_medical_e1.py
│ │ │ │ ├── chatglm3_6b_qlora_oasst1_512_e3.py
│ │ │ │ ├── chatglm3_6b_qlora_oasst1_e3.py
│ │ │ │ ├── chatglm3_6b_qlora_open_platypus_e3.py
│ │ │ │ ├── chatglm3_6b_qlora_openorca_e1.py
│ │ │ │ ├── chatglm3_6b_qlora_sql_e3.py
│ │ │ │ └── chatglm3_6b_qlora_tiny_codes_e1.py
│ │ │ └── chatglm3_6b_base/
│ │ │ ├── chatglm3_6b_base_qlora_alpaca_e3.py
│ │ │ ├── chatglm3_6b_base_qlora_alpaca_enzh_e3.py
│ │ │ ├── chatglm3_6b_base_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── chatglm3_6b_base_qlora_alpaca_zh_e3.py
│ │ │ ├── chatglm3_6b_base_qlora_arxiv_gentitle_e3.py
│ │ │ ├── chatglm3_6b_base_qlora_code_alpaca_e3.py
│ │ │ ├── chatglm3_6b_base_qlora_colorist_e5.py
│ │ │ ├── chatglm3_6b_base_qlora_lawyer_e3.py
│ │ │ ├── chatglm3_6b_base_qlora_medical_e1.py
│ │ │ ├── chatglm3_6b_base_qlora_oasst1_512_e3.py
│ │ │ ├── chatglm3_6b_base_qlora_oasst1_e3.py
│ │ │ ├── chatglm3_6b_base_qlora_open_platypus_e3.py
│ │ │ ├── chatglm3_6b_base_qlora_openorca_e1.py
│ │ │ ├── chatglm3_6b_base_qlora_sql_e3.py
│ │ │ └── chatglm3_6b_base_qlora_tiny_codes_e1.py
│ │ ├── cohere/
│ │ │ ├── README.md
│ │ │ └── cohere_104b/
│ │ │ └── cohere_100b_128k_sp32.py
│ │ ├── custom_dataset/
│ │ │ ├── pretrain/
│ │ │ │ ├── baichuan/
│ │ │ │ │ ├── baichuan2_13b_base_full_custom_pretrain_e1.py
│ │ │ │ │ └── baichuan2_7b_base_full_custom_pretrain_e1.py
│ │ │ │ ├── chatglm/
│ │ │ │ │ ├── chatglm2_6b_full_custom_pretrain_e1.py
│ │ │ │ │ └── chatglm3_6b_full_custom_pretrain_e1.py
│ │ │ │ ├── deepseek/
│ │ │ │ │ └── deepseek_moe_16b_base_full_custom_pretrain_e1.py
│ │ │ │ ├── gemma/
│ │ │ │ │ ├── gemma_2b_full_custom_pretrain_e1.py
│ │ │ │ │ └── gemma_7b_full_custom_pretrain_e1.py
│ │ │ │ ├── internlm/
│ │ │ │ │ ├── internlm2_1_8b_full_custom_pretrain_e1.py
│ │ │ │ │ ├── internlm2_20b_full_custom_pretrain_e1.py
│ │ │ │ │ └── internlm2_7b_full_custom_pretrain_e1.py
│ │ │ │ ├── llama/
│ │ │ │ │ ├── llama2_70b_full_custom_pretrain_e1.py
│ │ │ │ │ └── llama2_7b_full_custom_pretrain_e1.py
│ │ │ │ ├── mistral/
│ │ │ │ │ └── mistral_7b_full_custom_pretrain_e1.py
│ │ │ │ ├── mixtral/
│ │ │ │ │ └── mixtral_8x7b_full_custom_pretrain_e1.py
│ │ │ │ ├── qwen/
│ │ │ │ │ ├── qwen1_5_0_5b_full_custom_pretrain_e1.py
│ │ │ │ │ ├── qwen1_5_14b_full_custom_pretrain_e1.py
│ │ │ │ │ ├── qwen1_5_1_8b_full_custom_pretrain_e1.py
│ │ │ │ │ ├── qwen1_5_4b_full_custom_pretrain_e1.py
│ │ │ │ │ ├── qwen1_5_72b_full_custom_pretrain_e1.py
│ │ │ │ │ ├── qwen1_5_7b_full_custom_pretrain_e1.py
│ │ │ │ │ ├── qwen_1_8b_full_custom_pretrain_e1.py
│ │ │ │ │ ├── qwen_72b_full_custom_pretrain_e1.py
│ │ │ │ │ └── qwen_7b_full_custom_pretrain_e1.py
│ │ │ │ ├── starcoder/
│ │ │ │ │ └── starcoder_full_custom_pretrain_e1.py
│ │ │ │ ├── yi/
│ │ │ │ │ ├── yi_34b_full_custom_pretrain_e1.py
│ │ │ │ │ └── yi_6b_full_custom_pretrain_e1.py
│ │ │ │ └── zephyr/
│ │ │ │ └── zephyr_7b_beta_full_custom_pretrain_e1.py
│ │ │ └── sft/
│ │ │ ├── baichuan/
│ │ │ │ ├── baichuan2_13b_chat_qlora_custom_sft_e1.py
│ │ │ │ ├── baichuan2_7b_chat_qlora_custom_sft_e1.py
│ │ │ │ ├── baichuan_13b_chat_qlora_custom_sft_e1.py
│ │ │ │ └── baichuan_7b_qlora_custom_sft_e1.py
│ │ │ ├── chatglm/
│ │ │ │ ├── chatglm2_6b_qlora_custom_sft_e1.py
│ │ │ │ └── chatglm3_6b_qlora_custom_sft_e1.py
│ │ │ ├── deepseek/
│ │ │ │ ├── deepseek_moe_16b_chat_qlora_custom_sft_e1.py
│ │ │ │ └── deepseekcoder_6_7b_instruct_qlora_custom_sft_e1.py
│ │ │ ├── gemma/
│ │ │ │ ├── gemma_2b_it_qlora_custom_sft_e1.py
│ │ │ │ ├── gemma_2b_qlora_custom_sft_e1.py
│ │ │ │ ├── gemma_7b_it_qlora_custom_sft_e1.py
│ │ │ │ └── gemma_7b_qlora_custom_sft_e1.py
│ │ │ ├── internlm/
│ │ │ │ ├── internlm2_chat_1_8b_qlora_custom_sft_e1.py
│ │ │ │ ├── internlm2_chat_20b_qlora_custom_sft_e1.py
│ │ │ │ └── internlm2_chat_7b_qlora_custom_sft_e1.py
│ │ │ ├── llama/
│ │ │ │ ├── llama2_70b_qlora_custom_sft_e1.py
│ │ │ │ └── llama2_7b_chat_qlora_custom_sft_e1.py
│ │ │ ├── mistral/
│ │ │ │ └── mistral_7b_full_finetune_custom_sft_e1.py
│ │ │ ├── mixtral/
│ │ │ │ └── mixtral_8x7b_instruct_qlora_custom_sft_e1.py
│ │ │ ├── qwen/
│ │ │ │ ├── qwen1_5_0_5b_chat_qlora_custom_sft_e1.py
│ │ │ │ ├── qwen1_5_14b_chat_qlora_custom_sft_e1.py
│ │ │ │ ├── qwen1_5_1_8b_chat_qlora_custom_sft_e1.py
│ │ │ │ ├── qwen1_5_4b_chat_qlora_custom_sft_e1.py
│ │ │ │ ├── qwen1_5_72b_chat_qlora_custom_sft_e1.py
│ │ │ │ ├── qwen1_5_7b_chat_qlora_custom_sft_e1.py
│ │ │ │ ├── qwen_1_8b_chat_qlora_custom_sft_e1.py
│ │ │ │ ├── qwen_72b_qlora_custom_sft_e1.py
│ │ │ │ └── qwen_7b_chat_qlora_custom_sft_e1.py
│ │ │ ├── starcoder/
│ │ │ │ └── starcoder_qlora_custom_sft_e1.py
│ │ │ ├── yi/
│ │ │ │ ├── yi_34b_qlora_custom_sft_e1.py
│ │ │ │ └── yi_6b_qlora_custom_sft_e1.py
│ │ │ └── zephyr/
│ │ │ └── zephyr_7b_beta_qlora_custom_sft_e1.py
│ │ ├── deepseek/
│ │ │ ├── README.md
│ │ │ ├── deepseek_coder_6_7b_base/
│ │ │ │ └── deepseek_coder_6_7b_base_qlora_code_alpaca_e3.py
│ │ │ ├── deepseek_coder_6_7b_instruct/
│ │ │ │ └── deepseekcoder_6_7b_instruct_qlora_code_alpaca_e3.py
│ │ │ ├── deepseek_moe_16b_base/
│ │ │ │ ├── deepseek_moe_16b_base_full_oasst1_e3.py
│ │ │ │ └── deepseek_moe_16b_base_qlora_oasst1_e3.py
│ │ │ ├── deepseek_moe_16b_chat/
│ │ │ │ ├── deepseek_moe_16b_chat_full_oasst1_e3.py
│ │ │ │ └── deepseek_moe_16b_chat_qlora_oasst1_e3.py
│ │ │ ├── deepseek_v2_chat/
│ │ │ │ └── deepseek_v2_chat_full_alpaca_e3.py
│ │ │ └── deepseek_v2_lite_chat/
│ │ │ ├── deepseek_v2_lite_chat_full_alpaca_e3.py
│ │ │ └── deepseek_v2_lite_chat_full_alpaca_e3_32k_varlen.py
│ │ ├── deepspeed/
│ │ │ ├── deepspeed_zero1.json
│ │ │ ├── deepspeed_zero2.json
│ │ │ ├── deepspeed_zero2_offload.json
│ │ │ ├── deepspeed_zero3.json
│ │ │ └── deepspeed_zero3_offload.json
│ │ ├── dpo/
│ │ │ ├── internlm/
│ │ │ │ ├── internlm2_chat_1_8b_dpo_full.py
│ │ │ │ ├── internlm2_chat_1_8b_dpo_full_varlenattn.py
│ │ │ │ ├── internlm2_chat_1_8b_dpo_full_varlenattn_jsonl_dataset.py
│ │ │ │ └── internlm2_chat_7b_dpo_qlora_varlenattn.py
│ │ │ └── llama/
│ │ │ └── llama3_8b_instruct_dpo_qlora_varlenattn.py
│ │ ├── gemma/
│ │ │ ├── gemma_2b/
│ │ │ │ ├── gemma_2b_full_alpaca_e3.py
│ │ │ │ └── gemma_2b_qlora_alpaca_e3.py
│ │ │ ├── gemma_2b_it/
│ │ │ │ ├── gemma_2b_it_full_alpaca_e3.py
│ │ │ │ └── gemma_2b_it_qlora_alpaca_e3.py
│ │ │ ├── gemma_7b/
│ │ │ │ ├── gemma_7b_full_alpaca_e3.py
│ │ │ │ └── gemma_7b_qlora_alpaca_e3.py
│ │ │ └── gemma_7b_it/
│ │ │ ├── gemma_7b_it_full_alpaca_e3.py
│ │ │ └── gemma_7b_it_qlora_alpaca_e3.py
│ │ ├── internlm/
│ │ │ ├── internlm2_1_8b/
│ │ │ │ ├── internlm2_1_8b_full_alpaca_e3.py
│ │ │ │ └── internlm2_1_8b_qlora_alpaca_e3.py
│ │ │ ├── internlm2_20b/
│ │ │ │ ├── internlm2_20b_full_finetune_custom_dataset_e1.py
│ │ │ │ ├── internlm2_20b_qlora_alpaca_e3.py
│ │ │ │ ├── internlm2_20b_qlora_arxiv_gentitle_e3.py
│ │ │ │ ├── internlm2_20b_qlora_code_alpaca_e3.py
│ │ │ │ ├── internlm2_20b_qlora_colorist_e5.py
│ │ │ │ ├── internlm2_20b_qlora_lawyer_e3.py
│ │ │ │ ├── internlm2_20b_qlora_msagent_react_e3_gpu8.py
│ │ │ │ ├── internlm2_20b_qlora_oasst1_512_e3.py
│ │ │ │ ├── internlm2_20b_qlora_oasst1_e3.py
│ │ │ │ └── internlm2_20b_qlora_sql_e3.py
│ │ │ ├── internlm2_7b/
│ │ │ │ ├── internlm2_7b_full_finetune_custom_dataset_e1.py
│ │ │ │ ├── internlm2_7b_full_finetune_custom_dataset_e1_sequence_parallel_4.py
│ │ │ │ ├── internlm2_7b_qlora_alpaca_e3.py
│ │ │ │ ├── internlm2_7b_qlora_arxiv_gentitle_e3.py
│ │ │ │ ├── internlm2_7b_qlora_code_alpaca_e3.py
│ │ │ │ ├── internlm2_7b_qlora_colorist_e5.py
│ │ │ │ ├── internlm2_7b_qlora_json_e3.py
│ │ │ │ ├── internlm2_7b_qlora_lawyer_e3.py
│ │ │ │ ├── internlm2_7b_qlora_msagent_react_e3_gpu8.py
│ │ │ │ ├── internlm2_7b_qlora_oasst1_512_e3.py
│ │ │ │ ├── internlm2_7b_qlora_oasst1_e3.py
│ │ │ │ ├── internlm2_7b_qlora_sql_e3.py
│ │ │ │ ├── internlm2_7b_w_internevo_dataset.py
│ │ │ │ ├── internlm2_7b_w_tokenized_dataset.py
│ │ │ │ └── internlm2_7b_w_untokenized_dataset.py
│ │ │ ├── internlm2_chat_1_8b/
│ │ │ │ ├── internlm2_chat_1_8b_full_alpaca_e3.py
│ │ │ │ └── internlm2_chat_1_8b_qlora_alpaca_e3.py
│ │ │ ├── internlm2_chat_20b/
│ │ │ │ ├── internlm2_chat_20b_full_finetune_custom_dataset_e1.py
│ │ │ │ ├── internlm2_chat_20b_qlora_alpaca_e3.py
│ │ │ │ ├── internlm2_chat_20b_qlora_code_alpaca_e3.py
│ │ │ │ ├── internlm2_chat_20b_qlora_lawyer_e3.py
│ │ │ │ ├── internlm2_chat_20b_qlora_oasst1_512_e3.py
│ │ │ │ └── internlm2_chat_20b_qlora_oasst1_e3.py
│ │ │ ├── internlm2_chat_7b/
│ │ │ │ ├── internlm2_chat_7b_full_finetune_custom_dataset_e1.py
│ │ │ │ ├── internlm2_chat_7b_qlora_alpaca_e3.py
│ │ │ │ ├── internlm2_chat_7b_qlora_code_alpaca_e3.py
│ │ │ │ ├── internlm2_chat_7b_qlora_lawyer_e3.py
│ │ │ │ ├── internlm2_chat_7b_qlora_oasst1_512_e3.py
│ │ │ │ └── internlm2_chat_7b_qlora_oasst1_e3.py
│ │ │ ├── internlm_20b/
│ │ │ │ ├── internlm_20b_qlora_alpaca_e3.py
│ │ │ │ ├── internlm_20b_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── internlm_20b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── internlm_20b_qlora_alpaca_zh_e3.py
│ │ │ │ ├── internlm_20b_qlora_arxiv_gentitle_e3.py
│ │ │ │ ├── internlm_20b_qlora_code_alpaca_e3.py
│ │ │ │ ├── internlm_20b_qlora_colorist_e5.py
│ │ │ │ ├── internlm_20b_qlora_lawyer_e3.py
│ │ │ │ ├── internlm_20b_qlora_msagent_react_e3_gpu8.py
│ │ │ │ ├── internlm_20b_qlora_oasst1_512_e3.py
│ │ │ │ ├── internlm_20b_qlora_oasst1_e3.py
│ │ │ │ ├── internlm_20b_qlora_open_platypus_e3.py
│ │ │ │ └── internlm_20b_qlora_sql_e3.py
│ │ │ ├── internlm_7b/
│ │ │ │ ├── internlm_7b_full_alpaca_e3.py
│ │ │ │ ├── internlm_7b_full_alpaca_enzh_e3.py
│ │ │ │ ├── internlm_7b_full_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── internlm_7b_full_alpaca_zh_e3.py
│ │ │ │ ├── internlm_7b_full_intern_repo_dataset_template.py
│ │ │ │ ├── internlm_7b_full_oasst1_e3.py
│ │ │ │ ├── internlm_7b_qlora_alpaca_e3.py
│ │ │ │ ├── internlm_7b_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── internlm_7b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── internlm_7b_qlora_alpaca_zh_e3.py
│ │ │ │ ├── internlm_7b_qlora_arxiv_gentitle_e3.py
│ │ │ │ ├── internlm_7b_qlora_code_alpaca_e3.py
│ │ │ │ ├── internlm_7b_qlora_colorist_e5.py
│ │ │ │ ├── internlm_7b_qlora_json_e3.py
│ │ │ │ ├── internlm_7b_qlora_lawyer_e3.py
│ │ │ │ ├── internlm_7b_qlora_medical_e1.py
│ │ │ │ ├── internlm_7b_qlora_moss_sft_all_e1.py
│ │ │ │ ├── internlm_7b_qlora_moss_sft_all_e2_gpu8.py
│ │ │ │ ├── internlm_7b_qlora_moss_sft_plugins_e1.py
│ │ │ │ ├── internlm_7b_qlora_msagent_react_e3_gpu8.py
│ │ │ │ ├── internlm_7b_qlora_oasst1_512_e3.py
│ │ │ │ ├── internlm_7b_qlora_oasst1_e3.py
│ │ │ │ ├── internlm_7b_qlora_oasst1_e3_hf.py
│ │ │ │ ├── internlm_7b_qlora_oasst1_mmlu_e3.py
│ │ │ │ ├── internlm_7b_qlora_open_platypus_e3.py
│ │ │ │ ├── internlm_7b_qlora_openorca_e1.py
│ │ │ │ ├── internlm_7b_qlora_sql_e3.py
│ │ │ │ └── internlm_7b_qlora_tiny_codes_e1.py
│ │ │ ├── internlm_chat_20b/
│ │ │ │ ├── internlm_chat_20b_qlora_alpaca_e3.py
│ │ │ │ ├── internlm_chat_20b_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── internlm_chat_20b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── internlm_chat_20b_qlora_alpaca_zh_e3.py
│ │ │ │ ├── internlm_chat_20b_qlora_code_alpaca_e3.py
│ │ │ │ ├── internlm_chat_20b_qlora_lawyer_e3.py
│ │ │ │ ├── internlm_chat_20b_qlora_oasst1_512_e3.py
│ │ │ │ ├── internlm_chat_20b_qlora_oasst1_e3.py
│ │ │ │ └── internlm_chat_20b_qlora_open_platypus_e3.py
│ │ │ └── internlm_chat_7b/
│ │ │ ├── internlm_chat_7b_qlora_alpaca_e3.py
│ │ │ ├── internlm_chat_7b_qlora_alpaca_enzh_e3.py
│ │ │ ├── internlm_chat_7b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── internlm_chat_7b_qlora_alpaca_zh_e3.py
│ │ │ ├── internlm_chat_7b_qlora_arxiv_gentitle_e3.py
│ │ │ ├── internlm_chat_7b_qlora_code_alpaca_e3.py
│ │ │ ├── internlm_chat_7b_qlora_colorist_e5.py
│ │ │ ├── internlm_chat_7b_qlora_lawyer_e3.py
│ │ │ ├── internlm_chat_7b_qlora_medical_e1.py
│ │ │ ├── internlm_chat_7b_qlora_oasst1_512_e3.py
│ │ │ ├── internlm_chat_7b_qlora_oasst1_e3.py
│ │ │ ├── internlm_chat_7b_qlora_open_platypus_e3.py
│ │ │ ├── internlm_chat_7b_qlora_openorca_e1.py
│ │ │ ├── internlm_chat_7b_qlora_sql_e3.py
│ │ │ └── internlm_chat_7b_qlora_tiny_codes_e1.py
│ │ ├── llama/
│ │ │ ├── llama2_70b/
│ │ │ │ ├── llama2_70b_full_wizardlm_e1.py
│ │ │ │ ├── llama2_70b_int8_lora_open_platypus_e1.py
│ │ │ │ ├── llama2_70b_int8_lora_open_platypus_e1_hf.py
│ │ │ │ ├── llama2_70b_qlora_open_platypus_e1.py
│ │ │ │ └── llama2_70b_qlora_open_platypus_e1_hf.py
│ │ │ ├── llama2_7b/
│ │ │ │ ├── llama2_7b_full_pgbooks_400iters_sp1.py
│ │ │ │ ├── llama2_7b_full_pgbooks_400iters_sp4.py
│ │ │ │ ├── llama2_7b_full_wizardlm_e1.py
│ │ │ │ ├── llama2_7b_qlora_alpaca_e3.py
│ │ │ │ ├── llama2_7b_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── llama2_7b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── llama2_7b_qlora_alpaca_zh_e3.py
│ │ │ │ ├── llama2_7b_qlora_arxiv_gentitle_e3.py
│ │ │ │ ├── llama2_7b_qlora_code_alpaca_e3.py
│ │ │ │ ├── llama2_7b_qlora_colorist_e5.py
│ │ │ │ ├── llama2_7b_qlora_lawyer_e3.py
│ │ │ │ ├── llama2_7b_qlora_medical_e1.py
│ │ │ │ ├── llama2_7b_qlora_moss_sft_all_e1.py
│ │ │ │ ├── llama2_7b_qlora_moss_sft_all_e2_gpu8.py
│ │ │ │ ├── llama2_7b_qlora_moss_sft_plugins_e1.py
│ │ │ │ ├── llama2_7b_qlora_msagent_react_e3_gpu8.py
│ │ │ │ ├── llama2_7b_qlora_oasst1_512_e3.py
│ │ │ │ ├── llama2_7b_qlora_oasst1_e3.py
│ │ │ │ ├── llama2_7b_qlora_open_platypus_e3.py
│ │ │ │ ├── llama2_7b_qlora_openorca_e1.py
│ │ │ │ ├── llama2_7b_qlora_sql_e3.py
│ │ │ │ └── llama2_7b_qlora_tiny_codes_e1.py
│ │ │ ├── llama2_7b_chat/
│ │ │ │ ├── llama2_7b_chat_qlora_alpaca_e3.py
│ │ │ │ ├── llama2_7b_chat_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── llama2_7b_chat_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── llama2_7b_chat_qlora_alpaca_zh_e3.py
│ │ │ │ ├── llama2_7b_chat_qlora_arxiv_gentitle_e3.py
│ │ │ │ ├── llama2_7b_chat_qlora_code_alpaca_e3.py
│ │ │ │ ├── llama2_7b_chat_qlora_colorist_e5.py
│ │ │ │ ├── llama2_7b_chat_qlora_lawyer_e3.py
│ │ │ │ ├── llama2_7b_chat_qlora_medical_e1.py
│ │ │ │ ├── llama2_7b_chat_qlora_oasst1_512_e3.py
│ │ │ │ ├── llama2_7b_chat_qlora_oasst1_e3.py
│ │ │ │ ├── llama2_7b_chat_qlora_open_platypus_e3.py
│ │ │ │ ├── llama2_7b_chat_qlora_openorca_e1.py
│ │ │ │ ├── llama2_7b_chat_qlora_sql_e3.py
│ │ │ │ └── llama2_7b_chat_qlora_tiny_codes_e1.py
│ │ │ ├── llama3_70b_instruct/
│ │ │ │ └── llama3_70b_instruct_qlora_alpaca_e3_2k_gpu8.py
│ │ │ ├── llama3_8b/
│ │ │ │ ├── README.md
│ │ │ │ └── llama3_8b_full_alpaca_e3.py
│ │ │ ├── llama3_8b_instruct/
│ │ │ │ ├── llama3_8b_instruct_full_alpaca_e3.py
│ │ │ │ └── llama3_8b_instruct_qlora_alpaca_e3.py
│ │ │ └── llama_7b/
│ │ │ ├── llama_7b_qlora_alpaca_e3.py
│ │ │ ├── llama_7b_qlora_alpaca_enzh_e3.py
│ │ │ ├── llama_7b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── llama_7b_qlora_alpaca_zh_e3.py
│ │ │ ├── llama_7b_qlora_arxiv_gentitle_e3.py
│ │ │ ├── llama_7b_qlora_code_alpaca_e3.py
│ │ │ ├── llama_7b_qlora_colorist_e5.py
│ │ │ ├── llama_7b_qlora_lawyer_e3.py
│ │ │ ├── llama_7b_qlora_medical_e1.py
│ │ │ ├── llama_7b_qlora_moss_sft_all_e1.py
│ │ │ ├── llama_7b_qlora_moss_sft_all_e2_gpu8.py
│ │ │ ├── llama_7b_qlora_moss_sft_plugins_e1.py
│ │ │ ├── llama_7b_qlora_oasst1_512_e3.py
│ │ │ ├── llama_7b_qlora_oasst1_e3.py
│ │ │ ├── llama_7b_qlora_open_platypus_e3.py
│ │ │ ├── llama_7b_qlora_openorca_e1.py
│ │ │ ├── llama_7b_qlora_sql_e3.py
│ │ │ └── llama_7b_qlora_tiny_codes_e1.py
│ │ ├── llama_speed_benchmark/
│ │ │ ├── llama2_70b/
│ │ │ │ ├── llama2_70b_full_alpaca_enzh_128k_sp8.py
│ │ │ │ ├── llama2_70b_full_alpaca_enzh_256k_sp16.py
│ │ │ │ ├── llama2_70b_full_alpaca_enzh_32k_sp4.py
│ │ │ │ └── llama2_70b_full_alpaca_enzh_8k_sp1.py
│ │ │ ├── llama2_7b/
│ │ │ │ ├── llama2_7b_full_alpaca_enzh_128k_sp8.py
│ │ │ │ ├── llama2_7b_full_alpaca_enzh_1M_sp16.py
│ │ │ │ ├── llama2_7b_full_alpaca_enzh_256k_sp8.py
│ │ │ │ ├── llama2_7b_full_alpaca_enzh_32k_sp1.py
│ │ │ │ └── llama2_7b_full_alpaca_enzh_8k_sp1.py
│ │ │ └── yi_34b/
│ │ │ ├── yi_34b_200k_full_alpaca_enzh_128k_sp8.py
│ │ │ ├── yi_34b_200k_full_alpaca_enzh_256k_sp8.py
│ │ │ ├── yi_34b_200k_full_alpaca_enzh_32k_sp2.py
│ │ │ └── yi_34b_200k_full_alpaca_enzh_8k_sp1.py
│ │ ├── llava/
│ │ │ ├── README.md
│ │ │ ├── README_zh-CN.md
│ │ │ ├── internlm2_chat_1_8b_clip_vit_large_p14_336/
│ │ │ │ ├── finetune/
│ │ │ │ │ └── llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
│ │ │ │ └── pretrain/
│ │ │ │ └── llava_internlm2_chat_1_8b_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ │ ├── internlm2_chat_20b_clip_vit_large_p14_336/
│ │ │ │ ├── finetune/
│ │ │ │ │ ├── llava_internlm2_chat_20b_clip_vit_large_p14_336_e1_gpu8_finetune.py
│ │ │ │ │ └── llava_internlm2_chat_20b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
│ │ │ │ └── pretrain/
│ │ │ │ └── llava_internlm2_chat_20b_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ │ ├── internlm2_chat_7b_clip_vit_large_p14_336/
│ │ │ │ ├── finetune/
│ │ │ │ │ ├── llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_finetune.py
│ │ │ │ │ └── llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
│ │ │ │ └── pretrain/
│ │ │ │ └── llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ │ ├── internlm_chat_7b_clip_vit_large_p14_336/
│ │ │ │ ├── finetune/
│ │ │ │ │ └── llava_internlm_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
│ │ │ │ └── pretrain/
│ │ │ │ └── llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ │ ├── llama3_70b_instruct_clip_vit_large_p14_336/
│ │ │ │ └── pretrain/
│ │ │ │ └── llava_llama3_70b_instruct_quant_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ │ ├── llama3_8b_instruct_clip_vit_large_p14_336/
│ │ │ │ ├── README.md
│ │ │ │ ├── convert_xtuner_weights_to_hf.py
│ │ │ │ ├── convert_xtuner_weights_to_llava.py
│ │ │ │ ├── finetune/
│ │ │ │ │ ├── llava_llama3_8b_instruct_full_clip_vit_large_p14_336_e1_gpu8_finetune.py
│ │ │ │ │ ├── llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
│ │ │ │ │ ├── llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune.py
│ │ │ │ │ └── llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune.py
│ │ │ │ └── pretrain/
│ │ │ │ ├── llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ │ │ ├── llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py
│ │ │ │ └── llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain.py
│ │ │ ├── official/
│ │ │ │ ├── llava_v15_13b/
│ │ │ │ │ ├── llava_v15_13b_finetune.py
│ │ │ │ │ ├── llava_v15_13b_finetune_lora.py
│ │ │ │ │ └── llava_v15_13b_pretrain.py
│ │ │ │ └── llava_v15_7b/
│ │ │ │ ├── llava_v15_7b_finetune.py
│ │ │ │ ├── llava_v15_7b_finetune_lora.py
│ │ │ │ └── llava_v15_7b_pretrain.py
│ │ │ ├── phi3_mini_4k_instruct_clip_vit_large_p14_336/
│ │ │ │ ├── README.md
│ │ │ │ ├── convert_phi_to_llama.py
│ │ │ │ ├── convert_xtuner_weights_to_hf.py
│ │ │ │ ├── convert_xtuner_weights_to_llava.py
│ │ │ │ ├── finetune/
│ │ │ │ │ ├── llava_phi3_mini_4k_instruct_full_clip_vit_large_p14_336_e1_gpu8_finetune.py
│ │ │ │ │ └── llava_phi3_mini_4k_instruct_full_clip_vit_large_p14_336_full_e2_gpu8_internvl_finetune.py
│ │ │ │ └── pretrain/
│ │ │ │ ├── llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ │ │ └── llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py
│ │ │ ├── vicuna_13b_v15_clip_vit_large_p14_336/
│ │ │ │ ├── finetune/
│ │ │ │ │ └── llava_vicuna_13b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
│ │ │ │ └── pretrain/
│ │ │ │ └── llava_vicuna_13b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ │ └── vicuna_7b_v15_clip_vit_large_p14_336/
│ │ │ ├── finetune/
│ │ │ │ ├── llava_vicuna_7b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
│ │ │ │ └── llava_vicuna_7b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_refcoco.py
│ │ │ └── pretrain/
│ │ │ └── llava_vicuna_7b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ ├── mistral/
│ │ │ ├── mistral_7b_full_finetune_custom_dataset_e1.py
│ │ │ ├── mistral_7b_qlora_skypile_pretrain_e1.py
│ │ │ ├── mistral_7b_w_tokenized_dataset.py
│ │ │ └── mistral_7b_w_untokenized_dataset.py
│ │ ├── mixtral/
│ │ │ ├── README.md
│ │ │ ├── mixtral_8x7b/
│ │ │ │ ├── mixtral_8x7b_full_oasst1_e3.py
│ │ │ │ └── mixtral_8x7b_qlora_oasst1_e3.py
│ │ │ └── mixtral_8x7b_instruct/
│ │ │ ├── mixtral_8x7b_instruct_full_oasst1_e3.py
│ │ │ └── mixtral_8x7b_instruct_qlora_oasst1_e3.py
│ │ ├── orpo/
│ │ │ ├── internlm/
│ │ │ │ ├── internlm2_chat_1_8b_orpo_full.py
│ │ │ │ ├── internlm2_chat_1_8b_orpo_full_varlenattn.py
│ │ │ │ ├── internlm2_chat_1_8b_orpo_full_varlenattn_jsonl_dataset.py
│ │ │ │ └── internlm2_chat_7b_orpo_qlora_varlenattn_ultrafeedback_e5.py
│ │ │ └── llama/
│ │ │ └── llama3_8b_instruct_orpo_qlora_varlenattn_ultrafeedback_e5.py
│ │ ├── phi/
│ │ │ └── phi3/
│ │ │ ├── phi3_mini_128k_instruct_full_alpaca_e3.py
│ │ │ ├── phi3_mini_128k_instruct_qlora_alpaca_e3.py
│ │ │ ├── phi3_mini_4k_instruct_full_alpaca_e3.py
│ │ │ └── phi3_mini_4k_instruct_qlora_alpaca_e3.py
│ │ ├── qwen/
│ │ │ ├── qwen1/
│ │ │ │ ├── qwen_1_8b/
│ │ │ │ │ ├── qwen_1_8b_qlora_alpaca_e3.py
│ │ │ │ │ ├── qwen_1_8b_qlora_alpaca_enzh_e3.py
│ │ │ │ │ ├── qwen_1_8b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ │ ├── qwen_1_8b_qlora_alpaca_zh_e3.py
│ │ │ │ │ └── qwen_1_8b_qlora_code_alpaca_e3.py
│ │ │ │ ├── qwen_1_8b_chat/
│ │ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_e3.py
│ │ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_enzh_e3.py
│ │ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_zh_e3.py
│ │ │ │ │ └── qwen_1_8b_chat_qlora_code_alpaca_e3.py
│ │ │ │ ├── qwen_72b/
│ │ │ │ │ ├── qwen_72b_qlora_alpaca_e3.py
│ │ │ │ │ ├── qwen_72b_qlora_alpaca_enzh_e3.py
│ │ │ │ │ ├── qwen_72b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ │ ├── qwen_72b_qlora_alpaca_zh_e3.py
│ │ │ │ │ └── qwen_72b_qlora_code_alpaca_e3.py
│ │ │ │ ├── qwen_7b/
│ │ │ │ │ ├── qwen_7b_qlora_alpaca_e3.py
│ │ │ │ │ ├── qwen_7b_qlora_alpaca_enzh_e3.py
│ │ │ │ │ ├── qwen_7b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ │ ├── qwen_7b_qlora_alpaca_zh_e3.py
│ │ │ │ │ ├── qwen_7b_qlora_arxiv_gentitle_e3.py
│ │ │ │ │ ├── qwen_7b_qlora_code_alpaca_e3.py
│ │ │ │ │ ├── qwen_7b_qlora_colorist_e5.py
│ │ │ │ │ ├── qwen_7b_qlora_lawyer_e3.py
│ │ │ │ │ ├── qwen_7b_qlora_medical_e1.py
│ │ │ │ │ ├── qwen_7b_qlora_moss_sft_all_e1.py
│ │ │ │ │ ├── qwen_7b_qlora_moss_sft_all_e2_gpu8.py
│ │ │ │ │ ├── qwen_7b_qlora_moss_sft_plugins_e1.py
│ │ │ │ │ ├── qwen_7b_qlora_oasst1_512_e3.py
│ │ │ │ │ ├── qwen_7b_qlora_oasst1_e3.py
│ │ │ │ │ ├── qwen_7b_qlora_open_platypus_e3.py
│ │ │ │ │ ├── qwen_7b_qlora_openorca_e1.py
│ │ │ │ │ ├── qwen_7b_qlora_sql_e3.py
│ │ │ │ │ └── qwen_7b_qlora_tiny_codes_e1.py
│ │ │ │ └── qwen_7b_chat/
│ │ │ │ ├── qwen_7b_chat_qlora_alpaca_e3.py
│ │ │ │ ├── qwen_7b_chat_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── qwen_7b_chat_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── qwen_7b_chat_qlora_alpaca_zh_e3.py
│ │ │ │ ├── qwen_7b_chat_qlora_arxiv_gentitle_e3.py
│ │ │ │ ├── qwen_7b_chat_qlora_code_alpaca_e3.py
│ │ │ │ ├── qwen_7b_chat_qlora_colorist_e5.py
│ │ │ │ ├── qwen_7b_chat_qlora_lawyer_e3.py
│ │ │ │ ├── qwen_7b_chat_qlora_medical_e1.py
│ │ │ │ ├── qwen_7b_chat_qlora_oasst1_512_e3.py
│ │ │ │ ├── qwen_7b_chat_qlora_oasst1_e3.py
│ │ │ │ ├── qwen_7b_chat_qlora_open_platypus_e3.py
│ │ │ │ ├── qwen_7b_chat_qlora_openorca_e1.py
│ │ │ │ ├── qwen_7b_chat_qlora_sql_e3.py
│ │ │ │ └── qwen_7b_chat_qlora_tiny_codes_e1.py
│ │ │ └── qwen1_5/
│ │ │ ├── qwen1_5_0_5b/
│ │ │ │ ├── qwen1_5_0_5b_full_alpaca_e3.py
│ │ │ │ └── qwen1_5_0_5b_qlora_alpaca_e3.py
│ │ │ ├── qwen1_5_0_5b_chat/
│ │ │ │ ├── qwen1_5_0_5b_chat_full_alpaca_e3.py
│ │ │ │ └── qwen1_5_0_5b_chat_qlora_alpaca_e3.py
│ │ │ ├── qwen1_5_110b/
│ │ │ │ ├── qwen1_5_110b_full_alpaca_e3.py
│ │ │ │ └── qwen1_5_110b_qlora_alpaca_e3.py
│ │ │ ├── qwen1_5_110b_chat/
│ │ │ │ ├── README.md
│ │ │ │ ├── qwen1_5_110b_chat_full_alpaca_e3.py
│ │ │ │ ├── qwen1_5_110b_chat_qlora_alpaca_e3.py
│ │ │ │ └── qwen1_5_110b_chat_qlora_alpaca_e3_16k_2gpus.py
│ │ │ ├── qwen1_5_14b/
│ │ │ │ ├── qwen1_5_14b_full_alpaca_e3.py
│ │ │ │ └── qwen1_5_14b_qlora_alpaca_e3.py
│ │ │ ├── qwen1_5_14b_chat/
│ │ │ │ ├── qwen1_5_14b_chat_full_alpaca_e3.py
│ │ │ │ └── qwen1_5_14b_chat_qlora_alpaca_e3.py
│ │ │ ├── qwen1_5_1_8b/
│ │ │ │ ├── qwen1_5_1_8b_full_alpaca_e3.py
│ │ │ │ └── qwen1_5_1_8b_qlora_alpaca_e3.py
│ │ │ ├── qwen1_5_1_8b_chat/
│ │ │ │ ├── qwen1_5_1_8b_chat_full_alpaca_e3.py
│ │ │ │ └── qwen1_5_1_8b_chat_qlora_alpaca_e3.py
│ │ │ ├── qwen1_5_4b/
│ │ │ │ ├── qwen1_5_4b_full_alpaca_e3.py
│ │ │ │ └── qwen1_5_4b_qlora_alpaca_e3.py
│ │ │ ├── qwen1_5_4b_chat/
│ │ │ │ ├── qwen1_5_4b_chat_full_alpaca_e3.py
│ │ │ │ └── qwen1_5_4b_chat_qlora_alpaca_e3.py
│ │ │ ├── qwen1_5_72b/
│ │ │ │ ├── qwen1_5_72b_full_alpaca_e3.py
│ │ │ │ └── qwen1_5_72b_qlora_alpaca_e3.py
│ │ │ ├── qwen1_5_72b_chat/
│ │ │ │ ├── qwen1_5_72b_chat_full_alpaca_e3.py
│ │ │ │ └── qwen1_5_72b_chat_qlora_alpaca_e3.py
│ │ │ ├── qwen1_5_7b/
│ │ │ │ ├── qwen1_5_7b_full_alpaca_e3.py
│ │ │ │ └── qwen1_5_7b_qlora_alpaca_e3.py
│ │ │ └── qwen1_5_7b_chat/
│ │ │ ├── qwen1_5_7b_chat_full_alpaca_e3.py
│ │ │ └── qwen1_5_7b_chat_qlora_alpaca_e3.py
│ │ ├── qwen_moe/
│ │ │ └── qwen1_5/
│ │ │ └── qwen1_5_moe_a2_7_b_chat/
│ │ │ └── qwen1_5_moe_a2_7_b_chat_full_alpaca_e3.py
│ │ ├── reward_model/
│ │ │ ├── internlm/
│ │ │ │ ├── internlm2_chat_1_8b_reward_full_ultrafeedback.py
│ │ │ │ ├── internlm2_chat_1_8b_reward_full_varlenattn_jsonl_dataset.py
│ │ │ │ ├── internlm2_chat_1_8b_reward_full_varlenattn_ultrafeedback.py
│ │ │ │ └── internlm2_chat_1_8b_reward_qlora_varlenattn_ultrafeedback.py
│ │ │ └── llama/
│ │ │ └── llama3_8b_instruct_reward_full_varlenattn_ultrafeedback.py
│ │ ├── starcoder/
│ │ │ └── starcoder_qlora_stack_exchange_example.py
│ │ ├── yi/
│ │ │ ├── yi_34b/
│ │ │ │ └── yi_34b_qlora_alpaca_enzh_e3.py
│ │ │ └── yi_6b/
│ │ │ └── yi_6b_qlora_alpaca_enzh_e3.py
│ │ └── zephyr/
│ │ └── zephyr_7b_beta_qlora_alpaca_e3.py
│ ├── dataset/
│ │ ├── __init__.py
│ │ ├── collate_fns/
│ │ │ ├── __init__.py
│ │ │ ├── default_collate_fn.py
│ │ │ ├── mmlu_collate_fn.py
│ │ │ └── preference_collate_fn.py
│ │ ├── concat_dataset.py
│ │ ├── huggingface.py
│ │ ├── intern_repo.py
│ │ ├── json_dataset.py
│ │ ├── llava.py
│ │ ├── map_fns/
│ │ │ ├── __init__.py
│ │ │ ├── dataset_map_fns/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── alpaca_map_fn.py
│ │ │ │ ├── alpaca_zh_map_fn.py
│ │ │ │ ├── arxiv_map_fn.py
│ │ │ │ ├── code_alpaca_map_fn.py
│ │ │ │ ├── colors_map_fn.py
│ │ │ │ ├── crime_kg_assitant_map_fn.py
│ │ │ │ ├── default_map_fn.py
│ │ │ │ ├── law_reference_map_fn.py
│ │ │ │ ├── llava_map_fn.py
│ │ │ │ ├── medical_map_fn.py
│ │ │ │ ├── msagent_map_fn.py
│ │ │ │ ├── oasst1_map_fn.py
│ │ │ │ ├── openai_map_fn.py
│ │ │ │ ├── openorca_map_fn.py
│ │ │ │ ├── pretrain_map_fn.py
│ │ │ │ ├── sql_map_fn.py
│ │ │ │ ├── stack_exchange_map_fn.py
│ │ │ │ ├── tiny_codes_map_fn.py
│ │ │ │ └── wizardlm_map_fn.py
│ │ │ └── template_map_fn.py
│ │ ├── modelscope.py
│ │ ├── moss_sft.py
│ │ ├── preference_dataset.py
│ │ ├── refcoco_json.py
│ │ ├── samplers/
│ │ │ ├── __init__.py
│ │ │ ├── intern_repo.py
│ │ │ └── length_grouped.py
│ │ └── utils.py
│ ├── engine/
│ │ ├── __init__.py
│ │ ├── _strategy/
│ │ │ ├── __init__.py
│ │ │ └── deepspeed.py
│ │ ├── hooks/
│ │ │ ├── __init__.py
│ │ │ ├── dataset_info_hook.py
│ │ │ ├── evaluate_chat_hook.py
│ │ │ ├── hf_checkpoint_hook.py
│ │ │ ├── throughput_hook.py
│ │ │ └── varlen_attn_args_to_messagehub_hook.py
│ │ └── runner/
│ │ ├── __init__.py
│ │ └── loops.py
│ ├── entry_point.py
│ ├── evaluation/
│ │ ├── __init__.py
│ │ └── metrics/
│ │ ├── __init__.py
│ │ ├── mmlu_metric.py
│ │ └── reward_metric.py
│ ├── model/
│ │ ├── __init__.py
│ │ ├── dpo.py
│ │ ├── llava.py
│ │ ├── modules/
│ │ │ ├── __init__.py
│ │ │ ├── dispatch/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── attention.py
│ │ │ │ ├── baichuan.py
│ │ │ │ ├── cohere.py
│ │ │ │ ├── deepseek_v2.py
│ │ │ │ ├── internlm.py
│ │ │ │ ├── internlm2.py
│ │ │ │ ├── llama.py
│ │ │ │ ├── mistral.py
│ │ │ │ ├── phi3.py
│ │ │ │ ├── qwen2.py
│ │ │ │ ├── triton_kernels/
│ │ │ │ │ ├── __init__.py
│ │ │ │ │ ├── layer_norm.py
│ │ │ │ │ ├── rms_norm.py
│ │ │ │ │ └── rotary.py
│ │ │ │ ├── utils.py
│ │ │ │ └── yi.py
│ │ │ └── projector/
│ │ │ ├── __init__.py
│ │ │ ├── configuration_projector.py
│ │ │ └── modeling_projector.py
│ │ ├── orpo.py
│ │ ├── reward.py
│ │ ├── sft.py
│ │ ├── transformers_models/
│ │ │ ├── __init__.py
│ │ │ ├── deepseek_v2/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── configuration_deepseek.py
│ │ │ │ ├── modeling_deepseek.py
│ │ │ │ └── tokenization_deepseek_fast.py
│ │ │ └── mixtral/
│ │ │ ├── __init__.py
│ │ │ ├── configuration_mixtral.py
│ │ │ └── modeling_mixtral.py
│ │ └── utils.py
│ ├── parallel/
│ │ ├── __init__.py
│ │ └── sequence/
│ │ ├── __init__.py
│ │ ├── attention.py
│ │ ├── comm.py
│ │ ├── data_collate.py
│ │ ├── reduce_loss.py
│ │ ├── sampler.py
│ │ └── setup_distributed.py
│ ├── registry.py
│ ├── tools/
│ │ ├── chat.py
│ │ ├── check_custom_dataset.py
│ │ ├── copy_cfg.py
│ │ ├── data_preprocess/
│ │ │ ├── arxiv.py
│ │ │ └── convert_refcoco.py
│ │ ├── eval_refcoco.py
│ │ ├── get_data_order.py
│ │ ├── list_cfg.py
│ │ ├── list_dataset_format.py
│ │ ├── log_dataset.py
│ │ ├── mmbench.py
│ │ ├── model_converters/
│ │ │ ├── merge.py
│ │ │ ├── modeling_internlm2_reward/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── configuration_internlm2.py
│ │ │ │ └── modeling_internlm2.py
│ │ │ ├── pth_to_hf.py
│ │ │ └── split.py
│ │ ├── plugins/
│ │ │ ├── __init__.py
│ │ │ ├── api.py
│ │ │ ├── calculate.py
│ │ │ ├── search.py
│ │ │ └── solve.py
│ │ ├── process_untokenized_datasets.py
│ │ ├── process_untokenized_datasets_legacy.py
│ │ ├── process_untokenized_llava_data.py
│ │ ├── test.py
│ │ ├── tokenize_ftdp_datasets.py
│ │ ├── train.py
│ │ └── utils.py
│ ├── utils/
│ │ ├── __init__.py
│ │ ├── constants.py
│ │ ├── fileio.py
│ │ ├── handle_moe_load_and_save.py
│ │ ├── stop_criteria.py
│ │ ├── templates.py
│ │ └── zero_to_any_dtype.py
│ └── version.py
└── xtuner-train_internvideo2_5/
├── .gitignore
├── .owners.yml
├── .pre-commit-config-zh-cn.yaml
├── .pre-commit-config.yaml
├── LICENSE
├── MANIFEST.in
├── README.md
├── data/
│ ├── annotaions/
│ │ └── ft_data_example.jsonl
│ └── diy_ft_data.json
├── ft_internvideo_2_5.sh
├── ft_internvideo_2_5_datapacking.sh
├── requirements/
│ ├── deepspeed.txt
│ ├── docs.txt
│ ├── modelscope.txt
│ └── runtime.txt
├── requirements.txt
├── setup.cfg
├── setup.py
├── unify_internvl2_train_r16.py
└── xtuner/
├── __init__.py
├── _lite/
│ ├── __init__.py
│ ├── accelerate/
│ │ ├── __init__.py
│ │ ├── dispatches/
│ │ │ ├── __init__.py
│ │ │ ├── _attention.py
│ │ │ ├── _fused/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── layer_norm.py
│ │ │ │ ├── rms_norm.py
│ │ │ │ └── rotary.py
│ │ │ ├── clip.py
│ │ │ ├── internlm2.py
│ │ │ ├── internvl2.py
│ │ │ ├── llama3.py
│ │ │ ├── new.py
│ │ │ ├── phi3.py
│ │ │ ├── qwen2.py
│ │ │ └── qwen_vl2.py
│ │ ├── fsdp/
│ │ │ ├── __init__.py
│ │ │ ├── checkpointing.py
│ │ │ ├── clip_grad.py
│ │ │ ├── lazy.py
│ │ │ ├── precision.py
│ │ │ └── wrap.py
│ │ ├── generate.py
│ │ ├── lora.py
│ │ └── packed.py
│ ├── auto.py
│ ├── chat/
│ │ ├── __init__.py
│ │ ├── backends/
│ │ │ └── __init__.py
│ │ ├── messages/
│ │ │ ├── __init__.py
│ │ │ ├── base.py
│ │ │ └── chat.py
│ │ └── templates/
│ │ ├── __init__.py
│ │ ├── chat.py
│ │ └── hybrid.py
│ ├── checkpoint.py
│ ├── datasets/
│ │ ├── __init__.py
│ │ ├── dataset_fn.py
│ │ ├── format.py
│ │ ├── llava.py
│ │ ├── load.py
│ │ ├── load_new.py
│ │ ├── text.py
│ │ └── tokenize.py
│ ├── internvl/
│ │ ├── __init__.py
│ │ ├── constants.py
│ │ ├── conversation.py
│ │ ├── dataset.py
│ │ ├── new_dataset.py
│ │ ├── v1_5/
│ │ │ ├── configuration_intern_vit.py
│ │ │ ├── configuration_internvl_chat.py
│ │ │ ├── configuration_phi3.py
│ │ │ ├── conversation.py
│ │ │ ├── modeling_intern_vit.py
│ │ │ ├── modeling_internvl_chat.py
│ │ │ └── modeling_phi3.py
│ │ └── video_utils.py
│ ├── modelings/
│ │ ├── __init__.py
│ │ ├── internlm2/
│ │ │ ├── __init__.py
│ │ │ ├── configuration_internlm2.py
│ │ │ └── modeling_internlm2.py
│ │ └── model_fn.py
│ ├── parallel/
│ │ ├── __init__.py
│ │ ├── comm.py
│ │ ├── logger.py
│ │ ├── new_setup.py
│ │ ├── plans/
│ │ │ └── internlm2.py
│ │ ├── sampler.py
│ │ ├── sequence/
│ │ │ ├── __init__.py
│ │ │ ├── attention.py
│ │ │ ├── data_collate.py
│ │ │ ├── ops.py
│ │ │ └── reduce_loss.py
│ │ └── setup.py
│ └── yunchang/
│ ├── __init__.py
│ ├── comm/
│ │ ├── __init__.py
│ │ ├── all_to_all.py
│ │ └── extract_local.py
│ ├── globals.py
│ ├── hybrid/
│ │ ├── __init__.py
│ │ ├── async_attn_layer.py
│ │ ├── attn_layer.py
│ │ └── utils.py
│ ├── ring/
│ │ ├── __init__.py
│ │ ├── llama3_flash_attn_varlen.py
│ │ ├── ring_flash_attn.py
│ │ ├── ring_flash_attn_varlen.py
│ │ ├── stripe_flash_attn.py
│ │ ├── triton_utils.py
│ │ ├── utils.py
│ │ ├── zigzag_ring_flash_attn.py
│ │ └── zigzag_ring_flash_attn_varlen.py
│ └── ulysses/
│ ├── __init__.py
│ └── attn_layer.py
├── apis/
│ ├── __init__.py
│ ├── datasets/
│ │ ├── __init__.py
│ │ ├── alpaca.py
│ │ ├── arxiv.py
│ │ ├── code_alpaca.py
│ │ ├── colorist.py
│ │ ├── lawyer.py
│ │ ├── medical.py
│ │ ├── moss_003_sft.py
│ │ ├── oasst1.py
│ │ ├── open_orca.py
│ │ ├── sql.py
│ │ ├── tiny_codes.py
│ │ └── wizardlm.py
│ ├── model.py
│ └── training_args.py
├── configs/
│ ├── __init__.py
│ ├── baichuan/
│ │ ├── baichuan2_13b_base/
│ │ │ ├── baichuan2_13b_base_qlora_alpaca_e3.py
│ │ │ ├── baichuan2_13b_base_qlora_alpaca_enzh_e3.py
│ │ │ ├── baichuan2_13b_base_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── baichuan2_13b_base_qlora_alpaca_zh_e3.py
│ │ │ ├── baichuan2_13b_base_qlora_arxiv_gentitle_e3.py
│ │ │ ├── baichuan2_13b_base_qlora_code_alpaca_e3.py
│ │ │ ├── baichuan2_13b_base_qlora_colorist_e5.py
│ │ │ ├── baichuan2_13b_base_qlora_lawyer_e3.py
│ │ │ ├── baichuan2_13b_base_qlora_oasst1_512_e3.py
│ │ │ ├── baichuan2_13b_base_qlora_oasst1_e3.py
│ │ │ ├── baichuan2_13b_base_qlora_open_platypus_e3.py
│ │ │ └── baichuan2_13b_base_qlora_sql_e3.py
│ │ ├── baichuan2_13b_chat/
│ │ │ ├── baichuan2_13b_chat_qlora_alpaca_e3.py
│ │ │ ├── baichuan2_13b_chat_qlora_alpaca_enzh_e3.py
│ │ │ ├── baichuan2_13b_chat_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── baichuan2_13b_chat_qlora_alpaca_zh_e3.py
│ │ │ ├── baichuan2_13b_chat_qlora_code_alpaca_e3.py
│ │ │ ├── baichuan2_13b_chat_qlora_lawyer_e3.py
│ │ │ ├── baichuan2_13b_chat_qlora_oasst1_512_e3.py
│ │ │ ├── baichuan2_13b_chat_qlora_oasst1_e3.py
│ │ │ └── baichuan2_13b_chat_qlora_open_platypus_e3.py
│ │ ├── baichuan2_7b_base/
│ │ │ ├── baichuan2_7b_base_qlora_alpaca_e3.py
│ │ │ ├── baichuan2_7b_base_qlora_alpaca_enzh_e3.py
│ │ │ ├── baichuan2_7b_base_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── baichuan2_7b_base_qlora_alpaca_zh_e3.py
│ │ │ ├── baichuan2_7b_base_qlora_arxiv_gentitle_e3.py
│ │ │ ├── baichuan2_7b_base_qlora_code_alpaca_e3.py
│ │ │ ├── baichuan2_7b_base_qlora_colorist_e5.py
│ │ │ ├── baichuan2_7b_base_qlora_lawyer_e3.py
│ │ │ ├── baichuan2_7b_base_qlora_oasst1_512_e3.py
│ │ │ ├── baichuan2_7b_base_qlora_oasst1_e3.py
│ │ │ ├── baichuan2_7b_base_qlora_open_platypus_e3.py
│ │ │ └── baichuan2_7b_base_qlora_sql_e3.py
│ │ ├── baichuan2_7b_chat/
│ │ │ ├── baichuan2_7b_chat_qlora_alpaca_e3.py
│ │ │ ├── baichuan2_7b_chat_qlora_alpaca_enzh_e3.py
│ │ │ ├── baichuan2_7b_chat_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── baichuan2_7b_chat_qlora_alpaca_zh_e3.py
│ │ │ ├── baichuan2_7b_chat_qlora_code_alpaca_e3.py
│ │ │ ├── baichuan2_7b_chat_qlora_lawyer_e3.py
│ │ │ ├── baichuan2_7b_chat_qlora_oasst1_512_e3.py
│ │ │ ├── baichuan2_7b_chat_qlora_oasst1_e3.py
│ │ │ └── baichuan2_7b_chat_qlora_open_platypus_e3.py
│ │ ├── baichuan_13b_base/
│ │ │ ├── baichuan_13b_base_qlora_alpaca_e3.py
│ │ │ ├── baichuan_13b_base_qlora_alpaca_enzh_e3.py
│ │ │ ├── baichuan_13b_base_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── baichuan_13b_base_qlora_alpaca_zh_e3.py
│ │ │ ├── baichuan_13b_base_qlora_arxiv_gentitle_e3.py
│ │ │ ├── baichuan_13b_base_qlora_code_alpaca_e3.py
│ │ │ ├── baichuan_13b_base_qlora_colorist_e5.py
│ │ │ ├── baichuan_13b_base_qlora_lawyer_e3.py
│ │ │ ├── baichuan_13b_base_qlora_medical_e1.py
│ │ │ ├── baichuan_13b_base_qlora_moss_sft_all_e1.py
│ │ │ ├── baichuan_13b_base_qlora_moss_sft_all_e2_gpu8.py
│ │ │ ├── baichuan_13b_base_qlora_moss_sft_plugins_e1.py
│ │ │ ├── baichuan_13b_base_qlora_oasst1_512_e3.py
│ │ │ ├── baichuan_13b_base_qlora_oasst1_e3.py
│ │ │ ├── baichuan_13b_base_qlora_open_platypus_e3.py
│ │ │ ├── baichuan_13b_base_qlora_openorca_e1.py
│ │ │ ├── baichuan_13b_base_qlora_sql_e3.py
│ │ │ └── baichuan_13b_base_qlora_tiny_codes_e1.py
│ │ ├── baichuan_13b_chat/
│ │ │ ├── baichuan_13b_chat_qlora_alpaca_e3.py
│ │ │ ├── baichuan_13b_chat_qlora_alpaca_enzh_e3.py
│ │ │ ├── baichuan_13b_chat_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── baichuan_13b_chat_qlora_alpaca_zh_e3.py
│ │ │ ├── baichuan_13b_chat_qlora_arxiv_gentitle_e3.py
│ │ │ ├── baichuan_13b_chat_qlora_code_alpaca_e3.py
│ │ │ ├── baichuan_13b_chat_qlora_colorist_e5.py
│ │ │ ├── baichuan_13b_chat_qlora_lawyer_e3.py
│ │ │ ├── baichuan_13b_chat_qlora_medical_e1.py
│ │ │ ├── baichuan_13b_chat_qlora_oasst1_512_e3.py
│ │ │ ├── baichuan_13b_chat_qlora_oasst1_e3.py
│ │ │ ├── baichuan_13b_chat_qlora_open_platypus_e3.py
│ │ │ ├── baichuan_13b_chat_qlora_openorca_e1.py
│ │ │ ├── baichuan_13b_chat_qlora_sql_e3.py
│ │ │ └── baichuan_13b_chat_qlora_tiny_codes_e1.py
│ │ └── baichuan_7b/
│ │ ├── baichuan_7b_qlora_alpaca_e3.py
│ │ ├── baichuan_7b_qlora_alpaca_enzh_e3.py
│ │ ├── baichuan_7b_qlora_alpaca_enzh_oasst1_e3.py
│ │ ├── baichuan_7b_qlora_alpaca_zh_e3.py
│ │ ├── baichuan_7b_qlora_arxiv_gentitle_e3.py
│ │ ├── baichuan_7b_qlora_code_alpaca_e3.py
│ │ ├── baichuan_7b_qlora_colorist_e5.py
│ │ ├── baichuan_7b_qlora_lawyer_e3.py
│ │ ├── baichuan_7b_qlora_medical_e1.py
│ │ ├── baichuan_7b_qlora_moss_sft_all_e1.py
│ │ ├── baichuan_7b_qlora_moss_sft_all_e2_gpu8.py
│ │ ├── baichuan_7b_qlora_moss_sft_plugins_e1.py
│ │ ├── baichuan_7b_qlora_oasst1_512_e3.py
│ │ ├── baichuan_7b_qlora_oasst1_e3.py
│ │ ├── baichuan_7b_qlora_open_platypus_e3.py
│ │ ├── baichuan_7b_qlora_openorca_e1.py
│ │ ├── baichuan_7b_qlora_sql_e3.py
│ │ └── baichuan_7b_qlora_tiny_codes_e1.py
│ ├── chatglm/
│ │ ├── chatglm2_6b/
│ │ │ ├── chatglm2_6b_qlora_alpaca_e3.py
│ │ │ ├── chatglm2_6b_qlora_alpaca_enzh_e3.py
│ │ │ ├── chatglm2_6b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── chatglm2_6b_qlora_alpaca_zh_e3.py
│ │ │ ├── chatglm2_6b_qlora_arxiv_gentitle_e3.py
│ │ │ ├── chatglm2_6b_qlora_code_alpaca_e3.py
│ │ │ ├── chatglm2_6b_qlora_colorist_e5.py
│ │ │ ├── chatglm2_6b_qlora_lawyer_e3.py
│ │ │ ├── chatglm2_6b_qlora_medical_e1.py
│ │ │ ├── chatglm2_6b_qlora_oasst1_512_e3.py
│ │ │ ├── chatglm2_6b_qlora_oasst1_e3.py
│ │ │ ├── chatglm2_6b_qlora_open_platypus_e3.py
│ │ │ ├── chatglm2_6b_qlora_openorca_e1.py
│ │ │ ├── chatglm2_6b_qlora_sql_e3.py
│ │ │ └── chatglm2_6b_qlora_tiny_codes_e1.py
│ │ ├── chatglm3_6b/
│ │ │ ├── chatglm3_6b_qlora_alpaca_e3.py
│ │ │ ├── chatglm3_6b_qlora_alpaca_enzh_e3.py
│ │ │ ├── chatglm3_6b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── chatglm3_6b_qlora_alpaca_zh_e3.py
│ │ │ ├── chatglm3_6b_qlora_arxiv_gentitle_e3.py
│ │ │ ├── chatglm3_6b_qlora_code_alpaca_e3.py
│ │ │ ├── chatglm3_6b_qlora_colorist_e5.py
│ │ │ ├── chatglm3_6b_qlora_lawyer_e3.py
│ │ │ ├── chatglm3_6b_qlora_medical_e1.py
│ │ │ ├── chatglm3_6b_qlora_oasst1_512_e3.py
│ │ │ ├── chatglm3_6b_qlora_oasst1_e3.py
│ │ │ ├── chatglm3_6b_qlora_open_platypus_e3.py
│ │ │ ├── chatglm3_6b_qlora_openorca_e1.py
│ │ │ ├── chatglm3_6b_qlora_sql_e3.py
│ │ │ └── chatglm3_6b_qlora_tiny_codes_e1.py
│ │ └── chatglm3_6b_base/
│ │ ├── chatglm3_6b_base_qlora_alpaca_e3.py
│ │ ├── chatglm3_6b_base_qlora_alpaca_enzh_e3.py
│ │ ├── chatglm3_6b_base_qlora_alpaca_enzh_oasst1_e3.py
│ │ ├── chatglm3_6b_base_qlora_alpaca_zh_e3.py
│ │ ├── chatglm3_6b_base_qlora_arxiv_gentitle_e3.py
│ │ ├── chatglm3_6b_base_qlora_code_alpaca_e3.py
│ │ ├── chatglm3_6b_base_qlora_colorist_e5.py
│ │ ├── chatglm3_6b_base_qlora_lawyer_e3.py
│ │ ├── chatglm3_6b_base_qlora_medical_e1.py
│ │ ├── chatglm3_6b_base_qlora_oasst1_512_e3.py
│ │ ├── chatglm3_6b_base_qlora_oasst1_e3.py
│ │ ├── chatglm3_6b_base_qlora_open_platypus_e3.py
│ │ ├── chatglm3_6b_base_qlora_openorca_e1.py
│ │ ├── chatglm3_6b_base_qlora_sql_e3.py
│ │ └── chatglm3_6b_base_qlora_tiny_codes_e1.py
│ ├── cohere/
│ │ ├── README.md
│ │ └── cohere_104b/
│ │ └── cohere_100b_128k_sp32.py
│ ├── custom_dataset/
│ │ ├── pretrain/
│ │ │ ├── baichuan/
│ │ │ │ ├── baichuan2_13b_base_full_custom_pretrain_e1.py
│ │ │ │ └── baichuan2_7b_base_full_custom_pretrain_e1.py
│ │ │ ├── chatglm/
│ │ │ │ ├── chatglm2_6b_full_custom_pretrain_e1.py
│ │ │ │ └── chatglm3_6b_full_custom_pretrain_e1.py
│ │ │ ├── deepseek/
│ │ │ │ └── deepseek_moe_16b_base_full_custom_pretrain_e1.py
│ │ │ ├── gemma/
│ │ │ │ ├── gemma_2b_full_custom_pretrain_e1.py
│ │ │ │ └── gemma_7b_full_custom_pretrain_e1.py
│ │ │ ├── internlm/
│ │ │ │ ├── internlm2_1_8b_full_custom_pretrain_e1.py
│ │ │ │ ├── internlm2_20b_full_custom_pretrain_e1.py
│ │ │ │ └── internlm2_7b_full_custom_pretrain_e1.py
│ │ │ ├── llama/
│ │ │ │ ├── llama2_70b_full_custom_pretrain_e1.py
│ │ │ │ └── llama2_7b_full_custom_pretrain_e1.py
│ │ │ ├── mistral/
│ │ │ │ └── mistral_7b_full_custom_pretrain_e1.py
│ │ │ ├── mixtral/
│ │ │ │ └── mixtral_8x7b_full_custom_pretrain_e1.py
│ │ │ ├── qwen/
│ │ │ │ ├── qwen1_5_0_5b_full_custom_pretrain_e1.py
│ │ │ │ ├── qwen1_5_14b_full_custom_pretrain_e1.py
│ │ │ │ ├── qwen1_5_1_8b_full_custom_pretrain_e1.py
│ │ │ │ ├── qwen1_5_4b_full_custom_pretrain_e1.py
│ │ │ │ ├── qwen1_5_72b_full_custom_pretrain_e1.py
│ │ │ │ ├── qwen1_5_7b_full_custom_pretrain_e1.py
│ │ │ │ ├── qwen_1_8b_full_custom_pretrain_e1.py
│ │ │ │ ├── qwen_72b_full_custom_pretrain_e1.py
│ │ │ │ └── qwen_7b_full_custom_pretrain_e1.py
│ │ │ ├── starcoder/
│ │ │ │ └── starcoder_full_custom_pretrain_e1.py
│ │ │ ├── yi/
│ │ │ │ ├── yi_34b_full_custom_pretrain_e1.py
│ │ │ │ └── yi_6b_full_custom_pretrain_e1.py
│ │ │ └── zephyr/
│ │ │ └── zephyr_7b_beta_full_custom_pretrain_e1.py
│ │ └── sft/
│ │ ├── baichuan/
│ │ │ ├── baichuan2_13b_chat_qlora_custom_sft_e1.py
│ │ │ ├── baichuan2_7b_chat_qlora_custom_sft_e1.py
│ │ │ ├── baichuan_13b_chat_qlora_custom_sft_e1.py
│ │ │ └── baichuan_7b_qlora_custom_sft_e1.py
│ │ ├── chatglm/
│ │ │ ├── chatglm2_6b_qlora_custom_sft_e1.py
│ │ │ └── chatglm3_6b_qlora_custom_sft_e1.py
│ │ ├── deepseek/
│ │ │ ├── deepseek_moe_16b_chat_qlora_custom_sft_e1.py
│ │ │ └── deepseekcoder_6_7b_instruct_qlora_custom_sft_e1.py
│ │ ├── gemma/
│ │ │ ├── gemma_2b_it_qlora_custom_sft_e1.py
│ │ │ ├── gemma_2b_qlora_custom_sft_e1.py
│ │ │ ├── gemma_7b_it_qlora_custom_sft_e1.py
│ │ │ └── gemma_7b_qlora_custom_sft_e1.py
│ │ ├── internlm/
│ │ │ ├── internlm2_chat_1_8b_qlora_custom_sft_e1.py
│ │ │ ├── internlm2_chat_20b_qlora_custom_sft_e1.py
│ │ │ └── internlm2_chat_7b_qlora_custom_sft_e1.py
│ │ ├── llama/
│ │ │ ├── llama2_70b_qlora_custom_sft_e1.py
│ │ │ └── llama2_7b_chat_qlora_custom_sft_e1.py
│ │ ├── mistral/
│ │ │ └── mistral_7b_full_finetune_custom_sft_e1.py
│ │ ├── mixtral/
│ │ │ └── mixtral_8x7b_instruct_qlora_custom_sft_e1.py
│ │ ├── qwen/
│ │ │ ├── qwen1_5_0_5b_chat_qlora_custom_sft_e1.py
│ │ │ ├── qwen1_5_14b_chat_qlora_custom_sft_e1.py
│ │ │ ├── qwen1_5_1_8b_chat_qlora_custom_sft_e1.py
│ │ │ ├── qwen1_5_4b_chat_qlora_custom_sft_e1.py
│ │ │ ├── qwen1_5_72b_chat_qlora_custom_sft_e1.py
│ │ │ ├── qwen1_5_7b_chat_qlora_custom_sft_e1.py
│ │ │ ├── qwen_1_8b_chat_qlora_custom_sft_e1.py
│ │ │ ├── qwen_72b_qlora_custom_sft_e1.py
│ │ │ └── qwen_7b_chat_qlora_custom_sft_e1.py
│ │ ├── starcoder/
│ │ │ └── starcoder_qlora_custom_sft_e1.py
│ │ ├── yi/
│ │ │ ├── yi_34b_qlora_custom_sft_e1.py
│ │ │ └── yi_6b_qlora_custom_sft_e1.py
│ │ └── zephyr/
│ │ └── zephyr_7b_beta_qlora_custom_sft_e1.py
│ ├── deepseek/
│ │ ├── README.md
│ │ ├── deepseek_coder_6_7b_base/
│ │ │ └── deepseek_coder_6_7b_base_qlora_code_alpaca_e3.py
│ │ ├── deepseek_coder_6_7b_instruct/
│ │ │ └── deepseekcoder_6_7b_instruct_qlora_code_alpaca_e3.py
│ │ ├── deepseek_moe_16b_base/
│ │ │ ├── deepseek_moe_16b_base_full_oasst1_e3.py
│ │ │ └── deepseek_moe_16b_base_qlora_oasst1_e3.py
│ │ ├── deepseek_moe_16b_chat/
│ │ │ ├── deepseek_moe_16b_chat_full_oasst1_e3.py
│ │ │ └── deepseek_moe_16b_chat_qlora_oasst1_e3.py
│ │ ├── deepseek_v2_chat/
│ │ │ └── deepseek_v2_chat_full_alpaca_e3.py
│ │ └── deepseek_v2_lite_chat/
│ │ ├── deepseek_v2_lite_chat_full_alpaca_e3.py
│ │ └── deepseek_v2_lite_chat_full_alpaca_e3_32k_varlen.py
│ ├── deepspeed/
│ │ ├── deepspeed_zero1.json
│ │ ├── deepspeed_zero2.json
│ │ ├── deepspeed_zero2_offload.json
│ │ ├── deepspeed_zero3.json
│ │ └── deepspeed_zero3_offload.json
│ ├── dpo/
│ │ ├── internlm/
│ │ │ ├── internlm2_chat_1_8b_dpo_full.py
│ │ │ ├── internlm2_chat_1_8b_dpo_full_varlenattn.py
│ │ │ ├── internlm2_chat_1_8b_dpo_full_varlenattn_jsonl_dataset.py
│ │ │ └── internlm2_chat_7b_dpo_qlora_varlenattn.py
│ │ └── llama/
│ │ └── llama3_8b_instruct_dpo_qlora_varlenattn.py
│ ├── gemma/
│ │ ├── gemma_2b/
│ │ │ ├── gemma_2b_full_alpaca_e3.py
│ │ │ └── gemma_2b_qlora_alpaca_e3.py
│ │ ├── gemma_2b_it/
│ │ │ ├── gemma_2b_it_full_alpaca_e3.py
│ │ │ └── gemma_2b_it_qlora_alpaca_e3.py
│ │ ├── gemma_7b/
│ │ │ ├── gemma_7b_full_alpaca_e3.py
│ │ │ └── gemma_7b_qlora_alpaca_e3.py
│ │ └── gemma_7b_it/
│ │ ├── gemma_7b_it_full_alpaca_e3.py
│ │ └── gemma_7b_it_qlora_alpaca_e3.py
│ ├── internlm/
│ │ ├── internlm2_1_8b/
│ │ │ ├── internlm2_1_8b_full_alpaca_e3.py
│ │ │ └── internlm2_1_8b_qlora_alpaca_e3.py
│ │ ├── internlm2_20b/
│ │ │ ├── internlm2_20b_full_finetune_custom_dataset_e1.py
│ │ │ ├── internlm2_20b_qlora_alpaca_e3.py
│ │ │ ├── internlm2_20b_qlora_arxiv_gentitle_e3.py
│ │ │ ├── internlm2_20b_qlora_code_alpaca_e3.py
│ │ │ ├── internlm2_20b_qlora_colorist_e5.py
│ │ │ ├── internlm2_20b_qlora_lawyer_e3.py
│ │ │ ├── internlm2_20b_qlora_msagent_react_e3_gpu8.py
│ │ │ ├── internlm2_20b_qlora_oasst1_512_e3.py
│ │ │ ├── internlm2_20b_qlora_oasst1_e3.py
│ │ │ └── internlm2_20b_qlora_sql_e3.py
│ │ ├── internlm2_7b/
│ │ │ ├── internlm2_7b_full_finetune_custom_dataset_e1.py
│ │ │ ├── internlm2_7b_full_finetune_custom_dataset_e1_sequence_parallel_4.py
│ │ │ ├── internlm2_7b_qlora_alpaca_e3.py
│ │ │ ├── internlm2_7b_qlora_arxiv_gentitle_e3.py
│ │ │ ├── internlm2_7b_qlora_code_alpaca_e3.py
│ │ │ ├── internlm2_7b_qlora_colorist_e5.py
│ │ │ ├── internlm2_7b_qlora_json_e3.py
│ │ │ ├── internlm2_7b_qlora_lawyer_e3.py
│ │ │ ├── internlm2_7b_qlora_msagent_react_e3_gpu8.py
│ │ │ ├── internlm2_7b_qlora_oasst1_512_e3.py
│ │ │ ├── internlm2_7b_qlora_oasst1_e3.py
│ │ │ ├── internlm2_7b_qlora_sql_e3.py
│ │ │ ├── internlm2_7b_w_internevo_dataset.py
│ │ │ ├── internlm2_7b_w_tokenized_dataset.py
│ │ │ └── internlm2_7b_w_untokenized_dataset.py
│ │ ├── internlm2_chat_1_8b/
│ │ │ ├── internlm2_chat_1_8b_full_alpaca_e3.py
│ │ │ └── internlm2_chat_1_8b_qlora_alpaca_e3.py
│ │ ├── internlm2_chat_20b/
│ │ │ ├── internlm2_chat_20b_full_finetune_custom_dataset_e1.py
│ │ │ ├── internlm2_chat_20b_qlora_alpaca_e3.py
│ │ │ ├── internlm2_chat_20b_qlora_code_alpaca_e3.py
│ │ │ ├── internlm2_chat_20b_qlora_lawyer_e3.py
│ │ │ ├── internlm2_chat_20b_qlora_oasst1_512_e3.py
│ │ │ └── internlm2_chat_20b_qlora_oasst1_e3.py
│ │ ├── internlm2_chat_7b/
│ │ │ ├── internlm2_chat_7b_full_finetune_custom_dataset_e1.py
│ │ │ ├── internlm2_chat_7b_qlora_alpaca_e3.py
│ │ │ ├── internlm2_chat_7b_qlora_code_alpaca_e3.py
│ │ │ ├── internlm2_chat_7b_qlora_lawyer_e3.py
│ │ │ ├── internlm2_chat_7b_qlora_oasst1_512_e3.py
│ │ │ └── internlm2_chat_7b_qlora_oasst1_e3.py
│ │ ├── internlm_20b/
│ │ │ ├── internlm_20b_qlora_alpaca_e3.py
│ │ │ ├── internlm_20b_qlora_alpaca_enzh_e3.py
│ │ │ ├── internlm_20b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── internlm_20b_qlora_alpaca_zh_e3.py
│ │ │ ├── internlm_20b_qlora_arxiv_gentitle_e3.py
│ │ │ ├── internlm_20b_qlora_code_alpaca_e3.py
│ │ │ ├── internlm_20b_qlora_colorist_e5.py
│ │ │ ├── internlm_20b_qlora_lawyer_e3.py
│ │ │ ├── internlm_20b_qlora_msagent_react_e3_gpu8.py
│ │ │ ├── internlm_20b_qlora_oasst1_512_e3.py
│ │ │ ├── internlm_20b_qlora_oasst1_e3.py
│ │ │ ├── internlm_20b_qlora_open_platypus_e3.py
│ │ │ └── internlm_20b_qlora_sql_e3.py
│ │ ├── internlm_7b/
│ │ │ ├── internlm_7b_full_alpaca_e3.py
│ │ │ ├── internlm_7b_full_alpaca_enzh_e3.py
│ │ │ ├── internlm_7b_full_alpaca_enzh_oasst1_e3.py
│ │ │ ├── internlm_7b_full_alpaca_zh_e3.py
│ │ │ ├── internlm_7b_full_intern_repo_dataset_template.py
│ │ │ ├── internlm_7b_full_oasst1_e3.py
│ │ │ ├── internlm_7b_qlora_alpaca_e3.py
│ │ │ ├── internlm_7b_qlora_alpaca_enzh_e3.py
│ │ │ ├── internlm_7b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── internlm_7b_qlora_alpaca_zh_e3.py
│ │ │ ├── internlm_7b_qlora_arxiv_gentitle_e3.py
│ │ │ ├── internlm_7b_qlora_code_alpaca_e3.py
│ │ │ ├── internlm_7b_qlora_colorist_e5.py
│ │ │ ├── internlm_7b_qlora_json_e3.py
│ │ │ ├── internlm_7b_qlora_lawyer_e3.py
│ │ │ ├── internlm_7b_qlora_medical_e1.py
│ │ │ ├── internlm_7b_qlora_moss_sft_all_e1.py
│ │ │ ├── internlm_7b_qlora_moss_sft_all_e2_gpu8.py
│ │ │ ├── internlm_7b_qlora_moss_sft_plugins_e1.py
│ │ │ ├── internlm_7b_qlora_msagent_react_e3_gpu8.py
│ │ │ ├── internlm_7b_qlora_oasst1_512_e3.py
│ │ │ ├── internlm_7b_qlora_oasst1_e3.py
│ │ │ ├── internlm_7b_qlora_oasst1_e3_hf.py
│ │ │ ├── internlm_7b_qlora_oasst1_mmlu_e3.py
│ │ │ ├── internlm_7b_qlora_open_platypus_e3.py
│ │ │ ├── internlm_7b_qlora_openorca_e1.py
│ │ │ ├── internlm_7b_qlora_sql_e3.py
│ │ │ └── internlm_7b_qlora_tiny_codes_e1.py
│ │ ├── internlm_chat_20b/
│ │ │ ├── internlm_chat_20b_qlora_alpaca_e3.py
│ │ │ ├── internlm_chat_20b_qlora_alpaca_enzh_e3.py
│ │ │ ├── internlm_chat_20b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── internlm_chat_20b_qlora_alpaca_zh_e3.py
│ │ │ ├── internlm_chat_20b_qlora_code_alpaca_e3.py
│ │ │ ├── internlm_chat_20b_qlora_lawyer_e3.py
│ │ │ ├── internlm_chat_20b_qlora_oasst1_512_e3.py
│ │ │ ├── internlm_chat_20b_qlora_oasst1_e3.py
│ │ │ └── internlm_chat_20b_qlora_open_platypus_e3.py
│ │ └── internlm_chat_7b/
│ │ ├── internlm_chat_7b_qlora_alpaca_e3.py
│ │ ├── internlm_chat_7b_qlora_alpaca_enzh_e3.py
│ │ ├── internlm_chat_7b_qlora_alpaca_enzh_oasst1_e3.py
│ │ ├── internlm_chat_7b_qlora_alpaca_zh_e3.py
│ │ ├── internlm_chat_7b_qlora_arxiv_gentitle_e3.py
│ │ ├── internlm_chat_7b_qlora_code_alpaca_e3.py
│ │ ├── internlm_chat_7b_qlora_colorist_e5.py
│ │ ├── internlm_chat_7b_qlora_lawyer_e3.py
│ │ ├── internlm_chat_7b_qlora_medical_e1.py
│ │ ├── internlm_chat_7b_qlora_oasst1_512_e3.py
│ │ ├── internlm_chat_7b_qlora_oasst1_e3.py
│ │ ├── internlm_chat_7b_qlora_open_platypus_e3.py
│ │ ├── internlm_chat_7b_qlora_openorca_e1.py
│ │ ├── internlm_chat_7b_qlora_sql_e3.py
│ │ └── internlm_chat_7b_qlora_tiny_codes_e1.py
│ ├── llama/
│ │ ├── llama2_70b/
│ │ │ ├── llama2_70b_full_wizardlm_e1.py
│ │ │ ├── llama2_70b_int8_lora_open_platypus_e1.py
│ │ │ ├── llama2_70b_int8_lora_open_platypus_e1_hf.py
│ │ │ ├── llama2_70b_qlora_open_platypus_e1.py
│ │ │ └── llama2_70b_qlora_open_platypus_e1_hf.py
│ │ ├── llama2_7b/
│ │ │ ├── llama2_7b_full_pgbooks_400iters_sp1.py
│ │ │ ├── llama2_7b_full_pgbooks_400iters_sp4.py
│ │ │ ├── llama2_7b_full_wizardlm_e1.py
│ │ │ ├── llama2_7b_qlora_alpaca_e3.py
│ │ │ ├── llama2_7b_qlora_alpaca_enzh_e3.py
│ │ │ ├── llama2_7b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── llama2_7b_qlora_alpaca_zh_e3.py
│ │ │ ├── llama2_7b_qlora_arxiv_gentitle_e3.py
│ │ │ ├── llama2_7b_qlora_code_alpaca_e3.py
│ │ │ ├── llama2_7b_qlora_colorist_e5.py
│ │ │ ├── llama2_7b_qlora_lawyer_e3.py
│ │ │ ├── llama2_7b_qlora_medical_e1.py
│ │ │ ├── llama2_7b_qlora_moss_sft_all_e1.py
│ │ │ ├── llama2_7b_qlora_moss_sft_all_e2_gpu8.py
│ │ │ ├── llama2_7b_qlora_moss_sft_plugins_e1.py
│ │ │ ├── llama2_7b_qlora_msagent_react_e3_gpu8.py
│ │ │ ├── llama2_7b_qlora_oasst1_512_e3.py
│ │ │ ├── llama2_7b_qlora_oasst1_e3.py
│ │ │ ├── llama2_7b_qlora_open_platypus_e3.py
│ │ │ ├── llama2_7b_qlora_openorca_e1.py
│ │ │ ├── llama2_7b_qlora_sql_e3.py
│ │ │ └── llama2_7b_qlora_tiny_codes_e1.py
│ │ ├── llama2_7b_chat/
│ │ │ ├── llama2_7b_chat_qlora_alpaca_e3.py
│ │ │ ├── llama2_7b_chat_qlora_alpaca_enzh_e3.py
│ │ │ ├── llama2_7b_chat_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── llama2_7b_chat_qlora_alpaca_zh_e3.py
│ │ │ ├── llama2_7b_chat_qlora_arxiv_gentitle_e3.py
│ │ │ ├── llama2_7b_chat_qlora_code_alpaca_e3.py
│ │ │ ├── llama2_7b_chat_qlora_colorist_e5.py
│ │ │ ├── llama2_7b_chat_qlora_lawyer_e3.py
│ │ │ ├── llama2_7b_chat_qlora_medical_e1.py
│ │ │ ├── llama2_7b_chat_qlora_oasst1_512_e3.py
│ │ │ ├── llama2_7b_chat_qlora_oasst1_e3.py
│ │ │ ├── llama2_7b_chat_qlora_open_platypus_e3.py
│ │ │ ├── llama2_7b_chat_qlora_openorca_e1.py
│ │ │ ├── llama2_7b_chat_qlora_sql_e3.py
│ │ │ └── llama2_7b_chat_qlora_tiny_codes_e1.py
│ │ ├── llama3_70b_instruct/
│ │ │ └── llama3_70b_instruct_qlora_alpaca_e3_2k_gpu8.py
│ │ ├── llama3_8b/
│ │ │ ├── README.md
│ │ │ └── llama3_8b_full_alpaca_e3.py
│ │ ├── llama3_8b_instruct/
│ │ │ ├── llama3_8b_instruct_full_alpaca_e3.py
│ │ │ └── llama3_8b_instruct_qlora_alpaca_e3.py
│ │ └── llama_7b/
│ │ ├── llama_7b_qlora_alpaca_e3.py
│ │ ├── llama_7b_qlora_alpaca_enzh_e3.py
│ │ ├── llama_7b_qlora_alpaca_enzh_oasst1_e3.py
│ │ ├── llama_7b_qlora_alpaca_zh_e3.py
│ │ ├── llama_7b_qlora_arxiv_gentitle_e3.py
│ │ ├── llama_7b_qlora_code_alpaca_e3.py
│ │ ├── llama_7b_qlora_colorist_e5.py
│ │ ├── llama_7b_qlora_lawyer_e3.py
│ │ ├── llama_7b_qlora_medical_e1.py
│ │ ├── llama_7b_qlora_moss_sft_all_e1.py
│ │ ├── llama_7b_qlora_moss_sft_all_e2_gpu8.py
│ │ ├── llama_7b_qlora_moss_sft_plugins_e1.py
│ │ ├── llama_7b_qlora_oasst1_512_e3.py
│ │ ├── llama_7b_qlora_oasst1_e3.py
│ │ ├── llama_7b_qlora_open_platypus_e3.py
│ │ ├── llama_7b_qlora_openorca_e1.py
│ │ ├── llama_7b_qlora_sql_e3.py
│ │ └── llama_7b_qlora_tiny_codes_e1.py
│ ├── llama_speed_benchmark/
│ │ ├── llama2_70b/
│ │ │ ├── llama2_70b_full_alpaca_enzh_128k_sp8.py
│ │ │ ├── llama2_70b_full_alpaca_enzh_256k_sp16.py
│ │ │ ├── llama2_70b_full_alpaca_enzh_32k_sp4.py
│ │ │ └── llama2_70b_full_alpaca_enzh_8k_sp1.py
│ │ ├── llama2_7b/
│ │ │ ├── llama2_7b_full_alpaca_enzh_128k_sp8.py
│ │ │ ├── llama2_7b_full_alpaca_enzh_1M_sp16.py
│ │ │ ├── llama2_7b_full_alpaca_enzh_256k_sp8.py
│ │ │ ├── llama2_7b_full_alpaca_enzh_32k_sp1.py
│ │ │ └── llama2_7b_full_alpaca_enzh_8k_sp1.py
│ │ └── yi_34b/
│ │ ├── yi_34b_200k_full_alpaca_enzh_128k_sp8.py
│ │ ├── yi_34b_200k_full_alpaca_enzh_256k_sp8.py
│ │ ├── yi_34b_200k_full_alpaca_enzh_32k_sp2.py
│ │ └── yi_34b_200k_full_alpaca_enzh_8k_sp1.py
│ ├── llava/
│ │ ├── README.md
│ │ ├── README_zh-CN.md
│ │ ├── internlm2_chat_1_8b_clip_vit_large_p14_336/
│ │ │ ├── finetune/
│ │ │ │ └── llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
│ │ │ └── pretrain/
│ │ │ └── llava_internlm2_chat_1_8b_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ ├── internlm2_chat_20b_clip_vit_large_p14_336/
│ │ │ ├── finetune/
│ │ │ │ ├── llava_internlm2_chat_20b_clip_vit_large_p14_336_e1_gpu8_finetune.py
│ │ │ │ └── llava_internlm2_chat_20b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
│ │ │ └── pretrain/
│ │ │ └── llava_internlm2_chat_20b_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ ├── internlm2_chat_7b_clip_vit_large_p14_336/
│ │ │ ├── finetune/
│ │ │ │ ├── llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_finetune.py
│ │ │ │ └── llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
│ │ │ └── pretrain/
│ │ │ └── llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ ├── internlm_chat_7b_clip_vit_large_p14_336/
│ │ │ ├── finetune/
│ │ │ │ └── llava_internlm_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
│ │ │ └── pretrain/
│ │ │ └── llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ ├── llama3_70b_instruct_clip_vit_large_p14_336/
│ │ │ └── pretrain/
│ │ │ └── llava_llama3_70b_instruct_quant_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ ├── llama3_8b_instruct_clip_vit_large_p14_336/
│ │ │ ├── README.md
│ │ │ ├── convert_xtuner_weights_to_hf.py
│ │ │ ├── convert_xtuner_weights_to_llava.py
│ │ │ ├── finetune/
│ │ │ │ ├── llava_llama3_8b_instruct_full_clip_vit_large_p14_336_e1_gpu8_finetune.py
│ │ │ │ ├── llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
│ │ │ │ ├── llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune.py
│ │ │ │ └── llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune.py
│ │ │ └── pretrain/
│ │ │ ├── llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ │ ├── llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py
│ │ │ └── llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain.py
│ │ ├── official/
│ │ │ ├── llava_v15_13b/
│ │ │ │ ├── llava_v15_13b_finetune.py
│ │ │ │ ├── llava_v15_13b_finetune_lora.py
│ │ │ │ └── llava_v15_13b_pretrain.py
│ │ │ └── llava_v15_7b/
│ │ │ ├── llava_v15_7b_finetune.py
│ │ │ ├── llava_v15_7b_finetune_lora.py
│ │ │ └── llava_v15_7b_pretrain.py
│ │ ├── phi3_mini_4k_instruct_clip_vit_large_p14_336/
│ │ │ ├── README.md
│ │ │ ├── convert_phi_to_llama.py
│ │ │ ├── convert_xtuner_weights_to_hf.py
│ │ │ ├── convert_xtuner_weights_to_llava.py
│ │ │ ├── finetune/
│ │ │ │ ├── llava_phi3_mini_4k_instruct_full_clip_vit_large_p14_336_e1_gpu8_finetune.py
│ │ │ │ └── llava_phi3_mini_4k_instruct_full_clip_vit_large_p14_336_full_e2_gpu8_internvl_finetune.py
│ │ │ └── pretrain/
│ │ │ ├── llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ │ └── llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py
│ │ ├── vicuna_13b_v15_clip_vit_large_p14_336/
│ │ │ ├── finetune/
│ │ │ │ └── llava_vicuna_13b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
│ │ │ └── pretrain/
│ │ │ └── llava_vicuna_13b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ │ └── vicuna_7b_v15_clip_vit_large_p14_336/
│ │ ├── finetune/
│ │ │ ├── llava_vicuna_7b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
│ │ │ └── llava_vicuna_7b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_refcoco.py
│ │ └── pretrain/
│ │ └── llava_vicuna_7b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py
│ ├── mistral/
│ │ ├── mistral_7b_full_finetune_custom_dataset_e1.py
│ │ ├── mistral_7b_qlora_skypile_pretrain_e1.py
│ │ ├── mistral_7b_w_tokenized_dataset.py
│ │ └── mistral_7b_w_untokenized_dataset.py
│ ├── mixtral/
│ │ ├── README.md
│ │ ├── mixtral_8x7b/
│ │ │ ├── mixtral_8x7b_full_oasst1_e3.py
│ │ │ └── mixtral_8x7b_qlora_oasst1_e3.py
│ │ └── mixtral_8x7b_instruct/
│ │ ├── mixtral_8x7b_instruct_full_oasst1_e3.py
│ │ └── mixtral_8x7b_instruct_qlora_oasst1_e3.py
│ ├── orpo/
│ │ ├── internlm/
│ │ │ ├── internlm2_chat_1_8b_orpo_full.py
│ │ │ ├── internlm2_chat_1_8b_orpo_full_varlenattn.py
│ │ │ ├── internlm2_chat_1_8b_orpo_full_varlenattn_jsonl_dataset.py
│ │ │ └── internlm2_chat_7b_orpo_qlora_varlenattn_ultrafeedback_e5.py
│ │ └── llama/
│ │ └── llama3_8b_instruct_orpo_qlora_varlenattn_ultrafeedback_e5.py
│ ├── phi/
│ │ └── phi3/
│ │ ├── phi3_mini_128k_instruct_full_alpaca_e3.py
│ │ ├── phi3_mini_128k_instruct_qlora_alpaca_e3.py
│ │ ├── phi3_mini_4k_instruct_full_alpaca_e3.py
│ │ └── phi3_mini_4k_instruct_qlora_alpaca_e3.py
│ ├── qwen/
│ │ ├── qwen1/
│ │ │ ├── qwen_1_8b/
│ │ │ │ ├── qwen_1_8b_qlora_alpaca_e3.py
│ │ │ │ ├── qwen_1_8b_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── qwen_1_8b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── qwen_1_8b_qlora_alpaca_zh_e3.py
│ │ │ │ └── qwen_1_8b_qlora_code_alpaca_e3.py
│ │ │ ├── qwen_1_8b_chat/
│ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_e3.py
│ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── qwen_1_8b_chat_qlora_alpaca_zh_e3.py
│ │ │ │ └── qwen_1_8b_chat_qlora_code_alpaca_e3.py
│ │ │ ├── qwen_72b/
│ │ │ │ ├── qwen_72b_qlora_alpaca_e3.py
│ │ │ │ ├── qwen_72b_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── qwen_72b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── qwen_72b_qlora_alpaca_zh_e3.py
│ │ │ │ └── qwen_72b_qlora_code_alpaca_e3.py
│ │ │ ├── qwen_7b/
│ │ │ │ ├── qwen_7b_qlora_alpaca_e3.py
│ │ │ │ ├── qwen_7b_qlora_alpaca_enzh_e3.py
│ │ │ │ ├── qwen_7b_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ │ ├── qwen_7b_qlora_alpaca_zh_e3.py
│ │ │ │ ├── qwen_7b_qlora_arxiv_gentitle_e3.py
│ │ │ │ ├── qwen_7b_qlora_code_alpaca_e3.py
│ │ │ │ ├── qwen_7b_qlora_colorist_e5.py
│ │ │ │ ├── qwen_7b_qlora_lawyer_e3.py
│ │ │ │ ├── qwen_7b_qlora_medical_e1.py
│ │ │ │ ├── qwen_7b_qlora_moss_sft_all_e1.py
│ │ │ │ ├── qwen_7b_qlora_moss_sft_all_e2_gpu8.py
│ │ │ │ ├── qwen_7b_qlora_moss_sft_plugins_e1.py
│ │ │ │ ├── qwen_7b_qlora_oasst1_512_e3.py
│ │ │ │ ├── qwen_7b_qlora_oasst1_e3.py
│ │ │ │ ├── qwen_7b_qlora_open_platypus_e3.py
│ │ │ │ ├── qwen_7b_qlora_openorca_e1.py
│ │ │ │ ├── qwen_7b_qlora_sql_e3.py
│ │ │ │ └── qwen_7b_qlora_tiny_codes_e1.py
│ │ │ └── qwen_7b_chat/
│ │ │ ├── qwen_7b_chat_qlora_alpaca_e3.py
│ │ │ ├── qwen_7b_chat_qlora_alpaca_enzh_e3.py
│ │ │ ├── qwen_7b_chat_qlora_alpaca_enzh_oasst1_e3.py
│ │ │ ├── qwen_7b_chat_qlora_alpaca_zh_e3.py
│ │ │ ├── qwen_7b_chat_qlora_arxiv_gentitle_e3.py
│ │ │ ├── qwen_7b_chat_qlora_code_alpaca_e3.py
│ │ │ ├── qwen_7b_chat_qlora_colorist_e5.py
│ │ │ ├── qwen_7b_chat_qlora_lawyer_e3.py
│ │ │ ├── qwen_7b_chat_qlora_medical_e1.py
│ │ │ ├── qwen_7b_chat_qlora_oasst1_512_e3.py
│ │ │ ├── qwen_7b_chat_qlora_oasst1_e3.py
│ │ │ ├── qwen_7b_chat_qlora_open_platypus_e3.py
│ │ │ ├── qwen_7b_chat_qlora_openorca_e1.py
│ │ │ ├── qwen_7b_chat_qlora_sql_e3.py
│ │ │ └── qwen_7b_chat_qlora_tiny_codes_e1.py
│ │ └── qwen1_5/
│ │ ├── qwen1_5_0_5b/
│ │ │ ├── qwen1_5_0_5b_full_alpaca_e3.py
│ │ │ └── qwen1_5_0_5b_qlora_alpaca_e3.py
│ │ ├── qwen1_5_0_5b_chat/
│ │ │ ├── qwen1_5_0_5b_chat_full_alpaca_e3.py
│ │ │ └── qwen1_5_0_5b_chat_qlora_alpaca_e3.py
│ │ ├── qwen1_5_110b/
│ │ │ ├── qwen1_5_110b_full_alpaca_e3.py
│ │ │ └── qwen1_5_110b_qlora_alpaca_e3.py
│ │ ├── qwen1_5_110b_chat/
│ │ │ ├── README.md
│ │ │ ├── qwen1_5_110b_chat_full_alpaca_e3.py
│ │ │ ├── qwen1_5_110b_chat_qlora_alpaca_e3.py
│ │ │ └── qwen1_5_110b_chat_qlora_alpaca_e3_16k_2gpus.py
│ │ ├── qwen1_5_14b/
│ │ │ ├── qwen1_5_14b_full_alpaca_e3.py
│ │ │ └── qwen1_5_14b_qlora_alpaca_e3.py
│ │ ├── qwen1_5_14b_chat/
│ │ │ ├── qwen1_5_14b_chat_full_alpaca_e3.py
│ │ │ └── qwen1_5_14b_chat_qlora_alpaca_e3.py
│ │ ├── qwen1_5_1_8b/
│ │ │ ├── qwen1_5_1_8b_full_alpaca_e3.py
│ │ │ └── qwen1_5_1_8b_qlora_alpaca_e3.py
│ │ ├── qwen1_5_1_8b_chat/
│ │ │ ├── qwen1_5_1_8b_chat_full_alpaca_e3.py
│ │ │ └── qwen1_5_1_8b_chat_qlora_alpaca_e3.py
│ │ ├── qwen1_5_4b/
│ │ │ ├── qwen1_5_4b_full_alpaca_e3.py
│ │ │ └── qwen1_5_4b_qlora_alpaca_e3.py
│ │ ├── qwen1_5_4b_chat/
│ │ │ ├── qwen1_5_4b_chat_full_alpaca_e3.py
│ │ │ └── qwen1_5_4b_chat_qlora_alpaca_e3.py
│ │ ├── qwen1_5_72b/
│ │ │ ├── qwen1_5_72b_full_alpaca_e3.py
│ │ │ └── qwen1_5_72b_qlora_alpaca_e3.py
│ │ ├── qwen1_5_72b_chat/
│ │ │ ├── qwen1_5_72b_chat_full_alpaca_e3.py
│ │ │ └── qwen1_5_72b_chat_qlora_alpaca_e3.py
│ │ ├── qwen1_5_7b/
│ │ │ ├── qwen1_5_7b_full_alpaca_e3.py
│ │ │ └── qwen1_5_7b_qlora_alpaca_e3.py
│ │ └── qwen1_5_7b_chat/
│ │ ├── qwen1_5_7b_chat_full_alpaca_e3.py
│ │ └── qwen1_5_7b_chat_qlora_alpaca_e3.py
│ ├── qwen_moe/
│ │ └── qwen1_5/
│ │ └── qwen1_5_moe_a2_7_b_chat/
│ │ └── qwen1_5_moe_a2_7_b_chat_full_alpaca_e3.py
│ ├── reward_model/
│ │ ├── internlm/
│ │ │ ├── internlm2_chat_1_8b_reward_full_ultrafeedback.py
│ │ │ ├── internlm2_chat_1_8b_reward_full_varlenattn_jsonl_dataset.py
│ │ │ ├── internlm2_chat_1_8b_reward_full_varlenattn_ultrafeedback.py
│ │ │ └── internlm2_chat_1_8b_reward_qlora_varlenattn_ultrafeedback.py
│ │ └── llama/
│ │ └── llama3_8b_instruct_reward_full_varlenattn_ultrafeedback.py
│ ├── starcoder/
│ │ └── starcoder_qlora_stack_exchange_example.py
│ ├── yi/
│ │ ├── yi_34b/
│ │ │ └── yi_34b_qlora_alpaca_enzh_e3.py
│ │ └── yi_6b/
│ │ └── yi_6b_qlora_alpaca_enzh_e3.py
│ └── zephyr/
│ └── zephyr_7b_beta_qlora_alpaca_e3.py
├── dataset/
│ ├── __init__.py
│ ├── collate_fns/
│ │ ├── __init__.py
│ │ ├── default_collate_fn.py
│ │ ├── mmlu_collate_fn.py
│ │ └── preference_collate_fn.py
│ ├── concat_dataset.py
│ ├── huggingface.py
│ ├── intern_repo.py
│ ├── json_dataset.py
│ ├── llava.py
│ ├── map_fns/
│ │ ├── __init__.py
│ │ ├── dataset_map_fns/
│ │ │ ├── __init__.py
│ │ │ ├── alpaca_map_fn.py
│ │ │ ├── alpaca_zh_map_fn.py
│ │ │ ├── arxiv_map_fn.py
│ │ │ ├── code_alpaca_map_fn.py
│ │ │ ├── colors_map_fn.py
│ │ │ ├── crime_kg_assitant_map_fn.py
│ │ │ ├── default_map_fn.py
│ │ │ ├── law_reference_map_fn.py
│ │ │ ├── llava_map_fn.py
│ │ │ ├── medical_map_fn.py
│ │ │ ├── msagent_map_fn.py
│ │ │ ├── oasst1_map_fn.py
│ │ │ ├── openai_map_fn.py
│ │ │ ├── openorca_map_fn.py
│ │ │ ├── pretrain_map_fn.py
│ │ │ ├── sql_map_fn.py
│ │ │ ├── stack_exchange_map_fn.py
│ │ │ ├── tiny_codes_map_fn.py
│ │ │ └── wizardlm_map_fn.py
│ │ └── template_map_fn.py
│ ├── modelscope.py
│ ├── moss_sft.py
│ ├── preference_dataset.py
│ ├── refcoco_json.py
│ ├── samplers/
│ │ ├── __init__.py
│ │ ├── intern_repo.py
│ │ └── length_grouped.py
│ └── utils.py
├── engine/
│ ├── __init__.py
│ ├── _strategy/
│ │ ├── __init__.py
│ │ └── deepspeed.py
│ ├── hooks/
│ │ ├── __init__.py
│ │ ├── dataset_info_hook.py
│ │ ├── evaluate_chat_hook.py
│ │ ├── hf_checkpoint_hook.py
│ │ ├── throughput_hook.py
│ │ └── varlen_attn_args_to_messagehub_hook.py
│ └── runner/
│ ├── __init__.py
│ └── loops.py
├── entry_point.py
├── evaluation/
│ ├── __init__.py
│ └── metrics/
│ ├── __init__.py
│ ├── mmlu_metric.py
│ └── reward_metric.py
├── model/
│ ├── __init__.py
│ ├── dpo.py
│ ├── llava.py
│ ├── modules/
│ │ ├── __init__.py
│ │ ├── dispatch/
│ │ │ ├── __init__.py
│ │ │ ├── attention.py
│ │ │ ├── baichuan.py
│ │ │ ├── cohere.py
│ │ │ ├── deepseek_v2.py
│ │ │ ├── internlm.py
│ │ │ ├── internlm2.py
│ │ │ ├── llama.py
│ │ │ ├── mistral.py
│ │ │ ├── phi3.py
│ │ │ ├── qwen2.py
│ │ │ ├── triton_kernels/
│ │ │ │ ├── __init__.py
│ │ │ │ ├── layer_norm.py
│ │ │ │ ├── rms_norm.py
│ │ │ │ └── rotary.py
│ │ │ ├── utils.py
│ │ │ └── yi.py
│ │ └── projector/
│ │ ├── __init__.py
│ │ ├── configuration_projector.py
│ │ └── modeling_projector.py
│ ├── orpo.py
│ ├── reward.py
│ ├── sft.py
│ ├── transformers_models/
│ │ ├── __init__.py
│ │ ├── deepseek_v2/
│ │ │ ├── __init__.py
│ │ │ ├── configuration_deepseek.py
│ │ │ ├── modeling_deepseek.py
│ │ │ └── tokenization_deepseek_fast.py
│ │ └── mixtral/
│ │ ├── __init__.py
│ │ ├── configuration_mixtral.py
│ │ └── modeling_mixtral.py
│ └── utils.py
├── parallel/
│ ├── __init__.py
│ └── sequence/
│ ├── __init__.py
│ ├── attention.py
│ ├── comm.py
│ ├── data_collate.py
│ ├── reduce_loss.py
│ ├── sampler.py
│ └── setup_distributed.py
├── registry.py
├── tools/
│ ├── chat.py
│ ├── check_custom_dataset.py
│ ├── copy_cfg.py
│ ├── data_preprocess/
│ │ ├── arxiv.py
│ │ └── convert_refcoco.py
│ ├── eval_refcoco.py
│ ├── get_data_order.py
│ ├── list_cfg.py
│ ├── list_dataset_format.py
│ ├── log_dataset.py
│ ├── mmbench.py
│ ├── model_converters/
│ │ ├── merge.py
│ │ ├── modeling_internlm2_reward/
│ │ │ ├── __init__.py
│ │ │ ├── configuration_internlm2.py
│ │ │ └── modeling_internlm2.py
│ │ ├── pth_to_hf.py
│ │ └── split.py
│ ├── plugins/
│ │ ├── __init__.py
│ │ ├── api.py
│ │ ├── calculate.py
│ │ ├── search.py
│ │ └── solve.py
│ ├── process_untokenized_datasets.py
│ ├── process_untokenized_datasets_legacy.py
│ ├── process_untokenized_llava_data.py
│ ├── test.py
│ ├── tokenize_ftdp_datasets.py
│ ├── train.py
│ └── utils.py
├── utils/
│ ├── __init__.py
│ ├── constants.py
│ ├── fileio.py
│ ├── handle_moe_load_and_save.py
│ ├── stop_criteria.py
│ ├── templates.py
│ └── zero_to_any_dtype.py
└── version.py
================================================
FILE CONTENTS
================================================
================================================
FILE: .gitattributes
================================================
# Auto detect text files and perform LF normalization
logs
* text=auto
================================================
FILE: LICENSE
================================================
MIT License
Copyright (c) 2025 Yi Wang
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
================================================
FILE: README.md
================================================
[Xinhao Li](https://scholar.google.com/citations?user=evR3uR0AAAAJ), [Yi Wang](https://scholar.google.com.hk/citations?user=Xm2M8UwAAAAJ), [Jiashuo Yu](https://scholar.google.com.hk/citations?user=iH0Aq0YAAAAJ&oi=ao), [Xiangyu Zeng](https://scholar.google.com/citations?user=jS13DXkAAAAJ&hl=zh-CN), Yuhan Zhu, Haian Huang, Jianfei Gao, [Kunchang Li](https://scholar.google.com/citations?user=D4tLSbsAAAAJ), [Yinan He](https://dblp.org/pid/93/7763.html), Chenting Wang, [Yu Qiao](https://scholar.google.com/citations?user=gFtI-8QAAAAJ&hl), [Yali Wang](https://scholar.google.com/citations?user=hD948dkAAAAJ), and [Limin Wang](https://scholar.google.com/citations?user=HEuN8PcAAAAJ)
🤗 Model & Data    |   🖥️ Demo    |    📑 Paper    |    🌐 Blog
## :fire: Updates
- [x] **2025/06/13**: 🎉🎉🎉Our model achieves promising results on the [VideoEval-Pro](https://arxiv.org/abs/2505.14640) benchmark focused on long video understanding!
- [x] **2025/05/10**:🔥🔥🔥 We release most video of our [training data](https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data), Hope it can be of help to you!
- [x] **2025/03/27**:🔥🔥 We release our dataset and evaluation codes for single-hop and multi-hop needle-in-a-haystack!
- [x] **2025/03/09**:🔥🔥 We release our weights of each training stage [here](https://github.com/OpenGVLab/VideoChat-Flash/blob/main/llava-train_videochat/README.), try to build your VideoChat-Flash on them!
- [x] **2025/02/25**:🔥🔥 We release our [training data](https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data), [training codes based LLaVA](llava-train_videochat) for VideoChat-Flash and [training codes based XTuner](xtuner-train_internvideo2_5) for finetuning InternVideo2.5.
- [x] **2025/02/12**: 🎉🎉🎉Our VideoChat-Flash-7B@448 has achieved first place on the latest Video Detail Caption Benchmark, [AuroraCap](https://rese1f.github.io/aurora-web/).
- [x] **2025/01/15**: We provide [evaluation codes](lmms-eval_videochat) for QA & Grounding Benchmark.
- [x] **2025/01/12**: 🔥🔥🔥Release **VideoChat2-Flash**, a powerfull MLLM built on video encoder ([InternVideo](https://github.com/OpenGVLab/InternVideo)) and LLM ([Qwen](https://github.com/QwenLM/Qwen)).
- We offer five models, [VideoChat2-Flash-2B@224](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448) (Small LLM), [VideoChat2-Flash-7B@224](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res224), [VideoChat2-Flash-7B@448](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res448) (Overall best), [VideoChat-Flash-Qwen2_5-7B-1M](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B-1M_res224) (Super long video input) and [VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B) (Stronger short-term temporal understanding).
## 📑 Future Plan
- [ ] lmdeploy/vllm support for Videochat-Flash and InternVideo2.5
- [ ] LoRA finetuning training code for Videochat-Flash and InternVideo2.5
- [ ] Mixing image/video training code for InternVideo2.5
- [ ] Faster training code with XTuner for VideoChat-Flash
As I am currently very busy with work and find it difficult to complete the above plans quickly, I sincerely ask friends in the community to join in and **submit a PR**.
## :parrot: Introduction
**🚀State-of-the-art performance** in short and long video understanding, with temporal localization capabilities comparable to expert models.

**🔭Supports ultra-long video inputs**, achieving a groundbreaking needle-in-a-haystack evaluation accuracy of **99.1% on 10,000 frames**, capable of processing videos up to three hours long.

**⚡Highly efficient model architecture** with exceptional inference speed, encoding each video frame into just **16 tokens**, making it **5–10** times faster than the previous model.

## Demo & Inference
Refer to [hf README](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448) to inference our model.
## Evaluation
See [evaluation codes](lmms-eval_videochat). And [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval) have supported our model, you also could use it to evaluate our model on varous benchmarks.
## Training
See [training codes based LLaVA](llava-train_videochat) for VideoChat-Flash and [training codes based XTuner](xtuner-train_internvideo2_5) for finetuning InternVideo2.5.
## :bar_chart: [NIAH](./BENCHMARK.md)

See [xtuner-eval_niah](xtuner-eval_niah) for evaluation of Single-Hop NIAH-Video and Multi-Hop NIAH-Video.
# :page_facing_up: Citation
If you find this project useful in your research, please consider cite:
```BibTeX
@article{li2024videochat,
title={VideoChat-Flash: Hierarchical Compression for Long-Context Video Modeling},
author={Li, Xinhao and Wang, Yi and Yu, Jiashuo and Zeng, Xiangyu and Zhu, Yuhan and Huang, Haian and Gao, Jianfei and Li, Kunchang and He, Yinan and Wang, Chenting and Qiao, Yu and Wang, Yali and Wang, Limin},
journal={arXiv preprint arXiv:2501.00574},
year={2024}
}
```
# :dizzy: Acknowledgement
Thanks to the open source of the following projects: [InternVideo](https://github.com/OpenGVLab/InternVideo), [UMT](https://github.com/OpenGVLab/unmasked_teacher), [Qwen](https://github.com/QwenLM/Qwen), [LLaVA-VL](https://github.com/LLaVA-VL/LLaVA-NeXT), [lmms-eval](https://github.com/EvolvingLMMs-Lab/lmms-eval), [Ask-Anything](https://github.com/OpenGVLab/Ask-Anything), [ToMe](https://github.com/facebookresearch/ToMe), [LongVLM](https://github.com/ziplab/LongVLM), [FastV](https://github.com/pkunlp-icler/FastV), [LLaVolta](https://github.com/Beckschen/LLaVolta), [PyramidDrop](https://github.com/Cooperx521/PyramidDrop), [LongVA](https://github.com/EvolvingLMMs-Lab/LongVA), their implementation provides valuable reference experience for our project.
================================================
FILE: llava-train_videochat/.dockerignore
================================================
# The .dockerignore file excludes files from the container build process.
#
# https://docs.docker.com/engine/reference/builder/#dockerignore-file
# Exclude Git files
.git
.github
.gitignore
# Exclude Python cache files
__pycache__
.mypy_cache
.pytest_cache
.ruff_cache
# Exclude Python virtual environment
/venv
# Exclude some weights
/openai
/liuhaotian
================================================
FILE: llava-train_videochat/.editorconfig
================================================
root = true
# Unix-style newlines with a newline ending every file
[*]
end_of_line = lf
insert_final_newline = true
trim_trailing_whitespace = true
charset = utf-8
# 4 space indentation
[*.{py,json}]
indent_style = space
indent_size = 4
# 2 space indentation
[*.{md,sh,yaml,yml}]
indent_style = space
indent_size = 2
================================================
FILE: llava-train_videochat/.gitattributes
================================================
# https://git-scm.com/docs/gitattributes
# Set the default behavior, in case people don't have core.autocrlf set.
# https://git-scm.com/docs/gitattributes#_end_of_line_conversion
* text=auto
# common python attributes, taken from https://github.com/alexkaratarakis/gitattributes/blob/710900479a2bedeec7003d381719521ffbb18bf8/Python.gitattributes
# Source files
# ============
*.pxd text diff=python
*.py text diff=python
*.py3 text diff=python
*.pyw text diff=python
*.pyx text diff=python
*.pyz text diff=python
*.pyi text diff=python
# Binary files
# ============
*.db binary
*.p binary
*.pkl binary
*.pickle binary
*.pyc binary export-ignore
*.pyo binary export-ignore
*.pyd binary
# Jupyter notebook
*.ipynb text eol=lf
================================================
FILE: llava-train_videochat/.gitignore
================================================
# Python
__pycache__
*.pyc
*.egg-info
dist
# Log
*.log
*.log.*
# *.json
# *.jsonl
# Data
!**/alpaca-data-conversation.json
# Editor
.idea
*.swp
.vscode
# Other
.DS_Store
wandb
output
llavavid
checkpoints
project_checkpoints
debug_checkpoints
playground/data
playground/cc3m_llava34b_cap
ckpts*
.ipynb_checkpoints
chunyl_scripts
*.ipynb
# DevContainer
!.devcontainer/*
# Demo
serve_images/
notebooks/
logs
scripts/dist_*
logs/
submissions/
cn_scripts/
internal_project_checkpoints/
work_dirs
scripts/i18n/*
playground/.nfs028b000000010add00000001
HIP
playground/.nfs028b0000017bff2c00000012
scripts/qwen
scripts/vicuna
scripts/mistral
scripts/baseline_rep
scripts/cn_boli01_hl
scripts/cn_boli01_lf
scripts/cn_lf
scripts/cn_lq
scripts/cn_yg
scripts/cn_yg_hao
scripts/eva_encoder
scripts/i18n
scripts/i18n_higher_res
scripts/multi-images
scratchpad
build/
playground/*.json
mlx_configs/
data_processing/
# demo/
================================================
FILE: llava-train_videochat/LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: llava-train_videochat/README.md
================================================
# 👀How to train and evaluate VideoChat-Flash?🦜
## 1. Prepare Training Data
We need to address the fact that our data has been collected and used in different projects/people. For the data that has already been uploaded, we will refer you the corresponding viewing locations. Please collect relevant data fragments and integrate them in your own environments. We use similar data format with [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/main/scripts/train). ***You can customize your own training data in this format***.
In [data](.data), we have provided the data used in each training stage, along with the corresponding annotation locations. We have made all the data annotations and some of the videos available on [OpenGVLab/VideoChat-Flash-Training-Data](https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data), and I have listed all video source url in the annotation file.
## 2. Training
| Stage | Num. frames | ViT | Connector | LLM | CKPT |
|--------|:-------:|:------:|:------:|:------:|:------:|
| [stage1](scripts/train/stage1-init_connector) | 4 | :snowflake: | :fire: | :snowflake: | [all projector weights](https://huggingface.co/OpenGVLab/stage1-mm-projectors/tree/main) |
| [stage2](scripts/train/stage2-visual_pretraining) | 4-8 | :fire: | :fire: | :fire: | [UMT-Qwen2_7B](https://huggingface.co/OpenGVLab/stage2-UMT-Qwen2-7B-tome16_mlp), [UMT-Qwen2_5_1M_7B](https://huggingface.co/OpenGVLab/stage2-UMT-Qwen2_5_7B_1m-tome16_mlp), [UMT-HD-Qwen2_5_2B](https://huggingface.co/OpenGVLab/stage2-UMT-Qwen2_5_1.5B-tome16_mlp), [InternVideo2-Qwen2_5_7B](https://huggingface.co/OpenGVLab/stage2-InternVideo2-1B-Qwen2_5-7B-tome16_mlp) |
| [stage3](scripts/train/stage3-video_sft) | 64-512 | :fire: | :fire: | :fire: | [UMT-Qwen2_7B](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res448),[UMT-HD-Qwen2_5-2B](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-2B_res448),[UMT-Qwen2_5_1M_7B](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B-1M_res224), [InternVideo2-Qwen2_5_7B](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B_InternVideo2-1B) |
| [stage4](scripts/train/stage4_highres_postft) | 64-512 | :fire: | :fire: | :snowflake: | [UMT-HD-Qwen2-7B](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2-7B_res448)|
Training time with a 32 A100:
- stage1: under one hour:
- stage2: about 2 day
- stage3: about 2~3day
- stage4: about 2~3day
### Tips
- ***We recommend to start from stage3 based on our provided stage2 model to save training cost, and you could use [1/4 stage3 data](data/ablation_short-long_mix_sft.yaml) for ablation (as we do)! You also could ignore stage4 if you don't need a absolute SoTA performance!***
- We use slurm to train model on multple machines, **if you only have one machines or you don't use slurm**, please refer to [LLaVA-NeXT](https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/finetune_ov.sh) to modify the scripts.
- If you try to finetuning [UMT-Qwen2_5_1M_7B](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B-1M_res224), modify [`max_position_embeddings`](https://huggingface.co/OpenGVLab/VideoChat-Flash-Qwen2_5-7B-1M_res224/blob/main/config.json#L185) to smaller value like 32768 to avoid Cuda OOM!
### Install
```bash
git clone https://github.com/OpenGVLab/VideoChat-Flash
cd llava-train_videochat
pip install -e .
```
### Stage-1: Video-Language Alignment
Please download pretrained video encoders in [Huggingfaces](https://huggingface.co/OpenGVLab/Video_Encoders_for_Training_VideoChat-Flash) first. Then modify ckpt_path in `build_vit` of `llava/model/multimodal_encoder/umt_encoder.py` or `llava/model/multimodal_encoder/internvideo2_encoder.py`.
```bash
bash scripts/train/stage1-init_connector/stage1_umt_tome16_res224_qwen7b.sh
```
### Stage-2: Short Video Pre-training
```bash
bash scripts/train/stage2-visual_pretraining/stage2_umt_tome16_res224_qwen_7b.sh
```
### Stage-3: Joint Short & Long Video Instruction Tuning
```bash
bash scripts/train/stage3-video_sft/stage3_umt_tome16_res224_qwen_7b.sh
```
### Stage-4: Efficient High-Resolution Post-finetuning
Please modify `vision_tower="umt-hd-large"` in `Your_stage3_checkpoint_path/config.json` first!
```bash
bash scripts/train/stage4_highres_postft/stage4_umt_tome16_res448_qwen_7b.sh
```
## Evaluation
Overwrite your checkpoints directory with the configurations (json) and Python files from OpenGVLab/VideoChat-Flash, and then you can use the lmms-eval_videochat we provided for evaluation.
================================================
FILE: llava-train_videochat/cog.yaml
================================================
# Configuration for Cog ⚙️
# Reference: https://github.com/replicate/cog/blob/main/docs/yaml.md
build:
gpu: true
python_version: "3.11"
python_packages:
- "torch==2.0.1"
- "accelerate==0.21.0"
- "bitsandbytes==0.41.0"
- "deepspeed==0.9.5"
- "einops-exts==0.0.4"
- "einops==0.6.1"
- "gradio==3.35.2"
- "gradio_client==0.2.9"
- "httpx==0.24.0"
- "markdown2==2.4.10"
- "numpy==1.26.0"
- "peft==0.4.0"
- "scikit-learn==1.2.2"
- "sentencepiece==0.1.99"
- "shortuuid==1.0.11"
- "timm==0.6.13"
- "tokenizers==0.13.3"
- "torch==2.0.1"
- "torchvision==0.15.2"
- "transformers==4.31.0"
- "wandb==0.15.12"
- "wavedrom==2.0.3.post3"
- "Pygments==2.16.1"
run:
- curl -o /usr/local/bin/pget -L "https://github.com/replicate/pget/releases/download/v0.0.3/pget" && chmod +x /usr/local/bin/pget
# predict.py defines how predictions are run on your model
predict: "predict.py:Predictor"
================================================
FILE: llava-train_videochat/data/ablation_short-long_mix_sft.yaml
================================================
datasets:
# image sft datasets
- json_path: annotations/image/textcaps.json # 21942
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/textcaps
- json_path: annotations/image/textocr(gpt4v).json # 25104
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/textocr(gpt4v)
- json_path: annotations/image/rendered_text(cauldron)_fix.json # 9995
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/rendered_text(cauldron)
- json_path: annotations/image/iam(cauldron)_fix.json # 5658
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/iam(cauldron)
- json_path: annotations/image/llavar_gpt4_20k.json # 19790
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/llavar_gpt4_20k
- json_path: annotations/image/allava_instruct_vflan4v.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/allava_instruct_vflan4v
- json_path: annotations/image/allava_instruct_laion4v.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/allava_instruct_laion4v
- json_path: annotations/image/sharegpt4o.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4o
- json_path: annotations/image/sharegpt4v(coco).json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(coco)
- json_path: annotations/image/sharegpt4v(knowledge).json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(knowledge)
- json_path: annotations/image/sharegpt4v(llava).json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(llava)
- json_path: annotations/image/sharegpt4v(sam).json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(sam)
- json_path: annotations/image/tallyqa(cauldron,llava_format)_fix.json # 98675
sampling_strategy: "first:10%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/tallyqa(cauldron,llava_format) # 98680
- json_path: annotations/image/st_vqa(cauldron,llava_format)_fix.json # 17242
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/st_vqa(cauldron,llava_format) # 17247
- json_path: annotations/image/llava_next_raw_format_processed_738k.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data
- json_path: https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data/m4_instruct_annotations.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data
# video sft datasets
- json_path: annotations/video/caption_sharegemini_webvid_core100k_clean.json
sampling_strategy: "first:20%"
data_root: https://github.com/m-bain/webvid
- json_path: annotations/video/caption_sharegemini_k400_223k.json
sampling_strategy: "first:25%"
data_root: https://opendatalab.com/OpenMMLab/Kinetics-400
- json_path: annotations/video/caption_youcook2-youcook2-train_debug_9k.json
sampling_strategy: "first:25%"
data_root: http://youcook2.eecs.umich.edu/
- json_path: annotations/video/caption_textvr-textvr-train_40k.json
sampling_strategy: "first:25%"
data_root: https://github.com/callsys/TextVR
- json_path: annotations/video/moviechat1k_caption-MovieChat-train_caption_1k.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/Enxin/MovieChat-1K_train
- json_path: annotations/video/caption_favd-favd-train_10k.json
sampling_strategy: "first:25%"
data_root: https://github.com/OpenNLPLab/FAVDBench
- json_path: annotations/video/caption_sharegptvideo_300k-sharegptvideo-train_300k_302k.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k
video_read_type: img
- json_path: annotations/video/caption_sharegpt4o-sharegpt4o_3k.json
sampling_strategy: "first:25%"
data_root: https://sharegpt4o.github.io/
- json_path: annotations/video/vqa_tvqa-tvqa_123k.jsonl
sampling_strategy: "first:25%"
data_root: https://nlp.cs.unc.edu/data/jielei/tvqa/tvqa_public_html/index.html
video_read_type: img
- json_path: annotations/video/reasoning_next_qa-next_qa-train_35k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/doc-doc/NExT-QA
- json_path: annotations/video/vqa_tgif_transition_qa-tgif_transition_qa-train_53k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/YunseokJANG/tgif-qa
video_read_type: gif
- json_path: annotations/video/reasoning_clevrer_mc-clevrer_mc-train_43k_debug_43k.jsonl
sampling_strategy: "first:25%"
data_root: http://clevrer.csail.mit.edu/
- json_path: annotations/video/reasoning_clevrer_qa-clevrer_qa-train_mc_40k.jsonl
sampling_strategy: "first:25%"
data_root: http://clevrer.csail.mit.edu/
- json_path: annotations/video/classification_k710-k710-train_40k.jsonl
sampling_strategy: "first:25%"
- json_path: annotations/video/classification_ssv2-ssv2-train_40k.jsonl
sampling_strategy: "first:25%"
data_root: https://www.qualcomm.com/developer/software/something-something-v-2-dataset
- json_path: annotations/video/lsmdc-lsmdc_297k.json
sampling_strategy: "first:25%"
data_root: https://sites.google.com/site/describingmovies/
- json_path: annotations/video/vqa_rgbd-nturgbd_clean_110k.json
sampling_strategy: "first:25%"
data_root: https://rose1.ntu.edu.sg/dataset/actionRecognition/
- json_path: annotations/video/vqa_perception_train-mc_question_train_forchoice_8k.json
sampling_strategy: "first:25%"
data_root: https://github.com/google-deepmind/perception_test
- json_path: annotations/video/vqa_ego_qa-ego_qa-train_8k.jsonl
sampling_strategy: "first:25%"
data_root: https://ego4d-data.org/
- json_path: annotations/video/vqa_tgif_transition_qa_openend-openend_qa_annos-tgif_transition_qa_train_openend_53k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/YunseokJANG/tgif-qa
video_read_type: gif
- json_path: annotations/video/vqa_tgif_frame_qa-tgif_frame_qa-train_40k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/YunseokJANG/tgif-qa
video_read_type: gif
- json_path: annotations/video/vqa_tgif_count-openend_qa_train_openend_26839.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/YunseokJANG/tgif-qa
video_read_type: gif
- json_path: annotations/video/vqa_tgif_action-openend_qa_train_openend_20471.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/YunseokJANG/tgif-qa
video_read_type: gif
- json_path: annotations/video/reasoning_next_qa_oe-openend_qa_annos-next_qa_train_openend_35k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/doc-doc/NExT-QA
- json_path: annotations/video/vqa_webvid_qa-webvid_qa-train_100k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/m-bain/webvid
- json_path: annotations/video/moviechat1k_global-MovieChat-train_global_1k.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/Enxin/MovieChat-1K_train
- json_path: annotations/video/grounding_didemo-didemo-train_66k.json
sampling_strategy: "first:25%"
data_root: https://github.com/LisaAnne/TemporalLanguageRelease
- json_path: annotations/video/vqa_sharegptvideo_240k-sharegptvideo-train_240k_240k.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k
video_read_type: img
- json_path: annotations/video/caption_vidln_kinetics-vidln-kinetics_train_28k.json
sampling_strategy: "first:25%"
data_root: https://opendatalab.com/OpenMMLab/Kinetics_700
- json_path: annotations/video/caption_vidln_oops-vidln-oops_train_11k.json
sampling_strategy: "first:25%"
data_root: https://oops.cs.columbia.edu/
- json_path: annotations/video/caption_vidln_ovis-vidln-ovis_train_1k.json
sampling_strategy: "first:25%"
data_root: https://songbai.site/ovis/
video_read_type: img
- json_path: annotations/video/caption_vidln_uvo_sparse-vidln-uvo_sparse_train_6k.json
sampling_strategy: "first:25%"
data_root: https://sites.google.com/view/unidentified-video-object/dataset
- json_path: annotations/video/caption_vidln_uvo_dense-vidln-uvo_dense_train_1k.json
sampling_strategy: "first:25%"
data_root: https://sites.google.com/view/unidentified-video-object/dataset
- json_path: annotations/video/reasoning_star-star-train_46k.json
sampling_strategy: "first:25%"
data_root: https://bobbywu.com/STAR/
- json_path: annotations/video/vcg-plus_112K_clean_97k.json
sampling_strategy: "first:10%"
data_root: http://activity-net.org/
- json_path: annotations/video/vript_long_videos_en_20240911_fix.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/Mutonix/Vript
- json_path: annotations/video/vript_short_videos_en_20240911_fix.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/Mutonix/Vript
- json_path: annotations/video/guiworld_en_20241029_fix.jsonl
sampling_strategy: "first:25%"
data_root: https://gui-world.github.io/
## llava video
- json_path: annotations/video/llava-video_2_3_m_academic_mc_v0_1_qa_processed_6901_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_nextqa_oe_qa_processed_61_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_youtube_oe_v0_1_qa_processed_420200_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_academic_oe_v0_1_qa_processed_26302_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_youtube_mc_v0_1_qa_processed_39710_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_nextqa_oe_qa_processed_6843_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_youtube_mc_v0_1_qa_processed_39967_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_academic_v0_1_cap_processed_3124_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_academic_oe_v0_1_qa_processed_57924_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_youtube_v0_1_cap_processed_24685_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_youtube_mc_v0_1_qa_processed_39927_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_activitynetqa_oe_qa_processed_2950_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_nextqa_oe_qa_processed_4694_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_youtube_oe_v0_1_qa_processed_110624_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_academic_mc_v0_1_qa_processed_4241_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_youtube_mc_v0_1_qa_processed_39353_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_activitynetqa_oe_qa_processed_4530_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_youtube_oe_v0_1_qa_processed_137645_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_academic_mc_v0_1_qa_processed_20346_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_youtube_v0_1_cap_processed_19995_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_nextqa_mc_qa_processed_5496_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_academic_mc_v0_1_qa_processed_5753_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_youtube_oe_v0_1_qa_processed_141495_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_nextqa_mc_qa_processed_4633_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_activitynetqa_oe_qa_processed_7460_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_nextqa_mc_qa_processed_52_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_activitynetqa_oe_qa_processed_8590_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_academic_v0_1_cap_processed_4627_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_academic_v0_1_cap_processed_10514_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_youtube_v0_1_cap_processed_24234_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_nextqa_mc_qa_processed_6843_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_nextqa_oe_qa_processed_5492_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_academic_oe_v0_1_qa_processed_48468_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_youtube_v0_1_cap_processed_79346_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_academic_oe_v0_1_qa_processed_18134_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_perceptiontest_mc_qa_processed_1785_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_perceptiontest_mc_qa_processed_618_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_academic_v0_1_cap_processed_11985_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/timeit_ANet-TimeIT-Activitynet_Captions_11k.json
sampling_strategy: "first:25%"
data_root: http://activity-net.org//train
- json_path: annotations/video/timeit_COIN-TimeIT-COIN_10k.json
sampling_strategy: "first:25%"
data_root: https://coin-dataset.github.io/
- json_path: annotations/video/timeit_DiDeMo-TimeIT-DiDeMo_33k.json
sampling_strategy: "first:25%"
data_root: https://github.com/LisaAnne/TemporalLanguageRelease
- json_path: annotations/video/timeit_HiREST-TimeIT-HiREST_1k.json
sampling_strategy: "first:25%"
data_root: https://hirest-cvpr2023.github.io/
- json_path: annotations/video/timeit_QuerYD-TimeIT-QuerYD_15k.json
sampling_strategy: "first:25%"
data_root: https://www.robots.ox.ac.uk/~vgg/data/queryd/
- json_path: annotations/video/timeit_ViTT-TimeIT-ViTT_6k.json
sampling_strategy: "first:25%"
data_root: https://github.com/google-research-datasets/Video-Timeline-Tags-ViTT
- json_path: annotations/video/grounding_ANetRTL-ActivityNet-RTL-ANet_RTL_34k.json
sampling_strategy: "first:25%"
data_root: http://activity-net.org//train
- json_path: annotations/video/grounding_ANetHL-ANet-HL-ANet_HL2_11k.json
sampling_strategy: "first:25%"
data_root: http://activity-net.org//train
- json_path: annotations/video/htstep_eventunderstanding-longvideo_annos-htstep_eventunderstanding_1k_1k.json
sampling_strategy: "first:25%"
video_read_type: img
data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset
- json_path: annotations/video/htstep_eventcount-longvideo_annos-htstep_eventcount_2k_2k.json
sampling_strategy: "first:25%"
video_read_type: img
data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset
- json_path: annotations/video/htstep_eventrelationship-longvideo_annos-htstep_eventrelationship_1k_1k.json
sampling_strategy: "first:25%"
video_read_type: img
data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset
- json_path: annotations/video/ego4dhcap_eventunderstanding-longvideo_annos-ego4dhcap_eventunderstanding_2k_2k.json
sampling_strategy: "first:25%"
video_read_type: img
data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset
================================================
FILE: llava-train_videochat/data/stage1_init_connector_iv1m.yaml
================================================
datasets:
- json_path: OpenGVLab/VideoChat-Flash-Training-Data/annotations/video/smit_caption_481k.json
sampling_strategy: all
data_root: http://moments.csail.mit.edu/spoken.html
- json_path: OpenGVLab/VideoChat-Flash-Training-Data/annotations/image/blip_laion_cc_sbu_558k.json
sampling_strategy: all
data_root: https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain
================================================
FILE: llava-train_videochat/data/stage2_short_pretrain_iv6m.yaml
================================================
datasets:
- json_path: annotations/image/LLaVA-ReCap-118K.json
sampling_strategy: all
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-ReCap-118K
- json_path: annotations/image/LLaVA-ReCap-CC3M.json
sampling_strategy: all
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-ReCap-CC3M
- json_path: annotations/image/LLaVA-ReCap-558K.json
sampling_strategy: all
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-ReCap-558K
- json_path: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data/tree/main/evol_instruct/evol_instruct_processed.json
sampling_strategy: all
- json_path: annotations/video/webvid-fuse_caption_2m.json
sampling_strategy: all
data_root: https://github.com/m-bain/webvid
- json_path: annotations/video/caption_sharegemini_webvid_core100k_clean.json
sampling_strategy: all
data_root: https://github.com/m-bain/webvid
- json_path: annotations/video/caption_sharegemini_k400_223k.json
sampling_strategy: all
data_root: https://opendatalab.com/OpenMMLab/Kinetics-400
- json_path: annotations/image/ureader_tr_processed.json
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data/tree/main/ureader_ur/
sampling_strategy: all
- json_path: annotations/image/synthdog_zh_processed.json
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data/tree/main/synthdog_zh/synthdog_zh_images/
sampling_strategy: all
- json_path: annotations/image/synthdog_en_processed.json
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Mid-Data/tree/main/synthdog_en/synthdog_en_images/
sampling_strategy: all
- json_path: annotations/video/smit_caption_481k.json
sampling_strategy: all
data_root: http://moments.csail.mit.edu/spoken.html
- json_path: annotations/video/caption_sharegptvideo_300k-sharegptvideo-train_300k_302k.json
sampling_strategy: all
data_root: https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k
video_read_type: img
================================================
FILE: llava-train_videochat/data/stage3_short-long_mix_sft.yaml
================================================
datasets:
# image sft datasets
- json_path: annotations/image/textcaps.json # 21942
sampling_strategy: all
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/textcaps
- json_path: annotations/image/textocr(gpt4v).json # 25104
sampling_strategy: all
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/textocr(gpt4v)
- json_path: annotations/image/rendered_text(cauldron)_fix.json # 9995
sampling_strategy: all
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/rendered_text(cauldron)
- json_path: annotations/image/iam(cauldron)_fix.json # 5658
sampling_strategy: all
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/iam(cauldron)
- json_path: annotations/image/llavar_gpt4_20k.json # 19790
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/llavar_gpt4_20k
- json_path: annotations/image/allava_instruct_vflan4v.json
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/allava_instruct_vflan4v
- json_path: annotations/image/allava_instruct_laion4v.json
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/allava_instruct_laion4v
- json_path: annotations/image/sharegpt4o.json
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4o
- json_path: annotations/image/sharegpt4v(coco).json
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(coco)
- json_path: annotations/image/sharegpt4v(knowledge).json
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(knowledge)
- json_path: annotations/image/sharegpt4v(llava).json
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(llava)
- json_path: annotations/image/sharegpt4v(sam).json
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(sam)
- json_path: annotations/image/tallyqa(cauldron,llava_format)_fix.json # 98675
sampling_strategy: "first:10%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/tallyqa(cauldron,llava_format) # 98680
- json_path: annotations/image/st_vqa(cauldron,llava_format)_fix.json # 17242
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/st_vqa(cauldron,llava_format) # 17247
- json_path: annotations/image/llava_next_raw_format_processed_738k.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data
- json_path: https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data/m4_instruct_annotations.json
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data
# video sft datasets
- json_path: annotations/video/caption_sharegemini_webvid_core100k_clean.json
sampling_strategy: "first:20%"
data_root: https://github.com/m-bain/webvid
- json_path: annotations/video/caption_sharegemini_k400_223k.json
sampling_strategy: "all"
data_root: https://opendatalab.com/OpenMMLab/Kinetics-400
- json_path: annotations/video/caption_youcook2-youcook2-train_debug_9k.json
sampling_strategy: "all"
data_root: http://youcook2.eecs.umich.edu/
- json_path: annotations/video/caption_textvr-textvr-train_40k.json
sampling_strategy: "all"
data_root: https://github.com/callsys/TextVR
- json_path: annotations/video/moviechat1k_caption-MovieChat-train_caption_1k.json
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/Enxin/MovieChat-1K_train
- json_path: annotations/video/caption_favd-favd-train_10k.json
sampling_strategy: "first:25%"
data_root: https://github.com/OpenNLPLab/FAVDBench
- json_path: annotations/video/caption_sharegptvideo_300k-sharegptvideo-train_300k_302k.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k
video_read_type: img
- json_path: annotations/video/caption_sharegpt4o-sharegpt4o_3k.json
sampling_strategy: all
data_root: https://sharegpt4o.github.io/
- json_path: annotations/video/vqa_tvqa-tvqa_123k.jsonl
sampling_strategy: "all"
data_root: https://nlp.cs.unc.edu/data/jielei/tvqa/tvqa_public_html/index.html
video_read_type: img
- json_path: annotations/video/reasoning_next_qa-next_qa-train_35k.jsonl
sampling_strategy: all
data_root: https://github.com/doc-doc/NExT-QA
- json_path: annotations/video/vqa_tgif_transition_qa-tgif_transition_qa-train_53k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/YunseokJANG/tgif-qa
video_read_type: gif
- json_path: annotations/video/reasoning_clevrer_mc-clevrer_mc-train_43k_debug_43k.jsonl
sampling_strategy: all
data_root: http://clevrer.csail.mit.edu/
- json_path: annotations/video/reasoning_clevrer_qa-clevrer_qa-train_mc_40k.jsonl
sampling_strategy: all
data_root: http://clevrer.csail.mit.edu/
- json_path: annotations/video/classification_k710-k710-train_40k.jsonl
sampling_strategy: "first:25%"
- json_path: annotations/video/classification_ssv2-ssv2-train_40k.jsonl
sampling_strategy: "first:25%"
data_root: https://www.qualcomm.com/developer/software/something-something-v-2-dataset
- json_path: annotations/video/lsmdc-lsmdc_297k.json
sampling_strategy: "first:25%"
data_root: https://sites.google.com/site/describingmovies/
- json_path: annotations/video/vqa_rgbd-nturgbd_clean_110k.json
sampling_strategy: "first:25%"
data_root: https://rose1.ntu.edu.sg/dataset/actionRecognition/
- json_path: annotations/video/vqa_perception_train-mc_question_train_forchoice_8k.json
sampling_strategy: all
data_root: https://github.com/google-deepmind/perception_test
- json_path: annotations/video/vqa_ego_qa-ego_qa-train_8k.jsonl
sampling_strategy: "all"
data_root: https://ego4d-data.org/
- json_path: annotations/video/vqa_tgif_transition_qa_openend-openend_qa_annos-tgif_transition_qa_train_openend_53k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/YunseokJANG/tgif-qa
video_read_type: gif
- json_path: annotations/video/vqa_tgif_frame_qa-tgif_frame_qa-train_40k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/YunseokJANG/tgif-qa
video_read_type: gif
- json_path: annotations/video/vqa_tgif_count-openend_qa_train_openend_26839.jsonl
sampling_strategy: "all"
data_root: https://github.com/YunseokJANG/tgif-qa
video_read_type: gif
- json_path: annotations/video/vqa_tgif_action-openend_qa_train_openend_20471.jsonl
sampling_strategy: "all"
data_root: https://github.com/YunseokJANG/tgif-qa
video_read_type: gif
- json_path: annotations/video/reasoning_next_qa_oe-openend_qa_annos-next_qa_train_openend_35k.jsonl
sampling_strategy: all
data_root: https://github.com/doc-doc/NExT-QA
- json_path: annotations/video/vqa_webvid_qa-webvid_qa-train_100k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/m-bain/webvid
- json_path: annotations/video/moviechat1k_global-MovieChat-train_global_1k.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/Enxin/MovieChat-1K_train
- json_path: annotations/video/grounding_didemo-didemo-train_66k.json
sampling_strategy: all
data_root: https://github.com/LisaAnne/TemporalLanguageRelease
- json_path: annotations/video/vqa_sharegptvideo_240k-sharegptvideo-train_240k_240k.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k
video_read_type: img
- json_path: annotations/video/caption_vidln_kinetics-vidln-kinetics_train_28k.json
sampling_strategy: all
data_root: https://opendatalab.com/OpenMMLab/Kinetics_700
- json_path: annotations/video/caption_vidln_oops-vidln-oops_train_11k.json
sampling_strategy: all
data_root: https://oops.cs.columbia.edu/
- json_path: annotations/video/caption_vidln_ovis-vidln-ovis_train_1k.json
sampling_strategy: all
data_root: https://songbai.site/ovis/
video_read_type: img
- json_path: annotations/video/caption_vidln_uvo_sparse-vidln-uvo_sparse_train_6k.json
sampling_strategy: all
data_root: https://sites.google.com/view/unidentified-video-object/dataset
- json_path: annotations/video/caption_vidln_uvo_dense-vidln-uvo_dense_train_1k.json
sampling_strategy: all
data_root: https://sites.google.com/view/unidentified-video-object/dataset
- json_path: annotations/video/reasoning_star-star-train_46k.json
sampling_strategy: all
data_root: https://bobbywu.com/STAR/
- json_path: annotations/video/vcg-plus_112K_clean_97k.json
sampling_strategy: "first:10%"
data_root: http://activity-net.org/
- json_path: annotations/video/vript_long_videos_en_20240911_fix.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/Mutonix/Vript
- json_path: annotations/video/vript_short_videos_en_20240911_fix.jsonl
sampling_strategy: all
data_root: https://huggingface.co/datasets/Mutonix/Vript
- json_path: annotations/video/guiworld_en_20241029_fix.jsonl
sampling_strategy: "all"
data_root: https://gui-world.github.io/
## llava video
- json_path: annotations/video/llava-video_2_3_m_academic_mc_v0_1_qa_processed_6901_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_nextqa_oe_qa_processed_61_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_youtube_oe_v0_1_qa_processed_420200_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_academic_oe_v0_1_qa_processed_26302_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_youtube_mc_v0_1_qa_processed_39710_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_nextqa_oe_qa_processed_6843_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_youtube_mc_v0_1_qa_processed_39967_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_academic_v0_1_cap_processed_3124_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_academic_oe_v0_1_qa_processed_57924_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_youtube_v0_1_cap_processed_24685_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_youtube_mc_v0_1_qa_processed_39927_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_activitynetqa_oe_qa_processed_2950_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_nextqa_oe_qa_processed_4694_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_youtube_oe_v0_1_qa_processed_110624_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_academic_mc_v0_1_qa_processed_4241_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_youtube_mc_v0_1_qa_processed_39353_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_activitynetqa_oe_qa_processed_4530_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_youtube_oe_v0_1_qa_processed_137645_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_academic_mc_v0_1_qa_processed_20346_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_youtube_v0_1_cap_processed_19995_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_nextqa_mc_qa_processed_5496_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_academic_mc_v0_1_qa_processed_5753_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_youtube_oe_v0_1_qa_processed_141495_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_nextqa_mc_qa_processed_4633_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_activitynetqa_oe_qa_processed_7460_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_nextqa_mc_qa_processed_52_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_activitynetqa_oe_qa_processed_8590_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_academic_v0_1_cap_processed_4627_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_academic_v0_1_cap_processed_10514_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_youtube_v0_1_cap_processed_24234_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_nextqa_mc_qa_processed_6843_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_nextqa_oe_qa_processed_5492_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_academic_oe_v0_1_qa_processed_48468_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_youtube_v0_1_cap_processed_79346_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_academic_oe_v0_1_qa_processed_18134_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_perceptiontest_mc_qa_processed_1785_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_perceptiontest_mc_qa_processed_618_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_academic_v0_1_cap_processed_11985_with_duration.jsonl
sampling_strategy: "all"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/timeit_ANet-TimeIT-Activitynet_Captions_11k.json
sampling_strategy: all
data_root: http://activity-net.org//train
- json_path: annotations/video/timeit_COIN-TimeIT-COIN_10k.json
sampling_strategy: all
data_root: https://coin-dataset.github.io/
- json_path: annotations/video/timeit_DiDeMo-TimeIT-DiDeMo_33k.json
sampling_strategy: all
data_root: https://github.com/LisaAnne/TemporalLanguageRelease
- json_path: annotations/video/timeit_HiREST-TimeIT-HiREST_1k.json
sampling_strategy: all
data_root: https://hirest-cvpr2023.github.io/
- json_path: annotations/video/timeit_QuerYD-TimeIT-QuerYD_15k.json
sampling_strategy: all
data_root: https://www.robots.ox.ac.uk/~vgg/data/queryd/
- json_path: annotations/video/timeit_ViTT-TimeIT-ViTT_6k.json
sampling_strategy: all
data_root: https://github.com/google-research-datasets/Video-Timeline-Tags-ViTT
- json_path: annotations/video/grounding_ANetRTL-ActivityNet-RTL-ANet_RTL_34k.json
sampling_strategy: all
data_root: http://activity-net.org//train
- json_path: annotations/video/grounding_ANetHL-ANet-HL-ANet_HL2_11k.json
sampling_strategy: all
data_root: http://activity-net.org//train
- json_path: annotations/video/htstep_eventunderstanding-longvideo_annos-htstep_eventunderstanding_1k_1k.json
sampling_strategy: all
video_read_type: img
data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset
- json_path: annotations/video/htstep_eventcount-longvideo_annos-htstep_eventcount_2k_2k.json
sampling_strategy: all
video_read_type: img
data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset
- json_path: annotations/video/htstep_eventrelationship-longvideo_annos-htstep_eventrelationship_1k_1k.json
sampling_strategy: all
video_read_type: img
data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset
- json_path: annotations/video/ego4dhcap_eventunderstanding-longvideo_annos-ego4dhcap_eventunderstanding_2k_2k.json
sampling_strategy: all
video_read_type: img
data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset
================================================
FILE: llava-train_videochat/data/stage4_highres_postsft.yaml
================================================
datasets:
# image sft datasets, 6w
- json_path: annotations/image/synthdog_zh_processed.json
data_root: https://huggingface.co/datasets/lmms-lab/OneVision-Mid-Data/synthdog_zh/synthdog_zh_images/
sampling_strategy: "first:10%"
- json_path: annotations/image/synthdog_en_processed.json
data_root: https://huggingface.co/datasets/lmms-lab/OneVision-Mid-Data/synthdog_en/synthdog_en_images/
sampling_strategy: "first:10%"
- json_path: annotations/image/textcaps.json # 21942
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/textcaps
- json_path: annotations/image/textocr(gpt4v).json # 25104
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/textocr(gpt4v)
- json_path: annotations/image/rendered_text(cauldron)_fix.json # 9995
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/rendered_text(cauldron)
- json_path: annotations/image/iam(cauldron)_fix.json # 5658
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/iam(cauldron)
- json_path: annotations/image/llavar_gpt4_20k.json # 19790
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/llavar_gpt4_20k
- json_path: annotations/image/allava_instruct_vflan4v.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/allava_instruct_vflan4v
- json_path: annotations/image/allava_instruct_laion4v.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/allava_instruct_laion4v
- json_path: annotations/image/sharegpt4o.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4o
- json_path: annotations/image/sharegpt4v(coco).json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(coco)
- json_path: annotations/image/sharegpt4v(knowledge).json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(knowledge)
- json_path: annotations/image/sharegpt4v(llava).json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(llava)
- json_path: annotations/image/sharegpt4v(sam).json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/sharegpt4v(sam)
- json_path: annotations/image/tallyqa(cauldron,llava_format)_fix.json # 98675
sampling_strategy: "first:10%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/tallyqa(cauldron,llava_format) # 98680
- json_path: annotations/image/st_vqa(cauldron,llava_format)_fix.json # 17242
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-OneVision-Data/image/st_vqa(cauldron,llava_format) # 17247
- json_path: annotations/image/llava_next_raw_format_processed_738k.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data
- json_path: https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data/m4_instruct_annotations.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/M4-Instruct-Data
# video sft datasets
- json_path: annotations/video/caption_sharegemini_webvid_core100k_clean.json
sampling_strategy: "first:20%"
data_root: https://github.com/m-bain/webvid
- json_path: annotations/video/caption_sharegemini_k400_223k.json
sampling_strategy: "first:25%"
data_root: https://opendatalab.com/OpenMMLab/Kinetics-400
- json_path: annotations/video/caption_youcook2-youcook2-train_debug_9k.json
sampling_strategy: "first:25%"
data_root: http://youcook2.eecs.umich.edu/
- json_path: annotations/video/caption_textvr-textvr-train_40k.json
sampling_strategy: "first:25%"
data_root: https://github.com/callsys/TextVR
- json_path: annotations/video/moviechat1k_caption-MovieChat-train_caption_1k.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/Enxin/MovieChat-1K_train
- json_path: annotations/video/caption_favd-favd-train_10k.json
sampling_strategy: "first:25%"
data_root: https://github.com/OpenNLPLab/FAVDBench
- json_path: annotations/video/caption_sharegptvideo_300k-sharegptvideo-train_300k_302k.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k
video_read_type: img
- json_path: annotations/video/caption_sharegpt4o-sharegpt4o_3k.json
sampling_strategy: "first:25%"
data_root: https://sharegpt4o.github.io/
- json_path: annotations/video/vqa_tvqa-tvqa_123k.jsonl
sampling_strategy: "first:25%"
data_root: https://nlp.cs.unc.edu/data/jielei/tvqa/tvqa_public_html/index.html
video_read_type: img
- json_path: annotations/video/reasoning_next_qa-next_qa-train_35k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/doc-doc/NExT-QA
- json_path: annotations/video/vqa_tgif_transition_qa-tgif_transition_qa-train_53k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/YunseokJANG/tgif-qa
video_read_type: gif
- json_path: annotations/video/reasoning_clevrer_mc-clevrer_mc-train_43k_debug_43k.jsonl
sampling_strategy: "first:25%"
data_root: http://clevrer.csail.mit.edu/
- json_path: annotations/video/reasoning_clevrer_qa-clevrer_qa-train_mc_40k.jsonl
sampling_strategy: "first:25%"
data_root: http://clevrer.csail.mit.edu/
- json_path: annotations/video/classification_k710-k710-train_40k.jsonl
sampling_strategy: "first:25%"
- json_path: annotations/video/classification_ssv2-ssv2-train_40k.jsonl
sampling_strategy: "first:25%"
data_root: https://www.qualcomm.com/developer/software/something-something-v-2-dataset
- json_path: annotations/video/lsmdc-lsmdc_297k.json
sampling_strategy: "first:25%"
data_root: https://sites.google.com/site/describingmovies/
- json_path: annotations/video/vqa_rgbd-nturgbd_clean_110k.json
sampling_strategy: "first:25%"
data_root: https://rose1.ntu.edu.sg/dataset/actionRecognition/
- json_path: annotations/video/vqa_perception_train-mc_question_train_forchoice_8k.json
sampling_strategy: "first:25%"
data_root: https://github.com/google-deepmind/perception_test
- json_path: annotations/video/vqa_ego_qa-ego_qa-train_8k.jsonl
sampling_strategy: "first:25%"
data_root: https://ego4d-data.org/
- json_path: annotations/video/vqa_tgif_transition_qa_openend-openend_qa_annos-tgif_transition_qa_train_openend_53k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/YunseokJANG/tgif-qa
video_read_type: gif
- json_path: annotations/video/vqa_tgif_frame_qa-tgif_frame_qa-train_40k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/YunseokJANG/tgif-qa
video_read_type: gif
- json_path: annotations/video/vqa_tgif_count-openend_qa_train_openend_26839.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/YunseokJANG/tgif-qa
video_read_type: gif
- json_path: annotations/video/vqa_tgif_action-openend_qa_train_openend_20471.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/YunseokJANG/tgif-qa
video_read_type: gif
- json_path: annotations/video/reasoning_next_qa_oe-openend_qa_annos-next_qa_train_openend_35k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/doc-doc/NExT-QA
- json_path: annotations/video/vqa_webvid_qa-webvid_qa-train_100k.jsonl
sampling_strategy: "first:25%"
data_root: https://github.com/m-bain/webvid
- json_path: annotations/video/moviechat1k_global-MovieChat-train_global_1k.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/Enxin/MovieChat-1K_train
- json_path: annotations/video/grounding_didemo-didemo-train_66k.json
sampling_strategy: "first:25%"
data_root: https://github.com/LisaAnne/TemporalLanguageRelease
- json_path: annotations/video/vqa_sharegptvideo_240k-sharegptvideo-train_240k_240k.json
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/ShareGPTVideo/train_video_and_instruction/tree/main/train_300k
video_read_type: img
- json_path: annotations/video/caption_vidln_kinetics-vidln-kinetics_train_28k.json
sampling_strategy: "first:25%"
data_root: https://opendatalab.com/OpenMMLab/Kinetics_700
- json_path: annotations/video/caption_vidln_oops-vidln-oops_train_11k.json
sampling_strategy: "first:25%"
data_root: https://oops.cs.columbia.edu/
- json_path: annotations/video/caption_vidln_ovis-vidln-ovis_train_1k.json
sampling_strategy: "first:25%"
data_root: https://songbai.site/ovis/
video_read_type: img
- json_path: annotations/video/caption_vidln_uvo_sparse-vidln-uvo_sparse_train_6k.json
sampling_strategy: "first:25%"
data_root: https://sites.google.com/view/unidentified-video-object/dataset
- json_path: annotations/video/caption_vidln_uvo_dense-vidln-uvo_dense_train_1k.json
sampling_strategy: "first:25%"
data_root: https://sites.google.com/view/unidentified-video-object/dataset
- json_path: annotations/video/reasoning_star-star-train_46k.json
sampling_strategy: "first:25%"
data_root: https://bobbywu.com/STAR/
- json_path: annotations/video/vcg-plus_112K_clean_97k.json
sampling_strategy: "first:10%"
data_root: http://activity-net.org/
- json_path: annotations/video/vript_long_videos_en_20240911_fix.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/Mutonix/Vript
- json_path: annotations/video/vript_short_videos_en_20240911_fix.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/Mutonix/Vript
- json_path: annotations/video/guiworld_en_20241029_fix.jsonl
sampling_strategy: "first:25%"
data_root: https://gui-world.github.io/
## llava video
- json_path: annotations/video/llava-video_2_3_m_academic_mc_v0_1_qa_processed_6901_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_nextqa_oe_qa_processed_61_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_youtube_oe_v0_1_qa_processed_420200_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_academic_oe_v0_1_qa_processed_26302_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_youtube_mc_v0_1_qa_processed_39710_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_nextqa_oe_qa_processed_6843_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_youtube_mc_v0_1_qa_processed_39967_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_academic_v0_1_cap_processed_3124_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_academic_oe_v0_1_qa_processed_57924_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_youtube_v0_1_cap_processed_24685_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_youtube_mc_v0_1_qa_processed_39927_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_activitynetqa_oe_qa_processed_2950_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_nextqa_oe_qa_processed_4694_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_youtube_oe_v0_1_qa_processed_110624_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_academic_mc_v0_1_qa_processed_4241_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_youtube_mc_v0_1_qa_processed_39353_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_activitynetqa_oe_qa_processed_4530_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_youtube_oe_v0_1_qa_processed_137645_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_academic_mc_v0_1_qa_processed_20346_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_youtube_v0_1_cap_processed_19995_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_nextqa_mc_qa_processed_5496_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_academic_mc_v0_1_qa_processed_5753_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_youtube_oe_v0_1_qa_processed_141495_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_nextqa_mc_qa_processed_4633_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_activitynetqa_oe_qa_processed_7460_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_nextqa_mc_qa_processed_52_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_activitynetqa_oe_qa_processed_8590_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_academic_v0_1_cap_processed_4627_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_academic_v0_1_cap_processed_10514_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_1_2_m_youtube_v0_1_cap_processed_24234_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_nextqa_mc_qa_processed_6843_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_nextqa_oe_qa_processed_5492_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_academic_oe_v0_1_qa_processed_48468_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_youtube_v0_1_cap_processed_79346_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_2_3_m_academic_oe_v0_1_qa_processed_18134_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_perceptiontest_mc_qa_processed_1785_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_30_60_s_perceptiontest_mc_qa_processed_618_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/llava-video_0_30_s_academic_v0_1_cap_processed_11985_with_duration.jsonl
sampling_strategy: "first:25%"
data_root: https://huggingface.co/datasets/lmms-lab/LLaVA-Video-178K
- json_path: annotations/video/timeit_ANet-TimeIT-Activitynet_Captions_11k.json
sampling_strategy: "first:25%"
data_root: http://activity-net.org//train
- json_path: annotations/video/timeit_COIN-TimeIT-COIN_10k.json
sampling_strategy: "first:25%"
data_root: https://coin-dataset.github.io/
- json_path: annotations/video/timeit_DiDeMo-TimeIT-DiDeMo_33k.json
sampling_strategy: "first:25%"
data_root: https://github.com/LisaAnne/TemporalLanguageRelease
- json_path: annotations/video/timeit_HiREST-TimeIT-HiREST_1k.json
sampling_strategy: "first:25%"
data_root: https://hirest-cvpr2023.github.io/
- json_path: annotations/video/timeit_QuerYD-TimeIT-QuerYD_15k.json
sampling_strategy: "first:25%"
data_root: https://www.robots.ox.ac.uk/~vgg/data/queryd/
- json_path: annotations/video/timeit_ViTT-TimeIT-ViTT_6k.json
sampling_strategy: "first:25%"
data_root: https://github.com/google-research-datasets/Video-Timeline-Tags-ViTT
- json_path: annotations/video/grounding_ANetRTL-ActivityNet-RTL-ANet_RTL_34k.json
sampling_strategy: "first:25%"
data_root: http://activity-net.org//train
- json_path: annotations/video/grounding_ANetHL-ANet-HL-ANet_HL2_11k.json
sampling_strategy: "first:25%"
data_root: http://activity-net.org//train
- json_path: annotations/video/htstep_eventunderstanding-longvideo_annos-htstep_eventunderstanding_1k_1k.json
sampling_strategy: "first:25%"
video_read_type: img
data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset
- json_path: annotations/video/htstep_eventcount-longvideo_annos-htstep_eventcount_2k_2k.json
sampling_strategy: "first:25%"
video_read_type: img
data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset
- json_path: annotations/video/htstep_eventrelationship-longvideo_annos-htstep_eventrelationship_1k_1k.json
sampling_strategy: "first:25%"
video_read_type: img
data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset
- json_path: annotations/video/ego4dhcap_eventunderstanding-longvideo_annos-ego4dhcap_eventunderstanding_2k_2k.json
sampling_strategy: "first:25%"
video_read_type: img
data_root: https://huggingface.co/datasets/OpenGVLab/VideoChat-Flash-Training-Data/tree/main/longvid_subset
================================================
FILE: llava-train_videochat/llava/__init__.py
================================================
from .model import LlavaQwenForCausalLM
from .train.train import LazySupervisedDataset, DataCollatorForSupervisedDataset
================================================
FILE: llava-train_videochat/llava/constants.py
================================================
CONTROLLER_HEART_BEAT_EXPIRATION = 30
WORKER_HEART_BEAT_INTERVAL = 15
LOGDIR = "."
# Model Constants
IGNORE_INDEX = -100
IMAGE_TOKEN_INDEX = -200
DEFAULT_IMAGE_TOKEN = ""
DEFAULT_IMAGE_PATCH_TOKEN = ""
DEFAULT_IM_START_TOKEN = ""
DEFAULT_IM_END_TOKEN = ""
================================================
FILE: llava-train_videochat/llava/conversation.py
================================================
import dataclasses
from enum import auto, Enum
from typing import List, Any, Dict, Union, Tuple
import re
import base64
from io import BytesIO
from PIL import Image
from transformers import AutoTokenizer
class SeparatorStyle(Enum):
"""Different separator style."""
SINGLE = auto()
TWO = auto()
MPT = auto()
PLAIN = auto()
CHATML = auto()
LLAMA_2 = auto()
LLAMA_3 = auto()
QWEN = auto()
GEMMA = auto()
@dataclasses.dataclass
class Conversation:
"""A class that keeps all conversation history."""
system: str
roles: List[str]
messages: List[List[str]]
offset: int
sep_style: SeparatorStyle = SeparatorStyle.SINGLE
sep: str = "###"
sep2: str = None
version: str = "Unknown"
tokenizer_id: str = ""
tokenizer: Any = None
# Stop criteria (the default one is EOS token)
stop_str: Union[str, List[str]] = None
# Stops generation if meeting any token in this list
stop_token_ids: List[int] = None
skip_next: bool = False
def get_prompt(self):
messages = self.messages
if len(messages) > 0 and type(messages[0][1]) is tuple:
messages = self.messages.copy()
init_role, init_msg = messages[0].copy()
init_msg = init_msg[0]
if "mmtag" in self.version:
init_msg = init_msg.replace("", "").strip()
messages[0] = (init_role, init_msg)
messages.insert(0, (self.roles[0], ""))
messages.insert(1, (self.roles[1], "Received."))
elif not init_msg.startswith(""):
init_msg = init_msg.replace("", "").strip()
messages[0] = (init_role, "\n" + init_msg)
else:
messages[0] = (init_role, init_msg)
if self.sep_style == SeparatorStyle.SINGLE:
ret = self.system + self.sep
for role, message in messages:
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + ": " + message + self.sep
else:
ret += role + ":"
elif self.sep_style == SeparatorStyle.TWO:
seps = [self.sep, self.sep2]
ret = self.system + seps[0]
for i, (role, message) in enumerate(messages):
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + ": " + message + seps[i % 2]
else:
ret += role + ":"
elif self.sep_style == SeparatorStyle.CHATML:
ret = "" if self.system == "" else self.system + self.sep + "\n"
for role, message in messages:
if message:
if type(message) is tuple:
message, images, _ = message
message = "" * len(images) + message
ret += role + "\n" + message + self.sep + "\n"
else:
ret += role + "\n"
return ret
elif self.sep_style == SeparatorStyle.LLAMA_3:
chat_template_messages = [{"role": "system", "content": self.system}]
for role, message in messages:
if message:
if type(message) is tuple:
message, images = message
message = "" * len(images) + message
chat_template_messages.append({"role": role, "content": message})
# print(chat_template_messages)
return self.tokenizer.apply_chat_template(chat_template_messages, tokenize=False, add_generation_prompt=True)
# ret = "" if self.system == "" else self.system + self.sep + "\n"
# for role, message in messages:
# if message:
# if type(message) is tuple:
# message, images = message
# message = "" * len(images) + message
# ret += role + "\n" + message + self.sep + "\n"
# else:
# ret += role + "\n"
# return ret
elif self.sep_style == SeparatorStyle.MPT:
ret = self.system + self.sep
for role, message in messages:
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + message + self.sep
else:
ret += role
elif self.sep_style == SeparatorStyle.GEMMA:
ret = ""
for i, (role, message) in enumerate(messages):
assert role == self.roles[i % 2], "Conversation should alternate user/assistant/user/assistant/..."
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + message + self.sep
else:
ret += role
elif self.sep_style == SeparatorStyle.LLAMA_2:
wrap_sys = lambda msg: f"<>\n{msg}\n<>\n\n" if len(msg) > 0 else msg
wrap_inst = lambda msg: f"[INST] {msg} [/INST]"
ret = ""
for i, (role, message) in enumerate(messages):
if i == 0:
assert message, "first message should not be none"
assert role == self.roles[0], "first message should come from user"
if message:
if type(message) is tuple:
message, _, _ = message
if i == 0:
message = wrap_sys(self.system) + message
if i % 2 == 0:
message = wrap_inst(message)
ret += self.sep + message
else:
ret += " " + message + " " + self.sep2
else:
ret += ""
ret = ret.lstrip(self.sep)
elif self.sep_style == SeparatorStyle.PLAIN:
seps = [self.sep, self.sep2]
ret = self.system
for i, (role, message) in enumerate(messages):
if message:
if type(message) is tuple:
message, _, _ = message
ret += message + seps[i % 2]
else:
ret += ""
else:
raise ValueError(f"Invalid style: {self.sep_style}")
return ret
def append_message(self, role, message):
self.messages.append([role, message])
def process_image(self, image, image_process_mode, return_pil=False, image_format="PNG"):
if image_process_mode == "Pad":
def expand2square(pil_img, background_color=(122, 116, 104)):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
image = expand2square(image)
elif image_process_mode in ["Default", "Crop"]:
pass
elif image_process_mode == "Resize":
image = image.resize((336, 336))
else:
raise ValueError(f"Invalid image_process_mode: {image_process_mode}")
if type(image) is not Image.Image:
image = Image.open(image).convert("RGB")
max_hw, min_hw = max(image.size), min(image.size)
aspect_ratio = max_hw / min_hw
max_len, min_len = 672, 448
shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
longest_edge = int(shortest_edge * aspect_ratio)
W, H = image.size
if H > W:
H, W = longest_edge, shortest_edge
else:
H, W = shortest_edge, longest_edge
image = image.resize((W, H))
if return_pil:
return image
else:
buffered = BytesIO()
image.save(buffered, format=image_format)
img_b64_str = base64.b64encode(buffered.getvalue()).decode()
return img_b64_str
def get_images(self, return_pil=False, return_path=False):
images = []
for i, (role, msg) in enumerate(self.messages[self.offset :]):
if i % 2 == 0:
if type(msg) is tuple:
msg, image, image_process_mode = msg
if type(image) != list:
image = [image]
for img in image:
if not return_path and self.is_image_file(img):
img = self.process_image(img, image_process_mode, return_pil=return_pil)
else:
images.append(img)
return images
def is_image_file(self, filename):
image_extensions = [".png", ".jpg", ".jpeg", ".gif", ".bmp", ".tiff", ".webp"]
return any(filename.lower().endswith(ext) for ext in image_extensions)
def is_video_file(self, filename):
video_extensions = [".mp4", ".mov", ".avi", ".mkv", ".wmv", ".flv", ".mpeg", ".mpg"]
return any(filename.lower().endswith(ext) for ext in video_extensions)
def to_gradio_chatbot(self):
ret = []
for i, (role, msg) in enumerate(self.messages[self.offset :]):
if i % 2 == 0:
if type(msg) is tuple:
msg, image, image_process_mode = msg
if type(image) != list:
image = [image]
if len(image) == 1:
msg = "\n" + msg.replace("", "").strip()
else:
msg = re.sub(r"()\n(?=)", r"\1 ", msg)
img_str_list = []
for img in image:
if self.is_image_file(img):
img_b64_str = self.process_image(img, "Default", return_pil=False, image_format="JPEG")
img_str = f'
'
img_str_list.append(img_str)
elif self.is_video_file(img):
ret.append(((img,), None))
msg = msg.strip()
img_place_holder = ""
for img_str in img_str_list:
img_place_holder += f"{img_str}\n\n"
if len(img_str_list) > 0:
msg = f"{img_place_holder}\n\n{msg}"
if len(msg) > 0:
ret.append([msg, None])
else:
ret.append([msg, None])
else:
ret[-1][-1] = msg
return ret
def copy(self):
return Conversation(system=self.system, roles=self.roles, messages=[[x, y] for x, y in self.messages], offset=self.offset, sep_style=self.sep_style, sep=self.sep, sep2=self.sep2, version=self.version)
def dict(self):
if len(self.get_images()) > 0:
return {
"system": self.system,
"roles": self.roles,
"messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages],
"offset": self.offset,
"sep": self.sep,
"sep2": self.sep2,
}
return {
"system": self.system,
"roles": self.roles,
"messages": self.messages,
"offset": self.offset,
"sep": self.sep,
"sep2": self.sep2,
}
conv_vicuna_v0 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("Human", "Assistant"),
messages=[
["Human", "What are the key differences between renewable and non-renewable energy sources?"],
[
"Assistant",
"Renewable energy sources are those that can be replenished naturally in a relatively "
"short amount of time, such as solar, wind, hydro, geothermal, and biomass. "
"Non-renewable energy sources, on the other hand, are finite and will eventually be "
"depleted, such as coal, oil, and natural gas. Here are some key differences between "
"renewable and non-renewable energy sources:\n"
"1. Availability: Renewable energy sources are virtually inexhaustible, while non-renewable "
"energy sources are finite and will eventually run out.\n"
"2. Environmental impact: Renewable energy sources have a much lower environmental impact "
"than non-renewable sources, which can lead to air and water pollution, greenhouse gas emissions, "
"and other negative effects.\n"
"3. Cost: Renewable energy sources can be more expensive to initially set up, but they typically "
"have lower operational costs than non-renewable sources.\n"
"4. Reliability: Renewable energy sources are often more reliable and can be used in more remote "
"locations than non-renewable sources.\n"
"5. Flexibility: Renewable energy sources are often more flexible and can be adapted to different "
"situations and needs, while non-renewable sources are more rigid and inflexible.\n"
"6. Sustainability: Renewable energy sources are more sustainable over the long term, while "
"non-renewable sources are not, and their depletion can lead to economic and social instability.\n",
],
],
offset=2,
sep_style=SeparatorStyle.SINGLE,
sep="###",
)
conv_vicuna_v1 = Conversation(
system="A chat between a curious user and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the user's questions.",
roles=("USER", "ASSISTANT"),
version="v1",
messages=[],
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="",
)
conv_llama_2 = Conversation(
system="""You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.""",
roles=("USER", "ASSISTANT"),
version="llama_v2",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="",
)
conv_llava_llama_2 = Conversation(
system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",
roles=("USER", "ASSISTANT"),
version="llama_v2",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="",
)
# conv_llava_llama_3 = Conversation(
# system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",
# roles=("user", "assistant"),
# version="llama_v3",
# messages=[],
# offset=0,
# sep="<|eot_id|>",
# sep_style=SeparatorStyle.LLAMA_3,
# tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
# tokenizer=AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct"),
# stop_token_ids=[128009],
# )
conv_mistral_instruct = Conversation(
system="",
roles=("USER", "ASSISTANT"),
version="llama_v2",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="",
)
conv_llava_llama_2_simple = Conversation(
system="Answer the questions about the visual content that the user provides.",
roles=("USER", "ASSISTANT"),
version="llama_v2",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="",
)
conv_llava_llama_2_mmtag = Conversation(
system="Answer the questions about the visual content that the user provides." "The visual content will be provided with the following format: visual content.",
roles=("USER", "ASSISTANT"),
version="llama_v2_mmtag",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="",
)
conv_mpt = Conversation(
system="""<|im_start|>system
A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
conv_qwen = Conversation(
system="""<|im_start|>system
You are a helpful assistant.""",
roles=("<|im_start|>user", "<|im_start|>assistant"),
version="qwen",
messages=[],
offset=0,
sep_style=SeparatorStyle.CHATML,
sep="<|im_end|>",
)
conv_internlm_2 = Conversation(
system="""<|im_start|>system
You are a helpful assistant.""",
roles=("<|im_start|>user", "<|im_start|>assistant"),
version="internlm_2",
messages=[],
offset=0,
sep_style=SeparatorStyle.CHATML,
sep="<|im_end|>",
)
conv_gemma_instruct = Conversation(system="", roles=("user\n", "model\n"), version="gemma", messages=[], offset=0, sep_style=SeparatorStyle.GEMMA, sep="\n")
conv_llava_plain = Conversation(
system="",
roles=("", ""),
messages=[],
offset=0,
sep_style=SeparatorStyle.PLAIN,
sep="\n",
)
conv_llava_v0 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("Human", "Assistant"),
messages=[],
offset=0,
sep_style=SeparatorStyle.SINGLE,
sep="###",
)
conv_llava_v0_mmtag = Conversation(
system="A chat between a curious user and an artificial intelligence assistant. "
"The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
"The visual content will be provided with the following format: visual content.",
roles=("Human", "Assistant"),
messages=[],
offset=0,
sep_style=SeparatorStyle.SINGLE,
sep="###",
version="v0_mmtag",
)
conv_llava_v1 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("USER", "ASSISTANT"),
version="v1",
messages=[],
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="",
)
conv_llava_v1_mmtag = Conversation(
system="A chat between a curious user and an artificial intelligence assistant. "
"The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
"The visual content will be provided with the following format: visual content.",
roles=("USER", "ASSISTANT"),
messages=[],
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="",
version="v1_mmtag",
)
conv_mistral_orca = Conversation(
system="""<|im_start|>system
You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
conv_mistral_zephyr = Conversation(
system="""<|system|>
You are a helpful AI assistant.""",
roles=("<|user|>\n", "<|assistant|>\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="",
)
conv_mistral_direct = Conversation(
system="""<|im_start|>system
Answer the questions.""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
conv_chatml_direct = Conversation(
system="""<|im_start|>system
Answer the questions.""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
default_conversation = conv_vicuna_v0
conv_templates = {
"default": conv_vicuna_v0,
"v0": conv_vicuna_v0,
"v1": conv_vicuna_v1,
"vicuna_v1": conv_vicuna_v1,
"llama_2": conv_llama_2,
"mistral_instruct": conv_mistral_instruct,
"mistral_orca": conv_mistral_orca,
"mistral_zephyr": conv_mistral_zephyr,
"mistral_direct": conv_mistral_direct,
"plain": conv_llava_plain,
"v0_plain": conv_llava_plain,
"chatml_direct": conv_chatml_direct,
"llava_v0": conv_llava_v0,
"llava_v0_mmtag": conv_llava_v0_mmtag,
"llava_v1": conv_llava_v1,
"llava_v1_mmtag": conv_llava_v1_mmtag,
"llava_llama_2": conv_llava_llama_2,
# "llava_llama_3": conv_llava_llama_3,
"llava_llama_2_simple": conv_llava_llama_2_simple,
"llava_llama_2_mmtag": conv_llava_llama_2_mmtag,
"llava_mistral_instruct": conv_mistral_instruct,
"mpt": conv_mpt,
"qwen_1_5": conv_qwen,
"qwen_2": conv_qwen,
"internlm_2": conv_internlm_2,
"gemma_instruct": conv_gemma_instruct,
}
if __name__ == "__main__":
print(default_conversation.get_prompt())
print(default_conversation)
================================================
FILE: llava-train_videochat/llava/dist_utils.py
================================================
import json
import os
import builtins
import datetime
import time
import subprocess
import torch
import torch.distributed as dist
def get_rank() -> int:
if not dist.is_available():
return 0
if not dist.is_initialized():
return 0
return dist.get_rank()
def get_world_size() -> int:
if not dist.is_available():
return 1
if not dist.is_initialized():
return 1
return dist.get_world_size()
def setup_for_distributed(is_master):
builtin_print = builtins.print
def print(*args, **kwargs):
force = kwargs.pop("force", False)
# force = force or (get_world_size() > 8)
if is_master or force:
now = datetime.datetime.now().time()
builtin_print("[{}] ".format(now), end="") # print with time stamp
builtin_print(*args, **kwargs)
builtins.print = print
def init_distributed_mode(use_dynamic_port: bool = True):
if "SLURM_PROCID" in os.environ:
rank = int(os.environ["SLURM_PROCID"])
local_rank = rank % torch.cuda.device_count()
world_size = int(os.environ["SLURM_NTASKS"])
try:
local_size = int(os.environ["SLURM_NTASKS_PER_NODE"])
except:
local_size = int(os.environ.get("LOCAL_SIZE", 1))
if "MASTER_PORT" not in os.environ:
port = 10023 # + random.randint(0, 20)
# if use_dynamic_port:
# for i in range(10042, 65535):
# cmd = f"netstat -aon|grep {i}"
# with os.popen(cmd, "r") as file:
# if file.read() == "":
# port = i
# break
print(f"MASTER_PORT = {port}")
os.environ["MASTER_PORT"] = str(port)
time.sleep(3)
node_list = os.environ["SLURM_STEP_NODELIST"]
addr = subprocess.getoutput(f"scontrol show hostname {node_list} | head -n1")
if "MASTER_ADDR" not in os.environ:
os.environ["MASTER_ADDR"] = addr
os.environ["RANK"] = str(rank)
os.environ["LOCAL_RANK"] = str(local_rank)
os.environ["LOCAL_WORLD_SIZE"] = str(local_size)
os.environ["WORLD_SIZE"] = str(world_size)
else:
rank = int(os.environ["RANK"])
setup_for_distributed(rank == 0)
print(
f"Rank {os.environ['RANK']} | Local Rank {os.environ['LOCAL_RANK']} | "
f"World Size {os.environ['WORLD_SIZE']} | Local World Size {os.environ['LOCAL_WORLD_SIZE']} |",
force=True
)
================================================
FILE: llava-train_videochat/llava/mm_utils.py
================================================
from PIL import Image
from io import BytesIO
import base64
import math
import ast
import re
import torch
from transformers import StoppingCriteria
from llava.constants import IMAGE_TOKEN_INDEX
def resize_and_center_crop(image, shortest_edge_length):
# Calculate new dimensions and resize
aspect_ratio = float(image.width) / float(image.height)
if aspect_ratio > 1:
new_width = int(shortest_edge_length * aspect_ratio)
new_height = shortest_edge_length
else:
new_width = shortest_edge_length
new_height = int(shortest_edge_length / aspect_ratio)
resized_image = image.resize((new_width, new_height), Image.ANTIALIAS)
# Calculate the position and perform the center crop
left = (new_width - shortest_edge_length) / 2
top = (new_height - shortest_edge_length) / 2
right = (new_width + shortest_edge_length) / 2
bottom = (new_height + shortest_edge_length) / 2
cropped_image = resized_image.crop((left, top, right, bottom))
return cropped_image
def auto_pad_images(image, grid_params):
assert isinstance(image, Image.Image), "Input should be a Pillow Image"
assert len(grid_params) > 0, "Grid parameters should not be empty"
# Step 1: Calculate and find the closest aspect ratio
input_width, input_height = image.size
input_aspect_ratio = input_width / input_height
candidate_resolutions = [(w / h, w, h) for w in grid_params for h in grid_params]
closest_aspect_ratio = min(candidate_resolutions, key=lambda x: abs(input_aspect_ratio - x[0]))
candidate_resolutions = [(x[1], x[2]) for x in candidate_resolutions if abs(x[0] - closest_aspect_ratio[0]) < 1e-3]
target_resolution = min(candidate_resolutions, key=lambda res: abs(max(input_width, input_height) / max(res) - 1))
resize_width, resize_height = target_resolution
if input_width > input_height:
resize_height = int(resize_width / input_aspect_ratio)
else:
resize_width = int(resize_height * input_aspect_ratio)
resized_image = image.resize((resize_width, resize_height), Image.ANTIALIAS)
# Step 5: Pad the resized image if necessary to match the target resolution
pad_width = target_resolution[0] - resize_width
pad_height = target_resolution[1] - resize_height
padded_image = Image.new("RGB", target_resolution, color=(0, 0, 0))
padded_image.paste(resized_image, (pad_width // 2, pad_height // 2))
return padded_image
def extract_patches(image, patch_size, overlap_ratio):
assert isinstance(image, Image.Image), "Input should be a Pillow Image"
assert patch_size > 0, "Patch size should be greater than 0"
assert 0 <= overlap_ratio < 1, "Overlap ratio should be between 0 and 1"
W, H = image.size
patches = []
stride = int(patch_size * (1 - overlap_ratio))
num_patches_y = (H - patch_size) // stride + 1
num_patches_x = (W - patch_size) // stride + 1
y_start = (H - (num_patches_y - 1) * stride - patch_size) // 2
x_start = (W - (num_patches_x - 1) * stride - patch_size) // 2
for y in range(y_start, y_start + num_patches_y * stride, stride):
for x in range(x_start, x_start + num_patches_x * stride, stride):
patch = image.crop((x, y, x + patch_size, y + patch_size))
patches.append(patch)
return patches
def process_highres_image_crop_split(image, data_args, processor=None):
crop_resolution = data_args.image_crop_resolution
split_resolution = data_args.image_split_resolution
if processor is None:
processor = data_args.image_processor
image_crop = resize_and_center_crop(image, crop_resolution)
image_patches = extract_patches(image_crop, patch_size=split_resolution, overlap_ratio=0)
image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches]
return torch.stack(image_patches, dim=0)
def process_highres_image(image, processor, grid_pinpoints):
grid_params = [int(x) for x in grid_pinpoints.split(",")]
width_height = max(image.size)
fit_grid_params = [x for x in grid_params if x >= width_height]
if len(fit_grid_params) == 0:
select_size = max(grid_params)
else:
select_size = min(fit_grid_params)
# FIXME: always select the 448
select_size = max(grid_params)
image_padded = expand2square(image, tuple(int(x * 255) for x in processor.image_mean))
# FIXME: this seems to be a bug that it always resizes instead of padding
image_original_resize = image.resize((processor.size["shortest_edge"], processor.size["shortest_edge"]))
image_padded = image_padded.resize((select_size, select_size))
image_patches = extract_patches(image_padded, patch_size=processor.size["shortest_edge"], overlap_ratio=0)
image_patches = [image_original_resize] + image_patches
image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches]
return torch.stack(image_patches, dim=0)
def select_best_resolution(original_size, possible_resolutions, max_resolutions, patch_size):
"""
Selects the best resolution from a list of possible resolutions based on the original size.
Args:
original_size (tuple): The original size of the image in the format (width, height).
possible_resolutions (list): A list of possible resolutions in the format [(width1, height1), (width2, height2), ...].
Returns:
tuple: The best fit resolution in the format (width, height).
"""
original_width, original_height = original_size
best_fit = None
max_effective_resolution = 0
min_wasted_resolution = float("inf")
for width, height in possible_resolutions:
if max_resolutions != None and (width * height != patch_size * patch_size):
if (width * height+patch_size*patch_size) > max_resolutions: # NOTE 要算一个global
continue
# Calculate the downscaled size to keep the aspect ratio
scale = min(width / original_width, height / original_height)
downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
# Calculate effective and wasted resolutions
effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
wasted_resolution = (width * height) - effective_resolution
if effective_resolution > max_effective_resolution or (effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution):
max_effective_resolution = effective_resolution
min_wasted_resolution = wasted_resolution
best_fit = (width, height)
# print(f"original_size={original_size}, possible_resolutions={possible_resolutions}, max_resolutions={max_resolutions}, best_fit={best_fit}")
assert best_fit is not None, f"Can't find suitable fit in {possible_resolutions} at max:{max_resolutions}"
return best_fit
def resize_and_pad_image(image, target_resolution):
"""
Resize and pad an image to a target resolution while maintaining aspect ratio.
Args:
image (PIL.Image.Image): The input image.
target_resolution (tuple): The target resolution (width, height) of the image.
Returns:
PIL.Image.Image: The resized and padded image.
"""
original_width, original_height = image.size
target_width, target_height = target_resolution
# Determine which dimension (width or height) to fill
scale_w = target_width / original_width
scale_h = target_height / original_height
if scale_w < scale_h:
# Width will be filled completely
new_width = target_width
new_height = min(math.ceil(original_height * scale_w), target_height)
else:
# Height will be filled completely
new_height = target_height
new_width = min(math.ceil(original_width * scale_h), target_width)
# Resize the image
resized_image = image.resize((new_width, new_height))
# Create a new image with the target size and paste the resized image onto it
new_image = Image.new("RGB", (target_width, target_height), (0, 0, 0))
paste_x = (target_width - new_width) // 2
paste_y = (target_height - new_height) // 2
new_image.paste(resized_image, (paste_x, paste_y))
return new_image
def divide_to_patches(image, patch_size):
"""
Divides an image into patches of a specified size.
Args:
image (PIL.Image.Image): The input image.
patch_size (int): The size of each patch.
Returns:
list: A list of PIL.Image.Image objects representing the patches.
"""
patches = []
width, height = image.size
for i in range(0, height, patch_size):
for j in range(0, width, patch_size):
box = (j, i, j + patch_size, i + patch_size)
patch = image.crop(box)
patches.append(patch)
return patches
def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size, max_resolutions=None):
"""
Calculate the shape of the image patch grid after the preprocessing for images of any resolution.
Args:
image_size (tuple): The size of the input image in the format (width, height).
grid_pinpoints (str): A string representation of a list of possible resolutions.
patch_size (int): The size of each image patch.
Returns:
tuple: The shape of the image patch grid in the format (width, height).
"""
if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints:
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
if type(grid_pinpoints) is list:
possible_resolutions = grid_pinpoints
else:
possible_resolutions = ast.literal_eval(grid_pinpoints)
width, height = select_best_resolution(image_size, possible_resolutions, max_resolutions=max_resolutions, patch_size=patch_size)
# print("get width/patch size", width, patch_size, flush=True)
return width // patch_size, height // patch_size
def process_anyres_image(image, processor, grid_pinpoints):
"""
Process an image with variable resolutions.
Args:
image (PIL.Image.Image): The input image to be processed.
processor: The image processor object.
grid_pinpoints (str): A string representation of a list of possible resolutions.
Returns:
torch.Tensor: A tensor containing the processed image patches.
"""
raise NotImplementedError
# Convert grid_pinpoints from string to list
if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints:
try:
patch_size = processor.size[0]
except Exception as e:
patch_size = processor.size["shortest_edge"]
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
if type(grid_pinpoints) is list:
possible_resolutions = grid_pinpoints
else:
possible_resolutions = ast.literal_eval(grid_pinpoints)
best_resolution = select_best_resolution(image.size, possible_resolutions)
image_padded = resize_and_pad_image(image, best_resolution)
patches = divide_to_patches(image_padded, processor.crop_size["height"])
# FIXME: this seems to be a bug that it resizes instead of pad.
# but to keep it consistent with previous, i will keep it as it is
# TODO: uncomment below to ablate with the padding
if isinstance(processor.size, dict):
shortest_edge = processor.size["shortest_edge"]
else:
shortest_edge = min(processor.size)
image_original_resize = image.resize((shortest_edge, shortest_edge))
# image_padded_square = expand2square(image, tuple(int(x*255) for x in processor.image_mean))
# image_original_resize = image_padded_square.resize((processor.size['shortest_edge'], processor.size['shortest_edge']))
image_patches = [image_original_resize] + patches
image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches]
# print("image.size", image.size, "len(image_patches):", len(image_patches), "patch_size:", image_patches[0].shape)
return torch.stack(image_patches, dim=0)
def process_anyres_image_nopad(image, processor, grid_pinpoints):
"""
Process an image with variable resolutions.
Args:
image (PIL.Image.Image): The input image to be processed.
processor: The image processor object.
grid_pinpoints (str): A string representation of a list of possible resolutions.
Returns:
torch.Tensor: A tensor containing the processed image patches.
"""
# Convert grid_pinpoints from string to list
try:
patch_size = processor.size[0]
except Exception as e:
patch_size = processor.size["shortest_edge"]
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints:
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
if type(grid_pinpoints) is list:
possible_resolutions = grid_pinpoints
else:
possible_resolutions = ast.literal_eval(grid_pinpoints)
best_resolution = select_best_resolution(image.size, possible_resolutions, max_resolutions=None, patch_size=patch_size) # 目前图像无限制
# image_padded = resize_and_pad_image(image, best_resolution)
patches = divide_to_patches(image.resize(best_resolution), patch_size)
# FIXME: this seems to be a bug that it resizes instead of pad.
# but to keep it consistent with previous, i will keep it as it is
# TODO: uncomment below to ablate with the padding
if isinstance(processor.size, dict):
shortest_edge = processor.size["shortest_edge"]
else:
shortest_edge = min(processor.size)
image_original_resize = image.resize((shortest_edge, shortest_edge))
# image_padded_square = expand2square(image, tuple(int(x*255) for x in processor.image_mean))
# image_original_resize = image_padded_square.resize((processor.size['shortest_edge'], processor.size['shortest_edge']))
image_patches = [image_original_resize] + patches
image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches]
# raise ValueError(f"image.size: {image.size} len(image_patches): {len(image_patches)}, patch_size:, {image_patches[0].shape}, possible_resolutions:, {possible_resolutions}, best: {best_resolution}")
return torch.stack(image_patches, dim=0)
def process_anyres_video_nopad(video, processor, grid_pinpoints, max_resolutions):
"""
Process an image with variable resolutions.
Args:
video (numpy.ndarray): (T, H, W, C)
image (PIL.Image.Image): The input image to be processed.
processor: The image processor object.
grid_pinpoints (str): A string representation of a list of possible resolutions.
Returns:
torch.Tensor: A tensor containing the processed image patches.
"""
# Convert grid_pinpoints from string to list
try:
patch_size = processor.size[0]
except Exception as e:
patch_size = processor.size["shortest_edge"]
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints:
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
if type(grid_pinpoints) is list:
possible_resolutions = grid_pinpoints
else:
possible_resolutions = ast.literal_eval(grid_pinpoints)
best_resolution = select_best_resolution(video[0].shape[0:2], possible_resolutions, max_resolutions=max_resolutions, patch_size=patch_size)
video = processor.preprocess(video, return_tensors="pt", target_size=best_resolution)["pixel_values"]
print("data: new_video.shape:", video.shape, "best_resolution:", best_resolution)
return video
def load_image_from_base64(image):
return Image.open(BytesIO(base64.b64decode(image)))
def expand2square(pil_img, background_color):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
def process_images(images, image_processor, model_cfg):
image_aspect_ratio = getattr(model_cfg, "image_aspect_ratio", None)
new_images = []
if image_aspect_ratio == "highres":
raise NotImplementedError
for image in images:
image = process_highres_image(image, image_processor, model_cfg.image_grid_pinpoints)
new_images.append(image)
elif "anyres" in image_aspect_ratio:
for image in images:
if "nopad" in image_aspect_ratio:
image = process_anyres_image_nopad(image, image_processor, model_cfg.image_grid_pinpoints)
else:
image = process_anyres_image(image, image_processor, model_cfg.image_grid_pinpoints)
new_images.append(image)
elif image_aspect_ratio == "crop_split":
raise NotImplementedError
for image in images:
image = process_highres_image_crop_split(image, model_cfg, image_processor)
new_images.append(image)
elif image_aspect_ratio == "pad":
for image in images:
image = expand2square(image, tuple(int(x * 255) for x in image_processor.image_mean))
image = image_processor.preprocess(image, return_tensors="pt")["pixel_values"][0]
new_images.append(image)
else:
return image_processor.preprocess(images, return_tensors="pt")["pixel_values"]
if all(x.shape == new_images[0].shape for x in new_images):
new_images = torch.stack(new_images, dim=0)
return new_images
def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None):
prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split("")]
def insert_separator(X, sep):
return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1]
input_ids = []
offset = 0
if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
offset = 1
input_ids.append(prompt_chunks[0][0])
for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
input_ids.extend(x[offset:])
if return_tensors is not None:
if return_tensors == "pt":
return torch.tensor(input_ids, dtype=torch.long)
raise ValueError(f"Unsupported tensor type: {return_tensors}")
return input_ids
def get_model_name_from_path(model_path):
model_path = model_path.strip("/")
model_paths = model_path.split("/")
if model_paths[-1].startswith("checkpoint-"):
return model_paths[-2] + "_" + model_paths[-1]
else:
return model_paths[-1]
class KeywordsStoppingCriteria(StoppingCriteria):
def __init__(self, keywords, tokenizer, input_ids):
self.keywords = keywords
self.keyword_ids = []
for keyword in keywords:
cur_keyword_ids = tokenizer(keyword).input_ids
if len(cur_keyword_ids) > 1 and cur_keyword_ids[0] == tokenizer.bos_token_id:
cur_keyword_ids = cur_keyword_ids[1:]
self.keyword_ids.append(torch.tensor(cur_keyword_ids))
self.tokenizer = tokenizer
self.start_len = input_ids.shape[1]
def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
assert output_ids.shape[0] == 1, "Only support batch size 1 (yet)" # TODO
offset = min(output_ids.shape[1] - self.start_len, 3)
self.keyword_ids = [keyword_id.to(output_ids.device) for keyword_id in self.keyword_ids]
for keyword_id in self.keyword_ids:
if output_ids[0, -keyword_id.shape[0] :] == keyword_id:
return True
outputs = self.tokenizer.batch_decode(output_ids[:, -offset:], skip_special_tokens=True)[0]
for keyword in self.keywords:
if keyword in outputs:
return True
return False
================================================
FILE: llava-train_videochat/llava/model/__init__.py
================================================
import os
AVAILABLE_MODELS = {
"llava_qwen": "LlavaQwenForCausalLM, LlavaQwenConfig",
"llava_qwen_flash": "LlavaQwenForCausalLM_Flash, LlavaQwenConfig_Flash"
}
for model_name, model_classes in AVAILABLE_MODELS.items():
try:
exec(f"from .language_model.{model_name} import {model_classes}")
except Exception as e:
print(f"Failed to import {model_name} from llava.language_model.{model_name}. Error: {e}")
================================================
FILE: llava-train_videochat/llava/model/apply_delta.py
================================================
"""
Usage:
python3 -m fastchat.model.apply_delta --base ~/model_weights/llama-7b --target ~/model_weights/vicuna-7b --delta lmsys/vicuna-7b-delta
"""
import argparse
import torch
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
from llava import LlavaLlamaForCausalLM
def apply_delta(base_model_path, target_model_path, delta_path):
print("Loading base model")
base = AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
print("Loading delta")
delta = LlavaLlamaForCausalLM.from_pretrained(delta_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
delta_tokenizer = AutoTokenizer.from_pretrained(delta_path)
print("Applying delta")
for name, param in tqdm(delta.state_dict().items(), desc="Applying delta"):
if name not in base.state_dict():
assert name in ["model.mm_projector.weight", "model.mm_projector.bias"], f"{name} not in base model"
continue
if param.data.shape == base.state_dict()[name].shape:
param.data += base.state_dict()[name]
else:
assert name in ["model.embed_tokens.weight", "lm_head.weight"], f"{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}"
bparam = base.state_dict()[name]
param.data[: bparam.shape[0], : bparam.shape[1]] += bparam
print("Saving target model")
delta.save_pretrained(target_model_path)
delta_tokenizer.save_pretrained(target_model_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--base-model-path", type=str, required=True)
parser.add_argument("--target-model-path", type=str, required=True)
parser.add_argument("--delta-path", type=str, required=True)
args = parser.parse_args()
apply_delta(args.base_model_path, args.target_model_path, args.delta_path)
================================================
FILE: llava-train_videochat/llava/model/builder.py
================================================
# Copyright 2023 Haotian Liu
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import warnings
import shutil
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
import torch
from llava.model import *
from llava.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.utils import rank0_print
def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", attn_implementation="flash_attention_2", customized_config=None, overwrite_config=None, **kwargs):
kwargs["device_map"] = device_map
if load_8bit:
kwargs["load_in_8bit"] = True
elif load_4bit:
kwargs["load_in_4bit"] = True
kwargs["quantization_config"] = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4")
else:
kwargs["torch_dtype"] = torch.float16
if customized_config is not None:
kwargs["config"] = customized_config
if "multimodal" in kwargs:
if kwargs["multimodal"] is True:
is_multimodal = True
kwargs.pop("multimodal")
else:
is_multimodal = False
else:
is_multimodal = False
assert is_multimodal, "I need it!!!"
if "llava" in model_name.lower() or is_multimodal:
# Load LLaVA model
if "lora" in model_name.lower() and model_base is None:
raise NotImplementedError("I don't like lora.")
warnings.warn(
"There is `lora` in model name but no `model_base` is provided. If you are loading a LoRA model, please provide the `model_base` argument. Detailed instruction: https://github.com/haotian-liu/LLaVA#launch-a-model-worker-lora-weights-unmerged."
)
if "lora" in model_name.lower() and model_base is not None:
raise NotImplementedError("I don't like lora.")
lora_cfg_pretrained = AutoConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
rank0_print("Loading LLaVA from base model...")
if "mixtral" in model_name.lower():
from llava.model.language_model.llava_mixtral import LlavaMixtralConfig
lora_cfg_pretrained = LlavaMixtralConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = LlavaMixtralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
elif "mistral" in model_name.lower():
from llava.model.language_model.llava_mistral import LlavaMistralConfig
lora_cfg_pretrained = LlavaMistralConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = LlavaMistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
elif "gemma" in model_name.lower():
from llava.model.language_model.llava_gemma import LlavaGemmaConfig
lora_cfg_pretrained = LlavaGemmaConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = LlavaGemmaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
else:
from llava.model.language_model.llava_llama import LlavaConfig
lora_cfg_pretrained = LlavaConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
if model.lm_head.weight.shape[0] != token_num:
model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
rank0_print("Loading additional LLaVA weights...")
if os.path.exists(os.path.join(model_path, "non_lora_trainables.bin")):
non_lora_trainables = torch.load(os.path.join(model_path, "non_lora_trainables.bin"), map_location="cpu")
else:
# this is probably from HF Hub
from huggingface_hub import hf_hub_download
def load_from_hf(repo_id, filename, subfolder=None):
cache_file = hf_hub_download(repo_id=repo_id, filename=filename, subfolder=subfolder)
return torch.load(cache_file, map_location="cpu")
non_lora_trainables = load_from_hf(model_path, "non_lora_trainables.bin")
non_lora_trainables = {(k[11:] if k.startswith("base_model.") else k): v for k, v in non_lora_trainables.items()}
if any(k.startswith("model.model.") for k in non_lora_trainables):
non_lora_trainables = {(k[6:] if k.startswith("model.") else k): v for k, v in non_lora_trainables.items()}
model.load_state_dict(non_lora_trainables, strict=False)
from peft import PeftModel
rank0_print("Loading LoRA weights...")
model = PeftModel.from_pretrained(model, model_path)
rank0_print("Merging LoRA weights...")
model = model.merge_and_unload()
rank0_print("Model is loaded...")
elif model_base is not None: # this may be mm projector only, loading projector with preset language mdoel
rank0_print(f"Loading LLaVA from base model {model_base}...")
if "mixtral" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
cfg_pretrained = AutoConfig.from_pretrained(model_path)
model = LlavaMixtralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
elif "mistral" in model_name.lower() or "zephyr" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
cfg_pretrained = AutoConfig.from_pretrained(model_path)
model = LlavaMistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
elif "gemma" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
cfg_pretrained = AutoConfig.from_pretrained(model_path)
model = LlavaGemmaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
elif (
"wizardlm-2" in model_name.lower()
and "vicuna" in model_name.lower()
or "llama" in model_name.lower()
or "yi" in model_name.lower()
or "nous-hermes" in model_name.lower()
or "llava-v1.6-34b" in model_name.lower()
or "llava-v1.5" in model_name.lower()
):
from llava.model.language_model.llava_llama import LlavaConfig
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
if customized_config is None:
llava_cfg = LlavaConfig.from_pretrained(model_path)
if "v1.5" in model_name.lower():
llava_cfg.delay_load = True # a workaround for correctly loading v1.5 models
else:
llava_cfg = customized_config
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
llava_cfg = LlavaConfig.from_pretrained(model_path)
model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=llava_cfg, **kwargs)
else:
raise ValueError(f"Model {model_name} not supported")
mm_projector_weights = torch.load(os.path.join(model_path, "mm_projector.bin"), map_location="cpu")
mm_projector_weights = {k: v.to(torch.float16) for k, v in mm_projector_weights.items()}
model.load_state_dict(mm_projector_weights, strict=False)
else:
rank0_print(f"Loaded LLaVA model: {model_path}")
if "mixtral" in model_name.lower():
raise NotImplementedError("I don't like it.")
from llava.model.language_model.llava_mixtral import LlavaMixtralConfig
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
if customized_config is None:
llava_cfg = LlavaMixtralConfig.from_pretrained(model_path)
else:
llava_cfg = customized_config
if overwrite_config is not None:
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = LlavaMixtralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
elif "mistral" in model_name.lower() or "zephyr" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = LlavaMistralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
elif (
"wizardlm-2" in model_name.lower()
and "vicuna" in model_name.lower()
or "llama" in model_name.lower()
# or "yi" in model_name.lower() # 太容易撞车了
or "nous-hermes" in model_name.lower()
or "llava-v1.6-34b" in model_name.lower()
or "llava-v1.5" in model_name.lower()
):
raise NotImplementedError("I don't like it")
from llava.model.language_model.llava_llama import LlavaConfig
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
if customized_config is None:
llava_cfg = LlavaConfig.from_pretrained(model_path)
if "v1.5" in model_name.lower():
llava_cfg.delay_load = True # a workaround for correctly loading v1.5 models
else:
llava_cfg = customized_config
if overwrite_config is not None:
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
elif "qwen" in model_name.lower() or "quyen" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_path)
if "moe" in model_name.lower() or "A14B" in model_name.lower():
from llava.model.language_model.llava_qwen_moe import LlavaQwenMoeConfig
if overwrite_config is not None:
llava_cfg = LlavaQwenMoeConfig.from_pretrained(model_path)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaQwenMoeForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
else:
model = LlavaQwenMoeForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
elif "flash" in model_name.lower():
from llava.model.language_model.llava_qwen_flash import LlavaQwenConfig_Flash
if overwrite_config is not None:
llava_cfg = LlavaQwenConfig_Flash.from_pretrained(model_path)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaQwenForCausalLM_Flash.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
else:
model = LlavaQwenForCausalLM_Flash.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
else:
from llava.model.language_model.llava_qwen import LlavaQwenConfig
if overwrite_config is not None:
llava_cfg = LlavaQwenConfig.from_pretrained(model_path)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
else:
model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
elif "internlm2" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
from llava.model.language_model.llava_internlm2 import LlavaInternLM2Config
if overwrite_config is not None:
llava_cfg = LlavaInternLM2Config.from_pretrained(model_path, trust_remote_code=True)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaInternLM2ForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, trust_remote_code=True, **kwargs)
else:
model = LlavaInternLM2ForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, trust_remote_code=True, attn_implementation=attn_implementation, **kwargs)
elif "gemma" in model_name.lower():
raise NotImplementedError("I don't like it")
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
cfg_pretrained = AutoConfig.from_pretrained(model_path)
model = LlavaGemmaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
else:
# 默认用qwen
try:
tokenizer = AutoTokenizer.from_pretrained(model_path)
if "moe" in model_name.lower() or "A14B" in model_name.lower():
from llava.model.language_model.llava_qwen_moe import LlavaQwenMoeConfig
if overwrite_config is not None:
llava_cfg = LlavaQwenMoeConfig.from_pretrained(model_path)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaQwenMoeForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
else:
model = LlavaQwenMoeForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
elif "flash" in model_name.lower():
from llava.model.language_model.llava_qwen_flash import LlavaQwenConfig_Flash
if overwrite_config is not None:
llava_cfg = LlavaQwenConfig_Flash.from_pretrained(model_path)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaQwenForCausalLM_Flash.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
else:
model = LlavaQwenForCausalLM_Flash.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
elif "fastv" in model_name.lower():
from llava.model.language_model.llava_qwen_fastv import LlavaQwenConfig_FastV
if overwrite_config is not None:
llava_cfg = LlavaQwenConfig_FastV.from_pretrained(model_path)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaQwenForCausalLM_FastV.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
else:
model = LlavaQwenForCausalLM_FastV.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
else:
from llava.model.language_model.llava_qwen import LlavaQwenConfig
if overwrite_config is not None:
llava_cfg = LlavaQwenConfig.from_pretrained(model_path)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
else:
model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
except:
raise ValueError(f"Model {model_name} not supported")
# try:
# from llava.model.language_model.llava_llama import LlavaConfig
# tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
# if customized_config is None:
# llava_cfg = LlavaConfig.from_pretrained(model_path)
# if "v1.5" in model_path.lower():
# llava_cfg.delay_load = True # a workaround for correctly loading v1.5 models
# else:
# llava_cfg = customized_config
# if overwrite_config is not None:
# rank0_print(f"Overwriting config with {overwrite_config}")
# for k, v in overwrite_config.items():
# setattr(llava_cfg, k, v)
# model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
# except:
# raise ValueError(f"Model {model_name} not supported")
else:
NotImplementedError("I don't want language model only.")
# Load language model
if model_base is not None:
# PEFT model
from peft import PeftModel
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_base, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto")
print(f"Loading LoRA weights from {model_path}")
model = PeftModel.from_pretrained(model, model_path)
print(f"Merging weights")
model = model.merge_and_unload()
print("Convert to FP16...")
model.to(torch.float16)
else:
use_fast = False
if "mpt" in model_name.lower().replace("prompt", ""):
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, trust_remote_code=True, **kwargs)
else:
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
rank0_print(f"Model Class: {model.__class__.__name__}")
image_processor = None
if "llava" in model_name.lower() or is_multimodal:
mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)
if mm_use_im_patch_token:
tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
if mm_use_im_start_end:
tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
model.resize_token_embeddings(len(tokenizer))
vision_tower = model.get_vision_tower()
if not vision_tower.is_loaded:
vision_tower.load_model(device_map=device_map)
if device_map != "auto":
vision_tower.to(device="cuda", dtype=torch.float16)
image_processor = vision_tower.image_processor
if hasattr(model.config, "max_sequence_length"):
context_len = model.config.max_sequence_length
elif hasattr(model.config, "max_position_embeddings"):
context_len = model.config.max_position_embeddings
elif hasattr(model.config, "tokenizer_model_max_length"):
context_len = model.config.tokenizer_model_max_length
else:
context_len = 2048
return tokenizer, model, image_processor, context_len
================================================
FILE: llava-train_videochat/llava/model/consolidate.py
================================================
"""
Usage:
python3 -m llava.model.consolidate --src ~/model_weights/llava-7b --dst ~/model_weights/llava-7b_consolidate
"""
import argparse
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from llava.model import *
from llava.model.utils import auto_upgrade
def consolidate_ckpt(src_path, dst_path):
print("Loading model")
auto_upgrade(src_path)
src_model = AutoModelForCausalLM.from_pretrained(src_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
src_tokenizer = AutoTokenizer.from_pretrained(src_path, use_fast=False)
src_model.save_pretrained(dst_path)
src_tokenizer.save_pretrained(dst_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--src", type=str, required=True)
parser.add_argument("--dst", type=str, required=True)
args = parser.parse_args()
consolidate_ckpt(args.src, args.dst)
================================================
FILE: llava-train_videochat/llava/model/language_model/llava_qwen.py
================================================
# Copyright 2024 Hao Zhang
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List, Optional, Tuple, Union, Dict
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss
import transformers
from transformers import AutoConfig, AutoModelForCausalLM, LlamaConfig, LlamaModel, LlamaForCausalLM
from transformers.modeling_outputs import CausalLMOutputWithPast
from transformers.generation.utils import GenerateOutput
# from ...constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.model.llava_arch import LlavaMetaModel, LlavaMetaForCausalLM
from transformers import Qwen2Config, Qwen2Model, Qwen2ForCausalLM
# from .qwen.modeling_qwen import QWenLMHeadModel, QWenModel
# from .qwen.configuration_qwen import QWenConfig
class LlavaQwenConfig(Qwen2Config):
model_type = "llava_qwen"
class LlavaQwenModel(LlavaMetaModel, Qwen2Model):
config_class = LlavaQwenConfig
def __init__(self, config: Qwen2Config):
super(LlavaQwenModel, self).__init__(config)
class LlavaQwenForCausalLM(Qwen2ForCausalLM, LlavaMetaForCausalLM):
config_class = LlavaQwenConfig
def __init__(self, config):
# super(Qwen2ForCausalLM, self).__init__(config)
Qwen2ForCausalLM.__init__(self, config)
config.model_type = "llava_qwen"
config.rope_scaling = None
self.model = LlavaQwenModel(config)
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_model(self):
return self.model
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
images: Optional[torch.FloatTensor] = None,
image_sizes: Optional[List[List[int]]] = None,
return_dict: Optional[bool] = None,
modalities: Optional[List[str]] = ["image"],
dpo_forward: Optional[bool] = False,
cache_position=None,
) -> Union[Tuple, CausalLMOutputWithPast]:
# print("images[0].shape:", images[0].shape)
if inputs_embeds is None:
(input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities, image_sizes)
# print("inputs_embeds.shape:", inputs_embeds.shape)
if dpo_forward:
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
return logits, labels
else:
return super().forward(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
labels=labels,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
@torch.no_grad()
def generate(
self,
inputs: Optional[torch.Tensor] = None,
images: Optional[torch.Tensor] = None,
image_sizes: Optional[torch.Tensor] = None,
modalities: Optional[List[str]] = ["image"],
**kwargs,
) -> Union[GenerateOutput, torch.LongTensor]:
position_ids = kwargs.pop("position_ids", None)
attention_mask = kwargs.pop("attention_mask", None)
if "inputs_embeds" in kwargs:
raise NotImplementedError("`inputs_embeds` is not supported")
if images is not None:
(inputs, position_ids, attention_mask, _, inputs_embeds, _) = self.prepare_inputs_labels_for_multimodal(inputs, position_ids, attention_mask, None, None, images, modalities, image_sizes=image_sizes)
else:
inputs_embeds = self.get_model().embed_tokens(inputs)
return super().generate(position_ids=position_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, **kwargs)
def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
images = kwargs.pop("images", None)
image_sizes = kwargs.pop("image_sizes", None)
inputs = super().prepare_inputs_for_generation(input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs)
if images is not None:
inputs["images"] = images
if image_sizes is not None:
inputs["image_sizes"] = image_sizes
return inputs
AutoConfig.register("llava_qwen", LlavaQwenConfig)
AutoModelForCausalLM.register(LlavaQwenConfig, LlavaQwenForCausalLM)
================================================
FILE: llava-train_videochat/llava/model/language_model/llava_qwen_flash.py
================================================
# Copyright 2024 Hao Zhang
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List, Optional, Tuple, Union, Dict
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss
import transformers
from transformers import AutoConfig, AutoModelForCausalLM, LlamaConfig, LlamaModel, LlamaForCausalLM
from transformers.modeling_outputs import CausalLMOutputWithPast
from transformers.generation.utils import GenerateOutput
# from ...constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.model.llava_arch import LlavaMetaModel, LlavaMetaForCausalLM
from transformers import Qwen2Config
# from .qwen.modeling_qwen import QWenLMHeadModel, QWenModel
# from .qwen.configuration_qwen import QWenConfig
from .modeling_qwen2_flash import Qwen2Model_Flash, Qwen2ForCausalLM_Flash
class LlavaQwenConfig_Flash(Qwen2Config):
model_type = "llava_qwen_flash"
class LlavaQwenModel_Flash(LlavaMetaModel, Qwen2Model_Flash):
config_class = LlavaQwenConfig_Flash
def __init__(self, config: Qwen2Config):
super(LlavaQwenModel_Flash, self).__init__(config)
class LlavaQwenForCausalLM_Flash(Qwen2ForCausalLM_Flash, LlavaMetaForCausalLM):
config_class = LlavaQwenConfig_Flash
def __init__(self, config):
# super(Qwen2ForCausalLM, self).__init__(config)
Qwen2ForCausalLM_Flash.__init__(self, config)
config.model_type = "llava_qwen_flash"
# config.rope_scaling = None
self.model = LlavaQwenModel_Flash(config)
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_model(self):
return self.model
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
images: Optional[torch.FloatTensor] = None,
image_sizes: Optional[List[List[int]]] = None,
return_dict: Optional[bool] = None,
modalities: Optional[List[str]] = ["image"],
dpo_forward: Optional[bool] = False,
cache_position=None,
) -> Union[Tuple, CausalLMOutputWithPast]:
if inputs_embeds is None:
(input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities, image_sizes)
# print("inputs_embeds.shape:", inputs_embeds.shape)
if dpo_forward:
outputs, labels = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
labels=labels
)
hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
return logits, labels
else:
return super().forward(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
labels=labels,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
@torch.no_grad()
def generate(
self,
inputs: Optional[torch.Tensor] = None,
images: Optional[torch.Tensor] = None,
image_sizes: Optional[torch.Tensor] = None,
modalities: Optional[List[str]] = ["image"],
**kwargs,
) -> Union[GenerateOutput, torch.LongTensor]:
position_ids = kwargs.pop("position_ids", None)
attention_mask = kwargs.pop("attention_mask", None)
if "inputs_embeds" in kwargs:
raise NotImplementedError("`inputs_embeds` is not supported")
if images is not None:
(inputs, position_ids, attention_mask, _, inputs_embeds, _) = self.prepare_inputs_labels_for_multimodal(inputs, position_ids, attention_mask, None, None, images, modalities, image_sizes=image_sizes)
else:
self.model.image_token_posi = [-1]
self.model.prompt_len = None
self.model.image_tokens = [0]
inputs_embeds = self.get_model().embed_tokens(inputs)
return super().generate(position_ids=position_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, **kwargs)
def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
images = kwargs.pop("images", None)
image_sizes = kwargs.pop("image_sizes", None)
inputs = super().prepare_inputs_for_generation(input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs)
if images is not None:
inputs["images"] = images
if image_sizes is not None:
inputs["image_sizes"] = image_sizes
return inputs
AutoConfig.register("llava_qwen_flash", LlavaQwenConfig_Flash)
AutoModelForCausalLM.register(LlavaQwenConfig_Flash, LlavaQwenForCausalLM_Flash)
================================================
FILE: llava-train_videochat/llava/model/language_model/modeling_qwen2_flash.py
================================================
# coding=utf-8
# transformers==4.39.2 or 4.40.1 NOTE
# Copyright 2024 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
#
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
# and OPT implementations in this library. It has been modified from its
# original forms to accommodate minor architectural differences compared
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch Qwen2 model."""
import inspect
import math
import warnings
from typing import List, Optional, Tuple, Union
import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from torch import nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from transformers.activations import ACT2FN
from transformers.cache_utils import Cache, DynamicCache
from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask, _prepare_4d_causal_attention_mask_for_sdpa
from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
from transformers.modeling_utils import PreTrainedModel
from transformers.utils import (
add_start_docstrings,
add_start_docstrings_to_model_forward,
is_flash_attn_2_available,
is_flash_attn_greater_or_equal_2_10,
logging,
replace_return_docstrings,
)
from transformers.models.qwen2.configuration_qwen2 import Qwen2Config
from llava.constants import IGNORE_INDEX
if is_flash_attn_2_available():
from flash_attn import flash_attn_func, flash_attn_varlen_func
from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
_flash_supports_window_size = "window_size" in list(inspect.signature(flash_attn_func).parameters)
logger = logging.get_logger(__name__)
_CHECKPOINT_FOR_DOC = "Qwen/Qwen2-7B-beta"
_CONFIG_FOR_DOC = "Qwen2Config"
QWEN2_PRETRAINED_MODEL_ARCHIVE_LIST = [
"Qwen/Qwen2-7B-beta",
# See all Qwen2 models at https://huggingface.co/models?filter=qwen2
]
# Copied from transformers.models.llama.modeling_llama._get_unpad_data
def _get_unpad_data(attention_mask):
seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
max_seqlen_in_batch = seqlens_in_batch.max().item()
cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
return (
indices,
cu_seqlens,
max_seqlen_in_batch,
)
# Copied from transformers.models.llama.modeling_llama.LlamaRMSNorm with Llama->Qwen2
class Qwen2RMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
"""
Qwen2RMSNorm is equivalent to T5LayerNorm
"""
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
# Copied from transformers.models.mistral.modeling_mistral.MistralRotaryEmbedding with Mistral->Qwen2
class Qwen2RotaryEmbedding(nn.Module):
def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
super().__init__()
self.dim = dim
self.max_position_embeddings = max_position_embeddings
self.base = base
inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
self.register_buffer("inv_freq", inv_freq, persistent=False)
# Build here to make `torch.jit.trace` work.
self._set_cos_sin_cache(
seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
)
def _set_cos_sin_cache(self, seq_len, device, dtype):
self.max_seq_len_cached = seq_len
t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
freqs = torch.outer(t, self.inv_freq)
# Different from paper, but it uses a different permutation in order to obtain the same calculation
emb = torch.cat((freqs, freqs), dim=-1)
self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
def forward(self, x, seq_len=None):
# x: [bs, num_attention_heads, seq_len, head_size]
if seq_len > self.max_seq_len_cached:
self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
return (
self.cos_cached[:seq_len].to(dtype=x.dtype),
self.sin_cached[:seq_len].to(dtype=x.dtype),
)
# Copied from transformers.models.llama.modeling_llama.rotate_half
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]
return torch.cat((-x2, x1), dim=-1)
# Copied from transformers.models.mistral.modeling_mistral.apply_rotary_pos_emb
def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
"""Applies Rotary Position Embedding to the query and key tensors.
Args:
q (`torch.Tensor`): The query tensor.
k (`torch.Tensor`): The key tensor.
cos (`torch.Tensor`): The cosine part of the rotary embedding.
sin (`torch.Tensor`): The sine part of the rotary embedding.
position_ids (`torch.Tensor`):
The position indices of the tokens corresponding to the query and key tensors. For example, this can be
used to pass offsetted position ids when working with a KV-cache.
unsqueeze_dim (`int`, *optional*, defaults to 1):
The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
Returns:
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
"""
cos = cos[position_ids].unsqueeze(unsqueeze_dim)
sin = sin[position_ids].unsqueeze(unsqueeze_dim)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
# Copied from transformers.models.mistral.modeling_mistral.MistralMLP with Mistral->Qwen2
class Qwen2MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, x):
return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
# Copied from transformers.models.llama.modeling_llama.repeat_kv
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""
This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
"""
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
class Qwen2Attention(nn.Module):
"""
Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
and "Generating Long Sequences with Sparse Transformers".
"""
def __init__(self, config: Qwen2Config, layer_idx: Optional[int] = None):
super().__init__()
self.config = config
self.layer_idx = layer_idx
if layer_idx is None:
logger.warning_once(
f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
"to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
"when creating this class."
)
self.hidden_size = config.hidden_size
self.num_heads = config.num_attention_heads
self.head_dim = self.hidden_size // self.num_heads
self.num_key_value_heads = config.num_key_value_heads
self.num_key_value_groups = self.num_heads // self.num_key_value_heads
self.max_position_embeddings = config.max_position_embeddings
self.rope_theta = config.rope_theta
self.is_causal = True
self.attention_dropout = config.attention_dropout
if (self.head_dim * self.num_heads) != self.hidden_size:
raise ValueError(
f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
f" and `num_heads`: {self.num_heads})."
)
self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True)
self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
self.rotary_emb = Qwen2RotaryEmbedding(
self.head_dim,
max_position_embeddings=self.max_position_embeddings,
base=self.rope_theta,
)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
if "padding_mask" in kwargs:
warnings.warn(
"Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
)
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
"for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
"with a layer index."
)
kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
if past_key_value is not None:
cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
# repeat k/v heads if n_kv_heads < n_heads
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
raise ValueError(
f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
f" {attn_weights.size()}"
)
if attention_mask is not None:
if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
raise ValueError(
f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
)
attn_weights = attn_weights + attention_mask
# upcast attention to fp32
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
attn_output = torch.matmul(attn_weights, value_states)
if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
raise ValueError(
f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
f" {attn_output.size()}"
)
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
class Qwen2FlashAttention2(Qwen2Attention):
"""
Qwen2 flash attention module, following Qwen2 attention module. This module inherits from `Qwen2Attention`
as the weights of the module stays untouched. The only required change would be on the forward pass
where it needs to correctly call the public API of flash attention and deal with padding tokens
in case the input contains any of them. Additionally, for sliding window attention, we apply SWA only to the bottom
config.max_window_layers layers.
"""
# Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2.__init__
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
# flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
# Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
):
if "padding_mask" in kwargs:
warnings.warn(
"Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
)
# overwrite attention_mask with padding_mask
attention_mask = kwargs.pop("padding_mask")
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
"for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
"with a layer index."
)
kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
# Because the input can be padded, the absolute sequence length depends on the max position id.
rotary_seq_len = max(kv_seq_len, position_ids[:, -1].max().item()) + 1
cos, sin = self.rotary_emb(value_states, seq_len=rotary_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
use_sliding_windows = (
_flash_supports_window_size
and getattr(self.config, "sliding_window", None) is not None
and kv_seq_len > self.config.sliding_window
and self.config.use_sliding_window
)
if not _flash_supports_window_size:
logger.warning_once(
"The current flash attention version does not support sliding window attention, for a more memory efficient implementation"
" make sure to upgrade flash-attn library."
)
if past_key_value is not None:
# Activate slicing cache only if the config has a value `sliding_windows` attribute
cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
if (
getattr(self.config, "sliding_window", None) is not None
and kv_seq_len > self.config.sliding_window
and cache_has_contents
):
slicing_tokens = 1 - self.config.sliding_window
past_key = past_key_value[self.layer_idx][0]
past_value = past_key_value[self.layer_idx][1]
past_key = past_key[:, :, slicing_tokens:, :].contiguous()
past_value = past_value[:, :, slicing_tokens:, :].contiguous()
if past_key.shape[-2] != self.config.sliding_window - 1:
raise ValueError(
f"past key must have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`), got"
f" {past_key.shape}"
)
if attention_mask is not None:
attention_mask = attention_mask[:, slicing_tokens:]
attention_mask = torch.cat([attention_mask, torch.ones_like(attention_mask[:, -1:])], dim=-1)
cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
# repeat k/v heads if n_kv_heads < n_heads
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
dropout_rate = 0.0 if not self.training else self.attention_dropout
# In PEFT, usually we cast the layer norms in float32 for training stability reasons
# therefore the input hidden states gets silently casted in float32. Hence, we need
# cast them back in float16 just to be sure everything works as expected.
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, "_pre_quantization_dtype"):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
logger.warning_once(
f"The input hidden states seems to be silently casted in float32, this might be related to"
f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
f" {target_dtype}."
)
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
# Reashape to the expected shape for Flash Attention
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
attn_output = self._flash_attention_forward(
query_states,
key_states,
value_states,
attention_mask,
q_len,
dropout=dropout_rate,
use_sliding_windows=use_sliding_windows,
)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
def _flash_attention_forward(
self,
query_states,
key_states,
value_states,
attention_mask,
query_length,
dropout=0.0,
softmax_scale=None,
use_sliding_windows=False,
):
"""
Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
first unpad the input, then computes the attention scores and pad the final attention scores.
Args:
query_states (`torch.Tensor`):
Input query states to be passed to Flash Attention API
key_states (`torch.Tensor`):
Input key states to be passed to Flash Attention API
value_states (`torch.Tensor`):
Input value states to be passed to Flash Attention API
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
use_sliding_windows (`bool`, *optional*):
Whether to activate sliding window attention.
"""
if not self._flash_attn_uses_top_left_mask:
causal = self.is_causal
else:
# TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
causal = self.is_causal and query_length != 1
# Decide whether to use SWA or not by layer index.
if use_sliding_windows and self.layer_idx >= self.config.max_window_layers:
use_sliding_windows = False
# Contains at least one padding token in the sequence
if attention_mask is not None:
batch_size = query_states.shape[0]
query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
query_states, key_states, value_states, attention_mask, query_length
)
cu_seqlens_q, cu_seqlens_k = cu_seq_lens
max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
if not use_sliding_windows:
attn_output_unpad = flash_attn_varlen_func(
query_states,
key_states,
value_states,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_in_batch_q,
max_seqlen_k=max_seqlen_in_batch_k,
dropout_p=dropout,
softmax_scale=softmax_scale,
causal=causal,
)
else:
attn_output_unpad = flash_attn_varlen_func(
query_states,
key_states,
value_states,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_in_batch_q,
max_seqlen_k=max_seqlen_in_batch_k,
dropout_p=dropout,
softmax_scale=softmax_scale,
causal=causal,
window_size=(self.config.sliding_window, self.config.sliding_window),
)
attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
else:
if not use_sliding_windows:
attn_output = flash_attn_func(
query_states,
key_states,
value_states,
dropout,
softmax_scale=softmax_scale,
causal=causal,
)
else:
attn_output = flash_attn_func(
query_states,
key_states,
value_states,
dropout,
softmax_scale=softmax_scale,
causal=causal,
window_size=(self.config.sliding_window, self.config.sliding_window),
)
return attn_output
# Copied from transformers.models.mistral.modeling_mistral.MistralFlashAttention2._upad_input
def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
batch_size, kv_seq_len, num_heads, head_dim = key_layer.shape
# On the first iteration we need to properly re-create the padding mask
# by slicing it on the proper place
if kv_seq_len != attention_mask.shape[-1]:
attention_mask_num_tokens = attention_mask.shape[-1]
attention_mask = attention_mask[:, attention_mask_num_tokens - kv_seq_len :]
indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
key_layer = index_first_axis(key_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k)
value_layer = index_first_axis(value_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k)
if query_length == kv_seq_len:
query_layer = index_first_axis(
query_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k
)
cu_seqlens_q = cu_seqlens_k
max_seqlen_in_batch_q = max_seqlen_in_batch_k
indices_q = indices_k
elif query_length == 1:
max_seqlen_in_batch_q = 1
cu_seqlens_q = torch.arange(
batch_size + 1, dtype=torch.int32, device=query_layer.device
) # There is a memcpy here, that is very bad.
indices_q = cu_seqlens_q[:-1]
query_layer = query_layer.squeeze(1)
else:
# The -q_len: slice assumes left padding.
attention_mask = attention_mask[:, -query_length:]
query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
return (
query_layer,
key_layer,
value_layer,
indices_q,
(cu_seqlens_q, cu_seqlens_k),
(max_seqlen_in_batch_q, max_seqlen_in_batch_k),
)
# Copied from transformers.models.mistral.modeling_mistral.MistralSdpaAttention with Mistral->Qwen2
class Qwen2SdpaAttention(Qwen2Attention):
"""
Qwen2 attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
`Qwen2Attention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
SDPA API.
"""
# Adapted from Qwen2Attention.forward
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
if output_attentions:
# TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
logger.warning_once(
"Qwen2Model is using Qwen2SdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
)
return super().forward(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
)
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
if past_key_value is not None:
cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
if attention_mask is not None:
if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
raise ValueError(
f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
)
# SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
# Reference: https://github.com/pytorch/pytorch/issues/112577.
if query_states.device.type == "cuda" and attention_mask is not None:
query_states = query_states.contiguous()
key_states = key_states.contiguous()
value_states = value_states.contiguous()
attn_output = torch.nn.functional.scaled_dot_product_attention(
query_states,
key_states,
value_states,
attn_mask=attention_mask,
dropout_p=self.attention_dropout if self.training else 0.0,
# The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
is_causal=self.is_causal and attention_mask is None and q_len > 1,
)
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.view(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
return attn_output, None, past_key_value
QWEN2_ATTENTION_CLASSES = {
"eager": Qwen2Attention,
"flash_attention_2": Qwen2FlashAttention2,
"sdpa": Qwen2SdpaAttention,
}
class Qwen2DecoderLayer(nn.Module):
def __init__(self, config: Qwen2Config, layer_idx: int):
super().__init__()
self.hidden_size = config.hidden_size
if config.use_sliding_window and config._attn_implementation != "flash_attention_2":
logger.warning_once(
f"Sliding Window Attention is enabled but not implemented for `{config._attn_implementation}`; "
"unexpected results may be encountered."
)
self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
self.mlp = Qwen2MLP(config)
self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
self.post_attention_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: Optional[bool] = False,
use_cache: Optional[bool] = False,
**kwargs,
) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
if "padding_mask" in kwargs:
warnings.warn(
"Passing `padding_mask` is deprecated and will be removed in v4.37. "
"Please make sure use `attention_mask` instead.`"
)
"""
Args:
hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
`(batch, sequence_length)` where padding elements are indicated by 0.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
(see `past_key_values`).
past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
"""
residual = hidden_states
hidden_states = self.input_layernorm(hidden_states)
# Self Attention
hidden_states, self_attn_weights, present_key_value = self.self_attn(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
)
hidden_states = residual + hidden_states
# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
outputs = (hidden_states,)
if output_attentions:
outputs += (self_attn_weights,)
if use_cache:
outputs += (present_key_value,)
return outputs
QWEN2_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
Parameters:
config ([`Qwen2Config`]):
Model configuration class with all the parameters of the model. Initializing with a config file does not
load the weights associated with the model, only the configuration. Check out the
[`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
@add_start_docstrings(
"The bare Qwen2 Model outputting raw hidden-states without any specific head on top.",
QWEN2_START_DOCSTRING,
)
class Qwen2PreTrainedModel(PreTrainedModel):
config_class = Qwen2Config
base_model_prefix = "model"
supports_gradient_checkpointing = True
_no_split_modules = ["Qwen2DecoderLayer"]
_skip_keys_device_placement = "past_key_values"
_supports_flash_attn_2 = True
_supports_sdpa = True
_supports_cache_class = True
def _init_weights(self, module):
std = self.config.initializer_range
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
QWEN2_INPUTS_DOCSTRING = r"""
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
`past_key_values`).
If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
information on the default strategy.
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.n_positions - 1]`.
[What are position IDs?](../glossary#position-ids)
past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
Two formats are allowed:
- a [`~cache_utils.Cache`] instance;
- Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
cache format.
The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
legacy cache format will be returned.
If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
of shape `(batch_size, sequence_length)`.
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
`past_key_values`).
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""
@add_start_docstrings(
"The bare Qwen2 Model outputting raw hidden-states without any specific head on top.",
QWEN2_START_DOCSTRING,
)
class Qwen2Model_Flash(Qwen2PreTrainedModel):
"""
Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Qwen2DecoderLayer`]
Args:
config: Qwen2Config
"""
def __init__(self, config: Qwen2Config):
super().__init__(config)
self.padding_idx = config.pad_token_id
self.vocab_size = config.vocab_size
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
self.layers = nn.ModuleList(
[Qwen2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
)
self._attn_implementation = config._attn_implementation
self.norm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
self.gradient_checkpointing = False
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.embed_tokens
def set_input_embeddings(self, value):
self.embed_tokens = value
@add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
labels: Optional[torch.Tensor] = None,
) -> Union[Tuple, BaseModelOutputWithPast]:
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
use_cache = use_cache if use_cache is not None else self.config.use_cache
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
# retrieve input_ids and inputs_embeds
if input_ids is not None and inputs_embeds is not None:
raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
elif input_ids is not None:
batch_size, seq_length = input_ids.shape
elif inputs_embeds is not None:
batch_size, seq_length, _ = inputs_embeds.shape
else:
raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
if self.gradient_checkpointing and self.training:
if use_cache:
logger.warning_once(
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
)
use_cache = False
past_key_values_length = 0
if use_cache:
use_legacy_cache = not isinstance(past_key_values, Cache)
if use_legacy_cache:
past_key_values = DynamicCache.from_legacy_cache(past_key_values)
past_key_values_length = past_key_values.get_usable_length(seq_length)
if position_ids is None:
device = input_ids.device if input_ids is not None else inputs_embeds.device
position_ids = torch.arange(
past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
)
position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
else:
position_ids = position_ids.view(-1, seq_length).long()
if inputs_embeds is None:
inputs_embeds = self.embed_tokens(input_ids)
if attention_mask is not None and self._attn_implementation == "flash_attention_2" and use_cache:
is_padding_right = attention_mask[:, -1].sum().item() != batch_size
if is_padding_right:
raise ValueError(
"You are attempting to perform batched generation with padding_side='right'"
" this may lead to unexpected behaviour for Flash Attention version of Qwen2. Make sure to "
" call `tokenizer.padding_side = 'left'` before tokenizing the input. "
)
if self._attn_implementation == "flash_attention_2":
# 2d mask is passed through the layers
attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
elif self._attn_implementation == "sdpa" and not output_attentions:
# output_attentions=True can not be supported when using SDPA, and we fall back on
# the manual implementation that requires a 4D causal mask in all cases.
attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
attention_mask,
(batch_size, seq_length),
inputs_embeds,
past_key_values_length,
)
else:
# 4d mask is passed through the layers
attention_mask = _prepare_4d_causal_attention_mask(
attention_mask,
(batch_size, seq_length),
inputs_embeds,
past_key_values_length,
sliding_window=self.config.sliding_window,
)
hidden_states = inputs_embeds
# decoder layers
all_hidden_states = () if output_hidden_states else None
all_self_attns = () if output_attentions else None
next_decoder_cache = None
for layer_idx, decoder_layer in enumerate(self.layers):
if output_hidden_states:
all_hidden_states += (hidden_states,)
if self.gradient_checkpointing and self.training:
layer_outputs = self._gradient_checkpointing_func(
decoder_layer.__call__,
hidden_states,
attention_mask,
position_ids,
past_key_values,
output_attentions,
use_cache,
)
else:
layer_outputs = decoder_layer(
hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_values,
output_attentions=output_attentions,
use_cache=use_cache,
)
hidden_states = layer_outputs[0]
if use_cache:
next_decoder_cache = layer_outputs[2 if output_attentions else 1]
if output_attentions:
all_self_attns += (layer_outputs[1],)
###### copy from pdrop #########
# rank & drop after specific layer
# only drop in prefill stage when inference
rank_layer = layer_idx+1
if rank_layer in self.llm_compress_layer_list:
if hidden_states.shape[1] != 1: # prefill stage or training
stage = self.llm_compress_layer_list.index(rank_layer) # determine current stage
(
position_ids,
attention_mask,
hidden_states,
labels # update labels and return
) = self.flash_rank_drop(
cur_num = stage,
rank_layer = rank_layer,
features = hidden_states,
position_ids=position_ids,
attention_mask=attention_mask,
labels = labels
)
# process attention_mask again after updating
if self._attn_implementation == "flash_attention_2":
# 2d mask is passed through the layers
attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
elif self._attn_implementation == "sdpa" and not output_attentions:
# output_attentions=True can not be supported when using SDPA, and we fall back on
# the manual implementation that requires a 4D causal mask in all cases.
attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
attention_mask,
(batch_size, hidden_states.shape[1]),
hidden_states,
past_key_values_length,
)
else:
# 4d mask is passed through the layers
attention_mask = _prepare_4d_causal_attention_mask(
attention_mask,
(batch_size, hidden_states.shape[1]),
hidden_states,
past_key_values_length,
sliding_window=self.config.sliding_window,
)
else:
# update position_ids in decoding stage when inference
stage = self.llm_compress_layer_list.index(rank_layer) # determine current stage
cur_visual_length = [int(cur_image_token * self.llm_image_token_ratio_list[stage]) for cur_image_token in self.num_image_token_lens]
next_visual_length = [int(cur_image_token * self.llm_image_token_ratio_list[stage + 1]) for cur_image_token in self.num_image_token_lens]
new_position_ids = []
for idx, cur_position_ids in enumerate(position_ids):
cur_position_ids = cur_position_ids - (cur_visual_length[idx] - next_visual_length[idx])
new_position_ids.append(cur_position_ids)
assert idx == 0, idx
position_ids = torch.tensor(new_position_ids, dtype=torch.long).unsqueeze(0)
# raise ValueError(f"{type(position_ids)}, 哈哈我疯了")
#################
hidden_states = self.norm(hidden_states)
# add hidden states from the last decoder layer
if output_hidden_states:
all_hidden_states += (hidden_states,)
next_cache = None
if use_cache:
next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
if not return_dict:
return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None), labels
return BaseModelOutputWithPast(
last_hidden_state=hidden_states,
past_key_values=next_cache,
hidden_states=all_hidden_states,
attentions=all_self_attns,
), labels
# implementation of flash
def flash_rank_drop(
self, cur_num, rank_layer, features ,
position_ids, attention_mask, labels
):
if self.llm_compress_type == 'uniform0_attention':
if cur_num == 0:
llm_compress_type = 'uniform'
else:
llm_compress_type = 'attention'
elif self.llm_compress_type == 'uniform1_attention':
if cur_num <= 1:
llm_compress_type = 'uniform'
else:
llm_compress_type = 'attention'
else:
llm_compress_type = self.llm_compress_type
_labels = labels
_position_ids = position_ids
_attention_mask = attention_mask
if position_ids is None:
position_ids = torch.arange(0, features.shape[1], dtype=torch.long, device=features.device).unsqueeze(0)
if getattr(self.config, 'tokenizer_padding_side', 'right') == "right":
batch_size = features.shape[0]
image_tokens = [int(cur_image_token * self.llm_image_token_ratio_list[cur_num]) for cur_image_token in self.num_image_token_lens]
keep_length = [int(cur_image_token * self.llm_image_token_ratio_list[cur_num + 1]) for cur_image_token in self.num_image_token_lens]
features_list = []
attention_mask_list = []
labels_list = []
if attention_mask is None:
attention_mask = torch.ones((batch_size,features.shape[1]), dtype=torch.bool, device=features.device)
else:
attention_mask = attention_mask.bool()
if labels is None:
labels = torch.full((batch_size,features.shape[1]), IGNORE_INDEX, device=features.device)
if 'attention' in llm_compress_type:
# obtain query_states and key_states to calculate attention map
hidden_states= features.clone().detach()
# print(f"hidden_states.shape: {hidden_states.shape}")
self_attn = self.layers[rank_layer].self_attn
hidden_states = self.layers[rank_layer].input_layernorm(hidden_states)
# print(f"new hidden_states.shape: {hidden_states.shape}")
num_heads = self_attn.num_heads
num_key_value_heads = self_attn.num_key_value_heads
head_dim = self_attn.head_dim
bsz, q_len, _ = hidden_states.size()
# print(self_attn.k_proj)
query_states = self_attn.q_proj(hidden_states)
key_states = self_attn.k_proj(hidden_states)
value_states = self_attn.v_proj(hidden_states)
# print("old key_states.shape:", key_states.shape)
query_states = query_states.view(bsz, q_len, num_heads, head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, num_key_value_heads, head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, num_key_value_heads, head_dim).transpose(1, 2)
# print("key_states.shape:", key_states.shape)
kv_seq_len = key_states.shape[-2]
cos, sin = self_attn.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
key_states = repeat_kv(key_states, self_attn.num_key_value_groups)
# attention_mask
eager_attention_mask = _prepare_4d_causal_attention_mask(
attention_mask, (batch_size, q_len), hidden_states, past_key_values_length=0
).to(device=query_states.device)
# take valid features
features = [cur_features[cur_attention_mask] for cur_features, cur_attention_mask in zip(features, attention_mask)]
labels = [cur_labels[cur_attention_mask] for cur_labels, cur_attention_mask in zip(labels, attention_mask)]
attention_mask = [cur_attention_mask[cur_attention_mask] for cur_attention_mask, cur_attention_mask in zip(attention_mask, attention_mask)]
# rank & drop
for i in range(batch_size):
image_index = self.first_image_token_position[i]
if image_index == -1:
cur_input_embeds = features[i]
features_list.append(cur_input_embeds)
attention_mask_list.append(attention_mask[i])
labels_list.append(labels[i])
continue
if 'attention' in llm_compress_type:
# obtain current states
cur_key_states = key_states[i]
cur_query_states = query_states[i]
cur_eager_attention_mask = eager_attention_mask[i]
# choose last instruction token as query
if self.training:
answer_index = torch.where(labels[i] != -100)[0].tolist()
index_before_answer = []
for index in answer_index:
if labels[i][index-1] == -100:
index_before_answer.append(index-1)
if index_before_answer == []:
# print("========index_before_answer is []")
cur_input_embeds = features[i]
features_list.append(cur_input_embeds)
attention_mask_list.append(attention_mask[i])
labels_list.append(labels[i])
continue
index_before_answer=torch.tensor(index_before_answer,device=labels[0].device)
text_query_states = cur_query_states[:,index_before_answer,:]
text_eager_attention_mask = cur_eager_attention_mask[:,index_before_answer,:]
else:
prompt_total_len = self.text_prompt_lens[i] + image_tokens[i]
text_query_states = cur_query_states[:,prompt_total_len-1,:].unsqueeze(1)
text_eager_attention_mask = cur_eager_attention_mask[:,prompt_total_len-1,:].unsqueeze(1)
# print(f"text_query_states.shape: {text_query_states.shape}")
# print(f"cur_key_states.shape: {cur_key_states.shape}")
# calculate attention map
attn_weights = torch.matmul(text_query_states, cur_key_states.transpose(1, 2)) / math.sqrt(head_dim) #(num_head, text_token,seq_len)
attn_weights = attn_weights + text_eager_attention_mask
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype) #(num_head, text_token,seq_len)
attention_avg_head = torch.mean(attn_weights, dim=0) # ave across heads
attention_avg_head = attention_avg_head[:,image_index:image_index+image_tokens[i]] # select image token as keys
attention_avg_text = torch.mean(attention_avg_head, dim=0) # (576)
# print("attention_avg_text.shape:", attention_avg_text.shape)
if llm_compress_type == 'attention':
top_rank_index = attention_avg_text.topk(keep_length[i]).indices
else:
raise NotImplementedError(llm_compress_type)
elif llm_compress_type == 'uniform':
top_rank_index = torch.linspace(0, image_tokens[i]-1, keep_length[i], dtype=torch.long)
else:
raise NotImplementedError(llm_compress_type)
top_rank_index = top_rank_index + image_index
top_rank_index= top_rank_index.sort().values
start_index = image_index + image_tokens[i]
new_input_embeds = torch.cat([features[i][ :image_index, :] ,features[i][ top_rank_index, :], features[i][start_index:, :]], dim=0)
# print("origin labels:", len(labels))
# print(labels[i])
# print(top_rank_index)
new_labels = torch.cat([labels[i][ :image_index],labels[i][ top_rank_index], labels[i][start_index:]], dim=0)
new_attention_mask = torch.cat([attention_mask[i][:image_index], attention_mask[i][top_rank_index], attention_mask[i][start_index:]], dim=0)
features_list.append(new_input_embeds)
attention_mask_list.append(new_attention_mask)
labels_list.append(new_labels)
# Truncate sequences to max length as image embeddings can make the sequence longer
tokenizer_model_max_length = getattr(self.config, 'tokenizer_model_max_length', None)
if tokenizer_model_max_length is not None:
new_input_embeds = [x[:tokenizer_model_max_length] for x in features_list]
new_attention_mask = [x[:tokenizer_model_max_length] for x in attention_mask_list]
new_labels = [x[:tokenizer_model_max_length] for x in labels_list]
max_len = max(x.shape[0] for x in new_input_embeds)
# padding the sequences to form batch
embeds_padded=[]
labels_paded=[]
attention_mask_padded=[]
position_ids = torch.zeros((batch_size, max_len), dtype=position_ids.dtype, device=position_ids.device)
for i, (cur_new_embed, cur_new_labels) in enumerate(zip(new_input_embeds, new_labels)):
cur_len_emb=cur_new_embed.shape[0]
dif=max_len - cur_len_emb # padding to longest seq
cur_new_embed = torch.cat([cur_new_embed,torch.zeros((dif, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)],dim=0)
cur_new_labels = torch.cat([cur_new_labels,torch.full((dif,),IGNORE_INDEX,dtype=cur_new_labels.dtype, device=cur_new_labels.device)],dim=0)
cur_attention_mask = new_attention_mask[i]
cur_attention_mask = torch.cat([cur_attention_mask,torch.full((dif,),False, dtype=cur_attention_mask.dtype, device=cur_attention_mask.device)],dim=0)
embeds_padded.append(cur_new_embed)
labels_paded.append(cur_new_labels)
attention_mask_padded.append(cur_attention_mask)
cur_len = new_attention_mask[i].sum().item()
position_ids[i, :cur_len] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
new_input_embeds = torch.stack(embeds_padded,dim=0)
new_input_embeds = new_input_embeds.to(features[0].dtype)
new_attention_mask = torch.stack(attention_mask_padded,dim=0)
new_labels = torch.stack(labels_paded,dim=0)
if _position_ids is None:
position_ids = None
if _labels is None:
new_labels = None
if _attention_mask is None:
new_attention_mask = None
else:
new_attention_mask = new_attention_mask.to(dtype=_attention_mask.dtype)
return position_ids, new_attention_mask, new_input_embeds, new_labels
else:
raise ValueError(f"Unexpected tokenizer_padding_side: {self.config.tokenizer_padding_side}")
class Qwen2ForCausalLM_Flash(Qwen2PreTrainedModel):
_tied_weights_keys = ["lm_head.weight"]
def __init__(self, config):
super().__init__(config)
self.model = Qwen2Model_Flash(config)
self.vocab_size = config.vocab_size
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.embed_tokens
def set_input_embeddings(self, value):
self.model.embed_tokens = value
def get_output_embeddings(self):
return self.lm_head
def set_output_embeddings(self, new_embeddings):
self.lm_head = new_embeddings
def set_decoder(self, decoder):
self.model = decoder
def get_decoder(self):
return self.model
@add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, CausalLMOutputWithPast]:
r"""
Args:
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
Returns:
Example:
```python
>>> from transformers import AutoTokenizer, Qwen2ForCausalLM
>>> model = Qwen2ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
>>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
# decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
outputs, labels = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
labels=labels
)
hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
logits = logits.float()
loss = None
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
shift_logits = shift_logits.view(-1, self.config.vocab_size)
shift_labels = shift_labels.view(-1)
# Enable model parallelism
shift_labels = shift_labels.to(shift_logits.device)
loss = loss_fct(shift_logits, shift_labels)
if not return_dict:
output = (logits,) + outputs[1:]
return (loss,) + output if loss is not None else output
return CausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
def prepare_inputs_for_generation(
self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
):
# Omit tokens covered by past_key_values
if past_key_values is not None:
if isinstance(past_key_values, Cache):
cache_length = past_key_values.get_seq_length()
past_length = past_key_values.seen_tokens
max_cache_length = past_key_values.get_max_length()
else:
cache_length = past_length = past_key_values[0][0].shape[2]
max_cache_length = None
# Keep only the unprocessed tokens:
# 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
# some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
# input)
if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
# 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
# input_ids based on the past_length.
elif past_length < input_ids.shape[1]:
input_ids = input_ids[:, past_length:]
# 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
# If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
if (
max_cache_length is not None
and attention_mask is not None
and cache_length + input_ids.shape[1] > max_cache_length
):
attention_mask = attention_mask[:, -max_cache_length:]
position_ids = kwargs.get("position_ids", None)
if attention_mask is not None and position_ids is None:
# create position_ids on the fly for batch generation
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
if past_key_values:
position_ids = position_ids[:, -input_ids.shape[1] :]
# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
if inputs_embeds is not None and past_key_values is None:
model_inputs = {"inputs_embeds": inputs_embeds}
else:
model_inputs = {"input_ids": input_ids}
model_inputs.update(
{
"position_ids": position_ids,
"past_key_values": past_key_values,
"use_cache": kwargs.get("use_cache"),
"attention_mask": attention_mask,
}
)
return model_inputs
@staticmethod
def _reorder_cache(past_key_values, beam_idx):
reordered_past = ()
for layer_past in past_key_values:
reordered_past += (
tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
)
return reordered_past
================================================
FILE: llava-train_videochat/llava/model/llava_arch.py
================================================
# Copyright 2023 Haotian Liu
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from abc import ABC, abstractmethod
import psutil
import math
import re
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
from .multimodal_encoder.builder import build_vision_tower
from .multimodal_projector.builder import build_vision_projector
from llava.constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.mm_utils import get_anyres_image_grid_shape
from llava.utils import rank0_print
import random
class LlavaMetaModel:
def __init__(self, config):
super(LlavaMetaModel, self).__init__(config)
if hasattr(config, "mm_vision_tower"):
delay_load = getattr(config, "delay_load", False)
self.vision_tower = build_vision_tower(config, delay_load=delay_load)
self.mm_projector = build_vision_projector(config, vision_cfg=self.vision_tower.config)
if "unpad" in getattr(config, "mm_patch_merge_type", ""):
self.image_newline = nn.Parameter(torch.empty(config.hidden_size, dtype=self.dtype))
if "nopad" in getattr(config, "mm_patch_merge_type", "") and getattr(self.config, "mm_newline_position", "nothing") != "nothing":
self.frame_newline = nn.Parameter(torch.empty(config.hidden_size, dtype=self.dtype))
def get_vision_tower(self):
vision_tower = getattr(self, "vision_tower", None)
if type(vision_tower) is list:
vision_tower = vision_tower[0]
return vision_tower
def initialize_vision_modules(self, model_args, fsdp=None):
vision_tower = model_args.vision_tower
mm_vision_select_layer = model_args.mm_vision_select_layer
mm_vision_select_feature = model_args.mm_vision_select_feature
pretrain_mm_mlp_adapter = model_args.pretrain_mm_mlp_adapter
mm_patch_merge_type = model_args.mm_patch_merge_type
self.config.mm_vision_tower = vision_tower
self.config.vision_tower_pretrained = getattr(model_args, "vision_tower_pretrained", "")
if self.get_vision_tower() is None:
vision_tower = build_vision_tower(model_args)
if fsdp is not None and len(fsdp) > 0:
self.vision_tower = [vision_tower]
else:
self.vision_tower = vision_tower
else:
if fsdp is not None and len(fsdp) > 0:
vision_tower = self.vision_tower[0]
else:
vision_tower = self.vision_tower
vision_tower.load_model()
self.config.use_mm_proj = True
self.config.mm_projector_type = getattr(model_args, "mm_projector_type", "linear")
self.config.mm_hidden_size = vision_tower.hidden_size
self.config.mm_vision_select_layer = mm_vision_select_layer
self.config.mm_vision_select_feature = mm_vision_select_feature
self.config.mm_patch_merge_type = mm_patch_merge_type
if getattr(self, "mm_projector", None) is None:
self.mm_projector = build_vision_projector(self.config, vision_cfg=vision_tower.config)
if "unpad" in mm_patch_merge_type:
embed_std = 1 / torch.sqrt(torch.tensor(self.config.hidden_size, dtype=self.dtype))
self.image_newline = nn.Parameter(torch.randn(self.config.hidden_size, dtype=self.dtype) * embed_std)
if "nopad" in getattr(self.config, "mm_patch_merge_type", "") and getattr(self.config, "mm_newline_position", "nothing") != "nothing":
embed_std = 1 / torch.sqrt(torch.tensor(self.config.hidden_size, dtype=self.dtype))
self.frame_newline = nn.Parameter(torch.randn(self.config.hidden_size, dtype=self.dtype) * embed_std)
else:
# In case it is frozen by LoRA
for p in self.mm_projector.parameters():
p.requires_grad = True
if pretrain_mm_mlp_adapter is not None:
mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location="cpu")
def get_w(weights, keyword):
return {k.split(keyword + ".")[1]: v for k, v in weights.items() if keyword in k}
if self.config.mm_projector_type =='lxh_qformer':
incompatible_keys = self.mm_projector.load_state_dict(get_w(mm_projector_weights, "mm_projector"), strict=False)
else:
incompatible_keys = self.mm_projector.load_state_dict(get_w(mm_projector_weights, "mm_projector"))
rank0_print(f"Loaded mm projector weights from {pretrain_mm_mlp_adapter}. Incompatible keys: {incompatible_keys}")
def unpad_image(tensor, original_size, is_frame=False):
"""
Unpads a PyTorch tensor of a padded and resized image.
Args:
tensor (torch.Tensor): The image tensor, assumed to be in CxHxW format.
original_size (tuple): The original size of the image (height, width).
Returns:
torch.Tensor: The unpadded image tensor.
"""
original_width, original_height = original_size
if is_frame:
current_height, current_width = tensor.shape[2:]
else:
current_height, current_width = tensor.shape[1:]
# Compute aspect ratios
original_aspect_ratio = original_width / original_height
current_aspect_ratio = current_width / current_height
# Determine padding size and direction
if original_aspect_ratio > current_aspect_ratio:
# Padding was added to the height
scale_factor = current_width / original_width
new_height = int(original_height * scale_factor)
padding = (current_height - new_height) // 2
if is_frame:
unpadded_tensor = tensor[:, :, padding : current_height - padding, :]
else:
unpadded_tensor = tensor[:, padding : current_height - padding, :]
else:
# Padding was added to the width
scale_factor = current_height / original_height
new_width = int(original_width * scale_factor)
padding = (current_width - new_width) // 2
if is_frame:
unpadded_tensor = tensor[:, :, :, padding : current_width - padding]
else:
unpadded_tensor = tensor[:, :, padding : current_width - padding]
return unpadded_tensor
class LlavaMetaForCausalLM(ABC):
@abstractmethod
def get_model(self):
pass
def get_vision_tower(self):
return self.get_model().get_vision_tower()
def get_4dPool(self, image_feature):
num_frames, num_tokens, num_dim = image_feature.shape
height = width = int(math.sqrt(num_tokens))
assert num_tokens == height * width, image_feature.shape
image_feature = image_feature.view(num_frames, height, width, -1)
image_feature = image_feature.permute(0, 3, 1, 2).contiguous()
# image_feature = nn.functional.max_pool2d(image_feature, self.config.mm_spatial_pool_stride)
if self.config.mm_spatial_pool_mode == "average":
raise NotImplementedError
image_feature = nn.functional.avg_pool2d(image_feature, self.config.mm_spatial_pool_stride)
elif self.config.mm_spatial_pool_mode == "max":
raise NotImplementedError
image_feature = nn.functional.max_pool2d(image_feature, self.config.mm_spatial_pool_stride)
elif self.config.mm_spatial_pool_mode == "bilinear":
height, weight = image_feature.shape[2:]
scaled_shape = [math.ceil(height / 4), math.ceil(weight / 4)]
image_feature = nn.functional.interpolate(image_feature, size=scaled_shape, mode='bilinear')
else:
raise ValueError(f"Unexpected mm_spatial_pool_mode: {self.config.mm_spatial_pool_mode}")
image_feature = image_feature.permute(0, 2, 3, 1)
image_feature = image_feature.view(num_frames, -1, num_dim)
return image_feature
def get_2dPool(self, image_feature):
num_frames, num_tokens, num_dim = image_feature.shape
height = width = int(math.sqrt(num_tokens))
assert num_tokens == height * width, image_feature.shape
image_feature = image_feature.view(num_frames, height, width, -1)
image_feature = image_feature.permute(0, 3, 1, 2).contiguous()
# image_feature = nn.functional.max_pool2d(image_feature, self.config.mm_spatial_pool_stride)
if self.config.mm_spatial_pool_mode == "average":
raise NotImplementedError
image_feature = nn.functional.avg_pool2d(image_feature, self.config.mm_spatial_pool_stride)
elif self.config.mm_spatial_pool_mode == "max":
raise NotImplementedError
image_feature = nn.functional.max_pool2d(image_feature, self.config.mm_spatial_pool_stride)
elif self.config.mm_spatial_pool_mode == "bilinear":
height, weight = image_feature.shape[2:]
scaled_shape = [math.ceil(height / 2), math.ceil(weight / 2)]
image_feature = nn.functional.interpolate(image_feature, size=scaled_shape, mode='bilinear')
else:
raise ValueError(f"Unexpected mm_spatial_pool_mode: {self.config.mm_spatial_pool_mode}")
image_feature = image_feature.permute(0, 2, 3, 1)
image_feature = image_feature.view(num_frames, -1, num_dim)
return image_feature
def encode_image(self, images_list):
concat_images = torch.cat([image for image in images_list], dim=0)
split_sizes = [image.shape[0] for image in images_list]
image_features = self.get_model().get_vision_tower()(concat_images)
image_features = self.get_model().mm_projector(image_features)
image_features = torch.split(image_features, split_sizes)
return image_features
def encode_image_video(self, images_list, video_idx_in_batch):
concat_images = torch.cat([image for image in images_list], dim=0)
split_sizes = [image.shape[0] for image in images_list]
videos_or_images_features = self.get_model().get_vision_tower()(concat_images)
per_videos_or_images_features = torch.split(videos_or_images_features, split_sizes, dim=0) # tuple, (dim_1, 576, 4096)
all_videos_or_images_features = []
for idx, feat in enumerate(per_videos_or_images_features):
if idx in video_idx_in_batch:
feat = self.get_model().mm_projector(feat, compress=True, local_num_frames=getattr(self.config, "mm_local_num_frames", -1))
else:
feat = self.get_model().mm_projector(feat, compress=False)
all_videos_or_images_features.append(feat)
return all_videos_or_images_features
def encode_video(self, images_list, video_idx_in_batch):
bs = len(images_list)
concat_images = []
concat_videos = []
for idx, image in enumerate(images_list):
if idx in video_idx_in_batch:
concat_videos.append(image)
else:
concat_images.append(image)
# print(concat_videos[0].shape)
has_image = len(concat_images) > 0
has_video = len(concat_videos) > 0
mm_local_num_frames = getattr(self.config, "mm_local_num_frames", -1)
assert mm_local_num_frames != -1
if has_image:
image_split_sizes = [image.shape[0] for image in concat_images]
concat_images = torch.cat([image.unsqueeze(1) for image in concat_images], dim=0)
images_features = self.get_model().get_vision_tower()(concat_images) # B_i, N, D
images_features = self.get_model().mm_projector(images_features, compress=False, local_num_frames=1)
images_features = torch.split(images_features, image_split_sizes)
if has_video:
video_split_sizes = [video.shape[0] // mm_local_num_frames for video in concat_videos]
concat_videos = torch.cat([video.reshape(video.shape[0] // mm_local_num_frames, mm_local_num_frames, video.shape[1], video.shape[2], video.shape[3]) for video in concat_videos], dim=0) # B T C H W
videos_features = self.get_model().get_vision_tower()(concat_videos) # B_v, N, D
videos_features = self.get_model().mm_projector(videos_features, compress=True, local_num_frames=mm_local_num_frames)
videos_features = [v.reshape(-1, v.shape[-2] // mm_local_num_frames, v.shape[-1]) for v in torch.split(videos_features, video_split_sizes)]
all_videos_or_images_features = []
img_idx = 0
vid_idx = 0
for idx in range(bs):
if idx in video_idx_in_batch:
feat =videos_features[vid_idx]
vid_idx += 1
else:
feat = images_features[img_idx]
img_idx += 1
all_videos_or_images_features.append(feat)
if has_video:
assert vid_idx == len(videos_features), f"vid: {vid_idx} != {len(videos_features)}"
if has_image:
assert img_idx == len(images_features), f"img: {img_idx} != {len(images_features)}"
return all_videos_or_images_features
def encode_video_image(self, images_list, video_idx_in_batch):
bs = len(images_list)
concat_images = []
concat_videos = []
for idx, image in enumerate(images_list):
if idx in video_idx_in_batch:
concat_videos.append(image)
else:
concat_images.append(image)
# print(concat_videos[0].shape)
has_image = len(concat_images) > 0
has_video = len(concat_videos) > 0
mm_local_num_frames = getattr(self.config, "mm_local_num_frames", -1)
assert mm_local_num_frames != -1
if has_image:
image_split_sizes = [image.shape[0] for image in concat_images]
concat_images = torch.cat([image.unsqueeze(1) for image in concat_images], dim=0)
# print("input vit image.shape:", concat_images.shape)
images_features = self.get_model().get_vision_tower()(concat_images) # B_i, N, D
images_features = torch.split(images_features, image_split_sizes)
if has_video:
video_split_sizes = [video.shape[0] // mm_local_num_frames for video in concat_videos]
concat_videos = torch.cat([video.reshape(video.shape[0] // mm_local_num_frames, mm_local_num_frames, video.shape[1], video.shape[2], video.shape[3]) for video in concat_videos], dim=0)
# print("input vit video.shape:", concat_videos.shape)
videos_features = self.get_model().get_vision_tower()(concat_videos) # B_v, N, D
videos_features = [v.reshape(-1, v.shape[-2] // mm_local_num_frames, v.shape[-1]) for v in torch.split(videos_features, video_split_sizes)]
all_videos_or_images_features = []
img_idx = 0
vid_idx = 0
for idx in range(bs):
if idx in video_idx_in_batch:
feat = self.get_model().mm_projector(videos_features[vid_idx], compress=True, local_num_frames=getattr(self.config, "mm_local_num_frames", -1))
vid_idx += 1
else:
feat = self.get_model().mm_projector(images_features[img_idx], compress=False)
img_idx += 1
all_videos_or_images_features.append(feat)
if has_video:
assert vid_idx == len(videos_features), f"vid: {vid_idx} != {len(videos_features)}"
if has_image:
assert img_idx == len(images_features), f"img: {img_idx} != {len(images_features)}"
return all_videos_or_images_features
def add_token_per_frame(self, image_feature):
image_feature = image_feature.permute(2, 0, 1).contiguous()
if hasattr(self.model, "frame_newline"):
image_feature = torch.cat((image_feature, self.model.frame_newline[:, None, None].expand(*image_feature.shape[:-1], 1).to(image_feature.device)), dim=-1)
else:
image_feature = torch.cat((image_feature, self.model.image_newline[:, None, None].expand(*image_feature.shape[:-1], 1).to(image_feature.device)), dim=-1)
image_feature = image_feature.permute(1, 2, 0).contiguous()
return image_feature
def add_different_token_per_frame(self, image_feature):
raise NotImplementedError("No")
def prepare_inputs_labels_for_multimodal(self, input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities=["image"], image_sizes=None):
assert type(modalities) is list, modalities
vision_tower = self.get_vision_tower()
# rank_print(modalities)
if vision_tower is None or images is None or input_ids.shape[1] == 1:
return input_ids, position_ids, attention_mask, past_key_values, None, labels
if type(images) is list or images.ndim == 5:
if type(images) is list:
images = [x.unsqueeze(0) if x.ndim == 3 else x for x in images]
video_idx_in_batch = []
for _ in range(len(modalities)):
if modalities[_] == "video":
video_idx_in_batch.append(_)
images_list = []
for image in images:
if image.ndim == 4:
images_list.append(image)
else:
images_list.append(image.unsqueeze(0))
vision_encode_type = getattr(self.config, "vision_encode_type", "image")
mm_patch_merge_type = getattr(self.config, "mm_patch_merge_type", "flat")
image_aspect_ratio = getattr(self.config, "image_aspect_ratio", "square")
frame_aspect_ratio = getattr(self.config, "frame_aspect_ratio", "square")
mm_newline_position = getattr(self.config, "mm_newline_position", "nothing")
if "anyres" in frame_aspect_ratio:
new_images_list = []
num_frames_list = []
for idx, image in enumerate(images_list):
if idx in video_idx_in_batch:
T, C, H, W = image.shape
num_frames_list.append(T)
# print("origin video.shape:", image.shape) # T C H W
patch_size = self.get_vision_tower().image_size
if H * W != patch_size * patch_size:
global_image = F.interpolate(
image.float(), size=(patch_size, patch_size), mode='bicubic', align_corners=False
).to(image.dtype).unsqueeze(0)
sub_image = image.reshape(
1, T, C, H//patch_size, patch_size, W//patch_size, patch_size
).permute(0, 3, 5, 1, 2, 4, 6).reshape(-1, T, C, patch_size, patch_size).contiguous()
new_image = torch.concat([global_image, sub_image], dim=0).flatten(0, 1)
else:
new_image = image
# print("new video shape:", new_image.shape)
new_images_list.append(new_image)
else:
num_frames_list.append(1)
new_images_list.append(image)
images_list = new_images_list
# rank0_print(self.config)
# TODO image: share vit&connector for image/video, image_video:, video
if vision_encode_type == "image": # image backbone, process video by frame
image_features = self.encode_image(images_list)
elif vision_encode_type == "video": # video backbone, process video with compress
image_features = self.encode_video(images_list, video_idx_in_batch=video_idx_in_batch)
elif vision_encode_type == "image_video": # image backbone, process video with compress
image_features = self.encode_image_video(images_list, video_idx_in_batch=video_idx_in_batch)
elif vision_encode_type == "image_video_new":
image_features = self.encode_image_video_new(images_list, video_idx_in_batch=video_idx_in_batch)
elif vision_encode_type == "video_image": # image backbone, process video with compress
image_features = self.encode_video_image(images_list, video_idx_in_batch=video_idx_in_batch)
else:
raise NotImplementedError(vision_encode_type)
if 'llava_ov' in getattr(self.config, "transformers_version", ""):
new_image_features = []
# print("I am llava ov!!!!!!!")
for idx, image_feat in enumerate(image_features):
if idx in video_idx_in_batch: # NOTE lxh: I don't want it.
new_image_features.append(self.get_2dPool(image_feat))
else:
new_image_features.append(image_feat)
image_features = new_image_features
if mm_patch_merge_type == "flat":
image_features = [x.flatten(0, 1) for x in image_features]
elif mm_patch_merge_type.startswith("spatial"):
new_image_features = []
for image_idx, image_feature in enumerate(image_features):
# FIXME: now assume the image is square, and split to 2x2 patches
# num_patches = h * w, where h = w = sqrt(num_patches)
# currently image_feature is a tensor of shape (4, num_patches, hidden_size)
# we want to first unflatten it to (2, 2, h, w, hidden_size)
# rank0_print("At least we are reaching here")
if image_idx in video_idx_in_batch: # video operations
# rank0_print("Video")
# rank0_print(f"video image_feature.shape: {image_feature.shape}")
if "anyres" in frame_aspect_ratio:
if "anyres_max" in frame_aspect_ratio:
matched_anyres_max_num_patches = re.match(r"anyres_max_(\d+)", frame_aspect_ratio)
if matched_anyres_max_num_patches:
max_num_patches = int(matched_anyres_max_num_patches.group(1))
num_frames = num_frames_list[image_idx]
if hasattr(self.get_vision_tower(), "image_size"):
vision_tower_image_size = self.get_vision_tower().image_size
else:
raise ValueError("vision_tower_image_size is not found in the vision tower.")
try:
num_patch_width, num_patch_height = get_anyres_image_grid_shape(image_sizes[image_idx], self.config.frame_grid_pinpoints, vision_tower_image_size, max_resolutions=self.config.max_num_pixels // num_frames) #TODO 要传个num_frames来算
except Exception as e:
rank0_print(f"Error: {e}, self.config:{self.config}")
raise e
height = width = self.get_model().mm_projector.num_frame_patches_per_side
if "maxpool2x2" in mm_patch_merge_type:
raise NotImplementedError
elif "unpad" in mm_patch_merge_type and "anyres_max" in frame_aspect_ratio and matched_anyres_max_num_patches:
raise NotImplementedError
elif "unpad" in mm_patch_merge_type and "anyres" in frame_aspect_ratio:
raise NotImplementedError
else:
# rank0_print(f"652 num_frames={num_frames}")
if num_patch_height * num_patch_width != 1: # has global
image_feature = image_feature.view(num_patch_height * num_patch_width + 1, -1, height, width, image_feature.shape[-1])
assert num_frames == image_feature.shape[1], f"{num_frames} != {image_feature.shape[1]}"
base_frame_feature = image_feature[0].view(num_frames, -1, image_feature[0].shape[-1]) # T 4*4 C
# rank0_print(f"base_frame_feature.shape: {base_frame_feature.shape}")
image_feature = image_feature[1:].permute(1, 0, 2, 3, 4) # T P 4 4 C
frame_feature = image_feature.view(num_frames, num_patch_height, num_patch_width, height, width, -1)
frame_feature = frame_feature.permute(0, 1, 3, 2, 4, 5).contiguous()
frame_feature = frame_feature.flatten(1, 4)
frame_feature = torch.cat((base_frame_feature, frame_feature), dim=1)
# rank0_print(f"two_frame_feature.shape: {frame_feature.shape}")
else: # no global
frame_feature = image_feature.view(num_frames, -1, image_feature[0].shape[-1]) # T 4*4 C
# rank0_print(f"only_frame_feature.shape: {frame_feature.shape}")
if "nobase" in mm_patch_merge_type:
raise NotImplementedError
else:
frame_feature = image_feature
if "pad" in mm_patch_merge_type: # unpad和nopad都算
if mm_newline_position == 'one_token':
frame_feature = frame_feature.flatten(0, 1)
if "unpad" in mm_patch_merge_type:
frame_feature = torch.cat((frame_feature, self.model.image_newline[None].to(frame_feature.device)), dim=0)
else:
frame_feature = torch.cat((frame_feature, self.model.frame_newline[None].to(frame_feature.device)), dim=0)
elif mm_newline_position == 'frame':
# Frame-wise
frame_feature = self.add_token_per_frame(frame_feature)
frame_feature = frame_feature.flatten(0, 1)
elif mm_newline_position == 'frame2':
# Frame-wise
raise NotImplementedError
elif mm_newline_position == 'nothing':
frame_feature = frame_feature.flatten(0, 1)
else:
raise NotImplementedError("add pad please!!")
else:
frame_feature = frame_feature.flatten(0, 1)
# rank0_print(f"final video frame_feature.shape: {frame_feature.shape}")
image_feature = frame_feature
elif image_feature.shape[0] > 1: # multi patches and multi images operations
# rank0_print("Single-images") NOTE: 多图实际上不会过这里,因为被拆成多个单图pad了
base_image_feature = image_feature[0]
image_feature = image_feature[1:]
origin_size = image_feature.shape
height = width = self.get_model().mm_projector.num_image_patches_per_side # NOTE 写死一个图49
assert height * width == base_image_feature.shape[0], f"height:{height}, width: {width}, base_image_feature: {base_image_feature.shape}"
if "anyres_max" in image_aspect_ratio:
matched_anyres_max_num_patches = re.match(r"anyres_max_(\d+)", image_aspect_ratio)
if matched_anyres_max_num_patches:
max_num_patches = int(matched_anyres_max_num_patches.group(1))
if "anyres" in image_aspect_ratio:
if hasattr(self.get_vision_tower(), "image_size"):
vision_tower_image_size = self.get_vision_tower().image_size
else:
raise ValueError("vision_tower_image_size is not found in the vision tower.")
try:
num_patch_width, num_patch_height = get_anyres_image_grid_shape(image_sizes[image_idx], self.config.image_grid_pinpoints, vision_tower_image_size, max_resolutions=None)
except Exception as e:
rank0_print(f"Error: {e}")
raise e
# num_patch_width, num_patch_height = 2, 2
image_feature = image_feature.view(num_patch_height, num_patch_width, height, width, -1)
else:
raise NotImplementedError(image_aspect_ratio)
image_feature = image_feature.view(2, 2, height, width, -1)
if "maxpool2x2" in mm_patch_merge_type:
raise NotImplementedError
image_feature = image_feature.permute(4, 0, 2, 1, 3).contiguous()
image_feature = image_feature.flatten(1, 2).flatten(2, 3)
image_feature = nn.functional.max_pool2d(image_feature, 2)
image_feature = image_feature.flatten(1, 2).transpose(0, 1)
elif "unpad" in mm_patch_merge_type and "anyres_max" in image_aspect_ratio and matched_anyres_max_num_patches:
raise NotImplementedError
elif "unpad" in mm_patch_merge_type:
raise NotImplementedError
else:
image_feature = image_feature.permute(0, 2, 1, 3, 4).contiguous()
image_feature = image_feature.flatten(0, 3)
if "nobase" in mm_patch_merge_type:
pass
else:
try:
image_feature = torch.cat((base_image_feature, image_feature), dim=0)
except Exception as e:
raise ValueError(f"{num_patch_width} {num_patch_height} now: base_image_feature: {base_image_feature.shape}, {image_feature.shape}, image_sizes[image_idx]: {image_sizes[image_idx]}, origin_size: {origin_size}, {image_sizes[image_idx]}, {self.config.image_grid_pinpoints}, {vision_tower_image_size}")
else: # single image operations
image_feature = image_feature[0]
if "unpad" in mm_patch_merge_type:
image_feature = torch.cat((image_feature, self.model.image_newline[None]), dim=0)
# rank0_print(f"image/video_feature.shape: {image_feature.shape}")
new_image_features.append(image_feature)
image_features = new_image_features
else:
raise ValueError(f"Unexpected mm_patch_merge_type: {self.config.mm_patch_merge_type}")
else:
# raise NotImplementedError(f"images.shape={images.shape}, modalities={modalities}")
image_features = self.encode_image(images)
# TODO: image start / end is not implemented here to support pretraining.
if getattr(self.config, "tune_mm_mlp_adapter", False) and getattr(self.config, "mm_use_im_start_end", False):
raise NotImplementedError
# rank0_print(f"Total images len(image_features: {len(image_features)}")
# Let's just add dummy tensors if they do not exist,
# it is a headache to deal with None all the time.
# But it is not ideal, and if you have a better idea,
# please open an issue / submit a PR, thanks.
_labels = labels
_position_ids = position_ids
_attention_mask = attention_mask
if attention_mask is None:
attention_mask = torch.ones_like(input_ids, dtype=torch.bool)
else:
attention_mask = attention_mask.bool()
if position_ids is None:
position_ids = torch.arange(0, input_ids.shape[1], dtype=torch.long, device=input_ids.device)
if labels is None:
labels = torch.full_like(input_ids, IGNORE_INDEX)
# remove the padding using attention_mask -- FIXME
_input_ids = input_ids
input_ids = [cur_input_ids[cur_attention_mask] for cur_input_ids, cur_attention_mask in zip(input_ids, attention_mask)]
labels = [cur_labels[cur_attention_mask] for cur_labels, cur_attention_mask in zip(labels, attention_mask)]
new_input_embeds = []
new_labels = []
cur_image_idx = 0
mm_llm_compress = getattr(self.config, "mm_llm_compress", False)
if mm_llm_compress:
self.model.llm_compress_type = getattr(self.config, "llm_compress_type", "attention")
self.model.llm_compress_layer_list = getattr(self.config, "llm_compress_layer_list", [8, 16, 24])
self.model.llm_image_token_ratio_list = getattr(self.config, "llm_image_token_ratio_list", [1.0, 0.5, 0.25, 0.125])
first_image_token_position = []
text_prompt_lens = []
else:
self.model.llm_compress_type = "attention"
self.model.llm_compress_layer_list = []
self.model.llm_image_token_ratio_list = []
# rank_print("Inserting Images embedding")
for batch_idx, cur_input_ids in enumerate(input_ids):
num_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum()
if mm_llm_compress:
####### copy from pdrop, only support single image/video NOTE ##################
# record image position for further dropping
image_index = torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist()
assert len(image_index) == 1, f"Only support singe/video: {image_index}"
if image_index == []:
first_image_token_position.append(-1)
else:
first_image_token_position.append(image_index[0])
# record input instruction length in inference mode
if not self.training:
if image_index == []:
assert num_images == 0, num_images
else:
assert num_images == 1, f"num_images={num_images}, not support"
text_prompt_lens.append(cur_input_ids.shape[0] - num_images) # consider image place holder
###############################################
# rank0_print(f"num_images={num_images}")
if num_images == 0:
cur_image_features = image_features[cur_image_idx]
cur_input_embeds_1 = self.get_model().embed_tokens(cur_input_ids)
cur_input_embeds = torch.cat([cur_input_embeds_1, cur_image_features[0:0]], dim=0)
new_input_embeds.append(cur_input_embeds)
new_labels.append(labels[batch_idx])
cur_image_idx += 1
continue
image_token_indices = [-1] + torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist() + [cur_input_ids.shape[0]]
cur_input_ids_noim = []
cur_labels = labels[batch_idx]
cur_labels_noim = []
for i in range(len(image_token_indices) - 1):
cur_input_ids_noim.append(cur_input_ids[image_token_indices[i] + 1 : image_token_indices[i + 1]])
cur_labels_noim.append(cur_labels[image_token_indices[i] + 1 : image_token_indices[i + 1]])
split_sizes = [x.shape[0] for x in cur_labels_noim]
cur_input_embeds = self.get_model().embed_tokens(torch.cat(cur_input_ids_noim))
cur_input_embeds_no_im = torch.split(cur_input_embeds, split_sizes, dim=0)
cur_new_input_embeds = []
cur_new_labels = []
for i in range(num_images + 1):
cur_new_input_embeds.append(cur_input_embeds_no_im[i])
cur_new_labels.append(cur_labels_noim[i])
if i < num_images:
try:
cur_image_features = image_features[cur_image_idx]
except IndexError:
rank0_print(f"cur_image_idx={cur_image_idx} is not ok")
cur_image_features = image_features[cur_image_idx - 1]
cur_image_idx += 1
cur_new_input_embeds.append(cur_image_features)
cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=cur_labels.device, dtype=cur_labels.dtype))
cur_new_input_embeds = [x.to(self.device) for x in cur_new_input_embeds]
# import pdb; pdb.set_trace()
cur_new_input_embeds = torch.cat(cur_new_input_embeds)
cur_new_labels = torch.cat(cur_new_labels)
new_input_embeds.append(cur_new_input_embeds)
new_labels.append(cur_new_labels)
if mm_llm_compress:
self.model.first_image_token_position = first_image_token_position
self.model.text_prompt_lens = text_prompt_lens
self.model.num_image_token_lens = [image_feature.shape[0] for image_feature in image_features]
# Truncate sequences to max length as image embeddings can make the sequence longer
tokenizer_model_max_length = getattr(self.config, "tokenizer_model_max_length", None)
# rank_print("Finishing Inserting")
new_input_embeds = [x[:tokenizer_model_max_length] for x, modality in zip(new_input_embeds, modalities)]
new_labels = [x[:tokenizer_model_max_length] for x, modality in zip(new_labels, modalities)]
# Combine them
max_len = max(x.shape[0] for x in new_input_embeds)
batch_size = len(new_input_embeds)
new_input_embeds_padded = []
new_labels_padded = torch.full((batch_size, max_len), IGNORE_INDEX, dtype=new_labels[0].dtype, device=new_labels[0].device)
attention_mask = torch.zeros((batch_size, max_len), dtype=attention_mask.dtype, device=attention_mask.device)
position_ids = torch.zeros((batch_size, max_len), dtype=position_ids.dtype, device=position_ids.device)
# rank0_print("Prepare pos id")
for i, (cur_new_embed, cur_new_labels) in enumerate(zip(new_input_embeds, new_labels)):
cur_len = cur_new_embed.shape[0]
if getattr(self.config, "tokenizer_padding_side", "right") == "left":
new_input_embeds_padded.append(torch.cat((torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device), cur_new_embed), dim=0))
if cur_len > 0:
new_labels_padded[i, -cur_len:] = cur_new_labels
attention_mask[i, -cur_len:] = True
position_ids[i, -cur_len:] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
else:
new_input_embeds_padded.append(torch.cat((cur_new_embed, torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)), dim=0))
if cur_len > 0:
new_labels_padded[i, :cur_len] = cur_new_labels
attention_mask[i, :cur_len] = True
position_ids[i, :cur_len] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
new_input_embeds = torch.stack(new_input_embeds_padded, dim=0)
# rank0_print("tokenizer padding")
if _labels is None:
new_labels = None
else:
new_labels = new_labels_padded
if _attention_mask is None:
attention_mask = None
else:
attention_mask = attention_mask.to(dtype=_attention_mask.dtype)
if _position_ids is None:
position_ids = None
if getattr(self.config, "use_pos_skipping", False) and self.training:
position_ids = torch.arange(new_input_embeds.size(1), device=new_input_embeds.device).unsqueeze(0).to(new_input_embeds.device)
split_position = random.randint(0, new_input_embeds.size(1))
left_add = random.randint(0, self.config.pos_skipping_range)
right_add = random.randint(left_add, self.config.pos_skipping_range)
position_ids[:, :split_position] += left_add
position_ids[:, split_position:] += right_add
# import pdb; pdb.set_trace()
# print("Finish preparing")
return None, position_ids, attention_mask, past_key_values, new_input_embeds, new_labels
def initialize_vision_tokenizer(self, model_args, tokenizer):
if model_args.mm_use_im_patch_token:
tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
self.resize_token_embeddings(len(tokenizer))
if model_args.mm_use_im_start_end:
num_new_tokens = tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
self.resize_token_embeddings(len(tokenizer))
if num_new_tokens > 0:
input_embeddings = self.get_input_embeddings().weight.data
output_embeddings = self.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
output_embeddings[-num_new_tokens:] = output_embeddings_avg
if model_args.tune_mm_mlp_adapter:
for p in self.get_input_embeddings().parameters():
p.requires_grad = True
for p in self.get_output_embeddings().parameters():
p.requires_grad = False
if model_args.pretrain_mm_mlp_adapter:
mm_projector_weights = torch.load(model_args.pretrain_mm_mlp_adapter, map_location="cpu")
embed_tokens_weight = mm_projector_weights["model.embed_tokens.weight"]
assert num_new_tokens == 2
if input_embeddings.shape == embed_tokens_weight.shape:
input_embeddings[-num_new_tokens:] = embed_tokens_weight[-num_new_tokens:]
elif embed_tokens_weight.shape[0] == num_new_tokens:
input_embeddings[-num_new_tokens:] = embed_tokens_weight
else:
raise ValueError(f"Unexpected embed_tokens_weight shape. Pretrained: {embed_tokens_weight.shape}. Current: {input_embeddings.shape}. Numer of new tokens: {num_new_tokens}.")
elif model_args.mm_use_im_patch_token:
if model_args.tune_mm_mlp_adapter:
for p in self.get_input_embeddings().parameters():
p.requires_grad = False
for p in self.get_output_embeddings().parameters():
p.requires_grad = False
================================================
FILE: llava-train_videochat/llava/model/make_delta.py
================================================
"""
Usage:
python3 -m llava.model.make_delta --base ~/model_weights/llama-7b --target ~/model_weights/llava-7b --delta ~/model_weights/llava-7b-delta --hub-repo-id liuhaotian/llava-7b-delta
"""
import argparse
import torch
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
from llava.model.utils import auto_upgrade
def make_delta(base_model_path, target_model_path, delta_path, hub_repo_id):
print("Loading base model")
base = AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
print("Loading target model")
auto_upgrade(target_model_path)
target = AutoModelForCausalLM.from_pretrained(target_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
print("Calculating delta")
for name, param in tqdm(target.state_dict().items(), desc="Calculating delta"):
if name not in base.state_dict():
assert name in ["model.mm_projector.weight", "model.mm_projector.bias"], f"{name} not in base model"
continue
if param.data.shape == base.state_dict()[name].shape:
param.data -= base.state_dict()[name]
else:
assert name in ["model.embed_tokens.weight", "lm_head.weight"], f"{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}"
bparam = base.state_dict()[name]
param.data[: bparam.shape[0], : bparam.shape[1]] -= bparam
print("Saving delta")
if hub_repo_id:
kwargs = {"push_to_hub": True, "repo_id": hub_repo_id}
else:
kwargs = {}
target.save_pretrained(delta_path, **kwargs)
target_tokenizer = AutoTokenizer.from_pretrained(target_model_path)
target_tokenizer.save_pretrained(delta_path, **kwargs)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--base-model-path", type=str, required=True)
parser.add_argument("--target-model-path", type=str, required=True)
parser.add_argument("--delta-path", type=str, required=True)
parser.add_argument("--hub-repo-id", type=str, default=None)
args = parser.parse_args()
make_delta(args.base_model_path, args.target_model_path, args.delta_path, args.hub_repo_id)
================================================
FILE: llava-train_videochat/llava/model/multimodal_encoder/builder.py
================================================
import os
from .clip_encoder import CLIPVisionTower
from .siglip_encoder import SigLipVisionTower
from .clip_encoder import CLIPVisionTower, CLIPVisionTowerS2
from .umt_encoder import UMTVisionTower
from .internvideo2_encoder import InternVideo2VisionTower
# from .eva_clip.eva_clip_encoder import EvaClipVisionTower
# from .dev_eva_clip.eva_vit import EvaViTWrapper
def build_vision_tower(vision_tower_cfg, **kwargs):
vision_tower = getattr(vision_tower_cfg, "mm_vision_tower", getattr(vision_tower_cfg, "vision_tower", None))
# is_absolute_path_exists = os.path.exists(vision_tower) # NOTE sb code!
use_s2 = getattr(vision_tower_cfg, "s2", False)
if 'clip-vit' in vision_tower or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
if use_s2:
return CLIPVisionTowerS2(vision_tower, args=vision_tower_cfg, **kwargs)
else:
return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
elif "siglip" in vision_tower:
return SigLipVisionTower(vision_tower, vision_tower_cfg=vision_tower_cfg, **kwargs)
elif "internvideo2" in vision_tower:
return InternVideo2VisionTower(vision_tower, vision_tower_cfg=vision_tower_cfg, image_size=224, **kwargs)
elif "umt-hd" in vision_tower:
return UMTVisionTower(vision_tower, vision_tower_cfg=vision_tower_cfg, image_size=448, **kwargs)
elif "umt" in vision_tower:
return UMTVisionTower(vision_tower, vision_tower_cfg=vision_tower_cfg, **kwargs)
else:
raise ValueError(f"Unknown vision tower: {vision_tower}")
================================================
FILE: llava-train_videochat/llava/model/multimodal_encoder/clip_encoder.py
================================================
import torch
import torch.nn as nn
from llava.utils import rank0_print
from transformers import CLIPVisionModel, CLIPImageProcessor, CLIPVisionConfig
try:
from s2wrapper import forward as multiscale_forward
except:
pass
class CLIPVisionTower(nn.Module):
def __init__(self, vision_tower, args, delay_load=False):
super().__init__()
self.is_loaded = False
self.vision_tower_name = vision_tower
self.select_layer = args.mm_vision_select_layer
self.select_feature = getattr(args, "mm_vision_select_feature", "patch")
if not delay_load:
rank0_print(f"Loading vision tower: {vision_tower}")
self.load_model()
elif getattr(args, "unfreeze_mm_vision_tower", False):
# TODO: better detector is needed.
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True.")
self.load_model()
elif hasattr(args, "mm_tunable_parts") and "mm_vision_tower" in args.mm_tunable_parts:
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`.")
self.load_model()
else:
self.cfg_only = CLIPVisionConfig.from_pretrained(self.vision_tower_name)
def load_model(self, device_map=None):
if self.is_loaded:
rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name))
return
self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name)
self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map)
self.vision_tower.requires_grad_(False)
self.is_loaded = True
def feature_select(self, image_forward_outs):
select_feature_type = self.select_feature
if self.select_feature in ["slicefour_patch", "slicefour_cls_patch"]:
select_every_k_layer = len(image_forward_outs.hidden_states) // 4
image_features = torch.cat([image_forward_outs.hidden_states[i] for i in range(select_every_k_layer + self.select_layer, len(image_forward_outs.hidden_states), select_every_k_layer)], dim=-1)
select_feature_type = select_feature_type.replace("slicefour_", "")
elif self.select_feature in ["slice_m25811_f6_patch", "slice_m25811_f6_cls_patch"]:
select_layers = [-2, -5, -8, -11, 6]
image_features = torch.cat([image_forward_outs.hidden_states[i] for i in select_layers], dim=-1)
select_feature_type = select_feature_type.replace("slice_m25811_f6_", "")
else:
image_features = image_forward_outs.hidden_states[self.select_layer]
if select_feature_type == "patch":
image_features = image_features[:, 1:]
elif select_feature_type == "cls_patch":
image_features = image_features
else:
raise ValueError(f"Unexpected select feature: {select_feature_type}")
return image_features
def forward(self, images):
if type(images) is list:
image_features = []
for image in images:
image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), output_hidden_states=True)
image_feature = self.feature_select(image_forward_out).to(image.dtype)
image_features.append(image_feature)
else:
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
image_features = self.feature_select(image_forward_outs).to(images.dtype)
return image_features
@property
def dummy_feature(self):
return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
@property
def dtype(self):
return self.vision_tower.dtype
@property
def device(self):
return self.vision_tower.device
@property
def config(self):
if self.is_loaded:
return self.vision_tower.config
else:
return self.cfg_only
@property
def hidden_size(self):
_hidden_size = self.config.hidden_size
if "slicefour" in self.select_feature:
_hidden_size *= 4
if "slice_m25811_f6" in self.select_feature:
_hidden_size *= 5
return _hidden_size
@property
def num_patches_per_side(self):
return self.config.image_size // self.config.patch_size
@property
def num_patches(self):
_num_patches = (self.config.image_size // self.config.patch_size) ** 2
if "cls_patch" in self.select_feature:
_num_patches += 1
return _num_patches
@property
def image_size(self):
return self.config.image_size
class CLIPVisionTowerS2(CLIPVisionTower):
def __init__(self, vision_tower, args, delay_load=False):
self.s2_scales = getattr(args, "s2_scales", "336,672,1008")
self.s2_scales = list(map(int, self.s2_scales.split(",")))
self.s2_scales.sort()
self.s2_split_size = self.s2_scales[0]
self.s2_image_size = self.s2_scales[-1]
super().__init__(vision_tower, args, delay_load)
# change resize/crop size in preprocessing to the largest image size in s2_scale
if not delay_load or getattr(args, "unfreeze_mm_vision_tower", False):
self.image_processor.size["shortest_edge"] = self.s2_image_size
self.image_processor.crop_size["height"] = self.image_processor.crop_size["width"] = self.s2_image_size
def load_model(self, device_map=None):
if self.is_loaded:
rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name))
return
self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name)
self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map)
self.vision_tower.requires_grad_(False)
self.image_processor.size["shortest_edge"] = self.s2_image_size
self.image_processor.crop_size["height"] = self.image_processor.crop_size["width"] = self.s2_image_size
self.is_loaded = True
def forward_feature(self, images):
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
image_features = self.feature_select(image_forward_outs).to(images.dtype)
return image_features
def forward(self, images):
if type(images) is list:
image_features = []
for image in images:
image_feature = multiscale_forward(self.forward_feature, image.unsqueeze(0), img_sizes=self.s2_scales, max_split_size=self.s2_split_size, split_forward=True)
image_features.append(image_feature)
else:
image_features = multiscale_forward(self.forward_feature, images, img_sizes=self.s2_scales, max_split_size=self.s2_split_size, split_forward=True)
return image_features
@property
def hidden_size(self):
return self.config.hidden_size * len(self.s2_scales)
================================================
FILE: llava-train_videochat/llava/model/multimodal_encoder/internvideo2/__init__.py
================================================
from .vit_scale_clean import PretrainVisionTransformer_clean
================================================
FILE: llava-train_videochat/llava/model/multimodal_encoder/internvideo2/flash_attention_class.py
================================================
import torch
import torch.nn as nn
from einops import rearrange
from flash_attn.flash_attn_interface import flash_attn_varlen_qkvpacked_func
from flash_attn.bert_padding import unpad_input, pad_input
class FlashAttention(nn.Module):
"""Implement the scaled dot product attention with softmax.
Arguments
---------
softmax_scale: The temperature to use for the softmax attention.
(default: 1/sqrt(d_keys) where d_keys is computed at
runtime)
attention_dropout: The dropout rate to apply to the attention
(default: 0.0)
"""
def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
super().__init__()
self.softmax_scale = softmax_scale
self.dropout_p = attention_dropout
def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
max_s=None, need_weights=False):
"""Implements the multihead softmax attention.
Arguments
---------
qkv: The tensor containing the query, key, and value. (B, S, 3, H, D) if key_padding_mask is None
if unpadded: (nnz, 3, h, d)
key_padding_mask: a bool tensor of shape (B, S)
"""
assert not need_weights
assert qkv.dtype in [torch.float16, torch.bfloat16]
assert qkv.is_cuda
if cu_seqlens is None:
batch_size = qkv.shape[0]
seqlen = qkv.shape[1]
if key_padding_mask is None:
qkv = rearrange(qkv, 'b s ... -> (b s) ...')
max_s = seqlen
cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
device=qkv.device)
output = flash_attn_varlen_qkvpacked_func(
qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
softmax_scale=self.softmax_scale, causal=causal
)
output = rearrange(output, '(b s) ... -> b s ...', b=batch_size)
else:
nheads = qkv.shape[-2]
x = rearrange(qkv, 'b s three h d -> b s (three h d)')
x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
output_unpad = flash_attn_varlen_qkvpacked_func(
x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
softmax_scale=self.softmax_scale, causal=causal
)
output = rearrange(pad_input(rearrange(output_unpad, 'nnz h d -> nnz (h d)'),
indices, batch_size, seqlen),
'b s (h d) -> b s h d', h=nheads)
else:
assert max_s is not None
output = flash_attn_varlen_qkvpacked_func(
qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
softmax_scale=self.softmax_scale, causal=causal
)
return output, None
================================================
FILE: llava-train_videochat/llava/model/multimodal_encoder/internvideo2/pos_embed.py
================================================
import numpy as np
import torch
import logging
logger = logging.getLogger(__name__)
# --------------------------------------------------------
# 3D sine-cosine position embedding
# References:
# MVD: https://github.com/ruiwang2021/mvd/blob/main/modeling_finetune.py
# --------------------------------------------------------
def get_3d_sincos_pos_embed(embed_dim, grid_size, t_size, cls_token=False):
"""
grid_size: int of the grid height and width
t_size: int of the temporal size
return:
pos_embed: [t_size*grid_size*grid_size, embed_dim] or [1+t_size*grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
"""
assert embed_dim % 4 == 0
embed_dim_spatial = embed_dim // 4 * 3
embed_dim_temporal = embed_dim // 4
# spatial
grid_h = np.arange(grid_size, dtype=np.float32)
grid_w = np.arange(grid_size, dtype=np.float32)
grid = np.meshgrid(grid_w, grid_h) # here w goes first
grid = np.stack(grid, axis=0)
grid = grid.reshape([2, 1, grid_size, grid_size])
pos_embed_spatial = get_2d_sincos_pos_embed_from_grid(
embed_dim_spatial, grid
)
# temporal
grid_t = np.arange(t_size, dtype=np.float32)
pos_embed_temporal = get_1d_sincos_pos_embed_from_grid(
embed_dim_temporal, grid_t
)
# concate: [T, H, W] order
pos_embed_temporal = pos_embed_temporal[:, np.newaxis, :]
pos_embed_temporal = np.repeat(
pos_embed_temporal, grid_size**2, axis=1
) # [T, H*W, D // 4]
pos_embed_spatial = pos_embed_spatial[np.newaxis, :, :]
pos_embed_spatial = np.repeat(
pos_embed_spatial, t_size, axis=0
) # [T, H*W, D // 4 * 3]
pos_embed = np.concatenate([pos_embed_temporal, pos_embed_spatial], axis=-1)
pos_embed = pos_embed.reshape([-1, embed_dim]) # [T*H*W, D]
if cls_token:
pos_embed = np.concatenate(
[np.zeros([1, embed_dim]), pos_embed], axis=0
)
return pos_embed
# --------------------------------------------------------
# 2D sine-cosine position embedding
# References:
# Transformer: https://github.com/tensorflow/models/blob/master/official/nlp/transformer/model_utils.py
# MoCo v3: https://github.com/facebookresearch/moco-v3
# --------------------------------------------------------
def get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False):
"""
grid_size: int of the grid height and width
return:
pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
"""
grid_h = np.arange(grid_size, dtype=np.float32)
grid_w = np.arange(grid_size, dtype=np.float32)
grid = np.meshgrid(grid_w, grid_h) # here w goes first
grid = np.stack(grid, axis=0)
grid = grid.reshape([2, 1, grid_size, grid_size])
pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
if cls_token:
pos_embed = np.concatenate(
[np.zeros([1, embed_dim]), pos_embed], axis=0
)
return pos_embed
def get_1d_sincos_pos_embed(embed_dim, t_size, cls_token=False):
"""
t_size: int of the temporal size
return:
pos_embed: [t_size, embed_dim] or [1+t_size, embed_dim] (w/ or w/o cls_token)
"""
grid_t = np.arange(t_size, dtype=np.float32)
pos_embed = get_1d_sincos_pos_embed_from_grid(embed_dim, grid_t)
if cls_token:
pos_embed = np.concatenate(
[np.zeros([1, embed_dim]), pos_embed], axis=0
)
return pos_embed
def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
assert embed_dim % 2 == 0
# use half of dimensions to encode grid_h
emb_h = get_1d_sincos_pos_embed_from_grid(
embed_dim // 2, grid[0]
) # (H*W, D/2)
emb_w = get_1d_sincos_pos_embed_from_grid(
embed_dim // 2, grid[1]
) # (H*W, D/2)
emb = np.concatenate([emb_h, emb_w], axis=1) # (H*W, D)
return emb
def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
"""
embed_dim: output dimension for each position
pos: a list of positions to be encoded: size (M,)
out: (M, D)
"""
assert embed_dim % 2 == 0
omega = np.arange(embed_dim // 2, dtype=np.float32)
omega /= embed_dim / 2.0
omega = 1.0 / 10000**omega # (D/2,)
pos = pos.reshape(-1) # (M,)
out = np.einsum("m,d->md", pos, omega) # (M, D/2), outer product
emb_sin = np.sin(out) # (M, D/2)
emb_cos = np.cos(out) # (M, D/2)
emb = np.concatenate([emb_sin, emb_cos], axis=1) # (M, D)
return emb
def interpolate_pos_embed_internvideo2(checkpoint_model, model, orig_t_size = 8):
# interpolate position embedding
for pos_name in ['pos_embed', 'clip_pos_embed']:
if pos_name in checkpoint_model:
pos_embed_checkpoint = checkpoint_model[pos_name]
embedding_size = pos_embed_checkpoint.shape[-1] # channel dim
num_patches = model.patch_embed.num_patches #
num_extra_tokens = model.pos_embed.shape[-2] - num_patches # 0/1
# we use 8 frames for pretraining
# new_t_size = args.num_frames * args.num_segments // model.patch_embed.tubelet_size
new_t_size = model.num_frames // model.tubelet_size
# height (== width) for the checkpoint position embedding
orig_size = int(((pos_embed_checkpoint.shape[-2] - num_extra_tokens)//(orig_t_size)) ** 0.5)
# height (== width) for the new position embedding
new_size = int((num_patches // (new_t_size))** 0.5)
# class_token and dist_token are kept unchanged
if orig_t_size != new_t_size:
logger.info(f"Temporal interpolate from {orig_t_size} to {new_t_size} ({pos_name})")
print(f"Temporal interpolate from {orig_t_size} to {new_t_size} ({pos_name})")
extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
# only the position tokens are interpolated
pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
# B, L, C -> B, T, HW, C -> BHW, C, T (B = 1)
pos_tokens = pos_tokens.view(1, orig_t_size, -1, embedding_size)
pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, embedding_size, orig_t_size)
pos_tokens = torch.nn.functional.interpolate(pos_tokens, size=new_t_size, mode='linear')
pos_tokens = pos_tokens.view(1, -1, embedding_size, new_t_size)
pos_tokens = pos_tokens.permute(0, 3, 1, 2).reshape(1, -1, embedding_size)
new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
checkpoint_model[pos_name] = new_pos_embed
pos_embed_checkpoint = new_pos_embed
# class_token and dist_token are kept unchanged
if orig_size != new_size:
logger.info(f"Position interpolate from {orig_size}x{orig_size} to {new_size}x{new_size} ({pos_name})")
print(f"Position interpolate from {orig_size}x{orig_size} to {new_size}x{new_size} ({pos_name})")
extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
# only the position tokens are interpolated
pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
# B, L, C -> BT, H, W, C -> BT, C, H, W
pos_tokens = pos_tokens.reshape(-1, new_t_size, orig_size, orig_size, embedding_size)
pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
pos_tokens = torch.nn.functional.interpolate(
pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False)
# BT, C, H, W -> BT, H, W, C -> B, T, H, W, C
pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, new_t_size, new_size, new_size, embedding_size)
pos_tokens = pos_tokens.flatten(1, 3) # B, L, C
new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
checkpoint_model[pos_name] = new_pos_embed
for pos_name in ['img_pos_embed']:
if pos_name in checkpoint_model:
pos_embed_checkpoint = checkpoint_model[pos_name]
embedding_size = pos_embed_checkpoint.shape[-1] # channel dim
num_patches = model.patch_embed.num_img_patches #
num_extra_tokens = model.pos_embed.shape[-2] - model.patch_embed.num_patches # 0/1
# we use 8 frames for pretraining
# new_t_size = args.num_frames * args.num_segments // model.patch_embed.tubelet_size
# height (== width) for the checkpoint position embedding
orig_size = int(((pos_embed_checkpoint.shape[-2] - num_extra_tokens)) ** 0.5)
# height (== width) for the new position embedding
new_size = int((num_patches)** 0.5)
# class_token and dist_token are kept unchanged
if orig_size != new_size:
logger.info(f"Position interpolate from {orig_size}x{orig_size} to {new_size}x{new_size} ({pos_name})")
print(f"Position interpolate from {orig_size}x{orig_size} to {new_size}x{new_size} ({pos_name})")
extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
# only the position tokens are interpolated
pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
# B, L, C -> B, H, W, C -> B, C, H, W
pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
pos_tokens = torch.nn.functional.interpolate(
pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False)
# BT, C, H, W -> BT, H, W, C -> B, T, H, W, C
pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, 1, new_size, new_size, embedding_size)
pos_tokens = pos_tokens.flatten(1, 3) # B, L, C
new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
checkpoint_model[pos_name] = new_pos_embed
if 'pos_embed_spatial' in checkpoint_model or 'pos_embed_temporal' in checkpoint_model:
raise NotImplementedError
def interpolate_pos_embed_internvideo2_new(checkpoint_model, model, orig_t_size = 8):
pos_names = []
for k in checkpoint_model.keys():
if ('pos_embed' in k or 'clip_pos_embed' in k) and 'img' not in k:
pos_names.append(k)
assert len(pos_names) > 0, checkpoint_model.keys()
if 'pos_embed_spatial' in checkpoint_model.keys() or 'pos_embed_temporal' in checkpoint_model.keys():
raise NotImplementedError
# interpolate position embedding
for pos_name in pos_names:
pos_embed_checkpoint = checkpoint_model[pos_name]
embedding_size = pos_embed_checkpoint.shape[-1] # channel dim
num_patches = model.patch_embed.num_patches #
num_extra_tokens = model.pos_embed.shape[-2] - num_patches # 0/1
# we use 8 frames for pretraining
# new_t_size = args.num_frames * args.num_segments // model.patch_embed.tubelet_size
new_t_size = model.num_frames // model.tubelet_size
# height (== width) for the checkpoint position embedding
orig_size = int(((pos_embed_checkpoint.shape[-2] - num_extra_tokens)//(orig_t_size)) ** 0.5)
# height (== width) for the new position embedding
new_size = int((num_patches // (new_t_size))** 0.5)
# class_token and dist_token are kept unchanged
if orig_t_size != new_t_size:
logger.info(f"Temporal interpolate from {orig_t_size} to {new_t_size} ({pos_name})")
extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
# only the position tokens are interpolated
pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
# B, L, C -> B, T, HW, C -> BHW, C, T (B = 1)
pos_tokens = pos_tokens.view(1, orig_t_size, -1, embedding_size)
pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, embedding_size, orig_t_size)
pos_tokens = torch.nn.functional.interpolate(pos_tokens, size=new_t_size, mode='linear')
pos_tokens = pos_tokens.view(1, -1, embedding_size, new_t_size)
pos_tokens = pos_tokens.permute(0, 3, 1, 2).reshape(1, -1, embedding_size)
new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
checkpoint_model[pos_name] = new_pos_embed
pos_embed_checkpoint = new_pos_embed
# class_token and dist_token are kept unchanged
if orig_size != new_size:
logger.info(f"Position interpolate from {orig_size}x{orig_size} to {new_size}x{new_size} ({pos_name})")
extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
# only the position tokens are interpolated
pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
# B, L, C -> BT, H, W, C -> BT, C, H, W
pos_tokens = pos_tokens.reshape(-1, new_t_size, orig_size, orig_size, embedding_size)
pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
pos_tokens = torch.nn.functional.interpolate(
pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False)
# BT, C, H, W -> BT, H, W, C -> B, T, H, W, C
pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, new_t_size, new_size, new_size, embedding_size)
pos_tokens = pos_tokens.flatten(1, 3) # B, L, C
new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
checkpoint_model[pos_name] = new_pos_embed
================================================
FILE: llava-train_videochat/llava/model/multimodal_encoder/internvideo2/vit_scale_clean.py
================================================
import math
import logging
import torch
import torch.nn.functional as F
from timm.models.layers import DropPath, to_2tuple, trunc_normal_
from torch import nn
import torch.utils.checkpoint as checkpoint
from functools import partial
from einops import rearrange
from .pos_embed import get_3d_sincos_pos_embed, get_2d_sincos_pos_embed, get_1d_sincos_pos_embed, interpolate_pos_embed_internvideo2
from .flash_attention_class import FlashAttention
logger = logging.getLogger(__name__)
class CrossAttention(nn.Module):
def __init__(
self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0.,
proj_drop=0., attn_head_dim=None, out_dim=None):
super().__init__()
if out_dim is None:
out_dim = dim
self.num_heads = num_heads
head_dim = dim // num_heads
if attn_head_dim is not None:
head_dim = attn_head_dim
all_head_dim = head_dim * self.num_heads
self.scale = qk_scale or head_dim ** -0.5
assert all_head_dim == dim
self.q = nn.Linear(dim, all_head_dim, bias=False)
self.k = nn.Linear(dim, all_head_dim, bias=False)
self.v = nn.Linear(dim, all_head_dim, bias=False)
if qkv_bias:
self.q_bias = nn.Parameter(torch.zeros(all_head_dim))
self.k_bias = nn.Parameter(torch.zeros(all_head_dim))
self.v_bias = nn.Parameter(torch.zeros(all_head_dim))
else:
self.q_bias = None
self.k_bias = None
self.v_bias = None
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(all_head_dim, out_dim)
self.proj_drop = nn.Dropout(proj_drop)
def forward(self, x, k=None, v=None):
B, N, C = x.shape
N_k = k.shape[1]
N_v = v.shape[1]
q_bias, k_bias, v_bias = None, None, None
if self.q_bias is not None:
q_bias = self.q_bias
k_bias = self.k_bias
v_bias = self.v_bias
q = F.linear(input=x, weight=self.q.weight, bias=q_bias)
q = q.reshape(B, N, 1, self.num_heads, -1).permute(2, 0, 3, 1, 4).squeeze(0) # (B, N_head, N_q, dim)
k = F.linear(input=k, weight=self.k.weight, bias=k_bias)
k = k.reshape(B, N_k, 1, self.num_heads, -1).permute(2, 0, 3, 1, 4).squeeze(0)
v = F.linear(input=v, weight=self.v.weight, bias=v_bias)
v = v.reshape(B, N_v, 1, self.num_heads, -1).permute(2, 0, 3, 1, 4).squeeze(0)
q = q * self.scale
attn = (q @ k.transpose(-2, -1)) # (B, N_head, N_q, N_k)
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
x = self.proj(x)
x = self.proj_drop(x)
return x
class AttentiveBlock(nn.Module):
def __init__(self, dim, num_heads, qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
drop_path=0., norm_layer=nn.LayerNorm, attn_head_dim=None, out_dim=None):
super().__init__()
self.norm1_q = norm_layer(dim)
self.norm1_k = norm_layer(dim)
self.norm1_v = norm_layer(dim)
self.cross_attn = CrossAttention(
dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop,
proj_drop=drop, attn_head_dim=attn_head_dim, out_dim=out_dim)
self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
def forward(self, x_q, x_kv, pos_q, pos_k, bool_masked_pos, rel_pos_bias=None):
x_q = self.norm1_q(x_q + pos_q)
x_k = self.norm1_k(x_kv + pos_k)
x_v = self.norm1_v(x_kv)
x = self.cross_attn(x_q, k=x_k, v=x_v)
return x
class AttentionPoolingBlock(AttentiveBlock):
def forward(self, x):
x_q = x.mean(1, keepdim=True)
x_kv, pos_q, pos_k = x, 0, 0
x = super().forward(x_q, x_kv, pos_q, pos_k, bool_masked_pos=None, rel_pos_bias=None)
x = x.squeeze(1)
return x
class RMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
class LayerScale(nn.Module):
def __init__(self, dim, init_values=1e-5, inplace=False, force_fp32=False):
super().__init__()
self.inplace = inplace
self.weight = nn.Parameter(init_values * torch.ones(dim))
self.force_fp32 = force_fp32
@torch.cuda.amp.autocast(enabled=False)
def forward(self, x):
if self.force_fp32:
output_type = x.dtype
out = x.float().mul_(self.weight.float()) if self.inplace else x.float() * self.weight.float()
return out.to(dtype=output_type)
else:
out = x.mul_(self.weight) if self.inplace else x * self.weight
return out
class Attention(nn.Module):
def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0., use_flash_attn=False,
causal=False, norm_layer=nn.LayerNorm, qk_normalization=False, use_fused_rmsnorm=False):
super().__init__()
assert dim % num_heads == 0, 'dim should be divisible by num_heads'
self.num_heads = num_heads
head_dim = dim // num_heads
self.scale = head_dim ** -0.5
self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
self.use_flash_attn = use_flash_attn
if use_flash_attn:
self.causal = causal
self.inner_attn = FlashAttention(attention_dropout=attn_drop)
self.qk_normalization = qk_normalization
self.q_norm = norm_layer(dim) if qk_normalization else nn.Identity()
self.k_norm = norm_layer(dim) if qk_normalization else nn.Identity()
self.use_fused_rmsnorm = use_fused_rmsnorm
def _naive_attn(self, x):
B, N, C = x.shape
# print(x.shape, torch.cuda.memory_allocated(), torch.cuda.memory_allocated())
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
q, k, v = qkv.unbind(0) # make torchscript happy (cannot use tensor as tuple)
if self.qk_normalization:
B_, H_, N_, D_ = q.shape
q = self.q_norm(q.transpose(1, 2).flatten(-2, -1)).view(B_, N_, H_, D_).transpose(1, 2)
k = self.k_norm(k.transpose(1, 2).flatten(-2, -1)).view(B_, N_, H_, D_).transpose(1, 2)
attn = ((q * self.scale) @ k.transpose(-2, -1))
# attn = attn - attn.max(-1)[0].unsqueeze(-1) # in case of overflow for fp16
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
# print(torch.cuda.memory_allocated(), torch.cuda.memory_allocated())
x = (attn @ v).transpose(1, 2).reshape(B, N, C)
x = self.proj(x)
x = self.proj_drop(x)
return x
def _flash_attn(self, x, key_padding_mask=None, need_weights=False):
qkv = self.qkv(x)
qkv = rearrange(qkv, "b s (three h d) -> b s three h d", three=3, h=self.num_heads)
if self.qk_normalization:
q, k, v = qkv.unbind(2)
if self.use_fused_rmsnorm:
q = self.q_norm(q.flatten(-2, -1))[0].view(q.shape)
k = self.k_norm(k.flatten(-2, -1))[0].view(k.shape)
else:
q = self.q_norm(q.flatten(-2, -1)).view(q.shape)
k = self.k_norm(k.flatten(-2, -1)).view(k.shape)
qkv = torch.stack([q, k, v], dim=2)
context, _ = self.inner_attn(
qkv, key_padding_mask=key_padding_mask, need_weights=need_weights, causal=self.causal
)
outs = self.proj(rearrange(context, "b s h d -> b s (h d)"))
outs = self.proj_drop(outs)
return outs
def forward(self, x):
x = self._naive_attn(x) if not self.use_flash_attn else self._flash_attn(x)
return x
class Mlp(nn.Module):
""" MLP as used in Vision Transformer, MLP-Mixer and related networks
"""
def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU,
bias=True, drop=0.):
super().__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features
bias = to_2tuple(bias)
drop_probs = to_2tuple(drop)
self.fc1 = nn.Linear(in_features, hidden_features, bias=bias[0])
self.act = act_layer()
self.drop1 = nn.Dropout(drop_probs[0])
self.fc2 = nn.Linear(hidden_features, out_features, bias=bias[1])
self.drop2 = nn.Dropout(drop_probs[1])
def forward(self, x):
x = self.fc1(x)
x = self.act(x)
x = self.drop1(x)
x = self.fc2(x)
x = self.drop2(x)
return x
class Block(nn.Module):
def __init__(
self, dim, num_heads, mlp_ratio=4., qkv_bias=False, drop=0., attn_drop=0., init_values=None,
drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm, use_flash_attn=False, use_fused_mlp=False,
fused_mlp_heuristic=1, with_cp=False, qk_normalization=False, layerscale_no_force_fp32=False,
use_fused_rmsnorm=False):
super().__init__()
self.norm1 = norm_layer(dim)
self.attn = Attention(dim, num_heads=num_heads, qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop,
use_flash_attn=use_flash_attn, causal=False, norm_layer=norm_layer,
qk_normalization=qk_normalization,
use_fused_rmsnorm=use_fused_rmsnorm)
self.ls1 = LayerScale(dim, init_values=init_values,
force_fp32=(not layerscale_no_force_fp32)) if init_values else nn.Identity()
# NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
self.drop_path1 = DropPath(drop_path) if drop_path > 0. else nn.Identity()
self.norm2 = norm_layer(dim)
mlp_hidden_dim = int(dim * mlp_ratio)
if use_fused_mlp:
raise NotImplementedError
self.mlp = FusedMLP(in_features=dim, hidden_features=mlp_hidden_dim, heuristic=fused_mlp_heuristic)
else:
self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
self.ls2 = LayerScale(dim, init_values=init_values,
force_fp32=(not layerscale_no_force_fp32)) if init_values else nn.Identity()
self.drop_path2 = DropPath(drop_path) if drop_path > 0. else nn.Identity()
self.with_cp = with_cp
self.use_fused_rmsnorm = use_fused_rmsnorm
def forward(self, x, residual=None):
def _inner_forward(x, residual=None):
if self.use_fused_rmsnorm:
x, residual = self.norm1(x, residual)
x = self.drop_path1(self.ls1(self.attn(x)))
x, residual = self.norm2(x, residual)
x = self.drop_path2(self.ls2(self.mlp(x)))
return x, residual
else:
assert residual is None
x = x + self.drop_path1(self.ls1(self.attn(self.norm1(x))))
x = x + self.drop_path2(self.ls2(self.mlp(self.norm2(x))))
return x
if self.with_cp:
# print(f"\033[31m use_checkpoint [0m")
return checkpoint.checkpoint(_inner_forward, x, residual)
else:
return _inner_forward(x, residual=residual)
class PatchEmbed(nn.Module):
""" 3D Image to Patch Embedding
"""
def __init__(
self, img_size=224, patch_size=16, in_chans=3, embed_dim=768,
num_frames=8, tubelet_size=1, norm_layer=None
):
super().__init__()
img_size = to_2tuple(img_size)
patch_size = to_2tuple(patch_size)
self.img_size = img_size
self.patch_size = patch_size
self.grid_size = (
num_frames // tubelet_size,
img_size[0] // patch_size[0],
img_size[1] // patch_size[1]
) # (T, H, W)
self.num_patches = self.grid_size[0] * self.grid_size[1] * self.grid_size[2]
self.num_img_patches = self.grid_size[1] * self.grid_size[2]
self.proj = nn.Conv3d(
in_channels=in_chans, out_channels=embed_dim,
kernel_size=(tubelet_size, patch_size[0], patch_size[1]),
stride=(tubelet_size, patch_size[0], patch_size[1])
)
self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
def forward(self, x):
x = self.proj(x)
x = x.flatten(3).permute(0, 2, 3, 1) # B x C x T x HW => B x T x HW x C
x = self.norm(x)
return x
class PretrainVisionTransformer_clean(nn.Module):
def __init__(
self,
in_chans: int = 3,
patch_size: int = 14,
img_size: int = 224,
qkv_bias: bool = False, # follow internvl_clip to set False
drop_path_rate: float = 0.25, # may need ablation
embed_dim: int = 1408,
num_heads: int = 16,
mlp_ratio: float = 48/11,
init_values: float = 1e-5, # may need ablation
qk_normalization: bool = True,
depth: int = 40,
use_flash_attn: bool = True,
use_fused_rmsnorm: bool = True,
use_fused_mlp: bool = True,
fused_mlp_heuristic: int = 1,
attn_pool_num_heads: int = 16,
clip_embed_dim: int = 768,
layerscale_no_force_fp32: bool = False, # whether True for training?
num_frames: int = 8,
tubelet_size: int = 1,
sep_pos_embed: bool = False,
sep_image_video_pos_embed: bool = False,
use_checkpoint: bool = False,
checkpoint_num: int = 0,
# for unmasked teacher
x_vis_return_idx=-1,
x_vis_only=False
):
super().__init__()
self.num_frames = num_frames
self.tubelet_size = tubelet_size
# assert use_flash_attn == use_fused_rmsnorm == use_fused_mlp, f'use_flash_attn:{use_flash_attn}, use_fused_rmsnorm{use_fused_rmsnorm} and use_fused_mlp{use_fused_mlp} should be consistent'
self.use_flash_attn = use_flash_attn
self.embed_dim = embed_dim
logger.info(f"Origin depth: {depth}")
depth = depth + x_vis_return_idx + 1
logger.info(f"New depth: {depth}")
self.depth = depth
self.x_vis_only = x_vis_only
if use_fused_rmsnorm:
raise NotImplementedError
norm_layer_for_blocks = partial(DropoutAddRMSNorm, eps=1e-6, prenorm=True)
else:
norm_layer_for_blocks = partial(RMSNorm, eps=1e-6)
self.norm_layer_for_blocks = norm_layer_for_blocks
self.patch_embed = PatchEmbed(
img_size, patch_size, in_chans, embed_dim,
num_frames=num_frames, tubelet_size=tubelet_size,
)
num_patches = self.patch_embed.num_patches
num_img_patches = self.patch_embed.num_img_patches
# print(f"num_patches: {num_patches}, num_img_patches: {num_img_patches}")
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
# stolen from https://github.com/facebookresearch/mae_st/blob/dc072aaaf640d06892e23a33b42223a994efe272/models_vit.py#L65-L73C17
self.sep_pos_embed = sep_pos_embed
self.sep_image_video_pos_embed = sep_image_video_pos_embed
if sep_pos_embed:
raise NotImplementedError
else:
if sep_image_video_pos_embed:
logger.info("Use separate position embedding, for image and video we use different pos_embed.")
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
self.img_pos_embed = nn.Parameter(torch.zeros(1, num_img_patches + 1, embed_dim))
else:
logger.info("Use joint position embedding, for image and video we use same pos_embed.")
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]
# choose which layer to use checkpoint
with_cp_list = [False] * depth
if use_checkpoint:
for idx in range(depth):
if idx < checkpoint_num:
with_cp_list[idx] = True
logger.info(f"Droppath rate: {dpr}")
logger.info(f"Checkpoint list: {with_cp_list}")
self.blocks = nn.ModuleList([
Block(embed_dim, num_heads, mlp_ratio, qkv_bias=qkv_bias,
norm_layer=norm_layer_for_blocks,
drop_path=dpr[i], init_values=init_values, attn_drop=0.,
use_flash_attn=use_flash_attn, use_fused_mlp=use_fused_mlp,
fused_mlp_heuristic=fused_mlp_heuristic,
with_cp=with_cp_list[i],
qk_normalization=qk_normalization,
layerscale_no_force_fp32=layerscale_no_force_fp32,
use_fused_rmsnorm=use_fused_rmsnorm)
for i in range(depth)])
if not self.x_vis_only:
self.clip_projector = AttentionPoolingBlock(
dim=embed_dim, num_heads=attn_pool_num_heads, qkv_bias=True, qk_scale=None,
drop=0., attn_drop=0., norm_layer=partial(nn.LayerNorm, eps=1e-5), out_dim=clip_embed_dim)
self.init_pos_embed()
trunc_normal_(self.cls_token, std=.02) # NOTE 对chat没用,都要加载预训练的
self.apply(self._init_weights)
self.fix_init_weight()
def init_pos_embed(self):
logger.info("Init pos_embed from sincos pos_embed")
if self.sep_pos_embed:
raise NotImplementedError
else:
pos_embed = get_3d_sincos_pos_embed(
self.pos_embed.shape[-1],
self.patch_embed.grid_size[1], # height & weight
self.patch_embed.grid_size[0], # t_size
cls_token=True
)
self.pos_embed.data.copy_(torch.from_numpy(pos_embed).float().unsqueeze(0))
if self.sep_image_video_pos_embed:
img_pos_embed = get_3d_sincos_pos_embed(
self.pos_embed.shape[-1],
self.patch_embed.grid_size[1], # height & weight
1,
cls_token=True
)
self.img_pos_embed.data.copy_(torch.from_numpy(img_pos_embed).float().unsqueeze(0))
def _init_weights(self, m):
if isinstance(m, nn.Linear):
trunc_normal_(m.weight, std=.02)
if isinstance(m, nn.Linear) and m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
def fix_init_weight(self):
def rescale(param, layer_id):
param.div_(math.sqrt(2.0 * layer_id))
for layer_id, layer in enumerate(self.blocks):
rescale(layer.attn.proj.weight.data, layer_id + 1)
rescale(layer.mlp.fc2.weight.data, layer_id + 1)
@property
def dtype(self):
return self.patch_embed.proj.weight.dtype
def get_num_layers(self):
return len(self.blocks)
@torch.jit.ignore
def no_weight_decay(self):
return {
'pos_embed',
'pos_embed_spatial',
'pos_embed_temporal',
'pos_embed_cls',
'img_pos_embed',
'cls_token'
}
# @torch.cuda.amp.autocast(enabled=False)
def forward(self, x, mask=None, use_image=False):
x = self.patch_embed(x.type(self.dtype))
# print(f"x.shape: {x.shape} x.dtype: {x.dtype}, model.dtype: {self.dtype}")
B, T, L, C = x.shape # T: temporal; L: spatial
x = x.view([B, T * L, C])
# append cls token
cls_tokens = self.cls_token.expand(B, -1, -1)
x = torch.cat((cls_tokens, x), dim=1)
# add pos_embed
if self.sep_pos_embed:
raise NotImplementedError
else:
if use_image:
if self.sep_image_video_pos_embed:
pos_embed = self.img_pos_embed
else:
# (1, num_img_patches + 1, embed_dim)
# print('origin pos_embed.shape:', self.pos_embed.shape)
cls_pos_embed = self.pos_embed[:, 0:1, :]
# print('cls_pos_embed.shape:', cls_pos_embed.shape)
img_pos_embed = self.pos_embed[:, 1:, :].view(1, self.num_frames, self.patch_embed.num_patches // self.num_frames, self.embed_dim).mean(dim=1)
# print('img_pos_embed.shape:', img_pos_embed.shape)
pos_embed = torch.cat([cls_pos_embed, img_pos_embed], dim=1)
# print('final img_pos_embed.shape:', pos_embed.shape)
else:
pos_embed = self.pos_embed
# print("pos_embed.shape:", pos_embed.shape)
x = x + pos_embed
# mask tokens, ~mask means visible
if mask is not None:
x = x[~mask].reshape(B, -1, C)
else:
x = x.reshape(B, -1, C)
residual = None
for idx, blk in enumerate(self.blocks):
if isinstance(x, tuple) and len(x) == 2:
x, residual = x
x = blk(x, residual=residual)
if isinstance(x, tuple) and len(x) == 2:
x, residual = x
if residual is not None:
x = x + residual
x_vis = x
if self.x_vis_only:
return x_vis
else:
x_pool_vis = self.clip_projector(x_vis)
return x_vis, x_pool_vis, None, None
def pretrain_internvideo2_giant_patch14_224_clean(config):
model = PretrainVisionTransformer_clean(
in_chans=3, img_size=224, patch_size=14,
embed_dim=1408, depth=40, num_heads=16, mlp_ratio=48/11,
clip_embed_dim=config.vision_encoder.clip_embed_dim,
attn_pool_num_heads=16, qkv_bias=False,
drop_path_rate=0.25,
init_values=0.00001,
qk_normalization=True,
use_flash_attn=config.vision_encoder.get('use_flash_attn', True),
use_fused_rmsnorm=config.vision_encoder.get('use_fused_rmsnorm', True),
use_fused_mlp=config.vision_encoder.get('use_fused_mlp', True),
fused_mlp_heuristic=1,
layerscale_no_force_fp32=False,
num_frames=config.vision_encoder.num_frames,
tubelet_size=config.vision_encoder.tubelet_size,
sep_pos_embed=False,
sep_image_video_pos_embed=config.vision_encoder.sep_image_video_pos_embed,
use_checkpoint=config.vision_encoder.use_checkpoint,
checkpoint_num=config.vision_encoder.checkpoint_num,
x_vis_return_idx=config.vision_encoder.x_vis_return_idx,
x_vis_only=config.vision_encoder.x_vis_only,
)
if config.vision_encoder.pretrained is not None:
logger.info(f"Loading pretrained weights from {config.vision_encoder.pretrained}")
state_dict = torch.load(config.vision_encoder.pretrained, map_location='cpu')
interpolate_pos_embed_internvideo2(state_dict, model, orig_t_size=8) # NOTE 8f for stage1
message = model.load_state_dict(state_dict, strict=False)
logger.info(message)
else:
logger.info("No pretrained weights!!!")
return model
def pretrain_internvideo2_6b_patch14_224_clean(config):
model = PretrainVisionTransformer_clean(
in_chans=3, img_size=224, patch_size=14,
embed_dim=3200, depth=48, num_heads=25, mlp_ratio=4,
clip_embed_dim=config.vision_encoder.clip_embed_dim,
attn_pool_num_heads=16, qkv_bias=False,
drop_path_rate=0.3,
init_values=0.00001,
qk_normalization=True,
use_flash_attn=config.vision_encoder.get('use_flash_attn', True),
use_fused_rmsnorm=config.vision_encoder.get('use_fused_rmsnorm', True),
use_fused_mlp=config.vision_encoder.get('use_fused_mlp', True),
fused_mlp_heuristic=1,
layerscale_no_force_fp32=False,
num_frames=config.vision_encoder.num_frames,
tubelet_size=config.vision_encoder.tubelet_size,
sep_pos_embed=False,
sep_image_video_pos_embed=config.vision_encoder.sep_image_video_pos_embed,
use_checkpoint=config.vision_encoder.use_checkpoint,
checkpoint_num=config.vision_encoder.checkpoint_num,
x_vis_return_idx=config.vision_encoder.x_vis_return_idx,
x_vis_only=config.vision_encoder.x_vis_only
)
if config.vision_encoder.pretrained is not None:
logger.info(f"Loading pretrained weights from {config.vision_encoder.pretrained}")
state_dict = torch.load(config.vision_encoder.pretrained, map_location='cpu')
interpolate_pos_embed_internvideo2(state_dict, model, orig_t_size=8) # NOTE 8f for stage1
msg = model.load_state_dict(state_dict, strict=False)
logger.info(msg)
else:
logger.info("No pretrained weights!!!")
return model
================================================
FILE: llava-train_videochat/llava/model/multimodal_encoder/internvideo2_encoder.py
================================================
"""
# Adapted from https://huggingface.co/MILVLG/imp-v1-3b/blob/main/vision_encoder.py
"""
from typing import Optional, Tuple, Union, Dict
from dataclasses import dataclass
from functools import partial, reduce
from PIL import Image
import torch
import torch.utils.checkpoint
from torch import nn
import os
from transformers.image_processing_utils import BatchFeature, get_size_dict
from transformers.image_transforms import (
convert_to_rgb,
normalize,
rescale,
resize,
to_channel_dimension_format,
)
from transformers.image_utils import (
ChannelDimension,
PILImageResampling,
to_numpy_array,
)
from llava.utils import rank0_print
from .internvideo2.vit_scale_clean import PretrainVisionTransformer_clean
from .internvideo2.vit_scale_clean import interpolate_pos_embed_internvideo2
class InternVideo2ImageProcessor:
def __init__(self, image_mean=(0.485, 0.456, 0.406), image_std=(0.229, 0.224, 0.225), size=(224, 224), crop_size: Dict[str, int] = None, resample=PILImageResampling.BICUBIC, rescale_factor=1 / 255, data_format=ChannelDimension.FIRST):
crop_size = crop_size if crop_size is not None else {"height": size[0], "width": size[1]}
crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
self.image_mean = image_mean
self.image_std = image_std
self.size = size
self.resample = resample
self.rescale_factor = rescale_factor
self.data_format = data_format
self.crop_size = crop_size
def preprocess(self, images, return_tensors, target_size=None):
if isinstance(images, Image.Image):
images = [images]
else:
# to adapt video data
images = [to_numpy_array(image) for image in images]
assert isinstance(images, list)
if target_size is None:
target_size = self.size
transforms = [
convert_to_rgb,
to_numpy_array,
partial(resize, size=target_size, resample=self.resample, data_format=self.data_format),
partial(rescale, scale=self.rescale_factor, data_format=self.data_format),
partial(normalize, mean=self.image_mean, std=self.image_std, data_format=self.data_format),
partial(to_channel_dimension_format, channel_dim=self.data_format, input_channel_dim=self.data_format),
]
images = reduce(lambda x, f: [*map(f, x)], transforms, images)
data = {"pixel_values": images}
return BatchFeature(data=data, tensor_type=return_tensors)
class InternVideo2VisionConfig:
model_type = "internvideo2_vision_model"
def __init__(
self,
num_frames=4,
hidden_size=1408,
num_hidden_layers=40,
num_attention_heads=16,
num_channels=3,
image_size=224,
patch_size=14,
x_vis_return_idx=-2,
sep_image_video_pos_embed=True,
use_checkpoint=True,
checkpoint_num=40,
# **kwargs,
):
# super().__init__(**kwargs)
self.num_frames = num_frames
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.num_channels = num_channels
self.patch_size = patch_size
self.image_size = image_size
self.x_vis_return_idx = x_vis_return_idx
self.sep_image_video_pos_embed = sep_image_video_pos_embed
self.use_checkpoint = use_checkpoint
self.checkpoint_num = checkpoint_num
def build_vit(config, pt_type='origin'):
model = PretrainVisionTransformer_clean(
in_chans=config.num_channels, img_size=config.image_size, patch_size=config.patch_size,
embed_dim=config.hidden_size, depth=config.num_hidden_layers, num_heads=config.num_attention_heads, mlp_ratio=48/11,
# clip_embed_dim=config.vision_encoder.clip_embed_dim,
attn_pool_num_heads=16, qkv_bias=False,
drop_path_rate=0.25,
init_values=0.00001,
qk_normalization=True,
use_flash_attn=True,
use_fused_rmsnorm=False,
use_fused_mlp=False,
fused_mlp_heuristic=1,
layerscale_no_force_fp32=False,
num_frames=config.num_frames,
tubelet_size=1,
sep_pos_embed=False,
sep_image_video_pos_embed=config.sep_image_video_pos_embed,
use_checkpoint=config.use_checkpoint,
checkpoint_num=config.checkpoint_num,
x_vis_return_idx=config.x_vis_return_idx,
x_vis_only=True
)
ckpt_path = "OpenGVLab/Video_Encoders_for_Training_VideoChat-Flash/InternVideo2-1B_f4_vision.pt"
if not os.path.isfile(ckpt_path):
raise NotImplementedError("Please download https://huggingface.co/OpenGVLab/Video_Encoders_for_Training_VideoChat-Flash/InternVideo2-1B_f4_vision.pt")
state_dict = torch.load(ckpt_path, map_location='cpu')
if config.num_frames != 4:
raise NotImplementedError
# make deepspeed zero3 happy
if config.image_size != 224:
interpolate_pos_embed_internvideo2(state_dict, model, orig_t_size=4)
message = model.load_state_dict(state_dict, strict=False)
rank0_print(message)
return model
class InternVideo2VisionTower(nn.Module):
def __init__(self, vision_tower, vision_tower_cfg, delay_load=False, pt_type='origin', image_size=224):
super().__init__()
self.is_loaded = False
self.pt_type = pt_type
self.config = InternVideo2VisionConfig(num_frames=vision_tower_cfg.mm_local_num_frames, x_vis_return_idx=vision_tower_cfg.mm_vision_select_layer, image_size=image_size)
self.vision_tower_name = vision_tower
self.image_processor = InternVideo2ImageProcessor(size=(image_size, image_size))
if not delay_load:
rank0_print(f"Loading vision tower: {vision_tower}")
self.load_model()
elif getattr(vision_tower_cfg, "unfreeze_mm_vision_tower", False):
# TODO: better detector is needed.
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True.")
self.load_model()
elif hasattr(vision_tower_cfg, "mm_tunable_parts") and "mm_vision_tower" in vision_tower_cfg.mm_tunable_parts:
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`.")
self.load_model()
else:
raise NotImplementedError
self.cfg_only = self.config
def load_model(self, device_map=None):
if self.is_loaded:
rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name))
return
self.vision_tower = build_vit(self.config, pt_type=self.pt_type)
self.vision_tower.requires_grad_(False)
self.is_loaded = True
def forward(self, images):
if type(images) is list:
raise NotImplementedError
else:
# input: B T C H W
# output: B T*L C
T = images.shape[1]
images = images.permute(0, 2, 1, 3, 4)
image_embeds = self.vision_tower(images, use_image=(T == 1))
return image_embeds[:, 1:, :]
@property
def dummy_feature(self):
return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
@property
def dtype(self):
for p in self.vision_tower.parameters():
return p.dtype
@property
def device(self):
for p in self.vision_tower.parameters():
return p.device
@property
def hidden_size(self):
return self.config.hidden_size
@property
def num_patches(self):
return (self.config.image_size // self.config.patch_size) ** 2
@property
def num_patches_per_side(self):
return self.config.image_size // self.config.patch_size
# return self.model_config["vision_cfg"]["image_size"] // self.model_config["vision_cfg"]["patch_size"]
@property
def image_size(self):
return self.config.image_size
================================================
FILE: llava-train_videochat/llava/model/multimodal_encoder/siglip_encoder.py
================================================
"""
# Adapted from https://huggingface.co/MILVLG/imp-v1-3b/blob/main/vision_encoder.py
"""
from typing import Optional, Tuple, Union, Dict
from dataclasses import dataclass
from functools import partial, reduce
from PIL import Image
import torch
import torch.utils.checkpoint
from torch import nn
import os
from transformers.image_processing_utils import BatchFeature, get_size_dict
from transformers.image_transforms import (
convert_to_rgb,
normalize,
rescale,
resize,
to_channel_dimension_format,
)
from transformers.image_utils import (
ChannelDimension,
PILImageResampling,
to_numpy_array,
)
from transformers.activations import ACT2FN
from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling
from transformers.modeling_utils import PreTrainedModel
from transformers import PretrainedConfig
from transformers.utils import ModelOutput
from llava.utils import rank0_print
class SigLipImageProcessor:
def __init__(self, image_mean=(0.5, 0.5, 0.5), image_std=(0.5, 0.5, 0.5), size=(384, 384), crop_size: Dict[str, int] = None, resample=PILImageResampling.BICUBIC, rescale_factor=1 / 255, data_format=ChannelDimension.FIRST):
crop_size = crop_size if crop_size is not None else {"height": 384, "width": 384}
crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
self.image_mean = image_mean
self.image_std = image_std
self.size = size
self.resample = resample
self.rescale_factor = rescale_factor
self.data_format = data_format
self.crop_size = crop_size
def preprocess(self, images, return_tensors):
if isinstance(images, Image.Image):
images = [images]
else:
# to adapt video data
images = [to_numpy_array(image) for image in images]
assert isinstance(images, list)
transforms = [
convert_to_rgb,
to_numpy_array,
partial(resize, size=self.size, resample=self.resample, data_format=self.data_format),
partial(rescale, scale=self.rescale_factor, data_format=self.data_format),
partial(normalize, mean=self.image_mean, std=self.image_std, data_format=self.data_format),
partial(to_channel_dimension_format, channel_dim=self.data_format, input_channel_dim=self.data_format),
]
images = reduce(lambda x, f: [*map(f, x)], transforms, images)
data = {"pixel_values": images}
return BatchFeature(data=data, tensor_type=return_tensors)
class SigLipVisionConfig(PretrainedConfig):
model_type = "siglip_vision_model"
def __init__(
self,
hidden_size=1152,
image_mean=(0.5, 0.5, 0.5),
intermediate_size=4304,
num_hidden_layers=27,
num_attention_heads=16,
num_channels=3,
image_size=384,
patch_size=14,
hidden_act="gelu_pytorch_tanh",
layer_norm_eps=1e-6,
attention_dropout=0.0,
**kwargs,
):
super().__init__(**kwargs)
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.num_channels = num_channels
self.patch_size = patch_size
self.image_size = image_size
self.attention_dropout = attention_dropout
self.layer_norm_eps = layer_norm_eps
self.hidden_act = hidden_act
self.image_mean = image_mean
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
cls._set_token_in_kwargs(kwargs)
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the vision config dict if we are loading from SigLipConfig
if config_dict.get("model_type") == "siglip":
config_dict = config_dict["vision_config"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
print(f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " f"{cls.model_type}. This is not supported for all configurations of models and can yield errors.")
return cls.from_dict(config_dict, **kwargs)
@dataclass
# Copied from transformers.models.clip.modeling_clip.CLIPVisionModelOutput with CLIP->SigLip
class SigLipVisionModelOutput(ModelOutput):
"""
Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
Args:
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
The image embeddings obtained by applying the projection layer to the pooler_output.
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
"""
image_embeds: Optional[torch.FloatTensor] = None
last_hidden_state: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
class SigLipVisionEmbeddings(nn.Module):
def __init__(self, config: SigLipVisionConfig):
super().__init__()
self.config = config
self.embed_dim = config.hidden_size
self.image_size = config.image_size
self.patch_size = config.patch_size
self.patch_embedding = nn.Conv2d(
in_channels=config.num_channels,
out_channels=self.embed_dim,
kernel_size=self.patch_size,
stride=self.patch_size,
padding="valid",
)
self.num_patches = (self.image_size // self.patch_size) ** 2
self.num_positions = self.num_patches
self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False)
def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid]
embeddings = patch_embeds.flatten(2).transpose(1, 2)
embeddings = embeddings + self.position_embedding(self.position_ids)
return embeddings
class SigLipAttention(nn.Module):
"""Multi-headed attention from 'Attention Is All You Need' paper"""
# Copied from transformers.models.clip.modeling_clip.CLIPAttention.__init__
def __init__(self, config):
super().__init__()
self.config = config
self.embed_dim = config.hidden_size
self.num_heads = config.num_attention_heads
self.head_dim = self.embed_dim // self.num_heads
if self.head_dim * self.num_heads != self.embed_dim:
raise ValueError(f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:" f" {self.num_heads}).")
self.scale = self.head_dim**-0.5
self.dropout = config.attention_dropout
self.k_proj = nn.Linear(self.embed_dim, self.embed_dim)
self.v_proj = nn.Linear(self.embed_dim, self.embed_dim)
self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)
self.out_proj = nn.Linear(self.embed_dim, self.embed_dim)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
"""Input shape: Batch x Time x Channel"""
batch_size, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = key_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
value_states = value_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
k_v_seq_len = key_states.shape[-2]
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) * self.scale
if attn_weights.size() != (batch_size, self.num_heads, q_len, k_v_seq_len):
raise ValueError(f"Attention weights should be of size {(batch_size, self.num_heads, q_len, k_v_seq_len)}, but is" f" {attn_weights.size()}")
if attention_mask is not None:
if attention_mask.size() != (batch_size, 1, q_len, k_v_seq_len):
raise ValueError(f"Attention mask should be of size {(batch_size, 1, q_len, k_v_seq_len)}, but is {attention_mask.size()}")
attn_weights = attn_weights + attention_mask
# upcast attention to fp32
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
attn_weights = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
attn_output = torch.matmul(attn_weights, value_states)
if attn_output.size() != (batch_size, self.num_heads, q_len, self.head_dim):
raise ValueError(f"`attn_output` should be of size {(batch_size, self.num_heads, q_len, self.head_dim)}, but is" f" {attn_output.size()}")
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(batch_size, q_len, self.embed_dim)
attn_output = self.out_proj(attn_output)
return attn_output, attn_weights
# Copied from transformers.models.clip.modeling_clip.CLIPMLP with CLIP->SigLip
class SigLipMLP(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.activation_fn = ACT2FN[config.hidden_act]
self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
hidden_states = self.fc1(hidden_states)
hidden_states = self.activation_fn(hidden_states)
hidden_states = self.fc2(hidden_states)
return hidden_states
# Copied from transformers.models.clip.modeling_clip.CLIPEncoderLayer with CLIP->SigLip
class SigLipEncoderLayer(nn.Module):
def __init__(self, config: SigLipVisionConfig):
super().__init__()
self.embed_dim = config.hidden_size
self.self_attn = SigLipAttention(config)
self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
self.mlp = SigLipMLP(config)
self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
# Ignore copy
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.FloatTensor]:
"""
Args:
hidden_states (`torch.FloatTensor`):
Input to the layer of shape `(batch, seq_len, embed_dim)`.
attention_mask (`torch.FloatTensor`):
Attention mask of shape `(batch, 1, q_len, k_v_seq_len)` where padding elements are indicated by very large negative values.
output_attentions (`bool`, *optional*, defaults to `False`):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
"""
residual = hidden_states
hidden_states = self.layer_norm1(hidden_states)
hidden_states, attn_weights = self.self_attn(
hidden_states=hidden_states,
attention_mask=attention_mask,
output_attentions=output_attentions,
)
hidden_states = residual + hidden_states
residual = hidden_states
hidden_states = self.layer_norm2(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
outputs = (hidden_states,)
if output_attentions:
outputs += (attn_weights,)
return outputs
class SigLipPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = SigLipVisionConfig
base_model_prefix = "siglip"
supports_gradient_checkpointing = True
def _init_weights(self, module):
"""Initialize the weights"""
pass
# Copied from transformers.models.clip.modeling_clip.CLIPEncoder with CLIP->SigLip
class SigLipEncoder(nn.Module):
"""
Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
[`SigLipEncoderLayer`].
Args:
config: SigLipVisionConfig
"""
def __init__(self, config: SigLipVisionConfig):
super().__init__()
self.config = config
self.layers = nn.ModuleList([SigLipEncoderLayer(config) for _ in range(config.num_hidden_layers)])
self.gradient_checkpointing = False
# Ignore copy
def forward(
self,
inputs_embeds,
attention_mask: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutput]:
r"""
Args:
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
than the model's internal embedding lookup matrix.
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
for more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
encoder_states = () if output_hidden_states else None
all_attentions = () if output_attentions else None
hidden_states = inputs_embeds
for encoder_layer in self.layers:
if output_hidden_states:
encoder_states = encoder_states + (hidden_states,)
if self.gradient_checkpointing and self.training:
layer_outputs = self._gradient_checkpointing_func(
encoder_layer.__call__,
hidden_states,
attention_mask,
output_attentions,
)
else:
layer_outputs = encoder_layer(
hidden_states,
attention_mask,
output_attentions=output_attentions,
)
hidden_states = layer_outputs[0]
if output_attentions:
all_attentions = all_attentions + (layer_outputs[1],)
if output_hidden_states:
encoder_states = encoder_states + (hidden_states,)
if not return_dict:
return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
return BaseModelOutput(last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions)
class SigLipVisionTransformer(nn.Module):
def __init__(self, config: SigLipVisionConfig):
super().__init__()
self.config = config
embed_dim = config.hidden_size
self.embeddings = SigLipVisionEmbeddings(config)
self.encoder = SigLipEncoder(config)
self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
self.head = SigLipMultiheadAttentionPoolingHead(config)
def forward(
self,
pixel_values,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPooling]:
r"""
Returns:
"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
hidden_states = self.embeddings(pixel_values)
encoder_outputs = self.encoder(
inputs_embeds=hidden_states,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
last_hidden_state = encoder_outputs[0]
last_hidden_state = self.post_layernorm(last_hidden_state)
pooled_output = self.head(last_hidden_state)
if not return_dict:
return (last_hidden_state, pooled_output) + encoder_outputs[1:]
return BaseModelOutputWithPooling(
last_hidden_state=last_hidden_state,
pooler_output=pooled_output,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
)
class SigLipMultiheadAttentionPoolingHead(nn.Module):
"""Multihead Attention Pooling."""
def __init__(self, config: SigLipVisionConfig):
super().__init__()
self.probe = nn.Parameter(torch.randn(1, 1, config.hidden_size))
self.attention = torch.nn.MultiheadAttention(config.hidden_size, config.num_attention_heads, batch_first=True)
self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.mlp = SigLipMLP(config)
def forward(self, hidden_state):
batch_size = hidden_state.shape[0]
probe = self.probe.repeat(batch_size, 1, 1)
hidden_state = self.attention(probe, hidden_state, hidden_state)[0]
residual = hidden_state
hidden_state = self.layernorm(hidden_state)
hidden_state = residual + self.mlp(hidden_state)
return hidden_state[:, 0]
class SigLipVisionModel(SigLipPreTrainedModel):
config_class = SigLipVisionConfig
main_input_name = "pixel_values"
_no_split_modules = ["SigLipEncoderLayer"]
def __init__(self, config: SigLipVisionConfig):
super().__init__(config)
self.vision_model = SigLipVisionTransformer(config)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self) -> nn.Module:
return self.vision_model.embeddings.patch_embedding
def forward(
self,
pixel_values,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPooling]:
r"""
Returns:
Examples:
```python
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, SigLipVisionModel
>>> model = SigLipVisionModel.from_pretrained("google/siglip-base-patch16-224")
>>> processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
>>> pooled_output = outputs.pooler_output # pooled features
```"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
return self.vision_model(
pixel_values=pixel_values,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
class SigLipVisionTower(nn.Module):
def __init__(self, vision_tower, vision_tower_cfg, delay_load=False):
super().__init__()
self.is_loaded = False
self.config = SigLipVisionConfig()
self.vision_tower_name = vision_tower
self.image_processor = SigLipImageProcessor()
if not delay_load:
rank0_print(f"Loading vision tower: {vision_tower}")
self.load_model()
elif getattr(vision_tower_cfg, "unfreeze_mm_vision_tower", False):
# TODO: better detector is needed.
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True.")
self.load_model()
elif hasattr(vision_tower_cfg, "mm_tunable_parts") and "mm_vision_tower" in vision_tower_cfg.mm_tunable_parts:
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`.")
self.load_model()
else:
self.cfg_only = self.config
def load_model(self, device_map=None):
if self.is_loaded:
rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name))
return
self.vision_tower = SigLipVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map)
del self.vision_tower.vision_model.encoder.layers[-1:]
self.vision_tower.vision_model.head = nn.Identity()
self.vision_tower.requires_grad_(False)
self.is_loaded = True
def forward(self, images):
if type(images) is list:
image_features = []
for image in images:
image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), output_hidden_states=True)
image_feature = image_forward_out.hidden_states[-1].to(image.dtype)
assert image_features.shape[-2] == 729
image_features.append(image_feature)
else:
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
image_features = image_forward_outs.hidden_states[-1].to(images.dtype)
assert image_features.shape[-2] == 729
return image_features
@property
def dummy_feature(self):
return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
@property
def dtype(self):
for p in self.vision_tower.parameters():
return p.dtype
@property
def device(self):
for p in self.vision_tower.parameters():
return p.device
@property
def hidden_size(self):
return self.config.hidden_size
@property
def num_patches(self):
return (self.config.image_size // self.config.patch_size) ** 2
@property
def num_patches_per_side(self):
return self.config.image_size // self.config.patch_size
# return self.model_config["vision_cfg"]["image_size"] // self.model_config["vision_cfg"]["patch_size"]
@property
def image_size(self):
return self.config.image_size
================================================
FILE: llava-train_videochat/llava/model/multimodal_encoder/umt/vit.py
================================================
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.checkpoint as checkpoint
from functools import partial
try:
from flash_attn import flash_attn_qkvpacked_func
except:
print("You need to install flash_attn")
from timm.models.layers import drop_path, to_2tuple, trunc_normal_
# logger = logging.getLogger(__name__)
class DropPath(nn.Module):
"""Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
"""
def __init__(self, drop_prob=None):
super(DropPath, self).__init__()
self.drop_prob = drop_prob
def forward(self, x):
return drop_path(x, self.drop_prob, self.training)
def extra_repr(self) -> str:
return 'p={}'.format(self.drop_prob)
class Mlp(nn.Module):
def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
super().__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features
self.fc1 = nn.Linear(in_features, hidden_features)
self.act = act_layer()
self.fc2 = nn.Linear(hidden_features, out_features)
self.drop = nn.Dropout(drop)
def forward(self, x):
x = self.fc1(x)
x = self.act(x)
x = self.drop(x)
x = self.fc2(x)
x = self.drop(x)
return x
class Attention(nn.Module):
def __init__(
self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0.,
proj_drop=0., attn_head_dim=None,
attn_type='flash_v2'):
super().__init__()
self.num_heads = num_heads
head_dim = dim // num_heads
if attn_head_dim is not None:
head_dim = attn_head_dim
all_head_dim = head_dim * self.num_heads
self.scale = qk_scale or head_dim ** -0.5
self.qkv = nn.Linear(dim, all_head_dim * 3, bias=False)
if qkv_bias:
self.q_bias = nn.Parameter(torch.zeros(all_head_dim))
self.v_bias = nn.Parameter(torch.zeros(all_head_dim))
else:
self.q_bias = None
self.v_bias = None
if attn_type not in ['origin', 'flash_v2']:
raise NotImplementedError(f"Not support attn_type: {attn_type}")
print('umt:', f'attn_type: {attn_type}')
self.attn_type = attn_type
if attn_type == 'flash_v2':
self.attn_drop = attn_drop
else:
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(all_head_dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
def forward(self, x):
B, N, C = x.shape
qkv_bias = None
if self.q_bias is not None:
qkv_bias = torch.cat((self.q_bias, torch.zeros_like(self.v_bias, requires_grad=False), self.v_bias))
# qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
if self.attn_type == 'flash_v2':
qkv = qkv.reshape(B, N, 3, self.num_heads, -1)
x = flash_attn_qkvpacked_func(qkv, dropout_p=self.attn_drop, softmax_scale=self.scale, causal=False).reshape(B, N, -1)
else:
qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[
2] # make torchscript happy (cannot use tensor as tuple)
# B num_heads N head_dim
q = q * self.scale
attn = (q @ k.transpose(-2, -1))
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
x = self.proj(x)
x = self.proj_drop(x)
return x
class Block(nn.Module):
def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
drop_path=0., init_values=None, act_layer=nn.GELU, norm_layer=nn.LayerNorm,
attn_head_dim=None):
super().__init__()
self.norm1 = norm_layer(dim)
self.attn = Attention(
dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale,
attn_drop=attn_drop, proj_drop=drop, attn_head_dim=attn_head_dim)
# NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
self.norm2 = norm_layer(dim)
mlp_hidden_dim = int(dim * mlp_ratio)
self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
if init_values > 0:
self.gamma_1 = nn.Parameter(init_values * torch.ones((dim)),requires_grad=True)
self.gamma_2 = nn.Parameter(init_values * torch.ones((dim)),requires_grad=True)
else:
self.gamma_1, self.gamma_2 = None, None
def forward(self, x):
if self.gamma_1 is None:
x = x + self.drop_path(self.attn(self.norm1(x)))
x = x + self.drop_path(self.mlp(self.norm2(x)))
else:
x = x + self.drop_path(self.gamma_1 * self.attn(self.norm1(x)))
x = x + self.drop_path(self.gamma_2 * self.mlp(self.norm2(x)))
return x
class PatchEmbed(nn.Module):
""" Image to Patch Embedding
"""
def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, num_frames=16, tubelet_size=2):
super().__init__()
img_size = to_2tuple(img_size)
patch_size = to_2tuple(patch_size)
self.tubelet_size = int(tubelet_size)
num_patches = (img_size[1] // patch_size[1]) * (img_size[0] // patch_size[0]) * (num_frames // self.tubelet_size)
self.img_size = img_size
self.patch_size = patch_size
self.num_patches = num_patches
self.proj = nn.Conv3d(
in_channels=in_chans, out_channels=embed_dim,
kernel_size=(self.tubelet_size, patch_size[0], patch_size[1]),
stride=(self.tubelet_size, patch_size[0], patch_size[1])
)
print('umt:', f'Num of patches: {num_patches}')
def forward(self, x, **kwargs):
B, C, T, H, W = x.shape
# FIXME look at relaxing size constraints
# assert H == self.img_size[0] and W == self.img_size[1], \
# f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
x = self.proj(x).flatten(2).transpose(1, 2)
return x
# sin-cos position encoding
# https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/Models.py#L31
def get_sinusoid_encoding_table(n_position, d_hid, ckpt_num_frame=-1, cur_frame=12):
''' Sinusoid position encoding table '''
# TODO: make it with torch instead of numpy
def get_position_angle_vec(position):
return [position / np.power(10000, 2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)]
if ckpt_num_frame != -1 and ckpt_num_frame != cur_frame:
print('umt:', f"Interpolate position embedding")
print('umt:', f"Testing frame: {cur_frame}")
print('umt:', f"Checkpoint frame: {ckpt_num_frame}")
T = ckpt_num_frame # checkpoint frame
new_T = cur_frame # testing frame
n_position = n_position // new_T * T # generate checkpoint position embedding
sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)])
sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i
sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1
sinusoid_table = torch.tensor(sinusoid_table, dtype=torch.float, requires_grad=False).unsqueeze(0)
# interpolate
P = int((n_position // T) ** 0.5)
C = d_hid
sinusoid_table = sinusoid_table.reshape(-1, T, P, P, C)
sinusoid_table = sinusoid_table.permute(0, 2, 3, 4, 1).reshape(-1, C, T) # BHW, C, T
sinusoid_table = torch.nn.functional.interpolate(sinusoid_table, size=new_T, mode='linear')
sinusoid_table = sinusoid_table.reshape(1, P, P, C, new_T).permute(0, 4, 1, 2, 3) # B, T, H, W, C
sinusoid_table = sinusoid_table.flatten(1, 3)
return sinusoid_table
else:
sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)])
sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i
sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1
return torch.tensor(sinusoid_table, dtype=torch.float, requires_grad=False).unsqueeze(0)
def get_sinusoid_encoding_table2(n_position=784, d_hid=1024, cur_frame=8, ckpt_num_frame=4, pre_n_position=784):
''' Sinusoid position encoding table '''
# TODO: make it with torch instead of numpy
def get_position_angle_vec(position):
return [position / np.power(10000, 2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)]
# generate checkpoint position embedding
sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(pre_n_position)])
sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i
sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1
sinusoid_table = torch.tensor(sinusoid_table, dtype=torch.float, requires_grad=False).unsqueeze(0)
print(f"n_position: {n_position}")
print(f"pre_n_position: {pre_n_position}")
if n_position != pre_n_position:
T = ckpt_num_frame # checkpoint frame
P = 14 # checkpoint size
C = d_hid
new_P = int((n_position // cur_frame) ** 0.5) # testing size
print(f'Pretraining uses 14x14, but current version is {new_P}x{new_P}')
print(f'Interpolate the position embedding')
sinusoid_table = sinusoid_table.reshape(-1, T, P, P, C)
sinusoid_table = sinusoid_table.reshape(-1, P, P, C).permute(0, 3, 1, 2)
sinusoid_table = torch.nn.functional.interpolate(
sinusoid_table, size=(new_P, new_P), mode='bicubic', align_corners=False)
# BT, C, H, W -> BT, H, W, C -> B, T, H, W, C
sinusoid_table = sinusoid_table.permute(0, 2, 3, 1).reshape(-1, T, new_P, new_P, C)
sinusoid_table = sinusoid_table.flatten(1, 3) # B, THW, C
if cur_frame != ckpt_num_frame:
print(f'Pretraining uses 4 frames, but current frame is {cur_frame}')
print(f'Interpolate the position embedding')
T = ckpt_num_frame # checkpoint frame
new_T = cur_frame # testing frame
# interpolate
P = int((n_position // cur_frame) ** 0.5) # testing size
C = d_hid
sinusoid_table = sinusoid_table.reshape(-1, T, P, P, C)
sinusoid_table = sinusoid_table.permute(0, 2, 3, 4, 1).reshape(-1, C, T) # BHW, C, T
sinusoid_table = torch.nn.functional.interpolate(sinusoid_table, size=new_T, mode='linear')
sinusoid_table = sinusoid_table.reshape(1, P, P, C, new_T).permute(0, 4, 1, 2, 3) # B, T, H, W, C
sinusoid_table = sinusoid_table.flatten(1, 3) # B, THW, C
return sinusoid_table
class PretrainVisionTransformerEncoder(nn.Module):
""" Vision Transformer with support for patch or hybrid CNN input stage
"""
def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, depth=12,
num_heads=12, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop_rate=0., attn_drop_rate=0.,
drop_path_rate=0., norm_layer=nn.LayerNorm, init_values=None, num_frames=8, tubelet_size=1,
use_learnable_pos_emb=False,
use_checkpoint=False, checkpoint_num=0,
ckpt_num_frame=-1, with_ln=True, return_index=-1
):
super().__init__()
self.num_features = self.embed_dim = embed_dim # num_features for consistency with other models
self.patch_embed = PatchEmbed(
img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim,
num_frames=num_frames, tubelet_size=tubelet_size
)
num_patches = self.patch_embed.num_patches
self.depth = depth + return_index + 1
self.use_checkpoint = use_checkpoint
self.checkpoint_num = checkpoint_num
print('umt:', f"Use checkpoint: {use_checkpoint}")
print('umt:', f"Checkpoint number: {checkpoint_num}")
print('umt:', f"Real runing depth: {self.depth}")
# TODO: Add the cls token
if use_learnable_pos_emb:
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
self.img_pos_embed = nn.Parameter(torch.zeros(1, num_patches//(num_frames//tubelet_size) + 1, embed_dim))
else:
# sine-cosine positional embeddings
if img_size != 224:
self.pos_embed = get_sinusoid_encoding_table2(num_patches, embed_dim, ckpt_num_frame=ckpt_num_frame, cur_frame=num_frames//tubelet_size)
self.img_pos_embed = get_sinusoid_encoding_table2(num_patches//(num_frames//tubelet_size), embed_dim, cur_frame=1, ckpt_num_frame=1, pre_n_position=14*14)
else:
self.pos_embed = get_sinusoid_encoding_table(num_patches, embed_dim, ckpt_num_frame=ckpt_num_frame, cur_frame=num_frames//tubelet_size)
self.img_pos_embed = get_sinusoid_encoding_table(num_patches//(num_frames//tubelet_size), embed_dim)
dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)] # stochastic depth decay rule
self.blocks = nn.ModuleList([
Block(
dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer,
init_values=init_values)
for i in range(self.depth)])
if with_ln:
self.vision_layernorm = nn.LayerNorm(embed_dim, eps=1e-12)
else:
self.vision_layernorm = nn.Identity()
if use_learnable_pos_emb:
trunc_normal_(self.pos_embed, std=.02)
@torch.jit.ignore
def no_weight_decay(self):
return {'pos_embed', 'cls_token'}
def forward_features(self, x, use_image=False):
x = self.patch_embed(x)
if use_image:
x = x + self.img_pos_embed.type_as(x).to(x.device).clone().detach()
else:
x = x + self.pos_embed.type_as(x).to(x.device).clone().detach()
B, _, C = x.shape
x_vis = x
for idx, blk in enumerate(self.blocks):
if self.use_checkpoint and idx < self.checkpoint_num:
x_vis = checkpoint.checkpoint(blk, x_vis)
else:
x_vis = blk(x_vis)
# with ln ot not
x_vis = self.vision_layernorm(x_vis)
return x_vis
def forward(self, x, use_image=False):
x_vis = self.forward_features(x, use_image)
return x_vis
class PretrainVisionTransformer(nn.Module):
""" Vision Transformer with support for patch or hybrid CNN input stage
"""
def __init__(self,
img_size=224,
patch_size=16,
encoder_in_chans=3,
encoder_embed_dim=768,
encoder_depth=12,
encoder_num_heads=12,
mlp_ratio=4.,
qkv_bias=True,
qk_scale=None,
drop_rate=0.,
attn_drop_rate=0.,
drop_path_rate=0.,
norm_layer=partial(nn.LayerNorm, eps=1e-6),
init_values=0.,
use_learnable_pos_emb=False,
num_frames=8,
tubelet_size=1,
use_checkpoint=False,
checkpoint_num=0,
ckpt_num_frame=4, # the pretrained model uses 4 frames
return_index=-1,
with_ln=False
):
super().__init__()
self.encoder = PretrainVisionTransformerEncoder(
img_size=img_size,
patch_size=patch_size,
in_chans=encoder_in_chans,
embed_dim=encoder_embed_dim,
depth=encoder_depth,
num_heads=encoder_num_heads,
mlp_ratio=mlp_ratio,
qkv_bias=qkv_bias,
qk_scale=qk_scale,
drop_rate=drop_rate,
attn_drop_rate=attn_drop_rate,
drop_path_rate=drop_path_rate,
norm_layer=norm_layer,
init_values=init_values,
num_frames=num_frames,
tubelet_size=tubelet_size,
use_learnable_pos_emb=use_learnable_pos_emb,
use_checkpoint=use_checkpoint,
checkpoint_num=checkpoint_num,
ckpt_num_frame=ckpt_num_frame,
with_ln=with_ln,
return_index=return_index
)
print('umt:', f'With LN: {with_ln}')
print('umt:', f'Total {encoder_depth} layer')
print('umt:', f'Return {encoder_depth+return_index+1}-th layer')
self.apply(self._init_weights)
def _init_weights(self, m):
if isinstance(m, nn.Linear):
nn.init.xavier_uniform_(m.weight)
if isinstance(m, nn.Linear) and m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
@torch.jit.ignore
def no_weight_decay(self):
return {'pos_embed', 'cls_token', 'clip_pos_embed'}
def forward(self, x, use_image=False):
T = x.shape[2]
x_vis = self.encoder(x, use_image) # [B, N_vis, C_e]
B, TL, C = x_vis.shape
x_vis = x_vis.view(B, T, TL // T, C)
return x_vis
================================================
FILE: llava-train_videochat/llava/model/multimodal_encoder/umt_encoder.py
================================================
"""
# Adapted from https://huggingface.co/MILVLG/imp-v1-3b/blob/main/vision_encoder.py
"""
from typing import Optional, Tuple, Union, Dict
from dataclasses import dataclass
from functools import partial, reduce
from PIL import Image
import torch
import torch.utils.checkpoint
from torch import nn
import os
from transformers.image_processing_utils import BatchFeature, get_size_dict
from transformers.image_transforms import (
convert_to_rgb,
normalize,
rescale,
resize,
to_channel_dimension_format,
)
from transformers.image_utils import (
ChannelDimension,
PILImageResampling,
to_numpy_array,
)
from llava.utils import rank0_print
from .umt.vit import PretrainVisionTransformer
class UMTImageProcessor:
def __init__(self, image_mean=(0.485, 0.456, 0.406), image_std=(0.229, 0.224, 0.225), size=(224, 224), crop_size: Dict[str, int] = None, resample=PILImageResampling.BICUBIC, rescale_factor=1 / 255, data_format=ChannelDimension.FIRST):
crop_size = crop_size if crop_size is not None else {"height": size[0], "width": size[1]}
crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
self.image_mean = image_mean
self.image_std = image_std
self.size = size
self.resample = resample
self.rescale_factor = rescale_factor
self.data_format = data_format
self.crop_size = crop_size
def preprocess(self, images, return_tensors, target_size=None):
if isinstance(images, Image.Image):
images = [images]
else:
# to adapt video data
images = [to_numpy_array(image) for image in images]
assert isinstance(images, list)
if target_size is None:
target_size = self.size
transforms = [
convert_to_rgb,
to_numpy_array,
partial(resize, size=target_size, resample=self.resample, data_format=self.data_format),
partial(rescale, scale=self.rescale_factor, data_format=self.data_format),
partial(normalize, mean=self.image_mean, std=self.image_std, data_format=self.data_format),
partial(to_channel_dimension_format, channel_dim=self.data_format, input_channel_dim=self.data_format),
]
images = reduce(lambda x, f: [*map(f, x)], transforms, images)
data = {"pixel_values": images}
return BatchFeature(data=data, tensor_type=return_tensors)
class UMTVisionConfig:
model_type = "umt_vision_model"
def __init__(
self,
num_frames=4,
hidden_size=1024,
num_hidden_layers=24,
num_attention_heads=16,
num_channels=3,
image_size=224,
patch_size=16,
return_idx=-2
# **kwargs,
):
# super().__init__(**kwargs)
self.num_frames = num_frames
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.num_channels = num_channels
self.patch_size = patch_size
self.image_size = image_size
self.return_idx = return_idx
def build_vit(config, pt_type='origin'):
model = PretrainVisionTransformer(
img_size=config.image_size,
patch_size=16,
encoder_embed_dim=1024,
encoder_depth=24,
encoder_num_heads=16,
drop_path_rate=0.,
num_frames=config.num_frames,
tubelet_size=1,
use_checkpoint=True,
checkpoint_num=24,
return_index=config.return_idx,
with_ln=True, # merge vision_layernorm in it
)
ckpt_path = "OpenGVLab/Video_Encoders_for_Training_VideoChat-Flash/UMT-L_f4_vision.pt"
if not os.path.isfile(ckpt_path):
raise NotImplementedError("Please download https://huggingface.co/OpenGVLab/Video_Encoders_for_Training_VideoChat-Flash/UMT-L_f4_vision.pt")
old_state_dict = torch.load(ckpt_path, map_location='cpu')
state_dict = {}
for k in old_state_dict:
if k.startswith("encoder."):
if k.startswith("encoder.norm"):
state_dict[k.replace('encoder.norm', 'encoder.vision_layernorm')] = old_state_dict[k]
else:
state_dict[k] = old_state_dict[k]
del old_state_dict
msg = model.load_state_dict(state_dict, strict=False)
print('umt:', f"Loading pretrained weights from {ckpt_path}", msg)
return model
class UMTVisionTower(nn.Module):
def __init__(self, vision_tower, vision_tower_cfg, delay_load=False, pt_type='origin', image_size=224):
super().__init__()
self.is_loaded = False
self.pt_type = pt_type
self.config = UMTVisionConfig(num_frames=vision_tower_cfg.mm_local_num_frames, return_idx=vision_tower_cfg.mm_vision_select_layer, image_size=image_size)
self.vision_tower_name = vision_tower
self.image_processor = UMTImageProcessor(size=(image_size, image_size))
if not delay_load:
rank0_print(f"Loading vision tower: {vision_tower}")
self.load_model()
elif getattr(vision_tower_cfg, "unfreeze_mm_vision_tower", False):
# TODO: better detector is needed.
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True.")
self.load_model()
elif hasattr(vision_tower_cfg, "mm_tunable_parts") and "mm_vision_tower" in vision_tower_cfg.mm_tunable_parts:
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`.")
self.load_model()
else:
self.cfg_only = self.config
def load_model(self, device_map=None):
if self.is_loaded:
rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name))
return
self.vision_tower = build_vit(self.config, pt_type=self.pt_type)
self.vision_tower.requires_grad_(False)
self.is_loaded = True
def forward(self, images):
if type(images) is list:
raise NotImplementedError
else:
# input: B T C H W
# output: B T*L C
T = images.shape[1]
images = images.permute(0, 2, 1, 3, 4)
image_embeds = self.vision_tower(images, use_image=(T == 1))
B, T, L, C = image_embeds.shape
image_embeds = image_embeds.reshape(B, -1, C)
return image_embeds
@property
def dummy_feature(self):
return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
@property
def dtype(self):
for p in self.vision_tower.parameters():
return p.dtype
@property
def device(self):
for p in self.vision_tower.parameters():
return p.device
@property
def hidden_size(self):
return self.config.hidden_size
@property
def num_patches(self):
return (self.config.image_size // self.config.patch_size) ** 2
@property
def num_patches_per_side(self):
return self.config.image_size // self.config.patch_size
# return self.model_config["vision_cfg"]["image_size"] // self.model_config["vision_cfg"]["patch_size"]
@property
def image_size(self):
return self.config.image_size
================================================
FILE: llava-train_videochat/llava/model/multimodal_projector/builder.py
================================================
import torch
import torch.nn as nn
import re
from .tome16_mlp_hd64 import ToMe16_mlp_hd64
class IdentityMap(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x, *args, **kwargs):
return x
@property
def config(self):
return {"mm_projector_type": "identity"}
class SimpleResBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.pre_norm = nn.LayerNorm(channels)
self.proj = nn.Sequential(nn.Linear(channels, channels), nn.GELU(), nn.Linear(channels, channels))
def forward(self, x):
x = self.pre_norm(x)
return x + self.proj(x)
def build_vision_projector(config, delay_load=False, **kwargs):
projector_type = getattr(config, "mm_projector_type", "linear")
if projector_type == 'tome16_mlp_hd64':
return ToMe16_mlp_hd64(config, kwargs["vision_cfg"])
if projector_type == "linear":
return nn.Linear(config.mm_hidden_size, config.hidden_size)
mlp_gelu_match = re.match(r"^mlp(\d+)x_gelu$", projector_type)
if mlp_gelu_match:
mlp_depth = int(mlp_gelu_match.group(1))
modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
for _ in range(1, mlp_depth):
modules.append(nn.GELU())
modules.append(nn.Linear(config.hidden_size, config.hidden_size))
return nn.Sequential(*modules)
mlp_gelu_resnet_match = re.match(r"^mlp(\d+)x_res(\d+)x_gelu$", projector_type)
if mlp_gelu_resnet_match:
mlp_depth = int(mlp_gelu_resnet_match.group(1))
res_depth = int(mlp_gelu_resnet_match.group(2))
modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
for _ in range(1, mlp_depth):
modules.append(nn.GELU())
modules.append(nn.Linear(config.hidden_size, config.hidden_size))
for _ in range(res_depth):
modules.append(SimpleResBlock(config.hidden_size))
return nn.Sequential(*modules)
if projector_type == "identity":
return IdentityMap()
raise ValueError(f"Unknown projector type: {projector_type}")
================================================
FILE: llava-train_videochat/llava/model/multimodal_projector/tome16_mlp_hd64.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
# --------------------------------------------------------
import torch
import torch.nn as nn
from typing import Callable, Tuple
import torch.nn.functional as F
def bipartite_soft_matching(
metric: torch.Tensor,
r: int,
) -> Tuple[Callable, Callable]:
"""
Applies ToMe with a balanced matching set (50%, 50%).
Input size is [batch, tokens, channels].
r indicates the number of tokens to remove (max 50% of tokens).
"""
protected = 0
t = metric.shape[1]
r = min(r, (t - protected) // 2)
assert r > 0, r
with torch.no_grad():
metric = metric / metric.norm(dim=-1, keepdim=True)
a, b = metric[..., ::2, :], metric[..., 1::2, :]
scores = a @ b.transpose(-1, -2)
node_max, node_idx = scores.max(dim=-1)
edge_idx = node_max.argsort(dim=-1, descending=True)[..., None]
unm_idx = edge_idx[..., r:, :] # Unmerged Tokens
src_idx = edge_idx[..., :r, :] # Merged Tokens
dst_idx = node_idx[..., None].gather(dim=-2, index=src_idx)
def merge(x: torch.Tensor, mode="mean") -> torch.Tensor:
src, dst = x[..., ::2, :], x[..., 1::2, :]
n, t1, c = src.shape
unm = src.gather(dim=-2, index=unm_idx.expand(n, t1 - r, c))
src = src.gather(dim=-2, index=src_idx.expand(n, r, c))
dst = dst.scatter_add(-2, dst_idx.expand(n, r, c), src) # , reduce=mode)
return torch.cat([unm, dst], dim=1)
def unmerge(x: torch.Tensor) -> torch.Tensor:
unm_len = unm_idx.shape[1]
unm, dst = x[..., :unm_len, :], x[..., unm_len:, :]
n, _, c = unm.shape
src = dst.gather(dim=-2, index=dst_idx.expand(n, r, c))
out = torch.zeros(n, metric.shape[1], c, device=x.device, dtype=x.dtype)
out[..., 1::2, :] = dst
out.scatter_(dim=-2, index=(2 * unm_idx).expand(n, unm_len, c), src=unm)
out.scatter_(dim=-2, index=(2 * src_idx).expand(n, r, c), src=src)
return out
return merge, unmerge
def merge_wavg(
merge: Callable, x: torch.Tensor, size: torch.Tensor = None
) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Applies the merge function by taking a weighted average based on token size.
Returns the merged tensor and the new token sizes.
"""
if size is None:
size = torch.ones_like(x[..., 0, None])
x = merge(x * size, mode="sum")
size = merge(size, mode="sum")
x = x / size
return x, size
class ToMe16_mlp_hd64(nn.Module):
def __init__(self, config, vision_cfg):
super().__init__()
self._config = config
self.mm_hidden_size = config.mm_hidden_size
self.hw = vision_cfg.image_size // vision_cfg.patch_size
self.num_attention_heads = vision_cfg.num_attention_heads
self.mlp = nn.Sequential(nn.Linear(config.mm_hidden_size, config.hidden_size),
nn.GELU(),
nn.Linear(config.hidden_size, config.hidden_size))
self.max_pos_hw = self.hw
self.max_pos_num_frames = config.mm_pos_num_frames
# self._set_3d_pos_cache(max_grid_size=self.max_pos_hw, max_t_size=self.max_pos_num_frames)
self.num_image_patches_per_side = 8
self.num_frame_patches_per_side = 4
def merge_tokens(self, x, target_num_token):
r"""
x = torch.randn(10, 2560, c)
x = merge_tokens(x, r_merge_list=[1280])
"""
size = None
b, p, c = x.shape
tmp_p = p
r_merge_list = []
assert tmp_p > target_num_token, f"{tmp_p} should greater than {target_num_token}"
while tmp_p != target_num_token:
if tmp_p - target_num_token <= (tmp_p // 2):
r_merge_list.append(tmp_p - target_num_token)
break
else:
r_merge_list.append(tmp_p // 2)
tmp_p = tmp_p - (tmp_p // 2)
head = self.num_attention_heads
dim = c // head
for r in r_merge_list:
metric = x.reshape(b, p, head, dim).mean(2) # [b, p, c//head]
merge, _ = bipartite_soft_matching(
metric,
r
)
x, size = merge_wavg(merge, x, size)
_, p, _ = x.shape
# x = x.reshape(-1, c) # 300, 1024
return x
def forward(self, x, compress=False, local_num_frames=-1):
height = width = self.hw
assert height * width == x.shape[1]
dtype = x.dtype
device = x.device
if local_num_frames != -1 and local_num_frames != 1:
assert compress is True
if compress:
if local_num_frames != -1:
num_frames = local_num_frames
x = x.reshape(x.shape[0] // local_num_frames, -1, x.shape[-1])
else:
num_frames = x.shape[0]
x = x.reshape(1, -1, x.shape[-1])
num_tome_tokens = 16 * num_frames
else:
num_tome_tokens = 64
x = self.merge_tokens(x, target_num_token=num_tome_tokens)
x = self.mlp(x)
return x
@property
def config(self):
return {"mm_projector_type": "tome16_mlp_hd64"}
================================================
FILE: llava-train_videochat/llava/model/utils.py
================================================
from transformers import AutoConfig
def auto_upgrade(config):
cfg = AutoConfig.from_pretrained(config)
if "llava" in config and "llava" not in cfg.model_type:
assert cfg.model_type == "llama"
print("You are using newer LLaVA code base, while the checkpoint of v0 is from older code base.")
print("You must upgrade the checkpoint to the new code base (this can be done automatically).")
confirm = input("Please confirm that you want to upgrade the checkpoint. [Y/N]")
if confirm.lower() in ["y", "yes"]:
print("Upgrading checkpoint...")
assert len(cfg.architectures) == 1
setattr(cfg.__class__, "model_type", "llava")
cfg.architectures[0] = "LlavaLlamaForCausalLM"
cfg.save_pretrained(config)
print("Checkpoint upgraded.")
else:
print("Checkpoint upgrade aborted.")
exit(1)
================================================
FILE: llava-train_videochat/llava/serialize_utils.py
================================================
# Description: This file contains the code for serializing the dataset.
# From https://github.com/ppwwyyxx/RAM-multiprocess-dataloader/blob/795868a37446d61412b9a58dbb1b7c76e75d39c4/serialize.py
# Copyright (c) Facebook, Inc. and its affiliates.
"""
List serialization code adopted from
https://github.com/facebookresearch/detectron2/blob/main/detectron2/data/common.py
"""
import multiprocessing as mp
from typing import List, Any, Optional
import pickle
import numpy as np
import torch
import torch.distributed as dist
import functools
import os
from datetime import timedelta
def get_world_size() -> int:
if not dist.is_available():
return 1
if not dist.is_initialized():
return 1
return dist.get_world_size()
def get_rank() -> int:
if not dist.is_available():
return 0
if not dist.is_initialized():
return 0
return dist.get_rank()
def get_rank() -> int:
if not dist.is_available():
return 0
if not dist.is_initialized():
return 0
return dist.get_rank()
def get_local_rank() -> int:
if not dist.is_available():
return 0
if not dist.is_initialized():
return 0
# this is not guaranteed to be set
if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
return int(os.environ['LOCAL_RANK'])
elif 'SLURM_PROCID' in os.environ:
return int(os.environ['SLURM_LOCALID'])
else:
raise RuntimeError("Unable to get local rank")
def get_local_size() -> int:
return torch.cuda.device_count()
@functools.lru_cache()
def _get_global_gloo_group():
"""
Return a process group based on gloo backend, containing all the ranks
The result is cached.
"""
if dist.get_backend() == "nccl":
return dist.new_group(backend="gloo", timeout=timedelta(minutes=60))
else:
return dist.group.WORLD
def all_gather(data, group=None):
"""
Run all_gather on arbitrary picklable data (not necessarily tensors).
Args:
data: any picklable object
group: a torch process group. By default, will use a group which
contains all ranks on gloo backend.
Returns:
list[data]: list of data gathered from each rank
"""
if get_world_size() == 1:
return [data]
if group is None:
group = (
_get_global_gloo_group()
) # use CPU group by default, to reduce GPU RAM usage.
world_size = dist.get_world_size(group)
if world_size == 1:
return [data]
output = [None for _ in range(world_size)]
dist.all_gather_object(output, data, group=group)
return output
class NumpySerializedList:
def __init__(self, lst: list):
def _serialize(data):
buffer = pickle.dumps(data, protocol=-1)
return np.frombuffer(buffer, dtype=np.uint8)
print(
"Serializing {} elements to byte tensors and concatenating them all ...".format(
len(lst)
)
)
self._lst = [_serialize(x) for x in lst]
self._addr = np.asarray([len(x) for x in self._lst], dtype=np.int64)
self._addr = np.cumsum(self._addr)
self._lst = np.concatenate(self._lst)
print("Serialized dataset takes {:.2f} MiB".format(len(self._lst) / 1024**2))
def __len__(self):
return len(self._addr)
def __getitem__(self, idx):
start_addr = 0 if idx == 0 else self._addr[idx - 1].item()
end_addr = self._addr[idx].item()
bytes = memoryview(self._lst[start_addr:end_addr])
return pickle.loads(bytes)
class TorchSerializedList(NumpySerializedList):
def __init__(self, lst: list):
super().__init__(lst)
self._addr = torch.from_numpy(self._addr)
self._lst = torch.from_numpy(self._lst)
def __getitem__(self, idx):
start_addr = 0 if idx == 0 else self._addr[idx - 1].item()
end_addr = self._addr[idx].item()
bytes = memoryview(self._lst[start_addr:end_addr].numpy())
return pickle.loads(bytes)
def local_scatter(array: Optional[List[Any]]):
"""
Scatter an array from local leader to all local workers.
The i-th local worker gets array[i].
Args:
array: Array with same size of #local workers.
"""
if get_local_size() <= 1:
# Just one worker. Do nothing.
return array[0]
if get_local_rank() == 0:
assert len(array) == get_local_size()
all_gather(array)
else:
all_data = all_gather(None)
array = all_data[get_rank() - get_local_rank()]
return array[get_local_rank()]
# NOTE: https://github.com/facebookresearch/mobile-vision/pull/120
# has another implementation that does not use tensors.
class TorchShmSerializedList(TorchSerializedList):
def __init__(self, lst: list):
if get_local_rank() == 0:
super().__init__(lst)
if get_local_rank() == 0:
# Move data to shared memory, obtain a handle to send to each local worker.
# This is cheap because a tensor will only be moved to shared memory once.
handles = [None] + [
bytes(mp.reduction.ForkingPickler.dumps((self._addr, self._lst)))
for _ in range(get_local_size() - 1)
]
else:
handles = None
# Each worker receives the handle from local leader.
handle = local_scatter(handles)
if get_local_rank() > 0:
# Materialize the tensor from shared memory.
self._addr, self._lst = mp.reduction.ForkingPickler.loads(handle)
print(
f"Worker {get_rank()} obtains a dataset of length="
f"{len(self)} from its local leader."
)
# From https://github.com/ppwwyyxx/RAM-multiprocess-dataloader/issues/5#issuecomment-1510676170
def local_broadcast_process_authkey():
if int(os.environ['LOCAL_WORLD_SIZE']) == 1:
return
local_rank = int(os.environ['LOCAL_RANK'])
authkey = bytes(mp.current_process().authkey)
all_keys = all_gather(authkey)
local_leader_key = all_keys[get_rank() - local_rank]
if authkey != local_leader_key:
print("Process authkey is different from the key of local leader. This might happen when "
"workers are launched independently.")
print("Overwriting local authkey ...")
mp.current_process().authkey = local_leader_key
================================================
FILE: llava-train_videochat/llava/train/llava_trainer.py
================================================
import os
import torch
import torch.nn as nn
import datetime
from accelerate import Accelerator
from accelerate.utils import InitProcessGroupKwargs, GradientAccumulationPlugin
from torch.utils.data import Dataset, Sampler, DataLoader
from transformers import Trainer
from transformers.trainer import is_sagemaker_mp_enabled, get_parameter_names, has_length, ALL_LAYERNORM_LAYERS, logger, is_accelerate_available, is_datasets_available, GradientAccumulationPlugin
from transformers.trainer_utils import seed_worker
from transformers.trainer_pt_utils import get_length_grouped_indices as get_length_grouped_indices_hf
from transformers.trainer_pt_utils import AcceleratorConfig
from typing import List, Optional
from datetime import timedelta
if is_accelerate_available():
from accelerate import Accelerator, skip_first_batches, InitProcessGroupKwargs
if is_datasets_available():
import datasets
from llava.utils import rank0_print
def maybe_zero_3(param, ignore_status=False, name=None):
from deepspeed import zero
from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
if hasattr(param, "ds_id"):
if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
if not ignore_status:
print(name, "no ignore status")
with zero.GatheredParameters([param]):
param = param.data.detach().cpu().clone()
else:
param = param.detach().cpu().clone()
return param
def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)}
to_return = {k: maybe_zero_3(v, ignore_status=True, name=k).cpu() for k, v in to_return.items()}
return to_return
def split_to_even_chunks(indices, lengths, num_chunks):
"""
Split a list of indices into `chunks` chunks of roughly equal lengths.
"""
if len(indices) % num_chunks != 0:
return [indices[i::num_chunks] for i in range(num_chunks)]
num_indices_per_chunk = len(indices) // num_chunks
chunks = [[] for _ in range(num_chunks)]
chunks_lengths = [0 for _ in range(num_chunks)]
for index in indices:
shortest_chunk = chunks_lengths.index(min(chunks_lengths))
chunks[shortest_chunk].append(index)
chunks_lengths[shortest_chunk] += lengths[index]
if len(chunks[shortest_chunk]) == num_indices_per_chunk:
chunks_lengths[shortest_chunk] = float("inf")
return chunks
def get_variable_length_grouped_indices(lengths, batch_size, world_size, megabatch_mult=8, generator=None):
# We need to use torch for the random part as a distributed sampler will set the random seed for torch.
indices = torch.randperm(len(lengths), generator=generator)
sorted_indices = sorted(range(len(lengths)), key=lambda i: lengths[i], reverse=True)
megabatch_size = world_size * batch_size * megabatch_mult
megabatches = [sorted_indices[i : i + megabatch_size] for i in range(0, len(lengths), megabatch_size)]
megabatches = [sorted(megabatch, key=lambda i: indices[i], reverse=True) for megabatch in megabatches]
shuffled_indices = [i for megabatch in megabatches for i in megabatch]
world_batch_size = world_size * batch_size
batches = [shuffled_indices[i : i + world_batch_size] for i in range(0, len(lengths), world_batch_size)]
batch_indices = torch.randperm(len(batches), generator=generator)
batches = [batches[i] for i in batch_indices]
return [i for batch in batches for i in batch]
def get_modality_length_grouped_indices(lengths, batch_size, world_size, generator=None):
"""
Return a list of indices so that each slice of `batch_size` consecutive indices correspond to elements of similar
lengths. To do this, the indices are:
- randomly permuted
- grouped in mega-batches of size `mega_batch_mult * batch_size`
- reorder by length in each mega-batch
The result is the concatenation of all mega-batches, with the batch of `batch_size` containing the element of
maximum length placed first, so that an OOM happens sooner rather than later.
"""
# We need to use torch for the random part as a distributed sampler will set the random seed for torch.
assert all(l != 0 for l in lengths), "Should not have zero length."
if all(l > 0 for l in lengths) or all(l < 0 for l in lengths):
# all samples are in the same modality
return get_length_grouped_indices(lengths, batch_size, world_size, generator=generator)
mm_indices, mm_lengths = zip(*[(i, l) for i, l in enumerate(lengths) if l > 0])
lang_indices, lang_lengths = zip(*[(i, -l) for i, l in enumerate(lengths) if l < 0])
mm_shuffle = [mm_indices[i] for i in get_length_grouped_indices(mm_lengths, batch_size, world_size, generator=None)]
lang_shuffle = [lang_indices[i] for i in get_length_grouped_indices(lang_lengths, batch_size, world_size, generator=None)]
megabatch_size = world_size * batch_size
mm_megabatches = [mm_shuffle[i : i + megabatch_size] for i in range(0, len(mm_shuffle), megabatch_size)]
lang_megabatches = [lang_shuffle[i : i + megabatch_size] for i in range(0, len(lang_shuffle), megabatch_size)]
last_mm = mm_megabatches[-1]
last_lang = lang_megabatches[-1]
additional_batch = last_mm + last_lang
megabatches = mm_megabatches[:-1] + lang_megabatches[:-1]
megabatch_indices = torch.randperm(len(megabatches), generator=generator)
megabatches = [megabatches[i] for i in megabatch_indices]
if len(additional_batch) > 0:
megabatches.append(sorted(additional_batch))
return [i for megabatch in megabatches for i in megabatch]
def get_length_grouped_indices(lengths, batch_size, world_size, generator=None, merge=True):
"""
Return a list of indices so that each slice of `batch_size` consecutive indices correspond to elements of similar
lengths. To do this, the indices are:
- randomly permuted
- grouped in mega-batches of size `mega_batch_mult * batch_size`
- reorder by length in each mega-batch
The result is the concatenation of all mega-batches, with the batch of `batch_size` containing the element of
maximum length placed first, so that an OOM happens sooner rather than later.
"""
# We need to use torch for the random part as a distributed sampler will set the random seed for torch.
indices = torch.randperm(len(lengths), generator=generator)
megabatch_size = world_size * batch_size
megabatches = [indices[i : i + megabatch_size].tolist() for i in range(0, len(lengths), megabatch_size)]
megabatches = [sorted(megabatch, key=lambda i: lengths[i], reverse=True) for megabatch in megabatches]
megabatches = [split_to_even_chunks(megabatch, lengths, world_size) for megabatch in megabatches]
return [i for megabatch in megabatches for batch in megabatch for i in batch]
def get_length_grouped_indices_auto_single(lengths, batch_size, world_size, generator=None):
indices = get_length_grouped_indices_hf(lengths, batch_size * world_size, generator=generator)
megabatch_size = world_size * batch_size
megabatches = [indices[i : i + megabatch_size] for i in range(0, len(lengths), megabatch_size)]
megabatches = [sorted(megabatch, key=lambda i: lengths[i], reverse=True) for megabatch in megabatches]
megabatches = [split_to_even_chunks(megabatch, lengths, world_size) for megabatch in megabatches]
# We need to use torch for the random part as a distributed sampler will set the random seed for torch.
batch_indices = torch.randperm(len(megabatches), generator=generator)
megabatches = [megabatches[i] for i in batch_indices]
return [i for megabatch in megabatches for batch in megabatch for i in batch]
def get_modality_length_grouped_indices_auto(lengths, batch_size, world_size, generator=None):
# We need to use torch for the random part as a distributed sampler will set the random seed for torch.
assert all(l != 0 for l in lengths), "Should not have zero length."
if all(l > 0 for l in lengths) or all(l < 0 for l in lengths):
# all samples are in the same modality
return get_length_grouped_indices_auto_single(lengths, batch_size, world_size, generator=generator)
mm_indices, mm_lengths = zip(*[(i, l) for i, l in enumerate(lengths) if l > 0])
lang_indices, lang_lengths = zip(*[(i, -l) for i, l in enumerate(lengths) if l < 0])
mm_shuffle = [mm_indices[i] for i in get_length_grouped_indices_auto_single(mm_lengths, batch_size, world_size, generator=None)]
lang_shuffle = [lang_indices[i] for i in get_length_grouped_indices_auto_single(lang_lengths, batch_size, world_size, generator=None)]
megabatch_size = world_size * batch_size
mm_megabatches = [mm_shuffle[i : i + megabatch_size] for i in range(0, len(mm_shuffle), megabatch_size)]
lang_megabatches = [lang_shuffle[i : i + megabatch_size] for i in range(0, len(lang_shuffle), megabatch_size)]
last_mm = mm_megabatches[-1]
last_lang = lang_megabatches[-1]
additional_batch = last_mm + last_lang
megabatches = mm_megabatches[:-1] + lang_megabatches[:-1]
megabatch_indices = torch.randperm(len(megabatches), generator=generator)
megabatches = [megabatches[i] for i in megabatch_indices]
# FIXME: Hard code to avoid last batch mixed with different modalities
# if len(additional_batch) > 0:
# megabatches.append(sorted(additional_batch))
return [i for megabatch in megabatches for i in megabatch]
class LengthGroupedSampler(Sampler):
r"""
Sampler that samples indices in a way that groups together features of the dataset of roughly the same length while
keeping a bit of randomness.
"""
def __init__(
self,
batch_size: int,
world_size: int,
lengths: Optional[List[int]] = None,
generator=None,
variable_length: bool = False,
group_by_modality: bool = False,
group_by_modality_auto: bool = False,
):
if lengths is None:
raise ValueError("Lengths must be provided.")
self.batch_size = batch_size
self.world_size = world_size
self.lengths = lengths
self.generator = generator
self.variable_length = variable_length
self.group_by_modality = group_by_modality
self.group_by_modality_auto = group_by_modality_auto
def __len__(self):
return len(self.lengths)
def __iter__(self):
if self.variable_length:
assert not self.group_by_modality, "Variable length grouping is not supported with modality grouping."
indices = get_variable_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator)
else:
if self.group_by_modality:
indices = get_modality_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator)
elif self.group_by_modality_auto:
indices = get_modality_length_grouped_indices_auto(self.lengths, self.batch_size, self.world_size, generator=self.generator)
else:
indices = get_length_grouped_indices_auto_single(self.lengths, self.batch_size, self.world_size, generator=self.generator)
return iter(indices)
class LLaVATrainer(Trainer):
def create_accelerator_and_postprocess(self):
grad_acc_kwargs = {"num_steps": self.args.gradient_accumulation_steps}
grad_acc_kwargs["sync_with_dataloader"] = False
gradient_accumulation_plugin = GradientAccumulationPlugin(**grad_acc_kwargs)
accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52))
rank0_print("Setting NCCL timeout to INF to avoid running errors.")
# create accelerator object
self.accelerator = Accelerator(
dispatch_batches=self.args.dispatch_batches, split_batches=self.args.split_batches, deepspeed_plugin=self.args.deepspeed_plugin, gradient_accumulation_plugin=gradient_accumulation_plugin, kwargs_handlers=[accelerator_kwargs]
)
# some Trainer classes need to use `gather` instead of `gather_for_metrics`, thus we store a flag
self.gather_function = self.accelerator.gather_for_metrics
# deepspeed and accelerate flags covering both trainer args and accelerate launcher
self.is_deepspeed_enabled = getattr(self.accelerator.state, "deepspeed_plugin", None) is not None
self.is_fsdp_enabled = getattr(self.accelerator.state, "fsdp_plugin", None) is not None
# post accelerator creation setup
if self.is_fsdp_enabled:
fsdp_plugin = self.accelerator.state.fsdp_plugin
fsdp_plugin.limit_all_gathers = self.args.fsdp_config.get("limit_all_gathers", fsdp_plugin.limit_all_gathers)
if is_accelerate_available("0.23.0"):
fsdp_plugin.activation_checkpointing = self.args.fsdp_config.get("activation_checkpointing", fsdp_plugin.activation_checkpointing)
if fsdp_plugin.activation_checkpointing and self.args.gradient_checkpointing:
raise ValueError("The activation_checkpointing in FSDP config and the gradient_checkpointing in training arg " "can't be set to True simultaneously. Please use FSDP's activation_checkpointing logic " "when using FSDP.")
if self.is_deepspeed_enabled and getattr(self.args, "hf_deepspeed_config", None) is None:
self.propagate_args_to_deepspeed()
def _get_train_sampler(self) -> Optional[torch.utils.data.Sampler]:
if self.train_dataset is None or not has_length(self.train_dataset):
return None
if self.args.group_by_length:
lengths = self.train_dataset.lengths
return LengthGroupedSampler(
# self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_steps
self.args.train_batch_size,
# world_size=self.args.world_size,
world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work?
lengths=lengths,
)
elif self.args.group_by_modality_length:
lengths = self.train_dataset.modality_lengths
return LengthGroupedSampler(
# self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_steps
self.args.train_batch_size,
# world_size=self.args.world_size,
world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work?
lengths=lengths,
group_by_modality=True,
)
elif self.args.group_by_modality_length_auto:
lengths = self.train_dataset.modality_lengths
return LengthGroupedSampler(
# self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_steps
self.args.train_batch_size,
# world_size=self.args.world_size,
world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work?
lengths=lengths,
group_by_modality_auto=True,
)
elif self.args.group_by_varlen:
lengths = self.train_dataset.lengths
return LengthGroupedSampler(
self.args.train_batch_size * self.args.gradient_accumulation_steps,
# self.args.train_batch_size, # TODO: seems that we should have gradient_accumulation_steps
# world_size=self.args.world_size,
world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work?
lengths=lengths,
variable_length=True,
)
else:
return super()._get_train_sampler()
def get_train_dataloader(self) -> DataLoader:
"""
Returns the training [`~torch.utils.data.DataLoader`].
Will use no sampler if `train_dataset` does not implement `__len__`, a random sampler (adapted to distributed
training if necessary) otherwise.
Subclass and override this method if you want to inject some custom behavior.
"""
if self.train_dataset is None:
raise ValueError("Trainer: training requires a train_dataset.")
train_dataset = self.train_dataset
data_collator = self.data_collator
if is_datasets_available() and isinstance(train_dataset, datasets.Dataset):
train_dataset = self._remove_unused_columns(train_dataset, description="training")
else:
data_collator = self._get_collator_with_removed_columns(data_collator, description="training")
dataloader_params = {
"batch_size": self._train_batch_size,
"collate_fn": data_collator,
"num_workers": self.args.dataloader_num_workers,
"pin_memory": self.args.dataloader_pin_memory,
"persistent_workers": self.args.dataloader_persistent_workers,
}
if not isinstance(train_dataset, torch.utils.data.IterableDataset):
dataloader_params["sampler"] = self._get_train_sampler()
dataloader_params["drop_last"] = self.args.dataloader_drop_last
dataloader_params["worker_init_fn"] = seed_worker
dataloader_params["prefetch_factor"] = self.args.dataloader_num_workers * 2 if self.args.dataloader_num_workers != 0 else None
dataloader = self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))
return dataloader
def create_optimizer(self):
"""
Setup the optimizer.
We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the
Trainer's init through `optimizers`, or subclass and override this method in a subclass.
"""
if is_sagemaker_mp_enabled():
return super().create_optimizer()
opt_model = self.model
if self.optimizer is None:
decay_parameters = get_parameter_names(opt_model, ALL_LAYERNORM_LAYERS)
decay_parameters = [name for name in decay_parameters if "bias" not in name]
lr_mapper = {}
if self.args.mm_projector_lr is not None:
lr_mapper["mm_projector"] = self.args.mm_projector_lr
if self.args.mm_vision_tower_lr is not None:
lr_mapper["vision_tower"] = self.args.mm_vision_tower_lr
if len(lr_mapper) > 0:
special_lr_parameters = [name for name, _ in opt_model.named_parameters() if any(module_keyword in name for module_keyword in lr_mapper)]
optimizer_grouped_parameters = [
{
"params": [p for n, p in opt_model.named_parameters() if (n in decay_parameters and n not in special_lr_parameters and p.requires_grad)],
"weight_decay": self.args.weight_decay,
},
{
"params": [p for n, p in opt_model.named_parameters() if (n not in decay_parameters and n not in special_lr_parameters and p.requires_grad)],
"weight_decay": 0.0,
},
]
for module_keyword, lr in lr_mapper.items():
module_parameters = [name for name, _ in opt_model.named_parameters() if module_keyword in name]
optimizer_grouped_parameters.extend(
[
{
"params": [p for n, p in opt_model.named_parameters() if (n in decay_parameters and n in module_parameters and p.requires_grad)],
"weight_decay": self.args.weight_decay,
"lr": lr,
},
{
"params": [p for n, p in opt_model.named_parameters() if (n not in decay_parameters and n in module_parameters and p.requires_grad)],
"weight_decay": 0.0,
"lr": lr,
},
]
)
else:
optimizer_grouped_parameters = [
{
"params": [p for n, p in opt_model.named_parameters() if (n in decay_parameters and p.requires_grad)],
"weight_decay": self.args.weight_decay,
},
{
"params": [p for n, p in opt_model.named_parameters() if (n not in decay_parameters and p.requires_grad)],
"weight_decay": 0.0,
},
]
optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(self.args)
self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
if optimizer_cls.__name__ == "Adam8bit":
import bitsandbytes
manager = bitsandbytes.optim.GlobalOptimManager.get_instance()
skipped = 0
for module in opt_model.modules():
if isinstance(module, nn.Embedding):
skipped += sum({p.data_ptr(): p.numel() for p in module.parameters()}.values())
logger.info(f"skipped {module}: {skipped/2**20}M params")
manager.register_module_override(module, "weight", {"optim_bits": 32})
logger.debug(f"bitsandbytes: will optimize {module} in fp32")
logger.info(f"skipped: {skipped/2**20}M params")
return self.optimizer
def _save_checkpoint(self, model, trial, metrics=None):
if getattr(self.args, "tune_mm_mlp_adapter", False) or (
hasattr(self.args, "mm_tunable_parts") and (len(self.args.mm_tunable_parts.split(",")) == 1 and ("mm_mlp_adapter" in self.args.mm_tunable_parts or "mm_vision_resampler" in self.args.mm_tunable_parts))
):
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
run_dir = self._get_output_dir(trial=trial)
output_dir = os.path.join(run_dir, checkpoint_folder)
# Only save Adapter
keys_to_match = ["mm_projector", "vision_resampler"]
if getattr(self.args, "use_im_start_end", False):
keys_to_match.extend(["embed_tokens", "embed_in"])
weight_to_save = get_mm_adapter_state_maybe_zero_3(self.model.named_parameters(), keys_to_match)
if self.args.local_rank == 0 or self.args.local_rank == -1:
self.model.config.save_pretrained(output_dir)
torch.save(weight_to_save, os.path.join(output_dir, f"mm_projector.bin"))
else:
super(LLaVATrainer, self)._save_checkpoint(model, trial, metrics)
def _save(self, output_dir: Optional[str] = None, state_dict=None):
if getattr(self.args, "tune_mm_mlp_adapter", False):
pass
else:
super(LLaVATrainer, self)._save(output_dir, state_dict)
================================================
FILE: llava-train_videochat/llava/train/llava_trainer_eval.py
================================================
import json
import subprocess
from llava.train.llava_trainer import LLaVATrainer
class LLaVAEvalTrainer(LLaVATrainer):
def evaluate(self, evaluate_args):
cmd = f"accelerate launch --num_processes {evaluate_args.eval_num_processes} -m lmms_eval \
--model {evaluate_args.model} \
--model_args {evaluate_args.model_args} \
--tasks {evaluate_args.task_names} \
--batch_size {evaluate_args.batch_size} \
--log_samples_suffix {evaluate_args.log_samples_suffix} \
--output_path {evaluate_args.output_path}"
if evaluate_args.limit:
cmd += f" --limit {evaluate_args.limit}"
if evaluate_args.num_fewshot:
cmd += f" --num_fewshot {evaluate_args.num_fewshot}"
if evaluate_args.gen_kwargs != "":
cmd += f" --gen_kwargs {evaluate_args.gen_kwargs}"
if evaluate_args.log_samples:
cmd += f" --log_samples"
else:
assert False, "Please log samples so that the result can be parsed"
results = subprocess.run([cmd], shell=True, capture_output=True, text=True)
try:
result_file_index_start = results.stdout.index("Saved samples to ")
result_file_index_end = results.stdout.index(f".json")
result_file_index_start += len("Saved samples to ")
file = results.stdout[result_file_index_start:result_file_index_end]
except:
result_file_index_start = results.stderr.index("Saved samples to ")
result_file_index_end = results.stderr.index(f".json")
result_file_index_start += len("Saved samples to ")
file = results.stderr[result_file_index_start:result_file_index_end]
file = file.split("/")[:-1]
file = "/".join(file) + "/results.json"
with open(file, "r") as f:
lmms_eval_results = json.load(f)
result_dict = {}
tasks_list = evaluate_args.task_names.split(",")
for task in tasks_list:
task_results = lmms_eval_results["results"][task]
for k, v in task_results.items():
if k != "alias" and "stderr" not in k:
metric = k.split(",")[0]
result_dict[f"{task}_{metric}"] = v
return result_dict
"""def evaluate(self, evaluate_args):
initialize_tasks()
tasks_list = evaluate_args.task_names.split(",")
result_dict = {}
results = evaluator.simple_evaluate(
model=evaluate_args.model,
model_args=evaluate_args.model_args,
tasks=tasks_list,
num_fewshot=evaluate_args.num_fewshot,
batch_size=evaluate_args.batch_size,
device=evaluate_args.device,
limit=evaluate_args.limit,
check_integrity=evaluate_args.check_integrity,
show_task_to_terminal=evaluate_args.show_task_to_terminal,
log_samples=evaluate_args.log_samples,
gen_kwargs=evaluate_args.gen_kwargs,
cli_args=evaluate_args,
)
for task in tasks_list:
task_results = results["results"][task]
for k,v in task_results.items():
if k != "alias" and "stderr" not in k:
metric = k.split(",")[0]
result_dict[f"{task}_{metric}"] = v
return result_dict"""
================================================
FILE: llava-train_videochat/llava/train/train.py
================================================
# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:
# Adopted from tatsu-lab@stanford_alpaca. Below is the original copyright:
# Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import ast
import os
import copy
from dataclasses import dataclass, field
import json
import logging
import pathlib
from typing import Dict, Optional, Sequence, List
from PIL import Image, ImageFile
from packaging import version
import numpy as np
import gc
import io
import time
import random
import yaml
import math
import re
import torch
import transformers
import tokenizers
import deepspeed
from transformers import AutoConfig
from torch.utils.data import Dataset
from llava.constants import IGNORE_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IMAGE_TOKEN_INDEX
from llava.train.llava_trainer import LLaVATrainer
from llava import conversation as conversation_lib
from llava.model import *
from llava.mm_utils import process_highres_image, process_anyres_image, process_anyres_image_nopad, process_highres_image_crop_split, tokenizer_image_token, process_anyres_video_nopad
from llava.utils import rank0_print
from llava.video_utils import VIDEO_READER_FUNCS
# from llava.serialize_utils import TorchShmSerializedList, get_rank, get_local_rank, local_broadcast_process_authkey
# import wandb
torch.multiprocessing.set_sharing_strategy("file_system")
ImageFile.LOAD_TRUNCATED_IMAGES = True
local_rank = None
IS_TOKENIZER_GREATER_THAN_0_14 = version.parse(tokenizers.__version__) >= version.parse("0.14")
@dataclass
class ModelArguments:
model_name_or_path: Optional[str] = field(default="facebook/opt-125m")
model_class_name: Optional[str] = field(default=None, metadata={"help": "Used to init model class, format is XXXXForCausalLM. e.g. currently XXXX is chosen from LlavaLlama, LlavaMixtral, LlavaMistral, Llama"})
mm_tunable_parts: Optional[str] = field(
default=None, metadata={"help": 'Could be "mm_mlp_adapter", "mm_vision_resampler", "mm_vision_tower,mm_mlp_adapter,mm_language_model", "mm_vision_tower,mm_mlp_adapter,mm_language_model", "mm_mlp_adapter,mm_language_model"'}
)
# deciding which part of the multimodal model to tune, will overwrite other previous settings
version: Optional[str] = field(default="v0")
freeze_backbone: bool = field(default=False)
tune_mm_mlp_adapter: bool = field(default=False)
tune_mm_vision_resampler: bool = field(default=False)
vision_tower: Optional[str] = field(default=None)
vision_tower_pretrained: Optional[str] = field(default=None) # default to the last layer
vision_encode_type: Optional[str] = field(default="image")
unfreeze_mm_vision_tower: bool = field(default=False)
unfreeze_language_model: bool = field(default=False)
mm_vision_select_layer: Optional[int] = field(default=-1) # default to the last layer
pretrain_mm_mlp_adapter: Optional[str] = field(default=None)
mm_projector_type: Optional[str] = field(default="linear")
mm_use_im_start_end: bool = field(default=False)
mm_use_im_patch_token: bool = field(default=True)
mm_patch_merge_type: Optional[str] = field(default="flat")
mm_vision_select_feature: Optional[str] = field(default="patch")
mm_resampler_type: Optional[str] = field(default=None)
mm_mask_drop_mode: str = field(default="fixed")
mm_mask_drop_skip_percentage: float = field(default=0.0)
mm_mask_drop_ratio: float = field(default=0.25)
mm_mask_drop_ratio_upper: Optional[float] = field(default=None)
mm_mask_drop_ratio_lower: Optional[float] = field(default=None)
mm_spatial_pool_stride: Optional[int] = field(default=None)
mm_spatial_pool_mode: str = field(default="bilinear")
mm_spatial_pool_out_channels: Optional[int] = field(default=None)
mm_num_compress_latents: Optional[int] = field(default=128)
mm_num_compress_query_type: Optional[str] = field(default='learnable')
mm_pos_num_frames: Optional[int] = field(default=8)
mm_close_init: Optional[bool] = field(default=False)
min_slow_num_frames: Optional[int] = field(default=4)
mm_perceiver_depth: Optional[int] = field(default=3)
mm_perceiver_latents: Optional[int] = field(default=32)
mm_perceiver_ff_mult: Optional[float] = field(default=4)
mm_perceiver_pretrained: Optional[str] = field(default=None)
mm_qformer_depth: Optional[int] = field(default=3)
mm_qformer_latents: Optional[int] = field(default=32)
mm_qformer_pretrained: Optional[str] = field(default=None)
rope_scaling_factor: Optional[float] = field(default=None)
rope_scaling_type: Optional[str] = field(default=None)
s2: Optional[bool] = field(default=False)
s2_scales: Optional[str] = field(default="336,672,1008")
use_pos_skipping: Optional[bool] = field(default=False)
pos_skipping_range: Optional[int] = field(default=4096)
mm_newline_position: Optional[str] = field(default="one_token") # for frame separate
mm_local_num_frames: Optional[int] = field(default=-1) # 用来控制video encoder和projector是否分段处理时间序列
mm_llm_compress: Optional[bool] = field(default=False)
llm_compress_type: Optional[str] = field(default="attention")
llm_compress_layer_list: Optional[str] = field(default="8,16,24")
llm_image_token_ratio_list: Optional[str] = field(default="1.0,0.5,0.25,0.125")
# 增加新模型参数的记得去下面overwrite_config注册
@dataclass
class DataArguments:
data_path: str = field(default=None, metadata={"help": "Path to the training data, in llava's instruction.json format. Supporting multiple json files via /path/to/{a,b,c}.json"})
lazy_preprocess: bool = False
is_multimodal: bool = False
early_mix_text: bool = False
# image_folder: Optional[str] = field(default=None)
image_aspect_ratio: str = "square"
image_grid_pinpoints: Optional[str] = field(default=None)
image_crop_resolution: Optional[int] = field(default=None) # 好像没啥用
image_split_resolution: Optional[int] = field(default=None) # 好像没啥用
frame_aspect_ratio: str = "square"
frame_grid_pinpoints: Optional[str] = field(default=None)
max_num_pixels: int = 14745600000 # 384*384*100000
# video_folder: Optional[str] = field(default=None)
# video_fps: Optional[int] = field(default=1)
frames_upbound: Optional[int] = field(default=8)
frames_lowbound: Optional[int] = field(default=1) # 注意当视频实在没有这么多帧的时候还是会低于lowbound
time_msg: Optional[str] = field(default=None)
local_num_frames: Optional[int] = field(default=8)
sample_type: Optional[str] = field(default='middle')
@dataclass
class TrainingArguments(transformers.TrainingArguments):
cache_dir: Optional[str] = field(default=None)
optim: str = field(default="adamw_torch")
remove_unused_columns: bool = field(default=False)
freeze_mm_mlp_adapter: bool = field(default=False)
freeze_mm_vision_resampler: bool = field(default=False)
mpt_attn_impl: Optional[str] = field(default="triton")
model_max_length: int = field(
default=4096,
metadata={"help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."},
)
double_quant: bool = field(default=True, metadata={"help": "Compress the quantization statistics through double quantization."})
quant_type: str = field(default="nf4", metadata={"help": "Quantization data type to use. Should be one of `fp4` or `nf4`."})
bits: int = field(default=16, metadata={"help": "How many bits to use."})
lora_enable: bool = False
lora_r: int = 64
lora_alpha: int = 16
lora_dropout: float = 0.05
lora_weight_path: str = ""
lora_bias: str = "none"
mm_projector_lr: Optional[float] = None
mm_vision_tower_lr: Optional[float] = None
group_by_varlen: bool = field(default=False)
group_by_modality_length: bool = field(default=False)
group_by_modality_length_auto: bool = field(default=False)
auto_find_batch_size: bool = field(default=False)
gradient_checkpointing: bool = field(default=True)
verbose_logging: bool = field(default=True)
attn_implementation: str = field(default="flash_attention_2", metadata={"help": "Use transformers attention implementation."})
# @dataclass
# class EvaluationArguments:
# eval_num_processes: int = field(default=1)
# task_names: str = field(default=None)
# model: str = field(default="llava")
# model_args: Optional[str] = field(default=None)
# num_fewshot: Optional[int] = field(default=None)
# batch_size: int = field(default=1)
# device: Optional[str] = field(default=None)
# limit: Optional[int] = field(default=None)
# check_integrity: Optional[bool] = field(default=False)
# show_task_to_terminal: Optional[bool] = field(default=False)
# log_samples: Optional[bool] = field(default=True)
# gen_kwargs: Optional[str] = field(default="")
# log_samples_suffix: Optional[str] = field(default="")
# output_path: Optional[str] = field(default="./logs/")
def maybe_zero_3(param, ignore_status=False, name=None):
from deepspeed import zero
from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
if hasattr(param, "ds_id"):
if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
if not ignore_status:
logging.warning(f"{name}: param.ds_status != ZeroParamStatus.NOT_AVAILABLE: {param.ds_status}")
with zero.GatheredParameters([param]):
param = param.data.detach().cpu().clone()
else:
param = param.detach().cpu().clone()
return param
# Borrowed from peft.utils.get_peft_model_state_dict
def get_peft_state_maybe_zero_3(named_params, bias):
if bias == "none":
to_return = {k: t for k, t in named_params if "lora_" in k}
elif bias == "all":
to_return = {k: t for k, t in named_params if "lora_" in k or "bias" in k}
elif bias == "lora_only":
to_return = {}
maybe_lora_bias = {}
lora_bias_names = set()
for k, t in named_params:
if "lora_" in k:
to_return[k] = t
bias_name = k.split("lora_")[0] + "bias"
lora_bias_names.add(bias_name)
elif "bias" in k:
maybe_lora_bias[k] = t
for k, t in maybe_lora_bias:
if bias_name in lora_bias_names:
to_return[bias_name] = t
else:
raise NotImplementedError
to_return = {k: maybe_zero_3(v, ignore_status=True) for k, v in to_return.items()}
return to_return
def get_peft_state_non_lora_maybe_zero_3(named_params, require_grad_only=True):
to_return = {k: t for k, t in named_params if "lora_" not in k}
if require_grad_only:
to_return = {k: t for k, t in to_return.items() if t.requires_grad}
to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
return to_return
def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)}
to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
return to_return
def find_all_linear_names(model):
cls = torch.nn.Linear
lora_module_names = set()
multimodal_keywords = ["mm_projector", "vision_tower", "vision_resampler"]
for name, module in model.named_modules():
if any(mm_keyword in name for mm_keyword in multimodal_keywords):
continue
if isinstance(module, cls):
names = name.split(".")
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if "lm_head" in lora_module_names: # needed for 16-bit
lora_module_names.remove("lm_head")
return list(lora_module_names)
def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
"""Collects the state dict and dump to disk."""
if hasattr(trainer.args, "tune_mm_mlp_adapter") and trainer.args.tune_mm_mlp_adapter:
check_only_save_mm_adapter_tunnable = True
# only has mm_mlp_adapter and mm_vision_resampler in the tuneable parts
elif hasattr(trainer.args, "mm_tunable_parts") and (len(trainer.args.mm_tunable_parts.split(",")) == 1 and ("mm_mlp_adapter" in trainer.args.mm_tunable_parts or "mm_vision_resampler" in trainer.args.mm_tunable_parts)):
check_only_save_mm_adapter_tunnable = True
else:
check_only_save_mm_adapter_tunnable = False
trainer.accelerator.wait_for_everyone()
torch.cuda.synchronize()
rank0_print(f"Only save projectors: {check_only_save_mm_adapter_tunnable}")
if check_only_save_mm_adapter_tunnable:
# Only save Adapter
keys_to_match = ["mm_projector", "vision_resampler"]
if getattr(trainer.args, "use_im_start_end", False):
keys_to_match.extend(["embed_tokens", "embed_in"])
weight_to_save = get_mm_adapter_state_maybe_zero_3(trainer.model.named_parameters(), keys_to_match)
trainer.model.config.save_pretrained(output_dir)
current_folder = output_dir.split("/")[-1]
parent_folder = os.path.dirname(output_dir)
if trainer.args.local_rank == 0 or trainer.args.local_rank == -1:
if current_folder.startswith("checkpoint-"):
mm_projector_folder = os.path.join(parent_folder, "mm_projector")
os.makedirs(mm_projector_folder, exist_ok=True)
torch.save(weight_to_save, os.path.join(mm_projector_folder, f"{current_folder}.bin"))
else:
torch.save(weight_to_save, os.path.join(output_dir, f"mm_projector.bin"))
return
if trainer.deepspeed:
trainer.save_model(output_dir)
return
state_dict = trainer.model.state_dict()
if trainer.args.should_save:
cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
del state_dict
trainer._save(output_dir, state_dict=cpu_state_dict) # noqa
def smart_tokenizer_and_embedding_resize(
special_tokens_dict: Dict,
tokenizer: transformers.PreTrainedTokenizer,
model: transformers.PreTrainedModel,
):
"""Resize tokenizer and embedding.
Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
"""
num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))
if num_new_tokens > 0:
input_embeddings = model.get_input_embeddings().weight.data
output_embeddings = model.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
output_embeddings[-num_new_tokens:] = output_embeddings_avg
def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict:
"""Tokenize a list of strings."""
tokenized_list = [
tokenizer(
text,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
)
for text in strings
]
input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
input_ids_lens = labels_lens = [tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list]
return dict(
input_ids=input_ids,
labels=labels,
input_ids_lens=input_ids_lens,
labels_lens=labels_lens,
)
def _mask_targets(target, tokenized_lens, speakers):
# cur_idx = 0
cur_idx = tokenized_lens[0]
tokenized_lens = tokenized_lens[1:]
target[:cur_idx] = IGNORE_INDEX
for tokenized_len, speaker in zip(tokenized_lens, speakers):
if speaker == "human":
target[cur_idx + 2 : cur_idx + tokenized_len] = IGNORE_INDEX
cur_idx += tokenized_len
def _add_speaker_and_signal(header, source, get_conversation=True):
"""Add speaker and start/end signal on each round."""
BEGIN_SIGNAL = "### "
END_SIGNAL = "\n"
conversation = header
for sentence in source:
from_str = sentence["from"]
if from_str.lower() == "human":
from_str = conversation_lib.default_conversation.roles[0]
elif from_str.lower() == "gpt":
from_str = conversation_lib.default_conversation.roles[1]
else:
from_str = "unknown"
sentence["value"] = BEGIN_SIGNAL + from_str + ": " + sentence["value"] + END_SIGNAL
if get_conversation:
conversation += sentence["value"]
conversation += BEGIN_SIGNAL
return conversation
def preprocess_multimodal(sources: Sequence[str], data_args: DataArguments, msg="") -> Dict:
is_multimodal = data_args.is_multimodal
if not is_multimodal:
return sources
for source in sources:
for sentence in source:
# TODO maybe this should be changed for interleaved data?
# if DEFAULT_IMAGE_TOKEN in sentence["value"] and not sentence["value"].startswith(DEFAULT_IMAGE_TOKEN):
# only check for num_im=1
num_im = len(re.findall(DEFAULT_IMAGE_TOKEN, sentence["value"]))
if num_im == 1 and DEFAULT_IMAGE_TOKEN in sentence["value"] and not sentence["value"].startswith(DEFAULT_IMAGE_TOKEN):
sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, "").strip()
sentence["value"] = DEFAULT_IMAGE_TOKEN + "\n" + sentence["value"]
sentence["value"] = sentence["value"].strip()
if "mmtag" in conversation_lib.default_conversation.version:
sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, "" + DEFAULT_IMAGE_TOKEN + "")
replace_token = DEFAULT_IMAGE_TOKEN
if data_args.mm_use_im_start_end:
replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN
if msg.rstrip() != "":
replace_token = replace_token + msg.rstrip() + " " # NOTE for time msg of video
sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, replace_token)
# For videoInstruct-100k noisy_data. TODO: Ask Yuanhan to clean the data instead of leaving the noise code here.
sentence["value"] = sentence["value"].replace("QA_GT_caption_based_noisy", "")
return sources
def preprocess_llama_2(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv = conversation_lib.default_conversation.copy()
roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.LLAMA_2
# Mask targets
sep = "[/INST] "
for conversation, target in zip(conversations, targets):
total_len = int(target.ne(tokenizer.pad_token_id).sum())
rounds = conversation.split(conv.sep2)
cur_len = 1
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer))
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
else:
round_len = len(tokenizer(rou).input_ids)
instruction_len = len(tokenizer(parts[0]).input_ids) - 2
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def preprocess_gemma(sources: List[List[Dict[str, str]]], tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv: conversation_lib.Conversation = conversation_lib.default_conversation.copy()
roles: Dict[str, str] = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations: List[str] = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source: List[Dict[str, str]] = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role: str = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids: torch.Tensor = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids: torch.Tensor = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets: torch.Tensor = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.GEMMA
# Mask target
sep: str = conv.sep + conv.roles[1]
for conversation, target in zip(conversations, targets):
total_len: int = int(target.ne(tokenizer.pad_token_id).sum())
rounds: List[str] = conversation.split(conv.sep)
re_rounds = []
for conv_idx in range(0, len(rounds), 2):
re_rounds.append(conv.sep.join(rounds[conv_idx : conv_idx + 2]))
cur_len = 1 # Ignore
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(re_rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep # Re-append sep because split on this
# Now "".join(parts)==rou
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer)) - 1 # Ignore
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1 # Ignore
else:
round_len = len(tokenizer(rou).input_ids) - 1 # Ignore
instruction_len = len(tokenizer(parts[0]).input_ids) - 1 # Ignore
round_len += 2 # sep: \n takes 2 tokens
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(f"warning: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def preprocess_qwen(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict:
# roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"}
roles = {"human": "user", "gpt": "assistant"}
# Add image tokens to tokenizer as a special tokens
# Use a deepcopy of tokenizer so that we don't modify on the tokenizer
tokenizer = copy.deepcopy(tokenizer)
# When there is actually an image, we add the image tokens as a special token
if has_image:
tokenizer.add_tokens([""], special_tokens=True)
image_token_index = tokenizer.convert_tokens_to_ids("")
im_start, im_end = tokenizer.additional_special_tokens_ids[0:2] # for qwen2_5
# unmask_tokens = ["<|im_start|>", "<|im_start|>", "\n"]
unmask_tokens_idx = [198, im_start, im_end]
nl_tokens = tokenizer("\n").input_ids
# Reset Qwen chat templates so that it won't include system message every time we apply
chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
tokenizer.chat_template = chat_template
# _system = tokenizer("system").input_ids + nl_tokens
# _user = tokenizer("user").input_ids + nl_tokens
# _assistant = tokenizer("assistant").input_ids + nl_tokens
# Apply prompt templates
input_ids, targets = [], []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != roles["human"]:
source = source[1:]
input_id, target = [], []
# New version, use apply chat template
# Build system message for each sentence
input_id += tokenizer.apply_chat_template([{"role" : "system", "content" : system_message}])
target += [IGNORE_INDEX] * len(input_id)
for conv in source:
# Make sure llava data can load
try:
role = conv["role"]
content = conv["content"]
except:
role = conv["from"]
content = conv["value"]
role = roles.get(role, role)
conv = [{"role" : role, "content" : content}]
encode_id = tokenizer.apply_chat_template(conv)
input_id += encode_id
if role in ["user", "system"]:
target += [IGNORE_INDEX] * len(encode_id)
else:
target += encode_id
assert len(input_id) == len(target), f"{len(input_id)} != {len(target)}"
for idx, encode_id in enumerate(input_id):
if encode_id in unmask_tokens_idx:
target[idx] = encode_id
if encode_id == image_token_index:
input_id[idx] = IMAGE_TOKEN_INDEX
input_ids.append(input_id)
targets.append(target)
input_ids = torch.tensor(input_ids, dtype=torch.long)
targets = torch.tensor(targets, dtype=torch.long)
del tokenizer
return dict(
input_ids=input_ids, # tensor(bs x seq_len)
labels=targets, # tensor(bs x seq_len)
)
def preprocess_internlm2(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict:
# roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"}
roles = {"human": "user", "gpt": "assistant"}
# Add image tokens to tokenizer as a special tokens
# Use a deepcopy of tokenizer so that we don't modify on the tokenizer
tokenizer = copy.deepcopy(tokenizer)
# When there is actually an image, we add the image tokens as a special token
if has_image:
tokenizer.add_tokens([""], special_tokens=True)
image_token_index = tokenizer.convert_tokens_to_ids("")
unmask_tokens = ["<|im_start|>", "<|im_end|>", "<|action_start|>", "<|action_end|>", "<|interpreter|>", "<|plugin|>"]
unmask_tokens_idx = [tokenizer.convert_tokens_to_ids(tok) for tok in unmask_tokens]
# nl_tokens = tokenizer("\n").input_ids
# chat_template = "{{ bos_token }}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
# tokenizer.chat_template = chat_template
# _system = tokenizer("system").input_ids + nl_tokens
# _user = tokenizer("user").input_ids + nl_tokens
# _assistant = tokenizer("assistant").input_ids + nl_tokens
# Apply prompt templates
input_ids, targets = [], []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != roles["human"]:
source = source[1:]
input_id, target = [], []
# New version, use apply chat template
# Build system message for each sentence
input_id += tokenizer.apply_chat_template([{"role" : "system", "content" : system_message}])
target += [IGNORE_INDEX] * len(input_id)
for conv in source:
# Make sure llava data can load
try:
role = conv["role"]
content = conv["content"]
except:
role = conv["from"]
content = conv["value"]
role = roles.get(role, role)
conv = [{"role" : role, "content" : content}]
encode_id = tokenizer.apply_chat_template(conv)
input_id += encode_id
if role in ["user", "system"]:
target += [IGNORE_INDEX] * len(encode_id)
else:
target += encode_id
assert len(input_id) == len(target), f"{len(input_id)} != {len(target)}"
for idx, encode_id in enumerate(input_id):
if encode_id in unmask_tokens_idx:
target[idx] = encode_id
if encode_id == image_token_index:
input_id[idx] = IMAGE_TOKEN_INDEX
input_ids.append(input_id)
targets.append(target)
input_ids = torch.tensor(input_ids, dtype=torch.long)
targets = torch.tensor(targets, dtype=torch.long)
del tokenizer
return dict(
input_ids=input_ids, # tensor(bs x seq_len)
labels=targets, # tensor(bs x seq_len)
)
def preprocess_llama3(
sources,
tokenizer: transformers.PreTrainedTokenizer,
has_image: bool = False,
max_len=2048,
system_message: str = "You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.",
) -> Dict:
# roles = {"human": "<|start_header_id|>user<|end_header_id|>", "gpt": "<|start_header_id|>assistant<|end_header_id|>"}
roles = {"human": "user", "gpt": "assistant"}
# Add image tokens to tokenizer as a special tokens
# Use a deepcopy of tokenizer so that we don't modify on the tokenizer
tokenizer = copy.deepcopy(tokenizer)
# When there is actually an image, we add the image tokens as a special token
if has_image:
tokenizer.add_tokens([""], special_tokens=True)
image_token_index = tokenizer.convert_tokens_to_ids("")
bos_token_id = tokenizer.convert_tokens_to_ids("<|begin_of_text|>")
start_header_id = tokenizer.convert_tokens_to_ids("<|start_header_id|>")
end_header_id = tokenizer.convert_tokens_to_ids("<|end_header_id|>")
eot_id = tokenizer.convert_tokens_to_ids("<|eot_id|>")
unmask_tokens = ["<|begin_of_text|>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "\n\n"]
unmask_tokens_idx = [tokenizer.convert_tokens_to_ids(tok) for tok in unmask_tokens]
# After update, calling tokenizer of llama3 will
# auto add bos id for the tokens. ヽ(`⌒´)ノ
def safe_tokenizer_llama3(text):
input_ids = tokenizer(text).input_ids
if input_ids[0] == bos_token_id:
input_ids = input_ids[1:]
return input_ids
nl_tokens = tokenizer.convert_tokens_to_ids("\n\n")
# Apply prompt templates
input_ids, targets = [], []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != roles["human"]:
source = source[1:]
input_id, target = [], []
# New version, use apply chat template
# Build system message for each sentence
input_id += tokenizer.apply_chat_template([{"role" : "system", "content" : system_message}])
target += [IGNORE_INDEX] * len(input_id)
for conv in source:
# Make sure llava data can load
try:
role = conv["role"]
content = conv["content"]
except:
role = conv["from"]
content = conv["value"]
role = roles.get(role, role)
conv = [{"role" : role, "content" : content}]
# First is bos token we don't need here
encode_id = tokenizer.apply_chat_template(conv)[1:]
input_id += encode_id
if role in ["user", "system"]:
target += [IGNORE_INDEX] * len(encode_id)
else:
target += encode_id
assert len(input_id) == len(target), f"{len(input_id)} != {len(target)}"
for idx, encode_id in enumerate(input_id):
if encode_id in unmask_tokens_idx:
target[idx] = encode_id
if encode_id == image_token_index:
input_id[idx] = IMAGE_TOKEN_INDEX
input_ids.append(input_id)
targets.append(target)
input_ids = torch.tensor(input_ids, dtype=torch.long)
targets = torch.tensor(targets, dtype=torch.long)
return dict(
input_ids=input_ids, # tensor(bs x seq_len)
labels=targets, # tensor(bs x seq_len)
)
def preprocess_v1(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv = conversation_lib.default_conversation.copy()
roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.TWO
# Mask targets
sep = conv.sep + conv.roles[1] + ": "
for conversation, target in zip(conversations, targets):
total_len = int(target.ne(tokenizer.pad_token_id).sum())
rounds = conversation.split(conv.sep2)
cur_len = 1
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer))
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
else:
round_len = len(tokenizer(rou).input_ids)
instruction_len = len(tokenizer(parts[0]).input_ids) - 2
if i != 0 and not tokenizer.legacy and IS_TOKENIZER_GREATER_THAN_0_14:
round_len -= 1
instruction_len -= 1
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def preprocess_mpt(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv = conversation_lib.default_conversation.copy()
roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.MPT
# Mask targets
sep = conv.sep + conv.roles[1]
for conversation, target in zip(conversations, targets):
total_len = int(target.ne(tokenizer.pad_token_id).sum())
rounds = conversation.split(conv.sep)
re_rounds = [conv.sep.join(rounds[:3])] # system + user + gpt
for conv_idx in range(3, len(rounds), 2):
re_rounds.append(conv.sep.join(rounds[conv_idx : conv_idx + 2])) # user + gpt
cur_len = 1
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(re_rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer))
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1
else:
round_len = len(tokenizer(rou).input_ids)
instruction_len = len(tokenizer(parts[0]).input_ids) - 1
if i != 0 and getattr(tokenizer, "legacy", False) and IS_TOKENIZER_GREATER_THAN_0_14:
round_len += 1
instruction_len += 1
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f"(#turns={len(re_rounds)} ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def preprocess_plain(
sources: Sequence[str],
tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
# add end signal and concatenate together
conversations = []
for source in sources:
assert len(source) == 2
assert DEFAULT_IMAGE_TOKEN in source[0]["value"]
source[0]["value"] = DEFAULT_IMAGE_TOKEN
conversation = source[0]["value"] + source[1]["value"] + conversation_lib.default_conversation.sep
conversations.append(conversation)
# tokenize conversations
input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations]
targets = copy.deepcopy(input_ids)
for target, source in zip(targets, sources):
tokenized_len = len(tokenizer_image_token(source[0]["value"], tokenizer))
target[:tokenized_len] = IGNORE_INDEX
return dict(input_ids=input_ids, labels=targets)
def preprocess(sources: Sequence[str], tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
"""
Given a list of sources, each is a conversation list. This transform:
1. Add signal '### ' at the beginning each sentence, with end signal '\n';
2. Concatenate conversations together;
3. Tokenize the concatenated conversation;
4. Make a deepcopy as the target. Mask human words with IGNORE_INDEX.
"""
if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.PLAIN:
return preprocess_plain(sources, tokenizer)
if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.LLAMA_2:
return preprocess_llama_2(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version.startswith("v1"):
return preprocess_v1(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "mpt":
return preprocess_mpt(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "qwen":
return preprocess_qwen(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "internlm_2":
return preprocess_internlm2(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "gemma":
return preprocess_gemma(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "llama_v3":
return preprocess_llama3(sources, tokenizer, has_image=has_image)
# add end signal and concatenate together
conversations = []
for source in sources:
header = f"{conversation_lib.default_conversation.system}\n\n"
conversation = _add_speaker_and_signal(header, source)
conversations.append(conversation)
# tokenize conversations
def get_tokenize_len(prompts):
return [len(tokenizer_image_token(prompt, tokenizer)) for prompt in prompts]
if has_image:
input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations]
else:
conversations_tokenized = _tokenize_fn(conversations, tokenizer)
input_ids = conversations_tokenized["input_ids"]
targets = copy.deepcopy(input_ids)
for target, source in zip(targets, sources):
if has_image:
tokenized_lens = get_tokenize_len([header] + [s["value"] for s in source])
else:
tokenized_lens = _tokenize_fn([header] + [s["value"] for s in source], tokenizer)["input_ids_lens"]
speakers = [sentence["from"] for sentence in source]
_mask_targets(target, tokenized_lens, speakers)
return dict(input_ids=input_ids, labels=targets)
class LazySupervisedDataset(Dataset):
def __init__(self, data_path: str, tokenizer: transformers.PreTrainedTokenizer, data_args: DataArguments):
super(LazySupervisedDataset, self).__init__()
self.tokenizer = tokenizer
self.num_video_tokens = max(8, data_args.frames_upbound) * 128 // 8 # 估计一下video token 数量
try:
from petrel_client.client import Client
has_client = True
except ImportError:
has_client = False
if has_client:
self.client = Client('~/petreloss.conf')
else:
self.client = None
# if get_local_rank() == 0:
list_data_dict = []
# Handle multiple JSON files specified in the data_path
if "{" in data_path and "}" in data_path:
raise NotImplementedError("Please use .yaml!!!")
base_path, file_pattern = re.match(r"^(.*)\{(.*)\}\.json$", data_path).groups()
file_names = file_pattern.split(",")
rank0_print(f"Loading {file_names} from {base_path}")
data_args.dataset_paths = []
for file_name in file_names:
data_args.dataset_paths.append(f"{base_path}{file_name}.json")
full_path = f"{base_path}{file_name}.json"
rank0_print(f"Loading {full_path}")
with open(full_path, "r") as file:
cur_data_dict = json.load(file)
rank0_print(f"Loaded {len(cur_data_dict)} samples from {full_path}")
list_data_dict.extend(cur_data_dict)
elif data_path.endswith(".yaml"):
with open(data_path, "r") as file:
yaml_data = yaml.safe_load(file)
datasets = yaml_data.get("datasets")
# file should be in the format of:
# datasets:
# - json_path: xxxx1.json
# sampling_strategy: first:1000
# - json_path: xxxx2.json
# sampling_strategy: end:3000
# - json_path: xxxx3.json
# sampling_strategy: random:999
# data_args.dataset_paths = [dataset.get("json_path") for dataset in datasets] # NOTE
for dataset in datasets:
json_path = dataset.get("json_path")
sampling_strategy = dataset.get("sampling_strategy", "all")
sampling_number = None
rank0_print(f"Loading {json_path} with {sampling_strategy} sampling strategy")
if json_path.endswith(".jsonl"):
cur_data_dict = []
if "s3://" in json_path:
with io.BytesIO(self.client.get(json_path)) as json_file:
for line in json_file:
cur_data_dict.append(json.loads(line.strip()))
else:
with open(json_path, "r") as json_file:
for line in json_file:
cur_data_dict.append(json.loads(line.strip()))
elif json_path.endswith(".json"):
if "s3://" in json_path:
with io.BytesIO(self.client.get(json_path)) as json_file:
cur_data_dict = json.load(json_file)
else:
with open(json_path, "r") as json_file:
cur_data_dict = json.load(json_file)
else:
raise ValueError(f"Unsupported file type: {json_path}")
assert len(cur_data_dict) > 0, cur_data_dict
media_type = dataset.get("media_type", None)
if media_type is None:
if 'image' in cur_data_dict[0].keys(): # NOTE 碰到混合数据可能会出错
media_type = 'image'
elif 'video' in cur_data_dict[0].keys():
media_type = 'video'
else:
media_type = 'text'
if ":" in sampling_strategy:
sampling_strategy, sampling_number = sampling_strategy.split(":")
if "%" in sampling_number:
sampling_number = math.ceil(int(sampling_number.split("%")[0]) * len(cur_data_dict) / 100)
else:
sampling_number = int(sampling_number)
# Apply the sampling strategy
if sampling_strategy == "first" and sampling_number is not None:
cur_data_dict = cur_data_dict[:sampling_number]
rank0_print(f"sampling_strategy={sampling_strategy}, {0}:{sampling_number}")
elif sampling_strategy == "first2" and sampling_number is not None:
cur_data_dict = cur_data_dict[sampling_number:sampling_number*2]
rank0_print(f"sampling_strategy={sampling_strategy}, {sampling_number}:{sampling_number*2}")
elif sampling_strategy == "first3" and sampling_number is not None:
cur_data_dict = cur_data_dict[sampling_number*2:sampling_number*3]
rank0_print(f"sampling_strategy={sampling_strategy}, {sampling_number*2}:{sampling_number*3}")
elif sampling_strategy == "first4" and sampling_number is not None:
cur_data_dict = cur_data_dict[sampling_number*3:sampling_number*4]
rank0_print(f"sampling_strategy={sampling_strategy}, {sampling_number*3}:{sampling_number*4}")
elif sampling_strategy == "end" and sampling_number is not None:
cur_data_dict = cur_data_dict[-sampling_number:]
rank0_print(f"sampling_strategy={sampling_strategy}, {-sampling_number}:-")
elif sampling_strategy == "random" and sampling_number is not None:
raise NotImplementedError("Don't use random")
random.shuffle(cur_data_dict)
cur_data_dict = cur_data_dict[:sampling_number]
video_read_type = dataset.get("video_read_type", None)
data_root = dataset.get("data_root", "")
# try:
# post-process meta info
if media_type not in ['text', 'mix']:
def check_pnorm2(ori_path): # TODO ugly code, remove it after clean anno file
if ori_path.startswith("pnorm2:s3://") or ori_path.startswith("p2:s3://") or ori_path.startswith("pssd:s3://"):
old_bucket_name = ori_path.split('://')[1].split('/')[0]
data_prefix = ori_path.split('://')[0]
data_path = '/'.join(ori_path.split('://')[1].split('/')[1:])
# new_bucket_name = old_bucket_name.replace('-', '_').lower()
new_bucket_name = old_bucket_name.lower()
return data_prefix + '://' + new_bucket_name + '/' + data_path
else:
return ori_path
for i in range(len(cur_data_dict)):
if video_read_type != None:
cur_data_dict[i]['video_read_type'] = video_read_type
if type(cur_data_dict[i][media_type]) is list:
new_data_path = []
for old_data_path in cur_data_dict[i][media_type]:
new_data_path.append(os.path.join(data_root, old_data_path))
# new_data_path.append(check_pnorm2(os.path.join(data_root, old_data_path)))
cur_data_dict[i][media_type] = new_data_path
else:
cur_data_dict[i][media_type] = os.path.join(data_root, cur_data_dict[i][media_type])
# cur_data_dict[i][media_type] = check_pnorm2(os.path.join(data_root, cur_data_dict[i][media_type]))
rank0_print(f"Check samples from {json_path}, media_type={media_type}, video_read_type={video_read_type}, data_root={data_root}")
if media_type not in ['text', 'mix'] and video_read_type != 'fake':
ok = False
for i in range(3):
data_path = cur_data_dict[i][media_type]
if type(data_path) is list:
data_path = data_path[0]
rank0_print(f"Checking: {data_path}")
if 's3://' in data_path:
if media_type == 'video' and video_read_type in ['img', 'frame']:
for path in self.client.list(data_path):
ok = True
break
else:
tmp_data = self.client.get(data_path)
if tmp_data is not None and len(tmp_data) > 0:
ok = True
else:
if os.path.exists(data_path):
ok = True
if ok:
break
assert ok, f"Data in {data_path} can't be read!"
rank0_print(f"Loaded {len(cur_data_dict)} samples from {json_path}, media_type={media_type}, video_read_type={video_read_type}, data_root={data_root}")
# except Exception as e:
# rank0_print(f"Loaded {len(cur_data_dict)} samples from {json_path}, data_root={data_root}, something maybe wrong {e}!!!")
list_data_dict.extend(cur_data_dict)
else:
raise NotImplementedError("Please use .yaml!!!")
data_args.dataset_paths = [data_path]
rank0_print(f"Loading {data_path}")
with open(data_path, "r") as file:
cur_data_dict = json.load(file)
rank0_print(f"Loaded {len(cur_data_dict)} samples from {data_path}")
list_data_dict.extend(cur_data_dict)
# else:
# list_data_dict = []
self.list_data_dict = list_data_dict
# self.list_data_dict = TorchShmSerializedList(list_data_dict)
rank0_print(f"Loaded {len(self.list_data_dict)} samples from {data_path}")
rank0_print("Formatting inputs...Skip in lazy mode")
self.tokenizer = tokenizer
self.data_args = data_args
def __len__(self):
return len(self.list_data_dict)
@property
def lengths(self):
length_list = []
for sample in self.list_data_dict:
if "image" in sample:
img_tokens = 128
elif "video" in sample:
img_tokens = self.num_video_tokens
else:
img_tokens = 0
length_list.append(sum(len(conv["value"].split()) for conv in sample["conversations"]) + img_tokens)
return length_list
@property
def modality_lengths(self):
length_list = []
for sample in self.list_data_dict:
cur_len = sum(len(conv["value"].split()) for conv in sample["conversations"])
assert cur_len > 0, f"Conversation length is 0 for {sample}"
if "image" in sample or "video" in sample or self.data_args.early_mix_text:
length_list.append(cur_len)
else:
length_list.append(-cur_len)
return length_list
def process_image(self, image_file, overwrite_image_aspect_ratio=None):
# image_folder = self.data_args.image_folder
# start_time = time.time()
processor = self.data_args.image_processor
# print(f"\n\nInspecting the image path, image_file={image_file}")
try:
if 's3://' in image_file:
value = self.client.Get(image_file)
img_bytes = np.frombuffer(value, dtype=np.uint8)
with io.BytesIO(img_bytes) as buff:
image = Image.open(buff).convert('RGB')
else:
image = Image.open(image_file).convert('RGB') # PIL Image
except Exception as exn:
print(f"Failed to open image {image_file}. Exception:", exn)
raise exn
image_size = image.size
image_aspect_ratio = self.data_args.image_aspect_ratio
if overwrite_image_aspect_ratio is not None:
image_aspect_ratio = overwrite_image_aspect_ratio
if image_aspect_ratio == "highres":
raise NotImplementedError
image = process_highres_image(image, self.data_args.image_processor, self.data_args.image_grid_pinpoints)
# elif image_aspect_ratio == "anyres" or "anyres_max" in image_aspect_ratio:
elif "anyres" in image_aspect_ratio:
if 'nopad' in image_aspect_ratio:
image = process_anyres_image_nopad(image, self.data_args.image_processor, self.data_args.image_grid_pinpoints)
else:
raise NotImplementedError
image = process_anyres_image(image, self.data_args.image_processor, self.data_args.image_grid_pinpoints)
elif image_aspect_ratio == "crop_split":
raise NotImplementedError
image = process_highres_image_crop_split(image, self.data_args)
elif image_aspect_ratio == "pad":
def expand2square(pil_img, background_color):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
image = expand2square(image, tuple(int(x * 255) for x in processor.image_mean))
image = processor.preprocess(image, return_tensors="pt")["pixel_values"][0]
else:
image = processor.preprocess(image, return_tensors="pt")["pixel_values"][0]
# end_time = time.time()
# print(image_file, end_time - start_time)
# print(f"OK, image_file={image_file}\n\n")
return image, image_size, "image"
def process_video(self, video_file, data_anno, data_args):
# print(f"\n\nInspecting the video path, video_file={video_file}\n\n", flush=True)
# logging.info(f"\n\nInspecting the video path, video_file={video_file}\n\n")
# start_time = time.time()
local_num_frames = data_args.local_num_frames
max_num_frames = data_args.frames_upbound
min_num_frames = data_args.frames_lowbound
sample_type = data_args.sample_type
video_reader_type = data_anno.get("video_read_type", "decord")
if "start" in data_anno and "end" in data_anno:
clip = [float(data_anno["start"]), float(data_anno["end"])]
else:
clip = None
if clip is None or video_reader_type == "img":
video_reader = VIDEO_READER_FUNCS[video_reader_type]
frames, frame_indices, fps, duration = video_reader(
video_file, max_num_frames, sample_type,
min_num_frames=min_num_frames,
max_num_frames=max_num_frames, client=self.client, clip=clip,
local_num_frames=local_num_frames
)
# if sample_type in ['rand', 'middle'] and len(frames) < local_num_frames and len(frames) != max_num_frames:
# raise ValueError(f"{video_file} only have {len(frames)} frames!!!")
# logger.info(f"{data_path} is OK!!!!")
else:
video_reader = VIDEO_READER_FUNCS['lazy']
start, end = clip
duration = end - start
if min_num_frames > duration:
min_num_frames = (duration // local_num_frames) * local_num_frames
if sample_type == 'dynamic_fps1':
num_segments = int(duration // local_num_frames)
if num_segments == 0:
num_frames = local_num_frames
else:
num_frames = local_num_frames * num_segments
num_frames = min(num_frames, max_num_frames)
num_frames = max(num_frames, min_num_frames)
else:
num_frames = max_num_frames
frames, frame_indices, fps = video_reader(video_file, num_frames=num_frames, video_start=start, video_end=end, client=self.client)
# logger.info(f"{data_path} is OK, duation={end-start} num_frames={num_frames}!!!!")
if sample_type == 'dynamic_fps1' and len(frames) % local_num_frames != 0:
raise ValueError(f"min_num_frames={min_num_frames}, max_num_frames={max_num_frames}, local_num_frames={local_num_frames}, len(frames)={len(frames)}, is wrong!!!")
sec = [str(round(f / fps, 1)) for f in frame_indices]
if data_args.time_msg is not None and sec is not None:
if data_args.time_msg == 'short':
msg = f"\nThe video lasts for {duration:.2f} seconds, and {len(sec)} frames are uniformly sampled from it. "
else:
# " " should be added in the start and end
msg = f"\nThe video lasts for {duration:.2f} seconds, and {len(sec)} frames are uniformly sampled at {', '.join(sec)} seconds. "
else:
msg = ""
# logging.info(f"OK, video_file={video_file}\n\n")
# print(f"OK, video_file={video_file}\n\n", flush=True)
# end_time = time.time()
# print(video_file, end_time - start_time)
return frames, msg
def __getitem__(self, i) -> Dict[str, torch.Tensor]:
# TODO: define number of retries somewhere else
num_base_retries = 2
num_final_retries = 300
# try the current sample first
for attempt_idx in range(num_base_retries):
try:
sample = self._get_item(i)
return sample
except Exception as e:
# sleep 1s in case it is a cloud disk issue
print(f"[Try #{attempt_idx}] Failed to fetch sample {i}. Exception:", e)
if attempt_idx != (num_base_retries -1):
time.sleep(1)
retry_step = 5
# try other samples, in case it is file corruption issue
for attempt_idx in range(num_base_retries+3):
try:
next_index = min(i + retry_step, len(self.list_data_dict) - 1)
# sample_idx = random.choice(range(len(self)))
sample = self._get_item(next_index)
return sample
except Exception as e:
# no need to sleep
print(f"[Try other #{attempt_idx}] Failed to fetch sample {next_index}. Exception:", e)
retry_step *= 2
try:
sample = self._get_item(i)
return sample
except Exception as e:
raise e
def _get_item(self, i) -> Dict[str, torch.Tensor]:
sources = self.list_data_dict[i]
if isinstance(i, int):
sources = [sources]
else:
raise NotImplementedError(i)
assert len(sources) == 1, "Don't know why it is wrapped to a list" # FIXME
if "image" in sources[0]:
image_file = self.list_data_dict[i]["image"]
if type(image_file) is list:
# Handling multi images
# overwrite to process with simple pad
if len(image_file) > 1:
image = [self.process_image(f, "pad") for f in image_file]
image = [[im[0], im[1], "image"] for im in image]
else:
image = [self.process_image(f) for f in image_file]
else:
image = [self.process_image(image_file)]
# sources = preprocess_multimodal(copy.deepcopy([e["conversations"] for e in sources]), self.data_args)
sources = preprocess_multimodal([e["conversations"] for e in sources], self.data_args)
elif "video" in sources[0]:
video_file = self.list_data_dict[i]["video"]
try:
video, time_msg = self.process_video(video_file, data_anno=self.list_data_dict[i], data_args=self.data_args)
# print(video_file, time_msg)
processor = self.data_args.image_processor
frame_aspect_ratio = self.data_args.frame_aspect_ratio
# if frame_aspect_ratio == "anyres" or "anyres_max" in frame_aspect_ratio:
if "anyres" in frame_aspect_ratio:
if 'nopad' in frame_aspect_ratio:
image = process_anyres_video_nopad(video, self.data_args.image_processor, self.data_args.frame_grid_pinpoints, max_resolutions=self.data_args.max_num_pixels // len(video))
else:
raise NotImplementedError
# image = process_anyres_video(video, self.data_args.image_processor, self.data_args.frame_grid_pinpoints)
else:
image = processor.preprocess(video, return_tensors="pt")["pixel_values"]
image = [(image, video[0].shape[0:2], "video")]
# sources = preprocess_multimodal(copy.deepcopy([e["conversations"] for e in sources]), self.data_args, msg=time_msg)
sources = preprocess_multimodal([e["conversations"] for e in sources], self.data_args, msg=time_msg)
except Exception as e:
print(f"Error: {e}")
print(f"Failed to read video file: {video_file}")
raise e
else:
# sources = copy.deepcopy([e["conversations"] for e in sources]) # NOTE epoch>1时会出问题,最好提前处理了
sources = [e["conversations"] for e in sources]
has_image = ("image" in self.list_data_dict[i]) or ("video" in self.list_data_dict[i])
data_dict = preprocess(sources, self.tokenizer, has_image=has_image)
if "prompt" in data_dict:
prompt = data_dict["prompt"]
else:
prompt = None
if isinstance(i, int):
data_dict = dict(input_ids=data_dict["input_ids"][0], labels=data_dict["labels"][0])
# image exist in the data
if "image" in self.list_data_dict[i]:
data_dict["image"] = image
elif "video" in self.list_data_dict[i]:
data_dict["image"] = image
elif self.data_args.is_multimodal:
# image does not exist in the data, but the model is multimodal
crop_size = self.data_args.image_processor.crop_size
data_dict["image"] = [
(torch.zeros(1, 3, crop_size["height"], crop_size["width"]), (crop_size["width"], crop_size["height"]), "text"),
]
# prompt exist in the data
if prompt is not None:
data_dict["prompt"] = prompt
data_dict["id"] = self.list_data_dict[i].get("id", i)
# gc.collect() # NOTE
return data_dict
@dataclass
class DataCollatorForSupervisedDataset(object):
"""Collate examples for supervised fine-tuning."""
tokenizer: transformers.PreTrainedTokenizer
def pad_sequence(self, input_ids, batch_first, padding_value):
if self.tokenizer.padding_side == "left":
input_ids = [torch.flip(_input_ids, [0]) for _input_ids in input_ids]
input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=batch_first, padding_value=padding_value)
if self.tokenizer.padding_side == "left":
input_ids = torch.flip(input_ids, [1])
return input_ids
def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
# input_ids, labels, ids = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels", "id"))
input_ids = [_input_ids[: self.tokenizer.model_max_length] for _input_ids in input_ids]
labels = [_labels[: self.tokenizer.model_max_length] for _labels in labels]
if self.tokenizer.pad_token_id is None:
# self.tokenizer.pad_token_id = self.tokenizer.eos_token_id # FIXME: this could only be triggered for llama3 model.
self.tokenizer.pad_token_id = 0 # This gets the best result. Don't know why.
input_ids = self.pad_sequence(input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id)
labels = self.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
batch = dict(input_ids=input_ids, labels=labels.long() if labels.dtype == torch.int32 else labels, attention_mask=input_ids.ne(self.tokenizer.pad_token_id))
# batch = dict(input_ids=input_ids, labels=labels, attention_mask=input_ids.ne(self.tokenizer.pad_token_id), ids=ids)
if "image" in instances[0]:
images = [instance["image"] for instance in instances]
# data_format: [image/video, spatial_size, media_type]
batch["image_sizes"] = [im[1] for im_list in images for im in im_list]
batch["modalities"] = [im[2] for im_list in images for im in im_list]
images = [im[0] for im_list in images for im in im_list] # flatten multi-images
# 拉平多图应该没有影响,只要后面顺序对的上就行
# use list for input of different lengths
# if all(x is not None and x.shape == images[0].shape for x in images):
# Image: (N, P, C, H, W)
# Video: (N, F, C, H, W)
# batch["images"] = torch.stack(images)
# else:
batch["images"] = images
else:
# 纯文本数据也会填一个images
raise NotImplementedError(instances[0])
if "prompt" in instances[0]:
batch["prompts"] = [instance["prompt"] for instance in instances]
return batch
def make_supervised_data_module(tokenizer: transformers.PreTrainedTokenizer, data_args) -> Dict:
"""Make dataset and collator for supervised fine-tuning."""
train_dataset = LazySupervisedDataset(tokenizer=tokenizer, data_path=data_args.data_path, data_args=data_args)
data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
return dict(train_dataset=train_dataset, eval_dataset=None, data_collator=data_collator)
def get_model(model_args, training_args, bnb_model_from_pretrained_args):
assert training_args.attn_implementation
if training_args.attn_implementation == "sdpa" and torch.__version__ < "2.1.2":
raise ValueError("The 'sdpa' attention implementation requires torch version 2.1.2 or higher.")
customized_kwargs = dict()
customized_kwargs.update(bnb_model_from_pretrained_args)
cfg_pretrained = None
if ',' in model_args.llm_compress_layer_list:
llm_compress_layer_list = [int(i) for i in model_args.llm_compress_layer_list.split(',')]
else:
llm_compress_layer_list = [int(model_args.llm_compress_layer_list)]
llm_image_token_ratio_list = [float(i) for i in model_args.llm_image_token_ratio_list.split(',')]
overwrite_config = {"vision_encode_type": model_args.vision_encode_type,
"mm_num_compress_latents": model_args.mm_num_compress_latents,
"mm_num_compress_query_type": model_args.mm_num_compress_query_type,
"mm_pos_num_frames": model_args.mm_pos_num_frames,
"mm_local_num_frames": model_args.mm_local_num_frames,
"mm_close_init": model_args.mm_close_init,
"min_slow_num_frames": model_args.min_slow_num_frames,
"mm_llm_compress": model_args.mm_llm_compress,
"llm_compress_layer_list": llm_compress_layer_list,
"llm_image_token_ratio_list": llm_image_token_ratio_list,
"llm_compress_type": model_args.llm_compress_type,
"mm_projector_type": model_args.mm_projector_type,
"mm_patch_merge_type": model_args.mm_patch_merge_type,
"mm_newline_position": model_args.mm_newline_position
}
if any(
[
model_args.rope_scaling_factor is not None,
model_args.rope_scaling_type is not None,
model_args.mm_spatial_pool_stride is not None,
model_args.mm_spatial_pool_out_channels is not None,
model_args.mm_spatial_pool_mode is not None,
model_args.mm_resampler_type is not None,
]
):
if "internlm2" in model_args.model_name_or_path.lower():
cfg_pretrained = AutoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
else:
cfg_pretrained = AutoConfig.from_pretrained(model_args.model_name_or_path)
else:
raise NotImplementedError(model_args)
if model_args.use_pos_skipping is not None and model_args.pos_skipping_range is not None:
overwrite_config["use_pos_skipping"] = model_args.use_pos_skipping
overwrite_config["pos_skipping_range"] = model_args.pos_skipping_range
if model_args.rope_scaling_factor is not None and model_args.rope_scaling_type is not None:
overwrite_config["rope_scaling"] = {
"factor": model_args.rope_scaling_factor,
"type": model_args.rope_scaling_type,
}
if training_args.model_max_length is None:
training_args.model_max_length = cfg_pretrained.max_position_embeddings * model_args.rope_scaling_factor
overwrite_config["max_sequence_length"] = training_args.model_max_length
assert training_args.model_max_length == int(cfg_pretrained.max_position_embeddings * model_args.rope_scaling_factor), print(
f"model_max_length: {training_args.model_max_length}, max_position_embeddings: {cfg_pretrained.max_position_embeddings}, rope_scaling_factor: {model_args.rope_scaling_factor}"
)
# overwrite_config["max_sequence_length"] = model_args.max_sequence_length
# overwrite_config["tokenizer_model_max_length"] = model_args.tokenizer_model_max_length
if model_args.mm_spatial_pool_stride is not None and model_args.mm_spatial_pool_out_channels is not None and model_args.mm_spatial_pool_mode is not None and model_args.mm_resampler_type is not None:
overwrite_config["mm_resampler_type"] = model_args.mm_resampler_type
overwrite_config["mm_spatial_pool_stride"] = model_args.mm_spatial_pool_stride
overwrite_config["mm_spatial_pool_out_channels"] = model_args.mm_spatial_pool_out_channels
overwrite_config["mm_spatial_pool_mode"] = model_args.mm_spatial_pool_mode
if model_args.mm_spatial_pool_mode is not None:
overwrite_config["mm_spatial_pool_mode"] = model_args.mm_spatial_pool_mode
if overwrite_config:
assert cfg_pretrained is not None, "cfg_pretrained is None"
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(cfg_pretrained, k, v)
customized_kwargs["config"] = cfg_pretrained
if model_args.model_class_name is not None:
raise NotImplementedError(model_args)
actual_model_class_name = f"{model_args.model_class_name}ForCausalLM"
model_class = getattr(transformers, actual_model_class_name)
rank0_print(f"Using model class {model_class} from {model_args.model_class_name}")
model = model_class.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
elif model_args.vision_tower is not None:
if "mixtral" in model_args.model_name_or_path.lower():
raise ValueError(f"I don't want model class {model_args}")
model = LlavaMixtralForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock
deepspeed.utils.set_z3_leaf_modules(model, [MixtralSparseMoeBlock])
elif "mistral" in model_args.model_name_or_path.lower() or "zephyr" in model_args.model_name_or_path.lower():
model = LlavaMistralForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
elif (
"wizardlm-2" in model_args.model_name_or_path.lower()
or "vicuna" in model_args.model_name_or_path.lower()
or "llama" in model_args.model_name_or_path.lower()
# or "yi" in model_args.model_name_or_path.lower()
# or "nous-hermes" in model_args.model_name_or_path.lower()
# and "wizard-2" in model_args.model_name_or_path.lower()
):
raise ValueError(f"I don't want model class {model_args}")
model = LlavaLlamaForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
elif "qwen" in model_args.model_name_or_path.lower():
if "moe" in model_args.model_name_or_path.lower() or "A14B" in model_args.model_name_or_path:
model = LlavaQwenMoeForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
from transformers.models.qwen2_moe.modeling_qwen2_moe import Qwen2MoeSparseMoeBlock
deepspeed.utils.set_z3_leaf_modules(model, [Qwen2MoeSparseMoeBlock])
elif overwrite_config['mm_llm_compress']:
model = LlavaQwenForCausalLM_Pdrop.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
else:
model = LlavaQwenForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
elif "internlm2" in model_args.model_name_or_path.lower():
if overwrite_config['mm_llm_compress']:
raise NotImplementedError
# model = LlavaInternLM2ForCausalLM_Pdrop.from_pretrained(
# model_args.model_name_or_path,
# cache_dir=training_args.cache_dir,
# attn_implementation=training_args.attn_implementation,
# torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
# low_cpu_mem_usage=False,
# **customized_kwargs,
# )
else:
model = LlavaInternLM2ForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
elif "gemma" in model_args.model_name_or_path.lower():
raise ValueError(f"I don't want model class {model_args}")
model = LlavaGemmaForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
else:
raise ValueError(f"Unknown model class {model_args}")
else:
raise NotImplementedError
model = transformers.LlamaForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
rank0_print(f"Model config: {model.config}.")
return model
def train(attn_implementation=None):
# global local_rank
parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
# wandb.init(project="mllm", entity="likunchang", name=os.path.basename(training_args.output_dir), reinit=True)
# local_broadcast_process_authkey() # NOTE
if training_args.verbose_logging:
rank0_print(f"Inspecting experiment hyperparameters:\n")
rank0_print(f"model_args = {vars(model_args)}\n\n")
rank0_print(f"data_args = {vars(data_args)}\n\n")
rank0_print(f"training_args = {vars(training_args)}\n\n")
# rank0_print(f"evaluation_args = {vars(evaluation_args)}\n\n")
# local_rank = training_args.local_rank
compute_dtype = torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32)
bnb_model_from_pretrained_args = {}
if training_args.bits in [4, 8]:
from transformers import BitsAndBytesConfig
bnb_model_from_pretrained_args.update(
dict(
device_map={"": training_args.device},
load_in_4bit=training_args.bits == 4,
load_in_8bit=training_args.bits == 8,
quantization_config=BitsAndBytesConfig(
load_in_4bit=training_args.bits == 4,
load_in_8bit=training_args.bits == 8,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=training_args.double_quant,
bnb_4bit_quant_type=training_args.quant_type, # {'fp4', 'nf4'}
),
)
)
model = get_model(model_args, training_args, bnb_model_from_pretrained_args)
model.config.use_cache = False
if model_args.rope_scaling_factor is not None and model_args.rope_scaling_type is not None:
model.config.rope_scaling = {
"factor": model_args.rope_scaling_factor,
"type": model_args.rope_scaling_type,
}
if model_args.freeze_backbone:
model.model.requires_grad_(False)
if training_args.bits in [4, 8]:
from peft import prepare_model_for_kbit_training
model.config.torch_dtype = torch.float32 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=training_args.gradient_checkpointing)
if training_args.gradient_checkpointing:
if hasattr(model, "enable_input_require_grads"):
model.enable_input_require_grads()
else:
def make_inputs_require_grad(module, input, output):
output.requires_grad_(True)
model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
if training_args.lora_enable:
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=training_args.lora_r,
lora_alpha=training_args.lora_alpha,
target_modules=find_all_linear_names(model),
lora_dropout=training_args.lora_dropout,
bias=training_args.lora_bias,
task_type="CAUSAL_LM",
)
if training_args.bits == 16:
if training_args.bf16:
model.to(torch.bfloat16)
if training_args.fp16:
model.to(torch.float16)
rank0_print("Adding LoRA adapters...")
model = get_peft_model(model, lora_config)
if "mistral" in model_args.model_name_or_path.lower() or "mixtral" in model_args.model_name_or_path.lower() or "zephyr" in model_args.model_name_or_path.lower():
tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="left")
elif "qwen" in model_args.model_name_or_path.lower():
tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="right")
elif "internlm2" in model_args.model_name_or_path.lower():
tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="right", trust_remote_code=True)
elif (
"wizardlm-2" in model_args.model_name_or_path.lower()
or "vicuna" in model_args.model_name_or_path.lower()
or "llama" in model_args.model_name_or_path.lower()
or "yi" in model_args.model_name_or_path.lower()
or "nous-hermes" in model_args.model_name_or_path.lower()
and "wizard-2" in model_args.model_name_or_path.lower()
):
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
model_max_length=training_args.model_max_length,
padding_side="right",
use_fast=False,
)
rank0_print(f"Prompt version: {model_args.version}")
if model_args.version == "v0":
if tokenizer.pad_token is None:
smart_tokenizer_and_embedding_resize(
special_tokens_dict=dict(pad_token="[PAD]"),
tokenizer=tokenizer,
model=model,
)
elif model_args.version == "v0.5":
tokenizer.pad_token = tokenizer.unk_token
else:
if tokenizer.unk_token is not None:
tokenizer.pad_token = tokenizer.unk_token
if model_args.version in conversation_lib.conv_templates:
conversation_lib.default_conversation = conversation_lib.conv_templates[model_args.version]
else:
raise NotImplementedError(f"Can't find your conv_templates: {model_args.version}")
conversation_lib.default_conversation = conversation_lib.conv_templates["vicuna_v1"]
if model_args.vision_tower is not None:
model.get_model().initialize_vision_modules(model_args=model_args, fsdp=training_args.fsdp)
vision_tower = model.get_vision_tower()
vision_tower.to(dtype=torch.bfloat16 if training_args.bf16 else torch.float16, device=training_args.device)
# NOTE hard code
data_args.image_processor = vision_tower.image_processor
data_args.is_multimodal = True
model.config.image_aspect_ratio = data_args.image_aspect_ratio
model.config.frame_aspect_ratio = data_args.frame_aspect_ratio
if data_args.image_grid_pinpoints is not None:
if isinstance(data_args.image_grid_pinpoints, str) and "x" in data_args.image_grid_pinpoints:
try:
patch_size = data_args.image_processor.size[0]
except Exception as e:
patch_size = data_args.image_processor.size["shortest_edge"]
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", data_args.image_grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
data_args.image_grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
elif isinstance(data_args.image_grid_pinpoints, str):
data_args.image_grid_pinpoints = ast.literal_eval(data_args.image_grid_pinpoints)
if data_args.frame_grid_pinpoints is not None:
if isinstance(data_args.frame_grid_pinpoints, str) and "x" in data_args.frame_grid_pinpoints:
try:
patch_size = data_args.image_processor.size[0]
except Exception as e:
patch_size = data_args.image_processor.size["shortest_edge"]
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", data_args.frame_grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
data_args.frame_grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
elif isinstance(data_args.frame_grid_pinpoints, str):
data_args.frame_grid_pinpoints = ast.literal_eval(data_args.frame_grid_pinpoints)
model.config.max_num_pixels = data_args.max_num_pixels
model.config.frame_grid_pinpoints = data_args.frame_grid_pinpoints
model.config.image_grid_pinpoints = data_args.image_grid_pinpoints
model.config.image_crop_resolution = data_args.image_crop_resolution
model.config.image_split_resolution = data_args.image_split_resolution
model.config.tokenizer_padding_side = tokenizer.padding_side
model.config.tokenizer_model_max_length = tokenizer.model_max_length
model.config.mm_newline_position = model_args.mm_newline_position
### Deciding train which part of the model
if model_args.mm_tunable_parts is None: # traditional way of deciding which part to train
model.config.tune_mm_mlp_adapter = training_args.tune_mm_mlp_adapter = model_args.tune_mm_mlp_adapter
model.config.tune_mm_vision_resampler = training_args.tune_mm_vision_resampler = model_args.tune_mm_vision_resampler
if model_args.tune_mm_mlp_adapter or model_args.tune_mm_vision_resampler:
model.requires_grad_(False)
if model_args.tune_mm_mlp_adapter:
for p in model.get_model().mm_projector.parameters():
p.requires_grad = True
model.config.freeze_mm_mlp_adapter = training_args.freeze_mm_mlp_adapter
if training_args.freeze_mm_mlp_adapter:
for p in model.get_model().mm_projector.parameters():
p.requires_grad = False
model.config.freeze_mm_vision_resampler = training_args.freeze_mm_vision_resampler
model.config.unfreeze_mm_vision_tower = model_args.unfreeze_mm_vision_tower
if model_args.unfreeze_mm_vision_tower:
vision_tower.requires_grad_(True)
else:
vision_tower.requires_grad_(False)
else:
rank0_print(f"Using mm_tunable_parts: {model_args.mm_tunable_parts}")
model.config.mm_tunable_parts = training_args.mm_tunable_parts = model_args.mm_tunable_parts
# Set the entire model to not require gradients by default
model.requires_grad_(False)
vision_tower.requires_grad_(False)
model.get_model().mm_projector.requires_grad_(False)
# Parse the mm_tunable_parts to decide which parts to unfreeze
tunable_parts = model_args.mm_tunable_parts.split(",")
if "mm_mlp_adapter" in tunable_parts:
for p in model.get_model().mm_projector.parameters():
p.requires_grad = True
if "mm_vision_tower" in tunable_parts:
for name, param in model.named_parameters():
if "vision_tower" in name:
param.requires_grad_(True)
if "mm_language_model" in tunable_parts:
for name, param in model.named_parameters():
if "vision_tower" not in name and "mm_projector" not in name and "vision_resampler" not in name:
param.requires_grad_(True)
total_params = sum(p.ds_numel if hasattr(p, "ds_numel") else p.numel() for p in model.parameters())
trainable_params = sum(p.ds_numel if hasattr(p, "ds_numel") else p.numel() for p in model.parameters() if p.requires_grad)
rank0_print(f"Total parameters: ~{total_params/1e6:.2f} MB)")
rank0_print(f"Trainable parameters: ~{trainable_params/1e6:.2f} MB)")
if training_args.bits in [4, 8]:
model.get_model().mm_projector.to(dtype=compute_dtype, device=training_args.device)
model.config.mm_use_im_start_end = data_args.mm_use_im_start_end = model_args.mm_use_im_start_end
model.config.mm_projector_lr = training_args.mm_projector_lr
model.config.mm_vision_tower_lr = training_args.mm_vision_tower_lr
training_args.use_im_start_end = model_args.mm_use_im_start_end
model.config.mm_use_im_patch_token = model_args.mm_use_im_patch_token
model.initialize_vision_tokenizer(model_args, tokenizer=tokenizer)
if training_args.bits in [4, 8]:
from peft.tuners.lora import LoraLayer
for name, module in model.named_modules():
if isinstance(module, LoraLayer):
if training_args.bf16:
module = module.to(torch.bfloat16)
if "norm" in name:
module = module.to(torch.float32)
if "lm_head" in name or "embed_tokens" in name:
if hasattr(module, "weight"):
if training_args.bf16 and module.weight.dtype == torch.float32:
module = module.to(torch.bfloat16)
data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
trainer = LLaVATrainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
rank0_print(f"model_config after before train: {model.config}")
if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
trainer.train(resume_from_checkpoint=True)
else:
trainer.train()
trainer.save_state()
model.config.use_cache = True
if training_args.lora_enable:
state_dict = get_peft_state_maybe_zero_3(model.named_parameters(), training_args.lora_bias)
non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(model.named_parameters())
if training_args.local_rank == 0 or training_args.local_rank == -1:
if hasattr(model, "config"):
model.config.save_pretrained(training_args.output_dir)
if hasattr(model, "generation_config"):
model.generation_config.save_pretrained(training_args.output_dir)
model.save_pretrained(training_args.output_dir, state_dict=state_dict)
torch.save(non_lora_state_dict, os.path.join(training_args.output_dir, "non_lora_trainables.bin"))
else:
safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
rank0_print(f"Model saved to {training_args.output_dir}")
if __name__ == "__main__":
train()
================================================
FILE: llava-train_videochat/llava/train/train_mem.py
================================================
from llava.train.train import train
from llava.dist_utils import init_distributed_mode
if __name__ == "__main__":
init_distributed_mode()
train()
================================================
FILE: llava-train_videochat/llava/utils.py
================================================
import datetime
import logging
import logging.handlers
import os
import sys
import numpy as np
import requests
from llava.constants import LOGDIR
server_error_msg = "**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**"
moderation_msg = "I am sorry. Your input may violate our content moderation guidelines. Please avoid using harmful or offensive content."
handler = None
import torch.distributed as dist
try:
import av
from decord import VideoReader, cpu
except ImportError:
print("Please install pyav to use video processing functions.")
def process_video_with_decord(video_file, data_args):
vr = VideoReader(video_file, ctx=cpu(0), num_threads=1)
total_frame_num = len(vr)
avg_fps = round(vr.get_avg_fps() / data_args.video_fps)
frame_idx = [i for i in range(0, total_frame_num, avg_fps)]
if data_args.frames_upbound > 0:
if len(frame_idx) > data_args.frames_upbound:
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, data_args.frames_upbound, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
video = vr.get_batch(frame_idx).asnumpy()
# https://github.com/dmlc/decord/issues/208
vr.seek(0)
return video
def process_video_with_pyav(video_file, data_args):
container = av.open(video_file)
# !!! This is the only difference. Using auto threading
container.streams.video[0].thread_type = "AUTO"
video_frames = []
for packet in container.demux():
if packet.stream.type == 'video':
for frame in packet.decode():
video_frames.append(frame)
total_frame_num = len(video_frames)
video_time = video_frames[-1].time
avg_fps = round(total_frame_num / video_time / data_args.video_fps)
frame_idx = [i for i in range(0, total_frame_num, avg_fps)]
if data_args.frames_upbound > 0:
if len(frame_idx) > data_args.frames_upbound:
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, data_args.frames_upbound, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
frames = [video_frames[i] for i in frame_idx]
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
def rank0_print(*args):
if dist.is_initialized():
if dist.get_rank() == 0:
print(f"Rank {dist.get_rank()}: ", *args)
else:
print(*args)
def rank_print(*args):
if dist.is_initialized():
print(f"Rank {dist.get_rank()}: ", *args)
else:
print(*args)
def build_logger(logger_name, logger_filename):
global handler
formatter = logging.Formatter(
fmt="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
# Set the format of root handlers
if not logging.getLogger().handlers:
logging.basicConfig(level=logging.INFO)
logging.getLogger().handlers[0].setFormatter(formatter)
# Redirect stdout and stderr to loggers
stdout_logger = logging.getLogger("stdout")
stdout_logger.setLevel(logging.INFO)
sl = StreamToLogger(stdout_logger, logging.INFO)
sys.stdout = sl
stderr_logger = logging.getLogger("stderr")
stderr_logger.setLevel(logging.ERROR)
sl = StreamToLogger(stderr_logger, logging.ERROR)
sys.stderr = sl
# Get logger
logger = logging.getLogger(logger_name)
logger.setLevel(logging.INFO)
# Add a file handler for all loggers
if handler is None:
os.makedirs(LOGDIR, exist_ok=True)
filename = os.path.join(LOGDIR, logger_filename)
handler = logging.handlers.TimedRotatingFileHandler(filename, when="D", utc=True)
handler.setFormatter(formatter)
for name, item in logging.root.manager.loggerDict.items():
if isinstance(item, logging.Logger):
item.addHandler(handler)
return logger
class StreamToLogger(object):
"""
Fake file-like stream object that redirects writes to a logger instance.
"""
def __init__(self, logger, log_level=logging.INFO):
self.terminal = sys.stdout
self.logger = logger
self.log_level = log_level
self.linebuf = ""
def __getattr__(self, attr):
return getattr(self.terminal, attr)
def write(self, buf):
temp_linebuf = self.linebuf + buf
self.linebuf = ""
for line in temp_linebuf.splitlines(True):
# From the io.TextIOWrapper docs:
# On output, if newline is None, any '\n' characters written
# are translated to the system default line separator.
# By default sys.stdout.write() expects '\n' newlines and then
# translates them so this is still cross platform.
if line[-1] == "\n":
self.logger.log(self.log_level, line.rstrip())
else:
self.linebuf += line
def flush(self):
if self.linebuf != "":
self.logger.log(self.log_level, self.linebuf.rstrip())
self.linebuf = ""
def disable_torch_init():
"""
Disable the redundant torch default initialization to accelerate model creation.
"""
import torch
setattr(torch.nn.Linear, "reset_parameters", lambda self: None)
setattr(torch.nn.LayerNorm, "reset_parameters", lambda self: None)
def violates_moderation(text):
"""
Check whether the text violates OpenAI moderation API.
"""
url = "https://api.openai.com/v1/moderations"
headers = {"Content-Type": "application/json", "Authorization": "Bearer " + os.environ["OPENAI_API_KEY"]}
text = text.replace("\n", "")
data = "{" + '"input": ' + f'"{text}"' + "}"
data = data.encode("utf-8")
try:
ret = requests.post(url, headers=headers, data=data, timeout=5)
flagged = ret.json()["results"][0]["flagged"]
except requests.exceptions.RequestException as e:
print(f"######################### Moderation Error: {e} #########################")
flagged = False
except KeyError as e:
print(f"######################### Moderation Error: {e} #########################")
flagged = False
return flagged
def pretty_print_semaphore(semaphore):
if semaphore is None:
return "None"
return f"Semaphore(value={semaphore._value}, locked={semaphore.locked()})"
================================================
FILE: llava-train_videochat/llava/video_utils.py
================================================
import random
import os
import io
import av
import cv2
import decord
import imageio
from decord import VideoReader
import torch
import numpy as np
import math
import gc
import torchaudio
from torchvision.transforms.functional import pil_to_tensor
import re
def get_index(num_frames, num_segments):
seg_size = float(num_frames - 1) / num_segments
start = int(seg_size / 2)
offsets = np.array([
start + int(np.round(seg_size * idx)) for idx in range(num_segments)
])
return offsets
def lazy_load_s3video(s3path_video, num_frames, video_start, video_end, client):
# load video from ceph
video_bytes_stream = client.get(s3path_video, enable_stream_lazyloding=True)
container = av.open(video_bytes_stream)
stream = container.streams.video[0]
# duration = stream.duration
real_fps = container.streams.video[0].average_rate
time_base = container.streams.video[0].time_base
start, end = video_start, video_end
# Convert time to pts
duration_frams = int(end - start) * real_fps
frames_index = get_index(duration_frams, num_frames)
pts_list = []
start_pts = int((start) / time_base)
end_pts = int((end) / time_base)
for frame_index in frames_index:
pts_list.append(int((frame_index / real_fps)) / time_base)
# Seek to nearest key frame from the start
container.seek(max(start_pts, 0), stream=stream)
frames = []
for frame in container.decode(**{"video":0}):
if frame.pts < start_pts:
continue
# if frame.pts <= end_pts:
if len(pts_list) >0:
if frame.pts >= pts_list[0]:
frames.append(frame)
pts_list.pop(0)
else:
break
container.close()
frames = [np.array(frames[idx].to_rgb().to_image()) for idx in range(len(frames))]
final_frames = np.stack(frames)
del frames
del video_bytes_stream # T C H W
gc.collect()
return final_frames, frames_index, float(real_fps)
def pts_to_secs(pts: int, time_base: float, start_pts: int) -> float:
"""
Converts a present time with the given time base and start_pts offset to seconds.
Returns:
time_in_seconds (float): The corresponding time in seconds.
https://github.com/facebookresearch/pytorchvideo/blob/main/pytorchvideo/data/utils.py#L54-L64
"""
if pts == math.inf:
return math.inf
return int(pts - start_pts) * time_base
def get_pyav_video_duration(video_reader):
video_stream = video_reader.streams.video[0]
video_duration = pts_to_secs(
video_stream.duration,
video_stream.time_base,
video_stream.start_time
)
return float(video_duration)
def get_frame_indices(num_frames, vlen, sample='middle', fix_start=None, input_fps=1, min_num_frames=1, max_num_frames=-1, local_num_frames=8):
if min_num_frames > vlen:
if sample == 'dynamic_fps1':
min_num_frames = (vlen // local_num_frames) * local_num_frames
else:
min_num_frames = vlen
if sample == 'dynamic_fps1':
duration = float(vlen) / input_fps
num_segments = int(duration // local_num_frames)
if num_segments == 0:
num_frames = local_num_frames
else:
num_frames = local_num_frames * num_segments
if max_num_frames > 0:
num_frames = min(num_frames, max_num_frames)
sample = "middle" # NOTE
# logger.info(f"? is OK (img), duation={duration} frames={num_frames}!!!!")
num_frames = max(min_num_frames, num_frames)
# print(f"\033[0;31m vlen={vlen}, input_fps={input_fps} num_frames={num_frames} \033[0m")
if sample in ["rand", "middle"]: # uniform sampling
acc_samples = min(num_frames, vlen)
# split the video into `acc_samples` intervals, and sample from each interval.
intervals = np.linspace(start=0, stop=vlen, num=acc_samples + 1).astype(int)
ranges = []
for idx, interv in enumerate(intervals[:-1]):
ranges.append((interv, intervals[idx + 1] - 1))
if sample == 'rand':
try:
frame_indices = [random.choice(range(x[0], x[1])) for x in ranges]
except:
frame_indices = np.random.permutation(vlen)[:acc_samples]
frame_indices.sort()
frame_indices = list(frame_indices)
elif fix_start is not None:
frame_indices = [x[0] + fix_start for x in ranges]
elif sample == 'middle':
frame_indices = [(x[0] + x[1]) // 2 for x in ranges]
else:
raise NotImplementedError
if len(frame_indices) < num_frames: # padded with last frame
padded_frame_indices = [frame_indices[-1]] * num_frames
padded_frame_indices[:len(frame_indices)] = frame_indices
frame_indices = padded_frame_indices
elif "fps" in sample: # fps0.5, sequentially sample frames at 0.5 fps
output_fps = float(sample[3:])
duration = float(vlen) / input_fps
delta = 1 / output_fps # gap between frames, this is also the clip length each frame represents
frame_seconds = np.arange(0 + delta / 2, duration + delta / 2, delta)
frame_indices = np.around(frame_seconds * input_fps).astype(int)
frame_indices = [e for e in frame_indices if e < vlen]
if max_num_frames > 0 and len(frame_indices) > max_num_frames:
frame_indices = frame_indices[:max_num_frames]
# frame_indices = np.linspace(0 + delta / 2, duration + delta / 2, endpoint=False, num=max_num_frames)
else:
raise ValueError(f"Not support sample type: {sample}")
return frame_indices
def read_frames_av(video_path, num_frames, sample='rand', client=None, fix_start=None, min_num_frames=1, max_num_frames=-1, clip=None, local_num_frames=8):
if clip is not None:
raise NotImplementedError("av don't support clip!!!")
if 's3://' in video_path:
video_bytes = client.get(video_path)
byteio = io.BytesIO(video_bytes)
byteio.seek(0)
reader = av.open(byteio)
else:
byteio = None
reader = av.open(video_path)
frames = [f.to_rgb().to_ndarray() for f in reader.decode(video=0)]
vlen = len(frames)
duration = get_pyav_video_duration(reader)
fps = vlen / float(duration)
frame_indices = get_frame_indices(
num_frames, vlen, sample=sample, fix_start=fix_start,
input_fps=fps, min_num_frames=min_num_frames, max_num_frames=max_num_frames, local_num_frames=local_num_frames
)
frames = np.stack([frames[idx] for idx in frame_indices]) # (T, H, W, C), torch.uint8
# frames = frames.permute(0, 3, 1, 2) # (T, C, H, W), torch.uint8
if byteio != None:
byteio.close()
reader.close()
return frames, frame_indices, float(fps), duration
def read_frames_gif(
video_path, num_frames, sample='rand', fix_start=None,
min_num_frames=1, max_num_frames=-1, client=None, clip=None, local_num_frames=8
):
if clip is not None:
raise NotImplementedError("Gif don't support clip!!!")
if 's3://' in video_path:
video_bytes = client.get(video_path)
byteio = io.BytesIO(video_bytes)
gif = imageio.get_reader(byteio)
else:
byteio = None
gif = imageio.get_reader(video_path)
vlen = len(gif)
fps = 1.
duration = vlen / fps
frame_indices = get_frame_indices(
num_frames, vlen, sample=sample, fix_start=fix_start,
min_num_frames=min_num_frames,
max_num_frames=max_num_frames, local_num_frames=local_num_frames,
input_fps=fps # NOTE 写死先
)
frames = []
min_h = min_w = 100000
hw_set = set()
for index, frame in enumerate(gif):
# for index in frame_idxs:
if index in frame_indices:
frame = cv2.cvtColor(frame, cv2.COLOR_RGBA2RGB)
frame = frame.astype(np.uint8)
# # (H x W x C) to (C x H x W)
# frame = frame.permute(2, 0, 1)
frames.append(frame)
hw_set.add(frame.shape)
if frame.shape[0] < min_h:
min_h = frame.shape[0]
if frame.shape[1] < min_w:
min_w = frame.shape[1]
# print(hw_set, min_h, min_w)
if len(hw_set) > 1:
frames = [i[:min_h, :min_w] for i in frames]
frames = np.stack(frames) # .float() / 255
if byteio != None:
byteio.close()
return frames, frame_indices, float(fps), duration # for tgif
def read_frames_decord(
video_path, num_frames, sample='rand', fix_start=None, min_num_frames=1,
max_num_frames=-1, client=None, clip=None, local_num_frames=8
):
if video_path.endswith('.avi'):
return read_frames_av(video_path=video_path, num_frames=num_frames, sample=sample,
fix_start=fix_start, min_num_frames=min_num_frames, max_num_frames=max_num_frames,
client=client, clip=clip, local_num_frames=local_num_frames)
if 's3://' in video_path:
video_bytes = client.get(video_path)
if video_bytes is None or len(video_bytes) == 0:
raise ValueError(f"Can't read byte from {video_path}!")
byteio = io.BytesIO(video_bytes)
video_reader = VideoReader(byteio, num_threads=1)
else:
byteio = None
video_reader = VideoReader(video_path, num_threads=1)
vlen = len(video_reader)
fps = video_reader.get_avg_fps()
duration = vlen / float(fps)
if clip:
start, end = clip
start = max(0, start)
end = min(duration - 0.1, end) # 防止end超过视频末尾
duration = end - start
vlen = int(duration * fps)
start_index = int(start * fps)
frame_indices = get_frame_indices(
num_frames, vlen, sample=sample, fix_start=fix_start,
input_fps=fps, min_num_frames=min_num_frames, max_num_frames=max_num_frames, local_num_frames=local_num_frames
)
if clip:
frame_indices = [f + start_index for f in frame_indices]
# print(fps, frame_indices)
frames = video_reader.get_batch(frame_indices).asnumpy() # (T, H, W, C), torch.uint8
# https://github.com/dmlc/decord/issues/208
video_reader.seek(0)
if byteio != None:
byteio.close()
# frames = frames.permute(0, 3, 1, 2) # (T, C, H, W), torch.uint8
return frames, frame_indices, float(fps), duration
def read_frames_img(
video_path, num_frames, sample='rand', fix_start=None, min_num_frames=1,
max_num_frames=-1, client=None, clip=None, local_num_frames=8
):
def extract_frame_number(filename):
# Extract the numeric part from the filename using regular expressions
if filename.endswith('.jpg'):
match = re.search(r'_(\d+).jpg$', filename)
elif filename.endswith('.jpeg'):
match = re.search(r'_(\d+).jpeg$', filename)
elif filename.endswith('.png'):
match = re.search(r'_(\d+).png$', filename)
else:
raise NotImplementedError(f"Wrong filename: {filename}")
return int(match.group(1)) if match else -1
def sort_frames(frame_paths):
# Extract filenames from each path and sort by their numeric part
return sorted(frame_paths, key=lambda x: extract_frame_number(os.path.basename(x)))
# img_list=[]
if "s3://" in video_path:
img_list = sort_frames(client.list(video_path))
else:
img_list = sort_frames(list(os.listdir(video_path)))
if 'tvqa' in video_path.lower():
fps = 3.0 # tvqa是3fps的
else:
fps = 1.0 # NOTE 未知数据直接当1fps处理
if clip is not None:
start = float(clip[0])
end = float(clip[1])
start = max(0, start)
end = min(len(img_list) / fps, end) # 防止end超过视频末尾
vlen = (end - start) * fps
else:
vlen = len(img_list)
duration = vlen / fps
if min_num_frames > vlen:
if sample == 'dynamic_fps1':
min_num_frames = (vlen // local_num_frames) * local_num_frames
else:
min_num_frames = vlen
if sample == 'dynamic_fps1':
num_segments = int(duration // local_num_frames)
if num_segments == 0:
num_frames = local_num_frames
else:
num_frames = local_num_frames * num_segments
num_frames = min(num_frames, max_num_frames)
num_frames = max(min_num_frames, num_frames)
num_frames = int(num_frames)
if clip is not None:
def _get_index_by_time(start_sec, end_sec, num_segments=8, fps=1., max_frame=9999):
start_idx = max(1, round(start_sec * fps))
end_idx = min(round(end_sec * fps), max_frame)
seg_size = float(end_idx - start_idx) / (num_segments - 1)
offsets = np.array([start_idx + int(np.round(seg_size * idx)) for idx in range(num_segments)])
return offsets
frame_indices = _get_index_by_time(float(clip[0]), float(clip[1]), num_segments=num_frames, fps=fps, max_frame=len(img_list)-1)
else:
frame_indices = get_frame_indices(
num_frames, vlen, sample=sample, fix_start=fix_start,
min_num_frames=min_num_frames,
max_num_frames=max_num_frames, local_num_frames=local_num_frames
)
imgs = []
for idx in frame_indices:
frame_fname = os.path.join(video_path, img_list[idx])
if "s3://" in video_path:
img_bytes = client.get(frame_fname)
else:
with open(frame_fname, 'rb') as f:
img_bytes = f.read()
img_np = np.frombuffer(img_bytes, np.uint8)
img = cv2.imdecode(img_np, cv2.IMREAD_COLOR)
cv2.cvtColor(img, cv2.COLOR_BGR2RGB, img)
imgs.append(img)
# print(f"\033[0;31m img_list={len(img_list)} video_path={video_path}, len(imgs)={len(imgs)}, frame_indices={frame_indices} num_frames={num_frames} \033[0m")
frames = np.array(imgs, dtype=np.uint8)
# frames = torch.tensor(np.array(imgs), dtype=torch.uint8).permute(0, 3, 1, 2) # (T, C, H, W), torch.uint8
# logger.info(f"{video_path} is OK (img), duation={vlen}!!!!")
return frames, frame_indices, fps, duration # NOTE img直接当1fps处理
def read_frames_fake(
video_path, num_frames, sample='rand', fix_start=None,
max_num_frames=-1, client=None, clip=None, local_num_frames=8
):
print("I am fake!!!!!!")
frame_indices = get_frame_indices(
num_frames, 100, sample=sample, fix_start=fix_start,
input_fps=1, max_num_frames=max_num_frames, local_num_frames=local_num_frames
)
frames = np.random.randint(0, 255, size=(len(frame_indices), 224, 224, 3)) # (T, H, W, C), torch.uint8
return frames, frame_indices, 1.0, 100
VIDEO_READER_FUNCS = {
'av': read_frames_av,
'decord': read_frames_decord,
'gif': read_frames_gif,
'img': read_frames_img,
'frame': read_frames_img,
'lazy': lazy_load_s3video,
'fake': read_frames_fake
}
================================================
FILE: llava-train_videochat/pyproject.toml
================================================
[tool.black]
line-length = 240
[build-system]
requires = ["setuptools>=61.0"]
build-backend = "setuptools.build_meta"
[project]
name = "llava"
version = "1.7.0.dev0"
description = "LLaVA OneVision: The Next Generation of LLaVA with Better Image and Video Understanding Capabilities"
readme = "README.md"
requires-python = ">=3.8"
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: Apache Software License",
]
[project.optional-dependencies]
standalone = [
"shortuuid",
"httpx==0.24.0",
"einops",
"ftfy",
]
train = [
"llava[standalone]",
"numpy==1.26.1",
"open_clip_torch",
"fastapi",
"gradio==3.35.2",
"markdown2[all]",
"numpy",
"requests",
"sentencepiece",
"torch==2.1.2",
"torchvision==0.16.2",
"uvicorn",
"wandb",
"deepspeed==0.14.2",
"peft==0.4.0",
"accelerate>=0.29.1",
"tokenizers~=0.15.2",
"transformers@git+https://github.com/huggingface/transformers.git@1c39974a4c4036fd641bc1191cc32799f85715a4",
"bitsandbytes==0.41.0",
"scikit-learn==1.2.2",
"sentencepiece~=0.1.99",
"einops==0.6.1",
"einops-exts==0.0.4",
"gradio_client==0.2.9",
"urllib3<=2.0.0",
"datasets==2.16.1",
"pydantic==1.10.8",
"timm",
"hf_transfer",
"opencv-python",
"av",
"decord",
"tyro",
"scipy",
]
dev0 = [
"llava[standalone]",
"open_clip_torch",
"fastapi",
"markdown2[all]",
"uvicorn",
"bitsandbytes==0.41.0",
"scikit-learn==1.2.2",
"datasets==2.16.1",
"pydantic==1.10.8",
"timm",
"hf_transfer",
"opencv-python",
"av",
"decord",
"tyro",
"scipy",
]
[project.urls]
"Homepage" = "https://llava-vl.github.io"
"Bug Tracker" = "https://github.com/haotian-liu/LLaVA/issues"
[tool.setuptools.packages.find]
include = ["llava*", "trl*"]
exclude = [
"assets*",
"benchmark*",
"docs",
"dist*",
"playground*",
"scripts*",
"tests*",
"checkpoints*",
"project_checkpoints*",
"debug_checkpoints*",
"mlx_configs*",
"wandb*",
"notebooks*",
]
[tool.wheel]
exclude = [
"assets*",
"benchmark*",
"docs",
"dist*",
"playground*",
"scripts*",
"tests*",
"checkpoints*",
"project_checkpoints*",
"debug_checkpoints*",
"mlx_configs*",
"wandb*",
"notebooks*",
]
================================================
FILE: llava-train_videochat/requirements.txt
================================================
Babel==2.14.0
DataProperty==1.0.1
Deprecated==1.2.14
GitPython==3.1.43
Jinja2==3.1.3
Levenshtein==0.25.1
MarkupSafe==2.1.5
PyJWT==2.8.0
PyYAML==6.0.1
Pygments==2.17.2
QtPy==2.4.1
Send2Trash==1.8.3
absl-py==2.1.0
accelerate==0.33.0
addict==2.4.0
aiofiles==23.2.1
aiohttp==3.9.1
aiolimiter==1.2.1
aiosignal==1.3.1
alembic==1.13.0
altair==5.4.1
anls==0.0.2
annotated-types==0.7.0
anthropic==0.45.2
anyio==4.4.0
appdirs==1.4.4
async-timeout==4.0.3
attrs==23.1.0
audioread==3.0.1
av==14.0.1
bitsandbytes==0.41.0
black==24.1.0
blinker==1.7.0
blis==0.7.11
boto3==1.28.25
botocore==1.31.25
Brotli==1.1.0
capture-metric==0.1.13
catalogue==2.0.10
certifi==2023.11.17
cffi==1.16.0
cfgv==3.4.0
chardet==5.2.0
charset-normalizer==3.3.2
click==8.1.7
cloudpathlib==0.19.0
cloudpickle==3.0.0
cmake==3.25.0
colorama==0.4.6
coloredlogs==15.0.1
confection==0.1.5
contourpy==1.2.0
crcmod==1.7
cryptography==43.0.0
ctranslate2==4.4.0
cycler==0.12.1
cymem==2.0.8
databricks-cli==0.18.0
DataProperty==1.0.1
datasets==2.16.1
decorator==4.4.2
decord==0.6.0
deepspeed==0.14.2
dill==0.3.9
distlib==0.3.8
distro==1.9.0
docker==6.1.3
docker-pycreds==0.4.0
docstring_parser==0.16
easydict==1.13
einops==0.6.1
einops-exts==0.0.4
entrypoints==0.4
environs==9.5.0
et-xmlfile==1.1.0
evaluate==0.4.2
exceptiongroup==1.2.2
FactualSceneGraph==0.5.0
fastapi==0.115.6
faster-whisper==1.1.0
ffmpy==0.4.0
filelock==3.14.0
-e git+https://github.com/Dao-AILab/flash-attention.git@9a11f440d3a34f618b4ba814c825b109c6d7e8f5#egg=flash_attn
Flask==3.0.0
flatbuffers==24.12.23
fonttools==4.46.0
frozenlist==1.4.0
fsspec==2023.10.0
ftfy==6.1.1
func_timeout==4.3.5
gitdb==4.0.11
GitPython==3.1.40
gradio==5.9.1
gradio_client==1.5.2
greenlet==3.0.2
grpcio==1.66.1
gunicorn==21.2.0
h11==0.14.0
hf_transfer==0.1.8
hjson==3.1.0
httpcore==1.0.7
httpx==0.27.2
httpx-sse==0.4.0
huggingface-hub==0.27.0
humanfriendly==10.0
humanize==4.7.0
identify==2.6.0
idna==3.9
imageio==2.31.1
imageio-ffmpeg==0.5.1
importlib-metadata==7.0.0
itsdangerous==2.1.2
Jinja2==3.1.2
jiter==0.5.0
jmespath==0.10.0
joblib==1.3.2
jsonlines==4.0.0
jsonschema==4.23.0
jsonschema-specifications==2023.12.1
kiwisolver==1.4.5
langcodes==3.4.0
language_data==1.2.0
latex2mathml==3.77.0
lazy_loader==0.4
Levenshtein==0.25.1
librosa==0.10.2.post1
liger-kernel==0.0.0
linkify-it-py==2.0.3
lit==15.0.7
llvmlite==0.43.0
loguru==0.7.2
lxml==5.3.0
Mako==1.3.0
marisa-trie==1.2.0
Markdown==3.5.1
markdown-it-py==2.2.0
markdown2==2.5.0
MarkupSafe==2.1.3
marshmallow==3.20.1
matplotlib==3.8.2
mbstrdecoder==1.1.3
mdit-py-plugins==0.3.3
mdurl==0.1.2
mlflow==2.9.1
model-index==0.1.11
moviepy==1.0.3
mpmath==1.3.0
msgpack==1.0.8
multidict==6.0.4
multiprocess==0.70.17
multiprocessing-logging==0.3.4
murmurhash==1.0.10
mutagen==1.47.0
mypy-extensions==1.0.0
narwhals==1.5.5
networkx==3.2.1
ninja==1.11.1.1
nltk==3.9.1
nodeenv==1.9.1
numba==0.60.0
numexpr==2.10.1
numpy==1.26.4
nvidia-cublas-cu12==12.1.3.1
nvidia-cuda-cupti-cu12==12.1.105
nvidia-cuda-nvrtc-cu12==12.1.105
nvidia-cuda-runtime-cu12==12.1.105
nvidia-cudnn-cu12==8.9.2.26
nvidia-cufft-cu12==11.0.2.54
nvidia-curand-cu12==10.3.2.106
nvidia-cusolver-cu12==11.4.5.107
nvidia-cusparse-cu12==12.1.0.106
nvidia-nccl-cu12==2.18.1
nvidia-nvjitlink-cu12==12.6.20
nvidia-nvtx-cu12==12.1.105
oauthlib==3.2.2
onnxruntime==1.16.3
open_clip_torch==2.26.1
openai==1.44.0
opencv-python==4.8.1.78
opencv-python-headless==4.10.0.84
opendatalab==0.0.10
openmim==0.3.9
openpyxl==3.1.5
openxlab==0.1.1
ordered-set==4.1.0
orjson==3.10.7
oss2==2.17.0
packaging==24.1
pandas==2.1.3
pathos==0.3.3
pathspec==0.12.1
pathtools==0.1.2
pathvalidate==3.2.1
peft==0.4.0
Pillow==10.1.0
platformdirs==4.1.0
pooch==1.8.2
portalocker==2.10.1
pox==0.3.5
ppft==1.7.6.9
pre-commit==3.8.0
preshed==3.0.9
prettytable==3.9.0
proglog==0.1.10
protobuf==3.20.0
psutil==5.9.5
py-cpuinfo==9.0.0
pyarrow==14.0.1
pyarrow-hotfix==0.6
pybind11==2.13.5
pycocoevalcap==1.2
pycocotools==2.0.8
pycparser==2.22
pycryptodome==3.20.0
pycryptodomex==3.20.0
pydantic==2.10.4
pydantic_core==2.27.2
pydub==0.25.1
Pygments==2.18.0
PyJWT==2.8.0
pynvml==11.5.3
pyparsing==3.1.1
pyre-extensions==0.0.29
pytablewriter==1.2.0
python-dateutil==2.8.2
python-dotenv==1.0.0
python-hostlist==1.23.0
python-multipart==0.0.20
pytz==2023.3.post1
pywsd==1.2.5
PyYAML==6.0.1
querystring-parser==1.2.4
rapidfuzz==3.9.7
redis==5.0.7
referencing==0.35.1
regex==2023.10.3
reka-api==3.0.8
requests==2.28.2
rfc3986==1.5.0
rich==13.4.2
rouge==1.0.1
rpds-py==0.20.0
ruff==0.8.5
s3transfer==0.6.1
sacrebleu==2.4.3
safehttpx==0.1.6
safetensors==0.4.1
scikit-learn==1.2.2
scipy==1.11.4
seaborn==0.13.2
semantic-version==2.10.0
sentence-transformers==3.0.1
sentencepiece==0.1.99
sentry-sdk==1.29.2
setproctitle==1.3.2
shellingham==1.5.4
shortuuid==1.0.13
shtab==1.7.1
six==1.16.0
smart-open==7.0.4
smmap==5.0.1
sniffio==1.3.1
soundfile==0.12.1
soxr==0.3.7
spaces==0.31.1
spacy==3.7.6
spacy-legacy==3.0.12
spacy-loggers==1.0.5
SQLAlchemy==2.0.23
sqlitedict==2.1.0
sqlparse==0.4.4
srsly==2.4.8
starlette==0.41.3
svgwrite==1.4.3
sympy==1.12
tabledata==1.3.3
tabulate==0.9.0
tcolorpy==0.1.6
tenacity==8.3.0
tensorboard==2.17.1
tensorboard-data-server==0.7.2
tensorboardX==2.6
termcolor==2.3.0
thinc==8.2.5
threadpoolctl==3.2.0
tiktoken==0.7.0
timm==0.4.12
tokenizers==0.19.1
tomli==2.0.1
tomlkit==0.13.2
torch==2.1.2
torchaudio==2.1.2
torchvision==0.16.2
tqdm==4.65.2
tqdm-multiprocess==0.0.11
transformers==4.40.1
transformers-stream-generator==0.0.5
triton==2.1.0
typepy==1.3.2
typer==0.12.5
typing-inspect==0.9.0
typing_extensions==4.12.2
tyro==0.8.10
tzdata==2023.3
uc-micro-py==1.0.3
urllib3==1.26.20
uvicorn==0.30.6
virtualenv==20.26.4
wandb==0.17.9
wasabi==1.1.3
wavedrom==2.0.3.post3
wcwidth==0.2.12
weasel==0.4.1
websocket-client==1.7.0
websockets==13.0
Werkzeug==3.0.1
wn==0.0.23
wrapt==1.16.0
xformers==0.0.20
xxhash==3.4.1
yapf==0.40.2
yarl==1.9.4
yt-dlp==2024.8.6
zipp==3.17.0
zss==1.2.0
zstandard==0.23.0
================================================
FILE: llava-train_videochat/scripts/train/stage1-init_connector/stage1_internvideo2_tome16_res224_qwen7b.sh
================================================
export OMP_NUM_THREADS=1
export DISABLE_ADDMM_CUDA_LT=1
export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1
DATA_VERSION="data/stage1_init_connector_iv1m.yaml"
DATA_VERSION_CLEAN=$(basename "$DATA_VERSION")
VISION_MODEL_VERSION="internvideo2"
VISION_MODEL_VERSION_CLEAN="internvideo2"
LLM_VERSION="Qwen/Qwen2_5-7B-Instruct"
LLM_VERSION_CLEAN="Qwen2_5_7B"
mm_projector_type=tome16_mlp_hd64
PROMPT_VERSION=plain
BASE_RUN_NAME=stage1-${VISION_MODEL_VERSION}-${mm_projector_type}-${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_${PROMPT_VERSION}_$(date +"%Y%m%d_%H%M%S")
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"
PARTITION='video'
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
NUM_GPU=8
# NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command.
srun -p ${PARTITION} \
--job-name=${JOB_NAME} \
--ntasks=${NUM_GPU} \
--gres=gpu:8 \
--ntasks-per-node=8 \
--cpus-per-task=16 \
--kill-on-bad-exit=1 \
python -u llava/train/train_mem.py \
--deepspeed scripts/zero1.json \
--model_name_or_path ${LLM_VERSION} \
--version ${PROMPT_VERSION} \
--data_path ${DATA_VERSION} \
--vision_tower ${VISION_MODEL_VERSION} \
--mm_tunable_parts="mm_mlp_adapter" \
--mm_vision_select_layer -2 \
--mm_projector_type ${mm_projector_type} \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ./checkpoints/stage1-init_connector/${BASE_RUN_NAME} \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_strategy "no" \
--save_steps 50000 \
--learning_rate 1e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 8192 \
--gradient_checkpointing True \
--dataloader_num_workers 16 \
--lazy_preprocess True \
--report_to tensorboard \
--run_name $BASE_RUN_NAME \
--attn_implementation sdpa \
--frames_upbound 4 \
--time_msg short \
--local_num_frames 4 \
--sample_type middle \
--vision_encode_type video_image \
--mm_pos_num_frames 4 \
--mm_local_num_frames 4 \
--verbose_logging True >> ./output_logs/stage1-init_connector/${BASE_RUN_NAME}.log
# You can delete the sdpa attn_implementation if you want to use flash attn
================================================
FILE: llava-train_videochat/scripts/train/stage1-init_connector/stage1_umt_tome16_res224_qwen7b.sh
================================================
export OMP_NUM_THREADS=1
export DISABLE_ADDMM_CUDA_LT=1
export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1
DATA_VERSION="data/stage1_init_connector_iv1m.yaml"
DATA_VERSION_CLEAN=$(basename "$DATA_VERSION")
VISION_MODEL_VERSION="umt-large"
VISION_MODEL_VERSION_CLEAN="umt-large"
LLM_VERSION="Qwen/Qwen2-7B-Instruct"
LLM_VERSION_CLEAN="Qwen2_7B"
mm_projector_type=tome16_mlp_hd64
PROMPT_VERSION=plain
BASE_RUN_NAME=stage1-${VISION_MODEL_VERSION}-${mm_projector_type}-${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_${PROMPT_VERSION}_$(date +"%Y%m%d_%H%M%S")
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"
PARTITION='video'
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
NUM_GPU=8
# NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command.
srun -p ${PARTITION} \
--job-name=${JOB_NAME} \
--ntasks=${NUM_GPU} \
--gres=gpu:8 \
--ntasks-per-node=8 \
--cpus-per-task=16 \
--kill-on-bad-exit=1 \
python -u llava/train/train_mem.py \
--deepspeed scripts/zero1.json \
--model_name_or_path ${LLM_VERSION} \
--version ${PROMPT_VERSION} \
--data_path ${DATA_VERSION} \
--vision_tower ${VISION_MODEL_VERSION} \
--mm_tunable_parts="mm_mlp_adapter" \
--mm_vision_select_layer -2 \
--mm_projector_type ${mm_projector_type} \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ./checkpoints/stage1-init_connector/${BASE_RUN_NAME} \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_strategy "no" \
--save_steps 50000 \
--learning_rate 1e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 8192 \
--gradient_checkpointing True \
--dataloader_num_workers 16 \
--lazy_preprocess True \
--report_to tensorboard \
--run_name $BASE_RUN_NAME \
--attn_implementation sdpa \
--frames_upbound 4 \
--time_msg short \
--local_num_frames 4 \
--sample_type middle \
--vision_encode_type video_image \
--mm_pos_num_frames 4 \
--mm_local_num_frames 4 \
--verbose_logging True >> ./output_logs/stage1-init_connector/${BASE_RUN_NAME}.log
# You can delete the sdpa attn_implementation if you want to use flash attn
================================================
FILE: llava-train_videochat/scripts/train/stage1-init_connector/stage1_umt_tome16_res448_qwen1_5b.sh
================================================
export OMP_NUM_THREADS=1
export DISABLE_ADDMM_CUDA_LT=1
export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1
DATA_VERSION="data/stage1_init_connector_iv1m.yaml"
DATA_VERSION_CLEAN=$(basename "$DATA_VERSION")
VISION_MODEL_VERSION="umt-hd-large"
VISION_MODEL_VERSION_CLEAN="umt-hd-large"
LLM_VERSION="Qwen/Qwen2_5-1.5B-Instruct"
LLM_VERSION_CLEAN="Qwen2_5_1_5B"
mm_projector_type=tome16_mlp_hd64
PROMPT_VERSION=plain
BASE_RUN_NAME=stage1-${VISION_MODEL_VERSION}-${mm_projector_type}-${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_${PROMPT_VERSION}_$(date +"%Y%m%d_%H%M%S")
echo "BASE_RUN_NAME: ${BASE_RUN_NAME}"
PARTITION='video'
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
NUM_GPU=8
# NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command.
srun -p ${PARTITION} \
--job-name=${JOB_NAME} \
--ntasks=${NUM_GPU} \
--gres=gpu:8 \
--ntasks-per-node=8 \
--cpus-per-task=16 \
--kill-on-bad-exit=1 \
python -u llava/train/train_mem.py \
--deepspeed scripts/zero1.json \
--model_name_or_path ${LLM_VERSION} \
--version ${PROMPT_VERSION} \
--data_path ${DATA_VERSION} \
--vision_tower ${VISION_MODEL_VERSION} \
--mm_tunable_parts="mm_mlp_adapter" \
--mm_vision_select_layer -2 \
--mm_projector_type ${mm_projector_type} \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--bf16 True \
--output_dir ./checkpoints/stage1-init_connector/${BASE_RUN_NAME} \
--num_train_epochs 1 \
--per_device_train_batch_size 16 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 4 \
--evaluation_strategy "no" \
--save_strategy "no" \
--save_steps 50000 \
--learning_rate 1e-3 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 8192 \
--gradient_checkpointing True \
--dataloader_num_workers 16 \
--lazy_preprocess True \
--report_to tensorboard \
--run_name $BASE_RUN_NAME \
--attn_implementation sdpa \
--frames_upbound 4 \
--time_msg short \
--local_num_frames 4 \
--sample_type middle \
--vision_encode_type video_image \
--mm_pos_num_frames 4 \
--mm_local_num_frames 4 \
--verbose_logging True >> ./output_logs/stage1-init_connector/${BASE_RUN_NAME}.log
# You can delete the sdpa attn_implementation if you want to use flash attn
================================================
FILE: llava-train_videochat/scripts/train/stage2-visual_pretraining/stage2_internvideo2_tome16_res224_qwen_7b.sh
================================================
export OMP_NUM_THREADS=1
export DISABLE_ADDMM_CUDA_LT=1
export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1
DATA_VERSION="data/stage2_short_pretrain_iv6m.yaml"
DATA_VERSION_CLEAN=$(basename "$DATA_VERSION")
VISION_MODEL_VERSION="internvideo2"
VISION_MODEL_VERSION_CLEAN="internvideo2"
LLM_VERSION="Qwen/Qwen2_5-7B-Instruct"
LLM_VERSION_CLEAN="Qwen2_5_7B"
mm_projector_type=tome16_mlp_hd64
PROMPT_VERSION="qwen_2"
MID_RUN_NAME=stage2-${VISION_MODEL_VERSION}-${mm_projector_type}_${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_$(date +"%Y%m%d_%H%M%S")
echo "MID_RUN_NAME: ${MID_RUN_NAME}"
PARTITION='video'
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
NUM_GPU=32
# NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command.
srun -p ${PARTITION} \
--job-name=${JOB_NAME} \
--ntasks=${NUM_GPU} \
--gres=gpu:8 \
--ntasks-per-node=8 \
--cpus-per-task=16 \
--kill-on-bad-exit=1 \
python -u llava/train/train_mem.py \
--deepspeed scripts/zero1.json \
--model_name_or_path ${LLM_VERSION} \
--version ${PROMPT_VERSION} \
--data_path ${DATA_VERSION} \
--vision_tower ${VISION_MODEL_VERSION} \
--pretrain_mm_mlp_adapter="Your_stage_checkpoint_path/mm_projector.bin" \
--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
--mm_vision_tower_lr=2e-6 \
--mm_vision_select_layer -2 \
--mm_projector_type ${mm_projector_type} \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--group_by_modality_length True \
--image_aspect_ratio anyres_nopad \
--image_grid_pinpoints "(1x1),...,(6x6)" \
--mm_patch_merge_type spatial_nopad \
--mm_newline_position nothing \
--bf16 True \
--run_name $MID_RUN_NAME \
--output_dir ./checkpoints/stage2-visual_pretraining/${MID_RUN_NAME} \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 12000 \
--save_total_limit 1 \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 32768 \
--gradient_checkpointing True \
--dataloader_num_workers 8 \
--lazy_preprocess True \
--report_to tensorboard \
--torch_compile True \
--torch_compile_backend "inductor" \
--dataloader_drop_last True \
--attn_implementation sdpa \
--frames_upbound 8 \
--time_msg short \
--local_num_frames 4 \
--vision_encode_type video_image \
--sample_type dynamic_fps1 \
--mm_close_init True \
--mm_local_num_frames 4 \
--verbose_logging True >> ./output_logs/stage2-visual_pretraining/${MID_RUN_NAME}.log
# You can delete the sdpa attn_implementation if you want to use flash attn
================================================
FILE: llava-train_videochat/scripts/train/stage2-visual_pretraining/stage2_umt_tome16_res224_qwen_7b.sh
================================================
export OMP_NUM_THREADS=1
export DISABLE_ADDMM_CUDA_LT=1
export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1
DATA_VERSION="data/stage2_short_pretrain_iv6m.yaml"
DATA_VERSION_CLEAN=$(basename "$DATA_VERSION")
VISION_MODEL_VERSION="umt-large"
VISION_MODEL_VERSION_CLEAN="umt-large"
LLM_VERSION="Qwen/Qwen2-7B-Instruct"
LLM_VERSION_CLEAN="Qwen2_7B"
mm_projector_type=tome16_mlp_hd64
PROMPT_VERSION="qwen_2"
MID_RUN_NAME=stage2-${VISION_MODEL_VERSION}-${mm_projector_type}_${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_$(date +"%Y%m%d_%H%M%S")
echo "MID_RUN_NAME: ${MID_RUN_NAME}"
PARTITION='video'
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
NUM_GPU=32
# NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command.
srun -p ${PARTITION} \
--job-name=${JOB_NAME} \
--ntasks=${NUM_GPU} \
--gres=gpu:8 \
--ntasks-per-node=8 \
--cpus-per-task=16 \
--kill-on-bad-exit=1 \
python -u llava/train/train_mem.py \
--deepspeed scripts/zero1.json \
--model_name_or_path ${LLM_VERSION} \
--version ${PROMPT_VERSION} \
--data_path ${DATA_VERSION} \
--vision_tower ${VISION_MODEL_VERSION} \
--pretrain_mm_mlp_adapter="Your_stage_checkpoint_path/mm_projector.bin" \
--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
--mm_vision_tower_lr=2e-6 \
--mm_vision_select_layer -2 \
--mm_projector_type ${mm_projector_type} \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--group_by_modality_length True \
--image_aspect_ratio anyres_nopad \
--image_grid_pinpoints "(1x1),...,(6x6)" \
--mm_patch_merge_type spatial_nopad \
--mm_newline_position nothing \
--bf16 True \
--run_name $MID_RUN_NAME \
--output_dir ./checkpoints/stage2-visual_pretraining/${MID_RUN_NAME} \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 12000 \
--save_total_limit 1 \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 32768 \
--gradient_checkpointing True \
--dataloader_num_workers 8 \
--lazy_preprocess True \
--report_to tensorboard \
--torch_compile True \
--torch_compile_backend "inductor" \
--dataloader_drop_last True \
--attn_implementation sdpa \
--frames_upbound 8 \
--time_msg short \
--local_num_frames 4 \
--vision_encode_type video_image \
--sample_type dynamic_fps1 \
--mm_close_init True \
--mm_local_num_frames 4 \
--verbose_logging True >> ./output_logs/stage2-visual_pretraining/${MID_RUN_NAME}.log
# You can delete the sdpa attn_implementation if you want to use flash attn
================================================
FILE: llava-train_videochat/scripts/train/stage2-visual_pretraining/stage2_umt_tome16_res448_qwen_1_5b.sh
================================================
export OMP_NUM_THREADS=1
export DISABLE_ADDMM_CUDA_LT=1
export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1
DATA_VERSION="data/stage2_short_pretrain_iv6m.yaml"
DATA_VERSION_CLEAN=$(basename "$DATA_VERSION")
VISION_MODEL_VERSION="umt-hd-large"
VISION_MODEL_VERSION_CLEAN="umt-hd-large"
LLM_VERSION="Qwen/Qwen2_5-1.5B-Instruct"
LLM_VERSION_CLEAN="Qwen2_5_1_5B"
mm_projector_type=tome16_mlp_hd64
PROMPT_VERSION="qwen_2"
MID_RUN_NAME=stage2-${VISION_MODEL_VERSION}-${mm_projector_type}_${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_$(date +"%Y%m%d_%H%M%S")
echo "MID_RUN_NAME: ${MID_RUN_NAME}"
PARTITION='video'
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
NUM_GPU=32
# NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command.
srun -p ${PARTITION} \
--job-name=${JOB_NAME} \
--ntasks=${NUM_GPU} \
--gres=gpu:8 \
--ntasks-per-node=8 \
--cpus-per-task=16 \
--kill-on-bad-exit=1 \
python -u llava/train/train_mem.py \
--deepspeed scripts/zero1.json \
--model_name_or_path ${LLM_VERSION} \
--version ${PROMPT_VERSION} \
--data_path ${DATA_VERSION} \
--vision_tower ${VISION_MODEL_VERSION} \
--pretrain_mm_mlp_adapter="Your_stage_checkpoint_path/mm_projector.bin" \
--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
--mm_vision_tower_lr=2e-6 \
--mm_vision_select_layer -2 \
--mm_projector_type ${mm_projector_type} \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--group_by_modality_length True \
--image_aspect_ratio anyres_nopad \
--image_grid_pinpoints "(1x1),...,(6x6)" \
--mm_patch_merge_type spatial_nopad \
--mm_newline_position nothing \
--bf16 True \
--run_name $MID_RUN_NAME \
--output_dir ./checkpoints/stage2-visual_pretraining/${MID_RUN_NAME} \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 12000 \
--save_total_limit 1 \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 32768 \
--gradient_checkpointing True \
--dataloader_num_workers 8 \
--lazy_preprocess True \
--report_to tensorboard \
--torch_compile True \
--torch_compile_backend "inductor" \
--dataloader_drop_last True \
--attn_implementation sdpa \
--frames_upbound 8 \
--time_msg short \
--local_num_frames 4 \
--vision_encode_type video_image \
--sample_type dynamic_fps1 \
--mm_close_init True \
--mm_local_num_frames 4 \
--verbose_logging True >> ./output_logs/stage2-visual_pretraining/${MID_RUN_NAME}.log
# You can delete the sdpa attn_implementation if you want to use flash attn
================================================
FILE: llava-train_videochat/scripts/train/stage3-video_sft/stage3_internvideo2_tome16_res224_qwen_7b.sh
================================================
export OMP_NUM_THREADS=1
export DISABLE_ADDMM_CUDA_LT=1
export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1
DATA_VERSION="data/stage3_short-long_mix_sft.yaml"
DATA_VERSION_CLEAN=$(basename "$DATA_VERSION")
VISION_MODEL_VERSION="internvideo2"
VISION_MODEL_VERSION_CLEAN="internvideo2"
LLM_VERSION="Your_stage2_checkpoint_path"
LLM_VERSION_CLEAN="Qwen2_5_7B"
mm_projector_type=tome16_mlp_hd64
PROMPT_VERSION="qwen_2"
MID_RUN_NAME=stage3-${VISION_MODEL_VERSION}-${mm_projector_type}_${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_$(date +"%Y%m%d_%H%M%S")
echo "MID_RUN_NAME: ${MID_RUN_NAME}"
PARTITION='video'
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
NUM_GPU=32
# NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command.
srun -p ${PARTITION} \
--job-name=${JOB_NAME} \
--ntasks=${NUM_GPU} \
--gres=gpu:8 \
--ntasks-per-node=8 \
--cpus-per-task=16 \
--kill-on-bad-exit=1 \
python -u llava/train/train_mem.py \
--deepspeed scripts/zero1.json \
--model_name_or_path ${LLM_VERSION} \
--version ${PROMPT_VERSION} \
--data_path ${DATA_VERSION} \
--vision_tower ${VISION_MODEL_VERSION} \
--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
--mm_vision_tower_lr=2e-6 \
--mm_vision_select_layer -2 \
--mm_projector_type ${mm_projector_type} \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--group_by_modality_length True \
--image_aspect_ratio anyres_nopad \
--image_grid_pinpoints "(1x1),...,(6x6)" \
--mm_patch_merge_type spatial_nopad \
--mm_newline_position nothing \
--bf16 True \
--run_name $MID_RUN_NAME \
--output_dir ./checkpoints/stage3-video_sft/${MID_RUN_NAME} \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 10000 \
--save_total_limit 1 \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 32768 \
--gradient_checkpointing True \
--dataloader_num_workers 6 \
--lazy_preprocess True \
--report_to tensorboard \
--torch_compile True \
--torch_compile_backend "inductor" \
--dataloader_drop_last True \
--frames_upbound 512 \
--frames_lowbound 64 \
--time_msg short \
--local_num_frames 4 \
--vision_encode_type video_image \
--sample_type dynamic_fps1 \
--mm_local_num_frames 4 \
--verbose_logging True >> ./output_logs/stage3-video_sft/${MID_RUN_NAME}.log
# You can delete the sdpa attn_implementation if you want to use flash attn
================================================
FILE: llava-train_videochat/scripts/train/stage3-video_sft/stage3_umt_tome16_res224_qwen_7b.sh
================================================
export OMP_NUM_THREADS=1
export DISABLE_ADDMM_CUDA_LT=1
export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1
DATA_VERSION="data/stage3_short-long_mix_sft.yaml"
DATA_VERSION_CLEAN=$(basename "$DATA_VERSION")
VISION_MODEL_VERSION="umt-large"
VISION_MODEL_VERSION_CLEAN="umt-large"
LLM_VERSION="Your_stage2_checkpoint_path"
LLM_VERSION_CLEAN="Qwen2_7B"
mm_projector_type=tome16_mlp_hd64
PROMPT_VERSION="qwen_2"
MID_RUN_NAME=stage3-${VISION_MODEL_VERSION}-${mm_projector_type}_${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_$(date +"%Y%m%d_%H%M%S")
echo "MID_RUN_NAME: ${MID_RUN_NAME}"
PARTITION='video'
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
NUM_GPU=32
# NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command.
srun -p ${PARTITION} \
--job-name=${JOB_NAME} \
--ntasks=${NUM_GPU} \
--gres=gpu:8 \
--ntasks-per-node=8 \
--cpus-per-task=16 \
--kill-on-bad-exit=1 \
python -u llava/train/train_mem.py \
--deepspeed scripts/zero1.json \
--model_name_or_path ${LLM_VERSION} \
--version ${PROMPT_VERSION} \
--data_path ${DATA_VERSION} \
--vision_tower ${VISION_MODEL_VERSION} \
--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
--mm_vision_tower_lr=2e-6 \
--mm_vision_select_layer -2 \
--mm_projector_type ${mm_projector_type} \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--group_by_modality_length True \
--image_aspect_ratio anyres_nopad \
--image_grid_pinpoints "(1x1),...,(6x6)" \
--mm_patch_merge_type spatial_nopad \
--mm_newline_position nothing \
--bf16 True \
--run_name $MID_RUN_NAME \
--output_dir ./checkpoints/stage3-video_sft/${MID_RUN_NAME} \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 10000 \
--save_total_limit 1 \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 32768 \
--gradient_checkpointing True \
--dataloader_num_workers 6 \
--lazy_preprocess True \
--report_to tensorboard \
--torch_compile True \
--torch_compile_backend "inductor" \
--dataloader_drop_last True \
--frames_upbound 512 \
--frames_lowbound 64 \
--time_msg short \
--local_num_frames 4 \
--vision_encode_type video_image \
--sample_type dynamic_fps1 \
--mm_local_num_frames 4 \
--verbose_logging True >> ./output_logs/stage3-video_sft/${MID_RUN_NAME}.log
# You can delete the sdpa attn_implementation if you want to use flash attn
================================================
FILE: llava-train_videochat/scripts/train/stage3-video_sft/stage3_umt_tome16_res448_qwen_1_5b.sh
================================================
export OMP_NUM_THREADS=1
export DISABLE_ADDMM_CUDA_LT=1
export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1
DATA_VERSION="data/stage3_short-long_mix_sft.yaml"
DATA_VERSION_CLEAN=$(basename "$DATA_VERSION")
VISION_MODEL_VERSION="umt-hd-large"
VISION_MODEL_VERSION_CLEAN="umt-hd-large"
LLM_VERSION="Your_stage2_checkpoint_path"
LLM_VERSION_CLEAN="Qwen2_5_1_5B"
mm_projector_type=tome16_mlp_hd64
PROMPT_VERSION="qwen_2"
MID_RUN_NAME=stage3-${VISION_MODEL_VERSION}-${mm_projector_type}_${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_$(date +"%Y%m%d_%H%M%S")
echo "MID_RUN_NAME: ${MID_RUN_NAME}"
PARTITION='video'
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
NUM_GPU=32
# NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command.
srun -p ${PARTITION} \
--job-name=${JOB_NAME} \
--ntasks=${NUM_GPU} \
--gres=gpu:8 \
--ntasks-per-node=8 \
--cpus-per-task=16 \
--kill-on-bad-exit=1 \
python -u llava/train/train_mem.py \
--deepspeed scripts/zero1.json \
--model_name_or_path ${LLM_VERSION} \
--version ${PROMPT_VERSION} \
--data_path ${DATA_VERSION} \
--vision_tower ${VISION_MODEL_VERSION} \
--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter,mm_language_model" \
--mm_vision_tower_lr=2e-6 \
--mm_vision_select_layer -2 \
--mm_projector_type ${mm_projector_type} \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--group_by_modality_length True \
--image_aspect_ratio anyres_nopad \
--image_grid_pinpoints "(1x1),...,(6x6)" \
--mm_patch_merge_type spatial_nopad \
--mm_newline_position nothing \
--bf16 True \
--run_name $MID_RUN_NAME \
--output_dir ./checkpoints/stage3-video_sft/${MID_RUN_NAME} \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 10000 \
--save_total_limit 1 \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 32768 \
--gradient_checkpointing True \
--dataloader_num_workers 6 \
--lazy_preprocess True \
--report_to tensorboard \
--torch_compile True \
--torch_compile_backend "inductor" \
--dataloader_drop_last True \
--frames_upbound 512 \
--frames_lowbound 64 \
--time_msg short \
--local_num_frames 4 \
--vision_encode_type video_image \
--sample_type dynamic_fps1 \
--mm_local_num_frames 4 \
--verbose_logging True >> ./output_logs/stage3-video_sft/${MID_RUN_NAME}.log
# You can delete the sdpa attn_implementation if you want to use flash attn
================================================
FILE: llava-train_videochat/scripts/train/stage4_highres_postft/stage4_umt_tome16_res448_qwen_7b.sh
================================================
export OMP_NUM_THREADS=1
export DISABLE_ADDMM_CUDA_LT=1
export TORCH_CUDNN_USE_HEURISTIC_MODE_B=1
DATA_VERSION="data/stage4_highres_postsft"
DATA_VERSION_CLEAN=$(basename "$DATA_VERSION")
VISION_MODEL_VERSION="umt-hd-large"
VISION_MODEL_VERSION_CLEAN="umt-hd-large"
# NOTE Please modify vision_tower="umt-hd-large" in Your_stage3_checkpoint_path/config.json first!
LLM_VERSION_CLEAN="Qwen2_7B"
LLM_VERSION="Your_stage3_checkpoint_path"
LLM_VERSION_CLEAN=$(basename "$LLM_VERSION")
mm_projector_type=tome16_mlp_hd64
PROMPT_VERSION="qwen_2"
MID_RUN_NAME=stage4-${VISION_MODEL_VERSION_CLEAN}-${mm_projector_type}_${LLM_VERSION_CLEAN}_${DATA_VERSION_CLEAN}_$(date +"%Y%m%d_%H%M%S")
echo "MID_RUN_NAME: ${MID_RUN_NAME}"
PARTITION='video'
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
NUM_GPU=32
# NOTE: If you don't use slurm, please ref to https://github.com/LLaVA-VL/LLaVA-NeXT/blob/main/scripts/train/pretrain_clip.sh for training command.
srun -p ${PARTITION} \
--job-name=${JOB_NAME} \
--ntasks=${NUM_GPU} \
--gres=gpu:8 \
--ntasks-per-node=8 \
--cpus-per-task=16 \
--kill-on-bad-exit=1 \
python -u llava/train/train_mem.py \
--deepspeed scripts/zero1.json \
--model_name_or_path ${LLM_VERSION} \
--version ${PROMPT_VERSION} \
--data_path ${DATA_VERSION} \
--vision_tower ${VISION_MODEL_VERSION} \
--mm_tunable_parts="mm_vision_tower,mm_mlp_adapter" \
--mm_vision_tower_lr=2e-6 \
--mm_vision_select_layer -2 \
--mm_projector_type ${mm_projector_type} \
--mm_use_im_start_end False \
--mm_use_im_patch_token False \
--group_by_modality_length True \
--image_aspect_ratio anyres_nopad \
--image_grid_pinpoints "(1x1),...,(6x6)" \
--mm_patch_merge_type spatial_nopad \
--mm_newline_position nothing \
--bf16 True \
--run_name $MID_RUN_NAME \
--output_dir ./checkpoints/stage4-highres_postsft/${MID_RUN_NAME} \
--num_train_epochs 1 \
--per_device_train_batch_size 1 \
--per_device_eval_batch_size 4 \
--gradient_accumulation_steps 8 \
--evaluation_strategy "no" \
--save_strategy "steps" \
--save_steps 8000 \
--save_total_limit 1 \
--learning_rate 1e-5 \
--weight_decay 0. \
--warmup_ratio 0.03 \
--lr_scheduler_type "cosine" \
--logging_steps 1 \
--tf32 True \
--model_max_length 32768 \
--gradient_checkpointing True \
--dataloader_num_workers 4 \
--lazy_preprocess True \
--report_to tensorboard \
--torch_compile True \
--torch_compile_backend "inductor" \
--dataloader_drop_last True \
--frames_upbound 512 \
--frames_lowbound 64 \
--time_msg short \
--local_num_frames 4 \
--vision_encode_type video_image \
--sample_type dynamic_fps1 \
--mm_local_num_frames 4 \
--verbose_logging True >> ./output_logs/stage3-video_sft/${MID_RUN_NAME}.log
# You can delete the sdpa attn_implementation if you want to use flash attn
================================================
FILE: llava-train_videochat/scripts/zero1.json
================================================
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 1,
"reduce_bucket_size": 500000000.0
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 100,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
================================================
FILE: llava-train_videochat/scripts/zero2.json
================================================
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": false,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 100,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
================================================
FILE: llava-train_videochat/scripts/zero2_fused_adamw.json
================================================
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"allgather_partitions": true,
"allgather_bucket_size": 2e8,
"overlap_comm": true,
"reduce_scatter": true,
"reduce_bucket_size": 2e8,
"contiguous_gradients": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 100,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
================================================
FILE: llava-train_videochat/scripts/zero2_offload.json
================================================
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"train_micro_batch_size_per_gpu": "auto",
"train_batch_size": "auto",
"gradient_accumulation_steps": "auto",
"zero_optimization": {
"stage": 2,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto"
}
}
================================================
FILE: llava-train_videochat/scripts/zero3.json
================================================
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"offload_param": {
"device": "none",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 100,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
================================================
FILE: llava-train_videochat/scripts/zero3_offload.json
================================================
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"steps_per_print": 1e5,
"wall_clock_breakdown": false
}
================================================
FILE: llava-train_videochat/scripts/zero3pp.json
================================================
{
"fp16": {
"enabled": "auto",
"loss_scale": 0,
"loss_scale_window": 1000,
"initial_scale_power": 16,
"hysteresis": 2,
"min_loss_scale": 1
},
"bf16": {
"enabled": "auto"
},
"optimizer": {
"type": "AdamW",
"params": {
"lr": "auto",
"betas": "auto",
"eps": "auto",
"weight_decay": "auto"
}
},
"zero_optimization": {
"stage": 3,
"offload_optimizer": {
"device": "none",
"pin_memory": true
},
"offload_param": {
"device": "none",
"pin_memory": true
},
"overlap_comm": true,
"contiguous_gradients": true,
"zero_quantized_weights": true,
"zero_hpz_partition_size": 16,
"zero_quantized_gradients": true,
"sub_group_size": 1e9,
"reduce_bucket_size": "auto",
"stage3_prefetch_bucket_size": "auto",
"stage3_param_persistence_threshold": "auto",
"stage3_max_live_parameters": 1e9,
"stage3_max_reuse_distance": 1e9,
"stage3_gather_16bit_weights_on_model_save": true
},
"gradient_accumulation_steps": "auto",
"gradient_clipping": "auto",
"steps_per_print": 100,
"train_batch_size": "auto",
"train_micro_batch_size_per_gpu": "auto",
"wall_clock_breakdown": false
}
================================================
FILE: lmms-eval_videochat/.gitignore
================================================
env
*.pyc
output/
data/
lm_cache
.idea
build
dist
*.egg-info
venv
.vscode/
temp
__pycache__
.ipynb_checkpoints
temp
.DS_STORE
# IPython
profile_default/
ipython_config.py
logs/
wandb/
SimSun.ttf
submissions/
lmms_eval/tasks/hallusion_bench/hallusion_output_vs_model.json
lmms_eval/tasks/hallusion_bench/hallusion_output_vd_model.json
zk.log
cache_dir
ckpt
pretrained/
LLaVA/
*logs
temp/
logs/
data/
================================================
FILE: lmms-eval_videochat/.pre-commit-config.yaml
================================================
repos:
- repo: https://github.com/psf/black
rev: 23.12.1
hooks:
- id: black
language_version: python3
================================================
FILE: lmms-eval_videochat/LICENSE
================================================
# For the main pipeline structure-related code, we maintain the original license provided with lm-evaluation-harness, which is the MIT License.
MIT License
Copyright (c) 2024 LMMs-Lab
Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
# For the multimodal models and datasets that we have added (defined as code in the lmms_eval/tasks and lmms_eval/models folders), we apply the Apache License.
Apache 2.0 License
Copyright (c) 2024 LMMs-Lab
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
When modifying the code, please include the following information about the original lmms-eval source:
# Adopted from lmms-eval from https://github.com/EvolvingLMMs-Lab/lmms-eval. Below is the original copyright:
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
================================================
FILE: lmms-eval_videochat/README.md
================================================
# How to use
We have modified the data loading method for lmms-eval: instead of loading from Huggingface, the data is loaded locally. Therefore, when using it, you need to **specify the data path** in the YAML file of each task. The data can be downloaded from the [lmms-eval](https://huggingface.co/lmms-lab) or the official repos of the corresponding tasks.
## Installation
You can install the package by cloning the repository and running the following command:
```bash
git clone https://github.com/OpenGVLab/VideoChat-Flash
cd lmms-eval_videochat
pip install -e .
```
We provide all evaluation [scripts](scripts) and [annotations](eval_annotations) here.
You could evaluate one task:
```bash
TASK=mvbench
MODEL_NAME=videochat_flash
MAX_NUM_FRAMES=512
CKPT_PATH=OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
MASTER_PORT=$((18000 + $RANDOM % 100))
NUM_GPUS=8
accelerate launch --num_processes ${NUM_GPUS} --main_process_port ${MASTER_PORT} -m lmms_eval \
--model ${MODEL_NAME} \
--model_args pretrained=$CKPT_PATH,max_num_frames=$MAX_NUM_FRAMES \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/${JOB_NAME}_${MODEL_NAME}_f${MAX_NUM_FRAMES}
```
You could evaluate more tasks once like:
```bash
TASK=videomme,videomme_w_subtitle
MODEL_NAME=videochat_flash
MAX_NUM_FRAMES=512
CKPT_PATH=OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
MASTER_PORT=$((18000 + $RANDOM % 100))
NUM_GPUS=8
accelerate launch --num_processes ${NUM_GPUS} --main_process_port ${MASTER_PORT} -m lmms_eval \
--model ${MODEL_NAME} \
--model_args pretrained=$CKPT_PATH,max_num_frames=$MAX_NUM_FRAMES \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/${JOB_NAME}_${MODEL_NAME}_f${MAX_NUM_FRAMES}
```
We provide our [evaluation log](https://github.com/OpenGVLab/VideoChat-Flash/blob/main/lmms-eval_videochat/videochat-flash-7B%40448_eval_log_videomme.json) of videomme for your reproducibility.
================================================
FILE: lmms-eval_videochat/docs/README.md
================================================
# LMMs Eval Documentation
Welcome to the docs for `lmms-eval`!
Majority of this documentation is adapted from [lm-eval-harness](https://github.com/EleutherAI/lm-evaluation-harness/)
## Table of Contents
* To learn about the command line flags, see the [commands](commands.md)
* To learn how to add a new moddel, see the [Model Guide](model_guide.md).
* For a crash course on adding new tasks to the library, see our [Task Guide](task_guide.md).
* If you need to upload your datasets into correct HF format with viewer supported, please refer to [tools](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/pufanyi/hf_dataset_docs/tools)
================================================
FILE: lmms-eval_videochat/docs/commands.md
================================================
# User Guide
This document details the interface exposed by `lmms_eval` and provides details on what flags are available to users.
## Command-line Interface
Equivalently, running the library can be done via the `lmms_eval` entrypoint at the command line.
This mode supports a number of command-line arguments, the details of which can be also be seen via running with `-h` or `--help`:
* `--model` : Selects which model type or provider is evaluated. Must be a mdoels registered under lmms_eval/models. For example, `--model qwen_vl` or `--model llava`.
* `--model_args` : Controls parameters passed to the model constructor. Accepts a string containing comma-separated keyword arguments to the model class of the format `"arg1=val1,arg2=val2,..."`, such as, for example `--model_args pretrained=liuhaotian/llava-v1.5-7b,batch_size=1`. For a full list of what keyword arguments, see the initialization of the corresponding model class in `lmms_eval/models/`.
* `--tasks` : Determines which tasks or task groups are evaluated. Accepts a comma-separated list of task names or task group names. Must be solely comprised of valid tasks/groups. You can use `--tasks list` to see all the available tasks. If you add your own tasks but not shown on the list, you can try to set `--verbosity=DEBUG` to view the error message. You can also use `--tasks list_with_num` to check every tasks and the number of question each task contains. However, `list_with_num` will download all the available datasets and may require lots of memory and time.
* `--batch_size` : Sets the batch size used for evaluation. Can be a positive integer or `"auto"` to automatically select the largest batch size that will fit in memory, speeding up evaluation. One can pass `--batch_size auto:N` to re-select the maximum batch size `N` times during evaluation. This can help accelerate evaluation further, since `lm-eval` sorts documents in descending order of context length.
* `--output_path` : A string of the form `dir/file.jsonl` or `dir/`. Provides a path where high-level results will be saved, either into the file named or into the directory named. If `--log_samples` is passed as well, then per-document outputs and metrics will be saved into the directory as well.
* `--log_samples` : If this flag is passed, then the model's outputs, and the text fed into the model, will be saved at per-document granularity. Must be used with `--output_path`.
* `--limit` : Accepts an integer, or a float between 0.0 and 1.0 . If passed, will limit the number of documents to evaluate to the first X documents (if an integer) per task or first X% of documents per task. Useful for debugging, especially on costly API models.
## Usage with SRT API
> install sglang
```bash
git clone https://github.com/sgl-project/sglang.git
# Current version is tested on #1222
cd sglang;
pip install -e "python[srt]"
# Install FlashInfer CUDA kernels
pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.4/
```
> run sglang backend service with the following command
```bash
# After update, there is no need to use an extra command to setup backend server
# the server will be initialized in the init process
# launch lmms-eval srt_api model
CKPT_PATH=$1
TASK=$2
MODALITY=$3
TP_SIZE=$4
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
python3 -m lmms_eval \
--model srt_api \
--model_args modality=$MODALITY,model_version=$CKPT_PATH,tp=$TP_SIZE,host=127.0.0.1,port=30000,timeout=600 \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/
```
You may need to install some dependencies for the above command to work (if you encounter some errors).
```bash
pip install httpx==0.23.3
pip install protobuf==3.20
```
================================================
FILE: lmms-eval_videochat/docs/current_tasks.md
================================================
# Current Tasks
> () indicates the task name in the lmms_eval. The task name is also used to specify the dataset in the configuration file.
> The following is manually updated documentation. You could use `lmms_eval task --list` to list all supported tasks and their task names.
- AI2D (ai2d)
- ChartQA (chartqa)
- CMMMU (cmmmu)
- CMMMU Validation (cmmmu_val)
- CMMMU Test (cmmmu_test)
- COCO Caption (coco_cap)
- COCO 2014 Caption (coco2014_cap)
- COCO 2014 Caption Validation (coco2014_cap_val)
- COCO 2014 Caption Test (coco2014_cap_test)
- COCO 2017 Caption (coco2017_cap)
- COCO 2017 Caption MiniVal (coco2017_cap_val)
- COCO 2017 Caption MiniTest (coco2017_cap_test)
- [ConBench](https://github.com/foundation-multimodal-models/ConBench) (conbench)
- DOCVQA (docvqa)
- DOCVQA Validation (docvqa_val)
- DOCVQA Test (docvqa_test)
- Ferret (ferret)
- Flickr30K (flickr30k)
- Ferret Test (ferret_test)
- GQA (gqa)
- HallusionBenchmark (hallusion_bench_image)
- Infographic VQA (info_vqa)
- Infographic VQA Validation (info_vqa_val)
- Infographic VQA Test (info_vqa_test)
- LLaVA-Bench (llava_in_the_wild)
- LLaVA-Bench-COCO (llava_bench_coco)
- MathVerse (mathverse)
- MathVerse Text Dominant (mathverse_testmini_text_dominant)
- MathVerse Text Only (mathverse_testmini_text_only)
- MathVerse Text Lite (mathverse_testmini_text_lite)
- MathVerse Vision Dominant (mathverse_testmini_vision_dominant)
- MathVerse Vision Intensive (mathverse_testmini_vision_intensive)
- MathVerse Vision Only (mathverse_testmini_vision_only)
- MathVista (mathvista)
- MathVista Validation (mathvista_testmini)
- MathVista Test (mathvista_test)
- MMBench (mmbench)
- MMBench English (mmbench_en)
- MMBench English Dev (mmbench_en_dev)
- MMBench English Test (mmbench_en_test)
- MMBench Chinese (mmbench_cn)
- MMBench Chinese Dev (mmbench_cn_dev)
- MMBench Chinese Test (mmbench_cn_test)
- MME (mme)
- MMMU (mmmu)
- MMMU Validation (mmmu_val)
- MMMU Test (mmmu_test)
- MMStar (mmstar)
- MMUPD (mmupd)
- MMUPD Base (mmupd_base)
- MMAAD Base (mmaad_base)
- MMIASD Base (mmiasd_base)
- MMIVQD Base (mmivqd_base)
- MMUPD Option (mmupd_option)
- MMAAD Option (mmaad_option)
- MMIASD Option (mmiasd_option)
- MMIVQD Option (mmivqd_option)
- MMUPD Instruction (mmupd_instruction)
- MMAAD Instruction (mmaad_instruction)
- MMIASD Instruction (mmiasd_instruction)
- MMIVQD Instruction (mmivqd_instruction)
- MMVet (mmvet)
- Multi-DocVQA (multidocvqa)
- Multi-DocVQA Validation (multidocvqa_val)
- Multi-DocVQA Test (multidocvqa_test)
- NoCaps (nocaps)
- NoCaps Validation (nocaps_val)
- NoCaps Test (nocaps_test)
- OKVQA (ok_vqa)
- OKVQA Validation 2014 (ok_vqa_val2014)
- POPE (pope)
- RefCOCO (refcoco)
- refcoco_seg_test
- refcoco_seg_val
- refcoco_seg_testA
- refcoco_seg_testB
- refcoco_bbox_test
- refcoco_bbox_val
- refcoco_bbox_testA
- refcoco_bbox_testB
- RefCOCO+ (refcoco+)
- refcoco+_seg
- refcoco+_seg_val
- refcoco+_seg_testA
- refcoco+_seg_testB
- refcoco+_bbox
- refcoco+_bbox_val
- refcoco+_bbox_testA
- refcoco+_bbox_testB
- RefCOCOg (refcocog)
- refcocog_seg_test
- refcocog_seg_val
- refcocog_bbox_test
- refcocog_bbox_val
- ScienceQA (scienceqa_full)
- ScienceQA Full (scienceqa)
- ScienceQA IMG (scienceqa_img)
- ScreenSpot (screenspot)
- ScreenSpot REC / Grounding (screenspot_rec)
- ScreenSpot REG / Instruction Generation (screenspot_reg)
- SeedBench (seedbench)
- SeedBench 2 (seedbench_2)
- SeedBench 2 Plus (seedbench_2_plus)
- ST-VQA (stvqa)
- TextCaps (textcaps)
- TextCaps Validation (textcaps_val)
- TextCaps Test (textcaps_test)
- TextVQA (textvqa)
- TextVQA Validation (textvqa_val)
- TextVQA Test (textvqa_test)
- VizWizVQA (vizwiz_vqa)
- VizWizVQA Validation (vizwiz_vqa_val)
- VizWizVQA Test (vizwiz_vqa_test)
- VQAv2 (vqav2)
- VQAv2 Validation (vqav2_val)
- VQAv2 Test (vqav2_test)
- WebSRC (websrc)
- WebSRC Validation (websrc_val)
- WebSRC Test (websrc_test)
================================================
FILE: lmms-eval_videochat/docs/model_guide.md
================================================
# New Model Guide
In order to properly evaluate a given LM, we require implementation of a wrapper class subclassing the `lmms_eval.api.model.lmms` class, that defines how the lmms_eval should interface with your model. This guide walks through how to write this `lmms` subclass via adding it to the library!
## Setup
To get started contributing, go ahead and fork the main repo, clone it, create a branch with the name of your task, and install the project requirements in your environment:
```sh
# After forking...
git clone https://github.com//lmms-eval.git
cd lmms-eval
git checkout -b
pip install -e .
```
Now, we'll create a new file where we'll be adding our model:
```sh
touch lmms_eval/models/.py
```
**As a rule of thumb, we recommend you to use `lmms_eval/models/qwen_vl.py` and `lmms_eval/models/instructblip.py` as reference implementations for your model. You can copy and paste the contents of one of these files into your new file to get started.**
## Interface
All models must subclass the `lmms_eval.api.model.lmms` class.
The lmms class enforces a common interface via which we can extract responses from a model:
```python
class MyCustomLM(lmms):
#...
def loglikelihood(self, requests: list[Instance]) -> list[tuple[float, bool]]:
#...
def generate_until(self, requests: list[Instance]) -> list[str]:
#...
#...
```
Where `Instance` is a dataclass defined in [`lmms_eval.api.instance`](https://github.com/EvolvingLMMs-Lab/lmms-eval/tree/main/lmms_eval/api/instance.py) with property `args` of request-dependent type signature described below.
We support three types of requests, consisting of different interactions / measurements with an autoregressive LM.
All three request types take as input `requests` of type `list[Instance]` that have a matching `Instance.request_type` to the method name. Overall, you can check the [construct_requests](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/lmms_eval/api/task.py#L918) to see how the arguments are being constructed for different types of output type requests.
- `generate_until`
- Each request contains `Instance.args : Tuple[str, dict]` containing 1. an input string to the LM and 2. a dictionary of keyword arguments used to control generation parameters.
- In each `Instance.args` there will be 6 elements which are `contexts, all_gen_kwargs, doc_to_visual, doc_id, task, split`. `contexts` refers to the formatted question and is the text input for the LMM. Sometimes it might contains image token and need to address differently for different models. `all_gen_kwargs` refers to the dict that contains all the generation configuration for the model. We use `doc_id`, `task`, and `split` to access the dataset and then you can use `doc_to_visual` which is a function reference to process the image. When you implement your own model, you should use these to write your own generate_util function.
- Using this input and these generation parameters, text will be sampled from the language model (typically until a maximum output length or specific stopping string sequences--for example, `{"until": ["\n\n", "."], "max_gen_toks": 128}`).
- The generated input+output text from the model will then be returned.
- `loglikelihood`
- Each request contains `Instance.args : Tuple[str, str]` containing 1. an input string to the LM and 2. a target string on which the loglikelihood of the LM producing this target, conditioned on the input, will be returned.
- In each `Instance.args` there will be 6 elements which are ` contexts, doc_to_target, doc_to_visual, doc_id, task, split`. `contexts` refers to the formatted question and is the text input for the LMM. Sometimes it might contains image token and need to address differently for different models. `doc_to_target` is a function reference that get the get the answer from the doc. This will be the continuation of the answer and only tokens belong to this part should be calculated for the loglikelihood.
- Each request will have, as result, `(ll, is_greedy): Tuple[float, int]` returned, where `ll` is a floating point number representing the log probability of generating the target string conditioned on the input, and `is_greedy` being either the value `0` or `1`, with it being `1` if and only if the target string *would be generated by greedy sampling from the LM* (that is, if the target string is the *most likely* N-token string to be output by the LM given the input. )
## Registration
Congrats on implementing your model! Now it's time to test it out.
To make your model usable via the command line interface to `lmms_eval`, you'll need to tell `lmms_eval` what your model's name is.
This is done via a *decorator*, `lmms_eval.api.registry.register_model`. Using `register_model()`, one can both tell the package what the model's name(s) to be used are when invoking it with `python -m lm_eval --model ` and alert `lmms_eval` to the model's existence.
```python
from lmms_eval.api.registry import register_model
@register_model("", "")
class MyCustomLM(LM):
```
The final step is to import your model in `lmms_eval/models/__init__.py`:
```python
from .my_model_filename import MyCustomLM
```
================================================
FILE: lmms-eval_videochat/docs/run_examples.md
================================================
# User Guide
This document details the running examples for different models in `lmms_eval`. We include commandas on how to prepare environments for different model and some commands to run these models
## Environmental Variables
Before running experiments and evaluations, we recommend you to export following environment variables to your environment. Some are necessary for certain tasks to run.
```bash
export OPENAI_API_KEY=""
export HF_HOME=""
export HF_TOKEN=""
export HF_HUB_ENABLE_HF_TRANSFER="1"
export REKA_API_KEY=""
# Other possible environment variables include
# ANTHROPIC_API_KEY,DASHSCOPE_API_KEY etc.
```
## Some common environment issue
Sometimes you might encounter some common issues for example error related to `httpx` or `protobuf`. To solve these issues, you can first try
```bash
python3 -m pip install httpx==0.23.3;
python3 -m pip install protobuf==3.20;
# If you are using numpy==2.x, sometimes may causing errors
python3 -m pip install numpy==1.26;
# Someties sentencepiece are required for tokenizer to work
python3 -m pip install sentencepiece;
```
# Image Model
### LLaVA
First, you will need to clone repo of `lmms_eval` and repo of [`llava`](https://github.com/LLaVA-VL/LLaVA-NeXT/tree/inference)
```bash
cd /path/to/lmms-eval
python3 -m pip install -e .;
cd /path/to/LLaVA-NeXT;
python3 -m pip install -e ".[train]";
TASK=$1
CKPT_PATH=$2
CONV_TEMPLATE=$3
MODEL_NAME=$4
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
#mmbench_en_dev,mathvista_testmini,llava_in_the_wild,mmvet
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model llava \
--model_args pretrained=$CKPT_PATH,conv_template=$CONV_TEMPLATE,model_name=$MODEL_NAME \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/
```
If you are trying to use large LLaVA models such as LLaVA-NeXT-Qwen1.5-72B, you can try adding `device_map=auto` in model_args and change `num_processes` to 1.
### IDEFICS2
You won't need to clone any other repos to run idefics. Making sure your transformers version supports idefics2 would be enough
```bash
cd /path/to/lmms-eval
python3 -m pip install -e .;
python3 -m pip install transformers --upgrade;
TASK=$1
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model idefics2 \
--model_args pretrained=HuggingFaceM4/idefics2-8b \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/
```
### InternVL2
```bash
cd /path/to/lmms-eval
python3 -m pip install -e .;
python3 -m pip install flash-attn --no-build-isolation;
python3 -m pip install torchvision einops timm sentencepiece;
TASK=$1
CKPT_PATH=$2
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
accelerate launch --num_processes 8 --main_process_port 12380 -m lmms_eval \
--model internvl2 \
--model_args pretrained=$CKPT_PATH \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/
```
### InternVL-1.5
First you need to fork [`InternVL`](https://github.com/OpenGVLab/InternVL)
```bash
cd /path/to/lmms-eval
python3 -m pip install -e .;
cd /path/to/InternVL/internvl_chat
python3 -m pip install -e .;
python3 -m pip install flash-attn==2.3.6 --no-build-isolation;
TASK=$1
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model internvl \
--model_args pretrained="OpenGVLab/InternVL-Chat-V1-5"\
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/
```
### Xcomposer-4KHD and Xcomposer-2d5
Both of these two models does not require external repo
```bash
cd /path/to/lmms-eval
python3 -m pip install -e .;
python3 -m pip install flash-attn --no-build-isolation;
python3 -m pip install torchvision einops timm sentencepiece;
TASK=$1
MODALITY=$2
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
# For Xcomposer2d5
accelerate launch --num_processes 8 --main_process_port 10000 -m lmms_eval \
--model xcomposer2d5 \
--model_args pretrained="internlm/internlm-xcomposer2d5-7b",device="cuda",modality=$MODALITY\
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/
# For Xcomposer-4kHD
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model xcomposer2_4khd \
--model_args pretrained="internlm/internlm-xcomposer2-4khd-7b" \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/
```
### InstructBLIP
```bash
cd /path/to/lmms-eval
python3 -m pip install -e .;
python3 -m pip install transformers --upgrade;
CKPT_PATH=$1
TASK=$2
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model instructblip \
--model_args pretrained=$CKPT_PATH \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix instructblip \
--output_path ./logs/
```
### SRT API MODEL
To enable faster testing speed for larger llava model, you can use this srt api model to enable testing through sglang.
You will need to first glone sglang from "https://github.com/sgl-project/sglang". Current version is tested on the commit #1222 of sglang
Here are the scripts if you want to test the result in one script.
```bash
cd /path/to/lmms-eval
python3 -m pip install -e .;
cd /path/to/sglang;
python3 -m pip install -e "python[all]";
python3 -m pip install flashinfer -i https://flashinfer.ai/whl/cu121/torch2.3/
CKPT_PATH=$1
TASK=$2
MODALITY=$3
TP_SIZE=$4
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
python3 -m lmms_eval \
--model srt_api \
--model_args modality=$MODALITY,model_version=$CKPT_PATH,tp=$TP_SIZE,host=127.0.0.1,port=30000,timeout=600 \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/
```
You can use the script in `sglang` under `test` folder to kill all sglang service
# API Model
### GPT
```bash
cd /path/to/lmms-eval
python3 -m pip install -e .;
export OPENAI_API_KEY=""
TASK=$1
MODEL_VERSION=$2
MODALITIES=$3
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
accelerate launch --num_processes 8 --main_process_port 30000 -m lmms_eval \
--model gpt4v \
--model_args model_version=$MODEL_VERSION,modality=$MODALITIES\
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/
```
### Claude
```bash
cd /path/to/lmms-eval
python3 -m pip install -e .;
export ANTHROPIC_API_KEY=""
TASK=$1
MODEL_VERSION=$2
MODALITIES=$3
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model claude \
--model_args model_version=$MODEL_VERSION\
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/
```
# Video Model
### LLaVA-VID
```bash
cd /path/to/lmms-eval
python3 -m pip install -e .;
cd /path/to/LLaVA-NeXT;
python3 -m pip install -e ".[train]";
python3 -m pip install flash-attn --no-build-isolation;
python3 -m pip install av;
TASK=$1
CKPT_PATH=$2
CONV_TEMPLATE=$3
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model llavavid \
--model_args pretrained=$CKPT_PATH,conv_template=$CONV_TEMPLATE,video_decode_backend=decord,max_frames_num=32 \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/
```
### LLaMA-VID
```bash
cd /path/to/lmms-eval
python3 -m pip install -e .;
# Notice that you should not leave the folder of LLaMA-VID when calling lmms-eval
# Because they left their processor's config inside the repo
cd /path/to/LLaMA-VID;
python3 -m pip install -e .
python3 -m pip install av sentencepiece;
TASK=$1
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model llama_vid \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/
```
### Video-LLaVA
```bash
cd /path/to/lmms-eval
python3 -m pip install -e .;
python3 -m pip install transformers --upgrade;
python3 -m pip install av sentencepiece;
TASK=$1
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model video_llava \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/
```
### MPlug-Owl
Notice that this model will takes long time to load, please be patient :)
```bash
cd /path/to/lmms-eval
python3 -m pip install -e .;
# It has to use an old transformers version to run
python3 -m pip install av sentencepiece protobuf==3.20 transformers==4.28.1 einops;
TASK=$1
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model mplug_owl_video \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/
```
### Video-ChatGPT
```bash
cd /path/to/lmms-eval
python3 -m pip install -e .;
python3 -m pip install sentencepiece av;
TASK=$1
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
accelerate launch --num_processes 8 --main_process_port 12345 -m lmms_eval \
--model video_chatgpt \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/
```
================================================
FILE: lmms-eval_videochat/docs/task_guide.md
================================================
# Task Configuration
The `lmms_eval` is meant to be an extensible and flexible framework within which many different evaluation tasks can be defined. All tasks in the new version of the harness are built around a YAML configuration file format.
These YAML configuration files, along with the current codebase commit hash, are intended to be shareable such that providing the YAML config enables another researcher to precisely replicate the evaluation setup used by another, in the case that the prompt or setup differs from standard `lmms_eval` task implementations.
While adding a standard evaluation task on a new dataset can be occasionally as simple as swapping out a Hugging Face dataset path in an existing file, more specialized evaluation setups also exist. Here we'll provide a crash course on the more advanced logic implementable in YAML form available to users.
## Good Reference Tasks
Contributing a new task can be daunting! Luckily, much of the work has often been done for you in a different, similarly evaluated task. Good examples of task implementations to study include:
Generation-based tasks:
- MME (`lmms_eval/tasks/mme/mme.yaml`)
```yaml
dataset_path: lmms-lab/MME
dataset_kwargs:
token: True
task: "mme"
test_split: test
output_type: generate_until
doc_to_visual: !function utils.mme_doc_to_visual
doc_to_text: !function utils.mme_doc_to_text
doc_to_target: "answer"
generation_kwargs:
max_new_tokens: 16
temperature: 0
top_p: 1.0
num_beams: 1
do_sample: false
# The return value of process_results will be used by metrics
process_results: !function utils.mme_process_results
# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
metric_list:
- metric: mme_percetion_score
aggregation: !function utils.mme_aggregate_results
higher_is_better: true
- metric: mme_cognition_score
aggregation: !function utils.mme_aggregate_results
higher_is_better: true
lmms_eval_specific_kwargs:
default:
pre_prompt: ""
post_prompt: "\nAnswer the question using a single word or phrase."
qwen_vl:
pre_prompt: ""
post_prompt: " Answer:"
metadata:
- version: 0.0
```
You can pay special attention to the `process_results` and `metric_list` fields, which are used to define how the model output is post-processed and scored.
Also, the `lmms_eval_specific_kwargs` field is used to define model-specific prompt configurations. The default is set to follow Llava.
PPL-based tasks:
- Seedbench (`lmms_eval/tasks/seedbench/seedbench_ppl.yaml`)
```yaml
dataset_path: lmms-lab/SEED-Bench
dataset_kwargs:
token: True
task: "seedbench_ppl"
test_split: test
output_type: multiple_choice
doc_to_visual: !function utils.seed_doc_to_visual
doc_to_text: !function utils.seed_doc_to_text_mc
doc_to_choice : !function utils.seed_doc_to_choice
doc_to_target: !function utils.seed_doc_to_mc_target
# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
metric_list:
- metric: acc
metadata:
- version: 0.0
```
## Configurations
Tasks are configured via the `TaskConfig` object. Below, we describe all fields usable within the object, and their role in defining a task.
### Parameters
Task naming + registration:
- **task** (`str`, defaults to None) — name of the task.
- **group** (`str`, *optional*) — name of the task group(s) a task belongs to. Enables one to run all tasks with a specified tag or group name at once.
Dataset configuration options:
- **dataset_path** (`str`) — The name of the dataset as listed by HF in the datasets Hub.
- **dataset_name** (`str`, *optional*, defaults to None) — The name of what HF calls a “config” or sub-task of the benchmark. If your task does not contain any data instances, just leave this to default to None. (If you're familiar with the HF `datasets.load_dataset` function, these are just the first 2 arguments to it.)
- **dataset_kwargs** (`dict`, *optional*) — Auxiliary arguments that `datasets.load_dataset` accepts. This can be used to specify arguments such as `data_files` or `data_dir` if you want to use local datafiles such as json or csv.
- **training_split** (`str`, *optional*) — Split in the dataset to use as the training split.
- **validation_split** (`str`, *optional*) — Split in the dataset to use as the validation split.
- **test_split** (`str`, *optional*) — Split in the dataset to use as the test split.
- **fewshot_split** (`str`, *optional*) — Split in the dataset to draw few-shot exemplars from. assert that this not None if num_fewshot > 0. **This function is not well tested so far**
- **process_docs** (`Callable`, *optional*) — Optionally define a function to apply to each HF dataset split, to preprocess all documents before being fed into prompt template rendering or other evaluation steps. Can be used to rename dataset columns, or to process documents into a format closer to the expected format expected by a prompt template.
Prompting / in-context formatting options:
- **doc_to_text** (`Union[Callable, str]`, *optional*) — Column name or function to process a sample into the appropriate input for the model
- **doc_to_visial** (`Union[Callable, str]`, *optional*) — Function to process a sample into the appropriate input images for the model.
- **doc_to_target** (`Union[Callable, str]`, *optional*) — Column name or or function to process a sample into the appropriate target output for the model. For multiple choice tasks, this should return an index into
- **doc_to_choice** (`Union[Callable, str]`, *optional*) — Column name or or function to process a sample into a list of possible string choices for `multiple_choice` tasks. Left undefined for `generate_until` tasks.
Runtime configuration options:
- **num_fewshot** (`int`, *optional*, defaults to 0) — Number of few-shot examples before the input. **This function is not well tested so far**
- **batch_size** (`int`, *optional*, defaults to 1) — Batch size.
**So far some models (such as qwen) may not support batch size > 1. Some models (such as llava) will generate different scores for different batch sizes. We recommend setting batch size to 1 for final benchmarking runs.**
Scoring details:
- **metric_list** (`str`, *optional*, defaults to None) — A list of metrics to use for evaluation.
- **output_type** (`str`, *optional*, defaults to "generate_until") — Selects the type of model output for the given task. Options are `generate_until`, `loglikelihood`, and `multiple_choice`.
- **generation_kwargs** (`dict`, *optional*) — Auxiliary arguments for the `generate` function from HF transformers library. Advanced keyword arguments may not be supported for non-HF LM classes.
================================================
FILE: lmms-eval_videochat/eval_annotations/LVBench/README.md
================================================
---
license: mit
extra_gated_prompt: >-
You agree to not use the dataset to conduct experiments that cause harm to
human subjects. Please note that the data in this dataset may be subject to
other agreements. Before using the data, be sure to read the relevant
agreements carefully to ensure compliant use. Video copyrights belong to the
original video creators or platforms and are for academic research use only.
task_categories:
- visual-question-answering
extra_gated_fields:
Name: text
Company/Organization: text
Country: text
E-Mail: text
modalities:
- Video
- Text
configs:
- config_name: lvbench
data_files: json/lvbench_clean.json
- config_name: lvbench_cartoon
data_files: json/lvbench_clean_cartoon.json
- config_name: lvbench_documentary
data_files: json/lvbench_clean_documentary.json
- config_name: lvbench_live
data_files: json/lvbench_clean_live.json
- config_name: lvbench_selfmedia
data_files: json/lvbench_clean_selfmedia.json
- config_name: lvbench_sport
data_files: json/lvbench_clean_sport.json
- config_name: lvbench_tv
data_files: json/lvbench_clean_tv.json
language:
- en
size_categories:
- 1K-
The LongVideoBench dataset contains links to web videos for data collection
purposes. LongVideoBench does not own the content linked within this dataset;
all rights and copyright belong to the respective channel owners. Ensuring
compliance with platform terms and conditions is the responsibility of these
source channels. By accessing this dataset, you acknowledge and agree to the
following terms:
extra_gated_fields:
I understand that LongVideoBench does not own the videos in this dataset: checkbox
I understand that LongVideoBench is not the creator of the videos in this dataset: checkbox
I understand that, LongVideoBench may modify/delete its contents subject to the requirements of the creators or source platforms: checkbox
I agree to use this dataset for non-commercial use ONLY: checkbox
I agree with the data license (CC-BY-NC-SA 4-0) for this dataset: checkbox
task_categories:
- multiple-choice
- visual-question-answering
language:
- en
tags:
- long video understanding
- long context
- multimodal
- neurips 2024
pretty_name: longvideobench
---

# Dataset Card for LongVideoBench
Large multimodal models (LMMs) are handling increasingly longer and more complex inputs. However, few public benchmarks are available to assess these advancements. To address this, we introduce LongVideoBench, a question-answering benchmark with video-language interleaved inputs up to an hour long. It comprises 3,763 web-collected videos with subtitles across diverse themes, designed to evaluate LMMs on long-term multimodal understanding.
The main challenge that LongVideoBench targets is to accurately retrieve and reason over detailed information from lengthy inputs. We present a novel task called referring reasoning, where questions contain a referring query that references related video contexts, requiring the model to reason over these details.
LongVideoBench includes 6,678 human-annotated multiple-choice questions across 17 categories, making it one of the most comprehensive benchmarks for long-form video understanding. Evaluations show significant challenges even for advanced proprietary models (e.g., GPT-4o, Gemini-1.5-Pro, GPT-4-Turbo), with open-source models performing worse. Performance improves only when models process more frames, establishing LongVideoBench as a valuable benchmark for future long-context LMMs.
## Dataset Details
### Dataset Description
- **Curated by:** LongVideoBench Team
- **Language(s) (NLP):** English
- **License:** CC-BY-NC-SA 4.0
### Dataset Sources [optional]
- **Repository:** [https://github.com/longvideobench/LongVideoBench](https://github.com/longvideobench/LongVideoBench)
- **Homepage:** [https://longvideobench.github.io](https://longvideobench.github.io)
- **Leaderboard:** [https://huggingface.co/spaces/longvideobench/LongVideoBench](https://huggingface.co/spaces/longvideobench/LongVideoBench)
## Leaderboard (until Oct. 14, 2024)
We rank models by Test Total Performance.
| Model | Test Total (5341) | Test 8s-15s | Test 15s-60s | Test 180s-600s | Test 900s-3600s | Val Total (1337) |
| --- | --- | --- | --- | --- | --- | --- |
| [GPT-4o (0513) (256)](https://platform.openai.com/docs/models/gpt-4o) | 66.7 | 71.6 | 76.8 | 66.7 | 61.6 | 66.7 |
| [Aria (256)](https://huggingface.co/rhymes-ai/Aria) | 65.0 | 69.4 | 76.6 | 64.6 | 60.1 | 64.2 |
| [LLaVA-Video-72B-Qwen2 (128)](https://huggingface.co/lmms-lab/LLaVA-Video-72B-Qwen2) | 64.9 | 72.4 | 77.4 | 63.9 | 59.3 | 63.9 |
| [Gemini-1.5-Pro (0514) (256)](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-1.5-pro-001) | 64.4 | 70.2 | 75.3 | 65.0 | 59.1 | 64.0 |
| [LLaVA-OneVision-QWen2-72B-OV (32)](https://huggingface.co/lmms-lab/llava-onevision-qwen2-72b-ov) | 63.2 | 74.3 | 77.4 | 61.6 | 56.5 | 61.3 |
| [LLaVA-Video-7B-Qwen2 (128)](https://huggingface.co/lmms-lab/LLaVA-Video-7B-Qwen2) | 62.7 | 69.7 | 76.5 | 62.1 | 56.6 | 61.1 |
| [Gemini-1.5-Flash (0514) (256)](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemini-1.5-flash-001) | 62.4 | 66.1 | 73.1 | 63.1 | 57.3 | 61.6 |
| [GPT-4-Turbo (0409) (256)](https://platform.openai.com/docs/models/gpt-4-turbo-and-gpt-4) | 60.7 | 66.4 | 71.1 | 61.7 | 54.5 | 59.1 |
| [InternVL2-40B (16)](https://huggingface.co/OpenGVLab/InternVL2-40B) | 60.6 | 71.4 | 76.6 | 57.5 | 54.4 | 59.3 |
| [GPT-4o-mini (250)](https://platform.openai.com/docs/models/gpt-4o-mini) | 58.8 | 66.6 | 73.4 | 56.9 | 53.4 | 56.5 |
| [MiniCPM-V-2.6 (64)](https://huggingface.co/openbmb/MiniCPM-V-2_6) | 57.7 | 62.5 | 69.1 | 54.9 | 49.8 | 54.9 |
| [Qwen2-VL-7B (256)](https://huggingface.co/openbmb/MiniCPM-V-2_6) | 56.8 | 60.1 | 67.6 | 56.7 | 52.5 | 55.6 |
| [Kangaroo (64)](https://huggingface.co/KangarooGroup/kangaroo) | 54.8 | 65.6 | 65.7 | 52.7 | 49.1 | 54.2 |
| [PLLaVA-34B (32)](https://github.com/magic-research/PLLaVA) | 53.5 | 60.1 | 66.8 | 50.8 | 49.1 | 53.2 |
| [InternVL-Chat-V1-5-26B (16)](https://huggingface.co/OpenGVLab/InternVL-Chat-V1-5) | 51.7 | 61.3 | 62.7 | 49.5 | 46.6 | 51.2 |
| [LLaVA-Next-Video-34B (32)](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/) | 50.5 | 57.6 | 61.6 | 48.7 | 45.9 | 50.5 |
| [Phi-3-Vision-Instruct (16)](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct) | 49.9 | 58.3 | 59.6 | 48.4 | 45.1 | 49.6 |
| [Idefics2 (16)](https://huggingface.co/HuggingFaceM4/idefics2-8b) | 49.4 | 57.4 | 60.4 | 47.3 | 44.7 | 49.7 |
| [Mantis-Idefics2 (16)](https://huggingface.co/TIGER-Lab/Mantis-8B-Idefics2) | 47.6 | 56.1 | 61.4 | 44.6 | 42.5 | 47.0 |
| [LLaVA-Next-Mistral-7B (8)](https://huggingface.co/llava-hf/llava-v1.6-mistral-7b-hf) | 47.1 | 53.4 | 57.2 | 46.9 | 42.1 | 49.1 |
| [PLLaVA-13B (32)](https://github.com/magic-research/PLLaVA) | 45.1 | 52.9 | 54.3 | 42.9 | 41.2 | 45.6 |
| [InstructBLIP-T5-XXL (8)](https://github.com/salesforce/LAVIS/tree/main/projects/instructblip) | 43.8 | 48.1 | 50.1 | 44.5 | 40.0 | 43.3 |
| [Mantis-BakLLaVA (16)](https://huggingface.co/TIGER-Lab/Mantis-bakllava-7b) | 43.7 | 51.3 | 52.7 | 41.1 | 40.1 | 43.7 |
| [BLIP-2-T5-XXL (8)](https://github.com/salesforce/LAVIS/tree/main/projects/blip2) | 43.5 | 46.7 | 47.4 | 44.2 | 40.9 | 42.7 |
| [LLaVA-Next-Video-M7B (32)](https://llava-vl.github.io/blog/2024-04-30-llava-next-video/) | 43.5 | 50.9 | 53.1 | 42.6 | 38.9 | 43.5 |
| [LLaVA-1.5-13B (8)](https://huggingface.co/llava-hf/llava-1.5-13b-hf) | 43.1 | 49.0 | 51.1 | 41.8 | 39.6 | 43.4 |
| [ShareGPT4Video (16)](https://github.com/InternLM/InternLM-XComposer/tree/main/projects/ShareGPT4Video) | 41.8 | 46.9 | 50.1 | 40.0 | 38.7 | 39.7 |
| [VideoChat2 (Mistral-7B) (16)](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2) | 41.2 | 49.3 | 49.3 | 39.0 | 37.5 | 39.3 |
| [LLaVA-1.5-7B (8)](https://huggingface.co/llava-hf/llava-1.5-7b-hf) | 40.4 | 45.0 | 47.4 | 40.1 | 37.0 | 40.3 |
| [mPLUG-Owl2 (8)](https://github.com/X-PLUG/mPLUG-Owl/tree/main/mPLUG-Owl2) | 39.4 | 49.4 | 47.3 | 38.7 | 34.3 | 39.1 |
| [PLLaVA-7B (32)](https://github.com/magic-research/PLLaVA) | 39.2 | 45.3 | 47.3 | 38.5 | 35.2 | 40.2 |
| [VideoLLaVA (8)](https://github.com/PKU-YuanGroup/Video-LLaVA/) | 37.6 | 43.1 | 44.6 | 36.4 | 34.4 | 39.1 |
| [VideoChat2 (Vicuna 7B) (16)](https://github.com/OpenGVLab/Ask-Anything/tree/main/video_chat2) | 35.1 | 38.1 | 40.5 | 33.5 | 33.6 | 36.0 |
## Uses
1. Download the dataset via Hugging Face Client:
```shell
huggingface-cli download longvideobench/LongVideoBench --repo-type dataset --local-dir LongVideoBench --local-dir-use-symlinks False
```
2. Extract from the `.tar` files:
```shell
cat videos.tar.part.* > videos.tar
tar -xvf videos.tar
tar -xvf subtitles.tar
```
3. Use the [LongVideoBench] dataloader to load the data from raw MP4 files and subtitles:
- (a) Install the dataloader:
```shell
git clone https://github.com/LongVideoBench/LongVideoBench.git
cd LongVideoBench
pip install -e .
```
- (b) Load the dataset in python scripts:
```python
from longvideobench import LongVideoBenchDataset
# validation
dataset = LongVideoBenchDataset(YOUR_DATA_PATH, "lvb_val.json", max_num_frames=64)
# test
dataset = LongVideoBenchDataset(YOUR_DATA_PATH, "lvb_test_wo_gt.json", max_num_frames=64)
print(dataset[0]["inputs"]) # A list consisting of PIL.Image and strings.
```
The "inputs" are interleaved video frames and text subtitles, followed by questions and option prompts. You can then convert them to the format that your LMMs can accept.
### Direct Use
This dataset is meant to evaluate LMMs on video understanding and long-context understanding abilities.
### Out-of-Scope Use
We do not advise to use this dataset for training.
## Dataset Structure
- `lvb_val.json`: Validation set annotations.
- `lvb_test_wo_gt.json`: Test set annotations. Correct choice is not provided.
- `videos.tar.*`: Links to Videos.
- `subtitles.tar`: Links to Subtitles.
## Dataset Card Contact
haoning001@e.ntu.edu.sg
```
@misc{wu2024longvideobenchbenchmarklongcontextinterleaved,
title={LongVideoBench: A Benchmark for Long-context Interleaved Video-Language Understanding},
author={Haoning Wu and Dongxu Li and Bei Chen and Junnan Li},
year={2024},
eprint={2407.15754},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2407.15754},
}
```
================================================
FILE: lmms-eval_videochat/eval_annotations/LongVideoBench/lvb_test_wo_gt.json
================================================
[{"video_id": "G1D9C7kRx10", "question": "On a desk with a needle-shaped green leaf, there is a picture. A person is drawing with a pen. After the subtitle 'The snow fairy was hurt, and the next time she was sent to the mountains of ...', what does this person do?", "question_wo_referring_query": "What does this person do?", "candidates": ["Picks up the drawing with both hands", "Picks up the rubber", "Puts down the drawing", "Pats the rabbit"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "G1D9C7kRx10_0", "video_path": "G1D9C7kRx10.mp4", "subtitle_path": "G1D9C7kRx10_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 398.07, "view_count": 219450}, {"video_id": "G1D9C7kRx10", "question": "In the snow-covered mountain pass, there's a row of tall trees among them. After the subtitle says, \"is difficult and there are those who still don't enjoy her visits, but she no longer hears their...\", what event happens?", "question_wo_referring_query": "What event happens?", "candidates": ["A woman in white strokes a black rabbit.", "A woman in white lifts the ears of a black rabbit.", "A woman in white strokes a white rabbit.", "A woman in white grabs the foot of a black rabbit."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "G1D9C7kRx10_1", "video_path": "G1D9C7kRx10.mp4", "subtitle_path": "G1D9C7kRx10_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 398.07, "view_count": 219450}, {"video_id": "QDomidqSs84", "question": "The man in the grey suit, the man in the black shirt and jeans, and the man in the black suit are talking. After the subtitle mentions \"[Applause]\", what does the man in the black suit take out?", "question_wo_referring_query": "What does the man in the black suit take out?", "candidates": ["White cellphone", "Tablet", "Black card", "White card"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "QDomidqSs84_0", "video_path": "QDomidqSs84.mp4", "subtitle_path": "QDomidqSs84_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 308.48, "view_count": 5111}, {"video_id": "QDomidqSs84", "question": "What did the white-haired man, wearing a black top and black-rimmed glasses, put down after mentioning 'filmography' in the subtitles?", "question_wo_referring_query": "What did the white-haired man, wearing a black top and black-rimmed glasses, put down?", "candidates": ["microphone", "water bottle", "cell phone", "landline phone"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "QDomidqSs84_1", "video_path": "QDomidqSs84.mp4", "subtitle_path": "QDomidqSs84_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 308.48, "view_count": 5111}, {"video_id": "vIZBTk-bhYY", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there is the jade green seawater and the sand beach covered with green vegetation, followed by rows of palm trees and a white boat with a white triangular flag on the sea, and finally the jade green seawater lapping against the white waves and the distant misty land.", "First, there are rows of palm trees and a white boat with a white triangular flag on the sea, followed by the jade green seawater lapping against the white waves and the distant misty land, and finally the jade green seawater and the sand beach covered with green vegetation.", "First, there are rows of palm trees and a white boat with a white triangular flag on the sea, followed by the jade green seawater and the sand beach covered with green vegetation, and finally the jade green seawater lapping against the white waves and the distant misty land.", "First, the jade green seawater is lapping against the white waves and the distant misty land, followed by the jade green seawater and the sand beach covered with green vegetation, and finally rows of palm trees and a white boat with a white triangular flag on the sea."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "vIZBTk-bhYY_0", "video_path": "vIZBTk-bhYY.mp4", "subtitle_path": "vIZBTk-bhYY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 572.24, "view_count": 766}, {"video_id": "vIZBTk-bhYY", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, there is a circular water surface with a seam connecting to the outer water surface; then there is a blue water surface with a green-covered ground; finally, there are two people rowing a small boat with a gradient of red and yellow, and not far away there is a white boat.", "First, there are two people rowing a small boat with a gradient of red and yellow, and not far away there is a white boat; then there is a circular water surface with a seam connecting to the outer water surface; finally, there is a blue water surface with a green-covered ground.", "First, there is a blue water surface with a green-covered ground; then there is a circular water surface with a seam connecting to the outer water surface; finally, there are two people rowing a small boat with a gradient of red and yellow, and not far away there is a white boat.", "First, there is a blue water surface with a green-covered ground; then there are two people rowing a small boat with a gradient of red and yellow, and not far away there is a white boat; finally, there is a circular water surface with a seam connecting to the outer water surface."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "vIZBTk-bhYY_1", "video_path": "vIZBTk-bhYY.mp4", "subtitle_path": "vIZBTk-bhYY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 572.24, "view_count": 766}, {"video_id": "f44gpGR4uWU", "question": "In what scenes did the short-haired man in the gray shirt holding a connected phone appear in the video?", "question_wo_referring_query": ", in what scenes did he appear?", "candidates": ["In a hotel", "In a car, next to the standing trees", "In the water", "On top of a mountain"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "f44gpGR4uWU_0", "video_path": "f44gpGR4uWU.mp4", "subtitle_path": "f44gpGR4uWU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 552.53, "view_count": 5518}, {"video_id": "f44gpGR4uWU", "question": "In the opening of the video, there's a man wearing a black top and a gray hat in the car. In which of the following scenes does he appear later?", "question_wo_referring_query": "In which of the following scenes does he appear later?", "candidates": ["In the water", "In the car, on the sofa", "On the mountain", "In the bathroom"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "f44gpGR4uWU_1", "video_path": "f44gpGR4uWU.mp4", "subtitle_path": "f44gpGR4uWU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 552.53, "view_count": 5518}, {"video_id": "P2LWNdH3bi8", "question": "Under the text that reads 'Example results,' there is a man wearing a gray-blue coat and a red shirt explaining something. Where did the red dot move from the colorful place?", "question_wo_referring_query": "Under the text that reads 'Example results,' there is a man wearing a gray-blue coat and a red shirt explaining something. Where did the red dot move from the colorful place?", "candidates": ["red elephant", "black elephant", "blue elephant", "red car"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "P2LWNdH3bi8_0", "video_path": "P2LWNdH3bi8.mp4", "subtitle_path": "P2LWNdH3bi8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 446.83, "view_count": 17}, {"video_id": "jrEMWBswy_A", "question": "In the video, what change happened to the man in a black jacket and white undershirt, sitting in the car with a seatbelt on, when the subtitle says 'gonna go see Deadpool alright I'm'?", "question_wo_referring_query": "What change happened to the man in a black jacket and white undershirt?", "candidates": ["He put on a hat", "He changed into a white jacket", "He changed into a blue jacket", "He put on a mask"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "jrEMWBswy_A_0", "video_path": "jrEMWBswy_A.mp4", "subtitle_path": "jrEMWBswy_A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 292.63, "view_count": 3863}, {"video_id": "jrEMWBswy_A", "question": "In the video, what is the man wearing a black top and a white undershirt, sitting in the car with a seatbelt on, doing when the subtitles say 'probably break a bone after the gym got'?", "question_wo_referring_query": "What is the man wearing a black top and a white undershirt doing?", "candidates": ["Holding a white bag", "Holding a cellphone", "Holding white clothes", "Holding a plastic bag of food"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "jrEMWBswy_A_1", "video_path": "jrEMWBswy_A.mp4", "subtitle_path": "jrEMWBswy_A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 292.63, "view_count": 3863}, {"video_id": "BIcKLxV-8Gs", "question": "The woman sitting in front of the white sofa, wearing a white coat over a black top, holding a white round object, is wearing what on her left wrist?", "question_wo_referring_query": "What is the woman wearing a white coat over a black top wearing on her left wrist?", "candidates": ["black string", "silk ribbon", "red string", "watch"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "BIcKLxV-8Gs_0", "video_path": "BIcKLxV-8Gs.mp4", "subtitle_path": "BIcKLxV-8Gs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 567.2, "view_count": 14088}, {"video_id": "BIcKLxV-8Gs", "question": "In a room, there is a woman wearing a black and white polka dot short-sleeve shirt. She is applying skincare product with her left hand to her right shoulder. What object is to the right of this woman?", "question_wo_referring_query": "What object is to the right of this woman?", "candidates": ["curtain", "sofa", "beauty instrument", "skincare product"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "BIcKLxV-8Gs_1", "video_path": "BIcKLxV-8Gs.mp4", "subtitle_path": "BIcKLxV-8Gs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 567.2, "view_count": 14088}, {"video_id": "x44f6DED_7s", "question": "In front of a hazy orange and purple background, there is a woman wearing white clothes with short black hair. Below there is a red and black sidebar. When the subtitle mentions 'strategy, which meant heavy capital investment into China,' what is the woman holding?", "question_wo_referring_query": "What is the woman holding?", "candidates": ["mobile phone", "cup", "pen", "book"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "x44f6DED_7s_0", "video_path": "x44f6DED_7s.mp4", "subtitle_path": "x44f6DED_7s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 560.13, "view_count": 6377}, {"video_id": "x44f6DED_7s", "question": "In front of an orange-red blurry background, there is a man wearing a dark blue suit with a red and white striped tie. When the subtitles mention 'a, you know, a representative office. You have that money too. a, you know, a representative office,' what is he holding in his hand?", "question_wo_referring_query": "What is he holding in his hand?", "candidates": ["a white pen", "a card", "a red pen", "a tissue"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "x44f6DED_7s_1", "video_path": "x44f6DED_7s.mp4", "subtitle_path": "x44f6DED_7s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 560.13, "view_count": 6377}, {"video_id": "5yGxMLIxWhA", "question": "A person wearing a red and gray striped top and blue pants is using a white embroidery net. At this time, what color is the watch the person is wearing?", "question_wo_referring_query": "At this time, what color is the watch the person is wearing?", "candidates": ["White", "Black", "Blue", "Gray"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "5yGxMLIxWhA_0", "video_path": "5yGxMLIxWhA.mp4", "subtitle_path": "5yGxMLIxWhA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 258.26, "view_count": 6370}, {"video_id": "5yGxMLIxWhA", "question": "A man wearing a black coat, white innerwear, black pants, and black-framed glasses is sitting on a gray sofa with a woman dressed in a black top and olive pants. What color shoes is the woman wearing with her black top and olive pants?", "question_wo_referring_query": "What color shoes is the woman wearing with her black top and olive pants?", "candidates": ["red", "gray", "blue", "black"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "5yGxMLIxWhA_1", "video_path": "5yGxMLIxWhA.mp4", "subtitle_path": "5yGxMLIxWhA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 258.26, "view_count": 6370}, {"video_id": "y3JTUVNiFqY", "question": "On a brownish-red desk, there is a transparent bowl containing white powder. What is this person preparing to stir with?", "question_wo_referring_query": "What is this person preparing to stir with?", "candidates": ["Black spoon", "Red spoon", "Red spatula", "Red knife"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "y3JTUVNiFqY_0", "video_path": "y3JTUVNiFqY.mp4", "subtitle_path": "y3JTUVNiFqY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 236.11, "view_count": 221562}, {"video_id": "y3JTUVNiFqY", "question": "On a countertop with a few tomatoes on the right side and a bun placed on a floured surface, what is this person using to cut the bun?", "question_wo_referring_query": ", what is this person using to cut the bun?", "candidates": ["black-handled knife", "chopsticks", "red-handled knife", "bamboo board"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "y3JTUVNiFqY_1", "video_path": "y3JTUVNiFqY.mp4", "subtitle_path": "y3JTUVNiFqY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 236.11, "view_count": 221562}, {"video_id": "lNV_cvdJlxs", "question": "A man is wearing a blue-grey coat and a wide-brimmed hat; another man is wearing a black top hat and brown clothes. There are a few trees in the background, and the ground is scattered with many orange objects. What was the man in the blue-grey coat doing the first time he appeared?", "question_wo_referring_query": "A man is wearing a blue-grey coat and a wide-brimmed hat; another man is wearing a black top hat and brown clothes. There are a few trees in the background, and the ground is scattered with many orange objects. What was the man in the blue-grey coat doing the first time he appeared?", "candidates": ["Running", "Giving a lecture", "Digging", "Taking photos"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "lNV_cvdJlxs_0", "video_path": "lNV_cvdJlxs.mp4", "subtitle_path": "lNV_cvdJlxs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 380.71, "view_count": 963275}, {"video_id": "lNV_cvdJlxs", "question": "A person wearing a blue and white striped hat, a red patterned top, and having gray eyebrows, what is this person doing when they appear?", "question_wo_referring_query": "What is this person doing when they appear?", "candidates": ["Eating", "Running", "Drinking", "Taking pictures"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "lNV_cvdJlxs_1", "video_path": "lNV_cvdJlxs.mp4", "subtitle_path": "lNV_cvdJlxs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 380.71, "view_count": 963275}, {"video_id": "vMR2JnjrVbM", "question": "In a messy room, a man wearing a purple short-sleeved shirt is sitting on a dark blue sofa. When mentioning \"just takes a lot of time and emails and,\" what is he doing?", "question_wo_referring_query": ", what is he doing?", "candidates": ["Swimming", "Running", "Sleeping", "Cutting things with scissors"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "vMR2JnjrVbM_0", "video_path": "vMR2JnjrVbM.mp4", "subtitle_path": "vMR2JnjrVbM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 562.56, "view_count": 964237}, {"video_id": "vMR2JnjrVbM", "question": "On a messy desk, there's a blue pen and a light blue scissors. When the phrase 'outsource the shirts and just have a' is mentioned, what is the person doing in the video?", "question_wo_referring_query": "On a messy desk, there's a blue pen and a light blue scissors. When the phrase 'outsource the shirts and just have a' is mentioned, what is the person doing in the video?", "candidates": ["Holding scissors and cutting something", "Sleeping", "Running", "Printing items"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "vMR2JnjrVbM_1", "video_path": "vMR2JnjrVbM.mp4", "subtitle_path": "vMR2JnjrVbM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 562.56, "view_count": 964237}, {"video_id": "CCwe1HHCISQ", "question": "In front of a yellow background, a pair of hands is drawing on a white sheet of paper with a red and yellow intersecting pen. What is this white sheet of paper used for after the drawing is finished?", "question_wo_referring_query": "What is this white sheet of paper used for after the drawing is finished?", "candidates": ["Torn up", "Thrown away", "Placed on a pumpkin", "Held in hand"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "CCwe1HHCISQ_0", "video_path": "CCwe1HHCISQ.mp4", "subtitle_path": "CCwe1HHCISQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 186.72, "view_count": 155485}, {"video_id": "CCwe1HHCISQ", "question": "In front of a yellow background, what happens to this potato after a pair of hands uses a brown tool to wrap half of it?", "question_wo_referring_query": ", what happens to this potato?", "candidates": ["coloring", "washing", "slicing", "cooking"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "CCwe1HHCISQ_1", "video_path": "CCwe1HHCISQ.mp4", "subtitle_path": "CCwe1HHCISQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 186.72, "view_count": 155485}, {"video_id": "ITgatyoMDjU", "question": "In the scene, a person is wearing a black outfit with yellow polka dots. This person is holding a knife with their right hand, and using three fingers of their left hand to press the food on the knife's back while cutting vegetables. Which object appears first in this scene?", "question_wo_referring_query": "Which object appears first in this scene?", "candidates": ["Glove", "Knife", "Cutting board", "Garlic"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "ITgatyoMDjU_0", "video_path": "ITgatyoMDjU.mp4", "subtitle_path": "ITgatyoMDjU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 362.36, "view_count": 21266}, {"video_id": "qT7r0Ois8bw", "question": "Inside the room, there's a man wearing a gray coat and a blue inner layer. After the subtitle mentions 'capital A not just yet so one of the', what did this man do?", "question_wo_referring_query": "What did this man do?", "candidates": ["Running", "Holding a white document with his left thumb and index finger", "Kneeling on the floor", "Drinking a beverage"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "qT7r0Ois8bw_0", "video_path": "qT7r0Ois8bw.mp4", "subtitle_path": "qT7r0Ois8bw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 490.36, "view_count": 15010}, {"video_id": "qT7r0Ois8bw", "question": "In the room, there is a man wearing a grey coat and a blue inner shirt. After the subtitle mentions \u201cis YouTube is the world campfire where\u201d, what action does this man perform?", "question_wo_referring_query": "What action does this man perform?", "candidates": ["Right hand raised", "Both hands clenched into fists", "Left hand raised", "Both hands raised"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "qT7r0Ois8bw_1", "video_path": "qT7r0Ois8bw.mp4", "subtitle_path": "qT7r0Ois8bw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 490.36, "view_count": 15010}, {"video_id": "ijFm6DxNVyI", "question": "There is an orange hammock on the screen, and on the hammock, there is a creature with a black body and an orange mouth and paws. After the subtitle mentions 'a fancy way of saying that it', what object appears on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["A yellow bicycle", "A small boat", "An upside-down tree", "A small dog drinking something"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3O", "level": "L2-Relation", "id": "ijFm6DxNVyI_0", "video_path": "ijFm6DxNVyI.mp4", "subtitle_path": "ijFm6DxNVyI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 358.17, "view_count": 15954094}, {"video_id": "E35j15N_dys", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a person wearing white clothes standing in front of a yellow-brown building appears, then a black train in motion appears, and finally a car driving in the desert appears.", "First, a black train in motion appears, then a car driving in the desert appears, and finally a person wearing white clothes standing in front of a yellow-brown building appears.", "First, a car driving in the desert appears, then a black train in motion appears, and finally a person wearing white clothes standing in front of a yellow-brown building appears.", "First, a car driving in the desert appears, then a person wearing white clothes standing in front of a yellow-brown building appears, and finally a black train in motion appears."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "E35j15N_dys_0", "video_path": "E35j15N_dys.mp4", "subtitle_path": "E35j15N_dys_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 426.06, "view_count": 878406}, {"video_id": "YbX2-JFU7MY", "question": "In the bottom left corner of the video, under the green onion, there is white cream. Where else has this white cream appeared?", "question_wo_referring_query": "Where else has this white cream appeared?", "candidates": ["In the bowl", "In the blue bowl", "In the pot", "On the spoon"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "YbX2-JFU7MY_0", "video_path": "YbX2-JFU7MY.mp4", "subtitle_path": "YbX2-JFU7MY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 198.41, "view_count": 136544}, {"video_id": "YbX2-JFU7MY", "question": "At the beginning of the video, in the upper right corner, there is a brown and yellow sauce on a white object kept in a pocket. Where else does this yellow sauce appear?", "question_wo_referring_query": "Where else does this yellow sauce appear?", "candidates": ["On a blue shirt", "In the bowl", "On chopsticks", "On a fork", "At noon"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "YbX2-JFU7MY_1", "video_path": "YbX2-JFU7MY.mp4", "subtitle_path": "YbX2-JFU7MY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 198.41, "view_count": 136544}, {"video_id": "RUUHo6dbJwI", "question": "In the video, there's a woman wearing a black short-sleeve shirt, with her hair tied up, holding a cup with yellow and black decorations. The woman holding the beverage with the yellow star pattern on the cup appears alongside which subtitles?", "question_wo_referring_query": "The woman holding the beverage with the yellow star pattern on the cup appears alongside which subtitles?", "candidates": ["okay guys so first place and first like", "this is not bad at all it has like", "\u201cplease thank you\u201d ", "just broke the straw"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "RUUHo6dbJwI_0", "video_path": "RUUHo6dbJwI.mp4", "subtitle_path": "RUUHo6dbJwI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 378.28, "view_count": 90142}, {"video_id": "RUUHo6dbJwI", "question": "In the video, there is a woman sitting in a car, wearing a black short-sleeved shirt, holding a transparent cup with black patterned drink. Which subtitles have appeared with the woman with black patterned drink in the cup?", "question_wo_referring_query": "Which subtitles have appeared together with the woman with black patterned drink in the cup?", "candidates": ["this is not bad at all it has like", "okay guys so first place and first like", "just broke the straw", "this looks fun"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "RUUHo6dbJwI_1", "video_path": "RUUHo6dbJwI.mp4", "subtitle_path": "RUUHo6dbJwI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 378.28, "view_count": 90142}, {"video_id": "xqNCxPyAzzc", "question": "In front of a row of large trees and a house, there is a soldier. The soldier is wearing green clothes with green and gray face paint, holding a gun in his right hand. Beside him, there are white and orange letters. What changed on the soldier's face later?", "question_wo_referring_query": "What changed on the soldier's face later?", "candidates": ["The position of the gray and green paint on his face has increased.", "The position of the green paint on his face has decreased.", "The position of the face paint has not changed.", "The position of the green paint on his face has increased."], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "xqNCxPyAzzc_0", "video_path": "xqNCxPyAzzc.mp4", "subtitle_path": "xqNCxPyAzzc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 267.0, "view_count": 4210746}, {"video_id": "xqNCxPyAzzc", "question": "In a red-colored field, there is a row of green trees. In front of them are three men holding rifles, wearing hats, and dressed in light khaki military uniforms. What changes occur behind the rifles being held in the scene?", "question_wo_referring_query": ", what changes occur behind the rifles being held in the scene?", "candidates": ["The rifles emit yellow flames.", "The rifles get broken.", "The rifles get wet from the rain.", "The rifles emit green flames."], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "xqNCxPyAzzc_1", "video_path": "xqNCxPyAzzc.mp4", "subtitle_path": "xqNCxPyAzzc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 267.0, "view_count": 4210746}, {"video_id": "VLy8i0nv44U", "question": "A gray tank is parked on a green surface. Next to the tank is a row of trees, and beside the trees is a red surface. After the subtitle mentions 'ten point three meters long three metres', what change occurs to the color of the tank?", "question_wo_referring_query": "What change occurs to the color of the tank?", "candidates": ["Turns red", "Turns white", "Turns purple", "Turns green"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "VLy8i0nv44U_0", "video_path": "VLy8i0nv44U.mp4", "subtitle_path": "VLy8i0nv44U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 282.5, "view_count": 1912014}, {"video_id": "VLy8i0nv44U", "question": "On the green field, a gray tank is accompanied by two soldiers. One of the soldiers is holding a camera. After the subtitle mentions 'encouraging many people to look at them', what does the color of the tank change to?", "question_wo_referring_query": "What does the color of the tank change to?", "candidates": ["red", "purple", "black", "white"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "VLy8i0nv44U_1", "video_path": "VLy8i0nv44U.mp4", "subtitle_path": "VLy8i0nv44U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 282.5, "view_count": 1912014}, {"video_id": "wvQTmTPZix4", "question": "The video takes place inside a bus station, where there are many people. One of the individuals is a man wearing a blue top and purple pants. This man has several white plastic bags hanging from his belt and is carrying a large green bag on his shoulders. What action is this man performing?", "question_wo_referring_query": "What action is this man performing?", "candidates": ["Hand supporting the bus door", "Getting off the bus", "Hand pushing the bus door", "Right hand supporting the large green bag"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "wvQTmTPZix4_0", "video_path": "wvQTmTPZix4.mp4", "subtitle_path": "wvQTmTPZix4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 584.62, "view_count": 151877}, {"video_id": "wvQTmTPZix4", "question": "There is a green grassy field on the screen, with a white-haired sheep on it. Near the sheep appears the number 5 in yellow. What is this sheep doing on the grassy field?", "question_wo_referring_query": "What is this sheep doing on the grassy field?", "candidates": ["Eating grass", "Baaing", "Running", "Jumping"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "wvQTmTPZix4_1", "video_path": "wvQTmTPZix4.mp4", "subtitle_path": "wvQTmTPZix4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 584.62, "view_count": 151877}, {"video_id": "xCMjE1QmW74", "question": "In a picture with a white background, there are yellow words 'THESE DRUGS' and orange words 'COUNTERACT' and 'INFLAMMATION'. What is on the right side of these words?", "question_wo_referring_query": "What is on the right side of these words?", "candidates": ["yellow bottle", "light bulb", "mobile phone", "computer"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "xCMjE1QmW74_0", "video_path": "xCMjE1QmW74.mp4", "subtitle_path": "xCMjE1QmW74_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 409.2, "view_count": 454756}, {"video_id": "xCMjE1QmW74", "question": "The words RESEARCHERS, THINK, INTERNET, FORUMS, COULD, etc. are visible on what object in the screen?", "question_wo_referring_query": "On what object are these words visible?", "candidates": ["On a TV", "On a computer", "In a book", "On a mobile phone"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "xCMjE1QmW74_1", "video_path": "xCMjE1QmW74.mp4", "subtitle_path": "xCMjE1QmW74_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 409.2, "view_count": 454756}, {"video_id": "OajGa5Atg1c", "question": "In front of a yellow background, there are two characters standing by a table with a sword. One character has yellow hair and is wearing a red and yellow checkered outfit. What color is the clothing of the character next to the one with yellow hair?", "question_wo_referring_query": "What color is the clothing of the character next to the one with yellow hair?", "candidates": ["red", "purple", "blue", "black"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "OajGa5Atg1c_0", "video_path": "OajGa5Atg1c.mp4", "subtitle_path": "OajGa5Atg1c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 218.5, "view_count": 17965}, {"video_id": "OajGa5Atg1c", "question": "In front of two brown houses, three people are standing. One man with brown hair in a red and white checkered shirt is holding a badge, and two men with blonde hair in green and orange checkered shirts are also holding badges. What shape is the badge held by the man in the red and white checkered shirt?", "question_wo_referring_query": "What shape is the badge held by the man in the red and white checkered shirt?", "candidates": ["trapezoid", "rectangle", "circle", "oval"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "OajGa5Atg1c_1", "video_path": "OajGa5Atg1c.mp4", "subtitle_path": "OajGa5Atg1c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 218.5, "view_count": 17965}, {"video_id": "b43x5HFKvcY", "question": "There is a man wearing a black short sleeve shirt with a design on it, and he is holding a garment in his right hand. When the subtitle appears saying, 'Oh very quickly we have Geography Now! t-shirts. You can get these now. They're really cool,' what color is the garment in the man's right hand?", "question_wo_referring_query": "What color is the garment in the man's right hand?", "candidates": ["white", "black", "green", "red"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "b43x5HFKvcY_0", "video_path": "b43x5HFKvcY.mp4", "subtitle_path": "b43x5HFKvcY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 486.7, "view_count": 524799}, {"video_id": "b43x5HFKvcY", "question": "There is a man wearing a black short-sleeve shirt with designs on it. He is holding a cup in his left hand. When the subtitles say 'We also have Geography Now! coffee mugs you can get them look at that they come with the artwork and everything the logo for,' what material is the cup the man is holding in his left hand made of?", "question_wo_referring_query": "What material is the cup that the man is holding in his left hand made of?", "candidates": ["Stainless steel", "Ceramic", "Plastic", "Glass"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "b43x5HFKvcY_1", "video_path": "b43x5HFKvcY.mp4", "subtitle_path": "b43x5HFKvcY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 486.7, "view_count": 524799}, {"video_id": "O6aju9scF3k", "question": "In a blue background, there is a blue fish. When the subtitle 'of prey and point forward for feeding.' appears, what action does the blue fish take?", "question_wo_referring_query": "In a blue background, there is a blue fish. When the subtitle 'of prey and point forward for feeding.' appears, what action does the blue fish take?", "candidates": ["Blowing bubbles", "Opening its mouth wide", "Wagging its tail", "Eating something"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2E", "level": "IntraMoment", "id": "O6aju9scF3k_0", "video_path": "O6aju9scF3k.mp4", "subtitle_path": "O6aju9scF3k_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 189.94, "view_count": 987883}, {"video_id": "O6aju9scF3k", "question": "In a blue background, there is a pink creature. When the subtitle appears saying, 'predators can get startled, squirting ink as a diversion to escape,' what does the pink creature do?", "question_wo_referring_query": "What does the pink creature do?", "candidates": ["A creature is eating something", "A creature is squirting ink", "A creature is swimming with its mouth open", "A creature is wagging its tail"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2E", "level": "IntraMoment", "id": "O6aju9scF3k_1", "video_path": "O6aju9scF3k.mp4", "subtitle_path": "O6aju9scF3k_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 189.94, "view_count": 987883}, {"video_id": "zU_1ClPIOAg", "question": "A woman dressed in a white long-sleeve top and a black vest is gesturing with her hands. After the camera zooms in, what does this woman do next?", "question_wo_referring_query": "What does this woman do next?", "candidates": ["Plays with a mobile phone", "Dances", "Opens both hands", "Drinks water"], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "zU_1ClPIOAg_0", "video_path": "zU_1ClPIOAg.mp4", "subtitle_path": "zU_1ClPIOAg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 521.77, "view_count": 250918}, {"video_id": "zU_1ClPIOAg", "question": "In a blue background, a man wearing a black top with white underneath appears holding a book. He is waving while walking towards another person dressed in a black jumpsuit holding a scroll. What action does the person in the black jumpsuit take next?", "question_wo_referring_query": "What action does the person in the black jumpsuit take next?", "candidates": ["Crouches down", "Throws the hat upwards", "Plays with the phone", "Sings"], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "zU_1ClPIOAg_1", "video_path": "zU_1ClPIOAg.mp4", "subtitle_path": "zU_1ClPIOAg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 521.77, "view_count": 250918}, {"video_id": "Vjhxacb8r4I", "question": "In the video, four singers first appear each holding an instrument and performing, then an image of delicious food appears, and finally an image of a beach with many people appears. According to the description, which appears first?", "question_wo_referring_query": "According to the description, which appears first?", "candidates": ["Four singers each holding an instrument and performing", "Delicious food image", "Beach image", "Delicious food image and beach image appear at the same time\n\n"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "Vjhxacb8r4I_0", "video_path": "Vjhxacb8r4I.mp4", "subtitle_path": "Vjhxacb8r4I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 437.71, "view_count": 786}, {"video_id": "Vjhxacb8r4I", "question": "Which scene appears first according to the explanation: A scene with a statue next to a large tree with a person holding a guitar, pointing upwards; a scene with a white wall with two poplar trees on each side and a statue in the middle; or a scene in a room with various geometrical objects with paintings on them and many people viewing?", "question_wo_referring_query": "Which scene appears first according to the explanation?", "candidates": ["A scene with a white wall with two poplar trees on each side and a statue in the middle", "A scene with a statue next to a large tree with a person holding a guitar, pointing upwards", "A scene with a white wall with two poplar trees on each side, a statue in the middle, and a room with various geometrical objects with paintings on them and many people viewing both scenes together", "A scene in a room with various geometrical objects with paintings on them and many people viewing"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "Vjhxacb8r4I_1", "video_path": "Vjhxacb8r4I.mp4", "subtitle_path": "Vjhxacb8r4I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 437.71, "view_count": 786}, {"video_id": "t81G6eBthXI", "question": "In front of a black background with thick smoke rolling, what event occurred on the screen after the subtitle mentioned 'summit the formation of TAMU mif is a'?", "question_wo_referring_query": "What event occurred on the screen?", "candidates": ["Chemical explosion", "Forest fire", "House fire", "Underwater volcano eruption"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "t81G6eBthXI_0", "video_path": "t81G6eBthXI.mp4", "subtitle_path": "t81G6eBthXI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 207.64, "view_count": 7180}, {"video_id": "t81G6eBthXI", "question": "In a scene where a half-illuminated, orange-glowing planet appears, and after the subtitle 'video on the link to that is in the', what happens next?", "question_wo_referring_query": "What happens next?", "candidates": ["The color of the planet changes to red", "Volcanic eruption", "The planet explodes", "Lava flows down the volcano"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "t81G6eBthXI_1", "video_path": "t81G6eBthXI.mp4", "subtitle_path": "t81G6eBthXI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 207.64, "view_count": 7180}, {"video_id": "csEgtSh7jV4", "question": "On a white background, there is a small bridge with large trees on both sides. After the subtitle 'from Meta AI that claims to finally provide a foundational model in computer vision, closing' appears, what object shows up?", "question_wo_referring_query": "What object shows up?", "candidates": ["An animated image of a black bridge with large trees on both sides", "An animated image of a yellow bridge with large trees on both sides", "An animated image of a white bridge with large trees on both sides", "An animated image of a red bridge with large trees on both sides"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "csEgtSh7jV4_0", "video_path": "csEgtSh7jV4.mp4", "subtitle_path": "csEgtSh7jV4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 450.93, "view_count": 11881}, {"video_id": "csEgtSh7jV4", "question": "On a white background, there is a large circle and a small circle, with a blue arrow pointing towards the small circle. When the caption 'For example in the original uncurated dataset we'll likely to find a lot of cat images' appears, which animal appears on the screen?", "question_wo_referring_query": "Which animal appears on the screen?", "candidates": ["A white rabbit", "A white dog", "A black cat", "A white cat"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "csEgtSh7jV4_1", "video_path": "csEgtSh7jV4.mp4", "subtitle_path": "csEgtSh7jV4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 450.93, "view_count": 11881}, {"video_id": "-rke7RmxwfY", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, place down the fabric, then pick up the dark brown fabric with threads, and finally pick up a small square piece of fabric.", "First, place down the fabric, then pick up a small square piece of fabric, and finally pick up the dark brown fabric with threads.", "First, pick up a small square piece of fabric, then pick up the dark brown fabric with threads with your right hand, and finally place down the fabric.", "First, pick up the dark brown fabric with threads with your right hand, then pick up a small square piece of fabric, and finally place down the fabric."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "-rke7RmxwfY_0", "video_path": "-rke7RmxwfY.mp4", "subtitle_path": "-rke7RmxwfY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 421.42, "view_count": 195373}, {"video_id": "-rke7RmxwfY", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First use the surgical knife, then the grinding machine, and finally perform the threading", "First perform the threading, then use the grinding machine, and finally the surgical knife", "First use the grinding machine, then the surgical knife, and finally perform the threading", "First use the surgical knife, then perform the threading, and finally use the grinding machine"], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "-rke7RmxwfY_1", "video_path": "-rke7RmxwfY.mp4", "subtitle_path": "-rke7RmxwfY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 421.42, "view_count": 195373}, {"video_id": "BTPBuSvizV8", "question": "On a green-colored ground, next to an arched iron railing, three men are dancing continuously. One of them has black curly hair and a beard. Where else has he appeared?", "question_wo_referring_query": "Where else has he appeared?", "candidates": ["Under the waterfall", "In the desert", "On the beach", "In the boat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "BTPBuSvizV8_0", "video_path": "BTPBuSvizV8.mp4", "subtitle_path": "BTPBuSvizV8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 303.1, "view_count": 34841}, {"video_id": "BTPBuSvizV8", "question": "A man with a backpack is standing on the desert, and in front of him, there are 4 cars moving. The sky is half red and half blue, with a golden sun on his upper left. In which of the following places has the sun appeared?", "question_wo_referring_query": "In which of the following places has the sun appeared?", "candidates": ["Next to a man with a red shoulder, shorts, and bare feet", "Next to a man with a red shoulder and long pants", "Next to a man with a red shoulder and shorts", "Next to a woman wearing a waistcoat and shorts"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "BTPBuSvizV8_1", "video_path": "BTPBuSvizV8.mp4", "subtitle_path": "BTPBuSvizV8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 303.1, "view_count": 34841}, {"video_id": "pYwF3PEKwjo", "question": "A man wearing glasses, a black suit, and a white shirt with a gray tie is speaking. In which subtitles has this man appeared?", "question_wo_referring_query": "In which subtitles has this man appeared?", "candidates": ["In other words, the Kim Jong-un government", "The more sad I feel whenever I talk about Confucianism,", "Oh, the people I was with in the military at the time of the Korean War have all", "prepare for a \u2018real war\u2019"], "topic_category": "NPs", "question_category": "TOS", "level": "L2-Relation", "id": "pYwF3PEKwjo_0", "video_path": "pYwF3PEKwjo.mp4", "subtitle_path": "pYwF3PEKwjo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 216.76, "view_count": 15405}, {"video_id": "pYwF3PEKwjo", "question": "On a grassy slope, there are rows of tombstones, each with a pot of flowers in front of it. In which subtitles do these tombstones appear together?", "question_wo_referring_query": ", in which subtitles do these tombstones appear together?", "candidates": ["more sad I feel. And whenever I talk about Confucianism,", "In terms of", "continue", "Oh, the people I was with in the military at the time of the Korean War have all"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "pYwF3PEKwjo_1", "video_path": "pYwF3PEKwjo.mp4", "subtitle_path": "pYwF3PEKwjo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 216.76, "view_count": 15405}, {"video_id": "I1v31P7Zgak", "question": "On a busy street, a man wearing a black hat, glasses, a black coat, and holding a newspaper enters a room. What change occurs to his head after he enters the room?", "question_wo_referring_query": "On a busy street, a man wearing a black hat, glasses, a black coat, and holding a newspaper enters a room. What change occurs to his head after he enters the room?", "candidates": ["The man is wearing a military hat", "The man is not wearing a hat", "The man is wearing a peaked cap", "The man is wearing a headscarf"], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "I1v31P7Zgak_0", "video_path": "I1v31P7Zgak.mp4", "subtitle_path": "I1v31P7Zgak_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 274.94, "view_count": 8969}, {"video_id": "KM2w6miZzME", "question": "On a grassy field, there is a pair of glowing feet wearing dark green boots. After the subtitle 'like styling them' appears, what changes occur to the feet?", "question_wo_referring_query": "What changes occur to the feet?", "candidates": ["The color of the glowing boots has changed to light green.", "The color of the glowing boots has changed to light purple, and the left foot has a bracelet.", "The feet now have a pair of socks and the boots have changed to light green.", "The feet now have a pair of socks and the boots have changed to light purple."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "KM2w6miZzME_0", "video_path": "KM2w6miZzME.mp4", "subtitle_path": "KM2w6miZzME_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 301.07, "view_count": 184855}, {"video_id": "KM2w6miZzME", "question": "On a grassy field, there is a shoebox containing a pair of green shoes. What change occurs inside the shoebox just before the caption 'adventure and then the choco sandal was' appears?", "question_wo_referring_query": "What change occurs inside the shoebox?", "candidates": ["The shoebox contains a pair of red shoes", "The shoebox contains a pair of black shoes", "The shoebox contains a pair of purple shoes", "The shoebox contains a pair of yellow shoes"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "KM2w6miZzME_1", "video_path": "KM2w6miZzME.mp4", "subtitle_path": "KM2w6miZzME_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 301.07, "view_count": 184855}, {"video_id": "uxBHgt9fR_Q", "question": "In the video, a man wearing a black and white checkered shirt and a black hat backwards on his head is using his hand to feed a woman with short hair who is wearing a red floral outfit on his right side. What other objects are present in this scene?", "question_wo_referring_query": "What other objects are present in this scene?", "candidates": ["Shoe rack", "Fan", "Refrigerator", "Mannequin", "Television"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "uxBHgt9fR_Q_0", "video_path": "uxBHgt9fR_Q.mp4", "subtitle_path": "uxBHgt9fR_Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 227.82999999999998, "view_count": 2110}, {"video_id": "uxBHgt9fR_Q", "question": "In the scene, there are five people sitting in the first row of red chairs on a green floor. Which character appears in this scene?", "question_wo_referring_query": "Which character appears in this scene?", "candidates": ["The woman wearing a black top and red pants", "The woman wearing a black top and red pants", "The woman wearing a red top and red pants", "The woman wearing a red top and red skirt", "The woman wearing a white top and red pants"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "uxBHgt9fR_Q_1", "video_path": "uxBHgt9fR_Q.mp4", "subtitle_path": "uxBHgt9fR_Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 227.82999999999998, "view_count": 2110}, {"video_id": "sAYXsf_sHBU", "question": "In the red background, there is a book-shaped icon surrounded by a white circle in the upper left corner, and a white dashed exclamation mark in the upper right corner. What object appears when the subtitle reads 'from Morocco to Spain and that airlift'?", "question_wo_referring_query": "What object appears?", "candidates": ["A magnifying glass", "An arrow", "A white airplane", "Two black airplanes", "A black airplane"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "sAYXsf_sHBU_0", "video_path": "sAYXsf_sHBU.mp4", "subtitle_path": "sAYXsf_sHBU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 571.1, "view_count": 309007}, {"video_id": "sAYXsf_sHBU", "question": "What appeared in the white circle to the far left of the red background when the subtitle says 'the articles were most parts proved high'?", "question_wo_referring_query": "What appeared?", "candidates": ["An image of a skull", "An image of a hook", "An image of a notebook", "An image of a magnifying glass in a white circle", "An airplane"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "sAYXsf_sHBU_1", "video_path": "sAYXsf_sHBU.mp4", "subtitle_path": "sAYXsf_sHBU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 571.1, "view_count": 309007}, {"video_id": "HPcBspKilyU", "question": "What color is the fur of the dog that is panting with its pink tongue out while sitting on the green grass?", "question_wo_referring_query": "What color is the fur?", "candidates": ["white", "black", "brown", "gray", "yellow"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "HPcBspKilyU_0", "video_path": "HPcBspKilyU.mp4", "subtitle_path": "HPcBspKilyU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 540.0, "view_count": 521049}, {"video_id": "HPcBspKilyU", "question": "In front of a white dining table with several bottles of wine on it, what color are the sunglasses worn by a black-skinned man wearing a black hat and holding a triangular wine glass?", "question_wo_referring_query": "What color are the sunglasses?", "candidates": ["purple", "black", "blue", "pink", "yellow"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "HPcBspKilyU_1", "video_path": "HPcBspKilyU.mp4", "subtitle_path": "HPcBspKilyU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 540.0, "view_count": 521049}, {"video_id": "mLbDbmmV6Qc", "question": "In an image with a somewhat blurry background featuring blue sky, white clouds, and a river, there is a frog sitting facing the left side of the screen. When the caption says 'when this happens in lakes, native species can be suppressed and allow,' what color is the frog on the screen?", "question_wo_referring_query": "What color is the frog on the screen?", "candidates": ["blue", "purple", "white", "green", "red"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "mLbDbmmV6Qc_0", "video_path": "mLbDbmmV6Qc.mp4", "subtitle_path": "mLbDbmmV6Qc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 227.6, "view_count": 443174}, {"video_id": "mLbDbmmV6Qc", "question": "A white solid line is in the middle of the screen, with a few white clouds in the blue sky. The riverbanks are covered with tall green plants, and the nearby green grass looks exceptionally bright under the sunlight. When the subtitle mentions 'fertilizer or clearing a forest,' what is the shape of the blue icon to the left of the white solid line on the screen?", "question_wo_referring_query": "What is the shape of the blue icon to the left of the white solid line on the screen?", "candidates": ["shape of a bird", "shape of a duckling", "shape of a carp", "shape of a dolphin", "shape of a peacock"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "mLbDbmmV6Qc_1", "video_path": "mLbDbmmV6Qc.mp4", "subtitle_path": "mLbDbmmV6Qc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 227.6, "view_count": 443174}, {"video_id": "uC3kgpO4bOU", "question": "In a room with white walls, who is the person sitting in front of a round table using a finger to point to an open book?", "question_wo_referring_query": "Who is the person?", "candidates": ["A blonde woman wearing a white shirt", "A blonde woman wearing a red and white striped shirt", "A blonde woman wearing a yellow shirt", "A blonde woman wearing a black and white striped shirt", "A black-haired woman wearing a blue and white striped shirt"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "uC3kgpO4bOU_0", "video_path": "uC3kgpO4bOU.mp4", "subtitle_path": "uC3kgpO4bOU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 344.51, "view_count": 34067}, {"video_id": "uC3kgpO4bOU", "question": "In a room with white walls, on a circular desk full of books, whose hands are crossed and placed on an open book?", "question_wo_referring_query": "Whose hands are placed on the open book?", "candidates": ["A blonde woman wearing a black and white striped shirt", "A blonde woman wearing a red and white striped shirt", "A black-haired woman wearing a black and white striped shirt", "A blonde woman wearing a white shirt", "A blonde woman wearing a red shirt"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "uC3kgpO4bOU_1", "video_path": "uC3kgpO4bOU.mp4", "subtitle_path": "uC3kgpO4bOU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 344.51, "view_count": 34067}, {"video_id": "FAXRsYxJb-A", "question": "In a room with photo frames, what is the pregnant woman with a big belly, dressed in a gray long dress and wearing a ring, doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["Touching her belly", "Drinking water", "Sitting on the bed", "Lying on the bed", "Massaging her lower back"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "FAXRsYxJb-A_0", "video_path": "FAXRsYxJb-A.mp4", "subtitle_path": "FAXRsYxJb-A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 570.36, "view_count": 306560}, {"video_id": "FAXRsYxJb-A", "question": "On the green field, there is a cheetah running. What is the patterned giraffe doing the first time it appears in the camera?", "question_wo_referring_query": "What is the giraffe doing the first time it appears in the camera?", "candidates": ["Being bitten by the cheetah and losing limbs", "Drinking water", "Standing still", "Running", "Lowering its head"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "FAXRsYxJb-A_1", "video_path": "FAXRsYxJb-A.mp4", "subtitle_path": "FAXRsYxJb-A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 570.36, "view_count": 306560}, {"video_id": "y-pqgjsJ-g8", "question": "On a map of Europe with marked place names, what happens when the subtitle says 'But Britain and France - alarmed at Russia's southern expansion, and potential control'?", "question_wo_referring_query": "What happens?", "candidates": ["The purple area named Constantinople turns red", "A red arrow points to the purple area named Constantinople at the bottom right", "Three arrows point to the CRIMFA area", "An image appears in the top right corner", "An image appears in the top left corner"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "y-pqgjsJ-g8_0", "video_path": "y-pqgjsJ-g8.mp4", "subtitle_path": "y-pqgjsJ-g8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 574.04, "view_count": 536050}, {"video_id": "y-pqgjsJ-g8", "question": "In a scene from a map of Southeast China showing burning flames, what happens when the subtitles mention 'to the founding of Vladivostok, Russia's major Pacific port'?", "question_wo_referring_query": "What happened?", "candidates": ["Flames appear in the northernmost part of the China map", "The flames on the China map disappear", "The China map is enlarged", "The China map disappears", "The China map is reduced"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "y-pqgjsJ-g8_1", "video_path": "y-pqgjsJ-g8.mp4", "subtitle_path": "y-pqgjsJ-g8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 574.04, "view_count": 536050}, {"video_id": "mhkBMfS1AEE", "question": "At the battle arena, after a man wearing light clothes and black pants wrapped his arms around the leg of another man dressed in a blue shirt, what did he do next?", "question_wo_referring_query": "What did he do next?", "candidates": ["He took off his pants", "He carried him out of the arena", "He put him on his shoulders", "He held him in his hands", "He threw him to the ground"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "mhkBMfS1AEE_0", "video_path": "mhkBMfS1AEE.mp4", "subtitle_path": "mhkBMfS1AEE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 207.93, "view_count": 105814}, {"video_id": "mhkBMfS1AEE", "question": "In front of a house, what did a man with a bare upper body do after standing on the three red bars surrounding the battleground?", "question_wo_referring_query": "What action did he take?", "candidates": ["Jumped down and landed on the man wearing green clothes", "Kneeled on the bars", "Danced on the bars", "Jumped down and landed on the man wearing blue clothes", "Jumped down and landed on the man wearing red clothes"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "mhkBMfS1AEE_1", "video_path": "mhkBMfS1AEE.mp4", "subtitle_path": "mhkBMfS1AEE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 207.93, "view_count": 105814}, {"video_id": "bfP_eEfiMWI", "question": "On a table with a white tablecloth, after a green lettuce appears on a brown wooden board, which of the following objects appears first?", "question_wo_referring_query": "Which of the following objects appears first?", "candidates": ["several pieces of garlic", "egg yolk and egg white", "a red tomato", "tomato sauce", "a red chili pepper"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "bfP_eEfiMWI_0", "video_path": "bfP_eEfiMWI.mp4", "subtitle_path": "bfP_eEfiMWI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 336.7, "view_count": 104985}, {"video_id": "bfP_eEfiMWI", "question": "On a table with a white tablecloth, after a green cabbage appears on a brown wooden board, which of the following objects appears last?", "question_wo_referring_query": "Which of the following objects appears last?", "candidates": ["red tomato sauce", "fried yellow garlic", "a bunch of cilantro", "a whole piece of butter", "a red chili pepper"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "bfP_eEfiMWI_1", "video_path": "bfP_eEfiMWI.mp4", "subtitle_path": "bfP_eEfiMWI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 336.7, "view_count": 104985}, {"video_id": "7kdNoeuzn5I", "question": "Three people wearing orange life jackets are standing in front of the camera for an interview. After the subtitle says 'five depos that are available to collect,' what object appears on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["black shelf", "swimsuit", "white shelf", "kitchen", "swim ring"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "7kdNoeuzn5I_0", "video_path": "7kdNoeuzn5I.mp4", "subtitle_path": "7kdNoeuzn5I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 369.28, "view_count": 28005}, {"video_id": "7kdNoeuzn5I", "question": "During the interview, there is a woman on the far right of the screen wearing a blue swimming cap and a pink jacket. After the subtitle says 'cyclones before and you just know the,' what character appears on the screen?", "question_wo_referring_query": "What character appears on the screen?", "candidates": ["A man wearing a life jacket", "A short-haired woman in a pink shirt", "A woman wearing a black skirt", "A man in a yellow shirt holding a phone", "A woman wearing a colorful top"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "7kdNoeuzn5I_1", "video_path": "7kdNoeuzn5I.mp4", "subtitle_path": "7kdNoeuzn5I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 369.28, "view_count": 28005}, {"video_id": "y2aETzM4g6Y", "question": "On the green wooden board, where else has the heart-shaped dough been placed?", "question_wo_referring_query": "Where else has it been placed?", "candidates": ["On the blue dining mat", "In the oven", "In the rectangular baking tray", "In the square baking tray", "On the green and white checkered dining mat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "y2aETzM4g6Y_0", "video_path": "y2aETzM4g6Y.mp4", "subtitle_path": "y2aETzM4g6Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 490.04, "view_count": 326470}, {"video_id": "y2aETzM4g6Y", "question": "Where else has the triangular piece of meat being fried in the black pan appeared before?", "question_wo_referring_query": "Where else has it appeared before?", "candidates": ["In the refrigerator", "In the red pressure cooker", "In the oven", "On the white cutting board", "In the blue plate"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "y2aETzM4g6Y_1", "video_path": "y2aETzM4g6Y.mp4", "subtitle_path": "y2aETzM4g6Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 490.04, "view_count": 326470}, {"video_id": "N3aqvowVYdk", "question": "In the distant house's sky, there is a patch of pink-purple sky. With which subtitles does this pink-purple sky appear together?", "question_wo_referring_query": ", with which subtitles does this patch of pink-purple sky appear together?", "candidates": ["the supercontinent apart causing", "Relentless March of plate tectonics tore", "subsidence and crust or thinning as a", "significant transformations in", "Antarctica"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "N3aqvowVYdk_0", "video_path": "N3aqvowVYdk.mp4", "subtitle_path": "N3aqvowVYdk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 485.99, "view_count": 17148}, {"video_id": "N3aqvowVYdk", "question": "Under the blue sky, there is a yellow sandy beach next to a small house. By the shore of the yellow sandy beach, there is a deep green ocean like a gemstone. In which subtitles has this ocean appeared together with?", "question_wo_referring_query": ", in which subtitles has this ocean appeared together with?", "candidates": ["rift in needs to be clarified that", "minerals everywhere I'm beginning to", "in on where to explore though early I", "trying to ascertain the extent of the", "believe I have a rough idea of the"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "N3aqvowVYdk_1", "video_path": "N3aqvowVYdk.mp4", "subtitle_path": "N3aqvowVYdk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 485.99, "view_count": 17148}, {"video_id": "FX6R_TuYYJU", "question": "After a man wearing jeans appears with a pickaxe in the frame and creates a huge hole, what changes occur to the giant hole?", "question_wo_referring_query": "What changes occur to the giant hole?", "candidates": ["A cat appeared in the hole.", "The hole was filled with dirt.", "Some seeds were put into the hole.", "A dog appeared in the hole.", "The hole was filled."], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "FX6R_TuYYJU_0", "video_path": "FX6R_TuYYJU.mp4", "subtitle_path": "FX6R_TuYYJU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 506.94, "view_count": 8165}, {"video_id": "FX6R_TuYYJU", "question": "There is a black round bowl placed on the gray soil where green plants are grown, with a black digging tool next to it. What changes occurred in the black bowl after this scene?", "question_wo_referring_query": "What changes occurred in the black bowl after this scene?", "candidates": ["The black bowl is filled with a water bottle", "The black bowl is filled with soil", "The black bowl is filled with white liquid", "The black bowl is filled with some flowers", "The black bowl is filled with a stone"], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "FX6R_TuYYJU_1", "video_path": "FX6R_TuYYJU.mp4", "subtitle_path": "FX6R_TuYYJU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 506.94, "view_count": 8165}, {"video_id": "SjZaykUfJqc", "question": "In a room with photos on the wall, a girl with black hair and bangs, wearing glasses, is sitting at a desk writing. What objects appear in this room?", "question_wo_referring_query": "In a room with photos on the wall, a girl with black hair and bangs, wearing glasses, is sitting at a desk writing. What objects appear in this room?", "candidates": ["refrigerator", "book", "air conditioner", "TV", "washing machine"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "SjZaykUfJqc_0", "video_path": "SjZaykUfJqc.mp4", "subtitle_path": "SjZaykUfJqc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.68, "view_count": 345316}, {"video_id": "SjZaykUfJqc", "question": "In the room, a girl with black hair draped over her shoulders is crouching on the floor watching a computer. There is a guitar placed at the white headboard of the bed in the room. What other objects can be found in this room?", "question_wo_referring_query": "What other objects can be found in this room?", "candidates": ["stuffed duck toy", "green plant", "bookshelf", "television", "water dispenser"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "SjZaykUfJqc_1", "video_path": "SjZaykUfJqc.mp4", "subtitle_path": "SjZaykUfJqc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.68, "view_count": 345316}, {"video_id": "O4Vb9ggAT38", "question": "On a dark yellow ground, a man wearing black pants is holding a camera and taking pictures of two soldiers in green uniforms in front. When the subtitle says 'anti war messages often caught the,' what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["ocean", "airplane", "rocket", "gun", "ship"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "O4Vb9ggAT38_0", "video_path": "O4Vb9ggAT38.mp4", "subtitle_path": "O4Vb9ggAT38_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 246.04, "view_count": 3958148}, {"video_id": "O4Vb9ggAT38", "question": "In a black background, a military green helmet with a visor appears. When the subtitle says 'was also utilized for individual,' what object is on the screen?", "question_wo_referring_query": "What object is on the screen?", "candidates": ["Three red peach heart buckle cards", "Two black peach heart buckle cards", "A blue peach heart buckle card", "A black buckle card with a skull on a black peach heart", "A red peach heart buckle card"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "O4Vb9ggAT38_1", "video_path": "O4Vb9ggAT38.mp4", "subtitle_path": "O4Vb9ggAT38_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 246.04, "view_count": 3958148}, {"video_id": "JAhaM-NXPZI", "question": "On a brown wooden table, what is the shape of the white pastry with meat filling placed on it?", "question_wo_referring_query": "What is the shape?", "candidates": ["Rectangle", "Ladder shape", "Circle", "Triangle", "Square"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "JAhaM-NXPZI_0", "video_path": "JAhaM-NXPZI.mp4", "subtitle_path": "JAhaM-NXPZI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 325.7, "view_count": 3741}, {"video_id": "JAhaM-NXPZI", "question": "On a brown wooden table, what shape is the white surface area pressed by a long cylindrical rolling pin?", "question_wo_referring_query": "What shape is it?", "candidates": ["Circle", "Trapezoid", "Triangle", "Square", "Rectangle"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "JAhaM-NXPZI_1", "video_path": "JAhaM-NXPZI.mp4", "subtitle_path": "JAhaM-NXPZI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 325.7, "view_count": 3741}, {"video_id": "Op9gwDR1YlM", "question": "On the green grass, there is a small short house with an olive-colored wooden door. When the subtitle says 'tonight,' what color is the exterior wall of this house?", "question_wo_referring_query": "What color is the exterior wall of this house?", "candidates": ["black", "green", "blue", "white", "red"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "Op9gwDR1YlM_0", "video_path": "Op9gwDR1YlM.mp4", "subtitle_path": "Op9gwDR1YlM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 525.44, "view_count": 8459}, {"video_id": "QD0X14mLJAE", "question": "Under the blue sky is a stretch of buildings. An empty lot appears on the right side of the road, with a house with blue walls to the right of the empty lot. On the left side of the road, many cars are parked. A yellow truck with a white cab is parked by the empty lot. Who is adding rocks from the empty lot to the truck?", "question_wo_referring_query": "Who is adding rocks from the empty lot to the truck?", "candidates": ["The black minivan", "The yellow truck with a white cab", "The orange excavator arm", "The white minivan", "The yellow truck with a black cab"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "QD0X14mLJAE_0", "video_path": "QD0X14mLJAE.mp4", "subtitle_path": "QD0X14mLJAE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 487.92, "view_count": 159082}, {"video_id": "QD0X14mLJAE", "question": "In front of a tall building, a white-haired man wearing a white shirt and dark suit is communicating with a woman wearing a white shirt and a blue suit. Both are wearing black glasses, and the man is also wearing a dark tie. Who is tightly holding a document?", "question_wo_referring_query": "Who is tightly holding a document?", "candidates": ["The blonde man with black-framed glasses", "The white-haired man wearing a black shirt and black-framed glasses", "The woman with black-framed glasses", "The white-haired man with black-framed glasses", "The woman wearing a black shirt and black-framed glasses"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "QD0X14mLJAE_1", "video_path": "QD0X14mLJAE.mp4", "subtitle_path": "QD0X14mLJAE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 487.92, "view_count": 159082}, {"video_id": "ceL9SeCezcI", "question": "In front of a white curtain, a woman with long black hair, wearing a gray hospital robe and a bracelet, is holding a circular tray-like object with silver foil wrapped on top of it. What happened the first time the object appeared?", "question_wo_referring_query": "What happened the first time the object appeared?", "candidates": ["The silver foil on the tray-like object was lifted", "The tray-like object was placed on the floor", "The tray-like object was placed on the table", "The tray-like object fell to the ground", "The tray-like object was continuously shaken"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "ceL9SeCezcI_0", "video_path": "ceL9SeCezcI.mp4", "subtitle_path": "ceL9SeCezcI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 569.44, "view_count": 31743}, {"video_id": "ceL9SeCezcI", "question": "In front of the white curtain, what happens the first time the woman with long black hair, wearing a gray coat and a ring, and holding green food in her hand, appears?", "question_wo_referring_query": "What happens the first time the green food appears?", "candidates": ["The green food is put into a paper bag.", "The green food is brought to the woman's mouth.", "The green food is flicked away by a whisk.", "The green food falls to the ground.", "The green food is placed on the table."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "ceL9SeCezcI_1", "video_path": "ceL9SeCezcI.mp4", "subtitle_path": "ceL9SeCezcI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 569.44, "view_count": 31743}, {"video_id": "DmnIKihcBHw", "question": "A crowd of people is standing in front of the gray building's glass doors, most of them holding umbrellas. There are green branches on the left side of the door. On the right, two pedestrians are wearing red and yellow clothing and carrying backpacks. When the subtitle 'you answer this right away when you' appears, what happens?", "question_wo_referring_query": "What happens?", "candidates": ["A yellow car drives by on the road", "A black car drives by on the road", "A white car drives by on the road", "A red car drives by on the road", "A blue car drives by on the road"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "DmnIKihcBHw_0", "video_path": "DmnIKihcBHw.mp4", "subtitle_path": "DmnIKihcBHw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 379.91, "view_count": 14338}, {"video_id": "DmnIKihcBHw", "question": "A man in a shirt is standing in the center of the screen, with a projection screen behind him. On the right side of the projection screen is a yellow dog and a person wearing a hat. On the left side, someone is lying on the ground. When the subtitle 'because there's an enormous amount of' appears, what is the man wearing the hat doing?", "question_wo_referring_query": "What is the man wearing the hat doing?", "candidates": ["He is picking up something", "He stands up", "He is adjusting his hat", "He is searching through his pockets", "He is petting the yellow dog"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "DmnIKihcBHw_1", "video_path": "DmnIKihcBHw.mp4", "subtitle_path": "DmnIKihcBHw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 379.91, "view_count": 14338}, {"video_id": "9DqZYsckBwI", "question": "On a green background wall, there is a map hanging. A globe is positioned on the left side of the wall. Two men sit in front of a table with paper documents and a pen on it. There is a pair of glasses on the right side of the table. Which region do they discuss first?", "question_wo_referring_query": "Which region do they discuss first?", "candidates": ["Russia", "Switzerland", "Scotland", "England", "France"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "9DqZYsckBwI_0", "video_path": "9DqZYsckBwI.mp4", "subtitle_path": "9DqZYsckBwI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 296.2, "view_count": 3712156}, {"video_id": "9DqZYsckBwI", "question": "On an olive-colored wall, there is a map hanging, and a globe is positioned on the left side of the wall. A man in a red shirt is talking to the camera. When he talks about the difference between England and Scotland, which picture does he mention first?", "question_wo_referring_query": "Which picture does he mention first?", "candidates": ["Two types of snacks", "Two fresh flowers", "Two buildings", "Two glasses of water", "Two cows"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "9DqZYsckBwI_1", "video_path": "9DqZYsckBwI.mp4", "subtitle_path": "9DqZYsckBwI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 296.2, "view_count": 3712156}, {"video_id": "DnhsSbFiJ9s", "question": "Under the sky is a dense cityscape of buildings. A river flows from the upper right corner to the left, weaving through the city. Small green plants line the riverbanks. After the subtitle 'having 1.7 million people in the metro' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A wall sticker of a tiger appears on the wall.", "A boy wearing a blue lab coat appears.", "A few guitars appear.", "A stuffed animal face wearing a red-gray outfit and a baseball cap appears.", "A world map appears."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "DnhsSbFiJ9s_0", "video_path": "DnhsSbFiJ9s.mp4", "subtitle_path": "DnhsSbFiJ9s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 538.77, "view_count": 70611}, {"video_id": "DnhsSbFiJ9s", "question": "On a black and white patterned table, there is a white dinner plate with golden food ingredients and noodles inside. A silver utensil is placed under the food, and there is a red drink next to the plate. After the subtitle 'spaghetti' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A woman wearing a gray shirt with black vertical patterns on the chest is talking", "A man wearing a gray shirt with blue vertical patterns on the chest is talking", "A woman wearing a gray shirt with white vertical patterns on the chest is talking", "A man wearing a gray shirt with black vertical patterns on the chest is talking", "A man wearing a gray shirt with white vertical patterns on the chest is talking"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "DnhsSbFiJ9s_1", "video_path": "DnhsSbFiJ9s.mp4", "subtitle_path": "DnhsSbFiJ9s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 538.77, "view_count": 70611}, {"video_id": "4_LIyw04-hk", "question": "On a brown table, there is a black baking tray with a layer of paper on it. A yellow cake-like material has light yellow strips on top. A hand wearing a black glove is arranging the strips. What changes occurred to the yellow cake-like material when taken out of the oven?", "question_wo_referring_query": "What changes occurred to the yellow cake-like material when taken out of the oven?", "candidates": ["The food turned charred black", "The cake-like material became rectangular", "The food shrank", "The cake-like material became square", "The color of the food turned golden brown"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "4_LIyw04-hk_0", "video_path": "4_LIyw04-hk.mp4", "subtitle_path": "4_LIyw04-hk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 244.4, "view_count": 4343}, {"video_id": "4_LIyw04-hk", "question": "On the green table, a hand wearing a black glove is holding a golden spherical food item. What changes occur when the spherical food item appears on the baking tray?", "question_wo_referring_query": "What changes occur when the spherical food item appears on the baking tray?", "candidates": ["It changes from a round shape to a block shape", "It changes from a round shape to a flat shape", "It changes from a round shape to a strip shape", "It changes from golden yellow to white", "It changes from golden yellow to black"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "4_LIyw04-hk_1", "video_path": "4_LIyw04-hk.mp4", "subtitle_path": "4_LIyw04-hk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 244.4, "view_count": 4343}, {"video_id": "-9vEf8uYKFc", "question": "There is a small potted plant beside the wall. A man with curly hair wearing a blue tank top is holding a delivery box. The side of the box has a red square pattern, and there is a flag in the top right corner. When the subtitle 'Wristband a little pouch do this so cool a bunch of Aussie coins' appears, what changes occur on the man's wrist?", "question_wo_referring_query": "What changes occur on the man's wrist?", "candidates": ["A mechanical watch appears on his wrist", "A blue wristband appears on his wrist", "A red string appears on his wrist", "An electronic watch appears on his wrist", "A tattoo appears on his wrist"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "-9vEf8uYKFc_0", "video_path": "-9vEf8uYKFc.mp4", "subtitle_path": "-9vEf8uYKFc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 411.40999999999997, "view_count": 137853}, {"video_id": "-9vEf8uYKFc", "question": "There are small potted plants beside the wall, a curly-haired man wearing a blue tank top is sitting next to them. There is a gray spherical object and silver metal in the upper right corner. Below the sphere is a light-colored sofa chair. When the subtitle 'I'm probably going to share a lot of the snacks with the other people on the geography now team like Ken and Brandon' appears, what change happens to the man's clothes?", "question_wo_referring_query": "What change happens to the man's clothes?", "candidates": ["He changes into a denim jacket.", "He puts on a green shawl.", "He puts on a black shawl.", "He changes into a blue short sleeve shirt.", "He puts on a red, blue, and white shawl."], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "-9vEf8uYKFc_1", "video_path": "-9vEf8uYKFc.mp4", "subtitle_path": "-9vEf8uYKFc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 411.40999999999997, "view_count": 137853}, {"video_id": "wEr8K0Jq1_4", "question": "There are three apples on the left side of the table, and a transparent bowl containing water and fruits on the right side of the table. On the screen, one hand is holding an apple, and the other hand is holding a knife. What are these hands doing?", "question_wo_referring_query": "What are these hands doing?", "candidates": ["Washing the apple", "Cutting the apple", "Washing the knife", "Peeling the apple", "Cutting a lemon"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "wEr8K0Jq1_4_0", "video_path": "wEr8K0Jq1_4.mp4", "subtitle_path": "wEr8K0Jq1_4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 356.94, "view_count": 82089}, {"video_id": "wEr8K0Jq1_4", "question": "There are eggs on the right side of the table, a blue appliance is placed in the middle, a black small pot is on the blue appliance, one hand is holding the black pot handle, another hand is holding a spatula, what are these hands doing?", "question_wo_referring_query": ", one hand is holding a spatula, what are these hands doing?", "candidates": ["Rinsing the spatula", "Stir-frying", "Adding ingredients to the pot", "Cleaning the pot", "Moving the pot's position"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "wEr8K0Jq1_4_1", "video_path": "wEr8K0Jq1_4.mp4", "subtitle_path": "wEr8K0Jq1_4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 356.94, "view_count": 82089}, {"video_id": "CzojUDP02Eg", "question": "A blond man wearing a blue shirt and black-rimmed glasses appears on the screen. Behind him is a red-tile house with white walls, surrounded by short green plants. Next to the house is an open space, with trees nearby. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["watch", "necklace", "a dog", "a painting", "car"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "CzojUDP02Eg_0", "video_path": "CzojUDP02Eg.mp4", "subtitle_path": "CzojUDP02Eg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 579.92, "view_count": 181112}, {"video_id": "CzojUDP02Eg", "question": "There is a colorful checkered blanket on the grass, a pair of hands is holding a phone, the phone screen shows a woman in white clothing and a large fish on the beach. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Watch", "Short sleeves", "Fruit", "Bracelet", "Necklace"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "CzojUDP02Eg_1", "video_path": "CzojUDP02Eg.mp4", "subtitle_path": "CzojUDP02Eg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 579.92, "view_count": 181112}, {"video_id": "blTVOmlfZ6E", "question": "A blonde girl with red nail polish is standing next to a brown door, with transparent glass and tiles behind her. To the left, there is a picture containing various vegetables. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["planter", "showerhead", "lamp", "glasses", "clothes rack"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "blTVOmlfZ6E_0", "video_path": "blTVOmlfZ6E.mp4", "subtitle_path": "blTVOmlfZ6E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 472.17, "view_count": 18129}, {"video_id": "blTVOmlfZ6E", "question": "A woman wearing a black tank top and yoga pants is exercising on the black mat. She is holding a kettlebell in her hand and is also wearing white socks and shoes. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["child", "watch", "kettlebell", "planter", "washing machine"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "blTVOmlfZ6E_1", "video_path": "blTVOmlfZ6E.mp4", "subtitle_path": "blTVOmlfZ6E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 472.17, "view_count": 18129}, {"video_id": "fEd5uX0Lw9Y", "question": "The two sides of the red-walled house are green wooden fences, there's a big tree on the left, and a white-walled building at the back on the right. Four cartoon characters are lying underneath the red-walled house. Three people holding guns are visible through the second-floor window of the red-walled house. What is the shape of the second-floor window of the red-walled house?", "question_wo_referring_query": "What is the shape of the second-floor window of the red-walled house?", "candidates": ["arch window", "circular window", "triangular window", "semi-circular window", "square window"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "fEd5uX0Lw9Y_0", "video_path": "fEd5uX0Lw9Y.mp4", "subtitle_path": "fEd5uX0Lw9Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 255.92, "view_count": 1121436}, {"video_id": "fEd5uX0Lw9Y", "question": "In the distance is a blue sky, on the green ground, a tank is advancing under a rainy sky. Beside the tank are armed soldiers wearing helmets and uniforms, and there are trees behind the tank. What shape is the emblem on the tank?", "question_wo_referring_query": "What shape is the emblem on the tank?", "candidates": ["Dagger-shaped emblem", "Circle-shaped emblem", "Diamond-shaped emblem", "Star-shaped emblem", "Triangle-shaped emblem"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "fEd5uX0Lw9Y_1", "video_path": "fEd5uX0Lw9Y.mp4", "subtitle_path": "fEd5uX0Lw9Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 255.92, "view_count": 1121436}, {"video_id": "s-G74EQIwEQ", "question": "A lady in a grey dress and a man in a yellow coat are standing in front of a kitchen table, with kitchen utensils and fruits placed on the counter behind them. On the table in front of them are three jar-like items. When the subtitle 'welcome ryan for twinkie 1' appears, what shape are the tiles behind them?", "question_wo_referring_query": "What shape are the tiles behind them when the subtitle 'welcome ryan for twinkie 1' appears?", "candidates": ["Square", "Triangle", "Circle", "Pentagon", "Rectangle"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "s-G74EQIwEQ_0", "video_path": "s-G74EQIwEQ.mp4", "subtitle_path": "s-G74EQIwEQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 475.56, "view_count": 3290097}, {"video_id": "s-G74EQIwEQ", "question": "A woman wearing a gray apron appears on screen holding two candy-shaped foods. Behind her on the table are a teal cup and silver utensils. What kind of light is in the kitchen when the subtitle 'thank you for watching if you make it' appears?", "question_wo_referring_query": "What kind of light is in the kitchen?", "candidates": ["yellow pendant light", "yellow wall light", "white table lamp", "white wall light", "white pendant light"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "s-G74EQIwEQ_1", "video_path": "s-G74EQIwEQ.mp4", "subtitle_path": "s-G74EQIwEQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 475.56, "view_count": 3290097}, {"video_id": "nxUUliRA0Co", "question": "On the gray floor, there is a small stool, and a large number of white stripes are hanging inside the room. On the left side, there is an intersecting column and a beige floor. A short-haired man wearing a blue jacket and black pants appears in the bottom left corner. What did this man do when he first appeared?", "question_wo_referring_query": "What did this man do when he first appeared?", "candidates": ["Taking photos of others", "Others taking photos of him", "Speaking on stage", "Clapping", "Sitting on the stool"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "nxUUliRA0Co_0", "video_path": "nxUUliRA0Co.mp4", "subtitle_path": "nxUUliRA0Co_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 226.14, "view_count": 15823}, {"video_id": "nxUUliRA0Co", "question": "There is a large gathering of people outdoors. On the gray wall, there are white railings, and outside the white railings is a red brick wall. A woman wearing a colorful floral dress stands in the middle. Next to her, a girl in black clothes is carrying a large satchel with big floral prints. What did the woman in the floral dress do when she first appeared?", "question_wo_referring_query": "What did the woman in the floral dress do when she first appeared?", "candidates": ["Applauded for others", "Drinking water", "Photographing others", "Giving a speech", "Sitting on the bench"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "nxUUliRA0Co_1", "video_path": "nxUUliRA0Co.mp4", "subtitle_path": "nxUUliRA0Co_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 226.14, "view_count": 15823}, {"video_id": "eS_WGNWdhkg", "question": "The forest has a lot of tall trees, and the ground is scattered with dead branches and yellow grass. A woman wearing brown pants and a black dog are on a rock. When the subtitle 'and peace at this time in the world' appears, what is the woman doing?", "question_wo_referring_query": "What is the woman doing?", "candidates": ["The woman is riding on the dog", "The woman is standing and waving", "The woman is holding the dog", "The woman is dancing", "The woman is kneeling and waving"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "eS_WGNWdhkg_0", "video_path": "eS_WGNWdhkg.mp4", "subtitle_path": "eS_WGNWdhkg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 439.07, "view_count": 311416}, {"video_id": "eS_WGNWdhkg", "question": "The side of the wooden door is decorated in purple, the wooden door has two transparent windows, a flower ring is hanging between the two windows, through the windows you can see a wooden house inside, a lady wearing a gray and white coat is beside the door, when the subtitle 'Music' appears, what is the lady doing?", "question_wo_referring_query": "What is the lady doing?", "candidates": ["The lady is pushing the door with a bamboo basket", "The lady is pushing the door with a leather bag", "The lady is waving with a flower ring", "The lady is opening the door with a key", "The lady is shaking her head"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "eS_WGNWdhkg_1", "video_path": "eS_WGNWdhkg.mp4", "subtitle_path": "eS_WGNWdhkg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 439.07, "view_count": 311416}, {"video_id": "OVpDKoj6Okw", "question": "The man wearing the checkered shirt and the man in the dark short sleeves are having a conversation. Behind them is a white wall with potted plants placed on shelves, and a flag to the right of the potted plants. What did they do after talking about how tall they are?", "question_wo_referring_query": "What did they do after talking about how tall they are?", "candidates": ["They started arm wrestling.", "They shook hands.", "They took off their outerwear.", "They compared their heights.", "They both crouched down."], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "OVpDKoj6Okw_0", "video_path": "OVpDKoj6Okw.mp4", "subtitle_path": "OVpDKoj6Okw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 435.85, "view_count": 233749}, {"video_id": "OVpDKoj6Okw", "question": "The man in the checkered shirt and the man in the dark short-sleeved shirt are talking to each other. Behind them is a white wall and potted plants. There is a flag to the right of the potted plants. What does the man in the checkered shirt do after standing up?", "question_wo_referring_query": "What does the man in the checkered shirt do after standing up?", "candidates": ["He picks up the flag behind him", "He turns on the light in the room", "He puts on a pair of glasses", "He adjusts his chair", "He picks up a painting"], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "OVpDKoj6Okw_1", "video_path": "OVpDKoj6Okw.mp4", "subtitle_path": "OVpDKoj6Okw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 435.85, "view_count": 233749}, {"video_id": "_uLC-7gn6Gg", "question": "The long-haired lady wearing a black strap is sitting in the driver's seat, she is wearing a yellow bracelet on her hand, and there is no one in the black seat next to her. What is the first piece of clothing that the lady picks up?", "question_wo_referring_query": "What is the first piece of clothing that the lady picks up?", "candidates": ["A blue shirt", "A pair of blue jeans", "A pair of blue and white striped pants", "A white shirt", "A pair of olive green pants"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "_uLC-7gn6Gg_0", "video_path": "_uLC-7gn6Gg.mp4", "subtitle_path": "_uLC-7gn6Gg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 376.84, "view_count": 30853}, {"video_id": "_uLC-7gn6Gg", "question": "A long-haired woman wearing a black strap is sitting in the driver's seat, she has a yellow ring on her hand. The black seat next to her is empty. What is the first food item that appears in the woman's hand?", "question_wo_referring_query": "What is the first food item that appears in the woman's hand?", "candidates": ["Fried chicken", "A drink with a green straw", "Sausage", "Hamburger", "Noodles"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "_uLC-7gn6Gg_1", "video_path": "_uLC-7gn6Gg.mp4", "subtitle_path": "_uLC-7gn6Gg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 376.84, "view_count": 30853}, {"video_id": "u03OXCMGDqw", "question": "There is a black bucket on the right side of the gray floor. On the wooden brown cutting board, there is a green onion. One hand is pressing the green onion on the cutting board, and the other hand is holding a knife with a silver blade. What are these hands doing?", "question_wo_referring_query": "What are these hands doing?", "candidates": ["These hands are placing the green onion into a dish.", "These hands are cutting the green onion with the knife.", "These hands are peeling the green onion.", "These hands are using the knife to pat the green onion.", "These hands are putting away the knife."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "u03OXCMGDqw_0", "video_path": "u03OXCMGDqw.mp4", "subtitle_path": "u03OXCMGDqw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 276.13, "view_count": 6612}, {"video_id": "u03OXCMGDqw", "question": "A pair of hands holding yellow instant noodles appears above a light-colored plate placed on a wooden olive-colored cutting board on a gray desktop. What are these hands doing?", "question_wo_referring_query": "What are these hands doing?", "candidates": ["These hands are shaking the plate.", "These hands are lifting the instant noodles.", "These hands are rotating the instant noodles.", "These hands are holding the plate.", "These hands are hitting the cutting board with the instant noodles."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "u03OXCMGDqw_1", "video_path": "u03OXCMGDqw.mp4", "subtitle_path": "u03OXCMGDqw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 276.13, "view_count": 6612}, {"video_id": "AcEu8LEe_LY", "question": "The kitchen walls are white, the cabinets are white, and the cabinets are adjacent to large floor-to-ceiling windows. Light-colored curtains are pulled to the corners. A gray refrigerator stands to the left of the stove, which has a black countertop. A man wearing glasses and a shirt is leaning on the black countertop. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["a small potted plant", "a wristwatch", "a kitchen knife", "a frying pan", "a bracelet"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "AcEu8LEe_LY_0", "video_path": "AcEu8LEe_LY.mp4", "subtitle_path": "AcEu8LEe_LY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 240.36, "view_count": 4885}, {"video_id": "ykdnVm-Bf6U", "question": "In the blue sky with few clouds, the ground is covered with rolling hills, which have dry plants on them. As the hills descend, irregular stone walls appear. The stones on the walls are long and sharp. When the subtitle 'creating a spectacular and rare sight to' appears, what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["A dry creek", "A yellow hunting dog", "A small green plant", "A bird circling in the sky", "An old and worn-out car tire"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "ykdnVm-Bf6U_0", "video_path": "ykdnVm-Bf6U.mp4", "subtitle_path": "ykdnVm-Bf6U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 348.32, "view_count": 15048}, {"video_id": "JXSlV5_pwZw", "question": "A man wearing a shirt appears in a black-and-white photo. He's wearing a hat, and behind him is dense vegetation. The top button of his shirt is fastened tightly. In front of him is an old-fashioned camera. What kind of hat is this man wearing?", "question_wo_referring_query": "What kind of hat is this man wearing?", "candidates": ["Panama hat", "Wide-brimmed hat", "Beret", "Duckbill cap", "Military cap"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "JXSlV5_pwZw_0", "video_path": "JXSlV5_pwZw.mp4", "subtitle_path": "JXSlV5_pwZw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 520.48, "view_count": 256743}, {"video_id": "JXSlV5_pwZw", "question": "There are three car windows on one side of the car. The window near the front of the car shows a woman's head. The woman is holding a child's hand outside the car window. There is also a child under the round car light. What kind of clothes is the child, whose hand is held by the woman, wearing?", "question_wo_referring_query": "What kind of clothes is the child, whose hand is held by the woman, wearing?", "candidates": ["The child is wearing a robe", "The child is wearing a skirt", "The child is wearing overalls", "The child is wearing a long-sleeve shirt", "The child is wearing jeans"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "JXSlV5_pwZw_1", "video_path": "JXSlV5_pwZw.mp4", "subtitle_path": "JXSlV5_pwZw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 520.48, "view_count": 256743}, {"video_id": "ZPZdrxl4c2Q", "question": "The curtains in the room are drawn, and there is a child lying on the table in front of the window. Two women are positioned on either side of the child's head and feet, a man in the middle is holding a syringe, there is a bowl next to the child's feet, a corner of the room has a water pipe and a container with some liquid. What is the hairstyle of the woman next to the child's head?", "question_wo_referring_query": "What is the hairstyle of the woman next to the child's head?", "candidates": ["Long loose hair", "Tied into a high ponytail", "Hair is tucked into a hat", "Hair is tied up", "Short loose hair"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2A", "level": "IntraMoment", "id": "ZPZdrxl4c2Q_0", "video_path": "ZPZdrxl4c2Q.mp4", "subtitle_path": "ZPZdrxl4c2Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 338.88, "view_count": 106900}, {"video_id": "ZPZdrxl4c2Q", "question": "There are four men standing beside an outdoor horse. There are farming tools placed outside the house on the horse's tail side. The buildings on the horse's head side are relatively short. The two men in the middle are wearing hats and leather shoes. How is the man on the far right dressed?", "question_wo_referring_query": "How is the man on the far right dressed?", "candidates": ["Hooded coat", "Denim jacket", "Long robe covering the knees", "A cotton-padded coat", "Short suit"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2A", "level": "IntraMoment", "id": "ZPZdrxl4c2Q_1", "video_path": "ZPZdrxl4c2Q.mp4", "subtitle_path": "ZPZdrxl4c2Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 338.88, "view_count": 106900}, {"video_id": "IFYkJtSnUjo", "question": "In a tan-colored room with a white door and a silver door handle, there is a cord on the left side of the floor. A man wearing a gray short-sleeved shirt is sitting on a white chair. On the right side, there's a piece of furniture. Who is shaking their head in the video?", "question_wo_referring_query": "Who is shaking their head in the video?", "candidates": ["The man in the gray short-sleeved shirt sitting on the floor", "The man in the blue short-sleeved shirt sitting on the chair", "The man in the red short-sleeved shirt sitting on the floor", "The man in the gray short-sleeved shirt sitting on the chair", "The man in the red short-sleeved shirt sitting on the chair"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "IFYkJtSnUjo_0", "video_path": "IFYkJtSnUjo.mp4", "subtitle_path": "IFYkJtSnUjo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 188.11, "view_count": 490051}, {"video_id": "IFYkJtSnUjo", "question": "In a room with an olive-green floor, the door is white, the door handle is silver, there's a groove on the left side of the floor, a man in a gray short-sleeved shirt is sitting on a white chair, and in the black frame that pops up in the upper right corner, a man in a white short-sleeved shirt is talking. Who is holding their hands together?", "question_wo_referring_query": "Who is holding their hands together?", "candidates": ["The man in the blue short-sleeved shirt in the black frame", "The man in the white short-sleeved shirt in the black frame", "The man in the gray short-sleeved shirt sitting on the chair", "The man in the blue short-sleeved shirt sitting on the chair", "The man in the red short-sleeved shirt sitting on the chair"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "IFYkJtSnUjo_1", "video_path": "IFYkJtSnUjo.mp4", "subtitle_path": "IFYkJtSnUjo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 188.11, "view_count": 490051}, {"video_id": "nFBBaa9-nP4", "question": "The cartoon foot appears on the road, the shoes on the foot are blue, the edges of the shoes are white, there are green plants in the distance and blue sky with white clouds, what happens to the foot when the subtitle 'keeps your foot in place as you shift' appears?", "question_wo_referring_query": "what happens to the foot?", "candidates": ["The foot disappears", "The foot moves to the right", "The foot jumps up", "The foot moves to the left", "The foot stays in place"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2E", "level": "IntraMoment", "id": "nFBBaa9-nP4_0", "video_path": "nFBBaa9-nP4.mp4", "subtitle_path": "nFBBaa9-nP4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 221.67000000000002, "view_count": 132400}, {"video_id": "nFBBaa9-nP4", "question": "There is a green grid on the wall, with sculptures, a cube, glasses, and other items placed inside the grid. A man in a plaid shirt with black-framed glasses is speaking. When the subtitle 'only 66% of people got it right it's' appears, what does the man do?", "question_wo_referring_query": "What does the man do?", "candidates": ["He stood up", "He straightened his collar with his hand", "He pushed his glasses with his hand", "He points to the camera with his hand", "He waved his hand"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2E", "level": "IntraMoment", "id": "nFBBaa9-nP4_1", "video_path": "nFBBaa9-nP4.mp4", "subtitle_path": "nFBBaa9-nP4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 221.67000000000002, "view_count": 132400}, {"video_id": "kzsJdV9Uzzg", "question": "On a wooden cutting board, a hand wearing a ring is holding a knife and cutting a carrot. The knife has black letters and a red circular pattern on it. What did the person do after slicing the carrot?", "question_wo_referring_query": "What did the person do after slicing the carrot?", "candidates": ["Continued to cut some garlic", "Juiced the carrot slices", "Continued to cut some chili", "Threw the carrot slices into the pot", "Put the carrot slices into a bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "kzsJdV9Uzzg_0", "video_path": "kzsJdV9Uzzg.mp4", "subtitle_path": "kzsJdV9Uzzg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 332.2, "view_count": 186958}, {"video_id": "kzsJdV9Uzzg", "question": "On a wooden cutting board, a pair of hands is holding a knife to cut red chili peppers. The knife has black text and a red circular design. What did the person do after cutting the red chili peppers?", "question_wo_referring_query": ", what did the person do after cutting the red chili peppers?", "candidates": ["Threw the red chili peppers into the bowl", "Continued to slice some garlic", "Threw the red chili peppers into the plate", "Threw the red chili peppers into the pot", "Continued to slice some green onions"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "kzsJdV9Uzzg_1", "video_path": "kzsJdV9Uzzg.mp4", "subtitle_path": "kzsJdV9Uzzg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 332.2, "view_count": 186958}, {"video_id": "Gctcicm4UqU", "question": "A gray table has a wooden brown cutting board placed on it. In front of the cutting board are a pot and a plastic bottle. The middle part of the bottle is transparent, while the neck and base are white. What ingredient is placed on the cutting board for processing first?", "question_wo_referring_query": "What ingredient is placed on the cutting board for processing first?", "candidates": ["Carrot", "Eggplant", "Leek", "Green Pepper", "Cabbage"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "Gctcicm4UqU_0", "video_path": "Gctcicm4UqU.mp4", "subtitle_path": "Gctcicm4UqU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 403.13, "view_count": 7681}, {"video_id": "Gctcicm4UqU", "question": "On a grey desktop, there is a black pot. The pot handle is black, with a silver connection between the handle and the pot. A bottle of cooking oil with a yellow ring-shaped object is pouring oil into the pot. What ingredient is added to the pot first for cooking?", "question_wo_referring_query": "What ingredient is added to the pot first for cooking?", "candidates": ["Eggplant", "Carrot", "Green pepper", "Green onion", "Yellow mixed ingredients"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "Gctcicm4UqU_1", "video_path": "Gctcicm4UqU.mp4", "subtitle_path": "Gctcicm4UqU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 403.13, "view_count": 7681}, {"video_id": "YSppvwcvF4U", "question": "In the top left corner of the white background, there is a yellow sign and black characters, and in the center is a left-to-right arrow. On both sides of the arrow, there are equally divided squares. A red dot appears at the top right of the arrow. A man in a suit and glasses with a blue undershirt is speaking in the bottom right corner. He raises one hand to his chest. After the subtitle 'is you can actually learn this' appears, what action does this man take?", "question_wo_referring_query": "What action does this man take?", "candidates": ["He clasps his hands together", "He raises both his hands", "He adjusts his sleeves", "He taps his chest with his hand", "The pen in his hand disappears"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3E", "level": "L2-Relation", "id": "YSppvwcvF4U_0", "video_path": "YSppvwcvF4U.mp4", "subtitle_path": "YSppvwcvF4U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 554.33, "view_count": 14}, {"video_id": "jGn95KDWZMU", "question": "On the white wall hangs a guitar. To the right of the wall, there is a black table with a red electronic device on it. Beside the table, there is a white power strip with black cables plugged in. A man in a white short-sleeved shirt and a baseball cap is talking in front of a microphone. After the subtitle 'world for example if a client says we' appears, what object shows up in the scene?", "question_wo_referring_query": "What object appears in the scene?", "candidates": ["A man in a suit wearing glasses", "A man in a suit with fists clenched", "A man in a suit with arms raised high", "A man in a suit with arms crossed in front of his chest", "A man in a suit wearing a hat"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "jGn95KDWZMU_0", "video_path": "jGn95KDWZMU.mp4", "subtitle_path": "jGn95KDWZMU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 477.90999999999997, "view_count": 979}, {"video_id": "jGn95KDWZMU", "question": "A guitar hangs on a white wall, and on the right side of the wall, there is a black table with a red-cased electronic device. Beside the table, there is a white power strip with black cords plugged in. A man in a white short-sleeved shirt and baseball cap is talking in front of a microphone, with both his hands appearing on either side of the microphone, with pointing fingers up. After the subtitle 'is one I got from Master negotiator' appears, what object shows up in the scene?", "question_wo_referring_query": "A guitar hangs on a white wall, and on the right side of the wall, there is a black table with a red-cased electronic device. Beside the table, there is a white power strip with black cords plugged in. A man in a white short-sleeved shirt and baseball cap is talking in front of a microphone, with both his hands appearing on either side of the microphone, with pointing fingers up. After the subtitle 'is one I got from Master negotiator' appears, what object shows up in the scene?", "candidates": ["A man in a suit wearing a hat", "A man in a suit wearing glasses", "A man in a suit raising his hands up", "A man in a suit with a black shirt underneath", "A man in a suit with his hands crossed in front of his chest"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "jGn95KDWZMU_1", "video_path": "jGn95KDWZMU.mp4", "subtitle_path": "jGn95KDWZMU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 477.90999999999997, "view_count": 979}, {"video_id": "crALHjTiXbk", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, an animation explains the military actions taken by the U.S. against North Vietnam, then an animation introduces the background of the Northern Gulf Incident, followed by the animation describing the specific details of the incident including the timeline, casualties, and government attitudes.", "First, an animation introduces the specific details of the Northern Gulf Incident including the timeline, casualties, and government attitudes, then the animation introduces the background of the incident, followed by the animation explaining the military actions taken by the U.S. against North Vietnam.", "First, an animation introduces the background of the Northern Gulf Incident, then the animation explains the military actions taken by the U.S. against North Vietnam, followed by the animation describing the specific details of the incident including the timeline, casualties, and government attitudes.", "First, an animation explains the military actions taken by the U.S. against North Vietnam, then an animation introduces the specific details of the Northern Gulf Incident including the timeline, casualties, and government attitudes, followed by the animation describing the background of the incident.", "First, an animation introduces the background of the Northern Gulf Incident, then the animation describes the specific details of the incident including the timeline, casualties, and government attitudes, followed by the animation explaining the military actions taken by the U.S. against North Vietnam."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "crALHjTiXbk_0", "video_path": "crALHjTiXbk.mp4", "subtitle_path": "crALHjTiXbk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 192.0, "view_count": 1845031}, {"video_id": "crALHjTiXbk", "question": "What is the correct sequence of events described in the Northern Bay incident?", "question_wo_referring_query": "What is the correct sequence of events described in the Northern Bay incident?", "candidates": ["First, the animation mentions the US military claiming to be attacked, leading to a decision by the Congress to launch a counterattack against the South, then the animation describes the US military assisting the North in resisting enemy forces, and finally, the animation describes the South's military strike against the North.", "First, the animation describes the US military assisting the North in resisting enemy forces, then the animation describes the South's military strike against the North, and finally, the animation mentions the US military claiming to be attacked again, leading to a decision by the Congress to launch a counterattack.", "First, the animation describes the South's military strike against the North, then the animation describes the US military assisting the North in resisting enemy forces, and finally, the animation mentions the US military claiming to be attacked again, leading to a decision by the Congress to launch a counterattack.", "First, the animation describes the US military assisting the North in resisting enemy forces, then mentions the US military claiming to be attacked again, leading to a decision by the Congress to launch a counterattack, and finally, the animation describes the South's military strike against the North.", "First, the animation mentions the US military claiming to be attacked, leading to a decision by the Congress to launch a counterattack against the South, then the animation describes the South's military strike against the North, and finally, the animation describes the US military assisting the North in resisting enemy forces."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "crALHjTiXbk_1", "video_path": "crALHjTiXbk.mp4", "subtitle_path": "crALHjTiXbk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 192.0, "view_count": 1845031}, {"video_id": "G5GmFXLwHoc", "question": "There is an alarm clock with a silver frame placed on the table. Behind and to the right of the alarm clock is a black and white checkered decoration, and next to the alarm clock is a black phone. A man in a red short-sleeved shirt on the screen has his arms crossed in front of his chest. In what scene does this alarm clock appear?", "question_wo_referring_query": "In what scene does this alarm clock appear?", "candidates": ["A yellow wardrobe", "A sofa with white pillows", "A computer desk cluttered with items", "A glossy white floor", "A wooden table with potted plants and white plastic cups"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "G5GmFXLwHoc_0", "video_path": "G5GmFXLwHoc.mp4", "subtitle_path": "G5GmFXLwHoc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 592.84, "view_count": 21412}, {"video_id": "G5GmFXLwHoc", "question": "The right side of the olive-colored door is a white wall and a painting. The painting has a white background with small flowers on it. A woman wearing a blue denim jacket is in the center of the frame. The woman is wearing earrings, a necklace, and a pearl bracelet. In what other scene does this denim jacket appear?", "question_wo_referring_query": "In what other scene does this denim jacket appear?", "candidates": ["A clothes rack with a concentrated collection of clothes hanging on it", "A cluttered desk", "An old bicycle", "A roof covered with tiles", "A white ship"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "G5GmFXLwHoc_1", "video_path": "G5GmFXLwHoc.mp4", "subtitle_path": "G5GmFXLwHoc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 592.84, "view_count": 21412}, {"video_id": "pi9gYPbjcR8", "question": "On a white background with black dot decorations, there are black characters and two square frames at the center. Inside the frames, there are black arrows, and above the square frames, there are three arrows pointing right from the left. The bottom left corner features a green stripe with black letters. Which subtitle has appeared together with this image?", "question_wo_referring_query": "Which subtitle has appeared together with this image?", "candidates": ["ok", "cute puppy", "hello", "The sky and white clouds", "note all periodic tables have different"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "pi9gYPbjcR8_0", "video_path": "pi9gYPbjcR8.mp4", "subtitle_path": "pi9gYPbjcR8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 184.89, "view_count": 6787}, {"video_id": "pi9gYPbjcR8", "question": "On a white background with black dots for decoration, there are black characters. In the central part, there are three square boxes with three arrows pointing from left to right above them. In the bottom left corner, there is a green strip encasing black characters as a label. To the right of the label, there are characters with a small blue arrow pointing to the top right. With what subtitles do this small arrow and text appear together?", "question_wo_referring_query": ", with what subtitles do this small arrow and text appear together?", "candidates": ["The sky and white clouds", "the grams of sodium on top", "bottom to cancel these units out and ", "are three moles of na for one mole of", "next we will use the molar mass of"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "pi9gYPbjcR8_1", "video_path": "pi9gYPbjcR8.mp4", "subtitle_path": "pi9gYPbjcR8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 184.89, "view_count": 6787}, {"video_id": "FkTjFLzCvKU", "question": "Under the white sky, there are snow-covered trees with branches and leaves, snow piles on both sides of the stream, and a couple standing by the stream. The man is wearing a hat and gray pants, while the woman is wearing gloves and a scarf. A black dog is between them. What change does the woman undergo when she enters the house?", "question_wo_referring_query": "What change does the woman undergo when she enters the house?", "candidates": ["The woman puts on a fur coat", "The woman puts on jeans", "The woman puts on a suit", "The woman puts on a long skirt", "The woman puts on a bathrobe"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "FkTjFLzCvKU_0", "video_path": "FkTjFLzCvKU.mp4", "subtitle_path": "FkTjFLzCvKU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 329.4, "view_count": 289151}, {"video_id": "FkTjFLzCvKU", "question": "A white cloth with floral patterns is laid on a brown table. On the white cloth, there's a white bowl filled with white powder. Another bowl continuously adds water to the powder. What changes occur when the powder appears on the table?", "question_wo_referring_query": "What changes occur when the powder appears on the table?", "candidates": ["It turns into green powder", "It turns into a liquid form", "It turns into a green solid", "It gets compressed into a cake-like shape", "It turns into a green liquid"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "FkTjFLzCvKU_1", "video_path": "FkTjFLzCvKU.mp4", "subtitle_path": "FkTjFLzCvKU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 329.4, "view_count": 289151}, {"video_id": "mj86XmfOniY", "question": "The man wearing a black short-sleeve shirt in the top right corner is talking. In the bottom left corner of the screen, there's a mannequin with a blue tie and a black suit. The background of the mannequin is a blue data table. In the bottom right corner, there is a logo made up of blue, green, yellow, and red colors. When the subtitle 'indexed over time so a signal' appears, what change happens to the man?", "question_wo_referring_query": "What change happens to the man?", "candidates": ["The man puts on glasses", "The man moves from the top right corner to the top left corner", "The man stands up", "The man moves from the top right corner to the bottom right corner", "The man moves from the top right corner to the bottom left corner"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "mj86XmfOniY_0", "video_path": "mj86XmfOniY.mp4", "subtitle_path": "mj86XmfOniY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 482.0, "view_count": 5646}, {"video_id": "2dZ1WTctDU4", "question": "There is a notice board with black characters on a white wall. To the left of the wall is a tall white cabinet. A man in a white short-sleeved shirt is writing something on the notice board with a pen. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["A watch", "A painting", "A yellow cartoon character", "A blue cartoon character", "A small potted plant"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "2dZ1WTctDU4_0", "video_path": "2dZ1WTctDU4.mp4", "subtitle_path": "2dZ1WTctDU4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 497.5, "view_count": 4563}, {"video_id": "2dZ1WTctDU4", "question": "In the reflection of the yellow building on the transparent window, a man in a black short-sleeved shirt is leaning outside the window. The man is wearing a white hat and holding a yellow drink in his hand. The man's arm has tattoos and he is wearing a ring. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Necklace", "Blue plate", "Silver ring", "Silver watch", "Laptop"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "2dZ1WTctDU4_1", "video_path": "2dZ1WTctDU4.mp4", "subtitle_path": "2dZ1WTctDU4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 497.5, "view_count": 4563}, {"video_id": "OD1gUgOIgpo", "question": "In front of a white wall with a painting hanging on it, there is a woman with medium-length hair wearing earrings. When she says 'Williamsburg Brooklyn looked at this,' what object is present in the picture?", "question_wo_referring_query": "What object is present in the picture?", "candidates": ["diamond necklace", "gold bracelet", "green floral scarf", "silver brooch", "white long-sleeved shirt"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "OD1gUgOIgpo_0", "video_path": "OD1gUgOIgpo.mp4", "subtitle_path": "OD1gUgOIgpo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 218.05, "view_count": 8099}, {"video_id": "OD1gUgOIgpo", "question": "In front of a white wall full of paintings, a woman with medium-length hair, wearing a purple top, is standing in front of a painting. When she says 'through the lens of 2023 I look at it,' what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["white casual pants", "blue skinny jeans", "black skirt", "black high heels", "golden high heels"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "OD1gUgOIgpo_1", "video_path": "OD1gUgOIgpo.mp4", "subtitle_path": "OD1gUgOIgpo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 218.05, "view_count": 8099}, {"video_id": "rauAWIcLIIw", "question": "Under the white cloud with gray spots, there is a patch of emerald green grassland and a blue lake shore. On the grassland, there is a cow eating grass. When the subtitle 'human populations today the site invites' appears, what does the cow look like?", "question_wo_referring_query": "What does the cow look like?", "candidates": ["It has yellow fur on its body and white fur on its legs", "Its entire body has white fur", "Its entire body has yellow fur", "It has black fur on its body and white fur on its legs", "Its entire body has black and white mixed fur"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "rauAWIcLIIw_0", "video_path": "rauAWIcLIIw.mp4", "subtitle_path": "rauAWIcLIIw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 363.73, "view_count": 5081}, {"video_id": "rauAWIcLIIw", "question": "A small stream flows forward with green plants on both sides. A bridge spans the stream. When the subtitle \"subscribing to the channel if you'd like\" appears, what color is the guardrail on the bridge?", "question_wo_referring_query": "What color is the guardrail on the bridge?", "candidates": ["black", "gray", "blue", "white", "purple"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "rauAWIcLIIw_1", "video_path": "rauAWIcLIIw.mp4", "subtitle_path": "rauAWIcLIIw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 363.73, "view_count": 5081}, {"video_id": "bmcy1-BDIYw", "question": "On a black and gray table, there is a piece of bread placed on a wooden cutting board. There is green sauce on the bread, and a black hand is holding an object to evenly spread the sauce on the bread. What is the black hand holding to spread the sauce on the bread?", "question_wo_referring_query": "What object is the black hand holding to evenly spread the sauce on the bread?", "candidates": ["A white silicone brush", "A yellow silicone brush", "A white plastic spatula", "A wooden brush", "A metal knife"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "bmcy1-BDIYw_0", "video_path": "bmcy1-BDIYw.mp4", "subtitle_path": "bmcy1-BDIYw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 233.6, "view_count": 257237}, {"video_id": "bmcy1-BDIYw", "question": "On a black and gray table, in the upper right corner, there are a few small red tomatoes. In the center, there is a wooden cutting board. The plastic wrap on the roll is being peeled off. Which pair of hands is peeling off the plastic wrap from the roll?", "question_wo_referring_query": "Which pair of hands is peeling off the plastic wrap from the roll?", "candidates": ["Black hands wearing a ring on the fingers", "Black hands wearing a watch on the wrist", "White hands wearing a ring on the fingers", "White hands wearing a watch on the wrist", "White hands wearing a bracelet on the wrist"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "bmcy1-BDIYw_1", "video_path": "bmcy1-BDIYw.mp4", "subtitle_path": "bmcy1-BDIYw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 233.6, "view_count": 257237}, {"video_id": "BY6Ld-8cZEM", "question": "On a patch of earth-colored ground, there is a soldier wearing yellow-green clothes and a row of soldiers wearing dark green clothes standing. Beside the soldiers, there are green trees and grass. When the subtitle \"soldiers and sprayed all the prison\" appears, what does the soldier wearing yellow-green clothes do?", "question_wo_referring_query": "What does the soldier wearing yellow-green clothes do?", "candidates": ["Fires a gun into the sky", "Uses a shield to block the attack from the soldiers wearing dark green clothes", "Shoots at the soldiers wearing dark green clothes with a gun", "Throws a grenade towards the soldiers wearing dark green clothes", "Throws a bomb towards the soldiers wearing dark green clothes"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "BY6Ld-8cZEM_0", "video_path": "BY6Ld-8cZEM.mp4", "subtitle_path": "BY6Ld-8cZEM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 457.75, "view_count": 5210387}, {"video_id": "BY6Ld-8cZEM", "question": "On a yellowish terrain, there is a patch of green grass in the distance, along with two trees. What did the person wearing deep green clothes and a deep green hat do when the subtitle 'lined up the captured Italians some of' appeared?", "question_wo_referring_query": "What did the person wearing deep green clothes do?", "candidates": ["Laid prone on the ground", "Fell down on the spot", "Ran forward", "Kneeled down on the spot", "Crawled forward"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "BY6Ld-8cZEM_1", "video_path": "BY6Ld-8cZEM.mp4", "subtitle_path": "BY6Ld-8cZEM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 457.75, "view_count": 5210387}, {"video_id": "S-7r0-oysaU", "question": "In front of a green screen, a man with short hair, wearing black sunglasses and a black shirt, is sitting on a black chair. He is speaking into a black microphone. After he raises his right hand, what does he do?", "question_wo_referring_query": "After raising his right hand, what does he do?", "candidates": ["Uses his right hand to hold the microphone", "Puts his right hand down", "Touches his collar with his right hand", "Touches the top of his head with his right hand", "Raises his left hand as well"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "S-7r0-oysaU_0", "video_path": "S-7r0-oysaU.mp4", "subtitle_path": "S-7r0-oysaU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 1207, "duration": 393.0, "view_count": 52446}, {"video_id": "S-7r0-oysaU", "question": "In front of a white background, with 'Share on Facebook' written on it, a man with short hair, wearing a black shirt, appears on the screen. What does the man in the bottom right of the screen do after pointing to the right with his right hand?", "question_wo_referring_query": "What action does he take?", "candidates": ["Points to the left with both hands", "Clenches his fists", "Touches his head with his right hand", "Puts his hands down", "Touches his collar with his right hand"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "S-7r0-oysaU_1", "video_path": "S-7r0-oysaU.mp4", "subtitle_path": "S-7r0-oysaU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 1207, "duration": 393.0, "view_count": 52446}, {"video_id": "mBQkRV9q5G0", "question": "In a zoomed-in map of the Earth, after enlarging the map of Australia, which place name is mentioned first on the screen?", "question_wo_referring_query": "Which place name is mentioned first on the screen?", "candidates": ["Mount Qomolangma", "Siberian grasslands", "Australia", "the Dead Sea", "Blue Mountain"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "mBQkRV9q5G0_0", "video_path": "mBQkRV9q5G0.mp4", "subtitle_path": "mBQkRV9q5G0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 272.54, "view_count": 2623}, {"video_id": "mBQkRV9q5G0", "question": "Which of the following scenes appears first in the video?", "question_wo_referring_query": "Which of the following scenes appears first in the video?", "candidates": ["Golden sunlight beams through the clouds, spreading over a green mountain face, with mist slowly rising between the mountains.", "A kola feeding on a eucalyptus tree.", "An endless green forest can be seen with no ground in sight.", "An iron bridge with white railings with a stream below, flanked by dense green grass on the left and scattered rocks on the right.", "A camera pans out gradually from three towering peaks."], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "mBQkRV9q5G0_1", "video_path": "mBQkRV9q5G0.mp4", "subtitle_path": "mBQkRV9q5G0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 272.54, "view_count": 2623}, {"video_id": "sOtCsxsqP0M", "question": "On a PPT slide with a black background, there is a blue bar with the text 'ResNet to ConvNeXt'. A red light spot is located at the bottom right of the PPT slide. After the subtitle '82 percent' appears, what is the first action the red light spot takes?", "question_wo_referring_query": "What is the first action the red light spot takes?", "candidates": ["Slides to the right on the page", "Slides upward on the page", "Slides downward on the page", "Slides to the left on the page", "Circles in the middle of the page"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3E", "level": "L2-Relation", "id": "sOtCsxsqP0M_0", "video_path": "sOtCsxsqP0M.mp4", "subtitle_path": "sOtCsxsqP0M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 253, "duration": 359.0, "view_count": 264}, {"video_id": "sOtCsxsqP0M", "question": "On a PPT slide, there is a large square composed of many small squares. The blue bar on the PPT slide says 'Depthwise convolutions'. Next to a gray square in the middle, there is a small red dot. After the subtitle 'the number of uh' appears, what action does the small red dot do first?", "question_wo_referring_query": "What action does the small red dot do first?", "candidates": ["Move up and down", "Move left and right", "Move up the page", "Move down the page", "Spin in circles on the page"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3E", "level": "L2-Relation", "id": "sOtCsxsqP0M_1", "video_path": "sOtCsxsqP0M.mp4", "subtitle_path": "sOtCsxsqP0M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 253, "duration": 359.0, "view_count": 264}, {"video_id": "AMeKhbRk_Uw", "question": "On a rainy night, a person with short hair, wearing glasses, and a blue coat is speaking. After the subtitle \"remarkably Harvey is back over the water\" appears, who is the first person to appear?", "question_wo_referring_query": "Who is the first person to appear after the subtitle \"remarkably Harvey is back over the water\"?", "candidates": ["A man wearing a green jacket with white patterns and a hat worn backward", "A man wearing a gray hat and a gray coat", "A man wearing a red shirt and a blue hat", "A man wearing a black-gray shirt with suspenders", "A man wearing a blue-black T-shirt with short blonde hair"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "AMeKhbRk_Uw_0", "video_path": "AMeKhbRk_Uw.mp4", "subtitle_path": "AMeKhbRk_Uw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 398.07, "view_count": 144914}, {"video_id": "AMeKhbRk_Uw", "question": "In a room with a bedspread and a black TV, a man wearing a green top and gray pants is sitting. After the subtitle 'Catherine' appears, what is the first object to appear?", "question_wo_referring_query": "What is the first object to appear?", "candidates": ["white car", "big truck", "airplane", "drone", "red teddy bear"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "AMeKhbRk_Uw_1", "video_path": "AMeKhbRk_Uw.mp4", "subtitle_path": "AMeKhbRk_Uw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 398.07, "view_count": 144914}, {"video_id": "ABU9GvzDb7Q", "question": "Against a black background with a black microscope on the side and genetic sequence illustrations around it, a green SCI logo is in the center of the screen. In which of the following scenes has the green SCI logo appeared?", "question_wo_referring_query": "In which of the following scenes has the green SCI logo appeared?", "candidates": ["A man wearing a white chef's outfit cooking", "A sunny outdoor courtyard", "A boy wearing a striped sweater and glasses standing in front of a blue background", "A sophisticated dining room", "A woman in a floral dress dancing"], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "ABU9GvzDb7Q_0", "video_path": "ABU9GvzDb7Q.mp4", "subtitle_path": "ABU9GvzDb7Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 391.81, "view_count": 167927}, {"video_id": "ABU9GvzDb7Q", "question": "In front of a blue background, there is a green SCI logo in the bottom right corner, and a man wearing a striped sweater and black-framed glasses. In which of the following scenes has he appeared?", "question_wo_referring_query": "In which of the following scenes has he appeared?", "candidates": ["On the blue background, there is a black frame in the top right corner with the words GENES & CANCER BEHAVIOR", "On the green background, there is a black frame in the top right corner with the words GENES & CANCER BEHAVIOR", "On the purple background, there is a black frame in the top right corner with the words GENES & CANCER BEHAVIOR", "On the red background, there is a black frame in the top right corner with the words GENES & CANCER BEHAVIOR", "On the white background, there is a black frame in the top right corner with the words GENES & CANCER BEHAVIOR"], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "ABU9GvzDb7Q_1", "video_path": "ABU9GvzDb7Q.mp4", "subtitle_path": "ABU9GvzDb7Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 391.81, "view_count": 167927}, {"video_id": "UIdlqjqbT_s", "question": "In front of a blurry background, a pair of cans with black lids in hand are being closely filmed. The cans have English phrases and reflective golden labels on them. This can appeared together with which subtitles?", "question_wo_referring_query": "With which subtitles did this can appear together?", "candidates": ["\"usually on my tv stand i have two candles i decided to\ngo for hello autumn it smells of\"", "\"brown sugar and vanilla it's just beautiful it's our\nautumn bestseller everybody loves this candle\"", "\"I really like it\"", "\"It smells really good\"", "this tray will go where the tv stand is I think it looks very nice and"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "UIdlqjqbT_s_0", "video_path": "UIdlqjqbT_s.mp4", "subtitle_path": "UIdlqjqbT_s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 394.88, "view_count": 5254}, {"video_id": "UIdlqjqbT_s", "question": "On a black surface, there is a silver plate with a black bottle containing lilies on it. A woman wearing a white fur coat is pouring something into a cup with a fruit pattern on the inside wall. With which of the following subtitles has the cup with a fruit pattern on the inside wall appeared together?", "question_wo_referring_query": ", with which of the following subtitles has the cup with a fruit pattern on the inside wall appeared together?", "candidates": ["\"the second kind of busy being cozy this is my personal favorite i came up with this candle\"", "\"rustic i love that the wood is actually untreated or finished is that how we say it\"", "\"brown sugar and vanilla it's just beautiful it's our autumn bestseller everybody loves this candle\"", "\"usually on my tv stand i have two candles i decided to go for hello autumn it smells of\"", "\"all the way home\""], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "UIdlqjqbT_s_1", "video_path": "UIdlqjqbT_s.mp4", "subtitle_path": "UIdlqjqbT_s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 394.88, "view_count": 5254}, {"video_id": "ZtFtqA87jUM", "question": "When the SCI logo within the circle at the center of the blue grid background appears on a solid blue background, with two small black frames beside it, what change occurs to the SCI logo?", "question_wo_referring_query": "What change occurs to the SCI logo?", "candidates": ["White changes to green", "Green changes to blue", "White changes to blue", "Green changes to purple", "White changes to red"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "ZtFtqA87jUM_0", "video_path": "ZtFtqA87jUM.mp4", "subtitle_path": "ZtFtqA87jUM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 350.77, "view_count": 448401}, {"video_id": "ZtFtqA87jUM", "question": "In front of a yellow background, there is a man wearing dark red clothes, with a blue box to his right. Inside the box, there is a white dashed frame. When the text on his left changes from 'THERE ARE LOTS' to 'SOME', what changes occur in the white dashed frame?", "question_wo_referring_query": "What changes occur in the white dashed frame?", "candidates": ["It moves from the top part of the blue box to the middle part of the blue box", "It stretches from inside the blue box to outside the blue box", "It moves from the bottom part of the blue box to the top part of the blue box", "It moves from the top part of the blue box to the bottom part of the blue box", "It disappears from the blue box"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "ZtFtqA87jUM_1", "video_path": "ZtFtqA87jUM.mp4", "subtitle_path": "ZtFtqA87jUM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 350.77, "view_count": 448401}, {"video_id": "XjariDtbyFU", "question": "In a scene set against the city nightscape, a man with short hair and wearing a black suit appears. Below him, there is a red horizontal stripe and a white horizontal stripe. When the subtitle 'Andrew Tate's all-male Society the War' appears, what change happens to the red horizontal stripe?", "question_wo_referring_query": "What change happens to the red horizontal stripe?", "candidates": ["The red horizontal stripe changes to a blue horizontal stripe", "The red horizontal stripe moves from its position at the bottom to the top", "The red horizontal stripe becomes shorter", "The red horizontal stripe disappears", "The red horizontal stripe becomes wider"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "XjariDtbyFU_0", "video_path": "XjariDtbyFU.mp4", "subtitle_path": "XjariDtbyFU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 563.88, "view_count": 269271}, {"video_id": "XjariDtbyFU", "question": "In a room with a large black round table in the middle, a man wearing a green jacket with a black inner layer is sitting on a black chair. What changes occur to the man wearing the green jacket when the subtitle \"we'll talk off the back\" appears?", "question_wo_referring_query": "What changes occur to the man wearing the green jacket?", "candidates": ["The inner layer of clothes changed from black to white.", "The jacket changed from green to black.", "From sitting to standing.", "The clothes changed from a green jacket to a gray shirt.", "From having curly hair to having a crew cut."], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "XjariDtbyFU_1", "video_path": "XjariDtbyFU.mp4", "subtitle_path": "XjariDtbyFU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 563.88, "view_count": 269271}, {"video_id": "ZLkP6S6mKsY", "question": "A man wearing a black long-sleeve shirt and black-framed glasses is standing in front of a blue background. To the left is a sticker with letters in blue, yellow, and red colors on a white background, and to the right is a small yellow dog wearing a red bow tie. What is the man doing at this moment?", "question_wo_referring_query": "What is the man doing at this moment?", "candidates": ["Raising his hand wearing a ring, smiling and talking to the camera", "Turning his back to the camera", "Raising his hand and waving to the camera", "Explaining knowledge facing a blackboard", "Bowing to the camera"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "ZLkP6S6mKsY_0", "video_path": "ZLkP6S6mKsY.mp4", "subtitle_path": "ZLkP6S6mKsY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 343.39, "view_count": 363304}, {"video_id": "ZLkP6S6mKsY", "question": "A man wearing a black long-sleeve shirt and black-rimmed glasses is standing in front of a blue background. On the left, there is a chestnut and a sticker of a triangular cloak, and on the right, there is a sticker of a triangular cloak. What is this man doing at this moment?", "question_wo_referring_query": "What is this man doing at this moment?", "candidates": ["Explaining knowledge while facing the blackboard.", "Waving his hand towards the camera.", "The man is slightly raising both hands and standing in front of the camera speaking.", "Turning his back to the camera.", "Bowing towards the camera."], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "ZLkP6S6mKsY_1", "video_path": "ZLkP6S6mKsY.mp4", "subtitle_path": "ZLkP6S6mKsY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 343.39, "view_count": 363304}, {"video_id": "uqKGREZs6-w", "question": "The background is a deep purple starry sky, and in the solar system in space, there are several different celestial bodies orbiting the Sun. Which objects are present in the frame at this moment?", "question_wo_referring_query": "Which objects are present in the frame at this moment?", "candidates": ["Ash-colored Moon", "White Moon", "Blue-green-white Earth", "Purple Mars", "Black celestial bodies"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "uqKGREZs6-w_0", "video_path": "uqKGREZs6-w.mp4", "subtitle_path": "uqKGREZs6-w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.83, "view_count": 20400082}, {"video_id": "uqKGREZs6-w", "question": "The background is a white sun and yellow sky, with an orange-red ground surface. A purple bird wearing a space suit is lying on a recliner, squinting due to bright light. At this moment, what objects are present on the screen?", "question_wo_referring_query": "At this moment, what objects are present on the screen?", "candidates": ["A building", "A bottle of water", "A grassy field", "A space suit", "A tree"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "uqKGREZs6-w_1", "video_path": "uqKGREZs6-w.mp4", "subtitle_path": "uqKGREZs6-w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.83, "view_count": 20400082}, {"video_id": "JIv2Cq_YEYM", "question": "In a room with moss-green wallpaper, there is a white rabbit with black eyes on a stool. The background features a bright window. When the subtitle 'Failure Without Fear of Myself how would I live and who would I be like to think' appears, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["Green plants in a flower pot", "Dishes", "Dining utensils", "Kitchen utensils", "Cup"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "JIv2Cq_YEYM_0", "video_path": "JIv2Cq_YEYM.mp4", "subtitle_path": "JIv2Cq_YEYM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 470.55, "view_count": 213654}, {"video_id": "JIv2Cq_YEYM", "question": "A person wearing a dark green dress and holding a cup with floral patterns in both hands is sitting in front of a mirror. Behind them are two windows with white frames. On the right side of the screen, there are white curtains. When the subtitle 'the world thank you so much' appears, what objects are present in the screen?", "question_wo_referring_query": "What objects are present in the screen?", "candidates": ["yellow wallpaper", "refrigerator", "camera", "white bowl", "green potted plant"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "JIv2Cq_YEYM_1", "video_path": "JIv2Cq_YEYM.mp4", "subtitle_path": "JIv2Cq_YEYM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 470.55, "view_count": 213654}, {"video_id": "0fVZOjQlRrE", "question": "The sky is filled with dark clouds. Outside a house with a full-length planter, a girl wearing a floral dress sits hugging her knees amidst a cluster of yellow flowers. What type of flowers are in this cluster?", "question_wo_referring_query": "What type of flowers are in the cluster of yellow flowers?", "candidates": ["Sunflower", "Chrysanthemum", "Rape flower", "Evening Primrose", "Rose"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "0fVZOjQlRrE_0", "video_path": "0fVZOjQlRrE.mp4", "subtitle_path": "0fVZOjQlRrE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 275.15999999999997, "view_count": 192017}, {"video_id": "0fVZOjQlRrE", "question": "Bright sunlight shines into the room through the window. A girl with curly hair is wearing a gray coat and an inner shirt with lake blue floral patterns. Behind her are a green plant and a sink, and on the left side are two wall paintings. What type of coat is this girl wearing?", "question_wo_referring_query": "What type of coat is this girl wearing?", "candidates": ["Linen jacket", "Denim jacket", "Sleeveless vest", "Blazer coat", "Wool coat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "0fVZOjQlRrE_1", "video_path": "0fVZOjQlRrE.mp4", "subtitle_path": "0fVZOjQlRrE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 275.15999999999997, "view_count": 192017}, {"video_id": "5i8byIwk9sg", "question": "In a news studio with glass walls and silver pillars, there is a man on the right wearing a black suit with a dark red tie, and on the left is a female host wearing a light green top with a huge butterfly knot on her collar. When the subtitle 'begin with um all these kinds of Islamic' appears, what is the female host's hairstyle like?", "question_wo_referring_query": "What is the female host's hairstyle like?", "candidates": ["curly hair", "long hair", "blonde hair", "short hair", "red hair"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "5i8byIwk9sg_0", "video_path": "5i8byIwk9sg.mp4", "subtitle_path": "5i8byIwk9sg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 224.86, "view_count": 7863}, {"video_id": "5i8byIwk9sg", "question": "In a newsroom with glass walls and silver columns, there is a man on the right wearing a black suit with some gray in his hair, and a female host on the left in a light green top with a large butterfly bow on her collar. When the subtitle 'the west and also trying to find some' appears, what kind of tie is the man wearing?", "question_wo_referring_query": "What kind of tie is the man wearing?", "candidates": ["Checked tie", "Polka dot tie", "Floral tie", "Blue polka dot tie", "Dark red tie"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "5i8byIwk9sg_1", "video_path": "5i8byIwk9sg.mp4", "subtitle_path": "5i8byIwk9sg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 224.86, "view_count": 7863}, {"video_id": "8JCTp0oKR7w", "question": "On a cloudy day, there are two cars on a road lined with trees on both sides. On the right side of the road, there's a traffic sign indicating to yield to pedestrians. A white car on the left is crossing a zebra crossing, and on the right is a red car. This is the last appearance of this red car in the video. What is the red car doing at this moment?", "question_wo_referring_query": "What is the red car doing at this moment?", "candidates": ["Yielding to pedestrians", "Turning on the hazard lights", "Signaling a right turn", "Parked at the side", "Slowly crossing the zebra crossing"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "8JCTp0oKR7w_0", "video_path": "8JCTp0oKR7w.mp4", "subtitle_path": "8JCTp0oKR7w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 366.88, "view_count": 7277}, {"video_id": "8JCTp0oKR7w", "question": "On the right white wall hangs an animation mural. When the man wearing a light khaki baseball cap and a blue coat, holding a steamed bun first appears, what is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["Making a phone call", "Sweeping the floor", "Cooking", "Eating a steamed bun", "Talking with a friend"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "8JCTp0oKR7w_1", "video_path": "8JCTp0oKR7w.mp4", "subtitle_path": "8JCTp0oKR7w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 366.88, "view_count": 7277}, {"video_id": "tTJQfRYNPDM", "question": "On the wooden chopping board on the beige table, there are some mushrooms. A person in a blue and white floral dress is holding a knife. When the subtitle 'Mushrooms - 250 g.' appears, what is the person doing on the screen?", "question_wo_referring_query": ", what is the person doing on the screen?", "candidates": ["Cutting mushrooms", "Putting mushrooms into the pot", "Stir-frying mushrooms", "Washing mushrooms", "Putting mushrooms into water"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "tTJQfRYNPDM_0", "video_path": "tTJQfRYNPDM.mp4", "subtitle_path": "tTJQfRYNPDM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 374.1, "view_count": 45781}, {"video_id": "tTJQfRYNPDM", "question": "On the induction stove, there is a charcoal gray pot with mushrooms, yellow peppers, and shredded scallions inside. A hand appears above the pot. What is this hand doing when the subtitle 'Salt' appears?", "question_wo_referring_query": "What is this hand doing?", "candidates": ["Taking the vegetables out of the pot", "Covering the pot with a lid", "Sprinkling salt into the pot", "Sprinkling pepper into the pot", "Stirring the vegetables in the pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "tTJQfRYNPDM_1", "video_path": "tTJQfRYNPDM.mp4", "subtitle_path": "tTJQfRYNPDM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 374.1, "view_count": 45781}, {"video_id": "4ha62n2nQ48", "question": "The scene shows a panoramic view of the city with dense vegetation and a mix of tall buildings and low houses. There is a long road. What happens first after this in the video?", "question_wo_referring_query": "What happens first after this in the video?", "candidates": ["A timeline appears with five black and white images of the city", "A man is talking to himself in front of a mirror indoors", "A child wearing a gray and white short-sleeved shirt is playing basketball with an adult on a basketball court", "A building collapses", "A child wearing a gray and white short-sleeved shirt is playing basketball with an adult on a basketball court"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "4ha62n2nQ48_0", "video_path": "4ha62n2nQ48.mp4", "subtitle_path": "4ha62n2nQ48_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 326.95, "view_count": 3606}, {"video_id": "4ha62n2nQ48", "question": "In the black and white screen, there are some people sitting on the grass with their backs to the camera, watching children play soccer on the field. After this, what happens first on the screen?", "question_wo_referring_query": "After this, what happens first on the screen?", "candidates": ["A panoramic view of a city appears.", "Three children are playing in the autumn courtyard.", "A man is talking to the camera indoors.", "A child in a gray and white short sleeve is playing basketball with an adult on the basketball court.", "A building collapses."], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "4ha62n2nQ48_1", "video_path": "4ha62n2nQ48.mp4", "subtitle_path": "4ha62n2nQ48_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 326.95, "view_count": 3606}, {"video_id": "_FKuq1oLZ4Q", "question": "The background is a white curtain decorated with some green leaves. On the right side, there is a stack of books with a calendar on it. A woman wearing a plaid shirt is holding a gray handbag with one hand, and her hand is on the zipper of the outer pocket of the bag. What is the first item she takes out of the bag?", "question_wo_referring_query": "What is the first item she takes out of the bag?", "candidates": ["perfume", "car keys", "sunglasses", "lipstick", "two small packets of medicine"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "_FKuq1oLZ4Q_0", "video_path": "_FKuq1oLZ4Q.mp4", "subtitle_path": "_FKuq1oLZ4Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 356.04, "view_count": 27758}, {"video_id": "qPB7CQxWMr4", "question": "On a wall covered in cracks, there is an iron bed. Next to the bed, there is a chair. A gun is also tied to the bed. After the white text on the side says \"warning when the bedroom door to the,\" which character appears first?", "question_wo_referring_query": "In a room with a wall covered in cracks, which character appears first?", "candidates": ["A man wearing glasses, with a headscarf, and an olive-green outfit", "A man wearing a blue long-sleeve shirt, blue pants, and with golden short hair", "A man wearing a beige striped short-sleeve shirt, with golden short hair, and a scar on his face", "A man wearing a coffee-colored hat, coffee-colored clothes, and glasses"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "qPB7CQxWMr4_0", "video_path": "qPB7CQxWMr4.mp4", "subtitle_path": "qPB7CQxWMr4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 514.5, "view_count": 668657}, {"video_id": "qPB7CQxWMr4", "question": "In front of a wooden house painted with ducks, on a lush grassy field, there is a tall tree next to the house. Under the tree, there's a shadow. After saying 'complete lack of assistance finally on,' which item appears for the first time?", "question_wo_referring_query": "Which item appears for the first time?", "candidates": ["A purple cell phone", "A wooden chair", "A black cane", "A red tractor"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "qPB7CQxWMr4_1", "video_path": "qPB7CQxWMr4.mp4", "subtitle_path": "qPB7CQxWMr4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 514.5, "view_count": 668657}, {"video_id": "pmrzxkh8jLM", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which sequence of events is correct?", "candidates": ["First, squeezing a yellow substance onto a white parchment paper, then pouring cream into a plastic bag placed on a glass, and finally a hand wearing a black glove holding a round cake-like object in front of a camera.", "First, pouring cream into a plastic bag placed on a glass, then squeezing a yellow substance onto a white parchment paper, and finally a hand wearing a black glove holding a round cake-like object in front of a camera.", "First, a hand wearing a black glove holding a round cake-like object in front of a camera, then pouring cream into a plastic bag placed on a glass, and finally squeezing a yellow substance onto a white parchment paper.", "First, pouring cream into a plastic bag placed on a glass, then a hand wearing a black glove holding a round cake-like object in front of a camera, and finally squeezing a yellow substance onto a white parchment paper.", "First, squeezing a yellow substance onto a white parchment paper, then a hand wearing a black glove holding a round cake-like object in front of a camera, and finally pouring cream into a plastic bag placed on a glass."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "pmrzxkh8jLM_0", "video_path": "pmrzxkh8jLM.mp4", "subtitle_path": "pmrzxkh8jLM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 413.33, "view_count": 2664}, {"video_id": "cCsvsggKTbU", "question": "A woman with a blurry background behind her, wearing a black long-sleeve top, meticulously done makeup, golden curly hair, and earrings \u2013 in which of the following scenes does she appear?", "question_wo_referring_query": "In which of the following scenes does she appear?", "candidates": ["A scene with two women appearing at the same time", "At the scene of a traffic accident", "In a scene with many drones", "In a scene with many pedestrians in the background"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "cCsvsggKTbU_0", "video_path": "cCsvsggKTbU.mp4", "subtitle_path": "cCsvsggKTbU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 221.96, "view_count": 129113}, {"video_id": "cCsvsggKTbU", "question": "There are many cars behind her. A woman wearing a black feather coat, a blue scarf, and long hair \u2014 in which of the following scenes does she appear?", "question_wo_referring_query": "In which of the following scenes does she appear?", "candidates": ["In the scene where a red horizontal stripe appears on the screen, with the text 'little earlier that they wanted to make'", "In the scene where a red horizontal stripe appears on the screen, with the text 'French farmers block motorways around Paris'", "In the scene where a red horizontal stripe appears on the screen, with the text 'Bethany um at the moment we understand'", "In the scene where a red horizontal stripe appears on the screen, with the text 'It's really hard to understand'"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "cCsvsggKTbU_1", "video_path": "cCsvsggKTbU.mp4", "subtitle_path": "cCsvsggKTbU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 221.96, "view_count": 129113}, {"video_id": "Jhrb9B0Hcf4", "question": "In front of a giant screen, a woman wearing a blue tight-fitting top, holding a black board, dressed in a black skirt and black stockings, stands on the stage. With which of the following subtitles has she appeared together?", "question_wo_referring_query": "With which of the following subtitles has she appeared together?", "candidates": ["\"people that he considered extraordinary \"", "\"Kissinger's strong connection with the\"", "\"published last year was a study of six\"", "\"leaders and Mr Lee was one of them\""], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "Jhrb9B0Hcf4_0", "video_path": "Jhrb9B0Hcf4.mp4", "subtitle_path": "Jhrb9B0Hcf4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 331.12, "view_count": 18712}, {"video_id": "Jhrb9B0Hcf4", "question": "In front of a bookshelf filled with books, there is a woman with graying hair, wearing black-framed glasses, with earrings, and dressed in a black top. With which of the following subtitles has she appeared together?", "question_wo_referring_query": ", with which of the following subtitles has she appeared together?", "candidates": ["\"awarded the Nobel Peace Prize in 1973\"", "\"moments on the plane back to Washington\"", "\"touched down in London because uh it had\"", "\"Kissinger against his wife wife's wishes\""], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "Jhrb9B0Hcf4_1", "video_path": "Jhrb9B0Hcf4.mp4", "subtitle_path": "Jhrb9B0Hcf4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 331.12, "view_count": 18712}, {"video_id": "Ewc8cfEH0EM", "question": "In a blurry background, a woman wearing a purple top, black-framed glasses, and with her hair tied up appears. When the subtitle 'are obsessed with the death because it's' shows up, what changes occur with her?", "question_wo_referring_query": "What changes occur with her?", "candidates": ["She changes from facing the camera to bending over looking at an exhibit", "She changes from wearing black-framed glasses to wearing white-framed glasses", "She changes from bending over looking at an exhibit to facing the camera", "She changes from wearing a purple top to wearing a white top"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "Ewc8cfEH0EM_0", "video_path": "Ewc8cfEH0EM.mp4", "subtitle_path": "Ewc8cfEH0EM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 206.87, "view_count": 2653}, {"video_id": "Ewc8cfEH0EM", "question": "Carved into a white board are figures of a man and a woman, with a green bead hanging below them. When the subtitle 'around the object there's a skeleton it' appears, what changes?", "question_wo_referring_query": "What changes?", "candidates": ["The full figures are replaced by only the green bead in the mirror", "The figures of the man and woman are replaced by just the man in the mirror", "The view changes from a front view to a side view of the mirror", "The figures of the man and woman are replaced by just the woman in the mirror"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "Ewc8cfEH0EM_1", "video_path": "Ewc8cfEH0EM.mp4", "subtitle_path": "Ewc8cfEH0EM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 206.87, "view_count": 2653}, {"video_id": "gAVqVQkCwzw", "question": "In a white room, there is a painting hanging on the wall. In the room, there is a man wearing a suit with a tie. What objects have appeared behind him?", "question_wo_referring_query": "What objects have appeared behind him?", "candidates": ["Golden wall lamp", "White chair", "Blue curtains", "Red cushions", "Green plants"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "gAVqVQkCwzw_0", "video_path": "gAVqVQkCwzw.mp4", "subtitle_path": "gAVqVQkCwzw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 495.84, "view_count": 1169}, {"video_id": "gAVqVQkCwzw", "question": "In an area surrounded by many cars, a group of people are holding a camera and filming a certain man. On the screen, there is also a small window with a man having curly hair giving an explanation. What objects are present in this scene?", "question_wo_referring_query": "In an area surrounded by many cars, a group of people are holding a camera and filming a certain man. On the screen, there is also a small window with a man having curly hair giving an explanation. What objects are present in this scene?", "candidates": ["green light", "white car", "red car", "silver watch", "red leaves"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "gAVqVQkCwzw_1", "video_path": "gAVqVQkCwzw.mp4", "subtitle_path": "gAVqVQkCwzw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 495.84, "view_count": 1169}, {"video_id": "qEowX-vOb4E", "question": "On a wooden table, there is a hand wearing a ring holding a bottle, spraying liquid into a glass bowl. When the subtitle 'when cooking and drying out your lasagna.' appears, what objects are on the wooden table?", "question_wo_referring_query": "what objects are on the wooden table?", "candidates": ["purple bowl", "red tomato", "green pepper", "purple eggplant", "yellow sauce"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "qEowX-vOb4E_0", "video_path": "qEowX-vOb4E.mp4", "subtitle_path": "qEowX-vOb4E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 447.16, "view_count": 1481955}, {"video_id": "qEowX-vOb4E", "question": "On a wooden table, in a square glass bowl containing prepared food with a lot of red sauce on top, what items are visible on the screen when the subtitle 'Finally add the grated Parmesan on top' appears?", "question_wo_referring_query": "What items are visible on the screen?", "candidates": ["Green chili", "Yellow tomato", "Green leaves", "Purple eggplant", "Red chili"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "qEowX-vOb4E_1", "video_path": "qEowX-vOb4E.mp4", "subtitle_path": "qEowX-vOb4E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 447.16, "view_count": 1481955}, {"video_id": "KuMVSjdtbDY", "question": "In a black-and-white video, there are bushes and trees on the right side of the tracks, and in the distance, there's a white building. How are the tracks arranged in the video?", "question_wo_referring_query": "How are the tracks arranged in the video?", "candidates": ["Parallel, extending forward", "Curving to the left", "Intersecting, extending forward", "Parallel, extending to the left", "Parallel, extending to the right"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "KuMVSjdtbDY_0", "video_path": "KuMVSjdtbDY.mp4", "subtitle_path": "KuMVSjdtbDY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 581.08, "view_count": 909}, {"video_id": "KuMVSjdtbDY", "question": "In a black-and-white video, there is a track extending forward in parallel. There are many slopes on the right side of the track, and on the left side of the track stands a sign. What is the shape of the sign?", "question_wo_referring_query": "What is the shape of the sign?", "candidates": ["Triangle", "Square", "Circle", "Pentagon", "Rectangle"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "KuMVSjdtbDY_1", "video_path": "KuMVSjdtbDY.mp4", "subtitle_path": "KuMVSjdtbDY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 581.08, "view_count": 909}, {"video_id": "5VyI-Gp5LFY", "question": "In front of a blue background, a woman wearing headphones, with nail polish, and dressed in a blue short-sleeve shirt is speaking. What is her hairstyle like when she says 'as far away from propellant tanks as possible'?", "question_wo_referring_query": "What is her hairstyle like?", "candidates": ["Golden, slightly curly long hair", "Black, short hair", "Black, slightly curly long hair", "Golden, short hair", "Brown, long hair"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2A", "level": "IntraMoment", "id": "5VyI-Gp5LFY_0", "video_path": "5VyI-Gp5LFY.mp4", "subtitle_path": "5VyI-Gp5LFY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 432.39, "view_count": 143277}, {"video_id": "5VyI-Gp5LFY", "question": "On a black background, there is a simple diagram of solid fuel. Next to it, there is a box with the words 'Solid-fuel' written inside. When the subtitle 'In solid fuel rockets, you can mix them together and pack that mixture' appears, what color are the words inside the box?", "question_wo_referring_query": "What color are the words inside the box?", "candidates": ["purple", "white", "black", "green", "purple"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2A", "level": "IntraMoment", "id": "5VyI-Gp5LFY_1", "video_path": "5VyI-Gp5LFY.mp4", "subtitle_path": "5VyI-Gp5LFY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 432.39, "view_count": 143277}, {"video_id": "I2jrA1FB3yI", "question": "In a car, there are a few men seated. One is wearing a red buttoned shirt with a black jacket, and another is wearing a gray long-sleeve coat with glasses. Which man is driving the car?", "question_wo_referring_query": "Which man is driving the car?", "candidates": ["The man wearing a red buttoned shirt", "The man wearing glasses", "The man wearing a black jacket", "The man wearing a black suit", "The man with a gold ring on his hand"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "I2jrA1FB3yI_0", "video_path": "I2jrA1FB3yI.mp4", "subtitle_path": "I2jrA1FB3yI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 256.12, "view_count": 1032016}, {"video_id": "I2jrA1FB3yI", "question": "On a bridge, there is a black railing on the side of the bridge. On the railing, there is a long pole covered with iron spikes. There are many trees on the side of the bridge. Who is saying 'Archway Road' here?", "question_wo_referring_query": "Who is saying 'Archway Road' here?", "candidates": ["A guy wearing a red button-down shirt", "A man wearing a black watch", "A guy wearing a black suit", "A man wearing jeans and green shoes", "A man wearing a black hat"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "I2jrA1FB3yI_1", "video_path": "I2jrA1FB3yI.mp4", "subtitle_path": "I2jrA1FB3yI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 256.12, "view_count": 1032016}, {"video_id": "SxatasGeC1s", "question": "On a day with blue skies and white clouds, there is a pile of car shells from scrapped vehicles behind two people wearing hats. One person is wearing a light green hat, and the other is wearing a dark blue hat. Both people are being interviewed. When the person wearing the dark blue hat says 'every step', what action does he take?", "question_wo_referring_query": "What action does the person wearing the dark blue hat take?", "candidates": ["Looked at the interview camera", "Adjusted his hat", "Took off the sunglasses he was wearing", "Looked at the person wearing the light green hat", "Looked down at the ground"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "SxatasGeC1s_0", "video_path": "SxatasGeC1s.mp4", "subtitle_path": "SxatasGeC1s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 515.84, "view_count": 112038}, {"video_id": "SxatasGeC1s", "question": "On a day with blue skies and white clouds, a woman with a ponytail, holding a loudspeaker, and wearing a blue shirt is standing on the road with barricades. When she says 'citizens just using any means they can,' what action does she perform?", "question_wo_referring_query": "What action does she perform?", "candidates": ["Her eyes look towards the two people wearing green clothes standing in the distance", "She holds the loudspeaker with both hands", "She walks towards the two people wearing green clothes standing in the distance", "She waves towards the camera", "She waves at the two people wearing green clothes standing in the distance"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "SxatasGeC1s_1", "video_path": "SxatasGeC1s.mp4", "subtitle_path": "SxatasGeC1s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 515.84, "view_count": 112038}, {"video_id": "E_spTuUV2jI", "question": "In a lush green valley, there are soldiers wearing red armor and green armor. After referring to the elite heavy infantry, what type of soldier did the announcer mention next?", "question_wo_referring_query": "What type of soldier did the announcer mention next?", "candidates": ["Lightly armed infantry", "Elite professional soldier", "Lightly armed militia", "Immortal army"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "E_spTuUV2jI_0", "video_path": "E_spTuUV2jI.mp4", "subtitle_path": "E_spTuUV2jI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 328.62, "view_count": 5085}, {"video_id": "E_spTuUV2jI", "question": "On a gray background, there is a shield on which 'Tower shield' is written. Next to the shield, there is an arrow with 'Tall Tower shield' written after it. After a side-note introducing the soldier's shield, what is introduced next?", "question_wo_referring_query": "What is introduced next after the side-note?", "candidates": ["Melee combat", "Halberd", "Armor", "Helmet"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "E_spTuUV2jI_1", "video_path": "E_spTuUV2jI.mp4", "subtitle_path": "E_spTuUV2jI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 328.62, "view_count": 5085}, {"video_id": "vceAEMKHx1w", "question": "In an auditorium that is overall blue, there are two hosts: one male and one female. The male is wearing a suit, and the female has short brown hair. Who is the first person they mention?", "question_wo_referring_query": "Who is the first person they mention?", "candidates": ["JOE BIDEN", "Trump", "Obama", "Greg Abbott", "Vladimir Putin"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "vceAEMKHx1w_0", "video_path": "vceAEMKHx1w.mp4", "subtitle_path": "vceAEMKHx1w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 413.11, "view_count": 10641}, {"video_id": "vceAEMKHx1w", "question": "In an auditorium that is entirely blue, there is a female host with short brown hair and a male host with short black hair. Which state do they mention first?", "question_wo_referring_query": "Which state do they mention first?", "candidates": ["Texas", "Alabama", "Nevada", "South Carolina", "Idaho"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "vceAEMKHx1w_1", "video_path": "vceAEMKHx1w.mp4", "subtitle_path": "vceAEMKHx1w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 413.11, "view_count": 10641}, {"video_id": "4d-hcSuSD4M", "question": "In a room, a bald man with black-framed glasses and a mustache, after he says 'history of the future actually', what happened?", "question_wo_referring_query": "what happened?", "candidates": ["An illustration appears in the middle of the screen showing a person with a black helmet and no facial features, who is mostly dressed in black.", "An illustration appears in the top right corner of the screen depicting a character with multicolored circles on their body.", "An illustration of a glass-walled room with chairs and a table appears in the middle of the screen.", "A glass chandelier decoration appears on the screen."], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "4d-hcSuSD4M_0", "video_path": "4d-hcSuSD4M.mp4", "subtitle_path": "4d-hcSuSD4M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 281.15, "view_count": 2678}, {"video_id": "4d-hcSuSD4M", "question": "In a hallway, across from a red floral chair in a room with many paintings, what happens after the subtitle 'of the fact of the museum being' appears?", "question_wo_referring_query": "What happens next?", "candidates": ["A white round small table appears in the screen.", "A clay wall appears in the screen.", "The camera continuously moves closer to the room with many paintings.", "The screen only shows a colorful chandelier."], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "4d-hcSuSD4M_1", "video_path": "4d-hcSuSD4M.mp4", "subtitle_path": "4d-hcSuSD4M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 281.15, "view_count": 2678}, {"video_id": "WMKu8WZ3FGc", "question": "In a scene with a messy background, a man wearing a gray T-shirt with curly hair is sitting on a white chair. After he says, \"Why is this? Well, long story short, the Korean war, things got ugly,\" what is the first image that appears?", "question_wo_referring_query": "What is the first image that appears?", "candidates": ["An image of a man with dyed blue hair wearing a black shirt", "An image of a woman carrying a child with a tank in the background", "An image of a group of people taking a photo together", "An image of Kim Jong-un and Kim Il-sung", "The South Korean flag"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "WMKu8WZ3FGc_0", "video_path": "WMKu8WZ3FGc.mp4", "subtitle_path": "WMKu8WZ3FGc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 322.41, "view_count": 526634}, {"video_id": "WMKu8WZ3FGc", "question": "On a webpage with many videos, there is a video thumbnail featuring a dinosaur fossil. Before the subtitle \"The Great Courses Plus has access to over 7000 video lectures and courses taught by highly accredited\" appears, what image first appears on the screen?", "question_wo_referring_query": "What image first appears on the screen?", "candidates": ["An image of a man in a full black suit with a red and blue ball on his chest", "A screenshot of a video webpage with a video thumbnail featuring a big dinosaur", "An image with a black background, a blue light line at the bottom, and the text 'THE GREAT COURSES' on it", "The national flags of North Korea and South Korea"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "WMKu8WZ3FGc_1", "video_path": "WMKu8WZ3FGc.mp4", "subtitle_path": "WMKu8WZ3FGc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 322.41, "view_count": 526634}, {"video_id": "wKUa8uCIN5M", "question": "On a red background, under the white text 'Armor & Shields', there is a gray helmet icon on the upper left side of a yellow stripe. In which of the following scenarios has it appeared?", "question_wo_referring_query": "In which of the following scenarios has it appeared?", "candidates": ["On a red background, with white text 'cassis' in the scene.", "On a red background, with white text 'Casco' in the scene.", "On a red background wall, with a central heart symbol.", "On a red background, with white text 'forte' in the scene."], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "wKUa8uCIN5M_0", "video_path": "wKUa8uCIN5M.mp4", "subtitle_path": "wKUa8uCIN5M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 271.2, "view_count": 174197}, {"video_id": "wKUa8uCIN5M", "question": "On a red background, below two different styled helmets, there are four differently colored tiles. Among them, a gray tile appears on the white-text Iron. In which of the following scenes does this appear?", "question_wo_referring_query": "In which of the following scenes does this appear?", "candidates": ["On a red background, with two yellow icons, and the white-text bordo.", "On a red background, with two blue icons, and white text.", "On a red background, with a wooden head, leather pattern, and the white-text scutum.", "In a red background, on the left, a soldier holding a sword, and on the right, a soldier on horseback holding a spear."], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "wKUa8uCIN5M_1", "video_path": "wKUa8uCIN5M.mp4", "subtitle_path": "wKUa8uCIN5M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 271.2, "view_count": 174197}, {"video_id": "c3_-fShe8Ks", "question": "A wooden plaque is hanging on a black wall, and in front of the wall there is a woman with red hair and wearing green clothes. When this woman appears outside the room eating food, standing beside a man in a white short-sleeved shirt, what change occurs to the woman's clothing?", "question_wo_referring_query": "What change occurs to the woman's clothing?", "candidates": ["Her clothing does not change", "Her clothing changes from green to white", "Her clothing changes from green to brown", "Her clothing changes from green to black", "Her clothing changes from green to wine red"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "c3_-fShe8Ks_0", "video_path": "c3_-fShe8Ks.mp4", "subtitle_path": "c3_-fShe8Ks_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 507.98, "view_count": 118163}, {"video_id": "ePmpQpW2HzM", "question": "In a blue background, there is a woman with long hair wearing a checkered coat and a black inner shirt. In front of her, there is a line of black text that reads 'Melissa Maribel.' When this woman mentions 'you will find the percentage productivity is much higher,' what change occurs to the text in front of her?", "question_wo_referring_query": "What change occurs to the text in front of her?", "candidates": ["At this moment, there is no text in front of her", "The text in front of her changes from black to yellow", "The text in front of her changes from black to white", "The text in front of her changes from white to black", "The text in front of her changes from Melissa Maribel to you will find the percentage productivity is much higher"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "ePmpQpW2HzM_0", "video_path": "ePmpQpW2HzM.mp4", "subtitle_path": "ePmpQpW2HzM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 521.19, "view_count": 10043}, {"video_id": "ePmpQpW2HzM", "question": "In the upper left corner of the yellow surface, there is a bunch of pink flowers. In the upper right corner, there are three pieces of paper with writing on them. In the central part, there is a notebook with a round ring and a sticker that says 'If copper (II)' on it. What changes occur to the objects on the yellow surface when 'percent of a solution' is mentioned during the explanation?", "question_wo_referring_query": "In the upper left corner of the yellow surface, there is a bunch of pink flowers. In the upper right corner, there are three pieces of paper with writing on them. In the central part, there is a notebook with a round ring and a sticker that says 'If copper (II)' on it. What changes occur to the objects on the yellow surface when 'percent of a solution' is mentioned during the explanation?", "candidates": ["Nothing changed on the surface", "The pink flowers on the surface turned red", "The notebook turned yellow", "The surface only has a blank notebook", "The notebook on the surface turned into a piece of paper"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "ePmpQpW2HzM_1", "video_path": "ePmpQpW2HzM.mp4", "subtitle_path": "ePmpQpW2HzM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 521.19, "view_count": 10043}, {"video_id": "UtfyqgRfY0w", "question": "On a wooden board, there is a piece of gray paper. On the paper, there is a yellow food item. In the video, there is also a pair of hands wearing a watch and holding a knife. What are these hands doing in the video?", "question_wo_referring_query": "What are these hands doing in the video?", "candidates": ["The hands are cutting the yellow food item with a knife.", "The hands are cutting the watch with a knife.", "The hands are adjusting the watch.", "The hands are chopping the wooden board.", "The hands are throwing the yellow food item into a pot."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "UtfyqgRfY0w_0", "video_path": "UtfyqgRfY0w.mp4", "subtitle_path": "UtfyqgRfY0w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 318.61, "view_count": 333382}, {"video_id": "UtfyqgRfY0w", "question": "In a bright room, there is a table with a huge hamburger and a small hamburger on it. In front of the table, there is a man in white clothing and a man in black clothing. What is the man in black clothing doing?", "question_wo_referring_query": "What is the man in black clothing doing?", "candidates": ["The man is covering his mouth with his right hand and smiling", "The man is covering his mouth with his left hand and crying", "The man is standing up in surprise", "The man is covering his mouth with his left hand and smiling", "The man is covering his mouth with his right hand and smiling"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "UtfyqgRfY0w_1", "video_path": "UtfyqgRfY0w.mp4", "subtitle_path": "UtfyqgRfY0w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 318.61, "view_count": 333382}, {"video_id": "y5VdGiKFY7M", "question": "A woman wearing glasses and dressed in a black coat and grey shirt is sitting in front of a desk with a computer on it. On the desk behind her, there is a pot of purple flowers. The shelves further behind are filled with books. What is not present in this scene?", "question_wo_referring_query": "What is not present in this scene?", "candidates": ["Purple flowers", "Laptop", "Books", "Wristwatch", "Glasses"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "y5VdGiKFY7M_0", "video_path": "y5VdGiKFY7M.mp4", "subtitle_path": "y5VdGiKFY7M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 189.77, "view_count": 1678}, {"video_id": "y5VdGiKFY7M", "question": "On a white wall, there are four frames hanging. The central part of the top two frames has black text, while the central part of the bottom two frames is black. A woman wearing a yellow coat with long curly hair is standing and looking up at the frames on the wall. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["Computer", "Glasses", "Black coat", "Flower", "Watch"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "y5VdGiKFY7M_1", "video_path": "y5VdGiKFY7M.mp4", "subtitle_path": "y5VdGiKFY7M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 189.77, "view_count": 1678}, {"video_id": "Jg_KbHUS8zs", "question": "In a brightly lit dining room, two people are sitting opposite each other, eating. One is wearing glasses and the other is not. At the bottom of the screen, the phrase 'THEY WERE EATING POUTINE!!' appears. What kind of outerwear is the person wearing glasses wearing?", "question_wo_referring_query": "What kind of outerwear is the person wearing glasses wearing?", "candidates": ["Plain white short-sleeve outerwear", "White and gray striped long-sleeve outerwear", "Orange long-sleeve outerwear", "Plain white long-sleeve outerwear", "Orange short-sleeve outerwear"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "Jg_KbHUS8zs_0", "video_path": "Jg_KbHUS8zs.mp4", "subtitle_path": "Jg_KbHUS8zs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 580.65, "view_count": 33750}, {"video_id": "Jg_KbHUS8zs", "question": "In a brightly lit indoor space, there are two women wearing hats. One of them has her hair tied up, while the other has her hair down. Nearby, a hand is pointing towards a wheelchair in front of the wall behind the two women. In this scene, what style of hat is the woman with the hair down wearing?", "question_wo_referring_query": "In this scene, what style of hat is the woman with the hair down wearing?", "candidates": ["green duckbill cap", "green round hat", "purple round hat", "purple duckbill cap", "blue duckbill cap"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "Jg_KbHUS8zs_1", "video_path": "Jg_KbHUS8zs.mp4", "subtitle_path": "Jg_KbHUS8zs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 580.65, "view_count": 33750}, {"video_id": "jF0GveGkjJ4", "question": "In a bright room, there is a gray-striped curtain hanging. A woman dressed in black clothing is placing her left index finger on her lips. When the subtitles mention 'is alcohol like a thing i don\u2019t know so,' what kind of clothing is the woman wearing?", "question_wo_referring_query": "What kind of clothing is the woman wearing?", "candidates": ["Black short sleeves with gray stripes", "Pure black short sleeves", "Black long sleeves with white stripes", "Black short sleeves with white stripes", "Black long sleeves with gray stripes"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "jF0GveGkjJ4_0", "video_path": "jF0GveGkjJ4.mp4", "subtitle_path": "jF0GveGkjJ4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 530.76, "view_count": 30036}, {"video_id": "jF0GveGkjJ4", "question": "In a brightly lit room, a woman wearing a black short-sleeved shirt stands in front of a window holding a pair of shoes, talking to the camera. When the subtitle 'I think the shoes from Brandy Melville are quite good' appears, what is the pattern on the surface of the shoes she is holding?", "question_wo_referring_query": ", what is the pattern on the surface of the shoes she is holding?", "candidates": ["pure white", "black and gray stripes", "black and gray checks", "black and white checks", "black and white stripes"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "jF0GveGkjJ4_1", "video_path": "jF0GveGkjJ4.mp4", "subtitle_path": "jF0GveGkjJ4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 530.76, "view_count": 30036}, {"video_id": "3JhPAQehsmU", "question": "In a brightly lit room, there is a white model with several white board-like models standing on it. Two people are inspecting these models. Which person knocked over one of the board-like models?", "question_wo_referring_query": "Which person knocked over one of the board-like models?", "candidates": ["The person wearing a blue coat", "The person wearing a white shirt", "The person wearing a gray coat", "The person wearing glasses", "The person wearing short sleeves"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "3JhPAQehsmU_0", "video_path": "3JhPAQehsmU.mp4", "subtitle_path": "3JhPAQehsmU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 431.85, "view_count": 42174}, {"video_id": "3JhPAQehsmU", "question": "In a spacious room with black walls, with a red painting leaned against the wall, there are four people wearing blue gloves engaging in activities beside a box. Who is the person holding a black plastic bag?", "question_wo_referring_query": "Who is the person holding a black plastic bag?", "candidates": ["The person wearing a white hat and glasses", "The person with yellow hair and wearing glasses", "The person wearing blue jeans", "The person wearing black clothes and almost no hair", "The person wearing a white hat"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "3JhPAQehsmU_1", "video_path": "3JhPAQehsmU.mp4", "subtitle_path": "3JhPAQehsmU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 431.85, "view_count": 42174}, {"video_id": "Hgg-iF1u5QA", "question": "In a room with beige walls, there is a man wearing a blue short-sleeved shirt and a woman wearing a necklace and a white outfit. When the woman picks up a cup containing green liquid for the first time, what does she do?", "question_wo_referring_query": "What does she do?", "candidates": ["She drinks the liquid in the cup", "She feeds the man water", "She hands the cup to the man in blue", "She throws the cup away", "She spills the liquid in the cup"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "Hgg-iF1u5QA_0", "video_path": "Hgg-iF1u5QA.mp4", "subtitle_path": "Hgg-iF1u5QA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 394.7, "view_count": 10966}, {"video_id": "Hgg-iF1u5QA", "question": "When a woman with long blonde hair, dressed in black, first appeared holding a green object in front of the mirror in the bathroom, what did she do?", "question_wo_referring_query": "What did she do?", "candidates": ["She was applying mascara", "She was applying lipstick", "She was drawing her eyebrows", "She was washing her face", "She was dancing"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "Hgg-iF1u5QA_1", "video_path": "Hgg-iF1u5QA.mp4", "subtitle_path": "Hgg-iF1u5QA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 394.7, "view_count": 10966}, {"video_id": "-tqCzG-6nJY", "question": "In the black and white screen, there is a spacious room. A man with short hair, wearing a patterned outfit, is standing by a pot, holding a pan in his right hand. When the subtitle 'actually this might take a while but at' appears, what does this man do with his left hand?", "question_wo_referring_query": "What does the man do with his left hand?", "candidates": ["He places his left hand on the counter.", "He picks up an egg beater with his left hand.", "He picks up the pot with his left hand.", "He puts a lid on the pot with his left hand.", "He is adding something to the pot with his left hand."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "-tqCzG-6nJY_0", "video_path": "-tqCzG-6nJY.mp4", "subtitle_path": "-tqCzG-6nJY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 440.44, "view_count": 190422}, {"video_id": "-tqCzG-6nJY", "question": "In a spacious room, there's a countertop with some food placed on it. Behind the countertop, there are a woman with long hair dressed in black and a man with short hair dressed in red and black plaid clothes. When the subtitle mentions 'take the flavor of Nashua thank you I,' what does the woman do?", "question_wo_referring_query": "What does the woman do?", "candidates": ["She places her left hand on the man's right shoulder", "She picks up a piece of food with chopsticks", "She places her right hand on the man's left shoulder", "She places her left hand on the man's left shoulder", "She places her right hand on the man's right shoulder"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "-tqCzG-6nJY_1", "video_path": "-tqCzG-6nJY.mp4", "subtitle_path": "-tqCzG-6nJY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 440.44, "view_count": 190422}, {"video_id": "Ynd85xJ7PYU", "question": "In a blue background with some yellow and red patterns, and the text 'Commissioned Officers,' a circular pattern appears in the lower half of the screen. The circle contains a hamburger. What is the pattern that appears immediately after this one?", "question_wo_referring_query": "What is the pattern that appears immediately after this one?", "candidates": ["A circular pattern appears with a hamburger inside.", "A circular pattern appears with a question mark inside.", "A circular pattern appears with a person inside.", "A circular pattern appears with a bell inside.", "A circular pattern appears with a group of people inside."], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "Ynd85xJ7PYU_0", "video_path": "Ynd85xJ7PYU.mp4", "subtitle_path": "Ynd85xJ7PYU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 426.6, "view_count": 843559}, {"video_id": "Ynd85xJ7PYU", "question": "There is a boat in a blue background, with two black flags hanging on it. Three long masts stand on the boat, with some white lines holding onto the masts. After a circular icon with an exclamation mark appears on the boat, what is the icon that immediately follows?", "question_wo_referring_query": ", what is the icon that immediately follows?", "candidates": ["A circular icon appears, with a boat in the circle.", "A circular icon appears, with an exclamation mark in the circle.", "A circular icon appears, with a question mark in the circle.", "A circular icon appears, with an anchor in the circle.", "A circular icon appears, with a person wearing glasses and a magnifying glass in the circle."], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "Ynd85xJ7PYU_1", "video_path": "Ynd85xJ7PYU.mp4", "subtitle_path": "Ynd85xJ7PYU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 426.6, "view_count": 843559}, {"video_id": "jEFX__i_0ps", "question": "In front of a backdrop of earth and rivers, what did a man in the video do after a woman with wavy hair and wearing a pink coat looked down while speaking and the subtitle mentioned 'International Affairs Commentator Doug'?", "question_wo_referring_query": "What did the man in the video do?", "candidates": ["He blinked", "He adjusted his collar", "He touched his head with his hand", "He raised his right hand", "He sat on a chair"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "jEFX__i_0ps_0", "video_path": "jEFX__i_0ps.mp4", "subtitle_path": "jEFX__i_0ps_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 340.8, "view_count": 33214}, {"video_id": "jEFX__i_0ps", "question": "A bald man wearing a white shirt is speaking with his hands clasped tightly. Behind him on the screen, there is a building with glass curtain walls, and parked in front of the building are red and yellow cars. After the subtitle mentions 'if you think back on these,' what action does the man holding a pen on the screen behind him perform?", "question_wo_referring_query": "What action does the man holding a pen on the screen behind him perform immediately after?", "candidates": ["He walks into the building in front.", "He takes off his coat.", "He waves to the host.", "He tears up the paper in his hand.", "He writes on the paper."], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "jEFX__i_0ps_1", "video_path": "jEFX__i_0ps.mp4", "subtitle_path": "jEFX__i_0ps_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 340.8, "view_count": 33214}, {"video_id": "NSZeExajXsM", "question": "In the video, which of the following sequences is correct?", "question_wo_referring_query": "In the video, which of the following sequences is correct?", "candidates": ["First, a woman with tied hair is eating pizza; next, a woman is eating while driving; finally, a woman in green clothes gives a thumbs up", "First, a woman in green clothes gives a thumbs up; next, a woman with tied hair is eating pizza; finally, a woman is eating while driving", "First, a woman is eating while driving; next, a woman in green clothes gives a thumbs up; finally, a woman with tied hair is eating pizza", "First, a woman in green clothes gives a thumbs up; next, a woman is eating while driving; finally, a woman with tied hair is eating pizza", "First, a woman is eating while driving; next, a woman with tied hair is eating pizza; finally, a woman in green clothes gives a thumbs up"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "NSZeExajXsM_0", "video_path": "NSZeExajXsM.mp4", "subtitle_path": "NSZeExajXsM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 445.45, "view_count": 13972}, {"video_id": "N6ohRUog0RY", "question": "At the beginning of the video, there is a woman sitting in the driver's seat of a car, holding the steering wheel and wearing gray clothing. In which scenarios has this woman appeared?", "question_wo_referring_query": "In which scenarios has this woman, who spread her hands, appeared?", "candidates": ["She appeared in front of a white curtain", "She appeared at the zoo", "She appeared in front of a red brick house", "She appeared at the swimming pool", "She appeared on the bus"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "N6ohRUog0RY_0", "video_path": "N6ohRUog0RY.mp4", "subtitle_path": "N6ohRUog0RY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 495.16, "view_count": 63803}, {"video_id": "N6ohRUog0RY", "question": "At the beginning of the video, a woman is seen sitting in a car with a black seat and a black steering wheel, reaching out with her right hand to adjust the mirror. In which scenes has this car appeared?", "question_wo_referring_query": "Reaching out with her right hand to adjust the mirror, in which scenes has this car appeared?", "candidates": ["Appeared in an underground parking lot", "Appeared in front of the White House", "Appeared in a McDonald's parking lot", "Appeared at the zoo", "Appeared in front of a \"STARBUCKS COFFEE\" sign"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "N6ohRUog0RY_1", "video_path": "N6ohRUog0RY.mp4", "subtitle_path": "N6ohRUog0RY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 495.16, "view_count": 63803}, {"video_id": "fxgKqE4zRIw", "question": "In the second half of the video, a monitor appears. On the monitor's screen, the words 'SIMPLE HISTORY' are displayed. With which subtitles does this monitor appear?", "question_wo_referring_query": "With which subtitles does this monitor appear?", "candidates": ["installation takes time and 'in World War two'", "This episode is sponsored by CuriosityStream and Battle of Berlin you may not have known", "I must remain hidden to avoid enemy disruption and 'in World War two'", "This episode is sponsored by CuriosityStream and I must remain hidden to avoid enemy disruption", "This episode is sponsored by CuriosityStream and installation takes time"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "fxgKqE4zRIw_0", "video_path": "fxgKqE4zRIw.mp4", "subtitle_path": "fxgKqE4zRIw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 291.88, "view_count": 4966547}, {"video_id": "fxgKqE4zRIw", "question": "In the video, besides the muzzle, the entire body of the gray tank is buried underground. In which subtitles has this tank appeared together?", "question_wo_referring_query": "In which subtitles has this tank appeared together?", "candidates": ["Transportation and placement on the ground and in World War II", "In World War II and ground", "Transportation and placement on the ground and Battle of Berlin: You May Not Know", "Transportation and placement on the ground and ground", "Battle of Berlin: You May Not Know and ground"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "fxgKqE4zRIw_1", "video_path": "fxgKqE4zRIw.mp4", "subtitle_path": "fxgKqE4zRIw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 291.88, "view_count": 4966547}, {"video_id": "FwZGu1t4WWY", "question": "In a silver mold, egg yolk and egg white are placed. When the egg white is taken out from the mold and held in hand, what change does the egg white undergo?", "question_wo_referring_query": "What change does the egg white undergo?", "candidates": ["The egg white changes from liquid to a yellow solid", "The egg white changes from liquid to a green solid", "The egg white changes to a yellow liquid", "The egg white changes from liquid to a white solid", "The egg white changes from liquid to a red solid"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "FwZGu1t4WWY_0", "video_path": "FwZGu1t4WWY.mp4", "subtitle_path": "FwZGu1t4WWY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 336.05, "view_count": 136908}, {"video_id": "FwZGu1t4WWY", "question": "On the grey countertop, there is a transparent bowl with half a lemon placed next to it, along with some yellow and white ingredients. When this bowl appears together with the words 'CREAM CHEESE', what change occurs to this bowl?", "question_wo_referring_query": ", what change occurs to this bowl?", "candidates": ["This bowl contains both green and yellow ingredients", "This bowl contains half a lemon", "This bowl is filled with water", "This bowl only contains yellow ingredients", "This bowl contains green onions"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "FwZGu1t4WWY_1", "video_path": "FwZGu1t4WWY.mp4", "subtitle_path": "FwZGu1t4WWY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 336.05, "view_count": 136908}, {"video_id": "GThjSF-83HI", "question": "On a stage with a green background, there is a flying disc in the center. In front of the flying disc stands a man in purple clothing. There is a group of spectators below the stage, some of whom are holding up their phones to take photos. When the subtitle mentions 'happened a few seconds ago, we will bring,' what change occurred in the flying disc?", "question_wo_referring_query": "On a stage with a green background, there is a flying disc in the center. In front of the flying disc stands a man in purple clothing. There is a group of spectators below the stage, some of whom are holding up their phones to take photos. When the subtitle mentions 'happened a few seconds ago, we will bring,' what change occurred in the flying disc?", "candidates": ["Two flying discs appeared on the flying disc", "A green area appeared on the flying disc", "A red area appeared on the flying disc", "A black area appeared on the flying disc", "A white area appeared on the flying disc"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "GThjSF-83HI_0", "video_path": "GThjSF-83HI.mp4", "subtitle_path": "GThjSF-83HI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 324.32, "view_count": 40318}, {"video_id": "GThjSF-83HI", "question": "A man stands in front of the PDC logo, clenching his fists and roaring. The scrolling ticker below displays the words 'AKING NEWS.' What change occurs to this man when the subtitle mentions 'on stage at the moment there and we're'?", "question_wo_referring_query": "What change occurs to this man?", "candidates": ["The man is holding a microphone stand.", "The man changes into a white jacket.", "The man is holding a microphone.", "The man changes into a red jacket.", "The man is holding a trophy."], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "GThjSF-83HI_1", "video_path": "GThjSF-83HI.mp4", "subtitle_path": "GThjSF-83HI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 324.32, "view_count": 40318}, {"video_id": "g1KNpJfu_ZQ", "question": "When sunlight shines on the stone wall, a stone carving of a person appears on it. After the subtitle 'translate the early dynasties stories' appears, what change occurs?", "question_wo_referring_query": "When sunlight shines on the stone wall, a stone carving of a person appears on it. After the subtitle 'translate the early dynasties stories' appears, what change occurs?", "candidates": ["A young person appears", "An old person with white hair and wearing sunglasses appears", "A child appears", "A woman appears", "A dark-skinned person appears"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "g1KNpJfu_ZQ_0", "video_path": "g1KNpJfu_ZQ.mp4", "subtitle_path": "g1KNpJfu_ZQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 420.89, "view_count": 28356}, {"video_id": "g1KNpJfu_ZQ", "question": "The sunlight shines on the upper left of the stone block, and there are pictographic characters carved on the stone. A man wearing a hat is standing on the left side of the screen. What happened before the subtitle 'royal tomb and sealed in we placed' appeared?", "question_wo_referring_query": "The sunlight shines on the upper left of the stone block, and there are pictographic characters carved on the stone. A man wearing a hat is standing on the left side of the screen. What happened before the subtitle 'royal tomb and sealed in we placed' appeared?", "candidates": ["A woman in white clothes walked by the stone.", "A man wearing a hat and carrying a backpack was filming with a mobile phone.", "The man put on the hat.", "A woman walked by the river.", "A man in white clothes walked by the stone."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "g1KNpJfu_ZQ_1", "video_path": "g1KNpJfu_ZQ.mp4", "subtitle_path": "g1KNpJfu_ZQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 420.89, "view_count": 28356}, {"video_id": "sYySvVR5fk4", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which sequence of scenes is correct below?", "candidates": ["First, the red-haired man with a number '07' printed on the back of his clothes sings in the center of the stage, then the yellow-haired man in yellow sleeves sings and dances with a microphone, followed by the members taking turns singing in the center of the stage, and finally, the two men kneel on both sides of the stage and look intently at the camera.", "First, the red-haired man with a number '07' printed on the back of his clothes sings in the center of the stage, followed by the members taking turns singing in the center of the stage, then the two men kneel on both sides of the stage and look intently at the camera, and finally, the yellow-haired man in yellow sleeves sings and dances with a microphone.", "First, the yellow-haired man in yellow sleeves sings and dances with a microphone, followed by the members taking turns singing in the center of the stage, then the red-haired man with a number '07' printed on the back of his clothes sings in the center of the stage, and finally, the two men kneel on both sides of the stage and look intently at the camera.", "First, the two men kneel on both sides of the stage and look intently at the camera, then the red-haired man with a number '07' printed on the back of his clothes sings in the center of the stage, followed by the members taking turns singing in the center of the stage, and finally, the yellow-haired man in yellow sleeves sings and dances with a microphone.", "First, the two men kneel on both sides of the stage and look intently at the camera, then the yellow-haired man in yellow sleeves sings and dances with a microphone, followed by the members taking turns singing in the center of the stage, and finally, the red-haired man with a number '07' printed on the back of his clothes sings in the center of the stage."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "sYySvVR5fk4_0", "video_path": "sYySvVR5fk4.mp4", "subtitle_path": "sYySvVR5fk4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 191.66, "view_count": 15135}, {"video_id": "YKP_6-198x8", "question": "In the video, there are 5 soldiers dressed in green clothes, 4 of them are lying on the ground shooting at 9 soldiers, and one is standing. In which of the following scenes did the 4 soldiers appear?", "question_wo_referring_query": "In which of the following scenes did the 4 soldiers appear?", "candidates": ["On green grassland, with a train passing from left to right on the tracks", "By the rainy seaside", "On a truck", "The man in black with a blindfold standing by the red brick wall, with green grass and trees in the distance", "On a battlefield with 2 buildings"], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "YKP_6-198x8_0", "video_path": "YKP_6-198x8.mp4", "subtitle_path": "YKP_6-198x8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 255.79, "view_count": 2321585}, {"video_id": "YKP_6-198x8", "question": "On the left, there's a man wearing a brown outfit, a black hat, and holding a gun. On the right, there's a woman with curly green hair holding a gun. Between them, there is a tower. In which of the following scenes has this curly-haired woman appeared?", "question_wo_referring_query": "In which of the following scenes has this curly-haired woman appeared?", "candidates": ["Three men holding guns are shooting towards the right", "Red ground with a building in the background", "Beach", "Two men carrying iron mallets are walking towards the left side of the screen", "Five people are standing facing the camera, and the man in the middle is holding a white document"], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "YKP_6-198x8_1", "video_path": "YKP_6-198x8.mp4", "subtitle_path": "YKP_6-198x8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 255.79, "view_count": 2321585}, {"video_id": "hj1NPIU9SjM", "question": "How many rhinos are walking along the wide riverbank, one enjoying food and water, while yellow grass stretches into the distance, where the rhino and the subtitles appear together?", "question_wo_referring_query": ", did the rhino and those subtitles appear together?", "candidates": ["hoping to see a rhino today because the", "the main gate inland gutters is the", "a rhino and go to the National", "videoed them I've been to color of", "for non-residents if you're new to the"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "hj1NPIU9SjM_0", "video_path": "hj1NPIU9SjM.mp4", "subtitle_path": "hj1NPIU9SjM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 381.47, "view_count": 2666}, {"video_id": "hj1NPIU9SjM", "question": "In the middle of the video, there is a man wearing black hat and glasses, dressed in red clothes. Next to him, there are many weeds and a few trees. Behind him, there is a person in blue clothes. Which subtitles have appeared along with the person in blue clothes?", "question_wo_referring_query": "Which subtitles have appeared along with the person in blue clothes?", "candidates": ["you can get into the Nairobi National", "make sure to hire a guide and Oban the", "time you're here at the kws Nairobi", "journal it's about a 15 to 20 minute", "vocabulary so one thing to expect while"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "hj1NPIU9SjM_1", "video_path": "hj1NPIU9SjM.mp4", "subtitle_path": "hj1NPIU9SjM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 381.47, "view_count": 2666}, {"video_id": "YZZHejsF0ao", "question": "The sky is light blue, and below the sky, there is a wheel ship parked on the dark blue sea. When dense, buzzing model airplanes appear above the screen, and below are models of four wheel ships, what changes occurred to this wheel ship?", "question_wo_referring_query": "What changes occurred to this wheel ship?", "candidates": ["Hit by a soldier", "Hit by a fish mine", "Hit by a submarine", "The bottom of the ship turned red", "Hit by a tank"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "YZZHejsF0ao_0", "video_path": "YZZHejsF0ao.mp4", "subtitle_path": "YZZHejsF0ao_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 487.23, "view_count": 71255}, {"video_id": "0A5KzqN2LCA", "question": "On the marble slab, there are a few plates and a few bottles, some of the plates contain ice cubes, green vegetables, shrimp, potato chips, powdered seasoning, and red liquid seasoning. A pair of hands wearing black gloves is holding a bottle of seasoning. What change occurs to the ice cubes in the bowl when the subtitle mentions 'so again has Clamato freshly squeezed'?", "question_wo_referring_query": "When the subtitle mentions 'so again has Clamato freshly squeezed,' what change occurs to the ice cubes in the bowl?", "candidates": ["Black seasoning is added", "The ice cubes are dyed red", "Oil is added", "The ice cubes are dyed green", "Shrimp is added"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "0A5KzqN2LCA_0", "video_path": "0A5KzqN2LCA.mp4", "subtitle_path": "0A5KzqN2LCA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 254.84, "view_count": 68568}, {"video_id": "0A5KzqN2LCA", "question": "On the Dali stone, there are several plates, with the large plate in the middle filled with green cucumbers. In the middle of the cucumbers, there are 4 glass bottles filled with red condiments placed upside down. When the subtitle mentions 'do this I'm gonna call my friends and', what change happens to the cucumbers?", "question_wo_referring_query": "When the subtitle mentions 'do this I'm gonna call my friends and', what change happens to the cucumbers?", "candidates": ["The cucumbers were dyed red", "The cucumbers turned into juice", "The cucumbers were dyed black", "Covered in powder"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "0A5KzqN2LCA_1", "video_path": "0A5KzqN2LCA.mp4", "subtitle_path": "0A5KzqN2LCA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 254.84, "view_count": 68568}, {"video_id": "bv5sOzzA52A", "question": "In a room, there is a man lying on a bed with his upper body naked, and there is a woman wearing black clothes and white gloves. The man has a white towel placed below his belly. What is this woman doing?", "question_wo_referring_query": "What is this woman doing?", "candidates": ["Drinking water", "Running", "Removing his pubic hair", "Clapping", "Kissing the man"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "bv5sOzzA52A_0", "video_path": "bv5sOzzA52A.mp4", "subtitle_path": "bv5sOzzA52A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 402.97, "view_count": 2741334}, {"video_id": "bv5sOzzA52A", "question": "In a room, three men are sitting. The man in blue is in the middle, with a man in white and a man in grey sitting on either side of him. In front of the man in blue is a notebook computer. All three men have a patterned blanket covering their laps. What are these men doing?", "question_wo_referring_query": "What are these men doing?", "candidates": ["watching a movie", "reading a newspaper", "looking at a notebook computer", "playing games", "playing the piano"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "bv5sOzzA52A_1", "video_path": "bv5sOzzA52A.mp4", "subtitle_path": "bv5sOzzA52A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 402.97, "view_count": 2741334}, {"video_id": "Mwg7kCRQfes", "question": "On a table, there is a plate with 6 raw shrimp. Which of the following fruits has appeared?", "question_wo_referring_query": "Which of the following fruits has appeared?", "candidates": ["Apple", "Banana", "Lemon", "Peach", "Watermelon"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "Mwg7kCRQfes_0", "video_path": "Mwg7kCRQfes.mp4", "subtitle_path": "Mwg7kCRQfes_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 539.41, "view_count": 334409}, {"video_id": "Mwg7kCRQfes", "question": "A woman is sitting on the steps, with a black bag beside her. She is reading a book in her hand, and behind her, there are two people wearing black coats. Which of the following objects has appeared?", "question_wo_referring_query": "Which of the following objects has appeared?", "candidates": ["Black train", "Green train", "Car", "Bicycle", "Motorcycle"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "Mwg7kCRQfes_1", "video_path": "Mwg7kCRQfes.mp4", "subtitle_path": "Mwg7kCRQfes_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 539.41, "view_count": 334409}, {"video_id": "8BfaOc0ALlI", "question": "On the left side of the coffee-colored table, there is a glass bowl containing food, and to the right of the bowl is a plate with a concave section containing yellow solid food. When the caption 'layers and brownie batter tends to be' appears, which object is present in the scene?", "question_wo_referring_query": "Which object is present in the scene?", "candidates": ["yellow egg yolk", "coffee-colored plate", "sweet tart shell", "butter", "black cotton candy"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "8BfaOc0ALlI_0", "video_path": "8BfaOc0ALlI.mp4", "subtitle_path": "8BfaOc0ALlI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 199.03, "view_count": 130786}, {"video_id": "8BfaOc0ALlI", "question": "A woman is standing in front of a mirror, with a large screen and machinery behind her. When the subtitle shows 'just makes me really happy but what's,' which of the following items is present?", "question_wo_referring_query": "A woman is standing in front of a mirror, with a large screen and machinery behind her. Which of the following items is present?", "candidates": ["earring", "ring", "bracelet", "watch", "glove"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "8BfaOc0ALlI_1", "video_path": "8BfaOc0ALlI.mp4", "subtitle_path": "8BfaOc0ALlI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 199.03, "view_count": 130786}, {"video_id": "xdkJLn-1Ekk", "question": "The yellow dirt road is lined with tall weeds, there's a short brick wall on the left side of the screen, behind the brick wall there's a row of tall green trees, and on the right side of the brick wall there's a sign. What is the background color of the sign?", "question_wo_referring_query": "What is the background color of the sign?", "candidates": ["olive", "gray", "silver", "black", "white"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "xdkJLn-1Ekk_0", "video_path": "xdkJLn-1Ekk.mp4", "subtitle_path": "xdkJLn-1Ekk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 514.38, "view_count": 485196}, {"video_id": "xdkJLn-1Ekk", "question": "On the left side of the screen is a four-frame image sequence showing a person holding a glass of water with their hand, and on the right side is a string of English text. What is the shape of the flag mentioned in the text?", "question_wo_referring_query": "What is the shape of the flag mentioned in the text?", "candidates": ["Triangular", "Oval", "Circle", "Square", "Rectangle"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "xdkJLn-1Ekk_1", "video_path": "xdkJLn-1Ekk.mp4", "subtitle_path": "xdkJLn-1Ekk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 514.38, "view_count": 485196}, {"video_id": "Axhwtz844Qc", "question": "On the left side of the screen, there is a sketch of a human body with an English explanation. On the right side of the screen, there are three elderly people. What posture are the elderly people in when the subtitle 'it\u2019s from video so from video itself we' appears?", "question_wo_referring_query": "What posture are the elderly people in?", "candidates": ["Sitting", "Lying down", "Kneeling", "Standing upright", "Slightly bending at the waist"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2A", "level": "IntraMoment", "id": "Axhwtz844Qc_0", "video_path": "Axhwtz844Qc.mp4", "subtitle_path": "Axhwtz844Qc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 145, "duration": 435.0, "view_count": 6}, {"video_id": "Axhwtz844Qc", "question": "In the bottom left corner of the screen, there are three elderly people. On the right side, there is some text in English and mathematical data reasoning. When the subtitle appears saying 'temporal Vector here and so that we can', what is the shape of the blue icon with the word 'ResNet 50' on it?", "question_wo_referring_query": "What is the shape of the blue icon with the word 'ResNet 50' on it?", "candidates": ["circle", "square", "parallelogram", "staircase", "rectangle"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2A", "level": "IntraMoment", "id": "Axhwtz844Qc_1", "video_path": "Axhwtz844Qc.mp4", "subtitle_path": "Axhwtz844Qc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 145, "duration": 435.0, "view_count": 6}, {"video_id": "IUa6knrTFo4", "question": "Under the azure blue sky stretches a series of rolling mountains. Among the mountains, which are filled with lush green grass, a small aircraft is flying in the sky. Who is controlling the small aircraft?", "question_wo_referring_query": "Who is controlling the small aircraft?", "candidates": ["A man", "A woman", "A nurse", "A child", "An elderly person"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "IUa6knrTFo4_0", "video_path": "IUa6knrTFo4.mp4", "subtitle_path": "IUa6knrTFo4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 366.62, "view_count": 37800}, {"video_id": "IUa6knrTFo4", "question": "Under the pitch-black environment, with the help of light, we observe a green frog crawling in the grass. What is the source of the light?", "question_wo_referring_query": "What is the source of the light?", "candidates": ["firefly", "phone", "moon", "flashlight", "torch"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "IUa6knrTFo4_1", "video_path": "IUa6knrTFo4.mp4", "subtitle_path": "IUa6knrTFo4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 366.62, "view_count": 37800}, {"video_id": "dPJCvVq-nVo", "question": "On a thick wooden board, there is a black bowl with some rice, four neatly arranged eggs, some greens and a cloth nearby, as well as a hand. When the subtitle 'Eggs 4 pcs' appears, what does the hand in the picture do?", "question_wo_referring_query": "What does the hand in the picture do?", "candidates": ["Clenches into a fist", "Grabs an egg", "Spreads out all five fingers", "Lifts the rice bowl", "Picks up the cloth"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "dPJCvVq-nVo_0", "video_path": "dPJCvVq-nVo.mp4", "subtitle_path": "dPJCvVq-nVo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 211.05, "view_count": 569}, {"video_id": "dPJCvVq-nVo", "question": "On a wooden table, there is a square piece of cloth, a glass bowl, a ceramic bowl, and some beaten eggs. What happens on the screen when the subtitle 'Pour into a heatproof bowl' appears?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["Spill the food on the table", "Remove the cloth from the table", "Pour water into the glass bowl", "Pour the food into the glass bowl", "Remove the glass bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "dPJCvVq-nVo_1", "video_path": "dPJCvVq-nVo.mp4", "subtitle_path": "dPJCvVq-nVo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 211.05, "view_count": 569}, {"video_id": "JT-GQ4DdvVo", "question": "On the right side, there is a green card with dollar signs on a laptop, and on the left side, there is a white paper with English printing and three green cards with dollar signs. What happens on the screen after the dollar sign cards start to flash?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The white paper starts falling down from where the cards are stacked.", "Words in English start to appear on the white paper below the dollar sign cards.", "The white paper below the dollar sign cards starts to shake.", "The white paper splits into two halves."], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "JT-GQ4DdvVo_0", "video_path": "JT-GQ4DdvVo.mp4", "subtitle_path": "JT-GQ4DdvVo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 271.24, "view_count": 30962}, {"video_id": "JT-GQ4DdvVo", "question": "At the top of a white background screen, there is an English title 'CHEMICAL'. Below that, there are four categories each beginning with a black dot. Beneath the categories, there is a blue icon with the words 'GET STARTED' inside it. To the right of the text group is a mouse pointer. What action occurred when the mouse pointer clicked on the blue icon?", "question_wo_referring_query": "What action occurred when the mouse pointer clicked on the blue icon?", "candidates": ["Scrolled down to the right", "Blue icon disappeared", "Blue icon gradually got larger", "Blue icon moved to the left", "Blue icon gradually got smaller"], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "JT-GQ4DdvVo_1", "video_path": "JT-GQ4DdvVo.mp4", "subtitle_path": "JT-GQ4DdvVo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 271.24, "view_count": 30962}, {"video_id": "mlY2bkKJrOU", "question": "In a kitchen filled with utensils, a man in a white shirt wearing a duckbill cap and a man in a wine-red shirt looking downwards. When the subtitle 'baby Matt come with I'm up in the' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The man in the white shirt leaves the kitchen.", "The man in the wine-red shirt smiles and turns his head to the left.", "The man in the wine-red shirt looks down without speaking.", "The man in the wine-red shirt smiles and turns his head to the right."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "mlY2bkKJrOU_0", "video_path": "mlY2bkKJrOU.mp4", "subtitle_path": "mlY2bkKJrOU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 545.85, "view_count": 62377}, {"video_id": "mlY2bkKJrOU", "question": "Sunlight shines on a white building with glass windows arranged in a row; there are a few pigeons above the arched door, and a green plant at the entrance. What happens on the screen before the subtitle 'well we're lost we're lost like WWF' appears?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The host kneels down to look at the map", "The host is playing a mobile game", "A person in white clothes is playing a small lute", "The pigeons are flapping their wings", "A group of people are dancing"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "mlY2bkKJrOU_1", "video_path": "mlY2bkKJrOU.mp4", "subtitle_path": "mlY2bkKJrOU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 545.85, "view_count": 62377}, {"video_id": "RAJtxFBmGzU", "question": "On a white background screen, there are black and blue English letters. Below the English text, there is a picture featuring a red X and a small blue circle. In the lower right corner of the screen, there is a portrait of a man. After the subtitle 'general patterns of particular,' what is the first shape that appears on the screen?", "question_wo_referring_query": "On a white background screen, there are black and blue English letters. Below the English text, there is a picture featuring a red X and a small blue circle. In the lower right corner of the screen, there is a portrait of a man. After the subtitle 'general patterns of particular,' what is the first shape that appears on the screen?", "candidates": ["A large circle with 3 black dots, the center contains a small circle which is purple and has 4 black dots", "Triangle", "Square", "Rectangle", "Ellipse"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "RAJtxFBmGzU_0", "video_path": "RAJtxFBmGzU.mp4", "subtitle_path": "RAJtxFBmGzU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 366, "duration": 206.0, "view_count": 37}, {"video_id": "6BpoGdhP974", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, there is a river with a person on the left side and mountains on the right. Then, there is a snow-covered mountain range with barren land in front of it. Finally, there is a river with mountains along its side.", "First, there is a river with mountains along its side. Then, there is a snow-covered mountain range with barren land in front of it. Finally, there is a river with a person on the left side and mountains on the right.", "First, there is a snow-covered mountain range with barren land in front of it. Then, there is a river with a person on the left side and mountains on the right. Finally, there is a river with mountains along its side.", "First, there is a river with mountains along its side. Then, there is a river with a person on the left side and mountains on the right. Finally, there is a snow-covered mountain range with barren land in front of it.", "First, there is a snow-covered mountain range with barren land in front of it. Then, there is a river with mountains along its side. Finally, there is a river with a person on the left side and mountains on the right."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "6BpoGdhP974_0", "video_path": "6BpoGdhP974.mp4", "subtitle_path": "6BpoGdhP974_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 490.06, "view_count": 426914}, {"video_id": "6BpoGdhP974", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a person in a black hat and military attire eats a peach. Then, this person uses an axe to hit a peach on a tree. Finally, the person in the black hat and military attire eats the peach.", "First, a person wearing a black hat and dressed in military attire uses an axe to hit a peach on a tree. Then, this person eats the peach. Finally, this person peels a peach beside the tree.", "First, a person is peeling a peach beside a tree. Then, a person wearing a black hat appears. Then, the person in the black hat and military attire eats the peach. Finally, the person dressed in military attire uses an axe to hit a peach on a tree.", "First, a person is peeling a peach beside a tree. Then, a person wearing a black hat appears. After that, a person dressed in military attire uses an axe to hit a peach on a tree. Finally, the person in the black hat and military attire eats the peach.", "First, a person wearing a black hat and dressed in military attire uses an axe to hit a peach on a tree. Then, this person peels the peach beside the tree. Finally, this person, still in the black hat and military attire, eats the peach."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "6BpoGdhP974_1", "video_path": "6BpoGdhP974.mp4", "subtitle_path": "6BpoGdhP974_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 490.06, "view_count": 426914}, {"video_id": "kqG64MyZ3lw", "question": "On an island with a black background, there is a long red lava casting a glow. In which of the following scenes does this lava appear?", "question_wo_referring_query": "In which of the following scenes does this lava appear?", "candidates": ["In a forest", "On a plain", "In a river", "On a grassland", "On a volcano"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "kqG64MyZ3lw_0", "video_path": "kqG64MyZ3lw.mp4", "subtitle_path": "kqG64MyZ3lw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 186.45, "view_count": 3043}, {"video_id": "kqG64MyZ3lw", "question": "Under a sky, beside a river there is a land covered in snow. Surrounding this snowy land are mountains of various shapes. In which of the following scenes has the river water appeared?", "question_wo_referring_query": ", in which of the following scenes has the river water appeared?", "candidates": ["In the forest", "On the grassland", "In the middle of the mountain", "On the mountain top", "At the foot of the mountain"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "kqG64MyZ3lw_1", "video_path": "kqG64MyZ3lw.mp4", "subtitle_path": "kqG64MyZ3lw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 186.45, "view_count": 3043}, {"video_id": "T_hfV7pF9D8", "question": "On a street, there is a man with black hair wearing white clothes with English letters on them and a woman with black hair wearing a pink dress and khaki short sleeves. Behind them, there is a wheel and a pole, and behind the pole, there is a white house. In front of the house, there is a red car. In the upper right corner of the screen, there is a house decorated with yellow and green ornaments. In what context did this woman with black hair wearing a pink dress and khaki short sleeves appear together with those scenes?", "question_wo_referring_query": "In what context did this woman with black hair wearing a pink dress and khaki short sleeves appear together with those scenes?", "candidates": ["receiving a payment and you don't get", "and work when you want so when your", "fernds set you up to randomly get ice", "so this is my friend Ariel she works out", "prepared to hustle 24/7 when you are"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "T_hfV7pF9D8_0", "video_path": "T_hfV7pF9D8.mp4", "subtitle_path": "T_hfV7pF9D8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 304.31, "view_count": 27719}, {"video_id": "T_hfV7pF9D8", "question": "Under the white sky, a boy with black hair wearing a white shirt and black pants, with English letters on the shirt, is playing with a skateboard on a slope. There are a few trees behind him, on his left there is a street lamp and a tree surrounded by a fence, and on his right stands a tall building in white and orange with a tree below. Has the skateboard under the boy's feet and those trees appeared together before?", "question_wo_referring_query": "Has the skateboard under the boy's feet and those trees appeared together before?", "candidates": ["or if you don't like to travel you can", "Music", "the skate park", "nights start paying off and results of", "I've been fortunate enough the past two"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "T_hfV7pF9D8_1", "video_path": "T_hfV7pF9D8.mp4", "subtitle_path": "T_hfV7pF9D8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 304.31, "view_count": 27719}, {"video_id": "M5kdF6SnbAo", "question": "There is an English sentence starting with 'Linear' printed at the top of the page, below which is a detailed plot chart. On a white background, there is a solid blue line running horizontally across the chart. When the number in the green box changes to T=-0.60, what changes occur to the blue line?", "question_wo_referring_query": "What changes occur to the blue line?", "candidates": ["The horizontal line oscillates left and right", "The horizontal line bulges into a small mountain shape", "The horizontal line sinks down", "The horizontal line disappears"], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "M5kdF6SnbAo_0", "video_path": "M5kdF6SnbAo.mp4", "subtitle_path": "M5kdF6SnbAo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 366.97, "view_count": 5414}, {"video_id": "M5kdF6SnbAo", "question": "The vast expanse of the sea water, with waves rolling in one after another, and the water rippling continuously. When the screen switches to a sailboat traveling on the water, what change happens on the water surface?", "question_wo_referring_query": "What change happens on the water surface?", "candidates": ["A waterspout forms on the surface.", "The rippling water becomes calm.", "The rippling water turns into a stormy sea.", "It starts raining on the surface."], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "M5kdF6SnbAo_1", "video_path": "M5kdF6SnbAo.mp4", "subtitle_path": "M5kdF6SnbAo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 366.97, "view_count": 5414}, {"video_id": "a-pNEVeJBto", "question": "On a street, there is a house on the right side of the screen with English letters on it. Inside the house, there are 3 battery-powered tools, 2 of which are red and 1 is yellow. In front of the house, there is a purple car driving. Behind the purple car, there is a black car and some trees. What change occurs to the purple car when the subtitle mentions 'adequate'?", "question_wo_referring_query": "What change occurs to the purple car?", "candidates": ["The car's front facing right changes to the car's front facing forward.", "The car body turns purple.", "The car's front facing right changes to the car's front facing backward.", "The car's front facing right changes to the car's front facing left.", "The car's front facing left changes to the car's front facing right."], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "a-pNEVeJBto_0", "video_path": "a-pNEVeJBto.mp4", "subtitle_path": "a-pNEVeJBto_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 342.8, "view_count": 1179}, {"video_id": "a-pNEVeJBto", "question": "In a room with a grayish background, there is an elderly person with white hair, wearing a dark blue coat and light blue clothing in the center. To the right of this person, there is a flag placed. On the gray-colored wall behind the elderly person, there is a painting hanging on the right and a golden ornament hanging on the left. When the subtitle mentions 'entire uh the program and we have given,' what change occurs in the elderly person's movement?", "question_wo_referring_query": "What change occurs in the elderly person's movement?", "candidates": ["The person extends from one right hand to extending both hands in front of the chest with a gesture.", "The person changes from extending one hand to clasping both hands together in front of the chest.", "The person changes from extending one hand to placing it behind the back.", "The person changes from extending one left hand to extending one right hand with a gesture in front of the chest."], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "a-pNEVeJBto_1", "video_path": "a-pNEVeJBto.mp4", "subtitle_path": "a-pNEVeJBto_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 342.8, "view_count": 1179}, {"video_id": "VEReMqTalQE", "question": "Under a blue sky, there are trees and shrubs in the distance, with yellow soil nearby. On the left side is a man with a goatee and wearing a red hat. In the middle, there is a woman wearing a white inner shirt and a black and red plaid coat, along with sunglasses. What is this woman doing?", "question_wo_referring_query": "Under a blue sky, there are trees and shrubs in the distance, with yellow soil nearby. On the left side is a man with a goatee and wearing a red hat. In the middle, there is a woman wearing a white inner shirt and a black and red plaid coat, along with sunglasses. What is this woman doing?", "candidates": ["Waving at the mirror", "Took off her sunglasses", "Crouched down on the ground", "Eating a sandwich", "Shaking hands with the man"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "VEReMqTalQE_0", "video_path": "VEReMqTalQE.mp4", "subtitle_path": "VEReMqTalQE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 254.12, "view_count": 743}, {"video_id": "VEReMqTalQE", "question": "Under the blue sky, on the yellow ground, there is a green pickup truck on the left in the distance, a wind turbine on the right, and in the middle, there is a man wearing a blue shirt and a red hat. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Lied down on the ground", "Took off his clothes", "Stood up", "Ran into the distance", "Moved his hands"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "VEReMqTalQE_1", "video_path": "VEReMqTalQE.mp4", "subtitle_path": "VEReMqTalQE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 254.12, "view_count": 743}, {"video_id": "19Fu8ydbtaw", "question": "In a workshop with a wooden floor, a woman with golden hair is wearing a grayish-blue knitted shirt. To her left on a table, there is a blue cube, a large roll of paper, and to the right of the table stands a sculpture. In the distance, there is a cabinet with a monitor on top, and next to it, there is a chair and a storage cabinet. What objects appear in the workshop in the video?", "question_wo_referring_query": "In a workshop with a wooden floor, a woman with golden hair is wearing a grayish-blue knitted shirt. To her left on a table, there is a blue cube, a large roll of paper, and to the right of the table stands a sculpture. In the distance, there is a cabinet with a monitor on top, and next to it, there is a chair and a storage cabinet. What objects appear in the workshop in the video?", "candidates": ["A gaming console", "A pair of scissors", "A camera", "A Buddha statue", "A newspaper"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "19Fu8ydbtaw_0", "video_path": "19Fu8ydbtaw.mp4", "subtitle_path": "19Fu8ydbtaw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 399.82, "view_count": 216918}, {"video_id": "19Fu8ydbtaw", "question": "On a green workbench, there is a person holding a cup filled with sand in one hand and a small stone in the other hand. What objects appear on the green platform in the video?", "question_wo_referring_query": "On a green workbench, there is a person holding a cup filled with sand in one hand and a small stone in the other hand. What objects appear on the green platform in the video?", "candidates": ["A pair of scissors", "An eyebrow pencil", "A steel pen", "A magazine", "A bottle of glue"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "19Fu8ydbtaw_1", "video_path": "19Fu8ydbtaw.mp4", "subtitle_path": "19Fu8ydbtaw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 399.82, "view_count": 216918}, {"video_id": "dpDVnoVWviw", "question": "In front of the uneven rock wall, there is a green plant on the left side and a tool on the right side. What is the shape of this tool?", "question_wo_referring_query": "What is the shape of this tool?", "candidates": ["Perfectly straight line", "Square", "S-shaped", "Circular", "The shape of the number 7"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "dpDVnoVWviw_0", "video_path": "dpDVnoVWviw.mp4", "subtitle_path": "dpDVnoVWviw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 296.46, "view_count": 24430}, {"video_id": "dpDVnoVWviw", "question": "In the yellow background screen, on the right side there is black text, and in the middle, there is a pair of hands holding a cup. What color is this cup?", "question_wo_referring_query": "What color is this cup?", "candidates": ["red", "purple", "black", "yellow", "white"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "dpDVnoVWviw_1", "video_path": "dpDVnoVWviw.mp4", "subtitle_path": "dpDVnoVWviw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 296.46, "view_count": 24430}, {"video_id": "VWXjJCXJimg", "question": "In front of a single-story house, near purple plants, with green trees at the back right, and a rusty iron windmill in the front left, when it mentions 'And look forward to the autumn rains,' what color is the house on the far left?", "question_wo_referring_query": "What color is the house on the far left?", "candidates": ["Gray", "Blue", "Black", "Red", "White"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "VWXjJCXJimg_0", "video_path": "VWXjJCXJimg.mp4", "subtitle_path": "VWXjJCXJimg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 361.03, "view_count": 234912}, {"video_id": "VWXjJCXJimg", "question": "On the grass, in front of the olive-colored floral dress, there is a transparent glass cup placed to the left front. When 'with everything it's facing right now' is mentioned, what shape is the transparent glass cup in the lower left corner?", "question_wo_referring_query": "What shape is the transparent glass cup in the lower left corner?", "candidates": ["Circle", "Pentagon", "Rectangle", "Square", "Triangle"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "VWXjJCXJimg_1", "video_path": "VWXjJCXJimg.mp4", "subtitle_path": "VWXjJCXJimg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 361.03, "view_count": 234912}, {"video_id": "Dp0zQZTSS-Y", "question": "On the white platform, the wall is on the left, and there is a white cat in the middle. What was the white cat doing the first time it appeared?", "question_wo_referring_query": "What was the white cat doing the first time it appeared?", "candidates": ["Standing", "Raising its head", "Eating something", "Lying on the platform", "Running"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "Dp0zQZTSS-Y_0", "video_path": "Dp0zQZTSS-Y.mp4", "subtitle_path": "Dp0zQZTSS-Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 254.42, "view_count": 167518}, {"video_id": "Dp0zQZTSS-Y", "question": "In a room with a green backdrop, there is a window ledge with green plants and a transparent glass window on the left. In the middle, there is a woman with blonde hair wearing a red dress. What is the woman doing when she first appears?", "question_wo_referring_query": "What is the woman doing when she first appears?", "candidates": ["Talking to the mirror by the window", "Cutting a mango", "Washing dishes", "Eating a hamburger", "Waving at the mirror"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "Dp0zQZTSS-Y_1", "video_path": "Dp0zQZTSS-Y.mp4", "subtitle_path": "Dp0zQZTSS-Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 254.42, "view_count": 167518}, {"video_id": "cMEz5He5aG0", "question": "In a room with blue walls, there is a small green tree at the back right, an empty white chair at the front right, and a man sitting on the chair wearing a blue T-shirt in the middle. What action does the man do when he says 'and we smear it all over our faces'?", "question_wo_referring_query": "What action does he do?", "candidates": ["Stand up", "Make an OK gesture at the camera", "Wave at the camera", "Cover his face with both palms", "Shake his head left and right"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "cMEz5He5aG0_0", "video_path": "cMEz5He5aG0.mp4", "subtitle_path": "cMEz5He5aG0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 191.23, "view_count": 144927}, {"video_id": "cMEz5He5aG0", "question": "In the room with blue walls, there is a small green tree in the back-right, an empty white chair in the front-right, and a man wearing a blue T-shirt sitting in the middle of the chair. When the man mentions 'By the way, just for future reference,' what kind of sticker appears in the top right corner of the screen?", "question_wo_referring_query": "What kind of sticker appears in the top right corner of the screen?", "candidates": ["A basketball", "A bundle wrapped with an axe", "Blue pants", "A red wine bottle", "A square cake"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "cMEz5He5aG0_1", "video_path": "cMEz5He5aG0.mp4", "subtitle_path": "cMEz5He5aG0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 191.23, "view_count": 144927}, {"video_id": "T3ACX11E3CU", "question": "In a room with a beige background, a painting and a handwritten note are placed on the wall. On the table in front of the camera lens, there is a candle and a cup. A woman wearing a grey and white knit sweater and black tight pants is lying on the sofa reading a book. What did the woman do after lying on the sofa and reading a book?", "question_wo_referring_query": "In a room with a beige background, a painting and a handwritten note are placed on the wall. On the table in front of the camera lens, there is a candle and a cup. A woman wearing a grey and white knit sweater and black tight pants is lying on the sofa reading a book. What did the woman do after lying on the sofa and reading a book?", "candidates": ["Placed the cup with the candle on the table", "Fell asleep sideways", "Wrote with a pen", "Held a roast chicken", "Stood up"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "T3ACX11E3CU_0", "video_path": "T3ACX11E3CU.mp4", "subtitle_path": "T3ACX11E3CU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 584.92, "view_count": 35446}, {"video_id": "T3ACX11E3CU", "question": "In a kitchen with colorful ceramic tiles on the wall, on a white table there are many seasoning jars, a pumpkin, a can of biscuits, a plate of pancakes, two cups. A person is pouring water from an electric kettle into a teapot. What does this person do after pouring water into the teapot?", "question_wo_referring_query": "In a kitchen with colorful ceramic tiles on the wall, on a white table there are many seasoning jars, a pumpkin, a can of biscuits, a plate of pancakes, two cups. A person is pouring water from an electric kettle into a teapot. What does this person do after pouring water into the teapot?", "candidates": ["Pick up the teapot lid", "Continue pouring water", "Start eating pancakes", "Pick up a cup", "Pick up a seasoning jar"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "T3ACX11E3CU_1", "video_path": "T3ACX11E3CU.mp4", "subtitle_path": "T3ACX11E3CU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 584.92, "view_count": 35446}, {"video_id": "eUZNGDpbcmg", "question": "Which of the following concepts is mentioned first in the video?", "question_wo_referring_query": "Which of the following concepts is mentioned first in the video?", "candidates": ["MIMIC Chest X-ray dataset", "Author from Stanford University", "Neural networks", "Influence of Roman architecture"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O3O", "level": "L2-Relation", "id": "eUZNGDpbcmg_0", "video_path": "eUZNGDpbcmg.mp4", "subtitle_path": "eUZNGDpbcmg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 27, "duration": 527.0, "view_count": 60}, {"video_id": "eUZNGDpbcmg", "question": "Which concept is mentioned first in the video below?", "question_wo_referring_query": "Which concept is mentioned first in the video below?", "candidates": ["Diffusion model is an algorithm", "Evaluation in Romanek's paper", "Image from a paper", "Minor adjustment of chest X-ray", "Gallery database"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O3O", "level": "L2-Relation", "id": "eUZNGDpbcmg_1", "video_path": "eUZNGDpbcmg.mp4", "subtitle_path": "eUZNGDpbcmg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 27, "duration": 527.0, "view_count": 60}, {"video_id": "u07TGk01LPA", "question": "On a table laid with checkered fabric, there are various ingredients. In the middle of the screen, there is a rectangular white plate, with two rows of ingredients neatly arranged on it. In the bottom left corner, a black plate is partially visible. In the bottom right corner, there is a piece of red-yellow patterned fabric. What happens in the video after the word 'foreign' appears?", "question_wo_referring_query": "What happens in the video?", "candidates": ["A person pours a bowl of orange juice onto the ingredients on the white plate with one hand.", "A pair of hands lifts the white plate with two rows of ingredients.", "A hand picks up an ingredient from the white plate and eats it.", "One hand picks up the black plate, while the other hand places ingredients from the white plate onto the black plate."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "u07TGk01LPA_0", "video_path": "u07TGk01LPA.mp4", "subtitle_path": "u07TGk01LPA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 228.9, "view_count": 91016}, {"video_id": "u07TGk01LPA", "question": "On a light brown wooden table, there is a rectangular plate containing seasoned vegetable salad. To the upper right of the wooden table, there's a glass bowl with some seasoning. What happens on the screen after the subtitle 'so' appears?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A hand uses a brush to dip into the seasoning in the glass bowl, and then applies it on the vegetable salad", "A hand brings over a soup ladle", "A hand brings over a small bowl of white vinegar", "The seasoning in the glass bowl is poured onto the plate", "A hand takes away the glass bowl containing the seasoning"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "u07TGk01LPA_1", "video_path": "u07TGk01LPA.mp4", "subtitle_path": "u07TGk01LPA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 228.9, "view_count": 91016}, {"video_id": "VDgzT_PGt7E", "question": "A green submarine emerges from green water. After the phrase 'mission in the south' is mentioned by the side, which weapon appears first?", "question_wo_referring_query": "Which weapon appears first?", "candidates": ["Cannon", "Tank", "Machine Gun", "Armored Vehicle"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "VDgzT_PGt7E_0", "video_path": "VDgzT_PGt7E.mp4", "subtitle_path": "VDgzT_PGt7E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 559.29, "view_count": 910746}, {"video_id": "VDgzT_PGt7E", "question": "Under the dim sky lies a row of tightly packed graves, each with a stone tombstone. On the right side of the screen, there are two men holding rifles. After '1995 issue of Life magazine' is mentioned in white text, who is the first person to appear on the screen?", "question_wo_referring_query": "Who is the first person to appear on the screen?", "candidates": ["A man wearing a hat and red-black colored clothes", "A man wearing blue pants and a white shirt", "A man with red gloves, a scarf, and carrying a horn", "A person wearing steel armor and holding a gun"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "VDgzT_PGt7E_1", "video_path": "VDgzT_PGt7E.mp4", "subtitle_path": "VDgzT_PGt7E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 559.29, "view_count": 910746}, {"video_id": "mm_RKoE6HTk", "question": "On a black table, there is a metal baking tray with 8 hollow round clay objects and hollow cylindrical clay objects neatly arranged on it. A hand with green nail polish is manipulating them. Which of the following scenes feature these hollow round clay objects?", "question_wo_referring_query": "On a black table, there is a metal baking tray with 8 hollow round clay objects and hollow cylindrical clay objects neatly arranged on it. A hand with green nail polish is manipulating them. Which of the following scenes feature these hollow round clay objects?", "candidates": ["Next to a glass window with white curtains, a woman dressed in green and wearing glasses stands in front of the dining table with her hands open, talking about tonight's dinner.", "A white plate on the dining table contains pumpkin and carrot among other ingredients.", "On a white table, there is a messy bouquet, and a hand holding a glass is watering the flowers in a white porcelain vase.", "On a white dining table, there are stacked plates, an empty cup with two candles and flowers beside it."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "mm_RKoE6HTk_0", "video_path": "mm_RKoE6HTk.mp4", "subtitle_path": "mm_RKoE6HTk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 577.16, "view_count": 102697}, {"video_id": "mm_RKoE6HTk", "question": "In a room with a brick wall background, there is a woman wearing a green short-sleeve shirt and blue gloves standing near a window with white curtains. Outside the window, there's a green plant, and in the distance, there are houses. Next to the woman, there are three metal lamps. She is holding a napkin stained with blue-green dye. In which of the following scenes does the napkin appear?", "question_wo_referring_query": "In a room with a brick wall background, there is a woman wearing a green short-sleeve shirt and blue gloves standing near a window with white curtains. Outside the window, there's a green plant, and in the distance, there are houses. Next to the woman, there are three metal lamps. She is holding a napkin stained with blue-green dye. In which of the following scenes does the napkin appear?", "candidates": ["Next to the glass window with white curtains, a woman in green clothes and glasses stands at a dining table arranging flower vases.", "On the dining table, a white plate contains pumpkin and carrot pieces, with a golden fork on the lower left.", "On the dining table, a white plate contains oxtail, accompanied by fresh herbs, and a hand is seen on the right side of the plate.", "Next to the glass window with white curtains, a woman in green clothes and glasses stands in front of a dining table with her hands outstretched. The table is filled with plates and wine glasses, and she says she likes having tableware on the dining table."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "mm_RKoE6HTk_1", "video_path": "mm_RKoE6HTk.mp4", "subtitle_path": "mm_RKoE6HTk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 577.16, "view_count": 102697}, {"video_id": "dA29R-nT1vY", "question": "Against a brownish-yellow background, some English text appears on the left side and a partial human skull appears on the right side. Could you tell me which subtitles appear alongside the partial human skull?", "question_wo_referring_query": "Could you tell me which subtitles appear alongside the partial human skull?", "candidates": ["But if you are not a special observer, then you\u2019re equally as likely", "to be at any point from the start to the end of the human", "to our place in time as well to arrive at a rough idea of when our species will die", "And under the Copernican principle, we should assume that now isn\u2019t a special period in", "between 2.5% and 97.5% of the entirety of human existence"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "dA29R-nT1vY_0", "video_path": "dA29R-nT1vY.mp4", "subtitle_path": "dA29R-nT1vY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 483.15, "view_count": 306816}, {"video_id": "dA29R-nT1vY", "question": "When discussing meteor impacts on Earth, which subtitles have appeared along with the mentioned meteor crater?", "question_wo_referring_query": "Which subtitles have appeared along with it?", "candidates": ["And how do we get there", "to assume you are in fact a special observer\u2014one of the last humans ever.", "including exclusive originals.", "The first time any of our ancestors used fire?", "The Doomsday Argument is maybe a little bit morbid, but thinking about it is also kind"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "dA29R-nT1vY_1", "video_path": "dA29R-nT1vY.mp4", "subtitle_path": "dA29R-nT1vY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 483.15, "view_count": 306816}, {"video_id": "-kMtTIoSvjs", "question": "When the video turns to the grill, two men are standing behind the grill. One is wearing a black shirt and a white hat, while the other is wearing a blue shirt and a green hat. At this moment, what is the man wearing the blue shirt and green hat doing?", "question_wo_referring_query": "When the video shows the grill, two men are standing behind it. One is wearing a black shirt and a white hat, while the other is wearing a blue shirt and a green hat. At that moment, what is the man wearing the blue shirt and green hat doing?", "candidates": ["Walking", "Washing the food", "Tasting the food", "Flipping the grilled food"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "-kMtTIoSvjs_0", "video_path": "-kMtTIoSvjs.mp4", "subtitle_path": "-kMtTIoSvjs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 442.61, "view_count": 1035}, {"video_id": "-kMtTIoSvjs", "question": "At the edge of the alley, a car with a unique paint job marked \"019\" mysteriously parked quietly. A particularly eye-catching notice pasted on the car window clearly marked \u201c3,60\u20ac\u201d. What is the man wearing a yellow coat, black pants, and carrying a small bag doing standing next to the car?", "question_wo_referring_query": "What is the man wearing a yellow coat, black pants, and carrying a small bag doing standing next to the car?", "candidates": ["Getting in the car", "Taking a photo", "Getting out of the car", "Pushing the car"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "-kMtTIoSvjs_1", "video_path": "-kMtTIoSvjs.mp4", "subtitle_path": "-kMtTIoSvjs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 442.61, "view_count": 1035}, {"video_id": "nL7UScGb__w", "question": "There is a woman with a ponytail and a woman with curly hair holding hands on the screen. What are these two women holding in their hands?", "question_wo_referring_query": "What are these two women holding in their hands?", "candidates": ["Hairpin", "Scarf", "Bouquet", "Clothes"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "nL7UScGb__w_1", "video_path": "nL7UScGb__w.mp4", "subtitle_path": "nL7UScGb__w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 399.07, "view_count": 9669}, {"video_id": "x1Y-iC3AZik", "question": "A man wearing a black suit with a black shirt is standing in front of a white wall. There are two paintings on the wall. What is to the right of the man in the black suit when the subtitle mentions 'completely fascinated with Janet Sobel's'?", "question_wo_referring_query": "What is to the right of the man in the black suit?", "candidates": ["Books", "A circular painting and a sculpture on a white pedestal", "A circular painting", "A colorful painting and a sculpture on a white pedestal"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "x1Y-iC3AZik_0", "video_path": "x1Y-iC3AZik.mp4", "subtitle_path": "x1Y-iC3AZik_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 213.63, "view_count": 30296}, {"video_id": "x1Y-iC3AZik", "question": "A man in a black suit is elegantly explaining in the video. When the subtitle reaches 'we installed,' what is the image on the second frame to the left of the man in the black suit?", "question_wo_referring_query": "What is the image on the second frame to the left of the man in the black suit in the video?", "candidates": ["A round picture", "A square picture", "An oval picture", "A black picture"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "x1Y-iC3AZik_1", "video_path": "x1Y-iC3AZik.mp4", "subtitle_path": "x1Y-iC3AZik_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 213.63, "view_count": 30296}, {"video_id": "rb3VAlNy8f8", "question": "There is a woman in the video wearing glasses, earrings, and a necklace, dressed in a red and white striped top. She is holding a book with a yellow pen and two red pens on its pages. What color are the earrings when the subtitle mentions 'more practice'?", "question_wo_referring_query": "What color are the woman's earrings?", "candidates": ["blue", "white", "black", "red"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "rb3VAlNy8f8_0", "video_path": "rb3VAlNy8f8.mp4", "subtitle_path": "rb3VAlNy8f8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 448.42, "view_count": 2362}, {"video_id": "rb3VAlNy8f8", "question": "In the scene, the woman wearing a red and white striped jacket is holding a book. There are many colored crayons neatly arranged at the bottom of the book's pages. When the subtitle 'all of his work makes me happy' is mentioned, what color is the nail polish on this woman's hand?", "question_wo_referring_query": "What color is the nail polish on this woman's hand?", "candidates": ["Red", "Black", "Purple", "White"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "rb3VAlNy8f8_1", "video_path": "rb3VAlNy8f8.mp4", "subtitle_path": "rb3VAlNy8f8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 448.42, "view_count": 2362}, {"video_id": "apmiikhKML0", "question": "On a table, there are apples, colored pencils, and flowers. Sitting next to the table is a person with clenched fists who is talking. Who is this person that is talking?", "question_wo_referring_query": "Who is this person that is talking?", "candidates": ["A man with a black top", "A woman with a red top and long black hair", "A woman with a white top and long golden hair", "A woman with a white top and long black hair"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "apmiikhKML0_0", "video_path": "apmiikhKML0.mp4", "subtitle_path": "apmiikhKML0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 264.13, "view_count": 499147}, {"video_id": "apmiikhKML0", "question": "On the bottom right of a desk is a calculator. Above the calculator are 5 long tail clips. There is a hand with peach-colored nail polish wearing a ring. Four of the fingers are resting flat. What item is placed under this hand?", "question_wo_referring_query": "What item is placed under this hand?", "candidates": ["A white paper with red writing", "A white paper with black and green writing", "A white paper with black and red writing", "A pen"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "apmiikhKML0_1", "video_path": "apmiikhKML0.mp4", "subtitle_path": "apmiikhKML0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 264.13, "view_count": 499147}, {"video_id": "dEOVeXP4X1s", "question": "What are the people standing next to a large orange spherical building with leaf-shaped decorations at the top doing when they appear?", "question_wo_referring_query": "What are the people standing there doing when they appear?", "candidates": ["Dancing", "Painting", "Hugging", "Handstand"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "dEOVeXP4X1s_0", "video_path": "dEOVeXP4X1s.mp4", "subtitle_path": "dEOVeXP4X1s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 521.12, "view_count": 32859}, {"video_id": "dEOVeXP4X1s", "question": "There is a man in a brown short-sleeve shirt standing in front of a wall decorated with a mural and maps. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Dancing", "Eating something", "Doing a handstand", "Talking"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "dEOVeXP4X1s_1", "video_path": "dEOVeXP4X1s.mp4", "subtitle_path": "dEOVeXP4X1s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 521.12, "view_count": 32859}, {"video_id": "B5hxw3Jrs48", "question": "When the phrase \u201csupermarkets in France and the aisles of\u201d first appears in the subtitles, there is a woman with glasses and bangs on the screen. What action is this woman doing?", "question_wo_referring_query": "What action is this woman doing?", "candidates": ["Waving hello", "Clenching her fists", "Saluting", "Waving her arms up and down"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "B5hxw3Jrs48_0", "video_path": "B5hxw3Jrs48.mp4", "subtitle_path": "B5hxw3Jrs48_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 296.64, "view_count": 1877}, {"video_id": "B5hxw3Jrs48", "question": "When the phrase \"theory uh in the Middle Ages April 1 was\" appears in the subtitles, what is the woman wearing a blue floral blouse and glasses doing in the video?", "question_wo_referring_query": "When the phrase \"theory uh in the Middle Ages April 1 was\" appears in the subtitles, what is the woman wearing a blue floral blouse and glasses doing in the video?", "candidates": ["Crossing hands", "Clenching fists", "Holding a paper", "Raising a hand"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "B5hxw3Jrs48_1", "video_path": "B5hxw3Jrs48.mp4", "subtitle_path": "B5hxw3Jrs48_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 296.64, "view_count": 1877}, {"video_id": "1R_kGaSJZ4o", "question": "In a dimly lit room, a woman wearing a black hat, with earphones hanging around her neck, and dressed in a short-sleeved shirt with 'LOVE PINK' printed on it, what does this woman do after she finishes speaking?", "question_wo_referring_query": "What does this woman do after she finishes speaking?", "candidates": ["Queueing", "Moving house", "Medical examination", "Hitchhiking"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "1R_kGaSJZ4o_0", "video_path": "1R_kGaSJZ4o.mp4", "subtitle_path": "1R_kGaSJZ4o_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 341.0, "view_count": 15605}, {"video_id": "1R_kGaSJZ4o", "question": "Inside the cafeteria, a woman wearing a white short-sleeved shirt and a black hat is sitting. She is holding a straw and inserting it into a cup. What does this woman do next?", "question_wo_referring_query": "What does this woman do next?", "candidates": ["Tears open a packet of ketchup", "Dances", "Gets into a car", "Drives herself"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "1R_kGaSJZ4o_1", "video_path": "1R_kGaSJZ4o.mp4", "subtitle_path": "1R_kGaSJZ4o_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 341.0, "view_count": 15605}, {"video_id": "QzaFGCxoy5I", "question": "In a room with a blue background, there is a man wearing a white shirt and white earphones, and a man wearing a black shirt, a black hat, and red earphones. Who appears first between these two men?", "question_wo_referring_query": "Who appears first between these two men?", "candidates": ["The man wearing a blue shirt", "The man wearing a black hat and red earphones", "The man wearing a white shirt and white earphones", "The man wearing a red hat and black earphones"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "QzaFGCxoy5I_0", "video_path": "QzaFGCxoy5I.mp4", "subtitle_path": "QzaFGCxoy5I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 496.3, "view_count": 157370}, {"video_id": "QzaFGCxoy5I", "question": "In front of a black background, there is a man wearing a gray hoodie and a man wearing a black and white striped top with a hat that has letter patterns. Which of these two appears first?", "question_wo_referring_query": "Which of these two appears first?", "candidates": ["The man wearing a gray hoodie", "A man wearing a black and white striped top", "A man wearing a red and white striped top", "The man wearing a green hoodie"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "QzaFGCxoy5I_1", "video_path": "QzaFGCxoy5I.mp4", "subtitle_path": "QzaFGCxoy5I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 496.3, "view_count": 157370}, {"video_id": "1NIFbbRyOfM", "question": "In a brightly lit room, there is a man wearing a green short sleeve shirt with curly hair sitting in front of a white chair. After the caption 'woosh' appears, what action does this man perform?", "question_wo_referring_query": "What action does this man perform?", "candidates": ["Tilts his head", "Waves", "Bows", "Clasps his hands"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "1NIFbbRyOfM_0", "video_path": "1NIFbbRyOfM.mp4", "subtitle_path": "1NIFbbRyOfM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 202.62, "view_count": 168993}, {"video_id": "1NIFbbRyOfM", "question": "Against a background woven with deep green and light green, a flag with blue and white stripes is displayed. After the subtitle 'The only difference is that they made the blue darker,' what change occurs to the flag?", "question_wo_referring_query": "Against a background woven with deep green and light green, a flag with blue and white stripes is displayed. After the subtitle 'The only difference is that they made the blue darker,' what change occurs to the flag?", "candidates": ["A flag with a triangular design inside a circle gets enlarged", "shifts", "rotates", "shrinks"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "1NIFbbRyOfM_1", "video_path": "1NIFbbRyOfM.mp4", "subtitle_path": "1NIFbbRyOfM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 202.62, "view_count": 168993}, {"video_id": "8znHT-ESDSA", "question": "A man wearing a pink bathrobe takes his good friend on a tour. What is the order of the tour?", "question_wo_referring_query": "A man wearing a pink bathrobe takes his good friend on a tour. What is the order of the tour?", "candidates": ["First, they went to the staircase outside to drink water, then to the bedroom, and finally to the kitchen", "First, they went to the staircase outside to drink water, then to the kitchen, and finally to the bedroom", "First, they went to the kitchen, then to the bedroom, and finally to the staircase outside to drink water", "First, they went to the bedroom, then to the kitchen, and finally to the staircase outside to drink water"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "8znHT-ESDSA_0", "video_path": "8znHT-ESDSA.mp4", "subtitle_path": "8znHT-ESDSA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 438.65, "view_count": 62984}, {"video_id": "8znHT-ESDSA", "question": "A man wearing a red suit visits a man in a white suit's home. What is the sequence of the red suit man's visit?", "question_wo_referring_query": "A man wearing a red suit visits a man in a white suit's home. What is the sequence of the red suit man's visit?", "candidates": ["First, he visited the living room area, then the dining room, followed by the lounge with a study room, and finally the bedroom.", "First, he visited the living room area, then the lounge with a study room, followed by the dining room, and finally the bedroom.", "First, he visited the dining room, then the living room area, followed by the lounge with a study room, and finally the bedroom.", "First, he visited the living room area, then the lounge with a study room, followed by the bedroom, and finally the dining room."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "8znHT-ESDSA_1", "video_path": "8znHT-ESDSA.mp4", "subtitle_path": "8znHT-ESDSA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 438.65, "view_count": 62984}, {"video_id": "AKGkbdILcjU", "question": "At the beginning of the video, there is a man with golden curly hair standing on a cliff, half-naked. Where else does this man appear?", "question_wo_referring_query": "Where else does this man appear?", "candidates": ["In a room", "In a car", "On a cliff, on a boat", "On a plane"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "AKGkbdILcjU_0", "video_path": "AKGkbdILcjU.mp4", "subtitle_path": "AKGkbdILcjU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 556.82, "view_count": 203898}, {"video_id": "AKGkbdILcjU", "question": "On the ocean, there's a tall man wearing a pure white shirt and white pants, standing on a boat with his arms open wide, pointing his finger and saying 'Yeah'. Where else has this man appeared?", "question_wo_referring_query": "Where else has this man appeared?", "candidates": ["On a field", "In a car", "In a room", "Sailing on a yacht, pointing his finger and saying 'Yeah'"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "AKGkbdILcjU_1", "video_path": "AKGkbdILcjU.mp4", "subtitle_path": "AKGkbdILcjU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 556.82, "view_count": 203898}, {"video_id": "fY-_ISuhgr8", "question": "A man wearing a red short-sleeved shirt, black backpack, and a large red hat is feeding a giraffe with his right hand. In the background, the man stands in front of some trees, and in the upper right corner of the frame, there is a girl with black suspenders. What color does the man's large red hat change to?", "question_wo_referring_query": "What color does the man's large red hat change to?", "candidates": ["Large red changes to scarlet", "Large red changes to pink", "No color change", "Large red changes to black"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "fY-_ISuhgr8_0", "video_path": "fY-_ISuhgr8.mp4", "subtitle_path": "fY-_ISuhgr8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 354.32, "view_count": 2149}, {"video_id": "fY-_ISuhgr8", "question": "Next to a white car, there's a man wearing a red top, glasses, and a big red hat. His left index finger is pointing up, and in the background, he is standing in front of some trees. In the upper right corner of the screen, there are some wind turbines. In the upper left corner, there's a photo looking into the distance. How does the color of the shirt worn by the man in red change?", "question_wo_referring_query": "How does the color of the shirt worn by the man in red change?", "candidates": ["The man standing in front of the tree is wearing a dark green shirt, and in the upper left corner, the person looking into the distance is wearing black.", "The man standing in front of the tree is wearing a mainly black short-sleeved shirt with a pattern on the back, and in the upper left corner, the person looking into the distance is wearing a primarily white short-sleeved shirt with a black pocket.", "The man standing in front of the tree is wearing a primarily white short-sleeved shirt with a black pocket, and in the upper left corner, the person looking into the distance is wearing a mainly black short-sleeved shirt with a pattern on the back.", "The man standing in front of the tree is wearing a primarily white short-sleeved shirt with a black pocket, and in the upper left corner, the person looking into the distance is wearing a green shirt with prints."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "fY-_ISuhgr8_1", "video_path": "fY-_ISuhgr8.mp4", "subtitle_path": "fY-_ISuhgr8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 354.32, "view_count": 2149}, {"video_id": "ejXB0ZuDn9w", "question": "A map appears on the screen, showing a man sitting in a wheelchair in the center. After the subtitle 'disabilities in the EU where one and two' appears, what changes occur on the map?", "question_wo_referring_query": "What changes occur on the map?", "candidates": ["The map's color increases, one more man in wheelchair appears, and text is also added to the map.", "The map's color increases, two more men in wheelchairs appear, and text is also added to the map.", "The map's color increases, four more men in wheelchairs appear, and text is also added to the map.", "The map's color increases, three more men in wheelchairs appear, and text is also added to the map."], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "ejXB0ZuDn9w_0", "video_path": "ejXB0ZuDn9w.mp4", "subtitle_path": "ejXB0ZuDn9w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 334.24, "view_count": 6780}, {"video_id": "ejXB0ZuDn9w", "question": "On the left side of the screen is a man wearing a black suit, white shirt, and deep red tie, and on the right side is a blonde woman in an orange suit. After the subtitle 'workplace welcome to DW where do those' appears, what changes occur on the screen next to the blonde woman in the orange suit?", "question_wo_referring_query": "What changes occur on the screen next to the blonde woman in the orange suit?", "candidates": ["A black-haired man in a black top and gray pants, sitting in a wheelchair, appears by the door.", "A blonde man in a white top and gray pants, sitting in a wheelchair, appears by the door.", "A long-haired woman in a white top and gray pants, sitting in a wheelchair, appears by the door.", "A short-haired woman in a white patterned top and gray pants, sitting in a wheelchair, appears by the door."], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "ejXB0ZuDn9w_1", "video_path": "ejXB0ZuDn9w.mp4", "subtitle_path": "ejXB0ZuDn9w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 334.24, "view_count": 6780}, {"video_id": "dIMlS-foMXw", "question": "In the frame, there are 8 pieces of dough, and a pair of hands wearing transparent gloves. The right hand is holding a knife with a black handle. What is the right hand doing?", "question_wo_referring_query": "What is the right hand doing?", "candidates": ["Flattening the dough with the knife", "Stretching and pulling the dough", "Cutting the dough with the knife", "Cutting vegetables with the knife"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "dIMlS-foMXw_0", "video_path": "dIMlS-foMXw.mp4", "subtitle_path": "dIMlS-foMXw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 400.17, "view_count": 2036}, {"video_id": "dIMlS-foMXw", "question": "In the video, a left hand wearing a glove is seen. Underneath the hand, there's a sheet with many small holes. What is the left hand doing?", "question_wo_referring_query": "What is the left hand doing?", "candidates": ["The left hand is placing the sheet flat", "The left hand is flattening the sheet", "The left hand is holding the sheet", "The left hand is holding a knife"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "dIMlS-foMXw_1", "video_path": "dIMlS-foMXw.mp4", "subtitle_path": "dIMlS-foMXw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 400.17, "view_count": 2036}, {"video_id": "DFs2XhoRAEY", "question": "A man, dressed in a clean white shirt and paired with dark black pants, is walking steadily on the road with a black backpack. His left hand is in front, and at this moment, what is the man dressed in a white shirt and carrying a black backpack holding in his right hand?", "question_wo_referring_query": "At this time, what is the man dressed in a white shirt and carrying a black backpack holding in his right hand?", "candidates": ["yellow skates", "a yellow skateboard", "a black skateboard", "a yellow plank"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "DFs2XhoRAEY_0", "video_path": "DFs2XhoRAEY.mp4", "subtitle_path": "DFs2XhoRAEY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 278.18, "view_count": 374137}, {"video_id": "DFs2XhoRAEY", "question": "A man wearing a white top paired with shorts and a black backpack, a man in a gray top and gray striped pants, and a man dressed in black top and pants are having a conversation. At this moment, what is in the pocket of the man wearing black pants?", "question_wo_referring_query": "At this moment, what is in the pocket of the man wearing black pants?", "candidates": ["A blue wallet", "A yellow skateboard", "A blue mobile phone", "A bottle with a blue cap"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "DFs2XhoRAEY_1", "video_path": "DFs2XhoRAEY.mp4", "subtitle_path": "DFs2XhoRAEY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 278.18, "view_count": 374137}, {"video_id": "qgcLPha8Kao", "question": "The video shows a boy wearing shorts and bare feet sitting in a high place. When the subtitles mention 'how long they could remain positioned on top of a tall pole in a public setting', what is on the right side of the boy?", "question_wo_referring_query": "What is on the right side of the boy?", "candidates": ["Telephone", "Teacup", "Cube", "Flag"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "qgcLPha8Kao_0", "video_path": "qgcLPha8Kao.mp4", "subtitle_path": "qgcLPha8Kao_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 576.75, "view_count": 39398}, {"video_id": "qgcLPha8Kao", "question": "A woman wearing a red, black, and white checkered short-sleeve shirt is holding a white bottle in her left hand and putting something in her mouth with her right hand. When the subtitle 'Erom eating disgusting things to giving away gobs of money' appears, what other items are on the screen?", "question_wo_referring_query": "What other items are on the screen?", "candidates": ["a watch and a bracelet", "a Rubik's cube", "a hairband", "a pennant"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "qgcLPha8Kao_1", "video_path": "qgcLPha8Kao.mp4", "subtitle_path": "qgcLPha8Kao_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 576.75, "view_count": 39398}, {"video_id": "ibQfR1o_KvU", "question": "The screen shows a person standing outside a two-story building. This building is made up of white, gray, black, and orange colors. There are several huge money bags with dollar signs behind the person. What color clothes is this person wearing?", "question_wo_referring_query": "What color clothes is this person wearing?", "candidates": ["White with a mix of black", "White with a mix of green", "White with a mix of blue", "White with a mix of orange"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "ibQfR1o_KvU_0", "video_path": "ibQfR1o_KvU.mp4", "subtitle_path": "ibQfR1o_KvU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 229.57, "view_count": 26466}, {"video_id": "ibQfR1o_KvU", "question": "Described is a man with black hair and wearing a white jacket with an orange shirt standing in front of a lectern, with gray buildings nearby and a crowd of people below. What kind of object is this man holding?", "question_wo_referring_query": "What kind of object is this man holding?", "candidates": ["Square", "Rectangle", "Circle", "Hexagon"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "ibQfR1o_KvU_1", "video_path": "ibQfR1o_KvU.mp4", "subtitle_path": "ibQfR1o_KvU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 229.57, "view_count": 26466}, {"video_id": "WFf-ZCOAXCc", "question": "In front of a white background, a long-haired woman wearing a white dress is holding a gold necklace. When the subtitle mentions 'so I got this necklace and I kind of', what color nail polish is this woman wearing?", "question_wo_referring_query": "What color nail polish is this woman wearing?", "candidates": ["White", "Green", "Pink", "Black"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "WFf-ZCOAXCc_0", "video_path": "WFf-ZCOAXCc.mp4", "subtitle_path": "WFf-ZCOAXCc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 522.2, "view_count": 27422}, {"video_id": "WFf-ZCOAXCc", "question": "In front of a white background, there is a long-haired woman holding a green skirt in her hand and wearing a necklace. She appears when the subtitle mentions 'you know so I got a skirt it's super'. What color is this woman\u2019s clothing?", "question_wo_referring_query": "What color is this woman\u2019s clothing?", "candidates": ["green", "white", "black", "blue"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "WFf-ZCOAXCc_1", "video_path": "WFf-ZCOAXCc.mp4", "subtitle_path": "WFf-ZCOAXCc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 522.2, "view_count": 27422}, {"video_id": "0AfgVm03fuM", "question": "On a table, there is a black pot with a silver pot on it, and white smoke is rising from the pot. The person has a tattoo on their right hand and is holding something. What did the person pour into the pot?", "question_wo_referring_query": "What did the person pour into the pot?", "candidates": ["Strawberry", "White flour", "Liquid", "Brown powder"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "0AfgVm03fuM_0", "video_path": "0AfgVm03fuM.mp4", "subtitle_path": "0AfgVm03fuM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 215.13, "view_count": 200574}, {"video_id": "0AfgVm03fuM", "question": "In a container of white milk, there is a red stirrer stirring. What color drips into the milk, causing it to change color?", "question_wo_referring_query": "What color drips into the milk, causing it to change color?", "candidates": ["Yellow", "Green", "Red", "Black"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "0AfgVm03fuM_1", "video_path": "0AfgVm03fuM.mp4", "subtitle_path": "0AfgVm03fuM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 215.13, "view_count": 200574}, {"video_id": "nP4EcUqUktE", "question": "When 'amatcurs' first appears in the subtitles, what is the man wearing a brown hat sitting in the car doing?", "question_wo_referring_query": "What is the man sitting in the car doing?", "candidates": ["Eating a bun", "Playing a game", "Singing", "Drinking water"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "nP4EcUqUktE_0", "video_path": "nP4EcUqUktE.mp4", "subtitle_path": "nP4EcUqUktE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 345.58, "view_count": 26895}, {"video_id": "nP4EcUqUktE", "question": "When the subtitle 'Oh a city better it's a new home okay' first appears, there is a man wearing a hat. What is this man doing with his hand?", "question_wo_referring_query": "What is the man doing with his hand?", "candidates": ["Holding it tightly next to his mouth", "Turning his palm upwards", "Opening it slowly", "Holding something"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "nP4EcUqUktE_1", "video_path": "nP4EcUqUktE.mp4", "subtitle_path": "nP4EcUqUktE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 345.58, "view_count": 26895}, {"video_id": "F22PStAs-6s", "question": "In a dimly lit environment, there are three men. One man is wearing a sleeveless top and a necklace, and he is holding a musical instrument. Another man is wearing accessories on his hand and holding a microphone. What action did the man wearing the necklace do afterward?", "question_wo_referring_query": "What action did the man wearing the necklace do afterward?", "candidates": ["Drinking something", "Giving a speech", "Eating", "Singing"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "F22PStAs-6s_0", "video_path": "F22PStAs-6s.mp4", "subtitle_path": "F22PStAs-6s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 399.4, "view_count": 265428}, {"video_id": "F22PStAs-6s", "question": "There is a man standing in front of a telephone booth, naked from the waist up and wearing a padlock-style necklace. What action does this man perform next?", "question_wo_referring_query": "What action does this man perform next?", "candidates": ["Running", "Swimming", "Playing a musical instrument", "Eating"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "F22PStAs-6s_1", "video_path": "F22PStAs-6s.mp4", "subtitle_path": "F22PStAs-6s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 399.4, "view_count": 265428}, {"video_id": "Vb6xitXKQDs", "question": "In a large forest, there is a woman wearing a red top and green pants, holding a pen and drawing. What color does the woman use first in the video?", "question_wo_referring_query": "What color does the woman use first in the video?", "candidates": ["Black", "Red", "White", "Blue"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "Vb6xitXKQDs_0", "video_path": "Vb6xitXKQDs.mp4", "subtitle_path": "Vb6xitXKQDs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 400.9, "view_count": 1064735}, {"video_id": "Vb6xitXKQDs", "question": "Next to a table hanging with tree leaves and filled with many empty glass bottles, after picking up two oranges with one hand, what did this person do first in the video?", "question_wo_referring_query": "What did this person do first in the video?", "candidates": ["Place a flower", "Cut a string", "Slice the orange with a knife", "Pick up a water cup"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "Vb6xitXKQDs_1", "video_path": "Vb6xitXKQDs.mp4", "subtitle_path": "Vb6xitXKQDs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 400.9, "view_count": 1064735}, {"video_id": "YXrA6df3dqU", "question": "In a snow-covered forest, after the subtitles appear saying 'how much I needed human connection but not in the way mainstream society has normalized it,' what event occurs in the video?", "question_wo_referring_query": "What event occurs in the video?", "candidates": ["A cat appears, wagging its tail in the snow.", "A rabbit appears, wagging its tail in the snow.", "A dog appears, wagging its tail in the snow.", "A fox appears, wagging its tail in the snow."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "YXrA6df3dqU_0", "video_path": "YXrA6df3dqU.mp4", "subtitle_path": "YXrA6df3dqU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 392.12, "view_count": 409313}, {"video_id": "YXrA6df3dqU", "question": "A woman with yellow curly hair wearing a gray wool coat is grinding coffee. After the subtitles appear saying, 'But, it isn't who I am, and I had to come to terms with that. People like myself must confront life', what action does this woman take?", "question_wo_referring_query": "A woman with yellow curly hair wearing a gray wool coat is grinding coffee. After the subtitles appear saying, 'But, it isn't who I am, and I had to come to terms with that. People like myself must confront life', what action does this woman take?", "candidates": ["Touch her phone", "Open the blinds", "Drink milk", "Drink water"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "YXrA6df3dqU_1", "video_path": "YXrA6df3dqU.mp4", "subtitle_path": "YXrA6df3dqU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 392.12, "view_count": 409313}, {"video_id": "NZYW7EstuRI", "question": "A man wearing a black shirt is explaining something in a room. Before the subtitles show up with 'howdy it's kyle talking about my,' what appears in this video?", "question_wo_referring_query": "What appears in this video?", "candidates": ["television", "doctor", "flower", "a globe"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "NZYW7EstuRI_0", "video_path": "NZYW7EstuRI.mp4", "subtitle_path": "NZYW7EstuRI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 383.15, "view_count": 60777}, {"video_id": "NZYW7EstuRI", "question": "A man dressed in a black shirt is talking in a room. After the subtitle 'out' appears, what appears in this video?", "question_wo_referring_query": "What appears in this video?", "candidates": ["A map with red and blue regions", "Rabbit", "Dog", "Soldier"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "NZYW7EstuRI_1", "video_path": "NZYW7EstuRI.mp4", "subtitle_path": "NZYW7EstuRI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 383.15, "view_count": 60777}, {"video_id": "Kp_JRAlgcNY", "question": "During a live news broadcast, a man dressed in a green checkered pattern appeared. In which of the following places has this man in the green checkered pattern been seen?", "question_wo_referring_query": "Where has the man in the green checkered pattern been seen?", "candidates": ["1. There is an image of an arrow being fired nearby\n2. There is an image of someone in blue clothing piloting a plane nearby", "1. A woman wearing sunglasses\n2. A black background with a black circle", "1. A man with dark skin wearing sunglasses\n2. A woman wearing sunglasses", "1. A black background with a black circle\n2. There is an image of an arrow being fired nearby"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "Kp_JRAlgcNY_0", "video_path": "Kp_JRAlgcNY.mp4", "subtitle_path": "Kp_JRAlgcNY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 324.66, "view_count": 6534}, {"video_id": "Kp_JRAlgcNY", "question": "A woman wearing a black top and draped in long hair appears in a news broadcast. Where has this woman with long hair appeared?", "question_wo_referring_query": ", where has this woman with long hair appeared?", "candidates": ["1. In a room with 'TOTAL ECLIPSE' written in sunlight behind her\n2. Next to a man wearing a green plaid shirt", "1. A woman wearing sunglasses\n2. A black background with a black circle", "1. A dark-skinned man wearing sunglasses\n2. A woman wearing sunglasses", "1. A dark-skinned man wearing sunglasses\n2. A woman wearing sunglasses"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "Kp_JRAlgcNY_1", "video_path": "Kp_JRAlgcNY.mp4", "subtitle_path": "Kp_JRAlgcNY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 324.66, "view_count": 6534}, {"video_id": "bdjczPhrvYU", "question": "In a pot with many sausages, which subtitles appeared together with these sausages in the video?", "question_wo_referring_query": "In a pot with many sausages, which subtitles appeared together with these sausages in the video?", "candidates": ["Fry for 2 minutes on each side.", "Pour the egg mixture. And 4 slices of cheese.", "Sausages.", "Have a nice day!"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "bdjczPhrvYU_0", "video_path": "bdjczPhrvYU.mp4", "subtitle_path": "bdjczPhrvYU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 211.6, "view_count": 7815}, {"video_id": "bdjczPhrvYU", "question": "On a wooden board, there are a few slices of bread. In the video, with which subtitles do these slices of bread appear simultaneously?", "question_wo_referring_query": "In the video, with which subtitles do these slices of bread appear simultaneously?", "candidates": ["4 slices of bread and 'Have a nice day!'", "Pour the egg mixture.", "Sausages.", "Fry for 2 minutes on each side."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "bdjczPhrvYU_1", "video_path": "bdjczPhrvYU.mp4", "subtitle_path": "bdjczPhrvYU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 211.6, "view_count": 7815}, {"video_id": "2XY12HZ_-L4", "question": "The screen shows a man in a black short-sleeved shirt against a blue background. The man is wearing a black watch on his left hand, with both palms facing upward. What changes can be seen behind this man in the video?", "question_wo_referring_query": "What changes can be seen behind this man in the video?", "candidates": ["There is a woman's photo in the upper left corner of the man.", "There is a woman's photo in the upper right corner of the man.", "There is a man's photo in the upper right corner of the man.", "There is a man's photo in the upper left corner of the man."], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "2XY12HZ_-L4_0", "video_path": "2XY12HZ_-L4.mp4", "subtitle_path": "2XY12HZ_-L4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 315.69, "view_count": 90552}, {"video_id": "2XY12HZ_-L4", "question": "In a picture that is gray and contains some cars and horses with people coming and going, there are also white words like 'SHE'D SLIP INTO FACTORIES WITHOUT THE' on it. What changes did these words undergo?", "question_wo_referring_query": "What changes did these words undergo?", "candidates": ["The white words slide up", "The white words slide to the left", "The white words enlarge", "The white words shrink"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "2XY12HZ_-L4_1", "video_path": "2XY12HZ_-L4.mp4", "subtitle_path": "2XY12HZ_-L4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 315.69, "view_count": 90552}, {"video_id": "8DJ2a9I2-uo", "question": "Under a screen with a white background and black text, there is a man in the bottom right corner wearing a gray outfit with red inside. What motion did the man make after the subtitle 'human skeleton right so joint angles are' appeared?", "question_wo_referring_query": "What motion did the man make?", "candidates": ["Touched his chin with his right hand", "Watched TV", "Put his hands behind his head", "Played with his phone"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "8DJ2a9I2-uo_0", "video_path": "8DJ2a9I2-uo.mp4", "subtitle_path": "8DJ2a9I2-uo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 551.88, "view_count": 71}, {"video_id": "8DJ2a9I2-uo", "question": "In a screen with a white background and black 'UCF' letters, there is a man wearing a gray outfit with red accents in the lower right corner. After the word 'okay' appears on the screen, what change occurs on the screen?", "question_wo_referring_query": "What change occurs on the screen?", "candidates": ["The words at the top of the screen become larger and move to the middle of the screen, and the bottom row of text disappears.", "The words at the top of the screen do not change and move to the middle of the screen, and the bottom row of text disappears.", "The words at the top of the screen do not change and move to the bottom of the screen.", "A line of text is added to the middle of the screen."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "8DJ2a9I2-uo_1", "video_path": "8DJ2a9I2-uo.mp4", "subtitle_path": "8DJ2a9I2-uo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 551.88, "view_count": 71}, {"video_id": "5pgkQzIFxCQ", "question": "In a classroom, many people are standing, holding a book in their hands and singing. There is a conductor standing in front of them. What is the conductor doing?", "question_wo_referring_query": "What is the conductor doing?", "candidates": ["Playing with a phone", "Drinking water", "Playing the piano", "Lifting both hands in front of the body"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "5pgkQzIFxCQ_0", "video_path": "5pgkQzIFxCQ.mp4", "subtitle_path": "5pgkQzIFxCQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 426.39, "view_count": 14102}, {"video_id": "5pgkQzIFxCQ", "question": "In a classroom with a blackboard, what is a man with black hair, wearing a white shirt and yellow pants, doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["Playing with a phone", "Teaching on the blackboard", "Drinking water", "Playing the piano"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "5pgkQzIFxCQ_1", "video_path": "5pgkQzIFxCQ.mp4", "subtitle_path": "5pgkQzIFxCQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 426.39, "view_count": 14102}, {"video_id": "sEqrzpoXdpM", "question": "In a black room, there is a man with black hair wearing a yellow short-sleeved shirt and holding a cup. When the subtitle 'basically' appears, what other objects are in the room?", "question_wo_referring_query": "What other objects are in the room?", "candidates": ["a piano", "flowers", "a huge pumpkin model", "a table"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "sEqrzpoXdpM_0", "video_path": "sEqrzpoXdpM.mp4", "subtitle_path": "sEqrzpoXdpM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 197.8, "view_count": 53316}, {"video_id": "sEqrzpoXdpM", "question": "When a woman wearing a white hat and a black tank top, and a man in a red short-sleeved shirt appear, and the subtitle 'the race with me but she's not' shows, what objects are behind them?", "question_wo_referring_query": "When a woman wearing a white hat and a black tank top, and a man in a red short-sleeved shirt appear, and the subtitle 'the race with me but she's not' shows, what objects are behind them?", "candidates": ["Rabbit", "Bear", "Statue", "Dog"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "sEqrzpoXdpM_1", "video_path": "sEqrzpoXdpM.mp4", "subtitle_path": "sEqrzpoXdpM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 197.8, "view_count": 53316}, {"video_id": "xzFtVPOn7Sk", "question": "A woman with long hair is introducing something while subtitles display 'can use a flax egg you can use'. What color is the clothing worn by this woman with long hair?", "question_wo_referring_query": "What color is the clothing worn?", "candidates": ["green", "white", "orange", "red"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "xzFtVPOn7Sk_0", "video_path": "xzFtVPOn7Sk.mp4", "subtitle_path": "xzFtVPOn7Sk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 184.27, "view_count": 272251}, {"video_id": "xzFtVPOn7Sk", "question": "On a table, there is a piece of white and red checkered cloth. When the subtitle 'because they're such a classic dessert' appears, what item is this person holding?", "question_wo_referring_query": "What item is this person holding?", "candidates": ["milk", "spatula", "mobile phone", "straw"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "xzFtVPOn7Sk_1", "video_path": "xzFtVPOn7Sk.mp4", "subtitle_path": "xzFtVPOn7Sk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 184.27, "view_count": 272251}, {"video_id": "1KVSmJhT-WI", "question": "In front of a yellow sculpture enclosed in a glass case, there is someone holding a yellow notebook and taking notes. Who is the person taking notes on the screen?", "question_wo_referring_query": "Who is the person taking notes on the screen?", "candidates": ["A woman wearing a yellow trench coat and glasses", "A woman wearing a dark green trench coat and hat", "A man wearing a yellow trench coat and glasses", "A man wearing a black trench coat and hat"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "1KVSmJhT-WI_0", "video_path": "1KVSmJhT-WI.mp4", "subtitle_path": "1KVSmJhT-WI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 266.27, "view_count": 6418}, {"video_id": "MnZzVt5_yb8", "question": "On a wooden table, there are scallions and a radish. Someone is using both hands to handle a green object at the bottom of the screen. What is this person doing?", "question_wo_referring_query": "What is this person doing?", "candidates": [" Placing vegetables", " Rolling up rice", " Placing fruits", " Placing meat slices"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "MnZzVt5_yb8_0", "video_path": "MnZzVt5_yb8.mp4", "subtitle_path": "MnZzVt5_yb8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 207.92000000000002, "view_count": 721429}, {"video_id": "MnZzVt5_yb8", "question": "After a person wearing a bracelet puts a rolled sushi on the cutting board on the table, what is he preparing to do?", "question_wo_referring_query": "What is this person preparing to do?", "candidates": ["Arrange the sushi roll", "Separate the sushi by hand", "Cut the sushi with a knife", "Cut vegetables with a knife"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "MnZzVt5_yb8_1", "video_path": "MnZzVt5_yb8.mp4", "subtitle_path": "MnZzVt5_yb8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 207.92000000000002, "view_count": 721429}, {"video_id": "4HYZEWyUi3M", "question": "On a wooden cutting board, there are some green-skinned and orange fruits. A person with pink nail polish is pressing an orange fruit. At the bottom of the screen, there is also a knife. When the subtitle '150g of dried apricots' appears, what does the person do?", "question_wo_referring_query": "What does the person do?", "candidates": ["Remove the pit from the fruit with the knife", "Rinse the fruit flesh with water", "Cut the fruit into slices with the knife", "Tear the fruit flesh with their hand"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "4HYZEWyUi3M_0", "video_path": "4HYZEWyUi3M.mp4", "subtitle_path": "4HYZEWyUi3M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 273.6, "view_count": 2865}, {"video_id": "4HYZEWyUi3M", "question": "On a gray marble desktop, there is a glass bowl full of food. There are also white subtitles on the screen. What does the person do when 'do better with gloves' is mentioned?", "question_wo_referring_query": "What does the person do?", "candidates": ["Separate the ingredients in the bowl", "Liquefy the solids in the bowl", "Mix the ingredients in the bowl with hands", "Handle the food on the desk"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "4HYZEWyUi3M_1", "video_path": "4HYZEWyUi3M.mp4", "subtitle_path": "4HYZEWyUi3M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 273.6, "view_count": 2865}, {"video_id": "7mz4ikBUMTM", "question": "Standing by the gray sofa wearing a black short-sleeved shirt, the blond man wearing a watch and the man standing by the kitchen counter wearing a black short-sleeved shirt and blue pants holding an olive, who appears first in the video?", "question_wo_referring_query": "Who appears first in the video?", "candidates": ["The blond man standing by the gray sofa wearing a black short-sleeved shirt and a watch", "The blond man standing by the kitchen counter wearing a black short-sleeved shirt and a watch", "The man standing by the kitchen counter wearing a black short-sleeved shirt and blue pants holding an olive", "The man standing by the gray sofa wearing a black short-sleeved shirt and blue pants holding an olive"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "7mz4ikBUMTM_0", "video_path": "7mz4ikBUMTM.mp4", "subtitle_path": "7mz4ikBUMTM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 335.71, "view_count": 1415228}, {"video_id": "7mz4ikBUMTM", "question": "Among the dense foliage, who appears first in the video: the man wearing a blue jacket and a white hat or the man wearing a blue short-sleeve shirt, a black backpack, and a red headband?", "question_wo_referring_query": "Who appears first in the video?", "candidates": ["The man wearing a blue jacket and a white hat among the dense foliage", "The man wearing a blue short-sleeve shirt and a white hat among the dense foliage", "The man wearing a blue short-sleeve shirt, a black backpack, and a red headband standing among the trees", "The man wearing a blue jacket, a black backpack, and a red headband standing among the trees"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "7mz4ikBUMTM_1", "video_path": "7mz4ikBUMTM.mp4", "subtitle_path": "7mz4ikBUMTM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 335.71, "view_count": 1415228}, {"video_id": "hWMKaxjc4dA", "question": "In a room with white walls and green paintings hanging all over, a man wearing a hat and blue and white clothes is conversing with a woman who is wearing a hat and a green apron filled with various green materials at the waist. After the conversation reached 'all right so here I am at molly mats,' what did the woman do?", "question_wo_referring_query": "What did the woman do?", "candidates": ["The woman nodded her head", "The woman started dancing", "The woman put her hand into the apron", "The woman shook hands with the man", "The woman started fighting with the man"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "hWMKaxjc4dA_0", "video_path": "hWMKaxjc4dA.mp4", "subtitle_path": "hWMKaxjc4dA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 496.93, "view_count": 11186}, {"video_id": "hWMKaxjc4dA", "question": "In a room with a white background wall covered in green paintings, a man wearing a hat and a blue and white plaid shirt is conversing with a woman also wearing a hat and a green waist apron. The apron has many green decorations on it. When the conversation reaches the part where the man says 'really friendly um you know like you're,' what does the woman do afterwards?", "question_wo_referring_query": "What does the woman do?", "candidates": ["The woman shook hands with the man", "The woman nodded", "The woman started fighting with the man", "The woman clapped her hands together into a dome shape", "The woman started dancing"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "hWMKaxjc4dA_1", "video_path": "hWMKaxjc4dA.mp4", "subtitle_path": "hWMKaxjc4dA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 496.93, "view_count": 11186}, {"video_id": "005BeD0c2PA", "question": "An elderly woman with grey hair is standing in front of a display case at the museum. She is wearing black clothes. There are many items in the display case. After mentioning 'The Museum sets up the conditions for associative wandering thinking,' what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["Displayed in the museum showcase, something with teeth that looks somewhat like a crab.", "A woman with a child", "Inside a display case, a red and white signboard.", "Displayed on the wall inside the museum's glass case, there is an object with a pair of long horns.", "Displayed on a platform, a human-shaped figurine, which is black."], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "005BeD0c2PA_0", "video_path": "005BeD0c2PA.mp4", "subtitle_path": "005BeD0c2PA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 198.8, "view_count": 2559}, {"video_id": "005BeD0c2PA", "question": "A museum display window exhibits several items, including a figure wearing a red plaid shirt, two black masks, and three black human-shaped objects of varying sizes. On the far left is a wall, and on the right is a piece with black and white patterns. When the phrase 'This is obviously an object that's made to be animated that's now mute.' is mentioned, what is the first object that appears?", "question_wo_referring_query": ", what is the first object that appears?", "candidates": ["A woman with a child", "A woman wearing a black suit with graying hair", "An item that looks somewhat like a crab, displayed in the museum's exhibit", "A man standing in front of the display case, wearing white shorts, short hair, a red T-shirt, and black shoes", "A black human-shaped object on the display table"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "005BeD0c2PA_1", "video_path": "005BeD0c2PA.mp4", "subtitle_path": "005BeD0c2PA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 198.8, "view_count": 2559}, {"video_id": "uUNRIzEFjGI", "question": "Against a white background, a man wearing glasses is giving an explanation. On the right side, some figures are displayed on a blackboard, while on the left side, there is a globe. He is wearing a blue short-sleeved shirt, with a laptop in front of him. In which scenario is the globe not present?", "question_wo_referring_query": ", on the left side, there is a globe. He is wearing a blue short-sleeved shirt, and there is a laptop in front of him. In which scenario does the globe not appear?", "candidates": ["The man is wearing a blue Polo shirt, and the background board is white. He is placing a decoration similar to a ribbon on his head.", "The man is wearing a blue checkered shirt, with a laptop in front of him, and there are some words on his laptop. On the blackboard to the right, there's a drawing of a figure without facial features.", "The man is wearing a blue Polo shirt, and the background board is white. There are some words on his laptop, and on the blackboard to the right, there's a drawing of a figure with facial features, drawn with a blue pen.", "The man is wearing a blue and white striped shirt, and on the blackboard to the right, the figures only have faces and hair, with no features, and the painted collar is red and blue.", "The man is wearing a white checkered shirt, with a laptop in front of him. On the blackboard to the right, there's a drawing of Obama, and on his desk, there's a bird toy."], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "uUNRIzEFjGI_0", "video_path": "uUNRIzEFjGI.mp4", "subtitle_path": "uUNRIzEFjGI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 226.85, "view_count": 203473}, {"video_id": "uUNRIzEFjGI", "question": "In a white background, a man wearing a white checkered shirt and glasses is giving an explanation. On the left side is a globe, and on the right side, there is a blackboard displaying some figure images. In front of him, there is also a laptop. In which scene did Obama's image appear on the blackboard?", "question_wo_referring_query": "In which scene did Obama's image appear on the blackboard?", "candidates": ["When the man is wearing a blue striped shirt and giving an explanation", "When the man is wearing a white checkered shirt and giving an explanation", "When the man is wearing a black and white checkered shirt and giving an explanation", "When the man is wearing a blue Polo shirt and giving an explanation", "When the man is sitting on a yellow chair, with another office chair beside him"], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "uUNRIzEFjGI_1", "video_path": "uUNRIzEFjGI.mp4", "subtitle_path": "uUNRIzEFjGI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 226.85, "view_count": 203473}, {"video_id": "zFBdLmOoj3E", "question": "When the news screen is divided into three sections, in the video, which subtitles does the man, who is wearing a gray suit with a white shirt and standing on the right bottom of the screen, also appear in?", "question_wo_referring_query": "Which subtitles does he also appear in?", "candidates": ["\u201cMonth Energy prices though did increase", "\u201cYour reaction here and is this in line\u201c", "\u201cWith what we thought what do we think\u201c", "\u201cInflation rate here is off that terrible\u201c", "\u201cJanuary was 3.1% this is 3.2% that means\u201d"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "zFBdLmOoj3E_0", "video_path": "zFBdLmOoj3E.mp4", "subtitle_path": "zFBdLmOoj3E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 315.48, "view_count": 12059}, {"video_id": "zFBdLmOoj3E", "question": "When the news screen is divided into three sections, in the upper right corner of the video, a person wearing a black suit with a white V-neck underneath and yellow short hair sits in front of a mirror. In which subtitles do they also appear?", "question_wo_referring_query": "In which subtitles do they also appear?", "candidates": ["\u201dbetter 3.2% here in the month is a lot", "\u201cmonth Energy prices though did increase", "show you this chart right here this blue", "\u201dabout affordability we've got the key\u201c", "\u201ca rate cut in June but not if you keep\u201d"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "zFBdLmOoj3E_1", "video_path": "zFBdLmOoj3E.mp4", "subtitle_path": "zFBdLmOoj3E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 315.48, "view_count": 12059}, {"video_id": "fEDwBpgPqyg", "question": "In the video, there are several sheets of white-backed paper, and one hand with two rings wearing a rose-pink outfit and another wearing a light pink outfit with a wristwatch. There are two character images at the top left and right sides, including a woman with long hair wearing a rose-pink coat over a white inner garment on the top right. What transformation occurs at the end of the video when she is sitting in front of a yellow background?", "question_wo_referring_query": "What transformation occurs?", "candidates": ["The outer coat changes to black and white stripes", "The outer coat changes to dark green", "The outer coat changes to purple", "The outer coat changes to blue", "The outer coat changes to yellow"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "fEDwBpgPqyg_0", "video_path": "fEDwBpgPqyg.mp4", "subtitle_path": "fEDwBpgPqyg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 201.4, "view_count": 1203}, {"video_id": "upWlxT-14f0", "question": "What change occurs on the airplane taxiing on the runway with many green trees in the background when the subtitle '[Music]' appears?", "question_wo_referring_query": "What change occurs?", "candidates": ["The airplane moves on the ground", "The airplane stops moving", "The airplane flies into the sky", "The airplane's wing falls off", "The airplane starts to reverse"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "upWlxT-14f0_0", "video_path": "upWlxT-14f0.mp4", "subtitle_path": "upWlxT-14f0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 209.72, "view_count": 45205}, {"video_id": "upWlxT-14f0", "question": "In a photo of a woman and a man inside a car, what change occurred when the smiling woman on the right, who is wearing earrings and has her hair up, appears in the subtitle 'He is heading towards the Titanic and waiting'?", "question_wo_referring_query": "What change occurred?", "candidates": ["The woman's hair was cut into a bob", "The woman's hair was tied into a bun", "The woman had neatly parted bangs", "The woman's clothes changed to black", "The woman's hair came loose"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "upWlxT-14f0_1", "video_path": "upWlxT-14f0.mp4", "subtitle_path": "upWlxT-14f0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 209.72, "view_count": 45205}, {"video_id": "mVizOYsOklY", "question": "In a closed glass room, when a man with white hair, wearing a black item on his arm, dressed in a black suit and a pink plaid tie, raises his left hand, what is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["He is squatting to tie his shoes", "He is talking to the mirror", "He is singing to the mirror", "He is touching the silver ring on his hand", "He is tapping on the table in front of the mirror"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "mVizOYsOklY_0", "video_path": "mVizOYsOklY.mp4", "subtitle_path": "mVizOYsOklY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 378.01, "view_count": 5637}, {"video_id": "mVizOYsOklY", "question": "In a closed glass room, there is a man with white hair, wearing black-rimmed glasses, a gray striped suit, and a tie. What is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["Smiling slightly at the mirror", "Covering his face with both hands", "Pressing his temples with both hands", "Bending down to look at the floor", "Slapping the table forcefully"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "mVizOYsOklY_1", "video_path": "mVizOYsOklY.mp4", "subtitle_path": "mVizOYsOklY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 378.01, "view_count": 5637}, {"video_id": "elOwHlrQwuU", "question": "On a sunny beach, surrounded by greenery, there is a man wearing a black vest with sunglasses hanging on it. What objects are present in the scene when he says, 'actually super cheap I think one night'?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["red cushion with patterns", "blue tissue box", "green cushion with patterns", "blue cushion with patterns", "white chair"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "elOwHlrQwuU_0", "video_path": "elOwHlrQwuU.mp4", "subtitle_path": "elOwHlrQwuU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 490.22, "view_count": 73366}, {"video_id": "elOwHlrQwuU", "question": "In a car with a lot of space, a woman wearing sunglasses with her hair tied up, holding a mobile phone, and dressed in a suspenders top is sitting by the window. When she says, 'we now have our own private ride Wow and,' what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["an orange backpack", "many black seats", "a bottle of white milk", "black curtains", "a cute toy"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "elOwHlrQwuU_1", "video_path": "elOwHlrQwuU.mp4", "subtitle_path": "elOwHlrQwuU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 490.22, "view_count": 73366}, {"video_id": "niSDzNd2u1U", "question": "On a white wall hangs a painting depicting a girl in a pink dress kneeling on the ground and looking forward. In front of the painting, there is another girl with short blond hair holding a cane, who is looking at the painting. What is this girl wearing?", "question_wo_referring_query": "What is she wearing?", "candidates": ["White dress with pink floral pattern", "Pink shirt", "Blue sleeveless dress", "White T-shirt", "Pink shirt with skirt"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "niSDzNd2u1U_0", "video_path": "niSDzNd2u1U.mp4", "subtitle_path": "niSDzNd2u1U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 252.5, "view_count": 53240}, {"video_id": "CGFPOp61t3A", "question": "In a dark mine, there are two men wearing black shirts and pants. They are wearing black-framed glasses and holding lamps. When the narrator says 'generation told for thousands of years', what color hats are the two men wearing?", "question_wo_referring_query": "What color hats are the two men wearing?", "candidates": ["black", "red", "blue", "white", "yellow"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "CGFPOp61t3A_0", "video_path": "CGFPOp61t3A.mp4", "subtitle_path": "CGFPOp61t3A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 421.02, "view_count": 83367}, {"video_id": "CGFPOp61t3A", "question": "On a body of water with murky yellow water, where waves and ripples surge, what type of bird flies by while the person next to the camera says \"least it is to me assuming it's correct\"?", "question_wo_referring_query": "What type of bird flies over the body of water?", "candidates": ["A bird with completely white wings and a fully black body", "A bird with completely black wings and a snow-white body", "A bird with a fully snow-white body", "A bird with wings that have a bit of black", "A bird that is entirely black"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "CGFPOp61t3A_1", "video_path": "CGFPOp61t3A.mp4", "subtitle_path": "CGFPOp61t3A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 421.02, "view_count": 83367}, {"video_id": "PNwqSR5clWw", "question": "In a mountain cave, a pile of torches is burning on a stone platform, and a monkey is standing beside the torches looking out of the cave. Who extended a hand?", "question_wo_referring_query": "Who extended a hand?", "candidates": ["a monkey", "no one extended a hand", "a modern human", "a woman", "a man"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "PNwqSR5clWw_0", "video_path": "PNwqSR5clWw.mp4", "subtitle_path": "PNwqSR5clWw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.46, "view_count": 32034}, {"video_id": "PNwqSR5clWw", "question": "The sky is sapphire blue, with a few white clouds drifting in the distance. Below the blue sky is a stretch of shallows, with several mountain peaks resting on the shallows. A huge rock stands tall among the mountain peaks. Someone is standing upright facing the mountain peaks. Who is facing the huge rock and slowly raising their hands?", "question_wo_referring_query": "Who is facing the huge rock and slowly raising their hands?", "candidates": ["The person wearing a yellow coat and black pants", "The person wearing a yellow coat and brown pants", "The person wearing a blue coat and yellow pants", "The person wearing a yellow coat and blue pants", "The person wearing a yellow coat and white pants"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "PNwqSR5clWw_1", "video_path": "PNwqSR5clWw.mp4", "subtitle_path": "PNwqSR5clWw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.46, "view_count": 32034}, {"video_id": "3LeOues4Kp4", "question": "In an aerial view photo, there are many red symbols of varying sizes in the upper right corner of the screen. What happens on the screen when this photo first appears?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The red symbols in the upper right corner of the screen move to the lower left corner.", "The red symbols in the upper right corner of the screen move to the upper left corner.", "The red symbols in the upper right corner of the screen move to the lower right corner.", "The red symbols in the upper right corner of the screen disappear.", "The red symbols fill the entire screen."], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "3LeOues4Kp4_0", "video_path": "3LeOues4Kp4.mp4", "subtitle_path": "3LeOues4Kp4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 525.13, "view_count": 91650}, {"video_id": "3LeOues4Kp4", "question": "A panoramic view of the earth, the whole screen is composed of different shades of brown. What happens to the screen when it first appears?", "question_wo_referring_query": "What happens to the screen?", "candidates": ["The screen first zooms out then zooms in", "The screen does not change", "The screen continues to zoom in", "The screen first zooms in then zooms out", "The screen continues to zoom out"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "3LeOues4Kp4_1", "video_path": "3LeOues4Kp4.mp4", "subtitle_path": "3LeOues4Kp4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 525.13, "view_count": 91650}, {"video_id": "5bjNZFLvwvI", "question": "A person wearing a blue outer garment and a white inner garment, with their head not exposed, is holding a piece of sushi with a bite taken out of it in both hands. When the explanation mentions 'structurally sound because it\u2019s a,' what action does he take?", "question_wo_referring_query": "What action does he take?", "candidates": ["He turns the sushi so the bitten part faces the camera.", "He takes a bite of the sushi.", "He turns the sushi so the bitten part faces his mouth.", "He stands up.", "He throws away the sushi."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "5bjNZFLvwvI_0", "video_path": "5bjNZFLvwvI.mp4", "subtitle_path": "5bjNZFLvwvI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 440.78, "view_count": 300549}, {"video_id": "5bjNZFLvwvI", "question": "On a bustling square lined with tall buildings covered in advertisements, many people are playing around the square. There is a statue in the center of the square. A woman wearing a mask, a black coat, and a red backpack is mimicking the statue's pose. What happens when the subtitle 'why not pose father duffy oh' appears?", "question_wo_referring_query": "What happens to the masked woman?", "candidates": ["She turns to her right", "She turns to face the statue", "She runs towards the camera", "She squats down on the spot", "She turns to her left"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "5bjNZFLvwvI_1", "video_path": "5bjNZFLvwvI.mp4", "subtitle_path": "5bjNZFLvwvI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 440.78, "view_count": 300549}, {"video_id": "VSPX369A8jA", "question": "On a wooden board, a hand places a transparent bowl filled with minced meat. After placing the minced meat, what does this pair of hands do immediately?", "question_wo_referring_query": ", after placing the minced meat, what does this pair of hands do immediately?", "candidates": ["This hand takes away the wooden board", "This hand places a black brush on the wooden board", "This hand places a bowl of egg liquid on the wooden board", "This hand places a piece of dough on the wooden board", "This hand spreads out the minced meat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "VSPX369A8jA_0", "video_path": "VSPX369A8jA.mp4", "subtitle_path": "VSPX369A8jA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 397.11, "view_count": 322283}, {"video_id": "VSPX369A8jA", "question": "A transparent empty bowl is placed on a wooden board. Its four corners are arranged with seasonings. After a hand pours cabbage into the empty bowl, what does this hand do next?", "question_wo_referring_query": ", after this hand pours in cabbage, what does it do next?", "candidates": ["This hand immediately pours ginger into the large bowl", "This hand immediately pours green onion into the large bowl", "This hand immediately pours sesame oil into the large bowl", "This hand immediately pours garlic into the large bowl", "This hand immediately pours soy sauce into the large bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "VSPX369A8jA_1", "video_path": "VSPX369A8jA.mp4", "subtitle_path": "VSPX369A8jA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 397.11, "view_count": 322283}, {"video_id": "ZOkvFf8JbkA", "question": "A short-haired man wearing black clothes and sunglasses is explaining in front of a green screen. Based on the video, which scene appears first?", "question_wo_referring_query": "Based on the video, which scene appears first?", "candidates": ["A short-haired man wearing black clothes and sunglasses in front of a black screen", "A background composed of yellow and pink with the white text 'ML NEWS'", "A short-haired man wearing black clothes and sunglasses in front of a white screen", "A short-haired man wearing black clothes and sunglasses in front of a green screen", "A short-haired man wearing black clothes and sunglasses appears in the bottom right corner of the screen, while the rest of the screen shows a piece of text"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O3O", "level": "L2-Relation", "id": "ZOkvFf8JbkA_0", "video_path": "ZOkvFf8JbkA.mp4", "subtitle_path": "ZOkvFf8JbkA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 836, "duration": 399.0, "view_count": 23880}, {"video_id": "ZOkvFf8JbkA", "question": "In a background with pink and yellow alternating colors, there is white text in the center that says 'ML NEWS'. According to the video, which of the following colors of text appears first?", "question_wo_referring_query": "According to the video, which of the following colors of text appears first?", "candidates": ["Pink-yellow alternating 'ML NEWS' text", "White 'ML NEWS' text", "Pink-yellow alternating 'ML' text", "Blue 'ML NEWS' text", "Blue 'NEWS' text"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O3O", "level": "L2-Relation", "id": "ZOkvFf8JbkA_1", "video_path": "ZOkvFf8JbkA.mp4", "subtitle_path": "ZOkvFf8JbkA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 836, "duration": 399.0, "view_count": 23880}, {"video_id": "M5u7RKtuD1s", "question": "A webpage in English is shown on the screen, containing red and black English text. In the middle, there is an image of igneous rocks, and on the far right, there are five circular images arranged vertically. Before the subtitle mentions 'water to control and red diert lava', what happened?", "question_wo_referring_query": "What happened?", "candidates": ["Red lava kept rolling and flowing on the brown soil, and the surrounding area was scorched black.", "A row of short houses with white roofs were built on a green desolate grassland.", "A car appeared driving on the road.", "A satellite distribution map with red positioning arrows appeared.", "Puffs of white smoke emerged from the brown-black soil."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "M5u7RKtuD1s_0", "video_path": "M5u7RKtuD1s.mp4", "subtitle_path": "M5u7RKtuD1s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 183.98, "view_count": 12676}, {"video_id": "M5u7RKtuD1s", "question": "A bundle of glaring light hides behind a mountain range with snow patches that haven't melted yet, and the sky turns golden. A person stands on the mountain, and the subtitle mentions 'adaptability in the face of natural.' What happens next?", "question_wo_referring_query": "What happens next?", "candidates": ["A white webpage appears.", "A globe appears.", "A satellite distribution map with red arrows appears.", "On a green meadow, red-black rocks emit white mist.", "A row of short houses with white roofs appear on a green grassy field."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "M5u7RKtuD1s_1", "video_path": "M5u7RKtuD1s.mp4", "subtitle_path": "M5u7RKtuD1s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 183.98, "view_count": 12676}, {"video_id": "gYzCt6fOnLI", "question": "There is an open door behind the corridor, the walls and windows on both sides of the corridor are white, with red curtains hanging on the windows on the right side. The corridor is lit with warm-toned lights. A curly-haired man in a leather jacket is walking. What subtitle appears together with this curly-haired man?", "question_wo_referring_query": "What subtitle appears together with this curly-haired man?", "candidates": ["spend the night there in Joshua Tree", "all right so it smells like feet really", "active question are we bringing our", "the junk all we wanted bananas and like", "teddy bear we just got here dude I'm"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "gYzCt6fOnLI_0", "video_path": "gYzCt6fOnLI.mp4", "subtitle_path": "gYzCt6fOnLI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 349.43, "view_count": 474488}, {"video_id": "gYzCt6fOnLI", "question": "In the warmly lit room, there is a world map hanging on the right side of the wall, a dark-colored sofa with snack bags and white cushions on it, and a rolled-up curtain on the left side of the wall. A man wearing a white lab coat and black pants is standing in front of a telescope. What subtitles did this man appear with?", "question_wo_referring_query": "With what subtitles did this man appear?", "candidates": ["active question are we bringing our", "teddy bear we just got here dude I'm", "half of this stuff well just why all you", "Tree okay we gotta find all the tents", "spend the night there in Joshua Tree"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "gYzCt6fOnLI_1", "video_path": "gYzCt6fOnLI.mp4", "subtitle_path": "gYzCt6fOnLI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 349.43, "view_count": 474488}, {"video_id": "upVlsNsam4M", "question": "On the whiteboard with English text, on the right side of the background is an image of a bedroom with a window, a red chair, and a computer desk. In the lower right corner, a man wearing a black suit and black-rimmed glasses is speaking. What change occurs to this man when the cartoon ship appears?", "question_wo_referring_query": "What change occurs to this man?", "candidates": ["His hands change from being clasped together to being apart", "One of his hands touches his forehead", "His hands remain apart", "His hands remain clasped together", "His hands change from being apart to being clasped together"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "upVlsNsam4M_0", "video_path": "upVlsNsam4M.mp4", "subtitle_path": "upVlsNsam4M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 554.12, "view_count": 1980}, {"video_id": "upVlsNsam4M", "question": "A white background shows a carriage wheel appearing on a boat. The boat is white, with a red flag hanging on it. A character in a yellow coat with a hat and a red character are sitting on a small boat. The red character points towards a wooden box on the sea. In the bottom right corner, a man in a black suit with black-rimmed glasses is speaking. When black-and-white animation red lines appear, what change happens to this man?", "question_wo_referring_query": "What change happens to this man?", "candidates": ["His glasses disappear", "His laser pointer falls to the ground", "His suit turns blue", "He has nothing in his hand", "His laser pointer switches to his other hand"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "upVlsNsam4M_1", "video_path": "upVlsNsam4M.mp4", "subtitle_path": "upVlsNsam4M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 554.12, "view_count": 1980}, {"video_id": "zIC4zbxR7oM", "question": "When sunlight penetrates through layers of clouds and shines on the ocean surface, creating golden lines on the sea surface, and the subtitle appears 'The quiet depths of this ocean were the site of continuous sediment deposition, laying', what kind of change happens to the sea surface?", "question_wo_referring_query": "What kind of change happens to the sea surface?", "candidates": ["The sea surface becomes very clear", "The coastline falls", "The sea surface becomes calm", "The coastline rises", "The sea surface forms wave patterns"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "zIC4zbxR7oM_0", "video_path": "zIC4zbxR7oM.mp4", "subtitle_path": "zIC4zbxR7oM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 228.63, "view_count": 4477}, {"video_id": "hZ59EgFFgkk", "question": "In a room filled with various objects, a man with short black hair and wearing a black top, when a screen frame pops up on his upper right, what does this man wearing a black top do?", "question_wo_referring_query": "What does the man wearing a black top do?", "candidates": ["Holding a white piece of paper facing the camera", "Holding a red card facing the camera", "Holding hands together while speaking to the camera", "Holding a $10 bill facing the camera", "Holding a bag of chips and eating"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "hZ59EgFFgkk_0", "video_path": "hZ59EgFFgkk.mp4", "subtitle_path": "hZ59EgFFgkk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 365.87, "view_count": 109977}, {"video_id": "hZ59EgFFgkk", "question": "In a room filled with various items, there is a man with short black hair wearing a black shirt. A label with 'Ahmed' appears in the top-right corner. What is this man with short black hair doing?", "question_wo_referring_query": "What is the man with short black hair doing?", "candidates": ["Holding a Bosnian coin facing the mirror", "Swinging in front of the mirror", "Holding a card with a beautiful landscape toward the mirror", "Holding a white piece of paper facing the mirror", "Holding a pack of cards facing the mirror"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "hZ59EgFFgkk_1", "video_path": "hZ59EgFFgkk.mp4", "subtitle_path": "hZ59EgFFgkk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 365.87, "view_count": 109977}, {"video_id": "MANyy9xKn_8", "question": "There is a person wearing a black floral shirt, holding a knife with a small red dot on it, and cutting green vegetables. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["silver ring", "red chili pepper", "green chili pepper", "gold ring", "transparent cutting board"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "MANyy9xKn_8_0", "video_path": "MANyy9xKn_8.mp4", "subtitle_path": "MANyy9xKn_8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 520.03, "view_count": 111452}, {"video_id": "MANyy9xKn_8", "question": "A person wearing a blue floral dress is using a knife with small red dots and characters on it to cut peeled white garlic. What objects are present in this scene?", "question_wo_referring_query": "A person wearing a blue floral dress is using a knife with small red dots and characters on it to cut peeled white garlic. What objects are present in this scene?", "candidates": ["Green apple", "Green leaves", "Red watermelon", "Red chili", "Purple flower"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "MANyy9xKn_8_1", "video_path": "MANyy9xKn_8.mp4", "subtitle_path": "MANyy9xKn_8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 520.03, "view_count": 111452}, {"video_id": "qLLKNsEi_oA", "question": "Next to a white desk, there is a man wearing a blue shirt, with both hands full of tattoos, reading a book. When the subtitle appears 'small Library I picked out a book that I,' what object is on the man's body?", "question_wo_referring_query": "What object is on the man's body?", "candidates": ["silver ring", "silver watch", "gold watch", "gold ring", "black watch"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "qLLKNsEi_oA_0", "video_path": "qLLKNsEi_oA.mp4", "subtitle_path": "qLLKNsEi_oA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 181.1, "view_count": 122442}, {"video_id": "qLLKNsEi_oA", "question": "In front of a bookshelf filled with books, an elderly person with completely white hair is holding glasses and looking down. When the subtitle 'controlled disorder to it' appears, what object is present in the scene?", "question_wo_referring_query": "In front of a bookshelf filled with books, an elderly person with completely white hair is holding glasses and looking down. When the subtitle 'controlled disorder to it' appears, what object is present in the scene?", "candidates": ["A pen clipped to the breast pocket", "A silver ring worn on the finger", "An open book", "A sharpened pencil", "A wristwatch worn on the wrist"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "qLLKNsEi_oA_1", "video_path": "qLLKNsEi_oA.mp4", "subtitle_path": "qLLKNsEi_oA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 181.1, "view_count": 122442}, {"video_id": "hg_0ajgm9KQ", "question": "Beside a wall made of bricks, there are two trees with lush foliage. Next to the trees, there's also a brick wall. A man in black pants is walking past the trees. What kind of clothes is this man wearing?", "question_wo_referring_query": "What kind of clothes is this man wearing?", "candidates": ["Black long-sleeved coat", "Blue long-sleeved shirt", "Blue long-sleeved coat", "White long-sleeved shirt", "Green long-sleeved coat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "hg_0ajgm9KQ_0", "video_path": "hg_0ajgm9KQ.mp4", "subtitle_path": "hg_0ajgm9KQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 192.86, "view_count": 12070}, {"video_id": "hg_0ajgm9KQ", "question": "In a very tall chapel, there are colorful stained glass windows and many white pillars. There are also many candle holders with candles on the side. A man wearing long pants and a hat walks forward. What color is the hat that this man is wearing?", "question_wo_referring_query": "What color is the hat that this man is wearing?", "candidates": ["Red", "Black", "Sky blue", "White", "Purple"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "hg_0ajgm9KQ_1", "video_path": "hg_0ajgm9KQ.mp4", "subtitle_path": "hg_0ajgm9KQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 192.86, "view_count": 12070}, {"video_id": "iw_UnX1GWSo", "question": "In the bottom right corner of a white screen, a man with short black hair is explaining something. When he says 'will use some experience and make some,' what kind of outfit is he wearing?", "question_wo_referring_query": "What kind of outfit is he wearing?", "candidates": ["Gray round-neck shirt", "White button-up shirt", "Black button-up shirt", "White V-neck shirt", "Brown round-neck shirt"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2A", "level": "IntraMoment", "id": "iw_UnX1GWSo_0", "video_path": "iw_UnX1GWSo.mp4", "subtitle_path": "iw_UnX1GWSo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 202.36, "view_count": 4}, {"video_id": "iw_UnX1GWSo", "question": "Below a red dot on a white screen, there is a man wearing glasses, with short hair, and holding something in his hand. When he says 'network to produce the desired', what kind of outerwear is he wearing?", "question_wo_referring_query": "What kind of outerwear is he wearing?", "candidates": ["White suit jacket", "Black parka with a drawstring", "Blue cotton jacket", "Grey plaid shirt jacket", "Black suit jacket"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2A", "level": "IntraMoment", "id": "iw_UnX1GWSo_1", "video_path": "iw_UnX1GWSo.mp4", "subtitle_path": "iw_UnX1GWSo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 202.36, "view_count": 4}, {"video_id": "tasGHdFXa0s", "question": "In a pitch-black night sky, a blue object suddenly got hit, producing intense sparks and flares. What caused it?", "question_wo_referring_query": ", what caused it?", "candidates": ["a falling aircraft", "a fired rocket", "generated by the burgle crater impact", "an interstellar cannon", "a falling spacecraft"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "tasGHdFXa0s_0", "video_path": "tasGHdFXa0s.mp4", "subtitle_path": "tasGHdFXa0s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 376.44, "view_count": 17322}, {"video_id": "tasGHdFXa0s", "question": "On an aerial map, the left side is a sunlit sea area, and in the lower right corner, there is a small island in the sea. What is the object that horizontally crosses the small island on the map?", "question_wo_referring_query": "What is the item?", "candidates": ["A green arrow", "A white arrow", "A yellow arrow", "A blue arrow", "A red arrow"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "tasGHdFXa0s_1", "video_path": "tasGHdFXa0s.mp4", "subtitle_path": "tasGHdFXa0s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 376.44, "view_count": 17322}, {"video_id": "qGDj8Lbl7CI", "question": "By the side of an emerald green lake, there's a man wearing a black duckbill cap, with short golden hair, and carrying a backpack. What did he do when he first appeared?", "question_wo_referring_query": "What did he do when he first appeared?", "candidates": ["Happily ran forward.", "While walking forward, he took photos of the lake's beautiful scenery with his phone.", "While walking forward, he held the strap of the backpack.", "When walking forward, he pointed towards the emerald green lake with the finger holding his phone.", "While walking forward, he looked things up on his phone."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "qGDj8Lbl7CI_0", "video_path": "qGDj8Lbl7CI.mp4", "subtitle_path": "qGDj8Lbl7CI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 279.07, "view_count": 42005}, {"video_id": "qGDj8Lbl7CI", "question": "In front of a small wooden house, there is a white parasol, beside it is a jade green lake shore, a man wearing a black and white fur coat with short golden hair, what did he do the first time he appeared?", "question_wo_referring_query": "What did he do the first time he appeared?", "candidates": ["Sitting on a wooden chair with legs crossed", "Pointing towards the lake shore", "Taking off his fur coat", "Putting on a stylish hat", "Speaking softly to the mirror"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "qGDj8Lbl7CI_1", "video_path": "qGDj8Lbl7CI.mp4", "subtitle_path": "qGDj8Lbl7CI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 279.07, "view_count": 42005}, {"video_id": "mAXHFsDnBzE", "question": "In a white background, there is a picture with many people in it. Next to the picture, there is a man wearing a brown suit and glasses. What action does he make when he says 'about that's really'?", "question_wo_referring_query": "What action does he make?", "candidates": ["Uses the hand to rub his forehead", "Uses the hand holding something to wipe his nose", "Spreads both hands open", "Lowers the hand holding something", "Raises the hand holding something"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "mAXHFsDnBzE_0", "video_path": "mAXHFsDnBzE.mp4", "subtitle_path": "mAXHFsDnBzE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 270.08, "view_count": 20}, {"video_id": "mAXHFsDnBzE", "question": "In the bottom right corner of a white background, there is a man wearing a coffee-colored suit jacket with a maroon round-neck shirt underneath. What action does he perform when he says 'box okay so now you can understand like'?", "question_wo_referring_query": "What action does he perform?", "candidates": ["Right hand palm raised, palm facing up", "Left and right hands crossed", "Right hand touches forehead", "Both palms open outward", "Right hand adjusts glasses frame"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "mAXHFsDnBzE_1", "video_path": "mAXHFsDnBzE.mp4", "subtitle_path": "mAXHFsDnBzE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 270.08, "view_count": 20}, {"video_id": "-rWCR5oYOSM", "question": "In a theater with a blue sea and yellow rocks as the background, along with black stripes, which speaker appears first?", "question_wo_referring_query": "Which speaker appears first?", "candidates": ["A male with short black hair, wearing a white shirt", "A female with golden curly hair, wearing a brown dress", "A male with sparse hair, wearing a blue suit", "A male with curly black hair, wearing a white suit", "A female with short hair, wearing a black long dress"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "-rWCR5oYOSM_0", "video_path": "-rWCR5oYOSM.mp4", "subtitle_path": "-rWCR5oYOSM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 265.92, "view_count": 6224}, {"video_id": "-rWCR5oYOSM", "question": "In a lecture hall with a blue ocean-themed background that has black vertical stripes on it, someone is giving a presentation. Which person's name is mentioned first?", "question_wo_referring_query": ", which person's name is mentioned first?", "candidates": ["Sairo", "Lora", "Philip turl", "Pu Jing", "Obama"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "-rWCR5oYOSM_1", "video_path": "-rWCR5oYOSM.mp4", "subtitle_path": "-rWCR5oYOSM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 265.92, "view_count": 6224}, {"video_id": "-eaFD5Nr2nw", "question": "In front of a white wall with a bit of gray, after the subtitle 'response to' appears, which character makes their first appearance?", "question_wo_referring_query": "Which character makes their first appearance?", "candidates": ["A man in a blue shirt with short blond hair", "A man in a colorful outfit", "A man in a formal suit with a blue tie", "A woman wearing a necklace, with slightly curly hair, dressed in a black top", "A man in a black suit with a red tie"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "-eaFD5Nr2nw_0", "video_path": "-eaFD5Nr2nw.mp4", "subtitle_path": "-eaFD5Nr2nw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 580.56, "view_count": 68228}, {"video_id": "-eaFD5Nr2nw", "question": "In a place surrounded by weeds, and there's also a missile launcher, which character appears for the first time after the subtitle 'is their' appears?", "question_wo_referring_query": "Which character appears for the first time?", "candidates": ["A boy wearing a blue shirt with black short hair", "A man wearing a blue suit with a red tie", "A boy wearing a blue shirt with golden short hair", "A boy wearing a black shirt with golden short hair", "A boy wearing a white shirt with golden short hair"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "-eaFD5Nr2nw_1", "video_path": "-eaFD5Nr2nw.mp4", "subtitle_path": "-eaFD5Nr2nw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 580.56, "view_count": 68228}, {"video_id": "g-Ie90ejFjs", "question": "When the national flag behind the crowd appears in a radiating pattern background of blue-green stripes in the box in the upper right corner of the screen, what change occurred to the national flag?", "question_wo_referring_query": "What change occurred to the national flag?", "candidates": ["Two white arrows appeared", "Two yellow arrows appeared", "Two green arrows appeared", "Two blue arrows appeared", "One yellow arrow appeared"], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "g-Ie90ejFjs_0", "video_path": "g-Ie90ejFjs.mp4", "subtitle_path": "g-Ie90ejFjs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 437.34, "view_count": 300685}, {"video_id": "g-Ie90ejFjs", "question": "Against a blue-green striped background, there is a blue graphic. Next to the graphic, there is an object that, when it appears in the top right corner of a man wearing a gray short-sleeved shirt, what transformation does it undergo?", "question_wo_referring_query": "What transformation does it undergo?", "candidates": ["The color changes from white to blue", "The color changes from silver to gold", "The color changes from gold to blue", "The color changes from blue to silver", "The color changes from gold to silver"], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "g-Ie90ejFjs_1", "video_path": "g-Ie90ejFjs.mp4", "subtitle_path": "g-Ie90ejFjs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 437.34, "view_count": 300685}, {"video_id": "ENeCYwms-Cc", "question": "In the video, there are 9 people who look identical. They all have beards, wear black suits, and have scarves wrapped around their necks. They are showing varied expressions while talking. What items are not present in the screen?", "question_wo_referring_query": "What items are not present in the screen?", "candidates": ["Red and white scarf", "White shirt", "Blue and white scarf", "Teal shirt with some patterns", "White cane"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "ENeCYwms-Cc_0", "video_path": "ENeCYwms-Cc.mp4", "subtitle_path": "ENeCYwms-Cc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 202.04, "view_count": 4859424}, {"video_id": "ENeCYwms-Cc", "question": "The screen shows a room with two men sitting at a table talking. Behind them is a wall covered with various maps and a shelf with some items on it. On the right side, there is a drawing board, and on the left, there is a coat rack. One of the men is wearing an olive suit and a black shirt, while the other is wearing a red T-shirt. Which of the following items are not present in this room?", "question_wo_referring_query": "Which of the following items are not present in this room?", "candidates": ["A black globe", "A red and white striped ball", "A red desk lamp", "Glasses and a phone", "An olive hat hanging on the coat rack"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "ENeCYwms-Cc_1", "video_path": "ENeCYwms-Cc.mp4", "subtitle_path": "ENeCYwms-Cc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 202.04, "view_count": 4859424}, {"video_id": "UZdL5TmtIyw", "question": "A woman wearing a grey jacket is riding a horse, she is wearing jeans, has braided hair, there are some mountains in the distance, the sky is grey and cloudy, and the surroundings are a vast grassland. Some trees are around the grassland. When the phrase 'that's ok.' is mentioned, which objects are not present in the scene?", "question_wo_referring_query": "Which objects are not present in the scene?", "candidates": ["Sunglasses", "Grey jacket and white inner shirt", "Rod inserted in the grassland", "White horse with a green blanketed saddle", "Hill"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "UZdL5TmtIyw_0", "video_path": "UZdL5TmtIyw.mp4", "subtitle_path": "UZdL5TmtIyw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 370.75, "view_count": 287479}, {"video_id": "UZdL5TmtIyw", "question": "A woman is sitting at a table. On the table, there are two racks with some paintings on them. The woman is wearing a hat, and her body is almost blocked by the racks on the table. There is a car behind her. On the left, there are three nets and a long rod. On the right, there is a whiteboard, and a row of clothes are hanging in the nearby basket. What object is not present in the frame when 'And I hope you don\u2019t either.' is mentioned?", "question_wo_referring_query": "What object is not present in the frame?", "candidates": ["Painting with green plants", "Yellow whiteboard", "Black painting", "Blue painting", "White tablecloth"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "UZdL5TmtIyw_1", "video_path": "UZdL5TmtIyw.mp4", "subtitle_path": "UZdL5TmtIyw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 370.75, "view_count": 287479}, {"video_id": "1pLrNJFmz8c", "question": "A short-haired woman is talking to someone. She is sitting in front of a mirror, with short hair that has side-swept bangs, wearing black-framed glasses, and a denim top. She is crossing her arms over her knees. What kind of bottom is the black model behind her wearing?", "question_wo_referring_query": "What kind of bottom is the black model behind her wearing?", "candidates": ["Pink shorts", "Pink mini skirt", "Pink long skirt", "Pink pleated skirt", "Pink denim skirt"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "1pLrNJFmz8c_0", "video_path": "1pLrNJFmz8c.mp4", "subtitle_path": "1pLrNJFmz8c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 405.49, "view_count": 14754}, {"video_id": "1pLrNJFmz8c", "question": "In the video, there is a piece of white paper with many shapes listed on it. Many shapes also have annotations beside them highlighted in yellow watercolor. On the top right edge of the paper, above the yellow label, what is the shape?", "question_wo_referring_query": "What is the shape above the yellow label on the top right edge of the paper?", "candidates": ["Half rectangle", "Triangle", "Circle", "Heart", "Quarter circle"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "1pLrNJFmz8c_1", "video_path": "1pLrNJFmz8c.mp4", "subtitle_path": "1pLrNJFmz8c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 405.49, "view_count": 14754}, {"video_id": "jPXVANbTfEg", "question": "A blonde woman in a black coat is sitting in a car talking. She is wearing a seatbelt, her hair is tied up, her eyebrows are thin, her eyelashes are long, and she is wearing a mask. What color is her nail polish when she mentions 'kids i don't know but they're really'?", "question_wo_referring_query": "What color is her nail polish?", "candidates": ["White", "No nail polish", "Black", "Green", "Pink purple"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "jPXVANbTfEg_0", "video_path": "jPXVANbTfEg.mp4", "subtitle_path": "jPXVANbTfEg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 480.48, "view_count": 22845}, {"video_id": "jPXVANbTfEg", "question": "A girl with golden hair is giving a thumbs up with both hands in front of a mirror in a room. Next to her, the captions read 'shirt- target jacket- shein leggings- lululemon bag- princess polly necklace- shein.' When the caption 'shoes' appears, what type of jacket is she wearing?", "question_wo_referring_query": "What type of jacket is she wearing when the caption 'shoes' appears?", "candidates": ["Short denim jacket", "Short puffer jacket", "Short shiny jacket", "Short fuzzy jacket", "Short fleece jacket"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "jPXVANbTfEg_1", "video_path": "jPXVANbTfEg.mp4", "subtitle_path": "jPXVANbTfEg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 480.48, "view_count": 22845}, {"video_id": "S7boBDK5yMA", "question": "In a conference hall, four people dressed in business suits are seated at a conference table. Behind them is a beige curtain. In front of them are black notebooks, laptops, and microphones. Some are looking at their laptops while others are flipping through books. Who is the person drinking water from a cup?", "question_wo_referring_query": "Who is the person drinking water from a cup?", "candidates": ["The second man from the right wearing glasses and a purple tie", "The man on the far left wearing glasses with golden frames", "The man seated behind the four, playing with his phone", "The second man from the left looking at his laptop", "The man on the far right with a green tie"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "S7boBDK5yMA_0", "video_path": "S7boBDK5yMA.mp4", "subtitle_path": "S7boBDK5yMA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 386.0, "view_count": 20918}, {"video_id": "S7boBDK5yMA", "question": "In a hall, there are three flags in the hall, on the left side there are also some yellow flowers and red and blue fabric decorations, on the ground to the left are some green and white decorations, a man wearing glasses and a suit is preparing to sit down, behind him there is a person opening a window, next to him a person is tidying up the desk, on the desk there are tissue boxes and yellow flowers, behind him there is a woman, on the right side a man with glasses is walking past, who is holding a blue folder?", "question_wo_referring_query": "Who is holding a blue folder?", "candidates": ["The man with glasses preparing to sit down", "The woman in the purple clothes", "The tall man holding the window", "The man with glasses walking past on the right", "The man tidying up the desk"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "S7boBDK5yMA_1", "video_path": "S7boBDK5yMA.mp4", "subtitle_path": "S7boBDK5yMA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 386.0, "view_count": 20918}, {"video_id": "PrF0KCFFaew", "question": "A woman with short blonde hair and wearing earrings is explaining in front of a mirror. She is wearing black clothes, and behind her, there is a glass windbreak. Through the windbreak, several bookshelves are visible. Two women are talking in front of the bookshelves. What happened when 'United also told Boeing to stop building' was mentioned?", "question_wo_referring_query": "What happened?", "candidates": ["The two women talking hugged each other.", "The woman explaining brushed her hair.", "The two women talking shook hands.", "A man with a hat walked past the bookshelf.", "The woman explaining stood up."], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "PrF0KCFFaew_0", "video_path": "PrF0KCFFaew.mp4", "subtitle_path": "PrF0KCFFaew_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 380.98, "view_count": 123103}, {"video_id": "PrF0KCFFaew", "question": "A man with short hair is explaining in front of a screen. He is wearing a blue shirt. Behind him is a background divided by an orange translucent grid. On the left side, there are some red, yellow, and green interspersed data. On the right side, there are complex buildings. What happened when the phrase 'say what a price would be on that floor' was mentioned?", "question_wo_referring_query": "What happened?", "candidates": ["The background changed from images to black.", "The man left the frame.", "The man stood up.", "The man picked up the receiver.", "The white subtitle scroll bar changed quickly."], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "PrF0KCFFaew_1", "video_path": "PrF0KCFFaew.mp4", "subtitle_path": "PrF0KCFFaew_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 380.98, "view_count": 123103}, {"video_id": "3cAmGRX5XEg", "question": "In the video, a man wearing black clothes is holding a microphone and giving an explanation. Behind him is a room filled with various pictures, including a world map. To his right, different images will pop up. Which image is displayed first?", "question_wo_referring_query": "Which image is displayed first?", "candidates": ["A coastal scene with some rocks under a turquoise sky and many white buildings.", "A man and a woman wearing sunglasses with two children, also pushing a stroller with a baby. One child walks in front of the man, and the woman carries another smaller child.", "Under the turquoise sky, a lake with red water and green grass.", "Two horses standing beside rocks.", "A woman wearing earrings and wavy hair partly draped over her face. She is in a white tank top in front of a yellow background with white letters."], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "3cAmGRX5XEg_0", "video_path": "3cAmGRX5XEg.mp4", "subtitle_path": "3cAmGRX5XEg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 571.57, "view_count": 160306}, {"video_id": "3cAmGRX5XEg", "question": "Under a curtain that's half-black and half-green, three men and one woman are seated at a white table having a conversation. Each of them has some objects in their hands. There is also a white basket on the table, with a microphone placed in front of them. In the bottom right corner, there is a picture of the Brazilian flag. Which of the following images is shown last?", "question_wo_referring_query": "Under a curtain that's half-black and half-green, three men and one woman are seated at a white table having a conversation. Each of them has some objects in their hands. There is also a white basket on the table, with a microphone placed in front of them. In the bottom right corner, there is a picture of the Brazilian flag. Which of the following images is shown last?", "candidates": ["An image of the South Korean flag", "An image of the French flag", "An image of the Dutch flag", "An image of the Serbian flag", "An image of the German flag"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "3cAmGRX5XEg_1", "video_path": "3cAmGRX5XEg.mp4", "subtitle_path": "3cAmGRX5XEg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 571.57, "view_count": 160306}, {"video_id": "w9Z9TxKdWLw", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a clip of a woman in black clothes with long curly hair looking closely at a colorful painting on the table was shown, followed by a clip of the woman looking at a black rectangular painting, and finally a clip of the woman looking at a painting with blue, white, red, and yellow interspaces was shown.", "First, a clip of the woman looking at a black rectangular painting was shown, followed by a clip of the woman looking at a painting with blue, white, red, and yellow interspaces, and finally a clip of a woman in black clothes with long curly hair looking closely at a colorful painting on the table was shown.", "First, a clip of the woman looking at a painting with blue, white, red, and yellow interspaces was shown, followed by a clip of a woman in black clothes with long curly hair looking closely at a colorful painting on the table, and finally a clip of the woman looking at a black rectangular painting was shown.", "First, a clip of a woman in black clothes with long curly hair looking closely at a colorful painting on the table was shown, followed by a clip of the woman looking at a painting with blue, white, red, and yellow interspaces, and finally a clip of the woman looking at a black rectangular painting was shown.", "First, a clip of the woman looking at a black rectangular painting was shown, followed by a clip of a woman in black clothes with long curly hair looking closely at a colorful painting on the table, and finally a clip of the woman looking at a painting with blue, white, red, and yellow interspaces was shown."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "w9Z9TxKdWLw_0", "video_path": "w9Z9TxKdWLw.mp4", "subtitle_path": "w9Z9TxKdWLw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 214.18, "view_count": 5907}, {"video_id": "w9Z9TxKdWLw", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a black drawing under a white background was shown, with some patterns of original points and some white lines resembling arrows. Then, a woman with black clothing and curly hair holding a rectangular artwork was shown, with a bookshelf behind her. Finally, a drawing in red, blue, white, and yellow was shown.", "First, a woman with black clothing and curly hair holding a rectangular artwork was shown, with a bookshelf behind her. Then, a black drawing under a white background was shown, with some patterns of original points and some white lines resembling arrows. Finally, a drawing in red, blue, white, and yellow was shown.", "First, a black drawing under a white background was shown, with some patterns of original points and some white lines resembling arrows. Then, a drawing in red, blue, white, and yellow was shown, also with many white lines. Finally, a woman with black clothing and curly hair holding a rectangular artwork was shown, with a bookshelf behind her.", "First, a drawing in red, blue, white, and yellow was shown. Then, a woman with black clothing and curly hair holding a rectangular artwork was shown, with a bookshelf behind her. Finally, a black drawing under a white background was shown, with some patterns of original points and some white lines resembling arrows.", "First, a drawing in red, blue, white, and yellow was shown. Then, a black drawing under a white background was shown, with some patterns of original points and some white lines resembling arrows. Finally, a woman with black clothing and curly hair holding a rectangular artwork was shown, with a bookshelf behind her."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "w9Z9TxKdWLw_1", "video_path": "w9Z9TxKdWLw.mp4", "subtitle_path": "w9Z9TxKdWLw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 214.18, "view_count": 5907}, {"video_id": "A_i1AuOZonU", "question": "Against a white backdrop, there is an image of a pink brain, depicted from a side view, with the brain's texture very clear. In which other scene does this brain appear?", "question_wo_referring_query": "In which other scene does this brain appear?", "candidates": ["A blue scene with a man speaking", "A pink backdrop scene with some letters and numbers on the right side", "A black scene with a rectangular background composed of many colored stripes", "A blue scene marked with many squares", "A scene with a diagram of genetic maps"], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "A_i1AuOZonU_0", "video_path": "A_i1AuOZonU.mp4", "subtitle_path": "A_i1AuOZonU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 357.02, "view_count": 101061}, {"video_id": "A_i1AuOZonU", "question": "Under a black background, there are two character photos. On the left side, there is a man with white hair wearing glasses and dressed in dark blue clothing. On the right side is a woman with black long hair dressed in black and white plaid, smiling slightly. In which of the following scenes did these two people appear?", "question_wo_referring_query": "In which of the following scenes did these two people appear?", "candidates": ["In a photo of a brightly lit square", "In a photo on an island inhabited by fishermen", "In a photo with mountains and grassland with very good sunlight", "A scene with a pink background, with some alphabet letters and numbers written on the right side", "In a photo with sea and beach with very good sunlight"], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "A_i1AuOZonU_1", "video_path": "A_i1AuOZonU.mp4", "subtitle_path": "A_i1AuOZonU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 357.02, "view_count": 101061}, {"video_id": "FxKZeR2v6Q4", "question": "In what subtitles does the girl with pearl earrings appear together?", "question_wo_referring_query": "In which subtitles has it appeared together?", "candidates": ["or a head that depicts a character", "the girl's skin I'm no Beauty Guru but", "house to use for his Studio because of", "he created that shows his face and what", "in exotic attire the thing is we"], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "FxKZeR2v6Q4_0", "video_path": "FxKZeR2v6Q4.mp4", "subtitle_path": "FxKZeR2v6Q4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 590.04, "view_count": 619108}, {"video_id": "FxKZeR2v6Q4", "question": "In which subtitles does a portrait of a person wearing a black hat, with curly hair, and dressed in a white-collared black outfit appear?", "question_wo_referring_query": "In which subtitles does it appear?", "candidates": ["him any less of an artistic Master would", "researching this painting I would catch", "it make The Girl With a Pearl Earring", "create this piece we don't know much of", "who this girl is all of these unanswered"], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "FxKZeR2v6Q4_1", "video_path": "FxKZeR2v6Q4.mp4", "subtitle_path": "FxKZeR2v6Q4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 590.04, "view_count": 619108}, {"video_id": "UUPmuzzSVJY", "question": "A piece of chicken breast was placed on the cutting board. When it was placed in the rectangular dish with the broccoli, what kind of changes did the chicken breast undergo?", "question_wo_referring_query": "What kind of changes did the chicken breast undergo?", "candidates": ["The chicken breast was fried.", "The chicken breast was made into a sandwich.", "The chicken breast was cut into pieces.", "The chicken breast was sliced.", "The chicken breast was coated with breadcrumbs."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "UUPmuzzSVJY_0", "video_path": "UUPmuzzSVJY.mp4", "subtitle_path": "UUPmuzzSVJY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 332.5, "view_count": 166392}, {"video_id": "UUPmuzzSVJY", "question": "A tray filled with long slices of cheese was placed into the oven. After they were cooked and cut, what kind of changes occurred?", "question_wo_referring_query": "A tray filled with long slices of cheese was placed into the oven. After they were cooked and cut, what kind of changes occurred?", "candidates": ["The cheese was cut into triangular pieces and sprinkled with green onions.", "The cheese turned golden brown and was cut into rectangular pieces with ham in between.", "The cheese turned golden brown and was cut into heart-shaped pieces with tomatoes in between.", "The cheese turned black.", "The cheese turned golden brown and was cut into heart-shaped pieces with ham in between."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "UUPmuzzSVJY_1", "video_path": "UUPmuzzSVJY.mp4", "subtitle_path": "UUPmuzzSVJY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 332.5, "view_count": 166392}, {"video_id": "kCYUdZdiyDA", "question": "In a grey-walled room decorated with drawings and bookshelves, where there are books, a globe, and a rocket, what change does a man with glasses and brown hair experience when mentioning \u201cYou have a special sense that tells you whether your body is moving and how your body is oriented\u201d?", "question_wo_referring_query": "What change does he experience?", "candidates": ["He wore an earring", "He wore a hat", "He put on an eye patch", "He carried a shoulder bag", "His clothes changed to black"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "kCYUdZdiyDA_0", "video_path": "kCYUdZdiyDA.mp4", "subtitle_path": "kCYUdZdiyDA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.31, "view_count": 589477}, {"video_id": "kCYUdZdiyDA", "question": "Inside the space station, surrounded by a lot of complex equipment, a man wearing a black suit and blue pants is tumbling in the spaceship. When the phrase 'So an astronaut's vestibular system adapts to this new normal after just a few days up there' is mentioned, what change occurs to this man?", "question_wo_referring_query": "what change occurs to this man?", "candidates": ["He puts on a hat", "He puts on a jacket", "He puts on glasses", "He puts on a spacesuit", "He changes into a short-sleeved T-shirt"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "kCYUdZdiyDA_1", "video_path": "kCYUdZdiyDA.mp4", "subtitle_path": "kCYUdZdiyDA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.31, "view_count": 589477}, {"video_id": "vIG7wrL5O2I", "question": "In an amusement park, where there is a Ferris wheel in the distance and some trees around, the sky is very blue. A girl wearing a white T-shirt and denim shorts is sitting on a ladder. She straightens her legs, carrying a red backpack. What does she do next?", "question_wo_referring_query": "What does she do next?", "candidates": ["She jumps up from the ladder", "She lies down on the ladder", "She raises one of her hands", "She jumps off the ladder", "She puts down the backpack"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "vIG7wrL5O2I_0", "video_path": "vIG7wrL5O2I.mp4", "subtitle_path": "vIG7wrL5O2I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 323.4, "view_count": 118515}, {"video_id": "vIG7wrL5O2I", "question": "A girl wearing a white T-shirt with tied-up hair set up a wooden shelf on the wall of her room. To her left, there is a white door, and behind the door, there's a rack piled with many shoes. To the right, there is a bookshelf with many books and a globe, and on the right side, there are also paper cranes hanging. What did she put on the shelf?", "question_wo_referring_query": "What did she put on the shelf?", "candidates": ["She painted the shelf", "She placed a black painting", "She placed a pink painting", "She placed a doll", "She placed a yellow painting"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "vIG7wrL5O2I_1", "video_path": "vIG7wrL5O2I.mp4", "subtitle_path": "vIG7wrL5O2I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 323.4, "view_count": 118515}, {"video_id": "V9jsq8qvYwg", "question": "Under the lighting on a red background wall, there are two stone statues and a painted picture. The person in the painting has black hair tied in a braid and is wearing white and turquoise clothes. There are also three pieces of gold jewelry at the bottom right of the display. Which item is not present here?", "question_wo_referring_query": "Under the lighting on a red background wall, there are two stone statues and a painted picture. The person in the painting has black hair tied in a braid and is wearing white and turquoise clothes. There are also three pieces of gold jewelry at the bottom right of the display. Which item is not present here?", "candidates": ["Black hair", "Yellow bracelet", "White stone statue", "Yellow necklace", "Green earrings"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "V9jsq8qvYwg_0", "video_path": "V9jsq8qvYwg.mp4", "subtitle_path": "V9jsq8qvYwg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 203.2, "view_count": 20233}, {"video_id": "V9jsq8qvYwg", "question": "In an exhibition hall, the surrounding background walls are white, and the display cabinet is red with some artifacts inside. On the right, there is also a white-framed painting featuring a portrait with a red background. A man is talking in front of a mirror; he is wearing dark blue clothes, a checkered shirt, and glasses. What item is not present in this scene?", "question_wo_referring_query": "What item is not present in this scene?", "candidates": ["Blue sweater", "Red and white display cabinet", "Black glasses", "White plaster bust", "Checkered shirt"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "V9jsq8qvYwg_1", "video_path": "V9jsq8qvYwg.mp4", "subtitle_path": "V9jsq8qvYwg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 203.2, "view_count": 20233}, {"video_id": "J6uULoBsTJo", "question": "A man wearing a black lab coat is speaking against a dark blue background. He has black hair with a yellow-dyed streak on the right side of his forehead. He is wearing an earring, and he has a watch on his right hand. When he mentions 'just a glaze, or it can grow to about 25 centimeters thick,' what nonexistent object is he referring to?", "question_wo_referring_query": "A man wearing a black lab coat is speaking against a dark blue background. He has black hair with a yellow-dyed streak on the right side of his forehead. He is wearing an earring, and he has a watch on his right hand. When he mentions 'just a glaze, or it can grow to about 25 centimeters thick,' what nonexistent object is he referring to?", "candidates": ["Black watch", "Black earring", "Black hair", "Black outerwear", "Black lab coat"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "J6uULoBsTJo_0", "video_path": "J6uULoBsTJo.mp4", "subtitle_path": "J6uULoBsTJo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 337.8, "view_count": 169582}, {"video_id": "J6uULoBsTJo", "question": "There is a webpage displayed against a white background. In the upper left corner of the webpage, there are black letters spelling 'BRILLIANT'. On the right side, there is a blue and white rectangular frame. Below, there are three images, featuring some red and blue objects, a picture of a scale, and an image of the globe. When 'You can learn how to put a number on uncertainty and how to minimize it.' is mentioned, what is the non-existent object?", "question_wo_referring_query": ", what is the non-existent object?", "candidates": ["Red circle on a blue background", "Red globe on a blue background", "White scale on a blue background", "Blue rectangle on a blue background", "Black scale on a blue background"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "J6uULoBsTJo_1", "video_path": "J6uULoBsTJo.mp4", "subtitle_path": "J6uULoBsTJo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 337.8, "view_count": 169582}, {"video_id": "nMHwatpXM0Q", "question": "In the car, there is a woman with blonde hair. She is wearing a gray coat and a black camisole underneath. She is holding a cup with a straw inserted, with a steering wheel in front of her. What material is the cup she is holding made of?", "question_wo_referring_query": "What material is the cup she is holding made of?", "candidates": ["a brown ceramic cup", "a brown paper cup", "a gray paper cup", "a brown glass cup", "a transparent plastic cup"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "nMHwatpXM0Q_0", "video_path": "nMHwatpXM0Q.mp4", "subtitle_path": "nMHwatpXM0Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 583.28, "view_count": 34898}, {"video_id": "vCEianlPs7A", "question": "On a wooden board, there is a rectangular bowl filled with beaten eggs on the left side and a transparent bowl filled with breadcrumbs on the right side. A pair of black hands is picking up food and rolling it in the breadcrumbs. What is the shape of the food when it reaches the step 'Step 2: Coated with breadcrumbs.'?", "question_wo_referring_query": "What is the shape of the food at the step 'Step 2: Coated with breadcrumbs.'?", "candidates": ["triangle", "heart", "rectangle", "hexagon", "circle"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "vCEianlPs7A_0", "video_path": "vCEianlPs7A.mp4", "subtitle_path": "vCEianlPs7A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 516.6, "view_count": 4971253}, {"video_id": "jpcBgYYFS8o", "question": "On a tree in the sunlight of a sunny day, the tree branches and green leaves sway with the wind. What is produced from the pointed leaves on the tree?", "question_wo_referring_query": "What is produced?", "candidates": ["Avocado", "Watermelon", "Pomegranate", "Cherry", "Banana"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "jpcBgYYFS8o_0", "video_path": "jpcBgYYFS8o.mp4", "subtitle_path": "jpcBgYYFS8o_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 562.06, "view_count": 2111887}, {"video_id": "jpcBgYYFS8o", "question": "On both sides there are red iron plates with some gray rollers in the middle. Behind, there is a slanted gray iron conveyor belt, surrounded by some green grass and two wheels. What is being conveyed on this conveyor belt?", "question_wo_referring_query": "What is being conveyed on this conveyor belt?", "candidates": ["Apples", "Bananas", "Corn", "Cabbage", "Avocados"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "jpcBgYYFS8o_1", "video_path": "jpcBgYYFS8o.mp4", "subtitle_path": "jpcBgYYFS8o_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 562.06, "view_count": 2111887}, {"video_id": "JQVmkDUkZT4", "question": "On a blue background, there is a round white container. To the left, there is a syringe containing a red liquid. When the subtitle 'If you extract cells from your body, and put them in the right environment, they will continue to stay alive for a while' appears, what happens first?", "question_wo_referring_query": "What happens first?", "candidates": ["An animation showing six pink round cells.", "The right wing of a black cartoon crow is cut open, exposing its organs.", "Two birds are placed into a glass container for an experiment.", "A blue and yellow bird stands on a conveyor belt and is replaced by another organ.", "Different species of birds stand on a stone pillar in the sky."], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "JQVmkDUkZT4_0", "video_path": "JQVmkDUkZT4.mp4", "subtitle_path": "JQVmkDUkZT4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 387.4, "view_count": 15236126}, {"video_id": "JQVmkDUkZT4", "question": "On the pink sky, there are five stone pillars standing. The middle stone pillar has the word 'YOU' written on it. Different birds are standing on each stone pillar. Below, there are some white clouds floating. After the subtitle 'Let's make this more complicated!' appears, what happens first?", "question_wo_referring_query": "What happens first?", "candidates": ["The stone pillars collapse", "A blue and yellow bird stands in the middle of the conveyor belt in the transmitter, being replaced by the internal structure", "On the right side of the screen, the black frame's crow has its wings cut open, revealing its internal structure", "An animation demonstration of spherical particles gradually disappearing appears on screen", "An animation demonstration featuring six pink spherical particles appears"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "JQVmkDUkZT4_1", "video_path": "JQVmkDUkZT4.mp4", "subtitle_path": "JQVmkDUkZT4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 387.4, "view_count": 15236126}, {"video_id": "iOJ8BjbiPdE", "question": "In the dense forest and vegetation of the Amazon rainforest, three thin white waterfalls flow down. When the subtitles \"organized with a clear hierarchical\" appear, what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["High mountains surrounded by white clouds", "Forests shrouded in mist", "Rivers", "Round, needle-shaped buildings scattered around villages and farmland", "Prairie"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "iOJ8BjbiPdE_0", "video_path": "iOJ8BjbiPdE.mp4", "subtitle_path": "iOJ8BjbiPdE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 190.66, "view_count": 3889}, {"video_id": "iOJ8BjbiPdE", "question": "The background shows an English article on a white web page, with a list of light blue English links on the right side of the page. After the subtitle 'civilization began with lar technology a' appears, what appears first?", "question_wo_referring_query": "What appears first after the subtitle 'civilization began with lar technology a'?", "candidates": ["Two green maps in the middle of the webpage screen", "Grassland", "River", "Forest", "Waterfalls in the tropical rainforest"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "iOJ8BjbiPdE_1", "video_path": "iOJ8BjbiPdE.mp4", "subtitle_path": "iOJ8BjbiPdE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 190.66, "view_count": 3889}, {"video_id": "vkcwNAxODro", "question": "Which sequence of scenes in the video is correct?", "question_wo_referring_query": "Which sequence of scenes in the video is correct?", "candidates": ["A bright tree and a house at night outside the window; a man wearing a black hooded coat and headphones sitting in a car speaking into the camera; a beautiful starry sky above a lake surrounded by mountains; a man wearing a white short-sleeve shirt sitting in front of a bookshelf with a globe, speaking into the camera.", "A man wearing a white short-sleeve shirt sitting in front of a bookshelf with a globe, speaking into the camera; a bright tree and a house at night outside the window; a man wearing a black hooded coat and headphones sitting in a car speaking into the camera; a beautiful starry sky above a lake surrounded by mountains.", "A man wearing a white short-sleeve shirt sitting in front of a bookshelf with a globe, speaking into the camera; a man wearing a black hooded coat and headphones sitting in a car speaking into the camera; a beautiful starry sky above a lake surrounded by mountains; a bright tree and a house at night outside the window.", "A man wearing a white short-sleeve shirt sitting in front of a bookshelf with a globe, speaking into the camera; a beautiful starry sky above a lake surrounded by mountains; a bright tree and a house at night outside the window; a man wearing a black hooded coat and headphones sitting in a car speaking into the camera.", "A bright tree and a house at night outside the window; a man wearing a white short-sleeve shirt sitting in front of a bookshelf with a globe, speaking into the camera; a man wearing a black hooded coat and headphones sitting in a car speaking into the camera; a beautiful starry sky above a lake surrounded by mountains."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "vkcwNAxODro_0", "video_path": "vkcwNAxODro.mp4", "subtitle_path": "vkcwNAxODro_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 400.0, "view_count": 156714}, {"video_id": "PDbZI5NQ7oQ", "question": "The background shows a brightly lit high-rise building. A female host, with long straight hair, is seated in the newsroom wearing a dark purple suit with a necklace. In which of the following scenes does she also appear?", "question_wo_referring_query": "In which of the following scenes does she also appear?", "candidates": ["Connecting with an external line, where a male reporter appears on the right in a news segment", "In the audience seats", "In a sports hall on site", "On the racetrack", "On the judging panel"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "PDbZI5NQ7oQ_0", "video_path": "PDbZI5NQ7oQ.mp4", "subtitle_path": "PDbZI5NQ7oQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 186.48, "view_count": 29433}, {"video_id": "PDbZI5NQ7oQ", "question": "The background behind him is a red and white audience seating area. The man wearing a grey shirt and holding a microphone with a red logo, standing inside the gymnasium and speaking towards the camera, also appears in which of the following scenes?", "question_wo_referring_query": "Also appears in which of the following scenes?", "candidates": ["Appears on the screen to the left showing the news with red audience seats", "Appears in the audience seating area", "Appears beside the female host in the news studio", "Outside the gymnasium", "In the referee seat"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "PDbZI5NQ7oQ_1", "video_path": "PDbZI5NQ7oQ.mp4", "subtitle_path": "PDbZI5NQ7oQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 186.48, "view_count": 29433}, {"video_id": "aCWlZHXXtlk", "question": "The background is a white wall with potted plants on the table. A man wearing a gray suit, a blue and white striped tie, and black-framed glasses with sparse hair is present. Which subtitles appeared at the same time?", "question_wo_referring_query": "Which subtitles appeared at the same time?", "candidates": ["\u201clike Perth Brisbane Adelaide all still\u201d", "\u201cwhat they were through the middle of\u201d", "\u201clittle bit lower through the month as\u201d", "\u201cJanuary saw housing values Rise by\u201d", "\u201cdiversification we're seeing markers\u201d"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "aCWlZHXXtlk_0", "video_path": "aCWlZHXXtlk.mp4", "subtitle_path": "aCWlZHXXtlk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 330.92, "view_count": 15713}, {"video_id": "aCWlZHXXtlk", "question": "The background is a house with a red roof on a green floor, with a man wearing a black suit jacket, an olive floral tie, and black and olive hair. Which subtitles appeared simultaneously with him?", "question_wo_referring_query": "Which subtitles appeared simultaneously with him?", "candidates": ["\u201csee housing sentiment lifting providing\u201d", "\u201cexpect well that's looking more likely\u201d", "\u201cthat interest rates are highly\u201d", "\u201csentiment it's fair to say that as rates\u201d", "\u201cthere could be as many as two rate Cuts\u201d"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "aCWlZHXXtlk_1", "video_path": "aCWlZHXXtlk.mp4", "subtitle_path": "aCWlZHXXtlk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 330.92, "view_count": 15713}, {"video_id": "vakU1QhHs-w", "question": "Within the circular frame made of wood on the right side of the gray wall, there is an elderly person who is sitting on the left and a woman bending towards the right. There is a naked baby next to the elderly person\u2019s feet, and a group of people is flying in the air. When the subtitle 'fanciful I like the flatness I like the' appears, what change occurs in the painting?", "question_wo_referring_query": "What change occurs in the painting?", "candidates": ["The painting changes from a distant view to a close-up view.", "The position of the painting changes to someone else's home.", "The painting changes from a close-up view to a distant view.", "The painting changes from being intact to being damaged.", "The position of the painting changes to another wall."], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "vakU1QhHs-w_0", "video_path": "vakU1QhHs-w.mp4", "subtitle_path": "vakU1QhHs-w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 205.51, "view_count": 6266}, {"video_id": "vakU1QhHs-w", "question": "There are two murals on the gray wall. On the right, there is a circular wooden frame painting of a sitting woman with a naked baby on her lap. On the left, there is a square golden frame painting depicting a group of people walking in front of white buildings under a blue sky. When the subtitle 'they opened up the drawer they would' appears, what changes occur to the mural on the left?", "question_wo_referring_query": "What changes occur to the mural on the left?", "candidates": ["The mural's position changes to the right side.", "The mural's position changes to another wall.", "The mural becomes damaged.", "It changes from a distant view to a close-up.", "It changes from a close-up to a distant view."], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "vakU1QhHs-w_1", "video_path": "vakU1QhHs-w.mp4", "subtitle_path": "vakU1QhHs-w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 205.51, "view_count": 6266}, {"video_id": "KWAsz59F8gA", "question": "The background is a wooden wall with six rectangular windows. A man wearing a deep khaki suit and black-rimmed glasses is sitting in front of the desk. To the right is a light bulb shining inside a glass cover. On the left of the screen, there is a translucent blue-green sticker with a green dragon on a pink background. What is this man doing at this moment?", "question_wo_referring_query": "What is this man doing at this moment?", "candidates": ["Holding the desk with both hands", "Hanging both hands under the desk", "Sitting upright at the desk and speaking", "Resting his shoulders flat on the desk", "Supporting his glasses with one hand"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "KWAsz59F8gA_0", "video_path": "KWAsz59F8gA.mp4", "subtitle_path": "KWAsz59F8gA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 597.35, "view_count": 1935514}, {"video_id": "KWAsz59F8gA", "question": "The background is a wooden wall with a rectangular window. A man wearing a dark khaki suit and black-framed glasses is sitting in front of a desk. There is a computer on the left side of the desk. On the left side of the screen is a transparent blue sticker with a green-bodied cartoon character. What is the man in the screen doing at this moment?", "question_wo_referring_query": "What is the man in the screen doing at this moment?", "candidates": ["Talking to the camera", "Lying on the desk", "Making a video call", "Talking with a friend", "Using the computer for work"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "KWAsz59F8gA_1", "video_path": "KWAsz59F8gA.mp4", "subtitle_path": "KWAsz59F8gA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 597.35, "view_count": 1935514}, {"video_id": "mQpzWCC9rQ0", "question": "A bartender wearing a black apron over a black shirt with a buzz cut stands behind the bar. Around the round table in front of him, a curly-haired woman in a brown dress and a man in a white suit are sitting on the left. On the right, there's a straight-haired woman in a black coat and white inner garment. What object is not present in the scene at this moment?", "question_wo_referring_query": "At this moment, what object is not present in the scene?", "candidates": ["Red wine glass", "Chair", "White hat", "Red wine", "Computer"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "mQpzWCC9rQ0_0", "video_path": "mQpzWCC9rQ0.mp4", "subtitle_path": "mQpzWCC9rQ0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 556.58, "view_count": 249855}, {"video_id": "mQpzWCC9rQ0", "question": "The background is a grey wall and blue sky. A man, wearing a grey hat and covering half of his face with a large newspaper, is sitting behind a round table on the grass. There is an empty chair on each side of him. What object is present in this scene?", "question_wo_referring_query": "What object is present in this scene?", "candidates": ["Coffee cup", "Flying bird", "Glasses", "Necktie", "A handgun"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "mQpzWCC9rQ0_1", "video_path": "mQpzWCC9rQ0.mp4", "subtitle_path": "mQpzWCC9rQ0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 556.58, "view_count": 249855}, {"video_id": "0XYwJYIZizo", "question": "On a stage illuminated by red lights, there are six girls wearing black shoes dancing. What kind of bottom wear is the girl with long red hair wearing?", "question_wo_referring_query": "What kind of bottom wear is she wearing?", "candidates": ["Denim skirt", "Denim long skirt", "Black long pants", "Denim long pants", "Denim shorts"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "0XYwJYIZizo_0", "video_path": "0XYwJYIZizo.mp4", "subtitle_path": "0XYwJYIZizo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 197.3, "view_count": 61817}, {"video_id": "0XYwJYIZizo", "question": "On the stage with red lights, there are six girls wearing black leather shoes dancing. What is the hairstyle of the girl in the middle wearing a white tight-fitting off-shoulder top and denim capris?", "question_wo_referring_query": "What is the hairstyle?", "candidates": ["Long straight black hair with bangs", "Short black hair", "Long curly black hair with bangs", "Curly red hair", "Short curly black hair"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "0XYwJYIZizo_1", "video_path": "0XYwJYIZizo.mp4", "subtitle_path": "0XYwJYIZizo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 197.3, "view_count": 61817}, {"video_id": "8MQ59Ij5fkM", "question": "The background is a light-colored map, with two people on the left wearing green military jackets and green helmets. On the right side, there is a person in an olive long coat with a hat that has a red five-pointed star pattern and a light green helmet. When the subtitle 'have a military presence in Europe which' appears, what kind of guns are these four people holding?", "question_wo_referring_query": "What kind of guns are these four people holding?", "candidates": ["shotgun", "handgun", "machine gun", "hunting rifle", "rifle"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "8MQ59Ij5fkM_0", "video_path": "8MQ59Ij5fkM.mp4", "subtitle_path": "8MQ59Ij5fkM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 204.92000000000002, "view_count": 1523333}, {"video_id": "8MQ59Ij5fkM", "question": "The background is a blue sky and green mountains with a row of trees. On the grass, a group of soldiers in gray-blue uniforms with fierce expressions are pointing guns to the right. On the right side of the screen, three soldiers are lying down. When the subtitle 'Bloc China too was close to a communist' appears, what kind of gun does the black-haired woman with pigtails in the middle of the screen hold?", "question_wo_referring_query": "What kind of gun is she holding in her hand?", "candidates": ["shotgun", "machine gun", "handgun", "hunting rifle", "rifle"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "8MQ59Ij5fkM_1", "video_path": "8MQ59Ij5fkM.mp4", "subtitle_path": "8MQ59Ij5fkM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 204.92000000000002, "view_count": 1523333}, {"video_id": "rYgxbNV7UXA", "question": "There are two tofu pieces and a red chili in the upper left corner of the black countertop. After stir-frying scallions, mushrooms, and chili sauce in the black flat-bottomed pan on the blue electric stove, what was added to the pan using a glass container?", "question_wo_referring_query": "What was added to the pan using a glass container?", "candidates": ["coffee", "vegetable oil", "noodles", "tofu pieces", "beef"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "rYgxbNV7UXA_0", "video_path": "rYgxbNV7UXA.mp4", "subtitle_path": "rYgxbNV7UXA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 186.06, "view_count": 233035}, {"video_id": "rYgxbNV7UXA", "question": "In the top right corner of the black table, there is a pot containing soybeans. On the table, there is a wooden cutting board with patterns. In the middle of the screen, the right hand is holding soybeans, and what is the left hand using to peel the soybeans?", "question_wo_referring_query": "What is the left hand using to peel the soybeans?", "candidates": ["Fruit knife", "Peeler", "Spoon", "Fork", "Kitchen knife"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "rYgxbNV7UXA_1", "video_path": "rYgxbNV7UXA.mp4", "subtitle_path": "rYgxbNV7UXA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 186.06, "view_count": 233035}, {"video_id": "uC7PV_UTbDQ", "question": "On a wooden table, there is a glass container filled with pickled cucumbers and rosemary fragments. A hand wearing a black glove is holding a small glass bowl of olive oil. What is the hand doing?", "question_wo_referring_query": "What is being done?", "candidates": ["Adding cucumbers to the olive oil", "Pouring the olive oil into a pot", "Adding pickled cucumbers to the olive oil", "Pouring the olive oil into the glass container", "Adding rosemary to the olive oil"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "uC7PV_UTbDQ_0", "video_path": "uC7PV_UTbDQ.mp4", "subtitle_path": "uC7PV_UTbDQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 410.5, "view_count": 6226}, {"video_id": "uC7PV_UTbDQ", "question": "On the right side of the red wooden table, there are three dumplings. Some white flour is scattered in the middle of the table. What's the hand wearing black gloves doing with the dumpling?", "question_wo_referring_query": "What are they doing?", "candidates": ["Rolling the dough", "Flattening the dumpling into a pancake", "Making dumplings", "Cutting dumplings", "Kneading the dumpling"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "uC7PV_UTbDQ_1", "video_path": "uC7PV_UTbDQ.mp4", "subtitle_path": "uC7PV_UTbDQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 410.5, "view_count": 6226}, {"video_id": "Rwgy20LLtr0", "question": "An old man with gray hair and glasses, wearing a gray coat, is standing in front of a mural covered by glass, facing a mirror. When the caption 'paints the Madonna and Child' appears, what is the old man doing?", "question_wo_referring_query": "What is the old man doing?", "candidates": ["Introducing the mural", "Viewing the mural", "Reaching out to touch the mural", "Taking photos of the mural", "Chatting with a friend"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "Rwgy20LLtr0_0", "video_path": "Rwgy20LLtr0.mp4", "subtitle_path": "Rwgy20LLtr0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 199.8, "view_count": 4673}, {"video_id": "Rwgy20LLtr0", "question": "Inside a museum, there is a yellowed picture frame under a glass case. It contains an image of a woman dressed in a dark green robe, holding a child. What happens on the screen when the subtitle 'again but do CHEO knew how to compose I' appears?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A group of people gathers around the painting to discuss it.", "The focus shifts to another painting.", "A group of people gathers around the painting to admire it.", "The screen turns white, and the painting gradually appears.", "A group of people gathers around the painting to take pictures."], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "Rwgy20LLtr0_1", "video_path": "Rwgy20LLtr0.mp4", "subtitle_path": "Rwgy20LLtr0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 199.8, "view_count": 4673}, {"video_id": "K6tcjjSo9_g", "question": "In the kitchen on a granite countertop, there is a wooden cutting board. When the host uses a knife to halve the chicken breast, what do they do first?", "question_wo_referring_query": "What\u2019s the first thing they do?", "candidates": ["Put the chicken breast into the pot", "Place an absorbent paper on top of the chicken breast", "Sprinkle seasoning on the chicken breast", "Pound the chicken breast with a mallet", "Cut the chicken breast into small pieces"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "K6tcjjSo9_g_0", "video_path": "K6tcjjSo9_g.mp4", "subtitle_path": "K6tcjjSo9_g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 556.22, "view_count": 899115}, {"video_id": "K6tcjjSo9_g", "question": "On the kitchen counter, after putting the yellow Italian pasta into the boiling water in the pan, what does the person in the video do first?", "question_wo_referring_query": "What does the person in the video do first?", "candidates": ["Put the pasta into a black bowl", "Cover the pot and continue cooking the pasta", "Pour the seasoning over the pasta", "Place the pasta onto the chicken", "Prepare the sauce"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "K6tcjjSo9_g_1", "video_path": "K6tcjjSo9_g.mp4", "subtitle_path": "K6tcjjSo9_g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 556.22, "view_count": 899115}, {"video_id": "6s1ge2goE3M", "question": "The PPT in the video provides a detailed introduction to the knowledge related to blood. Which of the following screens appeared first?", "question_wo_referring_query": "Which of the following screens appeared first?", "candidates": ["A flat 2D formula structure table", "A layered structure diagram with 3D modules", "A person explaining their own self-introduction", "Two paragraphs of text explanation", "An animated demonstration"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O3O", "level": "L2-Relation", "id": "6s1ge2goE3M_0", "video_path": "6s1ge2goE3M.mp4", "subtitle_path": "6s1ge2goE3M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 367, "duration": 358.0, "view_count": 37}, {"video_id": "-zSz3Ck3X0w", "question": "In a grey room, in front of a black table, a man with short hair wearing a grey knit long-sleeve shirt, after saying 'certain tourism board now we'll get back', what did he do first?", "question_wo_referring_query": "What did he do first?", "candidates": ["Raised his arm and waved", "Right hand raised with right index finger up", "Both hands raised with palms open", "Kicked a ball with bare feet", "Left hand raised with left index finger up"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "-zSz3Ck3X0w_0", "video_path": "-zSz3Ck3X0w.mp4", "subtitle_path": "-zSz3Ck3X0w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 465.2, "view_count": 47723}, {"video_id": "-zSz3Ck3X0w", "question": "In a car, a man is wearing a gray hoodie and sitting next to him is a woman with long blond hair in a black down jacket. When the subtitle \"seen those videos make sure to check out\" appears, what does the woman in the black down jacket do first?", "question_wo_referring_query": "What does the woman in the black down jacket do first?", "candidates": ["Touches the head of the man in the gray hoodie", "Puts the phone she's holding into her bag", "Looks at the man in the gray hoodie", "Holds hands with the man in the gray hoodie", "Lifts her head and smiles at the camera"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "-zSz3Ck3X0w_1", "video_path": "-zSz3Ck3X0w.mp4", "subtitle_path": "-zSz3Ck3X0w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 465.2, "view_count": 47723}, {"video_id": "0lpJUdtjJ6A", "question": "Next to a red document being illuminated by light, there is an old-style rotary phone. In front of the phone is a tray with a cigarette emitting smoke. After the subtitle 'have drowned in their escape attempt' appears, what is the first object to appear?", "question_wo_referring_query": ", what is the first object to appear?", "candidates": ["a black and white electronic organ", "a yellow balloon", "an old-fashioned bucket", "a yellow magazine", "a red collar"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "0lpJUdtjJ6A_0", "video_path": "0lpJUdtjJ6A.mp4", "subtitle_path": "0lpJUdtjJ6A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 549.25, "view_count": 332424}, {"video_id": "riDh8pQz-WA", "question": "A woman with short black hair, wearing a purple top, is standing in front of a glass object against a blue background. She is next to a table with a black surface and gray legs. What does she change?", "question_wo_referring_query": "What does she change?", "candidates": ["She changes from facing the mirror head-on to facing the mirror from the side.", "She changes from wearing a purple top to wearing a black top.", "She changes from facing the mirror from the side to facing the mirror head-on.", "She changes from facing away from the mirror to facing the mirror.", "She changes from sitting to standing while talking."], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "riDh8pQz-WA_0", "video_path": "riDh8pQz-WA.mp4", "subtitle_path": "riDh8pQz-WA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 564.68, "view_count": 43082}, {"video_id": "riDh8pQz-WA", "question": "In a room with many computers, on a desk that has a black tabletop and gray legs, there are two women sitting on either side. One of the women is wearing a black dress and has short blonde hair. When this blonde-haired woman appears in a small box on the screen, with another box showing another scene beside her, what changes occur to her?", "question_wo_referring_query": "What changes occur to her?", "candidates": ["From being visible from the chest up to being visible from the waist up.", "From standing while speaking to sitting while speaking.", "From sitting while speaking to standing while speaking.", "From holding a microphone to putting it down.", "From being visible from the waist up to being visible only from the chest up."], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "riDh8pQz-WA_1", "video_path": "riDh8pQz-WA.mp4", "subtitle_path": "riDh8pQz-WA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 564.68, "view_count": 43082}, {"video_id": "5x_x8Ykw5IY", "question": "On a green surface, there is a yellow paper, a blue paper, a purple scissors, and a yellow pencil. When the subtitle \"Holding the accordion\" appears, what change happens to the yellow paper?", "question_wo_referring_query": "What change happens to the yellow paper?", "candidates": ["The large rectangular paper turns into folded origami crane", "The large rectangular paper turns into folded circular paper", "The large rectangular paper turns into folded paper rose", "The large rectangular paper turns into folded small rectangular paper", "The large rectangular paper turns into folded small triangular paper"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "5x_x8Ykw5IY_0", "video_path": "5x_x8Ykw5IY.mp4", "subtitle_path": "5x_x8Ykw5IY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 200.67000000000002, "view_count": 2478}, {"video_id": "5x_x8Ykw5IY", "question": "On a beige surface, there is a yellow piece of paper, along with some purple paper and a black stapler. When the subtitle \"crease the fold\" appears, what change occurs to the purple paper?", "question_wo_referring_query": "What change occurs to the purple paper?", "candidates": ["It changes from a square to a flower shape", "It changes from a square to a pentagon", "It changes from a square to a hexagon", "It changes from a square to a circle", "It changes from a square to a triangle"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "5x_x8Ykw5IY_1", "video_path": "5x_x8Ykw5IY.mp4", "subtitle_path": "5x_x8Ykw5IY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 200.67000000000002, "view_count": 2478}, {"video_id": "ZocnjqtzuG0", "question": "In front of a house made of stone, there are many small stones. The front of the house is covered with a patterned cloth. Three people are standing at the door watching two children, one dressed in yellow and one in blue. What is the child in blue doing?", "question_wo_referring_query": "What is the child in blue doing?", "candidates": ["Standing next to the child dressed in yellow", "Helping the child dressed in yellow adjust their pants", "Carrying the child dressed in yellow on their back", "Leading the child dressed in yellow forward", "Holding the child dressed in yellow"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "ZocnjqtzuG0_0", "video_path": "ZocnjqtzuG0.mp4", "subtitle_path": "ZocnjqtzuG0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 269.56, "view_count": 27420}, {"video_id": "ZocnjqtzuG0", "question": "On a road covered with stones and dirt, a child wearing a yellow top with short hair, and an adult wearing a white top, are walking under the sunlight. What is the child in the yellow top doing?", "question_wo_referring_query": "What is the child in the yellow top doing?", "candidates": ["Holding the hand of the adult wearing white clothing", "Running barefoot on the ground", "Riding a blue bicycle", "Rolling on the ground", "Riding a white bicycle"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "ZocnjqtzuG0_1", "video_path": "ZocnjqtzuG0.mp4", "subtitle_path": "ZocnjqtzuG0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 269.56, "view_count": 27420}, {"video_id": "IgiqlpDXtuU", "question": "In a dimly lit room, there is a wooden table in the middle. Next to the table, there are two green chairs and a bookshelf against the wall. On the table, there are many items. What items have appeared on the table in the middle of the room?", "question_wo_referring_query": "What items have appeared on the table in the middle of the room?", "candidates": ["Black telephone", "Yellow paper", "Purple folder", "Black folder", "Green pen"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "IgiqlpDXtuU_0", "video_path": "IgiqlpDXtuU.mp4", "subtitle_path": "IgiqlpDXtuU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 264.48, "view_count": 1891655}, {"video_id": "IgiqlpDXtuU", "question": "In a room with yellow walls, there is a desk with various books on it. In front of the desk, there is a man in a blue shirt talking. What items have appeared on the desk?", "question_wo_referring_query": "What items have appeared on the desk?", "candidates": ["A red Spider-Man figurine", "A white rabbit plush toy", "A black Spider-Man plush toy", "An olive-colored plush toy", "A blue rabbit plush toy"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "IgiqlpDXtuU_1", "video_path": "IgiqlpDXtuU.mp4", "subtitle_path": "IgiqlpDXtuU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 264.48, "view_count": 1891655}, {"video_id": "v_jNWXwxmCM", "question": "In a green room, there is a white shelf on the wall, which holds some white objects. There are 9 people in the room, 8 of whom are sitting and 1 is standing. When the subtitle \"told what they were volunteering for or\" appears, what objects are present in the room?", "question_wo_referring_query": "What objects are present in the room?", "candidates": ["A blue table", "A black table", "A white table", "An olive-colored table", "A green table"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "v_jNWXwxmCM_0", "video_path": "v_jNWXwxmCM.mp4", "subtitle_path": "v_jNWXwxmCM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 541.08, "view_count": 466459}, {"video_id": "v_jNWXwxmCM", "question": "On a green field, there is a large tent with many hospital beds inside. Many medical personnel are wearing gray scrub dresses with red crosses on them. When the subtitle \"of it all were volunteer Medics exposed\" appears, what items are present on the screen?", "question_wo_referring_query": "What items are present on the screen?", "candidates": ["Injured people wearing white tops", "White shoes", "Injured people wearing blue tops", "Injured people wearing red tops", "Blue curtains"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "v_jNWXwxmCM_1", "video_path": "v_jNWXwxmCM.mp4", "subtitle_path": "v_jNWXwxmCM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 541.08, "view_count": 466459}, {"video_id": "CgjCk9hB3Xs", "question": "In a warm and cozy room, a woman wearing a white top is sitting in front of a window. There's a wooden table in front of the window, and on the table, she is holding a small animal. When the subtitle 'Anyway, I hope you have a wonderful day and you enjoy the music' appears, what is the woman's hairstyle?", "question_wo_referring_query": "What is the woman's hairstyle?", "candidates": ["High ponytail", "Double pigtails", "Single pigtail", "Shoulder-length hair", "Bun"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "CgjCk9hB3Xs_0", "video_path": "CgjCk9hB3Xs.mp4", "subtitle_path": "CgjCk9hB3Xs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 480.48, "view_count": 259227}, {"video_id": "CgjCk9hB3Xs", "question": "On a wooden table, there is a vase with sunflowers in it. A woman wearing a white top and braided pigtails is holding a small animal. When the subtitle \"She had also been heavily pregnant and so she had left all these little eggs on the\" appears, what is the color of the small animal the woman is holding?", "question_wo_referring_query": "What is the color of the small animal the woman is holding?", "candidates": ["Black and White", "White", "Olive", "Gray", "Black"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "CgjCk9hB3Xs_1", "video_path": "CgjCk9hB3Xs.mp4", "subtitle_path": "CgjCk9hB3Xs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 480.48, "view_count": 259227}, {"video_id": "6p5C789Zq7E", "question": "Against a blue background with black text on it, in front of the background, there is a white-blue horizontal strip. A person mentioned that in the case, this approach benefits affluent white students. Which person mentioned that this approach benefits affluent white students?", "question_wo_referring_query": "Which person mentioned that this approach benefits affluent white students?", "candidates": ["A man with short hair wearing a black suit", "A man with short hair wearing a blue suit", "A man with short hair wearing a white suit", "A woman with short hair wearing an olive green suit", "A woman with short hair wearing a white suit"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "6p5C789Zq7E_0", "video_path": "6p5C789Zq7E.mp4", "subtitle_path": "6p5C789Zq7E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 293.46, "view_count": 18178}, {"video_id": "6p5C789Zq7E", "question": "The scene is in a white room, sunlight is streaming in from outside, there is a blue and white horizontal strip, and there is no text in the blue strip area. In the video, many people are expressing their opinions, which person uses a reference to Jesus?", "question_wo_referring_query": "Which person uses a reference to Jesus?", "candidates": ["A man wearing earphones and a blue suit", "A woman with short hair wearing an olive suit", "A woman with long golden hair wearing a white skirt", "A man with short black hair wearing a black suit", "A man with a shaved head wearing a black suit"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "6p5C789Zq7E_1", "video_path": "6p5C789Zq7E.mp4", "subtitle_path": "6p5C789Zq7E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 293.46, "view_count": 18178}, {"video_id": "Hb513K4CH3E", "question": "On a red wooden table, a pair of hands wearing black gloves is holding a knife to cut an orange carrot. What did the hands wearing black gloves do before cutting the carrot?", "question_wo_referring_query": "What did the hands wearing black gloves do before cutting the carrot?", "candidates": ["Placed the carrot in a bowl of water", "Removed the leaves from the carrot", "Peeled the carrot", "Washed the carrot", "Washed green beans"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "Hb513K4CH3E_0", "video_path": "Hb513K4CH3E.mp4", "subtitle_path": "Hb513K4CH3E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 346.2, "view_count": 166500}, {"video_id": "Hb513K4CH3E", "question": "In a black pot, there are yellow soybeans, pink sausages, orange carrots, and green peas. Someone is adding seasonings to the pot. After adding salt, what did the person adding the seasonings do next?", "question_wo_referring_query": ", what did the person adding the seasonings do next?", "candidates": ["Added chicken essence to the pot", "Added soy sauce to the pot", "Added black pepper to the pot", "Added chili powder to the pot", "Added cornstarch slurry to the pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "Hb513K4CH3E_1", "video_path": "Hb513K4CH3E.mp4", "subtitle_path": "Hb513K4CH3E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 346.2, "view_count": 166500}, {"video_id": "UZ5hn-VOhN8", "question": "On a sunny day, different types of cars are parked on a flat open ground. Beside the cars, there are also some red houses. What kind of plant appears first in the video?", "question_wo_referring_query": "What kind of plant appears first in the video?", "candidates": ["Yellow Water Lily", "Red Rose", "White Lily", "Yellow Osmanthus", "Green Bamboo"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "UZ5hn-VOhN8_0", "video_path": "UZ5hn-VOhN8.mp4", "subtitle_path": "UZ5hn-VOhN8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 497.2, "view_count": 23064}, {"video_id": "UZ5hn-VOhN8", "question": "In a room with many items, there is a white shelf against the wall. There are two people in the room, one with white hair and one with black hair. Which type of curtain appears first in the video?", "question_wo_referring_query": "Which type of curtain appears first in the video?", "candidates": ["Green curtain", "Grey and green striped curtain", "Purple curtain", "Blue curtain", "White curtain"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "UZ5hn-VOhN8_1", "video_path": "UZ5hn-VOhN8.mp4", "subtitle_path": "UZ5hn-VOhN8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 497.2, "view_count": 23064}, {"video_id": "Kw63Vn8hqeM", "question": "On a white background with gray dots, the text 'Ranking Substrates for SN1' is written, with some blue text beside it. Below the blue text, there's a diagram composed of black lines. After the subtitle 'carbocation rearrangement so for each' appears, what happens first on the screen?", "question_wo_referring_query": "What happens first on the screen?", "candidates": ["Another row of black lines appears below the existing diagram", "Some red arrows appear above the diagram made of black lines", "Some red lines appear below the diagram made of black lines", "Some blue arrows appear above the diagram made of black lines", "The diagram made of black lines is circled by red lines"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "Kw63Vn8hqeM_0", "video_path": "Kw63Vn8hqeM.mp4", "subtitle_path": "Kw63Vn8hqeM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 227.7, "view_count": 5701}, {"video_id": "Kw63Vn8hqeM", "question": "Against a white background with gray dots, text appears at the top while two parallel rows of designs made up of black lines, each row containing 6 designs, are displayed at the bottom. After the subtitle 'same concept here this is tertiary plus' appears, what is the first thing that happens on the screen?", "question_wo_referring_query": "What is the first thing that happens on the screen?", "candidates": ["On the second row, the second design from the right, made of black lines, appears some red lines underneath.", "On the second row, the first design from the right, made of black lines, has '10' written underneath in blue.", "On the second row, the second design from the right, made of black lines, has '123' written on it in red.", "On the second row, the first design from the right, made of black lines, appears some red dots.", "On the second row, the first design from the right, made of black lines, appears a red line underneath."], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "Kw63Vn8hqeM_1", "video_path": "Kw63Vn8hqeM.mp4", "subtitle_path": "Kw63Vn8hqeM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 227.7, "view_count": 5701}, {"video_id": "uT7K3Wh6Hgc", "question": "On a green grass field, there is a chubby little hand, next to it is a green plant, and the chubby little hand is holding a green object. After the subtitle \"and the datura representing the\" appears, what is the first object that appears on the screen?", "question_wo_referring_query": ", what is the first object that appears on the screen?", "candidates": ["a white dress", "a white cloth", "a black helmet", "a cream-colored conch", "a red apple"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "uT7K3Wh6Hgc_0", "video_path": "uT7K3Wh6Hgc.mp4", "subtitle_path": "uT7K3Wh6Hgc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 521.54, "view_count": 216598}, {"video_id": "uT7K3Wh6Hgc", "question": "The picture shows many people and there is a green plant behind them. In the picture, there is a child wearing a black helmet and another child lying on the green grass looking up. What is the first item to appear on the screen after the subtitles \"Mars asleep and why just why Buckle in\"?", "question_wo_referring_query": "What is the first item to appear on the screen?", "candidates": ["Red and white box of popcorn", "Rice white shell", "Red candied haws", "Blue bottle of Coke", "Red package of chili strips"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "uT7K3Wh6Hgc_1", "video_path": "uT7K3Wh6Hgc.mp4", "subtitle_path": "uT7K3Wh6Hgc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 521.54, "view_count": 216598}, {"video_id": "REKUP9FzB0E", "question": "On a road, there is a white utility pole, green trees, and distant high mountains. An elderly woman dressed in a red down jacket, wearing a floral scarf and a red hat, is walking. In which of the following scenes has she appeared?", "question_wo_referring_query": "In which of the following scenes has she appeared?", "candidates": ["A park with a child in a red outfit sliding down a slide", "A park with a turquoise colored lake", "A park with many flowering quince and hydrocharis plants", "A room with many wooden desks and a child in a red outfit reading books", "A very simple restaurant"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "REKUP9FzB0E_0", "video_path": "REKUP9FzB0E.mp4", "subtitle_path": "REKUP9FzB0E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 293.76, "view_count": 2271}, {"video_id": "REKUP9FzB0E", "question": "In front of a desk with many books and a black computer, a man wearing a black armor and a blue scarf, with a little girl in a blue feather dress standing next to him, in which of the following scenes does the man with the blue scarf appear?", "question_wo_referring_query": "In which of the following scenes does the man with the blue scarf appear?", "candidates": ["On a road with many flowerbeds around.", "In front of a lake with swans.", "On a path surrounded by dense trees.", "In a park with many wooden benches.", "In a dining hall with green chairs and many children eating."], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "REKUP9FzB0E_1", "video_path": "REKUP9FzB0E.mp4", "subtitle_path": "REKUP9FzB0E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 293.76, "view_count": 2271}, {"video_id": "QJ5fIVlh1Lk", "question": "In a jet-black background, there is only an illustration of a magnifying glass drawn with white lines. With which subtitles has this magnifying glass illustration appeared together before?", "question_wo_referring_query": "With which subtitles has this magnifying glass illustration appeared together before?", "candidates": ["what's going on here", "this is what embroidery looks like", "these are called mechanical waves", "you can start to see how it was made", "electron microscope"], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "QJ5fIVlh1Lk_0", "video_path": "QJ5fIVlh1Lk.mp4", "subtitle_path": "QJ5fIVlh1Lk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 311.15, "view_count": 21749}, {"video_id": "QJ5fIVlh1Lk", "question": "In the distance, there is a mountain peak covered with white snow, the sea waves are rising and falling, and a woman is lying on a small boat reading a book. With which subtitles has this woman appeared together?", "question_wo_referring_query": "With which subtitles has this woman appeared together?", "candidates": ["from a source like the sun to every", "peak", "they rise and fall", "light also moves in waves", "But, if you need to see closer?"], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "QJ5fIVlh1Lk_1", "video_path": "QJ5fIVlh1Lk.mp4", "subtitle_path": "QJ5fIVlh1Lk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 311.15, "view_count": 21749}, {"video_id": "QcDoslNiUt8", "question": "Seated near a city pond, there are two people outside the pond, one with yellow hair wearing blue clothes, and her eyes are one round and one semi-circular. The other person has black hair and wears white traditional attire. When the person with yellow hair and blue clothes and another person wearing blue clothes appear by the fire pile, what changes happened to her?", "question_wo_referring_query": "What changes happened to her?", "candidates": ["She got a knife in her hands", "She changed into a set of white clothes", "Both her eyes turned semi-circular", "She changed into a set of black clothes", "Both her eyes turned round"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "QcDoslNiUt8_0", "video_path": "QcDoslNiUt8.mp4", "subtitle_path": "QcDoslNiUt8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 233.65, "view_count": 16227}, {"video_id": "KvOfJchIqLs", "question": "In a room with gray walls and some models hanging, a woman wearing yellow clothing and glasses is sitting on a white chair with her hands spread open. When the subtitle mentions 'By principle, I don\u2019t use materials that are toxic', what change took place with this woman?", "question_wo_referring_query": "What change took place with this woman?", "candidates": ["She changed into a black piece of clothing", "She took off her glasses", "She used her hands to cover her face", "She opened her mouth and then closed it", "She stood up"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "KvOfJchIqLs_0", "video_path": "KvOfJchIqLs.mp4", "subtitle_path": "KvOfJchIqLs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 213.05, "view_count": 30053}, {"video_id": "KvOfJchIqLs", "question": "In a room with grey walls, what changes occur to a woman dressed in yellow, sitting cross-legged on a white chair, with her arms raised when the subtitle mentions 'So to counterbalance it, I had to do a lot of tricks'?", "question_wo_referring_query": ", what changes occur to this woman?", "candidates": ["She stops sitting cross-legged.", "She puts on a pair of gold-rimmed glasses.", "She raises her right arm higher.", "She changes into a white garment.", "The chair she is sitting on turns grey."], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "KvOfJchIqLs_1", "video_path": "KvOfJchIqLs.mp4", "subtitle_path": "KvOfJchIqLs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 213.05, "view_count": 30053}, {"video_id": "YW9B9basTxA", "question": "In a brightly lit room, in a corner there is an old display placed, and at the center of the screen stands a woman with white and purple hair, wearing glasses and dressed in a blue shirt. What is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["She is putting on glasses", "She is waving her hand", "She is putting on a coat", "She is wiping the table", "She is writing"], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "YW9B9basTxA_0", "video_path": "YW9B9basTxA.mp4", "subtitle_path": "YW9B9basTxA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 237.28, "view_count": 10327}, {"video_id": "YW9B9basTxA", "question": "Inside a room with an old monitor in the corner, a woman with white and purple hair stands in front of a large glass. Her eyes are tightly closed, and her right hand is pressing on the desk. What is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["She covers her face with her hand", "She removes her glasses", "She lowers her raised left hand", "She sits on a chair", "She lowers her raised left hand"], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "YW9B9basTxA_1", "video_path": "YW9B9basTxA.mp4", "subtitle_path": "YW9B9basTxA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 237.28, "view_count": 10327}, {"video_id": "8MgExCkbaeo", "question": "In the central part of the white background, there are two pictures of the same woman, one clear and one blurry. A red dot is positioned over the eye in the picture on the right. A man in a suit and glasses is giving an explanation. What is the object present in this scene?", "question_wo_referring_query": "What is the object present in this scene?", "candidates": ["a black top hat", "duck tongue hat", "red suit", "gray suit", "a string of brick necklace"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "8MgExCkbaeo_0", "video_path": "8MgExCkbaeo.mp4", "subtitle_path": "8MgExCkbaeo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 449.16, "view_count": 26}, {"video_id": "8MgExCkbaeo", "question": "In the central part of the white background, there are three pictures of the same woman. A red dot is located on the woman's brow in the left picture. A man in a suit and wearing glasses is explaining something. What object is not present in this scene?", "question_wo_referring_query": ", what object is not present in this scene?", "candidates": ["red underwear", "glasses", "gray suit", "hat", "black suit"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "8MgExCkbaeo_1", "video_path": "8MgExCkbaeo.mp4", "subtitle_path": "8MgExCkbaeo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 449.16, "view_count": 26}, {"video_id": "xLTJiKfFDDc", "question": "In front of a deep blue background, a man wearing glasses and a shirt is explaining something. The text 'memory cells' is displayed at the bottom of the screen, and when the caption mentions 'And while they\u2019re fighting what ails you, B and T cells also create memory cells,' what items are present in the scene?", "question_wo_referring_query": "What items are present in the scene?", "candidates": ["gold-framed glasses", "black-framed glasses", "blue short-sleeve shirt", "silver-framed glasses", "grey long-sleeve shirt"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "xLTJiKfFDDc_0", "video_path": "xLTJiKfFDDc.mp4", "subtitle_path": "xLTJiKfFDDc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 323.45, "view_count": 101380}, {"video_id": "xLTJiKfFDDc", "question": "There is an infant in the video with something placed on their face. They are looking at the direction of the camera lens when the subtitle mentions 'And that protects people who can\u2018t be vaccinated and other vulnerable individuals, like babies.' What object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["a cup", "a black coat", "a water bottle", "a white knitted sweater", "a respiratory device"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "xLTJiKfFDDc_1", "video_path": "xLTJiKfFDDc.mp4", "subtitle_path": "xLTJiKfFDDc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 323.45, "view_count": 101380}, {"video_id": "mZ9ZtEjj9b0", "question": "On a wooden table, there is a cup filled with white granules. A hand uses a spoon to scoop out a portion. What does this spoon look like?", "question_wo_referring_query": "What does this spoon look like?", "candidates": ["A plastic spoon", "A black spoon", "A wooden spoon", "A green spoon", "An iron spoon"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "mZ9ZtEjj9b0_0", "video_path": "mZ9ZtEjj9b0.mp4", "subtitle_path": "mZ9ZtEjj9b0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 464.83, "view_count": 2405}, {"video_id": "mZ9ZtEjj9b0", "question": "Some oil was poured into a black pot and spread evenly over the bottom with a brush. The phrase 'Grease the frying pan with oil' appears in the top left corner of the screen. What are the colors of the handle and bristles of this brush?", "question_wo_referring_query": "What are the colors of the handle and bristles of this brush?", "candidates": ["Silver handle and yellow bristles", "Silver handle and silver bristles", "Yellow handle and blue bristles", "Transparent handle and blue bristles", "Yellow handle and yellow bristles"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "mZ9ZtEjj9b0_1", "video_path": "mZ9ZtEjj9b0.mp4", "subtitle_path": "mZ9ZtEjj9b0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 464.83, "view_count": 2405}, {"video_id": "QhlfmVS9F9g", "question": "A man without a hat, dressed in black clothing, is sitting in a booth by the beach. There is another person behind him with their back facing him. When the subtitle 'yet thank you and we\u2018re now south on the' appears, what is the person behind him wearing?", "question_wo_referring_query": "What is the person behind him wearing?", "candidates": ["Gray hat and blue jacket", "Black hat and blue jacket", "Blue jacket and gray hat", "Gray hat and black jacket", "Black hat and green jacket"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "QhlfmVS9F9g_0", "video_path": "QhlfmVS9F9g.mp4", "subtitle_path": "QhlfmVS9F9g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 428.23, "view_count": 47137}, {"video_id": "QhlfmVS9F9g", "question": "There is a wooden pole standing on the beach. A man jumps onto the pole. When the subtitles mention 'Christian no Christian it\u2019s actually,' what is the man on the pole wearing?", "question_wo_referring_query": "What is the man on the pole wearing?", "candidates": ["A black vest and olive shorts", "A black vest and black shorts", "A black long-sleeve shirt and black shorts", "An olive vest and black shorts", "A black vest and black pants"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "QhlfmVS9F9g_1", "video_path": "QhlfmVS9F9g.mp4", "subtitle_path": "QhlfmVS9F9g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 428.23, "view_count": 47137}, {"video_id": "WkBVQI0dJZs", "question": "In the sunny outdoors, a few buildings with spires and red roofs are visible. In the open space in front of the buildings, a man and a woman dressed in formal attire are standing. What did the geese do when they appeared for the first time?", "question_wo_referring_query": "What did they do?", "candidates": ["The geese landed on the ground", "The geese landed on the spires", "The geese flew from the right side of the screen to the left", "The geese flew from the left side of the screen to the right", "The geese landed on the red roofs"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "WkBVQI0dJZs_0", "video_path": "WkBVQI0dJZs.mp4", "subtitle_path": "WkBVQI0dJZs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 370.04, "view_count": 946234}, {"video_id": "WkBVQI0dJZs", "question": "In front of a pitch-black background, there is a man with an alias wearing a blue suit and a hat with a connecting strap. What did this man do the first time he appeared?", "question_wo_referring_query": "What did he do?", "candidates": ["He raised his left hand and touched his forehead.", "He raised his left hand and touched his chin.", "He raised both hands and touched his chin.", "He raised his right hand and touched his forehead.", "He raised his right hand and touched his chin."], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "WkBVQI0dJZs_1", "video_path": "WkBVQI0dJZs.mp4", "subtitle_path": "WkBVQI0dJZs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 370.04, "view_count": 946234}, {"video_id": "1xQTlp0hscs", "question": "A man wearing glasses and dressed in black is painting on a white canvas with a brush. When the subtitle mentions 'across the canvas with a brush with this paint,' what is the man doing?", "question_wo_referring_query": "What is the man doing?", "candidates": ["He is painting from top to bottom on the canvas with a brush", "He painted a circle on the canvas with a brush", "He is painting from right to left on the canvas with a brush", "He is painting from bottom to top on the canvas with a brush", "He is painting from left to right on the canvas with a brush"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "1xQTlp0hscs_0", "video_path": "1xQTlp0hscs.mp4", "subtitle_path": "1xQTlp0hscs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 234.6, "view_count": 502442}, {"video_id": "1xQTlp0hscs", "question": "In an indoor space, a painting hangs on a white wall. A woman wearing blue clothes passes in front of a mirror, with a man with black hair behind her. When the subtitle \"Kline, at that time, was drawing these chairs, if you will\" appears, what is the man doing?", "question_wo_referring_query": "What is the man doing?", "candidates": ["He is taking the painting off the wall", "His left hand is in his pocket, slightly bending over to admire the painting on the wall", "He is looking at the mirror sideways", "He is pointing at the painting on the wall", "He is tripping the woman in blue clothes"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "1xQTlp0hscs_1", "video_path": "1xQTlp0hscs.mp4", "subtitle_path": "1xQTlp0hscs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 234.6, "view_count": 502442}, {"video_id": "IVsdmxGNoQM", "question": "On a blue background, there is only a red circle. Inside the circle are the upper bodies of two characters communicating in speech bubbles. What image appears right after this scene?", "question_wo_referring_query": "What image appears right after this scene?", "candidates": ["A red circle containing an image of a pen and a sword", "A red circle containing an image of crossed knives", "A red circle containing an image of a wheelboat", "A red circle containing an image of a bun", "A red circle containing an image of a person holding a sword"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "IVsdmxGNoQM_0", "video_path": "IVsdmxGNoQM.mp4", "subtitle_path": "IVsdmxGNoQM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 331.03, "view_count": 104191}, {"video_id": "IVsdmxGNoQM", "question": "On a blue background, there is only the white text 'Baguecket' in the middle. What appears immediately after this screen?", "question_wo_referring_query": "What appears immediately after this screen?", "candidates": ["A red circle appears with someone wearing a hat inside the tire.", "An image of a cannon appears.", "A plain blue background appears.", "A tank image appears.", "The text 'SUBSCRIBE HIER!' appears."], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "IVsdmxGNoQM_1", "video_path": "IVsdmxGNoQM.mp4", "subtitle_path": "IVsdmxGNoQM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 331.03, "view_count": 104191}, {"video_id": "wo-52LASUu0", "question": "A webpage displays many pictures of the Earth. After this scene, who is the first person to appear?", "question_wo_referring_query": "Who is the first person to appear?", "candidates": ["A man with short hair and a tattoo on his arm along with a man with an afro and glasses appearing simultaneously", "A man with an afro and glasses", "A man with short hair and no tattoos on his left arm", "A man with short hair and a tattoo on his right arm", "A man with short hair and a tattoo on his arm"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "wo-52LASUu0_0", "video_path": "wo-52LASUu0.mp4", "subtitle_path": "wo-52LASUu0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 477.93, "view_count": 216373}, {"video_id": "AqUw21GXN2I", "question": "In the bright and sunny outdoors, someone is filming a distant object with a phone. After the subtitle 'okay so take out the time-lapse feature. As I'm walking forward I want to keep the chair' appears, what does the person holding the phone do?", "question_wo_referring_query": "What does the person holding the phone do?", "candidates": ["He turns off the phone", "He switches the phone from vertical to horizontal position", "He throws the phone away", "He opens the map function on the phone", "He puts the phone in his bag"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "AqUw21GXN2I_0", "video_path": "AqUw21GXN2I.mp4", "subtitle_path": "AqUw21GXN2I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 282.48, "view_count": 226676}, {"video_id": "AqUw21GXN2I", "question": "In a spacious square, there is a gigantic brown chair. A man with blonde hair, wearing a black short-sleeve shirt, is sitting beside the chair. After the subtitle 'So yeah life in Switzerland is great. It's expensive here... Umm we got a cool little' appears, what does he do?", "question_wo_referring_query": "What action does he take?", "candidates": ["He picked up a backpack.", "He started playing the piano.", "He leaned over and pointed into the distance with his hand while moving the camera.", "He walked towards the gigantic chair.", "He lay down on the ground."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "AqUw21GXN2I_1", "video_path": "AqUw21GXN2I.mp4", "subtitle_path": "AqUw21GXN2I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 282.48, "view_count": 226676}, {"video_id": "jjuD288JlCs", "question": "Next to the railway, there's a row of buildings, and on the elevated road, a black bodywork train spewing white smoke is running. After the subtitles mention 'But the bit between Edgware and Mill Hill closed in 1964 under the orders of the infamous Doctor Beeching', who is the character that appears?", "question_wo_referring_query": "Who is the character that appears right after?", "candidates": ["A black and white photo of a man wearing a suit and black shirt, with a tie, holding a book", "A color photo of a man wearing a suit and white shirt, with a tie, holding a book", "A black and white photo of a man wearing a suit and white shirt, with a tie, holding a book", "A black and white photo of a woman wearing a suit and white shirt, with a tie, holding a book", "A color photo of a man wearing a suit and black shirt, with a tie, holding a book"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "jjuD288JlCs_0", "video_path": "jjuD288JlCs.mp4", "subtitle_path": "jjuD288JlCs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 543.76, "view_count": 3926911}, {"video_id": "jjuD288JlCs", "question": "In front of a red brick building with white windows, there is a man wearing a black short-sleeved shirt and jeans. After the subtitle mentions 'and this which is actually my grandma's house. To rebuild the line, we'd have to knock it down,' who appears next?", "question_wo_referring_query": "Who appears next?", "candidates": ["A woman with red hair wearing glasses, sitting on a tan chair", "A man with black hair wearing glasses, sitting on a tan chair", "A woman with black hair wearing glasses, sitting on a tan chair", "A man with blonde hair wearing glasses, sitting on a tan chair", "A woman with blonde hair wearing glasses, sitting on a tan chair"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "jjuD288JlCs_1", "video_path": "jjuD288JlCs.mp4", "subtitle_path": "jjuD288JlCs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 543.76, "view_count": 3926911}, {"video_id": "1aXeD_6smxc", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a book opened to a page showing a person sitting with a cane, then an image of a little girl with wings and a five-pointed star on her head, and finally, someone holding a bird's nest.", "First, there is an image of a little girl with wings and a five-pointed star on her head, then someone holding a bird's nest, and finally, a book opened to a page showing a person sitting with a cane.", "First, someone is holding a bird's nest, then a book opened to a page showing a person sitting with a cane, and finally, an image of a little girl with wings and a five-pointed star on her head.", "First, someone is holding a bird's nest, then an image of a little girl with wings and a five-pointed star on her head, and finally, a book opened to a page showing a person sitting with a cane.", "First, there is an image of a little girl with wings and a five-pointed star on her head, then a book opened to a page showing a person sitting with a cane, and finally, someone holding a bird's nest."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "1aXeD_6smxc_0", "video_path": "1aXeD_6smxc.mp4", "subtitle_path": "1aXeD_6smxc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 376.43, "view_count": 1219149}, {"video_id": "shAaKh7dR9g", "question": "Between two standing statues, there is a lying statue. A man with black hair and wearing a black short-sleeved shirt is hugging a black statue with both hands from the right side. Where has this lying black statue appeared before?", "question_wo_referring_query": "Where has this lying black statue appeared before?", "candidates": ["On the right hand side of a person wearing a black coat facing a mirror", "Next to a woman in a checkered shirt with a bun hairstyle", "Next to a white arm statue", "Beside a white female statue with a broken arm", "In front of a man wearing a black long-sleeved coat kneeling"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "shAaKh7dR9g_0", "video_path": "shAaKh7dR9g.mp4", "subtitle_path": "shAaKh7dR9g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 186.22, "view_count": 2503}, {"video_id": "isK5_O3cQ8Y", "question": "In the PPT with a white background, there is a yellow logo in the upper left corner. Next to the logo are the bold black English letters 'UCF'. The PPT has three lines of English text, and the second line is a blue URL. Where else has this URL appeared?", "question_wo_referring_query": "In the PPT with a white background, there is a yellow logo in the upper left corner. Next to the logo are the bold black English letters 'UCF'. The PPT has three lines of English text, and the second line is a blue URL. Where else has this URL appeared?", "candidates": ["In a picture of a horse grazing on the grassland", "On a webpage printed with an orange label 'Neural Network'", "In the PPT screen with two circles", "In the PPT with ten circular diagrams connected by arrows", "In a picture of a person with a backdrop of snow-capped mountains, green trees, and two horses on the grass"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "isK5_O3cQ8Y_0", "video_path": "isK5_O3cQ8Y.mp4", "subtitle_path": "isK5_O3cQ8Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 1416, "duration": 387.0, "view_count": 632}, {"video_id": "isK5_O3cQ8Y", "question": "In the PPT with a white background, there is a yellow icon in the upper left corner, and next to the icon are the bold black English letters 'UCF.' In the middle of the PPT, there are two circles. The left circle is surrounded by three black arrows, and the right circle has three red arrows. There is a red dot on the screen. In which of the following scenarios has this red dot also appeared?", "question_wo_referring_query": "In which of the following scenarios has this red dot also appeared?", "candidates": ["In a picture of a horse grazing on a grassland", "In a picture with a person's head, with a background of snowy mountains, green trees, and two horses on a grassy field", "In the PPT with ten circles connected by arrows, with the right circle imprinted with the black letters 'ak'", "On a webpage with an orange tag"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "isK5_O3cQ8Y_1", "video_path": "isK5_O3cQ8Y.mp4", "subtitle_path": "isK5_O3cQ8Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 1416, "duration": 387.0, "view_count": 632}, {"video_id": "98I6r8ix2uc", "question": "With the backdrop of orange sky and tall buildings, a news anchor dressed in a grey suit and light blue shirt is sitting in the newsroom speaking to the camera. When a man in a black suit with a white shirt appears in front of the camera with a background of two ink paintings, what change occurs in the image of the news anchor?", "question_wo_referring_query": "What change occurs in the image of the news anchor?", "candidates": ["The image shrinks and appears on the right side of the news screen", "The image shrinks and appears on the left side of the news screen", "The image disappears", "The image enlarges and appears on the left side of the news screen", "The image enlarges and appears on the right side of the news screen"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "98I6r8ix2uc_0", "video_path": "98I6r8ix2uc.mp4", "subtitle_path": "98I6r8ix2uc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 199.07, "view_count": 25766}, {"video_id": "98I6r8ix2uc", "question": "In the scene, a black man wearing a black suit jacket over a white inner shirt is sitting in front of the camera. The background features two ink painting murals. On the left side, with an orange sky and tall buildings, there is a host, dressed in a gray suit and light blue shirt, speaking in a news studio. When only the black man remains on the screen, what change happens to the video scene?", "question_wo_referring_query": "What change happens to the video scene?", "candidates": ["The black man's image shrinks to the bottom right corner", "The host's image disappears, and the black man's image enlarges", "The black man's image enlarges to the right side of the video", "The black man's image enlarges to the left side of the video", "The black man's image shrinks to the top of the video"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "98I6r8ix2uc_1", "video_path": "98I6r8ix2uc.mp4", "subtitle_path": "98I6r8ix2uc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 199.07, "view_count": 25766}, {"video_id": "W_jVtlgl1GQ", "question": "What changed when the woman with short, straight hair, wearing a black jacket and white inner top, sitting in front of the camera, said in the subtitles 'in my case playing Liz Danvers um it's'?", "question_wo_referring_query": "What changed?", "candidates": ["She changed into a white short sleeve shirt", "She put on sunglasses", "She put on a face mask", "She put on a black hat", "She changed into a blue jacket"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "W_jVtlgl1GQ_0", "video_path": "W_jVtlgl1GQ.mp4", "subtitle_path": "W_jVtlgl1GQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 217.88, "view_count": 3869}, {"video_id": "W_jVtlgl1GQ", "question": "What change occurs when the short-haired woman, dressed in a black coat and white inner layer, sitting in front of the brown cabinet, says in the subtitles 'helm of this show and I'm like not'?", "question_wo_referring_query": "What change occurs?", "candidates": ["Switches to a white coat", "A collar is attached to her neck", "Puts on a hat", "Puts on a mask", "Ties up her hair"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "W_jVtlgl1GQ_1", "video_path": "W_jVtlgl1GQ.mp4", "subtitle_path": "W_jVtlgl1GQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 217.88, "view_count": 3869}, {"video_id": "_GWjNO2X3qw", "question": "Inside a red rectangular frame, there are many thin, long iron rods with many white paper strips tied to them. Next to them stands a man with a black arm band and a beard. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Placing one hand on the iron rod", "Dismantling the iron rods", "Tying a red butterfly knot on the iron rods", "Tying paper strips to the iron rods with both hands", "Painting the iron rods"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "_GWjNO2X3qw_0", "video_path": "_GWjNO2X3qw.mp4", "subtitle_path": "_GWjNO2X3qw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 431.93, "view_count": 233653}, {"video_id": "_GWjNO2X3qw", "question": "When the video shows a shirtless man with black hair lying in a pool, what is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Both hands soaking in the water", "Wearing swimming goggles", "Placing one hand on the white tiles at the back", "Holding a mirror with both hands", "Submerging his head in the water"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "_GWjNO2X3qw_1", "video_path": "_GWjNO2X3qw.mp4", "subtitle_path": "_GWjNO2X3qw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 431.93, "view_count": 233653}, {"video_id": "MrDhPqrs6DY", "question": "There is a picture of a motorcycle parked on the road against a white background. What other objects appear in this scene?", "question_wo_referring_query": "What other objects appear in this scene?", "candidates": ["A woman wearing glasses and a black coat", "A man wearing glasses and a black coat", "A man wearing a hat and an olive green coat", "A man wearing glasses and a white coat", "A man without glasses and a black coat"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "MrDhPqrs6DY_0", "video_path": "MrDhPqrs6DY.mp4", "subtitle_path": "MrDhPqrs6DY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 378.12, "view_count": 39}, {"video_id": "MrDhPqrs6DY", "question": "On a white background, there are four pictures, one of which contains mountains and the sea. What other objects appear in this screen?", "question_wo_referring_query": "What other objects appear in this screen?", "candidates": ["A line chart", "A man in a red jacket", "A motorcycle", "A man in a white jacket", "A bar chart"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "MrDhPqrs6DY_1", "video_path": "MrDhPqrs6DY.mp4", "subtitle_path": "MrDhPqrs6DY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 378.12, "view_count": 39}, {"video_id": "HH-jCJKaPUs", "question": "In a room with white cabinets and a white door in the background, there is a woman with long black hair wearing a blue jacket sitting. While the subtitles say 'sure you like this video if you learned something and subscribe if you want to,' what object appears in the room?", "question_wo_referring_query": "What object appears in the room?", "candidates": ["A computer", "A rectangular vase", "A television", "A white table lamp", "A green plant"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "HH-jCJKaPUs_0", "video_path": "HH-jCJKaPUs.mp4", "subtitle_path": "HH-jCJKaPUs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 360.17, "view_count": 295547}, {"video_id": "HH-jCJKaPUs", "question": "On a white desk with different sized circular patterns, two hands with nail polish and wearing rings appear. There are a few pens on the left side of the desk. What object appears when the subtitle says 'from nanometers then to meters then to centimeters'?", "question_wo_referring_query": "On a white desk with different sized circular patterns, two hands with nail polish and wearing rings appear. There are a few pens on the left side of the desk. What object appears when the subtitle says 'from nanometers then to meters then to centimeters'?", "candidates": ["a book", "a piece of black paper", "a cup", "a bookbinding machine", "a piece of white paper"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "HH-jCJKaPUs_1", "video_path": "HH-jCJKaPUs.mp4", "subtitle_path": "HH-jCJKaPUs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 360.17, "view_count": 295547}, {"video_id": "djZXltelqV8", "question": "In the video, a little boy is lying on a railing and holding his face with one hand. What hairstyle does this little boy have?", "question_wo_referring_query": "What hairstyle does this little boy have?", "candidates": ["shoulder-length curly hair", "spiky hair", "crew cut", "wavy curls", "short cropped hair"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "djZXltelqV8_0", "video_path": "djZXltelqV8.mp4", "subtitle_path": "djZXltelqV8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 228.03, "view_count": 5649}, {"video_id": "djZXltelqV8", "question": "In the video, a girl wearing a white headscarf and a top with four round buttons is standing in front of a black iron column, looking ahead. Next to her is a woman wearing glasses. What hairstyle does the girl with the white headscarf have?", "question_wo_referring_query": "What hairstyle does the girl with the white headscarf have?", "candidates": ["Hairstyle without bangs", "Middle part hairstyle", "Explosive hairstyle", "Hairstyle with bangs", "Exposed forehead hairstyle"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "djZXltelqV8_1", "video_path": "djZXltelqV8.mp4", "subtitle_path": "djZXltelqV8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 228.03, "view_count": 5649}, {"video_id": "nNg6rfwwr-s", "question": "Under a misty sky, what shape of hat is worn by the woman playing with a black dog on the white snow next to a river when the subtitle says 'stay its message is clear'?", "question_wo_referring_query": "What shape of hat is worn?", "candidates": ["Staircase", "Rectangle", "Circle", "Triangle", "Square"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "nNg6rfwwr-s_0", "video_path": "nNg6rfwwr-s.mp4", "subtitle_path": "nNg6rfwwr-s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 337.07, "view_count": 264582}, {"video_id": "nNg6rfwwr-s", "question": "In the scene, someone is holding a lit matchstick to burn an object tied with a string. When the subtitles say 'are ever privy to the magic,' what color change does the object tied with the string undergo?", "question_wo_referring_query": "In the scene, someone is holding a lit matchstick to burn an object tied with a string. When the subtitles say 'are ever privy to the magic,' what color change does the object tied with the string undergo?", "candidates": ["starts to turn green", "burns into ashes", "does not emit smoke", "starts to turn black", "starts to turn white"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "nNg6rfwwr-s_1", "video_path": "nNg6rfwwr-s.mp4", "subtitle_path": "nNg6rfwwr-s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 337.07, "view_count": 264582}, {"video_id": "CU_HzIGE6lU", "question": "In front of a background with mountain ranges and blue sky with white clouds, on the far left of the grassland stands a small figure dressed in brown, holding a shield and spear. Which character in the screen is holding a round shield with an animal spreading its feathered wings?", "question_wo_referring_query": "Which character in the screen is holding a round shield with an animal spreading its feathered wings?", "candidates": ["The character wearing a brown helmet and armor, holding a spear", "The character wearing a green helmet", "The character not wearing a brown helmet", "The character not holding a spear", "The character wearing a gray helmet"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "CU_HzIGE6lU_0", "video_path": "CU_HzIGE6lU.mp4", "subtitle_path": "CU_HzIGE6lU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 258.97, "view_count": 6025}, {"video_id": "6RLNX8MTXns", "question": "What happened for the first time when a little boy wearing a red short-sleeve shirt, blue pants, and a white hat, walking at the front of a crowd, appeared on the dark yellow flat ground with grass on both sides?", "question_wo_referring_query": "What happened for the first time when he appeared?", "candidates": ["Playing in water", "Riding a bicycle", "Climbing rocks", "Running", "Waving at the camera"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "6RLNX8MTXns_0", "video_path": "6RLNX8MTXns.mp4", "subtitle_path": "6RLNX8MTXns_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 530.79, "view_count": 6120}, {"video_id": "6RLNX8MTXns", "question": "What happened the first time in front of the mirror in a room with white curtains on the left side and a staircase with a handrail on the right side?", "question_wo_referring_query": "What happened the first time in front of the mirror?", "candidates": ["A woman wearing a hat and glasses appeared in the room", "A man wearing a pink hat appeared in the room", "A woman wearing a white dress appeared in the room", "A man wearing a hat appeared in the room", "A man wearing an orange top appeared in the room"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "6RLNX8MTXns_1", "video_path": "6RLNX8MTXns.mp4", "subtitle_path": "6RLNX8MTXns_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 530.79, "view_count": 6120}, {"video_id": "tsyxO_VZI0E", "question": "When standing outdoors with houses and mountains in the background, what change occurs in the video when a woman in a green striped top and earrings says 'now to respond to this attack we don't'?", "question_wo_referring_query": "What change occurs?", "candidates": ["A white rectangle pops up in the bottom right corner", "A white rectangle pops up in the top right corner", "A white rectangle pops up in the bottom left corner", "A white rectangle pops up in the top left corner", "A black rectangle pops up in the bottom left corner"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "tsyxO_VZI0E_0", "video_path": "tsyxO_VZI0E.mp4", "subtitle_path": "tsyxO_VZI0E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 462.0, "view_count": 28555}, {"video_id": "tsyxO_VZI0E", "question": "What happened when a black-skinned host wearing a suit and a checkered tie was sitting in the studio and the subtitle said 'Iran has directly attacked Israel through'?", "question_wo_referring_query": "What happened?", "candidates": ["The screen switched to a flying airplane.", "The screen switched to a woman wearing a green striped dress.", "The screen switched to a group of people holding phones with flashlights on.", "The screen switched to a burning ruin.", "Many golden flashes appeared on the display screen behind."], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "tsyxO_VZI0E_1", "video_path": "tsyxO_VZI0E.mp4", "subtitle_path": "tsyxO_VZI0E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 462.0, "view_count": 28555}, {"video_id": "EefdICRARQE", "question": "What happened after a man wearing a white hat, white clothes, and white pants with a black backpack took the elevator up?", "question_wo_referring_query": "What happened?", "candidates": ["Appearing in a changing room", "Appearing in a bus", "Appearing in a room", "Appearing in a hospital", "Walking on a crowded street"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "EefdICRARQE_0", "video_path": "EefdICRARQE.mp4", "subtitle_path": "EefdICRARQE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 327.73, "view_count": 429106}, {"video_id": "EefdICRARQE", "question": "What happened after a man in a black shirt stood in front of a cluttered merchandise display and played with the chain around his neck while looking in the mirror?", "question_wo_referring_query": "What happened?", "candidates": ["Took off the chain around his neck and stood in front of the elevator", "Put on a gold hat", "Changed into a white shirt", "Took the elevator", "Changed clothes in the dressing room"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "EefdICRARQE_1", "video_path": "EefdICRARQE.mp4", "subtitle_path": "EefdICRARQE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 327.73, "view_count": 429106}, {"video_id": "rFQQFq1vV3k", "question": "After a person wearing a hat appears, observed from a distance through a telescope, which of the following characters appears first?", "question_wo_referring_query": "Which of the following characters appears first?", "candidates": ["A woman hugging a police officer in front of three other police officers.", "A woman holding onto a locked iron railing with both hands.", "The man holding a pen and writing on an orange desk with a black telephone, while a cigarette is dangling from his mouth.", "A man hugging a police officer in front of three other police officers.", "The man with a tie holding onto a locked iron railing with both hands."], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "rFQQFq1vV3k_0", "video_path": "rFQQFq1vV3k.mp4", "subtitle_path": "rFQQFq1vV3k_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 518.25, "view_count": 528289}, {"video_id": "WN1yCigL3Hk", "question": "A man with short hair and glasses is talking inside a room. His room has a gray background wall with pentagon-shaped decorations, a black and white picture frame, and a white bookshelf below with many objects, including books, a globe, and a rocket. He is wearing long-sleeved clothing. After he mentions 'Let's review a little eye-natomy real quick,' what happens?", "question_wo_referring_query": "what happens?", "candidates": ["A green and blue rectangle divided into two with a cross appears.", "A diagram of the side view of an eyeball appears.", "An image of a computer appears in the top left corner.", "A diagram of the front view of an eyeball appears.", "The globe behind him falls over."], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "WN1yCigL3Hk_0", "video_path": "WN1yCigL3Hk.mp4", "subtitle_path": "WN1yCigL3Hk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 599.43, "view_count": 705049}, {"video_id": "WN1yCigL3Hk", "question": "A man with short hair and glasses is explaining in a room. His room has a gray background wall with a pentagonal decoration and a black-and-white picture frame on it. Below, there is a white bookshelf with many items on it, including books, a globe, and a rocket. He is wearing long-sleeved clothes. What happens after he mentions 'a screen or store away.'?", "question_wo_referring_query": "What happens after he mentions 'a screen or store away.'?", "candidates": ["An anatomical diagram of the side view of an eye pops up.", "A picture with red, blue, and green curves pops up on the left side of the man.", "The globe behind him falls over.", "A green and blue rectangular split with a cross image in the middle pops up.", "A green and blue rectangular split with a cross image in the middle pops up."], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "WN1yCigL3Hk_1", "video_path": "WN1yCigL3Hk.mp4", "subtitle_path": "WN1yCigL3Hk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 599.43, "view_count": 705049}, {"video_id": "gu19AWWVSrg", "question": "The scene shows a dark green tank, with a puff of smoke rising from it. The surroundings consist of destroyed buildings. The tank is driving on the road. What is the first object that appears after the mention of 'Russia's invasion of Ukraine brought the'?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["white car", "withered trees", "destroyed houses", "blue houses", "green tank"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "gu19AWWVSrg_0", "video_path": "gu19AWWVSrg.mp4", "subtitle_path": "gu19AWWVSrg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 595.88, "view_count": 455807}, {"video_id": "gu19AWWVSrg", "question": "On a battlefield, a soldier is holding a weapon and bombarding a green tank in front. Surrounding him are ruined houses, and in the distance, there are some gray high-rise buildings. The soldier is wearing camouflage, a helmet, and the Ukrainian flag is on the helmet. The tank has black smoke and is engulfed in flames. When the phrase \"the war in Ukraine has cast a light on\" is mentioned, what is the first object that appears?", "question_wo_referring_query": ", what is the first object that appears?", "candidates": ["golden field", "pile of sandbags", "fighter jet", "broken car", "semi-automatic rifle"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "gu19AWWVSrg_1", "video_path": "gu19AWWVSrg.mp4", "subtitle_path": "gu19AWWVSrg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 595.88, "view_count": 455807}, {"video_id": "ertqHu0Zew8", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, the segment with frying chicken breast was played, followed by the segment with cutting chicken breast, and finally the segment with frying chicken breast was played.", "First, the segment with cutting chicken breast was played, followed by the segment with frying chicken breast, and finally the segment with cutting scallions was played.", "First, the segment with frying chicken breast was played, followed by the segment with cutting scallions, and finally the segment with cutting chicken breast was played.", "First, the segment with cutting scallions was played, followed by the segment with frying chicken breast, and finally the segment with cutting chicken breast was played.", "First, the segment with cutting chicken breast was played, followed by the segment with cutting scallions, and finally the segment with frying chicken breast was played."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "ertqHu0Zew8_0", "video_path": "ertqHu0Zew8.mp4", "subtitle_path": "ertqHu0Zew8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 356.53, "view_count": 542377}, {"video_id": "ertqHu0Zew8", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, the segment with chili powder was played, followed by the segment with tomato sauce, and finally the segment with noodles.", "First, the segment with tomato sauce was played, followed by the segment with noodles, and finally the segment with chili powder.", "First, the segment with noodles was played, followed by the segment with tomato sauce, and finally the segment with chili powder.", "First, the segment with noodles was played, followed by the segment with chili powder, and finally the segment with tomato sauce.", "First, the segment with chili powder was played, followed by the segment with noodles, and finally the segment with tomato sauce."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "ertqHu0Zew8_1", "video_path": "ertqHu0Zew8.mp4", "subtitle_path": "ertqHu0Zew8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 356.53, "view_count": 542377}, {"video_id": "z-7BKDfaZpg", "question": "In a room filled with many glass windows, on a large wooden table there are some black and white instruments. On the left side of the large desk, there is a painting with a white backboard stuck to it; it is a sketch of a house. To the right of the desk, there is a gray rectangular object standing with a gray hollow circular base. In what other scenes on the table has this white painting appeared?", "question_wo_referring_query": ", in what other scenes has this white painting appeared on the table?", "candidates": ["In a drawer full of tools", "In a black picture frame", "On a gray display stand", "In a brown picture frame", "By the window sill"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "z-7BKDfaZpg_0", "video_path": "z-7BKDfaZpg.mp4", "subtitle_path": "z-7BKDfaZpg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 359.93, "view_count": 1950206}, {"video_id": "z-7BKDfaZpg", "question": "In a room filled with many glass windows, there is a large wooden table with some black and white instruments on it. On the large table to the left, there is a painting with a white backing that depicts a sketch of a house. To the right of the table, there is a grey rectangular object standing upright, with a hollow circular base. In which other scene does this grey object with a hollow circular base appear?", "question_wo_referring_query": "In which other scene does this grey object with a hollow circular base appear?", "candidates": ["On the head of a woman with golden hair, wearing a brown short-sleeved knitted shirt", "On a cabinet with many desks and beside a wall with many shelves", "On a brown writing desk", "On a brown chair", "On a white desk piled with paper"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "z-7BKDfaZpg_1", "video_path": "z-7BKDfaZpg.mp4", "subtitle_path": "z-7BKDfaZpg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 359.93, "view_count": 1950206}, {"video_id": "7_VggZfyfTA", "question": "On a sightseeing bus, a man wearing a black backpack is talking. Next to him sits a woman wearing a black choker with a golden floral pattern. The woman's hair is tied up and she is wearing a headband. The surroundings of the tour bus are green trees, and there is a blue sign by the road. Which subtitles appear with this woman with the headband?", "question_wo_referring_query": ", which subtitles appear with this woman with the headband?", "candidates": ["cool and we're off to", "moly", "seu", "wa that's so", "were ditched by our boat crew because we"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "7_VggZfyfTA_0", "video_path": "7_VggZfyfTA.mp4", "subtitle_path": "7_VggZfyfTA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 287.29, "view_count": 135246}, {"video_id": "xmaqrw6GJPM", "question": "A lady in a denim jacket and jeans with brown hair is talking while holding a pair of jeans on a table. Behind her is a whole wall of white shelves, which are filled with many boxes and some books. To the left, there are three randomly placed chairs. Behind the lady, there is also a low table with many items on it. When she puts down the jeans and stands in front of the table talking with her hands on the table, what kind of change happens to her?", "question_wo_referring_query": "What kind of change happens to this lady?", "candidates": ["Parts her bangs into a side part", "Changes from an orange inner shirt to a black one", "The hair pinned behind her ear falls down", "Changes into a pair of black gloves", "Wears a watch on her wrist"], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "xmaqrw6GJPM_0", "video_path": "xmaqrw6GJPM.mp4", "subtitle_path": "xmaqrw6GJPM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 458.08, "view_count": 1136993}, {"video_id": "kODjdl5bTmw", "question": "A short-haired man wearing a white shirt is standing in a marketplace talking. To his left are some yellow tables and white shelves filled with items. Behind him to the right, there are also many shelves, and there are some people browsing around. When the phrase 'tsunami even now if that happens it can' is mentioned, what change occurred to this man?", "question_wo_referring_query": "A short-haired man wearing a white shirt is standing in a marketplace talking. To his left are some yellow tables and white shelves filled with items. Behind him to the right, there are also many shelves, and there are some people browsing around. When the phrase 'tsunami even now if that happens it can' is mentioned, what change occurred to this man?", "candidates": ["He put on a white armor.", "He put on a pair of glasses.", "He put on a red armor.", "He put on a reflective green armor.", "He put on an orange armor."], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "kODjdl5bTmw_0", "video_path": "kODjdl5bTmw.mp4", "subtitle_path": "kODjdl5bTmw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 260.83, "view_count": 5089}, {"video_id": "kODjdl5bTmw", "question": "A short-haired man wearing a white shirt is standing in a marketplace talking. To his left is a white wall, below him is a yellow table. Both the wall and the table are covered with many flyers and items. To his right are some stalls, with many people browsing. What change occurred when he mentioned 'has developed a way for families to'?", "question_wo_referring_query": "What change occurred to the man?", "candidates": ["He put on a wristwatch", "A cardboard box appeared in front of him", "He put on sunglasses", "He put on a hat", "A cardboard box appeared behind him"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "kODjdl5bTmw_1", "video_path": "kODjdl5bTmw.mp4", "subtitle_path": "kODjdl5bTmw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 260.83, "view_count": 5089}, {"video_id": "ql8yp-9Csn4", "question": "A woman with long brown hair is explaining in front of a mirror. She is wearing a black V-neck suit, a gray inner layer, and behind her is a white building backdrop with a round column entrance on the left side and many rectangular black windows on the right side. There are also cars and a flag on the building's roof. The sky is white. Which of the following objects is not present in the scene?", "question_wo_referring_query": "Which of the following objects is not present in the scene?", "candidates": ["road sign", "golden earrings", "black car", "German flag", "white car"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "ql8yp-9Csn4_0", "video_path": "ql8yp-9Csn4.mp4", "subtitle_path": "ql8yp-9Csn4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 286.32, "view_count": 70576}, {"video_id": "ql8yp-9Csn4", "question": "On the left is a woman with short hair wearing a red coat and white shirt, with a grayish-white background behind her. On the right is a woman with long brown hair speaking in front of the camera, wearing a black V-neck suit and a gray inner layer, with a white building in the background. Which object in the scene is not present?", "question_wo_referring_query": "Which object in the scene is not present?", "candidates": ["White sky", "Black sweater", "White car", "Necklace", "Green sedan"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "ql8yp-9Csn4_1", "video_path": "ql8yp-9Csn4.mp4", "subtitle_path": "ql8yp-9Csn4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 286.32, "view_count": 70576}, {"video_id": "ecUzn1m24bQ", "question": "A boat is sailing in a deep blue sea. Not far from the boat, a whale is slowly surfacing. What does the boat look like in this scene?", "question_wo_referring_query": ", what does the boat look like in this scene?", "candidates": ["blue motorboat", "white sailboat", "blue ferry", "white boat", "white three-deck ship"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "ecUzn1m24bQ_0", "video_path": "ecUzn1m24bQ.mp4", "subtitle_path": "ecUzn1m24bQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 333.58, "view_count": 200074}, {"video_id": "ecUzn1m24bQ", "question": "A person stands in front of a blue background wearing a black short-sleeve shirt, with curly hair and glasses. At this moment, they are biting their lip and raising their left hand. What animal is depicted on their short sleeves?", "question_wo_referring_query": "What animal is depicted on their short sleeves?", "candidates": ["rhinoceros", "whale", "dinosaur", "lion", "elephant"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "ecUzn1m24bQ_1", "video_path": "ecUzn1m24bQ.mp4", "subtitle_path": "ecUzn1m24bQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 333.58, "view_count": 200074}, {"video_id": "jZB08H8ND8o", "question": "Sunlight streams into the room through the window, shining on a wooden surface near the window where a green potted plant and a bowl are placed. A person without a visible face is placing their hand into the bowl. When the subtitle 'There is no easy cure for sadness; instead, I find that to cultivate joy, you must treat it' appears, what is the person wearing?", "question_wo_referring_query": "What is the person wearing?", "candidates": ["Blue long-sleeve knit shirt", "Black short-sleeve knit shirt", "Blue long-sleeve blazer", "Blue short-sleeve knit shirt", "Black long-sleeve knit shirt"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "jZB08H8ND8o_0", "video_path": "jZB08H8ND8o.mp4", "subtitle_path": "jZB08H8ND8o_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 442.2, "view_count": 2948011}, {"video_id": "jZB08H8ND8o", "question": "This is a sketch, a hand holding a pen is drawing a collar on an animal in the picture. When the subtitle mentions 'For example, I take care of young children as my primary job and spending time with them;' what is the animal in the picture?", "question_wo_referring_query": "What is the animal in the picture?", "candidates": ["cat", "turtle", "rabbit", "mouse", "bird"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "jZB08H8ND8o_1", "video_path": "jZB08H8ND8o.mp4", "subtitle_path": "jZB08H8ND8o_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 442.2, "view_count": 2948011}, {"video_id": "oSRQvyojOL8", "question": "Under a simple shed made of wooden beams and straw, a group of cows is eating fodder from a trough behind a fence. A person is standing nearby taking notes. Who is the person taking notes?", "question_wo_referring_query": "Who is the person taking notes?", "candidates": ["A man wearing a black duckbill cap and long sleeves rolled up to reveal his forearms", "A man wearing a blue duckbill cap and long sleeves rolled up to reveal his forearms", "A man wearing a black duckbill cap and a blue short sleeve shirt", "A man wearing a black duckbill cap and short sleeves", "A woman wearing a black duckbill cap and long sleeves rolled up to reveal her forearms"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "oSRQvyojOL8_0", "video_path": "oSRQvyojOL8.mp4", "subtitle_path": "oSRQvyojOL8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 327.45, "view_count": 147549}, {"video_id": "oSRQvyojOL8", "question": "In front of a blue background, a man wearing a checkered shirt and black-framed glasses is explaining something. His right hand is clenched into a fist with a thumbs-up, and his left hand is open. A picture of an animal appears on the screen. What is the animal in the picture?", "question_wo_referring_query": "What is the animal in the picture?", "candidates": ["a goldfish", "a green frog", "a bird", "a tree frog", "a puppy"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "oSRQvyojOL8_1", "video_path": "oSRQvyojOL8.mp4", "subtitle_path": "oSRQvyojOL8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 327.45, "view_count": 147549}, {"video_id": "FZb2uCgosdc", "question": "In the mountains, there is a small dirt road lined with lush green grass on both sides. In the distance, the mountain reveals patches of white rocks. When a person wearing red appeared for the first time on the small road, what did he do?", "question_wo_referring_query": "What did he do?", "candidates": ["He hugged a person with a blue backpack", "He put down his bag", "He is walking forward along the small road", "He walked along the small road towards the direction of the camera", "He jumped up"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "FZb2uCgosdc_0", "video_path": "FZb2uCgosdc.mp4", "subtitle_path": "FZb2uCgosdc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 503.55, "view_count": 504975}, {"video_id": "FZb2uCgosdc", "question": "In the distance, the sun is slowly setting into the sea, the bridge on the sea extends far away, and there are two seagulls in the picture. One of them is swimming in the sea, while the other is standing on the beach. When the seagull on the beach appears for the first time, what does it do?", "question_wo_referring_query": "what does it do?", "candidates": ["It lifts its right leg", "It lifts its right leg and takes a step", "It flaps its wings", "It lowers its head to search for food", "It flies up"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "FZb2uCgosdc_1", "video_path": "FZb2uCgosdc.mp4", "subtitle_path": "FZb2uCgosdc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 503.55, "view_count": 504975}, {"video_id": "tozTdkTuaIU", "question": "In a virtual kitchen, there are virtual characters wearing black and white clothes, and a virtual wooden table with a white bowl on it. One of the white bowls is floating in the air. When the subtitle appears 'back in the oven okay so we're gonna top,' what does the woman in black do?", "question_wo_referring_query": "What does the woman in black do?", "candidates": ["She adds ingredients to the floating white bowl.", "She adds ingredients to the floating red empty bowl.", "She adds ingredients to the floating white empty bowl.", "She starts a fight with the woman in white.", "She adds ingredients to the non-floating white bowl."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "tozTdkTuaIU_0", "video_path": "tozTdkTuaIU.mp4", "subtitle_path": "tozTdkTuaIU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 303.24, "view_count": 52661}, {"video_id": "tozTdkTuaIU", "question": "In the scene, there's a virtual room with a large floor-to-ceiling window. Outside the window, there's a snowman wearing a black hat and a red scarf. Inside the room, a virtual woman dressed in black is holding up her left hand. When the subtitle mentions 'that oh did you say Echo me how dare you,' what is this woman in black doing?", "question_wo_referring_query": "What is this woman in black doing?", "candidates": ["She throws a yellow ball", "She raises her right hand", "She catches a green ball", "She picks up an object from the table", "She throws a green ball"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "tozTdkTuaIU_1", "video_path": "tozTdkTuaIU.mp4", "subtitle_path": "tozTdkTuaIU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 303.24, "view_count": 52661}, {"video_id": "XWy8S_P-nYE", "question": "In the gray screen, a black mushroom cloud formed from an explosion is rising. What did the person without a helmet do on the plane before the explosion?", "question_wo_referring_query": "What did the person without a helmet do?", "candidates": ["He was climbing along the plane's exterior", "He was shooting at a distant target", "Jumped out of the plane", "He put down the gun in his hand", "He waved his hand"], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "XWy8S_P-nYE_0", "video_path": "XWy8S_P-nYE.mp4", "subtitle_path": "XWy8S_P-nYE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 274.38, "view_count": 26468}, {"video_id": "XWy8S_P-nYE", "question": "In front of a backdrop of a digital globe, a man wearing a red short-sleeve shirt is giving an explanation. The word and logo of 'Microsoft' appear in the top left corner of the screen. What text appears immediately after this in the top right corner of the screen?", "question_wo_referring_query": "What text appears immediately after this in the top right corner of the screen?", "candidates": ["Firefox", "DARPA", "Google", "Apple", "internet"], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "XWy8S_P-nYE_1", "video_path": "XWy8S_P-nYE.mp4", "subtitle_path": "XWy8S_P-nYE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 274.38, "view_count": 26468}, {"video_id": "2N7NnyyGkO8", "question": "Two women wearing denim jackets are watching a picture-in-picture screen at the upper right, which shows two women cooking. The woman in the black striped short sleeve is preparing the food, while the woman in the pink short sleeve has left the counter. After this scene, what is the first image to appear in the picture-in-picture at the upper right?", "question_wo_referring_query": "Two women wearing denim jackets are watching a picture-in-picture screen at the upper right, which shows two women cooking. The woman in the black striped short sleeve is preparing the food, while the woman in the pink short sleeve has left the counter. After this scene, what is the first image to appear in the picture-in-picture at the upper right?", "candidates": ["A close-up of the woman in the black striped short sleeve", "A close-up of the woman in the pink short sleeve", "A close-up of someone using flour to dry the food", "A close-up of two young boys wearing glasses", "A close-up of someone chopping food"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "2N7NnyyGkO8_0", "video_path": "2N7NnyyGkO8.mp4", "subtitle_path": "2N7NnyyGkO8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 543.04, "view_count": 274439}, {"video_id": "2N7NnyyGkO8", "question": "In a room, there are two women wearing denim jackets. They are watching a picture-in-picture display in the bottom right corner, which shows a woman with braided hair wearing a black striped short-sleeved shirt, picking up a piece of food. What is the first scene to appear in the picture-in-picture after this?", "question_wo_referring_query": "What is the first scene to appear in the picture-in-picture after this?", "candidates": ["A dish made of chocolate", "The woman in a black striped short-sleeved shirt is eating food", "Two children with dark skin wearing glasses", "The woman in a black striped short-sleeved shirt is cutting food with a knife", "A woman in a pink short-sleeved shirt is dancing"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "2N7NnyyGkO8_1", "video_path": "2N7NnyyGkO8.mp4", "subtitle_path": "2N7NnyyGkO8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 543.04, "view_count": 274439}, {"video_id": "kKsDdWJSNGU", "question": "In a car compartment, there is a man with blond hair wearing a white coat, revealing the strap of a gray backpack, what does this man do with his left hand after the subtitle 'allows you use gondolas for free and' appears?", "question_wo_referring_query": "What does this man do with his left hand?", "candidates": ["He touches his hair with his left hand", "He puts down the backpack with his left hand", "He stretches his left hand out of the window", "He raises a card with his left hand", "He puts down the item in his left hand"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "kKsDdWJSNGU_0", "video_path": "kKsDdWJSNGU.mp4", "subtitle_path": "kKsDdWJSNGU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 313.36, "view_count": 111336}, {"video_id": "kKsDdWJSNGU", "question": "In the scene, there is a gray building with green windowpanes shrouded in mist. Several people are standing under a flagpole in front of the door, talking. After the caption 'friend a need is we actually found a' appears, what is the next scene?", "question_wo_referring_query": "What is the scene that appears immediately after?", "candidates": ["A man lifting up a card inside a train compartment", "A vehicle speeding along a mountain road", "A man extending half of his body out of a train window", "Two trains passing each other", "Several gray buildings situated on a mountaintop, surrounded by greenery"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "kKsDdWJSNGU_1", "video_path": "kKsDdWJSNGU.mp4", "subtitle_path": "kKsDdWJSNGU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 313.36, "view_count": 111336}, {"video_id": "pKSmcmueTbA", "question": "On the wall of a room hangs a film model, and on the shelf in front of the wall are items such as international chess pieces and a water cup. A man wearing glasses and a gray coat is making an explanation. When the subtitle mentions 'To do this, Dickson took one of those long rolls of celluloid film and cut holes along the edges.', what is the object that appears in front of the green background?", "question_wo_referring_query": ", what is the object that appears in front of the green background?", "candidates": ["virtual film", "water cup", "film model", "international chess", "real film"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3O", "level": "L2-Relation", "id": "pKSmcmueTbA_0", "video_path": "pKSmcmueTbA.mp4", "subtitle_path": "pKSmcmueTbA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 565.07, "view_count": 550841}, {"video_id": "pKSmcmueTbA", "question": "In front of a beige background, a man wearing a gray overcoat and a fedora is looking at a wooden box. Behind him, three men and two women are waiting in line. After the subtitle mentions 'as music played from a phonograph and refreshments were served,' what appears in this scene?", "question_wo_referring_query": "What appears in this scene after the subtitle mentions 'as music played from a phonograph and refreshments were served'?", "candidates": ["a simulated barrel", "a photo booth", "a simulated film roll", "a simulated popcorn", "a simulated movie projector"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3O", "level": "L2-Relation", "id": "pKSmcmueTbA_1", "video_path": "pKSmcmueTbA.mp4", "subtitle_path": "pKSmcmueTbA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 565.07, "view_count": 550841}, {"video_id": "yzkNBX345EY", "question": "Among two men wearing diving suits, there is one man wearing a hat, a white uniform with black shoulder patches, and using binoculars. In which of the following locations has this man appeared?", "question_wo_referring_query": "In which of the following locations has this man appeared?", "candidates": ["On a motorcycle", "Next to a lion", "On an airplane", "In a zoo", "Next to a giant black spotlight emitting yellow light"], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "yzkNBX345EY_0", "video_path": "yzkNBX345EY.mp4", "subtitle_path": "yzkNBX345EY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 190.96, "view_count": 478236}, {"video_id": "yzkNBX345EY", "question": "In a classroom, some students are seated below the podium. A man wearing a black coat and tie is holding a wooden stick, pointing at a world map displayed beside him. Where has this man appeared before?", "question_wo_referring_query": "Where has this man appeared before?", "candidates": ["Next to an American flag", "Next to a man wearing a yellow coat", "Next to a tank", "Next to a woman wearing a yellow coat", "Next to a woman wearing a green coat"], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "yzkNBX345EY_1", "video_path": "yzkNBX345EY.mp4", "subtitle_path": "yzkNBX345EY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 190.96, "view_count": 478236}, {"video_id": "nqL266Ijq64", "question": "In a white background, there are two pictures of women. A red dot is resting on the hat of the woman in the picture on the right. In the lower right of the screen, a man wearing a brown coat is raising his left hand to control the red dot. These two pictures appeared together with which subtitles?", "question_wo_referring_query": "These two pictures appeared together with which subtitles?", "candidates": ["at your location", "image and this is the gradient", "use it and anything which is above it", "if the value is greater than a higher threshold", "edges in this image let's"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TOS", "level": "L2-Relation", "id": "nqL266Ijq64_0", "video_path": "nqL266Ijq64.mp4", "subtitle_path": "nqL266Ijq64_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 364.12, "view_count": 120}, {"video_id": "nqL266Ijq64", "question": "In the bottom right corner of a white background, there is a man with his fists raised and eyes closed tightly. The left lens of his glasses reflects light. In the middle of the background, there are two squares. The one on the left appears as a cross dyed blue, and the one on the right is completely dyed blue. Which captions have appeared together with these two squares?", "question_wo_referring_query": "Which captions have appeared together with these two squares?", "candidates": ["therefore, the two steps are already very clear", "segment this concept, rather than", "is when we\u2019re trying to find like the", "which is above it", "edges in this image"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TOS", "level": "L2-Relation", "id": "nqL266Ijq64_1", "video_path": "nqL266Ijq64.mp4", "subtitle_path": "nqL266Ijq64_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 364.12, "view_count": 120}, {"video_id": "2zW4k1QrM3I", "question": "In front of a sofa with a red pillow, there is a person wearing glasses and blue shoe covers. This person is bending over to adjust the sofa. When he appears next to a black car, what changed about him?", "question_wo_referring_query": "What changed about him?", "candidates": ["He took off the blue shoe covers.", "He had a red bag in his hand.", "His hair changed to black.", "He changed into a black coat.", "He removed his glasses."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "2zW4k1QrM3I_0", "video_path": "2zW4k1QrM3I.mp4", "subtitle_path": "2zW4k1QrM3I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 570.93, "view_count": 3001}, {"video_id": "2zW4k1QrM3I", "question": "Inside a spacious and brightly lit room, there is a man wearing a black shirt and a gray knitted coat. What change did he undergo when he appeared in a car at night fastening his seatbelt?", "question_wo_referring_query": "What change did he undergo?", "candidates": ["He changed into a black sweater.", "He changed into a white coat.", "He changed into a black knit shirt.", "He changed into a black suit.", "He changed into a black coat."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "2zW4k1QrM3I_1", "video_path": "2zW4k1QrM3I.mp4", "subtitle_path": "2zW4k1QrM3I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 570.93, "view_count": 3001}, {"video_id": "H3Bhlan0mE0", "question": "On a 'Google' search page, the screen shows various tablets along with their prices. After the mention of 'pen and touch screen so you can', what event occurred?", "question_wo_referring_query": "What event occurred?", "candidates": ["The page switched to a black bold font, and a black search bar appeared below.", "The page switched to a red bold font, and a blue search bar appeared below.", "The page switched to a black bold font, and a blue search bar appeared below.", "The page switched to a red bold font, and a black search bar appeared below."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3E", "level": "L2-Relation", "id": "H3Bhlan0mE0_0", "video_path": "H3Bhlan0mE0.mp4", "subtitle_path": "H3Bhlan0mE0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 451.33, "view_count": 3637}, {"video_id": "H3Bhlan0mE0", "question": "In a section titled 'piazza', below which there is a photo of a woman and three men along with their quotes, in the mention of 'university are using that right now so', what action did the person controlling the computer take afterward?", "question_wo_referring_query": "In a section titled 'piazza', below which there is a photo of a woman and three men along with their quotes, in the mention of 'university are using that right now so', what action did the person controlling the computer take afterward?", "candidates": ["He scrolled the page up to a section with an image of a computer on the left and seven lines of introduction on the right", "He clicked on the first woman's picture to introduce her", "He scrolled the page down to a section with an image of a computer on the left and seven lines of introduction on the right", "He deleted the page and searched again"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3E", "level": "L2-Relation", "id": "H3Bhlan0mE0_1", "video_path": "H3Bhlan0mE0.mp4", "subtitle_path": "H3Bhlan0mE0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 451.33, "view_count": 3637}, {"video_id": "Dpf9gTqP7xA", "question": "In a scene with a black background, there is a standing man in the top left corner and a white image on the right side with a blue and red bar graph. What appears on the screen after mentioning 'behave a generic agent invariant so also'?", "question_wo_referring_query": "What appears on the screen?", "candidates": ["A line graph represented only in red appears", "A bar graph represented by blue and red appears at the bottom of the image", "A line graph represented by blue and red appears", "A bar graph represented only in red appears"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "Dpf9gTqP7xA_0", "video_path": "Dpf9gTqP7xA.mp4", "subtitle_path": "Dpf9gTqP7xA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 1049, "duration": 362.0, "view_count": 932}, {"video_id": "Dpf9gTqP7xA", "question": "In a scene with a black background, there is a standing man in the top left corner and a white image on the right side. The left side of the image contains an irregular pattern of red, orange, blue, and black shapes. After mentioning 'clustered same space also when they,' what appears on the screen?", "question_wo_referring_query": ", what appears on the screen?", "candidates": ["A pattern of large yellow, green, and purple blocks appears", "A pattern of large red, orange, and blue blocks appears", "Small yellow, green, and purple granules appear", "Small red, orange, and blue granules appear on the right side of the image"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "Dpf9gTqP7xA_1", "video_path": "Dpf9gTqP7xA.mp4", "subtitle_path": "Dpf9gTqP7xA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 1049, "duration": 362.0, "view_count": 932}, {"video_id": "uS7W917ty3s", "question": "In a scene with a light yellow background, there is a man in a white short-sleeve shirt and olive green suspenders kneeling down. Where has he appeared before?", "question_wo_referring_query": "Where has he appeared before?", "candidates": ["In a scene with a blurry street and black, red, and yellow flags in the background", "In a scene with a blurry street and green, blue, and red flags in the background", "By an empty lakeside", "In a busy street"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "uS7W917ty3s_0", "video_path": "uS7W917ty3s.mp4", "subtitle_path": "uS7W917ty3s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 380.65, "view_count": 584527}, {"video_id": "uS7W917ty3s", "question": "A yellow fuzzy object appears on a zoomed-in view of Earth. Where has this Earth view appeared below?", "question_wo_referring_query": ", where has this Earth view appeared below?", "candidates": ["It has appeared in a scene with three cars of the same color.", "It has appeared in a scene with three ships of the same color.", "It has appeared in a scene with three differently colored ships.", "It has appeared in a scene with three differently colored cars."], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "uS7W917ty3s_1", "video_path": "uS7W917ty3s.mp4", "subtitle_path": "uS7W917ty3s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 380.65, "view_count": 584527}, {"video_id": "OKkK8Yo7JlU", "question": "Under the blue sky and white clouds, there is a hillside with green trees and barren grassland. A woman with long, curly brown hair tightly closes her eyes, facing the screen with her side to the left. Below, she and those subtitles have appeared together?", "question_wo_referring_query": ", below, she and those subtitles have appeared together?", "candidates": ["and recognize the beauty in the big", "[Music]", "picture", "than connecting with"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "OKkK8Yo7JlU_0", "video_path": "OKkK8Yo7JlU.mp4", "subtitle_path": "OKkK8Yo7JlU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 403.67, "view_count": 239984}, {"video_id": "OKkK8Yo7JlU", "question": "In front of a room with teal walls, a woman in floral clothing is looking sideways out the window. Next to her is a pot of green radishes. Which of the following captions appeared together with her?", "question_wo_referring_query": ", which of the following captions appeared together with her?", "candidates": ["it is a dry hot windy day", "picture", "than connecting with", "and recognize the beauty in the big"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "OKkK8Yo7JlU_1", "video_path": "OKkK8Yo7JlU.mp4", "subtitle_path": "OKkK8Yo7JlU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 403.67, "view_count": 239984}, {"video_id": "GvqopHd-U20", "question": "On a screen with a purple background, there is a black-haired man wearing a black and purple gradient robe. He faces the screen with his hands spread out. On his left side, there are also subtitles in white and yellow bold characters. What change happens to him afterward?", "question_wo_referring_query": "What change happens to him afterward?", "candidates": ["He turns to the left side.", "He clasps his hands together and looks at the camera.", "He fixes his hair.", "He clenches his fists and places them in front of him."], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "GvqopHd-U20_0", "video_path": "GvqopHd-U20.mp4", "subtitle_path": "GvqopHd-U20_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 190.57, "view_count": 127513}, {"video_id": "GvqopHd-U20", "question": "In a scene with a purple background, there is a man with black hair. He looks at the screen with his fists clenched in front of his chest. In the upper right corner, there is also a picture of an underwater robot. What change happened to him afterward?", "question_wo_referring_query": "What change happened to him afterward?", "candidates": ["His left hand grabbed his right hand in front of his body", "His right hand grabbed his left hand in front of his body", "He touched his beard", "He adjusted his collar"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "GvqopHd-U20_1", "video_path": "GvqopHd-U20.mp4", "subtitle_path": "GvqopHd-U20_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 190.57, "view_count": 127513}, {"video_id": "eJb0Y1oWc4M", "question": "On a wooden cutting board, there is a large square pancake. A woman, wearing a black dress, a ring on her right hand, and light pink nail polish, is holding one end of the pancake. What is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["She is cutting the pancake with a knife", "She is rolling up the pancake", "She is spreading sauce on the pancake", "She is tearing apart the pancake"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "eJb0Y1oWc4M_0", "video_path": "eJb0Y1oWc4M.mp4", "subtitle_path": "eJb0Y1oWc4M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 301.83, "view_count": 21955}, {"video_id": "eJb0Y1oWc4M", "question": "On a wooden cutting board, there's a plate with pizza on it. On the right side of the screen, a hand with pink nail polish is holding a golden knife and fork. What is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["She is cutting the pizza.", "She is holding the plate up to show it to the camera.", "She is lifting a small tomato from the pizza with the fork.", "She is picking out the green vegetables from the pizza."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "eJb0Y1oWc4M_1", "video_path": "eJb0Y1oWc4M.mp4", "subtitle_path": "eJb0Y1oWc4M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 301.83, "view_count": 21955}, {"video_id": "TES491F2e6c", "question": "In a scene with a vast background of a rocket, there is a man with glasses and brown hair wearing a gray coat, with his hands crossed in front of him. What color clothes is he wearing?", "question_wo_referring_query": "What color clothes is he wearing?", "candidates": ["Wearing a gray inner layer with black patterns and letters", "Wearing a black inner layer with white patterns and letters", "Wearing a black inner layer with gray patterns and letters", "Wearing a white inner layer with gray patterns and letters"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "TES491F2e6c_0", "video_path": "TES491F2e6c.mp4", "subtitle_path": "TES491F2e6c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 191.69, "view_count": 150033}, {"video_id": "TES491F2e6c", "question": "In a scene where numerous scientists are looking up at the sky through telescopes, what color clothes are they wearing?", "question_wo_referring_query": "What color clothes are they wearing?", "candidates": ["Wearing uniform white shirts and uniform neckties", "Wearing uniform black shirts and uniform neckties", "Wearing uniform white shirts and uniform ties", "Wearing uniform black shirts and uniform ties"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "TES491F2e6c_1", "video_path": "TES491F2e6c.mp4", "subtitle_path": "TES491F2e6c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 191.69, "view_count": 150033}, {"video_id": "SAlbLkIe948", "question": "Under a blue sky with white clouds, a man wearing glasses and a black headscarf, with a goatee, is standing on the ground. Behind him is a house made of earth bricks. When the sentence 'I just got my boy died for me and taken' is mentioned, what color clothes is he wearing?", "question_wo_referring_query": "What color clothes is he wearing?", "candidates": ["Wearing a red-black striped jacket", "Wearing a yellow-green striped jacket", "Wearing a red-black mixed with white checkered shirt", "Wearing a yellow-green mixed with black checkered shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "SAlbLkIe948_0", "video_path": "SAlbLkIe948.mp4", "subtitle_path": "SAlbLkIe948_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 267.96, "view_count": 319}, {"video_id": "SAlbLkIe948", "question": "Inside a large vehicle, there are green seats everywhere in sight. There is a muscular Black man sitting in the aisle seat. When mentioning 'there will be the end of the trip what,' what is he wearing?", "question_wo_referring_query": "What is he wearing?", "candidates": ["He is wearing a black hat, a black short-sleeved shirt, and blue jeans", "He is wearing a black hat, a pink long-sleeved shirt, and blue jeans", "He is wearing a black hat, a black long-sleeved shirt, and black jeans", "He is wearing a black hat, a pink short-sleeved shirt, and blue jeans"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "SAlbLkIe948_1", "video_path": "SAlbLkIe948.mp4", "subtitle_path": "SAlbLkIe948_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 267.96, "view_count": 319}, {"video_id": "esRZOALB5Bc", "question": "In the video, there is a woman wearing a pink dress and a gray coat. She is sitting on a sofa reading a book. What book is she reading?", "question_wo_referring_query": "What book is she reading?", "candidates": ["Thunderstorm", "Harry Potter", "A green book with the word 'HERBS' in large letters", "Dream of the Red Chamber"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "esRZOALB5Bc_0", "video_path": "esRZOALB5Bc.mp4", "subtitle_path": "esRZOALB5Bc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 528.8, "view_count": 11867}, {"video_id": "esRZOALB5Bc", "question": "In the video, the long-haired woman wearing a gray long-sleeve shirt is putting the pancake she just made into a container. What container is she using?", "question_wo_referring_query": "What container is she using?", "candidates": ["transparent glass jar", "box", "plate", "cup"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "esRZOALB5Bc_1", "video_path": "esRZOALB5Bc.mp4", "subtitle_path": "esRZOALB5Bc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 528.8, "view_count": 11867}, {"video_id": "seFzWOKPmGE", "question": "In the video, symptoms when sick are explained. The screen shows a man wearing an orange outfit with some floral patterns and a hat, demonstrating the symptoms when he is sick. What action did this man first take with his hands when he appeared for the first time?", "question_wo_referring_query": "In the video, symptoms when sick are explained. The screen shows a man wearing an orange outfit with some floral patterns and a hat, demonstrating the symptoms when he is sick. What action did this man first take with his hands when he appeared for the first time?", "candidates": ["Holding his forehead", "Rubbing his stomach", "Clenching his fist", "Clapping"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "seFzWOKPmGE_0", "video_path": "seFzWOKPmGE.mp4", "subtitle_path": "seFzWOKPmGE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 195.28, "view_count": 132379}, {"video_id": "seFzWOKPmGE", "question": "In the scene, a female student was practicing for an exam scenario at three in the morning. She is leaning on the table, holding a yellow cup in her left hand. What was she doing when she first appeared?", "question_wo_referring_query": "What was this girl doing when she first appeared?", "candidates": ["drinking water", "eating", "playing on her phone", "writing"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "seFzWOKPmGE_1", "video_path": "seFzWOKPmGE.mp4", "subtitle_path": "seFzWOKPmGE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 195.28, "view_count": 132379}, {"video_id": "TuOIW144_BY", "question": "The video explains the scenery encountered during a self-driving trip. In the screen, there is a winding road with a 60 speed limit sign. What happened to the white car when the subtitle mentions 'this 8.3 kilometer road runs across an'?", "question_wo_referring_query": "The video explains the scenery encountered during a self-driving trip. In the screen, there is a winding road with a 60 speed limit sign. What happened to the white car when the subtitle mentions 'this 8.3 kilometer road runs across an'?", "candidates": ["turned", "moved forward", "reversed", "stayed in place"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "TuOIW144_BY_0", "video_path": "TuOIW144_BY.mp4", "subtitle_path": "TuOIW144_BY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 600.43, "view_count": 1194905}, {"video_id": "TuOIW144_BY", "question": "The video introduces several very famous scenic spots. In the scene, there are three people with backpacks on a mountain. When the subtitles mention 'hardest hikes you'll ever do,' what action do the three people with backpacks in the scene take?", "question_wo_referring_query": "What action do the three people with backpacks in the scene take?", "candidates": ["Walking downhill", "Playing on their phones", "Lying down", "Eating"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "TuOIW144_BY_1", "video_path": "TuOIW144_BY.mp4", "subtitle_path": "TuOIW144_BY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 600.43, "view_count": 1194905}, {"video_id": "bSVfItpvG5Q", "question": "The video explains collision theory, with four spheres and a block drawn on the screen. The block has the letters 'D' written on it. What happens after the drawing is completed?", "question_wo_referring_query": "What happens after the drawing is completed?", "candidates": ["They remain stationary", "Four spheres collide", "The spheres float upwards", "The block falls down"], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "bSVfItpvG5Q_0", "video_path": "bSVfItpvG5Q.mp4", "subtitle_path": "bSVfItpvG5Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 223.71, "view_count": 484207}, {"video_id": "bSVfItpvG5Q", "question": "In the video, the concept of collision is explained. There is a small person standing on a rocket, and a drawn satellite is nearby, simulating a collision scenario. What happens when these two objects move?", "question_wo_referring_query": "What happens when these two objects move?", "candidates": ["Collide and separate", "Satellite separates", "No change", "Rocket separates"], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "bSVfItpvG5Q_1", "video_path": "bSVfItpvG5Q.mp4", "subtitle_path": "bSVfItpvG5Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 223.71, "view_count": 484207}, {"video_id": "0cuO5OSDMbw", "question": "In a scene with a dark green height chart in the background, there is a man with dreadlocks and a bare upper body. Above him, there is a segment of bold white text. In another scene with a gray background, there is a man in a black suit looking towards the right side of the screen. Which of these two characters appears first?", "question_wo_referring_query": "Which of these two characters appears first?", "candidates": ["the man with dreadlocks and a bare upper body", "both characters appear at the same time", "neither of these characters appears", "the man wearing a black suit and looking sideways"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "0cuO5OSDMbw_0", "video_path": "0cuO5OSDMbw.mp4", "subtitle_path": "0cuO5OSDMbw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 407.37, "view_count": 4259970}, {"video_id": "0cuO5OSDMbw", "question": "In a laboratory, there is a woman with curly hair wearing a white coat standing. Behind her stands a man also wearing a white coat. Which of these two characters appears first?", "question_wo_referring_query": "Which of these two characters appears first?", "candidates": ["The person with brown curly hair", "Neither of them appears", "The person with black curly hair", "Both appear at the same time"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "0cuO5OSDMbw_1", "video_path": "0cuO5OSDMbw.mp4", "subtitle_path": "0cuO5OSDMbw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 407.37, "view_count": 4259970}, {"video_id": "BYSE5ZUsrRg", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a long-haired woman and an elderly man wearing a black coat appear chatting inside a room, then the screen shows a lady and an elderly man with white hair, along with a man with black hair on the right, having a conversation; finally, it shifts to two pictures hanging on the wall.", "First, a long-haired woman and an elderly man wearing a black coat appear chatting inside a room, then the screen shows two pictures hanging on the wall; finally, it shifts to a conversation between a lady and an elderly man with white hair, along with a man with black hair on the right.", "First, two pictures hanging on the wall appear, then the screen shows a long-haired woman and an elderly man wearing a black coat chatting inside a room; finally, it shifts to a conversation between a lady and an elderly man with white hair, along with a man with black hair on the right.", "First, a lady and an elderly man with white hair, along with a man with black hair on the right, appear having a conversation, then the screen shows two pictures hanging on the wall; finally, it shifts to a long-haired woman and an elderly man wearing a black coat chatting inside a room."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "BYSE5ZUsrRg_0", "video_path": "BYSE5ZUsrRg.mp4", "subtitle_path": "BYSE5ZUsrRg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 236.86, "view_count": 46593}, {"video_id": "Vy4SspEKuwk", "question": "In the video, a beverage is poured from a stainless steel cup into a transparent cup on the table. Where else has this beverage appeared in the footage?", "question_wo_referring_query": "Where else has the beverage appeared in the footage?", "candidates": ["At a fried chicken shop", "In the trash can", "On the hand of a man wearing a long-sleeved shirt", "Has not appeared elsewhere"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "Vy4SspEKuwk_0", "video_path": "Vy4SspEKuwk.mp4", "subtitle_path": "Vy4SspEKuwk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 201.0, "view_count": 3697}, {"video_id": "Vy4SspEKuwk", "question": "In the video, a woman wearing an orange short-sleeved dress is standing on a busy street, holding a small beverage. Where else does this woman appear?", "question_wo_referring_query": "Where else does this woman appear?", "candidates": ["In the gym", "In the shop", "In the classroom", "In the broadcast room"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "Vy4SspEKuwk_1", "video_path": "Vy4SspEKuwk.mp4", "subtitle_path": "Vy4SspEKuwk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 201.0, "view_count": 3697}, {"video_id": "3X6HwS8eVq0", "question": "There are two men in white long sleeves running forward on the screen, behind them is a column with a beam of light scanning on it. In the video, with which subtitle did these two men in white long sleeves appear together?", "question_wo_referring_query": "With which subtitle did these two men in white long sleeves appear together in the video?", "candidates": ["searchlight these watchtowers prived to", "europe", "cannons", "a couple of lightly armed men"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "3X6HwS8eVq0_0", "video_path": "3X6HwS8eVq0.mp4", "subtitle_path": "3X6HwS8eVq0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 213.17000000000002, "view_count": 287222}, {"video_id": "3X6HwS8eVq0", "question": "On the right side of the screen, there is a sailboat with a white sail, and on the left side, there are several soldiers in red uniforms near the shore. In the video, with which caption does the white sailboat appear together?", "question_wo_referring_query": "In the video, with which caption does the white sailboat appear together?", "candidates": ["could keep guard over an even wider area", "used by defenders like pots of boiling", "cannons", "a couple of lightly armed men"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "3X6HwS8eVq0_1", "video_path": "3X6HwS8eVq0.mp4", "subtitle_path": "3X6HwS8eVq0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 213.17000000000002, "view_count": 287222}, {"video_id": "eDaUvO8W65I", "question": "On the road next to the street, there's a white car parked. Beside it, a gray-haired man wearing glasses and a man in a black suit are shaking hands. What did the gray-haired man change into while drinking at the back?", "question_wo_referring_query": "While drinking at the back, what did the gray-haired man change into?", "candidates": ["Changed from an orange trench coat to a dark gray shirt and coat", "Changed from a red trench coat to a dark gray shirt and coat", "Changed from a red trench coat to a dark gray suit coat", "Changed from an orange trench coat to a dark gray suit coat"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "eDaUvO8W65I_0", "video_path": "eDaUvO8W65I.mp4", "subtitle_path": "eDaUvO8W65I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 197.6, "view_count": 2336}, {"video_id": "eDaUvO8W65I", "question": "A man wearing a black coat is carrying a plastic cloth on a stainless steel bucket, and beside him, there is an elderly man with white hair and glasses bending over. What outfit does this elderly man with white hair change into at the end of the video?", "question_wo_referring_query": "What outfit does this elderly man with white hair change into at the end of the video?", "candidates": ["From a black coat to a dark blue suit", "From an orange outfit to a black coat", "From an orange outfit to a dark blue suit coat", "From a dark blue outfit to an orange coat"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "eDaUvO8W65I_1", "video_path": "eDaUvO8W65I.mp4", "subtitle_path": "eDaUvO8W65I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 197.6, "view_count": 2336}, {"video_id": "nrmbplVPBaE", "question": "In the video, there are several soldiers lying on the ground holding guns on the right side, with grass behind them. When the subtitles mention 'anti-tank rifle which was capable of penetrating the armor of German Panzers ones and twos', what change occurs to the gun?", "question_wo_referring_query": "What change occurs to the gun?", "candidates": ["Thrown", "Kicked", "No change", "Remains still and changes to a firing state"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "nrmbplVPBaE_0", "video_path": "nrmbplVPBaE.mp4", "subtitle_path": "nrmbplVPBaE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 221.32999999999998, "view_count": 946785}, {"video_id": "nrmbplVPBaE", "question": "In the video, there is a soldier wearing a green hat and dark green clothes riding a horse. When the subtitle mentions \"During this time Polish cavalry which made up 10% of its army who would use their horses for?\", what change occurred to the horse?", "question_wo_referring_query": "What change occurred to the horse?", "candidates": ["Lay down", "Moved to a stop", "Sat", "Ate something"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "nrmbplVPBaE_1", "video_path": "nrmbplVPBaE.mp4", "subtitle_path": "nrmbplVPBaE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 221.32999999999998, "view_count": 946785}, {"video_id": "tWq9bP8KU_s", "question": "There are five green plants on the screen, a broom leaning against the wall, and a cat on the right side. What is the cat in the picture doing?", "question_wo_referring_query": "What is the cat in the picture doing?", "candidates": ["Eating something", "Lying down", "Nodding", "Drinking water"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "tWq9bP8KU_s_0", "video_path": "tWq9bP8KU_s.mp4", "subtitle_path": "tWq9bP8KU_s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 317.22, "view_count": 219040}, {"video_id": "tWq9bP8KU_s", "question": "There are several clothes hanging on the left side of the screen. In the middle, there's a girl wearing a striped dress with her hair tied up, and there's a green plant next to her. What is the girl in the screen doing?", "question_wo_referring_query": "What is the girl in the screen doing?", "candidates": ["Sitting", "Squatting", "Lying down", "Standing"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "tWq9bP8KU_s_1", "video_path": "tWq9bP8KU_s.mp4", "subtitle_path": "tWq9bP8KU_s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 317.22, "view_count": 219040}, {"video_id": "eYwBCvwD6y8", "question": "In the video, there's a light green brush spreading oil on the dough in the pan. Which of the following objects does not appear in the video?", "question_wo_referring_query": "Which of the following objects does not appear in the video?", "candidates": ["dough", "coffee", "pan", "brush"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "eYwBCvwD6y8_0", "video_path": "eYwBCvwD6y8.mp4", "subtitle_path": "eYwBCvwD6y8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 257.3, "view_count": 8994}, {"video_id": "CLSq1h7AvkE", "question": "A man in a grey short-sleeve shirt is sitting beside a desk. On the desk, there is an open laptop, a yellow cup, and a pen. On the opposite side of the desk, there is also a bookshelf filled with books. When the phrase 'Have you ever watched Prime Minister's Questions? It's brilliant' is mentioned, which of the following items does not appear in the scene?", "question_wo_referring_query": "Which of the following items does not appear in the scene?", "candidates": ["Desk lamp", "Headset", "Bookshelf", "Chair"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "CLSq1h7AvkE_0", "video_path": "CLSq1h7AvkE.mp4", "subtitle_path": "CLSq1h7AvkE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 276.28, "view_count": 3239488}, {"video_id": "CLSq1h7AvkE", "question": "A man is wearing a black coat, he raises his right hand, there is a building behind him, a white car is on the road, and there are several pedestrians beside the car. What objects are present in the frame when the phrase 'That man sleeps at night' is mentioned?", "question_wo_referring_query": "What objects are present in the frame?", "candidates": ["fire truck", "ambulance", "red pole", "police car"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "CLSq1h7AvkE_1", "video_path": "CLSq1h7AvkE.mp4", "subtitle_path": "CLSq1h7AvkE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 276.28, "view_count": 3239488}, {"video_id": "ByHLYSXZMCc", "question": "In the video, a man wearing a plaid long-sleeve shirt with a blue undershirt is speaking in the center. What color is the background while the man is speaking?", "question_wo_referring_query": "What color is the background while the man in the video is speaking?", "candidates": ["yellow", "green", "purple", "black"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "ByHLYSXZMCc_0", "video_path": "ByHLYSXZMCc.mp4", "subtitle_path": "ByHLYSXZMCc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 236.15, "view_count": 341960}, {"video_id": "ByHLYSXZMCc", "question": "There is a yellow picture on the screen, and the picture shows a man with a beard wearing a hat and a shirt. What is the man in the picture holding in his hand?", "question_wo_referring_query": "What is the man in the picture holding in his hand?", "candidates": ["hat", "pipe", "cup", "bag"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "ByHLYSXZMCc_1", "video_path": "ByHLYSXZMCc.mp4", "subtitle_path": "ByHLYSXZMCc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 236.15, "view_count": 341960}, {"video_id": "fBuyxWozUd0", "question": "A small chunk of butter is placed on a black flat-bottom pan using tongs. When 'Creamed butter 0.7 oz' is mentioned, what shape is this butter?", "question_wo_referring_query": "What shape is this butter?", "candidates": ["Round", "Square", "Cylindrical", "Cone-shaped"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "fBuyxWozUd0_0", "video_path": "fBuyxWozUd0.mp4", "subtitle_path": "fBuyxWozUd0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 363.5, "view_count": 45955}, {"video_id": "fBuyxWozUd0", "question": "In the black pot, there are many mushrooms, with a wooden spatula placed on top of the mushrooms. When 'Fry~2-3 minutes' is mentioned, what shape are these mushrooms in?", "question_wo_referring_query": "What shape are these mushrooms in?", "candidates": ["Square", "Sliced", "Cubed", "Round"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "fBuyxWozUd0_1", "video_path": "fBuyxWozUd0.mp4", "subtitle_path": "fBuyxWozUd0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 363.5, "view_count": 45955}, {"video_id": "aL9-brGa7J0", "question": "There is a man in a black shirt and glasses speaking to the camera. He is extending his hands with palms facing each other. In the top left corner, there is a blue background image. What is the shape of the DNA fragment being described during the explanation?", "question_wo_referring_query": "What is the shape of the DNA fragment being described during the explanation?", "candidates": ["Square", "White circle drawn with a dashed line", "Oval", "Triangle"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "aL9-brGa7J0_0", "video_path": "aL9-brGa7J0.mp4", "subtitle_path": "aL9-brGa7J0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 466.72, "view_count": 155208}, {"video_id": "aL9-brGa7J0", "question": "On a green screen, there is a segment of bold white and purple letters on the left, and a yellow picture on the right. An object with a numerical pattern appears on the screen. What is this object?", "question_wo_referring_query": "What is this object?", "candidates": ["An item with a 2.0 font style and a purple background pattern", "An item with a 2.0 font style and a red background pattern", "An item with a 1.0 font style and a red background pattern", "An item with a 1.0 font style and a purple background pattern"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "aL9-brGa7J0_1", "video_path": "aL9-brGa7J0.mp4", "subtitle_path": "aL9-brGa7J0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 466.72, "view_count": 155208}, {"video_id": "jrDZ98TMQbQ", "question": "In the evening, under the yellow sky, there is a large volcano, with clouds scattered in the sky. What did this volcano first do when it appeared?", "question_wo_referring_query": ", what did this volcano first do when it appeared?", "candidates": ["Volcanic lava flowing onto the rocks", "Erupting smoke", "Explosion of lava", "No change"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "jrDZ98TMQbQ_0", "video_path": "jrDZ98TMQbQ.mp4", "subtitle_path": "jrDZ98TMQbQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 268.2, "view_count": 3213}, {"video_id": "jrDZ98TMQbQ", "question": "In the scene, there are many red rocks and some black substances. This magma contains a relatively high silica content. What was happening when the magma first appeared?", "question_wo_referring_query": "What was happening when the magma first appeared?", "candidates": ["No change occurred", "The magma was cooling on the rock", "The magma turned into rock", "The magma was boiling"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "jrDZ98TMQbQ_1", "video_path": "jrDZ98TMQbQ.mp4", "subtitle_path": "jrDZ98TMQbQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 268.2, "view_count": 3213}, {"video_id": "44V6RFHnM1E", "question": "In a room, there are 2 men wearing white shirts and gray suits, and 2 women wearing white shirts and black suits. After the phrase '300 feet to his death. He was described by the legal firm as one of the best and brightest' is mentioned, what are these 4 people doing?", "question_wo_referring_query": "What are these 4 people doing?", "candidates": ["These four people are standing in an elevator.", "These four people are eating at a table.", "These four people are running.", "These four people are standing still with expressions of shock on their faces."], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "44V6RFHnM1E_0", "video_path": "44V6RFHnM1E.mp4", "subtitle_path": "44V6RFHnM1E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 219.07999999999998, "view_count": 2084868}, {"video_id": "44V6RFHnM1E", "question": "In the yellow desert, there are a few small saguaro cacti and one large saguaro cactus. In front of this large saguaro cactus, there is a man wearing a dark blue top and light blue pants. This man is holding a gun. After mentioning 'then targeted a 26 foot tall Sagara shooting it several times in the trunk a', what is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["This man put the gun beside them", "This man is shooting a bird with a gun", "This man is shooting the large saguaro cactus with a gun", "This man put the gun on the ground"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "44V6RFHnM1E_1", "video_path": "44V6RFHnM1E.mp4", "subtitle_path": "44V6RFHnM1E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 219.07999999999998, "view_count": 2084868}, {"video_id": "eAzlO5akPOk", "question": "There is a pot in the picture with a yellow slurry inside, being stirred with a spatula. Which of the following liquids is added first?", "question_wo_referring_query": "Which of the following liquids is added first?", "candidates": ["Milk", "Coffee", "Coke", "Juice"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "eAzlO5akPOk_0", "video_path": "eAzlO5akPOk.mp4", "subtitle_path": "eAzlO5akPOk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 342.17, "view_count": 8118}, {"video_id": "eAzlO5akPOk", "question": "There are some crackers on the screen, a yellow paste-like substance in the bottom right corner, a cracker with the paste-like substance on hand, and a framed mold on the left. Which of the following items is likely to be put into the mold first?", "question_wo_referring_query": "Which of the following items is likely to be put into the mold first?", "candidates": ["The cracker with the paste-like substance", "None", "Both", "The yellow paste-like substance"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "eAzlO5akPOk_1", "video_path": "eAzlO5akPOk.mp4", "subtitle_path": "eAzlO5akPOk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 342.17, "view_count": 8118}, {"video_id": "kcEnWFPFBzM", "question": "On a grassland, a man wearing a black hat, glasses, and predominantly gray clothing with black accents, is mentioned. What event happens after the phrase 'oh man' is mentioned?", "question_wo_referring_query": "What event happens?", "candidates": ["He is chatting with someone", "He drinks water", "He puts on a jacket", "He skates up to the other half"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "kcEnWFPFBzM_0", "video_path": "kcEnWFPFBzM.mp4", "subtitle_path": "kcEnWFPFBzM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 559.64, "view_count": 596}, {"video_id": "kcEnWFPFBzM", "question": "Outside, there is a white packing box, a green building, a man wearing a black hat, with a mustache, and dressed in a striped shirt. After mentioning 'alright I'm hungry Ben's hungry we're,' what event occurred?", "question_wo_referring_query": "what event occurred?", "candidates": ["Kept walking continuously", "None", "Found food", "Drank water", "Teased a yellow dog"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "kcEnWFPFBzM_1", "video_path": "kcEnWFPFBzM.mp4", "subtitle_path": "kcEnWFPFBzM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 559.64, "view_count": 596}, {"video_id": "zDZ0cHTBX38", "question": "In the blue background, there are two airplanes with white as the main color and gray for details. After mentioning 'been the rival to the US space shuttle', which person appears?", "question_wo_referring_query": "Which person appears?", "candidates": ["A man wearing a white coat, black-framed glasses, and gray hair, and another man wearing a white coat with brown hair appear.", "A man wearing a white coat with black hair, and another man wearing a white coat with brown hair appear.", "A man with brown hair wearing black-framed glasses and another man with black hair wearing a white coat appear.", "A man with white hair wearing black coat and a man with black hair wearing black coat appear."], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "zDZ0cHTBX38_0", "video_path": "zDZ0cHTBX38.mp4", "subtitle_path": "zDZ0cHTBX38_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 544.5, "view_count": 3810197}, {"video_id": "zDZ0cHTBX38", "question": "In the blue sky, there is an airplane with a primary color of white and accents of blue and yellow, flying above yellow fields. Which person appears after the phrase 'meters long with a wingspan of 290 feet' is mentioned?", "question_wo_referring_query": "Which person appears?", "candidates": ["A person wearing white clothes appears", "A person wearing black clothes appears", "Two people wearing black clothes appear", "Two people wearing white uniforms operating the airplane appear"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "zDZ0cHTBX38_1", "video_path": "zDZ0cHTBX38.mp4", "subtitle_path": "zDZ0cHTBX38_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 544.5, "view_count": 3810197}, {"video_id": "N2EmdPWOFHo", "question": "In a blue background, a man wearing a black and purple patched jacket, with a green logo in the lower right corner. With which captions does this man appear together?", "question_wo_referring_query": "With which captions does this man appear together?", "candidates": ["Locusts are an agricultural menace and But notice we said almost every continent", "and cover an area of 330000 square kilometers", "carving out solitary lives in the brush", "East Africa, the Middle East, and Southeast Asia"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "N2EmdPWOFHo_0", "video_path": "N2EmdPWOFHo.mp4", "subtitle_path": "N2EmdPWOFHo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 350.98, "view_count": 1255358}, {"video_id": "N2EmdPWOFHo", "question": "On the green background image, there are many white English letters, and there are also many green insects. With which subtitles does this image appear together?", "question_wo_referring_query": "With which subtitles does this image appear together?", "candidates": ["to capture energy from our most renewable resource", "The more fragmented the network becomes, the more perilous the situation", "Their hands on, interactive courses will help you", "the colonization of the North American West"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "N2EmdPWOFHo_1", "video_path": "N2EmdPWOFHo.mp4", "subtitle_path": "N2EmdPWOFHo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 350.98, "view_count": 1255358}, {"video_id": "IjGsJ1awG94", "question": "In the video, a short-haired woman wearing a peach-colored long-sleeve is walking on the street. To the left is a house under renovation. What change happens to the woman wearing the peach-colored long-sleeve in the video?", "question_wo_referring_query": "What change happens to the woman wearing the peach-colored long-sleeve in the video?", "candidates": ["Turns back", "Turns right", "Stays still", "Continues straight"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "IjGsJ1awG94_0", "video_path": "IjGsJ1awG94.mp4", "subtitle_path": "IjGsJ1awG94_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 263.04, "view_count": 1445967}, {"video_id": "IjGsJ1awG94", "question": "In the video, a hand wearing a blue glove is gently brushing an artifact with a small brush. What transformation did the artifact undergo?", "question_wo_referring_query": "What transformation did the artifact undergo in the video?", "candidates": ["Thrown into a trash bin", "Still in the ground", "Was excavated and fully packaged", "Placed into water"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "IjGsJ1awG94_1", "video_path": "IjGsJ1awG94.mp4", "subtitle_path": "IjGsJ1awG94_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 263.04, "view_count": 1445967}, {"video_id": "wgvgRY_MH6I", "question": "In the deep blue sea, there are several white sea turtle eggs on a rock. After mentioning 'that the percent of female hatchlings', what changes occurred to these turtle eggs?", "question_wo_referring_query": "What changes occurred to these turtle eggs?", "candidates": ["These white sea turtle eggs disappeared", "These white sea turtle eggs were split open", "Black baby sea turtles hatched", "No changes occurred"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "wgvgRY_MH6I_0", "video_path": "wgvgRY_MH6I.mp4", "subtitle_path": "wgvgRY_MH6I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 308.78, "view_count": 1108}, {"video_id": "wgvgRY_MH6I", "question": "In the center of the screen, there is a white light ring in the deep blue sea. After the phrase 'goes a long way also' is mentioned, what change occurs to the three turtles on the left side of the light ring?", "question_wo_referring_query": "What change occurs to the three turtles?", "candidates": ["Reached the land", "In the center of the light ring", "Swam from the left side of the light ring to the right", "One more turtle appeared"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "wgvgRY_MH6I_1", "video_path": "wgvgRY_MH6I.mp4", "subtitle_path": "wgvgRY_MH6I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 308.78, "view_count": 1108}, {"video_id": "VqPmrYFvKf8", "question": "When red English text 'Few Shot Learning' and a green box with 'Model' in the middle appears on the screen, and the subtitle is 'oh all right this is Sam,' what animal appears in the video?", "question_wo_referring_query": "What animal appears in the video?", "candidates": ["A large olive-colored horse", "A black and white dog wearing colorful glasses", "A small green frog", "An orange kitten"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "VqPmrYFvKf8_0", "video_path": "VqPmrYFvKf8.mp4", "subtitle_path": "VqPmrYFvKf8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 600.43, "view_count": 30579}, {"video_id": "VqPmrYFvKf8", "question": "When the subtitle 'we transform the problem that the model' appears, along with the black underlined text 'Disadvantage of Traditional Models' and a green box with 'Model' in the middle, who appears on the screen?", "question_wo_referring_query": "When the green box appears, who is the person on the screen?", "candidates": ["Sam", "Jack", "Lizabeth", "Ajay"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "VqPmrYFvKf8_1", "video_path": "VqPmrYFvKf8.mp4", "subtitle_path": "VqPmrYFvKf8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 600.43, "view_count": 30579}, {"video_id": "oG45EoRh3Fo", "question": "When the subtitle mentions \"You have this sense of an artist breaking out\", a woman with blonde hair and earrings is sitting in an empty exhibition hall with two paintings hanging on the wall behind her. What style of clothing is this woman wearing?", "question_wo_referring_query": "What style of clothing is this woman wearing?", "candidates": ["Pink short dress", "Black and white striped long-sleeve dress", "White sleeveless long dress", "Silver short-sleeve long dress"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "oG45EoRh3Fo_0", "video_path": "oG45EoRh3Fo.mp4", "subtitle_path": "oG45EoRh3Fo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 313.38, "view_count": 138696}, {"video_id": "oG45EoRh3Fo", "question": "When the subtitle mentions 'the human being,' a closed, gloomy room appears on the screen. The room's main door and some windows are blocked by wooden planks, there is a white wooden board on the floor, and a man is painting. What style of shoes is this man wearing?", "question_wo_referring_query": "What style of shoes is this man wearing?", "candidates": ["High-top sneakers", "Laced leather shoes", "Slippers", "Low-top boots"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "oG45EoRh3Fo_1", "video_path": "oG45EoRh3Fo.mp4", "subtitle_path": "oG45EoRh3Fo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 313.38, "view_count": 138696}, {"video_id": "iCc9xkcKnks", "question": "In the video, under the blue sky and white clouds, there are vast green grasslands and trees, sometimes showing some houses and small roads. What movement caused the Rift Valley event that led to the sinking of the land in the Philippines?", "question_wo_referring_query": "What movement caused the Rift Valley event that led to the sinking of the land in the Philippines?", "candidates": ["Crustal movement", "Oceanic movement", "Wind movement", "Intrusion effect"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "iCc9xkcKnks_0", "video_path": "iCc9xkcKnks.mp4", "subtitle_path": "iCc9xkcKnks_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 360.23, "view_count": 20740}, {"video_id": "iCc9xkcKnks", "question": "In the video, there is a primarily blue satellite map with small green dots and location markers. Which two continents experienced a collision?", "question_wo_referring_query": "Which two continents experienced a collision?", "candidates": ["Tasmania and Antarctica", "Victoria and Tasmania", "Victoria and Antarctica", "Antarctica and Australia"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "iCc9xkcKnks_1", "video_path": "iCc9xkcKnks.mp4", "subtitle_path": "iCc9xkcKnks_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 360.23, "view_count": 20740}, {"video_id": "-O0UndD3Bls", "question": "When the Korean male actor Hong Sang-soo, wearing a black and white houndstooth coat and black trousers, first appeared in the room after pushing open the green door in the video, what action did he perform?", "question_wo_referring_query": "What action did this male actor perform?", "candidates": ["Sat on the bench", "Sat on the sofa", "Stood beside the chair", "Knelt down"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "-O0UndD3Bls_0", "video_path": "-O0UndD3Bls.mp4", "subtitle_path": "-O0UndD3Bls_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 347.85, "view_count": 2694}, {"video_id": "-O0UndD3Bls", "question": "When the four men in black suits and the woman in a black dress first appeared on the red carpet in the video, what did they do?", "question_wo_referring_query": "What did they do on the red carpet?", "candidates": ["Were photographed by the media", "Did nothing", "Signed autographs for fans", "Shook hands"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "-O0UndD3Bls_1", "video_path": "-O0UndD3Bls.mp4", "subtitle_path": "-O0UndD3Bls_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 347.85, "view_count": 2694}, {"video_id": "mwXFAKTf040", "question": "A small window was opened on a wall of yellowish soil that had broken down, and bars were installed in the middle. When the subtitles mention 'OK, let's see who was behind the pain, anguish, and suffering', what actions does the person on the screen take?", "question_wo_referring_query": "What actions does the person on the screen take?", "candidates": ["The person's hands grasp the iron bars", "The person's hands wave", "The person kneels on the ground", "The person stands upright"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "mwXFAKTf040_0", "video_path": "mwXFAKTf040.mp4", "subtitle_path": "mwXFAKTf040_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 568.9, "view_count": 284195}, {"video_id": "mwXFAKTf040", "question": "When the subtitles mention \"during certain activities, but that's\", a woman with long brown hair in a grey top appears on the screen. What action does this woman perform?", "question_wo_referring_query": "What action does this woman perform?", "candidates": ["Gives a thumbs-up", "She puts her finger on her lips and makes a shushing gesture", "Shakes her head", "Waves her hand"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "mwXFAKTf040_1", "video_path": "mwXFAKTf040.mp4", "subtitle_path": "mwXFAKTf040_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 568.9, "view_count": 284195}, {"video_id": "PvH9CFI0ZD8", "question": "In this video with a blue background, a bald man wearing a blue and white plaid shirt is giving an explanation. At the same time, one of his hands is raised to his chest, and then he clenches his fist and raises it above his waist. What is the first concept that this man introduces?", "question_wo_referring_query": ", what is the first concept that this man introduces?", "candidates": ["Electric cars", "Biodegradable fishing nets", "Plug-in hybrid vehicles", "Principle of battery usage"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "PvH9CFI0ZD8_0", "video_path": "PvH9CFI0ZD8.mp4", "subtitle_path": "PvH9CFI0ZD8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 500.46, "view_count": 87452}, {"video_id": "3Rua-sqqPKQ", "question": "Under a green background, there is a man wearing a gray robe, a hat, and black glasses. After the subtitle mentions 'called the Accord an historic one but he', what action does this man perform?", "question_wo_referring_query": "What action does this man perform?", "candidates": ["He stood up and clapped", "He did nothing", "He waved his hand", "He shook his head"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "3Rua-sqqPKQ_0", "video_path": "3Rua-sqqPKQ.mp4", "subtitle_path": "3Rua-sqqPKQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 356.2, "view_count": 1745}, {"video_id": "3Rua-sqqPKQ", "question": "In a spacious hall, a woman with long brown hair dressed in a red top is giving a presentation. After the subtitle mentions 'nearly 200 closet and inside it's got,' what action does this woman perform?", "question_wo_referring_query": "What action does this woman perform?", "candidates": ["She claps her hands", "She shakes her head", "She lowers her head", "She lowers her hands from her chest"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "3Rua-sqqPKQ_1", "video_path": "3Rua-sqqPKQ.mp4", "subtitle_path": "3Rua-sqqPKQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 356.2, "view_count": 1745}, {"video_id": "OYoRWHSEh3c", "question": "In a room where the wall is green and white alternately, there is a green plant on an olive-colored table. A woman dressed mainly in green, with a blue and pink polka-dot coat, has her right hand on her chest while explaining. After the subtitle mentions 'full time job and also illustration,' what animal appears on the screen?", "question_wo_referring_query": "What animal appears on the screen?", "candidates": ["A black and white kitten", "A small turtle", "A black puppy", "A white bunny"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "OYoRWHSEh3c_0", "video_path": "OYoRWHSEh3c.mp4", "subtitle_path": "OYoRWHSEh3c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 507.88, "view_count": 221883}, {"video_id": "OYoRWHSEh3c", "question": "In a room where the walls are green and white, on an olive-colored table sits a pot of green plants. When the subtitle mentions 'google domains to create a portfolio and,' what appears under the green flower background?", "question_wo_referring_query": "What appears under the green flower background?", "candidates": ["A golden beach.", "A yellow puppy.", "A green peacock.", "There are a total of four illustrations, the one on the far right has a green background with several mushrooms and the words 'it's okay to not be ok' in it."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "OYoRWHSEh3c_1", "video_path": "OYoRWHSEh3c.mp4", "subtitle_path": "OYoRWHSEh3c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 507.88, "view_count": 221883}, {"video_id": "ctNruaKxVQE", "question": "At the beginning of the video, a woman is wearing earrings, black clothes, a white hairband, and blue nail polish, and she is holding a drink. In which of the following scenes is this woman not present?", "question_wo_referring_query": "In which of the following scenes is this woman not present?", "candidates": ["Inside a car with black seats", "On a bench with green plants behind", "Inside a shopping mall with clothes racks and a grey pillar", "Entrance of a shopping mall"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "ctNruaKxVQE_0", "video_path": "ctNruaKxVQE.mp4", "subtitle_path": "ctNruaKxVQE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 569.32, "view_count": 23394}, {"video_id": "ctNruaKxVQE", "question": "In the opening of the video, a woman wearing black clothes, a white hairband, and blue nail polish is holding a green drink. In which of the following scenes does this green drink not appear?", "question_wo_referring_query": "In which of the following scenes does this green drink not appear?", "candidates": ["Inside a car with black seats", "In a mall with clothing racks and a gray bar in the background", "Entrance of the mall", "Inside a store selling drinks"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "ctNruaKxVQE_1", "video_path": "ctNruaKxVQE.mp4", "subtitle_path": "ctNruaKxVQE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 569.32, "view_count": 23394}, {"video_id": "PHlDFT2dyxI", "question": "At the beginning of the video, a man with short light brown hair wearing a short-sleeved shirt primarily in light purple and featuring red and green patterns appears with his hands spread open in front of his chest. Which of the following subtitles did not appear along with this man?", "question_wo_referring_query": "Which of the following subtitles did not appear along with this man?", "candidates": ["Darrin mentally did your dishes last", "Thursday videos because we cannot handle", "things first we are not stopping", "the workload or because we've run out of"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "PHlDFT2dyxI_0", "video_path": "PHlDFT2dyxI.mp4", "subtitle_path": "PHlDFT2dyxI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 366.58, "view_count": 136669}, {"video_id": "PHlDFT2dyxI", "question": "In the video, inside a room with yellow lighting, there is a bed with a blue sheet and a guitar on it. A man wearing a short-sleeved shirt is leaning and sitting on the edge of the bed. In the video, which of the following subtitles did not appear together with this man?", "question_wo_referring_query": ", which of the following subtitles did not appear together with this man?", "candidates": ["things first we are not stopping", "You should do a UH an exposed mad green.", "and that he didn't do you didn't do the", "the Mars thing he did you switch days."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "PHlDFT2dyxI_1", "video_path": "PHlDFT2dyxI.mp4", "subtitle_path": "PHlDFT2dyxI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 366.58, "view_count": 136669}, {"video_id": "gKO7okTWwUo", "question": "In the video opening, a light olive-colored area is outlined in the middle of a South America map, with the label BOLIVIA in black font in the center. When \"Bolivia is a landlocked country in\" is mentioned, what color change occurs in this light olive-colored area?", "question_wo_referring_query": ", what color change occurs in this light olive-colored area?", "candidates": ["changed from light olive to red", "changed from light olive to black", "changed from light olive to orange", "changed from light olive to gray"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "gKO7okTWwUo_0", "video_path": "gKO7okTWwUo.mp4", "subtitle_path": "gKO7okTWwUo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.75, "view_count": 144041}, {"video_id": "gKO7okTWwUo", "question": "In the video, there is a box filled with yellow and blue pearls placed on a rock at the bottom of the sea, with two pieces of seaweed on the screen. When the phrase 'expedition deep into a remote part of' is mentioned, what changes occur to the state of these pearls?", "question_wo_referring_query": "What changes occur to the state of these pearls?", "candidates": ["A hand appears and lifts up six blue pearls and six yellow pearls.", "A hand appears and lifts up seven blue pearls and seven yellow pearls.", "A hand appears and lifts up eight blue pearls and four yellow pearls.", "A hand appears and lifts up eight blue pearls and six yellow pearls."], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "gKO7okTWwUo_1", "video_path": "gKO7okTWwUo.mp4", "subtitle_path": "gKO7okTWwUo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.75, "view_count": 144041}, {"video_id": "pPSoE5j3Enk", "question": "When the video mentions 'In a clear bowl, pour 350g flour' during the reverse flour action, which of the following items does not appear in this scene?", "question_wo_referring_query": "Which of the following items is not present in this scene?", "candidates": ["A bowl with 350g of flour", "A wooden spatula", "Black gloves", "A water-filled cup"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "pPSoE5j3Enk_0", "video_path": "pPSoE5j3Enk.mp4", "subtitle_path": "pPSoE5j3Enk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 350.12, "view_count": 19603}, {"video_id": "pPSoE5j3Enk", "question": "When the person wearing black gloves in the video mentions \u201cCook this side for 3-5 mins until it appears golden\u201d while making pancakes, which of the following items does not appear on the screen?", "question_wo_referring_query": "Which of the following items does not appear on the screen?", "candidates": ["Oil", "Black non-stick pan", "Microwave", "Wooden spatula"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "pPSoE5j3Enk_1", "video_path": "pPSoE5j3Enk.mp4", "subtitle_path": "pPSoE5j3Enk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 350.12, "view_count": 19603}, {"video_id": "DVlvA4lyInM", "question": "In the video, a blue and white Greek flag appears in the upper right corner. When the male protagonist, who is wearing gray clothes, is reading the letter beside him, what color is the letter?", "question_wo_referring_query": "What color is the letter?", "candidates": ["A black letter with black writing and some stamps affixed", "A red letter with black writing and some stamps affixed", "A white letter with black writing and some stamps affixed", "A yellow letter with black writing and some stamps affixed"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "DVlvA4lyInM_0", "video_path": "DVlvA4lyInM.mp4", "subtitle_path": "DVlvA4lyInM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 385.69, "view_count": 130529}, {"video_id": "DVlvA4lyInM", "question": "In the video, the male protagonist wearing a gray coat picks up a note from a man with curled hair and a small beard on the left side. What is the value of the number on this note?", "question_wo_referring_query": ", what is the value of the number on this note?", "candidates": ["2000", "100", "200", "1000"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "DVlvA4lyInM_1", "video_path": "DVlvA4lyInM.mp4", "subtitle_path": "DVlvA4lyInM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 385.69, "view_count": 130529}, {"video_id": "1BWWvhKysMo", "question": "When the video transitions to a room with many items placed on the floor and bed, and it mentions 'much as 7 hours back into the room,' what color are the sheets in this room?", "question_wo_referring_query": "What color are the sheets in this room?", "candidates": ["white", "beige", "gray", "red"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "1BWWvhKysMo_0", "video_path": "1BWWvhKysMo.mp4", "subtitle_path": "1BWWvhKysMo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 397.6, "view_count": 76912}, {"video_id": "1BWWvhKysMo", "question": "In the video, a girl wearing a white top and black shorts is walking alone on a dark road while holding a cellphone. When she mentions 'there exactly what I wanted,' what color is the bag she's carrying?", "question_wo_referring_query": "What color is the bag this girl is carrying?", "candidates": ["white", "black", "blue", "red"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "1BWWvhKysMo_1", "video_path": "1BWWvhKysMo.mp4", "subtitle_path": "1BWWvhKysMo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 397.6, "view_count": 76912}, {"video_id": "OpNWJNOBWZs", "question": "Who is the person shown with their head resting in the snow when a photograph of a snowy landscape and a house is displayed, wearing an orange inner layer, a gray outer coat, and black-rimmed glasses, with both hands placed on their jacket?", "question_wo_referring_query": "Who is the person with their head resting in the snow?", "candidates": ["A person wearing an orange outer coat and white jeans", "A person wearing a white outer coat and blue jeans", "A person wearing an orange outer coat, blue jeans, and black shoes", "A child wearing a hat"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "OpNWJNOBWZs_0", "video_path": "OpNWJNOBWZs.mp4", "subtitle_path": "OpNWJNOBWZs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 190.84, "view_count": 16}, {"video_id": "OpNWJNOBWZs", "question": "In the video, a person wearing an orange inner layer and a gray outer jacket, along with black-frame glasses, brings the tips of their thumbs and index fingers together. When the label 'filter masked' appears on the display board, what is the object moving on the display board in the background?", "question_wo_referring_query": "What is the object moving on the display board in the background?", "candidates": ["Dark blue dot", "Black dot", "Red dot", "White dot"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "OpNWJNOBWZs_1", "video_path": "OpNWJNOBWZs.mp4", "subtitle_path": "OpNWJNOBWZs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 190.84, "view_count": 16}, {"video_id": "shZuYhJFZ5k", "question": "When the woman in the light blue dress returns to the room in the video, the chair back is green, the seat is brown, and the legs are green. What is the woman in the light blue dress doing when the chair first appears on screen?", "question_wo_referring_query": "What is the woman in the light blue dress doing?", "candidates": ["She took off her hat", "She is watering flowers", "She moved some books to the floor near the chair", "She is flipping through book pages"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "shZuYhJFZ5k_0", "video_path": "shZuYhJFZ5k.mp4", "subtitle_path": "shZuYhJFZ5k_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 562.31, "view_count": 383665}, {"video_id": "shZuYhJFZ5k", "question": "When the small black dog with a yellow collar appears alone for the first time in the video, what action is the dog doing?", "question_wo_referring_query": "What action is the small dog doing?", "candidates": ["Running by the lake", "Swimming in the water", "Sniffing the flowers by the lake with its nose", "Using its paws to flip the wooden board and sniffing the board with its nose"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "shZuYhJFZ5k_1", "video_path": "shZuYhJFZ5k.mp4", "subtitle_path": "shZuYhJFZ5k_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 562.31, "view_count": 383665}, {"video_id": "w8QpWQWKDuo", "question": "When the man with black curly hair, wearing a black sweatshirt and hat in the video, mentions 'so today I am gonna be eating some mr.', what action is the man performing?", "question_wo_referring_query": "What action is the man performing?", "candidates": ["He walks to the counter to buy food.", "He sits together with another woman wearing a black sweatshirt.", "He lifts a bucket of crabs with one hand.", "He stands up."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "w8QpWQWKDuo_0", "video_path": "w8QpWQWKDuo.mp4", "subtitle_path": "w8QpWQWKDuo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 211.63, "view_count": 41645}, {"video_id": "w8QpWQWKDuo", "question": "When the video shows the residential hotel, there is a khaki backpack and a black backpack next to the bunk bed with white sheets. When the phrase 'everything so we got two free drinks' is mentioned, what is the woman wearing a black hoodie doing in the video?", "question_wo_referring_query": "What is the woman wearing a black hoodie doing in the video?", "candidates": ["She picks up the khaki backpack", "She picks up the black backpack", "She puts the khaki backpack on the bed", "She puts the black backpack on the bed"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "w8QpWQWKDuo_1", "video_path": "w8QpWQWKDuo.mp4", "subtitle_path": "w8QpWQWKDuo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 211.63, "view_count": 41645}, {"video_id": "5wY0yecj8Qk", "question": "What image does the video show immediately after an elderly man, with white hair and wearing a striped shirt, puts one hand in front of a photo?", "question_wo_referring_query": "What image does the video show immediately after?", "candidates": ["A photo of a wooden chair.", "A photo of an elderly man wearing a suit jacket standing in front of a grassy field.", "A photo of an elderly man with white hair, sitting at a table with his hands crossed on the table.", "A house built in a wasteland."], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "5wY0yecj8Qk_0", "video_path": "5wY0yecj8Qk.mp4", "subtitle_path": "5wY0yecj8Qk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 258.13, "view_count": 16014}, {"video_id": "5wY0yecj8Qk", "question": "After showing a scene of five people lying on the ground by David Dorfman Dance, what artwork is displayed next in the video?", "question_wo_referring_query": "What artwork is displayed next in the video?", "candidates": ["Diane Arbus: in the beginning.", "The Sky is a Great Space", "Oliver Beer: Vessel Orchestra.", "Butterflies"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "5wY0yecj8Qk_1", "video_path": "5wY0yecj8Qk.mp4", "subtitle_path": "5wY0yecj8Qk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 258.13, "view_count": 16014}, {"video_id": "ILT0CIGWS88", "question": "In the video, there is a black-skinned man with black-rimmed glasses wearing a suit, and a woman in a blue coat holding a blue microphone. Which of these two NP-News-Programers appears first on the screen?", "question_wo_referring_query": "Which of these two NP-News-Programers appears first on the screen?", "candidates": ["The black-skinned man with brown-rimmed glasses wearing a suit", "The black-skinned man with black-rimmed glasses wearing a suit and the woman in a blue coat holding a blue microphone appear simultaneously", "The woman in a blue coat holding a blue microphone", "The black-skinned man with black-rimmed glasses wearing a suit"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "ILT0CIGWS88_0", "video_path": "ILT0CIGWS88.mp4", "subtitle_path": "ILT0CIGWS88_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 229.24, "view_count": 33748}, {"video_id": "ILT0CIGWS88", "question": "In the video, related NP-News-Programs feature Farmers clash with police outside EU summit, European Parliament President, and DW Correspondent. Which concept appears first in the video?", "question_wo_referring_query": "Which concept appears first in the video?", "candidates": ["European Parliament President and DW Correspondent", "DW Correspondent", "Farmers clash with police outside EU summit", "European Parliament President"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "ILT0CIGWS88_1", "video_path": "ILT0CIGWS88.mp4", "subtitle_path": "ILT0CIGWS88_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 229.24, "view_count": 33748}, {"video_id": "BeEHRDQaYXY", "question": "What did the male protagonist wearing black do after saying 'that I have so I had the idea of' in the video?", "question_wo_referring_query": "What did the male protagonist do?", "candidates": ["Opened the laptop and typed on the keyboard", "Walked into a room and closed the door", "Climbed stairs", "Started a car with a key"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "BeEHRDQaYXY_0", "video_path": "BeEHRDQaYXY.mp4", "subtitle_path": "BeEHRDQaYXY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 336.75, "view_count": 42257}, {"video_id": "BeEHRDQaYXY", "question": "In the video, after the phrase 'you're interested in so we can try to' is mentioned on the screen, what word appears on the screen as the answer to the question?", "question_wo_referring_query": ", what word appears on the screen as the answer to the question?", "candidates": ["running", "guitar", "Max", "School work"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "BeEHRDQaYXY_1", "video_path": "BeEHRDQaYXY.mp4", "subtitle_path": "BeEHRDQaYXY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 336.75, "view_count": 42257}, {"video_id": "nu5vPm264kI", "question": "In the video, after the woman wearing a dark short-sleeved shirt and holding a manuscript says 'sufficient evidence in animals so let's', what animal appears on the screen?", "question_wo_referring_query": "What animal appears on the screen?", "candidates": ["black rabbit", "green frog", "gray rabbit", "black mouse"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "nu5vPm264kI_0", "video_path": "nu5vPm264kI.mp4", "subtitle_path": "nu5vPm264kI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 202.84, "view_count": 63228}, {"video_id": "nu5vPm264kI", "question": "What are the large characters displayed on the big screen behind the host before the woman wearing a dark short-sleeved shirt and holding a paper mentions 'a recommendation back in the 80s that'?", "question_wo_referring_query": "What are the large characters displayed on the big screen behind the host?", "candidates": ["14 cans per day", "1.15x", "30mg per kg per day", "40mg per kg per day"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "nu5vPm264kI_1", "video_path": "nu5vPm264kI.mp4", "subtitle_path": "nu5vPm264kI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 202.84, "view_count": 63228}, {"video_id": "gWjJ4NM1KFA", "question": "On the vast ocean, a ship is sailing, with mountain peaks stretching out in the distance and some buildings and houses on the shore. Which subtitle does not appear at the same time as this ship?", "question_wo_referring_query": "Which subtitle does not appear at the same time as this ship?", "candidates": ["captives were often packed into the ship", "weather permitting but it didn't make", "tightly that they had no more than a few", "they were usually severely overwhelmed"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "gWjJ4NM1KFA_0", "video_path": "gWjJ4NM1KFA.mp4", "subtitle_path": "gWjJ4NM1KFA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 599.87, "view_count": 923240}, {"video_id": "gWjJ4NM1KFA", "question": "Towards the end of this video, an illustration of John Newton appears, wearing a white shirt and a black coat. Which subtitle does not appear simultaneously with this illustration of John Newton?", "question_wo_referring_query": "Which subtitle does not appear simultaneously with this illustration of John Newton?", "candidates": ["the flux in his accounts from his time", "sweeping over the ship", "slave merchant john newton describes", "at sea"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "gWjJ4NM1KFA_1", "video_path": "gWjJ4NM1KFA.mp4", "subtitle_path": "gWjJ4NM1KFA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 599.87, "view_count": 923240}, {"video_id": "xqu5ID_PgQI", "question": "In the video, when a group of gray small figures on the white background board at the bottom right corner is marked with a red cross, what changes occur on the white background board?", "question_wo_referring_query": ", what changes occur on the white background board in the video?", "candidates": ["A yellow emoji with a drooping mouth, some red text, and a small figure wearing a yellow top and blue pants appear on the white background board.", "A yellow emoji with an upturned mouth, some black text, and a small figure wearing a yellow top and blue pants appear on the white background board.", "A yellow emoji with a drooping mouth, some black text, and a small figure wearing a yellow top and blue pants appear on the white background board.", "A yellow emoji with a drooping mouth, some black text, and a small figure wearing a yellow top and white pants appear on the white background board."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "xqu5ID_PgQI_0", "video_path": "xqu5ID_PgQI.mp4", "subtitle_path": "xqu5ID_PgQI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 277.1, "view_count": 1754}, {"video_id": "xqu5ID_PgQI", "question": "Regarding the white background board that only has one blue gear and one black gear, when two purple pillars, two blue gears, and two black gears appear on the white background board in the video, what other changes occur on the white background board?", "question_wo_referring_query": "What other changes occur on the white background board?", "candidates": ["Two blue arrows and two gray clouds with 'New Instructions' appeared on the white background board", "A face with drooping corners of the mouth appeared on the white background board", "Three blue arrows and a gray cloud with 'New Instructions' appeared on the white background board", "Two blue arrows and a gray cloud with 'New Instructions' appeared on the white background board"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "xqu5ID_PgQI_1", "video_path": "xqu5ID_PgQI.mp4", "subtitle_path": "xqu5ID_PgQI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 277.1, "view_count": 1754}, {"video_id": "F-SiTYsr6K4", "question": "When the man wearing a grey shirt and a black suit with black-framed glasses mentioned 'you know like the normal neural network', what gesture did he make?", "question_wo_referring_query": "What gesture did the man make?", "candidates": ["Both hands naturally dropped down", "One hand raised from a down position to his chest level", "Thumbs up", "Both hands waved"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "F-SiTYsr6K4_0", "video_path": "F-SiTYsr6K4.mp4", "subtitle_path": "F-SiTYsr6K4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 317.0, "view_count": 6}, {"video_id": "F-SiTYsr6K4", "question": "When the man wearing a gray shirt and a black suit jacket with black-framed glasses mentions 'is the weights and the biases so this is,' what changes occur in his hand gestures?", "question_wo_referring_query": "What changes occur in this man's hand gestures?", "candidates": ["Both hands rotate in front of the chest", "One hand touches his hair", "One hand naturally drops down", "Both hands make a thumbs-up gesture"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "F-SiTYsr6K4_1", "video_path": "F-SiTYsr6K4.mp4", "subtitle_path": "F-SiTYsr6K4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 317.0, "view_count": 6}, {"video_id": "Z7GMgRDIujQ", "question": "In the video, the male host wearing a pink shirt with a white tie and a black coat is standing in front of a big screen reporting news. When the big screen shows a blue truck and a group of refugees, what is the woman in the black coat with a gray headscarf doing in the scene?", "question_wo_referring_query": "What is the woman in the black coat with a gray headscarf doing in the scene?", "candidates": ["Walking forward with a black and white bag", "Greeting someone beside her", "Walking forward with a red and white bag", "Walking forward with a white bag"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "Z7GMgRDIujQ_0", "video_path": "Z7GMgRDIujQ.mp4", "subtitle_path": "Z7GMgRDIujQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 233.8, "view_count": 7914}, {"video_id": "Z7GMgRDIujQ", "question": "The male host wearing a pink shirt with a white tie and a black jacket in the video is standing in front of a large screen reporting, when on the large screen there is an image of a man with blonde hair wearing a dark blue tie and a man with white hair wearing a bright blue tie standing between two national flags. What are these two men doing?", "question_wo_referring_query": "What are these two men doing?", "candidates": ["The two men are shaking hands", "The two men are bowing to each other", "The two men are hugging", "The two men are waving"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "Z7GMgRDIujQ_1", "video_path": "Z7GMgRDIujQ.mp4", "subtitle_path": "Z7GMgRDIujQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 233.8, "view_count": 7914}, {"video_id": "hJomNg0wZRo", "question": "In the video, below a blue sky with white clouds, there is a large area of yellow soil on the left and a blue river on the right. The screen displays 'Miocene'. Which of the following appears on the screen?", "question_wo_referring_query": "Which of the following appears on the screen?", "candidates": ["A continuous mountain range", "A colorful school of fish", "A white bridge", "A brick-red building"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "hJomNg0wZRo_0", "video_path": "hJomNg0wZRo.mp4", "subtitle_path": "hJomNg0wZRo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 257.88, "view_count": 3575}, {"video_id": "hJomNg0wZRo", "question": "In the video, which element appears on the screen during clear weather, with a small river flowing in the middle and green plants on both sides, and many stones on the right side?", "question_wo_referring_query": "Which element appears on the screen below?", "candidates": ["White bridge", "Connected mountain range", "Brick red building", "Colorful school of fish"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "hJomNg0wZRo_1", "video_path": "hJomNg0wZRo.mp4", "subtitle_path": "hJomNg0wZRo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 257.88, "view_count": 3575}, {"video_id": "_zn0kNjLwSs", "question": "In the video, there is a group of police officers gathered on one side of a street, holding shields and batons. There is also a black police car in the background. What are the uniforms of these police officers like?", "question_wo_referring_query": "What are the uniforms of these police officers like?", "candidates": ["They are wearing full blue police uniforms and black bulletproof vests, with black helmets.", "They are wearing full black police uniforms and black bulletproof vests, with black helmets.", "They are wearing full black police uniforms and blue bulletproof vests, with black helmets.", "They are wearing full black police uniforms and black bulletproof vests, with white helmets."], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "_zn0kNjLwSs_0", "video_path": "_zn0kNjLwSs.mp4", "subtitle_path": "_zn0kNjLwSs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 326.56, "view_count": 609238}, {"video_id": "_zn0kNjLwSs", "question": "In the video, there is a room with many different national flags hanging on the wall and a bookshelf full of books beside them. When the man in a grey short-sleeved shirt looks up, a map appears in the upper right corner of the screen. What color is this map?", "question_wo_referring_query": "What color is this map?", "candidates": ["white", "blue", "red", "green"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "_zn0kNjLwSs_1", "video_path": "_zn0kNjLwSs.mp4", "subtitle_path": "_zn0kNjLwSs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 326.56, "view_count": 609238}, {"video_id": "yhSsfBBBt-4", "question": "When the subtitle mentioned 'vegetable oil', a container was placed on the marble countertop, and a person was pouring vegetable oil into it. What kind of container is it?", "question_wo_referring_query": "What kind of container is it?", "candidates": ["Transparent rectangular container", "Round ceramic bowl", "Blue rectangular container", "Blue ceramic bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "yhSsfBBBt-4_0", "video_path": "yhSsfBBBt-4.mp4", "subtitle_path": "yhSsfBBBt-4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 289.83, "view_count": 46557}, {"video_id": "yhSsfBBBt-4", "question": "When the subtitle mentions '100g cheese', a wooden cutting board appears on the granite countertop, and a person is cutting cheese with a knife. What material is this knife made of?", "question_wo_referring_query": "What material is this knife made of?", "candidates": ["Plastic", "Stainless Steel", "Ceramic", "Silicone"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "yhSsfBBBt-4_1", "video_path": "yhSsfBBBt-4.mp4", "subtitle_path": "yhSsfBBBt-4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 289.83, "view_count": 46557}, {"video_id": "RRExCW_27Fk", "question": "In the video with the white background, there is a man in the bottom right corner who is wearing a red shirt and a grey jacket. What did the man do when he appeared for the first time?", "question_wo_referring_query": "What did the man do when he appeared for the first time?", "candidates": ["Nodded his head", "Crossed his hands in front of his stomach", "Waved his hand", "Shook his head"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O2E", "level": "IntraMoment", "id": "RRExCW_27Fk_0", "video_path": "RRExCW_27Fk.mp4", "subtitle_path": "RRExCW_27Fk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 274.32, "view_count": 24}, {"video_id": "RRExCW_27Fk", "question": "In the video, under a white background, there is a man in the lower right corner wearing a red inner shirt and a gray outer jacket explaining something. When three small square tables appear for the first time under the white background, what does this man do?", "question_wo_referring_query": "What does this man do?", "candidates": ["He waves his hand.", "Nothing changes.", "He lowers his originally crossed hands from in front of his chest.", "He nods."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O2E", "level": "IntraMoment", "id": "RRExCW_27Fk_1", "video_path": "RRExCW_27Fk.mp4", "subtitle_path": "RRExCW_27Fk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 274.32, "view_count": 24}, {"video_id": "A8iHcyhUoC0", "question": "In a dimly lit room with olive-colored walls, a boy wearing a white shirt is sitting on the floor. Next to him, a woman wearing a yellow headscarf mentions 'reports relief after 14 days.' What does the boy do at this moment?", "question_wo_referring_query": "In a dimly lit room with olive-colored walls, a boy wearing white. What does the boy do?", "candidates": ["shakes his head", "gives a thumbs up", "squints and smiles slightly", "waves his hand"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "A8iHcyhUoC0_0", "video_path": "A8iHcyhUoC0.mp4", "subtitle_path": "A8iHcyhUoC0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 450.36, "view_count": 65045}, {"video_id": "A8iHcyhUoC0", "question": "On a beach, a group of people are walking forward in formation. There are white warning lines on both sides of the team, and beside them is a man wearing a yellow vest and light blue jeans. When the subtitle mentions 'government of Indonesia doesn't,' what action does this man take?", "question_wo_referring_query": "What action does this man take?", "candidates": ["Made a thumbs-up gesture", "Waved his hand", "Shook his head", "Patted this group of people on the shoulder as they walked towards the camera"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "A8iHcyhUoC0_1", "video_path": "A8iHcyhUoC0.mp4", "subtitle_path": "A8iHcyhUoC0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 450.36, "view_count": 65045}, {"video_id": "53p2v1IjTAE", "question": "A large black 'YOU' appeared on the screen with a green checkmark above it. What did the woman wearing a blue top and black pants do?", "question_wo_referring_query": "What did the woman wearing a blue top and black pants do?", "candidates": ["Nodded", "Hands behind her back", "Gave a thumbs-up", "Waved"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "53p2v1IjTAE_0", "video_path": "53p2v1IjTAE.mp4", "subtitle_path": "53p2v1IjTAE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 222.68, "view_count": 15810}, {"video_id": "53p2v1IjTAE", "question": "When the woman wearing a blue top and black pants walks from the left side to the right side of the screen on a sidewalk, what action does she perform afterward?", "question_wo_referring_query": "What action does this woman perform afterward?", "candidates": ["Looked up at the wall", "Nodded her head", "Waved her hand", "Thumbed up"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "53p2v1IjTAE_1", "video_path": "53p2v1IjTAE.mp4", "subtitle_path": "53p2v1IjTAE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 222.68, "view_count": 15810}, {"video_id": "QeyM1Yu-ITI", "question": "In the video, which person appears first: the woman in a pink top and gray pants, the woman wearing a green hat and white pearl earrings and necklace, or the person in a white fur coat and blue jeans?", "question_wo_referring_query": "Which person appears first?", "candidates": ["The woman wearing a green hat and white pearl earrings and necklace", "The woman in a pink top and gray pants", "The woman in a pink top and gray pants and the woman wearing a green hat and white pearl earrings and necklace both appear simultaneously", "The person in a white fur coat and blue jeans"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "QeyM1Yu-ITI_0", "video_path": "QeyM1Yu-ITI.mp4", "subtitle_path": "QeyM1Yu-ITI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 531.66, "view_count": 2064}, {"video_id": "QeyM1Yu-ITI", "question": "In the video, there appears a small dog with a light yellow back and a white belly, a small dog with a purple flower headband and a round face sticking out its tongue, a small dog with a brown and white head in between with long ears, and a small dog wearing gold-framed glasses and deep brown fur. Which one of these small dogs appears first?", "question_wo_referring_query": "Which one of these small dogs appears first?", "candidates": ["A small dog wearing gold-framed glasses and deep brown fur", "A small dog with a purple flower headband and a round face sticking out its tongue", "A small dog with a brown and white head in between with long ears", "A small dog with a light yellow back and a white belly"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "QeyM1Yu-ITI_1", "video_path": "QeyM1Yu-ITI.mp4", "subtitle_path": "QeyM1Yu-ITI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 531.66, "view_count": 2064}, {"video_id": "RnlPrdMoQ1Y", "question": "In a snowy field, distant high skeletal snow-capped mountains can be seen. There are clouds in the sky. The mountain peak is quite close. Some surrounding mountains are also covered in snow. A man in a red mountaineering suit is lying on the snow-covered ground. He is wearing sunglasses, black pants, shoes, and black gloves. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Lying on the ground moving his body playfully", "Rolling on the snow-covered ground", "Leaping from a fish-like prone position in the snow", "Sitting up from the snow-covered ground", "Throwing snow to higher places"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "RnlPrdMoQ1Y_0", "video_path": "RnlPrdMoQ1Y.mp4", "subtitle_path": "RnlPrdMoQ1Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 407.28, "view_count": 455160}, {"video_id": "RnlPrdMoQ1Y", "question": "In front of a blue and white striped background board, a man with short hair and wearing glasses is holding a piece of candy with both hands. He is wearing a denim-colored shirt and a wristwatch on his hand. The candy he is holding is chocolate-colored. What is this man doing?", "question_wo_referring_query": ". What is this man doing?", "candidates": ["Took a bite of the candy", "Gave the candy to someone else", "Bent the candy", "Put the candy in his pocket", "Put the candy on his head"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "RnlPrdMoQ1Y_1", "video_path": "RnlPrdMoQ1Y.mp4", "subtitle_path": "RnlPrdMoQ1Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 407.28, "view_count": 455160}, {"video_id": "RJy_4elf2XI", "question": "In the scene, there is an animation under a stone mountain. A group of soldiers are wearing red helmets, holding rifles and red-yellow tags. There is green grass beneath their feet and stones. Which of the following does not exist?", "question_wo_referring_query": "Which of the following does not exist?", "candidates": ["Black circle", "White circle", "Cloud", "Blue sky", "Black helmet"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "RJy_4elf2XI_0", "video_path": "RJy_4elf2XI.mp4", "subtitle_path": "RJy_4elf2XI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 260.68, "view_count": 21555}, {"video_id": "RJy_4elf2XI", "question": "In the picture, there are three drawn sailboats. The three sailboats are sailing on the sea, with a blue sky and white clouds in the distance. The sailboats are brown, one is very large, and there are two very small ones. Which object does not exist?", "question_wo_referring_query": ", which object does not exist?", "candidates": ["Deep blue sea water", "Brown sailboat mast", "Rectangular white sail", "Triangular white sail", "Brown mast of the boat"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "RJy_4elf2XI_1", "video_path": "RJy_4elf2XI.mp4", "subtitle_path": "RJy_4elf2XI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 260.68, "view_count": 21555}, {"video_id": "jKd06kNcAC8", "question": "A man is walking on the street, his hair is blonde, and he is wearing a yellow T-shirt. The sky is very blue. In the frame, there is a white building on the left side with many posters on its wall. There is a traffic light next to him, and on the right side of the frame, there are two green trees. When the phrase 'Vincent: It's an ocean of [struggles to pronounce] familiarity' is mentioned, what is the non-existent object?", "question_wo_referring_query": "A man is walking on the street, his hair is blonde, and he is wearing a yellow T-shirt. The sky is very blue. In the frame, there is a white building on the left side with many posters on its wall. There is a traffic light next to him, and on the right side of the frame, there are two green trees. When the phrase 'Vincent: It's an ocean of [struggles to pronounce] familiarity' is mentioned, what is the non-existent object?", "candidates": ["Glass window", "White wall", "Green lawn", "Sunglasses", "White cloud"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "jKd06kNcAC8_0", "video_path": "jKd06kNcAC8.mp4", "subtitle_path": "jKd06kNcAC8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 369.58, "view_count": 132523}, {"video_id": "2v7vlsY2RM8", "question": "In a PPT slide, there is a white background board and a black letter title. Below the title, there are some stair shapes, rectangles, and arrowheads made of triangles. On the right side, there are some three-dimensional cubes and rectangular figures. On the far right, there are also some transparent rectangles in red, green, and yellow. In the upper right corner, there is a portrait of a man wearing black clothes, and there are also blue letters. When the phrase 'trimmed version of al' is mentioned, what color is the rectangle on the far right?", "question_wo_referring_query": ", what color is the rectangle on the far right?", "candidates": ["Black", "Transparent yellow", "Transparent red", "Transparent green", "White"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2A", "level": "IntraMoment", "id": "2v7vlsY2RM8_0", "video_path": "2v7vlsY2RM8.mp4", "subtitle_path": "2v7vlsY2RM8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 505.8, "view_count": 486}, {"video_id": "iihQ4bQ_5zA", "question": "In front of a khaki building with rectangular tiles, there are several soldiers, with two trees on both the left and right sides. The soldiers are standing on the road, holding shields and long guns. On the shields, there is an eagle emblem. The person at the front is wearing a blue outfit with a belt. Who is the person wearing a red headband?", "question_wo_referring_query": "Who is the person wearing a red headband?", "candidates": ["The person standing on the road without holding a shield", "The person standing on the road holding a shield", "The person wearing a blue jacket", "The person without a belt", "The person behind the green bush"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "iihQ4bQ_5zA_0", "video_path": "iihQ4bQ_5zA.mp4", "subtitle_path": "iihQ4bQ_5zA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 298.33, "view_count": 20232}, {"video_id": "Ly0h6qYyDfY", "question": "Beside a rocky cliff, there's a green lake. In the distance, there are tree-covered mountains and a blue sky. On the lake, there's a sightseeing boat. On one of the three rowboats, there are three men. What action did the man wearing a green shirt with a black hat and holding a black oar do the first time he appeared in the video?", "question_wo_referring_query": "What action did he do?", "candidates": ["Paddle the oar", "Chat with the lady next to him", "Splash the water with the oar", "Hand the oar to the lady next to him", "Shake hands with the lady next to him"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "Ly0h6qYyDfY_0", "video_path": "Ly0h6qYyDfY.mp4", "subtitle_path": "Ly0h6qYyDfY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 444.98, "view_count": 114210}, {"video_id": "Ly0h6qYyDfY", "question": "In a building supported by a green pole, with a white building on the right side, a man in a blue shirt on the left, a man in an orange apron on the right, a man in a gray T-shirt, and a woman in a yellow T-shirt. They are all gathered around a table which has a bucket of food, various seasonings, and stainless steel bowls on it. What is the man in the orange apron doing when he first appears on screen?", "question_wo_referring_query": "In a building supported by a green pole, with a white building on the right side, a man in a blue shirt on the left, a man in an orange apron on the right, a man in a gray T-shirt, and a woman in a yellow T-shirt. They are all gathered around a table which has a bucket of food, various seasonings, and stainless steel bowls on it. What is the man in the orange apron doing when he first appears on screen?", "candidates": ["Taking a paper towel", "Vegetables on the cutting board", "Taking a seasoning bottle", "Talking to the woman in the yellow T-shirt", "Food in the wooden bucket"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "Ly0h6qYyDfY_1", "video_path": "Ly0h6qYyDfY.mp4", "subtitle_path": "Ly0h6qYyDfY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 444.98, "view_count": 114210}, {"video_id": "-x7MtrgBkEc", "question": "On a wooden table covered with a gray tablecloth, there is a round bowl containing some yellow, green, and white foods. What happens when 'Mix' is mentioned?", "question_wo_referring_query": "What happens when 'Mix' is mentioned?", "candidates": ["Stir the food in the bowl", "Take the plate away", "Add cheese", "Add chili powder", "Add coriander"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "-x7MtrgBkEc_0", "video_path": "-x7MtrgBkEc.mp4", "subtitle_path": "-x7MtrgBkEc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 231.93, "view_count": 214}, {"video_id": "-x7MtrgBkEc", "question": "On a wooden table, there is a grey tablecloth with a white bowl on it containing some ham slices and some yellow cheese. What happened when 'Mozzarella 70 gr' was mentioned?", "question_wo_referring_query": "What happened when 'Mozzarella 70 gr' was mentioned?", "candidates": ["Putting egg yolk sauce into the bowl", "Stirring the food in the bowl", "Putting cheese into the bowl", "Pouring the food in the bowl into the pot", "Taking cheese out of the bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "-x7MtrgBkEc_1", "video_path": "-x7MtrgBkEc.mp4", "subtitle_path": "-x7MtrgBkEc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 231.93, "view_count": 214}, {"video_id": "B9LyP0C2rsQ", "question": "In the scene, there is a grey background with snowflakes drifting, and a bright sunlight on the left side. After the phrase 'consist of ice crystals instead of water' is mentioned, what happens?", "question_wo_referring_query": ", what happens?", "candidates": ["The cloud turns into a snowflake", "The cloud drifts towards the lower right side of the screen", "The cloud drifts towards the right side of the screen", "The cloud drifts towards the lower left side of the screen", "The cloud drifts towards the upper left side of the screen"], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "B9LyP0C2rsQ_0", "video_path": "B9LyP0C2rsQ.mp4", "subtitle_path": "B9LyP0C2rsQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 588.09, "view_count": 608932}, {"video_id": "B9LyP0C2rsQ", "question": "The screen shows an endless sky. The sky is somewhat gray and hazy, with no clouds. There is a large beam of sunlight on the left side. On the ground, there are some small buildings. Yellow text appears in the lower right corner. What happens after the words 'imperceivable high-altitude sheet of ice' appear?", "question_wo_referring_query": "What happens?", "candidates": ["A scene of a cloud appears", "A scene of a map appears", "A scene of dazzling sunlight appears", "A scene of a forest appears", "A scene of a snowflake appears"], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "B9LyP0C2rsQ_1", "video_path": "B9LyP0C2rsQ.mp4", "subtitle_path": "B9LyP0C2rsQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 588.09, "view_count": 608932}, {"video_id": "yaDIVTyUg34", "question": "On a wooden board, there is a transparent round bowl. What is the first food that is placed inside the bowl?", "question_wo_referring_query": "On a wooden board, there is a transparent round bowl. What is the first food that is placed inside the bowl?", "candidates": ["Minced garlic", "Ham", "Egg", "Onion", "Blueberry"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "yaDIVTyUg34_0", "video_path": "yaDIVTyUg34.mp4", "subtitle_path": "yaDIVTyUg34_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 208.4, "view_count": 1875}, {"video_id": "yaDIVTyUg34", "question": "On a wooden table, there is a small stove with a black iron pot on it. After adding oil and minced garlic, what is the first food item placed into the pot?", "question_wo_referring_query": "What is the first food item placed into the pot?", "candidates": ["Broccoli", "Ham", "Green onions", "Cucumber strips", "Egg liquid"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "yaDIVTyUg34_1", "video_path": "yaDIVTyUg34.mp4", "subtitle_path": "yaDIVTyUg34_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 208.4, "view_count": 1875}, {"video_id": "DkxjR_I3Zhc", "question": "In a dimly lit room with a light beam, a man with curly hair, wearing a red-gray striped long-sleeve shirt, is cutting paper in his hands. His room contains an iron rack, a few wooden tables, a colorful spinning disk, and a small window. The table in front of him is piled with many assorted items. What happens before the phrase 'in my youth' is mentioned?", "question_wo_referring_query": "", "candidates": ["The man picks up scissors from the table", "The man picks up a piece of wood from the table", "The man runs from a black door to the table", "The man turns on a desk lamp", "The man walks from a black door to the table"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "DkxjR_I3Zhc_0", "video_path": "DkxjR_I3Zhc.mp4", "subtitle_path": "DkxjR_I3Zhc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 195.78, "view_count": 4112}, {"video_id": "DkxjR_I3Zhc", "question": "A man wearing a red and gray striped shirt is talking. He has black curly hair, and behind him is a silver bookshelf filled with books. What happened after he mentioned 'so it provides a space of possibility to'?", "question_wo_referring_query": "A man wearing a red and gray striped shirt is talking. He has black curly hair, and behind him is a silver bookshelf filled with books. What happened after he mentioned 'so it provides a space of possibility to'?", "candidates": ["The man took a sip of water", "Two people are standing on a whiteboard covered with labels", "Several people are drawing on the board", "Several people are gathered around a whiteboard covered with labels", "One person is standing on a whiteboard covered with labels"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "DkxjR_I3Zhc_1", "video_path": "DkxjR_I3Zhc.mp4", "subtitle_path": "DkxjR_I3Zhc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 195.78, "view_count": 4112}, {"video_id": "0Eu8LiBkDeo", "question": "In the animated scene, there are three people dressed in white clothes. One is wearing a black tie, another one is in a blue shirt, and another person is in an olive-colored outfit with glasses. There are two rifles on the table in front of them. When the phrase 'the aging bolt-action rifles that had' is mentioned, who is the first person to appear with their face revealed?", "question_wo_referring_query": "In the animated scene, there are three people dressed in white clothes. One is wearing a black tie, another one is in a blue shirt, and another person is in an olive-colored outfit with glasses. There are two rifles on the table in front of them. When the phrase 'the aging bolt-action rifles that had' is mentioned, who is the first person to appear with their face revealed?", "candidates": ["The person in an olive-colored outfit with glasses", "The person in a blue shirt with a white jacket", "The soldier with a helmet carrying a rifle", "The person wearing a black tie", "The person in a white shirt with a black tie"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "0Eu8LiBkDeo_0", "video_path": "0Eu8LiBkDeo.mp4", "subtitle_path": "0Eu8LiBkDeo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 510.79, "view_count": 1916563}, {"video_id": "0Eu8LiBkDeo", "question": "With the background of a fallen large tree, three soldiers are seen aiming their rifles at the camera. They are wearing dark green helmets, light brown uniforms, and carrying many bullets. When the phrase 'into battle all across the world and the' is mentioned, what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["Bullet", "Soldier with a helmet", "Fallen large tree", "American flag", "Rifle"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "0Eu8LiBkDeo_1", "video_path": "0Eu8LiBkDeo.mp4", "subtitle_path": "0Eu8LiBkDeo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 510.79, "view_count": 1916563}, {"video_id": "5pxbnuhzBME", "question": "On a green field, a man with curly hair wearing a gray coat and pants is squatting there. A wooden railing separates him from a small sheep. He is looking at the small sheep. What object is not present in this scene?", "question_wo_referring_query": "What object is not present in this scene?", "candidates": ["Sunlight", "Gray shoes", "Gray shoulder bag", "Blue sky", "Black and white sheep"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "5pxbnuhzBME_0", "video_path": "5pxbnuhzBME.mp4", "subtitle_path": "5pxbnuhzBME_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 415.68, "view_count": 118013}, {"video_id": "5pxbnuhzBME", "question": "A man wearing a gray coat is standing on a beach with many footprints. He is gazing at distant hills and the sky. There are also some flags in the distance. What object does not exist in this scene?", "question_wo_referring_query": ", what object does not exist in this scene?", "candidates": ["Blue sky", "Black socks", "Light blue jeans", "Gray beach", "Red and yellow flags"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "5pxbnuhzBME_1", "video_path": "5pxbnuhzBME.mp4", "subtitle_path": "5pxbnuhzBME_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 415.68, "view_count": 118013}, {"video_id": "Flebt2AiYco", "question": "In front of a brown building with four round pillars, a triangular roof, and surrounded by two trees, some soldiers holding shields and spears are standing on the road in front. To the left, there are two wooden boxes. On the far right, a small blue figure is explaining something. When 'actually the eldest stumped by this the' is mentioned, which objects do not exist on the screen?", "question_wo_referring_query": "Which objects do not appear on the screen?", "candidates": ["white stripes", "blue shield", "blue sky", "blue hat", "blue flower"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "Flebt2AiYco_0", "video_path": "Flebt2AiYco.mp4", "subtitle_path": "Flebt2AiYco_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 398.43, "view_count": 2249}, {"video_id": "Flebt2AiYco", "question": "In front of a brown building with four round pillars, there is a triangular roof and two trees around it. Some soldiers holding shields and spears are standing on the road in front of it. To the left, there are two wooden boxes, and on the far right, a small blue person is explaining something. Behind him, there are two people, one large and one small, without helmets. When the phrase 'person to be too power hungry and cause' is mentioned, what object is not present in the scene?", "question_wo_referring_query": "What object is not present in the scene?", "candidates": ["Blue shield", "Blue sky", "Brown hat", "Green grass", "Red helmet"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "Flebt2AiYco_1", "video_path": "Flebt2AiYco.mp4", "subtitle_path": "Flebt2AiYco_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 398.43, "view_count": 2249}, {"video_id": "V16qfhdIK6M", "question": "A lady wearing a blue outfit with a pink underlay is sitting in a room talking. The background in the room is relatively blurry, showing some blue and white objects. There is also a round table and a chair behind her. The lady has straight hair and is wearing earrings. What style is the outfit she is wearing?", "question_wo_referring_query": "A lady wearing a blue outfit with a pink underlay is sitting in a room talking. The background is relatively blurry, showing some blue and white objects. There is also a round table and a chair behind her. The lady has straight hair and is wearing earrings. What style is the outfit she is wearing?", "candidates": ["A blue lab coat", "A blue wool sweater", "A blue knitted sweater", "A blue shirt jacket", "A blue suit"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "V16qfhdIK6M_0", "video_path": "V16qfhdIK6M.mp4", "subtitle_path": "V16qfhdIK6M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 571.13, "view_count": 46868}, {"video_id": "V16qfhdIK6M", "question": "The scene shows a person in a red T-shirt running towards the finish line on a track. Next to the yellow track is green ground with three flags inserted into it. The finish line is checkered in black and white. What shape are the flags on the grass?", "question_wo_referring_query": "What shape are the flags on the grass?", "candidates": ["Green rectangle", "Ladder shape", "Purple triangle", "Green triangle", "Purple square"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "V16qfhdIK6M_1", "video_path": "V16qfhdIK6M.mp4", "subtitle_path": "V16qfhdIK6M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 571.13, "view_count": 46868}, {"video_id": "ywMO0enFiFo", "question": "A girl wearing a straw hat and a green wool sweater is sitting on a pile of dead grass under a withered tree by the river, holding binoculars and looking towards the other side of the river. What did she do first after that?", "question_wo_referring_query": "What did she do first after that?", "candidates": ["Took out a notebook to read the words on it", "Walked into the brook", "Sat in the pile of grass and played a small harp", "Walked into the forest", "Sat on the fallen dead tree and played a small harp"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "ywMO0enFiFo_0", "video_path": "ywMO0enFiFo.mp4", "subtitle_path": "ywMO0enFiFo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 404.2, "view_count": 291973}, {"video_id": "ywMO0enFiFo", "question": "On the wooden table, there is a white water jar, five pigment containers of different colors, and brushes. After wiping the glass window on the table with a rag, what did the person in the video do?", "question_wo_referring_query": "What did the person do?", "candidates": ["Open the glass box", "Lift the glass window to admire", "Paint on the glass window", "Read a book", "Tend to a small potted plant"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "ywMO0enFiFo_1", "video_path": "ywMO0enFiFo.mp4", "subtitle_path": "ywMO0enFiFo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 404.2, "view_count": 291973}, {"video_id": "sknNQ2ry3QM", "question": "What happened after the subtitle 'alive NHK will still Arena del mundo' appeared, when there was a man in a white sunhat with deep-set eyes looking up next to a female host in a red suit?", "question_wo_referring_query": "What happened?", "candidates": ["The man in the hat visited a history museum", "Performers were being interviewed", "A group of people were rehearsing for a performance", "There was a video call with a veteran", "A veteran was being interviewed"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "sknNQ2ry3QM_0", "video_path": "sknNQ2ry3QM.mp4", "subtitle_path": "sknNQ2ry3QM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 246.11, "view_count": 6304}, {"video_id": "sknNQ2ry3QM", "question": "An elderly person with white hair is having a video call on a computer with a man wearing a white short-sleeve shirt. What happens after the subtitle 'Cornelius also talked online with one of' appears?", "question_wo_referring_query": "What happened?", "candidates": ["Performers are doing a street performance", "A group of people are rehearsing for a performance", "The elderly person is showing a photo album", "A man wearing a hat visits a history museum", "Has a video call with a veteran"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "sknNQ2ry3QM_1", "video_path": "sknNQ2ry3QM.mp4", "subtitle_path": "sknNQ2ry3QM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 246.11, "view_count": 6304}, {"video_id": "5skIqoO3ku0", "question": "On a white background is an English webpage, in the lower right corner of the screen is a man with thinning hair wearing glasses and a goatee, speaking into a microphone. After the subtitle 'and babbage and add another factor of 8' appears, what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["A chart published on a social networking site", "A green screen background", "A green dialogue box", "A social networking site with a black background", "An article posted on a social networking site with a white background"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "5skIqoO3ku0_0", "video_path": "5skIqoO3ku0.mp4", "subtitle_path": "5skIqoO3ku0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 83, "duration": 583.02, "view_count": 33567}, {"video_id": "5skIqoO3ku0", "question": "On a black background appears an English social media site, in the bottom right corner of the screen there is a man with sparse hair wearing glasses and sporting a goatee, speaking into a microphone. What is the first object that appears after the subtitle 'models can be at I million times more' is displayed?", "question_wo_referring_query": "On a black background appears an English social media site, in the bottom right corner of the screen there is a man with sparse hair wearing glasses and sporting a goatee, speaking into a microphone. What is the first object that appears after the subtitle 'models can be at I million times more' is displayed?", "candidates": ["A man with sparse hair", "A chart published on the social media site", "An article published by the social media site on a white background", "A green screen background", "A green dialogue box"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "5skIqoO3ku0_1", "video_path": "5skIqoO3ku0.mp4", "subtitle_path": "5skIqoO3ku0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 83, "duration": 583.02, "view_count": 33567}, {"video_id": "6bJIkfi8H-E", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["Explained the 'I-JEPA: Architecture and Training' screen, explained the 'I-JEPA Self-Supervised Learning Approach' PPT screen, explained the 'I-JEPA' PPT screen, explained the 'Self-Supervised Learning in Computer Vision' PPT screen, drew a rectangular frame", "Drew a rectangular frame, explained the 'I-JEPA Self-Supervised Learning Approach' PPT screen, explained the 'I-JEPA' PPT screen, explained the 'Self-Supervised Learning in Computer Vision' PPT screen, explained the 'I-JEPA: Architecture and Training' screen", "Drew a rectangular frame, explained the 'I-JEPA' PPT screen, explained the 'Self-Supervised Learning in Computer Vision' PPT screen, explained the 'I-JEPA Self-Supervised Learning Approach' PPT screen, explained the 'I-JEPA: Architecture and Training' screen", "Explained the 'I-JEPA Self-Supervised Learning Approach' PPT screen, explained the 'I-JEPA: Architecture and Training' screen, explained the 'I-JEPA' PPT screen, explained the 'Self-Supervised Learning in Computer Vision' PPT screen, drew a rectangular frame", "Explained the 'I-JEPA Self-Supervised Learning Approach' PPT screen, explained the 'I-JEPA' PPT screen, explained the 'Self-Supervised Learning in Computer Vision' PPT screen, explained the 'I-JEPA: Architecture and Training' screen, drew a rectangular frame"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "6bJIkfi8H-E_0", "video_path": "6bJIkfi8H-E.mp4", "subtitle_path": "6bJIkfi8H-E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 496.1, "view_count": 7811}, {"video_id": "6bJIkfi8H-E", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["A screen with a crawling tabby cat picture with a green box on top, a screen on the left side with two crawling tabby cat pictures divided by different colored blocks, a screen with 24 pictures covered by different black translucent panels, a screen with a picture of a man in the lower left corner wearing a blue long-sleeved shirt, a screen with four yellow flower patterned kitten pictures", "A screen with a picture of a man in the lower left corner wearing a blue long-sleeved shirt, a screen with four yellow flower patterned kitten pictures, a screen with a crawling tabby cat picture with a green box on top, a screen on the left side with two crawling tabby cat pictures divided by different colored blocks, a screen with 24 pictures covered by different black translucent panels", "A screen with four yellow flower patterned kitten pictures, a screen with a crawling tabby cat picture with a green box on top, a screen on the left side with two crawling tabby cat pictures divided by different colored blocks, a screen with a picture of a man in the lower left corner wearing a blue long-sleeved shirt, a screen with 24 pictures covered by different black translucent panels", "A screen with 24 pictures covered by different black translucent panels, a screen on the left side with two crawling tabby cat pictures divided by different colored blocks, a screen with a crawling tabby cat picture with a green box on top, a screen with a picture of a man in the lower left corner wearing a blue long-sleeved shirt, a screen with four yellow flower patterned kitten pictures", "A screen with 24 pictures covered by different black translucent panels, a screen with a crawling tabby cat picture with a green box on top, a screen on the left side with two crawling tabby cat pictures divided by different colored blocks, a screen with a picture of a man in the lower left corner wearing a blue long-sleeved shirt, a screen with four yellow flower patterned kitten pictures"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "6bJIkfi8H-E_1", "video_path": "6bJIkfi8H-E.mp4", "subtitle_path": "6bJIkfi8H-E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 496.1, "view_count": 7811}, {"video_id": "h_2WCFAYgaU", "question": "In a room with many paintings hanging on the wall, on a black tablecloth, there are some paintings framed with acrylic. In which of the following scenes does the man in a black shirt, who is wearing jeans, also appear?", "question_wo_referring_query": "In which of the following scenes does the man also appear?", "candidates": ["In front of a table displaying paintings framed with acrylic", "Next to a cup made with red and silver wires", "On a black cloth with a white paper and a steel pen", "On a soccer field with cigarettes and pens in round pearls", "Next to a white tea cup on a black tablecloth"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "h_2WCFAYgaU_0", "video_path": "h_2WCFAYgaU.mp4", "subtitle_path": "h_2WCFAYgaU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 232.87, "view_count": 3080}, {"video_id": "h_2WCFAYgaU", "question": "There are many wall paintings hanging on the wall. On a long table covered with a black tablecloth, there is an abstract painting framed in an acrylic frame. In which of the following scenes does this abstract painting also appear?", "question_wo_referring_query": "In which of the following scenes does this abstract painting also appear?", "candidates": ["Next to a cup made of red and silver wire.", "Next to a white teacup on a black tablecloth.", "In the hand of a man with apron.", "On the wall of an exhibition hall.", "On a black cloth with a steel pen on a sheet of white paper."], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "h_2WCFAYgaU_1", "video_path": "h_2WCFAYgaU.mp4", "subtitle_path": "h_2WCFAYgaU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 232.87, "view_count": 3080}, {"video_id": "DykZIEdHBVk", "question": "On the wooden table, there is a round yellow dessert in a metal plate, with a piece of chocolate placed on top of the dessert. With which subtitles has this dessert appeared simultaneously?", "question_wo_referring_query": ", with which subtitles has this dessert appeared simultaneously?", "candidates": ["\u201cteacher\u201d", "\u201cso\u201d", "\u201cmother\u201d", "\u201cfather\u201d", "\u201cstudents\u201d"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "DykZIEdHBVk_0", "video_path": "DykZIEdHBVk.mp4", "subtitle_path": "DykZIEdHBVk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 307.75, "view_count": 321863}, {"video_id": "DykZIEdHBVk", "question": "On the wooden table, there is a glass container filled with chocolate sauce and a mixture of white and green foods. In the upper right corner of the table is a piece of yellow cloth. With which subtitles have these ingredients appeared simultaneously in this container?", "question_wo_referring_query": "With which subtitles have these ingredients appeared simultaneously in this container?", "candidates": ["\u201csister\u201d", "\u201cbrother\u201d", "\u201cso\u201d", "\u201cmother\u201d", "\u201cteacher\u201d"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "DykZIEdHBVk_1", "video_path": "DykZIEdHBVk.mp4", "subtitle_path": "DykZIEdHBVk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 307.75, "view_count": 321863}, {"video_id": "RzweUJjzcWA", "question": "In the underground parking lot, a man wearing a white helmet and a gray short-sleeved shirt appeared. When the subtitle 'Nescafe iced coffee all right I hear' appeared, what change happened to this man?", "question_wo_referring_query": "What change happened to this man?", "candidates": ["Not wearing glasses turned into wearing glasses.", "Wearing a helmet turned into wearing a mask.", "Wearing a helmet turned into wearing a hat.", "The white helmet turned into a black helmet.", "The helmet was removed."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "RzweUJjzcWA_0", "video_path": "RzweUJjzcWA.mp4", "subtitle_path": "RzweUJjzcWA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 356.26, "view_count": 46068}, {"video_id": "RzweUJjzcWA", "question": "What change happens to the man in the gray short-sleeved shirt sitting in Hanbao King's restaurant when the subtitle 'get around here it sounds like I'm in a' appears?", "question_wo_referring_query": "What change happens to the man?", "candidates": ["His white shirt changes to a black shirt", "His gray short-sleeved shirt changes to a white short-sleeved shirt", "His gray short-sleeved shirt changes to a black shirt", "His gray short-sleeved shirt changes to a black short-sleeved shirt", "His gray short-sleeved shirt changes to a white shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "RzweUJjzcWA_1", "video_path": "RzweUJjzcWA.mp4", "subtitle_path": "RzweUJjzcWA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 356.26, "view_count": 46068}, {"video_id": "3JO8RKuHPl4", "question": "On a battlefield filled with the smoke of war, numerous soldiers wearing iron armor are holding spears and flags, riding on armored horses. What are they doing?", "question_wo_referring_query": "What are they doing?", "candidates": ["Practicing drills", "Participating in a horse race", "Engaging in a fierce battle", "Running away", "Celebrating a victory"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "3JO8RKuHPl4_0", "video_path": "3JO8RKuHPl4.mp4", "subtitle_path": "3JO8RKuHPl4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 326.42, "view_count": 590427}, {"video_id": "3JO8RKuHPl4", "question": "The background consists of a lake and rocks. A group of people dressed in black and white armor are doing various activities; some are riding horses and wielding long swords, others are holding bows and arrows, and in the distance, there is someone on horseback blowing a horn. What are they doing?", "question_wo_referring_query": "The background consists of a lake and rocks. A group of people dressed in black and white armor are doing various activities; some are riding horses and wielding long swords, others are holding bows and arrows, and in the distance, there is someone on horseback blowing a horn. What are they doing?", "candidates": ["Conversing", "Competing", "Fighting", "Performing", "Surrendering"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "3JO8RKuHPl4_1", "video_path": "3JO8RKuHPl4.mp4", "subtitle_path": "3JO8RKuHPl4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 326.42, "view_count": 590427}, {"video_id": "Ol0i9-qvkjU", "question": "The background has two red pillows on a sofa, a man wearing a gray hoodie jacket is in front of the camera, and there is white English text on the screen. When the subtitles 'that I'm about to do, contact me' appear, what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["computer", "book", "television", "clothes rack", "planter"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "Ol0i9-qvkjU_0", "video_path": "Ol0i9-qvkjU.mp4", "subtitle_path": "Ol0i9-qvkjU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 379.88, "view_count": 111262}, {"video_id": "Ol0i9-qvkjU", "question": "The background features two couches with red pillows. A man in a gray hoodie turns to look at a dog behind him. What object is present on the screen when the subtitle 'Hudson!' appears?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["kitten", "water tank", "curtain", "hat", "air conditioner"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "Ol0i9-qvkjU_1", "video_path": "Ol0i9-qvkjU.mp4", "subtitle_path": "Ol0i9-qvkjU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 379.88, "view_count": 111262}, {"video_id": "gxniiY6VvDg", "question": "There is a painting framed in gold on the flesh-colored wall, depicting three haystacks under scattered clouds, with a flock of sheep grazing in the wild grassland. When the subtitle 'admirable about me as painting is that' appears, what breed are the sheep in the painting?", "question_wo_referring_query": "What breed are the sheep in the painting?", "candidates": ["Mountain sheep", "Merino sheep", "Black mountain sheep", "Romanov sheep", "Bighorn sheep"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "gxniiY6VvDg_0", "video_path": "gxniiY6VvDg.mp4", "subtitle_path": "gxniiY6VvDg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 189.76, "view_count": 7479}, {"video_id": "gxniiY6VvDg", "question": "There is a painting on a flesh-colored wall depicting three haystacks under the setting sun, with a flock of sheep grazing on the grassland. In the lower left of the screen, there's a shadow of a person wearing dark green clothing walking by. When the subtitle 'connection between man and the natural' appears, what kind of frame does the painting have?", "question_wo_referring_query": "What kind of frame does this painting have?", "candidates": ["A gilt frame without a pattern", "A plain wooden frame", "A pale-yellow, old wooden frame", "A silver frame", "A gold frame with a floral pattern"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "gxniiY6VvDg_1", "video_path": "gxniiY6VvDg.mp4", "subtitle_path": "gxniiY6VvDg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 189.76, "view_count": 7479}, {"video_id": "fD1D9CfgAZI", "question": "The background shows buildings of different colors. On the road in front of the buildings, there are two military green armored vehicles driving over fallen stone slabs. In the lower left corner, there is a man dressed in olive military gear holding a gun and hiding behind a stone. When the subtitle 'risk reduction halts after it passed' appears, what are they doing?", "question_wo_referring_query": "What are they doing?", "candidates": ["Performing a show", "Military exercise", "Playing hide-and-seek", "Fighting", "Fixing a tank"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "fD1D9CfgAZI_0", "video_path": "fD1D9CfgAZI.mp4", "subtitle_path": "fD1D9CfgAZI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 571.5, "view_count": 1049006}, {"video_id": "fD1D9CfgAZI", "question": "In the background, there are rugged, sunlit hills with many caves. In front of the hills, there are three men wearing different headgear with braided beards, each riding a horse and holding a gun. When the subtitle 'sitting on single sears behind the crew' appears, what are they doing?", "question_wo_referring_query": "What are they doing?", "candidates": ["arguing", "riding horses on a journey", "horseracing", "practicing shooting", "fighting with enemies"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "fD1D9CfgAZI_1", "video_path": "fD1D9CfgAZI.mp4", "subtitle_path": "fD1D9CfgAZI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 571.5, "view_count": 1049006}, {"video_id": "7OdhtAiPfWY", "question": "In a 3D game screen, the lower part contains a grid of numbers 64, and objects in the scenario are arranged in sequence. Which concept is mentioned first below?", "question_wo_referring_query": "Which concept is mentioned first below?", "candidates": ["Memory Unit", "Forward Propagation Mode", "Input Layer", "Memory Cell", "Input Neuron"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O3O", "level": "L2-Relation", "id": "7OdhtAiPfWY_0", "video_path": "7OdhtAiPfWY.mp4", "subtitle_path": "7OdhtAiPfWY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 771, "duration": 420.02, "view_count": 34122}, {"video_id": "7OdhtAiPfWY", "question": "The lower part of the screen in a 3D game contains a grid with 64 cells, and objects in the scene are arranged in an orderly manner. Which concept is mentioned first below?", "question_wo_referring_query": "Which concept is mentioned first below?", "candidates": ["Constraint mechanism", "Pulse", "Add mechanism", "Sine wave", "Two active blocks"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O3O", "level": "L2-Relation", "id": "7OdhtAiPfWY_1", "video_path": "7OdhtAiPfWY.mp4", "subtitle_path": "7OdhtAiPfWY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 771, "duration": 420.02, "view_count": 34122}, {"video_id": "ShexTHpgM_E", "question": "Amidst a patch of green, several purple flowers are blooming. A hand gently strokes the purple flower in the center of the screen. What happens to the butterfly resting on the flower after the narrator says 'He seems to completely understand'?", "question_wo_referring_query": "What happens to the butterfly resting on the flower?", "candidates": ["The butterfly flutters its wings on the flower", "The butterfly is caught by the hand", "The butterfly flutters its wings and flies away from the flower", "The butterfly stays on the flower and does nothing", "The butterfly falls off the flower"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "ShexTHpgM_E_0", "video_path": "ShexTHpgM_E.mp4", "subtitle_path": "ShexTHpgM_E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 458.79, "view_count": 338395}, {"video_id": "ShexTHpgM_E", "question": "Before a person dressed in white pants placed a black rabbit on their lap and fed the rabbit while explaining about 'wishing you a wonderful day or night filled with dreams and stories as well as other marvelous thoughts,' what did the woman do at the door?", "question_wo_referring_query": ", what did the woman do at the door?", "candidates": ["Tore a thread with hands", "Washed her hands with water", "Cut a thread with scissors", "Picked a bunch of flowers", "Hung a bunch of flowers in front of the door"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "ShexTHpgM_E_1", "video_path": "ShexTHpgM_E.mp4", "subtitle_path": "ShexTHpgM_E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 458.79, "view_count": 338395}, {"video_id": "HU0EYA2d4_0", "question": "In an art gallery, a woman dressed in black is giving an explanation with a red painting behind her. After the subtitle mentions 'This exhibition charts the full obstreperous, unruly arc of Francis Picabia\u2019s career,' who appears behind the explaining woman?", "question_wo_referring_query": "Who appears behind the explaining woman?", "candidates": ["A man wearing a pink coat and black pants", "A man wearing a white coat and black pants", "A man wearing a black coat and pink pants", "A man wearing a black coat and black pants", "A man with long hair wearing a black coat"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "HU0EYA2d4_0_0", "video_path": "HU0EYA2d4_0.mp4", "subtitle_path": "HU0EYA2d4_0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 326.29, "view_count": 261308}, {"video_id": "HU0EYA2d4_0", "question": "In a bright room, four paintings are hanging on the wall. In one of the paintings, there is a skeleton in the corner. After the subtitle appears, saying 'Okay, I know, I skipped all the really scary parts,' what is the immediately following painting?", "question_wo_referring_query": "What is the immediately following painting?", "candidates": ["A portrait of a dog", "A portrait of a small red-nosed demon", "A half-length portrait of a woman wearing red", "A painting of a man in jet-black holding a whip", "A painting of a woman standing between two skeletons"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "HU0EYA2d4_0_1", "video_path": "HU0EYA2d4_0.mp4", "subtitle_path": "HU0EYA2d4_0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 326.29, "view_count": 261308}, {"video_id": "zak0M2Wfz9U", "question": "Behind the soldier holding a long rifle, wearing a red military uniform, with a red ornament on the black hat, there is a blue pavilion. In which scenes does this style of pavilion appear?", "question_wo_referring_query": "In which scenes does this style of pavilion appear?", "candidates": ["In front of a tall white building", "On the main street, in front of a window on a rainy day", "In front of a tall white building, in front of a soldier wearing a green military uniform", "In front of a tall red building, in front of a window on a rainy day", "In front of a tall yellow building, in front of a window on a rainy day"], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "zak0M2Wfz9U_0", "video_path": "zak0M2Wfz9U.mp4", "subtitle_path": "zak0M2Wfz9U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 503.67, "view_count": 2177884}, {"video_id": "7ExDRK_oc14", "question": "In front of a brown cupboard, there is a girl wearing glasses and a white short-sleeved shirt with yellow flowers embroidered on it. Which subtitles has this girl appeared in?", "question_wo_referring_query": "In front of a brown cupboard, there is a girl wearing glasses and a white short-sleeved shirt with yellow flowers embroidered on it. Which subtitles has this girl appeared in?", "candidates": ["Hello, welcome to my kitchen / My office lunch box is so pathetic", "Hello, welcome to my kitchen / Seeing something incorrect", "Hello, please leave my kitchen / My office lunch box is so pathetic", "Hello, welcome to my office / My kitchen is so pathetic", "Hello, welcome to my office / My office lunch box is so pathetic"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "7ExDRK_oc14_0", "video_path": "7ExDRK_oc14.mp4", "subtitle_path": "7ExDRK_oc14_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 387.8, "view_count": 231282}, {"video_id": "7ExDRK_oc14", "question": "In front of a brown shelf, a girl wearing glasses and a white short-sleeved shirt with yellow flowers is holding a rice cooker. What subtitles appeared with this rice cooker?", "question_wo_referring_query": "What subtitles appeared with this rice cooker?", "candidates": ["Hello, welcome to my apartment / The tree leaves are getting better and better", "My apartment lunch box is so pathetic / The tree leaves are getting better and better", "Now we need to cook some rice / My apartment lunch box is so pathetic", "Now we need to cook some rice / The tree leaves are getting better and better", "Now we need to cook some rice / Hello, welcome to my apartment"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "7ExDRK_oc14_1", "video_path": "7ExDRK_oc14.mp4", "subtitle_path": "7ExDRK_oc14_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 387.8, "view_count": 231282}, {"video_id": "iTo2n4ayiJU", "question": "On the pink background board, there is a pink flower in the upper right corner. A pair of hands holding a pen with a purple cap places a blank notebook in the center of the screen. After the hands leave the screen, what change occurs to the notebook?", "question_wo_referring_query": "What change occurs to the notebook?", "candidates": ["Pink patterns appear on the notebook", "Some purple rectangles appear on the notebook", "A smiley face appears on the notebook", "No change occurs to the notebook", "Purple patterns appear on the notebook"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "iTo2n4ayiJU_0", "video_path": "iTo2n4ayiJU.mp4", "subtitle_path": "iTo2n4ayiJU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 234.61, "view_count": 112057}, {"video_id": "iTo2n4ayiJU", "question": "In the upper right corner of the pink background board, there's a pink flower, an empty white notebook is placed in the center of the screen, and the upper left side of the notebook has the text 'Example 1'. After the notebook is replaced with a piece of paper containing text, what change occurs to the text 'Example 1'?", "question_wo_referring_query": "What change occurs to the text 'Example 1'?", "candidates": ["'Example 1' changes to 'Example 3'", "'Example 1' changes to 'Step 3'", "'Example 1' has not changed", "'Example 1' changes to 'Example 2'", "'Example 1' changes to 'Step 2'"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "iTo2n4ayiJU_1", "video_path": "iTo2n4ayiJU.mp4", "subtitle_path": "iTo2n4ayiJU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 234.61, "view_count": 112057}, {"video_id": "dqwpQarrDwk", "question": "The screen is divided into four rectangular smaller screens. In the top right corner, there is a white cloud over a few ships floating on the blue ocean. Which object does not appear in the entire screen?", "question_wo_referring_query": ", which object does not appear in the entire screen?", "candidates": ["The arrow", "The train", "The green forest", "The bridge", "The purple house"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "dqwpQarrDwk_0", "video_path": "dqwpQarrDwk.mp4", "subtitle_path": "dqwpQarrDwk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 494.5, "view_count": 12034017}, {"video_id": "dqwpQarrDwk", "question": "At the top of the screen, on the right side, there is a small star with blue, green, and white colors. In the middle, there are eight small calendars. On the left side, there is a small red star. What object is at the bottom of the screen?", "question_wo_referring_query": "What object is at the bottom of the screen?", "candidates": ["A purple cartoon bird", "A flying rocket", "A train", "Several astronauts", "A spaceship"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "dqwpQarrDwk_1", "video_path": "dqwpQarrDwk.mp4", "subtitle_path": "dqwpQarrDwk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 494.5, "view_count": 12034017}, {"video_id": "4MI-j2bTgds", "question": "In a sunlit room, what object appears on the screen when a person with long golden hair, wearing a blue sweatshirt, stretches on the bed and the subtitle 'I'm a certified holistic nutritionist' appears?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["white pillow", "red lamp", "a cat", "red pillow", "white bedside table"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "4MI-j2bTgds_0", "video_path": "4MI-j2bTgds.mp4", "subtitle_path": "4MI-j2bTgds_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 420.3, "view_count": 31088}, {"video_id": "4MI-j2bTgds", "question": "On a brown wooden floorboard, a black and white cat is lazily lying on its side on a chair with a backrest. What object is present on the screen when the subtitle says 'what motivated me to just see exercise'?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["A black plate containing yellow and green food", "A bell", "A colorful rug", "A door", "A white plate containing yellow and green food"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "4MI-j2bTgds_1", "video_path": "4MI-j2bTgds.mp4", "subtitle_path": "4MI-j2bTgds_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 420.3, "view_count": 31088}, {"video_id": "wUkXTHnfM6E", "question": "On a gray baking tray lined with white parchment paper, there are four rows of small round white dough balls. One person is holding a round transparent bowl in one hand, and brushing the dough balls with a brush in the other hand. What is the color and state of the substance inside the round transparent bowl?", "question_wo_referring_query": ", what is the color and state of the substance inside the round transparent bowl?", "candidates": ["White liquid", "Green solid", "Green paste", "Green liquid", "Black liquid"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "wUkXTHnfM6E_0", "video_path": "wUkXTHnfM6E.mp4", "subtitle_path": "wUkXTHnfM6E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 249.75, "view_count": 111829}, {"video_id": "wUkXTHnfM6E", "question": "On a green grassy background, a hand wearing a ring is pressing on three cylindrical white objects, while another hand is holding a knife ready to slice. What material is the white cylindrical object made of?", "question_wo_referring_query": "What material is the white cylindrical object made of?", "candidates": ["Wood material", "Metal material", "Glass material", "Aluminum alloy material", "Foam material"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "wUkXTHnfM6E_1", "video_path": "wUkXTHnfM6E.mp4", "subtitle_path": "wUkXTHnfM6E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 249.75, "view_count": 111829}, {"video_id": "liQH1O9WW7g", "question": "Who is controlling the red dot against the white background on a screen, moving it inside a square with 49 smaller squares, where the top right 9 squares are green, but the rest are white?", "question_wo_referring_query": "Who is moving the red dot against the white background on a screen?", "candidates": ["A woman wearing a blue suit and no glasses", "A man wearing a gray suit and glasses", "A man wearing a blue short-sleeve shirt and no glasses", "A man wearing a white suit and glasses", "A man wearing a blue suit and no glasses"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "liQH1O9WW7g_0", "video_path": "liQH1O9WW7g.mp4", "subtitle_path": "liQH1O9WW7g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 332.2, "view_count": 15}, {"video_id": "liQH1O9WW7g", "question": "Among the 7x7 grid on the left side of the white screen, which small squares turned blue first?", "question_wo_referring_query": "Which small squares turned blue first?", "candidates": ["The four small squares at the top left corner", "The nine small squares at the bottom left corner", "The nine small squares at the bottom right corner", "The nine small squares at the top right corner", "The nine small squares in the center"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "liQH1O9WW7g_1", "video_path": "liQH1O9WW7g.mp4", "subtitle_path": "liQH1O9WW7g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 332.2, "view_count": 15}, {"video_id": "U0kII5dUXKg", "question": "In the red background with a solar shape, what happens the first time a red duck-shaped icon appears in the bottom right corner?", "question_wo_referring_query": "What happens?", "candidates": ["A white rectangle pops up on the screen", "A black rectangle pops up on the screen", "A white circle pops up on the screen", "A white square pops up on the screen", "A red square pops up on the screen"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "U0kII5dUXKg_0", "video_path": "U0kII5dUXKg.mp4", "subtitle_path": "U0kII5dUXKg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 259.8, "view_count": 416192}, {"video_id": "U0kII5dUXKg", "question": "What happened the first time a red circle with check marks and a horizontal line pattern appeared on the very left side in the background with red and blue solid and dashed arrows?", "question_wo_referring_query": "What happened?", "candidates": ["A black rectangle appeared on the very left side.", "A small red icon of a flag appeared on the right side.", "A small red icon of a person appeared on the right side.", "A white rectangle appeared at the bottom.", "A white rectangle appeared at the top."], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "U0kII5dUXKg_1", "video_path": "U0kII5dUXKg.mp4", "subtitle_path": "U0kII5dUXKg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 259.8, "view_count": 416192}, {"video_id": "ui4P0z1U13U", "question": "On the brown wooden dining table, there is a striped placemat and a square shredder. Inside a glass bowl, there are yellow beans and water. What happens when the subtitle first shows 'Applause'?", "question_wo_referring_query": "On the brown wooden dining table, there is a striped placemat and a square shredder. Inside a glass bowl, there are yellow beans and water. What happens when the subtitle first shows 'Applause'?", "candidates": ["Pouring green chili into the shredder", "A hand is holding a glass bowl and pouring brown powder into the shredder", "Covering it with a black lid", "Pouring red chili into the shredder", "Pouring brown liquid inside"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "ui4P0z1U13U_0", "video_path": "ui4P0z1U13U.mp4", "subtitle_path": "ui4P0z1U13U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 200.53, "view_count": 201304}, {"video_id": "ui4P0z1U13U", "question": "In the video, there is a blender containing yellow beans and red chili peppers. What happens when the word 'Music' first appears in the subtitles?", "question_wo_referring_query": "In the video, there is a blender containing yellow beans and red chili peppers. What happens when the word 'Music' first appears in the subtitles?", "candidates": ["The blender turns the yellow beans and red chili peppers into a liquid.", "Water is added to the blender.", "Yellow beans are poured out from the blender.", "Red chili peppers are poured out from the blender.", "Meatballs are added into oil for frying."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "ui4P0z1U13U_1", "video_path": "ui4P0z1U13U.mp4", "subtitle_path": "ui4P0z1U13U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 200.53, "view_count": 201304}, {"video_id": "eRQJH094ukg", "question": "What happens after a bronze statue, characterized by a fit build, two rope-like designs on its shoulders, a long neck, and a rounded head, appears in the center of the screen with its back facing the camera?", "question_wo_referring_query": "What happens?", "candidates": ["The screen changes to the front view of the statue.", "The screen zooms out to show only the legs.", "The screen zooms in to show only the head of the statue.", "The screen zooms out to show only the buttocks.", "The camera shifts down to the statue's lower body and buttocks."], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "eRQJH094ukg_0", "video_path": "eRQJH094ukg.mp4", "subtitle_path": "eRQJH094ukg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 192.89, "view_count": 6037}, {"video_id": "eRQJH094ukg", "question": "In the exhibition hall with paintings hanging in the background, what happened on screen before the elderly man dressed in a black coat and holding a black cane appeared?", "question_wo_referring_query": "In the exhibition hall with paintings hanging in the background, what happened on screen before the elderly man dressed in a black coat and holding a black cane appeared?", "candidates": ["Only an enlarged face of a human figure sculpture appeared on the screen", "A black human figure sculpture covered densely with prickly iron needles appeared on the far right of the screen", "A black human figure sculpture covered densely with prickly iron needles appeared in the middle of the screen", "A black human figure sculpture covered densely with prickly iron needles appeared on the far left of the screen", "A human figure sculpture appeared with its back facing the camera"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "eRQJH094ukg_1", "video_path": "eRQJH094ukg.mp4", "subtitle_path": "eRQJH094ukg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 192.89, "view_count": 6037}, {"video_id": "Jj2wA5FING4", "question": "What happened after the scene with a white rabbit with black eyes sitting on a sofa, with a red floral cushion featuring a bunny cartoon character on the sofa as well, and the subtitle mentioning 'long time, I found myself on my own. Well, I did enjoy the company of my cat, as my dog was..'?", "question_wo_referring_query": "What happened?", "candidates": ["A small black dog wagging its tail", "A man picking fruit from a tree", "A bird-shaped pendant swinging in the wind", "A woman in a red dress walking beside dry reeds", "A woman picking fruit from a tree"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "Jj2wA5FING4_0", "video_path": "Jj2wA5FING4.mp4", "subtitle_path": "Jj2wA5FING4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 508.05, "view_count": 248423}, {"video_id": "Jj2wA5FING4", "question": "In the scene, one delicate hand is holding many small green beads, while the other hand is holding a thin needle already threaded with string. After the subtitle 'of pixie dust and magic and wonder. I have a feeling the new year will provide just that' appears, what happens next?", "question_wo_referring_query": "What happens next?", "candidates": ["A knot is tied in the thin string", "The thin string is removed from the needle", "Each small bead is threaded onto the thin string one by one", "The beads are placed into a box", "The beads are thrown away"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "Jj2wA5FING4_1", "video_path": "Jj2wA5FING4.mp4", "subtitle_path": "Jj2wA5FING4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 508.05, "view_count": 248423}, {"video_id": "oESjdMne55w", "question": "What object appears first after the subtitle says 'productive but realistically so let's go' with a woman lying on white sheets and white pillows and holding an eyeball?", "question_wo_referring_query": "What object appears first?", "candidates": ["A brown door", "A banana tree", "A cat", "A stone bench", "A transparent glass of water"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "oESjdMne55w_0", "video_path": "oESjdMne55w.mp4", "subtitle_path": "oESjdMne55w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 452.85, "view_count": 38673}, {"video_id": "oESjdMne55w", "question": "A woman with long hair wearing a grey wool coat stands to the right of the window and pulls open a curtain with floral patterns. Outside the window, greenery gradually comes into view. In the subtitles, after the phrase 'the tip I mentioned of making a habit,' what is the first item that appears?", "question_wo_referring_query": "What is the first item that appears?", "candidates": ["a kitchen", "a juicer", "a cow cat", "a computer", "a mirror"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "oESjdMne55w_1", "video_path": "oESjdMne55w.mp4", "subtitle_path": "oESjdMne55w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 452.85, "view_count": 38673}, {"video_id": "vpujwSIfWSc", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a voting result with a white background containing percentages appears, then an image with a blue background featuring a tank in the middle, and finally, two English sentences with a white background appear on the blue background image.", "First, an image with a blue background featuring a tank in the middle appears, then two English sentences with a white background appear on the blue background image, and finally, a voting result with a black background containing percentages appears.", "First, two English sentences with a white background appear on the blue background image, then an image with a blue background featuring a tank in the middle appears, and finally, a voting result with a black background containing percentages appears.", "First, two English sentences with a white background appear on the blue background image, then a voting result with a black background containing percentages appears, and finally, an image with a blue background featuring a tank in the middle appears.", "First, an image with a blue background featuring a tank in the middle appears, then a voting result with a black background containing percentages appears, and finally, two English sentences with a white background appear on the blue background image."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "vpujwSIfWSc_0", "video_path": "vpujwSIfWSc.mp4", "subtitle_path": "vpujwSIfWSc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 424.93, "view_count": 27695}, {"video_id": "urmNd2sOVHc", "question": "On a brown chopping board, there are many white cauliflower florets cut into small pieces. Where else do these white cauliflower florets appear?", "question_wo_referring_query": "On a brown chopping board, there are many white cauliflower florets cut into small pieces. Where else do these white cauliflower florets appear?", "candidates": ["In a rectangular white dish", "In a rectangular glass bowl", "In a round glass bowl", "In a round white bowl", "In a rectangular blue dish"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "urmNd2sOVHc_0", "video_path": "urmNd2sOVHc.mp4", "subtitle_path": "urmNd2sOVHc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 390.0, "view_count": 1497317}, {"video_id": "urmNd2sOVHc", "question": "On a rectangular glass bowl with a mixed white and green pattern at the bottom, some sliced meat has been placed. Where else have these sliced pieces of meat appeared?", "question_wo_referring_query": "Where else have these slices of meat appeared?", "candidates": ["In the refrigerator", "On the olive-colored cutting board", "On the blue cutting board", "In the oven", "On the white cutting board"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "urmNd2sOVHc_1", "video_path": "urmNd2sOVHc.mp4", "subtitle_path": "urmNd2sOVHc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 390.0, "view_count": 1497317}, {"video_id": "KXmpdJO9UOc", "question": "Which subtitles have appeared together with a man wearing a gray short-sleeve shirt and black pants standing in front of a satellite map with terrain distribution?", "question_wo_referring_query": "Which subtitles have appeared together?", "candidates": ["but they can still clearly be seen", "JET ENGINE NOISE", "from Hong Kong to Honduras", "KA-CHING", " BALLOON BURST"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "KXmpdJO9UOc_0", "video_path": "KXmpdJO9UOc.mp4", "subtitle_path": "KXmpdJO9UOc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.56, "view_count": 4071535}, {"video_id": "KXmpdJO9UOc", "question": "In the cloudy and dark sky of Wu Yunmi, an airplane is flying through the clouds, appearing and disappearing in the sky. Which subtitle has this airplane appeared with?", "question_wo_referring_query": ", which subtitles has this airplane appeared with in the sky?", "candidates": ["while the politicians bicker about whether or not Boris Island will ever actually be built", "millions of people", "London", "KA-CHING", "JET ENGINE NOISE"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "KXmpdJO9UOc_1", "video_path": "KXmpdJO9UOc.mp4", "subtitle_path": "KXmpdJO9UOc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.56, "view_count": 4071535}, {"video_id": "6_dBQFUnwQI", "question": "What change occurred to the color of the pink-colored region labeled with the names of European countries at the top?", "question_wo_referring_query": "What change occurred to the color?", "candidates": ["The European region changed to blue", "The European region changed to purple", "The European region changed to yellow", "The European region changed to green", "The European region changed to black"], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "6_dBQFUnwQI_0", "video_path": "6_dBQFUnwQI.mp4", "subtitle_path": "6_dBQFUnwQI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 332.23, "view_count": 190794}, {"video_id": "6_dBQFUnwQI", "question": "In the satellite imagery, there is a large landmass covered in purple. What change happened to the purple-covered area?", "question_wo_referring_query": "What change occurred?", "candidates": ["The purple area turned blue", "The Chinese translation of 'Turkey' appeared in the purple area", "The purple color disappeared", "The purple area turned red", "The word 'Turkey' in English appeared in the purple area"], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "6_dBQFUnwQI_1", "video_path": "6_dBQFUnwQI.mp4", "subtitle_path": "6_dBQFUnwQI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 332.23, "view_count": 190794}, {"video_id": "o0CHyygAQUY", "question": "When the girl, who is initially walking in the green meadow wearing a round hat and sporting a single ponytail, says in the subtitles 'believe that an abundant and peaceful home life can only be created, not found,' what changes occur?", "question_wo_referring_query": "What changes occur?", "candidates": ["She takes off her hat", "She ties her hair into a bun", "She ties her hair into two ponytails", "She ties her hair into a high ponytail", "She lets her hair down from the ponytail"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "o0CHyygAQUY_0", "video_path": "o0CHyygAQUY.mp4", "subtitle_path": "o0CHyygAQUY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 415.79, "view_count": 745879}, {"video_id": "o0CHyygAQUY", "question": "At the beginning of the scene, a girl wearing a gray coat, a straw hat, and a single-sided hemp flower scarf is seen walking through the grass. When she says, 'time in town and then be able to go home to a more peaceful haven when I feel like it,' what change occurs?", "question_wo_referring_query": "What change occurs?", "candidates": ["Took off the hat", "Switched to a blue coat", "Switched to a red dress", "Took off the gray coat", "Took off the white coat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "o0CHyygAQUY_1", "video_path": "o0CHyygAQUY.mp4", "subtitle_path": "o0CHyygAQUY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 415.79, "view_count": 745879}, {"video_id": "9mpPyy_257I", "question": "In a corner with green plants, there's a kitchen with an olive-colored stove with a red small pot on the right side and a dark-colored pot lid on the left. A black cooking pot is on the stove, and a pair of hands with flower sleeves are holding a silver bowl and a spatula. The bowl contains yellow ingredients. What are these hands doing?", "question_wo_referring_query": "What are these hands doing?", "candidates": ["Stirring the yellow ingredients", "Placing the spatula into the pot", "Scooping up the yellow ingredients", "Pouring the yellow ingredients into the pot", "Adding sesame to the ingredients"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "9mpPyy_257I_0", "video_path": "9mpPyy_257I.mp4", "subtitle_path": "9mpPyy_257I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 513.64, "view_count": 419424}, {"video_id": "9mpPyy_257I", "question": "On the olive-colored wooden patterned board, there is a yellow knife holder on the left side of the board, with a white small bowl in the center of the board. In front of the bowl are some food ingredients. A hand is holding a blue-handled silver kitchen tool. What is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["Washing large garlic bulbs", "Using a kitchen tool to crush garlic", "Using a knife to slice garlic", "Using a knife to smash garlic", "Putting the whole garlic into the bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "9mpPyy_257I_1", "video_path": "9mpPyy_257I.mp4", "subtitle_path": "9mpPyy_257I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 513.64, "view_count": 419424}, {"video_id": "N-SLWh1pVz8", "question": "In the blue background's bottom left corner, there is a small clip, with a hand holding a white ball that is connected through a rod. The white ball on the top has a paper strip with the letter 'S', and the white ball on the left has a paper strip with the letter 'V'. What kind of rod is connecting the white balls?", "question_wo_referring_query": "What kind of rod is connecting the white balls?", "candidates": ["Silver metallic fine rod", "Green plastic fine rod", "Thick metallic fine rod", "Wooden pencil-shaped fine rod", "Thick green plastic fine rod"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "N-SLWh1pVz8_0", "video_path": "N-SLWh1pVz8.mp4", "subtitle_path": "N-SLWh1pVz8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 282.65999999999997, "view_count": 40196}, {"video_id": "1xsw7U47UQo", "question": "The interface is similar to a mobile phone's screen, with signal, battery, and time icons in the top right corner, a blue calendar box in the middle, as well as a yellow sun and white clouds. At the bottom, there are four square icons, and the subtitle shows 'Music'. What do the four square icons at the bottom look like?", "question_wo_referring_query": "What do the four square icons at the bottom look like?", "candidates": ["An animal icon, a sun icon, an information icon, and a globe icon", "An animal icon, a musical note icon, an information icon, and a globe icon", "An animal icon, a sun icon, an information icon, and a star icon", "A fruit icon, a musical note icon, an information icon, and a globe icon", "A gear icon, a musical note icon, an information icon, and a globe icon"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "1xsw7U47UQo_0", "video_path": "1xsw7U47UQo.mp4", "subtitle_path": "1xsw7U47UQo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 564.87, "view_count": 236211}, {"video_id": "1xsw7U47UQo", "question": "With a green background with small flowers, on the left is a white puppy and a person in front of a TV screen wearing yellow pants, on the right is a woman wearing glasses and holding a mobile phone dressed in green. The subtitle appears 'here's a video of my friend's dog gus um'. What type of glasses frame is the woman wearing?", "question_wo_referring_query": "What type of glasses frame is the woman wearing?", "candidates": ["Pentagram frame", "Fashionable inverted triangle frame", "Formal rectangular frame", "Square frame", "Beautiful round frame"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "1xsw7U47UQo_1", "video_path": "1xsw7U47UQo.mp4", "subtitle_path": "1xsw7U47UQo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 564.87, "view_count": 236211}, {"video_id": "1tyWqK9z8DQ", "question": "On a white background with a formula written on it, the right side of the formula is framed, and below the frame is an arrow symbol. In the upper right corner of the screen, there is a man in a purple shirt wearing black-framed glasses, looking down and talking. When the subtitle 'eventually end up with this function so' appears, what happens to the red dot on the screen?", "question_wo_referring_query": "What happens to the red dot on the screen?", "candidates": ["The red dot on the screen turns into a green dot", "The red dot on the screen turns into a black dot", "The red dot on the screen moves downward", "The red dot on the screen moves upward", "The red dot on the screen temporarily disappears"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "1tyWqK9z8DQ_0", "video_path": "1tyWqK9z8DQ.mp4", "subtitle_path": "1tyWqK9z8DQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 311, "duration": 252.0, "view_count": 1826}, {"video_id": "1tyWqK9z8DQ", "question": "On a white background with a formula written on it, the formula is framed on the right side, with an arrow symbol below, and in the upper right corner of the screen, a man wearing purple with black-framed glasses is looking down and speaking. When the subtitle shows 'term is just a measure of entropy over', what happened to the man in the upper right corner?", "question_wo_referring_query": "What happened to the man in the upper right corner?", "candidates": ["The man in purple tilted his face backward.", "The man in purple adjusted his collar.", "The man in purple pushed his glasses.", "The man in purple raised his face.", "The man in purple lowered his face."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "1tyWqK9z8DQ_1", "video_path": "1tyWqK9z8DQ.mp4", "subtitle_path": "1tyWqK9z8DQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 311, "duration": 252.0, "view_count": 1826}, {"video_id": "TcQYR8LhCSI", "question": "On the yellow table by the window, there is a bag and a spool of thread. In the upper right corner of the table, there are rolled papers, and on the left side, there are some books. A black, yellow, and white cat jumped onto the table. What did the cat do after jumping onto the table?", "question_wo_referring_query": "What did the cat do after jumping onto the table?", "candidates": ["The cat went to eat cat food", "The cat climbed to the place where the owner sleeps", "The cat jumped onto the refrigerator", "The cat went to drink water", "The cat pushed the paper off the table"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "TcQYR8LhCSI_0", "video_path": "TcQYR8LhCSI.mp4", "subtitle_path": "TcQYR8LhCSI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 461.23, "view_count": 215077}, {"video_id": "TcQYR8LhCSI", "question": "On the yellow floor, there is a black, grey, and white cat eating cat food in front of a white cat feeder. Beside the feeder, there is a black cord and blue light. An orange table is to the side of the cat. After this cat finishes eating, what does it do next?", "question_wo_referring_query": "After the cat finishes eating, what does it do next?", "candidates": ["The cat goes to drink water", "The cat goes to sleep", "The cat goes to scratch the scratching post", "The cat starts grooming itself", "The cat sleeps on the shoe"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "TcQYR8LhCSI_1", "video_path": "TcQYR8LhCSI.mp4", "subtitle_path": "TcQYR8LhCSI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 461.23, "view_count": 215077}, {"video_id": "Xgec36qdOJs", "question": "On the streets and alleys with tangled wires overhead, there is a house wrapped in green leather behind the street. On the left side, there's a house with a silver-white wall and green-red exterior. On the right side, there are traces of water. A man in a blue short sleeve with a red backpack and another man in a black coat are walking together. Which car passes by them first?", "question_wo_referring_query": "Which car passes by them first?", "candidates": ["A white car with a white license plate", "A white car with a red license plate", "A blue car", "A red car", "A black car"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "Xgec36qdOJs_0", "video_path": "Xgec36qdOJs.mp4", "subtitle_path": "Xgec36qdOJs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 239.24, "view_count": 15515}, {"video_id": "Xgec36qdOJs", "question": "Under the blue sky, a ferry is traveling on the green sea. You can see the land in the distance at the horizon. Waves splash near the white railing. Which character's silhouette appears first on the ferry?", "question_wo_referring_query": "Which character's silhouette appears first on the ferry?", "candidates": ["A woman in a red jacket", "A boy in a blue uniform", "A boy in a black short-sleeve shirt with a character print", "A woman wearing sunglasses and a white short-sleeve shirt", "A woman in a light-colored dress with exposed shoulders"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "Xgec36qdOJs_1", "video_path": "Xgec36qdOJs.mp4", "subtitle_path": "Xgec36qdOJs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 239.24, "view_count": 15515}, {"video_id": "OmNVB3ff98s", "question": "The room door is white, and there is a black table on which a laptop computer and an electronic device with colorful buttons are placed. A black stand is on the table. A man in a light-colored short sleeve shirt is talking. After the subtitle 'productivity journey which is' appears, what object appears on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["Only one black word appears on the screen", "A series of black characters appears on the screen", "Only one white word appears on the screen", "A series of red characters appears on the screen", "A series of white characters appears on the screen"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "OmNVB3ff98s_0", "video_path": "OmNVB3ff98s.mp4", "subtitle_path": "OmNVB3ff98s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 533.48, "view_count": 536}, {"video_id": "dWoRqVhFmk4", "question": "On a wooden bamboo-colored tray, there is a ceramic bowl containing unskinned chicken meat. On the floor, there are green plants and a plastic box. What subtitle appears together with the chicken meat in the bowl?", "question_wo_referring_query": "What subtitle appears together with the chicken meat in the bowl?", "candidates": ["Fry~1minute", "Milk 150 ml", "Mushrooms 0.5 lb", "Tomatoes 0.5 lb", "Paprika"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "dWoRqVhFmk4_0", "video_path": "dWoRqVhFmk4.mp4", "subtitle_path": "dWoRqVhFmk4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 325.87, "view_count": 33099}, {"video_id": "dWoRqVhFmk4", "question": "On a wooden brown cutting board, a hand wearing a plain ring is using a knife to handle food ingredients. The knife has black text and a red circular floral pattern on its blade. What subtitle appears together with this knife?", "question_wo_referring_query": ", what subtitle appears together with this knife?", "candidates": ["Salt", "Mushrooms 0.5 lb", "Paprika", "Black pepper", "Hello everyone"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "dWoRqVhFmk4_1", "video_path": "dWoRqVhFmk4.mp4", "subtitle_path": "dWoRqVhFmk4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 325.87, "view_count": 33099}, {"video_id": "RDg0FHqFqxg", "question": "On the white background, there is blue English text at the top, a rectangular white frame composed of black lines on the left, and a blank white area on the right. In the top right corner, there is a water droplet icon. When the white frame on the left side changes to a black frame with white characters, what happens to the white area on the right?", "question_wo_referring_query": "What happens to the white area on the right?", "candidates": ["A blue triangle and black text appear in the white area.", "A red triangle and black text appear in the white area.", "A red square and characters appear in the white area.", "A blue square and black text appear in the white area.", "A blue pentagon and black text appear in the white area."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "RDg0FHqFqxg_0", "video_path": "RDg0FHqFqxg.mp4", "subtitle_path": "RDg0FHqFqxg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 526.03, "view_count": 81}, {"video_id": "N--VoSx0vT8", "question": "On the red-stained table, there is a transparent bowl containing yellow soybeans. When the white steam rises from the bowl and the caption reads 'Add 5g of salt (1 tsp)', what changes occur to the soybeans?", "question_wo_referring_query": "When the caption reads 'Add 5g of salt (1 tsp)', what changes occur to the soybeans?", "candidates": ["The soybeans turn into little chunks.", "The soybeans turn into golden roasted soybeans.", "The soybeans turn into soybean paste.", "The soybeans are cut into soybean threads.", "The soybeans turn into soybean granules."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "N--VoSx0vT8_0", "video_path": "N--VoSx0vT8.mp4", "subtitle_path": "N--VoSx0vT8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 348.97, "view_count": 5947}, {"video_id": "N--VoSx0vT8", "question": "A half-cut red chili is placed on a red-brown table, with the cross-section of the chili facing upwards. In the bottom right corner is the other half-cut red chili. When the subtitle 'Then cut it into dices' appears, what change happens to the chili?", "question_wo_referring_query": "What change happens to the chili?", "candidates": ["The chili is cut into dices", "The chili is carved into floral patterns", "The chili is cut into strips", "The chili is cut into chili shreds", "The chili is juiced"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "N--VoSx0vT8_1", "video_path": "N--VoSx0vT8.mp4", "subtitle_path": "N--VoSx0vT8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 348.97, "view_count": 5947}, {"video_id": "HRz-jH4CAy8", "question": "In the screen with an orange background, on the left there are two 'C's and a black line, in the middle there are two 'C's and two black lines, on the right there are three black lines, and there's a hand with manicured nails. What is the hand with manicured nails doing in the video?", "question_wo_referring_query": "What is the hand with manicured nails doing in the video?", "candidates": ["The finger is pressing on a 'C' shaped paper strip", "The palm is pressing on a horizontal paper strip", "Making a fist", "The finger is pressing on a horizontal paper strip"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "HRz-jH4CAy8_0", "video_path": "HRz-jH4CAy8.mp4", "subtitle_path": "HRz-jH4CAy8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 332.42, "view_count": 458301}, {"video_id": "HRz-jH4CAy8", "question": "In the middle of the screen, there's a hand with beautifully manicured nails, and below the hand is a piece of paper with letters on it. There are three clips in the top right corner and bottom left corner. What is the hand in the video doing?", "question_wo_referring_query": "In the middle of the screen, there's a hand with beautifully manicured nails, and below the hand is a piece of paper with letters on it. There are three clips in the top right corner and bottom left corner. What is the hand in the video doing?", "candidates": ["The hand is resting flat on the paper.", "The hand is picking up the paper.", "Pointing to the paper with an extended finger.", "Holding a pen and writing on the paper."], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "HRz-jH4CAy8_1", "video_path": "HRz-jH4CAy8.mp4", "subtitle_path": "HRz-jH4CAy8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 332.42, "view_count": 458301}, {"video_id": "3Wd70GBa62c", "question": "In front of a white door, there is a woman wearing deep blue with blonde hair. When the subtitle mentions 'my hobbies, and experiences,' what color is the potted plant next to the blonde woman?", "question_wo_referring_query": "What color is the potted plant next to the blonde woman?", "candidates": ["White pot with red flowers", "White pot with colorful plant", "White pot with green plant", "Green pot with green plant"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "3Wd70GBa62c_0", "video_path": "3Wd70GBa62c.mp4", "subtitle_path": "3Wd70GBa62c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 370.53, "view_count": 867742}, {"video_id": "3Wd70GBa62c", "question": "A blonde woman is standing in front of a door, her right hand resting on the door frame. When the subtitles say 'my internal spirit. I continue to save and invest in my dreams ensuring that my future will be that,' what color is the door in front of which the woman is standing?", "question_wo_referring_query": "A blonde woman is standing in front of a door, her right hand resting on the door frame. When the subtitles say 'my internal spirit. I continue to save and invest in my dreams ensuring that my future will be that,' what color is the door in front of which the woman is standing?", "candidates": ["yellow", "green", "white", "blue"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "3Wd70GBa62c_1", "video_path": "3Wd70GBa62c.mp4", "subtitle_path": "3Wd70GBa62c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 370.53, "view_count": 867742}, {"video_id": "s5UxN5ipLQE", "question": "On the right side of the screen, there's a black rabbit, and on the left side are some green vegetables growing on the ground. What does the rabbit do the first time it appears in the video?", "question_wo_referring_query": "What does the rabbit do the first time it appears in the video?", "candidates": ["Sits", "Eats carrot", "Eats vegetables", "Sleeps"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "s5UxN5ipLQE_1", "video_path": "s5UxN5ipLQE.mp4", "subtitle_path": "s5UxN5ipLQE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 434.87, "view_count": 299963}, {"video_id": "iBx1qyEILLk", "question": "In the video, there is a blue and white circular icon and a black fish-like object. When the subtitle mentions 'surface periodically while performing an,' what state is the black fish-like object in?", "question_wo_referring_query": "What state is the black fish-like object in?", "candidates": ["Floating above the blue and white icon", "Floating below the blue and white icon", "Floating behind the blue and white icon", "Floating in the same position as the blue and white icon"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "iBx1qyEILLk_0", "video_path": "iBx1qyEILLk.mp4", "subtitle_path": "iBx1qyEILLk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 289.47, "view_count": 70229}, {"video_id": "iBx1qyEILLk", "question": "In the video, under a black background, there is a line of white text 'USS Holland John Hollan'. When the subtitle mentions 'Holland named after her designer John', what changes appear on the screen?", "question_wo_referring_query": "In the video, under a black background, there is a line of white text 'USS Holland John Hollan'. When the subtitle mentions 'Holland named after her designer John', what changes appear on the screen?", "candidates": ["The white text 'USS Holland' is lighter in color than 'John Hollan'", "The text 'John Holland' changes from dark to light color", "The text 'John Holland' turns red", "The text 'John Holland' turns black"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "iBx1qyEILLk_1", "video_path": "iBx1qyEILLk.mp4", "subtitle_path": "iBx1qyEILLk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 289.47, "view_count": 70229}, {"video_id": "7mvRNTe4mzA", "question": "In the bottom right corner of the screen, there is a man wearing a suit and glasses. To his right, there is a number 49 and a circle in the middle. At the top of the screen, there is a row of the largest letters that spell 'Neural Networks.' The man is holding a pen with his left hand and his right palm is facing up. What did this man do next?", "question_wo_referring_query": "What did this man do next?", "candidates": ["Right hand holding pen", "Left hand holding pen, right hand making a fist", "Left hand holding pen, right hand lowered", "Right hand placed behind the body"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "7mvRNTe4mzA_0", "video_path": "7mvRNTe4mzA.mp4", "subtitle_path": "7mvRNTe4mzA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 601, "duration": 306.0, "view_count": 271}, {"video_id": "7mvRNTe4mzA", "question": "In the bottom left corner of the screen, there is a man wearing a black coat and glasses. To his right is the number 50. At the top of the screen, there is a line of text \"Binary classifying an image\" written in a large font. Below that, there is another line of text \"x=1\u00d7784\". What event occurs when the man touches his face with his right hand in the video?", "question_wo_referring_query": "In the bottom left corner of the screen, there is a man wearing a black coat and glasses. To his right is the number 50. At the top of the screen, there is a line of text \"Binary classifying an image\" written in a large font. Below that, there is another line of text \"x=1\u00d7784\". What event occurs when the man touches his face with his right hand in the video?", "candidates": ["1 line of text disappears", "2 more lines of text", "1 more line of text", "3 more lines of text"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "7mvRNTe4mzA_1", "video_path": "7mvRNTe4mzA.mp4", "subtitle_path": "7mvRNTe4mzA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 601, "duration": 306.0, "view_count": 271}, {"video_id": "STNLkN68rw4", "question": "In front of a broken green object, two rows of people are standing in order. They are dressed in black clothing, wearing black hats, and are shaded under a tall red wall. There is a group of people dressed in green military uniforms. Which of these two groups of people appears first?", "question_wo_referring_query": "Which of these two groups of people appears first?", "candidates": ["They appear at the same time", "People dressed in green military uniforms", "People dressed in black clothing and black hats", "None of them appear"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "STNLkN68rw4_0", "video_path": "STNLkN68rw4.mp4", "subtitle_path": "STNLkN68rw4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 197.46, "view_count": 1687048}, {"video_id": "STNLkN68rw4", "question": "In a scene with a brownish tone as the base color and green at the bottom, there is a gray tank. And on a gray ground area with trees in the background, there is a damaged army green tank. Which of these two objects is introduced first?", "question_wo_referring_query": "Which of these two objects is introduced first?", "candidates": ["gray tank", "army green tank", "neither is introduced", "both introduced at the same time"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "STNLkN68rw4_1", "video_path": "STNLkN68rw4.mp4", "subtitle_path": "STNLkN68rw4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 197.46, "view_count": 1687048}, {"video_id": "O-1JmMST1xs", "question": "There is a woman in the video wearing a long-sleeve black top. She has her left hand behind her head, and there is a grayish-white curtain behind her. After the subtitle 'I have to share this with you guys' appears, what does she do?", "question_wo_referring_query": "What does she do?", "candidates": ["Places her left hand down", "Crosses her left hand with her right hand", "Clenches her left fist", "Touches her hair with her left hand"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "O-1JmMST1xs_0", "video_path": "O-1JmMST1xs.mp4", "subtitle_path": "O-1JmMST1xs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 463.12, "view_count": 15684}, {"video_id": "O-1JmMST1xs", "question": "A woman in a black top with braided hair is holding a white box in her right hand. After the subtitle 'up so it's like this guy and it comes' appears, what does this woman do next?", "question_wo_referring_query": "What does this woman do next?", "candidates": ["Points to the right with her right index finger", "Raises both hands above her head", "Points to the right with her left index finger", "Touches her hair with her left hand"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "O-1JmMST1xs_1", "video_path": "O-1JmMST1xs.mp4", "subtitle_path": "O-1JmMST1xs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 463.12, "view_count": 15684}, {"video_id": "Oh1zhJOzh4Y", "question": "In the video, there is a white sculpture's head. What appears after the phrase 'where people are creating things that' is mentioned?", "question_wo_referring_query": "What appears after the phrase 'where people are creating things that' is mentioned?", "candidates": ["A white sculpture with a broken left arm appears", "A white sculpture of a person with both arms broken appears", "A white sculpture with a broken right arm appears", "A sculpture of just a head appears"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "Oh1zhJOzh4Y_0", "video_path": "Oh1zhJOzh4Y.mp4", "subtitle_path": "Oh1zhJOzh4Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 191.29, "view_count": 5723}, {"video_id": "Oh1zhJOzh4Y", "question": "In the exhibition hall, there is a white nude sculpture with a broken arm. Behind it, there is another white nude male sculpture. What appears after the phrase 'very important within art to make the' is mentioned?", "question_wo_referring_query": "What appears?", "candidates": ["A white sculpture with a broken left arm appears.", "Three headless, armless, white female nude sculptures appear.", "A white sculpture with a broken right arm appears.", "A white legless sculpture appears."], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "Oh1zhJOzh4Y_1", "video_path": "Oh1zhJOzh4Y.mp4", "subtitle_path": "Oh1zhJOzh4Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 191.29, "view_count": 5723}, {"video_id": "Pnn51ypf11c", "question": "There is a dish with orange and white mixed food with some green bits on top placed on a tray in the video. Where else in the video does this food appear?", "question_wo_referring_query": "Where else in the video does this food appear?", "candidates": ["In the dining room", "In the cup", "In the trash can", "In a picture hanging on the wall"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "Pnn51ypf11c_0", "video_path": "Pnn51ypf11c.mp4", "subtitle_path": "Pnn51ypf11c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 207.52, "view_count": 40730}, {"video_id": "Pnn51ypf11c", "question": "There is a man wearing a black coat and a blue inner layer on the screen. In front of the man, there are two rows of white letters. Where else does the man on the screen appear?", "question_wo_referring_query": "Where else does the man on the screen appear?", "candidates": ["In front of a wall with many photos", "Cafeteria", "Hamburger shop", "Fried chicken shop"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "Pnn51ypf11c_1", "video_path": "Pnn51ypf11c.mp4", "subtitle_path": "Pnn51ypf11c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 207.52, "view_count": 40730}, {"video_id": "nxNqBg7GfLc", "question": "In the video, there is a woman with blonde hair wearing a black short-sleeve shirt sitting in the driver's seat of a car. Her fingers are spread apart, and there is a balloon on the back seat that is yellow on the top and pink on the bottom. In which subtitle does this balloon also appear?", "question_wo_referring_query": "In the video, there is a woman with blonde hair wearing a black short-sleeve shirt sitting in the driver's seat of a car. Her fingers are spread apart, and there is a balloon on the back seat that is yellow on the top and pink on the bottom. In which subtitle does this balloon also appear?", "candidates": ["I ended up ordering but I was ordering", "and we're gonna get started with this", "if you're near this area 10 out of 10", "meals that I'm gonna have cereal's gonna"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "nxNqBg7GfLc_0", "video_path": "nxNqBg7GfLc.mp4", "subtitle_path": "nxNqBg7GfLc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 541.2, "view_count": 21110}, {"video_id": "nxNqBg7GfLc", "question": "A long-haired woman wearing a black short-sleeved shirt is sitting in the driver's seat, and she has put on a black mask. In the video, with which subtitle does the woman wearing the black mask appear together?", "question_wo_referring_query": "In the video, with which subtitle does the woman wearing the black mask appear together?", "candidates": ["i ended up ordering but i was ordering", "meals that i'm gonna have cereal's gonna", "and we're gonna get started with this", "okay so we're in the bathroom of the"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "nxNqBg7GfLc_1", "video_path": "nxNqBg7GfLc.mp4", "subtitle_path": "nxNqBg7GfLc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 541.2, "view_count": 21110}, {"video_id": "QoxP2X-iWrs", "question": "There is a pair of hands in the frame, with a cutting board. On the cutting board, there is a vegetable, and the left hand is pressing on the vegetable. In the upper right corner of the frame, there is a blue basket. What is the right hand doing in the video?", "question_wo_referring_query": "What is the right hand doing in the video?", "candidates": ["Holding a knife to cut the vegetable", "No movement", "Putting down the knife", "Cutting meat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "QoxP2X-iWrs_0", "video_path": "QoxP2X-iWrs.mp4", "subtitle_path": "QoxP2X-iWrs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 424.26, "view_count": 542717}, {"video_id": "QoxP2X-iWrs", "question": "There is a red bowl with a mixture of various vegetables in the scene. There are also two wooden chopsticks in the bowl. On the left side, there are a few pieces of pancake. A pair of hands appears above the bowl. What are the hands doing in the video?", "question_wo_referring_query": "What are the hands doing in the video?", "candidates": ["Holding chopsticks", "Holding the bowl", "No action", "Tearing the pancake"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "QoxP2X-iWrs_1", "video_path": "QoxP2X-iWrs.mp4", "subtitle_path": "QoxP2X-iWrs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 424.26, "view_count": 542717}, {"video_id": "bMVLxR68gII", "question": "A man in a suit is standing in front of a white background, with the number 47 shown on his right side. In the top left corner of this white background, the word 'Convolution' is displayed on the third line. What other object is present on this screen?", "question_wo_referring_query": "What other object is present on this screen?", "candidates": ["computer", "pen", "teacup", "glasses"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "bMVLxR68gII_0", "video_path": "bMVLxR68gII.mp4", "subtitle_path": "bMVLxR68gII_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 332.64, "view_count": 25}, {"video_id": "bMVLxR68gII", "question": "A man in a suit is standing in front of a white background, with '50' displayed on his right side. The word 'Averages' is shown in the third row on the upper left corner of this white background. What other object appears on this screen?", "question_wo_referring_query": "A man in a suit is standing in front of a white background, with '50' displayed on his right side. The word 'Averages' is shown in the third row on the upper left corner of this white background. What other object appears on this screen?", "candidates": ["speaker", "teacup", "computer", "pen"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "bMVLxR68gII_1", "video_path": "bMVLxR68gII.mp4", "subtitle_path": "bMVLxR68gII_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 332.64, "view_count": 25}, {"video_id": "STNxSEXZuQE", "question": "In a library, the walls are covered with wallpaper, there is a bookshelf beside it, and a man wearing short sleeves, when mentioning 'We covered the Ivorian flag in The Coat of Arms so Without Further Ado?' What action did this man do?", "question_wo_referring_query": "What action did this man do?", "candidates": ["He placed his left hand on his forehead", "He made a 'V' sign with both of his hands", "He extended two fingers of his right hand towards his left side", "He clenched both of his hands into fists"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "STNxSEXZuQE_0", "video_path": "STNxSEXZuQE.mp4", "subtitle_path": "STNxSEXZuQE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 443.28, "view_count": 136823}, {"video_id": "STNxSEXZuQE", "question": "In a study, the walls are covered in wallpaper, there is a bookshelf on the side, and a man wearing short sleeves is holding a piece of paper. When he mentions 'Ran out of it like A long Time Ago now I have some more I got this letter from Thailand,Hi Paul or', what color shirt is the man wearing?", "question_wo_referring_query": "What color shirt is the man wearing?", "candidates": ["Black long-sleeve T-shirt", "Black short-sleeve T-shirt", "White short-sleeve", "Blue short-sleeve"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "STNxSEXZuQE_1", "video_path": "STNxSEXZuQE.mp4", "subtitle_path": "STNxSEXZuQE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 443.28, "view_count": 136823}, {"video_id": "WGI3XfWgW4I", "question": "In the frame, in front of a concrete wall, what did the woman wearing a black shirt do after mentioning 'it's like little dots and you need to'?", "question_wo_referring_query": "What did she do?", "candidates": ["She made a 'V' sign with both hands", "She placed her right hand beside her mouth", "She clenched her fists and spread them out to the sides", "She placed her left hand on her forehead"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "WGI3XfWgW4I_0", "video_path": "WGI3XfWgW4I.mp4", "subtitle_path": "WGI3XfWgW4I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 377.59, "view_count": 3426}, {"video_id": "WGI3XfWgW4I", "question": "A woman wearing clothes mainly in blue with black polka dots, having thick black hair, with many flowers behind her, and posters on the wall. After mentioning 'or that history is only sorry,' what is this woman doing?", "question_wo_referring_query": "what is this woman doing?", "candidates": ["She is making a 'V' sign with both hands", "She is holding flowers", "She is holding a piece of fruit", "She is putting her left hand on her forehead"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "WGI3XfWgW4I_1", "video_path": "WGI3XfWgW4I.mp4", "subtitle_path": "WGI3XfWgW4I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 377.59, "view_count": 3426}, {"video_id": "1cAK01jmTvo", "question": "In the scene, on the rocks by the seaside, there is a blond man wearing beige shorts. He stands on the rock. What did he do afterwards?", "question_wo_referring_query": "What did he do afterwards?", "candidates": ["frog jump", "high jump", "long jump", "jump into the water"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "1cAK01jmTvo_0", "video_path": "1cAK01jmTvo.mp4", "subtitle_path": "1cAK01jmTvo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 568.69, "view_count": 1156380}, {"video_id": "1cAK01jmTvo", "question": "In the middle of a white building, there is a path. A woman wearing a gray and white striped top, a sun hat, and with an orange piece of clothing tied around her waist has her left foot forward. What does she do next?", "question_wo_referring_query": "What does she do next?", "candidates": ["She stands with both feet together", "She puts both hands behind her back", "She raises both hands", "She moves her right foot forward"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "1cAK01jmTvo_1", "video_path": "1cAK01jmTvo.mp4", "subtitle_path": "1cAK01jmTvo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 568.69, "view_count": 1156380}, {"video_id": "FYJLmjo1wJI", "question": "In the scene, who appears first among the man in the red coat, wearing a black hat and glasses, the man in the purple shirt and a red hat, or the woman in the gray short-sleeve shirt waving with her right hand?", "question_wo_referring_query": "Among these two scenes, who appears first?", "candidates": ["They all appear together.", "Among the six people, the man in the purple shirt and red hat, and the woman in the gray short-sleeve shirt waving with her right hand, appear first.", "The man in the red coat, wearing a black hat and glasses, appears first.", "The man in the white short-sleeve shirt, wearing a black hat, appears first."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "FYJLmjo1wJI_0", "video_path": "FYJLmjo1wJI.mp4", "subtitle_path": "FYJLmjo1wJI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 372.28, "view_count": 2589}, {"video_id": "FYJLmjo1wJI", "question": "In the scene, a man wearing a grey shirt, a red backpack, a black hat, and glasses, is shopping with a woman wearing a pink floral top, who is holding a black tag on a red coat in her left hand. Which person appears first?", "question_wo_referring_query": "Which person appears first?", "candidates": ["The man wearing a white short-sleeve shirt and a black hat appears first.", "They appear at the same time.", "The man wearing a grey shirt, a red backpack, a black hat, and glasses appears first.", "The woman wearing a pink floral top, holding a black tag on a red coat in her left hand, appears first."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "FYJLmjo1wJI_1", "video_path": "FYJLmjo1wJI.mp4", "subtitle_path": "FYJLmjo1wJI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 372.28, "view_count": 2589}, {"video_id": "19s55hpW52Y", "question": "In the video, there is a yellow cutting board placed on the table, with a white onion on it. A woman is holding the onion with her left hand and a knife in her right hand, placing it on the onion. After saying 'Hello everyone', what does this woman do?", "question_wo_referring_query": "What does this woman do after saying 'Hello everyone'?", "candidates": ["Put a lid on the pot", "Cut the potato with a knife", "Beat an egg soup with a whisk", "Beat an egg"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "19s55hpW52Y_0", "video_path": "19s55hpW52Y.mp4", "subtitle_path": "19s55hpW52Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 329.9, "view_count": 3038123}, {"video_id": "19s55hpW52Y", "question": "In the scene, there is a yellow cutting board on the table with 2 zucchinis on it. The woman is holding a zucchini in her left hand and a knife in her right hand, placing the knife on the zucchini. After the subtitle mentions 'Zucchini-2 pieces,' what does the woman do next?", "question_wo_referring_query": "In the scene, there is a yellow cutting board on the table with 2 zucchinis on it. The woman is holding a zucchini in her left hand and a knife in her right hand, placing the knife on the zucchini. After the subtitle mentions 'Zucchini-2 pieces,' what does the woman do next?", "candidates": ["The woman cuts the zucchini into chunks", "The woman puts the zucchini into a white bowl", "The woman cuts the zucchini into thin slices", "The woman puts the zucchini into a water bath"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "19s55hpW52Y_1", "video_path": "19s55hpW52Y.mp4", "subtitle_path": "19s55hpW52Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 329.9, "view_count": 3038123}, {"video_id": "2zzNTl7tcC8", "question": "Against a blue background, a woman wearing a black short-sleeve shirt and glasses mentions 'universally supported or strongly backed by science.' After this phrase, which item appears on the screen for the first time?", "question_wo_referring_query": "Which item appears on the screen for the first time after this phrase?", "candidates": ["a desk lamp", "a smartphone", "a table and a display", "a swivel chair"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3O", "level": "L2-Relation", "id": "2zzNTl7tcC8_0", "video_path": "2zzNTl7tcC8.mp4", "subtitle_path": "2zzNTl7tcC8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 302.01, "view_count": 279539}, {"video_id": "2zzNTl7tcC8", "question": "Against a blue background, there is a woman wearing a black short-sleeve shirt and glasses. She is holding her hands in mid-air beside her body. What object appears after she mentions 'they've actually led to a new type of bad posture'?", "question_wo_referring_query": "What object appears?", "candidates": ["Wheelchair", "Desk", "Lamp", "Cell phone"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3O", "level": "L2-Relation", "id": "2zzNTl7tcC8_1", "video_path": "2zzNTl7tcC8.mp4", "subtitle_path": "2zzNTl7tcC8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 302.01, "view_count": 279539}, {"video_id": "HQn1QKQYXVg", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, some code appears with green letters on top. Then, a flowchart on a white background appears. Finally, code with mathematical formulas appears.", "First, some code with green letters appears. Then, code with mathematical formulas appears. Finally, a flowchart on a white background appears.", "First, code with mathematical formulas appears. Then, a flowchart on a white background appears. Finally, some code with green letters on top appears.", "First, a flowchart on a white background appears. Then, some code with green letters on top appears. Finally, a flowchart on a white background appears."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "HQn1QKQYXVg_0", "video_path": "HQn1QKQYXVg.mp4", "subtitle_path": "HQn1QKQYXVg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 431, "duration": 445.0, "view_count": 37005}, {"video_id": "WTSKxSFtieg", "question": "In the video, there is a gray square on a yellow desktop, with a piece of dough on top of it. What changes happened to this piece of dough?", "question_wo_referring_query": "What changes happened to this piece of dough?", "candidates": ["The piece of dough was cut into 4 small pieces", "The piece of dough was divided into two halves", "The piece of dough was rolled into a ball", "The piece of dough was divided into 6 small pieces"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "WTSKxSFtieg_0", "video_path": "WTSKxSFtieg.mp4", "subtitle_path": "WTSKxSFtieg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 332.67, "view_count": 179533}, {"video_id": "WTSKxSFtieg", "question": "In the kitchen, there's a man wearing a black short-sleeved shirt and a necklace. His left hand is resting on the shoulder of a woman who is also wearing a black top and a necklace. What changes occurred to the man's hand gesture?", "question_wo_referring_query": "What changes occurred to the man's hand gesture?", "candidates": ["He put his left hand on his forehead.", "He raised his left hand and extended three fingers.", "He made a victory sign with both hands and held them in front of his chest.", "He clenched both hands into fists and held them in front of his chest."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "WTSKxSFtieg_1", "video_path": "WTSKxSFtieg.mp4", "subtitle_path": "WTSKxSFtieg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 332.67, "view_count": 179533}, {"video_id": "bSaExGobdgA", "question": "A man wearing a floral-patterned shirt and black pants is sitting in the corner of a room, speaking to the camera. What action is the man performing at this moment?", "question_wo_referring_query": "What action is the man performing at this moment?", "candidates": ["Dancing", "Eating something", "The fingers of his left hand are pointing forward", "Drinking water"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "bSaExGobdgA_0", "video_path": "bSaExGobdgA.mp4", "subtitle_path": "bSaExGobdgA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 295.79, "view_count": 3259}, {"video_id": "bSaExGobdgA", "question": "A man wearing a patterned shirt and a bracelet is leaning against a wall. What does the man do at this moment?", "question_wo_referring_query": "What does the man do at this moment?", "candidates": ["Eats something", "Turns the screen of the mirror to one side with his right hand", "Dances", "Drinks water"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "bSaExGobdgA_1", "video_path": "bSaExGobdgA.mp4", "subtitle_path": "bSaExGobdgA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 295.79, "view_count": 3259}, {"video_id": "NlPbdl5ahwg", "question": "A woman with loose hair, wearing a black short-sleeve shirt and gray shorts, when the caption appears 'old me was terrified of putting milk or', what is the object that the woman is holding in her hand?", "question_wo_referring_query": "What is the object that the woman is holding in her hand?", "candidates": ["milk", "fries", "hamburger", "snow jade"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "NlPbdl5ahwg_0", "video_path": "NlPbdl5ahwg.mp4", "subtitle_path": "NlPbdl5ahwg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 557.17, "view_count": 36808}, {"video_id": "NlPbdl5ahwg", "question": "On a table, there are two small hamburgers and a drink beside them. When the subtitle 'really cute ambiance the food is pretty' appears, what other item is on the table?", "question_wo_referring_query": "What other item is on the table?", "candidates": ["chili pepper", "sunglasses", "piano", "television"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "NlPbdl5ahwg_1", "video_path": "NlPbdl5ahwg.mp4", "subtitle_path": "NlPbdl5ahwg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 557.17, "view_count": 36808}, {"video_id": "Ia2qfSVxFV8", "question": "In a rundown room with an iron door, there is a man standing outside the room holding a torch, wearing a black hat and brown boots, along with a woman wearing a grey headscarf. When the subtitle 'execution of 14 innocent women and six' appears, what color clothes is the woman with the grey headscarf wearing?", "question_wo_referring_query": "In a rundown room with an iron door, there is a man standing outside the room holding a torch, wearing a black hat and brown boots, along with a woman wearing a grey headscarf. When the subtitle 'execution of 14 innocent women and six' appears, what color clothes is the woman with the grey headscarf wearing?", "candidates": ["red", "yellow", "black", "white"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "Ia2qfSVxFV8_0", "video_path": "Ia2qfSVxFV8.mp4", "subtitle_path": "Ia2qfSVxFV8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 580.38, "view_count": 223581}, {"video_id": "Ia2qfSVxFV8", "question": "On a table, there is a gray cup with a pink lid, and next to it, a blue mask. A long-haired woman is standing with both hands on the table reading a script. When the subtitle 'professed the victims in a sense' appears, what color is the coat the long-haired woman is wearing?", "question_wo_referring_query": "On a table, there is a gray cup with a pink lid, and next to it, a blue mask. A long-haired woman is standing with both hands on the table reading a script. When the subtitle 'professed the victims in a sense' appears, what color is the coat the long-haired woman is wearing?", "candidates": ["Green", "White", "Yellow", "Black"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "Ia2qfSVxFV8_1", "video_path": "Ia2qfSVxFV8.mp4", "subtitle_path": "Ia2qfSVxFV8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 580.38, "view_count": 223581}, {"video_id": "CP9zZ3sHwHQ", "question": "On the right side of a piece of brown land, there are large black rocks. There is a wriggling object constantly moving among the black rocks. What is this object?", "question_wo_referring_query": ", what is this object?", "candidates": ["Swimming fish", "Flowing lava", "Clean river water", "Polluted river water"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "CP9zZ3sHwHQ_0", "video_path": "CP9zZ3sHwHQ.mp4", "subtitle_path": "CP9zZ3sHwHQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 560.8, "view_count": 9875}, {"video_id": "CP9zZ3sHwHQ", "question": "Amidst a stretch of continuous black mountains, there are some orange-red objects scattered on the mountain peak in the middle. In the upper right corner of the screen, thick smoke keeps billowing. What is the substance that is causing the continuous billowing of thick smoke?", "question_wo_referring_query": "What is the substance that is causing the continuous billowing of thick smoke?", "candidates": ["ash", "lava", "wood", "large fire"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "CP9zZ3sHwHQ_1", "video_path": "CP9zZ3sHwHQ.mp4", "subtitle_path": "CP9zZ3sHwHQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 560.8, "view_count": 9875}, {"video_id": "zg6_10VCH4g", "question": "In a room with mirrors, a man with yellow hair wearing a gray shirt is holding a long thin object. What is this man in the gray shirt doing at this moment?", "question_wo_referring_query": "What is this man in the gray shirt doing at this moment?", "candidates": ["Putting the long thin object into his hair", "Putting the long thin object into his mouth", "Putting the long thin object into his nose", "Putting the long thin object into his ear"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "zg6_10VCH4g_0", "video_path": "zg6_10VCH4g.mp4", "subtitle_path": "zg6_10VCH4g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 343.11, "view_count": 27095}, {"video_id": "zg6_10VCH4g", "question": "In a room with mirrors, there is a man with yellow hair wearing a gray shirt. He is holding a piece of string. What is this man in the gray shirt doing at this moment?", "question_wo_referring_query": "What is this man in the gray shirt doing at this moment?", "candidates": ["Eating", "Singing", "Playing the piano", "Rubbing his neck with the string"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "zg6_10VCH4g_1", "video_path": "zg6_10VCH4g.mp4", "subtitle_path": "zg6_10VCH4g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 343.11, "view_count": 27095}, {"video_id": "Vm8qcGpp6q4", "question": "On a white background, there are nine lines of text. The text at the top reads 'Related works:(continued).' What happens on the screen when the subtitle says 'is still time and resource consuming ang'?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The top 6 lines of text are black, and the bottom 3 lines of text are red.", "The top 5 lines of text are black, and the bottom 4 lines of text are red.", "The top 8 lines of text are black, and the bottom 1 line of text is red.", "The top 7 lines of text are black, and the bottom 2 lines of text are red."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "Vm8qcGpp6q4_0", "video_path": "Vm8qcGpp6q4.mp4", "subtitle_path": "Vm8qcGpp6q4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 161, "duration": 564.0, "view_count": 16}, {"video_id": "Vm8qcGpp6q4", "question": "In the white background, at the top there is the word 'Experiments', along with 3 lines of text and a table. When the subtitle says 'classic machine learning method so we', what event happens on the screen?", "question_wo_referring_query": "What event happens on the screen?", "candidates": ["Some numbers are highlighted in green", "Some numbers are highlighted in blue", "Some numbers are highlighted in red", "Some numbers are highlighted in purple"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "Vm8qcGpp6q4_1", "video_path": "Vm8qcGpp6q4.mp4", "subtitle_path": "Vm8qcGpp6q4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 161, "duration": 564.0, "view_count": 16}, {"video_id": "hTrEKEKAxfM", "question": "What is being done when a hand holding a red and silver clip places the egg-liquid-battered chicken meat into a black pot?", "question_wo_referring_query": "What is being done?", "candidates": ["Pan-frying chicken", "Stir-frying chicken", "Boiling chicken", "Saut\u00e9ing chicken", "Deep-frying chicken"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "hTrEKEKAxfM_0", "video_path": "hTrEKEKAxfM.mp4", "subtitle_path": "hTrEKEKAxfM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 307.02, "view_count": 1084550}, {"video_id": "hTrEKEKAxfM", "question": "After placing a cut piece of meat into a plastic bag with one hand, and holding a glass container with marinade with the other hand, what are you preparing to do?", "question_wo_referring_query": "What are you preparing to do?", "candidates": ["Marinate the meat in the bag with the marinade", "Put the meat and marinade into a pot", "Set the bag of meat aside", "Throw away the bag of meat", "Put the bag of meat into the container"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "hTrEKEKAxfM_1", "video_path": "hTrEKEKAxfM.mp4", "subtitle_path": "hTrEKEKAxfM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 307.02, "view_count": 1084550}, {"video_id": "C9PfcC1r52Y", "question": "On a gray tile passage with glass on both sides, when two women dressed in black dresses face each other and pick up a piece of red fabric, which object appears on the screen?", "question_wo_referring_query": "Which object appears on the screen?", "candidates": ["A small wooden board", "Two chairs", "A pair of scissors", "A piece of yellow fabric laid on the tile floor", "An escalator"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "C9PfcC1r52Y_0", "video_path": "C9PfcC1r52Y.mp4", "subtitle_path": "C9PfcC1r52Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 273.9, "view_count": 39341}, {"video_id": "C9PfcC1r52Y", "question": "An elderly person wearing a brown and white striped shirt with black sleeves, and with gray hair, is sitting on a wooden chair and speaking to the mirror in a clean and bright room. What object appeared on the screen?", "question_wo_referring_query": "What object appeared on the screen?", "candidates": ["A refrigerator", "A kitchen", "A white cabinet", "Some tableware", "A fallen mirror"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "C9PfcC1r52Y_1", "video_path": "C9PfcC1r52Y.mp4", "subtitle_path": "C9PfcC1r52Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 273.9, "view_count": 39341}, {"video_id": "tYDS3DHjSUE", "question": "On a beige floral patterned wall, there are three paintings. The middle painting is a conference room with two flags standing on either side. Two leaders are sitting on yellow chairs, surrounded by some people holding microphones. When the subtitle 'should it become necessary we've become' appears, which object is not shown on the screen?", "question_wo_referring_query": "Which object is not shown on the screen?", "candidates": ["A mirror", "Red tie", "Two chandeliers", "An airplane model", "Beige sofa"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "tYDS3DHjSUE_0", "video_path": "tYDS3DHjSUE.mp4", "subtitle_path": "tYDS3DHjSUE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 546.48, "view_count": 84038}, {"video_id": "tYDS3DHjSUE", "question": "In the airplane, when the man sitting by the window wearing glasses and a gray top is writing with a pen on a piece of paper and the subtitle 'one thing that does seem certain is that' appears, which object does not appear in the frame?", "question_wo_referring_query": "Which object does not appear in the frame?", "candidates": ["white cup", "olive green reflective desktop", "silver watch", "airplane outside the window", "black backpack"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "tYDS3DHjSUE_1", "video_path": "tYDS3DHjSUE.mp4", "subtitle_path": "tYDS3DHjSUE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 546.48, "view_count": 84038}, {"video_id": "8uvzimVM7n0", "question": "Under the intense sunlight, the blue sky is densely covered with large patches of white clouds. On the left is a lush green plant, and on the right is a yellowing grassland. What kind of car is speeding down the empty road in the middle?", "question_wo_referring_query": "What kind of car is speeding down?", "candidates": ["Black sports car", "Black off-road vehicle", "White police car", "White ambulance", "Red fire truck"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "8uvzimVM7n0_0", "video_path": "8uvzimVM7n0.mp4", "subtitle_path": "8uvzimVM7n0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 396.19, "view_count": 2868293}, {"video_id": "8uvzimVM7n0", "question": "A man wearing a black short-sleeve shirt with white English alphabet prints is sitting on a white chair in the middle of lush green plants. What is the weather like at this time?", "question_wo_referring_query": "What is the weather like at this time?", "candidates": ["Overcast", "Continuous rain", "Sunny and bright", "Dusk", "Stormy"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "8uvzimVM7n0_1", "video_path": "8uvzimVM7n0.mp4", "subtitle_path": "8uvzimVM7n0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 396.19, "view_count": 2868293}, {"video_id": "hdZQQtr_KRw", "question": "In front of a black background, a man holding a white cup is wearing black-rimmed glasses, and there's a small microphone pinned on his shirt collar. In the top right corner, there's a small island sticker. When the subtitle 'frighteningly kind of deep story that' appears, what kind of clothes is this man wearing?", "question_wo_referring_query": "What kind of clothes is this man wearing?", "candidates": ["white short sleeves", "red short sleeves", "red denim jacket", "white long sleeves", "red long sleeves"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "hdZQQtr_KRw_0", "video_path": "hdZQQtr_KRw.mp4", "subtitle_path": "hdZQQtr_KRw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 410.90999999999997, "view_count": 184011}, {"video_id": "hdZQQtr_KRw", "question": "In front of the black background, a man holding a white cup, wearing a red short-sleeve collar with a small checkered pattern, appears. In the upper right corner, there is a boat sailing on the blue sea. When the subtitle 'minister of marine Affairs agreed to' appears, what kind of glasses is the man wearing?", "question_wo_referring_query": "What kind of glasses is the man wearing?", "candidates": ["frameless glasses", "gold-rimmed glasses", "white frame glasses", "silver-rimmed glasses", "black frame glasses"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "hdZQQtr_KRw_1", "video_path": "hdZQQtr_KRw.mp4", "subtitle_path": "hdZQQtr_KRw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 410.90999999999997, "view_count": 184011}, {"video_id": "1TTPq38LyPU", "question": "In a sunlit forest filled with trees, a man on the left in a blue shirt and dark blue pants has a speech bubble with a milk carton image next to him. On the right, a man in a fitted shirt with floral letters and blue pants is standing with a thumbs up. They both have different national flags on them. What are they doing?", "question_wo_referring_query": "In a sunlit forest filled with trees, a man on the left in a blue shirt and dark blue pants has a speech bubble with a milk carton image next to him. On the right, a man in a fitted shirt with floral letters and blue pants is standing with a thumbs up. They both have different national flags on them. What are they doing?", "candidates": ["They are arguing", "They are fishing", "They are exploring", "They are having a conversation", "They are fighting"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "1TTPq38LyPU_0", "video_path": "1TTPq38LyPU.mp4", "subtitle_path": "1TTPq38LyPU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 566.69, "view_count": 2141400}, {"video_id": "1TTPq38LyPU", "question": "In a green meadow, there's a gray sidewalk on the left surrounded by green trees. On the meadow, there's a tall blue swing. To the left and behind the swing, there is a red protective net. What is the person on the swing doing?", "question_wo_referring_query": "What are they doing?", "candidates": ["Fixing the swing", "Swinging on the swing", "Sitting and resting", "Talking while leaning against the swing", "Pushing the swing for someone else"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "1TTPq38LyPU_1", "video_path": "1TTPq38LyPU.mp4", "subtitle_path": "1TTPq38LyPU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 566.69, "view_count": 2141400}, {"video_id": "JJn1D76j5pI", "question": "In the video, the person wearing a tight black short-sleeve shirt is holding a plastic-wrapped item and standing in front of the supermarket shelf looking at goods. There is another person behind him wearing a red long-sleeve top and jeans. What is he doing?", "question_wo_referring_query": "In the video, the person wearing a tight black short-sleeve shirt is holding a plastic-wrapped item and standing in front of the supermarket shelf looking at goods. There is another person behind him wearing a red long-sleeve top and jeans. What is he doing?", "candidates": ["Talking to a friend", "Cleaning the store", "Shopping", "Working at the checkout", "Stocking shelves"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "JJn1D76j5pI_0", "video_path": "JJn1D76j5pI.mp4", "subtitle_path": "JJn1D76j5pI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 255.56, "view_count": 980189}, {"video_id": "JJn1D76j5pI", "question": "What are the three men in the picture doing when the woman wearing a black short-sleeve shirt with white English letter patterns appears for the first time on the left side of a bedroom with a wall covered in paintings?", "question_wo_referring_query": "What are the three men in the picture doing?", "candidates": ["Fighting", "Hugging enthusiastically", "Arguing", "Chatting", "Playing games"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "JJn1D76j5pI_1", "video_path": "JJn1D76j5pI.mp4", "subtitle_path": "JJn1D76j5pI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 255.56, "view_count": 980189}, {"video_id": "7H3D-6nj_dY", "question": "A person wearing blue shorts with suspenders and a white short-sleeved shirt is crouching on the grass, holding a purple flower in their right hand. What happened on the screen when the subtitle 'But to preface my thoughts I'll explain that I grew up without social media, I did' appeared?", "question_wo_referring_query": "What happened on the screen?", "candidates": ["The woman is picking wildflowers", "The woman is catching butterflies", "The woman is catching insects", "The woman is digging soil", "The woman is planting flowers"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "7H3D-6nj_dY_0", "video_path": "7H3D-6nj_dY.mp4", "subtitle_path": "7H3D-6nj_dY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 505.67, "view_count": 3980800}, {"video_id": "7H3D-6nj_dY", "question": "On a white board, there are rose pink petals laid out, and the background is a blurred wooden kitchen. What happens on the screen when the caption 'my dreams.' appears?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The petals are blown away by the wind", "The petals are sprinkled with white powder", "A woman is washing petals", "The petals are placed into a container", "A woman is picking petals"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "7H3D-6nj_dY_1", "video_path": "7H3D-6nj_dY.mp4", "subtitle_path": "7H3D-6nj_dY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 505.67, "view_count": 3980800}, {"video_id": "Yy2OZZJwBYU", "question": "Four individuals with different skin tones are standing in front of a black background. The person on the far left is a woman wearing a tight black off-shoulder top, being held by the waist by a man in a blue short-sleeve shirt next to her. Behind them, a person wearing a checkered shirt and red scarf, holding a guitar, passes by with their head turned sideways. What happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["Five individuals of different skin tones start singing and dancing.", "The other four people leave the screen together with the person holding the guitar.", "The person holding the guitar gets beaten by others.", "The camera shifts to a man in a dark green shirt with a black collar standing in front of the black background.", "The person holding the guitar starts waving at the camera."], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "Yy2OZZJwBYU_0", "video_path": "Yy2OZZJwBYU.mp4", "subtitle_path": "Yy2OZZJwBYU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 567.44, "view_count": 1970379}, {"video_id": "Yy2OZZJwBYU", "question": "After the man wearing an olive green short-sleeved shirt standing in front of the black background in the video finishes introducing the geographical location of East Timor, and an image labeled 'Dili' appears in the top left corner of the screen, what does the man introduce first?", "question_wo_referring_query": "After an image labeled 'Dili' appears in the top left corner of the screen, what does the man introduce first?", "candidates": ["Introduces that Dili is a commercial hub", "Introduces historical development", "Introduces environmental protection", "Introduces food culture", "Introduces religious beliefs"], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "Yy2OZZJwBYU_1", "video_path": "Yy2OZZJwBYU.mp4", "subtitle_path": "Yy2OZZJwBYU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 567.44, "view_count": 1970379}, {"video_id": "vvy-mtSYBs0", "question": "When the man, wearing a gray short-sleeved shirt and black-framed glasses, stands in front of a blue background and explains knowledge about the origin of life, what concept does he mention first?", "question_wo_referring_query": ", what concept does he mention first?", "candidates": ["First mentions the concept of the strong vitality of multicellular organisms", "First mentions the concept of changes in Earth's carbon cycle", "First mentions the concept of how ancient biologists explore the origins of life", "First mentions the concept of marine organisms", "First mentions the concept of small single-celled organisms"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "vvy-mtSYBs0_0", "video_path": "vvy-mtSYBs0.mp4", "subtitle_path": "vvy-mtSYBs0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 361.28, "view_count": 139417}, {"video_id": "vvy-mtSYBs0", "question": "There is a black bookshelf in the background. Four people of different ethnicities are sitting at a table. Before an Apple computer appears on the table for the first time, which person appears first?", "question_wo_referring_query": "Which person appears first?", "candidates": ["The man wearing gray short sleeves and black-rimmed glasses", "The man wearing black short sleeves and frameless glasses", "The archaeologist wearing red work clothes and protective goggles, working in front of a petrified wood", "The woman sitting at the desk, typing code with a desk lamp on", "The woman with a side profile, wearing earrings and olive-framed glasses, looking at the blue-green columnar data"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "vvy-mtSYBs0_1", "video_path": "vvy-mtSYBs0.mp4", "subtitle_path": "vvy-mtSYBs0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 361.28, "view_count": 139417}, {"video_id": "aAITmw3qohQ", "question": "In a kitchen, there's a bald man with a mustache holding a piece of paper, and behind him stands a man with a middle part hairstyle and a white shirt, leaning forward and looking at the paper. The background is a kitchen with utensils and cabinets. After the subtitle 'exploration for the formations related' appears, what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["A large plume of smoke from a volcanic eruption", "A huge crater in the city", "A black and white photo of the bald man with a mustache", "Lava from a volcanic eruption", "A transparent crystal"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "aAITmw3qohQ_0", "video_path": "aAITmw3qohQ.mp4", "subtitle_path": "aAITmw3qohQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 364.17, "view_count": 5421}, {"video_id": "aAITmw3qohQ", "question": "Surrounded by some green plants, in a yellow muddy stream, a bald man wearing blue shorts is washing something white in the water with both hands. After the subtitle 'apart and extract the valuable diamonds' appears, what is the first object to appear?", "question_wo_referring_query": "What is the first object to appear?", "candidates": ["A diamond held with tweezers", "A diamond embedded in a black rock", "A blue-green rock", "A giant mining pit", "A group of miners"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "aAITmw3qohQ_1", "video_path": "aAITmw3qohQ.mp4", "subtitle_path": "aAITmw3qohQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 364.17, "view_count": 5421}, {"video_id": "2sziIUZgdgk", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["A man wearing a white-polka-dotted black long-sleeve shirt with glasses does the opening introduction, introduces the plants used to make spices, uses pictures to show mold growth on meat, uses pictures to introduce spices.", "Introduces the plants used to make spices, uses pictures to introduce spices, uses pictures to show mold growth on meat, a man wearing a white-polka-dotted black long-sleeve shirt with glasses does the opening introduction.", "Uses pictures to show mold growth on meat, a man wearing a white-polka-dotted black long-sleeve shirt with glasses does the opening introduction, introduces the plants used to make spices, uses pictures to introduce spices.", "Uses pictures to introduce spices, a man wearing a white-polka-dotted black long-sleeve shirt with glasses does the opening introduction, introduces the plants used to make spices, uses pictures to show mold growth on meat.", "A man wearing a white-polka-dotted black long-sleeve shirt with glasses does the opening introduction, uses pictures to introduce spices, introduces the plants used to make spices, uses pictures to show mold growth on meat."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "2sziIUZgdgk_0", "video_path": "2sziIUZgdgk.mp4", "subtitle_path": "2sziIUZgdgk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 403.74, "view_count": 1300867}, {"video_id": "2sziIUZgdgk", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["A man wearing a white polka-dot black long-sleeve shirt and glasses is talking to the camera; the background of the man wearing the white polka-dot black long-sleeve shirt changes to orange; then a black man wearing a pink short-sleeve shirt stands outside, with buildings and utility poles on both sides; finally, there's a website forum with a black background.", "First, there's a website forum with a black background; then a black man wearing a pink short-sleeve shirt stands outside, with buildings and utility poles on both sides; a man wearing a white polka-dot black long-sleeve shirt and glasses is talking to the camera; finally, the background of the man wearing the white polka-dot black long-sleeve shirt changes to orange.", "The background of a man wearing a white polka-dot black long-sleeve shirt and glasses changes to orange; then there's a website forum with a black background; a black man wearing a pink short-sleeve shirt stands outside, with buildings and utility poles on both sides; finally, a man wearing a white polka-dot black long-sleeve shirt and glasses is talking to the camera.", "A man wearing a white polka-dot black long-sleeve shirt and glasses is talking to the camera; then there's a website forum with a black background; the background of the man wearing the white polka-dot black long-sleeve shirt changes to orange; then a black man wearing a pink short-sleeve shirt stands outside, with buildings and utility poles on both sides.", "A black man wearing a pink short-sleeve shirt stands outside, with buildings and utility poles on both sides; then the background of a man wearing a white polka-dot black long-sleeve shirt and glasses changes to orange; then there's a website forum with a black background; finally, a man wearing a white polka-dot black long-sleeve shirt and glasses is talking to the camera."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "2sziIUZgdgk_1", "video_path": "2sziIUZgdgk.mp4", "subtitle_path": "2sziIUZgdgk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 403.74, "view_count": 1300867}, {"video_id": "owWCm7V8TpM", "question": "In the water pool encircled by trees and plants in the ancient forest, a green patterned giant spider is wrapped around a tree. In which of the following scenes does it also appear?", "question_wo_referring_query": ", In which of the following scenes does it also appear?", "candidates": ["In a zoo", "In the middle of a road", "On a dry yellow grassland with bugs crawling, with two white English letters in the center of the screen", "In the kitchen of a house", "Inside a glass cover in a museum"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "owWCm7V8TpM_0", "video_path": "owWCm7V8TpM.mp4", "subtitle_path": "owWCm7V8TpM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 263.6, "view_count": 6245}, {"video_id": "owWCm7V8TpM", "question": "A snake with a cylindrical shape and olive-colored pattern, hiding among green plants with its tongue sticking out, has also appeared in which of the following scenarios?", "question_wo_referring_query": "Which of the following scenarios has it also appeared in?", "candidates": ["In a muddy pond", "In a zoo", "Beside a kangaroo", "On grassland full of dry fallen leaves", "Inside a household garden"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "owWCm7V8TpM_1", "video_path": "owWCm7V8TpM.mp4", "subtitle_path": "owWCm7V8TpM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 263.6, "view_count": 6245}, {"video_id": "diHPg8aJLws", "question": "The background is a kitchen with red English letter wall decorations and a sink. A man wearing a light purple long-sleeved shirt with an apron is in the kitchen. Which subtitles have appeared at the same time?", "question_wo_referring_query": "The background is a kitchen with red English letter wall decorations and a sink. A man wearing a light purple long-sleeved shirt with an apron is in the kitchen. Which subtitles have appeared at the same time?", "candidates": ["\"you have sang Yoon from Father's Office\"", "\"cornmeal I can tell you right now this\"", "\"nectarine with raspberries and could be\"", "\"you have Roy Choi from Kogi and you have\"", "\"it is some kind of breakfast bread\""], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "diHPg8aJLws_0", "video_path": "diHPg8aJLws.mp4", "subtitle_path": "diHPg8aJLws_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 320.28, "view_count": 309231}, {"video_id": "diHPg8aJLws", "question": "In the kitchen, in front of the man wearing a light purple long-sleeved shirt with rolled-up sleeves, what captions have appeared at the same time as the red and yellow food in the white bowl on the metal plate?", "question_wo_referring_query": "In front of the man wearing a light purple long-sleeved shirt with rolled-up sleeves in the kitchen, what captions have appeared at the same time as the red and yellow food in the white bowl on the metal plate?", "candidates": ["\"nectarine with raspberries and could be\"", "\"perfect score\"", "\"this being sort of fall late fall\"", "\"cornmeal I can tell you right now this\"", "\"here and we\u2018ll see what happens\""], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "diHPg8aJLws_1", "video_path": "diHPg8aJLws.mp4", "subtitle_path": "diHPg8aJLws_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 320.28, "view_count": 309231}, {"video_id": "Itmgnh2zuMU", "question": "There is a glass container on a wooden table, with a few eggs placed in the top left corner. There is some flour in a blue mesh on top of the container. What change occurred to the flour after a hand wearing a blue watchstrap tapped the mesh?", "question_wo_referring_query": "What change occurred to the flour?", "candidates": ["The flour sprinkled onto the hand", "The flour got soaked", "The flour sifted into the glass container", "The flour scattered onto the floor", "The flour scattered onto the table"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "Itmgnh2zuMU_0", "video_path": "Itmgnh2zuMU.mp4", "subtitle_path": "Itmgnh2zuMU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 185.81, "view_count": 209509}, {"video_id": "Itmgnh2zuMU", "question": "On the upper right side of the red tablecloth table, there are two cups of orange juice and two plates of delicacies served on white plates, with a fork next to each. After the food on the lower right plate is cut with a knife, what changes occurred?", "question_wo_referring_query": ", what changes occurred?", "candidates": ["One whole piece of food became two small pieces", "One whole piece of food became four small pieces", "One whole piece of food became three small pieces", "One whole piece of food became five small pieces", "One whole piece of food became crumbs"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "Itmgnh2zuMU_1", "video_path": "Itmgnh2zuMU.mp4", "subtitle_path": "Itmgnh2zuMU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 185.81, "view_count": 209509}, {"video_id": "Urp5sK3tJ7c", "question": "What change occurred to the woman, who was wearing a black tank top, had long curly hair, and was wearing a gold pendant necklace, when the subtitle 'that's why it's a fear there's literally' appeared?", "question_wo_referring_query": "What change occurred?", "candidates": ["Her long curly hair turned into long straight hair", "Her black tank top turned into a black short-sleeve shirt", "The curly hair she had tied up came undone", "Her long straight hair turned into long curly hair", "Her long hair turned short"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "Urp5sK3tJ7c_0", "video_path": "Urp5sK3tJ7c.mp4", "subtitle_path": "Urp5sK3tJ7c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 498.6, "view_count": 16271}, {"video_id": "Urp5sK3tJ7c", "question": "What change occurs when a bare-faced woman, wearing a black backless outfit, long curly hair, and a gold pendant necklace, appears on screen as the subtitle 'and I'll leave all of my like everything' is displayed?", "question_wo_referring_query": "What change occurs?", "candidates": ["Her backless outfit changes into a meticulously tailored outfit", "Her bare face becomes meticulously made up", "Her face turns pale", "The necklace on her neck disappears", "The gold necklace on her neck changes to a silver one"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "Urp5sK3tJ7c_1", "video_path": "Urp5sK3tJ7c.mp4", "subtitle_path": "Urp5sK3tJ7c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 498.6, "view_count": 16271}, {"video_id": "cakVNqU0xOA", "question": "The screen shows a serene blue sky, with some trees, sea water, and rocks around. A woman wearing a yellow bikini with her hair up places a neatly folded white towel on the sandy ground by the sea. After adjusting the towel next to her, when '30 sec olank hoid' appears in the top left corner, what action does the woman take?", "question_wo_referring_query": "The screen shows a serene blue sky, with some trees, sea water, and rocks around. A woman wearing a yellow bikini with her hair up places a neatly folded white towel on the sandy ground by the sea. After adjusting the towel next to her, when '30 sec olank hoid' appears in the top left corner, what action does the woman take?", "candidates": ["The woman jumps into the sea", "Supports herself with one hand on the ground and starts a side plank", "Throws the towel into the sea", "Supports herself on the ground with both hands and wrists parallel, doing a plank", "Supports herself with one hand on the ground, with the other hand on her waist"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "cakVNqU0xOA_0", "video_path": "cakVNqU0xOA.mp4", "subtitle_path": "cakVNqU0xOA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 553.49, "view_count": 14891}, {"video_id": "cakVNqU0xOA", "question": "The screen shows a sky with a tint of blue, surrounded by some trees, sea water, and rocks. A woman wearing a yellow bikini with her hair tied up is doing various fitness exercises on the sandy beach. She sits on one side facing the camera on a white towel, with her hands on her hair. What does the woman do after she takes off her hair tie?", "question_wo_referring_query": "What does the woman do after she takes off her hair tie?", "candidates": ["She throws the towel into the sea", "The woman flips her hair", "She supports herself with one hand on the ground and starts doing side planks", "She supports herself with one hand on the ground and the other hand on her waist", "The woman jumps into the sea"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "cakVNqU0xOA_1", "video_path": "cakVNqU0xOA.mp4", "subtitle_path": "cakVNqU0xOA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 553.49, "view_count": 14891}, {"video_id": "m6HHyl_s6qA", "question": "A man and a woman are walking on the sandy shore by the sea, surrounded by some palm trees. The clouds in the sky are very large, the water by the sea is blue-green, and there should be reefs in the distance. The man is wearing a lettered T-shirt, and the woman is wearing a black camisole with her hair tied up. What objects are not present in this scene?", "question_wo_referring_query": "A man and a woman are walking on the sandy shore by the sea, surrounded by some palm trees. The clouds in the sky are very large, the water by the sea is blue-green, and there should be reefs in the distance. The man is wearing a lettered T-shirt, and the woman is wearing a black camisole with her hair tied up. What objects are not present in this scene?", "candidates": ["Tripod", "Green camisole", "Black camisole", "Sunglasses", "Bottled mineral water"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "m6HHyl_s6qA_0", "video_path": "m6HHyl_s6qA.mp4", "subtitle_path": "m6HHyl_s6qA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 312.95, "view_count": 36340}, {"video_id": "m6HHyl_s6qA", "question": "On a small hillside, with a lake below and mountains in the distance, the sky is filled with white clouds. A lady is standing on a grassy path on the hillside, smiling at the camera. She is wearing a pair of white slippers. What objects did not appear in this scene?", "question_wo_referring_query": ", what objects did not appear in this scene?", "candidates": ["white slippers", "black shorts", "floral shorts", "sunglasses", "a black suspender dress"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "m6HHyl_s6qA_1", "video_path": "m6HHyl_s6qA.mp4", "subtitle_path": "m6HHyl_s6qA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 312.95, "view_count": 36340}, {"video_id": "vvD4AoOX8GA", "question": "There are two men sitting in a room with farming tools and a bucket on the right side, and some miscellaneous items and a white rectangular object on the left. Behind them is a hanging cloth, and in front of them is a table with a hunting rifle and a long green military box. They are speaking to the camera. What kind of cloth is hanging behind them?", "question_wo_referring_query": "What kind of cloth is hanging behind them?", "candidates": ["Transparent colorless cloth", "White opaque cloth", "Black opaque cloth", "Blue transparent cloth", "Yellow opaque cloth"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "vvD4AoOX8GA_0", "video_path": "vvD4AoOX8GA.mp4", "subtitle_path": "vvD4AoOX8GA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 251.56, "view_count": 64119}, {"video_id": "vvD4AoOX8GA", "question": "There are two men sitting in a room with farming tools and buckets stacked on the right side and some miscellaneous items and a white rectangular object on the left side. Behind them is a hanging cloth, and in front of them is a cloth with a hunting rifle and a military green long box placed on it. They are speaking to the camera. What type of pants is the bald man wearing?", "question_wo_referring_query": "What type of pants is the bald man wearing?", "candidates": ["blue shorts", "blue jeans", "olive long pants", "white long pants", "gray long pants"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "vvD4AoOX8GA_1", "video_path": "vvD4AoOX8GA.mp4", "subtitle_path": "vvD4AoOX8GA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 251.56, "view_count": 64119}, {"video_id": "C5sWbYwzKyg", "question": "In a room with white walls as the background, there is a man wearing black clothes and black headphones. What does this man do when he appears alone on camera?", "question_wo_referring_query": "What does this man do when he appears alone on camera?", "candidates": ["Happily dances", "Stands up and talks", "Cries in front of the camera", "Smiles and talks to the camera", "Speaks to the camera from a side view"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O2E", "level": "IntraMoment", "id": "C5sWbYwzKyg_0", "video_path": "C5sWbYwzKyg.mp4", "subtitle_path": "C5sWbYwzKyg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 346, "duration": 240.01, "view_count": 12732}, {"video_id": "C5sWbYwzKyg", "question": "In the dimly lit room, there is an air conditioner in the upper left corner, and below it, a black door. A man dressed in white, wearing glasses and earphones, is in the middle of the room. What did the man in white do the first time he appeared alone on screen?", "question_wo_referring_query": "What did the man in white do the first time he appeared alone on screen?", "candidates": ["Turned around to speak", "Looked directly at the camera, then slightly lowered his head before raising it again to speak", "Took off his earphones and put them on the table", "Stood up and pulled on his earphones", "Held his chin and looked at the camera"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O2E", "level": "IntraMoment", "id": "C5sWbYwzKyg_1", "video_path": "C5sWbYwzKyg.mp4", "subtitle_path": "C5sWbYwzKyg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 346, "duration": 240.01, "view_count": 12732}, {"video_id": "tz_f3jxGrF0", "question": "On a wooden dining table, there is a rectangular black plate. On the plate, there are pieces of carrot, sweet potato, apple, and four pieces of meat. In the top right corner, there is a label 'PEPPER'. When 'so' is mentioned, what are the hands in the video doing?", "question_wo_referring_query": "What are the hands in the video doing?", "candidates": ["Adding other vegetables to the plate", "One hand holds a seasoning bottle while the other sprinkles the seasoning", "Cutting meat", "Picked up the plate", "Took out a transparent bottle"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "tz_f3jxGrF0_0", "video_path": "tz_f3jxGrF0.mp4", "subtitle_path": "tz_f3jxGrF0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 239.03, "view_count": 234321}, {"video_id": "tz_f3jxGrF0", "question": "In a grey background scene, there is a chunk of green onion and a red pepper in the upper right corner, along with a bok choy. A wok is placed on a small stove, and a blue lid is on the right side. What happened when 'do' was mentioned?", "question_wo_referring_query": "What happened?", "candidates": ["Picked up the green onion", "Added vegetables to the wok", "Used tongs to pick up the meat from the wok", "Moved the wok away", "Started cutting the vegetables with the red pepper"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "tz_f3jxGrF0_1", "video_path": "tz_f3jxGrF0.mp4", "subtitle_path": "tz_f3jxGrF0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 239.03, "view_count": 234321}, {"video_id": "KKpTevFJpJw", "question": "Against a white background, there are blue-titled letters with blue arrows in the text. In the top left corner, a woman using a computer is depicted. There's also a yellow exclamation mark emoji. The bottom left features a cogwheel graphic, while on the right side, a man in yellow clothing is lying down, with one hand reaching out. Which character appears first?", "question_wo_referring_query": "Which character appears first?", "candidates": ["A short-haired woman in black and white using a computer", "A man in yellow clothing", "A grey cogwheel", "A blue arrow", "The yellow exclamation mark emoji"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O3O", "level": "L2-Relation", "id": "KKpTevFJpJw_0", "video_path": "KKpTevFJpJw.mp4", "subtitle_path": "KKpTevFJpJw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 387.0, "view_count": 2737}, {"video_id": "KKpTevFJpJw", "question": "Under a white background board, there are blue text headings, black text contents, and blue arrow insertions in the content. On the left side, there is a red and white circular icon, a yellow light bulb icon, a purple cylindrical icon on the right, and a gray gear. What is the last graphic to appear?", "question_wo_referring_query": "Under a white background board, there are blue text headings, black text contents, and blue arrow insertions in the content. On the left side, there is a red and white circular icon, a yellow light bulb icon, a purple cylindrical icon on the right, and a gray gear. What is the last graphic to appear?", "candidates": ["Blue graduation cap", "Red and white circular icon", "Purple cylindrical icon", "Blue and white circular icon", "Red graduation cap"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O3O", "level": "L2-Relation", "id": "KKpTevFJpJw_1", "video_path": "KKpTevFJpJw.mp4", "subtitle_path": "KKpTevFJpJw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 387.0, "view_count": 2737}, {"video_id": "_90F2F4ykMI", "question": "In a tray of candies, there are yellow, red, blue, green, and purple colored candies. After mentioning 'longest time is the infinity stone', what happened?", "question_wo_referring_query": "What happened?", "candidates": ["Picked up the green candy", "Rearranged the tray of candies", "Flipped the tray", "Poured milk into the tray", "Picked up the blue candy"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "_90F2F4ykMI_0", "video_path": "_90F2F4ykMI.mp4", "subtitle_path": "_90F2F4ykMI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 225.31, "view_count": 151252}, {"video_id": "_90F2F4ykMI", "question": "In a tray filled with purple and yellow candies, what action does the hand in the video perform when 'of time to prepare what takes the' is mentioned?", "question_wo_referring_query": "What action does the hand in the video perform?", "candidates": ["Hand presses on the candies", "Takes out a yellow candy from the tray", "Takes away the tray", "Places a purple candy on the tray", "Takes out a purple candy from the tray"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "_90F2F4ykMI_1", "video_path": "_90F2F4ykMI.mp4", "subtitle_path": "_90F2F4ykMI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 225.31, "view_count": 151252}, {"video_id": "pp2FLgoTvCQ", "question": "In the news, there are two frames, the left one is a dark-skinned man, and the right one is a man in a study room filled with bookshelves, where a man with grey hair is speaking. He is wearing a black pinstriped suit and a red and white striped shirt. When he mentions 'thought that actually Northern Ireland', what object appears for the first time?", "question_wo_referring_query": "What object appears for the first time?", "candidates": ["blue globe", "blue flag", "green globe", "green flag", "white globe"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "pp2FLgoTvCQ_0", "video_path": "pp2FLgoTvCQ.mp4", "subtitle_path": "pp2FLgoTvCQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 425.8, "view_count": 33600}, {"video_id": "pp2FLgoTvCQ", "question": "A man with white hair is wearing a black V-neck sweater and a red shirt. Behind him is a huge bookshelf filled with many books and two flags. To his left is a globe, and to his right is a lamp. Before he mentions 'is essential to us as well as them and', which objects have appeared?", "question_wo_referring_query": "Which objects have appeared?", "candidates": ["rectangular iron box", "blue flag", "white curtain", "black glasses", "yellow curtain"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "pp2FLgoTvCQ_1", "video_path": "pp2FLgoTvCQ.mp4", "subtitle_path": "pp2FLgoTvCQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 425.8, "view_count": 33600}, {"video_id": "5As9xro9940", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a black background with a turquoise tank and an army green tank opposed to each other and a red 'NOT' letter on the bottom is shown, followed by a screen containing the word 'BUT(2)', and finally a black background scene with the words 'HOW TO AVOID SUCH AN ERROR?'.", "First, a black background scene with the words 'HOW TO AVOID SUCH AN ERROR?' is shown, followed by a screen containing the word 'BUT(2)', and ending with a black background with a turquoise tank and an army green tank opposed to each other and a red 'NOT' letter on the bottom.", "First, a black background scene with the words 'HOW TO AVOID SUCH AN ERROR?' is shown, followed by a black background with a turquoise tank and an army green tank opposed to each other and a red 'NOT' letter on the bottom, and ending with a screen containing the word 'BUT(2)'.", "First, a black background with a turquoise tank and an army green tank opposed to each other and a red 'NOT' letter on the bottom is shown, followed by a black background scene with the words 'HOW TO AVOID SUCH AN ERROR?', and ending with a screen containing the word 'BUT(2)'.", "First, a screen containing the word 'BUT(2)' is shown, followed by a black background with a turquoise tank and an army green tank opposed to each other and a red 'NOT' letter on the bottom, and finally a black background scene with the words 'HOW TO AVOID SUCH AN ERROR?'."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "5As9xro9940_0", "video_path": "5As9xro9940.mp4", "subtitle_path": "5As9xro9940_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 487.93, "view_count": 119012}, {"video_id": "GF9eMau-cvs", "question": "A girl with golden curls is sitting on a giant tree root. She's wearing a blue skirt and a pair of boots, holding a long tree branch in her hand. Surrounding her are slopes, a few trees, and some stones on the grassy ground. In which scene does the golden-haired girl appear?", "question_wo_referring_query": ", in which scene does the golden-haired girl appear?", "candidates": ["Inside a treehouse", "By a small stream in the forest", "On a branch of a big tree", "On a stone step in the forest", "In a grand environment with a sky background, where the trees block the sky"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "GF9eMau-cvs_0", "video_path": "GF9eMau-cvs.mp4", "subtitle_path": "GF9eMau-cvs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 449.74, "view_count": 244061}, {"video_id": "GF9eMau-cvs", "question": "In the dense forest, a girl with golden curly hair sits on the grass surrounded by towering trees. The sky is very bright, she is wearing a blue skirt, and next to her is a black puppy. In which scenes does the black puppy appear in the video?", "question_wo_referring_query": "In which scenes does the black puppy appear in the video?", "candidates": ["On the branch of a big tree", "On the tree stump where the golden-haired girl in the blue skirt is sitting on the hillside", "In the treehouse", "On the grass lit by sunlight with white flowers blooming", "On the stone bench in the forest"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "GF9eMau-cvs_1", "video_path": "GF9eMau-cvs.mp4", "subtitle_path": "GF9eMau-cvs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 449.74, "view_count": 244061}, {"video_id": "seh6n1TQEWo", "question": "In front of a blue background, a man with short hair, wearing glasses and a long-sleeved shirt, is speaking. There are some light spots in the blue background, and the man's shirt has three buttons. With which subtitle does this man appear simultaneously?", "question_wo_referring_query": "With which subtitle does this man appear simultaneously?", "candidates": ["in France, as well as the Neanderthals, first found in", "did first evolve in Europe.", "about 500,000 old.", "At the time, Charles Darwin's theory about evolution by natural", "understood where we came from."], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "seh6n1TQEWo_0", "video_path": "seh6n1TQEWo.mp4", "subtitle_path": "seh6n1TQEWo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 543.09, "view_count": 312782}, {"video_id": "seh6n1TQEWo", "question": "Against a black background, there is a fossil skull placed on a desk, with white text on the left displaying 'TAUNG CHILD'. The main part of the skull is dark brown, while the front part, including the facial features, is white, with many cracks. When have this skull and these subtitles appeared at the same time?", "question_wo_referring_query": "When have this skull and these subtitles appeared at the same time?", "candidates": ["At the time, Charles Darwin's theory about evolution by natural", "understood where we came from.", "where Piltdown fit in the human tree, growing more", "was discovered in South Africa", "in France, as well as the Neanderthals, first found in"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "seh6n1TQEWo_1", "video_path": "seh6n1TQEWo.mp4", "subtitle_path": "seh6n1TQEWo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 543.09, "view_count": 312782}, {"video_id": "rFsjZlAfl4Y", "question": "A man wearing a dark T-shirt and glasses is sitting in a car. He has a red scarf on his head, and there's a bag on his shoulder with an accessory on it. It's very sunny, and all the car windows behind him are shining. When he walks in front of a building, with a tree on the right side and a green trash can on the left side, what change does this man undergo?", "question_wo_referring_query": "What change does this man undergo?", "candidates": ["He takes the backpack off his shoulder", "His scarf turns red", "The accessory on his backpack disappears", "He removes his glasses", "He takes off his glasses"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "rFsjZlAfl4Y_0", "video_path": "rFsjZlAfl4Y.mp4", "subtitle_path": "rFsjZlAfl4Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 400.74, "view_count": 2004}, {"video_id": "rFsjZlAfl4Y", "question": "A woman wearing glasses stands in front of a glass enclosure, observing. Inside the enclosure, there are many bird labels and bird nests. The woman is wearing earrings, white clothes, and has decorative items on her collar. When she sits on a white stool surrounded by trees with a man wearing a red headscarf, what change does this woman undergo?", "question_wo_referring_query": "What change does this woman undergo?", "candidates": ["She is holding a cup", "Her glasses are missing", "Her clothes changed to red", "She is carrying a backpack", "She took off her earrings"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "rFsjZlAfl4Y_1", "video_path": "rFsjZlAfl4Y.mp4", "subtitle_path": "rFsjZlAfl4Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 400.74, "view_count": 2004}, {"video_id": "1bXlxDGVERA", "question": "Under a starry sky background, there's the text 'Roche' in white on the right side. On the left side, there is a star with a white transparent ring around it. To the right of the star, there's a much smaller satellite. What happens to the small satellite when 'The moon's orbit would have spiraled inwards' is mentioned?", "question_wo_referring_query": "What happens to the small satellite?", "candidates": ["Gradually moves upward and enlarges", "Gradually moves to the left and enlarges", "Gradually moves backward and enlarges", "Gradually moves to the right and enlarges", "Starts jumping continuously"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "1bXlxDGVERA_0", "video_path": "1bXlxDGVERA.mp4", "subtitle_path": "1bXlxDGVERA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 222.85, "view_count": 65053}, {"video_id": "1bXlxDGVERA", "question": "Under a black starry sky background, Earth is in the middle of the screen, surrounded by some white fine lines and two loops of glowing curves. What changes occur to Earth when 'will be drained within the next 300 million years.' is mentioned?", "question_wo_referring_query": "What changes occur to Earth?", "candidates": ["An image of a satellite appears", "The color of the ends of the Earth gradually deepens to golden yellow", "The orbits of Earth's satellites disappear", "The planet's rings gradually disappear", "Earth gradually fades away"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "1bXlxDGVERA_1", "video_path": "1bXlxDGVERA.mp4", "subtitle_path": "1bXlxDGVERA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 222.85, "view_count": 65053}, {"video_id": "Y89s89kEsJY", "question": "In a room with green-colored walls, there is a wooden plank hanging on one wall and a white cabinet placed in front of the opposite wall. On the cabinet, there are two pots of green plants. A woman with long black hair wearing a white dress is sitting in front of a green cup. What is this woman doing?", "question_wo_referring_query": "What is this woman doing?", "candidates": ["She closed her left eye", "She is writing on the table", "She picked up the water cup", "She closed her right eye", "She closed both eyes"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "Y89s89kEsJY_0", "video_path": "Y89s89kEsJY.mp4", "subtitle_path": "Y89s89kEsJY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 336.8, "view_count": 111319}, {"video_id": "Y89s89kEsJY", "question": "In the upper right corner of a green screen background, there is a circular picture-in-picture. A woman with long black hair dressed in white is speaking on the screen. What is this woman in white doing?", "question_wo_referring_query": "What is this woman in white doing?", "candidates": ["She takes out her glasses", "She raises the thumb of her left hand", "She raises the thumb of her right hand", "She is wiping the table", "She is brushing her hair with her hand"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "Y89s89kEsJY_1", "video_path": "Y89s89kEsJY.mp4", "subtitle_path": "Y89s89kEsJY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 336.8, "view_count": 111319}, {"video_id": "7bhFvGgKQp4", "question": "In a field of yellow flowers, a little girl with long hair tied in pigtails uses her right hand to support her chin and holds a flower near her nose with her left hand. What objects are present in this scene?", "question_wo_referring_query": "In a field of yellow flowers, a little girl with long hair tied in pigtails uses her right hand to support her chin and holds a flower near her nose with her left hand. What objects are present in this scene?", "candidates": ["Handkerchief", "Earrings", "Earphones", "Purple flower", "Red flower"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "7bhFvGgKQp4_0", "video_path": "7bhFvGgKQp4.mp4", "subtitle_path": "7bhFvGgKQp4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 276.64, "view_count": 48494}, {"video_id": "7bhFvGgKQp4", "question": "How many trees are there under the red brick wall, and in front of the tree, there is a stone pillar inscribed with 'FOUNDED 1821.' What is the object present in this scene?", "question_wo_referring_query": "What is the object present in this scene?", "candidates": ["A full-body sculpture of a person", "A full-body painting of a person", "A half-body sculpture of a person", "A green road sign", "A half-body painting of a person"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "7bhFvGgKQp4_1", "video_path": "7bhFvGgKQp4.mp4", "subtitle_path": "7bhFvGgKQp4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 276.64, "view_count": 48494}, {"video_id": "jBEEYRm93Zs", "question": "On a white backdrop, there is a model placed. The model's edges have some trees, and in the middle, there is a blue lake. In the center of the lake, there is a person in a small boat. When the caption mentions 'rivers and lakeshores often change or even disappear, leading to disputes', what is present in the scene?", "question_wo_referring_query": "What is present in the scene?", "candidates": ["Blue human model", "Green tree model", "Green real tree", "Blue real tree", "Blue tree model"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "jBEEYRm93Zs_0", "video_path": "jBEEYRm93Zs.mp4", "subtitle_path": "jBEEYRm93Zs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 360.78, "view_count": 588461}, {"video_id": "jBEEYRm93Zs", "question": "On a white background board, there are three number models '1/2/3' of different colors standing. A cart model/truck model/shed model with a windmill is placed in front of the 1/2/3 models respectively. When the subtitle mentions 'there are three ways to create land', which item is not present in this scene?", "question_wo_referring_query": "Which item is not present in this scene?", "candidates": ["Green model number 2", "Blue model number 1", "Cart model with an empty cargo box", "Pink model number 3", "Cart model with cargo in the cargo box"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "jBEEYRm93Zs_1", "video_path": "jBEEYRm93Zs.mp4", "subtitle_path": "jBEEYRm93Zs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 360.78, "view_count": 588461}, {"video_id": "bMhUMyoQVQ8", "question": "On a cutting board, a hand with a ring is holding a knife with engraved characters and cutting ingredients. What are the characters on the knife?", "question_wo_referring_query": "What are the characters on the knife?", "candidates": ["\u52fe", "\u6bcf", "\u7578", "\u53e5", "\u65ec"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "bMhUMyoQVQ8_0", "video_path": "bMhUMyoQVQ8.mp4", "subtitle_path": "bMhUMyoQVQ8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 407.32, "view_count": 57196}, {"video_id": "bMhUMyoQVQ8", "question": "While frying minced meat mixed with orange granules in a non-stick pot, a spatula is added to the pot. What type of spatula is used in this scenario?", "question_wo_referring_query": "What type of spatula is used in this scenario?", "candidates": ["Gold-colored metal spatula", "Silver-colored metal spatula", "Large wooden spatula", "Red plastic spatula", "White ceramic spatula"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "bMhUMyoQVQ8_1", "video_path": "bMhUMyoQVQ8.mp4", "subtitle_path": "bMhUMyoQVQ8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 407.32, "view_count": 57196}, {"video_id": "SDD0r15AprU", "question": "In a striped background, there is a large rectangular stripe made up of two smaller rectangular stripes of different colors. When the subtitle mentions 'Hindu Javanese Kingdom cra. Others might say the white part is derived from the natural color of woven cloths', what are the colors of the two small rectangular stripes on the screen?", "question_wo_referring_query": "What are the colors of the two small rectangular stripes on the screen?", "candidates": ["red and white", "red and black", "white and green", "red and green", "white and black"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "SDD0r15AprU_0", "video_path": "SDD0r15AprU.mp4", "subtitle_path": "SDD0r15AprU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 461.33, "view_count": 1053089}, {"video_id": "SDD0r15AprU", "question": "A man wearing a T-shirt is sitting in a room filled with national flags. He raised his right hand, and behind him is a piece of broken glass reflecting a nearby bookshelf. When the subtitle mentions 'But you also visited Belarus last year and obtained this old (item)', what style of T-shirt was he wearing?", "question_wo_referring_query": "What style of T-shirt was he wearing?", "candidates": ["Pure black short-sleeve T-shirt", "Black and white striped short-sleeve T-shirt", "Pure black long-sleeve T-shirt", "Pure white short-sleeve T-shirt", "Black and white striped long-sleeve T-shirt"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "SDD0r15AprU_1", "video_path": "SDD0r15AprU.mp4", "subtitle_path": "SDD0r15AprU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 461.33, "view_count": 1053089}, {"video_id": "Hqse-27XWEQ", "question": "In an open space surrounded by buildings, there are two tall black-and-white striped stone pillars and some shorter black-and-white striped stone pillars. Two men and one woman are staying in the open space. Who jumped up to hug one of the tall black-and-white striped stone pillars?", "question_wo_referring_query": "Who jumped up to hug one of the tall black-and-white striped stone pillars?", "candidates": ["The person wearing a red short-sleeve shirt without a hat", "The person wearing long pants with a hat", "The person wearing a red short-sleeve shirt with a hat", "The person wearing long pants without a hat", "The person wearing a red long-sleeve shirt without a hat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "Hqse-27XWEQ_0", "video_path": "Hqse-27XWEQ.mp4", "subtitle_path": "Hqse-27XWEQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 596.47, "view_count": 73452}, {"video_id": "Hqse-27XWEQ", "question": "In a wooden cabin, a window was opened on one of the walls. A blue sign was hung, a small stool was placed in the corner, and a white blanket was placed on the clean white bed. Who lifted the blanket on the bed?", "question_wo_referring_query": "Who lifted the blanket on the bed?", "candidates": ["Person wearing black long-sleeve shirt and blue pants", "Person wearing black shirt and white shorts", "Person wearing black shirt and blue shorts", "Person wearing white shirt and blue shorts", "Person wearing black long-sleeve shirt and blue shorts"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "Hqse-27XWEQ_1", "video_path": "Hqse-27XWEQ.mp4", "subtitle_path": "Hqse-27XWEQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 596.47, "view_count": 73452}, {"video_id": "74ZqtCfkNkQ", "question": "When the woman with long hair and a black dress first appears with a magnifying glass in front of a painting, what is she doing?", "question_wo_referring_query": "When the woman with long hair and a black dress first appears with a magnifying glass in front of a painting, what is she doing?", "candidates": ["She is examining the painting with the magnifying glass.", "She puts down the magnifying glass.", "She throws away the magnifying glass.", "She uses the magnifying glass to smash the glass.", "She damages the magnifying glass."], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "74ZqtCfkNkQ_0", "video_path": "74ZqtCfkNkQ.mp4", "subtitle_path": "74ZqtCfkNkQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 187.56, "view_count": 18759}, {"video_id": "74ZqtCfkNkQ", "question": "When the woman with long hair wearing a black short-sleeved dress appears in front of the display case for the first time, what is she doing?", "question_wo_referring_query": "What is the woman doing?", "candidates": ["She is walking and admiring the items in the display case", "She is taking off her coat", "She is putting on glasses", "She is running her hand through her hair", "She is breaking the glass"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "74ZqtCfkNkQ_1", "video_path": "74ZqtCfkNkQ.mp4", "subtitle_path": "74ZqtCfkNkQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 187.56, "view_count": 18759}, {"video_id": "RrQBTvRj0s8", "question": "In a room with a red brick wall oven, there is a green plant placed next to the oven. On the wooden table in front of the green plant, there is a laptop and a plate of paper coins. A long-haired woman is sitting in front of the laptop. What did the woman do when the subtitles mentioned 'you get \u00a3300 expenses per day, you can fly.'?", "question_wo_referring_query": "What did the woman do?", "candidates": ["The woman stood up", "The woman picked up the paper coins with chopsticks and put them in her mouth", "The woman opened the laptop", "The woman picked up the paper coins with chopsticks and put them on the table", "The woman closed the laptop"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "RrQBTvRj0s8_0", "video_path": "RrQBTvRj0s8.mp4", "subtitle_path": "RrQBTvRj0s8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 269.04, "view_count": 2573654}, {"video_id": "RrQBTvRj0s8", "question": "Outside in the bright sunlight, a man wearing a green short-sleeve shirt walks past a white wall, separated by a row of green plants. When the subtitle mentions 'These \u201cPeople Peers\u201d are chosen for their skills and experience', what does the man in the green short-sleeve shirt do?", "question_wo_referring_query": "What action does the man in the green short-sleeve shirt do?", "candidates": ["Raises both hands, extends and bends three fingers", "Raises both hands, extends and bends two fingers", "Raises both hands, extends and bends one finger", "Raises left hand, extends and bends two fingers", "Raises right hand, extends and bends two fingers"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "RrQBTvRj0s8_1", "video_path": "RrQBTvRj0s8.mp4", "subtitle_path": "RrQBTvRj0s8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 269.04, "view_count": 2573654}, {"video_id": "62FAnfkm6vU", "question": "In a building rubble site, a group of people is searching for things. After a man wearing a white T-shirt and blue pants picks up some clothes, what does he do next?", "question_wo_referring_query": "What does the man do next after picking up the clothes?", "candidates": ["He puts on the clothes", "He lifts a frame", "He leaves the rubble site", "He throws the clothes away", "He hands the clothes to a man in black next to him"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "62FAnfkm6vU_0", "video_path": "62FAnfkm6vU.mp4", "subtitle_path": "62FAnfkm6vU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 587.6, "view_count": 45272}, {"video_id": "62FAnfkm6vU", "question": "A female news anchor, wearing glasses and dressed in a black V-neck outfit, is reporting in front of a cityscape. After the text on the red ticker below her changes to 'Bide says deal struck to allow aid into Gaza,' what happens behind the anchor?", "question_wo_referring_query": "What happens behind the female anchor?", "candidates": ["A group of people appears behind her", "A flock of birds flies past behind her", "The building behind her collapses", "A bird flies past behind her", "There is a fire behind her"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "62FAnfkm6vU_1", "video_path": "62FAnfkm6vU.mp4", "subtitle_path": "62FAnfkm6vU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 587.6, "view_count": 45272}, {"video_id": "DBfOvsttefg", "question": "Based on the video, which person appears first?", "question_wo_referring_query": "Based on the video, which person appears first?", "candidates": ["The man in a brown suit without a beard", "The man in a checkered shirt", "The man in black clothing with a beard", "The man in black clothing without a beard", "The man in a blue suit"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "DBfOvsttefg_0", "video_path": "DBfOvsttefg.mp4", "subtitle_path": "DBfOvsttefg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 252.76, "view_count": 3153}, {"video_id": "DBfOvsttefg", "question": "Based on the video, which of the following scenes appears first?", "question_wo_referring_query": "Based on the video, which of the following scenes appears first?", "candidates": ["A man in a blue suit video calling with a man in a black shirt", "A happy image appearing on the phone screen", "A happy image with a mouth appearing on the phone screen", "A happy image with eyes appearing on the phone screen", "A man in a blue suit video calling with a man in a white shirt"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "DBfOvsttefg_1", "video_path": "DBfOvsttefg.mp4", "subtitle_path": "DBfOvsttefg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 252.76, "view_count": 3153}, {"video_id": "YorHLIDcTaM", "question": "A girl with long brown hair is sitting inside a car, she is wearing white clothes and holding a mobile phone. In front of her is a steering wheel. On the left side of the screen, there is a picture showing a hand displaying white nail polish. What happens when the phrase 'right now the reviews on this place are' is mentioned?", "question_wo_referring_query": "What happens when the phrase 'right now the reviews on this place are' is mentioned?", "candidates": ["The girl flicked her hair.", "A picture displaying pink nail polish popped up in the top right corner.", "The girl put down the phone.", "A picture displaying white nail polish popped up in the top right corner.", "A picture displaying green nail polish popped up in the top right corner."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "YorHLIDcTaM_0", "video_path": "YorHLIDcTaM.mp4", "subtitle_path": "YorHLIDcTaM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 365.67, "view_count": 526434}, {"video_id": "YorHLIDcTaM", "question": "Inside the car, there is a hand displayed showing pink nail polish. There is a button in front of the car door, and the rearview mirror can be seen outside the door. What happened before the mention of 'posted cuz that would have sucked so'?", "question_wo_referring_query": "What happened before?", "candidates": ["The girl showed someone else's nails", "The girl put on sunglasses", "A picture showing nude-colored nail polish appeared on the left side", "The girl got out of the car", "The girl opened the car door"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "YorHLIDcTaM_1", "video_path": "YorHLIDcTaM.mp4", "subtitle_path": "YorHLIDcTaM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 365.67, "view_count": 526434}, {"video_id": "8JWGjBLwpeg", "question": "A woman is organizing her earrings in her room. Her room has a shelf cluttered with various items and a window with white curtains. Her hair is blonde, styled in a bun. She is wearing round earrings. What is the first item to appear after the phrase 'However, full autumn isn\u2019t here quite yet.'?", "question_wo_referring_query": "What is the first item to appear?", "candidates": ["A plant in a white flowerpot", "Yellow, purple, and white flowers", "Yellow, purple, and red flowers", "A mirror hanging on the wall", "Yellow, purple, and gold flowers"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "8JWGjBLwpeg_0", "video_path": "8JWGjBLwpeg.mp4", "subtitle_path": "8JWGjBLwpeg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 542.59, "view_count": 302195}, {"video_id": "8JWGjBLwpeg", "question": "On a hillside surrounded by lush trees, with a high mountain visible in the distance and the sky covered by thick white clouds, a lady stands on the hill looking at the sky. She is wearing a straw hat, has blonde hair, and is dressed in grey clothing. Before the phrase 'I appreciate all of you who have been so supporting during this time.', what is the first object to appear?", "question_wo_referring_query": "What is the first object to appear?", "candidates": ["A mirror hanging on the wall", "Yellow, purple, and red flowers", "Yellow, purple, and white flowers", "A white dress", "A fawn"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "8JWGjBLwpeg_1", "video_path": "8JWGjBLwpeg.mp4", "subtitle_path": "8JWGjBLwpeg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 542.59, "view_count": 302195}, {"video_id": "MKRTALmZRhU", "question": "On a black table, there is a cup of red drink and a black phone. Someone is holding a white plate with a hamburger. Next to it are some fries, a yellow long bag. In which subtitle do these food items appear at the same time?", "question_wo_referring_query": ", in which subtitle do these food items appear at the same time?", "candidates": ["in this guy alright guys so we are", "staying tonight it took us about three", "they got a sausage and cheddar breakfast", "cruisers Cafe 66 look at this bathroom", "like 30 years the backbone of America"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "MKRTALmZRhU_0", "video_path": "MKRTALmZRhU.mp4", "subtitle_path": "MKRTALmZRhU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 499.03, "view_count": 9335}, {"video_id": "MKRTALmZRhU", "question": "On a muddy ground, surrounded by some trees and some grassy areas, the sun shines on the muddy ground, casting the shadows of the trees. A moose walks on the road. Which subtitle appeared at the same time as this moose?", "question_wo_referring_query": "Which subtitle appeared at the same time as this moose?", "candidates": ["see 1 million cameras its hornless", "in this guy alright guys so we are", "these flats it goes on forever", "definition of society now one thing to", "speck here and the speck right here so"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "MKRTALmZRhU_1", "video_path": "MKRTALmZRhU.mp4", "subtitle_path": "MKRTALmZRhU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 499.03, "view_count": 9335}, {"video_id": "bZVI9RxjYJQ", "question": "On a yellow background mural, there are intricate carvings. To the left, there are many people and objects, and on the right, there are four archways. One person is walking up the stairs, another person is sitting on a chair, and the stairs are surrounded by draped curtains. When the phrase 'When he arrived, Nero sat on his official chair.' is mentioned, what changes occur on the mural?", "question_wo_referring_query": "What changes occur on the mural?", "candidates": ["The mural disappears", "The mural gradually shrinks", "The mural gets damaged", "The mural turns black", "The mural gradually enlarges"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "bZVI9RxjYJQ_0", "video_path": "bZVI9RxjYJQ.mp4", "subtitle_path": "bZVI9RxjYJQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 245.28, "view_count": 1945}, {"video_id": "bZVI9RxjYJQ", "question": "In the fresco with a yellow background, which features a number of intricate buildings, including a tall arched fortress and sharp-angled houses below, and a figure on the cylindrical column on the left side, what change occurs to the mural when Nero's statement 'in the beauty of the fire' is mentioned?", "question_wo_referring_query": "What change occurs to the mural?", "candidates": ["The mural turns black", "The mural gradually shrinks to the left", "The mural gradually enlarges to the left", "The mural disappears", "The mural gets scratched"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "bZVI9RxjYJQ_1", "video_path": "bZVI9RxjYJQ.mp4", "subtitle_path": "bZVI9RxjYJQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 245.28, "view_count": 1945}, {"video_id": "WPuxOLS2-nU", "question": "In a crowd, there is a girl in a purple cloak holding an Argentina flag. There are many people around her, some holding small flags. On the left, someone is holding an Argentina blue and white striped shirt, and another person is holding a little girl in a pink top. They are slowly moving forward. What action does the girl in the purple cloak take?", "question_wo_referring_query": "What action does the girl in the purple cloak take?", "candidates": ["The girl drapes the flag over herself.", "The girl is waving the flag.", "The girl gives the flag to someone else.", "The girl puts the flag on the ground.", "The girl folds the flag."], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "WPuxOLS2-nU_0", "video_path": "WPuxOLS2-nU.mp4", "subtitle_path": "WPuxOLS2-nU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 360.72, "view_count": 50129}, {"video_id": "WPuxOLS2-nU", "question": "In a grey background, there are two blocks of mirror. A blond woman in a black outfit is watching a video explanation on the right side. On the right, there are three women and a man holding a microphone. On the left, a woman wearing white clothes is holding a blue cloth. What did the woman in white clothes do?", "question_wo_referring_query": "What did the woman in white clothes do?", "candidates": ["Draped the cloth over herself", "Placed the cloth on the ground", "Waved the cloth", "Gave the cloth to someone else", "Threw the cloth away"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "WPuxOLS2-nU_1", "video_path": "WPuxOLS2-nU.mp4", "subtitle_path": "WPuxOLS2-nU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 360.72, "view_count": 50129}, {"video_id": "vj9nJURH0Gw", "question": "In a black and white tone scene, there is an inkstone on the table with three paper tubes on it. In one of the tubes, a brush pen is inserted. A pair of hands is holding a pen and writing on crinkled paper. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["Pencil", "Steel pen", "Mobile phone", "Ink", "Ring"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "vj9nJURH0Gw_0", "video_path": "vj9nJURH0Gw.mp4", "subtitle_path": "vj9nJURH0Gw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 533.73, "view_count": 15220}, {"video_id": "vj9nJURH0Gw", "question": "In a painting, six people are conversing on horseback with a castle and green grassland in the background. There is a red cross in the bottom right corner of the screen. What object is present on the screen at this moment?", "question_wo_referring_query": "What object is present on the screen at this moment?", "candidates": ["green tree", "handgun", "red hat", "hunting rifle", "watch"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "vj9nJURH0Gw_1", "video_path": "vj9nJURH0Gw.mp4", "subtitle_path": "vj9nJURH0Gw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 533.73, "view_count": 15220}, {"video_id": "CFQsjxaf5Nk", "question": "On a white-background PPT, there are three pictures. One picture shows a man wearing a striped short sleeve and a white short sleeve, with a white curtain in the background. When the caption 'action features and the predicted action' appears, what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["blackboard", "refrigerator", "television", "green bucket", "electric fan"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "CFQsjxaf5Nk_0", "video_path": "CFQsjxaf5Nk.mp4", "subtitle_path": "CFQsjxaf5Nk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 600.33, "view_count": 660}, {"video_id": "CFQsjxaf5Nk", "question": "On a white background PPT, there are twelve pictures with different screens and angles. The man in the pictures on the left column is wearing a black short sleeve, and the man in the pictures on the right column is wearing a blue short sleeve. When the caption 'from our network on the ntu rgbd dataset' appears, what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["green curtain", "green clothing", "TV screen", "green desk", "blue chair"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "CFQsjxaf5Nk_1", "video_path": "CFQsjxaf5Nk.mp4", "subtitle_path": "CFQsjxaf5Nk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 600.33, "view_count": 660}, {"video_id": "Au2vce6wMb8", "question": "The background is a white wall decorated with some green leaves, and there are orange-red and white flowers on the right side of the screen. A woman with long curly hair, wearing a necklace, is sitting in front of a mirror. What is the type of concealer she is holding in her hand to apply on her face?", "question_wo_referring_query": "What type is it?", "candidates": ["Flesh-colored powder", "Flesh-colored liquid", "Flesh-colored solid", "White liquid", "Red liquid"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "Au2vce6wMb8_0", "video_path": "Au2vce6wMb8.mp4", "subtitle_path": "Au2vce6wMb8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 361.52, "view_count": 20214}, {"video_id": "Au2vce6wMb8", "question": "The background is a white wall decorated with green leaves, with orange and white flowers on the right side of the screen. A woman with long curly hair wearing a necklace is sitting in front of a mirror. While she is using an eyebrow brush, what do her nails look like?", "question_wo_referring_query": "What do her nails look like?", "candidates": ["Nude T-shaped long nails", "Pink T-shaped long nails", "Nude pointed long nails", "Pink round short nails", "Pink pointed long nails"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "Au2vce6wMb8_1", "video_path": "Au2vce6wMb8.mp4", "subtitle_path": "Au2vce6wMb8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 361.52, "view_count": 20214}, {"video_id": "hrsxRJdwfM0", "question": "In the kitchen, a man is wearing a floral V-neck short-sleeve shirt, with chest hair and stubble. On the table beside him, there is a bowl of fruit and neatly arranged books. In front of him on a plate, there are croissants, ham slices, and cheese slices. When the subtitle appears: 'And if so, what do we call that breakfast other than disappointing?', what kind of drink is in the cup the man is holding?", "question_wo_referring_query": "In the kitchen, a man is wearing a floral V-neck short-sleeve shirt, with chest hair and stubble. On the table beside him, there is a bowl of fruit and neatly arranged books. In front of him on a plate, there are croissants, ham slices, and cheese slices. When the subtitle appears: 'And if so, what do we call that breakfast other than disappointing?', what kind of drink is in the cup the man is holding?", "candidates": ["milk", "yellow fruit juice", "clear pure water", "red wine", "purple fruit juice"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "hrsxRJdwfM0_0", "video_path": "hrsxRJdwfM0.mp4", "subtitle_path": "hrsxRJdwfM0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 519.84, "view_count": 3128951}, {"video_id": "hrsxRJdwfM0", "question": "The background is the outer wall of a beige house. Two men are standing in front of the door. The man on the right is wearing a white shirt and khaki pants. The man on the left is wearing a beige hat, a green frame glasses pinned to his collar. When the subtitle appears saying 'So there's never been a better time to brush up on a foreign language', what kind of clothes is the man on the left wearing?", "question_wo_referring_query": "What kind of clothes is the man on the left wearing?", "candidates": ["A green short-sleeve shirt", "A white denim jacket", "A white lab coat", "A green and white checkered denim jacket", "A green and white striped short-sleeve shirt"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "hrsxRJdwfM0_1", "video_path": "hrsxRJdwfM0.mp4", "subtitle_path": "hrsxRJdwfM0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 519.84, "view_count": 3128951}, {"video_id": "Ae7jkOIGMrw", "question": "On a purple desktop, in the upper left corner, there is a piece of paper with numbers and letters. A pair of hands takes out a black calculator. When the subtitle 'sure to press 2nd log and then put the exponent of negative ten point six one' appears, what is the person in the video doing?", "question_wo_referring_query": "What is the person in the video doing?", "candidates": ["Using a calculator to calculate a formula", "Replacing the calculator batteries", "Putting the calculator on the desktop", "Fixing the calculator", "Placing the calculator on the paper"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2E", "level": "IntraMoment", "id": "Ae7jkOIGMrw_0", "video_path": "Ae7jkOIGMrw.mp4", "subtitle_path": "Ae7jkOIGMrw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 244.25, "view_count": 354937}, {"video_id": "Ae7jkOIGMrw", "question": "On a purple desktop, there are five scattered folders in the bottom left corner, a pot of succulents in the top right corner, some paper notes with chemical formulas in the center, and a pink card with a pink ribbon on the right. When the caption 'the number of decimal places for the pH tells you how many sig figs to round to' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen when the caption appears?", "candidates": ["The blue arrow at the top of the wooden stick is pointing at the chemical formulas for explanation", "The folders on the desk are taken away", "The paper notes with chemical formulas are scattered", "The blue arrow points towards the succulent pot nearby", "The succulent pot is moved"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2E", "level": "IntraMoment", "id": "Ae7jkOIGMrw_1", "video_path": "Ae7jkOIGMrw.mp4", "subtitle_path": "Ae7jkOIGMrw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 244.25, "view_count": 354937}, {"video_id": "a_TwEAo9RuU", "question": "On a wooden table, there is a plate of chicken legs in a white dish. After the person in the video sprinkles salt and chili powder on the chicken, what does this person do next?", "question_wo_referring_query": "After sprinkling salt and chili powder on the chicken, what does this person do next?", "candidates": ["Cut the chicken into small pieces", "Put the chicken into a food bag", "Place the chicken into a pot", "Apply sauce to the chicken", "Put the chicken into a glass bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "a_TwEAo9RuU_0", "video_path": "a_TwEAo9RuU.mp4", "subtitle_path": "a_TwEAo9RuU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 406.53, "view_count": 186234}, {"video_id": "a_TwEAo9RuU", "question": "There is a large iron shovel on the left side of the wooden table. On the table, there is a glass container. The person on the screen is holding the edge of the container with one hand and using a spatula to mix with the other hand. After the dough mixture in the container is evenly mixed, what does this person do next?", "question_wo_referring_query": "What does this person do next after the dough mixture is evenly mixed?", "candidates": ["Add a single ingredient to the dough mixture", "Add milk to the dough mixture", "Put the mixed dough into a pot", "Beat an egg into the glass container", "Add water to the flour"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "a_TwEAo9RuU_1", "video_path": "a_TwEAo9RuU.mp4", "subtitle_path": "a_TwEAo9RuU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 406.53, "view_count": 186234}, {"video_id": "NmvsDXxWrzM", "question": "In front of a wall covered with pictures, a woman with golden long hair, wearing a black long-sleeved top and a floral skirt, is sticking things. After the subtitle 'So I think today our goal is to really figure out these\\nimage selections.' appears, which individuals appear on the scene for the first time?", "question_wo_referring_query": "Which individuals appear on the scene for the first time?", "candidates": ["A man wearing a black suit with short golden hair", "A man wearing a black suit with short black hair", "A man wearing a white T-shirt with short golden hair", "A man wearing a white T-shirt with short black hair", "A man wearing a pink shirt with short golden hair"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "NmvsDXxWrzM_0", "video_path": "NmvsDXxWrzM.mp4", "subtitle_path": "NmvsDXxWrzM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 196.91, "view_count": 12301}, {"video_id": "NmvsDXxWrzM", "question": "On a white table, a book with many pictures is spread open. A hand wearing a ring is pointing at a picture in the book. Before the subtitle 'What\u2019s different between these two?' appears, which characters have appeared?", "question_wo_referring_query": "Which characters have appeared?", "candidates": ["A woman with short black hair, wearing a black jacket and black innerwear", "A woman with short black hair, wearing a white jacket and black innerwear", "A woman with curly blonde hair, wearing a blue long dress", "A woman with short blonde hair, wearing a white business dress", "A woman with long black hair, wearing a white long dress"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "NmvsDXxWrzM_1", "video_path": "NmvsDXxWrzM.mp4", "subtitle_path": "NmvsDXxWrzM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 196.91, "view_count": 12301}, {"video_id": "70Dmd0dUXw0", "question": "Before an operation, a soldier dressed in green uniform, wearing a camouflage helmet, and holding an automatic rifle appears on screen. When the subtitle \"Although it was a primary weapon.\" appears, what changes occur to the item he is holding?", "question_wo_referring_query": "What changes occur to the item he is holding?", "candidates": ["The automatic rifle changes to a flamethrower", "The automatic rifle changes to a hand grenade", "The automatic rifle changes to a handheld rocket launcher", "The automatic rifle changes to a handgun", "The automatic rifle changes to a sniper rifle"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "70Dmd0dUXw0_0", "video_path": "70Dmd0dUXw0.mp4", "subtitle_path": "70Dmd0dUXw0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 238.46, "view_count": 3816345}, {"video_id": "70Dmd0dUXw0", "question": "Before an encampment, a soldier wearing a green uniform, a camouflage helmet, and holding a grenade launcher appears. When the subtitle \"The Stevens M77E was the most manufactured and used shotgun during the conflict used by the Army and Marines.\" shows up, what changes occur to the item in his hand?", "question_wo_referring_query": "What changes occur to the item in his hand?", "candidates": ["The grenade launcher turns into a flamethrower", "The grenade launcher turns into a bayonet", "The grenade launcher turns into a pistol", "The grenade launcher turns into a sniper rifle", "The grenade launcher turns into a shotgun"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "70Dmd0dUXw0_1", "video_path": "70Dmd0dUXw0.mp4", "subtitle_path": "70Dmd0dUXw0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 238.46, "view_count": 3816345}, {"video_id": "zs-Q4HrKc-U", "question": "In a black night sky, there is a long-tailed golden light beam with a sphere beside it. What did the golden light beam do?", "question_wo_referring_query": "What did the golden light beam do?", "candidates": ["Circled around the sphere", "Circled by itself", "Crashed into the sphere", "Brushed past the sphere", "Moved away from the sphere"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "zs-Q4HrKc-U_0", "video_path": "zs-Q4HrKc-U.mp4", "subtitle_path": "zs-Q4HrKc-U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 525.46, "view_count": 146458}, {"video_id": "zs-Q4HrKc-U", "question": "In a foggy background, there is thick black smoke swirling everywhere, and among the black smoke, there are many light spots with long tails. What did these light spots do?", "question_wo_referring_query": "In a foggy background, there is thick black smoke swirling everywhere, and among the black smoke, there are many light spots with long tails. What did these light spots do?", "candidates": ["merged together", "circled around the black smoke", "fell down", "flew upwards", "extinguished in the air"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "zs-Q4HrKc-U_1", "video_path": "zs-Q4HrKc-U.mp4", "subtitle_path": "zs-Q4HrKc-U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 525.46, "view_count": 146458}, {"video_id": "TaWDXwmjf_Q", "question": "On a stone platform lies a man wrapped in a white cloth, with his limbs held down by four men. Next to him stand three men wearing crowns. One of these men plunges a sword into the bound man's chest. When the subtitles mention 'public often prisoners of war with great,' what type of sword is used to stab the man in the chest?", "question_wo_referring_query": "What type of sword is used to stab the man in the chest?", "candidates": ["a red short sword", "a black long sword", "a green long sword", "a green short sword", "a black short sword"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "TaWDXwmjf_Q_0", "video_path": "TaWDXwmjf_Q.mp4", "subtitle_path": "TaWDXwmjf_Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 356.75, "view_count": 4443022}, {"video_id": "TaWDXwmjf_Q", "question": "In the crowded central square, a hand lifted a beating heart. When the subtitle mentions 'the beating heart will be lifted high,' what state is the heart in?", "question_wo_referring_query": ", what state is the heart in?", "candidates": ["A heart that has stopped beating and is bleeding black blood", "A red heart that has stopped beating and is bleeding fresh blood", "A black beating heart", "A beating heart that is bleeding black blood", "A red beating heart"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "TaWDXwmjf_Q_1", "video_path": "TaWDXwmjf_Q.mp4", "subtitle_path": "TaWDXwmjf_Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 356.75, "view_count": 4443022}, {"video_id": "jr1iZpDH0Q0", "question": "There is a blue surface. A pair of hands is holding a knife, cutting food. On the right side of the surface, there are some cut ingredients. What is the ingredient that has been sliced and placed aside?", "question_wo_referring_query": ", what is the ingredient that has been sliced and placed aside?", "candidates": ["Butter", "Shrimp", "Cream", "Lard", "Rabbit meat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "jr1iZpDH0Q0_0", "video_path": "jr1iZpDH0Q0.mp4", "subtitle_path": "jr1iZpDH0Q0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 433.1, "view_count": 518659}, {"video_id": "jr1iZpDH0Q0", "question": "On a wooden table, there is a whisk and a bowl with flour. A hand holding a cup of water appears, and where did the water in the cup go?", "question_wo_referring_query": "On a wooden table, there is a whisk and a bowl with flour. A hand holding a cup of water appears, and where did the water in the cup go?", "candidates": ["The water in the cup was poured onto the wooden surface of the table.", "The water in the cup was poured into the olive-colored bowl.", "The water in the cup was poured into the transparent bowl containing yellow oil.", "The water in the cup was poured into the transparent bowl containing white flour.", "The water in the cup was poured onto the whisk."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "jr1iZpDH0Q0_1", "video_path": "jr1iZpDH0Q0.mp4", "subtitle_path": "jr1iZpDH0Q0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 433.1, "view_count": 518659}, {"video_id": "kDoncOWp36g", "question": "In an indoor space, a woman in a white dress is giving a lecture. Some people are listening to the lecture. A projector is displaying on one wall of the classroom. When a man in a blue coat appears for the first time in a green frame in the projection, what does this man do?", "question_wo_referring_query": "What does this man do?", "candidates": ["He opens the car door", "He drives away the car", "He closes the car door", "He takes off the coat", "He puts on the coat"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O2E", "level": "IntraMoment", "id": "kDoncOWp36g_0", "video_path": "kDoncOWp36g.mp4", "subtitle_path": "kDoncOWp36g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 116, "duration": 530.0, "view_count": 538}, {"video_id": "kDoncOWp36g", "question": "In an indoor space, a woman dressed in white is giving a lecture. Some people are listening to the lecture. A projection is displayed on one wall of the classroom. In the projection, a green box selects a man lying on the ground with white letters on his chest. What did the man do when the green box disappeared from the projection?", "question_wo_referring_query": "What did the man do when the green box disappeared?", "candidates": ["The man touched his stomach with his hand.", "The man stood up.", "The man pulled up the blanket.", "The man turned over.", "The man touched his face with his hand."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O2E", "level": "IntraMoment", "id": "kDoncOWp36g_1", "video_path": "kDoncOWp36g.mp4", "subtitle_path": "kDoncOWp36g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 116, "duration": 530.0, "view_count": 538}, {"video_id": "sry-vM0Ebnw", "question": "On a white table, there are three transparent bowls containing food. The bowl in the middle has ingredients of various colors, while the food in the bowls on both sides is red. Next to the bowls, there are three small toy dinosaurs in red, orange, and green, respectively. There is also a pair of hands holding a transparent container filled with food. When the subtitle mentions [Music], what happens to the transparent container filled with food?", "question_wo_referring_query": "What happens to the transparent container filled with food?", "candidates": ["The container is dropped and broken", "The container is placed on the table", "The food in the container is taken away", "The container is lifted off the table", "A small dinosaur is put into the container"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "sry-vM0Ebnw_0", "video_path": "sry-vM0Ebnw.mp4", "subtitle_path": "sry-vM0Ebnw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 401.49, "view_count": 183563}, {"video_id": "sry-vM0Ebnw", "question": "On a wooden surface, there's a round plate. A pair of hands have drawn eyes and a nose on the object on the plate. When the subtitle [Music] appears, what happens to the object on the plate?", "question_wo_referring_query": "What happens to the object on the plate?", "candidates": ["The eyes on the object disappear.", "The white part of the object draws a mouth.", "The white part of the object draws a necklace.", "The pink part of the object draws a necklace.", "The pink part of the object draws a mouth."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "sry-vM0Ebnw_1", "video_path": "sry-vM0Ebnw.mp4", "subtitle_path": "sry-vM0Ebnw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 401.49, "view_count": 183563}, {"video_id": "S6RsSZw49E4", "question": "In front of a window, a woman in white is standing with her back to the camera, using a curling iron on her hair. Before she used the curling iron, what did she do with her left hand?", "question_wo_referring_query": "What did she do with her left hand?", "candidates": ["She hung a string of decorations on the wall with her left hand", "She closed the window with her left hand", "She opened the window with her left hand", "She was combing her hair with her left hand", "She picked up an item hanging on the wall with her left hand"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "S6RsSZw49E4_0", "video_path": "S6RsSZw49E4.mp4", "subtitle_path": "S6RsSZw49E4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 372.15, "view_count": 329132}, {"video_id": "S6RsSZw49E4", "question": "In the snowy and icy environment, after a blonde woman in a fur coat placed a red bow on a snow-covered tree, what did she do next?", "question_wo_referring_query": "What did she do next?", "candidates": ["She placed a green ornament on the tree.", "She placed a red ornament on the tree.", "She took a red ornament off the tree.", "She placed a red ornament on the ground.", "She placed a white ornament on the tree."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "S6RsSZw49E4_1", "video_path": "S6RsSZw49E4.mp4", "subtitle_path": "S6RsSZw49E4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 372.15, "view_count": 329132}, {"video_id": "FDmucVCh9X4", "question": "In a purple background, a woman wearing black clothes, glasses, and with black curly hair is explaining with her hands held flat in front of her chest. What is the first text that appears on this screen?", "question_wo_referring_query": ", what is the first text that appears on this screen?", "candidates": ["PLUS,WE EXPECT TO GET NOTIFICATIONS", "IN 2010,RESEARCHERS FOUND THAT AROUND 68% OF PARTICIPANTS EXPERIENCED SOME KIND OF PHANTOM BUZZ", "RESEARCHERS ARE PRETTY SURE PHANTOM VIBRATIONS AFFECT A LOT OF PEOPLE", "Phantom Vibration Syndrome", "PHANTOM PHONE VIBES ARE LIKELY A FALSE ALARM IN OUR SIGNAL DETECTION SYSTEM"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "FDmucVCh9X4_0", "video_path": "FDmucVCh9X4.mp4", "subtitle_path": "FDmucVCh9X4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 230.86, "view_count": 179718}, {"video_id": "FDmucVCh9X4", "question": "Against a purple background, a woman dressed in black clothes, wearing glasses, and with black curly hair is gesturing with her hands while explaining. What is the last text appearing in this scene?", "question_wo_referring_query": "Against a purple background, a woman dressed in black clothes, wearing glasses, and with black curly hair is gesturing with her hands while explaining. What is the last text appearing in this scene?", "candidates": ["A DIFFERENT STUDY IN 2014 LOOKED SPECIFICALLY AT TECH EMPLOYEES", "PHANTOM PHONE VIBES ARE LIKELY A FALSE ALARM IN OUR SIGNAL DETECTION SYSTEM", "PLUS, WE EXPECT TO GET NOTIFICATIONS", "SCISHOW VIEWERS GET AN ADDITIONAL THREE MONTHS FOR FREE WHEN THEY SIGN UP FOR A THREE MONTH SUBSCRIPTION", "Phantom Vibration Syndrome"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "FDmucVCh9X4_1", "video_path": "FDmucVCh9X4.mp4", "subtitle_path": "FDmucVCh9X4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 230.86, "view_count": 179718}, {"video_id": "Vd7z3GignNU", "question": "In front of a black background, a man wearing a purple short-sleeved shirt stands on the right side of the screen with his hands resting flat on his chest, explaining something. After the subtitle mentions 'speaking of which, the Lion is the official animal of the country,' what animal image appears on the left side of the screen?", "question_wo_referring_query": "What animal image appears on the left side of the screen after the subtitle mentions 'speaking of which, the Lion is the official animal of the country'?", "candidates": ["Lioness", "Lion", "Tiger", "Cat", "Leopard"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "Vd7z3GignNU_0", "video_path": "Vd7z3GignNU.mp4", "subtitle_path": "Vd7z3GignNU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 598.97, "view_count": 1412661}, {"video_id": "Vd7z3GignNU", "question": "In front of a black background, a man wearing a purple short-sleeved shirt makes a funny face while standing on the right side of the screen explaining something. What happens to this man after the subtitles appear saying 'Eh heh heh heh heh heh'?", "question_wo_referring_query": "What happens to this man?", "candidates": ["A man wearing a purple shirt and black-frame glasses punches him.", "A man wearing black clothes and sunglasses punches him.", "A man wearing black clothes and black-frame glasses punches him.", "He punches a man wearing black clothes and black-frame glasses.", "A man wearing black clothes and gold-frame glasses punches him."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "Vd7z3GignNU_1", "video_path": "Vd7z3GignNU.mp4", "subtitle_path": "Vd7z3GignNU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 598.97, "view_count": 1412661}, {"video_id": "SfeOs5vLf9s", "question": "The green branches in the picture are full of red fruits. Before the subtitle says 'we love spending time in the forest and I have always my eyes wide open for any kind of signs,' what first appeared?", "question_wo_referring_query": "What first appeared?", "candidates": ["Olive green leaves", "An apple with a smiling face", "A penguin doll", "A pumpkin with a smiling face", "A book"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "SfeOs5vLf9s_0", "video_path": "SfeOs5vLf9s.mp4", "subtitle_path": "SfeOs5vLf9s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 246.2, "view_count": 3734}, {"video_id": "SfeOs5vLf9s", "question": "A long-haired woman sits in a room with a Jack-o'-lantern that has a smiling face carved in it. The subtitles say, 'post it on Facebook or Instagram you can even plan a little photo shoot maybe get your kids to dress.' What object appears first after this?", "question_wo_referring_query": "What object appears first?", "candidates": ["A lit white candle", "A matchstick", "A tree", "A book with leaves pressed in it", "A bunch of flames"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "SfeOs5vLf9s_1", "video_path": "SfeOs5vLf9s.mp4", "subtitle_path": "SfeOs5vLf9s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 246.2, "view_count": 3734}, {"video_id": "xvh05AKJqJk", "question": "Under the pure white snowy mountain, on a frozen piece of snow-covered ground, a woman in thick clothing is playing with a black dog. Where else has this black dog appeared?", "question_wo_referring_query": "Where else has this black dog appeared?", "candidates": ["In a basketball court", "In a park", "On the bed inside a house", "In a pet shop", "On green grass"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "xvh05AKJqJk_0", "video_path": "xvh05AKJqJk.mp4", "subtitle_path": "xvh05AKJqJk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 443.03, "view_count": 564417}, {"video_id": "xvh05AKJqJk", "question": "On a yellow wooden table, some green leaves are scattered. A pair of hands is tying a red bow on a bunny doll. Where else has this bunny doll appeared?", "question_wo_referring_query": ", where else has this bunny doll appeared?", "candidates": ["On the sofa", "On the snow", "On the bed", "In a green box", "On a shelf with picture frames"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "xvh05AKJqJk_1", "video_path": "xvh05AKJqJk.mp4", "subtitle_path": "xvh05AKJqJk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 443.03, "view_count": 564417}, {"video_id": "hFEUohXoyvQ", "question": "In the white bowl, there are colorful jelly beans in blue, red, yellow, and other colors. Which subtitles have appeared together with them?", "question_wo_referring_query": "Which subtitles have appeared together with them?", "candidates": ["Music", "paper", "cookie", "and", "plane"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "hFEUohXoyvQ_0", "video_path": "hFEUohXoyvQ.mp4", "subtitle_path": "hFEUohXoyvQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 232.07, "view_count": 435060}, {"video_id": "hFEUohXoyvQ", "question": "Which subtitles appear together with the woman wearing black clothes and earrings, with short black hair in the picture?", "question_wo_referring_query": "Which subtitles appear together?", "candidates": ["not sure if you get it", "cute", "tasty", "walmart", "great"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "hFEUohXoyvQ_1", "video_path": "hFEUohXoyvQ.mp4", "subtitle_path": "hFEUohXoyvQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 232.07, "view_count": 435060}, {"video_id": "SKuhvTGiXz4", "question": "What change occurred to the clothes of the man wearing red shorts and a blue tank top the first time he appeared on the screen of a white phone at the beginning of the video?", "question_wo_referring_query": "What change occurred to the clothes?", "candidates": ["Changed into a white T-shirt with a cartoon pattern on the chest", "Changed into a blue T-shirt with a cartoon pattern on the chest", "Changed into a red top", "Changed into a white top", "Changed into a black short-sleeve shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "SKuhvTGiXz4_0", "video_path": "SKuhvTGiXz4.mp4", "subtitle_path": "SKuhvTGiXz4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 299.21, "view_count": 1201161}, {"video_id": "SKuhvTGiXz4", "question": "What changed when the man, who was wearing a blue backpack and red shorts and running at the beginning of the video, first appeared on the bed in the room?", "question_wo_referring_query": "What changed?", "candidates": ["Took off his running shoes", "Shaved his head", "Changed into blue pants", "Changed into white pants", "Changed into a white shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "SKuhvTGiXz4_1", "video_path": "SKuhvTGiXz4.mp4", "subtitle_path": "SKuhvTGiXz4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 299.21, "view_count": 1201161}, {"video_id": "ojBqr4Vl8Zs", "question": "What change occurred when the woman sitting outside the house drinking water from a brown cup said in the subtitles 'the inevitable time of Summer where wildfires start and so it is quite unhealthy to be'?", "question_wo_referring_query": "What change occurred?", "candidates": ["Put on feather earrings", "Put on flower-shaped earrings", "Put on round earrings", "Put on tassel earrings", "Put on pearl earrings"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "ojBqr4Vl8Zs_0", "video_path": "ojBqr4Vl8Zs.mp4", "subtitle_path": "ojBqr4Vl8Zs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 580.25, "view_count": 172172}, {"video_id": "ojBqr4Vl8Zs", "question": "What changed when the girl wearing a straw hat and picking fruit at the beginning of the video mentioned in the subtitle, 'I'll admit I'm not in a chatty mood today'?", "question_wo_referring_query": "What changed?", "candidates": ["Tied a hairband", "Took off the hat", "Changed to a blue hat", "Changed to a pink hat", "Changed to a green hat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "ojBqr4Vl8Zs_1", "video_path": "ojBqr4Vl8Zs.mp4", "subtitle_path": "ojBqr4Vl8Zs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 580.25, "view_count": 172172}, {"video_id": "sy9ztJuZu2c", "question": "On a page, there is an orange interface on the left side and a white rectangle frame with a marking of a compass needle inside it. On the right side of the page, there are three brown beans. After the rightmost bean moves to the right, what happens to the moving bean?", "question_wo_referring_query": ", what happens to the moving bean?", "candidates": ["One brown bean becomes two red beans", "One brown bean becomes three green beans", "One brown bean becomes three brown beans", "One brown bean becomes two green beans", "One brown bean becomes two brown beans"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "sy9ztJuZu2c_0", "video_path": "sy9ztJuZu2c.mp4", "subtitle_path": "sy9ztJuZu2c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 457.17, "view_count": 447781}, {"video_id": "sy9ztJuZu2c", "question": "What did the woman with long hair, wearing black sunglasses and a black short-sleeved shirt, do when the white English text appeared for the first time in the top-left corner of the blue background screen behind her?", "question_wo_referring_query": "What did this woman do?", "candidates": ["Clenched both hands into fists", "Crossed her fingers", "Pressed her fingertips together with palms facing herself", "Opened both hands with palms facing herself", "Opened both hands with palms not facing herself"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "sy9ztJuZu2c_1", "video_path": "sy9ztJuZu2c.mp4", "subtitle_path": "sy9ztJuZu2c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 457.17, "view_count": 447781}, {"video_id": "kbOsDFtvYZk", "question": "In a bright office, five people are sitting at white desks, and a long-haired woman in a black skirt walks between the office desks on both sides. What has never appeared in this room?", "question_wo_referring_query": "What has never appeared in this room?", "candidates": ["a computer", "a cup", "a whiteboard", "a red sofa", "a white chair"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "kbOsDFtvYZk_0", "video_path": "kbOsDFtvYZk.mp4", "subtitle_path": "kbOsDFtvYZk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 572.11, "view_count": 801317}, {"video_id": "kbOsDFtvYZk", "question": "The screen is divided into six sections. In the top left section, there is a standing broom, and in the bottom right section, there is a sleeping dog. Which object appears on the screen?", "question_wo_referring_query": "Which object appears on the screen?", "candidates": ["water bottle", "bread", "starfish", "broom", "house"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "kbOsDFtvYZk_1", "video_path": "kbOsDFtvYZk.mp4", "subtitle_path": "kbOsDFtvYZk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 572.11, "view_count": 801317}, {"video_id": "bGJuBwB5iwE", "question": "In front of a gray screen, three women are sitting on white high chairs. When the subtitle says \"interesting question the drawings are,\" which character appears?", "question_wo_referring_query": "Which character appears?", "candidates": ["The woman wearing glasses, a blue shirt, and white pants", "The woman not wearing glasses, a blue shirt, and black pants", "The woman wearing glasses, a white shirt, and black pants", "The woman wearing glasses, a blue shirt, and black pants", "The woman wearing glasses, a blue shirt, and black shorts"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "bGJuBwB5iwE_0", "video_path": "bGJuBwB5iwE.mp4", "subtitle_path": "bGJuBwB5iwE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 267.27, "view_count": 796}, {"video_id": "bGJuBwB5iwE", "question": "When the video begins, there is a person with a ring sitting in front of a gray background screen looking to the right and talking. What character appears when the subtitle says 'mechanical imagery artists are really'?", "question_wo_referring_query": "What character appears?", "candidates": ["A woman wearing a blue coat", "A woman wearing a black coat", "A woman wearing a hat", "A woman wearing glasses", "A woman wearing a white coat"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "bGJuBwB5iwE_1", "video_path": "bGJuBwB5iwE.mp4", "subtitle_path": "bGJuBwB5iwE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 267.27, "view_count": 796}, {"video_id": "VZ0Lb-E6bXo", "question": "A woman wearing a blue long-sleeved coat is putting a yellow triangular scarf on the neck of a dog sitting on the ground. What is the pattern on this yellow scarf?", "question_wo_referring_query": "What is the pattern on this yellow scarf?", "candidates": ["Triangle pattern", "Dot pattern", "Flower pattern", "Round pattern", "Star pattern"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "VZ0Lb-E6bXo_0", "video_path": "VZ0Lb-E6bXo.mp4", "subtitle_path": "VZ0Lb-E6bXo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 521.26, "view_count": 8062}, {"video_id": "VZ0Lb-E6bXo", "question": "On a black bar counter next to a glass window, there are four round glass wine glasses with pink flower decorations. What is the state of the substance inside the wine glasses?", "question_wo_referring_query": "What is the state of the substance inside the wine glasses?", "candidates": ["Yellow liquid", "Blue liquid", "White liquid", "White paste", "White solid"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "VZ0Lb-E6bXo_1", "video_path": "VZ0Lb-E6bXo.mp4", "subtitle_path": "VZ0Lb-E6bXo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 521.26, "view_count": 8062}, {"video_id": "xpJGBMzb5ao", "question": "In front of the yellow striped background, the black-haired girl sitting down mentions 'you position bet i want you to realize.' What color are the two pillows next to her?", "question_wo_referring_query": "What color are the two pillows next to her?", "candidates": ["Purple", "White", "Yellow", "Red", "Blue"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2A", "level": "IntraMoment", "id": "xpJGBMzb5ao_0", "video_path": "xpJGBMzb5ao.mp4", "subtitle_path": "xpJGBMzb5ao_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 473.57, "view_count": 10267}, {"video_id": "xpJGBMzb5ao", "question": "In a white background filled with many chemical formulas, what changes occur on the screen when the subtitle says, 'note when the same substance is found in'?", "question_wo_referring_query": "What changes occur on the screen?", "candidates": ["A red rectangle appears at the bottom", "A black rectangle appears at the bottom", "A red rectangle appears at the top", "A green rectangle appears at the bottom", "A red circle appears at the bottom"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2A", "level": "IntraMoment", "id": "xpJGBMzb5ao_1", "video_path": "xpJGBMzb5ao.mp4", "subtitle_path": "xpJGBMzb5ao_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 473.57, "view_count": 10267}, {"video_id": "WboGuFHD3to", "question": "On the blue background, there are three expressions written on the far left, five expressions written on the right, and a black-and-white striped pen placed in the upper left corner of the background. What happened when this blue background appeared for the first time?", "question_wo_referring_query": ", what happened when this blue background appeared for the first time?", "candidates": ["Place a black strip of paper.", "Point to an expression with the middle finger.", "Two hands with painted nails twirl from the middle to the sides.", "Point to an expression with the index finger.", "Place a white sheet of paper."], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "WboGuFHD3to_0", "video_path": "WboGuFHD3to.mp4", "subtitle_path": "WboGuFHD3to_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 435.44, "view_count": 21653}, {"video_id": "WboGuFHD3to", "question": "In the center of a blue background, there is only one piece of white paper with pink text on top and black text below. What happened the first time this piece of white paper appeared in the center?", "question_wo_referring_query": ", what happened the first time this piece of white paper appeared in the center?", "candidates": ["One hand stuck paper on top", "Drew a circle on top", "Folded the white paper", "Wrote on top", "Two hands opened the white paper"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "WboGuFHD3to_1", "video_path": "WboGuFHD3to.mp4", "subtitle_path": "WboGuFHD3to_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 435.44, "view_count": 21653}, {"video_id": "kmE_Cb5hRZI", "question": "Under the blue sky, on a gray road flanked by old houses with olive-colored outer walls made of tiles, with poles hanging wires, what happened the first time Applause was mentioned in the subtitles?", "question_wo_referring_query": "What happened?", "candidates": ["A plane flew overhead", "A group of people protested on the street", "A house collapsed", "An explosion occurred on the street", "Several cars drove down the road"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "kmE_Cb5hRZI_0", "video_path": "kmE_Cb5hRZI.mp4", "subtitle_path": "kmE_Cb5hRZI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 252.55, "view_count": 4704}, {"video_id": "kmE_Cb5hRZI", "question": "Under the blue sky, there is a river flowing beside the olive-colored mud ground. A person is standing on the stone path above the river. What is the person doing when 'Music' is first mentioned in the subtitles?", "question_wo_referring_query": "What is the person doing?", "candidates": ["Standing still on the stone", "Jumping and stepping across the river on the stone path", "Throwing stones into the water", "Playing with water on the stone", "Kneeling down on the stone"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "kmE_Cb5hRZI_1", "video_path": "kmE_Cb5hRZI.mp4", "subtitle_path": "kmE_Cb5hRZI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 252.55, "view_count": 4704}, {"video_id": "UHiUYj5EA0w", "question": "After a man wearing a blue shirt and black pants and a blonde woman first appear behind the white screen, which country is mentioned first?", "question_wo_referring_query": "Which country is mentioned first?", "candidates": ["China", "India", "Mongolia", "Zimbabwe", "Switzerland"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "UHiUYj5EA0w_0", "video_path": "UHiUYj5EA0w.mp4", "subtitle_path": "UHiUYj5EA0w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 530.07, "view_count": 1777048}, {"video_id": "UHiUYj5EA0w", "question": "After a man wearing a blue shirt and black pants and a woman with blonde hair first appear on a white screen, which concept is mentioned first?", "question_wo_referring_query": "Which concept is mentioned first below?", "candidates": ["Factors of production", "Productivity", "Resources", "Per capita GDP", "GDP"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "UHiUYj5EA0w_1", "video_path": "UHiUYj5EA0w.mp4", "subtitle_path": "UHiUYj5EA0w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 530.07, "view_count": 1777048}, {"video_id": "x20AGOtheik", "question": "An elderly man wearing black glasses and sporting a mustache is seated in a room with a globe and books placed behind him. What happens after this elderly man says, 'outer space so what happens if then this'?", "question_wo_referring_query": "What happens next?", "candidates": ["The elderly man picks up a book.", "The elderly man picks up a ball.", "The shot transitions from one screen to three screens.", "The shot transitions to a person wearing a golden tie.", "The shot transitions from one screen to two screens."], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "x20AGOtheik_0", "video_path": "x20AGOtheik.mp4", "subtitle_path": "x20AGOtheik_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 243.16, "view_count": 43060}, {"video_id": "x20AGOtheik", "question": "Under the pitch-black night sky, a rocket was successfully launched from the ground, with thick smoke billowing on the ground. Before the caption says \u2018launched aboard a rocket from Cape,\u2019 what happened?", "question_wo_referring_query": "What happened?", "candidates": ["The host wearing a blue tie speaks in a broadcast room", "The host wearing a golden tie speaks in a broadcast room", "The camera cuts to two screens", "An elderly person with glasses speaks to the camera", "An elderly person with glasses picks up a ball"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "x20AGOtheik_1", "video_path": "x20AGOtheik.mp4", "subtitle_path": "x20AGOtheik_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 243.16, "view_count": 43060}, {"video_id": "Shrc6dwjI5I", "question": "In a room cluttered with miscellaneous items in the background, a bald man wearing glasses is sitting on the left side of the screen, while a man in a gray suit and white shirt is sitting on the right side. After the subtitle says 'ceremonial full poem for us to start off,' what appears first?", "question_wo_referring_query": "What appears first?", "candidates": ["A man holding a flute", "Three little girls", "Two people holding red accordions", "Four people holding red accordions", "Three people playing pianos"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "Shrc6dwjI5I_0", "video_path": "Shrc6dwjI5I.mp4", "subtitle_path": "Shrc6dwjI5I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 360.33, "view_count": 4436}, {"video_id": "Shrc6dwjI5I", "question": "Before the subtitle says 'certain kinds of errors are what give,' what first appears when a girl in a deep red dress kneeling on the left and a girl in a black dress standing on the right are seen together?", "question_wo_referring_query": "", "candidates": ["a curly-haired little boy", "a small yellow drill", "a grand piano", "a small red drill", "a painting"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "Shrc6dwjI5I_1", "video_path": "Shrc6dwjI5I.mp4", "subtitle_path": "Shrc6dwjI5I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 360.33, "view_count": 4436}, {"video_id": "6dEvFjEUYHQ", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a man in a white short-sleeved shirt and a black backpack rides a bicycle on the street, then a man dressed in a black top skateboards under a bridge, and lastly a woman dressed entirely in white taking pictures with a black camera.", "First, a woman in a white short-sleeved shirt and a black backpack rides a bicycle on the street, then a woman dressed in a black top skateboards under a bridge, and lastly a man dressed entirely in white taking pictures with a black camera.", "First, a man in a white short-sleeved shirt and a black backpack rides a bicycle on the street, then a man dressed in a black top skateboards under a bridge, and lastly a woman in black and white checkered clothing taking pictures with a black camera.", "First, a man wearing a white top skateboards under a bridge, then a woman dressed in black and white checkered clothing taking pictures with a black camera, and lastly a man in a white short-sleeved shirt and a black backpack riding a bicycle on the street.", "First, a man in a white short-sleeved shirt and a red backpack rides a bicycle on the street, then a man dressed in a blue top skateboards under a bridge, and lastly a woman dressed entirely in white taking pictures with a black camera."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "6dEvFjEUYHQ_0", "video_path": "6dEvFjEUYHQ.mp4", "subtitle_path": "6dEvFjEUYHQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 567.28, "view_count": 15320}, {"video_id": "6dEvFjEUYHQ", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a black man wearing a hat kneels down with a camera to take pictures of a man in a white top playing with a skateboard, then a woman in a red coat sits on a red chair flexing her fist in front of a mirror, and finally, a man in a black short sleeve shirt rides a bicycle on the street.", "First, a black man wearing a hat kneels down with a camera to take pictures of a man in a white top playing with a skateboard, then a woman in a blue coat sits on a red chair flexing her fist in front of a mirror, and finally, a man in a white short sleeve shirt rides a bicycle on the street.", "First, a woman in a blue coat sits on a red chair flexing her fist in front of a mirror, then a man wearing a white short sleeve shirt with a black backpack rides a bicycle on the street, and finally, a black man wearing a hat kneels down with a camera to photograph a man in a white top playing with a skateboard.", "First, a white man wearing a hat kneels down with a camera to take pictures of a man in a black top playing with a skateboard, then a woman in a blue coat sits on a red chair flexing her fist in front of a mirror, and finally, a man in a black short sleeve shirt rides a bicycle on the street.", "First, a man wearing a white short sleeve shirt is riding a bicycle on the street, then a black man wearing a hat kneels down with a camera to take pictures of a man in a white top playing with a skateboard, and finally a woman in a blue coat sits on a red chair flexing her fist in front of a mirror."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "6dEvFjEUYHQ_1", "video_path": "6dEvFjEUYHQ.mp4", "subtitle_path": "6dEvFjEUYHQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 567.28, "view_count": 15320}, {"video_id": "tGIB0-dFOOY", "question": "On the brown cutting board, there is a piece of white dough that has already been wrapped and rolled into a round pie shape. Where else has this round pie-shaped white dough appeared?", "question_wo_referring_query": "Where else has this round pie-shaped white dough appeared?", "candidates": ["On the black cutting board", "In the white pot", "In the oven", "In the refrigerator", "In the black pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "tGIB0-dFOOY_0", "video_path": "tGIB0-dFOOY.mp4", "subtitle_path": "tGIB0-dFOOY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 267.9, "view_count": 5123}, {"video_id": "tGIB0-dFOOY", "question": "On a brown colored wooden board, a pair of hands wearing black gloves is holding a rolling pin to roll a previously white spherical dough ball. Where has this white spherical dough ball appeared before?", "question_wo_referring_query": ", where has this white spherical dough ball appeared before?", "candidates": ["In a white glass bowl", "In a white dish", "In the refrigerator", "In the oven", "On a white paper"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "tGIB0-dFOOY_1", "video_path": "tGIB0-dFOOY.mp4", "subtitle_path": "tGIB0-dFOOY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 267.9, "view_count": 5123}, {"video_id": "H6S-fOMD67Y", "question": "Which subtitles have appeared together with the dog that lies on the ground and extends its head into the rectangular window at the end of the scene?", "question_wo_referring_query": "Which subtitles have appeared together?", "candidates": ["rabbit", "Music", "cat", "dog", "videos"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "H6S-fOMD67Y_0", "video_path": "H6S-fOMD67Y.mp4", "subtitle_path": "H6S-fOMD67Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 299.28, "view_count": 1778}, {"video_id": "H6S-fOMD67Y", "question": "A partially torn red spring couplet is pasted on the outer wall of a building. Which subtitles have appeared together with this red spring couplet?", "question_wo_referring_query": ", which subtitles have appeared together with this red spring couplet?", "candidates": ["Music", "dance", "luck", "good", "me"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "H6S-fOMD67Y_1", "video_path": "H6S-fOMD67Y.mp4", "subtitle_path": "H6S-fOMD67Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 299.28, "view_count": 1778}, {"video_id": "CURLWPCxe-c", "question": "What change occurs when the pale-faced, elaborately adorned white queen dressed in a white gown with a rounded collar remarks, 'as she entered her autumn years, which in this case' in the subtitles?", "question_wo_referring_query": "What change occurs?", "candidates": ["Dyed her hair black", "Switched to an outfit with a red square collar", "Switched to an outfit with peacock feathers", "Developed freckles on her skin", "Switched to an outfit with a snow leopard design"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "CURLWPCxe-c_0", "video_path": "CURLWPCxe-c.mp4", "subtitle_path": "CURLWPCxe-c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 506.09, "view_count": 15941761}, {"video_id": "CURLWPCxe-c", "question": "What changes occurred when the elegant Queen Elsa, dressed in a white gown with a round-shaped collar and with her pale face and decorated hair, said in the subtitles 'Unfortunately, the disease left her with permanent scars'?", "question_wo_referring_query": "What changes occurred when the elegant Queen Elsa, dressed in a white gown with a round-shaped collar and with her pale face and decorated hair, said 'Unfortunately, the disease left her with permanent scars'?", "candidates": ["Hair became shorter", "Became bald", "Braided one braid", "Braided two braids", "Dyed green hair"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "CURLWPCxe-c_1", "video_path": "CURLWPCxe-c.mp4", "subtitle_path": "CURLWPCxe-c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 506.09, "view_count": 15941761}, {"video_id": "A8B765SJwHk", "question": "On the right side of the screen, there is a white bouquet of flowers. A woman wearing a white coat is pouring a red liquid from a bottle into a gray bowl. What objects appear in this scene?", "question_wo_referring_query": "What objects appear in this scene?", "candidates": ["A green potted plant", "A dog", "A lamp", "A book", "A white candle"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "A8B765SJwHk_0", "video_path": "A8B765SJwHk.mp4", "subtitle_path": "A8B765SJwHk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 450.0, "view_count": 349311}, {"video_id": "UBp26Q4gmYA", "question": "In a dark forest, a person is holding a sword with both hands and is about to kneel down in front of a seated woman. When the subtitle says 'codpiece,' what objects appear on the screen?", "question_wo_referring_query": "What objects appear on the screen?", "candidates": ["Two white horses", "A mule", "A white horse", "Three white horses", "A black horse"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "UBp26Q4gmYA_0", "video_path": "UBp26Q4gmYA.mp4", "subtitle_path": "UBp26Q4gmYA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 576.58, "view_count": 2086763}, {"video_id": "UBp26Q4gmYA", "question": "In the middle of the screen, an old lady wearing a white short-sleeved shirt appears in the middle of the background. What object is present when the subtitle says 'allows you to live over triple the life expectancy of people'?", "question_wo_referring_query": "What object is present?", "candidates": ["A woman with long hair wearing a blue outfit", "A pink heart", "A red heart with a heartbeat line", "A white heart", "A black question mark"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "UBp26Q4gmYA_1", "video_path": "UBp26Q4gmYA.mp4", "subtitle_path": "UBp26Q4gmYA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 576.58, "view_count": 2086763}, {"video_id": "KNvY59927SI", "question": "In the broadcast studio, a host wearing glasses and a tie is sitting in front of a background screen with a yellow car. What color is the host's tie?", "question_wo_referring_query": ", what color is the host's tie?", "candidates": ["black and white stripes", "black", "red", "gold", "blue"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "KNvY59927SI_0", "video_path": "KNvY59927SI.mp4", "subtitle_path": "KNvY59927SI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 237.6, "view_count": 1226}, {"video_id": "KNvY59927SI", "question": "At the end of the video, what is the hairstyle of the woman wearing glasses and a black suit sitting in front of a background with a red building?", "question_wo_referring_query": "What is the hairstyle?", "candidates": ["Shoulder-length wavy bob", "Black mohawk", "Long straight black hair", "Short haircut above the ears", "Waist-length straight black hair"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "KNvY59927SI_1", "video_path": "KNvY59927SI.mp4", "subtitle_path": "KNvY59927SI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 237.6, "view_count": 1226}, {"video_id": "uYNzqgU7na4", "question": "There is a large red car behind the street, and on the right is a supermarket with transparent windows. Outside the windows, there are palm-colored wooden furniture and advertising signs. Two men are on the sidewalk on the right side of the road. The man with blond hair on the left is wearing a yellow printed T-shirt and jeans, while the man on the right is wearing a yellow shirt and a blue lanyard ID card. Who is writing with a pen?", "question_wo_referring_query": "Who is writing with a pen?", "candidates": ["The blond man in a yellow printed T-shirt", "The man wearing a yellow shirt and a blue lanyard ID card", "The blond man in a yellow printed T-shirt", "The woman holding a microphone", "The man in the white shirt with black-framed glasses"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "uYNzqgU7na4_0", "video_path": "uYNzqgU7na4.mp4", "subtitle_path": "uYNzqgU7na4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 493.68, "view_count": 5136535}, {"video_id": "uYNzqgU7na4", "question": "Behind the grassy area is a green tree with drooping leaves. To the right is a road, and in the grassy area, there is a yellow wooden signpost. To the left and right of the signpost stand two gentlemen. The man on the right is wearing a golden long wig and boots. The man on the left is dressed in sleepwear with a white rope belt. Which individual is bending over?", "question_wo_referring_query": "Which individual is bending over?", "candidates": ["The man wearing a silver long wig", "The man in blue sleepwear", "The man with glasses and a golden wig", "The man in red sleepwear", "The man wearing a golden long wig"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "uYNzqgU7na4_1", "video_path": "uYNzqgU7na4.mp4", "subtitle_path": "uYNzqgU7na4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 493.68, "view_count": 5136535}, {"video_id": "V5tUiy_LSbg", "question": "On the olive-colored floor, there is a white TV stand, and the TV screen displays a flame image. To the right of the room, there is a white staircase with silver railings. A girl wearing a white short-sleeved shirt and black pants comes out from beside the staircase. What is the girl doing the first time she comes out?", "question_wo_referring_query": "On the olive-colored floor, there is a white TV stand, and the TV screen displays a flame image. To the right of the room, there is a white staircase with silver railings. A girl wearing a white short-sleeved shirt and black pants comes out from beside the staircase. What is the girl doing the first time she comes out?", "candidates": ["Kneeling on the floor using a matchstick to burn a letter.", "Kneeling on the floor lighting a matchstick to ignite fireworks.", "Kneeling on the floor watching TV.", "Kneeling on the floor touching the floor.", "Standing by the stairs lighting a matchstick."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "V5tUiy_LSbg_0", "video_path": "V5tUiy_LSbg.mp4", "subtitle_path": "V5tUiy_LSbg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 291.64, "view_count": 23258}, {"video_id": "V5tUiy_LSbg", "question": "On the black table, there is a white plate with decorations, a pumpkin-shaped cup filled with drinkable liquid, and to the right of the pumpkin cup, there is a cup filled with grains held by a hand. What happened when cotton candy first appeared on the screen?", "question_wo_referring_query": "What happened when cotton candy first appeared on the screen?", "candidates": ["A hand was holding cotton candy and eating it", "A hand was pinching cotton candy", "A hand shook the pumpkin cup", "A hand sprinkled cotton candy into the pumpkin cup"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "V5tUiy_LSbg_1", "video_path": "V5tUiy_LSbg.mp4", "subtitle_path": "V5tUiy_LSbg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 291.64, "view_count": 23258}, {"video_id": "liGCU9gaLcM", "question": "There is a strip of paper with formulas written on it at the top of the blue background, and there are pink diagonal stripes in the two corners of the background. When a pair of hands appears on the white paper and the subtitle shows 'reciprocal of the second fraction.1M cancels and we are left with 1 s and 1 M', what are the hands doing?", "question_wo_referring_query": "What are the hands doing?", "candidates": ["Holding a piece of paper", "Forming a heart shape", "Writing with a pen with a blue casing", "Making a fist", "Writing with a pen with a red casing"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2E", "level": "IntraMoment", "id": "liGCU9gaLcM_0", "video_path": "liGCU9gaLcM.mp4", "subtitle_path": "liGCU9gaLcM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 221.57999999999998, "view_count": 547871}, {"video_id": "liGCU9gaLcM", "question": "On a white paper with black dots as decoration, numbers are written. In the center bottom of the paper, 'y=2' is written. A pair of hands is positioned above the paper. When the subtitle 'this same exact process but this time we will use trials 3 and 2 since' appears, what are these hands doing?", "question_wo_referring_query": "What are these hands doing?", "candidates": ["The hands are holding a pen to write", "The hands are holding the black edge of the paper with formulas written on it", "The hands are holding the pink edge of the paper with formulas written on it", "The hands are holding the white edge of the paper with formulas written on it", "The hands are holding the blue edge of the paper with formulas written on it"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2E", "level": "IntraMoment", "id": "liGCU9gaLcM_1", "video_path": "liGCU9gaLcM.mp4", "subtitle_path": "liGCU9gaLcM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 221.57999999999998, "view_count": 547871}, {"video_id": "Q-FY5jRM8qE", "question": "On the road, a woman in a white short-sleeved shirt and a man in a black cap with a backpack are riding bicycles. There is a white wall behind them, and greenery on both sides of the road. What did the woman do after finishing her ride?", "question_wo_referring_query": "What did the woman do after finishing her ride?", "candidates": ["took a walk by the sea", "sat on a bamboo chair", "rode a motorcycle", "lay on a beach chair", "played ping pong"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "Q-FY5jRM8qE_0", "video_path": "Q-FY5jRM8qE.mp4", "subtitle_path": "Q-FY5jRM8qE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 271.04, "view_count": 32216}, {"video_id": "RHzxZO3a0CE", "question": "In a room with suitable lighting, a tall brown bookshelf is filled with books of various colors. A girl wearing a light-colored blouse appears in front of the bookshelf. Who is the first to introduce themselves in front of these books?", "question_wo_referring_query": "Who is the first to introduce themselves in front of these books?", "candidates": ["The boy wearing a red short-sleeved shirt", "The little girl wearing a red sweater", "The little girl wearing a light-colored blouse", "The little girl wearing a blue coat", "The woman wearing a red checkered suit"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "RHzxZO3a0CE_0", "video_path": "RHzxZO3a0CE.mp4", "subtitle_path": "RHzxZO3a0CE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 190.29, "view_count": 3854}, {"video_id": "RHzxZO3a0CE", "question": "A tall, dark green bookshelf is filled with books in various colors. A woman wearing a pink suit and a pearl necklace, along with black-framed glasses, is conversing with a girl dressed in a light-colored blouse. What is the first item that the little girl picks up?", "question_wo_referring_query": "What is the first item that the little girl picks up?", "candidates": ["a fan", "a very thick book", "a white pillow", "a pen", "a magnifying glass"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "RHzxZO3a0CE_1", "video_path": "RHzxZO3a0CE.mp4", "subtitle_path": "RHzxZO3a0CE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 190.29, "view_count": 3854}, {"video_id": "ZaZIkctuUvs", "question": "Under the blue sky with white clouds drifting, distant mountain ranges and various buildings can be seen. A ship is sailing on the sea, creating white waves. The ship's hull is red, and the structures on the hull are light-colored with designs of robotic cats. After the subtitle 'different spots around the island by' appears, what happens on the screen?", "question_wo_referring_query": "After the subtitle appears, what happens on the screen?", "candidates": ["A man in white-patterned red shorts jumps from the ship into the sea", "A woman in a blue top jumps into the sea", "A woman in a purple top jumps into the sea", "A man in black shorts jumps into the sea", "A man in blue shorts jumps into the sea"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "ZaZIkctuUvs_0", "video_path": "ZaZIkctuUvs.mp4", "subtitle_path": "ZaZIkctuUvs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 481.76, "view_count": 107346}, {"video_id": "LO7mi70nqyg", "question": "On the right side of the road, there is a white building. The barrier in the middle of the road is white. On the left side of the road, on a black pavilion, there is a pattern with a dog imprinted on a sea banner. The street lamp has a round white light bulb. A man with black hair tied up is facing a mirror. After the subtitle appeared \u201cairport and I get time to get things\u201d, what object appeared next?", "question_wo_referring_query": "What object appeared next?", "candidates": ["a red short sleeve shirt", "a skateboard", "a yellow parasol", "a box of incense", "a restaurant"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "LO7mi70nqyg_0", "video_path": "LO7mi70nqyg.mp4", "subtitle_path": "LO7mi70nqyg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 370.44, "view_count": 98800}, {"video_id": "LO7mi70nqyg", "question": "Under the illumination of the lamp, the ceiling panel on the left side of the dining room appears golden. The dining room wall has a black and white checkered pattern and nautical decorations. On the left, a man in a white shirt is seated on a black chair having a meal, while next to him is a man in a light-colored shirt with a black belt. After the subtitle 'something that laughs at dark Chicago I' appears, what item shows up?", "question_wo_referring_query": "What item shows up?", "candidates": ["Food with green sauce", "Food with red sauce", "A staircase", "A trash can", "A slide board"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "LO7mi70nqyg_1", "video_path": "LO7mi70nqyg.mp4", "subtitle_path": "LO7mi70nqyg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 370.44, "view_count": 98800}, {"video_id": "IdScFcPHTKY", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there's a clip of the girl driving, then a pink screen appears, followed by the girl driving alone, then a green screen, followed by the girl coming to a room with her friend to drive together, and finally a black screen with the girl driving alone again.", "First, there's a video of the girl and her friend driving together, followed by a pink screen with the girl driving alone, then a black screen, and then the girl comes to a room with her friend to drive together, and finally a video of the girl driving alone.", "First, there's a clip of the girl driving, then a green screen appears, followed by the girl and her friend driving together, then a black screen with the girl driving alone, and finally a pink screen with the girl driving alone again.", "First, there's a clip of the girl driving, then a black screen appears, followed by the girl driving alone, then a green screen appears, followed by the girl coming to a room with her friend to drive together, and finally a video of the girl and her friend driving together.", "First, there's a clip of the girl driving, then a black screen appears, followed by a video of the girl driving alone, then a green screen appears, followed by the girl coming to a room with her friend to drive together, and finally a video of the girl and her friend driving together."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "IdScFcPHTKY_0", "video_path": "IdScFcPHTKY.mp4", "subtitle_path": "IdScFcPHTKY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 409.88, "view_count": 25453}, {"video_id": "rb02ZnkcW4Y", "question": "Against a white background, a hand holding a pen draws a black line. The pen body is light-colored with black characters. Where else has this pen appeared?", "question_wo_referring_query": "Where else has this pen appeared?", "candidates": ["On a green background with blue 'M' characters", "On a green background with red 'M' characters", "On a black background with red 'M' characters", "On a white background with blue 'M' characters", "On a black background with blue 'M' characters"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "rb02ZnkcW4Y_0", "video_path": "rb02ZnkcW4Y.mp4", "subtitle_path": "rb02ZnkcW4Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 361.1, "view_count": 1339}, {"video_id": "rb02ZnkcW4Y", "question": "There are blue characters on the white background above, a red logo in the upper right corner, a yellow cat face in the upper left corner, a blue arrow below the cat face, and a chart in the center of the screen. The chart is mainly in blue and red tones, containing statistical data and bar charts. In what other scene has this chart appeared?", "question_wo_referring_query": "In what other scene has this chart appeared?", "candidates": ["Displayed on a black background with three thumbnails", "Displayed on a white background with five thumbnails", "Displayed on a white background with four thumbnails", "Displayed on a white background with three thumbnails", "Displayed on a white background with four thumbnails"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "rb02ZnkcW4Y_1", "video_path": "rb02ZnkcW4Y.mp4", "subtitle_path": "rb02ZnkcW4Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 361.1, "view_count": 1339}, {"video_id": "xD_J-844NZ8", "question": "In a well-lit room with white walls all around, there is a glass door and window below the emergency exit sign. On the left side, there is a black treadmill and electronic equipment. A man in a blue short-sleeved shirt is conversing with another man wearing a gray-white patterned sweater. Later, when a man appears next to the treadmill, what change occurs to the treadmill?", "question_wo_referring_query": "What change occurs to the treadmill?", "candidates": ["The treadmill starts running", "The treadmill does not change", "The treadmill is moved outside the room", "The treadmill changes from black to white", "The treadmill changes from black to red"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "xD_J-844NZ8_0", "video_path": "xD_J-844NZ8.mp4", "subtitle_path": "xD_J-844NZ8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 258.12, "view_count": 4182}, {"video_id": "xD_J-844NZ8", "question": "The background of the broadcast room is a brightly lit city high-rise, with a gray-white logo in the upper right corner. A woman wearing a button-down shirt and silver earrings is in front of a gray notebook. There is a scrolling information bar at the bottom. When a girl in blue clothes and glasses appears in front of the microphone to speak, what changes occur in the information bar?", "question_wo_referring_query": ", what changes occur in the information bar?", "candidates": ["The information bar changes from white, white, and black to white, black, and black.", "The information bar changes from red, white, and black to red, red, and black.", "The information bar changes from red, white, and black to black, black, and black.", "The information bar changes from red, white, and black to white, black, and black.", "The information bar changes from red, white, and black to white, white, and black."], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "xD_J-844NZ8_1", "video_path": "xD_J-844NZ8.mp4", "subtitle_path": "xD_J-844NZ8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 258.12, "view_count": 4182}, {"video_id": "ECfZQA2hDEY", "question": "The girl wearing a blue camisole and jeans is looking in the mirror. The walls and door of the room reflected in the mirror are white, and the items on the bed are also white. Above the headboard is a decoration embroidered with floral patterns. There is a cabinet next to the headboard. After the subtitle 'want you snug want me i need you i want' appears, what change happens to the girl's top?", "question_wo_referring_query": "What change happens to the girl's top?", "candidates": ["The blue camisole turns into a purple camisole", "The blue camisole turns into a skirt", "The blue camisole turns into a white camisole", "The blue camisole turns into a denim jacket", "The blue camisole turns into a yellow camisole"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "ECfZQA2hDEY_0", "video_path": "ECfZQA2hDEY.mp4", "subtitle_path": "ECfZQA2hDEY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 522.49, "view_count": 9407}, {"video_id": "ECfZQA2hDEY", "question": "The interior walls and door are white, the pillows on the bed are white, and there are green floral decorations above the headboard. The bed frame is metallic. A woman wearing a blue tank top and a necklace is holding a light yellow skirt with small floral patterns. After the subtitle 'which is amazing because usually skirts,' what change occurs in the woman's outfit?", "question_wo_referring_query": ", what change occurs in the woman's outfit?", "candidates": ["The top is a sweater, and the bottom is jeans.", "The top changes to a jacket, and the bottom is a light yellow skirt with small floral patterns.", "The top is a black denim shirt, and the bottom is a short skirt.", "The top changes to a red tank top, and the bottom is blue jeans.", "The top changes to a white tank top, and the bottom is a light yellow skirt with small floral patterns."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "ECfZQA2hDEY_1", "video_path": "ECfZQA2hDEY.mp4", "subtitle_path": "ECfZQA2hDEY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 522.49, "view_count": 9407}, {"video_id": "9mc_uFLw67c", "question": "In the top left corner, there is a black-and-white photo of a cat against a white background. In the top right corner, there is a cartoon character scattering confetti. In the center, there is an image of a cartoon water pipe, with two blue arrows on the left side and one blue arrow on the right side of the pipe. In the bottom right corner, there is a cartoon character lying on a chair. What happened to the cartoon character?", "question_wo_referring_query": "What happened to the cartoon character?", "candidates": ["The cartoon character kneeled down", "The cartoon character stood up", "The cartoon character nodded", "The cartoon character waved", "The cartoon character got colored"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "9mc_uFLw67c_0", "video_path": "9mc_uFLw67c.mp4", "subtitle_path": "9mc_uFLw67c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 322.0, "view_count": 1257}, {"video_id": "9mc_uFLw67c", "question": "At the top of the white background is a blue character, the left side is blank, and the right side has a blue arrow pointing to a purple text box. Below the blue character is a black and white floral cat image. What appears below the cat image?", "question_wo_referring_query": "What appears below the cat image?", "candidates": ["A drawn square", "A drawn arrow", "A drawn circle", "A drawn star", "A drawn rectangle"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "9mc_uFLw67c_1", "video_path": "9mc_uFLw67c.mp4", "subtitle_path": "9mc_uFLw67c_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 322.0, "view_count": 1257}, {"video_id": "rvIX0vzx3NM", "question": "A gray floor connects to a yellow wall. There are multiple square screens on the wall. Cool color bulbs are installed on the ceiling. A chair with miscellaneous objects is located to the right of a person. A man in short sleeves and a woman in a light-colored dress are having a conversation. When the subtitle 'for a combustion reaction you need some' appears, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["scarf", "glasses", "umbrella", "hat", "headscarf"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "rvIX0vzx3NM_0", "video_path": "rvIX0vzx3NM.mp4", "subtitle_path": "rvIX0vzx3NM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 423.13, "view_count": 136160}, {"video_id": "rvIX0vzx3NM", "question": "A grey floor connects to a yellow wall with multiple rectangular screen panels on it. The ceiling is fitted with cool-colored light bulbs. A white table is placed in front of a woman in a light-colored coat and a man in short sleeves. There is a plastic bottle on the table. What objects are present in the scene when the subtitle 'reaction so it makes something go faster' appears?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["plush toy", "tall potted plant", "hat", "scarf", "yellow cabinet"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "rvIX0vzx3NM_1", "video_path": "rvIX0vzx3NM.mp4", "subtitle_path": "rvIX0vzx3NM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 423.13, "view_count": 136160}, {"video_id": "kjraelDMrFQ", "question": "A high-pressure cooker is placed on the weeds, with one hand holding the black handle of the high-pressure cooker, and next to the high-pressure cooker lies a red bottle. There is a blue pattern on the body of the cooker with some spots. What is the approximate shape of the blue pattern on the high-pressure cooker?", "question_wo_referring_query": "What is the approximate shape of the blue pattern on the high-pressure cooker?", "candidates": ["It is a floral pattern", "It is a circle", "It is a triangle", "It is an approximate quadrilateral pattern", "It is a star"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "kjraelDMrFQ_0", "video_path": "kjraelDMrFQ.mp4", "subtitle_path": "kjraelDMrFQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 371.71, "view_count": 1246786}, {"video_id": "kjraelDMrFQ", "question": "What kind of handle does the axe being balanced by the man in green short sleeves and black shorts against a tree have?", "question_wo_referring_query": "What kind of handle does the axe have?", "candidates": ["Handle painted blue", "Black iron axe handle", "Light yellow axe handle", "Silver metal handle", "Handle painted red"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "kjraelDMrFQ_1", "video_path": "kjraelDMrFQ.mp4", "subtitle_path": "kjraelDMrFQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 371.71, "view_count": 1246786}, {"video_id": "v_15HsWidKY", "question": "There is a white wooden board mounted on the wall, with a blue bowl and a transparent cup on it. Below the wooden board, the table holds a cutting board and decorations. On the right side of the wooden board, there are red letters stuck on the wall. A woman in a shirt is holding two glasses of milk while talking with a girl in a pink short-sleeve. On the table in front of them, there are silver kitchen utensils. When the word 'milk' appears, what kind of cup is the woman using for the milk?", "question_wo_referring_query": "What kind of cup is the woman using for the milk?", "candidates": ["Light-colored thermos cup", "Transparent glass cup", "Patterned ceramic cup", "Disposable paper cup", "Stainless steel cup"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "v_15HsWidKY_0", "video_path": "v_15HsWidKY.mp4", "subtitle_path": "v_15HsWidKY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 600.1, "view_count": 300216}, {"video_id": "v_15HsWidKY", "question": "On the yellow table, there is a transparent cup and bowl. The bowl on the top right contains an egg, and the largest bowl on the right contains a white powdery ingredient. In the top left corner, there are various spoons and colored ingredients. When the subtitle shows 'how's it looking does it look combined', what is the smallest spoon in the top left corner like?", "question_wo_referring_query": "What is the smallest spoon in the top left corner like?", "candidates": ["Yellow wooden spoon", "Blue plastic spoon", "Silver-colored metallic spoon", "Green plastic spoon", "White ceramic spoon"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "v_15HsWidKY_1", "video_path": "v_15HsWidKY.mp4", "subtitle_path": "v_15HsWidKY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 600.1, "view_count": 300216}, {"video_id": "K9H4jcbcq7s", "question": "In a game map with green and yellow as the main colors, there is a data panel at the bottom. A river flows through the middle of the map from top to bottom. On both sides of the river, there are green trees and yellow patches of land. On the bottom left corner of the river on the map, there are two units marked with green flags. Which will connect the unit with the green flag to the yellow patch of land near the river?", "question_wo_referring_query": "Which will connect the unit with the green flag to the yellow patch of land near the river?", "candidates": ["a red line", "a black line", "a red dot", "a purple dot", "a purple line"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "K9H4jcbcq7s_0", "video_path": "K9H4jcbcq7s.mp4", "subtitle_path": "K9H4jcbcq7s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 420.68, "view_count": 9711}, {"video_id": "K9H4jcbcq7s", "question": "In the game map dominated by green and yellow colors, there is a data panel at the bottom. A river flows on the left side of the map, with a bridge over the river. On the right side of the bridge, there is a popping-up data box. A red unit is parked on the right side of the bridge. On the right side of the map, there is a red slope, and below the slope, a small road connects two densely dotted soil areas. What is moving near the bridge?", "question_wo_referring_query": ", What is moving near the bridge?", "candidates": ["A small black unit with a white marker", "A small black unit with a black marker", "Small red dots", "A small black unit with a red marker", "A small black unit with a green marker"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "K9H4jcbcq7s_1", "video_path": "K9H4jcbcq7s.mp4", "subtitle_path": "K9H4jcbcq7s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 420.68, "view_count": 9711}, {"video_id": "qV4lR9EWGlY", "question": "In front of a grey background stands a long-haired woman wearing a multi-colored striped shirt, with a necklace around her neck. There is an elderly man in a white tank top in front of the woman, a lamp is situated above the woman to the right, and to the left of the woman, there is a black object hanging by a string. What is the elderly man doing the first time he appears?", "question_wo_referring_query": "What is the elderly man doing the first time he appears?", "candidates": ["Pulling a small cart", "Smoking a cigarette", "Drinking tea", "Pushing a purple machine forward", "Chasing a little dog"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "qV4lR9EWGlY_0", "video_path": "qV4lR9EWGlY.mp4", "subtitle_path": "qV4lR9EWGlY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 578.58, "view_count": 1621762}, {"video_id": "qV4lR9EWGlY", "question": "In front of the gray background is a long-haired woman wearing a colorful striped shirt. Behind the woman, there are two lanterns hanging. On the left side of the woman, there is a rack with a doll placed on it. On the left side of the woman's table, there are books, and on the right side, there is a Rubik's cube. In a purple frame to the right of the woman, there is a cartoon boy. What is the boy doing when he first appears?", "question_wo_referring_query": "What is the boy doing when he first appears?", "candidates": ["Bending down", "Jumping on the spot", "Chasing a puppy", "Dancing to music", "Wearing earphones"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "qV4lR9EWGlY_1", "video_path": "qV4lR9EWGlY.mp4", "subtitle_path": "qV4lR9EWGlY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 578.58, "view_count": 1621762}, {"video_id": "TQkisvKkuSw", "question": "Outside the city walls is blue water, a boat with a blue and white banner is docked on the water surface, a blue cartoon character with a duckbill cap appears in the lower right corner. When the subtitles show 'only Phoenician trading Outpost in the,' what happens?", "question_wo_referring_query": "What is happening?", "candidates": ["A person wearing armor appears on the city wall", "A swordsman appears on the city wall", "The boat with a black banner just sails from left to right", "The boat with a yellow banner just sails from left to right", "The boat with a blue and white banner just sails from left to right"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "TQkisvKkuSw_0", "video_path": "TQkisvKkuSw.mp4", "subtitle_path": "TQkisvKkuSw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 415.73, "view_count": 2122}, {"video_id": "TQkisvKkuSw", "question": "In the background of the grassland, there are large trees and buildings. A cartoon character holding a sword and shield is standing in the center. On the left side is a soldier wearing a helmet and holding a spear and shield, and on the right side is a cartoon character holding a knife. What happens when the subtitle 'Music' appears?", "question_wo_referring_query": "What happens when?", "candidates": ["The cartoon character swings the sword in their hand", "The tree falls down", "A soldier with a bow and arrow passes by", "A soldier riding a war horse passes by", "A group of soldiers riding elephants passes by"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "TQkisvKkuSw_1", "video_path": "TQkisvKkuSw.mp4", "subtitle_path": "TQkisvKkuSw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 415.73, "view_count": 2122}, {"video_id": "U979-KD6tfM", "question": "The room is illuminated with warm light, the stool, bookshelf, and chair are all palm yellow in color. The bookshelf against the wall has books and photos on it, and in front of the chair is a table. To the right of the table is a flag. The man wearing a suit and white shirt stretches his back. What did the man do after stretching his back?", "question_wo_referring_query": "What did the man do after stretching his back?", "candidates": ["Faced the mirror and gave a thumbs-up", "Faced the mirror and pushed up his glasses", "Faced the mirror and adjusted his collar", "Faced the mirror and nodded with a smile", "Faced the mirror and waved"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "U979-KD6tfM_0", "video_path": "U979-KD6tfM.mp4", "subtitle_path": "U979-KD6tfM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 381.88, "view_count": 3173}, {"video_id": "U979-KD6tfM", "question": "The background of the studio is a large circular building. A man wearing a suit and a red tie is sitting on the left chair, and a woman dressed in red and wearing trousers is sitting on the right chair. The man faces the woman and spreads his arms. What did the man do after spreading his arms?", "question_wo_referring_query": ", what did the man do after spreading his arms?", "candidates": ["Crossed his arms", "Placed his hands on his lap with palms facing down", "Made a fist with one hand", "Placed his hands on his lap with palms facing up", "Extended his hands towards the woman"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "U979-KD6tfM_1", "video_path": "U979-KD6tfM.mp4", "subtitle_path": "U979-KD6tfM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 381.88, "view_count": 3173}, {"video_id": "mKCu9EPTFM8", "question": "A blonde woman wearing a white top, with a ring on her left middle finger and red nail polish; what action is the woman in white doing?", "question_wo_referring_query": "What action is the woman in white doing?", "candidates": ["Spreading both hands", "Clenching right fist", "Clenching fists", "Clapping hands"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "mKCu9EPTFM8_0", "video_path": "mKCu9EPTFM8.mp4", "subtitle_path": "mKCu9EPTFM8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 98, "duration": 22.99, "view_count": 37278}, {"video_id": "Qi-3lGh2gas", "question": "In the video, there is a man wearing a colorful striped polo shirt and black frame glasses. On the left side of the video, there is a blonde woman wearing a blue top, and on the right side, there is a man in a suit with a blue shirt. These three people are chatting. What items appear in this video?", "question_wo_referring_query": "What items appear in this video?", "candidates": ["Blue vase", "Pink picture frame", "Black bookshelf", "Black sofa"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "Qi-3lGh2gas_0", "video_path": "Qi-3lGh2gas.mp4", "subtitle_path": "Qi-3lGh2gas_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 328, "duration": 33.0, "view_count": 19090}, {"video_id": "sKNCEon7ddk", "question": "In front of a wall with graffiti, there is a man wearing a plaid long-sleeved shirt. When the subtitle mentions 'written all of my painting and then I', what item is placed behind him to the left?", "question_wo_referring_query": "In front of a wall with graffiti, there is a man wearing a plaid long-sleeved shirt. When the subtitle mentions 'written all of my painting and then I', what item is placed behind him to the left?", "candidates": ["Hat", "A photo of two men leaning on each other", "CD", "Plant pot"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "sKNCEon7ddk_0", "video_path": "sKNCEon7ddk.mp4", "subtitle_path": "sKNCEon7ddk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 89, "duration": 59.02, "view_count": 3780}, {"video_id": "-FXUKR7wfro", "question": "In the video, in front of a white shelf with various notebooks, there is a woman standing who is wearing glasses, an olive-colored mask, and a blue off-shoulder dress. What is the color of the shopping basket she is holding?", "question_wo_referring_query": "In the video, in front of a white shelf with various notebooks, there is a woman standing who is wearing glasses, an olive-colored mask, and a blue off-shoulder dress. What is the color of the shopping basket she is holding?", "candidates": ["Red", "Blue", "White", "Olive"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "-FXUKR7wfro_0", "video_path": "-FXUKR7wfro.mp4", "subtitle_path": "-FXUKR7wfro_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 201, "duration": 43.01, "view_count": 227121}, {"video_id": "ccSUCa4UoyA", "question": "In the video, in front of the bookshelf, there's a woman with curly hair wearing a black top. She raises her right hand and holds something when she mentions 'right they want to they want to create'. What is the color of the item in her hand in front of the bookshelf?", "question_wo_referring_query": "What color is the item in the woman's hand in front of the bookshelf?", "candidates": ["green", "black", "blue", "yellow"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "ccSUCa4UoyA_0", "video_path": "ccSUCa4UoyA.mp4", "subtitle_path": "ccSUCa4UoyA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 132, "duration": 24.0, "view_count": 59772}, {"video_id": "pArQIGFFpDw", "question": "In a room lined with various musical instruments, these instruments are placed in glass display cases. There is a person sitting in the middle of the room playing an instrument. Who is playing the instrument?", "question_wo_referring_query": "Who is playing the instrument?", "candidates": ["A woman wearing a black suspender dress", "A boy wearing a hat", "A woman wearing a black suspender pants", "A man in a black shirt"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "pArQIGFFpDw_0", "video_path": "pArQIGFFpDw.mp4", "subtitle_path": "pArQIGFFpDw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 35, "duration": 22.02, "view_count": 25895}, {"video_id": "Usyw2Vlu2AY", "question": "On a wooden table, there is a green pen on the left side, and a ruler on the right side. When the subtitle mentions 'um the reason i start there is because', what does the right hand do?", "question_wo_referring_query": "What does the right hand do?", "candidates": ["Shake", "Hold the pen", "Tap the table", "Pick up the ruler"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "Usyw2Vlu2AY_0", "video_path": "Usyw2Vlu2AY.mp4", "subtitle_path": "Usyw2Vlu2AY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1024, "duration": 26.99, "view_count": 9831}, {"video_id": "ey4HBZMf50E", "question": "In the video, there is a man speaking by the window. This man is wearing a white cap with a green brim and is dressed in a white undershirt and a military-green jacket. What action did this man in the military-green jacket take after speaking?", "question_wo_referring_query": "What action did this man in the military-green jacket take after speaking?", "candidates": ["Eating something", "Chatting with someone nearby", "Stirring ingredients in the pot", "Playing a game"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "ey4HBZMf50E_0", "video_path": "ey4HBZMf50E.mp4", "subtitle_path": "ey4HBZMf50E_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 80, "duration": 38.01, "view_count": 6125}, {"video_id": "u7eLwLeNLt8", "question": "In one frame, there is a black woman wearing light pink clothes and a light pink headscarf. In another frame, there is a blonde man in a white short-sleeve shirt and a black man in a blue shirt chatting. In front of them is a black square object. Who appears first in the video?", "question_wo_referring_query": "Who appears first in the video?", "candidates": ["The black woman wearing light pink clothes and a light pink headscarf", "The blonde man wearing a white short-sleeve shirt", "They appear together", "The black man wearing a blue shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "u7eLwLeNLt8_0", "video_path": "u7eLwLeNLt8.mp4", "subtitle_path": "u7eLwLeNLt8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 341, "duration": 22.0, "view_count": 5425718}, {"video_id": "O8JASd1jdZ0", "question": "A pair of hands is holding a black cloth bag with a transparent lunch box inside. On top of the lunch box, there is a piece of red cloth. On the red cloth, there are several forks. After the subtitle 'Music' appears, what object appears?", "question_wo_referring_query": "What object appears?", "candidates": ["An eggbeater, a frying pan, and a soy milk maker", "An eggbeater, a frying pan, and a mobile phone", "A frying pan, a mobile phone, and a pair of chopsticks", "An eggbeater, a frying pan, and an oven"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "O8JASd1jdZ0_0", "video_path": "O8JASd1jdZ0.mp4", "subtitle_path": "O8JASd1jdZ0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 481, "duration": 51.0, "view_count": 458603}, {"video_id": "Zet1oCkrqs4", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a woman is explaining the old man's work to a group of children. Then, there is a silhouette of an old man in the dark. Finally, an old man with white hair is standing alone observing an object.", "First, an old man with white hair is standing alone observing an object. Then, there is a silhouette of an old man in the dark. Finally, a woman is explaining the old man's work to a group of children.", "First, a woman is explaining the old man's work to a group of children. Then, an old man with white hair is standing alone observing an object. Finally, there is a silhouette of an old man in the dark.", "First, there is an old man with white hair standing alone observing an object. Then, a woman in a black top is explaining the old man's work to a group of children. Finally, there is a silhouette of an old man in the dark.", "First, there is a silhouette of an old man in the dark. Then, a woman is explaining the old man's work to a group of children. Finally, an old man with white hair is standing alone observing an object."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "Zet1oCkrqs4_0", "video_path": "Zet1oCkrqs4.mp4", "subtitle_path": "Zet1oCkrqs4_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 148, "duration": 48.01, "view_count": 11623}, {"video_id": "JmoOIBK0r_U", "question": "On a white desk, there is a green plant and a glass vase with three gardenias. Next to the desk, there is someone holding a black book with four printed flowers on the cover. In which of the following scenes does this black book with four printed flowers on the cover appear later?", "question_wo_referring_query": "In which of the following scenes does this black book with four printed flowers on the cover appear later?", "candidates": ["When opening a purple backpack", "In a scene with an olive-colored desk and curtains", "In a scene with curtains", "In a scene with a white desk"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "JmoOIBK0r_U_0", "video_path": "JmoOIBK0r_U.mp4", "subtitle_path": "JmoOIBK0r_U_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 183, "duration": 38.0, "view_count": 161484}, {"video_id": "bushHvw__Mo", "question": "In front of a dense wooden hut, an old man wearing a headscarf and a man holding a black mummy in his arms appear together with which subtitles?", "question_wo_referring_query": ", which subtitles appear together with the man holding a black mummy in his arms?", "candidates": ["the people of the Trobriand islands have", "their dead done through a smoke curing", "outside on display for the village", "process and then the bodies are left"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "bushHvw__Mo_0", "video_path": "bushHvw__Mo.mp4", "subtitle_path": "bushHvw__Mo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 782, "duration": 28.99, "view_count": 1284651}, {"video_id": "lMNRQix_QzQ", "question": "In front of three rounded doors, there is a person holding a tool in the right hand, a lamp in the left hand, and carrying a bag. How does their face change when they rest at the back compared to when they are in the front?", "question_wo_referring_query": "In front of three rounded doors, there is a person holding a tool in the right hand, a lamp in the left hand, and carrying a bag. How does their face change when they rest at the back compared to when they are in the front?", "candidates": ["The face turns black, and the face is covered in mud", "The face turns red, and the face is covered in mud", "The face turns pale, and the mud disappears", "The face turns blue, and the mud disappears"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "lMNRQix_QzQ_0", "video_path": "lMNRQix_QzQ.mp4", "subtitle_path": "lMNRQix_QzQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 36, "duration": 35.0, "view_count": 800632}, {"video_id": "UKqdewK2c0A", "question": "When the video screen turns to a blue background and the news headline 'Reform would harden border procedures for irregular arrivals EU VOTES ON STRICTER MIGRATION RULES' appears at the bottom of the screen, what is the hand of the woman with brown curly hair in a black professional outfit doing?", "question_wo_referring_query": "What is the hand of the woman with brown curly hair in a black professional outfit doing?", "candidates": ["Holding a phone", "Not moving", "Her fingers are interlocked", "Gesticulating"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "UKqdewK2c0A_0", "video_path": "UKqdewK2c0A.mp4", "subtitle_path": "UKqdewK2c0A_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 15, "duration": 58.0, "view_count": 3380}, {"video_id": "4i42SqHv2dY", "question": "A long-haired woman in a white top is sitting in a parked car. She raises her right hand to block the sunlight. What else is on her right arm?", "question_wo_referring_query": "What else is on her right arm of the long-haired woman?", "candidates": ["Wrist guard", "Bracelet", "Watch", "Hair tie"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "4i42SqHv2dY_0", "video_path": "4i42SqHv2dY.mp4", "subtitle_path": "4i42SqHv2dY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 228, "duration": 29.0, "view_count": 39792}, {"video_id": "aXGsieBZ7YY", "question": "In the middle of the square, there are four people chatting face-to-face. The houses on both sides of the first row in the background are white, and the short house in the middle is earth yellow. The second row is a whole building, where a woman is carrying a purple backpack. When the subtitle 'we would love to kidnap you and give you the best day ever' is shown, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["sofa", "fountain", "bicycle", "carriage"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "aXGsieBZ7YY_0", "video_path": "aXGsieBZ7YY.mp4", "subtitle_path": "aXGsieBZ7YY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 152, "duration": 25.99, "view_count": 6791191}, {"video_id": "Xq8VstbyI_0", "question": "In a room with a backdrop wall on the left side and a window on the right side, behind which there is a large white building, there is a blonde man speaking. What color shirt is the blonde man wearing?", "question_wo_referring_query": "What color shirt is the blonde man wearing?", "candidates": ["Light blue", "Red", "Black", "Purple", "Yellow"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "Xq8VstbyI_0_0", "video_path": "Xq8VstbyI_0.mp4", "subtitle_path": "Xq8VstbyI_0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 37, "duration": 55.02, "view_count": 1101}, {"video_id": "6wBxSeUCRTw", "question": "In the video, there is a long metallic object with a glossy surface on the muddy ground. When the subtitle shows 'meling of the roads and that's what,' what is the color of the long metallic object?", "question_wo_referring_query": "What is the color of the long metallic object?", "candidates": ["black", "white", "purple", "green"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "6wBxSeUCRTw_0", "video_path": "6wBxSeUCRTw.mp4", "subtitle_path": "6wBxSeUCRTw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 33, "duration": 28.0, "view_count": 32942}, {"video_id": "ILJzcyC9lyw", "question": "When a rectangular container completely wrapped with tin foil is placed on a tan wooden table, what material is used to insulate both ends of the rectangular container?", "question_wo_referring_query": "What material is used to insulate both ends of the rectangular container?", "candidates": ["Floral insulation cloth", "Red plastic insulation mat", "Red insulation cloth", "Red insulated gloves", "Floral insulated gloves"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "ILJzcyC9lyw_0", "video_path": "ILJzcyC9lyw.mp4", "subtitle_path": "ILJzcyC9lyw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 145, "duration": 43.0, "view_count": 75585}, {"video_id": "D6ozMRBzOpY", "question": "In the video, four dogs are pulling a small cart. Seated on the small cart is a man wearing a black shirt and black pants. Beside him is another man wearing a black and red jacket and a hat. What is this man with the hat doing?", "question_wo_referring_query": "What is this man with the hat doing?", "candidates": ["Jumping", "Squatting", "Pushing the small cart", "Singing"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "D6ozMRBzOpY_0", "video_path": "D6ozMRBzOpY.mp4", "subtitle_path": "D6ozMRBzOpY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 545, "duration": 56.99, "view_count": 4191}, {"video_id": "7QIEU9KkY5g", "question": "In front of a large oil painting depicting many figures, there is a man wearing a white hat backward and a gray jacket, with sunglasses hanging at the front. When the subtitles 'wow, it is a painting' appear, what does he do with his right hand?", "question_wo_referring_query": "What does he do with his right hand?", "candidates": ["Extends his hand to touch the oil painting", "Waves his hand", "Extends his index finger towards the oil painting", "Extends his thumb"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "7QIEU9KkY5g_0", "video_path": "7QIEU9KkY5g.mp4", "subtitle_path": "7QIEU9KkY5g_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 266, "duration": 17.02, "view_count": 8571765}, {"video_id": "VFWVA7ihCSs", "question": "On a wooden table with a piece of cloth, there's a pot with some ingredients. Someone, holding a black bowl, is adding seasonings into the pot. What did they do after this?", "question_wo_referring_query": "What did they do after this?", "candidates": ["Chop vegetables", "Stir with a wooden spatula", "Cut garlic", "Add green vegetables to the pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "VFWVA7ihCSs_0", "video_path": "VFWVA7ihCSs.mp4", "subtitle_path": "VFWVA7ihCSs_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 374, "duration": 56.02, "view_count": 213}, {"video_id": "rem9INLWPow", "question": "When the man in the olive-red shirt with gold-rimmed glasses is explaining a picture of a bull head, after a white bull head statue appears in the picture, which part inside the white bull head statue does the man in the olive-red shirt introduce first?", "question_wo_referring_query": "Which part inside the white bull head statue does the man in the olive-red shirt introduce first?", "candidates": ["The nose", "The horns", "The mouth", "The eyes"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "rem9INLWPow_0", "video_path": "rem9INLWPow.mp4", "subtitle_path": "rem9INLWPow_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 143, "duration": 46.01, "view_count": 4552}, {"video_id": "nLG5PFgeWNw", "question": "In the video, there is a large white bowl with green dumplings and a metal spatula. A hand is holding a bowl with handles on both sides, adding ingredients to the large bowl. Where does the metal spatula appear at the beginning of the video?", "question_wo_referring_query": "Where does the metal spatula appear at the beginning of the video?", "candidates": ["Inside a green bowl", "Inside a teacup", "Inside a red pot", "Holding the metal spatula to arrange ingredients on a baking tray"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "nLG5PFgeWNw_0", "video_path": "nLG5PFgeWNw.mp4", "subtitle_path": "nLG5PFgeWNw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 201, "duration": 26.0, "view_count": 17600}, {"video_id": "2wJY5sTfQmg", "question": "Shrimp are placed in a transparent bowl. The right hand holds a small transparent bowl and adds red seasoning. Which of the following subtitles and these shrimp appeared at the same time?", "question_wo_referring_query": "Which of the following subtitles appeared at the same time as these shrimp?", "candidates": ["mm-hmm nice", "[Music]", "nice", "mm-hmm and woof"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "2wJY5sTfQmg_0", "video_path": "2wJY5sTfQmg.mp4", "subtitle_path": "2wJY5sTfQmg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 251, "duration": 44.0, "view_count": 210951}, {"video_id": "MUSRgLDN0Sc", "question": "On a table, there is a smartphone and a cup of coffee. A person with green hair wearing black clothes is working on an iPad. Later, while eating at the table, what tool does she switch to working with?", "question_wo_referring_query": "What tool does she switch to working with?", "candidates": ["Laptop", "Smartphone", "Desktop computer", "Printer"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "MUSRgLDN0Sc_0", "video_path": "MUSRgLDN0Sc.mp4", "subtitle_path": "MUSRgLDN0Sc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 165, "duration": 58.99, "view_count": 313757}, {"video_id": "O45fPBY5fG0", "question": "In front of a white double-layer house, a man wearing a white tank top with tattoos and a man with combed hair wearing a white T-shirt are facing each other. What are they doing?", "question_wo_referring_query": "What are they doing?", "candidates": ["They are riding horses and having a conversation in front of the house.", "They are sitting on the grass in front of the house.", "They are standing on the grass in front of the house and having a conversation.", "They are fighting on the grass in front of the house."], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "O45fPBY5fG0_0", "video_path": "O45fPBY5fG0.mp4", "subtitle_path": "O45fPBY5fG0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 85, "duration": 25.03, "view_count": 25173}, {"video_id": "O7FyFMwqDmQ", "question": "A bald man wearing black armor and a white shirt, with a black box behind him on the right side, has a hat placed directly above the gongzi (a traditional Chinese ceremonial platform). What item is directly above the gongzi?", "question_wo_referring_query": "What item is directly above the gongzi?", "candidates": ["newspaper", "gongzi", "box", "hat"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "O7FyFMwqDmQ_0", "video_path": "O7FyFMwqDmQ.mp4", "subtitle_path": "O7FyFMwqDmQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 233, "duration": 34.0, "view_count": 382134}, {"video_id": "IXzODWlTWgk", "question": "When two soldiers wearing hats standing guard with guns appear on the screen, and the subtitle shows 'by zip line on a tightrope in a cat', what other objects are also on the screen?", "question_wo_referring_query": "When two soldiers wearing hats standing guard with guns appear on the screen, and the subtitle shows 'by zip line on a tightrope in a cat', what other objects are also on the screen?", "candidates": ["a black car", "a red car", "a soldier riding a motorcycle", "a bicycle"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "IXzODWlTWgk_0", "video_path": "IXzODWlTWgk.mp4", "subtitle_path": "IXzODWlTWgk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 237, "duration": 20.0, "view_count": 1586688}, {"video_id": "IwuPlwBP-no", "question": "In a sunny place with a building and a tree in the background, a man in a blue shirt is holding a drink in his left hand and feeding a woman with his right hand. What color is the shirt of the woman eating?", "question_wo_referring_query": "What color is the shirt of the woman who is eating?", "candidates": ["pink", "white", "black", "blue", "red"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "IwuPlwBP-no_0", "video_path": "IwuPlwBP-no.mp4", "subtitle_path": "IwuPlwBP-no_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 100, "duration": 18.0, "view_count": 94965}, {"video_id": "dCtFkzHymvo", "question": "What did the person in the video do with the food placed on the white plate on the dining table?", "question_wo_referring_query": "What did the person in the video do?", "candidates": ["Added seasoning to the food", "Tore open the food's cling film", "Pressed down on the food", "Picked up the plate with the food", "Cut the food into pieces"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "dCtFkzHymvo_0", "video_path": "dCtFkzHymvo.mp4", "subtitle_path": "dCtFkzHymvo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 364, "duration": 15.0, "view_count": 331}, {"video_id": "Nmzt2yHnLEI", "question": "What is absent from the screen when the phrase 'event that possibly caused the' is mentioned?", "question_wo_referring_query": ", what is absent from the screen?", "candidates": ["white light streak", "black space background", "falling, blue-glowing debris", "clouds on the blue Earth", "falling, orange-glowing debris"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "Nmzt2yHnLEI_0", "video_path": "Nmzt2yHnLEI.mp4", "subtitle_path": "Nmzt2yHnLEI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 9, "duration": 32.0, "view_count": 90796}, {"video_id": "cNkaFgFG4UM", "question": "When a book with a red background and '1824' text appears, along with a treasure chest image on the right, what shape is the sail of the ship sailing on the blue ocean behind?", "question_wo_referring_query": "What shape is the sail of the ship sailing on the blue ocean behind?", "candidates": ["Rectangle", "Ellipse", "Circle", "Triangle", "Square"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "cNkaFgFG4UM_0", "video_path": "cNkaFgFG4UM.mp4", "subtitle_path": "cNkaFgFG4UM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 123, "duration": 37.0, "view_count": 306501}, {"video_id": "Per-1kB4OVA", "question": "In a sunny and lush field, half covered with cement and half with grass, when 'couldn't find a bit big enough so I had' is mentioned, the person in jeans is pouring water from a bucket into a basin in the video. What is the shape of this basin?", "question_wo_referring_query": "In the video, the person in jeans is pouring water from a bucket into a basin. What is the shape of this basin?", "candidates": ["Black round basin", "Blue square basin", "White round basin", "White square basin", "Green square basin"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "Per-1kB4OVA_0", "video_path": "Per-1kB4OVA.mp4", "subtitle_path": "Per-1kB4OVA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1395, "duration": 44.98, "view_count": 352095}, {"video_id": "NqJjXTeXvWY", "question": "On the wooden cutting board, what item is being chopped with a kitchen knife?", "question_wo_referring_query": "What item is being chopped with a kitchen knife?", "candidates": ["Spinach", "White leek", "Chicken breast", "Yellow fruit", "Green leafy vegetable"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "NqJjXTeXvWY_0", "video_path": "NqJjXTeXvWY.mp4", "subtitle_path": "NqJjXTeXvWY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 53, "duration": 59.0, "view_count": 31549}, {"video_id": "_zPg9DfW5Sw", "question": "There is a transparent rectangular box with a tilted rim on the table, filled with food. The upper left corner shows the word 'CHICKEN.' What happens when a transparent measuring cup filled with liquid appears?", "question_wo_referring_query": "What happens when a transparent measuring cup filled with liquid appears?", "candidates": ["Poured the liquid from the measuring cup into the food", "Put the food into the measuring cup", "Walked away with the food", "Spilled the liquid from the measuring cup onto the table", "Placed the measuring cup next to the food"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "_zPg9DfW5Sw_0", "video_path": "_zPg9DfW5Sw.mp4", "subtitle_path": "_zPg9DfW5Sw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 216, "duration": 46.01, "view_count": 30694}, {"video_id": "RmwFXoegL8A", "question": "Against a gray background, a pair of hands is making a unicorn with white clay. When 'colors candy is very' is mentioned, what happens?", "question_wo_referring_query": ", what happens?", "candidates": ["Knead the clay by hand", "Stretch the clay", "Use a small cutter to trim the clay in their hands", "Shape the unicorn's legs", "Shape the unicorn's horn"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "RmwFXoegL8A_0", "video_path": "RmwFXoegL8A.mp4", "subtitle_path": "RmwFXoegL8A_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 12, "duration": 57.02, "view_count": 131279}, {"video_id": "Lwmd9E2Wovc", "question": "In a scene at a news television station, there is a short-haired female host with a large screen behind her. There are blue and white pillars inside the studio. The female host is wearing a long dress, and there is a slanted object next to her. She is dressed in a dark outfit with a pair of black pointed shoes. A ticker is scrolling at the bottom of the screen. When the phrase 'New study by IMDA and BCG to Spotlight women' appears on the ticker, what happens?", "question_wo_referring_query": "What happens?", "candidates": ["A man appeared", "The female host shook hands with a man", "The female host jumped up", "The female host invited someone to speak", "A black-and-white building appeared"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "Lwmd9E2Wovc_0", "video_path": "Lwmd9E2Wovc.mp4", "subtitle_path": "Lwmd9E2Wovc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 13, "duration": 23.0, "view_count": 4807}, {"video_id": "dOV8y6uHp9U", "question": "According to the explanation, which of the following characters appears first?", "question_wo_referring_query": "Which of the following characters appears first according to the explanation?", "candidates": ["The man wearing a short-sleeved T-shirt", "The man dancing around the glowing bucket", "The woman wearing a white camisole", "The woman with black long hair wearing a black and white floral dress", "The man wearing a white shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "dOV8y6uHp9U_0", "video_path": "dOV8y6uHp9U.mp4", "subtitle_path": "dOV8y6uHp9U_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 407, "duration": 18.02, "view_count": 12638}, {"video_id": "-t5uzY77zwA", "question": "Before the image with the text 'Elotes locos' appears, which type of edible plant can be seen on the screen?", "question_wo_referring_query": "Before the image with the text 'Elotes locos' appears, which type of edible plant can be seen on the screen?", "candidates": ["Food made from corn with the text 'Pupusas'", "Corn", "Food made from corn with the text 'Tamales'", "Image of a plant with the text 'lsote'", "Plant with the text 'Loroco', which has white flowers"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "-t5uzY77zwA_0", "video_path": "-t5uzY77zwA.mp4", "subtitle_path": "-t5uzY77zwA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 251, "duration": 49.97, "view_count": 996337}, {"video_id": "JoZLSuB7LLI", "question": "The opening scene is in a room full of green plants, and features a woman with golden curly hair, wearing a white shirt. Which subtitles shared screen time with her?", "question_wo_referring_query": "Which subtitles shared screen time with her?", "candidates": ["bread and put it in the oven and i will see you soon mavi oh my gosh what have you done", "this was the trouble she was getting into", "this was the trouble", "the funnest part of foccacia", "she was getting into"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "JoZLSuB7LLI_0", "video_path": "JoZLSuB7LLI.mp4", "subtitle_path": "JoZLSuB7LLI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 528, "duration": 37.0, "view_count": 420755}, {"video_id": "XAPcxibqb7s", "question": "The person in the video is using a mobile phone. At the beginning, there is an image of a Halloween picture on the phone. After playing for about 3 seconds, the finger is about to swipe on the phone screen. What kind of changes happened to the phone screen?", "question_wo_referring_query": ", what kind of changes happened to the phone screen?", "candidates": ["The phone entered a music app.", "The phone displayed many contacts.", "There is an image of a cup of coffee on the phone screen, surrounded by many books and some dried fruits.", "The phone screen turned off.", "The phone displayed four pictures, and the finger is swiping."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "XAPcxibqb7s_0", "video_path": "XAPcxibqb7s.mp4", "subtitle_path": "XAPcxibqb7s_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 396, "duration": 16.0, "view_count": 44217}, {"video_id": "vyMUsWMVIu4", "question": "Sunlight shines into the room through the window, and there's a round mirror hanging on the wall. On the right side of the screen, there's a black wardrobe. A woman wearing a black suit is shaking a plastic bag in front of a coffee-colored desk. Which of the following items does not appear?", "question_wo_referring_query": "Which of the following items does not appear?", "candidates": ["A green potted plant", "An open paper box", "A power strip", "A transparent glass cup", "A black phone"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "vyMUsWMVIu4_0", "video_path": "vyMUsWMVIu4.mp4", "subtitle_path": "vyMUsWMVIu4_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 95, "duration": 37.0, "view_count": 22803}, {"video_id": "KFNc2MGXLHg", "question": "In the top right, there's a carved picture. The right hand is pressing on the bottom right corner of white drawing paper. The left hand, wearing a ring, is drawing on the white paper with a pen. What kind of pen is in the left hand?", "question_wo_referring_query": ", what kind of pen is in the left hand?", "candidates": ["pencil", "ballpoint pen", "chalk", "brush pen", "steel pen"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "KFNc2MGXLHg_0", "video_path": "KFNc2MGXLHg.mp4", "subtitle_path": "KFNc2MGXLHg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1259, "duration": 55.99, "view_count": 2587}, {"video_id": "P2_qUbzigjE", "question": "In a room with a sofa and a map hanging on the wall, with a lamp placed beside it, a man wearing a black short-sleeved shirt has a laptop in front of him. He is in a conversation with the camera. When he mentions 'Here with Yes Theory to show you a little magic today,' what material is the sofa behind him?", "question_wo_referring_query": "What material is the sofa behind the man?", "candidates": ["Red leather sofa", "Yellow furry sofa", "Yellow fabric sofa", "Red bamboo sofa", "Black leather sofa"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "P2_qUbzigjE_0", "video_path": "P2_qUbzigjE.mp4", "subtitle_path": "P2_qUbzigjE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1, "duration": 24.03, "view_count": 295946}, {"video_id": "BfwSlyuMy8M", "question": "Inside a store, many teddy bear toys are placed on the round shelves around. There is a short-haired boy and girl smiling in front of the mirror, and there are people shopping in the store in the background. Who is hugging the teddy bear and swaying?", "question_wo_referring_query": "Who is hugging the teddy bear and swaying?", "candidates": ["The boy with yellow hair", "The short-haired girl", "The short-haired boy", "The person holding a shopping bag", "The girl with yellow hair"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "BfwSlyuMy8M_0", "video_path": "BfwSlyuMy8M.mp4", "subtitle_path": "BfwSlyuMy8M_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 141, "duration": 29.0, "view_count": 7412}, {"video_id": "eyIH8GvkKmc", "question": "A beautiful white porcelain object appears on the screen. It is said that this special item was made by Emperor Yongle. What events followed the story?", "question_wo_referring_query": "What events followed the story?", "candidates": ["He inherited the throne by poisoning his parents.", "He abandoned the throne.", "He inherited the throne by paying homage to his ancestors.", "He inherited the throne by killing his brothers.", "He ascended to the throne through a sea of blood."], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "eyIH8GvkKmc_0", "video_path": "eyIH8GvkKmc.mp4", "subtitle_path": "eyIH8GvkKmc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 93, "duration": 50.99, "view_count": 4583}, {"video_id": "ZStKwUDd5C0", "question": "After the caption 'sounds like a good one to me' appears, what happens on the computer screen of the woman sitting at the round wooden table using the computer?", "question_wo_referring_query": "What happens?", "candidates": ["A man wearing a black shirt and jeans with a chain necklace appears on the computer screen against a pink background.", "The computer screen turns black.", "Nothing changes on the computer screen.", "A video plays on the computer screen.", "A cartoon image appears on the computer screen."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "ZStKwUDd5C0_0", "video_path": "ZStKwUDd5C0.mp4", "subtitle_path": "ZStKwUDd5C0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 469, "duration": 21.02, "view_count": 610236}, {"video_id": "cseSHZUjXXA", "question": "Which caption appears simultaneously with a character dressed in white clothes, wearing a yellow belt, and with black and green knee pads?", "question_wo_referring_query": "Which caption appears simultaneously?", "candidates": ["\"marries the daughter of the famous\"", "\"he is hidden by the Tarentum king from\"", "\"doctor has just gone so after this demo\"", "\"city not having the foggiest where their\"", "\"whilst getting supplies Demma see slowly\""], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "cseSHZUjXXA_0", "video_path": "cseSHZUjXXA.mp4", "subtitle_path": "cseSHZUjXXA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 268, "duration": 44.0, "view_count": 3162}, {"video_id": "avVvRY6HbMw", "question": "What change occurred in the expression of the curly-haired woman in a blue-gray denim shirt after she tasted a dish from a table full of delicious dishes?", "question_wo_referring_query": "What change occurred?", "candidates": ["Changed from expressionless to smiling", "Changed from expressionless to angry", "Changed from smiling to surprised", "Changed from smiling to crying", "Changed from smiling to angry"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "avVvRY6HbMw_0", "video_path": "avVvRY6HbMw.mp4", "subtitle_path": "avVvRY6HbMw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1015, "duration": 16.98, "view_count": 838857}, {"video_id": "p5ZU-HpGves", "question": "In a dimly lit space, what is the black man in a blue short-sleeved shirt on the left side in front of the white brick wall doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["Chatting with a companion", "Burning paper with fire", "Using fire to ignite a substance", "Inhaling drugs", "Using fire to light a cigarette"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "p5ZU-HpGves_0", "video_path": "p5ZU-HpGves.mp4", "subtitle_path": "p5ZU-HpGves_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 44, "duration": 55.0, "view_count": 43643}, {"video_id": "jbaNuEDmEdY", "question": "When a glass measuring cup filled with milk is placed on a red wooden table, what appeared?", "question_wo_referring_query": "What appeared?", "candidates": ["A block of yellow butter", "A white pot", "A hand wearing a black glove", "A cutting board", "A kitchen knife"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "jbaNuEDmEdY_0", "video_path": "jbaNuEDmEdY.mp4", "subtitle_path": "jbaNuEDmEdY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 143, "duration": 25.0, "view_count": 30283}, {"video_id": "S-NW45rg81Y", "question": "When the subtitle appears, 'Every explosion is accompanied by many other extremely dangerous things.' What is the state of the water surface between the two rocks?", "question_wo_referring_query": "What does it look like?", "candidates": ["Rapidly flowing", "Covered with flowers of waves", "Surging in layers", "Dark and turbulent", "Calm and still"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "S-NW45rg81Y_0", "video_path": "S-NW45rg81Y.mp4", "subtitle_path": "S-NW45rg81Y_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 227, "duration": 21.99, "view_count": 26907}, {"video_id": "v3Y5uIlWV5E", "question": "What was the Asian woman with curly hair and a pink V-neck shirt doing when she first appeared?", "question_wo_referring_query": "What was she doing?", "candidates": ["Waving to the woman on the right who had straight golden hair and was wearing a white shirt", "Looking down at news drafts", "Organizing news drafts", "Waving to the camera", "Speaking to the woman on the right who had straight golden hair and was wearing a white shirt"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "v3Y5uIlWV5E_0", "video_path": "v3Y5uIlWV5E.mp4", "subtitle_path": "v3Y5uIlWV5E_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 196, "duration": 36.0, "view_count": 4313}, {"video_id": "_uhLhtOIdwI", "question": "When the subtitle 'arrested without rule of law and now our' appears, what is the man dressed in a fluorescent green police uniform doing in front of the police car parked on the side of the road?", "question_wo_referring_query": "What is he doing?", "candidates": ["Opening the car door for a colleague.", "Smoking a cigarette beside the police car.", "Escorting the woman in black clothing who is being arrested into the police car.", "Providing assistance to the crowd.", "Helping an elderly person by the roadside."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "_uhLhtOIdwI_0", "video_path": "_uhLhtOIdwI.mp4", "subtitle_path": "_uhLhtOIdwI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1929, "duration": 43.0, "view_count": 93550}, {"video_id": "VcIO_3rKMRw", "question": "After using a wooden spoon to stir in an iron pot filled with clear water and cabbage, what did the person in the video do?", "question_wo_referring_query": "What did the person in the video do?", "candidates": ["Poured the water out of the pot", "Cover the pot and cook the cabbage", "Took out a few slices of bread", "Cut the bread with a knife", "Scooped out the cabbage"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "VcIO_3rKMRw_0", "video_path": "VcIO_3rKMRw.mp4", "subtitle_path": "VcIO_3rKMRw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 94, "duration": 34.0, "view_count": 845621}, {"video_id": "ppAx8Iq011M", "question": "When adding other ingredients to the prepared container, what is the first ingredient to appear?", "question_wo_referring_query": "What is it?", "candidates": ["Sauce", "Milk", "Egg", "Flour", "Apple"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "ppAx8Iq011M_0", "video_path": "ppAx8Iq011M.mp4", "subtitle_path": "ppAx8Iq011M_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 51, "duration": 16.02, "view_count": 282928}, {"video_id": "IaKw3xz4qHk", "question": "When the subtitle 'appoint she appointed after she' appears, what happens to the screen featuring the news anchor in a purple jacket?", "question_wo_referring_query": "When the subtitle 'appoint she appointed after she' appears, what happens to the screen featuring the news anchor in a purple jacket?", "candidates": ["The anchor screen enlarges and a picture of a woman in black and a man in a khaki jacket appears on the left side", "The anchor screen disappears", "The anchor screen shrinks and a picture of a woman in black and a man in a khaki jacket appears on the left side", "The anchor screen shrinks and a picture of a woman in black and a man in a khaki jacket appears on the right side", "The anchor screen enlarges and a picture of a woman in black and a man in a khaki jacket appears on the right side"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "IaKw3xz4qHk_0", "video_path": "IaKw3xz4qHk.mp4", "subtitle_path": "IaKw3xz4qHk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 156, "duration": 49.98, "view_count": 181252}, {"video_id": "q_EDxmXQ9Vo", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, one person is talking to two others. Then, an elderly person in red clothing is speaking with a soldier standing to their right, and finally, two people are sitting at a table talking with three people standing behind them.", "First, an elderly person in red clothing is speaking with a soldier standing to their right. Then, one person is talking to two others, and finally, two people are sitting at a table talking with three people standing behind them.", "First, one person is talking to two others. Then, two people are sitting at a table talking with three people standing behind them, and finally, an elderly person in red clothing is speaking with a soldier standing to their right.", "First, two people are sitting at a table talking with three people standing behind them. Then, an elderly person in red clothing is speaking with a soldier standing to their right, and finally, one person is talking to two others.", "First, two people are sitting at a table talking with three people standing behind them. Then, one person is talking to two others, and finally, there is an elderly person in red clothing speaking with a soldier standing to their right."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "q_EDxmXQ9Vo_0", "video_path": "q_EDxmXQ9Vo.mp4", "subtitle_path": "q_EDxmXQ9Vo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 132, "duration": 21.0, "view_count": 361021}, {"video_id": "GsMT53pI-Ew", "question": "In a white corridor with walls covered in paintings, there's a woman wearing a black thin top, black tight leggings, and black-framed glasses, with short hair. With which of the following subtitles has she appeared together?", "question_wo_referring_query": "With which of the following subtitles has she appeared together?", "candidates": ["Mrs. Malborough", "blurred name", "kahun", "Music", "the war years so it's a picture that"], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "GsMT53pI-Ew_0", "video_path": "GsMT53pI-Ew.mp4", "subtitle_path": "GsMT53pI-Ew_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 307, "duration": 57.02, "view_count": 25895}, {"video_id": "Ni9JcHLea90", "question": "In a room filled with various photos, a short-haired man wearing a black T-shirt and holding a black microphone makes a gesture while saying 'informal ties today eight countries recognize the Somaliland Passport three countries have'. How does his gesture change?", "question_wo_referring_query": "How does his gesture change?", "candidates": ["Arms change from raised to lowered", "Index finger and middle finger change from straight to bent", "Hand changes from placed on the waist to raised", "Index finger and middle finger change from bent to straight", "Hand changes from raised to placed on the waist"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "Ni9JcHLea90_0", "video_path": "Ni9JcHLea90.mp4", "subtitle_path": "Ni9JcHLea90_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 220, "duration": 27.03, "view_count": 339755}, {"video_id": "tOh-Um-y9do", "question": "In a Q-version animation screen, there is a computer with an animated Q-version character wearing a white hat on either side. Beside the computer, there is a drink, and a pair of Q-version hands are operating the computer. What is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["Listening to music on the computer", "Streaming on the computer", "Watching TV on the computer", "Playing games on the computer", "Working on the computer"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "tOh-Um-y9do_0", "video_path": "tOh-Um-y9do.mp4", "subtitle_path": "tOh-Um-y9do_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 28, "duration": 33.0, "view_count": 2226693}, {"video_id": "WRcaryjCYCs", "question": "In a hallway under a warm light lamp, a woman wearing white stockings is tidying up. When the subtitles \"laura is frantically trying to get ready\" appear, what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["black gloves", "beverage with a straw", "grey clothing", "white hat", "blue bag"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "WRcaryjCYCs_0", "video_path": "WRcaryjCYCs.mp4", "subtitle_path": "WRcaryjCYCs_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 46, "duration": 17.0, "view_count": 51666}, {"video_id": "2mGTh6K_AOw", "question": "On a white wall, there are green leaves hanging, in front of the wall there is a black sofa, next to the sofa there's a black backpack hanging, a long-haired girl is sitting on the sofa, holding her head with both hands. What color pants is she wearing?", "question_wo_referring_query": "What color pants is she wearing?", "candidates": ["Purple", "Black", "Blue", "White", "Yellow"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "2mGTh6K_AOw_0", "video_path": "2mGTh6K_AOw.mp4", "subtitle_path": "2mGTh6K_AOw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 427, "duration": 41.0, "view_count": 91277}, {"video_id": "We64bVsxdLg", "question": "In a kitchen, when a woman wearing a yellow apron and a black sleeveless shirt and a man wearing a blue apron and a floral short-sleeve shirt, and the man says 'the fire stops this prize that', what material is the chopstick held by the woman in the yellow apron made of?", "question_wo_referring_query": "what material is the chopstick held by the woman in the yellow apron made of?", "candidates": ["Stainless Steel", "Plastic", "Iron", "Wood", "Paper"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "We64bVsxdLg_0", "video_path": "We64bVsxdLg.mp4", "subtitle_path": "We64bVsxdLg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 325, "duration": 50.97, "view_count": 1864673}, {"video_id": "ztrpSb911rg", "question": "In a room, there is a bed. Sitting on the white sofa next to the bed is a long-haired woman wearing a beige T-shirt and a necklace, introducing a book. What book is she introducing?", "question_wo_referring_query": "What book is she introducing?", "candidates": ["The white-covered 'MALIBU RISING'", "The blue-covered 'Harry Potter'", "The purple-covered 'MALIBU RISING'", "The blue-covered 'MALIBU RISING'", "The white-covered 'Harry Potter'"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "ztrpSb911rg_0", "video_path": "ztrpSb911rg.mp4", "subtitle_path": "ztrpSb911rg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 482, "duration": 21.0, "view_count": 39874}, {"video_id": "CuQshOUfz24", "question": "On a street covered with yellow maple leaves, next to a green tree, there is a black bag. When a woman with dyed blue hair, wearing black leather pants and a mask, finally appears, what is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["Kicking forward with left leg", "Standing with legs crossed", "Hands on hips", "Sitting on the ground", "Standing on one leg"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "CuQshOUfz24_0", "video_path": "CuQshOUfz24.mp4", "subtitle_path": "CuQshOUfz24_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 236, "duration": 21.02, "view_count": 204449}, {"video_id": "syCWb_ySJRo", "question": "In the blurry screen, what happens when the subtitle 'generated absolutely inundated and' appears with the yellow and black intermingled ocean touching the Earth's surface?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The scene changes to a full view of the Earth", "A review ceremony takes place", "A height comparison takes place", "The scene changes to the coastline", "A body comparison takes place"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "syCWb_ySJRo_0", "video_path": "syCWb_ySJRo.mp4", "subtitle_path": "syCWb_ySJRo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 8, "duration": 15.02, "view_count": 168991}, {"video_id": "4HxIam3tdQk", "question": "A hand holding a plate of delicious food, after the subtitle 'Music' appears, what did he do?", "question_wo_referring_query": "What did he do?", "candidates": ["Pressed down on the food in the plate with his hand", "Picked up the plate from the table", "Picked up the food from the plate with his hand", "Picked up a fork from the side", "Put the plate on the table"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "4HxIam3tdQk_0", "video_path": "4HxIam3tdQk.mp4", "subtitle_path": "4HxIam3tdQk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 251, "duration": 23.02, "view_count": 611965}, {"video_id": "_Gzx5UgmlZ8", "question": "Who is the person that appears after the subtitles 'Sometimes, the WITSEC agents are the only person'?", "question_wo_referring_query": "Who is he?", "candidates": ["A man wearing a silver watch and a short-sleeved shirt", "A woman wearing a green suit with short golden hair, holding a beige phone", "A man wearing a black suit with a colorful silk handkerchief in the suit pocket, holding a black phone", "A man wearing a hat, pressing down the brim with his hand, with a sharp look in his eyes", "A woman wearing a work uniform and sunglasses"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "_Gzx5UgmlZ8_0", "video_path": "_Gzx5UgmlZ8.mp4", "subtitle_path": "_Gzx5UgmlZ8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 294, "duration": 39.01, "view_count": 2152616}, {"video_id": "0DTf2NPkQ1E", "question": "In a corridor filled with many glass frames, each containing guitars, a man wearing a black long-sleeve shirt, brown pants, and olive shoes, with short hair, appears in which of the following places?", "question_wo_referring_query": "Where does he appear in the following places?", "candidates": ["On a corridor with ancient zithers", "In a dining hall with food", "In a quiet park", "In a recorded studio during an interview", "In front of a glass case with paintings"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "0DTf2NPkQ1E_0", "video_path": "0DTf2NPkQ1E.mp4", "subtitle_path": "0DTf2NPkQ1E_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 7, "duration": 46.01, "view_count": 11171}, {"video_id": "ml3N2CdM4CE", "question": "The vast blue sea, golden sand beach, many green trees by the shore, inside is a splendid city, which of the following subtitles has this scene appeared with?", "question_wo_referring_query": "Which of the following subtitles has this scene appeared with?", "candidates": ["clearly, research is still ongoing", "gravity scan", "what an absolute mindu so Melbourne is", "using them to map", "I'm attempting to utilize magnetic and"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "ml3N2CdM4CE_0", "video_path": "ml3N2CdM4CE.mp4", "subtitle_path": "ml3N2CdM4CE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 372, "duration": 34.0, "view_count": 24015}, {"video_id": "rS3YfhAiXHU", "question": "A man wearing a helmet, holding a sword and a red and white shield, with a wooden cart filled with straw beside him, and in front of him there\u2019s a man with a bald head wearing brown clothes. What did the man holding the sword do?", "question_wo_referring_query": "What did the man holding the sword do?", "candidates": ["Thrust the sword towards the man wearing brown clothes", "Kicked the man wearing brown clothes", "Tapped the shoulder of the man wearing brown clothes", "Swung the sword towards the man wearing brown clothes", "Smashed the shield towards the man wearing brown clothes"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "rS3YfhAiXHU_0", "video_path": "rS3YfhAiXHU.mp4", "subtitle_path": "rS3YfhAiXHU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 175, "duration": 29.0, "view_count": 461449}, {"video_id": "kMaO_AZYZMA", "question": "A woman wearing a pink outer garment and a black inner shirt, her background is a well-lit building, and the word 'BUSINESS' appears completely as rolling text at the bottom of the screen. What items are present in this scene?", "question_wo_referring_query": "What items are present in this scene?", "candidates": ["cup", "ring", "watch", "glasses", "keyboard"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "kMaO_AZYZMA_0", "video_path": "kMaO_AZYZMA.mp4", "subtitle_path": "kMaO_AZYZMA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 293, "duration": 57.0, "view_count": 873}, {"video_id": "Il8o5md-q1o", "question": "A pair of legs wearing torn pants, standing on the ground with shoes on, with some sparse grass growing nearby. When the explanation mentions 'the Iron Silk Network solved the problems of the farmers', what objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["Hunting rifle", "Hoe", "Green grass", "Grass hat", "Scarf"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "Il8o5md-q1o_0", "video_path": "Il8o5md-q1o.mp4", "subtitle_path": "Il8o5md-q1o_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 96, "duration": 23.0, "view_count": 2378428}, {"video_id": "9HOgcP5UBUU", "question": "In a room with red walls and silver wall frames, there is a woman wearing a headset and covering her lips. When she mentions 'the met', what color is her coat?", "question_wo_referring_query": "In a room with red walls and silver wall frames, there is a woman wearing a headset and covering her lips. When she mentions 'the met', what color is her coat?", "candidates": ["Pure white", "Pure yellow", "White and black", "Red and blue", "Orange and white"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "9HOgcP5UBUU_0", "video_path": "9HOgcP5UBUU.mp4", "subtitle_path": "9HOgcP5UBUU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 14, "duration": 31.0, "view_count": 8990}, {"video_id": "vZjli8WLuc4", "question": "On a flat wooden surface, a wooden board is placed, and on the board, there are six cups. A hand holding a bottle of vegetable oil pours it into the cups. When the explanation mentions 'Coat aups with vegetable oil,' what action does the hand holding the bottle of vegetable oil perform?", "question_wo_referring_query": "What action does the hand holding the bottle of vegetable oil perform?", "candidates": ["Added vegetable oil to another cup", "Threw the bottle of vegetable oil", "Lifted the wooden board", "Picked up a cup", "Lifted a stone"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "vZjli8WLuc4_0", "video_path": "vZjli8WLuc4.mp4", "subtitle_path": "vZjli8WLuc4_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 111, "duration": 34.0, "view_count": 461}, {"video_id": "U228_sx7Yz0", "question": "Which of the following scenarios appears first?", "question_wo_referring_query": "Which of the following scenarios appears first?", "candidates": ["An all black small dog", "A black and white small dog", "An all white small dog", "A woman wearing a black dress with curly hair", "A woman wearing a red dress with curly hair"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "U228_sx7Yz0_0", "video_path": "U228_sx7Yz0.mp4", "subtitle_path": "U228_sx7Yz0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 186, "duration": 23.98, "view_count": 290144}, {"video_id": "w8SLYpexB7U", "question": "In a room with an open door, a world map hanging on the wall, and photos covering the wall, a man wearing black clothes reads aloud from a piece of paper in his hand. After he says 'that\u2019s pretty clever so will is from', what action does he immediately take?", "question_wo_referring_query": "What action does he immediately take after saying 'that\u2019s pretty clever so will is from'?", "candidates": ["He raised his left hand", "He threw away an object in his right hand", "He raised his right hand", "He threw away an object in his left hand", "He put down his left hand"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "w8SLYpexB7U_0", "video_path": "w8SLYpexB7U.mp4", "subtitle_path": "w8SLYpexB7U_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 301, "duration": 52.01, "view_count": 97677}, {"video_id": "Nc9e16ekPkI", "question": "In a bright room with white walls and a window with a black frame, a man wearing glasses, with his hair tied back, and dressed in red, is giving an explanation. After he mentions, 'I'm praying for good weather,' what object does the camera then show outside the window?", "question_wo_referring_query": "What object does the camera then show outside the window?", "candidates": ["Glasses", "Tree", "Window", "Goldfish", "Chair"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "Nc9e16ekPkI_0", "video_path": "Nc9e16ekPkI.mp4", "subtitle_path": "Nc9e16ekPkI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 46, "duration": 57.0, "view_count": 707}, {"video_id": "Kt7bitgahQ8", "question": "At the start of the video, a female wearing a white T-shirt with her hair in a bun is sitting in the driver's seat of a car. What subtitles have appeared together with her?", "question_wo_referring_query": "What subtitles have appeared together with her?", "candidates": ["cry he was so bad so I didn\u2019t know", "cry you was so nice so I didn\u2019t know", "cry you was so nice so I know", "cry he was so nice so I didn\u2019t know", "cry you was so bad so I didn\u2019t know"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "Kt7bitgahQ8_0", "video_path": "Kt7bitgahQ8.mp4", "subtitle_path": "Kt7bitgahQ8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 444, "duration": 46.0, "view_count": 40241}, {"video_id": "BArbrfmsxeQ", "question": "On a white screen, when the explanation mentions 'there is absolutely no soil here,' what change occurs?", "question_wo_referring_query": "what change occurs?", "candidates": ["The entire screen turns red", "A scene with sparse green plants growing in yellow soil appears, taking up one-fourth of the screen", "The entire screen turns black", "The entire screen turns yellow", "The entire screen turns green"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "BArbrfmsxeQ_0", "video_path": "BArbrfmsxeQ.mp4", "subtitle_path": "BArbrfmsxeQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 181, "duration": 55.99, "view_count": 390532}, {"video_id": "3FzzixVU_48", "question": "When the scene is focused on the female host wearing glasses and earphones, what happened in the live video on the right?", "question_wo_referring_query": "What happened in the live video on the right?", "candidates": ["Black smoke is emerging in the live video.", "White smoke is emerging in the live video.", "A building is collapsing in the live video.", "People are evacuating in the live video.", "A fight is occurring in the live video."], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "3FzzixVU_48_0", "video_path": "3FzzixVU_48.mp4", "subtitle_path": "3FzzixVU_48_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 161, "duration": 57.0, "view_count": 64507}, {"video_id": "t86J_zZjjq4", "question": "In a yellow dusk, with the ocean surging with waves, a sailboat rises and falls in the waves, while a few seabirds soar in the sky. What objects are present in the scene at this moment?", "question_wo_referring_query": "What objects are present in the scene at this moment?", "candidates": ["A speedboat", "A sailboat", "Two sailboats", "An old eagle", "A yacht"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "t86J_zZjjq4_0", "video_path": "t86J_zZjjq4.mp4", "subtitle_path": "t86J_zZjjq4_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 364, "duration": 45.0, "view_count": 1943751}, {"video_id": "etAjM5tMnic", "question": "Viewed from space, the Earth's entire atmosphere emits a faint, clear blue light. When the subtitle 'we saw many places during the Burklc' appears, what object is present on the whole screen?", "question_wo_referring_query": "What object is present on the whole screen?", "candidates": ["Shrinking mountain ranges", "Airplane", "Soaring birds", "Space station", "Satellite"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "etAjM5tMnic_0", "video_path": "etAjM5tMnic.mp4", "subtitle_path": "etAjM5tMnic_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1136, "duration": 15.02, "view_count": 213898}, {"video_id": "Swh6pYdeR_8", "question": "In an indoor space, there are three paintings hanging on the wall and a sculpture of a woman placed in front of the paintings. When the subtitle mentions 'Since it is not a full-scale sculpture, if it\u2019s too stiff it will look very fake,' what is the woman\u2019s sculpture wearing on the lower part of her body?", "question_wo_referring_query": "When the subtitle mentions 'Since it is not a full-scale sculpture, if it\u2019s too stiff it will look very fake,' what is the woman\u2019s sculpture wearing on the lower part of her body?", "candidates": ["gray jeans", "gray skirt", "blue skirt", "blue jeans", "black sports pants"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "Swh6pYdeR_8_0", "video_path": "Swh6pYdeR_8.mp4", "subtitle_path": "Swh6pYdeR_8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 179, "duration": 21.02, "view_count": 498531}, {"video_id": "jQzOQMEhb2c", "question": "Outside in the bright sunlight, trees are neatly lined up on the flat ground, and some green chairs are randomly placed on the ground. Who reaches out to pull the green bucket at this moment?", "question_wo_referring_query": "Who reaches out to pull the green bucket at this moment?", "candidates": ["A man wearing sunglasses and a red short-sleeved shirt", "A man wearing sunglasses and a blue short-sleeved shirt", "A man wearing sunglasses and a red long-sleeved shirt", "A woman wearing sunglasses and a red short-sleeved shirt", "A woman wearing sunglasses and a blue short-sleeved shirt"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "jQzOQMEhb2c_0", "video_path": "jQzOQMEhb2c.mp4", "subtitle_path": "jQzOQMEhb2c_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 220, "duration": 34.99, "view_count": 290284}, {"video_id": "p7410Ba59Gk", "question": "In an open area in front of a classic white-roofed red-pillared Greek-style building, there are five people. When the armored person appears for the first time, what does he do?", "question_wo_referring_query": "When the armored person appears for the first time, what does he do?", "candidates": ["Walks back and forth in the open area", "Fights with the person wearing a robe", "Talks to the person wearing a robe", "Captures the remaining people in the open area", "Hits the red pillar of the building"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "p7410Ba59Gk_0", "video_path": "p7410Ba59Gk.mp4", "subtitle_path": "p7410Ba59Gk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 59, "duration": 36.0, "view_count": 3920}, {"video_id": "iBlh3T-BOec", "question": "In a warm kitchen with books and green plants on a white shelf, there is a man standing who is wearing a black short-sleeved shirt with tattoos on his arm. When the subtitle says 'second so that we don't stew the meat', what is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Adding sauce to the pot", "Adding water to the pot", "Pouring a white liquid into the pot", "Holding a spatula and stirring the food inside a black pot", "Chopping vegetables on a cutting board"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "iBlh3T-BOec_0", "video_path": "iBlh3T-BOec.mp4", "subtitle_path": "iBlh3T-BOec_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 710, "duration": 50.0, "view_count": 498621}, {"video_id": "qhsz6Vi8PGU", "question": "After the screen transitions from a white background subtitle to a scene with a blue sky, a road, and a village, which country is mentioned last?", "question_wo_referring_query": "Which country is mentioned last?", "candidates": ["Victoria State", "Paraguay", "Venezuela", "Laos", "Brazil"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "qhsz6Vi8PGU_0", "video_path": "qhsz6Vi8PGU.mp4", "subtitle_path": "qhsz6Vi8PGU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 176, "duration": 25.99, "view_count": 3985}, {"video_id": "q99tzGgkd4k", "question": "In the scene, there is a group of black people wearing colorful clothes. Someone is walking with a child on the street. There are shadows of people on the ground and the subtitles read, 'They have similar Creole cultures, languages, and backgrounds.' What happened next?", "question_wo_referring_query": "What happened next?", "candidates": ["Four people are smiling while looking directly at the camera.", "A group of people with black skin, naked from the back and carrying drums, appear with their backs to the camera.", "A group of people holding flags are walking on the street.", "A man with yellow skin wearing a gray shirt is speaking.", "A plane is flying over the map."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "q99tzGgkd4k_0", "video_path": "q99tzGgkd4k.mp4", "subtitle_path": "q99tzGgkd4k_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 509, "duration": 41.0, "view_count": 1049796}, {"video_id": "5d86McDJZFg", "question": "In the video, an elderly person wearing a snow-white wool hat and a mustache smiles at the camera. After the subtitle 'first person to reach' appears, what is the first thing to appear?", "question_wo_referring_query": "What appears first?", "candidates": ["A patch of fresh flowers", "A steamboat", "A blue ocean", "A green grassland", "Snow-covered mountains gleaming in the sunlight"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "5d86McDJZFg_0", "video_path": "5d86McDJZFg.mp4", "subtitle_path": "5d86McDJZFg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 33, "duration": 58.98, "view_count": 127064}, {"video_id": "5KTSd2jGYHo", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, the shimmering water is splashing against the rocks on the shore, then two people are walking along the riverbank, and finally, there is a car driving next to a muddy land with a triangular roof structure.", "First, several black tires appear on the wooden bridge at the shore, then the shimmering water is splashing against the rocks on the shore, and finally, there is a car driving next to a muddy land with a triangular roof structure.", "First, two people are walking along the riverbank, then the shimmering water is splashing against the rocks on the shore, and finally, there is a car driving next to a muddy land with a triangular roof structure.", "First, there is a car driving next to a muddy land with a triangular roof structure, then the shimmering water is splashing against the rocks on the shore, and finally, several black tires appear on the wooden bridge at the shore.", "First, several black tires appear on the wooden bridge at the shore, then there is a car driving next to a muddy land with a triangular roof structure, and finally, the shimmering water is splashing against the rocks on the shore."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "5KTSd2jGYHo_0", "video_path": "5KTSd2jGYHo.mp4", "subtitle_path": "5KTSd2jGYHo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 442, "duration": 48.0, "view_count": 2229167}, {"video_id": "8jPyg2pK11M", "question": "In a passageway with many pedestrians, standing on a black semicircular floor with colorful patterns, where has a man playing the guitar and wearing a blue short-sleeved shirt with a mustache appeared?", "question_wo_referring_query": "Where has he appeared?", "candidates": ["Park", "Restaurant", "On a crowded bus with yellow bars", "Swimming pool", "Supermarket"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "8jPyg2pK11M_0", "video_path": "8jPyg2pK11M.mp4", "subtitle_path": "8jPyg2pK11M_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 105, "duration": 21.0, "view_count": 2100318}, {"video_id": "uj8reAARLeA", "question": "What changes occur after the appearance of English letters and three butterfly-shaped images on a white background screen?", "question_wo_referring_query": "What changes occur?", "candidates": ["The butterfly-shaped image on the left side of the white background undergoes a rotational transformation.", "The largest butterfly-shaped image on the right side of the white background gradually becomes smaller.", "No changes occur.", "The smallest butterfly-shaped image in the lower middle of the white background undergoes a rotational transformation.", "The largest butterfly-shaped image on the right side of the white background undergoes a rotational transformation."], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "uj8reAARLeA_0", "video_path": "uj8reAARLeA.mp4", "subtitle_path": "uj8reAARLeA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 397, "duration": 55.99, "view_count": 379}, {"video_id": "pfdb6u4HDoQ", "question": "Above the green canyon, in mid-air, a person wearing a white helmet is standing on a narrow path with handrails on both sides. What is this person doing?", "question_wo_referring_query": ", what is this person doing?", "candidates": ["Sitting down", "Walking", "Jumping up", "Preparing to jump down", "Standing on one leg"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "pfdb6u4HDoQ_0", "video_path": "pfdb6u4HDoQ.mp4", "subtitle_path": "pfdb6u4HDoQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 902, "duration": 46.01, "view_count": 777225}, {"video_id": "2tvmg1rBJww", "question": "After a train with a flashing light, red front, and orange body runs on the track, which objects appear first?", "question_wo_referring_query": "Which objects appear first?", "candidates": ["Two white cars", "A white car", "A rabbit ceramic figurine", "Several keychains", "A vending machine"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "2tvmg1rBJww_0", "video_path": "2tvmg1rBJww.mp4", "subtitle_path": "2tvmg1rBJww_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 554, "duration": 59.0, "view_count": 149952}, {"video_id": "o8t83KVcurA", "question": "Two women and a man are recording a show in front of a round table. When the subtitle 'these three players that you named' appears, which item appears first?", "question_wo_referring_query": "Which item appears first?", "candidates": ["a basketball", "a hat", "three microphones", "a necklace", "two microphones"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "o8t83KVcurA_0", "video_path": "o8t83KVcurA.mp4", "subtitle_path": "o8t83KVcurA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 194, "duration": 28.0, "view_count": 37960}, {"video_id": "1d5GprImnh8", "question": "In front of a pink cherry blossom tree, when the subtitles mention \"said I'm ready 80 years,\" what style of clothing is the elderly woman wearing, who is wearing a red round hat and a mask?", "question_wo_referring_query": "What style of clothing is she wearing?", "candidates": ["Only a floral suspender dress", "Pure white turtleneck long-sleeved shirt and suspenders skirt", "Red and white checkered V-neck shirt", "Black low-neck shirt", "Red and white striped turtleneck long-sleeved shirt and suspenders skirt"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "1d5GprImnh8_0", "video_path": "1d5GprImnh8.mp4", "subtitle_path": "1d5GprImnh8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 133, "duration": 58.99, "view_count": 5717}, {"video_id": "1pvxmqAKIDU", "question": "A painting depicting strange and bizarre figures, as well as abstract human organs, mentions a world. Who created that world in the story?", "question_wo_referring_query": "Who created the world?", "candidates": ["Strange creatures", "Instruments", "Souls", "Demons", "Humans"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "1pvxmqAKIDU_0", "video_path": "1pvxmqAKIDU.mp4", "subtitle_path": "1pvxmqAKIDU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 26, "duration": 45.98, "view_count": 170621}, {"video_id": "cjQrqA_Y0DA", "question": "What happened before a girl in a pink dress holding a pink cake with candles appears?", "question_wo_referring_query": "What happened?", "candidates": ["An explosion occurred around a dilapidated house.", "A boy is making a peace sign in front of the camera.", "A boy with a buzz cut wearing a blue shirt and holding a microphone is being interviewed.", "A tank is driving on the road.", "A girl with curly hair wearing golden earrings is being interviewed in front of a camera."], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "cjQrqA_Y0DA_0", "video_path": "cjQrqA_Y0DA.mp4", "subtitle_path": "cjQrqA_Y0DA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 26, "duration": 58.0, "view_count": 53387}, {"video_id": "E5akec1-tgQ", "question": "A hand is holding a knife on the green cutting board preparing to cut a carrot into long strips. What object appeared right before the carrot was mentioned in the subtitle?", "question_wo_referring_query": "What object appeared?", "candidates": ["chicken leg", "chicken wing", "green onion", "cucumber", "eggplant"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "E5akec1-tgQ_0", "video_path": "E5akec1-tgQ.mp4", "subtitle_path": "E5akec1-tgQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 105, "duration": 33.0, "view_count": 35848}, {"video_id": "IOZQC8dczac", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a person crawls into a bright square cabin with a backpack inside, then a hand is pushing a black door, and finally, a person appears next to a transparent glass door.", "First, a hand is pushing a black door, then a person crawls into a bright square cabin with a backpack inside, and finally, a person appears next to a transparent glass door.", "First, a hand is pushing a white door, then a person appears next to a transparent glass door, and finally, two people crawl into a bright square cabin with a backpack inside.", "First, a person appears next to a transparent glass door, then a hand is pushing a black door, and finally, a person crawls into a bright square cabin with a backpack inside.", "First, a hand is pushing a black door, then a person appears next to a transparent glass door, and finally, a person crawls into a bright square cabin with a backpack inside."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "IOZQC8dczac_0", "video_path": "IOZQC8dczac.mp4", "subtitle_path": "IOZQC8dczac_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 283, "duration": 16.98, "view_count": 29526}, {"video_id": "7JEET4BZzsI", "question": "On a yellow wooden table, a silver baking dish is filled with wine-red ingredients. A hand wearing a red, blue, and green tricolor bracelet is stirring with a wooden spatula. What change occurs when the ingredients are finally put into the woman's mouth?", "question_wo_referring_query": "What change occurs when the ingredients are finally put into the woman's mouth?", "candidates": ["They turned into a golden-yellow liquid.", "They turned into a black solid.", "They turned into a golden-yellow solid.", "They turned into a wine-red solid.", "They turned into a wine-red liquid."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "7JEET4BZzsI_0", "video_path": "7JEET4BZzsI.mp4", "subtitle_path": "7JEET4BZzsI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 221, "duration": 50.01, "view_count": 447306}, {"video_id": "8ZO9M1_CJD0", "question": "A clear sky, the sea is surging, a large number of white buildings mixed with lights along the coastline, a white arrow points to a cluster of pollutants. The tallest building is an industrial chimney. What is happening at the chimney?", "question_wo_referring_query": "What is happening at the chimney?", "candidates": ["There is a flock of birds flying over the chimney", "The chimney is inactive", "The industrial chimney is emitting pollutants", "There is a flashing light at the base of the chimney", "There is a ship maneuvering in the sea below the chimney"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "8ZO9M1_CJD0_0", "video_path": "8ZO9M1_CJD0.mp4", "subtitle_path": "8ZO9M1_CJD0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 276, "duration": 50.02, "view_count": 1021553}, {"video_id": "Z1hch9COSOU", "question": "In a room with a green palette drifting with white cloth, a woman wearing a white headscarf and a green coat is holding a needle and thread. There is a white spherical object placed on the table, and a red and yellow camera box is located next to a pillar in the room. What items appear in this room?", "question_wo_referring_query": "What items appear in this room?", "candidates": ["Bracelet", "Tin can", "Lantern", "Hat", "Green plant"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "Z1hch9COSOU_0", "video_path": "Z1hch9COSOU.mp4", "subtitle_path": "Z1hch9COSOU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 196, "duration": 27.0, "view_count": 156017}, {"video_id": "fzh8qQhnhWs", "question": "On a yellow wooden cutting board, there are two yellow round food items. On the gray ground next to the cutting board, there is a potted plant placed, and the container of the plant is black. A hand is constantly near the round food items and the subtitle reads 'then roll in breadcrumbs'. What items are present in the scene?", "question_wo_referring_query": "What items are present in the scene?", "candidates": ["silver fork", "ring", "wooden knife", "pink seasoning jar", "yellow plate"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "fzh8qQhnhWs_0", "video_path": "fzh8qQhnhWs.mp4", "subtitle_path": "fzh8qQhnhWs_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 316, "duration": 52.0, "view_count": 2117573}, {"video_id": "iqHZOW-ASJU", "question": "In the room with picture frames and star designs on the wall, a bookshelf filled with books is connected to the wall. A girl wearing a white lab coat and blue patterned pants with a toy between her legs is squatting. The girl, who is wearing a round-framed hat and glasses, mentions 'anyways I just want to talk about a few'. What kind of toy is the girl holding between her legs?", "question_wo_referring_query": "What kind of toy is the girl holding between her legs?", "candidates": ["Green duck toy", "Green puppy toy", "Yellow duck toy", "Yellow puppy toy", "Yellow kitten toy"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "iqHZOW-ASJU_0", "video_path": "iqHZOW-ASJU.mp4", "subtitle_path": "iqHZOW-ASJU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 987, "duration": 31.0, "view_count": 761849}, {"video_id": "vI-qMsVonXY", "question": "The tiles in the middle of the road are arranged in a cross pattern, with white and red-blue tiles. There are trees, grass, and animals beside the road. On the road, there are pedestrians and animals in different colors. In the distance, there is a parked big bus. Who ate the pedestrian's yellow circular food?", "question_wo_referring_query": "Who ate the pedestrian's yellow circular food?", "candidates": ["A deer", "A little boy in green clothes", "A little boy in red clothes", "A small dog", "A little girl in red clothes"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "vI-qMsVonXY_0", "video_path": "vI-qMsVonXY.mp4", "subtitle_path": "vI-qMsVonXY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 191, "duration": 44.0, "view_count": 13150}, {"video_id": "FiTr7x9QsxI", "question": "When the subtitle 'other interesting thing about the early' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["Many people gathered around baking fire in the stove", "A nuclear submarine on the water's surface", "Sparks from a tall chimney set the forest on fire", "A three-tier red-bottomed wheel ship on the water", "A red-bottomed sail ship on the water"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "FiTr7x9QsxI_0", "video_path": "FiTr7x9QsxI.mp4", "subtitle_path": "FiTr7x9QsxI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 80, "duration": 19.02, "view_count": 1835}, {"video_id": "XKvoqLAs3Lg", "question": "There is a calligraphy painting hanging on the back of the room. Beside it, there are blue floral decorations and assorted items. A man wearing glasses, a black coat, and an olive green sweater is holding a painting with a mouse and an ox in an ink style. What does he do next?", "question_wo_referring_query": "What does he do next?", "candidates": ["The man picks up a pen.", "The man stands up.", "The man adjusts his glasses.", "The man uses the painting in his hand to cover his face.", "The man unfolds the painting in his hand and sets it aside."], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "XKvoqLAs3Lg_0", "video_path": "XKvoqLAs3Lg.mp4", "subtitle_path": "XKvoqLAs3Lg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 482, "duration": 25.0, "view_count": 4158}, {"video_id": "tC2gP2kLsKc", "question": "The green background screen shows a tank image. After the subtitle mentions 'experience reports are very rare, one report from November 1943 in Italy noted that it was\u2026,' what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A white circular symbol wrapping a wine glass icon disappears", "The tank image is dyed green", "Both a white circular symbol wrapping a wine glass icon and a white circular symbol wrapping a white diamond icon appear simultaneously", "A white circular symbol wrapping a white diamond icon disappears", "A white circular symbol wrapping a white diamond icon appears"], "topic_category": "KH-Knowledge-History", "question_category": "T3E", "level": "L2-Relation", "id": "tC2gP2kLsKc_0", "video_path": "tC2gP2kLsKc.mp4", "subtitle_path": "tC2gP2kLsKc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 578, "duration": 24.0, "view_count": 37927}, {"video_id": "qi9YysWXP3s", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First the scene of horizontally cutting the chocolate ingredient, then the scene of kneading the chocolate ingredient into a ball, and finally the scene of vertically cutting the chocolate ingredient.", "First the scene of vertically cutting the chocolate ingredient, then the scene of horizontally cutting the chocolate ingredient, and finally the scene of kneading the chocolate ingredient into a ball.", "First the scene of kneading the chocolate ingredient into a ball, then the scene of vertically cutting the chocolate ingredient, and finally the scene of horizontally cutting the chocolate ingredient.", "First the scene of vertically cutting the chocolate ingredient, then the scene of kneading the chocolate ingredient into a ball, and finally the scene of horizontally cutting the chocolate ingredient.", "First the scene of horizontally cutting the chocolate ingredient, then the scene of vertically cutting the chocolate ingredient, and finally the scene of kneading the chocolate ingredient into a ball."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "qi9YysWXP3s_0", "video_path": "qi9YysWXP3s.mp4", "subtitle_path": "qi9YysWXP3s_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 176, "duration": 29.0, "view_count": 2614}, {"video_id": "TCUpBC54WO8", "question": "On a green wooden table, there is a green bowl containing liquid ingredients. To the left is a yellow wooden board with a piece of bread on it. In what other contexts has the bread appeared?", "question_wo_referring_query": ", In what other contexts has the bread appeared?", "candidates": ["In the oil pot", "In the paper box", "In the blue plate", "In the plastic bag", "In the red plate"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "TCUpBC54WO8_0", "video_path": "TCUpBC54WO8.mp4", "subtitle_path": "TCUpBC54WO8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 214, "duration": 55.0, "view_count": 318498}, {"video_id": "CCfZbxSFFuE", "question": "A computer room with a black background. There are colorful images on the screens of two computers. A keyboard and cables are on the wooden desk. A man with black curly hair wearing a blue and white floral shirt is speaking. What other subtitles have appeared together with him?", "question_wo_referring_query": "What other subtitles have appeared together with him?", "candidates": ["Collect funds from them, as well as these", "Granting them the rights to work for you", "The personal data they are going to delete", "The agent will represent you in contacting data", "Handle any disagreements and keep you updated"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "CCfZbxSFFuE_0", "video_path": "CCfZbxSFFuE.mp4", "subtitle_path": "CCfZbxSFFuE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 617, "duration": 27.0, "view_count": 160552}, {"video_id": "GP2iOMKNpvg", "question": "In a crowded alleyway with various colored products being sold on both sides, a person in a blue short-sleeved shirt is to the right, a person in a red short-sleeved shirt is peeking from behind, and in the middle of the screen, a person in a black short-sleeved shirt is gesturing. Meanwhile, a person in a gray short-sleeved shirt is looking forward. What change occurred to the person in the gray short-sleeved shirt while eating?", "question_wo_referring_query": "What change occurred to the person in the gray short-sleeved shirt while eating?", "candidates": ["His short-sleeved shirt changed from gray to white", "His short-sleeved shirt turned green", "His short-sleeved shirt turned into long sleeves", "The black shoulder strap on his shoulder disappeared", "His short-sleeved shirt turned into an outer jacket"], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "GP2iOMKNpvg_0", "video_path": "GP2iOMKNpvg.mp4", "subtitle_path": "GP2iOMKNpvg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 412, "duration": 52.02, "view_count": 349978}, {"video_id": "hPoUZeF-haI", "question": "On a wooden-colored desktop, there are five different dishes placed. There's a black plate with tortilla chips, a white plate with food wrapped in tortillas, red salsa, and two small dishes of sides. Which of the following items appears?", "question_wo_referring_query": ", which of the following items appears?", "candidates": ["MEAT", "RICE & BEANS, SALSA, GUACAMOLE, VEGGIE BURRITO, and TORTILLA CHIPS", "BROCCOLI", "BEEF"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "hPoUZeF-haI_0", "video_path": "hPoUZeF-haI.mp4", "subtitle_path": "hPoUZeF-haI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 129, "duration": 42.0, "view_count": 128348}, {"video_id": "RPLcweuuNmo", "question": "In front of a large screen with a long-haired woman in the background, there is a long-haired woman wearing a dark red top and white skirt standing on the right side of the screen. Below, there are also news rolling subtitles. When 'with air conditioning the prolong' is mentioned, what objects have appeared?", "question_wo_referring_query": "What objects have appeared?", "candidates": ["A woman singing", "A man speaking with a microphone in hand", "A man raising his hand in the air", "A woman wiping her forehead with a tissue in one hand and raising the other hand in the air"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "RPLcweuuNmo_0", "video_path": "RPLcweuuNmo.mp4", "subtitle_path": "RPLcweuuNmo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 163, "duration": 57.0, "view_count": 5503}, {"video_id": "nI-Hj0y7yKc", "question": "On a red-brown wooden table, there are some red and yellow express packages. A man with curly hair is holding a microphone in his right hand, looking sideways at the camera. What kind of clothes is he wearing?", "question_wo_referring_query": "What kind of clothes is he wearing?", "candidates": ["Red short sleeves", "Black and white striped short sleeves", "Green short sleeves", "Purple jacket"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "nI-Hj0y7yKc_0", "video_path": "nI-Hj0y7yKc.mp4", "subtitle_path": "nI-Hj0y7yKc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 82, "duration": 38.01, "view_count": 103805}, {"video_id": "s-MEy4Oy7Pg", "question": "Beneath the tall mountain, there is a small green house with a red roof. Beside the small house, there is a dog that is sniffing the ground. When mentioning 'I indulge my love of crafting and design and purchase as needed. What has changed is the', what color is the dog?", "question_wo_referring_query": "What color is the dog?", "candidates": ["Black and white spotted dog", "Golden retriever", "Big white dog", "Big black dog"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "s-MEy4Oy7Pg_0", "video_path": "s-MEy4Oy7Pg.mp4", "subtitle_path": "s-MEy4Oy7Pg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 103, "duration": 48.0, "view_count": 654399}, {"video_id": "BXXEDmRjSyc", "question": "On a marble table, there is a baking tray containing a handprint arrangement made of tricolor carrots and noodles. At the bottom of the screen, there is a hand holding a ladle, adding food onto the noodles. What is the substance being poured into the dish in the video?", "question_wo_referring_query": ", what is the substance being poured into the dish in the video?", "candidates": ["purple fruit sauce", "carrot juice", "melon juice", "vegetable juice"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "BXXEDmRjSyc_0", "video_path": "BXXEDmRjSyc.mp4", "subtitle_path": "BXXEDmRjSyc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 16, "duration": 17.02, "view_count": 79861}, {"video_id": "ujpG3LNFGvU", "question": "In a black-and-white drawing, there is a man with a beard wearing a hat, next to a person with glasses and a long nose. When 'Vivid as your own Peter bugal the Elder' is mentioned, what does he do?", "question_wo_referring_query": "What does he do?", "candidates": ["Holding a cigarette in his mouth", "Reading a book", "Shakes hands with the person next to him", "Walking"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "ujpG3LNFGvU_0", "video_path": "ujpG3LNFGvU.mp4", "subtitle_path": "ujpG3LNFGvU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 298, "duration": 42.0, "view_count": 350337}, {"video_id": "EwUtDBAEWWU", "question": "Under the blue sky and white clouds, a group of small people in yellow clothes are holding fish and long rods, standing at the foot of a gray mountain. After resolving the food issue, what did they do?", "question_wo_referring_query": "What did they do next?", "candidates": ["Started chiseling the ground with tools", "Searched for seafood in the river", "Poured yellow liquid from a black funnel into a black container", "Started fishing and hunting"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "EwUtDBAEWWU_0", "video_path": "EwUtDBAEWWU.mp4", "subtitle_path": "EwUtDBAEWWU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 86, "duration": 39.04, "view_count": 4104}, {"video_id": "qNuY7BHdFC0", "question": "In a damp open space, there are trees and a long railing in front. There is a camera placed on the ground, with a right hand pointing into the distance. There are also bold white letters at the bottom of the screen. What was introduced first before this?", "question_wo_referring_query": "In a damp open space, there are trees and a long railing in front. There is a camera placed on the ground, with a right hand pointing into the distance. There are also bold white letters at the bottom of the screen. What was introduced first before this?", "candidates": ["An overcast sky with no visible sun due to dense clouds", "A sunrise with orange early morning mist", "Bright noon sunlight", "Soft moonlight"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "qNuY7BHdFC0_0", "video_path": "qNuY7BHdFC0.mp4", "subtitle_path": "qNuY7BHdFC0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 22, "duration": 37.0, "view_count": 71186}, {"video_id": "Rsq8fVhicAk", "question": "In the frame, there is a sand area with many stones and weeds. Before the narrator mentions 'this exploration we'll trace the Journey,' what event occurs?", "question_wo_referring_query": "What event occurs?", "candidates": ["No event occurs.", "The camera zooms out.", "The scene changes.", "The camera zooms in and rotates around a group of stones."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "Rsq8fVhicAk_0", "video_path": "Rsq8fVhicAk.mp4", "subtitle_path": "Rsq8fVhicAk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 4, "duration": 38.01, "view_count": 92003}, {"video_id": "cwJ4gdcArmg", "question": "A man in a blue short-sleeve shirt is sitting in a room filled with many potted plants and an additional white chair. There is a flag displayed on the screen, which consists of a white cross in the top left corner and four white stripes with five blue stripes. What change occurs to the flag when the man raises his hand?", "question_wo_referring_query": ", what change occurs to the flag?", "candidates": ["It turns into a blue background with a white cross in the center", "One white stripe disappears", "No change occurs", "One blue stripe disappears"], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "cwJ4gdcArmg_0", "video_path": "cwJ4gdcArmg.mp4", "subtitle_path": "cwJ4gdcArmg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 63, "duration": 17.99, "view_count": 311673}, {"video_id": "uvDlArbplEY", "question": "In the kitchen, there is a woman wearing a red and blue plaid shirt. On the table, there are two plates of pizza, some scattered spice bottles, bowls, an orange, and two pots of green plants. One of her hands is on the table, while the other hand is raised to her chest. Which of these items is not shown in the frame?", "question_wo_referring_query": "Which of these items is not shown in the frame?", "candidates": ["Orange", "Cup", "Cutting board", "Refrigerator"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "uvDlArbplEY_0", "video_path": "uvDlArbplEY.mp4", "subtitle_path": "uvDlArbplEY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 403, "duration": 19.98, "view_count": 440602}, {"video_id": "aaJAo7hGWdk", "question": "In the scene, there is a character wearing a blue hat and blue clothes in the bottom right corner, with a yellow building behind him. In front of the building, there are four soldiers wearing gray helmets, black armor, and red badges. Additionally, there is a man without a helmet, dressed in orange clothing, and another person wearing a silver hat with orange clothes. When the phrase 'in addition to Old Persians the Persians' is mentioned, what type of hat is the person wearing the blue clothes wearing?", "question_wo_referring_query": "What type of hat is the person wearing blue clothes wearing?", "candidates": ["Duckbill cap", "Beret", "Cowboy hat", "Sheriff's hat"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "aaJAo7hGWdk_0", "video_path": "aaJAo7hGWdk.mp4", "subtitle_path": "aaJAo7hGWdk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 103, "duration": 50.0, "view_count": 2635}, {"video_id": "gCLRKbgZalU", "question": "In a room, there is a woman wearing a white short-sleeved shirt, two rings on her left hand, and earrings. She is holding blue, pink, and purple books. What did she do when she first appeared?", "question_wo_referring_query": "What did she do when she first appeared?", "candidates": ["She was arranging the books", "She faced the camera and introduced the blue, pink, and purple books", "She was picking out books", "She was taking selfies in front of a mirror"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "gCLRKbgZalU_0", "video_path": "gCLRKbgZalU.mp4", "subtitle_path": "gCLRKbgZalU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 223, "duration": 32.0, "view_count": 64450}, {"video_id": "IxqPsNSdVho", "question": "In the scene where a man wearing a black top and jeans is holding two items in his hands inside a factory, a truck with two people inside is moving items, while two more people stand watching nearby. When the phrase 'to silence critics and to silence the media' is mentioned, what is the man wearing a black top and blue jeans doing?", "question_wo_referring_query": "What is the man wearing a black top and blue jeans doing?", "candidates": ["He is driving.", "He is giving a lecture.", "He is operating a machine.", "He is holding items in both hands."], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "IxqPsNSdVho_0", "video_path": "IxqPsNSdVho.mp4", "subtitle_path": "IxqPsNSdVho_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1289, "duration": 40.01, "view_count": 26517}, {"video_id": "7iDn8iYa9fA", "question": "In the video, two people are having a conversation. The man on the left, who has white hair, is wearing a suit and tie, while the woman on the right is dressed in black. After they finish discussing, a woman appears. What is the woman doing with her head when she appears?", "question_wo_referring_query": "What is the woman doing with her head when she appears?", "candidates": ["Shaking her head", "Raising her head", "Lowering her head", "No movement"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "7iDn8iYa9fA_0", "video_path": "7iDn8iYa9fA.mp4", "subtitle_path": "7iDn8iYa9fA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 105, "duration": 18.98, "view_count": 28231}, {"video_id": "3hHZ0S-bInw", "question": "In the scene, there is a black pot on the table with 5 pieces of meat inside. After mentioning 'even some shrimp you can swap out the', what happened?", "question_wo_referring_query": "What happened?", "candidates": ["A person used red tongs to take the meat out of the black pot", "Water was added to the pot", "Chicken broth was added to the pot", "Garlic was added to the pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "3hHZ0S-bInw_0", "video_path": "3hHZ0S-bInw.mp4", "subtitle_path": "3hHZ0S-bInw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 45, "duration": 27.03, "view_count": 245460}, {"video_id": "L1c7-Y7ZEqY", "question": "In the screen, on the left side of a photo, there are two men wearing blue shirts and black pants. One of them is wearing a yellow tag. There's a white car parked next to them, and many people are standing nearby. On the right side of the photo, there is an elderly man with white hair, wearing a white shirt with black polka dots and glasses. Behind him is a bookshelf. After mentioning 'he is the president and there are much,' which characters appear?", "question_wo_referring_query": ", which characters appear?", "candidates": ["A man wearing yellow clothes appears", "A woman wearing a pink dress and pearl earrings, and a man wearing a white shirt with a black suit and tie appear", "A woman wearing a black dress and pearl earrings appears", "A man wearing a white shirt with a blue suit jacket appears"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "L1c7-Y7ZEqY_0", "video_path": "L1c7-Y7ZEqY.mp4", "subtitle_path": "L1c7-Y7ZEqY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 527, "duration": 31.0, "view_count": 11869}, {"video_id": "tl3r89bggqI", "question": "Which sequence of scenes is correct?", "question_wo_referring_query": "Which sequence of scenes is correct?", "candidates": ["First, in front of a building, there are 4 people, three of whom are wearing red inner garments with white outer coats, and another person has a hat on their head, wearing a grey inner garment with a dark blue outer coat. Then, at night, many people appear riding horses, holding a spear in their left hand and a shield in their right hand. Finally, in front of a yellow house, there are 7 people, six of whom are wearing yellow clothes, and another person is wearing a golden helmet, yellow inner garment, and blue outer coat.", "First, at night, many people appear riding horses, holding a spear in their left hand and a shield in their right hand. Then, in front of a building, there are 4 people, three of whom are wearing red inner garments with white outer coats, and another person has a hat on their head, wearing a grey inner garment with a dark blue outer coat. Finally, in front of a yellow house, there are 7 people, six of whom are wearing yellow clothes, and another person is wearing a golden helmet, yellow inner garment, and blue outer coat.", "First, in front of a building, there are 4 people, three of whom are wearing red inner garments with white outer coats, and another person has a hat on their head, wearing a grey inner garment with a dark blue outer coat. Then, there are 7 people in front of a yellow house, six of whom are wearing yellow clothes, and another person is wearing a golden helmet, yellow inner garment, and blue outer coat. Finally, at night, many people appear riding horses, holding a spear in their left hand and a shield in their right hand.", "First, in front of a yellow house, there are 7 people, six of whom are wearing yellow clothes, and another person is wearing a golden helmet, yellow inner garment, and blue outer coat. Then, at night, many people appear riding horses, holding a spear in their left hand and a shield in their right hand. Finally, in front of a building, there are 4 people, three of whom are wearing red inner garments with white outer coats, and another person has a hat on their head, wearing a grey inner garment with a dark blue outer coat."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "tl3r89bggqI_0", "video_path": "tl3r89bggqI.mp4", "subtitle_path": "tl3r89bggqI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 227, "duration": 57.0, "view_count": 1484}, {"video_id": "iQVwdMBfCjw", "question": "Under the gray sky, there is a green grass field, and a green tank appears in the center of the screen. In which of the following scenes has it appeared before?", "question_wo_referring_query": "In which of the following scenes has it appeared before?", "candidates": ["On an empty street in front of an abandoned building", "In a research lab with many scientists", "In a desolate desert", "On a crowded street"], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "iQVwdMBfCjw_0", "video_path": "iQVwdMBfCjw.mp4", "subtitle_path": "iQVwdMBfCjw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 310, "duration": 51.0, "view_count": 404912}, {"video_id": "KKLY3oO8WdQ", "question": "In front of a black curtain, a man wearing a gray lab coat is standing with his hands clasped together in front of him. Above his right side are several small photos of rare animals arranged into a larger picture. What change did this man undergo?", "question_wo_referring_query": "What change did this man undergo?", "candidates": ["He turned to his left side", "His right hand points towards the camera", "He touched his own hair", "He looked at the camera with both fists loosely clenched in front of him"], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "KKLY3oO8WdQ_0", "video_path": "KKLY3oO8WdQ.mp4", "subtitle_path": "KKLY3oO8WdQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 942, "duration": 48.01, "view_count": 273164}, {"video_id": "59745N6_Kc8", "question": "In a large gymnasium with a blue floor mat, there is a row of people wearing short sleeves and shorts standing. On the right side of the screen, there is a woman wearing a grey checkered short-sleeved shirt. After '[Music]' is mentioned, what changes occur to her?", "question_wo_referring_query": "What changes occur to her?", "candidates": ["She changes into a black short-sleeved shirt", "She changes into a white short-sleeved shirt", "She changes into a black long-sleeved shirt", "She changes into a white long-sleeved shirt"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "59745N6_Kc8_0", "video_path": "59745N6_Kc8.mp4", "subtitle_path": "59745N6_Kc8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 127, "duration": 28.0, "view_count": 6778}, {"video_id": "LC5Xgk0cCgA", "question": "In a scene with red seats all around, there is a gray-haired man wearing a blue shirt sitting on one of the seats. His right hand is clenched in a fist, while his left hand is open with the palm facing the camera, showing the entire hand. What objects have appeared below?", "question_wo_referring_query": "In a scene with red seats all around, there is a gray-haired man wearing a blue shirt sitting on one of the seats. His right hand is clenched in a fist, while his left hand is open with the palm facing the camera, showing his entire hand. What objects have appeared below?", "candidates": ["The watch on the man's wrist", "Two rings on the man's fingers", "The man's black-rimmed glasses", "The man's black backpack"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "LC5Xgk0cCgA_0", "video_path": "LC5Xgk0cCgA.mp4", "subtitle_path": "LC5Xgk0cCgA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 126, "duration": 28.03, "view_count": 9840}, {"video_id": "buNygLYP89o", "question": "Next to an outdoor swimming pool, there is a man wearing a blue short-sleeve shirt and red and white floral shorts. One hand is supporting the platform, and the other hand is clenched in a fist. When mentioning 'outside the best thing to find is', which object is not present in the scene?", "question_wo_referring_query": "Which object is not present in the scene?", "candidates": ["Glasses", "A bushy tree", "A house with a white roof and red walls", "A blue headscarf"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "buNygLYP89o_0", "video_path": "buNygLYP89o.mp4", "subtitle_path": "buNygLYP89o_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 239, "duration": 42.0, "view_count": 911}, {"video_id": "i3I3FiK3_r8", "question": "On a wooden table, there is a pot containing chutney-colored sauce and chopped white green onions. In the frame, a right hand is holding a bowl of crushed sweet chili, ready to pour it into the pot. What type of pot is used in the scene?", "question_wo_referring_query": "What type of pot is used in the scene?", "candidates": ["A round pot with one handle", "A round pot with two handles", "A square pot with two handles", "A square pot with one handle"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "i3I3FiK3_r8_0", "video_path": "i3I3FiK3_r8.mp4", "subtitle_path": "i3I3FiK3_r8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 229, "duration": 53.01, "view_count": 377747}, {"video_id": "eQzeHYI3Q7Q", "question": "In a scene dominated by a stretch of green, there are four seated green giants. Above them in the middle, there is a smaller standing statue embedded in a square space. When the phrase \"was split between two courts, the Kenbet and the Great\" is mentioned, which of the following objects appears?", "question_wo_referring_query": "Which of the following objects appears?", "candidates": ["A complete standing giant statue", "Upper half of a seated damaged statue", "Lower half of a seated damaged statue", "A complete giant female statue seated"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "eQzeHYI3Q7Q_0", "video_path": "eQzeHYI3Q7Q.mp4", "subtitle_path": "eQzeHYI3Q7Q_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 187, "duration": 21.0, "view_count": 3059014}, {"video_id": "auceDl-XMmk", "question": "Against a red background, there is a black cannonball design at the top, and two different objects appear floating upwards at the bottom. What are these two objects?", "question_wo_referring_query": "Against a red background, there is a black cannonball design at the top, and two different objects appear floating upwards at the bottom. What are these two objects?", "candidates": ["Two photos of military green cannons", "Two photos of military blue tanks", "A photo of two young girls together", "Two photos of flags"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "auceDl-XMmk_0", "video_path": "auceDl-XMmk.mp4", "subtitle_path": "auceDl-XMmk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 94, "duration": 32.0, "view_count": 204658}, {"video_id": "hBvcMMuBsHQ", "question": "On a grey and white stage with a flag that has black, red, and yellow colors, there are two men in grey military uniforms holding long guns. What are they doing?", "question_wo_referring_query": "What are they doing?", "candidates": ["They are raising their guns and shooting at the sky", "They are preparing to aim and defeat the enemy", "They are holding guns and patrolling around the flag", "They are preparing to lower the flag"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "hBvcMMuBsHQ_0", "video_path": "hBvcMMuBsHQ.mp4", "subtitle_path": "hBvcMMuBsHQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 145, "duration": 45.0, "view_count": 1171792}, {"video_id": "dwfCB9zbvZ8", "question": "In the scene, there is a man wearing a black suit on the left, and a woman wearing a black suit with a headset on the right. There is also a scrolling bar with white text on a red background at the bottom. When the phrase 'pardon the P I found it all very' is mentioned, what action does the man take?", "question_wo_referring_query": "What action does the man take?", "candidates": ["He touched his hair.", "He raised his eyebrows and leaned forward.", "He raised his eyebrows and leaned back.", "He adjusted his tie."], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "dwfCB9zbvZ8_0", "video_path": "dwfCB9zbvZ8.mp4", "subtitle_path": "dwfCB9zbvZ8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 78, "duration": 31.0, "view_count": 123158}, {"video_id": "ny-ujkmMniI", "question": "In a scene without intense colors, there is a Da Vinci with a long beard and wearing a black hat. What happens next?", "question_wo_referring_query": "What happens next?", "candidates": ["A pencil sketch of a person riding a horse emerges from the center of the screen in the shape of a heart.", "A pencil sketch of a person riding a horse pops out from the right side.", "A pencil sketch of a person riding a horse pops out from the left side.", "A pencil sketch of a person riding a horse emerges from the center of the screen in the shape of a star."], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "ny-ujkmMniI_0", "video_path": "ny-ujkmMniI.mp4", "subtitle_path": "ny-ujkmMniI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 405, "duration": 48.0, "view_count": 411943}, {"video_id": "_GDbf9CeRvs", "question": "During a live performance with a vibrant crowd, there is a man wearing a light blue jacket and a tie standing in the center of the screen. On his right is a man holding a red guitar, and in the audience, there are also spectators raising their hands. Which character appears first?", "question_wo_referring_query": "Which character appears first?", "candidates": ["The crowd in the audience.", "The man holding a red guitar.", "The man with black hair wearing a light blue jacket and a tie.", "The man playing the double bass."], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "_GDbf9CeRvs_0", "video_path": "_GDbf9CeRvs.mp4", "subtitle_path": "_GDbf9CeRvs_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 119, "duration": 46.01, "view_count": 1530}, {"video_id": "LTPuD_I578Y", "question": "In a bedroom, a man wearing a black short-sleeve shirt is sitting in front of his computer looking at the screen. His hands are slightly spread apart in front of his chest. After he mentions 'times but I will say that I would not be', who else appears?", "question_wo_referring_query": "Who else appears?", "candidates": ["A person wearing a black short-sleeve shirt and a headset", "A person wearing a white short-sleeve shirt and protective glasses", "A person wearing a black short-sleeve shirt and protective glasses", "A person wearing a white short-sleeve shirt and a headset"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "LTPuD_I578Y_0", "video_path": "LTPuD_I578Y.mp4", "subtitle_path": "LTPuD_I578Y_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1558, "duration": 43.0, "view_count": 321936}, {"video_id": "6c5So8fDGy4", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, a woman with long hair in black clothes is sitting in the driver's seat of a car. Next, there is a person wearing black and white checkered shoes walking on a cement road with blue stripes. Then, the woman with long hair in black clothes is eating fried noodles under a tree. Finally, a dish of fried noodles with blue flowers and cilantro appears on the screen.", "First, there is a person wearing black and white checkered shoes walking on a cement road with blue stripes. Next, a woman with long hair in black clothes is sitting in the driver's seat of a car. Then, the woman with long hair in black clothes is eating fried noodles under a tree. Finally, a dish of fried noodles with blue flowers and cilantro appears on the screen.", "First, a woman with long hair in black clothes is sitting in the driver's seat of a car. Next, there is a person wearing black and white checkered shoes walking on a cement road with blue stripes. Then, a dish of fried noodles with blue flowers and cilantro appears on the screen. Finally, the woman with long hair in black clothes is eating fried noodles under a tree.", "First, there is a person wearing black and white checkered shoes walking on a cement road with blue stripes. Next, a woman with long hair in black clothes is sitting in the driver's seat of a car. Then, a dish of fried noodles with blue flowers and cilantro appears on the screen. Finally, the woman with long hair in black clothes is eating fried noodles under a tree."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "6c5So8fDGy4_0", "video_path": "6c5So8fDGy4.mp4", "subtitle_path": "6c5So8fDGy4_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 165, "duration": 33.0, "view_count": 35103}, {"video_id": "BYLnK3h6iDk", "question": "On a stone surface with green moss spots, there is a stainless steel cup filled with liquid. In the upper left part of the screen, a hand is about to pick it up. Which of the following subtitles has appeared along with this cup?", "question_wo_referring_query": ", this cup has appeared along with the following subtitles?", "candidates": ["Mono-sodium amalgamate", "Pepper", "Sugar", "Salt"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "BYLnK3h6iDk_0", "video_path": "BYLnK3h6iDk.mp4", "subtitle_path": "BYLnK3h6iDk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 163, "duration": 36.0, "view_count": 1286383}, {"video_id": "EE_n1oMszaQ", "question": "In an oil painting set on a green tree-covered grassy field, some people are playing, some are lying down resting, and some are sitting enjoying the scenery. What changes occur in the oil painting?", "question_wo_referring_query": ", what changes occur in the oil painting?", "candidates": ["A black rabbit appears", "A gray duck appears", "A character appears holding a gun chasing a rabbit", "A character in blue holding a gun appears"], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "EE_n1oMszaQ_0", "video_path": "EE_n1oMszaQ.mp4", "subtitle_path": "EE_n1oMszaQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 239, "duration": 57.0, "view_count": 5853}, {"video_id": "Qs0tGAcDkco", "question": "In the bottom right of the screen, there is a valley. Looking beyond it, there is a mountain peak standing tall against the blue sky and white clouds. The top of the mountain peak is also covered in patches of white snow. After the subtitle mentions 'every Rock Mountain and Valley narrates,' what changes occur on the screen?", "question_wo_referring_query": "What changes occur on the screen?", "candidates": ["The clouds at the horizon change into mist.", "The clouds at the horizon drift to the left side of the screen.", "The clouds at the horizon drift to the right side of the screen.", "The clouds at the horizon are tinted by the evening mist."], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "Qs0tGAcDkco_0", "video_path": "Qs0tGAcDkco.mp4", "subtitle_path": "Qs0tGAcDkco_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 262, "duration": 16.98, "view_count": 28256}, {"video_id": "yNDMHdSO9AM", "question": "In the black-and-white photo, there is a large white building, with lush trees and three flagpoles in front of it. Further ahead is a street with several parked cars. When the phrase 'women as well as gay rights this is just' is mentioned, which object does not appear on the screen?", "question_wo_referring_query": "Which object does not appear on the screen?", "candidates": ["Fountain with a square base", "White landmark arrow", "Fountain with a round base", "Pedestrian who is moving"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "yNDMHdSO9AM_0", "video_path": "yNDMHdSO9AM.mp4", "subtitle_path": "yNDMHdSO9AM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 296, "duration": 18.0, "view_count": 1063347}, {"video_id": "KtBU-HnxiCo", "question": "In a scene with a black background, there are two men with curly hair wearing short sleeves standing shoulder to shoulder. What type of clothes are they wearing?", "question_wo_referring_query": "What type of clothes are they wearing?", "candidates": ["The man on the right is wearing a black short-sleeve shirt with yellow-green stripes, and the man on the left is wearing a plain gray short-sleeve shirt.", "The man on the right is wearing a black short-sleeve shirt with yellow-green stripes, and the man on the left is wearing a white short-sleeve shirt with gray patterns.", "The man on the left is wearing a black short-sleeve shirt with yellow-green stripes, and the man on the right is wearing a white short-sleeve shirt with gray patterns.", "The man on the left is wearing a black short-sleeve shirt with yellow-green stripes, and the man on the right is wearing a plain gray short-sleeve shirt."], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "KtBU-HnxiCo_0", "video_path": "KtBU-HnxiCo.mp4", "subtitle_path": "KtBU-HnxiCo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1060, "duration": 46.98, "view_count": 382161}, {"video_id": "bzsKnKcb1-A", "question": "In a red background with two tanker designs, under a design with an intersecting white pattern, there is an orange wheelhouse, a black-framed box with continuously scrolling text inside. What is the style of the scrolling text?", "question_wo_referring_query": "What is the style of the scrolling text?", "candidates": ["Bold red text with marked blue vocabulary", "Pure bold red text", "Bold white text with marked red vocabulary", "Pure bold white text"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "bzsKnKcb1-A_0", "video_path": "bzsKnKcb1-A.mp4", "subtitle_path": "bzsKnKcb1-A_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 211, "duration": 36.0, "view_count": 943859}, {"video_id": "hZ4uLoWiA3I", "question": "In a dimly lit scene, there are two mannequins dressed in white standing at the front, looking inward. In the middle, there are mannequins dressed in different colored clothes and posing in various postures. What happened to them?", "question_wo_referring_query": "What happened to them?", "candidates": ["They were displayed in the middle.", "They were photographed by people.", "They were tried on by customers.", "They were moved by people."], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "hZ4uLoWiA3I_0", "video_path": "hZ4uLoWiA3I.mp4", "subtitle_path": "hZ4uLoWiA3I_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 10, "duration": 47.01, "view_count": 40088}, {"video_id": "Y5kcGYcDE0o", "question": "In a scene where a short-haired girl wearing a headscarf is leaning against a misshapen intersecting tree, what change occurs when 'the ground like girls on hands and knees' is mentioned?", "question_wo_referring_query": "What change occurs?", "candidates": ["The screen slowly moves from right to left", "The screen slowly moves from left to right", "The screen slowly moves from bottom to top", "The screen slowly moves from top to bottom"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "Y5kcGYcDE0o_0", "video_path": "Y5kcGYcDE0o.mp4", "subtitle_path": "Y5kcGYcDE0o_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 97, "duration": 45.98, "view_count": 3958}, {"video_id": "XL2WRhD29hw", "question": "In a video call screen, there\u2019s a smiling girl waving her hand, and beside her is a woman with her mouth open. What does this woman do?", "question_wo_referring_query": "What does this woman do?", "candidates": ["She hugged the little girl in front of her", "She turned off the camera", "She waved her right hand to say hello", "She waved her left hand to say hello"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "XL2WRhD29hw_0", "video_path": "XL2WRhD29hw.mp4", "subtitle_path": "XL2WRhD29hw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 210, "duration": 22.0, "view_count": 4368}, {"video_id": "0V8o5_NKv34", "question": "In a scene with a background of tall buildings, three men in suits with different colored ties are sitting on a crimson-colored chair on a round platform. Who appears first on the screen?", "question_wo_referring_query": "Who appears first on the screen?", "candidates": ["The white-haired man in a black suit with a blue necktie", "The white-haired man in a black suit with a blue tie", "The man in a black suit with a green tie", "The man in a black suit with a purple tie"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "0V8o5_NKv34_0", "video_path": "0V8o5_NKv34.mp4", "subtitle_path": "0V8o5_NKv34_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 301, "duration": 53.02, "view_count": 5638}, {"video_id": "UwUxooXWBOk", "question": "On a wooden board covered with white flour, a pair of hands with rose-red nails is pressing on two square patches. After mentioning [Music], what object appeared?", "question_wo_referring_query": ", what object appeared?", "candidates": ["a bowl of water in a transparent glass bowl", "a water-filled teapot", "a bottle of mineral water", "a can of olive oil"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "UwUxooXWBOk_0", "video_path": "UwUxooXWBOk.mp4", "subtitle_path": "UwUxooXWBOk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 73, "duration": 45.0, "view_count": 469145}, {"video_id": "KTXyIZhRtgk", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, the screen shows a transparent bowl on a white ceramic plate containing yellow food, followed by a man in a blue short sleeve shirt wearing glasses speaking to the camera, next a man in a blue shirt holding golden food facing the camera, then a man in a white shirt is eating at a table filled with various foods, and finally a white ceramic plate with four pieces of golden food.", "First, the screen shows a transparent bowl on a white ceramic plate containing yellow food, followed by a man in a blue short sleeve shirt wearing glasses speaking to the camera, next a man in a white shirt is eating at a table filled with various foods, then a man in a blue shirt holding golden food facing the camera, and finally a white ceramic plate with four pieces of golden food.", "First, a man in a blue short sleeve shirt wearing glasses is speaking to the camera, followed by a transparent bowl on a white ceramic plate containing yellow food, next a man in a white shirt is eating at a table filled with various foods, then a man in a blue shirt holding golden food facing the camera, and finally a white ceramic plate with four pieces of golden food.", "First, the screen shows a white ceramic plate with four pieces of golden food, followed by a man in a blue short sleeve shirt wearing glasses speaking to the camera, next a man in a white shirt is eating at a table filled with various foods, then a man in a blue shirt holding golden food facing the camera, and finally a transparent bowl on a white ceramic plate containing yellow food."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "KTXyIZhRtgk_0", "video_path": "KTXyIZhRtgk.mp4", "subtitle_path": "KTXyIZhRtgk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 453, "duration": 16.0, "view_count": 122041}, {"video_id": "LBvOZO1gmOE", "question": "On a row of pointed-roof houses, a big fire is burning. Which of the following subtitles have co-occurred with it?", "question_wo_referring_query": ", which of the following subtitles have co-occurred with it?", "candidates": ["celebrate", "Subscribe for more history videos", "But it may also have helped kill off some of the rats and fleas carrying the plague", "100000 people died in London from the Great Plague"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "LBvOZO1gmOE_0", "video_path": "LBvOZO1gmOE.mp4", "subtitle_path": "LBvOZO1gmOE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 154, "duration": 21.0, "view_count": 1473072}, {"video_id": "08TBIcGlAio", "question": "On a red wooden table, there is a transparent glass bowl containing a yellowish-white food item. In a screen with white subtitles in the lower left corner, what changes are happening to this food item?", "question_wo_referring_query": "What changes are happening to this food item?", "candidates": ["Stir the mixture after adding the ingredients to the bowl", "Throw away the mixture after adding the ingredients to the bowl", "Put the food from the bowl into a grinder", "Put the food from the bowl into the oven"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "08TBIcGlAio_0", "video_path": "08TBIcGlAio.mp4", "subtitle_path": "08TBIcGlAio_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 119, "duration": 55.0, "view_count": 2191}, {"video_id": "EpxkAQdtBi0", "question": "In front of the yellow table with the pumpkin, there are three men. One of the men, who is wearing a black short-sleeved shirt, is talking to another man wearing a hat. After mentioning 'Melon', what changes occurred to the fruit?", "question_wo_referring_query": "What changes occurred to the fruit?", "candidates": ["The pumpkin was hit by a car and rolled on the ground", "The pumpkin was broken, and a lot of juice flowed out", "The pumpkin was put into a red plastic bag and taken away by the man in the black shirt", "The pumpkin was put into a transparent plastic bag and taken away by the man in the black shirt"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "EpxkAQdtBi0_0", "video_path": "EpxkAQdtBi0.mp4", "subtitle_path": "EpxkAQdtBi0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 10, "duration": 55.0, "view_count": 3385567}, {"video_id": "3yruvXRA_mw", "question": "In the kitchen, a lady wearing a red polka dot short-sleeve shirt and glasses, holding a blue spatula with her left hand over a black pot, with a pool, refrigerator, fruits, various condiment bottles, and bowls behind her, during the mention of 'layer it's going to be the most moist', which items did not appear on the screen?", "question_wo_referring_query": "Which items did not appear on the screen?", "candidates": ["oven", "refrigerator", "pool", "orange vase"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "3yruvXRA_mw_0", "video_path": "3yruvXRA_mw.mp4", "subtitle_path": "3yruvXRA_mw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 128, "duration": 18.02, "view_count": 132788}, {"video_id": "n7KpS1yM9e0", "question": "Outdoors, a man wearing a black short-sleeved shirt, carrying a backpack, with a watch on his right hand, and then raising his right hand. There are some trees and a fenced area around him. What color is the man's watch?", "question_wo_referring_query": "What color is the man's watch?", "candidates": ["black", "white", "red", "blue"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "n7KpS1yM9e0_0", "video_path": "n7KpS1yM9e0.mp4", "subtitle_path": "n7KpS1yM9e0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 197, "duration": 30.99, "view_count": 64282}, {"video_id": "HCRqjtVmibg", "question": "In the video, the person on the right wearing a hat who is holding a book is a black person sitting on a sofa. When the subtitle mentions 'it's a sound coming from somewhere,' what color is the hat she is wearing?", "question_wo_referring_query": "What color is the hat the woman is wearing?", "candidates": ["gray", "yellow", "green", "black"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "HCRqjtVmibg_0", "video_path": "HCRqjtVmibg.mp4", "subtitle_path": "HCRqjtVmibg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 271, "duration": 25.0, "view_count": 686}, {"video_id": "rmxKF_MHxmM", "question": "In a scene with a black background, there is a curly-haired man wearing blue short sleeves and glasses. What is he doing at the beginning of the video?", "question_wo_referring_query": "What is he doing at the beginning of the video?", "candidates": ["He touches his chin", "He spreads his hands apart, with palms facing each other", "He clenches his left fist", "He spreads his left hand, with the palm facing down"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "rmxKF_MHxmM_0", "video_path": "rmxKF_MHxmM.mp4", "subtitle_path": "rmxKF_MHxmM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 11, "duration": 25.99, "view_count": 284931}, {"video_id": "vYEpd6w3hrE", "question": "On the left side of the screen, there is a long-haired woman wearing a blue long-sleeve shirt. On the right side, there are a few trees and a house in the distance. When the subtitle mentions 'Prime Minister what kind of rhetoric,' what happens to the house on the right side?", "question_wo_referring_query": "What happens to the house on the right side?", "candidates": ["Being doused with water", "No change", "Black smoke emerges", "Being sprayed with paint"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "vYEpd6w3hrE_0", "video_path": "vYEpd6w3hrE.mp4", "subtitle_path": "vYEpd6w3hrE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 263, "duration": 25.0, "view_count": 1151203}, {"video_id": "B6sc_YQc864", "question": "In the video, there are eight raw chicken legs neatly placed on a dark green board. What is done after placing them neatly?", "question_wo_referring_query": "What is done after placing them neatly?", "candidates": ["Throw away", "Put into a pot", "Boil", "Sprinkle salt"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "B6sc_YQc864_0", "video_path": "B6sc_YQc864.mp4", "subtitle_path": "B6sc_YQc864_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 386, "duration": 37.0, "view_count": 238635}, {"video_id": "ldLaH1770Hk", "question": "In a messy room, there is an old man in blue clothes kneeling next to a giant gray pillar. Surrounding him are two men in white clothes and black pants. Which person appears first on the screen?", "question_wo_referring_query": "Which person appears first on the screen?", "candidates": ["The man in a blue shirt", "The man in a white shirt with a tie", "The old man in blue clothes", "The man in a white short sleeve shirt with glasses"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "ldLaH1770Hk_0", "video_path": "ldLaH1770Hk.mp4", "subtitle_path": "ldLaH1770Hk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 84, "duration": 50.99, "view_count": 230180}, {"video_id": "QfpzNw231mQ", "question": "In the scene, there are two men standing at the door, one wearing a blue suit and the other wearing a black suit. After the caption 'it's a dog barking dude no wait it's hi' appears, what does the man in the blue suit do?", "question_wo_referring_query": "What does the man in the blue suit do?", "candidates": ["Turn around", "Kneel down", "Hug", "Adjust his tie"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "QfpzNw231mQ_0", "video_path": "QfpzNw231mQ.mp4", "subtitle_path": "QfpzNw231mQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 152, "duration": 16.98, "view_count": 5487391}, {"video_id": "ZCjz7Rf9Oag", "question": "In the video, a woman wearing a gray long-sleeve shirt and a ring is sitting in the driver's seat. She reaches her hand out and opens it up. What items appear in her hand before she says, 'chug it but anyways my video is uploaded'?", "question_wo_referring_query": "What items appear in her hand?", "candidates": ["computer", "milk tea", "apple", "mobile phone"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "ZCjz7Rf9Oag_0", "video_path": "ZCjz7Rf9Oag.mp4", "subtitle_path": "ZCjz7Rf9Oag_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 364, "duration": 29.0, "view_count": 28494}, {"video_id": "8G299uE7EMA", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, there are two men, one wearing a dark gray coat and the other wearing white clothes, standing in front of a mirror on a wooden pier, with yellow bold subtitles at the bottom. Next, there is a black man wearing a blue short-sleeved shirt standing in a ferry on the sea. Then, the scene shows a red-colored bus in front, with a man wearing dark purple clothes outside, and someone about to get off the bus. Finally, the scene shows a narrow path with green columns, where there is a shirtless man.", "First, there is a black man wearing a blue short-sleeved shirt standing in a ferry on the sea. Next, there are two men, one wearing a dark gray coat and the other wearing white clothes, standing in front of a mirror on a wooden pier, with yellow bold subtitles at the bottom. Then, the scene shows a red-colored bus in front, with a man wearing dark purple clothes outside, and someone about to get off the bus. Finally, the scene shows a narrow path with green columns, where there is a shirtless man.", "First, there is a black man wearing a blue short-sleeved shirt standing in a ferry on the sea. Next, there are two men, one wearing a dark gray coat and the other wearing white clothes, standing in front of a mirror on a wooden pier, with yellow bold subtitles at the bottom. Then, the scene shows a narrow path with green columns, where there is a shirtless man. Finally, the scene shows a red-colored bus in front, with a man wearing dark purple clothes outside, and someone about to get off the bus.", "First, there are two men, one wearing a dark gray coat and the other wearing white clothes, standing in front of a mirror on a wooden pier, with yellow bold subtitles at the bottom. Next, there is a black man wearing a blue short-sleeved shirt standing in a ferry on the sea. Then, the scene shows a narrow path with green columns, where there is a shirtless man. Finally, the scene shows a red-colored bus in front, with a man wearing dark purple clothes outside, and someone about to get off the bus."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "8G299uE7EMA_0", "video_path": "8G299uE7EMA.mp4", "subtitle_path": "8G299uE7EMA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 813, "duration": 26.0, "view_count": 115126}, {"video_id": "dhmFV44TN-I", "question": "In the video, a man in black clothing is speaking facing the camera, with another man wearing a white beanie and black short-sleeved shirt standing by his side. Where else does this man appear?", "question_wo_referring_query": "Where else does this man appear?", "candidates": ["On a bed", "Outside on a skateboard", "On a bench", "On a sofa"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "dhmFV44TN-I_0", "video_path": "dhmFV44TN-I.mp4", "subtitle_path": "dhmFV44TN-I_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 99, "duration": 46.98, "view_count": 7164}, {"video_id": "GpfZY5tfNdA", "question": "In the video, there is a man wearing a white shirt and a black suit speaking about his views on a wide street. Behind him, there are a few trees and some buildings in the background. In which subtitle does this man also appear?", "question_wo_referring_query": "In which subtitle does this man also appear?", "candidates": ["financial capital", "trade with all this business going on", "the city", "did you catch all that"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "GpfZY5tfNdA_0", "video_path": "GpfZY5tfNdA.mp4", "subtitle_path": "GpfZY5tfNdA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 281, "duration": 40.0, "view_count": 144691}, {"video_id": "JV3R1MfkQ_g", "question": "In the video, a black-haired woman wearing black clothes and another woman wearing an olive green dress with a skirt are cooking in the kitchen. There is a pot on the table. Compared to the beginning of the video, what changes can be observed on the table?", "question_wo_referring_query": "Compared to the beginning of the video, what changes can be observed on the table?", "candidates": ["There is an additional banana", "There is an additional watermelon", "No changes", "There is an additional cutting board, knife, and a leek"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "JV3R1MfkQ_g_0", "video_path": "JV3R1MfkQ_g.mp4", "subtitle_path": "JV3R1MfkQ_g_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 200, "duration": 47.97, "view_count": 653487}, {"video_id": "dthjauMsbRw", "question": "In the video, a long-haired woman wearing a short-sleeved shirt and a yellow apron is in the kitchen. Behind her, there is a pot of green plants, and on the counter, there is a cutting board and some jalape\u00f1os. What change occurred when she said 'our appetizer and i think jalapeno'?", "question_wo_referring_query": "What change occurred when she said 'our appetizer and i think jalapeno'?", "candidates": ["Started frying vegetables", "Picked up the jalape\u00f1os", "Screamed", "Dropped the jalape\u00f1os"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "dthjauMsbRw_0", "video_path": "dthjauMsbRw.mp4", "subtitle_path": "dthjauMsbRw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 26, "duration": 28.99, "view_count": 484422}, {"video_id": "01RpoSiW3Wo", "question": "In the scene, there are three men dressed in blue long sleeves and gray pants on the rooftop, and there is a light beam shining. What are these three men doing?", "question_wo_referring_query": "What are these three men doing?", "candidates": ["Turning around", "Swimming", "Dancing", "Walking forward"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "01RpoSiW3Wo_0", "video_path": "01RpoSiW3Wo.mp4", "subtitle_path": "01RpoSiW3Wo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 3, "duration": 16.0, "view_count": 193500}, {"video_id": "EfV5DCYliTA", "question": "The long-haired girl wearing purple and red clothes in the video is sitting on a brown sofa explaining her point of view. There is a window behind her. What objects have appeared in the room in the video?", "question_wo_referring_query": "What objects have appeared in the room in the video?", "candidates": ["Computer", "Phone", "Cup", "Books"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "EfV5DCYliTA_0", "video_path": "EfV5DCYliTA.mp4", "subtitle_path": "EfV5DCYliTA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 163, "duration": 17.0, "view_count": 8100}, {"video_id": "LtZoH_22Ukg", "question": "In the video, a long-haired woman wearing a long-sleeved shirt and black pants is sitting on a white bed. When the subtitle mentions 'foreign[Music],' what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["Quilt", "Refrigerator", "Sofa", "Computer"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "LtZoH_22Ukg_0", "video_path": "LtZoH_22Ukg.mp4", "subtitle_path": "LtZoH_22Ukg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 526, "duration": 16.98, "view_count": 291102}, {"video_id": "aIbU7YO6NsY", "question": "There is a man in glasses on a full screen, talking about his views in a room. What color clothes is he wearing in the video?", "question_wo_referring_query": "What color clothes is he wearing in the video?", "candidates": ["Blue", "Yellow", "Black", "Green"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "aIbU7YO6NsY_0", "video_path": "aIbU7YO6NsY.mp4", "subtitle_path": "aIbU7YO6NsY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 245, "duration": 30.0, "view_count": 86904}, {"video_id": "Ga6N2X-iNvg", "question": "In the video, there are two men discussing their respective views. The wall in the room has many pictures. The man on the left is wearing short sleeves and holding wheat, while the man on the right is wearing a lab coat with one hand resting behind the other. When the subtitle mentions 'without further ado you owe the,' what color is the lab coat the man on the right is wearing?", "question_wo_referring_query": "What color is the lab coat of the man on the right?", "candidates": ["white", "pink", "black", "green"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "Ga6N2X-iNvg_0", "video_path": "Ga6N2X-iNvg.mp4", "subtitle_path": "Ga6N2X-iNvg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 132, "duration": 43.0, "view_count": 514266}, {"video_id": "Q95__gXJ9l0", "question": "There are two icons on the screen. The left one shows 26th May 1940 and the right one shows 3 days 8 hours. Who lost three days on the screen?", "question_wo_referring_query": "Who lost three days on the screen?", "candidates": ["Archer", "Mobile Phone", "Tank Commander", "Mage"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "Q95__gXJ9l0_0", "video_path": "Q95__gXJ9l0.mp4", "subtitle_path": "Q95__gXJ9l0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 199, "duration": 27.0, "view_count": 202379}, {"video_id": "_GCfXvHqjdg", "question": "In the screen, there are a few fish in the pond, and there are also a few rocks in the background. When the subtitle mentions 'structure,' what happens to the fish in the pond?", "question_wo_referring_query": "What happens to the fish in the pond?", "candidates": ["walk", "remain still", "swim", "float"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "_GCfXvHqjdg_0", "video_path": "_GCfXvHqjdg.mp4", "subtitle_path": "_GCfXvHqjdg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 209, "duration": 23.98, "view_count": 14321}, {"video_id": "mp-iuKXG4es", "question": "Under the blue sky and white clouds, a vast expanse of green is visible, with green trees stretching as far as the eye can see. After the mention of 'this one one please consider subscribing,' what object appears?", "question_wo_referring_query": "What object appears?", "candidates": ["A bridge with white railings and a beige base appears.", "A bridge with beige railings and a white base appears.", "A bridge with black railings and a white base appears.", "A bridge with white railings and a black base appears."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "mp-iuKXG4es_0", "video_path": "mp-iuKXG4es.mp4", "subtitle_path": "mp-iuKXG4es_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 46.01, "view_count": 9541}, {"video_id": "LCNw2e-Zehw", "question": "On the dark green background, there are words in the style of 'Strict Interpretation', and immediately after that, there is a paragraph of white English text on a black background. What changes occurred to the background after this?", "question_wo_referring_query": "What changes occurred to the background after this?", "candidates": ["Three white circles appeared on the right side of it", "Three white circles appeared at the top of it", "None", "Three white circles appeared on the left side of it", "Three white circles appeared at the bottom of it"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "LCNw2e-Zehw_0", "video_path": "LCNw2e-Zehw.mp4", "subtitle_path": "LCNw2e-Zehw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 293, "duration": 41.0, "view_count": 807371}, {"video_id": "YzRVG2TOIIc", "question": "There is a bowl on the screen with a fried egg in it, and there is a black pot in the background. When the subtitle mentions 'no no,' what changes happen to the egg?", "question_wo_referring_query": "What changes happen to the egg?", "candidates": ["No change", "Thrown away", "Doused in sauce and smashed with chopsticks", "Eaten"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "YzRVG2TOIIc_0", "video_path": "YzRVG2TOIIc.mp4", "subtitle_path": "YzRVG2TOIIc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 255, "duration": 32.0, "view_count": 737927}, {"video_id": "T1zSZFiqEgg", "question": "In the video, a long-haired woman wearing a grey long-sleeve shirt is sitting in the driver\u2019s seat, with a ring on her right hand. When she looks out the window, what does she do?", "question_wo_referring_query": ", what does she do?", "candidates": ["Plays with her phone", "Drinks a beverage", "Uses a computer", "Eats something"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "T1zSZFiqEgg_0", "video_path": "T1zSZFiqEgg.mp4", "subtitle_path": "T1zSZFiqEgg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 386, "duration": 39.0, "view_count": 117617}, {"video_id": "9mj7biOwRks", "question": "The video explains different icons. There are seven different icons on the screen. Which of the following icons does NOT appear in the video?", "question_wo_referring_query": "Which of the following icons does NOT appear in the video?", "candidates": ["One with the letter 'S'", "One with a cup", "One with a small horse", "One with a clock"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "9mj7biOwRks_0", "video_path": "9mj7biOwRks.mp4", "subtitle_path": "9mj7biOwRks_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 306, "duration": 52.0, "view_count": 209155}, {"video_id": "1CFYuKX73sw", "question": "A long-haired woman is broadcasting alone outdoors in the video, and there are several lines of subtitles on the screen. What color is the armor she is wearing during the broadcast?", "question_wo_referring_query": "What color is the armor she is wearing?", "candidates": ["green", "yellow", "purple", "red"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "1CFYuKX73sw_0", "video_path": "1CFYuKX73sw.mp4", "subtitle_path": "1CFYuKX73sw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 152, "duration": 43.0, "view_count": 4248}, {"video_id": "P9vwbbTgkzc", "question": "There are two boxes on the screen, each containing many tools, and in front there is an electrically charged object. When the subtitle mentions 'electrically charged objects', what color does the electrically charged object emit?", "question_wo_referring_query": "What color does the electrically charged object emit?", "candidates": ["Purple", "Green", "Red", "Yellow"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "P9vwbbTgkzc_0", "video_path": "P9vwbbTgkzc.mp4", "subtitle_path": "P9vwbbTgkzc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 945, "duration": 15.98, "view_count": 1563584}, {"video_id": "7yXvXpSzX9A", "question": "In the evening, there is a bottle of mineral water and a bottle of wine on a table, with a tree nearby. Who is sitting on the right side of the table?", "question_wo_referring_query": "Who is sitting on the right side of the table?", "candidates": ["A woman with blonde hair wearing a skirt, holding a mobile phone", "A man wearing a black hoodie", "A woman wearing a black top", "A man wearing a black short-sleeved shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "7yXvXpSzX9A_0", "video_path": "7yXvXpSzX9A.mp4", "subtitle_path": "7yXvXpSzX9A_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 164, "duration": 31.0, "view_count": 23320}, {"video_id": "vGgRKaQQvuk", "question": "A wall is covered with posters, a man wearing a black short-sleeve shirt and black-framed glasses is in front of the camera, and there is a black and white picture of a fish-headed person in the upper right corner of the screen. When 'skipjack tuna fishing their mascot is' is mentioned, what is this man in the black short-sleeve shirt doing?", "question_wo_referring_query": "What is this man in the black short-sleeve shirt doing?", "candidates": ["He is twirling a marker in his right hand and pointing his left fist to the right.", "He is holding his fists in front of his chest.", "He is making a 'V' sign with both hands above his head.", "He is crossing his arms in front of his chest."], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "vGgRKaQQvuk_0", "video_path": "vGgRKaQQvuk.mp4", "subtitle_path": "vGgRKaQQvuk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1050, "duration": 38.0, "view_count": 639621}, {"video_id": "1IAVy2J6jfs", "question": "In the video, a black-haired woman is talking in a room. She is wearing black and white floral clothing and a black and white floral tie. After she finishes speaking, what is she doing when the scene changes?", "question_wo_referring_query": ", what is she doing when the scene changes?", "candidates": ["Shopping at the supermarket", "Standing", "Playing with her phone", "Watching TV"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "1IAVy2J6jfs_0", "video_path": "1IAVy2J6jfs.mp4", "subtitle_path": "1IAVy2J6jfs_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 161, "duration": 29.0, "view_count": 134253}, {"video_id": "li6celmZCJs", "question": "On the grassy area with green tree dots, there are two soldiers dressed in khaki uniforms holding banners, leading two rows of soldiers carrying guns, also dressed in military uniforms and wearing hats. Which of the following characters appears first?", "question_wo_referring_query": "Which of the following characters appears first?", "candidates": ["A pedestrian in a white short-sleeved shirt", "Two soldiers holding banners", "A row of soldiers carrying rifles", "A soldier in a blue uniform wearing a black hat"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "li6celmZCJs_0", "video_path": "li6celmZCJs.mp4", "subtitle_path": "li6celmZCJs_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 37, "duration": 50.0, "view_count": 23114}, {"video_id": "hhTzKXVTTtQ", "question": "There is a woman wearing glasses and a black coat explaining something indoors. What event occurs after the subtitles appear saying 'not being provided, and the fact that African Americans had to provide their own monies'?", "question_wo_referring_query": "There is a woman wearing glasses and a black coat explaining something indoors. What event occurs after the subtitles appear saying 'not being provided, and the fact that African Americans had to provide their own monies'?", "candidates": ["Drinking water", "Eating", "Leaving", "Introducing a painting"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "hhTzKXVTTtQ_0", "video_path": "hhTzKXVTTtQ.mp4", "subtitle_path": "hhTzKXVTTtQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 116, "duration": 57.98, "view_count": 6262}, {"video_id": "vZm-237vOYc", "question": "On the screen, many people are coming and going on an escalator at a train station. Many people with backpacks are walking up. After the subtitles say '[Music]', what is the object that appears on the screen?", "question_wo_referring_query": "What is the object that appears on the screen?", "candidates": ["A TV with 'MR' letters", "Computer", "Milk tea", "Phone"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "vZm-237vOYc_0", "video_path": "vZm-237vOYc.mp4", "subtitle_path": "vZm-237vOYc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 407, "duration": 27.99, "view_count": 37999}, {"video_id": "D_ncku6A7iA", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, some broken glass appears on the ground, followed by an empty room, then a person walks into the room. The camera pans to the desk, then moves to the wall, and finally, a man in a black coat appears in the room speaking. The camera then moves to the paper he is holding, followed by focusing back on the man, then a man in a yellow coat is seen standing at the staircase looking outside. He climbs the stairs and reaches outside, where it's all snowy. Finally, the snow scene appears.", "First, a man in a black coat appears in the room speaking. The camera then moves to the paper he is holding, followed by focusing back on the man. Next, some broken glass appears on the ground, followed by an empty room, then a person walks into the room. The camera pans to the desk, then moves to the wall. Finally, a man in a yellow coat is seen standing at the staircase looking outside. He climbs the stairs and reaches outside, where it's all snowy. Finally, the snow scene appears.", "First, a man in a yellow coat is seen standing at the staircase looking outside. He climbs the stairs and reaches outside, where it's all snowy. Then, some broken glass appears on the ground, followed by an empty room. Then a person walks into the room. The camera pans to the desk, then moves to the wall. Finally, a man in a black coat appears in the room speaking. The camera then moves to the paper he is holding, followed by focusing back on the man. Finally, the snow scene appears.", "First, the snow scene appears. Then, a man in a black coat appears in the room speaking. The camera then moves to the paper he is holding, followed by focusing back on the man. Next, a man in a yellow coat is seen standing at the staircase looking outside. He climbs the stairs and reaches outside, where it's all snowy. Finally, some broken glass appears on the ground, followed by an empty room. Then a person walks into the room. The camera pans to the desk, then moves to the wall. Finally, a man in a black coat appears in the room speaking. The camera moves to the paper he is holding. Finally, the snow scene appears."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "D_ncku6A7iA_0", "video_path": "D_ncku6A7iA.mp4", "subtitle_path": "D_ncku6A7iA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 641, "duration": 50.01, "view_count": 3320202}, {"video_id": "JSIgldbu5QU", "question": "There are two red background icons and three lines of subtitles in the screen. The three lines of subtitles are \"June 1944\" \"Tinidur\" \"90% 15%\". In which other scene do these two icons appear?", "question_wo_referring_query": "In which other scene do these two icons appear?", "candidates": ["In the scene with the number \"66\"", "In the scene with the word \"yes\"", "In the scene with the word \"ok\"", "In the scene with the words \"25-30Hours\""], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "JSIgldbu5QU_0", "video_path": "JSIgldbu5QU.mp4", "subtitle_path": "JSIgldbu5QU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 340, "duration": 17.0, "view_count": 131042}, {"video_id": "Ir5V3p86GMI", "question": "This woman with short hair, wearing glasses and a black suit appeared on screen with the text 'ELAINE KOH, Partner, Indirect Tax, KPMG Singapore'. Which of the following subtitles appeared together with this scene?", "question_wo_referring_query": "Which of the following subtitles appeared together with this scene?", "candidates": ["ettor was discovered in November last", "that are regulatory the 18 types of fees", "again there are some fees charge which time does not really help as new staff", "year during an internal review it says"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "Ir5V3p86GMI_0", "video_path": "Ir5V3p86GMI.mp4", "subtitle_path": "Ir5V3p86GMI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 97, "duration": 52.0, "view_count": 35203}, {"video_id": "NaAq7QxnBiM", "question": "On a road, a man wearing a black hoodie, carrying a backpack, and donning a black cap stands next to a woman dressed in a black suspender and wearing sunglasses. Beside them, there's a driver in a yellow outfit driving a white car. When they are next to a wooden house, what change occurs between the two people?", "question_wo_referring_query": ", what change occurs between the two people?", "candidates": ["The man slings the backpack over one shoulder and the woman places the sunglasses on her head.", "The woman changes into long sleeves.", "The man changes into a long-sleeved shirt.", "The man puts the sunglasses on his head."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "NaAq7QxnBiM_0", "video_path": "NaAq7QxnBiM.mp4", "subtitle_path": "NaAq7QxnBiM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 105, "duration": 36.0, "view_count": 17949}, {"video_id": "P-trSZfdPUw", "question": "On a dark brown desktop, there are a few objects inside a white pot. In the top left corner of the screen, there's a label that says 'BUTTER 6 TBSP'. When 'Music' is mentioned, what changes occur to these objects?", "question_wo_referring_query": "What changes occur to these objects?", "candidates": ["These objects melt", "These objects are coated with color", "A few more objects are added", "These objects are thrown away"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "P-trSZfdPUw_0", "video_path": "P-trSZfdPUw.mp4", "subtitle_path": "P-trSZfdPUw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 92, "duration": 36.0, "view_count": 1395552}, {"video_id": "q11c29qdmJ0", "question": "How many houses are there on the screen? On the left side, there is a gray house, on the right side, there is a blue house, and in front are columns. What happened in the middle of the screen?", "question_wo_referring_query": "What happened in the middle of the screen?", "candidates": ["Heavy Rain", "Avalanche", "Explosion", "Earthquake"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "q11c29qdmJ0_0", "video_path": "q11c29qdmJ0.mp4", "subtitle_path": "q11c29qdmJ0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 215, "duration": 56.0, "view_count": 1318056}, {"video_id": "P1Si_nR48fo", "question": "There is a man wearing a yellow short-sleeved shirt, black pants, and white shoes sitting on a white bench, and a man wearing a blue shirt and gray pants crouching on the railing. What objects appear on the screen?", "question_wo_referring_query": "There is a man wearing a yellow short-sleeved shirt, black pants, and white shoes sitting on a white bench, and a man wearing a blue shirt and gray pants crouching on the railing. What objects appear on the screen?", "candidates": ["Computer", "Headset", "Orange life buoy", "Sun hat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "P1Si_nR48fo_0", "video_path": "P1Si_nR48fo.mp4", "subtitle_path": "P1Si_nR48fo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 220, "duration": 58.02, "view_count": 860152}, {"video_id": "vbXHxRx2nWE", "question": "In the kitchen, there is a man wearing a black long-sleeve shirt and a gray apron, and glasses. His right hand is moving, and in front of him, there is a blue small pot. Behind him, there are also appliances like a microwave and an oven. On the table, there are various seasoning bottles and bowls. When 'sugar and cardamom' are mentioned, which item does not appear on the screen?", "question_wo_referring_query": "Which item does not appear on the screen?", "candidates": ["Oven", "Microwave", "Refrigerator", "Oil bottle"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "vbXHxRx2nWE_0", "video_path": "vbXHxRx2nWE.mp4", "subtitle_path": "vbXHxRx2nWE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 69, "duration": 30.99, "view_count": 142245}, {"video_id": "w1wQJt_YUnM", "question": "In front of a green background, there are two ladies standing. One lady, with long blonde hair, is wearing a long-sleeved top. The other lady, with black curly hair, is also wearing a long-sleeved top, smiling and speaking to the camera. The blonde lady is watching the lady with black hair. What color clothes are these two ladies wearing?", "question_wo_referring_query": "What color clothes are these two ladies wearing?", "candidates": ["The blonde lady is wearing a white top, and the lady with black curly hair is wearing a blue top.", "The blonde lady is wearing a blue top, and the lady with black curly hair is wearing a pink top.", "The blonde lady is wearing a white top, and the lady with black curly hair is wearing a black top.", "The blonde lady is wearing a black long-sleeved top, and the lady with black curly hair is wearing a white long-sleeved top."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "w1wQJt_YUnM_0", "video_path": "w1wQJt_YUnM.mp4", "subtitle_path": "w1wQJt_YUnM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 3, "duration": 51.01, "view_count": 48270}, {"video_id": "lZj-TZrpF9Q", "question": "In the video, a black puppy is lying on a white floor, and the puppy is wearing a collar. There are two gray chairs and a white cabinet surrounding it. When 'and so we took him off the streets and' is mentioned, what color is the puppy's collar?", "question_wo_referring_query": "What color is the puppy's collar?", "candidates": ["White", "Blue", "Red", "Black"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "lZj-TZrpF9Q_0", "video_path": "lZj-TZrpF9Q.mp4", "subtitle_path": "lZj-TZrpF9Q_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 324, "duration": 26.0, "view_count": 4476}, {"video_id": "pbvonhOHTZg", "question": "In the scene, there are a few cars parked in front of the building, along with some trees, in a raised concrete skateboard park. Who completed the skateboard jump?", "question_wo_referring_query": "Who completed the skateboard jump?", "candidates": ["A man wearing a blue jacket and black pants", "A woman wearing a gray short-sleeved shirt, black pants, blue shoes, and a red helmet", "A man wearing a white short-sleeved shirt and black pants", "A man wearing a white jacket and black pants"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "pbvonhOHTZg_0", "video_path": "pbvonhOHTZg.mp4", "subtitle_path": "pbvonhOHTZg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 197, "duration": 34.05, "view_count": 20529}, {"video_id": "zfv8EEZI7B0", "question": "On the road at night, a man is taking photos with a mobile phone. He is wearing a white short-sleeve shirt and a white and red hat. What action does he make while sitting in the car?", "question_wo_referring_query": "What action does he make while sitting in the car?", "candidates": ["He makes a 'yeah' gesture with both hands on his head.", "He puts his hands together in a prayer position.", "He clenches his fists and places them on his chest.", "He places his left hand next to his face, making a 'six' gesture, and sticks out his tongue."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "zfv8EEZI7B0_0", "video_path": "zfv8EEZI7B0.mp4", "subtitle_path": "zfv8EEZI7B0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 277, "duration": 23.02, "view_count": 3085}, {"video_id": "t7NE7apn-PA", "question": "The video explains four different concepts: P\u2192Q, not P\u2192not Q, Q\u2192P, and not Q\u2192not P. Which concept is mentioned first?", "question_wo_referring_query": "Which concept is mentioned first?", "candidates": ["Q\u2192P", "not P\u2192not Q", "P\u2192Q", "not Q\u2192not P"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "t7NE7apn-PA_0", "video_path": "t7NE7apn-PA.mp4", "subtitle_path": "t7NE7apn-PA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 273, "duration": 47.0, "view_count": 2642920}, {"video_id": "Oo-H110UEr4", "question": "After mentioning 'alternative to the US and I think that's', what did the man in the black suit with black hair and the man in the black suit with white hair, who are shaking hands, do? Behind them stand three other people, and there is a red and white striped flag in the background.", "question_wo_referring_query": "What did the two men do?", "candidates": ["They stood on a red carpet, waving at the screen.", "They stood on a red carpet shaking hands.", "They placed their clenched fists together on their chests.", "They clasped their hands together in a prayer gesture."], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "Oo-H110UEr4_0", "video_path": "Oo-H110UEr4.mp4", "subtitle_path": "Oo-H110UEr4_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 296, "duration": 46.0, "view_count": 34244}, {"video_id": "96OMuA65goo", "question": "In the video, there are three people trapped in yellow beer. One of them is a man wearing a brown hat. After the subtitle 'safety out of their flooded basement' appears, which characters are shown?", "question_wo_referring_query": "Which characters are shown after the subtitle 'safety out of their flooded basement' appears?", "candidates": ["A man wearing a white shirt and a light brown jacket", "A man wearing a brown coat and dark brown pants, and a man wearing a white shirt and a light brown jacket", "A man wearing a brown coat and dark brown pants", "A man wearing a white shirt and a black jacket"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "96OMuA65goo_0", "video_path": "96OMuA65goo.mp4", "subtitle_path": "96OMuA65goo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 186, "duration": 56.0, "view_count": 692334}, {"video_id": "7FJYeRkAO4U", "question": "When the man dressed in black and blue and the person dressed in light gray suit have a conversation in the video, what do these two men do?", "question_wo_referring_query": "What do these two men do?", "candidates": ["hug", "fist bump", "kiss", "shake hands"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "7FJYeRkAO4U_0", "video_path": "7FJYeRkAO4U.mp4", "subtitle_path": "7FJYeRkAO4U_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 97, "duration": 25.99, "view_count": 1147895}, {"video_id": "rVElU3QJrZk", "question": "When a woman wearing a hat, with long hair and earrings, is speaking in the video, which object appears together with the woman?", "question_wo_referring_query": "Which object appears together with the woman?", "candidates": ["A potted plant with flowers", "A lakeside", "A rectangular mirror", "A child's bicycle"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "rVElU3QJrZk_0", "video_path": "rVElU3QJrZk.mp4", "subtitle_path": "rVElU3QJrZk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 16, "duration": 50.0, "view_count": 44152}, {"video_id": "RPmoB0qUppM", "question": "In the video, when the man with the hat and a woman mention 'yeah and where's the bus we're probably', what color is the woman's hair that appears in the video?", "question_wo_referring_query": "In the video, when the man with the hat and a woman mention 'yeah and where's the bus we're probably', what color is the woman's hair that appears in the video?", "candidates": ["blonde", "brown", "purple", "black"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "RPmoB0qUppM_0", "video_path": "RPmoB0qUppM.mp4", "subtitle_path": "RPmoB0qUppM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 651, "duration": 44.0, "view_count": 258378}, {"video_id": "9DAaYp91oVY", "question": "In the video, there is a room with a door, a computer playing a landscape on the screen, and a white keyboard. What kind of clothes is the male protagonist wearing?", "question_wo_referring_query": "What kind of clothes is the male protagonist wearing in the video?", "candidates": ["Plain white short-sleeve", "Plain black short-sleeve", "Olive-colored short-sleeve", "Floral-patterned short-sleeve"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "9DAaYp91oVY_0", "video_path": "9DAaYp91oVY.mp4", "subtitle_path": "9DAaYp91oVY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 385, "duration": 41.0, "view_count": 39118}, {"video_id": "q_ynbdK__LA", "question": "When the male protagonist in the video, dressed in a mascot costume, mentions 'okay, here you once took one picture,' what color is the building in the background?", "question_wo_referring_query": "What color is the building in the background?", "candidates": ["Brick red roof", "Overall white with black windows", "Overall white with olive windows", "Overall black with white windows"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "q_ynbdK__LA_0", "video_path": "q_ynbdK__LA.mp4", "subtitle_path": "q_ynbdK__LA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 321, "duration": 50.97, "view_count": 849549}, {"video_id": "cGHK-I6OlG0", "question": "In front of a glass wall, there is a host wearing a white shirt and a blue vest sitting. When he says, 'The GS63 model offers two different screen size options,' how many people walk past behind the glass wall?", "question_wo_referring_query": "How many people walk past behind the glass wall?", "candidates": ["2", "4", "1", "3"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "cGHK-I6OlG0_0", "video_path": "cGHK-I6OlG0.mp4", "subtitle_path": "cGHK-I6OlG0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 39, "duration": 43.01, "view_count": 15729}, {"video_id": "h4sWFJFge54", "question": "A green building was shown in the video, with red brick buildings on either side. After showing a picture of the Benin City National Museum, which location\u2019s picture is shown next in the video?", "question_wo_referring_query": "Which location\u2019s picture is shown next in the video?", "candidates": ["Badagry slave museum", "Nat. War museum", "Apapa", "Millennium park, Abuja"], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "h4sWFJFge54_0", "video_path": "h4sWFJFge54.mp4", "subtitle_path": "h4sWFJFge54_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 213, "duration": 26.03, "view_count": 1658141}, {"video_id": "nVbLAlUAKKE", "question": "In a dark room, a man wearing a black T-shirt is leaning against a computer desk. What is the order of countries he mentioned in the narration?", "question_wo_referring_query": "In a dark room, a man wearing a black T-shirt is leaning against a computer desk. What is the order of countries he mentioned in the narration?", "candidates": ["Hawaii, Vietnam, Cambodia", "Bali, Cambodia, Vietnam", "Cambodia, Bali, Vietnam", "Vietnam, Cambodia, Bali"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "nVbLAlUAKKE_0", "video_path": "nVbLAlUAKKE.mp4", "subtitle_path": "nVbLAlUAKKE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 230, "duration": 49.97, "view_count": 141402}, {"video_id": "efwUrDNOHTU", "question": "When the video screen no longer shows the female protagonist and the subtitle 'oh my gosh' appears, which character appeared before?", "question_wo_referring_query": "Which character appeared before?", "candidates": ["A boy without glasses wearing a black shirt appeared", "A girl holding a coffee cup appeared", "A boy wearing black-framed glasses, a black mask, and a gray shirt appeared", "A boy in a white shirt appeared"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "efwUrDNOHTU_0", "video_path": "efwUrDNOHTU.mp4", "subtitle_path": "efwUrDNOHTU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 119, "duration": 22.0, "view_count": 280795}, {"video_id": "GtHtKSvUl9A", "question": "After the explanation mentions 'Naturally, the jumper wearing the Reebok Pumps survives', two people appear standing on the bridge facing each other. What are the people who appear in the scene wearing?", "question_wo_referring_query": "Two people appear standing on the bridge facing each other. What are the people in the scene wearing?", "candidates": ["Two people, one wearing a white jacket and white pants, and the other wearing a yellow jacket and yellow pants", "Two people, one wearing a black jacket and white pants, and the other wearing a gray jacket and purple pants", "A person wearing a black coat and yellow pants", "Two people, one wearing a red jacket and blue pants, and the other wearing a purple jacket and red pants"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "GtHtKSvUl9A_0", "video_path": "GtHtKSvUl9A.mp4", "subtitle_path": "GtHtKSvUl9A_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 445, "duration": 55.99, "view_count": 118025}, {"video_id": "5ODJjP8bC9s", "question": "Which of the following scenarios is in the correct order?", "question_wo_referring_query": "Which of the following scenarios is in the correct order?", "candidates": ["First, a part of the Earth's surface was shown, then a gray map, followed by the whole Earth, and finally a meteorite.", "First, a part of the Earth's surface was shown, then a gray map, followed by a meteorite, and finally the whole Earth.", "First, a gray map was shown, then a part of the Earth's surface, followed by a picture of a meteorite hitting the Earth.", "First, a part of the Earth's surface was shown, then a gray map, and lastly the whole Earth."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "5ODJjP8bC9s_0", "video_path": "5ODJjP8bC9s.mp4", "subtitle_path": "5ODJjP8bC9s_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 17, "duration": 47.01, "view_count": 2253}, {"video_id": "2b_sktnTsdA", "question": "When the male protagonist of the video opens a box containing a blue and a red box, and takes out a drawer with white paper, what is the male protagonist doing?", "question_wo_referring_query": "What is the male protagonist doing?", "candidates": ["Taking out the blue box", "Taking out the red box", "Putting something inside", "Taking out the white paper"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "2b_sktnTsdA_0", "video_path": "2b_sktnTsdA.mp4", "subtitle_path": "2b_sktnTsdA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 477, "duration": 34.99, "view_count": 531003}, {"video_id": "UXCDZXSysYY", "question": "In the video, in front of the black door and tan wall, there is a scene with a boy wearing a black short sleeve shirt sitting. What object appeared on the screen?", "question_wo_referring_query": "Which object appeared on the screen?", "candidates": ["cell phone", "school bag", "book", "computer"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "UXCDZXSysYY_0", "video_path": "UXCDZXSysYY.mp4", "subtitle_path": "UXCDZXSysYY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 701, "duration": 20.02, "view_count": 93726}, {"video_id": "utwG6qa5s3Q", "question": "When the video references \u201calso I got myself a giant bowl of grapes! If there's one thing I always\u201d, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["A large bowl of grapes", "A large bowl of apples", "A large bowl of cherries", "A bouquet of flowers"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "utwG6qa5s3Q_0", "video_path": "utwG6qa5s3Q.mp4", "subtitle_path": "utwG6qa5s3Q_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 366, "duration": 32.0, "view_count": 537944}, {"video_id": "mocm-sXBJSg", "question": "In the video frame, there is only a shirtless man, two women wearing white hats off-stage, a person in black clothing, and hands. What color are the scars on the shirtless man's body?", "question_wo_referring_query": "What color are the scars on the shirtless man's body?", "candidates": ["Two red and one black", "One red and two black", "Three red", "Two black"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "mocm-sXBJSg_0", "video_path": "mocm-sXBJSg.mp4", "subtitle_path": "mocm-sXBJSg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 84, "duration": 32.0, "view_count": 931393}, {"video_id": "ZdD_e7y6eOU", "question": "In the video, a woman with some gray hair who is wearing a brown hairband is applying something to her eyelids. What is she applying to her eyes?", "question_wo_referring_query": "What is this woman applying to her eyes?", "candidates": ["Using a small pointed stick to scoop out a white paste from the bottle", "Using a small pointed stick to scoop out a black paste from the bottle", "Using a small pointed stick to scoop out a red paste from the bottle", "Using her fingers to scoop out a black paste from the bottle"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "ZdD_e7y6eOU_0", "video_path": "ZdD_e7y6eOU.mp4", "subtitle_path": "ZdD_e7y6eOU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 184, "duration": 57.99, "view_count": 788515}, {"video_id": "MkoHNJtdtnk", "question": "When the man with black, curly hair and a small black goatee first appears in the video, what is he doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Saluting", "Holding a sword in one hand", "Pointing forward", "Holding a sword"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "MkoHNJtdtnk_0", "video_path": "MkoHNJtdtnk.mp4", "subtitle_path": "MkoHNJtdtnk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 453, "duration": 25.0, "view_count": 465240}, {"video_id": "h3TXAKG5eFQ", "question": "In the video, on a table with a square cloth and a bowl for serving food, what happens when the subtitle 'it' appears?", "question_wo_referring_query": "What happens in the video?", "candidates": ["Wipe the bowl", "Prepare a new bowl", "Pour green liquid into a prepared bowl", "Prepare food decorations for the bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "h3TXAKG5eFQ_0", "video_path": "h3TXAKG5eFQ.mp4", "subtitle_path": "h3TXAKG5eFQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 193, "duration": 20.98, "view_count": 63291}, {"video_id": "TVPNo6hUerI", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, the news broadcast format with picture-in-picture video, then switch to full-screen video recording of life.", "Full-screen video recording of life.", "First show photos, then switch to full-screen video recording of life.", "First switch to full-screen video recording of life, then news broadcast format with picture-in-picture video."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "TVPNo6hUerI_0", "video_path": "TVPNo6hUerI.mp4", "subtitle_path": "TVPNo6hUerI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 3, "duration": 56.02, "view_count": 3462}, {"video_id": "fAGKicTlXgg", "question": "In the video, there are two finely carved pillars standing under the golden light. In front of the pillars, there are two statues of a man and a woman. The woman's hand is raised diagonally upwards. Which subtitle appears at the same time as this statue?", "question_wo_referring_query": "Which subtitle appears at the same time as this statue?", "candidates": ["realistic that you can imagine it", "up to notifications at future projects", "art please leave a comment below", "and if you go to my website you can sign"], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "fAGKicTlXgg_0", "video_path": "fAGKicTlXgg.mp4", "subtitle_path": "fAGKicTlXgg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 157, "duration": 25.99, "view_count": 167059}, {"video_id": "xdtgn6p7-6g", "question": "At the beginning of the video, two eggs were placed into a large bowl. What change occurred in this bowl when 'Milk 4 tsp' was mentioned?", "question_wo_referring_query": "What change occurred in the bowl?", "candidates": ["Milk was added to the bowl", "Another egg was added to the bowl", "Crushed biscuits were added to the bowl", "Water was added to the bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "xdtgn6p7-6g_0", "video_path": "xdtgn6p7-6g.mp4", "subtitle_path": "xdtgn6p7-6g_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 26, "duration": 47.98, "view_count": 222}, {"video_id": "yM6QQvkpJtY", "question": "In the video, a man wearing a black short-sleeved shirt, with a small beard, holding a picture of an island appears. What objects are shown in the video?", "question_wo_referring_query": ", what objects are shown in the video?", "candidates": ["Two flags: one with a blue background, a white star in the top left corner, and four red stars on the right; the other with a blue background, a white star in the top left corner, and a circle formed by small white stars on the right", "Two flags: one with a blue background, a white star in the top left corner, and five red stars on the right; the other with a blue background, a white star in the top left corner, and a circle formed by small white stars on the right", "Two flags: one with a blue background, a white star in the top left corner, and four white stars on the right; the other with a blue background, a white star in the top left corner, and a circle formed by small white stars on the right", "Two flags: one with a blue background, a white star in the top left corner, and four red stars on the right; the other with a blue background, a white star in the top left corner, and five red stars forming a circle on the right"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "yM6QQvkpJtY_0", "video_path": "yM6QQvkpJtY.mp4", "subtitle_path": "yM6QQvkpJtY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 488, "duration": 31.99, "view_count": 333043}, {"video_id": "2LGcPof0-QM", "question": "In the video, when 'that I continued on with it' is mentioned in an orange-toned room, which object appears in that room?", "question_wo_referring_query": "Which object appears in that room?", "candidates": ["A window with olive curtains", "A sculpture", "A window with white curtains", "A girl with a red scarf"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "2LGcPof0-QM_0", "video_path": "2LGcPof0-QM.mp4", "subtitle_path": "2LGcPof0-QM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 274, "duration": 51.0, "view_count": 48839}, {"video_id": "ipTPgOBNDWg", "question": "In the video, with a blue sky background, there is a bird perched on a feeding device placed on a tree-covered hillside in the distance. What color is this bird?", "question_wo_referring_query": "What color is this bird?", "candidates": ["The head is black, wings are white, and the belly is white", "The head is white, wings are black, and the belly is white", "The head is black, wings are black, and the belly is white", "The head is black, wings are black, and the belly is also black"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "ipTPgOBNDWg_0", "video_path": "ipTPgOBNDWg.mp4", "subtitle_path": "ipTPgOBNDWg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 34, "duration": 48.02, "view_count": 373533}, {"video_id": "zvBiIvKUiOQ", "question": "When the video mentions 'this push for realism was necessary because they were to be observed through scopes with magnifying lenses', what kind of model are the figures in the video?", "question_wo_referring_query": "What kind of model are the figures in the video?", "candidates": ["Wearing a green helmet, with a small mustache", "Skin color black, wearing a green hat", "Wearing a white helmet, with glasses", "Wearing a green hat"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "zvBiIvKUiOQ_0", "video_path": "zvBiIvKUiOQ.mp4", "subtitle_path": "zvBiIvKUiOQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 140, "duration": 43.0, "view_count": 23483217}, {"video_id": "JFUqnTxfhRI", "question": "When the woman with black hair, wearing a black short-sleeve shirt and a colorful bracelet on her hand, first appears in the video, what is she doing?", "question_wo_referring_query": "What is this woman doing?", "candidates": ["Both hands resting on the table", "One hand raised, talking while smiling", "Holding a skateboard while walking on the street", "Both hands raised, talking with the main character in the video"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "JFUqnTxfhRI_0", "video_path": "JFUqnTxfhRI.mp4", "subtitle_path": "JFUqnTxfhRI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 42, "duration": 48.02, "view_count": 4518}, {"video_id": "wrINepmJC7c", "question": "What did the man wearing a white short-sleeved shirt with a pattern on the front and black-rimmed glasses do when he mentioned 'to incentivize citizens to get off' in the video?", "question_wo_referring_query": "What did he do?", "candidates": ["One hand holding a cup, the other hand half-raised", "One hand on the table, the other hand holding a cup", "Both hands holding a cup", "Both hands spread open"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "wrINepmJC7c_0", "video_path": "wrINepmJC7c.mp4", "subtitle_path": "wrINepmJC7c_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 283, "duration": 33.99, "view_count": 132481}, {"video_id": "GVP8qLK_8Ig", "question": "In the video, there is a woman with blue hair wearing a necklace. After she lifts her hand off her shoulder and spreads it out, what happens next in the video?", "question_wo_referring_query": ", what happens next in the video?", "candidates": ["The woman with pink hair waves her hand and smiles for a greeting.", "The woman with blue hair inquires the little girl.", "A woman with pink hair shakes hands with the woman with blue hair.", "The little girl grabs the hand of the woman with blue hair."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "GVP8qLK_8Ig_0", "video_path": "GVP8qLK_8Ig.mp4", "subtitle_path": "GVP8qLK_8Ig_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 10, "duration": 24.03, "view_count": 13984159}, {"video_id": "1quhjc6PI70", "question": "Originally, there was a tank on the screen in the video. After mentioning 'tank German tanks by the later years of,' what changes occurred on the screen?", "question_wo_referring_query": "What changes occurred on the screen?", "candidates": ["Five tanks appeared on the right side", "One tank appeared on the left side", "Four tanks appeared on the right side", "Five tanks appeared on the left side"], "topic_category": "KH-Knowledge-History", "question_category": "T3E", "level": "L2-Relation", "id": "1quhjc6PI70_0", "video_path": "1quhjc6PI70.mp4", "subtitle_path": "1quhjc6PI70_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 129, "duration": 57.0, "view_count": 1517640}, {"video_id": "2zmEYSjpVrs", "question": "After the mention of 'mobility hubs for drones where they have' in the video with blue sky and dark place as the background, what appeared in the video?", "question_wo_referring_query": "What appeared in the video?", "candidates": ["Drones and two people sitting in front of a control panel", "Control panel", "Two people, drones, control panel, and tank", "Drones"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "2zmEYSjpVrs_0", "video_path": "2zmEYSjpVrs.mp4", "subtitle_path": "2zmEYSjpVrs_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 123, "duration": 37.0, "view_count": 224455}, {"video_id": "LVBm0NTPpMo", "question": "At the beginning of the video, the female lead is sitting on the bed in a room, wearing a light green short sleeve shirt, and holding a cup with both hands. After watching the Halloween puppet at the end of the video, what change happened to the female lead?", "question_wo_referring_query": "What change happened to the female lead?", "candidates": ["Stood up", "Changed into a red shirt", "No longer holding a cup", "Tied her hair up"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "LVBm0NTPpMo_0", "video_path": "LVBm0NTPpMo.mp4", "subtitle_path": "LVBm0NTPpMo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 190, "duration": 44.0, "view_count": 6326}, {"video_id": "ioIWgvvfcAU", "question": "In the video, when all the ingredients for making tempeh tacos are prepared and placed on the plate, which ingredient is missing?", "question_wo_referring_query": "Which ingredient is missing?", "candidates": ["green onion", "salsa", "cilantro", "carrot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "ioIWgvvfcAU_0", "video_path": "ioIWgvvfcAU.mp4", "subtitle_path": "ioIWgvvfcAU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 154, "duration": 33.99, "view_count": 92339}, {"video_id": "BlqLakYXf-o", "question": "In the video, when there is only one man with white hair wearing a black coat over a white shirt, is this man wearing glasses?", "question_wo_referring_query": "Is this man wearing glasses?", "candidates": ["Not wearing glasses", "Wearing a pair of black-framed glasses", "Wearing a pair of brick-red framed glasses", "Wearing a pair of brown glasses"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "BlqLakYXf-o_0", "video_path": "BlqLakYXf-o.mp4", "subtitle_path": "BlqLakYXf-o_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 264, "duration": 24.0, "view_count": 2601}, {"video_id": "SWooyCShVn8", "question": "What color is the picture when mentioned 'check the link below down for either it' with an overall green background at the beginning of the video?", "question_wo_referring_query": "What color is the picture?", "candidates": ["Overall dark purple, with an island in the middle", "Overall green", "Overall orange, with a sun in the upper right corner", "Overall sky blue"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "SWooyCShVn8_0", "video_path": "SWooyCShVn8.mp4", "subtitle_path": "SWooyCShVn8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 174, "duration": 38.01, "view_count": 2028}, {"video_id": "St7PDhuaiA0", "question": "When the video mentions 'I mean I don't consider myself to be' against the tan curtain background, who in the video said this line?", "question_wo_referring_query": "Who in the video said this line?", "candidates": ["The woman with both hands crossed under her chin", "The woman with one hand raised and the other hand on the table", "The woman with one hand touching her ear and the other hand under her chin", "The woman with both hands extended outward, with a slight pout"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "St7PDhuaiA0_0", "video_path": "St7PDhuaiA0.mp4", "subtitle_path": "St7PDhuaiA0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 93, "duration": 35.0, "view_count": 23616}, {"video_id": "8sHeQIH_xd4", "question": "When the phrase 'fully digital business model has helped' is mentioned on the screen of the laptop being viewed in the video, what change occurs on the webpage in the video?", "question_wo_referring_query": "What change occurs on the webpage in the video?", "candidates": ["Scroll the webpage up from the 'Factory to Home' Zip Blinds E-Store page", "Close the webpage with the 'Factory to Home' Zip Blinds E-Store", "Open a new webpage with the 'Factory to Home' Zip Blinds E-Store", "Scroll the webpage down from the 'Factory to Home' Zip Blinds E-Store page"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "8sHeQIH_xd4_0", "video_path": "8sHeQIH_xd4.mp4", "subtitle_path": "8sHeQIH_xd4_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 211, "duration": 41.0, "view_count": 18133}, {"video_id": "zS4AP0Q8L8g", "question": "In the video, a total of nine boats are shown on the water, with three of them speeding along. Afterward, what changes occurred with these boats?", "question_wo_referring_query": ", what changes occurred with these boats?", "candidates": ["On the left side of the video, two boats remain stationary, while two boats are moving on the right side of the screen.", "On the left side of the video, three boats remain stationary, while two boats are moving on the right side of the screen.", "On the left side of the video, one boat remains stationary, while two boats are moving on the right side of the screen.", "Nine boats turned into six boats."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "zS4AP0Q8L8g_0", "video_path": "zS4AP0Q8L8g.mp4", "subtitle_path": "zS4AP0Q8L8g_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 256, "duration": 44.0, "view_count": 2821021}, {"video_id": "n5bpdO3knHY", "question": "After the screen shows the bustling streets and mentions 'If you\u2019re going with somebody getting', what changes occur on the screen?", "question_wo_referring_query": "What changes occur on the screen?", "candidates": ["The screen transitions to a store with a 'PAT'S STEAKS' sign, providing white seating occupied by customers", "The screen transitions to a store with a 'PAT'S STEAKS' sign, providing red seating occupied by customers", "The screen transitions to a store with a 'PAT'S STEAKS' sign, providing blue seating occupied by customers", "The screen transitions to a store with a 'PAT'S STEAKS' sign, providing black seating occupied by customers"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "n5bpdO3knHY_0", "video_path": "n5bpdO3knHY.mp4", "subtitle_path": "n5bpdO3knHY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 187, "duration": 20.99, "view_count": 7862}, {"video_id": "DbnHLQtjN4g", "question": "After the phrase 'there has been incredibly beneficial to who I am' is mentioned in the video, what animal appears on the screen?", "question_wo_referring_query": ", what animal appears on the screen?", "candidates": ["A dark gray rabbit with long whiskers", "A gray cat", "A white rabbit", "A colorful rooster"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "DbnHLQtjN4g_0", "video_path": "DbnHLQtjN4g.mp4", "subtitle_path": "DbnHLQtjN4g_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 261, "duration": 58.0, "view_count": 333002}, {"video_id": "IQlQzUmthww", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["In the first scene, there are 8 soldiers holding green shields, 5 soldiers on the left holding long guns, and the soldier on the far right holding a paper with 'Let's stay'. The second scene has a soldier on the left wearing armor and a helmet, holding a blue shield in the left hand and a paper with 'You can stay' in the right hand. On the right, there is another soldier holding a green shield and wearing a helmet. The third scene is set in a room with a burning furnace, and on the right side of the screen stands a modern cartoon character wearing a blue duckbill hat and a blue shirt. The fourth scene features in front of a castle, where there are 7 soldiers holding green shields with the left hand, one holding a sword, and another holding a long gun. All wear helmets.", "In the first scene, there are 2 soldiers holding green shields, 5 soldiers on the left holding long guns, and the soldier on the far right holding a paper with 'Let's stay'. The second scene has a soldier on the left wearing armor and a helmet, holding a blue shield in the left hand and a paper with 'You can stay' in the right hand. On the right, there is another soldier holding a green shield and wearing a helmet. The third scene is set in a room with a burning furnace, and on the right side of the screen stands a modern cartoon character wearing a blue duckbill hat and a blue shirt. The fourth scene features in front of a castle, where there are 4 soldiers holding green shields with the left hand, one holding a sword, and another holding a long gun. All wear helmets.", "In the first scene, there are 3 soldiers holding green shields, 2 soldiers on the left holding long guns, and the soldier on the far right holding a paper with 'Let's stay'. The second scene has a soldier on the left wearing armor and a helmet, holding a blue shield in the left hand and a paper with 'You can stay' in the right hand. On the right, there is another soldier holding a green shield and wearing a helmet. The third scene is set in a room with a burning furnace, and on the right side of the screen stands a modern cartoon character wearing a blue duckbill hat and a blue shirt. The fourth scene features in front of a castle, where there are 2 soldiers holding green shields with the left hand, one holding a sword, and another holding a long gun. All wear helmets.", "In the first scene, there are 8 soldiers holding green shields, 9 soldiers on the left holding long guns, and the soldier on the far right holding a paper with 'Let's stay'. The second scene has a soldier on the left wearing armor and a helmet, holding a blue shield in the left hand and a paper with 'You can stay' in the right hand. On the right, there is another soldier holding a green shield and wearing a helmet. The third scene is set in a room with a burning furnace, and on the right side of the screen stands a modern cartoon character wearing a blue duckbill hat and a blue shirt. The fourth scene features in front of a castle, where there are 6 soldiers holding green shields with the left hand, one holding a sword, and another holding a long gun. All wear helmets."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "IQlQzUmthww_0", "video_path": "IQlQzUmthww.mp4", "subtitle_path": "IQlQzUmthww_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 79, "duration": 37.0, "view_count": 2747}, {"video_id": "y6W_XUUV78w", "question": "In the video, where has the woman, who is following behind the male lead, wearing a black hat, a white top, and having her hair tied up, appeared?", "question_wo_referring_query": "Where has this woman appeared?", "candidates": ["In a photo with red and white vehicles and a person wearing a pink top in the foreground", "Walking in front of a shop with bright lights and multicolored fabric decorations; in a photo with red and white vehicles and a person wearing a pink top in the foreground", "Walking in front of a shop with bright lights and multicolored fabric decorations", "Walking in front of a shop with bright lights and multicolored fabric decorations; in a photo with red and white vehicles and a person wearing a pink top in the foreground; looking at a menu"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "y6W_XUUV78w_0", "video_path": "y6W_XUUV78w.mp4", "subtitle_path": "y6W_XUUV78w_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 536, "duration": 30.0, "view_count": 752748}, {"video_id": "1_GeYVwWkPo", "question": "The male main character who appears in the video wearing a black top and having black hair does not appear together with which of the following subtitles?", "question_wo_referring_query": "The male main character does not appear together with which of the following subtitles?", "candidates": ["Excited I visited Asaka a neighborhood", "China, i'm at a popular sushi restaurant.", "With the Vibes of Tokyo's downtown.", "To talk to some visitors"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "1_GeYVwWkPo_0", "video_path": "1_GeYVwWkPo.mp4", "subtitle_path": "1_GeYVwWkPo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 85, "duration": 22.99, "view_count": 24105}, {"video_id": "lqK3bO35hUQ", "question": "At the beginning of the video, there is a woman walking through the woods wearing a white top, carrying a single-shoulder bag and holding an orange coat. When this woman sits on a bench by the roadside looking at the distant meadow, what change occurs?", "question_wo_referring_query": "At the beginning of the video, there is a woman walking through the woods wearing a white top, carrying a single-shoulder bag and holding an orange coat. When this woman sits on a bench by the roadside looking at the distant meadow, what change occurs?", "candidates": ["The woman puts down the single-shoulder bag", "The woman puts on the orange coat", "The woman lifts the single-shoulder bag", "The woman changes into an orange short-sleeved shirt"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "lqK3bO35hUQ_0", "video_path": "lqK3bO35hUQ.mp4", "subtitle_path": "lqK3bO35hUQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1152, "duration": 57.99, "view_count": 246255}, {"video_id": "O9-prB2riJ8", "question": "A man in a black suit with a microphone in front of him is conversing with a smiling man in a white shirt and glasses. What musical instrument is behind the smiling man?", "question_wo_referring_query": "What musical instrument is behind the smiling man?", "candidates": ["Guitar", "Piano", "Violin", "Cello"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "O9-prB2riJ8_0", "video_path": "O9-prB2riJ8.mp4", "subtitle_path": "O9-prB2riJ8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 222, "duration": 23.99, "view_count": 2274}, {"video_id": "a13cXTeCius", "question": "This scene shows a variety of ice creams including red, black, and white ones, as well as an ice cream scooper. When the subtitle says 'you've got to get some gelato,' what is on top of the ice cream?", "question_wo_referring_query": "When the subtitle says 'you've got to get some gelato,' what is on top of the ice cream?", "candidates": ["Bowl", "Plate", "Scoop", "Green plant"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "a13cXTeCius_0", "video_path": "a13cXTeCius.mp4", "subtitle_path": "a13cXTeCius_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 894, "duration": 43.0, "view_count": 842137}, {"video_id": "p5qQkpyRrYg", "question": "The woman with long yellow hair, wearing a white top and necklace, after the subtitle 'i'll put a graphic that explains it' appears, what object shows up in the video?", "question_wo_referring_query": "What object appears next in the video?", "candidates": ["Two food pictures", "Three food pictures", "Four food pictures", "One food picture"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "p5qQkpyRrYg_0", "video_path": "p5qQkpyRrYg.mp4", "subtitle_path": "p5qQkpyRrYg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 404, "duration": 48.0, "view_count": 91352}, {"video_id": "VpgphVbvpUE", "question": "A man wearing a dark blue uniform and a black cap had a picture next to him, which showed a group photo in front of a van with five people standing and two kneeling. When this picture appeared, what action did the man in the uniform make?", "question_wo_referring_query": ", what action did the man in the uniform make?", "candidates": ["Turned his head", "Made a 'Yeah' gesture with his hand", "Nodded", "Waved his hand"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "VpgphVbvpUE_0", "video_path": "VpgphVbvpUE.mp4", "subtitle_path": "VpgphVbvpUE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 308, "duration": 46.01, "view_count": 191998}, {"video_id": "S0GH-DEjAMQ", "question": "On a dark night, there are many people standing in the square, each person holding a flame in their hand. Beside them is a large tree adorned with lights. The steps are also crowded with people. What are these people doing?", "question_wo_referring_query": "What are these people doing?", "candidates": ["Singing", "Dancing", "Sitting", "Holding the national flags of various countries"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "S0GH-DEjAMQ_0", "video_path": "S0GH-DEjAMQ.mp4", "subtitle_path": "S0GH-DEjAMQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 268, "duration": 17.02, "view_count": 395}, {"video_id": "LCEGrsVa4cU", "question": "In a blue room, there are four mannequins. A man and a woman with long hair wearing a long-sleeved dress and carrying a red backpack are observing them. When the subtitle 'Camp Is a Second Childhood' appears, what does the woman with long hair do?", "question_wo_referring_query": "What does the woman with long hair do?", "candidates": ["Takes a photo", "Hugs the man", "Makes a phone call", "Points at the mannequins"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "LCEGrsVa4cU_0", "video_path": "LCEGrsVa4cU.mp4", "subtitle_path": "LCEGrsVa4cU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 281, "duration": 21.98, "view_count": 279907}, {"video_id": "SUEFjE0QuVs", "question": "A man wearing glasses, dressed in a black suit with a white shirt underneath, is standing on stage giving a presentation. To his left and slightly behind him, there is a woman wearing a black V-neck top with white leaf patterns. She is touching her eyebrows. After she finishes touching her eyebrows, what does she do next?", "question_wo_referring_query": "After she finishes touching her eyebrows, what does she do next?", "candidates": ["Look at her phone", "Blink", "Drink water", "Adjust her clothing"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "SUEFjE0QuVs_0", "video_path": "SUEFjE0QuVs.mp4", "subtitle_path": "SUEFjE0QuVs_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 142, "duration": 23.0, "view_count": 6827}, {"video_id": "hcDjsE02noo", "question": "In a car, there is a woman wearing a black coat sitting in the driver's seat, and a woman wearing a green long-sleeved trench coat with a ponytail sitting in the passenger seat. Which woman is zoomed in on the screen?", "question_wo_referring_query": "Which woman is zoomed in on the screen?", "candidates": ["The woman in the yellow trench coat", "The woman in the green trench coat", "The woman in the black trench coat", "The woman in the white trench coat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "hcDjsE02noo_0", "video_path": "hcDjsE02noo.mp4", "subtitle_path": "hcDjsE02noo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 288, "duration": 25.0, "view_count": 11735}, {"video_id": "7Q_eEAhmda8", "question": "After the subtitle 'It's American painting.' passed, a white room appeared with two paintings hanging, and a man wearing a black top was standing inside. What was the man in the black top doing at that moment?", "question_wo_referring_query": "What was the man in the black top doing at that moment?", "candidates": ["Standing with his hands crossed, looking at the painting", "Touching his ear", "Making a phone call", "Touching his eyes"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "7Q_eEAhmda8_0", "video_path": "7Q_eEAhmda8.mp4", "subtitle_path": "7Q_eEAhmda8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 161, "duration": 34.0, "view_count": 9555}, {"video_id": "KiPC4jFp85U", "question": "A cartoon character facing the left side and wearing blue clothes and a blue hat appears in the frame. After the subtitle 'approach proved effective in one-on-one' appears, what image shows up behind this cartoon character?", "question_wo_referring_query": "What image appears behind this cartoon character?", "candidates": ["An image of three soldiers holding shields appears.", "An image of six soldiers holding shields appears.", "An image of one soldier holding a shield appears.", "An image of four soldiers holding shields appears."], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "KiPC4jFp85U_0", "video_path": "KiPC4jFp85U.mp4", "subtitle_path": "KiPC4jFp85U_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 180, "duration": 26.0, "view_count": 2261}, {"video_id": "GNrE6MQNeZQ", "question": "First, you see many houses in the distance and many boats on the river. Then, a house with a pink roof appears, and next to it, there is a small bridge. The houses on both sides are lined with trees. Finally, people coming and going and boats being rowed appear. What is the first thing that appears?", "question_wo_referring_query": "First, you see many houses in the distance and many boats on the river. Then, a house with a pink roof appears, and next to it, there is a small bridge. The houses on both sides are lined with trees. Finally, people coming and going and boats being rowed appear. What is the first thing that appears?", "candidates": ["First, you see many houses in the distance and many boats on the river", "Palace", "Airplane", "Spectators"], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "GNrE6MQNeZQ_0", "video_path": "GNrE6MQNeZQ.mp4", "subtitle_path": "GNrE6MQNeZQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 372, "duration": 50.02, "view_count": 2242645}, {"video_id": "7t0EssFcho0", "question": "When a red box with white letters CONTINENTAL VOLCANIC ARC appears in the top left of the blue and white colored planet, what changes occur behind this planet?", "question_wo_referring_query": "What changes occur behind this planet?", "candidates": ["The bottom right of the planet becomes whiter", "The planet turns black", "The planet turns purple", "The planet turns red"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "7t0EssFcho0_0", "video_path": "7t0EssFcho0.mp4", "subtitle_path": "7t0EssFcho0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 259, "duration": 45.98, "view_count": 58805}, {"video_id": "bBwEnPBGeeM", "question": "On a vast golden dune, there is a woman dressed in a red top and jeans riding a skateboard. How did the sand under her feet change earlier?", "question_wo_referring_query": "How did the sand under the woman's feet change earlier?", "candidates": ["The sand was blue", "The sand was golden", "The sand had a pearl color", "The sand did not change"], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "bBwEnPBGeeM_0", "video_path": "bBwEnPBGeeM.mp4", "subtitle_path": "bBwEnPBGeeM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 330, "duration": 37.0, "view_count": 45990}, {"video_id": "Tq-6zb6spFI", "question": "A woman with a ponytail, wearing Tibetan blue, is driving a car. What change happens to the woman when the subtitles appear: 'hi guys so so right now it's actually'?", "question_wo_referring_query": "What change happens to the woman?", "candidates": ["Covering her hair with curls", "Having a yellow hair clip", "Having a red hair clip", "Covering her hair with a scarf"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "Tq-6zb6spFI_0", "video_path": "Tq-6zb6spFI.mp4", "subtitle_path": "Tq-6zb6spFI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 208, "duration": 31.0, "view_count": 35651}, {"video_id": "7cp4GrifzCU", "question": "On a wooden board, there are six pieces of meat and two cans of black and white seasoning. A hand appears in the video; what action does the hand take at this moment?", "question_wo_referring_query": ", what action does the hand take at this moment?", "candidates": ["apply a bit of vinegar on the meat", "sprinkle seasoning on the meat", "cut the meat", "put the meat into the pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "7cp4GrifzCU_0", "video_path": "7cp4GrifzCU.mp4", "subtitle_path": "7cp4GrifzCU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 88, "duration": 40.0, "view_count": 61536}, {"video_id": "7sMOtQ9B-9E", "question": "In a room with a window and a white door, a long-haired woman in a pink short-sleeve shirt is sitting on a yellow sofa narrating an event. What other objects are in this room?", "question_wo_referring_query": "In a room with a window and a white door, a long-haired woman in a pink short-sleeve shirt is sitting on a yellow sofa narrating an event. What other objects are in this room?", "candidates": ["Table", "Fire extinguisher", "Piano", "Television"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "7sMOtQ9B-9E_0", "video_path": "7sMOtQ9B-9E.mp4", "subtitle_path": "7sMOtQ9B-9E_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 193, "duration": 45.0, "view_count": 345765}, {"video_id": "Swyargqy7I8", "question": "In a row wearing iron armor. Among the soldiers holding military knives and iron shields, and wearing yellow clothes, with several golden pagodas behind them, there is a soldier on the side wearing gray clothes and holding a helmet. What is the color of the additional piece of clothing on this soldier?", "question_wo_referring_query": "In a row wearing iron armor. Among the soldiers holding military knives and iron shields, and wearing yellow clothes, with several golden pagodas behind them, there is a soldier on the side wearing gray clothes and holding a helmet. What is the color of the additional piece of clothing on this soldier?", "candidates": ["yellow", "blue", "green", "red"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "Swyargqy7I8_0", "video_path": "Swyargqy7I8.mp4", "subtitle_path": "Swyargqy7I8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1071, "duration": 46.0, "view_count": 74378}, {"video_id": "pm1x9SaxPcM", "question": "The man standing behind the young Huzi is in front of a window. Outside the window is a tall building. When this man mentions 'series across the world to different,' what color clothes is he wearing?", "question_wo_referring_query": "What color clothes is he wearing?", "candidates": ["Purple", "White", "Green", "Yellow"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "pm1x9SaxPcM_0", "video_path": "pm1x9SaxPcM.mp4", "subtitle_path": "pm1x9SaxPcM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1510, "duration": 28.0, "view_count": 510956}, {"video_id": "XKp6ahy2eBk", "question": "What happened the first time the yellow car appeared on the road in the video?", "question_wo_referring_query": "What happened?", "candidates": ["Parking", "Carwash", "Driving on the road", "Painted yellow to purple"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "XKp6ahy2eBk_0", "video_path": "XKp6ahy2eBk.mp4", "subtitle_path": "XKp6ahy2eBk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 454, "duration": 54.99, "view_count": 110929}, {"video_id": "f8Kit1DdtLc", "question": "In the video set in a room with a man sitting in front of a bookshelf filled with books, he mentions, 'then it wouldn't be much of a loss to begin with'. What action does the man in the video do?", "question_wo_referring_query": "What action does the man in the video do?", "candidates": ["In the video, the man points towards a golden wheat field.", "A man with curly black hair appears in the video, and he places both hands flat on his knees.", "A man with curly black hair appears in the video, and he crosses his hands together.", "A man with curly black hair appears in the video, wearing a white short-sleeved shirt, and he raises one hand open."], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "f8Kit1DdtLc_0", "video_path": "f8Kit1DdtLc.mp4", "subtitle_path": "f8Kit1DdtLc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 185, "duration": 49.98, "view_count": 211013}, {"video_id": "QY6P_Wn1J5M", "question": "In the video, after the man wearing glasses touched his index finger and thumb together while talking, what did he do next?", "question_wo_referring_query": "What did the man do next?", "candidates": ["The man clasped his hands together", "The man put both of his hands together", "The man raised both of his hands", "The man put both of his hands down"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "QY6P_Wn1J5M_0", "video_path": "QY6P_Wn1J5M.mp4", "subtitle_path": "QY6P_Wn1J5M_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 320, "duration": 48.0, "view_count": 22385}, {"video_id": "O-iHW1dxsyo", "question": "In the animation explaining the sound source weapon, a small person wearing tactical police clothing and a helmet sitting in the driver's seat and another small person wearing ordinary clothes appear. Which of these two characters appears first in the animation?", "question_wo_referring_query": "Which of these two characters appears first in the animation?", "candidates": ["The small person sitting in the driver's seat wearing tactical police clothing", "The small person wearing green clothing", "The small person wearing a red top and black pants", "The small person wearing a gray outfit with a black hat"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "O-iHW1dxsyo_0", "video_path": "O-iHW1dxsyo.mp4", "subtitle_path": "O-iHW1dxsyo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 457, "duration": 41.0, "view_count": 1971743}, {"video_id": "HHsmRoHE7K0", "question": "Which characters appear after the explanation that mentions 'saut\u00e9ing onions then we add ground beef'?", "question_wo_referring_query": "Which characters appear?", "candidates": ["A woman making a drink", "A man wearing a black shirt holding a bowl and a woman wearing a red shirt and a hat", "A boy wearing a purple shirt", "An old person walking on the street"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "HHsmRoHE7K0_0", "video_path": "HHsmRoHE7K0.mp4", "subtitle_path": "HHsmRoHE7K0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 25, "duration": 58.98, "view_count": 900970}, {"video_id": "PJFFYjFhPy8", "question": "A bald man wearing a blue jacket, beige pants, and black shoes is in a room with two picture frames and a window. In which other scenes does this man appear?", "question_wo_referring_query": "In which other scenes does this man appear?", "candidates": ["In a library", "On the road", "In a hallway with two chairs and a staircase", "In a bookstore"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "PJFFYjFhPy8_0", "video_path": "PJFFYjFhPy8.mp4", "subtitle_path": "PJFFYjFhPy8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 841, "duration": 40.01, "view_count": 115186}, {"video_id": "Hwrj4jn9Tk0", "question": "What changes occurred to the pastries made with flour and placed in a black pot after adding the seasoning and processing further in the video?", "question_wo_referring_query": "What changes occurred after adding the seasoning and processing further in the video?", "candidates": ["Put in the freezer", "Changed color to black", "Baked in the black pot", "Added fruits on top"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "Hwrj4jn9Tk0_0", "video_path": "Hwrj4jn9Tk0.mp4", "subtitle_path": "Hwrj4jn9Tk0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 12, "duration": 49.01, "view_count": 101105}, {"video_id": "qodFOVfwCRE", "question": "What change occurred after the woman wearing a white T-shirt, glasses, and a backpack mentioned 'pei 5 af' at the whiteboard at the beginning of the video?", "question_wo_referring_query": "What change occurred after the woman wearing a white T-shirt, glasses, and a backpack mentioned 'pei 5 af' at the whiteboard at the beginning of the video?", "candidates": ["Raised one hand and smiled slightly", "Tied her hair up", "Hugged someone", "Raised both hands in front of her chest"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "qodFOVfwCRE_0", "video_path": "qodFOVfwCRE.mp4", "subtitle_path": "qodFOVfwCRE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 168, "duration": 38.01, "view_count": 7834}, {"video_id": "QstTmb2OMzo", "question": "At a seaside, there is a group of people, some standing, some sitting, and one person is standing holding a cellphone and filming the distance. What is this person filming at this moment?", "question_wo_referring_query": "What is the person holding the cellphone filming at this moment?", "candidates": ["A volcano erupting in the distance", "The sun", "A ship", "A seagull"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "QstTmb2OMzo_0", "video_path": "QstTmb2OMzo.mp4", "subtitle_path": "QstTmb2OMzo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 915, "duration": 24.99, "view_count": 162277}, {"video_id": "TQjA7SpsnzQ", "question": "In a cozy room, there is a computer and a phone on the table. At this moment, a blonde woman with her hair tied up and a man with black hair wearing a yellow sweater are watching a video in front of the computer screen. What other objects are in this room?", "question_wo_referring_query": "What other objects are in this room?", "candidates": ["a piano", "a drum set", "a pot of flowers", "a violin"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "TQjA7SpsnzQ_0", "video_path": "TQjA7SpsnzQ.mp4", "subtitle_path": "TQjA7SpsnzQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 107, "duration": 31.0, "view_count": 1737}, {"video_id": "PNmb4C_DMfw", "question": "A person wearing white clothes, holding a gray broom with a wooden handle, is sweeping something. What is this person sweeping when the subtitle 'Have a nice day and enjoy watching!' appears?", "question_wo_referring_query": "What is this person sweeping?", "candidates": ["yogurt", "chicken meat", "vegetables", "cheese"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "PNmb4C_DMfw_0", "video_path": "PNmb4C_DMfw.mp4", "subtitle_path": "PNmb4C_DMfw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 59, "duration": 41.0, "view_count": 167451}, {"video_id": "9QMdZ_CWx08", "question": "A woman is driving a car, wearing earrings, a headband on her hand, and touching her hair. When the subtitles show 'okay I'm gonna go home okay I'm back in,' what color top is the woman wearing?", "question_wo_referring_query": "What color top is the woman wearing?", "candidates": ["black", "white", "yellow", "red"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "9QMdZ_CWx08_0", "video_path": "9QMdZ_CWx08.mp4", "subtitle_path": "9QMdZ_CWx08_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 367, "duration": 42.0, "view_count": 34407}, {"video_id": "k2FkVDa7cDE", "question": "On a table with a hamburger, there's a woman with long hair wearing a gray long-sleeve sweatshirt with green letters on it, wearing earrings, and with pink nail polish. What is she holding in her hand?", "question_wo_referring_query": "What is she holding in her hand?", "candidates": ["fries", "hamburger", "phone", "coffee"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "k2FkVDa7cDE_0", "video_path": "k2FkVDa7cDE.mp4", "subtitle_path": "k2FkVDa7cDE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 512, "duration": 23.99, "view_count": 49987}, {"video_id": "ft-dYaZKxwU", "question": "In a picture featuring the flags of Finland and the Soviet Union, with the text 'The Winter War 1939-1940' in white letters on the side, along with many soldiers and tanks, which country's tanks were the most numerous in this war?", "question_wo_referring_query": "Which country's tanks were the most numerous in this war?", "candidates": ["Finland", "United Kingdom", "Soviet Union", "United States"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "ft-dYaZKxwU_0", "video_path": "ft-dYaZKxwU.mp4", "subtitle_path": "ft-dYaZKxwU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 300, "duration": 24.99, "view_count": 930034}, {"video_id": "uX6e-BC2rWw", "question": "Two men are broadcasting live news. When the subtitle shows 'potentially US president if he were to', what action did the man, wearing a black suit with a golden tie and headphones, do?", "question_wo_referring_query": "What action did he do?", "candidates": ["talked", "looked at his phone", "drank water", "blinked"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "uX6e-BC2rWw_0", "video_path": "uX6e-BC2rWw.mp4", "subtitle_path": "uX6e-BC2rWw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 335, "duration": 23.0, "view_count": 129829}, {"video_id": "PLpt9emMkjc", "question": "In a triangle with yellow and red colors and many country maps, a cursor moves to the far left side, containing four black-bordered boxes with red, white, and orange English letters inside. After explaining, what did this person do next?", "question_wo_referring_query": ", what did this person do next?", "candidates": ["Deleted the maps", "Shrunk the screen", "Enlarged the screen", "Typed on the screen"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "PLpt9emMkjc_0", "video_path": "PLpt9emMkjc.mp4", "subtitle_path": "PLpt9emMkjc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 273, "duration": 19.0, "view_count": 19076}, {"video_id": "zfEEgPHxUXk", "question": "In a car, sitting in the driver's seat is a woman with curly hair, wearing sunglasses and a leather jacket. Then, a man wearing a white fur coat with black inside appears. Who appears first in this video?", "question_wo_referring_query": "Who appears first in this video?", "candidates": ["The woman in the driver's seat, wearing sunglasses and a leather jacket, who is driving the car", "The child with the ponytail", "The man wearing the white fur coat", "The doctor in the uniform"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "zfEEgPHxUXk_0", "video_path": "zfEEgPHxUXk.mp4", "subtitle_path": "zfEEgPHxUXk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 141, "duration": 17.0, "view_count": 7441}, {"video_id": "8HMBRB6kTB8", "question": "A green helicopter is parked on a grass field. After three men with handheld guns and wearing camouflage uniforms say 'crews and special forces as they were,' what actions do these three men take?", "question_wo_referring_query": "What actions do these three men take?", "candidates": ["Firing their guns", "Shaking hands with each other", "Getting into the helicopter", "Advancing stealthily"], "topic_category": "KH-Knowledge-History", "question_category": "T3E", "level": "L2-Relation", "id": "8HMBRB6kTB8_0", "video_path": "8HMBRB6kTB8.mp4", "subtitle_path": "8HMBRB6kTB8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 208, "duration": 36.0, "view_count": 2287560}, {"video_id": "7vDObIhe48s", "question": "Throughout the entire video, which sequence of scenes is correct?", "question_wo_referring_query": "Which sequence of scenes is correct?", "candidates": ["First a photo of a map is shown, with a woman in a black coat talking to a white-haired man in front, then a woman in a black coat appears, and lastly, the white-haired man in a black suit and tie is shown.", "First appearing is the white-haired man in a black suit and tie with his hands open, followed by a woman with long hair, and lastly a photo of a map.", "First appearing is a woman in a purple coat, followed by a photo of a map, and lastly a shot of two people sitting behind a desk.", "First appears a white-haired man wearing a black suit and tie, then a photo of a map is shown, followed by a woman with long hair in a black coat talking to the white-haired man, and lastly, the white-haired man in a black suit and tie appears with his hands open."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "7vDObIhe48s_0", "video_path": "7vDObIhe48s.mp4", "subtitle_path": "7vDObIhe48s_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 121, "duration": 37.0, "view_count": 272701}, {"video_id": "ejGERwLV2Kg", "question": "An oil painting depicting three wine bottles on a yellow board and a transparent bowl with fruits. With which subtitles has this painting appeared together?", "question_wo_referring_query": ", with which subtitles has this oil painting appeared together?", "candidates": ["in the bedroom", "setting for her portrait and still lives", "so her home was always the stage", "the second section of the exhibition is"], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "ejGERwLV2Kg_0", "video_path": "ejGERwLV2Kg.mp4", "subtitle_path": "ejGERwLV2Kg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 425, "duration": 37.0, "view_count": 102324}, {"video_id": "zNW4iq0pd2c", "question": "At the beginning of the video, a long-haired woman who is wearing a black coat and a necklace appears. What changes occur to her hand later in the video?", "question_wo_referring_query": "What changes occur to her hand later in the video?", "candidates": ["One hand wearing a ring is placed under her chin", "She points to the computer with her hand", "Her hand is placed on her hair", "Her hand is inserted into her pocket"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "zNW4iq0pd2c_0", "video_path": "zNW4iq0pd2c.mp4", "subtitle_path": "zNW4iq0pd2c_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 275, "duration": 33.0, "view_count": 13217}, {"video_id": "ed2HfpyDyh4", "question": "After a squad of soldiers wearing hats fired forward with guns placed on their shoulders, and following the subtitle 'the tactic proved to be very successful on many occasions the most notable moments were the Battle of yad reg and the Tet offensive,' what change occurred among the soldiers?", "question_wo_referring_query": "What change occurred among the soldiers?", "candidates": ["Some soldiers lay on the ground, while others continued to advance with their guns", "All of them lay on the ground", "All of them crouched down", "They continued to advance"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "ed2HfpyDyh4_0", "video_path": "ed2HfpyDyh4.mp4", "subtitle_path": "ed2HfpyDyh4_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 196, "duration": 41.0, "view_count": 1026088}, {"video_id": "JUNzoaR43jM", "question": "Outside the room with things scattered around, there is only one man wearing a gray-blue hat, with a beard, wearing glasses, and a gray-blue long-sleeve shirt. What is to the left rear of this man when the subtitles mention 'unfortunately it's almost the end of the'?", "question_wo_referring_query": "What is to the left rear of this man?", "candidates": ["animal", "disorderly arranged table and chair", "computer", "mobile phone"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "JUNzoaR43jM_0", "video_path": "JUNzoaR43jM.mp4", "subtitle_path": "JUNzoaR43jM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 433, "duration": 32.0, "view_count": 685}, {"video_id": "NP2DEh5mPmE", "question": "In the gloomy environment with a dark sky, the winding mountain range appears intermittently. When the subtitle mentions 'caused the composition of the magma to,' what color is this mountain?", "question_wo_referring_query": "What color is this mountain?", "candidates": ["blue", "gray", "red", "purple"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "NP2DEh5mPmE_0", "video_path": "NP2DEh5mPmE.mp4", "subtitle_path": "NP2DEh5mPmE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 160, "duration": 18.99, "view_count": 8140}, {"video_id": "Wdcy6gdsVuU", "question": "After the round beauty food filled with cheeses and the mention of 'Sprinkle with herbs if desired' on the side, what change occurred to the dish?", "question_wo_referring_query": "What change occurred to the dish?", "candidates": ["Added with fruits", "Taken away and put into the refrigerator", "Sprinkled with herbs", "Sprinkled with spices"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "Wdcy6gdsVuU_0", "video_path": "Wdcy6gdsVuU.mp4", "subtitle_path": "Wdcy6gdsVuU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 182, "duration": 18.99, "view_count": 226}, {"video_id": "ttla4Dcrsas", "question": "Which characters appear after the mention of 'And so what I did was I went down to the end of the hall'?", "question_wo_referring_query": "Who are the characters that appear?", "candidates": ["A man in a black suit standing in a hall filled with framed pictures", "A host holding a microphone", "An old man in a black and white striped shirt", "A little girl in a purple dress"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "ttla4Dcrsas_0", "video_path": "ttla4Dcrsas.mp4", "subtitle_path": "ttla4Dcrsas_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 59, "duration": 52.99, "view_count": 25771}, {"video_id": "W5jTBxDDzTw", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, a pair of hands with blue nail polish stirring food in a glass bowl appears, then a plate with eight chicken legs covered in spices appears, and finally a blue board with four chicken legs on it appears.", "First, a plate with eight chicken legs covered in spices appears, then a blue board with four chicken legs on it, and finally a pair of hands with blue nail polish stirring food in a glass bowl appears.", "First, a pair of hands with blue nail polish stirring food in a glass bowl appears, then a blue board with four chicken legs on it, and finally a plate with eight chicken legs covered in spices appears.", "First, a blue board with four chicken legs on it appears, then a pair of hands with blue nail polish stirring food in a glass bowl, and finally a plate with eight chicken legs covered in spices appears."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "W5jTBxDDzTw_0", "video_path": "W5jTBxDDzTw.mp4", "subtitle_path": "W5jTBxDDzTw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 14, "duration": 30.99, "view_count": 76329}, {"video_id": "sL4JK_bDo0A", "question": "In the video opening, three people wearing yellow hats and holding hoes are planting watermelons and carrots. Where else have these three characters appeared?", "question_wo_referring_query": "Where else have these three characters appeared?", "candidates": ["In the picture to the upper left behind the man in the orange T-shirt", "In a book", "On the television", "In the computer"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "sL4JK_bDo0A_0", "video_path": "sL4JK_bDo0A.mp4", "subtitle_path": "sL4JK_bDo0A_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 193, "duration": 51.01, "view_count": 2109022}, {"video_id": "AUxJ24PqDws", "question": "At the beginning of the video, a man dressed in a black suit and tie is speaking. Which subtitles does this man appear with?", "question_wo_referring_query": "Which subtitles does this man appear with?", "candidates": ["our Focus since uh since October 7th uh", "Focus today but as", "How wide will it get", "show stopping the"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "AUxJ24PqDws_0", "video_path": "AUxJ24PqDws.mp4", "subtitle_path": "AUxJ24PqDws_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 39, "duration": 18.0, "view_count": 257071}, {"video_id": "tjaldSWX7No", "question": "When the black sand is evenly spread on the wooden board, what change occurs to the properties of the sand in the later part of the video?", "question_wo_referring_query": "What change occurs to the properties of the sand in the later part of the video?", "candidates": ["The sand forms a cylindrical shape.", "The color of the sand changes to green.", "The sand forms a rectangular block with holes.", "The color of the sand changes to yellow."], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "tjaldSWX7No_0", "video_path": "tjaldSWX7No.mp4", "subtitle_path": "tjaldSWX7No_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 214, "duration": 34.04, "view_count": 14255}, {"video_id": "KBGYkBPMIoM", "question": "What happened to the man wearing a floral-patterned shirt, with a small moustache, and blue contact lenses after he mentioned 'monkey cave espresso since right down'?", "question_wo_referring_query": "What happened to the man?", "candidates": ["He started driving", "He raised his hand with two rings and smiled slightly", "He picked up the coffee and drank", "He stood up and walked"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "KBGYkBPMIoM_0", "video_path": "KBGYkBPMIoM.mp4", "subtitle_path": "KBGYkBPMIoM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 796, "duration": 34.04, "view_count": 253387}, {"video_id": "KOK1TMSyKcM", "question": "Four soldiers are crouching on the grayish-white ground; in the distance, there are two cannons. Behind the cannons, there are green plants and hills. What are the two standing soldiers on the right doing?", "question_wo_referring_query": "What are they doing?", "candidates": ["Pushed away the cannon", "Lying down and shooting forward", "Climbed up from the lying down position", "Holding a gun and walking towards the right side of the screen", "Picked up a yellow wooden stick"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "KOK1TMSyKcM_0", "video_path": "KOK1TMSyKcM.mp4", "subtitle_path": "KOK1TMSyKcM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 10, "duration": 10.0, "view_count": 3444159}, {"video_id": "cPi0UyHB95U", "question": "In a room filled with post-it notes and newspapers, a woman with silver-white hair wearing a black floral-patterned kimono and glasses is being interviewed. When she talks about 'audience and then ask people to read and', what item is not present in the frame?", "question_wo_referring_query": "What item is not present in the frame?", "candidates": ["The white picture frame on the wall", "The newspapers on the table", "The black bag on the table", "The yellow bag on the table", "The orange post-it notes on the wall"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "cPi0UyHB95U_0", "video_path": "cPi0UyHB95U.mp4", "subtitle_path": "cPi0UyHB95U_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 180, "duration": 12.0, "view_count": 23292}, {"video_id": "1WheVxmQlo0", "question": "When three images appear containing the text 'Susanthika Jayasinghe 200m Dash Silver (2000 Sydney)', a man in a military green short-sleeve shirt is explaining. In the leftmost image with a white background, what type of clothing is the person wearing?", "question_wo_referring_query": "A man in a military green short-sleeve shirt is explaining. In the leftmost image with a white background, what type of clothing is the person wearing?", "candidates": ["Black jacket", "Short-sleeve T-shirt", "Running vest", "Short-sleeve sports jacket", "Black shirt"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "1WheVxmQlo0_0", "video_path": "1WheVxmQlo0.mp4", "subtitle_path": "1WheVxmQlo0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1394, "duration": 11.01, "view_count": 812513}, {"video_id": "-YoRV1_aNoU", "question": "In the room decorated with half white tiles and half glass tiles, with vases and plants placed inside, a woman in a white coat places her hands on a marble table. When she mentions 'Tracy is making hi I'm Tracy today', what is her hairstyle like?", "question_wo_referring_query": "What is her hairstyle like?", "candidates": ["Straight hair resting on shoulders", "Bubble ponytail gathered up", "Gathered up bun", "Gathered up explosion head", "Straight ponytail gathered up"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "-YoRV1_aNoU_0", "video_path": "-YoRV1_aNoU.mp4", "subtitle_path": "-YoRV1_aNoU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 53, "duration": 13.97, "view_count": 100376}, {"video_id": "lZHISpCGHS0", "question": "What is the object placed in water and constantly kneaded by hands in the video?", "question_wo_referring_query": "What is the object?", "candidates": ["Noodles", "Clothes", "Rice grains", "Fat", "Dough ball"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "lZHISpCGHS0_0", "video_path": "lZHISpCGHS0.mp4", "subtitle_path": "lZHISpCGHS0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 318, "duration": 12.01, "view_count": 756666}, {"video_id": "f_F-EYUd4lk", "question": "On the map block with a combination of green and blue colors, the red line extends horizontally from the block towards the ocean. What changes occur on the El Pilar Fault block above at this time?", "question_wo_referring_query": "What changes occur?", "candidates": ["A red dashed line extends from left to right, and the text 'Northern Range' appears.", "A white dashed line extends from left to right, and the text 'Northern Range' appears.", "There are yellow capital letters passing through from the bottom of the block.", "The text 'Northern Range' appears first, followed by a white dashed line extending from left to right.", "The block is surrounded by a red line."], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "f_F-EYUd4lk_0", "video_path": "f_F-EYUd4lk.mp4", "subtitle_path": "f_F-EYUd4lk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 423, "duration": 11.98, "view_count": 493068}, {"video_id": "r-iYbSqdcmI", "question": "On a dining table covered with a white tablecloth, there are delicious foods, a mobile phone, and a cup filled with water. What happened when it was said 'for dinner we had some kind of mashed'?", "question_wo_referring_query": "What happened?", "candidates": ["A hand", "A hand tried to pick up the food from the plate", "A hand tried to pick up the cup of water from the table", "Two people ran past the dining table", "A person wearing black shorts was standing there without moving"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "r-iYbSqdcmI_0", "video_path": "r-iYbSqdcmI.mp4", "subtitle_path": "r-iYbSqdcmI_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1144, "duration": 12.0, "view_count": 825209}, {"video_id": "uCvISLxqoqA", "question": "Which of the following scenes appears first?", "question_wo_referring_query": "Which of the following scenes appears first?", "candidates": ["First, two primitive people making tools, then three primitive people standing outside near a building, and finally a gray background with the words 'The Bronze Age'.", "First, a gray background with the words 'The Bronze Age', then three primitive people standing outside near a building, and finally two primitive people making tools.", "First, three primitive people standing outside near a building, then two primitive people making tools, and finally a gray background with the words 'The Bronze Age'.", "First, two primitive people making tools, then a gray background with the words 'The Bronze Age', and finally three primitive people standing outside near a building.", "First, three primitive people standing outside near a building, then a gray background with the words 'The Bronze Age', and finally two primitive people making tools."], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "uCvISLxqoqA_0", "video_path": "uCvISLxqoqA.mp4", "subtitle_path": "uCvISLxqoqA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 93, "duration": 10.97, "view_count": 7355}, {"video_id": "AmxvdpQ4_S8", "question": "A man wearing a blue T-shirt is walking on a brightly lit street at night. After he says 'everything's closed,' what happens?", "question_wo_referring_query": "What happens next?", "candidates": ["The man makes a phone call.", "The man starts running.", "A woman wearing a black hoodie walks up to the man.", "An old man wearing a black hoodie walks up to the man.", "The man touches his face."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "AmxvdpQ4_S8_0", "video_path": "AmxvdpQ4_S8.mp4", "subtitle_path": "AmxvdpQ4_S8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 201, "duration": 11.0, "view_count": 25799}, {"video_id": "ApfKE7a8Pic", "question": "A long-haired woman is combing her hair by a window with thin curtains, mentions 'last week has been slow slower than I' before, what animal appears?", "question_wo_referring_query": "What animal appears?", "candidates": ["white rabbit", "gray kitten", "black mouse", "white kitten", "gray rabbit"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "ApfKE7a8Pic_0", "video_path": "ApfKE7a8Pic.mp4", "subtitle_path": "ApfKE7a8Pic_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 63, "duration": 13.01, "view_count": 373556}, {"video_id": "uUTVJ1Nt2Z8", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, three photographs of men were shown, followed by two photographs of men, and finally there was a mention of 'the Nobel jury says that they deeply'", "First, there was a mention of 'the Nobel jury says that they deeply', followed by three photographs of men, and finally two photographs of men were shown", "First, three photographs of men were shown, then a mention of 'the Nobel jury says that they deeply', and finally two photographs of men were shown", "First, two photographs of men were shown, then a mention of 'the Nobel jury says that they deeply', and finally three photographs of men were shown", "First, two photographs of men were shown, followed by three photographs of men, and finally there was a mention of 'the Nobel jury says that they deeply'"], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "uUTVJ1Nt2Z8_0", "video_path": "uUTVJ1Nt2Z8.mp4", "subtitle_path": "uUTVJ1Nt2Z8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 20, "duration": 8.0, "view_count": 29529}, {"video_id": "spkQnL6TpXk", "question": "There are various cars parked by the roadside, with lush trees beside the road. Below which of the following scenes does the red motorcycle in front of the wooden fence appear?", "question_wo_referring_query": "In which of the following scenes does it appear?", "candidates": ["In the flowing, lush forest", "In front of the fallen glass window of the white building", "Beside the flowing river water", "Beside the rugged road", "On the mountain path walked by tourists"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "spkQnL6TpXk_0", "video_path": "spkQnL6TpXk.mp4", "subtitle_path": "spkQnL6TpXk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 632, "duration": 10.98, "view_count": 20426}, {"video_id": "dO-dA52NziA", "question": "What change occurs after the stone carving at the beginning of the video talks about 'Workshop practice but also the role of'?", "question_wo_referring_query": "What changes?", "candidates": ["The stone carving that originally only showed a human figure changes to display the entire appearance of the stone carving.", "The stone carving that originally only showed a rectangle changes to display the entire appearance of the stone carving.", "The stone carving that originally only showed a rectangle changes to display the entire appearance of the stone carving.", "The stone carving that originally only showed a human figure changes to display another rectangular stone carving.", "The orderly rectangular stone carving changes into a chaotic human figure stone carving."], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "dO-dA52NziA_0", "video_path": "dO-dA52NziA.mp4", "subtitle_path": "dO-dA52NziA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1098, "duration": 10.01, "view_count": 32417}, {"video_id": "o7MCS4Z5EOc", "question": "In the video, after the sunlight shines on the white glacier and the black mountain and the mention of \"report showing exceedingly high melt\" appears, what changes have occurred to the glacier?", "question_wo_referring_query": "What changes have occurred to the glacier?", "candidates": ["The glacier started to melt slowly and formed an ocean", "The glacier started to melt slowly and formed a waterfall", "The glacier cracked a seam", "The glacier started to melt slowly and formed a river", "The glacier lost a corner"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "o7MCS4Z5EOc_0", "video_path": "o7MCS4Z5EOc.mp4", "subtitle_path": "o7MCS4Z5EOc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 91, "duration": 10.98, "view_count": 1126837}, {"video_id": "50-CZdMFbIc", "question": "How many types of seasonings were added to the pizza when the person in the video served the cooked pizza?", "question_wo_referring_query": "How many types of seasonings were added to the pizza?", "candidates": ["Two types", "Four types", "Three types", "One type", "No additional seasonings added"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "50-CZdMFbIc_0", "video_path": "50-CZdMFbIc.mp4", "subtitle_path": "50-CZdMFbIc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 324, "duration": 11.97, "view_count": 176736}, {"video_id": "j3rxEqH1WbA", "question": "In the BBC newsroom, on the left side of the screen, a male host dressed in a suit is sitting there listening attentively. On the right side, the bookshelf is neatly arranged with books. What is the bald man doing?", "question_wo_referring_query": "What is the bald man doing?", "candidates": ["Speaking", "Reading a book", "Having a meeting", "Drinking water", "Making a phone call"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "j3rxEqH1WbA_0", "video_path": "j3rxEqH1WbA.mp4", "subtitle_path": "j3rxEqH1WbA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 350, "duration": 8.0, "view_count": 37305}, {"video_id": "cdj9FWiGzaE", "question": "Behind a cartoon character wearing a blue baseball cap and blue clothes, when two soldiers holding red shields and long spears walk into a building, which people do not appear?", "question_wo_referring_query": "Which people do not appear?", "candidates": ["A crowd wearing red clothes and golden-yellow hair", "A person wearing dark black clothes", "An officer holding a yellow paper with black lettering", "A person wearing white clothes", "A crowd wearing green clothes and black hair"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "cdj9FWiGzaE_0", "video_path": "cdj9FWiGzaE.mp4", "subtitle_path": "cdj9FWiGzaE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 215, "duration": 11.0, "view_count": 1367}, {"video_id": "nbb0GmommvM", "question": "Under the sunlight, in front of the mountain peak are tightly arranged bungalows and green plants, with interspersed red and white high-rise buildings standing by the roadside. What is the appearance of the mountains behind the city?", "question_wo_referring_query": "Under the sunlight, in front of the mountain peak are tightly arranged bungalows and green plants, with interspersed red and white high-rise buildings standing by the roadside. What is the appearance of the mountains behind the city?", "candidates": ["The mountains are covered with lush green trees", "Snowy white mountains", "Peaks of varying heights", "Several scattered hills", "Golden high mountains"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "nbb0GmommvM_0", "video_path": "nbb0GmommvM.mp4", "subtitle_path": "nbb0GmommvM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 586, "duration": 13.0, "view_count": 1766335}, {"video_id": "V8LmBoTywFM", "question": "What is slowly floating in the water with reflections, near the city wall, behind the cartoon character wearing a blue hat and blue clothes?", "question_wo_referring_query": "What is it?", "candidates": ["Aircraft carrier", "Sailboat", "Raft", "Submarine", "Wheelboat"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "V8LmBoTywFM_0", "video_path": "V8LmBoTywFM.mp4", "subtitle_path": "V8LmBoTywFM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 275, "duration": 10.0, "view_count": 1463}, {"video_id": "WpbPWsuu4IQ", "question": "In a room with green wallpaper, what happens in the video when the subtitles 'want to go to law school and I'm like' appear on the screen while a girl wearing a red and white striped sweater and glasses is sitting on a white bed?", "question_wo_referring_query": "In a room with green wallpaper, what happens in the video while a girl wearing a red and white striped sweater and glasses is sitting on a white bed?", "candidates": ["The video screen suddenly disappears", "The video screen gradually zooms out", "The video screen suddenly zooms in", "The video screen zooms in then out", "The video screen zooms out then in"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "WpbPWsuu4IQ_0", "video_path": "WpbPWsuu4IQ.mp4", "subtitle_path": "WpbPWsuu4IQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 892, "duration": 13.98, "view_count": 114188}, {"video_id": "_CQWvEkj1A8", "question": "What appears after a close-up of a woman with black hair, wearing golden earrings and a necklace, holding red wine, preparing to drink?", "question_wo_referring_query": "What appears after a close-up of a woman with black hair, wearing golden earrings and a necklace, holding red wine, preparing to drink?", "candidates": ["A woman with a knife and a white headpiece, wearing a red necklace, smiling.", "Three men in gray suits having a conversation.", "Three people dining together, looking towards the camera.", "A man with black hair, wearing white clothes and having a beard, with a pensive look.", "A tabby cat sitting on a cushion."], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "_CQWvEkj1A8_0", "video_path": "_CQWvEkj1A8.mp4", "subtitle_path": "_CQWvEkj1A8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 214, "duration": 12.0, "view_count": 667352}, {"video_id": "zHhabL3pUjg", "question": "There are some polygonal shapes with different colored patterns and English letters on a black background. What change occurs in the video screen after the subtitle 'day including eight presidents and most' appears?", "question_wo_referring_query": "What change occurs in the video screen?", "candidates": ["The video screen gradually shrinks", "The video screen gradually clears up", "The video screen gradually enlarges", "The video screen switches to calligraphy", "The video screen gradually becomes blurred"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "zHhabL3pUjg_0", "video_path": "zHhabL3pUjg.mp4", "subtitle_path": "zHhabL3pUjg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 115, "duration": 9.01, "view_count": 148826}, {"video_id": "eKXdufK1rdc", "question": "At a military base, there are three rows of yellow tents in front of which stand some soldiers dressed in black camouflage uniforms and two armored vehicles. In which scene does the airplane-shaped military balloon appear?", "question_wo_referring_query": "In which scene does the airplane-shaped military balloon appear?", "candidates": ["In a scene with blue skies and white clouds, green trees in the distance, and a cannon vehicle parked on a yellow grassy field on the right side, it appears in the lower left corner of the screen.", "In a scene with blue skies and white clouds, green trees in the distance, and a cannon vehicle parked on a yellow grassy field on the right side, it appears in the upper left corner of the screen.", "In a scene with blue skies and white clouds, green trees in the distance, and a cannon vehicle parked on a green grassy field on the right side, it appears in the upper left corner of the screen.", "In a scene with blue skies and white clouds, green trees in the distance, and a cannon vehicle parked on a yellow grassy field on the right side, it appears in the upper right corner of the screen.", "In a scene with blue skies and white clouds, green trees in the distance, and a cannon vehicle parked on a yellow grassy field on the right side, it appears in the lower right corner of the screen."], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "eKXdufK1rdc_0", "video_path": "eKXdufK1rdc.mp4", "subtitle_path": "eKXdufK1rdc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 483, "duration": 12.0, "view_count": 233492}, {"video_id": "SNhmSdLDmto", "question": "In an artwork, there are two naked women and a naked child. Among them, the child, who is sleeping soundly in the arms of the woman on the right, appeared simultaneously with which subtitles?", "question_wo_referring_query": "In an artwork, there are two naked women and a naked child. Among them, the child, who is sleeping soundly in the arms of the woman on the right, appeared simultaneously with which subtitles?", "candidates": ["\"sexuality\u00a0\u00a0\"", "\"And like Freud he placed\"", "\"Sigmund Freud.\"", "\"of mental illness, Almost a classic case study for\u00a0\"", "\"left behind 14 children by sa many women, He lived\u00a0\""], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "SNhmSdLDmto_0", "video_path": "SNhmSdLDmto.mp4", "subtitle_path": "SNhmSdLDmto_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 130, "duration": 13.98, "view_count": 1236953}, {"video_id": "JhvG8XpvR0g", "question": "After inserting a video segment of people holding flags and parading, what change occurs on the screen of the woman sitting indoors, wearing a brown long-sleeve top, and talking with black headphones on?", "question_wo_referring_query": "What change occurs on the screen of this woman?", "candidates": ["The woman's screen enlarges and appears on the left side of the video", "The woman's screen shrinks and appears on the right side of the video", "The woman's screen shrinks and appears on the left side of the video", "The woman's screen enlarges and appears on the right side of the video", "The woman's screen disappears"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "JhvG8XpvR0g_0", "video_path": "JhvG8XpvR0g.mp4", "subtitle_path": "JhvG8XpvR0g_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 272, "duration": 14.0, "view_count": 47265}, {"video_id": "tXVEUj0pUTQ", "question": "After the subtitle 'well the US is promising what it's' appears, what changes happen to the host's hands, who is wearing a black suit, glasses, and a tie while sitting in the news studio?", "question_wo_referring_query": "What changes happen to the host's hands?", "candidates": ["The host stands up while pressing the table.", "The news host leaves the studio.", "The host's hands slowly open up.", "The news host moves to the left side.", "The news host touches their own head."], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "tXVEUj0pUTQ_0", "video_path": "tXVEUj0pUTQ.mp4", "subtitle_path": "tXVEUj0pUTQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 3, "duration": 9.0, "view_count": 84904}, {"video_id": "mazCrox3rjA", "question": "In an antique room, a long-haired woman wearing a dress with floral patterns mentioned that 'my family has no memory of her, but she knows who my family is.' What kind of hair accessory does she have on her head?", "question_wo_referring_query": "What kind of hair accessory does she have on her head?", "candidates": ["A white hairpin with pearls", "A yellow floral hairband", "A green floral hairband", "A purple floral hairband", "A white floral hairband"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "mazCrox3rjA_0", "video_path": "mazCrox3rjA.mp4", "subtitle_path": "mazCrox3rjA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 240, "duration": 11.97, "view_count": 228837}, {"video_id": "2PXU5auiuJM", "question": "In a courtyard with a swimming pool with blue water, surrounded by green trees, and a white house beside the pool, there is a man touching his chin. What type of clothes is he wearing?", "question_wo_referring_query": "What type of clothes is he wearing?", "candidates": ["Black long-sleeve robe with red and white floral pattern without a hat", "Black short-sleeve T-shirt", "White short-sleeve robe", "Black long-sleeve robe with a red hat", "Blue long-sleeve jacket"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "2PXU5auiuJM_0", "video_path": "2PXU5auiuJM.mp4", "subtitle_path": "2PXU5auiuJM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 167, "duration": 7.97, "view_count": 60571}, {"video_id": "chhyDJnhIew", "question": "When the subtitle 'These ideas make me act a certain way while drawing' appears, a person is seen holding a large brush and painting on a surface with various colors in the air. What style of pants is this person wearing?", "question_wo_referring_query": "What style of pants is this person wearing?", "candidates": ["black shorts", "black capri pants", "white shorts", "black long pants", "blue shorts"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "chhyDJnhIew_0", "video_path": "chhyDJnhIew.mp4", "subtitle_path": "chhyDJnhIew_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 261, "duration": 10.01, "view_count": 38146}, {"video_id": "juPGTMQu3Vs", "question": "In front of a door with a green lock, there are three animated characters with different skin colors. One character places their hand on the head of another character who is wearing a blue hat. Which character places their hand on the character with the blue hat?", "question_wo_referring_query": "Which character places their hand on the character with the blue hat?", "candidates": ["The character with olive skin", "The character with blue skin", "The character wearing a purple robe", "The character wearing yellow shorts", "The character with gray skin"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "juPGTMQu3Vs_0", "video_path": "juPGTMQu3Vs.mp4", "subtitle_path": "juPGTMQu3Vs_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 26, "duration": 8.0, "view_count": 2419}, {"video_id": "hPDvtzZLhGk", "question": "In a room filled with various books, there are three women wearing different clothes having a conversation. When one woman, who is wearing a gray and white checkered coat, says \"we had a place on the ice and learning,\" what action does she make?", "question_wo_referring_query": "What action does she make?", "candidates": ["One hand pointing forward", "One hand raised with fingers slightly spread", "Hands crossed", "Both hands on her legs", "Both hands in front of her chest"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "hPDvtzZLhGk_0", "video_path": "hPDvtzZLhGk.mp4", "subtitle_path": "hPDvtzZLhGk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 61, "duration": 13.98, "view_count": 1462}, {"video_id": "KQQRfNX47w8", "question": "In a video with various calligraphy images as its background, what style of calligraphy is introduced before the video explains the nearly engraved calligraphy?", "question_wo_referring_query": "What style of calligraphy is introduced?", "candidates": ["Vigorous and Forceful Script", "Fine and Delicate Insect Script", "Penmanship that Resembles Dragons and Snakes", "Exquisite and Detailed Insect Script", "Dragon Flying and Phoenix Dancing Style"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "KQQRfNX47w8_0", "video_path": "KQQRfNX47w8.mp4", "subtitle_path": "KQQRfNX47w8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 68, "duration": 13.01, "view_count": 4401}, {"video_id": "bxjHWzHuqyo", "question": "What is the first food that appears on the screen?", "question_wo_referring_query": "What is the first food that appears on the screen?", "candidates": ["A delicious strawberry mousse", "Two cooked eggs", "A yellow banana", "Triangular pieces of cheese", "A plate of fried eggs sprinkled with chili powder"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "bxjHWzHuqyo_0", "video_path": "bxjHWzHuqyo.mp4", "subtitle_path": "bxjHWzHuqyo_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 191, "duration": 13.0, "view_count": 1224}, {"video_id": "NndMSB_99r8", "question": "In front of a red and white background wall, a man wearing a yellow open-collar jacket and a woman wearing a pink top and a gray headscarf are standing side by side, staring at the camera. What happens after the white subtitle mentions 'a cross-time frozen encounter'?", "question_wo_referring_query": "What happens?", "candidates": ["The screen changes to show only the man", "Both the man and the woman disappear from the screen", "The woman disappears from the screen", "The screen changes to show only the woman", "Three people appear on the screen"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "NndMSB_99r8_0", "video_path": "NndMSB_99r8.mp4", "subtitle_path": "NndMSB_99r8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1481, "duration": 11.01, "view_count": 10110}, {"video_id": "mANsedYvsBs", "question": "In a room full of red chairs, a man with messy hair wearing a black suit says 'It's necessary to complete the work in these small places.' What is the first painting that appears?", "question_wo_referring_query": "What is the first painting that appears?", "candidates": ["A landscape painting", "A city night scene painting", "A simple sketch", "A portrait painting", "A self-portrait painting"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "mANsedYvsBs_0", "video_path": "mANsedYvsBs.mp4", "subtitle_path": "mANsedYvsBs_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 100, "duration": 13.01, "view_count": 307039}, {"video_id": "YPM0uF0-hEM", "question": "Inside a car with black leather seats, after a woman wearing a black hooded coat, sunglasses, and nail polish says, 'hope to look better in the next video,' what change happens to her?", "question_wo_referring_query": "What change happens to her?", "candidates": ["Her hand goes from hanging down to holding her hair.", "Her hand goes from consistently holding the steering wheel to holding the steering wheel with both hands.", "Both hands go from naturally hanging down to bending at the elbows.", "Her hand goes from hanging down to raising both hands with fingers spread open.", "Her expression changes from neutral to broadly smiling."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "YPM0uF0-hEM_0", "video_path": "YPM0uF0-hEM.mp4", "subtitle_path": "YPM0uF0-hEM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 105, "duration": 12.0, "view_count": 30101}, {"video_id": "h6NkWt9ThMk", "question": "In a bedroom filled with stationery and miscellaneous items, a boy with short brown hair, wearing a red T-shirt and a wristwatch is sitting at a desk. What is he doing?", "question_wo_referring_query": "What is the boy sitting at the desk doing?", "candidates": ["He is drawing", "He is doing homework at the desk", "He is listening to music", "He is dancing", "He is talking to a camera"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "h6NkWt9ThMk_0", "video_path": "h6NkWt9ThMk.mp4", "subtitle_path": "h6NkWt9ThMk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 198, "duration": 10.97, "view_count": 123180}, {"video_id": "OlLHmZbnocY", "question": "Next to a three-story tall building, there are two men with the same hairstyle. One is wearing a green coat and a black hooded jacket, while the other is wearing a black T-shirt and a blue-black checkered coat. Which of the following items is present here?", "question_wo_referring_query": "Which of the following items is present here?", "candidates": ["A flower bed filled with fresh flowers", "A green backpack", "A drone flying in the sky", "A circular flower bed with green grass inside", "A small tree with blue fairy lights hanging on it"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "OlLHmZbnocY_0", "video_path": "OlLHmZbnocY.mp4", "subtitle_path": "OlLHmZbnocY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 218, "duration": 11.01, "view_count": 67929}, {"video_id": "TsJan9FHjKc", "question": "In a space with a purple pillar, there is a woman with black skin and black hair wearing black clothes. Behind her, there is a screen. On the screen, there are two people: one facing forward and one not. What kind of clothes is the person not facing forward wearing?", "question_wo_referring_query": "What kind of clothes is the person not facing forward on the screen wearing?", "candidates": ["Black short sleeves", "Black long sleeves", "Purple long sleeves", "White short sleeves", "White long sleeves"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "TsJan9FHjKc_0", "video_path": "TsJan9FHjKc.mp4", "subtitle_path": "TsJan9FHjKc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 53, "duration": 8.01, "view_count": 19917}, {"video_id": "e6QbEzXKSw0", "question": "On a street lined with buildings painted olive green on both sides, a man wearing a baseball cap and a T-shirt walks. When he raises his right hand and says 'Never be shy to ask any questions,' what color is missing from his T-shirt?", "question_wo_referring_query": ", what color is missing from his T-shirt?", "candidates": ["Blue", "Black", "White", "Olive green", "Yellow"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "e6QbEzXKSw0_0", "video_path": "e6QbEzXKSw0.mp4", "subtitle_path": "e6QbEzXKSw0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 354, "duration": 12.01, "view_count": 148549}, {"video_id": "pSxkvynyElQ", "question": "In a room with a half-open door, a paper box and a sofa on the floor, and a pair of black and white patterned gloves and a wooden board hanging on the wall, there are two people. Which person is handling food on a cutting board?", "question_wo_referring_query": "Which person is handling food on a cutting board?", "candidates": ["Man in black short sleeves", "Woman in black short sleeves", "Man in black long sleeves", "Man in white short sleeves", "Woman in white short sleeves"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "pSxkvynyElQ_0", "video_path": "pSxkvynyElQ.mp4", "subtitle_path": "pSxkvynyElQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 481, "duration": 10.97, "view_count": 207314}, {"video_id": "yy-z26jDcoM", "question": "In the center of the screen, there is a woman wearing a purple jacket with her hair combed like Liu Hai. On her chest, there's a line of text that reads 'OUTLOOK FOR TRAVEL IN APAC.' When she mentions 'So, happiness and joy are the current themes,' what happens to the 'OUTLOOK FOR TRAVEL IN APAC' text?", "question_wo_referring_query": "What happens to the 'OUTLOOK FOR TRAVEL IN APAC' text?", "candidates": ["The text moves left", "The text moves downward", "The text moves upward", "The text moves right", "The text scatters and flies out"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "yy-z26jDcoM_0", "video_path": "yy-z26jDcoM.mp4", "subtitle_path": "yy-z26jDcoM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 297, "duration": 9.98, "view_count": 4077}, {"video_id": "JWpu9UvwPUg", "question": "This is a picture-in-picture video. The larger screen shows a room with two men inside, while the smaller screen shows a workspace filled with containers of various seasonings. A hand wearing a blue glove picks up a container of black seasoning. What does this hand do immediately after picking up the black seasoning?", "question_wo_referring_query": "What does this hand do immediately after picking up the black seasoning?", "candidates": ["The hand picks up the red seasoning.", "The hand sprinkles the black seasoning in a white container.", "The hand sprinkles the black seasoning on a pizza.", "The hand sprinkles the red seasoning in a white container.", "The hand sprinkles the red seasoning on a pizza."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "JWpu9UvwPUg_0", "video_path": "JWpu9UvwPUg.mp4", "subtitle_path": "JWpu9UvwPUg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 62, "duration": 12.01, "view_count": 57002}, {"video_id": "0NQf648Ul24", "question": "Outside in the bright sunlight, there is a wooden bridge. One end of the bridge is stacked with giant stones, while the rest of it is surrounded by water. On the bridge, there are two women, one wearing a red bikini and the other wearing a floral bikini. After the narration mentions 'and half and it is so beautiful we\u2018re', what action does the woman in the red bikini do?", "question_wo_referring_query": "What action does the woman in the red bikini do?", "candidates": ["She puts her right hand on the back of her neck", "She puts her right hand on her chest", "She raises her phone with both hands to take a photo", "She puts her left hand on her chest", "She puts her left hand on the back of her neck"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "0NQf648Ul24_0", "video_path": "0NQf648Ul24.mp4", "subtitle_path": "0NQf648Ul24_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 248, "duration": 13.01, "view_count": 93120}, {"video_id": "5O2Yjn3OXRk", "question": "The screen shows a blue ocean and green land. In the center of the screen, there is a section highlighted by a blue strip, presenting a flag design. What text appears in the lower half of the screen before the explanation mentions 'Iain, Ewan, and Eoin'?", "question_wo_referring_query": "What text appears in the lower half of the screen?", "candidates": ["EOIN", "IAIN", "IAN", "EVAN", "EIAN"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "5O2Yjn3OXRk_0", "video_path": "5O2Yjn3OXRk.mp4", "subtitle_path": "5O2Yjn3OXRk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 182, "duration": 10.01, "view_count": 559472}, {"video_id": "qMokEEl_WaA", "question": "According to the explanation, which of the following sequences is correct?", "question_wo_referring_query": "According to the explanation, which of the following sequences is correct?", "candidates": ["First, mozzarella cheese is added to the ingredients, then the accompaniments in the soup ladle container are added to the ingredients, next a silver soup ladle container is lifted, and finally, a bowl of ingredients is displayed.", "First, a silver soup ladle container is lifted, then the accompaniments in the soup ladle container are added to the ingredients, afterwards mozzarella cheese is added to the ingredients, and finally, a bowl of ingredients is displayed.", "First, a bowl of ingredients is displayed, then a silver soup ladle container is lifted, next the accompaniments in the soup ladle container are added to the ingredients, and finally, mozzarella cheese is added to the ingredients.", "First, a silver soup ladle container is lifted, then a bowl of ingredients is displayed, afterwards mozzarella cheese is added to the ingredients, and finally, the accompaniments in the soup ladle container are added to the ingredients.", "First, a bowl of ingredients is displayed, then mozzarella cheese is added to the ingredients, next a silver soup ladle container is lifted, and finally, the accompaniments in the soup ladle container are added to the ingredients."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "qMokEEl_WaA_0", "video_path": "qMokEEl_WaA.mp4", "subtitle_path": "qMokEEl_WaA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 105, "duration": 9.0, "view_count": 4160}, {"video_id": "WoSdQtsS6kg", "question": "What changes occur to the facial lighting of the man wearing a blue suit and sporting a stubble beard, who appears at the beginning and end of the video?", "question_wo_referring_query": "What changes occur to the facial lighting of the man wearing a blue suit and sporting a stubble beard, who appears at the beginning and end of the video?", "candidates": ["From dark to bright", "From dark to bright and then to dark", "No change", "From bright to dark and then to bright", "From bright to dark"], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "WoSdQtsS6kg_0", "video_path": "WoSdQtsS6kg.mp4", "subtitle_path": "WoSdQtsS6kg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 140, "duration": 13.01, "view_count": 9761}, {"video_id": "m2FEtP5J8LM", "question": "In the video, how many different children are sitting around the long table having a meal?", "question_wo_referring_query": "In the video, how many different children are sitting around the long table having a meal?", "candidates": ["6", "5", "4", "7", "3"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "m2FEtP5J8LM_0", "video_path": "m2FEtP5J8LM.mp4", "subtitle_path": "m2FEtP5J8LM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1049, "duration": 9.0, "view_count": 1397069}, {"video_id": "sV1Z2LXtHqc", "question": "In a dense jungle, a man wearing short sleeves is pulling a bowstring. Which object is not present in this scene?", "question_wo_referring_query": "Which object is not present in this scene?", "candidates": ["stone", "short sleeves", "bow", "arrow", "watch"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "sV1Z2LXtHqc_0", "video_path": "sV1Z2LXtHqc.mp4", "subtitle_path": "sV1Z2LXtHqc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1318, "duration": 8.0, "view_count": 239571}, {"video_id": "obrEORMMZ_4", "question": "A solid white object is submerged in water. When explaining, it mentions 'If you have a favorite name origin or a different theory, please tell me in the comments.' What object is present in the scene?", "question_wo_referring_query": ", what object is present in the scene?", "candidates": ["paper", "knife", "stove", "water", "bowl"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "obrEORMMZ_4_0", "video_path": "obrEORMMZ_4.mp4", "subtitle_path": "obrEORMMZ_4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 280, "duration": 10.01, "view_count": 110931}, {"video_id": "KIBoDlePjsA", "question": "In a well-lit indoor space, there is an artwork displayed in the center. Paintings are hanging on the white walls, and some people are observing them. Among them, an elderly man is standing and admiring the painting on the wall. What kind of jacket is this man wearing?", "question_wo_referring_query": "What kind of jacket is this man wearing?", "candidates": ["A pure black suit", "A black wool coat", "A white sports shirt", "A black sports shirt", "A pure white suit"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "KIBoDlePjsA_0", "video_path": "KIBoDlePjsA.mp4", "subtitle_path": "KIBoDlePjsA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 15, "duration": 14.02, "view_count": 10220}, {"video_id": "1Si7L6nZoSg", "question": "In a black background, there's a white template in the middle with a drawing of a branch with six green leaves. What English letter is removed?", "question_wo_referring_query": "What English letter is removed?", "candidates": ["APP", "ABUH", "SARA", "SMITH", "TEXT"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "1Si7L6nZoSg_0", "video_path": "1Si7L6nZoSg.mp4", "subtitle_path": "1Si7L6nZoSg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 445, "duration": 9.01, "view_count": 24580}, {"video_id": "iwjPVL81AqI", "question": "Before a woman wearing a white long-sleeve coat with an English logo and a necklace, and having long hair, appears on the screen with just a half-face photo of an eye filled with tears, what action does she take?", "question_wo_referring_query": "What action does the woman take?", "candidates": ["Touches her hair with her hand", "Puts her hand on the steering wheel and talks", "Stretches her hand out of the window", "Opens the cap of a mineral water bottle", "Takes her hand off the steering wheel"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "iwjPVL81AqI_0", "video_path": "iwjPVL81AqI.mp4", "subtitle_path": "iwjPVL81AqI_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 129, "duration": 8.0, "view_count": 27519}, {"video_id": "Rcev1dr6cWg", "question": "In an environment with simple woven baskets, and three sofas with floral pillows on them, what did the man dressed in a deep blue polka-dotted short-sleeved shirt and wearing a watch and glasses do with his hands the first time he opened them, palms facing each other?", "question_wo_referring_query": "What did he do with his hands?", "candidates": ["He lowered the hand without a watch.", "He clenched both hands.", "He lowered the hand with a watch.", "He lowered both hands.", "He intertwined his fingers."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "Rcev1dr6cWg_0", "video_path": "Rcev1dr6cWg.mp4", "subtitle_path": "Rcev1dr6cWg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 352, "duration": 13.0, "view_count": 38367}, {"video_id": "bhQe2cjr5XQ", "question": "On the map with red, blue, and white areas, many small icons are moving into the blue region. Which area to the right of the River Rhine on the map turns blue first?", "question_wo_referring_query": "Which area to the right of the River Rhine on the map turns blue first?", "candidates": ["BAVARIA", "SWITZ", "PRUSSI", "HANOVER", "FRANCE"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "bhQe2cjr5XQ_0", "video_path": "bhQe2cjr5XQ.mp4", "subtitle_path": "bhQe2cjr5XQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 202, "duration": 11.0, "view_count": 5309351}, {"video_id": "xLTCivIB4kU", "question": "A woman wearing a red long sleeve shirt and with long blonde hair places her hands on the two car doors. After the subtitle says 'driving into antarctica,' which person appears?", "question_wo_referring_query": "Who appears?", "candidates": ["A man wearing a black helmet", "A man wearing a white helmet", "A man wearing a red helmet", "A man not wearing a helmet", "A man wearing a blue helmet"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "xLTCivIB4kU_0", "video_path": "xLTCivIB4kU.mp4", "subtitle_path": "xLTCivIB4kU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 889, "duration": 9.01, "view_count": 5316158}, {"video_id": "xnLEAqyxXZc", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a woman wearing a red hat is being interviewed by a reporter, followed by a woman wearing a pink hat and a feathered outfit being interviewed by a reporter.", "First, a woman wearing a pink hat and a feathered outfit is being interviewed by a reporter, followed by a firefighter extinguishing a fire in a misty white smoke environment.", "First, a firefighter is extinguishing a fire in a misty white smoke environment, followed by a woman wearing a pink hat and a feathered outfit being interviewed by a reporter.", "First, a woman wearing a pink hat and short sleeves is being interviewed by a reporter, followed by a firefighter extinguishing a fire in a misty white smoke environment.", "First, a woman wearing a white hat is being interviewed by a reporter, followed by a woman wearing a pink hat and a feathered outfit being interviewed by a reporter."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "xnLEAqyxXZc_0", "video_path": "xnLEAqyxXZc.mp4", "subtitle_path": "xnLEAqyxXZc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 85, "duration": 9.0, "view_count": 136167}, {"video_id": "K_b5BNSa4Dw", "question": "When the woman wearing glasses and a blue striped shirt was initially sitting in front of the computer editing the video, and later leaned back on the orange sofa with a goose plush toy appearing, what change happened to this woman?", "question_wo_referring_query": "What change happened to this woman?", "candidates": ["Two cats sat on her belly", "A cat sat on her belly", "She took off her glasses", "A cat climbed on her shoulder", "A small dog climbed on her belly"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "K_b5BNSa4Dw_0", "video_path": "K_b5BNSa4Dw.mp4", "subtitle_path": "K_b5BNSa4Dw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 562, "duration": 11.98, "view_count": 352741}, {"video_id": "5us98Ens_hg", "question": "A woman with gold curly hair and red nail polish is scanning a QR code with her phone in front of a mirror. After she mentions 'the pandemic? Yes' in the subtitles, what change occurs?", "question_wo_referring_query": "What change occurs?", "candidates": ["The mirror image of the woman with gold hair scanning the QR code with her phone enlarges.", "The mirror image of the woman with gold hair scanning the QR code with her phone shrinks.", "The woman puts on blue gloves.", "A crowd of people appears, lining up outside the tent.", "The mirror image of the woman with gold hair scanning the QR code with her phone neither enlarges nor shrinks."], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "5us98Ens_hg_0", "video_path": "5us98Ens_hg.mp4", "subtitle_path": "5us98Ens_hg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 74, "duration": 9.01, "view_count": 5690}, {"video_id": "Fq-zzwmPubk", "question": "With a transparent glass window behind him adorned with green curtains, two flags on the sides, and seated at a brown office desk with books and documents, what is the man wearing a black tie doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["Reading a book", "Waving the American flag", "Raising his hand and shouting", "Dancing", "Making a phone call"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "Fq-zzwmPubk_0", "video_path": "Fq-zzwmPubk.mp4", "subtitle_path": "Fq-zzwmPubk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 457, "duration": 12.0, "view_count": 1188020}, {"video_id": "D4DtkPPCBSE", "question": "On a street scattered with small pieces of trash, with cars parked on the right side, a trash bin placed, and the left side surrounded by columns, which individual is present at the scene?", "question_wo_referring_query": "Which individual is present at the scene?", "candidates": ["A man with curly hair wearing a white T-shirt", "A man wearing a white backpack", "A man wearing a white long-sleeve shirt", "A woman wearing a yellow top", "A woman wearing sunglasses and carrying a blue bag"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "D4DtkPPCBSE_0", "video_path": "D4DtkPPCBSE.mp4", "subtitle_path": "D4DtkPPCBSE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 311, "duration": 14.03, "view_count": 12799}, {"video_id": "DcT4SjFyZ4s", "question": "On a brown table, there is a round melon with some orange flesh in the middle that has been hollowed out, containing various fruits inside. When 'Music' is mentioned in the subtitles, which of the following objects is NOT present on the screen?", "question_wo_referring_query": "Which object is NOT present on the screen?", "candidates": ["A few green mint leaves", "Strawberry", "Blueberry", "Raspberry", "Banana"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "DcT4SjFyZ4s_0", "video_path": "DcT4SjFyZ4s.mp4", "subtitle_path": "DcT4SjFyZ4s_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 45, "duration": 12.01, "view_count": 150497}, {"video_id": "KUJi1CKASiY", "question": "In the video, a person is holding a black-handled kitchen utensil in their left hand, spreading a sticky substance to cover the entire bottom of a glass bowl. After the word \"REST\" appears on the screen, what happens to the substance in the glass bowl?", "question_wo_referring_query": "What happens to the substance in the glass bowl?", "candidates": ["Its color turns red", "It completely turns into a liquid", "Its color turns black", "It completely turns into a solid", "It completely turns into a paste"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "KUJi1CKASiY_0", "video_path": "KUJi1CKASiY.mp4", "subtitle_path": "KUJi1CKASiY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 330, "duration": 10.0, "view_count": 444564}, {"video_id": "SpifyNWKnfA", "question": "In the center of the room, on a brown rectangular plank with a concave surface, there are two statues of knights riding horses holding wooden sticks. On both sides of the room, there are other suits of armor lined up. On the second floor of the room, people are leaning against the railing talking, and on the white walls on both sides, there are multi-colored flags hanging. Who waves at the camera in the video?", "question_wo_referring_query": "Who waves at the camera in the video?", "candidates": ["A child wearing glasses", "A curly-haired boy in short sleeves", "An adult in a suit", "A curly-haired boy in long sleeves", "A curly-haired adult"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "SpifyNWKnfA_0", "video_path": "SpifyNWKnfA.mp4", "subtitle_path": "SpifyNWKnfA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 185, "duration": 13.01, "view_count": 6527}, {"video_id": "cDJ-aFxXZmk", "question": "In a sun-soaked field, there are green chili peppers growing. One hand is holding a basket made of bamboo, and the other hand is holding two green chili peppers. What happens next?", "question_wo_referring_query": "What happens next?", "candidates": ["The green chili peppers in the hand fall to the ground.", "Only one of the green chili peppers in the hand is placed into the bamboo basket.", "The two green chili peppers in the hand are placed into the bamboo basket.", "The bamboo basket is knocked over.", "Only one of the green chili peppers in the hand falls to the ground."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "cDJ-aFxXZmk_0", "video_path": "cDJ-aFxXZmk.mp4", "subtitle_path": "cDJ-aFxXZmk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 573, "duration": 13.0, "view_count": 3448152}, {"video_id": "Hp0-YP3lADk", "question": "Against a black background, in the top right corner, there is a well-lit photo of an island. A curly-haired man wearing a beige short-sleeve shirt is at the center of the black background. When the subtitle 'the little Faroe islands' appears, what did the man do?", "question_wo_referring_query": "What did the man do?", "candidates": ["The curly-haired man patted his shoulder.", "The curly-haired man's hands opened.", "The curly-haired man clasped his hands together.", "The curly-haired man waved to the camera.", "The curly-haired man nodded towards the camera."], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "Hp0-YP3lADk_0", "video_path": "Hp0-YP3lADk.mp4", "subtitle_path": "Hp0-YP3lADk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 100, "duration": 11.01, "view_count": 4875968}, {"video_id": "rmTvw1iFzpw", "question": "Regarding the background of the spacecraft, a seated man wearing a blue shirt with a black and white striped tie and curly hair had his left hand open and his right hand clenched after gesturing. What did he do next?", "question_wo_referring_query": "What did he do next?", "candidates": ["Both hands clenched", "One hand open, one hand clenched", "Both hands open, palms down", "Both hands open, palms up", "Both hands open, palms facing each other"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "rmTvw1iFzpw_0", "video_path": "rmTvw1iFzpw.mp4", "subtitle_path": "rmTvw1iFzpw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 300, "duration": 12.98, "view_count": 2556}, {"video_id": "oSPsPa64KtU", "question": "Under the blue sky, on a street with many people gathered, there are red and yellow buildings and some green trees alongside the street. A white car is parked on the right side of the screen. Who is the first person to appear on the street?", "question_wo_referring_query": "Who is the first person to appear on the street?", "candidates": ["The woman wearing a skirt", "The woman wearing sunglasses", "The person wearing red short sleeves and glasses", "The person wearing a white tank top, black shorts, and carrying a backpack", "The child wearing green short sleeves"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "oSPsPa64KtU_0", "video_path": "oSPsPa64KtU.mp4", "subtitle_path": "oSPsPa64KtU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 40, "duration": 12.01, "view_count": 43460}, {"video_id": "AZo_Bj2xGYs", "question": "By the river, there is green grass and trees. A man wearing a helmet, donned in armor, and draped in a red battle robe is leading a line of armored swordsmen. After the subtitle 'Britannia unsettling the freshly arrived' appears, what happens?", "question_wo_referring_query": ", what happens?", "candidates": ["The shot transitions to two black and white-themed cartoon characters", "The shot transitions to a flag", "The shot transitions to the trees", "The shot transitions to the creek", "The shot transitions to the soldiers"], "topic_category": "KH-Knowledge-History", "question_category": "T3E", "level": "L2-Relation", "id": "AZo_Bj2xGYs_0", "video_path": "AZo_Bj2xGYs.mp4", "subtitle_path": "AZo_Bj2xGYs_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 5, "duration": 8.0, "view_count": 1042}, {"video_id": "VpjuItsmZ-Q", "question": "Underneath the sky lies a stretch of connected green hills, with two mountain peaks standing in the foreground. In the distance, cliffs can be seen. After the caption 'or science in general I upload new' appears, what shows up on the screen?", "question_wo_referring_query": "What appears on the screen?", "candidates": ["A small house with a white roof", "A small house with a black roof", "A small house with a blue roof", "A small house with a green roof", "A small house with a red roof"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "VpjuItsmZ-Q_0", "video_path": "VpjuItsmZ-Q.mp4", "subtitle_path": "VpjuItsmZ-Q_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.01, "view_count": 11787}, {"video_id": "1ZwbgIeA6gE", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there is a black and white photo of a house, followed by a black and white photo of a man wearing a hat and a suit.", "First, there is a black and white photo of a man wearing a hat and a suit, followed by a black and white photo of a house.", "First, there is a color photo of a man wearing a hat and a suit, followed by a black and white photo of a house.", "First, there is a black and white photo of a man in a suit without a hat, followed by a black and white photo of a house.", "First, there is a black and white photo of a house, followed by a black and white photo of a man in a suit without a hat."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "1ZwbgIeA6gE_0", "video_path": "1ZwbgIeA6gE.mp4", "subtitle_path": "1ZwbgIeA6gE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 182, "duration": 13.01, "view_count": 2869389}, {"video_id": "Kz5nuMjNsH0", "question": "On the black soil, a few plants are growing. Sunlight passes through the leaves and shines on the hands. A pair of hands is turning over the roots of the plants. In which other scene do these pair of hands appear?", "question_wo_referring_query": "In which other scene do these pair of hands appear?", "candidates": ["When picking white fruits", "When washing hands", "When picking yellow fruits", "When brushing off dirt from the body", "When picking blue fruits"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "Kz5nuMjNsH0_0", "video_path": "Kz5nuMjNsH0.mp4", "subtitle_path": "Kz5nuMjNsH0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 42, "duration": 10.01, "view_count": 177091}, {"video_id": "F5QhrEPMJzE", "question": "What change occurred to the yellow crusted pastry, which contains meat slices and green and white vegetables, when it was served on a white table with blue and white checkered cloth?", "question_wo_referring_query": "What change occurred to the pastry?", "candidates": ["The crust turned golden yellow and was sprinkled with tomato sauce.", "The crust turned white and was sprinkled with sesame seeds.", "The crust turned golden yellow and was sprinkled with pepper.", "The crust turned golden yellow and had olive dots.", "The crust turned white and was sprinkled with tomato sauce."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "F5QhrEPMJzE_0", "video_path": "F5QhrEPMJzE.mp4", "subtitle_path": "F5QhrEPMJzE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 84, "duration": 10.97, "view_count": 98978}, {"video_id": "DWnRtK15hDo", "question": "Against a green background, a crown is placed on two crossed knives. A black knife is horizontally positioned in the center. After the subtitle 'More famous than their Terra hats were their Cooper II lives' appears, what changes occur to the black knife?", "question_wo_referring_query": "What changes occur to the black knife?", "candidates": ["The blade turns silver and features a blood groove. The hilt turns brownish-green with studs.", "The blade turns silver and features a blood groove. The hilt turns red with studs.", "The blade turns gold and features a blood groove. The hilt turns brownish-green with studs.", "The blade turns gold and features a blood groove. The hilt turns black with studs.", "The blade turns silver and features a blood groove. The hilt turns green with studs."], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "DWnRtK15hDo_0", "video_path": "DWnRtK15hDo.mp4", "subtitle_path": "DWnRtK15hDo_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 188, "duration": 9.0, "view_count": 2039538}, {"video_id": "0qtC1pSzHD4", "question": "A white phone with a pink casing is placed on a black car seat. A hand with rings on the thumb, index finger, and middle finger, and wearing nail polish, is hovering above the phone. What is this hand doing?", "question_wo_referring_query": "What is this hand doing?", "candidates": ["Picking up the phone", "Searching for car keys", "Tapping the phone screen", "Taking off the phone case", "Searching for the charging cable"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "0qtC1pSzHD4_0", "video_path": "0qtC1pSzHD4.mp4", "subtitle_path": "0qtC1pSzHD4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 80, "duration": 14.0, "view_count": 38466}, {"video_id": "w-I92mT_O7Q", "question": "In the bright sunlight outside, several cars are parked in front of an ochre-colored building, and pedestrians are strolling outside the building. A man with black curly hair, wearing summer clothes, is talking to the camera. What item is the man wearing that is shown in the video?", "question_wo_referring_query": "What item is the man shown talking in the video wearing?", "candidates": ["sunglasses", "necklace", "face mask", "headscarf", "hat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "w-I92mT_O7Q_0", "video_path": "w-I92mT_O7Q.mp4", "subtitle_path": "w-I92mT_O7Q_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 700, "duration": 12.97, "view_count": 1347654}, {"video_id": "dbmrOy65cNw", "question": "Inside the room, a man in a white coat is sitting on a suitcase located near the stairs. The man is wearing black pants, white socks, and white shoes. The stair railing is black, and there are floral prints on the suitcase. When the subtitle 'you finish watching this video you go to' appears, what item does not appear on the screen?", "question_wo_referring_query": "What item does not appear on the screen?", "candidates": ["A black chair", "A portrait painting", "A multicolored pillow", "A red sofa", "A green sofa"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "dbmrOy65cNw_0", "video_path": "dbmrOy65cNw.mp4", "subtitle_path": "dbmrOy65cNw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 23, "duration": 10.01, "view_count": 2681769}, {"video_id": "rRSEBpuLKmQ", "question": "In a room with warm lighting, the curtains are drawn, and there's a simple bookshelf filled with books on the wall. Inside the room, a woman with golden hair dressed in black is having an online conversation with a man wearing black-framed glasses. What is the man in the video wearing during this online conversation?", "question_wo_referring_query": "In a room with warm lighting, the curtains are drawn, and there's a simple bookshelf filled with books on the wall. Inside the room, a woman with golden hair dressed in black is having an online conversation with a man wearing black-framed glasses. What is the man in the video wearing during this online conversation?", "candidates": ["He is wearing a light blue suit", "He is wearing a blue shirt", "He is wearing a black suit", "He is wearing a black lab coat", "He is wearing a white lab coat"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "rRSEBpuLKmQ_0", "video_path": "rRSEBpuLKmQ.mp4", "subtitle_path": "rRSEBpuLKmQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 264, "duration": 13.0, "view_count": 53227}, {"video_id": "0mAGCXxnEP8", "question": "A lady with black hair and a black bracelet is using a silver fork to pick yellow food from a plate. Behind her is a refrigerator and a cabinet. Next to the food is a pink bag. When the subtitle 'good' appears, what kind of outerwear is the lady wearing?", "question_wo_referring_query": "What kind of outerwear is the lady wearing?", "candidates": ["Cotton lab coat", "Green fur coat", "Cotton long sleeve", "Black leather outerwear", "Fabric long sleeve"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "0mAGCXxnEP8_0", "video_path": "0mAGCXxnEP8.mp4", "subtitle_path": "0mAGCXxnEP8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 411, "duration": 13.01, "view_count": 16121}, {"video_id": "DeXN7CD1KrU", "question": "In the video, on a yellow cutting board, there is a round stainless steel tool placed. A person with a gold ring on his right hand is holding a whole piece of Parmesan cheese. What is this person doing with the cheese?", "question_wo_referring_query": "What is this person doing with the cheese?", "candidates": ["Shredding the whole piece of Parmesan cheese", "Throwing away the piece of cheese", "Putting the cheese directly into a pot", "Eating the piece of cheese directly"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "DeXN7CD1KrU_0", "video_path": "DeXN7CD1KrU.mp4", "subtitle_path": "DeXN7CD1KrU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 192, "duration": 9.0, "view_count": 419131}, {"video_id": "Uh2VgDw7odg", "question": "In a room with a bookshelf filled with items, white walls covered with sticky notes, a white desk with a black laptop on it, and a man wearing a black hooded sweatshirt and wristwatch holding a black mobile phone, which of the following items are not present?", "question_wo_referring_query": "Which of the following items are not present?", "candidates": ["Laptop", "Headphones", "Colorful sticky notes", "Mobile phone"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "Uh2VgDw7odg_0", "video_path": "Uh2VgDw7odg.mp4", "subtitle_path": "Uh2VgDw7odg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 92, "duration": 10.01, "view_count": 172045}, {"video_id": "QG0apoCUFUE", "question": "When mentioning 'our finger we're gonna make it deeper', on the gray table, there is a yellow cutting board with flour on it. What else is on the screen?", "question_wo_referring_query": "What else is on the screen?", "candidates": ["a wooden cutting board, a bowl with white ingredients", "3 small bowls with seasonings, a fork", "a wooden cutting board, a small knife, a beige bowl filled with water", "3 small bowls with seasonings, a wooden cutting board with flour on it, a fork, a red bowl filled with water, a bowl with white ingredients"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "QG0apoCUFUE_0", "video_path": "QG0apoCUFUE.mp4", "subtitle_path": "QG0apoCUFUE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 199, "duration": 8.01, "view_count": 882976}, {"video_id": "B-_0acPO5DU", "question": "In a European-style vintage room mentioned 'of Syria and Iran the historical', there is a vintage bookshelf with a fire pit underneath. A small figurine is wearing a blue hat and blue clothes. What type of hat is he wearing?", "question_wo_referring_query": "What type of hat is he wearing?", "candidates": ["Beret", "Duckbill cap", "Ascot cap", "Fur hat"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "B-_0acPO5DU_0", "video_path": "B-_0acPO5DU.mp4", "subtitle_path": "B-_0acPO5DU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 8, "duration": 13.0, "view_count": 3080}, {"video_id": "V42WMMpBYTo", "question": "In the video, there is a woman wearing skin-colored clothing in a white room. What is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["She is sitting", "She is standing and dancing", "She is lying on the white floor doing strange movements", "She is doing yoga"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "V42WMMpBYTo_0", "video_path": "V42WMMpBYTo.mp4", "subtitle_path": "V42WMMpBYTo_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 429, "duration": 11.01, "view_count": 2003}, {"video_id": "J4itcJ8Ur2c", "question": "On a yellow background board, there are 7 white circles. On another yellow background board, it says 'By the late Napoleonic Wars, skirmish tactics had'. Which of these two screens appears first?", "question_wo_referring_query": "Which of these two screens appears first?", "candidates": ["The screen with 7 white circles on a yellow background appears first; the last screen is one with 'Scouting Terrain' written on a yellow background.", "The screen with 7 white circles on a yellow background appears first; the screen with 'By the late Napoleonic Wars, skirmish tactics had' on a yellow background appears later.", "They are the same screen.", "The screen with 'By the late Napoleonic Wars, skirmish tactics had' on a yellow background appears first; the screen with 7 white circles on a yellow background appears later."], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "J4itcJ8Ur2c_0", "video_path": "J4itcJ8Ur2c.mp4", "subtitle_path": "J4itcJ8Ur2c_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 659, "duration": 14.0, "view_count": 721379}, {"video_id": "8tYwRWdKR-U", "question": "After mentioning 'him, he was one of his primary assistants,' on a gray wall hanging four paintings. A frame is pulled up in front of the four paintings. Which painting is enlarged?", "question_wo_referring_query": "Which painting is enlarged?", "candidates": ["No painting is enlarged.", "A portrait of a man with no beard, wearing a black outfit with white polka dots.", "A portrait of a man wearing a white outfit with black polka dots.", "A portrait of a man with dense black curly hair, wearing a black outfit with white polka dots, and sporting a beard."], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "8tYwRWdKR-U_0", "video_path": "8tYwRWdKR-U.mp4", "subtitle_path": "8tYwRWdKR-U_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 26, "duration": 13.98, "view_count": 36813}, {"video_id": "CHuwcm0P0TA", "question": "In the entire video, which sequence of scenes is correct?", "question_wo_referring_query": ", which sequence of scenes is correct?", "candidates": ["First appears a statue of poet Ferdowsi, then an illustration from 'The Book of Kings', and lastly an open copy of 'The Book of Kings'.", "First appears an open copy of 'The Book of Kings', then a statue of poet Ferdowsi, and lastly an illustration from 'The Book of Kings'.", "First appears an illustration from 'The Book of Kings', then an open copy of 'The Book of Kings', and lastly a statue of poet Ferdowsi.", "First appears an illustration from 'The Book of Kings', then a statue of poet Ferdowsi, and lastly an open copy of 'The Book of Kings'."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "CHuwcm0P0TA_0", "video_path": "CHuwcm0P0TA.mp4", "subtitle_path": "CHuwcm0P0TA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 741, "duration": 9.01, "view_count": 115062}, {"video_id": "isHeTpMCMo4", "question": "In a thematic map of a geological area, different regions are shown in different colors\u2014red, gray, blue, and yellow as the main divisions. On one side, there is also a screen with white text descriptions. Which of the following subtitles did not simultaneously appear with the screen?", "question_wo_referring_query": "Which of the following subtitles did not appear simultaneously with the screen?", "candidates": ["snapping apart when this land was", "the remains of the land being folded and", "compressed to the point of breaking and", "now this is a ring Dyke notice just how"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "isHeTpMCMo4_0", "video_path": "isHeTpMCMo4.mp4", "subtitle_path": "isHeTpMCMo4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 366, "duration": 10.98, "view_count": 127680}, {"video_id": "1MANyt7PGyE", "question": "The video opens in a dim room, with a man who has black hair, wearing a simple white shirt with a black vest over it, and a black tie. What does this man change into at the end?", "question_wo_referring_query": "What clothes does this man change into at the end?", "candidates": ["The black vest changes into a white suit, and the black tie changes into a white tie.", "The simple white shirt changes into a complex white shirt, the black vest changes into a black long-sleeve suit, and the black tie changes into a white tie.", "The white shirt, the black vest changes into a black long-sleeve suit, and the black tie.", "The white simple shirt changes into a black shirt, the black vest, and the black tie changes into a white tie."], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "1MANyt7PGyE_0", "video_path": "1MANyt7PGyE.mp4", "subtitle_path": "1MANyt7PGyE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 155, "duration": 13.97, "view_count": 11692}, {"video_id": "d08SixLsUZY", "question": "In the video, after the man wearing a white shirt and a grey suit jacket mentions 'a part of? I think that we're very bullish on,' how do his hand gestures change?", "question_wo_referring_query": "How do his hand gestures change?", "candidates": ["Crosses his hands in front of his chest", "Raises both hands above his head", "One hand is clenched into a fist, the other is flat", "Clenches his hands into fists in front of his chest"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "d08SixLsUZY_0", "video_path": "d08SixLsUZY.mp4", "subtitle_path": "d08SixLsUZY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 819, "duration": 14.02, "view_count": 8008}, {"video_id": "RrxblPkCi_Y", "question": "There are three scenes in the video. In one scene, there is a woman with short blonde hair wearing a white suit with a black shirt. In another scene, there is a woman in a study room wearing a black short-sleeve shirt, black headphones, and glasses. In the last scene, there is a man wearing a blue shirt and glasses. What is the woman in the black short-sleeve shirt doing?", "question_wo_referring_query": "What is the woman in the black short-sleeve shirt doing?", "candidates": ["She is running", "She is drinking water", "She is eating a hamburger", "She is exercising"], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "RrxblPkCi_Y_0", "video_path": "RrxblPkCi_Y.mp4", "subtitle_path": "RrxblPkCi_Y_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 2962, "duration": 10.98, "view_count": 2254}, {"video_id": "cnNgrsraUI4", "question": "In the video, a person in a blue shirt is using a nail stuck in a rock to suspend the rock in mid-air on a muddy ground. What objects are around him?", "question_wo_referring_query": "What objects are around him?", "candidates": ["Overpass", "Airplane", "Rocks, nails, grass", "Car"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "cnNgrsraUI4_0", "video_path": "cnNgrsraUI4.mp4", "subtitle_path": "cnNgrsraUI4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1372, "duration": 8.01, "view_count": 73374}, {"video_id": "mt6RIfVIPCI", "question": "In the video, a short-haired woman wearing a yellow long-sleeved lab coat and black pants raises her right hand while speaking into a receiver with her left hand. Next to her sits a man wearing a blue jacket. When the subtitle mentions 'Nare figuring out how to go from this,' what item appears on the screen?", "question_wo_referring_query": "What item appears on the screen?", "candidates": ["a table", "2 bottles of Coke, 2 receivers", "a computer", "2 bottles of mineral water, a receiver, a chair"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "mt6RIfVIPCI_0", "video_path": "mt6RIfVIPCI.mp4", "subtitle_path": "mt6RIfVIPCI_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1338, "duration": 12.01, "view_count": 1749}, {"video_id": "NIBzjuhawbs", "question": "On the yellow-striped table, there is also a yellow rolling pin and a pancake. What shape is this pancake?", "question_wo_referring_query": "What shape is this pancake?", "candidates": ["square", "circle", "oval", "rectangle"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "NIBzjuhawbs_0", "video_path": "NIBzjuhawbs.mp4", "subtitle_path": "NIBzjuhawbs_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 300, "duration": 8.98, "view_count": 1350544}, {"video_id": "A8tXYqm-JRY", "question": "On a grey desktop, a person is holding a glass bowl in his right hand, which contains yellow pieces. He is putting them into a juicer with his left hand. What is he holding?", "question_wo_referring_query": "What is he holding?", "candidates": ["Bananas", "Carrots", "Peaches", "Apples"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "A8tXYqm-JRY_0", "video_path": "A8tXYqm-JRY.mp4", "subtitle_path": "A8tXYqm-JRY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 22, "duration": 14.02, "view_count": 255334}, {"video_id": "4Wc6g113LG0", "question": "In the video, on the gray countertop, there's a yellow cutting board, on which there's a transparent glass bowl, the bowl contains flour, and 2 seasoning bottles. When the kettle first appears, what is the kettle used for?", "question_wo_referring_query": "What is the kettle used for?", "candidates": ["Pour the hot water from the kettle into the sink", "Boil water in the kettle", "Pour the hot water from the kettle into the transparent bowl", "Place the kettle on the table"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "4Wc6g113LG0_0", "video_path": "4Wc6g113LG0.mp4", "subtitle_path": "4Wc6g113LG0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 231, "duration": 13.0, "view_count": 4121}, {"video_id": "3a3ND86DTCk", "question": "When the phrase 'distinct look with deerskin breeches' appears, a soldier is wearing a coat with a short front and long back, primarily deep blue with white and yellow decorations, along with white pants. The soldier is holding a rifle. What action does this soldier take?", "question_wo_referring_query": "When the phrase 'distinct look with deerskin breeches' appears, a soldier is wearing a coat with a short front and long back, primarily deep blue with white and yellow decorations, along with white pants. The soldier is holding a rifle. What action does this soldier take?", "candidates": ["Holds the rifle with both hands in a standard pose", "Hands the rifle to someone else", "Puts the rifle on the ground", "Places the rifle beside his body"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "3a3ND86DTCk_0", "video_path": "3a3ND86DTCk.mp4", "subtitle_path": "3a3ND86DTCk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 146, "duration": 8.0, "view_count": 1440113}, {"video_id": "bDc_9M9mtds", "question": "On a grey table, there is a yellow cutting board with a knife on it, an iron plate beside it, and a piece of meat in the iron plate. In this video, which of these items appear first?", "question_wo_referring_query": "In this video, which of these items appear first?", "candidates": ["Table", "Knife", "Iron plate and meat", "Cutting board"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "bDc_9M9mtds_0", "video_path": "bDc_9M9mtds.mp4", "subtitle_path": "bDc_9M9mtds_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1007, "duration": 14.02, "view_count": 918795}, {"video_id": "muJx8OkFLDU", "question": "On a grassy hill, there's a statue of a person riding a horse. Before mentioning 'capture and execution', what event took place?", "question_wo_referring_query": "What event took place?", "candidates": ["Two people are dancing", "Two people are wrestling", "Two people are eating", "Two people are arm wrestling in a room"], "topic_category": "KH-Knowledge-History", "question_category": "T3E", "level": "L2-Relation", "id": "muJx8OkFLDU_0", "video_path": "muJx8OkFLDU.mp4", "subtitle_path": "muJx8OkFLDU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 209, "duration": 14.02, "view_count": 59862}, {"video_id": "GXDJMvTbu34", "question": "At the beginning of the video, an airplane appears in the blue sky over a green field. In what locations has it appeared?", "question_wo_referring_query": "In what locations has it appeared?", "candidates": ["In the blue and pink sky", "On the green field", "Inside the base", "Over the ocean"], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "GXDJMvTbu34_0", "video_path": "GXDJMvTbu34.mp4", "subtitle_path": "GXDJMvTbu34_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 25, "duration": 10.0, "view_count": 2516916}, {"video_id": "zWvud_Ns0mA", "question": "Which subtitles appear along with the man at the beginning of the video wearing a white shirt, black suit jacket, and black tie?", "question_wo_referring_query": "Which subtitles appear along with?", "candidates": ["Let's head out together", "I can solve this", "I can figure this ladies just give", "Leave it to me"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "zWvud_Ns0mA_0", "video_path": "zWvud_Ns0mA.mp4", "subtitle_path": "zWvud_Ns0mA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 136, "duration": 13.01, "view_count": 235435}, {"video_id": "0UAuGXkS7TU", "question": "At the beginning of the video, with a white building in the background, there's a girl with loose hair wearing a blue top and carrying a school bag, holding a stuffed toy. What does the object in her hand change to?", "question_wo_referring_query": "What does the object in her hand change to?", "candidates": ["It changes to an apple", "It doesn't change, it's still a stuffed toy", "It changes to a toy", "It changes to a sandwich"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "0UAuGXkS7TU_0", "video_path": "0UAuGXkS7TU.mp4", "subtitle_path": "0UAuGXkS7TU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 40, "duration": 9.0, "view_count": 11828}, {"video_id": "IUsiS3NX4qM", "question": "When it is mentioned 'comes to the counter-offensive there was,' a woman wearing a pink top, a necklace, large metal earrings, and glasses with short hair appears. At this moment, what changes on the screen?", "question_wo_referring_query": "At this moment, what changes on the screen?", "candidates": ["A woman wearing a blue skirt appears", "No changes", "A new screen appears on the left side, showing a woman in a black skirt with wavy blonde hair", "Two other screens appear"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "IUsiS3NX4qM_0", "video_path": "IUsiS3NX4qM.mp4", "subtitle_path": "IUsiS3NX4qM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 329, "duration": 11.0, "view_count": 238384}, {"video_id": "XjBNilsGR7s", "question": "In the scene, on a sparsely populated street, there is a clothing store. There is a woman wearing a red top, jeans, and carrying a bag. What is this woman doing?", "question_wo_referring_query": "In the scene, on a sparsely populated street, there is a clothing store. There is a woman wearing a red top, jeans, and carrying a bag. What is this woman doing?", "candidates": ["Exercising", "Sitting and resting by the roadside", "Eating in a restaurant", "Walking normally on the street"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "XjBNilsGR7s_0", "video_path": "XjBNilsGR7s.mp4", "subtitle_path": "XjBNilsGR7s_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 370, "duration": 8.01, "view_count": 11093}, {"video_id": "o9SI0B7G7u8", "question": "When mentioning 'jaylan hernandez eli maya nuku jonahan' in a neat and tidy room with white walls, a woman with short curly hair wearing light-colored clothes, a necklace, glasses, and a ring appears. What objects are behind her?", "question_wo_referring_query": ", what objects are behind her?", "candidates": ["computer", "television", "treadmill", "2 yellow tables, a grey sofa"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "o9SI0B7G7u8_0", "video_path": "o9SI0B7G7u8.mp4", "subtitle_path": "o9SI0B7G7u8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 159, "duration": 13.98, "view_count": 1133}, {"video_id": "0FnwZ-tFXic", "question": "In a car, there is a woman wearing a white top and a headscarf. There is also a white car next to her car. What is she wearing?", "question_wo_referring_query": "What is she wearing?", "candidates": ["coat", "bathrobe", "sweater", "short sleeves"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "0FnwZ-tFXic_0", "video_path": "0FnwZ-tFXic.mp4", "subtitle_path": "0FnwZ-tFXic_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 160, "duration": 9.0, "view_count": 29622}, {"video_id": "c3OXeVA1Mg4", "question": "In a dining room, there are two tables of customers eating and the silhouette of a man in black clothes playing a violin. On the left side of the screen, there is another man wearing a white coat, black pants, with Mediterranean-style hair. What is he doing at this moment?", "question_wo_referring_query": "What is he doing at this moment?", "candidates": ["He is holding a menu introducing dishes to the customers", "He is sitting at the dining table eating", "He is carrying a tray serving dishes", "He is holding a violin and performing"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "c3OXeVA1Mg4_0", "video_path": "c3OXeVA1Mg4.mp4", "subtitle_path": "c3OXeVA1Mg4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 265, "duration": 8.0, "view_count": 488543}, {"video_id": "qW8O_sisf4I", "question": "In front of a small piece of woodland on yellow soil, there is a gray ship stranded in the blue sea. What event occurs afterwards?", "question_wo_referring_query": ", what event occurs afterwards?", "candidates": ["A group of people wearing black shoes and brown leg bands are running.", "A person wearing red shoes is standing still.", "A person wearing black shoes is standing still.", "A person wearing red shoes is running."], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "qW8O_sisf4I_0", "video_path": "qW8O_sisf4I.mp4", "subtitle_path": "qW8O_sisf4I_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 98, "duration": 9.0, "view_count": 2067488}, {"video_id": "VMy90xfyrnc", "question": "In the screen, there is a black-haired woman wearing black clothes raising her hand and pointing to a rectangular wooden frame. Beside her, there is another woman with golden curls in black clothes standing. Among the following objects, which one appears first in the video?", "question_wo_referring_query": "Which of the following objects appears first in the video?", "candidates": ["All three appear simultaneously", "The rectangular wooden frame", "The woman with golden curls and short stature wearing black clothes", "The woman with black hair and tall stature wearing black clothes"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "VMy90xfyrnc_0", "video_path": "VMy90xfyrnc.mp4", "subtitle_path": "VMy90xfyrnc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 209, "duration": 12.01, "view_count": 208759}, {"video_id": "iYkCUPu-JuY", "question": "After mentioning 'Now the flag actually kinda tells a little bit of a story', a man wearing a grey short-sleeve shirt is sitting on a white chair. What does he do next?", "question_wo_referring_query": "What does he do next?", "candidates": ["He sits on the white chair in a daze.", "He introduces a cartoon.", "He leaves the room.", "He passionately explains the meaning of the colors on the flag, gesturing with both hands."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "iYkCUPu-JuY_0", "video_path": "iYkCUPu-JuY.mp4", "subtitle_path": "iYkCUPu-JuY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 45, "duration": 13.01, "view_count": 210248}, {"video_id": "xY77dfvTVpA", "question": "After the phrase \u201cincluding the longest river the thine\u201d appears, what shows up on the map of the meandering river?", "question_wo_referring_query": "What shows up on the map of the meandering river?", "candidates": ["Two pictures of multiple rivers and two English words.", "Two pictures of multiple rivers.", "A picture of a river with trees on both sides and white houses, along with an English word.", "Two English words."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "xY77dfvTVpA_0", "video_path": "xY77dfvTVpA.mp4", "subtitle_path": "xY77dfvTVpA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 593, "duration": 10.98, "view_count": 1035457}, {"video_id": "dh1n54EStAE", "question": "Based on the video content, what is the sequence of events in the video?", "question_wo_referring_query": "What is the sequence of events in the video?", "candidates": ["Only a hand-drawn picture of a man appears.", "First appears a hand-drawn picture of a man, then appears a photo with a dense mix of colors.", "First appears a photo with a dense mix of colors, then appears a hand-drawn picture of a man.", "There is no sequence of events."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "dh1n54EStAE_0", "video_path": "dh1n54EStAE.mp4", "subtitle_path": "dh1n54EStAE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 151, "duration": 12.0, "view_count": 4447}, {"video_id": "XwEXO3jTh_g", "question": "At the beginning of the video, there is a beautiful horse standing inside a carriage. Part of its body is extending out of the window, and it's wearing horse gear, including a piece of cloth that covers its head, only leaving the eyes visible. Where else has this horse appeared?", "question_wo_referring_query": "Where else has this horse appeared?", "candidates": ["By the river", "In a snowy area", "On a grassy field surrounded by trees", "On a racetrack"], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "XwEXO3jTh_g_0", "video_path": "XwEXO3jTh_g.mp4", "subtitle_path": "XwEXO3jTh_g_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 254, "duration": 10.0, "view_count": 203285}, {"video_id": "RO3FMFJqz9o", "question": "A group of people are gathered in the desert, some are sitting, some are lying down, and one person is riding a horse. They are conversing, with account books scattered around them. The sky is a vast expanse of azure, and mountains are faintly visible in the distance. Which subtitles appear with this scene?", "question_wo_referring_query": "Which subtitles appear with this scene?", "candidates": ["Jewish cultures met together in medieval", "the region the Christian and Muslim and", "medieval feast that features the food of", "the region the Christian and Muslim and Jewish cultures met together in medieval"], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "RO3FMFJqz9o_0", "video_path": "RO3FMFJqz9o.mp4", "subtitle_path": "RO3FMFJqz9o_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 33, "duration": 7.97, "view_count": 9912}, {"video_id": "quSCXTL6trw", "question": "On a brown table, there is a transparent pot. Initially, water is poured into the pot. When the lid is placed on the pot, what change occurs in its state?", "question_wo_referring_query": "What change occurs in its state?", "candidates": ["Boiling, with very few bubbles", "Boiling, with a lot of bubbles", "No change", "Color changed"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "quSCXTL6trw_0", "video_path": "quSCXTL6trw.mp4", "subtitle_path": "quSCXTL6trw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 125, "duration": 14.0, "view_count": 5606390}, {"video_id": "ntjn7uwZlEw", "question": "In the video, one screen displays a man wearing a light blue shirt with a dark blue suit and a black tie, while another screen shows a conference room with many people sitting on green sofas. What happens on the screen when the phrase 'you'll see that there are various' is mentioned?", "question_wo_referring_query": "What change happens on the screen?", "candidates": ["The screen goes blank", "One screen changes to show three men in suits holding a press conference", "Only the screen with the man in the blue suit remains", "No change happens"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "ntjn7uwZlEw_0", "video_path": "ntjn7uwZlEw.mp4", "subtitle_path": "ntjn7uwZlEw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 183, "duration": 10.0, "view_count": 15136}, {"video_id": "rGhrv03dsnc", "question": "Which animals have appeared on the yellow rock?", "question_wo_referring_query": "Which animals have appeared?", "candidates": ["Horses", "Frogs", "Scorpions", "Fish"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "rGhrv03dsnc_0", "video_path": "rGhrv03dsnc.mp4", "subtitle_path": "rGhrv03dsnc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 693, "duration": 13.98, "view_count": 321908}, {"video_id": "aBaWlz2vA5k", "question": "Mentioning 'And we are removing the mouth tip now. As you see there is a hole;' surrounded by a desk with white as the main color and red dotted accents, filled with wood. A man with short hair, wearing a white short-sleeved shirt and a wristwatch, what items did not appear in the scene?", "question_wo_referring_query": "What items did not appear in the scene?", "candidates": ["cigarette", "wristwatch", "computer", "metal bucket"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "aBaWlz2vA5k_0", "video_path": "aBaWlz2vA5k.mp4", "subtitle_path": "aBaWlz2vA5k_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 283, "duration": 13.01, "view_count": 395}, {"video_id": "z-D1DPYwn-w", "question": "In a room with grey walls and a painting hanging on it, there is a man with dark skin wearing a white shirt, a jacket over it, and a tie. What color is his jacket?", "question_wo_referring_query": "What color is his jacket?", "candidates": ["Red", "Black", "Rose", "Blue"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "z-D1DPYwn-w_0", "video_path": "z-D1DPYwn-w.mp4", "subtitle_path": "z-D1DPYwn-w_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 138, "duration": 9.01, "view_count": 23226}, {"video_id": "Tji64S7FT9A", "question": "When the phrase 'where we fall in love I don't know what' is mentioned, inside a clothing store, a lady wearing a light-colored shirt with a black dress over it, and a white necklace, is holding what kind of clothes in her hand?", "question_wo_referring_query": "what kind of clothes is she holding in her hand?", "candidates": ["red mesh dress", "gray wool dress", "red wool dress", "blue mesh dress"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "Tji64S7FT9A_0", "video_path": "Tji64S7FT9A.mp4", "subtitle_path": "Tji64S7FT9A_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 640, "duration": 8.0, "view_count": 124526}, {"video_id": "g9nbRdLlaa0", "question": "In an oil painting where the entire frame is the face of a woman, what is slowly moving downward?", "question_wo_referring_query": ", what is slowly moving downward?", "candidates": ["A painting of the 'Madonna' with wide eyes, slightly open mouth, and droplets of diamond tears adorning the face.", "A painting of the 'Madonna' with wide eyes, slightly open mouth, and droplets of pearl tears adorning the face.", "A painting of the 'Madonna' with wide eyes, slightly open mouth, and droplets of tree oil tears adorning the face.", "A painting of the 'Madonna' with wide eyes, slightly open mouth, and droplets of golden tears adorning the face."], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "g9nbRdLlaa0_0", "video_path": "g9nbRdLlaa0.mp4", "subtitle_path": "g9nbRdLlaa0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 86, "duration": 8.01, "view_count": 12989}, {"video_id": "sCUgKS-O2T4", "question": "On the screen, there is a man wearing fitting swim trunks. He is taking pictures with his phone, surrounded by a beautiful blue-green water area and waterfalls. There is also a woman wearing a bikini waving her hand. There are texts like 'DON\u2019T FORGET TO SUBSCRIBE!', 'FOR MY LAST VIDEO CLICK HERE', 'TO LEARN HOW TO TRAVEL THE WORLD LIKE US CLICK HERE'. When 'Oh' is mentioned, what are the man and woman doing on the screen?", "question_wo_referring_query": "What are the man and woman doing on the screen?", "candidates": ["The man is holding up his phone to take pictures, and the woman is waving at the phone.", "They are sitting by the poolside chatting", "Swimming", "They are playing with a water ball"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "sCUgKS-O2T4_0", "video_path": "sCUgKS-O2T4.mp4", "subtitle_path": "sCUgKS-O2T4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 337, "duration": 11.0, "view_count": 55169}, {"video_id": "w4VoEKB6agI", "question": "On a pathway paved with several stone tiles, there are a few pots of flowers. A woman wearing a beige coat, black pants, and white sneakers is holding a black watering can and is watering the flowers in a green pot. After she finishes watering the flowers in the green pot, what does this woman do next?", "question_wo_referring_query": "What does this woman do next?", "candidates": ["She leaves the house.", "She goes to water the flowers in the white pot.", "She waters another pot of green flowers.", "She places the watering can on the ground."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "w4VoEKB6agI_0", "video_path": "w4VoEKB6agI.mp4", "subtitle_path": "w4VoEKB6agI_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 153, "duration": 9.0, "view_count": 17581}, {"video_id": "QATKvMv2vds", "question": "After the word 'okay' appears, there is a white wall with five paintings on it, then there are four bookshelves full of books, and also a piano, a table with various objects on it. What is happening in this scene?", "question_wo_referring_query": "What is happening in this scene?", "candidates": ["A woman is watering flowers.", "A woman is running outside.", "A woman is playing with a cat.", "A woman is sitting on a chair eating."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "QATKvMv2vds_0", "video_path": "QATKvMv2vds.mp4", "subtitle_path": "QATKvMv2vds_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 56, "duration": 13.98, "view_count": 179951}, {"video_id": "YCy7QYzGTj4", "question": "A man wearing a gray short-sleeve shirt is sitting on the floor, holding a piece of paper. There are also two boxes on the floor. After the text 'this is really cool joe from penguin tasmania australia' appears, what appears in the top right corner of the screen?", "question_wo_referring_query": "What appears in the top right corner of the screen?", "candidates": ["The flag of the UK appears", "The flag of Australia appears on the screen", "The flag of the USA appears", "The man stands up"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "YCy7QYzGTj4_0", "video_path": "YCy7QYzGTj4.mp4", "subtitle_path": "YCy7QYzGTj4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 368, "duration": 8.98, "view_count": 151870}, {"video_id": "7WP0u6TtjXY", "question": "On a piece of paper with a yellow background, there is a white dragon with outstretched white wings. Where else has it appeared?", "question_wo_referring_query": "Where else has it appeared?", "candidates": ["By the river", "On an open book", "On the grass", "In the sky"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "7WP0u6TtjXY_0", "video_path": "7WP0u6TtjXY.mp4", "subtitle_path": "7WP0u6TtjXY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 172, "duration": 11.01, "view_count": 14728}, {"video_id": "CkvxbbtIyrA", "question": "In the video, after the 'Cherry tomatoes-250g' text appears on the yellow cutting board, what item appears above the cherry tomatoes?", "question_wo_referring_query": "What item appears above the cherry tomatoes?", "candidates": ["Knife", "Chopsticks", "Fork", "Tongs"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "CkvxbbtIyrA_0", "video_path": "CkvxbbtIyrA.mp4", "subtitle_path": "CkvxbbtIyrA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 51, "duration": 13.0, "view_count": 35817}, {"video_id": "1OEsz3ByhMc", "question": "When the man dressed in black and wearing sunglasses first appears on screen, and then appears for the second time in front of the green screen, how does his posture change?", "question_wo_referring_query": "How does the man's posture change?", "candidates": ["The first time, the man's posture is standing straight, and the second time, his legs are bent, his upper body is leaning back, and he strikes a dancing pose.", "The first time, the man\u2019s legs are bent, his upper body is leaning back, and he strikes a dancing pose. The second time, he has a normal upright posture.", "The first time, the man has a normal upright posture, and the second time, he is bent forward.", "The first time, the man is kneeling, and the second time, his legs are bent, his upper body is leaning back, and he strikes a dancing pose."], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "1OEsz3ByhMc_0", "video_path": "1OEsz3ByhMc.mp4", "subtitle_path": "1OEsz3ByhMc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 360, "duration": 10.01, "view_count": 11382}, {"video_id": "_bqzwsM6eoQ", "question": "In a room with yellow walls covered with several maps, after a man wearing a yellow shirt and a green jacket says, 'In fact, they perfected the art of copying old maps to an impressive T,' what changes occur in the room?", "question_wo_referring_query": "What changes occur in the room?", "candidates": ["The man wearing the yellow shirt and green jacket turns into a man with slightly shorter hair wearing a white long-sleeve shirt.", "The jacket turns black.", "The shirt changes into a blue shirt.", "The jacket turns red."], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "_bqzwsM6eoQ_0", "video_path": "_bqzwsM6eoQ.mp4", "subtitle_path": "_bqzwsM6eoQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 173, "duration": 14.0, "view_count": 3303108}, {"video_id": "-OpQ5RHnb_o", "question": "On a brown table, there is a glass bowl containing flour and other ingredients. A person wearing black gloves is holding the bowl with one hand and a whisk with the other. What did this person do with the ingredients in the glass bowl?", "question_wo_referring_query": ", what did this person do with the ingredients in the glass bowl?", "candidates": ["added water to mix", "put TM into the oven", "added other things to it", "mixed them evenly"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "-OpQ5RHnb_o_0", "video_path": "-OpQ5RHnb_o.mp4", "subtitle_path": "-OpQ5RHnb_o_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 110, "duration": 13.0, "view_count": 5581}, {"video_id": "Nut-J9bOkE0", "question": "In a European-style vintage bedroom, a naked woman is lying on a white sofa, wearing golden shoes on her feet. Beside her, there is another woman with a pale face, wearing white clothes and holding flowers. What objects appeared in the scene?", "question_wo_referring_query": "What objects appeared in the scene?", "candidates": ["Flowers, black cat, bracelet", "Mobile phone", "A full set of tea utensils", "Bookshelf"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "Nut-J9bOkE0_0", "video_path": "Nut-J9bOkE0.mp4", "subtitle_path": "Nut-J9bOkE0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 423, "duration": 14.0, "view_count": 510144}, {"video_id": "EQSYGHcIVaA", "question": "A mountain range stands across the entire view, with a lake in front of it. The sky displays different colors. At this moment, which colors make up the sky?", "question_wo_referring_query": "At this moment, which colors make up the sky?", "candidates": ["Yellow and black.", "Black and blue.", "Blue and pink.", "The left side of the sky is orange-colored clouds, the right side is black smoke, and the middle is blue."], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "EQSYGHcIVaA_0", "video_path": "EQSYGHcIVaA.mp4", "subtitle_path": "EQSYGHcIVaA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 732, "duration": 8.98, "view_count": 22250}, {"video_id": "femvIHkVQG8", "question": "In an office with white walls and an olive window frame door, there is a red wooden desk with a lamp and a desk phone on it. By the desk, there are three red wooden chairs. Which character in the scene is on the phone?", "question_wo_referring_query": "Which character in the scene is on the phone?", "candidates": ["A man wearing a khaki military uniform and a watch on his hand", "A gray-haired man wearing a red olive striped tie and a dark blue suit jacket", "A man wearing a military cap and a khaki military uniform", "A man without a hat, with black hair, wearing a khaki military uniform"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "femvIHkVQG8_0", "video_path": "femvIHkVQG8.mp4", "subtitle_path": "femvIHkVQG8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 296, "duration": 12.0, "view_count": 683377}, {"video_id": "3Y4USFYRLUE", "question": "A middle-aged man wearing a khaki coat, glasses, and sporting a mustache is standing in front of a wall with a framed tiger picture. There is a red sofa below the picture. What changes occur on the screen after he looks into the mirror?", "question_wo_referring_query": "What changes occur on the screen?", "candidates": ["A group of leopards is resting on the grassland", "A group of leopards is hunting on the grassland", "A group of tigers is hunting on the grassland", "Two tigers are resting and playing on the grassland"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "3Y4USFYRLUE_0", "video_path": "3Y4USFYRLUE.mp4", "subtitle_path": "3Y4USFYRLUE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 517, "duration": 13.0, "view_count": 6382}, {"video_id": "jcynicsFo8A", "question": "In a green photo booth with people coming and going, after mentioning 'I didn't understand one thing they were,' what did the curly-haired man in a green short sleeve shirt do?", "question_wo_referring_query": "What did he do?", "candidates": ["He slung a blue backpack over both shoulders", "He slung a red backpack over one shoulder", "He slung a blue backpack over one shoulder", "He slung a red backpack over both shoulders"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "jcynicsFo8A_0", "video_path": "jcynicsFo8A.mp4", "subtitle_path": "jcynicsFo8A_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 323, "duration": 9.01, "view_count": 25182}, {"video_id": "wgSBtXNJjhs", "question": "In the video, two shirtless men are in front of a large building, holding a white banner with black letters. Between them, there are two women embracing each other. What happens on the screen after the phrase 'and she had a strong political conscience. In 1967 she began organising anti-war protests' is mentioned?", "question_wo_referring_query": "In the video, two shirtless men are in front of a large building, holding a white banner with black letters. Between them, there are two women embracing each other. What happens on the screen after the phrase 'and she had a strong political conscience. In 1967 she began organising anti-war protests' is mentioned?", "candidates": ["Two men embrace and cry", "Two men tear the banner apart", "A man with an American flag on his head holds a pen towards another person", "Two women tear the banner apart"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "wgSBtXNJjhs_0", "video_path": "wgSBtXNJjhs.mp4", "subtitle_path": "wgSBtXNJjhs_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 778, "duration": 14.0, "view_count": 893489}, {"video_id": "EH1to-5hNWA", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there are two rows of children sitting in front of the large white structure, with a colored group photo of the teacher on the far right. Then, there's the large white structure with the surrounding white ruins collapsed and the ground in the middle eroded. Finally, there's a black-and-white photo of a row of children walking through a narrow path in front of the large white structure.", "First, there is a black-and-white photo of a row of children walking through a narrow path in front of a large white structure. Then, there's the large white structure with the surrounding white ruins collapsed and the ground in the middle eroded. Finally, there are two rows of children sitting in front of the large white structure, with a colored group photo of the teacher on the far right.", "First, there is a black-and-white photo of a row of children walking through a narrow path in front of a large white structure. Then, there are two rows of children sitting in front of the large white structure, with a colored group photo of the teacher on the far right. Finally, there's the large white structure with the surrounding white ruins collapsed and the ground in the middle eroded.", "First, there are two rows of children sitting in front of the large white structure, with a colored group photo of the teacher on the far right. Then, there's a black-and-white photo of a row of children walking through a narrow path in front of the large white structure. Finally, there's the large white structure with the surrounding white ruins collapsed and the ground in the middle eroded."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "EH1to-5hNWA_0", "video_path": "EH1to-5hNWA.mp4", "subtitle_path": "EH1to-5hNWA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 283, "duration": 10.01, "view_count": 6182}, {"video_id": "yJ7-30I0I4o", "question": "In a blue-gray background, a white emblem with a white crane holding grass in its beak, surrounded by flower petals, has been shown. Which subtitles have coexisted with this emblem?", "question_wo_referring_query": "In a blue-gray background, a white emblem with a white crane holding grass in its beak, surrounded by flower petals, has been shown. Which subtitles have coexisted with this emblem?", "candidates": ["here 1983 floats above the shield", "over the year 1960, here 1983 floats above the shield", "Hannah", "over the year 1960"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "yJ7-30I0I4o_0", "video_path": "yJ7-30I0I4o.mp4", "subtitle_path": "yJ7-30I0I4o_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 135, "duration": 10.01, "view_count": 217921}, {"video_id": "v6hmlzM0muE", "question": "Outdoors. One man is wearing a short sleeve shirt, jeans, and sunglasses. He is holding a piece of whiteboard with some English writing on it. There is a person in a green short-sleeved shirt with their back to the screen, and another person with a light green backpack is walking forward with their head down. Additionally, there is someone wearing white trousers pulling a child in a white top. What is not present in the scene?", "question_wo_referring_query": "What is not present in the scene?", "candidates": ["chair", "sunglasses", "computer", "backpack"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "v6hmlzM0muE_0", "video_path": "v6hmlzM0muE.mp4", "subtitle_path": "v6hmlzM0muE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 141, "duration": 10.97, "view_count": 611762}, {"video_id": "MaAn2qeLucQ", "question": "A person is using black tongs to take garlic out of a hot oil pan, and there is a line of text on the screen saying 'Den Knoblauch entfernen'. What is the shape of the garlic in the video?", "question_wo_referring_query": "What is the shape of the garlic in the video?", "candidates": ["Round", "Rectangular", "At this time, the garlic is crushed and charred", "Square"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "MaAn2qeLucQ_0", "video_path": "MaAn2qeLucQ.mp4", "subtitle_path": "MaAn2qeLucQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 86, "duration": 14.0, "view_count": 23612}, {"video_id": "ZV1l0e1-9co", "question": "A woman wearing a blue shirt and a black hat is standing near the baggage claim area, with a suitcase beside her. When the phrase 'we are the first baggage work work work' is mentioned, what type of hat is the woman wearing?", "question_wo_referring_query": "A woman wearing a blue shirt and a black hat is standing near the baggage claim area, with a suitcase beside her. When the phrase 'we are the first baggage work work work' is mentioned, what type of hat is the woman wearing?", "candidates": ["Cowboy hat", "Fisherman's cap", "Woolen hat", "Baseball cap"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "ZV1l0e1-9co_0", "video_path": "ZV1l0e1-9co.mp4", "subtitle_path": "ZV1l0e1-9co_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 405, "duration": 14.02, "view_count": 56326}, {"video_id": "v76h3Mi8o1o", "question": "In front of her is a withered branch, behind her there are swarms of mosquitoes over the river, on her left side there is a woman standing half-submerged in water, wearing a red ribbon. What action did she take?", "question_wo_referring_query": "What action did she take?", "candidates": ["She took off her clothes", "She dove into the water", "She touched the top of her head with her left hand", "She touched the top of her head with her right hand"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "v76h3Mi8o1o_0", "video_path": "v76h3Mi8o1o.mp4", "subtitle_path": "v76h3Mi8o1o_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 320, "duration": 13.0, "view_count": 453337}, {"video_id": "KpiaMIhCmtQ", "question": "In a scene where there is a picture with an explosion and thick smoke on the left, next to it is an image of a man wearing black clothes with a stubble, and there's a red and white subtitle strip at the bottom, what is he doing at this moment?", "question_wo_referring_query": ", what is he doing at this moment?", "candidates": ["He is facing the camera introducing the news.", "He turns his head to introduce the news.", "He is raising his left hand to indicate looking at the picture on the left.", "He is adjusting his microphone."], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "KpiaMIhCmtQ_0", "video_path": "KpiaMIhCmtQ.mp4", "subtitle_path": "KpiaMIhCmtQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 273, "duration": 14.0, "view_count": 102897}, {"video_id": "-PuFm80u-wo", "question": "In a studio with a whiteboard, with a woman in blue short sleeves sitting on the left, and on the right is a table with a man and a woman in a gray and white studio, what did the woman in blue short sleeves do when 'Israel but has been met with resistance' was mentioned?", "question_wo_referring_query": "What did the woman in blue short sleeves do?", "candidates": ["She took off her blue coat", "She is holding a report in her left hand", "She took the report handed over by the person opposite her", "She stood up in front of the whiteboard"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "-PuFm80u-wo_0", "video_path": "-PuFm80u-wo.mp4", "subtitle_path": "-PuFm80u-wo_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 546, "duration": 14.0, "view_count": 10282}, {"video_id": "zgyfXq2p0eM", "question": "On a small path lined with lush green leaves, a curly-haired woman wearing a red jacket and a gray long skirt is standing with her back to the camera. What was she doing before this moment?", "question_wo_referring_query": "What was she doing before this moment?", "candidates": ["She was looking down at her gloves", "She was looking down at a book", "She was looking at the mirror from the right side of the frame", "She was looking at the mirror from the left side of the frame"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "zgyfXq2p0eM_0", "video_path": "zgyfXq2p0eM.mp4", "subtitle_path": "zgyfXq2p0eM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 272, "duration": 10.0, "view_count": 5482}, {"video_id": "lW81aUNRKyA", "question": "On a blue table, there are two white objects placed on a metal plate, and there is also an incense stick. A person is holding a matchstick above the incense stick. What happens after mentioning 'back to each day'?", "question_wo_referring_query": "What happens after mentioning 'back to each day'?", "candidates": ["The incense stick is extinguished.", "The incense stick is moved to another location.", "The incense stick is lit.", "The incense stick is given to someone else."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "lW81aUNRKyA_0", "video_path": "lW81aUNRKyA.mp4", "subtitle_path": "lW81aUNRKyA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 64, "duration": 11.0, "view_count": 4198}, {"video_id": "koF4WYOYg6o", "question": "Many people wearing matching outfits walk by on the road, and the screen shows the text 'THE WAR WAS FAR FROM OVER'. After the phrase 'government had been asserting' is mentioned, who appears?", "question_wo_referring_query": "Who appears after the phrase 'government had been asserting' is mentioned?", "candidates": ["A man wearing a black shirt and white suit jacket", "A man wearing a black shirt and black suit", "A man wearing a blue shirt and white suit jacket", "A man wearing a white shirt, black suit jacket, and a tie"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "koF4WYOYg6o_0", "video_path": "koF4WYOYg6o.mp4", "subtitle_path": "koF4WYOYg6o_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 604, "duration": 10.0, "view_count": 15020}, {"video_id": "VsZ42fQlxqs", "question": "A sculpture with a bronze-colored body and gold hair sitting on a white platform \u2013 in which of the following places has this sculpture NOT appeared?", "question_wo_referring_query": "A sculpture with a bronze-colored body and gold hair sitting on a white platform \u2013 in which of the following places has this sculpture NOT appeared?", "candidates": ["On a white rectangular platform", "Inside a transparent glass cover", "On a white square platform", "On a transparent rectangular platform"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "VsZ42fQlxqs_0", "video_path": "VsZ42fQlxqs.mp4", "subtitle_path": "VsZ42fQlxqs_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 105, "duration": 12.01, "view_count": 8427}, {"video_id": "G0uowo1FfE4", "question": "In the scene, a man in military uniform is holding a long gun horizontally placed on a small pile of many sandbags on the right side. This man and which of the following subtitles have appeared together?", "question_wo_referring_query": ", this man and which of the following subtitles have appeared together?", "candidates": ["which resulted in better accuracy to ", "fire the soldier pulled the auxiliary", "own ", "made of steel with a buttstock of its"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "G0uowo1FfE4_0", "video_path": "G0uowo1FfE4.mp4", "subtitle_path": "G0uowo1FfE4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 254, "duration": 9.0, "view_count": 579384}, {"video_id": "JPoQeuWybIo", "question": "In a white room, there is a yellow-orange painting and two black-and-white paintings hanging on the walls. On the right side of the screen, there is also a woman with curly hair wearing a gray short-sleeve shirt. What was the final change that happened to the yellow-orange painting?", "question_wo_referring_query": "What was the final change that happened to the yellow-orange painting?", "candidates": ["It was placed in a storage room.", "It was placed on a wall for a special exhibit on its own.", "It was placed on a wall along with the black-and-white paintings for a special exhibit.", "It was taken down and admired by the woman with curly hair."], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "JPoQeuWybIo_0", "video_path": "JPoQeuWybIo.mp4", "subtitle_path": "JPoQeuWybIo_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 135, "duration": 11.01, "view_count": 153906}, {"video_id": "7ZxC4ilhhSU", "question": "On a red-brown table, a person wearing black gloves holds a scallion in their left hand and a knife in their right hand. After mentioning 'Chop an onion finely,' what change happens to the scallion?", "question_wo_referring_query": "what change happens to the scallion?", "candidates": ["The scallion is cut into pieces with the knife", "The scallion is chopped finely with the knife", "The scallion is put into the sink to wash", "The scallion is put into the pot to cook"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "7ZxC4ilhhSU_0", "video_path": "7ZxC4ilhhSU.mp4", "subtitle_path": "7ZxC4ilhhSU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 54, "duration": 11.0, "view_count": 4314}, {"video_id": "s1p5cDRA9F0", "question": "In the sky filled with dense clouds, there is a green airplane. Below it are many buildings. What is this airplane doing?", "question_wo_referring_query": "What is this airplane doing?", "candidates": ["Continuing to fly forward in the sky", "Attacking the buildings below", "Dropping supplies downward", "Preparing to land at a certain airfield"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "s1p5cDRA9F0_0", "video_path": "s1p5cDRA9F0.mp4", "subtitle_path": "s1p5cDRA9F0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 289, "duration": 10.0, "view_count": 1885872}, {"video_id": "aYiHuNqlhkg", "question": "In a background of yellow and grey, there are two security guards wearing the same clothes, another man wearing black clothes with a long scarf around his neck. What objects appear in the scene?", "question_wo_referring_query": "What objects appear in the scene?", "candidates": ["Scarf, helmet", "Headset", "Watch", "Necklace"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "aYiHuNqlhkg_0", "video_path": "aYiHuNqlhkg.mp4", "subtitle_path": "aYiHuNqlhkg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 214, "duration": 13.0, "view_count": 2730571}, {"video_id": "NPirOqWk0Zo", "question": "On a wooden table, there is a plate containing bean paste, mushrooms, meat, and other foods. A person is using chopsticks to pick up which food from the plate?", "question_wo_referring_query": ", using chopsticks, which food is being picked up from the plate?", "candidates": ["Bean paste", "Carrots", "Mushrooms", "Meat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "NPirOqWk0Zo_0", "video_path": "NPirOqWk0Zo.mp4", "subtitle_path": "NPirOqWk0Zo_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 376, "duration": 13.0, "view_count": 676274}, {"video_id": "CrXJTCQckFg", "question": "After mentioning 'the game tried robbers National Bank', in front of the red building, there are two masked people riding horses with money bags on their backs. What are they doing?", "question_wo_referring_query": "What are they doing?", "candidates": ["They are having a horse race", "They are shooting backward while holding guns", "They are washing the horses", "They are feeding the horses"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "CrXJTCQckFg_0", "video_path": "CrXJTCQckFg.mp4", "subtitle_path": "CrXJTCQckFg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 195, "duration": 14.0, "view_count": 2869885}, {"video_id": "9jGX11CNxzw", "question": "After mentioning 'fight our Comer some of the victims are,' what action did the white-haired man wearing a white shirt with a dark blue suit do?", "question_wo_referring_query": "What action did he do?", "candidates": ["He touched his eye", "He readjusted his collar", "He licked his lips", "He crossed his hands in front of his chest"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "9jGX11CNxzw_0", "video_path": "9jGX11CNxzw.mp4", "subtitle_path": "9jGX11CNxzw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 260, "duration": 11.0, "view_count": 10294}, {"video_id": "e3R9jYQhexM", "question": "In the video, a man wearing a gray short-sleeve shirt, a necklace, and sunglasses is seen smiling. The text 'So right now we're in this tuk tuk' appears on the screen. What happens after he mentions 'cafes to e ar to get some work done so'?", "question_wo_referring_query": "What happens after that?", "candidates": ["A driver wearing a red shirt appears.", "The man gets out of the vehicle.", "The man goes to a restaurant.", "The man starts running after getting off the vehicle."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "e3R9jYQhexM_0", "video_path": "e3R9jYQhexM.mp4", "subtitle_path": "e3R9jYQhexM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 883, "duration": 9.01, "view_count": 90711}, {"video_id": "PoeLQs_h4hI", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which sequence of events below is correct?", "candidates": ["First, there is a big lake with many land masses on it. Then, the rock lava erupts and falls on the surface of the black rocks. Finally, the state of the red lava during the eruption is shown.", "First, the rock lava erupts and falls on the surface of the black rocks. Then, the state of the red lava during the eruption is shown. Finally, there is a big lake with many land masses on it.", "First, there is a big lake with many land masses on it. Finally, the rock lava erupts and falls on the surface of the black rocks.", "First, there is a big lake with many land masses on it. Then, the state of the red lava during the eruption is shown. Finally, the rock lava erupts and falls on the surface of the black rocks."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "PoeLQs_h4hI_0", "video_path": "PoeLQs_h4hI.mp4", "subtitle_path": "PoeLQs_h4hI_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 131, "duration": 9.98, "view_count": 9127}, {"video_id": "oeoaPxXq0E4", "question": "A soldier wearing a yellow helmet, gray armor, holding a sword in one hand and a red shield in the other, appears alongside which subtitles?", "question_wo_referring_query": "and appears alongside which subtitles?", "candidates": ["conflicts between the Romans and the", "Germans", "Romans", "conflicts between the Romans and the Germans"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "oeoaPxXq0E4_0", "video_path": "oeoaPxXq0E4.mp4", "subtitle_path": "oeoaPxXq0E4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 155, "duration": 8.01, "view_count": 4618}, {"video_id": "H2Lgs6xmCUg", "question": "In an auditorium, there is a curly-haired woman wearing a white inner shirt and a black outer jacket, walking sideways to the left of the screen. Before she mentioned the 'homophobic bigoted anti-semitic chatbot,' what did she change?", "question_wo_referring_query": "In an auditorium, there is a curly-haired woman wearing a white inner shirt and a black outer jacket, walking sideways to the left of the screen. Before she mentioned the 'homophobic bigoted anti-semitic chatbot,' what did she change?", "candidates": ["Her right hand was raised higher than her left hand as a fist", "Her left hand was raised higher than her right hand as a fist", "She raised both of her hands", "She adjusted her glasses"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "H2Lgs6xmCUg_0", "video_path": "H2Lgs6xmCUg.mp4", "subtitle_path": "H2Lgs6xmCUg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 792, "duration": 9.98, "view_count": 225}, {"video_id": "Wtt4w4c97s4", "question": "In an area with a forest behind and a ruined site with collapsed red bricks in front, an elderly man wearing a dark blue coat and a hat is standing on one side. Which of the following objects appeared?", "question_wo_referring_query": "In an area with a forest behind and a ruined site with collapsed red bricks in front, an elderly man wearing a dark blue coat and a hat is standing on one side. Which of the following objects appeared?", "candidates": ["Abandoned TV", "Discarded computer", "White hat", "Dark blue hat"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "Wtt4w4c97s4_0", "video_path": "Wtt4w4c97s4.mp4", "subtitle_path": "Wtt4w4c97s4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 37, "duration": 11.0, "view_count": 2989}, {"video_id": "2No9bWWXgG4", "question": "In front of two green trees, there is a man with short black hair pointing at the camera. What is he wearing when he mentions 'it was shot in Japan a few months ago'?", "question_wo_referring_query": "What is he wearing?", "candidates": ["He is wearing a black long-sleeve shirt.", "He is wearing a black short-sleeve shirt.", "He is wearing a yellow short-sleeve shirt.", "He is wearing a yellow long-sleeve shirt."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "2No9bWWXgG4_0", "video_path": "2No9bWWXgG4.mp4", "subtitle_path": "2No9bWWXgG4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 7, "duration": 12.01, "view_count": 788920}, {"video_id": "7R8JhXf1b8g", "question": "In a small, narrow bedroom, there is a desk cluttered with scattered items, and a white wall with a flag of blue, white, and red stripes hanging on it. The man on the right is looking at a man lying on the chair laughing. Who is the man lying on the chair?", "question_wo_referring_query": "Who is the man lying on the chair?", "candidates": ["A man wearing an olive green short sleeve shirt with red and black pattern", "A man wearing a black and red tie-dye short sleeve shirt with short hair", "A man wearing a gray-black long sleeve shirt and black pants, slightly messy hair", "A man wearing a gray short sleeve shirt with red and black pattern"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "7R8JhXf1b8g_0", "video_path": "7R8JhXf1b8g.mp4", "subtitle_path": "7R8JhXf1b8g_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 96, "duration": 13.98, "view_count": 215349}, {"video_id": "ojn5Z7GnjJo", "question": "In a small car, the person in the driver's seat, a woman wearing a white shirt, is holding a pair of chopsticks to her lips. Next to her in the passenger seat is another woman with curly hair wearing a blue coat, who is holding her index finger to her lips. What does the woman in the driver's seat do at this moment?", "question_wo_referring_query": "What does the woman in the driver's seat do at this moment?", "candidates": ["Is stirring food", "Is about to drop food", "Is putting food into her mouth", "Is feeding her friend"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "ojn5Z7GnjJo_0", "video_path": "ojn5Z7GnjJo.mp4", "subtitle_path": "ojn5Z7GnjJo_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 457, "duration": 10.0, "view_count": 35795}, {"video_id": "OSccGTuEi5o", "question": "Under a wisteria flower, a man dressed in a dark green coat is making an upward movement with his right hand. What does he do next after this action?", "question_wo_referring_query": "What does he do after the screen changes?", "candidates": ["He waves at the camera.", "He turns his head to the right side of the screen and raises his right arm.", "He turns his head to the left side of the screen and raises his right arm.", "He turns his head to the left side of the screen and raises his left arm."], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "OSccGTuEi5o_0", "video_path": "OSccGTuEi5o.mp4", "subtitle_path": "OSccGTuEi5o_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 90, "duration": 12.98, "view_count": 88016}, {"video_id": "-yCYJYLPh2g", "question": "In a scene with red title subtitles at the bottom and white subtitles with black text content above them, there are images of three people in sequence: a woman in a purple suit, a man in a blue shirt, and a woman in light-brown clothing. Who appears first?", "question_wo_referring_query": "Who appears first among the following people?", "candidates": ["The man wearing a blue shirt", "The woman with short black hair wearing a purple suit jacket", "The man wearing a purple shirt", "The woman with short blond hair wearing a light-brown jacket"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "-yCYJYLPh2g_0", "video_path": "-yCYJYLPh2g.mp4", "subtitle_path": "-yCYJYLPh2g_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 150, "duration": 11.0, "view_count": 236485}, {"video_id": "DR0txTCd_Y4", "question": "In a red tomato pot, in the frame with green side dishes, what object appears on the screen before the phrase 'sways in' is mentioned?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["A hand wearing a rubber glove holding a silver knife appears", "A hand wearing a black glove holding a silver knife appears", "A hand wearing a transparent glove holding a silver knife appears", "A hand wearing a white glove holding a silver knife appears"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "DR0txTCd_Y4_0", "video_path": "DR0txTCd_Y4.mp4", "subtitle_path": "DR0txTCd_Y4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 325, "duration": 8.0, "view_count": 1892}, {"video_id": "Mvher-mKt1Y", "question": "In the vast open space, there is a man dressed entirely in blue, holding a helmet in his hand and carrying a parachute on his back. He has never existed together with which subtitles?", "question_wo_referring_query": ", he has never existed together with which subtitles?", "candidates": ["wanted to experience just", "of course", "yeah buddy what'd you think of that", "it's like everything that i've ever"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "Mvher-mKt1Y_0", "video_path": "Mvher-mKt1Y.mp4", "subtitle_path": "Mvher-mKt1Y_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 215, "duration": 13.01, "view_count": 4647424}, {"video_id": "8053iFzH05w", "question": "In the Milky Way filled with countless twinkling stars, there is a meteor emitting firelight on the right side of the screen and a star shining blue light on the left side. What is the final transformation of the meteor emitting firelight?", "question_wo_referring_query": "In the Milky Way filled with countless twinkling stars, there is a meteor emitting firelight on the right side of the screen and a star shining blue light on the left side. What is the final transformation of the meteor emitting firelight?", "candidates": ["The meteor emitting firelight stops moving", "The meteor emitting firelight moves towards the blue star", "Both celestial bodies move and eventually collide", "Neither of the two celestial bodies move"], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "8053iFzH05w_0", "video_path": "8053iFzH05w.mp4", "subtitle_path": "8053iFzH05w_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 80, "duration": 11.01, "view_count": 7440}, {"video_id": "ernuAAsYuVA", "question": "A woman wearing a pearl bracelet feeds a black pig carrots on screen. After the subtitle mentions '[Music]', what change occurs to the pig?", "question_wo_referring_query": ", what change occurs to the pig?", "candidates": ["The pig lies near a wooden pillar", "The pig lies near an iron pillar", "The pig walks near a wooden pillar", "The pig walks near an iron pillar"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "ernuAAsYuVA_0", "video_path": "ernuAAsYuVA.mp4", "subtitle_path": "ernuAAsYuVA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 221, "duration": 14.02, "view_count": 9063}, {"video_id": "cnKU_osbg3s", "question": "On the left is an oil painting of a man wearing red clothes and a blue cloak, and on the right is a scene of a man in a red and blue checkered coat. What action is the man in the red and blue checkered coat doing?", "question_wo_referring_query": "What action is the man in the red and blue checkered coat doing?", "candidates": ["His arms are crossed.", "His fists are clenched.", "His hands are naturally spread in front of him.", "His arms are naturally hanging down without any action."], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "cnKU_osbg3s_0", "video_path": "cnKU_osbg3s.mp4", "subtitle_path": "cnKU_osbg3s_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1242, "duration": 13.97, "view_count": 678600}, {"video_id": "dsDQVZXvTnM", "question": "In front of a blue wall, a man wearing a 'starry sky' jacket and a white hat is sitting at a table with many stacked wooden blocks. Which of the following objects is present in the scene?", "question_wo_referring_query": "Which of the following objects is present in the scene?", "candidates": ["red wooden blocks", "silver watch", "white wooden blocks", "black watch"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "dsDQVZXvTnM_0", "video_path": "dsDQVZXvTnM.mp4", "subtitle_path": "dsDQVZXvTnM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 146, "duration": 8.97, "view_count": 32845}, {"video_id": "puUcXsBJT-w", "question": "In a scene on a light-colored sofa, there are three adults sitting, with a child sandwiched between a man in a white shirt and a woman in a pink dress. Which item does not appear?", "question_wo_referring_query": "Which item does not appear?", "candidates": ["a green watch", "glasses with black frames", "an open white book", "a purple hair tie"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "puUcXsBJT-w_0", "video_path": "puUcXsBJT-w.mp4", "subtitle_path": "puUcXsBJT-w_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 642, "duration": 9.0, "view_count": 446775}, {"video_id": "9o_PpyMW1kU", "question": "In front of a large building, a long-haired woman wearing a lake blue coat is being interviewed with a microphone. What color is the microphone she is holding?", "question_wo_referring_query": "What color is the microphone she is holding?", "candidates": ["Holding a white microphone with 'cna' letters", "Holding a white microphone with 'nac' letters", "Holding a black microphone with 'nac' letters", "Holding a black microphone with 'cna' letters"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "9o_PpyMW1kU_0", "video_path": "9o_PpyMW1kU.mp4", "subtitle_path": "9o_PpyMW1kU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 54, "duration": 12.0, "view_count": 4550}, {"video_id": "2G_NEYOk2Gc", "question": "In the video, three people holding rifles are standing in a dense forest, and two more people holding rifles are crouched on the ground. Who are these people?", "question_wo_referring_query": "Who are these people in the video?", "candidates": ["Soldiers wearing blue uniforms and no hats", "Soldiers wearing blue uniforms and blue hats", "Soldiers wearing green uniforms and no hats", "Soldiers wearing green uniforms and green hats"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "2G_NEYOk2Gc_0", "video_path": "2G_NEYOk2Gc.mp4", "subtitle_path": "2G_NEYOk2Gc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 48, "duration": 11.0, "view_count": 2186500}, {"video_id": "FATjQmXxBm4", "question": "On a small boat, a man wearing a red knitted hat and a dark blue sailor's uniform is sitting down. What did he do when he appeared for the first time?", "question_wo_referring_query": "What did he do?", "candidates": ["He turned his head and communicated with his companions behind him", "He pointed behind him with his left hand", "He pointed behind him with his right hand", "He lowered his head and adjusted his clothing"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "FATjQmXxBm4_0", "video_path": "FATjQmXxBm4.mp4", "subtitle_path": "FATjQmXxBm4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 176, "duration": 9.0, "view_count": 55140}, {"video_id": "9GxGuisMcqQ", "question": "Under dim lighting, on a table with a glass bottle, a man is holding a book. When mentioning 'bringing more of an awareness on how things are changing,' what is he doing?", "question_wo_referring_query": "Under dim lighting, on a table with a glass bottle, a man is holding a book. When mentioning 'bringing more of an awareness on how things are changing,' what is he doing?", "candidates": ["He is flipping through the book", "He is about to close the book", "He is about to grab another book", "He is taking notes in the book with a pen"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "9GxGuisMcqQ_0", "video_path": "9GxGuisMcqQ.mp4", "subtitle_path": "9GxGuisMcqQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 214, "duration": 8.97, "view_count": 280797}, {"video_id": "_w9isEzYZGc", "question": "After a woman in a polka dot dress pours the side dish from an iron pan onto a plate using a wooden spatula, what does she do?", "question_wo_referring_query": "What does she do next?", "candidates": ["She uses a hand wearing a gold ring to press the dish with an iron spatula", "She uses a hand wearing a gold ring to press the dish with a wooden spatula", "She uses a hand wearing a silver ring to press the dish with a wooden spatula", "She uses a hand wearing a silver ring to press the dish with an iron spatula"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "_w9isEzYZGc_0", "video_path": "_w9isEzYZGc.mp4", "subtitle_path": "_w9isEzYZGc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 215, "duration": 10.0, "view_count": 23654}, {"video_id": "UAusbJmRB0c", "question": "On a white map filled with the city, there are six blue drone icons, and different colors are used for labels on top of them. After mentioning 'And then there's the six airports, only two of which are in Greater London. Is it about time that changed?', what appears on this map?", "question_wo_referring_query": "What appears on this map?", "candidates": ["A piece of the map in the middle is marked green", "A piece of the map on the left is marked red", "A piece of the map on the left is marked green", "A piece of the map in the middle is marked red"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "UAusbJmRB0c_0", "video_path": "UAusbJmRB0c.mp4", "subtitle_path": "UAusbJmRB0c_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 379, "duration": 12.0, "view_count": 4272587}, {"video_id": "HpIPtsWH4KU", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there's a close-up of a man holding a blue box with a black pattern on it. Then, in a small room, a man wearing an orange short-sleeved shirt is holding a blue box with a black pattern on it and introducing it. Next, the man is holding a small black piece of paper in his right hand and a document in his left hand. Finally, the man looks into the camera and tidies his hair.", "First, there's a close-up of a man holding a blue box with a black pattern on it. Then, in a small room, a man wearing an orange short-sleeved shirt is holding a blue box with a black pattern on it and introducing it. Next, the man looks into the camera and tidies his hair. Finally, the man is holding a small black piece of paper in his right hand and a document in his left hand.", "First, in a small room, a man wearing an orange short-sleeved shirt is holding a blue box with a black pattern on it and introducing it. Then, there's a close-up of a man holding a blue box with a black pattern on it. Next, the man looks into the camera and tidies his hair. Finally, the man is holding a small black piece of paper in his right hand and a document in his left hand.", "First, in a small room, a man wearing an orange short-sleeved shirt is holding a blue box with a black pattern on it and introducing it. Then, there's a close-up of a man holding a blue box with a black pattern on it. Next, the man is holding a small black piece of paper in his right hand and a document in his left hand. Finally, the man looks into the camera and tidies his hair."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "HpIPtsWH4KU_0", "video_path": "HpIPtsWH4KU.mp4", "subtitle_path": "HpIPtsWH4KU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 173, "duration": 9.01, "view_count": 202268}, {"video_id": "NgRua4a3k2c", "question": "In a deep blue background, with a yellow title, and blue English subtitles at the bottom, there is also an image of a man with dreadlocks wearing an orange short sleeve on the right side. Which object has appeared in the scene?", "question_wo_referring_query": "Which object has appeared in the scene?", "candidates": ["Green object edit", "Orange object edit", "Red object edit", "Green short sleeve"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "NgRua4a3k2c_0", "video_path": "NgRua4a3k2c.mp4", "subtitle_path": "NgRua4a3k2c_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1383, "duration": 8.01, "view_count": 17490}, {"video_id": "0inb9q0kLmA", "question": "In a scene with a blue background as the main theme, with a design that has green in the corners, three overlapping photos pop up one after another in the center. Which of the following objects is involved?", "question_wo_referring_query": "Which of the following objects is involved?", "candidates": ["A photo of a small island terrain", "A black and white photo at the bottom left of the screen with a wooden house supported by pillars", "A photo showing the Earth's surface terrain", "A photo at the bottom left of the screen with a wooden house supported by pillars"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "0inb9q0kLmA_0", "video_path": "0inb9q0kLmA.mp4", "subtitle_path": "0inb9q0kLmA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 92, "duration": 13.01, "view_count": 514745}, {"video_id": "VYnKWt3y1oE", "question": "There is a large building made of red bricks in the scene. At the bottom, there is a small path surrounded by lush green leaves. What happens to this building at the end of the video?", "question_wo_referring_query": "There is a large building made of red bricks in the scene. At the bottom, there is a small path surrounded by lush green leaves. What happens to this building at the end of the video?", "candidates": ["The camera shows a close-up of detailed features", "Many people are inside admiring the view", "It gets a wash from a heavy rain", "The camera shows a wide shot of the entire building"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "VYnKWt3y1oE_0", "video_path": "VYnKWt3y1oE.mp4", "subtitle_path": "VYnKWt3y1oE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 80, "duration": 14.02, "view_count": 42400}, {"video_id": "-7467khTEQ4", "question": "In front of a dimly lit room, a woman with short olive hair, wearing a grey coat and a white scarf, was mentioned. What action did she take when referring to 'didn't have a place in more traditional'?", "question_wo_referring_query": "What action did this woman take?", "candidates": ["She tightly clenched both hands", "She clenched her left fist in front of her body and spread her right hand", "She clenched her right fist in front of her body and spread her left hand", "She placed both hands indicating she didn't understand"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "-7467khTEQ4_0", "video_path": "-7467khTEQ4.mp4", "subtitle_path": "-7467khTEQ4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 106, "duration": 10.01, "view_count": 1665}, {"video_id": "uG4bEC9A5u4", "question": "In front of a large street with various types of large shopping malls, a man with short brown hair, wearing a purple short-sleeve shirt and carrying a black backpack, after speaking to the camera, what did he do?", "question_wo_referring_query": "What did he do?", "candidates": ["He pointed the camera at the sign and introduced it.", "He picked up his phone and asked a girl across the wall for directions.", "He was hailing a taxi by the roadside to leave.", "He turned around and crossed the sidewalk."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "uG4bEC9A5u4_0", "video_path": "uG4bEC9A5u4.mp4", "subtitle_path": "uG4bEC9A5u4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 452, "duration": 10.0, "view_count": 35634}, {"video_id": "rxKR2cHmlPY", "question": "In the scene, a man in a blue suit jacket and a woman wearing a red shawl are holding hands tightly. What is the final scene?", "question_wo_referring_query": "What is it?", "candidates": ["A woman in white and a woman wearing a blue-striped yellow dress are holding hands tightly.", "A woman in white and a woman wearing a yellow-striped blue dress are holding hands tightly.", "A woman in yellow and a woman wearing a blue-striped yellow dress are holding hands tightly.", "A woman in yellow and a woman wearing a yellow-striped blue dress are holding hands tightly."], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "rxKR2cHmlPY_0", "video_path": "rxKR2cHmlPY.mp4", "subtitle_path": "rxKR2cHmlPY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 771, "duration": 12.01, "view_count": 1616826}, {"video_id": "Q6QpSl6-CQw", "question": "In a small, narrow room, there are two layers of curtains in front of the window and a long table filled with many flower pots. After mentioning 'I asked all of you what made a person beautiful; and without fail, I got the words of compassion,' what event occurs on the screen?", "question_wo_referring_query": "What event occurs on the screen?", "candidates": ["A woman wearing a pink dress and a blue apron walks in holding a flowerpot with green plants.", "A woman wearing a blue dress and a pink apron walks in holding a flowerpot with red plants.", "A woman wearing a blue dress and a pink apron walks in holding a flowerpot with green plants.", "A woman wearing a pink dress and a blue apron walks in holding a flowerpot with red plants."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "Q6QpSl6-CQw_0", "video_path": "Q6QpSl6-CQw.mp4", "subtitle_path": "Q6QpSl6-CQw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 167, "duration": 11.0, "view_count": 250683}, {"video_id": "qol8-TRnqro", "question": "On a green field covered with many small flowers, different green trees grow in the background and on the right side. In the middle of the screen, there is a well cover on top of a dirt mound. After mentioning 'Japanese developed what they call,' what change occurs to the object?", "question_wo_referring_query": "What change occurs to the object?", "candidates": ["The well cover is opened by a Japanese soldier with a long gun from the inside.", "The well cover is opened by a Japanese soldier with a long gun from the outside.", "The well cover is opened by a child from the inside.", "The well cover is opened by a child from the outside."], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "qol8-TRnqro_0", "video_path": "qol8-TRnqro.mp4", "subtitle_path": "qol8-TRnqro_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 136, "duration": 9.0, "view_count": 2916940}, {"video_id": "-8iteLmTxwI", "question": "In a scene where the surface of a simulated moon, predominantly green with blue, yellow, and red dots, is mentioned, what changes occur on the screen after the phrase 'the moon it would appear that while'?", "question_wo_referring_query": "In a scene with a predominantly green surface of a simulated moon, decorated with blue, yellow, and red dots, what changes occur on the screen after mentioning 'the moon it would appear that while'?", "candidates": ["There are no changes on the predominantly green simulated moon surface with blue, yellow, and red dots.", "On the blue-background simulated moon surface with green dots, a yellow moon image appears on the left, and on the right, a systematically arranged lunar topography with black grid lines appears.", "On the blue-background simulated moon surface with green dots, a black moon image appears on the left, and on the right, a systematically arranged lunar topography with yellow grid lines appears.", "On the blue-background simulated moon surface with green dots, a yellow moon image appears on the right, and on the left, a systematically arranged lunar topography with black grid lines appears."], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "-8iteLmTxwI_0", "video_path": "-8iteLmTxwI.mp4", "subtitle_path": "-8iteLmTxwI_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1078, "duration": 8.01, "view_count": 446859}, {"video_id": "nuJ6QTZdvxM", "question": "As the screen transitions to a yellow coastline, the water exhibits a beautiful shimmering wave-like scene. What change does the water present in the screen?", "question_wo_referring_query": "What change does the water present in the screen?", "candidates": ["Liquefying", "Solidifying", "Evaporating", "Flowing"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "nuJ6QTZdvxM_0", "video_path": "nuJ6QTZdvxM.mp4", "subtitle_path": "nuJ6QTZdvxM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 172, "duration": 14.02, "view_count": 2349675}, {"video_id": "y6DP9oWHg9c", "question": "In the video, a woman wearing a yellow skirt and a red floral short sleeve top is tidying up in a room. What is the shape of the object she is holding while doing so?", "question_wo_referring_query": "What is the shape of the object she is holding while tidying up?", "candidates": ["Triangular", "Circular", "Square", "Round and pointy"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "y6DP9oWHg9c_0", "video_path": "y6DP9oWHg9c.mp4", "subtitle_path": "y6DP9oWHg9c_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 251, "duration": 7.97, "view_count": 536103}, {"video_id": "Ise9FpSolJE", "question": "When 'and now NeLie's turn it's white it's' is mentioned in the video, a man in a black long sleeve standing next to a blonde woman appears on the street. What object appears on the woman's head?", "question_wo_referring_query": "What object appears on this woman's head?", "candidates": ["jacket", "scarf", "sunglasses", "suit"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "Ise9FpSolJE_0", "video_path": "Ise9FpSolJE.mp4", "subtitle_path": "Ise9FpSolJE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 137, "duration": 12.01, "view_count": 461030}, {"video_id": "BhAl5mG0fNw", "question": "In a dimly lit room, there's a girl wearing headphones sitting at a round table. On the round table, there is a sandwich and a teacup. What color is the teacup on the round table?", "question_wo_referring_query": "What color is the teacup on the round table?", "candidates": ["Green", "Blue", "Yellow", "Black"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "BhAl5mG0fNw_0", "video_path": "BhAl5mG0fNw.mp4", "subtitle_path": "BhAl5mG0fNw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 58, "duration": 9.98, "view_count": 232421}, {"video_id": "d92gSj5_R9I", "question": "When the subtitles mention 'work you pay income tax and you pay', there are two women on the screen, one wearing glasses and one not wearing glasses. What color is the clothing of the woman who is speaking?", "question_wo_referring_query": "What color is the clothing of the woman who is speaking?", "candidates": ["black", "purple", "white", "yellow"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "d92gSj5_R9I_0", "video_path": "d92gSj5_R9I.mp4", "subtitle_path": "d92gSj5_R9I_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 438, "duration": 12.0, "view_count": 39434}, {"video_id": "JOvLXKU6mB4", "question": "In the video, the man wearing a black short-sleeved shirt is adding various ingredients into a measuring cup while cooking. What is the last ingredient he adds to the measuring cup?", "question_wo_referring_query": "What is the last ingredient added to the measuring cup?", "candidates": ["Potato", "Tree leaves", "Golden needle mushroom", "Green vegetable"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "JOvLXKU6mB4_0", "video_path": "JOvLXKU6mB4.mp4", "subtitle_path": "JOvLXKU6mB4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 67, "duration": 11.01, "view_count": 1244401}, {"video_id": "b1a2njql5g8", "question": "In the video, there is a boy wearing red gloves and a black and white striped long-sleeve shirt looking through binoculars. Which store does he see through the binoculars?", "question_wo_referring_query": "Which store does he see through the binoculars?", "candidates": ["Hot Pot Restaurant", "Fast Food Restaurant", "McDonald's", "KFC"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "b1a2njql5g8_0", "video_path": "b1a2njql5g8.mp4", "subtitle_path": "b1a2njql5g8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 163, "duration": 12.98, "view_count": 54499}, {"video_id": "i46vUEjbAH0", "question": "In the video explanation, a man wearing a suit and tie is holding something. In the explanation, what is the first piece of clothing described that the man in the picture is wearing?", "question_wo_referring_query": "In the explanation, what is the first piece of clothing described that the man in the picture is wearing?", "candidates": ["a lab coat", "a suit", "short sleeves", "a shirt with unbuttoned sleeves"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "i46vUEjbAH0_0", "video_path": "i46vUEjbAH0.mp4", "subtitle_path": "i46vUEjbAH0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 33, "duration": 7.98, "view_count": 3444}, {"video_id": "Om1jvUzVAtE", "question": "Several pillars and a statue of a lion appear on the screen, and after the subtitle says 'miraculous period came about or how long,' five people appear. What symbols are those five people holding?", "question_wo_referring_query": "What symbols are those five people holding?", "candidates": ["Question mark", "Period", "Exclamation mark", "Comma"], "topic_category": "KH-Knowledge-History", "question_category": "T3E", "level": "L2-Relation", "id": "Om1jvUzVAtE_0", "video_path": "Om1jvUzVAtE.mp4", "subtitle_path": "Om1jvUzVAtE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 42, "duration": 14.02, "view_count": 561676}, {"video_id": "_Ji_SHMfvN4", "question": "In the video, a black-haired woman wearing earrings has red nail polish on her hand. When the subtitle mentions 'and then the hair SW lady she looks like,' what does she point at with her hand?", "question_wo_referring_query": "What does she point at with her hand?", "candidates": ["Door", "Cup", "Mobile phone", "Computer"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "_Ji_SHMfvN4_0", "video_path": "_Ji_SHMfvN4.mp4", "subtitle_path": "_Ji_SHMfvN4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 265, "duration": 14.0, "view_count": 107169}, {"video_id": "VglM0-sgZbM", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, there is a scene where a machine harvests wheat, then two people have a conversation, and finally, a photo appears.", "First, a photo appears, then a scene where a machine harvests wheat, and finally, it ends with two people having a conversation.", "First, there is a scene where a machine harvests wheat, then someone holds a photo, and finally, two people have a conversation.", "First, two people are shown in a conversation video, then a scene where a machine harvests wheat appears, and finally, the two people continue their conversation."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "VglM0-sgZbM_0", "video_path": "VglM0-sgZbM.mp4", "subtitle_path": "VglM0-sgZbM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 993, "duration": 12.98, "view_count": 86794}, {"video_id": "Y4R3pl44qU4", "question": "In the beginning of the video, there's a red-haired girl wearing a blue duckbill cap, blue armor, and a white short-sleeve shirt. Which subtitle does she appear with?", "question_wo_referring_query": "With which subtitle does she appear?", "candidates": ["have a good day", "nice day", "happy baby", "see you"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "Y4R3pl44qU4_0", "video_path": "Y4R3pl44qU4.mp4", "subtitle_path": "Y4R3pl44qU4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 52, "duration": 11.01, "view_count": 203102}, {"video_id": "eUKvCgExpFY", "question": "In the video, a blonde woman wearing a black coat and a gold necklace is speaking in a studio. What change happens to her hair at the beginning of the video?", "question_wo_referring_query": "What change happens to her hair at the beginning of the video?", "candidates": ["clenches fist", "shakes hands", "puts it down", "raises it"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "eUKvCgExpFY_0", "video_path": "eUKvCgExpFY.mp4", "subtitle_path": "eUKvCgExpFY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 434, "duration": 9.0, "view_count": 8026}, {"video_id": "NB8yZ_Ymzpg", "question": "In the video, outside the room, there is a man wearing a blue short-sleeve shirt and a pair of shorts. What is he doing outside the room?", "question_wo_referring_query": "What is he doing outside the room?", "candidates": ["Sleeping", "Jumping rope", "Painting", "Eating"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "NB8yZ_Ymzpg_0", "video_path": "NB8yZ_Ymzpg.mp4", "subtitle_path": "NB8yZ_Ymzpg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 632, "duration": 14.02, "view_count": 1041241}, {"video_id": "VLg4EFD9n40", "question": "In the video, there is a man wearing a khaki short-sleeve shirt with a bit of a beard talking in a room. How many doors are in the room he is talking in?", "question_wo_referring_query": "How many doors are in the room where he is talking?", "candidates": ["2", "1", "3", "4"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "VLg4EFD9n40_0", "video_path": "VLg4EFD9n40.mp4", "subtitle_path": "VLg4EFD9n40_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 65, "duration": 14.02, "view_count": 150603}, {"video_id": "_6yeRSlT5eU", "question": "In the scene, a blonde woman with a ponytail, wearing a dark blue long sleeve shirt, is eating at the table. What is the color of the pickup truck that appears in the background?", "question_wo_referring_query": "What is the color of the pickup truck that appears in the background?", "candidates": ["Black", "White", "Purple", "Green"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "_6yeRSlT5eU_0", "video_path": "_6yeRSlT5eU.mp4", "subtitle_path": "_6yeRSlT5eU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 116, "duration": 11.01, "view_count": 5566}, {"video_id": "QPFnhDY52EU", "question": "Two men are conversing on the screen. When the subtitle mentions 'well it's a huge country it's on the Red', what is the hairstyle of the man speaking on the right?", "question_wo_referring_query": "What is the hairstyle of the man speaking on the right?", "candidates": ["Long hair", "Shaggy", "Crew cut", "Buzz cut"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "QPFnhDY52EU_0", "video_path": "QPFnhDY52EU.mp4", "subtitle_path": "QPFnhDY52EU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 207, "duration": 14.0, "view_count": 6333}, {"video_id": "XbYnLqNzaJo", "question": "In the video, a man wearing a backpack is in a room, and his eyes are covered by something red. When the flag of Indonesia and the number 176 appear in the upper right corner, which country does he mention in his speech?", "question_wo_referring_query": "Which country does he mention in his speech?", "candidates": ["USA", "China", "UK", "Singapore"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "XbYnLqNzaJo_0", "video_path": "XbYnLqNzaJo.mp4", "subtitle_path": "XbYnLqNzaJo_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 360, "duration": 9.98, "view_count": 1572739}, {"video_id": "nrXqGG1Pwwc", "question": "A painting appears on the screen with a man in front of the painting. When the subtitle mentions 'to imagine it just leaves that space for', what is the man in front of the painting doing?", "question_wo_referring_query": "What is the man in front of the painting doing?", "candidates": ["Sitting", "Crouching", "Lying down", "Standing and admiring"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "nrXqGG1Pwwc_0", "video_path": "nrXqGG1Pwwc.mp4", "subtitle_path": "nrXqGG1Pwwc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 190, "duration": 12.98, "view_count": 2714}, {"video_id": "EhMJV0wfHIs", "question": "In the room, there are many pictures on the wall, and a man wearing a dark blue short-sleeved shirt is speaking. While he is speaking, a national flag and another man wearing a medal appear in the top left corner. What does he do with his hand after finishing this matter?", "question_wo_referring_query": "What does he do with his hand after finishing this matter?", "candidates": ["Stretches outward", "Raises hand", "Clenches fist", "Shakes hand"], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "EhMJV0wfHIs_0", "video_path": "EhMJV0wfHIs.mp4", "subtitle_path": "EhMJV0wfHIs_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 401, "duration": 13.01, "view_count": 132074}, {"video_id": "oKAFW1Rt8eM", "question": "In the video, a long-haired woman with nail polish, wearing a striped long-sleeve shirt, is introducing clothes that keep her warm. Among the following materials, which one does she introduce first?", "question_wo_referring_query": "Among the following materials, which one does she introduce first?", "candidates": ["Silk", "Cotton", "HT", "Linen"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "oKAFW1Rt8eM_0", "video_path": "oKAFW1Rt8eM.mp4", "subtitle_path": "oKAFW1Rt8eM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 229, "duration": 11.0, "view_count": 44355}, {"video_id": "WFoWdI9x5oQ", "question": "On a day with blue skies and white clouds, a blond man wearing a black and white striped shirt is speaking. After the subtitle 'all speak really well and how do' appears, what action does the woman carrying a blue backpack in the scene take?", "question_wo_referring_query": "On a day with blue skies and white clouds, a blond man wearing a black and white striped shirt is speaking. After the subtitle 'all speak really well and how do' appears, what action does the woman carrying a blue backpack in the scene take?", "candidates": ["walks down", "lies down", "sits down", "runs"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "WFoWdI9x5oQ_0", "video_path": "WFoWdI9x5oQ.mp4", "subtitle_path": "WFoWdI9x5oQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 226, "duration": 9.98, "view_count": 1285661}, {"video_id": "xSd9TBJnaN0", "question": "The forest on the screen is filled with tall, lush trees. What appears after the explanation mentions \"world\"?", "question_wo_referring_query": "What appears then?", "candidates": ["hotel", "dining hall", "horse road", "library"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "xSd9TBJnaN0_0", "video_path": "xSd9TBJnaN0.mp4", "subtitle_path": "xSd9TBJnaN0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 882, "duration": 11.01, "view_count": 76305}, {"video_id": "s1uKMZH68Eg", "question": "Red substances are constantly boiling in the pot in the video. Spinach is added to the pot. Which subtitle appeared together with the spinach added at the beginning of the video?", "question_wo_referring_query": "Which subtitle appeared together with the spinach added at the beginning of the video?", "candidates": ["well", "a little", "in the room", "Spinach-100g"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "s1uKMZH68Eg_0", "video_path": "s1uKMZH68Eg.mp4", "subtitle_path": "s1uKMZH68Eg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 407, "duration": 8.0, "view_count": 39490}, {"video_id": "G5InIGJkYA4", "question": "In the video, there is an analysis being conducted inside a hall, followed by a few lines of subtitles. When the narration is explaining, what action is the statue inside the hall doing?", "question_wo_referring_query": "What action is the statue inside the hall doing?", "candidates": ["crouching", "sitting", "standing", "lying"], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "G5InIGJkYA4_0", "video_path": "G5InIGJkYA4.mp4", "subtitle_path": "G5InIGJkYA4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 166, "duration": 12.98, "view_count": 7833}, {"video_id": "lDwGg2WSsAE", "question": "In the video, a female reporter is interviewing a male reporter who is outside from inside the broadcasting studio. What color clothes is the male reporter on the left side wearing?", "question_wo_referring_query": "What color clothes is the male reporter on the left side wearing?", "candidates": ["green", "black", "light purple", "yellow"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "lDwGg2WSsAE_0", "video_path": "lDwGg2WSsAE.mp4", "subtitle_path": "lDwGg2WSsAE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 709, "duration": 9.0, "view_count": 12796}, {"video_id": "vPKuOy7yO0s", "question": "In the room on the screen, a man wearing a blue long-sleeve shirt and jeans is sitting in the middle of a sofa. There is also a \"Seek Discomfort\" neon light in the background. When the subtitle says \"I guess I can just,\" how many lightning bolt cushions are there in the scene?", "question_wo_referring_query": "When the subtitle says \"I guess I can just,\" how many lightning bolt cushions are there in the scene?", "candidates": ["2", "4", "3", "5"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "vPKuOy7yO0s_0", "video_path": "vPKuOy7yO0s.mp4", "subtitle_path": "vPKuOy7yO0s_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 13, "duration": 14.02, "view_count": 1976923}, {"video_id": "rYTXXUN-uXY", "question": "When the screen switches to image two, there are several people riding on horses, holding swords in their hands, many people have fallen on the ground, and some are trying to escape. Which army is severely defeated in the scene?", "question_wo_referring_query": "Which army is severely defeated in the scene?", "candidates": ["Rome", "Arab Empire", "Sparta", "Crusaders"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "rYTXXUN-uXY_0", "video_path": "rYTXXUN-uXY.mp4", "subtitle_path": "rYTXXUN-uXY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 369, "duration": 9.01, "view_count": 11751}, {"video_id": "j6Uw6KwA9sA", "question": "In the evening, a short-haired man wearing a red short-sleeve shirt is standing on the main street, with a few cars and people behind him. What happens the first time a hand reaches out from off-screen toward this man in red?", "question_wo_referring_query": "What happens the first time a hand reaches out from off-screen toward the man in red?", "candidates": ["handshake", "fist bump", "bow", "high five"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "j6Uw6KwA9sA_0", "video_path": "j6Uw6KwA9sA.mp4", "subtitle_path": "j6Uw6KwA9sA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 312, "duration": 9.01, "view_count": 225727}, {"video_id": "46wlRYIDPwQ", "question": "In the video, it explains why earthquakes occur, and there are two screens displayed. When the subtitle mentions 'again for watching I'll see you all real', there is a white car on the road. What is the status of this car?", "question_wo_referring_query": "What is the status of this car?", "candidates": ["Stationary", "Turning", "Moving forward", "Reversing"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "46wlRYIDPwQ_0", "video_path": "46wlRYIDPwQ.mp4", "subtitle_path": "46wlRYIDPwQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 186, "duration": 8.01, "view_count": 11745}, {"video_id": "B7gvt7vNPh0", "question": "Many airplanes are dropping bombs and conducting bombardments on the screen. According to the explanation, what is the first disaster that appears in the video?", "question_wo_referring_query": "According to the explanation, what is the first disaster that appears in the video?", "candidates": ["Earthquake", "Typhoon", "Drought", "Fire"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "B7gvt7vNPh0_0", "video_path": "B7gvt7vNPh0.mp4", "subtitle_path": "B7gvt7vNPh0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 320, "duration": 8.0, "view_count": 401723}, {"video_id": "lRe6VJ2djDA", "question": "Which of the following sequence of events is correct?", "question_wo_referring_query": "Which of the following sequence of events is correct?", "candidates": ["First, a map appears, then a man stands up amidst applause, and finally, a stream appears.", "First, a stream appears, then a man stands up amidst applause, and finally, a map appears.", "First, a man stands up amidst applause, then a map appears, and finally, a stream appears at the end.", "First, a man stands up amidst applause, then a stream appears, and finally, a map appears."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "lRe6VJ2djDA_0", "video_path": "lRe6VJ2djDA.mp4", "subtitle_path": "lRe6VJ2djDA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 316, "duration": 13.0, "view_count": 122259}, {"video_id": "TpcoCn09e2k", "question": "In the video, a blonde woman in a pink long sleeve is talking inside a room with many plants in the background. Where else did the hand from the beginning of the video appear?", "question_wo_referring_query": "Where else did the hand from the beginning of the video appear?", "candidates": ["Inside the room", "Burger shop", "Cafe", "Milk tea shop"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "TpcoCn09e2k_0", "video_path": "TpcoCn09e2k.mp4", "subtitle_path": "TpcoCn09e2k_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 309, "duration": 9.01, "view_count": 188987}, {"video_id": "aZX8jPwbhEA", "question": "At the beginning of the video, there is a boy with glasses and messy hair. After returning to his room, he changes his hat. What changes about his hat?", "question_wo_referring_query": "What changes about his hat?", "candidates": ["Black to red", "Green to purple", "Blue to green", "Yellow to black"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "aZX8jPwbhEA_0", "video_path": "aZX8jPwbhEA.mp4", "subtitle_path": "aZX8jPwbhEA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 312, "duration": 8.0, "view_count": 319}, {"video_id": "BtLh-HcFA8Y", "question": "In the video, there is a man wearing a blue suit interviewing a bald man with glasses in the studio. After the man on the left side of the screen finishes speaking, what change occurs on the screen?", "question_wo_referring_query": ", after the man on the left side of the screen finishes speaking, what change occurs on the screen?", "candidates": ["The two screens become three", "The two screens become one", "No change", "The two screens become four"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "BtLh-HcFA8Y_0", "video_path": "BtLh-HcFA8Y.mp4", "subtitle_path": "BtLh-HcFA8Y_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 367, "duration": 10.0, "view_count": 6088}, {"video_id": "k2NZVQzfAbo", "question": "The video explains the activities of infiltrating Japan by Orinpic and the escort ship, with a background of red and blue on the screen, including 'Suicide Attacks', 'Running up to Tanks with Mines', 'Suicide Boats', 'Manned Torpedoes Kaiten', and 'Kamikaze Planes'. How many times does the skull appear during the explanation?", "question_wo_referring_query": "How many times does the skull appear during the explanation?", "candidates": ["2", "4", "3", "1"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "k2NZVQzfAbo_0", "video_path": "k2NZVQzfAbo.mp4", "subtitle_path": "k2NZVQzfAbo_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 614, "duration": 12.0, "view_count": 552860}, {"video_id": "vEt3jg2XTO4", "question": "Several pictures appear on the screen, and then a man starts talking in a room. When the subtitles mention 'world from evil and here we are today,' what kind of clothes is the man wearing?", "question_wo_referring_query": "What kind of clothes is the man wearing when he talks?", "candidates": ["A white short-sleeved shirt with blue and green markings in the middle", "A black suit", "A long-sleeved shirt", "A green jacket"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "vEt3jg2XTO4_0", "video_path": "vEt3jg2XTO4.mp4", "subtitle_path": "vEt3jg2XTO4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1159, "duration": 10.97, "view_count": 852521}, {"video_id": "kdAyiH6on7g", "question": "The video explains weapons in Egyptian history and shows a person wearing a blue jacket and a black hat standing in front of a furnace. What is the state of the furnace at this time?", "question_wo_referring_query": "What is the state of the furnace at this time?", "candidates": ["Burning flames", "Flames extinguished", "No change", "Billowing black smoke"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "kdAyiH6on7g_0", "video_path": "kdAyiH6on7g.mp4", "subtitle_path": "kdAyiH6on7g_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 133, "duration": 14.0, "view_count": 2362}, {"video_id": "THuMM74JIZQ", "question": "There is a kitten on the screen, with some small rocks and a large stone behind it. When the subtitle mentions 'by gum pica,' what is the kitten doing?", "question_wo_referring_query": "What is the kitten doing?", "candidates": ["Sitting on the ground", "Lying on the ground", "Running", "Biting someone"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "THuMM74JIZQ_0", "video_path": "THuMM74JIZQ.mp4", "subtitle_path": "THuMM74JIZQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 340, "duration": 14.02, "view_count": 58943}, {"video_id": "hX4s1ZLW_PI", "question": "In the video, there is a man in long sleeves sitting on the ground. In the background, there is a sofa and a bookshelf with neatly arranged books. On the table, there are items like a tablet. After the subtitle mentions 'What?', what is the man doing in the next scene?", "question_wo_referring_query": "What is the man doing in the next scene?", "candidates": ["Standing and cheering", "Hugging someone", "Lying down and sleeping", "Sitting and talking on the phone"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "hX4s1ZLW_PI_0", "video_path": "hX4s1ZLW_PI.mp4", "subtitle_path": "hX4s1ZLW_PI_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 458, "duration": 8.0, "view_count": 2947521}, {"video_id": "ekWVpieoOGo", "question": "In the video, a man in a black suit and tie is giving a speech in a broadcast room, with several national flags behind him. Which word appears on the screen at the same time as his hand?", "question_wo_referring_query": "Which word appears on the screen at the same time as his hand?", "candidates": ["France welcomes", "and", "ok", "tonight and the Grateful Nation of"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "ekWVpieoOGo_0", "video_path": "ekWVpieoOGo.mp4", "subtitle_path": "ekWVpieoOGo_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1792, "duration": 12.0, "view_count": 2345}, {"video_id": "nrUqKdtQGdU", "question": "The screen is explaining the rent of guest houses. There are pictures of the accommodation shown on the screen, a total of five, depicting a clean and tidy living room and bedroom area. The price of the accommodation is displayed in the bottom right corner. When the subtitles mention 'guest houses that are kind of like B&Bs', how many pictures does the screen change to?", "question_wo_referring_query": "The screen is explaining the rent of guest houses. There are pictures of the accommodation shown on the screen, a total of five, depicting a clean and tidy living room and bedroom area. The price of the accommodation is displayed in the bottom right corner. When the subtitles mention 'guest houses that are kind of like B&Bs', how many pictures does the screen change to?", "candidates": ["4", "5", "3", "2"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "nrUqKdtQGdU_0", "video_path": "nrUqKdtQGdU.mp4", "subtitle_path": "nrUqKdtQGdU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 806, "duration": 13.01, "view_count": 23781}, {"video_id": "h6QyxeNRXVs", "question": "In the video, a man is handling processed meats outside in the bright daylight. What does he do while handling the processed meats?", "question_wo_referring_query": "What does he do while handling the processed meats?", "candidates": ["Threw it away", "Cooked it", "Sprinkled some powder", "Ate it"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "h6QyxeNRXVs_0", "video_path": "h6QyxeNRXVs.mp4", "subtitle_path": "h6QyxeNRXVs_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 411, "duration": 14.02, "view_count": 11618874}, {"video_id": "LCj64JdmU0M", "question": "In the scene, there is a woman with short blonde hair visiting a museum. What is she wearing when she is looking at the exhibits?", "question_wo_referring_query": "What is she wearing when she is looking at the exhibits?", "candidates": ["Suit", "Short-sleeve", "Tank top", "Black long-sleeve"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "LCj64JdmU0M_0", "video_path": "LCj64JdmU0M.mp4", "subtitle_path": "LCj64JdmU0M_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 117, "duration": 11.01, "view_count": 3439}, {"video_id": "DGyg7JoVX50", "question": "In the video, during the explanation of volcanoes in planets and the understanding of volcanology, what is the color of the planet being discussed?", "question_wo_referring_query": "What is the color of the planet being discussed in the explanation?", "candidates": ["White", "Red-orange", "Yellow", "Black"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "DGyg7JoVX50_0", "video_path": "DGyg7JoVX50.mp4", "subtitle_path": "DGyg7JoVX50_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 195, "duration": 12.98, "view_count": 2649}, {"video_id": "Eo3OMl_GcgQ", "question": "How many screen clips were played in the video, and when 'future videos' was mentioned in the subtitles, what planet appeared on the screen after 'DAY 1'?", "question_wo_referring_query": "What planet appeared on the screen after 'DAY 1'?", "candidates": ["Mars", "Jupiter", "Earth", "Moon"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "Eo3OMl_GcgQ_0", "video_path": "Eo3OMl_GcgQ.mp4", "subtitle_path": "Eo3OMl_GcgQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 41, "duration": 9.01, "view_count": 1168}, {"video_id": "e63UQXYMH4g", "question": "During the award ceremony on screen, when the subtitle mentioned 'Native American actress to win Golden,' what action did the actress and a man with a beard take?", "question_wo_referring_query": "What action did the actress and a man with a beard take?", "candidates": ["applause", "hug", "handshake", "elbow bump"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "e63UQXYMH4g_0", "video_path": "e63UQXYMH4g.mp4", "subtitle_path": "e63UQXYMH4g_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 113, "duration": 14.0, "view_count": 47895}, {"video_id": "0hiTs2pDB14", "question": "There is a boy in a blue and white long-sleeved shirt and white pants wearing a hat in the screen, and another boy in a jacket and jeans singing. What did the boy in the jacket and jeans do before singing?", "question_wo_referring_query": "What did the boy in the jacket and jeans do before singing?", "candidates": ["Crawled", "Jumped", "Turned from behind the boy in the blue and white long-sleeved shirt and white pants wearing a hat to his side", "Stood still", "Squatted"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "0hiTs2pDB14_0", "video_path": "0hiTs2pDB14.mp4", "subtitle_path": "0hiTs2pDB14_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 152, "duration": 14.02, "view_count": 2406}, {"video_id": "segm-kfNMqg", "question": "In the video, there are two men wearing hats in a room playing drums. The man on the left is playing a big drum, and the man on the right is playing a long drum. Which man starts playing the drum first?", "question_wo_referring_query": ", in the video, which man starts playing the drum first?", "candidates": ["Simultaneously", "Neither plays the drum", "Left side first", "Right side first"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "segm-kfNMqg_0", "video_path": "segm-kfNMqg.mp4", "subtitle_path": "segm-kfNMqg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 353, "duration": 9.01, "view_count": 261881}, {"video_id": "4w_Vc_7irhM", "question": "The video talks about the geographical issue of tectonic plate movements. After the subtitle 'Cyprus, an island nation, is located in the eastern Mediterranean sea', what event occurs on the Earth?", "question_wo_referring_query": ", what event occurs on the Earth?", "candidates": ["Marked with a black circle in the middle", "Marked with a red circle in the middle", "No event occurs", "Marked with an arrow"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "4w_Vc_7irhM_0", "video_path": "4w_Vc_7irhM.mp4", "subtitle_path": "4w_Vc_7irhM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 338, "duration": 13.97, "view_count": 1803933}, {"video_id": "DAIyRVxrjP4", "question": "In the video, there is a man wearing a blue coat and a hat with a beard, sitting on a couch outdoors, explaining the national flag. Where else has the national flag appeared, besides the beginning of the video?", "question_wo_referring_query": "Where else has the national flag appeared, besides the beginning of the video?", "candidates": ["Hamburger shop", "Outdoors", "Cafe", "Milk tea shop"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "DAIyRVxrjP4_0", "video_path": "DAIyRVxrjP4.mp4", "subtitle_path": "DAIyRVxrjP4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 63, "duration": 13.98, "view_count": 209091}, {"video_id": "cFZKEYMp13Q", "question": "In front of a black background, there is a man wearing a white shirt, with his collar styled in a 'Mac style'. This man gestures forward with his right hand while speaking. What does this man wearing a white shirt do next?", "question_wo_referring_query": "What does this man wearing a white shirt do next?", "candidates": ["Moves left hand forward", "Claps his hands", "Spreads both hands wide open", "Clenches both hands into fists"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "cFZKEYMp13Q_0", "video_path": "cFZKEYMp13Q.mp4", "subtitle_path": "cFZKEYMp13Q_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 26, "duration": 9.0, "view_count": 25479}, {"video_id": "aNcuIqpq11c", "question": "When the video displays the four-row red and white interleaved text \u201c\u30103\u3011a group of organisms with similar traits that can reproduce\u201d and the subtitle shows \u201csuggcst that a spccics group of\u201d, what change occurs in the video?", "question_wo_referring_query": "What change occurs in the video?", "candidates": ["The text 'a group of' becomes brighter.", "The text 'a group of organisms with similar traits that can reproduce' turns black.", "The entire text 'a group of organisms with similar traits that can reproduce' becomes darker.", "The text 'a group of organisms' becomes brighter."], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "aNcuIqpq11c_0", "video_path": "aNcuIqpq11c.mp4", "subtitle_path": "aNcuIqpq11c_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 73, "duration": 9.01, "view_count": 205737}, {"video_id": "KtEdmtYILVY", "question": "In the video, the man wearing a long-sleeve coffee-colored polo shirt is holding a large white ceramic bowl in his right hand. What does he do with his left hand inside the bowl?", "question_wo_referring_query": "What does the man do with his left hand inside the bowl?", "candidates": ["Takes out a piece of cloth", "Taps the bowl", "Takes out a piece of paper with the number 7 written on it", "Holds the bowl while eating"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "KtEdmtYILVY_0", "video_path": "KtEdmtYILVY.mp4", "subtitle_path": "KtEdmtYILVY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 444, "duration": 12.01, "view_count": 22699}, {"video_id": "NuB9uOjyBv4", "question": "In a room, a long-haired woman wearing circular earrings points to her face with her left hand. What is placed behind her to the right?", "question_wo_referring_query": "What is placed behind the long-haired woman to the right?", "candidates": ["Basket", "Plant", "Bookshelf", "Photo frame"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "NuB9uOjyBv4_0", "video_path": "NuB9uOjyBv4.mp4", "subtitle_path": "NuB9uOjyBv4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 502, "duration": 11.01, "view_count": 959261}, {"video_id": "lfwKKcDWxmA", "question": "A female host wearing a purple short-sleeved T-shirt is chatting with a man in a crimson T-shirt and glasses. When the female host opens her left hand outward, she says 'could drive the whole semiconductor'. What is placed in front of the female host?", "question_wo_referring_query": "What is placed in front of the female host?", "candidates": ["Laptop", "Teacup", "File folder", "Green plant"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "lfwKKcDWxmA_0", "video_path": "lfwKKcDWxmA.mp4", "subtitle_path": "lfwKKcDWxmA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 229, "duration": 8.98, "view_count": 5531}, {"video_id": "0Efs7_LRasM", "question": "In a room with two walls filled with books, there is a table and a box placed in the middle of the room. Next to the box stands a man with his hands on his hips. What kind of material are the pants worn by the man with his hands on his hips?", "question_wo_referring_query": "What kind of material are the pants worn by the man with his hands on his hips?", "candidates": ["Leather", "Wool", "Denim", "Linen"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "0Efs7_LRasM_0", "video_path": "0Efs7_LRasM.mp4", "subtitle_path": "0Efs7_LRasM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.01, "view_count": 6437}, {"video_id": "Y0R54DYs6MI", "question": "A beautiful woman wearing a white top and glasses is sitting next to a piano. When she spreads her hands and says 'printing cutting to do I have a new,' what style is her necklace?", "question_wo_referring_query": "What style is her necklace?", "candidates": ["A simple chain with a cross pendant", "A pearl necklace", "A gold chain with a pendant", "A simple chain with no other ornaments"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "Y0R54DYs6MI_0", "video_path": "Y0R54DYs6MI.mp4", "subtitle_path": "Y0R54DYs6MI_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 38, "duration": 9.01, "view_count": 1035492}, {"video_id": "KlLTsVNaw2Q", "question": "On the road paved with square bricks, there is a dark blue Volkswagen car with a white roof parked. To the right of the car, there is a blue trash can and several advertisement boards. Who opened the front left door of the car and got in?", "question_wo_referring_query": "Who opened the front left door of the car and got in?", "candidates": ["A woman wearing a blue vest", "A man wearing a light green short-sleeve shirt", "A man wearing slippers and a hat", "A blonde woman holding a phone"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "KlLTsVNaw2Q_0", "video_path": "KlLTsVNaw2Q.mp4", "subtitle_path": "KlLTsVNaw2Q_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 40, "duration": 8.0, "view_count": 105389}, {"video_id": "cdwVmbhCvs8", "question": "A man dressed in a red outfit with a golden cape is surrounded by four similarly dressed individuals. To his left stands a person holding a fruit platter with both arms extended forward. What action did the man take when the phrase \"with a good understanding of mathematics\" was mentioned beside him?", "question_wo_referring_query": "What action did the man take?", "candidates": ["Grabbed a wine glass", "Grabbed the fruit with his left hand and started eating", "Grabbed a book", "Finished eating the fruit and put his hand down"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "cdwVmbhCvs8_0", "video_path": "cdwVmbhCvs8.mp4", "subtitle_path": "cdwVmbhCvs8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 157, "duration": 14.0, "view_count": 699131}, {"video_id": "n7IOwv5baLI", "question": "A short-haired man in long sleeves sits beside a window in a brightly lit room, explaining something about his emails. After he finishes speaking, what is shown on the screen?", "question_wo_referring_query": "After he finishes speaking, what is shown on the screen?", "candidates": ["Scrolling computer screen", "Lying down", "Standing up", "Shaking hands"], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "n7IOwv5baLI_0", "video_path": "n7IOwv5baLI.mp4", "subtitle_path": "n7IOwv5baLI_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 179, "duration": 11.97, "view_count": 212140}, {"video_id": "6jKnqBa15JA", "question": "The video explains the creation of objects and their reactions. Which object is introduced first in the video?", "question_wo_referring_query": "Which object is introduced first in the video?", "candidates": ["An object wearing a white shirt and red pants with a small red horn on its head", "A mobile phone", "A glass of water", "A duck"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "6jKnqBa15JA_0", "video_path": "6jKnqBa15JA.mp4", "subtitle_path": "6jKnqBa15JA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 459, "duration": 10.97, "view_count": 167338}, {"video_id": "JDBqX2fv6uQ", "question": "In the video, a man wearing glasses and a suit is explaining his views on inflation in front of a screen. After he says, 'If you look at it, you know my view, which I share my view on inflation,' what action does he take?", "question_wo_referring_query": "What action did he take?", "candidates": ["Shake hands", "Hug", "Put down his hand", "No action"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "JDBqX2fv6uQ_0", "video_path": "JDBqX2fv6uQ.mp4", "subtitle_path": "JDBqX2fv6uQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 149, "duration": 10.01, "view_count": 5159}, {"video_id": "2LD-WXWWX-U", "question": "Several photos were shown in the video. In the second photo, a few people are walking forward with a child. After the explanation mentions 'That's a bit self-serving of him, but maybe he has a point,' what appears on the screen?", "question_wo_referring_query": "What appears on the screen?", "candidates": ["Massachusetts Institute of Technology", "Tsinghua University", "University of Michigan", "Dining Hall"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "2LD-WXWWX-U_0", "video_path": "2LD-WXWWX-U.mp4", "subtitle_path": "2LD-WXWWX-U_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 397, "duration": 9.0, "view_count": 170766}, {"video_id": "r-aIzkvPwFo", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, several men wearing white shorts, white socks, and red jackets appear. Afterward, an image related to war is shown. Then, a man in a red short-sleeved shirt appears and analyzes something. Finally, a man in a suit with a mustache gives a lecture as the conclusion.", "First, several men wearing white shorts, white socks, and red jackets appear. Afterward, an image related to war is shown. Then, a man in a suit with a mustache gives a lecture. Finally, a man in a red short-sleeved shirt appears and analyzes something as the conclusion.", "First, several men wearing white shorts, white socks, and red jackets appear. Afterward, a man in a red short-sleeved shirt appears and analyzes something. Then, a man in a suit with a mustache gives a lecture. Finally, an image related to war is shown as the conclusion.", "First, several men wearing white shorts, white socks, and red jackets appear. Afterward, a man in a red short-sleeved shirt appears and analyzes something. Then, an image related to war is shown. Lastly, a man in a suit with a mustache gives a lecture as the conclusion."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "r-aIzkvPwFo_0", "video_path": "r-aIzkvPwFo.mp4", "subtitle_path": "r-aIzkvPwFo_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 99, "duration": 11.0, "view_count": 4562787}, {"video_id": "5HoXesguAM8", "question": "A man wearing an orange top, a blue and orange striped scarf, and a straw hat, where has he been seen?", "question_wo_referring_query": "Where has he been seen?", "candidates": ["At the entrance of a fast-food restaurant.", "Seen alongside a man wearing an orange top and a black scarf, both wearing straw hats.", "Seen alongside a man wearing an orange top and a black scarf, both wearing straw hats; by the roadside, next to a black car, with a construction site in the background.", "By the roadside, next to a black car, with a construction site in the background."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "5HoXesguAM8_0", "video_path": "5HoXesguAM8.mp4", "subtitle_path": "5HoXesguAM8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1132, "duration": 10.01, "view_count": 1084918}, {"video_id": "H3W2wMS15SY", "question": "The video explains the coastline paradox. Regarding how to draw the coastline, there are two atoms with two red spheres and one green sphere vibrating on the screen. At the beginning of the video, which subtitle appears together with the atoms?", "question_wo_referring_query": "At the beginning of the video, which subtitle appears together with the atoms?", "candidates": ["yes", "add", "distance between two atoms isn't as", "why"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "H3W2wMS15SY_0", "video_path": "H3W2wMS15SY.mp4", "subtitle_path": "H3W2wMS15SY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 81, "duration": 11.01, "view_count": 155775}, {"video_id": "dDP59BFjiIs", "question": "The video explains historical weapons. On the screen, there is a man wearing a helmet facing another person. There are also several small figures of men and women in the background. What is the action of the man wearing a helmet in the video?", "question_wo_referring_query": "What is the action of the man wearing a helmet in the video?", "candidates": ["Shaking hands", "Raising a hand", "Holding up a weapon", "Clenching a fist"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "dDP59BFjiIs_0", "video_path": "dDP59BFjiIs.mp4", "subtitle_path": "dDP59BFjiIs_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 131, "duration": 10.01, "view_count": 2991}, {"video_id": "o9Bs6Pl2W_U", "question": "In the video, there is a man happily on vacation, posing with a smile for a photo at the swimming pool. How many men are there in the swimming pool?", "question_wo_referring_query": "How many men are there in the swimming pool?", "candidates": ["5", "2", "4", "3"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "o9Bs6Pl2W_U_0", "video_path": "o9Bs6Pl2W_U.mp4", "subtitle_path": "o9Bs6Pl2W_U_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 254, "duration": 9.01, "view_count": 3195460}, {"video_id": "MUEklBhRsOo", "question": "In the video, a Black man wearing a suit is conversing in a broadcast room with a man on the right who is holding a script. When the word 'haven't' appears in the subtitles, what object can be seen in the right man's frame?", "question_wo_referring_query": "What object can be seen in the right man's frame?", "candidates": ["Grey striped scarf", "Milk tea", "Fried chicken", "National flag"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "MUEklBhRsOo_0", "video_path": "MUEklBhRsOo.mp4", "subtitle_path": "MUEklBhRsOo_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 301, "duration": 14.0, "view_count": 2724}, {"video_id": "yckVqZIniZk", "question": "In the middle of a street in front of several houses, where four cars are parked, a man is standing next to a black car, bending over and looking for something. What is he wearing when the word 'ugh' is mentioned?", "question_wo_referring_query": "What is he wearing when the word 'ugh' is mentioned?", "candidates": ["He is wearing a gray T-shirt and green trousers.", "He is wearing a black T-shirt and blue shorts.", "He is wearing a black coat and blue jeans.", "He is wearing a gray T-shirt and black jeans."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "yckVqZIniZk_0", "video_path": "yckVqZIniZk.mp4", "subtitle_path": "yckVqZIniZk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 260, "duration": 9.01, "view_count": 51984}, {"video_id": "xGyw3_D7fZA", "question": "On a vacant area paved with grey tiles, there are several large containers with a few boxes of goods stacked outside. There are two people walking side by side. What are they observing?", "question_wo_referring_query": "What are they observing?", "candidates": ["Blue frame", "White frame", "White relief supplies", "Blue relief supplies"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "xGyw3_D7fZA_0", "video_path": "xGyw3_D7fZA.mp4", "subtitle_path": "xGyw3_D7fZA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 36, "duration": 14.0, "view_count": 19442}, {"video_id": "eA679BcvhxY", "question": "On the left side, there are four lines of white characters, and following them closely on the right side of the green data is a man wearing a gray suit. After this, what action does the data take?", "question_wo_referring_query": "On the left side, there are four lines of white characters, and following them closely on the right side of the green data is a man wearing a gray suit. After this, what action does the data take?", "candidates": ["Transforms with a sequential popping-out motion from the left", "Transforms with an upward motion like a hundred-leaved window", "Transforms with a sequential popping-out motion from the right", "Transforms with a downward motion like a hundred-leaved window"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "eA679BcvhxY_0", "video_path": "eA679BcvhxY.mp4", "subtitle_path": "eA679BcvhxY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 53, "duration": 12.01, "view_count": 4085}, {"video_id": "uILqYCYfceQ", "question": "There is a man in white standing in the middle, and there is a lively concert scene and a performance by a group of women led by a woman with golden curly hair. Which scene appears first?", "question_wo_referring_query": "There is a man in white standing in the middle, and there is a lively concert scene and a performance by a group of women led by a woman with golden curly hair. Which scene appears first?", "candidates": ["Neither scene appears", "A performance by a group of women led by a woman with golden curly hair", "Both appear at the same time", "A man in white standing in the middle at a lively concert scene"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "uILqYCYfceQ_0", "video_path": "uILqYCYfceQ.mp4", "subtitle_path": "uILqYCYfceQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 253, "duration": 11.0, "view_count": 3768}, {"video_id": "Hq1HOcOogmU", "question": "In the scene with multiple stacked wooden boxes, there is a large box with 'THIS WAY UP' inscribed on it. Before the mention of 'the dangers of drug addiction,' what event occurred?", "question_wo_referring_query": "What event occurred?", "candidates": ["An elderly man in a black suit stands holding a hammer.", "An elderly man in a brown suit stands holding a hammer.", "An elderly man in a brown suit stands holding a gun.", "An elderly man in a black suit stands holding a gun."], "topic_category": "KH-Knowledge-History", "question_category": "T3E", "level": "L2-Relation", "id": "Hq1HOcOogmU_0", "video_path": "Hq1HOcOogmU.mp4", "subtitle_path": "Hq1HOcOogmU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 391, "duration": 11.0, "view_count": 871215}, {"video_id": "8C0E5Ym9zvc", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, the YAMAHA logo appears in the center of the screen, surrounded by green trees; then, the sky becomes overcast with clouds, a white van appears next to a building, and there is a tall tower nearby with the text 'JAPANESE STUDIES A BOOST FOR BRAZILIAN SCHOOLKIDS'; finally, a sign appears with the text 'Panasonic Do Brasil Limitada'.", "First, the sky becomes overcast with clouds, a white van appears next to a building, and there is a tall tower nearby with the text 'JAPANESE STUDIES A BOOST FOR BRAZILIAN SCHOOLKIDS'; then, the YAMAHA logo appears in the center of the screen, surrounded by green trees; finally, a sign appears with the text 'Panasonic Do Brasil Limitada'.", "First, a sign appears with the text 'Panasonic Do Brasil Limitada'; then, the sky becomes overcast with clouds, a white van appears next to a building, and there is a tall tower nearby with the text 'JAPANESE STUDIES A BOOST FOR BRAZILIAN SCHOOLKIDS'; finally, the YAMAHA logo appears in the center of the screen, surrounded by green trees.", "First, the YAMAHA logo appears in the center of the screen, surrounded by green trees; then, a sign appears with the text 'Panasonic Do Brasil Limitada'; finally, the sky becomes overcast with clouds, a white van appears next to a building, and there is a tall tower nearby with the text 'JAPANESE STUDIES A BOOST FOR BRAZILIAN SCHOOLKIDS'."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "8C0E5Ym9zvc_0", "video_path": "8C0E5Ym9zvc.mp4", "subtitle_path": "8C0E5Ym9zvc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 241, "duration": 8.01, "view_count": 63599}, {"video_id": "lsQO3gMPslc", "question": "A woman wearing glasses and dressed in long sleeves is sitting in front of a chair and talking. What changes occurred in her hand movements before and after she spoke?", "question_wo_referring_query": "What changes occurred in the hand movements of the woman in the room before and after she spoke?", "candidates": ["Embraces", "Maintains being open", "Remains tightly grasped", "Opens to grasping tightly"], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "lsQO3gMPslc_0", "video_path": "lsQO3gMPslc.mp4", "subtitle_path": "lsQO3gMPslc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 75, "duration": 10.97, "view_count": 21509}, {"video_id": "gmUeNnuFrl4", "question": "In the video analyzing the meaning of each picture, what is the woman in the fourth picture doing with her hands? She is wearing a hat with several tufts of hair on it.", "question_wo_referring_query": "What is the woman in the fourth picture doing with her hands?", "candidates": ["Clenching her fist", "Holding her waist", "Shaking hands", "Hugging"], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "gmUeNnuFrl4_0", "video_path": "gmUeNnuFrl4.mp4", "subtitle_path": "gmUeNnuFrl4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 287, "duration": 11.0, "view_count": 983627}, {"video_id": "T8QOE-IWo3I", "question": "In the video, there are two men analyzing and explaining in a studio. Then, a man wearing a long-sleeve shirt appears giving a lecture on a spacious street. At the end of the video, what is the color of the long-sleeve shirt worn by the man speaking on the street?", "question_wo_referring_query": "What is the color of the long-sleeve shirt worn by the man speaking on the street at the end of the video?", "candidates": ["Yellow", "Black", "White", "Green"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "T8QOE-IWo3I_0", "video_path": "T8QOE-IWo3I.mp4", "subtitle_path": "T8QOE-IWo3I_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 172, "duration": 12.0, "view_count": 2054102}, {"video_id": "39zhwX2S8ew", "question": "The video explains animal migration routes, showing a group of elephants with tusks marching in a line. When the subtitle mentions 'migration route of large game animals', what color are the elephants that appear?", "question_wo_referring_query": "What color are the elephants that appear?", "candidates": ["Brown", "Purple", "None", "Yellow", "Green"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "39zhwX2S8ew_0", "video_path": "39zhwX2S8ew.mp4", "subtitle_path": "39zhwX2S8ew_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 68, "duration": 9.01, "view_count": 6950}, {"video_id": "ECW_qB2CId0", "question": "The video depicts a scene of a long-haired woman sitting in a car. She is seated in the driver's seat, and the background shows streets and buildings outside the car window. What does she do with her hands when the subtitle mentions 'to represent that but anyways I hope you'?", "question_wo_referring_query": "What does she do with her hands?", "candidates": ["Clapping", "Fisting", "Clasping hands", "No change"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "ECW_qB2CId0_0", "video_path": "ECW_qB2CId0.mp4", "subtitle_path": "ECW_qB2CId0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 129, "duration": 9.98, "view_count": 37633}, {"video_id": "fd-Me3EBGYY", "question": "The image is an artwork depicting a landscape of the sun, with many small boats floating on the water. There are paintings of a sunrise on the eastern side of the frame and a sunset on the western side. What appears after this scene?", "question_wo_referring_query": "What appears after this scene?", "candidates": ["A gymnasium", "A part of an art museum", "A classroom", "A playground"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "fd-Me3EBGYY_0", "video_path": "fd-Me3EBGYY.mp4", "subtitle_path": "fd-Me3EBGYY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 176, "duration": 8.01, "view_count": 1501457}, {"video_id": "k8BpEyE1kLk", "question": "In the video, a man wearing a black jacket is giving a lecture in a room filled with books, and there are several pictures on the wall. After he says 'that and that comes into the constellation states through Coinbase is', what event occurs?", "question_wo_referring_query": "What event occurs?", "candidates": ["The top left corner changes to 5:19pm London time", "No change", "The man leaves the room", "The man picks up a book"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "k8BpEyE1kLk_0", "video_path": "k8BpEyE1kLk.mp4", "subtitle_path": "k8BpEyE1kLk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 61, "duration": 8.98, "view_count": 2309}, {"video_id": "UBkNDTzLIVY", "question": "In the video, a white-haired man wearing military uniform and glasses on the right is being interviewed, and on the left, there's a city image. Just before the subtitle mentions 'Island Republic whether on our soil or,' what appears on the screen?", "question_wo_referring_query": "What appears on the screen just before the subtitle 'Island Republic whether on our soil or'?", "candidates": ["No change", "An additional dialogue box appears", "A line of text about 'IRAN ATTACK'", "The bottom row of subtitles disappears"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "UBkNDTzLIVY_0", "video_path": "UBkNDTzLIVY.mp4", "subtitle_path": "UBkNDTzLIVY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 152, "duration": 14.0, "view_count": 38072}, {"video_id": "rHcvXgr18jo", "question": "In a blue-themed background, with gray text on a blue bottom and black text on a white background as subtitles, there's a blonde woman dressed in a black suit. In which of the following scenes does she appear?", "question_wo_referring_query": "In which of the following scenes does she appear?", "candidates": ["In a scene with a woman with short blonde hair wearing a blue suit beside her", "In a scene with a woman with long blonde hair wearing a black suit beside her", "In a scene with a woman with short blonde hair wearing a black suit beside her", "In a scene with a woman with long blonde hair wearing a blue suit beside her"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "rHcvXgr18jo_0", "video_path": "rHcvXgr18jo.mp4", "subtitle_path": "rHcvXgr18jo_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 285, "duration": 9.0, "view_count": 14451}, {"video_id": "EX4OZ-lhdY0", "question": "Two people are in a confined space. One of them, a man wearing a headset, is holding a hammer and using it to hammer something. The other person is holding an object with both hands. Which subtitles appear along with this scene?", "question_wo_referring_query": "Which subtitles appear along with this scene?", "candidates": ["like socialism and communism", "better working conditions all this led", "in various degrees these related", "demanding better pay shorter hours and better working conditions all this led to the arrival of new political theories like socialism and communism"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "EX4OZ-lhdY0_0", "video_path": "EX4OZ-lhdY0.mp4", "subtitle_path": "EX4OZ-lhdY0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 728, "duration": 11.0, "view_count": 59094}, {"video_id": "rGwo2wND9Cw", "question": "In the video, two women are standing together. One of them, a blonde woman wearing a suit, is holding a piece of food in her left hand and making some kind of gesture with her right hand in front of her chest. The other woman is wearing a gray top and is moving her hands up and down. After mentioning 'cutting down on waste that's it figure,' what change occurs between the two women?", "question_wo_referring_query": "What change occurs between the two women after that?", "candidates": ["The woman in the gray top wipes her forehead with her right hand, while the woman in the black shirt crosses her hands in front of her chest.", "The woman in the gray top crosses her hands in front of her chest, while the woman in the black shirt leaves the frame.", "The woman in the gray top wipes her forehead with her right hand, while the woman in the black shirt bends slightly at the waist with a smile on her face.", "The woman in the gray top wipes her forehead with her right hand, while the woman in the black shirt makes a 'yeah' gesture."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "rGwo2wND9Cw_0", "video_path": "rGwo2wND9Cw.mp4", "subtitle_path": "rGwo2wND9Cw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 304, "duration": 9.01, "view_count": 128254}, {"video_id": "HU-BAZe2Gl8", "question": "Inside a car, a woman wearing a black strap is looking outside the car. One person is sitting in the driver's seat and another person is sitting in the front passenger seat. The person in the driver's seat is wearing a white short-sleeved shirt. What is the man in the white short-sleeved shirt doing?", "question_wo_referring_query": "What is the man in the white short-sleeved shirt doing?", "candidates": ["Chatting", "Playing with a phone", "Drinking water", "Driving"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "HU-BAZe2Gl8_0", "video_path": "HU-BAZe2Gl8.mp4", "subtitle_path": "HU-BAZe2Gl8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 378, "duration": 9.01, "view_count": 20462}, {"video_id": "ondqW6XkYzE", "question": "A man in a pink short-sleeved shirt is sitting on a sofa. On the table in front of him, there is a laptop and a mobile phone. On the wall beside him, two layers of clothing are hanging. What items are not present?", "question_wo_referring_query": "What items are not present?", "candidates": ["Laptop", "Headphones", "Mobile phone", "Sofa"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "ondqW6XkYzE_0", "video_path": "ondqW6XkYzE.mp4", "subtitle_path": "ondqW6XkYzE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 199, "duration": 8.97, "view_count": 96551}, {"video_id": "R3geXHgW7K0", "question": "In a quiet forest, a lion is walking among the grass and flowers. When the phrase 'thing close in on a spider it's just' is mentioned, which objects appear?", "question_wo_referring_query": "which objects appear?", "candidates": ["rabbit", "duck", "lion, grass, and flowers", "elephant"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "R3geXHgW7K0_0", "video_path": "R3geXHgW7K0.mp4", "subtitle_path": "R3geXHgW7K0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 113, "duration": 10.01, "view_count": 1300}, {"video_id": "XlVo6RbsIxE", "question": "A curly-haired news anchor wearing a black top and a blue jacket is on screen with the text 'Foreign ministers mark anniversary, mull Ukraine aid NATO MARKS 75 YEARS OF EXISTENCE.' When mentioning 'Ukraine given the situation with the,' what type of jacket is this female anchor wearing?", "question_wo_referring_query": "What type of jacket is this female anchor wearing?", "candidates": ["Cotton jacket", "Suit jacket", "Denim jacket", "Blazer"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "XlVo6RbsIxE_0", "video_path": "XlVo6RbsIxE.mp4", "subtitle_path": "XlVo6RbsIxE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 18, "duration": 9.0, "view_count": 1426}, {"video_id": "TRQyVyoO4UA", "question": "A man in a green military coat is speaking on a crowded street with cars on both sides, raising his left hand high. What is the man holding in his left hand?", "question_wo_referring_query": "What is the man holding in his left hand?", "candidates": ["A camera", "A backpack", "A fluffy toy", "A drink"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "TRQyVyoO4UA_0", "video_path": "TRQyVyoO4UA.mp4", "subtitle_path": "TRQyVyoO4UA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 6, "duration": 8.01, "view_count": 1975}, {"video_id": "K-kc0ivpEFY", "question": "A man wearing a dark gray suit paired with a white shirt, with a microphone clipped to his suit, raises both hands and explains, 'this an iconic car one of the most'. What object appears on the screen before his explanation?", "question_wo_referring_query": "What object appears on the screen before the explanation?", "candidates": ["The tires of a Volkswagen Beetle", "The badge of a Porsche", "The badge of a Volkswagen Beetle", "The steering wheel of a Volkswagen Beetle"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "K-kc0ivpEFY_0", "video_path": "K-kc0ivpEFY.mp4", "subtitle_path": "K-kc0ivpEFY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 30, "duration": 13.97, "view_count": 12561}, {"video_id": "mCeAiUQXkkg", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, add and mix in melted yellow oil, then sear the steak, and finally roast at 190 degrees for 27-30 minutes.", "First, sear the steak, then add and mix in melted yellow oil, and finally roast at 190 degrees for 27-30 minutes.", "First, sear the steak, then roast at 190 degrees for 10 minutes, and finally add yellow oil.", "First, sear the steak, then roast at 190 degrees for 27-30 minutes, and finally add and mix in melted yellow oil."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "mCeAiUQXkkg_0", "video_path": "mCeAiUQXkkg.mp4", "subtitle_path": "mCeAiUQXkkg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 158, "duration": 11.97, "view_count": 129790}, {"video_id": "0TQEfdweQ14", "question": "By the side of a table, there is a woman wearing a white T-shirt and a khaki skirt holding a tray on the table. Next to her, there is a man wearing a white T-shirt and a black skirt lifting up a tray. Which subtitles have appeared simultaneously with this woman in the khaki skirt?", "question_wo_referring_query": "Which subtitles have appeared simultaneously with this woman in the khaki skirt?", "candidates": ["ancaher marshmallow yeah", "pilow so how about I hold it", "how much you waned to use it as a giant", "yeah I got the culinary"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "0TQEfdweQ14_0", "video_path": "0TQEfdweQ14.mp4", "subtitle_path": "0TQEfdweQ14_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 812, "duration": 14.02, "view_count": 9349417}, {"video_id": "WvrkG5ZNEYA", "question": "In the video, there appears a soldier wearing gray armor and a gray protective suit. Next to the soldier, there are two men dressed in yellow clothes with black hair. When the subtitle mentions 'declare themselves kings over their', what object appears on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["Sword", "The spear and shield held by the soldier", "War flag", "War chariot"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "WvrkG5ZNEYA_0", "video_path": "WvrkG5ZNEYA.mp4", "subtitle_path": "WvrkG5ZNEYA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 366, "duration": 9.97, "view_count": 16316}, {"video_id": "Ir35TL8r7Ks", "question": "On a taupe stone desktop, there is text 'store-bought donuts 8' in the top right corner of the screen. On the desktop, a pair of hands holding a donut with the thumb and index finger. What shape is this donut?", "question_wo_referring_query": "What shape is this donut?", "candidates": ["Hexagon with a circular hole in the middle", "Circle", "Rectangle", "Square"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "Ir35TL8r7Ks_0", "video_path": "Ir35TL8r7Ks.mp4", "subtitle_path": "Ir35TL8r7Ks_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 253, "duration": 13.01, "view_count": 106045}, {"video_id": "uxeklxrpucM", "question": "A young woman with green hair wearing a gold necklace and colorful shirt, with red and gold earrings, is raising three fingers on her right hand. What other actions does this green-haired woman perform?", "question_wo_referring_query": "What other actions does this green-haired woman perform?", "candidates": ["Both hands raised high", "Holding a white striped object in her left hand", "Holding a white striped object in her right hand", "Holding a comb in her right hand"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "uxeklxrpucM_0", "video_path": "uxeklxrpucM.mp4", "subtitle_path": "uxeklxrpucM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 338, "duration": 12.01, "view_count": 251658}, {"video_id": "yjEuvd43JJ8", "question": "In a room filled with photos, there is a black-haired man wearing a black T-shirt with front print. This man is holding a microphone in his right hand. When the subtitle \"espanol en Tagalog anyway last rule how\" appears, what else does the black-haired man do?", "question_wo_referring_query": "What else does the black-haired man do?", "candidates": ["The man looks backward", "The man holds a book with his left hand while explaining", "The man moves his left hand forward", "The man explains with the microphone"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "yjEuvd43JJ8_0", "video_path": "yjEuvd43JJ8.mp4", "subtitle_path": "yjEuvd43JJ8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.97, "view_count": 152926}, {"video_id": "v448sJLNKI8", "question": "A man wearing a light gray suit paired with a dark red tie is talking with a man wearing glasses and a blue top. The man in glasses and the blue top looks downward. After this, what action does the man in the light gray suit do?", "question_wo_referring_query": "After this, what action does the man in the light gray suit do?", "candidates": ["Holding a cigarette in his left hand", "Holding a white pen in his right hand", "Clapping hands", "Raising his head"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "v448sJLNKI8_0", "video_path": "v448sJLNKI8.mp4", "subtitle_path": "v448sJLNKI8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 268, "duration": 9.0, "view_count": 43393}, {"video_id": "LhC6mTLh0ME", "question": "A boy wearing a gray shirt paired with jeans and a girl wearing a white striped sweater paired with red pants run towards a woman with long black hair and a red top. Among these three, who appears first in the later scene?", "question_wo_referring_query": "Among these three, who appears first in the later scene?", "candidates": ["The woman with long black hair and a black coat", "The girl wearing a white striped sweater paired with red pants", "The boy wearing a gray shirt paired with jeans", "The woman with long black hair and a red top"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "LhC6mTLh0ME_0", "video_path": "LhC6mTLh0ME.mp4", "subtitle_path": "LhC6mTLh0ME_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 212, "duration": 10.01, "view_count": 19792}, {"video_id": "f_xmzEEiLIk", "question": "In front of the white cabinet, there is a man wearing a grey-blue knit sweater paired with a shirt, he spreads his hands and says 'have already seen this recipe on the'. Before this, what action did this man do?", "question_wo_referring_query": "Before this, what action did this man do?", "candidates": ["Clapped", "Crossed his hands", "Squatted down", "Made a fist"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "f_xmzEEiLIk_0", "video_path": "f_xmzEEiLIk.mp4", "subtitle_path": "f_xmzEEiLIk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 10, "duration": 14.02, "view_count": 245223}, {"video_id": "7SDF0ZzDHzk", "question": "The water is extremely clear, and you can see many stones at the bottom, including one large yellow rock. The sunlight enhances the sparkling waves. After the subtitle shows 'hy the beaury of this place the warer is', who is the first person to jump into the water?", "question_wo_referring_query": "Who is the first person to jump into the water?", "candidates": ["Woman in black swimsuit", "Girl in red swimsuit", "Man in black swimming trunks", "Man in blue swimming trunks"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "7SDF0ZzDHzk_0", "video_path": "7SDF0ZzDHzk.mp4", "subtitle_path": "7SDF0ZzDHzk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 145, "duration": 11.01, "view_count": 263425}, {"video_id": "SfdKGigq2xc", "question": "Which of the following sequences of actions is correct?", "question_wo_referring_query": "Which of the following sequences of actions is correct?", "candidates": ["First, add oil to the pan, then add the bell peppers, and finally add the green onions", "First, add oil to the pan, then add the bell peppers, and finally stir-fry everything", "First, add the green onions to the pan, then add the bell peppers to the pan, and finally plate everything", "First, add the green onions to the pan, then take the plate with bell peppers, and finally add the bell peppers to the pan"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "SfdKGigq2xc_0", "video_path": "SfdKGigq2xc.mp4", "subtitle_path": "SfdKGigq2xc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 297, "duration": 10.98, "view_count": 634467}, {"video_id": "cd9or9hVpAw", "question": "A man with black hair and a mustache, with a smile, wearing a black short T-shirt and a backpack, is dancing near the handrail and waving his left hand greeting people. In which scene does this man appear in black clothes and meet his friend?", "question_wo_referring_query": "In which scene does the man in black clothes meet his friend?", "candidates": ["A bar", "A yacht by the seaside", "A dining table inside the store", "An open square"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "cd9or9hVpAw_0", "video_path": "cd9or9hVpAw.mp4", "subtitle_path": "cd9or9hVpAw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 567, "duration": 8.01, "view_count": 1160318}, {"video_id": "SlreHfVkGA4", "question": "The shelf is filled with photography-related equipment. A man is crouching on the right side of the screen, pointing his index finger at a white piece of equipment on the shelf. After the subtitle 'we're still chasing you polaroid' appears, what happens?", "question_wo_referring_query": "What happens next?", "candidates": ["A man in a black lab coat and a woman in a blue jacket hug.", "A man in a yellow down jacket puts his book bag on the corner of a table.", "A man wearing a black lab coat with a pattern sits on a sofa facing the left side of the screen, holding a purple teacup in his right hand.", "A person in a purple jacket and a person in a green jacket hug.", "A man in a white long-sleeve shirt sits facing the right side of the screen reading a book."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "SlreHfVkGA4_0", "video_path": "SlreHfVkGA4.mp4", "subtitle_path": "SlreHfVkGA4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1443.69, "view_count": 114272}, {"video_id": "SlreHfVkGA4", "question": "In a white room, there is a table on the left with a laptop on it. A TV is hanging on the wall, and there's a standing lamp beside it. On the right, there's a sofa. In the distance, a man in black clothes is standing, and in front of him are two men, one in blue and the other in green. What happens on the screen after the subtitle 'always hi thank you so much' appears?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A man wearing a black hat is sitting on the ground with his legs crossed, gesturing with his hands while talking.", "A man in blue pajamas and green short sleeves stands at the door holding a box.", "A man in blue scrubs is standing next to the sofa, gesturing with his hands while talking.", "A woman in blue clothing makes a heart shape with her hands towards the camera.", "A man in blue clothing is sliding on a tablet screen."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "SlreHfVkGA4_1", "video_path": "SlreHfVkGA4.mp4", "subtitle_path": "SlreHfVkGA4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1443.69, "view_count": 114272}, {"video_id": "SlreHfVkGA4", "question": "Inside the room, a man in a dark blue bathrobe is talking to a man in light blue pajamas. At this moment, a man in a green long-sleeve shirt and black shorts is sitting by the sofa, listening. After the subtitle 'imagine that you're reading all' appears, what happens on the screen?", "question_wo_referring_query": "Inside the room, a man in a dark blue bathrobe is talking to a man in light blue pajamas. At this moment, a man in a green long-sleeve shirt and black shorts is sitting by the sofa, listening. After the subtitle 'imagine that you're reading all' appears, what happens on the screen?", "candidates": ["A man in a black bathrobe hugs a woman in a blue coat.", "A man in a white long-sleeve shirt ties a cloth around someone and throws a pillow onto the sofa.", "A man in a white long-sleeve shirt is sitting on the right side of the screen reading a book.", "The man in the blue bathrobe gestures with both hands while talking by the sofa.", "A man squats on the right side of the screen, pointing his index finger at a white object on the shelf."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "SlreHfVkGA4_2", "video_path": "SlreHfVkGA4.mp4", "subtitle_path": "SlreHfVkGA4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1443.69, "view_count": 114272}, {"video_id": "YpTW1PSiL-U", "question": "In a room with a green wall, there is a wooden board on the wall with clothes hanging from it and books placed above them. On the left side, there is a window covered with curtains, and on the desk below the window, there are plants. In the middle, there is a ceramic bowl. A woman is holding a water container and pouring water into it. Which subtitles have appeared together with the ceramic bowl?", "question_wo_referring_query": "Which subtitles have appeared together with the ceramic bowl?", "candidates": ["instead just listen to my heart,What feels right for me may not be for you, and that's the beauty", "I wanted to experiment with potential flavors for my wedding. Today is raspberries and beets", "life alone than be with wrong person. While I did eventually find someone I was very ready", "I told that to Luke early on and he said that it was my choice, his love wasn't conditional", "my life again, As someone who innately wants to please people, it can be easy for me to get"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "YpTW1PSiL-U_0", "video_path": "YpTW1PSiL-U.mp4", "subtitle_path": "YpTW1PSiL-U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1075.83, "view_count": 2307141}, {"video_id": "YpTW1PSiL-U", "question": "On a wooden table, there is a bunch of basil on the right, and a blue-patterned cup with red powder inside on the left. Which subtitles mention the blue-patterned cup?", "question_wo_referring_query": "Which subtitles mention the blue-patterned cup?", "candidates": ["since I started this channel. It was brought up in my last video, I got so many messages", "quiet rhythm - something I realized I wanted to bring to all my future relationships", "I wanted to experiment with potential flavors for my wedding. Today is raspberries and beets", "life alone than be with wrong person. While I did eventually find someone I was very ready", "instead just listen to my heart, What feels right for me may not be for you, and that's the beauty"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "YpTW1PSiL-U_1", "video_path": "YpTW1PSiL-U.mp4", "subtitle_path": "YpTW1PSiL-U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1075.83, "view_count": 2307141}, {"video_id": "YpTW1PSiL-U", "question": "In a room, there are books, rabbit figurines, vases, and other items on the shelf at the back. There is a ceramic pot with a green plant on the windowsill on the right side. A blonde woman is sitting in the middle, and there are four containers below her. One of the glass containers holds a pair of scissors. Which subtitles do the scissors appear with?", "question_wo_referring_query": "Which subtitles do the scissors appear with?", "candidates": ["I wanted to experiment with potential flavors for my wedding. Today is raspberries and beets", "instead just listen to my heart, What feels right for me may not be for you, and that's the beauty", "for my mother's culture and that's something that I have always seen as normal even though", "and complexity of human relationships. Each is different and so interesting", "quiet rhythm - something I realized I wanted to bring to all my future relationships"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "YpTW1PSiL-U_2", "video_path": "YpTW1PSiL-U.mp4", "subtitle_path": "YpTW1PSiL-U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1075.83, "view_count": 2307141}, {"video_id": "ZNK4nfgNQpM", "question": "In the bottom right corner of the screen, there is a man wearing sunglasses and a gray short-sleeved shirt. His hands are in front of his chest, and he is holding a white pen in one hand. There are two red arrows above and to the left of him. What is the man in the bottom right corner doing at this time?", "question_wo_referring_query": "What is the man in the bottom right corner doing at this time?", "candidates": ["Turning his head to the left side of the screen", "Holding a pen in his left hand, making a few gestures in the air", "Standing up", "Turning his head to the right side of the screen", "Lying down sleeping"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "ZNK4nfgNQpM_0", "video_path": "ZNK4nfgNQpM.mp4", "subtitle_path": "ZNK4nfgNQpM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2126.37, "view_count": 32394}, {"video_id": "ZNK4nfgNQpM", "question": "In the lower-right corner of the screen, there is a man wearing sunglasses and a gray short-sleeved shirt. Above him, there is a pie chart composed of yellow and blue sections. To the left of the pie chart is a shape made up of many vertical lines. What is the man in the lower-right corner doing at this time?", "question_wo_referring_query": "What is the man in the lower-right corner doing at this time?", "candidates": ["Drew a boat", "Drew a car", "Drew an airplane", "Drew a banana", "Drew a red arrow"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "ZNK4nfgNQpM_1", "video_path": "ZNK4nfgNQpM.mp4", "subtitle_path": "ZNK4nfgNQpM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2126.37, "view_count": 32394}, {"video_id": "ZNK4nfgNQpM", "question": "In the lower right corner of the screen, there's a man wearing sunglasses and a grey short-sleeved shirt. In the middle of the screen, there's a bar chart composed of blue and grey rectangles, with the text 'Number of solved problems In IMO-AG-3D' above it. At this moment, what is the man in the lower right corner doing?", "question_wo_referring_query": "At this moment, what is the man in the lower right corner doing?", "candidates": ["Drawing an airplane", "Drawing a banana", "Drawing a red circle", "Drawing a car", "Drawing a boat"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "ZNK4nfgNQpM_2", "video_path": "ZNK4nfgNQpM.mp4", "subtitle_path": "ZNK4nfgNQpM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2126.37, "view_count": 32394}, {"video_id": "gEhBP0j86mo", "question": "A woman with yellow hair, dressed in a black coat and wearing glasses, has her hands placed on white paper documents on a lectern. Behind her is a gray background wall. Which of the following items has appeared?", "question_wo_referring_query": "Which of the following items has appeared?", "candidates": ["a bottle of Coke", "a glass of water", "a bottle of water", "a bottle of red wine", "3 glasses of water"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "gEhBP0j86mo_0", "video_path": "gEhBP0j86mo.mp4", "subtitle_path": "gEhBP0j86mo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1893.8, "view_count": 5145}, {"video_id": "gEhBP0j86mo", "question": "A woman with black hair and wearing a black coat is sitting, next to her is a man wearing a black coat, a tie, and revealing his mouth. Which of the following items has appeared?", "question_wo_referring_query": "Which of the following items has appeared?", "candidates": ["Earphone", "Ring", "Fresh flower", "Earring", "Crown"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "gEhBP0j86mo_1", "video_path": "gEhBP0j86mo.mp4", "subtitle_path": "gEhBP0j86mo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1893.8, "view_count": 5145}, {"video_id": "gEhBP0j86mo", "question": "A woman with blonde hair, wearing a black coat, is standing and clapping. Behind her is a person in a black and white striped shirt. To her left is a blue seat. Which of the following items appears?", "question_wo_referring_query": "Which of the following items appears?", "candidates": ["Crown", "Ring", "Earphone", "Earring", "Necklace"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "gEhBP0j86mo_2", "video_path": "gEhBP0j86mo.mp4", "subtitle_path": "gEhBP0j86mo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1893.8, "view_count": 5145}, {"video_id": "IP8984ExN2A", "question": "There are three hands on a white desktop, one of which has blue nail polish on the thumb, and '1-bromo-2-chlorocyclohexane' is stamped on the upper side of the hand. When the subtitle 'a ch2 this is a carbon connect to' appears, which object is present on the screen?", "question_wo_referring_query": "Which object is present on the screen?", "candidates": ["mobile phone", "apple", "crown", "necklace", "ring"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "IP8984ExN2A_0", "video_path": "IP8984ExN2A.mp4", "subtitle_path": "IP8984ExN2A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 967.1, "view_count": 11479}, {"video_id": "IP8984ExN2A", "question": "A pair of hands on a white desktop, the left hand extends the index finger painted with blue nail polish next to a black triangle, the other hand is clenched into a fist, when the subtitle 'I have dimethyls so I have two methyl' appears, which object is present in the screen?", "question_wo_referring_query": "which object is present in the screen?", "candidates": ["digital camera", "keyboard", "mouse", "watch", "mobile phone"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "IP8984ExN2A_1", "video_path": "IP8984ExN2A.mp4", "subtitle_path": "IP8984ExN2A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 967.1, "view_count": 11479}, {"video_id": "IP8984ExN2A", "question": "A hand wearing a watch is near a black triangle. The left hand is pointing with a blue-painted nail on the index finger, and below the finger, there is a sign with the English words 'Meso Compounds.' When the subtitle reads 'line up okay but in this case because if,' which object is present on the screen?", "question_wo_referring_query": "Which object is present on the screen?", "candidates": ["keyboard", "hat", "mouse", "computer", "digital camera"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "IP8984ExN2A_2", "video_path": "IP8984ExN2A.mp4", "subtitle_path": "IP8984ExN2A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 967.1, "view_count": 11479}, {"video_id": "uzEmUYmRCuY", "question": "A person wearing a black coat is standing in the green grass, holding a lot of fruits in the left hand and holding one fruit in the right hand. What is the color of the fruit in the right hand?", "question_wo_referring_query": "A person wearing a black coat is standing in the green grass, holding a lot of fruits in the left hand and holding one fruit in the right hand. What is the color of the fruit in the right hand?", "candidates": ["red", "yellow", "white", "olive", "coffee"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "uzEmUYmRCuY_0", "video_path": "uzEmUYmRCuY.mp4", "subtitle_path": "uzEmUYmRCuY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1081.02, "view_count": 2237313}, {"video_id": "uzEmUYmRCuY", "question": "A man wearing a black short-sleeve shirt is cutting a carrot on a cutting board with a knife. There are some withered wood and green plants behind the man. What shape is the cutting board?", "question_wo_referring_query": "What shape is the cutting board?", "candidates": ["Square", "Pentagon", "Hexagon", "Circle", "Triangle"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "uzEmUYmRCuY_1", "video_path": "uzEmUYmRCuY.mp4", "subtitle_path": "uzEmUYmRCuY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1081.02, "view_count": 2237313}, {"video_id": "uzEmUYmRCuY", "question": "Three children are walking on a meadow, with two of them walking one after the other. One of the children is wearing a hat and is spreading their arms, ready to perform a cartwheel. What is the color of the hat worn by the child who is ready to cartwheel?", "question_wo_referring_query": "What is the color of the hat worn by the child who is ready to cartwheel?", "candidates": ["Yellow", "Blue", "Pink", "Black", "White"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "uzEmUYmRCuY_2", "video_path": "uzEmUYmRCuY.mp4", "subtitle_path": "uzEmUYmRCuY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1081.02, "view_count": 2237313}, {"video_id": "yV8QeblH2gA", "question": "A person wearing a black short-sleeve shirt is holding a camera in their hand. To the far right is a wardrobe in natural wood color. Next to the wardrobe, there are many closed cans. What is the man wearing a black short-sleeve shirt and touching his collar holding in his hand?", "question_wo_referring_query": "What is the man wearing a black short-sleeve shirt and touching his collar holding in his hand?", "candidates": ["cellphone", "remote control", "camera", "microphone", "binoculars"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "yV8QeblH2gA_0", "video_path": "yV8QeblH2gA.mp4", "subtitle_path": "yV8QeblH2gA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1749.25, "view_count": 50459}, {"video_id": "yV8QeblH2gA", "question": "A person wearing a white short sleeve shirt is sitting at a white computer desk. His left elbow is resting on the table, leaning diagonally while facing the right side of the screen. What is the man playing with in his hand?", "question_wo_referring_query": "What is the man playing with in his hand?", "candidates": ["Wine glass", "Wristwatch", "VR headset", "Knife", "Cellphone"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "yV8QeblH2gA_1", "video_path": "yV8QeblH2gA.mp4", "subtitle_path": "yV8QeblH2gA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1749.25, "view_count": 50459}, {"video_id": "yV8QeblH2gA", "question": "In front of the white door, a man is wearing a blue bracelet on his wrist and holding a keypad device in his right hand. What is the object he is operating?", "question_wo_referring_query": "What is the object the man is operating?", "candidates": ["alarm", "door key", "car key", "laser pointer"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "yV8QeblH2gA_2", "video_path": "yV8QeblH2gA.mp4", "subtitle_path": "yV8QeblH2gA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1749.25, "view_count": 50459}, {"video_id": "lPIeD2MFs8k", "question": "A man wearing a floral face mask and a black coat is standing in front of many trees. To his left is a bald man wearing a black mask, and behind the bald man is a red brick building. What happens when the man wearing the floral face mask appears for the first time?", "question_wo_referring_query": "What happens?", "candidates": ["An airplane flies past the man wearing the floral face mask", "A bald man walks past the man wearing the floral face mask", "A ship sails past the man wearing the floral face mask", "A group of soldiers walks past the man wearing the floral face mask", "A car drives past the man wearing the floral face mask"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "lPIeD2MFs8k_0", "video_path": "lPIeD2MFs8k.mp4", "subtitle_path": "lPIeD2MFs8k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2284.29, "view_count": 6748440}, {"video_id": "lPIeD2MFs8k", "question": "Under the pitch-black night sky, there are two people standing. On the left is a man wearing glasses and a gray coat, and on the right is a woman with curly hair wearing a hat. When a flashlight first appeared and changed hands, what did the man do?", "question_wo_referring_query": "What did the man do?", "candidates": ["Peeled a banana", "Clapped", "Peeled an apple", "Held his left arm with his right hand", "Cut a watermelon"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "lPIeD2MFs8k_1", "video_path": "lPIeD2MFs8k.mp4", "subtitle_path": "lPIeD2MFs8k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2284.29, "view_count": 6748440}, {"video_id": "lPIeD2MFs8k", "question": "A man in black clothes is holding his hands together next to his face. On his right, there's a bald man holding a cup of water. What did the bald man do when the man in black separated his hands for the first time?", "question_wo_referring_query": "What did the bald man do?", "candidates": ["Cut an apple", "Drink water", "Eat food", "Peel a banana", "Slice a watermelon"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "lPIeD2MFs8k_2", "video_path": "lPIeD2MFs8k.mp4", "subtitle_path": "lPIeD2MFs8k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2284.29, "view_count": 6748440}, {"video_id": "pHk8kJ8sOc0", "question": "The screen shows two pictures: the one on the left features two people, one of whom is wearing a wreath and bowing their head, with \"Case making\" written in the lower left corner of the image. The picture on the right shows a person in a side view looking into a mirror, with \"Cover making\" written in the lower left corner. When the story mentions that printing made binding possible, what did it suggest?", "question_wo_referring_query": "When the story mentions that printing made binding possible, what did it suggest?", "candidates": ["Manufacturers adding patterns to fabric", "Browse selected American trade bindings", "Large-scale production can be conducted", "Cover design and special features", "The application of decorative techniques to religious books"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "pHk8kJ8sOc0_0", "video_path": "pHk8kJ8sOc0.mp4", "subtitle_path": "pHk8kJ8sOc0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1312.61, "view_count": 2625}, {"video_id": "pHk8kJ8sOc0", "question": "The screen shows five split images with the title 'Welcome to Watson Library'. After the story mentions Watson's expansion plan, what is mentioned immediately after?", "question_wo_referring_query": "What is mentioned immediately after?", "candidates": ["Works of Henry van Dyke", "Large-scale production can be carried out", "Cover design and special features", "Manufacturers adding patterns to fabrics", "Offering specialized travel projects and lectures"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "pHk8kJ8sOc0_1", "video_path": "pHk8kJ8sOc0.mp4", "subtitle_path": "pHk8kJ8sOc0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1312.61, "view_count": 2625}, {"video_id": "pHk8kJ8sOc0", "question": "In the black and white photo, there is a man wearing a suit standing between bookshelves with his arms crossed. The title 'Watson Library today' is printed at the top of the photo. After mentioning that those who are over 18 can enter without an appointment, what event is described immediately after?", "question_wo_referring_query": ", what event is described immediately after?", "candidates": ["Browsing American trade journals", "Kenneth Soner is the eighth director", "Promotional printing for subscription development", "Mass production can take place", "Ornamental technology applications and religious books"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "pHk8kJ8sOc0_2", "video_path": "pHk8kJ8sOc0.mp4", "subtitle_path": "pHk8kJ8sOc0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1312.61, "view_count": 2625}, {"video_id": "fwZ0ElnMn44", "question": "A woman wearing a striped shirt is standing with a white wall behind her. To her side, there is a shelf. At the top of the shelf, there is a light green bucket with a toy beside it. Which item appears first below?", "question_wo_referring_query": "Which item appears first below?", "candidates": ["Orange electric fan", "Water faucet", "Cat", "Green ladder", "Glasses"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "fwZ0ElnMn44_0", "video_path": "fwZ0ElnMn44.mp4", "subtitle_path": "fwZ0ElnMn44_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1956.8, "view_count": 2330324}, {"video_id": "fwZ0ElnMn44", "question": "Which of the following objects appears first in the video?", "question_wo_referring_query": "Which of the following objects appears first in the video?", "candidates": ["Elevator", "Blue Paint", "Laptop", "Hat", "Refrigerator"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "fwZ0ElnMn44_1", "video_path": "fwZ0ElnMn44.mp4", "subtitle_path": "fwZ0ElnMn44_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1956.8, "view_count": 2330324}, {"video_id": "fwZ0ElnMn44", "question": "Which of the following objects appears first in the video?", "question_wo_referring_query": "Which of the following objects appears first in the video?", "candidates": ["Shopping cart", "Piano", "American flag", "Strawberry cake", "White headband"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "fwZ0ElnMn44_2", "video_path": "fwZ0ElnMn44.mp4", "subtitle_path": "fwZ0ElnMn44_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1956.8, "view_count": 2330324}, {"video_id": "1NzygnKWG_g", "question": "In a field, there is an elephant. Beside the elephant, there are two people. Behind the two people, there is a wooden structure held up by two pillars. A few tires are hanging on the wooden structure by ropes. When the subtitle says 'run run Kibbles,' what is the elephant doing?", "question_wo_referring_query": "What is the elephant doing?", "candidates": ["Eating food", "Walking", "Jumping", "Sleeping", "Lying on the ground"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "1NzygnKWG_g_0", "video_path": "1NzygnKWG_g.mp4", "subtitle_path": "1NzygnKWG_g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 993.46, "view_count": 607177}, {"video_id": "1NzygnKWG_g", "question": "Three big elephants and one small elephant are standing on the grass. A man wearing white clothes with a belt is standing between two big elephants. The left big elephant's trunk is placed on the small elephant's leg. After the subtitles 'trust you can put' appear, what is the man wearing white clothes doing?", "question_wo_referring_query": "What is the man wearing white clothes doing?", "candidates": ["touching a big elephant's trunk", "touching a whale", "peeling an apple", "touching a giraffe", "touching a monkey"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "1NzygnKWG_g_1", "video_path": "1NzygnKWG_g.mp4", "subtitle_path": "1NzygnKWG_g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 993.46, "view_count": 607177}, {"video_id": "1NzygnKWG_g", "question": "On the grass, there is a man wearing black clothes. Behind him, there are three small brown baskets, and in the distance, there are two hills covered with green plants. When the subtitle 'shooting the dock while vlogging it too' appears, what does this person do?", "question_wo_referring_query": "What does this person do?", "candidates": ["Pet a dog", "Touch a giraffe", "Touch an elephant", "Play on a cellphone", "Peel an apple"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "1NzygnKWG_g_2", "video_path": "1NzygnKWG_g.mp4", "subtitle_path": "1NzygnKWG_g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 993.46, "view_count": 607177}, {"video_id": "q7QP_lfqnQM", "question": "On a white background, there is green text 'Savah she' with 10 red circles. After the subtitle 'if the information that there is a word' appears, what is the first symbol that appears?", "question_wo_referring_query": "On a white background, there is green text 'Savah she' with 10 red circles. After the subtitle 'if the information that there is a word' appears, what is the first symbol that appears?", "candidates": ["Painted red English letter", "Red solid line", "Red circle", "Dark blue arrow", "Red arrow"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "q7QP_lfqnQM_0", "video_path": "q7QP_lfqnQM.mp4", "subtitle_path": "q7QP_lfqnQM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2900.77, "view_count": 16540}, {"video_id": "q7QP_lfqnQM", "question": "The white background is filled with English sentences, and there is a heading 4.1. After the subtitle \"if you choose a good Kall right now\" appears, what is the first character that appears?", "question_wo_referring_query": ", what is the first character that appears?", "candidates": ["The formula Y=Softmax(B)G(X) shaded in yellow", "Xi framed in a red circle", "G(x) with wavy lines", "Machine Translation shaded in yellow", "Machine Translation"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "q7QP_lfqnQM_1", "video_path": "q7QP_lfqnQM.mp4", "subtitle_path": "q7QP_lfqnQM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2900.77, "view_count": 16540}, {"video_id": "q7QP_lfqnQM", "question": "There is a symbol (4) at the top of the page, and there are two red text equations under the second line formula below the symbol, with a green solid line below them. When the subtitle 'therefore you can't you can't represent' appears, what is the first symbol that appears?", "question_wo_referring_query": "There is a symbol (4) at the top of the page, and there are two red text equations under the second line formula below the symbol, with a green solid line below them. When the subtitle 'therefore you can't you can't represent' appears, what is the first symbol that appears?", "candidates": ["Green arrow", "Red solid line", "Dark blue arrow", "Red circle"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "q7QP_lfqnQM_2", "video_path": "q7QP_lfqnQM.mp4", "subtitle_path": "q7QP_lfqnQM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2900.77, "view_count": 16540}, {"video_id": "syFi-Q2jKu8", "question": "Which of the following sequences of scenarios is correct?", "question_wo_referring_query": "Which of the following sequences of scenarios is correct?", "candidates": ["First, on a white background, there is a red 'Application: Adversarial robustness' title, with 3 images below it. Beside two of the images are pictures of pandas, and in the middle is a picture of a motorcycle. Then, on a white background, there is a folded line diagram composed of green and purple lines, with the word 'Parameter' written below it. Finally, on a white background, there is a red 'Overview of the approach' title, with 3 images below it. The images contain many black circles connected by black lines.", "First, on a white background, there is a red 'Overview of the approach' title, with 3 images below it. The images contain many black circles connected by black lines. Then, on a white background, there is a folded line diagram composed of green and purple lines, with the word 'Parameter' written below it. Finally, on a white background, there is a red 'Application: Adversarial robustness' title, with 3 images below it. Beside two of the images are pictures of pandas, and in the middle is a picture of a motorcycle.", "First, on a white background, there is a red 'Overview of the approach' title, with 3 images below it. The images contain many black circles connected by black lines. Then, on a white background, there is a red 'Application: Adversarial robustness' title, with 3 images below it. Beside two of the images are pictures of pandas, and in the middle is a picture of a motorcycle. Finally, on a white background, there is a folded line diagram composed of green and purple lines, with the word 'Parameter' written below it.", "First, on a white background, there is a folded line diagram composed of green and purple lines, with the word 'Parameter' written below it. Then, on a white background, there is a red 'Application: Adversarial robustness' title, with 3 images below it. Beside two of the images are pictures of pandas, and in the middle is a picture of a motorcycle. Finally, on a white background, there is a red 'Overview of the approach' title, with 3 images below it. The images contain many black circles connected by black lines.", "First, on a white background, there is a folded line diagram composed of green and purple lines, with the word 'Parameter' written below it. Then, on a white background, there is a red 'Overview of the approach' title, with 3 images below it. The images contain many black circles connected by black lines. Finally, on a white background, there is a red 'Application: Adversarial robustness' title, with 3 images below it. Beside two of the images are pictures of pandas, and in the middle is a picture of a motorcycle."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "syFi-Q2jKu8_0", "video_path": "syFi-Q2jKu8.mp4", "subtitle_path": "syFi-Q2jKu8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2325.32, "view_count": 96}, {"video_id": "syFi-Q2jKu8", "question": "Which of the following sequences of scenarios is correct?", "question_wo_referring_query": "Which of the following sequences of scenarios is correct?", "candidates": ["First, on a white background, there is a red 'Result' title text. Below the title is a table with 4 rows. Then, on a white background, there is a red 'Overview of the approach' title text. Below the title are 3 images with many black circles connected by black lines. Lastly, on a white background, there is a red 'Application: Adversarial robustness' title text. Below the title are 3 images, with 2 pictures of pandas on the sides and a picture of Ma Sicong in the middle.", "First, on a white background, there is a red 'Overview of the approach' title text. Below the title are 3 images with many black circles connected by black lines. Then, on a white background, there is a red 'Application: Adversarial robustness' title text. Below the title are 3 images, with 2 pictures of pandas on the sides and a picture of Ma Sicong in the middle. Lastly, on a white background, there is a red 'Result' title text. Below the title is a table with 4 rows.", "First, on a white background, there is a red 'Application: Adversarial robustness' title text. Below the title are 3 images, with 2 pictures of pandas on the sides and a picture of Ma Sicong in the middle. Then, on a white background, there is a red 'Overview of the approach' title text. Below the title are 3 images with many black circles connected by black lines. Lastly, on a white background, there is a red 'Result' title text. Below the title is a table with 4 rows.", "First, on a white background, there is a red 'Overview of the approach' title text. Below the title are 3 images with many black circles connected by black lines. Then, on a white background, there is a red 'Result' title text. Below the title is a table with 4 rows. Lastly, on a white background, there is a red 'Application: Adversarial robustness' title text. Below the title are 3 images, with 2 pictures of pandas on the sides and a picture of Ma Sicong in the middle.", "First, on a white background, there is a red 'Result' title text. Below the title is a table with 4 rows. Then, on a white background, there is a red 'Application: Adversarial robustness' title text. Below the title are 3 images, with 2 pictures of pandas on the sides and a picture of Ma Sicong in the middle. Lastly, on a white background, there is a red 'Overview of the approach' title text. Below the title are 3 images with many black circles connected by black lines."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "syFi-Q2jKu8_1", "video_path": "syFi-Q2jKu8.mp4", "subtitle_path": "syFi-Q2jKu8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2325.32, "view_count": 96}, {"video_id": "syFi-Q2jKu8", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there is a red 'Cross-session decoding' title. Below the title, there is a bar chart composed of five colors: red, purple, blue, orange, and green. Then, there is a red 'Robust training of brain-computer interfaces' title. Below this title, there is a monkey playing a computer with a mechanical arm. Finally, there is a blue 'Conclusions' title, with two sections of black text and one section of blue text below.", "First, there is a red 'Cross-session decoding' title. Below the title, there is a bar chart composed of five colors: red, purple, blue, orange, and green. Then, there is a blue 'Conclusions' title, with two sections of black text and one section of blue text below. Finally, there is a red 'Robust training of brain-computer interfaces' title. Below this title, there is a monkey playing a computer with a mechanical arm.", "First, there is a red 'Robust training of brain-computer interfaces' title. Below this title, there is a monkey playing a computer with a mechanical arm. Then, there is a blue 'Conclusions' title. Below the title, there are two sections of black text and one section of blue text. Finally, there is a red 'Cross-session decoding' title. Below the title, there is a bar chart composed of five colors: red, purple, blue, orange, and green.", "First, there is a red 'Robust training of brain-computer interfaces' title. Below this title, there is a monkey playing a computer with a mechanical arm. Then, there is a red 'Cross-session decoding' title. Below the title, there is a bar chart composed of five colors: red, purple, blue, orange, and green. Finally, there is a blue 'Conclusions' title, with two sections of black text and one section of blue text below.", "First, there is a blue 'Conclusions' title. Below the title, there are two sections of black text and one section of blue text. Then, there is a red 'Robust training of brain-computer interfaces' title. Below this title, there is a monkey playing a computer with a mechanical arm. Finally, there is a red 'Cross-session decoding' title. Below the title, there is a bar chart composed of five colors: red, purple, blue, orange, and green."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "syFi-Q2jKu8_2", "video_path": "syFi-Q2jKu8.mp4", "subtitle_path": "syFi-Q2jKu8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2325.32, "view_count": 96}, {"video_id": "F-dQBJcK0D0", "question": "The page title is Converting Units, with the formula for Temperature on the lower left and five formulas related to 1atm for Pressure on the lower right. In which scenes does a pink arrow icon with 'check' appear on the lower left of the screen?", "question_wo_referring_query": "In which scenes does a pink arrow icon with 'check' appear on the lower left of the screen?", "candidates": ["The upper part of the page contains the title, starting with the phrase 'a gas balloon', with no marks after Given and Find.", "The upper part of the page contains the title, starting with the phrase 'What pressure (mm Hg)', with no marks after Given, and a line of marks after Find.", "The page title is Converting Units, with the formula k=\u00b0c+273 on the lower left and five formulas related to 1atm on the right. These concepts are all highlighted in pink shading.", "The upper part of the page contains the title, starting with the phrase 'calculate the volume'."], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "F-dQBJcK0D0_0", "video_path": "F-dQBJcK0D0.mp4", "subtitle_path": "F-dQBJcK0D0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1593.26, "view_count": 464096}, {"video_id": "F-dQBJcK0D0", "question": "The title at the top of the page starts with the phrase 'A 550.0 ml...' and provides three steps with 'given.' In the phrase group 'if the' below, there is the term 'Find: v2.' In which scenes does the blue dotted box around 'Find: v2' appear on the screen?", "question_wo_referring_query": "In which scenes does the blue dotted box around 'Find: v2' appear on the screen?", "candidates": ["The page title is 'Converting Units.' Below the title on the left is the equation for 'Temperature,' and below the title on the right is the five equations related to 'Pressure' about 1atm. A pink arrow marked 'check' is printed on the bottom left of the screen.", "The top of the page contains a title, starting with the phrase group 'a gas balloon...,' and there are no traces of erasure after 'Given' and 'Find.'", "The top left corner of the page has the formula '\u00b0C + 273 = 359K,' and below it, there is a string of numbers 1.571428571. Above this string of numbers in the formula, 350K and 550.0ml are highlighted in pink.", "The top of the page is titled 'Gas Laws' in purple. Below the title on the left is a group of six equations. To the right of the equations is an 'IMPORTANT' note with a pink highlight behind the vocabulary."], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "F-dQBJcK0D0_1", "video_path": "F-dQBJcK0D0.mp4", "subtitle_path": "F-dQBJcK0D0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1593.26, "view_count": 464096}, {"video_id": "F-dQBJcK0D0", "question": "In the upper right corner of the frame, there's a woman wearing a white coat holding a white pen. Below her, there's another woman wearing black clothes and glasses. She is touching her neck with a hand that has a watch. In which scenes does the watch on the woman in black appear?", "question_wo_referring_query": "In which scenes does the watch on the woman in black appear?", "candidates": ["At the top of the page, there is a heading. It starts with the word group 'a gas balloon.' Given below are three lines of equations. There are no handwritten notes next to 'Find.' In the lower right corner, the woman in black puts her left hand near her mouth.", "At the top of the page, there are the symbols 300 K and 1.25atm. Below that, there is a solution path given with handwritten notes. The entire equation segment for 1.25atm is enclosed within a blue dashed line.", "At the top of the page, there is a heading. It starts with 'What pressure(m.' A square color palette appears in the middle of the screen.", "A green curtain background is swaying. In the upper right corner, inside a circular frame, a woman in white clothes is talking while making notes."], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "F-dQBJcK0D0_2", "video_path": "F-dQBJcK0D0.mp4", "subtitle_path": "F-dQBJcK0D0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1593.26, "view_count": 464096}, {"video_id": "xHvvMGHmwuU", "question": "On a white snowy ground, a woman in white clothes and a black animal are in the middle. The woman is slightly squatting. There are trees on their left side and a red and yellow house on the right. In front of the house is a small tree covered in snow. Has this black dog appeared together with those subtitles?", "question_wo_referring_query": "Has this black dog appeared together with those subtitles?", "candidates": ["from the painting I create to the glitter of the snow under the light of a full moon", "that I will show you some images, it is so beautiful. I've shown her work before...", "to paraphrase a quote from an admired philosopher, C.S. Lewis, if we find ourselves with a desire that...", "military bases and she has always dreamed of having a little house in a secluded area", "for example a certain smell or sound of a creaking door"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "xHvvMGHmwuU_0", "video_path": "xHvvMGHmwuU.mp4", "subtitle_path": "xHvvMGHmwuU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1020.73, "view_count": 509684}, {"video_id": "xHvvMGHmwuU", "question": "In a dimly lit room, there are pots on the windowsill, someone holding a lit candle on the right side, things hanging on the wall, and another candle on the table. Which subtitles have appeared with this candle?", "question_wo_referring_query": ", which subtitles have appeared with this candle?", "candidates": ["to paraphrase a quote from an admired philosopher ec.s lewis if we find ourselves with a desire this", "that i will show you some images it", "from the painnning i create to the glitter of the snow under they light a full moon", "military bases and she has always", "for example a certain smell or sound"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "xHvvMGHmwuU_1", "video_path": "xHvvMGHmwuU.mp4", "subtitle_path": "xHvvMGHmwuU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1020.73, "view_count": 509684}, {"video_id": "xHvvMGHmwuU", "question": "In a room, there is a person wearing white clothes holding a round object, with a brown floor underneath their hand and a brown table to the left of their hand. With which subtitles has this round object coexisted?", "question_wo_referring_query": "With which subtitles has this round object coexisted?", "candidates": ["to paraphrase a quote from an admired philosopher ec.s lweis if we find ourselves with a desire this&nbap; ", "that i will show you some images it is so beautiful ilve shown her work before&nbap;&nbap;", "from the painnning i create to the glitter of the snow under they light a full moon", "for example a certain smell or sound of acreaking door nbsp;", "military bases and she has always "], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "xHvvMGHmwuU_2", "video_path": "xHvvMGHmwuU.mp4", "subtitle_path": "xHvvMGHmwuU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1020.73, "view_count": 509684}, {"video_id": "mGaMznGunXM", "question": "On the black wooden table, there are a kitchen knife and various dishes and bowls. The green onions and garlic are placed in the small bamboo basket. When the scene switches to a hand pressing on garlic on a yellow cutting board, what change occurs?", "question_wo_referring_query": "When the scene switches to a hand pressing on garlic on a yellow cutting board, what change occurs?", "candidates": ["The garlic is separated into cloves", "The garlic is burned", "The garlic is coated with oil", "The garlic is painted red"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "mGaMznGunXM_0", "video_path": "mGaMznGunXM.mp4", "subtitle_path": "mGaMznGunXM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1163.5, "view_count": 3536714}, {"video_id": "mGaMznGunXM", "question": "On a green grassy field, there are three dumplings placed on a plate on a stone on the grass. In front of the dumplings, there is a black pot, and to the right of the dumplings, there is a person wearing black pants and brown shoes squatting. When the scene changes, the short-haired man takes a bite of a dumpling; what happens to the dumpling?", "question_wo_referring_query": "On a green grassy field, there are three dumplings placed on a plate on a stone on the grass. In front of the dumplings, there is a black pot, and to the right of the dumplings, there is a person wearing black pants and brown shoes squatting. When the scene changes, the short-haired man takes a bite of a dumpling; what happens to the dumpling?", "candidates": ["Taken by a woman", "One section is missing", "Taken by a child", "Thrown into the air", "Falls to the ground"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "mGaMznGunXM_1", "video_path": "mGaMznGunXM.mp4", "subtitle_path": "mGaMznGunXM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1163.5, "view_count": 3536714}, {"video_id": "mGaMznGunXM", "question": "Under a white sky, a man dressed in black is holding a kebab skewer in his right hand and a small knife in his left. In front of him, there is a grill made of stacked stones with a pot placed on the stones. Behind the man, there is a mountain full of trees. When the scene changes to a skewer placed on a black stone platform, what happens to the meat on the skewer?", "question_wo_referring_query": "What happens to the meat on the skewer?", "candidates": ["The meat on the skewer turns into scattered pieces", "The meat on the skewer gets sprinkled with seasoning", "The meat on the skewer turns black", "Taken by a woman"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "mGaMznGunXM_2", "video_path": "mGaMznGunXM.mp4", "subtitle_path": "mGaMznGunXM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1163.5, "view_count": 3536714}, {"video_id": "o7GyFa90klg", "question": "On a wooden table, there are various types of leaves underneath a wooden board. On top of the board, there is a yellow orange that has been cut in half. Above the orange, there is a small silver knife. Beside the small knife, there is a finger. When the subtitle mentions 'i have as possible you see that i wear certain fast fashion items that i have owned:', what change happens to the orange?", "question_wo_referring_query": "On a wooden table, there are various types of leaves underneath a wooden board. On top of the board, there is a yellow orange that has been cut in half. Above the orange, there is a small silver knife. Beside the small knife, there is a finger. When the subtitle mentions 'i have as possible you see that i wear certain fast fashion items that i have owned:', what change happens to the orange?", "candidates": ["squeezed into juice", "cut into pieces", "cut into slices", "disappears", "peeled"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "o7GyFa90klg_0", "video_path": "o7GyFa90klg.mp4", "subtitle_path": "o7GyFa90klg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1189.02, "view_count": 1048952}, {"video_id": "o7GyFa90klg", "question": "On a brown table, there is a white plate full of indentations. A white jar is pouring a kind of transparent liquid into the plate. Next to the plate, there is a green potted plant. When the subtitle mentions 'do try to just focus on using what I already have and just wearing it until it's all worn out;', what changes occur with the white plate?", "question_wo_referring_query": "What changes occur with the white plate?", "candidates": ["A red fruit appears on the plate.", "A yellow fruit and a green plant appear on the plate.", "A green fruit appears on the plate.", "A pink fruit appears on the plate.", "A blue fruit and a green plant appear on the plate."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "o7GyFa90klg_1", "video_path": "o7GyFa90klg.mp4", "subtitle_path": "o7GyFa90klg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1189.02, "view_count": 1048952}, {"video_id": "o7GyFa90klg", "question": "In a room with wooden walls, there is a woman wearing a green short-sleeved shirt. Behind her, there are some green potted plants, and to her left, there is a white door. When the subtitle mentions 'breathing in all this gas that was just flowing into the room and had nowhere to go it was just ', what changes occurred in this woman?", "question_wo_referring_query": "What changes occurred in this woman?", "candidates": ["The woman stood up", "The woman's clothes changed to red", "A cup appeared in the woman's hand", "The woman squatted down", "The woman raised both hands"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "o7GyFa90klg_2", "video_path": "o7GyFa90klg.mp4", "subtitle_path": "o7GyFa90klg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1189.02, "view_count": 1048952}, {"video_id": "-dJFHks8-z0", "question": "In a room with a white wall, there are some paintings hanging on the wall. On a brown table, there is a white vase and other various tools. Next to the table, there is a woman with tied hair, wearing a white short-sleeved shirt, bending over. In front of the woman, there are some wooden combs. At the back of the table, there is a bed with a blanket and red clothes on it. What is this woman doing?", "question_wo_referring_query": "What is this woman doing?", "candidates": ["jumping rope", "getting dressed", "washing face", "running", "drinking water"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "-dJFHks8-z0_0", "video_path": "-dJFHks8-z0.mp4", "subtitle_path": "-dJFHks8-z0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2402.9, "view_count": 1380614}, {"video_id": "-dJFHks8-z0", "question": "In a room with white walls, various decorations are hanging on the walls. On the right side, there is a bed with a blanket and a man with a bald head wearing long-sleeved clothes sitting on it. On the left side, there is a brown cabinet, and next to the cabinet, there is a chair. A woman with tied hair and wearing a white short-sleeved shirt is sitting on the chair. What is this woman doing?", "question_wo_referring_query": "In a room with white walls, various decorations are hanging on the walls. On the right side, there is a bed with a blanket and a man with a bald head wearing long-sleeved clothes sitting on it. On the left side, there is a brown cabinet, and next to the cabinet, there is a chair. A woman with tied hair and wearing a white short-sleeved shirt is sitting on the chair. What is this woman doing?", "candidates": ["Putting on socks", "Folding the blanket", "Combing hair", "Singing", "Putting on trousers"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "-dJFHks8-z0_1", "video_path": "-dJFHks8-z0.mp4", "subtitle_path": "-dJFHks8-z0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2402.9, "view_count": 1380614}, {"video_id": "-dJFHks8-z0", "question": "Inside a room with white walls, there are various decorations hanging on the walls. On the right side, there is a bed with a quilt on it. On the left side, there is a brown cabinet with different tools on it. Next to the cabinet, there is a chair. Sitting on the chair is a man with short hair, wearing a blue shirt and white long sleeves. In front of the man, there is a woman with her hair tied up, wearing red clothes. What is this woman in red doing?", "question_wo_referring_query": "What is this woman in red doing?", "candidates": ["Dancing", "Eating something", "Wearing a black skirt", "Drinking water", "Running"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "-dJFHks8-z0_2", "video_path": "-dJFHks8-z0.mp4", "subtitle_path": "-dJFHks8-z0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2402.9, "view_count": 1380614}, {"video_id": "NKcb7XoSenk", "question": "Under a white sky, there are various different houses along the street. The porch colors of the houses are different. The tallest house is white, with trees planted beside it. In the far distance of the screen, there is a mountain. Which of the following objects have appeared?", "question_wo_referring_query": "Which of the following objects have appeared?", "candidates": ["Ship", "Train", "Balloon", "Car", "Airplane"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "NKcb7XoSenk_0", "video_path": "NKcb7XoSenk.mp4", "subtitle_path": "NKcb7XoSenk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.03, "view_count": 9205}, {"video_id": "NKcb7XoSenk", "question": "A woman wearing a gray long-sleeve shirt is sitting on a bed with white sheets and pillows. She has her hands on her lap. Next to her, there is a red-cased item. Behind the woman, there is a window, and in front of the window, there is a short-haired man lying down. To the woman's right, there is a wall with white photo frames. Which of the following objects appear in this scene?", "question_wo_referring_query": "Which of the following objects appear in this scene?", "candidates": ["Books", "Cup", "Red blanket", "Snacks", "Computer"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "NKcb7XoSenk_1", "video_path": "NKcb7XoSenk.mp4", "subtitle_path": "NKcb7XoSenk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.03, "view_count": 9205}, {"video_id": "NKcb7XoSenk", "question": "In a room with white walls, there is a man with short hair wearing a purple-gray suit standing on the right, and a long-haired woman showing her teeth standing on the left. The woman has a wooden object on her left side. Which of the following objects has appeared?", "question_wo_referring_query": "Which of the following objects has appeared?", "candidates": ["White lamp", "Green lamp", "Pink lamp", "Blue lamp", "Purple lamp"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "NKcb7XoSenk_2", "video_path": "NKcb7XoSenk.mp4", "subtitle_path": "NKcb7XoSenk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.03, "view_count": 9205}, {"video_id": "oi_yuxxEkwA", "question": "In a room with white walls, a man wearing a gray hat and a backpack is facing the mirror. He is holding the same tool in both hands. What style of clothing is the man wearing?", "question_wo_referring_query": "What style of clothing is the man wearing?", "candidates": ["green long sleeves", "green zippered hoodie", "green short sleeves", "green down jacket with zipper", "green buttoned down jacket"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "oi_yuxxEkwA_0", "video_path": "oi_yuxxEkwA.mp4", "subtitle_path": "oi_yuxxEkwA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1138.67, "view_count": 396325}, {"video_id": "oi_yuxxEkwA", "question": "In a snow-covered terrain, a man wearing a gray hat and green clothes is facing a camera. Behind him, there is a gray house with a streetlight next to it and a car behind him. What is the color of the car behind the man?", "question_wo_referring_query": "What is the color of the car behind the man?", "candidates": ["green", "black", "gray", "white", "pink"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "oi_yuxxEkwA_1", "video_path": "oi_yuxxEkwA.mp4", "subtitle_path": "oi_yuxxEkwA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1138.67, "view_count": 396325}, {"video_id": "oi_yuxxEkwA", "question": "In a room with white walls, there is a map hanging on the wall. Next to the map is a bookshelf full of books. In front of the bookshelf is a wooden table with books on it. On both sides sit two men with short hair and wearing grey clothes. They have their hands on the table. The man on the left is looking at the man on the right, while the man on the right is looking at the object in his hand. What is the shape of the object that the man on the right is holding?", "question_wo_referring_query": "What is the shape of the object that the man on the right is holding?", "candidates": ["Staircase", "Circle", "Ellipsoid", "Rectangle", "Cone"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "oi_yuxxEkwA_2", "video_path": "oi_yuxxEkwA.mp4", "subtitle_path": "oi_yuxxEkwA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1138.67, "view_count": 396325}, {"video_id": "UDLqu_HjlDA", "question": "On a night, a building with English letters on the roof emitted a firelight. There were several cars parked in front of this building. In the middle of the screen, there are English letters, the McDonald's logo, and a red rectangle with a white background under the McDonald's logo. What color is the McDonald's logo when the subtitle mentions 'closed after the early moments to'?", "question_wo_referring_query": "What color is the McDonald's logo?", "candidates": ["black", "yellow", "pink", "white", "red"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "UDLqu_HjlDA_0", "video_path": "UDLqu_HjlDA.mp4", "subtitle_path": "UDLqu_HjlDA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1009.76, "view_count": 3429}, {"video_id": "UDLqu_HjlDA", "question": "On a certain night, a house with English letters on its roof emitted sparks. In front of this house, a few cars were parked. In the middle of the screen, there were English letters, the McDonald's logo, and a red square. Among them, the background under the McDonald's logo was white. When the subtitle mentions 'uh and I was listening to Mark,' what shape is the red logo with 'news'?", "question_wo_referring_query": ", What shape is the red logo with 'news'?", "candidates": ["rectangle", "trapezoid", "circle", "square", "ladder shape"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "UDLqu_HjlDA_1", "video_path": "UDLqu_HjlDA.mp4", "subtitle_path": "UDLqu_HjlDA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1009.76, "view_count": 3429}, {"video_id": "UDLqu_HjlDA", "question": "On a night, there is a house with an English letter on the roof emitting firelight. In front of this house, a few cars are parked. In the middle of the screen, there is an English letter, a McDonald's logo, and a red square with a white background and the McDonald's logo below it. When the subtitle mentions 'confusion there's a lot of different,' what is the color of the car on the right side of the screen?", "question_wo_referring_query": "What is the color of the car on the right side of the screen?", "candidates": ["black", "yellow", "blue", "green", "olive"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "UDLqu_HjlDA_2", "video_path": "UDLqu_HjlDA.mp4", "subtitle_path": "UDLqu_HjlDA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1009.76, "view_count": 3429}, {"video_id": "6knperYEcVY", "question": "In the exhibition, a man in a purple suit is facing a mirror, with a blue dress on a mannequin to his left and a pink dress on a mannequin to his right. What words appear in the black frame that pops up at the bottom left of the screen?", "question_wo_referring_query": "What words appear in the black frame that pops up at the bottom left of the screen?", "candidates": ["MAX HOLLEIN", "american fashion", "embroidered", "government"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "6knperYEcVY_0", "video_path": "6knperYEcVY.mp4", "subtitle_path": "6knperYEcVY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 918.59, "view_count": 35923}, {"video_id": "6knperYEcVY", "question": "In the exhibition room, a man wearing a purple suit and a white mask stands in front of the camera. Behind him, there are several clothing exhibits. To the man's left, there's a maroon skirt with stripes. Which words are included in the black box that pops up in the lower left corner of the screen?", "question_wo_referring_query": "Which words are included in the black box that pops up in the lower left corner of the screen?", "candidates": ["creativity", "domestic", "ANDREW BOLTON", "simplicity", "cocial justice"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "6knperYEcVY_1", "video_path": "6knperYEcVY.mp4", "subtitle_path": "6knperYEcVY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 918.59, "view_count": 35923}, {"video_id": "6knperYEcVY", "question": "A woman with short hair, wearing yellow clothes, is on the stairs with both hands tightly pressed against her abdomen facing the camera. Which words are included in the black box that pops up in the lower left part of the screen?", "question_wo_referring_query": "Which words are included in the black box that pops up in the lower left part of the screen?", "candidates": ["embroidered", "american fashion", "fashion community", "Eva Chen", "costume"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "6knperYEcVY_2", "video_path": "6knperYEcVY.mp4", "subtitle_path": "6knperYEcVY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 918.59, "view_count": 35923}, {"video_id": "HuNinFOk2nU", "question": "On a brown wooden board, there is a person holding a knife in one hand and pressing down on a scallion flower with the other hand, with a white cabinet in the background. When this knife appears for the first time, what happens?", "question_wo_referring_query": "When this knife appears for the first time, what happens?", "candidates": ["The bok choy is chopped", "The fish is sliced", "Placed on the table", "Disappears", "The scallion flower is chopped"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "HuNinFOk2nU_0", "video_path": "HuNinFOk2nU.mp4", "subtitle_path": "HuNinFOk2nU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1328.12, "view_count": 269015}, {"video_id": "HuNinFOk2nU", "question": "On a white table, in the middle of the table, there's a white bowl with a golden floral rim, and next to the table stands a person holding a condiment. What happened when this bowl appeared for the first time?", "question_wo_referring_query": ", what happened when this bowl appeared for the first time?", "candidates": ["Put the fruit into the bowl", "Pour the sauce into the bowl", "Pour the vinegar into the bowl", "The bowl was smashed", "Pour the salt into the bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "HuNinFOk2nU_1", "video_path": "HuNinFOk2nU.mp4", "subtitle_path": "HuNinFOk2nU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1328.12, "view_count": 269015}, {"video_id": "HuNinFOk2nU", "question": "In a dish with a white and blue pattern, there are various cooked ingredients of different colors, and there is a hand holding a ladle. What happened when the ladle first appeared?", "question_wo_referring_query": ", what happened when the ladle first appeared?", "candidates": ["Turned blue", "Lifted the ingredients in the bowl", "Disappeared", "Turned red", "Placed on the dish"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "HuNinFOk2nU_2", "video_path": "HuNinFOk2nU.mp4", "subtitle_path": "HuNinFOk2nU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1328.12, "view_count": 269015}, {"video_id": "2W2hpfKrtu4", "question": "In a dark-colored scene with a somewhat dark background, a character wearing white hair ornaments and white clothing is standing. In front of them is a podium. What did this woman do when the subtitle 'the totality of recent history is' appeared?", "question_wo_referring_query": "What did this woman do?", "candidates": ["writing", "singing", "playing video games", "dancing", "playing the piano"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "2W2hpfKrtu4_0", "video_path": "2W2hpfKrtu4.mp4", "subtitle_path": "2W2hpfKrtu4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3251.12, "view_count": 12839}, {"video_id": "2W2hpfKrtu4", "question": "In a dark environment, there is a screen in the middle with varying shades of blue. Below the screen, there is an object, and to the right of the screen, there is a standing person. What event occurs on the screen when the subtitle 'from their British half your brimming' appears?", "question_wo_referring_query": "What event occurs on the screen?", "candidates": ["The person on the screen shrinks.", "The man on the right side of the screen walks to the center of the screen.", "The man on the screen disappears.", "Another person appears on the screen.", "The person on the screen enlarges."], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "2W2hpfKrtu4_1", "video_path": "2W2hpfKrtu4.mp4", "subtitle_path": "2W2hpfKrtu4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3251.12, "view_count": 12839}, {"video_id": "2W2hpfKrtu4", "question": "In a dimly lit room, there is an illuminated screen with a person standing on it. There is a mountain behind the person. Below the screen, two other people are slightly visible. What happened to the screen when the words 'Korean displayed searches for traces of' appeared?", "question_wo_referring_query": "What happened to the screen?", "candidates": ["The person on the screen became larger", "Another person appeared on the screen", "The person on the screen became smaller", "The person on the screen jumped", "The person on the screen swayed"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "2W2hpfKrtu4_2", "video_path": "2W2hpfKrtu4.mp4", "subtitle_path": "2W2hpfKrtu4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3251.12, "view_count": 12839}, {"video_id": "kOf4lGar0do", "question": "In a room, a short-haired man wearing a checkered shirt is singing a song. In front of him is a microphone stand, and he is surrounded by a group of onlookers. Closest to the man is a woman wearing a white coat. Behind them is a blue background board with text on it. What did the man in the checkered shirt do after he finished singing?", "question_wo_referring_query": "What did the man in the checkered shirt do after he finished singing?", "candidates": ["Wrote something", "Danced", "Jumped up", "Played with a phone", "Left the place"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "kOf4lGar0do_0", "video_path": "kOf4lGar0do.mp4", "subtitle_path": "kOf4lGar0do_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 950.99, "view_count": 1352}, {"video_id": "kOf4lGar0do", "question": "In a beige room, a bald man wearing a wreath and dressed in white clothes is standing at the front, talking, with a group of people behind him. After this man finishes singing, what does he do?", "question_wo_referring_query": ", after this man finishes singing, what does he do?", "candidates": ["He drinks water", "He dances", "He jumps", "He eats something", "He steps back and leaves the place where he was standing"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "kOf4lGar0do_1", "video_path": "kOf4lGar0do.mp4", "subtitle_path": "kOf4lGar0do_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 950.99, "view_count": 1352}, {"video_id": "kOf4lGar0do", "question": "In a room, a man wearing blue clothes with a hair accessory is holding something and filming towards the front, where various exhibits are displayed. In front of the exhibits, there is a person singing. The man in blue is surrounded by people. What event occurred after the person in front of the exhibits finished singing?", "question_wo_referring_query": "What event occurred after the person in front of the exhibits finished singing?", "candidates": ["The crowd around the man in blue dispersed and moved towards the exhibits", "Someone started dancing", "Applause", "Someone picked up an exhibit", "Nothing happened"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "kOf4lGar0do_2", "video_path": "kOf4lGar0do.mp4", "subtitle_path": "kOf4lGar0do_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 950.99, "view_count": 1352}, {"video_id": "O_naf7-fGZE", "question": "Which object appears first in the video?", "question_wo_referring_query": "Which object appears first in the video?", "candidates": ["White rope", "Yellow box", "Basket with green plants", "Burning torch", "Flowerpot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "O_naf7-fGZE_0", "video_path": "O_naf7-fGZE.mp4", "subtitle_path": "O_naf7-fGZE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1262.33, "view_count": 89234}, {"video_id": "O_naf7-fGZE", "question": "In a room with white walls, various tools are hanging on the walls. There is a wooden table in the room, which has a quilt and a potted plant on it. On the left side of the table, there is a man wearing brown clothes sitting on a chair. On the right side of the table, there is a white bag hanging on the wall. Who was the first person to enter this room?", "question_wo_referring_query": "Who was the first person to enter this room?", "candidates": ["The man in brown clothes", "The woman wearing a green skirt", "The woman wearing a blue skirt", "The woman wearing a red skirt", "The woman wearing a pink skirt"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "O_naf7-fGZE_1", "video_path": "O_naf7-fGZE.mp4", "subtitle_path": "O_naf7-fGZE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1262.33, "view_count": 89234}, {"video_id": "O_naf7-fGZE", "question": "In a forest, there is a box made of wood placed within the forest. On the box, there is a white object. Next to the box stands a man wearing a black coat and a hat, and a woman wearing green clothes. The woman is holding an unlit torch. Who is the first person to hold a lit torch?", "question_wo_referring_query": "Who is the first person to hold a lit torch?", "candidates": ["A child", "The man wearing the black coat and hat", "The woman wearing green clothes", "The man wearing brown clothes", "No one is holding it"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "O_naf7-fGZE_2", "video_path": "O_naf7-fGZE.mp4", "subtitle_path": "O_naf7-fGZE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1262.33, "view_count": 89234}, {"video_id": "fXRqX6uQ4ts", "question": "In a room with white walls, a woman with long hair wearing a gray cloak takes a selfie with a selfie stick. After the woman says 'bathroom is pretty and this is me,' what does she do next?", "question_wo_referring_query": "In a room with white walls, a woman with long hair wearing a gray cloak takes a selfie with a selfie stick. After the woman says 'bathroom is pretty and this is me,' what does she do next?", "candidates": ["Uses chopsticks to eat yellow food", "Goes to visit the 7-ELEVEN supermarket", "Takes a box with 'Shu Li Xiao' written on it and faces the mirror", "Drinks lemon water", "Pushes a black suitcase from right to left"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "fXRqX6uQ4ts_0", "video_path": "fXRqX6uQ4ts.mp4", "subtitle_path": "fXRqX6uQ4ts_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 973.27, "view_count": 111457}, {"video_id": "fXRqX6uQ4ts", "question": "In a room with yellow walls, there is a woman in front of the mirror wearing a yellow floral dress. Behind her, there are several green tables and chairs arranged in a row. After the woman says 'hurt guys I\u2019m at a Cat\u2026', what does she do?", "question_wo_referring_query": "What does she do?", "candidates": ["Displaying three bottles of shower gel from a box in front of the mirror.", "Eating instant noodles.", "Sitting by the bookshelf in front of the mirror, wearing a floral dress and reading a book with her head down.", "Drinking a beverage by the shelves.", "Pointing with her left hand at a sign that says 'Sky Garden'."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "fXRqX6uQ4ts_1", "video_path": "fXRqX6uQ4ts.mp4", "subtitle_path": "fXRqX6uQ4ts_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 973.27, "view_count": 111457}, {"video_id": "fXRqX6uQ4ts", "question": "In a bathroom with white tiled walls, there is a transparent glass partition on the left side. In front of the mirror, there is a woman holding a toothbrush and wearing a dress with a rice-colored floral pattern. After the woman says 'plan cuz that kind of stresses me it,' what action does she take?", "question_wo_referring_query": "What action does she take?", "candidates": ["Sitting in front of the mirror in a floral dress, looking down at a book next to a bookshelf", "Wrapping herself in a white bath towel and drinking mineral water", "Goes to the 7-ELEVEN supermarket for a stroll", "Holding a box labeled 'Scholl Effect' in front of the mirror", "Points to the sign that says 'Sky Garden' with her left hand"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "fXRqX6uQ4ts_2", "video_path": "fXRqX6uQ4ts.mp4", "subtitle_path": "fXRqX6uQ4ts_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 973.27, "view_count": 111457}, {"video_id": "ZedmncHt3OQ", "question": "In the dark historical architecture background, there is a picture of a pyramid and a sphinx. In the upper left corner of the background, there is the English word 'EGYPT.' After the sentence 'costa finished 6th bye bye 6 2 6 2 6 1' appears, what is the first object that appears?", "question_wo_referring_query": ", what is the first object that appears?", "candidates": ["Bronze statue of Hammurabi", "Bronze statue of Utnapishtim", "Statue of four goddesses from Greece", "A book stamped with SUMERIANS on the cover", "Chest filled with gold and silver treasures"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "ZedmncHt3OQ_0", "video_path": "ZedmncHt3OQ.mp4", "subtitle_path": "ZedmncHt3OQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1386.02, "view_count": 406286}, {"video_id": "ZedmncHt3OQ", "question": "In a dim yellow background, a shadow holds a sword in one hand and scales in the other hand. To the upper left of the shadow, the words 'CODE OF HAMMURABI' appear. After the subtitle mentions 'well hung,' what is the first object that appears?", "question_wo_referring_query": ", what is the first object that appears?", "candidates": ["A book with 'GILGAMESH' on the cover", "A book with 'ASSYRIAN HISTORY' on the cover", "A book with 'SUMERIANS' on the cover", "An Assyrian bird-like carving below", "A camel walking in the desert"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "ZedmncHt3OQ_1", "video_path": "ZedmncHt3OQ.mp4", "subtitle_path": "ZedmncHt3OQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1386.02, "view_count": 406286}, {"video_id": "ZedmncHt3OQ", "question": "On a black and red background page, at the top it says 'LAY OUT THE ORDER OF EVENTS'. Below the title, there are images of a circular artifact, an inscribed stone tablet, a pair of swords, and a statue of a mythical creature. When the subtitle 'queenstown' appears, what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears when the subtitle 'queenstown' is shown?", "candidates": ["A mirror showing a cityscape, with the word 'HIPPUR' underneath", "A magnifying glass", "Four people riding horses in the desert", "Desert dunes", "A sun symbol"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "ZedmncHt3OQ_2", "video_path": "ZedmncHt3OQ.mp4", "subtitle_path": "ZedmncHt3OQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1386.02, "view_count": 406286}, {"video_id": "zuBgQEGdhx4", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First is the scene where 'trees in a green field are in the distance, with a plate of chicken legs and two hands in the foreground.' Next is the scene where 'two hands are pulling apart a pair of chicken legs.' Finally, the scene shows 'a round pancake is placed on a wooden board, with a hand sprinkling flour on top of the pancake.'", "First is the scene where 'a round pancake is placed on a wooden board, with a hand sprinkling flour on top of the pancake.' Next is the scene where 'two hands are pulling apart a pair of chicken legs.' Finally, the scene shows 'trees in a green field in the distance, with a plate of chicken legs and two hands in the foreground.'", "First is the scene where 'a round pancake is placed on a wooden board, with a hand sprinkling flour on top of the pancake.' Next is the scene where 'trees in a green field are in the distance, with a plate of chicken legs and two hands in the foreground.' Finally, the scene shows 'two hands pulling apart a pair of chicken legs.'", "First is the scene where 'two hands are pulling apart a pair of chicken legs.' Next is the scene where 'trees in a green field are in the distance, with a plate of chicken legs and two hands in the foreground.' Finally, the scene shows 'a round pancake is placed on a wooden board, with a hand sprinkling flour on top of the pancake.'", "First is the scene where 'two hands are pulling apart a pair of chicken legs.' Next is the scene where 'a round pancake is placed on a wooden board, with a hand sprinkling flour on top of the pancake.' Finally, the scene shows 'trees in a green field in the distance, with a plate of chicken legs and two hands in the foreground.'"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "zuBgQEGdhx4_0", "video_path": "zuBgQEGdhx4.mp4", "subtitle_path": "zuBgQEGdhx4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1072.71, "view_count": 2220147}, {"video_id": "zuBgQEGdhx4", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, there's a scene with 'a yellow wooden house in the distance on the screen, a plate with meat and bones on the left nearby, and a pot with liquid on the right nearby'. Then there's a scene with 'an orange-yellow desktop with an iron plate on the right and a short wall made of stacked wood in the distance'. Finally, there's a scene with 'a yellow wooden house in the distance on the screen, a plate with meat and bones on the left nearby, and a pot with liquid on the right nearby.'", "First, there's a scene with 'a yellow wooden house in the distance on the screen, a plate with meat and bones on the left nearby, and a pot with liquid on the right nearby'. Then there's a scene with 'a yellow wooden house in the distance on the screen, a plate with meat and bones on the left nearby, an iron rake picking up a red intestine in the middle nearby, and a pot with liquid on the right nearby'. Finally, there's a scene with 'an orange-yellow desktop with an iron plate on the right and a short wall made of stacked wood in the distance.'", "First, there's a scene with 'a yellow wooden house in the distance on the screen, a plate with meat and bones on the left nearby, and a pot with liquid on the right nearby'. Then there's a scene with 'an orange-yellow desktop with an iron plate on the right and a short wall made of stacked wood in the distance'. Finally, there's a scene with 'a yellow wooden house in the distance on the screen, a plate with meat and bones on the left nearby, an iron rake picking up a red intestine in the middle nearby, and a pot with liquid on the right nearby.'", "First, there's a scene with 'an orange-yellow desktop with an iron plate on the right and a short wall made of stacked wood in the distance'. Then there's a scene with 'a yellow wooden house in the distance on the screen, a plate with meat and bones on the left nearby, an iron rake picking up a red intestine in the middle nearby, and a pot with liquid on the right nearby'. Finally, there's a scene with 'a yellow wooden house in the distance on the screen, a plate with meat and bones on the left nearby, and a pot with liquid on the right nearby.'", "First, there's a scene with 'a yellow wooden house in the distance on the screen, a plate with meat and bones on the left nearby, and a pot with liquid on the right nearby'. Then there's a scene with 'a yellow wooden house in the distance on the screen, a plate with meat and bones on the left nearby, an iron rake picking up a red intestine in the middle nearby, and a pot with liquid on the right nearby'. Finally, there's a scene with 'an orange-yellow desktop with an iron plate on the right and a short wall made of stacked wood in the distance.'"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "zuBgQEGdhx4_1", "video_path": "zuBgQEGdhx4.mp4", "subtitle_path": "zuBgQEGdhx4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1072.71, "view_count": 2220147}, {"video_id": "zuBgQEGdhx4", "question": "Which of the following sequence of scenes is correct?", "question_wo_referring_query": "Which of the following sequence of scenes is correct?", "candidates": ["First, the scene where 'a wooden table stacked with lots of logs, three wooden bowls on the left, a plant pot in the middle, and three garlic sprouts on the right' appears, then the scene where 'a pot filled with green soup, with purple and white onion rings floating in it' appears, and finally the scene where 'a green field in the distance, with a steaming pot nearby' appears.", "First, the scene where 'a green field in the distance, with a steaming pot nearby' appears, then the scene where 'a pot filled with green soup, with purple and white onion rings floating in it' appears, and finally the scene where 'a wooden table stacked with lots of logs, three wooden bowls on the left, a plant pot in the middle, and three garlic sprouts on the right' appears.", "First, the scene where 'a wooden table stacked with lots of logs, three wooden bowls on the left, a plant pot in the middle, and three garlic sprouts on the right' appears, then the scene where 'a green field in the distance, with a steaming pot nearby' appears, and finally the scene where 'a pot filled with green soup, with purple and white onion rings floating in it' appears.", "First, the scene where 'a pot filled with green soup, with purple and white onion rings floating in it' appears, then the scene where 'a wooden table stacked with lots of logs, three wooden bowls on the left, a plant pot in the middle, and three garlic sprouts on the right' appears, and finally the scene where 'a green field in the distance, with a steaming pot nearby' appears.", "First, the scene where 'a pot filled with green soup, with purple and white onion rings floating in it' appears, then the scene where 'a green field in the distance, with a steaming pot nearby' appears, and finally the scene where 'a wooden table stacked with lots of logs, three wooden bowls on the left, a plant pot in the middle, and three garlic sprouts on the right' appears."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "zuBgQEGdhx4_2", "video_path": "zuBgQEGdhx4.mp4", "subtitle_path": "zuBgQEGdhx4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1072.71, "view_count": 2220147}, {"video_id": "wgUIB4tD0cM", "question": "In the blurry black-and-white scene, there is a picture of Einstein with his hands clasped together in the middle. In which of the following scenes does this picture of Einstein appear?", "question_wo_referring_query": "In which of the following scenes does this picture of Einstein appear?", "candidates": ["In the hands of a man wearing a yellow T-shirt", "At a crowded exhibition", "Among the six pictures arranged in two rows with a blue-green background", "On a computer screen", "In one of the three murals on the white wall"], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "wgUIB4tD0cM_0", "video_path": "wgUIB4tD0cM.mp4", "subtitle_path": "wgUIB4tD0cM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 923.49, "view_count": 284207}, {"video_id": "wgUIB4tD0cM", "question": "In the scattered yellow-green starry scene, surrounded by stars with white dots, there is a red arrow in the middle of the screen. In which of the following scenes does this red arrow appear?", "question_wo_referring_query": ", in which of the following scenes does this red arrow appear?", "candidates": ["In the screen composed of a dispersing circular shape from yellow to purple, red, and blue", "In the PPT screen with white background and black text", "In a beach scene densely covered with dark clouds", "In three paintings on a white wall", "In the six pictures arranged in two rows with a blue-green background"], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "wgUIB4tD0cM_1", "video_path": "wgUIB4tD0cM.mp4", "subtitle_path": "wgUIB4tD0cM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 923.49, "view_count": 284207}, {"video_id": "wgUIB4tD0cM", "question": "In the screen with a black background, there is a stack of red and green spheres in the middle. Below the stack of spheres, there is a white 'Oxygen' text. In which of the following scenarios does this white 'Oxygen' text appear?", "question_wo_referring_query": ", in which of the following scenarios does this white 'Oxygen' text appear?", "candidates": ["A chemical element table recorded in a grid under a yellow background", "Green and red element table with Cf circled in green", "A cover brochure printed with SUN AND MAN", "Four chemical molecules in the middle of the screen under a black background, including Carbon-13 and Helium", "On a blue outer cover"], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "wgUIB4tD0cM_2", "video_path": "wgUIB4tD0cM.mp4", "subtitle_path": "wgUIB4tD0cM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 923.49, "view_count": 284207}, {"video_id": "vLkKSLgjLMY", "question": "Under the blue sky, there is a white building with a clock hanging on the right side of the screen, and on the left side, there is a red and white building that is collapsing. Which of the following subtitles have also appeared along with the red and white building in the middle of the screen?", "question_wo_referring_query": "Which of the following subtitles have also appeared along with the red and white building in the middle of the screen?", "candidates": ["Love brings us warmth in the fearful coldness", "But they aim", "Too often we take it as granted", "such as the toilet", "Good luck, everybody!"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "vLkKSLgjLMY_0", "video_path": "vLkKSLgjLMY.mp4", "subtitle_path": "vLkKSLgjLMY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2860.83, "view_count": 966}, {"video_id": "vLkKSLgjLMY", "question": "On the left side is the Mo Tian Lun screen background, in front of the camera is a man wearing a white shirt and black suit jacket with glasses. Which of the following subtitles appeared along with the man in the black coat in the screen?", "question_wo_referring_query": "Which of the following subtitles appeared along with the man in the black coat in the screen?", "candidates": ["Just a matter of time. Mark, thank you.", "after the Spring Festival", "My parents were free", "But they aim", "I like reading books"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "vLkKSLgjLMY_1", "video_path": "vLkKSLgjLMY.mp4", "subtitle_path": "vLkKSLgjLMY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2860.83, "view_count": 966}, {"video_id": "vLkKSLgjLMY", "question": "On the left side of the screen with the Ferris wheel in the background, there is a man in front of the camera wearing a white shirt and a black suit jacket with a red tie. Which of the following subtitles has also appeared with the man in the red tie in the scene?", "question_wo_referring_query": "Which of the following subtitles has appeared with the man in the red tie in the scene?", "candidates": ["man should not depend on luck which", "penetration of it by 2030", "my senior high school English teacher", "Maybe I'm your good choice", "The walls are white and blue"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "vLkKSLgjLMY_2", "video_path": "vLkKSLgjLMY.mp4", "subtitle_path": "vLkKSLgjLMY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2860.83, "view_count": 966}, {"video_id": "tlgKSes6-qk", "question": "On a wooden-colored desk, there is a stand with colorful blocks printed on it, on the stand there is a book, and a hand is holding a black pen. What objects are on the desk?", "question_wo_referring_query": "What objects are on the desk?", "candidates": ["yellow pen", "green pen", "red eraser", "black notebook computer", "white mobile phone"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "tlgKSes6-qk_0", "video_path": "tlgKSes6-qk.mp4", "subtitle_path": "tlgKSes6-qk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1426.49, "view_count": 4294}, {"video_id": "tlgKSes6-qk", "question": "In a white room, there is a hook shaped like a tree branch on the wall. There is a table next to the wall, and on the table, there is a painting. There is a woman with short hair speaking. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["White hook", "Black glasses", "Olive flowerpot", "Purple top", "Silver power strip"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "tlgKSes6-qk_1", "video_path": "tlgKSes6-qk.mp4", "subtitle_path": "tlgKSes6-qk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1426.49, "view_count": 4294}, {"video_id": "tlgKSes6-qk", "question": "On a table covered with a colorful plaid tablecloth, there is a pair of hands drawing on a white notebook with a pencil. The notebook has a pencil drawing of a man, and there is a photograph next to the notebook. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["a black pen", "a black computer", "a gold-colored cup", "a pink pig-style eraser", "a white phone"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "tlgKSes6-qk_2", "video_path": "tlgKSes6-qk.mp4", "subtitle_path": "tlgKSes6-qk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1426.49, "view_count": 4294}, {"video_id": "R0clnHE37Xc", "question": "Against a black background, there are many frames of different colors, each with words inside. On top of the frames, there is a large yellow-green frame with the word 'higgs' inside. When the caption 'if you start out with the higgs particle' appears, what object is present on the screen?", "question_wo_referring_query": "what object is present on the screen?", "candidates": ["blue circle", "pink circle", "red circle", "green circle", "white circle"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "R0clnHE37Xc_0", "video_path": "R0clnHE37Xc.mp4", "subtitle_path": "R0clnHE37Xc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 930.97, "view_count": 308339}, {"video_id": "R0clnHE37Xc", "question": "In a dinosaur fossil exhibit, there is a huge dinosaur skeleton in the middle, surrounded by a barrier with many people observing. When the subtitle \"looking at its Decay products the idea\" appears, what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["sky blue backpack", "olive backpack", "yellow backpack", "white backpack", "red backpack"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "R0clnHE37Xc_1", "video_path": "R0clnHE37Xc.mp4", "subtitle_path": "R0clnHE37Xc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 930.97, "view_count": 308339}, {"video_id": "R0clnHE37Xc", "question": "In a room with four lights in the background, a man with short hair wearing a black shirt is talking. There are some white letters on his chest. When he says 'will get 20% off their subscription I', what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["White square frame", "Red rectangular frame", "Blue square frame", "Green square frame", "Purple rectangular frame"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "R0clnHE37Xc_2", "video_path": "R0clnHE37Xc.mp4", "subtitle_path": "R0clnHE37Xc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 930.97, "view_count": 308339}, {"video_id": "cOaODWZ48IE", "question": "In a green grassland, there is a low wall made of piled stones. Next to the wall, there are a few green trees. A man stands in front of a wooden table, on which there are many wooden plates and a large slab of ribs. In the middle of the table, there is also a yellow lemon placed in one of the plates. What kind of clothes is the man standing in front of the table wearing?", "question_wo_referring_query": "In a green grassland, there is a low wall made of piled stones. Next to the wall, there are a few green trees. A man stands in front of a wooden table, on which there are many wooden plates and a large slab of ribs. In the middle of the table, there is also a yellow lemon placed in one of the plates. What kind of clothes is the man standing in front of the table wearing?", "candidates": ["Black long-sleeved shirt", "Gray short-sleeved shirt", "Black hooded jacket", "Black short-sleeved shirt", "Gray long-sleeved shirt"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "cOaODWZ48IE_0", "video_path": "cOaODWZ48IE.mp4", "subtitle_path": "cOaODWZ48IE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1443.51, "view_count": 25876334}, {"video_id": "cOaODWZ48IE", "question": "On a green lawn, a man with short hair, wearing a black short-sleeved shirt, is crouching in front of a pot. To the right of the man is a pile of wood, and to the left is a wooden rack with various spices on it. What color is the glove the man is holding?", "question_wo_referring_query": ", what color is the glove the man is holding?", "candidates": ["green", "white", "black", "purple", "blue"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "cOaODWZ48IE_1", "video_path": "cOaODWZ48IE.mp4", "subtitle_path": "cOaODWZ48IE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1443.51, "view_count": 25876334}, {"video_id": "cOaODWZ48IE", "question": "On a green lawn, there are two yellow long tables, with many children sitting beside them. There are three plates of food on the tables, along with spoons. What style are the plates on the tables?", "question_wo_referring_query": "What style are the plates on the tables?", "candidates": ["White plates with blue floral edges", "Plain blue plates", "Plain white plates", "White plates with purple floral edges", "Plates with a rabbit in the middle"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "cOaODWZ48IE_2", "video_path": "cOaODWZ48IE.mp4", "subtitle_path": "cOaODWZ48IE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1443.51, "view_count": 25876334}, {"video_id": "WmPgbkSBNhs", "question": "Under the blue sky and white clouds, a man with copper-colored skin is wearing black sunglasses and is shirtless, sitting on a white limousine. When the word 'YouTube' appears, what style of bracelet is the man wearing on his wrist?", "question_wo_referring_query": "What style of bracelet is the man wearing on his wrist?", "candidates": ["Silver carved bracelet", "Blue woven bracelet", "Gold hollow bracelet", "Black woven bracelet", "Gold bracelet"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "WmPgbkSBNhs_0", "video_path": "WmPgbkSBNhs.mp4", "subtitle_path": "WmPgbkSBNhs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1291.57, "view_count": 388107}, {"video_id": "WmPgbkSBNhs", "question": "On a sunny day, standing in the shade in front of a yellowish hill, there's a man with tanned skin and white earphones. What type of clothing is he wearing when the subtitle 'late' appears?", "question_wo_referring_query": "What type of clothing is the man wearing?", "candidates": ["white short sleeve", "gray hoodie", "gray sleeveless vest", "black sleeveless vest", "black short sleeve"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "WmPgbkSBNhs_1", "video_path": "WmPgbkSBNhs.mp4", "subtitle_path": "WmPgbkSBNhs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1291.57, "view_count": 388107}, {"video_id": "WmPgbkSBNhs", "question": "On a white walkway with yellow wall lights on both sides, a man with short hair and stubble is walking with a backpack. When the subtitle 'one of the ways I'm able to stay in' appears, what color clothes is the man wearing?", "question_wo_referring_query": "What color clothes is the man wearing?", "candidates": ["yellow", "white", "black", "blue", "gray"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "WmPgbkSBNhs_2", "video_path": "WmPgbkSBNhs.mp4", "subtitle_path": "WmPgbkSBNhs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1291.57, "view_count": 388107}, {"video_id": "WVPE62Gk3EM", "question": "A white document page with a lot of content written on it, numbered 40, appears. When the subtitle \"in principle but still um\" shows up, what happens to the page?", "question_wo_referring_query": ", what happens to the page?", "candidates": ["The page continuously zooms out", "The page scrolls right", "The page scrolls downwards", "The page scrolls upwards", "The page continuously zooms in"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "WVPE62Gk3EM_0", "video_path": "WVPE62Gk3EM.mp4", "subtitle_path": "WVPE62Gk3EM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2069.37, "view_count": 23802}, {"video_id": "WVPE62Gk3EM", "question": "On a page where one half is white and the other half contains text, the white part has some drawings. When the subtitle \"attention though they use the random\" appears, what happens on the page?", "question_wo_referring_query": "What happens on the page?", "candidates": ["The page keeps zooming in", "The page scrolls from top to bottom", "The page scrolls from bottom to top", "The page scrolls from left to right", "The page keeps zooming out"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "WVPE62Gk3EM_1", "video_path": "WVPE62Gk3EM.mp4", "subtitle_path": "WVPE62Gk3EM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2069.37, "view_count": 23802}, {"video_id": "WVPE62Gk3EM", "question": "On a document page, there are several large squares made up of small squares of different colors. Below the large squares, there are options labeled abcd. When the subtitle 'memory than the original' appears, what happens on the document page?", "question_wo_referring_query": ", what happens on the document page?", "candidates": ["The page scrolls from top to bottom.", "The page scrolls from left to right.", "The page keeps zooming in.", "The page keeps zooming out.", "The page scrolls from bottom to top."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "WVPE62Gk3EM_2", "video_path": "WVPE62Gk3EM.mp4", "subtitle_path": "WVPE62Gk3EM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2069.37, "view_count": 23802}, {"video_id": "j78jWgPuIgc", "question": "In front of a background with various colors, a dark-skinned man wearing black-framed glasses and a black suit lowers his head and sticks out his tongue. What action does he do next?", "question_wo_referring_query": "What action does he do next?", "candidates": ["Licked a finger", "Closed his mouth", "Talked to the mirror", "Retracted his tongue", "Licked his lower lip"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "j78jWgPuIgc_0", "video_path": "j78jWgPuIgc.mp4", "subtitle_path": "j78jWgPuIgc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1212.68, "view_count": 21627}, {"video_id": "j78jWgPuIgc", "question": "In a split-screen view, there is a dark-skinned man wearing a black suit on the left side, and a fair-skinned man wearing a blue tie on the right side. After the man with the blue tie raises his right palm facing up, what does he do next?", "question_wo_referring_query": "What does he do next?", "candidates": ["Supports his glasses with his left hand", "Raises both hands, with palms facing each other", "Puts his right hand down and raises his left hand", "Supports his glasses with his right hand", "Turns his right palm towards the mirror"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "j78jWgPuIgc_1", "video_path": "j78jWgPuIgc.mp4", "subtitle_path": "j78jWgPuIgc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1212.68, "view_count": 21627}, {"video_id": "j78jWgPuIgc", "question": "In a split-screen image, there is a dark-skinned man on the left screen, and a fair-skinned man on the right screen. After the dark-skinned man props up his chin with a hand wearing a ring, what did he do next?", "question_wo_referring_query": "What did he do next?", "candidates": ["Rubbed his eye with his right hand", "Arms crossed in front of his chest", "Adjusted his glasses with his left hand", "Hands on hips", "Both hands supporting his chin"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "j78jWgPuIgc_2", "video_path": "j78jWgPuIgc.mp4", "subtitle_path": "j78jWgPuIgc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1212.68, "view_count": 21627}, {"video_id": "WPz8YqaOR1s", "question": "In a grayscale scene, a woman wearing a skirt and a headpiece is lying on a white bed. After the subtitle 'She depicts herself with raw honesty, and courageously lets us in without hesitation.' appears, who is the first person to appear on the screen?", "question_wo_referring_query": "Who is the first person to appear on the screen?", "candidates": ["A man with short hair", "A woman wearing a floral skirt", "A woman wearing a white top", "A woman with short blonde hair", "A woman wearing a pink long skirt"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "WPz8YqaOR1s_0", "video_path": "WPz8YqaOR1s.mp4", "subtitle_path": "WPz8YqaOR1s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1207.67, "view_count": 7909}, {"video_id": "WPz8YqaOR1s", "question": "In a yellow space with a black starry sky outside, there are many people with black cloths on their heads inside. On a bed in the middle, there are two people, one blue and one red. After the subtitle \"Her 'Self Portrait' from 1937 is a significant piece that offers us a glimpse into the artist's inner world\" appears, what is the first animal to appear on-screen?", "question_wo_referring_query": "What is the first animal to appear on-screen?", "candidates": ["Olive Bird", "Green Bird", "Human-like Cat-headed Eagle", "Colorful Pig", "Bulldog"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "WPz8YqaOR1s_1", "video_path": "WPz8YqaOR1s.mp4", "subtitle_path": "WPz8YqaOR1s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1207.67, "view_count": 7909}, {"video_id": "WPz8YqaOR1s", "question": "In a screen that has four different pictures, the leftmost picture depicts a woman with an exposed chest, covered with a black cloth from the waist down. After the subtitle 'revealing the hidden layers of existence.' appears, what is the first plant that appears on the screen?", "question_wo_referring_query": "What is the first plant that appears on the screen?", "candidates": ["A giant sunflower", "A white flower", "A green palm tree", "A purple hibiscus", "A red flower"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "WPz8YqaOR1s_2", "video_path": "WPz8YqaOR1s.mp4", "subtitle_path": "WPz8YqaOR1s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1207.67, "view_count": 7909}, {"video_id": "jAhjPd4uNFY", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a few blue boxes appear, and inside the boxes are mice with red tails; then there is a scene with a yellow piece of land where a mouse with a blue body and a pink tail is standing; finally, there is a green grassy area with a gray mouse fountain statue.", "First, there is a scene with a yellow piece of land where a mouse with a blue body and a pink tail is standing; then there is a green grassy area with a gray mouse fountain statue; finally, a few blue boxes appear, and inside the boxes are mice with red tails.", "First, a few blue boxes appear, and inside the boxes are mice with red tails; then there is a green grassy area with a gray mouse fountain statue; finally, there is a yellow piece of land where a mouse with a blue body and a pink tail is standing.", "First, there is a scene with a yellow piece of land where a mouse with a blue body and a pink tail is standing; then a few blue boxes appear, and inside the boxes are mice with red tails; finally, there is a green grassy area with a gray mouse fountain statue.", "First, there is a green grassy area with a gray mouse fountain statue; then a few blue boxes appear, and inside the boxes are mice with red tails; finally, there is a yellow piece of land where a mouse with a blue body and a pink tail is standing."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "jAhjPd4uNFY_0", "video_path": "jAhjPd4uNFY.mp4", "subtitle_path": "jAhjPd4uNFY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 963.67, "view_count": 29525389}, {"video_id": "jAhjPd4uNFY", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there is a scene where a green car passes under a yellow billboard. Then, a scene with a blue background featuring a bag with a red shield and a yellow puzzle piece on it. Finally, there is a scene with a blue background where a small black figure is standing surrounded by light blue bubbles.", "First, there is a scene with a blue background featuring a bag with a red shield and a yellow puzzle piece on it. Then, a scene where a green car passes under a yellow billboard. Finally, a scene with a blue background where a small black figure is standing surrounded by light blue bubbles.", "First, there is a scene where a green car passes under a yellow billboard. Then, a scene with a blue background where a small black figure is standing surrounded by light blue bubbles. Finally, a scene with a blue background featuring a bag with a red shield and a yellow puzzle piece on it.", "First, there is a scene with a blue background where a small black figure is standing surrounded by light blue bubbles. Then, a scene with a blue background featuring a bag with a red shield and a yellow puzzle piece on it. Finally, there is a scene where a green car passes under a yellow billboard.", "First, there is a scene with a blue background featuring a bag with a red shield and a yellow puzzle piece on it. Then, a scene with a blue background where a small black figure is standing surrounded by light blue bubbles. Finally, there is a scene where a green car passes under a yellow billboard."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "jAhjPd4uNFY_1", "video_path": "jAhjPd4uNFY.mp4", "subtitle_path": "jAhjPd4uNFY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 963.67, "view_count": 29525389}, {"video_id": "jAhjPd4uNFY", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a scene of a pitch-black night sky under the illumination of a full moon, with a forked road stretching ahead; then a scene with a blue-black screen showing a blue baby; lastly, a scene in a blue and white sky with two blue and white birds.", "First, a scene in a blue and white sky with two blue and white birds; then a scene with a blue-black screen showing a blue baby; lastly, a scene of a pitch-black night sky under the illumination of a full moon, with a forked road stretching ahead.", "First, a scene of a pitch-black night sky under the illumination of a full moon, with a forked road stretching ahead; then a scene in a blue and white sky with two blue and white birds; lastly, a scene with a blue-black screen showing a blue baby.", "First, a scene with a blue-black screen showing a blue baby; then a scene of a pitch-black night sky under the illumination of a full moon, with a forked road stretching ahead; lastly, a scene in a blue and white sky with two blue and white birds.", "First, a scene with a blue-black screen showing a blue baby; then a scene in a blue and white sky with two blue and white birds; lastly, a scene of a pitch-black night sky under the illumination of a full moon, with a forked road stretching ahead."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "jAhjPd4uNFY_2", "video_path": "jAhjPd4uNFY.mp4", "subtitle_path": "jAhjPd4uNFY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 963.67, "view_count": 29525389}, {"video_id": "vZUwAL4CulE", "question": "In a room with colorful walls, where some wooden sofas are placed against the wall, and a giant green 'YES' sign is hanging on the wall. A man in a white lab coat is standing in front of the 'YES' sign. Which of the following scenes has he appeared in?", "question_wo_referring_query": "Which of the following scenes has he appeared in?", "candidates": ["In a photo studio with many clothes", "In a trampoline hall with black trampolines", "In a dense forest with lots of trees", "In a blue swimming pool", "In an observatory with many beautiful stars"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "vZUwAL4CulE_0", "video_path": "vZUwAL4CulE.mp4", "subtitle_path": "vZUwAL4CulE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1000.04, "view_count": 1325043}, {"video_id": "vZUwAL4CulE", "question": "On a training ground, two men dressed in black and one man dressed in white are watching a boy in yellow shorts. The boy is raising both hands on green turf. In which of the following scenes does this boy appear?", "question_wo_referring_query": "In which of the following scenes does this boy appear?", "candidates": ["On a sandy beach with houses in the distance, a man dressed in black does a backflip, while another man in light green clothing and a hat performs protective actions.", "At the three-point line of a basketball hoop, a short-haired man in black prepares to do a backflip, while two men get ready to protect him from injury.", "To the left is a green corner wall with a partition wall and two picture frames and a white network socket on the wall. A person looking like a mirror head runs in.", "A man wearing sunglasses and dotted leather jacket looks forward while driving, talking as he drives."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "vZUwAL4CulE_1", "video_path": "vZUwAL4CulE.mp4", "subtitle_path": "vZUwAL4CulE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1000.04, "view_count": 1325043}, {"video_id": "vZUwAL4CulE", "question": "Inside a room with white walls, there is a mirror on the wall, a photo next to the mirror, and a door on the left side of the mirror. Two men wearing white lab coats stand one in front of the other; the man in front of the mirror is facing the right side of the screen, while the man behind him is smiling and looking towards the left side of the screen. In which of the following scenes does the man with a big nose and stubble appear?", "question_wo_referring_query": ", in which of the following scenes does the man with a big nose and stubble appear?", "candidates": ["On the left side, there's a green wall corner with a partition wall. The wall has two picture frames and a white network socket, with a person running towards the mirror.", "Sunlight shines on a meadow, with a parking lot in the distance. A child wearing blue shorts is doing a cartwheel on the meadow.", "Sunlight illuminates a sandy beach, with dense high-rise buildings in the distance. A man is standing on a horizontal bar, doing a backflip.", "In the distance, there's a child wearing yellow shorts sitting on something, surrounded by game cards. The mirror distinctly shows a man in a grey lab coat."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "vZUwAL4CulE_2", "video_path": "vZUwAL4CulE.mp4", "subtitle_path": "vZUwAL4CulE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1000.04, "view_count": 1325043}, {"video_id": "Pg1oTBALFUk", "question": "On a green grassy field, there are a few big trees and a parked car. In front of the car, there are two people standing. One is a man with short hair, and the other is a woman with long blonde hair wearing a green short-sleeved shirt. With which of the following subtitles has the woman in the green short-sleeved shirt appeared together?", "question_wo_referring_query": "With which of the following subtitles has the woman in the green short-sleeved shirt appeared together?", "candidates": ["\"and see the world\"", "\"this is the start of my dream bucket\"", "\"but today we get to see it for ourselves\"", "\"is the one that inspired me to get up\"", "\"trip right here\""], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "Pg1oTBALFUk_0", "video_path": "Pg1oTBALFUk.mp4", "subtitle_path": "Pg1oTBALFUk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1179.43, "view_count": 687315}, {"video_id": "Pg1oTBALFUk", "question": "Next to a blue pond, there is a green tree. In the distance of the pond, there is a yellow land. A man wearing a grey long-sleeve, holding a black camera, and wearing a hat is standing in front of the pond. Which of the following subtitles have appeared together with him?", "question_wo_referring_query": "Which of the following subtitles have appeared together with him?", "candidates": ["\"some coffee and now i'm ready to go\"", "\"all right good morning mr anderson good\"", "\"feeling\"", "\"guten morgen hello hello there\"", "\"morning how are you today how are you\""], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "Pg1oTBALFUk_1", "video_path": "Pg1oTBALFUk.mp4", "subtitle_path": "Pg1oTBALFUk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1179.43, "view_count": 687315}, {"video_id": "Pg1oTBALFUk", "question": "On a piece of yellow land, some withered yellow grass and wooden bushes grow. Next to a leafless tree, there is a group of zebras wandering around. Which of the following subtitles appeared together with the zebras?", "question_wo_referring_query": "Which of the following subtitles appeared together with the zebras?", "candidates": ["\"in tanzania which leads to beautiful\"", "\"and zebras have come down in the\"", "\"it's the beginning of the rainy season\"", "\"thousands from kenya\"", "\"finding the water holes and right now\""], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "Pg1oTBALFUk_2", "video_path": "Pg1oTBALFUk.mp4", "subtitle_path": "Pg1oTBALFUk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1179.43, "view_count": 687315}, {"video_id": "ky9xet0lmI8", "question": "On a national region plane map, there are many small flags. There is a white circle on a dark red map area. What change has occurred to the position of the white circle when it appears on the map of a national region with a man with gold rolled hair as the background?", "question_wo_referring_query": "What change has occurred to the position of the white circle?", "candidates": ["It changed from appearing on the dark red area to appearing on the black area.", "It changed from appearing on the dark red area to appearing on the purple area.", "It changed from appearing on the dark red area to appearing on the blue area.", "It changed from appearing on the dark red area to appearing on the white area.", "It changed from appearing on the dark red area to appearing on the green area."], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "ky9xet0lmI8_0", "video_path": "ky9xet0lmI8.mp4", "subtitle_path": "ky9xet0lmI8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1577.8, "view_count": 1976198}, {"video_id": "ky9xet0lmI8", "question": "On a colorful map of national regions, there are many small flags, among which there is a blue region block with 'Munich' written on it. When this region block appears on a more comprehensive map of national regions, and there are also black lines emitting white light surrounding the region block, what change occurs in the color of the region block that has 'Munich' written on it?", "question_wo_referring_query": "What change occurs in the color of the region block that has 'Munich' written on it?", "candidates": ["from blue to beige", "from blue to black", "from blue to green", "from blue to red", "from blue to purple"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "ky9xet0lmI8_1", "video_path": "ky9xet0lmI8.mp4", "subtitle_path": "ky9xet0lmI8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1577.8, "view_count": 1976198}, {"video_id": "ky9xet0lmI8", "question": "On a map of a region with many colors, there are some white arrows. A section labeled AUSTRIAN EMPIRE is pointed at by two white arrows. When this section appears without any white arrows on a map dotted with various small flags, what color change occurs to the section labeled AUSTRIAN EMPIRE?", "question_wo_referring_query": "What color change occurs to the section labeled AUSTRIAN EMPIRE?", "candidates": ["Changed from beige to dark red", "Changed from beige to white", "Changed from beige to black", "Changed from beige to green", "Changed from beige to blue"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "ky9xet0lmI8_2", "video_path": "ky9xet0lmI8.mp4", "subtitle_path": "ky9xet0lmI8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1577.8, "view_count": 1976198}, {"video_id": "5ltlb1Li5K4", "question": "In a scene with an airplane in the background, a woman with short hair wearing an orange suit stands in front of a DW logo. When the subtitle 'now the US and UK have carried out new' appears, what change occurs to the DW logo?", "question_wo_referring_query": "What change occurs to the DW logo?", "candidates": ["The DW logo changes from a small DW logo to a large DW logo and moves from the center of the screen to the upper left corner.", "The DW logo changes from white to red.", "The DW logo changes from a large DW logo to a small DW logo and moves from the center of the screen to the upper left corner.", "The DW logo changes from blue to red.", "The DW logo changes from blue to white."], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "5ltlb1Li5K4_0", "video_path": "5ltlb1Li5K4.mp4", "subtitle_path": "5ltlb1Li5K4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1222.96, "view_count": 130834}, {"video_id": "5ltlb1Li5K4", "question": "In a split-screen display, there is a woman with short black hair wearing a red suit on the left, and a woman wearing a floral top with black headphones on the right. When the subtitle 'trying to align um them and their' appears, what change occurs on the screen of the woman wearing black headphones?", "question_wo_referring_query": "In a split-screen display, there is a woman with short black hair wearing a red suit on the left, and a woman wearing a floral top with black headphones on the right. When the subtitle 'trying to align um them and their' appears, what change occurs on the screen of the woman wearing black headphones?", "candidates": ["The screen becomes clearer", "The screen becomes smaller", "The screen becomes larger", "The screen moves from right to left"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "5ltlb1Li5K4_1", "video_path": "5ltlb1Li5K4.mp4", "subtitle_path": "5ltlb1Li5K4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1222.96, "view_count": 130834}, {"video_id": "5ltlb1Li5K4", "question": "In a scene with an airplane in the background, a short-haired woman in an orange suit is holding a blue pen. When the subtitle 'think the Islamic Republic will take' appears, what changes occur in the close-up shot of the woman in the orange suit?", "question_wo_referring_query": "What changes occur in the close-up shot of the woman in the orange suit?", "candidates": ["The close-up shot moved from the center to the top left corner", "The close-up shot moved from the center to the top right corner", "The close-up shot became larger", "The close-up shot became smaller"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "5ltlb1Li5K4_2", "video_path": "5ltlb1Li5K4.mp4", "subtitle_path": "5ltlb1Li5K4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1222.96, "view_count": 130834}, {"video_id": "o_NFpW1KoZg", "question": "There is a long table with a white surface placed in front of a gray wall, with many people sitting at the table. Among them is a woman wearing green floral clothing. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["black chair", "purple incense burner", "green bottle", "red flower", "blue hanging lamp"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "o_NFpW1KoZg_0", "video_path": "o_NFpW1KoZg.mp4", "subtitle_path": "o_NFpW1KoZg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2888.99, "view_count": 4100}, {"video_id": "o_NFpW1KoZg", "question": "In a split screen, the left side shows some data, and the right side shows a man wearing a black suit and black glasses. Below the man, there is the word 'Garfield' in black text. What objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["white plaid collar", "red plaid collar", "green plaid collar", "blue plaid collar", "orange plaid collar"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "o_NFpW1KoZg_1", "video_path": "o_NFpW1KoZg.mp4", "subtitle_path": "o_NFpW1KoZg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2888.99, "view_count": 4100}, {"video_id": "o_NFpW1KoZg", "question": "In the scene, which features a Ferris wheel in the background, a man with gray hair, wearing a black suit and white shirt, is speaking. Below him, the text 'COMFORTDELGRO FY NET INCOME BEATS ESTIMATES' is displayed. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["black glasses", "green bottle", "purple frankincense", "black chair", "white checkered tie"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "o_NFpW1KoZg_2", "video_path": "o_NFpW1KoZg.mp4", "subtitle_path": "o_NFpW1KoZg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2888.99, "view_count": 4100}, {"video_id": "y6c6jz4NRWY", "question": "A woman with brown hair is sitting inside a room. Behind her on the right side, there is a white door with a wreath hanging on it. On her left, there is a wall with several paintings, and further back there is a shelf with books. In the bottom right corner, there is a white table with a yellow ball on top of it. The woman is wearing an off-white sweater. What is the shape of the orange object on the shelf behind the woman?", "question_wo_referring_query": "What is the shape of the orange object on the shelf behind the woman?", "candidates": ["Circle", "Square", "Triangle", "Hexagon", "Heart"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "y6c6jz4NRWY_0", "video_path": "y6c6jz4NRWY.mp4", "subtitle_path": "y6c6jz4NRWY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1034.88, "view_count": 31461}, {"video_id": "y6c6jz4NRWY", "question": "There is a broken two-story brown building with a clock on the roof, both the first and second floors have rectangular windows with iron bars behind the glass. The sunlight in the scene is good, and there are gray clouds in the upper right corner of the roof, with the remaining part being blue sky. There are also some green shrubs on the ground. What is the shape of the two chimneys at the top of the roof?", "question_wo_referring_query": "What is the shape of the two chimneys at the top of the roof?", "candidates": ["Triangular shape", "Stair shape", "Rectangular prism", "Cylindrical", "Pentagon shape"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "y6c6jz4NRWY_1", "video_path": "y6c6jz4NRWY.mp4", "subtitle_path": "y6c6jz4NRWY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1034.88, "view_count": 31461}, {"video_id": "y6c6jz4NRWY", "question": "On a black table, there is a white plate with three cans on it. Also on the table are a pumpkin, a cucumber, a picture frame, a wooden board, a desk lamp, a ceramic jar, and other items. The transparent jar on the plate contains some stuff. What material is the lid of this transparent jar made of?", "question_wo_referring_query": "What material is the lid of this transparent jar made of?", "candidates": ["Ceramic lid", "Glass lid", "Rubber lid", "Wooden lid", "Plastic lid"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "y6c6jz4NRWY_2", "video_path": "y6c6jz4NRWY.mp4", "subtitle_path": "y6c6jz4NRWY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1034.88, "view_count": 31461}, {"video_id": "UrVCVVv08Qo", "question": "On a snowy ground, there is a windmill. Behind the windmill is a house, and the snowy area is surrounded by black forests. There are also some leafless trees with gray clouds surrounding them. The sunlight casts shadows of the trees and the windmill, and in the distance, there is still blue sky. When mentioning 'is actually known about what makes this,' what is the shape of the frame that supports the windmill in front of the windmill?", "question_wo_referring_query": "What is the shape of the frame that supports the windmill in front?", "candidates": ["Square", "Fan-shaped", "Rectangular", "Triangular", "Ladder-shaped"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "UrVCVVv08Qo_0", "video_path": "UrVCVVv08Qo.mp4", "subtitle_path": "UrVCVVv08Qo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1531.78, "view_count": 663139}, {"video_id": "UrVCVVv08Qo", "question": "In a snowy area, two gentlemen are talking. One gentleman has short hair and is wearing an olive-green parka, while the other gentleman is wearing a dark blue down jacket and a grey hat. The snowy area is surrounded by green pine trees, and there is also a blue sky and white clouds. When the phrase 'one road will take longer time than the' is mentioned, what material is the grey hat of the man in the blue down jacket made of?", "question_wo_referring_query": "What material is the grey hat of the man in the blue down jacket made of?", "candidates": ["Straw hat", "Denim hat", "Cloth hat", "Rabbit fur hat", "Wool hat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "UrVCVVv08Qo_1", "video_path": "UrVCVVv08Qo.mp4", "subtitle_path": "UrVCVVv08Qo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1531.78, "view_count": 663139}, {"video_id": "UrVCVVv08Qo", "question": "In a snowy area surrounded by forests, there is white snow and dense trees all around. Three people are standing and talking there. To the left is a woman wearing a swimsuit, swim cap, and a shawl. In the middle is a man in a pink bathrobe, and to the right is a woman in a red and black coat with a blue wool hat. When it mentions \"anticipated off camera we learned that\", what type of coat is this woman wearing?", "question_wo_referring_query": "What type of coat is this woman wearing?", "candidates": ["Woolen coat", "Windbreaker", "Cashmere coat", "Wool coat", "Down coat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "UrVCVVv08Qo_2", "video_path": "UrVCVVv08Qo.mp4", "subtitle_path": "UrVCVVv08Qo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1531.78, "view_count": 663139}, {"video_id": "vBG621XEegk", "question": "In a painting, there is a green grassy field, with a small lake in the bottom left corner. On the grass, there are some sparse small trees, and a few whimsical animals, including a dragon-like long-necked deer, a mouse with long ears, a horse with long horns, as well as monkeys and rabbits. There's a pink fountain in the middle. Which animal in the painting is drinking water?", "question_wo_referring_query": "Which animal in the painting is drinking water?", "candidates": ["Dragon-like long-necked deer", "Horse with long horns", "Gray rabbit", "Black bear", "Gray elephant"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "vBG621XEegk_0", "video_path": "vBG621XEegk.mp4", "subtitle_path": "vBG621XEegk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3051.95, "view_count": 2978750}, {"video_id": "vBG621XEegk", "question": "On a yellow building, there are three arched windows with small pointed tops on the second floor. The window frames on the building share the same shape and are adorned with complex decorations. There are also grey portraits hanging in the blank spaces and around the borders between the windows. The first floor has similar windows on both sides, with a large door in the middle. Which window has a flag hanging on it?", "question_wo_referring_query": "Which window has a flag hanging on it?", "candidates": ["The first window on the left side of the first floor", "The middle window on the first floor", "The first window on the left side of the second floor", "The middle window on the second floor", "The right window on the second floor"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "vBG621XEegk_1", "video_path": "vBG621XEegk.mp4", "subtitle_path": "vBG621XEegk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3051.95, "view_count": 2978750}, {"video_id": "vBG621XEegk", "question": "In a painting, there are 8 people; some are rejoicing on the grass, some are looking in the same direction, and one person is carrying a deep blue longan. Above the grass is a water surface with a pink object floating on it. Among the 8 people, who is wearing a red fruit on their head?", "question_wo_referring_query": "Among the 8 people, who is wearing a red fruit on their head?", "candidates": ["The person dressed in black looking in the same direction", "The person rejoicing with a red fruit on their head", "The person rejoicing", "The person dressed in white looking in the same direction", "The person carrying the large longan"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "vBG621XEegk_2", "video_path": "vBG621XEegk.mp4", "subtitle_path": "vBG621XEegk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3051.95, "view_count": 2978750}, {"video_id": "hIoCn_9QTVU", "question": "In one scene, on the left, there is a pair of hands holding an avocado. On the right, there are two men, one wearing a hat and the other wearing sunglasses. They are both wearing short-sleeve T-shirts, and they are sitting at a table full of food, each holding an avocado. What happened when the avocado appeared for the first time?", "question_wo_referring_query": "What happened when the avocado appeared for the first time?", "candidates": ["The avocado was split into two halves.", "A pair of hands demonstrated holding an avocado while the two men observed it.", "The avocado was placed on the cutting board and sliced.", "The avocado was peeled.", "The avocado fell onto the table."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O2E", "level": "IntraMoment", "id": "hIoCn_9QTVU_0", "video_path": "hIoCn_9QTVU.mp4", "subtitle_path": "hIoCn_9QTVU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 929.43, "view_count": 12375}, {"video_id": "hIoCn_9QTVU", "question": "In a kitchen, a person wearing gray clothes is serving a pot of soup. Behind him is a small table, and in front of him is a stove with white cabinets above it. What happened the first time the iron pot appeared?", "question_wo_referring_query": "What happened the first time the iron pot appeared?", "candidates": ["The iron pot was placed on the black stove.", "The iron pot lost a handle.", "The iron pot was covered with a lid.", "The iron pot was placed on the small table.", "The iron pot was placed on the white stove."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O2E", "level": "IntraMoment", "id": "hIoCn_9QTVU_1", "video_path": "hIoCn_9QTVU.mp4", "subtitle_path": "hIoCn_9QTVU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 929.43, "view_count": 12375}, {"video_id": "hIoCn_9QTVU", "question": "In a kitchen setting, there is a split-screen with white frames dividing it. In the split-screen, a man is standing next to a stove. Behind him is an electronic screen, and the cabinets are white. The stove has some small knives and jars, along with some miscellaneous items. There is also a blender. The man is wearing grey clothes. What happened when the blender appeared for the first time?", "question_wo_referring_query": "What happened when the blender appeared for the first time?", "candidates": ["The man picked up a jar", "The man picked up the blender", "The man started cutting vegetables", "The man raised two fingers and waved his hand", "The man made a victorious gesture towards the mirror"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O2E", "level": "IntraMoment", "id": "hIoCn_9QTVU_2", "video_path": "hIoCn_9QTVU.mp4", "subtitle_path": "hIoCn_9QTVU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 929.43, "view_count": 12375}, {"video_id": "s85Esql2DPk", "question": "Against a white backdrop, there are two women with two smaller frames in the upper right corner. One woman is wearing a black and white striped shirt with straight hair, and the other woman is wearing a black T-shirt with brown hair. On the white screen, there is a periodic table of elements. When 'increases as you move down a column and' is mentioned, what action does the woman in the black and white striped shirt perform?", "question_wo_referring_query": "What action does the woman in the black and white striped shirt perform?", "candidates": ["The woman in the black T-shirt picks up a book.", "A circle is drawn on the white backdrop.", "The woman in the black and white striped shirt is placing her arm down on the table.", "An arrow is drawn on the white backdrop.", "The woman in the black T-shirt raises her glasses."], "topic_category": "KS-Knowledge-STEM", "question_category": "T2E", "level": "IntraMoment", "id": "s85Esql2DPk_0", "video_path": "s85Esql2DPk.mp4", "subtitle_path": "s85Esql2DPk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1058.19, "view_count": 26539}, {"video_id": "s85Esql2DPk", "question": "Against a white background board, there are two women in the upper right corner with two split screens. One woman is wearing a black and white striped shirt and has straight hair. The other woman is wearing a black T-shirt and has curly hair. A periodic table with atomic images is displayed on the white screen. What does the woman in the black and white striped shirt do when referring to 'which remember was talking about the the'?", "question_wo_referring_query": "What does the woman in the black and white striped shirt do?", "candidates": ["A white background board shows a drawing of an arrow.", "The woman in the black T-shirt picks up a book.", "The woman in the black and white striped shirt puts down her hand that was tidying her hair.", "The woman in the black T-shirt raises her glasses.", "The woman in the black and white striped shirt is putting down her arm that was resting on the table."], "topic_category": "KS-Knowledge-STEM", "question_category": "T2E", "level": "IntraMoment", "id": "s85Esql2DPk_1", "video_path": "s85Esql2DPk.mp4", "subtitle_path": "s85Esql2DPk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1058.19, "view_count": 26539}, {"video_id": "s85Esql2DPk", "question": "Against a white background board, there are two ladies in the upper right corner with two separate frames. One is wearing a black and white striped shirt with straight hair, and the other is wearing a black T-shirt with brown hair. The white screen displays the Periodic Table of Elements and an atomic image. What happens when the phrase 'proportional so as we decrease the' is mentioned?", "question_wo_referring_query": "what happens?", "candidates": ["Under the white background board, beside a downward arrow drawn with a pen, some letters are being written.", "A circle is drawn on the white background board.", "Beside the atomic image under the white background board, some letters are being written.", "Under the Periodic Table of Elements on the white background board, some letters are being written.", "A long arrow is drawn on the white background board."], "topic_category": "KS-Knowledge-STEM", "question_category": "T2E", "level": "IntraMoment", "id": "s85Esql2DPk_2", "video_path": "s85Esql2DPk.mp4", "subtitle_path": "s85Esql2DPk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1058.19, "view_count": 26539}, {"video_id": "qw9Y-kvoXoM", "question": "In the scene, there is a man in gray clothes and a woman in an orange-red shirt. The wall and the table are divided with pink and green colors. There are many pictures on the wall and some small appliances on the table. In front of them, there are kitchen utensils, seasonings, and food on the table. The woman is holding a belt in her hand. What happens next?", "question_wo_referring_query": "What happens next?", "candidates": ["The woman and the man hug.", "The woman hands the belt to the man.", "The woman and the man shake hands.", "The woman helps the man put on the belt.", "The woman and the man hold the belt together."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "qw9Y-kvoXoM_0", "video_path": "qw9Y-kvoXoM.mp4", "subtitle_path": "qw9Y-kvoXoM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1165.71, "view_count": 2109871}, {"video_id": "qw9Y-kvoXoM", "question": "In a kitchen divided into pink and green, with many cartoons on the walls and a piece of white paper, a man and a woman wearing aprons are standing in front of a table covered with lots of food ingredients and a pink and green tablecloth. On the counter behind them, there are small appliances, shelves, and dishes. The man is wearing a grey T-shirt and the woman is wearing a red shirt. They are pouring oil into a pot. What happened after the man picked up the tofu from the table?", "question_wo_referring_query": "What happened?", "candidates": ["The man picked up the tofu from the table", "The man picked up the pot", "The man picked up a kitchen knife", "The man took off his apron", "The man picked up a white bowl from the table"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "qw9Y-kvoXoM_1", "video_path": "qw9Y-kvoXoM.mp4", "subtitle_path": "qw9Y-kvoXoM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1165.71, "view_count": 2109871}, {"video_id": "qw9Y-kvoXoM", "question": "In a kitchen with pink and green partitions, the walls are covered with many cartoon pictures and there is a white paper. A man and a woman are wearing aprons, standing at a table with many cooking ingredients on it. The tablecloth is also pink and green. Behind them on the counter are appliances, shelves, and trays. The man is wearing a gray T-shirt, and the woman is wearing a red shirt. The woman is raising her hand, pointing, and talking to the man. What happens after she faces the man?", "question_wo_referring_query": "What happens?", "candidates": ["The woman picks up a bowl", "The woman picks up a pot", "The woman picks up an avocado", "The woman picks up a milk pot", "The woman picks up a knife"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "qw9Y-kvoXoM_2", "video_path": "qw9Y-kvoXoM.mp4", "subtitle_path": "qw9Y-kvoXoM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1165.71, "view_count": 2109871}, {"video_id": "WuhnbtCrXqE", "question": "The host of BBC is currently reporting the news. He has short hair, is wearing a black suit, a white shirt, and a black and white patterned tie. To his right is a partition, and below is a red and white ticker. Which of the following characters appears when switching to the third different type of environment scene?", "question_wo_referring_query": "Which of the following characters appears when switching to the third different type of environment scene?", "candidates": ["A man on grass wearing jeans and a red T-shirt with a cowboy hat", "A railing on a concrete wall", "A signpost with complex decorations on a concrete wall", "A woman in a conference hall wearing white clothes", "A woman in a conference hall wearing black clothes"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "WuhnbtCrXqE_0", "video_path": "WuhnbtCrXqE.mp4", "subtitle_path": "WuhnbtCrXqE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 994.6, "view_count": 115821}, {"video_id": "WuhnbtCrXqE", "question": "A man wearing a white shirt with a crew cut is speaking. Behind him, there is a black background and some blurry buildings. He is wearing a wristwatch, and there is a red and white subtitle bar at the bottom. Starting with this man, which character appears for the first time?", "question_wo_referring_query": "Starting with this man, which character appears for the first time?", "candidates": ["A man in a black suit with a black and white striped tie", "A man in a white shirt", "A man wearing headphones and glasses in black clothing", "A man in green clothes standing at the podium", "A man holding a cellphone"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "WuhnbtCrXqE_1", "video_path": "WuhnbtCrXqE.mp4", "subtitle_path": "WuhnbtCrXqE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 994.6, "view_count": 115821}, {"video_id": "WuhnbtCrXqE", "question": "The BBC presenter is currently reporting the news. He has short hair, is wearing a black suit and a white shirt, and his tie is black and white patterned. His background is a night view of buildings, with a dark blue sky. Which concept is mentioned first below?", "question_wo_referring_query": "Which concept is mentioned first below?", "candidates": ["Ukrainian submarine destruction", "Temporary channel at the Suez Canal", "American auto workers strike", "Headquarters of the Black Sea Fleet", "Russian plan for two combat platforms"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "WuhnbtCrXqE_2", "video_path": "WuhnbtCrXqE.mp4", "subtitle_path": "WuhnbtCrXqE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 994.6, "view_count": 115821}, {"video_id": "i0H1WgsMrfU", "question": "In a background composed of dark blue and light blue colors, there is a grey and white person drawn. The person has short hair and is wearing a uniform, which is also composed of grey and white colors. What happens after 'him The God of War' is mentioned?", "question_wo_referring_query": "What happens after 'The God of War' is mentioned?", "candidates": ["Four question marks appear on the grey and white person's body", "Four question marks appear above the grey and white person's head", "The person on the screen gradually disappears", "Four black question marks appear on the screen", "Four white question marks appear on the screen"], "topic_category": "KH-Knowledge-History", "question_category": "T3E", "level": "L2-Relation", "id": "i0H1WgsMrfU_0", "video_path": "i0H1WgsMrfU.mp4", "subtitle_path": "i0H1WgsMrfU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1283.2, "view_count": 982226}, {"video_id": "i0H1WgsMrfU", "question": "A map was drawn on a gray background, with roads composed of lines. In the upper left corner is an irregular block, in the middle there are some black rectangles, and some gray rectangles. On the right side, there are two lines connected to a shape. Below the shape, another line is connected. What happened after the phrase 'in order to get a more balanced understanding of the whole' was mentioned?", "question_wo_referring_query": "What happened?", "candidates": ["A grayscale character appeared in the screen", "The lines in the screen disappeared", "The screen disappeared", "Four question marks appeared in the screen", "The screen turned colorful"], "topic_category": "KH-Knowledge-History", "question_category": "T3E", "level": "L2-Relation", "id": "i0H1WgsMrfU_1", "video_path": "i0H1WgsMrfU.mp4", "subtitle_path": "i0H1WgsMrfU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1283.2, "view_count": 982226}, {"video_id": "i0H1WgsMrfU", "question": "In a background composed of deep brown and light brown colors, there is a gray-white character drawn. The character has short hair and is wearing a uniform, which is made of gray and white colors. In the lower left corner, there are two white line-formed circles. In one circle, a character is having a conversation, while in the other circle, there are three standing characters. What happened after the phrase 'where literally quite many heads rolled' was mentioned?", "question_wo_referring_query": "What happened after the phrase 'where literally quite many heads rolled' was mentioned?", "candidates": ["A series of black letters appeared in the upper right corner", "The character in the screen disappeared", "The screen turned colorful", "The character in the screen gradually shrunk", "The character in the screen gradually enlarged"], "topic_category": "KH-Knowledge-History", "question_category": "T3E", "level": "L2-Relation", "id": "i0H1WgsMrfU_2", "video_path": "i0H1WgsMrfU.mp4", "subtitle_path": "i0H1WgsMrfU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1283.2, "view_count": 982226}, {"video_id": "WncUlZYpdq4", "question": "Against a white background, there is a title composed of black letters, a long dividing line, and some content made up of letters below it. In the lower right corner, a man in a black T-shirt with glasses is explaining something. What was the first object to appear after mentioning 'ping and Tom Goldstein of the University'?", "question_wo_referring_query": "What was the first object to appear?", "candidates": ["a black thick line", "olive-colored sunglasses", "a white arrow", "a yellow strip", "a black microphone"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "WncUlZYpdq4_0", "video_path": "WncUlZYpdq4.mp4", "subtitle_path": "WncUlZYpdq4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2144.03, "view_count": 27172}, {"video_id": "WncUlZYpdq4", "question": "Against a white background with a title composed of black letters and a long dividing line, there is text content below with highlighted sections in yellow and some blue text. The middle section contains some formulas, and in the bottom right corner, a man wearing a black T-shirt and glasses is explaining something. When he mentions 'process so we Define the forward process,' what is the first object that appears?", "question_wo_referring_query": ", what is the first object that appears?", "candidates": ["Green rectangle-like shape", "Blue rectangle", "Yellow line", "Blue long rectangle", "Olive small bear"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "WncUlZYpdq4_1", "video_path": "WncUlZYpdq4.mp4", "subtitle_path": "WncUlZYpdq4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2144.03, "view_count": 27172}, {"video_id": "WncUlZYpdq4", "question": "Against a white background, there is a title composed of black letters, as well as content composed of black and blue letters. In the bottom right corner, there is a man wearing sunglasses and a black T-shirt explaining. There is also a black microphone. After 'this input right here but it doesn't' is mentioned, what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["white arrow", "yellow line", "Bear in Four Grid", "green rectangle", "blue rectangle"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "WncUlZYpdq4_2", "video_path": "WncUlZYpdq4.mp4", "subtitle_path": "WncUlZYpdq4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2144.03, "view_count": 27172}, {"video_id": "QRnIzIXQmc4", "question": "In a painting, surrounded by yellow mountains and soil, the center is some ruined houses engulfed in flames. There are people around, some are fighting the fire, some are holding each other, and some are lying on the ground. Thick smoke is billowing above the flames. In what scene does the fire appear?", "question_wo_referring_query": "In what scene does the fire appear?", "candidates": ["In an old yellow photo of buildings and roads, with the word '1997' in the upper right corner", "Beneath a red building, there are two carriages with a few people on them, in a scene where sunlight shines on the road", "In a black and white background with a cow", "In a gray building with smoke, with a dense crowd in the lower left corner", "In a painting with gray and white mountains, gray low buildings, and a crowd"], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "QRnIzIXQmc4_0", "video_path": "QRnIzIXQmc4.mp4", "subtitle_path": "QRnIzIXQmc4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 900.77, "view_count": 329707}, {"video_id": "QRnIzIXQmc4", "question": "In a yellow field, there is a wooden scaffold structure of a house, the sky is white, and there are some plants and hills around it. Where else does this wooden scaffold structure of a house appear?", "question_wo_referring_query": "Where else does this wooden scaffold structure of a house appear?", "candidates": ["In the middle of a construction site, surrounded by many concrete slabs with a woman walking by.", "In a field with a very blue sky and no one around.", "In front of a wasteland.", "In front of a building where a carriage passed by.", "In a field with a very blue sky, a few people sitting under a plastic sheet, and some green grass around."], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "QRnIzIXQmc4_1", "video_path": "QRnIzIXQmc4.mp4", "subtitle_path": "QRnIzIXQmc4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 900.77, "view_count": 329707}, {"video_id": "QRnIzIXQmc4", "question": "Against a yellowish-olive background, there is an image of a sepia paper with the words 'WEIRD HISTORY' written on it, and below the paper are white letters. Where else has this sepia paper with the words 'WEIRD HISTORY' appeared?", "question_wo_referring_query": "Where else has this sepia paper with the words 'WEIRD HISTORY' appeared?", "candidates": ["Against a yellowish-olive background, white letters in the middle, sepia paper in the bottom right corner", "Against a yellowish-olive background, white letters in the middle, sepia paper in the middle", "Against a yellowish-olive background, white letters in the middle, sepia paper on the right side", "Against a yellowish-olive background, white letters in the middle, sepia paper in the top left corner", "Against a yellowish-olive background, white letters in the middle, sepia paper in the bottom left corner"], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "QRnIzIXQmc4_2", "video_path": "QRnIzIXQmc4.mp4", "subtitle_path": "QRnIzIXQmc4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 900.77, "view_count": 329707}, {"video_id": "VZf7KzH40B8", "question": "In a purple room, with a white ceiling tile, there's a man wearing a black T-shirt. One of his hands is putting on a hat, and the other hand is holding a phone. The phone has a speech bubble icon on it. He is taking a selfie in front of a mirror. What subtitles appeared at the same time as this man?", "question_wo_referring_query": "What subtitles appeared at the same time as this man?", "candidates": ["what we got going on", "yeah", "first day of school the day has come all", "energy power motivation", "how's your boba tea"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "VZf7KzH40B8_0", "video_path": "VZf7KzH40B8.mp4", "subtitle_path": "VZf7KzH40B8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 944.95, "view_count": 5334522}, {"video_id": "VZf7KzH40B8", "question": "On the screen, there is a red building on the left, a yellow building in the middle, and some grey buildings near the right. Additionally, there is a white corridor. In front of the red building on the left, there is a big tree. To the right of the big tree, there is a slanted path. A woman is walking towards the camera. She is wearing white shorts and a long-sleeved shirt with red stripes. In which subtitles did this woman appear at the same time?", "question_wo_referring_query": "In which subtitles did this woman appear at the same time?", "candidates": ["um first period we're going to um where", "authenticating", "we gotta call it something other than", "hey guys first day of school", "is it we're going to our fourth first so"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "VZf7KzH40B8_1", "video_path": "VZf7KzH40B8.mp4", "subtitle_path": "VZf7KzH40B8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 944.95, "view_count": 5334522}, {"video_id": "VZf7KzH40B8", "question": "In the mirror, there is a man wearing a black T-shirt, and behind him are two women wearing light green T-shirts. One of the women on the left is wearing a dark green jacket and carrying a backpack. Behind the two women is a red building. The women's light green T-shirts appeared at the same time as which subtitle?", "question_wo_referring_query": "The women's light green T-shirts appeared at the same time as which subtitle?", "candidates": ["in america high school lunches suck", "need to go and all that yeah i like", "french", "first grade here for psychology so i", "i'm actually heading to class now"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "VZf7KzH40B8_2", "video_path": "VZf7KzH40B8.mp4", "subtitle_path": "VZf7KzH40B8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 944.95, "view_count": 5334522}, {"video_id": "hAooAOFRsYc", "question": "In a PPT with a white background, there are many formulas written along with a yellow rectangular annotation. Before mentioning 'space right and also the keys will', what changes occurred to the formula Q=xWQ on the PPT slide?", "question_wo_referring_query": "What changes occurred to the formula Q=xWQ on the PPT slide?", "candidates": ["The formula Q=xWQ had a red curly brace added to it.", "The formula was circled by a line.", "A yellow rectangle was added.", "The formula disappeared.", "Parentheses were added to the formula."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "hAooAOFRsYc_0", "video_path": "hAooAOFRsYc.mp4", "subtitle_path": "hAooAOFRsYc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2885.3, "view_count": 25480}, {"video_id": "hAooAOFRsYc", "question": "In the screen, there is a white PPT slide with English text and formulas. There is also a gray dividing line, and the side notes with handwritten 'k (a, b)'. When the phrase 'just the beginning and this is just a' is mentioned, what changes occur to the formula 'k (a, b)' on the right side of the PPT slide?", "question_wo_referring_query": "What changes occur to the handwritten formula 'k (a, b)' on the right side of the PPT slide?", "candidates": ["The formula is enlarged", "The PPT slide turns to the next page", "The PPT slide moves up", "The formula is shrunk", "The PPT slide moves down"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "hAooAOFRsYc_1", "video_path": "hAooAOFRsYc.mp4", "subtitle_path": "hAooAOFRsYc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2885.3, "view_count": 25480}, {"video_id": "hAooAOFRsYc", "question": "In a white PPT background with a lot of related content written with formulas, after the phrase 'can make this pretty explicit namely you' is mentioned, what change occurred to the formula si= si-1+o(ziWk)(zWv)T on the PPT slide?", "question_wo_referring_query": "What change occurred to the formula si= si-1+o(ziWk)(zWv)T on the PPT slide?", "candidates": ["Brackets were added to the formula.", "The formula was circled with a red circle.", "A yellow rectangle was added.", "The formula was underlined.", "The formula disappeared."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "hAooAOFRsYc_2", "video_path": "hAooAOFRsYc.mp4", "subtitle_path": "hAooAOFRsYc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2885.3, "view_count": 25480}, {"video_id": "U53VgzzXcEg", "question": "On a wooden grain table, there is a small electric stove. On top of the stove, there is a flat-bottomed pot. Nearby, there are four transparent bowls containing various seasonings. A hand is holding tongs and is grasping noodles inside the pot. What is this hand doing?", "question_wo_referring_query": "What is this hand doing?", "candidates": ["Picking up the noodles", "Picking up the green onions in the bowl", "Stirring the noodles with the tongs", "Putting down the tongs", "Picking up the meat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "U53VgzzXcEg_0", "video_path": "U53VgzzXcEg.mp4", "subtitle_path": "U53VgzzXcEg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 991.45, "view_count": 710580}, {"video_id": "U53VgzzXcEg", "question": "There is a white ceramic bowl in the video. A hand is holding a pair of chopsticks over the bowl, which contains noodles, green onions, an egg, shredded meat, and a piece of seaweed. The background behind the bowl is very blurry. What is this hand doing?", "question_wo_referring_query": "What is this hand doing?", "candidates": ["Picking up the shredded meat", "Picking up the seaweed", "Picking up the green onions", "Picking up the egg", "Picking up the noodles"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "U53VgzzXcEg_1", "video_path": "U53VgzzXcEg.mp4", "subtitle_path": "U53VgzzXcEg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 991.45, "view_count": 710580}, {"video_id": "U53VgzzXcEg", "question": "On a wooden table, there is a transparent bowl. A hand is holding a cup filled with a brownish liquid. There are four Japanese characters above a string of white English letters with '50ml' written after them. What is the hand doing with the cup?", "question_wo_referring_query": "What is the hand doing with the cup?", "candidates": ["Taking the cup out of the frame", "Adding water to the cup", "Pouring the liquid from the cup into the bowl", "Putting the cup into the bowl", "Placing the cup on the table"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "U53VgzzXcEg_2", "video_path": "U53VgzzXcEg.mp4", "subtitle_path": "U53VgzzXcEg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 991.45, "view_count": 710580}, {"video_id": "7hcHdnA-BHw", "question": "In a snow-covered field, a man wearing black clothes and a hat is bending over holding a round-legged iron stand. Under the iron stand, there is a smoking campfire. The ground is covered with snow. To the left of the frame, there is an iron shovel. Behind the man, there is a wooden house with a wooden pillar, a glass door, and a snow-covered roof. To the right of the frame, there is a stone heap. What object is NOT present in the scene?", "question_wo_referring_query": "What object is NOT present in the scene?", "candidates": ["Yellow dog", "Black dog", "Black and white dog", "White snow", "Orange gloves"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "7hcHdnA-BHw_0", "video_path": "7hcHdnA-BHw.mp4", "subtitle_path": "7hcHdnA-BHw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1313.88, "view_count": 10122885}, {"video_id": "7hcHdnA-BHw", "question": "In a snowy landscape, a man wearing a black coat and a hat is washing vegetables from a basket by a stone pile next to a pond. Behind him is a wooden hut built with poles, filled with wood logs inside. The roof is covered with white snow. On the left side of the screen, there's a big pot emitting smoke. Behind the wooden hut is a snow-covered slope with some tree branches. There are also two small dogs beside the man. What object is not present in the scene?", "question_wo_referring_query": "What object is not present in the scene?", "candidates": ["White snow", "Black shoes", "An olive-colored hat", "A black dog", "A black and white dog"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "7hcHdnA-BHw_1", "video_path": "7hcHdnA-BHw.mp4", "subtitle_path": "7hcHdnA-BHw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1313.88, "view_count": 10122885}, {"video_id": "7hcHdnA-BHw", "question": "A man wearing a colorful coat is standing beside a wooden table. On the table, there are two rectangular wooden trays with rectangular pieces of meat inside, a black round tray with vegetables, an empty wooden bowl, and seasoning bottles. The man is holding a knife and cutting cabbage rolls on a cutting board. Behind him, on one side, there are many cylindrical logs stacked, while on the other side, there is a vast white snowfield with some withered branches. The sunlight is shining on the snowy ground. Which objects are not present in the scene?", "question_wo_referring_query": "Which objects are not present in the scene?", "candidates": ["Carrots", "White puppies", "Earthen jars", "Red round radishes", "Pumpkins"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "7hcHdnA-BHw_2", "video_path": "7hcHdnA-BHw.mp4", "subtitle_path": "7hcHdnA-BHw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1313.88, "view_count": 10122885}, {"video_id": "Q7cwC00Gs-Q", "question": "The screen shows split scenes of two women. The woman in the left split screen is wearing a green and black dress, has auburn short hair, and a small microphone on her collar. She is talking. The woman on the right has a bookshelf and a picture frame behind her, has auburn curly hair, is wearing glasses, and an orange outfit. What object mentioned in 'HERE TO DISCUSS IS OUR GUEST FROM A DEMOCRATIC PULLING FORM' is not in the frame?", "question_wo_referring_query": "What object mentioned is not in the frame?", "candidates": ["book", "green painting", "earrings", "blue painting", "amber glasses"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "Q7cwC00Gs-Q_0", "video_path": "Q7cwC00Gs-Q.mp4", "subtitle_path": "Q7cwC00Gs-Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2691.03, "view_count": 3336}, {"video_id": "Q7cwC00Gs-Q", "question": "A white-haired woman is reporting the news. The background behind her consists of images of buildings and the sky. She is wearing black clothes, and there are white background subtitles in front of and next to her. Which items are not present on the screen when the subtitles state: 'NTENTIONS WILL CHANGE. HIS BEHAVIOR WOULD CHANGE ONLY IF'?", "question_wo_referring_query": "Which items are not present on the screen?", "candidates": ["tree", "suspension bridge", "necklace", "glove", "buttons"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "Q7cwC00Gs-Q_1", "video_path": "Q7cwC00Gs-Q.mp4", "subtitle_path": "Q7cwC00Gs-Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2691.03, "view_count": 3336}, {"video_id": "Q7cwC00Gs-Q", "question": "There are three men standing in a hall. They are wearing suits and there is a black curtain behind them. In front of the curtain, there are three American flags. The man standing in the front, who is wearing glasses, is giving a speech. There is a podium in front of him. The two men behind him, one tall and one short, both have some gray hair. When mentioning 'BIGGEST TEST OF HIS POLITICAL LIFE, HOW TO MOVE THIS FORWARD', what objects are not present in the video?", "question_wo_referring_query": "What objects are not present in the video?", "candidates": ["Yellow accessory", "Red tie with white dots", "Red and blue striped tie", "Black glasses", "Blue tie"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "Q7cwC00Gs-Q_2", "video_path": "Q7cwC00Gs-Q.mp4", "subtitle_path": "Q7cwC00Gs-Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2691.03, "view_count": 3336}, {"video_id": "TrLrBL1U8z0", "question": "A man with short hair, wearing sunglasses and gray clothes, is speaking. There is a green curtain behind him. He has some beard. What type of clothes is he wearing?", "question_wo_referring_query": "What type of clothes is he wearing?", "candidates": ["Gray round-neck T-shirt", "Gray round-neck sweatshirt", "Gray round-neck jacket", "Gray round-neck shirt", "Gray round-neck sweater"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2A", "level": "IntraMoment", "id": "TrLrBL1U8z0_0", "video_path": "TrLrBL1U8z0.mp4", "subtitle_path": "TrLrBL1U8z0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1620.79, "view_count": 19393}, {"video_id": "TrLrBL1U8z0", "question": "In a black background, a man with short hair and wearing sunglasses is speaking. He is wearing a gray round-neck shirt, with a 3D image of a green-yellow virus in front of him. He has some short beard on his face. What material are the frames of his sunglasses?", "question_wo_referring_query": "What material are the frames of this man's sunglasses?", "candidates": ["Sunglasses with glass frames", "Sunglasses with TR frames", "Sunglasses with metal frames", "Sunglasses with plastic frames", "Sunglasses with wooden frames"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2A", "level": "IntraMoment", "id": "TrLrBL1U8z0_1", "video_path": "TrLrBL1U8z0.mp4", "subtitle_path": "TrLrBL1U8z0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1620.79, "view_count": 19393}, {"video_id": "TrLrBL1U8z0", "question": "In front of a green screen background, a man with short hair wearing sunglasses is speaking. He is wearing a black piece of clothing and has some facial hair. What type of clothing is he wearing?", "question_wo_referring_query": "What type of clothing is he wearing?", "candidates": ["Black V-neck T-shirt", "Black round-neck T-shirt", "Black blazer", "Black knitted shirt", "Black dress shirt"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2A", "level": "IntraMoment", "id": "TrLrBL1U8z0_2", "video_path": "TrLrBL1U8z0.mp4", "subtitle_path": "TrLrBL1U8z0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1620.79, "view_count": 19393}, {"video_id": "iz4QBKoebK4", "question": "In the scene, there is a dark blue background with a huge amount of white smoke continuously emerging. What happened when the thick smoke appeared for the first time?", "question_wo_referring_query": "What happened?", "candidates": ["The thick smoke was dispersed by the wind", "A burst of flame emerged", "A person walked out of the thick smoke", "It started to rain", "An explosion occurred within the thick smoke"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "iz4QBKoebK4_0", "video_path": "iz4QBKoebK4.mp4", "subtitle_path": "iz4QBKoebK4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1209.34, "view_count": 20195}, {"video_id": "iz4QBKoebK4", "question": "Under a grey background with black thick smoke and firelight, what happened the first time the volcano appeared?", "question_wo_referring_query": "What happened?", "candidates": ["It was covered by heavy snow", "Fireballs and black thick smoke were ejected", "It started to rain in the sky", "Many gems appeared on the volcano", "The trees around the volcano started to burn"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "iz4QBKoebK4_1", "video_path": "iz4QBKoebK4.mp4", "subtitle_path": "iz4QBKoebK4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1209.34, "view_count": 20195}, {"video_id": "iz4QBKoebK4", "question": "The screen shows a black outer space with a magnified Mars on the right side. The details of Mars' surface can be clearly seen. What happened the first time Mars appeared?", "question_wo_referring_query": "The screen shows a black outer space with a magnified Mars on the right side. The details of Mars' surface can be clearly seen. What happened the first time Mars appeared?", "candidates": ["Mars is gradually shrinking", "Mars is rotating slowly", "A meteor passed by Mars", "Mars was hit by a meteor", "A spacecraft appeared near Mars"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "iz4QBKoebK4_2", "video_path": "iz4QBKoebK4.mp4", "subtitle_path": "iz4QBKoebK4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1209.34, "view_count": 20195}, {"video_id": "ZYLkjljplDg", "question": "In the video, there is a Texas state flag. When 'Anglo and Hispanic' is mentioned, what event takes place on the screen?", "question_wo_referring_query": "What event takes place on the screen?", "candidates": ["It changes to a picture of a bunny", "A coin appears", "It changes to an old photograph", "It changes to a landscape painting", "It changes to an American flag"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "ZYLkjljplDg_0", "video_path": "ZYLkjljplDg.mp4", "subtitle_path": "ZYLkjljplDg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2217.35, "view_count": 133730}, {"video_id": "ZYLkjljplDg", "question": "The screen shows a bridge submerged in floodwater. When the phrase 'storm that affected the Houston area was' is mentioned, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["People are submerged in the floodwater", "Cars driving on the bridge", "A group of people standing on the bridge, with cars submerged in water below", "Cars are submerged in the floodwater", "A group of people standing on a car, with the car submerged in water"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "ZYLkjljplDg_1", "video_path": "ZYLkjljplDg.mp4", "subtitle_path": "ZYLkjljplDg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2217.35, "view_count": 133730}, {"video_id": "ZYLkjljplDg", "question": "The screen shows a harvester working in a golden rice field, the sky is very blue with some white clouds, and the harvester is red. What happens on the screen when 'so very diversified economy including' is mentioned?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["Becomes a Ferris wheel", "Becomes a steak", "Becomes a roundabout road", "Vegetables appear on the screen", "Becomes a purple map block"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "ZYLkjljplDg_2", "video_path": "ZYLkjljplDg.mp4", "subtitle_path": "ZYLkjljplDg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2217.35, "view_count": 133730}, {"video_id": "GIgGw0OEbiI", "question": "Against a dark blue background, there are several blue, dark blue, and white blocks, varying in size, clustered together. At the top of the screen, there's a white block and in the top left corner, there's a shape with letters inscribed on it. The blocks in the center of the screen are gradually shrinking against the blue background. What happens next?", "question_wo_referring_query": "What happens next?", "candidates": ["The blocks form a complete puzzle", "The blocks disappear", "The blocks start to glow", "The blocks turn black", "Lines appear on the blocks"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "GIgGw0OEbiI_0", "video_path": "GIgGw0OEbiI.mp4", "subtitle_path": "GIgGw0OEbiI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 962.46, "view_count": 693589}, {"video_id": "GIgGw0OEbiI", "question": "In the gray and blurry background screen, there is a painting, which is a very intricate stone carving frame with some figures inside. The whole painting is not very clear and quite ambiguous, and the painting itself is also gray. After this screen appears, what happens next?", "question_wo_referring_query": ", what happens next?", "candidates": ["The painting disappears", "The painting gradually enlarges", "The painting turns into color", "The painting is split into two halves", "The painting gradually shrinks"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "GIgGw0OEbiI_1", "video_path": "GIgGw0OEbiI.mp4", "subtitle_path": "GIgGw0OEbiI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 962.46, "view_count": 693589}, {"video_id": "GIgGw0OEbiI", "question": "The scene is a painting featuring a group of people surrounded by yellow trees. The people are wearing clothes in colors such as red, orange, and green. There is a wooden viewing stand in the painting with some people on it. Sunlight is shining on the trees and the ground. The painting is yellow in tone. What happened after this painting appeared?", "question_wo_referring_query": "What happened?", "candidates": ["The painting moved to the upper right corner", "The painting moved to the lower right corner", "The painting moved to the right", "The painting moved to the lower left corner", "The painting moved to the left"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "GIgGw0OEbiI_2", "video_path": "GIgGw0OEbiI.mp4", "subtitle_path": "GIgGw0OEbiI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 962.46, "view_count": 693589}, {"video_id": "KiEHlDWhdwk", "question": "In a PPT slide with a military green background, there is a title in the middle composed of white letters, and below it are images of four characters. One character is wearing a hat, one character is wearing glasses and surrounded by pens and erasers, another character is also wearing glasses, and another character is surrounded by erasers and magnifiers. Below the images of the characters, there are some letters. In the PPT, which character appears last?", "question_wo_referring_query": "In the PPT, which character appears last?", "candidates": ["The character wearing glasses", "The character wearing a hat", "The character wearing glasses and surrounded by pens and erasers", "The character surrounded by glasses", "The character surrounded by erasers and magnifiers"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "KiEHlDWhdwk_0", "video_path": "KiEHlDWhdwk.mp4", "subtitle_path": "KiEHlDWhdwk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 961.48, "view_count": 465219}, {"video_id": "KiEHlDWhdwk", "question": "In a scene with a background of a withered tree branch, there are five circular icons. These icons are composed of white lines and are evenly distributed on the screen. There are three icons on the top and two on the bottom. The center of these icons features a white design. Which of these icons appeared first?", "question_wo_referring_query": "Which of these icons appeared first?", "candidates": ["A circle with a clipboard recording orders", "A circle resembling the sun", "A circle depicting a spring-loaded boxing glove from a box", "A circle representing a data matrix", "A circle with a sword and shield"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "KiEHlDWhdwk_1", "video_path": "KiEHlDWhdwk.mp4", "subtitle_path": "KiEHlDWhdwk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 961.48, "view_count": 465219}, {"video_id": "KiEHlDWhdwk", "question": "In a PPT slide with a military green background, there is a title composed of white letters and four white circular outlined icons. The icons include a gear, a brush, a shield, and an axe. There are white letters below the icons explaining them. Which of these four icons appears first?", "question_wo_referring_query": "In a PPT slide with a military green background, there is a title composed of white letters and four white circular outlined icons. The icons include a gear, a brush, a shield, and an axe. There are white letters below the icons explaining them. Which of these four icons appears first?", "candidates": ["The circle with a shield", "The circle with an axe", "The circle with a brush", "The circle with a gear", "The circle with glasses"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "KiEHlDWhdwk_2", "video_path": "KiEHlDWhdwk.mp4", "subtitle_path": "KiEHlDWhdwk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 961.48, "view_count": 465219}, {"video_id": "fF72vHvAYrg", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["A man in a white striped short-sleeve shirt and a man in a black long-sleeve shirt are sitting indoors enjoying gourmet food, two men are outdoors in sunlight hanging a ram's weight on a tree, one man in a black long-sleeve shirt is cutting vegetables on a sunny grassy field", "Two men are outdoors in sunlight hanging a ram's weight on a tree, a man in a white striped short-sleeve shirt and a man in a black long-sleeve shirt are sitting indoors enjoying gourmet food, one man in a black long-sleeve shirt is cutting vegetables on a sunny grassy field", "A man in a white striped short-sleeve shirt and a man in a black long-sleeve shirt are sitting indoors enjoying gourmet food, one man in a black long-sleeve shirt is cutting vegetables on a sunny grassy field, two men are outdoors in sunlight hanging a ram's weight on a tree", "One man in a black long-sleeve shirt is cutting vegetables on a sunny grassy field, two men are outdoors in sunlight hanging a ram's weight on a tree, a man in a white striped short-sleeve shirt and a man in a black long-sleeve shirt are sitting indoors enjoying gourmet food", "Two men are outdoors in sunlight hanging a ram's weight on a tree, one man in a black long-sleeve shirt is cutting vegetables on a sunny grassy field, a man in a white striped short-sleeve shirt and a man in a black long-sleeve shirt are sitting indoors enjoying gourmet food"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "fF72vHvAYrg_0", "video_path": "fF72vHvAYrg.mp4", "subtitle_path": "fF72vHvAYrg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1298.3, "view_count": 2065560}, {"video_id": "fF72vHvAYrg", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["The man in black clothes stir-fries chopped green onions in a large iron pot placed on a wooden stove, the man wearing black clothes squats on a grassy field washing the iron pot with spring water using both hands, the person wearing black clothes chops carrots on a wooden cutting board.", "The man in black clothes stir-fries chopped green onions in a large iron pot placed on a wooden stove, the person wearing black clothes chops carrots on a wooden cutting board, the man wearing black clothes squats on a grassy field washing the iron pot with spring water using both hands.", "The person wearing black clothes chops carrots on a wooden cutting board, the man wearing black clothes squats on a grassy field washing the iron pot with spring water using both hands, the man in black clothes stir-fries chopped green onions in a large iron pot placed on a wooden stove.", "The man wearing black clothes squats on a grassy field washing the iron pot with spring water using both hands, the person wearing black clothes chops carrots on a wooden cutting board, the man in black clothes stir-fries chopped green onions in a large iron pot placed on a wooden stove.", "The person wearing black clothes chops carrots on a wooden cutting board, the man in black clothes stir-fries chopped green onions in a large iron pot placed on a wooden stove, the man wearing black clothes squats on a grassy field washing the iron pot with spring water using both hands."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "fF72vHvAYrg_1", "video_path": "fF72vHvAYrg.mp4", "subtitle_path": "fF72vHvAYrg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1298.3, "view_count": 2065560}, {"video_id": "fF72vHvAYrg", "question": "Which of the following sequences of actions is correct?", "question_wo_referring_query": "Which of the following sequences of actions is correct?", "candidates": ["Apply seasoning to the lamb, add hot water to the transparent container with fruits and leaves, serve the prepared meat and rice onto the plate", "Add hot water to the transparent container with fruits and leaves, serve the prepared meat and rice onto the plate, apply seasoning to the lamb", "Serve the prepared meat and rice onto the plate, add hot water to the transparent container with fruits and leaves, apply seasoning to the lamb", "Apply seasoning to the lamb, serve the prepared meat and rice onto the plate, add hot water to the transparent container with fruits and leaves", "Add hot water to the transparent container with fruits and leaves, apply seasoning to the lamb, serve the prepared meat and rice onto the plate"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "fF72vHvAYrg_2", "video_path": "fF72vHvAYrg.mp4", "subtitle_path": "fF72vHvAYrg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1298.3, "view_count": 2065560}, {"video_id": "2yHntvD5r8I", "question": "In a completely black background, there are labels at 25\u00b0N and 25\u00b0S, with orange arrows pointing towards a globe diagram. The globe diagram has three circles on it, located in the northern hemisphere, the equator, and the southern hemisphere. In which scene does this globe diagram appear?", "question_wo_referring_query": "In which scene does this globe diagram appear?", "candidates": ["Two spherical cross-sectional diagrams are respectively on the left and right sides of the screen, with dense latitude and longitude lines on them.", "In the distance, there is a black shadow, with the remaining sunlight shining on the water surface. The water surface is lightly rippling, and on the right side of the screen, there is a hand resting on a staircase extending into the water.", "On the left side, there is a burning sun, and on the right side, there is a gray moon. In between, there is a huge sphere, with the phrase 'The sun and moon' in English printed at the bottom.", "In the black background, there is a burning sphere with an iron wire-like long line arching above it."], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "2yHntvD5r8I_0", "video_path": "2yHntvD5r8I.mp4", "subtitle_path": "2yHntvD5r8I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 979.32, "view_count": 238540}, {"video_id": "2yHntvD5r8I", "question": "Amidst the deep blue sky, white clouds float gently. Below the sky are scattered buildings, and nearby flows a meandering river stretching into the distance. Riverbanks are lined with green vegetation. In which of the following scenes do crisscross white specks resembling hemp flowers appear in the sky?", "question_wo_referring_query": "In which of the following scenes do crisscross white specks resembling hemp flowers appear in the sky?", "candidates": ["In the dark background, there is a glowing sphere with a long wire-like line stretched above it in a curved path.", "In the blue sky with no clouds, some light white appears at the bottom of the screen, and bright objects flash across the sky in an orderly manner.", "In the distance, there is a dark shadow, with the remaining sunlight reflecting on the water surface, which lightly ripples. On the right-hand side of the screen, a handrail extends into the water.", "Two spherical cutaway diagrams are on the left and right sides of the screen, densely packed with mesh lines."], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "2yHntvD5r8I_1", "video_path": "2yHntvD5r8I.mp4", "subtitle_path": "2yHntvD5r8I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 979.32, "view_count": 238540}, {"video_id": "2yHntvD5r8I", "question": "In a mirror glass, a pink map, a hand holding a water bottle is pointing to a red and blue circle each occupying half of the circumference on the map. On the right side of the circle, there is an English phrase 'LUDINGTON PUMPED STORAGE PLANT.' Has the white cap of the water bottle shown up in that scene?", "question_wo_referring_query": "Has the white cap of the water bottle shown up in that scene?", "candidates": ["In the distance is a black shadow, the residual light of the sun shines on the water, the water gently ripples, and to the right of the screen, a hand supports a staircase extending into the water.", "Two spherical section diagrams are on the left and right sides of the screen, with dense meridians and parallels on them.", "A giant Earth glows blue in the starry sky, a man in a spacesuit is clenching his fingers tight in front of the mirror.", "The pale blue sky is dotted with white clouds, the deep blue sea and the sky meet at a line, the bottom of the mirror shows a protective net filled with iron filaments."], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "2yHntvD5r8I_2", "video_path": "2yHntvD5r8I.mp4", "subtitle_path": "2yHntvD5r8I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 979.32, "view_count": 238540}, {"video_id": "xqaA_gIx1jU", "question": "Which subtitles appear simultaneously with the scenes where the man, wearing a black long-sleeve shirt, holds a kitchen knife in one hand and a beef bone in the other while standing in front of a wooden table and cutting the beef bone?", "question_wo_referring_query": "Which subtitles appear simultaneously?", "candidates": ["\"Sugar\"", "\"Apple vinegar\"", "\"Cabbage\"", "\"Tomato paste\"", "\"Beetroot\""], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "xqaA_gIx1jU_0", "video_path": "xqaA_gIx1jU.mp4", "subtitle_path": "xqaA_gIx1jU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 936.0, "view_count": 683739}, {"video_id": "xqaA_gIx1jU", "question": "Surrounded by grass, which subtitles did the black pot containing rose-colored beetroot strips sprinkled with sugar appear with simultaneously?", "question_wo_referring_query": "Surrounded by grass, which subtitles did the black pot containing rose-colored beetroot strips sprinkled with sugar appear with simultaneously?", "candidates": ["\"Onion\"", "\"Beef stock\"", "\"Tomato paste\"", "\"Beetroot\"", "\"Cabbage\""], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "xqaA_gIx1jU_1", "video_path": "xqaA_gIx1jU.mp4", "subtitle_path": "xqaA_gIx1jU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 936.0, "view_count": 683739}, {"video_id": "xqaA_gIx1jU", "question": "Surrounded by grasslands, which subtitles appeared at the same time as the beetroot in the black cast iron pot placed on the stove?", "question_wo_referring_query": "Which subtitles appeared at the same time?", "candidates": ["\"Carrot\"", "\"Onion\"", "\"Cabbage\"", "\"Sugar\"", "\"Corn oil\""], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "xqaA_gIx1jU_2", "video_path": "xqaA_gIx1jU.mp4", "subtitle_path": "xqaA_gIx1jU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 936.0, "view_count": 683739}, {"video_id": "QpUBtBEJiOs", "question": "A person wearing a black and white floral dress is cutting an eggplant on a wooden cutting board with a kitchen knife. What change occurs to the eggplant when it is soaked in a glass container filled with clear water?", "question_wo_referring_query": "A person wearing a black and white floral dress is cutting an eggplant on a wooden cutting board with a kitchen knife. What change occurs to the eggplant when it is soaked in a glass container filled with clear water?", "candidates": ["The eggplant turns black", "The eggplant changes into cubes", "The eggplant changes into slices", "The eggplant turns red", "The raw eggplant turns into cooked eggplant"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "QpUBtBEJiOs_0", "video_path": "QpUBtBEJiOs.mp4", "subtitle_path": "QpUBtBEJiOs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 953.3, "view_count": 1292576}, {"video_id": "QpUBtBEJiOs", "question": "Six slices of eggplant placed on a black baking tray were brushed with oil on the surfaces, appearing white with a hint of yellow. What changes occurred to these eggplants after being taken out of the oven when they were baked?", "question_wo_referring_query": "What changes occurred to these eggplants?", "candidates": ["The white eggplants turned into red eggplants", "The eggplants got burnt and became black", "The raw white eggplants turned into golden ripe eggplants", "The eggplants turned black", "The eggplants shrank"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "QpUBtBEJiOs_1", "video_path": "QpUBtBEJiOs.mp4", "subtitle_path": "QpUBtBEJiOs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 953.3, "view_count": 1292576}, {"video_id": "QpUBtBEJiOs", "question": "A person wearing a black and white floral dress is cutting white onions on a wooden cutting board with a kitchen knife. When the cut onions are added to a ceramic container filled with raw meat pieces on a khaki-colored tablecloth, what change occurs to the onions?", "question_wo_referring_query": ", what change occurs to the onions?", "candidates": ["The whole onions change into onion segments", "The white onions change into red onions", "The whole onions change into onion pieces", "The whole onions change into onion sauce", "The whole onions change into onion shreds"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "QpUBtBEJiOs_2", "video_path": "QpUBtBEJiOs.mp4", "subtitle_path": "QpUBtBEJiOs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 953.3, "view_count": 1292576}, {"video_id": "rHQPBqMULXo", "question": "In the diagram with a black background and white lines, there is a man in the foreground wearing sunglasses and having sparse dreadlocks; his body is outlined with glowing green lines. What change occurs to this man when the subtitle \u2018and you can actually become an expert\u2019 appears?", "question_wo_referring_query": "What change occurs to this man?", "candidates": ["He starts wearing a hat", "His short hair turns long", "His dreadlocks turn into a shaved head", "He stops wearing sunglasses", "The glowing green lines outlining his body turn blue"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "rHQPBqMULXo_0", "video_path": "rHQPBqMULXo.mp4", "subtitle_path": "rHQPBqMULXo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 986.39, "view_count": 78336}, {"video_id": "rHQPBqMULXo", "question": "A man wearing a gray linen long-sleeve shirt and sunglasses is sitting with his arms crossed in front of a wooden table on the lawn outside of a house. Behind him is a red train. What changes occur to this man when the subtitle 'might give you the impression that' appears?", "question_wo_referring_query": "What changes occur to this man?", "candidates": ["He starts wearing a hat", "His clothes turn black", "He stops wearing sunglasses", "He no longer has a beard", "His gray linen shirt becomes a gray coat"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "rHQPBqMULXo_1", "video_path": "rHQPBqMULXo.mp4", "subtitle_path": "rHQPBqMULXo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 986.39, "view_count": 78336}, {"video_id": "rHQPBqMULXo", "question": "The background shows many cars speeding on an elevated bridge illuminated by street lights, with tall buildings lit up on either side. As the subtitle 'productive use of your time' appears, what change occurs to the man wearing sunglasses and sporting a ponytail?", "question_wo_referring_query": "What change occurs to the man as the subtitle 'productive use of your time' appears?", "candidates": ["The man's image shrinks from the center of the video to the bottom right corner.", "The man's image shrinks from the center of the video to the bottom left corner.", "The man's image enlarges from the center of the video to the top.", "The man's image shrinks from the center of the video to the top right corner.", "The man's image shrinks from the center of the video to the top left corner."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "rHQPBqMULXo_2", "video_path": "rHQPBqMULXo.mp4", "subtitle_path": "rHQPBqMULXo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 986.39, "view_count": 78336}, {"video_id": "CcOFmuLyPqQ", "question": "On a grey-blue wall, there are two pieces of paper with black and white pictures. In front of the wall, there is a man with curly hair, wearing a black short-sleeve shirt, blue jeans, and a wristwatch. What is this man doing at this moment?", "question_wo_referring_query": "What is the man doing at this moment?", "candidates": ["Touching his head with both hands", "Running his fingers through his hair with one hand", "Making a phone call", "Waving at the camera", "Raising his watch-wearing hand and pointing at the picture on the right side of the paper"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "CcOFmuLyPqQ_0", "video_path": "CcOFmuLyPqQ.mp4", "subtitle_path": "CcOFmuLyPqQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1494.64, "view_count": 188140}, {"video_id": "CcOFmuLyPqQ", "question": "At the lower section of a brick-red and khaki-colored building with a tiled wall, there are two shops with dark gray doors. A light gray garbage truck is parked by the roadside, and two workers wearing red suits with fluorescent green vests are standing behind it. What are they actually doing at this moment?", "question_wo_referring_query": "What are they actually doing at this moment?", "candidates": ["Sitting by the roadside resting", "Sitting by the roadside having lunch", "Emptying the garbage bin", "Having a conversation", "Smoking"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "CcOFmuLyPqQ_1", "video_path": "CcOFmuLyPqQ.mp4", "subtitle_path": "CcOFmuLyPqQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1494.64, "view_count": 188140}, {"video_id": "CcOFmuLyPqQ", "question": "The background is a shopping mall building with a bronze statue in front, and a group of people wearing fluorescent green and fluorescent red vests are standing by the traffic light at the side of the road, holding signs with words written on them. What are they doing?", "question_wo_referring_query": "What are they doing?", "candidates": ["Protesting", "Shopping at the mall", "Fighting in a crowd", "Crossing the road", "Waiting for the traffic light"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "CcOFmuLyPqQ_2", "video_path": "CcOFmuLyPqQ.mp4", "subtitle_path": "CcOFmuLyPqQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1494.64, "view_count": 188140}, {"video_id": "d563NI7_T9M", "question": "There is a bookshelf filled with books in the background. A woman with thick black eyebrows, wearing a pink and white striped short-sleeve shirt, is sitting in front of a mirror. In the lower left corner of the screen, there is a white sticker with black English text. What objects are present in the screen at this moment?", "question_wo_referring_query": "There is a bookshelf filled with books in the background. A woman with thick black eyebrows, wearing a pink and white striped short-sleeve shirt, is sitting in front of a mirror. In the lower left corner of the screen, there is a white sticker with black English text. What objects are present in the screen at this moment?", "candidates": ["a white door", "a clothes rack", "a computer", "a lamp", "a potted plant"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "d563NI7_T9M_0", "video_path": "d563NI7_T9M.mp4", "subtitle_path": "d563NI7_T9M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1160.28, "view_count": 8093}, {"video_id": "d563NI7_T9M", "question": "The right side of the video shows a shrunken image of a woman with black hair speaking. The video is filmed inside a room with wooden floorboards. Various ceramic crafts are displayed on the wooden shelves and glass covers around and above the room. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["Cashier", "Mural", "Refrigerator", "Television screen", "Glass cover"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "d563NI7_T9M_1", "video_path": "d563NI7_T9M.mp4", "subtitle_path": "d563NI7_T9M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1160.28, "view_count": 8093}, {"video_id": "d563NI7_T9M", "question": "The screen shows four framed paintings in different positions, and a small video image on the right side of the screen. What objects are present on the screen at this time?", "question_wo_referring_query": "What objects are present on the screen at this time?", "candidates": ["black-haired woman", "curtain", "bookshelf", "window", "cell phone"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "d563NI7_T9M_2", "video_path": "d563NI7_T9M.mp4", "subtitle_path": "d563NI7_T9M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1160.28, "view_count": 8093}, {"video_id": "xIDAlQAHc3s", "question": "Four people are sitting side by side on a grey sofa. The man on the left, who is wearing a grey hooded top, is holding a black mobile phone with both hands. The three people on the right are holding white bowls, looking at a computer covered with sticky notes, and eating something. When the subtitles 'hello everyone I'm back home now it was' appear, what object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["transparent glass", "refrigerator", "window", "blue jeans", "black clothing"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "xIDAlQAHc3s_0", "video_path": "xIDAlQAHc3s.mp4", "subtitle_path": "xIDAlQAHc3s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1839.84, "view_count": 1165964}, {"video_id": "xIDAlQAHc3s", "question": "A girl wearing a sleeveless top is sitting beside a shelf filled with books and small ornaments. She is using a computer while facing sideways to the camera. When the subtitle 'am so not me refreshing this page copy' appears, what object is not present on the screen?", "question_wo_referring_query": "What object is not present on the screen?", "candidates": ["Green plant", "Camera", "Mirror", "Television", "Electric fan"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "xIDAlQAHc3s_1", "video_path": "xIDAlQAHc3s.mp4", "subtitle_path": "xIDAlQAHc3s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1839.84, "view_count": 1165964}, {"video_id": "xIDAlQAHc3s", "question": "The background features a shelving unit with books and a clothing rack with hanging clothes. A woman wearing glasses and a white long-sleeve shirt with colorful floral prints is sitting on a wooden floor leaning against a bed, holding red scissors in her hand. When the subtitle 'family dynamics and first gen Vietnamese' appears, what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["Washing machine", "Refrigerator", "Potted plant", "Puppy", "Television"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "xIDAlQAHc3s_2", "video_path": "xIDAlQAHc3s.mp4", "subtitle_path": "xIDAlQAHc3s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1839.84, "view_count": 1165964}, {"video_id": "T8XiBTZRwSU", "question": "There is a red sofa behind the person. On the left side of the screen, there is a television, and on the floor, there is a green plant in a pot. A short-haired woman wearing an olive-colored coat with a black inner top and glasses is sitting in front of the mirror. There are yellow English words on the screen. At this moment, what material is the woman's coat made of?", "question_wo_referring_query": "At this moment, what material is the woman's coat made of?", "candidates": ["Hooded jacket", "Denim jacket", "Cotton coat", "Linen coat", "Leather coat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "T8XiBTZRwSU_0", "video_path": "T8XiBTZRwSU.mp4", "subtitle_path": "T8XiBTZRwSU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1035.9, "view_count": 197387}, {"video_id": "T8XiBTZRwSU", "question": "The background is a table lamp with warm light, a white wall painted green, and on the right is a bookshelf filled with books. In the frame, a hand is holding a white cup with cartoon animals printed on it. What material is this cup made of?", "question_wo_referring_query": "What material is this cup made of?", "candidates": ["Plastic cup", "Insulated cup", "Glass", "Ceramic", "Paper cup"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "T8XiBTZRwSU_1", "video_path": "T8XiBTZRwSU.mp4", "subtitle_path": "T8XiBTZRwSU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1035.9, "view_count": 197387}, {"video_id": "T8XiBTZRwSU", "question": "Behind her is a red sofa, to the left of the screen is a television, and there is a green potted plant on the ground. A short-haired woman wearing a brown leather jacket over black innerwear and glasses is in front of the mirror, wearing a ring and holding a dark blue coat. What material is the coat she is holding?", "question_wo_referring_query": "What material is the coat she is holding?", "candidates": ["Denim", "Uniform", "Cotton", "Leather", "Suit"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "T8XiBTZRwSU_2", "video_path": "T8XiBTZRwSU.mp4", "subtitle_path": "T8XiBTZRwSU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1035.9, "view_count": 197387}, {"video_id": "g2UtGDMKyxA", "question": "Two women wearing black coats are standing in front of a painting on a wall, discussing. The painting depicts green leaves and orange fruits. When the subtitle 'exhibition we really want to look' appears, what material is the painting's frame made of?", "question_wo_referring_query": "What material is the painting's frame made of?", "candidates": ["Wood", "Iron", "Plastic", "Silver", "Gold"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "g2UtGDMKyxA_0", "video_path": "g2UtGDMKyxA.mp4", "subtitle_path": "g2UtGDMKyxA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1486.74, "view_count": 62557}, {"video_id": "g2UtGDMKyxA", "question": "A painting depicts a blue sky with white clouds. The content shows two men in blue seawater; the man on the right is leaning out of a wooden boat, and the man on the left is in the water. When the subtitle 'to the bahamas painted in 1898 into 1899' appears, what is the man on the left holding with both hands?", "question_wo_referring_query": "When the subtitle 'to the bahamas painted in 1898 into 1899' appears, what is the man on the left holding with both hands?", "candidates": ["puppy", "turtle", "kitten", "bird", "fish"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "g2UtGDMKyxA_1", "video_path": "g2UtGDMKyxA.mp4", "subtitle_path": "g2UtGDMKyxA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1486.74, "view_count": 62557}, {"video_id": "g2UtGDMKyxA", "question": "Two women in black clothes are standing on the right side of the screen, admiring a painting framed in carved wood hanging on the wall on the left side. The painting depicts waves crashing against rocks, with a person sitting on a log. What is the hairstyle of the woman on the right when the subtitle 'different conditions of weather and' appears?", "question_wo_referring_query": "What is the hairstyle of the woman on the right?", "candidates": ["shoulder-length wavy hair", "short straight hair", "long curly hair", "long straight hair", "short curly hair"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "g2UtGDMKyxA_2", "video_path": "g2UtGDMKyxA.mp4", "subtitle_path": "g2UtGDMKyxA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1486.74, "view_count": 62557}, {"video_id": "apKqB3SiSyw", "question": "The background is a yellow low cabinet with a mural on the wall above it. Through the large floor-to-ceiling window on the left, you can see tall buildings outside. A man wearing a dark green hooded parka is indoors holding something to the camera to present the weather forecast. What is he holding?", "question_wo_referring_query": "What is the man holding to the camera to present the weather forecast indoors?", "candidates": ["book", "laptop", "watch", "mobile phone", "tablet"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "apKqB3SiSyw_0", "video_path": "apKqB3SiSyw.mp4", "subtitle_path": "apKqB3SiSyw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1252.8, "view_count": 188913}, {"video_id": "apKqB3SiSyw", "question": "Two men are standing back-to-back in front of a shelf with slippers on it. The man on the left is wearing a khaki coat, and the man on the right is wearing an army green coat. In the bottom left corner of the screen, there is a green logo with white numbers 556. What is the man on the right holding in his right hand while choosing a product?", "question_wo_referring_query": "What is the man on the right holding in his right hand while choosing a product?", "candidates": ["shopping bag", "cell phone", "canvas bag", "slippers", "blue shopping cart"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "apKqB3SiSyw_1", "video_path": "apKqB3SiSyw.mp4", "subtitle_path": "apKqB3SiSyw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1252.8, "view_count": 188913}, {"video_id": "apKqB3SiSyw", "question": "Outside in the bright sunlight, rays of sunshine piercing through the tree leaves, a man wearing a black top and grey pants sits on a bench in front of a green tree. In the lower-left corner of the screen, a green background logo with the white text '89.00' is visible. What is sitting on this man's lap?", "question_wo_referring_query": "What is sitting on this man's lap?", "candidates": ["little girl", "puppy", "little boy", "rabbit", "kitten"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "apKqB3SiSyw_2", "video_path": "apKqB3SiSyw.mp4", "subtitle_path": "apKqB3SiSyw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1252.8, "view_count": 188913}, {"video_id": "NmFUFnMj5Jw", "question": "Under the blue sky and white clouds, there is clear azure seawater. The coastline is lined with lush trees, and some rocks stand in the sea. What object appears in the video before the subtitle 'beyond the water to the geography below' appears?", "question_wo_referring_query": "What object appears in the video?", "candidates": ["A slowly rotating gray 3D globe", "A crescent-shaped island on the sea", "A vertical cross-sectional illustration of sea level", "Rolling dense smoke from a volcanic eruption", "The rocky surface of a volcano"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "NmFUFnMj5Jw_0", "video_path": "NmFUFnMj5Jw.mp4", "subtitle_path": "NmFUFnMj5Jw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1187.09, "view_count": 3551279}, {"video_id": "NmFUFnMj5Jw", "question": "The gold-red molten rock erupts like starlight touching the ground in the evening. What is the object appearing on the screen before the subtitle 'when a certain part of the mantle' appears?", "question_wo_referring_query": "What is the object appearing on the screen?", "candidates": ["Golden grassland under the blue sky and white clouds", "White clouds floating in the azure sky", "A ship floating on the sea", "An island shrouded in mist", "A river flowing between snow-capped mountains"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "NmFUFnMj5Jw_1", "video_path": "NmFUFnMj5Jw.mp4", "subtitle_path": "NmFUFnMj5Jw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1187.09, "view_count": 3551279}, {"video_id": "NmFUFnMj5Jw", "question": "In the clear shallow water, three black children are on a small boat. After the subtitles 'really all of these there's more than' appear, what object is shown on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["An island shrouded in mist", "A densely populated city surrounded by green trees on the shore", "A crescent-shaped island on the sea", "The rocky surface of a volcano", "A river flowing between snow-capped mountains"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "NmFUFnMj5Jw_2", "video_path": "NmFUFnMj5Jw.mp4", "subtitle_path": "NmFUFnMj5Jw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1187.09, "view_count": 3551279}, {"video_id": "gkHmXhhAF2Y", "question": "In a white background PPT with purple and green framed characters, where else do the characters with a purple border appear?", "question_wo_referring_query": "Where else do they appear?", "candidates": ["In a colorful spiral-shaped curve screen", "On a black background PPT", "In a PPT with lines and curves", "In a screen with a space starry background", "In a scene where a man and a woman are on small boats on the sea playing catch"], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "gkHmXhhAF2Y_0", "video_path": "gkHmXhhAF2Y.mp4", "subtitle_path": "gkHmXhhAF2Y_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1122.12, "view_count": 352596}, {"video_id": "gkHmXhhAF2Y", "question": "On a black background with four white arrows, a red spiral wave is in the middle. Where else has this wave appeared?", "question_wo_referring_query": "Where else has this wave appeared?", "candidates": ["On a white background, on the lower side of two circles connected by three lines", "In a scene with a man in a black short-sleeved shirt talking in front of a mirror", "On a webpage with black and white logos", "In a scene with a realistic robot", "In a scene where a man and a woman on two boats in the sea are passing a ball to each other"], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "gkHmXhhAF2Y_1", "video_path": "gkHmXhhAF2Y.mp4", "subtitle_path": "gkHmXhhAF2Y_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1122.12, "view_count": 352596}, {"video_id": "gkHmXhhAF2Y", "question": "In the PPT with a white background, there are four icons framed in purple, green, and orange on the left side, and four red arrows on the right side. In which other scenarios do these red arrows appear?", "question_wo_referring_query": "In which other scenarios do these red arrows appear?", "candidates": ["In a screen of a webpage with black and white icons", "In a screen with a black background featuring two ball-shaped objects in pink and magenta", "In a screen with a black-and-white photo of a man within a circular design on the left side", "In a screen featuring a realistic robot", "In a screen with a man wearing a black short-sleeve shirt speaking in front of a camera"], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "gkHmXhhAF2Y_2", "video_path": "gkHmXhhAF2Y.mp4", "subtitle_path": "gkHmXhhAF2Y_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1122.12, "view_count": 352596}, {"video_id": "tjcpZoyOlzg", "question": "The screen displays a painting depicting four people walking on a yellow road, with a blue lake, green mountains in the distance, and white clouds floating in the blue sky. In the bottom left corner, there are four figures all wearing uniforms, walking towards the right. As they walk to the center, what kind of change occurs?", "question_wo_referring_query": "What kind of change occurs as they reach the center?", "candidates": ["The person wearing a dark green uniform with a black hat took out paper and a pen", "The four people started shaking hands with each other", "The person wearing a light green uniform with a matching hat jumped up", "The four people walked towards the lake", "The four people started talking together"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "tjcpZoyOlzg_0", "video_path": "tjcpZoyOlzg.mp4", "subtitle_path": "tjcpZoyOlzg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1215.17, "view_count": 2004627}, {"video_id": "tjcpZoyOlzg", "question": "In the scene, there is a painting depicting a soldier dressed in a red uniform with a white hat, leading other soldiers on horseback into battle. They are all holding swords. In the distance, there are hills, blue skies, and clouds. The people he leads are dressed in various outfits, and they are all riding horses at a gallop. In another scene, the surroundings are composed of various shades of brown and green landscapes and mountain ranges, with the sky in the distance. The soldier in the red uniform is leading more people on horseback, showing the full view of the horses. At this moment, what changes occur to the horse that the leading soldier is riding?", "question_wo_referring_query": "In the scene, there is a painting depicting a soldier dressed in a red uniform with a white hat, leading other soldiers on horseback into battle. They are all holding swords. In the distance, there are hills, blue skies, and clouds. The people he leads are dressed in various outfits, and they are all riding horses at a gallop. In another scene, the surroundings are composed of various shades of brown and green landscapes and mountain ranges, with the sky in the distance. The soldier in the red uniform is leading more people on horseback, showing the full view of the horses. At this moment, what changes occur to the horse that the leading soldier is riding?", "candidates": ["The soldier's brown horse turned into a red-brown horse.", "The soldier's hat turned black.", "The soldier's beard is gone.", "The soldier's trousers' red stripes disappeared.", "The leading soldier changed his clothes."], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "tjcpZoyOlzg_1", "video_path": "tjcpZoyOlzg.mp4", "subtitle_path": "tjcpZoyOlzg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1215.17, "view_count": 2004627}, {"video_id": "tjcpZoyOlzg", "question": "The screen shows a drawing with a yellow background. There are some tan-colored buildings and wooden boats in the drawing. The wooden boat is docked at the shore. On the shore, there is a black soldier and another soldier wearing a red cape riding a white horse. A red flag is placed on the wooden boat. When the boat is in a black background, the boat catches fire and is sinking into the water. What change happens to the boat?", "question_wo_referring_query": "What change happens to the boat?", "candidates": ["The flag on the boat turned black.", "The boat turned white.", "The boat was hit with many bullets.", "The flag on the boat turned tan.", "The sail of the boat disappeared."], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "tjcpZoyOlzg_2", "video_path": "tjcpZoyOlzg.mp4", "subtitle_path": "tjcpZoyOlzg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1215.17, "view_count": 2004627}, {"video_id": "6QrxqHiLPf8", "question": "The screen shows a black background with a man speaking. He has black hair, is wearing a red short-sleeve shirt, and has black-framed glasses. To the left of him on the screen, a Russian flag is displayed. What change occurs to this man when 'the culture of polynesia and' is mentioned?", "question_wo_referring_query": "What change occurs to this man?", "candidates": ["His clothes changed to long sleeves", "He changed to wearing glasses with black frames", "He is holding a cup", "He switched to glasses with olive frames", "He is holding a pen"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "6QrxqHiLPf8_0", "video_path": "6QrxqHiLPf8.mp4", "subtitle_path": "6QrxqHiLPf8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1703.41, "view_count": 418859}, {"video_id": "6QrxqHiLPf8", "question": "A woman is standing in front of a black mirror. She has brown hair, is wearing a light green short-sleeved shirt, and is holding a red stick in her hand. She is wearing earrings. What change occurs when the phrase 'that goes and the eldest daughter is' is mentioned?", "question_wo_referring_query": "What change occurs to this woman?", "candidates": ["No longer holding the stick", "Changed to pink clothes", "Removed her earrings", "Applied lipstick", "Put on a wristwatch"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "6QrxqHiLPf8_1", "video_path": "6QrxqHiLPf8.mp4", "subtitle_path": "6QrxqHiLPf8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1703.41, "view_count": 418859}, {"video_id": "6QrxqHiLPf8", "question": "Against a black background, a man wearing a round-neck T-shirt and orange wig is speaking. What change occurs when he mentions 'i do not sound like that most of the'?", "question_wo_referring_query": "What change occurs?", "candidates": ["He picked up a hammer", "He took off the wig", "He changed into a black shirt", "He put on a coat", "He put on a mask"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "6QrxqHiLPf8_2", "video_path": "6QrxqHiLPf8.mp4", "subtitle_path": "6QrxqHiLPf8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1703.41, "view_count": 418859}, {"video_id": "AI4PXxI3qcM", "question": "In a field of yellow flowers, a woman dressed in a skirt and wearing a straw hat is standing among the flowers. She is holding a basket with her back to the camera, with small hills in the distance and a white sky. What is this woman doing?", "question_wo_referring_query": ", what is this woman doing?", "candidates": ["Running towards the hills in the distance", "Throwing the basket in her hand", "Removing her hat", "Picking flowers", "Adjusting her skirt"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "AI4PXxI3qcM_0", "video_path": "AI4PXxI3qcM.mp4", "subtitle_path": "AI4PXxI3qcM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 949.12, "view_count": 511359}, {"video_id": "AI4PXxI3qcM", "question": "A woman wearing a long dress is standing in the room holding a sun hat and facing the window with her back to the mirror. There is a small lamp on the upper right side of the yellow wooden window frame. Below the lamp are two square decorations. Sunlight is shining on the yellow curtains, casting a warm tone in the room. What is this woman with over-the-shoulder long hair doing?", "question_wo_referring_query": "What is this woman with over-the-shoulder long hair and wearing a dress doing?", "candidates": ["putting on makeup", "combing her hair repeatedly", "putting on a hat in front of the mirror", "taking off her coat", "tying a ribbon on her hair"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "AI4PXxI3qcM_1", "video_path": "AI4PXxI3qcM.mp4", "subtitle_path": "AI4PXxI3qcM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 949.12, "view_count": 511359}, {"video_id": "AI4PXxI3qcM", "question": "In a room with a blurry background, there is a blonde woman in front of the camera wearing a red sweater and earrings. To her right, there's a transparent rectangular glass with a metal frame, hanging by a metal chain, containing flowers. What is this woman doing?", "question_wo_referring_query": ", what is this woman doing?", "candidates": ["putting flowers into the ornament", "pointing to the flowers in the ornament and introducing them", "opening the glass of the ornament", "showing this transparent ornament with flowers", "putting down the ornament"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "AI4PXxI3qcM_2", "video_path": "AI4PXxI3qcM.mp4", "subtitle_path": "AI4PXxI3qcM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 949.12, "view_count": 511359}, {"video_id": "9bP23aHNkYo", "question": "In a room with white walls, a woman with curly hair wearing a fur coat is introducing a book in front of a mirror. She is holding a thin, light green book with white text on it. Behind her is a white bed, on which many books are piled. Which object is missing in the scene?", "question_wo_referring_query": "In a room with white walls, a woman with curly hair wearing a fur coat is introducing a book in front of a mirror. She is holding a thin, light green book with white text on it. Behind her is a white bed, on which many books are piled. Which object is missing in the scene?", "candidates": ["black bedframe", "white pillow", "necklace", "earrings", "blue pillow"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "9bP23aHNkYo_0", "video_path": "9bP23aHNkYo.mp4", "subtitle_path": "9bP23aHNkYo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1037.71, "view_count": 38140}, {"video_id": "9bP23aHNkYo", "question": "In a room with white walls, a woman with braids wearing a fur coat is introducing a book in front of a mirror. She is wearing accessories and holding a pink book with white text. Behind her is a white bed with many books piled on it. Which object is not present in the scene?", "question_wo_referring_query": "Which object is not present in the scene?", "candidates": ["earrings", "black bed frame", "olive", "gray fur coat", "shelf"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "9bP23aHNkYo_1", "video_path": "9bP23aHNkYo.mp4", "subtitle_path": "9bP23aHNkYo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1037.71, "view_count": 38140}, {"video_id": "9bP23aHNkYo", "question": "In a room with a white wall, a woman with long hair wearing a fur coat is looking at a blue book in her hand in front of a mirror. She is wearing accessories, and behind her is a white bed stacked with many books and a pillow. Which of the following objects is not present in the scene?", "question_wo_referring_query": "", "candidates": ["Pink book", "Bracelet", "Black bedframe", "Shelf", "Grey fur coat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "9bP23aHNkYo_2", "video_path": "9bP23aHNkYo.mp4", "subtitle_path": "9bP23aHNkYo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1037.71, "view_count": 38140}, {"video_id": "a6oCCrwYdQY", "question": "In the video, a woman with black hair is smiling. Her hair is tied up, and the background is a blue and white backdrop. Below the camera, there is a white subtitle strip. She is wearing a red V-neck shirt and a red and white jacket. When the phrase 'INVESTMENT STRATEGIST AT BLACKROCK' is mentioned, what object is not present on the screen?", "question_wo_referring_query": "What object is not present on the screen?", "candidates": ["Yellow skin", "A black pupil", "A yellow ellipse", "A necklace", "A yellow rectangular strip"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "a6oCCrwYdQY_0", "video_path": "a6oCCrwYdQY.mp4", "subtitle_path": "a6oCCrwYdQY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2583.82, "view_count": 5559}, {"video_id": "a6oCCrwYdQY", "question": "In the video, there are two men standing behind a white lectern, with an orange neon light backdrop. There is also a black board with letters on it. Both men are wearing suits and talking. The man in a dark blue suit has black hair, while the man in a black suit has white hair. Below the frame, there is a white subtitle strip, displaying the text: 'GENERATING GOOD TOP LINE GROWTH BECAUSE WE CONNECT CHINA TO THE'. What is the object that does not exist in the scene?", "question_wo_referring_query": "What is the object that does not exist in the scene?", "candidates": ["A yellow circle", "A large bridge built on water", "A white shirt", "A red flower-patterned tie", "Black leather shoes"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "a6oCCrwYdQY_1", "video_path": "a6oCCrwYdQY.mp4", "subtitle_path": "a6oCCrwYdQY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2583.82, "view_count": 5559}, {"video_id": "a6oCCrwYdQY", "question": "A blonde woman is sitting in front of a mirror talking. She is wearing a navy blue outfit, red lipstick, and a pair of black-rimmed glasses. Behind her is a blue and yellow interwoven blue background. When the phrase 'T\u2019S SO LOCKED UP IN THE US. IT\u2019S A VERY FEEB RILE CLIMATE' is mentioned, what object is not present in the frame?", "question_wo_referring_query": "What object is not present in the frame?", "candidates": ["red letters", "yellow rectangular shape", "navy blue T-shirt", "black letters", "yellow triangle"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "a6oCCrwYdQY_2", "video_path": "a6oCCrwYdQY.mp4", "subtitle_path": "a6oCCrwYdQY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2583.82, "view_count": 5559}, {"video_id": "wGYXpRj3-Es", "question": "A man in a black coat is assembling a triangular frame on a grassy field. In the distance, there's a hillside with green trees. At the bottom of the frame, there's a pile of round stones. Near the man's feet, there are several logs. What is the material of the frame he is assembling?", "question_wo_referring_query": "What is the material of the frame he is assembling?", "candidates": ["stainless steel frame", "steel frame", "iron frame", "wooden frame", "copper frame"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "wGYXpRj3-Es_0", "video_path": "wGYXpRj3-Es.mp4", "subtitle_path": "wGYXpRj3-Es_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1150.37, "view_count": 2100048}, {"video_id": "wGYXpRj3-Es", "question": "On a large pot, a pair of hands is holding a plate and pouring leeks into the pot. To the left of the frame is the lower half of a person wearing jeans, while to the right is a workstation and a wooden door. What material is the plate holding the leeks made of?", "question_wo_referring_query": "What material is the plate holding the leeks made of?", "candidates": ["Glass plate", "Stainless steel plate", "Iron plate", "Ceramic plate", "Wooden plate"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "wGYXpRj3-Es_1", "video_path": "wGYXpRj3-Es.mp4", "subtitle_path": "wGYXpRj3-Es_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1150.37, "view_count": 2100048}, {"video_id": "wGYXpRj3-Es", "question": "In a grassy field, there are two tables covered with white tablecloths. In the distance, there is a yellow house, and there is also a string of lanterns surrounded by green trees. A group of children is sitting in front of the tables, and there are foods like watermelon and cakes placed on the tables. Each child has a ceramic plate and a set of knives and forks in front of them. What material is the bowl holding the watermelon in the video made of?", "question_wo_referring_query": "What material is the bowl holding the watermelon in the video made of?", "candidates": ["Ceramic bowl", "Glass bowl", "Iron bowl", "Wooden bowl", "Stainless steel bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "wGYXpRj3-Es_2", "video_path": "wGYXpRj3-Es.mp4", "subtitle_path": "wGYXpRj3-Es_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1150.37, "view_count": 2100048}, {"video_id": "OAzSxWukMzE", "question": "In a black room, a man with short hair wearing gray clothes is talking. In front of him, there's a round microphone. Behind him, there's a black bookshelf with books and a lamp emitting light. To the right of the frame, there is a stand. When the term 'balance equation' is mentioned, what material is the stand on the right made of?", "question_wo_referring_query": "What material is the stand on the right made of?", "candidates": ["steel", "glass", "plastic", "wood", "iron"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2A", "level": "IntraMoment", "id": "OAzSxWukMzE_0", "video_path": "OAzSxWukMzE.mp4", "subtitle_path": "OAzSxWukMzE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1026.68, "view_count": 29459}, {"video_id": "OAzSxWukMzE", "question": "In a black room, there is a short-haired man wearing a grey outfit speaking. In front of him, there is a round microphone. Behind him, there is a black bookshelf with books on it, and a lamp emitting light. On the right side of the frame, there is a stand. When the phrase 'Claude is designed to be incredibly' is mentioned, what type of grey outfit is this man wearing?", "question_wo_referring_query": "What type of grey outfit is this man wearing?", "candidates": ["grey shirt", "grey jacket", "grey polo shirt", "grey coat", "grey T-shirt"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2A", "level": "IntraMoment", "id": "OAzSxWukMzE_1", "video_path": "OAzSxWukMzE.mp4", "subtitle_path": "OAzSxWukMzE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1026.68, "view_count": 29459}, {"video_id": "OAzSxWukMzE", "question": "In a blue and white interleaved background, there are two screens, one large and one small. One screen displays two interleaved red and blue images, and the other screen has programming code. In the bottom right corner, a man with grey short hair in a grey coat is explaining in a study room. When he mentions 'this video,' what shape are the two red and blue interleaved images on the screen?", "question_wo_referring_query": "What shape are the two interleaved red and blue images on the screen?", "candidates": ["Trapezoid", "Rectangle", "Heart-shaped", "Square", "Triangle"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2A", "level": "IntraMoment", "id": "OAzSxWukMzE_2", "video_path": "OAzSxWukMzE.mp4", "subtitle_path": "OAzSxWukMzE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1026.68, "view_count": 29459}, {"video_id": "MdB9hsk3z8Y", "question": "A man wearing a colorful plaid shirt appears on the screen. He has short hair and stubble. Behind him is a light green background wall with a painting and a blue-glowing insect decoration hanging on it. On the other side, there is a decoration divided into six squares hanging on the wall. What happened the first time this man appeared?", "question_wo_referring_query": "What happened the first time this man appeared?", "candidates": ["Spoke", "Pointed at a map", "Removed the painting from the wall", "Adjusted his glasses", "Showed the items behind him"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "MdB9hsk3z8Y_0", "video_path": "MdB9hsk3z8Y.mp4", "subtitle_path": "MdB9hsk3z8Y_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1008.48, "view_count": 72768}, {"video_id": "MdB9hsk3z8Y", "question": "Under a clear blue sky with no clouds in sight, a distinct horizon where the sea meets the sky is visible. Two horses appear on the screen at the boundless seaside. What happens when the black horse and the brown horse first appear together?", "question_wo_referring_query": "What happens when the black horse and the brown horse first appear together?", "candidates": ["The two horses walk from right to left by the sea.", "The two horses gallop by the sea.", "The two horses drink water by the sea.", "The two horses frolic and graze by the sea.", "The two horses walk from left to right by the sea."], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "MdB9hsk3z8Y_1", "video_path": "MdB9hsk3z8Y.mp4", "subtitle_path": "MdB9hsk3z8Y_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1008.48, "view_count": 72768}, {"video_id": "MdB9hsk3z8Y", "question": "In the distance, there are misty snow-capped mountains. Nearby, there are green plants and weeds. Among the weeds, there's a black panda. What was this panda doing when it first appeared?", "question_wo_referring_query": "What was this panda doing when it first appeared?", "candidates": ["Lying on its back in the grass sleeping", "Crouching in the grass staring at the camera", "Sitting in the grass staring at the camera", "Sitting with its back to the camera", "Lying in the grass eating"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "MdB9hsk3z8Y_2", "video_path": "MdB9hsk3z8Y.mp4", "subtitle_path": "MdB9hsk3z8Y_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1008.48, "view_count": 72768}, {"video_id": "991RezkolJ4", "question": "The screen shows a PPT with a pink floral border. The white background has several formulas and a blue arrow. In the bottom right corner, a woman with long straight hair wearing a blue outfit is sitting at a pink desk explaining. On the desk, there is also a pink lamp. What happened after mentioning 'cathode and subtracting it by the cell'?", "question_wo_referring_query": "What happened after that?", "candidates": ["The woman tied her hair", "The blue arrow disappeared", "The woman turned on the lamp", "The formula on the screen disappeared", "The background turned black"], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "991RezkolJ4_0", "video_path": "991RezkolJ4.mp4", "subtitle_path": "991RezkolJ4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 967.97, "view_count": 13503}, {"video_id": "991RezkolJ4", "question": "The screen shows a PPT with a pink floral border, a white background with several formulas, a blue arrow, and a light pink rectangular annotation. In the bottom right corner, there is a woman with long straight hair sitting at a pink desk explaining. She is wearing blue clothes and there is a pink desk lamp on the desk. After mentioning: 'value of Q and then it decreases the a,' what happened?", "question_wo_referring_query": "what happened?", "candidates": ["The phone behind the woman slowly rotated to the right.", "The blue arrow disappeared.", "The woman turned on the desk lamp.", "The woman adjusted her hair.", "The background turned black."], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "991RezkolJ4_1", "video_path": "991RezkolJ4.mp4", "subtitle_path": "991RezkolJ4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 967.97, "view_count": 13503}, {"video_id": "991RezkolJ4", "question": "The screen shows a PowerPoint slide with a pink floral border, a white background containing several formulas, and a blue arrow. In the lower right corner, a woman with long straight hair is seated at a pink desk, explaining something. She is wearing blue clothes, and there is a pink table lamp on the desk. What happened when she mentioned: 'reactants and then on top is going to be'?", "question_wo_referring_query": "What happened?", "candidates": ["The background turned black", "The woman raised her hand", "The woman brushed her hair", "The woman turned on the lamp", "The blue arrow disappeared"], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "991RezkolJ4_2", "video_path": "991RezkolJ4.mp4", "subtitle_path": "991RezkolJ4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 967.97, "view_count": 13503}, {"video_id": "Um7wmlfwgWA", "question": "In a purple background, there are three armored individuals. Their respective characteristics include one with yellow skin, one holding an axe, and one wearing a red mask. They have different weapons, and their grey armor looks heavy. Which one of these characters appears first before the mention of 'This video is sponsored by Raid: Shadow Legends'?", "question_wo_referring_query": "Which character appears first?", "candidates": ["The character with green skin, a green mohawk, and shoulder armor", "The character with green skin wearing a blue mask", "The character with green skin wearing a red mask", "The character with yellow skin and black hair holding two swords", "The character with green skin, a red mohawk, and shoulder armor"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "Um7wmlfwgWA_0", "video_path": "Um7wmlfwgWA.mp4", "subtitle_path": "Um7wmlfwgWA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1362.12, "view_count": 1152436}, {"video_id": "Um7wmlfwgWA", "question": "In a green map, there are white annotations marking various locations, and some places are marked with flags. On the left side of the map, some mountain ranges can be seen, in the middle, there are some road networks, and at the top, there is a transparent white circle. When the statement 'The day before, French patrols reported enemy troop movement to the north.' is mentioned, who is the first person to enter?", "question_wo_referring_query": ", who is the first person to enter?", "candidates": ["White hair, holding a staff, wearing a white uniform, holding a sword and with a red ribbon in front, Marshal Jourdan", "Short hair, yellow uniform with a white ribbon in front, General Sir Rowland Hill", "Short hair, red uniform with a white ribbon in front, General Sir Rowland Hill", "White hair, holding a staff, wearing a green uniform, holding a sword and with a red ribbon in front, Marshal Jourdan", "White hair, holding a staff, wearing a black uniform, holding a sword and with a red ribbon in front, Marshal Jourdan"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "Um7wmlfwgWA_1", "video_path": "Um7wmlfwgWA.mp4", "subtitle_path": "Um7wmlfwgWA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1362.12, "view_count": 1152436}, {"video_id": "Um7wmlfwgWA", "question": "On a blue, red, and gray mixed map with some place names and pictures annotated, there is also a white arrow. After mentioning 'Counterattacks to relieve the French garrisons at Pamplona,' who is the first character to appear?", "question_wo_referring_query": "Who is the first character to appear?", "candidates": ["The soldier in red and white uniform lying on the ground", "The character in blue clothing riding a horse and pointing forward", "The soldier in yellow and white uniform lying on the ground", "The character in black uniform, white pants, holding a saber and wearing white gloves", "The soldier in green and white uniform lying on the ground"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "Um7wmlfwgWA_2", "video_path": "Um7wmlfwgWA.mp4", "subtitle_path": "Um7wmlfwgWA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1362.12, "view_count": 1152436}, {"video_id": "oTCh0eWhZ6o", "question": "In a PPT with a white background and a black lettered title, there is a diagram with a circle inside containing the number 12345. On the right side of the PPT, there are two frames showing two women: one with black straight long hair, wearing a grey coat; the other with black-rimmed glasses, wearing a black uniform. What is the first object that appears when mentioning: 'different types so the first type is'?", "question_wo_referring_query": ", what is the first object that appears?", "candidates": ["Spider", "Green plant", "Yellow circle", "Number annotation", "Yellow frame"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3O", "level": "L2-Relation", "id": "oTCh0eWhZ6o_0", "video_path": "oTCh0eWhZ6o.mp4", "subtitle_path": "oTCh0eWhZ6o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1069.47, "view_count": 86532}, {"video_id": "oTCh0eWhZ6o", "question": "In the PPT with a white background, there is some text, two arrows, and two boxes with red and blue lines. On the right side of the PPT, there are two split screens showing two women. One woman has long straight black hair, wearing a gray coat, and the other woman is wearing black-rimmed glasses and a black coat. After the phrase 'consists of one metal and one nonmetal' is mentioned, which object appears for the first time?", "question_wo_referring_query": ", which object appears for the first time?", "candidates": ["Sulfur", "Yellow border", "Number annotation", "Green plant", "Colorful square"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3O", "level": "L2-Relation", "id": "oTCh0eWhZ6o_1", "video_path": "oTCh0eWhZ6o.mp4", "subtitle_path": "oTCh0eWhZ6o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1069.47, "view_count": 86532}, {"video_id": "oTCh0eWhZ6o", "question": "In a PowerPoint slide with a white background, there is an annotation with the numbers 123456, followed by some chemical formulas. In the upper left and lower right corners of a light olive-green border, there are two Snapchat images. When the phrase 'out what type of reaction each one is' is mentioned, which object appears first?", "question_wo_referring_query": "In a PowerPoint slide with a white background, there is an annotation with the numbers 123456, followed by some chemical formulas. In the upper left and lower right corners of a light olive-green border, there are two Snapchat images. When the phrase 'out what type of reaction each one is' is mentioned, which object appears first?", "candidates": ["White shelf", "Blackboard", "An annotated shape with the number 6 and the text 'Acid-Base Neutralization'", "Black pen", "Pink border"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3O", "level": "L2-Relation", "id": "oTCh0eWhZ6o_2", "video_path": "oTCh0eWhZ6o.mp4", "subtitle_path": "oTCh0eWhZ6o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1069.47, "view_count": 86532}, {"video_id": "KWVSm6XHm7Q", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a scene of the two taking their luggage is shown, then a scene of washing a quilt in the washing machine is shown, and finally a girl with purple hair wearing a yellow suspenders and glasses, and a boy with black hair wearing a white T-shirt and glasses are talking in the car.", "First, a scene of the two taking their luggage is shown, then a girl with purple hair wearing a yellow suspenders and glasses, and a boy with black hair wearing a white T-shirt and glasses are talking in the car, and finally a scene of washing a quilt in the washing machine is shown.", "First, a scene of washing a quilt in the washing machine is shown, then a scene of the two taking their luggage is shown, and finally a girl with purple hair wearing a yellow suspenders and glasses, and a boy with black hair wearing a white T-shirt and glasses are talking in the car.", "First, a girl with purple hair wearing a yellow suspenders and glasses, and a boy with black hair wearing a white T-shirt and glasses are talking in the car, then a scene of the two taking their luggage is shown, and finally a scene of washing a quilt in the washing machine is shown.", "First, a girl with purple hair wearing a yellow suspenders and glasses, and a boy with black hair wearing a white T-shirt and glasses are talking in the car, then a scene of washing a quilt in the washing machine is shown, and finally a scene of the two taking their luggage is shown."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "KWVSm6XHm7Q_0", "video_path": "KWVSm6XHm7Q.mp4", "subtitle_path": "KWVSm6XHm7Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1297.2, "view_count": 715643}, {"video_id": "KWVSm6XHm7Q", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["A scene of a white kitchen cabinet is shown, followed by a woman with black and yellow hair wearing glasses and a blue fluffy sweater, and then a scene of the woman in the blue fluffy sweater standing in front of a kitchen sink.", "A scene of the woman in the blue fluffy sweater standing in front of a kitchen sink, followed by a scene of a white kitchen cabinet, and then a woman with black and yellow hair wearing glasses and a blue fluffy sweater is shown.", "A scene of a white kitchen cabinet is shown, followed by a scene of the woman in the blue fluffy sweater standing in front of a kitchen sink, and then a woman with black and yellow hair wearing glasses and a blue fluffy sweater is shown.", "First, a woman with black and yellow hair wearing glasses and a blue fluffy sweater is shown, followed by a scene of a white kitchen cabinet, and finally a scene of the woman in the blue fluffy sweater standing in front of a kitchen sink.", "First, a woman with black and yellow hair wearing glasses and a blue fluffy sweater is shown, followed by a scene of the woman in the blue fluffy sweater standing in front of a kitchen sink, and finally a scene of a white kitchen cabinet."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "KWVSm6XHm7Q_1", "video_path": "KWVSm6XHm7Q.mp4", "subtitle_path": "KWVSm6XHm7Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1297.2, "view_count": 715643}, {"video_id": "KWVSm6XHm7Q", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a scene of a woman in front of a computer is shown. Then, a scene of a woman in a blue fur coat standing in front of a wardrobe is shown. Finally, a scene of a mystery box next to a green plant is shown.", "First, a scene of a woman in front of a computer is shown. Then, a scene of a mystery box next to a green plant is shown. Finally, a scene of a woman in a blue fur coat standing in front of a wardrobe is shown.", "First, a scene of a woman in a blue fur coat standing in front of a wardrobe is shown. Then, a scene of a mystery box next to a green plant is shown. Finally, a scene of a woman in front of a computer is shown.", "First, a scene of a woman in a blue fur coat standing in front of a wardrobe is shown. Then, a scene of a woman in front of a computer is shown. Finally, a scene of a mystery box next to a green plant is shown.", "First, a scene of a mystery box next to a green plant is shown. Then, a scene of a woman in a blue fur coat standing in front of a wardrobe is shown. Finally, a scene of a woman in front of a computer is shown."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "KWVSm6XHm7Q_2", "video_path": "KWVSm6XHm7Q.mp4", "subtitle_path": "KWVSm6XHm7Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1297.2, "view_count": 715643}, {"video_id": "Wvlnr0qjHg4", "question": "A woman is wearing a bright red down jacket. She has dark skin and is holding snowballs in her black-gloved hands. Behind her is a small cart piled with snow on the top and in the cargo area. There are orange and white buildings in the background. In which other scenes has this woman appeared?", "question_wo_referring_query": "In which other scenes has this woman appeared?", "candidates": ["On a snow-covered road where police cars are driving", "On a news broadcast with a green screen background", "In a snowy area with people wearing orange down jackets and blue hats", "On a snow-covered ground surrounded by white buildings and many tree branches", "In front of a building with a lot of people and thick smoke billowing"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "Wvlnr0qjHg4_0", "video_path": "Wvlnr0qjHg4.mp4", "subtitle_path": "Wvlnr0qjHg4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3084.22, "view_count": 5208}, {"video_id": "Wvlnr0qjHg4", "question": "In a split-screen image, the left side shows a picture from the international version of TikTok, and on the right side is a man wearing glasses and a suit. He has black hair and is wearing a black tie. There is a subtitle bar below with a white background. In which other scene does this man appear?", "question_wo_referring_query": "In which other scene does this man appear?", "candidates": ["In an airport at night.", "In a news broadcast with a red and blue background.", "In a scene with a tree branch and a blonde woman.", "In a place with a European-style yellow building and blue sky with white clouds.", "In a hall where a man with white hair wearing a suit is speaking."], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "Wvlnr0qjHg4_1", "video_path": "Wvlnr0qjHg4.mp4", "subtitle_path": "Wvlnr0qjHg4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3084.22, "view_count": 5208}, {"video_id": "Wvlnr0qjHg4", "question": "A woman with a yellow headscarf and a plaid jacket over a white high-neck shirt is sitting in front of a mirror talking. The background behind her is very blurry, and there is a white background subtitle strip below the mirror. In which scene does this woman also appear?", "question_wo_referring_query": "In which scene does this woman also appear?", "candidates": ["In a room with white strip lights, surrounded by instruments and people wearing white coats", "In a white room with a lamp and books", "On a red postage stamp", "In a place with a red and white building and a big tree", "In a painting of flowers and birds"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "Wvlnr0qjHg4_2", "video_path": "Wvlnr0qjHg4.mp4", "subtitle_path": "Wvlnr0qjHg4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3084.22, "view_count": 5208}, {"video_id": "zWz6ne-FgQk", "question": "A man with very short hair is standing next to a brown tank. He is wearing a brown jacket, a gray shirt, and jeans. Around him, there is a white wall and a red bar. With which subtitles does this man appear at the same time?", "question_wo_referring_query": "With which subtitles does this man appear at the same time?", "candidates": ["weapon. It had good tactical mobility", "can be immediately brought into action and is", "although not the perfect weapon to engage", "Hence, the artillery design office noted in September", "but it is a bit more complicated. Because Jentz and Doyle cite a report that seems"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "zWz6ne-FgQk_0", "video_path": "zWz6ne-FgQk.mp4", "subtitle_path": "zWz6ne-FgQk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 929.64, "view_count": 294255}, {"video_id": "zWz6ne-FgQk", "question": "In a PowerPoint with an army green background, there is a tan-colored tank graphic in the upper right corner. With which subtitles did this tank graphic appear simultaneously?", "question_wo_referring_query": "With which subtitles did this tank graphic appear simultaneously?", "candidates": ["I am not entirely sure. I have seen another", "the sounded great on paper, their feasibility and effectiveness in real life was limited.", "It could also be that this was the amount of", "Let us take a look at the firepower and protection", "shots in 7 bins. But in the technical data overview"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "zWz6ne-FgQk_1", "video_path": "zWz6ne-FgQk.mp4", "subtitle_path": "zWz6ne-FgQk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 929.64, "view_count": 294255}, {"video_id": "zWz6ne-FgQk", "question": "In the PPT with an army green background, there are six circles, each containing various white icons. These include a lever, an exclamation mark, gears, and a magnifying glass, an X, a QR code, as well as some human figures. Which captions appeared together with the circles containing the gears and magnifying glass icons?", "question_wo_referring_query": ", which captions appeared together with the circles containing the gears and magnifying glass icons?", "candidates": ["Yet, the poor protection for the crews resulted in", "But, couldn\u2019t the same objective have been met simply by placing the 7.5 cm Pak 40 with is", "In March 1944, the report", "In November it was then noted that design changes were necessary:", "It should be possible to achieve all this with"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "zWz6ne-FgQk_2", "video_path": "zWz6ne-FgQk.mp4", "subtitle_path": "zWz6ne-FgQk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 929.64, "view_count": 294255}, {"video_id": "LejFaCncuc4", "question": "The image shows an oil painting of a man in a red robe being assassinated. A woman in a yellow dress is holding a sword and cutting his throat. The man is lying on a bed with bloodstains underneath him. The painting has a black background and is placed beside another painting. When the scene is marked with GENTILESCHI's signature, what change occurs?", "question_wo_referring_query": "What change occurs?", "candidates": ["The woman's yellow dress turns blue.", "The image shrinks.", "The image enlarges.", "The target of the woman's assassination changes.", "The background changes to white."], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "LejFaCncuc4_0", "video_path": "LejFaCncuc4.mp4", "subtitle_path": "LejFaCncuc4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 908.38, "view_count": 312752}, {"video_id": "LejFaCncuc4", "question": "The screen shows an oil painting. On the left side of the painting, there is a man floating while holding a woman. In the middle, a naked red-haired woman stands on a shell. On the right, a woman in a floral dress holding a piece of cloth seems to be about to cover the naked woman. What changed in the painting when it appeared in Mona Lisa's drawing?", "question_wo_referring_query": "What changed?", "candidates": ["The background changed to yellow", "The screen disappeared\n", "The screen shrank\n", "The background changed to blue", "The screen enlarged\n"], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "LejFaCncuc4_1", "video_path": "LejFaCncuc4.mp4", "subtitle_path": "LejFaCncuc4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 908.38, "view_count": 312752}, {"video_id": "LejFaCncuc4", "question": "The screen shows an oil painting where a man wearing a red cap is being assassinated by a woman in a yellow dress. The woman is holding a sword and cutting the man's throat. The man is lying on the bed, with bloodstains underneath him. There is also a white arrow pointing towards the man on the screen. When this painting appears with a black background and subtitles that read ARTEMISIA GENTILESCHI, what changes occur?", "question_wo_referring_query": "What changes occur?", "candidates": ["The screen zooms in.", "The background changes to blue.", "The woman's dress changes from yellow to blue.", "The background changes to white.", "The screen tilts."], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "LejFaCncuc4_2", "video_path": "LejFaCncuc4.mp4", "subtitle_path": "LejFaCncuc4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 908.38, "view_count": 312752}, {"video_id": "lA3m7fzxrJo", "question": "There is a small airplane parked inside a room. In front of the airplane are two men, one wearing a black suit and the other wearing a burgundy suit with gray pants and glasses. The glasses have blue background stripes on both sides of the frames. What changes occur to the people in the scene when 'influenced everything let's say the the' is mentioned?", "question_wo_referring_query": "What changes occur to the people in the scene?", "candidates": ["The two men hug each other.", "The man in the burgundy shirt raises both hands.", "The two men shake hands.", "The man in the suit picks up the receiver.", "The man in the burgundy shirt raises one hand."], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "lA3m7fzxrJo_0", "video_path": "lA3m7fzxrJo.mp4", "subtitle_path": "lA3m7fzxrJo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1265.24, "view_count": 63972}, {"video_id": "lA3m7fzxrJo", "question": "In the scene, there is a small aircraft parked inside a room. Two men are standing in front of the aircraft, one dressed in a black suit and the other in a crimson suit with gray pants and wearing glasses. The two men are engaged in a face-to-face conversation. When the phrase 'the U.S. which hadn't been occupied by' is mentioned, what change occurs to the characters in the scene?", "question_wo_referring_query": "In the scene, there is a small aircraft parked inside a room. Two men are standing in front of the aircraft, one dressed in a black suit and the other in a crimson suit with gray pants and wearing glasses. The two men are engaged in a face-to-face conversation. When the phrase 'the U.S. which hadn't been occupied by' is mentioned, what change occurs to the characters in the scene?", "candidates": ["The man in the suit is holding a paper cup", "The man in the suit puts on a pair of sunglasses", "The man in the crimson suit is holding a water cup", "The man in the crimson suit raises one hand", "The man in the crimson suit raises both hands"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "lA3m7fzxrJo_1", "video_path": "lA3m7fzxrJo.mp4", "subtitle_path": "lA3m7fzxrJo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1265.24, "view_count": 63972}, {"video_id": "lA3m7fzxrJo", "question": "In the scene, there is a small airplane parked inside a room. Two men are standing in front of the airplane; one is wearing a black suit, and the other is wearing a maroon suit with gray pants and glasses. The two men are having a conversation. When the phrase 'officers I'm going to report it like I' is mentioned, what change occurs?", "question_wo_referring_query": "what change occurs?", "candidates": ["The man in the suit picks up the receiver", "The two men hug", "The man in the maroon shirt raises one hand", "The man in the maroon shirt raises both hands", "The man in the suit raises both hands"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "lA3m7fzxrJo_2", "video_path": "lA3m7fzxrJo.mp4", "subtitle_path": "lA3m7fzxrJo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1265.24, "view_count": 63972}, {"video_id": "had5PfhuVpc", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, on a street, with pink tables on the left displaying various designs, a group of people inside the tables eating, in front of them stands a person with long black hair wearing black clothes, a house on the right side with a white car in front of it, then in a wooden shed with transparent light above, a pink table with different designs, a long-haired woman in white long sleeves sitting on a black stool, beside a yellow stool, in front of her is a short-haired man with glasses in black clothes, finally under a red canopy stands a short-haired man in black long sleeves holding a black square object with yellow and black designs surrounded by people, with a hand in a palm-colored sleeve stretched out to him.", "First, in a wooden shed with transparent light above, a pink table with different designs, a long-haired woman in white long sleeves sitting on a black stool, beside a yellow stool, in front of her is a short-haired man with glasses in black clothes, then under a red canopy stands a short-haired man in black long sleeves holding a black square object with yellow and black designs surrounded by people, with a hand in a palm-colored sleeve stretched out to him, finally on a street, with pink tables on the left displaying various designs, a group of people inside the tables eating, in front of them stands a person with long black hair wearing black clothes, a house on the right side with a white car in front of it.", "First, on a street, with pink tables on the left displaying various designs, a group of people inside the tables eating, in front of them stands a person with long black hair wearing black clothes, a house on the right side with a white car in front of it, then under a red canopy stands a short-haired man in black long sleeves holding a black square object with yellow and black designs surrounded by people, with a hand in a palm-colored sleeve stretched out to him, finally in a wooden shed with transparent light above, a pink table with different designs, a long-haired woman in white long sleeves sitting on a black stool, beside a yellow stool, in front of her is a short-haired man with glasses in black clothes.", "First under a red canopy stands a short-haired man in black long sleeves holding a black square object with yellow and black designs surrounded by people, with a hand in a palm-colored sleeve stretched out to him, then on a street, with pink tables on the left displaying various designs, a group of people inside the tables eating, in front of them stands a person with long black hair wearing black clothes, a house on the right side with a white car in front of it, finally in a wooden shed with transparent light above, a pink table with different designs, a long-haired woman in white long sleeves sitting on a black stool, beside a yellow stool, in front of her is a short-haired man with glasses in black clothes.", "First, in a wooden shed with transparent light above, a pink table with different designs, a long-haired woman in white long sleeves sitting on a black stool, beside a yellow stool, in front of her is a short-haired man with glasses in black clothes, then on a street, with pink tables on the left displaying various designs, a group of people inside the tables eating, in front of them stands a person with long black hair wearing black clothes, a house on the right side with a white car in front of it, finally under a red canopy stands a short-haired man in black long sleeves holding a black square object with yellow and black designs surrounded by people, with a hand in a palm-colored sleeve stretched out to him."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "had5PfhuVpc_0", "video_path": "had5PfhuVpc.mp4", "subtitle_path": "had5PfhuVpc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1123.46, "view_count": 51326}, {"video_id": "had5PfhuVpc", "question": "Which of the following scenarios is in the correct order?", "question_wo_referring_query": "Which of the following scenarios is in the correct order?", "candidates": ["First, in a room with white walls, 4 people are sitting around a table, including 3 men and 1 woman. Various items are placed behind them. Next, under a white sky, there are 4 women standing in front of the screen. One person is wearing yellow clothes, a blue mask, and holding a piece of paper. The others are all wearing black clothes. One woman in black with glasses passes the paper to another woman in black. Behind them are different houses. Finally, in another room with white walls, on the left side there is a yellow table with 3 people sitting beside it. One person has short hair and is wearing an olive long-sleeve shirt and black pants. Behind him, there is a long-haired woman in a blue shirt and a short-haired man in a black coat. Various items are placed behind them.", "First, under a white sky, there are 4 women standing in front of the screen. One person is wearing yellow clothes, a blue mask, and holding a piece of paper. The others are all wearing black clothes. One woman in black with glasses passes the paper to another woman in black. Behind them are different houses. Then, in a room with white walls, on the left side is a yellow table with 3 people sitting beside it. One person has short hair and is wearing an olive long-sleeve shirt and black pants. Behind him, there is a long-haired woman in a blue shirt and a short-haired man in a black coat. There are various items behind them. Finally, in another white-walled room, 4 people are sitting around a table, including 3 men and 1 woman. Various items are placed behind them.", "First, in a room with white walls, on the left side there is a yellow table with 3 people sitting beside it. One person has short hair and is wearing an olive long-sleeve shirt and black pants. Behind him, there is a long-haired woman in a blue shirt and a short-haired man in a black coat. Various items are placed behind them. Then, in another white-walled room, 4 people are sitting around a table, including 3 men and 1 woman. Various items are placed behind them. Lastly, under a white sky, there are 4 women standing in front of the screen. One person is wearing yellow clothes, a blue mask, and holding a piece of paper. The others are all wearing black clothes. One woman in black with glasses passes the paper to another woman in black. Different houses are behind them.", "First, under a white sky, there are 4 women standing in front of the screen. One person is wearing yellow clothes, a blue mask, and holding a piece of paper. The others are all wearing black clothes. One woman in black with glasses passes the paper to another woman in black. Behind them are different houses. Next, in a room with white walls, 4 people are sitting around a table, including 3 men and 1 woman. Various items are placed behind them. Finally, in another room with white walls, on the left side there is a yellow table with 3 people sitting beside it. One person has short hair and is wearing an olive long-sleeve shirt and black pants. Behind him, there is a long-haired woman in a blue shirt and a short-haired man in a black coat. Various items are placed behind them.", "First, in a room with white walls, on the left side there is a yellow table with 3 people sitting beside it. One person has short hair and is wearing an olive long-sleeve shirt and black pants. Behind him, there is a long-haired woman in a blue shirt and a short-haired man in a black coat. Various items are placed behind them. Next, under a white sky, there are 4 women standing in front of the screen. One person is wearing yellow clothes, a blue mask, and holding a piece of paper. The others are all wearing black clothes. One woman in black with glasses passes the paper to another woman in black. Different houses are behind them. Then finally, in another white-walled room, 4 people are sitting around a table, including 3 men and 1 woman. Various items are placed behind them."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "had5PfhuVpc_1", "video_path": "had5PfhuVpc.mp4", "subtitle_path": "had5PfhuVpc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1123.46, "view_count": 51326}, {"video_id": "had5PfhuVpc", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, a man and a woman are standing in the kitchen; the woman on the right has long hair and is wearing a gray long-sleeve shirt and an olive skirt, while the man on the left has short hair and is wearing a black long-sleeve shirt. Behind them, there are different food ingredients. Then on a white table, a person in a black long-sleeve shirt holds a white bowl containing different food ingredients. Behind the bowl, there is a silver table with different kitchen utensils on it. Finally, in a room with white walls, a woman wearing a yellow hat, a gray long-sleeve shirt, and an olive skirt is standing in the middle. Behind her, there are various kinds of furniture.", "First in a room with white walls, there is a woman wearing a yellow hat, a gray long-sleeve shirt, and an olive skirt standing in the middle. Behind her, there are various kinds of furniture. Then on a white table, a person in a black long-sleeve shirt holds a white bowl containing different food ingredients. Behind the bowl, there is a silver table with different kitchen utensils on it. Finally, a man and a woman are standing in the kitchen; the woman on the right has long hair and is wearing a gray long-sleeve shirt and an olive skirt, while the man on the left has short hair and is wearing a black long-sleeve shirt. Behind them, there are different food ingredients.", "First, in a room with white walls, there is a woman wearing a yellow hat, a gray long-sleeve shirt, and an olive skirt standing in the middle. Behind her, there are various kinds of furniture. Then a man and a woman are standing in the kitchen; the woman on the right has long hair and is wearing a gray long-sleeve shirt and an olive skirt, while the man on the left has short hair and is wearing a black long-sleeve shirt. Behind them, there are different food ingredients. Finally, on a white table, a person in a black long-sleeve shirt holds a white bowl containing different food ingredients. Behind the bowl, there is a silver table with different kitchen utensils on it.", "First, a man and a woman are standing in the kitchen; the woman on the right has long hair and is wearing a gray long-sleeve shirt and an olive skirt, while the man on the left has short hair and is wearing a black long-sleeve shirt. Behind them, there are different food ingredients. Then, in a room with white walls, a woman wearing a yellow hat, a gray long-sleeve shirt, and an olive skirt is standing in the middle. Behind her, there are various kinds of furniture. Finally, on a white table, a person in a black long-sleeve shirt holds a white bowl containing different food ingredients. Behind the bowl, there is a silver table with different kitchen utensils on it.", "First, on a white table, a person in a black long-sleeve shirt holds a white bowl containing different food ingredients. Behind the bowl, there is a silver table with different kitchen utensils on it. Then, in a room with white walls, there is a woman wearing a yellow hat, a gray long-sleeve shirt, and an olive skirt standing in the middle. Behind her, there are various kinds of furniture. Finally, a man and a woman are standing in the kitchen; the woman on the right has long hair and is wearing a gray long-sleeve shirt and an olive skirt, and the man on the left has short hair and is wearing a black long-sleeve shirt. Behind them, there are different food ingredients."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "had5PfhuVpc_2", "video_path": "had5PfhuVpc.mp4", "subtitle_path": "had5PfhuVpc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1123.46, "view_count": 51326}, {"video_id": "rvr143crpuU", "question": "In a screen with a white background, there are rectangular shapes of varying sizes and lengths, all of the same color, in the middle of the screen. There are numbers to the left of the longest rectangle, and a string of English letters under each rectangle. There are also English letters on the rectangles in the middle and on the right. The central rectangle has an irregular circle outlined by blue and red stripes, with 3 blue arrows pointing to it. In which scenes does this irregular circle outlined by red stripes appear?", "question_wo_referring_query": "In which scenes does this irregular circle outlined by red stripes appear?", "candidates": ["On a yellow-background English letter loss", "On a white-background English letter Losers", "On a white-background English letter Focusing", "On a blue-background English letter loss", "On a white-background English letter Deep"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "rvr143crpuU_0", "video_path": "rvr143crpuU.mp4", "subtitle_path": "rvr143crpuU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1509.38, "view_count": 2666}, {"video_id": "rvr143crpuU", "question": "In a scene with a white background, there is a dark blue text made up of English letters and numbers at the top left corner. Below it, there is English text in varying degrees of thickness, and on the right side, there is also English text in different shades of thickness. Some of the English letters are highlighted in yellow. In which scenarios does this yellow color appear?", "question_wo_referring_query": "In which scenarios does this yellow color appear?", "candidates": ["On the blue lines of the statistical chart", "On the number 20 in the lower part of the statistical chart", "On the number 1 in the lower part of the statistical chart", "On the number 10 in the lower part of the statistical chart", "On the number 33 in the statistical chart"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "rvr143crpuU_1", "video_path": "rvr143crpuU.mp4", "subtitle_path": "rvr143crpuU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1509.38, "view_count": 2666}, {"video_id": "rvr143crpuU", "question": "In a white background screen, there are 8 different images arranged in two rows in the middle of the screen, with a string of English letters below each image. There are also strings of black English letters in the top-left and bottom-left corners of the screen, with some letters highlighted in yellow. In one of the images, there is a green circle formed by lines, and another image is pointed at by a green arrow. In which scenarios has this green appeared?", "question_wo_referring_query": "In which scenarios has this green appeared?", "candidates": ["On the lines in the rectangle statistics chart", "Above the symbol '%' in the rectangle statistics chart", "Above the English letters 'Err' in the rectangle statistics chart", "Above the number 100 in the rectangle statistics chart", "Above the English letters 'max' in the rectangle statistics chart"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "rvr143crpuU_2", "video_path": "rvr143crpuU.mp4", "subtitle_path": "rvr143crpuU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1509.38, "view_count": 2666}, {"video_id": "Ng79KM6F0cA", "question": "In a background with white paper and text on it, there are two different drawings on each side of the screen. Among these drawings, there are black and white English letters, with some white letters smudged with green and red. In which scenes do these different drawings and subtitles appear together?", "question_wo_referring_query": "In which scenes do these different drawings and subtitles appear together?", "candidates": ["towns they were in quite remote rural", "bear-headed what's that about there's so", "the infinity appears that the tournament", "England by any means and myself and", "it's just such a pleasure to be here"], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "Ng79KM6F0cA_0", "video_path": "Ng79KM6F0cA.mp4", "subtitle_path": "Ng79KM6F0cA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1749.96, "view_count": 8373}, {"video_id": "Ng79KM6F0cA", "question": "A black woman with short black hair is standing, she has a DELL laptop and a black phone. Which captions have this phone appeared with?", "question_wo_referring_query": "Which captions have this phone appeared with?", "candidates": ["um so handily the Leicester records also", "Africans do appear in the Royal", "says that in in November 1554 there was", "known that because of this record from a", "and the English because the English"], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "Ng79KM6F0cA_1", "video_path": "Ng79KM6F0cA.mp4", "subtitle_path": "Ng79KM6F0cA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1749.96, "view_count": 8373}, {"video_id": "Ng79KM6F0cA", "question": "In the screen, there is a man with short hair, wearing a purple suit. In front of him, there is a glass lectern and a black microphone. What subtitles have co-existed with the glasses this man is wearing?", "question_wo_referring_query": "What subtitles have co-existed with the glasses this man is wearing?", "candidates": ["imagine the black Tudor and I see it as", "and this um Sharon Foster this is uh", "He said passed away 50 it's 40", "imagine JOHN blank as", "so they've given it some thoughts I"], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "Ng79KM6F0cA_2", "video_path": "Ng79KM6F0cA.mp4", "subtitle_path": "Ng79KM6F0cA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1749.96, "view_count": 8373}, {"video_id": "1StypUbR_YA", "question": "Under the dark sky, there's a slope covered with plants on the left side of the screen. The slope is blanketed in white snow. In the snowy area, a man is wearing a rainbow-colored feather coat and yellow-green gloves and is handling a snow-covered log. When the scene switches to the man fishing by the lake, what changes occur to his hands?", "question_wo_referring_query": "What changes occur to his hands?", "candidates": ["The yellow-green gloves on his hands are no longer there.", "The yellow-green gloves on his hands turned white.", "The yellow-green gloves on his hands turned black.", "The yellow-green gloves on his hands turned brown."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "1StypUbR_YA_0", "video_path": "1StypUbR_YA.mp4", "subtitle_path": "1StypUbR_YA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 979.13, "view_count": 1171232}, {"video_id": "1StypUbR_YA", "question": "On a white snowy ground, there is a log cabin in the distance. A man, facing the camera, wearing yellow-green gloves and holding an axe, is chopping wood. When the scene cuts to the axe chopping into the log, what happens to the standing log in the video?", "question_wo_referring_query": ", what happens to the standing log in the video?", "candidates": ["Set on fire", "Chopped into two halves", "Placed on the snowy ground", "Thrown into the air", "Picked up by someone"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "1StypUbR_YA_1", "video_path": "1StypUbR_YA.mp4", "subtitle_path": "1StypUbR_YA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 979.13, "view_count": 1171232}, {"video_id": "1StypUbR_YA", "question": "A person wearing black leather boots is crouching on the white snow, a live fish is moving on the snow, then the scene changes to a cutting board with spices, what happened to the fish when someone is holding the fish and a pair of hands? How did the fish change?", "question_wo_referring_query": "How did the fish change?", "candidates": ["The fish's body was coated with olive oil", "The fish's body was cut open in several places", "The fish's body was coated with a layer of yellow oil", "The fish's body was coated with a layer of chili oil", "The live fish was grilled"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "1StypUbR_YA_2", "video_path": "1StypUbR_YA.mp4", "subtitle_path": "1StypUbR_YA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 979.13, "view_count": 1171232}, {"video_id": "eP45XoKpNGI", "question": "In the living room, three people are sitting on the sofa, a woman wearing green clothes is standing, holding a blue bag. Among the following items, which one appears in the video?", "question_wo_referring_query": "Among the following items, which one appears?", "candidates": ["Laptop", "Guitar", "Apple", "Mobile Phone", "Frame Drum"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "eP45XoKpNGI_0", "video_path": "eP45XoKpNGI.mp4", "subtitle_path": "eP45XoKpNGI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1265.47, "view_count": 92439}, {"video_id": "eP45XoKpNGI", "question": "By the urgent riverbank, there is an animal standing. Behind it, there is a building with 3 pillars at its base. Which of the following animals has appeared?", "question_wo_referring_query": "Which of the following animals has appeared?", "candidates": ["Monkey", "Catfish", "Bird", "Turtle", "Hippo"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "eP45XoKpNGI_1", "video_path": "eP45XoKpNGI.mp4", "subtitle_path": "eP45XoKpNGI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1265.47, "view_count": 92439}, {"video_id": "eP45XoKpNGI", "question": "In the living room, an elderly person with white hair is holding a painting with both hands, while another elderly person dressed in pink is lying on the sofa supporting their head. Which of the following items has appeared in the living room?", "question_wo_referring_query": "Which of the following items has appeared in the living room?", "candidates": ["Guitar", "Notebook computer", "Black doll", "White doll", "Rack drum"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "eP45XoKpNGI_2", "video_path": "eP45XoKpNGI.mp4", "subtitle_path": "eP45XoKpNGI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1265.47, "view_count": 92439}, {"video_id": "pcgVe9Oo-N0", "question": "A man wearing white clothes is sitting in front of two black speakers. Behind him is a wooden wall with two paintings. When the subtitle 'prevent the Japanese from dumping' appears, which of the following objects is present?", "question_wo_referring_query": "Which of the following objects is present?", "candidates": ["guitar", "hat", "laptop", "apple", "mobile phone"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "pcgVe9Oo-N0_0", "video_path": "pcgVe9Oo-N0.mp4", "subtitle_path": "pcgVe9Oo-N0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1009.64, "view_count": 153865}, {"video_id": "pcgVe9Oo-N0", "question": "Four people are sitting on white chairs. One of them, a man wearing a floral shirt, is holding a phone in one hand and has his other hand on his face. When the subtitle 'funds and even the central bank to give' appears, which of the following objects appears?", "question_wo_referring_query": "Which of the following objects appears?", "candidates": ["Laptop", "Frame drum", "Potato chips", "Pen", "A painting"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "pcgVe9Oo-N0_1", "video_path": "pcgVe9Oo-N0.mp4", "subtitle_path": "pcgVe9Oo-N0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1009.64, "view_count": 153865}, {"video_id": "pcgVe9Oo-N0", "question": "A person wearing a grey short-sleeved shirt, with one arm wrapped with several black bands, is holding a white cane. When the subtitle 'future growth potential' appears, which of the following objects can be seen?", "question_wo_referring_query": "Which of the following objects can be seen?", "candidates": ["a drum kit", "a framed picture", "a bottle of water", "a hat", "an apple"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "pcgVe9Oo-N0_2", "video_path": "pcgVe9Oo-N0.mp4", "subtitle_path": "pcgVe9Oo-N0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1009.64, "view_count": 153865}, {"video_id": "6iFTwmZGKIw", "question": "A person wearing a red floral dress, holding an egg in one hand, placing the other hand on a glass container, what did the person holding the egg do when they first appeared?", "question_wo_referring_query": "What did the person holding the egg do when they first appeared?", "candidates": ["Peel the banana", "Peel the apple", "Eat the egg", "Cut the watermelon", "Knock the egg"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "6iFTwmZGKIw_0", "video_path": "6iFTwmZGKIw.mp4", "subtitle_path": "6iFTwmZGKIw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1159.32, "view_count": 462387}, {"video_id": "6iFTwmZGKIw", "question": "A person is holding a tool in one hand and a potato in the other hand, with a cutting board beneath both hands. Next to the cutting board are two peeled potatoes. What did the person do the first time the left hand held the potato and the right hand held the peeler?", "question_wo_referring_query": "What did the person do the first time the left hand held the potato and the right hand held the peeler?", "candidates": ["Peel a banana", "Shred a carrot", "Cut a pumpkin", "Shred a potato", "Peel an apple"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "6iFTwmZGKIw_1", "video_path": "6iFTwmZGKIw.mp4", "subtitle_path": "6iFTwmZGKIw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1159.32, "view_count": 462387}, {"video_id": "6iFTwmZGKIw", "question": "On the cutting board, there is a scallion. The scallion is pressed down by one hand on the cutting board, and another hand is holding a knife over the scallion. What did the person with the knife and scallion do first?", "question_wo_referring_query": "What did the person with the knife and scallion do first?", "candidates": ["Peel the apple", "Peel the banana", "Cut the watermelon", "Cut the scallion", "Cut the apple"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "6iFTwmZGKIw_2", "video_path": "6iFTwmZGKIw.mp4", "subtitle_path": "6iFTwmZGKIw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1159.32, "view_count": 462387}, {"video_id": "zFwPRaujr0o", "question": "A man wearing white clothes stands in front of a door, with both hands clasped together, holding a black hat. A woman wearing a dress has one hand on the door. When the subtitle mentions 'good afternoon ron hi justine did you,' what happened?", "question_wo_referring_query": "What happened?", "candidates": ["A man wearing black boots opens the door", "A woman in a blue dress turns on the light", "A woman in a blue dress lights a candle", "A man wearing black boots turns on the light", "A woman in a blue dress opens the door"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "zFwPRaujr0o_0", "video_path": "zFwPRaujr0o.mp4", "subtitle_path": "zFwPRaujr0o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1109.68, "view_count": 874686}, {"video_id": "zFwPRaujr0o", "question": "On the red brick wall with some tools leaning against it, a man and a woman are standing around a countertop arranging ingredients. On the countertop, there are chunks of carrots and halved onions. There are also some parsley leaves on the edge of the countertop. When the subtitle 'this is a lot better than' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen when the subtitle 'this is a lot better than' appears?", "candidates": ["The man is cutting meat", "The man picks up a carrot with his left hand", "The man picks up some parsley with his left hand", "The man picks up a piece of meat with his left hand", "The man picks up an onion with his left hand"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "zFwPRaujr0o_1", "video_path": "zFwPRaujr0o.mp4", "subtitle_path": "zFwPRaujr0o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1109.68, "view_count": 874686}, {"video_id": "zFwPRaujr0o", "question": "A pot is being hung, and a woman wearing a dress is holding a cup with red liquid inside. When the subtitle mentions 'and port wine is there any of that wine,' what happens?", "question_wo_referring_query": "what happens?", "candidates": ["Yellow liquid is added to the pot", "Red liquid is added to the pot", "Apple is added to the pot", "Banana is added to the pot", "Green liquid is added to the pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "zFwPRaujr0o_2", "video_path": "zFwPRaujr0o.mp4", "subtitle_path": "zFwPRaujr0o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1109.68, "view_count": 874686}, {"video_id": "igS2Wy8ur5U", "question": "In front of a green background, a man wearing sunglasses and a blue jacket is sitting down. In front of him is a black microphone. What does this man do after clasping one hand with the other?", "question_wo_referring_query": "What does this man do after clasping one hand with the other?", "candidates": ["Claps", "Spreads both hands and makes a few gestures", "Eats an apple", "Peels a banana", "Cuts a watermelon"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "igS2Wy8ur5U_0", "video_path": "igS2Wy8ur5U.mp4", "subtitle_path": "igS2Wy8ur5U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2344.4, "view_count": 29589}, {"video_id": "igS2Wy8ur5U", "question": "In front of a green background, a man wearing sunglasses and a blue jacket is sitting. In front of him is a black microphone. What happens to the man after he crosses his arms over his chest?", "question_wo_referring_query": "What happens to the man?", "candidates": ["The man's glasses shrink", "The man eats an apple", "The man stands up", "The man peels a banana", "The man's glasses enlarge"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "igS2Wy8ur5U_1", "video_path": "igS2Wy8ur5U.mp4", "subtitle_path": "igS2Wy8ur5U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2344.4, "view_count": 29589}, {"video_id": "igS2Wy8ur5U", "question": "In the bottom right corner of the video, there is a man wearing sunglasses and a blue overcoat. His left hand has four fingers bent, forming a fist. In the middle of the video, the text 'Future History' in black appears. What happens after the camera angle switches?", "question_wo_referring_query": "What happens after the camera angle switches?", "candidates": ["The man peels a banana", "The man changes clothes", "The man is sleeping", "The man stands up", "The man drinks water"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "igS2Wy8ur5U_2", "video_path": "igS2Wy8ur5U.mp4", "subtitle_path": "igS2Wy8ur5U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2344.4, "view_count": 29589}, {"video_id": "iT_1Gj5VPP0", "question": "On a map with yellow background and black 'KYOTO' text, what appears first when the map is shown?", "question_wo_referring_query": "What appears first?", "candidates": ["A child wearing red clothes", "A man wearing black clothes", "A woman wearing red clothes", "A man wearing red clothes", "An elderly person with white hair wearing red clothes"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "iT_1Gj5VPP0_0", "video_path": "iT_1Gj5VPP0.mp4", "subtitle_path": "iT_1Gj5VPP0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2087.92, "view_count": 135178}, {"video_id": "iT_1Gj5VPP0", "question": "On a piece of paper, there is a sketch of a bearded man. On the left side of the sketch, there is a black inscription 'PHILIPPE BURTY'. After the image appears in the video, what is the first thing that appears?", "question_wo_referring_query": ", what is the first thing that appears?", "candidates": ["a painting of a weapon", "a painting of a woman holding a child", "a painting of a woman", "a painting of a wheel", "a painting of a child"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "iT_1Gj5VPP0_1", "video_path": "iT_1Gj5VPP0.mp4", "subtitle_path": "iT_1Gj5VPP0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2087.92, "view_count": 135178}, {"video_id": "iT_1Gj5VPP0", "question": "On a white surface, there is a white rectangular object. On top of the rectangular object, there are 5 neatly placed cylindrical objects. After these 5 cylindrical objects appear in the video, what is the first thing that appears?", "question_wo_referring_query": "What is the first thing that appears after the cylindrical objects?", "candidates": ["a book", "a person", "a painting", "a horse", "a pen"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "iT_1Gj5VPP0_2", "video_path": "iT_1Gj5VPP0.mp4", "subtitle_path": "iT_1Gj5VPP0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2087.92, "view_count": 135178}, {"video_id": "Im7juzWJr4M", "question": "There are many cars parked along the street, among them, a red car is parked next to a yellow warning sign. There are 2 black figures on the yellow warning sign. After the subtitle mentions 'on I'll try to make a separate video on,' what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["Another person appears", "A new line of text appears", "An airplane appears", "Two more people appear", "Another tank appears"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "Im7juzWJr4M_0", "video_path": "Im7juzWJr4M.mp4", "subtitle_path": "Im7juzWJr4M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1155.69, "view_count": 16875}, {"video_id": "Im7juzWJr4M", "question": "In the middle of the uneven stone, there is a depression, within which there is a white object. What happens on the screen before the subtitle mentions 'million years ago and the sedimentation'?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A map is shown on the screen.", "It turns into an airplane.", "It turns into a car.", "It turns into a ship.", "It turns into a tank."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "Im7juzWJr4M_1", "video_path": "Im7juzWJr4M.mp4", "subtitle_path": "Im7juzWJr4M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1155.69, "view_count": 16875}, {"video_id": "Im7juzWJr4M", "question": "A small stream is in the middle of the screen, with a bridge featuring white railings over it. On the left side of the stream, there are many green plants, and on the right side, there are many stones. What happens after the subtitle mentions 'your friends and family as that is'?", "question_wo_referring_query": "What happens after that?", "candidates": ["The screen zooms in", "The screen goes black", "The screen becomes coarse", "The screen zooms out", "The screen enlarges"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "Im7juzWJr4M_2", "video_path": "Im7juzWJr4M.mp4", "subtitle_path": "Im7juzWJr4M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1155.69, "view_count": 16875}, {"video_id": "Rh8BUBBUUG8", "question": "In the top left corner, there is a black person wearing a white T-shirt, and in the middle of the screen, there is a black title with the text 'Cross Camera Track Association'. Below this title, there are 12 circles. When the subtitle 'colors represent the same person and' appears, what is the first line that shows up?", "question_wo_referring_query": "What is the first line that shows up?", "candidates": ["yellow solid line", "pink dotted line", "yellow dotted line", "red double solid line", "pink solid line"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "Rh8BUBBUUG8_0", "video_path": "Rh8BUBBUUG8.mp4", "subtitle_path": "Rh8BUBBUUG8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1136.87, "view_count": 7972}, {"video_id": "Rh8BUBBUUG8", "question": "In the top left corner, there is a black person wearing a white T-shirt. In the middle of the screen, there is a black title that says 'Fist layer (Tracklet Generation)'. Below the title, there are six camera views. After the subtitle 'chocolates in each segment and then' appears, what is the first line that appears?", "question_wo_referring_query": "What is the first line that appears?", "candidates": ["blue solid line", "yellow solid line", "pink solid line", "pink dashed line", "blue dashed line"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "Rh8BUBBUUG8_1", "video_path": "Rh8BUBBUUG8.mp4", "subtitle_path": "Rh8BUBBUUG8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1136.87, "view_count": 7972}, {"video_id": "Rh8BUBBUUG8", "question": "In the top left corner, there is a black man wearing a white T-shirt. To his right are 4 mirrors, and the mirrors reflect many people. After the subtitle \"times but we plot them together for the\" appears, what is the first symbol that appears?", "question_wo_referring_query": "In the top left corner, there is a black man wearing a white T-shirt. To his right are 4 mirrors, and the mirrors reflect many people. After the subtitle \"times but we plot them together for the\" appears, what is the first symbol that appears?", "candidates": ["red arrow", "pink double arrow", "pink arrow", "yellow double arrow", "red double arrow"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "Rh8BUBBUUG8_2", "video_path": "Rh8BUBBUUG8.mp4", "subtitle_path": "Rh8BUBBUUG8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1136.87, "view_count": 7972}, {"video_id": "GZW6OjARMGU", "question": "In the room, there is a bald man wearing a gray coat, with a white table to his right and another white table behind him, on which a painting is placed. In which of the following scenes did this man appear?", "question_wo_referring_query": "In which of the following scenes did this man appear?", "candidates": ["In front of a painting where two people are hugging", "In a white room with several paintings hanging, facing the wall", "Standing in the middle of a white room with 2 paintings hanging", "In a green room with several paintings hanging", "In a room with 4 paintings placed on the tables"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "GZW6OjARMGU_0", "video_path": "GZW6OjARMGU.mp4", "subtitle_path": "GZW6OjARMGU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1002.29, "view_count": 66169}, {"video_id": "GZW6OjARMGU", "question": "In a white room, a bald man wearing glasses and a gray jacket is looking at a painting on the wall. The painting depicts two people hugging each other. In which of the following scenes does this man appear?", "question_wo_referring_query": "In which of the following scenes does this man appear?", "candidates": ["Standing in front of a painting with many people and a cow", "Standing on a bus", "In a room with a table holding four paintings", "In a green room with five paintings hanging, four of which are neatly arranged", "Standing in a white room with two paintings hanging in the middle"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "GZW6OjARMGU_1", "video_path": "GZW6OjARMGU.mp4", "subtitle_path": "GZW6OjARMGU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1002.29, "view_count": 66169}, {"video_id": "GZW6OjARMGU", "question": "In a room, there is a picture placed on a table. The person in the picture is kneeling. In front of the table, there is a bald man wearing a black coat and glasses, looking at the picture. In which of the following scenes has this picture appeared?", "question_wo_referring_query": "In which of the following scenes has this picture appeared?", "candidates": ["On a white table in an empty room", "On a white table in a room with many people", "Hanging on a bus", "Hanging on a ship", "Hanging on a pink wall"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "GZW6OjARMGU_2", "video_path": "GZW6OjARMGU.mp4", "subtitle_path": "GZW6OjARMGU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1002.29, "view_count": 66169}, {"video_id": "o75ybZ-6Uu8", "question": "There are 4 rows of neatly arranged squares, with 4 of them being black. Below the 4 rows of squares, there are the characters '32 classes each'. In which subtitles does this '32 classes each' appear together with?", "question_wo_referring_query": "In which subtitles does '32 classes each' appear together with?", "candidates": ["and discuss what's new and how it all", "so it's 32 categorical variables", "is in the blog post", "through the individual steps", "the image predictor oh yeah so xt"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TOS", "level": "L2-Relation", "id": "o75ybZ-6Uu8_0", "video_path": "o75ybZ-6Uu8.mp4", "subtitle_path": "o75ybZ-6Uu8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3298.03, "view_count": 24706}, {"video_id": "o75ybZ-6Uu8", "question": "Inside the red rectangular box, there is red, green, blue, and yellow. Outside the red rectangular box, there is a blue arrow and a green arrow. Which subtitles have appeared together with the red rectangular box?", "question_wo_referring_query": ", which subtitles have appeared together with the red rectangular box?", "candidates": ["those things so this whole", "outperform humans so much what you", "sample from them", "through the individual steps", "if you move a bit you're likely to get"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TOS", "level": "L2-Relation", "id": "o75ybZ-6Uu8_1", "video_path": "o75ybZ-6Uu8.mp4", "subtitle_path": "o75ybZ-6Uu8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3298.03, "view_count": 24706}, {"video_id": "o75ybZ-6Uu8", "question": "In the scene, a man wearing a cap is smiling. There is white text that seems to say 'Es scheint unm\u00f6glich' on his shoulder. With which subtitles does this smiling man appear together?", "question_wo_referring_query": "With which subtitles does this smiling man appear together?", "candidates": ["because this video pinball thing", "so it's 32 categorical variables", "and discuss what's new and how it all", "going to be described", "and then there is a sampling step so"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TOS", "level": "L2-Relation", "id": "o75ybZ-6Uu8_2", "video_path": "o75ybZ-6Uu8.mp4", "subtitle_path": "o75ybZ-6Uu8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3298.03, "view_count": 24706}, {"video_id": "cu5Z7j7FIhU", "question": "On a sunlit street, a man wearing a baseball cap and a wristwatch is walking with a black backpack along a yellow building wall. When the scene changes to the glass doors and windows of a red brick building, what changes occur in the man's clothing?", "question_wo_referring_query": "What changes occur in the clothing of the man wearing a baseball cap?", "candidates": ["The red jacket changes to a green T-shirt", "The floral shirt changes to a blue jacket", "The blue T-shirt changes to a black vest", "The red and blue striped T-shirt changes to a blue T-shirt", "The blue long-sleeve T-shirt changes to a black jacket"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "cu5Z7j7FIhU_0", "video_path": "cu5Z7j7FIhU.mp4", "subtitle_path": "cu5Z7j7FIhU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 926.76, "view_count": 65832}, {"video_id": "cu5Z7j7FIhU", "question": "On a bright and sunny day, there are small trees in the distance. In front of the camera, there is a man wearing a blue short-sleeved T-shirt and carrying a black shoulder bag. The scene then changes to the left, showing a glass door and windows, while on the right, there is a tight red brick wall. At the bottom right, there are various items like paper shells, desks, and chairs. What changes are there in the man's clothing in front of the camera?", "question_wo_referring_query": "What changes are there in the man's clothing in front of the camera?", "candidates": ["The red jacket changes to a green T-shirt", "The floral shirt changes to a blue jacket", "The white duckbill cap changes to a black duckbill cap", "The blue T-shirt changes to a black backpack", "The red and blue striped T-shirt changes to a blue T-shirt"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "cu5Z7j7FIhU_1", "video_path": "cu5Z7j7FIhU.mp4", "subtitle_path": "cu5Z7j7FIhU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 926.76, "view_count": 65832}, {"video_id": "cu5Z7j7FIhU", "question": "In the distance, there are misty mountains and a silent body of water behind the man. The man is wearing light blue clothes and a black-strap backpack, with a baseball cap facing the camera. The scene then shifts to a staircase entrance of a red brick building, cluttered with various objects. What change happened to the man's arm in front of the clutter?", "question_wo_referring_query": "What changed on the arm of the man in front of the clutter?", "candidates": ["A red string appeared on his arm", "A long black sleeve appeared on his arm", "A watch appeared on his arm", "A black wristband appeared on his arm", "A tattoo appeared on his arm"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "cu5Z7j7FIhU_2", "video_path": "cu5Z7j7FIhU.mp4", "subtitle_path": "cu5Z7j7FIhU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 926.76, "view_count": 65832}, {"video_id": "IUHYj96sNeI", "question": "On the surface of a white porcelain, various seasonings and an egg are placed on the left side, a hand on the right side is holding a bowl, and in the middle, on a cutting board, there is a piece of pork chop wrapped in bread crumbs. When the subtitle mentions 'you heard her,' what changes occurred to the pork chop?", "question_wo_referring_query": "When the subtitle mentions 'you heard her,' what changes occurred to the pork chop?", "candidates": ["The pork chop was coated with a dark seasoning", "The pork chop wrapped in bread crumbs was cut into pieces", "The raw pork chop wrapped in bread crumbs turned cooked, with its color darkening", "The breadcrumbs on the pork chop fell off", "The pork chop wrapped in bread crumbs was cut into two pieces"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "IUHYj96sNeI_0", "video_path": "IUHYj96sNeI.mp4", "subtitle_path": "IUHYj96sNeI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 937.56, "view_count": 1451820}, {"video_id": "IUHYj96sNeI", "question": "On a white dining table, there are various condiments, eggs, and pork placed on the left side, while rice is cooking inside an electric rice cooker on the right side. When the subtitle mentions 'oh the rice yes open wow', what change does the rice inside the electric rice cooker undergo?", "question_wo_referring_query": "When the subtitle mentions 'oh the rice yes open wow', what change does the rice inside the electric rice cooker undergo?", "candidates": ["The rice turned into rice cakes", "The rice turned into pork chops", "The rice disappeared from the electric cooker", "The rice turned into popcorn", "The rice turned into cooked rice"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "IUHYj96sNeI_1", "video_path": "IUHYj96sNeI.mp4", "subtitle_path": "IUHYj96sNeI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 937.56, "view_count": 1451820}, {"video_id": "IUHYj96sNeI", "question": "On a white table, there is a electric stove with hot oil burning on the right side, and a person on the left holding a piece of pork chop. What happens to the pork chop when the subtitle 'add the fried pork' appears?", "question_wo_referring_query": "What happens to the pork chop when the subtitle 'add the fried pork' appears?", "candidates": ["The pork chop is pulled apart into strips", "The pork chop turns into charcoal", "The pork chop is cut into small cubes", "The pork chop is cut into ball shapes", "The whole piece of pork chop is cut into pieces"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "IUHYj96sNeI_2", "video_path": "IUHYj96sNeI.mp4", "subtitle_path": "IUHYj96sNeI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 937.56, "view_count": 1451820}, {"video_id": "YWhMCXo4lkQ", "question": "Against a gray background, there is the American flag on the left side, and in the center of the frame, there are three men wearing blue outfits with NASA logos. What are the three men doing at this time?", "question_wo_referring_query": "What are the three men doing at this time?", "candidates": ["closing their eyes tightly", "standing up", "smiling", "crying", "lying down"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "YWhMCXo4lkQ_0", "video_path": "YWhMCXo4lkQ.mp4", "subtitle_path": "YWhMCXo4lkQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1071.72, "view_count": 138772}, {"video_id": "YWhMCXo4lkQ", "question": "In an aircraft cabin, there is a blue background outside the window. Inside the cabin, on the left, there is a floating blue instrument. In the center, there is a man with short hair wearing a white astronaut suit, glasses, and devices on his ears and nose. What is this man doing at this moment?", "question_wo_referring_query": "What is this man doing at this moment?", "candidates": ["Reading a book", "Holding a phone and listening to music", "Looking out the window", "Holding a pen and writing", "Sleeping"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "YWhMCXo4lkQ_1", "video_path": "YWhMCXo4lkQ.mp4", "subtitle_path": "YWhMCXo4lkQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1071.72, "view_count": 138772}, {"video_id": "YWhMCXo4lkQ", "question": "On the deck of a white boat, there are three men wearing white work uniforms. The man on the right is standing with his waist slightly bent and smiling. The man in the back left is holding the white ladder beside him with one hand and grasping the upper part of the deck with the other hand. What is the man in the front left doing at this moment?", "question_wo_referring_query": "What is the man in the front left doing at this moment?", "candidates": ["Lifting his leg", "Waving his right hand forward", "Raising both hands high", "Jumping up", "Starting to tighten his right hand"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "YWhMCXo4lkQ_2", "video_path": "YWhMCXo4lkQ.mp4", "subtitle_path": "YWhMCXo4lkQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1071.72, "view_count": 138772}, {"video_id": "l12GXD0t_RE", "question": "In a white background, there are four lines of English sentences with numerical titles and corresponding formulas. Among them, the first line starting with 'differentiate', the second line starting with 'evaluate', and the third line starting with 'calculate' are marked in yellow. There are three red drawn circles on the screen. What word group is present on the screen at this time?", "question_wo_referring_query": "In a white background, there are four lines of English sentences with numerical titles and corresponding formulas. Among them, the first line starting with 'differentiate', the second line starting with 'evaluate', and the third line starting with 'calculate' are marked in yellow. There are three red drawn circles on the screen. What word group is present on the screen at this time?", "candidates": ["Local stability", "Datasets and models", "FUNCTIONS", "expriment", "Control theory"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "l12GXD0t_RE_0", "video_path": "l12GXD0t_RE.mp4", "subtitle_path": "l12GXD0t_RE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2208.7, "view_count": 6913}, {"video_id": "l12GXD0t_RE", "question": "In the white background, on the right side there is English sentence content starting with the words 'hematics from'. The upper part of the text content is marked in yellow, and on the right side of the screen there are colorful handwritten characters. Which shape below is present in the middle of the screen?", "question_wo_referring_query": "Which shape below is present in the middle of the screen?", "candidates": ["blue arrow", "purple thick line", "red circle", "green arrow", "yellow square"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "l12GXD0t_RE_1", "video_path": "l12GXD0t_RE.mp4", "subtitle_path": "l12GXD0t_RE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2208.7, "view_count": 6913}, {"video_id": "l12GXD0t_RE", "question": "The video is divided into three sections, labeled with the English word groups Francois, Amaury, and Guillaume at the top. Below, there is a title labeled Abstract. Under the title is the text content, with the first line highlighted in yellow. Which of the following word groups does not appear on the screen?", "question_wo_referring_query": "Which of the following word groups does not appear on the screen?", "candidates": ["Facebook AI Research", "Introduction", "chat gpt", "Rutgers University", "Mathematicians"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "l12GXD0t_RE_2", "video_path": "l12GXD0t_RE.mp4", "subtitle_path": "l12GXD0t_RE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2208.7, "view_count": 6913}, {"video_id": "0L1ANxFGzC0", "question": "In front of a white background, while a man wearing a dark red shirt is saying 'environment with the street view of,' what appears on the screen?", "question_wo_referring_query": "What appears on the screen?", "candidates": ["A horse", "A handgun", "An ambulance", "A traffic police officer", "A red bus"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "0L1ANxFGzC0_0", "video_path": "0L1ANxFGzC0.mp4", "subtitle_path": "0L1ANxFGzC0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1819.83, "view_count": 94}, {"video_id": "0L1ANxFGzC0", "question": "In front of a white background with many formulas written on it, when a short-haired man dressed in a deep red shirt at the top left corner is saying 'previous slide if you apply this,' what is shown on the screen?", "question_wo_referring_query": "What is shown on the screen?", "candidates": ["black staircase shape", "purple circle", "green rectangle", "blue square", "red hexagon"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "0L1ANxFGzC0_1", "video_path": "0L1ANxFGzC0.mp4", "subtitle_path": "0L1ANxFGzC0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1819.83, "view_count": 94}, {"video_id": "0L1ANxFGzC0", "question": "Against a white background, a street area is displayed on the left. Various tall buildings stand, and below there are many pedestrians and vehicles. In the upper left corner, a short-haired man in a dark red outfit is saying 'so this is the starter state of Beijing'. What is present in the scene?", "question_wo_referring_query": "What is present in the scene?", "candidates": ["A dark blue car", "A red bus", "A white horse", "A police officer", "A prancing horse"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "0L1ANxFGzC0_2", "video_path": "0L1ANxFGzC0.mp4", "subtitle_path": "0L1ANxFGzC0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1819.83, "view_count": 94}, {"video_id": "2u3ujFhxEug", "question": "In a desert, many people are sitting holding flags. In front of them, some people are standing. When the subtitle mentions '1975 that the Spanish withdrew from,' what is the color of the flag?", "question_wo_referring_query": "What is the color of the flag?", "candidates": ["green", "blue", "red", "white", "purple"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "2u3ujFhxEug_0", "video_path": "2u3ujFhxEug.mp4", "subtitle_path": "2u3ujFhxEug_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1379.78, "view_count": 164732}, {"video_id": "2u3ujFhxEug", "question": "Five people are seated beside a table, and there are items placed on the table. Among these people, three are wearing hats. Behind them, there is a flag that is mostly red. When the subtitle mentions 'and lasted until 1946 before it became', what is the shape of the table?", "question_wo_referring_query": "What is the shape of the table?", "candidates": ["spindle-shaped", "ladder-shaped", "rectangular", "triangular", "round"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "2u3ujFhxEug_1", "video_path": "2u3ujFhxEug.mp4", "subtitle_path": "2u3ujFhxEug_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1379.78, "view_count": 164732}, {"video_id": "2u3ujFhxEug", "question": "In the screen, there is an object with maps of different countries printed on it. The colors of the maps differ. A finger extends from the left side pointing to the map of one of the countries. When the subtitle mentions 'intentionally chosen precisely for its,' this finger presses down on the map which has the text 'ABIA' in English. What is the color of this map?", "question_wo_referring_query": "What is the color of the map with 'ABIA' in English that the finger is pressing down on?", "candidates": ["Red", "Yellow", "White", "Black", "Pink"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "2u3ujFhxEug_2", "video_path": "2u3ujFhxEug.mp4", "subtitle_path": "2u3ujFhxEug_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1379.78, "view_count": 164732}, {"video_id": "D0uHpd4136M", "question": "In a car, a person has one hand raised, while the other hand is on the steering wheel. There are black seats behind and beside the person. A white piece of clothing is placed on the seat behind the person. Who is the person with their hand on the steering wheel?", "question_wo_referring_query": "Who is the person with their hand on the steering wheel?", "candidates": ["A woman in a grey dress", "A long-haired man", "A woman in a red dress", "A short-haired woman", "A man in a blue shirt"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "D0uHpd4136M_0", "video_path": "D0uHpd4136M.mp4", "subtitle_path": "D0uHpd4136M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1210.04, "view_count": 80467}, {"video_id": "D0uHpd4136M", "question": "In a room with white walls, a person is holding a jar of Vaseline. Behind them to the left is a white door, and on the right inside the wardrobe are clothes of different colors hanging. Who is the person holding the jar of Vaseline?", "question_wo_referring_query": "Who is the person holding the jar of Vaseline?", "candidates": ["A child", "A woman in black", "A short-haired woman in black", "A woman in red", "A man in black"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "D0uHpd4136M_1", "video_path": "D0uHpd4136M.mp4", "subtitle_path": "D0uHpd4136M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1210.04, "view_count": 80467}, {"video_id": "D0uHpd4136M", "question": "Inside a car, someone is holding a phone with a green case. To the right of this person is a rectangular object with a black and white base, which has a portrait and some English letters on it. In front of this person is a steering wheel, and behind them is a black seat. Who is the person holding the phone?", "question_wo_referring_query": "Who is the person holding the phone?", "candidates": ["A woman wearing red", "A woman wearing white", "A child", "A man wearing blue", "A woman wearing blue"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "D0uHpd4136M_2", "video_path": "D0uHpd4136M.mp4", "subtitle_path": "D0uHpd4136M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1210.04, "view_count": 80467}, {"video_id": "BxBqNd-2_1A", "question": "There is a man wearing a suit and glasses sitting on a beige sofa in front of a window with white curtains. When the yellow and white spinning lantern marked 'THE WORLD' first appears at the bottom of the screen, what happens in the scene?", "question_wo_referring_query": "What happens in the scene?", "candidates": ["The man takes off his glasses.", "The man adjusts his suit.", "The man stands up.", "The man slightly raises his head."], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "BxBqNd-2_1A_0", "video_path": "BxBqNd-2_1A.mp4", "subtitle_path": "BxBqNd-2_1A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1354.56, "view_count": 199697}, {"video_id": "BxBqNd-2_1A", "question": "In a room with green plaid decor, a long-haired woman sits on a green sofa with her legs crossed, facing the mirror. What happens in the screen the first time the man in a suit has his back to the mirror?", "question_wo_referring_query": "What happens in the screen?", "candidates": ["The woman claps her hands repeatedly", "The woman stands up", "The woman crosses her arms over her chest", "The woman and the man hug politely"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "BxBqNd-2_1A_1", "video_path": "BxBqNd-2_1A.mp4", "subtitle_path": "BxBqNd-2_1A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1354.56, "view_count": 199697}, {"video_id": "BxBqNd-2_1A", "question": "In a room with white curtains, a man with a ponytail wearing black clothes is sitting on a brown chair. He raises his hand beside his head. To his right, there is a brown cabinet with books on it. At the bottom of the screen, there is a white and blue rectangular shape with English letters on it. What did the man do when this rectangle first appeared?", "question_wo_referring_query": "What did the man do when this rectangle first appeared?", "candidates": ["Changed into a red shirt", "Took off his glasses", "", "Took off his outerwear", "Talked while gesturing with his hands"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "BxBqNd-2_1A_2", "video_path": "BxBqNd-2_1A.mp4", "subtitle_path": "BxBqNd-2_1A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1354.56, "view_count": 199697}, {"video_id": "69APgWPlFIE", "question": "In a room with beige walls, a woman wearing a checkered shirt is kneeling on a white bed. On the bed, there are white pillows and a doll placed beside her. There are photos pasted on the wall in front of the woman. When the subtitle 'finish up and procreate I go straight' appears, what is the woman doing?", "question_wo_referring_query": "What is the woman doing?", "candidates": ["Drinking water", "Hugging a doll and shaking her leg", "Using a laptop", "Combing her hair", "Reading a book"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "69APgWPlFIE_0", "video_path": "69APgWPlFIE.mp4", "subtitle_path": "69APgWPlFIE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1183.56, "view_count": 4790237}, {"video_id": "69APgWPlFIE", "question": "A person wearing a white shirt and black-and-white checkered pants is sitting on a white floor. She is holding scissors and a green piece of cardboard, with colored paper boards of various shapes around her. When the text 'many cardboard boxes left over I decided' appears, what is this person doing?", "question_wo_referring_query": "What is this person doing?", "candidates": ["Cutting the green paper board", "Eating something", "Playing with a phone", "Standing up", "Reading a book"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "69APgWPlFIE_1", "video_path": "69APgWPlFIE.mp4", "subtitle_path": "69APgWPlFIE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1183.56, "view_count": 4790237}, {"video_id": "69APgWPlFIE", "question": "In the video, there are various kinds of stickers on the screen, with a brown-colored paper next to them. In the bottom left corner, there is a hand with a brown sleeve. What does this hand do when the phrase 'imperfect snckers and I'm using it to' appears?", "question_wo_referring_query": "What does this hand do?", "candidates": ["Pick up a water cup", "Pick up a phone", "Pick up a laptop", "Take out the brown paper", "Stick the stickers onto the brown paper"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "69APgWPlFIE_2", "video_path": "69APgWPlFIE.mp4", "subtitle_path": "69APgWPlFIE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1183.56, "view_count": 4790237}, {"video_id": "BrZBXnpDfSA", "question": "On the stage with a purple background, there are different patterns, and five men are standing on the stage. Among them, a man dressed in a bright blue jacket, a white long-sleeved shirt, and black pants is dancing in the center of the five. There are fireworks spraying on both sides of them, and stage props are placed behind. After the man in the blue jacket and black tie finished dancing in the center, what did he do?", "question_wo_referring_query": "After the man in the blue jacket and black tie finished dancing in the center, what did he do?", "candidates": ["Jumped up", "Kneeled down", "Switched places with the man to his right", "Changed into a red outfit", "Switched places with the man to his left"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "BrZBXnpDfSA_0", "video_path": "BrZBXnpDfSA.mp4", "subtitle_path": "BrZBXnpDfSA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 901.64, "view_count": 574427}, {"video_id": "BrZBXnpDfSA", "question": "In front of a red background with the word 'concert' printed on it, five men are standing in the middle of a stage facing the audience. The man in the center is wearing black pants with a jacket that has a white letter 'R' on it. After raising his right hand and placing it in front of his chest, what action does this man perform?", "question_wo_referring_query": "After raising his right hand and placing it in front of his chest, what action does this man perform?", "candidates": ["Bow towards the audience", "Turn around and stand", "Kneel down towards the audience", "Raise both hands high and cheer", "Clap hands"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "BrZBXnpDfSA_1", "video_path": "BrZBXnpDfSA.mp4", "subtitle_path": "BrZBXnpDfSA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 901.64, "view_count": 574427}, {"video_id": "BrZBXnpDfSA", "question": "There is a large screen with 'ONE' written on it at the top of the blue background stage. Five boys are standing in the center of the stage. The boy in the middle is wearing black pants and a jacket with letters on it. What did this boy do after clapping his hands?", "question_wo_referring_query": "What did this boy do after clapping his hands?", "candidates": ["He waved to the audience", "He used his right hand to clap the person on his left", "He changed clothes", "He jumped up", "He used his right hand to clap the person on his right"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "BrZBXnpDfSA_2", "video_path": "BrZBXnpDfSA.mp4", "subtitle_path": "BrZBXnpDfSA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 901.64, "view_count": 574427}, {"video_id": "y9Laq-mmia8", "question": "Which label appears first in the video?", "question_wo_referring_query": "Which label appears first in the video?", "candidates": ["The white label with two hexagons and a double-headed arrow in between", "The label printed with 'Resonance Trick'", "The label with the title 'Typical Charged' at the beginning", "The white label at the beginning with 'Electrons Move' connected to a magnifying glass", "The white label printed with two 'CH3's"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "y9Laq-mmia8_0", "video_path": "y9Laq-mmia8.mp4", "subtitle_path": "y9Laq-mmia8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1824.93, "view_count": 21721}, {"video_id": "y9Laq-mmia8", "question": "Which of the following concepts is mentioned first in the video?", "question_wo_referring_query": "Which of the following concepts is mentioned first in the video?", "candidates": ["typical charged bonding patterns", "draw all resonance structures for the following structure", "resonance trick", "electrons move, atoms do not move", "ACS exam question"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "y9Laq-mmia8_1", "video_path": "y9Laq-mmia8.mp4", "subtitle_path": "y9Laq-mmia8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1824.93, "view_count": 21721}, {"video_id": "y9Laq-mmia8", "question": "Which object appears first in the video?", "question_wo_referring_query": "Which object appears first in the video?", "candidates": ["A black-edged hexagonal sticky note with no writing", "A green-edged sticky note with the English words 'bond with'", "A sticky note with two black dots", "A sticky note with 'YES NO' printed"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "y9Laq-mmia8_2", "video_path": "y9Laq-mmia8.mp4", "subtitle_path": "y9Laq-mmia8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1824.93, "view_count": 21721}, {"video_id": "_qMcEMjLYhw", "question": "In a mountain cave full of rocks, a short-haired man with a beard is standing in the middle. There's water below him, and after the subtitle mentions 'there's a few reasons why I fell in love,' what does this man do?", "question_wo_referring_query": "What action does this man take?", "candidates": ["Eats something", "Drinks water", "Raises one hand pointing to a person in white clothes", "Squats down", "Puts on clothes"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "_qMcEMjLYhw_0", "video_path": "_qMcEMjLYhw.mp4", "subtitle_path": "_qMcEMjLYhw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1358.88, "view_count": 581195}, {"video_id": "_qMcEMjLYhw", "question": "Under a blue sky, a man wearing a sun hat and sunglasses is standing shirtless in the middle of the screen. Behind him is blue seawater and a yellowish-green island. After the subtitle mentions 'just really take in milos,' what does he do?", "question_wo_referring_query": "What does he do?", "candidates": ["Touches his head", "Eats something", "Puts his hand on the waist of a long-haired woman", "Jumps into the water", "Puts his hand on the waist of a short-haired woman"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "_qMcEMjLYhw_1", "video_path": "_qMcEMjLYhw.mp4", "subtitle_path": "_qMcEMjLYhw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1358.88, "view_count": 581195}, {"video_id": "_qMcEMjLYhw", "question": "Under a blue sky, a long-haired woman wearing a headband and dressed in white is sitting on a white chair. Behind her is a sea and rocks. Holding a fork in her hand, after the subtitle mentions 'you're making me gay just watching you', what did she do?", "question_wo_referring_query": "What did she do?", "candidates": ["Jumped up", "Touched her head", "Drank water", "Kneeled down", "Closed her mouth and touched her nose with her right hand"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "_qMcEMjLYhw_2", "video_path": "_qMcEMjLYhw.mp4", "subtitle_path": "_qMcEMjLYhw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1358.88, "view_count": 581195}, {"video_id": "IUgNeCynVUU", "question": "In the black and white video, there's a white sheep sculpture in the middle of an exhibition room. On the left side in the distance and on the right side closer to the camera, there are a few people observing the exhibition. Before the subtitle mentions 'elitist and restrictive world of the Galleries', which person appears?", "question_wo_referring_query": "Which person appears?", "candidates": ["Child playing soccer", "Old man wearing a hat", "Person bending over on the ground painting a crow", "Man lying on the bed", "Woman lying on the grass"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "IUgNeCynVUU_0", "video_path": "IUgNeCynVUU.mp4", "subtitle_path": "IUgNeCynVUU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 920.89, "view_count": 312713}, {"video_id": "IUgNeCynVUU", "question": "In front of a white wall with posters and two T-shirts, two people with exposed shoulders and heads are looking at something on the wall. After the subtitle mentions 'was a consistent in Haring\u2019s work,' which person appears?", "question_wo_referring_query": "Which person appears?", "candidates": ["The elderly person wearing a scarf", "The person bending down to paint on the ground", "The child lying on the grass", "The person wearing a black T-shirt and facing away from the camera", "The man driving a car"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "IUgNeCynVUU_1", "video_path": "IUgNeCynVUU.mp4", "subtitle_path": "IUgNeCynVUU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 920.89, "view_count": 312713}, {"video_id": "IUgNeCynVUU", "question": "From the perspective view, the ground is a circular area filled with various black lines and patterns on a white background. The room's ceiling is lined with four searchlights. Who is the person that appears before the subtitle mentions 'not because it is uninterrupted'?", "question_wo_referring_query": "Who is the person that appears?", "candidates": ["The man wearing a red coat and a hat", "The man wearing a black coat and glasses", "The person bending down and painting on the ground", "The man lying on the grass", "The man wearing a red coat and glasses"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "IUgNeCynVUU_2", "video_path": "IUgNeCynVUU.mp4", "subtitle_path": "IUgNeCynVUU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 920.89, "view_count": 312713}, {"video_id": "gusNxEOdo-o", "question": "In the video image, on the left, there is a picture of a large hall with a yellow ceiling and two wall paintings hanging on a white wall. In the bottom right corner, there is a bald man in a gray suit who is giving a speech. Where has the man in the video who is giving a speech appeared before?", "question_wo_referring_query": "Where has the man in the video who is giving a speech appeared before?", "candidates": ["Next to a pure blue picture", "In a crowd on the street", "On a sunny beach", "On a green field", "Next to a picture with multiple head portraits"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "gusNxEOdo-o_0", "video_path": "gusNxEOdo-o.mp4", "subtitle_path": "gusNxEOdo-o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2871.97, "view_count": 4929}, {"video_id": "gusNxEOdo-o", "question": "In the video, on the left is an image with an orange background showing a man with wings who is flying, and at the bottom right, there is a bald man wearing a gray suit who is giving a presentation. Where has the man giving the presentation appeared before?", "question_wo_referring_query": "Where has the man giving the presentation appeared before?", "candidates": ["In a car that is being driven", "In a store full of merchandise", "On a crowded street", "In a dense forest", "Next to an image with a black background showing a winged man surrounded by three women wearing headscarves"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "gusNxEOdo-o_1", "video_path": "gusNxEOdo-o.mp4", "subtitle_path": "gusNxEOdo-o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2871.97, "view_count": 4929}, {"video_id": "gusNxEOdo-o", "question": "In the video, on the left there is a picture against a black background showing a man with wings surrounded by three women wearing headscarves, and in the bottom right corner, there is a bald man in a gray suit giving a speech. Where has the man giving a speech in the video appeared before?", "question_wo_referring_query": "Where has the man giving a speech in the video appeared before?", "candidates": ["In a busy office", "On the rooftop of a high-rise building", "On a yellow field", "On a sunny and beautiful beach", "Next to two pictures, one with a round tray and the other with black pottery"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "gusNxEOdo-o_2", "video_path": "gusNxEOdo-o.mp4", "subtitle_path": "gusNxEOdo-o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2871.97, "view_count": 4929}, {"video_id": "ETshSZvnzRg", "question": "In the topographic map of the mountainous areas of Huanglvxiang, the left section is surrounded by white lines. In the upper left corner, there is a blue sticker with the words 'Syr Darya'. In which of the following subtitles did this sticker appear together?", "question_wo_referring_query": ", in which of the following subtitles did this sticker appear together?", "candidates": ["I practiced every day.", "I must thank her for making a confident girl", "Wang noticed me", "helped me a lot in my middle school life", "Valley is like the Hidden Gem nestled"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "ETshSZvnzRg_0", "video_path": "ETshSZvnzRg.mp4", "subtitle_path": "ETshSZvnzRg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2272.65, "view_count": 313395}, {"video_id": "ETshSZvnzRg", "question": "In a scene with a black background, there is a golden cross at the top left. In the middle, a man in a blue T-shirt and a man in a blue jacket are talking to the camera. Which subtitles appeared together with the cross at the top left?", "question_wo_referring_query": "Which subtitles appeared together with the cross at the top left?", "candidates": ["Many people had no work", "Today is my birthday", "so some of my classmates sent me presents", "My home town is a beautiful place", "and Korean and Assyrian communities and"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "ETshSZvnzRg_1", "video_path": "ETshSZvnzRg.mp4", "subtitle_path": "ETshSZvnzRg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2272.65, "view_count": 313395}, {"video_id": "ETshSZvnzRg", "question": "In a room with a bookshelf, where there are books and decorative items on the shelf in the back, a man with wavy long hair wearing black clothes is speaking in front of the camera. Which of the following subtitles has appeared along with this speaking man?", "question_wo_referring_query": "Which of the following subtitles has appeared along with this speaking man?", "candidates": ["I like Chinese best because I have a good Chinese teacher", "Ms Sun works very hard", "My new teacher is science teacher", "She tries to make her classes lively and interesting", "CH and the Z the most common style of"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "ETshSZvnzRg_2", "video_path": "ETshSZvnzRg.mp4", "subtitle_path": "ETshSZvnzRg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2272.65, "view_count": 313395}, {"video_id": "h0IPcqF2q9U", "question": "In a room with green plants, there is a red table in front of the mirror. On the right side of the screen, a man wearing a red and white striped shirt and a man wearing a black shirt are sitting and talking. When the man in the black shirt says 'giant egg I gotta make homemade,' what change occurs?", "question_wo_referring_query": "What change occurs?", "candidates": ["Moved to a kitchen with mainly white tones", "Put on a red coat", "Moved to the street", "Put on a white hat", "Put on a black hat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "h0IPcqF2q9U_0", "video_path": "h0IPcqF2q9U.mp4", "subtitle_path": "h0IPcqF2q9U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1351.23, "view_count": 33917127}, {"video_id": "h0IPcqF2q9U", "question": "On a white marble tabletop, there is a white cutting board on the left side. A pair of hands is pressing a piece of pork shaped like a loaf on the cutting board. On the right side, there is a rectangular iron pan. When the video subtitles mention 'that's a spiral of joy right there all,' what change occurs to the pork?", "question_wo_referring_query": "What change occurs to the pork?", "candidates": ["It turned white", "It turned black", "It turned yellow", "A part of it was cut off", "It became larger"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "h0IPcqF2q9U_1", "video_path": "h0IPcqF2q9U.mp4", "subtitle_path": "h0IPcqF2q9U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1351.23, "view_count": 33917127}, {"video_id": "h0IPcqF2q9U", "question": "In the kitchen dominated by white tones, the countertop is made of white marble, and in front of the countertop stands a man wearing black clothes. A black apron is spread on the countertop. What change happens to the man when the subtitle mentions 'other kinds of noodles like Tony's'?", "question_wo_referring_query": "What change happens to the man?", "candidates": ["Put on a red apron", "Put on a white hat", "Put on a black apron", "Put on a white apron", "Changed into white clothes"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "h0IPcqF2q9U_2", "video_path": "h0IPcqF2q9U.mp4", "subtitle_path": "h0IPcqF2q9U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1351.23, "view_count": 33917127}, {"video_id": "VQMQIq9weDs", "question": "A man wearing a white shirt and blue suit is standing in front of the screen. The man is holding a paper in his hand and is wearing a red tie. The screen behind the man shows a data chart with a trend line. The left side of the chart has a yellow background while the right side has green characters. After the subtitle 'Eric, thank you, and joining us now to talk about this' appears, what happens next on the screen?", "question_wo_referring_query": "What happens next on the screen?", "candidates": ["A man in a suit starts speaking", "A girl in a black coat starts speaking", "A man in a white suit starts speaking", "A woman in a suit starts speaking", "A woman wearing a blue top starts speaking"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "VQMQIq9weDs_0", "video_path": "VQMQIq9weDs.mp4", "subtitle_path": "VQMQIq9weDs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1381.42, "view_count": 4413}, {"video_id": "VQMQIq9weDs", "question": "A woman wearing a dark green coat is sitting in front of a white wall. Behind her are potted plants and an item labeled IPO. Below the woman is a white horizontal banner about IPO. After the subtitle 'So we'd have read it, eyeing a valuation of up to six and a half billion dollars' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A girl gets up and walks to a bedroom", "A woman wearing a blue top talks", "A man in a white suit interacts with a girl online", "A woman wearing a blue top interacts with a girl online", "A woman wearing a white shirt and black jacket interacts with a girl online"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "VQMQIq9weDs_1", "video_path": "VQMQIq9weDs.mp4", "subtitle_path": "VQMQIq9weDs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1381.42, "view_count": 4413}, {"video_id": "VQMQIq9weDs", "question": "On a black background, there is a data table displaying the variations of each product. The table headers and product codes are in yellow, the fluctuations are shown in red and green, and there is a purple rectangular box in the bottom-right corner of the screen. After the subtitle 'happen they don't and then they get out, And so I've seen this over and over that' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A woman wearing a green lab coat explains the table", "A man wearing a white suit explains the table", "A woman wearing a blue top explains the table", "A woman wearing a blue suit explains the table", "A man wearing a suit and red tie explains the table"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "VQMQIq9weDs_2", "video_path": "VQMQIq9weDs.mp4", "subtitle_path": "VQMQIq9weDs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1381.42, "view_count": 4413}, {"video_id": "Dz0DfKnG5UE", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, there is a trailer, then a discussion on US sports news, followed by a discussion on music news, then some fashion news, followed by weather news and other news, and finally a segment introduction.", "First, there is a segment introduction, followed by a discussion on US sports news, then a discussion on music news, after that some fashion news, followed by weather news and other news, and finally a trailer.", "First, there is a trailer, then a discussion on US fashion news, followed by music news, some sports news, a segment introduction, and finally weather news and other news.", "First, there is a trailer, then a discussion on US sports news, followed by a discussion on music news, and then a segment introduction, followed by some fashion news, and then weather news and other news.", "First, there is a trailer, then a discussion on US music news, followed by sports news, some fashion news, a segment introduction, and finally weather news and other news."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "Dz0DfKnG5UE_0", "video_path": "Dz0DfKnG5UE.mp4", "subtitle_path": "Dz0DfKnG5UE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1151.57, "view_count": 932388}, {"video_id": "Dz0DfKnG5UE", "question": "Which of the following sequences is correct for the scenes?", "question_wo_referring_query": "Which of the following sequences is correct for the scenes?", "candidates": ["First, there is a trailer for the news event in 1984, then there is a program introduction, and finally it talks about the news event that happened in 1983.", "First, there is a program introduction, then there is a trailer for the news event in 1984, and lastly, it officially starts talking about the news event that happened in 1983.", "First, there is a trailer for the news event in 1984, then it officially starts talking about the news event that happened in 1983, and lastly, there is a program introduction.", "First, there is a program introduction, then it officially starts talking about the news event that happened in 1983, and lastly, there is a trailer for the news event in 1984.", "First, it officially starts talking about the news event that happened in 1983, then there is a trailer for the news event in 1984, and finally there is a program introduction."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "Dz0DfKnG5UE_1", "video_path": "Dz0DfKnG5UE.mp4", "subtitle_path": "Dz0DfKnG5UE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1151.57, "view_count": 932388}, {"video_id": "Dz0DfKnG5UE", "question": "In the news from 1983, which of the following sequences is correct?", "question_wo_referring_query": "In the news from 1983, which of the following sequences is correct?", "candidates": ["First, the weather news and other news were explained, then a sports news was narrated, followed by music news, and finally fashion news.", "First, the weather news and other news were explained, then a sports news was narrated, followed by fashion news, and finally music news.", "First, the music news was explained, then the weather news and other news were narrated, followed by fashion news, and finally sports news.", "First, the sports news was explained, then a music news was narrated, followed by fashion news, and finally weather news and other news.", "First, the weather news and other news were explained, then a music news was narrated, followed by fashion news, and finally sports news."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "Dz0DfKnG5UE_2", "video_path": "Dz0DfKnG5UE.mp4", "subtitle_path": "Dz0DfKnG5UE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1151.57, "view_count": 932388}, {"video_id": "T38zTlAQm3E", "question": "A woman in a white top is standing next to the kitchen counter in a trailer. The woman is wearing black pants. The counter next to the woman is white and has a sink and kitchen utensils on it. Behind the woman is a person wearing a short-sleeved shirt and white socks. In which other scenes does the woman appear?", "question_wo_referring_query": "In which other scenes does the woman appear?", "candidates": ["The back of a white horse", "A white bed", "A pavilion by the lake", "The driver's seat of a car", "The back of a black horse"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "T38zTlAQm3E_0", "video_path": "T38zTlAQm3E.mp4", "subtitle_path": "T38zTlAQm3E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1165.25, "view_count": 684710}, {"video_id": "T38zTlAQm3E", "question": "A man wearing a black shirt with a light-colored hat appears in front of a store window. There's a red stripe on the window, and a red object beneath it. A black car is parked next to the man. Where else has the man been seen wearing this hat?", "question_wo_referring_query": "Where else has the man been seen wearing this hat?", "candidates": ["Inside a car with light-colored interior", "On a horse's back", "Inside a lakeside pavilion", "On a tall tree", "On a white bed"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "T38zTlAQm3E_1", "video_path": "T38zTlAQm3E.mp4", "subtitle_path": "T38zTlAQm3E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1165.25, "view_count": 684710}, {"video_id": "T38zTlAQm3E", "question": "A man wearing a gray short sleeve is standing next to a car, the man is wearing a black hat, he has a ring on his hand and a black string around his neck, he is holding a pair of binoculars, the car is white, behind the man are green plants and the sky. Where else have the binoculars appeared?", "question_wo_referring_query": "Where else have the binoculars appeared?", "candidates": ["In a tall tree", "On top of a white bed", "On the back of a horse", "Inside a lakeside pavilion", "In a light-colored car interior"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "T38zTlAQm3E_2", "video_path": "T38zTlAQm3E.mp4", "subtitle_path": "T38zTlAQm3E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1165.25, "view_count": 684710}, {"video_id": "lUfh2ebjLOg", "question": "A man wearing a checkered shirt appears in front of a white background. The man is wearing glasses and a ring. He has a green bracelet on his left wrist. On the man's right side is a framed picture of Obama, with a yellow frame. What subtitle did this picture appear with?", "question_wo_referring_query": "What subtitle did this picture appear with?", "candidates": ["Now there are even more reasons why these securities were terrible ideas, but the important", "The island which the fishing boat hit", "I'm not very fond of this place", "For everywhere we look, there is work to be done", "have sat in the front of a bus in Alabama"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "lUfh2ebjLOg_0", "video_path": "lUfh2ebjLOg.mp4", "subtitle_path": "lUfh2ebjLOg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.27, "view_count": 3676649}, {"video_id": "lUfh2ebjLOg", "question": "A man with a prosthetic leg is sitting on a golden chair against a white background. The man is wearing a shirt and a wristwatch. To the left and right of the man are a cabinet and a chair. Behind the man is a cartoon character. On the right side of the screen, there is a picture of Obama, a laptop, and a globe. What subtitle appears together with the cabinet on the left side of the man?", "question_wo_referring_query": "What subtitle appears together with the cabinet on the left side of the man?", "candidates": ["All this we will do", "The island which the fishing boat hit", "new age", "I'm not very fond of this place", "Hosni Mubarak in Egypt"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "lUfh2ebjLOg_1", "video_path": "lUfh2ebjLOg.mp4", "subtitle_path": "lUfh2ebjLOg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.27, "view_count": 3676649}, {"video_id": "lUfh2ebjLOg", "question": "On the brown wooden desk, Obama, wearing a white shirt and a black suit, is writing something with a pen. Next to Obama, there is a man wearing glasses and a colorful tie. Behind them, there is a group of people. On Obama's left side, there is a child wearing a white shirt. In what subtitles did Obama appear together?", "question_wo_referring_query": "In what subtitles did Obama appear together?", "candidates": ["The island which the fishing boat hit", "a beautiful garden", "I'm not very fond of this place", "To be fair, he did end the squabbling it became full blow yelling", "global warming"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "lUfh2ebjLOg_2", "video_path": "lUfh2ebjLOg.mp4", "subtitle_path": "lUfh2ebjLOg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.27, "view_count": 3676649}, {"video_id": "ao56rH0gnVY", "question": "In a white room, a woman wearing a black top is organizing a white bedspread. There are stacks of books on the floor, with a potted plant on top of one stack. Next to the book stacks is a black counter. When the woman exercises on the mat, what does she change into?", "question_wo_referring_query": "What clothing does the woman change into?", "candidates": ["The woman's black clothing changes to green", "The woman's black clothing changes to gray", "The woman's blue clothing changes to gray", "The woman's green clothing changes to gray", "The woman's black clothing changes to blue"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "ao56rH0gnVY_0", "video_path": "ao56rH0gnVY.mp4", "subtitle_path": "ao56rH0gnVY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1080.95, "view_count": 81808}, {"video_id": "ao56rH0gnVY", "question": "A woman wearing a dark-colored bathrobe sits in front of a white bed. The woman is wearing black rings and a bracelet, and she is holding an eyebrow pencil. There is a black shelf at the head of the bed, and behind the bed there are potted plants and books. When the woman sits in the driver's seat of a car, holding two bottles of cosmetics, what does she change into?", "question_wo_referring_query": "What clothes does the woman change into?", "candidates": ["The woman's clothes change to orange", "The woman's clothes change to white", "The woman's clothes change to short sleeves", "The woman's clothes change to blue", "The woman's clothes change to a sweater"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "ao56rH0gnVY_1", "video_path": "ao56rH0gnVY.mp4", "subtitle_path": "ao56rH0gnVY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1080.95, "view_count": 81808}, {"video_id": "ao56rH0gnVY", "question": "The woman wearing a dark-colored robe is sitting in front of a white bed. She is wearing a black bracelet and a ring, and she is using a pen to apply makeup to her lips. There is a black shelf at the head of the bed. What change occurs to the hair clip on the woman's head when she is eating in the car?", "question_wo_referring_query": "What change occurs to the hair clip on the woman's head?", "candidates": ["The hair clip on the girl's head is gone.", "The girl's hair clip changed to a dog style.", "The hair clip on the girl's head turned red.", "The hair clip on the girl's head turned purple.", "The hair clip on the girl's head turned blue."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "ao56rH0gnVY_2", "video_path": "ao56rH0gnVY.mp4", "subtitle_path": "ao56rH0gnVY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1080.95, "view_count": 81808}, {"video_id": "BwevkqzyK8A", "question": "The man wearing a brown jacket appears in the center of the screen. Behind the man is a grey wall. On the right side of the wall are six joined maps, and on the left side is a standing rectangular map. The man's jacket collar has a zipper. What change occurred when the subtitle displayed 'university there wouldn't be a major'?", "question_wo_referring_query": "What change occurred with the man wearing the brown jacket?", "candidates": ["The man's hands, previously down, changed to being crossed in front of his chest.", "The man's mouth changed from open to closed.", "The man's jacket changed to a short-sleeved one.", "The man's hands, previously down, changed to being on his waist.", "The man's hands, previously down, changed to being raised."], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "BwevkqzyK8A_0", "video_path": "BwevkqzyK8A.mp4", "subtitle_path": "BwevkqzyK8A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1231.27, "view_count": 32522}, {"video_id": "BwevkqzyK8A", "question": "A man wearing a brown jacket appears in the center of the screen. Behind the man is a gray wall. On the right side of the wall, there is a six-panel map, and on the left side, there is a standing rectangular map. The man's jacket has a collar with a chain. What changes occur when the subtitle 'so where did it end and so the people of' appears?", "question_wo_referring_query": "What changes occur to the man wearing the brown jacket with a chain?", "candidates": ["The man's clothing turns green", "The map behind the man changes to a person's poster", "A necklace appears on the man's neck", "The man's eyes go from open to closed", "A gecko appears on the wall behind the man"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "BwevkqzyK8A_1", "video_path": "BwevkqzyK8A.mp4", "subtitle_path": "BwevkqzyK8A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1231.27, "view_count": 32522}, {"video_id": "BwevkqzyK8A", "question": "The man wearing the brown chain top is speaking. One of the man's hands appears on the screen, with five fingers spread out. Behind the man, there is a map on the left side and a puzzle map on the right side. When the subtitle 'just pick and choose the counties they' appears, what change occurs with the man wearing the brown top?", "question_wo_referring_query": "What change occurs with the man wearing the brown top?", "candidates": ["The man raises the other hand", "The man's hands are crossed in front of his chest", "The man stands with his hands on his hips", "The man raises both hands high", "The man lowers the hand that appears on the screen"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "BwevkqzyK8A_2", "video_path": "BwevkqzyK8A.mp4", "subtitle_path": "BwevkqzyK8A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1231.27, "view_count": 32522}, {"video_id": "esX6PG3RMok", "question": "There are three lines of chalk writing on the blackboard. A long-haired woman in a white outfit with circular patterns appears on the screen, holding a chalk in her hand. What is the woman doing in the video?", "question_wo_referring_query": "What is the woman doing in the video?", "candidates": ["The woman is writing on the blackboard", "The woman is cleaning the blackboard", "The woman is shaking her head", "The woman is waving", "The woman has her hands on her hips"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "esX6PG3RMok_0", "video_path": "esX6PG3RMok.mp4", "subtitle_path": "esX6PG3RMok_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1000.96, "view_count": 31938}, {"video_id": "esX6PG3RMok", "question": "The lady wearing a white lab coat is standing in front of a red curtain. Her pants are gray, and she has long hair and is holding a microphone with both hands. There is a yellow and blue rectangular frame in the upper left corner of the screen. What is she doing in the middle of the screen?", "question_wo_referring_query": "What is the lady doing in the middle of the screen?", "candidates": ["The lady is waving", "The lady is bowing to express gratitude", "The lady is raising her hands and dancing", "The lady is making a heart shape with her hands", "The lady is giving a speech"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "esX6PG3RMok_1", "video_path": "esX6PG3RMok.mp4", "subtitle_path": "esX6PG3RMok_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1000.96, "view_count": 31938}, {"video_id": "esX6PG3RMok", "question": "The woman in a white top is standing by the wall. She is wearing a watch and holding a handbag, with a mobile phone in her hand. The wall has posters on it, among which an image of a girl holding a trophy is most eye-catching. What is the woman in the white top doing?", "question_wo_referring_query": "What is the woman in the white top doing?", "candidates": ["Taking a selfie with one hand", "The girl is spinning around", "Taking a photo with one hand", "Holding the mobile phone up to her face with both hands", "The girl is making a heart shape with her hands"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "esX6PG3RMok_2", "video_path": "esX6PG3RMok.mp4", "subtitle_path": "esX6PG3RMok_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1000.96, "view_count": 31938}, {"video_id": "Pm_a-iGRM5Q", "question": "The girl wearing a white top is holding a drink in her hand. She is wearing a necklace and a bracelet, has long hair, and the drink has a black straw and black sticker. There is a white wall and a red fire safety sign behind the girl. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["watch", "hat", "earrings", "hair clip", "glasses"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "Pm_a-iGRM5Q_0", "video_path": "Pm_a-iGRM5Q.mp4", "subtitle_path": "Pm_a-iGRM5Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1183.72, "view_count": 180351}, {"video_id": "Pm_a-iGRM5Q", "question": "A lady wearing an olive-colored top is reading a book. She has long hair, and the book's cover is gray with silhouettes of two black figures. Next to her is a bookshelf filled with books. What objects are present in the scene?", "question_wo_referring_query": ", What objects are present in the scene?", "candidates": ["Bracelet", "Hat", "Glasses", "Plant pot", "Watch"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "Pm_a-iGRM5Q_1", "video_path": "Pm_a-iGRM5Q.mp4", "subtitle_path": "Pm_a-iGRM5Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1183.72, "view_count": 180351}, {"video_id": "Pm_a-iGRM5Q", "question": "The lady wearing a gray top is sitting by the window, holding a transparent cup in her hand. The lady is stirring the liquid in the cup with a wooden stick. Outside the window, the buildings feature red and white colors. There is a painting on the wall behind the lady. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["potted plant", "table lamp", "watch", "necklace", "glasses"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "Pm_a-iGRM5Q_2", "video_path": "Pm_a-iGRM5Q.mp4", "subtitle_path": "Pm_a-iGRM5Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1183.72, "view_count": 180351}, {"video_id": "WhWbgwrrxfQ", "question": "A man is speaking in the upper left against a white background. In the center right of the screen, there is a picture of a mannequin being aimed at by a scope. Next to the mannequin, there is a black cabinet. There are blue English letters at the top of the screen. When the subtitle 'consecutive images so for 10 to 30' appears, what objects can be seen in the scene?", "question_wo_referring_query": "What objects can be seen in the scene?", "candidates": ["Table", "Painting", "Potted plant", "Cat", "Dog"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "WhWbgwrrxfQ_0", "video_path": "WhWbgwrrxfQ.mp4", "subtitle_path": "WhWbgwrrxfQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1830.9, "view_count": 3720}, {"video_id": "WhWbgwrrxfQ", "question": "In the upper left section of the white background, there is a man with a black mole explaining something. On the right side of the man, there are bold blue characters and black characters. On the right side of the screen, there are twelve images, including slides and cars. When the subtitle 'been recorded in low light but because' appears, what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Large Potted Plant", "Blue Bicycle", "Woman in Yellow Dress", "Toy Figure", "A White Horse"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "WhWbgwrrxfQ_1", "video_path": "WhWbgwrrxfQ.mp4", "subtitle_path": "WhWbgwrrxfQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1830.9, "view_count": 3720}, {"video_id": "WhWbgwrrxfQ", "question": "In a white background, there is a man wearing a polo shirt in the upper left corner explaining something. To the right of the man, there are blue characters, and below the blue characters, there are four lines of black characters. Below the black characters, there is a three-line chart. When the subtitle 'results so X 300 represents brighter' appears, what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["umbrella", "pot", "lamp", "glasses", "button"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "WhWbgwrrxfQ_2", "video_path": "WhWbgwrrxfQ.mp4", "subtitle_path": "WhWbgwrrxfQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1830.9, "view_count": 3720}, {"video_id": "UQA1xzIADsQ", "question": "In the kitchen, a woman wearing a brown apron is standing by the table. Her hair is tied up. On the table in front of her are two plates filled with food, cups, and cutlery. To the left behind the woman, the wall is pink with a range hood on it, while the right wall is white with some storage area. There is a plant next to the table. What is the shape of the storage area on the white wall?", "question_wo_referring_query": ", what is the shape of the storage area on the white wall?", "candidates": ["Semi-circular", "Square", "Rectangular", "Ladder-shaped", "Round"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "UQA1xzIADsQ_0", "video_path": "UQA1xzIADsQ.mp4", "subtitle_path": "UQA1xzIADsQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2027.36, "view_count": 289926}, {"video_id": "UQA1xzIADsQ", "question": "In the kitchen, a woman wearing a black short-sleeve shirt and a brown skirt is standing by the table. The woman's hair is tied back. On the table, there is a bread maker, a cutting board, a knife, and a glass bowl. Behind the woman to the left, the wall is pink with a range hood above it. The wall on the right is white with some storage areas. There is a planter next to the table. What is the shape and color of the cutting board on the table?", "question_wo_referring_query": "What is the shape and color of the cutting board on the table?", "candidates": ["brown circle", "pink rectangle", "green rectangle", "brown rectangle", "pink circle"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "UQA1xzIADsQ_1", "video_path": "UQA1xzIADsQ.mp4", "subtitle_path": "UQA1xzIADsQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2027.36, "view_count": 289926}, {"video_id": "UQA1xzIADsQ", "question": "A wooden cutting board has three plates on it. The plate on the left contains yellow ingredients. The plate in the middle has skewered ingredients, with green, red, and yellow colors. The ingredients on the plate on the right are sliced. One hand is positioned above the middle plate. What does the plate on the left look like?", "question_wo_referring_query": "What does the plate on the left look like?", "candidates": ["Pink and oval", "Green and round", "White and round", "Pink and round", "White and oval"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "UQA1xzIADsQ_2", "video_path": "UQA1xzIADsQ.mp4", "subtitle_path": "UQA1xzIADsQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2027.36, "view_count": 289926}, {"video_id": "OpGzSjdTSho", "question": "Two cartoon characters appear on the screen: one character is wearing a uniform and a blue hat, while the other is bald and wearing a black top. They are standing in front of a wall, and to the side, there is a basket-like object. Which cartoon character is being choked?", "question_wo_referring_query": "Which cartoon character is being choked?", "candidates": ["The bald cartoon character", "The cartoon character wearing a blue hat", "The cartoon character wearing a suit", "The cartoon character wearing a red hat", "The cartoon character wearing a white coat"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "OpGzSjdTSho_0", "video_path": "OpGzSjdTSho.mp4", "subtitle_path": "OpGzSjdTSho_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1384.54, "view_count": 1119782}, {"video_id": "OpGzSjdTSho", "question": "On the battlefield, there are five cartoon characters. The cartoon character on the right is wearing a helmet and a uniform. On the left side, the cartoon characters consist of two men and two women, with both women wearing skirts. On the battlefield, there are burning houses and building ruins. Which cartoon character is holding a gun and shooting?", "question_wo_referring_query": "Which cartoon character is holding a gun and shooting?", "candidates": ["The cartoon character on the right wearing a helmet", "The cartoon character on the left wearing a coat", "The cartoon character on the left wearing a suit", "The cartoon character on the left wearing a black hat", "The cartoon character on the left wearing glasses"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "OpGzSjdTSho_1", "video_path": "OpGzSjdTSho.mp4", "subtitle_path": "OpGzSjdTSho_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1384.54, "view_count": 1119782}, {"video_id": "OpGzSjdTSho", "question": "On the road, there is a group of cartoon characters. The cartoon characters on both sides are wearing uniforms and helmets, and they are all carrying guns. Among them, in the middle, there are three men. They are respectively wearing yellow, green, and black shirts. There are houses on the side of the road. Which cartoon character is kneeling?", "question_wo_referring_query": "Which cartoon character is kneeling?", "candidates": ["The cartoon character in the middle wearing a black shirt", "The cartoon character in the middle wearing a green shirt", "The cartoon character in the middle wearing a yellow shirt", "The cartoon character on the far right holding a gun", "The cartoon character on the far left holding a gun"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "OpGzSjdTSho_2", "video_path": "OpGzSjdTSho.mp4", "subtitle_path": "OpGzSjdTSho_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1384.54, "view_count": 1119782}, {"video_id": "me0JeKA39WA", "question": "Next to the broken wall, Li Xiaolong stands wearing a white top and black pants. Opposite him is a man in a grey shirt and jeans, holding a metal object. What did Li Xiaolong do first when he entered the scene?", "question_wo_referring_query": "What did Li Xiaolong do first when he entered the scene?", "candidates": ["Kick with one leg", "Attack with nunchucks", "Attack with both legs", "Attack with wrist", "Headbutt"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "me0JeKA39WA_0", "video_path": "me0JeKA39WA.mp4", "subtitle_path": "me0JeKA39WA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1380.76, "view_count": 489068}, {"video_id": "me0JeKA39WA", "question": "Three toy cars appeared on the scene. They were placed in a black room, with switches held behind the toy cars. The toy cars on both sides are yellow and white, while the middle car has the number 5 on it. What happened after the toy cars first appeared on the scene?", "question_wo_referring_query": "What happened after the toy cars first appeared on the scene?", "candidates": ["The toy car was disassembled", "The toy car developed cracks", "The toy car's wheels fell off", "The toy car was ejected", "The toy car was set on fire"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "me0JeKA39WA_1", "video_path": "me0JeKA39WA.mp4", "subtitle_path": "me0JeKA39WA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1380.76, "view_count": 489068}, {"video_id": "me0JeKA39WA", "question": "The girl wearing a white top and a blue plaid skirt appears on screen. Her light brown hair is tied into two sections hanging over her chest. Behind the girl are dense mannequins. She is holding a flower in one hand. What did the girl do the first time she appeared onstage?", "question_wo_referring_query": "What did the girl do the first time she appeared onstage?", "candidates": ["The girl put her hands on her hips", "The girl hugged a dog", "The girl bent at the waist to express thanks", "The girl waved her hand in greeting", "The girl hugged a cat"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "me0JeKA39WA_2", "video_path": "me0JeKA39WA.mp4", "subtitle_path": "me0JeKA39WA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1380.76, "view_count": 489068}, {"video_id": "ju1LslAMND8", "question": "Which complete magazine cover appears first in the video?", "question_wo_referring_query": "Which complete magazine cover appears first in the video?", "candidates": ["NOWHERE TO HIDE", "THE CRISIS", "OPPORTUNITY", "THE GARDEN ON EARTH", "THEIR EYES WERE WATCHING GOD"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "ju1LslAMND8_0", "video_path": "ju1LslAMND8.mp4", "subtitle_path": "ju1LslAMND8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1987.7, "view_count": 2172}, {"video_id": "ju1LslAMND8", "question": "Which publication does the video 'Harlin Is Everywhere' mention first in the program?", "question_wo_referring_query": "Which publication does 'Harlin Is Everywhere' mention first in the program?", "candidates": ["Famous", "Chance", "Praise", "Flame", "Emergency"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "ju1LslAMND8_1", "video_path": "ju1LslAMND8.mp4", "subtitle_path": "ju1LslAMND8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1987.7, "view_count": 2172}, {"video_id": "ju1LslAMND8", "question": "When LYNNE discusses the promotion of Harlem Renaissance literature in the video, which artist is mentioned first?", "question_wo_referring_query": "When LYNNE discusses the promotion of Harlem Renaissance literature in the video, which artist is mentioned first?", "candidates": ["michel moatti", "Jessie Redmond", "Charles Johnson", "Alain", "mari anne"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "ju1LslAMND8_2", "video_path": "ju1LslAMND8.mp4", "subtitle_path": "ju1LslAMND8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1987.7, "view_count": 2172}, {"video_id": "6vebbDZxoKE", "question": "In the kitchen, there is a man wearing a blue shirt and a black apron. Behind the man are white cupboards and a black countertop. On the table in front of the man, there is a blue pot being heated on an electric device and a transparent bowl containing ingredients. After the subtitle 'this pasta the traditional way to plate' appears, what did the man wearing a black apron do?", "question_wo_referring_query": "What did the man wearing a black apron do?", "candidates": ["The man picked up a bowl containing yellow ingredients.", "The man picked up a kitchen knife.", "The man lifted a black pot.", "The man lifted a white pot.", "The man picked up a fork."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "6vebbDZxoKE_0", "video_path": "6vebbDZxoKE.mp4", "subtitle_path": "6vebbDZxoKE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 971.93, "view_count": 12646575}, {"video_id": "6vebbDZxoKE", "question": "There is a silver pot placed on a wooden table. The pot is placed on a blue heating device. A person is stirring white liquid inside the pot with a red ladle. To the left of the pot, there are tongs and some ingredients in a transparent small bowl. After the subtitle 'been stirring this pretty constantly and' appears, what does the person holding the red ladle do?", "question_wo_referring_query": "What does the person holding the red ladle do?", "candidates": ["Picks up a plate", "Pours the white liquid from a cup into the pot", "Pours the liquid from the pot into a bowl", "Stirs with tongs", "Lifts the pot with both hands"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "6vebbDZxoKE_1", "video_path": "6vebbDZxoKE.mp4", "subtitle_path": "6vebbDZxoKE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 971.93, "view_count": 12646575}, {"video_id": "6vebbDZxoKE", "question": "There are three pieces of food placed on a wooden table. The person wearing black clothes holds a flat object and is cutting more food. The person wearing black has tattoos on their arm and is wearing a wristwatch. After the subtitle appears: 'to cut it into more manageable pieces to', what did this person in black clothes do while cutting the food?", "question_wo_referring_query": "What did the person in black clothes do while cutting the food?", "candidates": ["Used chopsticks to pick up food", "Cut the food with a knife", "Pinched a piece of food", "Picked up a silver pot", "Picked up a knife"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "6vebbDZxoKE_2", "video_path": "6vebbDZxoKE.mp4", "subtitle_path": "6vebbDZxoKE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 971.93, "view_count": 12646575}, {"video_id": "KFyu8LRDKOI", "question": "The center of the screen shows a highway. On the left side of the road, there are short rails, a sandy beach, and the sea. On the right side of the road, there are houses and greenery. In the distance, there are green forests and a cloudless sky. After the subtitle 'proposal Americans in Point Roberts are' appears, what appears in the blue sky?", "question_wo_referring_query": "What appears in the blue sky?", "candidates": ["A car", "An eagle", "An American flag", "A black dog", "A kite"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "KFyu8LRDKOI_0", "video_path": "KFyu8LRDKOI.mp4", "subtitle_path": "KFyu8LRDKOI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 927.52, "view_count": 118812}, {"video_id": "KFyu8LRDKOI", "question": "Under the blue sky, a Canadian flag is waving in the wind, with a silver flagpole. Behind the flag are a blue lake and a green forest. After the subtitle 'of this and more would change if Canada', what appears in the indigo sky?", "question_wo_referring_query": "What appeared in the indigo sky?", "candidates": ["A black cat", "An eagle", "A car", "A kite", "An American flag"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "KFyu8LRDKOI_1", "video_path": "KFyu8LRDKOI.mp4", "subtitle_path": "KFyu8LRDKOI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 927.52, "view_count": 118812}, {"video_id": "KFyu8LRDKOI", "question": "The man in the shirt sitting on the table and the woman wearing a black vest are conversing. They are surrounded by densely packed shelves filled with paper boxes. The man is wearing a watch and light-colored pants. The woman is wearing grey clothes under her black vest. After the subtitle 'there's some days we don't have a single,' what appeared on the land by the sea?", "question_wo_referring_query": "What appeared on the land by the sea?", "candidates": ["A helicopter", "An eagle", "A house with a blue roof", "A sailboat", "A ship"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "KFyu8LRDKOI_2", "video_path": "KFyu8LRDKOI.mp4", "subtitle_path": "KFyu8LRDKOI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 927.52, "view_count": 118812}, {"video_id": "SNc303cwOHk", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, the man in the gray short-sleeve shirt starts hosting the show. Then, a man in a suit and a man in a short-sleeve shirt talk on the phone. Finally, the man in the gray short-sleeve shirt concludes and gives thanks.", "First, the man in the gray short-sleeve shirt concludes and gives thanks. Then, a man in a suit and a man in a short-sleeve shirt talk on the phone. Finally, the man in the gray short-sleeve shirt starts hosting the show.", "First, the man in the gray short-sleeve shirt concludes and gives thanks. Then, the man in the gray short-sleeve shirt starts hosting the show. Finally, a man in a suit and a man in a short-sleeve shirt talk on the phone.", "First, a man in a suit and a man in a short-sleeve shirt talk on the phone. Then, the man in the gray short-sleeve shirt starts hosting the show. Finally, the man in the gray short-sleeve shirt concludes and gives thanks.", "First, a man in a suit and a man in a short-sleeve shirt talk on the phone. Then, the man in the gray short-sleeve shirt concludes and gives thanks. Finally, the man in the gray short-sleeve shirt starts hosting the show."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "SNc303cwOHk_0", "video_path": "SNc303cwOHk.mp4", "subtitle_path": "SNc303cwOHk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1122.0, "view_count": 336742}, {"video_id": "SNc303cwOHk", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, a man in a suit has a phone conversation with a man in a grey short-sleeve shirt, bringing up the topic. Then, the man in a grey short-sleeve shirt hosts the program. After that, the man in the green shirt shares his insights and engages in a discussion with the man in the grey short-sleeve shirt. Finally, the man in the grey short-sleeve shirt concludes and gives thanks.", "First, the man in the grey short-sleeve shirt concludes and gives thanks. Next, the man in the grey short-sleeve shirt hosts the program. After that, the man in the green shirt shares his insights and engages in a discussion with the man in the grey short-sleeve shirt. Finally, a man in a suit has a phone conversation with the man in the grey short-sleeve shirt.", "First, a man in a suit has a phone conversation with a man in a short-sleeve shirt, bringing up the topic. Then, the man in a grey short-sleeve shirt concludes and gives thanks. After that, the man in the grey short-sleeve shirt hosts the program. Finally, the man in the green shirt shares his insights and engages in a discussion with the man in the short-sleeve shirt.", "First, a man in a suit has a phone conversation with a man in a short-sleeve shirt, bringing up the topic. Next, the man in the green shirt shares his insights and engages in a discussion with the man in the grey short-sleeve shirt. Then, the man in a grey short-sleeve shirt hosts the program. Finally, the man in the grey short-sleeve shirt concludes and gives thanks.", "First, a man in a suit has a phone conversation with a man in a short-sleeve shirt, bringing up the topic. Then, the man in a grey short-sleeve shirt hosts the program. After that, the man in the grey short-sleeve shirt concludes and gives thanks. Finally, the man in the green shirt shares his insights and engages in a discussion with the man in the short-sleeve shirt."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "SNc303cwOHk_1", "video_path": "SNc303cwOHk.mp4", "subtitle_path": "SNc303cwOHk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1122.0, "view_count": 336742}, {"video_id": "SNc303cwOHk", "question": "After the episode officially starts, what is the correct sequence of the following scenes?", "question_wo_referring_query": "After the episode officially starts, what is the correct sequence of the following scenes?", "candidates": ["First, the man in the gray short-sleeve shirt speaks, then the man in the green shirt responds to the man in the gray short-sleeve shirt, then both parties engage in a discussion, and finally, the man in the gray short-sleeve shirt concludes with thanks.", "First, the man in the gray short-sleeve shirt speaks, then the man in the green shirt responds to the man in the gray short-sleeve shirt, then the man in the gray short-sleeve shirt concludes with thanks, and finally, both parties engage in a discussion.", "First, the man in the gray short-sleeve shirt speaks, then both parties engage in a discussion, then the man in the green shirt responds to the man in the gray short-sleeve shirt, and finally, the man in the gray short-sleeve shirt concludes with thanks.", "First, the man in the gray short-sleeve shirt speaks, then the man in the green shirt responds to the man in the gray short-sleeve shirt, then both parties engage in a discussion, and finally, the man in the green shirt concludes with thanks.", "First, the man in the green shirt responds to the man in the gray short-sleeve shirt, then the man in the gray short-sleeve shirt starts speaking, then both parties engage in a discussion, and finally, the man in the gray short-sleeve shirt concludes with thanks."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "SNc303cwOHk_2", "video_path": "SNc303cwOHk.mp4", "subtitle_path": "SNc303cwOHk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1122.0, "view_count": 336742}, {"video_id": "IALDlg_0j5M", "question": "A man is standing next to a handrail inside a train station. The man is wearing a white top and black pants. He is also wearing a white face mask and a black hat. He is standing on a platform with trains on both sides. There is a white vehicle and buildings within the right side wall. Where has this man appeared before?", "question_wo_referring_query": "Where has this man appeared before?", "candidates": ["A busy restaurant kitchen", "A white bridge vehicle", "A crowded beach", "A train compartment full of passengers", "A white bus"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "IALDlg_0j5M_0", "video_path": "IALDlg_0j5M.mp4", "subtitle_path": "IALDlg_0j5M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.5, "view_count": 24214}, {"video_id": "IALDlg_0j5M", "question": "A man wearing a yellow floral shirt appears in the room. The man is wearing a white mask and glasses. He is carrying a bag in one hand and holding a drink in the other. To his side is a row of blue chairs with a green bulletin board above them. Behind him are three men and a window. Where has this man in the yellow shirt appeared before?", "question_wo_referring_query": "Where has this man in the yellow shirt appeared before?", "candidates": ["At a crowded lecture", "In a white car", "In the busy kitchen of a restaurant", "In a white public bus", "In front of a white bed in a hotel"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "IALDlg_0j5M_1", "video_path": "IALDlg_0j5M.mp4", "subtitle_path": "IALDlg_0j5M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.5, "view_count": 24214}, {"video_id": "IALDlg_0j5M", "question": "A train is running on the tracks, with white-walled houses, utility poles, and an empty road on either side. Overhead wires extend above the train. A white van is parked outside the white-walled houses. The train is gray, and the area near the locomotive's driving compartment is black. Where has this train appeared?", "question_wo_referring_query": "Where has this train appeared?", "candidates": ["In front of an abandoned mine", "Beside a green pasture", "Next to a pavilion in front of a building labeled Art Museum Library", "Dense green forest", "Beside a seaside beach"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "IALDlg_0j5M_2", "video_path": "IALDlg_0j5M.mp4", "subtitle_path": "IALDlg_0j5M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.5, "view_count": 24214}, {"video_id": "igf6vfDLR00", "question": "In the center of the screen, there is a model of the Earth, and a green aircraft graphic is flying around the Earth. At the bottom of the Earth, the time '1:52 AM' is displayed. Where have this Earth and these subtitles appeared together before?", "question_wo_referring_query": "Where have this Earth and these subtitles appeared together before?", "candidates": ["You know", "Deploy green light", "and not do that i say men we do have", "If this person runs over here", "things okay whatever um"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "igf6vfDLR00_0", "video_path": "igf6vfDLR00.mp4", "subtitle_path": "igf6vfDLR00_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1217.69, "view_count": 4718}, {"video_id": "igf6vfDLR00", "question": "On a street, there is a house without a roof. To the right of the house, there is a red car parked on the road. Between the house and the car, red letters display the text 'ALIEN ACTIVITY'. In which subtitles does this red car also appear?", "question_wo_referring_query": "In which subtitles does this red car also appear?", "candidates": ["one has the green light for deployment", "if this guy runs over here he could", "we are conducting some research", "lady hypatia is just", "so for those of you that aren\u2019t familiar"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "igf6vfDLR00_1", "video_path": "igf6vfDLR00.mp4", "subtitle_path": "igf6vfDLR00_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1217.69, "view_count": 4718}, {"video_id": "igf6vfDLR00", "question": "In a calculation interface, there are four national flags. At the very bottom is a Chinese flag. In which cell with a dragon head is there a logo with a five-pointed star that is shrinking? With which subtitles has this Chinese flag appeared together?", "question_wo_referring_query": "With which subtitles has this Chinese flag appeared together?", "candidates": ["wiping the floor with everybody", "you know", "invaded", "Our people were killed, so we plan to try", "This is our team"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "igf6vfDLR00_2", "video_path": "igf6vfDLR00.mp4", "subtitle_path": "igf6vfDLR00_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1217.69, "view_count": 4718}, {"video_id": "oGMyaES1kDI", "question": "On the wooden table in front of the patterned wall, there is a small white TV. On the upper part of the TV, there is a picture of a man wearing glasses. Inside the orange frame below the picture, the words 'Larry King Show Premieres' are displayed. When this TV and the subtitle 'A week later, the first micro on a chip' appear together, what changes occur inside the orange frame?", "question_wo_referring_query": "What changes occur inside the orange frame?", "candidates": ["The text inside the frame changes to black", "The text inside the frame changes to 'ON AIR'", "The text inside the frame changes to 'First micro on a Chip'", "The text inside the frame changes to yellow", "The text inside the frame changes to 'Feb 14th, 1978'"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "oGMyaES1kDI_0", "video_path": "oGMyaES1kDI.mp4", "subtitle_path": "oGMyaES1kDI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1207.46, "view_count": 430350}, {"video_id": "oGMyaES1kDI", "question": "This is a portrait of a character. Against a blue background, there is a man wearing white clothing and a red cloak. What change occurred to him when he appeared together with the subtitle 'serving Pope in modern history, wearing the big pointy'?", "question_wo_referring_query": "What change occurred to him?", "candidates": ["He put on a black hat", "He shaved his head", "He put on a big, pointy hat", "He did not wear a hat", "He put on a mitre"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "oGMyaES1kDI_1", "video_path": "oGMyaES1kDI.mp4", "subtitle_path": "oGMyaES1kDI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1207.46, "view_count": 430350}, {"video_id": "oGMyaES1kDI", "question": "In a pure black background, there is a blue and red circle. When the blue circle and the text 'We'll be right back.' appear together, what change occurs?", "question_wo_referring_query": "What change occurs?", "candidates": ["Red text appears within the circle.", "Black text appears within the circle.", "Yellow text appears within the circle.", "Green text appears within the circle.", "A dancing person appears within the circle."], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "oGMyaES1kDI_2", "video_path": "oGMyaES1kDI.mp4", "subtitle_path": "oGMyaES1kDI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1207.46, "view_count": 430350}, {"video_id": "VBObGVuUHwc", "question": "In a room, there is a group of people standing. Among them, there is a shirtless man with a smiling face revealing his white teeth. To his left, there is a man wearing a black tank top and holding a fruit in his hand. What kind of hair does the shirtless man have?", "question_wo_referring_query": "What kind of hair does the shirtless man have?", "candidates": ["He has short orange hair", "He has long orange hair", "He has long black hair", "He has short black hair", "He is bald"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "VBObGVuUHwc_0", "video_path": "VBObGVuUHwc.mp4", "subtitle_path": "VBObGVuUHwc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 982.27, "view_count": 59741}, {"video_id": "VBObGVuUHwc", "question": "In the distance, there is a multi-story tower standing tall, with a tree beside it. Nearby, there are three men wearing pointed round hats, looking at the camera. The man on the far right is wearing sunglasses and a hat. The man on the left is wearing a colorful coat. What are the colors of the frames and lenses of the glasses of the man in the middle?", "question_wo_referring_query": "In the distance, there is a multi-story tower standing tall, with a tree beside it. Nearby, what are the colors of the frames and lenses of the glasses of the man in the middle?", "candidates": ["White frames and blue lenses", "White frames and black lenses", "Red frames and black lenses", "Black frames and black lenses", "Gold frames and black lenses"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "VBObGVuUHwc_1", "video_path": "VBObGVuUHwc.mp4", "subtitle_path": "VBObGVuUHwc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 982.27, "view_count": 59741}, {"video_id": "VBObGVuUHwc", "question": "In the kitchen, there is a person looking at the stove fire, and another person beside a shelf. Outside the kitchen, there are two men: the one on the left is wearing a white hat and a white short-sleeve shirt, raising his right hand and smiling happily, showing his clean white teeth. The man on the right is wearing a hat and a black coat. What kind of hat is the man wearing who is dressed in black?", "question_wo_referring_query": "In the kitchen, there is a person looking at the stove fire, and another person beside a shelf. Outside the kitchen, there are two men: the one on the left is wearing a white hat and a white short-sleeve shirt, raising his right hand and smiling happily, showing his clean white teeth. The man on the right is wearing a hat and a black coat. What kind of hat is the man wearing who is dressed in black?", "candidates": ["Black duckbill hat", "White round hat", "White duckbill hat", "Large pointed black hat", "Black top hat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "VBObGVuUHwc_2", "video_path": "VBObGVuUHwc.mp4", "subtitle_path": "VBObGVuUHwc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 982.27, "view_count": 59741}, {"video_id": "6vLG6Eo1qTI", "question": "The street in the video is flooded, and in front of a mobile house, a woman wearing a purple coat is taking a picture with a smartphone in both hands. What kind of hat is the woman wearing when she appears with the subtitle 'is a boring hazard it's just slowly'?", "question_wo_referring_query": "What kind of hat is the woman wearing when she appears with the subtitle 'is a boring hazard it's just slowly'?", "candidates": ["a cream-colored knitted hat", "a cream-colored leather cap", "a purple knitted hat", "a black leather cap", "a black knitted hat"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "6vLG6Eo1qTI_0", "video_path": "6vLG6Eo1qTI.mp4", "subtitle_path": "6vLG6Eo1qTI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1126.76, "view_count": 26132}, {"video_id": "6vLG6Eo1qTI", "question": "In a room with a map hanging on the wall, a man with short black hair is holding a poster that says 'TSUNAMI EVACUATION ROUTE'. When this man appears with the subtitle 'got to put these things up all around', what kind of outerwear is he wearing?", "question_wo_referring_query": "In a room with a map hanging on the wall, a man with short black hair is holding a poster that says 'TSUNAMI EVACUATION ROUTE'. When this man appears with the subtitle 'got to put these things up all around', what kind of outerwear is he wearing?", "candidates": ["gray short-sleeve outerwear", "black short-sleeve outerwear", "gray long-sleeve leather jacket", "gray long-sleeve outerwear", "gray long-sleeve knit sweater"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "6vLG6Eo1qTI_1", "video_path": "6vLG6Eo1qTI.mp4", "subtitle_path": "6vLG6Eo1qTI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1126.76, "view_count": 26132}, {"video_id": "6vLG6Eo1qTI", "question": "In a wasteland, there are two people facing each other through binoculars. One person is wearing a green helmet and a red coat, while the other person is wearing a green coat. Not far from them, a helicopter is parked. When the helicopter and the subtitle 'are pretty far from the eruption itself' appear together, what two colors are on the helicopter?", "question_wo_referring_query": "What two colors are on the helicopter?", "candidates": ["White and green", "Gray and green", "White and gray", "Red and green", "White and red"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "6vLG6Eo1qTI_2", "video_path": "6vLG6Eo1qTI.mp4", "subtitle_path": "6vLG6Eo1qTI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1126.76, "view_count": 26132}, {"video_id": "7hIJDOohas8", "question": "On a brown surface lies a wooden board, on which there is an empty bowl. A person wearing a long-sleeved black shirt with tattoos exposed is cutting food on the wooden board. What is the food being cut in the video?", "question_wo_referring_query": "What is the food being cut in the video?", "candidates": ["Mushroom", "Yellow squash", "Butternut squash", "Tomato", "Spaghetti squash"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "7hIJDOohas8_0", "video_path": "7hIJDOohas8.mp4", "subtitle_path": "7hIJDOohas8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 914.25, "view_count": 101993}, {"video_id": "7hIJDOohas8", "question": "On an olive-colored surface, a person is placing a piece of chicken into an empty dish with white stripes while holding a knife in their right hand. Which part of the chicken is being placed into the dish?", "question_wo_referring_query": "Which part of the chicken is being placed into the dish?", "candidates": ["chicken thigh", "chicken wing", "chicken leg", "chicken breast", "chicken head"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "7hIJDOohas8_1", "video_path": "7hIJDOohas8.mp4", "subtitle_path": "7hIJDOohas8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 914.25, "view_count": 101993}, {"video_id": "7hIJDOohas8", "question": "In a room of a mining site, there is a wooden table. On the table, there are four white bowls and two other colored bowls containing food. A person is using black chopsticks to put rice into one of the bowls. Who is this person?", "question_wo_referring_query": ", who is this person?", "candidates": ["A man wearing a black long-sleeve coat with patterns on the sleeves", "A man wearing a black short-sleeve coat with patterns on the sleeves", "A man with long golden hair", "A man wearing a black long-sleeve coat without patterns on the sleeves", "A woman wearing a black long-sleeve coat with patterns on the sleeves"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "7hIJDOohas8_2", "video_path": "7hIJDOohas8.mp4", "subtitle_path": "7hIJDOohas8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 914.25, "view_count": 101993}, {"video_id": "JwCFhKP0OY8", "question": "On a gray countertop, there is a green plant in a pot. Next to the green plant, there is a wooden board, and on the wooden board, there is an onion. In the bottom left corner of the screen, the text 'Zwiebel - 1 Stck' appears. What happens when the subtitle 'Onion - 1 pc' appears?", "question_wo_referring_query": "What happens?", "candidates": ["The onion is thrown away.", "Someone takes the green plant away.", "A pair of hands takes the onion away.", "Someone takes the wooden board away.", "A pair of hands takes a knife and cuts the onion."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "JwCFhKP0OY8_0", "video_path": "JwCFhKP0OY8.mp4", "subtitle_path": "JwCFhKP0OY8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1198.47, "view_count": 2968440}, {"video_id": "JwCFhKP0OY8", "question": "There is an iron pot on the screen with chopped green onions and carrots inside. In the bottom left corner of the screen, there are white letters that say '1-2 Minuten braten'. When the subtitle 'Fry for 1-2 minutes' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The ingredients in the pot are poured out", "An iron spatula stirs back and forth in the pot", "A wooden spatula stirs back and forth in the pot", "Someone takes away the iron pot", "Someone adds water to the pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "JwCFhKP0OY8_1", "video_path": "JwCFhKP0OY8.mp4", "subtitle_path": "JwCFhKP0OY8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1198.47, "view_count": 2968440}, {"video_id": "JwCFhKP0OY8", "question": "Placed the sliced meat on a wooden board, a hand appears on the screen, when the subtitles say 'season on both sides.' What happens on the screen at that moment?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The meat on the wooden board is taken away.", "The hand flips the meat on the wooden board.", "Someone cuts the meat into pieces with a knife.", "The meat is put into a cast iron pot.", "Someone sprinkles pepper on the meat."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "JwCFhKP0OY8_2", "video_path": "JwCFhKP0OY8.mp4", "subtitle_path": "JwCFhKP0OY8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1198.47, "view_count": 2968440}, {"video_id": "8RhRokdQtoo", "question": "In a room filled with photos, a man wearing a white short-sleeved shirt opens his palm towards the camera. What did he do after this action?", "question_wo_referring_query": ", what did he do after this action?", "candidates": ["He put his left hand into his pocket.", "He extended two fingers.", "He held his forehead with his right hand.", "He sat on the ground.", "He clasped his own hands together."], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "8RhRokdQtoo_0", "video_path": "8RhRokdQtoo.mp4", "subtitle_path": "8RhRokdQtoo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 968.2, "view_count": 763020}, {"video_id": "8RhRokdQtoo", "question": "In a room filled with photos, a man wearing a white short-sleeve shirt has his left hand in his pocket and his right hand pointing with two fingers towards some text to his left. What does the man do next?", "question_wo_referring_query": "What does the man do next?", "candidates": ["He raises his left hand", "He crosses his arms in front of his chest", "He spreads his palm open", "He retracts the two fingers he was pointing with", "He extends a single finger"], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "8RhRokdQtoo_1", "video_path": "8RhRokdQtoo.mp4", "subtitle_path": "8RhRokdQtoo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 968.2, "view_count": 763020}, {"video_id": "8RhRokdQtoo", "question": "In a room full of photos, a man wearing a white short-sleeved shirt raises both hands and makes a V sign. In a photo in the upper right corner of the screen, a woman in a green dress is dancing. What does the man do next?", "question_wo_referring_query": "What does the man do next?", "candidates": ["He puts his left hand into his pocket.", "He retracts the two extended fingers.", "He crosses his arms in front of his chest.", "He spreads his remaining fingers.", "He waves both hands from left to right."], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "8RhRokdQtoo_2", "video_path": "8RhRokdQtoo.mp4", "subtitle_path": "8RhRokdQtoo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 968.2, "view_count": 763020}, {"video_id": "k6WIN7Ysygc", "question": "In the gray background image, there is a white circle with a person wearing a hat with star patterns on their shoulders inside the circle. After this scene appears, which design appears first?", "question_wo_referring_query": "Which design appears first?", "candidates": ["A white circle with the American flag design", "A white circle with a person holding a gun", "A white circle with a star design", "A white circle with a design of two people talking", "A white circle with a sword and a spear"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "k6WIN7Ysygc_0", "video_path": "k6WIN7Ysygc.mp4", "subtitle_path": "k6WIN7Ysygc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 946.3, "view_count": 881599}, {"video_id": "k6WIN7Ysygc", "question": "In the gray background image, there is only one white circle. Inside the circle, there is a red shield-shaped emblem with a raised line. At the center of the emblem, there is an exclamation mark. After this screen, what is the first emblem that appears?", "question_wo_referring_query": ", what is the first emblem that appears?", "candidates": ["A white circle with a depiction of a submarine", "A white circle with a depiction of the Star-Spangled Banner and the flag of the United States", "A white circle with a depiction of a sword and shield", "A white circle with a depiction of a missile", "A white circle with a depiction of a mailbox"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "k6WIN7Ysygc_1", "video_path": "k6WIN7Ysygc.mp4", "subtitle_path": "k6WIN7Ysygc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 946.3, "view_count": 881599}, {"video_id": "k6WIN7Ysygc", "question": "On a grey background image, there is only one white circle. Inside the circle, there is a design with crossed swords and a globe. What is the first design to appear after this screen?", "question_wo_referring_query": "What is the first design to appear after this screen?", "candidates": ["A white circle with a star design", "A white circle with a hat design", "A white circle with a design of crossed pencils and a ruler", "A white circle with a gear design", "A white circle with an airplane design"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "k6WIN7Ysygc_2", "video_path": "k6WIN7Ysygc.mp4", "subtitle_path": "k6WIN7Ysygc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 946.3, "view_count": 881599}, {"video_id": "VVh0wxi5woI", "question": "Two men are seated in front of a racing game machine, the man on the left is wearing gray clothes, and the man on the right is wearing a white hat and white clothes. After the subtitle mentions 'I let those guys win fun yeah at least I', what does the man with the hat do?", "question_wo_referring_query": "What action does the man with the hat make?", "candidates": ["He stands up", "He touches his nose with his hand", "He slumps over the steering wheel", "He is playing the racing game", "He leaves the seat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "VVh0wxi5woI_0", "video_path": "VVh0wxi5woI.mp4", "subtitle_path": "VVh0wxi5woI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1113.11, "view_count": 328547}, {"video_id": "VVh0wxi5woI", "question": "On a sunny and bright street, the person wearing a gray short sleeve and holding a camera is walking side by side with the person wearing a yellow hat and white short sleeve. After the subtitle \u3010music\u3011 appears, what action does the man in white immediately perform?", "question_wo_referring_query": ", what action does the man in white immediately perform?", "candidates": ["He hugged the person next to him", "He raised his left hand", "He squatted down on the ground", "He passed the camera", "He made a V-sign with his hand"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "VVh0wxi5woI_1", "video_path": "VVh0wxi5woI.mp4", "subtitle_path": "VVh0wxi5woI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1113.11, "view_count": 328547}, {"video_id": "VVh0wxi5woI", "question": "The wall is covered with menus written in Japanese. A man in a gray short-sleeved shirt is drinking. After the subtitle \u3010Music\u3011 appears, what does he do?", "question_wo_referring_query": "What does he do?", "candidates": ["He is playing a car game.", "He climbs the shelf.", "He eats a piece of sushi.", "He raises his finger.", "He puts the glass on the table."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "VVh0wxi5woI_2", "video_path": "VVh0wxi5woI.mp4", "subtitle_path": "VVh0wxi5woI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1113.11, "view_count": 328547}, {"video_id": "L5mPQ4rfoF8", "question": "There are three men standing in front of the plants, wearing crowns and holding scepters. The man in the middle is pointing with two fingers on his right hand. Before the subtitle mentions 'history show it's about art history this,' who is the first person to appear?", "question_wo_referring_query": "Who is the first person to appear?", "candidates": ["A man with long camel-colored hair wearing a black hat", "A man wearing a gray suit with a ponytail", "A man wearing a camel-colored coat with curly hair", "A woman wearing a camel-colored coat with curly hair", "A child wearing a red hat"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "L5mPQ4rfoF8_0", "video_path": "L5mPQ4rfoF8.mp4", "subtitle_path": "L5mPQ4rfoF8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1659.62, "view_count": 114445}, {"video_id": "L5mPQ4rfoF8", "question": "On the grey surface, there is a silver plate and a silver jug. Behind the silver jug, there is a much larger golden jug. After the subtitle mentions 'objects that would have defined Tudor', what is the object that appears on the screen?", "question_wo_referring_query": "What is the object that appears on the screen?", "candidates": ["Gilded porcelain bowl", "Silver candlestick", "Silver plate", "Silver jug", "A suit of armor with floral patterns"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "L5mPQ4rfoF8_1", "video_path": "L5mPQ4rfoF8.mp4", "subtitle_path": "L5mPQ4rfoF8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1659.62, "view_count": 114445}, {"video_id": "L5mPQ4rfoF8", "question": "A man and a woman are standing back-to-back in front of a mirror, describing a portrait hanging on a gray wall. After the subtitle mentions 'become perceived on a l\u2018d say Global', who appears in the video?", "question_wo_referring_query": "Who appears in the video?", "candidates": ["A man wearing a red cape and holding a scepter", "A man with long black hair wearing a green coat", "A man wearing a gray suit and without a mustache", "A woman with long straight black hair wearing a gray overcoat", "A man with short black hair, wearing a green suit, and without a mustache"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "L5mPQ4rfoF8_2", "video_path": "L5mPQ4rfoF8.mp4", "subtitle_path": "L5mPQ4rfoF8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1659.62, "view_count": 114445}, {"video_id": "QKFHHJqA9DE", "question": "Amid the distant, withered yellow mountain peaks, small patches of snow are intermittently visible. A man wearing a black hat and a colorful coat is sprinkling salt on the meat in front of him. In which scenes does this man appear?", "question_wo_referring_query": "In which scenes does this man appear?", "candidates": ["Next to a fire pit and a windbreak made of stacked stones", "Inside a small hut made of logs", "Next to a black beehive", "Next to a turkey feeding on food", "Beside a waterfall"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "QKFHHJqA9DE_0", "video_path": "QKFHHJqA9DE.mp4", "subtitle_path": "QKFHHJqA9DE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 999.17, "view_count": 818525}, {"video_id": "QKFHHJqA9DE", "question": "On a flat ground surrounded by mountains, a man with short black hair and wearing a black coat is skewering kebabs on an iron skewer. Where have these kebabs appeared?", "question_wo_referring_query": "Where have these kebabs appeared?", "candidates": ["In a black iron pot", "In a black iron kettle", "In a transparent glass", "On grey and white charcoal", "On a wooden tray"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "QKFHHJqA9DE_1", "video_path": "QKFHHJqA9DE.mp4", "subtitle_path": "QKFHHJqA9DE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 999.17, "view_count": 818525}, {"video_id": "QKFHHJqA9DE", "question": "In an open area surrounded by mountain peaks, a sheep is being roasted on a wooden rack. A man dressed in black clothes is lifting a black iron pot using an iron fork with a hook. In which scenes does this black iron pot appear?", "question_wo_referring_query": "In which scenes does this black iron pot appear?", "candidates": ["On a platform made of round tree branches", "In the hands of a little girl wearing a black coat", "On a polished wooden plate", "In the hands of a little boy wearing a blue hat", "In the hands of a little girl wearing a red hat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "QKFHHJqA9DE_2", "video_path": "QKFHHJqA9DE.mp4", "subtitle_path": "QKFHHJqA9DE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 999.17, "view_count": 818525}, {"video_id": "oy_Rj8qLcyo", "question": "Against a black background, there's an image of a sun emitting rays in the upper left corner. Below the image, the text 'Imperial Japanese Forces' appears. With which subtitles does this image simultaneously appear?", "question_wo_referring_query": ", with which subtitles does this image simultaneously appear?", "candidates": ["Various Views on 'How to Win'", "This is quite an extensive", "High Command", "Socialist racial hierarchy", "for treating prisoners of war properly, to be very diplomatic"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "oy_Rj8qLcyo_0", "video_path": "oy_Rj8qLcyo.mp4", "subtitle_path": "oy_Rj8qLcyo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1077.5, "view_count": 160634}, {"video_id": "oy_Rj8qLcyo", "question": "Against a black background, there is only one circle in the upper left corner. Inside the circle is an eye graphic, and below the graphic are the words 'Various Views on \u201cHow to Win\u201d'. Which subtitles did this graphic appear with?", "question_wo_referring_query": "Which subtitles did this graphic appear with?", "candidates": ["were the so-called \u2018criminal orders\u2019", "branch of the Wehrmacht.", "Some of the German atrocities were integral part of the nazi ideology.", "Imperial Japanese Forces", "This is very important to keep in mind"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "oy_Rj8qLcyo_1", "video_path": "oy_Rj8qLcyo.mp4", "subtitle_path": "oy_Rj8qLcyo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1077.5, "view_count": 160634}, {"video_id": "oy_Rj8qLcyo", "question": "In a black background, there is only one circle in the upper left corner. Inside the circle is a figure wearing a hat and decorated with shoulder stars. Below the figure, there are the words 'High Command'. With which subtitles did this image appear together?", "question_wo_referring_query": "With which subtitles did this image appear together?", "candidates": ["immoral also by the standards of the time", "with a pseudo-legal and disciplinary framework", "formations under civil administration", "war on the Eastern Front", "Imperial Japanese Forces"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "oy_Rj8qLcyo_2", "video_path": "oy_Rj8qLcyo.mp4", "subtitle_path": "oy_Rj8qLcyo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1077.5, "view_count": 160634}, {"video_id": "iI_IMuFf9rw", "question": "In the white background, there are the words 'Autoencoder Lecture 10' and a red dot. In the bottom right corner of the screen, there is a person wearing a blue jacket and glasses. The left lens of this man is reflecting light. When this man and the text 'Autoencoder Encoder-decoder' appear together, what change occurs in him?", "question_wo_referring_query": "What change occurs in him?", "candidates": ["The item he was holding in his right hand turns into a microphone", "The item he was holding in his right hand turns into a laser pointer", "The item he was holding in his right hand turns into a mobile phone", "The item he was holding in his right hand disappears", "The item he was holding in his right hand turns into glasses"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "iI_IMuFf9rw_0", "video_path": "iI_IMuFf9rw.mp4", "subtitle_path": "iI_IMuFf9rw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1383.68, "view_count": 258}, {"video_id": "iI_IMuFf9rw", "question": "In the center of a white background, there is a red dot, and on the left side of the screen, there are words 'Autoencoder Reconstruction Latent vector of size 2 Compression from 28\u00d728'. In the lower right corner of the screen, there is a man wearing a blue jacket. What change occurs to the man in the lower right corner when the red dot moves to the lower part of the letter 'r' in the first word of the third row on the left side?", "question_wo_referring_query": "What change occurs to the man in the lower right corner?", "candidates": ["He raises both hands above his head", "His glasses do not reflect light", "He raises his right hand", "He closes his eyes", "He lowers his left hand"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "iI_IMuFf9rw_1", "video_path": "iI_IMuFf9rw.mp4", "subtitle_path": "iI_IMuFf9rw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1383.68, "view_count": 258}, {"video_id": "iI_IMuFf9rw", "question": "In the upper left corner of the white background, there is the text 'Feature Learning' below UCF, while there is a man wearing glasses and a blue jacket in the lower right corner of the screen. After the text on the screen changes to 'Properties,' what change occurs to this man?", "question_wo_referring_query": ", what change occurs to this man?", "candidates": ["The left lens of his glasses does not reflect light", "He places his hand on his chin", "His lenses do not reflect light", "He changes into a black jacket", "Both of his lenses reflect light"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "iI_IMuFf9rw_2", "video_path": "iI_IMuFf9rw.mp4", "subtitle_path": "iI_IMuFf9rw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1383.68, "view_count": 258}, {"video_id": "6OXt-Yywz7s", "question": "In the room, there is a large mirror reflecting a green wall. A black stuffed animal is hanging from the ceiling. A woman wearing glasses and dressed in black clothes is standing in front of the mirror, closing her eyes and waving her hands. When the subtitles mention 'skating today, maybe go to a museum, treat myself to a new book at the bookstore; it\u2019s just me', what does the woman do immediately after?", "question_wo_referring_query": "What does the woman do immediately after?", "candidates": ["She turned around with her back facing the mirror", "She pointed her finger at the mirror", "She picked up a cat", "She turned off the light", "She was eating from a green bowl in front of the mirror"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "6OXt-Yywz7s_0", "video_path": "6OXt-Yywz7s.mp4", "subtitle_path": "6OXt-Yywz7s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1142.58, "view_count": 219034}, {"video_id": "6OXt-Yywz7s", "question": "In a room with green walls and a white door, there is a woman with black hair and glasses. She is wearing a white outer garment and khaki pants, with her hands in her pockets. After the subtitle 'what do i think? this is really cute! I like all the different textures! it doesn\u2019t really\u2026', what does this woman do?", "question_wo_referring_query": "What does this woman do?", "candidates": ["She puts on a floral coat", "She fastens her belt", "She puts on khaki pants", "She puts on a white top", "She applies eyeshadow"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "6OXt-Yywz7s_1", "video_path": "6OXt-Yywz7s.mp4", "subtitle_path": "6OXt-Yywz7s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1142.58, "view_count": 219034}, {"video_id": "6OXt-Yywz7s", "question": "In the wooden bookshelf in the room, there are some books and decorations placed. A woman wearing a white top is sitting next to the dining table in front of the bookshelf. She is holding a bottled drink and looking at the floor in shock. When the subtitle mentions 'why did I do that?!?!?! I didn\u2019t think it\u2019d actually splashed out, what the hell?!?!?', what did this woman do?", "question_wo_referring_query": "What did this woman do?", "candidates": ["She took off her olive-colored jacket", "She bought a book", "She took off her mask", "She removed her earphones", "She poured the drink from the green bottle into a cup"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "6OXt-Yywz7s_2", "video_path": "6OXt-Yywz7s.mp4", "subtitle_path": "6OXt-Yywz7s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1142.58, "view_count": 219034}, {"video_id": "SW8N4FcfmT4", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, a person wearing a purple coat and another wearing a black coat are playing football on a green field. Then, a woman with long hair wearing glasses and a black coat is being interviewed. Finally, a boy with dark skin wearing gray clothes and a girl with fair skin wearing white clothes are watching TV.", "First, a woman with long hair wearing glasses and a black coat is being interviewed. Then, a person wearing a purple coat and another wearing a black coat are playing football on a green field. Finally, a boy with dark skin wearing gray clothes and a girl with fair skin wearing white clothes are watching TV.", "First, a boy with dark skin wearing gray clothes and a girl with fair skin wearing white clothes are watching TV. Then, a person wearing a purple coat and another wearing a black coat are playing football on a green field. Finally, a woman with long hair wearing glasses and a black coat closes her eyes.", "First, a person wearing a purple coat and another wearing a black coat are playing football on a green field. Then, a boy with dark skin wearing gray clothes and a girl with fair skin wearing white clothes are watching TV. Finally, a woman with long hair wearing glasses and a black coat is being interviewed.", "First, a woman with long hair wearing glasses and a black coat closes her eyes. Then, a boy with dark skin wearing gray clothes and a girl with fair skin wearing white clothes are watching TV. Finally, a person wearing a purple coat and another wearing a black coat are playing football on a green field."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "SW8N4FcfmT4_0", "video_path": "SW8N4FcfmT4.mp4", "subtitle_path": "SW8N4FcfmT4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 967.0, "view_count": 509}, {"video_id": "SW8N4FcfmT4", "question": "Which of the following sequence of events is correct?", "question_wo_referring_query": "Which sequence of events is correct from the list below?", "candidates": ["First, a small lamb lowers its head to graze, followed by two big white geese in front of the fence, one stretching its neck, the other lowering its head to eat, and finally an axe deeply embedded in a piece of wood.", "First, an axe is deeply embedded in a piece of wood, followed by a small lamb lowering its head to graze, and finally two big white geese in front of the fence, one stretching its neck, the other lowering its head to eat.", "First, there are two big white geese in front of the fence, one stretching its neck, the other lowering its head to eat, followed by an axe deeply embedded in a piece of wood, and finally a small lamb lowering its head to graze.", "First, there are two big white geese in front of the fence, one stretching its neck, the other lowering its head to eat, followed by a small lamb lowering its head to graze, and finally an axe deeply embedded in a piece of wood.", "First, an axe is deeply embedded in a piece of wood, followed by two big white geese in front of the fence, one stretching its neck, the other lowering its head to eat, and finally a small lamb lowering its head to graze."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "SW8N4FcfmT4_1", "video_path": "SW8N4FcfmT4.mp4", "subtitle_path": "SW8N4FcfmT4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 967.0, "view_count": 509}, {"video_id": "SW8N4FcfmT4", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a boy with blond hair on a football field is holding a microphone with his right hand as the host. Next, a man and two women are sitting in a row in front of a bookshelf, with the man in the middle resting his hands behind the chair. Finally, a little girl with long blonde hair and glasses is being interviewed on the street, with the word 'PRIYA' on her left.", "First, a boy with blond hair on a football field is holding a microphone with his right hand as the host. Next, a little girl with long blonde hair and glasses is being interviewed on the street, with the word 'PRIYA' on her left. Finally, a man and two women are sitting in a row in front of a bookshelf, with the man in the middle resting his hands behind the chair.", "First, a man and two women are sitting in a row in front of a bookshelf, with the man in the middle resting his hands behind the chair. Next, a little girl with long blonde hair and glasses is being interviewed on the street, with the word 'PRIYA' on her left. Finally, a boy with blond hair on a football field is holding a microphone with his right hand as the host.", "First, a man and two women are sitting in a row in front of a bookshelf, with the man in the middle resting his hands behind the chair. Next, a boy with blond hair on a football field is holding a microphone with his right hand as the host. Finally, a little girl with long blonde hair and glasses is being interviewed on the street, with the word 'PRIYA' on her left.", "First, a little girl with long blonde hair and glasses is being interviewed on the street, with the word 'PRIYA' on her left. Next, a man and two women are sitting in a row in front of a bookshelf, with the man in the middle resting his hands behind the chair. Finally, a boy with blond hair on a football field is holding a microphone with his right hand as the host."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "SW8N4FcfmT4_2", "video_path": "SW8N4FcfmT4.mp4", "subtitle_path": "SW8N4FcfmT4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 967.0, "view_count": 509}, {"video_id": "3W35LZiceTA", "question": "On a wooden table in a simple shed, a man wearing a black short-sleeved shirt is holding a knife and cutting fat on the wooden plank. When he appears among a group of children, what change happens to him?", "question_wo_referring_query": ", what change happens to him?", "candidates": ["The knife in his hand turns into a watermelon", "The knife in his hand turns into a hat", "The knife in his hand turns into a teapot", "The knife in his hand turns into a green pepper", "The knife in his hand turns into food"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "3W35LZiceTA_0", "video_path": "3W35LZiceTA.mp4", "subtitle_path": "3W35LZiceTA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1071.77, "view_count": 4072693}, {"video_id": "3W35LZiceTA", "question": "A man extends his left hand to pick up a piece of meat from a bowl. In front of him, there is a silver round plate with a stick inserted into it, placed on a wooden table. What change occurs to the silver round plate when it appears next to a brick oven?", "question_wo_referring_query": "What change occurs to it?", "candidates": ["The stick on the silver plate has meat on it.", "The silver plate turned black.", "The stick on the silver plate has an eggplant on it.", "The stick on the silver plate has a green pepper on it.", "The stick on the silver plate broke."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "3W35LZiceTA_1", "video_path": "3W35LZiceTA.mp4", "subtitle_path": "3W35LZiceTA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1071.77, "view_count": 4072693}, {"video_id": "3W35LZiceTA", "question": "In the bright sunlight outside, there are a few small trees on the grassy slope. A sheep is happily running on the grass. When the sheep appears next to the man in a black short-sleeve shirt, what change occurs?", "question_wo_referring_query": "What change occurs?", "candidates": ["Its mouth is full of leaves", "It grew two horns", "It lies down on the grass", "Its mouth is full of food", "Its fur got shaved off"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "3W35LZiceTA_2", "video_path": "3W35LZiceTA.mp4", "subtitle_path": "3W35LZiceTA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1071.77, "view_count": 4072693}, {"video_id": "v7uyRH4TbVU", "question": "There is a wagon hitched to an old roadside wall lantern. Next to the wagon, there is a man wearing a blue long-sleeved shirt with his hands in his pockets. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["A pitchfork", "A black hat", "A sign with the number 19", "A wooden ladder", "A yellow bucket"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "v7uyRH4TbVU_0", "video_path": "v7uyRH4TbVU.mp4", "subtitle_path": "v7uyRH4TbVU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1275.54, "view_count": 270}, {"video_id": "v7uyRH4TbVU", "question": "On the concrete ground by the sea, a man bends over and watches a black puppy running towards him, while another person in blue jeans is also walking towards him. What items are present in this scene?", "question_wo_referring_query": "What items are present in this scene?", "candidates": ["earphones", "bracelet", "car", "whale", "black backpack"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "v7uyRH4TbVU_1", "video_path": "v7uyRH4TbVU.mp4", "subtitle_path": "v7uyRH4TbVU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1275.54, "view_count": 270}, {"video_id": "v7uyRH4TbVU", "question": "In front of a mud-colored castle, there is a flowing river. A man wearing a blue short-sleeved shirt is standing by the river. Next to him, there is the text 'Leaving Mallorca'. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["Sunglasses", "Mobile phone", "Boutonniere", "Gold-framed glasses", "Medal"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "v7uyRH4TbVU_2", "video_path": "v7uyRH4TbVU.mp4", "subtitle_path": "v7uyRH4TbVU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1275.54, "view_count": 270}, {"video_id": "KthoNFLscLU", "question": "Against a white background, the first line of text displays the phrase 'Adversarial Attack Table 1', and the white cursor stops at the third row of the last column in the table. What is the color of the box where the cursor stops?", "question_wo_referring_query": "What is the color of the box where the cursor stops?", "candidates": ["Black", "Red", "Green", "White", "Yellow"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2A", "level": "IntraMoment", "id": "KthoNFLscLU_0", "video_path": "KthoNFLscLU.mp4", "subtitle_path": "KthoNFLscLU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1764.03, "view_count": 891}, {"video_id": "KthoNFLscLU", "question": "In a white background, the first line of text shows the words 'Adversarial Attack Table 3' in some font style, below is a phrase in English that reads 'Trained on different dataset', followed by a table. In the table, what is the background color of the cell labeled 'trained on P2'?", "question_wo_referring_query": ", in the table, what is the background color of the cell labeled 'trained on P2'?", "candidates": ["blue", "white", "olive", "orange", "gray"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2A", "level": "IntraMoment", "id": "KthoNFLscLU_1", "video_path": "KthoNFLscLU.mp4", "subtitle_path": "KthoNFLscLU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1764.03, "view_count": 891}, {"video_id": "KthoNFLscLU", "question": "There are words and drawings on a white background, and there are five pictures in the lower right corner of the screen. The picture includes four flowers and a dog. The white cursor is hovering over the black area of the first picture. In this scene, what is the color of the flower in the third picture?", "question_wo_referring_query": "In this scene, what is the color of the flower in the third picture?", "candidates": ["olive", "white", "purple", "yellow", "pink"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2A", "level": "IntraMoment", "id": "KthoNFLscLU_2", "video_path": "KthoNFLscLU.mp4", "subtitle_path": "KthoNFLscLU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1764.03, "view_count": 891}, {"video_id": "-mt8aNy1M00", "question": "A man wearing sunglasses, a backward cap, and a black vest is walking along a forest path. What are the colors of the frame and lenses of the sunglasses hanging on his chest when the subtitles mention \"dance right now but we\u2019ve been told it\u2019s\"?", "question_wo_referring_query": "What are the colors of the frame and lenses of the sunglasses hanging on this man's chest?", "candidates": ["Orange-red frame and black lenses", "Black frame and black lenses", "White frame and black lenses", "White frame and orange-red lenses", "Black frame and orange-red lenses"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "-mt8aNy1M00_0", "video_path": "-mt8aNy1M00.mp4", "subtitle_path": "-mt8aNy1M00_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3258.68, "view_count": 10480705}, {"video_id": "-mt8aNy1M00", "question": "In a room with white walls that have 'YES HOUSE' written on them, a man with short hair, wearing blue jeans, is slightly raising his arm while sitting in front of a mirror. When the subtitles mention 'dreams everything the yes theory is', what is he wearing on his upper body?", "question_wo_referring_query": "What is he wearing on his upper body?", "candidates": ["Gray long-sleeved T-shirt", "Blue long-sleeved T-shirt", "Gray striped T-shirt", "Blue short-sleeved T-shirt", "Gray short-sleeved T-shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "-mt8aNy1M00_1", "video_path": "-mt8aNy1M00.mp4", "subtitle_path": "-mt8aNy1M00_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3258.68, "view_count": 10480705}, {"video_id": "-mt8aNy1M00", "question": "In the forest, a man wearing a blue short-sleeve shirt and a person with a hat and a red ribbon tied around their arm are sitting together in a row. When the subtitle mentions 'long gone thousands of years ago and,' what kind of hat is the person with the red ribbon wearing?", "question_wo_referring_query": "What kind of hat is the person with the red ribbon wearing?", "candidates": ["Blue hat", "Red knitted hat", "Black ceremonial hat", "Gray hat", "Gray duckbill hat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "-mt8aNy1M00_2", "video_path": "-mt8aNy1M00.mp4", "subtitle_path": "-mt8aNy1M00_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3258.68, "view_count": 10480705}, {"video_id": "715uLCHt4jE", "question": "In the black background, there are two screens. The screen on the left shows a man wearing a blue shirt with some dreads on his head. The screen on the right shows a short-haired man with a numerical chart above his head. In the screens, who is holding a very small book and reading?", "question_wo_referring_query": "In the screens, who is holding a very small book and reading?", "candidates": ["A man with dreads", "A man raising his left hand", "A man with long black hair", "A man with short hair wearing a blue shirt", "A man with short hair wearing a patterned shirt"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "715uLCHt4jE_0", "video_path": "715uLCHt4jE.mp4", "subtitle_path": "715uLCHt4jE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3261.27, "view_count": 366444}, {"video_id": "715uLCHt4jE", "question": "In the video, there are two men. One is wearing a blue shirt, and the other is wearing a black shirt. One is tilting his head to look to the side, while the other has his eyes closed and is facing the camera. Who is the person with eyes closed?", "question_wo_referring_query": "Who is the person with eyes closed?", "candidates": ["The man with short hair wearing a black short-sleeved shirt", "The man with short hair wearing a blue shirt", "The man with short hair and a clip on his shirt collar", "The man with long hair wearing a black short-sleeved shirt", "The man with long hair wearing a blue shirt"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "715uLCHt4jE_1", "video_path": "715uLCHt4jE.mp4", "subtitle_path": "715uLCHt4jE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3261.27, "view_count": 366444}, {"video_id": "715uLCHt4jE", "question": "There are three people on the screen. One person is wearing blue clothes, one person is wearing pink clothes, and one person is wearing white clothes. One person is touching a child's head. Who is this person?", "question_wo_referring_query": "Who is this person?", "candidates": ["A woman with brown hair wearing blue clothes", "A man with brown hair wearing pink clothes", "A woman with blonde hair wearing white clothes", "A woman with brown hair wearing pink clothes", "A man with short hair wearing blue clothes"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "715uLCHt4jE_2", "video_path": "715uLCHt4jE.mp4", "subtitle_path": "715uLCHt4jE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3261.27, "view_count": 366444}, {"video_id": "BB8po8ydXpE", "question": "In front of a white wall, there is a bookshelf holding a green plant and a cup. The bookshelf contains many books. A man in a gray shirt is sitting in front of the bookshelf. What is the man doing the first time he appears?", "question_wo_referring_query": "What is the man doing the first time he appears?", "candidates": ["He clasped his hands", "He spread his arms", "He stood up", "He placed his left hand on his head", "He placed his right hand on his head"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "BB8po8ydXpE_0", "video_path": "BB8po8ydXpE.mp4", "subtitle_path": "BB8po8ydXpE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 986.08, "view_count": 82438}, {"video_id": "BB8po8ydXpE", "question": "There are several airplanes parked indoors. In front of an airplane with a three-digit number on its nose, there is a man with a shaved head wearing a black suit and tie. When this man appears for the first time, what is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["He is dancing.", "He raises his hands above his head.", "He takes off his overcoat.", "He jumps off the ground.", "He is shaking his hands."], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "BB8po8ydXpE_1", "video_path": "BB8po8ydXpE.mp4", "subtitle_path": "BB8po8ydXpE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 986.08, "view_count": 82438}, {"video_id": "BB8po8ydXpE", "question": "There are several planes parked indoors, and in front of a plane with a 3-digit number, there is a bald man wearing a black suit. To his left, standing on a platform, is another man in black. When this man in black appears for the first time, what is he doing?", "question_wo_referring_query": "When this man in black appears for the first time, what is he doing?", "candidates": ["He attacks the man in front", "He jumps down from the platform", "He leaves from the doorway", "He is walking on the platform", "He falls to the ground"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "BB8po8ydXpE_2", "video_path": "BB8po8ydXpE.mp4", "subtitle_path": "BB8po8ydXpE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 986.08, "view_count": 82438}, {"video_id": "-4OlXmGkPfY", "question": "There is a chair in front of a wall with a painting hanging on it. In front of the chair, there is a man wearing a white shirt and a green military uniform toy. When the subtitle mentions 'well,' what is the man in the white shirt doing?", "question_wo_referring_query": "What is the man in the white shirt doing?", "candidates": ["The man in the white shirt broke the wine glass.", "The man in the white shirt is drinking alcohol.", "The green toy soldier is drinking alcohol.", "The green toy soldier broke the wine glass.", "The green toy soldier put down the wine glass."], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "-4OlXmGkPfY_0", "video_path": "-4OlXmGkPfY.mp4", "subtitle_path": "-4OlXmGkPfY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1457.09, "view_count": 21557}, {"video_id": "-4OlXmGkPfY", "question": "There is a chair in front of the wall with a painting hanging on it. In front of the chair, there is a man wearing a white shirt and a green doll in a military uniform. When the subtitles mention 'that again that\u2019s what people told me', what did the green doll do?", "question_wo_referring_query": "What did the green doll do?", "candidates": ["The green doll slid under the chair.", "The green doll turned to face the man in the white shirt.", "The green doll turned its back to the man in the white shirt.", "The green doll took off its hat.", "The green doll flew up."], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "-4OlXmGkPfY_1", "video_path": "-4OlXmGkPfY.mp4", "subtitle_path": "-4OlXmGkPfY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1457.09, "view_count": 21557}, {"video_id": "-4OlXmGkPfY", "question": "In front of the wall with the painting hanging on it, there's a chair. In front of the chair, there's a man wearing a white shirt with his left hand raised and a green toy soldier. They are looking at each other. When the subtitle mentions 'something they like right yeah well then', what is the man in the shirt doing?", "question_wo_referring_query": "What is the man in the shirt doing?", "candidates": ["He throws the toy soldier away", "He takes a sip of alcohol", "He is raising his right hand", "He touches the toy soldier", "He is putting down his left hand"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "-4OlXmGkPfY_2", "video_path": "-4OlXmGkPfY.mp4", "subtitle_path": "-4OlXmGkPfY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1457.09, "view_count": 21557}, {"video_id": "xW8cvel5G1w", "question": "In a room, there is a tank with the number 3 on it. Next to the tank, there are two men: one wearing an olive green coat with his hands in his pockets, and another man in black pointing sideways with his index finger. Before the man in black extends his hand to point, what happens to the tank?", "question_wo_referring_query": "What happens to the tank?", "candidates": ["The man in the olive green coat climbs onto the tank", "A bald man wearing a mask enters the tank", "A bald man wearing a mask paints graffiti on the tank", "A bald man wearing a mask opens the hatch of the tank", "The barrel of the tank rotates"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "xW8cvel5G1w_0", "video_path": "xW8cvel5G1w.mp4", "subtitle_path": "xW8cvel5G1w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1415.12, "view_count": 702829}, {"video_id": "xW8cvel5G1w", "question": "From the perspective of a tank view, the green tank has two holes at the top through which the inner wall of the tank can be seen. What did the crew member inside the hole on the left side of the screen do after extending his right hand?", "question_wo_referring_query": "What did the crew member inside the hole on the left side of the screen do after extending his right hand?", "candidates": ["Crawled out of the hole", "Took something into the hole from the outside", "Extended his head out of the hole", "Moved the hatch cover from the left side hole to the top of the hole"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "xW8cvel5G1w_1", "video_path": "xW8cvel5G1w.mp4", "subtitle_path": "xW8cvel5G1w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1415.12, "view_count": 702829}, {"video_id": "xW8cvel5G1w", "question": "When the light shines inside the tank, you can see some internal components and circuits. There are two fire extinguishers on the left side of a man in yellow clothing. What does this man do after slowly bending down to place his leg inside?", "question_wo_referring_query": "What does this man do after slowly bending down to place his leg inside?", "candidates": ["Crouches on the seat", "Leans against the tank wall", "Raises both his legs", "Stands up and extends his right hand towards the mirror", "Lies down flat"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "xW8cvel5G1w_2", "video_path": "xW8cvel5G1w.mp4", "subtitle_path": "xW8cvel5G1w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1415.12, "view_count": 702829}, {"video_id": "hxVlUxr4g3s", "question": "On a wooden surface, a piece of meat is placed, and a man wearing a black short-sleeved shirt raises his right hand to sprinkle seasoning on the meat. After this scene, what is the first insect to appear on the green leaf?", "question_wo_referring_query": "After this scene, what is the first insect to appear on the green leaf?", "candidates": ["spider", "cockroach", "green vegetable bug", "caterpillar", "ant"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "hxVlUxr4g3s_0", "video_path": "hxVlUxr4g3s.mp4", "subtitle_path": "hxVlUxr4g3s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1477.77, "view_count": 6275922}, {"video_id": "hxVlUxr4g3s", "question": "In a sunny yard, a man wearing a black short-sleeved shirt and red gloves is quickly walking while carrying a black pot. After this scene, what is the first animal to appear?", "question_wo_referring_query": ", after this scene, what is the first animal to appear?", "candidates": ["dog", "chicken", "sheep", "pig", "bird"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "hxVlUxr4g3s_1", "video_path": "hxVlUxr4g3s.mp4", "subtitle_path": "hxVlUxr4g3s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1477.77, "view_count": 6275922}, {"video_id": "hxVlUxr4g3s", "question": "Several children are sitting in front of a dining table with a floral tablecloth. A child in green clothing is biting into a watermelon with his head down, while a boy wearing glasses beside him is cutting the watermelon with a fork. What is the last animal to appear in this scene?", "question_wo_referring_query": "What is the last animal to appear?", "candidates": ["Goat", "Chicken", "Bird", "Ant", "Dog"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "hxVlUxr4g3s_2", "video_path": "hxVlUxr4g3s.mp4", "subtitle_path": "hxVlUxr4g3s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1477.77, "view_count": 6275922}, {"video_id": "bnjBpumrKkk", "question": "On the green cutting board, there is a pile of meat and three slices of already cut meat. After the subtitles say 'here on it which we can go through and,' what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A man wearing red clothes continues to cut meat slices", "A man wearing white clothes continues to cut meat slices", "A man wearing black clothes puts the meat slices into the oven", "A woman wearing black clothes puts the meat slices into a pot", "A man wearing black clothes continues to cut meat slices"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "bnjBpumrKkk_0", "video_path": "bnjBpumrKkk.mp4", "subtitle_path": "bnjBpumrKkk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 958.4, "view_count": 587700}, {"video_id": "bnjBpumrKkk", "question": "On a brown wooden board, a white bowl contains some white rectangular powder. After the subtitle mentions 'microwave quickly and pulled apart time,' what happens on the screen?", "question_wo_referring_query": "what happens on the screen?", "candidates": ["A man in a black short-sleeved shirt presses the button on a rice cooker.", "A man in a black short-sleeved shirt places a pot on the stove.", "A man in a black short-sleeved shirt is moving a flowerpot.", "A man in a black short-sleeved shirt opens the oven.", "A man in a black short-sleeved shirt prepares to open the fridge."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "bnjBpumrKkk_1", "video_path": "bnjBpumrKkk.mp4", "subtitle_path": "bnjBpumrKkk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 958.4, "view_count": 587700}, {"video_id": "bnjBpumrKkk", "question": "In front of a white cabinet with green plants, a man wearing a black short-sleeved shirt is holding a green bowl, standing in front of a green wooden board that has a pot. What happened on the screen before the subtitle says 'seeds and then we've got some GOI'?", "question_wo_referring_query": "What happened on the screen?", "candidates": ["Pouring condiments into an octagonal pot", "Pouring water into an octagonal pot", "Pouring herbs into an octagonal pot", "Pouring herbs into an octagonal pot", "Pouring wide flour strips into an octagonal pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "bnjBpumrKkk_2", "video_path": "bnjBpumrKkk.mp4", "subtitle_path": "bnjBpumrKkk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 958.4, "view_count": 587700}, {"video_id": "JIVUfS-YXeY", "question": "In the scene, a woman wearing a necklace and a black leather jacket is sitting in the car looking at the mirror. After she says 'my God' in the subtitles, what is the first object that appears on the screen?", "question_wo_referring_query": "What is the first object that appears on the screen?", "candidates": ["white towel", "black towel", "a cutting board", "a tablet", "a pair of scissors"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "JIVUfS-YXeY_0", "video_path": "JIVUfS-YXeY.mp4", "subtitle_path": "JIVUfS-YXeY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 948.49, "view_count": 54363}, {"video_id": "JIVUfS-YXeY", "question": "In a dimly lit room with a curtain hanging in the background, a woman wearing earrings and a black top is sitting in front of a mirror. After she says in the subtitles, 'okay I am back and I want to do a quick,' what object appears on the screen first?", "question_wo_referring_query": "What object appears on the screen first?", "candidates": ["Pink packaged snack", "Purple packaged snack", "Blue packaged snack", "Pink bagged snack", "Black packaged snack"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "JIVUfS-YXeY_1", "video_path": "JIVUfS-YXeY.mp4", "subtitle_path": "JIVUfS-YXeY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 948.49, "view_count": 54363}, {"video_id": "JIVUfS-YXeY", "question": "In a dark room, a woman wearing a necklace and a black floral dress is making a phone call. After the subtitle says 'oh okay yeah can you please do that okay', what object appears on the screen first?", "question_wo_referring_query": "What object appears on the screen first?", "candidates": ["Thermos", "Rag", "White plate", "Bamboo slip", "Water bottle"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "JIVUfS-YXeY_2", "video_path": "JIVUfS-YXeY.mp4", "subtitle_path": "JIVUfS-YXeY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 948.49, "view_count": 54363}, {"video_id": "YQw3ZOq6Zfs", "question": "When a woman wearing a black suit over a white blouse with golden hair appears on the screen, there is an image of a newspaper filled with English letters and pictures of three characters next to her. What changes occur to the newspaper image when the screen switches to a woman with black clothes and flaxen hair?", "question_wo_referring_query": "When a woman wearing a black suit over a white blouse with golden hair appears on the screen, there is an image of a newspaper filled with English letters and pictures of three characters next to her. What changes occur to the newspaper image when the screen switches to a woman with black clothes and flaxen hair?", "candidates": ["The newspaper image moves from the right side of the screen to the top right", "The content on the newspaper image changes", "The newspaper image moves from the right side of the screen to the top left", "The newspaper image moves from the right side of the screen to the left side", "The newspaper image changes from small to large"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "YQw3ZOq6Zfs_0", "video_path": "YQw3ZOq6Zfs.mp4", "subtitle_path": "YQw3ZOq6Zfs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1115.76, "view_count": 12049}, {"video_id": "YQw3ZOq6Zfs", "question": "A man wearing a black suit with a red and white polka dot tie is on the right side of the screen. There is a newspaper picture of a woman wearing black-framed glasses next to him. When the newspaper picture on the right changes to a yellow background with two pictures of women at the top and a watch at the bottom right, what changes occur in the news footage?", "question_wo_referring_query": "What changes occur in the news footage?", "candidates": ["The man in the black suit changes from not wearing glasses to wearing glasses.", "The man's black suit changes to a brown suit.", "The single frame of the man wearing a black suit changes to three people conversing and discussing.", "The man wearing a black suit changes to a blonde woman wearing a black suit.", "The man wearing a black suit changes to a woman with colored hair wearing black clothes."], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "YQw3ZOq6Zfs_1", "video_path": "YQw3ZOq6Zfs.mp4", "subtitle_path": "YQw3ZOq6Zfs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1115.76, "view_count": 12049}, {"video_id": "YQw3ZOq6Zfs", "question": "In the news footage, two women, both wearing black suits with black outerwear, are conversing. On the right side of the screen, there is a newspaper image featuring a woman wearing black-rimmed glasses. When three people appear in the footage, and a green-themed promotional cover appears on the large screen from the newspaper image, what changes occur to the newspaper image on the right?", "question_wo_referring_query": "What changes occur to the newspaper image on the right?", "candidates": ["The newspaper image on the right shifts to the lower left side", "The newspaper image on the right enlarges", "The newspaper image on the right disappears", "The newspaper image on the right shifts to the upper left corner", "The newspaper image on the right moves to the left"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "YQw3ZOq6Zfs_2", "video_path": "YQw3ZOq6Zfs.mp4", "subtitle_path": "YQw3ZOq6Zfs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1115.76, "view_count": 12049}, {"video_id": "AJr2T1ko3Is", "question": "In the scene, there are five yellow sketches with a background of green trees and rivers. After the subtitle 'harmonious composition than he was in the subject\u00a0matter. A striking contrast to the Impressionists.' appears, what changes occur to these sketches?", "question_wo_referring_query": "In the scene, there are five yellow sketches with a background of green trees and rivers. After the subtitle 'harmonious composition than he was in the subject\u00a0matter. A striking contrast to the Impressionists.' appears, what changes occur to these sketches?", "candidates": ["The sketches enlarge in size.", "The sketches gradually disappear and merge with the background.", "The sketches gradually disappear.", "The sketches move from left to right.", "The sketches shrink in size."], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "AJr2T1ko3Is_0", "video_path": "AJr2T1ko3Is.mp4", "subtitle_path": "AJr2T1ko3Is_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 980.38, "view_count": 301766}, {"video_id": "AJr2T1ko3Is", "question": "An oil painting depicts a group of people resting on a grassy field by the lake, some standing with umbrellas while others are sitting under the shade of a tree. After the subtitle 'This is reiterated by the almost complete lack of feet, as if the characters are floating-or on pedestals.' appears, what changes occur in the painting?", "question_wo_referring_query": "An oil painting depicts a group of people resting on a grassy field by the lake, some standing with umbrellas while others are sitting under the shade of a tree. After the subtitle 'This is reiterated by the almost complete lack of feet, as if the characters are floating-or on pedestals.' appears, what changes occur in the painting?", "candidates": ["The painting moves from left to right.", "The painting becomes blurry.", "The whole scene in the painting changes to a close-up depiction.", "The painting moves from right to left.", "The painting changes from color to black and white."], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "AJr2T1ko3Is_1", "video_path": "AJr2T1ko3Is.mp4", "subtitle_path": "AJr2T1ko3Is_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 980.38, "view_count": 301766}, {"video_id": "AJr2T1ko3Is", "question": "In an oil painting, a group of people are resting on a grassy field by a lakeside. Some are standing with umbrellas, while others are sitting under the trees. When the subtitle 'In fact, in earlier studies, she is shown alone. Her pet monkey has been interpreted' appears, what change occurs to the woman on the right wearing a black top, a purple skirt, holding a parasol, and wearing a hat?", "question_wo_referring_query": "In an oil painting, a group of people are resting on a grassy field by a lakeside. Some are standing with umbrellas, while others are sitting under the trees. When the subtitle 'In fact, in earlier studies, she is shown alone. Her pet monkey has been interpreted' appears, what change occurs to the woman on the right wearing a black top, a purple skirt, holding a parasol, and wearing a hat?", "candidates": ["Changes from a purple skirt to a black skirt", "Moves from the right side of the painting to the left side", "Goes from wearing a hat to not wearing a hat", "Goes from holding a parasol to not holding a parasol", "Changes into three different materials paintings"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "AJr2T1ko3Is_2", "video_path": "AJr2T1ko3Is.mp4", "subtitle_path": "AJr2T1ko3Is_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 980.38, "view_count": 301766}, {"video_id": "xBd1oDYrx7o", "question": "A man in a dark blue denim suit is holding a tablet with both hands, standing next to a stone sculpture on the right. What is this man doing?", "question_wo_referring_query": "A man in a dark blue denim suit is holding a tablet with both hands, standing next to a stone sculpture on the right. What is this man doing?", "candidates": ["Admiring the stone sculpture", "Introducing himself", "Greeting the camera", "Introducing the stone sculpture to the camera", "Taking a photo of the stone sculpture"], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "xBd1oDYrx7o_0", "video_path": "xBd1oDYrx7o.mp4", "subtitle_path": "xBd1oDYrx7o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1491.2, "view_count": 7254}, {"video_id": "xBd1oDYrx7o", "question": "The background is composed of three pieces of fabric with red and green floral patterns, framed with black rectangular lines, and pasted on the wall. What is the man, dressed in a dark blue denim jacket and holding a tablet with both hands, doing in front of the wall?", "question_wo_referring_query": "What is he doing in front of the wall?", "candidates": ["Taking photos of the fabric with the tablet", "Sitting in front of the wall introducing the fabric to the camera", "Waving to the camera", "Standing in front of the wall talking to the camera", "Admiring the fabric on the wall with his back to the camera"], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "xBd1oDYrx7o_1", "video_path": "xBd1oDYrx7o.mp4", "subtitle_path": "xBd1oDYrx7o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1491.2, "view_count": 7254}, {"video_id": "xBd1oDYrx7o", "question": "A man in a denim suit is standing on a honey-colored wooden floor, with many wooden-framed carvings neatly arranged on the white wall behind him. What action is this man performing?", "question_wo_referring_query": "What action is this man performing?", "candidates": ["Holding a flat board with both hands", "Arms crossed in front of chest", "Holding a camera with both hands", "One hand supporting his head", "Arms naturally hanging down"], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "xBd1oDYrx7o_2", "video_path": "xBd1oDYrx7o.mp4", "subtitle_path": "xBd1oDYrx7o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1491.2, "view_count": 7254}, {"video_id": "VzpG7di07E4", "question": "Three women are standing on the stage. The woman on the left is wearing a brown dress with white and gold hair. The woman in the middle is wearing a black top and standing behind a white podium. The woman on the right with short golden hair is wearing a red top and standing behind a white podium. What object is present on the screen at this time?", "question_wo_referring_query": "What object is present on the screen at this time?", "candidates": ["A hat", "A mobile phone", "Green clothing", "A scarf", "A white display screen with English text"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "VzpG7di07E4_0", "video_path": "VzpG7di07E4.mp4", "subtitle_path": "VzpG7di07E4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 971.76, "view_count": 103694}, {"video_id": "VzpG7di07E4", "question": "Under the blue sky and white clouds, there is a snow-covered, uneven valley. In the distance, there are snowy mountains. What object is present in the scene at this time?", "question_wo_referring_query": "What object is present in the scene at this time?", "candidates": ["Dead tree", "River", "Green tree", "Polar bear", "Sun"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "VzpG7di07E4_1", "video_path": "VzpG7di07E4.mp4", "subtitle_path": "VzpG7di07E4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 971.76, "view_count": 103694}, {"video_id": "VzpG7di07E4", "question": "On the right side of the screen, there are two black cars and one red car waiting in line to pass. On the left side, there is a white pavilion and two workers. The male worker is wearing a high-visibility green vest, a black baseball cap, and holding a piece of white paper with writing. What objects are present on the screen at this moment?", "question_wo_referring_query": "What objects are present on the screen at this moment?", "candidates": ["white van", "a cellphone", "a cat", "a bus", "orange cone-shaped roadblock"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "VzpG7di07E4_2", "video_path": "VzpG7di07E4.mp4", "subtitle_path": "VzpG7di07E4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 971.76, "view_count": 103694}, {"video_id": "kBNZ8ZRrtF0", "question": "There is a black semi-transparent cross icon in the top right corner of the yellow background. At the top center of the screen, there is a two-line black English title, and below the title, there are four white circular icons. What object is present in the frame when the subtitle 'To a certain degree this was also influenced by Napoleon's Legion of Honor.' appears?", "question_wo_referring_query": "What object is present in the frame?", "candidates": ["An airplane", "A cartoon character wearing a black top hat and a blue uniform", "A tank", "A hand grenade", "A handgun"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "kBNZ8ZRrtF0_0", "video_path": "kBNZ8ZRrtF0.mp4", "subtitle_path": "kBNZ8ZRrtF0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 942.67, "view_count": 888685}, {"video_id": "kBNZ8ZRrtF0", "question": "In the screen, there are two large black crosses hanging from the top, connected by round iron rings. The number at the bottom left is 1939 and the numbers at the bottom right are 1870/1914. When the subtitle 'Note that in previous establishments the backside also included the oak leaves, the initials of the king, and the crown' appears, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["White English text", "Red ribbon", "Airplane", "Soldier in uniform", "Circular emblem"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "kBNZ8ZRrtF0_1", "video_path": "kBNZ8ZRrtF0.mp4", "subtitle_path": "kBNZ8ZRrtF0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 942.67, "view_count": 888685}, {"video_id": "kBNZ8ZRrtF0", "question": "The top left of the screen has six lines of black English text, the top right has a black cross badge with a red, white, and black ribbon, and the bottom of the screen has black and white icons. When the subtitle 'Yet only the Luftwaffe ace and Stuka pilot Hans Ulrich Rudel received it before the war ended.' appears, what object is present in the screen?", "question_wo_referring_query": "What object is present in the screen?", "candidates": ["Mobile phone", "Handgun", "Airplane", "Rocket", "Tank"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "kBNZ8ZRrtF0_2", "video_path": "kBNZ8ZRrtF0.mp4", "subtitle_path": "kBNZ8ZRrtF0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 942.67, "view_count": 888685}, {"video_id": "vHzPyax20qg", "question": "On the wooden shelf, there are many neatly arranged brown jars of various shapes. The shelf has a white label on which '5' is written. What material are these jars made of?", "question_wo_referring_query": "What material are these jars made of?", "candidates": ["Clay", "Wood", "Iron", "Glass", "Ceramic"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "vHzPyax20qg_0", "video_path": "vHzPyax20qg.mp4", "subtitle_path": "vHzPyax20qg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1334.84, "view_count": 2703043}, {"video_id": "vHzPyax20qg", "question": "A man is standing in front of a wooden table wearing a black jacket over a green military uniform and a pea green hat. There is a row of prepared ingredients on the table. His hand reaches into a brown ceramic jar. What material is the hat that the man is wearing made of at this time?", "question_wo_referring_query": "What material is the hat that the man is wearing made of at this time?", "candidates": ["Duckbill hat", "Straw hat", "Denim hat", "Beret", "Wool hat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "vHzPyax20qg_1", "video_path": "vHzPyax20qg.mp4", "subtitle_path": "vHzPyax20qg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1334.84, "view_count": 2703043}, {"video_id": "vHzPyax20qg", "question": "A man dressed in an army green shirt is sitting in front of a wooden table, holding a glass filled with a red liquid in one hand. On the left side of the screen, there is a silver drink dispenser with a transparent tea pot containing steeping tea on top. What drink is the man holding in his hand at this moment?", "question_wo_referring_query": "What drink is the man holding in his hand at this moment?", "candidates": ["Grape Wine", "Red Wine", "Tea", "Beer", "Juice"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "vHzPyax20qg_2", "video_path": "vHzPyax20qg.mp4", "subtitle_path": "vHzPyax20qg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1334.84, "view_count": 2703043}, {"video_id": "aG0nqRXLDOg", "question": "There are buildings with red and white walls on both sides of the screen. In the middle of the buildings are three railroad tracks. A shirtless man wearing black pants is standing in the middle of the left track. What material is the railroad track made of when the subtitle 'a town called Logan and you just see so' appears?", "question_wo_referring_query": "What material is the railroad track made of?", "candidates": ["Silver", "Glass", "Metal", "Wood", "Iron"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "aG0nqRXLDOg_0", "video_path": "aG0nqRXLDOg.mp4", "subtitle_path": "aG0nqRXLDOg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 967.23, "view_count": 3122203}, {"video_id": "aG0nqRXLDOg", "question": "A person wearing a red short-sleeved top and blue pants is lying on the muddy ground. Next to them is a yellow wooden door painted with dark green duck designs and a window surrounded by iron wire mesh. When the subtitle 'and it's like is this a movie set is' appears, what material are the woman's pants made of?", "question_wo_referring_query": "What material are the woman's pants made of?", "candidates": ["Jeans", "Linen pants", "Cotton pants", "Leather pants", "Suit pants"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "aG0nqRXLDOg_1", "video_path": "aG0nqRXLDOg.mp4", "subtitle_path": "aG0nqRXLDOg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 967.23, "view_count": 3122203}, {"video_id": "aG0nqRXLDOg", "question": "On the left side of the screen, there is a white wall with three white spherical lamps. A white car with black and white English graffiti is driving on a concrete road curve, surrounded by barren land without vegetation. When the subtitle 'members and killed them on mass and if' appears, what type of car is this?", "question_wo_referring_query": "What type of car is this?", "candidates": ["Truck", "SUV", "Van", "Sedan", "Pickup truck"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "aG0nqRXLDOg_2", "video_path": "aG0nqRXLDOg.mp4", "subtitle_path": "aG0nqRXLDOg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 967.23, "view_count": 3122203}, {"video_id": "izAxEIj2k6M", "question": "In the video, there is a bronze statue under the sunlight. On the right, there is a woman wearing black clothes, sunglasses, and black gloves, and another woman wearing a green scarf and carrying a backpack. Who is touching the belly of the bronze statue with one hand?", "question_wo_referring_query": "Who is touching the belly of the bronze statue with one hand?", "candidates": ["The man wearing sunglasses", "The man wearing black clothes", "The woman wearing black clothes and sunglasses", "The man wearing the green scarf", "The woman wearing the green scarf"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "izAxEIj2k6M_0", "video_path": "izAxEIj2k6M.mp4", "subtitle_path": "izAxEIj2k6M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1242.98, "view_count": 155617}, {"video_id": "izAxEIj2k6M", "question": "There is a gray curtain behind the window, a black frame photo album and two red cakes on the table. Each cake has a candle stuck in it. Who is the person celebrating their birthday sitting in front of the cakes?", "question_wo_referring_query": "Who is the person celebrating their birthday?", "candidates": ["The woman wearing glasses and a white top", "The short-haired woman wearing a black and white checkered top", "The man wearing a black short-sleeve shirt", "The long-haired man wearing a white top", "The woman with long black hair wearing a white top"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "izAxEIj2k6M_1", "video_path": "izAxEIj2k6M.mp4", "subtitle_path": "izAxEIj2k6M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1242.98, "view_count": 155617}, {"video_id": "izAxEIj2k6M", "question": "The background features an ink-green wall and a coat rack with a green baseball uniform hanging on it. On the left side, there's a partition made of an ink-green wooden frame and glass. Who is the person holding a steamer and steaming a white garment on the ironing board in the scene?", "question_wo_referring_query": "Who is it?", "candidates": ["A boy wearing a khaki shirt", "A long-haired girl wearing a white shirt", "A short-haired girl wearing a brown inner shirt with black pants and glasses", "A short-haired girl wearing a white shirt", "A girl wearing a green baseball uniform"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "izAxEIj2k6M_2", "video_path": "izAxEIj2k6M.mp4", "subtitle_path": "izAxEIj2k6M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1242.98, "view_count": 155617}, {"video_id": "knfr0SUqJvI", "question": "The background is a white flat-roofed house and a small wooden door, on a sunny grassy field outside. When a man wearing a white short-sleeved shirt with black English print appears for the first time, what is he doing?", "question_wo_referring_query": "What is he doing at that time?", "candidates": ["Waving at a mirror", "Lying on the grass sunbathing", "Sitting outside eating", "Typing on a computer", "Smiling and looking at a computer"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "knfr0SUqJvI_0", "video_path": "knfr0SUqJvI.mp4", "subtitle_path": "knfr0SUqJvI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1001.21, "view_count": 3978349}, {"video_id": "knfr0SUqJvI", "question": "In a room with a red leather sofa, a map is on the wall, and on the greyish-brown sofa, there is a man wearing a red short-sleeve shirt and a black watch in front of a laptop. What is he doing the first time he appears?", "question_wo_referring_query": "What is he doing?", "candidates": ["Lying on the sofa looking at the laptop", "Typing on the laptop", "Talking to a friend", "Having a video call with someone", "Waving at the camera"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "knfr0SUqJvI_1", "video_path": "knfr0SUqJvI.mp4", "subtitle_path": "knfr0SUqJvI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1001.21, "view_count": 3978349}, {"video_id": "knfr0SUqJvI", "question": "Indoors, two men are standing next to an iron railing. On the white wall in the back left, there's a picture of a woman. The man on the left is wearing a red short-sleeved shirt, while the man on the right has messy hair, is wearing glasses, and is dressed in a denim jacket over a black and white printed shirt. When this man on the right first appears, what is he doing?", "question_wo_referring_query": "What is he doing when he first appears?", "candidates": ["Smoking", "Having a conversation with the man next to him", "Drinking alcohol", "Talking on the phone", "Introducing himself in front of a mirror"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "knfr0SUqJvI_2", "video_path": "knfr0SUqJvI.mp4", "subtitle_path": "knfr0SUqJvI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1001.21, "view_count": 3978349}, {"video_id": "ksOLsvsZVps", "question": "A woman wearing a yellow apron is standing in front of a white marble table that has a metal tray with some black balls and a transparent container with black food ingredients. What is she doing when the subtitles 'balls so this will be like the bodies of' appear on the screen?", "question_wo_referring_query": "What is the woman in the video doing?", "candidates": ["Adding water to the container", "Adding milk to the container", "Pressing the balls to form cakes", "Mixing the food ingredients in the container", "Shaping the food ingredients into small balls with her hands"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "ksOLsvsZVps_0", "video_path": "ksOLsvsZVps.mp4", "subtitle_path": "ksOLsvsZVps_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 913.75, "view_count": 7120207}, {"video_id": "ksOLsvsZVps", "question": "A white cutting board is placed on a white marble table, with a strip of dough and a small knife on it. When the subtitle 'gonna flatten one out' appears, what is the pair of hands wearing blue gloves doing on the cutting board?", "question_wo_referring_query": "What are they doing?", "candidates": ["Rolling out dough", "Mixing dough", "Cutting the dough ball", "Flattening a dough ball into a pancake", "Kneading the dough ball"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "ksOLsvsZVps_1", "video_path": "ksOLsvsZVps.mp4", "subtitle_path": "ksOLsvsZVps_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 913.75, "view_count": 7120207}, {"video_id": "ksOLsvsZVps", "question": "The background depicts some handicrafts placed on a table. Two adults and a little girl with long black hair, wearing a purple short-sleeve shirt, are seated in front of a white round table with colored pencils and sketchbooks. When the subtitle 'mummies like to eat sticks yes' appears, what are they doing?", "question_wo_referring_query": "What are they doing?", "candidates": ["Two women and the little girl are drawing", "Two women and one little girl are playing games", "Two women are teaching the little girl to do homework", "Two women are talking with the little girl", "Two women are watching the little girl draw"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "ksOLsvsZVps_2", "video_path": "ksOLsvsZVps.mp4", "subtitle_path": "ksOLsvsZVps_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 913.75, "view_count": 7120207}, {"video_id": "QQ2RaGIYqqE", "question": "In the video, which person appears first?", "question_wo_referring_query": "In the video, which person appears first?", "candidates": ["The man in a blue short-sleeved shirt under the pavilion", "The woman in a red bikini walking towards the camera", "The woman in a gray wrap skirt and black and white striped short-sleeve top standing behind the wooden bar", "The man wearing a white T-shirt and dark yellow shorts with a small brown and white dog by his side", "The woman in a white dress holding a silver laptop under the pavilion"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "QQ2RaGIYqqE_0", "video_path": "QQ2RaGIYqqE.mp4", "subtitle_path": "QQ2RaGIYqqE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1411.33, "view_count": 346787}, {"video_id": "QQ2RaGIYqqE", "question": "After the scene with the woman in a gray skirt and black-and-white striped short sleeves greeting the camera from behind the wooden bar counter, who is the first person to appear next in the video?", "question_wo_referring_query": "Who is it?", "candidates": ["A woman in a red bikini.", "A man in white short sleeves and deep yellow shorts with a small palm white dog by his side.", "A woman in a white dress under the wooden shed, holding a silver notebook computer.", "A shirtless man talking in a room facing the camera.", "A man wearing a white baseball cap."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "QQ2RaGIYqqE_1", "video_path": "QQ2RaGIYqqE.mp4", "subtitle_path": "QQ2RaGIYqqE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1411.33, "view_count": 346787}, {"video_id": "QQ2RaGIYqqE", "question": "Who is the first person to appear in the video after the man wearing a white shirt and a black gas mask, standing in the yellow-white smog holding a camera, shows up?", "question_wo_referring_query": "Who is the person?", "candidates": ["The woman in a red bikini", "The man wearing a gray lab coat and jeans, standing with arms crossed on a grassy slope", "The man wearing a white baseball cap", "The woman in a floral swimsuit on a yacht", "The man in a black and white striped shirt talking into a camera in a forest"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "QQ2RaGIYqqE_2", "video_path": "QQ2RaGIYqqE.mp4", "subtitle_path": "QQ2RaGIYqqE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1411.33, "view_count": 346787}, {"video_id": "dFLNOBKWlds", "question": "In the video, two men are sitting in front of a gray curtain. The man on the left is wearing a gray suit, dark blue jeans, and a tie, and he is holding a microphone while speaking. The man on the right, who is dressed in a suit, has his hands crossed on his lap and is listening to the man speaking. After the subtitle 'ultimately it is about relationships but' appears, what happens first on the screen?", "question_wo_referring_query": "What happens first on the screen after the subtitle 'ultimately it is about relationships but' appears?", "candidates": ["A video is played on the curtain.", "A black man in a gray sleeveless shirt starts speaking.", "The video cuts from the two men to five people.", "The man sitting on the right starts speaking.", "A person in a white shirt begins to speak."], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "dFLNOBKWlds_0", "video_path": "dFLNOBKWlds.mp4", "subtitle_path": "dFLNOBKWlds_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1830.23, "view_count": 31601}, {"video_id": "dFLNOBKWlds", "question": "In the scene, two men are sitting in front of a gray curtain. The man on the left is wearing a gray suit, deep blue jeans, and a tie, while holding a microphone and speaking. The man on the right, dressed in a suit, is sitting with his legs crossed. After the subtitle 'then they didn't really want me to shoot' appears, what happens first on the screen?", "question_wo_referring_query": "What happens first on the screen?", "candidates": ["Five men on the stage shake hands and hug in front of the curtain", "A black man wearing a gray sleeveless shirt starts speaking", "The man on the far right, wearing a black suit jacket over a white shirt, sitting with his legs crossed, speaks", "The man speaking on the left picks up a bottle of mineral water from the ground and drinks", "The person wearing a white shirt starts speaking"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "dFLNOBKWlds_1", "video_path": "dFLNOBKWlds.mp4", "subtitle_path": "dFLNOBKWlds_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1830.23, "view_count": 31601}, {"video_id": "dFLNOBKWlds", "question": "In the video, two men are sitting in front of a grey curtain. The man on the left is wearing a grey suit, dark blue jeans, and a tie, holding a microphone and gesturing animatedly while speaking. The man on the right is dressed in a suit jacket and is also holding a microphone and gesturing. After the subtitle 'the course of three days so ten eight' appears, what happens next on the screen?", "question_wo_referring_query": "What happens next on the screen?", "candidates": ["The man on the left picks up a bottle of mineral water from the ground and drinks it.", "A video is played on the curtain in the background.", "Five men on stage shake hands and hug in front of the curtain.", "The camera view switches from the two men to a side view of five men.", "A person in a white shirt starts speaking."], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "dFLNOBKWlds_2", "video_path": "dFLNOBKWlds.mp4", "subtitle_path": "dFLNOBKWlds_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1830.23, "view_count": 31601}, {"video_id": "5XHF-kZftDA", "question": "The screen shows a man wearing a black uniform. His hair is light brown, and he is sitting in a place surrounded by glass. Next to him, there's a colorful striped cushion, and behind him is a green plant. In which other scenes has this man appeared?", "question_wo_referring_query": "In which other scenes has this man appeared?", "candidates": ["In front of a wall covered with colorful square tiles.", "A woman in a denim skirt sitting on a green sofa, smiling at a man in front of her.", "On a beach with palm trees.", "In front of a yellow wall with a green sofa below it. The sofa has a yellow pillow, and there is a wooden stick forming a frame on top of the sofa.", "In the sea where a diver is touching the dorsal fin of a shark."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "5XHF-kZftDA_0", "video_path": "5XHF-kZftDA.mp4", "subtitle_path": "5XHF-kZftDA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 996.83, "view_count": 3417954}, {"video_id": "5XHF-kZftDA", "question": "The screen shows a man sitting in the back seat of a car. He is wearing a pink gown and has black earphones around his neck. Outside the car window, there are green trees and white buildings. In which other scene does this man appear?", "question_wo_referring_query": ", in which other scene does this man appear?", "candidates": ["On a beach with palm trees", "In a green corridor surrounded by three green doors", "In front of a wall decorated with colorful square tiles", "In the sea, where a diver is touching the back fin of a dolphin", "On a road with green trees, surrounded by long grass, and there are utility poles and street lights along the road"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "5XHF-kZftDA_1", "video_path": "5XHF-kZftDA.mp4", "subtitle_path": "5XHF-kZftDA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 996.83, "view_count": 3417954}, {"video_id": "5XHF-kZftDA", "question": "In the scene, there is a white dining hall with paintings on the walls, surrounded by some rest benches. A man and a woman are in front of a mirror; the man is wearing a patterned shirt, has a backpack on his shoulders, a camera hanging around his neck, and is holding a bottle in his hand. To his right is a woman wearing a white and pink bikini with a patterned shirt and her hair done up. In which other scene does this woman appear?", "question_wo_referring_query": "In which other scene does this woman appear?", "candidates": ["Inside a sightseeing bus with purple benches arranged alongside the bunk beds, with a view of green grassland and blue sky outside the bus window.", "In a plaza by the river.", "In front of a white wall with colorful lightboxes.", "In a wooded area full of fallen leaves.", "On a road with palm trees and buildings, with cars still visible along the roadside."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "5XHF-kZftDA_2", "video_path": "5XHF-kZftDA.mp4", "subtitle_path": "5XHF-kZftDA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 996.83, "view_count": 3417954}, {"video_id": "E8t1OAyHeWA", "question": "On a night with a starry sky, there is a tree and a telescope, and a man wearing a gray suit. After the subtitle reads 'to the globe side so without further ado,' what happens on the screen?", "question_wo_referring_query": ", what happens on the screen?", "candidates": ["Four men are having a conversation", "A man and a woman are having a conversation", "Two men are having a conversation", "Two people are fighting", "Two people are driving"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "E8t1OAyHeWA_0", "video_path": "E8t1OAyHeWA.mp4", "subtitle_path": "E8t1OAyHeWA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1836.24, "view_count": 410526}, {"video_id": "E8t1OAyHeWA", "question": "Under a starry night, there is a telescope on the grass. A man in gray clothes and a man in black clothes are having a conversation. After the subtitle mentions 'of a curve but with the compression', what is the bald man at the bottom right corner of the screen doing?", "question_wo_referring_query": "What is the bald man at the bottom right corner of the screen doing?", "candidates": ["Touching his ear", "Touching his hair", "Touching his beard", "Clapping", "Touching his butt"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "E8t1OAyHeWA_1", "video_path": "E8t1OAyHeWA.mp4", "subtitle_path": "E8t1OAyHeWA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1836.24, "view_count": 410526}, {"video_id": "E8t1OAyHeWA", "question": "On a night with a starry sky, there is a telescope on the grass. A man in grey clothes and a man in black clothes are having a conversation. After the subtitle mentions 'wrong and then going back to the truth,' what is the man in grey clothes in the top left of the screen doing?", "question_wo_referring_query": "What is the man in grey clothes in the top left of the screen doing?", "candidates": ["Touching his nose", "Touching his brain", "Touching his butt", "Touching his ear", "Touching his hair"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "E8t1OAyHeWA_2", "video_path": "E8t1OAyHeWA.mp4", "subtitle_path": "E8t1OAyHeWA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1836.24, "view_count": 410526}, {"video_id": "mcaP_6IVvKI", "question": "A painting is hanging on the red wall, with several rods framing it. There is a white tag on one of the rods. On the left side of the red wall, there is an arched passage. Before the subtitle mentions 'describes her struggle against her rapist, and her attempt to attack him with a knife,' what appears in the painting for the first time?", "question_wo_referring_query": "A painting is hanging on the red wall, with several rods framing it. There is a white tag on one of the rods. On the left side of the red wall, there is an arched passage. Before the subtitle mentions 'describes her struggle against her rapist, and her attempt to attack him with a knife,' what appears in the painting for the first time?", "candidates": ["A man", "A man and a woman", "Two men", "A man and two women", "Two women"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "mcaP_6IVvKI_0", "video_path": "mcaP_6IVvKI.mp4", "subtitle_path": "mcaP_6IVvKI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1612.66, "view_count": 12380}, {"video_id": "mcaP_6IVvKI", "question": "In an oil painting, there is a man holding a young animal by a leash, and under the animal's feet, there is another man with his arms spread out. Just before the subtitle mentions 'and mythological tales with startling levels of realism. Among his most haunting works are two,' what is the first animal to appear in the painting?", "question_wo_referring_query": "In an oil painting, there is a man holding a young animal by a leash, and under the animal's feet, there is another man with his arms spread out. Just before the subtitle mentions 'and mythological tales with startling levels of realism. Among his most haunting works are two,' what is the first animal to appear in the painting?", "candidates": ["monkey", "fish", "horse", "pig", "dog"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "mcaP_6IVvKI_1", "video_path": "mcaP_6IVvKI.mp4", "subtitle_path": "mcaP_6IVvKI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1612.66, "view_count": 12380}, {"video_id": "mcaP_6IVvKI", "question": "In a painting, there is a castle in the distance, with a few houses in front of the castle. Black smoke is rising from the houses. On the open ground in front of the houses, many people are holding weapons, and there are also two cannons. After the subtitle mentions, 'In 1789, at the height of her career, she left Paris to escape the revolutionary uprisings,' what is the first image that appears?", "question_wo_referring_query": "After the subtitle mentions, 'In 1789, at the height of her career, she left Paris to escape the revolutionary uprisings,' what is the first image that appears?", "candidates": ["oil painting", "map", "watercolor painting", "comic", "ink painting"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "mcaP_6IVvKI_2", "video_path": "mcaP_6IVvKI.mp4", "subtitle_path": "mcaP_6IVvKI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1612.66, "view_count": 12380}, {"video_id": "mMstMmhgZmg", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a bedroom appears with a bed covered in a blue quilt and two white pillows; then, a man in a blue jacket stands on the roof speaking; finally, a man is seen standing in the living room with a black table and white sofa beside him, and two orange chairs under the table.", "First, a man is seen standing in the living room with a black table and white sofa beside him, and two orange chairs under the table; then a bedroom appears with a bed covered in a blue quilt and two white pillows; finally, a man in a blue jacket stands on the roof speaking.", "First, a man in a blue jacket stands on the roof speaking, then a man is seen in the living room with a black table and white sofa beside him, and two orange chairs under the table; finally, a bedroom appears with a bed covered in a blue quilt and two white pillows.", "First, a man in a blue jacket stands on the roof speaking; then, a bedroom appears with a bed covered in a blue quilt and two white pillows; finally, a man is seen standing in the living room with a black table and white sofa beside him, and two orange chairs under the table.", "First, a man is seen standing in the living room with a black table and white sofa beside him, and two orange chairs under the table; then, a man in a blue jacket stands on the roof speaking, and lastly, a bedroom appears with a bed covered in a blue quilt and two white pillows."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "mMstMmhgZmg_0", "video_path": "mMstMmhgZmg.mp4", "subtitle_path": "mMstMmhgZmg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 952.08, "view_count": 500024}, {"video_id": "mMstMmhgZmg", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, a man wearing a blue coat and a woman wearing an olive coat and a black hat are talking on a balcony. Then, a man in a gray coat is walking up the stairs. Lastly, a yellow taxi drives by on the street.", "First, a yellow taxi drives by on the street. Then, a man in a gray coat is walking up the stairs. Lastly, a man wearing a blue coat and a woman wearing an olive coat and a black hat are talking on a balcony.", "First, a man wearing a blue coat and a woman wearing an olive coat and a black hat are talking on a balcony. Then, a yellow taxi drives by on the street. Lastly, a man in a gray coat is walking up the stairs.", "First, a yellow taxi drives by on the street. Then, a man wearing a blue coat and a woman wearing an olive coat and a black hat are talking on a balcony. Lastly, a man in a gray coat is walking up the stairs.", "First, a man in a gray coat is walking up the stairs. Then, a man wearing a blue coat and a woman wearing an olive coat and a black hat are talking on a balcony. Lastly, a yellow taxi drives by on the street."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "mMstMmhgZmg_1", "video_path": "mMstMmhgZmg.mp4", "subtitle_path": "mMstMmhgZmg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 952.08, "view_count": 500024}, {"video_id": "mMstMmhgZmg", "question": "Which of the following sequence of scenes is correct?", "question_wo_referring_query": "Which of the following sequence of scenes is correct?", "candidates": ["First, a man in a white jacket is singing outside, then a man wearing military green clothing is talking inside a room, and finally, a man in a denim jacket and a woman in a blue short-sleeve shirt with a ponytail are hugging in a crowd.", "First, a man in a denim jacket and a woman in a blue short-sleeve shirt with a ponytail are hugging in a crowd, then a man in a white jacket is singing outside, and lastly, a man wearing military green clothing is talking inside a room.", "First, a man in a white jacket is singing outside, then a man in a denim jacket and a woman in a blue short-sleeve shirt with a ponytail are hugging in a crowd, and finally, a man wearing military green clothing is talking inside a room.", "First, a man wearing military green clothing is talking inside a room, then a man in a denim jacket and a woman in a blue short-sleeve shirt with a ponytail are hugging in a crowd, and finally, a man in a white jacket is singing outside.", "First, a man wearing military green clothing is talking inside a room, then a man in a white jacket is singing outside, and lastly, a man in a denim jacket and a woman in a blue short-sleeve shirt with a ponytail are hugging in a crowd."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "mMstMmhgZmg_2", "video_path": "mMstMmhgZmg.mp4", "subtitle_path": "mMstMmhgZmg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 952.08, "view_count": 500024}, {"video_id": "eyQmhrVKsc0", "question": "Against a tan background, there are many quadrilateral shapes. Inside the topmost quadrilateral, there are 3 'x' shapes. In the top left corner, there's text that reads 'German Infantry Division 1940 17000 Men'. When this text in the top left corner changes to 'Soviet Infantry Division 1941', what change occurs to the topmost quadrilateral?", "question_wo_referring_query": "What change occurs to the topmost quadrilateral?", "candidates": ["Changes to red", "Changes to green", "Changes to pink", "Changes to purple", "Changes to black"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "eyQmhrVKsc0_0", "video_path": "eyQmhrVKsc0.mp4", "subtitle_path": "eyQmhrVKsc0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1117.37, "view_count": 257920}, {"video_id": "eyQmhrVKsc0", "question": "Against a tan background, there are many quadrilateral objects, with the topmost quadrilateral containing three 'x' marks\u2014two of which are red and one is black. In the upper-right corner, there is white text that reads 'British Infantry Division 3939 [BEF] 14 000 Men'. When the white text in the upper-left corner changes to 'Japanese Infantry Division 1940 Standard B', what change occurs to the topmost quadrilateral object?", "question_wo_referring_query": "What change occurs to the topmost quadrilateral object?", "candidates": ["The surroundings turn purple", "The surroundings turn white", "The surroundings turn pink", "The surroundings turn black", "The surroundings turn green"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "eyQmhrVKsc0_1", "video_path": "eyQmhrVKsc0.mp4", "subtitle_path": "eyQmhrVKsc0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1117.37, "view_count": 257920}, {"video_id": "eyQmhrVKsc0", "question": "Against a yellowish-brown background, there is a red quadrilateral object with the white text 'Polish Infantry Division 1939' on its right. When the text changes to 'Romanian Infantry Division 1941,' what change occurs to the top quadrilateral object?", "question_wo_referring_query": "What change occurs to the top quadrilateral object?", "candidates": ["Changes to yellow", "Changes to blue", "Changes to pink", "Changes to black", "Changes to green"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "eyQmhrVKsc0_2", "video_path": "eyQmhrVKsc0.mp4", "subtitle_path": "eyQmhrVKsc0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1117.37, "view_count": 257920}, {"video_id": "v1PEgoH67TM", "question": "The screen shows three people engaged in a video call. In the upper left is a woman wearing orange clothes and headphones in a bedroom. In the upper right is a short-haired woman in a blue and white striped shirt against a museum background. At the bottom is a short-haired man wearing glasses and a suit. When the conversation mentions 'I mean he was an artist right and I I', what change occurs on this man's screen?", "question_wo_referring_query": "What change occurs on this man's screen?", "candidates": ["The frame enlarges and becomes a standalone frame", "The frame moves to the upper right", "The frame moves back and forth", "The frame shrinks", "The frame moves to the upper left"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "v1PEgoH67TM_0", "video_path": "v1PEgoH67TM.mp4", "subtitle_path": "v1PEgoH67TM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3107.72, "view_count": 3988}, {"video_id": "v1PEgoH67TM", "question": "In the room, to the left of the white wardrobe, there is a white door. A woman in front of the wardrobe, wearing an orange outfit, glasses, and earphones, is speaking. When she says, 'you know as well i mean jill maybe it,' what changes occur on the screen of this woman?", "question_wo_referring_query": "What changes occur on the screen of this woman?", "candidates": ["The screen zooms in", "The screen zooms out", "The screen shifts to the bottom-right", "The screen flickers back and forth"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "v1PEgoH67TM_1", "video_path": "v1PEgoH67TM.mp4", "subtitle_path": "v1PEgoH67TM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3107.72, "view_count": 3988}, {"video_id": "v1PEgoH67TM", "question": "Two people are on a video call in the video. On the left side, there is a woman wearing orange clothes and headphones, and on the right side, there is a man with short hair wearing glasses and a suit in a study room. What change occurred to the woman on the left when the topic of 'immigrant groups' was mentioned during the conversation?", "question_wo_referring_query": "What change occurred to the woman on the left when the topic of 'immigrant groups' was mentioned?", "candidates": ["She changed into a black coat", "The button on her orange clothes popped open", "She took off her orange clothes", "She changed into an orange zip-up jacket"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "v1PEgoH67TM_2", "video_path": "v1PEgoH67TM.mp4", "subtitle_path": "v1PEgoH67TM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3107.72, "view_count": 3988}, {"video_id": "d3g3pCHqgo8", "question": "In the upper left corner, there is a woman's face in a frame. To the right of the frame is an English title containing 'Qualitative Result' with a blue background. Below the title is an image with 3D blocks. What is the woman in the upper left corner doing at this moment?", "question_wo_referring_query": "What is the woman in the upper left corner doing at this moment?", "candidates": ["Lowering her head", "Lying on the computer desk", "Turning her head to the right side of the screen", "Turning her head to the left side of the screen", "Standing up"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "d3g3pCHqgo8_0", "video_path": "d3g3pCHqgo8.mp4", "subtitle_path": "d3g3pCHqgo8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1607.0, "view_count": 1603}, {"video_id": "d3g3pCHqgo8", "question": "In the top-left corner, there is a woman's face. To the right of her face is an English title containing 'Qualitative Parsing' against a blue background, and below the title is a white background. What is the woman in the top-left corner doing at this moment?", "question_wo_referring_query": "What is the woman in the top-left corner doing at this moment?", "candidates": ["Looking forward", "Turning her head to the right side of the screen", "Crawling on a desk", "Clapping", "Turning her head to the left side of the screen"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "d3g3pCHqgo8_1", "video_path": "d3g3pCHqgo8.mp4", "subtitle_path": "d3g3pCHqgo8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1607.0, "view_count": 1603}, {"video_id": "d3g3pCHqgo8", "question": "In the upper-left corner is a close-up of a woman. To the right of the close-up is an English title that reads 'New Scenes: Minecraft' on a blue background. Below the title is a white background. At this moment, what is the woman in the upper-left corner's close-up doing?", "question_wo_referring_query": "At this moment, what is the woman in the upper-left corner's close-up doing?", "candidates": ["Standing up", "Turning her head to the left of the screen", "Turning her head to the right of the screen", "Looking forward", "Clapping"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "d3g3pCHqgo8_2", "video_path": "d3g3pCHqgo8.mp4", "subtitle_path": "d3g3pCHqgo8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1607.0, "view_count": 1603}, {"video_id": "KRQtKzJLC2U", "question": "In the room, there is a man in the front wearing a suit and holding a wine glass. In the background, there is a bar filled with red wine. Which of the following objects did not appear in the scene?", "question_wo_referring_query": "Which of the following objects did not appear in the scene?", "candidates": ["red wine", "wine glass", "necklace", "handbag", "man wearing a suit"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "KRQtKzJLC2U_0", "video_path": "KRQtKzJLC2U.mp4", "subtitle_path": "KRQtKzJLC2U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.0, "view_count": 17359}, {"video_id": "KRQtKzJLC2U", "question": "On a street lined with parked cars, there is a row of tall buildings in the distance. In the lower left corner, there is a man in a black coat. What object appears in the video?", "question_wo_referring_query": "What object appears in the video?", "candidates": ["computer", "blue bucket", "electric rice cooker", "chair", "ambulance"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "KRQtKzJLC2U_1", "video_path": "KRQtKzJLC2U.mp4", "subtitle_path": "KRQtKzJLC2U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.0, "view_count": 17359}, {"video_id": "KRQtKzJLC2U", "question": "On the desk, there is a transparent glass, red wine, a plate, etc. In the distance, in front of the wall, there are two paintings. In the front left, there are a few hands. Which of the following objects appeared in the video?", "question_wo_referring_query": ", which of the following objects appeared in the video?", "candidates": ["A white cup", "A woman with long hair", "A washing machine", "Milk", "A refrigerator"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "KRQtKzJLC2U_2", "video_path": "KRQtKzJLC2U.mp4", "subtitle_path": "KRQtKzJLC2U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.0, "view_count": 17359}, {"video_id": "q8JpkUeFR-4", "question": "In a room with white walls, there is a man wearing a green sweater and glasses directly in front. Behind the man is a bed, with a bedside table on each side of the bed in the distance. When the man mentions 'really exist and the worst part is not a', which item is present on the screen?", "question_wo_referring_query": "Which item is present on the screen?", "candidates": ["lamp", "bottle of water", "baby", "San Mien Zhi", "mask"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "q8JpkUeFR-4_0", "video_path": "q8JpkUeFR-4.mp4", "subtitle_path": "q8JpkUeFR-4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1081.44, "view_count": 5528}, {"video_id": "q8JpkUeFR-4", "question": "In a room with white walls, there is a bed in the middle covered with a green blanket. A man wearing a white T-shirt and glasses is sitting at the end of the bed. When the man mentions 'reduces the sleep hormone in your body', which object is not shown in the frame?", "question_wo_referring_query": "Which object is not shown in the frame?", "candidates": ["green blanket", "man", "laptop", "bed", "mobile phone"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "q8JpkUeFR-4_1", "video_path": "q8JpkUeFR-4.mp4", "subtitle_path": "q8JpkUeFR-4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1081.44, "view_count": 5528}, {"video_id": "q8JpkUeFR-4", "question": "On the olive-colored sofa, a person wearing a white undershirt and an orange jacket is using a laptop; to the left is a wooden table with small items on it, and in front is a wooden floor. When the subtitle says \u201cediting left but majority of the work\u201d, which item is present on the screen?", "question_wo_referring_query": "Which item is present on the screen?", "candidates": ["toothbrush", "olive-colored pillow", "pressure cooker", "refrigerator", "towel"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "q8JpkUeFR-4_2", "video_path": "q8JpkUeFR-4.mp4", "subtitle_path": "q8JpkUeFR-4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1081.44, "view_count": 5528}, {"video_id": "Gnki429Ef-g", "question": "In the room, there is a man wearing a shirt in front. On the wall behind the man, from left to right, there are several posters, a wall hanging, and a world map. When the man mentions 'play just feel What the weather today is', what style is the shirt in the room?", "question_wo_referring_query": "What style is the shirt in the room?", "candidates": ["Turn-down collar with buttons", "Stand-up collar with chain", "Round collar with buttons", "High collar with chain", "Turn-down collar with chain"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "Gnki429Ef-g_0", "video_path": "Gnki429Ef-g.mp4", "subtitle_path": "Gnki429Ef-g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 926.4, "view_count": 135729}, {"video_id": "Gnki429Ef-g", "question": "In a scene with a black background, there is a blue license plate with yellow characters and a yellow border. The middle of the license plate has a code made up of numbers, and there are words on the top and bottom of the plate. What is the shape of the license plate when the subtitle mentions 'of treatment of the Law on live Remind'?", "question_wo_referring_query": "What is the shape of the license plate in the scene?", "candidates": ["triangle", "rectangle", "trapezoid", "hexagon", "circle"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "Gnki429Ef-g_1", "video_path": "Gnki429Ef-g.mp4", "subtitle_path": "Gnki429Ef-g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 926.4, "view_count": 135729}, {"video_id": "Gnki429Ef-g", "question": "In the scene with a black background and a license plate with colorful patterns, the background shows rolling green hills and a nearby blue lake. The license plate has textual and numerical information. When the caption mentions \"of ethylene glycol is Where I\", what color are the numbers on the license plate?", "question_wo_referring_query": "What color are the numbers on the license plate in the scene?", "candidates": ["purple", "yellow", "white", "black", "gray"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "Gnki429Ef-g_2", "video_path": "Gnki429Ef-g.mp4", "subtitle_path": "Gnki429Ef-g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 926.4, "view_count": 135729}, {"video_id": "LtZs4FfwiSs", "question": "The room has white curtains hanging, with a black chair and a stool in the middle of the distance. There is oil on the floor in the foreground. A woman dressed in black is kneeling on the ground. What did the woman touch with her hand in the video?", "question_wo_referring_query": "What did the woman touch with her hand in the video?", "candidates": ["chair", "girl", "stool", "woman dressed in black", "oil"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "LtZs4FfwiSs_0", "video_path": "LtZs4FfwiSs.mp4", "subtitle_path": "LtZs4FfwiSs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1163.63, "view_count": 5532}, {"video_id": "LtZs4FfwiSs", "question": "In the variety show clip, the background is a board with a white bottom and yellow top. In the front, there is a round table with a woman wearing pink sitting on the left and a woman wearing purple sitting on the right. Who reaches out to touch the woman in pink?", "question_wo_referring_query": "Who reaches out to touch the woman in pink in the clip?", "candidates": ["Car", "Cat", "Mural", "The woman wearing purple", "Chair"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "LtZs4FfwiSs_1", "video_path": "LtZs4FfwiSs.mp4", "subtitle_path": "LtZs4FfwiSs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1163.63, "view_count": 5532}, {"video_id": "LtZs4FfwiSs", "question": "On the middle of the table, there is a transparent glass container filled with chili sauce. To the left, there is a hand and a wooden spoon, and to the right, there is a rectangular electric device. What is the item used to scoop up the chili sauce?", "question_wo_referring_query": "What is the item used to scoop up the chili sauce?", "candidates": ["Spatula", "Spoon", "Ladle", "Cup", "Pot lid"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "LtZs4FfwiSs_2", "video_path": "LtZs4FfwiSs.mp4", "subtitle_path": "LtZs4FfwiSs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1163.63, "view_count": 5532}, {"video_id": "ORrwrzn3okU", "question": "In a wood-finished Japanese room, there is a rectangular container at the front, a passageway pressing against the back left of the house, and on the right, a woman dressed in a deep blue kimono is sitting on the floor leaning against a low table. What is the woman doing when she appears?", "question_wo_referring_query": "What is the woman doing when she appears?", "candidates": ["playing mahjong", "stirring inside the rectangular container with chopsticks", "lying on the ground sleeping", "drinking soda", "dancing"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "ORrwrzn3okU_0", "video_path": "ORrwrzn3okU.mp4", "subtitle_path": "ORrwrzn3okU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.5, "view_count": 22869}, {"video_id": "ORrwrzn3okU", "question": "In the forest, there are dense trees behind to the right, a green stone path to the left, and in the front, there is a woman wearing black clothes holding a cane. What is the woman doing when the cane first appears?", "question_wo_referring_query": "What is the woman doing when the cane first appears?", "candidates": ["Walking and talking to a mirror", "Squatting on the ground", "Washing dishes", "Sweeping the house", "Tasting coffee"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "ORrwrzn3okU_1", "video_path": "ORrwrzn3okU.mp4", "subtitle_path": "ORrwrzn3okU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.5, "view_count": 22869}, {"video_id": "ORrwrzn3okU", "question": "In an orchard full of apple trees, a man dressed in a white shirt and black pants, wearing a white hat, is standing among the apple trees. What was the man doing the first time he appeared?", "question_wo_referring_query": "What was the man doing the first time he appeared?", "candidates": ["Drinking water", "Picking apples", "Hugging a woman", "Taking off his jacket", "Crouching to weed"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "ORrwrzn3okU_2", "video_path": "ORrwrzn3okU.mp4", "subtitle_path": "ORrwrzn3okU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.5, "view_count": 22869}, {"video_id": "ZR046eA9kEE", "question": "When the sunlight shines on the upper right corner of the table, there is a vegetable knife on the right side of the wooden table in the camera frame, and a corner of a plate of ingredients is exposed on the left side. In the middle, a dark brown wooden bowl contains rice. What happens in the scene when the subtitle mentions 'Rice'?", "question_wo_referring_query": "What happens in the scene?", "candidates": ["A hand reaches into the frame and picks up the plate of ingredients on the upper left corner.", "A hand reaches into the frame and picks up the rice bowl from the table.", "A hand reaches into the frame and picks up the vegetable knife.", "A hand reaches into the frame and spills the rice bowl onto the table."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "ZR046eA9kEE_0", "video_path": "ZR046eA9kEE.mp4", "subtitle_path": "ZR046eA9kEE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1107.16, "view_count": 1387387}, {"video_id": "ZR046eA9kEE", "question": "Sunlight is shining on the countertop, on which there is a piece of pork. In the upper right corner of the screen, there is a plate of ingredients, and in the lower left corner, there are two small wooden bowls containing seasonings. A person is standing at the edge of the countertop. What happens on the screen when the subtitle mentions 'Hot pepper'?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A hand picks up the pork and slaps it on the countertop", "A hand grabs some hot pepper powder and sprinkles it on the pork", "The pork is cut into two pieces", "The pork is cut into chunks", "The pork is flipped over"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "ZR046eA9kEE_1", "video_path": "ZR046eA9kEE.mp4", "subtitle_path": "ZR046eA9kEE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1107.16, "view_count": 1387387}, {"video_id": "ZR046eA9kEE", "question": "There is a silver knife with some sliced green bell peppers on the handle, and there is a pale olive ceramic jar below the knife blade. What happens on the screen when the subtitle mentions 'Bell pepper'?", "question_wo_referring_query": "There is a silver knife with some sliced green bell peppers on the handle, and there is a pale olive ceramic jar below the knife blade. What happens on the screen when the subtitle mentions 'Bell pepper'?", "candidates": ["Take the chili pepper out of the ceramic jar", "Put the knife into the ceramic jar", "Pour water into the ceramic jar", "Pour the knife and green bell peppers together into the ceramic jar", "Pour the chili pepper into the ceramic jar"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "ZR046eA9kEE_2", "video_path": "ZR046eA9kEE.mp4", "subtitle_path": "ZR046eA9kEE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1107.16, "view_count": 1387387}, {"video_id": "89XQ65yEjGg", "question": "In a white-background kitchen, with a blue-framed window on the back left, there is a man wearing black clothes and a woman wearing purple clothes in front. What does the man do after the subtitle mentions 'dishes I feel like you can open the'?", "question_wo_referring_query": "What does the man do?", "candidates": ["Pats the woman", "Jumps", "Takes an orange", "Claps hands", "Takes out a knife"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "89XQ65yEjGg_0", "video_path": "89XQ65yEjGg.mp4", "subtitle_path": "89XQ65yEjGg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1005.34, "view_count": 165678}, {"video_id": "89XQ65yEjGg", "question": "On a yellow wooden table, there is a dessert in a transparent glass container placed in the middle. In the top left corner, there's a bowl of whipped cream and a hand. What happens when the speaker mentions 'and I think now is about a good time to'?", "question_wo_referring_query": "What happens?", "candidates": ["Spilled the whipped cream", "Placed a strawberry on top", "Ate the dessert", "Took a piece of the dessert", "Added two scoops of whipped cream to the dessert"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "89XQ65yEjGg_1", "video_path": "89XQ65yEjGg.mp4", "subtitle_path": "89XQ65yEjGg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1005.34, "view_count": 165678}, {"video_id": "89XQ65yEjGg", "question": "In a kitchen with a white background, there is a window with a blue frame on the back left. In front, a woman wearing a purple dress stands before a table. After the subtitle mentions 'Sage coconut cream sauce I'm going to', what does the woman do first?", "question_wo_referring_query": "In a kitchen with a white background, there is a window with a blue frame on the back left. In front, a woman wearing a purple dress stands before a table. After the subtitle mentions 'Sage coconut cream sauce I'm going to', what does the woman do first?", "candidates": ["Takes away the flat-bottomed pan", "Kneads dough", "Brings a cup of water", "Puts on a jacket", "Puts the olive oil into the flat-bottomed pan"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "89XQ65yEjGg_2", "video_path": "89XQ65yEjGg.mp4", "subtitle_path": "89XQ65yEjGg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1005.34, "view_count": 165678}, {"video_id": "zK8mqwjhqrs", "question": "The screen is almost completely filled by a flat-bottomed pan. After the subtitle 'Hello my loves!' appears, what is the first item to appear?", "question_wo_referring_query": "What is the first item to appear?", "candidates": ["Pumpkin", "Milk tea", "Vegetable oil", "Ground beef", "Paper towel"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "zK8mqwjhqrs_0", "video_path": "zK8mqwjhqrs.mp4", "subtitle_path": "zK8mqwjhqrs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1485.08, "view_count": 629863}, {"video_id": "zK8mqwjhqrs", "question": "In the scene, a hand is placing a food-filled plate into an oven. After the subtitle 'Bake for another 10 minutes at 200 C' appears, what is the first item that appears?", "question_wo_referring_query": "What is the first item that appears?", "candidates": ["Wok", "Rolling pin", "Rice cooker", "Pork", "Green vegetables"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "zK8mqwjhqrs_1", "video_path": "zK8mqwjhqrs.mp4", "subtitle_path": "zK8mqwjhqrs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1485.08, "view_count": 629863}, {"video_id": "zK8mqwjhqrs", "question": "After frying minced beef and chopped onions in the pan for about 1 minute, what is the first object that appears?", "question_wo_referring_query": ", what is the first object that appears?", "candidates": ["Mobile phone", "A child", "Carrot", "Glass cup", "Oven"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "zK8mqwjhqrs_2", "video_path": "zK8mqwjhqrs.mp4", "subtitle_path": "zK8mqwjhqrs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1485.08, "view_count": 629863}, {"video_id": "ww9wYPmkRGA", "question": "Which of the following scenarios is in the correct order?", "question_wo_referring_query": "Which of the following scenarios is in the correct order?", "candidates": ["First, a person wearing green clothes pours a purple flower from a silver cup into a transparent teapot on a gray table. Next to the teapot, there are transparent plates and cups. Behind them, there is a red lid. Then, a pair of hands takes nuts from a pile of wooden bowls. Behind the wooden bowls, there is a gray kitchen utensil. Behind the gray kitchen utensil, there is a gray table holding a pot. Finally, a pair of hands uses a small knife to cut the nuts on the gray table, placing the purple flower and a red-handled teapot on it.", "First, a pair of hands takes nuts from a pile of wooden bowls. Behind the wooden bowls, there is a gray kitchen utensil. Behind the gray kitchen utensil, there is a gray table holding a pot. Then, a person wearing green clothes pours a purple flower from a silver cup into a transparent teapot on a gray table. Next to the teapot, there are transparent plates and cups. Behind them, there is a red lid. Finally, a pair of hands uses a small knife to cut the nuts on the gray table, placing the purple flower and a red-handled teapot on it.", "First, a pair of hands takes nuts from a pile of wooden bowls. Behind the wooden bowls, there is a gray kitchen utensil. Behind the gray kitchen utensil, there is a gray table holding a pot. Then, a pair of hands uses a small knife to cut the nuts on the gray table, placing the purple flower and a red-handled teapot on it. Finally, a person wearing green clothes pours a purple flower from a silver cup into a transparent teapot on a gray table. Next to the teapot, there are transparent plates and cups. Behind them, there is a red lid.", "First, a pair of hands uses a small knife to cut the nuts on the gray table, placing the purple flower and a red-handled teapot on it. Then, a person wearing green clothes pours a purple flower from a silver cup into a transparent teapot on a gray table. Next to the teapot, there are transparent plates and cups. Behind them, there is a red lid. Finally, a pair of hands takes nuts from a pile of wooden bowls. Behind the wooden bowls, there is a gray kitchen utensil. Behind the gray kitchen utensil, there is a gray table holding a pot.", "First, a person wearing green clothes pours a purple flower from a silver cup into a transparent teapot on a gray table. Next to the teapot, there are transparent plates and cups. Behind them, there is a red lid. Then, a pair of hands uses a small knife to cut the nuts on the gray table, placing the purple flower and a red-handled teapot on it. Finally, a pair of hands takes nuts from a pile of wooden bowls. Behind the wooden bowls, there is a gray kitchen utensil. Behind the gray kitchen utensil, there is a gray table holding a pot."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "ww9wYPmkRGA_0", "video_path": "ww9wYPmkRGA.mp4", "subtitle_path": "ww9wYPmkRGA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1150.85, "view_count": 6603017}, {"video_id": "ww9wYPmkRGA", "question": "Which of the following sequence of scenes is correct?", "question_wo_referring_query": "Which of the following sequence of scenes is correct?", "candidates": ["First, there is a black pot with a white lid on a brick at the front of the scene. To its right, there is a wooden shelf, and behind it, a stone wall and mountains. Then, a tree with white petals and pink flowers appears, with rocks and green grass beneath it. Finally, a person with black hair and green hair drinks from a cup filled with orange-colored liquid. To his right, there is a different tree. In front of him, there are a white pot lid, a wooden head, rocks, and an animal lying on the grass.", "First, a person with black hair and green hair drinks from a cup filled with orange-colored liquid. To his right, there is a different tree. In front of him, there are a white pot lid, a wooden head, rocks, and an animal lying on the grass. Then, there is a black pot with a white lid on a brick at the front of the scene. To its right, there is a wooden shelf, and behind it, a stone wall and mountains. Lastly, a tree with white petals and pink flowers appears, with rocks and green grass beneath it.", "First, a person with black hair and green hair drinks from a cup filled with orange-colored liquid. To his right, there is a different tree. In front of him, there are a white pot lid, a wooden head, rocks, and an animal lying on the grass. Then, a tree with white petals and pink flowers appears, with rocks and green grass beneath it. Finally, there is a black pot with a white lid on a brick at the front of the scene. To its right, there is a wooden shelf, and behind it, a stone wall and mountains.", "First, there is a black pot with a white lid on a brick at the front of the scene. To its right, there is a wooden shelf, and behind it, a stone wall and mountains. Then, a person with black hair and green hair drinks from a cup filled with orange-colored liquid. To his right, there is a different tree. In front of him, there are a white pot lid, a wooden head, rocks, and an animal lying on the grass. Lastly, there appears a tree with white petals and pink flowers, with rocks and green grass beneath it.", "First, there appears a tree with white petals and pink flowers, with rocks and green grass beneath it. Then, a person with black hair and green hair is drinking from a cup filled with orange-colored liquid. To his right, there is a different tree. In front of him, there are a white pot lid, a wooden head, rocks, and an animal lying on the grass. Finally, there is a black pot with a white lid on a brick, with a wooden shelf to its right, and a stone wall and mountains in the background."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "ww9wYPmkRGA_1", "video_path": "ww9wYPmkRGA.mp4", "subtitle_path": "ww9wYPmkRGA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1150.85, "view_count": 6603017}, {"video_id": "ww9wYPmkRGA", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a hand holding a kitchen knife pours the white ingredients on the wooden board into the black pot on the stone tiles. Then, a person with black hair wearing a green outfit and pants stands by a taupe table that has two wooden bowls and a wooden board full of white ingredients. To the left of this person, there is a wooden bucket. Behind them, there is a stone wall and a range of mountains. Finally, the person in green pants takes a silver ladle and stirs the ingredients in the black pot on the stone tiles. Beneath the pot, there is a pile of firewood.", "First, the person in green pants takes a silver ladle and stirs the ingredients in the black pot on the stone tiles. Beneath the pot, there is a pile of firewood. Then, a person with black hair wearing a green outfit and pants stands by a taupe table that has two wooden bowls and a wooden board full of white ingredients. To the left of this person, there is a wooden bucket. Behind them, there is a stone wall and a range of mountains. Finally, a hand holding a kitchen knife pours the white ingredients on the wooden board into the black pot on the stone tiles.", "First, there is a person with black hair wearing a green outfit and pants, standing by a taupe table that has two wooden bowls and a wooden board full of white ingredients. To the left of this person, there is a wooden bucket. Behind them, there is a stone wall and a range of mountains. Then, a hand holding a kitchen knife pours the white ingredients on the wooden board into the black pot on the stone tiles. Finally, the person in green pants takes a silver ladle and stirs the ingredients in the black pot on the stone tiles. Beneath the pot, there is a pile of firewood.", "First, there is a person with black hair wearing a green outfit and pants, standing by a taupe table that has two wooden bowls and a wooden board full of white ingredients. To the left of this person, there is a wooden bucket. Behind them, there is a stone wall and a range of mountains. Then, the person in green pants takes a silver ladle and stirs the ingredients in a black pot on the stone tiles. Beneath the pot, there is a pile of firewood. Finally, they hold a kitchen knife and pour the white ingredients on the wooden board into the black pot on the stone tiles.", "First, a hand holding a kitchen knife pours the white ingredients on the wooden board into the black pot on the stone tiles. Then, the person in green pants takes a silver ladle and stirs the ingredients in the black pot on the stone tiles. Beneath the pot, there is a pile of firewood. Finally, a person with black hair wearing a green outfit and pants stands by a taupe table that has two wooden bowls and a wooden board full of white ingredients. To the left of this person, there is a wooden bucket. Behind them, there is a stone wall and a range of mountains."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "ww9wYPmkRGA_2", "video_path": "ww9wYPmkRGA.mp4", "subtitle_path": "ww9wYPmkRGA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1150.85, "view_count": 6603017}, {"video_id": "uEIQo3UP2iw", "question": "On a bus, there are black seats, and in the middle of the black seats, there is a long-haired person wearing a hat and dressed in a white short-sleeved shirt. This man covers his face with a hand wearing a bracelet. Which subtitles have appeared together with this man?", "question_wo_referring_query": "Which subtitles have appeared together with this man?", "candidates": ["while you're flying or", "uh pre skydiving how you feeling my", "all right", "means you can't stop peeing yourself", "yeah a little less than 10. we're about"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "uEIQo3UP2iw_0", "video_path": "uEIQo3UP2iw.mp4", "subtitle_path": "uEIQo3UP2iw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1051.26, "view_count": 97313}, {"video_id": "uEIQo3UP2iw", "question": "In the sky, there is a plane in motion. People wearing black hats and glasses are standing on the open door of the plane. Has this plane and those subtitles appeared together before?", "question_wo_referring_query": ", has this plane and those subtitles appeared together before?", "candidates": ["junkies who aren't afraid of", "jumping out of a plane with all the emotion", "malfunctioning of the threshing system", "fully", "if again like tomorrow you know"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "uEIQo3UP2iw_1", "video_path": "uEIQo3UP2iw.mp4", "subtitle_path": "uEIQo3UP2iw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1051.26, "view_count": 97313}, {"video_id": "uEIQo3UP2iw", "question": "In a white-background room, there is a long-haired woman wearing a long dress and holding a white paper, and a short-haired man wearing short sleeves and blue pants holding a black piece of clothing standing in the middle. To their left is a cabinet filled with miscellaneous items, and to their right is a bed with a blanket on it. Different pictures are pasted on the wall beside the bed. In which subtitles did this woman also appear?", "question_wo_referring_query": "In which subtitles did this woman also appear?", "candidates": ["go through all this so it's sort of", "think that's", "couple weeks we graduated high school", "airport", "graduation"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "uEIQo3UP2iw_2", "video_path": "uEIQo3UP2iw.mp4", "subtitle_path": "uEIQo3UP2iw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1051.26, "view_count": 97313}, {"video_id": "q5dy-NtVeF0", "question": "Against a black background, there is a grey circle in the middle. Inside the circle, there is a yellow line with small white circular dots at both ends. What change occurs to this yellow line?", "question_wo_referring_query": "What change occurs to this yellow line?", "candidates": ["Changed from one line to two lines", "The yellow line bends downwards", "The yellow line lengthens", "The yellow line shortens", "The yellow line breaks"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "q5dy-NtVeF0_0", "video_path": "q5dy-NtVeF0.mp4", "subtitle_path": "q5dy-NtVeF0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1042.88, "view_count": 118541}, {"video_id": "q5dy-NtVeF0", "question": "In a black background, there is a table in the middle. On the center of the table, there is an upright yellow balloon. What change happened to the yellow balloon?", "question_wo_referring_query": "What change happened to the yellow balloon?", "candidates": ["The balloon changed from yellow to purple.", "The air in the balloon disappeared, making it deflated.", "The balloon became larger.", "It changed from being on the table to floating in the air.", "The balloon changed from yellow to red."], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "q5dy-NtVeF0_1", "video_path": "q5dy-NtVeF0.mp4", "subtitle_path": "q5dy-NtVeF0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1042.88, "view_count": 118541}, {"video_id": "q5dy-NtVeF0", "question": "In a black background, there is a white rectangle. On the left side of the rectangle, there are numbers with a string of English next to them. Below the rectangle, there are numbers on the bottom line. Inside the rectangle, there's a twisted yellow line. What changes occurred to the yellow line?", "question_wo_referring_query": "...what changes occurred to the yellow line?", "candidates": ["The yellow line changed to a purple line and moved downward", "The yellow line changed to a red line and moved downward", "The yellow line disappeared", "The yellow line changed to a red line and moved upward", "The yellow line changed to a purple line and moved downward"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "q5dy-NtVeF0_2", "video_path": "q5dy-NtVeF0.mp4", "subtitle_path": "q5dy-NtVeF0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1042.88, "view_count": 118541}, {"video_id": "vly1Go5zdPE", "question": "In a room with a white background, there is a white table in the middle. On the table, there is a bouquet of flowers and a pen. Behind the table, there is a white cabinet. In front of the cabinet, there are two different flags. On both sides of the table, there are two men. The man on the left raises one hand, while the man on the right places his hand on his lap. What are these two men doing?", "question_wo_referring_query": "What are these two men doing?", "candidates": ["Writing", "Using a computer", "Having a conversation", "Playing a game", "Fighting"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "vly1Go5zdPE_0", "video_path": "vly1Go5zdPE.mp4", "subtitle_path": "vly1Go5zdPE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1195.87, "view_count": 68127}, {"video_id": "vly1Go5zdPE", "question": "In a room with a white background, three men in suits are sitting next to a brown table. The man on the left has dense black hair, the man in the middle is almost bald and is wearing glasses, and the man on the right has sparse white hair. The man on the left is wearing a blue tie, while the men in the middle and on the right are wearing red ties. There is a group of people standing behind them. A woman with long black hair and a black dress is standing in the middle, flanked by men. Behind the woman, there are three different flags. What is the man sitting in the middle by the table doing?", "question_wo_referring_query": "What is the man sitting in the middle by the table doing?", "candidates": ["Playing with a computer", "Reading", "Talking", "Talking with a woman", "Writing"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "vly1Go5zdPE_1", "video_path": "vly1Go5zdPE.mp4", "subtitle_path": "vly1Go5zdPE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1195.87, "view_count": 68127}, {"video_id": "vly1Go5zdPE", "question": "Inside a room, there are 6 people sitting around a purple table. On the far left is a short-haired woman wearing a beige coat. To her right are 5 men sitting. Behind them, there are 3 people standing; one man in the middle, two women on either side of him. There are different flags on either side of the 3 standing people, a white pillar, transparent glass, and wood behind them. What are the 3 standing people doing?", "question_wo_referring_query": "What are the 3 standing people doing?", "candidates": ["Fighting", "Dancing", "Clapping", "Writing", "Talking to each other"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "vly1Go5zdPE_2", "video_path": "vly1Go5zdPE.mp4", "subtitle_path": "vly1Go5zdPE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1195.87, "view_count": 68127}, {"video_id": "K7lb6KWBanI", "question": "Under a blue sky, a group of people riding animals are fighting with a group of people in white clothes who are not riding animals. Which animal is present in the scene?", "question_wo_referring_query": "Which animal is present in the scene?", "candidates": ["Horse", "Sheep", "Elephant", "Mule", "Cow"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "K7lb6KWBanI_0", "video_path": "K7lb6KWBanI.mp4", "subtitle_path": "K7lb6KWBanI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3192.96, "view_count": 20809125}, {"video_id": "K7lb6KWBanI", "question": "Under a blue sky, a group of people riding animals are running forward on yellow soil. Which of the following objects do not exist in the scene?", "question_wo_referring_query": "Which of the following objects do not exist in the scene?", "candidates": ["Green grass", "Horses", "Flowers", "Mountains", "Rocks"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "K7lb6KWBanI_1", "video_path": "K7lb6KWBanI.mp4", "subtitle_path": "K7lb6KWBanI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3192.96, "view_count": 20809125}, {"video_id": "K7lb6KWBanI", "question": "Under a blue sky, a group of people are riding horses of different colors to the left. They are surrounded by trees, with houses in the background. Someone is standing on a house. Which weapon is present in the scene?", "question_wo_referring_query": "Which weapon is present in the scene?", "candidates": ["Hammer", "Axe", "Sword", "Shield", "Knife"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "K7lb6KWBanI_2", "video_path": "K7lb6KWBanI.mp4", "subtitle_path": "K7lb6KWBanI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3192.96, "view_count": 20809125}, {"video_id": "OtDytqkhjjM", "question": "Under a blue sky, there is a person with gray hair wearing a blue checkered shirt and white pants standing on yellow ground. In front of them is a flowing river. When the subtitle mentions 'fish a lot in this game if you want to,' what object is present in the painting?", "question_wo_referring_query": "Under a blue sky, there is a person with gray hair wearing a blue checkered shirt and white pants standing on yellow ground. In front of them is a flowing river. When the subtitle mentions 'fish a lot in this game if you want to,' what object is present in the painting?", "candidates": ["yellow grass", "yellow tree", "stone head", "green tree", "red fish"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "OtDytqkhjjM_0", "video_path": "OtDytqkhjjM.mp4", "subtitle_path": "OtDytqkhjjM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.56, "view_count": 456364}, {"video_id": "OtDytqkhjjM", "question": "In a room with green and white walls, a woman with gray hair, wearing a hat and dressed in green clothes, is standing in the middle of the room. Behind her is a corridor. On the left wall is a window with furniture in front of it. In the front, there is brown furniture, and on the right side, there is a red and white fish tank with goldfish and other furniture. The floor is brown. When the subtitle mentions 'of the game plus some of the features,' what object is present in the picture?", "question_wo_referring_query": "In a room with green and white walls, a woman with gray hair, wearing a hat and dressed in green clothes, is standing in the middle of the room. Behind her is a corridor. On the left wall is a window with furniture in front of it. In the front, there is brown furniture, and on the right side, there is a red and white fish tank with goldfish and other furniture. The floor is brown. When the subtitle mentions 'of the game plus some of the features,' what object is present in the picture?", "candidates": ["Yellow chair", "Yellow fish tank", "Red chair", "Red table", "White chair"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "OtDytqkhjjM_1", "video_path": "OtDytqkhjjM.mp4", "subtitle_path": "OtDytqkhjjM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.56, "view_count": 456364}, {"video_id": "OtDytqkhjjM", "question": "Under a blue sky, a woman with gray hair, wearing a yellow hair accessory and a long skirt, is running. In front of her is a lakeside, with trees planted behind, and a house among the trees. The ground under her feet is green grass. When the subtitle mentions 'they are nor all grown they will be one', what appears in the picture?", "question_wo_referring_query": "What appears in the picture?", "candidates": ["purple flower", "blue butterfly", "red flower", "yellow butterfly", "blue flower"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "OtDytqkhjjM_2", "video_path": "OtDytqkhjjM.mp4", "subtitle_path": "OtDytqkhjjM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.56, "view_count": 456364}, {"video_id": "HIZjn3r4C_0", "question": "There is a wall in the screen covered with various photos and a map. Beside it, there is a white door. A man is holding a cup. When the subtitle 'can get a geography now mug or geography' appears, what is the color scheme of the cup?", "question_wo_referring_query": "What is the color scheme of the cup?", "candidates": ["Purple", "Blue", "Green", "Red", "White"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "HIZjn3r4C_0_0", "video_path": "HIZjn3r4C_0.mp4", "subtitle_path": "HIZjn3r4C_0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 945.86, "view_count": 367255}, {"video_id": "HIZjn3r4C_0", "question": "A man wearing a light blue shirt is facing a camera. In the upper right corner of the screen, there is a picture with three women. When the subtitle 'it's still a region a home of the harare' appears, what is the facial expression of the women in the picture?", "question_wo_referring_query": "What is the facial expression of the women in the picture?", "candidates": ["Melancholic", "Crying", "Anxious", "Angry", "Smiling"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "HIZjn3r4C_0_1", "video_path": "HIZjn3r4C_0.mp4", "subtitle_path": "HIZjn3r4C_0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 945.86, "view_count": 367255}, {"video_id": "HIZjn3r4C_0", "question": "There is a wall in the scene covered with various photos, a map, and a flag. Next to it is a white door. When the subtitle 'of Ethiopia capital' appears, what color is the star on the flag?", "question_wo_referring_query": "What color is the star on the flag?", "candidates": ["red", "purple", "black", "white", "blue"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "HIZjn3r4C_0_2", "video_path": "HIZjn3r4C_0.mp4", "subtitle_path": "HIZjn3r4C_0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 945.86, "view_count": 367255}, {"video_id": "eC6Hd1hFvos", "question": "On the page, there is some code with sentences starting with Example Code. Where can the sequence classification mentioned by the host be obtained from?", "question_wo_referring_query": "Where can the sequence classification mentioned by the host be obtained from?", "candidates": ["GPT-3", "OpenAI", "logits", "Transformers library"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "eC6Hd1hFvos_0", "video_path": "eC6Hd1hFvos.mp4", "subtitle_path": "eC6Hd1hFvos_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1697.47, "view_count": 197816}, {"video_id": "eC6Hd1hFvos", "question": "At the top of the page is an English sentence about Low-Rank Adaptation and below it are some formula parameters. Who forms an R-row by K-column matrix?", "question_wo_referring_query": "Who forms an R-row by K-column matrix?", "candidates": ["C", "E", "B", "D", "A"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "eC6Hd1hFvos_1", "video_path": "eC6Hd1hFvos.mp4", "subtitle_path": "eC6Hd1hFvos_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1697.47, "view_count": 197816}, {"video_id": "eC6Hd1hFvos", "question": "The page contains some code, with a heading that starts with 'Example Code'. Below it, there is a secondary heading titled 'Evaluation Metrics'. What did the course mention as greater than ARG Max?", "question_wo_referring_query": "What did the course mention as greater than ARG Max?", "candidates": ["logits", "GPT", "zero elements", "low", "OpenAI"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "eC6Hd1hFvos_2", "video_path": "eC6Hd1hFvos.mp4", "subtitle_path": "eC6Hd1hFvos_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1697.47, "view_count": 197816}, {"video_id": "xu3KAe-S6cc", "question": "There is a picture of an eagle with a red beak, red claws, and a black body posted on an iron railing. Two men face the camera. When the man on the right appears for the first time, what action does he perform?", "question_wo_referring_query": "When the man on the right appears for the first time, what action does he perform?", "candidates": ["Shakes hands with the man beside him", "Stands up and gives a thumbs up", "Looks at the camera and laughs heartily", "Hugs the man beside him", "Clenches his fists"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "xu3KAe-S6cc_0", "video_path": "xu3KAe-S6cc.mp4", "subtitle_path": "xu3KAe-S6cc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 924.13, "view_count": 1238460}, {"video_id": "xu3KAe-S6cc", "question": "There is a white and dark blue bordered box at the top of the screen, within which English sentences start to appear. When the plane first appears in the center, what happens on the screen?", "question_wo_referring_query": "what happens on the screen?", "candidates": ["A small plane appears above the plane", "A blue bird icon appears below the plane", "The blue bird icon below the plane gradually disappears", "The plane flies from right to left until it disappears"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "xu3KAe-S6cc_1", "video_path": "xu3KAe-S6cc.mp4", "subtitle_path": "xu3KAe-S6cc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 924.13, "view_count": 1238460}, {"video_id": "xu3KAe-S6cc", "question": "Two rectangular boxes appear at the upper part of the screen moving from left to right. The upper box is white, and the lower box is dark blue. What happens when these two boxes appear?", "question_wo_referring_query": "What happens when these two boxes appear?", "candidates": ["English sentences start appearing from right to left within the white and blue rectangular boxes", "An airplane icon flies from the left side of the screen to the right side", "English sentences start appearing from left to right within the white and blue rectangular boxes", "An airplane icon appears and stops in the middle of the screen", "An image appears horizontally across the white and blue rectangular boxes"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "xu3KAe-S6cc_2", "video_path": "xu3KAe-S6cc.mp4", "subtitle_path": "xu3KAe-S6cc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 924.13, "view_count": 1238460}, {"video_id": "XXE_Q0K95CI", "question": "There is a man with a red tie on the wall, one hand behind his back and the other holding a white tray illustration. An animation is playing on the screen. A wooden chair is facing the projector. What happened when the subtitle 'family that was divorced and my parents' appeared?", "question_wo_referring_query": "What happened?", "candidates": ["A short-haired woman is talking to a man with tattoos on his arms.", "A short-haired woman is waving at a man with tattoos on his arms.", "A short-haired woman is smiling at a man with tattoos on his arms.", "A short-haired woman is shaking hands with a man with tattoos on his arms.", "A short-haired woman is hugging a man with tattoos on his arms."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "XXE_Q0K95CI_0", "video_path": "XXE_Q0K95CI.mp4", "subtitle_path": "XXE_Q0K95CI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1378.16, "view_count": 26023}, {"video_id": "XXE_Q0K95CI", "question": "There is a man on the wall wearing a red tie, with one hand behind his back and the other holding a white plate in the image. An animation is playing on the screen, with a wooden chair facing the projector. What happens when the subtitle 'he's been with me ever since even though' appears?", "question_wo_referring_query": "What happens?", "candidates": ["The man with a tattoo on his arm smiles.", "The man with a tattoo on his arm shakes hands with a woman.", "The man with a tattoo on his arm has no expression.", "The man with a tattoo on his arm stands up.", "The man with a tattoo on his arm drinks water."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "XXE_Q0K95CI_1", "video_path": "XXE_Q0K95CI.mp4", "subtitle_path": "XXE_Q0K95CI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1378.16, "view_count": 26023}, {"video_id": "XXE_Q0K95CI", "question": "Various wooden furniture is arranged indoors, a woman is wearing a red short-sleeved shirt with white English letters on it. When the subtitle 'stomping well she's you know it's in the' appears, what happens?", "question_wo_referring_query": "What happens?", "candidates": ["A woman is eating in front of the mirror", "A woman gestures with both hands while giving a slight smile to the mirror", "A woman combs her hair in front of the mirror", "A woman waves at the mirror", "A woman gestures with both hands while talking to the mirror"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "XXE_Q0K95CI_2", "video_path": "XXE_Q0K95CI.mp4", "subtitle_path": "XXE_Q0K95CI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1378.16, "view_count": 26023}, {"video_id": "4L7yLQUusgA", "question": "In the video, there is a wooden cabin with various spices and kitchen utensils displayed on the white walls. A man wearing a black short-sleeve shirt is handling an oxtail with his hands. After the subtitle 'Beef oxtail' appears, what happens?", "question_wo_referring_query": "what happens?", "candidates": ["Marinate the oxtail", "Cut ginger and garlic with a knife", "Wash the oxtail in water", "Take a large golden bowl", "Put the oxtail into a large pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "4L7yLQUusgA_0", "video_path": "4L7yLQUusgA.mp4", "subtitle_path": "4L7yLQUusgA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1220.73, "view_count": 1237873}, {"video_id": "4L7yLQUusgA", "question": "A large pot in the screen is set with oxtail; a pair of hands is adding yellow peas into the pot. What happened after the subtitle 'Peas' appears?", "question_wo_referring_query": "What happened?", "candidates": ["chopped green onions", "added water", "turned on the heat", "covered the pot", "prepared the sauce"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "4L7yLQUusgA_1", "video_path": "4L7yLQUusgA.mp4", "subtitle_path": "4L7yLQUusgA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1220.73, "view_count": 1237873}, {"video_id": "4L7yLQUusgA", "question": "In the scene, there is a dragon head, a glass cup, and a small straw ornament. A pair of hands is arranging tea flowers in a basket. What happens after the subtitle 'Herbal tea' appears?", "question_wo_referring_query": "What happens after?", "candidates": ["Place the tea flowers under the sun to dry", "Carefully select high-quality tea flowers", "Prepare a big basket", "Put the tea flowers into a glass jar", "Put the tea flowers into the water for cleaning"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "4L7yLQUusgA_2", "video_path": "4L7yLQUusgA.mp4", "subtitle_path": "4L7yLQUusgA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1220.73, "view_count": 1237873}, {"video_id": "V9NZWMFydH4", "question": "On the white wall in the video, there are many different photos attached, a grey shelf holding neatly arranged books and some small items, and a short-haired woman wearing a green top. After the subtitle 'illusion that there is more there than' appears, what is the first outfit that appears?", "question_wo_referring_query": "What is the first outfit to appear after the subtitle 'illusion that there is more there than'?", "candidates": ["A pair of jeans", "An orange polka-dotted dress", "A checkered shirt", "A pair of overalls", "A green top"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "V9NZWMFydH4_0", "video_path": "V9NZWMFydH4.mp4", "subtitle_path": "V9NZWMFydH4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1069.8, "view_count": 424677}, {"video_id": "V9NZWMFydH4", "question": "On the white wall in the frame, there are many different kinds of photos posted. On a gray shelf, there are neatly arranged books and some small items. There is also a short-haired woman wearing a green top. After the subtitle \"goes to eating out and buying coffee\" appears, what is the first household item that appears?", "question_wo_referring_query": "What is the first household item that appears?", "candidates": ["Bath sponge", "Bowl", "Toilet paper", "Laundry detergent", "Cup"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "V9NZWMFydH4_1", "video_path": "V9NZWMFydH4.mp4", "subtitle_path": "V9NZWMFydH4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1069.8, "view_count": 424677}, {"video_id": "V9NZWMFydH4", "question": "In the video, on the white wall, there are many different kinds of photos posted, a brown shelf with neatly placed books and some small items, and a short-haired woman wearing a green top. When the subtitle 'two songs off of this album is how hide' appears, what is the first decorative item shown?", "question_wo_referring_query": "What is the first decorative item shown?", "candidates": ["a couplet", "a bouquet", "a calligraphy piece", "a vase", "a painting"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "V9NZWMFydH4_2", "video_path": "V9NZWMFydH4.mp4", "subtitle_path": "V9NZWMFydH4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1069.8, "view_count": 424677}, {"video_id": "_4amKzHzcZE", "question": "Which of the following sequences of items is correct?", "question_wo_referring_query": "Which of the following sequences of items is correct?", "candidates": ["First appears a piece of cloth, followed by an eraser, then a sheet of white paper, followed closely by a bowl of water, and finally colored pencils, crayons, or markers.", "First appears a sheet of white paper, followed by a pen, then an eraser, followed closely by a ruler, and finally colored pencils, crayons, or markers.", "First appears a pen, followed by a sheet of white paper, then an eraser, followed closely by a ruler, and finally colored pencils, crayons, or markers.", "First appears a ruler, followed by an eraser, then a sheet of white paper, followed closely by a bowl of water, and finally colored pencils, crayons, or markers.", "First appears a pen, followed by an eraser, then a sheet of white paper, followed closely by a ruler, and finally colored pencils, crayons, or markers."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "_4amKzHzcZE_0", "video_path": "_4amKzHzcZE.mp4", "subtitle_path": "_4amKzHzcZE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1461.8, "view_count": 4372}, {"video_id": "_4amKzHzcZE", "question": "Which of the following sequences of steps is correct?", "question_wo_referring_query": "Which of the following sequences of steps is correct?", "candidates": ["First add green on the paper, then add light black, followed by orange, then light purple, and finally blue.", "First add green on the paper, then add orange, followed by light black, then light blue, and finally purple.", "First add orange on the paper, then add green, followed by light black, then light blue, and finally purple.", "First add green on the paper, then add light black, followed by orange, then light blue, and finally purple.", "First add blue on the paper, then add light black, followed by orange, then light purple, and finally green."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "_4amKzHzcZE_1", "video_path": "_4amKzHzcZE.mp4", "subtitle_path": "_4amKzHzcZE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1461.8, "view_count": 4372}, {"video_id": "_4amKzHzcZE", "question": "Which of the following sequence of scenes is correct?", "question_wo_referring_query": "Which of the following sequence of scenes is correct?", "candidates": ["First, a square is drawn on white paper, then an image with an orange rectangular border appears on a black background, next a woman is talking to the camera, and then the drawing is colored.", "First, a woman is talking to the camera, then an image with an orange rectangular border appears on a black background, next a square is drawn on white paper, and then the drawing is colored.", "First, a square is drawn on white paper, then an image with an orange rectangular border appears on a black background, then the drawing is colored, and finally a woman is talking to the camera.", "First, an image with an orange rectangular border appears on a black background, then a woman is talking to the camera, next a square is drawn on white paper, and then the drawing is colored.", "First, a circle is drawn on white paper, then an image with an orange rectangular border appears on a black background, next the drawing is colored, and finally a woman is talking to the camera."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "_4amKzHzcZE_2", "video_path": "_4amKzHzcZE.mp4", "subtitle_path": "_4amKzHzcZE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1461.8, "view_count": 4372}, {"video_id": "RU4auIoVQKE", "question": "On a sunny and breezy day, a man in black clothing is carrying a black backpack and dragging a blue suitcase, with a blue bag on top. To his right is a silver BMW car, to his left is a black car and a wooden bench with a red flower. In front of him are many trees and a house. In which of the following scenes did the man's blue bag appear?", "question_wo_referring_query": "In which of the following scenes did the man's blue bag appear?", "candidates": ["In the arcade", "On the wooden furniture in the man's new home", "In the bar", "On the boat in the rain", "Inside the church"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "RU4auIoVQKE_0", "video_path": "RU4auIoVQKE.mp4", "subtitle_path": "RU4auIoVQKE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 960.25, "view_count": 379353}, {"video_id": "RU4auIoVQKE", "question": "Inside a room with a light yellow background, a person wearing a black short-sleeved shirt is looking out the window. Next to him, there is a bed with a suitcase and other luggage on it. The room has wooden furniture on both sides, a suitcase on the right, and a short-haired woman in a green dress sitting on the left, looking at her phone. In which of the following scenes has this short-haired woman in a green dress appeared?", "question_wo_referring_query": "Inside a room with a light yellow background, a person wearing a black short-sleeved shirt is looking out the window. Next to him, there is a bed with a suitcase and other luggage on it. The room has wooden furniture on both sides, a suitcase on the right, and a short-haired woman in a green dress sitting on the left, looking at her phone. In which of the following scenes has this short-haired woman in a green dress appeared?", "candidates": ["In a supermarket", "Inside a hotel", "On a small road", "In a game arcade", "On a beach at sunrise"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "RU4auIoVQKE_1", "video_path": "RU4auIoVQKE.mp4", "subtitle_path": "RU4auIoVQKE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 960.25, "view_count": 379353}, {"video_id": "RU4auIoVQKE", "question": "In a beige room, there's a laptop on the left table, a wall-mounted TV above, a standing lamp beside it, and on the right, a short-haired man in a blue and black coat is sitting on the sofa playing a guitar. Another man in black clothes is also sitting on the sofa watching TV. Can you tell in which of the following scenes the man in the blue and black coat with the guitar has appeared?", "question_wo_referring_query": "In a beige room, there's a laptop on the left table, a wall-mounted TV above, a standing lamp beside it, and on the right, a short-haired man in a blue and black coat is sitting on the sofa playing a guitar. Another man in black clothes is also sitting on the sofa watching TV. Can you tell in which of the following scenes the man in the blue and black coat with the guitar has appeared?", "candidates": ["In a karaoke room", "In the new home of the man in black clothes", "On a beach in the rain", "In a gym", "In a game room"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "RU4auIoVQKE_2", "video_path": "RU4auIoVQKE.mp4", "subtitle_path": "RU4auIoVQKE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 960.25, "view_count": 379353}, {"video_id": "njxAmeLwj4c", "question": "In front of a green and white background wall with an English 'tion' background, there is a man wearing a black jacket with short hair. Which subtitle line did this man appear with?", "question_wo_referring_query": "Which subtitle line did this man appear with?", "candidates": ["season with repeated hurricanes for long", "thick and may contain multiple laminae", "of the planet and our question is", "catastrophes as being potentially more", "have laminate as well though the process"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "njxAmeLwj4c_0", "video_path": "njxAmeLwj4c.mp4", "subtitle_path": "njxAmeLwj4c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1546.96, "view_count": 125211}, {"video_id": "njxAmeLwj4c", "question": "Against a black background, there is a purple halo in the middle. At the center of the halo is a blue polyhedron, and next to it stands a gray rhinoceros wearing a tie. Which subtitle does this gray rhinoceros appear with?", "question_wo_referring_query": "Which subtitle does this gray rhinoceros appear with?", "candidates": ["progress over the last couple centuries", "who wrote a book in the early 1800s", "he argued that the rates of deposition", "against it became more apparent", "of the planet and our question is"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "njxAmeLwj4c_1", "video_path": "njxAmeLwj4c.mp4", "subtitle_path": "njxAmeLwj4c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1546.96, "view_count": 125211}, {"video_id": "njxAmeLwj4c", "question": "In a black background, a gray cow with a red collar is speaking 'during this that the alvarez hypothesis', and next to it, the Earth is shown being hit and ejecting molten rock. Which subtitle appears together with the image of the Earth being hit and ejecting molten rock?", "question_wo_referring_query": "Which subtitle appears together with the image of the Earth being hit and ejecting molten rock?", "candidates": ["geologists understand that infrequent", "but creationists don\u2019t like that they", "thinking in geology was in the process", "with a global iridium layer", "has led me to an unqualified acceptance"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "njxAmeLwj4c_2", "video_path": "njxAmeLwj4c.mp4", "subtitle_path": "njxAmeLwj4c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1546.96, "view_count": 125211}, {"video_id": "YN4wicCk2b0", "question": "Inside the train station, there is information about the trains displayed on the left side, and on the right side, there is a woman with long hair wearing an olive green coat, carrying a brown bag. When this woman appears in front of a bookshelf filled with books and looks into the mirror, what change occurs to her?", "question_wo_referring_query": "What change occurs to this woman?", "candidates": ["Her shoulder-length hair changes to twin ponytails.", "Her black hair is dyed yellow.", "Her shoulder-length hair changes to a bob.", "Her shoulder-length hair is tied up.", "Her shoulder-length hair turns into big waves."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "YN4wicCk2b0_0", "video_path": "YN4wicCk2b0.mp4", "subtitle_path": "YN4wicCk2b0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1397.77, "view_count": 89230}, {"video_id": "YN4wicCk2b0", "question": "Inside a bookstore with white walls, where the shelves are filled with books, at the front there is a glass entrance door, and on the ceiling there are red fire sprinkler pipes, a woman wearing a scarf, carrying a brown bag, and reading a green leather book with a ponytail appears. When this woman appears in the room with the white walls and a mirror behind her, she faces the mirror. What change happens to this woman?", "question_wo_referring_query": "Inside a bookstore with white walls, where the shelves are filled with books, at the front there is a glass entrance door, and on the ceiling there are red fire sprinkler pipes, a woman wearing a scarf, carrying a brown bag, and reading a green leather book with a ponytail appears. When this woman appears in the room with the white walls and a mirror behind her, she faces the mirror. What change happens to this woman?", "candidates": ["Dark green jacket changes to a white sweater", "Black jacket changes to a beige long sleeve wool coat", "Black long sleeve changes to a yellow short sleeve", "Dark green coat changes to a white knit shirt", "White coat changes to a black dress"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "YN4wicCk2b0_1", "video_path": "YN4wicCk2b0.mp4", "subtitle_path": "YN4wicCk2b0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1397.77, "view_count": 89230}, {"video_id": "YN4wicCk2b0", "question": "Outside, where the lights are bright, there is a gray closed door in the distance, and on the road, there is a girl wearing a white down jacket and a black skirt walking. In front of the camera is a long-haired girl wearing earrings and drinking mineral water. When she appears on the left side, there is a white wall, and a television is hanging on the wall. Inside a room with a black door in the distance, what kind of changes does this girl undergo?", "question_wo_referring_query": "How does this girl change?", "candidates": ["The white coat is changed to a gray long-sleeve woolen sweater", "The yellow coat is changed to a black knit shirt", "The white knit shirt is changed to a gray long-sleeve T-shirt", "The dark green coat is changed to a gray long-sleeve T-shirt", "The beige sweater is changed to a black jacket"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "YN4wicCk2b0_2", "video_path": "YN4wicCk2b0.mp4", "subtitle_path": "YN4wicCk2b0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1397.77, "view_count": 89230}, {"video_id": "hyP-9Ipy_eQ", "question": "In the distance, there are green fields. On the right, there is a wooden house with glass windows, and a tree stands on the right side with a bamboo basket hanging from it. Below, there are various dining tools arranged, with a wooden table in the middle. On the table, there are sliced fish and a bowl of wheat flour. A middle-aged man wearing a black short-sleeve shirt is currently stirring wheat starch. When the subtitle mentions 'Wheat flour', what change happens to the sliced fish?", "question_wo_referring_query": "When the subtitle mentions 'Wheat flour', what change happens to the sliced fish?", "candidates": ["The sliced fish is cut into pieces.", "The sliced fish is mixed with wheat starch.", "The sliced fish is steamed to a white color.", "The sliced fish turns black after being deep-fried.", "The sliced fish is coated with honey."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "hyP-9Ipy_eQ_0", "video_path": "hyP-9Ipy_eQ.mp4", "subtitle_path": "hyP-9Ipy_eQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1183.17, "view_count": 3947258}, {"video_id": "hyP-9Ipy_eQ", "question": "On a wooden table, there are three wooden bowls. The bowl on the left contains scallions. The bowl on the right contains a white seasoning. The bowl in the middle contains a lemon and a green plant. A pair of hands is holding the green plant. When the subtitle mentions 'Dill', what change happens to the green plant?", "question_wo_referring_query": "On a wooden table, there are three wooden bowls. The bowl on the left contains scallions. The bowl on the right contains a white seasoning. The bowl in the middle contains a lemon and a green plant. A pair of hands is holding the green plant. When the subtitle mentions 'Dill', what change happens to the green plant?", "candidates": ["The green plant is coated with honey", "The green plant is dyed yellow", "The green plant is chopped", "The green plant is mixed with the white seasoning", "The green plant is burned to charcoal"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "hyP-9Ipy_eQ_1", "video_path": "hyP-9Ipy_eQ.mp4", "subtitle_path": "hyP-9Ipy_eQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1183.17, "view_count": 3947258}, {"video_id": "hyP-9Ipy_eQ", "question": "On a wooden table, there is a bowl containing yellow oil blocks on the left side, and in the middle, there is an irregular wooden jar containing lemon peels. When the subtitle mentions 'Grated Lemon', what change occurs to the lemon pieces?", "question_wo_referring_query": "On a wooden table, there is a bowl containing yellow oil blocks on the left side, and in the middle, there is an irregular wooden jar containing lemon peels. When the subtitle mentions 'Grated Lemon', what change occurs to the lemon pieces?", "candidates": ["Mixed into a pot of food", "Coated with red tomato juice", "Thrown into the trash bin", "Coated with a white condiment", "Burned into charcoal"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "hyP-9Ipy_eQ_2", "video_path": "hyP-9Ipy_eQ.mp4", "subtitle_path": "hyP-9Ipy_eQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1183.17, "view_count": 3947258}, {"video_id": "yx1Tq4g6Ias", "question": "In a room, there is a white sofa. Next to the white sofa, there is a floor-to-ceiling window. Outside the window, there are green plants and a pond. There is a man in a pink short-sleeve shirt sitting on the sofa. Next to him, there is also a black camera. What is the man in the pink short-sleeve shirt doing while sitting on the sofa?", "question_wo_referring_query": "What is the man in the pink short-sleeve shirt doing while sitting on the sofa?", "candidates": ["Looking down and operating a computer", "Watching TV", "Holding his chin with both hands and dazing", "Playing with a phone", "Drawing"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "yx1Tq4g6Ias_0", "video_path": "yx1Tq4g6Ias.mp4", "subtitle_path": "yx1Tq4g6Ias_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2104.77, "view_count": 111353}, {"video_id": "yx1Tq4g6Ias", "question": "In a room with a white sofa, some paintings hanging on the wall, and grey pillows on either side, a man wearing a pink short-sleeved shirt is sitting on the sofa. The screen also shows yellow text that reads 'NOT SURE WHY I'M HUNCHED OVER LIKE QUASI MODO BUT W.E'. What is the man in the pink short-sleeved shirt doing on the sofa?", "question_wo_referring_query": "What is the man in the pink short-sleeved shirt doing on the sofa?", "candidates": ["Taking a selfie with his phone", "Raising both hands, palms facing each other, talking towards the camera", "Hugging a grey pillow, talking towards the camera", "Picking up a black remote to turn on the TV", "Putting on a pair of black glasses for himself"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "yx1Tq4g6Ias_1", "video_path": "yx1Tq4g6Ias.mp4", "subtitle_path": "yx1Tq4g6Ias_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2104.77, "view_count": 111353}, {"video_id": "yx1Tq4g6Ias", "question": "In an Adobe Photoshop operation page, there is a photo with green plants and a bubble-filled bathtub. A woman dressed in black is walking towards the bathtub, and beside her is a man wearing blue shorts. What is the man wearing blue shorts doing?", "question_wo_referring_query": "What is the man wearing blue shorts doing?", "candidates": ["Taking a selfie with his phone", "Holding purple flowers and scattering them in the bathtub", "Looking up at the clouds in the sky", "Sitting above the bathtub with his head down, operating a black notebook computer", "Playing in the bubble-filled bathtub"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "yx1Tq4g6Ias_2", "video_path": "yx1Tq4g6Ias.mp4", "subtitle_path": "yx1Tq4g6Ias_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2104.77, "view_count": 111353}, {"video_id": "2b-Bh-Oz_c4", "question": "In front of a red background, there are four people standing wearing black clothes, and in front of them is a row of people also wearing black clothes, with the text SUPREME COURT RULES TRUMP CAN REMAIN ON COLORADO BALLOT appearing at the bottom of the screen. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Black hats", "Green plants", "White file folders", "Yellow lilies", "Purple flowers"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "2b-Bh-Oz_c4_0", "video_path": "2b-Bh-Oz_c4.mp4", "subtitle_path": "2b-Bh-Oz_c4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2987.56, "view_count": 131924}, {"video_id": "2b-Bh-Oz_c4", "question": "On a street, there are many cars parked. A man wearing a light cyan shirt and sporting a stubble is standing on the street talking to a camera. At the bottom of the screen, there is black text 'TRUMP & HALEY CAMPAIGN IN NORTH CAROLINA AHEAD OF SUPER TUESDAY'. What object is present in the screen?", "question_wo_referring_query": "What object is present in the screen?", "candidates": ["Yellow perfume", "Blue pickup truck", "Red flower", "Red trash can", "White daisies"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "2b-Bh-Oz_c4_1", "video_path": "2b-Bh-Oz_c4.mp4", "subtitle_path": "2b-Bh-Oz_c4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2987.56, "view_count": 131924}, {"video_id": "2b-Bh-Oz_c4", "question": "In a room with a gray round table, four people dressed in suits are seated. At the bottom of the screen, there is black text that reads 'BIDEN TO DELIVER STATE OF THE UNION ADDRESS ON THURSDAY.' What object is present in the screen?", "question_wo_referring_query": "What object is present in the screen?", "candidates": ["Black tie", "Red flower", "Green plant", "Olive-colored chair", "Yellow cup"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "2b-Bh-Oz_c4_2", "video_path": "2b-Bh-Oz_c4.mp4", "subtitle_path": "2b-Bh-Oz_c4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2987.56, "view_count": 131924}, {"video_id": "35TXiDFSe0s", "question": "In a white room with white curtains and a white table, a man in a black short-sleeved shirt is talking to a mirror. What object is present on the screen when he says 'would become but'?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["yellow sunflower", "white daisy", "purple feather duster", "yellow jasmine", "green plant"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "35TXiDFSe0s_0", "video_path": "35TXiDFSe0s.mp4", "subtitle_path": "35TXiDFSe0s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1299.8, "view_count": 273733}, {"video_id": "35TXiDFSe0s", "question": "On a sunny and clear day, there are two shirtless men standing on the beach, with green trees behind them. When the subtitle 'explode um everybody's doing' appears, what object is present in the scene?", "question_wo_referring_query": "On a sunny and clear day, what object is present in the scene?", "candidates": ["red scarf", "black sunglasses", "green chair", "white shell", "gold necklace"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "35TXiDFSe0s_1", "video_path": "35TXiDFSe0s.mp4", "subtitle_path": "35TXiDFSe0s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1299.8, "view_count": 273733}, {"video_id": "35TXiDFSe0s", "question": "On a bustling evening, with many people eating, the screen shows white umbrellas and long tables. On the tables, there are white plates and cooked steaks. When the subtitle \"bon appetit so kyle explained to me how\" appears, what object is present in the screen?", "question_wo_referring_query": "What object is present in the screen?", "candidates": ["green chairs", "red tomato sauce", "red roses", "yellow sunflowers", "yellow daisy"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "35TXiDFSe0s_2", "video_path": "35TXiDFSe0s.mp4", "subtitle_path": "35TXiDFSe0s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1299.8, "view_count": 273733}, {"video_id": "X2vPEzWpduU", "question": "In front of a white wall, there is a black pot. Next to the pot, there is a white flower pot and a green plant. Inside the pot, there are some pieces of meat being stir-fried continuously. What material is the spatula used to stir-fry the meat made of?", "question_wo_referring_query": "What material is the spatula used to stir-fry the meat made of?", "candidates": ["Iron", "Silicone", "Wood", "Aluminum Alloy", "Stainless Steel"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "X2vPEzWpduU_0", "video_path": "X2vPEzWpduU.mp4", "subtitle_path": "X2vPEzWpduU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.33, "view_count": 1698173}, {"video_id": "X2vPEzWpduU", "question": "There is a black pot next to a white flowerpot. The pot contains many ingredients, including meat, noodle pieces, and various seasonings. What is the shape of the noodle pieces in the pot?", "question_wo_referring_query": "What is the shape of the noodle pieces in the pot?", "candidates": ["Triangle-shaped", "Butterfly-shaped", "Hexagon-shaped", "Round", "Caterpillar-shaped"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "X2vPEzWpduU_1", "video_path": "X2vPEzWpduU.mp4", "subtitle_path": "X2vPEzWpduU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.33, "view_count": 1698173}, {"video_id": "X2vPEzWpduU", "question": "On a wooden cutting board, there is a pair of hands holding a knife. Next to the cutting board, there are some green vegetables. The hands with the knife are cutting some food. What color is the food being cut?", "question_wo_referring_query": "What color is the food being cut?", "candidates": ["blue", "green", "yellow", "red", "purple"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "X2vPEzWpduU_2", "video_path": "X2vPEzWpduU.mp4", "subtitle_path": "X2vPEzWpduU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.33, "view_count": 1698173}, {"video_id": "Rslv43vHNcA", "question": "In a dark room, there is a man with short hair and fair skin standing. Behind him is a white table with some green plants on it. When the subtitle 'we made it in we've settled the place' appears, what style of clothes is the fair-skinned man wearing?", "question_wo_referring_query": "What style of clothes is the fair-skinned man wearing?", "candidates": ["blue sweater", "red T-shirt", "black hoodie", "white hoodie", "white shirt"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "Rslv43vHNcA_0", "video_path": "Rslv43vHNcA.mp4", "subtitle_path": "Rslv43vHNcA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1402.65, "view_count": 97661}, {"video_id": "Rslv43vHNcA", "question": "In a brightly lit room, there is a man wearing an orange short-sleeved shirt sitting down. Behind him, there is a brown shelf filled with many books. When the subtitle 'common denominator it's it's just a' appears, what hairstyle does the man with the orange short-sleeved shirt have?", "question_wo_referring_query": "What hairstyle does the man with the orange short-sleeved shirt have?", "candidates": ["Black dreadlocks", "Colorful dreadlocks", "Blue short hair", "Golden short hair", "Black short hair"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "Rslv43vHNcA_1", "video_path": "Rslv43vHNcA.mp4", "subtitle_path": "Rslv43vHNcA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1402.65, "view_count": 97661}, {"video_id": "Rslv43vHNcA", "question": "On a snowy ground, a man wearing a black hooded coat is walking. He is surrounded by tall trees. When the subtitle 'stumble upon something you love' appears, what color hat is the man wearing?", "question_wo_referring_query": "What color hat is the man wearing when he is walking in a black hooded coat?", "candidates": ["red", "purple", "blue", "white", "black"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "Rslv43vHNcA_2", "video_path": "Rslv43vHNcA.mp4", "subtitle_path": "Rslv43vHNcA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1402.65, "view_count": 97661}, {"video_id": "1Jj-sJ78O6M", "question": "In a room with gray walls, there are some hexagonal shelves on the walls, and some white cabinets are placed against the wall. In front of the cabinets, someone says, 'so how many rabbits will they produce.' Who said this sentence?", "question_wo_referring_query": "Who said this sentence?", "candidates": ["A man wearing a black hoodie", "A man wearing a black T-shirt", "A man with short black hair wearing a black suit", "A man wearing a white T-shirt", "A man wearing a black long-sleeve shirt and glasses"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "1Jj-sJ78O6M_0", "video_path": "1Jj-sJ78O6M.mp4", "subtitle_path": "1Jj-sJ78O6M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1374.38, "view_count": 3660467}, {"video_id": "1Jj-sJ78O6M", "question": "In a room with grey walls and white cabinets, a man wearing a dark blue long-sleeve shirt is seated at a black table. He is also wearing gloves and is holding an object with his gloved hand. What object is the man holding with his gloved hand?", "question_wo_referring_query": "What object is the man holding with his gloved hand?", "candidates": ["monkey puzzle tree", "pineapple", "tomato", "potato", "baobab"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "1Jj-sJ78O6M_1", "video_path": "1Jj-sJ78O6M.mp4", "subtitle_path": "1Jj-sJ78O6M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1374.38, "view_count": 3660467}, {"video_id": "1Jj-sJ78O6M", "question": "In a grey room, there are some white shelves filled with many books and ornaments. In front of the white shelves, there is a man wearing a black long-sleeve shirt talking. He mentions a person who occasionally uses the golden ratio. Who is this person?", "question_wo_referring_query": "Who is the person he is talking about?", "candidates": ["Galileo Galilei", "Steve Jobs", "architect le corbusier", "Mark Zuckerberg", "Brad Pitt"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "1Jj-sJ78O6M_2", "video_path": "1Jj-sJ78O6M.mp4", "subtitle_path": "1Jj-sJ78O6M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1374.38, "view_count": 3660467}, {"video_id": "uIki912a3mM", "question": "In a dimly lit scene, a few people in suits are sitting on black chairs. In the center of the screen is a man with short black hair. What is the man with short black hair doing when the subtitle 'I'm praying in forgiveness' appears?", "question_wo_referring_query": "What is the man with short black hair doing on the screen?", "candidates": ["Standing up with a microphone singing a song", "Putting the microphone on his leg, raising both hands to clap", "Sitting with crossed legs, holding a microphone, left hand clenched in a fist on his left leg", "Bending down to tie shoelaces", "Sitting with crossed legs, holding a microphone, left hand clenched in a fist on his right leg"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "uIki912a3mM_0", "video_path": "uIki912a3mM.mp4", "subtitle_path": "uIki912a3mM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1618.29, "view_count": 42427}, {"video_id": "uIki912a3mM", "question": "In a dimly lit scene, there are two people: a woman with her hair tied up and a man with short hair. When the subtitle 'should be universal and so we tried to' appears, what is the woman with her hair tied up doing?", "question_wo_referring_query": "What is the woman with her hair tied up doing?", "candidates": ["Tidying her hair with her right hand", "Crossing her arms in front of her chest", "Putting on a black suit jacket", "Bending over to tie her shoelaces", "Holding a phone with her right hand, raising her left hand with the palm facing up"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "uIki912a3mM_1", "video_path": "uIki912a3mM.mp4", "subtitle_path": "uIki912a3mM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1618.29, "view_count": 42427}, {"video_id": "uIki912a3mM", "question": "In a dark scene, there are several people sitting on black chairs. In the middle of the screen, there is a man with a buzz cut wearing a suit. When the caption 'benefited from the RV's preset soccer' appears, what is the man with the buzz cut doing?", "question_wo_referring_query": "What is the man with a buzz cut doing on screen?", "candidates": ["Pressing acupuncture points with both hands", "Resting his right hand on the shoulder of the woman on his right", "Bending down to tie his shoelace", "Clapping happily", "Resting his left hand on the shoulder of the man on his left"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "uIki912a3mM_2", "video_path": "uIki912a3mM.mp4", "subtitle_path": "uIki912a3mM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1618.29, "view_count": 42427}, {"video_id": "DJpjPVksc1A", "question": "On a white PPT slide with black text 'Noise Scheduling', there are some formulas. Below the formulas, there is a gray box with a small red dot inside. After the small red dot slides to the upper right corner of the gray box, what happens next?", "question_wo_referring_query": "After the small red dot slides to the upper right corner of the gray box, what happens next?", "candidates": ["Draw a triangle in the gray box", "Draw a heart in the gray box", "Slide to the lower left corner of the gray box", "Write text in the gray box"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "DJpjPVksc1A_0", "video_path": "DJpjPVksc1A.mp4", "subtitle_path": "DJpjPVksc1A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1857.53, "view_count": 3027}, {"video_id": "DJpjPVksc1A", "question": "On a white PPT slide, there are four photos assembled together. One of the photos has a black-and-white panda, and next to the photo are black letters 'Cascaded Image Synthesis'. Below the black letters, there is a small red dot. After the small red dot slides to the left on the PPT slide, what happens next?", "question_wo_referring_query": ", after the small red dot slides to the left on the PPT slide, what happens next?", "candidates": ["Draws a heart on the PPT slide", "Slides to the right on the PPT slide", "Draws a triangle on the PPT slide", "Writes text on the PPT slide"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "DJpjPVksc1A_1", "video_path": "DJpjPVksc1A.mp4", "subtitle_path": "DJpjPVksc1A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1857.53, "view_count": 3027}, {"video_id": "DJpjPVksc1A", "question": "On a white PPT slide, there are eight stitched-together pictures of a leopard. Next to the leopard pictures, there is a dashed frame. A small red dot is on the leopard pictures. What did the small red dot do after it slid to the right?", "question_wo_referring_query": ", What did the small red dot do after it slid to the right?", "candidates": ["Write words on the leopard pictures", "Draw a heart on the leopard pictures", "Draw a triangle on the leopard pictures", "Slide to the left on the leopard pictures"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "DJpjPVksc1A_2", "video_path": "DJpjPVksc1A.mp4", "subtitle_path": "DJpjPVksc1A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1857.53, "view_count": 3027}, {"video_id": "DwTUtOGNRSk", "question": "In the screen, there is a whiteboard with an English title surrounded by a yellow frame and it has two drawings on it. In the upper left corner, there is a woman wearing a black and white top with an orange baseball cap. In the upper right corner, there is a woman with long straight black hair wearing a sand-colored suit jacket. After the subtitle 'which makes it CIS exactly okay and then' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The woman in the upper right corner stretches her arms out over the whiteboard.", "A person with blue nail polish points to the star-shaped images on the left side of the whiteboard.", "The woman in the upper left corner points to a purple label with English writing on the whiteboard.", "The woman in the upper right corner removes the labels with purple and green frames.", "The woman in the upper right corner takes out a paper with a design of eight star shapes and places it on the whiteboard."], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "DwTUtOGNRSk_0", "video_path": "DwTUtOGNRSk.mp4", "subtitle_path": "DwTUtOGNRSk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 900.67, "view_count": 5316}, {"video_id": "DwTUtOGNRSk", "question": "In the upper left corner of the screen, there is a woman wearing a black and white shirt and an orange baseball cap. In the upper right corner, there is a woman with long straight black hair wearing a beige business suit. The woman in the upper right corner reaches out to the whiteboard, pointing at a picture with a green-bordered sticky note that says 'trans.' After the subtitle 'two methyl groups one at one and another' appears, what happens in the screen?", "question_wo_referring_query": "What happens in the screen?", "candidates": ["The woman in the upper right corner places the green-bordered label onto the triangle diagram in the upper left corner.", "The woman in the upper right corner opens her hands towards the whiteboard.", "The woman in the upper left corner marks a triangle diagram in the lower left corner.", "The woman in the upper right corner pulls out a paper with eight triangle diagrams and places it on the whiteboard.", "The woman in the upper left corner places a pink sticky note from the whiteboard onto the picture with a triangle diagram."], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "DwTUtOGNRSk_1", "video_path": "DwTUtOGNRSk.mp4", "subtitle_path": "DwTUtOGNRSk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 900.67, "view_count": 5316}, {"video_id": "DwTUtOGNRSk", "question": "In the upper left corner of the screen, there is a woman wearing a black and white top and an orange baseball cap. In the upper right corner, there is a woman with long straight black hair wearing a khaki blazer. On the whiteboard, there are two pushpins with some blue and purple paper strips attached. What happens on the screen after the subtitle 'that's a good thing okay' appears?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A right hand with purple nail polish takes out an 'Equally Stable' sticker", "A right hand with blue nail polish takes out a pink circle sticker", "A right hand with purple nail polish takes out a 'trans' sticker", "A right hand with blue nail polish takes out a sticker with the number 1", "A right hand with blue nail polish extends four fingers."], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "DwTUtOGNRSk_2", "video_path": "DwTUtOGNRSk.mp4", "subtitle_path": "DwTUtOGNRSk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 900.67, "view_count": 5316}, {"video_id": "aGLQR6TLlcI", "question": "In a dark scene, there is a sun emitting rays of light and a sculpture beside it. After the narration says 'the days get shorter the weather gets,' who is the first character to appear?", "question_wo_referring_query": ", who is the first character to appear?", "candidates": ["A man wearing gray shorts", "A man wearing a white short-sleeve shirt", "A man wearing a blue short-sleeve shirt", "A man wearing a red short-sleeve shirt", "A woman wearing a black hat and a pink coat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "aGLQR6TLlcI_0", "video_path": "aGLQR6TLlcI.mp4", "subtitle_path": "aGLQR6TLlcI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 963.59, "view_count": 199383}, {"video_id": "aGLQR6TLlcI", "question": "On a flat road, there is a man standing wearing a yellow hat and a tie-dye hoodie, with vehicles parked behind him. When the subtitle 'squad right here' appears, who is the first character to appear?", "question_wo_referring_query": "Who is the first character to appear?", "candidates": ["The man wearing a red short-sleeve shirt", "The man wearing a yellow jacket", "The man wearing a white short-sleeve shirt", "The man wearing a blue short-sleeve shirt", "The man wearing an olive-green parka"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "aGLQR6TLlcI_1", "video_path": "aGLQR6TLlcI.mp4", "subtitle_path": "aGLQR6TLlcI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 963.59, "view_count": 199383}, {"video_id": "aGLQR6TLlcI", "question": "On a zebra crossing, there are several adults leading a group of children across the road. The leading adult is carrying an orange backpack. After the subtitle 'a half later they're not real the same' appears, who is the first person to make an appearance?", "question_wo_referring_query": "Who is the first person to make an appearance?", "candidates": ["A man wearing a black coat, with short black hair, and glasses", "A man wearing an olive green down jacket", "A man wearing a red short-sleeved shirt", "A man wearing black and grey shorts", "A man wearing a white short-sleeved shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "aGLQR6TLlcI_2", "video_path": "aGLQR6TLlcI.mp4", "subtitle_path": "aGLQR6TLlcI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 963.59, "view_count": 199383}, {"video_id": "kLaveaWaQy4", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, there is a man carrying a tape recorder and a man wearing a white mask facing each other; then, there is a man wearing a dark blue top with short hair carrying a tape recorder on his shoulder; and lastly, on a street with many shops, there is a woman wearing a blue top and black pants looking at merchandise.", "First, on a street with many shops, there is a woman wearing a blue top and black pants looking at merchandise; then, there is a man carrying a tape recorder and a man wearing a white mask facing each other; and lastly, there is a man wearing a dark blue top with short hair carrying a tape recorder on his shoulder.", "First, there is a man wearing a dark blue top with short hair carrying a tape recorder on his shoulder; then, on a street with many shops, there is a woman wearing a blue top and black pants looking at merchandise; and lastly, there is a man carrying a tape recorder and a man wearing a white mask facing each other.", "First, there is a man wearing a dark blue top with short hair carrying a tape recorder on his shoulder; then, there is a man carrying a tape recorder and a man wearing a white mask facing each other; and lastly, on a street with many shops, there is a woman wearing a blue top and black pants looking at merchandise.", "First, on a street with many shops, there is a woman wearing a blue top and black pants looking at merchandise; then, there is a man wearing a dark blue top with short hair carrying a tape recorder on his shoulder; and lastly, there is a man carrying a tape recorder and a man wearing a white mask facing each other."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "kLaveaWaQy4_0", "video_path": "kLaveaWaQy4.mp4", "subtitle_path": "kLaveaWaQy4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.97, "view_count": 83996}, {"video_id": "kLaveaWaQy4", "question": "Which of the following scenarios is in the correct order?", "question_wo_referring_query": "Which of the following scenarios is in the correct order?", "candidates": ["First, a scene where various models of refrigerators are being sold; then, a man in a brown suit carrying a paper box with a TV on a crowded street; finally, a scene where various models of washing machines are being sold.", "First, a man in a brown suit carrying a paper box with a TV on a crowded street; then, a scene where various models of refrigerators are being sold; finally, a scene where various models of washing machines are being sold.", "First, a scene where various models of washing machines are being sold; then, a scene where various models of refrigerators are being sold; finally, a man in a brown suit carrying a paper box with a TV on a crowded street.", "First, a scene where various models of refrigerators are being sold; then, a scene where various models of washing machines are being sold; finally, a man in a brown suit carrying a paper box with a TV on a crowded street.", "First, a man in a brown suit carrying a paper box with a TV on a crowded street; then, a scene where various models of washing machines are being sold; finally, a scene where various models of refrigerators are being sold."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "kLaveaWaQy4_1", "video_path": "kLaveaWaQy4.mp4", "subtitle_path": "kLaveaWaQy4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.97, "view_count": 83996}, {"video_id": "kLaveaWaQy4", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a woman wearing a maid outfit and glasses is carrying a tray of dishes; then, the woman in the maid outfit is facing a man in a red and white checkered shirt, with both of her hands clenched into fists by her cheeks in a cutesy manner; finally, the woman in the maid outfit and the man in the red and white checkered shirt are both making cutesy fist gestures by their cheeks together.", "First, a woman wearing a maid outfit is facing a man in a red and white checkered shirt, with her hands clenched into fists by her cheeks in a cutesy manner; then, a woman wearing a maid outfit and glasses is carrying a tray of dishes; finally, the woman in the maid outfit and the man in the red and white checkered shirt are both making cutesy fist gestures by their cheeks together.", "First, a woman wearing a maid outfit is facing a man in a red and white checkered shirt, with her hands clenched into fists by her cheeks in a cutesy manner; then, the woman in the maid outfit and the man in the red and white checkered shirt are both making cutesy fist gestures by their cheeks together; finally, a woman wearing a maid outfit and glasses is carrying a tray of dishes.", "First, a woman wearing a maid outfit and glasses is carrying a tray of dishes; then, the woman in the maid outfit and the man in the red and white checkered shirt are both making cutesy fist gestures by their cheeks together; finally, the woman in the maid outfit is facing the man in the red and white checkered shirt, with both of her hands clenched into fists by her cheeks in a cutesy manner.", "First, the woman in the maid outfit and the man in the red and white checkered shirt are both making cutesy fist gestures by their cheeks together; then, the woman in the maid outfit is facing the man in the red and white checkered shirt, with both of her hands clenched into fists by her cheeks in a cutesy manner; finally, a woman wearing a maid outfit and glasses is carrying a tray of dishes."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "kLaveaWaQy4_2", "video_path": "kLaveaWaQy4.mp4", "subtitle_path": "kLaveaWaQy4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.97, "view_count": 83996}, {"video_id": "q1cBLxdWNVw", "question": "On an emerald green body of water, there is a bridge where two men are standing. One man is wearing a dark gray short-sleeve shirt and the other is wearing a black top. They are fist bumping. In which of the following scenes has the man in the dark gray short-sleeve shirt appeared before?", "question_wo_referring_query": "In which of the following scenes has the man in the dark gray short-sleeve shirt appeared before?", "candidates": ["In an empty bar", "In a park during rain", "In an endless desert", "On a beach during rain", "In a room with a red sofa"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "q1cBLxdWNVw_0", "video_path": "q1cBLxdWNVw.mp4", "subtitle_path": "q1cBLxdWNVw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1049.13, "view_count": 3898532}, {"video_id": "q1cBLxdWNVw", "question": "In front of a ticket vending machine, there stands a man wearing a grey short-sleeved shirt and carrying a backpack. Next to him is a trash can with a bottle of water on it. In which of the following scenes has the man in the grey short-sleeved shirt appeared?", "question_wo_referring_query": "In which of the following scenes has the man in the grey short-sleeved shirt appeared?", "candidates": ["On a boat cruising through emerald waters", "On a mountain top during snowfall", "In an unfrequented forest", "On a beach during the rain", "In a lively KTV"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "q1cBLxdWNVw_1", "video_path": "q1cBLxdWNVw.mp4", "subtitle_path": "q1cBLxdWNVw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1049.13, "view_count": 3898532}, {"video_id": "q1cBLxdWNVw", "question": "A house with yellow exterior walls has a large dark brown door. In front of the door, there is a man wearing a gray short-sleeved shirt walking. Where else has this gray short-sleeved shirt appeared?", "question_wo_referring_query": "Where else has the gray short-sleeved shirt worn by the man appeared?", "candidates": ["In an empty indoor basketball court", "In a deserted forest", "In a park during the rain", "In a clothing store filled with clothes", "In a forest during the rain"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "q1cBLxdWNVw_2", "video_path": "q1cBLxdWNVw.mp4", "subtitle_path": "q1cBLxdWNVw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1049.13, "view_count": 3898532}, {"video_id": "gRszqN6rHE8", "question": "On a street with many shops, a man wearing a dark blue short-sleeved shirt and a purple headband is walking. Which of the following subtitles appeared along with the man in the dark blue short-sleeved shirt?", "question_wo_referring_query": ", which of the following subtitles appeared along with the man in the dark blue short-sleeved shirt?", "candidates": ["\"his life he was absolutely blown away by\"", "\"a portion of this video is sponsored by\"", "\"to regret this it's a marathon but i\"", "\"home city of paris for the first time in\""], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "gRszqN6rHE8_0", "video_path": "gRszqN6rHE8.mp4", "subtitle_path": "gRszqN6rHE8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1087.76, "view_count": 994527}, {"video_id": "gRszqN6rHE8", "question": "In a dark space, there are two people sitting: one man wearing an orange long-sleeved shirt and another man wearing purple shorts. Next to the man in purple shorts, there are some food items. Which of the following subtitles appears along with the man in the orange long-sleeved shirt?", "question_wo_referring_query": "Which of the following subtitles appears along with the man in the orange long-sleeved shirt?", "candidates": ["\"a portion of this video is sponsored by\"", "\"two sandwiches here three pasta here\"", "\"his life he was absolutely blown away by\"", "\"home city of Paris for the first time in\""], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "gRszqN6rHE8_1", "video_path": "gRszqN6rHE8.mp4", "subtitle_path": "gRszqN6rHE8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1087.76, "view_count": 994527}, {"video_id": "gRszqN6rHE8", "question": "On a street where a car is parked, there is a group of people standing. At the front stands a man wearing a black helmet, a green jacket, and black-rimmed glasses. Behind them is a red car. In which subtitles has the man with the black helmet appeared together with the following text?", "question_wo_referring_query": "In which subtitles did the man wearing a black helmet appeared together with the following text?", "candidates": ["\"his life he was absolutely blown away by\"", "\"home city of paris for the first time in\"", "\"you to every single one of you guys so\"", "\"a portion of this video is sponsored by\""], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "gRszqN6rHE8_2", "video_path": "gRszqN6rHE8.mp4", "subtitle_path": "gRszqN6rHE8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1087.76, "view_count": 994527}, {"video_id": "D7pScyrGc2o", "question": "In a scene with many people, there is a woman wearing a white long-sleeve top. She is wearing black sunglasses. When this woman wearing the white long-sleeve top appears on a magazine cover, what change occurs to her glasses?", "question_wo_referring_query": ", what change occurs to her glasses?", "candidates": ["The black sunglasses turn into green decorative glasses", "The black sunglasses turn into purple decorative glasses", "The black sunglasses turn into blue decorative glasses", "The black sunglasses turn into red decorative glasses", "The black sunglasses turn into transparent glasses"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "D7pScyrGc2o_0", "video_path": "D7pScyrGc2o.mp4", "subtitle_path": "D7pScyrGc2o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1515.79, "view_count": 427285}, {"video_id": "D7pScyrGc2o", "question": "Against a black background stands a woman with a mole on her face. She is wearing a white scarf around her neck. What change happens to the white scarf around the woman's neck when she appears in a scene with a blurry background?", "question_wo_referring_query": "What change happens to the white scarf around her neck?", "candidates": ["It changes from white to blue", "It changes from white to purple", "It changes from white to red", "It changes from white to orange", "It changes from white to green"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "D7pScyrGc2o_1", "video_path": "D7pScyrGc2o.mp4", "subtitle_path": "D7pScyrGc2o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1515.79, "view_count": 427285}, {"video_id": "D7pScyrGc2o", "question": "In a room with a blue chair and a dark green table, a man wearing a white shirt is sitting. When this man wearing a white shirt appears on a black and white screen, what change happens to the clothes he is wearing?", "question_wo_referring_query": "What change happens to the clothes he is wearing?", "candidates": ["His white shirt changes to a black T-shirt", "His white shirt changes to a white suit", "His white shirt changes to a black suit", "His white shirt changes to a dark shirt", "His white shirt changes to a black hooded jacket"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "D7pScyrGc2o_2", "video_path": "D7pScyrGc2o.mp4", "subtitle_path": "D7pScyrGc2o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1515.79, "view_count": 427285}, {"video_id": "BgCdxJce5Yk", "question": "In a black room, there are two people sitting. One is a woman wearing a blue floral dress and the other is a man wearing a dark blue coat. Between them, there's a large screen. On the screen, there is a man wearing a white coat. When the subtitle 'it I will say that if it's' appears, what change occurs on the screen between them?", "question_wo_referring_query": "What change occurs on the screen between them?", "candidates": ["The man wearing a white coat on the screen turns into a man wearing a black coat.", "The man wearing a white coat on the screen turns into a man wearing a green shirt.", "The man wearing a white coat on the screen turns into a man wearing a black hoodie.", "The man wearing a white coat on the screen turns into a man wearing a blue suit.", "The man wearing a white coat on the screen turns into a man wearing a white T-shirt."], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "BgCdxJce5Yk_0", "video_path": "BgCdxJce5Yk.mp4", "subtitle_path": "BgCdxJce5Yk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1151.64, "view_count": 6953}, {"video_id": "BgCdxJce5Yk", "question": "In a room with two blue chairs, there are two people sitting: a woman in a blue floral dress and a man in black pants. When the caption 'like float tank isolation Chambers and I' appears, what changes occur in the man's framed image?", "question_wo_referring_query": "what changes occur in the framed image of the man in black pants?", "candidates": ["The framed image became more blurred", "The framed image got larger", "The framed image changed from his front view to his back view", "The framed image got smaller"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "BgCdxJce5Yk_1", "video_path": "BgCdxJce5Yk.mp4", "subtitle_path": "BgCdxJce5Yk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1151.64, "view_count": 6953}, {"video_id": "BgCdxJce5Yk", "question": "In a room with a round table, two people are sitting. One is a woman wearing a blue floral dress, and the other is a man wearing black pants. When the subtitle 'Kombat like you know this latest edition' appears, what change happens to the woman in the blue floral dress?", "question_wo_referring_query": "What change happens to the woman in the blue floral dress?", "candidates": ["She changes from wearing a blue floral dress to wearing a purple suit.", "Her crossed legs change to both legs down.", "She changes from facing the side of the mirror to facing the mirror directly.", "She changes from sitting on a chair to standing on the ground.", "She changes from having her hair down to having her hair tied."], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "BgCdxJce5Yk_2", "video_path": "BgCdxJce5Yk.mp4", "subtitle_path": "BgCdxJce5Yk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1151.64, "view_count": 6953}, {"video_id": "m8pOnJxOcqY", "question": "On a gray page, there is a document titled 'Gradient-Based Learning Applied to Document Recognition'. What happens to the document on the gray page?", "question_wo_referring_query": "What happens to the document on the gray page?", "candidates": ["Scrolls from bottom to top", "Scrolls from left to right", "Scrolls from top to bottom", "Scrolls from right to left", "Spins in circles on the gray page"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "m8pOnJxOcqY_0", "video_path": "m8pOnJxOcqY.mp4", "subtitle_path": "m8pOnJxOcqY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1160.33, "view_count": 136996}, {"video_id": "m8pOnJxOcqY", "question": "On a white PPT slide with a red title 'Convolutional Neural Networks', there are three squares. Inside the large green square in the middle, there is a small yellow square. What is the small yellow square doing inside the large green square?", "question_wo_referring_query": "What is the small yellow square doing inside the large green square?", "candidates": ["Moving to the right inside the large green square", "Moving up and down inside the large green square", "Circling around inside the large green square", "Gradually fading inside the large green square", "Moving to the left inside the large green square"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "m8pOnJxOcqY_1", "video_path": "m8pOnJxOcqY.mp4", "subtitle_path": "m8pOnJxOcqY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1160.33, "view_count": 136996}, {"video_id": "m8pOnJxOcqY", "question": "On a white PPT slide with a red title 'Convolutional Neural Networks', there is a rectangle containing many people's photos. What changes occur to the photos within the rectangle?", "question_wo_referring_query": "What changes occur to the photos within the rectangle?", "candidates": ["The photos keep disappearing within the rectangle", "The photos gradually shrink within the rectangle", "The photos gradually enlarge within the rectangle", "The photos rotate within the rectangle", "The photos slide from top to bottom within the rectangle"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "m8pOnJxOcqY_2", "video_path": "m8pOnJxOcqY.mp4", "subtitle_path": "m8pOnJxOcqY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1160.33, "view_count": 136996}, {"video_id": "ckCoFSQgeP4", "question": "On a black background, there is a regional map. Above the regional map, it says 'Latino Map of Chicago 2000'. In the regional map, there is a yellow circle with an arrow inside it. What color is the arrow inside the yellow circle?", "question_wo_referring_query": "What color is the arrow inside the yellow circle?", "candidates": ["white", "red", "purple", "black", "blue"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "ckCoFSQgeP4_0", "video_path": "ckCoFSQgeP4.mp4", "subtitle_path": "ckCoFSQgeP4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1041.28, "view_count": 76222}, {"video_id": "ckCoFSQgeP4", "question": "In a blurry screen, in front of a green cloth, there is a man wearing a black shirt sitting on a black chair. What hairstyle does the man with the black shirt have?", "question_wo_referring_query": "What hairstyle does the man with the black shirt have?", "candidates": ["Short hair", "Long hair", "Crew cut", "Buzz cut", "Shoulder-length curls"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "ckCoFSQgeP4_1", "video_path": "ckCoFSQgeP4.mp4", "subtitle_path": "ckCoFSQgeP4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1041.28, "view_count": 76222}, {"video_id": "ckCoFSQgeP4", "question": "In a bright room with green walls and a blue wall clock, there is a man wearing a gray short-sleeved shirt. What hairstyle does the man wearing the gray short-sleeved shirt have?", "question_wo_referring_query": "What hairstyle does the man wearing the gray short-sleeved shirt have?", "candidates": ["Red short hair", "Black shoulder-length curls", "Blue short hair", "Black short hair", "Hair bun"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "ckCoFSQgeP4_2", "video_path": "ckCoFSQgeP4.mp4", "subtitle_path": "ckCoFSQgeP4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1041.28, "view_count": 76222}, {"video_id": "RfN1WU3tLq4", "question": "In a blurry screen, there is a black frame drum. In front of the black frame drum, there is a man with a guitar hanging on him. When the subtitle \"after a series of dates throughout the South.\" appears, what color clothes is the man with a guitar hanging on him wearing?", "question_wo_referring_query": "What color clothes is the man with a guitar hanging on him wearing?", "candidates": ["white", "purple", "blue", "red", "black"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "RfN1WU3tLq4_0", "video_path": "RfN1WU3tLq4.mp4", "subtitle_path": "RfN1WU3tLq4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1419.0, "view_count": 460223}, {"video_id": "RfN1WU3tLq4", "question": "On a red background, there is a man with short black hair in the middle of the screen. Beside the man, there is a red sign with the word 'SORRY!' on it. When the subtitle 'from the unprecedented energy crisis of the 1970s' appears, what is the shape of the green border surrounding the man?", "question_wo_referring_query": "What is the shape of the green border surrounding the man?", "candidates": ["Staircase", "Rectangle", "Circle", "Triangle", "Hexagon"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "RfN1WU3tLq4_1", "video_path": "RfN1WU3tLq4.mp4", "subtitle_path": "RfN1WU3tLq4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1419.0, "view_count": 460223}, {"video_id": "RfN1WU3tLq4", "question": "In a scene where the text 'Saturday Night Fever\nParamount Pictures (1977)' is displayed in white font, there is a beautiful woman. When the subtitle 'Because I don't date guys like you anymore for one thing.' appears, what hairstyle does the woman in the scene have?", "question_wo_referring_query": "What hairstyle does the woman in the scene have?", "candidates": ["Shoulder-length black straight hair", "Waist-length black hair", "Gold waist-length short hair", "Black ear-length short hair", "Shoulder-length brown curls"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "RfN1WU3tLq4_2", "video_path": "RfN1WU3tLq4.mp4", "subtitle_path": "RfN1WU3tLq4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1419.0, "view_count": 460223}, {"video_id": "Fp4eovJQ7d8", "question": "On the black stage, there are three men sitting. The man on the left has short hair and is wearing a black top and pants, the man in the middle has long hair and is dressed in light-colored clothes, and the man on the right has his legs crossed and is wearing a gray suit. Below the stage, there is a crowd of people, among which there is a white pillar. Above the crowd are glaring lights. Who is holding a bottle of water?", "question_wo_referring_query": "...above the crowd are glaring lights. Who is holding a bottle of water?", "candidates": ["The woman in a black long dress below the stage", "The man in black short sleeves below the stage", "The long-haired woman wearing black glasses below the stage", "The long-haired man in light-colored clothes on the stage", "The short-haired man in black clothes on the stage"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "Fp4eovJQ7d8_0", "video_path": "Fp4eovJQ7d8.mp4", "subtitle_path": "Fp4eovJQ7d8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3054.12, "view_count": 2651}, {"video_id": "Fp4eovJQ7d8", "question": "Two gentlemen are sitting on chairs. The man on the left is wearing a grey suit and trousers, while the man on the right is wearing a dark-colored jacket and black trousers. The long-haired man on the left is speaking into a microphone, and the man on the right has his leg raised. Behind them is a grey curtain and silver poles. Who is drinking water?", "question_wo_referring_query": "Who is drinking water?", "candidates": ["The long-haired man in the dark jacket", "The blonde woman in the light-colored suit", "The long-haired man in the light-colored suit", "The curly-haired man in the light-colored suit", "The short-haired man in the dark jacket"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "Fp4eovJQ7d8_1", "video_path": "Fp4eovJQ7d8.mp4", "subtitle_path": "Fp4eovJQ7d8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3054.12, "view_count": 2651}, {"video_id": "Fp4eovJQ7d8", "question": "On the black stage sit three men. The man with short hair on the left is wearing a black top and pants, the man with long hair in the middle is dressed in a light-colored outfit, and the man on the right is wearing a gray suit. The long-haired man in the middle is talking on a phone. There is a dense crowd below the stage. Who is filming?", "question_wo_referring_query": "Who is filming?", "candidates": ["The long-haired man on the stage", "The man in gray clothing on the stage", "The man in black clothing crouching below the stage", "The long-haired woman on the stage", "The man in black clothing on the stage"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "Fp4eovJQ7d8_2", "video_path": "Fp4eovJQ7d8.mp4", "subtitle_path": "Fp4eovJQ7d8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3054.12, "view_count": 2651}, {"video_id": "LLRx2anvn1E", "question": "In the white background, there is a yellow logo and black English text in the top left corner; in the top right corner, there is a man speaking; in the center background, a video is playing showing a man in a green outer coat standing in front of a gate, carrying a backpack and wearing a hat. What did the man wearing the hat do the first time he appeared?", "question_wo_referring_query": "What did the man wearing the hat do the first time he appeared?", "candidates": ["Scanned the code on the gate with his phone screen", "Took off his backpack", "Stepped over the gate", "Picked up an apple", "Took off his hat"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O2E", "level": "IntraMoment", "id": "LLRx2anvn1E_0", "video_path": "LLRx2anvn1E.mp4", "subtitle_path": "LLRx2anvn1E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2965.1, "view_count": 8909}, {"video_id": "LLRx2anvn1E", "question": "In the white background's upper left corner, there is a yellow logo and black English text. In the upper right corner of the background, a man is talking. The center of the background shows a picture. In the picture, two men wearing glasses are sitting in the front row of a driverless car. The man on the left is wearing a black short-sleeve shirt and a silver watch, with yellow lettering on his chest. The man on the right is wearing a gray short-sleeve shirt and a silver watch, with a blue pattern on his chest. What did the man on the left do the first time he appeared on the scene?", "question_wo_referring_query": "What did the man on the left do the first time he appeared on the scene?", "candidates": ["Raised both hands", "Waved his hands left and right", "Kept shaking his head", "Nodded to indicate approval", "Put both hands on his hips"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O2E", "level": "IntraMoment", "id": "LLRx2anvn1E_1", "video_path": "LLRx2anvn1E.mp4", "subtitle_path": "LLRx2anvn1E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2965.1, "view_count": 8909}, {"video_id": "LLRx2anvn1E", "question": "In the background, on the left side there's a yellow icon and black English text. On the upper right, a man is speaking. A video is playing in the center of the background, showing a white car and a black car driving on the road. On the upper right of the video, there is a red banner. What did the white car do the first time it appeared?", "question_wo_referring_query": "What did the white car do the first time it appeared?", "candidates": ["The white car made a turn", "The white car overtook the black car", "The sunroof of the white car opened", "The window of the white car opened", "The white car parked by the roadside"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O2E", "level": "IntraMoment", "id": "LLRx2anvn1E_2", "video_path": "LLRx2anvn1E.mp4", "subtitle_path": "LLRx2anvn1E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2965.1, "view_count": 8909}, {"video_id": "R5zK2gznGoE", "question": "Four men are walking together. The first two men include one wearing a blue suit and glasses, and one with gray hair wearing a black suit. The man at the end is wearing a black suit and trousers, and glasses. There is green vegetation on the ground and a yellow wall on the side. At the wall entrance, there are red pillars with three men standing next to them. When the subtitle 'protocol was met with a lot of criticism' appears, what is the man wearing glasses at the front doing?", "question_wo_referring_query": "What is the man wearing glasses at the front doing?", "candidates": ["Holding a notebook around his waist with his hand", "Holding a bottle of beverage with his hand", "Carrying a bag", "Holding a document case", "Waving a hand to say hello"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "R5zK2gznGoE_0", "video_path": "R5zK2gznGoE.mp4", "subtitle_path": "R5zK2gznGoE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1111.28, "view_count": 249527}, {"video_id": "R5zK2gznGoE", "question": "A woman in a white top is sitting on a black chair. In front of her is a white laptop. Beside her is a window, partially covered by a yellow blind. When the subtitle 'addresses in the country in June of 2021' appears, what is the woman in a white top doing?", "question_wo_referring_query": "What is the woman in a white top doing?", "candidates": ["Shaking her head", "Drinking water", "Arching her back", "Brushing her hair", "Waving her hand"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "R5zK2gznGoE_1", "video_path": "R5zK2gznGoE.mp4", "subtitle_path": "R5zK2gznGoE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1111.28, "view_count": 249527}, {"video_id": "R5zK2gznGoE", "question": "A man wearing a white short-sleeve shirt appears in the bathroom. The man has a black watch on his wrist and is holding a mobile phone. Next to the man is a sink, and behind him are a window and a curtain. When the subtitle 'up my bathroom I how know how the' appears, what is the man in the white short-sleeve shirt doing?", "question_wo_referring_query": "What is the man in the white short-sleeve shirt doing?", "candidates": ["The man is crossing his arms", "The man is looking at the bedroom", "The man is staring at his phone", "The man is placing both hands on his thighs", "The man is clasping his hands together"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "R5zK2gznGoE_2", "video_path": "R5zK2gznGoE.mp4", "subtitle_path": "R5zK2gznGoE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1111.28, "view_count": 249527}, {"video_id": "-7wwfGJXEZg", "question": "In the center of the screen, there is a red circle with a white letter 'E' inside. The surrounding scene is dim, with white lock icons scattered around. After finishing the introduction of special offers, what else is mentioned?", "question_wo_referring_query": "After finishing the introduction of special offers, what else is mentioned?", "candidates": ["Introduced Roman", "Mentioned a king", "Mentioned a team of soldiers", "Express gratitude again", "Mentioned the Huns"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "-7wwfGJXEZg_0", "video_path": "-7wwfGJXEZg.mp4", "subtitle_path": "-7wwfGJXEZg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1515.2, "view_count": 668881}, {"video_id": "-7wwfGJXEZg", "question": "In the scene, a man wearing a crown and a long robe is standing. The man's long robe covers one of his arms. Surrounding the man are soldiers with guns and shields and three other characters dressed in green, white, and yellow garments. Immediately after the new king ascended to the throne following a war, what happened?", "question_wo_referring_query": "What happened?", "candidates": ["A civil war erupted in Rome.", "Rome suffered a defeat at the frontline.", "The queen was assassinated.", "The new king gained the support of the military establishment.", "The king was assassinated."], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "-7wwfGJXEZg_1", "video_path": "-7wwfGJXEZg.mp4", "subtitle_path": "-7wwfGJXEZg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1515.2, "view_count": 668881}, {"video_id": "-7wwfGJXEZg", "question": "A group of soldiers with long guns and shields are advancing on a road. Among them, one soldier is holding up a red flag. There are weeds and houses on both sides of the road. What happened after his term ended?", "question_wo_referring_query": "What happened?", "candidates": ["Gothic army was defeated", "Rome and the Southern army reached a ceasefire agreement", "Romans experienced setbacks on the frontline", "Roman mercenaries had an internal conflict", "An incident where Gaius appeared to delay military pay"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "-7wwfGJXEZg_2", "video_path": "-7wwfGJXEZg.mp4", "subtitle_path": "-7wwfGJXEZg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1515.2, "view_count": 668881}, {"video_id": "dcTbD2Ju7VE", "question": "A male host is standing in a studio. The man is wearing a black suit and a red tie, holding a script in his hand. Behind him on the screen, there is a picture of an airplane. After the subtitle 'off' appears, what event is mentioned next?", "question_wo_referring_query": ", what event is mentioned next?", "candidates": ["Criminal investigation into the airplane accident", "Presidential election", "Tornado disaster", "Border conflict", "Animal injury incident in Texas"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "dcTbD2Ju7VE_0", "video_path": "dcTbD2Ju7VE.mp4", "subtitle_path": "dcTbD2Ju7VE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1099.03, "view_count": 586255}, {"video_id": "dcTbD2Ju7VE", "question": "A man is sitting in a studio, wearing a black suit and white shirt, with a red tie. The screen behind him shows a forest and cars. After the caption 'the Priscilla Thompson has the very latest on the investigation tonight' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A helicopter crashes in the wilderness", "A tornado disaster", "A panda is moving", "A squad of soldiers skydiving", "Trump gives a speech"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "dcTbD2Ju7VE_1", "video_path": "dcTbD2Ju7VE.mp4", "subtitle_path": "dcTbD2Ju7VE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1099.03, "view_count": 586255}, {"video_id": "dcTbD2Ju7VE", "question": "A silver-haired man in a dark upper garment appears on the screen. The man is wearing white earphones, and the wall behind him is covered with photos of various sizes. There is a fan on the ceiling. The man raises one hand, and after the subtitle 'started swirling or circle like spinning' appears, what happens next on the screen?", "question_wo_referring_query": "What happens next on the screen?", "candidates": ["The man connects with a woman in a checkered coat.", "A helicopter crashes in the wilderness.", "Trump gives a speech.", "A squad of soldiers starts jumping.", "Biden gives a speech."], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "dcTbD2Ju7VE_2", "video_path": "dcTbD2Ju7VE.mp4", "subtitle_path": "dcTbD2Ju7VE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1099.03, "view_count": 586255}, {"video_id": "93ZjtqvxnnQ", "question": "A man with long beard is talking to a man in black uniform. The man with long beard is wearing a light-colored shirt. The man in black uniform has short hair and stubble. There are white letters on the chest of his black uniform, and he is wearing a work badge around his neck. After the subtitle 'go space is nonsense' appears, what object appears on the screen?", "question_wo_referring_query": ", what object appears on the screen?", "candidates": ["A globe", "A pot of plant", "A photo frame", "A pair of glasses", "A white cup"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3O", "level": "L2-Relation", "id": "93ZjtqvxnnQ_0", "video_path": "93ZjtqvxnnQ.mp4", "subtitle_path": "93ZjtqvxnnQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 906.56, "view_count": 121848}, {"video_id": "93ZjtqvxnnQ", "question": "A man wearing a white shirt and black suit is standing at the podium. The man's hair is blonde, and he is wearing an earpiece. A work badge is hanging around his neck, and he is wearing glasses. There is a black curtain behind him. On the podium in front of the man, some equipment is placed. After the subtitles display 'say hey man I want to come I want to', what object appears on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["a globe", "a picture frame", "a planter", "a dark bucket", "a pair of glasses"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3O", "level": "L2-Relation", "id": "93ZjtqvxnnQ_1", "video_path": "93ZjtqvxnnQ.mp4", "subtitle_path": "93ZjtqvxnnQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 906.56, "view_count": 121848}, {"video_id": "93ZjtqvxnnQ", "question": "The man in the blue lab coat shakes hands with the man in the black suit. The man in the blue lab coat is carrying a backpack, and the man in the black suit is wearing glasses and has one hand in his coat pocket. Next to them, there are potted plants and a painting on the wall. After the subtitle 'international conference this many' appears, what object appears on the screen?", "question_wo_referring_query": "what object appears on the screen?", "candidates": ["a globe", "a bookshelf", "a television", "a potted plant", "a photo frame"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3O", "level": "L2-Relation", "id": "93ZjtqvxnnQ_2", "video_path": "93ZjtqvxnnQ.mp4", "subtitle_path": "93ZjtqvxnnQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 906.56, "view_count": 121848}, {"video_id": "7Q8V2exDnDg", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, the woman starts telling the story, then the woman introduces the program, and finally, the woman concludes with thanks.", "First, the woman in black introduces the program, then the woman starts telling the story, and finally, the woman concludes with thanks.", "First, the woman concludes with thanks, then the woman starts telling the story, and finally, the woman introduces the program.", "First, the woman starts telling the story, then the woman concludes with thanks, and finally, the woman introduces the program.", "First, the woman in black introduces the program, then the woman concludes with thanks, and finally, the woman starts telling the story."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "7Q8V2exDnDg_0", "video_path": "7Q8V2exDnDg.mp4", "subtitle_path": "7Q8V2exDnDg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 950.8, "view_count": 4337}, {"video_id": "7Q8V2exDnDg", "question": "After the lady officially starts telling the story, which of the following sequences of events is correct?", "question_wo_referring_query": "After the lady officially starts telling the story, which of the following sequences of events is correct?", "candidates": ["First, the lady leads everyone in singing a song, then she starts reading the book to tell the story, and after finishing the story, the lady discusses an item with everyone.", "First, the lady starts reading the book to tell the story, then the lady leads everyone in singing a song, and afterward, the lady discusses an item with everyone.", "First, the lady discusses an item with everyone, then she starts reading the book to tell the story, and afterward, the lady leads everyone in singing a song.", "First, the lady discusses an item with everyone, then the lady leads everyone in singing a song, and afterward, the lady starts reading the book to tell the story.", "First, the lady leads everyone in singing a song, then after finishing the story, the lady discusses an item with everyone, and then she starts reading the book to tell the story."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "7Q8V2exDnDg_1", "video_path": "7Q8V2exDnDg.mp4", "subtitle_path": "7Q8V2exDnDg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 950.8, "view_count": 4337}, {"video_id": "7Q8V2exDnDg", "question": "In the story about names told by the woman in black, which of the following sequences of events is correct?", "question_wo_referring_query": "In the story about names told by the woman in black, which of the following sequences of events is correct?", "candidates": ["First, she tells the story of a lonely little girl who doesn't dare to tell others her name, then she talks about the girl's name and making fun of others' names, followed by the story's end, and then she tells how the girl's mother successfully encourages her.", "First, she talks about how the girl's mother successfully encourages the girl, then she talks about the girl's name and making fun of others' names, then she tells the story of a lonely little girl who doesn't dare to tell others her name, and finally, the story ends.", "First, she talks about how the girl's mother successfully encourages the girl, then she tells the story of a lonely little girl who doesn't dare to tell others her name, then she talks about the girl's name and making fun of others' names, and finally, the story ends.", "First, she talks about the girl's name and making fun of others' names, then she tells how the girl's mother successfully encourages the girl, then she tells the story of a lonely little girl who doesn't dare to tell others her name, and finally, the story ends.", "First, she tells the story of a lonely little girl who doesn't dare to tell others her name, then she talks about the girl's name and making fun of others' names, followed by how the girl's mother successfully encourages her, and finally, the story ends."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "7Q8V2exDnDg_2", "video_path": "7Q8V2exDnDg.mp4", "subtitle_path": "7Q8V2exDnDg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 950.8, "view_count": 4337}, {"video_id": "D_z7fNvblfY", "question": "The woman in the white top is holding a woven basket full of pumpkins. To her side, there is a white door with a flower wreath hanging on it. Behind her, there is a white staircase and a wooden rack with clothes and pictures on it. Where else do the pumpkins in the woman's hand appear?", "question_wo_referring_query": "Where else do the pumpkins in the woman's hand appear?", "candidates": ["The passenger seat of a car", "A white sofa", "Under a tree in the yard", "A white cabinet with glass bottles on it", "The steps of the staircase"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "D_z7fNvblfY_0", "video_path": "D_z7fNvblfY.mp4", "subtitle_path": "D_z7fNvblfY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1204.21, "view_count": 37026}, {"video_id": "D_z7fNvblfY", "question": "A long-haired woman is holding a blue shirt with a tag on it. Behind her is a wall covered with photos and a wooden shelf. The shelf holds a pumpkin and some books. There is a wreath hanging on the white door beside her. Where has the blue shirt in the woman's hand appeared before?", "question_wo_referring_query": "Where has the blue shirt in the woman's hand appeared before?", "candidates": ["On the window pressing against the door", "On the bathroom glass door", "On the white round bed", "On the kitchen counter", "On the woman in the mirror"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "D_z7fNvblfY_1", "video_path": "D_z7fNvblfY.mp4", "subtitle_path": "D_z7fNvblfY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1204.21, "view_count": 37026}, {"video_id": "D_z7fNvblfY", "question": "The long-haired lady is wearing a light-colored fur coat, with her hands clasped together. Behind her is a wall covered with photos and a wooden shelf holding a pumpkin and books. On the white door to her side hangs a wreath. In which other place has this lady appeared?", "question_wo_referring_query": "In which other place has this lady appeared?", "candidates": ["On the sandy beach", "In a pavilion by the lake", "Beside a green tree", "At a noisy amusement park", "On a bench in the park"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "D_z7fNvblfY_2", "video_path": "D_z7fNvblfY.mp4", "subtitle_path": "D_z7fNvblfY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1204.21, "view_count": 37026}, {"video_id": "14XjLAscoFo", "question": "Three girls are together. The girl on the left is standing, holding a white bag, and wearing a blue top. The middle girl is wearing an orange mask and sitting on a bench. The girl on the right is wearing a green coat and sitting on a bench. The ground is grassy, and there's a tall tree behind the bench. What subtitles have appeared with the middle girl?", "question_wo_referring_query": "What subtitles have appeared with the middle girl?", "candidates": ["It feels pretty good", "That looks a little bad", "friends i love y'all", "Music", "I love that forest very much"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "14XjLAscoFo_0", "video_path": "14XjLAscoFo.mp4", "subtitle_path": "14XjLAscoFo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1362.43, "view_count": 198724}, {"video_id": "14XjLAscoFo", "question": "Two girls wearing glasses are hugging each other. The girl on the left is wearing a dark-colored top and holding a phone, while the girl on the right is wearing a light-colored top and checkered pants. There is a yellow backpack to their left. Behind them are a white roof, a large tree, and a blue sky. In what caption does this backpack appear together with?", "question_wo_referring_query": ", in what caption does this backpack appear together with?", "candidates": ["romantic time of the year i will lie", "That's bad news", "fruit trees berries squash and herbs at", "It feels pretty good", "Tall trees"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "14XjLAscoFo_1", "video_path": "14XjLAscoFo.mp4", "subtitle_path": "14XjLAscoFo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1362.43, "view_count": 198724}, {"video_id": "14XjLAscoFo", "question": "A girl with red hair is standing on the sidewalk. The girl is wearing light-colored clothes and pants, carrying a bag that matches her clothing color. There are lush green plants on both sides of the sidewalk. In what caption does this bag appear together?", "question_wo_referring_query": "In what caption does this bag appear together?", "candidates": ["Music", "That's bad news", "Yay", "romantic time of the year i will lie", "It feels pretty good"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "14XjLAscoFo_2", "video_path": "14XjLAscoFo.mp4", "subtitle_path": "14XjLAscoFo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1362.43, "view_count": 198724}, {"video_id": "FNjSsHCh74A", "question": "A man wearing a white short-sleeved shirt is squatting on the ground. There is a black object clipped to his collar. Behind the man, there is a black backpack leaning against a white wall. When a white picture is laid on the floor, what change occurs to the man?", "question_wo_referring_query": "What change occurs to the man?", "candidates": ["The man has an extra pen in his hand", "The man goes from squatting to standing", "The man has an extra piece of paper in his hand", "The man goes from squatting to kneeling", "The man goes from squatting to lying down"], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "FNjSsHCh74A_0", "video_path": "FNjSsHCh74A.mp4", "subtitle_path": "FNjSsHCh74A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3123.96, "view_count": 96718}, {"video_id": "FNjSsHCh74A", "question": "A man in a white short-sleeve shirt is crouching on the floor. In front of the man, there is a painting. One corner of the painting is pressed down by a cup. In the lower right corner of the painting, there is a transparent glass product. In the upper left corner of the painting, there are rolled-up paintings and a glass bottle. When the man holds a card in his hand, what changes occur on the painting?", "question_wo_referring_query": "When the man holds a card in his hand, what changes occur on the painting?", "candidates": ["A stack of cards appears in the corner of the painting", "A pen appears in the center of the painting", "A stack of cards appears in the center of the painting", "A pen appears in the corner of the painting", "A pack of tissues appears in the corner of the painting"], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "FNjSsHCh74A_1", "video_path": "FNjSsHCh74A.mp4", "subtitle_path": "FNjSsHCh74A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3123.96, "view_count": 96718}, {"video_id": "FNjSsHCh74A", "question": "A man in a white short-sleeve shirt is crouching on the floor with his hands crossed. He is facing a mirror, and there is a painting in front of him. A corner of the painting is being pressed down by a cup, and in the bottom-right corner of the painting, there is a transparent glass product. In the top-left of the painting, there is a rolled-up painting and a glass bottle. What changes occur when the man picks up an unopened pack of cards?", "question_wo_referring_query": "What changes occur when the man picks up an unopened pack of cards?", "candidates": ["The man's crossed hands change to supporting his chin with both hands.", "The man's crossed hands change to pressing the ground with his palms.", "The man's crossed hands change to supporting his chin with one hand.", "The man's crossed hands change to being spread out.", "The man's hands remain crossed."], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "FNjSsHCh74A_2", "video_path": "FNjSsHCh74A.mp4", "subtitle_path": "FNjSsHCh74A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3123.96, "view_count": 96718}, {"video_id": "R_ZGwbK0aXE", "question": "A man in a black top is seated in the back of a scooter. A man in red is driving. The man in black is carrying a black bag, wearing a necklace. There's a stone wall and greenery along the roadside. When the subtitle 'it multiple times across multiple people' appears, what changes happened to the man in the black top?", "question_wo_referring_query": "What changes happened to the man in the black top?", "candidates": ["The black top changed to a blue denim jacket", "The black top changed to a blue shirt", "The black top changed to a white top", "The man put on a hat", "The man put on sunglasses"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "R_ZGwbK0aXE_0", "video_path": "R_ZGwbK0aXE.mp4", "subtitle_path": "R_ZGwbK0aXE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1230.63, "view_count": 1589261}, {"video_id": "R_ZGwbK0aXE", "question": "A man wearing a grey shirt is sitting in the car dealership. The man is holding a paper cup in his hand, has earphones around his neck, and is wearing a wristwatch and a ring. The dealership is decorated with yellow and blue. When the subtitle 'some sort that's water repellent do you' appears, what change happens to the man?", "question_wo_referring_query": "What change happens to the man?", "candidates": ["The earphones around the man's neck disappear", "The man puts on glasses", "The man puts on a hat", "The man's grey shirt changes to a blue shirt", "The man's grey shirt changes to a red shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "R_ZGwbK0aXE_1", "video_path": "R_ZGwbK0aXE.mp4", "subtitle_path": "R_ZGwbK0aXE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1230.63, "view_count": 1589261}, {"video_id": "R_ZGwbK0aXE", "question": "The man in the denim jacket and the man in the gray coat are sitting together. The man in the gray coat has earphones around his neck. They are in an enclosed space covered by purple curtains, with dazzling lights above. When the subtitle 'morning to everyone' appears, what change occurred to the man in the denim jacket?", "question_wo_referring_query": "What change occurred to the man in the denim jacket?", "candidates": ["The man put on a hat.", "The denim jacket changed to a blue short-sleeve shirt.", "The denim jacket changed to a red short-sleeve shirt.", "The denim jacket changed to a black short-sleeve shirt.", "The man put on a pair of glasses."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "R_ZGwbK0aXE_2", "video_path": "R_ZGwbK0aXE.mp4", "subtitle_path": "R_ZGwbK0aXE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1230.63, "view_count": 1589261}, {"video_id": "DiNzQP7kK-s", "question": "The screen is filled with various chaotic curves, with five vertical black lines moving from left to right. There are numbers, symbols, and triangles on these lines. What happened on the screen?", "question_wo_referring_query": "What happened on the screen?", "candidates": ["A blue vertical line appeared", "A red vertical line appeared", "A white vertical line appeared", "A blue horizontal line appeared", "A yellow horizontal line appeared"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "DiNzQP7kK-s_0", "video_path": "DiNzQP7kK-s.mp4", "subtitle_path": "DiNzQP7kK-s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2458.97, "view_count": 13815}, {"video_id": "DiNzQP7kK-s", "question": "On a white background, there is a section of text. At the top, the title is bold and written in uppercase English letters. Below, there is a section of black characters. What is happening on the screen?", "question_wo_referring_query": "What is happening on the screen?", "candidates": ["A black pattern is moving to the right", "A yellow pattern is moving to the right", "A red pattern is moving to the left", "A red pattern is moving to the right", "A yellow pattern is moving to the left"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "DiNzQP7kK-s_1", "video_path": "DiNzQP7kK-s.mp4", "subtitle_path": "DiNzQP7kK-s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2458.97, "view_count": 13815}, {"video_id": "DiNzQP7kK-s", "question": "On the upper portion of a white background, there is a segment of black English text. In the middle section, there is a three-line table. The lower section contains another segment of English text. Two columns of the three-line table have numbers. What happened on the screen?", "question_wo_referring_query": "What happened on the screen?", "candidates": ["A red circle was drawn on the three-line table", "A blue circle was drawn on the three-line table", "A blue triangle was drawn on the three-line table", "A red triangle was drawn on the three-line table", "A red star was drawn on the three-line table"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "DiNzQP7kK-s_2", "video_path": "DiNzQP7kK-s.mp4", "subtitle_path": "DiNzQP7kK-s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2458.97, "view_count": 13815}, {"video_id": "YBlNQK0Ao6g", "question": "On a white background, six rows of photos are arranged. The leftmost photo is incomplete, with a black area at the bottom of each photo. The rest of the photos are complete. The photos include animals, buildings, and entertainment items. What object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["hound", "penguin", "elephant", "leopard", "lion"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "YBlNQK0Ao6g_0", "video_path": "YBlNQK0Ao6g.mp4", "subtitle_path": "YBlNQK0Ao6g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1906.87, "view_count": 29708}, {"video_id": "YBlNQK0Ao6g", "question": "At the top of the white background is English text, while below there are two similarly structured areas, each containing elements such as quadrilaterals, circles, and arrows. The rectangular area at the bottom left is surrounded by a red curve. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Green square", "Black circle", "Green teardrop shape", "Green curve", "Green star"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "YBlNQK0Ao6g_1", "video_path": "YBlNQK0Ao6g.mp4", "subtitle_path": "YBlNQK0Ao6g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1906.87, "view_count": 29708}, {"video_id": "YBlNQK0Ao6g", "question": "In the middle of the screen, there's a line graph. The vertical axis increases from bottom to top, and the horizontal axis decreases from left to right. The graph features lines in three colors: green, yellow, and blue. Below the line graph, there are English characters in black. What objects are present in the screen?", "question_wo_referring_query": "What objects are present in the screen?", "candidates": ["Blue arrow", "Yellow arrow", "Green arrow", "Black arrow", "Red arrow"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "YBlNQK0Ao6g_2", "video_path": "YBlNQK0Ao6g.mp4", "subtitle_path": "YBlNQK0Ao6g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1906.87, "view_count": 29708}, {"video_id": "7YlJhPLocLA", "question": "In the blue rectangular box at the center of the screen, there are black characters written. On the right side of the screen, there are two women in a video call. The woman at the top is wearing a blue top and has long black hair, with a tablet computer on the desk in front of her. The woman at the bottom is wearing a short-sleeved black shirt and glasses, with an accessory on her wrist and a decoration behind her. What objects are present on the screen when the subtitle 'feel better yeah cool not have the' appears?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["desk lamp", "earphones", "black cat", "black hat", "fan"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "7YlJhPLocLA_0", "video_path": "7YlJhPLocLA.mp4", "subtitle_path": "7YlJhPLocLA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1709.38, "view_count": 184327}, {"video_id": "7YlJhPLocLA", "question": "A woman wearing a denim jacket is sitting at a white desk. The woman has long hair and there is a tablet computer on the desk. Behind the woman is a cabinet with various items on it. On the wall to the side of the woman, there are some posters. When the subtitle shows 'now that you've learned the steps to,' what object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["Fan", "Doll", "Desk lamp", "Water cup", "Potted plant"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "7YlJhPLocLA_1", "video_path": "7YlJhPLocLA.mp4", "subtitle_path": "7YlJhPLocLA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1709.38, "view_count": 184327}, {"video_id": "7YlJhPLocLA", "question": "Two chemical equations appear on a white background. The first question's blank square is empty, while the second question has already been written in. On the right side of the screen, two women are interacting via video call. The woman at the top, wearing a denim jacket, is holding a pen and gesturing. The woman at the bottom, dressed in black short sleeves, has her fingers intertwined in front of her chest. This woman is wearing glasses and a black wristwatch. Behind her, a shelf is holding various items. When the caption 'those formal charges because every' appears, what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["a banana", "a hat", "a desk lamp", "a white chair", "an apple"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "7YlJhPLocLA_2", "video_path": "7YlJhPLocLA.mp4", "subtitle_path": "7YlJhPLocLA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1709.38, "view_count": 184327}, {"video_id": "TaHnD6MDHwE", "question": "Two women are having a conversation. The woman on the left is wearing a green top, black-framed glasses, and earrings, while the woman on the right is wearing a black top. There is a banner with English text at the bottom of the screen. What is the shape of the earrings worn by the woman on the left?", "question_wo_referring_query": "What is the shape of the earrings worn by the woman on the left?", "candidates": ["teardrop", "star", "triangle", "heart", "circle"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "TaHnD6MDHwE_0", "video_path": "TaHnD6MDHwE.mp4", "subtitle_path": "TaHnD6MDHwE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1053.62, "view_count": 17416}, {"video_id": "TaHnD6MDHwE", "question": "A man wearing a checkered shirt and a black suit is speaking. Behind the man, there is a sofa and a shelf full of miscellaneous items. There are paintings hanging on the wall in the room. Below the screen, there is a horizontal banner with English letters. What is the shape of the frame of the painting on the wall?", "question_wo_referring_query": "What is the shape of the frame of the painting on the wall?", "candidates": ["Oval", "Round", "Square", "Fan-shaped", "Irregular"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "TaHnD6MDHwE_1", "video_path": "TaHnD6MDHwE.mp4", "subtitle_path": "TaHnD6MDHwE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1053.62, "view_count": 17416}, {"video_id": "TaHnD6MDHwE", "question": "Two women are communicating online. The woman on the right is wearing a green top, black-framed glasses, and earrings. The woman on the left is wearing a blue top, and the screen behind her shows a weather report. How is the woman on the left styling her hair?", "question_wo_referring_query": "How is the woman on the left styling her hair?", "candidates": ["Long black curly hair", "Long, shoulder-length, blonde hair", "Long blonde hair with thick bangs", "Long, shoulder-length, silver hair", "Long black hair with thick bangs"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "TaHnD6MDHwE_2", "video_path": "TaHnD6MDHwE.mp4", "subtitle_path": "TaHnD6MDHwE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1053.62, "view_count": 17416}, {"video_id": "YjjVatilGpA", "question": "In the upper left corner of a white background with the black English text 'Qualitative Results-Multiple Objects', a man in a blue shirt is standing at a podium. There are three dynamic videos on the right side of the screen. In the middle video, on a white wooden base, there's a cat. What happens in the middle video when the subtitle says 'network is still a little segment'?", "question_wo_referring_query": "What happens in the middle video?", "candidates": ["The cat is eating something", "The cat is rolling on the ground", "The cat plays with something on the white base using its paw", "The cat is crouching on the ground", "The cat is licking its fur"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "YjjVatilGpA_0", "video_path": "YjjVatilGpA.mp4", "subtitle_path": "YjjVatilGpA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2122.83, "view_count": 775}, {"video_id": "YjjVatilGpA", "question": "In the upper left corner of a white background with the black English text 'Qualitative Results', a man in a blue shirt is standing in front of a podium. On the right side of the screen, there are three dynamic videos. In the video under the white English text 'Ground Truth', there is a car in front of a house with a red roof. When the subtitle says 'bouncing and a person filming so even,' what happens in the video under the white English text 'Ground Truth'?", "question_wo_referring_query": "What happens in the video under the white English text 'Ground Truth'?", "candidates": ["The car is flipped over.", "The rear of the car bounces up from the ground and is higher than the front.", "The front of the car bounces up from the ground and is higher than the rear.", "The car rotates in place.", "The car rolls over."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "YjjVatilGpA_1", "video_path": "YjjVatilGpA.mp4", "subtitle_path": "YjjVatilGpA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2122.83, "view_count": 775}, {"video_id": "YjjVatilGpA", "question": "In the upper left corner of a white background with black English text 'Qualitative Results', there is a man in a blue shirt standing in front of a podium. On the right side of the screen, there are three dynamic videos. In the video under the white English text 'Ground Truth,' an infant is sitting on the ground looking at a pink ball. When the subtitle says 'so here's an example with two actors,' what happens in the video under the white English text 'Ground Truth'?", "question_wo_referring_query": "What happens in the video under the white English text 'Ground Truth'?", "candidates": ["The infant is crawling on the ground", "The infant is sitting on the pink ball", "The blue ball rolls on the ground", "The infant lies on the ground", "The pink ball rolls on the ground"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "YjjVatilGpA_2", "video_path": "YjjVatilGpA.mp4", "subtitle_path": "YjjVatilGpA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2122.83, "view_count": 775}, {"video_id": "4bgHtDoy9Qk", "question": "In the upper left corner of the screen, there is white English text 'Operations' on a white background. In the lower right corner, there is a man wearing black glasses and a black suit. Amidst the white background, there is a downward blue arrow and a blue error sign. After the first blue rectangle is drawn below the blue arrow, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A red letter C is written inside the blue rectangle.", "A green letter C is written inside the blue rectangle.", "A blue letter A is written inside the blue rectangle.", "A blue letter C is written inside the blue rectangle.", "A red letter B is written inside the blue rectangle."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "4bgHtDoy9Qk_0", "video_path": "4bgHtDoy9Qk.mp4", "subtitle_path": "4bgHtDoy9Qk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1491.44, "view_count": 265}, {"video_id": "4bgHtDoy9Qk", "question": "In a white background with many blue numbers and English letters, there is a white English word 'Operations' written in the upper left corner of the screen, and a man wearing black glasses and a black suit in the lower right corner. After the first red arrow is drawn, what happens on the screen?", "question_wo_referring_query": "What happens on the screen after the first red arrow is drawn?", "candidates": ["On the left side of the red arrow, a green number 6 is written", "On the left side of the red arrow, a blue number 10 is written", "On the right side of the red arrow, a red number 10 is written", "On the left side of the red arrow, a green number 10 is written", "On the right side of the red arrow, a red number 9 is written"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "4bgHtDoy9Qk_1", "video_path": "4bgHtDoy9Qk.mp4", "subtitle_path": "4bgHtDoy9Qk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1491.44, "view_count": 265}, {"video_id": "4bgHtDoy9Qk", "question": "In a white background with the white English word 'Operations' written in the top-left corner of the screen, and a man wearing black glasses and a black suit in the bottom-right corner, what happens on the screen after 'C1' is written inside a blue rectangle below a blue downward arrow and a blue error symbol?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A red triangle was drawn", "A green arrow was drawn", "A red arrow was drawn", "A green rectangle was drawn", "A blue arrow was drawn"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "4bgHtDoy9Qk_2", "video_path": "4bgHtDoy9Qk.mp4", "subtitle_path": "4bgHtDoy9Qk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1491.44, "view_count": 265}, {"video_id": "TA5Ilzauebo", "question": "After a man wearing sunglasses and a green shirt appears on a grassy field, which of the following characters appears first?", "question_wo_referring_query": "Which of the following characters appears first?", "candidates": ["A woman wearing sunglasses and a black swimsuit, sitting on a red and white striped chair and taking a selfie with her phone", "A woman kneeling in front of a fire pit, wearing a white hat and holding a guitar", "A woman with black hair wearing a white short-sleeve shirt, sitting in front of a fire pit", "A woman walking through a forest, wearing a pink cotton-padded jacket and blue jeans", "A man wearing a gray short-sleeve shirt and white wired earphones, sitting in front of a computer"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "TA5Ilzauebo_0", "video_path": "TA5Ilzauebo.mp4", "subtitle_path": "TA5Ilzauebo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1119.58, "view_count": 2307786}, {"video_id": "TA5Ilzauebo", "question": "After a man wearing black sunglasses and a red shirt with a backpack appears in the frame, which of the following characters shows up first?", "question_wo_referring_query": "Which of the following characters shows up first?", "candidates": ["A man sitting in front of a white door wearing an olive leather jacket and pink scrubs", "A man sitting next to a red chair wearing a white shirt and holding a purple ring and black wristbands", "A man riding a bicycle on the road by the river, wearing a green helmet and blue shorts", "A man riding a bicycle on the road by the river, wearing a black helmet", "A bald man with his arms crossed in front of his chest, wearing a black short-sleeved shirt and having tattoos on his arms"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "TA5Ilzauebo_1", "video_path": "TA5Ilzauebo.mp4", "subtitle_path": "TA5Ilzauebo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1119.58, "view_count": 2307786}, {"video_id": "TA5Ilzauebo", "question": "After a man wearing a black swimsuit is standing in blue seawater with wave patterns, which of the following characters appears first?", "question_wo_referring_query": "Which of the following characters appears first?", "candidates": ["A man standing to the right of a bookshelf, looking down at a book.", "A man dressed in a black shirt and blue jeans sitting with his legs crossed in front of a white window on a brown sofa.", "A woman wearing a yellow woolen top and blue jeans sitting on a gray sofa, using her hand to cover her mouth while looking at her phone.", "A man wearing a hat and a white striped shirt seated in a room with bicycles.", "A man wearing a checkered shirt with a camera standing by the river with a backpack."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "TA5Ilzauebo_2", "video_path": "TA5Ilzauebo.mp4", "subtitle_path": "TA5Ilzauebo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1119.58, "view_count": 2307786}, {"video_id": "HrAi1R1F9Pw", "question": "On a wooden cutting board, there are a few green onions. A woman stands in front of the cutting board, holding a knife in one hand and pressing the green onions with the other. What is the woman doing in front of the cutting board in the video?", "question_wo_referring_query": "What is the woman doing in front of the cutting board in the video?", "candidates": ["Cutting the green onions with a knife", "Took away the cutting board", "Drank a sip of coffee", "Put the knife on the cutting board", "Threw away the green onions"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "HrAi1R1F9Pw_0", "video_path": "HrAi1R1F9Pw.mp4", "subtitle_path": "HrAi1R1F9Pw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 936.3, "view_count": 3895315}, {"video_id": "HrAi1R1F9Pw", "question": "The pot is filled with a mostly white and some darker small pieces of a porridge-like food. There is a wooden spatula on the top of the pot. In the video, what was the spatula used for?", "question_wo_referring_query": "In the video, what was the spatula used for?", "candidates": ["Tapping the edge of the pot", "Taken out of the frame", "Stirring in the pot", "Dropped into the pot", "Broken"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "HrAi1R1F9Pw_1", "video_path": "HrAi1R1F9Pw.mp4", "subtitle_path": "HrAi1R1F9Pw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 936.3, "view_count": 3895315}, {"video_id": "HrAi1R1F9Pw", "question": "A piece of ham is placed on the wooden cutting board, and a woman in front of the board is holding a knife with the blade facing the ham. What does the woman do next with her hand in the video?", "question_wo_referring_query": "A piece of ham is placed on the wooden cutting board, and a woman in front of the board is holding a knife with the blade facing the ham. What does the woman do next with her hand in the video?", "candidates": ["Takes away the parsley", "Brings a cup of water", "Places part of the ham near the edge of the board", "Brings a bunch of cilantro", "Takes the cutting board away"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "HrAi1R1F9Pw_2", "video_path": "HrAi1R1F9Pw.mp4", "subtitle_path": "HrAi1R1F9Pw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 936.3, "view_count": 3895315}, {"video_id": "lk4VG-VqN2s", "question": "In the gradually changing purple background, a man wearing a blue plaid shirt and glasses is talking into the camera. What objects can be seen in the frame?", "question_wo_referring_query": "What objects can be seen in the frame?", "candidates": ["sanjee", "gloves", "ring", "scarf", "hat"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "lk4VG-VqN2s_0", "video_path": "lk4VG-VqN2s.mp4", "subtitle_path": "lk4VG-VqN2s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1722.89, "view_count": 439458}, {"video_id": "lk4VG-VqN2s", "question": "In the background filled with blurred flock of geese, some birds in the front are looking around. At the bottom of the screen, there is information in various colors. Which of the following words has not appeared on the screen?", "question_wo_referring_query": "Which of the following words has not appeared on the screen?", "candidates": ["IPHONE", "KEY", "PECK", "PIECE", "BACH"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "lk4VG-VqN2s_1", "video_path": "lk4VG-VqN2s.mp4", "subtitle_path": "lk4VG-VqN2s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1722.89, "view_count": 439458}, {"video_id": "lk4VG-VqN2s", "question": "During the evening concert, three men are passionately performing on a lit-up stage, with the audience seated below. Which item was not present in the scene?", "question_wo_referring_query": "Which item was not present in the scene?", "candidates": ["Sound", "Microphone", "Apple", "Electric Guitar", "Men"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "lk4VG-VqN2s_2", "video_path": "lk4VG-VqN2s.mp4", "subtitle_path": "lk4VG-VqN2s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1722.89, "view_count": 439458}, {"video_id": "Pc96BmhEpP4", "question": "In an interview room with a round table, there is a black window curtain in the upper left corner. Two men in suits are sitting in chairs beside the table, conversing with each other. When the conversation mentions 'attacking the NATO still is a very big', which object appears on the screen?", "question_wo_referring_query": "Which object appears on the screen?", "candidates": ["baguette", "transparent glass cup", "white backpack", "laptop", "carton of milk"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "Pc96BmhEpP4_0", "video_path": "Pc96BmhEpP4.mp4", "subtitle_path": "Pc96BmhEpP4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1564.2, "view_count": 126171}, {"video_id": "Pc96BmhEpP4", "question": "In a trench dug by hand, there is a flat yellow ground in the distance. In the middle, there are three soldiers in camouflage uniforms crouching in the trench. When the commentary mentions 'a more rapid rotation of soldiers but', what objects appear on the screen?", "question_wo_referring_query": "What objects appear on the screen?", "candidates": ["Keyboard", "Tank", "Headset", "Watch", "Tire"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "Pc96BmhEpP4_1", "video_path": "Pc96BmhEpP4.mp4", "subtitle_path": "Pc96BmhEpP4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1564.2, "view_count": 126171}, {"video_id": "Pc96BmhEpP4", "question": "In an interview room with a round table in the middle, flanked by black curtains on both sides, two men in black suits and one woman in a blue suit sit around the table discussing. When they mention 'uh it was a dream already in the Soviet', what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["Chair", "Comic Book", "World Map", "Akan\u00e9 Hill", "Mobile Phone"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "Pc96BmhEpP4_2", "video_path": "Pc96BmhEpP4.mp4", "subtitle_path": "Pc96BmhEpP4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1564.2, "view_count": 126171}, {"video_id": "9iqBeJnsq8U", "question": "In the scene with a black background, on the right side, there is a person wearing a shirt and a watch, speaking to the camera. In the upper right corner, there are two people working busily in a green field. What style of clothing is the black person in the video wearing?", "question_wo_referring_query": "What style of clothing is the black person in the video wearing?", "candidates": ["A red and black checkered short sleeve round neck shirt", "A black and red checkered cotton coat", "A black and red checkered windbreaker", "A chain shirt with black and red checkered pattern", "A shirt with black and red checkered pattern and buttons"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "9iqBeJnsq8U_0", "video_path": "9iqBeJnsq8U.mp4", "subtitle_path": "9iqBeJnsq8U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 917.52, "view_count": 2796769}, {"video_id": "9iqBeJnsq8U", "question": "Under the blue sky, there are various buildings in the distance. Nearby, there are lush trees on both sides. A man wearing black clothing and a hat is speaking to the camera. What style of hat is the man wearing?", "question_wo_referring_query": "What style of hat is the man wearing?", "candidates": ["Black ceremonial hat", "Red sanitation hat", "Red duck tongue hat", "Red round hat", "Red headscarf"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "9iqBeJnsq8U_1", "video_path": "9iqBeJnsq8U.mp4", "subtitle_path": "9iqBeJnsq8U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 917.52, "view_count": 2796769}, {"video_id": "9iqBeJnsq8U", "question": "On the track field, there is an athlete draped in a red flag with sports shoes hanging around his neck, raising his hands high. Spectators fill the stands around the field. What color are the sports shoes hanging around the athlete's neck?", "question_wo_referring_query": "What color are the sports shoes hanging around the athlete's neck on the track?", "candidates": ["Black", "Red and green mix", "Blue", "White", "Red"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "9iqBeJnsq8U_2", "video_path": "9iqBeJnsq8U.mp4", "subtitle_path": "9iqBeJnsq8U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 917.52, "view_count": 2796769}, {"video_id": "xCwaHCrNDjE", "question": "In the gray and white screen, there is a globe composed of blue and green at the top, small icons in both the top left and top right corners, and some text at the bottom. When the video subtitle says 'difference but it's going to take a lot,' what is the shape of the red part of the icon in the top left corner?", "question_wo_referring_query": "What is the shape of the red part of the icon in the top left corner?", "candidates": ["Circular", "Square", "Triangle", "Ladder shape", "Pentagram shape"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "xCwaHCrNDjE_0", "video_path": "xCwaHCrNDjE.mp4", "subtitle_path": "xCwaHCrNDjE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1173.08, "view_count": 807}, {"video_id": "xCwaHCrNDjE", "question": "In the scene with a gray-white background, there is an image of the globe composed of blue and green at the top. There are two small logos, one in the top left corner and one in the top right corner. Below are some texts. When the video caption mentions 'in a tiger Reserve there are around 3,' what color is the logo in the top right corner?", "question_wo_referring_query": "What color is the logo in the top right corner?", "candidates": ["yellow", "black", "green", "white", "blue"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "xCwaHCrNDjE_1", "video_path": "xCwaHCrNDjE.mp4", "subtitle_path": "xCwaHCrNDjE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1173.08, "view_count": 807}, {"video_id": "xCwaHCrNDjE", "question": "In the screen with a gray-white background, there is a globe icon made up of blue and green colors at the top. There are two small logos in the top left and top right corners. At the bottom, there are some words. When the video subtitle mentions 'literally not have this tick off list,' what color is the word 'CLIMATE' at the bottom?", "question_wo_referring_query": "What color is the word 'CLIMATE' at the bottom?", "candidates": ["white", "black", "orange", "blue", "green"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "xCwaHCrNDjE_2", "video_path": "xCwaHCrNDjE.mp4", "subtitle_path": "xCwaHCrNDjE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1173.08, "view_count": 807}, {"video_id": "5qi6AS-5zKk", "question": "In the supermarket, the shelves on the left are filled with various types and colors of products. On the right, there's a person with a visible hand injury. What product does this person touch with their hand?", "question_wo_referring_query": "In the supermarket, the shelves on the left are filled with various types and colors of products. On the right, there's a person with a visible hand injury. What product does this person touch with their hand?", "candidates": ["candy", "pencil", "calendar", "milk", "notebook"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "5qi6AS-5zKk_0", "video_path": "5qi6AS-5zKk.mp4", "subtitle_path": "5qi6AS-5zKk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1398.23, "view_count": 181607}, {"video_id": "5qi6AS-5zKk", "question": "In the dark room, there is a row of clothes hanging on a clothes rack on the right side. In front of the mirror, a woman with long black hair and wearing earrings is speaking. What is the object touched by the woman's hand?", "question_wo_referring_query": ", what is the object touched by the woman's hand?", "candidates": ["cola", "mobile phone", "chocolate", "card", "computer"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "5qi6AS-5zKk_1", "video_path": "5qi6AS-5zKk.mp4", "subtitle_path": "5qi6AS-5zKk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1398.23, "view_count": 181607}, {"video_id": "5qi6AS-5zKk", "question": "In a dining room with white and yellow hues, there is a transparent glass window on the right side. In front of the mirror, there is a woman with long black hair wearing white clothing. The woman is eating food. What is the food that is entering the woman's mouth?", "question_wo_referring_query": "What is the food that is entering the woman's mouth?", "candidates": ["Steamed bun", "Hamburger", "Ham", "Candy", "Chocolate"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "5qi6AS-5zKk_2", "video_path": "5qi6AS-5zKk.mp4", "subtitle_path": "5qi6AS-5zKk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1398.23, "view_count": 181607}, {"video_id": "9l3fHjMHuhE", "question": "In an animated scene, on a distant viewing platform full of spectators in a gladiator arena, there is a person wearing red and holding a red shield on the left side, and a person wearing blue and holding a blue shield on the right side. What did the person with the red shield do when he first appeared?", "question_wo_referring_query": "What did the person with the red shield do when he first appeared?", "candidates": ["Dropped his shield", "Fell to the ground", "Waved a sword at the person in blue", "Picked up a branch", "Took off his helmet"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "9l3fHjMHuhE_0", "video_path": "9l3fHjMHuhE.mp4", "subtitle_path": "9l3fHjMHuhE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1537.83, "view_count": 1081307}, {"video_id": "9l3fHjMHuhE", "question": "In the animated scene, on a training ground where the distant area has a light green field and the nearby area has a dark green field, there is a rectangular platform in the middle. Behind the platform stands a soldier wearing a hat, and on the platform, there is a soldier with short hair holding two sticks. What did the short-haired soldier holding two sticks do the first time he appeared?", "question_wo_referring_query": "What did the short-haired soldier holding two sticks do the first time he appeared?", "candidates": ["Picked up a hammer", "Took a sip of water", "Knelt on the platform", "Waved his hand to the sky", "Jumped off the platform"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "9l3fHjMHuhE_1", "video_path": "9l3fHjMHuhE.mp4", "subtitle_path": "9l3fHjMHuhE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1537.83, "view_count": 1081307}, {"video_id": "9l3fHjMHuhE", "question": "In the animated scene, in a dense green forest, there are trees and grass in the distance, and two large rocks nearby. In front of the rocks stands a soldier wearing green clothes and carrying a quiver of arrows on his shoulder. What did the soldier carrying a quiver of arrows do the first time he appeared?", "question_wo_referring_query": "What did the soldier carrying a quiver of arrows do the first time he appeared?", "candidates": ["Took off his helmet", "Kneeled on the rock", "Fired an arrow to the right", "Picked up a gun", "Dropped the quiver"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "9l3fHjMHuhE_2", "video_path": "9l3fHjMHuhE.mp4", "subtitle_path": "9l3fHjMHuhE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1537.83, "view_count": 1081307}, {"video_id": "4Nf1x9S1P4A", "question": "Under the sunlight, various vegetables are placed on a wooden table, with neatly stacked wooden materials behind the table corner. In the video, what appears first after the tomato?", "question_wo_referring_query": "In the video, what appears first after the tomato?", "candidates": ["television", "napkin", "mango", "cilantro", "dragon fruit"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "4Nf1x9S1P4A_0", "video_path": "4Nf1x9S1P4A.mp4", "subtitle_path": "4Nf1x9S1P4A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1134.0, "view_count": 669336}, {"video_id": "4Nf1x9S1P4A", "question": "Sunlight is shining on a coffee-colored wooden table, on which there is a wooden bowl filled with shredded onions. After the wooden bowl appears in the video, what appears for the first time?", "question_wo_referring_query": "What appears for the first time?", "candidates": ["Child", "Car", "Cow", "Dog", "Horse"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "4Nf1x9S1P4A_1", "video_path": "4Nf1x9S1P4A.mp4", "subtitle_path": "4Nf1x9S1P4A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1134.0, "view_count": 669336}, {"video_id": "4Nf1x9S1P4A", "question": "Under the sunlight, in front of the mirror is a man wearing a military green coat holding a kebab in his hand. Behind the man is flat land, and in the distance, there are steep slopes. In the video, what appears for the first time after the kebab appears?", "question_wo_referring_query": "In the video, what appears for the first time after the kebab appears?", "candidates": ["dog", "woman with long hair", "snow cake", "yellow football", "kite"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "4Nf1x9S1P4A_2", "video_path": "4Nf1x9S1P4A.mp4", "subtitle_path": "4Nf1x9S1P4A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1134.0, "view_count": 669336}, {"video_id": "9OWw7cWC1ts", "question": "In a multi-person lecture, five people are seated on the stage from left to right: a woman in yellow, a man in white, a man in black, a woman with short hair, and a woman in gray. The audience is listening intently. After the man in black mentioned 'into the Academy would be the ability,' what did he do next?", "question_wo_referring_query": "What did he do next?", "candidates": ["Stood up", "Touched the top of his head", "Took off his shoes", "Waved his hands", "Removed his glasses"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "9OWw7cWC1ts_0", "video_path": "9OWw7cWC1ts.mp4", "subtitle_path": "9OWw7cWC1ts_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2543.54, "view_count": 161}, {"video_id": "9OWw7cWC1ts", "question": "During a multi-person panel discussion, there are five individuals seated on stage from left to right: a woman in yellow, a man in white, a man in black, a woman with short hair, and a woman in gray. The audience is listening attentively. What action did the woman in yellow on the far left perform after the conversation mentioned 'i'm robert m i'm from the CIA and when'?", "question_wo_referring_query": "What action did the woman in yellow on the far left perform?", "candidates": ["Stood up to speak", "Pulled out a notebook", "Took a sip of water", "Ate a piece of chocolate", "Raised both hands"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "9OWw7cWC1ts_1", "video_path": "9OWw7cWC1ts.mp4", "subtitle_path": "9OWw7cWC1ts_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2543.54, "view_count": 161}, {"video_id": "9OWw7cWC1ts", "question": "During a multi-speaker seminar, there are five people seated on stage in the following order from left to right: a woman in yellow, a man in white, a man in black, a woman with short hair, and a woman in gray. The audience is listening attentively. When the phrase 'any forward am I dreaming I remember' is mentioned, what did the woman in yellow on the far left do?", "question_wo_referring_query": "What did the woman in yellow on the far left do?", "candidates": ["Stood up", "Put on glasses", "Took off her jacket", "Took a sip of water", "Pointed towards the audience with her left hand"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "9OWw7cWC1ts_2", "video_path": "9OWw7cWC1ts.mp4", "subtitle_path": "9OWw7cWC1ts_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2543.54, "view_count": 161}, {"video_id": "lWdCrvwQcWo", "question": "In the study room, there is a bookshelf filled with various types of books in the background. In front of the camera, there is a man wearing gray clothes and holding a book in his hand. What object appears after the man mentions 'it's the cartography is just not as good'?", "question_wo_referring_query": "What object appears?", "candidates": ["Map", "Poker card", "Chess piece", "Scarf", "Cup"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "lWdCrvwQcWo_0", "video_path": "lWdCrvwQcWo.mp4", "subtitle_path": "lWdCrvwQcWo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1019.82, "view_count": 20271}, {"video_id": "lWdCrvwQcWo", "question": "In the study room, there is a bookshelf filled with various books in the background. In front of the camera is a man wearing a gray jacket and holding a green book. What appears after the man mentions 'these two are'?", "question_wo_referring_query": "What appears?", "candidates": ["blue book", "mobile phone", "red book", "cake", "remote control car"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "lWdCrvwQcWo_1", "video_path": "lWdCrvwQcWo.mp4", "subtitle_path": "lWdCrvwQcWo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1019.82, "view_count": 20271}, {"video_id": "lWdCrvwQcWo", "question": "In the study, there's a bookshelf full of various books in the background. In front of the camera stands a man wearing gray clothes and holding a book. After the man mentions 'they're the ones with the green', what animal appears in the picture?", "question_wo_referring_query": "What animal appears in the picture?", "candidates": ["Spider", "Cat", "Snake", "Tiger", "Bird"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "lWdCrvwQcWo_2", "video_path": "lWdCrvwQcWo.mp4", "subtitle_path": "lWdCrvwQcWo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1019.82, "view_count": 20271}, {"video_id": "SBW2PHzPf3M", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, the scene appears with a 'red wall background, a woman with black clothes and golden long hair sits on the left, a man in a black suit sits in the middle, and a man in red clothes sits on the right, the woman touches her ear with her left hand', then the scene appears with the 'red wall background, a woman with black clothes and golden long hair sits on the left, a man in a black suit sits in the middle, and a man in red clothes sits on the right, the woman places her left hand on her chest', then finally the scene appears with the 'red wall background, a woman with black clothes and golden long hair sits on the left, a man in a black suit sits in the middle, and a man in red clothes sits on the right, the woman extends her left hand towards the man in the middle'", "First, the scene appears with a 'red wall background, a woman with black clothes and golden long hair sits on the left, a man in a black suit sits in the middle, and a man in red clothes sits on the right, the woman touches her ear with her left hand', then the scene appears with the 'red wall background, a woman with black clothes and golden long hair sits on the left, a man in a black suit sits in the middle, and a man in red clothes sits on the right, the woman extends her left hand towards the man in the middle', then finally the scene appears with the 'red wall background, a woman with black clothes and golden long hair sits on the left, a man in a black suit sits in the middle, and a man in red clothes sits on the right, the woman places her left hand on her chest'", "First, the scene appears with a 'red wall background, a woman with black clothes and golden long hair sits on the left, a man in a black suit sits in the middle, and a man in red clothes sits on the right, the woman extends her left hand towards the man in the middle', then the scene appears with the 'red wall background, a woman with black clothes and golden long hair sits on the left, a man in a black suit sits in the middle, and a man in red clothes sits on the right, the woman places her left hand on her chest', then finally the scene appears with the 'red wall background, a woman with black clothes and golden long hair sits on the left, a man in a black suit sits in the middle, and a man in red clothes sits on the right, the woman touches her ear with her left hand'", "First, the scene appears with a 'red wall background, a woman with black clothes and golden long hair sits on the left, a man in a black suit sits in the middle, and a man in red clothes sits on the right, the woman places her left hand on her chest', then the scene appears with the 'red wall background, a woman with black clothes and golden long hair sits on the left, a man in a black suit sits in the middle, and a man in red clothes sits on the right, the woman touches her ear with her left hand', then finally the scene appears with the 'red wall background, a woman with black clothes and golden long hair sits on the left, a man in a black suit sits in the middle, and a man in red clothes sits on the right, the woman extends her left hand towards the man in the middle'", "First, the scene appears with a 'red wall background, a woman with black clothes and golden long hair sits on the left, a man in a black suit sits in the middle, and a man in red clothes sits on the right, the woman places her left hand on her chest', then the scene appears with the 'red wall background, a woman with black clothes and golden long hair sits on the left, a man in a black suit sits in the middle, and a man in red clothes sits on the right, the woman extends her left hand towards the man in the middle', then finally the scene appears with the 'red wall background, a woman with black clothes and golden long hair sits on the left, a man in a black suit sits in the middle, and a man in red clothes sits on the right, the woman touches her ear with her left hand'"], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "SBW2PHzPf3M_0", "video_path": "SBW2PHzPf3M.mp4", "subtitle_path": "SBW2PHzPf3M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2035.14, "view_count": 780}, {"video_id": "SBW2PHzPf3M", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, a scene appears with a 'wine-red background wall. On the left, a woman wearing black clothes and sporting long golden hair is seated. In the middle, a man wearing a black suit is seated. On the right, a man wearing wine-red clothes is seated. They are talking while seated.' Then, a scene appears in a 'dimly lit room with five individuals holding microphones, having a conversation, and a small black podium on the far right.' Finally, a scene appears with a 'wine-red background wall. On the left side, a man wearing a white shirt and a black suit jacket is seated, and on the right side, a woman wearing a black and white plaid shirt is seated. Both the man and the woman are looking to the right.'", "First, a scene appears with a 'wine-red background wall. On the left side, a man wearing a white shirt and a black suit jacket is seated, and on the right side, a woman wearing a black and white plaid shirt is seated. Both the man and the woman are looking to the right.' Then, a scene appears in a 'dimly lit room with five individuals holding microphones, having a conversation, and a small black podium on the far right.' Finally, a scene appears with a 'wine-red background wall. On the left, a woman wearing black clothes and sporting long golden hair is seated. In the middle, a man wearing a black suit is seated. On the right, a man wearing wine-red clothes is seated. They are talking while seated.'", "First, a scene appears in a 'dimly lit room with five individuals holding microphones, having a conversation, and a small black podium on the far right.' Then, a scene appears with a 'wine-red background wall. On the left side, a man wearing a white shirt and a black suit jacket is seated, and on the right side, a woman wearing a black and white plaid shirt is seated. Both the man and the woman are looking to the right.' Finally, a scene appears with a 'wine-red background wall. On the left, a woman wearing black clothes and sporting long golden hair is seated. In the middle, a man wearing a black suit is seated. On the right, a man wearing wine-red clothes is seated. They are talking while seated.'", "First, a scene appears in a 'dimly lit room with five individuals holding microphones, having a conversation, and a small black podium on the far right.' Then, a scene appears with a 'wine-red background wall. On the left, a woman wearing black clothes and sporting long golden hair is seated. In the middle, a man wearing a black suit is seated. On the right, a man wearing wine-red clothes is seated. They are talking while seated.' Finally, a scene appears with a 'wine-red background wall. On the left side, a man wearing a white shirt and a black suit jacket is seated, and on the right side, a woman wearing a black and white plaid shirt is seated. Both the man and the woman are looking to the right.'", "First, a scene appears with a 'wine-red background wall. On the left side, a man wearing a white shirt and a black suit jacket is seated, and on the right side, a woman wearing a black and white plaid shirt is seated. Both the man and the woman are looking to the right.' Then, a scene appears with a 'wine-red background wall. On the left, a woman wearing black clothes and sporting long golden hair is seated. In the middle, a man wearing a black suit is seated. On the right, a man wearing wine-red clothes is seated. They are talking while seated.' Finally, a scene appears in a 'dimly lit room with five individuals holding microphones, having a conversation, and a small black podium on the far right.'"], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "SBW2PHzPf3M_1", "video_path": "SBW2PHzPf3M.mp4", "subtitle_path": "SBW2PHzPf3M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2035.14, "view_count": 780}, {"video_id": "SBW2PHzPf3M", "question": "Which of the sequences below is correct?", "question_wo_referring_query": "Which of the sequences below is correct?", "candidates": ["First, a scene appears in front of a wine-red background wall, with a man in a white shirt and a black suit jacket sitting on the left, and a woman in a black and white checkered shirt sitting on the right, both looking to the right. Then, a scene appears in front of a wine-red background wall, with a man in a wine-red outfit sitting on the left, and a man wearing a white shirt and a black suit jacket sitting on the right, having a conversation. Finally, a scene appears in front of a wine-red background wall, with a man wearing glasses sitting on the left, a man in a wine-red outfit sitting in the middle, and a man wearing a white shirt and a black suit jacket sitting on the right, and the three of them are having a conversation.", "First, a scene appears in front of a wine-red background wall, with a man wearing glasses sitting on the left, a man in a wine-red outfit sitting in the middle, and a man wearing a white shirt and a black suit jacket sitting on the right, and the three of them are having a conversation. Then, a scene appears in front of a wine-red background wall, with a man in a wine-red outfit sitting on the left, and a man wearing a white shirt and a black suit jacket sitting on the right, having a conversation. Finally, a scene appears in front of a wine-red background wall, with a man in a white shirt and a black suit jacket sitting on the left, and a woman in a black and white checkered shirt sitting on the right, both looking to the right.", "First, a scene appears in front of a wine-red background wall, with a man in a wine-red outfit sitting on the left, and a man wearing a white shirt and a black suit jacket sitting on the right, having a conversation. Then, a scene appears in front of a wine-red background wall, with a man in a white shirt and a black suit jacket sitting on the left, and a woman in a black and white checkered shirt sitting on the right, both looking to the right. Finally, a scene appears in front of a wine-red background wall, with a man wearing glasses sitting on the left, a man in a wine-red outfit sitting in the middle, and a man wearing a white shirt and a black suit jacket sitting on the right, and the three of them are having a conversation.", "First, a scene appears in front of a wine-red background wall, with a man wearing glasses sitting on the left, a man in a wine-red outfit sitting in the middle, and a man wearing a white shirt and a black suit jacket sitting on the right, and the three of them are having a conversation. Then, a scene appears in front of a wine-red background wall, with a man in a white shirt and a black suit jacket sitting on the left, and a woman in a black and white checkered shirt sitting on the right, both looking to the right. Finally, a scene appears in front of a wine-red background wall, with a man in a wine-red outfit sitting on the left, and a man wearing a white shirt and a black suit jacket sitting on the right, having a conversation.", "First, a scene appears in front of a wine-red background wall, with a man in a white shirt and a black suit jacket sitting on the left, and a woman in a black and white checkered shirt sitting on the right, both looking to the right. Then, a scene appears in front of a wine-red background wall, with a man wearing glasses sitting on the left, a man in a wine-red outfit sitting in the middle, and a man wearing a white shirt and a black suit jacket sitting on the right, and the three of them are having a conversation. Finally, a scene appears in front of a wine-red background wall, with a man in a wine-red outfit sitting on the left, and a man wearing a white shirt and a black suit jacket sitting on the right, having a conversation."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "SBW2PHzPf3M_2", "video_path": "SBW2PHzPf3M.mp4", "subtitle_path": "SBW2PHzPf3M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2035.14, "view_count": 780}, {"video_id": "8f5xIMStqF4", "question": "In the live video screen, on the right is a picture of a small stream full of pebbles with trees on both sides. In the bottom left corner, there is a man wearing sunglasses. In which of the following scenes has the man in the bottom left corner of the video appeared before?", "question_wo_referring_query": "In which of the following scenes has the man in the bottom left corner of the video appeared before?", "candidates": ["On a sunlit mountain top", "Beside a picture with white background and black lines", "Inside a dimly lit wine glass", "By a rainy street", "On a verdant grassland"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "8f5xIMStqF4_0", "video_path": "8f5xIMStqF4.mp4", "subtitle_path": "8f5xIMStqF4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1748.78, "view_count": 37427}, {"video_id": "8f5xIMStqF4", "question": "In front of the camera is a man wearing sunglasses and sporting a crew cut. Behind him is a pure green backdrop. In which of the following scenes has this man appeared before?", "question_wo_referring_query": "In which of the following scenes has this man appeared?", "candidates": ["In a white background with black text", "In a pure blue screen", "On a fully loaded airplane", "In a quiet classroom", "On a big tree"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "8f5xIMStqF4_1", "video_path": "8f5xIMStqF4.mp4", "subtitle_path": "8f5xIMStqF4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1748.78, "view_count": 37427}, {"video_id": "8f5xIMStqF4", "question": "In the live stream video, on the right side, there is a man wearing a checkered shirt with a blue thumbs-up sticker on his palm, and in the lower left corner, there is a man wearing sunglasses. In which of the following scenarios does the man wearing sunglasses appear?", "question_wo_referring_query": "In which of the following scenarios does the man wearing sunglasses appear?", "candidates": ["On a luxury cruise ship", "On a crowded beach", "On the crowded Great Wall", "In a white background scene with a cat sticker", "In a white background scene with a yellow thumbs-up sticker in the upper left corner"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "8f5xIMStqF4_2", "video_path": "8f5xIMStqF4.mp4", "subtitle_path": "8f5xIMStqF4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1748.78, "view_count": 37427}, {"video_id": "iDulhoQ2pro", "question": "On the PPT slide with a white background and black text, there's an image on the left side with a red arrow point that spreads out in four different directions. When the background changes to pure white, what color does the red arrow point become?", "question_wo_referring_query": "What color does it change to?", "candidates": ["Black", "Green", "Yellow", "Purple", "Orange"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "iDulhoQ2pro_0", "video_path": "iDulhoQ2pro.mp4", "subtitle_path": "iDulhoQ2pro_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1626.21, "view_count": 603928}, {"video_id": "iDulhoQ2pro", "question": "In the middle of a white background is a colorful thought guide. On both the left and right sides, there is a mass of red lines. In the thought guide, on the left side is a blue rectangle with the words 'Feed Forward' inside. What changes occur when the right side of the screen turns pure white?", "question_wo_referring_query": "What changes occur?", "candidates": ["Turned red", "Turned into a square", "Turned grey-blue", "Turned into a circle", "Covered by an orange stroke"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "iDulhoQ2pro_1", "video_path": "iDulhoQ2pro.mp4", "subtitle_path": "iDulhoQ2pro_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1626.21, "view_count": 603928}, {"video_id": "iDulhoQ2pro", "question": "In a PPT slide with white background and black text, there is a red rectangle at the center with three 'V1' texts inside. What color does this red rectangle change to when the background turns pure white?", "question_wo_referring_query": "What color does it change to?", "candidates": ["Black", "Green", "Gray", "Orange", "Yellow"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "iDulhoQ2pro_2", "video_path": "iDulhoQ2pro.mp4", "subtitle_path": "iDulhoQ2pro_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1626.21, "view_count": 603928}, {"video_id": "VzS8hrOSSAs", "question": "In the live broadcast video, the bottom right corner shows a man wearing a white shirt and having a beard, explaining something. The rest of the screen displays a PPT with white background and black text. In the top left corner, the word 'Samanantar' is covered by a blue layer. When the man mentions 'neural network so the source of this', what change occurs to the word 'Samanantar' covered by the blue layer?", "question_wo_referring_query": "In the live broadcast video, the bottom right corner shows a man wearing a white shirt and having a beard, explaining something. The rest of the screen displays a PPT with white background and black text. In the top left corner, the word 'Samanantar' is covered by a blue layer. When the man mentions 'neural network so the source of this', what change occurs to the word 'Samanantar' covered by the blue layer?", "candidates": ["The word 'Samanantar' changed to 'OPEN'", "All covered by black", "The word 'Samanantar' disappeared", "The blue layer turned red", "The blue layer disappeared"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "VzS8hrOSSAs_0", "video_path": "VzS8hrOSSAs.mp4", "subtitle_path": "VzS8hrOSSAs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1151.23, "view_count": 9388}, {"video_id": "VzS8hrOSSAs", "question": "In the live broadcast video screen at the bottom right, there is a man wearing a white shirt and sporting a mustache who is explaining, and the rest of the screen is a black background with white-coded PPT screen where in the middle it is overlaid with blue graphics and the text '639,722'. What changed when the man mentioned 'sentence it has 639 characters while the' and the blue layer covering '639,722'?", "question_wo_referring_query": "What changed in the '639,722' text overlaid with the blue graphic?", "candidates": ["The blue overlay on '722' disappeared", "The blue overlay changed to yellow", "'639,722' changed to 'null'", "The blue overlay completely disappeared", "'639,722' text disappeared"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "VzS8hrOSSAs_1", "video_path": "VzS8hrOSSAs.mp4", "subtitle_path": "VzS8hrOSSAs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1151.23, "view_count": 9388}, {"video_id": "VzS8hrOSSAs", "question": "In the livestream video screen, the bottom right corner shows a man wearing a white shirt and sporting a mustache, who is currently giving an explanation. The rest of the screen displays a black background with white code on a PPT slide. In the top left corner of the PPT slide, there's a small yellow dot. When the man mentions 'get an encoder self-attention mask as,' what change occurs to the small yellow dot?", "question_wo_referring_query": "What change occurs to the small yellow dot?", "candidates": ["The area became larger", "Moved to the top right corner of the screen", "Changed to red", "Changed into a square", "Moved to the bottom center of the screen"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "VzS8hrOSSAs_2", "video_path": "VzS8hrOSSAs.mp4", "subtitle_path": "VzS8hrOSSAs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1151.23, "view_count": 9388}, {"video_id": "aa0Y6fryUSk", "question": "There is a woman on the screen wearing black clothes, and there is also an infant wearing red clothes. What is the woman in the scene doing?", "question_wo_referring_query": "What is the woman in the scene doing?", "candidates": ["Putting the infant into a stroller", "Holding the infant", "Pulling the infant", "Feeding the infant milk", "Carrying the infant on her back"], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "aa0Y6fryUSk_0", "video_path": "aa0Y6fryUSk.mp4", "subtitle_path": "aa0Y6fryUSk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2114.2, "view_count": 29690}, {"video_id": "aa0Y6fryUSk", "question": "There are many brightly colored paintings on the wall. In a square container, there are drawing materials. There is also a woman. What is this woman doing?", "question_wo_referring_query": ", what is this woman doing?", "candidates": ["Selecting paintbrushes", "Mixing colors", "Varnishing", "Painting on a large drawing board stuck to the wall", "Washing paintbrushes"], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "aa0Y6fryUSk_1", "video_path": "aa0Y6fryUSk.mp4", "subtitle_path": "aa0Y6fryUSk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2114.2, "view_count": 29690}, {"video_id": "aa0Y6fryUSk", "question": "How many men and women are sitting in the scene, with some indistinct castles in the background? What is the man standing among the group of seated people in the center doing?", "question_wo_referring_query": "What is the man standing among the group of seated people doing?", "candidates": ["Smiling at the mirror", "Standing with hands on hips facing a mirror", "Kneeling and chatting with the people around", "Turning around to look behind", "Passing a cigarette to the people around"], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "aa0Y6fryUSk_2", "video_path": "aa0Y6fryUSk.mp4", "subtitle_path": "aa0Y6fryUSk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2114.2, "view_count": 29690}, {"video_id": "D8IwYmhS9Qw", "question": "In a spacious area with many lights, tables, and whiteboards, there are two people facing computers on a desk that holds many items. Which item is not present on the desk?", "question_wo_referring_query": "Which item is not present on the desk?", "candidates": ["Water cup", "Earphones", "Small electric fan", "Backpack", "Small note"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "D8IwYmhS9Qw_0", "video_path": "D8IwYmhS9Qw.mp4", "subtitle_path": "D8IwYmhS9Qw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 985.49, "view_count": 125492}, {"video_id": "D8IwYmhS9Qw", "question": "Sunlight filters through the dense leaves, casting shadows. Two people are sitting on a bench. What objects appear in this scene?", "question_wo_referring_query": "What objects appear in this scene?", "candidates": ["Puppy", "Kitten", "Book bag", "Toy car", "Bird"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "D8IwYmhS9Qw_1", "video_path": "D8IwYmhS9Qw.mp4", "subtitle_path": "D8IwYmhS9Qw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 985.49, "view_count": 125492}, {"video_id": "D8IwYmhS9Qw", "question": "There's a boy with short brown hair in the video, his mouth wide open facing the camera. Behind him, there's a mirror and a shelf. Which object has appeared on the shelf?", "question_wo_referring_query": "Which object has appeared on the shelf?", "candidates": ["toy gun", "book", "basketball", "mobile phone", "teddy bear"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "D8IwYmhS9Qw_2", "video_path": "D8IwYmhS9Qw.mp4", "subtitle_path": "D8IwYmhS9Qw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 985.49, "view_count": 125492}, {"video_id": "PlYjN5kO5OQ", "question": "A person is holding a paper bowl standing next to a wooden table. On the table, there's a metal tool placed, and in the background, there are stacked wooden sticks. When the caption 'Soy sauce' appears, which of the objects on the screen is present?", "question_wo_referring_query": "Which object is present on the screen?", "candidates": ["Kitchen knife", "Scissors", "Vegetables", "Flower pot", "Sharpening stone"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "PlYjN5kO5OQ_0", "video_path": "PlYjN5kO5OQ.mp4", "subtitle_path": "PlYjN5kO5OQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1081.68, "view_count": 1078350}, {"video_id": "PlYjN5kO5OQ", "question": "Under a refined wooden house, there are many neatly arranged round wooden pieces, with a flower pot containing blooming purple-red flowers beside them. A man is holding coffee in his left hand, standing in front of a stove, getting ready to pick up a small glass jar. When the caption 'Tea break' appears, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["Nail", "Bucket", "Books", "Chair", "Puppy"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "PlYjN5kO5OQ_1", "video_path": "PlYjN5kO5OQ.mp4", "subtitle_path": "PlYjN5kO5OQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1081.68, "view_count": 1078350}, {"video_id": "PlYjN5kO5OQ", "question": "There is a pair of wrinkled hands in the picture; one hand is holding a knife and the other a lemon. When the subtitle 'Lemon' appears, what objects are present in the picture?", "question_wo_referring_query": "What objects are present in the picture?", "candidates": ["Wooden cutting board", "Scissors", "Glass bowl", "Plastic bag", "Chopsticks"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "PlYjN5kO5OQ_2", "video_path": "PlYjN5kO5OQ.mp4", "subtitle_path": "PlYjN5kO5OQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1081.68, "view_count": 1078350}, {"video_id": "E_EUG6w156s", "question": "There is a wooden chopping board in the scene, with one hand holding a sausage and the other hand holding a knife ready to cut it. What is the shape of the sausage?", "question_wo_referring_query": "What is the shape of the sausage?", "candidates": ["Triangle", "V-shape", "Square", "Round", "U-shape"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "E_EUG6w156s_0", "video_path": "E_EUG6w156s.mp4", "subtitle_path": "E_EUG6w156s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 931.77, "view_count": 685838}, {"video_id": "E_EUG6w156s", "question": "A person in the video is holding a metal kitchen knife and cutting a chili pepper on a wooden cutting board. What is the color of the chili pepper's stem?", "question_wo_referring_query": "What is the color of the chili pepper's stem?", "candidates": ["Gray", "White", "Green", "Red", "Black"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "E_EUG6w156s_1", "video_path": "E_EUG6w156s.mp4", "subtitle_path": "E_EUG6w156s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 931.77, "view_count": 685838}, {"video_id": "E_EUG6w156s", "question": "There is a bowl with a metal rim filled with pale yellow liquid on the screen. A small spoon is adding salt to it. What material is the spoon made of?", "question_wo_referring_query": "What material is the spoon made of?", "candidates": ["Plastic", "Silver", "Wood", "Ceramic", "Aluminum"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "E_EUG6w156s_2", "video_path": "E_EUG6w156s.mp4", "subtitle_path": "E_EUG6w156s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 931.77, "view_count": 685838}, {"video_id": "lqhpXf1PIhg", "question": "There is a white door on the screen, and inside the room are green vine decorations, ambient lighting, and items such as bags. Who is wearing khaki pants and a grass-green suit jacket?", "question_wo_referring_query": "Who is wearing khaki pants and a grass-green suit jacket?", "candidates": ["A woman", "A little boy", "A little girl", "A man", "A foreign woman"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "lqhpXf1PIhg_0", "video_path": "lqhpXf1PIhg.mp4", "subtitle_path": "lqhpXf1PIhg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1209.61, "view_count": 293965}, {"video_id": "lqhpXf1PIhg", "question": "There are many colorful clothes neatly hanging on the wall in the video. There is a mobile phone on the phone holder, and the back of the phone case is green with a round pattern decoration in the middle. What is pointing towards the green phone case?", "question_wo_referring_query": "What is pointing towards the green phone case?", "candidates": ["A selfie stick with a red head", "A hand with red nail polish", "A pen with a red head", "A paintbrush with a red tip"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "lqhpXf1PIhg_1", "video_path": "lqhpXf1PIhg.mp4", "subtitle_path": "lqhpXf1PIhg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1209.61, "view_count": 293965}, {"video_id": "lqhpXf1PIhg", "question": "In a room decorated with colorful lights, some kitchen utensils, and a blue wall with a mirror, photo albums, and orange stickers, what is hanging on the white wall behind a woman who is making a 'yeah' gesture in front of the mirror?", "question_wo_referring_query": "What is hanging on the white wall behind the woman?", "candidates": ["A green fan with flower patterns", "A red fan with crane patterns", "A red fan with happiness symbols", "A red fan with flower patterns", "A blue fan with flower patterns"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "lqhpXf1PIhg_2", "video_path": "lqhpXf1PIhg.mp4", "subtitle_path": "lqhpXf1PIhg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1209.61, "view_count": 293965}, {"video_id": "9PKfF4bBErc", "question": "The screen shows a gray background, with a large red leaf in the center. When this red leaf first appears, what happens?", "question_wo_referring_query": "When this red leaf first appears, what happens?", "candidates": ["The man in the bottom right corner of the screen is looking at the computer and appears worried.", "The woman in the bottom right corner of the screen is smiling at the computer.", "The man in the bottom right corner of the screen is staring blankly at the computer.", "The woman in the bottom right corner of the screen is typing furiously on the computer.", "The man in the bottom right corner of the screen is looking at the computer and smiling happily."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "9PKfF4bBErc_0", "video_path": "9PKfF4bBErc.mp4", "subtitle_path": "9PKfF4bBErc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1098.5, "view_count": 111911}, {"video_id": "9PKfF4bBErc", "question": "On the couch in the screen, there is a white cushion and a guitar. On the table, there are three small green plants and a phone. What happened when the small crab first appeared?", "question_wo_referring_query": ", what happened?", "candidates": ["It crawled on the table", "A man placed it in the palm of his hand", "A little girl watched it", "It crawled in the water", "A little boy watched it"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "9PKfF4bBErc_1", "video_path": "9PKfF4bBErc.mp4", "subtitle_path": "9PKfF4bBErc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1098.5, "view_count": 111911}, {"video_id": "9PKfF4bBErc", "question": "The screen shows many demonstrations and a swimming pool with a blue-green color. In the lower right corner of the screen, there is a square frame containing a man sitting in front of a computer. What happens when this swimming pool first appears?", "question_wo_referring_query": "What happens?", "candidates": ["A man lies in the pool with his arms outstretched", "A man is kicking his legs in the water", "A man is about to enter the pool", "Two men are about to enter the pool", "Two men are chatting in the water"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "9PKfF4bBErc_2", "video_path": "9PKfF4bBErc.mp4", "subtitle_path": "9PKfF4bBErc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1098.5, "view_count": 111911}, {"video_id": "ue9HHOhandU", "question": "There are many sea-themed posters on the walls of the room. The wardrobe contains colorful clothes, and there are also two cute dolls. A girl is facing the mirror. When the subtitle 'certain age group to read especially as' appears, what action does the girl take?", "question_wo_referring_query": "What action does the girl take?", "candidates": ["Turns around", "Holds a book in her hand", "Touches her head", "Turns her head", "Laughs towards the mirror"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "ue9HHOhandU_0", "video_path": "ue9HHOhandU.mp4", "subtitle_path": "ue9HHOhandU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1449.56, "view_count": 235792}, {"video_id": "ue9HHOhandU", "question": "The wall of the room is covered with many photos, the wardrobe is full of clothes and miscellaneous items. A short-haired woman, wearing glasses, is holding a pile of books and standing in front of the mirror. When the subtitle 'this I don't taste anything wow this is' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["An orange border appears on the screen", "The woman puts down the books she was holding", "The woman takes off her glasses", "The woman turns around", "The woman lightly nods her head while facing the mirror"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "ue9HHOhandU_1", "video_path": "ue9HHOhandU.mp4", "subtitle_path": "ue9HHOhandU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1449.56, "view_count": 235792}, {"video_id": "ue9HHOhandU", "question": "There are many pictures stuck on the walls of the room, and the wardrobe is full of clothes and miscellaneous items. A short-haired woman is talking to the mirror. What happens when the subtitles 'with your identity it shows that the' appear?", "question_wo_referring_query": "What happens?", "candidates": ["The woman is writing notes", "The woman is flipping through a book", "The woman is putting on makeup", "The woman picks up a book from the desk", "The woman lifts three books with her hands"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "ue9HHOhandU_2", "video_path": "ue9HHOhandU.mp4", "subtitle_path": "ue9HHOhandU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1449.56, "view_count": 235792}, {"video_id": "73Mh88wMp3g", "question": "Who is the first person to appear in the video?", "question_wo_referring_query": "Who is the first person to appear in the video?", "candidates": ["The man wearing a black shirt and a black and white striped hat", "The middle-aged woman wearing a green dress", "The girl wearing a scarf", "The man driving the rickshaw", "The little girl wearing suspenders"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "73Mh88wMp3g_0", "video_path": "73Mh88wMp3g.mp4", "subtitle_path": "73Mh88wMp3g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 975.18, "view_count": 249174}, {"video_id": "73Mh88wMp3g", "question": "Which object appears first in the video?", "question_wo_referring_query": "Which object is the first to appear in the video?", "candidates": ["The Moon", "A baby carriage", "A detector emitting red light and having a cube-shaped head", "A rescue vehicle with a flashing red light on the roof", "A telephone booth"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "73Mh88wMp3g_1", "video_path": "73Mh88wMp3g.mp4", "subtitle_path": "73Mh88wMp3g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 975.18, "view_count": 249174}, {"video_id": "73Mh88wMp3g", "question": "Which small flying device appears first in the video?", "question_wo_referring_query": "Which small flying device appears first in the video?", "candidates": ["The detector emitting green light with a cuboid-shaped body", "The detector emitting red light with a cuboid-shaped body", "The detector emitting blue light with a cuboid-shaped body", "The detector emitting violet light with a cuboid-shaped body", "The detector emitting yellow light with a cuboid-shaped body"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "73Mh88wMp3g_2", "video_path": "73Mh88wMp3g.mp4", "subtitle_path": "73Mh88wMp3g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 975.18, "view_count": 249174}, {"video_id": "uOFRKh-MnJw", "question": "A painting is hanging on the wall showing a red ladybug. To the left of the woman's head in the painting, there's a green leaf. In the bottom left corner of the video, there's a small wooden table with a green potted plant on it. After the caption 'find an object that you like but you can' appears, what happens?", "question_wo_referring_query": ", what happens?", "candidates": ["metal bracelet", "gold statue", "silver statue", "metal necklace", "pearl necklace"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "uOFRKh-MnJw_0", "video_path": "uOFRKh-MnJw.mp4", "subtitle_path": "uOFRKh-MnJw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 955.12, "view_count": 5030}, {"video_id": "uOFRKh-MnJw", "question": "On a wooden desk, various drawing pens and paints are arranged, along with a white bowl filled with water. A pair of hands is drawing on a piece of paper. What happens after the subtitle 'can see that what I painted is' appears?", "question_wo_referring_query": "On a wooden desk, various drawing pens and paints are arranged, along with a white bowl filled with water. A pair of hands is drawing on a piece of paper. What happens after the subtitle 'can see that what I painted is' appears?", "candidates": ["A small dog is drawn on the paper", "A green snake is drawn on the paper", "A cat is drawn on the paper", "A flower is drawn on the paper", "A mouse is drawn on the paper"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "uOFRKh-MnJw_1", "video_path": "uOFRKh-MnJw.mp4", "subtitle_path": "uOFRKh-MnJw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 955.12, "view_count": 5030}, {"video_id": "uOFRKh-MnJw", "question": "In the scene, a person is painting with various brushes, water, rags, and a metal tray nearby. What happens after the subtitle 'I want to avoid having all the shapes' appears?", "question_wo_referring_query": "What happens next?", "candidates": ["The brush is dipped into black paint", "A flower is painted on the paper", "The person switches to another brush", "The brush is cleaned with water", "A small dog is painted on the paper"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "uOFRKh-MnJw_2", "video_path": "uOFRKh-MnJw.mp4", "subtitle_path": "uOFRKh-MnJw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 955.12, "view_count": 5030}, {"video_id": "1RFP_PZo2QU", "question": "The screen shows a green plant with a man standing next to it, talking to the camera. After the subtitle 'like all good friendships they take time' appears, what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["An elephant", "A puppy", "A chick", "A kitten", "A piglet"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "1RFP_PZo2QU_0", "video_path": "1RFP_PZo2QU.mp4", "subtitle_path": "1RFP_PZo2QU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1154.1, "view_count": 49268}, {"video_id": "1RFP_PZo2QU", "question": "There is a piglet on the screen, sunlight shines on the green plants casting a shadow. A surface wrapper appears in the lower left corner of the screen. After the subtitle 'your nap' appears, what is the first object that appears?", "question_wo_referring_query": "what is the first object that appears?", "candidates": ["a black and white bowl", "a blue and white bowl", "a red and white bowl", "a pure black bowl", "a pure white bowl"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "1RFP_PZo2QU_1", "video_path": "1RFP_PZo2QU.mp4", "subtitle_path": "1RFP_PZo2QU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1154.1, "view_count": 49268}, {"video_id": "1RFP_PZo2QU", "question": "In the video, there are lights hanging on a white building, dimly illuminating it, and a man smiling at the camera, revealing his clean white teeth. After the subtitles 'his very spicy and she freaked out she's' appear, what is the first object to appear?", "question_wo_referring_query": ", what is the first object to appear?", "candidates": ["sky blue fan", "white scissors", "black puppy", "red chili", "green chili"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "1RFP_PZo2QU_2", "video_path": "1RFP_PZo2QU.mp4", "subtitle_path": "1RFP_PZo2QU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1154.1, "view_count": 49268}, {"video_id": "A_1QbBQ5pag", "question": "In the video, there is a blue pattern with two handled round ceramic vases. In which of the following places have they appeared?", "question_wo_referring_query": "In which of the following places have they appeared?", "candidates": ["On a store counter", "In the hands of a little girl", "Near a windowsill illuminated by sunlight", "In a rectangular window", "Next to a flower vase"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "A_1QbBQ5pag_0", "video_path": "A_1QbBQ5pag.mp4", "subtitle_path": "A_1QbBQ5pag_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1461.5, "view_count": 35511}, {"video_id": "A_1QbBQ5pag", "question": "In which of the following scenes does the lady dressed in a white skirt and wearing a white hat appear in the video?", "question_wo_referring_query": "In which of the following scenes does she appear?", "candidates": ["On a shelf filled with numerous photo frames", "On a display platform at a richly decorated exhibition", "On the nightstand of a pink house", "In a meticulously arranged photo gallery", "On a wooden small table"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "A_1QbBQ5pag_1", "video_path": "A_1QbBQ5pag.mp4", "subtitle_path": "A_1QbBQ5pag_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1461.5, "view_count": 35511}, {"video_id": "A_1QbBQ5pag", "question": "How many fan-shaped wooden sticks, red and white buttons, and a colorful recorder appear in the following scenes in the video?", "question_wo_referring_query": "In which of the following scenes do they appear?", "candidates": ["In an old man's hand", "On a rectangular tea table", "On a wooden cabinet", "On a park bench", "On a round wooden table"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "A_1QbBQ5pag_2", "video_path": "A_1QbBQ5pag.mp4", "subtitle_path": "A_1QbBQ5pag_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1461.5, "view_count": 35511}, {"video_id": "g-FH4-kKJbE", "question": "Sunlight shines on a distant fenced area, in front of the lens is a woman leaning against a stone wall wearing black clothing. With which subtitles has this woman appeared together?", "question_wo_referring_query": "With which subtitles has this woman appeared together?", "candidates": ["you guys this seems like a good time to", "must crash a lot of timing you see these", "which is like the center Midtown of", "hey what the heck I love the best yeas", "about 325 to rent a bike for 30 minutes"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "g-FH4-kKJbE_0", "video_path": "g-FH4-kKJbE.mp4", "subtitle_path": "g-FH4-kKJbE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1092.59, "view_count": 1020817}, {"video_id": "g-FH4-kKJbE", "question": "In the video, there is a tall building with flags flying in the wind. Which subtitles have appeared along with the American flag?", "question_wo_referring_query": "Which subtitles have appeared along with the American flag?", "candidates": ["one time it's so good", "if ever you want to immerse yourself in", "and I want to thank Best Western hotels", "Rewards you not only get incredible", "nuclear war it's kind of a creepy"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "g-FH4-kKJbE_1", "video_path": "g-FH4-kKJbE.mp4", "subtitle_path": "g-FH4-kKJbE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1092.59, "view_count": 1020817}, {"video_id": "g-FH4-kKJbE", "question": "On the sidewalk, people are coming and going. A woman wearing white sneakers rides a blue shared bike across the zebra crossing. In the video, which subtitles appear simultaneously with the blue shared bike?", "question_wo_referring_query": "In the video, which subtitles appear simultaneously with the blue shared bike?", "candidates": ["sophisticated", "stunning it's too dark", "Manhattan yellow caps as far as you can", "feeling down here I know it's kind of", "things in just three days"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "g-FH4-kKJbE_2", "video_path": "g-FH4-kKJbE.mp4", "subtitle_path": "g-FH4-kKJbE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1092.59, "view_count": 1020817}, {"video_id": "9F47SdaGOwg", "question": "Under the blue sky lies a vast expanse of mountains and fields. In the distance, there is a stretch of blue river with boats. On the undulating hills, there are two women smiling and showing their clean white teeth. One woman stands on a hilltop with her hair tied up in a bun and is dressed in a black sports bra, while indoors she stands beside a blonde man. How did her top change?", "question_wo_referring_query": "Under the blue sky lies a vast expanse of mountains and fields. In the distance, there is a stretch of blue river with boats. On the undulating hills, there are two women smiling and showing their clean white teeth. One woman stands on a hilltop with her hair tied up in a bun and is dressed in a black sports bra, while indoors she stands beside a blonde man. How did her top change?", "candidates": ["She changed into a red and black plaid shirt", "She changed into a purple dress", "She changed into a white short sleeve shirt", "She changed into a gray sports bra", "She changed into a black shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "9F47SdaGOwg_0", "video_path": "9F47SdaGOwg.mp4", "subtitle_path": "9F47SdaGOwg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.4, "view_count": 281686}, {"video_id": "9F47SdaGOwg", "question": "On the indoor glass window, there is a deer head, and below it, there is a row of green bottom patterns and black English text. There are also some books and bags nearby. A man in a gray short-sleeve shirt is next to a woman. This woman, wearing a polka-dotted black and white top and pink pants, boards a ship and feels the sea breeze. How does her outfit change while facing the mirror?", "question_wo_referring_query": "How does her outfit change?", "candidates": ["Changed into blue shorts", "Changed into a pink skirt", "Changed into a black swimsuit with small patterns", "Changed into a green swimsuit with small patterns", "Changed into gray shorts"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "9F47SdaGOwg_1", "video_path": "9F47SdaGOwg.mp4", "subtitle_path": "9F47SdaGOwg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.4, "view_count": 281686}, {"video_id": "9F47SdaGOwg", "question": "In the distance, there are some ships and a faint twilight. Beside a woman wearing a blue dress stands a man smiling at the camera. What changes occur to the man's clothes when he boards the ship with his backpack and faces the camera?", "question_wo_referring_query": "What changes occur to the clothes?", "candidates": ["The gray-green top changes to a white shirt with green patterns.", "The gray-green top changes to a white short-sleeve shirt with green patterns.", "The gray-green top changes to a gray shirt with green patterns.", "The gray-green top changes to a blue short-sleeve shirt with green patterns.", "The white shirt changes to a blue short-sleeve shirt with green patterns."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "9F47SdaGOwg_2", "video_path": "9F47SdaGOwg.mp4", "subtitle_path": "9F47SdaGOwg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.4, "view_count": 281686}, {"video_id": "x9i49P5AwYk", "question": "In a village flatland covered with greenery, there's a wooden house with glass windows on the right side. To the left, there is a tree with wooden and metallic objects underneath it. In the middle, there's a wooden table with chicken, eggplants, tomatoes, and green peppers on it. In the middle, there's a middle-aged man wearing a black short-sleeved shirt. What is this middle-aged man doing?", "question_wo_referring_query": "What is this middle-aged man doing?", "candidates": ["Grilling eggplants", "Grilling potatoes", "Grilling tomatoes", "Grilling chicken", "Grilling green peppers"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "x9i49P5AwYk_0", "video_path": "x9i49P5AwYk.mp4", "subtitle_path": "x9i49P5AwYk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1204.94, "view_count": 2997298}, {"video_id": "x9i49P5AwYk", "question": "On a plain entirely covered with greenery, small trees can be seen in the distance, the sky is densely covered with clouds, and there is a metal bucket surrounded by skewers of chicken and vegetables. There is a middle-aged man wearing a black short-sleeved shirt, green pants, and gloves. What is this middle-aged man doing?", "question_wo_referring_query": "What is this middle-aged man doing?", "candidates": ["Eating eggplant", "Holding a skewer with eggplant", "Eating chicken skewers heartily", "Throwing tree branches into the metal bucket", "Holding a skewer with chicken"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "x9i49P5AwYk_1", "video_path": "x9i49P5AwYk.mp4", "subtitle_path": "x9i49P5AwYk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1204.94, "view_count": 2997298}, {"video_id": "x9i49P5AwYk", "question": "On a flat land covered in green plants, there are green mountains in the distance, and the weather is overcast. In the middle, there is a wooden table full of peaches, roasted chicken, and roasted beans. Seated beside the table is a man with short hair wearing a black short-sleeved shirt. What is this short-haired man doing?", "question_wo_referring_query": "What is the short-haired man doing?", "candidates": ["Eating roasted chicken", "Eating an eggplant", "Sleeping with eyes closed", "Eating roasted beans", "Eating peaches"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "x9i49P5AwYk_2", "video_path": "x9i49P5AwYk.mp4", "subtitle_path": "x9i49P5AwYk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1204.94, "view_count": 2997298}, {"video_id": "pKfCTlO8pHU", "question": "When the subtitle mentions 'better that it is the' against a black background displaying 'sky news DAILY,' which letter is present on the screen?", "question_wo_referring_query": "Which letter is present on the screen?", "candidates": ["D", "B", "F", "A", "M"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "pKfCTlO8pHU_0", "video_path": "pKfCTlO8pHU.mp4", "subtitle_path": "pKfCTlO8pHU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1109.68, "view_count": 1857}, {"video_id": "pKfCTlO8pHU", "question": "Against a black background, with the text 'sky news DAILY,' when the subtitle mentions 'Corona graphs that can block out the', which object is present on the screen?", "question_wo_referring_query": "Which object is present on the screen?", "candidates": ["Phone", "Pencil", "Globe", "Earphone", "Moon"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "pKfCTlO8pHU_1", "video_path": "pKfCTlO8pHU.mp4", "subtitle_path": "pKfCTlO8pHU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1109.68, "view_count": 1857}, {"video_id": "pKfCTlO8pHU", "question": "With a black background, displaying 'sky news DAILY', when the subtitle mentions 'really want to set the right precedent I', which word is present on the screen?", "question_wo_referring_query": "Which word is present on the screen?", "candidates": ["daisy", "date", "fly", "news", "skd"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "pKfCTlO8pHU_2", "video_path": "pKfCTlO8pHU.mp4", "subtitle_path": "pKfCTlO8pHU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1109.68, "view_count": 1857}, {"video_id": "m0vIzYjLw5Q", "question": "In a wooden room, a painting is hanging on the wall, there is a white lampshade on the ceiling, beside the dining table, there is a man in red clothes, a man in a hooded lab coat wearing a hat, a man in a hooded lab coat wearing glasses, and a man in a beige fur coat. They are sitting down eating and chatting. When the subtitle 'grateful that our community exists' appears, what objects are present on the dining table?", "question_wo_referring_query": "What objects are present on the dining table?", "candidates": ["A highball glass", "Chopsticks", "A pair of scissors", "A metal bowl containing noodles", "Fried chicken pieces"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "m0vIzYjLw5Q_0", "video_path": "m0vIzYjLw5Q.mp4", "subtitle_path": "m0vIzYjLw5Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1246.96, "view_count": 2343898}, {"video_id": "m0vIzYjLw5Q", "question": "On a snow-covered mountain, there is a person wearing a dark blue jacket and pants. In front of him is a man in an olive green jacket, draped with a blue flag, wearing a black cap, and sporting a beard, pointing at the camera. When the subtitle 'message and' appears, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["A snow leopard", "A poster", "A pair of black sunglasses", "A black head-mounted earphone", "An orange notebook"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "m0vIzYjLw5Q_1", "video_path": "m0vIzYjLw5Q.mp4", "subtitle_path": "m0vIzYjLw5Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1246.96, "view_count": 2343898}, {"video_id": "m0vIzYjLw5Q", "question": "In a room, there is a light switch on the front wall, with a sofa below it. On the left white wall, there is a whiteboard, and a person is writing on it. In the middle, there is a table surrounded by chairs. A person is sitting on a chair looking at the whiteboard. When the caption 'felt for us endless setbacks because of' is mentioned, what object appears on the table?", "question_wo_referring_query": "What object appears on the table?", "candidates": ["A high-heeled wine glass", "A laptop", "A pair of scissors", "A bottle of red wine", "A painting"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "m0vIzYjLw5Q_2", "video_path": "m0vIzYjLw5Q.mp4", "subtitle_path": "m0vIzYjLw5Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1246.96, "view_count": 2343898}, {"video_id": "vblbs6-qEyI", "question": "In a scene where a man wearing a short-sleeved T-shirt with a badge on the collar is talking about a PPT on person detection, what is the shape of the object labeled original video?", "question_wo_referring_query": "What is the shape of the object labeled original video?", "candidates": ["Cylinder", "Rectangular prism", "Cone", "Cube", "Sphere"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2A", "level": "IntraMoment", "id": "vblbs6-qEyI_0", "video_path": "vblbs6-qEyI.mp4", "subtitle_path": "vblbs6-qEyI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2171.37, "view_count": 150}, {"video_id": "vblbs6-qEyI", "question": "In a scene where a man wearing a short-sleeve T-shirt with a microphone pinned to the collar is discussing a Query-by-Committee (QBC) PPT, what is the color of the rectangle enclosing the equation?", "question_wo_referring_query": "What is the color of the rectangle enclosing the equation?", "candidates": ["blue", "black", "white", "red", "green"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2A", "level": "IntraMoment", "id": "vblbs6-qEyI_1", "video_path": "vblbs6-qEyI.mp4", "subtitle_path": "vblbs6-qEyI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2171.37, "view_count": 150}, {"video_id": "vblbs6-qEyI", "question": "A man wearing a short-sleeved T-shirt with a badge on the collar is talking about Temporal Action Localization: CDC in a scene. What color is the model front-end on the top of the segment?", "question_wo_referring_query": "What color is the model front-end on the top of the segment?", "candidates": ["purple", "blue", "white", "black", "green"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2A", "level": "IntraMoment", "id": "vblbs6-qEyI_2", "video_path": "vblbs6-qEyI.mp4", "subtitle_path": "vblbs6-qEyI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2171.37, "view_count": 150}, {"video_id": "PS9f6LnSJO0", "question": "On a wooden tabletop, there is a glass vase in the distance with a few white flowers inserted in it. Next to the table, there is a person holding a whisk in one hand and a bowl of flour in the other. On the table, there is a ceramic bowl containing a liquid. What is the color of the liquid in the ceramic bowl when 'Flour - 200g' appears?", "question_wo_referring_query": "What is the color of the liquid in the ceramic bowl when 'Flour - 200g' appears?", "candidates": ["light purple", "beige", "light blue", "pale blue", "black"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "PS9f6LnSJO0_0", "video_path": "PS9f6LnSJO0.mp4", "subtitle_path": "PS9f6LnSJO0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1384.88, "view_count": 3548389}, {"video_id": "PS9f6LnSJO0", "question": "On a wooden table, a person wearing a white dress is holding a metal object in one hand and rubbing it with a piece of cheese in the other hand. When the subtitle \u201cCheese - 200g\u201d appears, what is the shape of the metal object?", "question_wo_referring_query": "When the subtitle \u201cCheese - 200g\u201d appears, what is the shape of the metal object?", "candidates": ["Cylindrical", "Rectangular", "Cube", "Round disc", "Cone"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "PS9f6LnSJO0_1", "video_path": "PS9f6LnSJO0.mp4", "subtitle_path": "PS9f6LnSJO0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1384.88, "view_count": 3548389}, {"video_id": "PS9f6LnSJO0", "question": "In a room with white walls, there are several metal cabinets at the back with a painting and a glass vase with a plant on top. At the front, there is a person cutting a chili pepper on a wooden table. When the subtitle 'Have a nice day and enjoy watching!' appears, what color is the chili pepper?", "question_wo_referring_query": "When the subtitle 'Have a nice day and enjoy watching!' appears, what color is the chili pepper?", "candidates": ["purple", "red", "green", "blue", "yellow"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "PS9f6LnSJO0_2", "video_path": "PS9f6LnSJO0.mp4", "subtitle_path": "PS9f6LnSJO0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1384.88, "view_count": 3548389}, {"video_id": "Fu7mLCydfDI", "question": "When a woman wearing a black and white striped long sleeve and a woman wearing a black short sleeve and glasses are discussing the electrons of Ga, Br, I, and In, who is holding a white pen?", "question_wo_referring_query": "When a woman wearing a black and white striped long sleeve and a woman wearing a black short sleeve and glasses are discussing the electrons of Ga, Br, I, and In, who is holding a white pen?", "candidates": ["A woman wearing a black and white striped long sleeve", "A woman wearing a black and white striped short sleeve", "A woman wearing a black short sleeve and glasses", "A woman wearing a headset", "A woman wearing a black long sleeve"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "Fu7mLCydfDI_0", "video_path": "Fu7mLCydfDI.mp4", "subtitle_path": "Fu7mLCydfDI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 969.07, "view_count": 20817}, {"video_id": "Fu7mLCydfDI", "question": "When a woman wearing a black and white striped long-sleeve shirt and a woman wearing a black short-sleeve shirt with glasses are discussing the electrons of Se, Sr, Rb, and Br, who is holding their head with both hands?", "question_wo_referring_query": "Who is holding their head with both hands?", "candidates": ["A woman wearing a black and white striped long-sleeve shirt", "A man wearing a black short-sleeve shirt", "A woman wearing a silver coat", "A woman wearing a black short-sleeve shirt with glasses", "A woman wearing a black long-sleeve shirt with earphones"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "Fu7mLCydfDI_1", "video_path": "Fu7mLCydfDI.mp4", "subtitle_path": "Fu7mLCydfDI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 969.07, "view_count": 20817}, {"video_id": "Fu7mLCydfDI", "question": "When a woman wearing a black and white striped long-sleeve shirt and a woman wearing a black short-sleeve shirt and glasses are discussing F, Cl, Br, and I atoms, who is writing?", "question_wo_referring_query": "Who is writing?", "candidates": ["A woman wearing a black long-sleeve shirt and earphones", "A woman wearing a black long-sleeve shirt", "A woman wearing a black and white striped long-sleeve shirt", "A man wearing a black short-sleeve shirt", "A woman wearing a black short-sleeve shirt and glasses"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "Fu7mLCydfDI_2", "video_path": "Fu7mLCydfDI.mp4", "subtitle_path": "Fu7mLCydfDI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 969.07, "view_count": 20817}, {"video_id": "1lpJn0gn1nc", "question": "Under the blue sky and white clouds, there is a white building nearby with a sculpture on top. In front of the building, there is a green space and trees. What happened when Oregon first appeared?", "question_wo_referring_query": "What happened when Oregon first appeared?", "candidates": ["The host remarked that this building is very good and kept praising it", "The host remarked that this building is very stupid", "The host described the exterior of the building in detail", "The host gave a detailed introduction to the history of the building"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "1lpJn0gn1nc_0", "video_path": "1lpJn0gn1nc.mp4", "subtitle_path": "1lpJn0gn1nc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1197.57, "view_count": 764600}, {"video_id": "1lpJn0gn1nc", "question": "Under the blue sky and white clouds, there is a white building standing in the middle with a flag on top, surrounded by trees. What happened when Alabama first appeared?", "question_wo_referring_query": ", what happened?", "candidates": ["The host commented that the building style feels traditional", "The host compared Oregon and Alabama", "The host gave a detailed introduction about the building materials", "The host kept complaining about the old design", "The host mentioned the style is very trendy"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "1lpJn0gn1nc_1", "video_path": "1lpJn0gn1nc.mp4", "subtitle_path": "1lpJn0gn1nc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1197.57, "view_count": 764600}, {"video_id": "1lpJn0gn1nc", "question": "Under the blue sky and white clouds, there is a round-topped building. In front of the building, there is a circular area with green spaces and vegetation. Around the outer edge of the circular area, the sidewalks are lined with green plants. What happened the first time Texas appeared?", "question_wo_referring_query": "What happened?", "candidates": ["The host is commenting on the Capitol building's antiquated air.", "The host is commenting on the grandeur of the Capitol building.", "The host is describing the overall structure of the Capitol building.", "The host is describing the history of the Capitol building.", "The host is commenting on the rich history of the Capitol building."], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "1lpJn0gn1nc_2", "video_path": "1lpJn0gn1nc.mp4", "subtitle_path": "1lpJn0gn1nc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1197.57, "view_count": 764600}, {"video_id": "m47qsfSZoTI", "question": "Against a white background, when a man wearing a white shirt and glasses mentions \"gans can do we can start to mimic that,\" what action does he perform?", "question_wo_referring_query": "What action does he perform?", "candidates": ["Raises both hands", "Joins hands together", "Waves both hands forward", "Squats down", "Puts both hands on his head"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "m47qsfSZoTI_0", "video_path": "m47qsfSZoTI.mp4", "subtitle_path": "m47qsfSZoTI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1637.7, "view_count": 7102}, {"video_id": "m47qsfSZoTI", "question": "Against a white background, when a man wearing a white shirt and glasses mentions 'zebra detector and he will also be', what action is he performing?", "question_wo_referring_query": "What action is he performing?", "candidates": ["Both hands behind his back", "Starting to draw circles with his left hand", "Pointing upward with his left hand", "Raising both hands high", "Waving both hands forward"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "m47qsfSZoTI_1", "video_path": "m47qsfSZoTI.mp4", "subtitle_path": "m47qsfSZoTI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1637.7, "view_count": 7102}, {"video_id": "m47qsfSZoTI", "question": "Against a white background, when a man wearing a white shirt and glasses mentions 'with it and even if you look closely if,' what action is he performing?", "question_wo_referring_query": "What action is he performing?", "candidates": ["Left hand pinky pointing upward", "Waving both hands upward", "Left hand index finger pointing forward", "Squatting down", "Turning head backward"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "m47qsfSZoTI_2", "video_path": "m47qsfSZoTI.mp4", "subtitle_path": "m47qsfSZoTI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1637.7, "view_count": 7102}, {"video_id": "g1cj8GdWOm4", "question": "Against a white background, a man wearing a black jacket and glasses is clasping his hands while explaining the contents of a PPT titled 'Problem of fitting' in the lower right corner. What does he do with his hands after clasping them?", "question_wo_referring_query": "What does he do with his hands after clasping them?", "candidates": ["Raises his hands up", "Clenches his hands tightly", "Spreads his hands outward", "Squats down", "Jumps up"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "g1cj8GdWOm4_0", "video_path": "g1cj8GdWOm4.mp4", "subtitle_path": "g1cj8GdWOm4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2333.68, "view_count": 205}, {"video_id": "g1cj8GdWOm4", "question": "Against a white background with English text and formulas, a man wearing a black jacket and glasses is talking about the content of a PPT on Rgeulariation; Ensemble in the lower right corner while holding his hands out. What did he do with his hands afterward?", "question_wo_referring_query": "What did he do with his hands afterward?", "candidates": ["He moved his right hand to the right", "He moved his left hand to the left", "He continued holding his hands out and moved them to the right", "He clenched both hands tightly", "He squatted down"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "g1cj8GdWOm4_1", "video_path": "g1cj8GdWOm4.mp4", "subtitle_path": "g1cj8GdWOm4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2333.68, "view_count": 205}, {"video_id": "g1cj8GdWOm4", "question": "In a white background, with English text, a man wearing a black jacket and glasses, while discussing the content of a PPT slide titled 'Training steps' in the lower right corner, is holding a laser pointer in his right hand. What action did he take afterwards?", "question_wo_referring_query": "What action did he take afterwards?", "candidates": ["Clenched his fists", "Touched his glasses", "Shook his head", "Clasped his hands together", "Switched the laser pointer to his left hand"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "g1cj8GdWOm4_2", "video_path": "g1cj8GdWOm4.mp4", "subtitle_path": "g1cj8GdWOm4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2333.68, "view_count": 205}, {"video_id": "fR9dhkJyNNo", "question": "At the top of the screen is dense green foliage, below the foliage is withered yellow grass, and on the iron rack surrounded by rocks, there is a pot. The food in the pot is steaming. A person is using a spatula to scoop the food onto a silver round plate on the table. When the screen plays to this point, which of the following items appears first?", "question_wo_referring_query": "Which of the following items appears first?", "candidates": ["Knife", "Axe", "Yellow jar", "Pot lid", "Mobile phone"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "fR9dhkJyNNo_0", "video_path": "fR9dhkJyNNo.mp4", "subtitle_path": "fR9dhkJyNNo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1082.72, "view_count": 1075457}, {"video_id": "fR9dhkJyNNo", "question": "Which of the following objects appears first in the video?", "question_wo_referring_query": "Which of the following objects appears first in the video?", "candidates": ["Frame of a pot stand", "Small water wagon", "Autonomous short smoke cylinder", "Beehive", "Purple flower"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "fR9dhkJyNNo_1", "video_path": "fR9dhkJyNNo.mp4", "subtitle_path": "fR9dhkJyNNo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1082.72, "view_count": 1075457}, {"video_id": "fR9dhkJyNNo", "question": "In the video, which of the following ingredients appears first?", "question_wo_referring_query": "Which of the following ingredients appears first in the video?", "candidates": ["Beef tongue", "Garlic", "Bell pepper", "Eggplant", "Tomato"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "fR9dhkJyNNo_2", "video_path": "fR9dhkJyNNo.mp4", "subtitle_path": "fR9dhkJyNNo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1082.72, "view_count": 1075457}, {"video_id": "tP68QwVvAZk", "question": "There is a map hanging on the wall above the right side of the door, and the rest of the wall is covered with various photos. The white door is open, and a man wearing a black coat appears in front of the wall. The man's face is mostly obscured by the black hood he is wearing. To the right of the man, there is a red and white flag. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Adjusting his bracelet", "Adjusting his watch", "Adjusting his black gloves", "Adjusting his black hat", "Adjusting his shoes"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "tP68QwVvAZk_0", "video_path": "tP68QwVvAZk.mp4", "subtitle_path": "tP68QwVvAZk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1741.99, "view_count": 2169363}, {"video_id": "tP68QwVvAZk", "question": "On the wall above the right side of the door is a map, with various photos in other positions. The white door is open. Two men wearing black short-sleeve shirts are standing in front of the wall. The man on the left is wearing a white hat backward and has a floral pattern on the chest of his short-sleeve shirt. The man on the right has one hand on his waist and the other raised high, with a Mickey Mouse pattern on his short-sleeve shirt. What is the man on the left doing?", "question_wo_referring_query": "What is the man on the left doing?", "candidates": ["He is pressing the shoulder of the person next to him", "He is raising both hands high", "He is pressing the chest of the person next to him", "He is pressing the arms of the person next to him", "He is pressing the head of the person next to him"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "tP68QwVvAZk_1", "video_path": "tP68QwVvAZk.mp4", "subtitle_path": "tP68QwVvAZk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1741.99, "view_count": 2169363}, {"video_id": "tP68QwVvAZk", "question": "On the wall above the right side of the door hangs a map, and in other positions, there are various photographs. The white door is open, and a man wearing an olive-colored outfit with a pink checkered headscarf is standing in front of the wall. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Standing with his back to the mirror, raising both hands", "Standing with his back to the mirror, waving", "Facing the mirror, hands on hips", "Standing with his back to the mirror, hands on hips", "Standing with his back to the mirror, bending over"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "tP68QwVvAZk_2", "video_path": "tP68QwVvAZk.mp4", "subtitle_path": "tP68QwVvAZk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1741.99, "view_count": 2169363}, {"video_id": "oAY_sn1v3Kw", "question": "In the broadcasting room, at the table in the center, there is a silver-haired lady sitting on the left, a bald, middle-aged gentleman in a suit sitting on the right, and a long-haired lady in a suit sitting on the right as well. On the table in front of them, there are paper documents, and on the screen in the broadcasting room, there are enlarged materials being displayed. What is present in the scene?", "question_wo_referring_query": "What is present in the scene?", "candidates": ["painting", "laptop", "glasses", "vase", "hat"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "oAY_sn1v3Kw_0", "video_path": "oAY_sn1v3Kw.mp4", "subtitle_path": "oAY_sn1v3Kw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1116.16, "view_count": 18379}, {"video_id": "oAY_sn1v3Kw", "question": "On the left side of the screen, there is a bald man in a suit and white shirt, and a woman with long hair in a suit. There are paper materials on the table in front of them. Behind them, there is a handrail and steps. On the right side of the screen, there is a cover with an elderly person and a woman in blue clothes. At the bottom of the screen, there is an information bar. What is present in the scene?", "question_wo_referring_query": "What is present in the scene?", "candidates": ["glasses", "ring", "hat", "potted plant", "laptop"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "oAY_sn1v3Kw_1", "video_path": "oAY_sn1v3Kw.mp4", "subtitle_path": "oAY_sn1v3Kw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1116.16, "view_count": 18379}, {"video_id": "oAY_sn1v3Kw", "question": "On the left side of the screen, there is a long-haired woman dressed in a black innerwear and a black jacket. The woman is wearing a silver accessory on her wrist and has a ring on her finger. Behind the woman are a railing and a lamppost. On the right side of the screen, there is a cover of a document. At the bottom of the screen, there is an information strip. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Glasses", "Hat", "Planter", "Necklace", "Earrings"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "oAY_sn1v3Kw_2", "video_path": "oAY_sn1v3Kw.mp4", "subtitle_path": "oAY_sn1v3Kw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1116.16, "view_count": 18379}, {"video_id": "g66-6uFbsf4", "question": "A man wearing a yellow lab coat appears in the center of the screen, with a background of a world map behind him. There are electronic devices and books on the left side of the globe, and on the right side of the globe, there are papers and pictures. The man is wearing a black wristwatch, and when the subtitle 'episode of flat earth Friday with me' appears, what objects are present in the scene?", "question_wo_referring_query": "what objects are present in the scene?", "candidates": ["glasses", "planter", "lamp", "ring", "necklace"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "g66-6uFbsf4_0", "video_path": "g66-6uFbsf4.mp4", "subtitle_path": "g66-6uFbsf4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1070.72, "view_count": 199508}, {"video_id": "g66-6uFbsf4", "question": "The walls inside the cabin are filled with dense instruments and wires. There is a yellow square on the left wall and a blue square on the right wall. A man wearing a black shirt and black shorts appears in the cabin lying down, and the subtitle 'show us how it's done' appears. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Black socks", "Basketball", "White socks", "Volleyball", "Soccer ball"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "g66-6uFbsf4_1", "video_path": "g66-6uFbsf4.mp4", "subtitle_path": "g66-6uFbsf4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1070.72, "view_count": 199508}, {"video_id": "g66-6uFbsf4", "question": "The screen shows a perspective view. A chain appears at the top corner, and the ground is reddish sand. In the upper right corner of the ground, there is a white rectangular box and a red bucket-shaped object. Near the center of the screen, there is a parked work vehicle. When the subtitle 'now as I'm sat up here eating me soup' appears, what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["A black dog", "Shadow of the work vehicle", "Green trees", "A dried-up stream", "A horse"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "g66-6uFbsf4_2", "video_path": "g66-6uFbsf4.mp4", "subtitle_path": "g66-6uFbsf4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1070.72, "view_count": 199508}, {"video_id": "wcJfWoLwe6g", "question": "White clouds are floating in the blue sky, distant connected mountain ranges can be seen. A man wearing a black short-sleeved shirt is sitting next to a wooden table feeding sheep some snacks. The table is set on a flat grassy surface. There are trees behind the man. On the table, there are yellow foods and red drinks. What kind of tray is under the red drink?", "question_wo_referring_query": "What kind of tray is under the red drink?", "candidates": ["blue plastic", "transparent glass material", "brown wood", "black metal material", "red plastic"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "wcJfWoLwe6g_0", "video_path": "wcJfWoLwe6g.mp4", "subtitle_path": "wcJfWoLwe6g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1007.41, "view_count": 2678304}, {"video_id": "wcJfWoLwe6g", "question": "There are yellow ingredients and soup in a large silver pot, a silver fork rests on the right side of the pot, a wooden bowl with white ingredients appears on the right side of the pot, a hand is adding the white ingredients into the pot with a ladle. What is the form of the white ingredients?", "question_wo_referring_query": "What is the form of the white ingredients?", "candidates": ["Solid chunks", "Powdery", "Liquid", "Viscous substance between solid and liquid", "Long solid strips"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "wcJfWoLwe6g_1", "video_path": "wcJfWoLwe6g.mp4", "subtitle_path": "wcJfWoLwe6g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1007.41, "view_count": 2678304}, {"video_id": "wcJfWoLwe6g", "question": "Wood is burning on a grey and black stove, and there is a black pot on the right side of the stove. In the upper left corner of the floor, there's a piece of wood standing upright, with a silver pot beside it. Someone is holding the pot's handle. What does the wall in the room look like?", "question_wo_referring_query": "What does the wall in the room look like?", "candidates": ["It is a white-tiled wall", "It is a wooden wall", "It is a blue-tiled wall", "It is a stone-tiled wall", "It is a red brick wall"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "wcJfWoLwe6g_2", "video_path": "wcJfWoLwe6g.mp4", "subtitle_path": "wcJfWoLwe6g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1007.41, "view_count": 2678304}, {"video_id": "r9jUTwVx2Pw", "question": "The vast white sky shows no clouds. A house with red walls and a black tile roof appears in a grass field. The exterior of the house is surrounded by a white fence. The building on the far left has a cylindrical structure. There is a tree on the right side of the house. When the subtitle 'part of the state you're nowhere near' appears, what is the shape of the window of this red-walled house?", "question_wo_referring_query": "What is the shape of the window of this red-walled house?", "candidates": ["Fan-shaped window", "Arch-shaped window", "Rectangular window", "Triangular window", "Circular window"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "r9jUTwVx2Pw_0", "video_path": "r9jUTwVx2Pw.mp4", "subtitle_path": "r9jUTwVx2Pw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1159.29, "view_count": 142747}, {"video_id": "r9jUTwVx2Pw", "question": "There are two road signs standing on the yellow ground, with sparse yellow grass on the ground. In the distance, there are continuous mountain ranges and clouds. On the right road sign, there is a horse-riding symbol. The left sign has arrows and characters. When the subtitle appears 'towns and in fact the road on this map', what does the left road sign look like?", "question_wo_referring_query": "What does the left road sign look like?", "candidates": ["Stair-shaped road sign", "Rectangular road sign", "Round road sign", "Square-shaped road sign", "Triangular road sign"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "r9jUTwVx2Pw_1", "video_path": "r9jUTwVx2Pw.mp4", "subtitle_path": "r9jUTwVx2Pw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1159.29, "view_count": 142747}, {"video_id": "r9jUTwVx2Pw", "question": "Under the blue sky, white clouds float, a road leads to the distance. On the right side of the road is a green lawn, with a row of poles and a signpost on the lawn. The signpost consists of three parts: the top is white, the bottom is a green rectangle, and the left side is the smallest area. On the left side of the road are houses and trees. When the subtitle 'the northwestern portion of the state' appears, what kind of car is next to the red house on the left?", "question_wo_referring_query": "What kind of car is next to the red house on the left?", "candidates": ["white pickup truck", "white sedan", "black pickup truck", "black sedan", "white motorcycle"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "r9jUTwVx2Pw_2", "video_path": "r9jUTwVx2Pw.mp4", "subtitle_path": "r9jUTwVx2Pw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1159.29, "view_count": 142747}, {"video_id": "LI1daKe06OU", "question": "There is a painting hanging on the white wall to the left of the house, and the curtains in the room are drawn. Two men are sitting on the black sofa. The man on the left, who is wearing a hat, is dressed in a red short-sleeve shirt and white shorts, while the man on the right, who is wearing a watch, is dressed in a light-colored short-sleeve shirt and ripped jeans. A man wearing blue jeans appears from the right. What did the man in blue jeans do first when he appeared?", "question_wo_referring_query": "What did the man in blue jeans do first when he appeared?", "candidates": ["Drank water", "Opened his phone", "Sat on the sofa", "Shook hands with the person next to him", "Changed clothes"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "LI1daKe06OU_0", "video_path": "LI1daKe06OU.mp4", "subtitle_path": "LI1daKe06OU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1161.03, "view_count": 83658}, {"video_id": "LI1daKe06OU", "question": "There is a painting hanging on the white wall to the left of the house, the curtains in the room are drawn up, four men are sitting on a black sofa, the man in the middle wearing a hat is dressed in a red short sleeve shirt and white shorts, the man in the middle wearing a watch is dressed in a light short sleeve shirt and ripped jeans, the man on the right is wearing jeans, and the man on the left is wearing a black vest. What did the man in the black vest do when he first appeared?", "question_wo_referring_query": "What did the man in the black vest do when he first appeared?", "candidates": ["Threw the towel", "Drank water", "Put down the cell phone", "Wiped his head with a towel", "Opened a cell phone"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "LI1daKe06OU_1", "video_path": "LI1daKe06OU.mp4", "subtitle_path": "LI1daKe06OU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1161.03, "view_count": 83658}, {"video_id": "LI1daKe06OU", "question": "There is a patch of brown on the white wall of the room, covered with colorful sticky notes. A man wearing an olive green shirt appears in the frame, with a black cord around his neck. What did this man do when he first appeared?", "question_wo_referring_query": "What did this man do when he first appeared?", "candidates": ["Smiled and started talking to the camera", "Drank water", "Changed clothes", "Picked up the phone", "Waved at the camera"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "LI1daKe06OU_2", "video_path": "LI1daKe06OU.mp4", "subtitle_path": "LI1daKe06OU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1161.03, "view_count": 83658}, {"video_id": "zOVd5bSLBRQ", "question": "A girl wearing suspenders and glasses appears in front of a transparent window framed in black. To the left of the window, there are stickers and a yellow ball-shaped decoration. Outside the window, there is a blue sky and a city view. When the subtitle 'for future things coming so' appears, what is the girl doing?", "question_wo_referring_query": "What is the girl doing?", "candidates": ["The girl is making a heart sign.", "The girl is playing with a decoration.", "The girl is adjusting her glasses.", "The girl is making a circular motion with her hand.", "The girl is making a peace sign."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "zOVd5bSLBRQ_0", "video_path": "zOVd5bSLBRQ.mp4", "subtitle_path": "zOVd5bSLBRQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1020.44, "view_count": 519099}, {"video_id": "zOVd5bSLBRQ", "question": "A boy wearing a purple shirt and a girl wearing a black suspender with glasses appear on the screen. The girl has green hair tied into pigtails and is holding a bun. Behind them, there are trees and a corner of a tall Gothic building. When the subtitle 'cute' appears, what is the girl doing?", "question_wo_referring_query": "What is the girl doing?", "candidates": ["The girl is chewing the bun in her mouth", "The girl is adjusting her hair with her hands", "The girl is tearing the bun", "The girl is adjusting her glasses with her hand", "The girl is drinking water"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "zOVd5bSLBRQ_1", "video_path": "zOVd5bSLBRQ.mp4", "subtitle_path": "zOVd5bSLBRQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1020.44, "view_count": 519099}, {"video_id": "zOVd5bSLBRQ", "question": "A girl in a green hospital gown wearing glasses is sitting on a rock. The girl is wearing blue checkered shorts. The chest area of the girl's hospital gown has cartoon embroidery. Behind the girl is a green lakeside. On the other side of the lakeside is a rocky wall and green trees. When the subtitle 'so' appears, what is the girl doing?", "question_wo_referring_query": ", what is the girl doing?", "candidates": ["The girl is waving her hand towards the camera", "The girl is drinking water", "The girl has her hands on her waist", "The girl holds her chest with two hands", "The girl is using her hands to adjust her clothes"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "zOVd5bSLBRQ_2", "video_path": "zOVd5bSLBRQ.mp4", "subtitle_path": "zOVd5bSLBRQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1020.44, "view_count": 519099}, {"video_id": "4tMSsFKzkow", "question": "A man riding a unicycle is moving to the right on the street. Near him is an elderly woman with silver hair wearing white clothes. There are many parked motorcycles on both sides of the street. The street has white stripes, and there are neatly arranged houses along the street. What did the man do after moving to the right?", "question_wo_referring_query": "What did the man do after moving to the right?", "candidates": ["He collided with a motorcycle.", "He entered a house.", "He turned back and waved towards the camera.", "He went to drink water.", "He shook hands with the elderly woman next to him."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "4tMSsFKzkow_0", "video_path": "4tMSsFKzkow.mp4", "subtitle_path": "4tMSsFKzkow_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1013.1, "view_count": 787636}, {"video_id": "4tMSsFKzkow", "question": "In a warmly-lit restaurant, a man in short sleeves is sitting on a red sofa, with his black backpack placed on his right. On the table in front of him, there is a transparent glass bottle, food, and a mobile phone. Behind him, there is an empty dining seat and a transparent lattice window filled with boards. What did this man do after finishing his meal?", "question_wo_referring_query": "What did this man do after finishing his meal?", "candidates": ["Walked across the road", "He rode a motorcycle at the restaurant entrance", "Called a car at the restaurant entrance", "A woman drove a car to pick him up at the restaurant entrance", "A man drove a car to pick him up at the restaurant entrance"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "4tMSsFKzkow_1", "video_path": "4tMSsFKzkow.mp4", "subtitle_path": "4tMSsFKzkow_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1013.1, "view_count": 787636}, {"video_id": "4tMSsFKzkow", "question": "A man dressed in a white shirt and a black coat is standing in front of a blue door, talking. The man is holding a black-lidded white beverage cup in his hand. On his right side, there's a golden door handle. What does the man do after he finishes talking?", "question_wo_referring_query": "What does the man do after he finishes talking?", "candidates": ["He enters a tunnel.", "He gets into a taxi.", "He drives away in a car.", "He rides a motorcycle.", "He enters a cafeteria."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "4tMSsFKzkow_2", "video_path": "4tMSsFKzkow.mp4", "subtitle_path": "4tMSsFKzkow_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1013.1, "view_count": 787636}, {"video_id": "1bk0rYvZXaI", "question": "What does the blonde woman wearing a gray outfit take out first from the bag on the right side of the table? The woman has red nail polish and is wearing a bracelet. Behind her is a white wall and a cabinet. In front of her is an open paper box, and on the right side of the table, there is a blue bag with turnip patterns.", "question_wo_referring_query": "What does the woman take out first from the bag?", "candidates": ["A green packaged box with vegetable patterns", "Rye noodles", "A yellow plastic box with the number 365 printed on it", "Mineral water", "A canned tomato sauce"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "1bk0rYvZXaI_0", "video_path": "1bk0rYvZXaI.mp4", "subtitle_path": "1bk0rYvZXaI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1260.06, "view_count": 15910}, {"video_id": "1bk0rYvZXaI", "question": "A blonde woman wearing a white camisole is standing in a bathroom. The woman has red manicured nails and is wearing a heart-shaped pendant necklace. To her right is a white towel, and behind her are white tiles and a showerhead. What does the woman pick up first?", "question_wo_referring_query": "What does the woman pick up first?", "candidates": ["A mirror", "A pair of black shorts", "A comb", "A gold-packaged lipstick", "An eyebrow pencil"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "1bk0rYvZXaI_1", "video_path": "1bk0rYvZXaI.mp4", "subtitle_path": "1bk0rYvZXaI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1260.06, "view_count": 15910}, {"video_id": "1bk0rYvZXaI", "question": "A blonde woman, dressed in a white halter top and black shorts, is standing in front of an olive-colored wardrobe. The lights inside the wardrobe are warm-colored. The clothes on the sides of the wardrobe are hanging up, while those in the middle are folded and stacked. The walls of the room are white. What does this woman put on first?", "question_wo_referring_query": "What does this woman put on first?", "candidates": ["A black short vest", "A blue short vest", "A black long dress", "A tight-fitting black long-sleeve top", "A white robe"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "1bk0rYvZXaI_2", "video_path": "1bk0rYvZXaI.mp4", "subtitle_path": "1bk0rYvZXaI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1260.06, "view_count": 15910}, {"video_id": "Y4_pvVrGfB0", "question": "A man wearing a grey coat is in the center of the screen. On the wall behind him, there's a map and a poster. A model of a moth with yellow legs and a blue back is pasted next to the poster. The lower part of the wall is brown. In the upper right of the man, there's a map with green dots. After the subtitles 'which is a bad sign because uh if that's,' what object appears in the man's hand?", "question_wo_referring_query": "What object appears in the man's hand?", "candidates": ["A map", "A piece of paper with the number 6 written on it", "A poster", "A piece of paper with the number 9 written on it", "A white bowl"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "Y4_pvVrGfB0_0", "video_path": "Y4_pvVrGfB0.mp4", "subtitle_path": "Y4_pvVrGfB0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2293.89, "view_count": 35904}, {"video_id": "Y4_pvVrGfB0", "question": "A man wearing a gray coat is in the center of the screen, holding a white bowl. On the wall behind him, there is a map and a nautical chart. A model of a spider with yellow limbs and a blue back is attached to the side of the nautical chart. The lower part of the wall is brown. There is also a map picture in the upper-right corner, which contains green blocks. After the subtitle 'easy win for Ontario in this one' appears, what does the man take out of the bowl?", "question_wo_referring_query": "What does the man take out of the bowl?", "candidates": ["A slip of paper with the number 8 written on it", "A slip of paper with the number 6 written on it", "A slip of paper with the number 9 written on it", "A slip of paper with the number 13 written on it", "A slip of paper with the number 31 written on it"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "Y4_pvVrGfB0_1", "video_path": "Y4_pvVrGfB0.mp4", "subtitle_path": "Y4_pvVrGfB0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2293.89, "view_count": 35904}, {"video_id": "Y4_pvVrGfB0", "question": "A man in a gray coat is at the center of the screen. There is a map and a sea chart on the wall behind him. A model of a yellow and blue beetle is attached next to the sea chart. The lower part of the wall is khaki-colored. In the upper right corner of the man, there is a map with green patches in it. After the subtitle 'at these' appears, what object appears in front of the man's chest?", "question_wo_referring_query": ", what object appears in front of the man's chest?", "candidates": ["A map appears", "Two sea charts appear", "Two silver round coins appear", "Two beetle models appear", "A white bowl appears"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "Y4_pvVrGfB0_2", "video_path": "Y4_pvVrGfB0.mp4", "subtitle_path": "Y4_pvVrGfB0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2293.89, "view_count": 35904}, {"video_id": "My2BQI4Vqqs", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, the learning skills of chemistry are summarized, then the female teacher and students collaboratively guide the chemistry lesson, and finally, the female teacher gives an introduction to the lesson.", "First, the female teacher gives an introduction to the lesson, then the learning skills of chemistry are summarized, and finally, the female teacher and students collaboratively guide the chemistry lesson.", "First, the female teacher and the students collaboratively guide the chemistry lesson, then the female teacher gives an introduction to the lesson, and finally, the female teacher summarizes the learning skills of chemistry.", "First, the female teacher gives an introduction to the lesson, then the female teacher and the students collaboratively guide the chemistry lesson, and finally, the female teacher summarizes the learning skills of chemistry.", "First, the female teacher summarizes the learning skills of chemistry, then the female teacher gives an introduction to the lesson, and finally, the female teacher and students collaboratively guide the chemistry lesson."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "My2BQI4Vqqs_0", "video_path": "My2BQI4Vqqs.mp4", "subtitle_path": "My2BQI4Vqqs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1001.37, "view_count": 39093}, {"video_id": "My2BQI4Vqqs", "question": "In the chemistry class taught by the female teacher, which of the following sequences of scenes is correct?", "question_wo_referring_query": "In the chemistry class taught by the female teacher, which of the following sequences of scenes is correct?", "candidates": ["First, a scene explaining chemical formulas appears, followed by another chemical formula explanation scene, then it goes back to the periodic table scene, followed by another chemical formula explanation scene, and lastly returns to the periodic table explanation scene.", "First, a scene explaining the periodic table of elements appears, then a scene explaining chemical formulas, then it goes back to the periodic table scene, followed by another chemical formula explanation scene, and lastly returns to the periodic table explanation scene.", "First, a scene explaining chemical formulas appears, followed by another chemical formula explanation scene, then it goes back to the periodic table scene, followed by a periodic table explanation scene, and lastly returns to the periodic table explanation scene.", "First, a scene explaining chemical formulas appears, then a periodic table explanation scene, then it goes back to the periodic table scene, followed by another chemical formula explanation scene, and lastly returns to the periodic table explanation scene.", "First, a scene explaining chemical formulas appears, followed by another chemical formula explanation scene, then it goes back to the periodic table scene, followed by another chemical formula explanation scene, and lastly returns to the chemical formula explanation scene."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "My2BQI4Vqqs_1", "video_path": "My2BQI4Vqqs.mp4", "subtitle_path": "My2BQI4Vqqs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1001.37, "view_count": 39093}, {"video_id": "My2BQI4Vqqs", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, the teacher gives an introduction to the lesson, then the periodic table of elements appears. The teacher and students discuss it together, then a chemical formula appears. The teacher continues the explanation and finally, the teacher summarizes the learning techniques.", "First, the teacher summarizes the learning techniques, then a chemical formula appears. The teacher and students discuss it together, then the periodic table of elements appears. The teacher continues the explanation and finally, the teacher gives an introduction to the lesson.", "First, the teacher summarizes the learning techniques, then the periodic table of elements appears. The teacher and students discuss it together, then a formula screen appears. The teacher continues the explanation and finally, the teacher gives an introduction to the lesson.", "First, the teacher gives an introduction to the lesson, then the periodic table of elements appears. The teacher and students discuss it together, then a formula screen appears. The teacher continues the explanation and finally, the teacher summarizes the learning techniques.", "First, the teacher gives an introduction to the lesson, then a chemical formula appears. The teacher and students discuss it together, then the periodic table of elements appears. The teacher continues the explanation and finally, the teacher summarizes the learning techniques."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "My2BQI4Vqqs_2", "video_path": "My2BQI4Vqqs.mp4", "subtitle_path": "My2BQI4Vqqs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1001.37, "view_count": 39093}, {"video_id": "H8z8glfqsm0", "question": "A man in a white lab coat is standing in front of a black background. The man has his hand showing a ring, and he is wearing glasses and a bow tie. With what subtitle does this man appear?", "question_wo_referring_query": "With what subtitle does this man appear?", "candidates": ["Friendship of and you can actually find", "prevent Coastal erosion while", "is a king tide you ask a king tide is ", "Beautiful flowers", "check out maybe our most popular place"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "H8z8glfqsm0_0", "video_path": "H8z8glfqsm0.mp4", "subtitle_path": "H8z8glfqsm0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1987.25, "view_count": 384730}, {"video_id": "H8z8glfqsm0", "question": "A man wearing a white T-shirt and denim jacket is standing in front of a black screen. The man's white T-shirt has a blue and green logo on it. In the top right corner of the man, there is an image of the UK flag and some stars. When has this UK flag and what subtitle appeared together before?", "question_wo_referring_query": "When has this UK flag and what subtitle appeared together before?", "candidates": ["right", "prevent Coastal erosion while", "check out maybe our most popular place", "so I really wanted to go to Tuvalu for", "Beautiful flowers"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "H8z8glfqsm0_1", "video_path": "H8z8glfqsm0.mp4", "subtitle_path": "H8z8glfqsm0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1987.25, "view_count": 384730}, {"video_id": "H8z8glfqsm0", "question": "A man in a black inner outfit and green checkered shirt is standing in front of a black background. The man's shirt has a small pocket and buttons on the chest, and a lamp is hanging on his right side. The man has a short beard. Along with what subtitles has this lamp appeared?", "question_wo_referring_query": "Along with what subtitles has this lamp appeared?", "candidates": ["right", "on the island electricity is usually run", "Beautiful flowers", "check out maybe our most popular place", "that is right you know we are"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "H8z8glfqsm0_2", "video_path": "H8z8glfqsm0.mp4", "subtitle_path": "H8z8glfqsm0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1987.25, "view_count": 384730}, {"video_id": "g5rnQEt2snI", "question": "A woman wearing a black short-sleeve shirt and skirt is sitting on a purple sofa, and a man with glasses wearing a checkered shirt is sitting on a green sofa. Between them, there is a coffee table with a photo frame on it, and the wall behind them has various wooden frames with models. When a hand cursor appears in the lower-left corner of the screen pointing to the right side of the scene and the woman raises both hands, what changes happen to the purple sofa?", "question_wo_referring_query": "What changes happen to the purple sofa?", "candidates": ["A pair of scissors appears on the sofa armrest", "A piece of paper appears on the sofa armrest", "A potted plant appears on the sofa armrest", "An apple appears on the sofa armrest", "A toy pig appears on the sofa"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "g5rnQEt2snI_0", "video_path": "g5rnQEt2snI.mp4", "subtitle_path": "g5rnQEt2snI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2155.2, "view_count": 69828}, {"video_id": "g5rnQEt2snI", "question": "A woman wearing a black short-sleeved shirt and skirt is sitting on a purple sofa, and a man wearing a checkered shirt and glasses is sitting on a green sofa. The man has one hand resting on his leg and the other hand supporting his chin. There is a potted plant behind the man, and various framed models are placed on the wall behind them. When the lower-left corner of the screen shows a prompt to the right, the woman's palms face up, her hands spread out to both sides. What did the man do at this moment?", "question_wo_referring_query": "What did this man do at this moment?", "candidates": ["The man kneels on the ground", "The hand that was supporting his chin is now placed on his leg", "The man places both hands on his waist", "The man stands up", "The man raises both hands high"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "g5rnQEt2snI_1", "video_path": "g5rnQEt2snI.mp4", "subtitle_path": "g5rnQEt2snI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2155.2, "view_count": 69828}, {"video_id": "g5rnQEt2snI", "question": "A lady wearing a black short-sleeve shirt is sitting on a purple sofa. The lady is wearing earrings, and on the wall behind her, there is a wooden frame with a model inside. In the bottom left corner from the lady, there's a clock diagram with a red-tipped hand pointing to the upper right. When a man, with his legs crossed and one hand leaning on his shoe, appears on the screen, what changes occur to the clock diagram?", "question_wo_referring_query": "What changes occur to the clock diagram?", "candidates": ["The hand changes from pointing upper right to pointing left", "The hand changes from pointing upper right to pointing straight down", "The hand changes from pointing upper right to pointing straight up", "The hand changes from pointing upper right to pointing upper left", "The hand changes from pointing upper right to pointing right"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "g5rnQEt2snI_2", "video_path": "g5rnQEt2snI.mp4", "subtitle_path": "g5rnQEt2snI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2155.2, "view_count": 69828}, {"video_id": "S4J-WVevSEs", "question": "The bottom section of the white background is a yellow horizontal strip with an icon, and the upper section contains bold English text. In the center-right of the screen, there is a photo of three lions. What change occurs to the three lions in the photo when the subtitle 'passed to stable diffusion for imp' appears?", "question_wo_referring_query": "What change occurs to the three lions in the photo?", "candidates": ["The three lions turned into three dogs", "The three lions turned into one border collie and two lions", "The three lions turned into one keji and two lions", "The three lions turned into three wolves", "The three lions turned into one wolf and two lions"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "S4J-WVevSEs_0", "video_path": "S4J-WVevSEs.mp4", "subtitle_path": "S4J-WVevSEs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1879.04, "view_count": 141}, {"video_id": "S4J-WVevSEs", "question": "At the bottom of the white background, there is a yellow strip with a logo, above it are bold English characters, and below the English characters is a table with three columns and five rows. The first and second columns on the left have green characters. What changes occur in the data table when subtitles show 'uh and actually um'?", "question_wo_referring_query": "What changes occur in the data table?", "candidates": ["The third column, fourth row contains an additional red box.", "The second column, fourth row contains an additional red box.", "The second column, fourth row contains an additional blue box.", "The third column, fourth row contains an additional green box.", "The second column, fourth row contains an additional green box."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "S4J-WVevSEs_1", "video_path": "S4J-WVevSEs.mp4", "subtitle_path": "S4J-WVevSEs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1879.04, "view_count": 141}, {"video_id": "S4J-WVevSEs", "question": "Below the white background is a yellow horizontal bar with an icon. Under the English title is bold English text, and beneath the bold text are four pillar-shaped icons with green tops. Under the icons are orange and blue characters as well as black characters. When the subtitle 'drawbacks in the limitation in this' appears, what changes occur to the yellow horizontal bar below?", "question_wo_referring_query": "What changes occur to the yellow horizontal bar below?", "candidates": ["The yellow horizontal bar turns red.", "The yellow horizontal bar becomes shorter.", "The yellow horizontal bar disappears.", "The yellow horizontal bar is partially covered by the status bar.", "The yellow horizontal bar becomes longer."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "S4J-WVevSEs_2", "video_path": "S4J-WVevSEs.mp4", "subtitle_path": "S4J-WVevSEs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1879.04, "view_count": 141}, {"video_id": "ibBZoRmF_QQ", "question": "In a sunny outdoors, a woman wearing a red and white striped dress and a man in a dark coat are standing beside a camera. The woman is wearing a pearl necklace and glasses. Behind them, there are green tree branches. The camera's tripod is silver. What kind of glasses is the woman wearing?", "question_wo_referring_query": "What kind of glasses is the woman wearing?", "candidates": ["Transparent square glasses with black frames", "Transparent square glasses with gold frames", "Transparent square glasses with silver frames", "Black sunglasses with blue frames", "Black sunglasses with red frames"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "ibBZoRmF_QQ_0", "video_path": "ibBZoRmF_QQ.mp4", "subtitle_path": "ibBZoRmF_QQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1383.63, "view_count": 25497}, {"video_id": "ibBZoRmF_QQ", "question": "A man wearing a white shirt and black suit is standing in front of a white wall. On the wall facing the man, there are five black and white paintings hanging. The man's hand is pointing to these paintings. There are also black and white paintings hanging on the wall behind the man. What does the man's hair look like?", "question_wo_referring_query": "What does the man's hair look like?", "candidates": ["Short and straight black hair", "Curly hair with a bare forehead", "Long and straight blond hair", "Curly hair with long bangs", "Long and curly blond hair"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "ibBZoRmF_QQ_1", "video_path": "ibBZoRmF_QQ.mp4", "subtitle_path": "ibBZoRmF_QQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1383.63, "view_count": 25497}, {"video_id": "ibBZoRmF_QQ", "question": "On the white wall, there are three black and white paintings. The building in the left painting has round pillars with a vertical structure; the building in the middle painting has a spherical top; the building in the right painting has round pillars connected to an arch. The vertical structure is surrounded by thin long pillars connected to the arch. How are the frames of these paintings designed?", "question_wo_referring_query": "How are the frames of these paintings designed?", "candidates": ["Square black frames", "Round black frames", "Square brown frames", "Square white frames", "Round white frames"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "ibBZoRmF_QQ_2", "video_path": "ibBZoRmF_QQ.mp4", "subtitle_path": "ibBZoRmF_QQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1383.63, "view_count": 25497}, {"video_id": "dPsXxLyqpfs", "question": "In a document, there is a line of bold text on the right side that reads '2.4.Putting V, M, and C Together.' On the left side of the document, there is a flowchart, and a virtual pen is drawing a red circle around it. Which word beginning with which letter is circled in red?", "question_wo_referring_query": "Which word starting with which letter is circled in red?", "candidates": ["Word starting with letter h", "Word starting with letter r", "Word starting with letter z", "Word starting with letter a", "Word starting with letter m"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "dPsXxLyqpfs_0", "video_path": "dPsXxLyqpfs.mp4", "subtitle_path": "dPsXxLyqpfs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1122.38, "view_count": 12975}, {"video_id": "dPsXxLyqpfs", "question": "On the right side of a document, there is a line of bold text that says '3.4. Car Racing Dreams'. There are also images on both sides of the document. On the left side, there is an image of a red car turning a corner, and on the right side, there is a blurry image. A virtual pen is drawing a red arrow between these two images. Which image has the arrowhead drawn on it?", "question_wo_referring_query": "Which image has the arrowhead drawn on it?", "candidates": ["The blurry red car image on the right side", "The clear road image on the right side", "The blurry road image on the right side", "The image on the left side", "The blurry road image on the left side"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "dPsXxLyqpfs_1", "video_path": "dPsXxLyqpfs.mp4", "subtitle_path": "dPsXxLyqpfs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1122.38, "view_count": 12975}, {"video_id": "dPsXxLyqpfs", "question": "In a document, under the numbers \u201c868, 1092, 820\u201d in the third column of a data table, a red line was drawn. What tool was used to draw the line on the document?", "question_wo_referring_query": ", What tool was used to draw the line on the document?", "candidates": ["Black virtual pen", "Black real pen", "White virtual pen", "Red virtual pen", "White real pen"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "dPsXxLyqpfs_2", "video_path": "dPsXxLyqpfs.mp4", "subtitle_path": "dPsXxLyqpfs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1122.38, "view_count": 12975}, {"video_id": "cgTGO4osPuo", "question": "The scene shows the interior of a car, with a gas pump visible outside the car window. When a woman with golden hair and wearing earrings appears in the car for the first time, what does she do?", "question_wo_referring_query": "What does the woman do?", "candidates": ["She fastens her seatbelt", "She removes her earrings", "She wipes the window", "She opens the car door", "She unbuckles her seatbelt"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "cgTGO4osPuo_0", "video_path": "cgTGO4osPuo.mp4", "subtitle_path": "cgTGO4osPuo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 912.15, "view_count": 55786}, {"video_id": "cgTGO4osPuo", "question": "This is the interior space of a car. On the wall in the distance outside the car window, there is the word 'Mic'. Inside the car, there is a woman with brown hair who has picked up a brown notebook. What did she do when she first took out the brown notebook in the car?", "question_wo_referring_query": "What did the woman do when she first took out the brown notebook in the car?", "candidates": ["She was using a camera to film the notebook.", "She threw the notebook out of the window.", "She opened the notebook.", "She placed the notebook on the passenger seat.", "She wrote in the notebook."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "cgTGO4osPuo_1", "video_path": "cgTGO4osPuo.mp4", "subtitle_path": "cgTGO4osPuo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 912.15, "view_count": 55786}, {"video_id": "cgTGO4osPuo", "question": "In a room where a lamp emits a bright light, there is a bed with green plants on a rack. When a woman wearing a white long-sleeved shirt with round buttons and wearing an earpiece appears in the room for the first time, what does she do?", "question_wo_referring_query": "What does she do?", "candidates": ["She takes off the earpiece.", "She turns off the lamp.", "She lies down on the bed.", "She puts a sheet on the bed.", "She unties her hair."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "cgTGO4osPuo_2", "video_path": "cgTGO4osPuo.mp4", "subtitle_path": "cgTGO4osPuo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 912.15, "view_count": 55786}, {"video_id": "9BHG6i9oH2A", "question": "In a train station with a blue roof, a train is moving on the tracks. The train has a white body with blue stripes, and the lights on top emit a yellow glow. At the bottom of the screen, there are white words indicating 'Diesel car'. When the subtitle mentions 'has around 440 diesel cars the CO2', what happens?", "question_wo_referring_query": "What happens?", "candidates": ["A person jumps off the train", "The train moves towards the camera", "The train stops on the tracks", "A person is climbing on the train", "The train moves away from the camera"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "9BHG6i9oH2A_0", "video_path": "9BHG6i9oH2A.mp4", "subtitle_path": "9BHG6i9oH2A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.0, "view_count": 45422}, {"video_id": "9BHG6i9oH2A", "question": "In a room, there is a display cabinet behind with model trains, and a woman wearing a gray coat with long olive hair is sitting in front of the display cabinet. The purple text on the display cabinet in front of her shows 'Cathy Cat'. When the subtitle mentions 'J. R East hydrogen hybrid train does this', what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The woman on the screen stands up.", "The woman on the screen is brushing her hair.", "The woman on the screen picks up a model train.", "The woman on the screen is clapping.", "The woman on the screen covers her face with her hand."], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "9BHG6i9oH2A_1", "video_path": "9BHG6i9oH2A.mp4", "subtitle_path": "9BHG6i9oH2A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.0, "view_count": 45422}, {"video_id": "9BHG6i9oH2A", "question": "In a gorge filled with trees, rapid river waters flow at the bottom of the valley, and a red steel bridge spans across it. A train is running on the bridge. When the subtitle 'entire line opened on May 26th' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A landslide occurs", "The train heads towards the tunnel entrance", "The train stops at the center of the bridge", "The train is moving away from the tunnel entrance", "The bridge collapses"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "9BHG6i9oH2A_2", "video_path": "9BHG6i9oH2A.mp4", "subtitle_path": "9BHG6i9oH2A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.0, "view_count": 45422}, {"video_id": "gduGjU0hx24", "question": "In the video, there is an iron pot, and someone is adding sweet chili sauce to it. In the top left corner of the screen, there is white text that says 'Sweet Chili Sauce-100 ml'. After adding the sweet chili sauce to the pot, what needs to be done?", "question_wo_referring_query": "After adding the sweet chili sauce to the pot, what needs to be done?", "candidates": ["Need to add 30 ml of soy sauce to the pot", "Need to add 30 ml of vegetable oil to the pot", "Need to add 100 ml of soy sauce to the pot", "Need to add 30 ml of vinegar to the pot", "Need to add 30 ml of oil to the pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "gduGjU0hx24_0", "video_path": "gduGjU0hx24.mp4", "subtitle_path": "gduGjU0hx24_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1339.3, "view_count": 951265}, {"video_id": "gduGjU0hx24", "question": "On a wooden cutting board, there is a bowl containing sliced meat. A person is pouring vegetable oil over the meat. In the upper left corner of the screen, there is a white text displaying 'Vegetable oil - 5 tbsp.' After adding the vegetable oil to the bowl, what needs to be done next?", "question_wo_referring_query": "After adding the vegetable oil to the bowl, what needs to be done next?", "candidates": ["Need to add chicken essence to the bowl", "Need to add MSG to the bowl", "Need to add starch to the bowl", "Need to add salt to the bowl", "Need to add chili powder to the bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "gduGjU0hx24_1", "video_path": "gduGjU0hx24.mp4", "subtitle_path": "gduGjU0hx24_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1339.3, "view_count": 951265}, {"video_id": "gduGjU0hx24", "question": "On a wooden cutting board, there is a bowl containing pieces of chicken. A person uses chopsticks to sprinkle salt into the bowl. In the top left corner of the screen, the word 'Salt' appears in white. After adding salt to the bowl, what needs to be done?", "question_wo_referring_query": "What needs to be done after adding salt to the bowl?", "candidates": ["Need to add black pepper to the bowl", "Need to add green peppers to the bowl", "Need to add chili peppers to the bowl", "Need to add chocolate to the bowl", "Need to add vinegar to the bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "gduGjU0hx24_2", "video_path": "gduGjU0hx24.mp4", "subtitle_path": "gduGjU0hx24_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1339.3, "view_count": 951265}, {"video_id": "QG6-B_EFmxI", "question": "A person is rinsing wheat in a bowl under running water. What happened before 'Wheat' is mentioned in the subtitles?", "question_wo_referring_query": "What happened before that?", "candidates": ["A person was chopping garlic", "A group of children were gathered at the dining table", "A goat was grazing on grass", "A person was chopping a green pepper", "A person picked a pumpkin"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "QG6-B_EFmxI_0", "video_path": "QG6-B_EFmxI.mp4", "subtitle_path": "QG6-B_EFmxI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1162.87, "view_count": 2363748}, {"video_id": "QG6-B_EFmxI", "question": "On a wooden board, a person is cutting garlic with a small knife. After 'Garlic' is mentioned in the subtitles, what happens next?", "question_wo_referring_query": ", what happens next?", "candidates": ["A person rinses wheat with running water", "A person cuts open a chicken's belly with a knife", "Picking fruit from a tree", "Rinsing a pumpkin with running water", "Rinsing wheat with running water"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "QG6-B_EFmxI_1", "video_path": "QG6-B_EFmxI.mp4", "subtitle_path": "QG6-B_EFmxI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1162.87, "view_count": 2363748}, {"video_id": "QG6-B_EFmxI", "question": "On a wooden table with ingredients like tomatoes, a person wearing gray pants is cleaning green peppers. What event occurs after the subtitle 'Pepper' appears?", "question_wo_referring_query": "What event occurs?", "candidates": ["Placing squash into a bowl", "Removing feathers from a chicken", "Picking fruit from the tree", "A group of children gather around the dining table to eat", "A person cuts chicken into small pieces"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "QG6-B_EFmxI_2", "video_path": "QG6-B_EFmxI.mp4", "subtitle_path": "QG6-B_EFmxI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1162.87, "view_count": 2363748}, {"video_id": "SPOqoI0zOPQ", "question": "A man wearing sunglasses and sporting a beard is walking on the street. A red 'x' appears on the right side of his body. After the subtitle mentions 'so what we're gonna do is we're gonna', who is the person that appears?", "question_wo_referring_query": "Who is the person that appears?", "candidates": ["A woman wearing a white coat", "A clean-shaven man without glasses, wearing a black suit, enclosed in a green-lined frame", "A man wearing black-rimmed glasses and a white shirt, standing in front of a yellow background", "A man with a beard and glasses, wearing a black suit, enclosed in a green-lined frame", "A man with a beard and glasses, wearing a white suit, enclosed in a green-lined frame"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "SPOqoI0zOPQ_0", "video_path": "SPOqoI0zOPQ.mp4", "subtitle_path": "SPOqoI0zOPQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1650.76, "view_count": 16591}, {"video_id": "SPOqoI0zOPQ", "question": "In the video, a man wearing sunglasses with a beard and dressed in black is shown on the screen. The center of the screen displays a game scene where a purple ball is resting on a green grid. Below, in a red background, the white text 'TEST TASK Hide and Seek' appears. Who appears after the subtitle mentions 'and they have to fulfill various goals'?", "question_wo_referring_query": "Who appears after the subtitle?", "candidates": ["A woman in front of a yellow background, wearing black-framed glasses and a white top", "A man with yellow curly hair and a black beard, wearing a black short-sleeve outer garment", "A man wearing a black suit and white shirt", "A woman wearing a black suit and white shirt", "A man with black curly hair and a black beard, wearing a black short-sleeve outer garment"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "SPOqoI0zOPQ_1", "video_path": "SPOqoI0zOPQ.mp4", "subtitle_path": "SPOqoI0zOPQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1650.76, "view_count": 16591}, {"video_id": "SPOqoI0zOPQ", "question": "In a green background, there is a man wearing sunglasses and black clothes. A red aiming lens moves back and forth on the screen. After the subtitle mentions 'so there\u2019s apparently this video here', who appears?", "question_wo_referring_query": "Who appears?", "candidates": ["A person wearing red headphones, holding a teacup in the left hand, and waving towards the computer with the right hand", "A woman in a black suit and white shirt", "A man with black curly hair wearing gray-blue pants", "A person wearing red headphones, holding a teacup in the right hand, and waving towards the computer with the left hand", "A man in a black suit and white shirt"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "SPOqoI0zOPQ_2", "video_path": "SPOqoI0zOPQ.mp4", "subtitle_path": "SPOqoI0zOPQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1650.76, "view_count": 16591}, {"video_id": "sywv1X9Kqwo", "question": "In a kitchen, a man wearing a white short-sleeve shirt with tattoos on his arms places a steel pan on a wooden counter. On the wooden counter, there are also two cans and a bowl of greens. Which of the following subtitles has appeared together with this man?", "question_wo_referring_query": "Which of the following subtitles has appeared together with this man?", "candidates": ["what you\u2018re looking for you know crunch", "in order to see the color more closely from one side", "confidently quickly withdrawing from the oven", "yeast dissolves in a small amount of water", "Once you start you tend to have more time to bake slowly"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "sywv1X9Kqwo_0", "video_path": "sywv1X9Kqwo.mp4", "subtitle_path": "sywv1X9Kqwo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1142.72, "view_count": 242746}, {"video_id": "sywv1X9Kqwo", "question": "In a bright room, there is a man wearing a white shirt with tattoos on his arms. Next to him is a green oven with flames flickering inside. In what subtitles did this oven and the man appear together?", "question_wo_referring_query": "In what subtitles did this oven and the man appear together?", "candidates": ["okay, let\u2019s give it a try", "before you start turning it and you see", "it really melted", "your friend, the oven version is also great", "here is mine"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "sywv1X9Kqwo_1", "video_path": "sywv1X9Kqwo.mp4", "subtitle_path": "sywv1X9Kqwo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1142.72, "view_count": 242746}, {"video_id": "sywv1X9Kqwo", "question": "On a wooden countertop in a kitchen, there is a steel pot, a can, and a bowl of vegetables. A tattooed man is opening the can. With which subtitles did this can appear together?", "question_wo_referring_query": "With which subtitles did this can appear together?", "candidates": ["that needs a longer time", "nice and hot and i turn the flame down", "prepare other ingredients", "we wanted this to come up to temperature", "we have three types of classic custard here"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "sywv1X9Kqwo_2", "video_path": "sywv1X9Kqwo.mp4", "subtitle_path": "sywv1X9Kqwo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1142.72, "view_count": 242746}, {"video_id": "hDQefy5WADM", "question": "In a black background, there is a gray sphere in the center. A white object is revolving around it along a white line. Around the sphere is represented a gravitational field using lines. What changes occur when this gray sphere and the word 'MASS CONCENTRATIONS [MASCONS]' appear simultaneously?", "question_wo_referring_query": "What changes occur when this gray sphere and the word 'MASS CONCENTRATIONS [MASCONS]' appear simultaneously?", "candidates": ["A part of the gray sphere turns cyan", "A part of the gray sphere turns red", "A part of the gray sphere turns purple", "A part of the gray sphere turns blue", "A part of the gray sphere turns green"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "hDQefy5WADM_0", "video_path": "hDQefy5WADM.mp4", "subtitle_path": "hDQefy5WADM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1076.0, "view_count": 90753}, {"video_id": "hDQefy5WADM", "question": "On the street in front of a white building, three men in suits and ties are walking. When the man in the middle appears in a frame while giving a speech and looks forward, what change occurs to him?", "question_wo_referring_query": ", what change occurs to this man?", "candidates": ["The man has some speech notes in his hand and is giving a speech.", "The man has changed into a striped suit.", "The man has changed into a braided shirt.", "The man is wearing a pair of glasses.", "The man has changed into a white coat."], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "hDQefy5WADM_1", "video_path": "hDQefy5WADM.mp4", "subtitle_path": "hDQefy5WADM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1076.0, "view_count": 90753}, {"video_id": "hDQefy5WADM", "question": "There is a picture of a globe against a black background, with three people in front of the globe wearing American spacesuits. The person on the far left has the name 'NEIL ARMSTRO' written on the chest. When this man appears in a box in the black background, does he experience a voice change?", "question_wo_referring_query": "Does this man experience a voice change?", "candidates": ["He covered his eyes with his hand", "He put on black eyeglasses", "He closed his eyes", "He put on glasses", "He put his hand on his face"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "hDQefy5WADM_2", "video_path": "hDQefy5WADM.mp4", "subtitle_path": "hDQefy5WADM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1076.0, "view_count": 90753}, {"video_id": "GzjT7OhdvOo", "question": "The screen shows a woman wearing a red hairpin and glasses. She has face masks on both cheeks. On the left side of the screen, there is the text 'TEETH WHITENING'. When the subtitle mentions 'the only white washing you will ever see', what change occurs to her?", "question_wo_referring_query": "What change occurs to her?", "candidates": ["Her glasses disappear.", "She removes the face mask from her left cheek.", "Her eyes change from open to closed.", "She removes the face mask from her right cheek.", "She removes both face masks."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "GzjT7OhdvOo_0", "video_path": "GzjT7OhdvOo.mp4", "subtitle_path": "GzjT7OhdvOo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1196.46, "view_count": 304284}, {"video_id": "GzjT7OhdvOo", "question": "In a room with green walls, there are some cloth dolls on the shelf in the corner of the wall. In front of the shelf, a woman wearing a pentagram-shaped headband is pressing her palms together. When the subtitle mentions 'button down with a tank top underneath,' what change occurs with her?", "question_wo_referring_query": "What change occurs with her?", "candidates": ["She carries a black bag", "She puts on a hat", "She changes into a black tank top", "She puts on white shorts", "She puts on a blue coat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "GzjT7OhdvOo_1", "video_path": "GzjT7OhdvOo.mp4", "subtitle_path": "GzjT7OhdvOo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1196.46, "view_count": 304284}, {"video_id": "GzjT7OhdvOo", "question": "In a room with a bookshelf full of books, a woman is sitting in front of the bookshelf. She is wearing a fishnet cardigan and a nose ring. She is playing with her hair with her hand. When the subtitle mentions 'like I\u2019m accomplishing something but', what change occurs to this woman?", "question_wo_referring_query": "What change occurs to this woman?", "candidates": ["She has put on a red hair clip.", "There is a smartphone in her left hand.", "The woman has a smartphone with a dark screen in her hand.", "The woman has a smartphone with a bright screen in her hand.", "She has changed into a black coat."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "GzjT7OhdvOo_2", "video_path": "GzjT7OhdvOo.mp4", "subtitle_path": "GzjT7OhdvOo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1196.46, "view_count": 304284}, {"video_id": "WF7F4GQ1J0w", "question": "In a forest, there is a man wearing a blue coat and a hooded cape. He is clasping his hands together. What items are present in this scene?", "question_wo_referring_query": "In a forest, there is a man wearing a blue coat and a hooded cape. He is clasping his hands together. What items are present in this scene?", "candidates": ["earphones", "necklace", "watch", "blue feathered garment", "duck-tongue hat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "WF7F4GQ1J0w_0", "video_path": "WF7F4GQ1J0w.mp4", "subtitle_path": "WF7F4GQ1J0w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 957.17, "view_count": 63420}, {"video_id": "WF7F4GQ1J0w", "question": "A man wearing a white long sleeve shirt and earphones is sitting on a black sofa in front of the window. His palms are pressed together, and there is a green plant on his left side. What items are present in this scene?", "question_wo_referring_query": "What items are present in this scene?", "candidates": ["water glass", "in-ear earphones", "table lamp", "MacBook", "lighter"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "WF7F4GQ1J0w_1", "video_path": "WF7F4GQ1J0w.mp4", "subtitle_path": "WF7F4GQ1J0w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 957.17, "view_count": 63420}, {"video_id": "WF7F4GQ1J0w", "question": "In the sunny outdoor, a person with white hair and wearing a white gown is talking to a man who has his hands in his pockets and his upper body exposed. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["watch", "earphones", "car", "necklace", "hair clip"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "WF7F4GQ1J0w_2", "video_path": "WF7F4GQ1J0w.mp4", "subtitle_path": "WF7F4GQ1J0w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 957.17, "view_count": 63420}, {"video_id": "jw_MlQ9VGyY", "question": "The screen has a purple background, and there is a man wearing glasses and a shirt explaining in front of it. In the upper right corner of the screen, there\u2019s an image of meat with three green spots on it. When the subtitle mentions \u201cmight have bacteria on the surface,\u201d what item is present in the scene?", "question_wo_referring_query": "What item is present in the scene?", "candidates": ["Iron pan", "Watch", "Microphone on the shirt", "Earring", "Necklace"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "jw_MlQ9VGyY_0", "video_path": "jw_MlQ9VGyY.mp4", "subtitle_path": "jw_MlQ9VGyY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1365.62, "view_count": 110005}, {"video_id": "jw_MlQ9VGyY", "question": "In a black background with the white text 'DOES MICROWAVING FOOD DESTROY ITS VITAMINS?' displayed in the center, what object is present on the screen when the subtitle mentions 'might have heard the microwave is a'?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["Iron pot", "Microscope", "Glasses", "Magnifying glass", "Microwave"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "jw_MlQ9VGyY_1", "video_path": "jw_MlQ9VGyY.mp4", "subtitle_path": "jw_MlQ9VGyY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1365.62, "view_count": 110005}, {"video_id": "jw_MlQ9VGyY", "question": "In a purple background, there is a woman with long hair wearing a short-sleeved black shirt at the center. In the top left corner of the screen, there are the words 'BOILING LED TO THE LOSS OF 33% OF ITS VITAMIN C'. When the subtitle mentions '2009 boiling the broccoli led to the', what item is present in the scene?", "question_wo_referring_query": "What item is present in the scene?", "candidates": ["shirt", "ring", "hair clip", "glasses", "hat"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "jw_MlQ9VGyY_2", "video_path": "jw_MlQ9VGyY.mp4", "subtitle_path": "jw_MlQ9VGyY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1365.62, "view_count": 110005}, {"video_id": "7AeMhVN-TFA", "question": "On a dry grassy hillside, a woman with golden hair, wearing a gray coat and a scarf, stands in the center of the screen with her hands on her hips. What kind of scarf is this woman wearing?", "question_wo_referring_query": "What kind of scarf is this woman wearing?", "candidates": ["Blue knitted scarf", "Green silk scarf", "Black knitted scarf", "Green knitted scarf", "Olive knitted scarf"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "7AeMhVN-TFA_0", "video_path": "7AeMhVN-TFA.mp4", "subtitle_path": "7AeMhVN-TFA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 920.59, "view_count": 4371804}, {"video_id": "7AeMhVN-TFA", "question": "On the desk in front of the window, there is a computer. A woman is writing with a pen in her left hand. What is the style of this woman's hair?", "question_wo_referring_query": "What is the style of this woman's hair?", "candidates": ["Long curly black hair", "Short curly black hair", "Long straight blonde hair", "Short curly blonde hair", "Long curly blonde hair"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "7AeMhVN-TFA_1", "video_path": "7AeMhVN-TFA.mp4", "subtitle_path": "7AeMhVN-TFA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 920.59, "view_count": 4371804}, {"video_id": "7AeMhVN-TFA", "question": "On a wooden table, there's a green bowl. A hand is using a spoon to add something into the green bowl. What kind of spoon is this?", "question_wo_referring_query": "What kind of spoon is this?", "candidates": ["golden metal spoon", "wooden spoon", "golden plastic spoon", "green plastic spoon", "white metal spoon"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "7AeMhVN-TFA_2", "video_path": "7AeMhVN-TFA.mp4", "subtitle_path": "7AeMhVN-TFA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 920.59, "view_count": 4371804}, {"video_id": "BRlZ4CgytAc", "question": "Against an urban backdrop, a man wearing a blue coat and a hat is explaining something. His right hand is raised to his chest level. When the subtitles mention 'that Bernie Moreno is the best opponent,' what kind of hat is this man wearing?", "question_wo_referring_query": "What kind of hat is this man wearing?", "candidates": ["black cowboy hat", "black peaked cap", "black hat", "blue hat", "black graduation cap"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "BRlZ4CgytAc_0", "video_path": "BRlZ4CgytAc.mp4", "subtitle_path": "BRlZ4CgytAc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3104.77, "view_count": 368486}, {"video_id": "BRlZ4CgytAc", "question": "In the frame, a yellow line crosses the screen. At the center of the screen is a photo of a man, who is wearing a brown coat and a white inner garment. When the subtitle 'he was paroled police say this man' appears, what kind of hair does the man have?", "question_wo_referring_query": "What kind of hair does the man have?", "candidates": ["black long straight hair", "black afro", "blonde mohawk", "blonde short hair", "black mohawk"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "BRlZ4CgytAc_1", "video_path": "BRlZ4CgytAc.mp4", "subtitle_path": "BRlZ4CgytAc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3104.77, "view_count": 368486}, {"video_id": "BRlZ4CgytAc", "question": "In the studio, there is a brunette female host with her eyes tightly closed. The screen next to her displays a building. When the subtitle mentions 'tech companies in China and Taiwan,' what is the woman wearing?", "question_wo_referring_query": "What is the woman wearing?", "candidates": ["Black suit", "Black polka dot blouse", "White polka dot dress", "White polka dot blouse", "White knit shirt"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "BRlZ4CgytAc_2", "video_path": "BRlZ4CgytAc.mp4", "subtitle_path": "BRlZ4CgytAc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3104.77, "view_count": 368486}, {"video_id": "HPxh7kk_hE4", "question": "Beside a wall made of wooden logs, there is an oven made of clay and bricks. The oven is burning with a fierce flame, and someone is adding wood to the oven. Who is adding the wood to the oven?", "question_wo_referring_query": "Who is adding the wood to the oven?", "candidates": ["A man wearing a gray short-sleeve shirt", "A man wearing a black long robe with rolled-up sleeves", "A woman wearing a black short-sleeve shirt", "A woman wearing a black long robe with rolled-up sleeves", "A man wearing a black short-sleeve shirt"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "HPxh7kk_hE4_0", "video_path": "HPxh7kk_hE4.mp4", "subtitle_path": "HPxh7kk_hE4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1058.93, "view_count": 1642432}, {"video_id": "HPxh7kk_hE4", "question": "In the grassland surrounded by mountains, there is a small stone-formed fountain. Who is the person flipping the ingredients inside the pot in the shed next to the fountain?", "question_wo_referring_query": "Who is the person flipping the ingredients?", "candidates": ["A man with short hair wearing a white long-sleeved coat", "A woman with short hair wearing a black long-sleeved coat", "A woman with long hair wearing a black long-sleeved coat", "A man with long hair wearing a black long-sleeved coat", "A man with short hair wearing a black long-sleeved coat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "HPxh7kk_hE4_1", "video_path": "HPxh7kk_hE4.mp4", "subtitle_path": "HPxh7kk_hE4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1058.93, "view_count": 1642432}, {"video_id": "HPxh7kk_hE4", "question": "On a piece of wood board, someone uses a knife to cut through a material in the middle. Behind the board, a small red flower is blooming. What is the food material being cut on the board?", "question_wo_referring_query": "What is the food material being cut on the board?", "candidates": ["Yellow melon", "Zucchini", "Pumpkin", "Green pepper", "Eggplant"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "HPxh7kk_hE4_2", "video_path": "HPxh7kk_hE4.mp4", "subtitle_path": "HPxh7kk_hE4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1058.93, "view_count": 1642432}, {"video_id": "474PYGhSuis", "question": "In a small, narrow space, there is a refrigerator. At the bottom of the screen, there is also the text 'and I love the updated appliances despite the entire theme and layout being.' When the refrigerator appears for the first time, what happens to it?", "question_wo_referring_query": "What happened to it?", "candidates": ["Someone took it away", "Someone stuck a sticker on it", "It was opened", "Someone smashed it", "Its handle fell off"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "474PYGhSuis_0", "video_path": "474PYGhSuis.mp4", "subtitle_path": "474PYGhSuis_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1832.83, "view_count": 392805}, {"video_id": "474PYGhSuis", "question": "There is a red sofa placed against a white wall, with two people sitting on it: one is wearing a pink coat and playing with a mobile phone, and the other is wearing a colorful coat. When the person in the colorful coat appears on the sofa for the first time, what did she do?", "question_wo_referring_query": "what did she do?", "candidates": ["She fell to the ground", "She rolled on the sofa", "She lay down on the sofa", "She leaned back against the sofa backrest", "She left the sofa"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "474PYGhSuis_1", "video_path": "474PYGhSuis.mp4", "subtitle_path": "474PYGhSuis_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1832.83, "view_count": 392805}, {"video_id": "474PYGhSuis", "question": "A poster of a cartoon character is hanging on a wall. A woman wearing glasses and a pink coat stands in front of the poster looking distressed. What did the woman do the first time she appeared in front of the poster?", "question_wo_referring_query": "What did she do?", "candidates": ["She put up a new poster on the wall", "She painted graffiti on the wall", "She took off her coat", "She tore the poster", "She took the poster down from the wall"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "474PYGhSuis_2", "video_path": "474PYGhSuis.mp4", "subtitle_path": "474PYGhSuis_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1832.83, "view_count": 392805}, {"video_id": "dPTCxiZ6h00", "question": "On the wall of a room, there are some pictures hanging. On the bookshelf beside the pictures, there are three layers of books. A person with long brown hair, wearing a black coat, is standing with their back to the camera in front of the bookshelf. What did this woman do when the subtitle 'foreign \u3010music\u3011' appeared?", "question_wo_referring_query": "What did this woman do?", "candidates": ["She took a book from the bookshelf.", "This woman was wiping the bookshelf.", "She placed a new book on the bookshelf.", "She pushed the bookshelf.", "She put on a coat."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "dPTCxiZ6h00_0", "video_path": "dPTCxiZ6h00.mp4", "subtitle_path": "dPTCxiZ6h00_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 907.21, "view_count": 47954}, {"video_id": "dPTCxiZ6h00", "question": "In a bright room, a woman with long hair wearing black clothing is bent over facing a mirror. The door behind her is open. When the subtitle mentions 'going to take her for two nights we\u2018re', what action does she take?", "question_wo_referring_query": "What action does she take?", "candidates": ["She taps on the table with her hand", "She stands up straight", "She covers her face with her hand", "She combs her hair with both hands", "She wipes the table"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "dPTCxiZ6h00_1", "video_path": "dPTCxiZ6h00.mp4", "subtitle_path": "dPTCxiZ6h00_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 907.21, "view_count": 47954}, {"video_id": "dPTCxiZ6h00", "question": "Under a golden picture frame, there is a gray sofa. A woman with long straight brown hair, dressed in black, is sitting on the sofa with a cup in her left hand. When the subtitle says 'peaceful and calm Vibe I am going to,' what action does she take?", "question_wo_referring_query": "What action does she take?", "candidates": ["She places the cup on the coffee table.", "She lies down on the sofa.", "She stands up from the sofa.", "She places her right hand on the cup.", "She adds water to the cup."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "dPTCxiZ6h00_2", "video_path": "dPTCxiZ6h00.mp4", "subtitle_path": "dPTCxiZ6h00_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 907.21, "view_count": 47954}, {"video_id": "-PgbM2MhcYQ", "question": "The entire screen is divided into three different sections. The two sections on the right each show one person, while the section on the left shows two people. In the left section, a man wearing a black suit and a pink coat is looking up at the camera. After he looks at the camera, what does this man do next?", "question_wo_referring_query": "What does this man do next?", "candidates": ["He bent over the desk", "He stood up", "He looked at his companion beside him", "He held his forehead with his hand", "He took off his glasses"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "-PgbM2MhcYQ_0", "video_path": "-PgbM2MhcYQ.mp4", "subtitle_path": "-PgbM2MhcYQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2987.08, "view_count": 28042}, {"video_id": "-PgbM2MhcYQ", "question": "A man with glasses, black hair, and wearing a white striped shirt is sitting in front of a plain white wall. He mentions India's space program. What event is mentioned after discussing the space program?", "question_wo_referring_query": "What event is mentioned after discussing the space program?", "candidates": ["China's technology", "The U.S. Artemis program", "Modi visits the United States", "The African Space Agency", "Chandrayaan costs $75 million"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "-PgbM2MhcYQ_1", "video_path": "-PgbM2MhcYQ.mp4", "subtitle_path": "-PgbM2MhcYQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2987.08, "view_count": 28042}, {"video_id": "-PgbM2MhcYQ", "question": "The screen is divided into three different sections. There are two people in the left section looking at each other. The right sections each show an individual. On the right side, there's a woman wearing glasses and a yellow shirt. After she raises her hands from the table, what action does she take?", "question_wo_referring_query": "What action does she take after?", "candidates": ["She turned her head to look at the mirror.", "She turned off the camera.", "She touched her hair.", "She picked up a glass of water.", "She put on her glasses."], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "-PgbM2MhcYQ_2", "video_path": "-PgbM2MhcYQ.mp4", "subtitle_path": "-PgbM2MhcYQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2987.08, "view_count": 28042}, {"video_id": "k2A68mGPurA", "question": "The screen shows a part of a map, and in the red area, there is the word 'Neotropic.' After mentioning the Neotropical region, which area is mentioned first?", "question_wo_referring_query": "Which area is mentioned first?", "candidates": ["India", "Australia", "Mexico", "Nearctic", "Southeast Asia"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "k2A68mGPurA_0", "video_path": "k2A68mGPurA.mp4", "subtitle_path": "k2A68mGPurA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1339.88, "view_count": 366064}, {"video_id": "k2A68mGPurA", "question": "On a snowy background, there are two animals. On the right-side animal, the red word 'Mammals' is written. What is the last red word to appear before 'Mammals' is mentioned?", "question_wo_referring_query": "What is the last red word to appear before 'Mammals' is mentioned?", "candidates": ["Nearctic", "Therapsids", "Neotropic", "Cynodonts", "Eucynodonts"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "k2A68mGPurA_1", "video_path": "k2A68mGPurA.mp4", "subtitle_path": "k2A68mGPurA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1339.88, "view_count": 366064}, {"video_id": "k2A68mGPurA", "question": "On a piece of sand, a kangaroo rat is shaking its brain bag facing the camera. What is the first red text that appears after this scene?", "question_wo_referring_query": "What is the first red text that appears after this scene?", "candidates": ["Ostrich", "Emu", "Rhea", "Human", "Cynodonts"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "k2A68mGPurA_2", "video_path": "k2A68mGPurA.mp4", "subtitle_path": "k2A68mGPurA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1339.88, "view_count": 366064}, {"video_id": "8gs867LUtBw", "question": "There is a wooden board on the screen, on which there is a knife, and a pair of hands are handling some food. What is the food that is being prepared?", "question_wo_referring_query": "What is the food that is being prepared?", "candidates": ["Corn", "Onion", "Garlic", "Carrot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "8gs867LUtBw_0", "video_path": "8gs867LUtBw.mp4", "subtitle_path": "8gs867LUtBw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1353.29, "view_count": 8911016}, {"video_id": "8gs867LUtBw", "question": "There is a table in the screen. On the table, there is a wooden board. Various kinds of ingredients are placed next to the wooden board. In the front, there is a rod tied up with a string. There is a pair of hands cutting vegetables on the wooden board. What is the vegetable being cut?", "question_wo_referring_query": "What is the vegetable being cut?", "candidates": ["Garlic", "Onion", "Mushroom", "Tomato"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "8gs867LUtBw_1", "video_path": "8gs867LUtBw.mp4", "subtitle_path": "8gs867LUtBw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1353.29, "view_count": 8911016}, {"video_id": "8gs867LUtBw", "question": "On the screen, there is a wooden table with a wooden board on top of it. In front of the wooden board, there is a wooden round plate with a chopping board on it. A pair of hands is holding a kitchen knife, processing some food. What is the food being processed?", "question_wo_referring_query": "What is the food being processed?", "candidates": ["Tomato", "Mushroom", "Meat", "Garlic", "Leek"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "8gs867LUtBw_2", "video_path": "8gs867LUtBw.mp4", "subtitle_path": "8gs867LUtBw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1353.29, "view_count": 8911016}, {"video_id": "f4TmQEZzsec", "question": "In the scene, there is a man in an orange shirt against a black background. To the upper left of the man in the orange shirt, there is another man wearing red shorts and a dark shirt. What is the man in red shorts and a dark shirt doing when he first appears?", "question_wo_referring_query": "What is the man in red shorts and a dark shirt doing when he first appears?", "candidates": ["Walking while holding the skateboard", "Standing next to the skateboard", "Fell off the skateboard", "Sitting and resting on the skateboard"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "f4TmQEZzsec_0", "video_path": "f4TmQEZzsec.mp4", "subtitle_path": "f4TmQEZzsec_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1106.08, "view_count": 3817702}, {"video_id": "f4TmQEZzsec", "question": "In the scene with a black background, when the black man in a short-sleeved black and red checkered shirt appears, what does he do with his hands?", "question_wo_referring_query": "What does he do with his hands?", "candidates": ["Both hands clasped together", "Both hands open wide", "Both arms extend to the left, hands clasped together with fingers spread open", "Both hands crossed over chest"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "f4TmQEZzsec_1", "video_path": "f4TmQEZzsec.mp4", "subtitle_path": "f4TmQEZzsec_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1106.08, "view_count": 3817702}, {"video_id": "f4TmQEZzsec", "question": "After the man with the black short sleeves and curly hair appears, what gesture did the woman with long, olive-colored hair and wearing a dark green top make with her hands?", "question_wo_referring_query": "What gesture did she make with her hands?", "candidates": ["Hands on waist", "Did not make any gesture", "Raised hands", "Clapped hands"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "f4TmQEZzsec_2", "video_path": "f4TmQEZzsec.mp4", "subtitle_path": "f4TmQEZzsec_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1106.08, "view_count": 3817702}, {"video_id": "TFiZYA_JfJs", "question": "In the scene, there is a man wearing black clothes with a watch on his left hand in a room. There is a bed behind him. What does the man do when the subtitles mention 'it is very successful even without a'?", "question_wo_referring_query": "What does the man do?", "candidates": ["Walks", "Eats food", "Speaks while holding a microphone in his right hand", "Drinks water"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "TFiZYA_JfJs_0", "video_path": "TFiZYA_JfJs.mp4", "subtitle_path": "TFiZYA_JfJs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1004.37, "view_count": 4721}, {"video_id": "TFiZYA_JfJs", "question": "In the video, there is a man wearing black clothes, with a watch on his left hand, in a room. There is a bed behind him. What action does the man take when the subtitle mentions 'this will so what it will return is like'?", "question_wo_referring_query": "What action does the man take?", "candidates": ["Spreading his hands apart", "Holding a microphone with his left hand and speaking", "Clenching his fists", "Running on a treadmill"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "TFiZYA_JfJs_1", "video_path": "TFiZYA_JfJs.mp4", "subtitle_path": "TFiZYA_JfJs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1004.37, "view_count": 4721}, {"video_id": "TFiZYA_JfJs", "question": "On the screen, there is a person on a road with a few trees planted on the right side of the road, and a few cars parked beside the trees. When the subtitle mentions 'those good things incorporate them into,' what action does the person wearing a blue outfit on the right side of the screen take?", "question_wo_referring_query": "What action does the person wearing a blue outfit on the right side take?", "candidates": ["Walking", "Riding a bicycle", "Running", "Holding a phone and talking"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "TFiZYA_JfJs_2", "video_path": "TFiZYA_JfJs.mp4", "subtitle_path": "TFiZYA_JfJs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1004.37, "view_count": 4721}, {"video_id": "swiiVM64g6Y", "question": "In a purely black background on the screen, there is a man wearing red clothes. His name is Barbs, and he is introducing himself. What did Barbs do before he started his self-introduction?", "question_wo_referring_query": "What did Barbs do before he started his self-introduction?", "candidates": ["Raised his hands", "Crossed his arms in front of his chest", "Clasped his hands together in front of him", "Held a white piece of clothing with both hands"], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "swiiVM64g6Y_0", "video_path": "swiiVM64g6Y.mp4", "subtitle_path": "swiiVM64g6Y_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1997.47, "view_count": 1525453}, {"video_id": "swiiVM64g6Y", "question": "In the scene, there are three people. On the left is a man wearing blue clothes looking to the left. In the middle is a man wearing red clothes with his arms crossed in front of his chest. On the right is a blonde woman wearing a floral outfit with black clothes underneath. Her arms are also crossed in front of her chest. She is wearing a bracelet on her left hand. After the three of them discuss what Spanish culture means to them, what action does the blonde woman take?", "question_wo_referring_query": "What action does the blonde woman take?", "candidates": ["Crosses her arms in front of her chest", "Raises a tray", "Raises her hands halfway", "Clenches her fists"], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "swiiVM64g6Y_1", "video_path": "swiiVM64g6Y.mp4", "subtitle_path": "swiiVM64g6Y_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1997.47, "view_count": 1525453}, {"video_id": "swiiVM64g6Y", "question": "In the scene, there is a man with long hair wearing black clothes and holding a red guitar. After introducing the modern classical guitar, what does the man do?", "question_wo_referring_query": "What does the man do after introducing the modern classical guitar?", "candidates": ["He holds a plate with both hands.", "He raises both hands up.", "He holds a white piece of clothing with both hands.", "He raises his left hand to touch his nose."], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "swiiVM64g6Y_2", "video_path": "swiiVM64g6Y.mp4", "subtitle_path": "swiiVM64g6Y_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1997.47, "view_count": 1525453}, {"video_id": "JJjPIS7Ko-U", "question": "Among the characters seated in the middle of the stage in the video, who appears first: the woman wearing a beige long-sleeve jacket with khaki pants and a golden shawl, or the woman with hair in a bun wearing a black short-sleeve shirt with red floral prints and black shorts?", "question_wo_referring_query": "Which character appears first?", "candidates": ["The woman wearing a black long-sleeve jacket with white pants and a golden shawl", "The woman wearing a beige long-sleeve jacket with khaki pants and a golden shawl", "The woman with hair in a bun wearing a red short-sleeve shirt with black long pants", "The woman with hair in a bun wearing a black short-sleeve shirt with red floral prints and black shorts"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "JJjPIS7Ko-U_0", "video_path": "JJjPIS7Ko-U.mp4", "subtitle_path": "JJjPIS7Ko-U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1368.08, "view_count": 125055}, {"video_id": "JJjPIS7Ko-U", "question": "On the screen, the woman with blonde hair wearing a beige coat is holding a picture of someone in a black jumpsuit with white leggings and also a picture of someone in a red dance outfit with a black background. Which picture appears first?", "question_wo_referring_query": "Which picture appears first?", "candidates": ["The picture of someone in a black jumpsuit with white leggings", "The picture of someone in a black dance outfit with a black background", "The picture of someone in a red dance outfit with a black background", "Both pictures appear at the same time"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "JJjPIS7Ko-U_1", "video_path": "JJjPIS7Ko-U.mp4", "subtitle_path": "JJjPIS7Ko-U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1368.08, "view_count": 125055}, {"video_id": "JJjPIS7Ko-U", "question": "There are two chairs placed in the center of the stage, with triangular fill lights positioned behind and on the sides of the chairs. Along the edges of the stage, there is a large circular fill light. Among these fill lights, which one appears first?", "question_wo_referring_query": "Which one appears first?", "candidates": ["The circular fill light", "The triangular fill light in the middle", "Appear simultaneously", "The triangular fill light at the left side entrance of the stage"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "JJjPIS7Ko-U_2", "video_path": "JJjPIS7Ko-U.mp4", "subtitle_path": "JJjPIS7Ko-U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1368.08, "view_count": 125055}, {"video_id": "j3XpfBChLyk", "question": "In the black background, there is a man wearing a white short-sleeved shirt. In the bottom left corner of the screen, there are yellow letters saying 'Paul B. (aka Barby) host'. After the statement 'Hey everybody, I'm your host Barbs. We have reached the Land of the Rising Sun', who appears?", "question_wo_referring_query": "Who appears after the statement?", "candidates": ["A man with long hair wearing a blue short-sleeved shirt appears, along with a yellow KIX book.", "A man wearing a red jacket appears.", "A man wearing a pink shirt appears.", "A man wearing a purple short-sleeved shirt appears."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "j3XpfBChLyk_0", "video_path": "j3XpfBChLyk.mp4", "subtitle_path": "j3XpfBChLyk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 979.94, "view_count": 7173758}, {"video_id": "j3XpfBChLyk", "question": "Against a black background, a man wearing a white short-sleeve shirt stands with his right hand in a fist placed on his chest, and his left hand open with the palm facing up stretched out in front of the camera. In the top left corner, there is a green map of Japan. After the phrase 'beautiful on the outside, but potentially dangerous on the inside' is mentioned, what objects appear?", "question_wo_referring_query": "What objects appear?", "candidates": ["A night bar appears", "A commercial street during the day appears", "A white building appears", "Many snacks placed on a table appear"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "j3XpfBChLyk_1", "video_path": "j3XpfBChLyk.mp4", "subtitle_path": "j3XpfBChLyk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 979.94, "view_count": 7173758}, {"video_id": "j3XpfBChLyk", "question": "In a black background, a man wearing a white short-sleeved shirt has his hands clasped together. In the upper left corner, there is a person in a light purple shirt bending over in front of a red building. After mentioning 'Today, about 80% of Japanese people practice Shinto to some extent,' what object appears?", "question_wo_referring_query": "What object appears?", "candidates": ["One map piece appears", "Five map pieces appear", "Two map pieces appear", "Four green-eyed maps of different shapes appear"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "j3XpfBChLyk_2", "video_path": "j3XpfBChLyk.mp4", "subtitle_path": "j3XpfBChLyk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 979.94, "view_count": 7173758}, {"video_id": "9JqS6HfsYIM", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First appears a portrait composed of many people. Then, a photo of Mona Lisa's smile appears. Finally, a woman with black hair, wearing plain clothes and a pink accessory on her head, appears.", "First appears a photo of Mona Lisa's smile. Then, a woman with black hair, wearing plain clothes and a pink accessory on her head, appears. Finally, a portrait composed of many people appears.", "First appears a photo of Mona Lisa's smile. Then, a portrait composed of many people appears. Finally, a woman with black hair, wearing plain clothes and a pink accessory on her head, appears.", "First appears a portrait composed of many people. Then, a woman with black hair, wearing plain clothes and a pink accessory on her head, appears. Finally, a photo of Mona Lisa's smile appears."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "9JqS6HfsYIM_0", "video_path": "9JqS6HfsYIM.mp4", "subtitle_path": "9JqS6HfsYIM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1013.45, "view_count": 279137}, {"video_id": "9JqS6HfsYIM", "question": "Which of the following sequence of scenes is correct?", "question_wo_referring_query": "Which of the following sequence of scenes is correct?", "candidates": ["First, a person with white curly hair and wearing a costume consisting of blue and gold colors appears, holding a book. Then, a woman wearing a blue upper garment, holding a fan in her right hand, appears beside a person in a yellow upper garment holding an umbrella in their right hand. Finally, a naked man with blond hair and a belt around his waist appears.", "First, a naked man with blond hair and a belt around his waist appears. Then, a woman wearing a blue upper garment, holding a fan in her right hand, appears beside a person in a yellow upper garment holding an umbrella in their right hand. Finally, a person with white curly hair and wearing a costume consisting of blue and gold colors appears, holding a book.", "First, a naked man with blond hair and a belt around his waist appears. Then, a person with white curly hair and wearing a costume consisting of blue and gold colors appears, holding a book. Finally, a woman wearing a blue upper garment, holding a fan in her right hand, appears beside a person in a yellow upper garment holding an umbrella in their right hand.", "First, a person with white curly hair and wearing a costume consisting of blue and gold colors appears, holding a book. Then, a naked man with blond hair and a belt around his waist appears. Finally, a woman wearing a blue upper garment, holding a fan in her right hand, appears beside a person in a yellow upper garment holding an umbrella in their right hand."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "9JqS6HfsYIM_1", "video_path": "9JqS6HfsYIM.mp4", "subtitle_path": "9JqS6HfsYIM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1013.45, "view_count": 279137}, {"video_id": "9JqS6HfsYIM", "question": "Which sequence of scenes shown below is correct?", "question_wo_referring_query": "Which sequence of scenes shown below is correct?", "candidates": ["First, there is a scene with two paintings hanging on the wall of the exhibition hall, with four people standing in front of the paintings. Then, there is a scene in an exhibition hall with many paintings, and a yellow machine in the middle of the hall. Lastly, an elderly man with white hair, wearing a white shirt and a gray jacket, appears.", "First, there is a scene with two paintings hanging on the wall of the exhibition hall, with four people standing in front of the paintings. Then, an elderly man with white hair, wearing a white shirt and a gray jacket, appears. Lastly, there is a scene in an exhibition hall with many paintings, and a yellow machine in the middle of the hall.", "First, there is a scene in an exhibition hall with many paintings, and a yellow machine in the middle of the hall. Then, an elderly man with white hair, wearing a white shirt and a gray jacket, appears. Lastly, there is a scene with two paintings hanging on the wall of the exhibition hall, with four people standing in front of the paintings.", "First, there is a scene in an exhibition hall with many paintings, and a yellow machine in the middle of the hall. Then, there is a scene with two paintings hanging on the wall of the exhibition hall, with four people standing in front of the paintings. Lastly, an elderly man with white hair, wearing a white shirt and a gray jacket, appears."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "9JqS6HfsYIM_2", "video_path": "9JqS6HfsYIM.mp4", "subtitle_path": "9JqS6HfsYIM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1013.45, "view_count": 279137}, {"video_id": "ErJ6qaRhW-c", "question": "In a black background, a man wearing a black short-sleeve shirt is holding a cup with his right hand and pointing at this cup with his left hand. The screen also has the white text 'GeographyNow.com'. Where else has this man appeared?", "question_wo_referring_query": "Where else has this man appeared?", "candidates": ["This man appears in front of a black background with a woman wearing a purple top.", "This man appears in front of a black background with another man wearing a red short-sleeve shirt.", "In a black background, this man is standing in front of a Dutch flag with another man wearing a black short-sleeve shirt next to him. The other man has an English flag design on his chest.", "This man appears in front of a white window."], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "ErJ6qaRhW-c_0", "video_path": "ErJ6qaRhW-c.mp4", "subtitle_path": "ErJ6qaRhW-c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1674.64, "view_count": 945110}, {"video_id": "ErJ6qaRhW-c", "question": "In front of a black background, a long-haired man wearing a black short sleeve shirt and a dark green hat has his hands outstretched and mouth open. Where else has this man appeared?", "question_wo_referring_query": "Where else has this man appeared?", "candidates": ["This man has appeared in front of a black background along with a man wearing a red short-sleeve shirt.", "This man has appeared in front of a white window.", "This man has appeared in front of a black background along with a woman wearing a purple top.", "This man has appeared in front of a black background along with a man wearing a black short-sleeve shirt with prints on its front."], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "ErJ6qaRhW-c_1", "video_path": "ErJ6qaRhW-c.mp4", "subtitle_path": "ErJ6qaRhW-c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1674.64, "view_count": 945110}, {"video_id": "ErJ6qaRhW-c", "question": "In front of a black background, there is a woman wearing a gray short-sleeved shirt, with her right hand pointing a finger forward. In the upper left corner, there is an image of the Surinamese flag. Where else has this woman appeared?", "question_wo_referring_query": "Where else has this woman appeared?", "candidates": ["This woman appears in front of a white window.", "In a black background with an image of a white shop in the upper left corner, this woman spreads her hands on both sides.", "This woman and another woman wearing a purple top appear in front of a black background.", "This woman and a man wearing a red short-sleeved shirt appear in front of a black background."], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "ErJ6qaRhW-c_2", "video_path": "ErJ6qaRhW-c.mp4", "subtitle_path": "ErJ6qaRhW-c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1674.64, "view_count": 945110}, {"video_id": "ni6NxEBCDag", "question": "In a room, a woman with black hair is wearing a black inner outfit and a black coat, along with a necklace. On the table, there is a tablet and a blue cup. With which subtitles has this woman appeared together?", "question_wo_referring_query": "With which subtitles has this woman appeared together?", "candidates": ["just yet just to show you where", "okay first things first you always want", "and then the reason why you wouldn't is", "study plan using the link in the and then on the bottom or mass of an"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "ni6NxEBCDag_0", "video_path": "ni6NxEBCDag.mp4", "subtitle_path": "ni6NxEBCDag_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2002.2, "view_count": 21677}, {"video_id": "ni6NxEBCDag", "question": "In the frame, there is a woman wearing glasses and a gray short-sleeved shirt sitting on a white chair in a predominantly white room. Which subtitles have appeared together with this woman?", "question_wo_referring_query": "Which subtitles have appeared together with this woman?", "candidates": ["wait we're not finding we're finding a", "yes and then I realize I didn't put it", "to think of what the formula is so what and have to do some converting right and do", "blank oh okay I think it's this one go"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "ni6NxEBCDag_1", "video_path": "ni6NxEBCDag.mp4", "subtitle_path": "ni6NxEBCDag_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2002.2, "view_count": 21677}, {"video_id": "ni6NxEBCDag", "question": "In the scene, a black-haired woman dressed in a black inner top and a dark green outer jacket is holding a calculator, while another woman wearing a gray short-sleeved shirt and glasses is holding a calculator and a white pen. Which subtitles are appearing on the screen along with these two women?", "question_wo_referring_query": "Which subtitles are appearing on the screen along with these two women?", "candidates": ["multiply at the bottom first and I'd even say I mean doesn't really", "but we redo it because now now I'm", "to put that it's an actual exponent you", "yeah so I would put nine point one one"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "ni6NxEBCDag_2", "video_path": "ni6NxEBCDag.mp4", "subtitle_path": "ni6NxEBCDag_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2002.2, "view_count": 21677}, {"video_id": "3yGI4JwZwP4", "question": "In the scene, inside a room, there are two colorful sofas, and a man in a purple short sleeve shirt is sitting behind the sofas. In front of the red table, what color clothes did he change into?", "question_wo_referring_query": ", what color clothes did he change into?", "candidates": ["He changed into a black bathrobe", "He changed into a white short sleeve", "He changed into a black short sleeve", "He changed into a white bathrobe"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "3yGI4JwZwP4_0", "video_path": "3yGI4JwZwP4.mp4", "subtitle_path": "3yGI4JwZwP4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1467.8, "view_count": 284937}, {"video_id": "3yGI4JwZwP4", "question": "In the video, there is a man wearing a black hoodie sitting in the car. There is also another person sitting next to him. The man in the car is wearing a seatbelt. What change occurred when the man in the car took off his seatbelt?", "question_wo_referring_query": "What change occurred when the man in the car took off his seatbelt?", "candidates": ["He put on a black hat.", "He tied up his hair.", "He wore a purple wool hat.", "He put on a yellow hat."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "3yGI4JwZwP4_1", "video_path": "3yGI4JwZwP4.mp4", "subtitle_path": "3yGI4JwZwP4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1467.8, "view_count": 284937}, {"video_id": "3yGI4JwZwP4", "question": "In the video, in a room with white walls, there is a man wearing a black hoodie and black pants. He is wearing a black watch on his right hand, and there is white foam on his hand. After the man moves near the white wall facing the camera, what changes occur to his hand?", "question_wo_referring_query": "What changes occur to his hand?", "candidates": ["It becomes more.", "It turns pink.", "The foam on his hand is washed away, and it turns into holding yellow paper.", "It turns green."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "3yGI4JwZwP4_2", "video_path": "3yGI4JwZwP4.mp4", "subtitle_path": "3yGI4JwZwP4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1467.8, "view_count": 284937}, {"video_id": "zSGul-P9P3w", "question": "In the video, there is a bell pepper, a tomato, a carrot, and a potato placed by a pool, and a person is holding a plate in their left hand and pouring the potato into the water. When the subtitle 'Potato' is mentioned, what changes occur to the form of the potato?", "question_wo_referring_query": "What changes occur to the form of the potato?", "candidates": ["It turns into thin slices", "It turns into chunks", "It turns into mashed potato", "It turns into strips"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "zSGul-P9P3w_0", "video_path": "zSGul-P9P3w.mp4", "subtitle_path": "zSGul-P9P3w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 974.48, "view_count": 5482944}, {"video_id": "zSGul-P9P3w", "question": "In the video, there are also beans, bell peppers, and red pepper placed next to the pool. Someone is holding a plate; the carrots fall into the water. When the subtitle 'Carrot' is mentioned, what changes occur to the carrots?", "question_wo_referring_query": "What changes happen to the carrots?", "candidates": ["They become strip-shaped", "They become round-shaped", "They become carrot juice", "They become carrot threads"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "zSGul-P9P3w_1", "video_path": "zSGul-P9P3w.mp4", "subtitle_path": "zSGul-P9P3w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 974.48, "view_count": 5482944}, {"video_id": "zSGul-P9P3w", "question": "On the screen, there are still beans placed by the pool, and there are bell peppers and hot peppers in the pool. When the subtitles \u2018Bell pepper and Hot pepper\u2019 are mentioned, what change occurs to the bell pepper?", "question_wo_referring_query": "What change occurs to the bell pepper?", "candidates": ["Turns into hot pepper sauce", "Turns into a round shape", "Turns into a thread shape", "Turns into a strip shape"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "zSGul-P9P3w_2", "video_path": "zSGul-P9P3w.mp4", "subtitle_path": "zSGul-P9P3w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 974.48, "view_count": 5482944}, {"video_id": "AJoOlCRgm84", "question": "In a dark room, a skinny person is sitting on a green chair. He is wearing a white hat and there are some objects on the table in front of him. When 'Music' is mentioned, what objects are present in the frame?", "question_wo_referring_query": "What objects are present in the frame?", "candidates": ["A tape", "A wooden head and strings", "A hammer", "A pair of scissors"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "AJoOlCRgm84_0", "video_path": "AJoOlCRgm84.mp4", "subtitle_path": "AJoOlCRgm84_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1667.5, "view_count": 29111}, {"video_id": "AJoOlCRgm84", "question": "On the screen, there are some carvings on the stone wall, a little boy is crouching on the ground, and there is a tree in the background. When 'enlightenment from the Buddha's 550' is mentioned, what objects are present on the screen?", "question_wo_referring_query": "On the screen, there are some carvings on the stone wall, a little boy is crouching on the ground, and there is a tree in the background. When 'enlightenment from the Buddha's 550' is mentioned, what objects are present on the screen?", "candidates": ["Stone carving of a giraffe", "Stone carving of a tiger", "Stone carving of an elephant", "Stone carving of a lion"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "AJoOlCRgm84_1", "video_path": "AJoOlCRgm84.mp4", "subtitle_path": "AJoOlCRgm84_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1667.5, "view_count": 29111}, {"video_id": "AJoOlCRgm84", "question": "In a dark room, there is an old man. On the table in front of the old man, there is a box. Inside the box, there are some objects. When 'Music' is mentioned, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["bowl", "glasses", "scroll", "mirror"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "AJoOlCRgm84_2", "video_path": "AJoOlCRgm84.mp4", "subtitle_path": "AJoOlCRgm84_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1667.5, "view_count": 29111}, {"video_id": "nWhOvPsrT7k", "question": "In a small, dark room, a man with a beard and wearing an overcoat is sitting on a wooden chair beside a white bed. What kind of clothes is the man wearing in the scene?", "question_wo_referring_query": "What kind of clothes is the man wearing in the scene?", "candidates": ["Light-colored short sleeves, dark pants", "Light-colored long sleeves, dark pants", "Light-colored long sleeves, white pants", "Light-colored short sleeves, white pants"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "nWhOvPsrT7k_0", "video_path": "nWhOvPsrT7k.mp4", "subtitle_path": "nWhOvPsrT7k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3222.29, "view_count": 104871}, {"video_id": "nWhOvPsrT7k", "question": "In front of a long table with a row of desktop computers, a group of people are standing and interacting. A young woman with olive-colored long hair is writing with a pen in front of the computer. What is this woman wearing?", "question_wo_referring_query": "What is this woman wearing?", "candidates": ["Wearing a blue tie-dyed long-sleeve shirt", "Wearing a white tie-dyed short-sleeve shirt", "Wearing a white tie-dyed long-sleeve shirt", "Wearing a blue tie-dyed short-sleeve shirt"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "nWhOvPsrT7k_1", "video_path": "nWhOvPsrT7k.mp4", "subtitle_path": "nWhOvPsrT7k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3222.29, "view_count": 104871}, {"video_id": "nWhOvPsrT7k", "question": "In front of pine trees covered in white snow and on a thick layer of snow, there is a man crouched down setting a trap. What kind of clothes is this man wearing?", "question_wo_referring_query": "What kind of clothes is this man wearing?", "candidates": ["Wearing a dark green down jacket and gray pants", "Wearing a dark blue padded jacket and black pants", "Wearing a dark green down jacket and black pants", "Wearing a dark blue padded jacket and gray pants"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "nWhOvPsrT7k_2", "video_path": "nWhOvPsrT7k.mp4", "subtitle_path": "nWhOvPsrT7k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3222.29, "view_count": 104871}, {"video_id": "J456aSUrN8o", "question": "In a piece of land with grass and soil, there are several giraffes. A man wearing a dark green outfit is standing outside the bars feeding a giraffe. At the moment he mentions 'you're so much less scarier than the,' what kind of bracelet is he wearing on his left hand?", "question_wo_referring_query": "What kind of bracelet is the man wearing on his left hand?", "candidates": ["Wearing only a yellow beaded bracelet", "Wearing a rice yellow and olive green braided bracelet", "Wearing a rice yellow and olive green beaded bracelet", "Wearing only a braided bracelet"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "J456aSUrN8o_0", "video_path": "J456aSUrN8o.mp4", "subtitle_path": "J456aSUrN8o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1453.24, "view_count": 8218}, {"video_id": "J456aSUrN8o", "question": "On a grassy field with trees, a man in light red clothing is leaning on the ground in front of a slope. When mentioning 'that's been put into my heart', what kind of glasses is the man wearing?", "question_wo_referring_query": "What kind of glasses is the man wearing?", "candidates": ["Black frame sunglasses", "Transparent frame glasses", "Black frame glasses", "Red frame sunglasses"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "J456aSUrN8o_1", "video_path": "J456aSUrN8o.mp4", "subtitle_path": "J456aSUrN8o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1453.24, "view_count": 8218}, {"video_id": "J456aSUrN8o", "question": "Outside a building with many cars parked, there is a woman wearing a pink short-sleeved top holding a red bag. When mentioning 'happy with your expensive Indian outfit', what color is the bag on her shoulder?", "question_wo_referring_query": "What color is the bag on her shoulder?", "candidates": ["A rose-colored bag", "A black handbag", "A black bookbag", "A white bookbag"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "J456aSUrN8o_2", "video_path": "J456aSUrN8o.mp4", "subtitle_path": "J456aSUrN8o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1453.24, "view_count": 8218}, {"video_id": "YLZbqrWoO7o", "question": "In a white background screen, there are many black letters, with the title 'Why edge detection?' Below the title, there is a pencil sketch of an eyebrow, an eye, and a nose. There is a red dot on the screen pointing to something. What is the object being pointed at?", "question_wo_referring_query": "What is the object being pointed at?", "candidates": ["It is the pencil sketch of the eyebrow", "It is the pencil sketch of the eye", "It is the letters of the title", "It is the pencil sketch of the nose"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "YLZbqrWoO7o_0", "video_path": "YLZbqrWoO7o.mp4", "subtitle_path": "YLZbqrWoO7o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1914.5, "view_count": 1165}, {"video_id": "YLZbqrWoO7o", "question": "In a white background screen, there are many black English letters. At the bottom of the screen, there are also four different colored rectangular shapes formed by long rectangles. There is a red dot on one of them. What is the object that is being pointed to?", "question_wo_referring_query": "What is the object that is being pointed to?", "candidates": ["Black rectangular shape", "Blue rectangular shape", "Red rectangular shape", "Green rectangular shape"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "YLZbqrWoO7o_1", "video_path": "YLZbqrWoO7o.mp4", "subtitle_path": "YLZbqrWoO7o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1914.5, "view_count": 1165}, {"video_id": "YLZbqrWoO7o", "question": "In a screen with a white background, there is a black title 'Closeup of edges.' Below it is a picture of a building, which also features an object highlighted by a red frame. What is this object?", "question_wo_referring_query": "What is this object?", "candidates": ["The door under the roof", "A transparent window of the house", "The yellow wall outside the house", "The red roof of the house"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "YLZbqrWoO7o_2", "video_path": "YLZbqrWoO7o.mp4", "subtitle_path": "YLZbqrWoO7o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1914.5, "view_count": 1165}, {"video_id": "7WbW2LCM5Rc", "question": "In an open space with a black car parked, there are two men wearing glasses. The one on the left is wearing gray clothes, and the one on the right is wearing black clothes. What is the man in black clothes doing at this moment?", "question_wo_referring_query": "What is the man in black clothes doing at this moment?", "candidates": ["He extends his left hand, with a clenched fist", "He extends his left hand, with fingers spread open", "He extends his right hand, with a clenched fist", "He raises his right hand, with fingers spread open"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "7WbW2LCM5Rc_0", "video_path": "7WbW2LCM5Rc.mp4", "subtitle_path": "7WbW2LCM5Rc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1056.68, "view_count": 216015}, {"video_id": "7WbW2LCM5Rc", "question": "By a wall with a green coat rack holding black clothes, a woman in blue clothes is propping herself up on a table with her left arm. Nearby, a man in black clothes with a beard and wearing a hat is sitting. What is this man doing at this time?", "question_wo_referring_query": "What is this man doing at this time?", "candidates": ["He is looking into the mirror", "He gave the woman a cup", "He took off his hat", "He is drinking water"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "7WbW2LCM5Rc_1", "video_path": "7WbW2LCM5Rc.mp4", "subtitle_path": "7WbW2LCM5Rc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1056.68, "view_count": 216015}, {"video_id": "7WbW2LCM5Rc", "question": "In a white room with a black piece of clothing hanging on the side wall, a man in black is reaching out with his right hand, sitting sideways at a white table. Not far from him, a woman in blue is also seated. What is this woman doing at this moment?", "question_wo_referring_query": "What is this woman doing at this moment?", "candidates": ["She is taking off her glasses", "She is adjusting her collar", "She is pouring water for herself", "She is tying her hair"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "7WbW2LCM5Rc_2", "video_path": "7WbW2LCM5Rc.mp4", "subtitle_path": "7WbW2LCM5Rc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1056.68, "view_count": 216015}, {"video_id": "518NTdzZfhM", "question": "In the video, there's a man in a suit and glasses. Behind him, there are six colorful pictures, with a picture of a car in the middle of the top row. What action does this man perform when the subtitle reads 'foreground background or binary'?", "question_wo_referring_query": "What action does this man perform?", "candidates": ["He raises both hands.", "He holds a pen in his left hand and clenches his right hand into a fist, while his left hand is naturally down.", "He holds a pen in his left hand and clenches his right hand into a fist, while raising his right hand.", "He holds a pen in his right hand and clenches his left hand into a fist, while his right hand is naturally down."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "518NTdzZfhM_0", "video_path": "518NTdzZfhM.mp4", "subtitle_path": "518NTdzZfhM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2364.67, "view_count": 519}, {"video_id": "518NTdzZfhM", "question": "A man in a suit and glasses appears on the screen. Behind him, there are two tables containing numbers: 1, 2, 3, and 4. What does the man do when the subtitle says 'another complex way you can do this is'?", "question_wo_referring_query": "What action does the man perform?", "candidates": ["Left hand holding a pen, left hand gripping the right hand", "Right hand holding a pen and making a fist, left hand naturally down", "Left hand holding a pen, right hand gripping the left hand", "Left hand holding a pen and making a fist, right hand raised"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "518NTdzZfhM_1", "video_path": "518NTdzZfhM.mp4", "subtitle_path": "518NTdzZfhM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2364.67, "view_count": 519}, {"video_id": "518NTdzZfhM", "question": "In the scene, a man wearing a suit and glasses appears. To his right is the number 27. The background behind the man features several blue rectangular patterns, with the largest word being 'U-Net.' When the subtitle mentions 'one variation as we discussed,' what action does the man perform?", "question_wo_referring_query": "What action does this man perform?", "candidates": ["Right hand holds a pen, left hand clenched in a fist", "Left hand holds a pen, right hand raises, fingers spread towards the left", "Left hand holds a pen, right hand grasps the left hand", "Right hand holds a pen and clenches a fist, left hand naturally down"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "518NTdzZfhM_2", "video_path": "518NTdzZfhM.mp4", "subtitle_path": "518NTdzZfhM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2364.67, "view_count": 519}, {"video_id": "CcYhr1_Zil0", "question": "In the video, there is a woman with long hair wearing a white dress on the left, and a woman with long hair wearing glasses and a black dress on the right. After the lecture begins, what does the woman in white do?", "question_wo_referring_query": "What does the woman in white do?", "candidates": ["Claps her hands together.", "Raises both hands.", "Raises her right hand holding a white pen.", "Raises her left hand."], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "CcYhr1_Zil0_0", "video_path": "CcYhr1_Zil0.mp4", "subtitle_path": "CcYhr1_Zil0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1639.41, "view_count": 22629}, {"video_id": "CcYhr1_Zil0", "question": "In the video, there is a woman with long hair wearing a white outfit on the left side, with a white cabinet behind her. On the right side, there is a woman with long hair wearing a black outfit, with a watch on her right hand. After the woman in black pushes her glasses up, what does her right hand do?", "question_wo_referring_query": "What does the right hand of the woman with long hair wearing a black outfit do?", "candidates": ["Her right hand is clenched into a fist on the table.", "Her right hand is raised near her mouth and she is holding a white pen.", "Her right hand is extended forward.", "Her right hand is raised near her mouth and she is holding a ruler."], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "CcYhr1_Zil0_1", "video_path": "CcYhr1_Zil0.mp4", "subtitle_path": "CcYhr1_Zil0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1639.41, "view_count": 22629}, {"video_id": "CcYhr1_Zil0", "question": "In the video, there is a woman with long hair wearing a white dress on the left, and a woman wearing black clothes with glasses on the right. The woman on the right is wearing a bracelet on her right hand and a watch on her left hand. After tying her hair with her right hand, what action does the woman with glasses perform?", "question_wo_referring_query": "What action does the woman with glasses perform?", "candidates": ["She raises both hands.", "She puts her hands together in opposite directions and places them in front of her chest.", "She makes a fist.", "She holds a white pen."], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "CcYhr1_Zil0_2", "video_path": "CcYhr1_Zil0.mp4", "subtitle_path": "CcYhr1_Zil0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1639.41, "view_count": 22629}, {"video_id": "_qgKQGsuKeQ", "question": "In the top left corner of the screen, there is a man with sparse hair wearing white clothes. In the middle, there is a title 'Interest Point Detection'. After the subtitle mentions 'detection so we are looking at the edges,' what action does the man perform?", "question_wo_referring_query": "What action does the man perform?", "candidates": ["He raises both hands.", "He touches his head with his hand.", "He touches his nose with his hand.", "He points at something with his hand."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3E", "level": "L2-Relation", "id": "_qgKQGsuKeQ_0", "video_path": "_qgKQGsuKeQ.mp4", "subtitle_path": "_qgKQGsuKeQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2829.87, "view_count": 107956}, {"video_id": "_qgKQGsuKeQ", "question": "In the upper left corner of the screen, there's a man wearing white clothes with sparse hair. The man has his head slightly lowered. In the middle, there is a title 'Mathematics of Harris Detector'. After the subtitle mentions 'and this is our vector which is here no', what action does the man take?", "question_wo_referring_query": "What action did the man take?", "candidates": ["He clenched his fist tightly", "He pointed at something with his hand", "He touched his nose with his hand", "He raised his left hand"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3E", "level": "L2-Relation", "id": "_qgKQGsuKeQ_1", "video_path": "_qgKQGsuKeQ.mp4", "subtitle_path": "_qgKQGsuKeQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2829.87, "view_count": 107956}, {"video_id": "_qgKQGsuKeQ", "question": "In the upper left corner of the screen, there's a man with sparse hair wearing a white shirt. In the middle, there's a title 'Compute corner response.' What did the man do after the subtitle 'two images of this animal and so if you' appeared?", "question_wo_referring_query": "In the upper left corner of the screen, there's a man with sparse hair wearing a white shirt. In the middle, there's a title 'Compute corner response.' What did the man do after the subtitle 'two images of this animal and so if you' appeared?", "candidates": ["He pointed at something with his hand.", "He touched his head with his hand.", "He clenched his fist tightly and raised it.", "He touched his nose with his hand."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3E", "level": "L2-Relation", "id": "_qgKQGsuKeQ_2", "video_path": "_qgKQGsuKeQ.mp4", "subtitle_path": "_qgKQGsuKeQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2829.87, "view_count": 107956}, {"video_id": "g533JQqzNrs", "question": "Which of the following scenarios is in the correct order?", "question_wo_referring_query": "Which of the following scenarios is in the correct order?", "candidates": ["First, a weather simulation appears, then a large wave hitting the rocks, and finally a shaking room due to an earthquake.", "First, a shaking room due to an earthquake, then a large wave hitting the rocks, and finally a weather simulation.", "First, a large wave hitting the rocks, then a weather simulation appears, and finally a shaking room due to an earthquake.", "First, a weather simulation appears, then a shaking room due to an earthquake, and finally a large wave hitting the rocks."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "g533JQqzNrs_0", "video_path": "g533JQqzNrs.mp4", "subtitle_path": "g533JQqzNrs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.5, "view_count": 10841}, {"video_id": "g533JQqzNrs", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a man in a white short sleeve shirt and a woman in a yellow coat are seen talking, then the house destroyed by the earthquake appears, and finally, a woman in a pink coat and a woman with short hair in a black coat are seen talking.", "First, a woman in a pink coat and a woman with short hair in a black coat are seen talking, then a man in a white short sleeve shirt and a woman in a yellow coat are seen talking, and finally, the house destroyed by the earthquake appears.", "First, the house destroyed by the earthquake appears, then a man in a white short sleeve shirt and a woman in a yellow coat are seen talking, and finally, a woman in a pink coat and a woman with short hair in a black coat are seen talking.", "First, the house destroyed by the earthquake appears, then a woman in a pink coat and a woman with short hair in a black coat are seen talking, and finally, a man in a white short sleeve shirt and a woman in a yellow coat are seen talking."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "g533JQqzNrs_1", "video_path": "g533JQqzNrs.mp4", "subtitle_path": "g533JQqzNrs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.5, "view_count": 10841}, {"video_id": "g533JQqzNrs", "question": "Which of the following sequence of scenes is correct?", "question_wo_referring_query": "Which sequence of scenes below is correct?", "candidates": ["First, an ocean with fish appears, then cracked ground appears, and finally different colored drawing blocks appear as the ending.", "First, different colored drawing blocks appear, then cracked ground appears, and finally an underwater scene with fish swimming is the ending.", "First, cracked ground appears, then different colored drawing blocks appear, and finally an ocean with fish appears as the ending.", "First, different colored drawing blocks appear, then an ocean with fish appears, and finally cracked ground appears as the ending."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "g533JQqzNrs_2", "video_path": "g533JQqzNrs.mp4", "subtitle_path": "g533JQqzNrs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.5, "view_count": 10841}, {"video_id": "50ev0a1J9uQ", "question": "In front of a sports field, there is a man wearing a blue and white short-sleeve shirt, with a watch on his left hand, and both hands open with palms facing up. In which of the following scenes does this man appear?", "question_wo_referring_query": "In front of a sports field, there is a man wearing a blue and white short-sleeve shirt, with a watch on his left hand, and both hands open with palms facing up. In which of the following scenes does this man appear?", "candidates": ["Standing by a large mountain", "In front of a chain-link fence", "Standing by a small river", "Standing at the entrance of a staircase"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "50ev0a1J9uQ_0", "video_path": "50ev0a1J9uQ.mp4", "subtitle_path": "50ev0a1J9uQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 927.47, "view_count": 229041}, {"video_id": "50ev0a1J9uQ", "question": "There is a man in an orange shirt and a white hat standing beside a car. Behind the man to the left is a large tree, and to the right are a big tree and a house with gray walls and a red roof. In which of the following scenes did this man in the orange shirt appear?", "question_wo_referring_query": "In which scene did this man in the orange shirt appear?", "candidates": ["Beside a train", "In front of a wire fence", "Beside a big tree in a park", "By a small river"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "50ev0a1J9uQ_1", "video_path": "50ev0a1J9uQ.mp4", "subtitle_path": "50ev0a1J9uQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 927.47, "view_count": 229041}, {"video_id": "50ev0a1J9uQ", "question": "In front of a playground, there is a man wearing a black backpack. This man is dressed in a white and blue top, and he is wearing a black watch on his left arm. In which of the following scenes does this black watch appear?", "question_wo_referring_query": "In which of the following scenes does this black watch appear?", "candidates": ["On a bus", "Next to a train", "Next to a gray column", "On a mountain"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "50ev0a1J9uQ_2", "video_path": "50ev0a1J9uQ.mp4", "subtitle_path": "50ev0a1J9uQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 927.47, "view_count": 229041}, {"video_id": "sm6uO5FkX5I", "question": "On the table in the video, there is a cutting board with three potatoes and a peeler on it. Next to the table, there is a person wearing light grey clothes, holding a potato and a peeler. In the video, alongside which subtitle do the potatoes appear?", "question_wo_referring_query": "Alongside which subtitle do the potatoes appear in the video?", "candidates": ["Pour the mixture.", "Eggs-3 pcs.", "Did you like my recipe", "Potatoes-2-3 pieces."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "sm6uO5FkX5I_0", "video_path": "sm6uO5FkX5I.mp4", "subtitle_path": "sm6uO5FkX5I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1389.53, "view_count": 20912}, {"video_id": "sm6uO5FkX5I", "question": "In the video, the person wearing blue and white clothes places a slicer on the cutting board, takes a cucumber, and slices it using the slicer. Which subtitle appears along with cucumber in the video?", "question_wo_referring_query": "Which subtitle appears along with cucumber in the video?", "candidates": ["Onion-1 pc.", "Did you like my recipe", "Pour the mixture.", "Eggs-3 pcs."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "sm6uO5FkX5I_1", "video_path": "sm6uO5FkX5I.mp4", "subtitle_path": "sm6uO5FkX5I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1389.53, "view_count": 20912}, {"video_id": "sm6uO5FkX5I", "question": "There is a cutting board on the table in the video, with a scallion on it. Someone is pressing the scallion with their hand and using a knife to cut the scallion. During which subtitle does the scallion appear together?", "question_wo_referring_query": "During which subtitle does the scallion appear together in the video?", "candidates": ["Did you like my recipe", "Pour the mixture.", "Onion-1 pc.", "Eggs-3 pcs."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "sm6uO5FkX5I_2", "video_path": "sm6uO5FkX5I.mp4", "subtitle_path": "sm6uO5FkX5I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1389.53, "view_count": 20912}, {"video_id": "DNBCoof4dNM", "question": "There is a man in the picture wearing a blue short-sleeve shirt with floral patterns and glasses. In the top right corner, there are two pictures with a French flag in between. What changes after the top right corner picture disappears involving the man wearing the blue short-sleeve shirt with floral patterns and glasses?", "question_wo_referring_query": "What changes involving the man wearing a blue short-sleeve shirt with floral patterns and glasses?", "candidates": ["He picks up a white cup with his right hand", "He joins his hands together", "He clenches his fist", "He raises his left hand"], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "DNBCoof4dNM_0", "video_path": "DNBCoof4dNM.mp4", "subtitle_path": "DNBCoof4dNM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1910.66, "view_count": 207666}, {"video_id": "DNBCoof4dNM", "question": "In the video, there is a man wearing a gray coat with a beard, holding his hands open. In the top-right corner of the video, there is a picture with three books on it. What changes occur to the man in the video before the picture in the top-right corner disappears?", "question_wo_referring_query": "What changes occur to the man in the video before the picture in the top-right corner disappears?", "candidates": ["Raising both hands", "Holding both blue and black items in his hands", "Making a fist", "No change"], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "DNBCoof4dNM_1", "video_path": "DNBCoof4dNM.mp4", "subtitle_path": "DNBCoof4dNM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1910.66, "view_count": 207666}, {"video_id": "DNBCoof4dNM", "question": "There is a man wearing a red short-sleeved shirt with black patterns in the middle of the screen. He is holding a yellow dog in his right hand, and there are blue subtitles on the right side of the screen. What changes occur to the man in the red shirt when the blue subtitles disappear?", "question_wo_referring_query": "What changes occur to the man in the red shirt with black patterns in the middle of the screen when the blue subtitles disappear?", "candidates": ["No change.", "Both hands raise simultaneously.", "His left hand points to the right.", "The dog in his right hand disappears, and he holds on with both hands."], "topic_category": "KG-Knowledge-Geography", "question_category": "SAA", "level": "L2-Relation", "id": "DNBCoof4dNM_2", "video_path": "DNBCoof4dNM.mp4", "subtitle_path": "DNBCoof4dNM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1910.66, "view_count": 207666}, {"video_id": "lEX8kMHrj1I", "question": "There are three green trees on the screen. On the tree in the middle, there is a small bird standing, with mountains in the distance. What event is happening in the distance?", "question_wo_referring_query": "What event is happening in the distance?", "candidates": ["Rain", "Hailstorm", "Explosion", "Many birds flying in the sky"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "lEX8kMHrj1I_0", "video_path": "lEX8kMHrj1I.mp4", "subtitle_path": "lEX8kMHrj1I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1062.5, "view_count": 61554}, {"video_id": "lEX8kMHrj1I", "question": "In a dark scene, there are many armored creatures, and on the right side, there is a large gray creature. What did this gray thing do?", "question_wo_referring_query": "What did this gray thing do?", "candidates": ["flew", "crawled", "jumped", "ate something"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "lEX8kMHrj1I_1", "video_path": "lEX8kMHrj1I.mp4", "subtitle_path": "lEX8kMHrj1I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1062.5, "view_count": 61554}, {"video_id": "lEX8kMHrj1I", "question": "The sky in the scene is fiery red, and below the black area, there's a mass of red flames. What event is happening in the scene?", "question_wo_referring_query": "What event is happening in the scene?", "candidates": ["Lava is rising from the crater", "Lava is descending into the crater", "A forest fire is occurring", "A house fire is occurring"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "lEX8kMHrj1I_2", "video_path": "lEX8kMHrj1I.mp4", "subtitle_path": "lEX8kMHrj1I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1062.5, "view_count": 61554}, {"video_id": "jjzssiwpq3M", "question": "In the video, there is a black-haired woman and a brunette woman leaning together on a piece of flower-patterned fabric. When the subtitle mentions 'her so nothing would happen to her,' which object appears on the screen?", "question_wo_referring_query": "Which object appears on the screen?", "candidates": ["Black Collar", "White Mouse Pad", "White Duckbill Cap", "Red Sunglasses"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "jjzssiwpq3M_0", "video_path": "jjzssiwpq3M.mp4", "subtitle_path": "jjzssiwpq3M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2457.39, "view_count": 219082}, {"video_id": "jjzssiwpq3M", "question": "A white-haired man in a black suit is giving a lecture in the broadcast room. What objects are present on the screen when the subtitle mentions 'the early morning areas people trying to'?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["scarf", "white feathered hat", "blue and red floral patterned tie", "hat"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "jjzssiwpq3M_1", "video_path": "jjzssiwpq3M.mp4", "subtitle_path": "jjzssiwpq3M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2457.39, "view_count": 219082}, {"video_id": "jjzssiwpq3M", "question": "On the right side of the screen, there is a man dressed in a black suit and wearing a black tie sitting in the studio giving a lecture. When the subtitle mentions 'crisis and outrage in Montreal after an', what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["sunglasses", "hat", "white mouse cursor", "blue, red, and white floral patterned tie"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "jjzssiwpq3M_2", "video_path": "jjzssiwpq3M.mp4", "subtitle_path": "jjzssiwpq3M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2457.39, "view_count": 219082}, {"video_id": "t8nLkge34mE", "question": "In the video, there's a man wearing a coat and glasses talking in a room. There are several round decorations on the wall. What is the color of the coat the man in the video is wearing?", "question_wo_referring_query": "What is the color of the coat the man in the video is wearing?", "candidates": ["Green", "White", "Yellow", "Black"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "t8nLkge34mE_0", "video_path": "t8nLkge34mE.mp4", "subtitle_path": "t8nLkge34mE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1332.46, "view_count": 57494}, {"video_id": "t8nLkge34mE", "question": "In the video, a woman wearing a black coat and white inner wear is sitting in a room with several cups on the table. How is the length of the woman's hair portrayed on the screen?", "question_wo_referring_query": "How is the length of the woman's hair portrayed on the screen?", "candidates": ["Bald", "Crew cut", "Short hair", "Long hair"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "t8nLkge34mE_1", "video_path": "t8nLkge34mE.mp4", "subtitle_path": "t8nLkge34mE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1332.46, "view_count": 57494}, {"video_id": "t8nLkge34mE", "question": "In the video, a man dressed in black clothes with a beard is sitting in front of a shelf with many books, speaking. What is the man in the video wearing?", "question_wo_referring_query": "What is the man in the video wearing?", "candidates": ["tank top", "suit", "short sleeves", "long sleeves"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "t8nLkge34mE_2", "video_path": "t8nLkge34mE.mp4", "subtitle_path": "t8nLkge34mE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1332.46, "view_count": 57494}, {"video_id": "kYiC8lzpTyM", "question": "In the room on the screen, there is a person holding a smartphone in their right hand. On the smartphone, there is an image of a woman, and their left hand is giving a thumbs-up. There is a gray object in the room. Who is holding the smartphone in the video?", "question_wo_referring_query": "Who is holding the smartphone in the video?", "candidates": ["A man in a white lab coat", "A woman in a skirt", "A man in a black short-sleeve shirt with a beard", "A man in a blue short-sleeve shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "kYiC8lzpTyM_0", "video_path": "kYiC8lzpTyM.mp4", "subtitle_path": "kYiC8lzpTyM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1803.8, "view_count": 4741461}, {"video_id": "kYiC8lzpTyM", "question": "There are two girls wearing skirts holding white ribbons in the picture, and there is a woman in a white wedding dress in the middle. What is the woman in the wedding dress holding in the video?", "question_wo_referring_query": "What is the woman in the wedding dress holding in the video?", "candidates": ["Food", "Cord", "Flowers decorated with green leaves", "Mobile phone"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "kYiC8lzpTyM_1", "video_path": "kYiC8lzpTyM.mp4", "subtitle_path": "kYiC8lzpTyM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1803.8, "view_count": 4741461}, {"video_id": "kYiC8lzpTyM", "question": "In the video, there is a man wearing a green tie and a white shirt with a mustache, another man in the middle wearing a pink tie, and a woman on the right holding a cell phone. What is the man with the green tie holding in his right hand?", "question_wo_referring_query": "What is the man with the green tie holding in his right hand?", "candidates": ["cell phone", "flower", "food", "microphone"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "kYiC8lzpTyM_2", "video_path": "kYiC8lzpTyM.mp4", "subtitle_path": "kYiC8lzpTyM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1803.8, "view_count": 4741461}, {"video_id": "bUVHrX_WmcQ", "question": "In the video, there is a black person on the left side wearing glasses and a white coat with polka dots, in the middle, there is a black person wearing a short-sleeved shirt, and on the right side, there is a man wearing a short-sleeved shirt with dark blue edges and polka dots. In the upper right corner of the video, there is a transparent cup with coffee. What is the man on the right doing when the subtitle says 'bananas speaking of coffee I usually'?", "question_wo_referring_query": "What is the man on the right doing?", "candidates": ["Holding a cup with his left hand", "Hugging the person in the middle", "Doing nothing", "Raising both hands"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "bUVHrX_WmcQ_0", "video_path": "bUVHrX_WmcQ.mp4", "subtitle_path": "bUVHrX_WmcQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1920.82, "view_count": 316340}, {"video_id": "bUVHrX_WmcQ", "question": "On the left side of the screen, there is a black person wearing glasses and a white jacket with floral patterns. In the middle, there is a black person wearing short sleeves. On the right side, there is a man wearing a dark blue short-sleeved shirt with floral patterns. In the upper left corner, there is a picture with fireworks. When the subtitle mentions 'made out of iron that is rolled around', what is the woman on the left doing?", "question_wo_referring_query": "What is the woman on the left doing?", "candidates": ["Left hand pointing down with index finger", "Right hand pointing down with index finger", "Left hand pointing down with middle finger", "Right hand pointing down with middle finger"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "bUVHrX_WmcQ_1", "video_path": "bUVHrX_WmcQ.mp4", "subtitle_path": "bUVHrX_WmcQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1920.82, "view_count": 316340}, {"video_id": "bUVHrX_WmcQ", "question": "In the video, there is a man wearing a black long-sleeve shirt with a beard. To the right, there is a flag with yellow, red, black, and yellow horizontal stripes, along with some lines of text. When the subtitle mentions 'otherwise contemporary Ugandan artists,' what is the man in the video doing?", "question_wo_referring_query": "In the video, there is a man wearing a black long-sleeve shirt with a beard. To the right, there is a flag with yellow, red, black, and yellow horizontal stripes, along with some lines of text. When the subtitle mentions 'otherwise contemporary Ugandan artists,' what is the man in the video doing?", "candidates": ["raising his right hand", "holding a cup in his right hand", "clenching his right hand into a fist", "holding a cup in his left hand"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "bUVHrX_WmcQ_2", "video_path": "bUVHrX_WmcQ.mp4", "subtitle_path": "bUVHrX_WmcQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1920.82, "view_count": 316340}, {"video_id": "nEL0lVLOeOs", "question": "There is a woman in the video wearing a green outfit, glasses, and a floral hat. She has a spoon in her right hand, and there is rice on the spoon. She is putting the rice into her mouth. What does the woman in the video do after putting the rice into her mouth?", "question_wo_referring_query": "What does the woman in the video do after putting the rice into her mouth?", "candidates": ["Covers her mouth with her right hand", "Covers her mouth with her left hand", "Raises both hands", "Clenches her fists"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "nEL0lVLOeOs_0", "video_path": "nEL0lVLOeOs.mp4", "subtitle_path": "nEL0lVLOeOs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1035.1, "view_count": 509847}, {"video_id": "nEL0lVLOeOs", "question": "In the video, there is a bag on the right side and a glass jar on the left side. The hand in the middle of the screen pours dark brown powder into the glass jar. What does the hand do after this action?", "question_wo_referring_query": "What does the hand in the video do after this action?", "candidates": ["Pours water into the glass jar", "Makes no further action", "Throws away the bag on the right", "Throws away the glass jar"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "nEL0lVLOeOs_1", "video_path": "nEL0lVLOeOs.mp4", "subtitle_path": "nEL0lVLOeOs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1035.1, "view_count": 509847}, {"video_id": "nEL0lVLOeOs", "question": "In the video, a person wearing a pink outfit with red stripes places a dumpling into a blue pot. What happens in the video after the person places the dumpling into the pot?", "question_wo_referring_query": "What happens in the video after the person places the dumpling into the pot?", "candidates": ["The person puts the dumpling on a yellow plate", "The person puts the dumpling on a purple plate", "The person throws the dumpling away", "The person puts the dumpling on a green plate and displays it"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "nEL0lVLOeOs_2", "video_path": "nEL0lVLOeOs.mp4", "subtitle_path": "nEL0lVLOeOs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1035.1, "view_count": 509847}, {"video_id": "3cv_m5RKQVQ", "question": "In the video, there are several black and white goats with long horns on flat ground, four camels walking on the desert with a temple in the background, and many fish swimming in the ocean. Which screen is displayed first?", "question_wo_referring_query": ", which screen is displayed first?", "candidates": ["Many fish swimming in the ocean", "Four camels walking on the desert with a temple in the background", "None appeared", "Several black and white goats with long horns on flat ground"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "3cv_m5RKQVQ_0", "video_path": "3cv_m5RKQVQ.mp4", "subtitle_path": "3cv_m5RKQVQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 987.9, "view_count": 765457}, {"video_id": "3cv_m5RKQVQ", "question": "In the video, there is a spiral staircase with many people on it, a beach with several houses nearby, and a section of the Earth's surface. Which scene is shown first?", "question_wo_referring_query": "Which scene is shown first from the choices below?", "candidates": ["None of the above", "A beach with several houses nearby", "A spiral staircase with many people on it", "A section of the Earth's surface"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "3cv_m5RKQVQ_1", "video_path": "3cv_m5RKQVQ.mp4", "subtitle_path": "3cv_m5RKQVQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 987.9, "view_count": 765457}, {"video_id": "3cv_m5RKQVQ", "question": "In the video, there is a picture of ancient Babylon and many geese walking on the ground, as well as many people holding smartphones and taking pictures of Mona Lisa paintings on the wall. Which scene is shown first?", "question_wo_referring_query": "In the video, there is a picture of ancient Babylon and many geese walking on the ground, as well as many people holding smartphones and taking pictures of Mona Lisa paintings on the wall. Which scene is shown first?", "candidates": ["Many people holding smartphones and taking pictures of Mona Lisa paintings on the wall", "None of the above", "A picture of ancient Babylon", "Many geese walking on the ground"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "3cv_m5RKQVQ_2", "video_path": "3cv_m5RKQVQ.mp4", "subtitle_path": "3cv_m5RKQVQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 987.9, "view_count": 765457}, {"video_id": "QNxIiTStOaQ", "question": "In a dimly-lit room, there is a white bed. Next to the white bed is a woman with short black hair wearing headphones. After she says \"sleeve mascara this has the best mascara,\" what is the first thing the woman with short black hair does?", "question_wo_referring_query": "What is the first thing the woman with short black hair does?", "candidates": ["Holds the phone", "Takes a photo with her phone", "Uses an eyelash curler", "Raises her right hand to make a heart shape", "Walks forward"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "QNxIiTStOaQ_0", "video_path": "QNxIiTStOaQ.mp4", "subtitle_path": "QNxIiTStOaQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1009.51, "view_count": 125735}, {"video_id": "QNxIiTStOaQ", "question": "In a white room with orange lighting, a short-haired woman wearing a white T-shirt is speaking. After she says 'cold,' what does she do first?", "question_wo_referring_query": "What does she do first after she says 'cold'?", "candidates": ["Laughs heartily", "Holding a phone towards a mirror", "Looks at the phone", "Raises the phone with her hand", "Left hand picks up a brown cup"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "QNxIiTStOaQ_1", "video_path": "QNxIiTStOaQ.mp4", "subtitle_path": "QNxIiTStOaQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1009.51, "view_count": 125735}, {"video_id": "QNxIiTStOaQ", "question": "In a dark elevator, a girl wearing a skirt with a backpack is holding a phone. After she says 'look of the entire festival and let's go,' what does the girl holding the phone do first?", "question_wo_referring_query": "What does the girl holding the phone do first?", "candidates": ["Holding up a single hand against the inner wall of the elevator", "Standing with both arms raised", "Holding a cup of coffee", "Eating something", "Leaning on the table with both hands"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "QNxIiTStOaQ_2", "video_path": "QNxIiTStOaQ.mp4", "subtitle_path": "QNxIiTStOaQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1009.51, "view_count": 125735}, {"video_id": "8hqkwCuJ0GU", "question": "In a space surrounded by white walls, who is the first character to appear after a short-haired male with a black backpack on his shoulder says 'excited for today today is going to be a'?", "question_wo_referring_query": "Who is the first character to appear?", "candidates": ["A man wearing a green top", "A man with short blonde hair", "A man with a backpack", "A woman wearing a gray top and sitting in a car"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "8hqkwCuJ0GU_0", "video_path": "8hqkwCuJ0GU.mp4", "subtitle_path": "8hqkwCuJ0GU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1240.98, "view_count": 63057}, {"video_id": "8hqkwCuJ0GU", "question": "In a blurry scene, there are some wooden bridges, and in the distance, a blue-green bucket. When the subtitle 'there hi hello how are you what's your' appears, who is the first character to appear?", "question_wo_referring_query": "Who is the first character to appear?", "candidates": ["A man wearing a green top", "A man with short hair, wearing a bright green top, walking on the road", "A woman wearing a gray top, sitting in a car", "A woman wearing a white T-shirt, walking on the road", "A little girl with black hair, wearing a gray long-sleeve shirt, standing on the wooden bridge"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "8hqkwCuJ0GU_1", "video_path": "8hqkwCuJ0GU.mp4", "subtitle_path": "8hqkwCuJ0GU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1240.98, "view_count": 63057}, {"video_id": "8hqkwCuJ0GU", "question": "On a pitch-black night, there is a white car parked. In front of the white car stands a woman wearing a black short-sleeve. When the subtitle 'met her in a coffee shop' appears, what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["floral scarf", "green wall", "black backpack", "blue short-sleeve", "green short-sleeve"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "8hqkwCuJ0GU_2", "video_path": "8hqkwCuJ0GU.mp4", "subtitle_path": "8hqkwCuJ0GU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1240.98, "view_count": 63057}, {"video_id": "W0JaMeHYxzQ", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, a PPT slide titled 'SVM -background'; then a PPT slide titled 'Maximum Margin'; finally, a PPT slide titled 'Classifier Design'", "First, a PPT slide titled 'Classifier Design'; then a PPT slide titled 'Maximum Margin'; finally, a PPT slide titled 'SVM -background'", "First, a PPT slide titled 'Maximum Margin'; then a PPT slide titled 'SVM -background'; finally, a PPT slide titled 'Classifier Design'", "First, a PPT slide titled 'Classifier Design'; then a PPT slide titled 'SVM -background'; finally, a PPT slide titled 'Maximum Margin'", "First, a PPT slide titled 'Maximum Margin'; then a PPT slide titled 'Classifier Design'; finally, a PPT slide titled 'SVM -background'"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "W0JaMeHYxzQ_0", "video_path": "W0JaMeHYxzQ.mp4", "subtitle_path": "W0JaMeHYxzQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2790.9, "view_count": 487}, {"video_id": "W0JaMeHYxzQ", "question": "Which of the following sequences of slides is correct?", "question_wo_referring_query": "Which of the following sequences of slides is correct?", "candidates": ["First, a slide titled 'Decision boundary for NN Classifier'; then a slide titled 'The machine learning framework'; and finally a slide titled 'Algorithm'.", "First, a slide titled 'Algorithm'; then a slide titled 'Decision boundary for NN Classifier'; and finally a slide titled 'The machine learning framework'.", "First, a slide titled 'The machine learning framework'; then a slide titled 'Algorithm'; and finally a slide titled 'Decision boundary for NN Classifier'.", "First, a slide titled 'Decision boundary for NN Classifier'; then a slide titled 'Algorithm'; and finally a slide titled 'The machine learning framework'.", "First, a slide titled 'The machine learning framework'; then a slide titled 'Decision boundary for NN Classifier'; and finally a slide titled 'Algorithm'."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "W0JaMeHYxzQ_1", "video_path": "W0JaMeHYxzQ.mp4", "subtitle_path": "W0JaMeHYxzQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2790.9, "view_count": 487}, {"video_id": "W0JaMeHYxzQ", "question": "Which of the following sequences of slides is correct?", "question_wo_referring_query": "Which of the following sequences of slides is correct?", "candidates": ["First is a slide with 'Questions?'; next is a slide with 'Image classification - ImageNet'; finally, a slide with 'Dataset split'", "First is a slide with 'Image classification - ImageNet'; next is a slide with 'Questions?'; finally, a slide with 'Dataset split'", "First is a slide with 'Image classification - ImageNet'; next is a slide with 'Dataset split'; finally, a slide with 'Questions?'", "First is a slide with 'Dataset split'; next is a slide with 'Image classification - ImageNet'; finally, a slide with 'Questions?'", "First is a slide with 'Questions?'; next is a slide with 'Dataset split'; finally, a slide with 'Image classification - ImageNet'"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "W0JaMeHYxzQ_2", "video_path": "W0JaMeHYxzQ.mp4", "subtitle_path": "W0JaMeHYxzQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2790.9, "view_count": 487}, {"video_id": "mfY1732NpzM", "question": "On a PPT slide titled 'LiRPA on General Computational Graph,' there is a picture of a panda. Next to the panda picture, there is a yellow circle. Where has the panda picture appeared before?", "question_wo_referring_query": "In which of the following places has the panda picture appeared before?", "candidates": ["In a screen with many purple flowers", "In a screen with a vibrant blue ocean", "In a screen showing the PPT slide", "In a screen with a red sunset", "In a screen with many green bamboos"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "mfY1732NpzM_0", "video_path": "mfY1732NpzM.mp4", "subtitle_path": "mfY1732NpzM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3354.56, "view_count": 279}, {"video_id": "mfY1732NpzM", "question": "In a PPT slide with 'CROWN Bound Propagation' written in red at the top, there is a flowchart connected by black arrows. Below the flowchart, there is also a red horizontal bar that says 'Northeastern University'. Where else has this red horizontal bar with 'Northeastern University' appeared?", "question_wo_referring_query": "Where else has this red horizontal bar with 'Northeastern University' appeared?", "candidates": ["In a PPT slide with a blue ocean", "In a PPT slide with pomegranates", "In a PPT slide with 'Concrete Lower and Upper Bounds' written in red", "In a PPT slide with 'Affected by macroeconomic factors' written in red", "In a PPT slide with blue water lilies"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "mfY1732NpzM_1", "video_path": "mfY1732NpzM.mp4", "subtitle_path": "mfY1732NpzM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3354.56, "view_count": 279}, {"video_id": "mfY1732NpzM", "question": "On a PPT slide, there is a square with a green block. The square has 'non-overlapping groups' written on it. Below the square, there is a red horizontal bar. Where else has the square with the green block appeared?", "question_wo_referring_query": "Where else has the square with the green block appeared?", "candidates": ["On the PPT slide with three pandas", "On the PPT slide with a yellow peony", "On the PPT slide with a large green watermelon", "On the PPT slide with 'Affected by macroeconomic factors' written in black text", "On the PPT slide with 'Optimization problem' written in black text"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "mfY1732NpzM_2", "video_path": "mfY1732NpzM.mp4", "subtitle_path": "mfY1732NpzM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3354.56, "view_count": 279}, {"video_id": "uXPv_Bnt44M", "question": "In the blue-green water, there are 4 people holding guns progressing forward. Two are wearing green long-sleeved military uniforms, while the other two are wearing white vests. Which of the following subtitles appeared together with the two people in white vests?", "question_wo_referring_query": "Which of the following subtitles appeared together with the two people in white vests?", "candidates": ["\"Life is really fragile\"", "\"Many people died in war\"", "\"the Japanese Commander made the painful\"", "\"The number of people lying in the pool of blood is in the tens of thousands\"", "\"We must persevere\""], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "uXPv_Bnt44M_0", "video_path": "uXPv_Bnt44M.mp4", "subtitle_path": "uXPv_Bnt44M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2278.79, "view_count": 545958}, {"video_id": "uXPv_Bnt44M", "question": "In a dark room, there are three people. Two of them have red blood holes on their faces, while one is wearing a white dress with a coffee-colored apron, holding a chicken. Which of the following subtitles have appeared with the person wearing the coffee-colored apron and holding the chicken?", "question_wo_referring_query": ", which of the following subtitles have appeared with the person wearing the coffee-colored apron and holding the chicken?", "candidates": ["\"First, let the chicken bleed\"", "\"This is a rumor\"", "\"Substances in chicken blood can effectively prevent and treat viruses\"", "\"out the poison in the bubos therefore\"", "\"Treating diseases with chicken blood\""], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "uXPv_Bnt44M_1", "video_path": "uXPv_Bnt44M.mp4", "subtitle_path": "uXPv_Bnt44M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2278.79, "view_count": 545958}, {"video_id": "uXPv_Bnt44M", "question": "In a room with a red L-shaped long table, there are several people wearing black and white clothes. Next to the person wearing white clothes, there is a colorful big pig and several small pigs of different colors. Which of the following subtitles appear together with the colorful big pig?", "question_wo_referring_query": "With which of the following subtitles does the colorful big pig appear together?", "candidates": ["\"The whole body is covered in bacteria\"", "\"It looks delicious\"", "\"out exactly how it would have in a human\"", "\"Carrying viruses\"", "\"This type of pig was carefully developed by humans\""], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "uXPv_Bnt44M_2", "video_path": "uXPv_Bnt44M.mp4", "subtitle_path": "uXPv_Bnt44M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2278.79, "view_count": 545958}, {"video_id": "Sw-HaD6IkWU", "question": "In a bright room with a large mirror and a black ceiling panel, a woman wearing a black sports top and black shorts appears in front of the mirror. There is an orange yoga mat in front of her. What is the woman in the black sports top doing?", "question_wo_referring_query": "What is the woman in the black sports top doing?", "candidates": ["High knee running in place", "Crossing arms in front of her chest and moving her neck", "Doing stretching exercises", "Kneeling on both knees and wiping the orange yoga mat with a paper towel", "Kneeling on one knee and wiping the orange yoga mat with a paper towel"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "Sw-HaD6IkWU_0", "video_path": "Sw-HaD6IkWU.mp4", "subtitle_path": "Sw-HaD6IkWU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 924.37, "view_count": 12089}, {"video_id": "Sw-HaD6IkWU", "question": "Inside a car with black seats, with some glaring sunlight outside the window, there is a woman with long blonde hair and red nail polish. What is she doing inside the car?", "question_wo_referring_query": "What is she doing inside the car?", "candidates": ["Putting on black sunglasses", "Taking off her coat", "Putting on a gold necklace", "Tying up her hair", "Putting on a gold bracelet"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "Sw-HaD6IkWU_1", "video_path": "Sw-HaD6IkWU.mp4", "subtitle_path": "Sw-HaD6IkWU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 924.37, "view_count": 12089}, {"video_id": "Sw-HaD6IkWU", "question": "In a dimly lit yoga room, many people are practicing yoga. There is a mobile phone leaning against a red pillar. A woman dressed in a black sports bra is kneeling on a black yoga mat with a red border. What is she doing on the yoga mat?", "question_wo_referring_query": "What is she doing on the yoga mat?", "candidates": ["Bending forward with her head resting on the mat", "Putting on a black outer garment", "Kneeling on the yoga mat playing with her phone", "Putting on a green outer garment", "Putting on a white outer garment"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "Sw-HaD6IkWU_2", "video_path": "Sw-HaD6IkWU.mp4", "subtitle_path": "Sw-HaD6IkWU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 924.37, "view_count": 12089}, {"video_id": "wSgqm1EqxNA", "question": "On a table covered with a white tablecloth, there are many foods and many people sitting on both sides. Behind a man wearing a white shirt, there is a lake. What objects are present on the table covered with a white cloth?", "question_wo_referring_query": "What objects are present on the table covered with a white cloth?", "candidates": ["Red wine", "Black phone", "A giant chicken drumstick", "Blue plate", "White phone"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "wSgqm1EqxNA_0", "video_path": "wSgqm1EqxNA.mp4", "subtitle_path": "wSgqm1EqxNA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1261.4, "view_count": 188596}, {"video_id": "wSgqm1EqxNA", "question": "On a grassy area near a lake, there is a long table covered with white fabric, and beneath it, there's off white fabric. Next to the table, there's also a boat near the grassy area. What objects are present in the frame?", "question_wo_referring_query": "What objects are present in the frame?", "candidates": ["Fried chicken", "Green cucumber", "Roasted lamb leg", "Tomato", "Red bottle"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "wSgqm1EqxNA_1", "video_path": "wSgqm1EqxNA.mp4", "subtitle_path": "wSgqm1EqxNA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1261.4, "view_count": 188596}, {"video_id": "wSgqm1EqxNA", "question": "Inside a room with a blue arch door, there is a bedside arch-shaped wall niche containing two decorative vases. Above the bed's head on the wall, there is a white air conditioner. What objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["yellow pillow", "blue pillow", "purple pillow", "white pillow", "red pillow"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "wSgqm1EqxNA_2", "video_path": "wSgqm1EqxNA.mp4", "subtitle_path": "wSgqm1EqxNA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1261.4, "view_count": 188596}, {"video_id": "mY4AmqIRTZI", "question": "In a kitchen with white walls, there is a man standing who is wearing a black short-sleeve shirt and a black apron. He is holding a piece of dough in his hand. When he says \"like shoot out pasta everywhere are we\", what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Transparent glass jar", "Red chili pepper", "Blue plate", "Red tomato", "Green chili pepper"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "mY4AmqIRTZI_0", "video_path": "mY4AmqIRTZI.mp4", "subtitle_path": "mY4AmqIRTZI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1106.94, "view_count": 10996657}, {"video_id": "mY4AmqIRTZI", "question": "On a countertop made of marble, there is a pot cooking noodles. On both sides of the pot, there are plates and bowls that are the same color as the pot. When the subtitle 'cook them for about four to five minutes' appears, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["Stainless steel knife", "Green peppers", "Blue pot lid", "Red tomatoes", "Blue plate"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "mY4AmqIRTZI_1", "video_path": "mY4AmqIRTZI.mp4", "subtitle_path": "mY4AmqIRTZI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1106.94, "view_count": 10996657}, {"video_id": "mY4AmqIRTZI", "question": "In a kitchen with a white marble countertop, there is a man wearing a gray long-sleeve shirt and another man wearing a black short-sleeve shirt standing. There is a plate of food in front of them. When the subtitle 'yeah that's there that's heavy okay' appears, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["yellow liquid", "yellow carrots", "purple sweet potatoes", "red chili peppers", "red tomatoes"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "mY4AmqIRTZI_2", "video_path": "mY4AmqIRTZI.mp4", "subtitle_path": "mY4AmqIRTZI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1106.94, "view_count": 10996657}, {"video_id": "PjiF8XmVDso", "question": "In a kitchen that's half pink and half green, there is a man and a woman. The woman is wearing a pink apron, and the man is wearing a green apron. In front of the man with the green apron, there is also a small blue pot. What is the hairstyle of the woman wearing the pink apron in the video?", "question_wo_referring_query": "What is the hairstyle of the woman wearing the pink apron in the video?", "candidates": ["Black short hair", "Gold short hair", "Black long hair", "Gold long hair", "Silver long hair"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "PjiF8XmVDso_0", "video_path": "PjiF8XmVDso.mp4", "subtitle_path": "PjiF8XmVDso_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1163.33, "view_count": 5787133}, {"video_id": "PjiF8XmVDso", "question": "In a room with blue lights, there are many toys and wooden items. A man with short black hair appears in the middle of the screen. To his right, there is a green clover and a white toy. What color clothing is the man with short black hair on the screen wearing?", "question_wo_referring_query": "What color clothing is the man with short black hair on the screen wearing?", "candidates": ["Purple", "Black", "White", "Yellow", "Blue"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "PjiF8XmVDso_1", "video_path": "PjiF8XmVDso.mp4", "subtitle_path": "PjiF8XmVDso_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1163.33, "view_count": 5787133}, {"video_id": "PjiF8XmVDso", "question": "In a brightly lit room, there is a white table. Seated in front of the white table is a young boy wearing a blue long-sleeved shirt with short black hair. To the right of the young boy is a painting, and below the painting is a frame. What material is the frame under the painting made of?", "question_wo_referring_query": "What material is the frame under the painting in the scene made of?", "candidates": ["iron", "cotton", "wood", "stainless steel", "plastic"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "PjiF8XmVDso_2", "video_path": "PjiF8XmVDso.mp4", "subtitle_path": "PjiF8XmVDso_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1163.33, "view_count": 5787133}, {"video_id": "gTRNL_tJ8Go", "question": "On the blue-green terrain map, there is a box outlined with yellow lines, and inside the box is written 'CENTRAL UPLIFT DOME.' What shape does the box outlined with yellow lines take when the subtitle 'this ring tells a story of a time when' appears?", "question_wo_referring_query": "What shape does the box outlined with yellow lines take?", "candidates": ["Oval", "Triangle", "Pentagon", "Hexagon", "Square"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "gTRNL_tJ8Go_0", "video_path": "gTRNL_tJ8Go.mp4", "subtitle_path": "gTRNL_tJ8Go_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1074.28, "view_count": 13131}, {"video_id": "gTRNL_tJ8Go", "question": "In a gray screen, there is a white circle at the center. Next to the circle, there is an arrow with the text 'UPLIFT DOME' beside it. When the subtitle 'the creation of new metamorphic rock' appears, what is the color of the arrow next to the white circle?", "question_wo_referring_query": "What is the color of the arrow next to the white circle?", "candidates": ["green", "red", "white", "purple", "blue"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "gTRNL_tJ8Go_1", "video_path": "gTRNL_tJ8Go.mp4", "subtitle_path": "gTRNL_tJ8Go_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1074.28, "view_count": 13131}, {"video_id": "gTRNL_tJ8Go", "question": "On a topographic map depicting blue, green, and white regions, an area labeled Chesapeake is circled. When the subtitle 'of North America the Collision was' appears, what color is the line that circles the Chesapeake area?", "question_wo_referring_query": "What color is the line that circles the area labeled Chesapeake?", "candidates": ["purple", "red", "yellow", "blue", "black"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "gTRNL_tJ8Go_2", "video_path": "gTRNL_tJ8Go.mp4", "subtitle_path": "gTRNL_tJ8Go_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1074.28, "view_count": 13131}, {"video_id": "rr65vnslEZ0", "question": "In front of a black background, two people are standing. At the bottom of the screen, there is white text that reads '@Curlykidlife (He has cool Special FX videos)'. One person is introducing a man named Dunca. Who is introducing the man named Dunca?", "question_wo_referring_query": "Who is introducing the man named Dunca?", "candidates": ["The man wearing a black suit", "The man wearing a purple hat and a black short-sleeved shirt", "The man wearing a blue hoodie", "The man wearing a black long-sleeved shirt with short blonde hair", "The man wearing a black hat and a white T-shirt"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "rr65vnslEZ0_0", "video_path": "rr65vnslEZ0.mp4", "subtitle_path": "rr65vnslEZ0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 945.32, "view_count": 779803}, {"video_id": "rr65vnslEZ0", "question": "In a blurry screen, there is a white arrow with a black border. Above the white arrow, it says Geograpeep Gers\u00e1n. Someone is saying \"All we do not pronounce the letter S at the end of words.\" Who is this person?", "question_wo_referring_query": "Who is the person saying this?", "candidates": ["A man wearing a blue baseball cap and a white jersey", "A man wearing a black suit with a red tie", "A man with black framed glasses and short black hair", "A man wearing a white shirt", "A man wearing a black hoodie"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "rr65vnslEZ0_1", "video_path": "rr65vnslEZ0.mp4", "subtitle_path": "rr65vnslEZ0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 945.32, "view_count": 779803}, {"video_id": "rr65vnslEZ0", "question": "A man wearing a purple hat is standing in front of a black background. The man is dressed in a black short-sleeved shirt. To the man's right, there is a picture of a national flag. The man is saying \"Nicaragua's story is very much like its land, There's a touch of volcanic precaudon mixed in with beautiful tradition.\" Which country's national flag appears to the man's right?", "question_wo_referring_query": "Which country's national flag appears to the man's right?", "candidates": ["Pakistan", "Croatia", "Serbia", "Bulgaria", "Nicaragua"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "rr65vnslEZ0_2", "video_path": "rr65vnslEZ0.mp4", "subtitle_path": "rr65vnslEZ0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 945.32, "view_count": 779803}, {"video_id": "u4g38uWvYGk", "question": "In front of a blue background, a man wearing a black short-sleeved shirt with partially dyed yellow short black hair is standing. What is he doing when he first appears on the scene?", "question_wo_referring_query": "What is he doing when he first appears on the scene?", "candidates": ["Performing a Thomas spin in front of the camera", "Hugging a briefcase, spinning in circles in front of the camera", "Raising his arms, spreading his palms, talking to the camera", "Drawing in front of the camera", "Holding a black microphone, singing to the camera"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "u4g38uWvYGk_0", "video_path": "u4g38uWvYGk.mp4", "subtitle_path": "u4g38uWvYGk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1738.36, "view_count": 209012}, {"video_id": "u4g38uWvYGk", "question": "In a garden with purple flowers and green leaves, there is a bee with black stripes. What is the bee doing the first time it appears among the purple flowers?", "question_wo_referring_query": "What is the bee doing the first time it appears among the purple flowers?", "candidates": ["Resting on the green leaf", "Flying above the purple flower", "Flying from the leaf towards the purple flower", "Attached to the purple flower collecting nectar and pollinating", "Flying in circles around the purple flower"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "u4g38uWvYGk_1", "video_path": "u4g38uWvYGk.mp4", "subtitle_path": "u4g38uWvYGk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1738.36, "view_count": 209012}, {"video_id": "u4g38uWvYGk", "question": "In front of a purple background, there is a man standing wearing a black long-sleeve shirt, with short black hair, and a watch on his wrist. What is he doing the first time he appears?", "question_wo_referring_query": "What is he doing the first time he appears?", "candidates": ["Running in front of the camera with his arms extended", "Hugging another man wearing a white short-sleeve shirt", "Pointing at the camera and singing", "Raising his arm, with his palms facing each other, talking to the camera", "Crouching down in front of the camera"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "u4g38uWvYGk_2", "video_path": "u4g38uWvYGk.mp4", "subtitle_path": "u4g38uWvYGk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1738.36, "view_count": 209012}, {"video_id": "6gxA9veS_3I", "question": "In front of a grayish brown mountain, a man with gray hair wearing a blue top is standing. To the right of the man, there's some green grass. When the subtitle 'the tests conducted in the late 1960s' appears, what is the man with gray hair doing?", "question_wo_referring_query": "What is the man with gray hair doing?", "candidates": ["Holding a knife and chopping forward", "Bending down to pick something up", "Holding a shield and blocking", "Kneeling on the ground with his head down", "Holding a gun and shooting"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "6gxA9veS_3I_0", "video_path": "6gxA9veS_3I.mp4", "subtitle_path": "6gxA9veS_3I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1095.33, "view_count": 4661286}, {"video_id": "6gxA9veS_3I", "question": "In a blocky house with many colors, there is a man wearing a black hat and a black top. What is the man wearing a black hat doing when the subtitle 'company' appears?", "question_wo_referring_query": "What is the man wearing a black hat doing?", "candidates": ["Putting a backpack on his back", "Aiming forward with a gun ready to shoot", "Crawling on the ground", "Running forward with a gun", "Taking off his black hat"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "6gxA9veS_3I_1", "video_path": "6gxA9veS_3I.mp4", "subtitle_path": "6gxA9veS_3I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1095.33, "view_count": 4661286}, {"video_id": "6gxA9veS_3I", "question": "In the sunny weather, there is a yellow plain beside a yellowish hill, and on the plain, there are some silvery objects and a large yellow rock. There is a man wearing a green hat next to the large rock. When the subtitle 'mauser cartridge that the mg42 used the' appears, what is the man wearing the green hat doing?", "question_wo_referring_query": "What is the man wearing a green hat doing?", "candidates": ["Looking forward with a telescope", "Banging his head against the large rock", "Leaning against the large rock and sleeping", "Setting up a gun on the large rock and shooting", "Sitting on the ground drinking water"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "6gxA9veS_3I_2", "video_path": "6gxA9veS_3I.mp4", "subtitle_path": "6gxA9veS_3I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1095.33, "view_count": 4661286}, {"video_id": "l8b5fkHD6R0", "question": "On a white PPT slide, there is black text that reads 'Contributions', and the first point listed on the slide is 'Change Pre-training dataset'. What appears on the PPT slide after this first point?", "question_wo_referring_query": "What appears on the PPT slide after the first point?", "candidates": ["2. Region-Level triplet loss", "2. Data Clearing Protocol", "2. How to completely clear data", "2. Cellular data", "2. Clearing of Datasets"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "l8b5fkHD6R0_0", "video_path": "l8b5fkHD6R0.mp4", "subtitle_path": "l8b5fkHD6R0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1029.6, "view_count": 25}, {"video_id": "l8b5fkHD6R0", "question": "On a white PPT slide, there is black text 'Results - TNBC'. Below the black text is a table with a red box around it. Inside the red box, there is a small red dot. After the small red dot moves to the lower left within the red box, what does it do next?", "question_wo_referring_query": "After the small red dot within the red box moves to the lower left, what does it do next?", "candidates": ["Slides left and right within a blue box", "Slides to the upper right within the red box", "Draws a star within the red box", "Draws a heart within the red box", "Circles around within a blue box"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "l8b5fkHD6R0_1", "video_path": "l8b5fkHD6R0.mp4", "subtitle_path": "l8b5fkHD6R0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1029.6, "view_count": 25}, {"video_id": "l8b5fkHD6R0", "question": "On a white PPT slide with numerous black texts that has 'Strengths & weaknesses' written on it, a small red dot appears. After sliding left, what does the red dot do?", "question_wo_referring_query": "After sliding left, what does the small red dot do?", "candidates": ["Circles around on the PPT slide", "Draws a heart on the PPT slide", "Circles around the red text 'Strengths & weaknesses' on the PPT slide", "Writes text on the PPT slide", "Circles around the black text 'Strengths & weaknesses' on the PPT slide"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "l8b5fkHD6R0_2", "video_path": "l8b5fkHD6R0.mp4", "subtitle_path": "l8b5fkHD6R0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1029.6, "view_count": 25}, {"video_id": "J74SdkwX2l4", "question": "According to the explanation, which of the following names is mentioned first in the video?", "question_wo_referring_query": "According to the explanation, which of the following names is mentioned first in the video?", "candidates": ["Pete", "R\u00e9my", "P\u00e9rignon", "Napoleon", "Brune"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "J74SdkwX2l4_0", "video_path": "J74SdkwX2l4.mp4", "subtitle_path": "J74SdkwX2l4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1891.08, "view_count": 545282}, {"video_id": "J74SdkwX2l4", "question": "In a pure black background, there is a golden eagle. Below the eagle, there is a golden letter N. Next to the golden eagle, there are several people's pictures. Which person's picture appears first beside the golden eagle?", "question_wo_referring_query": "Which person's picture appears first beside the golden eagle?", "candidates": ["P\u00c9RIGNON", "S\u00c9RURIER", "BRUNE", "GROUCHY", "MONCEY"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "J74SdkwX2l4_1", "video_path": "J74SdkwX2l4.mp4", "subtitle_path": "J74SdkwX2l4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1891.08, "view_count": 545282}, {"video_id": "J74SdkwX2l4", "question": "On a black background, there is a marshal's name and a quote from Napoleon reviewing that marshal. In the video, which marshal's review appears first on the black background?", "question_wo_referring_query": "In the video, which marshal's review appears first on the black background?", "candidates": ["GROUCHY", "Marshal Augereau", "Brune", "Marshal Bernadotte", "S\u00c9RURIER"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "J74SdkwX2l4_2", "video_path": "J74SdkwX2l4.mp4", "subtitle_path": "J74SdkwX2l4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1891.08, "view_count": 545282}, {"video_id": "4tHk8jJic3k", "question": "In a kitchen filled with various items, a woman with short black hair, wearing a purple top and a black apron, is standing. After she says \"you have to do gloom gelatin and wait\", what does she do?", "question_wo_referring_query": "What does she do after saying \"you have to do gloom gelatin and wait\"?", "candidates": ["Roll the balls between her index finger and thumb", "Hold some balls and display them in front of the camera", "Hold a pack of pearl powder in her hand", "Pour white milk into a blue pot", "Tear open a bag containing substances"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "4tHk8jJic3k_0", "video_path": "4tHk8jJic3k.mp4", "subtitle_path": "4tHk8jJic3k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 956.62, "view_count": 2787495}, {"video_id": "4tHk8jJic3k", "question": "In a kitchen with many items, a woman wearing a purple top and a black apron raises a glass bowl with her right hand. After she says 'one egg two tablespoons of coconut flour,' what does she do next?", "question_wo_referring_query": "In a kitchen with many items, a woman wearing a purple top and a black apron raises a glass bowl with her right hand. After she says 'one egg two tablespoons of coconut flour,' what does she do next?", "candidates": ["Hold a packet of boba balls with her left hand", "Pour white milk into a blue pot", "Hold some balls in front of a mirror", "Pour white milk into a mixing machine", "Roll balls with her index fingers"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "4tHk8jJic3k_1", "video_path": "4tHk8jJic3k.mp4", "subtitle_path": "4tHk8jJic3k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 956.62, "view_count": 2787495}, {"video_id": "4tHk8jJic3k", "question": "On a white marble countertop, there is a blue pot and a blue pair of scissors. Someone is holding a bowl of melted chocolate. After the subtitle says 'chocolate has melted', what does the person holding the bowl of melted chocolate do?", "question_wo_referring_query": "What does the person holding a bowl of melted chocolate do?", "candidates": ["Holds a packet of pearl powder", "Rolls round candies between the thumb and forefinger", "Holds a few round candies in front of the camera", "Pours the melted white chocolate into a piping bag", "Pours white milk into the blue pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "4tHk8jJic3k_2", "video_path": "4tHk8jJic3k.mp4", "subtitle_path": "4tHk8jJic3k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 956.62, "view_count": 2787495}, {"video_id": "baUufWD5G8w", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, there is a scene with many pieces of wood burning in a round pit; then, a scene with continuous peaks, covered in white snow; finally, a green meadow with a black and white dog lying beside a wall under a window.", "First, there is a scene with many pieces of wood burning in a round pit; then, a green meadow with a black and white dog lying beside a wall under a window; finally, a scene with continuous peaks, covered in white snow.", "First, there is a green meadow with a black and white dog lying beside a wall under a window; then, continuous peaks, covered in white snow; finally, a scene with many pieces of wood burning in a round pit.", "First, there are continuous peaks, covered in white snow; then, a scene with many pieces of wood burning in a round pit; finally, a green meadow with a black and white dog lying beside a wall under a window.", "First, there are continuous peaks, covered in white snow; then, a green meadow with a black and white dog lying beside a wall under a window; finally, a scene with many pieces of wood burning in a round pit."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "baUufWD5G8w_0", "video_path": "baUufWD5G8w.mp4", "subtitle_path": "baUufWD5G8w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1041.9, "view_count": 9208986}, {"video_id": "baUufWD5G8w", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a hand sifting rice in a wooden bowl; then, a person in a green coat chopping garlic on a cutting board with green onions; finally, a man in a green coat standing in front of a wooden table sharpening a knife.", "First, a person in a green coat chopping garlic on a cutting board with green onions; then, a hand sifting rice in a wooden bowl; finally, a man in a green coat standing in front of a wooden table sharpening a knife.", "First, a hand sifting rice in a wooden bowl; then, a man in a green coat standing in front of a wooden table sharpening a knife; finally, a person in a green coat chopping garlic on a cutting board with green onions.", "First, a man in a green coat standing in front of a wooden table sharpening a knife; then, a hand sifting rice in a wooden bowl; finally, a person in a green coat chopping garlic on a cutting board with green onions.", "First, a man in a green coat standing in front of a wooden table sharpening a knife; then, a person in a green coat chopping garlic on a cutting board with green onions; finally, a hand sifting rice in a wooden bowl."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "baUufWD5G8w_1", "video_path": "baUufWD5G8w.mp4", "subtitle_path": "baUufWD5G8w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1041.9, "view_count": 9208986}, {"video_id": "baUufWD5G8w", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a scene where a pair of chopsticks are used to pick up food from an iron pot; then, a scene where two sheep of different colors are eating plants from a white bucket; finally, a scene where a man in a green coat is using chopsticks to put food into his mouth.", "First, a scene where a man in a green coat is using chopsticks to put food into his mouth; then, a scene where two sheep of different colors are eating plants from a white bucket; finally, a scene where a pair of chopsticks are used to pick up food from an iron pot.", "First, a scene where two sheep of different colors are eating plants from a white bucket; then, a scene where a man in a green coat is using chopsticks to put food into his mouth; finally, a scene where a pair of chopsticks are used to pick up food from an iron pot.", "First, a scene where two sheep of different colors are eating plants from a white bucket; then, a scene where a pair of chopsticks are used to pick up food from an iron pot; finally, a scene where a man in a green coat is using chopsticks to put food into his mouth.", "First, a scene where a pair of chopsticks are used to pick up food from an iron pot; then, a scene where a man in a green coat is using chopsticks to put food into his mouth; finally, a scene where two sheep of different colors are eating plants from a white bucket."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "baUufWD5G8w_2", "video_path": "baUufWD5G8w.mp4", "subtitle_path": "baUufWD5G8w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1041.9, "view_count": 9208986}, {"video_id": "CW8l_VPgEgI", "question": "In a black space, there is a black microscope with a green SCI logo next to it. In which of the following scenes does the green SCI logo appear?", "question_wo_referring_query": "In which of the following scenes does the green SCI logo appear?", "candidates": ["In a scene beside beautiful flowers, a woman in a white dress is standing.", "In a checkered dining room, a man in a black suit is eating a steak.", "In a beautiful park, a man in a white hooded sweatshirt is running forward.", "By the side of a jade-green lake, a man wearing a black short-sleeve shirt is standing.", "In a scene with a green background, a man wearing a shirt with a small fish pattern is standing."], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "CW8l_VPgEgI_0", "video_path": "CW8l_VPgEgI.mp4", "subtitle_path": "CW8l_VPgEgI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2062.06, "view_count": 1075568}, {"video_id": "CW8l_VPgEgI", "question": "In front of a green background, a man wearing a striped shirt and black-framed glasses appears with white text on both sides. In which of the following scenes does the man in the striped shirt appear?", "question_wo_referring_query": "In which of the following scenes does the man in the striped shirt appear?", "candidates": ["In a scene with a red background and many white chairs on the screen", "In a scene with a blue background with many illustrations", "In a scene with a green background and only a green SCI logo on the screen", "In a quiet park with yellow maple leaves on both sides", "In a scene with a white background and many green leaves on the screen"], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "CW8l_VPgEgI_1", "video_path": "CW8l_VPgEgI.mp4", "subtitle_path": "CW8l_VPgEgI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2062.06, "view_count": 1075568}, {"video_id": "CW8l_VPgEgI", "question": "In a scene with a green background, there is a man wearing a short-sleeved shirt with a fish pattern. The man's chest has the white text 'partial melting'. In which of the following scenes does the man appear wearing the shirt with the fish pattern?", "question_wo_referring_query": "In which of the following scenes does the man appear wearing the shirt with the fish pattern?", "candidates": ["In a scene with a white background, where the only thing on screen is a green SCI logo", "In a scene with a blue background, where the only thing on screen is a green SCI logo", "In a scene with a yellow background, where the only thing on screen is a green SCI logo", "In a scene with a purple background, where the only thing on screen is a green SCI logo", "In a scene with a green background, where the only thing on screen is a green SCI logo"], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "CW8l_VPgEgI_2", "video_path": "CW8l_VPgEgI.mp4", "subtitle_path": "CW8l_VPgEgI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2062.06, "view_count": 1075568}, {"video_id": "DS6i0U5ycKA", "question": "Which subtitles were displayed simultaneously with the man standing in front of a black background, wearing a black short-sleeve shirt, and holding a puppy in one hand?", "question_wo_referring_query": "Which subtitles were displayed simultaneously?", "candidates": ["\"the national sport is gush takiri a form\"", "\"and but i'm feeling home it always had\"", "\"habits and ways of life\"", "\"are also more tajiks in dispora than in\"", "\"been settled by the persian speaking\""], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "DS6i0U5ycKA_0", "video_path": "DS6i0U5ycKA.mp4", "subtitle_path": "DS6i0U5ycKA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1745.45, "view_count": 807829}, {"video_id": "DS6i0U5ycKA", "question": "Standing in front of a black background, a woman with long curly hair wearing a white tight short-sleeved shirt. In the upper left corner, there is an image formed by a red, white, and green national flag. What subtitles have appeared on the screen at the same time as this woman?", "question_wo_referring_query": "Which subtitles have appeared on the screen at the same time as this woman?", "candidates": ["\u201cstuff right basically what you get\u201d", "\u201call right music of tajikistan all right\u201d", "\u201cby numerous instruments native to\u201d", "\u201cyou'll see a wide contrast of\u201d", "\u201csea access they gain in sky access huh\u201d"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "DS6i0U5ycKA_1", "video_path": "DS6i0U5ycKA.mp4", "subtitle_path": "DS6i0U5ycKA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1745.45, "view_count": 807829}, {"video_id": "DS6i0U5ycKA", "question": "A man standing in front of a black background, wearing a black short-sleeved shirt with floral prints, with white staff notation featuring black music notes in the upper right corner. Which subtitles appeared simultaneously with this man?", "question_wo_referring_query": "Which subtitles appeared simultaneously with this man?", "candidates": ["\"not getting lumped into all the drama\"", "\"also got to give a shout out to the\"", "\"sometimes clash with what many might\"", "\"hijabs let alone burqas or nay other\"", "\"here things are a little different from\""], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "DS6i0U5ycKA_2", "video_path": "DS6i0U5ycKA.mp4", "subtitle_path": "DS6i0U5ycKA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1745.45, "view_count": 807829}, {"video_id": "MQtSDLOg05c", "question": "A woman wearing a dark top with her hair tied back is drawing in front of a whiteboard, with her back facing a mirror. When she appears in a room with her drawings placed around and a white platform behind her, what change does she undergo?", "question_wo_referring_query": "A woman wearing a dark top with her hair tied back is drawing in front of a whiteboard, with her back facing a mirror. When she appears in a room with her drawings placed around and a white platform behind her, what change does she undergo?", "candidates": ["The woman changes from facing the mirror to having her back to the mirror", "The woman changes from not wearing glasses to wearing glasses", "The woman changes from standing to sitting", "The woman changes from dark clothes to light clothes", "The woman changes from not wearing a hat to wearing a hat"], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "MQtSDLOg05c_0", "video_path": "MQtSDLOg05c.mp4", "subtitle_path": "MQtSDLOg05c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1194.92, "view_count": 162267}, {"video_id": "MQtSDLOg05c", "question": "What changes occurred to the woman who was wearing a dark top with rolled-up sleeves, with her hair tied up, painting in front of a whiteboard, when she sat on a wooden chair by the window and drew a picture of a woman holding a baby?", "question_wo_referring_query": "What changes occurred to this woman?", "candidates": ["From dark clothing to light clothing", "From white hair to red hair", "From rolled-up sleeves to sleeves down", "From standing to lying down", "From not wearing a hat to wearing a hat"], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "MQtSDLOg05c_1", "video_path": "MQtSDLOg05c.mp4", "subtitle_path": "MQtSDLOg05c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1194.92, "view_count": 162267}, {"video_id": "MQtSDLOg05c", "question": "A woman wearing a dark-colored top with rolled-up sleeves, with her hair tied back, is drawing in front of a whiteboard, seen from behind in the mirror. This woman is wearing glasses. When she is sitting on a wooden chair by the window, with a painting of a woman holding a baby placed to her right, what change happens to this woman?", "question_wo_referring_query": "What change happens to this woman?", "candidates": ["Changes from dark-colored clothing to light-colored clothing", "Changes from wearing glasses to not wearing glasses", "Changes from light-colored hair to dark-colored hair", "Changes from standing to lying down", "Changes from being indoors to being outdoors"], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "MQtSDLOg05c_2", "video_path": "MQtSDLOg05c.mp4", "subtitle_path": "MQtSDLOg05c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1194.92, "view_count": 162267}, {"video_id": "ZFzEk5BaiXo", "question": "On the wooden table, there are several pencils and a blank drawing book. On the left side of the screen, there's a picture of a potted plant with purple petals. When the subtitle 'petals on the left-hand side' appears, what change happens to the drawing book?", "question_wo_referring_query": "What change happens to the drawing book?", "candidates": ["The drawing book changes from white to yellow", "The drawing book changes from white to black", "The drawing book changes from white to colorful", "The drawing book moves from the table to the hand", "The drawing book changes from blank to having drawings"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "ZFzEk5BaiXo_0", "video_path": "ZFzEk5BaiXo.mp4", "subtitle_path": "ZFzEk5BaiXo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1758.3, "view_count": 31542}, {"video_id": "ZFzEk5BaiXo", "question": "On the wooden table, there are several pencils and a sketchbook with three square sketches. On the left side of the screen, there is a photo of a pot with purple flowers. What change occurs in the sketchbook when the subtitle 'right hand side of our drawing' appears?", "question_wo_referring_query": "What change occurs in the sketchbook?", "candidates": ["The drawing changes to a blank space", "The drawing changes from colorless to colored", "The pencil drawing changes to a pen drawing", "The three square sketches change to two square sketches", "The three square sketches change to a sketch of potted flowers"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "ZFzEk5BaiXo_1", "video_path": "ZFzEk5BaiXo.mp4", "subtitle_path": "ZFzEk5BaiXo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1758.3, "view_count": 31542}, {"video_id": "ZFzEk5BaiXo", "question": "On the wooden desk, there are several pencils and a sketchbook with drawings of potted flowers. On the left side of the screen, there's a photo of a potted plant with purple petals. When the subtitle 'our simplified shapes again' appears, what changes occur to the sketch on the sketchbook?", "question_wo_referring_query": "What changes occur to the sketch on the sketchbook?", "candidates": ["The sketch changes to a colored drawing", "The sketch of the potted plant changes to a small animal", "The sketch changes to a blank space", "The sketch of the potted plant changes to a tree", "The sketch changes to a shaded drawing"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "ZFzEk5BaiXo_2", "video_path": "ZFzEk5BaiXo.mp4", "subtitle_path": "ZFzEk5BaiXo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1758.3, "view_count": 31542}, {"video_id": "yWZiO7YNoPQ", "question": "Two men are standing in front of a black background. The man on the left is wearing a tight gray short-sleeve shirt and has his hair tied up, while the man on the right is wearing a green short-sleeve shirt. When a picture appears between their heads with a big red X on it, what action is the man on the right doing?", "question_wo_referring_query": ", what action is the man on the right doing?", "candidates": ["Raising one hand next to his neck in a gesture", "Hands hanging down", "Touching his head", "Waving", "Crossing his hands"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "yWZiO7YNoPQ_0", "video_path": "yWZiO7YNoPQ.mp4", "subtitle_path": "yWZiO7YNoPQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.45, "view_count": 3687431}, {"video_id": "yWZiO7YNoPQ", "question": "There are two men standing in front of a black background. The man on the left is wearing a tight grey short-sleeve shirt with dreadlocks, and the man on the right is wearing a green short-sleeve shirt. When a pair of hands holding a gun extends from the left side of the screen towards the man in grey, what action is the man in the grey shirt performing?", "question_wo_referring_query": "What action is the man in the grey shirt performing?", "candidates": ["Both hands are hanging down", "Both arms are crossed in front of his chest", "Both hands are on his head", "Both hands are raised above his head", "Both hands are making OK gestures with fingers constantly shaking"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "yWZiO7YNoPQ_1", "video_path": "yWZiO7YNoPQ.mp4", "subtitle_path": "yWZiO7YNoPQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.45, "view_count": 3687431}, {"video_id": "yWZiO7YNoPQ", "question": "A man is standing in front of a black background, wearing a gray tight-fitting short-sleeve shirt, with curly hair on the side, facing the camera, holding a cell phone. What is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["Waving at the camera", "Chatting with a man in green clothes", "Browsing the phone", "Sending a text message", "Talking on the phone"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "yWZiO7YNoPQ_2", "video_path": "yWZiO7YNoPQ.mp4", "subtitle_path": "yWZiO7YNoPQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.45, "view_count": 3687431}, {"video_id": "FjbmyOGve9M", "question": "A man in a dark suit is standing on stage holding a microphone. On both sides of the silver screen are dark red curtains, and there are four chairs on the stage. What objects are present in the scene?", "question_wo_referring_query": ", what objects are present in the scene?", "candidates": ["television", "washing machine", "audience", "refrigerator", "mobile phone"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "FjbmyOGve9M_0", "video_path": "FjbmyOGve9M.mp4", "subtitle_path": "FjbmyOGve9M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1356.06, "view_count": 31726}, {"video_id": "FjbmyOGve9M", "question": "There are four people sitting on chairs on stage. The man on the far right is wearing a khaki coat and white shoes, talking with a woman next to him. The two men on the left are holding microphones and conversing. What objects are present on the screen at this time?", "question_wo_referring_query": "There are four people sitting on chairs on stage. The man on the far right is wearing a khaki coat and white shoes, talking with a woman next to him. The two men on the left are holding microphones and conversing. What objects are present on the screen at this time?", "candidates": ["yellow shoes", "white coat", "TV screen", "mineral water", "hat"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "FjbmyOGve9M_1", "video_path": "FjbmyOGve9M.mp4", "subtitle_path": "FjbmyOGve9M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1356.06, "view_count": 31726}, {"video_id": "FjbmyOGve9M", "question": "The man on the right side of the screen is wearing a khaki coat and holding a receiver, while the woman on the left is wearing glasses, a black checkered coat, and is holding a receiver, talking. What items are present on the screen at this time?", "question_wo_referring_query": ", what items are present on the screen at this time?", "candidates": ["Pen", "Computer", "Necklace", "Hat", "Mobile Phone"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "FjbmyOGve9M_2", "video_path": "FjbmyOGve9M.mp4", "subtitle_path": "FjbmyOGve9M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1356.06, "view_count": 31726}, {"video_id": "VRX_1pk4aMc", "question": "The screen shows a mound of dirt with a yellow top and some scattered stones. There is a silver protective fence on the mound, and a red downward-pointing arrow appears on the screen. When the subtitle 'River once flowed through here' appears, what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["rose", "car", "bird", "rabbit", "grass field"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "VRX_1pk4aMc_0", "video_path": "VRX_1pk4aMc.mp4", "subtitle_path": "VRX_1pk4aMc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1163.53, "view_count": 23272}, {"video_id": "VRX_1pk4aMc", "question": "In front of the green trees under the blue sky, on the golden glowing stone wall under the sunlight, there is a protective railing. In front of the stone wall, there is a road. When the subtitle 'beautiful outcrub that was created like' appears, what object is present in the screen?", "question_wo_referring_query": "When the subtitle 'beautiful outcrub that was created like' appears, what object is present in the screen?", "candidates": ["Car", "High-rise building", "Utility pole", "Zebra crossing", "White cloud"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "VRX_1pk4aMc_1", "video_path": "VRX_1pk4aMc.mp4", "subtitle_path": "VRX_1pk4aMc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1163.53, "view_count": 23272}, {"video_id": "VRX_1pk4aMc", "question": "Under the blue sky, there is a small town with lush vegetation. In the town, there is a straight road, and when the subtitle 'here we have more anticlines visible' appears, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["Car", "River", "Mountain", "Sun", "White clouds"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "VRX_1pk4aMc_2", "video_path": "VRX_1pk4aMc.mp4", "subtitle_path": "VRX_1pk4aMc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1163.53, "view_count": 23272}, {"video_id": "vfVjU6YKsc8", "question": "The background shows a display table with a decorative lamp hanging from it. On the right side, there's a dark khaki-colored curtain. A man is standing in front of a mirror holding a thick book. What is he wearing when he opens the book?", "question_wo_referring_query": "What is he wearing?", "candidates": ["leather jacket", "long sleeves", "hooded coat", "short sleeves", "denim jacket"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "vfVjU6YKsc8_0", "video_path": "vfVjU6YKsc8.mp4", "subtitle_path": "vfVjU6YKsc8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.0, "view_count": 24316}, {"video_id": "vfVjU6YKsc8", "question": "The background is a shelf with decorative lights hanging on it, a dark khaki curtain on the right side, and a man wearing a black short-sleeved shirt facing a mirror with one hand holding an orange-covered book. What type of hairstyle does he have?", "question_wo_referring_query": "What type of hairstyle does he have?", "candidates": ["Black short hair", "Brown long hair", "Blonde short hair", "Blonde curly hair", "White short hair"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "vfVjU6YKsc8_1", "video_path": "vfVjU6YKsc8.mp4", "subtitle_path": "vfVjU6YKsc8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.0, "view_count": 24316}, {"video_id": "vfVjU6YKsc8", "question": "The room is dimly lit with a red light. A man is standing with a group photo behind him. What is the man's expression at this moment?", "question_wo_referring_query": "What is the man's expression at this moment?", "candidates": ["Angry", "Heartbroken", "Joyful", "Happy", "Horror"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "vfVjU6YKsc8_2", "video_path": "vfVjU6YKsc8.mp4", "subtitle_path": "vfVjU6YKsc8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.0, "view_count": 24316}, {"video_id": "mhyv-nRXGmc", "question": "There's a wooden shovel stir-frying orange and red vegetables in a black iron pot. When the subtitle 'We live about 2 minutes' appears, what is the shape of these vegetables?", "question_wo_referring_query": ", what is the shape of these vegetables?", "candidates": ["liquid state", "block-shaped", "stir-fried into black mush", "strip-shaped", "fragmented"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "mhyv-nRXGmc_0", "video_path": "mhyv-nRXGmc.mp4", "subtitle_path": "mhyv-nRXGmc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 911.44, "view_count": 213066}, {"video_id": "mhyv-nRXGmc", "question": "When a pair of hands is chopping tofu on a wooden cutting board and the subtitle 'Today I'm cooking a very interesting and simple dish!' appears, what shape is the tofu being cut into?", "question_wo_referring_query": "What shape is the tofu being cut into?", "candidates": ["tofu strips", "tofu cubes", "mashed tofu", "shredded tofu", "tofu slices"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "mhyv-nRXGmc_1", "video_path": "mhyv-nRXGmc.mp4", "subtitle_path": "mhyv-nRXGmc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 911.44, "view_count": 213066}, {"video_id": "mhyv-nRXGmc", "question": "There are cut potato pieces on the cutting board. A woman in a black dress is standing in front of the cutting board. On the left side, there is a pot with a floral pattern and a white interior. When the subtitle 'I am very happy to welcome you to my channel' appears, what is the material of this pot?", "question_wo_referring_query": "What is the material of this pot?", "candidates": ["stone", "ceramic", "iron", "steel", "glass"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "mhyv-nRXGmc_2", "video_path": "mhyv-nRXGmc.mp4", "subtitle_path": "mhyv-nRXGmc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 911.44, "view_count": 213066}, {"video_id": "RrcRZSh9z60", "question": "In the scene, there is a man wearing a gray shirt, behind him is a man wearing police uniform and glasses, and another man in a black suit with a tie. Who is the person executed in the Florida State Prison?", "question_wo_referring_query": "Who is the person?", "candidates": ["Bob", "Jack", "Tom", "Ted Bundy", "Alice"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "RrcRZSh9z60_0", "video_path": "RrcRZSh9z60.mp4", "subtitle_path": "RrcRZSh9z60_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1810.02, "view_count": 1099998}, {"video_id": "RrcRZSh9z60", "question": "Who is the woman with short white curly hair, wearing a black top, and holding a microphone, singing 'Express Yourself' on stage?", "question_wo_referring_query": "Who is it?", "candidates": ["Bai Li", "Michael Jackson", "Tom", "Madonna", "Yao Beina"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "RrcRZSh9z60_1", "video_path": "RrcRZSh9z60.mp4", "subtitle_path": "RrcRZSh9z60_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1810.02, "view_count": 1099998}, {"video_id": "RrcRZSh9z60", "question": "On a black background, there is a rectangular sepia-toned photo of two elderly men. Below the photo, there are blue illuminated letters. Which country was the first in the world to legally recognize same-sex marriage?", "question_wo_referring_query": "Which country?", "candidates": ["Turkey", "Denmark", "Switzerland", "United States", "Iceland"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "RrcRZSh9z60_2", "video_path": "RrcRZSh9z60.mp4", "subtitle_path": "RrcRZSh9z60_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1810.02, "view_count": 1099998}, {"video_id": "gI1Y5dlNfF8", "question": "The background is a bookshelf filled with books. A woman with brown hair, dressed in black, appears for the first time in front of the bookshelf. What is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["She is eating", "She is cooking", "She is looking at her phone", "She is talking in front of the camera", "She is chatting with a friend"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "gI1Y5dlNfF8_0", "video_path": "gI1Y5dlNfF8.mp4", "subtitle_path": "gI1Y5dlNfF8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2422.87, "view_count": 2598}, {"video_id": "gI1Y5dlNfF8", "question": "In the scene, three women are sitting on a sofa. There are handicrafts and scissors placed on a tablecloth with a red floral pattern in front of them. When the white-haired woman wearing glasses and dressed in red on the far right first appears, what is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["She is raising her hand to wave", "She is making handicrafts", "She is shaking her head", "She is on a video call with the woman in front of the bookshelf", "She is talking to the woman beside her"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "gI1Y5dlNfF8_1", "video_path": "gI1Y5dlNfF8.mp4", "subtitle_path": "gI1Y5dlNfF8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2422.87, "view_count": 2598}, {"video_id": "gI1Y5dlNfF8", "question": "In front of a white wall, a woman with short dark hair tied with a maple leaf-patterned scarf appears sitting on a black chair for the first time. What is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["Eating", "Making a phone call", "Talking to the camera", "Cooking", "Watching a video of a woman in front of a bookshelf"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "gI1Y5dlNfF8_2", "video_path": "gI1Y5dlNfF8.mp4", "subtitle_path": "gI1Y5dlNfF8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2422.87, "view_count": 2598}, {"video_id": "A_3R-Rkn_98", "question": "In the PPT slide with a dark blue background, there is a red circular icon with a white car on the left side. What happens when the red circular icon with a white clock appears in the middle?", "question_wo_referring_query": "What happens?", "candidates": ["A red circular icon with a white head appears.", "A yellow circular icon with a white head appears.", "A red triangle appears.", "A red circular icon with a white airplane appears.", "A black square appears."], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "A_3R-Rkn_98_0", "video_path": "A_3R-Rkn_98.mp4", "subtitle_path": "A_3R-Rkn_98_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1184.23, "view_count": 964597}, {"video_id": "A_3R-Rkn_98", "question": "In the PPT slide with a dark blue background, there is a red circle on the left with a white globe icon inside. After a white circle with a red star icon appears in the center, what happens?", "question_wo_referring_query": ", what happens?", "candidates": ["A red circle with a white lighthouse icon inside appears", "A white circle with a black hammer icon inside appears", "A white triangle appears", "A red circle with a white fist icon inside appears", "A black hook appears"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "A_3R-Rkn_98_1", "video_path": "A_3R-Rkn_98.mp4", "subtitle_path": "A_3R-Rkn_98_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1184.23, "view_count": 964597}, {"video_id": "A_3R-Rkn_98", "question": "In the PPT slide with a deep blue background, there are a bunch of small white human icons at the top. What happens after these white human icons turn black?", "question_wo_referring_query": "What happens?", "candidates": ["A bunch of red small human icons appear at the bottom", "A red tiger icon appears", "A yellow circle with a white human head icon inside appears", "A red circle with a white fist icon inside appears", "A white triangle appears"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "A_3R-Rkn_98_2", "video_path": "A_3R-Rkn_98.mp4", "subtitle_path": "A_3R-Rkn_98_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1184.23, "view_count": 964597}, {"video_id": "jI8aXKR9h-I", "question": "In the news segment, the bottom part shows subtitles for explanations, the right side displays a newspaper image, and on the left side, there is a woman with blonde hair wearing a blue suit. After the blonde woman appears on screen, which character shows up first?", "question_wo_referring_query": "Which character shows up first?", "candidates": ["A woman wearing a red suit with short hair", "A man driving a car", "A little girl wearing a T-shirt", "An elderly person lying in bed", "A woman wearing a black suit with long brown hair and a ring"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "jI8aXKR9h-I_0", "video_path": "jI8aXKR9h-I.mp4", "subtitle_path": "jI8aXKR9h-I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1125.4, "view_count": 10846}, {"video_id": "jI8aXKR9h-I", "question": "In the news footage, there are subtitles at the bottom explaining the content, a newspaper image and two people in black clothes on the right, and a woman in a blue suit with blonde hair on the left. After the blonde woman appears, which of the following objects appears first?", "question_wo_referring_query": "Which of the following objects appears first?", "candidates": ["red bus", "green tank", "black car", "white car", "child kicking a soccer ball"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "jI8aXKR9h-I_1", "video_path": "jI8aXKR9h-I.mp4", "subtitle_path": "jI8aXKR9h-I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1125.4, "view_count": 10846}, {"video_id": "jI8aXKR9h-I", "question": "Which newspaper character appears first in the video?", "question_wo_referring_query": "Which newspaper character appears first in the video?", "candidates": ["A couple facing the camera holding a white pole, with 'HENK'S HEROES' printed on the cover, and a short-haired man in a black round-neck shirt on the bottom right", "A newspaper headline printed with 'FINANCIAL TIMES' with a man facing away from the camera on the bottom right", "A cover printed with a portrait of Trump", "A man wearing a blue short-sleeved shirt with short hair, with a silver trophy beside him"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "jI8aXKR9h-I_2", "video_path": "jI8aXKR9h-I.mp4", "subtitle_path": "jI8aXKR9h-I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1125.4, "view_count": 10846}, {"video_id": "1ITtHAUDEbg", "question": "On a stage illuminated by orange-red lights, there is a large screen displaying some Korean text. Nine people are standing in a row, with the person on the far left wearing a white top and yellow skirt, holding a script and a microphone. What happened on the screen before the caption 'Where should I stand?' appeared?", "question_wo_referring_query": "What happened on the screen?", "candidates": ["Nine people gave individual performances one by one", "Nine people danced on the brightly lit stage", "Nine people were interviewed one by one", "Nine people introduced themselves one by one", "Nine people sat on the stage for an interview"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "1ITtHAUDEbg_0", "video_path": "1ITtHAUDEbg.mp4", "subtitle_path": "1ITtHAUDEbg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1181.95, "view_count": 132879}, {"video_id": "1ITtHAUDEbg", "question": "A woman with long straight black hair, wearing a black suit jacket, stands in the middle. On either side of her are two men wearing headsets and microphones. What happens first on screen after the subtitle 'I'm on cloud nine' appears?", "question_wo_referring_query": "What happens first on screen?", "candidates": ["The performer on stage introduces themselves.", "The performer dances on a brilliantly lit stage.", "Nine individuals are interviewed one after another.", "The host interviews the performer.", "Nine individuals showcase their personal talents."], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "1ITtHAUDEbg_1", "video_path": "1ITtHAUDEbg.mp4", "subtitle_path": "1ITtHAUDEbg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1181.95, "view_count": 132879}, {"video_id": "1ITtHAUDEbg", "question": "The female host, wearing a white top and yellow inner layer, is sitting on a chair with a mic and a script in her hands. There is a male with orange hair sitting in front of a computer behind her. When the subtitle 'Let's hear ENHYPEN's version of the song.' appears, what happens first on the screen?", "question_wo_referring_query": "What happens first on the screen?", "candidates": ["Nine people dance on a brightly lit stage", "A male wearing a black leather jacket and black pants performs a personal talent show", "Seven male group members stand on the stage and sing", "The performer on stage introduces themselves", "A girl in the audience wearing pink clothes covers her mouth and laughs"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "1ITtHAUDEbg_2", "video_path": "1ITtHAUDEbg.mp4", "subtitle_path": "1ITtHAUDEbg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1181.95, "view_count": 132879}, {"video_id": "SC4nwtBAYPA", "question": "A man wearing a bean green shirt and keeping a beard is sitting in front of a background decorated with green plants and lights. To the left and rear side, there is a man wearing earphones and a dark purple shirt. What object first appears on the screen after the subtitle 'see this we will have some of them live' appears?", "question_wo_referring_query": "What object first appears on the screen?", "candidates": ["A black dog with a collar standing on the beach", "A large piece of green oily farmland at dusk", "A motorcycle", "A woman wearing a white tank top and white pants sitting on a sofa", "A small black Insta360 X3 camera"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "SC4nwtBAYPA_0", "video_path": "SC4nwtBAYPA.mp4", "subtitle_path": "SC4nwtBAYPA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1432.8, "view_count": 201581}, {"video_id": "SC4nwtBAYPA", "question": "A small yellow dog with fluffy fur is digging a water hole on the beach. After the subtitle 'her face through the mud it was an' appears, what object appears first on the screen?", "question_wo_referring_query": "What object appears first on the screen?", "candidates": ["A black dog lying on the dark blue sand", "A woman wearing a white strap top and white shorts sitting on the sand", "White tiles on the ground arranged in many small squares", "A large expanse of green farmland under the twilight", "Green banana leaves in front of a house"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "SC4nwtBAYPA_1", "video_path": "SC4nwtBAYPA.mp4", "subtitle_path": "SC4nwtBAYPA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1432.8, "view_count": 201581}, {"video_id": "SC4nwtBAYPA", "question": "A man with some gray hair wearing glasses is facing two computers. The screens show a 3D room image. After the text 'potentially making a darker floor i' appears, what is the first object to appear on the screen?", "question_wo_referring_query": "What is the first object to appear on the screen?", "candidates": ["A woman wearing a white tank top and white pants sitting on a sofa", "A green banana leaf on the ground", "A black dog lying on a dark blue sofa", "A white floor tile composed of many small squares", "A white tile with speckles"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "SC4nwtBAYPA_2", "video_path": "SC4nwtBAYPA.mp4", "subtitle_path": "SC4nwtBAYPA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1432.8, "view_count": 201581}, {"video_id": "RUIV-cWBCIc", "question": "In front of the glass door inside the mall, someone is holding two white boxes containing AirTags with their hand. There's a black mat on the ground and a gray car outside the glass door. In which other scenes do these two boxed AirTags appear?", "question_wo_referring_query": "In which other scenes do these two boxed AirTags appear?", "candidates": ["On the wooden table in the home of a man wearing a black t-shirt and curly hair", "On the floor with a white polka-dot mat", "Inside a black backpack", "On a trash can outside the house", "On the wooden table with three rolls of tape and a water bottle"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "RUIV-cWBCIc_0", "video_path": "RUIV-cWBCIc.mp4", "subtitle_path": "RUIV-cWBCIc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1551.76, "view_count": 14958}, {"video_id": "RUIV-cWBCIc", "question": "In front of a blue background wall, a man wearing a black shirt with a small microphone clipped to his collar and a silver necklace has appeared in which other scenes?", "question_wo_referring_query": "In which other scenes has he appeared?", "candidates": ["On a postal boat on the blue ocean", "In front of a large map on a wall", "Next to a white pickup truck outdoors", "Next to a factory with a river flowing nearby", "Next to a blue garbage truck that is tilting trash"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "RUIV-cWBCIc_1", "video_path": "RUIV-cWBCIc.mp4", "subtitle_path": "RUIV-cWBCIc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1551.76, "view_count": 14958}, {"video_id": "RUIV-cWBCIc", "question": "In a parking lot between two parking zones, there is a solid white line painted down the middle of the road. On the screen, someone is holding some food packages of different colors. In which of the following scenes have these packages also appeared?", "question_wo_referring_query": "In which of the following scenes have these packages also appeared?", "candidates": ["Inside a black backpack\n", "On an outdoor trash can", "Next to the white plastic recycling bin outside the mall entrance", "On a mail ship at sea", "On the ground covered with white polka dots"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "RUIV-cWBCIc_2", "video_path": "RUIV-cWBCIc.mp4", "subtitle_path": "RUIV-cWBCIc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1551.76, "view_count": 14958}, {"video_id": "ZbGz-69fyjk", "question": "In the live broadcast's PPT screen, on the left side there is a photo on a white background with the image of a black statue in the center, and on the right side there is a picture of a standing lion. Which subtitles have appeared together with the lion picture on the right side of the screen?", "question_wo_referring_query": "Which subtitles have appeared together with the lion picture on the right side of the screen?", "candidates": ["Here is how you can find us", "clear from her lion hat", "Love deserves all the admiring words", "Do come on time", "and love is even beyond the life and death"], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "ZbGz-69fyjk_0", "video_path": "ZbGz-69fyjk.mp4", "subtitle_path": "ZbGz-69fyjk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.8, "view_count": 34295}, {"video_id": "ZbGz-69fyjk", "question": "In the PPT screen of the live broadcast, the sides are gray, and in the middle is a picture of a stone tablet with engraved text and human figures. In the top right corner, there is a screen showing a woman dressed in black speaking. With which subtitles did the stone tablet image appear together on the screen?", "question_wo_referring_query": "With which subtitles did the stone tablet image appear together on the screen?", "candidates": ["people go there to buy things", "I did my homework on first day", "my senior high school English teacher", "several of the figures of the story in", "took out the food we prepared"], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "ZbGz-69fyjk_1", "video_path": "ZbGz-69fyjk.mp4", "subtitle_path": "ZbGz-69fyjk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.8, "view_count": 34295}, {"video_id": "ZbGz-69fyjk", "question": "In the PPT slide during the livestream, there\u2019s an image on the left side showing a white background with a black sculpture in the middle, while on the right side, there\u2019s an image of four seated statues in front of a gray brick wall. With which subtitles has the image on the right side appeared together?", "question_wo_referring_query": "With which subtitles has the image on the right side appeared together?", "candidates": ["man should not depend on lucky whic", "meant as an eternal ritual calendar in", "are the most valuable things in the world", "will get ready to face the future", "is good and what is evil"], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "ZbGz-69fyjk_2", "video_path": "ZbGz-69fyjk.mp4", "subtitle_path": "ZbGz-69fyjk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.8, "view_count": 34295}, {"video_id": "HILPnXLVhCU", "question": "In the video conversation between two people, the left side shows an image with a white background with mathematical formulas written in the upper half. In the middle of the image, there is a long horizontal line. In the center, a woman is wearing a dark green coat, and on the right side is a woman wearing a grey outfit. When the subtitle 'still do it that I would ultimately end' appears, what change occurred to the long horizontal line in the image on the left?", "question_wo_referring_query": "In the video conversation between two people, the left side shows an image with a white background with mathematical formulas written in the upper half. In the middle of the image, there is a long horizontal line. In the center, a woman is wearing a dark green coat, and on the right side is a woman wearing a grey outfit. When the subtitle 'still do it that I would ultimately end' appears, what change occurred to the long horizontal line in the image on the left?", "candidates": ["Disappeared", "Turned red", "Shrunk", "Shrunk and moved to the upper left", "Turned into a vertical line"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "HILPnXLVhCU_0", "video_path": "HILPnXLVhCU.mp4", "subtitle_path": "HILPnXLVhCU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1495.33, "view_count": 28366}, {"video_id": "HILPnXLVhCU", "question": "In the conversation video between two people, the left side is a picture with a white background and black text filled with formulas, the middle shows a woman wearing a dark green coat with a black calculator under her hand, and the right side is a woman wearing a gray coat. When the subtitle 'pretty much just plugging it into' appears, what change occurs to the calculator on the screen?", "question_wo_referring_query": "What change occurs to the calculator on the screen?", "candidates": ["Disappears", "Turns white", "Becomes a circle", "Shrinks and moves to the lower left", "Turns red"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "HILPnXLVhCU_1", "video_path": "HILPnXLVhCU.mp4", "subtitle_path": "HILPnXLVhCU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1495.33, "view_count": 28366}, {"video_id": "HILPnXLVhCU", "question": "In the video dialogue scene, there is a woman wearing a dark green coat in the middle, a woman in gray clothes on the right, and a formula encircled by a dotted line on a white background on the left. When the subtitle 'this is like the whole purpose of me' appears, what change happens to the formula enclosed by the dotted line in the picture on the left?", "question_wo_referring_query": "What change happens to the formula enclosed by the dotted line in the picture on the left?", "candidates": ["Shrunk", "Disappeared", "Shrunk", "Moved to the right", "Moved up"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "HILPnXLVhCU_2", "video_path": "HILPnXLVhCU.mp4", "subtitle_path": "HILPnXLVhCU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1495.33, "view_count": 28366}, {"video_id": "94pmvObiLSg", "question": "On a white screen, there is a square made up of blocks in the lower-left corner. Inside the square, there is a smaller green square. What is the green square doing?", "question_wo_referring_query": "What is the green square doing?", "candidates": ["Moving left inside the larger square", "Moving down inside the larger square", "Moving randomly inside the larger square", "Moving up inside the larger square", "Moving right inside the larger square"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "94pmvObiLSg_0", "video_path": "94pmvObiLSg.mp4", "subtitle_path": "94pmvObiLSg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2143.73, "view_count": 596}, {"video_id": "94pmvObiLSg", "question": "On a white screen with the word 'Demo', there are three squares: one large square and two small squares. One of the small squares is made up of blue blocks. What are the blue blocks inside the small square doing?", "question_wo_referring_query": "What are the blue blocks inside the small square doing?", "candidates": ["Moving left", "Moving down", "Moving right", "Moving up", "Constantly disappearing"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "94pmvObiLSg_1", "video_path": "94pmvObiLSg.mp4", "subtitle_path": "94pmvObiLSg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2143.73, "view_count": 596}, {"video_id": "94pmvObiLSg", "question": "On a white screen that says 'Visualizing Convolution', there are three rectangles made up of small squares of different colors arranged in parallel. In the middle rectangle, there is a small red dot. What is the small red dot doing?", "question_wo_referring_query": "What is the small red dot doing?", "candidates": ["Coloring inside the middle rectangle", "Spinning around inside the middle rectangle", "Drawing inside the middle rectangle", "Writing inside the middle rectangle", "Drawing an X inside the middle rectangle"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "94pmvObiLSg_2", "video_path": "94pmvObiLSg.mp4", "subtitle_path": "94pmvObiLSg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2143.73, "view_count": 596}, {"video_id": "5HKhe4T_Qzc", "question": "A white shelf holding various books, with a mannequin dressed in a blue dress beside it. In front of the shelf sits a woman with long hair wearing an off-shoulder top. Which objects are present in the scene?", "question_wo_referring_query": "Which objects are present in the scene?", "candidates": ["Red flower", "Golden necklace", "Blue hair ornament", "White porcelain doll", "Green plant"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "5HKhe4T_Qzc_0", "video_path": "5HKhe4T_Qzc.mp4", "subtitle_path": "5HKhe4T_Qzc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2930.56, "view_count": 1201}, {"video_id": "5HKhe4T_Qzc", "question": "A wooden-colored door, and beside it, there is a wooden-colored shelf holding some file folders. In front of the shelf sits a woman with long hair wearing a black top. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Green office chair", "White in-ear headphones", "White over-ear headphones", "Black in-ear headphones", "Black over-ear headphones"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "5HKhe4T_Qzc_1", "video_path": "5HKhe4T_Qzc.mp4", "subtitle_path": "5HKhe4T_Qzc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2930.56, "view_count": 1201}, {"video_id": "5HKhe4T_Qzc", "question": "In a four-section screen, there are two women and two men. The upper half shows women with long black hair and long blonde hair, while the lower half shows men with short hair and men with curly hair. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Gold hair clip", "Blue cabinet", "White shawl", "Blue dress", "Green chair"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "5HKhe4T_Qzc_2", "video_path": "5HKhe4T_Qzc.mp4", "subtitle_path": "5HKhe4T_Qzc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2930.56, "view_count": 1201}, {"video_id": "qbDgjwB93i8", "question": "In a dimly lit room, a man wearing a sleeveless black shirt, with short hair and a necklace, is speaking. When he says 'and you've got a lot of missing pieces,' what objects are present in the frame?", "question_wo_referring_query": "what objects are present in the frame?", "candidates": ["golden necklace", "green hat", "yellow incense burner", "black backpack", "green pillow"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "qbDgjwB93i8_0", "video_path": "qbDgjwB93i8.mp4", "subtitle_path": "qbDgjwB93i8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1025.73, "view_count": 604885}, {"video_id": "qbDgjwB93i8", "question": "In a valley, there is a white waterfall. There are green plants on both sides of the waterfall, and in front of the waterfall, there is a man wearing a black sleeveless shirt and a green hat. When the subtitle 'Now that we're shooting in front of these people' appears, which objects are present on the screen?", "question_wo_referring_query": "Which objects are present on the screen?", "candidates": ["White camera", "Black backpack", "Red flower", "Gold necklace", "Black camera"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "qbDgjwB93i8_1", "video_path": "qbDgjwB93i8.mp4", "subtitle_path": "qbDgjwB93i8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1025.73, "view_count": 604885}, {"video_id": "qbDgjwB93i8", "question": "In a warmly furnished room, there are some natural wood-colored tables and chairs. Above, two blackboards are hanging with some content written in colored chalk on them. A woman wearing a yellow top places her hand on the natural wood-colored table. When the subtitle 'You guys are amazing' appears, what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Blue denim shorts", "Black long skirt", "Notebook computer", "Green watermelon", "Book with a blue cover"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "qbDgjwB93i8_2", "video_path": "qbDgjwB93i8.mp4", "subtitle_path": "qbDgjwB93i8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1025.73, "view_count": 604885}, {"video_id": "mRxbGYSHP2w", "question": "Against a green background, there are two circles. In one circle, there is a woman wearing a blue top, and in the other circle, there is a man wearing a hat. Beside the circles, there are some lines of white text. What type of earbuds is the woman in the blue top wearing?", "question_wo_referring_query": "What type of earbuds is the woman in the blue top wearing?", "candidates": ["White Bluetooth earbuds", "Black in-ear earbuds", "White head-mounted earbuds", "White in-ear earbuds", "Black head-mounted earbuds"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "mRxbGYSHP2w_0", "video_path": "mRxbGYSHP2w.mp4", "subtitle_path": "mRxbGYSHP2w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1325.1, "view_count": 2224}, {"video_id": "mRxbGYSHP2w", "question": "In the top left corner, there is a 3D globe icon. In the top right corner, there is a red triangle. Below the triangle, there are two circles: one with a woman and one with a man. What color clothes is the woman wearing in the circle in the middle of the screen?", "question_wo_referring_query": "What color clothes is the woman wearing in the circle in the middle of the screen?", "candidates": ["blue", "white", "black", "red", "purple"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "mRxbGYSHP2w_1", "video_path": "mRxbGYSHP2w.mp4", "subtitle_path": "mRxbGYSHP2w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1325.1, "view_count": 2224}, {"video_id": "mRxbGYSHP2w", "question": "There is a red horizontal stripe in the middle, with a globe icon above the red stripe. Next to the red stripe are two circles, each containing a person. Above the circles is a red icon. What shape is the red icon above the circles?", "question_wo_referring_query": "What shape is the red icon above the circles?", "candidates": ["Square", "Triangle", "Star", "Rhombus", "Circle"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "mRxbGYSHP2w_2", "video_path": "mRxbGYSHP2w.mp4", "subtitle_path": "mRxbGYSHP2w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1325.1, "view_count": 2224}, {"video_id": "beTEeD9GqGk", "question": "On a blue table, there is a brown slide, and below the slide are colorful balls. In front of the slide, a man with short hair and glasses is speaking. When he says 'week's Spotlight we're focusing on,' what style of jacket is he wearing?", "question_wo_referring_query": "What style of jacket is he wearing?", "candidates": ["Black blazer", "White hooded sweatshirt", "Black hooded sweatshirt", "White long-sleeve jacket with chains", "Black long-sleeve jacket with chains"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "beTEeD9GqGk_0", "video_path": "beTEeD9GqGk.mp4", "subtitle_path": "beTEeD9GqGk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1586.89, "view_count": 9090}, {"video_id": "beTEeD9GqGk", "question": "A wooden scaffold with some objects placed on it is shown. Next to the scaffold, there is a table. In front of the table, a woman with long hair is standing and talking. When she says \"is not not just a big number it's a big,\" what color clothes is she wearing?", "question_wo_referring_query": "What color clothes is she wearing?", "candidates": ["Red", "Green", "Black", "White", "Blue"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "beTEeD9GqGk_1", "video_path": "beTEeD9GqGk.mp4", "subtitle_path": "beTEeD9GqGk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1586.89, "view_count": 9090}, {"video_id": "beTEeD9GqGk", "question": "A huge window with white floral curtains on either side, and in front of the window is a red table with some boxes on it. A girl wearing a white top and a tie is speaking. What is the color of her hair when she says \"warning system my project is called\"?", "question_wo_referring_query": "What is the color of her hair?", "candidates": ["black", "silver", "gold", "green", "blue"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "beTEeD9GqGk_2", "video_path": "beTEeD9GqGk.mp4", "subtitle_path": "beTEeD9GqGk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1586.89, "view_count": 9090}, {"video_id": "Ztx4THuO-KU", "question": "In a white room, there is a grey sofa with a pillow on it. Next to the sofa, there is a staircase. Someone placed a black laptop on the sofa. Who placed the black laptop on the sofa?", "question_wo_referring_query": "Who placed the black laptop on the sofa?", "candidates": ["The man in the white hooded jacket", "The man in the white T-shirt", "The man in the checkered shirt", "The man in the black hooded jacket", "The man in the black sleeveless tank top"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "Ztx4THuO-KU_0", "video_path": "Ztx4THuO-KU.mp4", "subtitle_path": "Ztx4THuO-KU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1786.48, "view_count": 22306}, {"video_id": "Ztx4THuO-KU", "question": "A man wearing a black and red plaid shirt is sitting on a gray sofa, holding a black laptop. There is a pillow beside him. He is talking about his story, mentioning one of his courses. Which course did he mention?", "question_wo_referring_query": "Which course did he mention?", "candidates": ["Chemistry Course", "Advanced Mathematics Course", "BSC Course", "International Business Course", "Physics Course"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "Ztx4THuO-KU_1", "video_path": "Ztx4THuO-KU.mp4", "subtitle_path": "Ztx4THuO-KU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1786.48, "view_count": 22306}, {"video_id": "Ztx4THuO-KU", "question": "A man in a black and red checkered shirt is sitting on a gray sofa with a black laptop in front of him. Someone asks him why he doesn't feature his girlfriend in his vlog. Who is the person asking the question?", "question_wo_referring_query": "Who is the person asking the question?", "candidates": ["TTS", "AOS", "HHY", "TTT", "AAG"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "Ztx4THuO-KU_2", "video_path": "Ztx4THuO-KU.mp4", "subtitle_path": "Ztx4THuO-KU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1786.48, "view_count": 22306}, {"video_id": "ARd4jsThc2c", "question": "In a black and white scene, in front of a wooden house, there are several people wearing hats and black pants. What are they doing the first time they appear?", "question_wo_referring_query": ", what are they doing the first time they appear?", "candidates": ["Rolling on the ground", "Running into the wooden house", "Holding black guns and shooting", "Crouching on the ground", "Hugging each other"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "ARd4jsThc2c_0", "video_path": "ARd4jsThc2c.mp4", "subtitle_path": "ARd4jsThc2c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 977.5, "view_count": 4302084}, {"video_id": "ARd4jsThc2c", "question": "Within a square circle, a man wearing a grey suit and a red and white striped tie is standing in front of a wall decorated with many paintings. What was he doing when he first appeared?", "question_wo_referring_query": "What was he doing when he first appeared?", "candidates": ["Raising his hands and cheering", "Squatting down to tie his shoelaces", "Playing cards", "Spinning happily on the spot", "Holding a few cats and a glass"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "ARd4jsThc2c_1", "video_path": "ARd4jsThc2c.mp4", "subtitle_path": "ARd4jsThc2c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 977.5, "view_count": 4302084}, {"video_id": "ARd4jsThc2c", "question": "On a vast yellow desert, a few white clouds float in the distant sky. A cowboy, wearing an olive hat, a blue shirt, and olive chaps, stands on the desert. What is he doing when he first appears?", "question_wo_referring_query": "What is he doing when he first appears?", "candidates": ["Riding a horse forward", "Shooting at game", "Throwing his hat", "Standing with a big finger and slightly smiling", "Running into a wooden house"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "ARd4jsThc2c_2", "video_path": "ARd4jsThc2c.mp4", "subtitle_path": "ARd4jsThc2c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 977.5, "view_count": 4302084}, {"video_id": "Oo3A5QQj5U0", "question": "On a sunny bright day, three women are discussing some topics. In the distance, there is a green plant. When the woman wearing a black watch on her wrist, and a belt around her waist says 'they complied with all those new requirements that needed to get done,' what action does the woman with the black watch on her wrist perform?", "question_wo_referring_query": "What action does the woman with the black watch on her wrist perform?", "candidates": ["Lifting her pants", "Leaning on the railing, moving her hands around", "Brushing her hair with her hand", "Massaging her neck", "Playing with her earring"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2E", "level": "IntraMoment", "id": "Oo3A5QQj5U0_0", "video_path": "Oo3A5QQj5U0.mp4", "subtitle_path": "Oo3A5QQj5U0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2190.98, "view_count": 822854}, {"video_id": "Oo3A5QQj5U0", "question": "In a room with an airplane, the airplane is surrounded by gray railing. Two women are sitting on a blue chair talking. When the woman in a suit with a green blouse says, \"ourselves as a planet, or starting to imagine and understand who we are all on,\" what action does the woman in the suit make?", "question_wo_referring_query": "What action does the woman in the suit make?", "candidates": ["Touches her nose with her left hand", "Tucks her hair behind her ear with her right hand", "Rubbed her eyes", "Leans back on the chair, gesturing with both hands", "Leans sideways on the chair, gesturing with both hands"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2E", "level": "IntraMoment", "id": "Oo3A5QQj5U0_1", "video_path": "Oo3A5QQj5U0.mp4", "subtitle_path": "Oo3A5QQj5U0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2190.98, "view_count": 822854}, {"video_id": "Oo3A5QQj5U0", "question": "In front of a starry sky with a ball-shaped object in the background, what did the man with short golden hair, wearing a gray suit and a silver ring on his finger, do when he said 'go to a place that we had been staring at for Millenia.'?", "question_wo_referring_query": "What did the man wearing a gray suit do?", "candidates": ["Raised both hands with palms facing each other", "Grabbed his right hand with his left hand", "Held his head with both hands", "Touched the ring on his left hand with his right hand", "Grabbed his left hand with his right hand"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2E", "level": "IntraMoment", "id": "Oo3A5QQj5U0_2", "video_path": "Oo3A5QQj5U0.mp4", "subtitle_path": "Oo3A5QQj5U0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2190.98, "view_count": 822854}, {"video_id": "sgeTrVfx2rc", "question": "In a game screen, there is a QR code in the upper left corner. In the middle of the screen, there are four dragons, with some game characters in front of the dragons. After blue wings appeared above the heads of the game characters in the middle of the screen, what appeared next?", "question_wo_referring_query": "After blue wings appeared above the heads of the game characters in the middle of the screen, what appeared next?", "candidates": ["Black Circle", "Red Knife", "Blue Sword", "Green Shield", "Purple Flame"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "sgeTrVfx2rc_0", "video_path": "sgeTrVfx2rc.mp4", "subtitle_path": "sgeTrVfx2rc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1729.84, "view_count": 394423}, {"video_id": "sgeTrVfx2rc", "question": "Under the sunset, there are short houses behind a majestic city wall. In the distance, a peach-shaped object is situated in the middle of the city. A black flag with a yellow circle appears on the screen. What happens after the black flag with a yellow circle appears on the screen?", "question_wo_referring_query": ", what happens after a black flag with a yellow circle appears on the screen?", "candidates": ["A black flag with a purple circle appears", "A red flag with a yellow circle appears", "A red flag with a white circle appears", "Another black flag with a yellow circle appears", "A yellow flag with a blue circle appears"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "sgeTrVfx2rc_1", "video_path": "sgeTrVfx2rc.mp4", "subtitle_path": "sgeTrVfx2rc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1729.84, "view_count": 394423}, {"video_id": "sgeTrVfx2rc", "question": "A room with white columns and many bookshelves caught fire. The Library of Baghdad was burned down. What happened after the Library of Baghdad was burned down?", "question_wo_referring_query": ", what happened after the Library of Baghdad was burned down?", "candidates": ["The books and manuscripts from the Library of Baghdad were thrown into the Tigris River", "The books and manuscripts from the Library of Baghdad were thrown into the Rhine River", "The books and manuscripts from the Library of Baghdad were thrown into the Volga River", "The books and manuscripts from the Library of Baghdad were thrown into the Danube River", "The books and manuscripts from the Library of Baghdad were thrown into the Amazon River"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "sgeTrVfx2rc_2", "video_path": "sgeTrVfx2rc.mp4", "subtitle_path": "sgeTrVfx2rc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1729.84, "view_count": 394423}, {"video_id": "8o3jN8vHIMM", "question": "In the video frame, which character appears first?", "question_wo_referring_query": "In the video frame, which character appears first?", "candidates": ["The man wearing a blue shirt with the ocean in the background", "The man wearing a blue suit with a coffee shop in the background", "The man wearing a black short sleeve shirt with a map in the background", "The man wearing a white lab coat with a hood", "The man wearing a black hat"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "8o3jN8vHIMM_0", "video_path": "8o3jN8vHIMM.mp4", "subtitle_path": "8o3jN8vHIMM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1177.72, "view_count": 194478}, {"video_id": "8o3jN8vHIMM", "question": "Whose name appears first in the video?", "question_wo_referring_query": "Whose name appears first in the video?", "candidates": ["Mary", "Brad Pitt", "simon dan", "Bruce", "Anthony Hopkins"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "8o3jN8vHIMM_1", "video_path": "8o3jN8vHIMM.mp4", "subtitle_path": "8o3jN8vHIMM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1177.72, "view_count": 194478}, {"video_id": "8o3jN8vHIMM", "question": "In the video explanation, which lens type is mentioned first?", "question_wo_referring_query": "Which lens type is mentioned first in the video explanation?", "candidates": ["Zoom lens", "Millimeter lens", "fisheye lens", "Prime lens", "macro"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "8o3jN8vHIMM_2", "video_path": "8o3jN8vHIMM.mp4", "subtitle_path": "8o3jN8vHIMM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1177.72, "view_count": 194478}, {"video_id": "MUShhvNwlN0", "question": "On a flat cement road, there is a man standing who is wearing a black and white striped t-shirt, carrying a black backpack, and wearing a black helmet. Next to the man, there is also a bicycle. After the man says, 'it normally so this evening I thought I', what does he do next?", "question_wo_referring_query": "What does the man do next?", "candidates": ["Holds a black and white battery in his hand", "Uses his right hand to pat the bicycle's black seat", "Holds a battery in one hand, with the other hand resting on his leg", "Lowers his head facing the green grass", "Kneels on both knees on the green grass"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "MUShhvNwlN0_0", "video_path": "MUShhvNwlN0.mp4", "subtitle_path": "MUShhvNwlN0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 976.2, "view_count": 671771}, {"video_id": "MUShhvNwlN0", "question": "On a flat road, there is a house with a red and white exterior wall on one side and many cars parked on the other side. Beside the cars, there are also green trees. A man wearing a black short-sleeved shirt and a black helmet is speaking. After saying \"however I've got to get back home I've,\" what did he do next?", "question_wo_referring_query": "What did he do next?", "candidates": ["Touched a black battery while on the green grass", "Removed the battery of the electric bicycle", "Kneeled on the green grass and spoke to the camera", "Turned his head toward an electric bicycle with a silver-white frame that was facing him", "Kneeled on the green grass wearing gray pants"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "MUShhvNwlN0_1", "video_path": "MUShhvNwlN0.mp4", "subtitle_path": "MUShhvNwlN0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 976.2, "view_count": 671771}, {"video_id": "MUShhvNwlN0", "question": "On a flat road, with green plants on the left side and some newly built houses on the right side, a man riding a bicycle with a silver-gray frame is moving forward. After the subtitle 'thank you so much for watching the video' appears, what does the man riding the bicycle do next?", "question_wo_referring_query": "After the man riding the bicycle, what does he do next?", "candidates": ["Holds a pile of black and white batteries in his hand", "Continues to ride the bicycle forward", "Stops the bike and talks facing the camera", "Stands under a tree to cool off", "Kneels on the green grass with both knees"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "MUShhvNwlN0_2", "video_path": "MUShhvNwlN0.mp4", "subtitle_path": "MUShhvNwlN0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 976.2, "view_count": 671771}, {"video_id": "q2ujkKrvMOs", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a man wearing glasses and a checkered shirt is seen sitting in front of a wall with many wooden frames; then, a woman wearing a black hat, twin braids, and a coffee-colored top is sitting on a black sofa; finally, a woman standing in front of a green background wearing a grey coat over black innerwear.", "First, there is a scene with a woman standing in front of a green background wearing a grey coat over black innerwear; then, a man wearing glasses and a checkered shirt is seen sitting in front of a wall with many wooden frames; finally, a woman wearing a black hat, twin braids, and a coffee-colored top is sitting on a black sofa.", "First, a man wearing glasses and a checkered shirt is seen sitting in front of a wall with many wooden frames; then, there is a scene with a woman standing in front of a green background wearing a grey coat over black innerwear; finally, a woman wearing a black hat, twin braids, and a coffee-colored top is sitting on a black sofa.", "First, a woman wearing a black hat, twin braids, and a coffee-colored top is sitting on a black sofa; then, a man wearing glasses and a checkered shirt is seen sitting in front of a wall with many wooden frames; finally, there is a scene with a woman standing in front of a green background wearing a grey coat over black innerwear.", "First, there is a scene with a woman standing in front of a green background wearing a grey coat over black innerwear; then, a woman wearing a black hat, twin braids, and a coffee-colored top is sitting on a black sofa; finally, a man wearing glasses and a checkered shirt is seen sitting in front of a wall with many wooden frames."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "q2ujkKrvMOs_0", "video_path": "q2ujkKrvMOs.mp4", "subtitle_path": "q2ujkKrvMOs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1336.75, "view_count": 49886}, {"video_id": "q2ujkKrvMOs", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, the parrot is standing on the woman's finger and drops the food from its mouth; then, the parrot stands on one leg on the woman's finger and grabs the food with its other leg; finally, a woman in a black short-sleeve shirt is feeding the green parrot with some food.", "First, a woman in a black short-sleeve shirt is feeding a green parrot with some food; then, the parrot stands on one leg on the woman's finger and grabs the food with its other leg; finally, the parrot is standing on the woman's finger and drops the food from its mouth.", "First, the parrot stands on one leg on the woman's finger and grabs the food with its other leg; then, the parrot is standing on the woman's finger and drops the food from its mouth; finally, a woman in a black short-sleeve shirt is feeding the green parrot with some food.", "First, the parrot stands on one leg on the woman's finger and grabs the food with its other leg; then, a woman in a black short-sleeve shirt is feeding the green parrot with some food; finally, the parrot is standing on the woman's finger and drops the food from its mouth.", "First, a woman in a black short-sleeve shirt is feeding a green parrot with some food; then, the parrot is standing on the woman's finger and drops the food from its mouth; finally, the parrot stands on one leg on the woman's finger and grabs the food with its other leg."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "q2ujkKrvMOs_1", "video_path": "q2ujkKrvMOs.mp4", "subtitle_path": "q2ujkKrvMOs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1336.75, "view_count": 49886}, {"video_id": "q2ujkKrvMOs", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, there's a table with a wooden block with a mouse standing on it, and a woman in a coffee-colored top is holding the wooden block. Then, there's a room with many wooden shelves, where a woman in a coffee-colored top is holding a mouse while talking to a man in a plaid shirt. Finally, there's the same room with many wooden shelves where two women and one man are sitting.", "First, there's a room with many wooden shelves where two women and one man are sitting. Then, there's a table with a wooden block with a mouse standing on it, and a woman in a coffee-colored top is holding the wooden block. Finally, there's a room with many wooden shelves, where a woman in a coffee-colored top is holding a mouse while talking to a man in a plaid shirt.", "First, there's a room with many wooden shelves, where a woman in a coffee-colored top is holding a mouse while talking to a man in a plaid shirt. Then, the same room with many wooden shelves where two women and one man are sitting. Finally, there's a table with a wooden block with a mouse standing on it, and a woman in a coffee-colored top is holding the wooden block.", "First, there's a room with many wooden shelves, where a woman in a coffee-colored top is holding a mouse while talking to a man in a plaid shirt. Then, there's a scenario where a wooden block with a mouse standing on it is placed on a table, and a woman in a coffee-colored top is holding the wooden block. Finally, there's a scene in the same room with two women and one man sitting.", "First, there's a table with a wooden block with a mouse standing on it, and a woman in a coffee-colored top is holding the wooden block. Then, there's a room with many wooden shelves where two women and one man are sitting. Finally, there's a room with many wooden shelves, where a woman in a coffee-colored top is holding a mouse while talking to a man in a plaid shirt."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "q2ujkKrvMOs_2", "video_path": "q2ujkKrvMOs.mp4", "subtitle_path": "q2ujkKrvMOs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1336.75, "view_count": 49886}, {"video_id": "kssbas7z9Gc", "question": "At the top of the screen, there are three circular icons. The icon on the left contains a mask of a human face, the middle one is a white tank model, and the one on the right is a white diamond. In which scenes does the white sketch tank in the background appear?", "question_wo_referring_query": "In which scenes does the white sketch tank in the background appear?", "candidates": ["From a top-down view of the front end of the tank, you can see clear iron patterns.", "At the top is an English sentence starting with M.G., with two circular icons on both sides. The icons contain question marks, and below are white and red text blocks.", "Five white tank icons are at the top of the screen, below is a tank icon with a turret pointing to the right.", "There are a few green plants growing in yellow grass, a tank crouching under a green plant with smoke."], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "kssbas7z9Gc_0", "video_path": "kssbas7z9Gc.mp4", "subtitle_path": "kssbas7z9Gc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 924.07, "view_count": 181877}, {"video_id": "kssbas7z9Gc", "question": "At the top of the screen, there is an English title containing 'Total Number'. Below the title, there is '3472*', and below '3472*' there are all tank models, all of which start with 'Panzer'. In which of the following scenes do the white dashed boxes enclosing 'Panzer I' appear?", "question_wo_referring_query": "In which of the following scenes do the white dashed boxes enclosing 'Panzer I' appear?", "candidates": ["On the dry desert ground, a tank is moving forward with two dense smoke areas in front, and there is a tree branch on the left side of the tank.", "Below the title there is '3332*', and below that, there are all Panzer tank models. There are three arrows pointing upwards, and in the lower left, there is a white '42%' figure.", "There are all white tank models at the bottom of the screen, with two dashed boxes each enclosing five tank models.", "The title contains 'Panzer II' at the beginning, and there is a small white figure holding a cross in the middle of the screen."], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "kssbas7z9Gc_1", "video_path": "kssbas7z9Gc.mp4", "subtitle_path": "kssbas7z9Gc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 924.07, "view_count": 181877}, {"video_id": "kssbas7z9Gc", "question": "In a black background, a gray tank is in the center of the screen, with a row of white boxes below it. Inside the boxes are enticing English phrases, starting with the red word 'Disclaimer'. In which of the following scenes does this gray tank appear?", "question_wo_referring_query": "In which of the following scenes does this gray tank appear?", "candidates": ["In the upper left corner of the screen, there is a circular icon with a white figure inside. In the upper right corner, there is a circular icon with a white figure wearing glasses. The background screen is tinted red, with red occupying the left side and black occupying the right side.", "At the top, there's an English sentence starting with 'M.G.', flanked by two circular icons, each with a question mark inside. Below, there are text blocks in white and red respectively.", "A tank is moving forward on dry sand, with two puffs of smoke in front, and a tree branch to the left of the tank.", "A tank is moving forward on dry sand, with two puffs of smoke in front and a tree branch to the left of the tank."], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "kssbas7z9Gc_2", "video_path": "kssbas7z9Gc.mp4", "subtitle_path": "kssbas7z9Gc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 924.07, "view_count": 181877}, {"video_id": "r2SBKcYt6FA", "question": "A man is sitting on a sofa in front of a window, wearing a white T-shirt and carrying a black backpack. He has short hair, and behind him is a green curtain. With which of the following subtitles has this man, carrying a black backpack, appeared together?", "question_wo_referring_query": "With which of the following subtitles has this man, carrying a black backpack, appeared together?", "candidates": ["\"This place is too beautiful\"", "\"That's great\"", "\"The scenery here is particularly beautiful\"", "\"have not left our room like at all today\"", "\"I like it here\""], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "r2SBKcYt6FA_0", "video_path": "r2SBKcYt6FA.mp4", "subtitle_path": "r2SBKcYt6FA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 948.98, "view_count": 43523}, {"video_id": "r2SBKcYt6FA", "question": "On a night with a lot of people, a guy wearing a white T-shirt with black sunglasses hanging on it is sitting on a chair. There are plastic bucket lights on both sides of the guy. What subtitles have appeared along with the sunglasses hanging on the guy's white T-shirt?", "question_wo_referring_query": "What subtitles have appeared along with the sunglasses hanging on the guy's white T-shirt?", "candidates": ["\"photography and then I have my GoPro my\"", "\"The scenery here is really beautiful\"", "\"My photographer took photos for me\"", "\"I am attracted to the beautiful scenery here\"", "\"Extremely powerful\""], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "r2SBKcYt6FA_1", "video_path": "r2SBKcYt6FA.mp4", "subtitle_path": "r2SBKcYt6FA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 948.98, "view_count": 43523}, {"video_id": "r2SBKcYt6FA", "question": "On a beach, there are some red and green houses. In front of these houses, a pedestrian wearing green pants walks by. A man with short hair wearing a gray short-sleeved shirt appears in the shot. Which of the following subtitles has appeared alongside the man in the gray short-sleeved shirt?", "question_wo_referring_query": "Which of the following subtitles has appeared alongside the man in the gray short-sleeved shirt?", "candidates": ["\"Damn it, there are planes flying by in the sky\"", "\"The sunshine here is particularly good\"", "\"I like to play with children\"", "\"I think you will like this place\"", "\"Ronnie I think yeah it'll be beautiful\""], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "r2SBKcYt6FA_2", "video_path": "r2SBKcYt6FA.mp4", "subtitle_path": "r2SBKcYt6FA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 948.98, "view_count": 43523}, {"video_id": "HehaHkve4zU", "question": "On a white screen, there are different colored frames on both sides, each frame containing a picture of a man giving a lecture. In the middle of the screen, there is a circular figure. When the circular figure appears on a white screen with different colored horizontal stripes on both sides, what change occurs in the circular figure?", "question_wo_referring_query": "What change occurs in the circular figure?", "candidates": ["The circular figure changes from one ring to four rings", "The circular figure changes from three rings to one ring", "The circular figure changes from two rings to three rings", "The circular figure changes from two rings to four rings", "The circular figure changes from three rings to four rings"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "HehaHkve4zU_0", "video_path": "HehaHkve4zU.mp4", "subtitle_path": "HehaHkve4zU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2325.04, "view_count": 88}, {"video_id": "HehaHkve4zU", "question": "On a white screen that says 'Overall Results-Video', there is a blue box containing some numbers and colored balls. Next to the blue box, there's an empty space with no text. When the blue box appears on the white screen that says 'Models better on shorter videos', what change occurs in the blue box?", "question_wo_referring_query": "What change occurs in the blue box?", "candidates": ["Two yellow dashed lines appeared inside the blue box", "Two pink dashed lines appeared inside the blue box", "Two olive green dashed lines appeared inside the blue box", "Two white dashed lines appeared inside the blue box", "Two red dashed lines appeared inside the blue box"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "HehaHkve4zU_1", "video_path": "HehaHkve4zU.mp4", "subtitle_path": "HehaHkve4zU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2325.04, "view_count": 88}, {"video_id": "HehaHkve4zU", "question": "On a white screen that reads 'Overall Results-Text,' there is a beige box containing some colorful small balls. When the beige box appears on a white screen that reads 'Overall-Multimodal,' what changes occur to the beige box?", "question_wo_referring_query": "What changes occur to the beige box?", "candidates": ["The box changes from beige to blue.", "The box changes from beige to green.", "The box changes from beige to black.", "The box changes from beige to purple.", "The box changes from beige to yellow."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "HehaHkve4zU_2", "video_path": "HehaHkve4zU.mp4", "subtitle_path": "HehaHkve4zU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2325.04, "view_count": 88}, {"video_id": "g0QrBphsioM", "question": "In front of a black background with the word 'Collectivities,' a man in a blue short-sleeve shirt is speaking. When the subtitle 'And as of 2007, the scattered islands in the Indian Ocean, remember the Comoros episode.' appears, what change happens to the man's shirt?", "question_wo_referring_query": "What change happens to the shirt of the man wearing a blue short-sleeve shirt?", "candidates": ["Changed from blue short-sleeve to green short-sleeve", "Changed from blue short-sleeve to gray short-sleeve", "Changed from blue short-sleeve to red short-sleeve", "Changed from blue short-sleeve to white short-sleeve", "Changed from blue short-sleeve to black short-sleeve"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "g0QrBphsioM_0", "video_path": "g0QrBphsioM.mp4", "subtitle_path": "g0QrBphsioM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1010.26, "view_count": 3634814}, {"video_id": "g0QrBphsioM", "question": "Against a black background, there's a green terrain map with the label 'The \"Hexagon\"' below it. A man in a blue short-sleeve shirt is standing next to the map. When the subtitle 'since if you tilt your head a little bit, it kinda looks like it has six sides.' appears, what changes occur around the green terrain map?", "question_wo_referring_query": "What changes occur around the green terrain map?", "candidates": ["The green terrain map is surrounded by red lines, and the white text below disappears.", "The green terrain map is surrounded by purple lines, and the white text below disappears.", "The green terrain map is surrounded by white lines, and the white text below disappears.", "The green terrain map is surrounded by blue lines, and the white text below disappears.", "The green terrain map is surrounded by yellow lines, and the white text below disappears."], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "g0QrBphsioM_1", "video_path": "g0QrBphsioM.mp4", "subtitle_path": "g0QrBphsioM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1010.26, "view_count": 3634814}, {"video_id": "g0QrBphsioM", "question": "On a screen with the Earth's surface as the background, there are three green landform images, each having a pair of large round eyes. When the subtitle \"For France, Japan is seen as like the epitome of exoticism. Similar to themselves, the Japanese\" appears, what changes occur in the eyes of the leftmost landform image?", "question_wo_referring_query": "What changes occur in its eyes?", "candidates": ["A light bulb appears on the eyes", "Tears of blue color appear below the eyes", "A yellow five-pointed star appears on the eyes", "The eyes turn into red hearts", "Some small flowers appear in the eyes"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "g0QrBphsioM_2", "video_path": "g0QrBphsioM.mp4", "subtitle_path": "g0QrBphsioM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1010.26, "view_count": 3634814}, {"video_id": "uFig609YWjU", "question": "In front of a red and white background, there is a man with slightly yellowish skin, short hair, and wearing a black shirt. What is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["Licking his lower lip", "Rubbing his lower lip", "Grabbing his own tongue", "Holding a microphone", "Pointing forward with his finger"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "uFig609YWjU_0", "video_path": "uFig609YWjU.mp4", "subtitle_path": "uFig609YWjU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1288.16, "view_count": 517886}, {"video_id": "uFig609YWjU", "question": "In a Disney amusement park, there is a person wearing a Mickey Mouse costume. Next to him, there is a short-haired woman wearing floral sleeveless dress and sunglasses. What is the short-haired woman wearing sunglasses doing?", "question_wo_referring_query": "What is the short-haired woman wearing sunglasses doing?", "candidates": ["Bending down to pick something up", "Hugging the person in the Mickey Mouse costume", "Shaking hands with the person in the Mickey Mouse costume", "Blowing a kiss to the person in the Mickey Mouse costume", "Adjusting her sunglasses with her hand"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "uFig609YWjU_1", "video_path": "uFig609YWjU.mp4", "subtitle_path": "uFig609YWjU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1288.16, "view_count": 517886}, {"video_id": "uFig609YWjU", "question": "In front of a slightly greenish wall, there is a red sofa chair. Sitting on the chair is a man dressed in a green top and blue jeans. What is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["Shaking his legs", "Holding a black cat", "Wearing a hat", "Playing with his phone", "Fixing his hair"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "uFig609YWjU_2", "video_path": "uFig609YWjU.mp4", "subtitle_path": "uFig609YWjU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1288.16, "view_count": 517886}, {"video_id": "Y9avVPEa_KA", "question": "In a room with a full view map of the Earth on the wall, there are shelves hanging on the wall with some books on them, and a man wearing a white short-sleeved shirt and a black bracelet is in the room. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["green shelf", "purple shelf", "blue shelf", "white shelf", "black shelf"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "Y9avVPEa_KA_0", "video_path": "Y9avVPEa_KA.mp4", "subtitle_path": "Y9avVPEa_KA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 991.48, "view_count": 200150}, {"video_id": "Y9avVPEa_KA", "question": "A silver-grey curtain, next to the curtain is a wooden-colored shelf, in front of the curtain there is a man wearing a black coat and black glasses. What objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["Purple chair", "White chair", "Green chair", "Black chair", "Blue chair"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "Y9avVPEa_KA_1", "video_path": "Y9avVPEa_KA.mp4", "subtitle_path": "Y9avVPEa_KA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 991.48, "view_count": 200150}, {"video_id": "Y9avVPEa_KA", "question": "In the azure blue water, there are many machines. In the water, there is a silver-white object that looks like a ladder. There is a person in the water wearing a full white protective suit, and surrounding him are some people in black diving suits. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Uniform blue oxygen tube", "Blue strip with white oxygen tube", "Yellow strip with silver oxygen tube", "Uniform red oxygen tube", "Uniform yellow oxygen tube"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "Y9avVPEa_KA_2", "video_path": "Y9avVPEa_KA.mp4", "subtitle_path": "Y9avVPEa_KA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 991.48, "view_count": 200150}, {"video_id": "AcgzB9hqxqo", "question": "In a gray background, there are four white circle icons, each with a different design. One of the icons features a four-pointed star. When the subtitle '300 000 men to arms.' appears, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["A white pentagram design", "A white broken sword design", "Icons of four white dwarves", "A gray shield design", "A white shield design"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "AcgzB9hqxqo_0", "video_path": "AcgzB9hqxqo.mp4", "subtitle_path": "AcgzB9hqxqo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1091.07, "view_count": 693368}, {"video_id": "AcgzB9hqxqo", "question": "On a grey background, there are six white circles, each containing different designs. Four of the circles depict white stick figures. When the subtitle 'Dobrev wrote several years ago, and we discussed a few weeks ago' appears, what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["Black shoes", "A black folder", "A white shirt", "A black tank", "A blue hat"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "AcgzB9hqxqo_1", "video_path": "AcgzB9hqxqo.mp4", "subtitle_path": "AcgzB9hqxqo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1091.07, "view_count": 693368}, {"video_id": "AcgzB9hqxqo", "question": "Against a gray background with the word 'Morale,' there are 8 white circles, each containing a different pattern. In the top right corner of the screen, there is also a picture. When the subtitle 'weapons like NLAWs, Javelins, HIMARs and the Panzerhaubitze 2000' appears, what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["Rifle", "A sniper rifle with scope", "A rocket launcher firing", "A tank firing", "Handgun"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "AcgzB9hqxqo_2", "video_path": "AcgzB9hqxqo.mp4", "subtitle_path": "AcgzB9hqxqo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1091.07, "view_count": 693368}, {"video_id": "cuHgxPe3J7I", "question": "Against a black-and-white backdrop, a man wearing a hat and having a goatee stands behind two women. Both the man and the woman on the left are wearing accessories around their necks. The two women have their hair tied up, exposing their facial features. Dense foliage is seen behind them. Who is holding a bag with their hand?", "question_wo_referring_query": "Who is holding a bag with their hand?", "candidates": ["The woman wearing a hat", "The woman wearing a blouse", "The man wearing a hat", "The woman wearing jeans", "The woman wearing a skirt on the right"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "cuHgxPe3J7I_0", "video_path": "cuHgxPe3J7I.mp4", "subtitle_path": "cuHgxPe3J7I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1951.05, "view_count": 518983}, {"video_id": "cuHgxPe3J7I", "question": "A Western-style building stands on the ground. The middle part of the building is an arch shape. There are many windows on both sides of the building. In front of the large iron door of the building, a crowd of people is gathered. The iron door is decorated with exquisite patterns. A woman wearing a hat and a black cloak, along with a man in a black hat, walk inside. Who is holding the hand of the man in the black hat?", "question_wo_referring_query": "Who is holding the hand of the man in the black hat?", "candidates": ["The woman in a suit and black cloak", "The long-haired woman carrying a basket", "The woman wearing a black hat and holding an umbrella", "The woman in a dress and white cloak", "The woman in a dress and black cloak"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "cuHgxPe3J7I_1", "video_path": "cuHgxPe3J7I.mp4", "subtitle_path": "cuHgxPe3J7I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1951.05, "view_count": 518983}, {"video_id": "cuHgxPe3J7I", "question": "Four workers are installing a painting on a yellow wall. The two workers on the left are wearing red clothes, the crouching worker is wearing a plaid shirt, and the other worker is wearing a gray top. All the workers are wearing blue gloves. The painting depicts a person in a red robe. Who is standing on the ladder?", "question_wo_referring_query": ", who is standing on the ladder?", "candidates": ["The worker on the right wearing a red top with a hood and black pants", "The man on the right wearing a plaid shirt and black pants", "The person on the left wearing a blue plaid shirt and pants", "The worker on the right wearing a gray top and black pants", "The worker on the left wearing a red top with a hood and black pants"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "cuHgxPe3J7I_2", "video_path": "cuHgxPe3J7I.mp4", "subtitle_path": "cuHgxPe3J7I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1951.05, "view_count": 518983}, {"video_id": "CAHN5DmExzU", "question": "A man wearing a white headscarf appears in a narrow alley. The man is dressed in shorts and a short-sleeved shirt, and he is wearing white gloves. At the end of the alley, there are green plants and a red box. Along the alley, there are windows and light-colored cans. What did the man with the headscarf do when he first appeared?", "question_wo_referring_query": "What did the man with the headscarf do when he first appeared?", "candidates": ["Walked with a red box", "Walked with a pink can", "Walked with a green box", "Walked with a black box", "Walked with a red can"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "CAHN5DmExzU_0", "video_path": "CAHN5DmExzU.mp4", "subtitle_path": "CAHN5DmExzU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.0, "view_count": 71605}, {"video_id": "CAHN5DmExzU", "question": "The screen has a black border, and a blurry gray sky shows an airplane. In the upper left corner of the screen, there is a black character filling a white horizontal bar and a white character filling a red horizontal bar. What did the airplane do upon its first appearance?", "question_wo_referring_query": "What did the airplane do upon its first appearance?", "candidates": ["The airplane flew at high speed", "The airplane crashed", "The airplane entered the clouds and disappeared", "The airplane nosedived", "The airplane attacked the ground"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "CAHN5DmExzU_1", "video_path": "CAHN5DmExzU.mp4", "subtitle_path": "CAHN5DmExzU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.0, "view_count": 71605}, {"video_id": "CAHN5DmExzU", "question": "A man wearing an olive-colored hat and a man with a white headscarf appear in the room. The man with the hat is wearing a gray outfit, and the man with the headscarf is in short sleeves. The glasses are pushed onto the headscarf. Next to the man with the hat are transparent containers and a gray wall. What did the man with the hat do the first time he appeared?", "question_wo_referring_query": "What did the man with the hat do the first time he appeared?", "candidates": ["Took off his own watch", "Explained something to the man with the headscarf beside him", "Took off his own hat", "Picked up a glass bowl", "Picked up a glass jar"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "CAHN5DmExzU_2", "video_path": "CAHN5DmExzU.mp4", "subtitle_path": "CAHN5DmExzU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.0, "view_count": 71605}, {"video_id": "Zw4phGHmzsI", "question": "On a stage lit with yellow lights, there are seven men wearing suits. The man on the far right has dyed blond hair. Next to him is a man with dyed burgundy hair. Behind this group of men, there is a gear-shaped object illuminated by four yellow lights. On the left side of the stage, there is a black and white striped column. When the subtitle 'stop' appears, what is the blond-haired man doing?", "question_wo_referring_query": "What is the blond-haired man doing?", "candidates": ["Jumping with both legs", "Squatting", "Making a heart shape with hands", "Raising both hands", "One hand on chest, one hand on leg"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "Zw4phGHmzsI_0", "video_path": "Zw4phGHmzsI.mp4", "subtitle_path": "Zw4phGHmzsI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1720.65, "view_count": 4118}, {"video_id": "Zw4phGHmzsI", "question": "The screen has black horizontal stripes with white characters, and in the middle of the screen are five women. The woman on the far left with silver hair is wearing a shoulder-baring outfit, the woman in the center is wearing a mesh top, and the second woman from the right has a blonde ponytail. All women are dressed mainly in black and white, with red, yellow, and silver decorations behind them. When the subtitle 'Applause' appears, what is the woman in the mesh top in the center doing?", "question_wo_referring_query": "What is the woman in the center wearing a mesh top doing?", "candidates": ["Touching her lips with her thumb", "Touching her lips with her ring finger", "Touching her lips with her middle finger", "Touching her lips with her little finger", "Touching her lips with her index finger"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "Zw4phGHmzsI_1", "video_path": "Zw4phGHmzsI.mp4", "subtitle_path": "Zw4phGHmzsI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1720.65, "view_count": 4118}, {"video_id": "Zw4phGHmzsI", "question": "There are white letters on the black bars at the top and bottom of the screen. On a checkerboard floor illuminated by predominantly purple neon lights, five women wearing black tops and black shorts are on stage. The two women on the right are wearing black high boots; the woman in the middle has dyed hair. All five women have white microphones near their mouths. Behind the stage, there is a screen and some equipment. When the subtitle 'you're' appears, what is the woman on the far right doing?", "question_wo_referring_query": "What is the woman on the far right doing?", "candidates": ["The woman has her hands on her waist", "The woman is standing", "The woman raises both hands", "The woman is sitting on the floor", "The woman has her fingers on her lips"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "Zw4phGHmzsI_2", "video_path": "Zw4phGHmzsI.mp4", "subtitle_path": "Zw4phGHmzsI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1720.65, "view_count": 4118}, {"video_id": "3tX2qCmV8lQ", "question": "A man wearing a shirt and black suit stands in front of a lectern. The lectern's frame is transparent, and a microphone is fixed to the top of the lectern. Behind the man, there are tan wood items and a red curtain. After the caption 'Welcome, thank you' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["Two women walk towards a stage with a desk and chair.", "One man walks towards a stage with a desk and chair.", "Three women walk towards a stage with a desk and chair.", "Two men walk towards a stage with a desk and chair.", "Three men walk towards a stage with a desk and chair."], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "3tX2qCmV8lQ_0", "video_path": "3tX2qCmV8lQ.mp4", "subtitle_path": "3tX2qCmV8lQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2726.19, "view_count": 3418}, {"video_id": "3tX2qCmV8lQ", "question": "A red plastic object is standing on the stage floor. The stage is decorated with green plants around it. The red plastic object is hanging with a yellow arc-shaped object, and there is a blue toy sitting on the arc-shaped object. In the distance, there are plants and dense buildings. After the subtitle 'which was a remake of an Allan Kaprow performance' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The man on the right side of the stage picks up a glass of water and drinks", "The woman on the right side of the stage picks up a glass of water and drinks", "The man on the left side of the stage picks up a glass of water and drinks", "The men on both sides of the stage pick up their glasses of water and drink simultaneously", "The woman on the left side of the stage picks up a glass of water and drinks"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "3tX2qCmV8lQ_1", "video_path": "3tX2qCmV8lQ.mp4", "subtitle_path": "3tX2qCmV8lQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2726.19, "view_count": 3418}, {"video_id": "3tX2qCmV8lQ", "question": "What did the man wearing a green cap and a checkered shirt, who is sitting on a stool on the red carpet in the middle of the stage, do after the subtitle 'It's a way of moving backward in time, for me, I think' appeared?", "question_wo_referring_query": "What did the man on the right do?", "candidates": ["The man on the right stood up and gave thanks.", "The man on the right picked up a glass of water.", "The man on the right picked up a microphone.", "The man on the right bowed and apologized.", "The man on the right stood up and nodded."], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "3tX2qCmV8lQ_2", "video_path": "3tX2qCmV8lQ.mp4", "subtitle_path": "3tX2qCmV8lQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2726.19, "view_count": 3418}, {"video_id": "TanLRBKlGeE", "question": "A woman wearing striped back rests is sitting on a yellow sofa. The woman is wearing glasses and holding a light-colored bag. Behind the woman, there is a TV and a newspaper. Next to the TV, there is a potted plant. After the subtitle 'dollars' appears, what object does the woman hold in her hand?", "question_wo_referring_query": "What object does the woman hold in her hand?", "candidates": ["A mobile phone", "One dark-colored bottle with a pump and white label", "Two dark-colored bottles with pumps and white labels", "A paper box", "Two paper boxes"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "TanLRBKlGeE_0", "video_path": "TanLRBKlGeE.mp4", "subtitle_path": "TanLRBKlGeE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1538.01, "view_count": 292370}, {"video_id": "TanLRBKlGeE", "question": "In the corner of a white wall, there is a litter box with some cat litter in it. A girl wearing striped clothes is squatting next to the litter box. She is wearing earrings and glasses, and she is looking down. There is a potted plant near the cat litter. After the subtitle 'Applause' appears, what object appears inside the litter box?", "question_wo_referring_query": "What object appears inside the litter box?", "candidates": ["A small yellow dog", "A small white dog", "An entirely yellow cat", "An entirely snow-white cat", "A cat with black patterns"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "TanLRBKlGeE_1", "video_path": "TanLRBKlGeE.mp4", "subtitle_path": "TanLRBKlGeE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1538.01, "view_count": 292370}, {"video_id": "TanLRBKlGeE", "question": "The girl wearing a gray and white outfit is holding a cup and drinking water. She is also wearing a black hat and a black mask. Behind her is a wooden desk, chair, and lamp in a brown color. After the subtitle 'Music' appears, what appears on the girl's hand?", "question_wo_referring_query": "What appears on the girl's hand?", "candidates": ["a black laptop", "blue headphones", "a face mask", "gray headphones", "a white laptop"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "TanLRBKlGeE_2", "video_path": "TanLRBKlGeE.mp4", "subtitle_path": "TanLRBKlGeE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1538.01, "view_count": 292370}, {"video_id": "F3CEbwhpC1I", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, the woman and the woman with glasses join a guided learning session, then the woman in the white outerwear summarizes the learning experience, and finally the long-haired woman in white clothes gives an introduction.", "First, the woman in the white outerwear summarizes the learning experience, then the woman and the woman with glasses join a guided learning session, and finally the long-haired woman in white clothes gives an introduction.", "First, the long-haired woman in white clothes gives an introduction, then she and the woman with glasses join a guided learning session, and finally the woman in the white outerwear summarizes the learning experience.", "First, the long-haired woman in white clothes gives an introduction, then the woman in the white outerwear summarizes the learning experience, and finally the woman and the woman with glasses join a guided learning session.", "First, the woman and the woman with glasses join a guided learning session, then the long-haired woman in white clothes gives an introduction, and finally the woman in the white outerwear summarizes the learning experience."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "F3CEbwhpC1I_0", "video_path": "F3CEbwhpC1I.mp4", "subtitle_path": "F3CEbwhpC1I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1750.02, "view_count": 52052}, {"video_id": "F3CEbwhpC1I", "question": "During the guided learning process with the lady in white, which of the following sequences is correct?", "question_wo_referring_query": "During the guided learning process with the lady in white, which of the following sequences is correct?", "candidates": ["First guide to solve the problem of pressure, then solve the problem of temperature, and finally solve the problem of volume.", "First guide to solve the problem of volume, then solve the problem of pressure, and finally solve the problem of temperature.", "First guide to solve the problem of pressure, then solve the problem of volume, and finally solve the problem of temperature.", "First guide to solve the problem of volume, then solve the problem of temperature, and finally solve the problem of pressure.", "First guide to solve the problem of temperature, then solve the problem of pressure, and finally solve the problem of volume."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "F3CEbwhpC1I_1", "video_path": "F3CEbwhpC1I.mp4", "subtitle_path": "F3CEbwhpC1I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1750.02, "view_count": 52052}, {"video_id": "F3CEbwhpC1I", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, the woman summarizes the learning experience, then assists the girl in solving the temperature problem, then assists the girl in solving the stress problem, then the woman makes an introduction, and finally assists in solving the volume problem.", "First, the woman gives an introduction, then guides the girl to solve the temperature problem, then guides the girl to solve the stress problem, then guides the girl to solve the volume problem, and finally the woman summarizes the learning experience.", "First, the woman gives an introduction, then assists the girl in solving the temperature problem, then assists the girl in solving the stress problem, then the woman summarizes the learning experience, and finally assists in solving the volume problem.", "First, the woman gives an introduction, then the woman assists the girl in solving the stress problem, then assists the girl in solving the temperature problem, then assists the girl in solving the volume problem, and finally the woman summarizes the experience.", "First, the woman with long hair in white clothes does an introduction, then the woman in the white coat summarizes the learning experience, and finally the woman and the girl wearing glasses complete the temperature problem, then assists the girl in solving the stress problem, and finally assists the girl in solving the volume problem."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "F3CEbwhpC1I_2", "video_path": "F3CEbwhpC1I.mp4", "subtitle_path": "F3CEbwhpC1I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1750.02, "view_count": 52052}, {"video_id": "wCQC8ZIIc1o", "question": "A man wearing an overcoat with a white feather pattern is sitting in a car. Next to him, there is a woman in a green dress with her hair tied up. Outside the car window are green fields and a white sky. In what other scenes does this man in the white feather pattern overcoat appear?", "question_wo_referring_query": "In what other scenes does this man in the white feather pattern overcoat appear?", "candidates": ["On an old bicycle", "Inside a pavilion by the lake", "A country lane flanked by green fields", "In front of a bed in a luxurious hotel", "On a tall tree"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "wCQC8ZIIc1o_0", "video_path": "wCQC8ZIIc1o.mp4", "subtitle_path": "wCQC8ZIIc1o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1027.07, "view_count": 43873}, {"video_id": "wCQC8ZIIc1o", "question": "A man wearing a jacket with a white feather pattern is sitting in a car. Next to him is a woman in a green dress with her hair tied up. The woman is wearing sunglasses, and there is a bottle of water in front of her. Outside the car window, there is a signpost and a building with green trees. Where have the sunglasses worn by the woman in the green dress appeared before?", "question_wo_referring_query": "Where have the sunglasses worn by the woman in the green dress appeared before?", "candidates": ["On a tall tree", "In front of a luxury hotel bed", "On a small path between fields", "On a stone staircase", "In a pavilion by the lake"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "wCQC8ZIIc1o_1", "video_path": "wCQC8ZIIc1o.mp4", "subtitle_path": "wCQC8ZIIc1o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1027.07, "view_count": 43873}, {"video_id": "wCQC8ZIIc1o", "question": "A man wearing a white feathered jacket is sitting in a car, with a woman in green clothes and shorts resting her head on his lap. The woman is wearing silver ornaments on her hand. The car interior is gray, and outside the car window there is greenery. In what scene has this woman in green clothes resting on the man's lap appeared before?", "question_wo_referring_query": "In what scene has this woman in green clothes resting on the man's lap appeared before?", "candidates": ["A crowded pavilion in the middle of a lake", "A country road with green grass on both sides", "Inside a restaurant", "On a tall tree", "A red public bus"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "wCQC8ZIIc1o_2", "video_path": "wCQC8ZIIc1o.mp4", "subtitle_path": "wCQC8ZIIc1o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1027.07, "view_count": 43873}, {"video_id": "KBpOmH-TeRE", "question": "A dog walks on a yellow dirt ground. The dog's body color is primarily white. There are iron railings and stone slabs beside the dog. Inside the iron railings, there is a parked motorcycle. In the lower right corner of the screen, there is a man wearing a dark-colored coat. What subtitles have appeared along with this dog?", "question_wo_referring_query": "What subtitles have appeared along with this dog?", "candidates": ["She can get better results", "It tastes good", "that's so cool Alibi hey", "you don't understand how incredibly", "shut"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "KBpOmH-TeRE_0", "video_path": "KBpOmH-TeRE.mp4", "subtitle_path": "KBpOmH-TeRE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1742.37, "view_count": 117100}, {"video_id": "KBpOmH-TeRE", "question": "The refrigerator contains various cakes. In the upper left position of the refrigerator, there is a cake with a chocolate-style packaging. Next to it is a cone of ice cream. A hand is holding the cone of ice cream, and there is a label on top of the ice cream. In which caption did this ice cream cone appear?", "question_wo_referring_query": "In which caption did this ice cream cone appear?", "candidates": ["Applause", "everything interesting so I guess uh", "that's so cool Alibi hey", "This decision needs to be discussed", "Music"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "KBpOmH-TeRE_1", "video_path": "KBpOmH-TeRE.mp4", "subtitle_path": "KBpOmH-TeRE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1742.37, "view_count": 117100}, {"video_id": "KBpOmH-TeRE", "question": "A curly-haired man wearing a blue jacket is standing next to a huge pit. The man has a red and black scarf around his neck. Inside the pit there are rocks and sand, and there are objects burning along the edge of the pit. In which subtitle does this scarf appear together with the described scene?", "question_wo_referring_query": "Which subtitle did this scarf appear with together?", "candidates": ["man one thing I absolutely love about", "I need some food", "Turk menistan and the whole thing was", "that's so cool Alibi hey", "This decision needs to be discussed"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "KBpOmH-TeRE_2", "video_path": "KBpOmH-TeRE.mp4", "subtitle_path": "KBpOmH-TeRE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1742.37, "view_count": 117100}, {"video_id": "Rx2G6jyxMKo", "question": "A man wearing a gray short-sleeve shirt appears on the video cover. The man has tattoos on his arms, and behind him, there is a map and a black-and-white illustration of planets. The map has red and blue curves distributed on it. When the man appears on the gray cover with three spheres, what changes occur to him?", "question_wo_referring_query": "When the man appears on the gray cover with three spheres, what changes occur to him?", "candidates": ["The man put on a pair of glasses", "The man's short sleeves turned into a blue suit", "The man's short sleeves turned into a denim jacket", "The man's short sleeves turned into a lab coat", "The man put on a hat"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "Rx2G6jyxMKo_0", "video_path": "Rx2G6jyxMKo.mp4", "subtitle_path": "Rx2G6jyxMKo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1002.16, "view_count": 354633}, {"video_id": "Rx2G6jyxMKo", "question": "Under a clear sky, a man wearing a black vest is talking on the grass. The man has a white print on the chest area of his black vest, and there is a large tree behind him. To the left of the tree, there is a man wearing shorts. When the man in the black vest appears on the cover of a video about the flat earth theory, what changes occur?", "question_wo_referring_query": "When the man in the black vest appears on the cover of a video about the flat earth theory, what changes occur?", "candidates": ["The man's black vest changes to a pink vest.", "The man's black vest changes to a blue short-sleeved shirt.", "The man's black vest changes to a green vest.", "The man's black vest changes to a white vest.", "The man's black vest changes to a denim jacket."], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "Rx2G6jyxMKo_1", "video_path": "Rx2G6jyxMKo.mp4", "subtitle_path": "Rx2G6jyxMKo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1002.16, "view_count": 354633}, {"video_id": "Rx2G6jyxMKo", "question": "A man in a dark short-sleeved shirt is sitting and talking. Behind him are a wall and a bookshelf. On the bookshelf, there are books, a globe, and a plastic box. Posters are stuck on the wall. When this man appears on the cover of a video mentioning a space agency, what change occurs to him?", "question_wo_referring_query": "When this man appears on the cover of a video mentioning a space agency, what change occurs to him?", "candidates": ["The man's short-sleeved shirt changes to a denim jacket", "The tattoo on the man's arm changes to black", "The man's dark short-sleeved shirt changes to a gray short-sleeved shirt", "The man's dark short-sleeved shirt changes to a spacesuit", "The man's dark short-sleeved shirt changes to a suit"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "Rx2G6jyxMKo_2", "video_path": "Rx2G6jyxMKo.mp4", "subtitle_path": "Rx2G6jyxMKo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1002.16, "view_count": 354633}, {"video_id": "bu9Bm8aw_lI", "question": "On a piece of white paper, there is a drawing of a high-heeled shoe. Two hands appear on the paper; one is pressing down the paper and the other is holding a black pen, sketching. Next to the paper, there is a box of ink. When the subtitle 'onto absorbent paper, making a copy' appears, what change happens to the high-heeled shoe?", "question_wo_referring_query": "What change happens to the high-heeled shoe?", "candidates": ["The high-heeled shoe gets colored.", "The high-heeled shoe shrinks.", "The high-heeled shoe turns into a sneaker.", "The high-heeled shoe enlarges.", "The high-heeled shoe turns into a fabric shoe."], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "bu9Bm8aw_lI_0", "video_path": "bu9Bm8aw_lI.mp4", "subtitle_path": "bu9Bm8aw_lI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 907.04, "view_count": 597917}, {"video_id": "bu9Bm8aw_lI", "question": "In the black background, there is a black-and-white photo of Marilyn Monroe in the center. Marilyn is wearing a black off-the-shoulder dress, her hair is curly, and her lips are slightly open, showing her teeth. When the subtitle 'and her youth - and the peak of her fame. Marilyn is forever frozen in perfect cinematic beauty' appears, what change occurs to Marilyn's black-and-white photo?", "question_wo_referring_query": "In the black background, there is a black-and-white photo of Marilyn Monroe in the center. Marilyn is wearing a black off-the-shoulder dress, her hair is curly, and her lips are slightly open, showing her teeth. When the subtitle 'and her youth - and the peak of her fame. Marilyn is forever frozen in perfect cinematic beauty' appears, what change occurs to Marilyn's black-and-white photo?", "candidates": ["The black-and-white photo tilts to the left", "The black-and-white photo tilts upside down", "The black-and-white photo tilts to the right", "The black-and-white photo flips", "Color appears on the black-and-white photo"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "bu9Bm8aw_lI_1", "video_path": "bu9Bm8aw_lI.mp4", "subtitle_path": "bu9Bm8aw_lI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 907.04, "view_count": 597917}, {"video_id": "bu9Bm8aw_lI", "question": "On the gray paper, there is a man wearing glasses. The man's glasses have black frames, and his short hair is slanted to the side. The man's collar accessory is white. When the subtitle 'by premature baldness, pockmarked skin' appears, what changes are seen in the man in the image?", "question_wo_referring_query": "What changes are seen in the man in the image?", "candidates": ["The man's head and nose are covered with messy black lines", "The man's mouth and eyes are covered with messy black lines", "The man's eyes and nose are covered with messy black lines", "The man's head and mouth are covered with messy black lines", "The man's head and eyes are covered with messy black lines"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "bu9Bm8aw_lI_2", "video_path": "bu9Bm8aw_lI.mp4", "subtitle_path": "bu9Bm8aw_lI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 907.04, "view_count": 597917}, {"video_id": "Z2cWR9t2_E4", "question": "On the wooden table, there is a transparent glass bowl containing gelatin-based food in various colors like red, green, and yellow. A hand holding a knife appears on the screen. What is this hand doing?", "question_wo_referring_query": "What is this hand doing?", "candidates": ["Stirring the food with the knife", "Tapping the glass bowl with the knife", "Using the knife to skewer the food", "Cutting the food with the knife", "Scraping the glass bowl with the knife"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "Z2cWR9t2_E4_0", "video_path": "Z2cWR9t2_E4.mp4", "subtitle_path": "Z2cWR9t2_E4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1239.76, "view_count": 58445}, {"video_id": "Z2cWR9t2_E4", "question": "On a green wood cutting board, there are white scallions. A hand holding a knife appears on the screen, the ring finger of one hand is wearing a ring, and the knife has black text and symbols on it. What are these hands doing?", "question_wo_referring_query": "What are these hands doing?", "candidates": ["Pounding scallions with a knife", "Chopping scallions with a knife", "Cutting scallions with a knife", "Crushing scallions with a knife", "Cleaning the knife"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "Z2cWR9t2_E4_1", "video_path": "Z2cWR9t2_E4.mp4", "subtitle_path": "Z2cWR9t2_E4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1239.76, "view_count": 58445}, {"video_id": "Z2cWR9t2_E4", "question": "On the woven mat lies a light-colored bowl. The rim of the bowl has gone through a process making it darker, and inside the bowl, two lamps are reflected. On the screen, there are yellow smiley faces displayed with one holding an egg. What is the hand doing?", "question_wo_referring_query": "What is the hand doing?", "candidates": ["Rubbing the rim of the bowl with the egg", "Placing the egg into the bowl", "Rotating the egg", "Using the egg to tap the rim of the bowl", "Peeling the egg"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "Z2cWR9t2_E4_2", "video_path": "Z2cWR9t2_E4.mp4", "subtitle_path": "Z2cWR9t2_E4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1239.76, "view_count": 58445}, {"video_id": "6ibZezGV0Kc", "question": "A woman in a black turtleneck coat is video calling a long-haired woman in a white coat. The woman in black is wearing round earrings and a black inner garment, while the woman in white is in a room with glaring light. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Hat", "Necklace", "Puppy", "Television", "Watch"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "6ibZezGV0Kc_0", "video_path": "6ibZezGV0Kc.mp4", "subtitle_path": "6ibZezGV0Kc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 922.88, "view_count": 279537}, {"video_id": "6ibZezGV0Kc", "question": "A woman wearing a black inner layer and a black high-collar outer coat is standing in front of a white curtain. Beside the woman is a white wall. One of the woman's hands is near her shoulder, and she is wearing round earrings. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["buttons", "hat", "glasses", "planter", "book"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "6ibZezGV0Kc_1", "video_path": "6ibZezGV0Kc.mp4", "subtitle_path": "6ibZezGV0Kc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 922.88, "view_count": 279537}, {"video_id": "6ibZezGV0Kc", "question": "On the right side of the screen is a woman wearing a black inner shirt and a black turtleneck outer garment, with her hand in a claw-like gesture. On the left side of the screen is a woman in a white outer garment handling blue food items, surrounded by various kitchen utensils. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["A dog", "A red knife", "A wristwatch", "A green knife", "A cat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "6ibZezGV0Kc_2", "video_path": "6ibZezGV0Kc.mp4", "subtitle_path": "6ibZezGV0Kc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 922.88, "view_count": 279537}, {"video_id": "-sNJd7bANTI", "question": "A man wearing a black short-sleeved shirt is standing in front of a brown wooden cabinet. On the cabinet, there is a black box and a white paper box. There is a staircase in the distance. When the subtitle 'do not throw away' appears, what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["watch", "sofa", "green plant", "hat", "desk lamp"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "-sNJd7bANTI_0", "video_path": "-sNJd7bANTI.mp4", "subtitle_path": "-sNJd7bANTI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1539.2, "view_count": 17038}, {"video_id": "-sNJd7bANTI", "question": "In the bottom right corner of the screen, there is a man wearing black clothes and glasses. In the center of the screen, a woman wearing a yellow checkered dress is standing in front of a green tree. Below the woman, there is a red horizontal strip with white text. When the subtitle 'think rosie is like a 3d model that they' appears, what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["purple flowers", "a dog", "a cat", "blue flowers", "red flowers"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "-sNJd7bANTI_1", "video_path": "-sNJd7bANTI.mp4", "subtitle_path": "-sNJd7bANTI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1539.2, "view_count": 17038}, {"video_id": "-sNJd7bANTI", "question": "In the bottom right corner of the screen, there is a man in black clothes wearing glasses. In the middle of the screen, a girl in a blue short-sleeved shirt is blowing out candles on a cake. The girl is sitting on a dark-colored sofa. There is a poster hanging on the wall. When the subtitle 'sure how it works but given that this' appears, what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["soccer ball", "sofa cushion", "pot plant", "hat", "necklace"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "-sNJd7bANTI_2", "video_path": "-sNJd7bANTI.mp4", "subtitle_path": "-sNJd7bANTI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1539.2, "view_count": 17038}, {"video_id": "ftymKoG78Ng", "question": "A man dressed in a dark short-sleeved shirt is standing in front of Wu Gexiao Scenic Area. The man is wearing a black backpack. To his right is a stone wall, and to his left is a row of houses. Behind the man is a squatting woman. A short-haired woman, who is carrying a pink bag, walks past to the left rear side of the man. What style of clothing is the short-haired woman carrying the pink bag wearing?", "question_wo_referring_query": "What style of clothing is the short-haired woman carrying the pink bag wearing?", "candidates": ["Blue short sleeves", "Gray jacket", "Gray short sleeves", "Gray sweater", "Blue denim jacket"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "ftymKoG78Ng_0", "video_path": "ftymKoG78Ng.mp4", "subtitle_path": "ftymKoG78Ng_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 924.07, "view_count": 98405}, {"video_id": "ftymKoG78Ng", "question": "A man wearing light-colored clothing is in a room. The man is wearing a bracelet on his wrist. There is a TV and a mirror behind the man. The curtains in the room are light-colored. What is the man's hairstyle like?", "question_wo_referring_query": "What is the man's hairstyle like?", "candidates": ["Short and trim hair", "Long black hair", "Long hair with a ponytail", "Long golden hair", "Long silver hair"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "ftymKoG78Ng_1", "video_path": "ftymKoG78Ng.mp4", "subtitle_path": "ftymKoG78Ng_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 924.07, "view_count": 98405}, {"video_id": "ftymKoG78Ng", "question": "A man drapes a piece of clothing over his neck. Behind the man, there's a white wall and a dark-colored cabinet filled with items. Beside the cabinet, there is a rotating fan and a light. To the left of the man, a staircase is visible. What kind of light is in the room?", "question_wo_referring_query": "What kind of light is in the room?", "candidates": ["Square wall light", "Round wall light", "Round pendant light", "Flower-shaped pendant light", "Waterdrop-shaped pendant light"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "ftymKoG78Ng_2", "video_path": "ftymKoG78Ng.mp4", "subtitle_path": "ftymKoG78Ng_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 924.07, "view_count": 98405}, {"video_id": "Z-LhBjlpwr0", "question": "On the right side of the screen is a man wearing a white shirt and suit, with green plants and a white building in the background. On the left side of the screen is Biden, dressed in a dark top and wearing a black hat, standing in front of a black podium. Behind Biden stand two men. When the subtitle 'Out referring uh to Russian President' appears, what is the shape on the blue part of the flag behind Biden?", "question_wo_referring_query": "What is the shape on the blue part of the flag behind Biden?", "candidates": ["Pentagon shape", "Fan shape", "Circle", "Animal shape", "Rectangle"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "Z-LhBjlpwr0_0", "video_path": "Z-LhBjlpwr0.mp4", "subtitle_path": "Z-LhBjlpwr0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2989.62, "view_count": 196851}, {"video_id": "Z-LhBjlpwr0", "question": "The woman wearing a brown coat is holding a microphone and speaking. The handle of the microphone is black. Behind her are three white long poles and the sky. When the subtitle 'Supreme Court that Frozen embryos are' appears at the bottom of the screen, what style is the woman's clothing under the coat?", "question_wo_referring_query": "What style is the woman's clothing under the coat?", "candidates": ["High-necked white sweater", "V-necked white sweater", "Round-necked blue sweater", "High-necked blue sweater", "Round-necked white sweater"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "Z-LhBjlpwr0_1", "video_path": "Z-LhBjlpwr0.mp4", "subtitle_path": "Z-LhBjlpwr0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2989.62, "view_count": 196851}, {"video_id": "Z-LhBjlpwr0", "question": "In the scene, a woman in a red dress and a man in a blue suit are posing together. The man is wearing a blue shirt, and his suit has a patterned decoration on the lapel. The woman is wearing a gold necklace around her neck. There is a tree behind them. At the bottom of the screen, there is a white banner with black text. When the subtitle 'before the two met Sam Brown spoke about' appears, what is the man's hairstyle like?", "question_wo_referring_query": "What is the man's hairstyle like?", "candidates": ["A fringe covering the eyebrows", "Long golden hair", "Long black hair", "Short hair with a visible forehead", "Silver hair parted in the middle"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "Z-LhBjlpwr0_2", "video_path": "Z-LhBjlpwr0.mp4", "subtitle_path": "Z-LhBjlpwr0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2989.62, "view_count": 196851}, {"video_id": "77_Tqv-F8_M", "question": "There are two men in the scene. One is wearing a black hat, a life jacket, and blue shorts, while the other is wearing a red shirt and orange pants. The scene features a lake with a grassy shore. Who is rowing the boat on the water?", "question_wo_referring_query": "There are two men in the scene. One is wearing a black hat, a life jacket, and blue shorts, while the other is wearing a red shirt and orange pants. The scene features a lake with a grassy shore. Who is rowing the boat on the water?", "candidates": ["The man in red pants", "The woman in red pants", "The man in blue shorts", "The man in white shorts", "The woman in blue shorts"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "77_Tqv-F8_M_0", "video_path": "77_Tqv-F8_M.mp4", "subtitle_path": "77_Tqv-F8_M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1306.36, "view_count": 188924}, {"video_id": "77_Tqv-F8_M", "question": "On the snowy road, a team of soldiers in uniforms is advancing. They are wearing green military coats with black stripes, black pants, and hats. There is a signpost and a forest on the right side of the road. Who is carrying the flag and advancing?", "question_wo_referring_query": "Who is carrying the flag and advancing?", "candidates": ["The third man in the middle row wearing a black hat", "The first man in the right row wearing a black hat", "The first man in the left row wearing a gray hat", "The second man in the left row wearing a gray hat"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "77_Tqv-F8_M_1", "video_path": "77_Tqv-F8_M.mp4", "subtitle_path": "77_Tqv-F8_M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1306.36, "view_count": 188924}, {"video_id": "77_Tqv-F8_M", "question": "The buildings on both sides of the road are tall, with red-colored buildings on the left and white-colored buildings on the right. Black street lamps stand in the greenbelt. Who is the person walking with a red bag in the video?", "question_wo_referring_query": "Who is the person walking with a red bag in the video?", "candidates": ["Woman wearing a purple dress", "Woman wearing a sleeveless white dress", "Man wearing a black vest over a white long-sleeve shirt", "Man wearing a red shirt", "Man wearing a purple shirt and black shorts"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "77_Tqv-F8_M_2", "video_path": "77_Tqv-F8_M.mp4", "subtitle_path": "77_Tqv-F8_M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1306.36, "view_count": 188924}, {"video_id": "POLDmEcddCg", "question": "A light-colored Beetle car is parked in the garage. Next to the Beetle car is a black car. The garage walls are white, with posters and a clothesline hanging on them. There are pipes and two windows on the back wall. What does the bald man in the black suit do when he appears for the first time?", "question_wo_referring_query": "What does the bald man in the black suit do when he appears for the first time?", "candidates": ["He gets into the driver's seat of the black car", "He picks up a cigarette", "He picks up a car key", "He picks up a poster", "He gets into the driver's seat of the Beetle car"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "POLDmEcddCg_0", "video_path": "POLDmEcddCg.mp4", "subtitle_path": "POLDmEcddCg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 922.17, "view_count": 404552}, {"video_id": "POLDmEcddCg", "question": "Two men wearing dark coats stand side by side. The man on the right is wearing glasses and has a mustache, while the man on the left has his sleeves rolled up halfway. Behind them, there are red cabinets and paper boxes. What did the man on the left do the first time he appeared?", "question_wo_referring_query": "What did the man on the left do the first time he appeared?", "candidates": ["Hands on hips", "Patted the man on the right's shoulder", "Shook hands with the man on the right", "Arms crossed", "Raised both hands"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "POLDmEcddCg_1", "video_path": "POLDmEcddCg.mp4", "subtitle_path": "POLDmEcddCg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 922.17, "view_count": 404552}, {"video_id": "POLDmEcddCg", "question": "A light green car is parked indoors, and two men wearing dark outerwear appear on both sides of the car. Both men are wearing jeans and pink face masks. There is a round poster on the room's wall, and on one side of the wall, there are two windows. Upon his first appearance, what did the man on the left in blue jeans do?", "question_wo_referring_query": "What did the man on the left in blue jeans do upon his first appearance?", "candidates": ["Checked the car windows", "Checked the car tires", "Checked the car engine", "Checked the car headlights", "Performed maintenance on the car"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "POLDmEcddCg_2", "video_path": "POLDmEcddCg.mp4", "subtitle_path": "POLDmEcddCg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 922.17, "view_count": 404552}, {"video_id": "efPrtcLdcdM", "question": "The man wearing a black top is standing in front of a green backdrop, wearing dark sunglasses, with a tattoo on his chest. In front of him, there is a microphone. After talking about training a language model, what happens next?", "question_wo_referring_query": ", what happens next?", "candidates": ["The man took off his dark sunglasses.", "The model generated thousands of posts.", "The man picked up a black phone.", "The man picked up a black hat.", "The man went to drink water."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "efPrtcLdcdM_0", "video_path": "efPrtcLdcdM.mp4", "subtitle_path": "efPrtcLdcdM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1159.28, "view_count": 985849}, {"video_id": "efPrtcLdcdM", "question": "On the white paper, a flag podium and wheelhouse is being drawn. There is a spherical pattern on the flagpole. One hand is holding a pen and the other hand is pressing the paper to continue drawing the pattern. After finishing the drawing of the flag podium, what do the hands do next?", "question_wo_referring_query": "What do the hands do next?", "candidates": ["Cut the white paper with scissors", "One hand is crumpling the paper", "Clench fists with both hands", "Color the flag podium", "One hand is drawing the sun"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "efPrtcLdcdM_1", "video_path": "efPrtcLdcdM.mp4", "subtitle_path": "efPrtcLdcdM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1159.28, "view_count": 985849}, {"video_id": "efPrtcLdcdM", "question": "The man wearing a white shirt and blue suit is standing in front of a yellow background. The man has a long beard and is holding paper notes in one hand while pushing them out one by one with the other hand. What does the man do after pushing out a circle of paper notes?", "question_wo_referring_query": "What action does the man take after pushing out a circle of paper notes?", "candidates": ["The man waves goodbye.", "The man places his hands on his hips.", "The man crosses his hands in front of his chest.", "The man spreads his hands and scatters all the paper notes.", "The man raises both hands high."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "efPrtcLdcdM_2", "video_path": "efPrtcLdcdM.mp4", "subtitle_path": "efPrtcLdcdM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1159.28, "view_count": 985849}, {"video_id": "SIYWxZZ9igc", "question": "In the gray background, the center has white English text characters. Above the white characters, there are black characters with red edges. At the bottom of the screen, there is a blue icon. In this video, which person's real-life image appears first?", "question_wo_referring_query": "Which person\u2019s real-life image appears first?", "candidates": ["MICHEL MOATTI", "CLIVE EMSLEY", "MARI ANNE", "JAMES CAMERON", "POLY NICOLS"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "SIYWxZZ9igc_0", "video_path": "SIYWxZZ9igc.mp4", "subtitle_path": "SIYWxZZ9igc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1305.17, "view_count": 752604}, {"video_id": "SIYWxZZ9igc", "question": "In the room, there is a woman wearing a green and white dress. The woman has a belt around her waist and is holding a dark-colored bottle. There are many bottles on the floor and table as well. Behind the woman, there is a wardrobe and a table lamp. Who does the woman talk to first?", "question_wo_referring_query": "Who does the woman talk to first?", "candidates": ["A man wearing a black hat and black trench coat", "A police officer in uniform", "A waiter wearing a white shirt", "A blonde man who looks like a scavenger", "An elderly lady with gray hair wearing simple clothes"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "SIYWxZZ9igc_1", "video_path": "SIYWxZZ9igc.mp4", "subtitle_path": "SIYWxZZ9igc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1305.17, "view_count": 752604}, {"video_id": "SIYWxZZ9igc", "question": "Who is the first female cartoon character to appear in the entire video?", "question_wo_referring_query": "Who is the first female cartoon character to appear in the entire video?", "candidates": ["The woman wearing a green top", "The woman wearing a white top", "The woman wearing a blue top", "The woman wearing a red outfit with a necklace around her neck", "The woman wearing a headscarf"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "SIYWxZZ9igc_2", "video_path": "SIYWxZZ9igc.mp4", "subtitle_path": "SIYWxZZ9igc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1305.17, "view_count": 752604}, {"video_id": "2RBolFsLElk", "question": "Against a white background, the top left corner of the screen displays the black English text 'Contrastive Learning', and there are three black arrows pointing in different directions in the middle of the screen. What other objects are present in this screen?", "question_wo_referring_query": "What other objects are present in this screen?", "candidates": ["Three pictures of dogs in different color tones", "Three pictures of monkeys in different color tones", "Three pictures of horses in different color tones", "Three pictures of fish in different color tones", "Three pictures of cats in different color tones"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "2RBolFsLElk_0", "video_path": "2RBolFsLElk.mp4", "subtitle_path": "2RBolFsLElk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2643.43, "view_count": 246}, {"video_id": "2RBolFsLElk", "question": "In the screen with a green background that has a white line and the white English text 'Ablation Study' at the top left corner, what other object is present in the screen?", "question_wo_referring_query": "What other object is present in the screen?", "candidates": ["orange bright spot", "small white triangle arrow", "green bright spot", "red bright spot", "small black triangle arrow"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "2RBolFsLElk_1", "video_path": "2RBolFsLElk.mp4", "subtitle_path": "2RBolFsLElk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2643.43, "view_count": 246}, {"video_id": "2RBolFsLElk", "question": "In a green background screen with a white line and the white English word 'Questions' in the top left corner, what else is present in the screen?", "question_wo_referring_query": "what else is present on the screen?", "candidates": ["red question mark", "black parentheses", "white question mark", "white period", "white comma"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "2RBolFsLElk_2", "video_path": "2RBolFsLElk.mp4", "subtitle_path": "2RBolFsLElk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2643.43, "view_count": 246}, {"video_id": "F_DS6UvZXPo", "question": "By the side of a lake, a tank is firing shells on a grassy field, with some black smoke lingering in the air. When the subtitle says 'one of the main strengths of the Panzerhaubitze 2000 is the integration of the fire control system,' what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["boat", "flowers", "aircraft", "ducks", "trees"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "F_DS6UvZXPo_0", "video_path": "F_DS6UvZXPo.mp4", "subtitle_path": "F_DS6UvZXPo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1058.0, "view_count": 563440}, {"video_id": "F_DS6UvZXPo", "question": "In the green background with the white English word 'Symbolism' at the top, it shows a white hammer icon in the middle surrounded by a white circle. When the subtitle says \u201cthat due to the visual similarity if the Elefant tank destroyer in combination with it being far more\u201d, what other objects are present in this scene?", "question_wo_referring_query": "What other objects are present in this scene?", "candidates": ["Bicycle", "Airplane", "National flag", "Tank", "Ship"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "F_DS6UvZXPo_1", "video_path": "F_DS6UvZXPo.mp4", "subtitle_path": "F_DS6UvZXPo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1058.0, "view_count": 563440}, {"video_id": "F_DS6UvZXPo", "question": "In the green background, there is a circular icon at the bottom left that contains both a question mark and an exclamation mark within a white circle, and on the far right side, there is another circular icon with just a white question mark inside a white circle. When the subtitles mention 'although less likely since Hungry is located in Europe,' what else is present in this screen?", "question_wo_referring_query": "What else is present in this screen?", "candidates": ["American flag", "Hungarian flag", "Norwegian flag", "French flag", "Chinese flag"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "F_DS6UvZXPo_2", "video_path": "F_DS6UvZXPo.mp4", "subtitle_path": "F_DS6UvZXPo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1058.0, "view_count": 563440}, {"video_id": "kxS7IgTZebs", "question": "There is a large grey semi-arched bridge in the background of the scene. In front of the bridge, a soldier is standing on green grass, wearing a helmet and holding a gun. What style of upper garment is this soldier wearing?", "question_wo_referring_query": "What style of upper garment is this soldier wearing?", "candidates": ["Firefighter uniform", "Suit", "Nurse uniform", "Camouflage uniform", "Backpack"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "kxS7IgTZebs_0", "video_path": "kxS7IgTZebs.mp4", "subtitle_path": "kxS7IgTZebs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2230.46, "view_count": 1739597}, {"video_id": "kxS7IgTZebs", "question": "On the screen, a person wearing a grey diving suit and carrying a backpack is standing on the green seafloor. There is a small fish next to this person. There is also some green seaweed on the seafloor. What color is the person's backpack?", "question_wo_referring_query": ", what color is the person's backpack?", "candidates": ["red", "green", "purple", "yellow", "black"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "kxS7IgTZebs_1", "video_path": "kxS7IgTZebs.mp4", "subtitle_path": "kxS7IgTZebs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2230.46, "view_count": 1739597}, {"video_id": "kxS7IgTZebs", "question": "There is a forest of green trees in the background of the scene. A person dressed in white clothes and wearing a military hat is holding a rifle, standing on the white snow. What shape is the red pattern on his hat?", "question_wo_referring_query": "What shape is the red pattern on his hat?", "candidates": ["Rectangle", "Triangle", "Pentagon", "Hexagon", "Circle"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "kxS7IgTZebs_2", "video_path": "kxS7IgTZebs.mp4", "subtitle_path": "kxS7IgTZebs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2230.46, "view_count": 1739597}, {"video_id": "OLEsDqBE2fQ", "question": "In a room with a green picture frame, sitting in front of a white wall and a white door, holding a green maple leaf, who is the person?", "question_wo_referring_query": "Who is it?", "candidates": ["A man wearing a hat and blue clothes", "A woman wearing glasses and green clothes", "A woman wearing glasses and red clothes", "A man wearing glasses and blue clothes", "A woman wearing glasses and blue clothes"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "OLEsDqBE2fQ_0", "video_path": "OLEsDqBE2fQ.mp4", "subtitle_path": "OLEsDqBE2fQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 942.48, "view_count": 1260}, {"video_id": "OLEsDqBE2fQ", "question": "In a room with a hanging picture frame, a person is holding a book. On the right page of the book, there is a child in green clothes standing on white snow, looking at two people playing on a frozen lake. Who is the person holding the book?", "question_wo_referring_query": "Who is the person holding the book?", "candidates": ["A woman wearing earphones", "A woman wearing a mask", "A woman wearing glasses and dressed in blue", "A woman wearing black sunglasses", "A woman wearing glasses and dressed in blue"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "OLEsDqBE2fQ_1", "video_path": "OLEsDqBE2fQ.mp4", "subtitle_path": "OLEsDqBE2fQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 942.48, "view_count": 1260}, {"video_id": "OLEsDqBE2fQ", "question": "In a room with an olive-colored picture frame, sitting in front of a white wall and a white door, who is the person holding their glasses frame with their hand?", "question_wo_referring_query": "Who is it?", "candidates": ["The white-haired woman in blue clothing", "The long straight-haired woman in blue clothing", "The curly-haired woman in blue clothing", "The curly-haired man in black clothing", "The woman with bangs in blue clothing"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "OLEsDqBE2fQ_2", "video_path": "OLEsDqBE2fQ.mp4", "subtitle_path": "OLEsDqBE2fQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 942.48, "view_count": 1260}, {"video_id": "FSl92F_mEek", "question": "On the left side of the screen is a short-haired woman wearing a black dress with red polka dots. On the right side of the screen is a newspaper with 'METR' written on it. What changed when the black text 'facebook.com' first appeared in the white long box at the bottom?", "question_wo_referring_query": "What changed?", "candidates": ["The text moved up", "The text moved to the left", "The text moved to the right", "The text moved down", "The text turned red"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "FSl92F_mEek_0", "video_path": "FSl92F_mEek.mp4", "subtitle_path": "FSl92F_mEek_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1048.16, "view_count": 8982}, {"video_id": "FSl92F_mEek", "question": "On the right side of the screen is a blonde-haired female host wearing pink clothing, and on the left side is a page with an image of a notebook and a mobile phone. What happened the first time this page appeared?", "question_wo_referring_query": "What happened the first time this page appeared?", "candidates": ["The page moved up", "The page moved down", "The page turned red", "The page moved to the right", "The page moved to the left"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "FSl92F_mEek_1", "video_path": "FSl92F_mEek.mp4", "subtitle_path": "FSl92F_mEek_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1048.16, "view_count": 8982}, {"video_id": "FSl92F_mEek", "question": "On a white background, there is a red rectangle with the white letter 'M' on the far left side. What happens when a blue rectangle with the white letter 'm' first appears in the center of the screen, towards the far left?", "question_wo_referring_query": "What change occurs?", "candidates": ["Turns green", "Turns purple", "Moves right", "Moves left", "Turns red"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "FSl92F_mEek_2", "video_path": "FSl92F_mEek.mp4", "subtitle_path": "FSl92F_mEek_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1048.16, "view_count": 8982}, {"video_id": "H5vpBCLo74U", "question": "There is a black number '6' at the top of the white page, and below the number '6', the text 'New York is a city' is written in green font. Below the letters 'a' and the word 'city', there are blue arrows drawn. When the subtitle says 'regressive style or a city it's a better', what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The arrow turns red", "The dashed line turns black", "The dashed line disappears", "The dashed line turns red", "The dashed line turns purple"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "H5vpBCLo74U_0", "video_path": "H5vpBCLo74U.mp4", "subtitle_path": "H5vpBCLo74U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1805.13, "view_count": 23329}, {"video_id": "H5vpBCLo74U", "question": "In the middle of a white page, there is green text saying 'New York is a city.' Below it, the black text saying 'is a city' is circled in green. What happens on the screen when the subtitles say 'classically right just one two three'?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The red numbers 123 appear in the middle of the screen", "The blue numbers 56 appear in the middle of the screen", "The purple numbers 123 appear in the middle of the screen", "The blue numbers 75 appear in the middle of the screen", "The blue numbers 123 appear in the middle of the screen"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "H5vpBCLo74U_1", "video_path": "H5vpBCLo74U.mp4", "subtitle_path": "H5vpBCLo74U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1805.13, "view_count": 23329}, {"video_id": "H5vpBCLo74U", "question": "At the top of a white page, there's a data table. On the bottom left, there is black text that says '3.6 ClueWeb09-B Dataset'. When the subtitle reads 'appreciate these kind of analyses called', what happens on the screen?", "question_wo_referring_query": "What happens on the screen when the subtitle reads 'appreciate these kind of analyses called'?", "candidates": ["The page scrolls down", "The page scrolls up", "The table is circled in purple", "The table is circled in yellow", "The table is circled in blue"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "H5vpBCLo74U_2", "video_path": "H5vpBCLo74U.mp4", "subtitle_path": "H5vpBCLo74U_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1805.13, "view_count": 23329}, {"video_id": "nEU-8AeBBbE", "question": "After a man wearing black sunglasses and a short-sleeve shirt with a blue-green floral pattern appears behind a black screen, which of the following scenic spots is introduced first?", "question_wo_referring_query": "Which of the following scenic spots is introduced first?", "candidates": ["11 Memorial Monuments", "Xiaofu Island", "Missoni", "Banjac Waterfall", "Xigong River"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "nEU-8AeBBbE_0", "video_path": "nEU-8AeBBbE.mp4", "subtitle_path": "nEU-8AeBBbE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 980.86, "view_count": 63768}, {"video_id": "nEU-8AeBBbE", "question": "After a man wearing black glasses and holding a small doll with one hand, carrying a blue globe on his shoulder appears, which country is mentioned first below?", "question_wo_referring_query": "Which country is mentioned first below?", "candidates": ["China", "France", "Thailand", "Vietnam", "Laos"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "nEU-8AeBBbE_1", "video_path": "nEU-8AeBBbE.mp4", "subtitle_path": "nEU-8AeBBbE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 980.86, "view_count": 63768}, {"video_id": "nEU-8AeBBbE", "question": "After a man wearing black glasses and a blue-green patterned short-sleeve shirt appears behind a black screen, which character appears first?", "question_wo_referring_query": ", which character appears first?", "candidates": ["A woman wearing a black hat and black clothes who is picking red fruits from a tree appears in the upper right corner of the screen.", "A man wearing a black shirt and black shoes doing push-ups appears on the right side of the screen.", "A woman wearing a pink shirt and a straw hat who is rowing a boat appears in the upper right corner of the screen.", "A bald man wearing a white shirt appears on the right side of the screen.", "A person wearing a red shirt who is picking tea on a mountain, wearing a straw hat and a backpack, appears in the upper left corner of the screen."], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "nEU-8AeBBbE_2", "video_path": "nEU-8AeBBbE.mp4", "subtitle_path": "nEU-8AeBBbE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 980.86, "view_count": 63768}, {"video_id": "WZznMfEiIEc", "question": "In front of a locker with a red rectangular paper strip labeled 'G-15' in the upper left corner, stands a man with a mustache wearing a black shirt. After the subtitles say 'at most of the metro stations and so for', what action does this man take?", "question_wo_referring_query": "What action does this man take?", "candidates": ["Puts his hand on his head", "Presses his thumb on the locker", "Touches his ear with his hand", "Presses his index finger on the locker", "Presses his middle finger on the locker"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "WZznMfEiIEc_0", "video_path": "WZznMfEiIEc.mp4", "subtitle_path": "WZznMfEiIEc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1249.63, "view_count": 304895}, {"video_id": "WZznMfEiIEc", "question": "On a piece of grass, the man in black clothes with a blue backpack on the far left is using his hand to tease a child on the back in the middle. On the far right stands a man in a black feather coat. After the subtitle says 'same thing that I'm doing but look', what action does the man with the backpack on the far left do?", "question_wo_referring_query": "What action does the man with the backpack on the far left do?", "candidates": ["Waves at the camera", "Puts the backpack down", "Punches towards the camera", "Gives a thumbs-up to the camera", "Hugs the child"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "WZznMfEiIEc_1", "video_path": "WZznMfEiIEc.mp4", "subtitle_path": "WZznMfEiIEc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1249.63, "view_count": 304895}, {"video_id": "WZznMfEiIEc", "question": "On the left side of the screen, against a black background with yellow text \"BOWL OF RAMEN,\" a man in a yellow shirt is looking down. After the subtitle says, \"meal here around ten bucks u.s. really,\" what does the man do?", "question_wo_referring_query": "What does the man do?", "candidates": ["eats noodles with chopsticks", "drinks water", "eats noodles with a fork", "eats a bread roll", "buys coffee"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "WZznMfEiIEc_2", "video_path": "WZznMfEiIEc.mp4", "subtitle_path": "WZznMfEiIEc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1249.63, "view_count": 304895}, {"video_id": "GTIjylkB-TI", "question": "In a room equipped with a white air conditioner and a TV, a man in an olive-green shirt is holding a little girl dressed in blue. Next to them stands a little boy. After the subtitle says 'That's the next video that we're doing is hitchhiking,' which person appears first?", "question_wo_referring_query": ", which person appears first?", "candidates": ["A man wearing a white cap holding a green sign", "A man wearing a green uniform holding a blue sign", "A man wearing a white cap holding a pink sign", "A man wearing a white cap holding a red sign", "A man wearing a red uniform holding a blue sign"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "GTIjylkB-TI_0", "video_path": "GTIjylkB-TI.mp4", "subtitle_path": "GTIjylkB-TI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1497.87, "view_count": 1242249}, {"video_id": "GTIjylkB-TI", "question": "In a room with green plants, a man is sitting wearing a white cap, a gray T-shirt, and orange pants. After the subtitle says 'The book is right here,' what object appears on the screen first?", "question_wo_referring_query": "What object appears first on the screen?", "candidates": ["A piano", "A book", "A guitar", "A painting", "A jade pendant"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "GTIjylkB-TI_1", "video_path": "GTIjylkB-TI.mp4", "subtitle_path": "GTIjylkB-TI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1497.87, "view_count": 1242249}, {"video_id": "GTIjylkB-TI", "question": "On a main street with a white house, a man in a black short-sleeved shirt puts his hand on the shoulder of a man wearing a white T-shirt and holding a book. After the subtitle says 'I hope you will celebrate Matt's great alongside us and show up as a community behind him', what object appears first on the screen?", "question_wo_referring_query": "What object appears first on the screen?", "candidates": ["Car", "Ship", "Tricycle", "Train", "Airplane"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "GTIjylkB-TI_2", "video_path": "GTIjylkB-TI.mp4", "subtitle_path": "GTIjylkB-TI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1497.87, "view_count": 1242249}, {"video_id": "SYZfvi0Ab78", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a bar chart with percentages in different colors (red, green, white, etc.) appears on the screen. Then, on the top middle of a green background, there is a white circular icon with a thumbs-up gesture. Finally, on the top left of the green background, there is a white circular icon with an eye and a thermometer.", "First, on the top left of a green background, there is a white circular icon with an eye and a thermometer. Then, on the top middle of the green background, there is a white circular icon with a thumbs-up gesture. Finally, a bar chart with percentages in different colors (red, green, white, etc.) appears on the screen.", "First, a bar chart with percentages in different colors (red, green, white, etc.) appears on the screen. Then, on the top left of a green background, there is a white circular icon with an eye and a thermometer. Finally, on the top middle of the green background, there is a white circular icon with a thumbs-up gesture.", "First, on the top left of a green background, there is a white circular icon with an eye and a thermometer. Then, a bar chart with percentages in different colors (red, green, white, etc.) appears on the screen. Finally, on the top middle of the green background, there is a white circular icon with a thumbs-up gesture.", "First, on the top middle of a green background, there is a white circular icon with a thumbs-up gesture. Then, a bar chart with percentages in different colors (red, green, white, etc.) appears on the screen. Finally, on the top left of the green background, there is a white circular icon with an eye and a thermometer."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "SYZfvi0Ab78_0", "video_path": "SYZfvi0Ab78.mp4", "subtitle_path": "SYZfvi0Ab78_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1013.8, "view_count": 1587396}, {"video_id": "SYZfvi0Ab78", "question": "Which of the following sequence of scenes is correct?", "question_wo_referring_query": "Which of the following sequence of scenes is correct?", "candidates": ["First, a white tank appears at the top of a green background with the word 'Firepower' in the top-left corner. Then, three circular icons with white question marks appear in the middle of the green background. Finally, on the lower-left of the green background, a German flag appears, and on the right, an American flag shows up.", "First, on the lower-left of the green background, a German flag appears, and on the right, an American flag shows up. Then, three circular icons with white question marks appear in the middle of the green background. Finally, a white tank appears at the top of a green background with the word 'Firepower' in the top-left corner.", "First, three circular icons with white question marks appear in the middle of the green background. Then, on the lower-left of the green background, a German flag appears, and on the right, an American flag shows up. Finally, a white tank appears at the top of a green background with the word 'Firepower' in the top-left corner.", "First, a white tank appears at the top of a green background with the word 'Firepower' in the top-left corner. Then, on the lower-left of the green background, a German flag appears, and on the right, an American flag shows up. Finally, three circular icons with white question marks appear in the middle of the green background.", "First, on the lower-left of the green background, a German flag appears, and on the right, an American flag shows up. Then, a white tank appears at the top of a green background with the word 'Firepower' in the top-left corner. Finally, three circular icons with white question marks appear in the middle of the green background."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "SYZfvi0Ab78_1", "video_path": "SYZfvi0Ab78.mp4", "subtitle_path": "SYZfvi0Ab78_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1013.8, "view_count": 1587396}, {"video_id": "SYZfvi0Ab78", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a yellow tank drives by and puffs out billowing white smoke, then a circular design with a white four-leaf clover appears in the middle of the top of a green background, and finally white text 'Thanks to Andrew' appears at the top of the green background.", "First, a yellow tank drives by and puffs out billowing white smoke, then white text 'Thanks to Andrew' appears at the top of the green background, and finally a circular design with a white four-leaf clover appears in the middle of the top of the green background.", "First, white text 'Thanks to Andrew' appears at the top of the green background, then a yellow tank drives by and puffs out billowing white smoke, and finally a circular design with a white four-leaf clover appears in the middle of the top of the green background.", "First, a circular design with a white four-leaf clover appears in the middle of the top of a green background, then white text 'Thanks to Andrew' appears at the top of the green background, and finally a yellow tank drives by and puffs out billowing white smoke.", "First, a circular design with a white four-leaf clover appears in the middle of the top of a green background, then a yellow tank drives by and puffs out billowing white smoke, and finally white text 'Thanks to Andrew' appears at the top of the green background."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "SYZfvi0Ab78_2", "video_path": "SYZfvi0Ab78.mp4", "subtitle_path": "SYZfvi0Ab78_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1013.8, "view_count": 1587396}, {"video_id": "v7T9ObriZ6k", "question": "In the middle of a white background, there's a large black circle with a black dot in the center. Below this large circle, the English word 'sun' is written. On both sides, there are two parallel black arrows pointing in opposite directions. Where have this large circle and these captions appeared together?", "question_wo_referring_query": "Where have this large circle and these captions appeared together?", "candidates": ["free", "lines up with the sun's geometric center", "grass", "goodbye", "hello"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "v7T9ObriZ6k_0", "video_path": "v7T9ObriZ6k.mp4", "subtitle_path": "v7T9ObriZ6k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 904.36, "view_count": 176724}, {"video_id": "v7T9ObriZ6k", "question": "In the lower-left corner against the black background, there is a curly-haired woman wearing a black hat. In the middle of the screen, there is a golden sun with three Earths around it on red orbit lines. With what subtitles does this golden sun appear together?", "question_wo_referring_query": "With what subtitles does this golden sun appear together?", "candidates": ["really prove or disprove that that is", "hello", "Jack", "wow", "Kalen"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "v7T9ObriZ6k_1", "video_path": "v7T9ObriZ6k.mp4", "subtitle_path": "v7T9ObriZ6k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 904.36, "view_count": 176724}, {"video_id": "v7T9ObriZ6k", "question": "At the top of the screen, there is a black arrow and a yellow arrow, as well as a circular globe map with black text that says, 'Black Arrow Cannot see sun'. In which subtitles has this appeared together?", "question_wo_referring_query": "With which subtitles has this appeared together?", "candidates": ["click", "music", "come on", "code", "that it must throw its light different"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "v7T9ObriZ6k_2", "video_path": "v7T9ObriZ6k.mp4", "subtitle_path": "v7T9ObriZ6k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 904.36, "view_count": 176724}, {"video_id": "0YYdo3vJnzU", "question": "In the topmost right part of the screen with the white background and black text 'CapsNet Architecture', there is a black square in the middle with the white number 4. When it appears on the white background with the black text 'Conv1' on the screen, what change occurs to the shape of this black square?", "question_wo_referring_query": "What change occurs to the shape of this black square?", "candidates": ["Becomes circular", "Grows smaller", "Grows larger", "Turns purple", "Turns white"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "0YYdo3vJnzU_0", "video_path": "0YYdo3vJnzU.mp4", "subtitle_path": "0YYdo3vJnzU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1882.33, "view_count": 3438}, {"video_id": "0YYdo3vJnzU", "question": "In the white background with black text 'CapsNet Architecture' at the top right side of the screen, there is a light purple rectangle labeled 'ReLU Conv1'. What change occurs to the shape of this light purple rectangle when it appears on a white background with black text 'Conv1'?", "question_wo_referring_query": "What change occurs to the shape of this light purple rectangle?", "candidates": ["Turns black", "Decreases in size", "Turns white", "Becomes circular", "Increases in size"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "0YYdo3vJnzU_1", "video_path": "0YYdo3vJnzU.mp4", "subtitle_path": "0YYdo3vJnzU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1882.33, "view_count": 3438}, {"video_id": "0YYdo3vJnzU", "question": "In the top right section of the screen, within the white background text \u201cCapsNet Architecture\u201d in black font, there is a light purple rectangle with lines inside labeled 'DigitCaps.' When this light purple rectangle appears in the screen with the white background text 'CapsNet Architecture-Reconstruction,' what change occurs to the shape of this light purple rectangle?", "question_wo_referring_query": "What change occurs to the shape of this light purple rectangle?", "candidates": ["turns green", "turns red", "gets larger", "gets smaller", "turns white"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "0YYdo3vJnzU_2", "video_path": "0YYdo3vJnzU.mp4", "subtitle_path": "0YYdo3vJnzU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1882.33, "view_count": 3438}, {"video_id": "VrigMmXt9A0", "question": "When the map of the African plate block with a yellow-green color appears for the first time in the lower right corner of the screen, and the subtitle says 'term in a bilateral fashion, these can be NATO member states, but it is not coming from NATO,' what change occurs to the shape of the African plate block map?", "question_wo_referring_query": "What change occurs to the shape of the African plate block map?", "candidates": ["Turns yellow", "Becomes smaller", "Turns green", "Turns black", "Becomes larger"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "VrigMmXt9A0_0", "video_path": "VrigMmXt9A0.mp4", "subtitle_path": "VrigMmXt9A0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1484.27, "view_count": 59818}, {"video_id": "VrigMmXt9A0", "question": "When the map of Ukraine, labeled with the white English word 'Ukraine', appears for the first time in the middle of the screen, it is marked with many red arrows. What change occurs in the shape of the map of Ukraine when the subtitle says 'After the Russian Invasion of Ukraine in February 2022, the United States, member states of the ...'?", "question_wo_referring_query": "What change occurs in the shape of the map of Ukraine?", "candidates": ["Becomes round", "Becomes white", "Becomes smaller", "Becomes larger", "Becomes green"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "VrigMmXt9A0_1", "video_path": "VrigMmXt9A0.mp4", "subtitle_path": "VrigMmXt9A0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1484.27, "view_count": 59818}, {"video_id": "VrigMmXt9A0", "question": "On the map of Europe, when the subtitle says 'on the Russian Federation in order to undermine its ability to sustain the war in the long run,' what change occurs to the shape of Spain on the map?", "question_wo_referring_query": "What change occurs to the shape of Spain on the map?", "candidates": ["turns purple", "becomes smaller", "turns black", "becomes larger", "turns round"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "VrigMmXt9A0_2", "video_path": "VrigMmXt9A0.mp4", "subtitle_path": "VrigMmXt9A0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1484.27, "view_count": 59818}, {"video_id": "KjIRSbMJ8zY", "question": "In a room filled with toys, standing in front of a pink square paper box with stickers, what is the child wearing a blue and white striped shirt and pants doing with the stickers in their hand?", "question_wo_referring_query": "What are they doing?", "candidates": ["Putting the stickers on the pink paper box", "Drawing on the ground", "Putting the stickers on their face", "Putting the stickers on the table", "Putting the stickers on the ground"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "KjIRSbMJ8zY_0", "video_path": "KjIRSbMJ8zY.mp4", "subtitle_path": "KjIRSbMJ8zY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1358.57, "view_count": 520844}, {"video_id": "KjIRSbMJ8zY", "question": "In a room, sitting in front of a brown wood cabinet with a black TV on it, a long-haired woman in a blue dress is covering her ears with both hands. What is this woman doing?", "question_wo_referring_query": "What is this woman doing?", "candidates": ["Sweeping the floor", "Cooking", "Screaming loudly", "Drawing on the ground", "Mopping the floor"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "KjIRSbMJ8zY_1", "video_path": "KjIRSbMJ8zY.mp4", "subtitle_path": "KjIRSbMJ8zY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1358.57, "view_count": 520844}, {"video_id": "KjIRSbMJ8zY", "question": "In front of the gray sofa, on the white floor sits a man wearing a black top, watching a child holding a pink bag and dressed in blue and white striped clothes. The child places one hand on the lid of a white bucket. What is this child doing?", "question_wo_referring_query": "What is this child doing?", "candidates": ["Throwing something into the bucket", "Closing the bucket lid", "Sitting on the bucket", "Pushing the bucket away", "Standing on the bucket"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "KjIRSbMJ8zY_2", "video_path": "KjIRSbMJ8zY.mp4", "subtitle_path": "KjIRSbMJ8zY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1358.57, "view_count": 520844}, {"video_id": "RhXW1nKzVVk", "question": "In a kitchen with white dishes placed in the background, there is a woman in a gray apron standing on the far left holding a plate with both hands. On the far right, there is another woman wearing a black short-sleeved shirt covering her eyes with both hands. When the subtitle says 'I'm excited,' what other objects are present in the kitchen?", "question_wo_referring_query": "What other objects are present in the kitchen?", "candidates": ["oven", "dishwasher", "refrigerator", "roast chicken", "green plant"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "RhXW1nKzVVk_0", "video_path": "RhXW1nKzVVk.mp4", "subtitle_path": "RhXW1nKzVVk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 951.58, "view_count": 865564}, {"video_id": "RhXW1nKzVVk", "question": "On a white floral-patterned table, there is a red round pot on the far left, and a blue cutting board on the far right. A hand is holding a knife cutting green snow peas on the blue cutting board. When the subtitle says 'Snow peas,' what else is visible on the screen?", "question_wo_referring_query": "What else is visible on the screen?", "candidates": ["Zucchini", "Green pepper", "Peeled shrimp", "Tomato", "Leek"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "RhXW1nKzVVk_1", "video_path": "RhXW1nKzVVk.mp4", "subtitle_path": "RhXW1nKzVVk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 951.58, "view_count": 865564}, {"video_id": "RhXW1nKzVVk", "question": "On a white floral patterned table, there is a glass bowl containing meatballs and another glass bowl with a pair of chopsticks inside, filled with a green liquid. Someone is cutting mushrooms on a red cutting board. When the subtitle says 'Regular thin, like this?', what else is visible on the screen?", "question_wo_referring_query": "What else is visible on the screen?", "candidates": ["Blender", "Blue cutting board", "Tomato", "Seasoning bottle", "Egg"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "RhXW1nKzVVk_2", "video_path": "RhXW1nKzVVk.mp4", "subtitle_path": "RhXW1nKzVVk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 951.58, "view_count": 865564}, {"video_id": "7yWOtvBsmZo", "question": "Against a black background, on the right side of the screen, a man wearing black glasses and a blue suit is sitting in front of a mirror. When the subtitle says 'and so she would have been only about', what object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["Refrigerator", "Globe", "Flower pot", "Rocket", "Ship"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "7yWOtvBsmZo_0", "video_path": "7yWOtvBsmZo.mp4", "subtitle_path": "7yWOtvBsmZo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1897.63, "view_count": 4769}, {"video_id": "7yWOtvBsmZo", "question": "In a black background, on the right side of the screen, there is a man wearing black glasses and a blue suit sitting in front of the camera. When the subtitle says 'underwear knowing this personal trait,' what characters are present on the screen?", "question_wo_referring_query": "What characters are present on the screen?", "candidates": ["A man wearing a blue short-sleeve shirt and white shorts", "A woman wearing a blue dress", "A man wearing a blue short-sleeve shirt and yellow shorts", "A woman wearing a blue long-sleeve shirt and white shorts", "A man wearing a red short-sleeve shirt and white shorts"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "7yWOtvBsmZo_1", "video_path": "7yWOtvBsmZo.mp4", "subtitle_path": "7yWOtvBsmZo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1897.63, "view_count": 4769}, {"video_id": "7yWOtvBsmZo", "question": "On the left side of the screen, within a white background, there is a painting depicting a person wearing a red top and blue pants, sitting cross-legged on a chair. When the subtitle says \"was capturing the cultural and,\" what characters are present in the screen?", "question_wo_referring_query": "What characters are present in the screen?", "candidates": ["A man wearing sunglasses and a pink suit", "A woman wearing glasses and a blue suit", "A man wearing glasses and a blue suit", "A man wearing sunglasses and a white short-sleeve shirt", "A man wearing sunglasses and a white suit"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "7yWOtvBsmZo_2", "video_path": "7yWOtvBsmZo.mp4", "subtitle_path": "7yWOtvBsmZo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1897.63, "view_count": 4769}, {"video_id": "83ggxS21mFM", "question": "Three characters appear on the screen: on the far left is a side-facing elderly person with white hair wearing glasses, in the middle is a woman wearing a hat, and on the far right is a man looking at a mirror. When the subtitle says 'openly began an affair, and she had been hospitalized for psychoneurosis, O'Keeffe broke away,' what color hat is the woman in the middle wearing?", "question_wo_referring_query": "What color hat is the woman in the middle wearing?", "candidates": ["Brown", "Red", "Black", "Green", "White"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "83ggxS21mFM_0", "video_path": "83ggxS21mFM.mp4", "subtitle_path": "83ggxS21mFM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1038.07, "view_count": 248903}, {"video_id": "83ggxS21mFM", "question": "On the white wall, there is an abstract painting in grey and black tones hanging on the left side. To the right of the painting, an older man wearing a hat and a black coat is seen looking at it sideways. When the subtitle says, 'influence, O'Keeffe focused on creating images that have become synonymous with the Great American,' what color is the hat the old man is wearing?", "question_wo_referring_query": "What color is the hat the old man is wearing?", "candidates": ["white", "black", "red", "purple", "blue"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "83ggxS21mFM_1", "video_path": "83ggxS21mFM.mp4", "subtitle_path": "83ggxS21mFM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1038.07, "view_count": 248903}, {"video_id": "83ggxS21mFM", "question": "On the left side of the screen, a person wearing a black hat and black clothes is walking in front of a yellow mountain, accompanied by a man in blue jeans. When the subtitle says 'and controversial relationship. O'Keeffe was an ambitious artist who created a unique visual,' what style of top is the man in blue jeans wearing?", "question_wo_referring_query": "What style of top is the man in blue jeans wearing?", "candidates": ["Striped short sleeves", "Suit", "Plaid shirt", "Floral print shirt", "Cotton short sleeves"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "83ggxS21mFM_2", "video_path": "83ggxS21mFM.mp4", "subtitle_path": "83ggxS21mFM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1038.07, "view_count": 248903}, {"video_id": "LZMm36G0mWc", "question": "The scene shows a street where various fruits are placed in boxes at the entrance. A group of people, dressed in traditional costumes and holding fans, are dancing near the entrance of a fruit shop. There is also someone playing a flute, and a crowd of people watching them. Who is the person playing the flute?", "question_wo_referring_query": "Who is the person playing the flute?", "candidates": ["The person holding a yellow fan", "The person wearing a red baseball cap", "The person wearing a gray floral kimono jacket", "The person wearing a white sun hat", "The person wearing a green floral kimono jacket"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "LZMm36G0mWc_0", "video_path": "LZMm36G0mWc.mp4", "subtitle_path": "LZMm36G0mWc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.5, "view_count": 35734}, {"video_id": "LZMm36G0mWc", "question": "The screen shows a red stage table with three wooden trays and three red tables, all displaying exquisite food such as sushi, dim sum, tofu, meat pies, etc. Which table or wooden tray holding bowls and plates has three types of food on it?", "question_wo_referring_query": "Which table or wooden tray holding bowls and plates has three types of food on it?", "candidates": ["The black tray on the upper right", "The wooden table on the right", "The wooden table on the lower left", "The wooden tray on the upper left", "The wooden table in the middle"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "LZMm36G0mWc_1", "video_path": "LZMm36G0mWc.mp4", "subtitle_path": "LZMm36G0mWc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.5, "view_count": 35734}, {"video_id": "LZMm36G0mWc", "question": "In the video, there's a tray placed on a red table, with two bowls and a plate inside the tray. The plate is white and contains deep-fried food. The bowls are red on the inside and black with gold on the outside. One bowl has rice and soup. Which ingredient is placed in the same bowl as the meat pie?", "question_wo_referring_query": "Which ingredient is placed in the same bowl as the meat pie?", "candidates": ["Deep-fried meat roll", "Seaweed slices", "Tofu", "Spinach", "Rice"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "LZMm36G0mWc_2", "video_path": "LZMm36G0mWc.mp4", "subtitle_path": "LZMm36G0mWc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.5, "view_count": 35734}, {"video_id": "Ys_ENxeNctk", "question": "In the scene, there is a man wearing a black shirt, and beside him is a woman in a black and white shirt. The man is wearing a black and green checkered shirt. The woman has curly hair. They are in a white room with some clutter around. In front of them is a table covered with a black tablecloth. On the table, there are red and blue sticks. The woman is wearing blue gloves. What does the woman in the black and white shirt do the first time she appears on screen?", "question_wo_referring_query": "What does the woman in the black and white shirt do the first time she appears on screen?", "candidates": ["The woman is conducting an experiment", "The woman is shaking hands with the man", "The woman picks up a stick from the table", "The woman is talking to the man", "The woman removes her gloves"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "Ys_ENxeNctk_0", "video_path": "Ys_ENxeNctk.mp4", "subtitle_path": "Ys_ENxeNctk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1247.16, "view_count": 13184}, {"video_id": "Ys_ENxeNctk", "question": "In the screen, there is a bookshelf with many books on it. In front of the bookshelf is a wooden door with lattice, and a man in black and green clothing is standing in front of the bookshelf talking. He has brown hair, is wearing jeans, and beside him is a whiteboard with a string of black letters. When the black letters 'PHOTONS' first appear, what happens?", "question_wo_referring_query": "What happens?", "candidates": ["The man crosses his waist", "The man points at the whiteboard", "The man clenches his hands", "The man is erasing the letters", "The man is nodding"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "Ys_ENxeNctk_1", "video_path": "Ys_ENxeNctk.mp4", "subtitle_path": "Ys_ENxeNctk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1247.16, "view_count": 13184}, {"video_id": "Ys_ENxeNctk", "question": "The screen shows a man wearing a black and green checkered shirt. He is wearing jeans, has short brown hair, and in front of him is a yellow table with three bottles and a bucket of ice on it. Behind him is a blackboard and to the right side of the screen is a wooden shelf. What happened when the three plastic bottles first appeared?", "question_wo_referring_query": "What happened?", "candidates": ["The man lifted the bucket of ice.", "The man touched his head.", "The man picked up a plastic bottle.", "The man spread his arms.", "The man turned to face the blackboard."], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "Ys_ENxeNctk_2", "video_path": "Ys_ENxeNctk.mp4", "subtitle_path": "Ys_ENxeNctk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1247.16, "view_count": 13184}, {"video_id": "Et2ysviyoBo", "question": "In a news segment, there are two scenes. On the left is a man wearing a dark blue shirt. He has relatively short hair, a watch on his wrist, and a black background behind him. On the right is a man wearing a black coat. His beard is long and white, and he is wearing a black hat and glasses, with a brown background behind him. When 'regional conflict is to stop the war in' is mentioned, what actions do the characters in the scene perform?", "question_wo_referring_query": "What actions do the characters in the scene perform?", "candidates": ["The man on the right takes a sip of water", "The man on the right waves his hand", "The man on the left is waving his hand", "The man on the left shakes hands with someone", "The man on the right removes his glasses"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "Et2ysviyoBo_0", "video_path": "Et2ysviyoBo.mp4", "subtitle_path": "Et2ysviyoBo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 910.72, "view_count": 335553}, {"video_id": "Et2ysviyoBo", "question": "In a news screen, there are two segments. On the left side, a woman with braids wearing a cerulean blue shirt is standing against a city nightscape backdrop. On the right side, a man in a suit is in a hall; he is wearing a black and white tie and has a red ornament on his suit. What action do the characters in the screen take when 'we are the puppets of Iran which of' is mentioned?", "question_wo_referring_query": "What action do the characters in the screen take?", "candidates": ["The man on the right adjusted his suit", "The man on the right picked up the handset", "The man on the right drank water", "The woman raised both wrists", "The woman raised one wrist"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "Et2ysviyoBo_1", "video_path": "Et2ysviyoBo.mp4", "subtitle_path": "Et2ysviyoBo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 910.72, "view_count": 335553}, {"video_id": "Et2ysviyoBo", "question": "Under the blue sky, distant mountains covered with green vegetation can be seen. A pole stands erect by the dirt road. A man wearing a hat and holding a gun is facing away from the camera. A tank's cannon is pointing in the direction of the camera. When the subtitle 'that the Hamas attacks of October the' appears, what action do the characters on the screen take?", "question_wo_referring_query": "What action do the characters on the screen take?", "candidates": ["The man holding the gun crouches down with his head in his hands.", "The man holding the gun walks to the left side of the screen.", "The tank fires at the man.", "The man holding the gun walks towards the tank.", "The man holding the gun prepares to shoot at the tank."], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "Et2ysviyoBo_2", "video_path": "Et2ysviyoBo.mp4", "subtitle_path": "Et2ysviyoBo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 910.72, "view_count": 335553}, {"video_id": "TRoxAWpi4MI", "question": "In the video, two hands are holding a small blue ceramic piece, which resembles a cup or a small bowl, with white floral patterns on it. The sleeves near the hands are green, and in the background, there is a shelf holding two similar small ceramic pieces. What happens after 'I think I can' is mentioned?", "question_wo_referring_query": "What happens after 'I think I can' is mentioned?", "candidates": ["A blonde woman in a green outfit is holding the small ceramic piece and conversing with a woman in a black top with short white hair", "A blonde woman in a green outfit is holding the small ceramic piece and conversing with a woman in a black top with short black hair", "A blonde woman in a green outfit is holding the small ceramic piece and conversing with a woman in a yellow top with short black hair", "A blonde woman in a green outfit is holding the small ceramic piece and conversing with a woman in a green top with short black hair", "A blonde woman in a green outfit is holding the small ceramic piece and conversing with a woman in a pink top with short black hair"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "TRoxAWpi4MI_0", "video_path": "TRoxAWpi4MI.mp4", "subtitle_path": "TRoxAWpi4MI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1505.5, "view_count": 25762}, {"video_id": "TRoxAWpi4MI", "question": "The screen shows a woman wearing a green dress; she is standing in front of a display table with some pottery and fragments. There are two sheets of paper on the display board behind the table, one of which reads '1682.' A man wearing black clothing is standing beside the fragmented display table, wearing a mask. What happened after the phrase 'So from here, let's take a closer look at that Tsuboya' was mentioned?", "question_wo_referring_query": "What happened?", "candidates": ["The woman picked up a fragment.", "The camera zoomed out, and the man started explaining to the woman.", "The camera zoomed in, and the man started explaining to the woman.", "The woman and the man shook hands.", "The camera focused on the man, and he started explaining."], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "TRoxAWpi4MI_1", "video_path": "TRoxAWpi4MI.mp4", "subtitle_path": "TRoxAWpi4MI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1505.5, "view_count": 25762}, {"video_id": "TRoxAWpi4MI", "question": "A woman is standing in front of a wooden rack filled with ceramics, with many bowls and plates on it. There is a man dressed in short sleeves in front of her, holding a small tool. To the left of the camera is a glass window, outside of which is a long corridor. The woman is wearing a green coat and a dark blue dress. She is having a conversation with the man. What happened after their conversation?", "question_wo_referring_query": ", what happened?", "candidates": ["A blond woman in green clothes holding a small ceramic was conversing with another woman in a yellow shirt and black short hair.", "The woman picked up a piece of fragment.", "The camera zoomed in, and the man was explaining something to the woman.", "The man was molding clay on a board.", "The woman and the man shook hands."], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "TRoxAWpi4MI_2", "video_path": "TRoxAWpi4MI.mp4", "subtitle_path": "TRoxAWpi4MI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1505.5, "view_count": 25762}, {"video_id": "-jUVOXOw9T0", "question": "In a white room, there is a woman wearing a black vest speaking. She has short hair, is wearing red lipstick, earrings, and a necklace. She has tattoos on her arms. There is a green plant beside the window behind her, and small buildings can be seen outside the window with the sky in the distance. What happened when 'paeosod' was mentioned?", "question_wo_referring_query": "What happened?", "candidates": ["A painting popped up on the left featuring an orange tree and yellow mountains.", "A painting popped up on the left featuring a pear tree and snow-capped mountains.", "A painting popped up on the left featuring an apple tree and snow-capped mountains.", "A painting popped up on the left featuring an orange tree and snow-capped mountains.", "A painting popped up on the left featuring a peach tree and snow-capped mountains."], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "-jUVOXOw9T0_0", "video_path": "-jUVOXOw9T0.mp4", "subtitle_path": "-jUVOXOw9T0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1717.42, "view_count": 6013}, {"video_id": "-jUVOXOw9T0", "question": "On a wooden-patterned desk, there is a piece of white paper. Beside it are a few pencils, and a hand with its finger placed on one corner of the white paper. What happens after mentioning 'piece of paper'?", "question_wo_referring_query": ", what happens?", "candidates": ["Two hands press on the paper, one hand holds a blue pencil", "Two hands press on the paper, one hand holds a gray pencil", "Two hands press on the paper, one hand holds a green pencil", "Two hands press on the paper, one hand holds a yellow pencil", "Two hands press on the paper, one hand holds a white pencil"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "-jUVOXOw9T0_1", "video_path": "-jUVOXOw9T0.mp4", "subtitle_path": "-jUVOXOw9T0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1717.42, "view_count": 6013}, {"video_id": "-jUVOXOw9T0", "question": "On a wooden-grain table, there is a pair of hands, and underneath the hands is a completed colored drawing. The drawing features an orange tree, green woods, and snow-capped mountains. The upper left corner shows the draft of the painting, along with several pencils. What happened after mentioning 'very bright and I got muted as we got'?", "question_wo_referring_query": "After mentioning 'very bright and I got muted as we got,' what happened?", "candidates": ["The hand on the table is holding another sketch.", "The hand on the table is holding another colored circular tie-dye painting.", "The hand on the table is holding another oil painting.", "The hand on the table is holding another square-shaped colored tie-dye painting.", "The hand on the table is holding another triangular-shaped colored tie-dye painting."], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "-jUVOXOw9T0_2", "video_path": "-jUVOXOw9T0.mp4", "subtitle_path": "-jUVOXOw9T0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1717.42, "view_count": 6013}, {"video_id": "rf02a8LzmUo", "question": "In a PPT slide, there are yellow and gray rectangles with some alphabetic annotations inside. In the bottom right, there is a 3x3 grid of images, with numbers labeled on it. On the right side, there is a gray rectangle with a colored rectangle inside. Before mentioning: 'right if you think about if we're going', which objects have not appeared before?", "question_wo_referring_query": "which objects have not appeared before?", "candidates": ["green rectangle labeled Multi-Head", "yellow rectangle labeled Norm", "red rectangle labeled Linear Projection of Flattened Patches", "round shape labeled Class Bird Ball Car", "blue rectangle labeled MLP"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "rf02a8LzmUo_0", "video_path": "rf02a8LzmUo.mp4", "subtitle_path": "rf02a8LzmUo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1002.63, "view_count": 29}, {"video_id": "rf02a8LzmUo", "question": "In a white PPT slide, there is a title composed of black letters, containing a black dot inside. Behind it, there is content with black letters. In the lower middle part, there is a purple rectangular box. At the bottom of the PPT page are some black text contents and there is also a yellow logo. When the phrase 'okay and that's for U image' is mentioned, what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["A square with purple outlined border", "A circle labeled Class Bird Ball Car", "A yellow rectangle labeled Norm", "A red rectangle labeled Linear Projection of Flattened Patches", "A blue rectangle labeled MLP"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "rf02a8LzmUo_1", "video_path": "rf02a8LzmUo.mp4", "subtitle_path": "rf02a8LzmUo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1002.63, "view_count": 29}, {"video_id": "rf02a8LzmUo", "question": "In a white PPT slide, there is a title formed by black letters, along with some grey and purple rectangular shapes. To the side, there is a purple border containing various colored rectangles. At the bottom of the PPT slide, there is some content in black text and a yellow mark. When the phrase 'and we want to predict the corresponding' is mentioned, what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["A blue rectangle labeled MLP", "A green rectangle labeled Multi-Head", "A yellow rectangle labeled Norm", "A circle labeled Class Bird Ball Car", "An image of a person surfing"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "rf02a8LzmUo_2", "video_path": "rf02a8LzmUo.mp4", "subtitle_path": "rf02a8LzmUo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1002.63, "view_count": 29}, {"video_id": "aYRI-NiucuA", "question": "Which sequence of scenes below is correct?", "question_wo_referring_query": "Which sequence of scenes below is correct?", "candidates": ["Firstly, a scene of four men lifting a painting is shown, followed by a black and white photo of a man with glasses wearing black clothes, and finally, a man with dark skin wearing a gray medical gown drawing at a table.", "Firstly, a man with dark skin wearing a gray medical gown drawing at a table is shown, followed by a black and white photo of a man with glasses wearing black clothes, and finally, a scene of four men lifting a painting.", "Firstly, a black and white photo of a man with glasses wearing black clothes is shown, followed by a scene of four men lifting a painting, and finally, a man with dark skin wearing a gray medical gown drawing at a table.", "Firstly, a man with dark skin wearing a gray medical gown drawing at a table is shown, followed by a scene of four men lifting a painting, and finally, a black and white photo of a man with glasses wearing black clothes is shown.", "Firstly, a black and white photo of a man with glasses wearing black clothes is shown, followed by a man with dark skin wearing a gray medical gown drawing at a table, and finally, a scene of four men lifting a painting."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "aYRI-NiucuA_0", "video_path": "aYRI-NiucuA.mp4", "subtitle_path": "aYRI-NiucuA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1417.92, "view_count": 29382}, {"video_id": "aYRI-NiucuA", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a painting where the overall color is olive with blue and black depicting a person, followed by a screen showing a woman in a black turtleneck sitting and talking in front of a bookshelf, and finally a painting with three black figures and a background composed of blue, orange, and red.", "First, a painting with three black figures and a background composed of blue, orange, and red, followed by a painting where the overall color is olive with blue and black depicting a person, and finally a screen showing a woman in a black turtleneck sitting and talking in front of a bookshelf.", "First, a screen shows a woman in a black turtleneck sitting and talking in front of a bookshelf, followed by a painting where the overall color is olive with blue and black depicting a person, and finally a painting with three black figures and a background composed of blue, orange, and red.", "First, a screen shows a woman in a black turtleneck sitting and talking in front of a bookshelf, followed by a painting with three black figures and a background composed of blue, orange, and red, and finally a painting where the overall color is olive with blue and black depicting a person.", "First, a painting where the overall color is olive with blue and black depicting a person, followed by a painting with three black figures and a background composed of blue, orange, and red, and finally a screen showing a woman in a black turtleneck sitting and talking in front of a bookshelf."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "aYRI-NiucuA_1", "video_path": "aYRI-NiucuA.mp4", "subtitle_path": "aYRI-NiucuA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1417.92, "view_count": 29382}, {"video_id": "aYRI-NiucuA", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, a scene where a curly-haired man is speaking is shown, followed by a scene where a woman with dark skin and high-collared clothing is speaking, and finally a scene where a man with dark skin, slightly graying hair, and wearing a blue shirt is speaking.", "First, a scene where a man with dark skin, slightly graying hair, and wearing a blue shirt is speaking is shown, followed by a scene where a curly-haired man is speaking, and finally a scene where a woman with dark skin and high-collared clothing is speaking.", "First, a scene where a curly-haired man is speaking is shown, followed by a scene where a man with dark skin, slightly graying hair, and wearing a blue shirt is speaking, and finally a scene where a woman with dark skin and high-collared clothing is speaking.", "First, a scene where a woman with dark skin and high-collared clothing is speaking is shown, followed by a scene where a curly-haired man is speaking, and finally a scene where a man with dark skin, slightly graying hair, and wearing a blue shirt is speaking.", "First, a scene where a woman with dark skin and high-collared clothing is speaking is shown, followed by a scene where a man with dark skin, slightly graying hair, and wearing a blue shirt is speaking, and finally a scene where a curly-haired man is speaking."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "aYRI-NiucuA_2", "video_path": "aYRI-NiucuA.mp4", "subtitle_path": "aYRI-NiucuA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1417.92, "view_count": 29382}, {"video_id": "IibCOlrL9gc", "question": "Against a blue background, a blonde woman in a black dress is reporting the news. On the blue background on the left side of the screen, there are tall buildings. In front of the woman, at the bottom, there is a white background with subtitles. What subtitle appeared simultaneously with this woman?", "question_wo_referring_query": "What subtitle appeared simultaneously with this woman?", "candidates": ["HITTING 497%.THE BENCHMARK 10 YEAR NOTE", "CERTAINLY VERY MUCH AN UNTHINKABLE TURN OF EVENTS", "GIVEN WHEN WE CAME INTO THIS YEAR15 BASIS POINTS OF CUTS", "LET'S GET MORE ON THE US.INFLATION PRINT NOW FROM", "TOPPING 4.5% FOR THE FIRST TIME SINCE NOVEMBER"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "IibCOlrL9gc_0", "video_path": "IibCOlrL9gc.mp4", "subtitle_path": "IibCOlrL9gc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2832.97, "view_count": 2082}, {"video_id": "IibCOlrL9gc", "question": "A gentleman in the video is wearing a navy blue suit with a purple tie and black-rimmed glasses. He is broadcasting the news with a glass window behind him, showing high-rise buildings outside. There is a subtitle bar with a white background in front of him. In which subtitle does this gentleman appear simultaneously?", "question_wo_referring_query": "In which subtitle does this gentleman appear simultaneously?", "candidates": ["THE POWER SECTOR VERY COAL AND GAS HEAVY", "CERTAINLY VERY MUCH AN UNTHINKABLE TURN OF EVENTS", "HITTING 497%. THE BENCHMARK 10 YEAR NOTE", "TOPPING 4.5% FOR THE FIRST TIME SINCE NOVEMBER", "GIVEN WHEN WE CAME INTO THIS YEAR 15 BASIS POINTS OF CUTS"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "IibCOlrL9gc_1", "video_path": "IibCOlrL9gc.mp4", "subtitle_path": "IibCOlrL9gc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2832.97, "view_count": 2082}, {"video_id": "IibCOlrL9gc", "question": "The screen shows a man with gray hair, wearing a dark blue suit and a light blue and dark blue striped tie. There is also a white dialog box in the lower right corner of the screen. Behind the man is a glass window, outside of which there are white buildings. With which subtitle did this man appear simultaneously?", "question_wo_referring_query": "With which subtitle did this man appear simultaneously?", "candidates": ["GIVEN WHEN WE CAME INTO THIS YEAR15 BASIS POINTS OF CUTS", "REVERBERATIONS, IT SEEMS. MICHAEL PH:", "Of course, this is a very unimaginable turn of events", "First time since last November to break 4.5%", "MARKET FOR YOU NOW, KATE? OBVIOUSLY YOU HAVE BEEN TRYING"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "IibCOlrL9gc_2", "video_path": "IibCOlrL9gc.mp4", "subtitle_path": "IibCOlrL9gc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2832.97, "view_count": 2082}, {"video_id": "m7ubevtp7Zg", "question": "In the video, there's a person wearing black clothing with tattoos on their arm who is cutting meat. Below the red cutting board where the meat is placed is a wooden table. There is also a stainless steel bowl nearby. After placing all the meat into the stainless steel bowl, which contains seasonings and green onions, what changes occur to the meat during the marination process?", "question_wo_referring_query": "What changes occur to the meat during the marination process?", "candidates": ["The meat is cut into chunks", "The meat is cut into strips", "The meat is cut into thin shreds", "The meat is cooked", "The meat is cut into minced pieces"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "m7ubevtp7Zg_0", "video_path": "m7ubevtp7Zg.mp4", "subtitle_path": "m7ubevtp7Zg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1285.0, "view_count": 575016}, {"video_id": "m7ubevtp7Zg", "question": "In the video, there is a pair of pattern-decorated hands holding a stainless steel bowl on top of a wooden table. Inside the bowl, there is a dough ball, covered with a plastic wrap, placed on a wooden board. When the dough ball is placed in a black pan and sprinkled with green vegetables, what transformation does the dough undergo?", "question_wo_referring_query": "What transformation does the dough undergo?", "candidates": ["The dough ball is made into a flower roll", "The dough ball is made into a thin pancake", "The dough ball is made into a naan", "The dough ball is made into a dumpling", "The dough ball is made into a bun"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "m7ubevtp7Zg_1", "video_path": "m7ubevtp7Zg.mp4", "subtitle_path": "m7ubevtp7Zg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1285.0, "view_count": 575016}, {"video_id": "m7ubevtp7Zg", "question": "A hand is holding a wooden spatula, stir-frying meat slices in an iron pan with a wooden handle. The stove under the iron pan is black, and the person is wearing black clothes. There is also a red bowl on the right side of the screen. What changes occur when the meat slices, rice, and eggs are placed in a gray and white plate?", "question_wo_referring_query": "What changes occur?", "candidates": ["The meat slices turn brown, mixed with red chili peppers and white vegetables.", "The meat slices turn brown, mixed with red chili peppers and black vegetables.", "The meat slices turn brown, mixed with red chili peppers and yellow vegetables.", "The meat slices turn brown, mixed with red chili peppers and green vegetables.", "The meat slices turn brown, mixed with green chili peppers and green vegetables."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "m7ubevtp7Zg_2", "video_path": "m7ubevtp7Zg.mp4", "subtitle_path": "m7ubevtp7Zg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1285.0, "view_count": 575016}, {"video_id": "gOK7nxcUECU", "question": "In a white background, there are three pens in the top right corner. In the center of the screen, there is a pair of hands, with red nail polish on the fingers and a ring. In the lower left and lower right corners, there are round mirrors showing two women respectively. Both women have long black hair. The woman in the lower left corner is wearing blue clothing, while the woman in the lower right is wearing crimson clothing. Their backgrounds are green. What change occurs to the hand when mentioning 'yeah we don't see all the hydrogens'?", "question_wo_referring_query": "What change occurs to the hand?", "candidates": ["One hand forms a fist", "Both hands form fists", "One hand points to the top right corner", "One hand points to text and symbols on the screen", "One hand is open and the other forms a fist"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "gOK7nxcUECU_0", "video_path": "gOK7nxcUECU.mp4", "subtitle_path": "gOK7nxcUECU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1667.87, "view_count": 13442}, {"video_id": "gOK7nxcUECU", "question": "In a white background, there are three pens in the upper right corner. A pair of hands is placed in the center of the screen. The fingernails are painted red and the hands are wearing rings. The upper part of the white background has a purple frame with some chemical formulas written inside. The hands in the screen are placing a black strip. In the lower left corner and lower right corner are two round mirrors showing women. Both women have long straight black hair. The woman in the lower left corner is wearing blue clothes, and the woman in the lower right corner is wearing pinkish-red clothes. Their background is green. When 'over two different types of structures' is mentioned, what changes occur to the hands in the screen?", "question_wo_referring_query": "What changes occur to the hands in the screen?", "candidates": ["One finger extends", "Two fingers extend", "A fist is made", "A fist is made", "Three fingers extend"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "gOK7nxcUECU_1", "video_path": "gOK7nxcUECU.mp4", "subtitle_path": "gOK7nxcUECU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1667.87, "view_count": 13442}, {"video_id": "gOK7nxcUECU", "question": "In a white background, there are three pens in the upper right corner. In the middle of the screen, there are some black stripes and content made up of letters. A hand finger is pointing at the content in the center of the screen made up of black stripes and subtitles. In the lower left and lower right corners, there are respectively circular frames of two women, both with straight black hair. The woman in the lower left corner is wearing blue clothes, and the woman in the lower right corner is wearing crimson clothes. Their backgrounds are both green. When 'h' is mentioned, what kind of change does the hand undergo in the screen?", "question_wo_referring_query": ", what kind of change does the hand undergo in the screen?", "candidates": ["The hand points to the content in the screen with a pen", "Writes content on the background with a pen", "Takes out a piece of paper", "Points to the content with three fingers", "Opened the cap of a pen"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "gOK7nxcUECU_2", "video_path": "gOK7nxcUECU.mp4", "subtitle_path": "gOK7nxcUECU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1667.87, "view_count": 13442}, {"video_id": "wsRgz3S0-S4", "question": "The screen shows a light green table with some handmade soaps on it. A hand is picking up one of the soaps. Some soaps are round, some are square, and some have a wavy pattern. The hand has a ring on one finger and part of a sleeve of an orange-red sweater is visible. Which of the following items is not present in the screen?", "question_wo_referring_query": "Which of the following items is not present in the screen?", "candidates": ["Round green carved soap", "Purple flower", "Yellow soap", "Blue carved soap", "Yellow flower pedal"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "wsRgz3S0-S4_0", "video_path": "wsRgz3S0-S4.mp4", "subtitle_path": "wsRgz3S0-S4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1344.22, "view_count": 344162}, {"video_id": "wsRgz3S0-S4", "question": "The scene shows a woman with braided hair, wearing an orange sweater. She is sitting in front of a table, carefully preparing some food. The light green wall behind her has two windows with yellow frames, and there are some plants hanging from the top of the screen. There are also some plants placed on her table. Which objects are not present in the scene?", "question_wo_referring_query": "Which objects are not present in the scene?", "candidates": ["A round white table", "A chair with corners on both sides", "White Venetian blinds", "A glass vase", "A wind chime with white flowers"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "wsRgz3S0-S4_1", "video_path": "wsRgz3S0-S4.mp4", "subtitle_path": "wsRgz3S0-S4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1344.22, "view_count": 344162}, {"video_id": "wsRgz3S0-S4", "question": "A woman is sitting in front of a mirror talking. Her hair is blonde, and she is wearing an orange-red outfit. Behind her is a light green wall. On the left side of the screen, there is also a window with a wooden frame. What is the object that doesn't exist in the scene?", "question_wo_referring_query": "What is the object that doesn't exist in the scene?", "candidates": ["Green plant", "A rose-red patterned scarf", "An orange-red coat", "Picture frame", "A red patterned chair cover"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "wsRgz3S0-S4_2", "video_path": "wsRgz3S0-S4.mp4", "subtitle_path": "wsRgz3S0-S4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1344.22, "view_count": 344162}, {"video_id": "t6PifI-7QBg", "question": "Vegetables and meat are mixed together, and a pair of chopsticks is holding them. When the subtitle 'hey' appears, what items are present on the screen?", "question_wo_referring_query": "What items are present on the screen?", "candidates": ["Tomatoes", "Dumplings", "Noodles", "Earphones", "Sausages"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "t6PifI-7QBg_0", "video_path": "t6PifI-7QBg.mp4", "subtitle_path": "t6PifI-7QBg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2472.52, "view_count": 1501618}, {"video_id": "t6PifI-7QBg", "question": "On a wooden table, there is a blue electric stove in the middle with a pot on it. Someone is holding a spoon and stirring a glass bowl. When the subtitle 'do\n[Music]' appears, what can be seen on the screen?", "question_wo_referring_query": "What can be seen on the screen?", "candidates": ["bun", "scissors", "sliced sausage", "blueberry", "a whole tomato"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "t6PifI-7QBg_1", "video_path": "t6PifI-7QBg.mp4", "subtitle_path": "t6PifI-7QBg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2472.52, "view_count": 1501618}, {"video_id": "t6PifI-7QBg", "question": "On a wooden table, there is a pot of vegetables in the front left, a white flat surface in the middle with meat on it, and a pair of tongs holding yellow-green vegetables on top of the meat. When the subtitle 'do [Music]' appears, what is present on the screen at that time?", "question_wo_referring_query": "", "candidates": ["Radish", "Carrots", "Fried Chicken Nuggets", "Green Chili Peppers", "Tomatoes"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "t6PifI-7QBg_2", "video_path": "t6PifI-7QBg.mp4", "subtitle_path": "t6PifI-7QBg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2472.52, "view_count": 1501618}, {"video_id": "RXWyuO3v4vM", "question": "Under a sky obscured by dark clouds, there is an ocean in the distance and mountains covered with green grass nearby. An old man with white hair and a hat is looking towards the camera. What color is the old man's clothing?", "question_wo_referring_query": "What color is the old man's clothing?", "candidates": ["Blue", "Red", "Green", "Purple", "Black"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "RXWyuO3v4vM_0", "video_path": "RXWyuO3v4vM.mp4", "subtitle_path": "RXWyuO3v4vM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1253.21, "view_count": 8791163}, {"video_id": "RXWyuO3v4vM", "question": "In a room cluttered with various items, there is a piece of blue leather laid horizontally at the back. On the left, there is a frame painted in red, white, and blue. Next to the frame is a man in black clothes with long beard, standing on a stone. In the middle, there is a shirtless man. What is the shape of the stone that the man in black clothes is standing on?", "question_wo_referring_query": "What is the shape of the stone that the man in black clothes is standing on?", "candidates": ["Cylindrical", "Disk-shaped", "Conical", "Cubic", "Spherical"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "RXWyuO3v4vM_1", "video_path": "RXWyuO3v4vM.mp4", "subtitle_path": "RXWyuO3v4vM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1253.21, "view_count": 8791163}, {"video_id": "RXWyuO3v4vM", "question": "Against a white background, with a pattern in green, yellow, blue, and red above, a shirtless man with short hair and a beard is lying down and smiling at the camera. What shape is the pattern in green, yellow, blue, and red?", "question_wo_referring_query": "What shape is the pattern in green, yellow, blue, and red?", "candidates": ["Circle", "Rectangle", "Square", "Parallelogram", "Trapezoid"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "RXWyuO3v4vM_2", "video_path": "RXWyuO3v4vM.mp4", "subtitle_path": "RXWyuO3v4vM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1253.21, "view_count": 8791163}, {"video_id": "gA0RAPh2ZgU", "question": "On a flat ground, in the distance on the left stands a red pillar, on the right there is a tower with a small window, nearby on the left there is a white building, in the middle there is a statue of a person, under the sunlight, the shadow of the statue stretches to the left. When the subtitle 'To me, it's almost certain that his use of space was inspired by earlier images' appears, what color is the head of the statue?", "question_wo_referring_query": "What color is the head of the statue when the subtitle 'To me, it's almost certain that his use of space was inspired by earlier images' appears?", "candidates": ["red", "blue", "green", "black", "purple"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "gA0RAPh2ZgU_0", "video_path": "gA0RAPh2ZgU.mp4", "subtitle_path": "gA0RAPh2ZgU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1001.0, "view_count": 1379970}, {"video_id": "gA0RAPh2ZgU", "question": "On a river, there is a meadow in the distance with a rabbit on it. Nearby on the right side of the bank are many animals resting. In the middle of the river, there's a pink plant with a cat-headed eagle standing in a hole within it. When the subtitles show 'influenced by his painting and technique. It is rarely, if ever, pointed out that Dali's portrait', what is the shape of the hole where the cat-headed eagle is standing?", "question_wo_referring_query": "What is the shape of the hole where the cat-headed eagle is standing?", "candidates": ["Rectangle", "Square", "Round", "Parallelogram", "Ladder-shaped"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "gA0RAPh2ZgU_1", "video_path": "gA0RAPh2ZgU.mp4", "subtitle_path": "gA0RAPh2ZgU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1001.0, "view_count": 1379970}, {"video_id": "gA0RAPh2ZgU", "question": "In a room with a blue and yellow background, there are three people having a conversation. One is a woman wearing a hat and dressed in blue, another is a short-haired man in a suit, and the third is a man in a burgundy coat with a shaven head. When the subtitle 'but the work came first, celebrity second, Dali, the artist had become a prisoner of Dali the celebrity' appears, what color is the woman's hat?", "question_wo_referring_query": "What color is the woman's hat?", "candidates": ["green", "blue", "dark purple", "red", "black"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "gA0RAPh2ZgU_2", "video_path": "gA0RAPh2ZgU.mp4", "subtitle_path": "gA0RAPh2ZgU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1001.0, "view_count": 1379970}, {"video_id": "bFFrmBDFeI4", "question": "On a grassy field with buildings in the distance, there is a black column nearby with two lamps in the middle. To the left, there is a white balloon and a character painting. In the top right corner, there are some tree branches. The middle is filled with people. Can you tell who is taking a photo with a mobile phone at this moment?", "question_wo_referring_query": "Can you tell who is taking a photo with a mobile phone at this moment?", "candidates": ["A man wearing a black hat", "A woman wearing a white coat", "A man wearing white pants", "A woman wearing a blue jacket", "A woman wearing a black hoodie and black pants"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "bFFrmBDFeI4_0", "video_path": "bFFrmBDFeI4.mp4", "subtitle_path": "bFFrmBDFeI4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1092.68, "view_count": 181743}, {"video_id": "bFFrmBDFeI4", "question": "In a room with a light on the ceiling, a street view outside the glass window, and a photo hanging on the white wall on the left side, who is pointing a thumb at the camera among a short-haired man in black clothes and a man wearing headphones?", "question_wo_referring_query": "Who is pointing a thumb at the camera?", "candidates": ["A woman wearing a mask", "A man in black clothes wearing white headphones", "A man in a blue outfit", "A man wearing headphones", "A man wearing a white hat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "bFFrmBDFeI4_1", "video_path": "bFFrmBDFeI4.mp4", "subtitle_path": "bFFrmBDFeI4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1092.68, "view_count": 181743}, {"video_id": "bFFrmBDFeI4", "question": "In a chat software, there are four young people chatting. The top left corner shows a short-haired woman wearing glasses, the top right corner shows a man with fluffy hair, the bottom left corner shows a short-haired man wearing white earphones, and the bottom right corner shows a man wearing a hooded jacket. Who is touching their neck at this moment?", "question_wo_referring_query": "Who is touching their neck at this moment?", "candidates": ["The short-haired man wearing white earphones", "The man wearing a hooded jacket", "The man with fluffy hair", "The short-haired woman wearing glasses"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "bFFrmBDFeI4_2", "video_path": "bFFrmBDFeI4.mp4", "subtitle_path": "bFFrmBDFeI4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1092.68, "view_count": 181743}, {"video_id": "j3voSPJL8sU", "question": "On the distant hillside full of greenery, a man wearing a short sleeve shirt and green pants is crouching on the ground. In front of him is a pile of wooden blocks. When the axe head first appears on the screen, what happens?", "question_wo_referring_query": "What happens when the axe head appears on the screen?", "candidates": ["The man replaces the handle of the axe.", "The man chops at the wooden block with the axe head.", "The man places the axe head on the ground.", "The man is sharpening the axe blade with a stone."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "j3voSPJL8sU_0", "video_path": "j3voSPJL8sU.mp4", "subtitle_path": "j3voSPJL8sU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 986.72, "view_count": 686169}, {"video_id": "j3voSPJL8sU", "question": "On the poor land, there is a brown wooden table with several wooden bowls and yellow metal utensils on it. Behind the large brown wooden bowl, there is a piece of lamb. On the right side of the screen, there is a mobile phone screen displaying a page with wooden bowls and cutlery. What happened when corn oil first appeared on the screen?", "question_wo_referring_query": "What happened when corn oil first appeared on the screen?", "candidates": ["The corn oil coated the red chili.", "The corn oil spread across the wooden table surface.", "The corn oil was poured into the large wooden bowl.", "The corn oil spilled onto the lamb."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "j3voSPJL8sU_1", "video_path": "j3voSPJL8sU.mp4", "subtitle_path": "j3voSPJL8sU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 986.72, "view_count": 686169}, {"video_id": "j3voSPJL8sU", "question": "On the right side of the screen, there are charred wooden logs and a fresh wooden stick. An individual wearing yellow gloves is holding an axe. A piece of wood is inserted into the ground in front of the tip of a foot, with an iron chain placed on top of the wood. What happens in the video when the iron chain first appears?", "question_wo_referring_query": "What happens in the video?", "candidates": ["The iron chain is thrown into the wooden logs.", "A nail is slowly hammered into the wood.", "A nail is pulled out of the wood.", "The iron chain is taken apart."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "j3voSPJL8sU_2", "video_path": "j3voSPJL8sU.mp4", "subtitle_path": "j3voSPJL8sU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 986.72, "view_count": 686169}, {"video_id": "i3EntVc8dUI", "question": "In a room with a white background and a ceiling light on, there is a black table in front, with a book on it and a pen beside it. A man in a black T-shirt is sitting in front of the table, flipping through the book. What does the man do after finishing the book?", "question_wo_referring_query": "What does the man do after finishing the book?", "candidates": ["Flips the book", "Writes a note", "Stands up", "Twirls the pen", "Tears a page from the book"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "i3EntVc8dUI_0", "video_path": "i3EntVc8dUI.mp4", "subtitle_path": "i3EntVc8dUI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1332.0, "view_count": 86198}, {"video_id": "i3EntVc8dUI", "question": "In a room with white walls, there is a window on the right side. Below the window, there is a table with a vase, a laptop, and a potted plant on the wooden laptop stand. A man in a short-sleeve T-shirt is sitting in front of the laptop typing on it. What did the man do after typing on the laptop?", "question_wo_referring_query": "What did the man do after typing on the laptop?", "candidates": ["Held the laptop", "Sat on the chair", "Picked up the potted plant", "Picked up the vase", "Slept on the bed"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "i3EntVc8dUI_1", "video_path": "i3EntVc8dUI.mp4", "subtitle_path": "i3EntVc8dUI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1332.0, "view_count": 86198}, {"video_id": "i3EntVc8dUI", "question": "In a room with white walls, there is a floor-to-ceiling window with white curtains on the left. There is a black object in the lower right corner. In front, there is a white desk with three microphones and two water bottles on it. Two men are sitting beside the desk. The man in the middle is reaching out with both hands and speaking into a microphone. What does the man do after reaching out with both hands and speaking?", "question_wo_referring_query": "What does the man do after reaching out with both hands and speaking?", "candidates": ["Stood up", "Raised the microphone", "Made a 'YES' gesture with both hands", "Patted the person beside him", "Clenched his fists"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "i3EntVc8dUI_2", "video_path": "i3EntVc8dUI.mp4", "subtitle_path": "i3EntVc8dUI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1332.0, "view_count": 86198}, {"video_id": "_y1L5tPqDvA", "question": "Which of the following characters appears first in the video?", "question_wo_referring_query": "Which of the following characters appears first in the video?", "candidates": ["An elderly person wearing a blue cap, an orange coat, and glasses", "A woman wearing a white duckbill cap and a black T-shirt", "A middle-aged man in a blue coat and glasses", "A short-haired woman in a purple T-shirt", "A woman wearing a white scarf and glasses"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "_y1L5tPqDvA_0", "video_path": "_y1L5tPqDvA.mp4", "subtitle_path": "_y1L5tPqDvA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2945.5, "view_count": 68529}, {"video_id": "_y1L5tPqDvA", "question": "Which concept is mentioned first below?", "question_wo_referring_query": "Which concept is mentioned first below?", "candidates": ["Public", "Scope", "Height", "Mountain Peak", "Mountain"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "_y1L5tPqDvA_1", "video_path": "_y1L5tPqDvA.mp4", "subtitle_path": "_y1L5tPqDvA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2945.5, "view_count": 68529}, {"video_id": "_y1L5tPqDvA", "question": "Which type of hat appeared first among the following?", "question_wo_referring_query": "Which type of hat appeared first among the following?", "candidates": ["White duckbill cap", "Blue duckbill cap", "Blue hiking hat", "Gray hiking hat", "Olive beret"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "_y1L5tPqDvA_2", "video_path": "_y1L5tPqDvA.mp4", "subtitle_path": "_y1L5tPqDvA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2945.5, "view_count": 68529}, {"video_id": "WzWtxOufgbM", "question": "On a high mountain, surrounded by white mist, a man wearing a black floral coat with a blue shirt underneath and green pants is standing. His left hand is in his pocket, his right foot is forward, and his left foot is back. After the subtitle mentions 'You cannot truly feel Kalbajar's beauty unless you see it with your own eyes', what does the man do?", "question_wo_referring_query": "What action does the man take?", "candidates": ["The man is running", "The man is drinking water with his right hand", "The man is cutting vegetables", "The man is stir-frying vegetables"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "WzWtxOufgbM_0", "video_path": "WzWtxOufgbM.mp4", "subtitle_path": "WzWtxOufgbM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1105.14, "view_count": 3476509}, {"video_id": "WzWtxOufgbM", "question": "In the scene, there is a stretch of wilderness where a man is looking into the distance. He is wearing a black top with floral patterns and green pants. The man has his hands in his pockets. Behind him, there is a wooden table. To the left, there is a black pot with a silver lid. In front of the pot, there is a round wooden stump with a tool on it. The direction the man is looking towards is an upward slope covered in green plants. After the subtitle mentions 'Karabakh is Azerbaijan,' what action does the man take?", "question_wo_referring_query": "What action does the man take?", "candidates": ["The man is eating.", "The man lifted the pot lid.", "The man is looking into the distance.", "The man is cooking."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "WzWtxOufgbM_1", "video_path": "WzWtxOufgbM.mp4", "subtitle_path": "WzWtxOufgbM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1105.14, "view_count": 3476509}, {"video_id": "WzWtxOufgbM", "question": "On a desolate mountain, there's a man wearing a black floral coat with a blue shirt underneath. He is situated under a blue canopy. In front of him on the right side, there is a red object on the ground. Next to it is a silver and black triangular object. In the distance, there are overlapping mountain peaks and mist. Before the subtitle 'Kalbajar, Karabakh, Azerbaijan' appears, what does the man do?", "question_wo_referring_query": "What does the man do?", "candidates": ["The man is chopping vegetables", "The man is taking photos", "The man is eating something in his hand", "The man is introducing the scenery to a companion"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "WzWtxOufgbM_2", "video_path": "WzWtxOufgbM.mp4", "subtitle_path": "WzWtxOufgbM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1105.14, "view_count": 3476509}, {"video_id": "Z6d9wMYwbME", "question": "In a room with white cabinets and colorful decorative paintings on the wall, there is a golden-haired woman wearing a duckbill cap holding a banana. After the 'Music' subtitle disappears, what animal appears next to the woman?", "question_wo_referring_query": ", what animal appears next to the woman?", "candidates": ["a bird", "a dog", "a cat", "a panda"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "Z6d9wMYwbME_0", "video_path": "Z6d9wMYwbME.mp4", "subtitle_path": "Z6d9wMYwbME_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1270.07, "view_count": 266421}, {"video_id": "Z6d9wMYwbME", "question": "On the road paved with red bricks, a woman wearing a duck-tongue hat and a red cotton coat appears before the subtitle 'walking because lynn told chris to take,' how many men appear in front of the woman?", "question_wo_referring_query": "How many men appear in front of the woman?", "candidates": ["Two men", "One person", "A couple", "Three people"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "Z6d9wMYwbME_1", "video_path": "Z6d9wMYwbME.mp4", "subtitle_path": "Z6d9wMYwbME_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1270.07, "view_count": 266421}, {"video_id": "Z6d9wMYwbME", "question": "In a room with many decorative paintings on the wall, a strip of lights decorated around the ceiling edge, a pot of snake plant on the table, and warm-toned furniture, a woman wearing a beige hat, a brown coat, and a black inner layer appears. After the subtitle 'one so without further ado let's jump' disappears, what small household appliance appears on the screen?", "question_wo_referring_query": "What small household appliance appears on the screen?", "candidates": ["pressure cooker", "frying pan", "kettle", "rice cooker"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "Z6d9wMYwbME_2", "video_path": "Z6d9wMYwbME.mp4", "subtitle_path": "Z6d9wMYwbME_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1270.07, "view_count": 266421}, {"video_id": "BdTWh6dfye4", "question": "In which of the following sequences are the scenes correct?", "question_wo_referring_query": "In which of the following sequences are the scenes correct?", "candidates": ["First, a man dressed in black is lying on the grass, resting his head on a blue object, with many trees in the foreground, a wall on the right, and a utility pole next to the wall. Then, a person wearing a red shirt, carrying a backpack, and spreading both arms is skateboarding on a road, with mountains stretching out in front. Finally, a building is on fire, and on the right, a fire truck is parked with a group of firefighters extinguishing the fire.", "First, a person wearing a red shirt, carrying a backpack, and spreading both arms is skateboarding on a road, with mountains stretching out in front. Then, a man dressed in black is lying on the grass, resting his head on a blue object, with many trees in the foreground, a wall on the right, and a utility pole next to the wall. Finally, a building is on fire, and on the right, a fire truck is parked with a group of firefighters extinguishing the fire.", "First, a building is on fire, and on the right, a fire truck is parked with a group of firefighters extinguishing the fire. Then, a person wearing a red shirt, carrying a backpack, and spreading both arms is skateboarding on a road, with mountains stretching out in front. Finally, a man dressed in black is lying on the grass, resting his head on a blue object, with many trees in the foreground, a wall on the right, and a utility pole next to the wall.", "First, a building is on fire, and on the right, a fire truck is parked with a group of firefighters extinguishing the fire. Then, a man dressed in black is lying on the grass, resting his head on a blue object, with many trees in the foreground, a wall on the right, and a utility pole next to the wall. Finally, a person wearing a red shirt, carrying a backpack, and spreading both arms is skateboarding on a road, with mountains stretching out in front."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "BdTWh6dfye4_0", "video_path": "BdTWh6dfye4.mp4", "subtitle_path": "BdTWh6dfye4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1109.24, "view_count": 3107132}, {"video_id": "BdTWh6dfye4", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there is a person with blonde hair wearing a black top, olive green pants, and shoes, sitting on the ground against a tree. In front of them is a handcart filled with various items. Then, there is a red and white vehicle with the number 9308 on the roof passing an intersection, with a white house on the right side. Lastly, a hand is rummaging through a trash can with a black trash bag in a stone shell.", "First, there is a red and white vehicle with the number 9308 on the roof passing an intersection, with a white house on the right side. Then, there is a person with blonde hair wearing a black top, olive green pants, and shoes, sitting on the ground against a tree. In front of them is a handcart filled with various items. Lastly, a hand is rummaging through a trash can with a black trash bag in a stone shell.", "First, a hand is rummaging through a trash can with a black trash bag in a stone shell. Then, there is a red and white vehicle with the number 9308 on the roof passing an intersection, with a white house on the right side. Lastly, there is a person with blonde hair wearing a black top, olive green pants, and shoes, sitting on the ground against a tree. In front of them is a handcart filled with various items.", "First, a hand is rummaging through a trash can with a black trash bag in a stone shell. Then, there is a person with blonde hair wearing a black top, olive green pants, and shoes, sitting on the ground against a tree. In front of them is a handcart filled with various items. Lastly, there is a red and white vehicle with the number 9308 on the roof passing an intersection, with a white house on the right side."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "BdTWh6dfye4_1", "video_path": "BdTWh6dfye4.mp4", "subtitle_path": "BdTWh6dfye4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1109.24, "view_count": 3107132}, {"video_id": "BdTWh6dfye4", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a man in a dark coat over a blue shirt is eating a chicken leg, with a bottle of mineral water on his left and a sign on his right that says THURSDAYS 11. Then, a person dressed in all black is walking on the right side of the road. There is a blue bus in front on the left, a cow and a row of green plants on the right, and an overhead sign that reads Marina Del Rey Mindanao Way. Finally, a person wearing a black-gray coat is sitting on the ground, holding a newspaper, with a backpack next to them. Behind the backpack, there is a red sign that says FIRE LANE.", "First, a person wearing a black-gray coat is sitting on the ground, holding a newspaper, with a backpack next to them. Behind the backpack, there is a red sign that says FIRE LANE. Then, a man in a dark coat over a blue shirt is eating a chicken leg, with a bottle of mineral water on his left and a sign on his right that says THURSDAYS 11. Finally, a person dressed in all black is walking on the right side of the road. There is a blue bus in front on the left, a cow and a row of green plants on the right, and an overhead sign that reads Marina Del Rey Mindanao Way.", "First, a man in a dark coat over a blue shirt is eating a chicken leg, with a bottle of mineral water on his left and a sign on his right that says THURSDAYS 11. Then, a person wearing a black-gray coat is sitting on the ground, holding a newspaper, with a backpack next to them. Behind the backpack, there is a red sign that says FIRE LANE. Finally, a person dressed in all black is walking on the right side of the road. There is a blue bus in front on the left, a cow and a row of green plants on the right, and an overhead sign that reads Marina Del Rey Mindanao Way.", "First, a person wearing a black-gray coat is sitting on the ground, holding a newspaper, with a backpack next to them. Behind the backpack, there is a red sign that says FIRE LANE. Then, a person dressed in all black is walking on the right side of the road. There is a blue bus in front on the left, a cow and a row of green plants on the right, and an overhead sign that reads Marina Del Rey Mindanao Way. Finally, a man in a dark coat over a blue shirt is eating a chicken leg, with a bottle of mineral water on his left and a sign on his right that says THURSDAYS 11."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "BdTWh6dfye4_2", "video_path": "BdTWh6dfye4.mp4", "subtitle_path": "BdTWh6dfye4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1109.24, "view_count": 3107132}, {"video_id": "4zHp29IgfuU", "question": "On the white table next to the dark grey sofa, there is a dark tan Dachshund in a white box. In which scene did this dog appear?", "question_wo_referring_query": "In which scene did this dog appear?", "candidates": ["By the sea", "By the pool", "On a green lawn", "Under a pavilion"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "4zHp29IgfuU_0", "video_path": "4zHp29IgfuU.mp4", "subtitle_path": "4zHp29IgfuU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 920.34, "view_count": 199304}, {"video_id": "4zHp29IgfuU", "question": "In the video, in a dimly lit room, a man wearing a bathrobe is holding a camera and taking pictures with a woman wearing a bathrobe and a small flower in her hair. In which scene does the woman with the flower in her hair appear later?", "question_wo_referring_query": "In which scene does the woman with the flower in her hair appear later?", "candidates": ["On the beach at sunset", "On the road at the foot of the mountain", "By the rocky seashore", "Sitting on a chair beside the stone-paved outer wall"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "4zHp29IgfuU_1", "video_path": "4zHp29IgfuU.mp4", "subtitle_path": "4zHp29IgfuU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 920.34, "view_count": 199304}, {"video_id": "4zHp29IgfuU", "question": "In the video, there is a woman standing next to the tour bus, wearing a dark gray uniform, with her hair tied back and smiling. In which later scene does this woman appear?", "question_wo_referring_query": ", in which later scene does this woman appear?", "candidates": ["On the grass field", "By the swimming pool", "Next to the tour bus", "At the staircase inside"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "4zHp29IgfuU_2", "video_path": "4zHp29IgfuU.mp4", "subtitle_path": "4zHp29IgfuU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 920.34, "view_count": 199304}, {"video_id": "C7dbCrsf0l4", "question": "There is a man with curly hair on the screen, and behind him to the right, there is a man wearing an orange outfit with a silver backpack. In which subtitles does this curly-haired man appear?", "question_wo_referring_query": "In which subtitles does this curly-haired man appear?", "candidates": ["well Istanbul here we are", "what you're describing exactly basically", "like we are inside a space shuttle", "wow that's where you control this whole"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "C7dbCrsf0l4_0", "video_path": "C7dbCrsf0l4.mp4", "subtitle_path": "C7dbCrsf0l4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1373.0, "view_count": 6659400}, {"video_id": "C7dbCrsf0l4", "question": "In the video, there is a white and blue cargo ship in the ocean with 'DFDS' written in white letters on its body. Which subtitles does this cargo ship appear with?", "question_wo_referring_query": "Which subtitles does this white and blue cargo ship appear with?", "candidates": ["we're in the middle of the ocean right", "want to discourage you from starting a", "there was a lot for us to do on this", "ship and our day tomorrow was about to"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "C7dbCrsf0l4_1", "video_path": "C7dbCrsf0l4.mp4", "subtitle_path": "C7dbCrsf0l4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1373.0, "view_count": 6659400}, {"video_id": "C7dbCrsf0l4", "question": "In the video, a blond man with a tongue piercing wearing a black top and a backpack, with an orange and black camera behind him. In which subtitles does this man appear?", "question_wo_referring_query": "In which subtitles does this man appear?", "candidates": ["as we stepped into their office our", "though yeah that's what I'm worried", "and then there's a pool oh", "should we like try and talk to those"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "C7dbCrsf0l4_2", "video_path": "C7dbCrsf0l4.mp4", "subtitle_path": "C7dbCrsf0l4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1373.0, "view_count": 6659400}, {"video_id": "qpAhpLPg2r4", "question": "In a scene with a yellow background, a person with black nail polish is holding a phone. On the phone screen, a curly-haired woman wearing a brown coat is shown. What is this curly-haired woman doing?", "question_wo_referring_query": "What is the curly-haired woman doing?", "candidates": ["Playing games", "Holding a sparkler and waving", "Watching a video", "Drinking coffee"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "qpAhpLPg2r4_0", "video_path": "qpAhpLPg2r4.mp4", "subtitle_path": "qpAhpLPg2r4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1588.51, "view_count": 230873}, {"video_id": "qpAhpLPg2r4", "question": "On a small road where a dark grey car is parked, a man with short hair is standing. He is wearing a black shirt with colorful patterns, blue jeans, and a black backpack. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Carrying a luggage case", "Reaching for the car door handle", "Driving the car", "Opening the trunk"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "qpAhpLPg2r4_1", "video_path": "qpAhpLPg2r4.mp4", "subtitle_path": "qpAhpLPg2r4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1588.51, "view_count": 230873}, {"video_id": "qpAhpLPg2r4", "question": "What is the woman wearing a white mask and a navy blue short sleeve shirt doing next to a wooden table with an olive-colored pattern, which has a green glass bottle, a red-green can, and a white porcelain plate with a stack of dishes?", "question_wo_referring_query": "What is the woman wearing a white mask and a navy blue short sleeve shirt doing?", "candidates": ["Cleaning up the table early in the morning", "Mopping the floor", "Wiping the table", "Holding a stack of dishes on the table"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "qpAhpLPg2r4_2", "video_path": "qpAhpLPg2r4.mp4", "subtitle_path": "qpAhpLPg2r4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1588.51, "view_count": 230873}, {"video_id": "Zgk2OaSIYOU", "question": "In a desolate place, there is a man wearing black clothes. To the right rear side of the man, there is a tree with some round wooden logs underneath it. To the left of the logs, there is a wooden table with a large jar on it. In front of the man, there is a wooden table, under which are stacked round wooden logs. On the table, there is a wooden board, several round bowls, and some green onions. To the left of the table, there is a colander. What is the man holding in his hand?", "question_wo_referring_query": "In a desolate place, there is a man wearing black clothes. To the right rear side of the man, there is a tree with some round wooden logs underneath it. To the left of the logs, there is a wooden table with a large jar on it. In front of the man, there is a wooden table, under which are stacked round wooden logs. On the table, there is a wooden board, several round bowls, and some green onions. To the left of the table, there is a colander. What is the man holding in his hand?", "candidates": ["kitchen knife", "round wooden logs", "green onions", "bowls"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "Zgk2OaSIYOU_0", "video_path": "Zgk2OaSIYOU.mp4", "subtitle_path": "Zgk2OaSIYOU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 935.88, "view_count": 847870}, {"video_id": "Zgk2OaSIYOU", "question": "On a large black plate, there are only three cooked birds. In the upper right corner, there is a cracked wooden handle. What object can be seen on the screen?", "question_wo_referring_query": "What object can be seen on the screen?", "candidates": ["scallion", "axe head", "small plate", "spoon"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "Zgk2OaSIYOU_1", "video_path": "Zgk2OaSIYOU.mp4", "subtitle_path": "Zgk2OaSIYOU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 935.88, "view_count": 847870}, {"video_id": "Zgk2OaSIYOU", "question": "In a desolate place, a man dressed in a black coat with a blue shirt underneath sits next to a wooden table. Behind him, there is a tree with a circular piece of wood placed underneath it. In front of the circular wood, there is a can, and on the table, there are two plates. One plate contains rice and the other contains white vegetables. What is the man holding in his hand?", "question_wo_referring_query": "What is the man holding in his hand?", "candidates": ["White vegetables", "A water bottle", "Chopsticks", "A circular bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "Zgk2OaSIYOU_2", "video_path": "Zgk2OaSIYOU.mp4", "subtitle_path": "Zgk2OaSIYOU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 935.88, "view_count": 847870}, {"video_id": "N4aXsz8JqkY", "question": "Under a canopy, a man wearing a dark blue shirt and blue pants is adding salt to a small wooden barrel. Behind the man is a wooden shelf with various items on it, and some things hanging below it, including two cans. In front of the man, there is a wooden table with a few round bowls on the left side and a large piece of meat on the right side. To the left of the meat, there is a wooden board with a knife on it. When the subtitle mentions 'Salt', what long, thin red item is near the man on the wooden board?", "question_wo_referring_query": "What long, thin red item is near the man on the wooden board?", "candidates": ["Chili pepper", "Hammer", "Radish", "Vegetable knife"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "N4aXsz8JqkY_0", "video_path": "N4aXsz8JqkY.mp4", "subtitle_path": "N4aXsz8JqkY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1520.33, "view_count": 3728435}, {"video_id": "N4aXsz8JqkY", "question": "In the video, there is a person wearing a dark blue top who is using a tool to shred cheese. To the right, there is some green foliage, in the background, there is a can, and in the foreground, there is a wooden table with a round wooden platter and a round bowl on it. When the subtitle mentions 'Mozzarella cheese,' what red elongated object is on the right side of the table?", "question_wo_referring_query": "What red elongated object is on the right side of the table?", "candidates": ["chili pepper", "meat", "leek", "carrot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "N4aXsz8JqkY_1", "video_path": "N4aXsz8JqkY.mp4", "subtitle_path": "N4aXsz8JqkY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1520.33, "view_count": 3728435}, {"video_id": "N4aXsz8JqkY", "question": "Under a shelf, there is a man wearing a green coat with a dark blue shirt inside, raising his right hand. Behind him, there's a platform built with wooden planks on the wall, holding various items. Below the platform, there are tools, cans, a tree on the right side, and a white furnace on the left. In front of the man, there's a wooden table with a plank on top. When the subtitle says 'Thanks everyone!', what thin, long object with blue color can be found on the wooden plank on the table?", "question_wo_referring_query": "What thin, long object with blue color can be found on the wooden plank on the table?", "candidates": ["plate", "can", "cake", "knife"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "N4aXsz8JqkY_2", "video_path": "N4aXsz8JqkY.mp4", "subtitle_path": "N4aXsz8JqkY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1520.33, "view_count": 3728435}, {"video_id": "a1WM5fb_-iE", "question": "In a purely black background, there is a man wearing black-framed glasses and a gray short-sleeved shirt holding a white mug with blue and green patterns. In the top left corner of the man, there is a small icon of a white handbag. What is the shape of the handbag in the small icon?", "question_wo_referring_query": "What is the shape of the handbag in the small icon?", "candidates": ["Rectangular", "Bucket-shaped", "Square", "Cylindrical"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "a1WM5fb_-iE_0", "video_path": "a1WM5fb_-iE.mp4", "subtitle_path": "a1WM5fb_-iE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 903.86, "view_count": 342423}, {"video_id": "a1WM5fb_-iE", "question": "In the scene with a man holding a white mug against a pure black background, there appears a flag with a moon and a star in the top right corner. What is the background color of the flag?", "question_wo_referring_query": "In the scene with a man holding a white mug against a pure black background, there appears a flag with a moon and a star in the top right corner. What is the background color of the flag?", "candidates": ["red", "black", "yellow", "white"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "a1WM5fb_-iE_1", "video_path": "a1WM5fb_-iE.mp4", "subtitle_path": "a1WM5fb_-iE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 903.86, "view_count": 342423}, {"video_id": "a1WM5fb_-iE", "question": "In a screen with a map as the background, next to the text \u201cChuvash Republic/Chuvashia (Russia)\u201d at the center of the screen, there is a white arrow pointing to a certain place. What shape frame is around the place the arrow is pointing to?", "question_wo_referring_query": "What shape frame is around the place the arrow is pointing to?", "candidates": ["white circle", "red rectangle", "red square", "red circle"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "a1WM5fb_-iE_2", "video_path": "a1WM5fb_-iE.mp4", "subtitle_path": "a1WM5fb_-iE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 903.86, "view_count": 342423}, {"video_id": "YrdLMFiygUk", "question": "There is a platform on the left side of the room on the screen with a potted plant on it. On the right side, various pictures are pasted on the wall. In the middle, there's a girl wearing striped clothes and dyeing her hair. What item is she taking off and putting on?", "question_wo_referring_query": "What item is she taking off and putting on?", "candidates": ["Earrings", "Glasses", "Ring", "Bracelet"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "YrdLMFiygUk_0", "video_path": "YrdLMFiygUk.mp4", "subtitle_path": "YrdLMFiygUk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 982.28, "view_count": 417033}, {"video_id": "YrdLMFiygUk", "question": "There is a table in the screen with two computers, an iPad, a mobile phone, and a person wearing white clothes holding a pen in their right hand. What is the object being held in the person's hand in the video?", "question_wo_referring_query": "What is the object being held in the person's hand in the video?", "candidates": ["Orange", "White pen", "Ruler", "Pear"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "YrdLMFiygUk_1", "video_path": "YrdLMFiygUk.mp4", "subtitle_path": "YrdLMFiygUk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 982.28, "view_count": 417033}, {"video_id": "YrdLMFiygUk", "question": "There is a wooden table in the scene. A hand is holding a white pen and writing. In the video, on what object is the writing being done?", "question_wo_referring_query": "In the video, on what object is the writing being done?", "candidates": ["Draft Paper", "Book", "Notebook", "iPad"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "YrdLMFiygUk_2", "video_path": "YrdLMFiygUk.mp4", "subtitle_path": "YrdLMFiygUk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 982.28, "view_count": 417033}, {"video_id": "4zdbjIjKchI", "question": "In the video, what gesture is the man in gray upper clothing and dark blue jeans, standing in front of a big cannon, making during his initial appearance?", "question_wo_referring_query": "What gesture is he making during his initial appearance?", "candidates": ["Consulting information", "Hands spread open", "Resting on the side", "Reading a book"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "4zdbjIjKchI_0", "video_path": "4zdbjIjKchI.mp4", "subtitle_path": "4zdbjIjKchI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1033.56, "view_count": 244824}, {"video_id": "4zdbjIjKchI", "question": "On a road lined with lush green trees, when a dark blue convertible passes by, what gesture does the man sitting in the back seat, wearing a green jacket and a duckbill hat, make?", "question_wo_referring_query": "What gesture does this man make?", "candidates": ["Both hands raised in a '2' gesture", "No movement", "Holding both hands tight", "One hand raised"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "4zdbjIjKchI_1", "video_path": "4zdbjIjKchI.mp4", "subtitle_path": "4zdbjIjKchI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1033.56, "view_count": 244824}, {"video_id": "4zdbjIjKchI", "question": "On the left side of the screen, there is a path lined with various trees, and people of all kinds are walking on the path. On the right side of the screen, there is a yellow tank, and in front of the barrel of the tank stands a short-haired man wearing an army green jacket and black pants. What was this man doing when he first appeared?", "question_wo_referring_query": "What was this man doing when he first appeared?", "candidates": ["Taking pictures with a mobile phone", "Observing the cannon", "Walking away", "Chatting with someone"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "4zdbjIjKchI_2", "video_path": "4zdbjIjKchI.mp4", "subtitle_path": "4zdbjIjKchI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1033.56, "view_count": 244824}, {"video_id": "429Pu5342I4", "question": "On a map with yellow and blue regions, there are several white texts. Among them, the largest letters are 'First Island Chain' in white. When the subtitles say \u201cwhereas the Second Island Chain runs from the middle of Japan,\u201d what event occurs on the screen?", "question_wo_referring_query": "What event occurs on the screen?", "candidates": ["A second red dotted line appears in the blue region.", "A red dotted line appears in the yellow region.", "A black dotted line appears in the blue region.", "A solid red line appears in the yellow region."], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "429Pu5342I4_0", "video_path": "429Pu5342I4.mp4", "subtitle_path": "429Pu5342I4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 962.17, "view_count": 588330}, {"video_id": "429Pu5342I4", "question": "The screen shows a blue background with 5 white circle patterns, 3 of the circle patterns are arranged in a row. When the subtitle 'replacement of Los Angeles-class attack submarines with Virginia-class' appears, what happens on the screen?", "question_wo_referring_query": "What event occurs on the screen?", "candidates": ["A fourth pattern appears to the right of the row of 3 circles", "A fourth pattern appears above the row of 3 circles", "A fourth pattern appears to the left of the row of 3 circles", "A fourth pattern appears below the row of 3 circles"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "429Pu5342I4_1", "video_path": "429Pu5342I4.mp4", "subtitle_path": "429Pu5342I4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 962.17, "view_count": 588330}, {"video_id": "429Pu5342I4", "question": "The screen shows a blue background with three white circular designs at the top. Below the circles, there are two lines of white text, and a black box. When the subtitle says, 'It is better suited to fleet air defense missions,' what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The amount of white text inside the black box decreased.", "The amount of green text inside the black box increased.", "The amount of white text inside the black box increased.", "The amount of red text inside the black box increased."], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "429Pu5342I4_2", "video_path": "429Pu5342I4.mp4", "subtitle_path": "429Pu5342I4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 962.17, "view_count": 588330}, {"video_id": "q-TJvNBO1fw", "question": "In a black and white screen, an elderly man wearing black-rimmed glasses is looking at another elderly person with glasses and curly hair. Based on the content provided in the video clip, which character appears first?", "question_wo_referring_query": "Which character appears first?", "candidates": ["An elderly man with silver hair wearing glasses.", "An elderly man with silver hair wearing glasses, a black hat, and a black coat.", "A young man wearing glasses.", "A woman with black hair wearing sunglasses."], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "q-TJvNBO1fw_0", "video_path": "q-TJvNBO1fw.mp4", "subtitle_path": "q-TJvNBO1fw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1014.42, "view_count": 65402}, {"video_id": "q-TJvNBO1fw", "question": "In the video, a blue frog stands behind a purple shrimp and tries to steal the ball that the shrimp is holding with its pincers. Which animal appears first?", "question_wo_referring_query": "Which animal appears first?", "candidates": ["small fish", "blue frog", "octopus", "small shrimp"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "q-TJvNBO1fw_1", "video_path": "q-TJvNBO1fw.mp4", "subtitle_path": "q-TJvNBO1fw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1014.42, "view_count": 65402}, {"video_id": "q-TJvNBO1fw", "question": "In the black-and-white animated scene, a man carrying a birdcage and holding a net is followed by a small animal. Which character appears first?", "question_wo_referring_query": "Which character appears first?", "candidates": ["The bird perched on the tree", "The man with the birdcage", "A crow", "The fox by the bush"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "q-TJvNBO1fw_2", "video_path": "q-TJvNBO1fw.mp4", "subtitle_path": "q-TJvNBO1fw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1014.42, "view_count": 65402}, {"video_id": "OOmuF8gxvyo", "question": "In the video, there's a woman wearing black clothes in a room with green plants in the background and books and vegetables placed on the right. After she says 'feel free to be playful in the selection,' what does this woman do?", "question_wo_referring_query": "what happens to the woman after she says 'feel free to be playful in the selection'?", "candidates": ["picks up a book", "picks up a light bulb", "lies down", "stands up"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "OOmuF8gxvyo_0", "video_path": "OOmuF8gxvyo.mp4", "subtitle_path": "OOmuF8gxvyo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1424.36, "view_count": 6200}, {"video_id": "OOmuF8gxvyo", "question": "There is a piece of paper placed on a wooden table in the video, with a pencil next to it. A hand in the video is holding a green pencil. After the subtitle mentions 'that the graphite in the pencil is', what happens next?", "question_wo_referring_query": "What happens next?", "candidates": ["The pencil is thrown away", "The green pencil draws a line on the paper", "The paper is thrown away", "Nothing happens"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "OOmuF8gxvyo_1", "video_path": "OOmuF8gxvyo.mp4", "subtitle_path": "OOmuF8gxvyo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1424.36, "view_count": 6200}, {"video_id": "OOmuF8gxvyo", "question": "In the top left corner of the screen, there is a picture and a piece of paper lying on a desk. The paper has drawings made with blue and maroon colors as well as pencil. After the subtitle mentions 'slowly put in a little pop of color here,' what happens?", "question_wo_referring_query": "What happens after?", "candidates": ["The pencil part is drawn with a purple marker", "The pencil part is drawn with a black marker", "The pencil part is drawn with a white marker", "The pencil part is drawn with a green marker"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "OOmuF8gxvyo_2", "video_path": "OOmuF8gxvyo.mp4", "subtitle_path": "OOmuF8gxvyo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1424.36, "view_count": 6200}, {"video_id": "jmPTbUcNMnY", "question": "Which scene appears first?", "question_wo_referring_query": "Which scene appears first?", "candidates": ["The first scene shows a room with several paintings hanging on white walls, a bookshelf on the side, and a woman in a black dress with her hands crossed. The next scene shows a green field with a section of wall made of stones, a small sign hanging on the side. Finally, the last scene shows a road behind a light palm-colored house, with a flowerbed behind the house.", "The first scene shows a room with several paintings hanging on white walls, a bookshelf on the side, and a woman in a black dress with her hands crossed. The next scene shows a road behind a light palm-colored house, with a flowerbed behind the house. Finally, the last scene shows a green field with a section of wall made of stones, a small sign hanging on the side.", "The first scene shows a green field with a section of stone-wall, a small sign hanging on the side. The next scene shows a room with several paintings hanging on white walls, a bookshelf on the side, and a woman in a black dress with her hands crossed. Finally, the last scene shows a road behind a light palm-colored house, with a flowerbed behind the house.", "The first scene shows a green field with a section of stone-wall, a small sign hanging on the side. The next scene shows a road behind a light palm-colored house, with a flowerbed behind the house. Finally, the last scene shows a room with several paintings hanging on white walls, a bookshelf on the side, and a woman in a black dress with her hands crossed."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "jmPTbUcNMnY_0", "video_path": "jmPTbUcNMnY.mp4", "subtitle_path": "jmPTbUcNMnY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1024.42, "view_count": 37575}, {"video_id": "jmPTbUcNMnY", "question": "Which scene appears first?", "question_wo_referring_query": "Which scene appears first?", "candidates": ["The first scene shows a beautiful bouquet of flowers blooming in front of a yellow building, with a bench in front of the building. Next is a scene with a woman wearing a black short-sleeved shirt and brown pants walking on a path between a light yellow building and green plants. The last scene is a person holding a white flower by a light yellow wall.", "The first scene shows a person holding a white flower by a light yellow wall. Next is a scene with a woman wearing a black short-sleeved shirt and brown pants walking on a path between a light yellow building and green plants. The last scene is a beautiful bouquet of flowers blooming in front of a yellow building with a bench in front of it.", "The first scene shows a beautiful bouquet of flowers blooming in front of a yellow building, with a bench in front of the building. Next, there is a person holding a white flower by a light yellow wall. The last scene is a woman wearing a black short-sleeved shirt and brown pants walking on a path between a light yellow building and green plants.", "The first scene shows a person holding a white flower by a light yellow wall. Next, there is a beautiful bouquet of flowers blooming in front of a yellow building with a bench in front of it. The last scene is a woman wearing a black short-sleeved shirt and brown pants walking on a path between a light yellow building and green plants."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "jmPTbUcNMnY_1", "video_path": "jmPTbUcNMnY.mp4", "subtitle_path": "jmPTbUcNMnY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1024.42, "view_count": 37575}, {"video_id": "jmPTbUcNMnY", "question": "Which scene appears first in chronological order?", "question_wo_referring_query": "Which scene appears first?", "candidates": ["The first scene is set in a room with a few paintings hanging on the white wall, a bookshelf on the side, and a woman wearing a black suspender dress with her right hand splayed open in front of her chest. The next scene takes place in a noodle shop, where a person in a black suspender dress and brown pants is holding a clipboard in their right hand and a box containing noodles in their left. The final scene features a small black triangular house surrounded by green vegetation.", "The first scene takes place in a noodle shop, where a person in a black suspender dress and brown pants is holding a clipboard in their right hand and a box containing noodles in their left. The next scene is set in a room with a few paintings hanging on the white wall, a bookshelf on the side, and a woman wearing a black suspender dress with her right hand splayed open in front of her chest. The final scene also takes place in a room with a few paintings hanging on the white wall, a bookshelf on the side, and a woman wearing a black suspender dress with her right hand splayed open in front of her chest.", "The first scene takes place in a noodle shop, where a person in a black suspender dress and brown pants is holding a clipboard in their right hand and a box containing noodles in their left. The next scene features a small black triangular house surrounded by green vegetation. The last scene is set in a room with a few paintings hanging on the white wall, a bookshelf on the side, and a woman wearing a black suspender dress with her right hand splayed open in front of her chest.", "The first scene takes place in a room with a few paintings hanging on the white wall, a bookshelf on the side, and a woman wearing a black suspender dress with her right hand splayed open in front of her chest. The next scene features a small black triangular house surrounded by green vegetation. The final scene takes place in a noodle shop, where a person in a black suspender dress and brown pants is holding a clipboard in their right hand and a box containing noodles in their left."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "jmPTbUcNMnY_2", "video_path": "jmPTbUcNMnY.mp4", "subtitle_path": "jmPTbUcNMnY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1024.42, "view_count": 37575}, {"video_id": "-uvMrMcN0eA", "question": "In the video, this woman with black hair, who is wearing white clothes and earrings, is smiling. In which other scene does this woman appear?", "question_wo_referring_query": "In which other scene does this woman appear?", "candidates": ["She is in the gym.", "She is in a karaoke room.", "In a shop, she is drinking coffee.", "She is in the dance studio."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "-uvMrMcN0eA_0", "video_path": "-uvMrMcN0eA.mp4", "subtitle_path": "-uvMrMcN0eA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1241.71, "view_count": 276754}, {"video_id": "-uvMrMcN0eA", "question": "A woman wearing a white top, black jacket, and black earphones is standing in front of a graffiti-covered wall. Where else has this woman appeared?", "question_wo_referring_query": "Where else has this woman appeared?", "candidates": ["In a karaoke room", "In a gym", "In a dance studio", "In a flower shop"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "-uvMrMcN0eA_1", "video_path": "-uvMrMcN0eA.mp4", "subtitle_path": "-uvMrMcN0eA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1241.71, "view_count": 276754}, {"video_id": "-uvMrMcN0eA", "question": "A woman wearing a white top with black details and a black jacket, holding a bouquet of flowers, is standing in a library. Where else does this woman appear?", "question_wo_referring_query": "Where else does this woman appear?", "candidates": ["She is in the dance studio", "She is in the gym", "In the fitting room of a clothing store", "She is in the karaoke room"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "-uvMrMcN0eA_2", "video_path": "-uvMrMcN0eA.mp4", "subtitle_path": "-uvMrMcN0eA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1241.71, "view_count": 276754}, {"video_id": "l_3zj6HeWUE", "question": "On the PPT, there is an English passage on the left and a folding diagram consisting of two fold lines on the right. What changes have occurred in the area surrounding this folding diagram?", "question_wo_referring_query": "What changes have occurred in the area surrounding this folding diagram?", "candidates": ["There are additional blue lines and red circles on the folding diagram", "There is an additional segment of English text with a red background", "No changes have occurred", "There is an additional segment of English text with a yellow background"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "l_3zj6HeWUE_0", "video_path": "l_3zj6HeWUE.mp4", "subtitle_path": "l_3zj6HeWUE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1745.97, "view_count": 29024}, {"video_id": "l_3zj6HeWUE", "question": "In the PPT, there are four different blocks, and below them, there are many English paragraphs. What event took place on the screen with multiple drafts?", "question_wo_referring_query": "What event took place on the screen with multiple drafts?", "candidates": ["Yellow patterned English texts appeared on the four blocks, with red and green drafts above.", "There were many additional purple drafts.", "There was an additional paragraph of red patterned English text.", "There were many additional blue drafts."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "l_3zj6HeWUE_1", "video_path": "l_3zj6HeWUE.mp4", "subtitle_path": "l_3zj6HeWUE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1745.97, "view_count": 29024}, {"video_id": "l_3zj6HeWUE", "question": "In the PPT, there are many English paragraphs on the top, as well as some mathematical formulas. There are also some green markings drawn as notes, and below the green markings is the black text '4. Experiments'. What happened on the slide with colored backgrounds where there are Small batch sizes?", "question_wo_referring_query": "What happened on the slide with colored backgrounds where there are Small batch sizes?", "candidates": ["There is an additional English paragraph with a red background.", "Nothing changed.", "There are many additional purple drafts.", "There are 2 charts and 2 data tables."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "l_3zj6HeWUE_2", "video_path": "l_3zj6HeWUE.mp4", "subtitle_path": "l_3zj6HeWUE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1745.97, "view_count": 29024}, {"video_id": "wS_g0meQgyc", "question": "In the bottom right corner of the screen, there is a man wearing glasses. At the top, 'Learning by Rotating' is written in bold letters. Below, there are 5 colorful images. In the middle image, there are two men. What is the bird doing in the leftmost image?", "question_wo_referring_query": "What is the bird doing in the leftmost image?", "candidates": ["Standing on a tree branch", "Running", "Flying in the sky", "Eating something"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "wS_g0meQgyc_1", "video_path": "wS_g0meQgyc.mp4", "subtitle_path": "wS_g0meQgyc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2725.37, "view_count": 29}, {"video_id": "wS_g0meQgyc", "question": "In a room, a black-haired woman wearing red and white clothes was seated in front of a device next to a glass wall. What action did this woman perform?", "question_wo_referring_query": "What action did this woman perform?", "candidates": ["Her left hand was placed on the keyboard", "She was standing while operating the device", "Her right hand was placed behind her", "Her right hand was placed on the keyboard"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "wS_g0meQgyc_2", "video_path": "wS_g0meQgyc.mp4", "subtitle_path": "wS_g0meQgyc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2725.37, "view_count": 29}, {"video_id": "73Qivs9TyPc", "question": "A person dressed in a red outfit with white floral patterns is shown in the video. This person is wearing a ring on the right hand, holding a vegetable knife, and pressing down on a vegetable with the left hand while cutting it. What is the vegetable being cut in the video?", "question_wo_referring_query": ", what is the vegetable being cut in the video?", "candidates": ["Spinach", "Bok Choy", "Carrot", "Cucumber"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "73Qivs9TyPc_0", "video_path": "73Qivs9TyPc.mp4", "subtitle_path": "73Qivs9TyPc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1639.96, "view_count": 28709}, {"video_id": "73Qivs9TyPc", "question": "In the frame, there's a person wearing a white shirt placing their hand on some chopped vegetables. There's a white cabinet in the background. What vegetable is being touched in the video?", "question_wo_referring_query": "What vegetable is being touched in the video?", "candidates": ["Cabbage", "Carrot", "Lettuce", "Choy sum"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "73Qivs9TyPc_1", "video_path": "73Qivs9TyPc.mp4", "subtitle_path": "73Qivs9TyPc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1639.96, "view_count": 28709}, {"video_id": "73Qivs9TyPc", "question": "There is a person in white clothing who is pouring vegetables from a white bowl into a transparent bowl on the table. What are the vegetables in the white bowl in the video?", "question_wo_referring_query": "What are the vegetables in the white bowl in the video?", "candidates": ["Spinach", "Carrot", "Lettuce", "Broccoli"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "73Qivs9TyPc_2", "video_path": "73Qivs9TyPc.mp4", "subtitle_path": "73Qivs9TyPc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1639.96, "view_count": 28709}, {"video_id": "Y3OM1PfDgMY", "question": "In front of a wooden-paneled wall with a white door, there is a light olive green sofa. To the left of the screen, there is a woman wearing a black and white checkered coat, and next to her is a man wearing a floral shirt and red pants. What is this man doing?", "question_wo_referring_query": "What is this man doing at this moment?", "candidates": ["He is playing the guitar", "He is dancing with the woman", "He is holding a microphone singing", "He is playing the accordion"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "Y3OM1PfDgMY_0", "video_path": "Y3OM1PfDgMY.mp4", "subtitle_path": "Y3OM1PfDgMY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1265.43, "view_count": 1882716}, {"video_id": "Y3OM1PfDgMY", "question": "In a room with a television, there is a man wearing a floral shirt and red pants. To his right is a woman wearing a green top and jeans. What did the man do at this moment?", "question_wo_referring_query": "What did the man do at this moment?", "candidates": ["He scratched his head", "He hugged the woman next to him", "He shook hands with the woman next to him", "He inserted his right hand into his pocket"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "Y3OM1PfDgMY_1", "video_path": "Y3OM1PfDgMY.mp4", "subtitle_path": "Y3OM1PfDgMY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1265.43, "view_count": 1882716}, {"video_id": "Y3OM1PfDgMY", "question": "On a brownish-green ground, there is a purple sofa with a man in black clothing sitting on the ground. In front of him, there are three puppies. What action did the man take at this moment?", "question_wo_referring_query": "What action did the man take at this moment?", "candidates": ["Trimming the puppies' nails", "Lifting the puppies", "Putting the puppies in a cage", "Petting the puppies"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "Y3OM1PfDgMY_2", "video_path": "Y3OM1PfDgMY.mp4", "subtitle_path": "Y3OM1PfDgMY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1265.43, "view_count": 1882716}, {"video_id": "mNEU6jhk0zU", "question": "In the exhibition hall, there are several paintings and explanations hung on the gray walls. When a woman in a black dress mentions 'heart and we stand with the protagonists,' what is she doing?", "question_wo_referring_query": "What is this woman doing?", "candidates": ["The woman spreads her hands open with her palms facing up", "The woman makes a V sign with both hands", "The woman places her left hand on her forehead", "The woman clenches her fists in front of her chest"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "mNEU6jhk0zU_0", "video_path": "mNEU6jhk0zU.mp4", "subtitle_path": "mNEU6jhk0zU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2159.95, "view_count": 2898}, {"video_id": "mNEU6jhk0zU", "question": "In the video, a woman dressed in a black hoodie and black pants is lying on a white cloth. When the phrase 'verticality and verticality to me' is mentioned, what action is this person doing?", "question_wo_referring_query": "What action is this person doing?", "candidates": ["The woman has her left hand on her forehead.", "The woman is clenching her fists in front of her chest.", "This person is pulling the hoodie outward with both hands.", "The woman is making a 'thumbs up' gesture."], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "mNEU6jhk0zU_1", "video_path": "mNEU6jhk0zU.mp4", "subtitle_path": "mNEU6jhk0zU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2159.95, "view_count": 2898}, {"video_id": "mNEU6jhk0zU", "question": "In front of a white background, there is a person with black hair wearing a black top. With their eyes closed, what action does this person perform when the word 'abandon' is mentioned?", "question_wo_referring_query": "What action does this person perform?", "candidates": ["This person places both hands beside their nose.", "The woman clenches her fists and holds them in front of her chest.", "The woman places her left hand on her forehead.", "The woman makes a 'Y' shape with both hands."], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "mNEU6jhk0zU_2", "video_path": "mNEU6jhk0zU.mp4", "subtitle_path": "mNEU6jhk0zU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2159.95, "view_count": 2898}, {"video_id": "P23BI71mHeI", "question": "In the bottom right corner of the screen, there is a man holding a pen in his left hand. Behind the man, there are two colorful images. Below the images, there are four lines of black text. What changes can be seen below the colorful images in the background?", "question_wo_referring_query": "What changes can be seen below the colorful images in the background?", "candidates": ["The text has decreased and shapes have been added", "The text has increased and nothing else has changed", "The text has increased and shapes have also been added", "No changes"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "P23BI71mHeI_0", "video_path": "P23BI71mHeI.mp4", "subtitle_path": "P23BI71mHeI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1532.44, "view_count": 527}, {"video_id": "P23BI71mHeI", "question": "There is a man in a suit on the screen, and the largest text on the screen is 'Alternative kernel.' Behind the man, there is a curve graph with red, green, and blue colors. There is a red dot on this graph. How does the red dot move on the curve graph?", "question_wo_referring_query": "How does the red dot move on the curve graph?", "candidates": ["The position of the red dot does not change.", "The red dot moves around on the curve graph.", "The red dot moves around outside the curve graph.", "The red dot moves around on the text 'Alternative kernel.'"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "P23BI71mHeI_1", "video_path": "P23BI71mHeI.mp4", "subtitle_path": "P23BI71mHeI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1532.44, "view_count": 527}, {"video_id": "P23BI71mHeI", "question": "There is a man in a suit in the scene, with a colorful picture of a man in a white shirt behind him. On the right side of the picture are black letters, and above the picture is a line of large text 'Scale Invariant Feature Transform (SIFT)'. What change occurs on the screen below 'Scale Invariant Feature Transform (SIFT)'?", "question_wo_referring_query": "What change occurs on the screen below 'Scale Invariant Feature Transform (SIFT)'?", "candidates": ["1 person picture changes to 2 car pictures, text increases", "1 person picture changes to 2 car pictures, text decreases", "1 person picture changes to 2 person pictures, text decreases", "No change"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "P23BI71mHeI_2", "video_path": "P23BI71mHeI.mp4", "subtitle_path": "P23BI71mHeI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1532.44, "view_count": 527}, {"video_id": "anWN4rVwov4", "question": "There are four men on the screen. One is wearing a yellow suit, holding a gun with his right hand and supporting it with his left hand. Another man is wearing a dark green suit, holding a gun naturally at his side with his right hand. There's also a man in an orange suit, and another in a green suit. At the bottom of the screen, there is a flag with red and white stripes and a five-pointed star in dark green and white at the top left corner. Which person or object is introduced first on the screen?", "question_wo_referring_query": "Which person or object is introduced first on the screen?", "candidates": ["The man in a dark green suit, holding a gun naturally at his side with his right hand", "The man in a yellow suit, holding a gun with his right hand and supporting it with his left hand", "The flag with red and white stripes and a dark green and white five-pointed star at the top left corner", "The man in a green suit"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "anWN4rVwov4_0", "video_path": "anWN4rVwov4.mp4", "subtitle_path": "anWN4rVwov4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 901.04, "view_count": 11482}, {"video_id": "anWN4rVwov4", "question": "In one scene, there are many coins laid flat, with a man with curly hair in front of the coins. In another scene, there are stacked coins with a $DOLLAR sign in front, and another scene shows five stacks of coins with 3 coins on the far left side. Which of these coins appears first?", "question_wo_referring_query": "Which of these coins appears first?", "candidates": ["Stacked coins with a $DOLLAR sign in front", "Stacked coins with a $DOLLAR sign in front and five stacks of coins together", "Many coins laid flat, with a man with curly hair in front of the coins", "Five stacks of coins, with 3 coins on the far left side"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "anWN4rVwov4_1", "video_path": "anWN4rVwov4.mp4", "subtitle_path": "anWN4rVwov4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 901.04, "view_count": 11482}, {"video_id": "anWN4rVwov4", "question": "One screen shows a bust of a man with curly hair and his right hand placed on his chest. In the bottom left corner, there are subtitles reading 'JULY 6, 1774'. Another screen shows a statue of the Goddess of Liberty, with a flag in the background. Another screen shows a full-body statue of a man in front of a building, with the subtitle 'COLLECTING TAXES' in front of the statue. Which of these statues appears first?", "question_wo_referring_query": "Which of these statues appears first?", "candidates": ["A bust of a man", "A full-body statue of a man", "A statue of the Goddess of Liberty", "A bust of a man and a statue of the Goddess of Liberty appearing together"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "anWN4rVwov4_2", "video_path": "anWN4rVwov4.mp4", "subtitle_path": "anWN4rVwov4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 901.04, "view_count": 11482}, {"video_id": "cygqKMAzNig", "question": "In the scene, there is a long-haired man wearing a hat and dressed in black, with a beard. In the top left corner, there is an image with green, black, yellow, and blue hues. What action does the man take after the subtitle mentions '960 species of birds found here very few'?", "question_wo_referring_query": "What action does the man take?", "candidates": ["Both hands join together.", "Left hand holds the clothing while the right hand lifts the arm, and meanwhile, there's black smoke in the top left corner image.", "Right hand holds the clothing while the left hand lifts the arm.", "Clenched fist."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "cygqKMAzNig_0", "video_path": "cygqKMAzNig.mp4", "subtitle_path": "cygqKMAzNig_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1934.5, "view_count": 461013}, {"video_id": "cygqKMAzNig", "question": "There are two men wearing glasses in the video. The man on the left is wearing a black jacket with floral patterns on a short-sleeve shirt, while the man on the right is a black man wearing a plain black short-sleeve shirt. In the top left corner of the video, there is a picture with green, black, yellow, and blue inserts. What did the man do after the subtitle mentions 'sour yeah very good well just like the'?", "question_wo_referring_query": "What did the man in the video do?", "candidates": ["The man on the left held three paper bills, the man on the right pointed to the paper bills, and simultaneously, the picture in the top left corner changed.", "The two men hugged.", "No action.", "The man on the right held three paper bills, the man on the left pointed to the paper bills, and simultaneously, the picture in the top left corner changed."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "cygqKMAzNig_1", "video_path": "cygqKMAzNig.mp4", "subtitle_path": "cygqKMAzNig_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1934.5, "view_count": 461013}, {"video_id": "cygqKMAzNig", "question": "On the right side of the screen, there is a long-haired woman wearing a light green short-sleeved shirt. In the upper left corner of the video, there is a picture, and below it, there are three lines of white text. Before the subtitle mentions '1978.tribal or marrying,' what action does the woman take?", "question_wo_referring_query": "What action does this woman take?", "candidates": ["Brings hands together", "Right arm extends, a flame appears on the screen", "Clenches fist", "Raises hand"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "cygqKMAzNig_2", "video_path": "cygqKMAzNig.mp4", "subtitle_path": "cygqKMAzNig_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1934.5, "view_count": 461013}, {"video_id": "Jv5LDIEJsCQ", "question": "In the video, there is a person wearing a black suit in the center giving a speech. Before the subtitles mention 'and gas and finally he signal he would,' who is the person appearing?", "question_wo_referring_query": "Who is the person appearing?", "candidates": ["A black person wearing a white shirt", "A host wearing a purple coat", "A black person under a tree wearing a skirt", "A black person wearing a white uniform and glasses"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "Jv5LDIEJsCQ_0", "video_path": "Jv5LDIEJsCQ.mp4", "subtitle_path": "Jv5LDIEJsCQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1337.61, "view_count": 947}, {"video_id": "Jv5LDIEJsCQ", "question": "In the video, there is a long-haired woman wearing a brown coat speaking in a broadcast room. After the subtitle mentions 'error defaulter in 2020 so let's bring,' the person who appears is?", "question_wo_referring_query": "the person who appears is?", "candidates": ["A black person wearing a black suit", "A black person wearing a black military uniform", "The host wearing a purple coat with a pink inner layer", "A black person wearing a white military uniform and glasses"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "Jv5LDIEJsCQ_1", "video_path": "Jv5LDIEJsCQ.mp4", "subtitle_path": "Jv5LDIEJsCQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1337.61, "view_count": 947}, {"video_id": "Jv5LDIEJsCQ", "question": "There is a long-haired woman wearing a brown coat speaking in a studio. After the subtitle mentions \u201cfour Decades of work I ask him where,\u201d who appears?", "question_wo_referring_query": "Who appears?", "candidates": ["A black man wearing a black suit", "An elderly man with white hair wearing a red long sleeve and earphones", "A black man wearing a white uniform and glasses", "A host wearing a purple coat over a pink outfit"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "Jv5LDIEJsCQ_2", "video_path": "Jv5LDIEJsCQ.mp4", "subtitle_path": "Jv5LDIEJsCQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1337.61, "view_count": 947}, {"video_id": "MqDehUoMk-E", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, an image annotated with blue, red, and yellow markers is shown, then the screen changes to a man in the bottom right corner with lots of code on the screen, and finally, an image with colored blocks connected by arrows is shown.", "First, an image with colored blocks connected by arrows is shown, then the screen changes to an image annotated with blue, red, and yellow markers, and finally, a man appears in the bottom right corner with lots of code on the screen.", "First, a man appears in the bottom right corner with lots of code on the screen, then the screen changes to an image with colored blocks connected by arrows, and finally, an image annotated with blue, red, and yellow markers is shown.", "First, a man appears in the bottom right corner with lots of code on the screen, then the screen changes to an image annotated with blue, red, and yellow markers, and finally, an image with colored blocks connected by arrows is shown."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "MqDehUoMk-E_0", "video_path": "MqDehUoMk-E.mp4", "subtitle_path": "MqDehUoMk-E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2393.37, "view_count": 8680}, {"video_id": "MqDehUoMk-E", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which sequence of events is correct?", "candidates": ["First, a picture marked with blue, red, and yellow pens appears. Then the screen changes to a man in the bottom right corner, with a lot of code on the screen. Finally, it ends with images of blocks in different colors connected by arrows.", "First, a man appears in the bottom right corner, with a lot of code on the screen. Then the screen changes to a picture marked with blue, red, and yellow pens, and finally, it ends with images of blocks in different colors connected by arrows.", "First, a man appears in the bottom right corner, with a lot of code on the screen. Then the screen changes to images of blocks in different colors connected by arrows. Finally, it ends with a picture marked with blue, red, and yellow pens.", "First, images of blocks in different colors connected by arrows appear. Then the screen changes to a picture marked with blue, red, and yellow pens. Finally, it ends with a man in the bottom right corner, with a lot of code on the screen."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "MqDehUoMk-E_1", "video_path": "MqDehUoMk-E.mp4", "subtitle_path": "MqDehUoMk-E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2393.37, "view_count": 8680}, {"video_id": "MqDehUoMk-E", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, the man appears in the top right corner. The screen shows a lot of code, then the screen changes to images marked with white, green, red, and yellow pens. It then goes back to the image of the code and finally ends with images marked with white, red, and yellow pens.", "First, the man appears in the top left corner. The screen shows a lot of code, then the screen changes to images marked with white, green, red, and yellow pens. It then goes back to the image of the code and finally ends with images marked with white, red, and yellow pens.", "First, the man appears in the bottom left corner. The screen shows a lot of code, then the screen changes to images marked with white, green, red, and yellow pens. It then goes back to the image of the code and finally ends with images marked with white, red, and yellow pens.", "First, images marked with white, red, and yellow pens appear. Then, the screen changes to images marked with white, green, red, and yellow pens. It then goes back to the image of the code and finally ends with the man appearing in the top right corner with a lot of code on the screen."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "MqDehUoMk-E_2", "video_path": "MqDehUoMk-E.mp4", "subtitle_path": "MqDehUoMk-E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2393.37, "view_count": 8680}, {"video_id": "-rboN6F2g-k", "question": "On the right side of the screen, there's a man wearing a blue short-sleeve shirt and glasses. On the left side, there's a globe with a red square on it. In the video, when the man looks upward, what change occurs to the red square on the left?", "question_wo_referring_query": "What change occurs to the red square on the left?", "candidates": ["A green balloon with an @ symbol appears", "A black balloon with an @ symbol appears", "A yellow balloon with an @ symbol appears", "A green balloon with a # symbol appears"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "-rboN6F2g-k_0", "video_path": "-rboN6F2g-k.mp4", "subtitle_path": "-rboN6F2g-k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 937.19, "view_count": 2582710}, {"video_id": "-rboN6F2g-k", "question": "The man wearing a blue short-sleeved shirt sitting on the right side of the screen has a yellow chair to his left. When a cabinet appears on the right side of the screen, what changes happen to the man on the right?", "question_wo_referring_query": "What changes happen to the man on the right?", "candidates": ["Sits on the floor", "Kneels", "No change", "Moves from the right chair to the left chair"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "-rboN6F2g-k_1", "video_path": "-rboN6F2g-k.mp4", "subtitle_path": "-rboN6F2g-k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 937.19, "view_count": 2582710}, {"video_id": "-rboN6F2g-k", "question": "There are three small characters on the screen; the one on the left has white hair, the one in the middle is wearing a hat, and the one on the right is blue. When the character in the middle on the screen touches the blue character on the right, what changes happen to the blue character on the right in the video?", "question_wo_referring_query": "What changes happen to the blue character on the right in the video?", "candidates": ["turns into a puff of white smoke", "moves to the left", "gets bigger", "no change"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "-rboN6F2g-k_2", "video_path": "-rboN6F2g-k.mp4", "subtitle_path": "-rboN6F2g-k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 937.19, "view_count": 2582710}, {"video_id": "kOvfBZ1IcEM", "question": "In the video, a man wearing a dark blue suit is sitting in a broadcast room. When the subtitles mention 'Okay. Stephen Engle Bloomberg's chief North,' what changes occur on the screen?", "question_wo_referring_query": "What changes occur on the screen?", "candidates": ["The man disappears", "A woman appears", "No changes", "The screen changes from one man to two men"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "kOvfBZ1IcEM_0", "video_path": "kOvfBZ1IcEM.mp4", "subtitle_path": "kOvfBZ1IcEM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2828.56, "view_count": 2015}, {"video_id": "kOvfBZ1IcEM", "question": "In the middle of the screen, there is a white-haired man wearing a black coat holding a rectangular object. What happens on the screen when the subtitle mentions 'CEO Jensen Huang's highly anticipated speech'?", "question_wo_referring_query": "What changes occur on the screen?", "candidates": ["The screen zooms in on the rectangular object in his hand", "The screen zooms in to show only the man", "The screen switches", "No change"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "kOvfBZ1IcEM_1", "video_path": "kOvfBZ1IcEM.mp4", "subtitle_path": "kOvfBZ1IcEM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2828.56, "view_count": 2015}, {"video_id": "kOvfBZ1IcEM", "question": "On the left side of the screen is a man wearing a dark blue suit speaking in a broadcast room, and on the right side is a man wearing a black suit with a red tie. What changes occur on the screen when the subtitle mentions 'They've gone they've done it since 2007,the first central bank,the last central'?", "question_wo_referring_query": "What changes occur on the screen?", "candidates": ["The screen changes to show only a man in a black suit with a red tie.", "The screen changes to show only a man in a dark blue suit speaking in a broadcast room.", "No changes", "Screen switches"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "kOvfBZ1IcEM_2", "video_path": "kOvfBZ1IcEM.mp4", "subtitle_path": "kOvfBZ1IcEM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2828.56, "view_count": 2015}, {"video_id": "jltgNGt8Lpg", "question": "In the white background with black text 'Introduction' in the middle, there is a line of yellow-marked black English text 'numerical precision for speed' at the top of the page. What is the green line to the left of this yellow-marked English text doing?", "question_wo_referring_query": "What is the green line to the left of this yellow-marked English text doing?", "candidates": ["Drawing a star on the white background", "Drawing a triangle on the white background", "Drawing a trace on the white background", "Drawing a dot on the white background", "Drawing a circle on the white background"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "jltgNGt8Lpg_0", "video_path": "jltgNGt8Lpg.mp4", "subtitle_path": "jltgNGt8Lpg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1338.79, "view_count": 49639}, {"video_id": "jltgNGt8Lpg", "question": "At the very top of the white page, there is black text that says 'Abstract'. In the bottom right corner, against a background with two images labeled 'Residual Network' and 'ODE Network', what is the yellow highlighter doing on the English phrase 'introduce a new family'?", "question_wo_referring_query": "What is it doing?", "candidates": ["Highlighting the word 'Abstract'", "Highlighting the two images", "On the blank space to the right of the highlight", "Continuing to highlight the sentence after 'introduce a new family'", "On the blank space to the left of the highlight"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "jltgNGt8Lpg_1", "video_path": "jltgNGt8Lpg.mp4", "subtitle_path": "jltgNGt8Lpg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1338.79, "view_count": 49639}, {"video_id": "jltgNGt8Lpg", "question": "In the middle of the top of a white page, there is a number \"2\", and at the bottom, a table with the black text \"Algorithm\" is circled by green lines. What happened to this white page?", "question_wo_referring_query": "What happened to this white page?", "candidates": ["The white page scrolled up", "The white page turned black", "The white page scrolled down", "The white page enlarged", "The white page shrunk"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "jltgNGt8Lpg_2", "video_path": "jltgNGt8Lpg.mp4", "subtitle_path": "jltgNGt8Lpg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1338.79, "view_count": 49639}, {"video_id": "LuRtWt5xMiA", "question": "In a room, a man dressed in a white long-sleeved top is sitting at a table, holding his stomach and laughing heartily. What other objects are present in this room?", "question_wo_referring_query": "What other objects are present in this room?", "candidates": ["Television", "Power Strip", "Hot Water Bottle", "Computer", "Refrigerator"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "LuRtWt5xMiA_0", "video_path": "LuRtWt5xMiA.mp4", "subtitle_path": "LuRtWt5xMiA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1438.83, "view_count": 3472245}, {"video_id": "LuRtWt5xMiA", "question": "On the white desk, there is a cellphone in the middle with a white background display, and there are two black book clips above the cellphone. What other objects are present in this scene?", "question_wo_referring_query": "What other objects are present in this scene?", "candidates": ["white glasses", "white paper", "black glasses", "tissues", "colored pens"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "LuRtWt5xMiA_1", "video_path": "LuRtWt5xMiA.mp4", "subtitle_path": "LuRtWt5xMiA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1438.83, "view_count": 3472245}, {"video_id": "LuRtWt5xMiA", "question": "In a bright room, a man in a green top holding a camera is shaking hands with an elderly man with white hair wearing a black coat. What other objects are present in this room?", "question_wo_referring_query": "What other objects are present in this room?", "candidates": ["gray curtains", "white curtains", "white bed", "white sofa", "black curtains"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "LuRtWt5xMiA_2", "video_path": "LuRtWt5xMiA.mp4", "subtitle_path": "LuRtWt5xMiA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1438.83, "view_count": 3472245}, {"video_id": "fUtBuPSRkLs", "question": "In the blue background, there are many red graphics, among which the white text 'croydon' is circled by a red graphic. What is the shape of this graphic when the subtitle says 'all to see because these yes these these'?", "question_wo_referring_query": "What is the shape?", "candidates": ["staircase shape", "square", "triangle", "rectangle", "circle"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "fUtBuPSRkLs_0", "video_path": "fUtBuPSRkLs.mp4", "subtitle_path": "fUtBuPSRkLs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1550.89, "view_count": 9797}, {"video_id": "fUtBuPSRkLs", "question": "Under the dark sky, there is a tall lighthouse on a green meadow. When the subtitle says 'understand the mineralogy of a place,' what color is the lighthouse?", "question_wo_referring_query": "What color is it?", "candidates": ["black", "red", "green", "white", "yellow"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "fUtBuPSRkLs_1", "video_path": "fUtBuPSRkLs.mp4", "subtitle_path": "fUtBuPSRkLs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1550.89, "view_count": 9797}, {"video_id": "fUtBuPSRkLs", "question": "When the waves continuously crash onto a stretch of brown sand on the screen, and the sea waves continuously roll up huge waves, what color is the subtitle when it says 'remain curious and hungry for knowledge'?", "question_wo_referring_query": "What color is it?", "candidates": ["gold", "yellow", "white", "blue", "black"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "fUtBuPSRkLs_2", "video_path": "fUtBuPSRkLs.mp4", "subtitle_path": "fUtBuPSRkLs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1550.89, "view_count": 9797}, {"video_id": "F34eVxzS6Ng", "question": "In the top left corner of the screen, there is a woman wearing a blue top standing in front of a yellow background giving a presentation. On the top right side of the screen, there is a PPT page with blue English text saying 'Graphical model and Sparse Precision Matrix.' What is the first black English text to appear?", "question_wo_referring_query": "What is the first black English text to appear?", "candidates": ["local classifiers", "A sample graphical model", "learnt structure from data", "precision matrix of the ground truth", "segmentation"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "F34eVxzS6Ng_0", "video_path": "F34eVxzS6Ng.mp4", "subtitle_path": "F34eVxzS6Ng_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1053.43, "view_count": 1739}, {"video_id": "F34eVxzS6Ng", "question": "In the top left corner of the screen, a woman in a blue top is giving a presentation against a yellow background. In the top right side of the screen, there is a PPT slide with the blue English title 'Experiments and Results'. What is the first black English text that appears on the slide?", "question_wo_referring_query": ", what is the first black English text that appears?", "candidates": ["Using Empirical Precision matrix", "Using the sparse partial correlation matrix", "Label Graph for SIFTFLOW data set", "covariance", "background"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "F34eVxzS6Ng_1", "video_path": "F34eVxzS6Ng.mp4", "subtitle_path": "F34eVxzS6Ng_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1053.43, "view_count": 1739}, {"video_id": "F34eVxzS6Ng", "question": "In the upper left corner of the screen, there is a woman wearing a blue top standing in front of a yellow background giving a presentation. On the top right side of the screen, there is a PPT slide with blue English text 'Long distance connections'. What is the first black English text that appears?", "question_wo_referring_query": "What is the first black English text that appears?", "candidates": ["Classifier output", "Our results", "Image", "Spatial Smoothing", "Ground truth"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "F34eVxzS6Ng_2", "video_path": "F34eVxzS6Ng.mp4", "subtitle_path": "F34eVxzS6Ng_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1053.43, "view_count": 1739}, {"video_id": "lHePKNTRmdI", "question": "When a man in a black suit with a black tie first appears standing in front of a brown table, holding a small hammer with both hands, what is he doing?", "question_wo_referring_query": "What is he doing when he first appears?", "candidates": ["Tapping his head with the small hammer", "Throwing the small hammer", "Putting the small hammer down", "Tapping his shoulder with the small hammer", "Tapping the table with the small hammer"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "lHePKNTRmdI_0", "video_path": "lHePKNTRmdI.mp4", "subtitle_path": "lHePKNTRmdI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 994.24, "view_count": 1560699}, {"video_id": "lHePKNTRmdI", "question": "Standing by the window, what did the man wearing a white t-shirt with a cartoon design featuring three eyes and a mouth do when he first appeared?", "question_wo_referring_query": "What did he do when he first appeared?", "candidates": ["Touched the frame of his glasses with one hand", "Clenched his fists with both hands", "Put one hand in his pocket", "Crossed his arms in front of his chest", "Held his head with both hands"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "lHePKNTRmdI_1", "video_path": "lHePKNTRmdI.mp4", "subtitle_path": "lHePKNTRmdI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 994.24, "view_count": 1560699}, {"video_id": "lHePKNTRmdI", "question": "When the man, who is standing in front of a white wall with graffiti and wearing a gray helmet and a red jacket with the Adidas logo underneath his outerwear, appears for the first time, what does he do?", "question_wo_referring_query": "What does he do?", "candidates": ["Puts both hands on the helmet", "Puts one hand on the helmet", "Clenches both fists", "Puts his hand on his hip", "Covers his mouth with his hand"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "lHePKNTRmdI_2", "video_path": "lHePKNTRmdI.mp4", "subtitle_path": "lHePKNTRmdI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 994.24, "view_count": 1560699}, {"video_id": "xpNf_M-VQTk", "question": "In a house with green floorboards, what action does a woman, who is wearing a black fur hat and a black feathered outfit, take when the subtitle says 'Better than the polar swim'?", "question_wo_referring_query": "What action does she take?", "candidates": ["Takes off the shoes", "Takes off the black feathered outfit", "Takes off the hat", "Takes off the green coat", "Ties up the hair"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "xpNf_M-VQTk_0", "video_path": "xpNf_M-VQTk.mp4", "subtitle_path": "xpNf_M-VQTk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 962.7, "view_count": 1382378}, {"video_id": "xpNf_M-VQTk", "question": "On the green grass, what action did a person wearing a grey shirt and black pants take when the caption says 'chuckles'?", "question_wo_referring_query": "What action was taken?", "candidates": ["Stood on the rail", "Crawled under the bottom of the rail", "Threw the book bag on the ground", "Flipped over a rail with one foot", "Cut the rail"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "xpNf_M-VQTk_1", "video_path": "xpNf_M-VQTk.mp4", "subtitle_path": "xpNf_M-VQTk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 962.7, "view_count": 1382378}, {"video_id": "xpNf_M-VQTk", "question": "What action did the woman wearing a black hat and earrings perform beside a car when the subtitle said 'I didn't actually jump down from the cliff'?", "question_wo_referring_query": "What action did she perform?", "candidates": ["Faced away from the camera", "Placed her hand on the head of the man with the grey hat", "Made a 'peace' gesture towards the camera", "Kissed the man with the grey hat on the cheek", "Opened the car door"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "xpNf_M-VQTk_2", "video_path": "xpNf_M-VQTk.mp4", "subtitle_path": "xpNf_M-VQTk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 962.7, "view_count": 1382378}, {"video_id": "KbFziURyfhY", "question": "Sitting in front of a tan-colored desk with colored pencils, what did the person wearing white clothes do after flipping through the calendar in their hand?", "question_wo_referring_query": "What did they do?", "candidates": ["Hung the calendar on the wall", "Hung the picture on the wall", "Put a green plant on the desk", "Stuck a piece of adhesive paper on the back of the calendar", "Put a computer on the desk"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "KbFziURyfhY_0", "video_path": "KbFziURyfhY.mp4", "subtitle_path": "KbFziURyfhY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 998.63, "view_count": 261454}, {"video_id": "KbFziURyfhY", "question": "On the green floor with floral patterns, what did the person in pink pants do after painting a piece of wood green?", "question_wo_referring_query": "What did they do?", "candidates": ["Stood the green wooden board upright", "Cut the board into two pieces", "Drew a picture on the board", "Used a hammer to nail nails", "Drilled a hole in the green wooden board"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "KbFziURyfhY_1", "video_path": "KbFziURyfhY.mp4", "subtitle_path": "KbFziURyfhY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 998.63, "view_count": 261454}, {"video_id": "KbFziURyfhY", "question": "In the room, after a person wearing a red and black striped shirt placed a computer on the desk, what did they do?", "question_wo_referring_query": "What did they do?", "candidates": ["Stuck a picture on the wall", "Placed two white speakers in front of the computer", "Placed a small keyboard in front of the computer", "Applied glue to a picture", "Flipped the calendar"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "KbFziURyfhY_2", "video_path": "KbFziURyfhY.mp4", "subtitle_path": "KbFziURyfhY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 998.63, "view_count": 261454}, {"video_id": "Yc-fDlnGbU4", "question": "After a black tank appears at the beginning of the video, which of the following concepts is mentioned first?", "question_wo_referring_query": "After a black tank appears at the beginning of the video, which of the following concepts is mentioned first?", "candidates": ["Contemporary art overview", "National security", "Russian precision strike theory", "Defense budget", "Government revenue"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "Yc-fDlnGbU4_0", "video_path": "Yc-fDlnGbU4.mp4", "subtitle_path": "Yc-fDlnGbU4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.87, "view_count": 358630}, {"video_id": "Yc-fDlnGbU4", "question": "After a black tank appears at the beginning of the video, which of the following areas is mentioned first?", "question_wo_referring_query": "After a black tank appears at the beginning of the video, which of the following areas is mentioned first?", "candidates": ["Crimea", "Moscow", "Syria", "Arab", "Georgia"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "Yc-fDlnGbU4_1", "video_path": "Yc-fDlnGbU4.mp4", "subtitle_path": "Yc-fDlnGbU4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.87, "view_count": 358630}, {"video_id": "Yc-fDlnGbU4", "question": "After the appearance of a black tank at the beginning of the video, which region is mentioned first?", "question_wo_referring_query": "After the appearance of a black tank at the beginning of the video, which region is mentioned first?", "candidates": ["China", "Russia", "Thailand", "Paris", "Azad"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "Yc-fDlnGbU4_2", "video_path": "Yc-fDlnGbU4.mp4", "subtitle_path": "Yc-fDlnGbU4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.87, "view_count": 358630}, {"video_id": "5Xm9P7bigo8", "question": "A white arrow is placed on a launch pad on the screen, and after the subtitles read 'efforts to deescalate border tensions 3', what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["It starts to rain", "A car appears nearby", "Several people appear next to the arrow", "The arrow is launched into the sky", "A plane appears in the sky"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "5Xm9P7bigo8_0", "video_path": "5Xm9P7bigo8.mp4", "subtitle_path": "5Xm9P7bigo8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1019.88, "view_count": 130275}, {"video_id": "5Xm9P7bigo8", "question": "What happened on the screen before the subtitles mentioned \u2018North Sea to delay in the ban on diesel\u2019 and the oil prices were displayed on the gas pump screen?", "question_wo_referring_query": "What happened?", "candidates": ["A man in a black and white striped short-sleeve shirt was washing a car", "A man in a black and white striped short-sleeve shirt was refueling a car", "A man in a white short-sleeve shirt was refueling a car", "A woman in a black and white striped short-sleeve shirt was refueling a car", "A man in a black short-sleeve shirt was refueling a car"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "5Xm9P7bigo8_1", "video_path": "5Xm9P7bigo8.mp4", "subtitle_path": "5Xm9P7bigo8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1019.88, "view_count": 130275}, {"video_id": "5Xm9P7bigo8", "question": "In the video, five people stand in a row holding hands. After the subtitle says 'as the bricks block continues to expand,' what action do these five people take?", "question_wo_referring_query": "What action do these five people take?", "candidates": ["They all raise their hands above their heads", "They form a circle", "They walk forward while holding hands", "They let go of each other's hands", "They all jump up together"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "5Xm9P7bigo8_2", "video_path": "5Xm9P7bigo8.mp4", "subtitle_path": "5Xm9P7bigo8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1019.88, "view_count": 130275}, {"video_id": "V7Q0HVGaD68", "question": "A rectangular white paper appears in the middle of the screen. After the subtitle says 'meantime you'll', what appears on the screen?", "question_wo_referring_query": "What appears on the screen?", "candidates": ["A black pair of scissors", "A pen", "An eraser", "An orange pair of scissors", "An orange piece of paper"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "V7Q0HVGaD68_0", "video_path": "V7Q0HVGaD68.mp4", "subtitle_path": "V7Q0HVGaD68_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1402.53, "view_count": 6769}, {"video_id": "V7Q0HVGaD68", "question": "A square piece of white paper appeared in the middle of the screen, and after the subtitle said 'covering up this backing,' what appeared on the screen?", "question_wo_referring_query": "What appeared on the screen?", "candidates": ["A square piece of green paper", "A rectangular piece of white paper", "A square piece of red paper", "A circular piece of purple paper", "A square piece of blue paper"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "V7Q0HVGaD68_1", "video_path": "V7Q0HVGaD68.mp4", "subtitle_path": "V7Q0HVGaD68_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1402.53, "view_count": 6769}, {"video_id": "V7Q0HVGaD68", "question": "After the subtitle mentions 'areas you could take your marker', what object appears on screen with various shapes of colorful cut paper pasted on a white sheet?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["A glue stick", "A piece of sticky paper", "A book", "A white sheet of paper", "A pen"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "V7Q0HVGaD68_2", "video_path": "V7Q0HVGaD68.mp4", "subtitle_path": "V7Q0HVGaD68_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1402.53, "view_count": 6769}, {"video_id": "4RAvJt3fWoI", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a blue-purple circular icon with the letter C appears on the top of a black screen, then a picture of a man wearing black glasses and a white shirt appears on the far left of a white interface, and finally a picture of a man wearing black glasses and a white shirt appears on the far left of a black interface.", "First, a blue-purple circular icon with the letter C appears on the top of a black screen, then a man wearing a black hat is sitting in a room with an electric guitar hanging, and finally a picture of a man wearing black glasses and a white shirt appears on the far left of a black interface.", "First, a blue-purple circular icon with the letter C appears on the top of a black screen, then a picture of a man wearing black glasses and a white shirt appears on the far left of a black interface, and finally a picture of a man wearing black glasses and a white shirt appears on the far left of a white interface.", "First, a man wearing a black hat is sitting in a room with an electric guitar hanging, then a picture of a man wearing black glasses and a white shirt appears on the far left of a black interface, and finally a blue-purple circular icon with the letter C appears on the top of a black screen.", "First, a picture of a man wearing black glasses and a white shirt appears on the far left of a black interface, then a blue-purple circular icon with the letter C appears on the top of a black screen, and finally a picture of a man wearing black glasses and a white shirt appears on the far left of a black interface."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "4RAvJt3fWoI_0", "video_path": "4RAvJt3fWoI.mp4", "subtitle_path": "4RAvJt3fWoI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2864.1, "view_count": 12677}, {"video_id": "4RAvJt3fWoI", "question": "Which of the following sequence of scenes is correct?", "question_wo_referring_query": "Which of the following sequence of scenes is correct?", "candidates": ["First, a scene with a black background and white text 'Thanks for watching' appears at the bottom, then a purple circular icon with the letter S appears in the middle of a black screen, and finally black text 'create assistant' appears at the top of a white page.", "First, black text 'create assistant' appears at the top of a white page, then a scene with a black background and white text 'Thanks for watching' appears at the bottom, and finally a purple circular icon with the letter S appears in the middle of a black screen.", "First, black text 'create assistant' appears at the top of a white page, then a purple circular icon with the letter S appears in the middle of a black screen, and finally a scene with a black background and white text 'Thanks for watching' appears at the bottom.", "First, a purple circular icon with the letter S appears in the middle of a black screen, then black text 'create assistant' appears at the top of a white page, and finally a scene with a black background and white text 'Thanks for watching' appears at the bottom.", "First, a purple circular icon with the letter S appears in the middle of a black screen, then a scene with a black background and white text 'Thanks for watching' appears at the bottom, and finally black text 'create assistant' appears at the top of a white page."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "4RAvJt3fWoI_1", "video_path": "4RAvJt3fWoI.mp4", "subtitle_path": "4RAvJt3fWoI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2864.1, "view_count": 12677}, {"video_id": "4RAvJt3fWoI", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a white box with a red rectangular option in the bottom right corner pops up in the middle of the screen, then a white page with the black text 'Playground' in the top left corner appears on the screen, and finally a white interface with the black text 'Fine-tuning' at the top appears.", "First, a white interface with the black text 'Fine-tuning' at the top appears, then a white page with the black text 'Playground' in the top left corner appears on the screen, and finally a white box with a red rectangular option in the bottom right corner pops up in the middle of the screen.", "First, a white page with the black text 'Playground' in the top left corner appears on the screen, then a white interface with the black text 'Fine-tuning' at the top appears, and finally a white box with a red rectangular option in the bottom right corner pops up in the middle of the screen.", "First, a white page with the black text 'Playground' in the top left corner appears on the screen, then a white box with a red rectangular option in the bottom right corner pops up in the middle of the screen, and finally a white interface with the black text 'Fine-tuning' at the top appears.", "First, a white box with a red rectangular option in the bottom right corner pops up in the middle of the screen, then a white interface with the black text 'Fine-tuning' at the top appears, and finally a white page with the black text 'Playground' in the top left corner appears on the screen."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "4RAvJt3fWoI_2", "video_path": "4RAvJt3fWoI.mp4", "subtitle_path": "4RAvJt3fWoI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2864.1, "view_count": 12677}, {"video_id": "ur6aK1XH_kk", "question": "In which of the following scenes does the woman wearing black glasses and a white top from the bottom of the screen at the beginning also appear?", "question_wo_referring_query": "In which of the following scenes does she also appear?", "candidates": ["Bottom right corner of a scene with blue ocean waves crashing against the shore", "Bottom left corner of a scene with golden ocean waves crashing against the shore", "Top right corner of a scene with blue ocean waves crashing against the shore", "Bottom left corner of a scene with blue ocean waves crashing against the shore", "Top left corner of a scene with blue ocean waves crashing against the shore"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "ur6aK1XH_kk_0", "video_path": "ur6aK1XH_kk.mp4", "subtitle_path": "ur6aK1XH_kk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1464.28, "view_count": 21018}, {"video_id": "ur6aK1XH_kk", "question": "In the opening shot, where else does the woman wearing glasses and a white top in the lower part of the screen appear?", "question_wo_referring_query": "Where else does she appear?", "candidates": ["Top left corner of the screen with an image of a fox walking on white snow", "Bottom left corner of the screen with an image of a fox walking on white snow", "Top right corner of the screen with an image of a fox walking on white snow", "Bottom right corner of the screen with an image of a fox walking on green grass", "Bottom right corner of the screen with an image of a fox walking on white snow"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "ur6aK1XH_kk_1", "video_path": "ur6aK1XH_kk.mp4", "subtitle_path": "ur6aK1XH_kk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1464.28, "view_count": 21018}, {"video_id": "ur6aK1XH_kk", "question": "In the top left corner of the screen at the beginning, a short-haired woman wearing floral clothing is sitting in a room with wall paintings. Where else does she appear?", "question_wo_referring_query": "Where else does she appear?", "candidates": ["Bottom left corner of a picture with a man standing in the blue sea with his shoulders exposed", "Bottom right corner of a picture with a man standing in the blue sea with his shoulders exposed", "Top right corner of a picture with a man standing on grass with his shoulders exposed", "Top left corner of a picture with a man standing in the blue sea with his shoulders exposed", "Top right corner of a picture with a man standing in the blue sea with his shoulders exposed"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "ur6aK1XH_kk_2", "video_path": "ur6aK1XH_kk.mp4", "subtitle_path": "ur6aK1XH_kk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1464.28, "view_count": 21018}, {"video_id": "RRc09z_D8ds", "question": "There is a black volcano in the screen, spewing fiery red lava and billowing with rolling black smoke. The black smoke rises straight into the sky, where it spreads out into a gray haze. In which subtitles did this volcano appear together?", "question_wo_referring_query": "In which subtitles did this volcano appear together?", "candidates": ["nature but beyond its Scenic Beauty and", "and wide but back to the big one the", "wildlife and nature enthusiasts it's a", "of death and", "the chamber contain its Rage with an"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "RRc09z_D8ds_0", "video_path": "RRc09z_D8ds.mp4", "subtitle_path": "RRc09z_D8ds_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1030.1, "view_count": 14145}, {"video_id": "RRc09z_D8ds", "question": "In the fiery red sky, which subtitles appeared alongside the gradually rising sun hidden behind the black clouds?", "question_wo_referring_query": "Which subtitles have appeared alongside?", "candidates": ["desolation this super volcanic eruption", "Mass death and starvation occurs and you", "of death and", "the land so as you carve your way down", "alongside the baffle that fueled it"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "RRc09z_D8ds_1", "video_path": "RRc09z_D8ds.mp4", "subtitle_path": "RRc09z_D8ds_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1030.1, "view_count": 14145}, {"video_id": "RRc09z_D8ds", "question": "Which subtitles have appeared together with a golden circular object made of gold with an engraved design on it in a black background?", "question_wo_referring_query": "Which subtitles have appeared together?", "candidates": ["nature but beyond its Scenic Beauty and", "played below for in the dance of ice and", "the land so as you carve your way down", "Rich Empire of the time making its"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "RRc09z_D8ds_2", "video_path": "RRc09z_D8ds.mp4", "subtitle_path": "RRc09z_D8ds_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1030.1, "view_count": 14145}, {"video_id": "lNX-vl_xAFc", "question": "In a room with olive-green floorboards, a man wearing a black hooded jacket and blue trousers sitting on a white chair reading a book mentions in the subtitles \"tongue twister which is fiction\". What style of clothing is he wearing?", "question_wo_referring_query": "What style of clothing is he wearing?", "candidates": ["Shirt", "T-shirt", "Sweater", "Overcoat", "Suit"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "lNX-vl_xAFc_0", "video_path": "lNX-vl_xAFc.mp4", "subtitle_path": "lNX-vl_xAFc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 949.28, "view_count": 77732}, {"video_id": "lNX-vl_xAFc", "question": "In a room with a square window, what style of clothes is worn by a man lying on a white bed reading a book, dressed in a black robe and blue pants, when the subtitles say 'than merely a distraction'?", "question_wo_referring_query": "What style of clothing is worn?", "candidates": ["linen garment", "feather coat", "police uniform", "firefighter uniform", "T-shirt"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "lNX-vl_xAFc_1", "video_path": "lNX-vl_xAFc.mp4", "subtitle_path": "lNX-vl_xAFc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 949.28, "view_count": 77732}, {"video_id": "lNX-vl_xAFc", "question": "Sitting in a room with a white desk lamp and a globe, what style of clothing is the man wearing when he says in the caption 'that with my experience', who is dressed in a black floral T-shirt and wearing a watch?", "question_wo_referring_query": "What style of clothing is he wearing?", "candidates": ["Down jacket", "Long-sleeve shirt", "Swimwear", "Tank top", "Suit"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "lNX-vl_xAFc_2", "video_path": "lNX-vl_xAFc.mp4", "subtitle_path": "lNX-vl_xAFc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 949.28, "view_count": 77732}, {"video_id": "KPM0NZIeVl4", "question": "A person wearing black pants and black leather shoes is holding a yellow broom and a gray dustpan. What is this person doing?", "question_wo_referring_query": "What is this person doing?", "candidates": ["Mopping the floor", "Dropping the dustpan", "Sweeping the floor", "Putting the broom on the rack", "Putting the broom on the floor"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "KPM0NZIeVl4_0", "video_path": "KPM0NZIeVl4.mp4", "subtitle_path": "KPM0NZIeVl4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 964.5, "view_count": 102061}, {"video_id": "KPM0NZIeVl4", "question": "On the screen, the far right shows a round wooden steamer basket with delicious food, while on the far left, a woman with black hair wearing blue clothes, gold earrings, and a gold necklace is seen holding a pair of black chopsticks inside a mirror. What is this woman doing?", "question_wo_referring_query": "What is this woman doing?", "candidates": ["Tapping her head", "Touching her earrings", "Spitting food", "Chewing food", "Smoking"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "KPM0NZIeVl4_1", "video_path": "KPM0NZIeVl4.mp4", "subtitle_path": "KPM0NZIeVl4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 964.5, "view_count": 102061}, {"video_id": "KPM0NZIeVl4", "question": "In the kitchen, on the far right side of a rectangular metal table that holds four red spice jars and an iron pot with food, there is a person in black clothing putting a metal spoon into a pot. What is this person doing?", "question_wo_referring_query": "What is this person doing?", "candidates": ["Adding water", "Washing the pot", "Pouring the food into the iron pot on the table", "Cutting vegetables", "Stir-frying"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "KPM0NZIeVl4_2", "video_path": "KPM0NZIeVl4.mp4", "subtitle_path": "KPM0NZIeVl4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 964.5, "view_count": 102061}, {"video_id": "j6tk5iegRME", "question": "In a hair salon, a woman wearing a black knit cap is holding a curling iron and curling the hair of another woman who is sitting and taking photos with a green-cased phone. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["mineral water", "mirror", "oil painting", "doll", "flowerpot"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "j6tk5iegRME_0", "video_path": "j6tk5iegRME.mp4", "subtitle_path": "j6tk5iegRME_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 920.65, "view_count": 40361}, {"video_id": "j6tk5iegRME", "question": "In front of a window, on an orange desk, there is a white square table lamp. A woman wearing a gray robe is holding a cloth and wiping the desk. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["television", "fan", "oil painting", "black plug", "green plant"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "j6tk5iegRME_1", "video_path": "j6tk5iegRME.mp4", "subtitle_path": "j6tk5iegRME_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 920.65, "view_count": 40361}, {"video_id": "j6tk5iegRME", "question": "In a room, a woman with tied hair, wearing a gray robe, is arranging a white quilt on the bed. What objects are present in this room?", "question_wo_referring_query": "What objects are present in this room?", "candidates": ["blue water cup", "air conditioner", "fan", "slippers", "books"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "j6tk5iegRME_2", "video_path": "j6tk5iegRME.mp4", "subtitle_path": "j6tk5iegRME_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 920.65, "view_count": 40361}, {"video_id": "wnElMGO5Eh4", "question": "Inside a room with a white nightstand, a woman with black hair, wearing a black floral dress and a white short-sleeved jacket, is facing away from the mirror, tidying up the white sheets on the bed. When the subtitle mentions 'session', what object is present in the room?", "question_wo_referring_query": "What object is present in the room?", "candidates": ["green plant", "television", "flower pot", "black rectangular mirror", "computer"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "wnElMGO5Eh4_0", "video_path": "wnElMGO5Eh4.mp4", "subtitle_path": "wnElMGO5Eh4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 978.77, "view_count": 81568}, {"video_id": "wnElMGO5Eh4", "question": "In a room with windows, a woman with black hair, wearing a black floral skirt and a white jacket, is holding several transparent bags containing clothes. What objects are present in the room when the subtitle says 'because like when I travel storing like'?", "question_wo_referring_query": "What objects are present in the room?", "candidates": ["a black door", "a suitcase", "a white door", "an olive-colored door", "a doorknob"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "wnElMGO5Eh4_1", "video_path": "wnElMGO5Eh4.mp4", "subtitle_path": "wnElMGO5Eh4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 978.77, "view_count": 81568}, {"video_id": "wnElMGO5Eh4", "question": "In a room with cluttered clothes, a woman wearing a white top is organizing a black suitcase on a brown wooden floor. What item is present in the room when the subtitle says 'now'?", "question_wo_referring_query": "What item is present in the room?", "candidates": ["computer", "red pillow", "washing machine", "sofa", "chair"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "wnElMGO5Eh4_2", "video_path": "wnElMGO5Eh4.mp4", "subtitle_path": "wnElMGO5Eh4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 978.77, "view_count": 81568}, {"video_id": "U__Us5KPRrA", "question": "On the far right of the screen, there is a green plant. Next to the green plant is a man with a thick beard, wearing blue pants and a black coat. What material is this man's coat made of?", "question_wo_referring_query": "What material is this man's coat made of?", "candidates": ["Silk coat", "Leather coat", "Wool coat", "Rabbit fur coat", "Cotton coat"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "U__Us5KPRrA_0", "video_path": "U__Us5KPRrA.mp4", "subtitle_path": "U__Us5KPRrA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1352.08, "view_count": 20689}, {"video_id": "U__Us5KPRrA", "question": "In a dimly lit room with a square carpet on the floor, four people are sitting and chatting. The man on the far left, wearing black pants, is sitting with his legs crossed. What color is the jacket worn by the man on the far right, who is sitting on a gray sofa and wearing white shoes?", "question_wo_referring_query": "What color is the jacket worn by the man on the far right, who is sitting on a gray sofa and wearing white shoes?", "candidates": ["olive", "blue", "yellow", "white", "black"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "U__Us5KPRrA_1", "video_path": "U__Us5KPRrA.mp4", "subtitle_path": "U__Us5KPRrA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1352.08, "view_count": 20689}, {"video_id": "U__Us5KPRrA", "question": "In the video, there are two people sitting cross-legged and talking. The man on the far left is wearing a short-sleeved shirt and black pants. What kind of shoes is the man on the far right, who is wearing a black leather jacket and blue pants, wearing?", "question_wo_referring_query": "What kind of shoes is the man on the far right, who is wearing a black leather jacket and blue pants, wearing?", "candidates": ["Skate shoes", "Sneakers", "Canvas shoes", "Sandals", "Leather shoes"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "U__Us5KPRrA_2", "video_path": "U__Us5KPRrA.mp4", "subtitle_path": "U__Us5KPRrA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1352.08, "view_count": 20689}, {"video_id": "xtkq4jRVkgM", "question": "Outside on a clear sunny day, distant mountain peaks are covered in white snow. Nearby, there is a house with white walls. A flag flutters in the wind atop a flagpole outside the house. In this scene, who is raising their left hand and pointing into the distance?", "question_wo_referring_query": ", in this scene, who is raising their left hand and pointing into the distance?", "candidates": ["A man with short hair wearing a black coat", "A woman with short hair wearing a black coat", "A man with short hair wearing a black short-sleeved shirt", "A woman with short hair wearing a gray coat", "A man with short hair wearing a gray coat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "xtkq4jRVkgM_0", "video_path": "xtkq4jRVkgM.mp4", "subtitle_path": "xtkq4jRVkgM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 995.06, "view_count": 98905}, {"video_id": "xtkq4jRVkgM", "question": "In the snowy field in front of the snow-capped mountain, there are three people: one person wearing white clothes, one person with a headscarf, and one person wearing an olive green coat. Who is the person in this scene facing the camera and extending both hands making a 'V' sign?", "question_wo_referring_query": ", who is the person in this scene facing the camera and extending both hands making a 'V' sign?", "candidates": ["The person with a headscarf", "The person wearing a white coat", "The person wearing a white coat and blue pants", "The person wearing an olive green coat and a headscarf", "The person wearing an olive green coat and blue pants"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "xtkq4jRVkgM_1", "video_path": "xtkq4jRVkgM.mp4", "subtitle_path": "xtkq4jRVkgM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 995.06, "view_count": 98905}, {"video_id": "xtkq4jRVkgM", "question": "There is a black motorcycle parked on the shore of the blue lake. Its license plate number is 'BE6258'. Who went to the motorcycle and sat by the shore?", "question_wo_referring_query": "Who went to the motorcycle and sat by the shore?", "candidates": ["The person holding a black bag and wearing blue pants", "The person wearing white clothes and leaning against a railing", "The person wearing black clothes and black pants by the shore", "The person wearing blue clothes and black pants by the shore", "The person wearing black clothes and hugging their arms"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "xtkq4jRVkgM_2", "video_path": "xtkq4jRVkgM.mp4", "subtitle_path": "xtkq4jRVkgM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 995.06, "view_count": 98905}, {"video_id": "zETNKkECocg", "question": "In a room, there is a group of people. A woman in a red cloak picks up a bowl. Beside her, a man in blue clothing is holding his head with his right hand. In front, a man in black clothing has his left hand in his pocket. When the short-haired person in a pink coat appears in the room for the first time, what is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["He is lying on the floor", "He is holding a wine glass", "He is kneeling on the floor", "He hits the man in black clothing with a fist", "He jumps up from the floor"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "zETNKkECocg_0", "video_path": "zETNKkECocg.mp4", "subtitle_path": "zETNKkECocg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1523.24, "view_count": 404685}, {"video_id": "zETNKkECocg", "question": "In a background image of a crowd riding horses on a main street, there is a man in the foreground. He is wearing a pink coat with olive fur and has a gold ribbon tied around his neck. At the bottom of the screen, there are the white letters 'COLONEL PRINCE SERGEI TRUBETSKOY'. When this man appears for the first time, what does he do?", "question_wo_referring_query": "When this man appears for the first time, what does he do?", "candidates": ["He mounts a horse", "He takes off his coat", "He sits on the ground", "He raises his right hand", "He opens his eyes"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "zETNKkECocg_1", "video_path": "zETNKkECocg.mp4", "subtitle_path": "zETNKkECocg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1523.24, "view_count": 404685}, {"video_id": "zETNKkECocg", "question": "Beside a tree, there is a white horse and a man wearing a wide hat, a black shirt, and white pants. What is this man doing when he first appears on screen?", "question_wo_referring_query": "What is this man doing when he first appears on screen?", "candidates": ["He is shooting at the horse", "He takes out a handgun", "He falls to the ground", "He drops his hat", "He is riding the horse"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "zETNKkECocg_2", "video_path": "zETNKkECocg.mp4", "subtitle_path": "zETNKkECocg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1523.24, "view_count": 404685}, {"video_id": "5_sCTe61wdY", "question": "In a room, there is a display. In front of the display sits a man wearing glasses and a checkered shirt. The light bulb above the man's head emits white light. When the subtitle 'here an artisan has been making dolls' appears, what is the man doing?", "question_wo_referring_query": "What is the man doing?", "candidates": ["He is wiping the table", "He is arranging flowers", "He is writing", "He is drawing", "He is making dolls"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "5_sCTe61wdY_0", "video_path": "5_sCTe61wdY.mp4", "subtitle_path": "5_sCTe61wdY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.0, "view_count": 12347}, {"video_id": "5_sCTe61wdY", "question": "A man and a woman are kneeling in a room, the woman on the right is dressing up a doll, and the man with glasses in front of him has two doll heads inserted. When the subtitle 'his wife redo assists in repairing' appears, what is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["He is carving a doll", "He picks up the doll", "He moves the paper placed on the paper box with his left hand", "He assists the woman in dressing up the doll", "He stands up"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "5_sCTe61wdY_1", "video_path": "5_sCTe61wdY.mp4", "subtitle_path": "5_sCTe61wdY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.0, "view_count": 12347}, {"video_id": "5_sCTe61wdY", "question": "On a white table in a room, there are two dolls. The doll on the left is wearing a red dress, while the doll on the right is facing a mirror. In the screen, there's also a man wearing glasses and leaning on the table, and a woman wearing a gray hat. When the subtitle mentions '[Music]', what does the woman do?", "question_wo_referring_query": "What does the woman do when the subtitle mentions '[Music]'?", "candidates": ["She damages the doll.", "She takes off her hat.", "She picks up the doll on the right side of the screen.", "She picks up the doll wearing red clothes.", "She helps the doll take off its clothes."], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "5_sCTe61wdY_2", "video_path": "5_sCTe61wdY.mp4", "subtitle_path": "5_sCTe61wdY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.0, "view_count": 12347}, {"video_id": "lEhMz1Hyzw8", "question": "In the screen, there is a photo with the year 1922 written below it. In the photo, a man wearing a hat and dressed in black stands in front of a row of people. He has his left hand on his hip and is raising his right fist high in the air. What action does this man take next?", "question_wo_referring_query": "What action does this man take next?", "candidates": ["He places both hands on the railing of the balcony", "He shakes hands with someone", "He crosses his right hand over his waist", "He sits down on a chair", "He takes off his hat"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "lEhMz1Hyzw8_0", "video_path": "lEhMz1Hyzw8.mp4", "subtitle_path": "lEhMz1Hyzw8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1067.7, "view_count": 4015}, {"video_id": "lEhMz1Hyzw8", "question": "In the video, a man wearing a grey coat and a white shirt with a tie raises his right hand emotionally. There is a white 'Adolf Hitler' label at the bottom left of the screen. After this scene, which historical figure's name is mentioned first in the video?", "question_wo_referring_query": "After this scene, which historical figure's name is mentioned first in the video?", "candidates": ["Chiang Kai-shek", "Stalin", "Roosevelt", "Benito Mussolini", "Adolf Hitler"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "lEhMz1Hyzw8_1", "video_path": "lEhMz1Hyzw8.mp4", "subtitle_path": "lEhMz1Hyzw8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1067.7, "view_count": 4015}, {"video_id": "lEhMz1Hyzw8", "question": "There is a picture on the screen with four people planting the American flag into the ground. What were people doing in the video before this?", "question_wo_referring_query": "What were people doing in the video before this?", "candidates": ["A group of people digging a trench", "A man standing on a platform giving a speech", "A group of people standing next to the Japanese flag cheering", "A pilot crashing an aircraft into a building", "A person repairing a tank"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "lEhMz1Hyzw8_2", "video_path": "lEhMz1Hyzw8.mp4", "subtitle_path": "lEhMz1Hyzw8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1067.7, "view_count": 4015}, {"video_id": "IiBFqnNu7A8", "question": "The screen shows a blank white background with nothing but a gray line on the left side. After this screen, what appears first?", "question_wo_referring_query": "What appears first?", "candidates": ["A square", "Four black lines", "A text-filled document", "A circle", "A black line"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O3O", "level": "L2-Relation", "id": "IiBFqnNu7A8_0", "video_path": "IiBFqnNu7A8.mp4", "subtitle_path": "IiBFqnNu7A8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2121.3, "view_count": 6230}, {"video_id": "IiBFqnNu7A8", "question": "In a document, there are two model Earth diagrams. The text under the diagram on the left is highlighted in yellow and selected with a red line box. What is the first type of chart that appears after this screen?", "question_wo_referring_query": "What is the first type of chart that appears after this screen?", "candidates": ["Flowchart", "Pie Chart", "Mind Mapping", "Bar Chart", "Line Chart"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O3O", "level": "L2-Relation", "id": "IiBFqnNu7A8_1", "video_path": "IiBFqnNu7A8.mp4", "subtitle_path": "IiBFqnNu7A8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2121.3, "view_count": 6230}, {"video_id": "IiBFqnNu7A8", "question": "In the document shown on the screen, there is a segment of text highlighted in yellow, within which there are two letters 'Th' that are not highlighted. What is the first concept that appears in the green box after this segment?", "question_wo_referring_query": "What is the first concept that appears in the green box after this segment?", "candidates": ["NPL", "Model Learning", "latent disagreement", "convolutional neural", "Planning in Latent Space"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O3O", "level": "L2-Relation", "id": "IiBFqnNu7A8_2", "video_path": "IiBFqnNu7A8.mp4", "subtitle_path": "IiBFqnNu7A8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2121.3, "view_count": 6230}, {"video_id": "j7MiI9pZaxk", "question": "In a room with walls covered in photos, there are two men: one wearing a white T-shirt and the other wearing a black T-shirt. Between them, there is a photo showing a city nestled at the foot of a mountain, with the city's lights reflecting in the lake water. After the subtitle 'It's kind of like a travel back in time like that people's memory of the cities' appears, what object appears on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["German flag", "Jordanian flag", "Lebanese flag", "Amanian flag", "Brazilian flag"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "j7MiI9pZaxk_0", "video_path": "j7MiI9pZaxk.mp4", "subtitle_path": "j7MiI9pZaxk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1019.19, "view_count": 4099579}, {"video_id": "j7MiI9pZaxk", "question": "The walls of the room are covered with pictures. Two men are standing inside the room. The man in the white T-shirt has his arms crossed, while the person in black raises his right hand. After the subtitle 'It's just like they love art when you mentioned Tunisia what comes to mind again as an Arab?' appears, what object shows up in the scene?", "question_wo_referring_query": "What object appears in the scene?", "candidates": ["A man wearing a black vest", "A model of the Earth", "The flag of Kenya", "The flag of Algeria", "The flag of Somalia"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "j7MiI9pZaxk_1", "video_path": "j7MiI9pZaxk.mp4", "subtitle_path": "j7MiI9pZaxk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1019.19, "view_count": 4099579}, {"video_id": "j7MiI9pZaxk", "question": "In a room with walls covered in photos, there are two men. One man, wearing a white short-sleeve shirt, is placing his right hand on his chin, while the other man, dressed in black, has his arms crossed in front of his chest. After the caption mentions 'They are definitely known for Socotra island, which is all the plants are only unique to that one Island,' what is the first object that appears on the screen?", "question_wo_referring_query": "What is the first object that appears on the screen?", "candidates": ["Jordanian flag", "Dragon Blood Tree", "A man wearing a black vest", "Kenyan flag", "Globe model"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "j7MiI9pZaxk_2", "video_path": "j7MiI9pZaxk.mp4", "subtitle_path": "j7MiI9pZaxk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1019.19, "view_count": 4099579}, {"video_id": "zuS_GQfe7J0", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a red liquid is poured from a measuring cup into a transparent bowl. Next, a woman with braided hair is standing in front of the kitchen counter watching the microwave emitting white smoke. Finally, a pair of hands wearing a ring is peeling a boiled egg on a white surface.", "First, a woman with braided hair is standing in front of the kitchen counter watching the microwave emitting white smoke. Next, a red liquid is poured from a measuring cup into a transparent bowl. Finally, a pair of hands wearing a ring is peeling a boiled egg on a white surface.", "First, a red liquid is poured from a measuring cup into a transparent bowl. Next, a pair of hands wearing a ring is peeling a boiled egg on a white surface. Finally, a woman with braided hair is standing in front of the kitchen counter watching the microwave emitting white smoke.", "First, a pair of hands wearing a ring is peeling a boiled egg on a white surface. Next, a woman with braided hair is standing in front of the kitchen counter watching the microwave emitting white smoke. Finally, a red liquid is poured from a measuring cup into a transparent bowl.", "First, a woman with braided hair is standing in front of the kitchen counter watching the microwave emitting white smoke. Next, a pair of hands wearing a ring is peeling a boiled egg on a white surface. Finally, a red liquid is poured from a measuring cup into a transparent bowl."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "zuS_GQfe7J0_0", "video_path": "zuS_GQfe7J0.mp4", "subtitle_path": "zuS_GQfe7J0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 942.99, "view_count": 6154128}, {"video_id": "zuS_GQfe7J0", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a pair of hands wearing rings peels a lemon using a peeler on a white surface, then a woman wearing glasses and a black striped coat closes the microwave door, and finally, an egg beater is used to whisk egg yolks.", "First, an egg beater is used to whisk egg yolks, then a woman wearing glasses and a black striped coat closes the microwave door, and finally, a pair of hands wearing rings peels a lemon using a peeler on a white surface.", "First, a pair of hands wearing rings peels a lemon using a peeler on a white surface, then an egg beater is used to whisk egg yolks, and finally, a woman wearing glasses and a black striped coat closes the microwave door.", "First, a woman wearing glasses and a black striped coat closes the microwave door, then a pair of hands wearing rings peels a lemon using a peeler on a white surface, and finally, an egg beater is used to whisk egg yolks.", "First, an egg beater is used to whisk egg yolks, then a pair of hands wearing rings peels a lemon using a peeler on a white surface, and finally, a woman wearing glasses and a black striped coat closes the microwave door."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "zuS_GQfe7J0_1", "video_path": "zuS_GQfe7J0.mp4", "subtitle_path": "zuS_GQfe7J0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 942.99, "view_count": 6154128}, {"video_id": "zuS_GQfe7J0", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is accurate?", "candidates": ["First, a woman in a dark green outfit is seated with feet and hands pressed against a cabinet near a wall in the kitchen. Next, the kitchen counter with a white surface has a knife with a green handle on it, and the woman in a dark green outfit stands by the counter, using her hand to put food into her mouth. Finally, there is a green cutting board on a white surface, with someone slicing yellow food material on the board.", "First, the kitchen counter with a white surface has a knife with a green handle on it, and the woman in a dark green outfit stands by the counter, using her hand to put food into her mouth. Next, there is a green cutting board on a white surface, with someone slicing yellow food material on the board. Finally, a woman in a dark green outfit is seated with feet and hands pressed against a cabinet near a wall in the kitchen.", "First, a woman in a dark green outfit is seated with feet and hands pressed against a cabinet near a wall in the kitchen. Next, there is a green cutting board on a white surface, with someone slicing yellow food material on the board. Finally, the kitchen counter with a white surface has a knife with a green handle on it, and the woman in the dark green outfit stands by the counter, using her hand to put food into her mouth.", "First, there is a green cutting board on a white surface, with someone slicing yellow food material on the board. Next, a woman in a dark green outfit is seated with feet and hands pressed against a cabinet near a wall in the kitchen. Finally, the kitchen counter with a white surface has a knife with a green handle on it, and the woman in the dark green outfit stands by the counter, using her hand to put food into her mouth.", "First, there is a green cutting board on a white surface, with someone slicing yellow food material on the board. Next, the kitchen counter with a white surface has a knife with a green handle on it, and the woman in a dark green outfit stands by the counter, using her hand to put food into her mouth. Finally, a woman in a dark green outfit is seated with feet and hands pressed against a cabinet near a wall in the kitchen."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "zuS_GQfe7J0_2", "video_path": "zuS_GQfe7J0.mp4", "subtitle_path": "zuS_GQfe7J0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 942.99, "view_count": 6154128}, {"video_id": "dHWhntWbx5w", "question": "A man is standing in front of a blue door. He is wearing a black coat, has a string of ornaments hanging around his neck, and is wearing a baseball cap backwards. Where have we seen this man before?", "question_wo_referring_query": "Where have we seen this man before?", "candidates": ["On a motorboat", "At a swimming pool", "In a banana grove", "Next to a man wearing a security uniform", "Next to a woman wearing a security uniform"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "dHWhntWbx5w_0", "video_path": "dHWhntWbx5w.mp4", "subtitle_path": "dHWhntWbx5w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 917.69, "view_count": 143545}, {"video_id": "dHWhntWbx5w", "question": "There are two men on the screen. The one in front is wearing a hat and a gold necklace. The man behind him has his eyes closed, is wearing an olive-green jacket, and has a black backpack on his shoulders. In which scenes does the man in the olive-green clothing appear?", "question_wo_referring_query": "In which scenes does the man in the olive-green clothing appear?", "candidates": ["At a cockfighting ring", "In a banana grove", "In a motorcycle parking lot", "On a yacht", "Next to a person wearing blue short sleeves and carrying a backpack"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "dHWhntWbx5w_1", "video_path": "dHWhntWbx5w.mp4", "subtitle_path": "dHWhntWbx5w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 917.69, "view_count": 143545}, {"video_id": "dHWhntWbx5w", "question": "A group of people are playing in the sea, near the shadow of a plant on the beach. A white drone is taking off. In which scenarios has this drone appeared before?", "question_wo_referring_query": "A group of people are playing in the sea, near the shadow of a plant on the beach. A white drone is taking off. In which scenarios has this drone appeared before?", "candidates": ["In the hand of a man wearing a brown garment", "On a red scooter", "At the bottom of the sea", "In a gray box", "Next to a man in a blue shirt holding a rooster"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "dHWhntWbx5w_2", "video_path": "dHWhntWbx5w.mp4", "subtitle_path": "dHWhntWbx5w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 917.69, "view_count": 143545}, {"video_id": "7qBBRCUJC2A", "question": "Someone is diving underwater, a giant fish with blue skin and yellow fins swims across the screen. Which subtitles appear simultaneously with this fish?", "question_wo_referring_query": ", which subtitles appear simultaneously with this fish?", "candidates": ["most beautiful creatures I've ever seen", "I've ever seen", "your discretion it's such a humbling", "some very loose gravel some huge huge", "because everywhere you go is serious dud"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "7qBBRCUJC2A_0", "video_path": "7qBBRCUJC2A.mp4", "subtitle_path": "7qBBRCUJC2A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 928.16, "view_count": 2243690}, {"video_id": "7qBBRCUJC2A", "question": "A boat with the word 'victoria' etched on it is floating on the water's surface. Three people are standing on the boat; the woman in the middle is wearing a green long-sleeved outfit. In which subtitles does this woman appear?", "question_wo_referring_query": "In which subtitles does this woman appear?", "candidates": ["the four island Tour is incredible", "up going make sure you do the four", "this place", "because everywhere you go is serious dud", "there's a couple sets of waterfalls"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "7qBBRCUJC2A_1", "video_path": "7qBBRCUJC2A.mp4", "subtitle_path": "7qBBRCUJC2A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 928.16, "view_count": 2243690}, {"video_id": "7qBBRCUJC2A", "question": "A man wearing a white short-sleeve shirt is riding a motorcycle on a dirt road, with a shimmering lake and green mountain peaks in the distance. Where has this man appeared together with which subtitles?", "question_wo_referring_query": "Where has this man appeared together with which subtitles?", "candidates": ["this place before but there's a great", "I liked Anda", "most beautiful creatures I've ever seen", "outgrown the infrastructure at least 10", "some very loose gravel some huge huge"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "7qBBRCUJC2A_2", "video_path": "7qBBRCUJC2A.mp4", "subtitle_path": "7qBBRCUJC2A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 928.16, "view_count": 2243690}, {"video_id": "nIj5N0Tkqz0", "question": "A man is speaking with a microphone. He has black hair and is wearing a black suit and a white shirt. When the subtitle mentions 'proposed discontinuing the railway,' what items can be seen in the scene?", "question_wo_referring_query": "What items can be seen in the scene?", "candidates": ["watch", "water cup", "lunch box", "glasses", "earphones"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "nIj5N0Tkqz0_0", "video_path": "nIj5N0Tkqz0.mp4", "subtitle_path": "nIj5N0Tkqz0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.0, "view_count": 104228}, {"video_id": "nIj5N0Tkqz0", "question": "Inside the room's display case, various train models are arranged. A woman with long auburn hair, her lips tightly closed and her hands slightly raised, is sitting in front of the display case. When the subtitle mentions 'right so i think some of the locals will', what object is present in this scene?", "question_wo_referring_query": "What object is present in this scene?", "candidates": ["car", "glasses", "drone", "ping pong ball", "microphone"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "nIj5N0Tkqz0_1", "video_path": "nIj5N0Tkqz0.mp4", "subtitle_path": "nIj5N0Tkqz0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.0, "view_count": 104228}, {"video_id": "nIj5N0Tkqz0", "question": "A man is being interviewed. He is wearing a black hat, a black suit with a white shirt, and a striped tie. He has his eyes closed and is slightly nodding his head. When the subtitle mentions '[Music]', what object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["A pair of black-framed glasses", "A wristwatch", "Over-ear headphones", "A pair of gold-framed glasses", "A black car"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "nIj5N0Tkqz0_2", "video_path": "nIj5N0Tkqz0.mp4", "subtitle_path": "nIj5N0Tkqz0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1685.0, "view_count": 104228}, {"video_id": "46avtFTjTV8", "question": "On a sofa, there is an orange doll. Next to the doll, a woman with long auburn hair and glasses is sitting with her eyes closed. What are the two colors of the woman's outer garment?", "question_wo_referring_query": "What are the two colors of the woman's outer garment?", "candidates": ["Yellow and green", "White and yellow", "White and green", "Black and green", "White and orange"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "46avtFTjTV8_0", "video_path": "46avtFTjTV8.mp4", "subtitle_path": "46avtFTjTV8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1143.84, "view_count": 162352}, {"video_id": "46avtFTjTV8", "question": "In a dimly lit room, a woman wearing glasses kneels on the floor and takes a bundle of branches out of a wooden box. What are the two colors of the woman's outer garment?", "question_wo_referring_query": "What are the two colors of the woman's outer garment?", "candidates": ["Green and dark yellow", "Green and olive", "Black and olive", "Gray and green", "Gray and red"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "46avtFTjTV8_1", "video_path": "46avtFTjTV8.mp4", "subtitle_path": "46avtFTjTV8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1143.84, "view_count": 162352}, {"video_id": "46avtFTjTV8", "question": "In a small window, there is a pointy hat hanging on the clean white wall. A woman wearing glasses has a quilt draped over her. What does the pattern on this quilt look like?", "question_wo_referring_query": "What does the pattern on this quilt look like?", "candidates": ["Purple background with small white flowers", "Purple background with small white checkered pattern", "Red background with small white flowers", "White background with small purple flowers", "Pink background with small white flowers"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "46avtFTjTV8_2", "video_path": "46avtFTjTV8.mp4", "subtitle_path": "46avtFTjTV8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1143.84, "view_count": 162352}, {"video_id": "MYGOm48DVgo", "question": "In the bottom right corner of the white background, there is a man wearing glasses. On the left side of the screen, there is a 6*6 grid of numbers containing only 0s and 1s. When the subtitle mentions 'input right to to process uh the input', what color is the number 1 in the grid on the screen?", "question_wo_referring_query": "What color is the number 1 in the grid on the screen?", "candidates": ["Blue", "Red", "Black", "White", "Pink"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2A", "level": "IntraMoment", "id": "MYGOm48DVgo_0", "video_path": "MYGOm48DVgo.mp4", "subtitle_path": "MYGOm48DVgo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1088.23, "view_count": 27}, {"video_id": "MYGOm48DVgo", "question": "In the bottom right corner of the white background, there's a man wearing glasses, and on the left side of the screen, there's a 6\u00d76 grid with numbers. A black mouse pointer is hovering in the upper-left cell of this grid. Additionally, there's a 3\u00d73 grid on top of the larger grid. The numbers within the cells of the smaller grid are circled in various colors. When the subtitle mentions \u201cconnected with these nine values right,\u201d what are the colors of the circles in the bottom-left and top-right corners of the 3\u00d73 grid?", "question_wo_referring_query": "What are the colors of the circles in the bottom-left and top-right corners of the 3\u00d73 grid?", "candidates": ["Blue and yellow", "Blue and white", "Blue and black", "Blue and red", "Blue and purple"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2A", "level": "IntraMoment", "id": "MYGOm48DVgo_1", "video_path": "MYGOm48DVgo.mp4", "subtitle_path": "MYGOm48DVgo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1088.23, "view_count": 27}, {"video_id": "MYGOm48DVgo", "question": "Under the title with the words 'Variants of Convolution Operation' on a white background, the secondary heading below is 'Dilated/Atrous Convolution.' There are two figures formed by rectangles at the bottom of the screen. What is the color of the nine-square grid rectangle in the upper part of the figure on the right?", "question_wo_referring_query": "What is the color of the nine-square grid rectangle in the upper part of the figure on the right?", "candidates": ["blue", "white", "green", "gray"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2A", "level": "IntraMoment", "id": "MYGOm48DVgo_2", "video_path": "MYGOm48DVgo.mp4", "subtitle_path": "MYGOm48DVgo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1088.23, "view_count": 27}, {"video_id": "05f1WZFgmCY", "question": "In the background, there is a snow mountain and scattered cottages, with a small table placed on the snow in front of the camera. The table is covered with food and wine glasses. Two men are standing in front of the camera, smiling at it. What is the man wearing red-framed glasses holding in his hand?", "question_wo_referring_query": "What is the man wearing red-framed glasses holding in his hand?", "candidates": ["a teapot", "a transparent bottle containing liquid", "a tall wine glass containing wine", "a wine glass containing liquid"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "05f1WZFgmCY_0", "video_path": "05f1WZFgmCY.mp4", "subtitle_path": "05f1WZFgmCY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 928.63, "view_count": 525423}, {"video_id": "05f1WZFgmCY", "question": "There is a building with a gray exterior wall and green windows on the screen. Some vehicles are parked next to the building, and there are two people walking in a line on the road. Who is the person holding a white cup near their mouth on the screen?", "question_wo_referring_query": "Who is the person holding a white cup near their mouth on the screen?", "candidates": ["The person wearing a white fur coat", "The person without a hat", "The woman wearing a black coat", "The person wearing a black coat and an olive hat", "The woman wearing a black coat and an olive hat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "05f1WZFgmCY_1", "video_path": "05f1WZFgmCY.mp4", "subtitle_path": "05f1WZFgmCY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 928.63, "view_count": 525423}, {"video_id": "05f1WZFgmCY", "question": "On the white countertop in the kitchen, there is a silver pot. The pot has brown soup around its edge, and someone is stirring the pot with a ladle. Who is stirring the pot with the ladle?", "question_wo_referring_query": "Who is stirring the pot with the ladle?", "candidates": ["A woman with long hair wearing a white sweater", "A man with short hair wearing a black jacket", "A man wearing glasses and a gray long-sleeved shirt", "A woman wearing glasses and a gray short-sleeved shirt", "A man wearing glasses and a white short-sleeved shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "05f1WZFgmCY_2", "video_path": "05f1WZFgmCY.mp4", "subtitle_path": "05f1WZFgmCY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 928.63, "view_count": 525423}, {"video_id": "nr_pFDbvz6A", "question": "In the scene, there is an opened book. The left page of the book is blank, while the right page has an illustration of a person and a summary of his life. At the top of the right page, it reads 'LOUIS-GABRIEL SUCHET'. When the subtitle first appears with the text 'Michel Ney was a cooper's son from Lorraine, a German-speaking region of France on the', what happens on the screen?", "question_wo_referring_query": "what happened on the screen?", "candidates": ["The book turns a page", "The entire book disappears", "Content appears on the left page", "Content on the right page disappears", "The right page gets torn"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "nr_pFDbvz6A_0", "video_path": "nr_pFDbvz6A.mp4", "subtitle_path": "nr_pFDbvz6A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2312.6, "view_count": 670429}, {"video_id": "nr_pFDbvz6A", "question": "The screen shows a simulated map, with a section labeled 'GREAT' painted in red on the left side and a section labeled 'FRANCE' painted in blue. There are many ships at anchor in the strait between these two regions, and some troops are positioned on the land near the ships. When the subtitle 'He was accompanied by Colonel Henri Jomini, a Swiss officer and military theorist' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The troops positioned in the blue region advance towards 'River Rhine'.", "The troops positioned in the blue region disappear.", "The gray areas of the map are painted red.", "The gray areas of the map are painted blue.", "The troops positioned in the blue region advance towards 'GREAT'."], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "nr_pFDbvz6A_1", "video_path": "nr_pFDbvz6A.mp4", "subtitle_path": "nr_pFDbvz6A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2312.6, "view_count": 670429}, {"video_id": "nr_pFDbvz6A", "question": "On a white background, there is a white mug with a character image on it, and below it is the text 'MARSHAL BESSIERES'. When the caption 'We have 10 of the best available as mugs and stickers in the merch store now' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen when the caption appears?", "candidates": ["A new mug appears directly below this mug.", "A character poster appears to the right of this mug.", "A new mug appears to the left of this mug.", "A new mug appears to the right of this mug.", "A character poster appears to the left of this mug."], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "nr_pFDbvz6A_2", "video_path": "nr_pFDbvz6A.mp4", "subtitle_path": "nr_pFDbvz6A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2312.6, "view_count": 670429}, {"video_id": "-f8psVWljQQ", "question": "On a wooden board, a pair of hands is holding a knife and cutting mushrooms. In the bottom left corner of the screen, there is the text 'Pilze - 300 g'. What needs to be done in the next step?", "question_wo_referring_query": "On a wooden board, a pair of hands is holding a knife and cutting mushrooms. In the bottom left corner of the screen, there is the text 'Pilze - 300 g'. What needs to be done in the next step?", "candidates": ["Need to add mushrooms to the pot", "Need to add onions to the pot", "Need to add olive oil to the pot", "Need to add salt to the pot", "Need to add blue flowers to the pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "-f8psVWljQQ_0", "video_path": "-f8psVWljQQ.mp4", "subtitle_path": "-f8psVWljQQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.2, "view_count": 15844}, {"video_id": "-f8psVWljQQ", "question": "On a wooden board, a pair of hands is holding a knife and cutting a red onion. In the top left corner of the screen, the text 'Red onion - 1 piece' is displayed. What needs to be done in the next step?", "question_wo_referring_query": "What needs to be done in the next step?", "candidates": ["Pour the carrots into the pot", "Pour olive oil into the pot", "Pour the salt into the pot", "Pour the onion into the pot", "Cut the carrots"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "-f8psVWljQQ_1", "video_path": "-f8psVWljQQ.mp4", "subtitle_path": "-f8psVWljQQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.2, "view_count": 15844}, {"video_id": "-f8psVWljQQ", "question": "In a green ceramic bowl, there is a whisk. Heavy cream is being added to the bowl, and in the upper left corner of the screen, the text 'Heavy cream ~ 0.75 cup' is displayed. What needs to be done in the next step?", "question_wo_referring_query": "What needs to be done in the next step?", "candidates": ["Add salt to the bowl", "Chop a carrot", "Add tomatoes to the bowl", "Add blueberries to the bowl", "Add eggs to the bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "-f8psVWljQQ_2", "video_path": "-f8psVWljQQ.mp4", "subtitle_path": "-f8psVWljQQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.2, "view_count": 15844}, {"video_id": "glYMjtt0ayQ", "question": "A woman wearing a black coat over a white top, with her head covered, is looking down at something. What appeared on the screen just before the subtitle said 'uh suggesting then sure so this is the'?", "question_wo_referring_query": "What appeared on the screen?", "candidates": ["A poster with a picture of a blonde woman in a black dress on the right side", "A poster with a picture of a woman in a bikini on the right side", "A poster with a picture of a red-haired woman in a black bikini on the right side", "A poster with a picture of a blonde woman in a blue T-shirt on the right side", "A poster with a picture of a blonde woman in a white dress on the right side"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "glYMjtt0ayQ_0", "video_path": "glYMjtt0ayQ.mp4", "subtitle_path": "glYMjtt0ayQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1046.12, "view_count": 11825}, {"video_id": "glYMjtt0ayQ", "question": "In the room, an elderly man with white hair, wearing a black suit and a blue shirt, is speaking. After the subtitle says 'do it it's rather like you don't want,' which character appears on the screen first?", "question_wo_referring_query": "Which character appears on the screen first?", "candidates": ["A blonde woman wearing a blue coat", "A blonde woman wearing a purple coat", "A blonde woman wearing a yellow coat", "A blonde woman wearing a white coat", "A blonde woman wearing a black coat"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "glYMjtt0ayQ_1", "video_path": "glYMjtt0ayQ.mp4", "subtitle_path": "glYMjtt0ayQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1046.12, "view_count": 11825}, {"video_id": "glYMjtt0ayQ", "question": "Inside the auditorium, to the left of a rectangular table sits a blonde woman, and to the right sit a gray-haired man and a brown-haired woman. What object appeared on the screen before the subtitle said 'suggesting it was not about me but of'?", "question_wo_referring_query": "What object appeared on the screen?", "candidates": ["A poster in the top left corner with a white letter 'i'", "A poster in the top left corner with a red letter 'A'", "A poster in the top right corner with a white letter 'J'", "A poster in the top left corner with a red letter 'i'", "A poster in the top right corner with a white letter 'i'"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "glYMjtt0ayQ_2", "video_path": "glYMjtt0ayQ.mp4", "subtitle_path": "glYMjtt0ayQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1046.12, "view_count": 11825}, {"video_id": "PVvR_ZDjfhQ", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a pair of hands cuts a purple onion on a green chopping board. Then, at the end, the woman wearing a red feathered dress and a red hat kneels on the white snow. Finally, the woman in black clothes gives two round cakes to a man in white clothes.", "First, the woman wearing a red feathered dress and a red hat kneels on the white snow. Then, a pair of hands cuts a purple onion on a green chopping board. Finally, a woman in black clothes gives two round cakes to a man in white clothes.", "First, a pair of hands cuts a purple onion on a green chopping board. Then, the woman in black clothes gives two round cakes to a man in white clothes. After that, the woman wearing a red feathered dress and a red hat kneels on the white snow, and at the end, she kneels on the white snow once again.", "First, the woman in black clothes gives two round cakes to a man in white clothes. Then, a pair of hands cuts a purple onion on a green chopping board. Finally, the woman wearing a red feathered dress and a red hat kneels on the white snow.", "First, the woman wearing a red feathered dress and a red hat kneels on the white snow. Then, the woman in black clothes gives two round cakes to a man in white clothes. Finally, the woman wearing a red feathered dress and a red hat kneels on the white snow again. At the end, a pair of hands cuts a purple onion on a green chopping board."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "PVvR_ZDjfhQ_0", "video_path": "PVvR_ZDjfhQ.mp4", "subtitle_path": "PVvR_ZDjfhQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 906.78, "view_count": 4090357}, {"video_id": "PVvR_ZDjfhQ", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, someone removes the covering paper from a black pot on top of the stove, then a man in a red cotton jacket sits in a gray account book, and finally, a person uses a fire stick to ignite a pile of wood.", "First, a person uses a fire stick to ignite a pile of wood, then a person in a red cotton jacket sits in a gray account book, and finally, they remove the covering paper from a black pot on top of the stove.", "First, a person in a red cotton jacket sits in a gray account book, then someone removes the covering paper from a black pot on top of the stove, and finally, a person uses a fire stick to ignite a pile of wood.", "First, a person uses a fire stick to ignite a pile of wood, then someone removes the covering paper from a black pot on top of the stove, and finally, a person in a red cotton jacket sits in a gray account book.", "First, a person in a red cotton jacket sits in a gray account book, then a person uses a fire stick to ignite a pile of wood, and finally, they remove the covering paper from a black pot on top of the stove."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "PVvR_ZDjfhQ_1", "video_path": "PVvR_ZDjfhQ.mp4", "subtitle_path": "PVvR_ZDjfhQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 906.78, "view_count": 4090357}, {"video_id": "PVvR_ZDjfhQ", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, two women sit in front of a dining table covered with a red and white checkered tablecloth and taste the gourmet food in the black pot. Then, a person puts skewers on red skewers' few skewers of meat. Finally, a woman wearing a black short-sleeved shirt and pants stands in front of a red canopy and puts food wrapped in tin foil on the fire to roast.", "First, a person puts skewers on red skewers' few skewers of meat. Then, two women sit in front of a dining table covered with a red and white checkered tablecloth and taste the gourmet food in the black pot. Finally, a woman wearing a black short-sleeved shirt and pants stands in front of a red canopy and puts food wrapped in tin foil on the fire to roast.", "First, a woman wearing a black short-sleeved shirt and pants stands in front of a red canopy and puts food wrapped in tin foil on the fire to roast. Then, two women sit in front of a dining table covered with a red and white checkered tablecloth and taste the gourmet food in the black pot. Finally, a person puts skewers on red skewers' few skewers of meat.", "First, a woman wearing a black short-sleeved shirt and pants stands in front of a red canopy and puts food wrapped in tin foil on the fire to roast. Then, a person puts skewers on red skewers' few skewers of meat. Finally, two women sit in front of a dining table covered with a red and white checkered tablecloth and taste the gourmet food in the black pot.", "First, a person puts skewers on red skewers' few skewers of meat. Then, a woman wearing a black short-sleeved shirt and pants stands in front of a red canopy and puts food wrapped in tin foil on the fire to roast. Finally, two women sit in front of a dining table covered with a red and white checkered tablecloth and taste the gourmet food in the black pot."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "PVvR_ZDjfhQ_2", "video_path": "PVvR_ZDjfhQ.mp4", "subtitle_path": "PVvR_ZDjfhQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 906.78, "view_count": 4090357}, {"video_id": "l4is4uHvKlU", "question": "In the top-left corner, on a black background with the yellow English word 'what', there is a picture of a white house surrounded by white columns under yellow tree leaves. Where else has this picture appeared?", "question_wo_referring_query": "Where else has this picture appeared?", "candidates": ["On a white background with the red English word 'why' in the top-left corner", "On a black background with the yellow English word 'why' in the top-right corner", "On a black background with the yellow English word 'why' in the top-left corner", "On a white background with the yellow English word 'why' in the top-left corner", "On a black background with the yellow English word 'ME' in the top-left corner"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "l4is4uHvKlU_0", "video_path": "l4is4uHvKlU.mp4", "subtitle_path": "l4is4uHvKlU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 942.9, "view_count": 32592}, {"video_id": "l4is4uHvKlU", "question": "In the top left corner of the black background, there's yellow text saying 'Why'. A gray desktop computer is shown in the middle of the screen. Where else has this computer appeared?", "question_wo_referring_query": "Where else has this computer appeared?", "candidates": ["In the black background with white English text 'How' in the top right corner", "In the black background with yellow English text 'How' in the top left corner", "In the black background with yellow English text 'How' in the bottom left corner", "In the black background with yellow English text 'How' in the middle", "In the black background with yellow English text 'How' in the bottom right corner"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "l4is4uHvKlU_1", "video_path": "l4is4uHvKlU.mp4", "subtitle_path": "l4is4uHvKlU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 942.9, "view_count": 32592}, {"video_id": "l4is4uHvKlU", "question": "In the black background with the white text 'Positional Encoding' written at the top, there is a white word 'horse' written. Where else does this word appear?", "question_wo_referring_query": "Where else does this word appear?", "candidates": ["In the blue background with the red text 'Multi Head Attention' written in the middle", "In the black background with the white text 'Multi Head Attention' written at the top", "In the white background with the white text 'Multi Head Attention' written at the top", "In the blue background with the white text 'Multi Head Attention' written at the bottom", "In the blue background with the yellow text 'Multi Head Attention' written in the middle"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "l4is4uHvKlU_2", "video_path": "l4is4uHvKlU.mp4", "subtitle_path": "l4is4uHvKlU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 942.9, "view_count": 32592}, {"video_id": "q8aczw5Ce6E", "question": "Sitting behind a bookshelf, the woman dressed in a white dress with a side ponytail is on the far left, and the man dressed in a gray shirt is on the far right. With which subtitles has this man appeared together?", "question_wo_referring_query": "With which subtitles has this man appeared together?", "candidates": ["nice", "john", "hello", "goodbye", "fruit full part of marriage as for relationship advice I got quite a few questions of that I by "], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "q8aczw5Ce6E_0", "video_path": "q8aczw5Ce6E.mp4", "subtitle_path": "q8aczw5Ce6E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 955.75, "view_count": 214012}, {"video_id": "q8aczw5Ce6E", "question": "In the room painted with green paint, next to the windowsill with green potted plants, there is a white glass table lamp. With which subtitles has this lamp appeared together?", "question_wo_referring_query": "With which subtitles has this lamp appeared together?", "candidates": ["goodbye", "hello", "army", "jack", "embrace my art style and uh really made me proud of the work I do even though it might"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "q8aczw5Ce6E_1", "video_path": "q8aczw5Ce6E.mp4", "subtitle_path": "q8aczw5Ce6E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 955.75, "view_count": 214012}, {"video_id": "q8aczw5Ce6E", "question": "In a room filled with green plants, a cat appears on the leg of a woman wearing a single-sided brocade pattern dress and a hairband. Which subtitles did the cat and the woman appear together with?", "question_wo_referring_query": "Which subtitles did the cat and the woman appear together with?", "candidates": ["plane", "feeling them try really hard to suppress them or run away from them instead just be there", "goodbye", "Mike", "Mary"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "q8aczw5Ce6E_2", "video_path": "q8aczw5Ce6E.mp4", "subtitle_path": "q8aczw5Ce6E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 955.75, "view_count": 214012}, {"video_id": "h39NdL7cqL4", "question": "Has the boiling red lava, erupting constantly in the video, changed when viewed from a distance from the volcano?", "question_wo_referring_query": ", has the lava changed?", "candidates": ["The lava turned green", "The lava turned purple", "The lava turned blue", "The lava turned black", "The lava turned white"], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "h39NdL7cqL4_0", "video_path": "h39NdL7cqL4.mp4", "subtitle_path": "h39NdL7cqL4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 913.29, "view_count": 11227}, {"video_id": "h39NdL7cqL4", "question": "What change occurs in the sky, which is tinged with pink twilight, above a dark forest in the video?", "question_wo_referring_query": "What change occurs after this?", "candidates": ["The sky turns blue", "The sky turns purple", "The sky turns black", "The sky turns green", "The sky turns yellow"], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "h39NdL7cqL4_1", "video_path": "h39NdL7cqL4.mp4", "subtitle_path": "h39NdL7cqL4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 913.29, "view_count": 11227}, {"video_id": "h39NdL7cqL4", "question": "In the black night sky, a bright white moon is hanging. What changes occurred to this moon afterwards?", "question_wo_referring_query": ", what changes occurred to this moon afterwards?", "candidates": ["The moon became round", "The moon became yellow", "The moon became red", "The moon became larger", "The moon became smaller"], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "h39NdL7cqL4_2", "video_path": "h39NdL7cqL4.mp4", "subtitle_path": "h39NdL7cqL4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 913.29, "view_count": 11227}, {"video_id": "7l1VPibnTgM", "question": "Next to a table with a newspaper, a woman wearing black glasses is sitting on the left, and a woman with her arms crossed and wearing a black coat is sitting on the right. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["tissue paper", "flower vase", "pink cup", "glass cup", "blue cup"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "7l1VPibnTgM_0", "video_path": "7l1VPibnTgM.mp4", "subtitle_path": "7l1VPibnTgM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 986.92, "view_count": 9452}, {"video_id": "7l1VPibnTgM", "question": "A group of police officers wearing reflective vests with 'POLICE' logos appear behind several flags. What other objects are present in this scene?", "question_wo_referring_query": "What other objects are present in this scene?", "candidates": ["bridge", "street lamp", "lifebuoy", "trash bin", "car"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "7l1VPibnTgM_1", "video_path": "7l1VPibnTgM.mp4", "subtitle_path": "7l1VPibnTgM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 986.92, "view_count": 9452}, {"video_id": "7l1VPibnTgM", "question": "In the scene, a person dressed in an orange jacket with white stripes and black pants is walking towards the camera. What other objects are present in this scene?", "question_wo_referring_query": "What other objects are present in this scene?", "candidates": ["red fire extinguisher", "lifebuoy", "bicycle", "motorbike", "refrigerator"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "7l1VPibnTgM_2", "video_path": "7l1VPibnTgM.mp4", "subtitle_path": "7l1VPibnTgM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 986.92, "view_count": 9452}, {"video_id": "HGEXluz5nFA", "question": "In the green background featuring ducks and a magnifying glass design, what is the color of the question mark design in the top left corner?", "question_wo_referring_query": "What color is it?", "candidates": ["White", "Yellow", "Red", "Black", "Purple"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "HGEXluz5nFA_0", "video_path": "HGEXluz5nFA.mp4", "subtitle_path": "HGEXluz5nFA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1000.8, "view_count": 173560}, {"video_id": "HGEXluz5nFA", "question": "In the video, two military green tanks are parked on a piece of grass. What color is the grass?", "question_wo_referring_query": ", What color is the grass?", "candidates": ["Black", "Yellow", "Gray", "White", "Green"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "HGEXluz5nFA_1", "video_path": "HGEXluz5nFA.mp4", "subtitle_path": "HGEXluz5nFA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1000.8, "view_count": 173560}, {"video_id": "HGEXluz5nFA", "question": "In a game screen, a small aircraft is flying above a forest with a few narrow paths. What shape is the red pattern on the body and tail of this aircraft?", "question_wo_referring_query": "What shape is the red pattern on the body and tail of this aircraft?", "candidates": ["Square", "Circle", "Triangle", "Rectangle", "Pentagon"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "HGEXluz5nFA_2", "video_path": "HGEXluz5nFA.mp4", "subtitle_path": "HGEXluz5nFA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1000.8, "view_count": 173560}, {"video_id": "x-733sBOlB4", "question": "In the exhibition hall, there is a woman wearing a black suit and a necklace standing in front of a gray wall with picture frames. What hairstyle does the woman with golden hair have when the subtitle says 'that stems back to success and parasias'?", "question_wo_referring_query": "What hairstyle does this woman have?", "candidates": ["Crew cut", "Short hair with bangs", "Shoulder-length hair", "Bob cut", "Waist-length hair"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "x-733sBOlB4_0", "video_path": "x-733sBOlB4.mp4", "subtitle_path": "x-733sBOlB4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1670.46, "view_count": 74022}, {"video_id": "x-733sBOlB4", "question": "In front of a gray wall with picture frames, a man wearing a blue shirt and a gray suit is standing. When the subtitle says 'organizing this exhibition is a sign of,' what color and style is this man's tie?", "question_wo_referring_query": "What color and style is this man's tie?", "candidates": ["Red and white striped style", "Pure black style", "Black and white checkered style", "Black and white striped style", "Blue and white floral style"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "x-733sBOlB4_1", "video_path": "x-733sBOlB4.mp4", "subtitle_path": "x-733sBOlB4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1670.46, "view_count": 74022}, {"video_id": "x-733sBOlB4", "question": "On the gray background wall, there is a black picture frame. Inside the frame, there are multicolored flowers and a blue curtain. When the subtitle says 'performance isn't it,' what is the shape of this picture frame?", "question_wo_referring_query": "What is the shape of this picture frame?", "candidates": ["Triangle", "Circle", "Square", "Trapezoid", "Rectangle"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "x-733sBOlB4_2", "video_path": "x-733sBOlB4.mp4", "subtitle_path": "x-733sBOlB4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1670.46, "view_count": 74022}, {"video_id": "DB3bprN1yM8", "question": "In a room with books and picture frames, someone is sitting on the ground being interviewed. Behind them is a sliding wooden door with white square patterns. Who is the person being interviewed?", "question_wo_referring_query": "Who is the person being interviewed?", "candidates": ["A black-haired man wearing a blue jacket and black pants", "A black-haired woman wearing a red skirt", "A blond-haired man wearing a black jacket and white pants", "A white-haired man wearing a red jacket and black pants", "A white-haired man wearing a black jacket and black pants"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "DB3bprN1yM8_0", "video_path": "DB3bprN1yM8.mp4", "subtitle_path": "DB3bprN1yM8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2834.14, "view_count": 1708}, {"video_id": "DB3bprN1yM8", "question": "Who is the person pushing the baby stroller in front of a white building with black letters \"lululemon\" on a glass door?", "question_wo_referring_query": "Who?", "candidates": ["A man wearing a gray short-sleeve shirt, black shorts, and loafers", "A woman wearing a white short-sleeve shirt, black shorts, and loafers", "A man wearing a black short-sleeve shirt, black shorts, and sneakers", "A man wearing a gray short-sleeve shirt, black shorts, and leather shoes", "A woman wearing a white short-sleeve shirt, black skirt, and loafers"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "DB3bprN1yM8_1", "video_path": "DB3bprN1yM8.mp4", "subtitle_path": "DB3bprN1yM8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2834.14, "view_count": 1708}, {"video_id": "DB3bprN1yM8", "question": "In an auditorium equipped with white flashing lights, there is a circular stage in the center. Who is the person sitting on a white chair on the stage, facing forward with their back to the mirror?", "question_wo_referring_query": "Who is it?", "candidates": ["The woman with short yellow hair wearing a black top and black pants", "The man with black hair wearing a black top and black pants", "The man wearing a blue top and black pants", "The man with black hair wearing a blue top and white pants", "The woman with long yellow hair wearing a blue top and black pants"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "DB3bprN1yM8_2", "video_path": "DB3bprN1yM8.mp4", "subtitle_path": "DB3bprN1yM8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2834.14, "view_count": 1708}, {"video_id": "J4D5RydtJJo", "question": "When a woman wearing a grey jacket and green overalls with short boots appears on the screen for the first time, what action does she perform?", "question_wo_referring_query": "What action does she perform when she first appears on the screen?", "candidates": ["She puts one hand on her waist and spreads her legs apart.", "She spins around on the spot.", "She crosses one leg in front of the other.", "She waves at the camera.", "She takes off her grey jacket."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "J4D5RydtJJo_0", "video_path": "J4D5RydtJJo.mp4", "subtitle_path": "J4D5RydtJJo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.79, "view_count": 261886}, {"video_id": "J4D5RydtJJo", "question": "In a room with a white wardrobe that is full of hanging clothes, what did a cat do when it appeared for the first time next to an orange potted plant in the bottom left corner of the screen?", "question_wo_referring_query": "What did it do?", "candidates": ["It was biting the green leaves", "It put its head into the red cat bowl", "It put its head into the blue cat bowl", "It was rolling on the ground", "It knocked over the orange potted plant"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "J4D5RydtJJo_1", "video_path": "J4D5RydtJJo.mp4", "subtitle_path": "J4D5RydtJJo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.79, "view_count": 261886}, {"video_id": "J4D5RydtJJo", "question": "What action did a woman wearing a colorful plaid coat and a beige belt skirt take the first time she appeared on the screen?", "question_wo_referring_query": "What action did she take?", "candidates": ["Picked up a cat", "Took off her coat", "Sat on the bed", "Touched her hair", "Took off her shoes"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "J4D5RydtJJo_2", "video_path": "J4D5RydtJJo.mp4", "subtitle_path": "J4D5RydtJJo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.79, "view_count": 261886}, {"video_id": "HD3uP_LNQ5g", "question": "At the top center of the green screen, there is a white exclamation mark icon with a white circle around it. What happens on the screen when the subtitle says 'Bradbury both wrong and misleading the'?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A segment of red English text appears in the center of the green screen.", "A segment of black English text appears in the center of the green screen.", "A pillar-shaped figure appears in the center of the green screen.", "A segment of white English text appears in the center of the green screen.", "A segment of white Chinese text appears in the center of the green screen."], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "HD3uP_LNQ5g_0", "video_path": "HD3uP_LNQ5g.mp4", "subtitle_path": "HD3uP_LNQ5g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1067.67, "view_count": 259077}, {"video_id": "HD3uP_LNQ5g", "question": "On the green screen, there is a black arrow and a red double-headed arrow with white text. What event occurs on the screen when the subtitle says 'nine meters now there were various'?", "question_wo_referring_query": "What event occurs on the screen?", "candidates": ["The black arrow and the red double-headed arrow move upwards", "The black arrow and the red double-headed arrow move to the left", "The black arrow and the red double-headed arrow move downwards", "The black arrow and the red double-headed arrow rotate", "The black arrow and the red double-headed arrow move to the right"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "HD3uP_LNQ5g_1", "video_path": "HD3uP_LNQ5g.mp4", "subtitle_path": "HD3uP_LNQ5g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1067.67, "view_count": 259077}, {"video_id": "HD3uP_LNQ5g", "question": "In the top-left corner of the green screen, there is a white skull icon with two white arrows inserted into it. When the subtitle says 'the offensive and defensive in open', what event occurred on the screen?", "question_wo_referring_query": "What event occurred on the screen?", "candidates": ["A black shield icon appeared in the middle of the green screen", "A white shield icon appeared in the top-left corner of the green screen", "A white shield icon appeared in the top-right corner of the green screen", "A white shield icon appeared in the bottom-left corner of the green screen", "A white shield icon appeared in the bottom-right corner of the green screen"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "HD3uP_LNQ5g_2", "video_path": "HD3uP_LNQ5g.mp4", "subtitle_path": "HD3uP_LNQ5g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1067.67, "view_count": 259077}, {"video_id": "URowb7ZEipI", "question": "After a black wheelboat appears and travels on the blue sea on the screen, which of the following characters shows up first?", "question_wo_referring_query": "Which of the following characters shows up first?", "candidates": ["A man with black hair and a black tie making a phone call in a car", "A man sitting in front of a bookshelf with a globe, wearing a grey top with tattoos on his arm", "A man sitting in front of a bookshelf with a globe, wearing a white top", "A bald man with a blue and white striped tie making a phone call in a car", "A bald man sitting in a dim room with purple lights, wearing a black top"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "URowb7ZEipI_0", "video_path": "URowb7ZEipI.mp4", "subtitle_path": "URowb7ZEipI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1103.67, "view_count": 794856}, {"video_id": "URowb7ZEipI", "question": "In front of the bookshelf with the globe, after a man wearing a gray shirt with tattoos on his arms appears, which of the following objects appears first?", "question_wo_referring_query": "Which of the following objects appears first?", "candidates": ["black drone", "parachute", "ball with green, white, and blue colors", "character wearing green shorts", "character wearing blue shorts"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "URowb7ZEipI_1", "video_path": "URowb7ZEipI.mp4", "subtitle_path": "URowb7ZEipI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1103.67, "view_count": 794856}, {"video_id": "URowb7ZEipI", "question": "In front of the bookshelf with the globe, after the man wearing a gray shirt and with tattoos on his arm appears, which of the following objects appears first?", "question_wo_referring_query": "Which of the following objects appears first?", "candidates": ["White door", "Green mountain range", "Newspaper", "White sailboat", "Map of Earth satellites"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "URowb7ZEipI_2", "video_path": "URowb7ZEipI.mp4", "subtitle_path": "URowb7ZEipI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1103.67, "view_count": 794856}, {"video_id": "ojsaQcmtlj0", "question": "In the video, a pair of hands is holding an orange square box. After the subtitle says 'original assortments of premium Japanese,' what happens on the screen?", "question_wo_referring_query": "What happens next on the screen?", "candidates": ["The orange square box is taken away.", "A green paper box appears on the screen.", "A black paper box appears on the screen.", "The lid of the orange square box is opened.", "The orange square box is flipped over."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "ojsaQcmtlj0_0", "video_path": "ojsaQcmtlj0.mp4", "subtitle_path": "ojsaQcmtlj0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1610.83, "view_count": 288921}, {"video_id": "ojsaQcmtlj0", "question": "A girl wearing a white headband is sitting on the bed, holding a pair of green cloth shoes in her hands. After the subtitles say 'wanted to buy these,' what happened to this girl?", "question_wo_referring_query": "What happened to the girl?", "candidates": ["Started clapping with the hands holding the cloth shoes", "Put the cloth shoes on the bed", "Put the cloth shoes on the table", "Put the cloth shoes on the ground", "Put the cloth shoes on her feet"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "ojsaQcmtlj0_1", "video_path": "ojsaQcmtlj0.mp4", "subtitle_path": "ojsaQcmtlj0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1610.83, "view_count": 288921}, {"video_id": "ojsaQcmtlj0", "question": "A woman wearing a white hairband, sitting on a bed and holding an orange plush toy with black eyes, says in the subtitles 'because we don't want to spread'. What does this woman do next?", "question_wo_referring_query": "What does this woman do next?", "candidates": ["Puts the orange plush toy on the ground", "Holds a green dinosaur toy in front of her chest", "Places the orange plush toy on the bed", "Puts the green dinosaur toy on her head", "Puts the orange plush toy on her head"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "ojsaQcmtlj0_2", "video_path": "ojsaQcmtlj0.mp4", "subtitle_path": "ojsaQcmtlj0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1610.83, "view_count": 288921}, {"video_id": "4ATeeiuA5yc", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a gray and white styled table with the right side blurred, then a man wearing a gray lab coat sits in front of a blue background with a globe, and finally there is an Excel sheet with a table in the top left corner and a red arrow in the middle.", "First, a gray and white styled table with the right side blurred, then there is an Excel sheet with a table in the top left corner and a blue arrow in the middle, and finally a man wearing a gray lab coat sits in front of a blue background with a globe.", "First, a man wearing a gray lab coat sits in front of a blue background with a globe, then there is a gray and white styled table with the right side blurred, and finally there is an Excel sheet with a table in the top left corner and a red arrow in the middle.", "First, a man wearing a gray lab coat sits in front of a blue background with a globe, then there is an Excel sheet with a table in the top left corner and a red arrow in the middle, and finally there is a gray and white styled table with the right side blurred.", "First, there is an Excel sheet with a table in the top left corner and a red arrow in the middle, then there is a gray and white styled table with the right side blurred, and finally a man wearing a white lab coat sits in front of a blue background with a globe."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "4ATeeiuA5yc_0", "video_path": "4ATeeiuA5yc.mp4", "subtitle_path": "4ATeeiuA5yc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 950.88, "view_count": 439949}, {"video_id": "4ATeeiuA5yc", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, an image with three red arrows and a black line marked with a height of 311m appears in the screen. Then, a satellite map with a Google logo appears in the bottom-right corner. Lastly, on the far left side of the screen from top to bottom, three characters appear in three mirrors.", "First, on the far left side of the screen from top to bottom, three characters appear in three mirrors. Then, a satellite map with a Google logo appears in the bottom-right corner. Lastly, an image with three red arrows and a black line marked with a height of 311m appears in the screen.", "First, on the far left side of the screen from top to bottom, three characters appear in three mirrors. Then, an image with three red arrows and a black line marked with a height of 311m appears in the screen. Finally, a satellite map with a Google logo appears in the bottom-right corner.", "First, on the far left side of the screen from top to bottom, three characters appear in three mirrors. Then, an image with three red arrows and a black line marked with a height of 311m appears in the screen. Finally, a satellite map with a Google logo appears in the bottom-right corner.", "First, an image with three red arrows and a black line marked with a height of 311m appears in the screen. Then, on the far left side of the screen from top to bottom, three characters appear in three mirrors. Finally, a satellite map with a Google logo appears in the bottom-right corner."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "4ATeeiuA5yc_1", "video_path": "4ATeeiuA5yc.mp4", "subtitle_path": "4ATeeiuA5yc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 950.88, "view_count": 439949}, {"video_id": "4ATeeiuA5yc", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there is a black-framed picture with a blue triangle in the top left corner of the screen, then a white snowy mountain with a pointed tower labeled 'Dow Crag' on the right side in the front, and finally, a white background with a long rectangle with a black border containing 12 small squares at the very bottom.", "First, there is a white snowy mountain with a pointed tower labeled 'Dow Crag' on the right side in the front, then a black-framed picture with a blue triangle in the top left corner of the screen, and finally, a white background with a long rectangle with a black border containing 12 small squares at the very bottom.", "First, there is a white snowy mountain with a pointed tower labeled 'Dow Crag' on the right side in the front, then a white background with a long rectangle with a black border containing 12 small squares at the very bottom, and finally, a black-framed picture with a blue triangle in the top left corner of the screen.", "First, there is a black-framed picture with a blue triangle in the top left corner of the screen, then a white snowy mountain with a pointed tower labeled 'Dow Crag' on the right side in the front, and finally, a white background with a long rectangle with a black border containing 12 small squares at the very bottom.", "First, there is a black-framed picture with a blue triangle in the top left corner of the screen, next a white background with a long rectangle with a black border containing 12 small squares at the very bottom, and finally, a white snowy mountain with a pointed tower labeled 'Dow Crag' on the right side in the front."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "4ATeeiuA5yc_2", "video_path": "4ATeeiuA5yc.mp4", "subtitle_path": "4ATeeiuA5yc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 950.88, "view_count": 439949}, {"video_id": "xU1s14ebD1M", "question": "In the scene where the man in green pants is holding a yellow plastic bag with fruits at the market, where else has he appeared?", "question_wo_referring_query": "Where else has he appeared?", "candidates": ["On an airplane", "On a bus", "On a yacht", "In a cabin", "On a boat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "xU1s14ebD1M_0", "video_path": "xU1s14ebD1M.mp4", "subtitle_path": "xU1s14ebD1M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 971.6, "view_count": 1392364}, {"video_id": "xU1s14ebD1M", "question": "In the kitchen of a wooden house shown on screen, where else has the sheep covered in red sauce on a wooden board appeared?", "question_wo_referring_query": "Where else has it appeared?", "candidates": ["At the bottom of the steel rack", "On the soccer field", "At the top of the steel rack", "Inside the oven", "Inside the refrigerator"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "xU1s14ebD1M_1", "video_path": "xU1s14ebD1M.mp4", "subtitle_path": "xU1s14ebD1M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 971.6, "view_count": 1392364}, {"video_id": "xU1s14ebD1M", "question": "Where else did a man kicking a soccer ball on the grass, wearing a black hat and black glasses, and dressed in a red shirt and blue jeans, appear?", "question_wo_referring_query": "Where else did he appear?", "candidates": ["On the bus", "Inside the sheep pen", "At the market", "On the dining table inside a wooden house", "On the boat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "xU1s14ebD1M_2", "video_path": "xU1s14ebD1M.mp4", "subtitle_path": "xU1s14ebD1M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 971.6, "view_count": 1392364}, {"video_id": "KFweVYFztbk", "question": "A man with a bare upper body, wearing deep blue and white striped pants with a red cross symbol on them, is doing a handstand on a purple mat. Next to him, there's a man wearing a red short-sleeved shirt, black shorts, and white socks assisting him. Which subtitles did these deep blue shorts appear with at the same time?", "question_wo_referring_query": "Which subtitles did they appear with at the same time?", "candidates": ["\u201cthis morning we were having a\u201d", "\u201cso is this table set up good morning\u201d", "\u201cand so i've seen what has worked and\u201d", "\u201cinspiration that i really follow as well\u201d", "\u201cfor six full seconds\u201d"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "KFweVYFztbk_0", "video_path": "KFweVYFztbk.mp4", "subtitle_path": "KFweVYFztbk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1115.72, "view_count": 15486}, {"video_id": "KFweVYFztbk", "question": "Seated in a dimly lit dining room, wearing a black floral suspender dress, with a golden-haired woman in front of a glass of red wine, and a man in a white short-sleeved shirt sitting beside her, in what captions did this woman also simultaneously appear?", "question_wo_referring_query": "In what captions did she also simultaneously appear?", "candidates": ["\"gone down a little bit but you can see\"", "\"this morning we were having a\"", "\"and then we can have amazing italian\"", "\"how in different parts of the world the\"", "\"embrace that fear to push our\""], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "KFweVYFztbk_1", "video_path": "KFweVYFztbk.mp4", "subtitle_path": "KFweVYFztbk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1115.72, "view_count": 15486}, {"video_id": "KFweVYFztbk", "question": "On the left side of the screen, there is a huge mural. There are two men training on a narrow staircase made of terracotta and cream-colored tiles. One man is squatting, wearing a gray short-sleeved shirt and black shorts with knee pads. Next to him, another man is wearing a red short-sleeved shirt. With which subtitles does this scene appear simultaneously?", "question_wo_referring_query": "With which subtitles does this scene appear simultaneously?", "candidates": ["\"well when it's high these are\"", "\"outer part of it and so he cuts it open\"", "\"right over there we\u2019re heading to\"", "\"food\"", "\"and fall and you have to engage\""], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "KFweVYFztbk_2", "video_path": "KFweVYFztbk.mp4", "subtitle_path": "KFweVYFztbk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1115.72, "view_count": 15486}, {"video_id": "wjHQu0Ve96A", "question": "On a dimly lit stage, there are five girls dancing. Among them, the girl with red hair, wearing a white tank top and blue jeans, is sitting on a transparent chair, facing a host dressed in a white top and yellow skirt. What changed about her appearance?", "question_wo_referring_query": "On a dimly lit stage, there are five girls dancing. Among them, the girl with red hair, wearing a white tank top and blue jeans, is sitting on a transparent chair, facing a host dressed in a white top and yellow skirt. What changed about her appearance?", "candidates": ["She went from not having bangs to having bangs", "She went from not having a red plaid fabric on her body to having a red plaid fabric", "Her long red hair turned into short red hair", "She changed from wearing a white tank top to wearing a coat", "Her red hair turned into black hair"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "wjHQu0Ve96A_0", "video_path": "wjHQu0Ve96A.mp4", "subtitle_path": "wjHQu0Ve96A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1024.03, "view_count": 53682}, {"video_id": "wjHQu0Ve96A", "question": "With an orange and red stage lighting background, there are five women with different hair colors sitting on transparent chairs in the center of the stage. Among them, the woman in the middle is wearing a white off-shoulder top and has a red plaid cloth draped over her legs while singing into a microphone. When she is dancing at the far right at the end of the video, what changes occur to her?", "question_wo_referring_query": "What changes occur to her at the end of the video?", "candidates": ["She changes from denim shorts to denim pants.", "Her white shoes change to black shoes.", "Her golden hair changes to red hair.", "Her white off-shoulder top changes to a white short-sleeve shirt.", "The red plaid cloth on her legs disappears."], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "wjHQu0Ve96A_1", "video_path": "wjHQu0Ve96A.mp4", "subtitle_path": "wjHQu0Ve96A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1024.03, "view_count": 53682}, {"video_id": "wjHQu0Ve96A", "question": "There are five girls in different colored outfits dancing on stage. The stage lights are composed of vibrant blue, green, and purple hues. When the scene shifts to them sitting on transparent chairs on the right side of the stage facing a host dressed in a white top and a yellow skirt, what changes occur in the background of the stage behind them?", "question_wo_referring_query": "What changes occur in the background of the stage behind them?", "candidates": ["The blue, green, and purple backdrop changes to white.", "The blue, green, and purple backdrop changes to a video background.", "The vibrant blue, green, and purple backdrop changes to white Korean text.", "The black screen changes to multicolor.", "The vibrant backdrop changes to a black screen."], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "wjHQu0Ve96A_2", "video_path": "wjHQu0Ve96A.mp4", "subtitle_path": "wjHQu0Ve96A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1024.03, "view_count": 53682}, {"video_id": "7xstflp3s-4", "question": "In front of a black curtain backdrop, a short-haired woman wearing a white suit is kneeling on one knee on a platform, holding a microphone, with a large audience in front of her. When the subtitle 'talk about your use of symbolism in the' appears, what change occurs to this woman?", "question_wo_referring_query": ", what change occurs to this woman?", "candidates": ["She changes from a white sleeveless top to a white short-sleeved top.", "She changes from kneeling on one knee to sitting on a chair.", "She changes from kneeling on one knee to standing on the stage.", "She changes from white clothes to yellow clothes.", "She changes from short hair to long hair."], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "7xstflp3s-4_0", "video_path": "7xstflp3s-4.mp4", "subtitle_path": "7xstflp3s-4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2580.45, "view_count": 1893}, {"video_id": "7xstflp3s-4", "question": "A short-haired woman, dressed in a sleeveless white top and white pants, is sitting on a bench holding a microphone and conversing with a woman on her left, who is wearing a black top and gray pants. What change occurs with this woman when the subtitle 'up just a few of the creative keys so' appears?", "question_wo_referring_query": "What change occurs with this woman?", "candidates": ["From short hair to long hair", "From sitting to standing", "From sleeveless top to long-sleeved top", "From long pants to shorts", "From white clothes to red clothes"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "7xstflp3s-4_1", "video_path": "7xstflp3s-4.mp4", "subtitle_path": "7xstflp3s-4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2580.45, "view_count": 1893}, {"video_id": "7xstflp3s-4", "question": "What changes occur to the woman sitting on the platform, wearing a white fur long-sleeved top, a black and white polka dot skirt with red stockings underneath, holding a microphone, when the subtitle 'significant for the instead of water' appears on the screen? She is flanked by two men in suits.", "question_wo_referring_query": "What changes occur to the woman when the subtitle appears?", "candidates": ["She changes from sitting to standing", "Her long-sleeved top changes to a short-sleeved top", "Her red stockings change to black stockings", "She gains a fringe", "The microphone in her hand disappears"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "7xstflp3s-4_2", "video_path": "7xstflp3s-4.mp4", "subtitle_path": "7xstflp3s-4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2580.45, "view_count": 1893}, {"video_id": "ApCn7zSwgTw", "question": "The background is a large screen displaying numbers and English words. A news anchor with shoulder-length short hair, wearing a rose-pink V-neck top and a small microphone pinned to her collar, is sitting in front of a camera. What is she doing?", "question_wo_referring_query": "What is the person sitting in front of the camera doing?", "candidates": ["Introducing herself", "Talking with a guest", "Reporting the news", "Connecting with an outside reporter", "Conducting an interview"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "ApCn7zSwgTw_0", "video_path": "ApCn7zSwgTw.mp4", "subtitle_path": "ApCn7zSwgTw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2630.1, "view_count": 2835}, {"video_id": "ApCn7zSwgTw", "question": "The background is a large screen with numbers and English words. What is a woman wearing a black V-neck shirt and silver hoop earrings doing in front of the mirror?", "question_wo_referring_query": "What is she doing?", "candidates": ["Making a phone call", "Introducing herself", "Chatting with a guest", "Talking in front of a mirror", "Connecting with a reporter"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "ApCn7zSwgTw_1", "video_path": "ApCn7zSwgTw.mp4", "subtitle_path": "ApCn7zSwgTw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2630.1, "view_count": 2835}, {"video_id": "ApCn7zSwgTw", "question": "On the right side of the news screen, there is a man sitting in front of a blue sky and red bridge background, wearing a dark blue suit over a light blue shirt and black-rimmed glasses. On the left side of the news screen, there are some white and red numbers. What is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["Waving", "Being interviewed", "Making a phone call", "Chatting with a friend", "Talking to the camera"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "ApCn7zSwgTw_2", "video_path": "ApCn7zSwgTw.mp4", "subtitle_path": "ApCn7zSwgTw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2630.1, "view_count": 2835}, {"video_id": "JCfeHqRRThY", "question": "In the view of the goddess statue holding a mirror under the blue sky, what objects are present on the screen?", "question_wo_referring_query": ", what objects are present on the screen at this moment?", "candidates": ["Rainwater", "Workers", "Clouds", "Birds", "Lakeside"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "JCfeHqRRThY_0", "video_path": "JCfeHqRRThY.mp4", "subtitle_path": "JCfeHqRRThY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1136.37, "view_count": 639313}, {"video_id": "JCfeHqRRThY", "question": "In a sunny forest full of lush green plants, a woman dressed in a white long-sleeved shirt and wearing a white headscarf stands with her back to the camera. She is raising her left hand to touch the plants. What objects are present in the scene at this moment?", "question_wo_referring_query": "What objects are present in the scene at this moment?", "candidates": ["lake shore", "river stream", "green leaves", "flowers", "white clouds"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "JCfeHqRRThY_1", "video_path": "JCfeHqRRThY.mp4", "subtitle_path": "JCfeHqRRThY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1136.37, "view_count": 639313}, {"video_id": "JCfeHqRRThY", "question": "On the four-layer shelf filled with vegetables in the supermarket, the bottom layer is filled with upside-down lettuce. What object is present on the screen at this time?", "question_wo_referring_query": "What object is present on the screen at this time?", "candidates": ["potatoes", "broccoli", "leeks", "lettuce", "coriander"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "JCfeHqRRThY_2", "video_path": "JCfeHqRRThY.mp4", "subtitle_path": "JCfeHqRRThY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1136.37, "view_count": 639313}, {"video_id": "zdKX7Xo3Cb8", "question": "In the upper left corner of the screen, there is a bald man wearing a white shirt, sitting in front of an orange-yellow wall, explaining to the camera. On the right side of the screen, there is a PPT slide with 12 different pictures and animations. What is missing from the screen when the subtitle 'thesekindof videos and it's called RGB' appears?", "question_wo_referring_query": "What is missing from the screen?", "candidates": ["blue wetsuit", "grassland", "hand", "checkered shirt", "television"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "zdKX7Xo3Cb8_0", "video_path": "zdKX7Xo3Cb8.mp4", "subtitle_path": "zdKX7Xo3Cb8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3263.67, "view_count": 43177}, {"video_id": "zdKX7Xo3Cb8", "question": "In the upper left corner of the screen, there is a bald man wearing a white shirt, sitting in front of an ochre wall, explaining to the camera. On the right side, there's a PPT slide with 'Moving Light Display' written in blue English text at the top. The slide contains an image with a black background. When the subtitle '3d and so this is example of that which' appears, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["white dot", "television", "curtain", "black short sleeve", "window"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "zdKX7Xo3Cb8_1", "video_path": "zdKX7Xo3Cb8.mp4", "subtitle_path": "zdKX7Xo3Cb8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3263.67, "view_count": 43177}, {"video_id": "zdKX7Xo3Cb8", "question": "In the upper left corner of the screen, there is a man wearing a white shirt, sitting in front of a beige wall with his head down, and on the right side, there is a PPT slide with three gray sheets of paper and irregular circular images. When the subtitle 'coordinate it of the image our camera so' appears, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["Cell phone", "Wall painting", "Red and white striped wall decoration", "Desk lamp", "Camera"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "zdKX7Xo3Cb8_2", "video_path": "zdKX7Xo3Cb8.mp4", "subtitle_path": "zdKX7Xo3Cb8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3263.67, "view_count": 43177}, {"video_id": "AuGlltrffKY", "question": "In an iron pot with an orange soup base, there are many pieces of meat in the soup. A person is using one hand to grab the meat pieces and the other hand holding a ladle stirring in the pot. What material is the ladle the person is holding?", "question_wo_referring_query": "What material is the ladle that this person is holding?", "candidates": ["ceramic ladle", "wooden ladle", "iron ladle", "steel ladle", "plastic ladle"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "AuGlltrffKY_0", "video_path": "AuGlltrffKY.mp4", "subtitle_path": "AuGlltrffKY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 929.96, "view_count": 11513}, {"video_id": "AuGlltrffKY", "question": "There is a blue and yellow church building with a white wall and many shuttered windows on the right side. In front of the church, a man and a woman are holding a white dog. What material is this church building made of?", "question_wo_referring_query": "What material is this church made of?", "candidates": ["clay", "straw", "wood", "stone", "concrete"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "AuGlltrffKY_1", "video_path": "AuGlltrffKY.mp4", "subtitle_path": "AuGlltrffKY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 929.96, "view_count": 11513}, {"video_id": "AuGlltrffKY", "question": "Under the blue sky, sitting inside the boat, floating on the water surrounded by green water and blue mountains, there are six cups placed inside the boat. What material are these cups made of?", "question_wo_referring_query": "What material are these cups made of?", "candidates": ["Plastic cups", "Ceramic cups", "Glass cups", "Paper cups", "Thermal cups"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "AuGlltrffKY_2", "video_path": "AuGlltrffKY.mp4", "subtitle_path": "AuGlltrffKY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 929.96, "view_count": 11513}, {"video_id": "VG-cRvq0yvg", "question": "The screen displays an overview of a chapel under a blue sky. When the subtitle 'invention of gothic architects held back' appears, what is the chapel made of?", "question_wo_referring_query": "What is the chapel made of?", "candidates": ["Stone", "Cement", "Wood", "Clay", "Grass"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "VG-cRvq0yvg_0", "video_path": "VG-cRvq0yvg.mp4", "subtitle_path": "VG-cRvq0yvg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 928.63, "view_count": 15020}, {"video_id": "VG-cRvq0yvg", "question": "In the video, there are three statues of men outside the church, and the one on the right is holding a long spear. When the subtitle 'who foretold of the coming of Christ' appears, what material are these statues made of?", "question_wo_referring_query": "What material are these statues made of?", "candidates": ["Gold", "Wood", "Clay", "Jade", "Stone"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "VG-cRvq0yvg_1", "video_path": "VG-cRvq0yvg.mp4", "subtitle_path": "VG-cRvq0yvg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 928.63, "view_count": 15020}, {"video_id": "VG-cRvq0yvg", "question": "In the video, there is a door on a building. Above the door, there is a row of human figure carvings. On the left side of the door, there are three human figure sculptures, and on the right side, there are four human figure sculptures. What material is this door made of when the subtitle 'they pass before the attending kings and' appears?", "question_wo_referring_query": "What material is this door made of?", "candidates": ["Iron door", "Stone door", "Paper door", "Wooden door", "Concrete door"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "VG-cRvq0yvg_2", "video_path": "VG-cRvq0yvg.mp4", "subtitle_path": "VG-cRvq0yvg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 928.63, "view_count": 15020}, {"video_id": "xJOry2On5CA", "question": "On the yellow wooden desk, there is a pair of men's hands working on a math test. The left hand is wearing a black watch and a ring. On the right side of the desk, a black mouse pad with a white mouse is placed. What is the man holding in his right hand while doing the test?", "question_wo_referring_query": "What is the man holding in his right hand while doing the test?", "candidates": ["Ballpoint pen", "Marker", "Red pen", "Pencil", "Fountain pen"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "xJOry2On5CA_0", "video_path": "xJOry2On5CA.mp4", "subtitle_path": "xJOry2On5CA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1573.32, "view_count": 27477}, {"video_id": "xJOry2On5CA", "question": "There is a pair of men's hands on a yellow wooden desk. The left hand is wearing a black watch and a ring. A white mouse is placed on a black mouse pad on the right side of the desk. What is this man writing on with a pen in his right hand?", "question_wo_referring_query": "What is this man writing on with a pen in his right hand?", "candidates": ["book", "newspaper", "white paper", "notebook", "test paper"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "xJOry2On5CA_1", "video_path": "xJOry2On5CA.mp4", "subtitle_path": "xJOry2On5CA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1573.32, "view_count": 27477}, {"video_id": "xJOry2On5CA", "question": "On the yellow wooden desk, there are two test papers. On the black mouse pad on the right side of the desk, there is a white mouse. What is the man wearing on his left wrist while answering the test questions?", "question_wo_referring_query": "What is the man wearing on his left wrist while answering the test questions?", "candidates": ["watch", "ring", "bangle", "rope", "bracelet"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "xJOry2On5CA_2", "video_path": "xJOry2On5CA.mp4", "subtitle_path": "xJOry2On5CA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1573.32, "view_count": 27477}, {"video_id": "7b_1ZCV4Pwc", "question": "On both sides of the road there are forests, in the middle there is an asphalt road with a yellow double solid line. On this road, there are cars and a herd of cows. What are these cows doing on the road?", "question_wo_referring_query": "What are these cows doing on the road?", "candidates": ["Playing with tourists", "Lying on the road", "Eating food on the road", "Rushing head-on", "Walking on the road"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "7b_1ZCV4Pwc_0", "video_path": "7b_1ZCV4Pwc.mp4", "subtitle_path": "7b_1ZCV4Pwc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1036.1, "view_count": 47141}, {"video_id": "7b_1ZCV4Pwc", "question": "In the middle of two stone walls, there is a clear and shallow stream. Before the stream, there are many stones, and in the stream, there are two people with a red bag and green shorts and a person with a blue bag. What are they doing in the stream?", "question_wo_referring_query": "What are they doing in the stream?", "candidates": ["Fishing in the stream", "Swimming in the stream", "Walking in the stream", "Catching fish in the stream", "Chatting in the stream"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "7b_1ZCV4Pwc_1", "video_path": "7b_1ZCV4Pwc.mp4", "subtitle_path": "7b_1ZCV4Pwc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1036.1, "view_count": 47141}, {"video_id": "7b_1ZCV4Pwc", "question": "Under the gloomy clouds, a clear lake is surrounded by green trees. A woman wearing a gray hooded coat, with a ponytail and black-framed glasses, is sitting in a small boat holding a paddle. What is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["Eating something", "Drinking water", "Having a conversation", "Rowing the boat", "Taking photos"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "7b_1ZCV4Pwc_2", "video_path": "7b_1ZCV4Pwc.mp4", "subtitle_path": "7b_1ZCV4Pwc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1036.1, "view_count": 47141}, {"video_id": "z2hgtFBBbxk", "question": "A woman, wearing a white headscarf, a yellow top, a blue apron, and a red skirt, is standing in front of a simple kitchen window. On the wall next to the window, there is a woven basket hanging. When the caption \"whose quiet settings, serene lighting, cropping, & ambiguousness made his scenes more universal\" appears, what is this woman doing with the red ceramic jar in her hands?", "question_wo_referring_query": "What is this woman doing with the red ceramic jar in her hands when the caption \"whose quiet settings, serene lighting, cropping, & ambiguousness made his scenes more universal\" appears?", "candidates": ["Pouring flower", "Pouring water", "Pouring milk", "Washing clothes", "Cooking"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "z2hgtFBBbxk_0", "video_path": "z2hgtFBBbxk.mp4", "subtitle_path": "z2hgtFBBbxk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1108.71, "view_count": 1001980}, {"video_id": "z2hgtFBBbxk", "question": "The background is a bookshelf filled with many books. There is an elderly person with white hair, wearing a white shirt and white suspenders, and glasses. In front of them, there is a computer and keyboard. When the subtitle '\u201cTim's Vermeer\u201d they claimed to reproduce a Vermeer. You can find links to these and counter theories' appears, what is this elderly person doing in front of the computer?", "question_wo_referring_query": "What is this elderly person doing in front of the computer?", "candidates": ["Watching a video", "Browsing web pages", "Video chatting", "Drawing on the computer", "Typing"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "z2hgtFBBbxk_1", "video_path": "z2hgtFBBbxk.mp4", "subtitle_path": "z2hgtFBBbxk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1108.71, "view_count": 1001980}, {"video_id": "z2hgtFBBbxk", "question": "When a person wearing a blue robe and a blue headdress, holding a book with a yellow cover in their left hand and a small trumpet in their right hand, is sitting sideways, with a painter wearing a black hat sitting in front of a canvas with his back to them, and the words 'in the Dutch republic was official, and converting to Catholicism didn't bring Vermeer any personal...' appear, what is the painter doing?", "question_wo_referring_query": "What is the painter doing?", "candidates": ["Drinking water", "Painting the woman", "Eating something", "Talking to the woman", "Guiding the woman's pose"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "z2hgtFBBbxk_2", "video_path": "z2hgtFBBbxk.mp4", "subtitle_path": "z2hgtFBBbxk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1108.71, "view_count": 1001980}, {"video_id": "nSu6xIUmymw", "question": "There is a red car on the map, with 'DAY 1' in the bottom left corner. During the introduction part of the video about the first day in Vienna, what is the first point of interest recommended?", "question_wo_referring_query": "What is the first point of interest recommended?", "candidates": ["MICHAELERPLATZ", "Central Caf\u00e9", "Austrian National Library", "Austrian Supreme Court", "Hofburg Palace"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "nSu6xIUmymw_0", "video_path": "nSu6xIUmymw.mp4", "subtitle_path": "nSu6xIUmymw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1287.36, "view_count": 19054}, {"video_id": "nSu6xIUmymw", "question": "There is a red car on the map, and 'DAY 2' is shown in the bottom left corner. During the introduction of the video on the second day of the Vienna trip, what attractions were first recommended?", "question_wo_referring_query": "", "candidates": ["Magnificent Sha Brun Palace", "Beethoven Museum", "Vienna State Opera House", "Brun Palman House", "Church"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "nSu6xIUmymw_1", "video_path": "nSu6xIUmymw.mp4", "subtitle_path": "nSu6xIUmymw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1287.36, "view_count": 19054}, {"video_id": "nSu6xIUmymw", "question": "After introducing the Vienna State Opera, which attractions were recommended in the video?", "question_wo_referring_query": "Which attractions were recommended?", "candidates": ["Brun Palman House", "Belvedere Museum", "Central Cafe", "Austrian Supreme Court", "North Tower"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "nSu6xIUmymw_2", "video_path": "nSu6xIUmymw.mp4", "subtitle_path": "nSu6xIUmymw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1287.36, "view_count": 19054}, {"video_id": "pv5x6wZ-lD0", "question": "On the wooden table, there is a white sketchbook surrounded by colored pencils of different hues. A pair of hands is seen holding a green colored pencil and filling in the stem part of a flower drawn in the sketchbook. What happens on the screen when the subtitle 'areas of the painting' appears?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["Using an eraser", "Drawing the lines for the flower bouquet in the sketchbook with a pencil", "Displaying a book placed on the table that has a green leaf and yellow flower drawn with a sepia cover", "Using a yellow colored pencil to color the petals in the sketchbook", "Placing the sketchbook, colored pencils, and other drawing tools on the table"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "pv5x6wZ-lD0_0", "video_path": "pv5x6wZ-lD0.mp4", "subtitle_path": "pv5x6wZ-lD0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1259.96, "view_count": 4196}, {"video_id": "pv5x6wZ-lD0", "question": "There is a white sketchbook on a wooden desk, surrounded by crayons of different colors. A pair of hands is holding a blue-purple crayon, coloring the fur of a caterpillar coiled around a flower stem. After the subtitle 'to protect my art from the oils in my' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["Coloring the body of the caterpillar with a yellow crayon", "Drawing flower outlines on the sketchbook with a pencil", "A person wearing a brown shirt and a colorful rough-chain necklace is talking", "Coloring the flower stem with a green crayon", "Placing drawing tools like sketchbook crayons on the desk"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "pv5x6wZ-lD0_1", "video_path": "pv5x6wZ-lD0.mp4", "subtitle_path": "pv5x6wZ-lD0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1259.96, "view_count": 4196}, {"video_id": "pv5x6wZ-lD0", "question": "There is a blank white sketchbook on the wooden table surrounded by colored pencils. What happens on the screen before the subtitle 'you can even use copy paper' appears?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["Sketching a bouquet on the sketchbook with a pencil", "Erasing with a rubber", "A person wearing a brown shirt and a colorful necklace is talking", "Coloring the stem of the flower with a green colored pencil", "Displaying a yellowed book placed on the table with drawings of green leaves and yellow flowers"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "pv5x6wZ-lD0_2", "video_path": "pv5x6wZ-lD0.mp4", "subtitle_path": "pv5x6wZ-lD0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1259.96, "view_count": 4196}, {"video_id": "bDuW49Lxbf4", "question": "Three people are standing in front of a black background. The man on the left is wearing a black short-sleeve shirt. On the right are two African women, with the woman in the middle wearing a red dress with straps. After the caption 'well regardless south sudan definitely' appears, what object appears on the screen?", "question_wo_referring_query": "Three people are standing in front of a black background. The man on the left is wearing a black short-sleeve shirt. On the right are two African women, with the woman in the middle wearing a red dress with straps. After the caption 'well regardless south sudan definitely' appears, what object appears on the screen?", "candidates": ["Picture of an African wild dog", "Picture of a rhinoceros", "Image of a man in a dark blue shirt holding a giant fish in the upper left corner of the screen", "African cuisine", "Picture of an African eagle"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "bDuW49Lxbf4_0", "video_path": "bDuW49Lxbf4.mp4", "subtitle_path": "bDuW49Lxbf4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1843.51, "view_count": 568159}, {"video_id": "bDuW49Lxbf4", "question": "A man wearing a white short-sleeve shirt is standing in front of a black background. On the left side of the screen is a picture of a man in a black suit and a pregnant woman in a light blue dress. Before the subtitle 'his wife is incredibly pregnant she' appears, what object appears on the screen?", "question_wo_referring_query": "what object appears on the screen?", "candidates": ["A man wearing a denim cap and a white long-sleeve V-neck shirt", "A village with some thatched houses", "A native African with face paint", "Some African cuisine", "A tall African basketball player"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "bDuW49Lxbf4_1", "video_path": "bDuW49Lxbf4.mp4", "subtitle_path": "bDuW49Lxbf4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1843.51, "view_count": 568159}, {"video_id": "bDuW49Lxbf4", "question": "A man wearing a white short-sleeved shirt is standing in front of a black background, holding a small boy wearing a blue-green short-sleeved shirt. After the subtitle 'legacy' appears, what object appears on the screen?", "question_wo_referring_query": "After the subtitle 'legacy' appears, what object appears on the screen?", "candidates": ["A picture of an African wild dog", "African gourmet food", "A long-armed, long-legged African basketball player", "An African native person with facial tattoos", "A man wearing a black short-sleeved shirt, holding a beer, with muscles and body shape very strong"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "bDuW49Lxbf4_2", "video_path": "bDuW49Lxbf4.mp4", "subtitle_path": "bDuW49Lxbf4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1843.51, "view_count": 568159}, {"video_id": "z1fshY-CwSY", "question": "Where else does the short-haired woman, who is wearing a white short-sleeve shirt and glasses, and standing in front of a bookshelf full of books at the beginning of the video, appear?", "question_wo_referring_query": "Where else does she appear?", "candidates": ["In a bathroom", "On an outdoor grass field", "Next to a wooden shelf with walls full of paintings and beside a television", "On the rooftop", "On the balcony"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "z1fshY-CwSY_0", "video_path": "z1fshY-CwSY.mp4", "subtitle_path": "z1fshY-CwSY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1001.77, "view_count": 372390}, {"video_id": "z1fshY-CwSY", "question": "The short-haired woman wearing glasses, dressed in a white short-sleeve shirt and sitting beside a wooden table with scissors and cards on it, also holding cards in her hands, with a gray kitchen cabinet behind her, appeared in which of the following scenes?", "question_wo_referring_query": "In which of the following scenes did she appear?", "candidates": ["In the living room with an orange sofa on the right, a green carpet on the wooden floor, and a black screen TV hanging on the white wall opposite the sofa", "In the library", "In the coffee shop", "On the rooftop terrace", "In the laundry room"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "z1fshY-CwSY_1", "video_path": "z1fshY-CwSY.mp4", "subtitle_path": "z1fshY-CwSY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1001.77, "view_count": 372390}, {"video_id": "z1fshY-CwSY", "question": "The background shows an animated scene on a television screen. There are many paintings hanging on the wall above the TV, and to the left of the TV is a wooden shelf filled with various items. A short-haired woman, wearing a short overalls and black jeans, is standing on the carpet in front of a red sofa. In which of the following scenes has she appeared before?", "question_wo_referring_query": "In which of the following scenes has she appeared before?", "candidates": ["In the supermarket.", "In the internet caf\u00e9.", "In the kitchen with a refrigerator full of magnets on the right side of the screen, a gray kitchen cabinet to the left of the refrigerator, and a white semi-transparent dish cabinet nested in the upper cabinet.", "In the game room.", "In the shopping mall."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "z1fshY-CwSY_2", "video_path": "z1fshY-CwSY.mp4", "subtitle_path": "z1fshY-CwSY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1001.77, "view_count": 372390}, {"video_id": "x0XCXGXGZMs", "question": "In front of a black background, there's a picture of a white goat in the upper left corner of the screen. Wearing a red tight short-sleeve shirt, a muscular black man with white and yellow English words printed on his shirt also appears simultaneously with which subtitles?", "question_wo_referring_query": "Along with which subtitles does he appear?", "candidates": ["\"stop fights the festivals are of course\"", "\"mated to the rather intense conditions\"", "\"another they even had a system of\"", "\"either way the largest groups of this\"", "\"which brings us to thank you Noah this\""], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "x0XCXGXGZMs_0", "video_path": "x0XCXGXGZMs.mp4", "subtitle_path": "x0XCXGXGZMs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1108.73, "view_count": 1158097}, {"video_id": "x0XCXGXGZMs", "question": "In front of a black background, there is an image of a transparent glass cup filled with coffee in the top left of the screen. What words appear at the same time as the man wearing a red short-sleeve shirt with a black pattern on it?", "question_wo_referring_query": "What words appear at the same time?", "candidates": ["\"you might be wondering what's the second\"", "\"a slower heart rate a third larger lung\"", "\"still live on in the indigenous\"", "\"the Incan Empire which had quite a wide\"", "\"just a few fishing vessels might have a\""], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "x0XCXGXGZMs_1", "video_path": "x0XCXGXGZMs.mp4", "subtitle_path": "x0XCXGXGZMs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1108.73, "view_count": 1158097}, {"video_id": "x0XCXGXGZMs", "question": "Standing in front of a black background, holding a dark red wooden pepper grinder, what captions appeared simultaneously with the long straight-haired girl wearing a white short suspender top and black pants?", "question_wo_referring_query": "What captions appeared simultaneously?", "candidates": ["\"gonna bring out in oh alright we're back\"", "\"Maundy the world's tallest flowering\"", "\"time for my triple shot espresso break\"", "\"that how much is it well once a limited\"", "\"national animal the llama and about 300\""], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "x0XCXGXGZMs_2", "video_path": "x0XCXGXGZMs.mp4", "subtitle_path": "x0XCXGXGZMs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1108.73, "view_count": 1158097}, {"video_id": "pZAsU54_fJ8", "question": "When the screen changes to a PPT with a white background and a woman in black clothing and glasses appears in the bottom right corner, what change occurs to the woman in white clothing sitting in front of the computer desk with black long straight hair, with a light green cup on the right side of the screen and two potted plants on the white cabinet behind her?", "question_wo_referring_query": "What change occurs to the woman in white clothing?", "candidates": ["The screen shrinks to the center of the video", "The screen shrinks to the top right corner of the video", "The screen shrinks to the bottom left corner of the video", "The screen shrinks to directly above the video", "The screen shrinks to the top left corner of the video"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "pZAsU54_fJ8_0", "video_path": "pZAsU54_fJ8.mp4", "subtitle_path": "pZAsU54_fJ8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1844.25, "view_count": 127400}, {"video_id": "pZAsU54_fJ8", "question": "At the beginning of the video, there is a female with long black straight hair, wearing a white top, sitting in front of a computer desk. To the right of the frame, there is a light green cup, and behind her is a white cabinet with two potted plants on it. When the video background changes to a light green pattern at the end, what changes occur to this female?", "question_wo_referring_query": "What changes occur to this female?", "candidates": ["The screen changes to a circular shape and shrinks to the upper right corner of the video", "The screen changes to an elliptical shape", "The screen changes to a triangular shape", "The screen changes to a square shape", "The screen changes to a heart shape"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "pZAsU54_fJ8_1", "video_path": "pZAsU54_fJ8.mp4", "subtitle_path": "pZAsU54_fJ8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1844.25, "view_count": 127400}, {"video_id": "pZAsU54_fJ8", "question": "On the white background PPT, there is a periodic table of chemical elements. In the upper right is a woman with straight black hair wearing a white top, and in the lower right is a woman wearing black long sleeves and glasses. Towards the end of the video, when the background changes to a light green, what change occurs to the woman in white?", "question_wo_referring_query": "What change occurs to the woman in white?", "candidates": ["The screen changes from a square to a circle", "The screen changes from a rectangle to a heart shape", "The screen changes from a rectangle to a circle", "The screen changes from a rectangle to a square", "The screen changes from a rectangle to a triangle"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "pZAsU54_fJ8_2", "video_path": "pZAsU54_fJ8.mp4", "subtitle_path": "pZAsU54_fJ8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1844.25, "view_count": 127400}, {"video_id": "X2sKt-WEIeU", "question": "A person with a tattooed arm places steaming hot green beans into a pot full of ice cubes. When the subtitles 'we've got this big uh bowl of ice cold' appear, what changes occur to the green beans?", "question_wo_referring_query": "What changes occur to the green beans?", "candidates": ["The green beans change from green to black", "The green beans change from vibrant green to yellow-green", "The green beans change from green to yellow", "The green beans change from yellow-green to vibrant green", "The green beans change from steaming hot to instantly cold"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "X2sKt-WEIeU_0", "video_path": "X2sKt-WEIeU.mp4", "subtitle_path": "X2sKt-WEIeU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1242.64, "view_count": 325516}, {"video_id": "X2sKt-WEIeU", "question": "On the wooden table, there is a transparent glass container holding green vegetable leaves. In front of the table stands a man wearing a black short-sleeved tattooed shirt. He is holding a yellow bottle in one hand and a spatula in the other. When the subtitle 'definitely got enough salt I should do' appears, what changes occur to the vegetable leaves in the container?", "question_wo_referring_query": "What changes occur to the vegetable leaves in the container?", "candidates": ["Changed from raw to cooked", "Changed from solid to liquid", "Changed from green to yellow", "Changed from yellow to green", "Changed from green to black"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "X2sKt-WEIeU_1", "video_path": "X2sKt-WEIeU.mp4", "subtitle_path": "X2sKt-WEIeU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1242.64, "view_count": 325516}, {"video_id": "X2sKt-WEIeU", "question": "On a wooden table, there is a silver iron pot containing green salad leaves. A man with tattoos on his arms, wearing a black short-sleeved shirt, is tossing the green salad leaves in the pot with his hands. When the subtitle 'salads I got a few extra of these choots' appears, what change occurs to the green salad leaves?", "question_wo_referring_query": "What change occurs to the green salad leaves?", "candidates": ["The green salad leaves change from being in the iron pot to being on a white plate.", "The green salad leaves change from being in the iron pot to being in a flat-bottomed pan.", "The green salad leaves change from a solid to a liquid.", "The green salad leaves turn into yellow salad leaves.", "The green salad leaves change from being in the iron pot to being in a glass bowl."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "X2sKt-WEIeU_2", "video_path": "X2sKt-WEIeU.mp4", "subtitle_path": "X2sKt-WEIeU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1242.64, "view_count": 325516}, {"video_id": "g6R9KYOSqik", "question": "In the forest, beside a few trees, there is a man wearing a black short-sleeved shirt holding an axe. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Planting trees", "Cutting beef bones", "Chopping wood", "Placing the axe on the ground"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "g6R9KYOSqik_0", "video_path": "g6R9KYOSqik.mp4", "subtitle_path": "g6R9KYOSqik_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 993.86, "view_count": 772060}, {"video_id": "g6R9KYOSqik", "question": "In the scene, a man wearing a black short-sleeved shirt and dark green pants is standing next to a pile of rocks. There is a sheep standing in front of him. What is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["Boiling water", "Feeding the sheep", "Watering flowers", "Chopping wood"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "g6R9KYOSqik_1", "video_path": "g6R9KYOSqik.mp4", "subtitle_path": "g6R9KYOSqik_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 993.86, "view_count": 772060}, {"video_id": "g6R9KYOSqik", "question": "In the scene, there are purple and white scallions on a yellow cutting board, and there are also bell peppers on the table. A person wearing a black short-sleeved shirt is holding a knife in their right hand. What is this person doing?", "question_wo_referring_query": "In the scene, there are purple and white scallions on a yellow cutting board, and there are also bell peppers on the table. A person wearing a black short-sleeved shirt is holding a knife in their right hand. What is this person doing?", "candidates": ["Cutting scallions", "Chopping wood", "Cutting meat", "Cutting bell peppers"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "g6R9KYOSqik_2", "video_path": "g6R9KYOSqik.mp4", "subtitle_path": "g6R9KYOSqik_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 993.86, "view_count": 772060}, {"video_id": "Y0qLv4q-Ma4", "question": "In the screen, inside a studio, there is a black man with long hair wearing a gray coat on the left side, and a man in a blue suit and a woman with short hair sitting on the right side. When the subtitles mention 'I mean you know the Ukrainians have', what object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["Transparent cup", "Purple scarf", "Stairs", "White cup"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "Y0qLv4q-Ma4_0", "video_path": "Y0qLv4q-Ma4.mp4", "subtitle_path": "Y0qLv4q-Ma4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 915.4, "view_count": 12743}, {"video_id": "Y0qLv4q-Ma4", "question": "On the left side of the screen, there is a man wearing a dark blue suit and glasses. On the right side, there is a colorful picture. Below, there are two lines of subtitles. When the subtitle mentions 'signatures for it but I think that as,' what object appears on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["purple tie", "necklace", "red tie", "gray tie"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "Y0qLv4q-Ma4_1", "video_path": "Y0qLv4q-Ma4.mp4", "subtitle_path": "Y0qLv4q-Ma4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 915.4, "view_count": 12743}, {"video_id": "Y0qLv4q-Ma4", "question": "In the middle left of the screen, there is a man with a beard wearing a dark blue suit. To his right, there is a short-haired woman wearing black clothes. On the right side of the screen, there is a colorful picture showing a man and a woman hugging each other. At the moment when the subtitle mentions 'feel I feel very sad that the roast is,' what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["red tie", "white cup", "dark gray coat", "glasses"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "Y0qLv4q-Ma4_2", "video_path": "Y0qLv4q-Ma4.mp4", "subtitle_path": "Y0qLv4q-Ma4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 915.4, "view_count": 12743}, {"video_id": "PimpJUhpzic", "question": "The left and right sides of the screen are surrounded by trees, and there is a person walking on a bridge in the middle. There is a small river underneath the bridge. What is the color of the bridge in the screen?", "question_wo_referring_query": "What is the color of the bridge in the screen?", "candidates": ["purple", "blue", "red", "green"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "PimpJUhpzic_0", "video_path": "PimpJUhpzic.mp4", "subtitle_path": "PimpJUhpzic_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 916.48, "view_count": 382016}, {"video_id": "PimpJUhpzic", "question": "There is a bed with a white and green bedspread in the picture. On the bed, there is a doll. On the bedside table, there is a green fan, books, a green plant, and a transparent storage box. What shape is the doll in the picture?", "question_wo_referring_query": "What shape is the doll in the picture?", "candidates": ["Rectangular", "Spherical", "Cylindrical", "Triangular"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "PimpJUhpzic_1", "video_path": "PimpJUhpzic.mp4", "subtitle_path": "PimpJUhpzic_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 916.48, "view_count": 382016}, {"video_id": "PimpJUhpzic", "question": "In the scene, a woman wearing a blue polka dot short-sleeve shirt and off-white pants is sitting on the ground. There are some orange flowers and a few trees in the background. What is the woman in the scene wearing on her head?", "question_wo_referring_query": "What is the woman in the scene wearing on her head?", "candidates": ["earphones", "headband", "hairclip", "hat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "PimpJUhpzic_2", "video_path": "PimpJUhpzic.mp4", "subtitle_path": "PimpJUhpzic_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 916.48, "view_count": 382016}, {"video_id": "NT8Qkc7mzX4", "question": "In the video, the top left corner shows a man wearing a black coat. On the right, there are two pictures. The left picture shows a woman wearing glasses and an old person wearing glasses. The right picture is a line chart. When the subtitle mentions 'download very well it shows thousands of,' what is the younger woman in the image wearing?", "question_wo_referring_query": "What is the younger woman in the image wearing?", "candidates": ["Light purple spaghetti strap", "Light purple short sleeve", "Light purple long sleeve", "White spaghetti strap"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2A", "level": "IntraMoment", "id": "NT8Qkc7mzX4_0", "video_path": "NT8Qkc7mzX4.mp4", "subtitle_path": "NT8Qkc7mzX4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3291.4, "view_count": 560}, {"video_id": "NT8Qkc7mzX4", "question": "In the top left corner of the video, there is a man wearing a black jacket. On the right side, there is a picture of a man holding a child dressed in pink. What is the man in the picture on the right side of the screen wearing when the subtitle mentions 'yeah yeah actually a lot of that has to'?", "question_wo_referring_query": "What is the man in the picture on the right side of the screen wearing?", "candidates": ["suit", "black jacket", "black and gray striped short-sleeve shirt", "denim jacket"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2A", "level": "IntraMoment", "id": "NT8Qkc7mzX4_1", "video_path": "NT8Qkc7mzX4.mp4", "subtitle_path": "NT8Qkc7mzX4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3291.4, "view_count": 560}, {"video_id": "NT8Qkc7mzX4", "question": "In the top left corner of the video, there is a man wearing a black jacket, and on the right side, there is a picture. In the picture, there is a man riding a horse. What pants is the man riding the horse wearing when the subtitle mentions 'for a lot of examples I show you today'?", "question_wo_referring_query": "What pants is the man riding the horse wearing?", "candidates": ["Gray pants", "Black shorts", "Gray shorts", "White shorts"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2A", "level": "IntraMoment", "id": "NT8Qkc7mzX4_2", "video_path": "NT8Qkc7mzX4.mp4", "subtitle_path": "NT8Qkc7mzX4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3291.4, "view_count": 560}, {"video_id": "sTOotMuDQ8k", "question": "At the top left corner of the screen, it shows 14:55, a suit is hanging on the coat rack, a person is sitting on a grey sofa and is writing something with their head down. There is a black backpack on the sofa opposite. Who is the person writing?", "question_wo_referring_query": "Who is the person writing?", "candidates": ["A woman wearing a white short-sleeve shirt, black shorts, and a black cap.", "A woman wearing a black short-sleeve shirt and a cap.", "A man wearing a purple shirt and jeans.", "A man wearing a red shirt and black pants."], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "sTOotMuDQ8k_0", "video_path": "sTOotMuDQ8k.mp4", "subtitle_path": "sTOotMuDQ8k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1388.89, "view_count": 5302}, {"video_id": "sTOotMuDQ8k", "question": "In the scene, there is an open notebook on a white table. On the notebook, there's a C and an O connected by a black horizontal line. A hand holding a pen is placed above the notebook. Who is the person outlined in a pink frame?", "question_wo_referring_query": "Who is the person outlined in a pink frame?", "candidates": ["The man wearing a white short-sleeve shirt", "The woman wearing a black short-sleeve shirt", "The woman wearing a red top", "The black-haired woman wearing a black top"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "sTOotMuDQ8k_1", "video_path": "sTOotMuDQ8k.mp4", "subtitle_path": "sTOotMuDQ8k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1388.89, "view_count": 5302}, {"video_id": "sTOotMuDQ8k", "question": "In the scene, there is a book on the table with a chemical structure on it, and a hand holding a piece of paper placing it on top. Who is the person holding the piece of paper?", "question_wo_referring_query": "Who is the person holding the piece of paper?", "candidates": ["The woman in the blue suit", "The woman in the white short sleeves", "The woman in the black suit", "The woman in the red short sleeves"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "sTOotMuDQ8k_2", "video_path": "sTOotMuDQ8k.mp4", "subtitle_path": "sTOotMuDQ8k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1388.89, "view_count": 5302}, {"video_id": "-voCYf0SzKk", "question": "In the video, there is a white-haired woman wearing a black top staring at the screen. There is also a man wearing a grey top, a watch on his left wrist, and a ring. What is this man doing when he appears?", "question_wo_referring_query": "What is this man doing when he appears?", "candidates": ["He holds his fists in front of his chest", "He puts his left hand on his mouth", "He makes a peace sign with both hands", "He places his right hand on his forehead"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "-voCYf0SzKk_0", "video_path": "-voCYf0SzKk.mp4", "subtitle_path": "-voCYf0SzKk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1915.29, "view_count": 6926}, {"video_id": "-voCYf0SzKk", "question": "In the scene, a white-haired woman in a black top is looking at the screen with her head tilted. There is also a white-haired man in a gray top. What is this man doing when he appears?", "question_wo_referring_query": "What is this man doing when he appears?", "candidates": ["His right hand is placed on his forehead", "He is making a 'Yay' gesture with both hands", "He is holding a cup of water in his right hand and drinking", "He is clenching his fists in front of his chest"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "-voCYf0SzKk_1", "video_path": "-voCYf0SzKk.mp4", "subtitle_path": "-voCYf0SzKk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1915.29, "view_count": 6926}, {"video_id": "-voCYf0SzKk", "question": "In the scene, there is a room with a table. On one side of the table sits a man in a suit, and on the other side is an elderly woman wearing white clothes and a necklace. What is the man doing the first time he appears?", "question_wo_referring_query": "What is the man doing the first time he appears?", "candidates": ["The man has his hands crossed on the table", "His right hand is on his forehead", "He has his fists clenched in front of his chest", "He is making a 'V' sign with both hands"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "-voCYf0SzKk_2", "video_path": "-voCYf0SzKk.mp4", "subtitle_path": "-voCYf0SzKk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1915.29, "view_count": 6926}, {"video_id": "rl4nUngiR2k", "question": "In an English PPT, when mentioning 'okay Ashley large and then they somehow', what happened on this PPT?", "question_wo_referring_query": "What happened on this PPT?", "candidates": ["A red conical object appeared", "A dark green cylindrical object appeared", "A red cylindrical object appeared", "A dark green conical object appeared"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "rl4nUngiR2k_0", "video_path": "rl4nUngiR2k.mp4", "subtitle_path": "rl4nUngiR2k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1894.37, "view_count": 6970}, {"video_id": "rl4nUngiR2k", "question": "On the English PPT, when referring to 'method to do this blurt as they', what event occurred on this PPT?", "question_wo_referring_query": "What event occurred on this PPT?", "candidates": ["A segment of English text was highlighted with a red background", "A segment of English text was highlighted with a blue background", "A segment of English text was highlighted with a purple background", "A segment of English text was highlighted with a yellow background"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "rl4nUngiR2k_1", "video_path": "rl4nUngiR2k.mp4", "subtitle_path": "rl4nUngiR2k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1894.37, "view_count": 6970}, {"video_id": "rl4nUngiR2k", "question": "In the video, there are 2 line charts in the PPT. Below the charts, the text 'Test Set skew' is displayed. When 'for all of these it remains relatively' is mentioned, what event occurs in the PPT?", "question_wo_referring_query": "What event occurs in the PPT?", "candidates": ["Green circles appear on the charts.", "Many red circles and arrows appear on the charts.", "Purple circles appear on the charts.", "Blue circles appear on the charts."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "rl4nUngiR2k_2", "video_path": "rl4nUngiR2k.mp4", "subtitle_path": "rl4nUngiR2k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1894.37, "view_count": 6970}, {"video_id": "pHSr3zyl8YM", "question": "On the screen, in the top left corner, a man wearing a white shirt and an orange tie is standing next to an English PPT. The PPT features the word 'Motivations' in black font and includes two images. After the man extends his left arm, what happens to the PPT?", "question_wo_referring_query": "What happens to the PPT?", "candidates": ["The content on the PPT disappears", "A section of yellow text in English is added to the PPT", "A section of blue text in English is added to the PPT", "30 images appear on the PPT"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "pHSr3zyl8YM_0", "video_path": "pHSr3zyl8YM.mp4", "subtitle_path": "pHSr3zyl8YM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2664.7, "view_count": 1019}, {"video_id": "pHSr3zyl8YM", "question": "In the video, a man wearing a white shirt and an orange tie appears in the upper left corner. After he raises both his hands, what happens to the PPT?", "question_wo_referring_query": "What happens to the PPT?", "candidates": ["Two images are added to the PPT", "A page of advanced mathematical formulas is added to the PPT", "A section of English text in blue appears on the PPT", "A section of English text in yellow appears on the PPT"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "pHSr3zyl8YM_1", "video_path": "pHSr3zyl8YM.mp4", "subtitle_path": "pHSr3zyl8YM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2664.7, "view_count": 1019}, {"video_id": "pHSr3zyl8YM", "question": "In the video, there are 2 tables with blue backgrounds on the PPT. A man wearing a white shirt and an orange tie, with one hand clenched in a fist and the other hand raised and open, appears. After that, what happens to the PPT?", "question_wo_referring_query": "What happens to the PPT?", "candidates": ["Two more pictures appear on the PPT", "Five more pictures appear on the PPT", "A section of yellow-colored English text appears on the PPT\n", "A section of blue-colored English text appears on the PPT"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "pHSr3zyl8YM_2", "video_path": "pHSr3zyl8YM.mp4", "subtitle_path": "pHSr3zyl8YM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2664.7, "view_count": 1019}, {"video_id": "tq5oIIlop5s", "question": "In the scene, outdoors, a person is holding a mulberry. And a man wearing a dark blue top, holding an axe in his right hand and a tree mushroom in his left hand. Which of these two scenes appears first?", "question_wo_referring_query": "Which of these two scenes appears first?", "candidates": ["The scene with the person holding the tree mushroom appears first", "The scene with the person holding the mulberry appears first", "The scene with the person holding the black tree mushroom appears first", "The scene with the person holding the red mulberry appears first"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "tq5oIIlop5s_0", "video_path": "tq5oIIlop5s.mp4", "subtitle_path": "tq5oIIlop5s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 915.73, "view_count": 3755467}, {"video_id": "tq5oIIlop5s", "question": "In the video, there is a man climbing a tree in the outdoors, wearing a dark blue short-sleeved shirt and a hat. Another man, also wearing a hat and dressed in dark blue, is fixing three wooden sticks onto an upright branch. Which of these two scenes appears first?", "question_wo_referring_query": "Which of these two scenes appears first?", "candidates": ["The man climbing the tree appears first", "The scene of the man fixing the wooden sticks appears first", "Neither scene appears", "Both scenes appear simultaneously"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "tq5oIIlop5s_1", "video_path": "tq5oIIlop5s.mp4", "subtitle_path": "tq5oIIlop5s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 915.73, "view_count": 3755467}, {"video_id": "tq5oIIlop5s", "question": "In the wilderness, there is a small stream on a grassy patch. Nearby, there are two horses, one bay-colored and one gray. On a gray rock, there's a bay-colored plate with four chickens inside. A bowl with red sauce is placed next to it. Which of these two scenes appears first?", "question_wo_referring_query": "Which of these two scenes appears first?", "candidates": ["The scene with the plate containing chicken appears first", "Both scenes appear simultaneously", "Neither of the scenes appears", "The scene with the two horses appears first"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "tq5oIIlop5s_2", "video_path": "tq5oIIlop5s.mp4", "subtitle_path": "tq5oIIlop5s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 915.73, "view_count": 3755467}, {"video_id": "ElMiXjg0zXM", "question": "On the screen, a man in a dark green top is standing in front of a wallpaper-covered wall. After mentioning 'let's check for sure,' what action does this man take?", "question_wo_referring_query": "What action does this man take?", "candidates": ["He makes a 'Y' gesture with both hands", "He places his left hand on his forehead", "He clasps his hands together in front of his chest", "He holds a white bowl in his right hand, reaches into the bowl with his left hand, and closes his eyes"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "ElMiXjg0zXM_0", "video_path": "ElMiXjg0zXM.mp4", "subtitle_path": "ElMiXjg0zXM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2173.31, "view_count": 29364}, {"video_id": "ElMiXjg0zXM", "question": "In the scene, there is wallpaper on the wall, and a man wearing a dark green shirt. At the bottom of the picture in the upper right corner of the screen, there is a green-stained area. After mentioning 'one,' what action does this man take?", "question_wo_referring_query": "What action does this man take?", "candidates": ["He uses his left hand to hold a white bowl in his lap, and holds a piece of paper in front of the camera with his right hand, with '100' written on it.", "He uses his right hand to hold a white bowl in his lap, and holds a piece of paper in front of the camera with his left hand, with '10' written on it.", "He uses his left hand to hold a white bowl in his lap, and holds a piece of paper in front of the camera with his right hand, with '10' written on it.", "He uses his right hand to hold a white bowl in his lap, and holds a piece of paper in front of the camera with his left hand, with '100' written on it."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "ElMiXjg0zXM_1", "video_path": "ElMiXjg0zXM.mp4", "subtitle_path": "ElMiXjg0zXM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2173.31, "view_count": 29364}, {"video_id": "ElMiXjg0zXM", "question": "On the screen, there's wallpaper on the wall. After a man wearing a dark green top mentions 'Arkansas has some Mississippi River', what action does the man take?", "question_wo_referring_query": "What action does the man take?", "candidates": ["He holds a white bowl in his right arm, takes a piece of paper with '4500' written on it in his left hand, covering his face with his left shoulder.", "He holds a white bowl in his left arm, takes a piece of paper with '4500' written on it in his right hand, covering his face with his right shoulder.", "He holds a white bowl in his right arm, takes a piece of paper with '45' written on it in his left hand, covering his face with his left shoulder.", "He holds a white bowl in his left arm, takes a piece of paper with '45' written on it in his right hand, covering his face with his right shoulder."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "ElMiXjg0zXM_2", "video_path": "ElMiXjg0zXM.mp4", "subtitle_path": "ElMiXjg0zXM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2173.31, "view_count": 29364}, {"video_id": "l0zRU35sovQ", "question": "On a green olive map with blue vascular inserts, there are pictures of different people in the lower right corner of the screen. From this location, a white arrow runs horizontally to the upper left corner of the screen. After the mention of 'Believing Napoleon would now retreat towards Paris, the Allies decided to advance along,' what object appears in the middle of the screen?", "question_wo_referring_query": "What object appears in the middle of the screen?", "candidates": ["Two white dashed-line circles appear", "A single red dashed-line circle appears", "A single white dashed-line circle appears", "Two red circles appear"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "l0zRU35sovQ_0", "video_path": "l0zRU35sovQ.mp4", "subtitle_path": "l0zRU35sovQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1984.84, "view_count": 3379038}, {"video_id": "l0zRU35sovQ", "question": "On a greenish map with blue veins, in the middle of the screen there are photos of different people. In the upper right corner, there is a white date. After the mention of 'Any talk of Napoleon's defeat in late February was premature,' what object appears on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["Two white arrows appear", "One white arrow appears", "One yellow arrow appears", "Two yellow arrows appear"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "l0zRU35sovQ_1", "video_path": "l0zRU35sovQ.mp4", "subtitle_path": "l0zRU35sovQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1984.84, "view_count": 3379038}, {"video_id": "l0zRU35sovQ", "question": "In a scene with a green olive background, there is an image of a group of people riding horses and wearing armor in battle. After the phrase 'In desperate fight, Napoleon personally rallied fleeing troops, and exposed himself' is mentioned, what object appears on the screen?", "question_wo_referring_query": "what object appears on the screen?", "candidates": ["A print of a group of men in military uniforms burning a flag", "A pencil sketch of a group of men in military uniforms burning a flag", "A watercolor painting of a group of men in military uniforms burning a flag", "An oil painting of a group of men in military uniforms burning a flag"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "l0zRU35sovQ_2", "video_path": "l0zRU35sovQ.mp4", "subtitle_path": "l0zRU35sovQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1984.84, "view_count": 3379038}, {"video_id": "elRXytQJQdU", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a screen with a black background appears, with the text 'Related Work-VAE problem' at the top. On the left is a diagram with concentric circles of the same color on a coordinate axis, and on the right is a complete red circle. Then a screen with a blue background appears, with white text at the top. In the middle of the screen are six rows of different banana photos arranged within a large white background photo. Finally, a screen with the text 'Sample Generation' at the top appears; on the right is a large white background photo composed of 16 small photos.", "First, a screen with a black background appears, with the text 'Related Work-VAE problem' at the top. On the left is a diagram with concentric circles of the same color on a coordinate axis, and on the right is a complete red circle. Then a screen with the text 'Sample Generation' at the top appears; on the right is a large white background photo composed of 16 small photos. Finally, a screen with a blue background appears, with white text at the top. In the middle of the screen are six rows of different banana photos arranged within a large white background photo.", "First, a screen with the text 'Sample Generation' at the top appears; on the right is a large white background photo composed of 16 small photos. Then a screen with a black background appears, with the text 'Related Work-VAE problem' at the top. On the left is a diagram with concentric circles of the same color on a coordinate axis, and on the right is a complete red circle. Finally, a screen with a blue background appears, with white text at the top. In the middle of the screen are six rows of different banana photos arranged within a large white background photo.", "First, a screen with the text 'Sample Generation' at the top appears; on the right is a large white background photo composed of 16 small photos. Then a screen with a blue background appears, with white text at the top. In the middle of the screen are six rows of different banana photos arranged within a large white background photo. Finally, a screen with a black background appears, with the text 'Related Work-VAE problem' at the top. On the left is a diagram with concentric circles of the same color on a coordinate axis, and on the right is a complete red circle."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "elRXytQJQdU_0", "video_path": "elRXytQJQdU.mp4", "subtitle_path": "elRXytQJQdU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2815.03, "view_count": 3040}, {"video_id": "elRXytQJQdU", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["The first scene shows a black background with the words 'Related Work-VAE problem' at the top. On the left side, there is an image with concentric circles of the same color under a coordinate axis. On the right side, there is a complete yellow rectangle. Next, an image with different-sized black letters on a white background appears. At the bottom, some words are highlighted in blue. Finally, a deep blue background appears with two sections titled 'Training' and 'Architecture' in white letters.", "The first scene shows a deep blue background with two sections titled 'Training' and 'Architecture' in white letters. Next, a black background appears with the words 'Related Work-VAE problem' at the top. On the left side, there is an image with concentric circles of the same color under a coordinate axis. On the right side, there is a complete yellow rectangle. Finally, an image with different-sized black letters on a white background appears. At the bottom, some words are highlighted in blue.", "The first scene shows an image with different-sized black letters on a white background. At the bottom, some words are highlighted in blue. Next, a black background appears with the words 'Related Work-VAE problem' at the top. On the left side, there is an image with concentric circles of the same color under a coordinate axis. On the right side, there is a complete yellow rectangle. Finally, a deep blue background appears with two sections titled 'Training' and 'Architecture' in white letters.", "The first scene shows a black background with the words 'Related Work-VAE problem' at the top. On the left side, there is an image with concentric circles of the same color under a coordinate axis. On the right side, there is a complete yellow rectangle. Next, a deep blue background appears with two sections titled 'Training' and 'Architecture' in white letters. Finally, an image with different-sized black letters on a white background appears. At the bottom, some words are highlighted in blue."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "elRXytQJQdU_1", "video_path": "elRXytQJQdU.mp4", "subtitle_path": "elRXytQJQdU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2815.03, "view_count": 3040}, {"video_id": "elRXytQJQdU", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a screen with a black background appears with the words 'Related Work-VAE problem' at the top. On the left side is a plot with coordinate axes and circular markers of the same color, and on the right side is a blue rectangular plot. Then, a screen with a white background in a PPT format appears. The left side of this screen contains a small photo of a dog. Finally, another screen with a black background appears. On the left side of this screen is a photo of an airplane, and on the right side is a white-colored formula.", "First, a screen with a black background appears with the words 'Related Work-VAE problem' at the top. On the left side is a plot with coordinate axes and circular markers of the same color, and on the right side is a blue rectangular plot. Then, another screen with a black background appears. On the left side of this screen is a photo of an airplane, and on the right side is a white-colored formula. Finally, a screen with a white background in a PPT format appears. The left side of this screen contains a small photo of a dog.", "First, a screen with a white background in a PPT format appears. The left side contains a small photo of a dog. Then, a screen with a black background appears. On the left side is a photo of an airplane, and on the right side is a white-colored formula. Finally, another screen with a black background appears with the words 'Related Work-VAE problem' at the top. On the left side is a plot with coordinate axes and circular markers of the same color, and on the right side is a blue rectangular plot.", "First, a screen with a white background in a PPT format appears. The left side contains a small photo of a dog. Then, a screen with a black background appears with the words 'Related Work-VAE problem' at the top. On the left side is a plot with coordinate axes and circular markers of the same color, and on the right side is a blue rectangular plot. Finally, another screen with a black background appears. The left side contains a photo of an airplane, and on the right side is a white-colored formula."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "elRXytQJQdU_2", "video_path": "elRXytQJQdU.mp4", "subtitle_path": "elRXytQJQdU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2815.03, "view_count": 3040}, {"video_id": "5gOAKubzwf8", "question": "There are two pictures on the screen. On the left is a short-haired woman wearing a white top with a yellow accessory around her neck. On the right is a man wearing a blue shirt and glasses. In which of the following scenes does this short-haired woman appear?", "question_wo_referring_query": "In which of the following scenes does this short-haired woman appear?", "candidates": ["Appears on the school field", "Appears in the kitchen", "Appears in front of a refrigerator", "Appears in a broadcast room with three women together"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "5gOAKubzwf8_0", "video_path": "5gOAKubzwf8.mp4", "subtitle_path": "5gOAKubzwf8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 960.88, "view_count": 7548}, {"video_id": "5gOAKubzwf8", "question": "In the video, there is a long-haired woman wearing a pink outer jacket, with a white inner garment. The display in the upper left corner of the screen shows 'ADDRESS'. In which of the following scenes does 'ADDRESS' appear on the screen?", "question_wo_referring_query": "In which of the following scenes does 'ADDRESS' appear on the screen?", "candidates": ["Behind a man wearing a white short-sleeved shirt and glasses", "Behind a man wearing a black short-sleeved shirt and glasses", "Behind a woman wearing a black short-sleeved shirt and glasses", "Behind a woman wearing a white short-sleeved shirt and glasses"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "5gOAKubzwf8_1", "video_path": "5gOAKubzwf8.mp4", "subtitle_path": "5gOAKubzwf8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 960.88, "view_count": 7548}, {"video_id": "5gOAKubzwf8", "question": "In the video, there is a long-haired girl wearing a pink coat with a white inner garment. On the front, there is white text outlined in black, and below the white frame, there is deep blue text outlined in yellow. In which of the following scenes does this long-haired girl appear?", "question_wo_referring_query": "In which of the following scenes does this long-haired girl appear?", "candidates": ["Appears in a broadcast room with three women together", "Appears in the park", "Appears on the school field", "Appears in the kitchen"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "5gOAKubzwf8_2", "video_path": "5gOAKubzwf8.mp4", "subtitle_path": "5gOAKubzwf8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 960.88, "view_count": 7548}, {"video_id": "56UeS4HZN64", "question": "In a room, a woman is sitting on a bed, outside the window is a lush forest, and a man with tied hair is standing in front of a pool. Which subtitles appear along with this woman sitting on the bed?", "question_wo_referring_query": "Which subtitles appear along with the woman sitting on the bed?", "candidates": ["Sussex", "excited to be here and and explore", "PR stay and for having us I am so\n", "Countryside I am very lucky to call"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "56UeS4HZN64_1", "video_path": "56UeS4HZN64.mp4", "subtitle_path": "56UeS4HZN64_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1114.42, "view_count": 25770}, {"video_id": "56UeS4HZN64", "question": "Surrounded by green trees, there is a wooden table, a black table, and a black chair. A woman in a white top is sitting on the black chair. Which subtitles appear together with this woman?", "question_wo_referring_query": "Which subtitles appear together with this woman?", "candidates": ["Foods", "excited to be here and and explore", "Countryside I am very lucky to call", "PR stay and for having us I am so"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "56UeS4HZN64_2", "video_path": "56UeS4HZN64.mp4", "subtitle_path": "56UeS4HZN64_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1114.42, "view_count": 25770}, {"video_id": "H4jA_SN4hgg", "question": "In the video, a woman with long hair, wearing a black coat and a hat, is standing behind a wolf statue. She is holding a silver cup. There is also a man in a checkered shirt next to the wolf statue. Behind them, there is a red house and a car. Towards the end of the video, what changes occur to the woman's outfit in the kitchen?", "question_wo_referring_query": "What changes occur to the woman's outfit in the kitchen towards the end of the video?", "candidates": ["The black coat changes to a black coat with a white tiger pattern on the front.", "The black coat changes to a black coat with a white dragon pattern on the front.", "The woman takes off the outer armor.", "The black coat changes to a black coat with a white cat pattern on the front."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "H4jA_SN4hgg_0", "video_path": "H4jA_SN4hgg.mp4", "subtitle_path": "H4jA_SN4hgg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1375.17, "view_count": 6406242}, {"video_id": "H4jA_SN4hgg", "question": "A man in a white long-sleeved lab coat is sitting in front of a wire mesh. He is wearing blue jeans and a silver necklace. His left hand is open with the palm facing down. In the later part of the video, what change occurs in the man's attire as he sits on the sofa?", "question_wo_referring_query": "In the later part of the video, what change occurs in the man's attire as he sits on the sofa?", "candidates": ["Wears an olive-colored coat over the lab coat", "Wears a red coat over the lab coat", "Wears a blue coat over the lab coat", "Wears a black coat over the lab coat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "H4jA_SN4hgg_1", "video_path": "H4jA_SN4hgg.mp4", "subtitle_path": "H4jA_SN4hgg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1375.17, "view_count": 6406242}, {"video_id": "H4jA_SN4hgg", "question": "In a kitchen, there is a man wearing a black short-sleeve shirt. He raises his right index finger, and next to him stands a woman wearing a hat and holding a white bottle. Later in the video, what changes occur in the attire of the man, the woman, and the wolf walking together?", "question_wo_referring_query": "Later in the video, what changes occur in the attire of the man, the woman, and the wolf walking together?", "candidates": ["He has a red shirt tied around his waist.", "The blue jeans have changed to black jeans.", "He has a black shirt tied around his waist.", "The man has a white shirt tied around his waist."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "H4jA_SN4hgg_2", "video_path": "H4jA_SN4hgg.mp4", "subtitle_path": "H4jA_SN4hgg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1375.17, "view_count": 6406242}, {"video_id": "xkGyETwGbmI", "question": "On a map with intersecting vertical and horizontal roads, there is a red car driving on a deep blue road. When '[Music]' is mentioned, what change happens to the car?", "question_wo_referring_query": "What change happens to the car?", "candidates": ["It rotated counterclockwise by 180\u00b0", "It rotated counterclockwise by 270\u00b0", "It rotated clockwise by 180\u00b0", "It rotated clockwise by 270\u00b0"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "xkGyETwGbmI_0", "video_path": "xkGyETwGbmI.mp4", "subtitle_path": "xkGyETwGbmI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1043.81, "view_count": 6921}, {"video_id": "xkGyETwGbmI", "question": "At the bottom of a screen with many houses, there is a white box. Inside the box, on the left side, there is a circular image, and on the right side, there is a blue thumbs-up image. There is also a red subscribe button and a bell icon. When 'forget to hit that button share' is mentioned, what changes on the screen?", "question_wo_referring_query": "What changes on the screen?", "candidates": ["A cursor appears clicking on the red subscribe button", "A cursor appears clicking on the bell icon", "A cursor appears clicking on the blue thumbs-up image", "The screen does not change"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "xkGyETwGbmI_1", "video_path": "xkGyETwGbmI.mp4", "subtitle_path": "xkGyETwGbmI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1043.81, "view_count": 6921}, {"video_id": "xkGyETwGbmI", "question": "On the right side of a screen divided by a long river, there is a prominent round-topped building, and in the lower right corner of the screen, there are bold letters 'EXOTIC VACATION.' What change occurs on the screen when 'explore Florence is from April to June' is mentioned?", "question_wo_referring_query": "What change occurs on the screen?", "candidates": ["A yellow box with white text appears", "A red box with black text appears", "A red box with white text appears", "A yellow box with black text appears"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "xkGyETwGbmI_2", "video_path": "xkGyETwGbmI.mp4", "subtitle_path": "xkGyETwGbmI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1043.81, "view_count": 6921}, {"video_id": "pf7QtUngK_E", "question": "In the video, there is a man wearing a blue lab coat in a room with the Earth as the background. His left palm is facing up and open, his mouth is open. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["He is putting the book on top of his head", "He is placing the book on the bookshelf", "He is placing the book on the table", "He is raising the book in his hand"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "pf7QtUngK_E_0", "video_path": "pf7QtUngK_E.mp4", "subtitle_path": "pf7QtUngK_E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 940.4, "view_count": 148565}, {"video_id": "pf7QtUngK_E", "question": "In the scene, there are many photos pieced together, four of which are about astronauts. There is also a photo of a man in a suit standing at a podium. Many of these photos have English words on them. What is the man in the black suit doing?", "question_wo_referring_query": "What is the man in the black suit doing?", "candidates": ["He is making a V sign with both hands.", "He is clenching his fists.", "The man is laying his hands flat on the podium.", "He is holding his forehead with his left hand."], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "pf7QtUngK_E_1", "video_path": "pf7QtUngK_E.mp4", "subtitle_path": "pf7QtUngK_E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 940.4, "view_count": 148565}, {"video_id": "pf7QtUngK_E", "question": "In the scene, in a room with a backdrop of Earth, there is a man wearing a green uniform. Next to him, there is an image similar to a smartphone screenshot. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["He is making a 'Yay' gesture with both hands", "Both of his hands are pointing forward, and he is wearing a ring on his left hand", "He is holding his forehead with his right hand", "He is holding his fists in front of his chest"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "pf7QtUngK_E_2", "video_path": "pf7QtUngK_E.mp4", "subtitle_path": "pf7QtUngK_E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 940.4, "view_count": 148565}, {"video_id": "IZysBXLTLqQ", "question": "In the scene, there is a patch of trees under the sky, a lake in the distance, and a yellow arrow pointing diagonally downward with the words 'BLACK HILL' in bold black above it. What other objects appear in the scene?", "question_wo_referring_query": "What other objects appear in the scene?", "candidates": ["telephone pole", "sun", "house", "airplane"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "IZysBXLTLqQ_0", "video_path": "IZysBXLTLqQ.mp4", "subtitle_path": "IZysBXLTLqQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1542.81, "view_count": 143682}, {"video_id": "IZysBXLTLqQ", "question": "The screen shows a blue sky with a thick yellow line diagonally positioned at the bottom. Above the yellow line are the letters TEMPERANCEREEF in yellow. What other objects are visible on the screen?", "question_wo_referring_query": "What other objects are visible on the screen?", "candidates": ["Windmill", "Power Pole", "Large Tree", "Airplane"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "IZysBXLTLqQ_1", "video_path": "IZysBXLTLqQ.mp4", "subtitle_path": "IZysBXLTLqQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1542.81, "view_count": 143682}, {"video_id": "IZysBXLTLqQ", "question": "There are many big trees on the screen, with a small path among the trees. There is a yellow triangle pointing towards this small path. Next to the triangle is a black frame with yellow text inside. What other objects are on the screen?", "question_wo_referring_query": "What other objects are on the screen?", "candidates": ["Airplane", "Grass", "Electric poles", "House"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "IZysBXLTLqQ_2", "video_path": "IZysBXLTLqQ.mp4", "subtitle_path": "IZysBXLTLqQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1542.81, "view_count": 143682}, {"video_id": "lkUcONzD-b8", "question": "In a room, there is a man sitting on a gray sofa. He is wearing black clothes and gray shorts. There is also a blue pillow on the sofa. In the top left corner, there is the word 'self-care'. What clothes is this man wearing?", "question_wo_referring_query": "What clothes is this man wearing?", "candidates": ["Sleeveless top", "Down coat", "Robe", "Raincoat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "lkUcONzD-b8_0", "video_path": "lkUcONzD-b8.mp4", "subtitle_path": "lkUcONzD-b8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1365.33, "view_count": 543985}, {"video_id": "lkUcONzD-b8", "question": "Next to the red building, there is a person wearing a floral top, carrying a backpack, with glasses, purple-red hair, and white-blue earphones. What is the color of the backpack?", "question_wo_referring_query": "What is the color of the backpack?", "candidates": ["red", "orange", "black", "purple"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "lkUcONzD-b8_1", "video_path": "lkUcONzD-b8.mp4", "subtitle_path": "lkUcONzD-b8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1365.33, "view_count": 543985}, {"video_id": "lkUcONzD-b8", "question": "On the screen, in a white room, there is a woman wearing a blue top, glasses, and has purple hair. The text on the screen says 'linh: hello everyone, I am in my bathroom.' What hairstyle is the woman wearing?", "question_wo_referring_query": "What hairstyle is the woman wearing?", "candidates": ["Bun", "High ponytail", "Loose waves", "Low ponytail"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "lkUcONzD-b8_2", "video_path": "lkUcONzD-b8.mp4", "subtitle_path": "lkUcONzD-b8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1365.33, "view_count": 543985}, {"video_id": "_t7qE-5OIxg", "question": "In the scene, there is a sofa in a room, and in front of the sofa, there is a man wearing a white patterned shirt and white pants. When mentioning 'Just give me a chance', what is the material of the pants?", "question_wo_referring_query": "What is the material of the pants?", "candidates": ["Cotton", "Leather", "Silk", "Linen"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "_t7qE-5OIxg_0", "video_path": "_t7qE-5OIxg.mp4", "subtitle_path": "_t7qE-5OIxg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1347.89, "view_count": 726449}, {"video_id": "_t7qE-5OIxg", "question": "In the video, a man wearing a white outfit and a white hat is shown wearing a silver necklace. There is a microphone in front of him. When mentioning 'The Pope's comments came weeks after the European parliament,' what is the shape of his necklace?", "question_wo_referring_query": "What is the shape of his necklace?", "candidates": ["Ball-shaped", "Rectangular", "Cross-shaped", "Round"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "_t7qE-5OIxg_1", "video_path": "_t7qE-5OIxg.mp4", "subtitle_path": "_t7qE-5OIxg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1347.89, "view_count": 726449}, {"video_id": "_t7qE-5OIxg", "question": "In the scene where there is a man wearing a blue shirt with a black suit jacket and a hat, and a blond child wearing white clothes in a room with red walls, what color is the hat mentioned when referring to 'Age lecturer Luc Jurer and another man named Joseph di'?", "question_wo_referring_query": "What color is the hat?", "candidates": ["black", "red", "purple", "gold"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "_t7qE-5OIxg_2", "video_path": "_t7qE-5OIxg.mp4", "subtitle_path": "_t7qE-5OIxg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1347.89, "view_count": 726449}, {"video_id": "AWKmazrRIwA", "question": "In the top left corner where a man in blue shirt is speaking, there is a picture of three men in blue short-sleeved shirts looking at a man holding a black object. Who is the man holding the black object?", "question_wo_referring_query": "Who is the man holding the black object?", "candidates": ["A man in a white short-sleeved shirt and black pants", "A man in a blue short-sleeved shirt and black pants", "A man in a white short-sleeved shirt and blue jeans", "A man in a blue short-sleeved shirt and blue jeans"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "AWKmazrRIwA_0", "video_path": "AWKmazrRIwA.mp4", "subtitle_path": "AWKmazrRIwA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1236.95, "view_count": 6217526}, {"video_id": "AWKmazrRIwA", "question": "In the upper left corner where a man in a blue shirt is talking, there's a picture of a family sitting beside a white table, with three elderly people playing chess in front of them. Who is the person wearing a wristwatch while playing chess?", "question_wo_referring_query": "Who is the person wearing a wristwatch while playing chess?", "candidates": ["A man wearing a light green short-sleeved shirt", "A woman wearing black and white stripes", "A man wearing a white shirt with black stripes", "A man wearing a gray shirt"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "AWKmazrRIwA_1", "video_path": "AWKmazrRIwA.mp4", "subtitle_path": "AWKmazrRIwA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1236.95, "view_count": 6217526}, {"video_id": "AWKmazrRIwA", "question": "In the marble patterned background, there is a black-haired woman with a slight smile, crossing her arms over her chest. Who is this person?", "question_wo_referring_query": "Who is the person doing this action?", "candidates": ["A woman wearing a black, sleeveless dress", "A woman wearing a white, short-sleeved dress", "A woman wearing a white, long-sleeved dress", "A woman wearing a white, sleeveless dress"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "AWKmazrRIwA_2", "video_path": "AWKmazrRIwA.mp4", "subtitle_path": "AWKmazrRIwA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1236.95, "view_count": 6217526}, {"video_id": "VmqZvlj07-w", "question": "In the video, there's a man wearing a white shirt, khaki pants, and glasses standing in front of a building. What is he doing the first time he appears?", "question_wo_referring_query": "What is he doing the first time he appears?", "candidates": ["He is running", "He is cooking", "He has his hands on his hips", "He is dancing"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "VmqZvlj07-w_0", "video_path": "VmqZvlj07-w.mp4", "subtitle_path": "VmqZvlj07-w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 903.2, "view_count": 142614}, {"video_id": "VmqZvlj07-w", "question": "In the scene, there is a red curtain. One person is wearing a brown shirt with a white jacket over it and a white hat. Another person is wearing a gray coat and gray hat. At the desk, a woman sits wearing a pink and white polka-dot dress with a pink and white hat. What is this woman doing when she appears for the first time?", "question_wo_referring_query": "What is the woman doing when she appears for the first time?", "candidates": ["She is eating", "She is running", "She is writing something", "She is dancing"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "VmqZvlj07-w_1", "video_path": "VmqZvlj07-w.mp4", "subtitle_path": "VmqZvlj07-w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 903.2, "view_count": 142614}, {"video_id": "VmqZvlj07-w", "question": "In the video, on a rainy day, there's a person in a black coat and white shorts pushing a bicycle. Next to this person is someone else dressed in black. Ahead of them, there's a man wearing a short-sleeved shirt and a black hat, holding a white object in his right hand. What is the man holding the white object doing the first time he appears?", "question_wo_referring_query": "What is the man holding the white object doing the first time he appears?", "candidates": ["Eating", "Writing", "Walking", "Dancing"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "VmqZvlj07-w_2", "video_path": "VmqZvlj07-w.mp4", "subtitle_path": "VmqZvlj07-w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 903.2, "view_count": 142614}, {"video_id": "UkpN0S5D1q8", "question": "There is a picture of a bird on the screen, and on the right side is a woman in white with her left index finger on her mouth. When the subtitle says 'because as the pheasant leans forward to', what is the bird doing on the screen?", "question_wo_referring_query": "What is the bird doing on the screen?", "candidates": ["extending its neck to drink water", "extending its neck to eat a fish", "spreading its wings to fly", "spreading its wings to run"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "UkpN0S5D1q8_0", "video_path": "UkpN0S5D1q8.mp4", "subtitle_path": "UkpN0S5D1q8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1837.84, "view_count": 22566}, {"video_id": "UkpN0S5D1q8", "question": "There are 3 images on the screen; the one on the left is a colored picture, the middle one is a dark-bordered one, and the one on the right features a white-haired woman. What action does this white-haired woman perform when the subtitle says 'the will leave the impression'?", "question_wo_referring_query": "What action does this white-haired woman perform?", "candidates": ["Both hands spread downwards", "Both hands clenched into fists and pressed together", "Fingers spread and interlaced", "Both hands spread upwards"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "UkpN0S5D1q8_1", "video_path": "UkpN0S5D1q8.mp4", "subtitle_path": "UkpN0S5D1q8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1837.84, "view_count": 22566}, {"video_id": "UkpN0S5D1q8", "question": "On the left side of the screen, there is a colorful image with two birds. One of the birds is stretching its neck as if to drink water. On the right side, there is a picture of a woman in white. When the subtitle reads \u2018do you see that as the water falls down,\u2019 what action does the woman with gray hair perform?", "question_wo_referring_query": "What action does the gray-haired woman perform?", "candidates": ["Points forward with right index finger", "Points upward with left middle finger", "Points forward with right middle finger", "Raises left hand"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "UkpN0S5D1q8_2", "video_path": "UkpN0S5D1q8.mp4", "subtitle_path": "UkpN0S5D1q8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1837.84, "view_count": 22566}, {"video_id": "m2uPFNdMiqM", "question": "There is a man wearing a black suit and glasses in the middle of the screen, inside a blue and white frame. Below are several red vertical lines. After the man in the screen looks to the right, what does he do?", "question_wo_referring_query": "After the man in the screen looks to the right, what does he do?", "candidates": ["Eating", "Dancing", "Standing", "Wearing a black suit, sitting in front of a podium, giving a speech"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "m2uPFNdMiqM_0", "video_path": "m2uPFNdMiqM.mp4", "subtitle_path": "m2uPFNdMiqM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1618.6, "view_count": 999891}, {"video_id": "m2uPFNdMiqM", "question": "There is a man in the video wearing black clothes and a hat. His hands are open, with a piece of jewelry on his left hand. What did he do after this action?", "question_wo_referring_query": "What did he do after this action?", "candidates": ["Crouching", "Speaking while holding a microphone on stage", "Sleeping", "Dancing"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "m2uPFNdMiqM_1", "video_path": "m2uPFNdMiqM.mp4", "subtitle_path": "m2uPFNdMiqM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1618.6, "view_count": 999891}, {"video_id": "m2uPFNdMiqM", "question": "In the scene, there's a man wearing only jeans sitting by the window. On the bed, there's a woman sitting. What did the man in the video do after facing the woman?", "question_wo_referring_query": "What did the man in the video do after facing the woman?", "candidates": ["Kiss", "Shake hands", "Eat", "Hug"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "m2uPFNdMiqM_2", "video_path": "m2uPFNdMiqM.mp4", "subtitle_path": "m2uPFNdMiqM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1618.6, "view_count": 999891}, {"video_id": "pr8-agDG8sI", "question": "In the screen, there is a monkey inside the red basket. There are also 2 white puppies on the yellow floor. Which scene appears first?", "question_wo_referring_query": "Which scene appears first?", "candidates": ["The monkey inside the red basket appears first", "The 2 white puppies appear first", "A black puppy appears first", "The tiger inside the basket appears first"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "pr8-agDG8sI_0", "video_path": "pr8-agDG8sI.mp4", "subtitle_path": "pr8-agDG8sI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1293.88, "view_count": 317587}, {"video_id": "pr8-agDG8sI", "question": "In the video, who appears first among a man standing in front of a mirror wearing a white short-sleeved shirt and holding a black camera and a woman wearing a white tank top and holding an object, and a man outside wearing a white top, a black backpack, a military green hat, and sunglasses, and a woman wearing an olive top and sunglasses?", "question_wo_referring_query": "Who appears first among these two scenes?", "candidates": ["The man standing in front of a mirror wearing a white short-sleeved shirt and holding a black camera, and the woman wearing a white tank top and holding an object, appear first", "The man standing in front of a mirror wearing a black short-sleeved shirt and holding a white camera, and the woman wearing a blue tank top and holding an object, appear first", "The man outside wearing a black top, a white backpack, a black hat, and sunglasses, and the woman wearing an olive top and sunglasses, appear first", "The man outside wearing a white top, a black backpack, a military green hat, and sunglasses, and the woman wearing an olive top and sunglasses, appear first"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "pr8-agDG8sI_1", "video_path": "pr8-agDG8sI.mp4", "subtitle_path": "pr8-agDG8sI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1293.88, "view_count": 317587}, {"video_id": "pr8-agDG8sI", "question": "In the video, there is a wooden dining table with 2 cups of water on it, and sitting at the table are a woman with long hair wearing a white tank top and a man wearing a white short-sleeved shirt. On the road, there is a man wearing a white top and a woman wearing a black helmet. Which of these two scenes appears first?", "question_wo_referring_query": "Which of these two scenes appears first?", "candidates": ["The scene with two cups of water on a wooden dining table, with a woman wearing a white tank top and a man wearing a white short-sleeved shirt sitting at the table, appears first.", "The scene with two cups of water on a wooden dining table, with a woman wearing a black tank top and a man wearing a black short-sleeved shirt sitting at the table, appears first.", "The scene with a man wearing a black top and a woman wearing a pink helmet on the road appears first.", "The scene with a man wearing a white top and a woman wearing a black helmet on the road appears first."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "pr8-agDG8sI_2", "video_path": "pr8-agDG8sI.mp4", "subtitle_path": "pr8-agDG8sI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1293.88, "view_count": 317587}, {"video_id": "ZBD5ToU5HBI", "question": "In a room filled with items, there is a woman wearing a sleeveless red checkered top, gray headphones, a gray backpack, and blue jeans. After mentioning the 'Applause' subtitle, what did this woman do?", "question_wo_referring_query": "What did this woman do?", "candidates": ["She fished outside", "She washed things in the kitchen", "She taught in the classroom", "She slept in the living room"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "ZBD5ToU5HBI_0", "video_path": "ZBD5ToU5HBI.mp4", "subtitle_path": "ZBD5ToU5HBI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1887.15, "view_count": 269903}, {"video_id": "ZBD5ToU5HBI", "question": "In the frame, in a room, what did the woman wearing a light yellow short sleeve, glasses, and grey headband-style earphones, holding a mobile phone in her right hand, do after speaking the subtitle 'okay y'all you know how Google photos or'?", "question_wo_referring_query": "What did this woman do?", "candidates": ["Laying on the bed", "Cooking", "Running", "Dancing"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "ZBD5ToU5HBI_1", "video_path": "ZBD5ToU5HBI.mp4", "subtitle_path": "ZBD5ToU5HBI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1887.15, "view_count": 269903}, {"video_id": "ZBD5ToU5HBI", "question": "In the scene, on the grass, there are 5 people standing. Among them, there is a woman wearing short sleeves and shorts with a black cross-body bag. She gives a thumbs up with her left hand pointing towards the camera. After the white English subtitles mentioning 'Laughter' appear at the top of the screen, what did this woman do?", "question_wo_referring_query": ", what did this woman do?", "candidates": ["She was cooking", "She sat on a wooden peg", "She was running", "She was dancing"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "ZBD5ToU5HBI_2", "video_path": "ZBD5ToU5HBI.mp4", "subtitle_path": "ZBD5ToU5HBI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1887.15, "view_count": 269903}, {"video_id": "s4zV6vtR3SU", "question": "On the right side of a clip showing black trees and grasslands, there's a majestic stone lion. Following the mention of 'country around by kidnapping young women', what appears on the screen?", "question_wo_referring_query": "What appears on the screen?", "candidates": ["A clip of a woman dancing appears", "A black clip of a dragon appears", "A black clip of a leopard appears", "A black clip of a soldier wearing armor and holding a spear appears"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "s4zV6vtR3SU_0", "video_path": "s4zV6vtR3SU.mp4", "subtitle_path": "s4zV6vtR3SU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2112.2, "view_count": 16214}, {"video_id": "s4zV6vtR3SU", "question": "On the right side of an orange blurry screen, there is a statue of two naked men fighting. After mentioning 'first slow arushian in orthus,' what appears on the screen?", "question_wo_referring_query": "What appears on the screen?", "candidates": ["Five clips of different types of black animals appear", "A clip of a horse and three clips of the same type of black animal appear", "Four clips of the same type of black animal appear", "Five clips of the same type of black animal appear"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "s4zV6vtR3SU_1", "video_path": "s4zV6vtR3SU.mp4", "subtitle_path": "s4zV6vtR3SU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2112.2, "view_count": 16214}, {"video_id": "s4zV6vtR3SU", "question": "On the left side of an orange blurry screen, there is a tree with many fruits. After mentioning 'from the garden of hesperides the tree', what object appears on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["A black silhouette of a dog appears", "A black silhouette of a bull appears", "A black silhouette of a cat appears", "A black silhouette of a dragon appears"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "s4zV6vtR3SU_2", "video_path": "s4zV6vtR3SU.mp4", "subtitle_path": "s4zV6vtR3SU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2112.2, "view_count": 16214}, {"video_id": "1ZGfUJML4Os", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, in the middle of a yellow building, there is a blonde woman wearing a blue shirt and black pants walking on a road. Next, on the wall of the yellow building, there are three sculptures. Finally, there appears a man wearing a dark blue hoodie and a white cap in front of a row of buildings.", "First, there appears a man wearing a dark blue hoodie and a white cap in front of a row of buildings. Next, a blonde woman appears. Finally, on the wall of the yellow building, there are three sculptures.", "First, in the middle of a yellow building, there is a blonde woman wearing a blue shirt and black pants walking on a road. Next, the same woman appears again. Finally, on the wall of the yellow building, there are three sculptures.", "First, there appears a man wearing a dark blue hoodie and a white cap in front of a row of buildings. Next, in the middle of a yellow building, there is a blonde woman wearing a blue shirt and black pants walking on a road. Finally, on the wall of the yellow building, there are three sculptures."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "1ZGfUJML4Os_0", "video_path": "1ZGfUJML4Os.mp4", "subtitle_path": "1ZGfUJML4Os_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1096.63, "view_count": 16625}, {"video_id": "1ZGfUJML4Os", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a round green pond appears. Then, inside a room, there is a man wearing a denim jacket, a white hat, and holding a black camera. Finally, next to this pond, there is a golden retriever.", "First, next to this pond, there is a golden retriever. Then, inside a room, there is a man wearing a denim jacket, a white hat, and holding a black camera. Finally, a round green pond appears.", "First, a round green pond appears. Then, next to this pond, there is a golden retriever. Finally, inside a room, there is a man wearing a denim jacket, a white hat, and holding a black camera.", "First, next to this pond, there is a golden retriever. Then, a round green pond appears. Finally, inside a room, there is a man wearing a denim jacket, a white hat, and holding a black camera."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "1ZGfUJML4Os_1", "video_path": "1ZGfUJML4Os.mp4", "subtitle_path": "1ZGfUJML4Os_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1096.63, "view_count": 16625}, {"video_id": "1ZGfUJML4Os", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, on a table, there is a green bottle, a glass cup, and a white cup filled with coffee, into which milk is being added. Next, a man wearing a blue denim jacket and a hat appears in the kitchen, with milk placed in front of him. Finally, outside the room, a man in a denim jacket and a white hat is seen drinking from a white cup.", "First, a man wearing a blue denim jacket and a hat appears in the kitchen, with milk placed in front of him. Next, outside the room, a man in a denim jacket and a white hat is seen drinking from a white cup. Finally, on a table, there is a green bottle, a glass cup, and a white cup filled with coffee, into which milk is being added.", "First, a man wearing a blue denim jacket and a hat appears in the kitchen, with milk placed in front of him. Next, on a table, there is a green bottle, a glass cup, and a white cup filled with coffee, into which he is adding milk. Finally, outside the room, a man in a denim jacket and a white hat is seen drinking from a white cup.", "First, on a table, there is a green bottle, a glass cup, and a white cup filled with coffee, into which milk is being added. Next, outside the room, a man in a denim jacket and a white hat is seen drinking from a white cup. Finally, a man wearing a blue denim jacket and a hat appears in the kitchen, with milk placed in front of him."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "1ZGfUJML4Os_2", "video_path": "1ZGfUJML4Os.mp4", "subtitle_path": "1ZGfUJML4Os_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1096.63, "view_count": 16625}, {"video_id": "2V7eniiI6HQ", "question": "In a scene with the Earth in the background, there is a man wearing blue clothes with a microphone clamped to his collar. Which subtitles have appeared along with him?", "question_wo_referring_query": "Which subtitles have appeared along with him?", "candidates": ["gravity exists no", "of the Earth and you give us all that", "move on to Dave's second Point number", "when you place the cube next to it"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "2V7eniiI6HQ_0", "video_path": "2V7eniiI6HQ.mp4", "subtitle_path": "2V7eniiI6HQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 911.12, "view_count": 273958}, {"video_id": "2V7eniiI6HQ", "question": "In a large thermal imaging video window, a white object appears on the sea surface. In the thermal imaging video window at the bottom left corner of the screen, there is a man wearing gray clothes with a ponytail. Which subtitles have appeared with him?", "question_wo_referring_query": "Which subtitles have appeared with him?", "candidates": ["year I'm using u gov to fund a Christmas", "friends those people are doing incorrect", "of the Earth and you give us all that", "the seasonal among st whether or not"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "2V7eniiI6HQ_1", "video_path": "2V7eniiI6HQ.mp4", "subtitle_path": "2V7eniiI6HQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 911.12, "view_count": 273958}, {"video_id": "2V7eniiI6HQ", "question": "In a large red-green colored video window, there is a pure blue background with some bold letters of different colors at the top and bottom. In the middle of the screen, there is a baseball. With which subtitles does this baseball simultaneously appear?", "question_wo_referring_query": ", with which subtitles does this baseball simultaneously appear?", "candidates": ["of the Earth and you give us all that", "relative density to equilibrium", "year I'm using u gov to fund a Christmas", "the seasonal among st whether or not"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "2V7eniiI6HQ_2", "video_path": "2V7eniiI6HQ.mp4", "subtitle_path": "2V7eniiI6HQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 911.12, "view_count": 273958}, {"video_id": "KgmV2GaN-J0", "question": "In a spacious room, a man wearing a dark gray coat and a gray hat, with curly hair, sits in front of a tan desk. What clothes did he change into when he finally sat on the sofa?", "question_wo_referring_query": "What clothes did he change into?", "candidates": ["He changed into a dark blue coat and a ginger yellow inner garment.", "He changed into a striped black and white coat and a white inner garment with blue patterns on it.", "He changed into a striped black and white coat and a ginger yellow inner garment.", "He changed into a dark blue coat and a white inner garment with blue patterns on it."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "KgmV2GaN-J0_0", "video_path": "KgmV2GaN-J0.mp4", "subtitle_path": "KgmV2GaN-J0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 955.58, "view_count": 3186519}, {"video_id": "KgmV2GaN-J0", "question": "In a spacious room, there is a man wearing earphones and a watch, dressed in a white coat. In the video, when he places his hands on the table, what clothing did he change into?", "question_wo_referring_query": "What clothing did he change into?", "candidates": ["Changed into a gray long-sleeve shirt with white stripes", "Changed into a green long-sleeve shirt with white stripes", "Changed into a green short-sleeve shirt with white stripes", "Changed into a gray short-sleeve shirt with white stripes"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "KgmV2GaN-J0_1", "video_path": "KgmV2GaN-J0.mp4", "subtitle_path": "KgmV2GaN-J0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 955.58, "view_count": 3186519}, {"video_id": "KgmV2GaN-J0", "question": "In a room with a few shelves of books, white-washed walls, there is a woman in a black short-sleeved shirt with white patterns. What change happens when she points at the camera with her right hand in the video?", "question_wo_referring_query": ", what change happens to her?", "candidates": ["She changes into a pair of sneakers", "She changes into light blue pajamas", "She lets her long hair down", "She puts on a silver necklace"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "KgmV2GaN-J0_2", "video_path": "KgmV2GaN-J0.mp4", "subtitle_path": "KgmV2GaN-J0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 955.58, "view_count": 3186519}, {"video_id": "WLtb_iSIHd8", "question": "A short-haired woman wearing a black floral dress, with her hands crossed and clasped, stands in front of a wall with a picture of a man in red clothes. What changes occur when she mentions 'viewers is the fact that it is'?", "question_wo_referring_query": "What changes occur?", "candidates": ["She looks at the camera, left hand clenched in a fist", "She looks at the camera, right hand clenched in a fist", "She crosses her hands behind her back", "She spreads both hands apart, palms facing each other at chest level"], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "WLtb_iSIHd8_0", "video_path": "WLtb_iSIHd8.mp4", "subtitle_path": "WLtb_iSIHd8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1213.59, "view_count": 21186}, {"video_id": "WLtb_iSIHd8", "question": "In a massive exhibition hall, a woman with short black hair, wearing a grey suit, is standing with her arms folded, looking at a sculpture of a white short-sleeved shirt and blue overalls. When she mentioned 'to experience these sculptures in three,' what changes did she undergo?", "question_wo_referring_query": "What changes did she undergo?", "candidates": ["She pointed with her right hand towards the mirror, with her left hand clenched in a fist.", "She touched her own hair.", "She pointed with her left hand towards the mirror, with her right hand clenched in a fist.", "She naturally lowered her arms and clasped her hands together."], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "WLtb_iSIHd8_1", "video_path": "WLtb_iSIHd8.mp4", "subtitle_path": "WLtb_iSIHd8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1213.59, "view_count": 21186}, {"video_id": "WLtb_iSIHd8", "question": "In front of a large exhibition hall, there is a man wearing a dark blue suit and a ring. He is holding both fists in front of him. What changes occur when he mentions \"the end of Warhol foundation for the\"?", "question_wo_referring_query": "What changes occur to him?", "candidates": ["He clenches his right fist and his left arm naturally drops.", "He clenches his left fist and his right arm naturally drops.", "He crosses his arms.", "He stands up his collar."], "topic_category": "KA-Knowledge-Art", "question_category": "TAA", "level": "L2-Relation", "id": "WLtb_iSIHd8_2", "video_path": "WLtb_iSIHd8.mp4", "subtitle_path": "WLtb_iSIHd8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1213.59, "view_count": 21186}, {"video_id": "ukoSPO0cycE", "question": "In front of a teal wall covered with posters, there is a man holding a black broom, wearing a blue denim jacket and black pants. There is also an orange bucket next to him. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["He is tearing down posters from the wall", "He is sweeping the floor", "He is using a broom dipped in white glue to paint on the wall", "He is using the broom to dust"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "ukoSPO0cycE_0", "video_path": "ukoSPO0cycE.mp4", "subtitle_path": "ukoSPO0cycE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1413.33, "view_count": 2931695}, {"video_id": "ukoSPO0cycE", "question": "On a concrete wall beside a street, there are two large white pieces of paper with many words written on them. A man wearing a denim jacket is standing in front of them. What is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["He is using a pen to write on the paper.", "He is using a broom to clean up the trash on the ground.", "He is using a pair of pliers to remove the paper.", "He is using a knife to cut the words off the paper."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "ukoSPO0cycE_1", "video_path": "ukoSPO0cycE.mp4", "subtitle_path": "ukoSPO0cycE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1413.33, "view_count": 2931695}, {"video_id": "ukoSPO0cycE", "question": "In a white room corner, a man wearing a pink lab coat is sitting on a red chair. In front of him, there is also a table with white paper on it. What is he doing at this moment?", "question_wo_referring_query": "What is he doing at this moment?", "candidates": ["He is conversing with the man opposite him.", "He is painting ducks on his clothes.", "He is preparing to type on a computer.", "He is holding a pen and preparing to write on the white paper."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "ukoSPO0cycE_2", "video_path": "ukoSPO0cycE.mp4", "subtitle_path": "ukoSPO0cycE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1413.33, "view_count": 2931695}, {"video_id": "m4w1A-JU_hk", "question": "In a spacious dining hall, there is a woman wearing a green suspender dress and a man wearing a light olive short-sleeve shirt. When 'read your future your fortune' is mentioned, what objects appear on the screen?", "question_wo_referring_query": "What objects appear on the screen?", "candidates": ["A light olive hat", "A white hat", "A red skirt", "A pair of red sunglasses"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "m4w1A-JU_hk_0", "video_path": "m4w1A-JU_hk.mp4", "subtitle_path": "m4w1A-JU_hk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 947.7, "view_count": 424614}, {"video_id": "m4w1A-JU_hk", "question": "In a tidy bedroom with three windows, there is a man standing in it, wearing a light brown short-sleeved shirt. When mentioning 'personally I want the private bathroom,' what objects are present in the frame?", "question_wo_referring_query": "What objects are present in the frame?", "candidates": ["Black ring", "Pink pillows", "Light gray curtains", "Green windows"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "m4w1A-JU_hk_1", "video_path": "m4w1A-JU_hk.mp4", "subtitle_path": "m4w1A-JU_hk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 947.7, "view_count": 424614}, {"video_id": "m4w1A-JU_hk", "question": "In front of a white column surrounded by a vast expanse of ocean water, a woman wearing a green strap stands. When the phrase \u201clike this right this is the only city in\u201d is mentioned, what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Blue mask", "Green camera", "White pants", "Light olive bag"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "m4w1A-JU_hk_2", "video_path": "m4w1A-JU_hk.mp4", "subtitle_path": "m4w1A-JU_hk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 947.7, "view_count": 424614}, {"video_id": "YAWT7IAWtTw", "question": "By the lakeside teeming with green grass, a woman in a white dress and a man in a red jacket are sitting on the grass. What kind of hat is the man in the red jacket wearing?", "question_wo_referring_query": "What kind of hat is the man in the red jacket wearing?", "candidates": ["Wearing a black wide-brimmed hat", "Not wearing a hat", "Wearing a white wide-brimmed hat", "Wearing a black baseball cap"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "YAWT7IAWtTw_0", "video_path": "YAWT7IAWtTw.mp4", "subtitle_path": "YAWT7IAWtTw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1353.83, "view_count": 205529}, {"video_id": "YAWT7IAWtTw", "question": "In a room with a red sofa, a woman wearing a blue polka dot dress and a man holding a hat are having a conversation. What is the man wearing?", "question_wo_referring_query": "What is the man wearing?", "candidates": ["A dark purple coat and white pants", "A dark purple coat and blue pants", "A dark green coat and white pants", "A dark red coat and white pants"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "YAWT7IAWtTw_1", "video_path": "YAWT7IAWtTw.mp4", "subtitle_path": "YAWT7IAWtTw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1353.83, "view_count": 205529}, {"video_id": "YAWT7IAWtTw", "question": "In a room with a light green door, a woman in a blue polka dot dress is looking at a woman in a yellow dress. What color hat is the woman in the yellow dress wearing?", "question_wo_referring_query": "What color hat is the woman in the yellow dress wearing?", "candidates": ["white veil hat", "white beret", "white wide-brimmed hat", "white duckbill cap"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "YAWT7IAWtTw_2", "video_path": "YAWT7IAWtTw.mp4", "subtitle_path": "YAWT7IAWtTw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1353.83, "view_count": 205529}, {"video_id": "R90A7Ca8zpQ", "question": "In a kitchen with many food ingredients on the counter, there are two men, one wearing a black short-sleeve shirt and the other wearing a pink shirt. When mentioning 'um I can't say that word gotta buzz so', what kind of hat is the man wearing the pink shirt wearing?", "question_wo_referring_query": "What kind of hat is the man wearing the pink shirt wearing?", "candidates": ["pink baseball cap", "pink sun hat", "yellow baseball cap", "blue sun hat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "R90A7Ca8zpQ_0", "video_path": "R90A7Ca8zpQ.mp4", "subtitle_path": "R90A7Ca8zpQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 918.96, "view_count": 159956}, {"video_id": "R90A7Ca8zpQ", "question": "On a countertop filled with many kitchen tools, there is a red pot. A man in a black short-sleeve shirt is standing in front of the pot. When mentioning 'about we try steps one through five,' what kind of ladle is he holding?", "question_wo_referring_query": "What kind of ladle is he holding?", "candidates": ["a gray silicone ladle", "a white plastic ladle", "a blue long-handled ladle", "a wooden ladle"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "R90A7Ca8zpQ_1", "video_path": "R90A7Ca8zpQ.mp4", "subtitle_path": "R90A7Ca8zpQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 918.96, "view_count": 159956}, {"video_id": "R90A7Ca8zpQ", "question": "In a kitchen with a table full of various ingredients, there is a man in a black short-sleeved shirt and another man in a pink shirt. When the phrase 'okay are you sure so one of them is a' is mentioned, what kind of scoop is the man in the pink shirt holding?", "question_wo_referring_query": "What kind of scoop is the man in the pink shirt holding?", "candidates": ["a white plastic scoop", "a silver metal scoop", "a natural wood scoop", "a red silicone scoop"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "R90A7Ca8zpQ_2", "video_path": "R90A7Ca8zpQ.mp4", "subtitle_path": "R90A7Ca8zpQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 918.96, "view_count": 159956}, {"video_id": "xKT26PuX4c0", "question": "On a bustling street, with an iron-fenced river on the right and various buildings on the left, a gray-haired man wearing a gray jacket and gloves is holding a woman dressed in black. What does he do next?", "question_wo_referring_query": "What does he do next?", "candidates": ["He is walking with a cat on a leash.", "He is hugging the woman beside him.", "He is asking the woman beside him for directions.", "He is walking with a dog on a leash."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "xKT26PuX4c0_0", "video_path": "xKT26PuX4c0.mp4", "subtitle_path": "xKT26PuX4c0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 936.6, "view_count": 4513}, {"video_id": "xKT26PuX4c0", "question": "In a room with a blue floor, there is an oil painting with a golden frame on the wall. A woman dressed in black is standing in front. What is she doing at this moment?", "question_wo_referring_query": "What is she doing at this moment?", "candidates": ["She is taking the painting down.", "She has her hands in her pockets and is admiring the painting.", "She is wiping the painting.", "She is introducing the painting to a visitor."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "xKT26PuX4c0_1", "video_path": "xKT26PuX4c0.mp4", "subtitle_path": "xKT26PuX4c0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 936.6, "view_count": 4513}, {"video_id": "xKT26PuX4c0", "question": "In a crowded place, there is a man wearing a beige shirt and a ring sitting in front of a wooden table. What is he doing at this moment?", "question_wo_referring_query": "What is he doing at this moment?", "candidates": ["His hands are placed on the table, and he is talking while facing the mirror.", "He took off his own ring.", "He took off his clothes.", "He is lowering his head and eating."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "xKT26PuX4c0_2", "video_path": "xKT26PuX4c0.mp4", "subtitle_path": "xKT26PuX4c0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 936.6, "view_count": 4513}, {"video_id": "aT2sCd_8oEk", "question": "On a tree full of white blossoms, a woman wearing a white long dress and a pink scarf is sitting. After this, what does she do?", "question_wo_referring_query": "After this, what does she do?", "candidates": ["She picks up a potted plant", "She is sleeping", "She is brushing her hair", "She puts two illustrated papers into a white envelope"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "aT2sCd_8oEk_0", "video_path": "aT2sCd_8oEk.mp4", "subtitle_path": "aT2sCd_8oEk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 929.81, "view_count": 1589025}, {"video_id": "aT2sCd_8oEk", "question": "In front of a wooden table with some small potted plants, a woman wearing white clothes is standing with her left hand holding some food. What does she do next?", "question_wo_referring_query": "What does she do next?", "candidates": ["She puts her hands together in a prayer gesture", "She lets her hair down", "She picks up a small potted plant from the table with her left hand", "She picks up a small potted plant from the table with her right hand"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "aT2sCd_8oEk_1", "video_path": "aT2sCd_8oEk.mp4", "subtitle_path": "aT2sCd_8oEk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 929.81, "view_count": 1589025}, {"video_id": "aT2sCd_8oEk", "question": "In a room with a white door, a woman with curled hair wearing a blue checkered skirt is standing sideways facing a mirror and adjusting her skirt. What does she do next?", "question_wo_referring_query": "What does she do next?", "candidates": ["She is doing the laundry", "She is painting on the wall", "She is taking a shower", "She is drawing on a canvas"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "aT2sCd_8oEk_2", "video_path": "aT2sCd_8oEk.mp4", "subtitle_path": "aT2sCd_8oEk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 929.81, "view_count": 1589025}, {"video_id": "ErITc1Nx5mU", "question": "With a dark blue background, there are 2 yellow blocks along with some white text in English. In another scene with a blue background, there are 8 yellow blocks, each accompanied by white English text. Which of these two scenes appears first?", "question_wo_referring_query": "Which of these two scenes appears first?", "candidates": ["In a blue background, there are 8 yellow blocks, each accompanied by white English text. Appears first", "4 blocks appear first", "Appear together", "2 yellow blocks along with some white English text. Appears first"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "ErITc1Nx5mU_0", "video_path": "ErITc1Nx5mU.mp4", "subtitle_path": "ErITc1Nx5mU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1180.1, "view_count": 13100}, {"video_id": "ErITc1Nx5mU", "question": "In the black background, there are 4 lines of white text in English. In the blue background, there are 8 yellow squares, each with white text in English beside them. Which of these two scenes appears first?", "question_wo_referring_query": "Which of these two scenes appears first?", "candidates": ["3 yellow squares appear first.", "Both scenes appear at the same time.", "In the blue background, there are 8 yellow squares, each with white text in English beside them. This scene appears first.", "In the black background, there are 4 lines of white text in English. This scene appears first."], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "ErITc1Nx5mU_1", "video_path": "ErITc1Nx5mU.mp4", "subtitle_path": "ErITc1Nx5mU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1180.1, "view_count": 13100}, {"video_id": "ErITc1Nx5mU", "question": "In a blue background, there are 8 yellow squares, and they are surrounded by white text in English. In a black background, there are two lines of white text in English. Which of these scenes appears first?", "question_wo_referring_query": "Which of these scenes appears first?", "candidates": ["In a black background, there are two lines of white text in English. Appears first", "Appear together", "In a blue background, there are 8 yellow squares, and they are surrounded by white text in English. Appears first", "Three yellow squares appear first"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "ErITc1Nx5mU_2", "video_path": "ErITc1Nx5mU.mp4", "subtitle_path": "ErITc1Nx5mU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1180.1, "view_count": 13100}, {"video_id": "P2LWNdH3bi8", "question": "In the video, there is a man wearing a gray-blue jacket and a red shirt at the corner, next to the text 'Mask R-CNN'. His right arm hangs naturally down, and his left hand is placed in front of his chest. After this, what action does the man perform?", "question_wo_referring_query": ", after this, what action does the man perform?", "candidates": ["Making a fist with both hands", "Putting hands together", "Spreading palms", "Letting both hands hang down"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "P2LWNdH3bi8_1", "video_path": "P2LWNdH3bi8.mp4", "subtitle_path": "P2LWNdH3bi8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 446.83, "view_count": 17}, {"video_id": "ITgatyoMDjU", "question": "The screen shows some red puree ingredients with some strips of chicken added on top. There are lines of white text at the top and bottom of the screen. Which ingredient appears first on the screen?", "question_wo_referring_query": "Which ingredient appears first on the screen?", "candidates": ["Carrot", "Tomato", "Leek", "Chicken"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "ITgatyoMDjU_1", "video_path": "ITgatyoMDjU.mp4", "subtitle_path": "ITgatyoMDjU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 362.36, "view_count": 21266}, {"video_id": "ijFm6DxNVyI", "question": "In a background with a deep blue tone, there are multicolored small spheres surrounding a purple object. After the subtitle mentions 'If our current understanding of physics is correct, then the universe,' what object appears on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["A dog wearing glasses", "A dog wearing a headscarf", "A chicken wearing glasses", "A chicken wearing a headscarf"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3O", "level": "L2-Relation", "id": "ijFm6DxNVyI_1", "video_path": "ijFm6DxNVyI.mp4", "subtitle_path": "ijFm6DxNVyI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 358.17, "view_count": 15954094}, {"video_id": "E35j15N_dys", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, people dressed in red and riding white horses appear on the street. Then, a group of people waving swords and knives appear in front of the house. Lastly, many camels carrying goods walk on the road.", "First, many camels carrying goods walk on the road. Then, people dressed in red and riding white horses appear on the street. Lastly, a group of people waving swords and knives appear in front of the house.", "First, many camels carrying goods walk on the road. Then, a group of people waving swords and knives appear in front of the house. Finally, people dressed in red and riding white horses appear on the street.", "First, people dressed in red and riding white horses appear on the street. Then, many camels carrying goods walk on the road. Finally, a group of people waving swords and knives appear in front of the house."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "E35j15N_dys_1", "video_path": "E35j15N_dys.mp4", "subtitle_path": "E35j15N_dys_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 426.06, "view_count": 878406}, {"video_id": "I1v31P7Zgak", "question": "At a newsstand, there is a man wearing a black coat and a black hat, holding a newspaper in his gloved right hand. What happened to the man's hands while he was drinking tea and reading the newspaper?", "question_wo_referring_query": "What happened to the man's hands?", "candidates": ["There was no change.", "The man was not wearing gloves on either hand.", "The man was not wearing a glove on his right hand.", "The man was not wearing a glove on his left hand."], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "I1v31P7Zgak_1", "video_path": "I1v31P7Zgak.mp4", "subtitle_path": "I1v31P7Zgak_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 274.94, "view_count": 8969}, {"video_id": "dPVqWOLNkoQ", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, an elderly man wearing glasses and an olive-green coat raises his hand and looks at the leaking roof. Then, cars are driving on the road under a pink-purple sky. Finally, a long-haired woman wearing a white hat and a feather dress is pushing a yellow stroller.", "First, cars are driving on the road under a pink-purple sky. Then, an elderly man wearing glasses and an olive-green coat raises his hand and looks at the leaking roof. Finally, a long-haired woman wearing a white hat and a feather dress is pushing a yellow stroller.", "First, cars are driving on the road under a pink-purple sky. Then, a long-haired woman wearing a white hat and a feather dress is pushing a yellow stroller. Finally, an elderly man wearing glasses and an olive-green coat raises his hand and looks at the leaking roof.", "First, a long-haired woman wearing a white hat and a feather dress is pushing a yellow stroller. Next, an elderly man wearing glasses and an olive-green coat raises his hand and looks at the leaking roof. Finally, cars are driving on the road under a pink-purple sky.", "First, an elderly man wearing glasses and an olive-green coat raises his hand and looks at the leaking roof. Then, a long-haired woman wearing a white hat and a feather dress is pushing a yellow stroller. Finally, cars are driving on the road under a pink-purple sky."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "dPVqWOLNkoQ_0", "video_path": "dPVqWOLNkoQ.mp4", "subtitle_path": "dPVqWOLNkoQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 575.92, "view_count": 37605}, {"video_id": "dPVqWOLNkoQ", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, in daylight, a man wearing black-rimmed glasses and a black down jacket stands outside the window, then a person wearing a firefighter's uniform and a white helmet stands beside the roof, and finally, a woman in a blue coat gives a speech.", "First, a woman in a blue coat gives a speech, then a person wearing a firefighter's uniform and a white helmet stands on a rescue ladder beside the roof, and finally, in daylight, a man wearing black-rimmed glasses and a black down jacket stands outside the window.", "First, a woman in a blue coat gives a speech, then in daylight, a man wearing black-rimmed glasses and a black down jacket stands outside the window, and finally, a person wearing a firefighter's uniform and a white helmet stands on a rescue ladder beside the roof.", "First, a person wearing a firefighter's uniform and a white helmet stands on a rescue ladder beside the roof, then, in daylight, a man wearing black-rimmed glasses and a black down jacket stands outside the window, and finally, a woman in a blue coat gives a speech.", "First, in daylight, a man wearing black-rimmed glasses and a black down jacket stands outside the window, then a woman in a blue coat gives a speech, and finally, a person wearing a firefighter's uniform and a white helmet stands on a rescue ladder beside the roof."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "dPVqWOLNkoQ_1", "video_path": "dPVqWOLNkoQ.mp4", "subtitle_path": "dPVqWOLNkoQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 575.92, "view_count": 37605}, {"video_id": "Op9gwDR1YlM", "question": "A man wearing a hat and an overcoat with red and white striped sleeves is walking on a path with grass on the sides. At the moment when the subtitle says 'this way yesterday we went up there and', what type of hat is the man wearing?", "question_wo_referring_query": "What type of hat is the man wearing?", "candidates": ["fisherman's hat", "baseball cap", "beret", "wool hat", "knit cap"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "Op9gwDR1YlM_1", "video_path": "Op9gwDR1YlM.mp4", "subtitle_path": "Op9gwDR1YlM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 525.44, "view_count": 8459}, {"video_id": "AcEu8LEe_LY", "question": "In a dimly lit room, a white wall connects to a floor-to-ceiling window with transparent glass. Outside the window, there is a beach and sea view. The curtains on the window are pulled to the sides. There are various colored cushions on a dark-colored sofa. In front of the sofa, there is a white table, and further ahead, there is a small white desk lamp emitting a faint warm light. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["A pendant lamp", "A desktop computer", "An oil painting", "A tall potted plant", "A notebook computer"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "AcEu8LEe_LY_1", "video_path": "AcEu8LEe_LY.mp4", "subtitle_path": "AcEu8LEe_LY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 240.36, "view_count": 4885}, {"video_id": "ykdnVm-Bf6U", "question": "In the scene where there is a light yellow background with neatly arranged black characters on the right side, and on the left side is a white cup with globe design motifs held with one hand gripping the cup and the other lifting it, below the cup are three pictures showing the cup from different angles. When the subtitle 'description below we also sell merch on' appears, what object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["There is a red shirt", "There is a flag with only blue color", "There is a flag with red and blue colors", "There is a flag with only red color", "There is a blue shirt"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "ykdnVm-Bf6U_1", "video_path": "ykdnVm-Bf6U.mp4", "subtitle_path": "ykdnVm-Bf6U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 348.32, "view_count": 15048}, {"video_id": "YSppvwcvF4U", "question": "In the top left corner of a white background, there is a yellow sign and black characters. At the center, there is an arrow pointing from left to right. The squares to the left and right of the arrow are equally spaced squares. A red dot appears in the first square on the left. A man in a blue undershirt with a suit and glasses is speaking in the bottom right corner. After the subtitle 'do if you have to upsample from 2 cross' appears, what happens to the small red dot?", "question_wo_referring_query": "what happens to the small red dot?", "candidates": ["The small red dot moves from the first square on the left to the third square on the right.", "The small red dot moves from the first square on the left to the first square on the right.", "The small red dot moves from the first square on the left to the third square on the left.", "The small red dot moves from the first square on the left to the second square on the left.", "The small red dot moves from the first square on the left to the second square on the right."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3E", "level": "L2-Relation", "id": "YSppvwcvF4U_1", "video_path": "YSppvwcvF4U.mp4", "subtitle_path": "YSppvwcvF4U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 554.33, "view_count": 14}, {"video_id": "mj86XmfOniY", "question": "The man in the black short sleeves in the lower left corner is talking. His hands are spread out. Next to him is a black table. On the lower left side of the white background, there is a green body image and a thermometer. Above the image, there are some gray square objects and a blue musical note. When the subtitle 'signal so you have like this' appears, what change happens to the man?", "question_wo_referring_query": "When the subtitle 'signal so you have like this' appears, what change happens to the man?", "candidates": ["A hat appears on the man's head", "One hand is lowered, the palm of the other hand is facing forward with fingers slightly curved", "The man stands up", "The man puts on glasses", "The watch on the man's arm disappears"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "mj86XmfOniY_1", "video_path": "mj86XmfOniY.mp4", "subtitle_path": "mj86XmfOniY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 482.0, "view_count": 5646}, {"video_id": "_FKuq1oLZ4Q", "question": "The backdrop features some green leaves adorning a white curtain. To the right, there is a pile of books with a calendar on top. A woman in a plaid shirt takes out a pack of blue-packaged mints from her bag. What is the first item she retrieves from the bag after this?", "question_wo_referring_query": "What is the first item she retrieves from the bag after this?", "candidates": ["pen", "lipstick", "wallet", "hand sanitizer", "mask"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "_FKuq1oLZ4Q_1", "video_path": "_FKuq1oLZ4Q.mp4", "subtitle_path": "_FKuq1oLZ4Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 356.04, "view_count": 27758}, {"video_id": "pmrzxkh8jLM", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a hand wearing a black glove picks up a yellow sponge. Then, the gloved hand places a white parchment paper on the table. Finally, the gloved hand pours yellow oil into a glass bowl.", "First, a hand wearing a black glove places a white parchment paper on the table. Then, the gloved hand pours yellow oil into a glass bowl. Finally, the gloved hand picks up a yellow sponge.", "First, a hand wearing a black glove picks up a yellow sponge. Then, the gloved hand pours yellow oil into a glass bowl. Finally, the gloved hand places a white parchment paper on the table.", "First, a hand wearing a black glove pours yellow oil into a glass bowl. Then, the gloved hand picks up a yellow sponge. Finally, the gloved hand places a white parchment paper on the table.", "First, a hand wearing a black glove pours yellow oil into a glass bowl. Then, the gloved hand places a white parchment paper on the table. Finally, the gloved hand picks up a yellow sponge."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "pmrzxkh8jLM_1", "video_path": "pmrzxkh8jLM.mp4", "subtitle_path": "pmrzxkh8jLM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 413.33, "view_count": 2664}, {"video_id": "WFGTeaA7qsY", "question": "Which of the following scenarios is in the correct order?", "question_wo_referring_query": "Which of the following scenarios is in the correct order?", "candidates": ["First is the scene inside the Tank, where two men in military uniforms, one holding a gun, are firing, followed by the scene where Tank is stuck in the mud firing, and finally a man in a green uniform with short hair holding the control lever with both hands.", "First is the scene where Tank is stuck in the mud firing, followed by the scene inside the Tank, where two men in military uniforms, one holding a gun, are firing, and finally a man in a green uniform with short hair holding the control lever with both hands.", "First is the scene where Tank is trapped in the mud being attacked, followed by a man in a green uniform with short hair holding the control lever with both hands, and finally inside the Tank, where two men in military uniforms, one holding a gun, are firing.", "First is a man in a green uniform with short hair holding the control lever with both hands, followed by the scene where Tank is stuck in the mud firing, and finally the scene inside the Tank, where two men in military uniforms, one holding a gun, are firing.", "First is the scene inside the Tank, where two men in military uniforms, one holding a gun, are firing, followed by a man in a green uniform with short hair holding the control lever with both hands, and finally the scene where Tank is stuck in the mud firing."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "WFGTeaA7qsY_0", "video_path": "WFGTeaA7qsY.mp4", "subtitle_path": "WFGTeaA7qsY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 500.08, "view_count": 236953}, {"video_id": "WFGTeaA7qsY", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there is a scene in the trenches with many soldiers wearing green helmets and green backpacks. Then, there is a scene inside the tank with many soldiers operating it. Finally, there is a scene on a green flat ground outside the tank where a row of soldiers are shaking hands with the officer.", "First, there is a scene on a green flat ground outside the tank where a row of soldiers are shaking hands with the officer. Then, there is a scene inside the tank with many soldiers operating it. Finally, there is a scene in the trenches with many soldiers wearing green helmets and green backpacks.", "First, there is a scene in the trenches with many soldiers wearing green helmets and green backpacks. Then, there is a scene on a green flat ground outside the tank where a row of soldiers are shaking hands with the officer. Finally, there is a scene inside the tank with many soldiers operating it.", "First, there is a scene inside the tank with many soldiers operating it. Then, there is a scene in the trenches with many soldiers wearing green helmets and green backpacks. Finally, there is a scene on a green flat ground outside the tank where a row of soldiers are shaking hands with the officer.", "First, there is a scene inside the tank with many soldiers operating it. Then, there is a scene on a green flat ground outside the tank where a row of soldiers are shaking hands with the officer. Finally, there is a scene in the trenches with many soldiers wearing green helmets and green backpacks."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "WFGTeaA7qsY_1", "video_path": "WFGTeaA7qsY.mp4", "subtitle_path": "WFGTeaA7qsY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 500.08, "view_count": 236953}, {"video_id": "c3_-fShe8Ks", "question": "In the open area in front of the olive-colored building, there is a black car and a silver car parked. A short-haired man wearing black clothes and carrying a red and black backpack is walking forward. When this man appears at a subway station with a glass roof and a subway that emits yellow lights, what change happens to his backpack?", "question_wo_referring_query": "What change happens to his backpack?", "candidates": ["Only the olive backpack remains on him", "His backpack changes to red and olive", "His backpack changes to black and olive", "Only the black backpack remains on him", "Only the red backpack remains on him"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "c3_-fShe8Ks_1", "video_path": "c3_-fShe8Ks.mp4", "subtitle_path": "c3_-fShe8Ks_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 507.98, "view_count": 118163}, {"video_id": "NSZeExajXsM", "question": "Which of the following scene sequences is correct in the video?", "question_wo_referring_query": "Which of the following scene sequences is correct in the video?", "candidates": ["First, a hand is seen holding a brown bag with the text 'bag secured' in the center of the screen; then the blonde woman is in a bright and spacious room, wearing a white knitted shirt, placing a hamburger into her mouth using both hands; finally, she is sitting in a car with a picture labeled 'LUNCH' beside her.", "First, a hand is seen holding a brown bag with the text 'bag secured' in the center of the screen; then a blonde woman is sitting in a car with a picture labeled 'LUNCH' beside her; finally, she is in a bright and spacious room, wearing a white knitted shirt, placing a hamburger into her mouth using both hands.", "First, a blonde woman is sitting in a car with a picture labeled 'LUNCH' beside her; then she is in a bright and spacious room, wearing a white knitted shirt, placing a hamburger into her mouth using both hands; finally, a hand is seen holding a brown bag with the text 'bag secured' in the center of the screen.", "First, a blonde woman is sitting in a car with a picture labeled 'LUNCH' beside her; then a hand is seen holding a brown bag with the text 'bag secured' in the center of the screen; finally, the blonde woman, now wearing a white knitted shirt, is in a bright and spacious room, placing a hamburger into her mouth using both hands.", "First, the blonde woman is in a bright and spacious room, wearing a white knitted shirt, placing a hamburger into her mouth using both hands; then she is sitting in a car with a picture labeled 'LUNCH' beside her; finally, a hand is seen holding a brown bag with the text 'bag secured' in the center of the screen."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "NSZeExajXsM_1", "video_path": "NSZeExajXsM.mp4", "subtitle_path": "NSZeExajXsM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 445.45, "view_count": 13972}, {"video_id": "sYySvVR5fk4", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which scene sequence is correct?", "candidates": ["First, the group of men sing and dance in the middle of the stage, with a few beams of light shining above them and a flower displayed on the screen behind them. Their movements are almost identical. Then, several men appear on stage under red flashing lights, with a green box at the bottom of the screen and faint text appearing. Finally, a man dressed in white appears alone staring at the camera, with someone in a white top, black and yellow outerwear, and black pants standing behind him.", "First, a man dressed in white appears alone staring at the camera, with someone in a white top, black and yellow outerwear, and black pants standing behind him. Then, a group of men sing and dance in the middle of the stage, with a few beams of light shining down and a flower displayed on the screen behind them. Their movements are almost identical. Finally, several men appear on stage under red flashing lights, with a green box at the bottom of the screen and faint text appearing.", "First, several men appear on stage under red flashing lights, with a green box at the bottom of the screen and faint text appearing. Then, a group of men sing and dance in the middle of the stage, with a few beams of light shining down on the stage and a flower displayed on the screen behind them. Their movements are almost identical. Finally, a man dressed in white appears alone staring at the camera, with someone in a white top, black and yellow outerwear, and black pants standing behind him.", "First, several men appear on stage under red flashing lights, with a green box at the bottom of the screen and faint text appearing. Then, a man dressed in white appears alone staring at the camera, with someone in a white top, black and yellow outerwear, and black pants standing behind him. Finally, a group of men sing and dance in the middle of the stage, with a few beams of light shining above them and a flower displayed on the screen behind them. Their movements are almost identical.", "First, a man dressed in white appears alone staring at the camera, with someone in a white top, black and yellow outerwear, and black pants standing behind him. Then, several men appear on stage under red flashing lights, with a green box at the bottom of the screen and faint text appearing. At the end, a group of men sing and dance in the middle of the stage, with a few beams of light shining above them and a flower displayed on the screen behind them. Their movements are almost identical."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "sYySvVR5fk4_1", "video_path": "sYySvVR5fk4.mp4", "subtitle_path": "sYySvVR5fk4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 191.66, "view_count": 15135}, {"video_id": "YZZHejsF0ao", "question": "There is a brown ship sailing on the dark blue sea, with 5 white planes flying in the sky. What changes occur to the brown ship when it reaches the point on the screen where 'Torpedo Bomber Tactics' appears in the upper left corner?", "question_wo_referring_query": "There is a brown ship sailing on the dark blue sea, with 5 white planes flying in the sky. What changes occur to the brown ship when it reaches the point on the screen where 'Torpedo Bomber Tactics' appears in the upper left corner?", "candidates": ["It becomes smaller", "It becomes green", "It becomes white", "It becomes black", "It becomes bigger"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "YZZHejsF0ao_1", "video_path": "YZZHejsF0ao.mp4", "subtitle_path": "YZZHejsF0ao_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 487.23, "view_count": 71255}, {"video_id": "RAJtxFBmGzU", "question": "On a white background screen, with black and blue English letters written on it, below the English text there is a square image with two objects inside a circle. At the bottom of the screen, there are four identical pictures, each showing three different objects. After the subtitle 'uh the initialized model weights will be,' what is the first object that appears in the middle of the screen?", "question_wo_referring_query": "What is the first object that appears in the middle of the screen?", "candidates": ["computer", "red rectangle frame", "power bank", "data line", "blue rectangle frame"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "RAJtxFBmGzU_1", "video_path": "RAJtxFBmGzU.mp4", "subtitle_path": "RAJtxFBmGzU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 366, "duration": 206.0, "view_count": 37}, {"video_id": "1KVSmJhT-WI", "question": "In the exhibition hall bustling with people, two paintings woven in white and off-white are hung. In front of them, a person is pushing a small cart filled with dark gray objects. Who is the person pushing the cart in the scene?", "question_wo_referring_query": "Who is the person pushing the cart in the scene?", "candidates": ["A man wearing a blue short-sleeve shirt, white innerwear, and black pants", "A woman wearing a black suit and khaki pants", "A woman wearing a blue short-sleeve shirt, white innerwear, and black pants", "A man wearing a black suit and khaki pants"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "1KVSmJhT-WI_1", "video_path": "1KVSmJhT-WI.mp4", "subtitle_path": "1KVSmJhT-WI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 266.27, "view_count": 6418}, {"video_id": "fEDwBpgPqyg", "question": "In the screen, there are a few sheets of paper with a white background next to a hand wearing a pink dress and two rings. On both the left and right upper sides, there are two images of people. When the woman on the right upper side, who is wearing a pink coat over a white shirt and has long hair, lifts her hair behind her ear, what changes occur to the sheets of paper on the screen?", "question_wo_referring_query": "In the screen, there are a few sheets of paper with a white background next to a hand wearing a pink dress and two rings. On both the left and right upper sides, there are two images of people. When the woman on the right upper side, who is wearing a pink coat over a white shirt and has long hair, lifts her hair behind her ear, what changes occur to the sheets of paper on the screen?", "candidates": ["The paper was entirely cleared", "The piece of paper with text on the lower left moved to the upper right side", "The piece of paper with text on the lower left moved to the upper left side", "The piece of paper with text on the lower right moved to the upper right side", "The piece of paper with text on the lower right moved to the upper left image"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "fEDwBpgPqyg_1", "video_path": "fEDwBpgPqyg.mp4", "subtitle_path": "fEDwBpgPqyg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 201.4, "view_count": 1203}, {"video_id": "niSDzNd2u1U", "question": "In front of a white wall with many paintings hanging on it, there stands a woman wearing a pink blouse and with short blond hair. She is standing beside a painting depicting a woman in a pink dress kneeling on the ground, looking at the front. What color glasses is the woman with short blond hair wearing?", "question_wo_referring_query": "What color glasses is the woman with short blond hair wearing?", "candidates": ["Silver", "Gold", "Red", "White", "Pink"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "niSDzNd2u1U_1", "video_path": "niSDzNd2u1U.mp4", "subtitle_path": "niSDzNd2u1U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 252.5, "view_count": 53240}, {"video_id": "zIC4zbxR7oM", "question": "The sky is a dusky blue, with mountain peaks in the distance. A winding road connects to the mountain range, with sparse trees along the roadside. On both sides, there are cliff-like surroundings. When the subtitle 'and evolution' appears, what change occurs on the road?", "question_wo_referring_query": "What change occurs on the road?", "candidates": ["A green continent appears beside the road", "The road becomes straight", "The road wheels become clear", "The road disappears", "The greenery on both sides of the road is gone"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "zIC4zbxR7oM_1", "video_path": "zIC4zbxR7oM.mp4", "subtitle_path": "zIC4zbxR7oM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 228.63, "view_count": 4477}, {"video_id": "06pw74YgX6k", "question": "Which of the following scenarios is in the correct sequence?", "question_wo_referring_query": "Which of the following scenarios is in the correct sequence?", "candidates": ["First, a man in a black coat is looking at ceramics inside a glass case, then the scene changes to a glass case with many hand-held items, and finally, the glass case has many sculptures.", "First, the scene is a glass case displaying many sculptures, then it changes to a glass case with many hand-held items, and lastly, a man in a black coat is looking at ceramics inside the glass case.", "First, a man in a black coat is looking at ceramics inside a glass case, then the scene changes to a glass case displaying many sculptures, and finally, the glass case has many hand-held items.", "First, the scene is a glass case displaying many sculptures, then a man in a black coat is looking at ceramics inside the glass case, and lastly, the scene shows a glass case with many hand-held items.", "First, the scene is a glass case with many hand-held items, then a man in a black coat is looking at ceramics inside the glass case, and lastly, the scene shows many sculptures in the glass case."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "06pw74YgX6k_0", "video_path": "06pw74YgX6k.mp4", "subtitle_path": "06pw74YgX6k_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 194.36, "view_count": 2112}, {"video_id": "06pw74YgX6k", "question": "Which sequence of scenes below is correct?", "question_wo_referring_query": "Which sequence of scenes below is correct?", "candidates": ["First, there is the scene with the glass display case full of ceramics. Then, there are two wooden clocks inside the glass display case, with one clock face featuring floral patterns. Finally, there is a man wearing a black coat, looking sideways in front of a wall full of pictures.", "First, there is a man wearing a black coat, looking sideways in front of a wall full of pictures. Then, there are two wooden clocks inside a glass display case, with one clock face featuring floral patterns. Finally, there is the scene with the glass display case full of ceramics.", "First, there is the scene with the glass display case full of ceramics. Then, there is a man wearing a black coat, looking sideways in front of a wall full of pictures. Finally, there are two wooden clocks inside the glass display case, with one clock face featuring floral patterns.", "First, there are two wooden clocks inside a glass display case, with one clock face featuring floral patterns. Then, there is a man wearing a black coat, looking sideways in front of a wall full of pictures. Finally, there is the scene with the glass display case full of ceramics.", "First, there is a man wearing a black coat, looking sideways in front of a wall full of pictures. Then, there is a scene with glass display cases full of ceramics. Finally, there are two wooden clocks inside the glass display case, with one clock face featuring floral patterns."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "06pw74YgX6k_1", "video_path": "06pw74YgX6k.mp4", "subtitle_path": "06pw74YgX6k_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 194.36, "view_count": 2112}, {"video_id": "nMHwatpXM0Q", "question": "In the car, there is a woman with blonde hair. She is wearing a gray coat and a black tank top inside. She has painted her nails pink and is wearing a ring on her finger. In front of her is the steering wheel, and she is holding a net-like packaged food. What type of culinary process is this food made from?", "question_wo_referring_query": "In the car, there is a woman with blonde hair. She is wearing a gray coat and a black tank top inside. She has painted her nails pink and is wearing a ring on her finger. In front of her is the steering wheel, and she is holding a net-like packaged food. What type of culinary process is this food made from?", "candidates": ["Fried food", "Stir-fried food", "Cold mixed food", "Steam-cooked food", "Boiled food"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "nMHwatpXM0Q_1", "video_path": "nMHwatpXM0Q.mp4", "subtitle_path": "nMHwatpXM0Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 583.28, "view_count": 34898}, {"video_id": "vCEianlPs7A", "question": "On a wooden cutting board, there is a transparent bowl filled with red and yellow food, along with green vegetables. When mentioning 'And let's not forget the mysterious allure of black pepper!' what is the state of the green vegetables?", "question_wo_referring_query": "What is the state of the green vegetables?", "candidates": ["Long strips state", "Blended into juice state", "Chopped state", "Uncut state", "Mixed with egg yolk sauce state"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "vCEianlPs7A_1", "video_path": "vCEianlPs7A.mp4", "subtitle_path": "vCEianlPs7A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 516.6, "view_count": 4971253}, {"video_id": "vkcwNAxODro", "question": "Which sequence of scenes in the video is correct?", "question_wo_referring_query": "Which sequence of scenes in the video is correct?", "candidates": ["Three screws rotating on a wooden table, a man in a white short-sleeved shirt sitting in front of a bookshelf with a globe, talking to the camera, a star shining brightly in the night sky, a galaxy-like Milky Way in outer space", "A galaxy-like Milky Way in outer space, a man in a white short-sleeved shirt sitting in front of a bookshelf with a globe, talking to the camera, a star shining brightly in the night sky, three screws rotating on a wooden table", "Three screws rotating on a wooden table, a star shining brightly in the night sky, a galaxy-like Milky Way in outer space, a man in a white short-sleeved shirt sitting in front of a bookshelf with a globe, talking to the camera", "A star shining brightly in the night sky, a galaxy-like Milky Way in outer space, three screws rotating on a wooden table, a man in a white short-sleeved shirt sitting in front of a bookshelf with a globe, talking to the camera", "A galaxy-like Milky Way in outer space, three screws rotating on a wooden table, a man in a white short-sleeved shirt sitting in front of a bookshelf with a globe, talking to the camera, a star shining brightly in the night sky"], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "vkcwNAxODro_1", "video_path": "vkcwNAxODro.mp4", "subtitle_path": "vkcwNAxODro_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 400.0, "view_count": 156714}, {"video_id": "0lpJUdtjJ6A", "question": "On a green wall, there is a window surrounded by bars. Inside the window, there is a man with a cap, wearing blue clothes. After the subtitle \"original Public Enemy Number One the\" appears, what is the first animal that appears?", "question_wo_referring_query": "What is the first animal that appears?", "candidates": ["A shark swimming in the sea", "A completely white bird", "A bird with black feathers on its wings", "A completely black cat", "A completely gray mouse"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "0lpJUdtjJ6A_1", "video_path": "0lpJUdtjJ6A.mp4", "subtitle_path": "0lpJUdtjJ6A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 549.25, "view_count": 332424}, {"video_id": "WD3SC0mLpfI", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, there is a scene with a woman wearing a white flower crown with green leaves, dressed in blue-green clothes, lying on the ground covered with pink petals. Then, there's a scene with a man wearing round glasses, a beard, and a black top in front of a black background. Lastly, there's a scene with a woman wearing a ring on her hand and a white flower crown on her head, covered with pink petals.", "First, there is a scene with a man wearing round glasses, a beard, and a black top in front of a black background. Then, there's a scene with a woman wearing a white flower crown with green leaves, dressed in blue-green clothes, lying on the ground covered with pink petals. Lastly, there's a scene with a woman wearing a ring on her hand and a white flower crown on her head, covered with pink petals.", "First, there is a scene with a woman wearing a ring on her hand and a white flower crown on her head, covered with pink petals. Then, there is a scene with a man wearing round glasses, a beard, and a black top in front of a black background. Lastly, there's a scene with a woman wearing a white flower crown with green leaves, dressed in blue-green clothes, lying on the ground covered with pink petals.", "First, there is a scene with a man wearing round glasses, a beard, and a black top in front of a black background. Then, there's a scene with a woman wearing a ring on her hand and a white flower crown on her head, covered with pink petals. Lastly, there's a scene with a woman wearing a white flower crown with green leaves, dressed in blue-green clothes, lying on the ground covered with pink petals.", "First, there is a scene with a woman wearing a ring on her hand and a white flower crown on her head, covered with pink petals. Then, there's a scene with a woman wearing a white flower crown with green leaves, dressed in blue-green clothes, lying on the ground covered with pink petals. Lastly, there's a scene with a man wearing round glasses, a beard, and a black top in front of a black background."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "WD3SC0mLpfI_0", "video_path": "WD3SC0mLpfI.mp4", "subtitle_path": "WD3SC0mLpfI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 480.13, "view_count": 800157}, {"video_id": "WD3SC0mLpfI", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First is a scene where a person wearing a yellow straw hat, white mustache, and gray suit holding a stick, then, a pencil sketch on paper of a man with a mustache, and lastly, a scene where a person wearing a golden outfit with a red cloak is leaning on a black table.", "First is a scene where a person wearing a golden outfit with a red cloak is leaning on a black table, then, a scene where a person wearing a yellow straw hat, white mustache, and gray suit holding a stick, and lastly, a pencil sketch on paper of a man with a mustache.", "First is a scene where a person wearing a yellow straw hat, white mustache, and gray suit holding a stick, then, a scene where a person wearing a golden outfit with a red cloak is leaning on a black table, and lastly, a pencil sketch on paper of a man with a mustache.", "First is a pencil sketch on paper of a man with a mustache, then, a scene where a person wearing a yellow straw hat, white mustache, and gray suit holding a stick, and lastly, a scene where a person wearing a golden outfit with a red cloak is leaning on a black table.", "First is a scene where a person wearing a golden outfit with a red cloak is leaning on a black table, then, a pencil sketch on paper of a man with a mustache, and lastly, a scene where a person wearing a yellow straw hat, white mustache, and gray suit holding a stick."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "WD3SC0mLpfI_1", "video_path": "WD3SC0mLpfI.mp4", "subtitle_path": "WD3SC0mLpfI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 480.13, "view_count": 800157}, {"video_id": "sl2awq9l9hc", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, there is an article titled 'Underwater' with a screenshot of the article showing a Twitter icon under the title, then a blue sphere with green appears, and finally an aerial map view.", "First, an aerial map view, then a blue sphere with green appears, and finally an article titled 'Underwater' with a screenshot of the article showing a Twitter icon under the title.", "First, a blue sphere with green appears, then an aerial map view, and finally an article titled 'Underwater' with a screenshot of the article showing a Twitter icon under the title.", "First, a blue sphere with green appears, then an article titled 'Underwater' with a screenshot of the article showing a Twitter icon under the title, and finally an aerial map view.", "First, there is an article titled 'Underwater' with a screenshot of the article showing a Twitter icon under the title, then an aerial map view, and finally a blue sphere with green appears."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "sl2awq9l9hc_0", "video_path": "sl2awq9l9hc.mp4", "subtitle_path": "sl2awq9l9hc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 205.04, "view_count": 5625}, {"video_id": "sl2awq9l9hc", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First is a scene of red fireballs continuously falling, followed by a scene where white mist rises from the water, with red firelight visible in the mist, and finally a scene where red and yellow flames are spinning continuously.", "First is a scene of red fireballs continuously falling, followed by a scene where red and yellow flames are spinning continuously, and finally a scene where white mist rises from the water, with red firelight visible in the mist.", "First is a scene where red and yellow flames are spinning continuously, followed by a scene of red fireballs continuously falling, and finally a scene where white mist rises from the water, with red firelight visible in the mist.", "First is a scene where white mist rises from the water, with red firelight visible in the mist, followed by a scene of red fireballs continuously falling, and finally a scene where red and yellow flames are spinning continuously.", "First is a scene where red and yellow flames are spinning continuously, followed by a scene where white mist rises from the water, with red firelight visible in the mist, and finally a scene of red fireballs continuously falling."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "sl2awq9l9hc_1", "video_path": "sl2awq9l9hc.mp4", "subtitle_path": "sl2awq9l9hc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 205.04, "view_count": 5625}, {"video_id": "QcDoslNiUt8", "question": "At the beginning of the video, there is a man wearing a chainmail hood, holding a sword high in his right hand and a shield in his left hand. To his right, there are three soldiers holding spears, and to his left, there are four men wearing gauntlets. When this man holding the sword appears outside the castle, what change happens to him?", "question_wo_referring_query": "What change happens to him?", "candidates": ["He no longer has the sword and shield in his hands", "His shield has changed to another sword", "A feather appeared on his head", "His hair has turned golden", "His sword has changed to a spear"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "QcDoslNiUt8_1", "video_path": "QcDoslNiUt8.mp4", "subtitle_path": "QcDoslNiUt8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 233.65, "view_count": 16227}, {"video_id": "wo-52LASUu0", "question": "In a room, on a wooden bookshelf, there is a globe and some file folders. A man with short hair and tattoos on his arms is sitting in front of the bookshelf. After he finishes introducing himself, who is the first person to appear?", "question_wo_referring_query": "Who is the first person to appear?", "candidates": ["A man with white hair and glasses", "A man with an afro and glasses", "A man with an afro and no glasses", "A man with red hair and glasses", "A man with a flat top and glasses"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "wo-52LASUu0_1", "video_path": "wo-52LASUu0.mp4", "subtitle_path": "wo-52LASUu0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 477.93, "view_count": 216373}, {"video_id": "1aXeD_6smxc", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a blonde woman in front of a window takes off a blue coat, followed by fully bloomed white flowers with yellow pistils, and finally a blue pot with a bunch of green berries and a single purple berry.", "First, a blue pot with a bunch of green berries and a single purple berry, followed by fully bloomed white flowers with yellow pistils, and finally a blonde woman in front of a window taking off a blue coat.", "First, a blonde woman in front of a window takes off a blue coat, followed by a blue pot with a bunch of green berries and a single purple berry, and finally fully bloomed white flowers with yellow pistils.", "First, fully bloomed white flowers with yellow pistils, followed by a blue pot with a bunch of green berries and a single purple berry, and finally a blonde woman in front of a window taking off a blue coat.", "First, fully bloomed white flowers with yellow pistils, followed by a blonde woman in front of a window taking off a blue coat, and finally a blue pot with a bunch of green berries and a single purple berry."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "1aXeD_6smxc_1", "video_path": "1aXeD_6smxc.mp4", "subtitle_path": "1aXeD_6smxc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 376.43, "view_count": 1219149}, {"video_id": "shAaKh7dR9g", "question": "At the beginning of the video, there is a man with short black hair and stubble wearing a black short-sleeve shirt. Next to him in a white frame is the green text 'WILFREDO PRIETO'. In which scenes does this man appear?", "question_wo_referring_query": "In which scenes does this man appear?", "candidates": ["In a large hall next to a male sculpture touching his head with his right hand and a man wearing a blue shirt who is bald on top but has hair around the sides.", "Next to a man dressed in a security uniform.", "Next to a woman wearing a blue plaid shirt with her hair styled in a bun.", "In front of the main entrance of an exhibition hall.", "In a glass display window."], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "shAaKh7dR9g_1", "video_path": "shAaKh7dR9g.mp4", "subtitle_path": "shAaKh7dR9g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 186.22, "view_count": 2503}, {"video_id": "H9qCwQGSfwc", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["Slice the red cabbage on the cutting board, put the red cabbage into the pot with clean water, heat the water and add salt, cut the scallions into thin strips, and stir-fry the scallions in hot oil along with other ingredients in the pot.", "Cut the scallions into thin strips, slice the red cabbage on the cutting board, put the red cabbage into the pot with clean water, heat the water and add salt, and stir-fry the scallions in hot oil along with other ingredients in the pot.", "Put the red cabbage into the pot with clean water, heat the water and add salt, cut the scallions into thin strips, slice the red cabbage on the cutting board, and stir-fry the scallions in hot oil along with other ingredients in the pot.", "Heat the water and add salt, stir-fry the scallions in hot oil along with other ingredients in the pot, slice the red cabbage on the cutting board, put the red cabbage into the pot with clean water, and cut the scallions into thin strips.", "Put the red cabbage into the pot with clean water, slice the red cabbage on the cutting board, heat the water and add salt, cut the scallions into thin strips, and stir-fry the scallions in hot oil along with other ingredients in the pot."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "H9qCwQGSfwc_0", "video_path": "H9qCwQGSfwc.mp4", "subtitle_path": "H9qCwQGSfwc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 385.27, "view_count": 43910}, {"video_id": "H9qCwQGSfwc", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which is the correct sequence of events?", "candidates": ["Cut the mackerel into pieces, stir-fry the mackerel with onions in the pot, add cornstarch into boiling water and cook, add a small cube of butter to the mackerel in the pot, add western blue flowers (broccoli) to the pot and continue stir-frying with the mackerel, pour the cooked cornstarch into the pot and mix it with the stir-fried mackerel and western blue flowers.", "Add cornstarch into boiling water and cook, pour the cooked cornstarch into the pot and mix it with the stir-fried mackerel and western blue flowers, add a small cube of butter to the mackerel in the pot, add western blue flowers (broccoli) to the pot and continue stir-frying with the mackerel, cut the mackerel into pieces, stir-fry the mackerel with onions in the pot.", "Add a small cube of butter to the mackerel in the pot, add cornstarch into boiling water and cook, pour the cooked cornstarch into the pot and mix it with the stir-fried mackerel and western blue flowers, add western blue flowers (broccoli) to the pot and continue stir-frying with the mackerel, cut the mackerel into pieces, stir-fry the mackerel with onions in the pot.", "Pour the cooked cornstarch into the pot and mix it with the stir-fried mackerel and western blue flowers, add a small cube of butter to the mackerel in the pot, add cornstarch into boiling water and cook, add western blue flowers (broccoli) to the pot and continue stir-frying with the mackerel, cut the mackerel into pieces, stir-fry the mackerel with onions in the pot.", "Cut the mackerel into pieces, stir-fry the mackerel with onions in the pot, add a small cube of butter to the mackerel in the pot, add western blue flowers (broccoli) to the pot and continue stir-frying with the mackerel, add cornstarch into boiling water and cook, pour the cooked cornstarch into the pot and mix it with the stir-fried mackerel and western blue flowers."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "H9qCwQGSfwc_1", "video_path": "H9qCwQGSfwc.mp4", "subtitle_path": "H9qCwQGSfwc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 385.27, "view_count": 43910}, {"video_id": "CU_HzIGE6lU", "question": "In the final white background scene on the far right, there are three characters standing. The leftmost character is a white-haired individual with a sword pointing upwards. Which character is holding a round shield with a cat head and eagle design?", "question_wo_referring_query": "Which character is holding a round shield with a cat head and eagle design?", "candidates": ["The character with three black feathers on their helmet", "The rightmost character among the three, wearing a golden helmet and holding a spear", "The character wearing a gray helmet", "The character with two feathers on their head", "The middle character among the three, holding a sword"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "CU_HzIGE6lU_1", "video_path": "CU_HzIGE6lU.mp4", "subtitle_path": "CU_HzIGE6lU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 258.97, "view_count": 6025}, {"video_id": "rFQQFq1vV3k", "question": "After the appearance of a person wearing a hat and holding a telescope while observing from afar, which of the following characters appears first?", "question_wo_referring_query": "Which of the following characters appears first?", "candidates": ["A group of people wearing overalls hammering at the iron tracks on the ground beside a stone.", "A woman dressed in black clothing with a white polka-dot bow tie around her neck, standing in front of a black door making a prayer gesture.", "A man dressed in black clothing with a white polka-dot bow tie around his neck, standing in front of a black door making a prayer gesture.", "A man sitting on a bed in a dark room, covering his head with his hands.", "A woman sitting on a bed in a dark room, covering her head with her hands."], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "rFQQFq1vV3k_1", "video_path": "rFQQFq1vV3k.mp4", "subtitle_path": "rFQQFq1vV3k_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 518.25, "view_count": 528289}, {"video_id": "7_VggZfyfTA", "question": "In a pavilion, a man wearing a black vest is talking. There are three people around him: on the left, a person is wearing a white short-sleeved shirt; on the right, there is a man wearing a hat, and further right, there is a woman resting while sitting. In the distance, some people are walking back and forth. Outside the pavilion, there are Osmanthus trees, and the sunlight is very bright. With which subtitles has the man in the black vest appeared?", "question_wo_referring_query": "With which subtitles has the man in the black vest appeared?", "candidates": ["holy", "moly", "crap", "fish after spawning about 7 to 8 whale", "seu"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "7_VggZfyfTA_1", "video_path": "7_VggZfyfTA.mp4", "subtitle_path": "7_VggZfyfTA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 287.29, "view_count": 135246}, {"video_id": "xmaqrw6GJPM", "question": "A long-haired woman wearing a denim jacket and jeans, with an orange inner shirt, is standing in a room. Behind her on the wall, there is a row of white cabinets filled with various items. On the left side of the room, there are two office desks, some chairs, and wooden shelves. On the right side, there is a transparent office desk. Later, she is next to a blue cabinet covered with white fabric, organizing the white fabric. What kind of change occurs while she is doing this?", "question_wo_referring_query": ", what kind of change occurs?", "candidates": ["Changed into black pants", "Put on white gloves", "Changed into a different jacket", "Tied up her hair", "Put on gray gloves"], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "xmaqrw6GJPM_1", "video_path": "xmaqrw6GJPM.mp4", "subtitle_path": "xmaqrw6GJPM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 458.08, "view_count": 1136993}, {"video_id": "LFuxCYaD76E", "question": "Which of the following sequence of scenes is correct?", "question_wo_referring_query": "Which of the following sequence of scenes is correct?", "candidates": ["First, a small dog is seen lying on the floor with its tongue out. Then, a woman in a room wearing a white short-sleeve shirt with a ring on her left hand is holding a book with the word 'COLLEEN' on the cover. Finally, there's a green basket filled with items, including a maple leaf with the word 'FALL' on it.", "First, a small dog is seen lying on the floor with its tongue out. Then, there's a green basket filled with items, including a maple leaf with the word 'FALL' on it. Finally, a woman in a room wearing a white short-sleeve shirt with a ring on her left hand is holding a book with the word 'COLLEEN' on the cover.", "First, a woman in a room wearing a white short-sleeve shirt with a ring on her left hand is holding a book with the word 'COLLEEN' on the cover. Then, there's a green basket filled with items, including a maple leaf with the word 'FALL' on it. Finally, a small dog is seen lying on the floor with its tongue out.", "First, there's a green basket filled with items, including a maple leaf with the word 'FALL' on it. Afterwards, a woman in a room wearing a white short-sleeve shirt with a ring on her left hand is holding a book with the word 'COLLEEN' on the cover. Finally, a small dog is seen lying on the floor with its tongue out.", "First, there's a green basket filled with items, including a maple leaf with the word 'FALL' on it. Afterwards, a small dog is seen lying on the floor with its tongue out. Finally, a woman in a room wearing a white short-sleeve shirt with a ring on her left hand is holding a book with the word 'COLLEEN' on the cover."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "LFuxCYaD76E_0", "video_path": "LFuxCYaD76E.mp4", "subtitle_path": "LFuxCYaD76E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.24, "view_count": 18339}, {"video_id": "LFuxCYaD76E", "question": "Which of the following sequence of scenes is correct?", "question_wo_referring_query": "Which of the following sequence of scenes is correct?", "candidates": ["First, it's a fortress with lights shining through the windows on a rainy night. Then, a woman wearing black clothes and glasses is reading a book on a dark chair. Finally, through the mouth of a protective barrier, you can see two trees on the road with a man in black clothes leaning against a silver car.", "First, a woman wearing black clothes and glasses is reading a book on a dark chair. Then, through the mouth of a protective barrier, you can see two trees on the road with a man in black clothes leaning against a silver car. Finally, it's a fortress with lights shining through the windows on a rainy night.", "First, it's a fortress with lights shining through the windows on a rainy night. Then, through the mouth of a protective barrier, you can see two trees on the road with a man in black clothes leaning against a silver car. Finally, a woman wearing black clothes and glasses is reading a book on a dark chair.", "First, through the mouth of a protective barrier, you can see two trees on the road with a man in black clothes leaning against a silver car. Then, a woman wearing black clothes and glasses is reading a book on a dark chair. Finally, it's a fortress with lights shining through the windows on a rainy night.", "First, a woman wearing black clothes and glasses is reading a book on a dark chair. Then, it's a fortress with lights shining through the windows on a rainy night. Finally, through the mouth of a protective barrier, you can see two trees on the road with a man in black clothes leaning against a silver car."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "LFuxCYaD76E_1", "video_path": "LFuxCYaD76E.mp4", "subtitle_path": "LFuxCYaD76E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.24, "view_count": 18339}, {"video_id": "ZVg7TuIGSlA", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a blue water surface with snowy mountains in the distance appears, followed by yellow soil. Next is mountains with red rock formations, then a view with tall buildings in the background and green grass in the foreground. The final scene shows a road with olive green slopes on the side.", "First, mountains with red rock formations appear, followed by a blue water surface with snowy mountains in the distance, then yellow soil. Next is a view with tall buildings in the background and green grass in the foreground. The final scene shows a road with olive green slopes on the side.", "First, a blue water surface with snowy mountains in the distance appears, followed by yellow soil. The next scene shows mountains with red rock formations, then a view with tall buildings in the background and green grass in the foreground. The final scene shows a road with olive green slopes on the side.", "First, a blue water surface with snowy mountains in the distance appears, followed by yellow soil. The next scene shows mountains with red rock formations, then a road with olive green slopes on the side. The final scene shows a view with tall buildings in the background and green grass in the foreground."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "ZVg7TuIGSlA_0", "video_path": "ZVg7TuIGSlA.mp4", "subtitle_path": "ZVg7TuIGSlA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 347.08, "view_count": 9016}, {"video_id": "ZVg7TuIGSlA", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there is a scene with misty mountains and green trees along a jade-colored lakeshore, followed by a partial view of the Earth's surface with bold white text on the side, then an image of the entire Earth, and finally, a scene with rolling green hills, a winding river, and numerous residences on the opposite bank.", "First, there is a partial view of the Earth's surface with bold white text on the side, followed by a scene with misty mountains and green trees along a jade-colored lakeshore, then an image of the entire Earth, and finally, a scene with rolling green hills, a winding river, and numerous residences on the opposite bank.", "First, there is a partial view of the Earth's surface with bold white text on the side, followed by a scene with misty mountains and green trees along a jade-colored lakeshore, then a scene with rolling green hills, a winding river, and numerous residences on the opposite bank, and finally, an image of the entire Earth.", "First, there is a scene with misty mountains and green trees along a jade-colored lakeshore, followed by a partial view of the Earth's surface with bold white text on the side, then a scene with rolling green hills, a winding river, and numerous residences on the opposite bank, and finally, an image of the entire Earth."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "ZVg7TuIGSlA_1", "video_path": "ZVg7TuIGSlA.mp4", "subtitle_path": "ZVg7TuIGSlA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 347.08, "view_count": 9016}, {"video_id": "BYSE5ZUsrRg", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a black-haired man in black clothing speaking appears, then it shifts to a black and white artwork hanging on the wall, followed by a blue background artwork where a woman in the picture is looking up and eating something. Finally, it shifts to a woman and an old man with white hair, as well as a black-haired man on the right side having a conversation.", "First, a woman and an old man with white hair, as well as a black-haired man on the right side having a conversation appears, then it shifts to a black-haired man in black clothing speaking, followed by a blue background artwork where a woman in the picture is looking up and eating something. Finally, it shifts to a black and white artwork hanging on the wall.", "First, a black and white artwork hanging on the wall appears, then it shifts to a black-haired man in black clothing speaking, followed by a blue background artwork where a woman in the picture is looking up and eating something. Finally, it shifts to a woman and an old man with white hair, as well as a black-haired man on the right side having a conversation.", "First, a blue background artwork where a woman in the picture is looking up and eating something appears, then it shifts to a black-haired man in black clothing speaking, followed by a black and white artwork hanging on the wall. Finally, it shifts to a woman and an old man with white hair, as well as a black-haired man on the right side having a conversation."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "BYSE5ZUsrRg_1", "video_path": "BYSE5ZUsrRg.mp4", "subtitle_path": "BYSE5ZUsrRg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 236.86, "view_count": 46593}, {"video_id": "eYwBCvwD6y8", "question": "On the red wooden table in the screen, a pair of black-gloved hands is folding a pancake. The pancake contains white and green fragments. What is inside the pancake in the video?", "question_wo_referring_query": "What is inside the pancake in the video?", "candidates": ["Radish", "Green onion", "Cilantro", "Pumpkin"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "eYwBCvwD6y8_1", "video_path": "eYwBCvwD6y8.mp4", "subtitle_path": "eYwBCvwD6y8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 257.3, "view_count": 8994}, {"video_id": "8jnfsldTS9o", "question": "Which sequence of events is correct?", "question_wo_referring_query": "Which sequence of events is correct?", "candidates": ["First, in the black and white scene, there are 4 people sitting in a car. One boy is holding his chin with his left hand, and the girl next to him is looking out the window. Next, a little boy is lying on a book, and a little girl with her left hand on her chin is reading. Finally, a little girl in a pink top is holding a book and reading.", "First, in the black and white scene, there are 4 people sitting in a car. One boy is holding his chin with his left hand, and the girl next to him is looking out the window. Next, a little girl in a pink top is holding a book and reading. Finally, a little boy is lying on a book, and a little girl with her left hand on her chin is reading.", "First, a little boy is lying on a book, and a little girl with her left hand on her chin is reading. Next, in the black and white scene, there are 4 people sitting in a car. One boy is holding his chin with his left hand, and the girl next to him is looking out the window. Finally, a little girl in a pink top is holding a book and reading.", "First, a little boy is lying on a book, and a little girl with her left hand on her chin is reading. Next, a little girl in a pink top is holding a book and reading. Finally, in the black and white scene, there are 4 people sitting in a car. One boy is holding his chin with his left hand, and the girl next to him is looking out the window."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "8jnfsldTS9o_0", "video_path": "8jnfsldTS9o.mp4", "subtitle_path": "8jnfsldTS9o_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 279.05, "view_count": 17114}, {"video_id": "8jnfsldTS9o", "question": "Which sequence of events is correct?", "question_wo_referring_query": "Which sequence of events is correct?", "candidates": ["First, in a white room, a black bronze statue standing on one leg with its left hand extending two fingers and its right hand extending three fingers appears. Then, a little girl wearing a rose-red top appears, looking at a black kitten in a display case. Finally, in front of a white relief sculpture, a little girl wearing a rose-red top and holding a book stands, with a woman wearing a black coat and scarf, also holding a book, standing next to her.", "First, in a white room, a black bronze statue standing on one leg with its left hand extending two fingers and its right hand extending three fingers appears. Then, a little girl wearing a rose-red top appears, looking at a black kitten in a display case. Finally, in front of a white relief sculpture, a little girl wearing a rose-red top and holding a book stands, with a woman wearing a black coat and scarf, also holding a book, standing next to her.", "First, a little girl wearing a rose-red top appears, looking at a black kitten in a display case. First, in front of a white relief sculpture, a little girl wearing a rose-red top and holding a book stands, with a woman wearing a black coat and scarf, also holding a book, standing next to her. Finally, in a white room, a black bronze statue standing on one leg with its left hand extending two fingers and its right hand extending three fingers appears.", "First, in a white room, a black bronze statue standing on one leg with its left hand extending two fingers and its right hand extending three fingers appears. Then, in front of a white relief sculpture, a little girl wearing a rose-red top and holding a book stands, with a woman wearing a black coat and scarf, also holding a book, standing next to her. Finally, a little girl wearing a rose-red top appears, looking at a black kitten in a display case.", "First, a little girl wearing a rose-red top appears, looking at a black kitten in a display case. Then, in a white room, a black bronze statue standing on one leg with its left hand extending two fingers and its right hand extending three fingers appears. Finally, in front of a white relief sculpture, a little girl wearing a rose-red top and holding a book stands, with a woman wearing a black coat and scarf, also holding a book, standing next to her."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "8jnfsldTS9o_1", "video_path": "8jnfsldTS9o.mp4", "subtitle_path": "8jnfsldTS9o_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 279.05, "view_count": 17114}, {"video_id": "PvH9CFI0ZD8", "question": "In this video, under a blue background, a man with a bald head wearing a blue and white checked shirt is giving an explanation. The screen shows an original white icy wilderness scene and another scene of a simulated map dominated by blue and green colors. Which location is the man first introducing in this video?", "question_wo_referring_query": "Which location is the man first introducing in this video?", "candidates": ["Arctic region", "Northern Europe region", "Antarctic region", "Pacific region"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "PvH9CFI0ZD8_1", "video_path": "PvH9CFI0ZD8.mp4", "subtitle_path": "PvH9CFI0ZD8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 500.46, "view_count": 87452}, {"video_id": "jKd06kNcAC8", "question": "A man is standing in front of a mirror. Behind him is a white wall with three people standing by the wall. There are two suitcases stacked next to the wall. The man is wearing a black round-neck shirt, his hair is brown, and there is a white pillar on the right side. When the phrase 'What the hell is that?' is mentioned, what is the non-existent object?", "question_wo_referring_query": ", what is the non-existent object?", "candidates": ["Round Lamp", "White Suitcase", "Blue Eyes", "Dark Blue Clothing", "Red Roadblock"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "jKd06kNcAC8_1", "video_path": "jKd06kNcAC8.mp4", "subtitle_path": "jKd06kNcAC8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 369.58, "view_count": 132523}, {"video_id": "2v7vlsY2RM8", "question": "In a PPT slide, there is a white background with a title in black font, under the title there are some shapes like stairs, rectangles, and arrows made of triangles. On the right side, there are some 3D cubes and rectangular prisms. When 'two uh matrix' is mentioned, what shape is the object labeled with 'VGG'?", "question_wo_referring_query": "What shape is the object labeled with 'VGG'?", "candidates": ["Inverted stair shape", "Square shape", "Cylinder shape", "Horizontal stair shape", "Triangle shape"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2A", "level": "IntraMoment", "id": "2v7vlsY2RM8_1", "video_path": "2v7vlsY2RM8.mp4", "subtitle_path": "2v7vlsY2RM8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 505.8, "view_count": 486}, {"video_id": "iihQ4bQ_5zA", "question": "In a room, there is a painting on the left wall with four characters on it. In the lower-left corner, there is a brown sofa. To the right of the sofa, there is a shelf with three musical instruments on it. On the right side, there is a window. There is a computer on the desk, and next to it, there is a bookshelf with colorful books. In front, there is a person in blue clothes. Which character in the upper-left corner of the painting is wearing a hat?", "question_wo_referring_query": "Which character in the upper-left corner of the painting is wearing a hat?", "candidates": ["The person wearing a helmet", "The person holding a knife", "The person wearing blue clothes", "The person raising both hands", "The person holding a shield"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "iihQ4bQ_5zA_1", "video_path": "iihQ4bQ_5zA.mp4", "subtitle_path": "iihQ4bQ_5zA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 298.33, "view_count": 20232}, {"video_id": "zak0M2Wfz9U", "question": "At the beginning of the video, there is a soldier holding a rifle, wearing a red uniform, with a black hat decorated with a red ornament. In which scenes does he appear?", "question_wo_referring_query": "In which scenes does he appear?", "candidates": ["In front of a blue pavilion with the letters 'ER,' next to a soldier riding a horse, wearing armor, and holding a rifle.", "In front of a green pavilion with the letters 'ER,' next to a soldier riding a horse, wearing armor, and holding a long sword.", "In front of a blue pavilion with the letters 'ER,' next to a soldier riding a horse, wearing armor, and holding a long sword.", "In front of a red pavilion with the letters 'ER,' next to a soldier riding a horse, wearing armor, and holding a long sword.", "In front of a white pavilion with the letters 'ER,' next to a soldier riding a horse, wearing armor, and holding a long sword."], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "zak0M2Wfz9U_1", "video_path": "zak0M2Wfz9U.mp4", "subtitle_path": "zak0M2Wfz9U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 503.67, "view_count": 2177884}, {"video_id": "vpujwSIfWSc", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a black tank appears in the lower-right corner of the blue background, then two black tanks appear in the blue background, and finally, three black tanks appear at the bottom of the blue background.", "First, a black tank appears in the lower-left corner of the blue background, then two black tanks appear in the blue background, and finally, three black tanks appear at the bottom of the blue background.", "First, a black tank appears in the upper-right corner of the blue background, then two black tanks appear in the blue background, and finally, three black tanks appear at the bottom of the blue background.", "First, a black tank appears in the lower-left corner of the blue background, then three black tanks appear in the blue background, and finally, two black tanks appear at the bottom of the blue background.", "First, a black tank appears in the lower-right corner of the blue background, then three black tanks appear in the blue background, and finally, two black tanks appear at the bottom of the blue background."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "vpujwSIfWSc_1", "video_path": "vpujwSIfWSc.mp4", "subtitle_path": "vpujwSIfWSc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 424.93, "view_count": 27695}, {"video_id": "N-SLWh1pVz8", "question": "On the left side, there is a purple background with a long-haired woman in purple clothing speaking. On the right side, there is a cover of a scrapbook. On the cover, there is a white text box with characters. What is the pattern on the cover of the scrapbook?", "question_wo_referring_query": "What is the pattern on the cover of the scrapbook?", "candidates": ["Inverted triangular pattern collage", "Square pattern collage", "Hexagonal pattern collage", "Circular pattern collage", "Triangular pattern collage"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "N-SLWh1pVz8_1", "video_path": "N-SLWh1pVz8.mp4", "subtitle_path": "N-SLWh1pVz8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 282.65999999999997, "view_count": 40196}, {"video_id": "OmNVB3ff98s", "question": "The room door is white. There is a black table with a laptop and an electronic device with colorful buttons on it. A black shelf is above the table. A man in a light-colored short-sleeve shirt is speaking, and after the subtitle 'on a completely different topic' appears, what object appears beside the man?", "question_wo_referring_query": "What object appears beside the man?", "candidates": ["a plate of food", "a silver fork", "a fruit plate", "a bunch of keys", "a cup of water"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "OmNVB3ff98s_1", "video_path": "OmNVB3ff98s.mp4", "subtitle_path": "OmNVB3ff98s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 533.48, "view_count": 536}, {"video_id": "RDg0FHqFqxg", "question": "On a white background, there is blue English text at the top and black English text on the left side. Below the black English text, there is a black rectangular frame. In the top right corner, there is a water drop icon. When black characters appear in the blank area in the middle of the screen, what changes occur to the black rectangular frame?", "question_wo_referring_query": "What changes occur to the black rectangular frame?", "candidates": ["The black frame becomes larger", "Red characters appear inside the black frame", "White characters appear inside the black frame", "The black frame moves to the left side", "The black frame becomes smaller"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "RDg0FHqFqxg_1", "video_path": "RDg0FHqFqxg.mp4", "subtitle_path": "RDg0FHqFqxg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 526.03, "view_count": 81}, {"video_id": "HQn1QKQYXVg", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which sequence of scenes below is correct?", "candidates": ["First, screens with yellow, blue, and purple code appear, followed by a screen with green code, and finally a screen with blue, green, yellow, and orange code appears.", "First, screens with yellow, blue, and purple code appear, followed by a screen with blue, green, yellow, and orange code. Finally, a screen with green code appears.", "First, a screen with green code appears, followed by a screen with blue, green, yellow, and orange code. Finally, screens with yellow, blue, and purple code appear.", "First, a screen with green code appears, followed by screens with yellow, blue, and purple code, and finally a screen with blue, green, yellow, and orange code appears."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SSS", "level": "L2-Relation", "id": "HQn1QKQYXVg_1", "video_path": "HQn1QKQYXVg.mp4", "subtitle_path": "HQn1QKQYXVg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 431, "duration": 445.0, "view_count": 37005}, {"video_id": "5As9xro9940", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a black background with the white text 'THE ERROR' and a military green tank is shown. Next, a red hand, a white tank, and the white text 'January 1943' are displayed. Finally, a tan with a camo pattern appears above the military green tank.", "First, a red hand, a white tank, and the white text 'January 1943' are shown. Next, a black background with the white text 'THE ERROR' and a military green tank is displayed. Finally, a tan with a camo pattern appears above the military green tank.", "First, a tan with a camo pattern appears above the military green tank. Then the same camo pattern tan appears again above the military green tank. Finally, a black background with the white text 'THE ERROR' and a black tank is shown.", "First, a red hand, a white tank, and the white text 'January 1943' are shown. Next, a tan with a camo pattern appears above the military green tank. Finally, a black background with the white text 'THE ERROR' and a military green tank is shown.", "First, a black background with the white text 'THE ERROR' and a military green tank is shown. Next, a tan with a camo pattern is displayed above the military green tank. Finally, a red hand, a white tank, and the white text 'January 1943' are shown."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "5As9xro9940_1", "video_path": "5As9xro9940.mp4", "subtitle_path": "5As9xro9940_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 487.93, "view_count": 119012}, {"video_id": "rBdOp4Btfrg", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, an image of two red circles with a black tank in the middle on a blue background was played, followed by a logo consisting of a green tank and an olive ear of wheat on a blue background, and finally, an image of a question mark inside a red circle on a blue background was played.", "First, a logo consisting of a green tank and an olive ear of wheat on a blue background was played, followed by an image of a question mark inside a red circle on a blue background, and finally, an image of two red circles with a black tank in the middle on a blue background was played.", "First, an image of a question mark inside a red circle on a blue background was played, then an image of two red circles with a black tank in the middle on a blue background was played, and finally, another image of a question mark inside a red circle on a blue background was played.", "First, a logo consisting of a green tank and an olive ear of wheat on a blue background was played, followed by an image of two red circles with a black tank in the middle on a blue background, and finally, an image of a question mark inside a red circle on a blue background was played.", "First, an image of a question mark inside a red circle on a blue background was played, followed by a logo consisting of a green tank and an olive ear of wheat on a blue background, and finally, an image of two red circles with a black tank in the middle on a blue background was played."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "rBdOp4Btfrg_0", "video_path": "rBdOp4Btfrg.mp4", "subtitle_path": "rBdOp4Btfrg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 217.23, "view_count": 73866}, {"video_id": "rBdOp4Btfrg", "question": "Which of the following sequences of scenarios is correct?", "question_wo_referring_query": "Which of the following sequences of scenarios is correct?", "candidates": ["First, a red circle like a light bulb with a blue background was shown, followed by a red circular shape with two faces, and finally a circular shape with three arrows.", "First, a circular shape with three arrows was shown, followed by a red circular shape with two faces, and finally a red circle like a light bulb with a blue background.", "First, a red circle like a light bulb with a blue background was shown, followed by a circular shape with three arrows, and finally a red circular shape with two faces.", "First, a circular shape with three arrows was shown, followed by a red circle like a light bulb with a blue background, and finally a red circular shape with two faces.", "First, a red circular shape with two faces was shown, followed by a red circle like a light bulb with a blue background, and finally a circular shape with three arrows."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "rBdOp4Btfrg_1", "video_path": "rBdOp4Btfrg.mp4", "subtitle_path": "rBdOp4Btfrg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 217.23, "view_count": 73866}, {"video_id": "A8B765SJwHk", "question": "There is a large bowl on a yellow wooden table, and a person is pouring clear water into the bowl containing red strawberries. What objects appear in this scene?", "question_wo_referring_query": "What objects appear in this scene?", "candidates": ["White flowers", "Red flowers", "Green leaves", "Yellow flowers", "Purple flowers"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "A8B765SJwHk_1", "video_path": "A8B765SJwHk.mp4", "subtitle_path": "A8B765SJwHk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 450.0, "view_count": 349311}, {"video_id": "Q-FY5jRM8qE", "question": "A black striped cat is lying on a bamboo chair. There is a blue box to the back left of the chair. A girl in a white short-sleeved shirt reaches through the gap under the chair armrest to pet the cat. What does the girl do after petting the cat?", "question_wo_referring_query": "What does the girl do after petting the cat?", "candidates": ["Lie on the beach", "Sit on the chair and look at the menu", "Walk on a small road", "Drink a bottle of soda water", "Get on a motorcycle"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "Q-FY5jRM8qE_1", "video_path": "Q-FY5jRM8qE.mp4", "subtitle_path": "Q-FY5jRM8qE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 271.04, "view_count": 32216}, {"video_id": "ZaZIkctuUvs", "question": "As sunlight shines on the yellow sandy beach, the inland side is a field of green plants while the seaward side is filled with varying shades of blue ocean. Numerous small boats are anchored near the shore. After the subtitle 'want to see more of these places in' appears, what can be seen on the screen?", "question_wo_referring_query": "What can be seen on the screen?", "candidates": ["A shirtless man with white floral blue shorts is strolling", "A shirtless man with white floral red shorts is strolling", "A woman in a black dress surfing in the sea", "A man wearing a red headscarf appears on the beach", "A cruise ship sailing on the sea"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "ZaZIkctuUvs_1", "video_path": "ZaZIkctuUvs.mp4", "subtitle_path": "ZaZIkctuUvs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 481.76, "view_count": 107346}, {"video_id": "IdScFcPHTKY", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there is a clip of a girl driving a car, then a video of the girl driving with friends, followed by a video of the girl driving alone, and finally a video of the girl driving alone.", "First, there is a video of the girl driving with friends, then a video of the girl driving alone, followed by a video of the girl arriving at a room with friends to go for a drive together, and finally a video of driving with friends.", "First, there is a clip of a girl driving a car, then a video of the girl driving, followed by a video of the girl arriving at a room with friends to go for a drive together, and finally a video of the girl driving alone.", "First, there is a clip of a girl driving a car, then a video of the girl driving, followed by a video of the girl arriving at a room with friends to go for a drive together, and finally a video of the girl and her friends driving together.", "First, there is a video of the girl driving with friends, then a video of the girl driving alone, followed by a video of the girl arriving at a room with friends to go for a drive together, and finally a video of the girl driving alone."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "IdScFcPHTKY_1", "video_path": "IdScFcPHTKY.mp4", "subtitle_path": "IdScFcPHTKY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 409.88, "view_count": 25453}, {"video_id": "xWLpCJTV94k", "question": "When the dragon boat passes by the woman with black hair who is wearing a blue floral dress and a straw hat, with her legs partially submerged in water and facing away from the camera, what is the first hand gesture this woman makes?", "question_wo_referring_query": ", what is the first hand gesture this woman makes?", "candidates": ["Touches her back", "Waves to the people on the dragon boat", "Crosses her hands behind her back", "Spreads the fingers of her right hand"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "xWLpCJTV94k_0", "video_path": "xWLpCJTV94k.mp4", "subtitle_path": "xWLpCJTV94k_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 14, "duration": 15.0, "view_count": 10968}, {"video_id": "79VnJRIZ_fI", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a man wearing a hat and sunglasses without a shirt rides a scooter, followed by four men holding giraffe-shaped floats running on the grass, and finally, a man wearing sunglasses and yellow shorts riding a scooter.", "First, a man wearing sunglasses and yellow shorts rides a scooter, followed by four men holding giraffe-shaped floats running on the grass, and finally, a man wearing a hat and sunglasses without a shirt riding a scooter.", "First, a man wearing sunglasses and yellow shorts rides a scooter, followed by a man wearing a hat and sunglasses without a shirt riding a scooter, and finally, four men holding giraffe-shaped floats run on the grass.", "First, a man wearing a hat and sunglasses without a shirt rides a scooter, followed by a man wearing sunglasses and yellow shorts riding a scooter, and finally, four men holding giraffe-shaped floats running on the grass.", "First, four men holding giraffe-shaped floats run on the grass, followed by a man wearing sunglasses and yellow shorts riding a scooter, and finally, a man wearing a hat and sunglasses without a shirt riding a scooter."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "79VnJRIZ_fI_0", "video_path": "79VnJRIZ_fI.mp4", "subtitle_path": "79VnJRIZ_fI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 145, "duration": 16.0, "view_count": 3071229}, {"video_id": "WyJDimr_BJ0", "question": "On a black background, there is a gray-white star and another star partially covered by a shadow. When the scene transitions to this setting, what event occurs?", "question_wo_referring_query": "When the scene transitions to this setting, what event occurs?", "candidates": ["A meteor flies past one of the stars", "A cloud appears on one of the stars", "One star is rotating while the other star is not", "The star rotates slowly", "A meteorite strikes the two stars"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "WyJDimr_BJ0_0", "video_path": "WyJDimr_BJ0.mp4", "subtitle_path": "WyJDimr_BJ0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1126, "duration": 51.99, "view_count": 395327}, {"video_id": "JB-DCLC6M3Q", "question": "Inside the brightly pure white building, when the name frame with Alex's name pops up and the conversation mentions 'one of the designers of this huge...', which object appears in the video?", "question_wo_referring_query": "Which object appears in the video?", "candidates": ["Green hat", "Green hooded feathered garment", "White feathered garment", "Black hooded raincoat", "White stone steps"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "JB-DCLC6M3Q_0", "video_path": "JB-DCLC6M3Q.mp4", "subtitle_path": "JB-DCLC6M3Q_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 271, "duration": 30.0, "view_count": 867}, {"video_id": "mz9Cy0xUH0E", "question": "Which of the following scenes in the video is correct?", "question_wo_referring_query": "Which of the following scenes in the video is correct?", "candidates": ["First, there is a picture with a black background and white letters, then two men in military uniforms appear, followed by two national flags, and finally a man in a military uniform with a red building on his right.", "First, there are two men wearing military uniforms, then a picture with a black background and white letters appears, followed by a man in a military uniform with a red building on his right, and finally two national flags.", "First, there are two men wearing military uniforms, then a picture with a black background and white letters appears, followed by two national flags, and finally a man in a military uniform with a red building on his right.", "First, two national flags appear, then a picture with a black background and white letters, followed by a man in a military uniform with a red building on his right, and finally two men in military uniforms.", "First, there is a picture with a black background and white letters, then two national flags appear, followed by a man in a military uniform with a red building on his right, and finally two men in military uniforms."], "topic_category": "KH-Knowledge-History", "question_category": "SSS", "level": "L2-Relation", "id": "mz9Cy0xUH0E_0", "video_path": "mz9Cy0xUH0E.mp4", "subtitle_path": "mz9Cy0xUH0E_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 128, "duration": 30.0, "view_count": 1176009}, {"video_id": "ezuFR5OwHU8", "question": "A group of soldiers dressed in military green uniforms and wearing green camouflage helmets are seen fighting fiercely in which of the following scenarios?", "question_wo_referring_query": "In which of the following scenarios do they appear?", "candidates": ["Near a white building on a green grassy field.", "Around an M-shaped door surrounding.", "On the road by a vehicle near a beige wall and grey brick buildings.", "Near three palm trees, behind a red brick wall and a vehicle.", "In a room with pink walls, right next to a window where a TV is placed."], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "ezuFR5OwHU8_0", "video_path": "ezuFR5OwHU8.mp4", "subtitle_path": "ezuFR5OwHU8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 101, "duration": 37.0, "view_count": 2470048}, {"video_id": "_8vHZvuticY", "question": "When the subtitle 'pretty Patel signed off that or order of' appears, what change occurs on the screen of the woman wearing a black coat, sitting in front of the bookshelf, and wearing black-framed glasses?", "question_wo_referring_query": "What change occurs?", "candidates": ["The screen shrinks to appear on the left side of the news video", "The screen shrinks to appear on the right side of the news video", "The screen shrinks to appear in the middle of the news video", "The screen shrinks to appear at the top of the news video", "The screen disappears"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "_8vHZvuticY_0", "video_path": "_8vHZvuticY.mp4", "subtitle_path": "_8vHZvuticY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 124, "duration": 58.0, "view_count": 1649}, {"video_id": "kPaaObNjn2w", "question": "In a kitchen with a man and a woman, the man is wearing a black T-shirt and black pants, and the woman is wearing a nude long sleeve shirt and black pants. What is not present in this kitchen?", "question_wo_referring_query": "What is not present in this kitchen?", "candidates": ["white bowl", "beautiful painting", "white gloves", "red spice jar", "yellow banana"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "kPaaObNjn2w_0", "video_path": "kPaaObNjn2w.mp4", "subtitle_path": "kPaaObNjn2w_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 477, "duration": 14.97, "view_count": 25533}, {"video_id": "KCrjrUNWyZc", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which sequence of scenes is correct?", "candidates": ["First is the scene of stir-frying vegetables, then the scene of adding sauce, and finally the scene of plating the dish.", "First is the scene of plating the dish, then the scene of adding sauce, and finally the scene of stir-frying vegetables.", "First is the scene of adding sauce, then the scene of stir-frying vegetables, and finally the scene of plating the dish.", "First is the scene of stir-frying vegetables, then the scene of plating the dish, and finally the scene of adding sauce.", "First is the scene of adding sauce, then the scene of plating the dish, and finally the scene of stir-frying vegetables."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "KCrjrUNWyZc_0", "video_path": "KCrjrUNWyZc.mp4", "subtitle_path": "KCrjrUNWyZc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 105, "duration": 50.01, "view_count": 202555}, {"video_id": "GPSwz29LVdg", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following scenes/steps is in the correct sequence?", "candidates": ["First, sprinkle chili on the ingredients in the pot, then pour the ingredients from the pot into the iron plate, and finally cover the iron pot with a transparent lid.", "First, cover the iron pot with a transparent lid, then pour the ingredients from the pot into the iron plate, and finally sprinkle chili on the ingredients in the pot.", "First, pour the ingredients from the pot into the iron plate, then sprinkle chili on the ingredients in the pot, and finally cover the iron pot with a transparent lid.", "First, pour the ingredients from the pot into the iron plate, then cover the iron pot with a transparent lid, and finally sprinkle chili on the ingredients in the pot.", "First, sprinkle chili on the ingredients in the pot, then cover the iron pot with a transparent lid, and finally pour the ingredients from the pot into the iron plate."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "GPSwz29LVdg_0", "video_path": "GPSwz29LVdg.mp4", "subtitle_path": "GPSwz29LVdg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 400, "duration": 50.0, "view_count": 93991}, {"video_id": "RxHWxL9dCBI", "question": "What happened before the small figure, wearing blue clothes and having a round white face with only two black circles as eyes, made a move on the map that includes ocean and land?", "question_wo_referring_query": "What happened?", "candidates": ["A group of small figures, holding shields and wearing armor, were riding horses and moving forward.", "A soldier was setting a fire.", "A soldier was holding three money bags.", "Only one small figure, holding a shield and wearing armor, was riding a horse and moving forward.", "A soldier was killing someone with a knife."], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "RxHWxL9dCBI_0", "video_path": "RxHWxL9dCBI.mp4", "subtitle_path": "RxHWxL9dCBI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 3, "duration": 52.0, "view_count": 1041}, {"video_id": "hbrrrnXu4GI", "question": "In front of a wall with a greenish painting hanging, there is a woman with gold short hair wearing a black outfit, looking at the white label in the painting and pointing at it with her right hand. What is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["She tears off the label and throws it into the trash can", "She takes the label off and introduces it to the camera", "She is wiping off the label's central ink stain with her hand", "She is pointing at the label for an introduction"], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "hbrrrnXu4GI_0", "video_path": "hbrrrnXu4GI.mp4", "subtitle_path": "hbrrrnXu4GI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 140, "duration": 44.0, "view_count": 37048}, {"video_id": "_DiIXcHUg70", "question": "In a screen with a black background, there is a video window with a blue top featuring a picture of a young girl. Below are two sections with bold text descriptions. Which object appears in the screen?", "question_wo_referring_query": "Which object appears in the screen?", "candidates": ["A picture of a young girl with long brown hair", "A picture of a young girl with long black hair", "A picture of a young girl with short black hair", "A picture of a young girl with short brown hair"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "_DiIXcHUg70_0", "video_path": "_DiIXcHUg70.mp4", "subtitle_path": "_DiIXcHUg70_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1364, "duration": 43.0, "view_count": 574280}, {"video_id": "D1-xWH7K9rY", "question": "Which sequence of events is correct?", "question_wo_referring_query": "Which sequence of events is correct?", "candidates": ["First appears a blue star with a red line on it. Next comes a yellow ball with a black ring around it, and the words 'OZ GEOGRAPHICS' written in the middle. Lastly, a blue swimming pool with a pink and a blue float inside appears.", "First, a yellow ball with a black ring around it appears. In the middle, it has the words 'OZ GEOGRAPHICS' written. Next, a blue swimming pool with a pink and a blue float appears. Lastly, a blue star with a red line on it appears.", "First appears a blue star with a red line on it. Then, a blue swimming pool with a pink and a blue float appears. Finally, a yellow ball with a black ring around it appears, with the words 'OZ GEOGRAPHICS' written in the middle.", "First, a blue swimming pool with a pink and a blue float appears. Then, a blue star with a red line on it appears. Lastly, a yellow ball with a black ring around it appears, with the words 'OZ GEOGRAPHICS' written in the middle."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "D1-xWH7K9rY_0", "video_path": "D1-xWH7K9rY.mp4", "subtitle_path": "D1-xWH7K9rY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 27, "duration": 29.0, "view_count": 32602}, {"video_id": "__Bchxr3ejw", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, a girl wearing a fur coat covers herself with a fur blanket and lies down on a sofa. Then she lights a match, and finally, she uses the match to light a candle.", "First, a girl wearing a fur coat stands up and lights a match. Then, she uses the match to light a candle. Finally, she covers herself with a fur blanket and lies down on a sofa.", "A girl covers herself with a fur blanket and lies down on a sofa, then opens a book.", "First, a girl wearing a fur coat kneels on the ground and lights a match. Then, she uses the match to light a candle. Finally, she covers herself with a fur blanket and lies down on a sofa."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "__Bchxr3ejw_0", "video_path": "__Bchxr3ejw.mp4", "subtitle_path": "__Bchxr3ejw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 333, "duration": 37.0, "view_count": 3906}, {"video_id": "kHy8_fjrkdk", "question": "In which scene does the high-five action, involving the person wearing a black hat and a black armband with a white pattern on a short sleeve shirt at the beginning of the video, occur?", "question_wo_referring_query": "In which scene does the high-five action, involving the person wearing a black hat and a black armband with a white pattern on a short sleeve shirt at the beginning of the video, occur?", "candidates": ["The man wearing a light blue shirt gave a high-five as he walked up the steps", "The man wearing a light blue shirt gave a high-five in front of the wall;", "The man wearing a light blue shirt gave a high-five as he walked up the steps; the man wearing a light blue shirt gave a high-five in front of the wall; the man wearing a light blue shirt gave a high-five with the last little boy", "The man wearing a light blue shirt gave a high-five as he walked up the steps; the man wearing a light blue shirt gave a high-five in front of the wall;"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "kHy8_fjrkdk_0", "video_path": "kHy8_fjrkdk.mp4", "subtitle_path": "kHy8_fjrkdk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 283, "duration": 50.99, "view_count": 29021}, {"video_id": "1OiCBI-m6eo", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a girl with hands in her pants pockets appears. Then, a girl wearing a beige top and pants, high heels, and a headscarf appears. Finally, a girl in high heels appears, with one hand on her waist and the other removing her headscarf.", "First, a girl wearing black pants, glasses, and with hands in her pockets appears. Then, a girl with a headscarf appears in a room filled with clothes and shoes. Finally, a girl in high heels appears, with one hand on her waist and the other removing her headscarf.", "First, a boy wearing purple pants appears. Then, a girl with a headscarf appears in a room filled with clothes and shoes. Finally, a girl in high heels appears, with one hand on her waist and the other removing her headscarf.", "First, a girl wearing a beige top and pants, with a headscarf, appears in a room filled with clothes and shoes. Then, a girl in high heels appears, with one hand on her waist and the other removing her headscarf. Finally, a girl wearing black pants, glasses, and with hands in her pockets appears."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "1OiCBI-m6eo_0", "video_path": "1OiCBI-m6eo.mp4", "subtitle_path": "1OiCBI-m6eo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 475, "duration": 19.02, "view_count": 222683}, {"video_id": "VUw_9tu3gUI", "question": "On the left side of the screen, there is a red cup resting against the wall. A cartoon mouse wearing glasses and a red outfit is kneeling down, and on the right side is another cartoon character mouse wearing a pink skirt. When the subtitle 'you find pal there he is he's running' appears, what does the dog on the left side of the comic book look like?", "question_wo_referring_query": "What does the dog on the left side of the comic book look like?", "candidates": ["A medium-sized white dog", "A large yellow wolf-dog", "A small brown dog", "A large brown dog", "A yellow puppy"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "VUw_9tu3gUI_0", "video_path": "VUw_9tu3gUI.mp4", "subtitle_path": "VUw_9tu3gUI_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 106, "duration": 14.0, "view_count": 1600}, {"video_id": "3mHC-Pf8-dU", "question": "What kind of clothing is the person wearing, who is standing in front of a blue display screen with a rectangular device projected on the screen behind them showing a height of 62 cm?", "question_wo_referring_query": "What kind of clothing are they wearing?", "candidates": ["Wearing a black long-sleeve top with a white neck-drape dress over it", "Only wearing a white long-sleeve sweater", "Wearing a white long-sleeve top with a black neck-drape dress over it", "Wearing a white long-sleeve top with a red neck-drape dress over it", "Only wearing a black long-sleeve sweater"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "3mHC-Pf8-dU_0", "video_path": "3mHC-Pf8-dU.mp4", "subtitle_path": "3mHC-Pf8-dU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 285, "duration": 12.0, "view_count": 2192491}, {"video_id": "qGKUKj-ha3Q", "question": "How many children appear in the video in total?", "question_wo_referring_query": "How many children appear in the video in total?", "candidates": ["3 children", "6 children", "5 children", "2 children", "4 children"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "qGKUKj-ha3Q_0", "video_path": "qGKUKj-ha3Q.mp4", "subtitle_path": "qGKUKj-ha3Q_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 100, "duration": 12.0, "view_count": 24919}, {"video_id": "023NcazhBmI", "question": "In a bedroom with yellow walls and an open white door, a man wearing a black short-sleeved shirt is looking into a mirror with his arms bent in front of him. After mentioning 'speech at this tree is Rirls's present,' what did he do next?", "question_wo_referring_query": "What did he do next?", "candidates": ["The man stretched his arms and did a chest expansion exercise", "The man bent down to tie his shoe laces", "The man bent his arm to adjust his right sleeve", "The man bent his arm to adjust his left sleeve"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "023NcazhBmI_0", "video_path": "023NcazhBmI.mp4", "subtitle_path": "023NcazhBmI_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 158, "duration": 12.97, "view_count": 3489292}, {"video_id": "8LycOZ7Wh_4", "question": "Which sequence of scenes in the video is correct?", "question_wo_referring_query": "Which sequence of scenes in the video is correct?", "candidates": ["First, two people are seen talking, then a sculpture with a black base and red accents appears, and finally a hand-made clay sculpture is shown.", "First, a sculpture with a black base and red accents appears, then two people are seen talking, and finally a hand-made clay sculpture is shown.", "First, two people are seen talking, then a hand-made clay sculpture is shown, and finally a sculpture with a black base and red accents appears.", "First, a hand-made clay sculpture is shown, then two people are seen talking, and finally a sculpture with a black base and red accents appears."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "8LycOZ7Wh_4_0", "video_path": "8LycOZ7Wh_4.mp4", "subtitle_path": "8LycOZ7Wh_4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 242, "duration": 13.01, "view_count": 10391}, {"video_id": "Yo9pGYBMtds", "question": "In the video, a person wearing yellow clothes is handing over a purple object to another person dressed in white clothes, with the words 'TIBERIUS GRACCJUS' visible. In another scene, there are two people in swimwear, one of them holding a purse, with a group of people standing behind them. Which scene appears first?", "question_wo_referring_query": "Which scene appears first?", "candidates": ["The only scene shows two similarly dressed men, one holding an ear of corn, and the other holding a purse.", "The only scene shows a man dressed entirely in yellow holding a purple object, alongside another man wearing a red inner shirt and white outer coat.", "The first scene shows a person in yellow clothes handing a purple object to another in white clothes, with the words 'TIBERIUS GRACCJUS' visible. The last scene features two people in swimwear, one holding a purse, with a group of people behind them.", "The first scene features two people in swimwear, one holding a purse, with a group of people behind them. The last scene shows a person in yellow clothes handing a purple object to another in white clothes, with the words 'TIBERIUS GRACCJUS' visible."], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "Yo9pGYBMtds_0", "video_path": "Yo9pGYBMtds.mp4", "subtitle_path": "Yo9pGYBMtds_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 24, "duration": 11.01, "view_count": 1990}, {"video_id": "_OBUVipjUzk", "question": "A man in a dark gray short-sleeve shirt is facing the camera, standing in front of a photo wall. In the upper right corner of the screen, there are three lines of white subtitles beginning with 'Karachay'. After mentioning 'They have two main groups the Turkic Karachay peoples and the Circassian Turkish peoples, just like Kabardino-Balkaria,' what action did he take next?", "question_wo_referring_query": "What action did he take next?", "candidates": ["On screen, he turns his head to the right, holds wheat in his left hand, and gestures toward the subtitles above with his right hand.", "On screen, he turns his head to the left, holds wheat in his left hand, and gestures toward the subtitles above with his right hand.", "On screen, he turns his head to the right, holds wheat in his right hand, and gestures toward the subtitles above with his left hand.", "On screen, he turns his head to the left, holds wheat in his right hand, and gestures toward the subtitles above with his left hand."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "_OBUVipjUzk_0", "video_path": "_OBUVipjUzk.mp4", "subtitle_path": "_OBUVipjUzk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 510, "duration": 10.01, "view_count": 1097371}, {"video_id": "_wZMvhMeobQ", "question": "In front of a business street with a white crosswalk, a man with tied hair wearing a grey long-sleeve shirt is facing the camera, behind him is a woman wearing a purple-grey outfit with her hair tied. Who appears first?", "question_wo_referring_query": "Who appears first?", "candidates": ["The man with tied hair wearing a white long-sleeve shirt", "The man with tied hair wearing a grey long-sleeve shirt", "The woman wearing a purple-grey outfit with her hair tied", "Both people appear at the same time"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "_wZMvhMeobQ_0", "video_path": "_wZMvhMeobQ.mp4", "subtitle_path": "_wZMvhMeobQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 932, "duration": 10.98, "view_count": 103607}, {"video_id": "KfRS8RG18Kk", "question": "On a vast grassland, four warriors holding spears and shields are riding horses, leading the soldiers behind them forward. What changes did these soldiers undergo in the end?", "question_wo_referring_query": "What changes did they undergo?", "candidates": ["They used bows and arrows against the enemy.", "They returned to their homeland to celebrate victory.", "They held shields to resist enemy arrows.", "They rode horses and advanced with spears."], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "KfRS8RG18Kk_0", "video_path": "KfRS8RG18Kk.mp4", "subtitle_path": "KfRS8RG18Kk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 167, "duration": 10.01, "view_count": 2759}, {"video_id": "UTAY-XpNgjk", "question": "In front of a black curtain, a man wearing a gray floral shirt has his hands crossed while standing next to a man in a black short-sleeve shirt. After mentioning 'how the breakdown of the population', what changes occurred with the two men?", "question_wo_referring_query": "After mentioning 'how the breakdown of the population', what changes occurred with the two men?", "candidates": ["Both men pointed towards the camera with one hand at the same time", "Both men spread their arms at the same time", "Both men crossed their hands at the same time", "One man pointed towards the camera with one hand while the other crossed his hands"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "UTAY-XpNgjk_0", "video_path": "UTAY-XpNgjk.mp4", "subtitle_path": "UTAY-XpNgjk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 549, "duration": 9.01, "view_count": 309447}, {"video_id": "3wAKNDVY3bk", "question": "In the video, when a person in a blue outfit holding a dustpan throws away garbage into a pit, at the moment the subtitle \"of discase wirhin a city or town\" appears, what is the color of the dustpan held by the person in the blue outfit?", "question_wo_referring_query": "What is the color of the dustpan held by the person in the blue outfit?", "candidates": ["Black", "Light brown", "Red", "Blue"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "3wAKNDVY3bk_0", "video_path": "3wAKNDVY3bk.mp4", "subtitle_path": "3wAKNDVY3bk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 317, "duration": 10.0, "view_count": 763627}, {"video_id": "pCNGTT9LbBQ", "question": "There are two men on the screen, one of whom is wearing black sunglasses. What is the man in the bottom right corner of the screen doing?", "question_wo_referring_query": "What is the man in the bottom right corner of the screen doing?", "candidates": ["Clasping his fists", "Squatting", "Shaking hands with the man wearing sunglasses", "Turning his head", "Laughing loudly"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "pCNGTT9LbBQ_0", "video_path": "pCNGTT9LbBQ.mp4", "subtitle_path": "pCNGTT9LbBQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.63, "view_count": 1839766}, {"video_id": "pCNGTT9LbBQ", "question": "A man and a woman are sitting together in the video. The man's collar is multi-striped, and there is also a logo on his chest. He has a neutral facial expression. What did the woman do?", "question_wo_referring_query": "What did the woman do?", "candidates": ["Get angry at the man", "Touch the man's head", "Laugh", "Help the man adjust his collar", "Make eye contact with the man"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "pCNGTT9LbBQ_1", "video_path": "pCNGTT9LbBQ.mp4", "subtitle_path": "pCNGTT9LbBQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.63, "view_count": 1839766}, {"video_id": "pCNGTT9LbBQ", "question": "There are two glasses on the screen, and a man wearing black-framed glasses and a leather jacket. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Laughing loudly at the mirror", "Throwing the lighter into the trash can", "Drinking alcohol", "Taking off the jacket", "Playing with a lighter"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "pCNGTT9LbBQ_2", "video_path": "pCNGTT9LbBQ.mp4", "subtitle_path": "pCNGTT9LbBQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 988.63, "view_count": 1839766}, {"video_id": "gzNrnSxKA7c", "question": "An injured man is receiving fluids, wearing a thin chain around his neck, and has some bloodstains on the collar of his white shirt. When the subtitle 'that if he went back everyone would end' appears, what objects are present in the frame?", "question_wo_referring_query": "What objects are present in the frame?", "candidates": ["camera", "blood bag", "white gauze bandage", "backpack", "emergency kit"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "gzNrnSxKA7c_0", "video_path": "gzNrnSxKA7c.mp4", "subtitle_path": "gzNrnSxKA7c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 944.18, "view_count": 12093}, {"video_id": "gzNrnSxKA7c", "question": "In the lower left corner of the screen, there's a man with short curly hair wearing a white shirt. Next to him, there's a woman with long hair holding her chin with her hand. When the subtitle 'five years ago when his team got into' appears, what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["watch", "notebook", "cell phone", "cup", "pen"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "gzNrnSxKA7c_1", "video_path": "gzNrnSxKA7c.mp4", "subtitle_path": "gzNrnSxKA7c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 944.18, "view_count": 12093}, {"video_id": "gzNrnSxKA7c", "question": "In the screen, a man wearing a grey cap and a red-green armor is holding a little girl with both hands. When the subtitle 'helicopters crash into each other and' appears, what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["fishing net", "camera", "black-grey thick smoke", "wooden barrel", "mobile phone"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "gzNrnSxKA7c_2", "video_path": "gzNrnSxKA7c.mp4", "subtitle_path": "gzNrnSxKA7c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 944.18, "view_count": 12093}, {"video_id": "ajor5LGXs0Q", "question": "In the office, many employees are working on computers. A standing man pats the shoulder of a sitting man, and the sitting man looks up at the standing man. What color is the shirt of the man sitting down?", "question_wo_referring_query": "In the office, many employees are working on computers. A standing man pats the shoulder of a sitting man, and the sitting man looks up at the standing man. What color is the shirt of the man sitting down?", "candidates": ["khaki", "gray", "white", "blue", "black"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "ajor5LGXs0Q_0", "video_path": "ajor5LGXs0Q.mp4", "subtitle_path": "ajor5LGXs0Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1087.76, "view_count": 17053}, {"video_id": "ajor5LGXs0Q", "question": "On the table in the scene, there is a predominantly white helmet, several books, a bottle with a yellow cap, and two photos of two people wearing blue-green tops and white pants, as well as a group photo. What is the shape of the photo frame used to display the group photo?", "question_wo_referring_query": "What is the shape of the photo frame used to display the group photo?", "candidates": ["Triangle", "Rectangle", "Square", "Circle", "Staircase"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "ajor5LGXs0Q_1", "video_path": "ajor5LGXs0Q.mp4", "subtitle_path": "ajor5LGXs0Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1087.76, "view_count": 17053}, {"video_id": "ajor5LGXs0Q", "question": "On the left side of the screen, a man and a woman are on a swing, while on the right side, there is a man alone on a swing. What color is the helmet worn by the man on the right side of the screen?", "question_wo_referring_query": "What color is the helmet worn by the man on the right side of the screen?", "candidates": ["black", "blue", "yellow", "red", "green"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "ajor5LGXs0Q_2", "video_path": "ajor5LGXs0Q.mp4", "subtitle_path": "ajor5LGXs0Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1087.76, "view_count": 17053}, {"video_id": "o9sDPX2FXeQ", "question": "Inside a white-walled room, a woman dressed in black stands in front of a glass cabinet on the right side of the screen. She is holding clothes and looking at a man sitting in front of a mirror. What is the woman's hairstyle when the caption 'the man has to say instead they just' appears?", "question_wo_referring_query": "What is the woman's hairstyle?", "candidates": ["Short black hair tucked behind the ear", "Long black wavy hair", "Tied with a band", "Long black hair with a shawl"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "o9sDPX2FXeQ_0", "video_path": "o9sDPX2FXeQ.mp4", "subtitle_path": "o9sDPX2FXeQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1065.08, "view_count": 238607}, {"video_id": "o9sDPX2FXeQ", "question": "In a messy room with various items placed around, a woman wearing white clothes is answering a phone call. When the subtitle 'night as the man was sick and wanted to' appears, what pattern is on the woman's headscarf?", "question_wo_referring_query": "What pattern is on the woman's headscarf?", "candidates": ["Gray polka dots", "Red stripes", "Red and white checkered", "Black and white stripes", "Green stripes"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "o9sDPX2FXeQ_1", "video_path": "o9sDPX2FXeQ.mp4", "subtitle_path": "o9sDPX2FXeQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1065.08, "view_count": 238607}, {"video_id": "o9sDPX2FXeQ", "question": "The man in a high-neck coat is looking at a telescope. When the subtitle 'the old man to join him as hero drinks' appears, what is the color of the glowing circular accessory on the man's clothing?", "question_wo_referring_query": "What is the color of the glowing circular accessory on the man's clothing?", "candidates": ["red", "gold", "purple", "black", "white"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "o9sDPX2FXeQ_2", "video_path": "o9sDPX2FXeQ.mp4", "subtitle_path": "o9sDPX2FXeQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1065.08, "view_count": 238607}, {"video_id": "qVdLXQgt6dw", "question": "In a crowded airport, there are many white and gray chairs. A woman dressed in all black is pushing a trolley. Who is the person carrying a black suitcase in the picture?", "question_wo_referring_query": "Who is the person carrying a black suitcase in the picture?", "candidates": ["Hugo Vega", "case", "Brody", "Joe", "Braga\n"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "qVdLXQgt6dw_0", "video_path": "qVdLXQgt6dw.mp4", "subtitle_path": "qVdLXQgt6dw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 993.92, "view_count": 9230}, {"video_id": "qVdLXQgt6dw", "question": "There are a few people watching the match around the boxing ring. On the ring, a shirtless man in yellow shorts is fighting a woman wearing a headscarf. Who is the woman in the headscarf in the video?", "question_wo_referring_query": "Who is the woman in the headscarf in the video?", "candidates": ["Brody", "Braga", "G", "Matty Ramos", "Joe"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "qVdLXQgt6dw_1", "video_path": "qVdLXQgt6dw.mp4", "subtitle_path": "qVdLXQgt6dw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 993.92, "view_count": 9230}, {"video_id": "qVdLXQgt6dw", "question": "On the left side of the screen, there is a black man wearing a gray short-sleeve shirt facing the right side of the screen. On the right side of the screen, there are three men in black short-sleeve shirts facing the man in the gray short-sleeve shirt. In the middle, a man in a suit is trying to mediate an argument. Who is the tallest person involved in the altercation on the screen?", "question_wo_referring_query": "Who is the tallest person involved in the altercation on the screen?", "candidates": ["Brody", "Braga", "Brody", "Joe", "Matty Ramos"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "qVdLXQgt6dw_2", "video_path": "qVdLXQgt6dw.mp4", "subtitle_path": "qVdLXQgt6dw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 993.92, "view_count": 9230}, {"video_id": "qVdLXQgt6dw", "question": "On the left side of the screen, a black man in a gray short-sleeve shirt is facing the right side of the screen. On the right side of the screen, three men in black short-sleeve shirts are facing the man in the gray shirt. In the middle, a man in a suit is trying to break up a fight. Among them, who is the tallest person involved in the conflict?", "question_wo_referring_query": "Who is the tallest person involved in the conflict?", "candidates": ["Joe", "Brody", "Matty Ramos", "Brody", "Braga"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "qVdLXQgt6dw_3", "video_path": "qVdLXQgt6dw.mp4", "subtitle_path": "qVdLXQgt6dw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 993.92, "view_count": 9230}, {"video_id": "MAttW7AUrq4", "question": "There are many cars parked within the parking lines on the street, and there are some green spaces and trees on the side of the road. What happened when the black car appeared for the first time?", "question_wo_referring_query": "What happened when it appeared?", "candidates": ["Caught fire", "Turned", "Fell into the water", "Flipped over", "Collided with another car"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "MAttW7AUrq4_0", "video_path": "MAttW7AUrq4.mp4", "subtitle_path": "MAttW7AUrq4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1937.07, "view_count": 14376}, {"video_id": "MAttW7AUrq4", "question": "There is a transparent bowl and several glass bottles with red stickers containing beverages in the scene. When the beverage held by a hand appears for the first time in the scene, what happened to it?", "question_wo_referring_query": "What happened to it?", "candidates": ["Got thrown into the distance", "Got placed neatly on the ground", "Got spilled on the ground", "Got snatched away by someone", "Got drunk"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "MAttW7AUrq4_1", "video_path": "MAttW7AUrq4.mp4", "subtitle_path": "MAttW7AUrq4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1937.07, "view_count": 14376}, {"video_id": "MAttW7AUrq4", "question": "On a dimly lit street, there is a pink balloon, a white car, and a sign with red and green mixed characters. A man in a suit is standing by the street. When the man's tie appears, what happens?", "question_wo_referring_query": ", what happens?", "candidates": ["It falls to the ground", "It's given to someone else by the man", "It's blown away by a strong wind", "It's torn apart by the man", "It's taken off by the man"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "MAttW7AUrq4_2", "video_path": "MAttW7AUrq4.mp4", "subtitle_path": "MAttW7AUrq4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1937.07, "view_count": 14376}, {"video_id": "eQFiZ_4AIsE", "question": "There is a person wearing black gloves and a mask on the screen. What happened when the subtitle 'Thomas removes his mask first, followed by Llyr and Dai' appeared?", "question_wo_referring_query": "What happened?", "candidates": ["This person put on the mask.", "This person snatched the mask from someone else's hand.", "This person handed the mask to someone else.", "This person took off their mask.", "This person threw away the mask."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "eQFiZ_4AIsE_0", "video_path": "eQFiZ_4AIsE.mp4", "subtitle_path": "eQFiZ_4AIsE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 919.71, "view_count": 71589}, {"video_id": "eQFiZ_4AIsE", "question": "On a pitch-black night, you can vaguely see the posts by the road, a car is driving in the dark night, what happened when the caption 'in to night' appeared?", "question_wo_referring_query": ", what happened?", "candidates": ["The car disappeared from the screen", "The car caught fire", "The car overturned", "The car roughly drove to the center position of the screen", "The car roughly drove to the bottom left position of the screen"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "eQFiZ_4AIsE_1", "video_path": "eQFiZ_4AIsE.mp4", "subtitle_path": "eQFiZ_4AIsE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 919.71, "view_count": 71589}, {"video_id": "eQFiZ_4AIsE", "question": "On a narrow road flanked by dense trees, there are two spheres floating in the air, one in front of the other. When the subtitle 'Ricky drives and Candy manages to shoot two of the spheres' appears, what happens to the sphere in the foreground?", "question_wo_referring_query": ", what happens to the sphere in the foreground?", "candidates": ["Falls to the ground", "Turns red", "Splits into three spheres", "Explodes", "Catches fire"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "eQFiZ_4AIsE_2", "video_path": "eQFiZ_4AIsE.mp4", "subtitle_path": "eQFiZ_4AIsE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 919.71, "view_count": 71589}, {"video_id": "VXP2v8ECVvo", "question": "There is a man in a blue shirt in the frame who is slightly looking down at something. Next to him, there is a woman in a green top. What happens when the woman's gaze shifts diagonally towards the man?", "question_wo_referring_query": "There is a man in a blue shirt in the frame who is slightly looking down at something. Next to him, there is a woman in a green top. What happens when the woman's gaze shifts diagonally towards the man?", "candidates": ["The man touches the woman's head", "The man smiles at the woman", "The man and the woman hug face to face", "The man catches the woman by the shoulder with one hand", "The woman adjusts the man's collar"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "VXP2v8ECVvo_0", "video_path": "VXP2v8ECVvo.mp4", "subtitle_path": "VXP2v8ECVvo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1004.47, "view_count": 79821}, {"video_id": "VXP2v8ECVvo", "question": "A woman with medium-length yellow hair, wearing a green undershirt and a black coat, stands at the entrance. What happens after she walks into the room?", "question_wo_referring_query": ", what happens?", "candidates": ["The woman shakes hands with another woman wearing a red top.", "The woman sits down on an office chair.", "The woman hugs a professor wearing a white gown.", "The woman shakes hands with a man.", "The woman kneels down to talk with a child."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "VXP2v8ECVvo_1", "video_path": "VXP2v8ECVvo.mp4", "subtitle_path": "VXP2v8ECVvo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1004.47, "view_count": 79821}, {"video_id": "VXP2v8ECVvo", "question": "Indoors, there is a tent and a chair with nine lattice patterns. A man is sitting in front of a round seat, and there is also a woman with medium-length yellow hair. What did the woman do before sitting down opposite the man?", "question_wo_referring_query": "What did the woman do before sitting down opposite the man?", "candidates": ["Went to the restroom to wash her face", "Changed into slippers", "Picked up a bottle of Brandy and a white cup", "Put down the bag she was carrying", "Picked up a bottle of Brandy and two white cups"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "VXP2v8ECVvo_2", "video_path": "VXP2v8ECVvo.mp4", "subtitle_path": "VXP2v8ECVvo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1004.47, "view_count": 79821}, {"video_id": "O8QYYNlg3U4", "question": "Which character appears first in the video?", "question_wo_referring_query": "Which character appears first in the video?", "candidates": ["Sophie", "Carl", "Logan", "Danny", "Ben"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "O8QYYNlg3U4_0", "video_path": "O8QYYNlg3U4.mp4", "subtitle_path": "O8QYYNlg3U4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1090.73, "view_count": 348856}, {"video_id": "O8QYYNlg3U4", "question": "Which female character appears first in the video?", "question_wo_referring_query": "Which female character appears first in the video?", "candidates": ["Trina", "Sophie's mother", "Helen", "Danny's mother", "Sophie"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "O8QYYNlg3U4_1", "video_path": "O8QYYNlg3U4.mp4", "subtitle_path": "O8QYYNlg3U4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1090.73, "view_count": 348856}, {"video_id": "O8QYYNlg3U4", "question": "Which object appears first in the video?", "question_wo_referring_query": "Which object appears first in the video?", "candidates": ["Black Backpack", "Intercom", "Staircase", "Straw Hat", "Guardian"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "O8QYYNlg3U4_2", "video_path": "O8QYYNlg3U4.mp4", "subtitle_path": "O8QYYNlg3U4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1090.73, "view_count": 348856}, {"video_id": "dPW5kUKjmR4", "question": "A man dressed in black is facing the camera, there is a mirror behind him. After the subtitle 'Josh looks pale and stressed when Leena finds him.', what happens?", "question_wo_referring_query": "What happens?", "candidates": ["Josh is holding a gun in his right hand, shouting angrily", "Devin takes off his headset", "Leena is holding a balloon", "Noel is being interviewed on television"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "dPW5kUKjmR4_0", "video_path": "dPW5kUKjmR4.mp4", "subtitle_path": "dPW5kUKjmR4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 919.62, "view_count": 2775419}, {"video_id": "dPW5kUKjmR4", "question": "Behind Josh are packed black bags and a paper box, and Josh, holding a wad of cash, looks to the left of the screen. After the phrase 'Josh is too happy about his dream coming true to doubt the story and moves out of the house' is mentioned, what happens?", "question_wo_referring_query": "Afterwards, what happens?", "candidates": ["Noel is interviewed on television", "Devin records a video with his mobile phone", "Leena flips through Marissa's diary", "Devin drags leena back home"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "dPW5kUKjmR4_1", "video_path": "dPW5kUKjmR4.mp4", "subtitle_path": "dPW5kUKjmR4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 919.62, "view_count": 2775419}, {"video_id": "dPW5kUKjmR4", "question": "Leena is sitting on a brown leather sofa, with her feet on a coffee table, flipping through a diary. What happened before the subtitle 'One evening, Leena is going through Marissa's diaries, which she's found in the attic.'?", "question_wo_referring_query": "What happened before?", "candidates": ["Devin records a video on his phone", "Noel is being interviewed on TV", "Josh is angrily shouting with a gun in his right hand", "Devin drags Leena back home"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "dPW5kUKjmR4_2", "video_path": "dPW5kUKjmR4.mp4", "subtitle_path": "dPW5kUKjmR4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 919.62, "view_count": 2775419}, {"video_id": "6rYciBvzkU4", "question": "A woman in a red dress is fixing her hair inside a bus, while outside the bus door, a man is facing away from the camera. The bus is dimly lit, and when the subtitle 'man turns out to be a passenger, and he, too, gets on the bus.' appears, what is the first item that appears?", "question_wo_referring_query": "What is the first item that appears?", "candidates": ["Scarf", "Earphones", "A signboard from Fuzhou to Longxiang", "License plate with 'TianC\u00b728730' in blue", "Black-rimmed glasses"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "6rYciBvzkU4_0", "video_path": "6rYciBvzkU4.mp4", "subtitle_path": "6rYciBvzkU4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 932.43, "view_count": 3405808}, {"video_id": "6rYciBvzkU4", "question": "In a dark environment, a man dressed in light khaki color is lowering his head with a fierce look in his eyes. After the subtitle 'stabbed' appears, what is the first object to appear?", "question_wo_referring_query": "What is the first object to appear?", "candidates": ["gun", "white bus", "blue blazer", "black butterfly knot", "red sand"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "6rYciBvzkU4_1", "video_path": "6rYciBvzkU4.mp4", "subtitle_path": "6rYciBvzkU4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 932.43, "view_count": 3405808}, {"video_id": "6rYciBvzkU4", "question": "On the left side of the screen, there is a sculpture. Behind the sculpture is a calendar displaying the date July 15, 1996. When the subtitle 'date; July 15, 1996 often appeared on calendars' appears, who is the first person to appear?", "question_wo_referring_query": "Who is the first person to appear?", "candidates": ["A middle-aged woman wearing a blue scarf", "A child holding a toy", "A man wearing a black coat", "A man dressed in a police uniform", "A woman in a red dress"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "6rYciBvzkU4_2", "video_path": "6rYciBvzkU4.mp4", "subtitle_path": "6rYciBvzkU4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 932.43, "view_count": 3405808}, {"video_id": "7QUJ5SCKEhc", "question": "The surrounding area in the video shows yellow and brown hills, with a car on the road, and a wounded man lying on the ground. Where has the wounded man appeared before?", "question_wo_referring_query": "Where has the wounded man appeared before?", "candidates": ["In a kindergarten", "In a police station", "On a hospital bed", "In a swimming pool", "In an office"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "7QUJ5SCKEhc_0", "video_path": "7QUJ5SCKEhc.mp4", "subtitle_path": "7QUJ5SCKEhc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 980.41, "view_count": 83687}, {"video_id": "7QUJ5SCKEhc", "question": "The screen shows some golden grass and a small wooden hut, as well as a guard wearing an iron helmet. In which of the following places has the guard's iron helmet appeared?", "question_wo_referring_query": "In which of the following places has the guard's iron helmet appeared?", "candidates": ["In the river", "On the mountain top", "In the toilet", "Inside the wooden hut", "In the field"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "7QUJ5SCKEhc_1", "video_path": "7QUJ5SCKEhc.mp4", "subtitle_path": "7QUJ5SCKEhc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 980.41, "view_count": 83687}, {"video_id": "7QUJ5SCKEhc", "question": "In the scene with lush green leaves, there are three men and a woman with curly hair wearing a headscarf. Where has this woman appeared before?", "question_wo_referring_query": "Where has this woman appeared before?", "candidates": ["In a narrow alley being chased by someone on horseback wielding a weapon", "By a large river", "At the military headquarters", "On a mountaintop", "In a beautifully decorated house"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "7QUJ5SCKEhc_2", "video_path": "7QUJ5SCKEhc.mp4", "subtitle_path": "7QUJ5SCKEhc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 980.41, "view_count": 83687}, {"video_id": "ZD05hkfZAkk", "question": "In the scene, the surroundings are white furniture, with a hollowed-out dark brown guardrail, and a man with a tense and fierce expression facing the camera. Which subtitles have appeared together with this man?", "question_wo_referring_query": ", which subtitles have appeared together with this man?", "candidates": ["Let's fight", "I will definitely win", "Starting now", "Never give up", "and face several deadly fights today we"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "ZD05hkfZAkk_0", "video_path": "ZD05hkfZAkk.mp4", "subtitle_path": "ZD05hkfZAkk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1085.24, "view_count": 85215}, {"video_id": "ZD05hkfZAkk", "question": "In the scene, a man, wearing a V-neck shirt and with a bleeding head, is staring expressionlessly at the camera. Behind him is a square net. What subtitles appear together with the iron chain around the man's neck?", "question_wo_referring_query": "What subtitles appear together with the iron chain around the man's neck?", "candidates": ["I really hope to get through these days", "I hope I can break through myself", "One day, I will see the light again", "Tightly tied with iron chains", "shed on a leash and unable to act of his"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "ZD05hkfZAkk_1", "video_path": "ZD05hkfZAkk.mp4", "subtitle_path": "ZD05hkfZAkk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1085.24, "view_count": 85215}, {"video_id": "ZD05hkfZAkk", "question": "There are three people in the frame. In the center is an older man wearing black clothes and sunglasses. Behind him is a short-haired man with a startled expression. In the lower left corner of the frame is an elderly woman with white hair wearing a pearl necklace. When did this elderly woman and which subtitles appear together?", "question_wo_referring_query": "When did this elderly woman and which subtitles appear together?", "candidates": ["The person in the photo was once the company owner", "person in the photo as a former employee", "The person in the photo was once the main force of the company", "The person in the photo was once a company anchor", "The person in the photo was once a company customer"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "ZD05hkfZAkk_2", "video_path": "ZD05hkfZAkk.mp4", "subtitle_path": "ZD05hkfZAkk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1085.24, "view_count": 85215}, {"video_id": "egG4VNZGqG8", "question": "There is a woman in a light blue top with pinned-up hair, and a man with short brown hair in a white top. Behind them, there is a woman in a red dress wearing a flower on her head. How does the flower on the woman's head change while she is dining with others?", "question_wo_referring_query": "How does the flower on the woman's head change while she is dining with others?", "candidates": ["From purple-red to golden-yellow", "From purple-red to white", "From purple-red to gray", "From purple-red to blue", "From purple-red to green"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "egG4VNZGqG8_0", "video_path": "egG4VNZGqG8.mp4", "subtitle_path": "egG4VNZGqG8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 958.83, "view_count": 108233}, {"video_id": "egG4VNZGqG8", "question": "In the scene with the golden light flashing on the stage, a sophisticatedly dressed woman stands in the center of the screen. How does this woman's hair differ when she is in a space surrounded by metallic poles?", "question_wo_referring_query": "How does this woman's hair differ when she is in a space surrounded by metallic poles?", "candidates": ["The hair changes from being styled to messy braids", "The hair color changes to blue", "The hair color changes to red", "It changes to short blue hair", "It changes to short hair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "egG4VNZGqG8_1", "video_path": "egG4VNZGqG8.mp4", "subtitle_path": "egG4VNZGqG8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 958.83, "view_count": 108233}, {"video_id": "egG4VNZGqG8", "question": "On a rainy day, a woman dressed in black is sitting by a big tree, looking far away with a sad expression. How does this woman's attire change from when she was in the bushes?", "question_wo_referring_query": "How does this woman's attire change from when she was in the bushes?", "candidates": ["Her plain clothes changed to a black and white striped outfit with the number 16 on it", "Her plain clothes changed to a blue and white striped outfit with the number 12 on it", "Her plain clothes changed to a black and white striped outfit with the number 13 on it", "Her plain clothes changed to a black and white striped outfit with the number 12 on it", "Her plain clothes changed to an orange and white striped outfit with the number 12 on it"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "egG4VNZGqG8_2", "video_path": "egG4VNZGqG8.mp4", "subtitle_path": "egG4VNZGqG8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 958.83, "view_count": 108233}, {"video_id": "yn6zJisx4O8", "question": "In a poorly lit room, there's a man sitting in the middle on a stool, wearing a dark green coat; to the left is a green iron frame, to the right is a teak table with books on it and a window letting in light. What is the man in the room doing in the video?", "question_wo_referring_query": "What is the man in the room doing in the video?", "candidates": ["Smoking", "Reading", "Drinking water", "Exercising", "Sleeping"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "yn6zJisx4O8_0", "video_path": "yn6zJisx4O8.mp4", "subtitle_path": "yn6zJisx4O8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1381.55, "view_count": 15724}, {"video_id": "yn6zJisx4O8", "question": "In the screen, there is a man with a frowning expression on the left and a man wearing a black coat on the right. Behind them is a green wall, and there is a window in the distance to the right. What are the two men doing in the video?", "question_wo_referring_query": "What are the two men doing in the video?", "candidates": ["hugging", "sleeping", "shaking hands", "pulling each other", "eating noodles"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "yn6zJisx4O8_1", "video_path": "yn6zJisx4O8.mp4", "subtitle_path": "yn6zJisx4O8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1381.55, "view_count": 15724}, {"video_id": "yn6zJisx4O8", "question": "By the blue sea, there are a few red wine bottles below, a person standing in the middle (face not visible), and two men in black clothes sitting. What is the man sitting on the right side of the screen doing?", "question_wo_referring_query": "What is the man sitting on the right side of the screen doing?", "candidates": ["Reading a book", "Reading a book", "Nodding", "Waving", "Drinking wine"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "yn6zJisx4O8_2", "video_path": "yn6zJisx4O8.mp4", "subtitle_path": "yn6zJisx4O8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1381.55, "view_count": 15724}, {"video_id": "TTPotrEuzIo", "question": "From a bird's-eye view, a city composed of numerous buildings of varying heights stretches from the bottom to the top of the screen, with a grayish-white sky in the distance. What objects appear on the video screen?", "question_wo_referring_query": "What objects appear on the video screen?", "candidates": ["A helicopter", "A white ambulance", "An apple", "A man in black clothes", "A police car"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "TTPotrEuzIo_0", "video_path": "TTPotrEuzIo.mp4", "subtitle_path": "TTPotrEuzIo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 975.04, "view_count": 324628}, {"video_id": "TTPotrEuzIo", "question": "On the military training field, on the left are two rows of soldiers running, on the right is a grey-white ground, and in the distance are white buildings and a military green truck. What objects appear on the training field in the video?", "question_wo_referring_query": "What objects appear on the training field in the video?", "candidates": ["A red fire truck", "A helicopter", "A yellow truck head", "A blue keyboard", "A woman wearing a white suit"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "TTPotrEuzIo_1", "video_path": "TTPotrEuzIo.mp4", "subtitle_path": "TTPotrEuzIo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 975.04, "view_count": 324628}, {"video_id": "TTPotrEuzIo", "question": "The screen shows a group of men and women dressed in green military uniforms or gray and black suits. Everyone is looking towards the camera. What objects or people appear in the video?", "question_wo_referring_query": ", what objects or people appear in the video?", "candidates": ["A man wearing a red suit", "A white clock", "A green military truck", "A black man wearing a gray and black suit", "A woman wearing a white suit"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "TTPotrEuzIo_2", "video_path": "TTPotrEuzIo.mp4", "subtitle_path": "TTPotrEuzIo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 975.04, "view_count": 324628}, {"video_id": "PB5vTKqNUqw", "question": "In the bright room, there is a round white short table in the middle, a person wearing a black hat is in the lower left corner, a woman wearing gray clothes is standing on the right side, and a man wearing a gray-black suit is sitting in the upper middle. The three people are talking in the room. When the conversation mentions 'nean there looking like they knew each other', what items appear in the room?", "question_wo_referring_query": "What items appear in the room?", "candidates": ["White bottle", "Red bottle", "Boy wearing black clothes", "Desktop computer", "Blue bottle"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "PB5vTKqNUqw_0", "video_path": "PB5vTKqNUqw.mp4", "subtitle_path": "PB5vTKqNUqw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2688.06, "view_count": 12845}, {"video_id": "PB5vTKqNUqw", "question": "In a room with a blue background wall, there is a woman sitting in front of a mirror wearing a blue and black sports jacket and applying a facial mask. On the wall to the left behind her, there is a green plant. When the subtitle mentions 'Z pretending to be someone named Koko,' what items appear in the room?", "question_wo_referring_query": "What items appear in the room?", "candidates": ["Laptop", "Black chair", "Red bottle", "White mug", "Glass cup"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "PB5vTKqNUqw_1", "video_path": "PB5vTKqNUqw.mp4", "subtitle_path": "PB5vTKqNUqw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2688.06, "view_count": 12845}, {"video_id": "PB5vTKqNUqw", "question": "In the bright office, there is a dark brown rectangular desk in the middle. A man wearing a silver-gray suit is sitting on the right side, and a woman with long hair holding a document is on the left side. In the distance, there are floor-to-ceiling windows. When the subtitle mentions 'next day sing Chan asked tongen to give', which items appear in the room?", "question_wo_referring_query": "Which items appear in the room?", "candidates": ["Cat", "Globe", "Telephone", "Smartphone", "Remote control car"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "PB5vTKqNUqw_2", "video_path": "PB5vTKqNUqw.mp4", "subtitle_path": "PB5vTKqNUqw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2688.06, "view_count": 12845}, {"video_id": "eNqxgXkbaFw", "question": "In the room illuminated by dim light, in front of the mirror, on the left side there is a woman with long curly blonde hair wearing a white dress, and on the right side there is a woman with short hair. Both women are talking. What color is the short-haired woman\u2019s clothing?", "question_wo_referring_query": "What color is the short-haired woman\u2019s clothing in the room?", "candidates": ["Red", "Blue", "Yellow", "Orange", "White"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "eNqxgXkbaFw_0", "video_path": "eNqxgXkbaFw.mp4", "subtitle_path": "eNqxgXkbaFw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 913.95, "view_count": 113296}, {"video_id": "eNqxgXkbaFw", "question": "Under the white stone wall, a short-haired man wearing a tank top and a woman with curly hair are squatting next to the white wall. What color pants is the squatting woman wearing?", "question_wo_referring_query": "What color pants is the squatting woman wearing?", "candidates": ["Khaki", "Black", "Orange", "Gray", "White"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "eNqxgXkbaFw_1", "video_path": "eNqxgXkbaFw.mp4", "subtitle_path": "eNqxgXkbaFw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 913.95, "view_count": 113296}, {"video_id": "eNqxgXkbaFw", "question": "In front of a wall made of white blocks, a man with grayish-white hair and a mustache stands facing the camera, talking. What color clothes is the man wearing?", "question_wo_referring_query": ", what color clothes is the man wearing?", "candidates": ["red", "white", "gray", "blue", "yellow"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "eNqxgXkbaFw_2", "video_path": "eNqxgXkbaFw.mp4", "subtitle_path": "eNqxgXkbaFw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 913.95, "view_count": 113296}, {"video_id": "knnoVAyOxuA", "question": "In a room with white walls, there's a gray table in the middle nearby. A man stands on each side of the table. In the distance, there is an open door, with a clock hanging above it. When the subtitle mentions 'and puts him back behind bars', what shape is the clock?", "question_wo_referring_query": "What shape is the clock?", "candidates": ["Triangle", "Rectangle", "Hexagon", "Square", "Circle"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "knnoVAyOxuA_0", "video_path": "knnoVAyOxuA.mp4", "subtitle_path": "knnoVAyOxuA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 915.95, "view_count": 622223}, {"video_id": "knnoVAyOxuA", "question": "In a dimly lit room, a man wearing a suit and a bow-tie sits in front of a dark grey wall. When the subtitle mentions 'his case, making it possible to drop all charges against Gries,' what color is the man's suit?", "question_wo_referring_query": "What color is the man's suit?", "candidates": ["red", "green", "black", "white", "yellow"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "knnoVAyOxuA_1", "video_path": "knnoVAyOxuA.mp4", "subtitle_path": "knnoVAyOxuA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 915.95, "view_count": 622223}, {"video_id": "knnoVAyOxuA", "question": "On a grey background screen, there is an illustration in the center formed by the overlapping of two circles, with a design on each side formed by overlapping rectangles. At the top, there is a two-line text. When the caption mentions 'the Earth,' what is the main color of the illustration formed by the overlapping circles in the center?", "question_wo_referring_query": "What is the main color of the illustration formed by the overlapping of two circles in the center?", "candidates": ["black", "grey", "purple", "blue", "white"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "knnoVAyOxuA_2", "video_path": "knnoVAyOxuA.mp4", "subtitle_path": "knnoVAyOxuA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 915.95, "view_count": 622223}, {"video_id": "K4s_tnsfhH4", "question": "On the left is a glass wall and on the right is a canopy of hundred departments. On the left is a man wearing an olive green shirt and a checkered jacket, and on the right is a man wearing a black coat and glasses. Who is the person walking in the hallway with his hands in his pockets?", "question_wo_referring_query": "Who is the person walking in the hallway with his hands in his pockets?", "candidates": ["The man wearing an olive green shirt and a checkered jacket", "The man wearing a red coat and glasses", "The man wearing a yellow shirt and a checkered jacket", "The man wearing a black coat and glasses", "The woman with long hair wearing a black coat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "K4s_tnsfhH4_0", "video_path": "K4s_tnsfhH4.mp4", "subtitle_path": "K4s_tnsfhH4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1482.02, "view_count": 27500}, {"video_id": "K4s_tnsfhH4", "question": "In the train, on the left, there is a man with a bald head, and on the right there is a man wearing glasses and a checkered shirt. Which character in the video is holding a phone and talking?", "question_wo_referring_query": "Which character in the video is holding a phone and talking?", "candidates": ["A boy wearing a black t-shirt.", "A man wearing a red jacket and glasses.", "A woman wearing a black jacket with long hair.", "A man wearing glasses and a checkered shirt.", "A man wearing a yellow undershirt and a checkered shirt jacket."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "K4s_tnsfhH4_1", "video_path": "K4s_tnsfhH4.mp4", "subtitle_path": "K4s_tnsfhH4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1482.02, "view_count": 27500}, {"video_id": "K4s_tnsfhH4", "question": "On the blurred background mountain, there is a car with a partially visible roof on the left side, a middle-aged woman in black clothes in front of the left side mirror, and a woman with a hat whose face is unclear on the right side. Which character is making a phone call in the scene?", "question_wo_referring_query": "Which character is making a phone call in the scene?", "candidates": ["The woman wearing a black coat with long red hair", "The man wearing a red coat and glasses", "The middle-aged woman in black clothes", "The man wearing a yellow shirt and a white coat", "The woman wearing a hat whose face is unclear"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "K4s_tnsfhH4_2", "video_path": "K4s_tnsfhH4.mp4", "subtitle_path": "K4s_tnsfhH4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1482.02, "view_count": 27500}, {"video_id": "sEj2lhXhkbU", "question": "In a red-walled corridor with red and white tiles on the floor, there's a person on the left side wearing black clothes and crouching, and a white animal in the middle. When the subtitle 'intruder while trying to capture the man' appears, what happens?", "question_wo_referring_query": "What happens?", "candidates": ["The white animal falls to the ground", "The person wearing black clothes jumps up", "The yellow animal falls to the ground", "The white animal crouches down", "The black animal falls to the ground"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "sEj2lhXhkbU_0", "video_path": "sEj2lhXhkbU.mp4", "subtitle_path": "sEj2lhXhkbU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 925.89, "view_count": 53263}, {"video_id": "sEj2lhXhkbU", "question": "Inside the silver-white pipe mouth, in the black lacquered pipe, there is an animal wearing green clothes and has a white head. When the subtitle mentions 'of escaping to prevent this friend from', what action does the white-headed animal take?", "question_wo_referring_query": "What action does the white-headed animal take?", "candidates": ["Nodded to the camera", "Put on a hat", "Made an 'OK' gesture to the camera", "Waved to the camera", "Picked up a bottle of water"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "sEj2lhXhkbU_1", "video_path": "sEj2lhXhkbU.mp4", "subtitle_path": "sEj2lhXhkbU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 925.89, "view_count": 53263}, {"video_id": "sEj2lhXhkbU", "question": "Under the blue sky and white clouds, there is a huge yellow rock mountain with a missile in it. What event occurs when the subtitles 'and presses the button to launch the' appear?", "question_wo_referring_query": "What event occurs?", "candidates": ["The sky turns black", "A missile is launched into the sky", "A hot air balloon flies into the sky", "The white clouds turn gray", "An airplane flies into the sky"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "sEj2lhXhkbU_2", "video_path": "sEj2lhXhkbU.mp4", "subtitle_path": "sEj2lhXhkbU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 925.89, "view_count": 53263}, {"video_id": "GH14cdbU2Jc", "question": "In front of the yellow wall, there are green plants on both sides and in the middle, there is a woman wearing pink and white clothes and a man wearing a white and green checkered shirt. The two are playing tug-of-war. After the woman in pink and white appears, which of the following people appear first?", "question_wo_referring_query": "Which of the following people appear first?", "candidates": ["A man wearing yellow clothes", "A child holding a skateboard", "A person wearing a black hat", "A woman wearing red clothes", "A man wearing white clothes"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "GH14cdbU2Jc_0", "video_path": "GH14cdbU2Jc.mp4", "subtitle_path": "GH14cdbU2Jc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1175.23, "view_count": 160578}, {"video_id": "GH14cdbU2Jc", "question": "In the scene, there's a man on the left who is on the phone and wearing black clothes. On the right, there are blurry tree trunks and green plants in the distance. After the man on the phone wearing black clothes appears, which of the following characters appears first?", "question_wo_referring_query": "Which of the following characters appears first?", "candidates": ["The bald man wearing a dark green garment", "The person wearing a black hat", "The man on the phone wearing a black watch", "The girl holding a skateboard", "The man wearing yellow clothing"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "GH14cdbU2Jc_1", "video_path": "GH14cdbU2Jc.mp4", "subtitle_path": "GH14cdbU2Jc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1175.23, "view_count": 160578}, {"video_id": "GH14cdbU2Jc", "question": "In front of the orange-yellow door and window, a man with a white beard, wearing a golden watch, is sitting on a dark green armchair. After this man appears, which of the following people appear next?", "question_wo_referring_query": "Which of the following people appear next?", "candidates": ["A child holding an umbrella", "A man wearing a white shirt with a black beard", "A man wearing yellow clothes", "A bald man wearing dark green clothes", "A woman wearing a black hat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "GH14cdbU2Jc_2", "video_path": "GH14cdbU2Jc.mp4", "subtitle_path": "GH14cdbU2Jc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1175.23, "view_count": 160578}, {"video_id": "GsMh_62RaLI", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there is a scene where 'a woman wearing pink clothes and having long golden hair is smiling and talking with a man dressed in blue and white stripes in front of a black car.' Then, there is a scene where 'on a coastal highway, a woman wearing red clothing and having golden hair is seen in the middle of the road, followed closely by a white car and a dark green car.' Finally, there is a scene where 'a man wearing a blue short-sleeved shirt is closing the car door in the driver's seat of a black interior car.'", "First, there is a scene where 'a man wearing a blue short-sleeved shirt is closing the car door in the driver's seat of a black interior car.' Then, there is a scene where 'a woman wearing pink clothes and having long golden hair is smiling and talking with a man dressed in blue and white stripes in front of a black car.' Finally, there is a scene where 'on a coastal highway, a woman wearing red clothing and having golden hair is seen in the middle of the road, followed closely by a white car and a dark green car.'", "First, there is a scene where 'a woman wearing pink clothes and having long golden hair is smiling and talking with a man dressed in blue and white stripes in front of a black car.' Then, there is a scene where 'a man wearing a blue short-sleeved shirt is closing the car door in the driver's seat of a black interior car.' Finally, there is a scene where 'on a coastal highway, a woman wearing red clothing and having golden hair is seen in the middle of the road, followed closely by a white car and a dark green car.'", "First, there is a scene where 'a man wearing a blue short-sleeved shirt is closing the car door in the driver's seat of a black interior car.' Then, there is a scene where 'on a coastal highway, a woman wearing red clothing and having golden hair is seen in the middle of the road, followed closely by a white car and a dark green car.' Finally, there is a scene where 'a woman wearing pink clothes and having long golden hair is smiling and talking with a man dressed in blue and white stripes in front of a black car.'", "First, there is a scene where 'on a coastal highway, a woman wearing red clothing and having golden hair is seen in the middle of the road, followed closely by a white car and a dark green car.' Then, there is a scene where 'a man wearing a blue short-sleeved shirt is closing the car door in the driver's seat of a black interior car.' Finally, there is a scene where 'a woman wearing pink clothes and having long golden hair is smiling and talking with a man dressed in blue and white stripes in front of a black car.'"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "GsMh_62RaLI_0", "video_path": "GsMh_62RaLI.mp4", "subtitle_path": "GsMh_62RaLI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1014.52, "view_count": 165490}, {"video_id": "GsMh_62RaLI", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there is a scene in a 'background restaurant filled with people, with a lit desk lamp on the table in the middle, and a man wearing a white shirt and black jacket on the right'. Then, there is a scene in a 'pitch-black car, with a light shining through the window on the left, a woman with long golden hair in the middle, and a man's profile on the right'. Finally, there is a scene in a 'background restaurant filled with people, with a lit desk lamp on the table in the middle, a woman with long golden hair on the left, and a black silhouette on the right'", "First, there is a scene in a 'pitch-black car, with a light shining through the window on the left, a woman with long golden hair in the middle, and a man's profile on the right'. Then, there is a scene in a 'background restaurant filled with people, with a lit desk lamp on the table in the middle, a woman with long golden hair on the left, and a black silhouette on the right'. Finally, there is a scene in a 'background restaurant filled with people, with a lit desk lamp on the table in the middle, and a man wearing a white shirt and black jacket on the right'", "First, there is a scene in a 'background restaurant filled with people, with a lit desk lamp on the table in the middle, and a man wearing a white shirt and black jacket on the right'. Then, there is a scene in a 'pitch-black car, with a light shining through the window on the left, a woman with long golden hair in the middle, and a man's profile on the right'. Finally, there is a scene in a 'background restaurant filled with people, with a lit desk lamp on the table in the middle, a woman with long golden hair on the left, and a black silhouette on the right'", "First, there is a scene in a 'background restaurant filled with people, with a lit desk lamp on the table in the middle, and a man wearing a white shirt and black jacket on the right'. Then, there is a scene in a 'pitch-black car, with a light shining through the window on the left, a woman with long golden hair in the middle, and a man's profile on the right'. Finally, there is a scene in a 'background restaurant filled with people, with a lit desk lamp on the table in the middle, a woman with long golden hair on the left, and a black silhouette on the right'", "First, there is a scene in a 'pitch-black car, with a light shining through the window on the left, a woman with long golden hair in the middle, and a man's profile on the right'. Then, there is a scene in a 'background restaurant filled with people, with a lit desk lamp on the table in the middle, and a man wearing a white shirt and black jacket on the right'. Finally, there is a scene in a 'background restaurant filled with people, with a lit desk lamp on the table in the middle, a woman with long golden hair on the left, and a black silhouette on the right'", "First, there is a scene in a 'background restaurant filled with people, with a lit desk lamp on the table in the middle, a woman with long golden hair on the left, and a black silhouette on the right'. Then, there is a scene in a 'pitch-black car, with a light shining through the window on the left, a woman with long golden hair in the middle, and a man's profile on the right'. Finally, there is a scene in a 'background restaurant filled with people, with a lit desk lamp on the table in the middle, a woman with long golden hair on the left, and a black silhouette on the right'"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "GsMh_62RaLI_1", "video_path": "GsMh_62RaLI.mp4", "subtitle_path": "GsMh_62RaLI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1014.52, "view_count": 165490}, {"video_id": "GsMh_62RaLI", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First is the scene where 'a woman in white clothes is making a phone call while sitting on a white sofa with a red backrest'. Next is the scene where 'in front of the background of a floor-to-ceiling window, a woman in red clothes on the left and a man in white clothes on the right are rubbing his temples with both hands'. The last scene is 'in front of a floor-to-ceiling window with a grey curtain in the middle, a man in white clothes is standing on the left and a woman in pink clothes is holding a mobile phone on the right'.", "First is the scene where 'in front of the background of a floor-to-ceiling window, a woman in red clothes on the left and a man in white clothes on the right are rubbing his temples with both hands'. Next is the scene where 'a woman in white clothes is making a phone call while sitting on a white sofa with a red backrest'. The last scene is 'in front of a floor-to-ceiling window with a grey curtain in the middle, a man in white clothes is standing on the left and a woman in pink clothes is holding a mobile phone on the right'.", "First is the scene where 'in front of the background of a floor-to-ceiling window, a woman in red clothes on the left and a man in white clothes on the right are rubbing his temples with both hands'. Next is the scene where 'in front of a floor-to-ceiling window with a grey curtain in the middle, a man in white clothes is standing on the left and a woman in pink clothes is holding a mobile phone on the right'. The last scene is 'a woman in white clothes is making a phone call while sitting on a white sofa with a red backrest'.", "First is the scene where 'in front of a floor-to-ceiling window with a grey curtain in the middle, a man in white clothes is standing on the left and a woman in pink clothes is holding a mobile phone on the right'. Next is the scene where 'a woman in white clothes is making a phone call while sitting on a white sofa with a red backrest'. The last scene is 'in front of the background of a floor-to-ceiling window, a woman in red clothes on the left and a man in white clothes on the right are rubbing his temples with both hands'.", "First is the scene where 'a woman in white clothes is making a phone call while sitting on a white sofa with a red backrest'. Next is the scene where 'in front of a floor-to-ceiling window with a grey curtain in the middle, a man in white clothes is standing on the left and a woman in pink clothes is holding a mobile phone on the right'. The last scene is 'in front of the background of a floor-to-ceiling window, a woman in red clothes on the left and a man in white clothes on the right are rubbing his temples with both hands'."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "GsMh_62RaLI_2", "video_path": "GsMh_62RaLI.mp4", "subtitle_path": "GsMh_62RaLI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1014.52, "view_count": 165490}, {"video_id": "tKwuwLChP2g", "question": "Under a sky, there is a white house with green grass in front of it. On the right side of the house, there is a stone on the grass. At the very front, there is a man with short hair, wearing a long-sleeve black shirt and holding a black bag. Behind the house, there is a mountain. What does this man do?", "question_wo_referring_query": "What does this man do?", "candidates": ["Turn around", "Jump up", "Drink water", "Walk towards the white house", "Put the black bag on the ground"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "tKwuwLChP2g_0", "video_path": "tKwuwLChP2g.mp4", "subtitle_path": "tKwuwLChP2g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1986.88, "view_count": 10397}, {"video_id": "tKwuwLChP2g", "question": "Outside a house, there is a white staircase. On the staircase sits a man with short black hair, wearing a white long-sleeve shirt, black pants, and a tie. To the left of the staircase, there are trees and a brown house. In front of the staircase, there is a person wearing khaki pants. What is the man in the white shirt and tie doing?", "question_wo_referring_query": "What is the man in the white shirt and tie doing?", "candidates": ["Drinking water", "Skipping rope", "Running", "Eating something", "Stood up"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "tKwuwLChP2g_1", "video_path": "tKwuwLChP2g.mp4", "subtitle_path": "tKwuwLChP2g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1986.88, "view_count": 10397}, {"video_id": "tKwuwLChP2g", "question": "There are many people on the screen, among them, two men are particularly noticeable. On the left side of the screen, there is a man with short black hair wearing a suit. On the right side, there is another man with similar attire and short hair. They are surrounded by many people holding cameras, and these people are being blocked by a man in a green shirt. What did the man on the right side of the screen do?", "question_wo_referring_query": "What did the man on the right side of the screen do?", "candidates": ["Singing", "Bent down", "Dancing", "Running", "Jumped up"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "tKwuwLChP2g_2", "video_path": "tKwuwLChP2g.mp4", "subtitle_path": "tKwuwLChP2g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1986.88, "view_count": 10397}, {"video_id": "l4uCFYiayIM", "question": "In a room, there is a gray table. Five people are sitting around the table. On the left side, there is a short-haired man in a white shirt and a long-haired woman in a white and blue dress. In the middle, there is a short-haired man in a green long-sleeve shirt. On the right side, there is a short-haired man in a white long-sleeve shirt and a long-haired woman in a white long-sleeve shirt. They are all looking at a television. When the subtitle 'similarly new doubr proposes a new idea' appears, what is present in the scene?", "question_wo_referring_query": "What is present in the scene when the subtitle appears?", "candidates": ["plate", "potted plant", "laptop", "jump rope", "snacks"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "l4uCFYiayIM_0", "video_path": "l4uCFYiayIM.mp4", "subtitle_path": "l4uCFYiayIM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2226.59, "view_count": 45467}, {"video_id": "l4uCFYiayIM", "question": "In a room with white walls, a blue curtain is hanging on the wall. There is a white table in the room with items on it, and beside the table there is a sofa with different colors. A woman with tied hair, wearing a white shirt with English letters and shorts, is sitting on the sofa. She is looking at an item in her hand. When the subtitle says 'still heavy she remembered the ber', what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["cup", "television", "snacks", "cushion", "treadmill"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "l4uCFYiayIM_1", "video_path": "l4uCFYiayIM.mp4", "subtitle_path": "l4uCFYiayIM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2226.59, "view_count": 45467}, {"video_id": "l4uCFYiayIM", "question": "There are three people standing on the screen. On the left is a man with short hair, wearing glasses, and dressed in a blue short-sleeved shirt. In the middle is a woman with black braided hair, wearing a blue dress. On the right is a man with short black hair, wearing a white and blue long-sleeved shirt, holding something in his hand. There are trees behind them. When the subtitle mentions 'planned a surprised birthday celebration,' what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["glass", "pillow", "snacks", "cake", "plate"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "l4uCFYiayIM_2", "video_path": "l4uCFYiayIM.mp4", "subtitle_path": "l4uCFYiayIM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2226.59, "view_count": 45467}, {"video_id": "u3rcqZRb8fo", "question": "In a room, there is a man with short hair, wearing a black long-sleeved suit and a red tie. Behind him, there is a window with objects placed outside. It's drizzling outside the house. What is the color of this man's hair?", "question_wo_referring_query": "What is the color of this man's hair?", "candidates": ["White", "Black", "Green", "Red", "Yellow"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "u3rcqZRb8fo_0", "video_path": "u3rcqZRb8fo.mp4", "subtitle_path": "u3rcqZRb8fo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1069.74, "view_count": 255903}, {"video_id": "u3rcqZRb8fo", "question": "In a slightly dim room with a beam of light, there is a woman wearing short sleeves and a hat. She is performing in the room, and a group of people behind her are applauding. What is the color of the clothes she is wearing?", "question_wo_referring_query": "What is the color of the clothes she is wearing?", "candidates": ["red", "red", "green", "black", "white"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "u3rcqZRb8fo_1", "video_path": "u3rcqZRb8fo.mp4", "subtitle_path": "u3rcqZRb8fo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1069.74, "view_count": 255903}, {"video_id": "u3rcqZRb8fo", "question": "In a room, there is a board covered with various pictures. In front of the pictures stands a woman with short hair, wearing a red long-sleeved dress with a hat. What is the shape of the picture of a man wearing a hat located at the top right of the board?", "question_wo_referring_query": "What is the shape of the picture of a man wearing a hat located at the top right of the board?", "candidates": ["ladder", "spike", "circle", "triangle", "square"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "u3rcqZRb8fo_2", "video_path": "u3rcqZRb8fo.mp4", "subtitle_path": "u3rcqZRb8fo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1069.74, "view_count": 255903}, {"video_id": "13ByPC8ac7w", "question": "In a forest, a man with curly hair is sitting next to a tree, wearing a coat and a black shirt underneath. He is wearing a necklace, and someone is looking at him. When the subtitle mentions 'Essel tries to talk Gawain out of going, but he insists that he must since he made a covenant,' what is the color of the man's coat?", "question_wo_referring_query": "What is the color of the man's coat?", "candidates": ["black", "white", "yellow", "red", "green"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "13ByPC8ac7w_0", "video_path": "13ByPC8ac7w.mp4", "subtitle_path": "13ByPC8ac7w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 947.49, "view_count": 730317}, {"video_id": "13ByPC8ac7w", "question": "In a forest, there is a short-haired man. He is wearing a black long-sleeved shirt underneath, with a brownish-yellow outer garment and a scarf. He is holding an axe in his hand. Behind him is a horse, and to his left is a person wearing a hat. When the subtitle mentions 'the green ax and rides away with his horse. Long after they're gone, Gawain imagines dying', what color is the man's axe?", "question_wo_referring_query": "What color is the man's axe?", "candidates": ["White", "Black", "Green", "Red", "Yellow"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "13ByPC8ac7w_1", "video_path": "13ByPC8ac7w.mp4", "subtitle_path": "13ByPC8ac7w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 947.49, "view_count": 730317}, {"video_id": "13ByPC8ac7w", "question": "In a room with slightly dim lighting, there is a table with some objects on it, including a burning candle. To the left of the table sits a woman with her hair tied back, and to the right sits a woman wearing white clothes with a cloth covering her glasses. Behind them, there are some more burning candles. In front of them, there is a person looking towards them. When the subtitle mentions 'the green will always remain and overcome all of humanity's creations and achievements,' what is the color of the cloth covering the glasses of the woman with glasses covered by a cloth?", "question_wo_referring_query": "What is the color of the cloth covering the glasses of the woman with glasses covered by a cloth?", "candidates": ["white", "green", "blue", "red", "black"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "13ByPC8ac7w_2", "video_path": "13ByPC8ac7w.mp4", "subtitle_path": "13ByPC8ac7w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 947.49, "view_count": 730317}, {"video_id": "RCeq26HPi-I", "question": "In a room, there is a table placed in the center. On the table, there is a potted plant. On the left wall, a lamp is hanging. There are some items placed on the table under the lamp. On the right side of the table, there is another lamp. In front of the central table, there is a person about to fall. Who is the person about to fall?", "question_wo_referring_query": "Who is the person about to fall?", "candidates": ["A person wearing a red long-sleeve shirt", "A person wearing a black short-sleeve shirt", "A person wearing a green black long-sleeve shirt", "A person wearing a yellow long-sleeve shirt", "A person wearing a black long-sleeve shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "RCeq26HPi-I_0", "video_path": "RCeq26HPi-I.mp4", "subtitle_path": "RCeq26HPi-I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1167.4, "view_count": 541556}, {"video_id": "RCeq26HPi-I", "question": "Outdoors, next to a car, there are two people standing. The person closest to the car is a short-haired man wearing black clothes, with one hand on the car. In front of him stands another person who is on a phone call. There are trees behind them and grass under their feet. Who is the person making the phone call?", "question_wo_referring_query": "Who is the person making the phone call?", "candidates": ["A short-haired woman", "A man in gray clothes", "A woman in red clothes", "A long-haired woman", "A short-haired man"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "RCeq26HPi-I_1", "video_path": "RCeq26HPi-I.mp4", "subtitle_path": "RCeq26HPi-I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1167.4, "view_count": 541556}, {"video_id": "RCeq26HPi-I", "question": "In a yellow-white room, there is a person with short blonde hair who is calling for help. Standing behind that person are two people wearing black clothes. Who is the person calling for help?", "question_wo_referring_query": "Who is the person calling for help?", "candidates": ["a woman in red clothes", "a woman in yellow clothes", "a child", "a man in white clothes", "a woman in white clothes"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "RCeq26HPi-I_2", "video_path": "RCeq26HPi-I.mp4", "subtitle_path": "RCeq26HPi-I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1167.4, "view_count": 541556}, {"video_id": "PkjzgMs5veQ", "question": "In a room, on the right side of the room, there is a short-haired person with a beard wearing blue clothes, and on the left side, there is a short-haired woman wearing a yellow short-sleeved shirt, holding a cake. On the cake, there is a white candle. Behind the woman on the left side, there is a white lamp. On the wall behind them, there are some square decorations hanging. What did the woman do when the cake first appeared?", "question_wo_referring_query": "What did the woman do when the cake first appeared?", "candidates": ["Light the candle", "Jump rope", "Dance", "Run", "Blow out the candle"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "PkjzgMs5veQ_0", "video_path": "PkjzgMs5veQ.mp4", "subtitle_path": "PkjzgMs5veQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1097.83, "view_count": 36772}, {"video_id": "PkjzgMs5veQ", "question": "In a slightly dimly lit room, there are two people. On the left side of the room is a short-haired man wearing glasses and a short-sleeved shirt. On the right side, there is a short-haired woman wearing a short-sleeved shirt, holding a blowtorch in her hand. Behind her, there is a window with white curtains. Outside the window, there are buildings. What is this woman doing when the blowtorch first appears?", "question_wo_referring_query": "What is this woman doing when the blowtorch first appears?", "candidates": ["Welding something", "Eating something", "Kneeling down", "Drinking water", "Dancing"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "PkjzgMs5veQ_1", "video_path": "PkjzgMs5veQ.mp4", "subtitle_path": "PkjzgMs5veQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1097.83, "view_count": 36772}, {"video_id": "PkjzgMs5veQ", "question": "In a room, three people are sitting on the ground. On the left sits a woman with braided hair, wearing a skirt, holding a cake with candles. In the middle sits a short-haired child. On the right sits a short-haired man wearing a blue long-sleeved shirt, holding a spoon. On the left side of the woman, there is a chair. Behind them, there are ribbons and paintings hanging, and some items are placed in front of them. What did the child do when he first appeared?", "question_wo_referring_query": "What did the child do when he first appeared?", "candidates": ["Danced", "Ate the cake", "Sang", "Blew out the candles on the cake", "Stood up"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "PkjzgMs5veQ_2", "video_path": "PkjzgMs5veQ.mp4", "subtitle_path": "PkjzgMs5veQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1097.83, "view_count": 36772}, {"video_id": "epXIyiAAa84", "question": "In the room stands a long-haired woman wearing black clothes, with a few people sitting behind her. When the subtitle mentions 'Despitc that, Lisa asks the Pale Man about the scrapbook under her bedroom floor,\u2019 what does this woman do?", "question_wo_referring_query": "What does this woman do?", "candidates": ["run", "drink water", "kneel down", "sing", "make a phone call"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "epXIyiAAa84_0", "video_path": "epXIyiAAa84.mp4", "subtitle_path": "epXIyiAAa84_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1076.74, "view_count": 65253}, {"video_id": "epXIyiAAa84", "question": "In a dimly lit room, various utensils of different colors and shapes are placed on the table. Sitting beside the table is a short-haired man wearing a long-sleeved black shirt. Behind him stands a long-haired woman dressed in black clothing with patterns. To the left of the woman on the wall, there is a rectangular decoration and a window behind her. When the subtitle mentions 'Worried for her brother. Lisa tries to attack Edgar, but Robbie stops her. Bruce', what does the woman do?", "question_wo_referring_query": "What does the woman do?", "candidates": ["Sit down", "Jump up", "Eat something", "Try to attack someone", "Kneel down"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "epXIyiAAa84_1", "video_path": "epXIyiAAa84.mp4", "subtitle_path": "epXIyiAAa84_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1076.74, "view_count": 65253}, {"video_id": "epXIyiAAa84", "question": "In the frame, there are two white objects lined up side by side. On the outermost white object, there is a hand of a person wearing white clothes. When the caption mentions 'Lisa then pushes the washing machine aside and breaks the wall, going inside the tunnel to pick', what did this hand do?", "question_wo_referring_query": "What did this hand do?", "candidates": ["Picked up a water cup", "Picked up a phone", "Lifted the white object", "Pushed the white object to the side", "Picked up some snacks"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "epXIyiAAa84_2", "video_path": "epXIyiAAa84.mp4", "subtitle_path": "epXIyiAAa84_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1076.74, "view_count": 65253}, {"video_id": "-oTVCItP1GQ", "question": "In a slightly dim space with a beam of light, there are two people sitting on the chair at the front. The person on the left is bald. Behind them, there is a person with short hair standing. What did the man in the front do after he noticed the person standing behind them?", "question_wo_referring_query": "What did the man in the front do after he noticed the person standing behind them?", "candidates": ["Danced", "Sang a song", "Let them sit down", "Picked up a water cup", "Slapped this person"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "-oTVCItP1GQ_0", "video_path": "-oTVCItP1GQ.mp4", "subtitle_path": "-oTVCItP1GQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1077.91, "view_count": 19137}, {"video_id": "-oTVCItP1GQ", "question": "In a dimly lit room with a faint light source, one person strikes another while holding something. As they start fighting, there are some glowing objects and other items behind them. What does one man do after hitting the other man?", "question_wo_referring_query": "What does one man do after hitting the other man?", "candidates": ["jumped up", "unlocked a door", "ran", "drank water", "lay down"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "-oTVCItP1GQ_1", "video_path": "-oTVCItP1GQ.mp4", "subtitle_path": "-oTVCItP1GQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1077.91, "view_count": 19137}, {"video_id": "-oTVCItP1GQ", "question": "In the screen, there are three men. The man in the middle is wearing a green hat, green clothes, and is holding a gun to his own eyes. The man on the left is also wearing a hat and green clothes and is holding a microphone. The man on the right is wearing a green hat with an eye on it. What happened after the man in the middle was about to fire the gun?", "question_wo_referring_query": "What happened after the man in the middle was about to fire the gun?", "candidates": ["Ate something", "Jumped up", "A whale leaped out of the sea", "Fell into the water", "The people beside him disappeared"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "-oTVCItP1GQ_2", "video_path": "-oTVCItP1GQ.mp4", "subtitle_path": "-oTVCItP1GQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1077.91, "view_count": 19137}, {"video_id": "G8FZtQFttM8", "question": "Which animal appears first in the video?", "question_wo_referring_query": "Which animal appears first in the video?", "candidates": ["sheep", "bird", "donkey", "horse", "chicken"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "G8FZtQFttM8_0", "video_path": "G8FZtQFttM8.mp4", "subtitle_path": "G8FZtQFttM8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 916.28, "view_count": 22120}, {"video_id": "G8FZtQFttM8", "question": "Who is the first character to start dancing in the video?", "question_wo_referring_query": "Who is the first character to start dancing in the video?", "candidates": ["Woman with green hair", "Man with black hair", "Woman with black hair", "Woman with yellow hair", "Woman with red hair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "G8FZtQFttM8_1", "video_path": "G8FZtQFttM8.mp4", "subtitle_path": "G8FZtQFttM8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 916.28, "view_count": 22120}, {"video_id": "G8FZtQFttM8", "question": "Who is the first person to dance on the table in the video?", "question_wo_referring_query": "Who is the first person to dance on the table in the video?", "candidates": ["Blond woman", "Black-haired woman", "Brown-haired woman", "Red-haired woman", "Blond man"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "G8FZtQFttM8_2", "video_path": "G8FZtQFttM8.mp4", "subtitle_path": "G8FZtQFttM8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 916.28, "view_count": 22120}, {"video_id": "Uvzq-vDhVmA", "question": "In a room with green walls, a man with short hair, wearing black clothes, sunglasses, and face paint is standing on the left side. On the right side, there is a woman and a person in white clothes. What did the man wearing sunglasses do after the subtitle 'with more injuries when he removed' appeared?", "question_wo_referring_query": "What did the man wearing sunglasses do?", "candidates": ["Left", "Sang", "Held a woman's hand", "Touched his forehead with one hand", "Jumped"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "Uvzq-vDhVmA_0", "video_path": "Uvzq-vDhVmA.mp4", "subtitle_path": "Uvzq-vDhVmA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2069.43, "view_count": 46176}, {"video_id": "Uvzq-vDhVmA", "question": "In a room with white walls, there are various items stuck on the pillars inside the room. There is a woman dressed in white with a blue skirt standing in the room, holding a white package, and a short-haired man dressed in white. After the subtitle 'call and asked her to leave due to the', what did the man do?", "question_wo_referring_query": "In a room with white walls, there are various items stuck on the pillars inside the room. There is a woman dressed in white with a blue skirt standing in the room, holding a white package, and a short-haired man dressed in white. After the subtitle 'call and asked her to leave due to the', what did the man do?", "candidates": ["picked up the woman", "drank water", "drove a car", "ran", "jumped"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "Uvzq-vDhVmA_1", "video_path": "Uvzq-vDhVmA.mp4", "subtitle_path": "Uvzq-vDhVmA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2069.43, "view_count": 46176}, {"video_id": "Uvzq-vDhVmA", "question": "In the video, a short-haired man in a white shirt is standing on the left side, and on the right side, there is a woman with long hair wearing a butterfly hairpin. There is also a white object in the background. After the subtitles mention 'unleash her Fury Hawk reached our and,' what do they do?", "question_wo_referring_query": "What do they do?", "candidates": ["Singing", "Driving", "Hugging", "Running", "Eating"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "Uvzq-vDhVmA_2", "video_path": "Uvzq-vDhVmA.mp4", "subtitle_path": "Uvzq-vDhVmA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2069.43, "view_count": 46176}, {"video_id": "PFrV_3IsJ2Y", "question": "Which of the following sequence of scenes is correct?", "question_wo_referring_query": "Which of the following sequence of scenes is correct?", "candidates": ["First, there is a woman wearing gray clothes lying on a green bed, her head bandaged with a white cloth, with a pillow under her head. Then, there is a person driving a green car in a forest, with trees around. Finally, we see a clock hanging on a wall, decorated with various ornaments, showing 12:25.", "First, there is a woman wearing gray clothes lying on a green bed, her head bandaged with a white cloth, with a pillow under her head. Then, we see a clock hanging on a wall, decorated with various ornaments, showing 12:25. Finally, there is a person driving a green car in a forest, with trees around.", "First, there is a person driving a green car in a forest, with trees around. Then, there is a woman wearing gray clothes lying on a green bed, her head bandaged with a white cloth, with a pillow under her head. Finally, we see a clock hanging on a wall, decorated with various ornaments, showing 12:25.", "First, we see a clock hanging on a wall, decorated with various ornaments, showing 12:25. Then, there is a person driving a green car in a forest, with trees around. Finally, there is a woman wearing gray clothes lying on a green bed, her head bandaged with a white cloth, with a pillow under her head.", "First, there is a person driving a green car in a forest, with trees around. Then, we see a clock hanging on a wall, decorated with various ornaments, showing 12:25. Finally, there is a woman wearing gray clothes lying on a green bed, her head bandaged with a white cloth, with a pillow under her head."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "PFrV_3IsJ2Y_0", "video_path": "PFrV_3IsJ2Y.mp4", "subtitle_path": "PFrV_3IsJ2Y_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1079.3, "view_count": 250657}, {"video_id": "PFrV_3IsJ2Y", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First is a scene beside a grassy area where a woman in green is removing her coat and looking at the clothes in her hands. Then, there are two people hugging at night. Finally, there is a scene on a street at night, with someone wearing white clothes standing in front of a car, flanked by grass.", "First is a scene on a street at night, with someone wearing white clothes standing in front of a car, flanked by grass. Then, there is a scene beside a grassy area where a woman in green is removing her coat and looking at the clothes in her hands. Finally, there are two people hugging at night.", "First, there are two people hugging at night. Then, there is a scene on a street at night, with someone wearing white clothes standing in front of a car, flanked by grass. Finally, there is a scene beside a grassy area where a woman in green is removing her coat and looking at the clothes in her hands.", "First is a scene on a street at night, with someone wearing white clothes standing in front of a car, flanked by grass. Then, there are two people hugging at night. Finally, there is a scene beside a grassy area where a woman in green is removing her coat and looking at the clothes in her hands.", "First is a scene beside a grassy area where a woman in green is removing her coat and looking at the clothes in her hands. Then, there is a scene on a street at night, with someone wearing white clothes standing in front of a car, flanked by grass. Finally, there are two people hugging at night."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "PFrV_3IsJ2Y_1", "video_path": "PFrV_3IsJ2Y.mp4", "subtitle_path": "PFrV_3IsJ2Y_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1079.3, "view_count": 250657}, {"video_id": "PFrV_3IsJ2Y", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["There's a woman in long sleeves holding a gun with trees behind her, then the red boat in the river capsizes, with various trees in front of the boat, and lastly, there's a person rowing a red boat on a body of water.", "First, there's a person rowing a red boat on a body of water, then the red boat in the river capsizes, with various trees in front of the boat, and lastly, there's a woman in long sleeves holding a gun with trees behind her.", "First, the red boat in the river capsizes, with various trees in front of the boat, then there's a person rowing a red boat on a body of water, and lastly, there's a woman in long sleeves holding a gun with trees behind her.", "First, there's a woman in long sleeves holding a gun with trees behind her, then a person rowing a red boat on a body of water, and lastly, the red boat in the river capsizes, with various trees in front of the boat.", "First, there's a person rowing a red boat on a body of water, then there's a woman in long sleeves holding a gun with trees behind her, and lastly, the red boat in the river capsizes, with various trees in front of the boat."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "PFrV_3IsJ2Y_2", "video_path": "PFrV_3IsJ2Y.mp4", "subtitle_path": "PFrV_3IsJ2Y_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1079.3, "view_count": 250657}, {"video_id": "xetUZY33rSM", "question": "In a field, there is a group of white animals on the grass and some trees. In the middle of the scene, there is a short-haired man wearing a long-sleeved green shirt, holding a piece of white paper. In which of the following scenes did these white animals also appear?", "question_wo_referring_query": "In which of the following scenes did these white animals also appear?", "candidates": ["In a swamp", "Inside a house", "In the desert", "Outside a wooden house", "In the water"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "xetUZY33rSM_0", "video_path": "xetUZY33rSM.mp4", "subtitle_path": "xetUZY33rSM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 907.94, "view_count": 45212}, {"video_id": "xetUZY33rSM", "question": "On a calm sea, there is a wooden board floating on the water with three people lying on it. On the upper side, there is a man with short hair, wearing a blue short-sleeved shirt. On the lower left side, there is a man with short hair, wearing blue pants, holding something in his hands. On the right side, there is a man wearing a hat, dressed in white clothes and pants. Next to them, there are shoes of different colors and a rope. Where else has this rope appeared?", "question_wo_referring_query": "Where else has this rope appeared?", "candidates": ["In a tree", "Hanging on the wall of a house", "In water", "In a book", "Inside a hat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "xetUZY33rSM_1", "video_path": "xetUZY33rSM.mp4", "subtitle_path": "xetUZY33rSM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 907.94, "view_count": 45212}, {"video_id": "xetUZY33rSM", "question": "On a calm sea, there is a wooden plank floating with three people on it. In the middle is a man with short hair, wearing a blue short-sleeve shirt and green pants. To the left is a man in yellow clothes with a lot of facial hair. To the right is a person wearing a yellow hat. Between them are various colored items. In which of the following scenarios has the man with the yellow hat also appeared?", "question_wo_referring_query": "In which of the following scenarios has the man with the yellow hat also appeared?", "candidates": ["Outside a house with a blue wall", "Outside a house with a white wall", "Outside a house with a green wall", "Outside a house with a black wall", "Outside a house with a red wall"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "xetUZY33rSM_2", "video_path": "xetUZY33rSM.mp4", "subtitle_path": "xetUZY33rSM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 907.94, "view_count": 45212}, {"video_id": "YLkFmWcZ8h8", "question": "A woman with long straight hair, wearing a gray coat and a dark green inner shirt, is looking out from a window where a bit of red brick wall is slightly visible on both sides. Which of the following subtitles has this woman appeared with?", "question_wo_referring_query": "Which of the following subtitles has this woman appeared with?", "candidates": ["I got up at 7 o'clock", "new term is coming and I will be wiser", "I like drawing pictures best", "was seeing future events again on the", "if you are interested"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "YLkFmWcZ8h8_0", "video_path": "YLkFmWcZ8h8.mp4", "subtitle_path": "YLkFmWcZ8h8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1077.68, "view_count": 7524}, {"video_id": "YLkFmWcZ8h8", "question": "In the forest at night, on the middle of the road stands a black-haired woman wearing a black shirt and a red coat. Behind the woman is a yellow car with its headlights on. Which of the following subtitles has this yellow car appeared with?", "question_wo_referring_query": "With which of the following subtitles has this yellow car appeared?", "candidates": ["He's very strong", "I became interested in English", "train station Cassie arriving in the", "made a lot of money, happy death", "a student from China and have noticed your requirement online"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "YLkFmWcZ8h8_1", "video_path": "YLkFmWcZ8h8.mp4", "subtitle_path": "YLkFmWcZ8h8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1077.68, "view_count": 7524}, {"video_id": "YLkFmWcZ8h8", "question": "In the middle of the city intersection at night, the upper right side shows a row of houses by the street, the lower right side of the road has many black or yellow cars, and in the upper left part of the sky, there is a white and red car. Which of the following subtitles does this white and red car appear in?", "question_wo_referring_query": "Which of the following subtitles does this white and red car appear in?", "candidates": ["the second crossing and you'll find a hospital", "was so happy we put up the tent", "triggering the bomb Ezekiel carried", "as doing the housework or making progress in", "Everybody wants to do better in it"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "YLkFmWcZ8h8_2", "video_path": "YLkFmWcZ8h8.mp4", "subtitle_path": "YLkFmWcZ8h8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1077.68, "view_count": 7524}, {"video_id": "nqjPYHZ37HE", "question": "When the woman in green shorts and a black tank top was wrestling, what clothes did she change into afterwards?", "question_wo_referring_query": "What clothes did the woman in green shorts and a black tank top change into after wrestling?", "candidates": ["She changed from a black tank top into a green T-shirt, wearing a helmet and carrying a backpack", "She changed from a black tank top into a red coat", "She changed from a black tank top into a white coat, wearing a helmet and carrying a backpack", "She changed from a black tank top into a blue T-shirt, wearing a helmet and carrying a backpack", "She changed from a black tank top into cycling gear, wearing a helmet and carrying a backpack"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "nqjPYHZ37HE_0", "video_path": "nqjPYHZ37HE.mp4", "subtitle_path": "nqjPYHZ37HE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 910.88, "view_count": 198488}, {"video_id": "nqjPYHZ37HE", "question": "In a small room, there are windows on both the right and left sides that let in light. In the middle, there is a woman with long hair wearing a gray coat. What change occurs after this woman moves to the bow of the ship?", "question_wo_referring_query": "What change occurs?", "candidates": ["Changed from a gray coat to a green T-shirt", "Changed from a gray coat to a white coat", "Changed from a gray coat to a red coat", "Put on a gray hat", "Took off the coat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "nqjPYHZ37HE_1", "video_path": "nqjPYHZ37HE.mp4", "subtitle_path": "nqjPYHZ37HE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 910.88, "view_count": 198488}, {"video_id": "nqjPYHZ37HE", "question": "Below the yellow and white cliff is a pitch-black cave. In front of the cliff, on the left is a man wearing a red short-sleeved shirt, and on the right is a woman wearing a beige dress. What clothing did the woman change into when she appeared in the office?", "question_wo_referring_query": "What clothing did the woman change into when she appeared in the office?", "candidates": ["She changed from the beige dress into a red T-shirt.", "She changed from the beige dress into a green dress.", "She changed from the beige dress into a black T-shirt.", "She changed from the beige dress into a black leather jacket.", "She changed from the beige dress into a yellow dress."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "nqjPYHZ37HE_2", "video_path": "nqjPYHZ37HE.mp4", "subtitle_path": "nqjPYHZ37HE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 910.88, "view_count": 198488}, {"video_id": "RsoT9WImC8Q", "question": "In the room with a gray sofa in the center, there is a woman wearing a red and white dress sitting on the sofa. To the lower left, there is a wine-red short table, and to the upper right, there is a light red wall with a wooden texture. Which items are not present in the room?", "question_wo_referring_query": "Which items are not present in the room?", "candidates": ["Gray and white pillow", "Black remote-controlled car", "Laptop", "Green plant", "White water cup"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "RsoT9WImC8Q_0", "video_path": "RsoT9WImC8Q.mp4", "subtitle_path": "RsoT9WImC8Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2089.79, "view_count": 129779}, {"video_id": "RsoT9WImC8Q", "question": "On the seaside, in the distance, there are several groups of people and the light blue ocean. In the center near the shoreline, there is a man wearing white pants, and to the right, there is a woman wearing a light yellow dress. What items have appeared on the screen?", "question_wo_referring_query": "What items have appeared on the screen?", "candidates": ["Laptop computer", "White umbrella", "Black remote-controlled car", "White car", "White hat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "RsoT9WImC8Q_1", "video_path": "RsoT9WImC8Q.mp4", "subtitle_path": "RsoT9WImC8Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2089.79, "view_count": 129779}, {"video_id": "RsoT9WImC8Q", "question": "In the white-walled conference room, there is a white and orange shelf formed by hexagonal grids on the left, a mustard yellow long table in the middle, and people sitting around the table having a meeting. Which of the following items are not present in the conference room?", "question_wo_referring_query": "Which items are not present in the conference room?", "candidates": ["White curtains", "White projector", "Black chair", "Black water dispenser", "Laptop"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "RsoT9WImC8Q_2", "video_path": "RsoT9WImC8Q.mp4", "subtitle_path": "RsoT9WImC8Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2089.79, "view_count": 129779}, {"video_id": "urXojVL9ULE", "question": "In a room illuminated by light coming through a window, there is a long pale red table at the bottom, with a man wearing a black coat sitting on the left middle side and a woman wearing black clothes sitting on the right side. What objects are present in the room when the subtitles mention 'and the board members'?", "question_wo_referring_query": "What objects are present in the room?", "candidates": ["a white scarf", "a silver pen", "a silver lighter", "a silver knife", "a black hat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "urXojVL9ULE_0", "video_path": "urXojVL9ULE.mp4", "subtitle_path": "urXojVL9ULE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 998.63, "view_count": 25455}, {"video_id": "urXojVL9ULE", "question": "In the sunset on the beach, among the rocks in the middle of the sand, there is a man wearing blue-green clothes on the left, and a woman wearing yellow clothes on the right. When the subtitle mentions 'they came through, he only comes out through another door in the room,' what items appear on the screen?", "question_wo_referring_query": "What items appear on the screen?", "candidates": ["white clothes", "red clothes", "yellow pants", "red clothes", "white hat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "urXojVL9ULE_1", "video_path": "urXojVL9ULE.mp4", "subtitle_path": "urXojVL9ULE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 998.63, "view_count": 25455}, {"video_id": "urXojVL9ULE", "question": "In the scene where two people are kissing each other, the person on the left is a woman with short black hair, and the person on the right is a man with short blond hair. When the subtitle 'Then the two of them kissed each other' appears, which objects have appeared in the scene?", "question_wo_referring_query": "Which objects have appeared in the scene?", "candidates": ["silver earrings", "white scarf", "yellow earrings", "black hat", "silver necklace"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "urXojVL9ULE_2", "video_path": "urXojVL9ULE.mp4", "subtitle_path": "urXojVL9ULE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 998.63, "view_count": 25455}, {"video_id": "c8MsJz2xUNg", "question": "Through the window outside, the high-rise doors and windows opposite are clearly visible. Inside the meeting room, there are four people (three men and one woman) sitting around the conference table. What is the color of the chairs in the meeting room?", "question_wo_referring_query": "What is the color of the chairs in the meeting room?", "candidates": ["gray", "green", "white", "black", "red"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "c8MsJz2xUNg_0", "video_path": "c8MsJz2xUNg.mp4", "subtitle_path": "c8MsJz2xUNg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 937.04, "view_count": 194112}, {"video_id": "c8MsJz2xUNg", "question": "The background is a virtualized green surface and trees. On the left side of the frame, there is a black silhouette with only the shoulders visible. In the center, there is a man with curly hair. What color is the curly hair of the man in the center of the frame?", "question_wo_referring_query": "What color is the curly hair of the man in the center of the frame?", "candidates": ["Black-gray", "White", "Green", "Red", "Yellow"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "c8MsJz2xUNg_1", "video_path": "c8MsJz2xUNg.mp4", "subtitle_path": "c8MsJz2xUNg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 937.04, "view_count": 194112}, {"video_id": "c8MsJz2xUNg", "question": "In the grand hall with black walls, there is a man wearing a gray coat on the left side, a man wearing a white shirt and gray coat on the right side, and in the middle is the silhouette of a man in a black police uniform with a black hat. What is the shape of the hat's brim shown in the image?", "question_wo_referring_query": "What is the shape of the hat's brim shown in the image?", "candidates": ["hexagon", "circle", "triangle", "square", "rectangle"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "c8MsJz2xUNg_2", "video_path": "c8MsJz2xUNg.mp4", "subtitle_path": "c8MsJz2xUNg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 937.04, "view_count": 194112}, {"video_id": "lKxEXO0Ffio", "question": "In the animation, on a ground paved with square stone slabs, there is a carved stone sculpture standing on a stone platform in the middle, and in the lower left there's an old man wearing grey clothes. When the subtitles say 'puppet appears and steals the jewels but,' what shape is the stone sculpture on the stone platform?", "question_wo_referring_query": "What shape is the stone sculpture on the stone platform?", "candidates": ["circle", "rectangle", "pentagon", "square", "octagon"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "lKxEXO0Ffio_0", "video_path": "lKxEXO0Ffio.mp4", "subtitle_path": "lKxEXO0Ffio_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1155.24, "view_count": 73100}, {"video_id": "lKxEXO0Ffio", "question": "In the animated scene with the blue sky and white clouds, there is a man wearing gray and black clothes holding two cards. When the subtitles mention \"wouldn't return in time the man made a\", what color are the cards in the man's hand?", "question_wo_referring_query": "What color are the cards in the man's hand?", "candidates": ["red", "blue", "white", "yellow", "black"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "lKxEXO0Ffio_1", "video_path": "lKxEXO0Ffio.mp4", "subtitle_path": "lKxEXO0Ffio_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1155.24, "view_count": 73100}, {"video_id": "lKxEXO0Ffio", "question": "In a sky filled with dark clouds, there are two glowing circular light orbs: one in the upper left and one in the lower right. When the subtitle mentions 'face and spits out one last sphere of', what color is the light orb in the upper left?", "question_wo_referring_query": "What color is the light orb in the upper left?", "candidates": ["Green", "Blue", "Red", "White", "Purple"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "lKxEXO0Ffio_2", "video_path": "lKxEXO0Ffio.mp4", "subtitle_path": "lKxEXO0Ffio_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1155.24, "view_count": 73100}, {"video_id": "-hgJVx12Qk4", "question": "In the animated scene, on the desktop in a dim room, there is a laptop with a red light on the left, a spider in the middle, and a black wall on the right. Which character is using the laptop in the scene?", "question_wo_referring_query": "Which character is using the laptop in the scene?", "candidates": ["wall tiger", "panda", "cat", "wolf", "spider"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "-hgJVx12Qk4_0", "video_path": "-hgJVx12Qk4.mp4", "subtitle_path": "-hgJVx12Qk4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1001.9, "view_count": 46044}, {"video_id": "-hgJVx12Qk4", "question": "In the animated scene, under the cloudy sky, on the left is a fox wearing black clothes with an orange head, and on the right is an animal with a gray-black head. Behind them is a car with only the roof visible. Which character is driving the car in the scene?", "question_wo_referring_query": "Which character is driving the car in the scene?", "candidates": ["A fox wearing black clothes with a black head", "A man wearing black clothes", "An animal with a gray-black head", "A fox wearing black clothes with an orange head", "An animal wearing black clothes with a gray-black head"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "-hgJVx12Qk4_1", "video_path": "-hgJVx12Qk4.mp4", "subtitle_path": "-hgJVx12Qk4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1001.9, "view_count": 46044}, {"video_id": "-hgJVx12Qk4", "question": "In the animated scene, there is a car in motion. In the front left seat is a red spider, in the middle is a creature with a red and green head, and on the right side is a wolf wearing white clothes. In the back seat is a shark wearing black clothes. Which character is holding the steering wheel and driving the car?", "question_wo_referring_query": "Which character is holding the steering wheel and driving the car?", "candidates": ["Woman wearing white clothes", "Wolf wearing white clothes", "Red spider", "Creature with a red and green head", "Shark wearing black clothes"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "-hgJVx12Qk4_2", "video_path": "-hgJVx12Qk4.mp4", "subtitle_path": "-hgJVx12Qk4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1001.9, "view_count": 46044}, {"video_id": "27w6E2Vyt6A", "question": "In a room cluttered with various items, there is a woman with black hair on the right and a man with short blonde hair and wearing a green shirt on the left. What did the man do when he first appeared?", "question_wo_referring_query": "What did he do?", "candidates": ["Ate a cucumber", "Drank a can of beverage", "Ate a piece of bread", "Picked up the remote control", "Picked up a glass cup"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "27w6E2Vyt6A_0", "video_path": "27w6E2Vyt6A.mp4", "subtitle_path": "27w6E2Vyt6A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 936.36, "view_count": 23581}, {"video_id": "27w6E2Vyt6A", "question": "In the nighttime scene, there is a row of support pillars for the road on the left side and flat ground on the right side. Next to one of the support pillars on the left, there is a person with a dark silhouette. The first time this person appears, what happens to him?", "question_wo_referring_query": ", what happens to him?", "candidates": ["Kneels on the ground", "Turns on a flashlight", "Opens the door of a car", "Falls to the ground", "Waves towards the camera"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "27w6E2Vyt6A_1", "video_path": "27w6E2Vyt6A.mp4", "subtitle_path": "27w6E2Vyt6A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 936.36, "view_count": 23581}, {"video_id": "27w6E2Vyt6A", "question": "In the scene where various items are neatly stacked in the background, a man in a green sweater is sitting by the window looking outside. What did the man by the window do the first time he appeared?", "question_wo_referring_query": "What did he do?", "candidates": ["Opened the window", "Waved outside the window", "Pulled up the blinds", "Took out a bottle of mineral water", "Stuck his hand out of the window"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "27w6E2Vyt6A_2", "video_path": "27w6E2Vyt6A.mp4", "subtitle_path": "27w6E2Vyt6A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 936.36, "view_count": 23581}, {"video_id": "FZfmPQMJ6rw", "question": "In front of a gray-white wall, a long-haired woman wearing a blue and partially white floral printed dress is leaning against the wall. When the subtitles mention 'recalled the invitation and they both', what action does the woman perform?", "question_wo_referring_query": "What action does the woman perform?", "candidates": ["Lifting her head", "Raising her hand and sticking out her tongue", "Drinking a sip of water", "Smiling and raising her hand", "Smiling and taking out her phone"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "FZfmPQMJ6rw_0", "video_path": "FZfmPQMJ6rw.mp4", "subtitle_path": "FZfmPQMJ6rw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1092.28, "view_count": 15033}, {"video_id": "FZfmPQMJ6rw", "question": "In the shot with a blurry background, on the left is a woman's face, and in the center is a woman wearing white clothes and tied-up hair. When the subtitle mentions 'finally found what they needed she asked,' what action did the woman wearing white clothes and tied-up hair in the center perform?", "question_wo_referring_query": "What action did the woman wearing white clothes and tied-up hair in the center perform?", "candidates": ["Hand covering mouth and smiling", "Holding a pen and smiling", "Touching hair and smiling", "Hand covering mouth crying", "Holding a phone and smiling"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "FZfmPQMJ6rw_1", "video_path": "FZfmPQMJ6rw.mp4", "subtitle_path": "FZfmPQMJ6rw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1092.28, "view_count": 15033}, {"video_id": "FZfmPQMJ6rw", "question": "On a white background wall, there is a painting with a moose. To the right of the painting, there is a blue door. A long-haired woman in an off-shoulder dress is facing a mirror. When the subtitle mentions 'couldn't make it what U didn't know at', what is the woman doing?", "question_wo_referring_query": "What is the woman doing?", "candidates": ["Holding a phone and smiling", "Holding a pen and smiling", "Making a phone call with a cellphone", "Waving at the mirror", "Drinking a sip of water"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "FZfmPQMJ6rw_2", "video_path": "FZfmPQMJ6rw.mp4", "subtitle_path": "FZfmPQMJ6rw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1092.28, "view_count": 15033}, {"video_id": "N8JA_m_kcy4", "question": "In a room with red walls and columns, two women and one man are sitting on a sofa to the left, and a man wearing an olive green outfit is standing on the right. What does the man in the olive green outfit do before the subtitle 'He drinks, stating that if they were meant to be killed, it would've already happened' appears?", "question_wo_referring_query": "What does the man in the olive green outfit, who is standing, do?", "candidates": ["Kneeled on the ground", "Drank a glass of water", "Sat on the sofa", "Put on a hat", "Took off his coat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "N8JA_m_kcy4_0", "video_path": "N8JA_m_kcy4.mp4", "subtitle_path": "N8JA_m_kcy4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 967.8, "view_count": 220646}, {"video_id": "N8JA_m_kcy4", "question": "In a room with dark red tones, a man is facing away from the mirror while a woman in a light blue shirt is facing the mirror. After the subtitle 'She firmly states that either one of the parents dies, or they all perish with their daughter' appears, what does the woman in the light blue shirt do?", "question_wo_referring_query": "What does the woman in the light blue shirt do?", "candidates": ["Smile at the man", "Cry at the man", "Take off her shirt", "Wave at the mirror", "Put on sunglasses"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "N8JA_m_kcy4_1", "video_path": "N8JA_m_kcy4.mp4", "subtitle_path": "N8JA_m_kcy4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 967.8, "view_count": 220646}, {"video_id": "N8JA_m_kcy4", "question": "Under the blue sky and white clouds, on a narrow road surrounded by green walls of plants slightly taller than a person, a woman in a light blue top is facing the camera. After the subtitle 'Elsewhere, Teresa spots the exit and dashes towards it' appears, what does the woman do?", "question_wo_referring_query": "What does the woman do?", "candidates": ["Faces the camera and sighs", "Faces the camera and waves", "Turns away from the camera and squats on the ground", "Puts on a hat", "Turns away from the camera and runs"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "N8JA_m_kcy4_2", "video_path": "N8JA_m_kcy4.mp4", "subtitle_path": "N8JA_m_kcy4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 967.8, "view_count": 220646}, {"video_id": "86QlwqOey-s", "question": "In the office, on the upper left side of the wall, there is red and white wallpaper, and on the right side, there is a white cabinet with a mirror hanging with photos. In front of the mirror is a man wearing gray clothes. Before the subtitle 'Tommy annoys his colleagues with his constant filming, walking with the camera even to the...', which other characters appear?", "question_wo_referring_query": "Which other characters appear?", "candidates": ["A man wearing yellow clothes", "A man wearing red pants", "A man wearing blue clothes", "A man wearing red clothes", "A man wearing a red hat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "86QlwqOey-s_0", "video_path": "86QlwqOey-s.mp4", "subtitle_path": "86QlwqOey-s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1061.31, "view_count": 18028}, {"video_id": "86QlwqOey-s", "question": "On the white snowy ground, some items are piled up at the top left, there's a shadow covering the bottom left, and the right side is an empty stretch of snow. After the subtitles 'in snow. Tommy has to turn off the camera. Later, Monica secretly turns it on to film ; ;' appear, what items are shown?", "question_wo_referring_query": "What items appear afterward?", "candidates": ["A blue box with a white lid", "A blue box with a black lid", "A yellow box", "A blue box with a red lid", "A red box with a white lid"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "86QlwqOey-s_1", "video_path": "86QlwqOey-s.mp4", "subtitle_path": "86QlwqOey-s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1061.31, "view_count": 18028}, {"video_id": "86QlwqOey-s", "question": "On the white snowy ground, there's a dense forest in the distance. In front of the camera is a man wearing red clothes with his back to the camera. After the subtitle 'The guy falls into apathy and returns to Monica. For a moment, the girl regains ...', which characters appear?", "question_wo_referring_query": "Which characters appear afterwards?", "candidates": ["A man with long black hair", "A woman with long black hair", "A woman with long blonde hair", "A boy with long black hair", "A woman with short black hair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "86QlwqOey-s_2", "video_path": "86QlwqOey-s.mp4", "subtitle_path": "86QlwqOey-s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1061.31, "view_count": 18028}, {"video_id": "FSiOsYyU8jI", "question": "Under the blue sky, on the left side of the frame is a black person wearing gray clothes, and on the right side is the back of a person wearing a hat. In which of the following scenes does the black person wearing white clothes appear in the frame?", "question_wo_referring_query": "In which of the following scenes does the black person wearing white clothes appear in the frame?", "candidates": ["In a bus full of passengers", "In an empty classroom", "On a sunny beach", "In a busy kitchen", "In a classroom with students sitting"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "FSiOsYyU8jI_0", "video_path": "FSiOsYyU8jI.mp4", "subtitle_path": "FSiOsYyU8jI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 959.9, "view_count": 138618}, {"video_id": "FSiOsYyU8jI", "question": "In the room with faint lighting, there is a glass door with a white frame in the center. To the left in front of the door, there is a man wearing a red tie. To the right, there is a black man dressed in gray clothes and a silver tie. In which of the following scenes has the black man on the right appeared?", "question_wo_referring_query": "In which of the following scenes has the black man on the right appeared?", "candidates": ["Inside an ambulance", "On a deserted mountaintop", "In a crowded group of people", "On the deck of a luxury cruise ship", "On a hospital bed"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "FSiOsYyU8jI_1", "video_path": "FSiOsYyU8jI.mp4", "subtitle_path": "FSiOsYyU8jI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 959.9, "view_count": 138618}, {"video_id": "FSiOsYyU8jI", "question": "In the room with a pile of documents in the middle, on the left side stands a row of people wearing various colors of clothing, and on the right side stands a person with a white megaphone at their waist facing away from a mirror. In which of the following scenes does the person with the white megaphone at their waist appear?", "question_wo_referring_query": "In which of the following scenes does the person with the white megaphone at their waist facing away from a mirror appear?", "candidates": ["On an empty stage", "In a crowded basketball arena", "On a sunny grassland", "On a sunny beach", "In a bus full of passengers"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "FSiOsYyU8jI_2", "video_path": "FSiOsYyU8jI.mp4", "subtitle_path": "FSiOsYyU8jI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 959.9, "view_count": 138618}, {"video_id": "5wNiNXcG7lE", "question": "In front of the virtual background camera, on the left side, there is a hand showing only the palm, and on the right side, there is a hand showing both the arm and the palm. These two hands are shaking hands. There is a gray-black spider on the fingers of the left palm. With which of the following subtitles has the spider in front of the camera appeared together?", "question_wo_referring_query": ", with which of the following subtitles has the spider in front of the camera appeared together?", "candidates": ["They introduce us different kinds of knowledge", "because I have a very beautiful bedroom", "Scard, Frank shook the spider off to the gang's amusement", "and have noticed your requirement on line", "curtains are blue too"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "5wNiNXcG7lE_0", "video_path": "5wNiNXcG7lE.mp4", "subtitle_path": "5wNiNXcG7lE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 906.94, "view_count": 159018}, {"video_id": "5wNiNXcG7lE", "question": "In a room with white walls, there is the corner of a bed and a yellow bedside table on the left side. On the right wall hangs a checkered garment. In the middle, a man wearing a white shirt is sitting on a chair. In the room, with which of the following subtitles has this man appeared together?", "question_wo_referring_query": "With which of the following subtitles has the man in the room appeared together?", "candidates": ["Books tell us what is good and what is evil", "Therefore to read more books is the best policy", "we can choose one of them to take part in", "and philosophy of life", "so, Eddie lies that he hasn't taken a shower in days"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "5wNiNXcG7lE_1", "video_path": "5wNiNXcG7lE.mp4", "subtitle_path": "5wNiNXcG7lE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 906.94, "view_count": 159018}, {"video_id": "5wNiNXcG7lE", "question": "In a dimly lit room, on the left side, there is a room bathed in red light with a table, and on the table, there is a lamp glowing with yellow light. On the right side, there are two men walking into the room. With which of the following subtitles does the scene with the lamp glowing yellow on the table appear?", "question_wo_referring_query": "With which of the following subtitles does the scene with the lamp glowing yellow on the table appear?", "candidates": ["are some photos on the wall", "The two men enter the bar, and Candy approaches", "will get ready to face the future", "i will get back to my everyday activities", "man should not depend on lucky which"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "5wNiNXcG7lE_2", "video_path": "5wNiNXcG7lE.mp4", "subtitle_path": "5wNiNXcG7lE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 906.94, "view_count": 159018}, {"video_id": "_7a8HVIK38E", "question": "In the corner of the blue wall, on the right side wall, there are two white switches lined up. In the middle is a woman with long golden hair wearing a blue-gray outfit. When this woman appears among the blurry crowd in the background, what clothing does she change into?", "question_wo_referring_query": "What clothing does she change into?", "candidates": ["Put on a green T-shirt", "Changed from blue-gray outfit to white clothing", "Changed from white clothing to black clothing", "Put on a black coat", "Changed from blue-gray outfit to burgundy clothing"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "_7a8HVIK38E_0", "video_path": "_7a8HVIK38E.mp4", "subtitle_path": "_7a8HVIK38E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1199.56, "view_count": 688975}, {"video_id": "_7a8HVIK38E", "question": "On a black surface, the top right is a yellow-white wall, and in the middle is a woman sitting on the ground with her face covered by her hands, having long golden hair. When this woman appears in front of a wall that is white on the left side and dark green on the right side, what clothes did she change into?", "question_wo_referring_query": "What clothes did she change into?", "candidates": ["She changed from grey clothes to red clothes", "She changed from white clothes to grey clothes", "She changed from black clothes to grey clothes", "She changed from grey clothes to black clothes", "She changed from white clothes to yellow clothes"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "_7a8HVIK38E_1", "video_path": "_7a8HVIK38E.mp4", "subtitle_path": "_7a8HVIK38E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1199.56, "view_count": 688975}, {"video_id": "_7a8HVIK38E", "question": "In the dimly lit room, on the right side there is a short table cluttered with various items, and on the right side is a long-haired woman sitting on the floor and leaning against the sofa. This woman has black legs. When the woman lies flat on the water-covered floor of the room, what transformation does she undergo?", "question_wo_referring_query": ", what transformation does she undergo?", "candidates": ["Her black legs turned into blue-gray mermaid tails.", "Her black legs changed into white legs.", "Her black legs turned into red legs.", "Her head turned into a fish head.", "Black wings grew on her body."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "_7a8HVIK38E_2", "video_path": "_7a8HVIK38E.mp4", "subtitle_path": "_7a8HVIK38E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1199.56, "view_count": 688975}, {"video_id": "tMVDTMrPzDM", "question": "On the ground in the forest, on the left side there is a man wearing dark green clothes. On the right side there is a man with grey and white long hair, and behind the man with long grey and white hair there is a group of people wearing hats. What is the man with long grey and white hair on the right side doing in the frame?", "question_wo_referring_query": "What is the man with long grey and white hair on the right side doing in the frame?", "candidates": ["Taking out a long sword", "Raising an axe towards the sky", "Handing an axe to the man wearing dark green clothes", "Handing an axe to the man wearing dark white clothes", "Throwing an axe to the ground"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "tMVDTMrPzDM_0", "video_path": "tMVDTMrPzDM.mp4", "subtitle_path": "tMVDTMrPzDM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 957.19, "view_count": 330917}, {"video_id": "tMVDTMrPzDM", "question": "In a lush forest, there is a man in the middle who is wearing a black shirt and pants. He is surrounded by dark green low plants. What action is the man in the middle doing?", "question_wo_referring_query": "What action is the man in the middle doing?", "candidates": ["Hands clasped together", "Hands on hips", "Arms spread open", "Head bowed in prayer", "Waving at the camera"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "tMVDTMrPzDM_1", "video_path": "tMVDTMrPzDM.mp4", "subtitle_path": "tMVDTMrPzDM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 957.19, "view_count": 330917}, {"video_id": "tMVDTMrPzDM", "question": "On the sandy beach overshadowed by a mountain cave, in the distance is the bright blue-green sea under the sunlight. Nearby, inside the mountain cave, there's a silhouette of a man facing the sea. What is the man inside the mountain cave doing?", "question_wo_referring_query": "What is the man inside the mountain cave doing?", "candidates": ["Standing and drinking water", "Running towards the sea", "Doing push-ups on the sand", "Sitting on the ground drinking water", "Picking up a torch from the ground"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "tMVDTMrPzDM_2", "video_path": "tMVDTMrPzDM.mp4", "subtitle_path": "tMVDTMrPzDM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 957.19, "view_count": 330917}, {"video_id": "6CUr6FGEVWw", "question": "In the classroom with white walls, the desks and chairs are neatly arranged and fully occupied by students who are attentively listening to the lecture. On the lower right is the back view of a teacher dressed in gray and white clothing giving a lecture. Which of the following items appear in the classroom?", "question_wo_referring_query": "Which of the following items appear in the classroom?", "candidates": ["black coat", "blue coat", "red hat", "yellow desks", "white curtains"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "6CUr6FGEVWw_0", "video_path": "6CUr6FGEVWw.mp4", "subtitle_path": "6CUr6FGEVWw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2352.27, "view_count": 669952}, {"video_id": "6CUr6FGEVWw", "question": "In a room illuminated by sunlight, there is a yellow table in the middle filled with various items. On either side of the table, there are two women both dressed in blue. Near the window at a distance, there are two colorful oil paintings. Which of the following items appear in the room?", "question_wo_referring_query": "Which of the following items appear in the room?", "candidates": ["White curtain", "Red drinking machine", "Orange umbrella", "Blue beverage bottle", "Orange beverage bottle"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "6CUr6FGEVWw_1", "video_path": "6CUr6FGEVWw.mp4", "subtitle_path": "6CUr6FGEVWw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2352.27, "view_count": 669952}, {"video_id": "6CUr6FGEVWw", "question": "In front of a transparent glass window, on a windowsill with yellow and white colors, there is a man in the middle wearing a black shirt and gray trousers, and on the right there is a woman wearing a black shirt and a skirt. Which of the following items appears in the scene?", "question_wo_referring_query": "Which of the following items appears in the scene?", "candidates": ["Tie", "White coat", "Red trousers", "Hat", "Gray bookbag"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "6CUr6FGEVWw_2", "video_path": "6CUr6FGEVWw.mp4", "subtitle_path": "6CUr6FGEVWw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2352.27, "view_count": 669952}, {"video_id": "qk4LU_i4gsc", "question": "In the image, on the left is a tall yellow truck. Beside the truck are two people wearing green clothes and one person wearing grey clothes. In front of the mirror on the right, there is a man in yellow clothes and another man in white clothes having a conversation. When the subtitle mentions 'signs of his beloved son,' which of the following objects appear in the scene?", "question_wo_referring_query": "Which of the following objects appear in the scene?", "candidates": ["blue safety helmet", "white safety helmet", "black safety helmet", "blue knit cap", "black truck"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "qk4LU_i4gsc_0", "video_path": "qk4LU_i4gsc.mp4", "subtitle_path": "qk4LU_i4gsc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.87, "view_count": 263821}, {"video_id": "qk4LU_i4gsc", "question": "In front of several green tree trunks, there is a savage with white paint on his belly and holding a long stick. When the subtitle says 'Instead, he shoots at the savage aiming at Bill,' which of the following items appeared on the screen?", "question_wo_referring_query": "Which of the following items appeared on the screen?", "candidates": ["Green utility pole", "Long-handled axe", "Red bandana", "Black hat", "White shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "qk4LU_i4gsc_1", "video_path": "qk4LU_i4gsc.mp4", "subtitle_path": "qk4LU_i4gsc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.87, "view_count": 263821}, {"video_id": "qk4LU_i4gsc", "question": "Under the reflection of the disordered plant vines, a piece of uneven sand in the distance is illuminated by faint sunlight. When the subtitle mentions 'As a farewell, Bill embraces his son for the last time and returns to the city,' what objects appear on the sand?", "question_wo_referring_query": "What objects appear on the sand?", "candidates": ["Police car", "Blue safety helmet", "Pickup truck", "Tank", "Airplane"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "qk4LU_i4gsc_2", "video_path": "qk4LU_i4gsc.mp4", "subtitle_path": "qk4LU_i4gsc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.87, "view_count": 263821}, {"video_id": "WVNZvAh7dZw", "question": "In a train car with a row of white handrails suspended from above, there is a man in the middle wearing a red coat and looking at his phone, with a transparent window on the right side. What is the hair color of the man wearing the red coat on the train?", "question_wo_referring_query": "What is the hair color of the man wearing the red coat on the train?", "candidates": ["yellow", "white", "red", "black", "blue"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "WVNZvAh7dZw_0", "video_path": "WVNZvAh7dZw.mp4", "subtitle_path": "WVNZvAh7dZw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2093.06, "view_count": 9819}, {"video_id": "WVNZvAh7dZw", "question": "A man wearing a black suit is lying on a blue leather sofa, with an open book covering his face. What color is the book in the scene?", "question_wo_referring_query": "What color is the book in the scene?", "candidates": ["blue", "black", "beige", "red", "olive"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "WVNZvAh7dZw_1", "video_path": "WVNZvAh7dZw.mp4", "subtitle_path": "WVNZvAh7dZw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2093.06, "view_count": 9819}, {"video_id": "WVNZvAh7dZw", "question": "In the white-walled hair salon, on the left side, there are four mirrors lined up and a person wearing white clothes standing straight. In the middle and on the right side, there are teal chairs lined up. In the far right is a man wearing black clothes. What is the shape of the floor mat under the teal chair on the right side of the hair salon?", "question_wo_referring_query": "What is the shape of the floor mat under the teal chair on the right side of the hair salon?", "candidates": ["Circle", "Square", "Hexagon", "Rectangle", "Triangle"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "WVNZvAh7dZw_2", "video_path": "WVNZvAh7dZw.mp4", "subtitle_path": "WVNZvAh7dZw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2093.06, "view_count": 9819}, {"video_id": "XL8RwvLNKCc", "question": "On a ground made of light yellow square stone slabs, there is an overturned wine bottle in the middle. When the subtitle 'Justin challenges them all to a game of Truth or Dare. The bottle is spun, where Eleanor ...' appears, what color is the wine bottle on the ground?", "question_wo_referring_query": "What color is the wine bottle on the ground?", "candidates": ["The bottle's body is golden and its neck is white", "Orange", "The bottle's body is golden and its neck is black", "The bottle's body is black and its neck is white", "The bottle's body is black and its neck is golden"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "XL8RwvLNKCc_0", "video_path": "XL8RwvLNKCc.mp4", "subtitle_path": "XL8RwvLNKCc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1015.77, "view_count": 2040201}, {"video_id": "XL8RwvLNKCc", "question": "In the background-blurred frame, there are several pieces of different-colored clothing hanging on the left, a gray floral sofa chair on the right, and a woman with facial injuries wearing blue clothes sitting in the middle. When the subtitle mentions 'As a result, Justin gets angry finally tells Paul to choose;', what type of hairstyle does the woman in the frame have?", "question_wo_referring_query": "What type of hairstyle does the woman in the frame have?", "candidates": ["Black long straight hair", "Black short curly hair", "White short hair", "Blonde long straight hair", "Red long straight hair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "XL8RwvLNKCc_1", "video_path": "XL8RwvLNKCc.mp4", "subtitle_path": "XL8RwvLNKCc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1015.77, "view_count": 2040201}, {"video_id": "XL8RwvLNKCc", "question": "On a floor made of long wooden boards, there is a stick standing upright in the lower right corner. In the middle, there is a short-haired man lying down with his right arm stretched out. When the subtitle mentions 'the fork hoe, where he dies on the spot,' what color is the man's right sleeve?", "question_wo_referring_query": "What color is the man's right sleeve?", "candidates": ["gray", "white", "red", "yellow", "green"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "XL8RwvLNKCc_2", "video_path": "XL8RwvLNKCc.mp4", "subtitle_path": "XL8RwvLNKCc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1015.77, "view_count": 2040201}, {"video_id": "h5rdgtyx844", "question": "A black animal is lying on a black armchair, facing the camera. There's a blurry screen behind the chair. What is the animal lying on the chair, squinting and smiling?", "question_wo_referring_query": "What is the animal lying on the chair, squinting and smiling?", "candidates": ["bear", "tiger", "dog", "lion", "cat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "h5rdgtyx844_0", "video_path": "h5rdgtyx844.mp4", "subtitle_path": "h5rdgtyx844_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1089.96, "view_count": 322790}, {"video_id": "h5rdgtyx844", "question": "In the animation, on a yellow ground illuminated by sunlight, there is a person in the middle squatting and sticking out their tongue. To the right is a gray-white stone platform, and to the upper left is an animal wearing white clothes and a blue handcart. Who is the person squatting and sticking out their tongue?", "question_wo_referring_query": "Who is the person squatting and sticking out their tongue in the scene?", "candidates": ["a man in white clothes", "a man in yellow clothes", "a woman in yellow clothes", "a woman in black clothes", "a man in black clothes"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "h5rdgtyx844_1", "video_path": "h5rdgtyx844.mp4", "subtitle_path": "h5rdgtyx844_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1089.96, "view_count": 322790}, {"video_id": "h5rdgtyx844", "question": "In the animated scene, in front of the stairs under a white house, there is a shadow of an animal with a white head on the left, and on the right, there is an animal wearing a red cloak, holding a black cat with both hands. What is the animal holding the black cat with both hands?", "question_wo_referring_query": "What is the animal holding the black cat with both hands in the scene?", "candidates": ["A dog with a white head", "A tiger", "A dog with a brown head", "A rabbit with a brown head", "An elephant"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "h5rdgtyx844_2", "video_path": "h5rdgtyx844.mp4", "subtitle_path": "h5rdgtyx844_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1089.96, "view_count": 322790}, {"video_id": "Vp8yU0vKti0", "question": "On the walkway with a white wall on the right side, several torches are successively placed on the wall of the walkway. On the left side of the walkway is a woman dressed in red, and in the middle is a man dressed in red. What does the man do the first time he appears?", "question_wo_referring_query": "What does he do?", "candidates": ["Drops a bow", "Breaks a bow", "Waves a sword", "Pulls open a bow", "Lifts up a hat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "Vp8yU0vKti0_0", "video_path": "Vp8yU0vKti0.mp4", "subtitle_path": "Vp8yU0vKti0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1256.82, "view_count": 72508}, {"video_id": "Vp8yU0vKti0", "question": "In the scene with dark green mountains in the distance and white dome buildings nearby, there is a person in red clothing on the right and a man wearing orange pants and an orange headscarf on the left. What did the man wearing the orange headscarf do when he first appeared?", "question_wo_referring_query": "What did the man wearing the orange headscarf do when he first appeared?", "candidates": ["Hugged the person in red clothing", "Kicked the person in red clothing", "Fell to the ground", "Pulled a bow", "Picked up a sword"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "Vp8yU0vKti0_1", "video_path": "Vp8yU0vKti0.mp4", "subtitle_path": "Vp8yU0vKti0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1256.82, "view_count": 72508}, {"video_id": "Vp8yU0vKti0", "question": "In a field full of tall yellow grass, with white sky in the upper left corner, a man in black armor with white beard stands in front of the camera. Behind him, some soldiers are hidden in the grass. What is the man with the white beard doing when he first appears?", "question_wo_referring_query": "What is the man with the white beard doing when he first appears?", "candidates": ["Sticking a sword into the ground", "Drawing a bow", "Raising a helmet", "Raising a handgun", "Raising a sword"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "Vp8yU0vKti0_2", "video_path": "Vp8yU0vKti0.mp4", "subtitle_path": "Vp8yU0vKti0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1256.82, "view_count": 72508}, {"video_id": "NMeViaAgct8", "question": "In a room with a white tiled wall, a man wearing a suit and a dark-colored tie is standing. Various items are placed on the cabinet behind him. When the subtitles mention 'in series three, two years have passed and Tommy has continued a partnership with Winston Churchill', what does the man do?", "question_wo_referring_query": "What does the man do?", "candidates": ["Run", "Sing", "Drink water", "Smoke", "Dance"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "NMeViaAgct8_0", "video_path": "NMeViaAgct8.mp4", "subtitle_path": "NMeViaAgct8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1088.17, "view_count": 42461}, {"video_id": "NMeViaAgct8", "question": "A man in a suit and tie is sitting next to a table in a room. On the table are items like a cup and an ashtray. Behind him is a tent. When the subtitle mentions 'with Winston Churchill. After the family are rewarded their freedom,' what does the man do?", "question_wo_referring_query": "What does the man do?", "candidates": ["Takes off his glasses", "Stands up", "Writes", "Plays the piano", "Kneels down"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "NMeViaAgct8_1", "video_path": "NMeViaAgct8.mp4", "subtitle_path": "NMeViaAgct8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1088.17, "view_count": 42461}, {"video_id": "NMeViaAgct8", "question": "In the middle of the screen, there is a man wearing a black suit and glasses, looking slightly upwards into the distance. Behind him, there is a large fire. When the subtitle mentions 'then murders Michael for his betrayal, fulfilling the late Polly's prophecy,' what happens to this man?", "question_wo_referring_query": "What happens to this man?", "candidates": ["falls down", "jumps up", "drinks water", "runs", "kneels down"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "NMeViaAgct8_2", "video_path": "NMeViaAgct8.mp4", "subtitle_path": "NMeViaAgct8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1088.17, "view_count": 42461}, {"video_id": "9BR7FaeS01A", "question": "In a room, two children are sitting on a sofa. The child on the right, with short hair and wearing black clothes, is holding and reading a book, while the child on the left is also looking at the book. There is a window behind them with white and red curtains. As the screen cuts off here, which object appears first?", "question_wo_referring_query": ", as the screen cuts off here, which object appears first?", "candidates": ["tennis racket", "balloon", "basketball hoop", "syringe", "sunglasses"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "9BR7FaeS01A_0", "video_path": "9BR7FaeS01A.mp4", "subtitle_path": "9BR7FaeS01A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1059.03, "view_count": 14588}, {"video_id": "9BR7FaeS01A", "question": "In a room with yellow walls, there is a short-haired man sitting next to a table wearing white clothes with a white coat. There are objects on the table, and on the windowsill behind him, there are white bars and potted plants. In front of the yellow wall, there is a small green tree. Which of the following items appears first?", "question_wo_referring_query": "Which of the following items appears first?", "candidates": ["drink", "star", "flower", "fire pit", "water cup"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "9BR7FaeS01A_1", "video_path": "9BR7FaeS01A.mp4", "subtitle_path": "9BR7FaeS01A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1059.03, "view_count": 14588}, {"video_id": "9BR7FaeS01A", "question": "Who appears first in the video?", "question_wo_referring_query": "Who appears first in the video?", "candidates": ["Cecilia", "Lisbon", "Trip", "Peter", "Chase"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "9BR7FaeS01A_2", "video_path": "9BR7FaeS01A.mp4", "subtitle_path": "9BR7FaeS01A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1059.03, "view_count": 14588}, {"video_id": "l2N3V5zzefU", "question": "There is a glass window on the wall. Below the window, there are four women standing side by side with their hands raised high, holding something tightly against the wall facing the mirror. After the subtitle 'she feels very annoyed with him and' appears, what is the first object to appear?", "question_wo_referring_query": "What is the first object to appear?", "candidates": ["book without cover", "camera", "table", "wooden plank", "soccer ball"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "l2N3V5zzefU_0", "video_path": "l2N3V5zzefU.mp4", "subtitle_path": "l2N3V5zzefU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 976.84, "view_count": 226367}, {"video_id": "l2N3V5zzefU", "question": "In the middle of the screen, there is a white computer with English letters on it. Behind the computer, there is a white window. To the right of the computer, there are curtains with a yellow and blue checkered pattern. When the subtitle 'after Bo Rose sends he and Jin's pager' appears, which item is shown on the screen for the first time?", "question_wo_referring_query": "Which item appears on the screen for the first time?", "candidates": ["Mouse", "Wooden board", "Pen", "A piece of paper", "Book"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "l2N3V5zzefU_1", "video_path": "l2N3V5zzefU.mp4", "subtitle_path": "l2N3V5zzefU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 976.84, "view_count": 226367}, {"video_id": "l2N3V5zzefU", "question": "In a room with pictures on the wall, a group of people sit next to a desk filled with books, looking at a woman in a red dress standing at the door. The woman is holding something in her hand, and after the subtitle mentions 'starts to pay attention to her when she', what kind of item appears on the screen?", "question_wo_referring_query": "What kind of item appears on the screen?", "candidates": ["green book bag", "a piece of paper", "wooden plank", "pen", "mouse"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "l2N3V5zzefU_2", "video_path": "l2N3V5zzefU.mp4", "subtitle_path": "l2N3V5zzefU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 976.84, "view_count": 226367}, {"video_id": "Uk5_GLGeRiY", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, in a room with white walls, there are two people sitting and eating. On the left is a man with short hair and green clothes, and on the right is a woman with hair tied up, wearing a blue dress and a white hairband on her head. Then, in another room with white walls, there is a woman with long hair and a red dress standing on the left, and a man with short hair and green clothes on the right. They are having a conversation, and the man is shorter than the woman. There are other people standing behind them. Finally, in a room with white walls, there are two women sitting. On the left is a woman with her hair tied up in a blue dress, and on the right is a woman with black hair and a red dress. They are holding food in their hands. Then, in another room with white walls, there are two women sitting. On the left is a woman with her hair tied up in a blue dress, and on the right is a woman with black hair and a red dress. They are holding food in their hands.", "First, in a room with white walls, there are two women sitting. On the left is a woman with her hair tied up in a blue dress, and on the right is a woman with black hair and a red dress. They are holding food in their hands. Then, in a room with white walls, there are two people sitting and eating. On the left is a man with short hair and green clothes, and on the right is a woman with hair tied up, wearing a blue dress and a white hairband on her head. Finally, in a room with white walls, there is a woman with long hair and a red dress standing on the left, and a man with short hair and green clothes on the right. They are having a conversation, and the man is shorter than the woman. There are other people standing behind them.", "First, in a room with white walls, there is a woman with long hair and a red dress standing on the left, and a man with short hair and green clothes on the right. They are having a conversation, and the man is shorter than the woman. There are other people standing behind them. Then, in a room with white walls, there are two women sitting. On the left is a woman with her hair tied up in a blue dress, and on the right is a woman with black hair and a red dress. They are holding food in their hands. Finally, in a room with white walls, there are two people sitting and eating. On the left is a man with short hair and green clothes, and on the right is a woman with hair tied up, wearing a blue dress and a white hairband on her head.", "First, in a room with white walls, there is a woman with long hair and a red dress standing on the left, and a man with short hair and green clothes on the right. They are having a conversation, and the man is shorter than the woman. There are other people standing behind them. Then, in a room with white walls, there are two people sitting and eating. On the left is a man with short hair and green clothes, and on the right is a woman with hair tied up, wearing a blue dress and a white hairband on her head. Finally, in a room with white walls, there are two women sitting. On the left is a woman with her hair tied up in a blue dress, and on the right is a woman with black hair and a red dress. They are holding food in their hands.", "First, in a room with white walls, there are two people sitting and eating. On the left is a man with short hair and green clothes, and on the right is a woman with hair tied up, wearing a blue dress and a white hairband on her head. Then, in another room with white walls, there is a woman with long hair and a red dress standing on the left, and a man with short hair and green clothes on the right. They are having a conversation, and the man is shorter than the woman. There are other people standing behind them. Finally, in a room with white walls, there are two women sitting. On the left is a woman with her hair tied up in a blue dress, and on the right is a woman with black hair and a red dress. They are holding food in their hands."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "Uk5_GLGeRiY_0", "video_path": "Uk5_GLGeRiY.mp4", "subtitle_path": "Uk5_GLGeRiY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 951.91, "view_count": 52321}, {"video_id": "Uk5_GLGeRiY", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, in the mirror, there are two people: on the left, a man in grey clothes, and on the right, a woman with long curly hair. Next, in a white room, on the left, there is a woman with black hair wearing a green outfit, and on the right, there is a man with short hair, wearing a grey jacket and a red tie. Finally, a person wearing grey presses the head of the woman in the green outfit into the water, with a red wall in the background.", "First, a person wearing grey presses the head of the woman in the green outfit into the water, with a red wall in the background. Next, in the mirror, there are two people: on the left, a man in grey clothes, and on the right, a woman with long curly hair. Finally, in a white room, on the left, there is a woman with black hair wearing a green outfit, and on the right, there is a man with short hair, wearing a grey jacket and a red tie.", "First, in a white room, on the left, there is a woman with black hair wearing a green outfit, and on the right, there is a man with short hair, wearing a grey jacket and a red tie. Next, there is a person wearing grey who presses the head of the woman in the green outfit into the water, with a red wall in the background. Finally, in the mirror, there are two people: on the left, a man in grey clothes, and on the right, a woman with long curly hair.", "First, a person wearing grey presses the head of the woman in the green outfit into the water, with a red wall in the background. Then, in a white room, on the left, there is a woman with black hair wearing a green outfit, and on the right, there is a man with short hair, wearing a grey jacket and a red tie. Lastly, in the mirror, there are two people: on the left, a man in grey clothes, and on the right, a woman with long curly hair.", "First, in a white room, on the left, there is a woman with black hair wearing a green outfit, and on the right, there is a man with short hair, wearing a grey jacket and a red tie. Next, in the mirror, there are two people: on the left, a man in grey clothes, and on the right, a woman with long curly hair. Finally, there is a person wearing grey who presses the head of the woman in the green outfit into the water, with a red wall in the background."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "Uk5_GLGeRiY_1", "video_path": "Uk5_GLGeRiY.mp4", "subtitle_path": "Uk5_GLGeRiY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 951.91, "view_count": 52321}, {"video_id": "Uk5_GLGeRiY", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, there is a short-haired man wearing a khaki-colored outfit in a room with white walls, with several people standing behind him. Then, there is a child with short hair wearing a green outfit on the screen. Finally, a long-haired woman with curly hair wearing a red outfit appears in a room with white walls.", "First, a long-haired woman with curly hair wearing a red outfit appears in a room with white walls. Then, there is a child with short hair wearing a green outfit on the screen. Finally, there is a short-haired man wearing a khaki-colored outfit in a room with white walls, with several people standing behind him.", "First, there is a short-haired man wearing a khaki-colored outfit in a room with white walls, with several people standing behind him. Then, a long-haired woman with curly hair wearing a red outfit appears in a room with white walls. Finally, there is a child with short hair wearing a green outfit on the screen.", "First, there is a child with short hair wearing a green outfit on the screen. Then, there is a short-haired man wearing a khaki-colored outfit in a room with white walls, with several people standing behind him. Finally, a long-haired woman with curly hair wearing a red outfit appears in a room with white walls.", "First, a long-haired woman with curly hair wearing a red outfit appears in a room with white walls. Then, there is a short-haired man wearing a khaki-colored outfit in a room with white walls, with several people standing behind him. Finally, there is a child with short hair wearing a green outfit on the screen."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "Uk5_GLGeRiY_2", "video_path": "Uk5_GLGeRiY.mp4", "subtitle_path": "Uk5_GLGeRiY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 951.91, "view_count": 52321}, {"video_id": "LXhP-mec5xo", "question": "In a slightly dimly lit room, a person wearing a checkered shirt is sitting by the table, looking at the box in her hand. On the table, there is a book and an orange object. When did this book and the corresponding subtitles appear together?", "question_wo_referring_query": "When did this book and the corresponding subtitles appear together?", "candidates": ["their boxes and try to find a way to ", "upon reading it they find out that it's", "surrender her phone as it is not allowed", "the box and a note card pops out of it", "centers the minos building she is greeted"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "LXhP-mec5xo_0", "video_path": "LXhP-mec5xo.mp4", "subtitle_path": "LXhP-mec5xo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 992.45, "view_count": 88337}, {"video_id": "LXhP-mec5xo", "question": "Outside a house made of timber, some people are standing. In front of the camera is a short-haired man wearing blue clothes, and behind him is another short-haired man wearing a long-sleeved black shirt. On the right side of the screen, there is a row of people standing. Have the people wearing blue clothes and some subtitles appeared together before?", "question_wo_referring_query": "Have the people wearing blue clothes and some subtitles appeared together before?", "candidates": ["ben stumbles upon a fishing hole then", "in the next room they find themselves in", "upon using it she discovers a magner", "as the frozen lake explodes and cracks", "concerned about what might happen next"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "LXhP-mec5xo_1", "video_path": "LXhP-mec5xo.mp4", "subtitle_path": "LXhP-mec5xo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 992.45, "view_count": 88337}, {"video_id": "LXhP-mec5xo", "question": "In a room, there is a long-haired woman wearing a long-sleeved dress standing inside. She is holding a gun and raising it. Behind her, there are various objects placed. In which subtitles did this gun appear together with?", "question_wo_referring_query": ", in which subtitles did this gun appear together with?", "candidates": ["be strangled by the game's master", "themes and people bet on who would win", "wall there he meets the games master", "and flees the room through a secret door", "thinking that he can finally leave the"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "LXhP-mec5xo_2", "video_path": "LXhP-mec5xo.mp4", "subtitle_path": "LXhP-mec5xo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 992.45, "view_count": 88337}, {"video_id": "z5b46_hwtkc", "question": "On the office desk are some office supplies, and there is a round stool under the desk. A man wearing a red checkered shirt with white stripes is kneeling in front of the desk. When the man in the red checkered shirt makes a phone call with his mobile phone, what change happens to him?", "question_wo_referring_query": "When the man wearing the red checkered shirt makes a phone call with his mobile phone, what change happens to him?", "candidates": ["He stood up", "Nothing changed", "He changed into a green shirt", "He jumped up", "He changed into a blue shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "z5b46_hwtkc_0", "video_path": "z5b46_hwtkc.mp4", "subtitle_path": "z5b46_hwtkc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 998.3, "view_count": 363805}, {"video_id": "z5b46_hwtkc", "question": "In a dimly lit room, a woman wearing short-sleeved clothing is standing next to a white door. Her gaze is fixed on the handle of the white door. What change did the handle undergo?", "question_wo_referring_query": "What change did the handle undergo?", "candidates": ["There was an additional handle", "The door handle turned slightly", "An extra hand appeared on the handle", "No changes occurred", "The handle disappeared"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "z5b46_hwtkc_1", "video_path": "z5b46_hwtkc.mp4", "subtitle_path": "z5b46_hwtkc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 998.3, "view_count": 363805}, {"video_id": "z5b46_hwtkc", "question": "On top of a desk there is a white book, on top of the book there is a black pen and a mobile phone, and beside them there is a hand of a person wearing short sleeves. What changed with the mobile phone on the desk?", "question_wo_referring_query": "What changed with the mobile phone on the desk?", "candidates": ["Disappeared", "Placed on a blue book", "Placed on a blue book", "Picked up by someone", "There is another identical mobile phone"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "z5b46_hwtkc_2", "video_path": "z5b46_hwtkc.mp4", "subtitle_path": "z5b46_hwtkc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 998.3, "view_count": 363805}, {"video_id": "WksCKMI8CWg", "question": "The screen shows a man and a woman. The screen is a bit blurry, but you can see that both of them have dark skin. They are in a dimly lit room. The woman is holding a knife, and the man is holding the woman's arm. What objects are present in the scene?", "question_wo_referring_query": ", what objects are present in the scene?", "candidates": ["short-sleeve t-shirt", "glass bottle", "shelf", "window", "curtain"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "WksCKMI8CWg_0", "video_path": "WksCKMI8CWg.mp4", "subtitle_path": "WksCKMI8CWg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 528, "duration": 9.98, "view_count": 178836}, {"video_id": "cj69uzQIfJA", "question": "The scene features a woman with braided hair, wearing a white shirt. In front of the mirror, there is another person dressed in a shirt, though their head is not visible. What object is present in the scene when the phrase 'When she orders him to hit Margaret, George bursts past Marvin and Elton to...' is mentioned?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["a yellow window", "a white-framed painting", "a green-framed painting", "a braided head of hair", "a black-framed painting"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "cj69uzQIfJA_0", "video_path": "cj69uzQIfJA.mp4", "subtitle_path": "cj69uzQIfJA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 219, "duration": 8.0, "view_count": 24970}, {"video_id": "PA4KyaxytsU", "question": "In the scene, a woman and a man are sitting cross-legged behind a short table. There are some plates, two glasses, and two cans on the table. Behind them is a green wall, and on the right side, there is a wooden bookshelf with books inside. There are also clothes hanging on the wall. The woman is wearing a white dress, and the man is wearing an olive green uniform. Both have black hair. What shape is the table where the two people are sitting?", "question_wo_referring_query": "What shape is the table where the two people are sitting?", "candidates": ["irregular", "square", "triangle", "circle", "rectangle"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "PA4KyaxytsU_0", "video_path": "PA4KyaxytsU.mp4", "subtitle_path": "PA4KyaxytsU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 333, "duration": 11.0, "view_count": 3137352}, {"video_id": "pUmDyP20IBU", "question": "In the scene, there is a man wearing a green shirt against a black background. His hair is black, and his profile shows that he looks very frightened and is speaking. What happened the first time he appeared?", "question_wo_referring_query": "What happened the first time he appeared?", "candidates": ["He is shaking his head", "He is nodding", "He is holding his head with both hands", "He is waving his hands", "He is pointing forward with both hands"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "pUmDyP20IBU_0", "video_path": "pUmDyP20IBU.mp4", "subtitle_path": "pUmDyP20IBU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 287, "duration": 11.0, "view_count": 408886}, {"video_id": "DzXEJjRqpys", "question": "In the scene, there is a woman and a man sitting inside a car with rain outside. The woman is wearing a blue coat, and the man in front of her is wearing glasses. What happens when it\u2019s mentioned 'she has to get rid of the cat because he doesn't want to miss the flight to Bulgaria. Thea steps...'?", "question_wo_referring_query": "What happens next?", "candidates": ["The woman and the man are shaking hands.", "The woman is looking forward and talking.", "The woman is talking to the man.", "The woman is fastening her seatbelt.", "The woman and the man are hugging."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "DzXEJjRqpys_0", "video_path": "DzXEJjRqpys.mp4", "subtitle_path": "DzXEJjRqpys_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 112, "duration": 14.02, "view_count": 808746}, {"video_id": "fGiKr-qjCOk", "question": "A curly-haired woman is walking in a room with green wallpaper, teal columns, and a staircase. She is wearing a black V-neck top. After she takes a step toward the front of the room and comes to a stop, what happens?", "question_wo_referring_query": "what happens after she stops?", "candidates": ["The woman sees a red door with a hole in it, and there is a flower vase inside the hole.", "The woman sees a red door with a hole in it, and there is a dog inside the hole.", "The woman sees a red door with a hole in it, and there is a rabbit inside the hole.", "The woman sees a red door with a hole in it, and there is a corpse inside the hole.", "The woman sees a red door with a hole in it, and there is a cat inside the hole."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "fGiKr-qjCOk_0", "video_path": "fGiKr-qjCOk.mp4", "subtitle_path": "fGiKr-qjCOk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 482, "duration": 12.01, "view_count": 309681}, {"video_id": "hSjIRgf10nw", "question": "The screen shows a man wearing a white T-shirt with an outer jacket, holding one side of his head with one hand. He is standing against a black background. Who is the first character to fall to the ground?", "question_wo_referring_query": "Who is the first character to fall to the ground?", "candidates": ["A man wearing a yellow T-shirt with an outer jacket", "A man with a bloodstain on the corner of his mouth", "A man wearing a black T-shirt with an outer jacket", "A man wearing a white T-shirt with an outer jacket", "A woman wearing earrings"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "hSjIRgf10nw_0", "video_path": "hSjIRgf10nw.mp4", "subtitle_path": "hSjIRgf10nw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 198, "duration": 8.0, "view_count": 474711}, {"video_id": "ybpvUr4__qc", "question": "A woman with black hair wearing a blue knitted top is speaking. In front of her, there is a little girl with black hair. Behind them is a lush tree. After mentioning 'Ah-young immediately breaks them up and tries to comfort Dan-Bi,' what happened?", "question_wo_referring_query": "What happened?", "candidates": ["The woman threw away her backpack.", "The woman smiled at the mirror.", "The woman yelled at the mirror.", "The woman talked to someone else in the mirror.", "The woman started to run."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "ybpvUr4__qc_0", "video_path": "ybpvUr4__qc.mp4", "subtitle_path": "ybpvUr4__qc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 75, "duration": 8.97, "view_count": 257568}, {"video_id": "ZrJNu0KDCIM", "question": "The screen shows a dark background with a woman in the background, indistinguishable in the night. When it says: 'a video and said she loves Wade and is grateful that Wade has come into her life. The fourth,' what object appears for the first time?", "question_wo_referring_query": "What object appears for the first time?", "candidates": ["A black car", "A red car", "A bottle with yellow transparent liquid", "A yellow car", "A white car"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "ZrJNu0KDCIM_0", "video_path": "ZrJNu0KDCIM.mp4", "subtitle_path": "ZrJNu0KDCIM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 68, "duration": 11.97, "view_count": 40089}, {"video_id": "iHQNplhLuHc", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a woman is shown talking to a man with gray hair dressed in a military green outfit. Then, there is a scene with a teacup and a metal can on a table. Finally, a woman is shown standing in front of a green door.", "First, there is a scene with a teacup and a metal can on a table. Then, a woman is shown talking to a man with gray hair dressed in a military green outfit. Finally, a woman is shown standing in front of a green door.", "First, a woman is shown standing in front of a green door. Then, a woman is shown talking to a man with gray hair dressed in a military green outfit. Finally, there is a scene with a teacup and a metal can on a table.", "First, there is a scene with a teacup and a metal can on a table. Then, a woman is shown standing in front of a green door. Finally, a woman is shown talking to a man with gray hair dressed in a military green outfit.", "First, a woman is shown talking to a man with gray hair dressed in a military green outfit. Then, a woman is shown standing in front of a green door. Finally, there is a scene with a teacup and a metal can on a table."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "iHQNplhLuHc_0", "video_path": "iHQNplhLuHc.mp4", "subtitle_path": "iHQNplhLuHc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 315, "duration": 8.0, "view_count": 1572}, {"video_id": "0O5UGUP_wEU", "question": "In the video, a man wearing a white shirt covers his mouth with a piece of paper. He has some beard stubble on his face, and there is a blurry background behind him. In which scenarios does this man appear?", "question_wo_referring_query": ", in which scenarios does this man appear?", "candidates": ["At a party with many people, with a yellow table full of food", "At a party with many people, with a black table full of food", "At a party with many people, with a red table full of food", "At a party with many people, with a blue table full of food", "At a party with many people, with a white table full of food"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "0O5UGUP_wEU_0", "video_path": "0O5UGUP_wEU.mp4", "subtitle_path": "0O5UGUP_wEU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 265, "duration": 13.01, "view_count": 801511}, {"video_id": "t7FFhg8yr5E", "question": "The screen shows a dim green background with a man whose hair is somewhat sparse and who has a rather haggard appearance. He's wearing a relatively thick coat, and the surroundings are grassy fields and woodland, with dark hills in the distance. In which subtitle does this man also appear?", "question_wo_referring_query": "In which subtitle does this man also appear?", "candidates": ["He discovers unusual changes in his hands. As", "lost and didn't know the way home, and if she was willing to help, he would let her and the", "FAR NORTH", "horror-stricken as he is, he forges home to", "She brings him to the hospital, where the doctors"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "t7FFhg8yr5E_0", "video_path": "t7FFhg8yr5E.mp4", "subtitle_path": "t7FFhg8yr5E_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 101, "duration": 12.0, "view_count": 12426}, {"video_id": "WliED1yvrLQ", "question": "In the video, a mutant creature with many legs is attacking a person wearing a white shirt. When a person wearing a short-sleeved jacket, jeans, and accessories appears, what kind of change does the mutant creature undergo?", "question_wo_referring_query": ", what kind of change does the mutant creature undergo?", "candidates": ["The mutant creature starts to shrink.", "The mutant creature begins to move sideways.", "The spider is knocked over.", "The creature attacks the person with the accessories.", "The mutant creature begins to move diagonally."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "WliED1yvrLQ_0", "video_path": "WliED1yvrLQ.mp4", "subtitle_path": "WliED1yvrLQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 378, "duration": 9.01, "view_count": 742961}, {"video_id": "Vn2LD_0uoMA", "question": "The screen shows a mountain cave surrounded by green vegetation, with rocks on the dirt road. A man in a jacket with black gloves and floral white pants is present, along with a person with white hair standing in front of a car. The car is stuck in the cave, and its trunk is open. The car is blue. What is the person in the jacket doing?", "question_wo_referring_query": "What is the person in the jacket doing?", "candidates": ["He is making a phone call to the person with white hair", "He is jogging away from the person with white hair", "He is dancing away from the person with white hair", "He is talking to the person with white hair", "He is walking away from the person with white hair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "Vn2LD_0uoMA_0", "video_path": "Vn2LD_0uoMA.mp4", "subtitle_path": "Vn2LD_0uoMA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 321, "duration": 11.01, "view_count": 165157}, {"video_id": "nUooccjyNm4", "question": "In the scene, there are two men. One man with an astonished expression is sitting on a chair with a large backrest, and another man with a serious expression is in front of a mirror. There is also a book in the scene. The background is very dark, with only faint lighting. Both men are wearing black clothes. Who is the person touching the book in the scene?", "question_wo_referring_query": "Who is the person touching the book in the scene?", "candidates": ["A man with an astonished expression and yellow hair", "A man with an astonished expression and brown hair", "A man with an astonished expression and white hair", "A man with an astonished expression and black hair", "A man with a serious expression"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "nUooccjyNm4_0", "video_path": "nUooccjyNm4.mp4", "subtitle_path": "nUooccjyNm4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 447, "duration": 10.0, "view_count": 60735}, {"video_id": "v52FigdsvgQ", "question": "In the video, there is a man wearing a black jacket and a red inner shirt. He is sitting in front of a long table with one hand resting on the table behind him. There is a man in a police uniform beside him. When this man in the red inner shirt appears for the first time, what does he do?", "question_wo_referring_query": "When this man in the red inner shirt appears for the first time, what does he do?", "candidates": ["The man looks down at a camera being handed over his shoulder and catches it with his hand.", "The man looks down at a photo being handed over his shoulder and catches it with his hand.", "The man looks down at a vase being handed over his shoulder and catches it with his hand.", "The man looks down at a document being handed over his shoulder and catches it with his hand.", "The man looks down at an evidence bag being handed over his shoulder and catches it with his hand."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "v52FigdsvgQ_0", "video_path": "v52FigdsvgQ.mp4", "subtitle_path": "v52FigdsvgQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 90, "duration": 9.0, "view_count": 38009}, {"video_id": "uc2zDMN7bFY", "question": "The scene is set inside an ancient guesthouse where a man and a girl, both dressed in traditional attire, are present. The man's attire is well-fitted, his black hair is neatly combed and tied with a hairband. The girl's appearance is somewhat disheveled, her hair casually bound into a bun with a cloth wrapped around her head. A lot of plates with abundant food are stacked between them. When the line 'But in the middle of the journey, her horse and her supplies were robbed by someone.' is mentioned, what is the woman doing in the scene?", "question_wo_referring_query": "What is the woman doing in the scene?", "candidates": ["The woman is conversing with the man", "The woman is shaking hands with the man", "The woman is eating the food in her hand", "The woman is handing a backpack to the man", "The woman is receiving a backpack from the man"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "uc2zDMN7bFY_0", "video_path": "uc2zDMN7bFY.mp4", "subtitle_path": "uc2zDMN7bFY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 132, "duration": 13.0, "view_count": 51508}, {"video_id": "MjD8kc49wOQ", "question": "In the scene, a man with black hair and wearing black clothes is talking to a gentleman in front of him who has white hair, a white beard, and is dressed in a black suit. What happens after they finish talking?", "question_wo_referring_query": "What happens?", "candidates": ["A pen points to a net woven with plastic, and the holes in the net are glowing.", "A pen points to a net woven with branches, and the holes in the net are glowing.", "A pen points to a net woven with hemp rope, and the holes in the net are glowing.", "A pen points to a net woven with rubber strips, and the holes in the net are glowing.", "A pen points to a net woven with wires, and the holes in the net are glowing."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "MjD8kc49wOQ_0", "video_path": "MjD8kc49wOQ.mp4", "subtitle_path": "MjD8kc49wOQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 181, "duration": 12.0, "view_count": 9486}, {"video_id": "Ga8YXLvZTsw", "question": "In the scene, there is a woman and a man sitting in a carriage. There is a window behind them. The woman is wearing a white headscarf with a black headscarf draped over the white one. The man is wearing a black hat and a black coat. Which character appears first?", "question_wo_referring_query": "Which character appears first?", "candidates": ["Man with the black hat", "Man with the black hat", "Woman with the white headscarf", "Carriage", "Woman with the yellow headscarf"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "Ga8YXLvZTsw_0", "video_path": "Ga8YXLvZTsw.mp4", "subtitle_path": "Ga8YXLvZTsw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 343, "duration": 9.01, "view_count": 25991}, {"video_id": "fH-eerD6-hE", "question": "In the blurry black screen, there is a man. What happens after the phrase 'Indio takes his life' is mentioned?", "question_wo_referring_query": "In the blurry black screen, there is a man. What happens after the phrase 'Indio takes his life' is mentioned?", "candidates": ["A pair of hands with brown sleeves is lighting a cigarette for a man with grayish white hair and sideburns.", "A pair of hands with yellow sleeves is lighting a cigarette for a man with grayish white hair and sideburns.", "A pair of hands with green sleeves is lighting a cigarette for a man with grayish white hair and sideburns.", "A pair of hands with pink sleeves is lighting a cigarette for a man with grayish white hair and sideburns.", "A pair of hands with black and white sleeves is lighting a cigarette for a man with grayish white hair and sideburns."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "fH-eerD6-hE_0", "video_path": "fH-eerD6-hE.mp4", "subtitle_path": "fH-eerD6-hE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 88, "duration": 8.0, "view_count": 161521}, {"video_id": "jqtYVWB9WyQ", "question": "The scene features a very dim background. A man with brown curly hair, dressed in a red and white uniform, is standing in front of the camera. There are also many other people in red and white uniforms around him, some behind and others beside him. In the distance, three faint torch lights can be seen. When he hears Thomas will march against Hotspur the next day, he gets upset. Who or what appears for the first time?", "question_wo_referring_query": "Who or what appears for the first time?", "candidates": ["Torchlight", "A group of people wearing blue pointed hats", "A group of people wearing black pointed hats", "A group of people wearing white pointed hats", "A group of people wearing brown pointed hats"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "jqtYVWB9WyQ_0", "video_path": "jqtYVWB9WyQ.mp4", "subtitle_path": "jqtYVWB9WyQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 77, "duration": 8.01, "view_count": 898905}, {"video_id": "POasWxrssEQ", "question": "In a dark background, a black robot with red eyes and a white robot with red eyes are fighting. In which other scene does the black robot appear?", "question_wo_referring_query": "In which other scene does the black robot appear?", "candidates": ["In a gray and white wasteland", "In a yellow wasteland", "On a black car", "On a grass field", "In a river"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "POasWxrssEQ_0", "video_path": "POasWxrssEQ.mp4", "subtitle_path": "POasWxrssEQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 484, "duration": 12.01, "view_count": 1110749}, {"video_id": "vkWVMVIbzqE", "question": "The screen shows a dark-skinned hand holding a damaged family photo. In which subtitle does this photo also appear simultaneously?", "question_wo_referring_query": "In which subtitle does this photo also appear simultaneously?", "candidates": ["do", "childhood because even after so much", "time he still misses having someone to", "so", "hello"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "vkWVMVIbzqE_0", "video_path": "vkWVMVIbzqE.mp4", "subtitle_path": "vkWVMVIbzqE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 91, "duration": 10.98, "view_count": 1048438}, {"video_id": "JuY8HEa3cJ0", "question": "The screen shows two men talking. They are surrounded by black bars and some strings of small lights with a faint glow. It is nighttime. One of the men is holding a cup. When the white text 'STORY RECAPPED' appears on the screen, what kind of change occurs to the man holding the cup?", "question_wo_referring_query": "What kind of change occurs to the man holding the cup?", "candidates": ["The man is holding a bear", "The man is holding a pillow", "The man is holding a toy", "The man is holding a child", "The man is holding a cat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "JuY8HEa3cJ0_0", "video_path": "JuY8HEa3cJ0.mp4", "subtitle_path": "JuY8HEa3cJ0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 106, "duration": 13.01, "view_count": 134847}, {"video_id": "Xrma_6PDlr4", "question": "The scene is inside an abandoned building where it is very dark and chaotic, the lighting is poor. A man dressed in black is lying on the ground covered in blood, and a woman in a pink dress with golden curly hair is kneeling beside him. What is she doing at this moment?", "question_wo_referring_query": "What is she doing at this moment?", "candidates": ["She is trying to stop the man's bleeding", "She is holding a long black rectangular object in one hand", "She is cradling the man's face", "She is treating the man", "She is talking to the man"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "Xrma_6PDlr4_0", "video_path": "Xrma_6PDlr4.mp4", "subtitle_path": "Xrma_6PDlr4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 854, "duration": 12.01, "view_count": 2505975}, {"video_id": "py7wOsXCJmc", "question": "The scene is on a road with some cars parked on the side. In the distance, there are yellow, white, and gray buildings. There are also white streetlights on the road. A child is sitting on the ground, and there are three people: a man in black clothing, a man in a light olive coat, and a woman in a pink top and jeans. What objects are not present in the scene?", "question_wo_referring_query": "What objects are not present in the scene?", "candidates": ["White shoes", "Black window", "White railing", "Orange rooftop", "Blue car"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "py7wOsXCJmc_0", "video_path": "py7wOsXCJmc.mp4", "subtitle_path": "py7wOsXCJmc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 160, "duration": 13.0, "view_count": 162524}, {"video_id": "yRsIdR6sxNw", "question": "A cat is perched on an object on the screen. It is flying in the midst of fiery sparks, wearing black armor. There is also a weapon resembling a green bat behind its tail. What color is this cat?", "question_wo_referring_query": "What color is this cat?", "candidates": ["Green", "Olive", "Pink", "Yellow", "Blue"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "yRsIdR6sxNw_0", "video_path": "yRsIdR6sxNw.mp4", "subtitle_path": "yRsIdR6sxNw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 968, "duration": 9.0, "view_count": 236023}, {"video_id": "4DX2ueH_tdU", "question": "On the screen, there is a man in a white shirt and a woman in a white dress with their fingers interlocked. The woman's other hand is resting on the man's shoulder, and they are in a building illuminated with green light, seemingly dancing. What type of hat is the woman wearing?", "question_wo_referring_query": "What type of hat is the woman wearing?", "candidates": ["Nurse's hat", "Cowboy hat", "Chef's hat", "Sun hat", "Doctor's hat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "4DX2ueH_tdU_0", "video_path": "4DX2ueH_tdU.mp4", "subtitle_path": "4DX2ueH_tdU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 93, "duration": 12.0, "view_count": 170449}, {"video_id": "C7Rjcop6cFU", "question": "In the video, there is a white wall, a man wearing a black and red plaid shirt, another man in a black coat facing the mirror, and a woman in a white shirt whose profile is visible in the mirror talking to the man facing the mirror. Which character is standing next to the white tiled wall?", "question_wo_referring_query": "Which character is standing next to the white tiled wall?", "candidates": ["A man wearing a black and red plaid shirt and an orange jacket", "A man wearing a black and red plaid shirt and a yellow jacket", "A man wearing a black and red plaid shirt and a black jacket", "A man wearing a black and red plaid shirt and an olive jacket", "A man wearing a black and red plaid shirt and a green jacket"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "C7Rjcop6cFU_0", "video_path": "C7Rjcop6cFU.mp4", "subtitle_path": "C7Rjcop6cFU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 384, "duration": 13.0, "view_count": 277685}, {"video_id": "ljEOyZg4kQg", "question": "In the video, there are two characters with pointed heads and no hair talking. One of them is a woman wearing a denim jacket, standing by a bookshelf. Another pointed-headed character, wearing a white shirt, is facing a camera. What did the pointed-headed woman in the denim jacket do the first time she appeared?", "question_wo_referring_query": "What did the pointed-headed woman in the denim jacket do the first time she appeared?", "candidates": ["Running", "Carrying a bag of rice", "Walked towards the bookshelf", "Picked up a book", "Holding a bag of snacks"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "ljEOyZg4kQg_0", "video_path": "ljEOyZg4kQg.mp4", "subtitle_path": "ljEOyZg4kQg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 502, "duration": 12.01, "view_count": 617457}, {"video_id": "HOePjmPqGGk", "question": "In a yellow-gray scene with many falling leaves and an inclined slope covered with some weeds, what happens when the phrase 'of hope an emotional reunion ensues as' is mentioned?", "question_wo_referring_query": "What happens?", "candidates": ["Some of the falling leaves land on the ground while others are still falling.", "The falling leaves are drifting in the air.", "A person is picking up the fallen leaves.", "A person is picking up the fallen leaves.", "A person is running."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "HOePjmPqGGk_0", "video_path": "HOePjmPqGGk.mp4", "subtitle_path": "HOePjmPqGGk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 965, "duration": 10.0, "view_count": 352260}, {"video_id": "_95ecNvq5w4", "question": "A man wearing a white shirt and black armor is in the water, blowing some bubbles from his mouth. He has a goatee, and the camera shows a side view of his face. What happens next?", "question_wo_referring_query": "What happens next?", "candidates": ["A mermaid swims up to the man with a whale", "A mermaid swims up to the man with shells", "A mermaid swims up to the man with crabs", "A mermaid swims up to the man with a sea turtle", "A mermaid swims up to the man with a school of fish"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "_95ecNvq5w4_0", "video_path": "_95ecNvq5w4.mp4", "subtitle_path": "_95ecNvq5w4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 737, "duration": 12.0, "view_count": 17999}, {"video_id": "2goiQSC6Zyc", "question": "Who is the first character to appear on the screen?", "question_wo_referring_query": "Who is the first character to appear on the screen?", "candidates": ["A man with very short hair wearing a black suit and a white bow tie, and a man with white hair wearing a black suit and a black bow tie", "A man with very short hair wearing a black suit and a white bow tie, and a man with white hair wearing a black suit and a black bow tie", "A man with very short hair wearing a black suit and a red bow tie, and a woman in red clothes lying on a hospital bed", "A man with very short hair wearing a black suit and a red bow tie, and a woman in white clothes lying on a hospital bed", "A man with very short hair wearing a black suit and a red bow tie, and a man with white hair wearing a black suit and a black bow tie"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "2goiQSC6Zyc_0", "video_path": "2goiQSC6Zyc.mp4", "subtitle_path": "2goiQSC6Zyc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 257, "duration": 12.0, "view_count": 9653}, {"video_id": "W8EOivi2Bng", "question": "The screen shows the upper body of a character, holding a bunch of white wildflowers with green leaves. They are wearing an olive-colored striped shirt, and there are green plants illuminated by sunlight in the background. What happens after the phrase 'the guest is downcast as he crushes his' is mentioned?", "question_wo_referring_query": "What happens after?", "candidates": ["A man's face appears, and the man has scars on his face.", "A woman's face appears, and the woman has an olive-colored complexion.", "A small dog appears.", "A man's face appears, and the man has a bushy beard.", "A man's face appears, and the man has an olive-colored eyebrow."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "W8EOivi2Bng_0", "video_path": "W8EOivi2Bng.mp4", "subtitle_path": "W8EOivi2Bng_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 229, "duration": 12.01, "view_count": 113305}, {"video_id": "D88IVE3Dtyw", "question": "The screen shows a rocky landscape, with a yellow light glowing in the lower left corner. A person wearing white clothes appears in a flash on the right side, where another person is crouching on the rock, though their face is unclear. When He then goes for it, but as he accidentally presses the is mentioned, who is the first person to appear?", "question_wo_referring_query": "Who is the first person to appear?", "candidates": ["A person covered in yellow flames", "A person emitting white light", "A person covered in red flames", "A person covered in blue flames", "A person emitting green light"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "D88IVE3Dtyw_0", "video_path": "D88IVE3Dtyw.mp4", "subtitle_path": "D88IVE3Dtyw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 16, "duration": 10.97, "view_count": 161360}, {"video_id": "Y0IaijKNGX8", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a black man in black clothes is shown turning his head. Then, a group of people are shown standing in a cave with a circular top, with two people facing each other in front of the group. Finally, a group of people in white clothes are shown in a wasteland, surrounded by red sheds.", "First, a group of people in white clothes are shown in a wasteland, surrounded by red sheds. Then, a black man in black clothes is shown turning his head. Finally, a group of people are shown standing in a cave with a circular top, with two people facing each other in front of the group.", "First, a black man in black clothes is shown turning his head. Then, a group of people in white clothes are shown in a wasteland, surrounded by red sheds. Finally, a group of people are shown standing in a cave with a circular top, with two people facing each other in front of the group.", "First, a group of people in white clothes are shown in a wasteland, surrounded by red sheds. Then, a group of people are shown standing in a cave with a circular top, with two people facing each other in front of the group. Finally, a black man in black clothes is shown turning his head.", "First, a group of people are shown standing in a cave with a circular top, with two people facing each other in front of the group. Then, a group of people in white clothes are shown in a wasteland, surrounded by red sheds. Finally, a black man in black clothes is shown turning his head."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "Y0IaijKNGX8_0", "video_path": "Y0IaijKNGX8.mp4", "subtitle_path": "Y0IaijKNGX8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1266, "duration": 12.98, "view_count": 16361}, {"video_id": "zsagdM6vYJg", "question": "On the screen, there is a blonde woman wearing a black coat with a white scarf at the collar. She is talking to a man who is only visible as a small, blurry shadow in front of the camera and whose face is not shown. In what other scene does this woman appear?", "question_wo_referring_query": ", In what other scene does this woman appear?", "candidates": ["In a store selling clocks", "In a room that looks very cozy", "In a store selling antiques", "In a restaurant with many white candles", "In a square at night"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "zsagdM6vYJg_0", "video_path": "zsagdM6vYJg.mp4", "subtitle_path": "zsagdM6vYJg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 637, "duration": 10.0, "view_count": 15809}, {"video_id": "Qf8vJieCRSU", "question": "A man with short brown hair and a white shirt is sitting in front of a white background wall. Next to him, there is a label with the white letters 'BENAZZA'. His smile reveals a hint of mockery. What subtitles appeared at the same time as this man?", "question_wo_referring_query": "What subtitles appeared at the same time as this man?", "candidates": ["a B to graduate early", "The instructor expressed that his work deserved a grade of", "but lacked personal interpretation.", "a notorious German dictator as the greatest man in history", "C, despite knowing that he needed"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "Qf8vJieCRSU_0", "video_path": "Qf8vJieCRSU.mp4", "subtitle_path": "Qf8vJieCRSU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 642, "duration": 14.02, "view_count": 45652}, {"video_id": "mNiBWqY2QDA", "question": "In a white room, with a police officer in the background, the screen is somewhat blurry, a woman with dyed blonde hair wearing a pink dress is hugging a black-skinned police officer wearing a hat. When the camera angle shows the side view of the two people, what kind of change occurs to the woman in the pink dress?", "question_wo_referring_query": "What kind of change occurs to the woman in the pink dress?", "candidates": ["She jumps up happily", "She gets a surprised, wide mouth", "She shakes hands with the police officer", "She finds a seat and sits down", "She raises both hands"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "mNiBWqY2QDA_0", "video_path": "mNiBWqY2QDA.mp4", "subtitle_path": "mNiBWqY2QDA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 908, "duration": 11.97, "view_count": 4714493}, {"video_id": "ehSr-HIKVMw", "question": "The screen shows a lady with curled hair wearing a uniform. She has a name tag on her chest with a red and a green triangle on it. She is lying on a white chair, surrounded by a white object that emits blue light. What is this lady doing?", "question_wo_referring_query": "What is this lady doing?", "candidates": ["She is looking upwards", "She is looking downwards", "She is pointing at something and talking", "She is hugging her arms", "She is holding a cup of water"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "ehSr-HIKVMw_0", "video_path": "ehSr-HIKVMw.mp4", "subtitle_path": "ehSr-HIKVMw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 464, "duration": 7.98, "view_count": 3871516}, {"video_id": "f_YU6dhXDP0", "question": "A woman with black hair is lying on a table in the scene. She is holding a black microphone while talking. She is wearing black and white striped clothes. The background of the scene is a dark room. In front of her is a carved mirror. What is the object that does not exist in the scene?", "question_wo_referring_query": "What is the object that does not exist in the scene?", "candidates": ["Red bottle", "White plate", "Black plate", "Black chair", "Red pillow"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "f_YU6dhXDP0_0", "video_path": "f_YU6dhXDP0.mp4", "subtitle_path": "f_YU6dhXDP0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 752, "duration": 13.01, "view_count": 1132108}, {"video_id": "yqueYON_Q90", "question": "The screen shows a man with slightly long black hair, wearing a black T-shirt, earrings, and some accessories. The background behind him is quite blurry, but you can see some greenery. What object is present on the screen?", "question_wo_referring_query": "The screen shows a man with slightly long black hair, wearing a black T-shirt, earrings, and some accessories. The background behind him is quite blurry, but you can see some greenery. What object is present on the screen?", "candidates": ["Mask", "Ring", "Hat", "Earring", "Building"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "yqueYON_Q90_0", "video_path": "yqueYON_Q90.mp4", "subtitle_path": "yqueYON_Q90_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1198, "duration": 14.0, "view_count": 651071}, {"video_id": "FOGgMULOwcw", "question": "The screen shows a man wearing black clothes. He is wearing a hat with a red stripe, has a goatee, and is talking to a blurry figure in front of a mirror. What type of hat is he wearing?", "question_wo_referring_query": "What type of hat is he wearing?", "candidates": ["A beanie", "A chef's hat", "A nurse's hat", "A hat with a brim", "A straw hat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "FOGgMULOwcw_0", "video_path": "FOGgMULOwcw.mp4", "subtitle_path": "FOGgMULOwcw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 652, "duration": 10.01, "view_count": 21326}, {"video_id": "HCDrVeTj4nI", "question": "In the scene, three men are standing in front of a broken wooden house. They are wearing brown shoes. There are some barrels, wooden planks, and straw on both sides of the house. There are very tall trees outside the house. When mentioning 'three of them Furious because the Roman,' what material are the clothes they are wearing made of?", "question_wo_referring_query": "What material are the clothes worn by the three people made of?", "candidates": ["Flannel clothes", "Lamb wool clothes", "Silk clothes", "Woolen clothes", "Leather and cloth clothes"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "HCDrVeTj4nI_0", "video_path": "HCDrVeTj4nI.mp4", "subtitle_path": "HCDrVeTj4nI_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 828, "duration": 9.0, "view_count": 107893}, {"video_id": "MmB92mK2YDQ", "question": "In the video, a woman in a leather jacket is facing a man in short sleeves holding a cane. The man is sitting on a chair with a wooden window frame behind them and a sofa next to them. Who is supporting the chair in the scene?", "question_wo_referring_query": "Who is supporting the chair in the scene?", "candidates": ["The woman in the olive leather jacket", "The woman in the black leather jacket", "The woman in the red leather jacket", "The man in short sleeves", "The woman in the yellow leather jacket"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "MmB92mK2YDQ_0", "video_path": "MmB92mK2YDQ.mp4", "subtitle_path": "MmB92mK2YDQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 379, "duration": 9.01, "view_count": 5881}, {"video_id": "VQHumVT3nXg", "question": "In the scene, there is a gray toy with slightly pointed ears and large glasses. Its eyes are half white and half black. The toy's hands and feet are white. It is lying on the floor, with a chair leg visible nearby. What happened the first time this toy appeared?", "question_wo_referring_query": "What happened the first time this toy appeared?", "candidates": ["It was picked up by two hands", "Its ear was pinched by a hand", "It was kicked away by a foot", "It was picked up by a hand", "It was swept into a trash can"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "VQHumVT3nXg_0", "video_path": "VQHumVT3nXg.mp4", "subtitle_path": "VQHumVT3nXg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 456, "duration": 11.98, "view_count": 23570}, {"video_id": "rCutfFmTw-M", "question": "A woman in black clothing is standing inside a dim white building. In the distance in front of her, there is a white door that is half-open. Another woman in black clothing is standing there. When Marta returns home and brings up 'from Boucher,' what happens?", "question_wo_referring_query": "what happens?", "candidates": ["The woman is adjusting her clothes", "The woman is touching her hair", "The woman's hand is on the door lock", "The woman's hand is on the railing", "The woman raises both hands"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "rCutfFmTw-M_0", "video_path": "rCutfFmTw-M.mp4", "subtitle_path": "rCutfFmTw-M_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 567, "duration": 14.02, "view_count": 7115443}, {"video_id": "JgcvvUPj8MU", "question": "In front of a dilapidated building in the frame, a man is holding up a white two-wheeled cart. He is wearing a leather jacket and is talking to another man with some gray hair, who is dressed in a black coat. After the man in the black coat starts walking forward, what happens?", "question_wo_referring_query": ", what happens?", "candidates": ["The man in the black coat walking raises his head", "The man in the black coat walking picks up a backpack", "The man holding the cart starts to ride it", "The man in the black coat walking starts running", "The man holding the cart puts on a helmet"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "JgcvvUPj8MU_0", "video_path": "JgcvvUPj8MU.mp4", "subtitle_path": "JgcvvUPj8MU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 102, "duration": 14.02, "view_count": 1447996}, {"video_id": "g0-77TIpESw", "question": "Who is the first person to open the door in the scene?", "question_wo_referring_query": "Who is the first person to open the door in the scene?", "candidates": ["The person wearing earphones and green clothes", "The person wearing green clothes and jeans", "The person wearing earphones and a denim jacket", "The person wearing green clothes and a denim jacket", "The person wearing earphones and yellow clothes"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "g0-77TIpESw_0", "video_path": "g0-77TIpESw.mp4", "subtitle_path": "g0-77TIpESw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 49, "duration": 12.01, "view_count": 27816}, {"video_id": "Z9LSJ7AA16k", "question": "The screen shows a black space with stars, with a circular space station that has a ring shape, and in front, there is a round shuttle. What happens after Leo mentions the pod and ignores his captain's protests, reaching the storm in the space warps?", "question_wo_referring_query": "What happens?", "candidates": ["A female black-skinned astronaut is talking in a white room", "A female yellow-skinned astronaut is talking in a white room", "A male black-skinned astronaut is talking in a white room", "A female white-skinned astronaut is talking in a white room", "A male white-skinned astronaut is talking in a white room"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "Z9LSJ7AA16k_0", "video_path": "Z9LSJ7AA16k.mp4", "subtitle_path": "Z9LSJ7AA16k_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 107, "duration": 11.01, "view_count": 710226}, {"video_id": "VLNkDxUNjE0", "question": "In the screen, there's a dark-skinned man with a beard. He's looking at the person in front of him. Behind him are some messy office desks and chairs. He is sitting in front of an office desk and has an earring. Who is the first person to appear when mentioning 'Department he returns to his desk only'?", "question_wo_referring_query": "Who is the first person to appear?", "candidates": ["A dark-skinned person in a police uniform without a hat", "A dark-skinned person in a police uniform with a hat", "A person in a suit with a purple tie", "A person in a suit with a blue tie", "A person in a suit with a red tie"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "VLNkDxUNjE0_0", "video_path": "VLNkDxUNjE0.mp4", "subtitle_path": "VLNkDxUNjE0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 154, "duration": 8.0, "view_count": 107811}, {"video_id": "H60Y_naBie0", "question": "In the scene, two long-haired women are standing beside the man in the middle. The woman on the left is talking to the camera, while the woman on the right is looking at the camera, running her hand through her hair, and gazing at the woman in white on the left. In which of the following scenes does the man in the middle appear?", "question_wo_referring_query": "In which of the following scenes does the man in the middle appear?", "candidates": ["A man is talking to the camera on the right side of the screen, with a blurry plant lightly swaying in the bottom right.", "In the background, there's a blue billboard on the right side of the screen. In front of a glass door is a woman with tied hair and a shoulder-baring outfit. There is a man standing in front of the blue billboard on the right.", "A person wearing clothes is looking up to the sky from a tree.", "In a blurry background, a man wearing a gray shirt is standing and blinking his eyes.", "In a blurry background, a man wearing a gray shirt is facing the camera, with two women standing with their backs to the camera. One of the women is holding a tray."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "H60Y_naBie0_0", "video_path": "H60Y_naBie0.mp4", "subtitle_path": "H60Y_naBie0_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 533, "duration": 13.0, "view_count": 87706}, {"video_id": "V_H6W2R88Z0", "question": "On the turbulent and gloomy sea, there is a black and white bird-shaped boat. When the subtitle 'predominant ways that they will be' appears, what change happens to this boat?", "question_wo_referring_query": "What change happens to this boat?", "candidates": ["Changes from a gloomy water surface to a bright blue water surface", "Changes from an intact boat to a dismantled and damaged boat", "Changes from black and white to black", "Flies from the water surface to the air", "Sinks from the surface to underwater"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "V_H6W2R88Z0_0", "video_path": "V_H6W2R88Z0.mp4", "subtitle_path": "V_H6W2R88Z0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 916.55, "view_count": 266657}, {"video_id": "V_H6W2R88Z0", "question": "The dark green meteorite with a pitted and uneven surface floating in black space\u2014what transformation does it undergo when the caption 'probably been at least one kilometer' appears?", "question_wo_referring_query": "What transformation does the meteorite undergo?", "candidates": ["The meteorite collides with the Moon.", "The meteorite is hit by a rocket.", "The meteorite disintegrates in space.", "The meteorite collides with a spaceship.", "The meteorite collides with Earth, producing a huge flash of light."], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "V_H6W2R88Z0_1", "video_path": "V_H6W2R88Z0.mp4", "subtitle_path": "V_H6W2R88Z0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 916.55, "view_count": 266657}, {"video_id": "V_H6W2R88Z0", "question": "At the top of the screen is a topographic relief map of the Earth, and at the bottom is a white-background blue line graph with a red arrow pointing to it. What changes occur in the line graph when the subtitle 'unsurprisingly perth's topography acted' appears?", "question_wo_referring_query": "What changes occur in the line graph?", "candidates": ["The blue color on the line graph changes to pink", "The line graph rises", "The line graph flattens", "The blue color on the line graph changes to yellow", "The red arrow on the line graph disappears"], "topic_category": "KG-Knowledge-Geography", "question_category": "TAA", "level": "L2-Relation", "id": "V_H6W2R88Z0_2", "video_path": "V_H6W2R88Z0.mp4", "subtitle_path": "V_H6W2R88Z0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 916.55, "view_count": 266657}, {"video_id": "eXEzHeMEEHs", "question": "At the beginning of the video, sitting in a news studio, there's a screen on the right displaying a photo of a man wearing a gray suit with a blue shirt, a white earphone in one ear, and holding a news script. When this man starts a video call with a curly-haired woman wearing a dark purple top and sitting in front of a bookshelf, what change happens to this man?", "question_wo_referring_query": "At the beginning of the video, sitting in a news studio, there's a screen on the right displaying a photo of a man wearing a gray suit with a blue shirt, a white earphone in one ear, and holding a news script. When this man starts a video call with a curly-haired woman wearing a dark purple top and sitting in front of a bookshelf, what change happens to this man?", "candidates": ["The blue shirt changes to a white shirt", "The white earphone changes to a black earphone", "The white earphone changes to no earphone", "The gray suit changes to a black suit", "The gray suit changes to a white suit"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "eXEzHeMEEHs_0", "video_path": "eXEzHeMEEHs.mp4", "subtitle_path": "eXEzHeMEEHs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1311.68, "view_count": 91217}, {"video_id": "eXEzHeMEEHs", "question": "A woman with short curly hair, wearing a deep purple top, is sitting in front of a circular floral-patterned wall with wooden shelves. Below the screen, there is a red and white background with a white English news tag. After the news tag disappears, what change happens to this woman?", "question_wo_referring_query": "What change happens to this woman?", "candidates": ["Her expression changes from blank to smiling", "Her deep purple top changes to a blue top", "Her curly hair changes to straight hair", "Her short hair grows long", "Her deep purple top changes to a black top"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "eXEzHeMEEHs_1", "video_path": "eXEzHeMEEHs.mp4", "subtitle_path": "eXEzHeMEEHs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1311.68, "view_count": 91217}, {"video_id": "eXEzHeMEEHs", "question": "The large screen in the news broadcast room displays a photo of a man. There is a man sitting in front of the screen wearing a gray suit jacket over a blue shirt with a white earpiece in one ear, holding a news script. Below the screen is a red and white background with a white English news tag. After the news tag under the man sitting in the news broadcast room disappears, what changes happen on the screen behind him?", "question_wo_referring_query": "What changes happen on the screen behind him?", "candidates": ["The large screen turns completely white", "The photo of a man on the screen changes to two photos of men", "The large screen changes from the man's photo to a news headline", "The large screen changes from one man's photo to another man's photo", "The large screen turns black"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "eXEzHeMEEHs_2", "video_path": "eXEzHeMEEHs.mp4", "subtitle_path": "eXEzHeMEEHs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1311.68, "view_count": 91217}, {"video_id": "5F6gTQtlgaI", "question": "The background is an image of Earth in space. Which subtitles have appeared at the same time as the man, who is wearing a green jumpsuit and sitting in front of the background wall, speaking to the camera?", "question_wo_referring_query": "Which subtitles have appeared at the same time?", "candidates": ["\"electromagnetic waves interact with\"", "\"he started by misunderstanding\"", "\"everything about 5g but he didn't stop\"", "\"regarding anything he's talking about\"", "\"you won't believe the things he says\""], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "5F6gTQtlgaI_0", "video_path": "5F6gTQtlgaI.mp4", "subtitle_path": "5F6gTQtlgaI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1121.6, "view_count": 86315}, {"video_id": "5F6gTQtlgaI", "question": "The background is a gray wall with some posters on it. In front of a small window with yellow patterned curtains, there's a globe. A man wearing a black shirt and glasses is sitting in front of the wall, speaking to the camera. Which subtitles appear simultaneously during his speech?", "question_wo_referring_query": "Which subtitles appear simultaneously?", "candidates": ["\u201cthat mark has no experience in radio\u201d", "\u201cyes i know it's sunday as i said mark\u201d", "\u201cregarding anything he's talking about\u201d", "\u201ccountless videos of misinformation about\u201d", "\u201cstill is on a crusade of sorts putting\u201d"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "5F6gTQtlgaI_1", "video_path": "5F6gTQtlgaI.mp4", "subtitle_path": "5F6gTQtlgaI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1121.6, "view_count": 86315}, {"video_id": "5F6gTQtlgaI", "question": "The background is a white wall with a white flag featuring red and yellow symbols and black English text. On the right side, there is a wooden shelf holding various helmets. A man wearing a white shirt and black sunglasses with thinning hair has appeared simultaneously with which subtitles?", "question_wo_referring_query": "Which subtitles has he appeared with at the same time?", "candidates": ["\"countless videos of misinformation about\"", "\"and god's law only applies to man and\"", "\u201cthat mark has no experience in radio\u201d", "\"yes i know it's sunday as i said mark\"", "\"regarding anything he's talking about\""], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "5F6gTQtlgaI_2", "video_path": "5F6gTQtlgaI.mp4", "subtitle_path": "5F6gTQtlgaI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1121.6, "view_count": 86315}, {"video_id": "UfkXwNxB5q0", "question": "On the grey and black soil, a man wearing an olive-green upper garment is squatting on the ground, using water to wash two plucked chickens placed on a round wooden stump. In which of the following scenarios do these two chickens also appear?", "question_wo_referring_query": ", In which of the following scenarios do these two chickens also appear?", "candidates": ["In a pile of burning logs below a stove", "Next to a black and white puppy lying on the grass", "On a cutting board with two cloves of garlic", "In a scene where a man swings an ax at a pile of wooden logs", "In an iron pot with boiling water emitting steam"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "UfkXwNxB5q0_0", "video_path": "UfkXwNxB5q0.mp4", "subtitle_path": "UfkXwNxB5q0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1204.24, "view_count": 2179666}, {"video_id": "UfkXwNxB5q0", "question": "In the background, there is a row of yellow, blue, and green bee boxes and a withered tree. A black and white dog is lying on the grass in front of the bee boxes. In which other scenes does this dog appear?", "question_wo_referring_query": "In which other scenes does this dog appear?", "candidates": ["On the round pile of wood behind a man who is cooking", "By a mountain spring", "On the open ground in front of a rock with flowing water", "In the forest", "By the stream"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "UfkXwNxB5q0_1", "video_path": "UfkXwNxB5q0.mp4", "subtitle_path": "UfkXwNxB5q0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1204.24, "view_count": 2179666}, {"video_id": "UfkXwNxB5q0", "question": "A man wearing an army green coat and an army green hat, holding a yellow woven basket, kneeling on the ground covered with dry leaves in a withered forest picking wild fruits, has he appeared in any other scenes?", "question_wo_referring_query": "Has he appeared in any other scenes?", "candidates": ["In a wooden house's living room", "In a wooden house's bedroom", "On the edge of a cliff", "In a green forest", "On a wooden fence in a grassy area surrounded by a bunch of children"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "UfkXwNxB5q0_2", "video_path": "UfkXwNxB5q0.mp4", "subtitle_path": "UfkXwNxB5q0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1204.24, "view_count": 2179666}, {"video_id": "gY0yACShcnU", "question": "Which of the following sequences of scenes in the video are correct?", "question_wo_referring_query": "Which of the following sequences of scenes in the video are correct?", "candidates": ["First, a man wearing a black short-sleeved shirt and gray pants hugs a woman in a red shirt and black pants in front of a wardrobe in a bedroom. Then, a man in a hooded jacket is running beside a stream among rocks illuminated by sunlight. Finally, a man in a black shirt is sitting in the bedroom, holding a black phone with both hands, talking into it while looking at a mirror.", "First, a man in a black shirt is sitting in the bedroom, holding a black phone with both hands, talking into it while looking at a mirror. Then, a man wearing a black short-sleeved shirt and gray pants hugs a woman in a red shirt and black pants in front of a wardrobe in a bedroom. Lastly, a man in a hooded jacket is running beside a stream among rocks illuminated by sunlight.", "First, a man in a hooded jacket is running beside a stream among rocks illuminated by sunlight. Then, a man in a black shirt is sitting in the bedroom, holding a black phone with both hands, talking into it while looking at a mirror. Finally, a man wearing a black short-sleeved shirt and gray pants hugs a woman in a red shirt and black pants in front of a wardrobe in a bedroom.", "First, a man in a black shirt is sitting in the bedroom, holding a black phone with both hands, talking into it while looking at a mirror. Then, a man in a hooded jacket is running beside a stream among rocks illuminated by sunlight. Finally, a man wearing a black short-sleeved shirt and gray pants hugs a woman in a red shirt and black pants in front of a wardrobe in a bedroom.", "First, a man wearing a black short-sleeved shirt and gray pants hugs a woman in a red shirt and black pants in front of a wardrobe in a bedroom. Then, a man in a black shirt is sitting in the bedroom, holding a black phone with both hands, talking into it while looking at a mirror. Lastly, a man in a hooded jacket is running beside a stream among rocks illuminated by sunlight."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "gY0yACShcnU_0", "video_path": "gY0yACShcnU.mp4", "subtitle_path": "gY0yACShcnU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1423.3, "view_count": 81768}, {"video_id": "gY0yACShcnU", "question": "Which of the following sequences of scenes in the video are correct?", "question_wo_referring_query": "Which of the following sequences of scenes in the video are correct?", "candidates": ["First, there is a man in a black hooded coat, and another man in a dark green top with dark hair, sitting at a wooden desk in a bedroom with purple walls, having a conversation. Then, there is the man in the dark green top with dark hair, sitting at the wooden desk, looking at a computer screen. Finally, there is a man wearing a dark top, light pants, a yellow hat, and a white mask, carrying a yellow bag and walking on a platform beside the tracks.", "First, there is a man wearing a dark top, light pants, a yellow hat, and a white mask, carrying a yellow bag and walking on a platform beside the tracks. Then, there is a man in a dark green top with dark hair, sitting at a wooden desk, looking at a computer screen. Finally, there is a man in a black hooded coat and the man in the dark green top with dark hair, sitting at a wooden desk in a bedroom with purple walls, having a conversation.", "First, there is a man in a black hooded coat, and another man in a dark green top with dark hair, sitting at a wooden desk in a bedroom with purple walls, having a conversation. Then, there is a man wearing a dark top, light pants, a yellow hat, and a white mask, carrying a yellow bag and walking on a platform beside the tracks. Finally, the man in the dark green top with dark hair, sitting at the wooden desk, looking at a computer screen.", "First, there is a man wearing a dark top, light pants, a yellow hat, and a white mask, carrying a yellow bag and walking on a platform beside the tracks. Then, there is a man in a black hooded coat, and another man in a dark green top with dark hair, sitting at a wooden desk in a bedroom with purple walls, having a conversation. Finally, the man in the dark green top with dark hair, sitting at the wooden desk, looking at a computer screen.", "First, there is the man in the dark green top with dark hair, sitting at the wooden desk, looking at a computer screen. Then, there is a man in a black hooded coat, and another man in a dark green top with dark hair, sitting at a wooden desk in a bedroom with purple walls, having a conversation. Finally, there is a man wearing a dark top, light pants, a yellow hat, and a white mask, carrying a yellow bag and walking on a platform beside the tracks."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "gY0yACShcnU_1", "video_path": "gY0yACShcnU.mp4", "subtitle_path": "gY0yACShcnU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1423.3, "view_count": 81768}, {"video_id": "gY0yACShcnU", "question": "Which of the following sequences of scenes in the video is correct?", "question_wo_referring_query": "Which of the following sequences of scenes in the video is correct?", "candidates": ["First, a pair of hands is using a mobile phone in front of a mirror, then two men are assembling a jigsaw puzzle under a desk lamp in a dimly lit bedroom, and finally, a man wearing a black hooded jacket and grey pants is organizing his suitcase in front of a mirror in a living room with wooden flooring.", "First, a man wearing a black hooded jacket and grey pants is organizing his suitcase in front of a mirror in a living room with wooden flooring, then two men are assembling a jigsaw puzzle under a desk lamp in a dimly lit bedroom, and finally, a pair of hands is using a mobile phone in front of a mirror.", "First, two men are assembling a jigsaw puzzle under a desk lamp in a dimly lit bedroom, then a man wearing a black hooded jacket and grey pants is organizing his suitcase in front of a mirror in a living room with wooden flooring, and finally, a pair of hands is using a mobile phone in front of a mirror.", "First, two men are assembling a jigsaw puzzle under a desk lamp in a dimly lit bedroom, then a pair of hands is using a mobile phone in front of a mirror, and finally, a man wearing a black hooded jacket and grey pants is organizing his suitcase in front of a mirror in a living room with wooden flooring.", "First, a pair of hands is using a mobile phone in front of a mirror, then a man wearing a black hooded jacket and grey pants is organizing his suitcase in front of a mirror in a living room with wooden flooring, and finally, two men are assembling a jigsaw puzzle under a desk lamp in a dimly lit bedroom."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "gY0yACShcnU_2", "video_path": "gY0yACShcnU.mp4", "subtitle_path": "gY0yACShcnU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1423.3, "view_count": 81768}, {"video_id": "FnblcP8QfPA", "question": "Outside the three black-framed windows is a wall covered with green plants. Inside the window is a dining room with a wooden table and wooden chairs. After the subtitle 'went and got some food at this' appears, who is the person that appears?", "question_wo_referring_query": "Who is the person that appears?", "candidates": ["A woman holding a green surfboard, smiling at the camera by the seaside", "A woman holding a skewer, wearing a light blue short sleeve and a grey hairband", "A woman doing yoga on the beach, wearing a pink top and black shorts", "A woman in a green off-the-shoulder top, smiling at the camera with the sea and sunset in the background", "A girl walking up stone steps, wearing a suspender and carrying a backpack"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "FnblcP8QfPA_0", "video_path": "FnblcP8QfPA.mp4", "subtitle_path": "FnblcP8QfPA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 956.26, "view_count": 110562}, {"video_id": "FnblcP8QfPA", "question": "On a stone-paved path amidst lush green plants on both sides, a girl wearing a white sun hat, carrying a backpack, and dressed in a pink bikini stands facing the camera and smiles. What food appears after the subtitle 'online met Go Jungle Explorer in Kylie's'?", "question_wo_referring_query": "What food appears?", "candidates": ["Fried chicken strips presented on a white bowl with a piece of paper", "Skewers of meat placed in a bowl with wooden sticks", "Dragon fruits and other fruits placed on a wooden table", "Triangular cakes placed on a plate", "Fried pork chops with green and pink garnishes on a white plate"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "FnblcP8QfPA_1", "video_path": "FnblcP8QfPA.mp4", "subtitle_path": "FnblcP8QfPA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 956.26, "view_count": 110562}, {"video_id": "FnblcP8QfPA", "question": "On a dim wall illuminated by a yellow light with the sign 'Donna,' with a checkered wooden board on the right, what food appears after the subtitle 'that some locals'?", "question_wo_referring_query": "What food appears?", "candidates": ["Triangular cakes placed on a plate", "Skewers of meat placed on a plate with wooden sticks", "Fried pork chops garnished with green and pink vegetables in a white plate", "Fruits like dragon fruit on a wooden table", "Orange soup with green vegetables and large prawns on a wooden table in an iron dish"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "FnblcP8QfPA_2", "video_path": "FnblcP8QfPA.mp4", "subtitle_path": "FnblcP8QfPA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 956.26, "view_count": 110562}, {"video_id": "xYGvLyw1_CM", "question": "At the top of the gradient gray background, there is a black long line and a dark gray short line. After the subtitle 'Particularly interesting is this entry that relates to the Second World War nbsp;' appears, what is the first event that happens on the screen?", "question_wo_referring_query": "What is the first event that occurs on the screen?", "candidates": ["Five circular icons appear on the gray background.", "Soldiers cutting film clips appear on the gray background.", "Tanks cutting film clips appear on the gray background.", "English titles appear on the two lines in the gray background, and below appear black and white frames with characters.", "A large white title appears on the gray background."], "topic_category": "KH-Knowledge-History", "question_category": "T3E", "level": "L2-Relation", "id": "xYGvLyw1_CM_0", "video_path": "xYGvLyw1_CM.mp4", "subtitle_path": "xYGvLyw1_CM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1157.48, "view_count": 165412}, {"video_id": "xYGvLyw1_CM", "question": "There is a white human figure icon at the top of the gray background. Below is a frame filled with black and white text, and in the upper right corner, there is an image of a green train. What is the first event that happens after the subtitle 'much how the Ukrainians operate, they call such cell-phones appearing \"spring flowers\".' appears?", "question_wo_referring_query": "What is the first event that happens?", "candidates": ["White English text appears to the right of a purple frame on the upper left of the gray background.", "The screen clears and two lines appear at the top with the title 'Reserve Capacities'.", "A tank diagram appears on the gray background.", "A purple frame with English text appears on the gray background.", "Two men are sitting on a couch and talking to each other."], "topic_category": "KH-Knowledge-History", "question_category": "T3E", "level": "L2-Relation", "id": "xYGvLyw1_CM_1", "video_path": "xYGvLyw1_CM.mp4", "subtitle_path": "xYGvLyw1_CM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1157.48, "view_count": 165412}, {"video_id": "xYGvLyw1_CM", "question": "Two men are sitting on a sofa in front of a white wall having a conversation. The man on the left is wearing a gray short-sleeve shirt and has thinning curly hair, while the man on the right is wearing a plaid shirt and has some gray hair. What is the first event that happens after the caption 'It is like 50 pages of rules about how to stay alive in war. And you are reading it and you' appears on the screen?", "question_wo_referring_query": "What is the first event that happens in the screen?", "candidates": ["Two lines of English titles on a gray background appear, with black and white framed text at the bottom.", "Three circular icons with different drawings appear on the gray background.", "A black box with white text appears on the upper part of the screen, with four circular icons below.", "A byzantine tank icon appears in the screen.", "There is a purple frame with English text in the upper left corner of the gray and white background, and four white English sentences on the right."], "topic_category": "KH-Knowledge-History", "question_category": "T3E", "level": "L2-Relation", "id": "xYGvLyw1_CM_2", "video_path": "xYGvLyw1_CM.mp4", "subtitle_path": "xYGvLyw1_CM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1157.48, "view_count": 165412}, {"video_id": "xJrKIPwVwGM", "question": "Which character appears first in the video?", "question_wo_referring_query": "Which character appears first in the video?", "candidates": ["exp marked with yellow", "provably circled in red", "green QK written", "FAVOR+ marked with yellow", "h(X) shaded in yellow"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O3O", "level": "L2-Relation", "id": "xJrKIPwVwGM_0", "video_path": "xJrKIPwVwGM.mp4", "subtitle_path": "xJrKIPwVwGM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3278.37, "view_count": 54907}, {"video_id": "xJrKIPwVwGM", "question": "Which character appears first in the video?", "question_wo_referring_query": "Which character appears first in the video?", "candidates": ["K pointed at by a red arrow", "K circled in blue", "L\u00d7d inside the yellow rectangle", "exp in the lower part marked with a red line", "Q'K' written in red next to the yellow rectangle"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O3O", "level": "L2-Relation", "id": "xJrKIPwVwGM_1", "video_path": "xJrKIPwVwGM.mp4", "subtitle_path": "xJrKIPwVwGM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3278.37, "view_count": 54907}, {"video_id": "xJrKIPwVwGM", "question": "In the video, which concept is mentioned first?", "question_wo_referring_query": "Which concept is mentioned first in the video?", "candidates": ["Residuals", "Ordinary Differential Equations", "Exponential Function", "Magnitude", "Parameters"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "O3O", "level": "L2-Relation", "id": "xJrKIPwVwGM_2", "video_path": "xJrKIPwVwGM.mp4", "subtitle_path": "xJrKIPwVwGM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3278.37, "view_count": 54907}, {"video_id": "h_n0Z1mIxrI", "question": "A man is kneeling by a creek full of green grass, washing tomatoes. Next to him, there is a basket filled with green peppers. What did he do afterwards?", "question_wo_referring_query": "What did he do?", "candidates": ["Cut chicken into pieces", "Picked vegetables in the garden", "Cut beef on a wooden table", "Fixed a black pot on a large wooden frame", "Cut vegetables on a wooden table in the grass"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "h_n0Z1mIxrI_0", "video_path": "h_n0Z1mIxrI.mp4", "subtitle_path": "h_n0Z1mIxrI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1100.9, "view_count": 4055145}, {"video_id": "h_n0Z1mIxrI", "question": "In a green grassy field, a man wearing a black short-sleeved shirt and khaki pants is picking vegetables in the garden with a woven basket in his hand. What did he do after that?", "question_wo_referring_query": "What did he do?", "candidates": ["Cut wooden shelves with an electric saw", "Chopped chicken on a wooden table", "Squatted by a small creek to wash the vegetables", "Cut mutton oil on a wooden table", "Fixed an iron pot on a large wooden shelf"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "h_n0Z1mIxrI_1", "video_path": "h_n0Z1mIxrI.mp4", "subtitle_path": "h_n0Z1mIxrI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1100.9, "view_count": 4055145}, {"video_id": "h_n0Z1mIxrI", "question": "A man is standing in front of a wooden stake. Three black pots are mounted on the stake, each containing fried fish, chicken, and beans. What did the man do after stir-frying the food in the three pots?", "question_wo_referring_query": "What did he do?", "candidates": ["Fixed the black pots onto the large wooden stake", "Cut lamb fat on the wooden table", "Picked vegetables from the garden", "Kneeled by the small stream to wash vegetables", "Placed the beef into a well-oiled iron pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "h_n0Z1mIxrI_2", "video_path": "h_n0Z1mIxrI.mp4", "subtitle_path": "h_n0Z1mIxrI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1100.9, "view_count": 4055145}, {"video_id": "LJP7m4D8P8A", "question": "When a group of black and white magpies appear for the first time on a grass field covered in white snow, what are they doing?", "question_wo_referring_query": "What are they doing?", "candidates": ["Calling toward the sky", "Grazing on items in the grass", "Basking in the sun", "Flying at low altitude", "Laying eggs"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "LJP7m4D8P8A_0", "video_path": "LJP7m4D8P8A.mp4", "subtitle_path": "LJP7m4D8P8A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1409.91, "view_count": 125785}, {"video_id": "LJP7m4D8P8A", "question": "There is a duck with a black head and white body on the surface of the lake. The water surface ripples in layers. What was this duck doing the first time it appeared on the lake surface?", "question_wo_referring_query": "What was it doing?", "candidates": ["Swimming on the lake", "Catching fish on the lake", "Jumping on the lake", "Splashing in the lake water", "Flying over the lake"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "LJP7m4D8P8A_1", "video_path": "LJP7m4D8P8A.mp4", "subtitle_path": "LJP7m4D8P8A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1409.91, "view_count": 125785}, {"video_id": "LJP7m4D8P8A", "question": "The background is a tall building and withered trees illuminated by sunlight, with snow and water accumulated on the ground. The first time a black-haired woman wearing a leather jacket, a khaki scarf, and silver earphones appears, what is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["Walking outdoors while talking to the camera", "Using a computer in a bedroom", "Drinking coffee in a caf\u00e9", "Shopping for clothes in a mall", "Studying in a library"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "LJP7m4D8P8A_2", "video_path": "LJP7m4D8P8A.mp4", "subtitle_path": "LJP7m4D8P8A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1409.91, "view_count": 125785}, {"video_id": "zYd4w7hpbx0", "question": "Under the white sky on the impoverished hillside, there are black and cold rocks and orange-red rocks. What is the object in the scene that is spraying rocks?", "question_wo_referring_query": "What is the object in the scene that is spraying rocks?", "candidates": ["Green Mountain", "Plain", "Snow Mountain", "Volcano", "Rock"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "zYd4w7hpbx0_0", "video_path": "zYd4w7hpbx0.mp4", "subtitle_path": "zYd4w7hpbx0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 960.16, "view_count": 33711}, {"video_id": "zYd4w7hpbx0", "question": "The screen shows a vast area of poverty-stricken yellow earth, stained black in large patches. What is the red substance flowing on the black surface?", "question_wo_referring_query": "What is it?", "candidates": ["Lake water", "Lava", "River", "Red oil", "Sea water"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "zYd4w7hpbx0_1", "video_path": "zYd4w7hpbx0.mp4", "subtitle_path": "zYd4w7hpbx0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 960.16, "view_count": 33711}, {"video_id": "zYd4w7hpbx0", "question": "On a ground with brown fallen leaves and green grass, there is a white and reddish-orange rock. What is the object gradually sinking in the rock?", "question_wo_referring_query": "What is it?", "candidates": ["Lead", "Floating organisms", "Iron ore", "Fossil", "Coal"], "topic_category": "KG-Knowledge-Geography", "question_category": "E2O", "level": "IntraMoment", "id": "zYd4w7hpbx0_2", "video_path": "zYd4w7hpbx0.mp4", "subtitle_path": "zYd4w7hpbx0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 960.16, "view_count": 33711}, {"video_id": "uLJSf9Ws2ms", "question": "On the screen, a man in a dark suit is facing away from the camera, speaking to a woman across from him who is wearing a pink suit and a white low neckline top with her black curly hair down. Behind the woman is a screen displaying a picture of a white palace building. When the subtitle 'I wonder Governor when you look at the' appears, what object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["mobile phone", "curtains", "camera", "black and white mug", "window"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "uLJSf9Ws2ms_0", "video_path": "uLJSf9Ws2ms.mp4", "subtitle_path": "uLJSf9Ws2ms_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1037.41, "view_count": 174529}, {"video_id": "uLJSf9Ws2ms", "question": "The background consists of three screens displaying images. The screen on the left shows a picture of an American leader giving a speech. In the middle of the screen, a man with some gray hair wearing a gray-blue suit and a woman with black hair wearing a pink suit with a white inner top are sitting and talking at a glass round table. When the subtitle 'years has been a master class close to' appears, what object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["transparent glass cup", "white headband", "red suit", "American flag", "pink scarf"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "uLJSf9Ws2ms_1", "video_path": "uLJSf9Ws2ms.mp4", "subtitle_path": "uLJSf9Ws2ms_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1037.41, "view_count": 174529}, {"video_id": "uLJSf9Ws2ms", "question": "The background shows a screen with an image of a white palace building under a blue sky and white clouds. A woman with short black hair, wearing a pink suit with a white inner layer, is sitting in front of the screen, holding up a black pen while speaking. When the subtitle 'migrants from seeking Asylum would you' appears, what object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["Sunglasses", "Red pen", "White paper", "Necklace", "White round earrings"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "uLJSf9Ws2ms_2", "video_path": "uLJSf9Ws2ms.mp4", "subtitle_path": "uLJSf9Ws2ms_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1037.41, "view_count": 174529}, {"video_id": "FmbgUddvJHc", "question": "As white clouds and mist swirl around the black and gray mountain ridges under the blue sky, what objects are present on the screen at this moment?", "question_wo_referring_query": ", what objects are present on the screen at this moment?", "candidates": ["grassland", "green trees", "rocks", "flowers", "stream"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "FmbgUddvJHc_0", "video_path": "FmbgUddvJHc.mp4", "subtitle_path": "FmbgUddvJHc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1026.09, "view_count": 695511}, {"video_id": "FmbgUddvJHc", "question": "There are a few white clouds floating in the blue sky. Below the sky are yellow barren mountain ranges and lakes. What object is present on the screen at this moment?", "question_wo_referring_query": ", what object is present on the screen at this moment?", "candidates": ["Grassland", "Starry Sky", "Sun", "Green Trees", "Cave"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "FmbgUddvJHc_1", "video_path": "FmbgUddvJHc.mp4", "subtitle_path": "FmbgUddvJHc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1026.09, "view_count": 695511}, {"video_id": "FmbgUddvJHc", "question": "On the screen, there is a satellite floating above an orange surface of Mars, which has many protrusions and depressions. What object is present on the screen at this moment?", "question_wo_referring_query": "What object is present on the screen at this moment?", "candidates": ["the starry sky", "a circular hole", "a rocket", "the Earth", "mountain ranges"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "FmbgUddvJHc_2", "video_path": "FmbgUddvJHc.mp4", "subtitle_path": "FmbgUddvJHc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1026.09, "view_count": 695511}, {"video_id": "8liuFWolR88", "question": "In a black and white photo, a woman with black short curly hair is wearing a light-colored top and a dark skirt, standing with one hand on her waist in front of two abstract paintings. When she, with white hair wearing a black top and sitting in front of a purple curtain talking, what change did this woman undergo?", "question_wo_referring_query": "What change did this woman undergo?", "candidates": ["A: From young to old", "C: From white hair to black hair", "D: From old to young", "B: From fat to slim", "E: From wearing glasses to not wearing glasses"], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "8liuFWolR88_0", "video_path": "8liuFWolR88.mp4", "subtitle_path": "8liuFWolR88_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 920.76, "view_count": 170381}, {"video_id": "8liuFWolR88", "question": "In a black-and-white scene, a woman wearing a black plaid dress is washing dishes by the pool, while a man in a white shirt with a cigarette in his mouth is drying dishes on the right side. As they walk outside by a dry tree on the grass, with a black-and-white patterned dog beside them, what change happens to the woman?", "question_wo_referring_query": "What change happens to the woman?", "candidates": ["The woman, who was not wearing glasses, starts wearing glasses.", "The woman's short hair changes to long hair.", "The woman's curly hair changes to straight hair.", "The woman's black hair changes to white hair.", "The woman's plaid dress changes to pants."], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "8liuFWolR88_1", "video_path": "8liuFWolR88.mp4", "subtitle_path": "8liuFWolR88_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 920.76, "view_count": 170381}, {"video_id": "8liuFWolR88", "question": "A girl wearing a pink shirt and a grey and white skirt is sitting on the ground in the corner of a wall covered in blue, red, and yellow paint. When she is crouching down with a white top and black and white striped pants, painting the ground, what change happened to this girl?", "question_wo_referring_query": "A girl wearing a pink shirt and a grey and white skirt is sitting on the ground in the corner of a wall covered in blue, red, and yellow paint. When she is crouching down with a white top and black and white striped pants, painting the ground, what change happened to this girl?", "candidates": ["Changed from wearing a skirt to wearing pants", "Changed from wearing glasses to not wearing glasses", "Changed from not wearing a hat to wearing a hat", "Changed from having short hair to having straight hair", "Changed from having blonde hair to having black hair"], "topic_category": "KA-Knowledge-Art", "question_category": "SAA", "level": "L2-Relation", "id": "8liuFWolR88_2", "video_path": "8liuFWolR88.mp4", "subtitle_path": "8liuFWolR88_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 920.76, "view_count": 170381}, {"video_id": "LOb5p1G2lzw", "question": "A man inside a car, wearing a white short-sleeved shirt, a red and gold helmet, and extending a hand with a silver ring\u2014what subtitles appear simultaneously with this scene?", "question_wo_referring_query": "What subtitles appear simultaneously with this scene?", "candidates": ["\"oh great once teams have received their\"", "\"drivers so that they can head to\"", "\"clue they must sprint to find their\"", "\"only one player compete\"", "\u201cwe're gonna settle this how men do it\u201d"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "LOb5p1G2lzw_0", "video_path": "LOb5p1G2lzw.mp4", "subtitle_path": "LOb5p1G2lzw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1715.97, "view_count": 65411}, {"video_id": "LOb5p1G2lzw", "question": "On the wrestling stage, there are two men wearing red headgear. The man on the left is dressed in a white short-sleeve shirt, red shorts, and a red cape. The man on the right is dressed in a black short-sleeve shirt, floral pants, and black leather boots. With which subtitles did the man on the right appear simultaneously?", "question_wo_referring_query": "With which subtitles did the man on the right appear simultaneously?", "candidates": ["\"[Applause]\"", "\u201cmy challenge yeah can we just play with\u201d", "\"about go check the previous episode it's\"", "\"deport me from this country that's not\"", "\u201ckai if you don't know what i'm talking\u201d"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "LOb5p1G2lzw_1", "video_path": "LOb5p1G2lzw.mp4", "subtitle_path": "LOb5p1G2lzw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1715.97, "view_count": 65411}, {"video_id": "LOb5p1G2lzw", "question": "Which subtitles have appeared at the same time as the scene showing a man sitting in a car, wearing a white shirt and a necklace, with his arm resting on the steering wheel and talking to the camera?", "question_wo_referring_query": "Which subtitles have appeared at the same time?", "candidates": ["\"let's go let's go\"", "\"my chest hurts so bad wow am i really\"", "\u201cas you guys know bali is home to me\u201d", "\"that out of shape\"", "\"it's a race it's not a sprint so we need\""], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "LOb5p1G2lzw_2", "video_path": "LOb5p1G2lzw.mp4", "subtitle_path": "LOb5p1G2lzw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1715.97, "view_count": 65411}, {"video_id": "Z4AEXRtDEEY", "question": "In a black and white screen, there is an unfinished group sculpture with two people, one standing on a ladder and the other wearing a hat, carving. After the caption mentions 'Rodin was as much an entrepreneur as he was a sculptor. In fact never', what item appears in the video?", "question_wo_referring_query": "what item appears in the video?", "candidates": ["A black sculpture of The Thinker", "A green-painted sculpture of The Thinker", "A white sculpture of The Thinker", "A sculpture of The Thinker with green vegetation", "The Mona Lisa's smile"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "Z4AEXRtDEEY_0", "video_path": "Z4AEXRtDEEY.mp4", "subtitle_path": "Z4AEXRtDEEY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 947.56, "view_count": 627739}, {"video_id": "Z4AEXRtDEEY", "question": "In a black and white scene, there is a man wearing suspenders, a hat, and a large coat. Beside him, there is a gigantic white statue showing only its legs and one arm. When the captions mention 'From then on, he would either make his figures smaller or larger than life.' what is the next object that appears in the video?", "question_wo_referring_query": ", what is the next object that appears in the video?", "candidates": ["A black sculpture of a man with his right hand on his head and left hand in a fist", "A painting of Mona Lisa's smile", "A car", "A sculpture of The Thinker", "A black sculpture of a man and woman embracing"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "Z4AEXRtDEEY_1", "video_path": "Z4AEXRtDEEY.mp4", "subtitle_path": "Z4AEXRtDEEY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 947.56, "view_count": 627739}, {"video_id": "Z4AEXRtDEEY", "question": "In a painting, there is a person wearing a hat and a red robe in the center. On the left side, there is a mountain peak with a person climbing while carrying something on their back. On the right side, there is a building. The subtitles mention 'The Thinker,' but in 1880, he was still known as 'The Poet' and represented the writer Dante Alighieri. After that, who is the character that appears in the video?", "question_wo_referring_query": "After that, who is the character that appears in the video?", "candidates": ["A man wearing a hat and with a long beard", "A man in a black coat and tie", "A child with short hair and arms crossed", "A man in a white shirt and tie", "A man driving a horse-drawn carriage and wearing a top hat"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "Z4AEXRtDEEY_2", "video_path": "Z4AEXRtDEEY.mp4", "subtitle_path": "Z4AEXRtDEEY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 947.56, "view_count": 627739}, {"video_id": "vOtKGg0url4", "question": "On a shelf in the room, some items are placed, and a woman wearing white clothes is sitting in the middle of the room. She puts her hands on her head to comb her hair, and after the subtitle mentions 'anything of it are going to', what does the woman do sitting on the brown sofa?", "question_wo_referring_query": ", what does the woman do sitting on the brown sofa?", "candidates": ["She is playing with her phone", "She took a sip of a drink", "She is reading a book", "She took off the scarf", "She crossed her hands in front of her chest"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "vOtKGg0url4_0", "video_path": "vOtKGg0url4.mp4", "subtitle_path": "vOtKGg0url4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1440.93, "view_count": 202103}, {"video_id": "vOtKGg0url4", "question": "Inside the room, there is an orange cabinet in front of a white bedspread. Next to the cabinet stands a woman with hands on her hips, wearing a white short-sleeve shirt. After the subtitle mentions 'because that whole time period I was,' what did this woman do on the leaf-covered street?", "question_wo_referring_query": "What did this woman do on the leaf-covered street?", "candidates": ["She picked up a red maple leaf", "She picked up a green leaf", "She was dancing", "She picked up a yellow leaf", "She lay down on the ground"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "vOtKGg0url4_1", "video_path": "vOtKGg0url4.mp4", "subtitle_path": "vOtKGg0url4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1440.93, "view_count": 202103}, {"video_id": "vOtKGg0url4", "question": "On the roadside paved with red tiles, a woman with long black hair and wearing a white top is sitting along the road with her phone. After the subtitles mention 'you're not doing enough I started,' what does she do beside a car with an open door?", "question_wo_referring_query": "What does she do?", "candidates": ["She closed the car door", "She threw the phone away", "She opened the car door", "She is jumping high", "She is spinning around"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "vOtKGg0url4_2", "video_path": "vOtKGg0url4.mp4", "subtitle_path": "vOtKGg0url4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1440.93, "view_count": 202103}, {"video_id": "bJWf8XUrdVQ", "question": "Two people are online. On the right, there is a man in a black suit against a blue background. On the left, there is a bald man wearing glasses who is currently lifting his head. After he lifts his head, when the first letter on the screen is 'd', what does he do next?", "question_wo_referring_query": "Two people are online. On the right, there is a man in a black suit against a blue background. On the left, there is a bald man wearing glasses who is currently lifting his head. After he lifts his head, when the first letter on the screen is 'd', what does he do next?", "candidates": ["He takes off his glasses", "He touches his head with his right hand", "He touches his chin with his left hand", "He touches his head with his left hand", "He touches his chin with his right hand"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "bJWf8XUrdVQ_0", "video_path": "bJWf8XUrdVQ.mp4", "subtitle_path": "bJWf8XUrdVQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 962.04, "view_count": 36148}, {"video_id": "bJWf8XUrdVQ", "question": "There are two people on the screen. The man on the right is wearing a white shirt and a tie. The man on the left is wearing glasses and has a short beard. He is extending his index finger. After this action, when the first letter of the subtitles appears, what did this man do?", "question_wo_referring_query": "What did this man do?", "candidates": ["He leaned back.", "He stood up.", "He extended his left index finger.", "He raised his left hand.", "He crossed his arms."], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "bJWf8XUrdVQ_1", "video_path": "bJWf8XUrdVQ.mp4", "subtitle_path": "bJWf8XUrdVQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 962.04, "view_count": 36148}, {"video_id": "bJWf8XUrdVQ", "question": "The screen is divided into two parts. The person in front of the blue background has short hair and is wearing a tie. Next to him, a bald person is touching their chin with their left hand. After this action, when the first letter of the marquee appears, what does he do?", "question_wo_referring_query": "What does he do?", "candidates": ["He puts down his right hand", "He puts on glasses", "He touches his chin with his right hand", "He puts down his left hand", "He unties his tie"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "bJWf8XUrdVQ_2", "video_path": "bJWf8XUrdVQ.mp4", "subtitle_path": "bJWf8XUrdVQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 962.04, "view_count": 36148}, {"video_id": "1nC6eA8yy8g", "question": "Thick snow covers the ground, some pedestrians are walking on the snow, and on the wall next to them there are two pillars with black and white stripes. Between the pillars, there is an open door. When the subtitle mentions 'in schools', what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A person is climbing the wall", "The large door closes", "A group of people are shoveling snow", "A group of people are destroying a wooden door", "A team of people rushes out from inside the door"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "1nC6eA8yy8g_0", "video_path": "1nC6eA8yy8g.mp4", "subtitle_path": "1nC6eA8yy8g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1880.88, "view_count": 322153}, {"video_id": "1nC6eA8yy8g", "question": "The screen displays a simulated map, with gray lines dividing it into several sections. In the upper part, there are green blocks representing troops with green flags, while in the lower part, there are white blocks with red stripes representing another group. What happens on the screen when the subtitles mention 'Moraviev Apostol hopes the opposing'?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The red-striped unit moves towards the white block on the right.", "The people in the green block move towards the red-striped unit.", "The red-striped unit moves towards the white block on the left.", "The red-striped unit moves towards the green block above.", "The red-striped unit moves towards the green block below."], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "1nC6eA8yy8g_1", "video_path": "1nC6eA8yy8g.mp4", "subtitle_path": "1nC6eA8yy8g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1880.88, "view_count": 322153}, {"video_id": "1nC6eA8yy8g", "question": "Beneath the overcast sky, in front of the red high wall, a person with their upper body covered by a white cloth is hanging on a wooden gallows. When the subtitle mentions 'as the men are hanged ropes break', what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A new person is hanged on the gallows", "The ropes suspending the body break", "The red high wall collapses", "The white cloth covering the body is removed", "The wooden frame of the gallows breaks"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "1nC6eA8yy8g_2", "video_path": "1nC6eA8yy8g.mp4", "subtitle_path": "1nC6eA8yy8g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1880.88, "view_count": 322153}, {"video_id": "T7bhbL1cqOw", "question": "Books filled the wall, and a woman stood in front of the bookshelf. She was wearing a black mask and a white coat. When this woman appeared in front of the bookshelf for the first time in this outfit, what did she do?", "question_wo_referring_query": "What did she do?", "candidates": ["She took a book from the shelf.", "She walked from left to right in front of the bookshelf.", "She walked from right to left in front of the bookshelf.", "She destroyed the bookshelf.", "She placed a book on the shelf."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "T7bhbL1cqOw_0", "video_path": "T7bhbL1cqOw.mp4", "subtitle_path": "T7bhbL1cqOw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 994.73, "view_count": 147393}, {"video_id": "T7bhbL1cqOw", "question": "Two cars are parked on the street, and next to them is a street lamp. Beside the street lamp is a white building, which has a large wooden door with a small opening. When the person wearing blue clothes appeared in front of the wooden door for the first time, what did he do?", "question_wo_referring_query": "What did he do?", "candidates": ["He walked into the building", "He got into the car", "He completely opened the wooden door", "She closed the wooden door", "He walked out of the building"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "T7bhbL1cqOw_1", "video_path": "T7bhbL1cqOw.mp4", "subtitle_path": "T7bhbL1cqOw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 994.73, "view_count": 147393}, {"video_id": "T7bhbL1cqOw", "question": "In front of a door which is green on the top part and yellow on the bottom part, there is a metal plate on the street. The metal plate has the word 'GAS' on it, and a bird is beside the metal plate. When the bird appears for the first time, what does it do?", "question_wo_referring_query": "What does it do when it appears for the first time?", "candidates": ["It flies into the sky", "It is defecating", "It bumps into the door", "It lowers its head to search for food", "It crashes into a car"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "T7bhbL1cqOw_2", "video_path": "T7bhbL1cqOw.mp4", "subtitle_path": "T7bhbL1cqOw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 994.73, "view_count": 147393}, {"video_id": "LXTVy1wF2WM", "question": "There is a photo hanging on the wall, and on the desk in front there are a telephone, a desk lamp, and a globe. Three people are interrogating the person sitting on the chair. Who is pointing a gun at the person on the chair?", "question_wo_referring_query": ", who is pointing a gun at the person on the chair?", "candidates": ["The person wearing suspenders and not wearing glasses", "The person with a tear in their clothes and not wearing a hat", "The person wearing a hat and a suit", "The person not wearing a hat but wearing suspenders", "The person wearing a hat and suspenders"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "LXTVy1wF2WM_0", "video_path": "LXTVy1wF2WM.mp4", "subtitle_path": "LXTVy1wF2WM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2478.13, "view_count": 881661}, {"video_id": "LXTVy1wF2WM", "question": "On a street in front of a row of buildings, there are debris and obstacles. Flames are burning fiercely on one side, and three people are arguing on the street. Who threw the incendiary bottle?", "question_wo_referring_query": "Who threw the incendiary bottle?", "candidates": ["The person holding a rifle", "The person wearing a brown hat and brown pants", "The person wearing a brown jacket and holding a rifle", "The person squatting down", "The person wearing a khaki jacket and black pants"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "LXTVy1wF2WM_1", "video_path": "LXTVy1wF2WM.mp4", "subtitle_path": "LXTVy1wF2WM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2478.13, "view_count": 881661}, {"video_id": "LXTVy1wF2WM", "question": "There are three people in the field. Two people are plowing the field with tools, and one person is hitting the plowing people with a whip. Who is the person using the whip?", "question_wo_referring_query": "Who is the person using the whip?", "candidates": ["The person wearing blue shorts", "The person wearing red shorts", "The person walking in the middle", "The person walking on the far right of the screen", "The person wearing olive shorts"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "LXTVy1wF2WM_2", "video_path": "LXTVy1wF2WM.mp4", "subtitle_path": "LXTVy1wF2WM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2478.13, "view_count": 881661}, {"video_id": "QO7Ymd99RtY", "question": "In front of the golden tower outside the house on a clear day, there are three men. The man on the left is wearing a hat and a blue short sleeve, while the man in the middle is wearing a white short sleeve. When the subtitle mentions 'to his home country because if he did he', what is the man on the far right wearing?", "question_wo_referring_query": "What is the man on the far right wearing?", "candidates": ["black woolen sweater", "blue short sleeve T-shirt", "black short sleeve T-shirt", "black vest", "white short sleeve T-shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "QO7Ymd99RtY_0", "video_path": "QO7Ymd99RtY.mp4", "subtitle_path": "QO7Ymd99RtY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1168.33, "view_count": 2313572}, {"video_id": "QO7Ymd99RtY", "question": "There is a statue of a person placed on the lawn in front of the house, with two people standing next to the statue. One person, who is not showing his face, is wearing a floral top and white pants. The other person, a man in a black short-sleeve shirt, is mentioned in the subtitles as saying 'wow nice oh it's nothing it's'. What kind of pants is this man wearing?", "question_wo_referring_query": "What kind of pants is this man wearing?", "candidates": ["green trousers", "blue shorts", "blue jeans", "blue tight pants", "blue sweatpants"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "QO7Ymd99RtY_1", "video_path": "QO7Ymd99RtY.mp4", "subtitle_path": "QO7Ymd99RtY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1168.33, "view_count": 2313572}, {"video_id": "QO7Ymd99RtY", "question": "In the distance, there is a white house at the foot of a mountain. Nearby, two people are sitting on a lilac-colored bench. One person, with dark skin, is wearing a multicolored striped short-sleeve shirt. The other, touching their earlobe, is wearing a horizontally-striped short-sleeve shirt. When the subtitle mentions 'understanding that the sing is pass for', what is the hairstyle of the person touching their earlobe?", "question_wo_referring_query": "What is the hairstyle of the person touching their earlobe?", "candidates": ["Black long hair", "Red mohawk", "Black Mohican", "Black bob", "Black short hair"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "QO7Ymd99RtY_2", "video_path": "QO7Ymd99RtY.mp4", "subtitle_path": "QO7Ymd99RtY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1168.33, "view_count": 2313572}, {"video_id": "10nEx2-8J0M", "question": "There are two people in the scene: the person on the right has short brown hair and is wearing a shirt, while behind the person on the left there is a white cabinet. The person on the left is adjusting a hearing aid with their hand. When the subtitle mentions 'don\u2019t have to pay don\u2019t have to find the,' what object is present in the scene?", "question_wo_referring_query": "what object is present in the scene?", "candidates": ["display", "microphone", "sound", "necklace", "watch"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "10nEx2-8J0M_0", "video_path": "10nEx2-8J0M.mp4", "subtitle_path": "10nEx2-8J0M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1748.92, "view_count": 11963}, {"video_id": "10nEx2-8J0M", "question": "The screen shows a webpage with a digital human model on the left, next to an AI symbol. In the bottom-right corner, there is a person wearing a gray short-sleeved shirt with their hands raised. What is present on the screen when the subtitles mention 'imminent is this threat you know how'?", "question_wo_referring_query": "What is present on the screen?", "candidates": ["mouse", "black sunglasses", "red sunglasses", "gray sunglasses", "keyboard"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "10nEx2-8J0M_1", "video_path": "10nEx2-8J0M.mp4", "subtitle_path": "10nEx2-8J0M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1748.92, "view_count": 11963}, {"video_id": "10nEx2-8J0M", "question": "The screen is divided into two parts. On the left side, there is a room with a white wardrobe, and a person with very short hair is touching their chin. On the right side, there is a man wearing a checkered shirt. When the subtitle mentions 'disasters and more cyber attacks and', what items are not present on the screen?", "question_wo_referring_query": "What items are not present on the screen?", "candidates": ["Face mask", "Microphone", "Wired earphones", "Wireless earphones", "Button"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "10nEx2-8J0M_2", "video_path": "10nEx2-8J0M.mp4", "subtitle_path": "10nEx2-8J0M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1748.92, "view_count": 11963}, {"video_id": "0gOZy5xgy18", "question": "Under the blue sky and white clouds, there is a stretch of yellow sand. There are two people on the sand, one person is riding a horse, and the other person, who is not riding, is holding a gun. What is the person who is not riding a horse doing?", "question_wo_referring_query": "What is the person who is not riding a horse doing?", "candidates": ["He drew a sword", "He is shooting at the person on the horse", "He got on the horse", "He threw away the gun", "He is rolling on the ground"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "0gOZy5xgy18_0", "video_path": "0gOZy5xgy18.mp4", "subtitle_path": "0gOZy5xgy18_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1005.87, "view_count": 19381}, {"video_id": "0gOZy5xgy18", "question": "The distant mountain peaks show no trace of green, and a stretch of desert lies at the foot of the mountains. Military installations in the desert are firing into the sky, and a plane is flying through the air. What is the plane doing?", "question_wo_referring_query": "What is the plane doing?", "candidates": ["It is falling from the sky", "It collided with a bird", "It dropped a guided bomb", "It fired at the ground", "It is maneuvering to avoid missiles"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "0gOZy5xgy18_1", "video_path": "0gOZy5xgy18.mp4", "subtitle_path": "0gOZy5xgy18_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1005.87, "view_count": 19381}, {"video_id": "0gOZy5xgy18", "question": "Under the gloomy sky, there are black mountains in the distance. At the foot of the mountain is a stretch of yellow sandy ground. On the yellow sandy ground, two people riding horses are running towards the front. In the scene, what is the person wearing grey clothes doing?", "question_wo_referring_query": "In the scene, what is the person wearing grey clothes doing?", "candidates": ["He pulled out a gun", "He fell to the ground", "He is side-swinging with the knife in his hand", "He is picking up the weapon", "He jumped off the horse"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "0gOZy5xgy18_2", "video_path": "0gOZy5xgy18.mp4", "subtitle_path": "0gOZy5xgy18_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1005.87, "view_count": 19381}, {"video_id": "lD2tB1riyl4", "question": "This is a flag made up of blue and white stripes, with four red six-pointed stars on it. One of the red arrows points to the leftmost six-pointed star. When this flag appears along with the subtitle 'the second star is a symbol of the great,' what change occurs?", "question_wo_referring_query": ", what change occurs?", "candidates": ["An additional red arrow appears on the flag", "An additional red six-pointed star appears on the flag", "The flag turns completely white", "Three additional red arrows appear on the flag", "Two additional red arrows appear on the flag"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "lD2tB1riyl4_0", "video_path": "lD2tB1riyl4.mp4", "subtitle_path": "lD2tB1riyl4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1059.76, "view_count": 136181}, {"video_id": "lD2tB1riyl4", "question": "In a pure black background, there is a man wearing sunglasses, a black suit, and a black tie. He places his right index finger on his lips, making a gesture for silence. When he appears along with the subtitle 'in risk management,' what change occurs to him?", "question_wo_referring_query": "What change occurs to him?", "candidates": ["He extends his left index finger.", "He touches his chin with his left hand.", "His pose changes to crossing his arms.", "He places both hands on his head.", "He touches his chin with his right hand."], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "lD2tB1riyl4_1", "video_path": "lD2tB1riyl4.mp4", "subtitle_path": "lD2tB1riyl4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1059.76, "view_count": 136181}, {"video_id": "lD2tB1riyl4", "question": "A woman with an afro hairstyle, wearing earrings, and having dark skin is standing in front of greenery and yellow flowers. On the right side of the screen, there are also the words '1984'. When she appears together with the subtitles 'renamed the Oprah Winfrey Show', what change does she undergo?", "question_wo_referring_query": "What change does she undergo?", "candidates": ["She changed into a yellow dress", "She changed into a blue coat", "She changed into a purple outfit", "She changed into a red dress", "She is holding a microphone"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "lD2tB1riyl4_2", "video_path": "lD2tB1riyl4.mp4", "subtitle_path": "lD2tB1riyl4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1059.76, "view_count": 136181}, {"video_id": "8iwUSDxE9dY", "question": "A picture of the Earth viewed from space is pasted on the wall. On the adjacent bookshelf, there are picture frames and some miscellaneous items. In front of the bookshelf sits a man wearing earth-tone clothing, with his hands clasped together in front of his chest. Text appears in a white horizontal bar below. This man and which subtitles have appeared together before?", "question_wo_referring_query": ", text appears in a white horizontal bar below. This man and which subtitles have appeared together before?", "candidates": ["to create a globe and then incorporate", "yeah that\u2019s right guy who measured", "first global projection of the world", "Flat Earth and doesn\u2019t on a ball", "true Aristarchus first did that in"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "8iwUSDxE9dY_0", "video_path": "8iwUSDxE9dY.mp4", "subtitle_path": "8iwUSDxE9dY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.92, "view_count": 643600}, {"video_id": "8iwUSDxE9dY", "question": "On a white background, there is a sepia paper with an old world map drawn on it. There is a yellow circle and a mouse pointer icon on the map. In what context did this map and which subtitles appear together?", "question_wo_referring_query": "In what context did this map and which subtitles appear together?", "candidates": ["if he stood at the shore Alexandre he", "would not have been able to see Turkey", "knew how it looked as per the map you", "would not have been able to see Turkey", "not really scientific because he didn\u2019t"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "8iwUSDxE9dY_1", "video_path": "8iwUSDxE9dY.mp4", "subtitle_path": "8iwUSDxE9dY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.92, "view_count": 643600}, {"video_id": "8iwUSDxE9dY", "question": "In the black background, there is only an image of a man. He has long curly hair and wears an olive-colored coat. There are numbers (1656-1741) in the lower part of the black background. In which captions did this image appear together?", "question_wo_referring_query": "In which captions did this image appear together?", "candidates": ["would not have been able to see Turkey", "the Royal Society and later to become", "just checking and this gave the British", "to this to deceive you because they will", "true Aristarchus first did that in"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "8iwUSDxE9dY_2", "video_path": "8iwUSDxE9dY.mp4", "subtitle_path": "8iwUSDxE9dY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 905.92, "view_count": 643600}, {"video_id": "cAf9llte0WE", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a man in a suit appears in front of a display screen showing a waterfall, then there is a gray globe model against a black background, and finally, a woman in a pink coat touching a building with her index finger.", "First, a man in a suit appears in front of a display screen showing a waterfall, then a woman in a pink coat touching a building with her index finger, and finally, there is a gray globe model against a black background.", "First, there is a gray globe model against a black background, followed by a man in a suit in front of a display screen showing a waterfall, and finally, a woman in a pink coat touching a building with her index finger.", "First, a woman in a pink coat touching a building with her index finger, followed by a gray globe model against a black background, and finally, a man in a suit in front of a display screen showing a waterfall.", "First, there is a gray globe model against a black background, followed by a woman in a pink coat touching a building with her index finger, and finally, a man in a suit in front of a display screen showing a waterfall."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "cAf9llte0WE_0", "video_path": "cAf9llte0WE.mp4", "subtitle_path": "cAf9llte0WE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1516.85, "view_count": 53053}, {"video_id": "cAf9llte0WE", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a man with short golden hair wearing a suit appears in front of a background with blue sky, white clouds, and city buildings. Next, the screen shows only a black background with nothing on it. Finally, a bald man wearing a black coat holding a microphone is standing in a water alley.", "First, the screen shows only a black background with nothing on it. Next, a bald man wearing a black coat holding a microphone is standing in a water alley. Finally, a man with short golden hair wearing a suit appears in front of a background with blue sky, white clouds, and city buildings.", "First, a bald man wearing a black coat holding a microphone is standing in a water alley. Next, the screen shows only a black background with nothing on it. Finally, a man with short golden hair wearing a suit appears in front of a background with blue sky, white clouds, and city buildings.", "First, a man with short golden hair wearing a suit appears in front of a background with blue sky, white clouds, and city buildings. Next, a bald man wearing a black coat holding a microphone is standing in a water alley. Finally, the screen shows only a black background with nothing on it.", "First, the screen shows only a black background with nothing on it. Next, a man with short golden hair wearing a suit appears in front of a background with blue sky, white clouds, and city buildings. Finally, a bald man wearing a black coat holding a microphone is standing in a water alley."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "cAf9llte0WE_1", "video_path": "cAf9llte0WE.mp4", "subtitle_path": "cAf9llte0WE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1516.85, "view_count": 53053}, {"video_id": "cAf9llte0WE", "question": "Which of the following sequence of events is correct?", "question_wo_referring_query": "Which of the following sequence of events is correct?", "candidates": ["First, a group of people sits on a boat passing through flooded streets, next, an elderly man with white hair looks down at the undercarriage of a white car, and finally, a woman with blonde hair wearing a pink coat leans against a white railing.", "First, a woman with blonde hair wearing a pink coat leans against a white railing, next, an elderly man with white hair looks down at the undercarriage of a white car, and finally, a group of people sits on a boat passing through flooded streets.", "First, a woman with blonde hair wearing a pink coat leans against a white railing, next, a group of people sits on a boat passing through flooded streets, and finally, an elderly man with white hair looks down at the undercarriage of a white car.", "First, an elderly man with white hair looks down at the undercarriage of a white car, next, a group of people sits on a boat passing through flooded streets, and finally, a woman with blonde hair wearing a pink coat leans against a white railing.", "First, a group of people sits on a boat passing through flooded streets, next, a woman with blonde hair wearing a pink coat leans against a white railing, and finally, an elderly man with white hair looks down at the undercarriage of a white car."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "cAf9llte0WE_2", "video_path": "cAf9llte0WE.mp4", "subtitle_path": "cAf9llte0WE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1516.85, "view_count": 53053}, {"video_id": "QHIKlu6bn2M", "question": "A person with short hair, wearing black-rimmed glasses and a grey short-sleeved shirt, is standing in front of a black background. He is holding a cup in his right hand and his left hand is hanging straight down. After the subtitle says 'so little context originally I was not,' what does a man wearing black gloves do in front of a pointed green building?", "question_wo_referring_query": "What does a man wearing black gloves do in front of a pointed green building?", "candidates": ["He sits down on the ground", "He waves his right hand", "He jumps high", "He punches a man next to him", "He waves his left hand"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "QHIKlu6bn2M_0", "video_path": "QHIKlu6bn2M.mp4", "subtitle_path": "QHIKlu6bn2M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1492.06, "view_count": 2129340}, {"video_id": "QHIKlu6bn2M", "question": "In front of a bumpy white wall, a man wearing a black short-sleeved shirt is holding a skull-shaped cup in his left hand. After the subtitle mentions \u201ca little interesting so here it is,\u201d what did this man do?", "question_wo_referring_query": "What did this man do?", "candidates": ["He put on a backpack", "He used a fork to pick up a steak", "He picked up a mug", "He picked up a card from the restaurant room", "He jumped up"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "QHIKlu6bn2M_1", "video_path": "QHIKlu6bn2M.mp4", "subtitle_path": "QHIKlu6bn2M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1492.06, "view_count": 2129340}, {"video_id": "QHIKlu6bn2M", "question": "In front of a damaged building, there is a man wearing a hat and a gray-green coat. He is holding a microphone and when the subtitle mentions 'these buildings so this half was', what does this man do while sitting at a table with a couple?", "question_wo_referring_query": "What does this man do?", "candidates": ["He put on pants", "He washed his face with a towel", "He touched his head with his hand", "He kneeled on one knee on the ground", "He looked up and drank a cup of beverage"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "QHIKlu6bn2M_2", "video_path": "QHIKlu6bn2M.mp4", "subtitle_path": "QHIKlu6bn2M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1492.06, "view_count": 2129340}, {"video_id": "02jQiIkEGh8", "question": "A shirtless man wearing blue shorts is sitting on the edge of a white leather yacht and swimming in the water. What happens next on the screen?", "question_wo_referring_query": "What happens next on the screen?", "candidates": ["Many yachts are anchored in the blue waters beside the mountain", "White clouds are swirling among the green mountains", "A shirtless man wearing blue shorts is swimming around a small boat under the sunlight", "A man wearing a red shirt is running on the grass amidst misty mountain surroundings", "White waterfalls flow down between the green trees of the mountain range"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "02jQiIkEGh8_0", "video_path": "02jQiIkEGh8.mp4", "subtitle_path": "02jQiIkEGh8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2817.07, "view_count": 3649944}, {"video_id": "02jQiIkEGh8", "question": "Surrounded by green waters and blue mountains, there are three people rowing on the water. In the frame, there is a man who is shirtless and wearing black shorts, running on a wooden plank with his back to the camera. What does he do next?", "question_wo_referring_query": ", what does he do next?", "candidates": ["He turns around and jumps from the plank into the green water, splashing water.", "Under the sunlight, a shirtless man wearing blue shorts circles in a small boat on the water.", "White waterfalls flow down between the green trees in the mountains.", "A man in a red shirt runs on the grassy area amid misty mountains.", "White mist spirals between the blue mountains."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "02jQiIkEGh8_1", "video_path": "02jQiIkEGh8.mp4", "subtitle_path": "02jQiIkEGh8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2817.07, "view_count": 3649944}, {"video_id": "02jQiIkEGh8", "question": "Under the blue sky and white clouds, a man with curly hair dressed in a gray top and carrying a backpack walks between sunlit Italian buildings talking to the camera. What does he do next?", "question_wo_referring_query": "What does he do next?", "candidates": ["Running on the grassland among cloud-covered mountain ridges.", "This man, shirtless on top and wearing blue pants, stands at the water's edge where many people are playing, talking to the camera.", "White mist curls between the green mountains.", "A shirtless man in black shorts jumps from a wooden plank into a green lake nestled among tall mountains.", "Sitting on a white boat with his back to the camera, moving over the water."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "02jQiIkEGh8_2", "video_path": "02jQiIkEGh8.mp4", "subtitle_path": "02jQiIkEGh8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2817.07, "view_count": 3649944}, {"video_id": "8OdjXijKTwU", "question": "On a white marble table, there is a handmade craft made with purple, orange, yellow, and green fabric strips, featuring black patterns. There is a pair of scissors placed on the left side. When the black wavy fabric strip appears for the first time, what is the hand in the screen doing?", "question_wo_referring_query": "What is the hand in the screen doing?", "candidates": ["Pasting the black wavy fabric strip onto the craft", "Sewing the black wavy fabric strip onto the craft", "Marking on the black wavy fabric strip", "Placing the black wavy fabric strip on the craft", "Trimming the black wavy fabric strip with the scissors"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "8OdjXijKTwU_0", "video_path": "8OdjXijKTwU.mp4", "subtitle_path": "8OdjXijKTwU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 919.42, "view_count": 1746}, {"video_id": "8OdjXijKTwU", "question": "On a white marble desktop, there's an intricate craft with interwoven purple and orange fabric strips. When a dark green fabric strip first appears, what is the hand in the screen doing?", "question_wo_referring_query": "What is the hand in the screen doing?", "candidates": ["Sewing the green fabric strip onto the yellow one", "Sewing the green fabric strip onto the orange one", "Sticking the green fabric strip onto the orange one", "Sewing the green fabric strip onto the purple one", "Holding a dark green fabric strip and interweaving it with the orange one"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "8OdjXijKTwU_1", "video_path": "8OdjXijKTwU.mp4", "subtitle_path": "8OdjXijKTwU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 919.42, "view_count": 1746}, {"video_id": "8OdjXijKTwU", "question": "On a white marble table, there is a handcrafted item made with strips of purple, orange, yellow, and green fabric. There is a pair of scissors on the left. When the black marker first appears, what are the hands with green gloves doing on the screen?", "question_wo_referring_query": "On a white marble table, there is a handcrafted item made with strips of purple, orange, yellow, and green fabric. There is a pair of scissors on the left. When the black marker first appears, what are the hands with green gloves doing on the screen?", "candidates": ["Drawing a design on the fabric strips with the marker", "Writing on white paper", "Refilling the marker with ink", "Cutting the fabric strips with scissors", "Drawing on a notebook"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "8OdjXijKTwU_2", "video_path": "8OdjXijKTwU.mp4", "subtitle_path": "8OdjXijKTwU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 919.42, "view_count": 1746}, {"video_id": "V2L9S67D9dU", "question": "On the beige wall, there are different colored speech bubbles and cartoon pencil images with words on them. There is a girl with long curly hair wearing a black coat and black pants on the right. Who is the person sitting next to her, listening to her talking?", "question_wo_referring_query": "Who is it?", "candidates": ["A girl with golden hair", "A girl with long hair wearing a black outfit", "A boy wearing a black coat, white shirt, tie, and glasses", "A girl wearing a short-sleeved dress", "A reporter wearing a black outfit"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "V2L9S67D9dU_0", "video_path": "V2L9S67D9dU.mp4", "subtitle_path": "V2L9S67D9dU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 934.76, "view_count": 1678}, {"video_id": "V2L9S67D9dU", "question": "Who is the person standing outside the dark green train carriage, to the left of the man with graying hair wearing a black coat and dark blue jeans?", "question_wo_referring_query": "Who is it?", "candidates": ["The man with dark brown hair wearing a shiny green work uniform", "The man with short dark brown hair wearing denim overalls", "The man with black hair wearing a black uniform", "The man with curly hair wearing a denim jacket", "The man with dark brown hair wearing a shiny red work uniform"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "V2L9S67D9dU_1", "video_path": "V2L9S67D9dU.mp4", "subtitle_path": "V2L9S67D9dU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 934.76, "view_count": 1678}, {"video_id": "V2L9S67D9dU", "question": "On the yellow-green grass field, there are three trash bins: the left one is blue, the middle one is yellow, and the right one is green. What is the object that a hand throws into the yellow trash bin in the video?", "question_wo_referring_query": "What is the object thrown?", "candidates": ["glass cup", "aluminum can", "toilet paper", "plastic water bottle", "paper cup"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "V2L9S67D9dU_2", "video_path": "V2L9S67D9dU.mp4", "subtitle_path": "V2L9S67D9dU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 934.76, "view_count": 1678}, {"video_id": "QqwDvxLVZII", "question": "Against a grey-white background, on the upper left side there is a circular design with radiating gradient red rays, topped with the black English text 'Special Attack Corps'. When the subtitles 'Now to give you a better understanding of chemical attacks. Let's look at some characteristics of the raids by the Special Attack Corps and' appear, what color is the circle on the left side of the screen?", "question_wo_referring_query": "What color is the circle on the left side of the screen?", "candidates": ["Yellow", "Black", "Red", "Green", "Blue"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "QqwDvxLVZII_0", "video_path": "QqwDvxLVZII.mp4", "subtitle_path": "QqwDvxLVZII_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1378.37, "view_count": 464578}, {"video_id": "QqwDvxLVZII", "question": "On the screen, there are white, gray, and red labels marked with 1, 2, and 3, respectively. At the top, on the red rectangular label, there are five black circular icons. Below these icons, there is some English text. When the subtitle 'In order to counter kamikazes, various methods were used and existing methods like combat air patrol were adapted or improved' appears, what does the arrow inside the third circular icon look like?", "question_wo_referring_query": "What does the arrow inside the third circular icon look like?", "candidates": ["Curved and pointing diagonally upwards", "Straight and pointing downwards", "Curved and pointing downwards", "Curved and pointing diagonally downwards", "Straight and pointing upwards"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "QqwDvxLVZII_1", "video_path": "QqwDvxLVZII.mp4", "subtitle_path": "QqwDvxLVZII_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1378.37, "view_count": 464578}, {"video_id": "QqwDvxLVZII", "question": "On the upper part of a gray and white background, there are images of two airplanes. There is a black circular icon on each of the left and right sides. When the subtitle '120 or more aircraft might be assigned for radar picket patrol. Be aware that these picket ships were' appears, what color are the images of the two airplanes on the screen?", "question_wo_referring_query": "What color are the images of the two airplanes on the screen?", "candidates": ["Dark Blue", "Light Blue", "Military Green", "White", "Gray"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "QqwDvxLVZII_2", "video_path": "QqwDvxLVZII.mp4", "subtitle_path": "QqwDvxLVZII_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1378.37, "view_count": 464578}, {"video_id": "nOBm4aYEYR4", "question": "There is black English text 'earning' framed in red on the left side of the white PPT. In the middle, there is a deep blue square. At the bottom, there is a dashed line framed in red. What color is the dashed line that is framed?", "question_wo_referring_query": "What color is the dashed line that is framed?", "candidates": ["White", "Green", "Purple", "Red", "Deep Blue"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2A", "level": "IntraMoment", "id": "nOBm4aYEYR4_0", "video_path": "nOBm4aYEYR4.mp4", "subtitle_path": "nOBm4aYEYR4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2743.83, "view_count": 87767}, {"video_id": "nOBm4aYEYR4", "question": "The white background is filled with red and blue subtitles and symbols. In the middle of the screen, there are two lines marked with green highlighter. In the lower right corner, there is a man wearing a black short-sleeved shirt and sunglasses. What is this man's hairstyle?", "question_wo_referring_query": "What is this man's hairstyle?", "candidates": ["Long black hair", "Bald", "Black curly hair", "Short curly hair", "Sparse hair"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2A", "level": "IntraMoment", "id": "nOBm4aYEYR4_1", "video_path": "nOBm4aYEYR4.mp4", "subtitle_path": "nOBm4aYEYR4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2743.83, "view_count": 87767}, {"video_id": "nOBm4aYEYR4", "question": "The left side of the white PPT shows a half-display of an English article, and there are two incomplete images in the upper left corner. The blank area on the right is filled with handwritten red and dark blue characters in a messy manner. In the lower right corner, there is a man with messy hair, wearing black top and sunglasses. What kind of top is this man wearing?", "question_wo_referring_query": "What kind of top is this man wearing?", "candidates": ["short sleeve", "hoodie", "suit", "leather jacket", "long sleeve"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2A", "level": "IntraMoment", "id": "nOBm4aYEYR4_2", "video_path": "nOBm4aYEYR4.mp4", "subtitle_path": "nOBm4aYEYR4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2743.83, "view_count": 87767}, {"video_id": "OqkPdJgGFfE", "question": "On a dark green background, there are three round icons with graphics. At the bottom of the screen, there is a gray-green tank. When the subtitle \"book about the T-90, Zaloga noted the following: 'The list price per round in the early 1990s was '\" appears, what object is present in the scene?", "question_wo_referring_query": "On a dark green background, there are three round icons with graphics. At the bottom of the screen, there is a gray-green tank. When the subtitle \"book about the T-90, Zaloga noted the following: 'The list price per round in the early 1990s was '\" appears, what object is present in the scene?", "candidates": ["Cross", "Question mark", "Medal", "Soldier", "Rocket"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "OqkPdJgGFfE_0", "video_path": "OqkPdJgGFfE.mp4", "subtitle_path": "OqkPdJgGFfE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1023.16, "view_count": 192284}, {"video_id": "OqkPdJgGFfE", "question": "In the top part of the dark green background, there is a title written as 'The Wind of Chang.' In the middle, there are five rectangular boxes with a red background, each containing six circular icons. In the bottom right, there is an image of a tank. When the subtitle 'Fifth, there are also anti-tank guided missiles' appears, what object is present on the screen?", "question_wo_referring_query": "In the top part of the dark green background, there is a title written as 'The Wind of Chang.' In the middle, there are five rectangular boxes with a red background, each containing six circular icons. In the bottom right, there is an image of a tank. When the subtitle 'Fifth, there are also anti-tank guided missiles' appears, what object is present on the screen?", "candidates": ["Rocket", "Ribbon", "Question mark", "Soldier", "Tube"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "OqkPdJgGFfE_1", "video_path": "OqkPdJgGFfE.mp4", "subtitle_path": "OqkPdJgGFfE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1023.16, "view_count": 192284}, {"video_id": "OqkPdJgGFfE", "question": "A bald man wearing a taupe coat, dark blue jeans, and a blue mask is standing to the right side of an army green tank, facing the camera and talking. When the subtitle 'missiles haven't so far replaced guns in tanks. I originally recorded the museum's parts of this' appears, what object is present on the screen?", "question_wo_referring_query": "A bald man wearing a taupe coat, dark blue jeans, and a blue mask is standing to the right side of an army green tank, facing the camera and talking. When the subtitle 'missiles haven't so far replaced guns in tanks. I originally recorded the museum's parts of this' appears, what object is present on the screen?", "candidates": ["car tire", "missile model", "airplane model", "camera", "sunglasses"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "OqkPdJgGFfE_2", "video_path": "OqkPdJgGFfE.mp4", "subtitle_path": "OqkPdJgGFfE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1023.16, "view_count": 192284}, {"video_id": "ZgyEi-fjfy4", "question": "The background is a room with a wooden staircase. A woman with ash-colored hair in a gray sleeveless top is talking to the camera. In the upper right corner of the screen, there is a sketch of an old person with a yellow background. In the lower left corner, there is a white label with black English text. What object is present in this scene?", "question_wo_referring_query": "The background is a room with a wooden staircase. A woman with ash-colored hair in a gray sleeveless top is talking to the camera. In the upper right corner of the screen, there is a sketch of an old person with a yellow background. In the lower left corner, there is a white label with black English text. What object is present in this scene?", "candidates": ["kitchen", "washing machine", "black backpack", "refrigerator", "TV"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "ZgyEi-fjfy4_0", "video_path": "ZgyEi-fjfy4.mp4", "subtitle_path": "ZgyEi-fjfy4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1505.13, "view_count": 5200}, {"video_id": "ZgyEi-fjfy4", "question": "The setting is a room with rice-white walls. Towards the rear-right, there is a black backpack hanging on a gray chair. A woman with ash-brown hair wearing a gray sleeveless top is holding a black pencil and speaking towards the camera. What object is present in this scene?", "question_wo_referring_query": "What object is present in this scene?", "candidates": ["display screen", "computer", "flat-bottomed pan", "cutting board", "wooden ladder"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "ZgyEi-fjfy4_1", "video_path": "ZgyEi-fjfy4.mp4", "subtitle_path": "ZgyEi-fjfy4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1505.13, "view_count": 5200}, {"video_id": "ZgyEi-fjfy4", "question": "On a pink desktop, a pair of hands is holding a green pencil, drawing a rough sketch of a human head on white paper. What objects are present in the scene at this moment?", "question_wo_referring_query": "What objects are present in the scene at this moment?", "candidates": ["Pencil sharpener", "Planter", "Notebook", "Eraser", "Pen"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "ZgyEi-fjfy4_2", "video_path": "ZgyEi-fjfy4.mp4", "subtitle_path": "ZgyEi-fjfy4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1505.13, "view_count": 5200}, {"video_id": "KyP6ALCtIpc", "question": "At the beginning of the video, a chubby man wearing glasses and a cream-colored sweater is riding a bicycle on a path among greenery. A man with curly hair wearing a white short-sleeved shirt rides past him. When the subtitle 'female teacher scolds him for his messy' appears, what change occurs to this man?", "question_wo_referring_query": "At the beginning of the video, a chubby man wearing glasses and a cream-colored sweater is riding a bicycle on a path among greenery. A man with curly hair wearing a white short-sleeved shirt rides past him. When the subtitle 'female teacher scolds him for his messy' appears, what change occurs to this man?", "candidates": ["He changes from riding a bicycle to standing", "He goes from not wearing glasses to wearing glasses", "He changes from a white short-sleeved shirt to a white long-sleeved shirt", "He changes from a white short-sleeved shirt to a black short-sleeved shirt", "He changes from curly hair to straight hair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TAA", "level": "L2-Relation", "id": "KyP6ALCtIpc_0", "video_path": "KyP6ALCtIpc.mp4", "subtitle_path": "KyP6ALCtIpc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 23, "duration": 10.0, "view_count": 24444}, {"video_id": "jQoPlUOzPfc", "question": "In a scene with white text 'BOOKER', a man with a mustache is putting chopsticks in his mouth. When this man with a mustache appears in a dark scene without the white text 'BOOKER', what change occurs to the man with a mustache?", "question_wo_referring_query": "What change occurs to the man with a mustache?", "candidates": ["He goes from having a mustache to being clean-shaven.", "He goes from wearing a white shirt to wearing a blue shirt.", "He goes from wearing a black shirt to wearing a dark blue shirt.", "He goes from wearing a blue shirt to wearing a white shirt.", "He goes from having chopsticks in his mouth to having nothing in his mouth."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "jQoPlUOzPfc_0", "video_path": "jQoPlUOzPfc.mp4", "subtitle_path": "jQoPlUOzPfc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 231, "duration": 11.98, "view_count": 466347}, {"video_id": "YdMaV0ujn0I", "question": "Sunlight streams through a window into a dimly lit room. By the window stand two people; one is wearing a black hoodie with tattoos on his neck, and the other is wearing a black jacket with half of his arm exposed. The person in the jacket is handing a gun to the person in the hoodie. When have the person in the hoodie and these subtitles appeared together?", "question_wo_referring_query": "When have the person in the hoodie and these subtitles appeared together?", "candidates": ["god bless you", "thank you", "and cover the window while he goes", "help me", "inside as he goes deeper guards spot him"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "YdMaV0ujn0I_0", "video_path": "YdMaV0ujn0I.mp4", "subtitle_path": "YdMaV0ujn0I_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 882, "duration": 9.01, "view_count": 1991}, {"video_id": "wE9iPlFtbv0", "question": "A man with black hair wearing a coat and hat appears in front of a large gate in an alley, he is fleeing hurriedly. In which of the following places has this man also appeared?", "question_wo_referring_query": "In which of the following places has this man also appeared?", "candidates": ["Inside the restroom", "On the rooftop", "On the lawn", "Inside the conference room", "The entrance of a security checkpoint with a computer nearby"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "wE9iPlFtbv0_0", "video_path": "wE9iPlFtbv0.mp4", "subtitle_path": "wE9iPlFtbv0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 917, "duration": 14.0, "view_count": 65072}, {"video_id": "nGhxB4Jfhr0", "question": "In a living room with hanging picture frames, sitting to the right of a dining table filled with food, what is a black-haired woman wearing a yellow short skirt and holding chopsticks doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["Playing with her phone", "Brushing her hair", "Eating", "Singing", "Watching TV"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "nGhxB4Jfhr0_0", "video_path": "nGhxB4Jfhr0.mp4", "subtitle_path": "nGhxB4Jfhr0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 10, "duration": 43.01, "view_count": 60934}, {"video_id": "6p8p1MEibPo", "question": "On a road, a yellow car is speeding quickly. The people inside the car are looking out of the car windows. What object appears in this scene?", "question_wo_referring_query": "What object appears in this scene?", "candidates": ["rocket", "airplane", "ship", "ocean", "house"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "6p8p1MEibPo_0", "video_path": "6p8p1MEibPo.mp4", "subtitle_path": "6p8p1MEibPo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 131, "duration": 22.99, "view_count": 549654}, {"video_id": "vKbRujlR8I0", "question": "In a dimly lit room with a desk, there is a man wearing a tie and a blue shirt standing next to a woman. The subtitle says 'The next day, Ruby resigns to Paul and says she's always been attracted to Paul'. What object appears on the screen at this moment?", "question_wo_referring_query": "What object appears on the screen at this moment?", "candidates": ["Table lamp", "Water dispenser", "Pillow", "Refrigerator", "Bicycle"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "vKbRujlR8I0_0", "video_path": "vKbRujlR8I0.mp4", "subtitle_path": "vKbRujlR8I0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 482, "duration": 44.0, "view_count": 7628}, {"video_id": "pwOWoq2ru5Q", "question": "Outside a house on level ground where a red car is parked, a man with a bald head, holding a gun, comes out of a white car. What color clothes is he wearing?", "question_wo_referring_query": "What color clothes is he wearing?", "candidates": ["pink", "black", "blue", "green", "white"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "pwOWoq2ru5Q_0", "video_path": "pwOWoq2ru5Q.mp4", "subtitle_path": "pwOWoq2ru5Q_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 355, "duration": 16.98, "view_count": 7880}, {"video_id": "ocCoJetow8c", "question": "On a white hospital bed, a man with black hair styled as Liu Hai, covered with a white blanket, raises a hand with an attached tube. The subtitles say, 'Bob wakes up in a hospital and acts strangely. He moves his legs and hands slowly.' What color is the man's shirt?", "question_wo_referring_query": "What color is the man's shirt?", "candidates": ["Yellow", "Purple", "Red", "Black", "White"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "ocCoJetow8c_0", "video_path": "ocCoJetow8c.mp4", "subtitle_path": "ocCoJetow8c_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 237, "duration": 15.02, "view_count": 2259}, {"video_id": "thSv2mPKl4A", "question": "There are three characters on the screen. The man on the far left is smiling while looking forward. The man in the middle is wearing a hat and a yellow shirt, and the man on the far right is holding a cigarette in his mouth. Who is holding a stick?", "question_wo_referring_query": "There are three characters on the screen. The man on the far left is smiling while looking forward. The man in the middle is wearing a hat and a yellow shirt, and the man on the far right is holding a cigarette in his mouth. Who is holding a stick?", "candidates": ["The man on the far right with a cigarette", "The man on the far left with a smile", "The middle person wearing a black dress (female)", "The man in the middle wearing a hat and a yellow shirt", "The middle person wearing a hat and a yellow shirt (female)"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "thSv2mPKl4A_0", "video_path": "thSv2mPKl4A.mp4", "subtitle_path": "thSv2mPKl4A_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 358, "duration": 46.98, "view_count": 41516}, {"video_id": "MeliJuoMnsI", "question": "On the yellow sand, a blonde woman on the far left is guarding a glowing man lying on a blanket, and a woman on the far right wearing shorts is bending over. What is the woman on the far right doing?", "question_wo_referring_query": "What is the woman bending over on the far right doing?", "candidates": ["Putting on a skirt", "Putting on shoes", "Wrapping her lower legs with a white towel", "Putting on pants", "Taking off shoes"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "MeliJuoMnsI_0", "video_path": "MeliJuoMnsI.mp4", "subtitle_path": "MeliJuoMnsI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 325, "duration": 41.0, "view_count": 6390836}, {"video_id": "bL8Vv_PQ7UY", "question": "What action did the man sitting in a yellow car take when the subtitles said 'drive the cab through the city away from the villains who give chase in their own cars'?", "question_wo_referring_query": "What action did he take?", "candidates": ["Was running", "Pointed a gun out of the car window", "Grabbed the steering wheel", "Opened the car door", "Pointed the gun at himself"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "bL8Vv_PQ7UY_0", "video_path": "bL8Vv_PQ7UY.mp4", "subtitle_path": "bL8Vv_PQ7UY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 317, "duration": 23.0, "view_count": 208650}, {"video_id": "YFjqxbPO2VE", "question": "Kneeling on the yellow sand, after a curly-haired woman in white clothes was hit by a man in front of her with a stone, what happened?", "question_wo_referring_query": "What happened?", "candidates": ["The woman fell to the ground", "The woman held her head in pain and cried", "The woman lowered her head and cried", "The woman stood up", "The woman crawled on the ground"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "YFjqxbPO2VE_0", "video_path": "YFjqxbPO2VE.mp4", "subtitle_path": "YFjqxbPO2VE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 476, "duration": 13.01, "view_count": 1297674}, {"video_id": "T9ihc0KhbDU", "question": "After the screen shows three characters standing in front of the iron railing, which character appears first?", "question_wo_referring_query": "Which character appears first?", "candidates": ["A woman with black hair wearing a hat", "A man in armor standing under a black night sky", "A woman with blonde hair", "A woman with black hair", "A woman with red hair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "T9ihc0KhbDU_0", "video_path": "T9ihc0KhbDU.mp4", "subtitle_path": "T9ihc0KhbDU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 382, "duration": 9.01, "view_count": 48043}, {"video_id": "NZJ9iyICyrI", "question": "A man with black hair, dressed in a blue fleece jacket and a white shirt, is lying on the ground in the video. After the subtitle says 'and she writes out the number on the advert and gives it to Guy. Guy calls the number and Zoley...', what happens on the screen?", "question_wo_referring_query": "what happens on the screen?", "candidates": ["A woman is sitting in front of a computer", "A woman is drinking coffee", "A man is holding a phone and dialing the number written on the paper", "A man is sitting in front of a computer", "A man is drinking coffee"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "NZJ9iyICyrI_0", "video_path": "NZJ9iyICyrI.mp4", "subtitle_path": "NZJ9iyICyrI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 65, "duration": 14.0, "view_count": 3927}, {"video_id": "ZjIP0kiDim4", "question": "A man wearing a hat and thick cotton clothes on the ground stretches out his arms and spins around, saying in the subtitles, 'like this. After that, he skated and slid on the ice like a little child.' What appeared on screen first after that?", "question_wo_referring_query": "What appeared on screen first?", "candidates": ["A book", "A bicycle", "A pair of chopsticks", "A chair", "A pot"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "ZjIP0kiDim4_0", "video_path": "ZjIP0kiDim4.mp4", "subtitle_path": "ZjIP0kiDim4_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 39, "duration": 8.0, "view_count": 22437}, {"video_id": "ORoDJlCGBEI", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, a woman in a red dress is holding a black microphone and singing on stage. Then, a woman wearing a hat and holding a mobile phone and a handbag appears at the door of a room. Finally, a man wearing a black tie stands in front of a long-haired girl.", "First, a woman in a red dress is holding a black microphone and singing on stage. Then, a man wearing a black tie stands in front of a long-haired girl. Finally, a woman wearing a hat and holding a mobile phone and a handbag appears at the door of a room.", "First, a man wearing a black tie stands in front of a long-haired girl. Then, a woman in a red dress is holding a black microphone and singing on stage. Finally, a woman wearing a hat and holding a mobile phone and a handbag appears at the door of a room.", "First, a woman wearing a hat and holding a mobile phone and a handbag appears at the door of a room. Then, a man wearing a black tie stands in front of a long-haired girl. Finally, a woman in a red dress is holding a black microphone and singing on stage.", "First, a woman wearing a hat and holding a mobile phone and a handbag appears at the door of a room. Then, a woman in a red dress is holding a black microphone and singing on stage. Finally, a man wearing a black tie stands in front of a long-haired girl."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "ORoDJlCGBEI_0", "video_path": "ORoDJlCGBEI.mp4", "subtitle_path": "ORoDJlCGBEI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 187, "duration": 10.01, "view_count": 100789}, {"video_id": "vUuob4SVfvo", "question": "At the beginning of the video, where else does the woman with black hair, writing on a calendar that shows the number 10, appear?", "question_wo_referring_query": "Where else does she appear?", "candidates": ["In a room on the far left where a black-skinned woman holding a yellow towel is standing.", "In a room on the far right where a black-skinned woman holding a yellow towel is standing.", "In a room on the far left where a black-skinned woman holding a red towel is standing.", "In a room on the far left where a black-skinned man holding a black towel is standing.", "In a room in the middle where a black-skinned man holding a white towel is standing."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "vUuob4SVfvo_0", "video_path": "vUuob4SVfvo.mp4", "subtitle_path": "vUuob4SVfvo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 319, "duration": 10.01, "view_count": 476505}, {"video_id": "cxJeFLTy0-U", "question": "Standing in the middle of a stage with three paintings hanging behind, which subtitles have appeared together with a man dressed in a black top and black pants holding a megaphone?", "question_wo_referring_query": "Which subtitles have appeared together with them?", "candidates": ["saying that no", "the girls to a piece of gum", "The next day", "and what is considered as sins in the bible", "And ever since the ritual"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "cxJeFLTy0-U_0", "video_path": "cxJeFLTy0-U.mp4", "subtitle_path": "cxJeFLTy0-U_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 192, "duration": 11.98, "view_count": 9805}, {"video_id": "UIU8bOLxHjQ", "question": "At the beginning of the video, a man with black hair styled in bangs and wearing a white shirt appears in a shopping mall. What change happens to this man's clothing when he is at the desk in the classroom?", "question_wo_referring_query": "What change happens to this man's clothing?", "candidates": ["He has put on a black tie", "He has put on a yellow tie", "He has put on a blue tie", "He has put on a purple tie", "He has put on a red tie"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "UIU8bOLxHjQ_0", "video_path": "UIU8bOLxHjQ.mp4", "subtitle_path": "UIU8bOLxHjQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 142, "duration": 13.0, "view_count": 13086}, {"video_id": "gJzqh08G85A", "question": "In the scene, a woman dressed in a white nurse uniform and wearing a white nurse hat walks past the corridor outside the ward door. What object is present in this scene?", "question_wo_referring_query": "What object is present in this scene?", "candidates": ["white chair", "green chair", "black chair", "yellow chair", "red chair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "gJzqh08G85A_0", "video_path": "gJzqh08G85A.mp4", "subtitle_path": "gJzqh08G85A_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 406, "duration": 14.02, "view_count": 153097}, {"video_id": "K5HyWB5Aj0g", "question": "In the dimly lit car, a man sitting in the driver's seat looks at a woman beside him. When the subtitles say 'taunting virgins like Ito into giving in,' what object appears on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["steering wheel", "mirror", "toothbrush", "cup", "cell phone"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "K5HyWB5Aj0g_0", "video_path": "K5HyWB5Aj0g.mp4", "subtitle_path": "K5HyWB5Aj0g_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 270, "duration": 8.01, "view_count": 1884217}, {"video_id": "rPJTYkAzsqs", "question": "In the scene, golden sunlight is shining on a building with a narrow corridor. The building has dark shadows, and a man with some hair and wearing a black suit is walking in the front. Who is shown kneeling on the ground in the scene?", "question_wo_referring_query": "Who is shown kneeling on the ground in the scene?", "candidates": ["A man wearing a white coat and white pants", "A man wearing a white coat and black pants", "A man wearing a red coat and black pants", "A man wearing a yellow coat and black pants", "A man wearing a black coat and a white shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "rPJTYkAzsqs_0", "video_path": "rPJTYkAzsqs.mp4", "subtitle_path": "rPJTYkAzsqs_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 76, "duration": 11.01, "view_count": 73251}, {"video_id": "--mUOD9Tok4", "question": "In the position in front of the man wearing a hat and a blue shirt, what is the bald man in a white short-sleeved shirt and overalls doing when he first appears?", "question_wo_referring_query": "What is he doing when he first appears?", "candidates": ["Rubbing his face", "Touching his head", "Touching his nose", "Touching his lower lip", "Touching his eyes"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "--mUOD9Tok4_0", "video_path": "--mUOD9Tok4.mp4", "subtitle_path": "--mUOD9Tok4_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 183, "duration": 13.01, "view_count": 476940}, {"video_id": "d2XpTyQtIfk", "question": "In a dark room, when the subtitles say 'but he takes pictures of them too. When Bo ferrs her, Mazcy tells them to leave', what happened to the woman in gray suspenders and black shorts on the screen?", "question_wo_referring_query": "What happened to the woman on the screen?", "candidates": ["The woman kneeled on the bed", "The woman stood on the bed", "The woman sat on the bed", "The woman was lying on the bed", "The woman was lying on the sofa"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "d2XpTyQtIfk_0", "video_path": "d2XpTyQtIfk.mp4", "subtitle_path": "d2XpTyQtIfk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 329, "duration": 12.0, "view_count": 4504}, {"video_id": "NBHW6k-qVSI", "question": "In a dimly lit room with lit torches on the walls, a beam of light shines through a window with square patterns, illuminating the room. In front of the window, a man standing in a horse stance with his fists clenched looks to his left. What does the man do next?", "question_wo_referring_query": "What does the man do next?", "candidates": ["Picking up a torch from the wall", "Fighting with a person wearing armor", "Standing by the window looking outside", "Sitting on the window sill", "Sitting on the ground making a fire"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "NBHW6k-qVSI_0", "video_path": "NBHW6k-qVSI.mp4", "subtitle_path": "NBHW6k-qVSI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 449, "duration": 9.01, "view_count": 31883}, {"video_id": "U8iajbW_iPk", "question": "After a man with curly hair wearing a black vest appears under the orange sky at the beginning of the video, who is the next character to appear?", "question_wo_referring_query": "Who is the next character to appear?", "candidates": ["A man with curly hair wearing a green suit and sunglasses", "A man wearing a black suit, white shirt, and black sunglasses", "A man with curly hair wearing a yellow suit and sunglasses", "A person wearing a white vest and a necklace", "A woman with curly hair wearing a purple suit and sunglasses"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "U8iajbW_iPk_0", "video_path": "U8iajbW_iPk.mp4", "subtitle_path": "U8iajbW_iPk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 160, "duration": 12.98, "view_count": 210869}, {"video_id": "WqEY3F7GVnU", "question": "Before the subtitle that says 'On the other hand, Elliot says, annoyed that he has found a real alien,' what happened on the left side of the screen where someone wearing a striped shirt picked up a drawing made by a boy in a red and white checkered shirt, sitting at a mahogany desk?", "question_wo_referring_query": "What happened on the screen?", "candidates": ["A boy wearing a blue hoodie was riding a bicycle on the road", "A boy wearing a green hoodie was skateboarding on the road", "A boy wearing a red hoodie was riding a bicycle on the road", "A boy wearing a multicolored hoodie was skateboarding on the road", "A boy wearing a purple hoodie was riding a bicycle on the road"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "WqEY3F7GVnU_0", "video_path": "WqEY3F7GVnU.mp4", "subtitle_path": "WqEY3F7GVnU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 237, "duration": 10.01, "view_count": 860}, {"video_id": "SmIQM36HFPs", "question": "A person with fresh red blood stains on their face is staring wide-eyed at the camera. After the subtitle says, 'transcendence. The movie zooms into Anna's eyes which look extremely hollow and empty to me,' which character appears first on the screen?", "question_wo_referring_query": "Which character appears first on the screen?", "candidates": ["A blonde woman wearing a red dress", "A black-haired woman wearing a blue dress", "A blonde woman wearing a white dress", "A blonde woman wearing a black dress", "A red-haired woman wearing a black dress"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "SmIQM36HFPs_0", "video_path": "SmIQM36HFPs.mp4", "subtitle_path": "SmIQM36HFPs_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 504, "duration": 8.01, "view_count": 7198}, {"video_id": "PkHPaAPAQu4", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, two people sit on a red sofa with a painting hanging behind it. Then, a blonde woman in a black suit stands next to three round balloons in front of a window. Finally, at night, a car drives on a road with dim lights.", "First, a blonde woman in a black suit stands next to three round balloons in front of a window. Then, at night, a car drives on a road with dim lights. Finally, two people sit on a red sofa with a painting hanging behind it.", "First, a blonde woman in a black suit stands next to three round balloons in front of a window. Then, two people sit on a red sofa with a painting hanging behind it. Finally, at night, a car drives on a road with dim lights.", "First, two people sit on a red sofa with a painting hanging behind it. Then, at night, a car drives on a road with dim lights. Finally, a blonde woman in a black suit stands next to three round balloons in front of a window.", "First, at night, a car drives on a road with dim lights. Then, two people sit on a red sofa with a painting hanging behind it. Finally, a blonde woman in a black suit stands next to three round balloons in front of a window."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "PkHPaAPAQu4_0", "video_path": "PkHPaAPAQu4.mp4", "subtitle_path": "PkHPaAPAQu4_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 43, "duration": 10.01, "view_count": 1197379}, {"video_id": "r6SP_FKSVPU", "question": "What other locations have the woman in white short sleeves and shorts appeared in, aside from sitting next to the man in white who is on the phone on the white sofa?", "question_wo_referring_query": "What other locations has she appeared in?", "candidates": ["Kitchen", "Restroom", "Dining hall", "Park", "Outside the elevator door"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "r6SP_FKSVPU_0", "video_path": "r6SP_FKSVPU.mp4", "subtitle_path": "r6SP_FKSVPU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 274, "duration": 14.0, "view_count": 41635}, {"video_id": "WT_wNCTUQyk", "question": "Sitting in front of a dining table with food on it, which subtitles appear together with a man holding a gun and wearing a yellow coat?", "question_wo_referring_query": "Which subtitles appear together with them?", "candidates": ["involvement with Winter", "Olsen asks Kevin to tell him everything he knows about his involvement with Winter", "Olsen asks Kevin to tell him everything", "In the opening scene, Agent Eric Olsen is interrogating Kevin Weeks", "A quick warning, there will be major spoilers ahead"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "WT_wNCTUQyk_0", "video_path": "WT_wNCTUQyk.mp4", "subtitle_path": "WT_wNCTUQyk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 12, "duration": 9.0, "view_count": 16010}, {"video_id": "1pKWOimO1GA", "question": "What is the woman, dressed in red and with disheveled hair, who is leaning against the white pillar with red blood stains on her shoulder, doing in the video?", "question_wo_referring_query": "What is she doing?", "candidates": ["Singing", "Dancing", "Looking up at the sky", "Crying", "Looking down"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "1pKWOimO1GA_0", "video_path": "1pKWOimO1GA.mp4", "subtitle_path": "1pKWOimO1GA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 413, "duration": 12.0, "view_count": 1153784}, {"video_id": "UmpR4B8KpZ4", "question": "In the dimly lit audience, many viewers are enthusiastically applauding and cheering for the performance on stage. Who can be seen in this scene?", "question_wo_referring_query": "Who can be seen in this scene?", "candidates": ["A woman in the audience not clapping, wearing a green coat and black inner clothes", "A woman in the audience not clapping, wearing a black dress", "A woman in the audience not clapping, wearing a red coat and black inner clothes", "A woman in the audience not clapping, wearing a red dress", "A woman in the audience not clapping, wearing a white coat and black inner clothes"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "UmpR4B8KpZ4_0", "video_path": "UmpR4B8KpZ4.mp4", "subtitle_path": "UmpR4B8KpZ4_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 68, "duration": 9.0, "view_count": 679099}, {"video_id": "y3QGdpHkfpw", "question": "A panda with black circles around its eyes lands on a firework stand that is decorated with a cannon, and the stand is surrounded by multicolored fireworks. When the subtitle mentions 'the panda lands on a firework stand and', what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["a bamboo stick", "noodles", "a string of red lanterns", "dumplings", "a steamed bun"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "y3QGdpHkfpw_0", "video_path": "y3QGdpHkfpw.mp4", "subtitle_path": "y3QGdpHkfpw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 884, "duration": 11.01, "view_count": 140573}, {"video_id": "gEPuBdGi8ws", "question": "Under the sky, several green spirits are gathered on the lush grass. The short round-headed spirit on the left is biting the hand of the tall spirit in the middle, who is wearing a hat and white glasses. What color is the sky at this time?", "question_wo_referring_query": "What color is the sky at this time?", "candidates": ["Blue", "Yellow", "Green", "White", "Pink"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "gEPuBdGi8ws_0", "video_path": "gEPuBdGi8ws.mp4", "subtitle_path": "gEPuBdGi8ws_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 166, "duration": 14.0, "view_count": 127638}, {"video_id": "jk3tfVuKO7Q", "question": "On a pitch-black night, the street is illuminated by some faint light, with very few pedestrians walking. Who is moving hurriedly beside the road with cars?", "question_wo_referring_query": ", who is moving hurriedly beside the road with cars?", "candidates": ["A woman wearing a yellow shirt", "A man wearing a white shirt", "A man wearing a black shirt", "A woman wearing a black shirt", "A woman wearing a white shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "jk3tfVuKO7Q_0", "video_path": "jk3tfVuKO7Q.mp4", "subtitle_path": "jk3tfVuKO7Q_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1122, "duration": 11.0, "view_count": 46707}, {"video_id": "Lm97zad8_lY", "question": "What did a man wearing a gray suit and sporting a mustache do the first time he appeared in front of a black table in a white room?", "question_wo_referring_query": "What did he do?", "candidates": ["Drank water", "Sat on a chair", "Fixed his hairstyle", "Read a book", "Shook hands with the woman in front of him"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "Lm97zad8_lY_0", "video_path": "Lm97zad8_lY.mp4", "subtitle_path": "Lm97zad8_lY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 183, "duration": 12.0, "view_count": 121962}, {"video_id": "BUcu2lqINkw", "question": "In the pitch-black night, a woman dressed in a white shirt and white pants appears under a tree. The subtitle reads 'rings and the unsuspecting girl opens, but is shot in the face. Returning from her mission, the rebel.' What is she doing at this time?", "question_wo_referring_query": "What is she doing?", "candidates": ["dancing", "sitting on the ground", "running", "walking", "riding a bicycle"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "BUcu2lqINkw_0", "video_path": "BUcu2lqINkw.mp4", "subtitle_path": "BUcu2lqINkw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 147, "duration": 10.0, "view_count": 152978}, {"video_id": "x2Uq8crt6GM", "question": "In a pitch-black scene, a blonde woman is looking down at something. What happens next in the scene?", "question_wo_referring_query": "What happens next in the scene?", "candidates": ["A plane is flying in the sky.", "A train is running on the tracks.", "A boat is floating on the sea.", "Three boats are floating on the sea.", "A rainbow appears in the sky."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "x2Uq8crt6GM_0", "video_path": "x2Uq8crt6GM.mp4", "subtitle_path": "x2Uq8crt6GM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 677, "duration": 9.01, "view_count": 951580}, {"video_id": "XRRV1VXSKpA", "question": "After a computer screen showing a video of people appears on a table in the room, which person appears first?", "question_wo_referring_query": "Which person appears first?", "candidates": ["A boy wearing black glasses and a tie", "A person wearing a white mask and yellow clothes", "A woman with long straight black hair wearing glasses", "A woman with long black curly hair wearing a hat", "A blonde woman with long curly hair wearing a hat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "XRRV1VXSKpA_0", "video_path": "XRRV1VXSKpA.mp4", "subtitle_path": "XRRV1VXSKpA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 515, "duration": 10.0, "view_count": 82667}, {"video_id": "nECUdg1PtzA", "question": "On a sofa with a pillow, a boy with a Liu Hai hairstyle is sitting on the left holding a toy. After the subtitle 'thus getting them banished to the living room,' what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A man opens the window", "A man pulls open the curtains", "A woman in a red top lies on the bed", "A man opens the room door", "A woman in a green top lies on the bed"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "nECUdg1PtzA_0", "video_path": "nECUdg1PtzA.mp4", "subtitle_path": "nECUdg1PtzA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 94, "duration": 8.01, "view_count": 95077}, {"video_id": "1AbAtGY-2ks", "question": "In front of a house on the grass, a red-haired woman wearing a black coat is talking to two women who have their hands crossed in front of their chests. After the subtitle says 'Katherine a ride home along with Maureen. As Joel rearranges the over packed van,', what object appears first on the screen?", "question_wo_referring_query": ", what object appears first on the screen?", "candidates": ["An airplane", "A bicycle", "A scooter", "A baby stroller", "A red car"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "1AbAtGY-2ks_0", "video_path": "1AbAtGY-2ks.mp4", "subtitle_path": "1AbAtGY-2ks_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 855, "duration": 13.98, "view_count": 56941}, {"video_id": "114Zz0q7Kqs", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a person sitting on a brown sofa with a gun is shot and falls to the ground, then a short-haired woman on the left side of the screen watches a man in black clothes in front of her, and finally, a bald man in a black shirt bends over.", "First, a bald man in a black shirt bends over, then a short-haired woman on the left side of the screen watches a man in black clothes in front of her, and finally, a person sitting on a brown sofa with a gun is shot and falls to the ground.", "First, a person sitting on a brown sofa with a gun is shot and falls to the ground, then a bald man in a black shirt bends over, and finally, a short-haired woman on the left side of the screen watches a man in black clothes in front of her.", "First, a short-haired woman on the left side of the screen watches a man in black clothes in front of her, then a bald man in a black shirt bends over, and finally, a person sitting on a brown sofa with a gun is shot and falls to the ground.", "First, a short-haired woman on the left side of the screen watches a man in black clothes in front of her, then a person sitting on a brown sofa with a gun is shot and falls to the ground, and finally, a bald man in a black shirt bends over."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "114Zz0q7Kqs_0", "video_path": "114Zz0q7Kqs.mp4", "subtitle_path": "114Zz0q7Kqs_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 674, "duration": 8.01, "view_count": 54391}, {"video_id": "C8wTD09gzWw", "question": "On the yellow desert, there is a sparse growth of plants. From a distance, we can see vehicles on the yellow sand between the plants on both sides. In which of the following places has this scene appeared?", "question_wo_referring_query": ", in which of the following places has this scene appeared?", "candidates": ["In front of a black iron tower", "In front of a red iron tower", "By the sea", "In front of a golden iron tower", "In front of a yellow iron tower"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "C8wTD09gzWw_0", "video_path": "C8wTD09gzWw.mp4", "subtitle_path": "C8wTD09gzWw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 700, "duration": 11.0, "view_count": 962215}, {"video_id": "EtS8UignYko", "question": "Sitting in front of a table with a cup on it, which subtitles have appeared together with a man in a black suit looking at his hands?", "question_wo_referring_query": "Which subtitles have appeared together?", "candidates": ["the same chair that Nick is sitting", "the same chair that Nick is sitting on. As Nick ponders his situation,", "Nick is sitting", "As Nick ponders his situation", "but notices that he doesn't have a reflection. He goes up to the mirror to see for himself and puts"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "EtS8UignYko_0", "video_path": "EtS8UignYko.mp4", "subtitle_path": "EtS8UignYko_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 225, "duration": 8.01, "view_count": 286359}, {"video_id": "LKNxGk_ee40", "question": "In a room, a woman in a green top is standing in front of a white window looking outside. What is a man in a blue shirt, who is sitting in front of a wooden table behind the woman, holding his chin with his hand, doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["Watching TV", "Drinking tea", "Reading a newspaper", "Playing with a phone", "Using a computer"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "LKNxGk_ee40_0", "video_path": "LKNxGk_ee40.mp4", "subtitle_path": "LKNxGk_ee40_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 697, "duration": 9.98, "view_count": 21746}, {"video_id": "2LYtuIOR50U", "question": "In a white room, there are several people standing. On the far left is a man wearing a black suit and a white shirt with a black tie. When the subtitle says 'has no pride and brought shame to the family. At Philippe's house, all his employees watch', who is present in the frame?", "question_wo_referring_query": "Who is present in the frame?", "candidates": ["A woman wearing a white baseball cap and a black shirt", "A woman wearing a black baseball cap and a black shirt", "A man wearing a black round hat and a red shirt", "A man wearing a white round hat and a black shirt", "A man wearing a black round hat and a black shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "2LYtuIOR50U_0", "video_path": "2LYtuIOR50U.mp4", "subtitle_path": "2LYtuIOR50U_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1080, "duration": 12.0, "view_count": 173052}, {"video_id": "eSIHvLOvcJg", "question": "On a stone path with green grass on both sides, two people are walking side by side at the front. On the left side of the screen, a leg wearing blue jeans appears. What style of shoes is this leg wearing with the blue jeans?", "question_wo_referring_query": "What style of shoes is this leg wearing with the blue jeans?", "candidates": ["Short boots", "Leather shoes", "Canvas shoes", "Sandals", "Long boots"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "eSIHvLOvcJg_0", "video_path": "eSIHvLOvcJg.mp4", "subtitle_path": "eSIHvLOvcJg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 87, "duration": 12.0, "view_count": 90209}, {"video_id": "sU6GnZDUNso", "question": "In a dark room, a person appeared at the doorway, where there's a curtain with a bright white light behind it. Who appeared at this doorway?", "question_wo_referring_query": ", who appeared at this doorway?", "candidates": ["A person with two red circular light rings on their waist", "A person with two purple circular light rings on their eyes", "A person with two red circular light rings on their eyes", "A person with two white circular light rings on their eyes", "A person with two red circular light rings on their leg"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "sU6GnZDUNso_0", "video_path": "sU6GnZDUNso.mp4", "subtitle_path": "sU6GnZDUNso_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 879, "duration": 9.01, "view_count": 146216}, {"video_id": "aq2T9z6NOdI", "question": "What is the creature with white round glasses, white eyebrows, and a big nose doing at the end of the video?", "question_wo_referring_query": "What is it doing?", "candidates": ["closing its eyes", "touching its mustache", "opening its mouth", "touching its eyebrows", "staring"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "aq2T9z6NOdI_0", "video_path": "aq2T9z6NOdI.mp4", "subtitle_path": "aq2T9z6NOdI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 480, "duration": 9.01, "view_count": 105143}, {"video_id": "v6w-7q_9ZRU", "question": "On a busy street with many vehicles passing by, what did the woman wearing a black top and passing by a mirror say when the subtitle mentioned 'destroys technology no disease no fear'?", "question_wo_referring_query": "What did she do?", "candidates": ["Tying her shoelaces", "Pushing a baby stroller", "Looking at her phone", "Riding a scooter", "Riding a bicycle"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "v6w-7q_9ZRU_0", "video_path": "v6w-7q_9ZRU.mp4", "subtitle_path": "v6w-7q_9ZRU_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 576, "duration": 11.01, "view_count": 117656}, {"video_id": "RqZZoJkX1Fg", "question": "After a woman holding a green handbag appears to the left behind a pot of green plants, who is the first person to appear below?", "question_wo_referring_query": "Who is the first person to appear below?", "candidates": ["A woman wearing a black skirt", "A woman wearing a black choker and a blue camisole dress", "A woman wearing a black choker and a red camisole dress", "A woman with earrings and Liu Hai hairstyle", "A man with a Liu Hai hairstyle holding a cup"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "RqZZoJkX1Fg_0", "video_path": "RqZZoJkX1Fg.mp4", "subtitle_path": "RqZZoJkX1Fg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 285, "duration": 11.0, "view_count": 422632}, {"video_id": "jKTbQ6IXwPI", "question": "Under the pink sky, two people with hats standing in front of the sea are looking at each other. After the subtitle says 'this has turned everybody against her,' what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The two people hug each other.", "The two people wave towards the sea.", "A person waves towards the sea.", "A person kneels down.", "A person walks onto a boat."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "jKTbQ6IXwPI_0", "video_path": "jKTbQ6IXwPI.mp4", "subtitle_path": "jKTbQ6IXwPI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 63, "duration": 12.98, "view_count": 486752}, {"video_id": "TfH_nTONAFk", "question": "In a room, standing to the far right in front of the curtain is a man wearing a black suit and a black bow tie, and on the far left, a man in a red collar is lifting a round pot lid. Before the subtitle \"During this, they encounter an assassin disguised as a waiter attempting to kill Leo\" appears, what object appears first on the screen?", "question_wo_referring_query": "What object appears first on the screen?", "candidates": ["A bicycle", "A pot of roses", "A bed", "A pot of green plants", "A chair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "TfH_nTONAFk_0", "video_path": "TfH_nTONAFk.mp4", "subtitle_path": "TfH_nTONAFk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 448, "duration": 9.97, "view_count": 2394}, {"video_id": "RmguF2hsQyc", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a white outfit on an animated character in the dark night becomes tattered, then several white flashes of lightning appear in the black sky, and finally, an animated character in a black suit with a black bow tie closes their eyes.", "First, an animated character in a black suit with a black bow tie closes their eyes, then a white outfit on an animated character in the dark night becomes tattered, and finally, several white flashes of lightning appear in the black sky.", "First, a white outfit on an animated character in the dark night becomes tattered, then an animated character in a black suit with a black bow tie closes their eyes, and finally, several white flashes of lightning appear in the black sky.", "First, several white flashes of lightning appear in the black sky, then an animated character in a black suit with a black bow tie closes their eyes, and finally, a white outfit on an animated character in the dark night becomes tattered.", "First, several white flashes of lightning appear in the black sky, then a white outfit on an animated character in the dark night becomes tattered, and finally, an animated character in a black suit with a black bow tie closes their eyes."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "RmguF2hsQyc_0", "video_path": "RmguF2hsQyc.mp4", "subtitle_path": "RmguF2hsQyc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 100, "duration": 9.0, "view_count": 893452}, {"video_id": "@healthfood-6876513762694679814", "question": "There is a bowl of white rice in a white ceramic bowl on the screen. Green sauce is poured over the rice, and there is also a lifted bottle of green sauce above the bowl. What is happening on the screen right now?", "question_wo_referring_query": "What is happening on the screen right now?", "candidates": ["Eating the rice", "Mixing the sauce", "Stirring the rice", "Making fried rice", "Pouring green sauce over the rice"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "@healthfood-6876513762694679814_0", "video_path": "6876513762694679814.mp4", "subtitle_path": "6876513762694679814_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.83, "view_count": 10594}, {"video_id": "@recipesbyanne-7285745052637318432", "question": "A blonde woman in a gray hoodie is sitting at a concrete table by a red brick wall, holding a plate of food on an iron plate, getting ready to taste it. What object is present in the scene at this time?", "question_wo_referring_query": ", what object is present in the scene at this time?", "candidates": ["sunhat", "lemon slice", "sunglasses", "white rice", "mobile phone"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "@recipesbyanne-7285745052637318432_0", "video_path": "7285745052637318432.mp4", "subtitle_path": "7285745052637318432_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.77, "view_count": 63304}, {"video_id": "@healthfood-6939966507778673926", "question": "A woman wearing a gray top with floral prints on the letters and a black turtleneck underneath, with her hair tied up in a bun and a black hairband, is standing next to a white door frame on the left side of the screen, frowning at the camera. When the caption 'Okay.I'm done' appears, what object is present in the screen?", "question_wo_referring_query": "What object is present in the screen?", "candidates": ["A mobile phone", "A washing machine", "A television", "A white door", "A desk lamp"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "@healthfood-6939966507778673926_0", "video_path": "6939966507778673926.mp4", "subtitle_path": "6939966507778673926_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.6, "view_count": 14800}, {"video_id": "@healthfood-6941461918683008261", "question": "The background consists of an off-white wall and a white bed. On the left side of the screen, there is a white cylindrical object, a metal chain hanging from the ceiling, and a woman wearing beige suspenders and blue pants with one hand behind her back and the other hand pointing upwards. What material are the pants the woman is wearing made of?", "question_wo_referring_query": "What material are the pants the woman is wearing made of?", "candidates": ["Denim", "Leather pants", "Cotton pants", "Linen material", "Suit pants"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "@healthfood-6941461918683008261_0", "video_path": "6941461918683008261.mp4", "subtitle_path": "6941461918683008261_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.9, "view_count": 18600}, {"video_id": "@recipesbyanne-7192619035610156293", "question": "On a marble tabletop, there is a bowl containing a mixture of white and coffee-colored creamy ingredients in a transparent container. A finger is scraping some coffee-colored ingredients from an iron spoon. When the subtitle 'Love is in the air, in the thunder of the sea.' appears, what material is the transparent container in the screen made of?", "question_wo_referring_query": "what material is the transparent container in the screen made of?", "candidates": ["glass", "ceramic", "plastic", "acrylic", "iron"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "@recipesbyanne-7192619035610156293_0", "video_path": "7192619035610156293.mp4", "subtitle_path": "7192619035610156293_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.03, "view_count": 104850}, {"video_id": "@healthfood-6858752701841329414", "question": "In the video, a white container holds yellow flakes, and a ladle filled with white liquid is reaching down into the container. What is the substance being added to the container?", "question_wo_referring_query": "What is the substance being added to the container?", "candidates": ["Milk", "Sugar", "Soybean oil", "Butter", "Cooking oil"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "@healthfood-6858752701841329414_0", "video_path": "6858752701841329414.mp4", "subtitle_path": "6858752701841329414_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.13, "view_count": 2440}, {"video_id": "@healthfood-7081730352011709742", "question": "On a gray marble table, there is a bowl of white cotton candy in a transparent glass container. What is happening on the screen when a spoon with gold color is used to add yellow oil to the container for the first time?", "question_wo_referring_query": "What is happening on the screen?", "candidates": ["Mixing yellow oil with cotton candy", "Adding yellow oil to the milk", "Adding yellow oil to the pot", "Adding white sugar to the yellow oil", "Adding yellow oil to the container"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "@healthfood-7081730352011709742_0", "video_path": "7081730352011709742.mp4", "subtitle_path": "7081730352011709742_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.35, "view_count": 76044}, {"video_id": "@recipesbyanne-7135835285400587525", "question": "On a white marble table, there is a wooden cutting board. A hand wearing a ring is pressing down on a carrot, while another hand is holding a knife. In front of the knife are three slices of carrots. When the caption 'you' appears, what is the person in the video doing?", "question_wo_referring_query": "What is the person in the video doing?", "candidates": ["cutting carrot chunks", "cutting carrot slices", "cutting carrot strips", "juicing carrots", "cutting carrot pieces"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "@recipesbyanne-7135835285400587525_0", "video_path": "7135835285400587525.mp4", "subtitle_path": "7135835285400587525_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.43, "view_count": 5563}, {"video_id": "@healthfood-7098797783297281326", "question": "The screen shows a pack of potato chips with purple and cream white packaging. The bag has the word 'POTATO CHIPS' in purple letters, and below is a picture of potato chips. What happens in the video after displaying this pack of chips?", "question_wo_referring_query": "What happens in the video?", "candidates": ["Showing a plastic container with green sticker filled with guacamole.", "Holding a plastic container with a green sticker filled with guacamole by hand.", "Opening the lid of the guacamole container to show its contents to the camera.", "Using a potato chip to scoop up guacamole from a plastic container.", "Scooping a spoonful of guacamole with a spoon."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "@healthfood-7098797783297281326_0", "video_path": "7098797783297281326.mp4", "subtitle_path": "7098797783297281326_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.67, "view_count": 60258}, {"video_id": "@healthfood-7091722648010476842", "question": "Which object appears first in the video?", "question_wo_referring_query": "Which object appears first in the video?", "candidates": ["Three black cans with white English text", "Three cans of food inside a box", "A black box with white English text on a marble slab", "A white plate holding the fried eggs", "Three fried eggs in a pan"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "@healthfood-7091722648010476842_0", "video_path": "7091722648010476842.mp4", "subtitle_path": "7091722648010476842_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.47, "view_count": 26650}, {"video_id": "@recipesbyanne-7132833277357870341", "question": "In a transparent container on a wooden table, a mixer is whipping egg whites. After the subtitle 'all day, all day, why we say please when I got to leave and say it ain't big to stay,' what is the first item that appears?", "question_wo_referring_query": "What is the first item that appears?", "candidates": ["Banana slice", "Egg yolk", "Chocolate-colored sauce in a transparent glass container", "White plate", "Banana"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "@recipesbyanne-7132833277357870341_0", "video_path": "7132833277357870341.mp4", "subtitle_path": "7132833277357870341_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.9, "view_count": 6588}, {"video_id": "@recipesbyanne-7169577213371763974", "question": "Which of the following scenarios is in the correct order?", "question_wo_referring_query": "Which of the following scenarios is in the correct order?", "candidates": ["Add the various ingredients placed on the side into the glass container with leafy greens, add the yellow dressing to the fruit and vegetable salad in the glass container, mix the fruit and vegetable salad in the glass container, use tongs to lift the fruit and vegetable salad onto a plate", "Mix the fruit and vegetable salad in the glass container, add the various ingredients placed on the side into the glass container with leafy greens, use tongs to lift the fruit and vegetable salad onto a plate, add the yellow dressing to the fruit and vegetable salad in the glass container", "Mix the fruit and vegetable salad in the glass container, add the various ingredients placed on the side into the glass container with leafy greens, add the yellow dressing to the fruit and vegetable salad in the glass container, use tongs to lift the fruit and vegetable salad onto a plate", "Mix the fruit and vegetable salad in the glass container, add the yellow dressing to the fruit and vegetable salad in the glass container, add the various ingredients placed on the side into the glass container with leafy greens, use tongs to lift the fruit and vegetable salad onto a plate", "Add the yellow dressing to the fruit and vegetable salad in the glass container, add the various ingredients placed on the side into the glass container with leafy greens, mix the fruit and vegetable salad in the glass container, use tongs to lift the fruit and vegetable salad onto a plate"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "@recipesbyanne-7169577213371763974_0", "video_path": "7169577213371763974.mp4", "subtitle_path": "7169577213371763974_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.03, "view_count": 10989}, {"video_id": "@recipesbyanne-7213772637334490373", "question": "On a marble table, there is a Dutch oven filled with clay beans soaked in clear water. In which other scenarios have these clay beans appeared?", "question_wo_referring_query": "In which other scenarios have they appeared?", "candidates": ["In an iron pot filled with white milk", "In an oven on the marble tabletop", "In a glass bowl placed on a wooden board containing yellow food", "In a frying pan on the stovetop", "In the scene where the clay beans are pressed into clay with the help of iron tools inside an iron pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "@recipesbyanne-7213772637334490373_0", "video_path": "7213772637334490373.mp4", "subtitle_path": "7213772637334490373_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.1, "view_count": 212632}, {"video_id": "@recipesbyanne-7105814667162914054", "question": "The background is a gray tiled wall. On a white table, there's a white bowl containing a mixture of milk and yellowish powder. What change occurs after the ingredients in the bowl are stirred with a metal spoon?", "question_wo_referring_query": "What change occurs?", "candidates": ["Changes from liquid to powder", "Changes from powder to semi-solid", "Changes from yellow to white", "Changes from powder to liquid", "Changes from yellow to red"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "@recipesbyanne-7105814667162914054_0", "video_path": "7105814667162914054.mp4", "subtitle_path": "7105814667162914054_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.4, "view_count": 2340}, {"video_id": "@recipesbyanne-7229329927064046874", "question": "In a black pot with oil, a hand places three pieces of meat coated in sauce. When the subtitle 'Sugar, how you get so fly' appears, what changes do these pieces of meat undergo?", "question_wo_referring_query": "What changes do these pieces of meat undergo?", "candidates": ["The pieces of meat turn into shredded meat", "The pieces of meat turn into minced meat", "The golden-yellow color turns into charred black color", "The raw meat becomes cooked", "The pieces of meat turn into meat sauce"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "@recipesbyanne-7229329927064046874_0", "video_path": "7229329927064046874.mp4", "subtitle_path": "7229329927064046874_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.35, "view_count": 18400000}, {"video_id": "@recipesbyanne-7165853138182868230", "question": "On a white marble table, there is a piece of wooden lattice board with a semi-transparent spring roll wrapper on it. In the center, there are green vegetables and sauce. What is the hand on the left side of the screen doing while holding the edge of the spring roll wrapper?", "question_wo_referring_query": "What are they doing?", "candidates": ["Putting the spring roll into a pot", "Tearing the spring roll wrapper apart", "Rolling up the spring roll wrapper", "Putting raw vegetables on the spring roll wrapper", "Sprinkling sauce on the spring roll wrapper"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "@recipesbyanne-7165853138182868230_0", "video_path": "7165853138182868230.mp4", "subtitle_path": "7165853138182868230_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.8, "view_count": 36636}, {"video_id": "@healthfood-6895852781446384902", "question": "The background is an indoor setting with gray-white walls, featuring a black bed with white bed sheets and a duvet. A woman wearing a blue long coat, a black short inner top, and black shorts has one hand on her waist and the other hand raising a finger above her head. What object is present in the scene at this moment?", "question_wo_referring_query": "What object is present in the scene at this moment?", "candidates": ["Chain hanging on the wall", "Sunglasses", "Kitten", "Puppy", "Hat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "@healthfood-6895852781446384902_0", "video_path": "6895852781446384902.mp4", "subtitle_path": "6895852781446384902_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.0, "view_count": 10700}, {"video_id": "@healthfood-7187787808302435627", "question": "In the scene, an iron spatula is stirring a brown sauce in a white bowl. In the upper left corner of the screen, there is a black label sticker with white English text. When the subtitle 'And in honor of that, we're using their pre-marinated teriyaki tofu' appears, what object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["Triangle-shaped tofu pieces", "Recipe", "Microwave oven", "Refrigerator", "Iron pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "@healthfood-7187787808302435627_0", "video_path": "7187787808302435627.mp4", "subtitle_path": "7187787808302435627_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.2, "view_count": 13768}, {"video_id": "@recipesbyanne-7238614903454469402", "question": "In the black pot on the white marble countertop, a wooden spatula is stirring minced garlic, chopped carrots, and red raw meat. At this moment, what is the shape of the meat inside the pot?", "question_wo_referring_query": "What is the shape of the meat inside the pot at this moment?", "candidates": ["meat slices", "meat chunks", "minced meat", "shredded meat", "meat strips"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "@recipesbyanne-7238614903454469402_0", "video_path": "7238614903454469402.mp4", "subtitle_path": "7238614903454469402_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.5, "view_count": 8063923}, {"video_id": "@recipesbyanne-7254206119885688090", "question": "On a white marble table, there is a bowl with four meatballs and salad. A metal spoon is placing sauce onto a plate. When the subtitle 'Won't you come with me and spend the night' appears, what color is the sauce?", "question_wo_referring_query": "What color is the sauce?", "candidates": ["white", "yellow", "red", "purple", "green"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "@recipesbyanne-7254206119885688090_0", "video_path": "7254206119885688090.mp4", "subtitle_path": "7254206119885688090_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.23, "view_count": 123900}, {"video_id": "@recipesbyanne-7176280783961722118", "question": "In a black frying pan containing a sauce made with scallions and tomato juice, what did the hand with red nail polish put into the pan?", "question_wo_referring_query": "What was put into the pan?", "candidates": ["chicken legs", "chicken wings", "chicken breast", "pork", "beef"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "@recipesbyanne-7176280783961722118_0", "video_path": "7176280783961722118.mp4", "subtitle_path": "7176280783961722118_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.8, "view_count": 66000}, {"video_id": "@healthfood-7025332892498922798", "question": "In front of a grey tiled wall, there is a wooden cutting board with two pieces of dough on it. When a white bottle of sauce with dark blue and light blue labels appears for the first time, what is the person in the video doing?", "question_wo_referring_query": ", what is the person in the video doing?", "candidates": ["Smearing sauce on the dough", "Smearing chili sauce on the dough", "Showing the bottle of sauce to the camera", "Rolling up the dough", "Spreading sauce on the dough"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "@healthfood-7025332892498922798_0", "video_path": "7025332892498922798.mp4", "subtitle_path": "7025332892498922798_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.78, "view_count": 46264}, {"video_id": "@healthfood-7025971557440736559", "question": "A hand wearing a white bracelet is holding a tray lined with oil-absorbing paper. On the white paper, there are green peppers along with seasoning and ham. When the subtitle 'I'm sorry. Miss Jackson' appears, what is the person on the screen doing?", "question_wo_referring_query": "What is the person on the screen doing?", "candidates": ["Taking the food out of the oven", "Putting the tray of food into the oven", "Sprinkling cheese on the food", "Putting the food into a pot", "Brushing a layer of oil on the food"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "@healthfood-7025971557440736559_0", "video_path": "7025971557440736559.mp4", "subtitle_path": "7025971557440736559_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.97, "view_count": 44235}, {"video_id": "@recipesbyanne-7149130969532861702", "question": "On a white marble table, after tearing open the pastry package and taking out the pastry, what was done first?", "question_wo_referring_query": "What was done first?", "candidates": ["Mix the pastry crumbs and yogurt", "Crumble the pastry and put it in a transparent glass container", "Put white yogurt into the pastry crumbs", "Put the pastry into milk", "Spread tomato sauce on the pastry"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "@recipesbyanne-7149130969532861702_0", "video_path": "7149130969532861702.mp4", "subtitle_path": "7149130969532861702_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.03, "view_count": 1800000}, {"video_id": "@healthfood-6868399521332743429", "question": "Which object appears first in the video?", "question_wo_referring_query": "Which object appears first in the video?", "candidates": ["Hot pot base ingredient", "Food ingredient with a label on its shell", "Grill rack", "Square-shaped white food ingredient", "Sticky white food ingredient"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "@healthfood-6868399521332743429_0", "video_path": "6868399521332743429.mp4", "subtitle_path": "6868399521332743429_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.8, "view_count": 8093}, {"video_id": "@recipesbyanne-7217889587736120603", "question": "On a white marble table, a pair of hands is whisking chocolate powder in a transparent glass container. After the subtitle 'Thank you.' appears, what happens next on the screen?", "question_wo_referring_query": "What happens next on the screen?", "candidates": ["Add cooking oil to the container", "Add cocoa beans to the container", "Whisk the egg and cocoa beans together", "Crack an egg into the transparent glass container", "Put the whisked ingredients into a baking tray"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "@recipesbyanne-7217889587736120603_0", "video_path": "7217889587736120603.mp4", "subtitle_path": "7217889587736120603_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.78, "view_count": 96925}, {"video_id": "@healthfood-6855359517790604549", "question": "Which of the following sequences of actions is correct?", "question_wo_referring_query": "Which of the following sequences of actions is correct?", "candidates": ["A jar with a red label of olive oil, a glass container filled with chunks of yellow fruit, the mixture of fruit chunks and olive oil spread on dried acid milk", "A glass container filled with chunks of yellow fruit, the mixture of fruit chunks and olive oil spread on dried acid milk, a jar with a red label of olive oil", "The mixture of fruit chunks and olive oil spread on dried acid milk, a glass container filled with chunks of yellow fruit, a jar with a red label of olive oil", "A glass container filled with chunks of yellow fruit, a jar with a red label of olive oil, the mixture of fruit chunks and olive oil spread on dried acid milk", "A jar with a red label of olive oil, the mixture of fruit chunks and olive oil spread on dried acid milk, a glass container filled with chunks of yellow fruit"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "@healthfood-6855359517790604549_0", "video_path": "6855359517790604549.mp4", "subtitle_path": "6855359517790604549_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.07, "view_count": 4665}, {"video_id": "@healthfood-7142907293414280490", "question": "In a white ceramic bowl filled with a yellow-orange sauce, a metal spoon is holding up the sauce from the bowl. In what other contexts does this yellow-orange sauce appear?", "question_wo_referring_query": ", in what other contexts does this yellow-orange sauce appear?", "candidates": ["In a scene where the sauce is being scooped out of a white bowl with a spoon", "Next to a microwave", "In a black frying pan", "Next to tomato sauce", "In a transparent glass container"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "@healthfood-7142907293414280490_0", "video_path": "7142907293414280490.mp4", "subtitle_path": "7142907293414280490_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.97, "view_count": 22800}, {"video_id": "@recipesbyanne-7099563338442607877", "question": "On a white table there are three glass containers filled with chia seeds, topped with desserts made of strawberries and blueberries. What change occurs in the video after a piece of dessert is lifted with a fork?", "question_wo_referring_query": "What change occurs in the video?", "candidates": ["The strawberries on the dessert disappear", "The screen becomes blurry", "The screen enlarges", "The screen shrinks", "The screen becomes clear"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "@recipesbyanne-7099563338442607877_0", "video_path": "7099563338442607877.mp4", "subtitle_path": "7099563338442607877_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.1, "view_count": 3251}, {"video_id": "@recipesbyanne-7105005917229305093", "question": "In the video, a pair of hands is peeling garlic cloves on a wooden board. After the subtitle \u201cBring that ass back like a boom boom boom boom The translation of \"Bring that ass back like a boom boom boom boom\" to C\u201d appears, what change happens to the garlic cloves?", "question_wo_referring_query": "What change happens to the garlic cloves?", "candidates": ["The garlic cloves go from the wooden board into clear water", "The garlic cloves are peeled clean", "The garlic cloves are sliced", "The garlic cloves are crushed", "The garlic cloves turn into garlic paste"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "@recipesbyanne-7105005917229305093_0", "video_path": "7105005917229305093.mp4", "subtitle_path": "7105005917229305093_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.47, "view_count": 2141}, {"video_id": "@recipesbyanne-7233794321277930778", "question": "Inside a transparent glass container on a white marble tabletop, there are a few pieces of seasoned chicken. A hand is currently squeezing a lemon into the container. What objects are present in the frame at this moment?", "question_wo_referring_query": "What objects are present in the frame at this moment?", "candidates": ["Chicken claw", "Knife and fork", "Clove", "Chili powder", "Cilantro"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "@recipesbyanne-7233794321277930778_0", "video_path": "7233794321277930778.mp4", "subtitle_path": "7233794321277930778_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.87, "view_count": 299592}, {"video_id": "@recipesbyanne-7197441038406126854", "question": "On a white marble countertop, there is a large ceramic bowl containing eggs and tomato sauce. A hand is using chopsticks to mix them. What material is this pair of chopsticks made of?", "question_wo_referring_query": ", What material is this pair of chopsticks in the video made of?", "candidates": ["Acrylic chopsticks", "Paper chopsticks", "Plastic chopsticks", "Metal chopsticks", "Iron chopsticks"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "@recipesbyanne-7197441038406126854_0", "video_path": "7197441038406126854.mp4", "subtitle_path": "7197441038406126854_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.77, "view_count": 115128}, {"video_id": "@healthfood-6921801196508384518", "question": "A woman with her hair in a bun, wearing a grey hoodie and holding a purple and white packaged food, is eating. When the subtitle 'warm and my favorite flavors are banana nut and lemon blueberry' appears, what type of top is this woman wearing?", "question_wo_referring_query": "What type of top is this woman wearing?", "candidates": ["shirt", "denim jacket", "sweater", "cotton jacket", "leather jacket"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "@healthfood-6921801196508384518_0", "video_path": "6921801196508384518.mp4", "subtitle_path": "6921801196508384518_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.17, "view_count": 12293}, {"video_id": "@recipesbyanne-7182930123123969285", "question": "What is being added into the container on the white marble table, which contains a lump of white dough and has a hand at the top adding something?", "question_wo_referring_query": "What is being added to the container?", "candidates": ["Tomato Sauce", "Sugar", "Eggs", "Flour", "Water"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "@recipesbyanne-7182930123123969285_0", "video_path": "7182930123123969285.mp4", "subtitle_path": "7182930123123969285_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.3, "view_count": 44100}, {"video_id": "@healthfood-6893963247997832453", "question": "What is the ingredient added to the food coated with a pink paste at the beginning of the video?", "question_wo_referring_query": "What is the ingredient added to the food coated with a pink paste at the beginning of the video?", "candidates": ["Carrot sticks", "Yellow fish decorations", "White dragon fruit balls", "Lychee", "Pink flowers"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "@healthfood-6893963247997832453_0", "video_path": "6893963247997832453.mp4", "subtitle_path": "6893963247997832453_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.62, "view_count": 4943}, {"video_id": "@healthfood-7092061669479714094", "question": "A hand takes out a berry from a plastic bag stored in a refrigerator filled with many foods. After the subtitle 'college cooking hack for keeping your bread fresh' appears, what happens in the video?", "question_wo_referring_query": "What happens in the video?", "candidates": ["Puts the sliced berry into the bread machine", "Applies ketchup on the berry", "Applies sesame sauce on the berry", "Soaks the berry in milk", "Cuts the berry into pieces"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "@healthfood-7092061669479714094_0", "video_path": "7092061669479714094.mp4", "subtitle_path": "7092061669479714094_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 7.97, "view_count": 49652}, {"video_id": "@recipesbyanne-7241201480458358043", "question": "In the video, a small wooden spatula is stirring yellow bean sprouts and chicken in a white sauce. In which other scenes does this small wooden spatula appear?", "question_wo_referring_query": "In which other scenes does this small wooden spatula appear?", "candidates": ["On a white marble tabletop", "While stirring a pot of yellow shredded ingredients", "In a pot with three boiled eggs", "In a bowl of cold mixed vegetables", "In a bowl full of beef slices"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "@recipesbyanne-7241201480458358043_0", "video_path": "7241201480458358043.mp4", "subtitle_path": "7241201480458358043_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.83, "view_count": 240362}, {"video_id": "@healthfood-7088354792359218478", "question": "A colorful whisk is stirring a blend of matcha powder and colorful sprinkles inside a glass container. When the top is drizzled with white sauce and the mixture is held up towards the camera, what changes occur to these mixtures?", "question_wo_referring_query": ", what changes occur to these mixtures?", "candidates": ["The mixture changes from a dough-like state to a cake-like state", "The mixture changes from olive color to yellow", "The mixture changes from liquid to solid", "The mixture changes from colorful to black", "The mixture changes from solid to liquid"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "@healthfood-7088354792359218478_0", "video_path": "7088354792359218478.mp4", "subtitle_path": "7088354792359218478_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.08, "view_count": 32201}, {"video_id": "@healthfood-7145149520324627755", "question": "In the transparent container on the white table, there is some yellow liquid and two pieces of banana. What happened to the banana when the subtitles \u201cYou go down just like Holy Mary. Mariana, Mariana Cross.\u201d appeared?", "question_wo_referring_query": "What happened to the banana?", "candidates": ["The banana pieces turned into banana slices", "The solid turned into a liquid", "The banana turned into dried banana", "The banana turned into banana puree", "The banana changed from yellow to green"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "@healthfood-7145149520324627755_0", "video_path": "7145149520324627755.mp4", "subtitle_path": "7145149520324627755_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.2, "view_count": 19700}, {"video_id": "@healthfood-7101376793185193258", "question": "A woman with curly hair, wearing a sleeveless dark brown top, is holding a dessert covered in chocolate sauce served on a white plate on a wooden table. What is she doing in front of the table?", "question_wo_referring_query": "What is she doing?", "candidates": ["Decorating the dessert", "Eating the dessert", "Making the dessert", "Cutting the dessert", "Introducing the dessert"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "@healthfood-7101376793185193258_0", "video_path": "7101376793185193258.mp4", "subtitle_path": "7101376793185193258_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.35, "view_count": 151395}, {"video_id": "@recipesbyanne-7319540573717777696", "question": "The background is a red tile wall. A woman with long golden hair, wearing a long-sleeved dark green top with white floral print, is sitting in front of a gray table tasting delicious food. What objects are present in the scene at this moment?", "question_wo_referring_query": "What objects are present in the scene at this moment?", "candidates": ["Spatula", "Chopsticks", "Sunglasses", "Hat", "Knife and fork"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "@recipesbyanne-7319540573717777696_0", "video_path": "7319540573717777696.mp4", "subtitle_path": "7319540573717777696_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.63, "view_count": 108809}, {"video_id": "@healthfood-7002350145610288390", "question": "Under the azure sky, a girl wearing a bun hairstyle, a hidden blue top, and white shorts is running on the beach. When the subtitle 'We tried to do as many summer activities we could in those five minutes' appears, what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["white cloud", "sunglasses", "sailboat", "white wave", "mobile phone"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "@healthfood-7002350145610288390_0", "video_path": "7002350145610288390.mp4", "subtitle_path": "7002350145610288390_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.03, "view_count": 25130}, {"video_id": "@recipesbyanne-7195947239368396037", "question": "In the video, some green vegetables are added to a black pot containing white broth, and there is also a wooden spoon in the pot. What type of vegetables are these?", "question_wo_referring_query": "What type of vegetables are these?", "candidates": ["Water spinach", "Cabbage", "Lettuce", "Bok choy", "Snow peas"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "@recipesbyanne-7195947239368396037_0", "video_path": "7195947239368396037.mp4", "subtitle_path": "7195947239368396037_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.57, "view_count": 137480}, {"video_id": "@healthfood-6858368172467195141", "question": "On a white marble countertop, there is a white ceramic bowl filled with yellow lentil paste. Some sauce is on top of the lentil paste, and a fork is stirring it. When the subtitle 'I promise your life is about to change for the better' appears, what kind of sauce is on the lentil paste?", "question_wo_referring_query": "What kind of sauce is on the lentil paste?", "candidates": ["Olive chocolate sauce", "Red tomato sauce", "White salad dressing", "Red chili sauce", "Green basil sauce"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "@healthfood-6858368172467195141_0", "video_path": "6858368172467195141.mp4", "subtitle_path": "6858368172467195141_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.58, "view_count": 1882}, {"video_id": "@healthfood-7059826758039702830", "question": "There is a bowl on a wooden table filled with various colors and types of vegetables in a white ceramic bowl. A hand is holding a glass container and about to add something to the vegetables. What is being added?", "question_wo_referring_query": "What is being added to the vegetables?", "candidates": ["White salad dressing", "Garlic sauce", "Tomato sauce", "Chili sauce", "Milk"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "@healthfood-7059826758039702830_0", "video_path": "7059826758039702830.mp4", "subtitle_path": "7059826758039702830_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.42, "view_count": 110119}, {"video_id": "@healthfood-6859546154200304902", "question": "In the top right corner of a white marble table is half a leek placed on a wooden board. Next to the wooden board is a blue bowl containing various vegetables. When the hand holding a lime first appears in the frame, what is being done?", "question_wo_referring_query": "What is being done in the frame?", "candidates": ["Tossing the ingredients in the bowl", "Putting the ingredients into a pot", "Adding olive oil to the bowl", "Cutting lime slices", "Juicing lime into the bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "@healthfood-6859546154200304902_0", "video_path": "6859546154200304902.mp4", "subtitle_path": "6859546154200304902_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.13, "view_count": 3011}, {"video_id": "@healthfood-6885743982949813510", "question": "A hand is holding a red rambutan in front of a white wall and showing it to the camera. What happens first on the screen after this?", "question_wo_referring_query": "What happens first on the screen?", "candidates": ["Showing the white flesh of the rambutan to the camera", "Peeling the rambutan by hand", "Eating the rambutan", "Cutting the rambutan on a white cutting board with a knife", "Displaying several neatly arranged rambutans"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "@healthfood-6885743982949813510_0", "video_path": "6885743982949813510.mp4", "subtitle_path": "6885743982949813510_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.23, "view_count": 4329}, {"video_id": "@healthfood-7098803661677153579", "question": "Which object appears first in the video?", "question_wo_referring_query": "Which object appears first in the video?", "candidates": ["Leek", "Cutting board", "Fork", "Cup", "Knife"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "@healthfood-7098803661677153579_0", "video_path": "7098803661677153579.mp4", "subtitle_path": "7098803661677153579_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.78, "view_count": 85735}, {"video_id": "@healthfood-6927371951384120581", "question": "The screen shows a transparent cup containing white food. There is a pouch at the top of the cup, and yellow powder is being poured out of it. After mentioning 'Turmeric is a magic that's an anti-inflammatory plant to protect the heart and brain,' what happened next?", "question_wo_referring_query": "What happened next?", "candidates": ["Yellow semi-solid food is being poured into the transparent cup placed on a gray table.", "Yellow semi-solid food is being poured into the transparent cup placed on a pink table.", "Yellow semi-solid food is being poured into the transparent cup placed on a white table.", "Yellow semi-solid food is being poured into the transparent cup placed on a green table.", "Yellow semi-solid food is being poured into the transparent cup placed on a black table."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "@healthfood-6927371951384120581_0", "video_path": "6927371951384120581.mp4", "subtitle_path": "6927371951384120581_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.33, "view_count": 22311}, {"video_id": "@healthfood-6902151185592077573", "question": "A lady wearing a white T-shirt and black shorts is facing her back with her hands in front of a mirror. Her hair is light brown with some black, and she is in a white bedroom. When 'Stop. What the hell are you talking about?' is mentioned, what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["blue letters 'quinoa'", "black letters 'quinoa'", "red letters 'quinoa'", "white letters 'whole grains'", "yellow letters 'quinoa'"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "@healthfood-6902151185592077573_0", "video_path": "6902151185592077573.mp4", "subtitle_path": "6902151185592077573_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.07, "view_count": 8492}, {"video_id": "@healthfood-7213401928393461038", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a scene of a plate of cooked Italian noodles is shown, followed by a woman in a white coat with a red floral inner dress, and finally, a woman in a gray T-shirt eating Italian noodles.", "First, a scene of a plate of cooked Italian noodles is shown, followed by a woman in a gray T-shirt eating Italian noodles, and finally, a woman in a white coat with a red floral inner dress.", "First, a woman in a white coat with a red floral inner dress is shown, followed by a scene of a plate of cooked Italian noodles, and finally, a woman in a gray T-shirt eating Italian noodles.", "First, a woman in a gray T-shirt eating Italian noodles is shown, followed by a woman in a white coat with a red floral inner dress, and finally, a scene of a plate of cooked Italian noodles.", "First, a woman in a gray T-shirt eating Italian noodles is shown, followed by a scene of a plate of cooked Italian noodles, and finally, a woman in a white coat with a red floral inner dress."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "@healthfood-7213401928393461038_0", "video_path": "7213401928393461038.mp4", "subtitle_path": "7213401928393461038_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.67, "view_count": 43941}, {"video_id": "@recipesbyanne-7246411417291656474", "question": "The scene shows a pot placed on a white table, containing a white liquid and green vegetables. A wooden spatula is stirring the food in the pot. In which other scene does this wooden spatula appear?", "question_wo_referring_query": "In which other scene does this wooden spatula appear?", "candidates": ["In a pot containing yellow broad noodles, green vegetables, and white liquid", "In a pot containing yellow thin noodles, green vegetables, and white liquid", "In a pot containing white broad noodles, green vegetables, and white liquid", "In a pot containing white thin noodles, green vegetables, and white liquid", "In a pot containing yellow thin noodles, green vegetables, and black liquid"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "@recipesbyanne-7246411417291656474_0", "video_path": "7246411417291656474.mp4", "subtitle_path": "7246411417291656474_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.68, "view_count": 275459}, {"video_id": "@recipesbyanne-7098346738615913734", "question": "The screen shows a white bowl on a wooden table, with cut strawberries in it. When 'Thank you.' is mentioned, what change happens to the strawberries in the bowl?", "question_wo_referring_query": "What change happens to the strawberries in the bowl?", "candidates": ["The strawberries are drizzled with a green sauce", "The strawberries are drizzled with a white sauce", "The strawberries are drizzled with a yellow sauce", "The strawberries are drizzled with a light blue sauce", "The strawberries are drizzled with a black sauce"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "@recipesbyanne-7098346738615913734_0", "video_path": "7098346738615913734.mp4", "subtitle_path": "7098346738615913734_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.77, "view_count": 1312}, {"video_id": "@recipesbyanne-7093174416544124166", "question": "The scene shows a hand with purple nail polish holding some green leaves. Below the hand is a black pot containing pasta and some white seasoning, sprinkled with green leaves. What is the hand in the scene doing?", "question_wo_referring_query": "What is the hand in the scene doing?", "candidates": ["Sprinkling white seasoning and leaves into the pot", "Adding water to the pot", "Stir-frying the pasta with a spatula", "Picking up leaves from the pot", "Sprinkling leaves into the pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "@recipesbyanne-7093174416544124166_0", "video_path": "7093174416544124166.mp4", "subtitle_path": "7093174416544124166_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.03, "view_count": 1412}, {"video_id": "@recipesbyanne-7151743128762977541", "question": "The screen shows a small wooden board on a white table, with two round seaweed pieces on it. There is also some rice, a bit of vegetables, and a piece of mackerel. What item is not present in the scene?", "question_wo_referring_query": "Which items are not present in the scene?", "candidates": ["golden spoon", "pumpkin", "rice", "piece of mackerel", "silver spoon"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "@recipesbyanne-7151743128762977541_0", "video_path": "7151743128762977541.mp4", "subtitle_path": "7151743128762977541_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.33, "view_count": 40563}, {"video_id": "@healthfood-6932530520303652102", "question": "A woman is speaking on the screen. Her hair is tied up, she has black eyebrows, and she is sitting on a chair. There is a white wall behind her, and she is wearing a dark green garment. What is the fabric of this garment?", "question_wo_referring_query": "What is the fabric of this garment?", "candidates": ["uniform", "woolen clothes", "T-shirt", "windbreaker", "coat"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "@healthfood-6932530520303652102_0", "video_path": "6932530520303652102.mp4", "subtitle_path": "6932530520303652102_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.07, "view_count": 4900000}, {"video_id": "@recipesbyanne-7205238929522494725", "question": "In the image, there is a white table, with a black mesh frame on top, covered with white paper. There is a rectangular baked yellow food item on the paper, and a knife is cutting this food item. When the subtitle 'When you put your body on my neck' appears, what shape is the food item cut into?", "question_wo_referring_query": "What shape is the food item cut into?", "candidates": ["Circle", "Strip", "Chunk", "Shred", "Triangle"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "@recipesbyanne-7205238929522494725_0", "video_path": "7205238929522494725.mp4", "subtitle_path": "7205238929522494725_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.5, "view_count": 92005}, {"video_id": "@recipesbyanne-7228227020096802074", "question": "A black pot is sitting on a white table in the video. Inside the pot is spacious eggplant pasta, with a green seasoning sprinkled on top. Two hands are holding a utensil above the pot, preparing to pick up the noodles. What is the utensil used to pick up the noodles?", "question_wo_referring_query": "What is the utensil used to pick up the noodles?", "candidates": ["Skimmer", "Pasta server", "Tongs", "Chopsticks", "Fork"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "@recipesbyanne-7228227020096802074_0", "video_path": "7228227020096802074.mp4", "subtitle_path": "7228227020096802074_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.2, "view_count": 492951}, {"video_id": "@healthfood-6958515099313229061", "question": "The screen shows a white cutting board with some strawberries on it, along with a hand and a knife. What happened when the strawberries first appeared?", "question_wo_referring_query": "What happened when the strawberries first appeared?", "candidates": ["A hand is cutting the strawberries", "A hand is using a knife to move the strawberries", "A hand is lifting the strawberries", "A hand is picking up the strawberries", "A hand is putting the strawberries into a bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "@healthfood-6958515099313229061_0", "video_path": "6958515099313229061.mp4", "subtitle_path": "6958515099313229061_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.33, "view_count": 31520}, {"video_id": "@healthfood-7141073153840762155", "question": "On the screen, there is a yellow table with a white cup on it containing yellow and white food. At the top, there's a line in small letters that says 'top with whipped cream and enjoy'. What happens on the screen when Outo Music is mentioned?", "question_wo_referring_query": "On the screen, there is a yellow table with a white cup on it containing yellow and white food. At the top, there's a line in small letters that says 'top with whipped cream and enjoy'. What happens on the screen when Outo Music is mentioned?", "candidates": ["A cat is mixing the food in the cup", "A cat is placed inside this cup", "A cat lifts the whipped cream out of the cup", "A hand lifts the cup up", "A cat lifts the food and whipped cream out of the cup"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "@healthfood-7141073153840762155_0", "video_path": "7141073153840762155.mp4", "subtitle_path": "7141073153840762155_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.43, "view_count": 20547}, {"video_id": "@healthfood-7332625543822691627", "question": "The screen shows a black pot on a stove, with cooked yellow eggs inside. There is also a piece of toast with small tomatoes on top and sliced avocados on the side. Which food item was put into the pot first?", "question_wo_referring_query": "The screen shows a black pot on a stove, with cooked yellow eggs inside. There is also a piece of toast with small tomatoes on top and sliced avocados on the side. Which food item was put into the pot first?", "candidates": ["egg", "avocado", "passion fruit", "salt", "bread slice"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "@healthfood-7332625543822691627_0", "video_path": "7332625543822691627.mp4", "subtitle_path": "7332625543822691627_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.23, "view_count": 16412847}, {"video_id": "@healthfood-7125121950590078254", "question": "The screen shows a woman wearing workout clothes. Her workout clothes are green, and she is holding a phone and taking a picture in front of a mirror. There are white walls and white windows around her. What happened when 'Lazy' and 'pick up' were mentioned?", "question_wo_referring_query": "What happened?", "candidates": ["A smoothie with dragon fruit color and fruits on top appeared.", "A salad with spiral pasta and various vegetables appeared.", "A piece of grilled sausage appeared.", "A pizza with grilled sausage appeared.", "A salad with tofu, lettuce, and cherry tomatoes appeared."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "@healthfood-7125121950590078254_0", "video_path": "7125121950590078254.mp4", "subtitle_path": "7125121950590078254_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.53, "view_count": 20000}, {"video_id": "@recipesbyanne-7156580984606051589", "question": "In the video, there is a white table with a white box containing some partially sliced string beans. When the phrase 'You're my butterfly, baby' is mentioned, what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["A small pan", "A hand wearing a glove", "A bowl of green plants", "Seasoning powder and cooking oil", "A white box"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "@recipesbyanne-7156580984606051589_0", "video_path": "7156580984606051589.mp4", "subtitle_path": "7156580984606051589_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.2, "view_count": 30678}, {"video_id": "@recipesbyanne-7205970972879457542", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, the scene shows a lid being placed on the pot. Next, a black pot filled with yellow rice, with some chicken breast on top and sprinkled with green herbs is shown. Finally, a spatula stirring the rice in the pot is displayed.", "First, a black pot filled with yellow rice, with some chicken breast on top and sprinkled with green herbs is shown. Next, a spatula stirring the rice in the pot is displayed. Finally, the scene shows a lid being placed on the pot.", "First, a spatula stirring the rice in the pot is displayed. Next, a black pot filled with yellow rice, with some chicken breast on top and sprinkled with green herbs is shown. Finally, the scene shows a lid being placed on the pot.", "First, a spatula stirring the rice in the pot is displayed. Next, the scene shows a lid being placed on the pot. Finally, a black pot filled with yellow rice, with some chicken breast on top and sprinkled with green herbs is shown.", "First, a black pot filled with yellow rice, with some chicken breast on top and sprinkled with green herbs is shown. Next, the scene shows a lid being placed on the pot. Finally, a spatula stirring the rice in the pot is displayed."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "@recipesbyanne-7205970972879457542_0", "video_path": "7205970972879457542.mp4", "subtitle_path": "7205970972879457542_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.18, "view_count": 161376}, {"video_id": "@recipesbyanne-7230820180786449691", "question": "In the scene, a hand is holding a white bowl and pouring chicken breast fillets into a black pot. The background is white. At the same time, which subtitles appeared together with the chicken breast fillets?", "question_wo_referring_query": "Which subtitles appeared at the same time as the chicken breast fillets?", "candidates": ["Thak ", "Thank You", "Thank you.", "THANK you", "THANK YOU"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "@recipesbyanne-7230820180786449691_0", "video_path": "7230820180786449691.mp4", "subtitle_path": "7230820180786449691_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.95, "view_count": 5300000}, {"video_id": "@recipesbyanne-7210432182249082118", "question": "On the screen, there is a black pot on top of a white marble table. A bowl is pouring egg liquid into the pot. What changes happen after the egg is heated in the pot?", "question_wo_referring_query": "What changes happen after the egg is heated in the pot?", "candidates": ["It turned into a fried egg", "It turned into steamed eggs", "It turned into an egg flower", "It turned into tender and smooth eggs", "It turned into scrambled eggs"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "@recipesbyanne-7210432182249082118_0", "video_path": "7210432182249082118.mp4", "subtitle_path": "7210432182249082118_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.8, "view_count": 82300}, {"video_id": "@recipesbyanne-7246783133352201499", "question": "In the scene, there is a transparent bowl on a white stone slab, held by a pair of hands with nail polish. Inside the bowl are three pieces of chicken thigh meat. What changes occur to the chicken thigh meat after it is mixed with the seasoning?", "question_wo_referring_query": "What changes occur to the chicken thigh meat?", "candidates": ["The chicken thigh meat was braised to green.", "The chicken thigh meat was braised to white.", "The chicken thigh meat was braised to yellow.", "The chicken thigh meat was braised to red.", "The chicken thigh meat was braised to olive color."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "@recipesbyanne-7246783133352201499_0", "video_path": "7246783133352201499.mp4", "subtitle_path": "7246783133352201499_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.45, "view_count": 1944354}, {"video_id": "@recipesbyanne-7148817637500292357", "question": "In the screen, there's a small wooden board with two round cakes on it. The cakes are topped with sauce, vegetables, and potato strips. In front of the wooden board, there's a round plate. Both the wooden board and the plate are on a white marble table. On the plate, there's also a round white thin cake topped with white sauce, green leaves, and tomato slices. There's also a hand in the screen. What is this hand doing?", "question_wo_referring_query": "What is this hand doing?", "candidates": ["Using a fork to stab a potato strip", "Using a skewer to hold a potato strip", "Using tongs to pick up a potato strip", "Using chopsticks to pick up a potato strip", "Using a knife to stab a potato strip"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "@recipesbyanne-7148817637500292357_0", "video_path": "7148817637500292357.mp4", "subtitle_path": "7148817637500292357_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.3, "view_count": 16300}, {"video_id": "@recipesbyanne-7102764936157236486", "question": "The background in the video is a wooden board with a white bowl on it, filled with strawberries. When 'Jumping on a trampoline' is mentioned, what object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["Frozen strawberries", "A round purple bowl", "A round blue bowl", "A round black bowl", "Frozen cherries"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "@recipesbyanne-7102764936157236486_0", "video_path": "7102764936157236486.mp4", "subtitle_path": "7102764936157236486_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.43, "view_count": 2219}, {"video_id": "@recipesbyanne-7134365831621774597", "question": "In the video, there is a hand holding a jar, and a spoon is stirring the chocolate sauce inside the jar. What material is the jar made of?", "question_wo_referring_query": "What material is the jar holding the chocolate sauce made of?", "candidates": ["Plastic jar", "Glass jar", "Stainless steel jar", "Purple clay jar", "Ceramic jar"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "@recipesbyanne-7134365831621774597_0", "video_path": "7134365831621774597.mp4", "subtitle_path": "7134365831621774597_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.03, "view_count": 32167}, {"video_id": "@recipesbyanne-7094925150604266758", "question": "On the screen, there is a white plate with three pieces of white dessert sprinkled with red decorations and halved blueberries. When the word 'you' appears, what shape is this dessert?", "question_wo_referring_query": "On the screen, there is a white plate with three pieces of white dessert sprinkled with red decorations and halved blueberries. When the word 'you' appears, what shape is this dessert?", "candidates": ["Triangle", "Rectangle", "Square", "Circle", "Heart"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "@recipesbyanne-7094925150604266758_0", "video_path": "7094925150604266758.mp4", "subtitle_path": "7094925150604266758_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.27, "view_count": 1311}, {"video_id": "@recipesbyanne-7136224148522634502", "question": "In the video, there is a white plate with a piece of bread covered in chocolate sauce. What does the hand in the video put on the bread?", "question_wo_referring_query": "What does the hand in the video put on the bread?", "candidates": ["Banana powder", "Orange slices", "Lemon slices", "Banana slices", "Banana sticks"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "@recipesbyanne-7136224148522634502_0", "video_path": "7136224148522634502.mp4", "subtitle_path": "7136224148522634502_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.23, "view_count": 5820}, {"video_id": "@recipesbyanne-7202269291725524230", "question": "There is an orange nine-compartment box on a white marble slab in the video. What happened the first time this box appeared?", "question_wo_referring_query": "What happened the first time this box appeared?", "candidates": ["Pouring strawberry sauce into the box", "Pouring a white liquid into the box", "Pouring a light yellow liquid into the box", "Pouring a coffee-colored liquid into the box", "Pouring an orange liquid into the box"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "@recipesbyanne-7202269291725524230_0", "video_path": "7202269291725524230.mp4", "subtitle_path": "7202269291725524230_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.27, "view_count": 333564}, {"video_id": "@recipesbyanne-7164049069776669958", "question": "In the scene, there is a white bowl placed on a white countertop. One hand is holding a tool with a hole in it, and the other hand is holding a lemon over the bowl. Inside the bowl, there are red and orange foods. When the subtitle says 'Thank you', what is happening?", "question_wo_referring_query": "When the subtitle says 'Thank you', what is happening?", "candidates": ["Cutting the lemon", "Putting the lemon into the bowl", "Peeling the lemon", "Slicing the lemon", "Rubbing lemon zest into the bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "@recipesbyanne-7164049069776669958_0", "video_path": "7164049069776669958.mp4", "subtitle_path": "7164049069776669958_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.23, "view_count": 16274}, {"video_id": "@healthfood-6896519861577567494", "question": "In the video, a woman wearing a black and white knitted sweater with a bun in her hair is holding her face with one hand and pointing with one finger to a white popup box in the upper left corner. Behind her, there is a window with a black frame, and the rest of the background consists of two white walls. What happens after she points to the popup box?", "question_wo_referring_query": "What happens after she points to the popup box?", "candidates": ["She stands up and waves one hand", "She starts working out", "She starts running", "She starts singing", "She starts drawing"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "@healthfood-6896519861577567494_0", "video_path": "6896519861577567494.mp4", "subtitle_path": "6896519861577567494_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.07, "view_count": 9940}, {"video_id": "@recipesbyanne-7167757797180886277", "question": "There is a white plate on a white marble table with black patterns. On the plate, there are two black triangular sandwiches with green sauce on top and two slices of lemon on the side. Which item does the hand in the video pick up first?", "question_wo_referring_query": "Which item does the hand in the video pick up first?", "candidates": ["Avocado", "Green vegetables", "Black sandwich piece", "Green beans", "Lemon slice"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "@recipesbyanne-7167757797180886277_0", "video_path": "7167757797180886277.mp4", "subtitle_path": "7167757797180886277_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.5, "view_count": 10696}, {"video_id": "@recipesbyanne-7163652012352081158", "question": "On screen, there is a white table with a white plate holding a round cake. One hand is holding a white sauce container and the other hand is applying the sauce to the cake with a round brush. After mentioning 'So go and get right with me', what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["water", "edamame", "cream", "carrot", "broccoli"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "@recipesbyanne-7163652012352081158_0", "video_path": "7163652012352081158.mp4", "subtitle_path": "7163652012352081158_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.5, "view_count": 11200}, {"video_id": "@healthfood-7132891597024316715", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, the sauce with banana slices on top is shown, followed by honey drizzled on top, then the finished dessert is shown with a container next to it, and finally a piece of bread on a plate spread with olive-colored sauce.", "First, the sauce with banana slices on top is shown, followed by honey drizzled on top, then a piece of bread on a plate spread with olive-colored sauce, and finally the finished dessert is shown with a container next to it.", "First, the finished dessert is shown with a container next to it, followed by a piece of bread on a plate spread with olive-colored sauce, and finally the sauce topped with banana slices and drizzled with honey.", "First, a piece of bread on a plate is spread with olive-colored sauce, then the sauce is topped with banana slices, followed by honey drizzled on top, and finally the finished dessert is shown, with a container next to it.", "First, a piece of bread on a plate is spread with olive-colored sauce, then the finished dessert is shown with a container next to it, followed by banana slices on the sauce, and finally honey is drizzled on top."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "@healthfood-7132891597024316715_0", "video_path": "7132891597024316715.mp4", "subtitle_path": "7132891597024316715_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.7, "view_count": 44780}, {"video_id": "@healthfood-6911436168093617413", "question": "On a black stove, there is a square small frying pan with some striped patterns on it. There are two pieces of flatbread on the pan. In which other scenes does the flatbread appear?", "question_wo_referring_query": "In which other scenes does the flatbread appear?", "candidates": ["In a wooden background, inside a triangular dish", "In a wooden background, inside a round dish", "In a wooden background, inside a rectangular dish", "In a wooden background, inside an iron dish", "In a wooden background, inside a square dish"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "@healthfood-6911436168093617413_0", "video_path": "6911436168093617413.mp4", "subtitle_path": "6911436168093617413_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.41, "view_count": 10597}, {"video_id": "@recipesbyanne-7176665095831817478", "question": "The screen shows a glass bowl on a white floral-patterned table, containing pale yellow food and some dry powder. There's a whisk inside it. When the food is put into a container and topped with green crackers, what changes occur to the food in the bowl?", "question_wo_referring_query": "What changes occur to the food in the bowl?", "candidates": ["The food turns red", "The food turns yellow", "The food turns orange", "The food turns black", "The food turns coffee-colored"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "@recipesbyanne-7176665095831817478_0", "video_path": "7176665095831817478.mp4", "subtitle_path": "7176665095831817478_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.3, "view_count": 14500}, {"video_id": "@_eat_sleep_travel_repeat-7301815521832471840", "question": "The screen shows a road paved with cobblestones, and not far away, there is a glowing Christmas tree surrounded by white columns. In the distance, there is a building with a blue neon light on one side of its wall. In the lower corner of the wall, there are three people wearing white, black, and red clothes respectively. What are they doing?", "question_wo_referring_query": "What are they doing?", "candidates": ["making a wish", "walking", "setting off fireworks", "shaking hands", "hugging"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "@_eat_sleep_travel_repeat-7301815521832471840_0", "video_path": "7301815521832471840.mp4", "subtitle_path": "7301815521832471840_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.13, "view_count": 2944}, {"video_id": "@placesunleashed-7299542502078434565", "question": "The screen shows a high cliff, surrounded by many long, black rocks. There are also some green plants. In the middle of the cliff, a small waterfall is flowing, and below the waterfall, there is a small lake. A person is standing beside the small lake, surrounded by many rocks, among which some green moss and plants are scattered. What is the object present in the screen?", "question_wo_referring_query": "What is the object present in the screen?", "candidates": ["White clothes", "Blue clothes", "Pink clothes", "Green clothes", "Yellow clothes"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "@placesunleashed-7299542502078434565_0", "video_path": "7299542502078434565.mp4", "subtitle_path": "7299542502078434565_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.47, "view_count": 85294}, {"video_id": "@jetset_anna-6933514222169050373", "question": "The screen shows a jade green sea with many clouds floating in the blue sky. There is a beach beside the jade sea, and several lounge chairs are placed on the beach. When 'Glory to you over royalty' is mentioned, what object is present in the screen?", "question_wo_referring_query": "What object is present in the screen?", "candidates": ["White waves", "Pink chair", "Orange chair", "Blue chair", "White chair"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "@jetset_anna-6933514222169050373_0", "video_path": "6933514222169050373.mp4", "subtitle_path": "6933514222169050373_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.07, "view_count": 12100}, {"video_id": "@placesunleashed-7287593696101534981", "question": "The screen shows a very high mountain, with clouds and mist in the distance. Through the mist, you can see distant slopes and rivers. Nearby, there is a road on the mountain with vehicles traveling on it. The road is surrounded by trees and greenery on the mountain. What type of road is shown in the scene?", "question_wo_referring_query": "What type of road is shown in the scene?", "candidates": ["Overpass bridge", "Straight concrete road", "Winding mountain road", "Dirt road", "Z-shaped road"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "@placesunleashed-7287593696101534981_0", "video_path": "7287593696101534981.mp4", "subtitle_path": "7287593696101534981_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.47, "view_count": 4337}, {"video_id": "@jetset_anna-7049751480714185989", "question": "On a beach bathed in sunset light, the sky has a slight blue tint. On the left side of the screen, the sun's rays are dazzling, and a withered tree stands on the beach. A wind chime is hanging from it with a hemp rope, and there are also three pillows on the wind chime. When 'Thank you' is mentioned, what shape is the wind chime in the picture?", "question_wo_referring_query": "What shape is the wind chime in the picture?", "candidates": ["Triangle", "Ladder", "Circle", "Square", "Rectangle"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "@jetset_anna-7049751480714185989_0", "video_path": "7049751480714185989.mp4", "subtitle_path": "7049751480714185989_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.01, "view_count": 61213}, {"video_id": "@jetset_anna-6941426759938215173", "question": "In the video, there is a sky with shades of grey and blue, and an orange-red sunset lighting up the sky. There are some mountain slopes in the distance, and nearby there are some rock formations. On a platform covered with wooden planks, there are some objects placed. What objects are placed on the platform?", "question_wo_referring_query": "What objects are placed on the platform?", "candidates": ["Blue chair", "Round-shaped table", "Square-shaped table", "Pink bench", "Oval-shaped table"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "@jetset_anna-6941426759938215173_0", "video_path": "6941426759938215173.mp4", "subtitle_path": "6941426759938215173_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.9, "view_count": 18600}, {"video_id": "@placesunleashed-7292457308779859205", "question": "In the scene, the distant sky is colored by the sunset, there are also black mountains, next to the mountains is a lake, and further ahead is a white building. All nearby buildings are lit with orange lights. There is a white bed on the terrace, and in front of the bed, there is an infinity pool. The pool is also surrounded by a ring of lights. What happened when this scene appeared?", "question_wo_referring_query": "What happened when this scene appeared?", "candidates": ["The white door on the terrace was opened.", "The pillow on the bed was taken away.", "The water in the pool is gently flowing.", "Someone in the pool is swimming.", "The cup on the bed was knocked over."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "@placesunleashed-7292457308779859205_0", "video_path": "7292457308779859205.mp4", "subtitle_path": "7292457308779859205_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.45, "view_count": 3101}, {"video_id": "@kelseyinlondon-7341017314445249825", "question": "In the video, which character appears first?", "question_wo_referring_query": "In the video, which character appears first?", "candidates": ["A woman in a white off-shoulder dress", "A woman in a black hooded coat", "A woman in a blue shirt", "A man in a white T-shirt", "A person swimming in turquoise water"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "@kelseyinlondon-7341017314445249825_0", "video_path": "7341017314445249825.mp4", "subtitle_path": "7341017314445249825_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.0, "view_count": 16412}, {"video_id": "@placesunleashed-7283881910172978438", "question": "The screen shows a beautiful castle being built in the mountains. The castle has red walls and pointy black roofs. It is surrounded by green trees, and in the distance, there is a lake. Sunlight shines on the trees. In which other scene does this castle appear?", "question_wo_referring_query": "In which other scene does this castle appear?", "candidates": ["In a snow-covered forest", "In a forest filled with yellow fallen leaves", "On the edge of a cliff", "On a coastline shrouded in the night", "In an amusement park with a merry-go-round"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "@placesunleashed-7283881910172978438_0", "video_path": "7283881910172978438.mp4", "subtitle_path": "7283881910172978438_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.72, "view_count": 976}, {"video_id": "@kelseyinlondon-7320934425288641825", "question": "The woman with black hair in the video is walking towards a carved wall, with tall rocks on either side. What change occurs when she reaches a draped viewing platform?", "question_wo_referring_query": "What change occurs?", "candidates": ["The woman raises her hands", "The woman pins a flower", "The woman points into the distance", "The woman sits down", "The woman wears a hat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "@kelseyinlondon-7320934425288641825_0", "video_path": "7320934425288641825.mp4", "subtitle_path": "7320934425288641825_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.85, "view_count": 3191134}, {"video_id": "@kelseyinlondon-7121752735036280070", "question": "A dining table is placed beside a black railing, with mesh underneath. From outside the railing, you can see the scenery on the opposite side, including rocks, a lake, and greenery. Against the rocks, there are pink and yellow buildings. On the dining table, there is some food: one dish contains a sandwich with cream and tomato, another contains some meat and fries, and there are also two drinks. When the phrase 'That I could never forget the way' is mentioned, what changes occur to the dining table?", "question_wo_referring_query": "What changes occur to the dining table?", "candidates": ["A person begins eating with a fork.", "The dining table is moved out of the frame, leaving only a corner in view.", "Two people start clinking their glasses.", "The drinks on the dining table are taken away.", "A piece of tablecloth is placed on the dining table."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "@kelseyinlondon-7121752735036280070_0", "video_path": "7121752735036280070.mp4", "subtitle_path": "7121752735036280070_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.13, "view_count": 47200}, {"video_id": "@placesunleashed-7290191786834808070", "question": "By the side of the azure lake, there are two people wearing black tops. They are gazing at a person with a red safety harness stirring up a splash of water in the middle of the lake. What is this person with a red safety harness doing?", "question_wo_referring_query": "By the side of the azure lake, there are two people wearing black tops. They are gazing at a person with a red safety harness stirring up a splash of water in the middle of the lake. What is this person with a red safety harness doing?", "candidates": ["He is falling vertically from the air", "He is using a rope to skim the water surface", "He is swimming", "He is taking a bath", "He is performing a diving stunt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "@placesunleashed-7290191786834808070_0", "video_path": "7290191786834808070.mp4", "subtitle_path": "7290191786834808070_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.42, "view_count": 5314}, {"video_id": "@jetset_anna-7027813170349657350", "question": "In a narrow alley paved with stone slabs, several wooden tables are placed. A woman is leaning against the wall, with her head tilted. Her gaze passes through the alley and falls on the blue sea ahead. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["dolphin", "sunglasses", "wooden chair", "sunshade", "car"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "@jetset_anna-7027813170349657350_0", "video_path": "7027813170349657350.mp4", "subtitle_path": "7027813170349657350_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.97, "view_count": 548400}, {"video_id": "@_eat_sleep_travel_repeat-7296112656413412641", "question": "Outdoors, a man with short blonde hair, wearing a black coat, is kneeling on the ground feeding milk to two lambs. When the subtitle says 'Thank you', what item is present in this scene?", "question_wo_referring_query": "Outdoors, a man with short blonde hair, wearing a black coat, is kneeling on the ground feeding milk to two lambs. When the subtitle says 'Thank you', what item is present in this scene?", "candidates": ["Ring", "Milk bottle", "Bowl full of cow's milk", "Sunglasses", "Hair clip"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "@_eat_sleep_travel_repeat-7296112656413412641_0", "video_path": "7296112656413412641.mp4", "subtitle_path": "7296112656413412641_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.32, "view_count": 1630}, {"video_id": "@kelseyinlondon-7241610445990219034", "question": "In a narrow alley paved with stones, flanked by two low white walls, a woman with long hair wearing a pink top is walking. She walks straight toward the end of the alley, where a Greek flag flutters in the breeze. What type of skirt is this woman wearing?", "question_wo_referring_query": "In this scene, what type of skirt is the woman wearing?", "candidates": ["pink long skirt", "white short skirt", "white dress", "white long skirt", "pink short skirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "@kelseyinlondon-7241610445990219034_0", "video_path": "7241610445990219034.mp4", "subtitle_path": "7241610445990219034_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.07, "view_count": 38200}, {"video_id": "@kelseyinlondon-7209247539881610501", "question": "In a restaurant decorated with green plants and a glass roof, many people are dining. A woman wearing light pink clothing is about to take a sip of her drink, and there are two large plates of pizza on the table in front of her. When the subtitle reads 'Three, six, nine girls wanna drink wine,' what is the man with a ponytail behind this woman wearing?", "question_wo_referring_query": "What is the man with a ponytail behind this woman wearing?", "candidates": ["Black short sleeve", "White long sleeve", "Light pink long skirt", "White short sleeve", "Blue long sleeve"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "@kelseyinlondon-7209247539881610501_0", "video_path": "7209247539881610501.mp4", "subtitle_path": "7209247539881610501_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.9, "view_count": 134200}, {"video_id": "@placesunleashed-7285742445655084294", "question": "In the distance, there is a row of trees on the screen, and nearby there is a yellow-green grass field. Two men are on the grass field. Who is the one facing downward on the ground in the middle of the screen?", "question_wo_referring_query": "Who is the one facing downward on the ground in the middle of the screen?", "candidates": ["The man wearing a black shirt and floral pants", "The man who is shirtless and wearing floral shorts", "The man with a patterned back and wearing floral shorts", "The man who is shirtless and wearing floral pants", "The man wearing a black shirt and floral shorts"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "@placesunleashed-7285742445655084294_0", "video_path": "7285742445655084294.mp4", "subtitle_path": "7285742445655084294_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.48, "view_count": 2061}, {"video_id": "@jetset_anna-6995875726817791238", "question": "Azure blue fills the sky, with a few white clouds drifting carelessly, and the jade green sea below remains calm with no waves. On a wooden platform by the sea, there is a pool. What happened when this pool appeared for the first time?", "question_wo_referring_query": "What happened when it appeared for the first time?", "candidates": ["The walls of the pool collapsed.", "A fish jumped out of the pool.", "The water in the pool started bubbling continuously.", "The water in the pool dried up.", "The water in the pool turned black."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "@jetset_anna-6995875726817791238_0", "video_path": "6995875726817791238.mp4", "subtitle_path": "6995875726817791238_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.21, "view_count": 71667}, {"video_id": "@jetset_anna-6933266805553138950", "question": "The river water is flowing between the houses. A person wearing blue clothing is looking at the wide water surface in front of the olive-colored building. On the water surface, there are two boats, one of which is carrying passengers dressed in blue. The boatman picks up a wooden pole. After this action, what did the boatman do next?", "question_wo_referring_query": "The river water is flowing between the houses. A person wearing blue clothing is looking at the wide water surface in front of the olive-colored building. On the water surface, there are two boats, one of which is carrying passengers dressed in blue. The boatman picks up a wooden pole. After this action, what did the boatman do next?", "candidates": ["He jumped into the river.", "He is shaking the boat.", "He sat down on the boat.", "He docked the boat at the shore.", "He jumped onto another boat."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "@jetset_anna-6933266805553138950_0", "video_path": "6933266805553138950.mp4", "subtitle_path": "6933266805553138950_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.7, "view_count": 10800}, {"video_id": "@_eat_sleep_travel_repeat-7258213061637590299", "question": "The sun disperses golden rays of light, adding a layer of color to the clouds on the horizon. The seawater reflects the sunlight, creating shimmering ripples. The rocks on the sea appear indistinct, with only a path of ripples visible. What is the first scene after this frame?", "question_wo_referring_query": "What is the first scene after this frame?", "candidates": ["People riding a boat fishing on the sea", "A group of people swimming in the sea", "Under the blue sky and white clouds, a field of radiant mountain peaks surrounds a blue lake", "A lighthouse on rocks by the seaside", "A flock of sea birds flying over the sea"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "@_eat_sleep_travel_repeat-7258213061637590299_0", "video_path": "7258213061637590299.mp4", "subtitle_path": "7258213061637590299_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.33, "view_count": 1061}, {"video_id": "@luxtravelbe-7350017739282550048", "question": "On the beach, there is a person wearing a white short-sleeved shirt, blue shorts, and a colorful backpack, who is crouching and petting a small dog with their right hand. What object is present in this scene?", "question_wo_referring_query": "What object is present in this scene?", "candidates": ["ring", "earphones", "glasses", "necklace", "bracelet"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "@luxtravelbe-7350017739282550048_0", "video_path": "7350017739282550048.mp4", "subtitle_path": "7350017739282550048_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.03, "view_count": 15651}, {"video_id": "@_eat_sleep_travel_repeat-7302068588939054369", "question": "In the distance, there are continuous mountain peaks, and nearby, under a tree branch, a person stretches out one hand palm up, with a small bird lowering its head to eat food from the person's palm. When the caption reads 'meet new people. I want to see this beautiful earth that we call home. Like that\u2019s all I want.', what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["A watch", "A sun hat", "An eagle", "Sunglasses", "A drone"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "@_eat_sleep_travel_repeat-7302068588939054369_0", "video_path": "7302068588939054369.mp4", "subtitle_path": "7302068588939054369_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.27, "view_count": 4789406}, {"video_id": "@kelseyinlondon-7284940568873012512", "question": "There are green plants, calm lake water, and mountain peaks outside the half-opened window. There is a small basket placed on the windowsill. What are the colors of the items inside the small basket?", "question_wo_referring_query": ", What are the colors of the items inside the small basket?", "candidates": ["Vermilion and green", "Yellow and green", "Pink and green", "White and green", "Green and white"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "@kelseyinlondon-7284940568873012512_0", "video_path": "7284940568873012512.mp4", "subtitle_path": "7284940568873012512_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.57, "view_count": 35116}, {"video_id": "@_eat_sleep_travel_repeat-7249001347603860762", "question": "On the tan-colored wall, there is a window that opens inward. On the windowsill, there are two potted plants. Outside the window, there are tree branches full of green leaves. In this scene, what is swaying in the wind?", "question_wo_referring_query": "In this scene, what is swaying in the wind?", "candidates": ["window frame", "tree leaves", "wall", "window", "flower pots"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "@_eat_sleep_travel_repeat-7249001347603860762_0", "video_path": "7249001347603860762.mp4", "subtitle_path": "7249001347603860762_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.77, "view_count": 2812}, {"video_id": "@kelseyinlondon-7105029975635512581", "question": "On a clear day with white clouds floating in the sky, there is a tall palm tree under the blue sky. Next to the palm tree is a road, and a group of girls are walking along the road. When the girl in black clothing first appears, what is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["She is running", "She is driving a car", "She is climbing the palm tree", "She is riding a bicycle", "She is taking pictures"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "@kelseyinlondon-7105029975635512581_0", "video_path": "7105029975635512581.mp4", "subtitle_path": "7105029975635512581_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.87, "view_count": 13671}, {"video_id": "@placesunleashed-7300244575589862661", "question": "Two people are standing on a sandy beach with white lounge chairs, with the ocean behind them rolling waves towards the shore. What happened before this scene?", "question_wo_referring_query": "What happened before this scene?", "candidates": ["The sea waters surged under the bridge, overturning.", "The sea waters submerged the bridge.", "People on the bridge were swept away by the water.", "A school of fish jumped out of the sea.", "The sea waters submerged the gas station."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "@placesunleashed-7300244575589862661_0", "video_path": "7300244575589862661.mp4", "subtitle_path": "7300244575589862661_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.47, "view_count": 7829}, {"video_id": "@kelseyinlondon-7324379997790047521", "question": "Under the green archway, there is a round platform. Below the round platform, a small channel connects to the front ditch. The sides of the ditch are made of stone bricks. Additionally, there are neatly arranged green plants beside the stone bricks. What is the first scene that appears after this frame?", "question_wo_referring_query": "What is the first scene that appears after this frame?", "candidates": ["A solemn and deserted corridor with not a single person in sight.", "On the brown wall, there is a trail of purple flowers climbing up. A fine water jet sprays in the ditch.", "Bright red flowers blooming along the wall.", "A group of people walking beside a green pond.", "A pavilion carved with various floral patterns."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "@kelseyinlondon-7324379997790047521_0", "video_path": "7324379997790047521.mp4", "subtitle_path": "7324379997790047521_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.23, "view_count": 1165650}, {"video_id": "@daiki.shino-7265386463586503968", "question": "On a breezy and beautiful day, there are four people on a flat grassland. A child in red clothing is kicking a soccer ball on the grass. After the subtitle mentions 'you', what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A motorbike drives onto the grassland.", "The boy in red falls to the ground.", "A dog runs to the middle of the road.", "A child in blue climbs up a streetlight.", "Three people walk past a store with the sign 'WINGS'."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "@daiki.shino-7265386463586503968_0", "video_path": "7265386463586503968.mp4", "subtitle_path": "7265386463586503968_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.92, "view_count": 8972}, {"video_id": "@_eat_sleep_travel_repeat-7218515650828913925", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, the sun is about to set behind the mountain peak, followed by a green tent on a grassy slope, and finally, a lake surrounded by mountains", "First, there is a green tent on a grassy slope, followed by a lake surrounded by mountains, and finally, the sun about to set behind the mountain peak", "First, there is a green tent on a grassy slope, followed by the sun about to set behind the mountain peak, and finally, a lake surrounded by mountains", "First, there is a lake surrounded by mountains, followed by the sun about to set behind the mountain peak, and finally, a green tent on a grassy slope", "First, there is a lake surrounded by mountains, followed by a green tent on a grassy slope, and finally, the sun about to set behind the mountain peak"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "@_eat_sleep_travel_repeat-7218515650828913925_0", "video_path": "7218515650828913925.mp4", "subtitle_path": "7218515650828913925_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.43, "view_count": 9096}, {"video_id": "@daiki.shino-7301767178024537376", "question": "There is a man in the room wearing a white long-sleeved shirt with his eyes closed. In front of him is a photo of another man who is putting his finger in his mouth. In what scenarios does the man in the photo appear?", "question_wo_referring_query": "In what scenarios does the man in the photo appear?", "candidates": ["On a motorcycle", "In a zoo", "On a bus", "In an indoor swimming pool", "In front of a green backdrop"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "@daiki.shino-7301767178024537376_0", "video_path": "7301767178024537376.mp4", "subtitle_path": "7301767178024537376_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.18, "view_count": 24706}, {"video_id": "@kelseyinlondon-7243354190360169754", "question": "There is a group of buildings with multicolored exteriors built along the hillside, following the natural contours of the terrain. A road extends from outside into the buildings, and some pedestrians are walking along it. This building complex and which subtitles have appeared together?", "question_wo_referring_query": ", this building complex and which subtitles have appeared together?", "candidates": ["Beach day at Monterosso", "Tik Tok", "Silencio Bruno", "Dinner at La Regina", "Relax at Vernazza"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "@kelseyinlondon-7243354190360169754_0", "video_path": "7243354190360169754.mp4", "subtitle_path": "7243354190360169754_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.77, "view_count": 281300}, {"video_id": "@jetset_anna-6947264120605412614", "question": "A road stretches out between two ponds, with a tall and grand building adjacent to both sides of the ponds, towering into the sky. The screen displays the text 'A rare Instagrammable spot in Dubai that is CROWDLESS'. When this text appears on the screen at the trunk of a palm tree, what change occurs?", "question_wo_referring_query": ", what change occurs?", "candidates": ["The text changes to GOOD PALACE", "The text changes to I LOVE DUBAI", "The text changes to THIS IS DUBAI", "The text changes to THANK YOU", "The text changes to PALACE, DOWNTOWN DUBAI"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "@jetset_anna-6947264120605412614_0", "video_path": "6947264120605412614.mp4", "subtitle_path": "6947264120605412614_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.04, "view_count": 67973}, {"video_id": "@kelseyinlondon-7275008335106100513", "question": "Under the blue sky, there is a cluster of buildings extending in the distance. In front of these buildings, there is a flowing river, and a small boat is docked on it. A woman, wearing a polka dot long dress and carrying a bag in her hand, is standing on the riverbank opposite. When this woman appears together with the subtitle 'Sell that soul to be popular,' what change happens to her skirt?", "question_wo_referring_query": "What change happens to her skirt?", "candidates": ["She changes into a white floral dress.", "She changes into a pure white long dress.", "She changes into an off-white long dress.", "She changes into a pure white short skirt.", "She changes into a black floral dress."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "@kelseyinlondon-7275008335106100513_0", "video_path": "7275008335106100513.mp4", "subtitle_path": "7275008335106100513_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.1, "view_count": 105800}, {"video_id": "@jetset_anna-6937019927996026117", "question": "Under the sunlight, there is a white building. In front of the white building, there is a wall, with a portion of the wall in the shadow. A little girl in a pink dress is standing on the part of the wall that is in the shadow. What is this little girl doing?", "question_wo_referring_query": "What is this little girl doing?", "candidates": ["The little girl is sitting on the wall", "The little girl is crying on the wall", "The little girl jumped down from the wall", "The little girl is taking off her shoes", "The little girl is rolling on the wall"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "@jetset_anna-6937019927996026117_0", "video_path": "6937019927996026117.mp4", "subtitle_path": "6937019927996026117_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.5, "view_count": 21100}, {"video_id": "@placesunleashed-7299854063850573061", "question": "Sunlight spills on the mountain wall, a building nestles into the mountainside, and below the staircase, people dressed in white and green clothes are looking at the building embedded in the mountain wall from a distance. What is present in this scene?", "question_wo_referring_query": "What is present in this scene?", "candidates": ["Sunshade", "Mobile Phone", "Bicycle", "Car", "Camera"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "@placesunleashed-7299854063850573061_0", "video_path": "7299854063850573061.mp4", "subtitle_path": "7299854063850573061_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.47, "view_count": 12421}, {"video_id": "@_eat_sleep_travel_repeat-7254160563050253594", "question": "On a grassy hillside, a short-haired person wearing red clothing and carrying a backpack stands, looking out at the distant valley and hills. What kind of pants is this man wearing?", "question_wo_referring_query": "What kind of pants is this man wearing?", "candidates": ["Blue jeans", "Black shorts", "Black wool pants", "Black long pants", "Green shorts"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "@_eat_sleep_travel_repeat-7254160563050253594_0", "video_path": "7254160563050253594.mp4", "subtitle_path": "7254160563050253594_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.07, "view_count": 2201}, {"video_id": "@luxtravelbe-7240140244354370843", "question": "The rushing stream crashes onto the rocks, creating white splashes, and then flows down into the jade green deep pool beneath the cliff bordered by lush green vegetation. Who is squatting on the ground, watching the white splashes?", "question_wo_referring_query": ", who is squatting on the ground, watching the white splashes?", "candidates": ["The person wearing green clothes", "The person wearing white clothes", "The person wearing olive clothes", "The person wearing blue clothes", "The person wearing black clothes"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "@luxtravelbe-7240140244354370843_0", "video_path": "7240140244354370843.mp4", "subtitle_path": "7240140244354370843_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.42, "view_count": 14194}, {"video_id": "@daiki.shino-7312964087342812448", "question": "Under the overcast sky, there is a Ferris wheel that has stopped moving. A bird swiftly flies past the Ferris wheel. What happened after this?", "question_wo_referring_query": "What happened after this?", "candidates": ["A person wearing a black coat walked from the right side to the left side of the bridge", "A flock of birds flew over the lake", "A fish jumped out of the lake", "A bird crashed into the Ferris wheel", "The Ferris wheel started moving"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "@daiki.shino-7312964087342812448_0", "video_path": "7312964087342812448.mp4", "subtitle_path": "7312964087342812448_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.42, "view_count": 18726}, {"video_id": "@_eat_sleep_travel_repeat-7286774330107759905", "question": "A small road leads to a house with a green roof in the distance. Along the road, there are red leaves, and on both sides of the road, there are some low green plants. What animal appeared in the scene right before this?", "question_wo_referring_query": "What animal appeared in the scene right before this?", "candidates": ["deer", "chicken", "dog", "cow", "cat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "@_eat_sleep_travel_repeat-7286774330107759905_0", "video_path": "7286774330107759905.mp4", "subtitle_path": "7286774330107759905_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.97, "view_count": 717}, {"video_id": "@_eat_sleep_travel_repeat-7280481273788009760", "question": "There is a green character statue on a white pillar, and in the open space in front of the statue, a group of people are dancing and singing in a circle. What happens after the subtitle 'Your turn' appears?", "question_wo_referring_query": "What happens?", "candidates": ["A person climbs onto the statue", "Water gushes out from a pipe", "A person is drinking beer", "A kite is flying towards a hill", "A dog is scratching its belly"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "@_eat_sleep_travel_repeat-7280481273788009760_0", "video_path": "7280481273788009760.mp4", "subtitle_path": "7280481273788009760_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.63, "view_count": 9944}, {"video_id": "@kelseyinlondon-7155520659207048453", "question": "On a white table, there are two cups of white drinks, and on a column next to the table, there is a half-body portrait of a character. Below the column is an endless view of the sea, with a few yachts anchored on the sea. After the subtitle 'Thank you' appears, what item appears in the video?", "question_wo_referring_query": "What item appears in the video after the subtitle 'Thank you'?", "candidates": ["White Chair", "Sculpture", "Column", "Table", "Glass Cup"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "@kelseyinlondon-7155520659207048453_0", "video_path": "7155520659207048453.mp4", "subtitle_path": "7155520659207048453_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.7, "view_count": 2100000}, {"video_id": "@luxtravelbe-7225305829442931994", "question": "A person wearing white shorts is standing on a rock by the sea, with only one foot visible. In front of him is the sea. What is this person doing?", "question_wo_referring_query": "What is this person doing?", "candidates": ["Playing the guitar", "Walking backward", "Walking forward", "Peeling an apple", "Clapping"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "@luxtravelbe-7225305829442931994_0", "video_path": "7225305829442931994.mp4", "subtitle_path": "7225305829442931994_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.78, "view_count": 5703}, {"video_id": "@placesunleashed-7282425964263410949", "question": "On the side of the straight road, there are many houses and streetlights. On the road sign on the left, there is a speed limit sign of 40. The plaque on the streetlights reads 'Bon Machi Name Store Street'. In the distance on the road, there is a snowy mountain. Which type of vehicle has appeared here before?", "question_wo_referring_query": "Which type of vehicle has appeared here before?", "candidates": ["Motorcycle", "Car", "Ship", "Bicycle", "Horse-drawn carriage"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "@placesunleashed-7282425964263410949_0", "video_path": "7282425964263410949.mp4", "subtitle_path": "7282425964263410949_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.42, "view_count": 753}, {"video_id": "@luxtravelbe-7295062360824302881", "question": "In the swimming pool, a woman is holding a man from behind. There are many dishes of food by the poolside. The back of the swimming pool faces the sea. When the subtitle 'This year, blessings, money, testimony.' appears, which of the following items is present?", "question_wo_referring_query": "Which of the following items is present?", "candidates": ["Watch", "Sunglasses", "Computer", "Toothbrush", "Mobile phone"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "@luxtravelbe-7295062360824302881_0", "video_path": "7295062360824302881.mp4", "subtitle_path": "7295062360824302881_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.2, "view_count": 48289}, {"video_id": "@_eat_sleep_travel_repeat-7280853197122440481", "question": "A man is walking on a stone-paved road, wearing a hat and carrying a backpack. There is a railing on the roadside, and on the left side is a small creek with many rocks, and in the distance, there is a small waterfall. When the subtitle \"And lie on the backseat [And lie on the backseat] in Chinese Simplified can be translated as\" appears, what color is the man's hat?", "question_wo_referring_query": "What color is the man's hat?", "candidates": ["red", "black", "pink", "blue", "yellow"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "@_eat_sleep_travel_repeat-7280853197122440481_0", "video_path": "7280853197122440481.mp4", "subtitle_path": "7280853197122440481_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.53, "view_count": 65406}, {"video_id": "@_eat_sleep_travel_repeat-7199588809707736326", "question": "A bicycle is riding on the road, with grassland to the left and many trees planted on the right side of the road. In the distance, sunlight shines on the mountain. Who is riding the bicycle?", "question_wo_referring_query": "Who is riding the bicycle?", "candidates": ["Child", "Person not wearing a hat", "Person wearing a hat", "Person with long hair", "Bald person"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "@_eat_sleep_travel_repeat-7199588809707736326_0", "video_path": "7199588809707736326.mp4", "subtitle_path": "7199588809707736326_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.83, "view_count": 532}, {"video_id": "@jetset_anna-6930744648658881798", "question": "Sunlight shines on the surface of the lake, where a few small boats are docked. Nearby is a small stream flowing from the lake, with green weeds on both sides. In the distance, there are mountains. When the camera moves to the left, what happens?", "question_wo_referring_query": ", what happens?", "candidates": ["A few people are running", "A few people are playing instruments", "A few people are dancing", "A few people are walking", "A few people are fighting"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "@jetset_anna-6930744648658881798_0", "video_path": "6930744648658881798.mp4", "subtitle_path": "6930744648658881798_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.5, "view_count": 7320}, {"video_id": "@placesunleashed-7304704545899154694", "question": "What is the color of the first emoji that appears in the video?", "question_wo_referring_query": "What is the color of the first emoji that appears in the video?", "candidates": ["Purple", "Yellow", "Green", "White", "Blue"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "@placesunleashed-7304704545899154694_0", "video_path": "7304704545899154694.mp4", "subtitle_path": "7304704545899154694_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.51, "view_count": 6126}, {"video_id": "@kelseyinlondon-7222212014532201733", "question": "Sunlight shines through tree leaves onto the ground. A long-haired woman in a coat, back facing the camera, looks towards a distant building. The text 'Cotswolds' appears above her head. With which subtitles has this woman appeared together before?", "question_wo_referring_query": "With which subtitles has this woman appeared together before?", "candidates": ["cambridge", "1 hour 14 minute train", "1.5 hour train", "Somewhere in the crowd there"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "@kelseyinlondon-7222212014532201733_0", "video_path": "7222212014532201733.mp4", "subtitle_path": "7222212014532201733_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.43, "view_count": 1200000}, {"video_id": "@kelseyinlondon-7235226440428522779", "question": "A woman is sitting at the dining table holding a teacup. The table is filled with lots of food and a bouquet of flowers. In the middle of the scene, there's a lake, and in the distance, there is a green hillside. What is the woman doing?", "question_wo_referring_query": "What is the woman doing?", "candidates": ["drinking something", "peeling an apple", "playing guitar", "clapping", "eating an apple"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "@kelseyinlondon-7235226440428522779_0", "video_path": "7235226440428522779.mp4", "subtitle_path": "7235226440428522779_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.47, "view_count": 513227}, {"video_id": "@placesunleashed-7299104442001919238", "question": "A person dressed in red clothes and wearing a hat stands on a grassy field, looking at distant mountains with white snow on the peaks. Which of the following items has appeared?", "question_wo_referring_query": "Which of the following items has appeared?", "candidates": ["Mobile phone", "Airplane", "Backpack", "Car", "Computer"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "@placesunleashed-7299104442001919238_0", "video_path": "7299104442001919238.mp4", "subtitle_path": "7299104442001919238_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.41, "view_count": 54251}, {"video_id": "@kelseyinlondon-7063550915600485638", "question": "A pair of hands is holding some food, and in front of the hands there is a cylindrical object with a black bottom and a red top. In the distance, there is a building. When the subtitle 'Girl, you know you're lost, lost in the thrill of it all' appears, which of the following foods has appeared?", "question_wo_referring_query": "Which of the following foods has appeared?", "candidates": ["Dried cake", "Banana", "Apple", "Ice cream", "Instant noodles"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "@kelseyinlondon-7063550915600485638_0", "video_path": "7063550915600485638.mp4", "subtitle_path": "7063550915600485638_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.77, "view_count": 65500}, {"video_id": "@jetset_anna-6954028871008259333", "question": "In the sunset, there are many blooming flowers. Behind the flowers, there is a house. The roof of the house is blue, and at the very top of the house, there is a cross. When the caption 'You dies, don't you want somebody to love?' appears, what color are the flowers?", "question_wo_referring_query": "What color are the flowers?", "candidates": ["black", "blue", "purple", "white", "yellow"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "@jetset_anna-6954028871008259333_0", "video_path": "6954028871008259333.mp4", "subtitle_path": "6954028871008259333_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.17, "view_count": 51400}, {"video_id": "@kelseyinlondon-7342869275679477025", "question": "There is a person jumping into a swimming pool. Along the edge of the pool, there are many green plants and two plants with pink flowers. Next to the plants with pink flowers, there are a few people lying on lounge chairs. Who is the person jumping into the pool?", "question_wo_referring_query": "Who is the person jumping into the pool?", "candidates": ["baby", "woman in black swimsuit", "person wearing red clothes", "one gentleman", "little boy"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "@kelseyinlondon-7342869275679477025_0", "video_path": "7342869275679477025.mp4", "subtitle_path": "7342869275679477025_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.57, "view_count": 693245}, {"video_id": "@kelseyinlondon-7160314365235825925", "question": "A woman is standing in front of a window. She is wearing a hat, a trench coat, and holding a cup of milk tea. Behind her, there are many books in the window. When the caption \"and I can't no way\" appears, what is this woman doing?", "question_wo_referring_query": "What is this woman doing?", "candidates": ["Opening the handbag", "Lifting the cup of milk tea", "Taking off her coat", "Taking off her hat", "Brushing her hair"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "@kelseyinlondon-7160314365235825925_0", "video_path": "7160314365235825925.mp4", "subtitle_path": "7160314365235825925_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.77, "view_count": 36725}, {"video_id": "@kelseyinlondon-7281317854153149728", "question": "Gray clouds are floating over the distant mountains, and nearby there are green grass and plants. Two rows of long chairs are vaguely visible. What happens in the scene after a hand lightly pushes open the window handle on the left side of the window?", "question_wo_referring_query": "What happens in the scene after the hand lightly pushes open the window handle on the left side?", "candidates": ["Watering the lawn downstairs.", "The scene switches to a wall with vines on the left side of the screen, and under the vines, there is a white window frame.", "Move the flower from the window sill to the outside.", "The scene switches to a grassy field where a corner of a brick building can be seen, and the right side of the scene has a gray wooden table."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "@kelseyinlondon-7281317854153149728_0", "video_path": "7281317854153149728.mp4", "subtitle_path": "7281317854153149728_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.9, "view_count": 278784}, {"video_id": "@jetset_anna-7008578911613357317", "question": "What color is the handrail that appears first in the video?", "question_wo_referring_query": "What color is the handrail that appears first in the video?", "candidates": ["Yellow", "Green", "White", "Black", "Red"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "@jetset_anna-7008578911613357317_0", "video_path": "7008578911613357317.mp4", "subtitle_path": "7008578911613357317_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.05, "view_count": 83387}, {"video_id": "@kelseyinlondon-7207858544941780230", "question": "A woman is sitting by the lake, with a white pillow and quilt beside her. She is holding a wine glass, and there is a bottle of wine and a cup behind her. After the subtitle 'So I' appears, what happens?", "question_wo_referring_query": "What happens?", "candidates": ["A person is driving a car", "A person is steering a boat", "A person is riding a motorcycle", "A person is riding a bicycle", "Three people are riding bicycles"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "@kelseyinlondon-7207858544941780230_0", "video_path": "7207858544941780230.mp4", "subtitle_path": "7207858544941780230_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.27, "view_count": 38100}, {"video_id": "@placesunleashed-7309193488267037958", "question": "On the right side of the mountain wall, there are many green weeds, and the waterfall flows into the blue water. In the distance, there are cliffs with many trees below them. After the subtitle 'Its turquoise color and amazing red contrasted diffs are the perfect scenery' appears, what is the first object that appears?", "question_wo_referring_query": "What is the first object to appear?", "candidates": ["electromagnet", "chain", "iron pot", "computer", "mobile phone"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "@placesunleashed-7309193488267037958_0", "video_path": "7309193488267037958.mp4", "subtitle_path": "7309193488267037958_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.28, "view_count": 7500}, {"video_id": "@jetset_anna-6998579159387639046", "question": "On the sofa, there is a cerulean pillow. Behind the sofa are 2 round tables and 1 square table. In the back, there's a cerulean sea. On the right side of the screen, there is a house with a thatched roof. In which scenes does the cerulean pillow appear below?", "question_wo_referring_query": "In which scenes below does the cerulean pillow appear?", "candidates": ["There are white clouds with mist in the sky. On the cerulean sea surface, there is a grey wooden board with neatly placed dining tables and sofas.", "A group of people are surfing under the azure sky.", "A boat passes under the azure sky.", "Under the azure sky is an endless sea."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "@jetset_anna-6998579159387639046_0", "video_path": "6998579159387639046.mp4", "subtitle_path": "6998579159387639046_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.59, "view_count": 82308}, {"video_id": "@movie.explained6-7252659832933649666", "question": "A man in a short-sleeved shirt is sitting in a bar. He is holding a bottle in one hand and a glass in the other, while several people are standing behind him. What is this man in the short-sleeved shirt doing?", "question_wo_referring_query": "What is this man in the short-sleeved shirt doing?", "candidates": ["Peeling an apple", "Drinking alcohol", "Playing basketball", "Singing", "Dancing"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "@movie.explained6-7252659832933649666_0", "video_path": "7252659832933649666.mp4", "subtitle_path": "7252659832933649666_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 34, "duration": 11.0, "view_count": 334}, {"video_id": "@movie.explained6-7277890595899706626", "question": "A person is spitting saliva into a container on a table. This person is wearing black clothes. On a white sofa behind him, there is another person lying down. Which of the following items have appeared?", "question_wo_referring_query": "Which of the following items have appeared?", "candidates": ["Iron cup", "Glass basin", "High-heeled cup", "Plate", "Iron basin"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "@movie.explained6-7277890595899706626_0", "video_path": "7277890595899706626.mp4", "subtitle_path": "7277890595899706626_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 11, "duration": 12.0, "view_count": 5667}, {"video_id": "@movie.explained6-7252661962218261761", "question": "A person wearing short sleeves is sitting on the edge of a bed, with his hands supporting the bed. There are many papers stuck on the wall behind the bed. The bed is very messy. When the subtitle 'One day, as Dae-ho is still trying to find a way to locate his son, he comes across some information on lucid dreaming' appears, which of the following objects is present?", "question_wo_referring_query": "Which of the following objects is present?", "candidates": ["Blue pacifier", "Red pacifier", "Beer bottle", "Television", "White pacifier"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "@movie.explained6-7252661962218261761_0", "video_path": "7252661962218261761.mp4", "subtitle_path": "7252661962218261761_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 12, "duration": 12.0, "view_count": 485}, {"video_id": "@movie.explained6-7270311507689245954", "question": "A person chained by an iron chain is lifting an object and smashing it towards a person wearing safety goggles and a mask. The chained person is wearing yellow clothes and kneeling on the ground. What color is the clothing of the person wearing safety goggles and a mask?", "question_wo_referring_query": "What color is the clothing of the person wearing safety goggles and a mask?", "candidates": ["green", "pink", "blue", "red", "white"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "@movie.explained6-7270311507689245954_0", "video_path": "7270311507689245954.mp4", "subtitle_path": "7270311507689245954_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 37, "duration": 11.0, "view_count": 12200}, {"video_id": "@movie.explained6-7252659370511617281", "question": "In the upper right corner of the screen, there is an English word group printed with 'OMELETTE BAR'. In the dimly lit space, some dotted lights are shining. Who is the person holding a wine bottle and drinking in the screen?", "question_wo_referring_query": "Who is the person holding a wine bottle and drinking in the screen?", "candidates": ["The man standing and wearing a green short sleeve", "The man sitting and wearing a green short sleeve", "The woman standing and wearing a pink off-shoulder dress", "The woman standing and wearing a green short sleeve"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "@movie.explained6-7252659370511617281_0", "video_path": "7252659370511617281.mp4", "subtitle_path": "7252659370511617281_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 25, "duration": 10.0, "view_count": 400}, {"video_id": "@daily.stories20-7254684960597019947", "question": "A man in a blue shirt appears on the screen, with the words 'in case you can sell a bag of cocaine one time at a child' visible. When the subtitles show 'Because you can sell a bag of cocaine one time,' what is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Dancing", "Playing guitar", "Clapping", "Singing", "Talking"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "@daily.stories20-7254684960597019947_0", "video_path": "7254684960597019947.mp4", "subtitle_path": "7254684960597019947_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 30, "duration": 12.0, "view_count": 765}, {"video_id": "@movie.explained6-7272321274221776129", "question": "In the dead of night, a man stands at the highest point of a transmission tower, preparing to jump down. After the man jumps down, what does he do?", "question_wo_referring_query": ", what does the man do?", "candidates": ["peel an apple", "climb back up", "roll on the ground", "play the guitar", "dance"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "@movie.explained6-7272321274221776129_0", "video_path": "7272321274221776129.mp4", "subtitle_path": "7272321274221776129_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 2, "duration": 8.0, "view_count": 4012}, {"video_id": "@movie.explained6-7252662893307579650", "question": "Who is the first person to be pinned down in the video?", "question_wo_referring_query": "Who is the first person to be pinned down in the video?", "candidates": ["The little girl", "The person wearing sunglasses", "The woman", "The little boy", "The man in blue clothing"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "@movie.explained6-7252662893307579650_0", "video_path": "7252662893307579650.mp4", "subtitle_path": "7252662893307579650_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 23, "duration": 11.0, "view_count": 540}, {"video_id": "@movie.explained6-7268843891120491778", "question": "In the scene, there is only one woman wearing glasses with short hair. What did the man do in the video before the subtitles mentioned, 'I want you to feed her first when Hobie\u2019s Heroes comes home.'?", "question_wo_referring_query": "What did the man do in the video?", "candidates": ["The man lay on the ground", "The man raised his hands", "The man held his chin with his hand", "The man stood up", "The man looked up"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "@movie.explained6-7268843891120491778_0", "video_path": "7268843891120491778.mp4", "subtitle_path": "7268843891120491778_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 25, "duration": 14.0, "view_count": 2378}, {"video_id": "@movie.explained6-7254803083685793026", "question": "In a room with a half-open door, there are three people: a woman with a red headscarf and golden hair stands between two men with shaved heads. After the subtitles mention \u201cThe dealers of the chocolate nest were shocked and immediately gave her an increased supply,\u201d what item appears in the video?", "question_wo_referring_query": "What item appears in the video?", "candidates": ["earrings", "a headscarf", "a ring", "a watch", "a scarf"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "@movie.explained6-7254803083685793026_0", "video_path": "7254803083685793026.mp4", "subtitle_path": "7254803083685793026_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 11, "duration": 9.0, "view_count": 5015}, {"video_id": "@movie.explained6-7253163385716509953", "question": "There are two women in a pond. The one on the left side of the screen has long auburn hair and is wearing a blue bikini, while the one on the right side of the screen has black hair and is touching her ear. In which scenes does the black-haired woman appear?", "question_wo_referring_query": "In which scenes does the black-haired woman appear?", "candidates": ["On a motorcycle", "On a bus", "Next to a green statue", "In a white bathtub", "In a church"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "@movie.explained6-7253163385716509953_0", "video_path": "7253163385716509953.mp4", "subtitle_path": "7253163385716509953_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 13, "duration": 14.0, "view_count": 1091}, {"video_id": "@movie.explained6-7269176266773777666", "question": "Below the white cabinet where items are placed, a man with short golden hair and wearing glasses looks at the palm of his raised hand. When this man appears next to the rolled-up blind, what change happens to him?", "question_wo_referring_query": ", what change happens to him?", "candidates": ["His hand sparkles", "He holds a cup in his right hand", "He removes his glasses", "He changes into an entirely gray outfit", "He changes into an entirely white outfit"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "@movie.explained6-7269176266773777666_0", "video_path": "7269176266773777666.mp4", "subtitle_path": "7269176266773777666_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 20, "duration": 9.0, "view_count": 2118}, {"video_id": "@movie.explained6-7270058251767631105", "question": "In a space with a black car parked, there is a man wearing a blue shirt and a woman wearing a yellow shirt. What is the man in the blue shirt doing?", "question_wo_referring_query": "What is the man wearing a blue shirt doing?", "candidates": ["He is lifting the woman wearing a yellow shirt horizontally", "He is undressing in front of the black car", "He is embracing the woman wearing a yellow shirt", "He is hugging the woman wearing a yellow shirt from behind", "He is squatting in front of the black car"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "@movie.explained6-7270058251767631105_0", "video_path": "7270058251767631105.mp4", "subtitle_path": "7270058251767631105_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 6, "duration": 9.0, "view_count": 4780}, {"video_id": "@movie.explained6-7279387480128867586", "question": "By a gray wall, there is a woman wearing a black top with her hair tied up. At the bottom of the screen, there are white letters indicating that this is not a robbery and murder. What objects can be seen in the scene?", "question_wo_referring_query": "What objects can be seen in the scene?", "candidates": ["Black bracelet", "Golden earrings", "Black flashlight", "Red bracelet", "Silver necklace"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "@movie.explained6-7279387480128867586_0", "video_path": "7279387480128867586.mp4", "subtitle_path": "7279387480128867586_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 24, "duration": 11.0, "view_count": 7236}, {"video_id": "@movie.explained6-7258691109684071682", "question": "In a room with warm yellow lighting, a little girl wearing a dark blue short-sleeved shirt and braids is running forward. In front of her is a white door. When the subtitle 'At night, however, Ellie was hiding in the bathroom throwing up.' appears, what object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["Yellow plush toy", "Golden bracelet", "White chrysanthemum", "Yellow rug", "Green plants"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "@movie.explained6-7258691109684071682_0", "video_path": "7258691109684071682.mp4", "subtitle_path": "7258691109684071682_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 10, "duration": 9.99, "view_count": 2834}, {"video_id": "@movie.explained6-7275992010580856065", "question": "In a room with green walls, there is a man with a white beard, and at the bottom of the screen, there are white letters that read 'It seemed her glasses were blocking the transfer'. When the subtitle 'He decided to go with her son.' appears, what color is the hair of the man with the white beard?", "question_wo_referring_query": "What color is the hair of the man with the white beard?", "candidates": ["Golden long hair", "Silver hair", "White shoulder-length curly hair", "Golden shoulder-length curly hair", "White long hair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "@movie.explained6-7275992010580856065_0", "video_path": "7275992010580856065.mp4", "subtitle_path": "7275992010580856065_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 14, "duration": 14.02, "view_count": 2949}, {"video_id": "@movie.explained6-7278937063854968065", "question": "In a scene with white text saying 'pop out and fly into the kids' toys', a little boy with curly hair, dressed in a bean green shirt, is shown with his mouth wide open and a terrified expression, looking ahead. What caused the boy in the bean green shirt to look ahead with such a terrified expression?", "question_wo_referring_query": "What caused the boy in the bean green shirt to look ahead with such a terrified expression?", "candidates": ["a human corpse", "a gigantic snake", "a dead bird", "a crawling centipede", "a decomposing cat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "@movie.explained6-7278937063854968065_0", "video_path": "7278937063854968065.mp4", "subtitle_path": "7278937063854968065_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 5, "duration": 10.0, "view_count": 9341}, {"video_id": "@movie.explained6-7268252760057957633", "question": "In front of a blurry background, there is a man wearing a dark blue suit, black-framed glasses, and with short hair. Below the man, there are white words 'With only 7 minutes left'. What is the man in the dark blue suit doing the first time he appears?", "question_wo_referring_query": "What is the man in the dark blue suit doing the first time he appears?", "candidates": ["Talking while turning his head to the left", "Climbing a cliff", "Repairing a phone", "Running coolly on a bridge", "Opening a car door"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "@movie.explained6-7268252760057957633_0", "video_path": "7268252760057957633.mp4", "subtitle_path": "7268252760057957633_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 37, "duration": 11.0, "view_count": 2911}, {"video_id": "@movie.explained6-7269280939656760578", "question": "In a blurry screen, there is a passage with many people standing on it. One of them is a man in a black suit. When the subtitle \"Neil, it's so good to see you. I love what you're wearing, It really does. It's just a great ensemble.\" appears, what is the man in the black suit doing?", "question_wo_referring_query": "What is the man in the black suit doing?", "candidates": ["Facing a wall and punching it", "Shaking hands with a man in a gray-black suit", "Running forward joyfully", "Happily spinning in place", "Hugging a man in a gray-black suit"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "@movie.explained6-7269280939656760578_0", "video_path": "7269280939656760578.mp4", "subtitle_path": "7269280939656760578_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 30, "duration": 13.0, "view_count": 2427}, {"video_id": "@movie.explained6-7253802764088675586", "question": "In a brightly lit space, a woman wearing black suspenders and with black, shoulder-length curly hair is undergoing an inspection. Behind her is an inspector wearing a blue-green hat. What did the woman in black suspenders do after placing something on her chest?", "question_wo_referring_query": "What did the woman wearing black suspenders do after placing something on her chest?", "candidates": ["Tightened the clothes on her chest", "Hugged the inspector wearing the blue-green hat", "Knelt down to pick something up", "Happily spun around on the spot", "Raised both hands with palms facing the inspector wearing the blue-green hat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "@movie.explained6-7253802764088675586_0", "video_path": "7253802764088675586.mp4", "subtitle_path": "7253802764088675586_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 3, "duration": 8.01, "view_count": 2207}, {"video_id": "@movie.explained6-7272501460695371010", "question": "On a screen that has the white text 'Stauffer is unable to avoid', a man rolls down from a white slope. After 'Stoffer is unable to avoid it and tumbles down the hill' is said, what happens to the man who rolls down from the white slope?", "question_wo_referring_query": "What happens to the man who rolls down from the white slope?", "candidates": ["Freezes to death", "Gets buried by white snow", "Is attacked and taken away by a wolf", "Falls into a coma", "Is rescued by a man wearing a military green coat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "@movie.explained6-7272501460695371010_0", "video_path": "7272501460695371010.mp4", "subtitle_path": "7272501460695371010_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 8, "duration": 9.0, "view_count": 3371}, {"video_id": "@movie.explained6-7268473687685352706", "question": "In a scene with white text saying 'I'm stuck inside the ATM,' there is a printer that has a hand pulling out a sheet of white paper with black text from the printer's mouth. After saying 'Do not give my consent to use my face on whatever show this is.' what was the first object that appeared in the scene?", "question_wo_referring_query": "What was the first object that appeared in the scene?", "candidates": ["Blue and white striped short sleeve", "Green plant", "Yellow round face icon", "Red light wave", "Blue hat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "@movie.explained6-7268473687685352706_0", "video_path": "7268473687685352706.mp4", "subtitle_path": "7268473687685352706_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 25, "duration": 9.0, "view_count": 4609}, {"video_id": "@movie.explained6-7252676006950145281", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a dark-skinned man leaning against a green wall; then, a man in a brown coat raising one hand; finally, a scene of one hand holding a white board.", "First, a man in a brown coat raising one hand; then, a scene of one hand holding a white board; finally, a dark-skinned man leaning against a green wall.", "First, a man in a brown coat raising one hand; then, a dark-skinned man leaning against a green wall; finally, a scene of one hand holding a white board.", "First, a scene of one hand holding a white board; then, a dark-skinned man leaning against a green wall; finally, a man in a brown coat raising one hand.", "First, a dark-skinned man leaning against a green wall; then, a scene of one hand holding a white board; finally, a man in a brown coat raising one hand."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "@movie.explained6-7252676006950145281_0", "video_path": "7252676006950145281.mp4", "subtitle_path": "7252676006950145281_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 23, "duration": 12.0, "view_count": 563}, {"video_id": "@movie.explained6-7269356762506054913", "question": "In a white room, there is a rectangular mirror and a cream-colored cabinet hanging on the wall. Next to the rectangular mirror, there is also a round mirror. In which of the following scenes does the round mirror next to the rectangular mirror appear?", "question_wo_referring_query": "In which of the following scenes does the round mirror next to the rectangular mirror appear?", "candidates": ["On a makeup table with many lipsticks", "In a bedroom with white curtains", "In a room with a man wearing glasses but no clothes", "In a dressing room full of bags", "On a white desk"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "@movie.explained6-7269356762506054913_0", "video_path": "7269356762506054913.mp4", "subtitle_path": "7269356762506054913_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 8, "duration": 8.0, "view_count": 2151}, {"video_id": "@movie.explained6-7270058138693422338", "question": "A hand wearing various bracelets reaches out from the trunk of a red car; what subtitles have appeared along with the hand reaching out from the trunk?", "question_wo_referring_query": "Which subtitles have appeared along with the hand reaching out from the trunk?", "candidates": ["\"It looks very dangerous\"", "\"The driver behind witnesses this and immediately dials 911\"", "\"Be careful to avoid startling the snake with grass\"", "\"I don't know if the person in the trunk is safe\"", "\"As the kidnapper grows wary, he exits the highway.\""], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "@movie.explained6-7270058138693422338_0", "video_path": "7270058138693422338.mp4", "subtitle_path": "7270058138693422338_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1, "duration": 14.0, "view_count": 12077}, {"video_id": "@movie.explained6-7271974975819599105", "question": "On a road with white borders, a man in a white shirt is running forward. In front of the man, there are green trees. When a white car drives by on the road with white borders, what change occurs to the color of the trees?", "question_wo_referring_query": "What change occurs to the color of the trees?", "candidates": ["The color changed from green to orange", "The color changed from green to yellow", "The color changed from green to white", "The color changed from green to brown", "The color changed from green to purple"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "@movie.explained6-7271974975819599105_0", "video_path": "7271974975819599105.mp4", "subtitle_path": "7271974975819599105_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 32, "duration": 13.98, "view_count": 4061}, {"video_id": "@movie.explained6-7266768102879153415", "question": "In a scene with a blurry background, a little girl with auburn long hair, wearing a green armored vest, appears. What change occurs to the little girl with auburn long hair when the subtitle 'Stop it!' appears for the third time?", "question_wo_referring_query": "What change occurs to the little girl with auburn long hair?", "candidates": ["The little girl changes to a blue pen.", "The little girl changes to a blue armored vest.", "The pen in the little girl's hand disappears.", "The little girl goes from sitting on the chair to standing by the wall.", "The little girl changes to a green pen."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TAA", "level": "L2-Relation", "id": "@movie.explained6-7266768102879153415_0", "video_path": "7266768102879153415.mp4", "subtitle_path": "7266768102879153415_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1, "duration": 13.0, "view_count": 8417964}, {"video_id": "@movie.explained6-7268899125012270337", "question": "In front of a blurry background, there is a man with short black hair wearing glasses, and at the bottom of the screen there are white letters saying 'I couldn't go to work All day that day'. What is the man with glasses in the picture doing?", "question_wo_referring_query": "What is the man with glasses in the picture doing?", "candidates": ["Pushing a shopping cart", "Withdrawing money from an ATM", "Waiting for a bus at a bus stop", "Happily spinning in place", "Licking his left index finger with his tongue"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "@movie.explained6-7268899125012270337_0", "video_path": "7268899125012270337.mp4", "subtitle_path": "7268899125012270337_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 19, "duration": 14.0, "view_count": 1582}, {"video_id": "@movie.explained6-7276650670554451202", "question": "Next to a red brick wall, a man wearing a mixed-color coat is running. The man is also holding a dark brown bag, and at the bottom of the screen, there are white text 'superpower does not include flight'. Which objects are present in the scene?", "question_wo_referring_query": "Which objects are present in the scene?", "candidates": ["Red-brown hat", "Yellow orchid", "White skirt", "Purple tail", "White chrysanthemum"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "@movie.explained6-7276650670554451202_0", "video_path": "7276650670554451202.mp4", "subtitle_path": "7276650670554451202_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 33, "duration": 10.98, "view_count": 3498}, {"video_id": "@movie.explained6-7276264271447690497", "question": "In a room with many people standing, a man in a white shirt is punching another man wearing a black and green striped T-shirt. When the subtitle 'He tried hard to pull his hand out, but the next second' appears, what objects can be seen on the screen?", "question_wo_referring_query": "What objects can be seen on the screen?", "candidates": ["a green plant", "a blue hat", "a white long skirt", "a white flowerpot", "a rice white hat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "@movie.explained6-7276264271447690497_0", "video_path": "7276264271447690497.mp4", "subtitle_path": "7276264271447690497_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 8, "duration": 8.01, "view_count": 3159}, {"video_id": "@movie.explained6-7252675383697345793", "question": "In a dark room, there is a woman with her hair tied up and wearing a necklace. Above her is a flickering light. What color clothes is the woman with her hair tied up wearing?", "question_wo_referring_query": "What color clothes is the woman with her hair tied up wearing?", "candidates": ["blue", "purple", "white", "red", "green"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "@movie.explained6-7252675383697345793_0", "video_path": "7252675383697345793.mp4", "subtitle_path": "7252675383697345793_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 3, "duration": 8.0, "view_count": 604}, {"video_id": "@movie.explained6-7268872355361852674", "question": "In a green meadow, there is a little girl wearing pink pants and a man with a watch on his wrist. What was the little girl in pink pants doing the first time she appeared?", "question_wo_referring_query": "What was the little girl in pink pants doing the first time she appeared?", "candidates": ["Running to catch a butterfly on a flower", "Crawling on the grass catching bugs", "Holding a white kitten", "Romping in the autumn scenery", "Flying a kite on the grass"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "@movie.explained6-7268872355361852674_0", "video_path": "7268872355361852674.mp4", "subtitle_path": "7268872355361852674_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 45, "duration": 12.0, "view_count": 1667}, {"video_id": "@movie.explained6-7269743822848920833", "question": "In an empty room, a little girl wearing light-colored jeans is sitting on the floor. Beside the girl is a woman with her hair tied up. In front of them is a man wearing a brown suit. When the subtitle 'Well, I did it' appears, what is the man in the brown suit doing?", "question_wo_referring_query": "What is the man in the brown suit doing?", "candidates": ["Hugging the woman with her hair tied up", "Picking up the little girl who is sitting on the floor", "Crawling on the floor", "Kneeling on one knee on the ground", "Lying on the floor"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "@movie.explained6-7269743822848920833_0", "video_path": "7269743822848920833.mp4", "subtitle_path": "7269743822848920833_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 36, "duration": 14.0, "view_count": 3001}, {"video_id": "@movie.explained6-7271571132912815362", "question": "In a white hospital, a little girl wearing a blue cap is sitting in a wheelchair. On both sides of her are doctors dressed in dark green short sleeves. After the doctor on the right side of the girl taps the handle of the wheelchair with his finger, what does he do next?", "question_wo_referring_query": "After the doctor on the right side of the girl taps the handle of the wheelchair with his finger, what does he do next?", "candidates": ["Puts on a mask", "Touches the girl's face", "Massages her shoulder", "Raises his hand into a fist", "Helps the girl with her pillow"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "@movie.explained6-7271571132912815362_0", "video_path": "7271571132912815362.mp4", "subtitle_path": "7271571132912815362_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 7, "duration": 14.0, "view_count": 19767}, {"video_id": "@movie.explained6-7254238959469890818", "question": "In the video, who is the first character to appear?", "question_wo_referring_query": "In the video, who is the first character to appear?", "candidates": ["The man with red eyes and fair skin", "The man wearing black sunglasses", "The man wearing a black trench coat", "The man with black short hair, wearing a dark blue jacket", "The man wearing white trousers"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "@movie.explained6-7254238959469890818_0", "video_path": "7254238959469890818.mp4", "subtitle_path": "7254238959469890818_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 28, "duration": 7.98, "view_count": 2904}, {"video_id": "@movie.explained6-7279295187652857090", "question": "In a blurry scene, there is a woman with long black hair wearing a silver ring on her hand. After the narration says 'more desperate is that she even choked on her teeth, and breathing became unusually difficult,' what did the woman with long black hair do?", "question_wo_referring_query": "What did the woman with long black hair do?", "candidates": ["Crying on the ground", "Crying with her head in her hands", "Touching her neck with her hand", "Happily stretching both arms", "Making a phone call with a green-cased cellphone"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "@movie.explained6-7279295187652857090_0", "video_path": "7279295187652857090.mp4", "subtitle_path": "7279295187652857090_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 18, "duration": 14.0, "view_count": 7974}, {"video_id": "@movie.explained6-7264915750425545992", "question": "On a clear and sunny day, there is a blue-gray pickup truck parked outside. Next to the pickup truck stands a man wearing an olive-green trench coat. When the man says, 'Hey teach,' who is the first character to appear?", "question_wo_referring_query": "Who is the first character to appear?", "candidates": ["A man wearing a white shirt and dark blue vest", "A man wearing a blue hoodie", "A man wearing a white suit", "A man wearing a red tie", "A man wearing a red shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "@movie.explained6-7264915750425545992_0", "video_path": "7264915750425545992.mp4", "subtitle_path": "7264915750425545992_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 4, "duration": 14.0, "view_count": 595435}, {"video_id": "@movie.explained6-7253144750310575362", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First is a scene with a dark-skinned man wearing glasses; then a scene with a man in a black short-sleeved shirt talking; and finally a scene with a woman with short blonde hair holding glasses.", "First is a scene with a man in a black short-sleeved shirt talking; then a scene with a dark-skinned man wearing glasses; and finally a scene with a woman with short blonde hair holding glasses.", "First is a scene with a man in a black short-sleeved shirt talking; then a scene with a woman with short blonde hair holding glasses; and finally a scene with a dark-skinned man wearing glasses.", "First is a scene with a dark-skinned man wearing glasses; then a scene with a woman with short blonde hair holding glasses; and finally a scene with a man in a black short-sleeved shirt talking.", "First is a scene with a woman with short blonde hair holding glasses; then a scene with a man in a black short-sleeved shirt talking; and finally a scene with a dark-skinned man wearing glasses."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "@movie.explained6-7253144750310575362_0", "video_path": "7253144750310575362.mp4", "subtitle_path": "7253144750310575362_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 37, "duration": 14.0, "view_count": 2999}, {"video_id": "@movie.explained6-7268834822968184065", "question": "In a room with lighting, a man with short hair holding a baby is making a phone call. In which of the following scenes does the man holding the baby appear?", "question_wo_referring_query": "In which of the following scenes does the man holding the baby appear?", "candidates": ["In a quiet park", "In a room with a white sofa", "In a white hospital", "On a beautiful beach", "In a scene with a window and people opposite the window"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "@movie.explained6-7268834822968184065_0", "video_path": "7268834822968184065.mp4", "subtitle_path": "7268834822968184065_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 30, "duration": 14.0, "view_count": 2189}, {"video_id": "@movie.explained6-7268981987451424001", "question": "In a scene with the white text 'You don't Have to call him daddy.', a little boy in a checkered shirt is talking to the right. When the subtitle 'She said you're not her real father anyway.' appears, what change occurs to the boy in the checkered shirt?", "question_wo_referring_query": "What change occurs to the boy in the checkered shirt?", "candidates": ["He stops talking to the right and begins talking to the left", "He changes into a blue T-shirt", "He starts wearing a blue hat", "He changes into a yellow T-shirt", "He changes into a black hoodie"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TAA", "level": "L2-Relation", "id": "@movie.explained6-7268981987451424001_0", "video_path": "7268981987451424001.mp4", "subtitle_path": "7268981987451424001_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 3, "duration": 14.0, "view_count": 1933}, {"video_id": "@kerstinong-7287196927026924802", "question": "On a sunny and bright day, a woman in a purple bikini is standing on the beach, with the vast sea not far away. What is the woman in the purple bikini doing on the beach?", "question_wo_referring_query": "What is the woman in the purple bikini doing on the beach?", "candidates": ["Holding a phone, turning her head towards the camera", "Blowing a balloon", "Picking up seashells", "Flying a kite", "Eating an ice cream cone"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "@kerstinong-7287196927026924802_0", "video_path": "7287196927026924802.mp4", "subtitle_path": "7287196927026924802_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.13, "view_count": 7611}, {"video_id": "@jess.morg-7285497469075606826", "question": "Next to a white bay window, there is a white desk. On the desk, there is a silver-gray laptop, and some books are placed beside the laptop. What objects are on the white desk?", "question_wo_referring_query": "What objects are on the white desk?", "candidates": ["Orange mug", "Purple mug", "Red mug", "Black mug", "White mug"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "@jess.morg-7285497469075606826_0", "video_path": "7285497469075606826.mp4", "subtitle_path": "7285497469075606826_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.18, "view_count": 729}, {"video_id": "@kerstinong-7157729958071176450", "question": "On a water-soaked surface, there is a woman wearing a purple sports bra and purple yoga pants doing a plank. When the subtitle 'Monday, Tuesday, Wednesday, Thursday' appears, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["A black dog", "A red tricycle", "A purple skipping rope", "A yellow wind chime", "A white tricycle"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "@kerstinong-7157729958071176450_0", "video_path": "7157729958071176450.mp4", "subtitle_path": "7157729958071176450_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.8, "view_count": 10100}, {"video_id": "@jonijawne-7196870015126113542", "question": "In a close-up shot of an eye, there is white text saying 'speedy mascara remover'. In front of the eye, there's an opened mascara. What is the color of the mascara bottle?", "question_wo_referring_query": "What is the color of the mascara bottle?", "candidates": ["green", "purple", "white", "blue", "red"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "@jonijawne-7196870015126113542_0", "video_path": "7196870015126113542.mp4", "subtitle_path": "7196870015126113542_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.53, "view_count": 13543}, {"video_id": "@kerstinong-7115463916087561473", "question": "On a natural wood-colored table, there is a pair of silver earrings and a blue camisole. A hand wearing a silver watch is reaching for a bag placed on the natural wood-colored table. When the subtitle 'Love your imperfections, I'll be' appears, what color is the bag on the natural wood-colored table?", "question_wo_referring_query": "What color is the bag on the natural wood-colored table?", "candidates": ["Silver gray", "Red", "Blue", "Green", "Olive"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "@kerstinong-7115463916087561473_0", "video_path": "7115463916087561473.mp4", "subtitle_path": "7115463916087561473_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.37, "view_count": 10903}, {"video_id": "@kerstinong-7309723212940692738", "question": "On a pitch-black night, there are many people taking pictures under a single horizontal bar. There is a woman wearing a black sports bra and a man without a shirt. Which woman is on the grey horizontal bar doing a 'Biye' pose?", "question_wo_referring_query": "Which woman is on the grey horizontal bar doing a 'Biye' pose?", "candidates": ["The woman wearing blue shorts", "The woman wearing purple yoga pants", "The woman wearing black yoga pants", "The woman wearing black shorts", "The woman wearing peach pink yoga pants"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "@kerstinong-7309723212940692738_0", "video_path": "7309723212940692738.mp4", "subtitle_path": "7309723212940692738_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.07, "view_count": 7941}, {"video_id": "@lisolna-7236057511260949786", "question": "On a flat road, there is a person wearing a cream-colored long windbreaker. Many vehicles are parked on the right side of the road. When the caption \"I like to imagine the wind blowing my perfume and sliding around you.\" appears, what is the person in the cream-colored windbreaker doing?", "question_wo_referring_query": "What is the person in the cream-colored windbreaker doing?", "candidates": ["Eating an ice cream cone", "Crouching down to tie a shoelace", "Happily running forward", "Walking two dogs", "Happily spinning around in place"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "@lisolna-7236057511260949786_0", "video_path": "7236057511260949786.mp4", "subtitle_path": "7236057511260949786_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.1, "view_count": 21397}, {"video_id": "@kerstinong-6966489287261015297", "question": "In a car with black seats, there is a woman with tied hair wearing a black-grey short-sleeve shirt. She is talking close to the mirror. After she finishes talking close to the mirror, what does she do next?", "question_wo_referring_query": "After she finishes talking close to the mirror, what does she do next?", "candidates": ["Covers her face with her hands", "Adjusts her hair with her right hand", "Stretches her arms out", "Adjusts her hair with her left hand", "Brings her hands together in front of her chest"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "@kerstinong-6966489287261015297_0", "video_path": "6966489287261015297.mp4", "subtitle_path": "6966489287261015297_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.25, "view_count": 135464}, {"video_id": "@kerstinong-7035647472902999298", "question": "In a white room, there are some white cabinets. Next to the white cabinets stands a woman with black hair, wearing a black and grey short sleeve shirt. The woman is holding a white ring in her hand. After the person next to her says 'I love you for infinity,' what does the woman wearing the black and grey short sleeves do?", "question_wo_referring_query": "What is the woman wearing the black and grey short sleeves doing?", "candidates": ["Spinning the white ring in her hand", "Leaning against the white cabinet", "Hanging the white ring on her hand", "Putting the white ring around her neck", "Lying on the white bed"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "@kerstinong-7035647472902999298_0", "video_path": "7035647472902999298.mp4", "subtitle_path": "7035647472902999298_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.13, "view_count": 11419}, {"video_id": "@kerstinong-6940269333872479489", "question": "Which of the following sequence of scenes is correct?", "question_wo_referring_query": "Which of the following sequence of scenes is correct?", "candidates": ["First, a woman in a white long-tail dress is standing beside a man in a red hat riding a bicycle swiftly past. Then, the same woman in a white long-tail dress turns her head to look at the camera. Lastly, a woman in a white long-tail dress is walking forward with her back to the camera.", "First, a woman in a white long-tail dress is walking forward with her back to the camera. Then, the same woman in a white long-tail dress turns her head to look at the camera. Lastly, a woman in a white long-tail dress is standing beside a man in a red hat riding a bicycle swiftly past.", "First, a woman in a white long-tail dress is walking forward with her back to the camera. Then, a woman in a white long-tail dress is standing beside a man in a red hat riding a bicycle swiftly past. Lastly, the same woman in a white long-tail dress turns her head to look at the camera.", "First, the woman in a white long-tail dress turns her head to look at the camera. Then, the same woman in a white long-tail dress is walking forward with her back to the camera. Lastly, a woman in a white long-tail dress is standing beside a man in a red hat riding a bicycle swiftly past.", "First, the woman in a white long-tail dress turns her head to look at the camera. Then, a woman in a white long-tail dress is standing beside a man in a red hat riding a bicycle swiftly past. Lastly, the same woman in a white long-tail dress is walking forward with her back to the camera."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "@kerstinong-6940269333872479489_0", "video_path": "6940269333872479489.mp4", "subtitle_path": "6940269333872479489_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.13, "view_count": 19654}, {"video_id": "@jess.morg-7285125312709397803", "question": "Under a cyan sky, there's a patch of green tree leaves. In the middle of the screen, there are icons of a maple leaf, a sock, a small bear, and a coffee-colored heart. In which of the following scenes does the small bear icon appear?", "question_wo_referring_query": "In which of the following scenes does the small bear icon appear in the middle of the screen?", "candidates": ["On a screen with a serving of cooked steak", "On a screen with jade green seawater", "On a screen with a lot of windmills", "On a screen with a lot of yellow pumpkins", "On a screen with a lot of lollipops"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "@jess.morg-7285125312709397803_0", "video_path": "7285125312709397803.mp4", "subtitle_path": "7285125312709397803_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.75, "view_count": 697}, {"video_id": "@lisolna-7257154386323836187", "question": "On a white table, there is a glass with a few white flowers on it, and inside the glass, there are ice cubes. Which of the following subtitles appear along with this image?", "question_wo_referring_query": ", there is a glass with a few white flowers on it, and which of the following subtitles appear along with this image?", "candidates": ["\"He was out of town, and his two friends were so fine\"", "\"That's great\"", "\"It's very comfortable in this kind of life\"", "\"It's so delicious\"", "\"I really like it\""], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "@lisolna-7257154386323836187_0", "video_path": "7257154386323836187.mp4", "subtitle_path": "7257154386323836187_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.07, "view_count": 13599}, {"video_id": "@jess.morg-7181954269728312619", "question": "In a scene with sunlight, sand, and waves, in a blue sky with white clouds, what object is present?", "question_wo_referring_query": ", what object is present?", "candidates": ["Tik Tok logo", "Little bear logo", "Yellow star logo", "Red heart logo", "Blue triangle logo"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "@jess.morg-7181954269728312619_0", "video_path": "7181954269728312619.mp4", "subtitle_path": "7181954269728312619_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.1, "view_count": 157}, {"video_id": "@kerstinong-6921697509085662465", "question": "In a brightly lit gym, there is a man and a woman working out. Behind them, there is a grey yoga mat and some exercise equipment. What color shirt is the man who is working out wearing?", "question_wo_referring_query": "What color shirt is the man who is working out wearing?", "candidates": ["purple", "white", "black", "red", "blue"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "@kerstinong-6921697509085662465_0", "video_path": "6921697509085662465.mp4", "subtitle_path": "6921697509085662465_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.55, "view_count": 46568}, {"video_id": "@jess.morg-7279553487619345706", "question": "On a rocky shore with a broken tree, next to the broken tree is a sandy beach of earthy yellow, and in the distance is the vast ocean. A flock of birds flies across the sky, and when the subtitle 'Where I can free' appears, what is the color of the sky on the screen?", "question_wo_referring_query": "What is the color of the sky on the screen?", "candidates": ["White", "Black", "Purple", "Red and brown mixed with black and gray", "Blue"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "@jess.morg-7279553487619345706_0", "video_path": "7279553487619345706.mp4", "subtitle_path": "7279553487619345706_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.1, "view_count": 1147}, {"video_id": "@kerstinong-7011546824238566657", "question": "In a scene with a giant tree, there is a doll wearing a yellow short-sleeve shirt in front of the tree, and behind the tree is a field of golden wheat. Who is sprinting in place in front of the doll wearing the yellow short-sleeve shirt?", "question_wo_referring_query": "Who is sprinting in place in front of the doll wearing the yellow short-sleeve shirt?", "candidates": ["A man wearing a black short-sleeve shirt", "A woman wearing a white tank top", "A woman wearing a white sports vest", "A woman wearing a white T-shirt", "A woman wearing a blue-green sports vest"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "@kerstinong-7011546824238566657_0", "video_path": "7011546824238566657.mp4", "subtitle_path": "7011546824238566657_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.23, "view_count": 10231}, {"video_id": "@jonijawne-7311286036800261382", "question": "On a sunny day, on a flat road, there's a man wearing a black top, black pants, and carrying a backpack. What is the man wearing the black top doing the first time he appears?", "question_wo_referring_query": "What is the man wearing the black top doing the first time he appears?", "candidates": ["Spinning in place on the flat road", "Pushing a silver-white suitcase forward", "Kneeling on the flat road", "Running forward on the flat road", "Crawling on the flat road"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "@jonijawne-7311286036800261382_0", "video_path": "7311286036800261382.mp4", "subtitle_path": "7311286036800261382_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.63, "view_count": 1970}, {"video_id": "@kerstinong-7258641784253779202", "question": "In a scene with a beautiful woman in the background, there is a woman with long hair wearing a green top. In the bottom right corner of the screen, there is a TikTok logo. When the caption 'So you won't be charged.' appears, what is the woman in the green top doing?", "question_wo_referring_query": "What is the woman in the green top doing?", "candidates": ["Making a heart shape with her hands in front of the mirror", "Adjusting her hair with her left hand", "Cupping her face with both hands", "Massaging her temples with both hands", "Standing in front of the mirror with her left index finger raised"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "@kerstinong-7258641784253779202_0", "video_path": "7258641784253779202.mp4", "subtitle_path": "7258641784253779202_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.87, "view_count": 16900}, {"video_id": "@kerstinong-6818897963549543681", "question": "In a bright room, there is a woman wearing a light green sports tank top and a woman wearing a pink short-sleeved shirt. After the woman in the pink short-sleeved shirt wipes the floor with a piece of blue cloth, what does she do next?", "question_wo_referring_query": "What does the woman wearing the pink short-sleeved shirt do after wiping the floor with a piece of blue cloth?", "candidates": ["Take off the pink short-sleeved shirt", "Pick up the blue cloth", "Throw the blue cloth into the trash can", "Throw the blue cloth on the table", "Kick the woman wearing the light green sports tank top"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "@kerstinong-6818897963549543681_0", "video_path": "6818897963549543681.mp4", "subtitle_path": "6818897963549543681_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.07, "view_count": 7728}, {"video_id": "@jess.morg-7224673326055296302", "question": "In the video, which type of cup appears first?", "question_wo_referring_query": "In the video, which type of cup appears first?", "candidates": ["Blue cup", "Green cup", "Transparent cup", "Purple cup", "Yellow cup"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "@jess.morg-7224673326055296302_0", "video_path": "7224673326055296302.mp4", "subtitle_path": "7224673326055296302_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.73, "view_count": 1177}, {"video_id": "@lisolna-7254553666315652378", "question": "Inside a golden container, there is a viscous dark-colored substance. The container has a narrow waist, and both ends are wider than the waist. The waist of the container has a handle. After the subtitle 'You are not set on stone' appears, what happens on the screen?", "question_wo_referring_query": ", what happens on the screen?", "candidates": ["Someone pours the dark substance into an empty cup", "A plate gets broken", "A woman is eating a hot dog", "Someone is adding herbs to a dish", "A woman picks up a plate of food"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "@lisolna-7254553666315652378_0", "video_path": "7254553666315652378.mp4", "subtitle_path": "7254553666315652378_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.97, "view_count": 22250}, {"video_id": "@jess.morg-7249478966061370666", "question": "A gentleman is walking a yellow dog on a path. The dog's leash is blue. On both sides of the path, there are red flowers and green grass. After the subtitle 'I had this thing where I need everybody to think. I'm the greatest, the quote-unquote Fantastic Mr. Fox' appears, what object shows up?", "question_wo_referring_query": "What object shows up?", "candidates": ["curtain", "bench", "pillar", "street lamp", "potted plant"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "@jess.morg-7249478966061370666_0", "video_path": "7249478966061370666.mp4", "subtitle_path": "7249478966061370666_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.21, "view_count": 996}, {"video_id": "@lisolna-7170417823083138310", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, the lady takes the bus home and does her makeup and skincare, then drinks coffee and takes a walk, and finally goes shopping.", "First, the lady goes shopping, then drinks coffee and takes a walk, and finally takes the bus home and does her makeup and skincare.", "First, the lady takes the bus home and does her makeup and skincare, then goes shopping, and finally drinks coffee and takes a walk.", "First, the lady drinks coffee and takes a walk, then goes shopping, and finally takes the bus home and does her makeup and skincare.", "First, the lady drinks coffee and takes a walk, then takes the bus home and does her makeup and skincare, and finally goes shopping."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "@lisolna-7170417823083138310_0", "video_path": "7170417823083138310.mp4", "subtitle_path": "7170417823083138310_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.15, "view_count": 3954}, {"video_id": "@jonijawne-7204517054597106950", "question": "A man appears in the room, wearing a white shirt. His hair is parted in the middle. He is holding a mobile phone with both hands. Behind the man, there is a wall and a staircase. In which other scenario has the man's hand in the white shirt appeared?", "question_wo_referring_query": "In which other scenario has the man's hand in the white shirt appeared?", "candidates": ["On a blue sofa", "On a delicate handbag", "On a red round table", "In front of a small potted plant", "In front of transparent glass"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "@jonijawne-7204517054597106950_0", "video_path": "7204517054597106950.mp4", "subtitle_path": "7204517054597106950_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.2, "view_count": 44842}, {"video_id": "@kerstinong-7338379964762574082", "question": "A woman with a ponytail is working out on a treadmill. She is wearing shorts and a white sports bra. In front of her is a window, and outside the window, there are buildings. The treadmill's deck is blue, and there are other pieces of equipment around. What caption has appeared with this woman before?", "question_wo_referring_query": "With what caption has this woman appeared before?", "candidates": ["I think I can win this one", "No painstaking work", "That's a good way to do it", "I can do anything", "I can do it"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "@kerstinong-7338379964762574082_0", "video_path": "7338379964762574082.mp4", "subtitle_path": "7338379964762574082_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.73, "view_count": 5066}, {"video_id": "@lisolna-7178966223307214086", "question": "A box with a red ribbon and white dots contains a snack. Surrounding the box are other snacks, some decorated with bows. When the area around the box is empty, what change occurs inside the box?", "question_wo_referring_query": "When the area around the box is empty, what change occurs inside the box?", "candidates": ["A cat appears in the box", "The white dots on the box turn blue", "The box changes from a quadrilateral to a polygon", "The box changes from mostly empty to being filled with snacks", "A rat appears in the box"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "@lisolna-7178966223307214086_0", "video_path": "7178966223307214086.mp4", "subtitle_path": "7178966223307214086_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.77, "view_count": 14792}, {"video_id": "@jess.morg-7262119504526724398", "question": "Blue sky with floating white clouds; on the beach near the sea, there is a low wall made of piled stones. Inside the stone wall is the beach, and outside the stone wall is the vast sea. When the subtitle 'Bye' appears, what change occurs on the sea?", "question_wo_referring_query": "What change occurs on the sea?", "candidates": ["A surfer woman appears on the sea", "A dolphin appears on the sea", "A whale appears on the sea", "A surfer man appears on the sea", "The sea surface ripples with white waves"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "@jess.morg-7262119504526724398_0", "video_path": "7262119504526724398.mp4", "subtitle_path": "7262119504526724398_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.29, "view_count": 675}, {"video_id": "@kerstinong-7115039462840683777", "question": "There is a car placed inside the room. Behind the car, there is a yellow rectangular decoration and a shelving unit piled up with miscellaneous items. A bright hanging light is suspended from the room's ceiling panel, and in front of the light, a blue-lettered white screen is hanging. What can be found in this scene?", "question_wo_referring_query": "What can be found in this scene?", "candidates": ["potted plant", "fan", "table lamp", "hat", "watch"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "@kerstinong-7115039462840683777_0", "video_path": "7115039462840683777.mp4", "subtitle_path": "7115039462840683777_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.57, "view_count": 14122}, {"video_id": "@kerstinong-7315271756133027074", "question": "A woman with long hair is standing in front of a sign. The woman is wearing a white top and pink shorts. There is grass beneath her feet. The sign is predominantly red and white, depicting a sports figure. When the text 'Outro Music' appears, what style is the white top that the woman in pink shorts is wearing?", "question_wo_referring_query": "What style is the white top that the woman in pink shorts is wearing?", "candidates": ["Shirt", "Suit", "Sweater", "Sweater", "Short sleeve"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "@kerstinong-7315271756133027074_0", "video_path": "7315271756133027074.mp4", "subtitle_path": "7315271756133027074_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.07, "view_count": 4411}, {"video_id": "@kerstinong-7214871387620609298", "question": "A woman wearing a blue sports outfit is standing in the center of the screen, holding a numbered sign. Next to the woman on the blue track, two people are standing beside a hurdle. Behind the track are green trees and the sky. Who has a bottle of water under their armpit?", "question_wo_referring_query": "Who has a bottle of water under their armpit?", "candidates": ["The woman in the blue sports outfit", "The woman in the white shirt leaning on the hurdle", "The man in the white shirt leaning on the hurdle", "The woman in the black shirt leaning on the hurdle", "The man in the black shirt leaning on the hurdle"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "@kerstinong-7214871387620609298_0", "video_path": "7214871387620609298.mp4", "subtitle_path": "7214871387620609298_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.0, "view_count": 16600}, {"video_id": "@jonijawne-7192686219149511942", "question": "A man wearing a fur-collared coat is standing sideways to the camera. The background behind the man features a white wall and a staircase. The man's hair is parted in the middle. When the caption 'I am Mistress from the Stars' appears, what is this man with the middle part hairstyle doing?", "question_wo_referring_query": "What is this man with the middle part hairstyle doing when the caption appears?", "candidates": ["The man has his hands on his hips", "The man is raising both hands", "The man is raising one hand", "The man has his hands in his coat pockets", "The man has his arms crossed in front of his chest"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "@jonijawne-7192686219149511942_0", "video_path": "7192686219149511942.mp4", "subtitle_path": "7192686219149511942_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.72, "view_count": 1750}, {"video_id": "@emsiiees-7251097980860665094", "question": "In the video, who is the person that first takes a photo with the girl holding a skewer at the beginning of the video?", "question_wo_referring_query": "In the video, who is the person that first takes a photo with the girl holding a skewer at the beginning of the video?", "candidates": ["The woman wearing white clothes and holding a white bag", "The woman wearing a pink hat and a black coat", "The woman wearing a blue coat", "The person wearing black sunglasses and a black short-sleeve shirt", "The man wearing a blue coat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "@emsiiees-7251097980860665094_0", "video_path": "7251097980860665094.mp4", "subtitle_path": "7251097980860665094_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.47, "view_count": 7429}, {"video_id": "@jess.morg-7256188695546514731", "question": "The clear blue sky has no white clouds in sight. On the left side of the white sidewalk, there are a row of tall trees and parked cars. On the right side of the sidewalk, there is a cobblestone ground, buildings, and planters. After the subtitle 'What do you want to be when you grow up?' appears, what object shows up on the screen?", "question_wo_referring_query": "What object shows up on the screen?", "candidates": ["a lamp", "a seagull", "a plate of fruit", "a puppy", "a pile of rocks"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "@jess.morg-7256188695546514731_0", "video_path": "7256188695546514731.mp4", "subtitle_path": "7256188695546514731_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.4, "view_count": 1006}, {"video_id": "@jess.morg-7300001535734467883", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, it shows the scene of the sunset over the sea, then it showcases a small seaside house with unique style and exquisite decorations, and then it displays various recreational facilities and parked cars on the beach.", "First, it showcases a small seaside house with unique style and exquisite decorations, then it displays various recreational facilities and parked cars on the beach, and then it shows the scene of the sunset over the sea.", "First, it displays various recreational facilities and parked cars on the beach, then it shows the scene of the sunset over the sea, and then it showcases a small seaside house with unique style and exquisite decorations.", "First, it displays various recreational facilities and parked cars on the beach, then it showcases a small seaside house with unique style and exquisite decorations, and then it shows the scene of the sunset over the sea.", "First, it shows the scene of the sunset over the sea, then it displays various recreational facilities and parked cars on the beach, and then it showcases a small seaside house with unique style and exquisite decorations."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "@jess.morg-7300001535734467883_0", "video_path": "7300001535734467883.mp4", "subtitle_path": "7300001535734467883_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.53, "view_count": 933}, {"video_id": "@jonijawne-7194138546754587910", "question": "The man wearing a black shirt is also wearing pink pants. He has a small bear design on his chest and is carrying a carton-shaped bag. In his hand, he is holding a pair of blue jeans hanging on a clothes rack. What subtitle has appeared with the carton-shaped bag of the man wearing the black shirt?", "question_wo_referring_query": ", what subtitle has appeared with the carton-shaped bag of the man wearing the black shirt?", "candidates": ["All is well", "Too bad", "I'm sorry", "Occasionally", "Maybe you're right"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "@jonijawne-7194138546754587910_0", "video_path": "7194138546754587910.mp4", "subtitle_path": "7194138546754587910_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.82, "view_count": 3355}, {"video_id": "@lisolna-7270606262671314208", "question": "The sun is hanging high in the sky, and a golden line appears on the sea surface under the sunlight. A car is parked by the seaside. When a hand wearing a ring appears in the water, what change happens to the sun?", "question_wo_referring_query": "When a hand wearing a ring appears in the water, what change happens to the sun?", "candidates": ["The sun visually becomes larger", "The sun visually becomes smaller and appears closer to the sea surface", "The sun disappears", "The sun rises to a higher position", "The sun is submerged below the horizon"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "@lisolna-7270606262671314208_0", "video_path": "7270606262671314208.mp4", "subtitle_path": "7270606262671314208_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.67, "view_count": 6830}, {"video_id": "@kerstinong-6933588450973076738", "question": "Under the yellow light, a woman and a man appear on the screen. The man is wearing short sleeves and glasses. The woman is wearing a shoulder-exposing top. Both of them have accessories on their wrists and there are white characters around them. What is the woman wearing the shoulder-exposing top doing?", "question_wo_referring_query": "What is the woman wearing the shoulder-exposing top doing?", "candidates": ["The woman is crossing her arms in front of her chest", "The woman is holding hands with the man", "The woman is holding a bag with both hands", "The woman is holding a bag with one hand", "The woman is placing her hands on her waist"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "@kerstinong-6933588450973076738_0", "video_path": "6933588450973076738.mp4", "subtitle_path": "6933588450973076738_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.04, "view_count": 215559}, {"video_id": "@jess.morg-7270653695358946603", "question": "A small yellow desk lamp is placed on the table. Below the lamp, there is a black stand. On the table, there are objects emitting red light and some other decorations. Above the table, there are decorations hanging on the wall. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Book", "Pen", "Hat", "Planter", "Poker card"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "@jess.morg-7270653695358946603_0", "video_path": "7270653695358946603.mp4", "subtitle_path": "7270653695358946603_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.71, "view_count": 669}, {"video_id": "@lisolna-7174159693596658950", "question": "A person wearing a black top appears on the screen, holding a phone in front of their face, and wearing a scarf. The person has jewelry on their fingers. When the caption 'I just appreciate silence in a world that never stops talking' appears, what object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["Picture frame", "Lamp", "Hat", "Potted plant", "Glasses"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "@lisolna-7174159693596658950_0", "video_path": "7174159693596658950.mp4", "subtitle_path": "7174159693596658950_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.07, "view_count": 12172}, {"video_id": "@kerstinong-7066316029168717058", "question": "A long-haired lady is facing the camera. She is wearing a black mask, and one of her hands is holding down her lower lip. The black mask has various patterns on it. What is the shape of the pattern on the mask?", "question_wo_referring_query": "What is the shape of the pattern on the mask?", "candidates": ["Shape of musical notes", "Shape of stars", "Shape of water droplets", "Hexagonal shape", "Shape of hearts"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "@kerstinong-7066316029168717058_0", "video_path": "7066316029168717058.mp4", "subtitle_path": "7066316029168717058_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.53, "view_count": 3182}, {"video_id": "@jess.morg-7217265268446334254", "question": "There's a ring appearing by the seaside, with a pillar beside it. Gazing at the endless sea with white waves, the sky above the sea is white. When the subtitles 'Smoking cigarettes on the ring' appear, what is the pillar beside the ring like?", "question_wo_referring_query": "What is the pillar beside the ring like?", "candidates": ["A dark-colored wooden pillar", "A red wooden pillar", "A yellow wooden pillar", "A black metal pillar", "A silver metal pillar"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "@jess.morg-7217265268446334254_0", "video_path": "7217265268446334254.mp4", "subtitle_path": "7217265268446334254_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.83, "view_count": 761}, {"video_id": "@kerstinong-6900398825899396354", "question": "A girl appears on the screen. The girl has shoulder-length hair, is wearing a short-sleeve shirt with yellow stripes, and is wearing an ornament on her neck. Behind the girl, there is a woman wearing a white short-sleeve shirt. What did the girl with the necklace do the first time she appeared?", "question_wo_referring_query": ", what did the girl with the necklace do the first time she appeared?", "candidates": ["The girl was running", "The girl squatted on the ground", "The girl participated in a hurdle race", "The girl was drinking a beverage", "The girl was singing"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "@kerstinong-6900398825899396354_0", "video_path": "6900398825899396354.mp4", "subtitle_path": "6900398825899396354_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.06, "view_count": 9321}, {"video_id": "@lisolna-7265707367025691936", "question": "Who is the first person to appear in the video?", "question_wo_referring_query": "Who is the first person to appear in the video?", "candidates": ["The person riding a motorcycle", "The person wearing a black short-sleeve shirt", "The person riding a bicycle", "The person wearing a blue top", "The person wearing a white short-sleeve shirt"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "@lisolna-7265707367025691936_0", "video_path": "7265707367025691936.mp4", "subtitle_path": "7265707367025691936_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.23, "view_count": 31435}, {"video_id": "@kerstinong-7152541234723065090", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there is a scene of a gentleman leaving with the purchased cosmetics after checking out, then a scene of a gentleman shopping for cosmetics at a supermarket, followed by a scene of a male host testing makeup.", "First, there is a scene of a gentleman shopping for cosmetics at a supermarket, then a scene of a gentleman leaving with the purchased cosmetics after checking out, followed by a scene of a male host testing makeup.", "First, there is a scene of a gentleman shopping for cosmetics at a supermarket, then a scene of a male host testing makeup, followed by a scene of a gentleman leaving with the purchased cosmetics after checking out.", "First, there is a scene of a gentleman leaving with the purchased cosmetics after checking out, then a scene of a male host testing makeup, followed by a scene of a gentleman shopping for cosmetics at a supermarket.", "First, there is a scene of a male host testing makeup, then a scene of a gentleman shopping for cosmetics at a supermarket, followed by a scene of a gentleman leaving with the purchased cosmetics after checking out."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "@kerstinong-7152541234723065090_0", "video_path": "7152541234723065090.mp4", "subtitle_path": "7152541234723065090_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.6, "view_count": 15256}, {"video_id": "@lisolna-7159641152696716549", "question": "In the scene, a woman with black hair is applying makeup with a tool that has a red handle and a brush tip. In the top left corner, there are musical notes and alphabet symbols. What change occurs to the musical notes and alphabet symbols when a staff member is preparing a drink?", "question_wo_referring_query": "What change occurs to the musical notes and alphabet symbols in the top left corner when a staff member is preparing a drink?", "candidates": ["Only the alphabet symbols moved to the bottom right corner", "The musical notes and alphabet symbols moved to the bottom right corner", "Only the musical notes moved to the bottom right corner", "The musical notes and alphabet symbols moved to the top right corner", "Only the musical notes moved to the top right corner"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "@lisolna-7159641152696716549_0", "video_path": "7159641152696716549.mp4", "subtitle_path": "7159641152696716549_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.28, "view_count": 1064}, {"video_id": "@kerstinong-7277224088521755905", "question": "A woman appears on a red track, wearing sports underwear and dark pants, with a headband on her forehead. Behind her, a man in a white shirt is exercising. What is the woman in sports underwear doing?", "question_wo_referring_query": "What is the woman in sports underwear doing?", "candidates": ["The woman is doing squats", "The woman is standing with her hands on her waist", "The woman is covering her chest", "The woman is fixing her hair with her hands", "The woman is waving"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "@kerstinong-7277224088521755905_0", "video_path": "7277224088521755905.mp4", "subtitle_path": "7277224088521755905_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.87, "view_count": 3759}, {"video_id": "@lisolna-7274599412645645600", "question": "A man wearing gray pants appears in a mall. This man in gray pants is dragging a blue shopping basket. The supermarket floor is mainly white. What objects are present in this shopping scene?", "question_wo_referring_query": "What objects are present in this shopping scene?", "candidates": ["A pot of green plants", "A photo", "Shoes", "A fan", "A black cat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "@lisolna-7274599412645645600_0", "video_path": "7274599412645645600.mp4", "subtitle_path": "7274599412645645600_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.27, "view_count": 18700}, {"video_id": "@kerstinong-7275644130808646914", "question": "During a sports event, several women are running on a red track. There are only a few spectators in the white stands, and a silver metallic object is placed on the track. A man in red clothing passes by from outside the track. What style of pants is the man in red wearing?", "question_wo_referring_query": "What style of pants is the man in red wearing?", "candidates": ["Blue ripped jeans", "White shorts", "Black and white striped pants", "Blue overalls", "White jeans"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "@kerstinong-7275644130808646914_0", "video_path": "7275644130808646914.mp4", "subtitle_path": "7275644130808646914_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.27, "view_count": 32764}, {"video_id": "@kerstinong-7317859416609393921", "question": "A woman appears at a swimming pool, wearing sunglasses and a bikini. There is a shed behind her, and people are gathered underneath. Who among them is smiling at the camera with their teeth showing?", "question_wo_referring_query": "Who among them is smiling at the camera with their teeth showing?", "candidates": ["The woman in a red top", "The woman in a white top", "The woman wearing sunglasses", "The man in a black jacket", "The man in a yellow hooded coat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "@kerstinong-7317859416609393921_0", "video_path": "7317859416609393921.mp4", "subtitle_path": "7317859416609393921_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.25, "view_count": 2543}, {"video_id": "@jess.morg-7205014742136524078", "question": "A white car door and a hand appear on the screen. The owner of the hand is wearing a dark top. In the center of the screen, there are white characters and a pattern. After the person wearing the black top appears for the first time, what do they do?", "question_wo_referring_query": "After the person wearing the black top appears for the first time, what do they do?", "candidates": ["Order a cup of juice", "Drink juice", "Pull the car door", "Jump", "Drive"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "@jess.morg-7205014742136524078_0", "video_path": "7205014742136524078.mp4", "subtitle_path": "7205014742136524078_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.83, "view_count": 487}, {"video_id": "@jess.morg-7268032278872231210", "question": "What is the first item placed on the table in the video?", "question_wo_referring_query": "What is the first item placed on the table in the video?", "candidates": ["mobile phone", "cup", "camera", "photo frame", "book"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "@jess.morg-7268032278872231210_0", "video_path": "7268032278872231210.mp4", "subtitle_path": "7268032278872231210_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.2, "view_count": 646}, {"video_id": "@lisolna-7248301633161940250", "question": "There are gourmet foods and price tags placed in the glass showcase, and the top row of the case is decorated with lights. After the caption 'I can't think of a time that you weren't there' appears, what object shows up at the top of the case?", "question_wo_referring_query": ", what object shows up at the top of the case?", "candidates": ["a hat", "a large menu", "an apple", "a black cat", "a chandelier"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "@lisolna-7248301633161940250_0", "video_path": "7248301633161940250.mp4", "subtitle_path": "7248301633161940250_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.8, "view_count": 34732}, {"video_id": "@lisolna-7243055628715838746", "question": "The woman wearing a black top is standing next to a car. She has long black hair that reaches her shoulders. There is a white car behind her. In what subtitle does this woman wearing a black top appear?", "question_wo_referring_query": "In what subtitle does this woman wearing a black top appear?", "candidates": ["I have a chronic disability", "Come on", "All good", "Bye", "Okay, too"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "@lisolna-7243055628715838746_0", "video_path": "7243055628715838746.mp4", "subtitle_path": "7243055628715838746_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.87, "view_count": 24592}, {"video_id": "@jess.morg-7218657390588169518", "question": "Under the clear blue sky, a long corridor appears by the sea, with white waves on the sea surface. In the distance of the corridor, there is a cliff. In the center-left position of the screen, there are musical notes and alphabet icons. As the waves become more intense, what changes occur to the musical notes and alphabet icons on the left side of the screen?", "question_wo_referring_query": "As the waves become more intense, what changes occur to the musical notes and alphabet icons on the left side of the screen?", "candidates": ["Only the musical notes moved upwards", "Only the alphabet icons moved upwards", "The musical notes and alphabet icons moved to the right", "Only the alphabet icons moved to the right", "Only the musical notes moved to the right"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "@jess.morg-7218657390588169518_0", "video_path": "7218657390588169518.mp4", "subtitle_path": "7218657390588169518_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.57, "view_count": 850}, {"video_id": "oNHgRt9xLZE", "question": "In the upper left corner of the screen is a woman with long hair in a white coat giving a lecture in front of a camera and microphone, in the lower left is a book, on the right side of the screen is a whiteboard with 6 lines of English explanation at the top and a large heading starting with the letter 'D' at the bottom, and there is a chemical formula beneath the heading. Which chemical units are not present in the screen?", "question_wo_referring_query": "Which chemical units are not present in the screen?", "candidates": ["kg", "L", "ML", "mol", "g"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "oNHgRt9xLZE_0", "video_path": "oNHgRt9xLZE.mp4", "subtitle_path": "oNHgRt9xLZE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2922.7, "view_count": 40629}, {"video_id": "oNHgRt9xLZE", "question": "In the top left corner of the screen, there is a woman with long hair in a white coat, holding a pointer and giving a lecture. At the bottom left, there is a book. On the right side of the screen, there is a whiteboard with a pink English title and six lines of English explanations below it. Below the English explanations, there is a chemical formula. At the bottom right, there is a colored circle. Which chemical elements appear on the screen?", "question_wo_referring_query": "Which chemical elements appear on the screen?", "candidates": ["Chromium element", "Calcium element", "Lead element", "Boron element", "Aluminum element"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "oNHgRt9xLZE_1", "video_path": "oNHgRt9xLZE.mp4", "subtitle_path": "oNHgRt9xLZE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2922.7, "view_count": 40629}, {"video_id": "oNHgRt9xLZE", "question": "In the upper left corner of the screen, there is a woman with long hair in a white coat giving a lecture to a camera and microphone. In the lower left corner, there is a book. The right side of the screen shows a whiteboard with the title 'Oxygen-18' written in English, followed by its chemical formula. There is also a colorful circle in the bottom right corner. Which chemical elements appear in the screen?", "question_wo_referring_query": "Which chemical elements appear in the screen?", "candidates": ["Boron", "Carbon", "Iron", "Oxygen", "Hydrogen"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2O", "level": "IntraMoment", "id": "oNHgRt9xLZE_2", "video_path": "oNHgRt9xLZE.mp4", "subtitle_path": "oNHgRt9xLZE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2922.7, "view_count": 40629}, {"video_id": "pMJYCFdVLt8", "question": "What is the man with gray hair, wearing glasses, dressed in black, with a guitar on his back, standing in front of the drum kit on stage doing while speaking into the microphone?", "question_wo_referring_query": "What is he doing?", "candidates": ["Arranging clothes", "Playing guitar and singing", "Combing his hair", "Playing the drum kit", "Adjusting the microphone"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "pMJYCFdVLt8_0", "video_path": "pMJYCFdVLt8.mp4", "subtitle_path": "pMJYCFdVLt8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1613.61, "view_count": 1446958}, {"video_id": "pMJYCFdVLt8", "question": "In the video, a black man wearing a blue short sleeve shirt is holding a green can of 7-Up drink. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Drinking Mirinda", "Shaking the can of 7-Up", "Drinking 7-Up", "Drinking Sprite", "Drinking Coca-Cola"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "pMJYCFdVLt8_1", "video_path": "pMJYCFdVLt8.mp4", "subtitle_path": "pMJYCFdVLt8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1613.61, "view_count": 1446958}, {"video_id": "pMJYCFdVLt8", "question": "The man in the middle of the screen is wearing a black suit, a white shirt, and a black tie around his neck. He is holding a lighter in his right hand and has a cigarette in his mouth. To his right stands another man in a black suit and white shirt. To his left is a man wearing a black suit and white shirt with a coffee-colored overcoat, holding a cigarette case with both hands. What is the man in the middle doing?", "question_wo_referring_query": "The man in the middle of the screen is wearing a black suit, a white shirt, and a black tie around his neck. He is holding a lighter in his right hand and has a cigarette in his mouth. To his right stands another man in a black suit and white shirt. To his left is a man wearing a black suit and white shirt with a coffee-colored overcoat, holding a cigarette case with both hands. What is the man in the middle doing?", "candidates": ["The man in the middle of the screen is spinning a cigarette with his hand.", "The man in the middle of the screen is chewing on a cigarette.", "The man in the middle of the screen is lighting a cigarette.", "The man in the middle of the screen is flicking the lighter.", "The man in the middle of the screen is shaking the lighter."], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "pMJYCFdVLt8_2", "video_path": "pMJYCFdVLt8.mp4", "subtitle_path": "pMJYCFdVLt8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1613.61, "view_count": 1446958}, {"video_id": "8wkgDnNxiVs", "question": "On a white document page, there are two animated images, within which there is a dynamic model with a blue-purple brain. When the dynamic model with the blue-purple brain appears in a scene with a stick and a red circle around it, what changes occur to the dynamic model?", "question_wo_referring_query": "What changes occur to the dynamic model?", "candidates": ["The dynamic model gains two purple circles.", "The dynamic model gains two red circles.", "The dynamic model gains two blue circles.", "The dynamic model gains two yellow circles.", "The dynamic model gains two pink circles."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "8wkgDnNxiVs_0", "video_path": "8wkgDnNxiVs.mp4", "subtitle_path": "8wkgDnNxiVs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2028.88, "view_count": 3083}, {"video_id": "8wkgDnNxiVs", "question": "On a white document with an image, there is a formula containing the term in black text 'applied_torque.' When the term 'applied_torque' in black text appears on a white page document without an image, what change occurs to the black text 'applied_torque'?", "question_wo_referring_query": "What change occurs to the black text 'applied_torque'?", "candidates": ["It is shaded yellow", "It is shaded purple", "It is shaded blue", "It is shaded red", "It is shaded green"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "8wkgDnNxiVs_1", "video_path": "8wkgDnNxiVs.mp4", "subtitle_path": "8wkgDnNxiVs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2028.88, "view_count": 3083}, {"video_id": "8wkgDnNxiVs", "question": "In a document screen with 14 pages, there are three incomplete circular diagrams. Above the first circular diagram, there are black letters 'Roughness'. When the black letter 'Roughness' appears on a screen with four circular diagrams, what changes occur to the black letter 'Roughness'?", "question_wo_referring_query": "What changes occur to the black letter 'Roughness'?", "candidates": ["It is shaded with a green shadow.", "It becomes clearer and is shaded with a yellow shadow.", "It is shaded with a blue shadow.", "It becomes blurrier and is shaded with a green shadow.", "It is shaded with a purple shadow."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SAA", "level": "L2-Relation", "id": "8wkgDnNxiVs_2", "video_path": "8wkgDnNxiVs.mp4", "subtitle_path": "8wkgDnNxiVs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2028.88, "view_count": 3083}, {"video_id": "Ccru29i7M-A", "question": "A blonde woman with red lipstick is wearing headphones and a yellow top, sitting indoors, broadcasting the news. Which subtitles appear simultaneously with this woman?", "question_wo_referring_query": "Which subtitles appear simultaneously with this woman?", "candidates": ["RISK SO IT DOES MAKE SENSE. IT'S NOT IRRATIONAL TO ASSUME", "NORTHERN ITALY. THE EXPLOSION OCCURRED AS", "NORTHERN ITALY. THE EXPLOSION OCCURRED AS", "SENTIMENT INDEX, IT IS NEUTRAL FOR THE FIRST", "PLACE. THEY DON'T DISPUTE THAT THE 787"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "Ccru29i7M-A_0", "video_path": "Ccru29i7M-A.mp4", "subtitle_path": "Ccru29i7M-A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2578.55, "view_count": 1553}, {"video_id": "Ccru29i7M-A", "question": "There are three NP-News-Programers being connected on the screen. On the right is a blonde woman wearing a black suit, in the middle is a blonde woman wearing a yellow top and earphones, and on the left is a man with short brown hair, wearing a dark blue suit, white shirt, and a purple tie. What subtitles appear simultaneously while this man is reporting?", "question_wo_referring_query": "What subtitles appear simultaneously while this man is reporting?", "candidates": ["ON THAT IPO I AM SURE. WILL IT COME TO FRUITION? MANUS:", "DATA. BE SPECIFIC OF WHAT WE ARE EXPECTING", "GOING ON. PERHAPS IN LINE TO SLIGHTLY", "CONSECUTIVE SESSIONS NOW WITHOUT A NEW RECORD", "FOR THAT CANDIDATE. THE MOMENT THAT PERSON IS"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "Ccru29i7M-A_1", "video_path": "Ccru29i7M-A.mp4", "subtitle_path": "Ccru29i7M-A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2578.55, "view_count": 1553}, {"video_id": "Ccru29i7M-A", "question": "There is a man with short black hair wearing a light blue shirt and black-framed glasses sitting in front of the camera explaining something. Which subtitles appear on the screen at the same time as this man?", "question_wo_referring_query": "Which subtitles appear on the screen at the same time as this man?", "candidates": ["THE. BITCOIN COMES BACK --BY THE", "STRAIT.THIS IS CRITICALLY IMPORTANT", "IN MAY AND START A TAPERING PROCESS IN JUNE.", "THE RISKS ARE TO BOTH SIDES. MANUS:", "NUMBER"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "Ccru29i7M-A_2", "video_path": "Ccru29i7M-A.mp4", "subtitle_path": "Ccru29i7M-A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2578.55, "view_count": 1553}, {"video_id": "IO8PSRb33bM", "question": "In the video, a man with short black hair, wearing black-rimmed glasses, and dressed in a white T-shirt is standing in a room full of photos. He is holding a microphone in his right hand. To his left is a map of South America with the words 'SOUTH AMERICA IS NEXT...' written on it. Where else does this man appear?", "question_wo_referring_query": "Where else does the man in the video appear?", "candidates": ["In a room full of musical instruments.", "In a room full of maps.", "A man with black hair, wearing a white T-shirt, is standing in a room full of photos and maps. On screen, he is holding a microphone in his left hand, while a man with brown hair, wearing a white shirt and holding a cup with a straw, is on his right.", "He is standing in a room full of photos and maps, holding a microphone in his left hand. To his right is a man with black hair, wearing a white shirt and holding a cup with a straw.", "He is standing in a room full of photos and maps, holding a microphone in his right hand. To his left is a man with brown hair, wearing a black shirt and holding a cup with a straw."], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "IO8PSRb33bM_0", "video_path": "IO8PSRb33bM.mp4", "subtitle_path": "IO8PSRb33bM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1411.58, "view_count": 343587}, {"video_id": "IO8PSRb33bM", "question": "In the scene, there is a man with short black hair wearing a white T-shirt and black-framed glasses. He is holding a microphone in his right hand and standing in a room full of photos and maps. To his left, there is a man with brown hair wearing a white shirt. Below them are five individuals from different countries, arranged from left to right. The first one is a dark-haired man in a gray short-sleeve shirt. The second is a blonde woman in a pink top. The third is a dark-haired man in a burgundy and white outfit. The fourth is a blonde woman in a black top. The fifth is a woman in a dark blue top and black-framed glasses. Can you identify which scenes this woman in the dark blue top has appeared in?", "question_wo_referring_query": "Can you identify which scenes this woman in the dark blue top has appeared in?", "candidates": ["The woman in the dark blue top and black-framed glasses is sitting indoors. To her right is a pot with flowers. Above her left side is a photo of a giraffe, and below it is a photo of a wild boar.", "The woman in the dark blue top and black-framed glasses is sitting indoors. To her left is a pot with flowers and above it is a photo of a deer. Below it is a photo of a wild boar.", "The woman in the dark blue top and black-framed glasses is sitting indoors. To her right is a pot with flowers. Above her left side is a photo of a deer, and below it is a photo of an elephant.", "The woman in the dark blue top and black-framed glasses is sitting indoors. To her right is a pot with flowers. Above her left side is a photo of an elephant, and below it is a photo of a giraffe.", "The woman in the dark blue top and black-framed glasses is sitting indoors. To her right is a pot with flowers. Above her left side is a photo of a ram, and below it is a photo of an ostrich."], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "IO8PSRb33bM_1", "video_path": "IO8PSRb33bM.mp4", "subtitle_path": "IO8PSRb33bM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1411.58, "view_count": 343587}, {"video_id": "IO8PSRb33bM", "question": "In the video, a man with short black hair and black-framed glasses is wearing a white T-shirt and holding a microphone with his right hand. He is standing in a room filled with photos. On his left side, there is an Argentine flag. Can you identify other scenes where the Argentine flag appears?", "question_wo_referring_query": "Can you identify other scenes where the Argentine flag appears?", "candidates": ["The screen shows a topographic map of South America. On the upper left side, there is the flag of the Republic of Chile; on the right side, there is the Argentine flag; and at the bottom, it says 68\u00b0 longitude.", "The screen shows a topographic map of South America. On the upper left side, there is the flag of Serbia; on the right side, there is the Argentine flag; and at the bottom, it says 68\u00b0 longitude.", "The screen shows a topographic map of South America. On the right side, there is the flag of the Republic of Chile; on the left side, there is the Argentine flag; and at the bottom, it says 68\u00b0 longitude.", "The screen shows a topographic map of South America. On the upper left side, there is the flag of the Republic of Chile; on the right side, there is the Argentine flag; and at the bottom, it says 70\u00b0 longitude.", "The screen shows a topographic map of South America. On the upper left side, there is the flag of Portugal; on the right side, there is the Argentine flag; and at the bottom, it says 68\u00b0 longitude."], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "IO8PSRb33bM_2", "video_path": "IO8PSRb33bM.mp4", "subtitle_path": "IO8PSRb33bM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1411.58, "view_count": 343587}, {"video_id": "Tp_-DjBQPIo", "question": "Which of the following sequences of scenes in the video is correct?", "question_wo_referring_query": "Which of the following sequences of scenes in the video is correct?", "candidates": ["First, a man with graying hair wearing a gray-green coat and a seatbelt, driving a car; then, it's a black-haired man in a gray-black coat with Christmas elements sitting in front of a blue globe backdrop talking to the camera; finally, there's a man with graying hair in a black coat holding an orange pumpkin, standing in front of wooden cabinets in a kitchen.", "First, a black-haired man in a gray-black coat with Christmas elements sits in front of a blue globe backdrop talking to the camera; then, it's a man with graying hair wearing a gray-green coat and a seatbelt, driving a car; finally, there's a man with graying hair in a black coat holding an orange pumpkin, standing in front of wooden cabinets in a kitchen.", "First, a man with graying hair in a black coat holding an orange pumpkin, standing in front of wooden cabinets in a kitchen; then, it's a man with graying hair wearing a gray-green coat and a seatbelt, driving a car; finally, a black-haired man in a gray-black coat with Christmas elements sits in front of a blue globe backdrop talking to the camera.", "First, a man with graying hair wearing a gray-green coat and a seatbelt, driving a car; then, it's a man with graying hair in a black coat holding an orange pumpkin, standing in front of wooden cabinets in a kitchen; finally, it's a black-haired man in a gray-black coat with Christmas elements sitting in front of a blue globe backdrop talking to the camera.", "First, a black-haired man in a gray-black coat with Christmas elements sits in front of a blue globe backdrop talking to the camera; then, it's a man with graying hair in a black coat holding an orange pumpkin, standing in front of wooden cabinets in a kitchen; finally, it's a man with graying hair wearing a gray-green coat and a seatbelt, driving a car."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "Tp_-DjBQPIo_0", "video_path": "Tp_-DjBQPIo.mp4", "subtitle_path": "Tp_-DjBQPIo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1024.92, "view_count": 153877}, {"video_id": "Tp_-DjBQPIo", "question": "Which of the following sequence of scenes in the video is correct?", "question_wo_referring_query": "Which of the following sequence of scenes in the video is correct?", "candidates": ["First, in the black-and-white video, a man wearing a dark coat over a light shirt with black hair is talking in front of a lectern. Then, there is a black background with two purple lines and some white English text above. Finally, this man in a dark coat over a light shirt stands with arms wide open in front of a blackboard with chalk symbols.", "First, this man in a dark coat over a light shirt stands with arms wide open in front of a blackboard with chalk symbols. Then, there is a black background with two purple lines and some white English text above. Finally, in the black-and-white video, a man wearing a dark coat over a light shirt with black hair is talking in front of a lectern.", "First, in the black-and-white video, a man wearing a dark coat over a light shirt with black hair is talking in front of a lectern. Then, this man in the dark coat over the light shirt stands with arms wide open in front of a blackboard with chalk symbols. Lastly, there is a black background with two purple lines and some white English text above.", "First, there is a black background with two purple lines and some white English text above. Then, this man in a dark coat over a light shirt stands with arms wide open in front of a blackboard with chalk symbols. Lastly, in the black-and-white video, a man wearing a dark coat over a light shirt with black hair is talking in front of a lectern.", "First, this man in a dark coat over a light shirt stands with arms wide open in front of a blackboard with chalk symbols. Then, in the black-and-white video, a man wearing a dark coat over a light shirt with black hair is talking in front of a lectern. Lastly, there is a black background with two purple lines and some white English text above."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "Tp_-DjBQPIo_1", "video_path": "Tp_-DjBQPIo.mp4", "subtitle_path": "Tp_-DjBQPIo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1024.92, "view_count": 153877}, {"video_id": "Tp_-DjBQPIo", "question": "Which of the following scene sequences in the video are correct?", "question_wo_referring_query": "Which scene sequences in the video are correct?", "candidates": ["First, there is a silhouette of a man fully dressed in black holding a phone against a white background, then there are two iron balls connected by iron chains floating above a circle drawn on a whiteboard, and finally, a blue globe is placed on a wooden table indoors.", "First, a blue globe is placed on a wooden table indoors, then there is a silhouette of a man fully dressed in black holding a phone against a white background, and finally, there are two iron balls connected by iron chains floating above a circle drawn on a whiteboard.", "First, there is a silhouette of a man fully dressed in black holding a phone against a white background, then a blue globe is placed on a wooden table indoors, and finally, there are two iron balls connected by iron chains floating above a circle drawn on a whiteboard.", "First, there are two iron balls connected by iron chains floating above a circle drawn on a whiteboard, then there is a silhouette of a man fully dressed in black holding a phone against a white background, and finally, a blue globe is placed on a wooden table indoors.", "First, there are two iron balls connected by iron chains floating above a circle drawn on a whiteboard, then a blue globe is placed on a wooden table indoors, and finally, there is a silhouette of a man fully dressed in black holding a phone against a white background."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "Tp_-DjBQPIo_2", "video_path": "Tp_-DjBQPIo.mp4", "subtitle_path": "Tp_-DjBQPIo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1024.92, "view_count": 153877}, {"video_id": "yDNj0aZ7oFY", "question": "Inside a room with bay windows, a man with black hair tied back on the left side is wearing a black suit and holding a book, while a blonde woman on the right is wearing a gray plaid coat and talking to the man. After the subtitles 'it a brief skim it a brief skim' appear, what object appears on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["viaduct", "picture of a person on the white wall", "woman doing yoga on a yoga mat", "piano", "white submarine on the water surface"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "yDNj0aZ7oFY_0", "video_path": "yDNj0aZ7oFY.mp4", "subtitle_path": "yDNj0aZ7oFY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1242.28, "view_count": 1173401}, {"video_id": "yDNj0aZ7oFY", "question": "In the scene, a man wearing a dark blue suit and black-rimmed glasses is holding a pile of US dollars. Before the subtitle 'so my first mission is to identify one' appeared, what object was shown?", "question_wo_referring_query": "What object was shown?", "candidates": ["A bean-green short-sleeve shirt", "A black short-sleeve shirt", "A high-rise building outside a rooftop", "A black and blue racing car parked in the middle of the road", "A woman wearing a white long-sleeve sweater with colorful stripes sitting and looking at a computer"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "yDNj0aZ7oFY_1", "video_path": "yDNj0aZ7oFY.mp4", "subtitle_path": "yDNj0aZ7oFY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1242.28, "view_count": 1173401}, {"video_id": "yDNj0aZ7oFY", "question": "Outside the large floor-to-ceiling window, there are some potted green plants and wooden chairs. On the staircase beside the window, a man dressed in a black suit and pants is walking down. Before the subtitle 'okay now back to me becoming a' appears, what object is shown?", "question_wo_referring_query": "What object appears?", "candidates": ["A red arcade machine", "A bald man wearing a white shirt", "A ping-pong table", "A blue baseball cap", "A red sports car parked on the gray brick road in front of a house"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "yDNj0aZ7oFY_2", "video_path": "yDNj0aZ7oFY.mp4", "subtitle_path": "yDNj0aZ7oFY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1242.28, "view_count": 1173401}, {"video_id": "7aMVp75ybDA", "question": "In a room with white pillows and a white quilt, a woman wearing a black short-sleeved shirt is sitting on a cream-colored sofa. What did the woman wearing the black short-sleeved shirt do after she said 'each other'?", "question_wo_referring_query": "What did the woman wearing the black short-sleeved shirt do?", "candidates": ["Pointed at a book with the number 9 on the cover", "Held a book with a gray and white cover in one hand and made a gesture with the other hand", "Picked up a book with an orange cover", "Picked up a book with a gray and white cover"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "7aMVp75ybDA_0", "video_path": "7aMVp75ybDA.mp4", "subtitle_path": "7aMVp75ybDA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1086.2, "view_count": 30635}, {"video_id": "7aMVp75ybDA", "question": "In a room with red maple leaves, a woman with long hair and wearing earrings sits on a cream-colored sofa. After she says \"vibes i recommend this a lot okay so the,\" what does the woman with earrings do?", "question_wo_referring_query": "What does the woman with earrings do?", "candidates": ["Picked up a book with a grey-white cover", "Picked up a book with a black cover", "Picked up a book with a cover showing a man in a blue shirt", "Picked up a book with an orange cover"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "7aMVp75ybDA_1", "video_path": "7aMVp75ybDA.mp4", "subtitle_path": "7aMVp75ybDA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1086.2, "view_count": 30635}, {"video_id": "7aMVp75ybDA", "question": "In a room with a beige-colored small sand-hair carpet, there is a woman with long hair wearing a black short-sleeved shirt. After she says 'back to like normal,' what does the woman in the black short-sleeved shirt do?", "question_wo_referring_query": "What does the woman in the black short-sleeved shirt do?", "candidates": ["Picks up a book with a gray-white cover", "Picks up a book with a blue-green cover", "Picks up a book with a black cover", "Picks up a book with an orange cover"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "7aMVp75ybDA_2", "video_path": "7aMVp75ybDA.mp4", "subtitle_path": "7aMVp75ybDA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1086.2, "view_count": 30635}, {"video_id": "YaBogB6wKAI", "question": "In the background, there is a grey-green wall with a black-framed whiteboard hanging on it. A man wearing a black short-sleeve shirt and sporting dreadlocks drew a circle on the whiteboard and wrote the letters 'CBD' inside it. What did he do immediately afterwards?", "question_wo_referring_query": "What did he do immediately afterwards?", "candidates": ["He colored the circle black.", "He drew a five-pointed star outside the circle.", "He drew many symbols outside the circle containing the letters.", "He erased the circle with the letters inside.", "He drew a question mark inside the circle."], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "YaBogB6wKAI_0", "video_path": "YaBogB6wKAI.mp4", "subtitle_path": "YaBogB6wKAI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1195.73, "view_count": 41834}, {"video_id": "YaBogB6wKAI", "question": "The background is a gray-green wall with a black-framed whiteboard on it. In the center of the whiteboard, there is a drawing of a circle with the text 'CBD' inside, surrounded by five radiating lines with many characters drawn along them. What did the man in a black short-sleeved shirt with curly hair do after this?", "question_wo_referring_query": "What did he do after this?", "candidates": ["Used a pen to circle the characters on the whiteboard", "Colored the circle black", "Erased the content on the whiteboard", "Drew some patterns in the four corners of the whiteboard", "Drew a question mark inside the circle"], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "YaBogB6wKAI_1", "video_path": "YaBogB6wKAI.mp4", "subtitle_path": "YaBogB6wKAI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1195.73, "view_count": 41834}, {"video_id": "YaBogB6wKAI", "question": "The background is a gray-green wall with a black-framed whiteboard hanging on it. In the center of the whiteboard, there is a drawing of an irregular circle with the words 'inner city' written inside. What did the man in a black short-sleeved shirt with shaved head do after drawing five arrows on Friday?", "question_wo_referring_query": "What did he do afterward?", "candidates": ["Drew a rectangle on the whiteboard and wrote 'main city' inside", "Drew a star outside the circle", "Drew a question mark on the circle", "Drew a circle on the whiteboard and wrote the letters 'CBD' inside", "Filled the circle with black color"], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "YaBogB6wKAI_2", "video_path": "YaBogB6wKAI.mp4", "subtitle_path": "YaBogB6wKAI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1195.73, "view_count": 41834}, {"video_id": "dCHFYOHlbBk", "question": "In a room with wooden cabinets and many photos on the wall, a woman wearing a black long-sleeved top is sitting in front of the cabinets. When the subtitle 'will be fully in person next monday so' appears, what style of hat is the woman in the black long-sleeved top wearing?", "question_wo_referring_query": "What style of hat is the woman wearing the black long-sleeved top wearing?", "candidates": ["Blue duckbill cap", "Black beret", "White and green checkered fisherman hat", "Black baseball cap", "White baseball cap"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "dCHFYOHlbBk_0", "video_path": "dCHFYOHlbBk.mp4", "subtitle_path": "dCHFYOHlbBk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 979.05, "view_count": 311042}, {"video_id": "dCHFYOHlbBk", "question": "In a brightly lit bathroom, there is a rectangular mirror. Two women with tied-up hair are standing in front of the mirror. When the subtitle 'no' appears, what style of clothing is the woman with orange hair wearing?", "question_wo_referring_query": "What style of clothing is the woman with orange hair wearing?", "candidates": ["Black hoodie", "White T-shirt", "Blue hoodie", "Dark blue long-sleeve scrub top", "White floral nightgown"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "dCHFYOHlbBk_1", "video_path": "dCHFYOHlbBk.mp4", "subtitle_path": "dCHFYOHlbBk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 979.05, "view_count": 311042}, {"video_id": "dCHFYOHlbBk", "question": "In a room with a wooden desk, there is a laptop on the desk. A woman wearing a blue and purple sweater is sitting at the desk. When the subtitle 'to provide that for her except she is so' appears, what hairstyle is the woman wearing?", "question_wo_referring_query": ", what hairstyle is the woman wearing, the woman who is wearing a blue and purple sweater?", "candidates": ["Green shoulder-length curly hair", "Gathered blonde long hair", "Blue long hair", "Blue shoulder-length short hair", "Gray long hair in a ponytail"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "dCHFYOHlbBk_2", "video_path": "dCHFYOHlbBk.mp4", "subtitle_path": "dCHFYOHlbBk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 979.05, "view_count": 311042}, {"video_id": "b8Q8yPBcNCk", "question": "Under a gray sky, there is a piece of reddish-brown land. On the land stand some soldiers holding daggers and wearing helmets. Beside the soldiers, there are also some trees without leaves. What color hats are the soldiers standing on the reddish-brown land in the picture wearing?", "question_wo_referring_query": "What color hats are the soldiers standing on the reddish-brown land in the picture wearing?", "candidates": ["red", "gray-green", "yellow", "purple", "white"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "b8Q8yPBcNCk_0", "video_path": "b8Q8yPBcNCk.mp4", "subtitle_path": "b8Q8yPBcNCk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1775.71, "view_count": 989205}, {"video_id": "b8Q8yPBcNCk", "question": "On a natural wood-colored floor, there is a table covered with a gray tablecloth. Several men wearing cream-colored shirts and cream-colored pants are standing next to the table. What kind of beard does the man holding a white bottle have?", "question_wo_referring_query": "What kind of beard does the man holding a white bottle have?", "candidates": ["A full beard", "A goatee", "A connected beard", "A single beard", "A mountain goat beard"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "b8Q8yPBcNCk_1", "video_path": "b8Q8yPBcNCk.mp4", "subtitle_path": "b8Q8yPBcNCk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1775.71, "view_count": 989205}, {"video_id": "b8Q8yPBcNCk", "question": "In a gray-blue waterscape, there is a man wearing an army green diving suit. The man has a gray diving mask on his head. What color is the man, who is wearing an army green diving suit, holding in his hand?", "question_wo_referring_query": "What color is the man, who is wearing an army green diving suit, holding in his hand?", "candidates": ["White", "Purple", "Green", "Black-gray", "Blue"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "b8Q8yPBcNCk_2", "video_path": "b8Q8yPBcNCk.mp4", "subtitle_path": "b8Q8yPBcNCk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1775.71, "view_count": 989205}, {"video_id": "CrdNZnpkfHM", "question": "In a game screen, there is a circle in the lower left corner, a radar map in the lower right corner, and a gradient green square in the upper right corner. What objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["Purple text 'Why Intelligence'", "Red text 'Why Intelligence'", "White text 'Why Intelligence'", "Green text 'Why Intelligence'", "Blue text 'Why Intelligence'"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "CrdNZnpkfHM_0", "video_path": "CrdNZnpkfHM.mp4", "subtitle_path": "CrdNZnpkfHM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2412.36, "view_count": 13115}, {"video_id": "CrdNZnpkfHM", "question": "In a viewfinder screen, a mountain can be seen in the distance, with lush vegetation at its base. There is a rectangular radar map at the bottom right of the screen. What objects are present in the screen?", "question_wo_referring_query": "What objects are present in the screen?", "candidates": ["purple flowers", "blue whales", "white snow", "golden cannonball", "blue boat"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "CrdNZnpkfHM_1", "video_path": "CrdNZnpkfHM.mp4", "subtitle_path": "CrdNZnpkfHM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2412.36, "view_count": 13115}, {"video_id": "CrdNZnpkfHM", "question": "In a panoramic view, there is a deep blue sea at the bottom which includes some isolated islands. In the lower right corner of the frame, there is also a square radar map. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Yellow dots", "Blue dots", "Gray dots", "Orange dots", "Black dots"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "CrdNZnpkfHM_2", "video_path": "CrdNZnpkfHM.mp4", "subtitle_path": "CrdNZnpkfHM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2412.36, "view_count": 13115}, {"video_id": "L1PIf361b9c", "question": "Among the red and brown cliffs, there are some green plants. One plant growing in the middle is taller than the others. What happened to the plant growing in the middle?", "question_wo_referring_query": "What happened to the plant growing in the middle?", "candidates": ["Cut down by a man in a blue uniform", "Broken by a woman with long golden hair", "Blown into two pieces by the wind", "Swayed left and right by the wind", "Broken by a man in a white T-shirt"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "L1PIf361b9c_0", "video_path": "L1PIf361b9c.mp4", "subtitle_path": "L1PIf361b9c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 906.17, "view_count": 135578}, {"video_id": "L1PIf361b9c", "question": "On a surface with various types of sand and soil, there are two red crabs. One larger red crab is on a red-brown rock. What happened to the larger red crab?", "question_wo_referring_query": "What happened to the larger red crab?", "candidates": ["Was swept away by water", "Covered and buried by black-gray sand", "Picked up by a man wearing a black hooded jacket using tongs", "Moving on the red-brown rock", "Picked up by a hand wearing a black glove"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "L1PIf361b9c_1", "video_path": "L1PIf361b9c.mp4", "subtitle_path": "L1PIf361b9c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 906.17, "view_count": 135578}, {"video_id": "L1PIf361b9c", "question": "In a gray and misty environment, there are two large patches of yellow-green land. Between these two yellow-green patches, there is also a black body of water. What has happened to the two yellow-green patches?", "question_wo_referring_query": "What has happened to the two yellow-green patches?", "candidates": ["continuously sinking", "gradually collapsed", "formed a circle", "separated further apart", "drawn closer together"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "L1PIf361b9c_2", "video_path": "L1PIf361b9c.mp4", "subtitle_path": "L1PIf361b9c_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 906.17, "view_count": 135578}, {"video_id": "XpoKB3usmKc", "question": "On a white PPT slide with the black text 'The Problem', there is a grey frame surrounding two dashed line boxes. When the subtitle 'typically the training process will just' appears, what change occurs to the grey frame surrounding the two dashed line boxes?", "question_wo_referring_query": "What change occurs to the grey frame surrounding the two dashed line boxes?", "candidates": ["The color of the frame changes from grey to purple", "It becomes smaller", "The color of the frame changes from grey to red", "It becomes larger", "The color of the frame changes from grey to blue"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "XpoKB3usmKc_0", "video_path": "XpoKB3usmKc.mp4", "subtitle_path": "XpoKB3usmKc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2217.62, "view_count": 27753}, {"video_id": "XpoKB3usmKc", "question": "On a white PPT slide with black text 'Ingredient 4: LORA', there is an image of multiple blue spheres connected together. What changes occur to this image of connected blue spheres when the subtitle 'are trainable while that's probably not' appears?", "question_wo_referring_query": "What changes occur to the image of connected blue spheres?", "candidates": ["Many white objects are inserted into the image of connected blue spheres.", "Many green objects are inserted into the image of connected blue spheres.", "Many yellow objects are inserted into the image of connected blue spheres.", "Many purple objects are inserted into the image of connected blue spheres.", "Many blue objects are inserted into the image of connected blue spheres."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "XpoKB3usmKc_1", "video_path": "XpoKB3usmKc.mp4", "subtitle_path": "XpoKB3usmKc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2217.62, "view_count": 27753}, {"video_id": "XpoKB3usmKc", "question": "On a white PPT slide with black text 'What is Quantization?', there is a grey-green rectangular gradient chart. What changes occur to the grey-green rectangular gradient chart when the subtitle 'here an alternative way we can do' appears?", "question_wo_referring_query": "What changes occur to the grey-green rectangular gradient chart?", "candidates": ["The color changed from grey-green to white", "The color changed from grey-green to purple", "The color changed from grey-green to blue", "It became longer and bigger", "It became shorter and smaller"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TAA", "level": "L2-Relation", "id": "XpoKB3usmKc_2", "video_path": "XpoKB3usmKc.mp4", "subtitle_path": "XpoKB3usmKc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2217.62, "view_count": 27753}, {"video_id": "3MI7MBOhnVs", "question": "On a white PPT slide with the black text 'Method', there is a red rectangle and two differently colored stair shapes. Which of the following subtitles have appeared alongside the red rectangle on the white PPT slide?", "question_wo_referring_query": "Which of the following subtitles have appeared alongside the red rectangle on the white PPT slide?", "candidates": ["\"and our paper is on high resolution\"", "\"the latent space representation and the\"", "\"image synthesis with latent diffusion\"", "\"models\"", "\"uh hello everyone uh we are group four\""], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TOS", "level": "L2-Relation", "id": "3MI7MBOhnVs_0", "video_path": "3MI7MBOhnVs.mp4", "subtitle_path": "3MI7MBOhnVs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1794.63, "view_count": 2059}, {"video_id": "3MI7MBOhnVs", "question": "On a white PPT slide with black text 'Method - Conditioning', there is a pink rectangle and a light blue rectangle. With which of the following subtitles has the pink rectangle appeared together?", "question_wo_referring_query": "With which of the following subtitles has the pink rectangle appeared together?", "candidates": ["\"space to get the latest representation\"", "\"and the encoder compresses the image\"", "\"input to an auto encoder is the image\"", "\"of the image we use the auto encoder the\"", "\"Transformer based Network\""], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TOS", "level": "L2-Relation", "id": "3MI7MBOhnVs_1", "video_path": "3MI7MBOhnVs.mp4", "subtitle_path": "3MI7MBOhnVs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1794.63, "view_count": 2059}, {"video_id": "3MI7MBOhnVs", "question": "On a white PPT slide with black text 'Super-Resolution with Latent Diffusion', there are several cat pictures with different levels of clarity. Which of the following subtitles appeared together with the cat pictures?", "question_wo_referring_query": "Which of the following subtitles appeared together with the cat pictures?", "candidates": ["\"interested in doing this you can just\"", "\"quality image for this case\"", "\"with your desired prompt choose your\"", "\"for Texas image synthesis so if you're\"", "\"use the script right here replace this\""], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TOS", "level": "L2-Relation", "id": "3MI7MBOhnVs_2", "video_path": "3MI7MBOhnVs.mp4", "subtitle_path": "3MI7MBOhnVs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1794.63, "view_count": 2059}, {"video_id": "vocQR-_MwXA", "question": "Which of the following scenarios is in the correct sequence?", "question_wo_referring_query": "Which of the following scenarios is in the correct sequence?", "candidates": ["First is a scene on a road with grass on both sides where many people are moving things; then is a scene where a man and a woman, both in white short-sleeved shirts, are hugging; and finally, a scene where a man in a white short-sleeved shirt is carrying something in an olive-yellow paper box.", "First is a scene where a man in a white short-sleeved shirt is carrying something in an olive-yellow paper box; then is a scene on a road with grass on both sides where many people are moving things; and finally, a scene where a man and a woman, both in white short-sleeved shirts, are hugging.", "First is a scene where a man in a white short-sleeved shirt is carrying something in an olive-yellow paper box; then is a scene where a man and a woman, both in white short-sleeved shirts, are hugging; and finally, a scene on a road with grass on both sides where many people are moving things.", "First is a scene on a road with grass on both sides where many people are moving things; then is a scene where a man in a white short-sleeved shirt is carrying something in an olive-yellow paper box; and finally, a scene where a man and a woman, both in white short-sleeved shirts, are hugging.", "First is a scene where a man and a woman, both in white short-sleeved shirts, are hugging; then is a scene on a road with grass on both sides where many people are moving things; and finally, a scene where a man in a white short-sleeved shirt is carrying something in an olive-yellow paper box."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "vocQR-_MwXA_0", "video_path": "vocQR-_MwXA.mp4", "subtitle_path": "vocQR-_MwXA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 944.61, "view_count": 5417442}, {"video_id": "vocQR-_MwXA", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First is a scene where a man in a black short-sleeve shirt and a woman with sunglasses on her head are sitting on a woven chair; then a scene where a woman wearing black sunglasses is talking to a woman in a white short-sleeve shirt; finally a scene where a woman in a dark blue short-sleeve shirt with hair tied up is sitting on a woven chair.", "First is a scene where a woman in a dark blue short-sleeve shirt with hair tied up is sitting on a woven chair; then a scene where a man in a black short-sleeve shirt and a woman with sunglasses on her head are sitting on a woven chair; finally a scene where a woman wearing black sunglasses is talking to a woman in a white short-sleeve shirt.", "First is a scene where a woman wearing black sunglasses is talking to a woman in a white short-sleeve shirt; then a scene where a man in a black short-sleeve shirt and a woman with sunglasses on her head are sitting on a woven chair; finally a scene where a woman in a dark blue short-sleeve shirt with hair tied up is sitting on a woven chair.", "First is a scene where a man in a black short-sleeve shirt and a woman with sunglasses on her head are sitting on a woven chair; then a scene where a woman in a dark blue short-sleeve shirt with hair tied up is sitting on a woven chair; finally a scene where a woman wearing black sunglasses is talking to a woman in a white short-sleeve shirt.", "First is a scene where a woman wearing black sunglasses is talking to a woman in a white short-sleeve shirt; then a scene where a woman in a dark blue short-sleeve shirt with hair tied up is sitting on a woven chair; finally a scene where a man in a black short-sleeve shirt and a woman with sunglasses on her head are sitting on a woven chair."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "vocQR-_MwXA_1", "video_path": "vocQR-_MwXA.mp4", "subtitle_path": "vocQR-_MwXA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 944.61, "view_count": 5417442}, {"video_id": "vocQR-_MwXA", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First is a scene with a man in a white short-sleeve shirt standing in a room with a beige sofa; then, a scene with three women sitting together on a beige sofa; lastly, a man in a blue-green short-sleeve shirt standing in front of a door with a picture of a workspace, talking.", "First is a man in a blue-green short-sleeve shirt standing in front of a door with many pictures of planets, talking; then, three women sitting together on a beige sofa; lastly, a man in a white short-sleeve shirt standing in a room with a beige sofa.", "First is a scene with a man in a white short-sleeve shirt standing in a room with a beige sofa; then, a man in a blue-green short-sleeve shirt standing in front of a door with many pictures of planets, talking; lastly, a scene with three women sitting together on a beige sofa.", "First is a scene with three women sitting together on a beige sofa; then, a man in a white short-sleeve shirt standing in a room with a beige sofa; lastly, a man in a blue-green short-sleeve shirt standing in front of a door with many pictures of planets, talking.", "First is a scene with three women sitting together on a beige sofa; then, a man in a blue-green short-sleeve shirt standing in front of a door with many pictures of planets, talking; lastly, a man in a white short-sleeve shirt standing in a room with a beige sofa."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "vocQR-_MwXA_2", "video_path": "vocQR-_MwXA.mp4", "subtitle_path": "vocQR-_MwXA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 944.61, "view_count": 5417442}, {"video_id": "TQcODHmw378", "question": "In an elevator with dark oak walls, there is a man with short black hair and a woman with long black hair standing. After Bystander says 'most awkward places', who is the first person to appear on the screen?", "question_wo_referring_query": "Who is the first person to appear on the screen?", "candidates": ["A woman in a sleeveless white shirt", "Spider-Man", "A man with a yellow-green crossbody bag", "A man in a gray jacket", "A man wearing black earphones"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "TQcODHmw378_0", "video_path": "TQcODHmw378.mp4", "subtitle_path": "TQcODHmw378_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1117.58, "view_count": 3011811}, {"video_id": "TQcODHmw378", "question": "In a scene depicting a city, there are buildings of varying heights in the lower part of the screen, and in the distance, there's blue-black seawater. After the phrase 'although a part of the US' is spoken, who is the first person to appear on the screen?", "question_wo_referring_query": "Who is the first person to appear on the screen?", "candidates": ["A man wearing a yellow hat", "A man wearing a black-grey short-sleeve shirt", "A man wearing a black hooded jacket", "A little girl wearing a black short-sleeve shirt and a bun, dancing", "A woman wearing a red coat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "TQcODHmw378_1", "video_path": "TQcODHmw378.mp4", "subtitle_path": "TQcODHmw378_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1117.58, "view_count": 3011811}, {"video_id": "TQcODHmw378", "question": "In an elevator with a teal woodgrain wall, a man wearing a striped short-sleeve shirt is talking to the mirror. Who is the first person to appear on screen after he says, \"audible audible is an amazing tool to\"?", "question_wo_referring_query": "Who is the first person to appear on screen?", "candidates": ["A man with teal short hair, wearing a black short-sleeve shirt.", "A man wearing a black short-sleeve shirt, black pants, and talking on the phone.", "A man wearing a yellow hat.", "A man wearing a gray short-sleeve shirt.", "A man wearing red-rimmed glasses and a cream-colored top."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "TQcODHmw378_2", "video_path": "TQcODHmw378.mp4", "subtitle_path": "TQcODHmw378_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1117.58, "view_count": 3011811}, {"video_id": "U96FKT1zB68", "question": "The top of the white PPT has a black English title 'Efficient Network Architectures'. The screen shows four pictures respectively: a red bus, a yellow dog, some cartoon characters, and a building. After the subtitle 'so we we'll look at a couple of very' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A white line chart appears on the screen.", "Two green cubes appear on the screen.", "A grayscale table appears on the screen.", "Two differently shaped cubes appear, with green, yellow, and pink squares marked with numbers at the bottom.", "There appear three different shaped cubes, four of each shape."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3E", "level": "L2-Relation", "id": "U96FKT1zB68_0", "video_path": "U96FKT1zB68.mp4", "subtitle_path": "U96FKT1zB68_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1505.13, "view_count": 28}, {"video_id": "U96FKT1zB68", "question": "The top of the white PPT slide displays the English title \u201cMobileNet,\u201d with two squares in the middle of the screen. At the bottom, there are six red-highlighted squares. After the subtitle 'looking at M Channel together right like' appears, what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["Three squares appear on the screen, with the left one composed of dark blue, orange, yellow, and red, and the middle and right ones composed of orange, red, and yellow.", "Two differently-shaped squares appear on the screen, with the bottom ones marked with green, yellow, and pink numbers.", "Two green squares appear on the screen.", "A grayscale table appears on the screen.", "Three different-shaped squares, each displayed four times, appear on the screen."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3E", "level": "L2-Relation", "id": "U96FKT1zB68_1", "video_path": "U96FKT1zB68.mp4", "subtitle_path": "U96FKT1zB68_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1505.13, "view_count": 28}, {"video_id": "U96FKT1zB68", "question": "In the top part of the white PPT slide, there is an English title 'MobileNet'. In the first line below, there is red-colored English text, and further down, there are three lines of formulas. After the subtitle 'multiplications and this is um you know' appears, what happens on the screen?", "question_wo_referring_query": ", what happens on the screen?", "candidates": ["Two different-shaped cuboids appear on the screen. The bottom one has green, yellow, and pink squares labeled with numbers.", "Two dark blue rectangles appear on the screen with a blue downward arrow in between.", "Three cuboids appear on the screen. The left one is composed of dark blue, orange, yellow, and red, while the middle and right ones are composed of orange, red, and yellow.", "Two green cuboids appear on the screen.", "Three different-shaped cuboids appear on the screen, each with four sides."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3E", "level": "L2-Relation", "id": "U96FKT1zB68_2", "video_path": "U96FKT1zB68.mp4", "subtitle_path": "U96FKT1zB68_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1505.13, "view_count": 28}, {"video_id": "FlMLRvrq5vI", "question": "In the scene, a woman with long curly hair wearing a yellow apron and a white short-sleeve shirt is standing in the kitchen. To her right, there are several vegetables in small bowls, a large bowl with dough, some small bowls with ingredients, and a green onion. In the middle, there is a glass cup containing oil. She is holding a yellow oil brush in her left hand and a cup of chopped garlic in her right hand. What does she do next?", "question_wo_referring_query": "What does she do next?", "candidates": ["She puts the onion into the oil.", "She pours the oil into the chopped garlic.", "She sweeps the garlic into the oil.", "She puts the green onion into the oil.", "She puts the oil brush into the oil."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "FlMLRvrq5vI_0", "video_path": "FlMLRvrq5vI.mp4", "subtitle_path": "FlMLRvrq5vI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 919.92, "view_count": 584142}, {"video_id": "FlMLRvrq5vI", "question": "In the screen, the right hand is pouring oil into a blender, while the left hand is holding the blender on a white marble countertop. There are four empty glass bowls on the left side of the blender, and three empty glass bowls on the right side. A ladle is placed below the blender. What did the woman do next?", "question_wo_referring_query": "What did the woman do next?", "candidates": ["Picked up the ladle and placed it on the right side of the blender.", "Picked up the ladle and placed it in the glass bowl on the left side.", "Poured the contents of the blender into the glass bowl on the right side.", "Picked up the ladle, took a scoop from the blender, and tasted it.", "Picked up the ladle and put the lid back on the blender."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "FlMLRvrq5vI_1", "video_path": "FlMLRvrq5vI.mp4", "subtitle_path": "FlMLRvrq5vI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 919.92, "view_count": 584142}, {"video_id": "FlMLRvrq5vI", "question": "The woman in the video is wearing a red apron, has a pink hairband on her left hand, and is holding a white cutting board. She is also wearing a bracelet and a ring on her right hand. She is cutting a dumpling into small pieces on a white marble countertop. What did she do next?", "question_wo_referring_query": "What did she do next?", "candidates": ["She picked up a small dumpling with both hands and divided it again.", "She picked up a small dumpling with both hands and placed it in a glass bowl on the left.", "She picked up a small dumpling with both hands and kneaded it.", "She put the small dumpling on the marble countertop, covered it with a towel, and started to ferment it.", "She picked up a small dumpling with both hands and added oil to it."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "FlMLRvrq5vI_2", "video_path": "FlMLRvrq5vI.mp4", "subtitle_path": "FlMLRvrq5vI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 919.92, "view_count": 584142}, {"video_id": "b8jlMJz5A0E", "question": "What is happening on the screen when the subtitles say \"signs and ample time to clear and many\"?", "question_wo_referring_query": "What is happening on the screen?", "candidates": ["The sky is overcast with dense clouds on the screen.", "The sky is clear and sunny on the screen.", "It is raining in the sky on the screen.", "It is snowing on the screen.", "There is a battle taking place on the screen, bombs are exploding in the air, and black smoke is billowing."], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "b8jlMJz5A0E_0", "video_path": "b8jlMJz5A0E.mp4", "subtitle_path": "b8jlMJz5A0E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2255.39, "view_count": 13391}, {"video_id": "b8jlMJz5A0E", "question": "When the subtitle reads 'though their debut was explosive it was,' what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A fountain sprays water on the screen", "Stream water is flowing on the screen", "Fireworks are burning on the screen", "Lava erupts upward from a volcano on the screen", "Coal is burning on the screen"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "b8jlMJz5A0E_1", "video_path": "b8jlMJz5A0E.mp4", "subtitle_path": "b8jlMJz5A0E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2255.39, "view_count": 13391}, {"video_id": "b8jlMJz5A0E", "question": "When the subtitle is 'eternal Testament to our planet's', a blonde woman wearing a burgundy red tank top, army green pants, and black sneakers appears on the screen. There is a tree in the top right corner, and a black backpack in the bottom right corner. At this moment, the woman is holding a pink water bottle. What is she doing on the rock wall?", "question_wo_referring_query": "What is she doing?", "candidates": ["Lying down against the rock wall", "Drinking water", "Standing up", "Placing the water bottle on the rock wall", "Tidying up her clothes"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "b8jlMJz5A0E_2", "video_path": "b8jlMJz5A0E.mp4", "subtitle_path": "b8jlMJz5A0E_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2255.39, "view_count": 13391}, {"video_id": "oqfW7xn215o", "question": "What was the man with the wine red short-sleeve shirt and black hair doing the first time he appeared?", "question_wo_referring_query": "What was he doing the first time he appeared?", "candidates": ["He was eating a strawberry.", "He was eating a plum.", "He was eating a cherry.", "He was eating a pancake.", "He was eating an apple."], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "oqfW7xn215o_0", "video_path": "oqfW7xn215o.mp4", "subtitle_path": "oqfW7xn215o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1138.26, "view_count": 1297381}, {"video_id": "oqfW7xn215o", "question": "What is the man with short black sleeves and brown hair doing when he first appears?", "question_wo_referring_query": "What is he doing when he first appears?", "candidates": ["Raising his hands above his head", "Holding his shoulders with both hands", "Holding a national flag", "Holding a ceramic cup", "Holding a glass of water"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "oqfW7xn215o_1", "video_path": "oqfW7xn215o.mp4", "subtitle_path": "oqfW7xn215o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1138.26, "view_count": 1297381}, {"video_id": "oqfW7xn215o", "question": "What is the blonde woman wearing a red dress doing when she first appears?", "question_wo_referring_query": "What is the blonde woman wearing a red dress doing when she first appears?", "candidates": ["Pulling the man in a grey T-shirt", "Holding an iron cup", "Holding a silver cup", "Holding a glass cup", "Pulling the man in a black T-shirt"], "topic_category": "KG-Knowledge-Geography", "question_category": "O2E", "level": "IntraMoment", "id": "oqfW7xn215o_2", "video_path": "oqfW7xn215o.mp4", "subtitle_path": "oqfW7xn215o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1138.26, "view_count": 1297381}, {"video_id": "zPIhMJGWiM8", "question": "In the ocean of coins, there are four people: two are sitting on the side, one person is submerged in the coins with only their head exposed, one person is swimming in the coins. Who is swimming in the coins?", "question_wo_referring_query": "Who is swimming in the coins?", "candidates": ["The person wearing a blue coat with nothing on their face", "The person with short hair wearing a brown coat", "The person wearing glasses and a brown coat", "The person submerged in the coins with only their head exposed", "The person wearing swimming goggles"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "zPIhMJGWiM8_0", "video_path": "zPIhMJGWiM8.mp4", "subtitle_path": "zPIhMJGWiM8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.73, "view_count": 2106593}, {"video_id": "zPIhMJGWiM8", "question": "There is a sailboat in the scene, with the British flag hanging on the boat. There is a wooden box on the ground. There are three people in the scene. One person is carrying the wooden box. Who is carrying the wooden box?", "question_wo_referring_query": "There is a person carrying the wooden box. Who is carrying the wooden box?", "candidates": ["The person wearing a blue hat and has short hair", "The person wearing a blue coat and holding a wallet", "The person wearing a black hat and has long curly hair", "The person wearing a long black dress", "The person wearing a red headscarf and a sleeveless shirt"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "zPIhMJGWiM8_1", "video_path": "zPIhMJGWiM8.mp4", "subtitle_path": "zPIhMJGWiM8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.73, "view_count": 2106593}, {"video_id": "zPIhMJGWiM8", "question": "In the scene, there are two people holding long guns. One is wearing gray clothes, and the other is wearing a red coat with black smoke above their head. There are three ships on the sea beside them, and one of them is sinking. Which ship is sinking?", "question_wo_referring_query": "Which ship is sinking?", "candidates": ["The ship with the Dutch flag", "The ship with the British flag", "The wooden ship with the Dutch flag", "The ship in the middle of the scene", "The ship on the left side of the scene"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "zPIhMJGWiM8_2", "video_path": "zPIhMJGWiM8.mp4", "subtitle_path": "zPIhMJGWiM8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.73, "view_count": 2106593}, {"video_id": "Py-sIVi8L-0", "question": "In front of the white building, there is a person wearing a blue coat walking by. Next to her, there is a lady wearing sunglasses and dressed in green floral clothing. When the subtitles mention 'voters to decide the issue this November,' what hairstyle does the lady wearing sunglasses have?", "question_wo_referring_query": "When the subtitles mention 'voters to decide the issue this November,' what hairstyle does the lady wearing sunglasses have?", "candidates": ["Olive shoulder-length wavy hair", "Olive ear-length short hair", "Olive shoulder-length straight hair", "Olive bob cut", "Black shoulder-length straight hair"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "Py-sIVi8L-0_0", "video_path": "Py-sIVi8L-0.mp4", "subtitle_path": "Py-sIVi8L-0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1250.52, "view_count": 875231}, {"video_id": "Py-sIVi8L-0", "question": "On the bookshelf in the room, there are books and photos. A man with short hair and glasses is sitting in front of the bookshelf. When the subtitles mention 'in Northern Gaza in Shian to look after,' what kind of outerwear is he wearing?", "question_wo_referring_query": "What kind of outerwear is he wearing?", "candidates": ["Black long-sleeved outerwear", "Olive long-sleeved outerwear", "Blue denim jacket", "Black wool sweater", "Black knit shirt"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "Py-sIVi8L-0_1", "video_path": "Py-sIVi8L-0.mp4", "subtitle_path": "Py-sIVi8L-0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1250.52, "view_count": 875231}, {"video_id": "Py-sIVi8L-0", "question": "In a room, there is a clock hanging on the wall, a photo placed on a wooden table nearby, and a bald man with a tie sitting in front of the wooden table. When the subtitle mentions 'choose to not go to college in the fall,' what are the colors of his glasses frame and lenses respectively?", "question_wo_referring_query": "What are the colors of his glasses frame and lenses respectively?", "candidates": ["Blue frame, white lenses", "Blue frame, black lenses", "Black frame, white lenses", "Brown frame, white lenses", "Blue frame, red lenses"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "Py-sIVi8L-0_2", "video_path": "Py-sIVi8L-0.mp4", "subtitle_path": "Py-sIVi8L-0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1250.52, "view_count": 875231}, {"video_id": "t6vzJ6ceJqs", "question": "In front of the red wall, there is a brown chair. A man wearing a black coat sits with his hands on the armrests, gazing into the distance. What kind of hairstyle does this man have?", "question_wo_referring_query": "What kind of hairstyle does this man have?", "candidates": ["black short hair", "crew cut", "white wavy hair", "black shoulder-length hair", "white short hair"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "t6vzJ6ceJqs_0", "video_path": "t6vzJ6ceJqs.mp4", "subtitle_path": "t6vzJ6ceJqs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1120.42, "view_count": 248642}, {"video_id": "t6vzJ6ceJqs", "question": "In a dimly lit room, the lamp on the table emits a soft glow. There is a statue on a pedestal in the room, and a woman is standing next to a chair under a white pillar. What kind of clothing is the woman wearing?", "question_wo_referring_query": ", what kind of clothing is the woman wearing?", "candidates": ["red long dress", "pink long dress", "blue jeans", "pink short dress", "pink leather pants"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "t6vzJ6ceJqs_1", "video_path": "t6vzJ6ceJqs.mp4", "subtitle_path": "t6vzJ6ceJqs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1120.42, "view_count": 248642}, {"video_id": "t6vzJ6ceJqs", "question": "In the black background, there is a painting of a man wearing a hat and a suit. He is standing sideways, gazing forward. The two ends of his mustache curl upwards. What kind of hat is this man wearing?", "question_wo_referring_query": "What kind of hat is this man wearing?", "candidates": ["Olive bowler hat", "Black knitted hat", "Black wide-brimmed hat", "Black bowler hat", "Olive dress hat"], "topic_category": "KA-Knowledge-Art", "question_category": "S2A", "level": "IntraMoment", "id": "t6vzJ6ceJqs_2", "video_path": "t6vzJ6ceJqs.mp4", "subtitle_path": "t6vzJ6ceJqs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1120.42, "view_count": 248642}, {"video_id": "MebTIQNRU5g", "question": "In the white background, there is a flower pattern. In the bottom left corner, there is a book pattern. In the bottom right corner, next to a pink table, there is a woman wearing a blue coat. When the subtitle says 'going to be part of this equation so in a combustion reaction recall that we\u2019re', what item is present in the scene?", "question_wo_referring_query": "What item is present in the scene?", "candidates": ["blue phone", "pink long-sleeve coat", "book with blue cover", "desktop computer", "yellow lamp"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "MebTIQNRU5g_0", "video_path": "MebTIQNRU5g.mp4", "subtitle_path": "MebTIQNRU5g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1289.39, "view_count": 286569}, {"video_id": "MebTIQNRU5g", "question": "On a white background with text and graphics, a blue arrow points to the letter H. In the bottom right corner of the screen, there is a woman with long black hair wearing a blue outer garment. When the subtitles mention \u201cI want to align the unit so they can cancel so these two grams across from,\u201d what objects are present in this scene?", "question_wo_referring_query": "When the subtitles mention \u201cI want to align the unit so they can cancel so these two grams across from,\u201d what objects are present in this scene?", "candidates": ["pink telephone", "pink desk lamp", "wired earphones", "red arrow", "blue desk lamp"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "MebTIQNRU5g_1", "video_path": "MebTIQNRU5g.mp4", "subtitle_path": "MebTIQNRU5g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1289.39, "view_count": 286569}, {"video_id": "MebTIQNRU5g", "question": "A painting hangs on the pink wall in the room, and items are placed in the white cabinet in front of the wall. A lady with long hair is sitting next to the cabinet. She extends her hands' index fingers. When the subtitles mention 'this is my theoretical yield this question is a common test question and a', what item is not present in this scene?", "question_wo_referring_query": "what item is not present in this scene?", "candidates": ["table lamp", "telephone", "silver vase", "car model", "green plant"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "MebTIQNRU5g_2", "video_path": "MebTIQNRU5g.mp4", "subtitle_path": "MebTIQNRU5g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1289.39, "view_count": 286569}, {"video_id": "T7B-xr346JU", "question": "In a dimly lit room, a lamp emits a faint glow. A man with short hair, dressed in a black suit, stands in the room with his fingers interlocked. What item is not present in this scene?", "question_wo_referring_query": ", what item is not present in this scene?", "candidates": ["earphones", "mask", "cufflinks", "glasses", "tie"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "T7B-xr346JU_0", "video_path": "T7B-xr346JU.mp4", "subtitle_path": "T7B-xr346JU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 929.72, "view_count": 79598}, {"video_id": "T7B-xr346JU", "question": "On the black wall, there are white lines. In front of the wall, there are two sets of clothing. On the left is a set of black women's clothes, and on the right is a set of men's clothes with a black coat and a white shirt. What items are present in this scene?", "question_wo_referring_query": "What items are present in this scene?", "candidates": ["Necklace", "Tie", "Leather bag", "Button", "Belt"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "T7B-xr346JU_1", "video_path": "T7B-xr346JU.mp4", "subtitle_path": "T7B-xr346JU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 929.72, "view_count": 79598}, {"video_id": "T7B-xr346JU", "question": "There are two sets of black clothing in front of a black wall. The set on the left consists of a top and a bottom, while the set on the right is undivided and has the words 'LITTLE BLACK DRESS' in white on it. What items are present in this scene?", "question_wo_referring_query": ", What items are present in this scene?", "candidates": ["white mannequin", "scarf", "belt", "button", "glasses"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "T7B-xr346JU_2", "video_path": "T7B-xr346JU.mp4", "subtitle_path": "T7B-xr346JU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 929.72, "view_count": 79598}, {"video_id": "K16pFXgrdRw", "question": "In the elevator, a man with short hair wearing a blue shirt is placing his hands on the handrail. What is this man doing?", "question_wo_referring_query": "In the elevator, a man with short hair wearing a blue shirt is placing his hands on the handrail. What is this man doing?", "candidates": ["He is hitting his head with the handrail.", "He is throwing the handrail aside.", "He is taking down the handrail.", "He is installing the handrail.", "He is placing the handrail on the ground."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "K16pFXgrdRw_0", "video_path": "K16pFXgrdRw.mp4", "subtitle_path": "K16pFXgrdRw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.71, "view_count": 4214469}, {"video_id": "K16pFXgrdRw", "question": "There are two people in front of a clean white wall, with a water dispenser and some greenery placed on the walkway. The person standing on the right is wearing black and holding a gun. What is the person with the gun doing?", "question_wo_referring_query": "What is the person holding the gun doing?", "candidates": ["He is holding the other person", "He is shooting at the other person", "He is taking down an item hanging on the wall", "He is fist-bumping the other person", "He is throwing the gun away"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "K16pFXgrdRw_1", "video_path": "K16pFXgrdRw.mp4", "subtitle_path": "K16pFXgrdRw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.71, "view_count": 4214469}, {"video_id": "K16pFXgrdRw", "question": "A person wearing blue clothing is standing in front of a silver wall. He's holding a bottle of water in his left hand and his clothes are covered in bloodstains. What is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["He pointed a gun at his own temple", "He is drinking water", "He opened the elevator door", "He took off his coat", "He is smoking"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "K16pFXgrdRw_2", "video_path": "K16pFXgrdRw.mp4", "subtitle_path": "K16pFXgrdRw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 939.71, "view_count": 4214469}, {"video_id": "UwyBbhyocJM", "question": "In front of a black wall, a woman wearing a yellow bikini looks up. Her hands are chained, and as the subtitle 'Not long after, Zed enters Daria\u2019s room and invites her to meet the guests.' appears, what changes occur to her?", "question_wo_referring_query": "What changes occur to her?", "candidates": ["She puts on a blue fur coat", "She puts on a black long-sleeved coat", "She drapes a white scarf", "She puts on a black long skirt", "She puts on a white long-sleeved coat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TAA", "level": "L2-Relation", "id": "UwyBbhyocJM_0", "video_path": "UwyBbhyocJM.mp4", "subtitle_path": "UwyBbhyocJM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 901.44, "view_count": 181217}, {"video_id": "UwyBbhyocJM", "question": "In a room, there is a woman sitting on a wooden chair. She is wearing a black long-sleeved coat and has golden curly hair. In front of her, there is a silver cup. What changes occur when she appears with the subtitles 'The trio plots to set traps in the jungle, while Tisa stays behind to keep her eyes on'?", "question_wo_referring_query": "What changes occur to her?", "candidates": ["She changes into a transparent black dress.", "She changes into a black long skirt.", "She changes into a transparent white dress.", "She changes into a yellow bikini.", "She changes into a white coat."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TAA", "level": "L2-Relation", "id": "UwyBbhyocJM_1", "video_path": "UwyBbhyocJM.mp4", "subtitle_path": "UwyBbhyocJM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 901.44, "view_count": 181217}, {"video_id": "UwyBbhyocJM", "question": "On the screen, there are two people, a man with short black hair sitting beside a woman with long brunette hair. She is wearing a golden necklace and a black sleeveless dress. When the subtitle \u201cAt the same time, Daria and Shela take a break, and Shela expresses that she\u2019s tired of everything.\u201d appears, what change happened to her?", "question_wo_referring_query": "What change happened to her?", "candidates": ["She changed into a white sleeveless short dress.", "She changed into a yellow bikini.", "She changed into a black long dress.", "She changed into a white blouse.", "She changed into a black shirt."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TAA", "level": "L2-Relation", "id": "UwyBbhyocJM_2", "video_path": "UwyBbhyocJM.mp4", "subtitle_path": "UwyBbhyocJM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 901.44, "view_count": 181217}, {"video_id": "AL6U74YgA0Y", "question": "In a dimly lit room, there is a decoration hanging on the wall, a bunch of flowers placed next to the white wall, and beside the flowers stands a person with short hair wearing a brown coat. When this man appears at the doorway facing an elderly person with white hair, what changes occur?", "question_wo_referring_query": "What changes occur?", "candidates": ["He changes into a gray sweater.", "He changes into a gray coat.", "He changes into a plaid shirt.", "He puts on a hat.", "He changes into a brown sweater."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "AL6U74YgA0Y_0", "video_path": "AL6U74YgA0Y.mp4", "subtitle_path": "AL6U74YgA0Y_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1021.47, "view_count": 66709}, {"video_id": "AL6U74YgA0Y", "question": "Next to the pink flowers, there is a man in a police uniform. His hat is askew, and he is holding a cup and a bag in his left hand. When he appears in front of a police car, what changes occur?", "question_wo_referring_query": "What changes occur to him?", "candidates": ["He takes off his hat", "He takes off his police uniform", "He changes the items in his hand to a handgun", "He changes the items in his hand to a piece of butter", "He changes the items in his hand to a piece of clothing"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "AL6U74YgA0Y_1", "video_path": "AL6U74YgA0Y.mp4", "subtitle_path": "AL6U74YgA0Y_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1021.47, "view_count": 66709}, {"video_id": "AL6U74YgA0Y", "question": "In a dark room, there is a woman with long hair standing. She is wearing a purple garment with floral patterns. When this woman appears on a busy street, standing opposite to a person with white hair, what changes did she undergo?", "question_wo_referring_query": "What changes did she undergo?", "candidates": ["She changed into a tan overcoat", "She is holding a bowl", "She changed into a beige overcoat", "She changed into a purple vest", "She sat in a wheelchair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "AL6U74YgA0Y_2", "video_path": "AL6U74YgA0Y.mp4", "subtitle_path": "AL6U74YgA0Y_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1021.47, "view_count": 66709}, {"video_id": "TCZQL4e4JtE", "question": "There are three people on the screen. The tall man on the left has a crew cut, the man in the middle is wearing a fur coat, and the woman on the right has blond hair and a crew cut. In which subtitles does the man with the crew cut appear together?", "question_wo_referring_query": "In which subtitles does the man with the crew cut appear together?", "candidates": ["to leave the island wanting to keep the", "Astrid and the other stop at a glacier", "have much effect and she is easily", "into the Hunter's ship and throws a few", "entrances to Burke forbidding citizens"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "TCZQL4e4JtE_0", "video_path": "TCZQL4e4JtE.mp4", "subtitle_path": "TCZQL4e4JtE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1072.16, "view_count": 69827}, {"video_id": "TCZQL4e4JtE", "question": "On the screen, there are two people: a lady with golden hair and a long skirt, and a man with brown hair wearing armor, sitting on the ground. What subtitles appear along with this man?", "question_wo_referring_query": "What subtitles appear along with this man?", "candidates": ["to ashes with a stoic being the only one", "peace hiccup suggests going to Drago\u2019s", "storm the meeting setting the ceiling on", "fire and reducing everyone in the hall", "about responsibility they spot a large"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "TCZQL4e4JtE_1", "video_path": "TCZQL4e4JtE.mp4", "subtitle_path": "TCZQL4e4JtE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1072.16, "view_count": 69827}, {"video_id": "TCZQL4e4JtE", "question": "In a room, there is a man with drooping bangs wearing a horn-shaped hat and a thick fur coat, holding a child who is looking ahead. With which captions does this man appear?", "question_wo_referring_query": "With which captions does this man wearing a horn-shaped hat appear?", "candidates": ["off his glider locking on toothless", "Left To Face Drago alone even though he", "while gliding through the air the young", "mother and son spend the whole day", "have much effect and she is easily"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "TCZQL4e4JtE_2", "video_path": "TCZQL4e4JtE.mp4", "subtitle_path": "TCZQL4e4JtE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1072.16, "view_count": 69827}, {"video_id": "VuxC8bamaVY", "question": "In a room with transparent glass, there is a man with black-framed glasses, wearing a black coat and a tie. Where has this man appeared?", "question_wo_referring_query": "Where has this man appeared?", "candidates": ["Beside a bald man wearing an orange outfit", "In a white car", "On a helicopter", "In an olive-colored car", "At the zoo"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "VuxC8bamaVY_0", "video_path": "VuxC8bamaVY.mp4", "subtitle_path": "VuxC8bamaVY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1055.82, "view_count": 205199}, {"video_id": "VuxC8bamaVY", "question": "On the dimly lit street, there is a man wearing a black long shirt. He has a cigarette in his right hand and his left hand is deeply inserted into the wall beside him. Where has this man appeared?", "question_wo_referring_query": "Where has this man appeared?", "candidates": ["In a prison", "Next to a man wearing a red vest outside a house", "In the bathroom", "In a white car", "Next to a person with white hair wearing a blue uniform"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "VuxC8bamaVY_1", "video_path": "VuxC8bamaVY.mp4", "subtitle_path": "VuxC8bamaVY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1055.82, "view_count": 205199}, {"video_id": "VuxC8bamaVY", "question": "In a brightly sunlit room, there is a wooden table with a black table lamp and other items on it. A man with short white hair, wearing a shirt, is sitting in front of the table. He is holding a red object in his right hand and a piece of paper in his left hand. Where has this man appeared before?", "question_wo_referring_query": "Where has this man appeared before?", "candidates": ["On a white bed", "In a boxing gym", "Next to a piano", "Next to a bald man holding a long gun", "In a bar"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "VuxC8bamaVY_2", "video_path": "VuxC8bamaVY.mp4", "subtitle_path": "VuxC8bamaVY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1055.82, "view_count": 205199}, {"video_id": "UvxhQuE1iw4", "question": "A clock is hanging on the kitchen wall. A young girl with long golden hair is sitting on the counter, and beside her, a man in a gray shirt is cooking. After the subtitle mentions 'After a week, Katie returns to work. During this time,', what does the man do next?", "question_wo_referring_query": "What does the man do next?", "candidates": ["He takes down a book", "He kisses a girl", "He picks up a child", "He sits on the stove", "He smokes a cigarette"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "UvxhQuE1iw4_0", "video_path": "UvxhQuE1iw4.mp4", "subtitle_path": "UvxhQuE1iw4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 995.56, "view_count": 122932}, {"video_id": "UvxhQuE1iw4", "question": "In the room, there is a man with short hair wearing a white patterned short-sleeved shirt. He pushes a woman with golden hair onto the bed. The subtitle mentions 'then confronts the writer about her actual whereabouts, pushing her against the bed.' What does the man do next?", "question_wo_referring_query": "What does the man do next?", "candidates": ["He drives a car with one hand", "He picks up the mobile phone", "He takes a white object from his left hand with his right hand", "He picks up an infant", "He turns on the computer"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "UvxhQuE1iw4_1", "video_path": "UvxhQuE1iw4.mp4", "subtitle_path": "UvxhQuE1iw4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 995.56, "view_count": 122932}, {"video_id": "UvxhQuE1iw4", "question": "A lady is sitting in front of a white cabinet, she is wearing a gray and black coat, with a blue scarf wrapped around her hair, and she places her phone on a white table next to her. When the subtitle mentions 'but she contemplates that her hormones may influence her emotional response. Later;', what does the lady do next?", "question_wo_referring_query": "What does the lady do next?", "candidates": ["She takes a bath in the tub", "She sits on the stove", "She paints a bird on the wall", "She drinks from a bottle of wine", "She is playing on the computer"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "UvxhQuE1iw4_2", "video_path": "UvxhQuE1iw4.mp4", "subtitle_path": "UvxhQuE1iw4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 995.56, "view_count": 122932}, {"video_id": "JwSSABe3B54", "question": "In the video, a man wearing a black coat and a blue and white striped shirt appears in front of a door and beside a railing. Who is the first person to appear on the scene?", "question_wo_referring_query": "Who is the first person to appear on the scene?", "candidates": ["A man in a shirt and tie who is on the phone", "A woman in a floral dress", "A woman wearing a black mask", "A man wearing a black and white checkered shirt", "A woman with curly hair wearing a V-neck black top and earrings"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "JwSSABe3B54_0", "video_path": "JwSSABe3B54.mp4", "subtitle_path": "JwSSABe3B54_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 936.67, "view_count": 488898}, {"video_id": "JwSSABe3B54", "question": "After the appearance of a woman wearing a black hat and black clothing while holding a camera on the grass, which of the following characters shows up first?", "question_wo_referring_query": "Which of the following characters shows up first?", "candidates": ["A boy wearing red shorts", "A man wearing a long nightgown", "A man wearing black and white checkered shirt and glasses", "A woman with red hair", "An old woman wearing a pearl necklace and earrings"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "JwSSABe3B54_1", "video_path": "JwSSABe3B54.mp4", "subtitle_path": "JwSSABe3B54_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 936.67, "view_count": 488898}, {"video_id": "JwSSABe3B54", "question": "After a man wearing a black and white checkered shirt and a woman walk through a door emitting green light, which of the following characters appears first?", "question_wo_referring_query": ", which of the following characters appears first?", "candidates": ["A woman wearing black clothing and a black hat", "A woman with pink hair", "A man wearing a red shirt and holding binoculars", "A man wearing a black and white checkered shirt", "A woman wearing a floral dress"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "JwSSABe3B54_2", "video_path": "JwSSABe3B54.mp4", "subtitle_path": "JwSSABe3B54_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 936.67, "view_count": 488898}, {"video_id": "rba81Y5l7rU", "question": "The ground in the scene is covered with leaves and mud. A person is trapped with only their head exposed while the rest of their body is buried in dirt. What did this person do after getting trapped?", "question_wo_referring_query": "What did this person do after getting trapped?", "candidates": ["He started a fire", "He picked up a gun", "He went to find water to drink", "He struggled to crawl out of the mud", "He jumped into the river"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "rba81Y5l7rU_0", "video_path": "rba81Y5l7rU.mp4", "subtitle_path": "rba81Y5l7rU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1200.9, "view_count": 9255}, {"video_id": "rba81Y5l7rU", "question": "On the left side of the screen, a person extends a hand with a wound. After someone on the right side of the screen gently applies medication to the wound with their finger, what happens next?", "question_wo_referring_query": ", what happens next on the screen?", "candidates": ["One person climbs up a tree branch", "The two people embrace", "One person gets stuck in the mud", "The person on the right side of the screen applies a white bandage to the hand of the injured person on the left side of the screen", "One person stands on a tree branch"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "rba81Y5l7rU_1", "video_path": "rba81Y5l7rU.mp4", "subtitle_path": "rba81Y5l7rU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1200.9, "view_count": 9255}, {"video_id": "rba81Y5l7rU", "question": "In the screen, a person is standing on a branch, holding a probe with a display screen. After a large ant crawls onto his hand, what happens next on the screen?", "question_wo_referring_query": "What happens next on the screen?", "candidates": ["He hugs a girl", "He lies down by the river", "He picks up a gun and shoots", "He is attacked by a dinosaur", "He falls from the tree"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "rba81Y5l7rU_2", "video_path": "rba81Y5l7rU.mp4", "subtitle_path": "rba81Y5l7rU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1200.9, "view_count": 9255}, {"video_id": "YX2jnHv6R-w", "question": "In a room with a plum blossom design on the backdrop, a man in a skirt stands to the left of the table. What action does the man on the right wearing a black suit and blue jeans take when the subtitles say 'die left with no choice our hero heads'?", "question_wo_referring_query": "What action does he take?", "candidates": ["Picked up a book", "Picked up a cup of coffee", "Lay on the bed", "Sat on the sofa", "Put his hand into his pocket"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "YX2jnHv6R-w_0", "video_path": "YX2jnHv6R-w.mp4", "subtitle_path": "YX2jnHv6R-w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1545.93, "view_count": 21940}, {"video_id": "YX2jnHv6R-w", "question": "On the left side of the screen, there is a man dressed in a green suit with a tie around his neck, holding the arm of a man dressed in a black shirt. When the subtitle says 'find human souls and make a contract,' what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The hair of the man dressed in a green suit catches fire.", "The fingers of the man dressed in a black shirt catch fire.", "The hair of the man dressed in a black shirt catches fire.", "The buttocks of the man dressed in a black shirt catch fire.", "The fingers of the man dressed in a green suit catch fire."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "YX2jnHv6R-w_1", "video_path": "YX2jnHv6R-w.mp4", "subtitle_path": "YX2jnHv6R-w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1545.93, "view_count": 21940}, {"video_id": "YX2jnHv6R-w", "question": "In a restaurant full of customers, the table in the center of the screen has wine and food on it. Around it are five people, with a woman in white clothing sitting in the middle. When the subtitle reads 'he's still on duty while the others,' what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The two men sitting on the left side of the table raise their glasses", "The woman in white sitting in the middle of the table raises her glass", "The two men sitting on the left side of the table place their glasses on their heads", "The woman sitting in the middle of the table covers her ears with both hands", "The two men sitting on the left side of the table place their glasses on the ground"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "YX2jnHv6R-w_2", "video_path": "YX2jnHv6R-w.mp4", "subtitle_path": "YX2jnHv6R-w_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1545.93, "view_count": 21940}, {"video_id": "Q6xlMp1yyng", "question": "In front of a window with white curtains, there is a long-haired little girl wearing a white top. Next to her is a pillow with a rabbit pattern. Behind her, a man is looking back at the little girl. What was this little girl doing the first time she appeared?", "question_wo_referring_query": "What was this little girl doing the first time she appeared?", "candidates": ["She was playing with a doll", "She was jumping on the bed", "She was swaying left and right", "She covered her head with her hands", "She was holding a fish cake"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "Q6xlMp1yyng_0", "video_path": "Q6xlMp1yyng.mp4", "subtitle_path": "Q6xlMp1yyng_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1504.28, "view_count": 147234}, {"video_id": "Q6xlMp1yyng", "question": "In front of a building's staircase, on the far left stands a little girl, in the middle stands a black-haired man wearing a dark blue coat, and on the far right is an old man with white hair. What did this old man with white hair do the first time he appeared?", "question_wo_referring_query": "What did this old man with white hair do the first time he appeared?", "candidates": ["He fell to the ground", "He picked up a little girl", "He was walking a dog", "He hugged the little girl", "He reached out and pushed the chest of the man in the dark blue coat in front of him"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "Q6xlMp1yyng_1", "video_path": "Q6xlMp1yyng.mp4", "subtitle_path": "Q6xlMp1yyng_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1504.28, "view_count": 147234}, {"video_id": "Q6xlMp1yyng", "question": "In front of a brown table, a woman wearing a black coat and a white blouse is sitting on the right side, looking at a man beside her who is wearing a gray lab coat with a red hat and earphones. What did the man do the first time he appeared?", "question_wo_referring_query": "What did the man do the first time he appeared?", "candidates": ["He stood up to dance", "He tapped his fingers on the table", "He took off his earphones", "He picked up a glass of water", "He took off his glasses"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "Q6xlMp1yyng_2", "video_path": "Q6xlMp1yyng.mp4", "subtitle_path": "Q6xlMp1yyng_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1504.28, "view_count": 147234}, {"video_id": "_9NIU2r6RSs", "question": "In a dark environment, who is standing in front of a background with black horizontal lines and numbers indicating different heights, holding a black sign in their hand?", "question_wo_referring_query": "Who is it?", "candidates": ["A woman wearing a black fur coat", "A woman wearing a red top", "A woman wearing a pink fur coat", "A woman wearing a red fur coat", "A woman wearing a mask"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "_9NIU2r6RSs_0", "video_path": "_9NIU2r6RSs.mp4", "subtitle_path": "_9NIU2r6RSs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 980.12, "view_count": 491041}, {"video_id": "_9NIU2r6RSs", "question": "In a dimly lit room with different photos and paper stuck on a wall and window, who is the person standing to the left of the white house building model placed in the middle?", "question_wo_referring_query": "Who is it?", "candidates": ["A man wearing a black suit and glasses", "A man wearing a red suit and glasses", "A man wearing a pink suit and glasses", "A man wearing a gray suit and a hat", "A man wearing a white suit and glasses"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "_9NIU2r6RSs_1", "video_path": "_9NIU2r6RSs.mp4", "subtitle_path": "_9NIU2r6RSs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 980.12, "view_count": 491041}, {"video_id": "_9NIU2r6RSs", "question": "In front of a building with several round columns outside its door, who is standing on the steps holding a gun and speaking while holding a piece of white paper?", "question_wo_referring_query": "Who is it?", "candidates": ["A woman in green clothing with her mask lifted, exposing her face", "A woman in blue clothing with her mask lifted, exposing her face", "A woman in red clothing with her mask lifted, exposing her face", "A person in red clothing with a mask covering their face", "A person in yellow clothing with a mask covering their face"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "_9NIU2r6RSs_2", "video_path": "_9NIU2r6RSs.mp4", "subtitle_path": "_9NIU2r6RSs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 980.12, "view_count": 491041}, {"video_id": "58AB1z5HdDs", "question": "In a lively bar, there is a handsome man standing in the middle of the screen, surrounded by many beautiful women. Among them, a woman in a red bikini on his right is dancing enthusiastically. When the subtitle says \"Just when Reed is finally loosening up and having fun dancing with women, Sue and\", what style of clothing is this man wearing?", "question_wo_referring_query": "What style of clothing is this man wearing?", "candidates": ["white vest", "white shirt and black vest", "black suit", "blue shirt and black vest", "blue shirt and pink vest"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "58AB1z5HdDs_0", "video_path": "58AB1z5HdDs.mp4", "subtitle_path": "58AB1z5HdDs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1046.57, "view_count": 149924}, {"video_id": "58AB1z5HdDs", "question": "In front of a building, there is a black plastic bag on the far left side. On the right side of the screen, a woman's leg is engulfed in golden flames, with many people around watching her. When the subtitle says, 'She tries to check on him by putting her hand on his forehead, and their powers suddenly,' what is the color of this woman's hair?", "question_wo_referring_query": "In front of a building, there is a black plastic bag on the far left side. On the right side of the screen, a woman's leg is engulfed in golden flames, with many people around watching her. When the subtitle says, 'She tries to check on him by putting her hand on his forehead, and their powers suddenly,' what is the color of this woman's hair?", "candidates": ["white", "pink", "black", "red", "yellow"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "58AB1z5HdDs_1", "video_path": "58AB1z5HdDs.mp4", "subtitle_path": "58AB1z5HdDs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1046.57, "view_count": 149924}, {"video_id": "58AB1z5HdDs", "question": "On the green vegetated brown soil, there is a machine on the right side, and next to it stands a blonde woman wearing a tight outfit. She looks up at a silver figure floating in the air. When the subtitle says 'machine on, but the surfer stops her,' what color is the outfit the woman is wearing?", "question_wo_referring_query": "What color is the outfit the woman is wearing?", "candidates": ["white", "red", "blue", "purple", "black"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "58AB1z5HdDs_2", "video_path": "58AB1z5HdDs.mp4", "subtitle_path": "58AB1z5HdDs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1046.57, "view_count": 149924}, {"video_id": "NWMUKBUkK1o", "question": "In the dimly lit airplane cabin, on the seats flanking the middle aisle, a man in a suit is seated on the far left side looking at his phone, and a woman on the far right side is holding a phone. What style of clothing is the man seated behind this woman wearing?", "question_wo_referring_query": "What style of clothing is the man seated behind this woman wearing?", "candidates": ["suit", "short sleeves", "vest", "cotton-padded jacket", "shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "NWMUKBUkK1o_0", "video_path": "NWMUKBUkK1o.mp4", "subtitle_path": "NWMUKBUkK1o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1012.31, "view_count": 20412}, {"video_id": "NWMUKBUkK1o", "question": "In a large conference room, there are many chairs placed around, and on the table, there are many scattered documents. Two men are looking for something on the table. On the far left of the screen, there is a man wearing a shirt standing. What color is the shirt worn by this man?", "question_wo_referring_query": "What color is the shirt worn by this man?", "candidates": ["blue", "white", "black", "green", "olive"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "NWMUKBUkK1o_1", "video_path": "NWMUKBUkK1o.mp4", "subtitle_path": "NWMUKBUkK1o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1012.31, "view_count": 20412}, {"video_id": "NWMUKBUkK1o", "question": "Three people are standing in the screen. The person on the far left is a woman with tied hair and wearing a black coat, looking down at an electronic screen in her hand. On the far right stands an elderly man in a suit with white hair. What style of clothing is the man standing in the middle wearing?", "question_wo_referring_query": "What style of clothing is the man standing in the middle wearing?", "candidates": ["olive green long-sleeve", "cotton-padded jacket", "baseball uniform", "short-sleeve", "undershirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "NWMUKBUkK1o_2", "video_path": "NWMUKBUkK1o.mp4", "subtitle_path": "NWMUKBUkK1o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1012.31, "view_count": 20412}, {"video_id": "vS8eQ5nVrRQ", "question": "A man wearing black sunglasses stands next to a black and orange motorcycle, with a wall made of wooden planks and a white tent behind him. What kind of clothes is he wearing?", "question_wo_referring_query": "A man wearing black sunglasses stands next to a black and orange motorcycle, with a wall made of wooden planks and a white tent behind him. What kind of clothes is he wearing?", "candidates": ["army green camouflage suit", "black long-sleeve", "gray hoodie", "suit jacket", "green leather jacket"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "vS8eQ5nVrRQ_0", "video_path": "vS8eQ5nVrRQ.mp4", "subtitle_path": "vS8eQ5nVrRQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1034.33, "view_count": 8701}, {"video_id": "vS8eQ5nVrRQ", "question": "In a blurry scene, there are two men talking. One man is wearing a gray-green coat, and the other man is wearing a black coat and a black hat. What kind of beard does the man wearing the black hat have?", "question_wo_referring_query": "What kind of beard does the man wearing the black hat have?", "candidates": ["Goatee", "Full beard", "Mustache", "Sideburns", "Mutton chops"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "vS8eQ5nVrRQ_1", "video_path": "vS8eQ5nVrRQ.mp4", "subtitle_path": "vS8eQ5nVrRQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1034.33, "view_count": 8701}, {"video_id": "vS8eQ5nVrRQ", "question": "In the video, a man with a scruffy beard is holding a transparent bottle with a green label. What color is this drink?", "question_wo_referring_query": "What color is this drink?", "candidates": ["orange", "white", "olive", "green", "red"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "vS8eQ5nVrRQ_2", "video_path": "vS8eQ5nVrRQ.mp4", "subtitle_path": "vS8eQ5nVrRQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1034.33, "view_count": 8701}, {"video_id": "V6P2LkT0UuU", "question": "In a scene with a black car parked, a man in a dark blue suit gets out of the car, and another man in a black suit opens the door for him. When the subtitle 'following his release from custody' appears, what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["green plant", "black camera", "yellow incense stick", "white baseball cap", "white shoes"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "V6P2LkT0UuU_0", "video_path": "V6P2LkT0UuU.mp4", "subtitle_path": "V6P2LkT0UuU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1006.47, "view_count": 214416}, {"video_id": "V6P2LkT0UuU", "question": "In a dimly lit room, there are two men with different skin tones standing. One man has dark skin and is bald, while the other man has fair skin and short hair. When the caption 'York killing all of the passengers on' appears, what object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["black hooded jacket", "black-framed glasses", "black tie", "white suit", "blue hat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "V6P2LkT0UuU_1", "video_path": "V6P2LkT0UuU.mp4", "subtitle_path": "V6P2LkT0UuU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1006.47, "view_count": 214416}, {"video_id": "V6P2LkT0UuU", "question": "In a dimly lit space, there is a woman sitting in a white long-sleeve top, and next to her is a man with short hair. When the subtitle 'begins dating Jared after coming to town' appears, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["White suit", "Purple chair", "Blue jeans", "Red tie", "White flower pot"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "V6P2LkT0UuU_2", "video_path": "V6P2LkT0UuU.mp4", "subtitle_path": "V6P2LkT0UuU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1006.47, "view_count": 214416}, {"video_id": "W_GEfMAGYso", "question": "In a grassland, a man is riding a horse at a gallop. Next to him there is a car with a man and a woman inside, both of whom are looking over at the man on the horse. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["Red hat", "Black collar", "Gold saddle", "Black whip", "Red collar"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "W_GEfMAGYso_0", "video_path": "W_GEfMAGYso.mp4", "subtitle_path": "W_GEfMAGYso_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1024.0, "view_count": 24501}, {"video_id": "W_GEfMAGYso", "question": "Sunlight shines through the window and spills into the room. There are some green plants on the windowsill. A little girl with long hair is handling items beside the table in front of the window. She just turned her head to look in the direction of the camera. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["white dress", "bamboo backpack", "yellow bottle", "red hair clip", "saddle"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "W_GEfMAGYso_1", "video_path": "W_GEfMAGYso.mp4", "subtitle_path": "W_GEfMAGYso_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1024.0, "view_count": 24501}, {"video_id": "W_GEfMAGYso", "question": "On a grassy field with green oil, a team of soldiers is passing by. A man with white hair is standing to the side, hugging a girl. Walking at the front is a soldier wearing a hat and a coat, and beside him is another soldier holding a horse. What is not present in this scene?", "question_wo_referring_query": "What is not present in this scene?", "candidates": ["Trees", "Armor", "Tank", "Rifle", "Car"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "W_GEfMAGYso_2", "video_path": "W_GEfMAGYso.mp4", "subtitle_path": "W_GEfMAGYso_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1024.0, "view_count": 24501}, {"video_id": "NYYxIlCsFr0", "question": "On a cruise ship, there is a black-haired woman on the left side of the screen wearing a round hat and a skirt, holding a glass of wine. On the right side of the screen, there is a black-haired man in a black suit wearing glasses. Behind them, several people in swimsuits are walking, and some are lying on chairs. What is the man in the black suit and glasses doing on the right side of the screen?", "question_wo_referring_query": "What is the man in the black suit and glasses doing on the right side of the screen?", "candidates": ["Clinking glasses", "Drinking wine", "Kneeling on the ground", "Dancing", "Looking at his phone"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "NYYxIlCsFr0_0", "video_path": "NYYxIlCsFr0.mp4", "subtitle_path": "NYYxIlCsFr0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1131.97, "view_count": 38641}, {"video_id": "NYYxIlCsFr0", "question": "In a swimming pool, a man wearing a white shirt is fully submerged in the water except for his head. A woman is standing at the edge of the pool with a wheelchair next to her. What is the woman doing?", "question_wo_referring_query": "What is the woman doing?", "candidates": ["She is pushing a wheelchair.", "She is jumping into the pool.", "She is changing clothes.", "She is throwing a lifebuoy into the pool.", "She is using a pool hook to press down on the person in the pool."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "NYYxIlCsFr0_1", "video_path": "NYYxIlCsFr0.mp4", "subtitle_path": "NYYxIlCsFr0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1131.97, "view_count": 38641}, {"video_id": "NYYxIlCsFr0", "question": "In a room with a black and white checkered floor, there are white ribbons and some small debris floating in the air. On the screen, there is a man wearing glasses and a black suit. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Drinking alcohol", "Kneeling", "Hugging a woman", "Kissing a woman", "Shaking hands with a woman"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "NYYxIlCsFr0_2", "video_path": "NYYxIlCsFr0.mp4", "subtitle_path": "NYYxIlCsFr0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1131.97, "view_count": 38641}, {"video_id": "M2wtT8L04Lw", "question": "A man appears on the screen wearing a red shirt and a black hat. He has short hair and is holding a red ball, smiling with perfectly white teeth. What change occurs to this man\u2019s appearance when the subtitle 'television one day he thought that if' appears?", "question_wo_referring_query": "What change happens to the man wearing a red shirt when the subtitle 'television one day he thought that if' appears?", "candidates": ["The man's black hat changes to green", "The man's shirt changes to blue", "The man's black hat changes to red", "The man's shirt changes to green", "The man puts on a pair of glasses"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TAA", "level": "L2-Relation", "id": "M2wtT8L04Lw_0", "video_path": "M2wtT8L04Lw.mp4", "subtitle_path": "M2wtT8L04Lw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 990.8, "view_count": 1323659}, {"video_id": "M2wtT8L04Lw", "question": "A girl wearing a blue top appears on the screen. The girl's golden short hair covers her forehead, and she is holding a piece of white candy, trying to put it in her mouth. Next to the girl on the right, there is someone else wearing a blue top. When the subtitle 'throughout her body in addition Violet' appears, what change occurs on the face of the girl wearing the blue top?", "question_wo_referring_query": "When the subtitle 'throughout her body in addition Violet' appears, what change occurs on the face of the girl wearing the blue top?", "candidates": ["There were flies on the girl's face", "The girl's face turned black", "The girl's face turned green", "The girl's face turned blue", "The girl's face became covered with spots"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TAA", "level": "L2-Relation", "id": "M2wtT8L04Lw_1", "video_path": "M2wtT8L04Lw.mp4", "subtitle_path": "M2wtT8L04Lw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 990.8, "view_count": 1323659}, {"video_id": "M2wtT8L04Lw", "question": "A man and a boy are standing outside a building. The man is wearing a black hat and black sunglasses. There is a rectangular metal nameplate on the wall next to the boy. The window on the wall has white snow accumulated on it, and the blinds are drawn up. What change happens to the man in black sunglasses when the subtitle 'dentist recognizes his son and fearfully' appears?", "question_wo_referring_query": "What change happens to the man in black sunglasses when the subtitle 'dentist recognizes his son and fearfully' appears?", "candidates": ["The man changes into a red hat", "The man puts on a white fur coat", "The man changes into a white hat", "The man puts on a white scarf", "The man takes off his black sunglasses"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TAA", "level": "L2-Relation", "id": "M2wtT8L04Lw_2", "video_path": "M2wtT8L04Lw.mp4", "subtitle_path": "M2wtT8L04Lw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 990.8, "view_count": 1323659}, {"video_id": "VmNfku3Tej8", "question": "Next to a police car, there is a male police officer. Behind him, there are four people, one of whom is holding a camera. A person with a backpack is holding a black stick-like object. What is this police officer doing?", "question_wo_referring_query": "What is this police officer doing?", "candidates": ["driving a car", "making a phone call", "wearing glasses", "drinking milk tea", "firing a gun"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "VmNfku3Tej8_0", "video_path": "VmNfku3Tej8.mp4", "subtitle_path": "VmNfku3Tej8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 591.77, "view_count": 238327}, {"video_id": "VmNfku3Tej8", "question": "On the right, there is a curly-haired woman wearing a black coat. Next to her is a man in a white shirt holding a cup with a black straw. What is the man doing?", "question_wo_referring_query": "What is the man doing?", "candidates": ["Putting on clothes", "Drinking milk tea", "Eating an apple", "Wearing sunglasses", "Eating watermelon"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "VmNfku3Tej8_1", "video_path": "VmNfku3Tej8.mp4", "subtitle_path": "VmNfku3Tej8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 591.77, "view_count": 238327}, {"video_id": "ptq7zIt6Dyo", "question": "There are three people sitting on the code head. On the left side is a woman wearing a headscarf, a white short-sleeved shirt, and yellow pants. Next to her is a black person wearing a plaid jacket, and on the right side is a person with their back to the screen. Which of the following items has appeared?", "question_wo_referring_query": "Which of the following items has appeared?", "candidates": ["guitar", "watermelon", "car", "hat", "sunglasses"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "ptq7zIt6Dyo_0", "video_path": "ptq7zIt6Dyo.mp4", "subtitle_path": "ptq7zIt6Dyo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 554.46, "view_count": 162545}, {"video_id": "ptq7zIt6Dyo", "question": "In the evening, on the left is a person wearing a headband and a red-black-white outfit, with one hand in a fist position. Opposite him stands a person wearing a colorful hat and a black coat. Which of the following items appears?", "question_wo_referring_query": "Which of the following items appears?", "candidates": ["Sunglasses", "Necklace", "Airplane", "Watermelon", "Guitar"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "ptq7zIt6Dyo_1", "video_path": "ptq7zIt6Dyo.mp4", "subtitle_path": "ptq7zIt6Dyo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 554.46, "view_count": 162545}, {"video_id": "KGE6y95NhUM", "question": "A woman wearing a white coat and a short skirt is standing on a grass field with a small stream behind her. When the subtitle \"She wore a white shirt and black mini skirt, like a restaurant waitress\" appears, what color is the woman's short skirt?", "question_wo_referring_query": "What color is the woman's short skirt?", "candidates": ["red", "white", "yellow", "black", "green"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "KGE6y95NhUM_0", "video_path": "KGE6y95NhUM.mp4", "subtitle_path": "KGE6y95NhUM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 584.65, "view_count": 55681}, {"video_id": "KGE6y95NhUM", "question": "Three men are seated around a dining table. A woman in blue clothing is serving food. On the table, there are some tableware and a bottle of wine. Beside the wine, there is a rectangular box. When the subtitle 'to the dining room' appears, what is the color of the rectangular box?", "question_wo_referring_query": ", what is the color of the rectangular box?", "candidates": ["Green", "Yellow", "Blue", "Black", "Red"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "KGE6y95NhUM_1", "video_path": "KGE6y95NhUM.mp4", "subtitle_path": "KGE6y95NhUM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 584.65, "view_count": 55681}, {"video_id": "WrBLGuopAuA", "question": "A man in a gray coat is standing in a dispersed crowd, holding a bag in one hand, ready to throw the bag into the river. Among the crowd is a woman with a red hoodie holding a bag. When the woman in the red hoodie appears for the first time, what happens?", "question_wo_referring_query": "What happens?", "candidates": ["Explosion", "Car accident", "Fire", "Gunfight", "Earthquake"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "WrBLGuopAuA_0", "video_path": "WrBLGuopAuA.mp4", "subtitle_path": "WrBLGuopAuA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 564.0, "view_count": 38183}, {"video_id": "WrBLGuopAuA", "question": "A black car with a visible gun is driving, the background is a grass field, and the foreground is a black pole. What happened when the car with the visible gun appeared for the first time?", "question_wo_referring_query": "What happened?", "candidates": ["Explosion", "Earthquake", "Fire", "Gunfight", "Car accident"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "WrBLGuopAuA_1", "video_path": "WrBLGuopAuA.mp4", "subtitle_path": "WrBLGuopAuA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 564.0, "view_count": 38183}, {"video_id": "m5oFJf6nUdY", "question": "A person wearing white clothes and a white mask is lying on the ground, with one hand placed on the white mask. When the subtitle appears \"to unmask him first and is shocked to discover David. At that moment he regains consciousness\", what did this hand do?", "question_wo_referring_query": "What did this hand do?", "candidates": ["Played the piano", "Peeled a banana", "Removed the mask", "Peeled an apple", "Drank water"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "m5oFJf6nUdY_0", "video_path": "m5oFJf6nUdY.mp4", "subtitle_path": "m5oFJf6nUdY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 575.63, "view_count": 97133}, {"video_id": "m5oFJf6nUdY", "question": "A woman wearing a black coat is standing. To her left is a table, and on the table is a knife covered in blood. To her right is a man in a white suit. When the subtitle 'the townspeople jump on him, and Henry's furious girlfriend tries to attack Winnie. She quickly' appears, what does the woman do?", "question_wo_referring_query": "What does the woman do?", "candidates": ["Peeling a banana", "Holding a phone", "Picking up the knife", "Holding a banana", "Playing the piano"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "m5oFJf6nUdY_1", "video_path": "m5oFJf6nUdY.mp4", "subtitle_path": "m5oFJf6nUdY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 575.63, "view_count": 97133}, {"video_id": "R_gI0CNcLiM", "question": "A man wearing a white coat is pushing a food cart, and on the cart, there is a teapot and several tea cups. On the left, there is a police officer with his back turned. After the man pushing the cart appears, what does this man do?", "question_wo_referring_query": "What does this man do?", "candidates": ["Pour tea into the tea cup", "Snatch the police officer's hat", "Touch his head", "Touch his face", "Do push-ups"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "R_gI0CNcLiM_0", "video_path": "R_gI0CNcLiM.mp4", "subtitle_path": "R_gI0CNcLiM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 507.94, "view_count": 1288970}, {"video_id": "R_gI0CNcLiM", "question": "After a woman and a man kissed on the sofa, what happened?", "question_wo_referring_query": "What happened?", "candidates": ["fight", "earthquake", "explosion", "car accident", "fire"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "R_gI0CNcLiM_1", "video_path": "R_gI0CNcLiM.mp4", "subtitle_path": "R_gI0CNcLiM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 507.94, "view_count": 1288970}, {"video_id": "1q-iTza3iNo", "question": "A soldier wearing a helmet and holding a gun, on the wall to his left there is a row of flower pots, above the flower pots there is a white railing with some green leaves brushing against it. What happened just before the subtitle \"Then, Yousef and Ayana come dangerously close to colliding\" appears?", "question_wo_referring_query": "What happened just before?", "candidates": ["The woman waves to the soldier", "The woman is driving", "The woman is playing the guitar", "The woman shoots at the soldier", "The woman is eating"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "1q-iTza3iNo_0", "video_path": "1q-iTza3iNo.mp4", "subtitle_path": "1q-iTza3iNo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 490.03, "view_count": 53286}, {"video_id": "1q-iTza3iNo", "question": "A man wearing black clothes is standing in front of a transparent screen, on which there is a white object resembling a beetle. What happens before the subtitle 'the other soldier then says he can't leave his wingman, so he asks permission to fire back.' appears?", "question_wo_referring_query": "What happens before?", "candidates": ["Plane explosion", "Man in black clothes is eating", "Gunfight", "Car crash", "Earthquake"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "1q-iTza3iNo_1", "video_path": "1q-iTza3iNo.mp4", "subtitle_path": "1q-iTza3iNo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 490.03, "view_count": 53286}, {"video_id": "vKKakkHjzFs", "question": "A man with long hair dressed in a black suit is standing in front of a table. Opposite him is a man in black holding a trigger clip. On the table, there is an open box containing a handgun. In which of the following scenes has this long-haired man appeared before?", "question_wo_referring_query": "In which of the following scenes has this long-haired man appeared before?", "candidates": ["In a tank", "On a ship", "In a space station", "On horseback", "In a submarine"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "vKKakkHjzFs_0", "video_path": "vKKakkHjzFs.mp4", "subtitle_path": "vKKakkHjzFs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 523.79, "view_count": 292802}, {"video_id": "vKKakkHjzFs", "question": "A man dressed in black is standing on the street, holding a white bird in his hand and talking with a person carrying a briefcase. In which of the following scenes does this man appear?", "question_wo_referring_query": "In which of the following scenes does this man appear?", "candidates": ["On a rooftop, caressing the white bird", "In a submarine", "On a wheelboat", "On horseback", "In a tank"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "vKKakkHjzFs_1", "video_path": "vKKakkHjzFs.mp4", "subtitle_path": "vKKakkHjzFs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 523.79, "view_count": 292802}, {"video_id": "3YK21UHUtLE", "question": "Five cars are traveling on the highway, and four police cars surround one yellow car. In which subtitle does the yellow car simultaneously appear?", "question_wo_referring_query": "In which subtitle does the yellow car simultaneously appear?", "candidates": ["cars and parts - Ocho San, the same one he helped find an earring, comes to the rescue", "getting the car ready. Together with the police officers. Fo pulls out onto blocked highway", "gets him. The police arrive, and Fo's sister is with them! She is alive, and rescued. Amy", "The story begins with a factory where cars and machinery are assembled", "An Interpol officer arrives at the police station. At the meeting it is reported that the intruder"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "3YK21UHUtLE_0", "video_path": "3YK21UHUtLE.mp4", "subtitle_path": "3YK21UHUtLE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 484.13, "view_count": 30762}, {"video_id": "3YK21UHUtLE", "question": "A blond man wearing a yellow and blue racing suit is drinking a beverage while a blonde woman with glasses stands to his left with her hands in her pockets. In which subtitle does this man appear?", "question_wo_referring_query": "In which subtitle does this man appear?", "candidates": ["the police. Cougar slowly pulled up alongside them, and rolled down his window, smirking.", "is impressed with Fo, and offers to race, but Chen Foin refuses, claiming that he is not a racer.", "The story begins with a factory where cars and machinery are assembled", "he nearly falls out. Seeing the family running out into the noise, points the house at", "cars and parts-Ocho San, the same one he helped find an earring, comes to the rescue"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "3YK21UHUtLE_1", "video_path": "3YK21UHUtLE.mp4", "subtitle_path": "3YK21UHUtLE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 484.13, "view_count": 30762}, {"video_id": "6ql2xlebZYA", "question": "A man is riding a bicycle. He is wearing a red hat and carrying a bag. Behind him is a person wearing red pants. When this man, who is riding the bicycle, talks with a woman, what difference is there in their clothing?", "question_wo_referring_query": "What difference is there in the clothing when the man riding the bicycle talks with a woman?", "candidates": ["The pattern in the middle of the clothing changes to a motorcycle.", "The main color theme of the clothing changes from red to gray.", "The main color theme of the clothing changes from red to blue.", "The main color theme of the clothing changes from red to black.", "The main color theme of the clothing changes from red to white."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "6ql2xlebZYA_0", "video_path": "6ql2xlebZYA.mp4", "subtitle_path": "6ql2xlebZYA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 558.49, "view_count": 5244}, {"video_id": "6ql2xlebZYA", "question": "A man wearing a checkered jacket is holding a red item, and next to him is a woman with long hair. On the wall behind the woman, there is a black passage. What is different about the man's head while he is holding the box??", "question_wo_referring_query": "What is different about the man's head while he is holding the box??", "candidates": ["He is wearing a hat", "He has a face mask on", "He is wearing an earphone", "He is wearing an earring", "He has sunglasses on"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "6ql2xlebZYA_1", "video_path": "6ql2xlebZYA.mp4", "subtitle_path": "6ql2xlebZYA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 558.49, "view_count": 5244}, {"video_id": "ZEUBSCNQ-A0", "question": "In a car, there is a woman wearing glasses sitting in the driver's seat. What is present in the scene at this moment?", "question_wo_referring_query": "What is present in the scene at this moment?", "candidates": ["white earphones", "black sunglasses", "round glasses", "black mobile phone", "notebook computer"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "ZEUBSCNQ-A0_0", "video_path": "ZEUBSCNQ-A0.mp4", "subtitle_path": "ZEUBSCNQ-A0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 563.7, "view_count": 495062}, {"video_id": "ZEUBSCNQ-A0", "question": "On a white bed, there is a woman with a wound on her face, looking to the right. What is present in the scene at this moment?", "question_wo_referring_query": ", what is present in the scene at this moment?", "candidates": ["a pair of glasses", "a pair of scissors", "a pair of earrings", "a dagger", "a bald head"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "ZEUBSCNQ-A0_1", "video_path": "ZEUBSCNQ-A0.mp4", "subtitle_path": "ZEUBSCNQ-A0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 563.7, "view_count": 495062}, {"video_id": "MbuXGXmrHFM", "question": "In a scene with a green container in the background, there is a robot with an orange exterior and green goggles. When the caption 'to help him create a new body to replace his dying original one. After gaining his new body,' appears, what is shown on the screen?", "question_wo_referring_query": "In a scene with a green container in the background, there is a robot with an orange exterior and green goggles. When the caption 'to help him create a new body to replace his dying original one. After gaining his new body,' appears, what is shown on the screen?", "candidates": ["Cauldron", "Turtle", "Scissors", "A test tube containing red liquid", "Wrench"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "MbuXGXmrHFM_0", "video_path": "MbuXGXmrHFM.mp4", "subtitle_path": "MbuXGXmrHFM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 237.0, "view_count": 51908}, {"video_id": "MbuXGXmrHFM", "question": "Inside a room, there's a door on the left, a many-sided window in the middle, a small wooden table below the window, and in the mirror, there's a black woman in a green jacket and a white man in a blue shirt kissing. What is present on the screen when the subtitle 'Following the battle, Mark and Amber reconcile, while Cecil covers up Nolan's' appears?", "question_wo_referring_query": "Inside a room, there's a door on the left, a many-sided window in the middle, a small wooden table below the window, and in the mirror, there's a black woman in a green jacket and a white man in a blue shirt kissing. What is present on the screen when the subtitle 'Following the battle, Mark and Amber reconcile, while Cecil covers up Nolan's' appears?", "candidates": ["glass cabinet", "black wristwatch", "metal bucket", "necklace", "white mobile phone"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "MbuXGXmrHFM_1", "video_path": "MbuXGXmrHFM.mp4", "subtitle_path": "MbuXGXmrHFM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 237.0, "view_count": 51908}, {"video_id": "oicBAgHe9p4", "question": "At sea, on the right side, there is a ship. There are three people floating on the sea. A woman in the front is opening her mouth. She is touching a man on his right side, whose mouth is bleeding. The woman at the back is wearing a life jacket. When the subtitle 'medical attention, Zach slowly and painfully dies, Michelle drifts away from the group and, finally' appears, what is the color of the life jacket worn by the woman at the back?", "question_wo_referring_query": "At sea, on the right side, there is a ship. There are three people floating on the sea. A woman in the front is opening her mouth. She is touching a man on his right side, whose mouth is bleeding. The woman at the back is wearing a life jacket. When the subtitle 'medical attention, Zach slowly and painfully dies, Michelle drifts away from the group and, finally' appears, what is the color of the life jacket worn by the woman at the back?", "candidates": ["Green", "Black", "Red", "Blue", "White"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "oicBAgHe9p4_0", "video_path": "oicBAgHe9p4.mp4", "subtitle_path": "oicBAgHe9p4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 590.04, "view_count": 2076016}, {"video_id": "oicBAgHe9p4", "question": "In the scene, there is a white cloth with a crying child on it whose hands are stretched out. When the subtitle 'the fishing boat sails away. Sarah wakes up in the cabin and begins to cry.' appears, what is the color of the child's clothes?", "question_wo_referring_query": "In the scene, there is a white cloth with a crying child on it whose hands are stretched out. When the subtitle 'the fishing boat sails away. Sarah wakes up in the cabin and begins to cry.' appears, what is the color of the child's clothes?", "candidates": ["blue", "purple", "white", "red", "black"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "oicBAgHe9p4_1", "video_path": "oicBAgHe9p4.mp4", "subtitle_path": "oicBAgHe9p4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 590.04, "view_count": 2076016}, {"video_id": "OJnow7fdHH8", "question": "On a street, with a wire fence on the left side and a green pillar on the right side, there is a book beside the pillar, two cars are parked to the back left, in front of the camera is a man wearing a hat and a blue T-shirt, and behind him there is a horizontal yellow warning tape. What did the man in the blue T-shirt do when he appeared?", "question_wo_referring_query": "What did the man in the blue T-shirt do when he appeared?", "candidates": ["Kneel down", "Walk forward", "Raise both hands", "Jump up", "Clap"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "OJnow7fdHH8_0", "video_path": "OJnow7fdHH8.mp4", "subtitle_path": "OJnow7fdHH8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 325.42, "view_count": 6040927}, {"video_id": "OJnow7fdHH8", "question": "In a room, there's an object burning with flames in the background, and in the foreground, there's a bald Black person wearing a watch. When the object starts burning, what is this bald Black person doing?", "question_wo_referring_query": "When the object starts burning, what is this bald Black person doing?", "candidates": ["Jumping up", "Spreading arms", "Crouching down", "Clapping hands", "Running forward"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "OJnow7fdHH8_1", "video_path": "OJnow7fdHH8.mp4", "subtitle_path": "OJnow7fdHH8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 325.42, "view_count": 6040927}, {"video_id": "Xyj4IoQqf_k", "question": "In a schoolyard, surrounded by seated students, there is a girl in front of the camera with a damp forehead, wearing a black and white school uniform. What is this girl doing when the subtitle 'during Breon The Peculiar occurrences' appears?", "question_wo_referring_query": "What is this girl doing?", "candidates": ["Crying and shedding tears", "Clenching her fists", "Clapping hands", "Touching her lips with her hand", "Standing up"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "Xyj4IoQqf_k_0", "video_path": "Xyj4IoQqf_k.mp4", "subtitle_path": "Xyj4IoQqf_k_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 504.96, "view_count": 4899}, {"video_id": "Xyj4IoQqf_k", "question": "In a dark room, there is a short-haired woman dressed in black. In front of her, there is an object with green, purple, blue, and white colors, radiating golden light. When the subtitle 'fled momentarily to retrieve her sword fled momentarily to retrieve her sword' appears, what is the woman doing?", "question_wo_referring_query": "What is the woman doing?", "candidates": ["Holding the glowing object above her head", "Squatting down", "With a closed mouth and tightly shut eyes", "Jumping up", "Smiling widely"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "Xyj4IoQqf_k_1", "video_path": "Xyj4IoQqf_k.mp4", "subtitle_path": "Xyj4IoQqf_k_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 504.96, "view_count": 4899}, {"video_id": "havYEL19KLU", "question": "In a white-background room, on the left side there's a long-haired woman wearing black clothes, in the middle there's a man in green and red clothes with his eyes closed tightly, and on the right side there's a man holding his own shoulders. What did the man on the right do after holding his shoulders?", "question_wo_referring_query": "What did the man on the right do after holding his shoulders?", "candidates": ["Cries", "Covers his face with both hands", "Laughs", "Claps", "Shakes his right hand up and down"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "havYEL19KLU_0", "video_path": "havYEL19KLU.mp4", "subtitle_path": "havYEL19KLU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 318.45, "view_count": 42235}, {"video_id": "havYEL19KLU", "question": "In a street area, there is a white building on the left, with two cars parked behind it. In the middle, there is a man wearing a white T-shirt and a black bulletproof vest, pointing forward with a gun. What did the man in the middle do right after pointing the gun forward?", "question_wo_referring_query": "What did the man in the middle do right after pointing the gun forward?", "candidates": ["ran forward", "clutched his wound with both hands", "rolled to the left", "shot himself", "lied down"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "havYEL19KLU_1", "video_path": "havYEL19KLU.mp4", "subtitle_path": "havYEL19KLU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 318.45, "view_count": 42235}, {"video_id": "Rs4VoeeqoaE", "question": "In this video, which of the following characters appears first?", "question_wo_referring_query": "In this video, which of the following characters appears first?", "candidates": ["The man with short hair, wearing black clothes and a bandage on his forehead", "The man wearing a black suit and glasses", "The woman with short hair, wearing a green shirt", "The man wearing a white coat, sporting a beard, and wearing deep blue glasses", "The woman with red lipstick, wearing earrings, and long golden hair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "Rs4VoeeqoaE_0", "video_path": "Rs4VoeeqoaE.mp4", "subtitle_path": "Rs4VoeeqoaE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 297.77, "view_count": 4271}, {"video_id": "Rs4VoeeqoaE", "question": "In this video, which of the following objects appears first?", "question_wo_referring_query": "In this video, which of the following objects appears first?", "candidates": ["White tea cup", "Canadian passport", "Black handbag", "Glass of water", "Horse sculpture"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "Rs4VoeeqoaE_1", "video_path": "Rs4VoeeqoaE.mp4", "subtitle_path": "Rs4VoeeqoaE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 297.77, "view_count": 4271}, {"video_id": "PHXLo-SXrxc", "question": "In a room, there are two doors with curtains at the back, and there is also a window with curtains on the right side. In the middle, there is a man with a beard wearing a blue outfit, sitting on a blue couch. Opposite him sits a black man wearing white clothing. When the subtitles 'designed for an unknown purpose. The simulation then creates new simulations with a different' appear, what does the man in the blue outfit do?", "question_wo_referring_query": "What does the man in the blue outfit do?", "candidates": ["Shaking his head", "Sleeping on the couch", "Clenching his hands", "Walking on the road", "Shaking hands with the black man in white clothing"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "PHXLo-SXrxc_0", "video_path": "PHXLo-SXrxc.mp4", "subtitle_path": "PHXLo-SXrxc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 449.05, "view_count": 2577288}, {"video_id": "PHXLo-SXrxc", "question": "Inside a room, the middle window has the blinds down. On the left side, there is a piano. On top of the piano, there is a ceramic pot and a table lamp. In the center, there is a little girl with a ponytail sitting at a table writing. When the subtitle 'brings Sam back from the parallel world. Without wasting time, Brendan then takes advantage of\u2026' appears, what does the girl do?", "question_wo_referring_query": "What does the girl do?", "candidates": ["eats fruit", "picks up the ceramic pot", "plays the piano", "pulls up the blinds", "claps hands with a man"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "PHXLo-SXrxc_1", "video_path": "PHXLo-SXrxc.mp4", "subtitle_path": "PHXLo-SXrxc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 449.05, "view_count": 2577288}, {"video_id": "6OUZXiaF_FU", "question": "In a dark room, there is a mirror showing the back of a boy wearing blue clothes with short blonde hair. After the subtitle 'Eli steps back and sees from a mirror a girl standing by the window writing with her finger' appears, who is the new character that appears in the mirror?", "question_wo_referring_query": "Who is the new character that appears in the mirror?", "candidates": ["a blonde woman with a wounded face", "a little girl", "a man holding a dagger", "a nurse holding a Bible", "an old man"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "6OUZXiaF_FU_0", "video_path": "6OUZXiaF_FU.mp4", "subtitle_path": "6OUZXiaF_FU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 559.63, "view_count": 905125}, {"video_id": "6OUZXiaF_FU", "question": "In a blue-background infirmary, there is a pair of double doors with windows in the middle, a pane of glass window on the right side, and a table under the window with a lit lamp on it. Inside the infirmary, there are three people in nurse uniforms standing straight at the back. In front of the mirror, there is a blond boy with an injured face. When the subtitle 'Rose finally confesses the truth and tells Eli she's tried to have a son but nothing' appears, which new character is introduced?", "question_wo_referring_query": "Which new character is introduced?", "candidates": ["A man holding a crutch", "A blonde woman in a black coat", "A nurse holding a Bible", "A man in a red jacket", "A nun holding a crutch"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "6OUZXiaF_FU_1", "video_path": "6OUZXiaF_FU.mp4", "subtitle_path": "6OUZXiaF_FU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 559.63, "view_count": 905125}, {"video_id": "mS_9f5txVGg", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a young boy with blonde hair walks forward holding a snake, then four children are laughing by the dining table, and lastly, a blonde woman in a pink coat is talking on the phone.", "First, a young boy with blonde hair walks forward holding a snake, then a blonde woman in a pink coat is talking on the phone, and lastly, four children are laughing by the dining table.", "First, four children are laughing by the dining table, then a young boy with blonde hair walks forward holding a snake, and lastly, a blonde woman in a pink coat is talking on the phone.", "First, four children are laughing by the dining table, then a blonde woman in a pink coat is talking on the phone, and lastly, a young boy with blonde hair walks forward holding a snake.", "First, a blonde woman in a pink coat is talking on the phone, then a young boy with blonde hair walks forward holding a snake, and lastly, four children are laughing by the dining table."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "mS_9f5txVGg_0", "video_path": "mS_9f5txVGg.mp4", "subtitle_path": "mS_9f5txVGg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 550.23, "view_count": 8968}, {"video_id": "mS_9f5txVGg", "question": "Which of the following sequence of scenes is correct?", "question_wo_referring_query": "Which of the following sequence of scenes is correct?", "candidates": ["First, a woman in a white dress is putting things in the trunk of a car. Then, a blonde woman is lying on a bed, turning off the light. Finally, a woman wearing a ring is lying in a bathtub filled with bubbles.", "First, a woman in a white dress is putting things in the trunk of a car. Then, a woman wearing a ring is lying in a bathtub filled with bubbles. Finally, a blonde woman is lying on a bed, turning off the light.", "First, a woman wearing a ring is lying in a bathtub filled with bubbles. Then, a blonde woman is lying on a bed, turning off the light. Finally, a woman in a white dress is putting things in the trunk of a car.", "First, a blonde woman is lying on a bed, turning off the light. Then, a woman in a white dress is putting things in the trunk of a car. Finally, a woman wearing a ring is lying in a bathtub filled with bubbles.", "First, a blonde woman is lying on a bed, turning off the light. Then, a woman wearing a ring is lying in a bathtub filled with bubbles. Finally, a woman in a white dress is putting things in the trunk of a car."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "mS_9f5txVGg_1", "video_path": "mS_9f5txVGg.mp4", "subtitle_path": "mS_9f5txVGg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 550.23, "view_count": 8968}, {"video_id": "jnM7u39HXZM", "question": "In a laboratory, a blonde woman wearing a red dress is angrily looking to the right. In which of the following scenes does this blonde woman in the red dress appear?", "question_wo_referring_query": "In which of the following scenes does this blonde woman in the red dress appear?", "candidates": ["On a beach", "In a gym", "In a hospital room where a middle-aged blonde woman in a white dress is lying down and looking at medical equipment", "In a dining room filled with food", "In a room filled with laptops"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "jnM7u39HXZM_0", "video_path": "jnM7u39HXZM.mp4", "subtitle_path": "jnM7u39HXZM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 573.37, "view_count": 349273}, {"video_id": "jnM7u39HXZM", "question": "In a power plant room shrouded in darkness, a blonde girl wearing white clothing captures a mouse with a transparent container. Which of the following scenes has this mouse appeared in?", "question_wo_referring_query": "Which of the following scenes has this mouse appeared in?", "candidates": ["On a shallow boat", "In an old, abandoned house", "In a square-shaped incinerator", "On a sunny beach", "In a hospital"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "jnM7u39HXZM_1", "video_path": "jnM7u39HXZM.mp4", "subtitle_path": "jnM7u39HXZM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 573.37, "view_count": 349273}, {"video_id": "RiqoHOrQ7GU", "question": "On the roof of a car, there is a man lying down with a troubled expression on his face. He is using his hands to tap his helmet. When this man is kneeling against a wall and looking forward, what kind of change does this man undergo?", "question_wo_referring_query": "What kind of change does this man undergo?", "candidates": ["The black jacket becomes the Ant-Man suit", "The silver suit becomes a black T-shirt", "The Superman suit becomes a black suit", "The Ant-Man suit becomes a black T-shirt and pants", "The silver coat becomes a blue and white shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "RiqoHOrQ7GU_0", "video_path": "RiqoHOrQ7GU.mp4", "subtitle_path": "RiqoHOrQ7GU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 577.6, "view_count": 6771}, {"video_id": "RiqoHOrQ7GU", "question": "In a workshop filled with various drawings behind, there is an elderly man with white hair and glasses speaking near the table in the front. On the left stands a man in a short-sleeved shirt, and in the middle sits a woman with short hair wearing a watch, looking towards the elderly man. While this short-haired woman is leaning against the table making a phone call, what change happens to her?", "question_wo_referring_query": ", what change happens to her?", "candidates": ["The green dress turns into a black cloak", "The black vest turns into a sleeveless green dress", "The sleeveless green dress turns into a black suit", "The blue dress turns into a black vest", "The black vest turns into a black suit"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "RiqoHOrQ7GU_1", "video_path": "RiqoHOrQ7GU.mp4", "subtitle_path": "RiqoHOrQ7GU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 577.6, "view_count": 6771}, {"video_id": "kSNMOgmAgWA", "question": "There is a large circle on the screen with many small round windows inside. On the table, there are various experimental instruments, and a man with slightly hunched shoulders. What is this man doing?", "question_wo_referring_query": ", What is this man doing?", "candidates": ["Singing", "Taking notes", "Spraying water", "Drinking tea", "Talking to the person next to him"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "kSNMOgmAgWA_0", "video_path": "kSNMOgmAgWA.mp4", "subtitle_path": "kSNMOgmAgWA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 414.75, "view_count": 2658897}, {"video_id": "kSNMOgmAgWA", "question": "In the distance, there is a vast expanse of white snowy mountains, and below the mountains, there are two houses, a red car, and a circular decoration. Many people are standing inside the circular frame. In the bottom right corner of the screen, there is a man holding a camera and a blonde woman. What is the woman doing?", "question_wo_referring_query": "What is the woman doing?", "candidates": ["Shaking hands with the man", "Laughing out loud on the spot", "Reporting on the scene facing the man's camera", "Adjusting her own clothes", "Turning around and looking behind"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "kSNMOgmAgWA_1", "video_path": "kSNMOgmAgWA.mp4", "subtitle_path": "kSNMOgmAgWA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 414.75, "view_count": 2658897}, {"video_id": "xaf1QP7wjio", "question": "In a courtyard, there are some sparse green vines. A man wearing a white shirt and a grey tie is standing in the center of the courtyard, terrifying the people around him. Some of the frightened people are huddling their bodies, while others are kneeling on the ground bowing their heads. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Flowerpot", "Cell phone", "Wristwatch", "Camera", "Handgun"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "xaf1QP7wjio_0", "video_path": "xaf1QP7wjio.mp4", "subtitle_path": "xaf1QP7wjio_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 571.47, "view_count": 505934}, {"video_id": "xaf1QP7wjio", "question": "In a room, there is a rectangular window with six panes, an old man lying on a sickbed, and a man and a woman watching over by the bed. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["red skirt", "white chair", "blue table", "green slippers", "white curtain"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "xaf1QP7wjio_1", "video_path": "xaf1QP7wjio.mp4", "subtitle_path": "xaf1QP7wjio_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 571.47, "view_count": 505934}, {"video_id": "aifIZasgnXY", "question": "There is a colorful painting on the wall in the office. A gray-haired teacher in a blue coat is communicating with a child in red clothes. When the subtitle 'add the numbers' appears, the kid scoffs and says that her method is stupid. The woman manages to suppress...' What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["A metal bucket with color pencils", "A computer", "A cellphone", "A red scarf", "A black and white chair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "aifIZasgnXY_0", "video_path": "aifIZasgnXY.mp4", "subtitle_path": "aifIZasgnXY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.53, "view_count": 5133094}, {"video_id": "aifIZasgnXY", "question": "Against a cyan background, a man with short blond hair, dressed in a black suit and wearing a tie, faces the camera. On the table in front of him, there are two cups of water and a seat nameplate written in English. What object appears in the frame when the subtitle 'scoff at the comment and reply that she should admit the possibility of multiple correct answers' appears?", "question_wo_referring_query": "What object appears in the frame?", "candidates": ["bracelet", "mobile phone", "camera", "laptop", "wristwatch"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "aifIZasgnXY_1", "video_path": "aifIZasgnXY.mp4", "subtitle_path": "aifIZasgnXY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 561.53, "view_count": 5133094}, {"video_id": "u2z3Y6qf-fA", "question": "On the training field, on the white surface with the letters MS filled in yellow and framed in blue, several men in sports uniforms are facing a man in a dark green jacket holding a ball. Among them, one man in a white t-shirt with black undershirt has his hands crossed at his waist. What color are the English letters engraved on the ball in the video?", "question_wo_referring_query": "What color are the English letters engraved on the ball in the video?", "candidates": ["green", "red", "white", "black", "blue"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "u2z3Y6qf-fA_0", "video_path": "u2z3Y6qf-fA.mp4", "subtitle_path": "u2z3Y6qf-fA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 500.83, "view_count": 9608}, {"video_id": "u2z3Y6qf-fA", "question": "In a room with insufficient lighting, a man is sitting next to a TV watching the screen. Next to him is another man holding a helmet. What is the color of the outer shell of the helmet in the man\u2019s hand?", "question_wo_referring_query": "What is the color of the outer shell of the helmet in the man's hand?", "candidates": ["yellow", "light green", "gray", "black", "olive"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "u2z3Y6qf-fA_1", "video_path": "u2z3Y6qf-fA.mp4", "subtitle_path": "u2z3Y6qf-fA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 500.83, "view_count": 9608}, {"video_id": "MaG1Kn7RM2Q", "question": "In the video, there is a man wearing a bean green coat and a woman with long blonde curls. They both look distressed, and the woman even appears to be crying. When the subtitle 'face, but the doctor said that if her face was operated on, it was very likely that Penelope' appears, what color is the innermost clothing the man is wearing?", "question_wo_referring_query": "What color is the innermost clothing the man is wearing?", "candidates": ["yellow", "gray", "green", "blue", "red"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "MaG1Kn7RM2Q_0", "video_path": "MaG1Kn7RM2Q.mp4", "subtitle_path": "MaG1Kn7RM2Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 569.61, "view_count": 364531}, {"video_id": "MaG1Kn7RM2Q", "question": "A man wearing olive-colored glasses, a white inner shirt, and a yellow outer coat is talking on the phone. When the subtitle 'the surprising fact that Maxwell's real name is Johnny. Johnny deliberately disguised himself' appears, what is the color of the phone the man is holding?", "question_wo_referring_query": "What is the color of the phone the man is holding?", "candidates": ["green", "red", "blue", "black", "white"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "MaG1Kn7RM2Q_1", "video_path": "MaG1Kn7RM2Q.mp4", "subtitle_path": "MaG1Kn7RM2Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 569.61, "view_count": 364531}, {"video_id": "GDZdeUCNfL4", "question": "In the video, a man and a woman are seen along with two microphones. The man, looking somewhat downcast, is being interviewed silently, while the woman beside him is weeping in distress. Who is the person with their hand on the woman's shoulder?", "question_wo_referring_query": "Who is the person with their hand on the woman's shoulder?", "candidates": ["Hector", "Max", "The interviewing reporter", "Shawn", "Victor"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "GDZdeUCNfL4_0", "video_path": "GDZdeUCNfL4.mp4", "subtitle_path": "GDZdeUCNfL4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 578.91, "view_count": 302125}, {"video_id": "GDZdeUCNfL4", "question": "On a rather barren desert, there is a wooden fence rail, with two women and a horse. The saddle is orange. Who is the person leading the horse in the frame?", "question_wo_referring_query": "Who is the person leading the horse in the frame?", "candidates": ["Victor's daughter", "Shawn", "Carlos", "Max's wife", "Shawn's wife"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "GDZdeUCNfL4_1", "video_path": "GDZdeUCNfL4.mp4", "subtitle_path": "GDZdeUCNfL4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 578.91, "view_count": 302125}, {"video_id": "8rNhsftFj0Q", "question": "In a room, there is a white lamp, a woman with braided hair wearing black clothes sitting by the bed, and a baby wrapped in a white blanket. What happens when this baby appears for the first time?", "question_wo_referring_query": "What happens?", "candidates": ["The woman is feeding the baby with a green bottle.", "The baby is being held and soothed by the woman.", "The baby is sleeping alone on the bed.", "The woman is changing the baby's diaper.", "The woman is holding another baby."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "8rNhsftFj0Q_0", "video_path": "8rNhsftFj0Q.mp4", "subtitle_path": "8rNhsftFj0Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 575.61, "view_count": 116645}, {"video_id": "8rNhsftFj0Q", "question": "The roadside is lined with bright trees, and a black van drives towards the camera from the right side of the trees. What happened the first time this black van appeared?", "question_wo_referring_query": "What happened the first time this black van appeared?", "candidates": ["It was being driven to the prison.", "The car ran out of gas and stopped.", "The car overturned.", "The car crashed into a tree.", "The car fell into the water."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "8rNhsftFj0Q_1", "video_path": "8rNhsftFj0Q.mp4", "subtitle_path": "8rNhsftFj0Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 575.61, "view_count": 116645}, {"video_id": "LeAfsjsJnYI", "question": "A serious-looking man wearing a backpack is sitting inside a car, gripping the steering wheel tightly. Through the window, a little girl can be faintly seen. What happens after the man gets in the car and locks the door?", "question_wo_referring_query": "What happens?", "candidates": ["The little girl gets into the car and sits next to the man.", "The little girl starts crying loudly at the man.", "The man hugs the little girl.", "The little girl gets into the car and sits in the back seat.", "The man drives away immediately."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "LeAfsjsJnYI_0", "video_path": "LeAfsjsJnYI.mp4", "subtitle_path": "LeAfsjsJnYI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 529.23, "view_count": 7878}, {"video_id": "LeAfsjsJnYI", "question": "On a road, with sparsely scattered trees along the side, there is a man wearing an olive green jacket and a car with black letters on its yellow roof. What happened after the man got out of the car?", "question_wo_referring_query": "What happened?", "candidates": ["A little girl was seen standing on the side of the road", "The car exploded, and the man was injured and lying on the ground", "The car caught fire and started smoking", "A deer appeared", "Another car collided head-on"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "LeAfsjsJnYI_1", "video_path": "LeAfsjsJnYI.mp4", "subtitle_path": "LeAfsjsJnYI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 529.23, "view_count": 7878}, {"video_id": "AeOuyGbhcIY", "question": "Which medical product appeared first among the following?", "question_wo_referring_query": "Which medical product appeared first among the following?", "candidates": ["IV tube", "Crutch", "Needle", "Lab mouse", "Blood bag"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "AeOuyGbhcIY_0", "video_path": "AeOuyGbhcIY.mp4", "subtitle_path": "AeOuyGbhcIY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 572.47, "view_count": 9398}, {"video_id": "AeOuyGbhcIY", "question": "Which of the following characters appears first?", "question_wo_referring_query": "Which of the following characters appears first?", "candidates": ["Joe", "Anna", "Lucien", "Martine", "Michael Morbius"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "AeOuyGbhcIY_1", "video_path": "AeOuyGbhcIY.mp4", "subtitle_path": "AeOuyGbhcIY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 572.47, "view_count": 9398}, {"video_id": "6IAvHpaRrkA", "question": "Next to several khaki trees are gray stone piles, with a hint of greenery barely visible. A man wearing a patterned blue short-sleeved shirt and another man in a red short-sleeved shirt with a khaki backpack are having a conversation face to face. After the subtitles 'Chozen. They both inspire and reinvigorate; Daniel. To top it all off, Daniel meets Yuna,' what happens next?", "question_wo_referring_query": "What happens next?", "candidates": ["A police officer pins down a man wearing a gray short-sleeved shirt", "Waves goodbye to the man opposite", "A police officer enters the yard", "Johnny helps Miguel with recovery training", "Turns around and leaves"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "6IAvHpaRrkA_0", "video_path": "6IAvHpaRrkA.mp4", "subtitle_path": "6IAvHpaRrkA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 184.45, "view_count": 32518}, {"video_id": "6IAvHpaRrkA", "question": "In the pitch-black night, what happened when a man with short yellow hair, wearing a green shirt, and a man with short black hair, wearing an orange short sleeve shirt with a plaid jacket and a hateful expression, stood together after the subtitle 'Karate Tournament, and then makes a call to his old pal Terry Silver from Karate Kid:' appeared?", "question_wo_referring_query": "What happened?", "candidates": ["A woman answers a call", "A woman combs her hair", "A man writes a letter", "A child enters the doorway", "A man makes a telephone call"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "6IAvHpaRrkA_1", "video_path": "6IAvHpaRrkA.mp4", "subtitle_path": "6IAvHpaRrkA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 184.45, "view_count": 32518}, {"video_id": "6aUiAhASJak", "question": "In a pitch-black room, only a small window lets in a bit of light. A man is hunched over on his back making a phone call. From his neck, you can faintly see the white shirt he is wearing. When the subtitle 'he doesn't notice the bullet's shell; with his father's name, Otis, on it' appears, what is the first object to appear?", "question_wo_referring_query": "What is the first object to appear?", "candidates": ["A note", "A blue beetle's gun", "A green helicopter", "A bottle of blue-labeled wine", "A bullet shell with the English word 'OTIS' on it"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "6aUiAhASJak_0", "video_path": "6aUiAhASJak.mp4", "subtitle_path": "6aUiAhASJak_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 592.55, "view_count": 98194}, {"video_id": "6aUiAhASJak", "question": "In the upper left corner of the screen, there are two fingers holding a bullet. After the subtitle \u201cTo his shock, the clerk's name is also Jeff\u201d appears, who is the first person to appear?", "question_wo_referring_query": "Who is the first person to appear?", "candidates": ["A woman wearing a polka dot top", "A little girl wearing a pink dress", "A man wearing black inside and yellow outside", "A little boy wearing denim overalls", "A man wearing a white shirt inside and a black jacket outside"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "6aUiAhASJak_1", "video_path": "6aUiAhASJak.mp4", "subtitle_path": "6aUiAhASJak_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 592.55, "view_count": 98194}, {"video_id": "eBRia8iUJPI", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a blonde child is seen crying loudly while looking at an injured man lying on the ground, then a woman in a swimsuit is seen holding a drink, followed by a car explosion, and finally, a woman is seen holding an umbrella and dancing in the rain.", "First, a car explosion happens, then a woman in a blue swimsuit is seen holding a drink by the pool, followed by a blonde child crying loudly while looking at an injured man lying on the ground, and finally, a woman is seen holding an umbrella and making a phone call in the rain.", "First, a car explosion happens, then a woman is seen holding an umbrella and making a phone call in the rain, followed by a woman in a swimsuit holding a drink, and lastly, a blonde child is seen laughing while looking at an injured man lying on the ground.", "First, a car explosion happens, then a woman in a business suit is seen holding a drink, followed by a blonde child crying loudly while looking at an injured man lying on the ground, and finally, a woman is seen holding an umbrella and making a phone call in the rain.", "First, a woman in a blue swimsuit is seen holding a drink by the pool, then a car explosion happens, followed by a blonde child crying loudly while looking at an injured man lying on the ground, and finally, a woman is seen holding an umbrella and making a phone call in the rain."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "eBRia8iUJPI_0", "video_path": "eBRia8iUJPI.mp4", "subtitle_path": "eBRia8iUJPI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 334.97, "view_count": 20514}, {"video_id": "eBRia8iUJPI", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a woman cries and smokes, then a man plays a musical instrument, then a hand covers a bleeding wound, and finally, a man lies down and falls into the water.", "First, a man plays a musical instrument, then a woman cries and smokes, then a hand covers a bleeding wound, and finally, a man stands up and falls into the water.", "First, a man plays a musical instrument, then a woman cries and smokes, then a hand covers a bleeding wound, and finally, a man lies down and falls into the water.", "First, a man plays a musical instrument, then a woman laughs and smokes, then a hand covers a bleeding wound, and finally, a man lies down and falls into the water.", "First, a man drinks alcohol loudly, then a woman cries and smokes, then a hand covers a bleeding wound, and finally, a man lies down and falls into the water."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "eBRia8iUJPI_1", "video_path": "eBRia8iUJPI.mp4", "subtitle_path": "eBRia8iUJPI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 334.97, "view_count": 20514}, {"video_id": "Sj07YvzQ-o8", "question": "In a room with insufficient lighting, there are two places illuminated by white light and a yellowish-orange light emitted from a desk lamp. On a blackboard, some English words are written. There is also a man dressed in a white shirt, black coat, wearing a black tie and black glasses. Where has this man appeared?", "question_wo_referring_query": "Where has this man appeared below?", "candidates": ["The man is facing the camera; a woman in black clothes is holding a gun at him from behind.", "On a grassland, being held at gunpoint by a woman.", "In a conference room where many people are attending a meeting.", "In an academic hall where a lecture is being given.", "At a horse race track."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "Sj07YvzQ-o8_0", "video_path": "Sj07YvzQ-o8.mp4", "subtitle_path": "Sj07YvzQ-o8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 376.64, "view_count": 46852}, {"video_id": "Sj07YvzQ-o8", "question": "There is a short-haired man in a red outfit and a short-haired, slightly curly-haired woman also in a red outfit. Where has this woman appeared in the following places?", "question_wo_referring_query": "Where has this woman appeared in the following places?", "candidates": ["In a hot spring", "Beside a bed in a bedroom", "On a wide road", "In a corridor", "In a kitchen full of kitchenware"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "Sj07YvzQ-o8_1", "video_path": "Sj07YvzQ-o8.mp4", "subtitle_path": "Sj07YvzQ-o8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 376.64, "view_count": 46852}, {"video_id": "SHvA5Q7DX_o", "question": "In a laboratory, there is a man wearing glasses with dark and shiny hair, dressed in a white coat with blue and red stripes. He is holding a piece of paper. With which subtitles is this man shown together?", "question_wo_referring_query": "With which subtitles is this man shown together?", "candidates": ["The opening scene introduces us to a guy named Yoon Seok, who serves as a lab worker", "The opening scene introduces us to a guy named Yoon Seok Woo, who serves as a lab worker in the", "The opening scene introduces us to a guy named Yoon Seok Woo, who serves as a lab worker in", "The opening scene introduces us to a guy named Yoon Seok Woo, who serves as a lab worker", "The opening scene introduces us to a guy named Yoon Seok Woo, who serves as a lab worker in the"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "SHvA5Q7DX_o_0", "video_path": "SHvA5Q7DX_o.mp4", "subtitle_path": "SHvA5Q7DX_o_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 536.8, "view_count": 100593}, {"video_id": "SHvA5Q7DX_o", "question": "Outside of a room, surrounded by sparse vegetation, a man in a white shirt and light blue pants stands on a long wooden plank. He grabs the hand of a woman with long black hair wearing a skirt with one hand, and with the other hand he holds her neck. In which subtitles did this woman appear together?", "question_wo_referring_query": ", in which subtitles did this woman appear together?", "candidates": ["he accidentally drops his phone on the road. ; ; Yoo Min is unable to reach him", "he accidentally drops his phone on the road. ; ; Due to this", "he accidentally drops his phone on the road. ; ; Due to this, Yoo Min is unable to reach him", "he accidentally drops his phone on the road. ; Due to this, Yoo Min is unable to reach him"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "SHvA5Q7DX_o_1", "video_path": "SHvA5Q7DX_o.mp4", "subtitle_path": "SHvA5Q7DX_o_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 536.8, "view_count": 100593}, {"video_id": "HsZNPsHB5Jw", "question": "In a large hall, there are two large calligraphy characters pasted on the wall, with a painting in the middle. Below the painting, some tribute items are placed. There are also some wooden chairs in the hall. Beside a round wooden table, there stands a man in gray clothes and a woman in white and red clothes. When this woman holding a wooden box and a man entered the door together, what was different about her clothing?", "question_wo_referring_query": "When the woman holding a wooden box and a man entered the door together, what was different about her clothing?", "candidates": ["The main color of the clothing changed from blue and pink to black and gray", "The main color of the clothing changed from white and pink to sky blue", "The main color of the clothing changed from gray to black and gray", "The main color of the clothing changed from white and pink to blue and gray", "The main color of the clothing changed from white and pink to black and gray"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "HsZNPsHB5Jw_0", "video_path": "HsZNPsHB5Jw.mp4", "subtitle_path": "HsZNPsHB5Jw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 562.17, "view_count": 4034}, {"video_id": "HsZNPsHB5Jw", "question": "There is a white vase on the wooden chair indoors. The vase holds white flowers in full bloom. In the frame, there are two people having a conversation face-to-face. One of them is a woman dressed in an outfit featuring shades of purple-red and pink, with a gold necklace around her neck and vibrant floral accessories in her hair. When this woman talks to an elderly woman by the door, what changes in her hair accessories?", "question_wo_referring_query": "When this woman talks to an elderly woman by the door, what changes in her hair accessories?", "candidates": ["The vibrant floral accessory changes to an amethyst necklace", "The vibrant floral accessory changes to a pearl hairpin", "The vibrant floral accessory changes to a grey hat", "The vibrant floral accessory changes to a black hat", "The vibrant floral accessory changes to loose hair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "HsZNPsHB5Jw_1", "video_path": "HsZNPsHB5Jw.mp4", "subtitle_path": "HsZNPsHB5Jw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 562.17, "view_count": 4034}, {"video_id": "RKFaYL0RUPs", "question": "By the side of a river, there is a man with short hair, wearing blue clothes. Next to him, there is another man with short hair, wearing glasses and dressed in olive green clothes. On the other side of the river, there is a mountain and some houses. What is the person wearing blue clothes doing?", "question_wo_referring_query": "What is the person wearing blue clothes doing?", "candidates": ["Drinking water", "Running", "Changing clothes", "Taking off glasses", "Making a phone call"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "RKFaYL0RUPs_0", "video_path": "RKFaYL0RUPs.mp4", "subtitle_path": "RKFaYL0RUPs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 574.48, "view_count": 64030}, {"video_id": "RKFaYL0RUPs", "question": "Under a blue sky, there is a person with short hair, wearing sunglasses and dressed in white clothes, holding something in their hand. What is this man doing?", "question_wo_referring_query": "Under a blue sky, there is a person with short hair, wearing sunglasses and dressed in white clothes, holding something in their hand. What is this man doing?", "candidates": ["Dancing", "Reading a book", "Running", "Drinking water", "Making a phone call"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "RKFaYL0RUPs_1", "video_path": "RKFaYL0RUPs.mp4", "subtitle_path": "RKFaYL0RUPs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 574.48, "view_count": 64030}, {"video_id": "QdwCB925pFY", "question": "In a room with yellow walls, various pictures are pasted on the wall. There stands a man in the room with short white hair, wearing a black suit and a tie. Behind him, there is a cabinet with various books and a globe on it. Next to the cabinet, there is also a flag and two people, a woman with black hair and a man with short hair wearing a red dress. Which of the following items appeared in this room?", "question_wo_referring_query": "Which of the following items appeared in this room?", "candidates": ["Water cup", "Map", "Computer", "Snacks", "Mobile phone"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "QdwCB925pFY_0", "video_path": "QdwCB925pFY.mp4", "subtitle_path": "QdwCB925pFY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 574.81, "view_count": 12894}, {"video_id": "QdwCB925pFY", "question": "Outside a house, there is a person with short hair wearing red clothes standing. Behind him, there are also other people in different clothes. To the left and behind the person in red clothes, there is a green tree. Which of the following items appears in this scene?", "question_wo_referring_query": "Which of the following items appears in this scene?", "candidates": ["gold chain", "mobile phone", "red chain", "sunglasses", "green chain"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "QdwCB925pFY_1", "video_path": "QdwCB925pFY.mp4", "subtitle_path": "QdwCB925pFY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 574.81, "view_count": 12894}, {"video_id": "-07drzOov-M", "question": "In the scene, there is a man with short white hair, a mustache, and wearing a red outfit. He is wearing an accessory on his chest, and there are also some white flowers behind him. When the subtitle mentions, 'confronts the former President. Snow claims that he wasn't responsible for the bombings,' what is the color of the accessory on the chest of the man in red?", "question_wo_referring_query": "In the scene, there is a man with short white hair, a mustache, and wearing a red outfit. He is wearing an accessory on his chest, and there are also some white flowers behind him. When the subtitle mentions, 'confronts the former President. Snow claims that he wasn't responsible for the bombings,' what is the color of the accessory on the chest of the man in red?", "candidates": ["green", "blue", "black", "white", "red"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "-07drzOov-M_0", "video_path": "-07drzOov-M.mp4", "subtitle_path": "-07drzOov-M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 555.97, "view_count": 58570}, {"video_id": "-07drzOov-M", "question": "On the screen, there is a woman on the left wearing a long-sleeved black dress, and on the right is a man with short hair wearing a long-sleeved black outfit, holding a gun in his hand. There is flying debris behind them, and some people are standing behind them. When the subtitles mention 'results in many innocent refugees being killed. The Capitol then drops bombs on the crowd,' what is the color of the gun in the man's hand?", "question_wo_referring_query": "What is the color of the gun in the man's hand?", "candidates": ["purple", "red", "blue", "yellow", "white"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "-07drzOov-M_1", "video_path": "-07drzOov-M.mp4", "subtitle_path": "-07drzOov-M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 555.97, "view_count": 58570}, {"video_id": "O1iiLYHCQDE", "question": "There are two people sitting in the video, one of whom has their hair tied back and is wearing a long-sleeved white shirt. She is resting her head in her hands. Next to her, there is someone comforting her. Who is the person comforting her?", "question_wo_referring_query": "Who is the person comforting her?", "candidates": ["William", "Sofia", "Arthur", "Stian", "Ronny"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "O1iiLYHCQDE_0", "video_path": "O1iiLYHCQDE.mp4", "subtitle_path": "O1iiLYHCQDE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 559.27, "view_count": 6498}, {"video_id": "O1iiLYHCQDE", "question": "There is a monitor in the scene displaying a terrain map. A person is pointing at the monitor, and in front of them, there is another person looking at the screen. Who is the person pointing at the screen?", "question_wo_referring_query": "Who is the person pointing at the screen?", "candidates": ["Staff member", "William", "Arthur", "Stian", "Sofia"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "O1iiLYHCQDE_1", "video_path": "O1iiLYHCQDE.mp4", "subtitle_path": "O1iiLYHCQDE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 559.27, "view_count": 6498}, {"video_id": "hGCLb_bcTLU", "question": "In front of a person, there is a glowing object placed on the left side and a red phone on the right side. When the phone appears for the first time, what does this person do?", "question_wo_referring_query": "When the phone appears for the first time, what does this person do?", "candidates": ["drink water", "make a phone call", "run", "stand up", "eat something"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "hGCLb_bcTLU_0", "video_path": "hGCLb_bcTLU.mp4", "subtitle_path": "hGCLb_bcTLU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 481.15, "view_count": 2387228}, {"video_id": "hGCLb_bcTLU", "question": "A person is standing on the roof of a building, holding a signal gun and firing it into the sky. What does this person do after firing the signal gun for the first time?", "question_wo_referring_query": "After this person fires the signal gun for the first time, what does this person do?", "candidates": ["Quietly looks at the sky and then sits down", "Drinks water", "Quietly looks at the sky and then squats down", "Eats something", "Quietly looks at the sky and then observes the surroundings"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "hGCLb_bcTLU_1", "video_path": "hGCLb_bcTLU.mp4", "subtitle_path": "hGCLb_bcTLU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 481.15, "view_count": 2387228}, {"video_id": "Qs9emiQEJNA", "question": "In the screen, there is a blonde woman wearing a pink dress. Behind her, there is a white wall, and in front of her, there is a person standing. When the subtitle mentions 'reaching there, Lars discovers that Sturla has threatened to kill her if she aborts the child,' what event occurs?", "question_wo_referring_query": "In the screen, there is a blonde woman wearing a pink dress. Behind her, there is a white wall and in front of her, there is a person standing. When the subtitle mentions 'reaching there, Lars discovers that Sturla has threatened to kill her if she aborts the child,' what event occurs?", "candidates": ["Leaves the room", "A man hugs the woman in the pink dress", "Running", "Writing", "Reading a book"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "Qs9emiQEJNA_0", "video_path": "Qs9emiQEJNA.mp4", "subtitle_path": "Qs9emiQEJNA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 585.29, "view_count": 28018}, {"video_id": "Qs9emiQEJNA", "question": "Four people are standing beside a yellow table. On the far left, there is a bald man wearing a blue shirt, and on the far right, there is a short-haired woman in a blue shirt. In the middle, there is a long-haired woman in a black shirt and a short-haired man in a black shirt. Behind them, there are square objects pasted on the white wall. What happened when the subtitle mentions 'they had with Ben without permission. Harald also claims that the London Police are demanding that'?", "question_wo_referring_query": "What happened?", "candidates": ["Squatting", "The man in the blue shirt sat down", "Singing", "The woman in the blue shirt sat down", "Running"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "Qs9emiQEJNA_1", "video_path": "Qs9emiQEJNA.mp4", "subtitle_path": "Qs9emiQEJNA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 585.29, "view_count": 28018}, {"video_id": "jXEbN2ZJXAY", "question": "In the evening, a man with short hair and short sleeves is holding something in his hand and reading aloud, with a phone recording him in front and a bright lamp behind him. What happened after he finished reading?", "question_wo_referring_query": "What happened after he finished reading?", "candidates": ["Someone danced", "Someone sang", "The crowd dispersed", "Someone applauded him", "Someone ran"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "jXEbN2ZJXAY_0", "video_path": "jXEbN2ZJXAY.mp4", "subtitle_path": "jXEbN2ZJXAY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 590.96, "view_count": 318880}, {"video_id": "jXEbN2ZJXAY", "question": "A man wearing glasses and dressed in black is holding a silver tool to examine his own tongue. What did he do after he finished the examination?", "question_wo_referring_query": "What did he do after he finished the examination?", "candidates": ["Kneel down", "Go see a doctor", "Jump up", "Go for a run", "Sing a song"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "jXEbN2ZJXAY_1", "video_path": "jXEbN2ZJXAY.mp4", "subtitle_path": "jXEbN2ZJXAY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 590.96, "view_count": 318880}, {"video_id": "Xd4Du1q6B6Q", "question": "In a forest, a person wearing a white short-sleeved shirt is standing and looking at the trees in front. Pausing at this frame, which of the following items appears first?", "question_wo_referring_query": "Pausing at this frame, which of the following items appears first?", "candidates": ["table", "car", "hat", "a wood cabin", "photo"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "Xd4Du1q6B6Q_0", "video_path": "Xd4Du1q6B6Q.mp4", "subtitle_path": "Xd4Du1q6B6Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 486.6, "view_count": 4844}, {"video_id": "Xd4Du1q6B6Q", "question": "In a forest, a long-haired woman is recording something. There is yellow grass at her feet, and there are trees in front of and behind her. Stopping at this frame, which of the following objects appears first?", "question_wo_referring_query": "Stopping at this frame, which of the following objects appears first?", "candidates": ["Small wooden horse sculpture", "Police car", "Book", "Binoculars", "Airplane"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "Xd4Du1q6B6Q_1", "video_path": "Xd4Du1q6B6Q.mp4", "subtitle_path": "Xd4Du1q6B6Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 486.6, "view_count": 4844}, {"video_id": "xmfzMtH26Qg", "question": "Before the subtitle 'carter varum's girlfriend for the last three years but she really is tda she shows brain' appears, what action did the woman in white clothing do? There is a man's face partially shown on both sides of the car which is blue and filled with yellow boxes. The man on the left rear side is wearing a red shirt, and the one in the middle is a woman in white clothing.", "question_wo_referring_query": "What action did the woman in white clothing do?", "candidates": ["Put hands on the waist", "Extended both arms", "Waved hand to the mirror", "Stuck tongue out to the mirror", "Punched forwards"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "xmfzMtH26Qg_0", "video_path": "xmfzMtH26Qg.mp4", "subtitle_path": "xmfzMtH26Qg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 384.05, "view_count": 10291}, {"video_id": "xmfzMtH26Qg", "question": "At sea, there is a blue and white car in front of the camera. Inside the car is a man wearing red clothes and a bald black man. To the right in the distance, there is a forest. Before the subtitle 'lands on the boat carter walks up to the car with his shotgun in hand and brian pulls out his gun' appears, what event happened?", "question_wo_referring_query": "At sea, there is a blue and white car in front of the camera. Inside the car is a man wearing red clothes and a bald black man. To the right in the distance, there is a forest. Before the subtitle 'lands on the boat carter walks up to the car with his shotgun in hand and brian pulls out his gun' appears, what event happened?", "candidates": ["A car fell into the sea", "An airplane flew into the sky", "A car flew into the sky", "A car exploded", "A helicopter flew into the sky"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "xmfzMtH26Qg_1", "video_path": "xmfzMtH26Qg.mp4", "subtitle_path": "xmfzMtH26Qg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 384.05, "view_count": 10291}, {"video_id": "UUcdqrkYaBU", "question": "In a room filled with adults and children, the upper left side features a white cabinet decorated with Christmas ornaments, while the upper right corner shows a ground window letting light in and a Christmas tree. Which objects appear after the subtitle \u201cinside the house danny gives the albanians a bag of monopoly money as a joke the head albanian\u201d appears?", "question_wo_referring_query": "Which objects appear afterwards?", "candidates": ["Red cloth bag filled with money", "Santa Claus", "Table lamp", "Apple", "Christmas tree"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "UUcdqrkYaBU_0", "video_path": "UUcdqrkYaBU.mp4", "subtitle_path": "UUcdqrkYaBU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 520.39, "view_count": 601169}, {"video_id": "UUcdqrkYaBU", "question": "In a dark room, there is an open kitchen with various things and a small flame on the upper part of the screen. On the lower left, there is a table lamp. On the right wall, there is a square photo hanging. In the subtitle, 'wouldn't commit these crimes jimmy meets mike with his family at a lake house in north shore' appears. After these appear, what objects can be seen?", "question_wo_referring_query": "What new objects appear?", "candidates": ["Yellow car", "Table lamp", "Pillow", "Square picture frame", "White car"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "UUcdqrkYaBU_1", "video_path": "UUcdqrkYaBU.mp4", "subtitle_path": "UUcdqrkYaBU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 520.39, "view_count": 601169}, {"video_id": "0Etb5BJFyE0", "question": "In front of the camera with two white lights on and a blurred background, there is a woman with blonde hair and wearing earrings in the middle. In which of the following scenes does this woman appear?", "question_wo_referring_query": "In which of the following scenes does this woman appear?", "candidates": ["At the crossroads under the rain", "On the mountain top under the sunlight", "On the plaza covered with fallen leaves", "On the blue sea", "In the busy kitchen"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "0Etb5BJFyE0_0", "video_path": "0Etb5BJFyE0.mp4", "subtitle_path": "0Etb5BJFyE0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 275.21, "view_count": 18121}, {"video_id": "0Etb5BJFyE0", "question": "In front of the blurry background, a woman with yellow long hair wearing black-rimmed glasses is gesturing a kiss to the left of the camera. Behind the woman, there is a blurred wine-red table and wall. In which of the following scenes has this woman appeared?", "question_wo_referring_query": "In which of the following scenes has the woman in front of the camera appeared?", "candidates": ["A busy kitchen", "A crossroad in the rain", "A mountain top under the sun", "A space with diamond-shaped grid windows", "A swimming pool on a high floor"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "0Etb5BJFyE0_1", "video_path": "0Etb5BJFyE0.mp4", "subtitle_path": "0Etb5BJFyE0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 275.21, "view_count": 18121}, {"video_id": "98Ip9meBD0g", "question": "During the night, there is a man in front of the mirror who is driving a car and wearing earphones. To the upper left, there is a bright white light. Which of the following subtitles has appeared along with the man in front of the mirror?", "question_wo_referring_query": "Which of the following subtitles has appeared along with the man in front of the mirror?", "candidates": ["wolves and other animals in the zoo", "Suddenly, Clay calls. Wheelman explains why he took the money and fled. Clay is baffled.", "are so much that I want to know, if you are interested.", "become increasingly popular in daily life of modern society", "the better. It's true that I will know more about the world and understand why"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "98Ip9meBD0g_0", "video_path": "98Ip9meBD0g.mp4", "subtitle_path": "98Ip9meBD0g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 560.36, "view_count": 20847}, {"video_id": "98Ip9meBD0g", "question": "In a garage with white ceiling tiles and white columns, inside a white car in front of the mirror, a man is sitting on the right side with his head tilted, and on the left side, a woman with long blonde hair is sitting. In the car, the woman with long blonde hair appeared with which of the following subtitles?", "question_wo_referring_query": "In the car, the woman with long blonde hair appeared with which of the following subtitles?", "candidates": ["there is my desk and chair. They're in front of the window.", "execution by the West End leader. Wheelman then hands over the money and escapes with Katie", "introduce us different kinds of knowledge.", "a student of No.3 middle school", "they will afford a telephone with numbers without four and others which is bad in their mind."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "98Ip9meBD0g_1", "video_path": "98Ip9meBD0g.mp4", "subtitle_path": "98Ip9meBD0g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 560.36, "view_count": 20847}, {"video_id": "ZaQC9s573Mk", "question": "On the screen, the right side shows a burgundy wooden door, while the other side has an illuminated doorway, with a man in a white shirt standing there. When this man appears in a room filled with various types of guns, what clothes did he change into?", "question_wo_referring_query": "What clothes did he change into?", "candidates": ["Changed into a black coat", "Changed into a red coat", "Changed into a black T-shirt", "Changed into a blue sweater", "Changed into a white T-shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "ZaQC9s573Mk_0", "video_path": "ZaQC9s573Mk.mp4", "subtitle_path": "ZaQC9s573Mk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 583.03, "view_count": 7539}, {"video_id": "ZaQC9s573Mk", "question": "At the gathering, on the left of the camera is a woman with her hair tied up wearing blue clothes, on the right is an old man wearing a blue shirt and black coat, and in the middle is a man wearing a white shirt, black coat, and a brown tie. When the man in the middle appeared at the half-open door, what change happened to him?", "question_wo_referring_query": "What change happened to him?", "candidates": ["He put on a white T-shirt", "He took off his clothes", "He put on a blue sweater", "He put on a red jacket", "He put on a black T-shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "ZaQC9s573Mk_1", "video_path": "ZaQC9s573Mk.mp4", "subtitle_path": "ZaQC9s573Mk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 583.03, "view_count": 7539}, {"video_id": "Wb--4VmKpUI", "question": "In the middle of a field with yellow grass surrounded by green grass, on the right side, there's a boy wearing blue clothes, and on the left side, there's a man wearing black pants and a gray-black checkered shirt. What is the man on the grass doing?", "question_wo_referring_query": "What is the man on the grass doing?", "candidates": ["Half-kneeling on the ground", "Lying on the ground", "Doing push-ups", "Picking up a wallet from the ground", "Helping the boy up"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "Wb--4VmKpUI_0", "video_path": "Wb--4VmKpUI.mp4", "subtitle_path": "Wb--4VmKpUI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 572.36, "view_count": 18920}, {"video_id": "Wb--4VmKpUI", "question": "On a grassland full of green grass, there is a man on the left sitting on the ground and wearing khaki pants, and on the right, there is a man standing and wearing clothes with gray and white polka dots. What is the man sitting on the ground doing?", "question_wo_referring_query": "What is the man sitting on the ground doing?", "candidates": ["Reaching out to the man standing", "Throwing a stone at the man standing", "Touching his own hair", "Hugging his own legs with both hands", "Covering his stomach with both hands"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "Wb--4VmKpUI_1", "video_path": "Wb--4VmKpUI.mp4", "subtitle_path": "Wb--4VmKpUI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 572.36, "view_count": 18920}, {"video_id": "jzPAWfBxLD4", "question": "In front of a yellow rocky wall, three people dressed in black outfits and equipped with swords at their waists are attempting to climb. Which items appear in the frame below?", "question_wo_referring_query": "Which items appear in the frame below?", "candidates": ["Red hat", "Black hat", "White clothing", "Bow and arrow", "Rifle"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "jzPAWfBxLD4_0", "video_path": "jzPAWfBxLD4.mp4", "subtitle_path": "jzPAWfBxLD4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 574.07, "view_count": 124299}, {"video_id": "jzPAWfBxLD4", "question": "Near the edge of a stage piled with fallen leaves, there are broken wooden railings. Below, there are several people wearing black ancient costumes scattered around the stage steps. Which of the following objects have appeared in the scene?", "question_wo_referring_query": "Which of the following objects have appeared in the scene?", "candidates": ["Red hat", "Black axe", "Long sword", "Bow and arrow", "White clothes"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "jzPAWfBxLD4_1", "video_path": "jzPAWfBxLD4.mp4", "subtitle_path": "jzPAWfBxLD4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 574.07, "view_count": 124299}, {"video_id": "DUuwT1uWAmo", "question": "In the room, on the right side, there is a man in white clothes who is leaning his head and reading a book, along with a pile of books. In the upper middle part, there is a window letting in light, and below the window, there are also piled-up books. When the subtitle mentions 'He calls himself Hikikomori.', what items are present in the room?", "question_wo_referring_query": "What items are present in the room?", "candidates": ["white table", "black chair", "yellow chair", "black sofa", "notebook computer"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "DUuwT1uWAmo_0", "video_path": "DUuwT1uWAmo.mp4", "subtitle_path": "DUuwT1uWAmo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 505.07, "view_count": 66358}, {"video_id": "DUuwT1uWAmo", "question": "In front of the wine-red wall of the building by the street, a man wearing a white top and orange shorts is running on the road on the right side, and on the left side is a black-painted alley. When the subtitle 'The huge metropolis of Tokyo is completely empty and deserted' appears, what items are shown on the screen?", "question_wo_referring_query": "What items are shown on the screen?", "candidates": ["Trash cans", "Black coats", "Bicycles", "Motorcycles", "Cars"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "DUuwT1uWAmo_1", "video_path": "DUuwT1uWAmo.mp4", "subtitle_path": "DUuwT1uWAmo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 505.07, "view_count": 66358}, {"video_id": "XsMma8JDn6g", "question": "The ground is bathed in sunlight. In front of the camera is a man wearing an olive green jacket and a blue inner shirt, talking on the phone. Behind him are several large trucks. What color are the pants of the man in front of the camera?", "question_wo_referring_query": "What color are the pants of the man in front of the camera?", "candidates": ["Green", "Yellow", "Black", "White", "Blue"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "XsMma8JDn6g_0", "video_path": "XsMma8JDn6g.mp4", "subtitle_path": "XsMma8JDn6g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 521.93, "view_count": 343215}, {"video_id": "XsMma8JDn6g", "question": "In the video frame, the left side is a black partition with glass windows, the center is still a pitch-black space, and the right side is a man wearing sunglasses and holding a phone, talking to the lens. What color is the hat the man in the frame is wearing?", "question_wo_referring_query": "What color is the hat the man in the frame is wearing?", "candidates": ["red", "purple", "white", "gray", "black"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "XsMma8JDn6g_1", "video_path": "XsMma8JDn6g.mp4", "subtitle_path": "XsMma8JDn6g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 521.93, "view_count": 343215}, {"video_id": "nsqjR62BgTc", "question": "In front of several people wearing black or grey-white clothes, a woman wearing a grey-black hat and a necklace is looking at the camera. At the moment when the subtitle 'first time were aghast, because it turned out that Albert had a stutter and could not speak fluently.' appears, what material is the necklace worn by the woman looking at the camera made of?", "question_wo_referring_query": "In front of several people wearing black or grey-white clothes, a woman wearing a grey-black hat and a necklace is looking at the camera. At the moment when the subtitle 'first time were aghast, because it turned out that Albert had a stutter and could not speak fluently.' appears, what material is the necklace worn by the woman looking at the camera made of?", "candidates": ["pearls", "iron", "copper", "gold", "wood beads"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "nsqjR62BgTc_0", "video_path": "nsqjR62BgTc.mp4", "subtitle_path": "nsqjR62BgTc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 549.57, "view_count": 14041}, {"video_id": "nsqjR62BgTc", "question": "In a room illuminated with soft light, there is a fireplace burning on the left and a glass window on the right allowing light to pass through. Several people are sitting by the table near the window, engaged in conversation. When the subtitle mentions 'speech went smoothly and all the people who hear it returned to being optimistic and confident,' what shape is the table in the room?", "question_wo_referring_query": "In a room illuminated with soft light, there is a fireplace burning on the left and a glass window on the right allowing light to pass through. Several people are sitting by the table near the window, engaged in conversation. When the subtitle mentions 'speech went smoothly and all the people who hear it returned to being optimistic and confident,' what shape is the table in the room?", "candidates": ["Square", "Rectangle", "Triangle", "Hexagon", "Circle"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "nsqjR62BgTc_1", "video_path": "nsqjR62BgTc.mp4", "subtitle_path": "nsqjR62BgTc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 549.57, "view_count": 14041}, {"video_id": "iR6BvWCzoFE", "question": "A person is lying on a white sofa, and another person wearing a camouflage uniform and a person wearing a black coat are pointing guns at him. Who is the person lying on the sofa?", "question_wo_referring_query": "Who is the person lying on the sofa?", "candidates": ["Jaime", "Rafa", "Felix", "KIki", "Pablo Escobar"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "iR6BvWCzoFE_0", "video_path": "iR6BvWCzoFE.mp4", "subtitle_path": "iR6BvWCzoFE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 345.45, "view_count": 69108}, {"video_id": "iR6BvWCzoFE", "question": "A person wearing sunglasses is sitting on a chair. There is a pillar behind the chair, and behind the pillar, there are two soldiers holding guns. Who is the person wearing sunglasses?", "question_wo_referring_query": "Who is the person wearing sunglasses?", "candidates": ["Pablo Escobar", "Rafa", "Felix", "Jaime", "KIki"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "iR6BvWCzoFE_1", "video_path": "iR6BvWCzoFE.mp4", "subtitle_path": "iR6BvWCzoFE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 345.45, "view_count": 69108}, {"video_id": "D5_wgH77uKE", "question": "A man wearing a tie is holding a glass of wine. Behind him is a bar counter with some glassware on it. To the bottom left of the bar counter is a round stool. What did the man do the first time he appeared holding the wine?", "question_wo_referring_query": "What did the man holding the wine do the first time he appeared?", "candidates": ["Eating", "Playing the guitar", "Drinking wine", "Peeling an apple", "Clapping"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "D5_wgH77uKE_0", "video_path": "D5_wgH77uKE.mp4", "subtitle_path": "D5_wgH77uKE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 533.23, "view_count": 422658}, {"video_id": "D5_wgH77uKE", "question": "A man wearing a black coat and a hat is holding a gun, with some wooden dummies behind him. What did the man holding the gun do the first time he appeared?", "question_wo_referring_query": "What did the man holding the gun do the first time he appeared?", "candidates": ["Fire a cannon", "Clap", "Shoot the gun", "Peel an apple", "Drink alcohol"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "D5_wgH77uKE_1", "video_path": "D5_wgH77uKE.mp4", "subtitle_path": "D5_wgH77uKE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 533.23, "view_count": 422658}, {"video_id": "v6Vo1cmnTF4", "question": "A woman in an orange dress is walking on the road, followed by a man. The man has one hand raised, standing beside a tree. When the subtitle 'an engineer in the irrigation department and had retired. Since then, he had to use a wheelchair.' appears, what is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["playing guitar", "clapping", "peeling an apple", "picking his ear with his finger", "drinking water"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "v6Vo1cmnTF4_0", "video_path": "v6Vo1cmnTF4.mp4", "subtitle_path": "v6Vo1cmnTF4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 514.23, "view_count": 5024}, {"video_id": "v6Vo1cmnTF4", "question": "A man wearing a shirt is sitting on a sofa, pulling up his shirt with both hands. There is a tape recorder on the cabinet next to him. When the subtitle appears saying \"They continued dating each other until they married in 1992. She accompanied Obama until now\", what is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Putting on clothes", "Using his fingers to pick his ear", "Wearing a hat", "Unbuttoning his shirt", "Peeling an apple"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "v6Vo1cmnTF4_1", "video_path": "v6Vo1cmnTF4.mp4", "subtitle_path": "v6Vo1cmnTF4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 514.23, "view_count": 5024}, {"video_id": "OSZbjqyhYuY", "question": "What is the first weapon that appears in the video?", "question_wo_referring_query": "What is the first weapon that appears in the video?", "candidates": ["Knife", "Whip", "Rifle", "Axe", "Sniper rifle"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "OSZbjqyhYuY_0", "video_path": "OSZbjqyhYuY.mp4", "subtitle_path": "OSZbjqyhYuY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 540.13, "view_count": 449083}, {"video_id": "OSZbjqyhYuY", "question": "A person wearing a black coat is lying on a red sofa. He has been shot, and there are several iron pillars behind him. Behind the pillars, there's a chair and a sofa. Who was the first person to be killed?", "question_wo_referring_query": "Who was the first person to be killed?", "candidates": ["Leo", "Bobby", "Demyan", "Aleksey", "Mikhail"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "OSZbjqyhYuY_1", "video_path": "OSZbjqyhYuY.mp4", "subtitle_path": "OSZbjqyhYuY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 540.13, "view_count": 449083}, {"video_id": "8vRkvH4xx5s", "question": "A man with blonde hair in a suit is looking at a document in his hands. There's a lit lamp behind him. After the subtitle appears \"way to an event nearby. While Sinclair looks at documents on Richard's desk\", what is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Playing the guitar", "Talking", "Cutting a watermelon", "Dancing", "Peeling an apple"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "8vRkvH4xx5s_0", "video_path": "8vRkvH4xx5s.mp4", "subtitle_path": "8vRkvH4xx5s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 572.14, "view_count": 38619}, {"video_id": "8vRkvH4xx5s", "question": "A man in a suit is standing on a construction site with a dirt pile behind him. There are two yellow excavators on either side of the dirt pile. When the subtitle 'construction sites until he returns home and becomes a drooling idiot.' appears, what is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["flying a plane", "cutting a watermelon", "playing the guitar", "chasing someone", "dancing"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "8vRkvH4xx5s_1", "video_path": "8vRkvH4xx5s.mp4", "subtitle_path": "8vRkvH4xx5s_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 572.14, "view_count": 38619}, {"video_id": "TCpdqSzhANc", "question": "A man in a suit is lying on the green ground with his eyes closed. After the subtitle \"Neo's mind has become trapped in a transition zone between the Matrix and the machine world\" appears, what is the first weapon to appear?", "question_wo_referring_query": "What is the first weapon to appear?", "candidates": ["Sword", "Axe", "RPG", "Gun", "Bow and arrow"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "TCpdqSzhANc_0", "video_path": "TCpdqSzhANc.mp4", "subtitle_path": "TCpdqSzhANc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 537.0, "view_count": 264450}, {"video_id": "TCpdqSzhANc", "question": "In the game screen, there is a person wearing black clothes holding a gun pointed at a person lying on the ground. To his right is a fence, and to his left is a wall. After the subtitle 'was murdered by a mysterious assassin, who was actually a humanoid mass of flies. Yes seriously.' appears, who is the first real person to appear?", "question_wo_referring_query": "Who is the first real person to appear?", "candidates": ["Baby", "Man", "Woman", "Little boy", "Little girl"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "TCpdqSzhANc_1", "video_path": "TCpdqSzhANc.mp4", "subtitle_path": "TCpdqSzhANc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 537.0, "view_count": 264450}, {"video_id": "YA_i3mR3XgA", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, two people are standing inside a yellow circle fighting, then a monster emerges from the ground, several soldiers point their guns at the monster, and finally, on a flying device, there are several monkeys and humans, with the foremost monkey holding an axe.", "First, a monster emerges from the ground, several soldiers point their guns at the monster, then on a flying device, there are several monkeys and humans, with the foremost monkey holding an axe, and finally, a monster emerges from the ground, several soldiers point their guns at the monster again.", "First, on a flying device, there are several monkeys and humans, with the foremost monkey holding an axe, then a monster emerges from the ground, several soldiers point their guns at the monster, and finally, two people are standing inside a yellow circle fighting.", "First, two people are standing inside a yellow circle fighting, then on a flying device, there are several monkeys and humans, with the foremost monkey holding an axe, and finally, a monster emerges from the ground, several soldiers point their guns at the monster.", "First, on a flying device, there are several monkeys and humans, with the foremost monkey holding an axe, then two people are standing inside a yellow circle fighting, and finally, a monster emerges from the ground, several soldiers point their guns at the monster."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "YA_i3mR3XgA_0", "video_path": "YA_i3mR3XgA.mp4", "subtitle_path": "YA_i3mR3XgA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 300.53, "view_count": 54010}, {"video_id": "YA_i3mR3XgA", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a person is holding a helmet in one hand and a piece of fabric torn from clothing in the other, with grass behind her. Then, a person is sitting on a chair with a woman holding a blue bottle behind him. Finally, a man dressed in blue clothes is standing with his hands on his hips.", "First, a person is sitting on a chair with a woman holding a blue bottle behind him. Then, another person is holding a helmet in one hand and a piece of fabric torn from clothing in the other, with grass behind her. Finally, a man dressed in blue clothes is standing with his hands on his hips.", "First, a person is holding a helmet in one hand and a piece of fabric torn from clothing in the other, with grass behind her. Then, a man dressed in blue clothes is standing with his hands on his hips. Finally, a person is sitting on a chair with a woman holding a blue bottle behind him.", "First, a man dressed in blue clothes is standing with his hands on his hips. Then, a person is holding a helmet in one hand and a piece of fabric torn from clothing in the other, with grass behind her. Finally, a person is sitting on a chair with a woman holding a blue bottle behind him.", "First, a person is sitting on a chair with a woman holding a blue bottle behind him. Then, a man dressed in blue clothes is standing with his hands on his hips. Finally, another person is holding a helmet in one hand and a piece of fabric torn from clothing in the other, with grass behind her."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "YA_i3mR3XgA_1", "video_path": "YA_i3mR3XgA.mp4", "subtitle_path": "YA_i3mR3XgA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 300.53, "view_count": 54010}, {"video_id": "PF56BZp771k", "question": "Inside a room surrounded by iron shelves filled with various food items, a man dressed in formal attire with a beard is holding two white items in his hand. There is a red backpack beside him. In which of the following locations has this red backpack appeared?", "question_wo_referring_query": "In which of the following locations has this red backpack appeared?", "candidates": ["An outdoor cafeteria", "Inside a moving vehicle", "A library", "A gym", "A sunny beach"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "PF56BZp771k_0", "video_path": "PF56BZp771k.mp4", "subtitle_path": "PF56BZp771k_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 444.88, "view_count": 61408}, {"video_id": "PF56BZp771k", "question": "In a room surrounded by bright screens, there are two black men running towards a red sports car. In which of the following scenes does this red sports car appear?", "question_wo_referring_query": "In which of the following scenes does this red sports car appear?", "candidates": ["Next to an open-air hotel", "Underground parking garage", "Beachside", "Highway", "Outside an amusement park"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "PF56BZp771k_1", "video_path": "PF56BZp771k.mp4", "subtitle_path": "PF56BZp771k_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 444.88, "view_count": 61408}, {"video_id": "pGxT0qtxYxg", "question": "On a grassy field, a woman with an injured forehead is lying on the ground. Which subtitles did this injured woman appear with?", "question_wo_referring_query": "Which subtitles did this injured woman appear with?", "candidates": ["everything that has been going on between them, and is now going to blackmail her.", "they start making out", "Pam works for the mayor, and has a son named Sam.", "Tom then gives Billie the gun, and asks her to shoot the girl.", "He arrives at Pam's forest lodge, hoping to find peace away from his busy city life."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "pGxT0qtxYxg_0", "video_path": "pGxT0qtxYxg.mp4", "subtitle_path": "pGxT0qtxYxg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 598.43, "view_count": 206726}, {"video_id": "pGxT0qtxYxg", "question": "In a factory, the bottom left corner is piled with miscellaneous items, in front is a column of iron rods, and a person wearing a watch is holding a knife. Which subtitles has this knife appeared with?", "question_wo_referring_query": "Which subtitles has this knife appeared with?", "candidates": ["Tom then gives Billie the gun, and asks her to shoot the girl.", "leaves to meet Shannon at the decided spot.", "This is when Tom gets there and sees her, she tries to outrun him but he catches her.", "She tells him what she was doing in the forest, and that her bag contains a lot of proof but it fell somewhere.", "She gets close to Shannon, and Shannon stabs her in the foot with a knife and runs away."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "pGxT0qtxYxg_1", "video_path": "pGxT0qtxYxg.mp4", "subtitle_path": "pGxT0qtxYxg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 598.43, "view_count": 206726}, {"video_id": "Ey-4jL0mLfE", "question": "Which of the following sequences of scenes in the video is correct?", "question_wo_referring_query": "Which of the following sequences of scenes in the video is correct?", "candidates": ["First, a woman with short hair sitting indoors makes an incredulous expression with a man in black standing behind her. Then, a woman in a car shields her eyes from flashing lights outside, followed by a glowing figure floating between brightly lit high-rise buildings at night.", "First, at night, a glowing figure floats between brightly lit high-rise buildings. Then, a woman in a car shields her eyes from flashing lights outside. Lastly, a woman with short hair sitting indoors makes an incredulous expression with a man in black standing behind her.", "First, a woman in a car shields her eyes from flashing lights outside. Then, at night, a glowing figure floats between brightly lit high-rise buildings. Lastly, a woman with short hair sitting indoors makes an incredulous expression with a man in black standing behind her.", "First, a woman with short hair sitting indoors makes an incredulous expression with a man in black standing behind her. Then, at night, a glowing figure floats between brightly lit high-rise buildings. Lastly, a woman in a car shields her eyes from flashing lights outside.", "First, a woman in a car shields her eyes from flashing lights outside. Then, a woman with short hair sitting indoors makes an incredulous expression with a man in black standing behind her. Lastly, at night, a glowing figure floats between brightly lit high-rise buildings."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "Ey-4jL0mLfE_0", "video_path": "Ey-4jL0mLfE.mp4", "subtitle_path": "Ey-4jL0mLfE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 353, "duration": 14.0, "view_count": 19417}, {"video_id": "erP0Xk_OvxY", "question": "In a cityscape scene, there are moving cars and many tall buildings. Next to the buildings, there are some green trees. After the subtitle 'Today, I'm going to recap a 2017 action thriller film called: Instant Death.' appears, who is the first person to appear on the screen?", "question_wo_referring_query": "Who is the first person to appear on the screen?", "candidates": ["A man wearing a blue dress shirt", "A woman with short black hair", "A man wearing a black hoodie", "A man driving a white car on the road", "A man wearing a dark green long sleeve shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "erP0Xk_OvxY_0", "video_path": "erP0Xk_OvxY.mp4", "subtitle_path": "erP0Xk_OvxY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 11, "duration": 8.01, "view_count": 42459}, {"video_id": "clIQHKMlw10", "question": "In a room, a curtain hanging on a long rod is drawn to one side. A man with short hair, wearing a black tank top, is standing in front of the curtain. He extends his index fingers with his eyes closed tightly and walks slowly forward. After the subtitle 'stole it from his estranged father. Refusing to know more, the disturbed\u2026' appears, what does the man do?", "question_wo_referring_query": ", what does the man do?", "candidates": ["He sat down on the floor", "He fell down", "He lifted the table", "He held onto the table and stood up from the chair", "He held onto the table and sat on the chair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "clIQHKMlw10_0", "video_path": "clIQHKMlw10.mp4", "subtitle_path": "clIQHKMlw10_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 230, "duration": 10.98, "view_count": 150424}, {"video_id": "wMoA8TWHupA", "question": "On the left side of the screen, two hands are holding a notebook. After this scene appears where a hand is also holding the notebook on the right side, who is the first person to appear?", "question_wo_referring_query": "Who is the first person to appear after this scene?", "candidates": ["A man wearing a black coat, white shirt, and a black tie", "A long-haired woman wearing a white trench coat", "A man wearing a grey suit and glasses, carrying a black backpack", "A man wearing a grey coat, white shirt, and a red tie", "A man wearing a grey suit with a striped tie"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "wMoA8TWHupA_0", "video_path": "wMoA8TWHupA.mp4", "subtitle_path": "wMoA8TWHupA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 267, "duration": 14.02, "view_count": 147700}, {"video_id": "@movie.explained6-7254236386633600257", "question": "In the distance within the sports ground, there's a national flag, with a blue mat beneath it. A long-haired woman wearing a blue long-sleeve shirt and gray pants appears on the right side of the screen. What is this woman doing?", "question_wo_referring_query": "What is this woman doing?", "candidates": ["Vaulting", "Long jumping", "Pole vaulting", "Walking with a stride", "High jumping"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "@movie.explained6-7254236386633600257_0", "video_path": "7254236386633600257.mp4", "subtitle_path": "7254236386633600257_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 52.35, "view_count": 572550}, {"video_id": "@movie.explained6-7254237484949753090", "question": "A man and a woman are eating barbecue in a barbecue restaurant. The man is wearing white clothes and is using tongs to pass grilled meat to the woman. When the subtitle 'He did not care and continued to eat the roast meat with his girlfriend.' appears, which of the following items was present?", "question_wo_referring_query": "Which of the following items was present?", "candidates": ["Earphones", "Mask", "Book Bag", "Hat", "Glasses"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "@movie.explained6-7254237484949753090_0", "video_path": "7254237484949753090.mp4", "subtitle_path": "7254237484949753090_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 52.05, "view_count": 22200}, {"video_id": "@movie.explained6-7269964316974075138", "question": "In a scene with a blurry bridge in the background, there is a woman wearing a blue and black checkered shirt. Below the woman, there are white letters saying 'However the wily monster took on his mother's appearance.' What hairstyle does the woman with the blue and black checkered shirt have in the scene?", "question_wo_referring_query": "What hairstyle does the woman wearing the blue and black checkered shirt have in the scene?", "candidates": ["short blonde hair", "white short hair", "black short hair", "shoulder-length curly hair", "long blonde hair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "@movie.explained6-7269964316974075138_0", "video_path": "7269964316974075138.mp4", "subtitle_path": "7269964316974075138_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 55.6, "view_count": 6065}, {"video_id": "@movie.explained6-7274915783518489857", "question": "In a room with tiled floors, there is a gray refrigerator placed against the wall. Inside the refrigerator, there is some food, and there is a person who has locked themselves inside the gray refrigerator. Who has locked themselves inside the gray refrigerator?", "question_wo_referring_query": "Who has locked themselves inside the gray refrigerator?", "candidates": ["The girl wearing a black dress", "The two-year-old girl wearing a dark blue top", "The girl wearing a white top", "The girl wearing a green T-shirt", "The two-year-old girl wearing a purple top"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "@movie.explained6-7274915783518489857_0", "video_path": "7274915783518489857.mp4", "subtitle_path": "7274915783518489857_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.97, "view_count": 5742}, {"video_id": "@movie.explained6-7265406370826947847", "question": "In a narrow and dim room, there is a wooden shelf on the wall, with a white towel hanging from it. A man dressed in gray-black clothes is in the room. What was the man in the gray-black clothes doing the first time he appeared?", "question_wo_referring_query": "What was the man in gray-black clothes doing the first time he appeared?", "candidates": ["Lying on the ground, vomiting white foam", "Crawling on the ground", "Dancing in the narrow room", "Happily spinning around in circles", "Singing in the narrow room"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "@movie.explained6-7265406370826947847_0", "video_path": "7265406370826947847.mp4", "subtitle_path": "7265406370826947847_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 56.56, "view_count": 17700}, {"video_id": "@movie.explained6-7253134838876753153", "question": "In a blurry scene, there is a green field, and next to the green field is a dark green bench. On the dark green bench sits a woman wearing a light-colored, tight-fitting short-sleeve shirt. When the subtitle 'although she disregards him completely. However, when he leaves, the girl tells her friends that one day he will become her husband.' appears, what is the woman in the light-colored, tight-fitting short-sleeve shirt doing?", "question_wo_referring_query": "What is the woman wearing a light-colored, tight-fitting short-sleeve shirt doing?", "candidates": ["Sitting on the dark green bench tying shoelaces", "Lying on the dark green bench reading a book", "Sitting on the dark green bench playing with a phone", "Sitting on the dark green bench reading a book", "Lying on the dark green bench sleeping"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "@movie.explained6-7253134838876753153_0", "video_path": "7253134838876753153.mp4", "subtitle_path": "7253134838876753153_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 51.53, "view_count": 2786}, {"video_id": "@movie.explained6-7268563086108265729", "question": "In a scene that says 'someone is trapped inside,' there is a woman wearing a black and white striped short-sleeve shirt. The woman is holding a phone and making a call facing a camera. What did the woman wearing the black and white striped short-sleeve shirt do after making the phone call while facing the camera?", "question_wo_referring_query": "What did the woman wearing the black and white striped short-sleeve shirt do after making the phone call facing the camera?", "candidates": ["Continued making the call facing the camera", "Started making the call with her back to the camera", "Sat on a white chair while making the call", "Sat on a black chair while making the call", "Crouched down while making the call"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "@movie.explained6-7268563086108265729_0", "video_path": "7268563086108265729.mp4", "subtitle_path": "7268563086108265729_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 38.32, "view_count": 6392}, {"video_id": "@movie.explained6-7275317741823872257", "question": "In a swimming pool with blue tiles, a woman wearing a blue swimming cap and yellow swimsuit kisses a man with cropped hair on the cheek. After the narrator says, 'girl stealthily gave the boy a quick kiss and swiftly jumped into the water. However,' what happens to the girl in the yellow swimsuit?", "question_wo_referring_query": "What happens to the girl in the yellow swimsuit?", "candidates": ["She was taken away by a man wearing a blue shirt", "She swims forward desperately in the swimming pool", "She gets knocked unconscious", "Her belly gradually becomes larger in the swimming pool", "She struggles to call for help in the swimming pool"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "@movie.explained6-7275317741823872257_0", "video_path": "7275317741823872257.mp4", "subtitle_path": "7275317741823872257_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 56.09, "view_count": 3469}, {"video_id": "@movie.explained6-7253143092533005569", "question": "In a bathroom with a round mirror, a woman with her hair down, wearing sexy lingerie, stands with her back to the mirror. After the narration \"That night, Pippa dresses and gets herself ready to do that same scene from their neighbor,\" who is the first person to appear on screen?", "question_wo_referring_query": "Who is the first person to appear on screen?", "candidates": ["A woman wearing a white shirt", "A woman with her hair tied up", "A woman in a black dress", "A woman wearing a white tank top", "A man with black curly hair wearing glasses"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "@movie.explained6-7253143092533005569_0", "video_path": "7253143092533005569.mp4", "subtitle_path": "7253143092533005569_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 57.93, "view_count": 1244}, {"video_id": "@movie.explained6-7263344488729365767", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a person wearing a striped long-sleeve is holding a strange dagger; then, a person wearing a white short-sleeve tries to pull the dagger out of the sheath; finally, a person wearing a striped long-sleeve picks up a wooden box.", "First, a person wearing a striped long-sleeve picks up a wooden box; then, a person wearing a white short-sleeve tries to pull the dagger out of the sheath; finally, a person wearing a striped long-sleeve is holding a strange dagger.", "First, a person wearing a striped long-sleeve picks up a wooden box; then, a person wearing a striped long-sleeve is holding a strange dagger; finally, a person wearing a white short-sleeve tries to pull the dagger out of the sheath.", "First, a person wearing a striped long-sleeve is holding a strange dagger; then, a person wearing a striped long-sleeve picks up a wooden box; finally, a person wearing a white short-sleeve tries to pull the dagger out of the sheath.", "First, a person wearing a white short-sleeve tries to pull the dagger out of the sheath; then, a person wearing a striped long-sleeve is holding a strange dagger; finally, a person wearing a striped long-sleeve picks up a wooden box."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "@movie.explained6-7263344488729365767_0", "video_path": "7263344488729365767.mp4", "subtitle_path": "7263344488729365767_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.0, "view_count": 10939}, {"video_id": "@movie.explained6-7252659149891177730", "question": "In front of a green wall, a man wearing a black leather jacket is walking while drinking something. At the bottom of the screen, there are white words 'Donny Berger, present day'. In which of the following scenes has the man in the black leather jacket appeared?", "question_wo_referring_query": "In which of the following scenes has the man in the black leather jacket appeared?", "candidates": ["In a room with a red chair and a green table", "In a rainy park", "On a sunny beach", "In a chilly bar", "In a cozy coffee shop"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "@movie.explained6-7252659149891177730_0", "video_path": "7252659149891177730.mp4", "subtitle_path": "7252659149891177730_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 55.73, "view_count": 388}, {"video_id": "@movie.explained6-7270745607856229633", "question": "In a rectangular pit, there is a man with short black hair who is not wearing any clothes, and he is throwing a piece of leather upwards. When the man with short black hair appears on a busy street, what change has he undergone?", "question_wo_referring_query": "What change has the man with short black hair undergone?", "candidates": ["From not wearing a shirt to wearing a red long-sleeve shirt", "From not wearing a shirt to wearing a grey long-sleeve shirt", "From not wearing a shirt to wearing a purple long-sleeve shirt", "From not wearing a shirt to wearing a blue long-sleeve shirt", "From not wearing a shirt to wearing a green long-sleeve shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "@movie.explained6-7270745607856229633_0", "video_path": "7270745607856229633.mp4", "subtitle_path": "7270745607856229633_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 39.1, "view_count": 41509}, {"video_id": "@movie.explained6-7272336488895991042", "question": "In the scene with the white text 'Stauffer hit the seal', there is a man wearing a yellow beret and holding a gun. When the subtitle 'It's the only way Stover can avoid capture.' appears, what changes happen to the man wearing the yellow beret?", "question_wo_referring_query": "What changes happen to the man wearing the yellow beret?", "candidates": ["Takes off the yellow beret", "Puts on a white coat", "Wears an additional white scarf", "Changes to a white beret", "Puts on a pair of glasses"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TAA", "level": "L2-Relation", "id": "@movie.explained6-7272336488895991042_0", "video_path": "7272336488895991042.mp4", "subtitle_path": "7272336488895991042_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 57.0, "view_count": 3794}, {"video_id": "@movie.explained6-7268943860573785345", "question": "Standing next to a wall covered with many pictures, there is a man with short black hair wearing a green coat. At the bottom of the screen, there is also white text saying: 'Our concern is what happens When Lucy turns 8.' What is the man in the green coat doing while standing next to the wall?", "question_wo_referring_query": "What is the man in the green coat doing while standing next to the wall?", "candidates": ["Tightly wringing his hands", "Covering his ears with both hands", "Squatting on the ground and holding his head in pain, crying", "Lowering his head", "Covering his face with both hands"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "@movie.explained6-7268943860573785345_0", "video_path": "7268943860573785345.mp4", "subtitle_path": "7268943860573785345_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 49.57, "view_count": 1453}, {"video_id": "@movie.explained6-7269387531609771266", "question": "In a nearly flooded bathroom, there is a koi fish swimming around. Next to the koi fish, there is an open door, and in front of the koi fish, there is a sliding door with a person standing inside. What object exists in this nearly flooded bathroom?", "question_wo_referring_query": "What object exists in this nearly flooded bathroom?", "candidates": ["green sink", "black sink", "white sink", "purple sink", "blue sink"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "@movie.explained6-7269387531609771266_0", "video_path": "7269387531609771266.mp4", "subtitle_path": "7269387531609771266_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 49.53, "view_count": 3314}, {"video_id": "@movie.explained6-7272206374883314945", "question": "Next to a wooden-colored table, there is a person holding a white bowl and a small brush. The white bowl has a small crack on it. When the subtitle \"With the arbitrary alteration of data, cracks started appearing in everything around them.\" appears, what objects are present on the wooden-colored table?", "question_wo_referring_query": "What objects are present on the wooden-colored table?", "candidates": ["black bottle", "purple bottle", "blue bottle", "green bottle", "red bottle"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "@movie.explained6-7272206374883314945_0", "video_path": "7272206374883314945.mp4", "subtitle_path": "7272206374883314945_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 45.65, "view_count": 3449}, {"video_id": "@movie.explained6-7273302310527257858", "question": "In a scene with the white text 'He didn't suspect anything when he saw how' displayed, there's a man wearing a black hat and sporting a beard. When the subtitle 'suspect anything when he saw how professional Jerry was and proceeded to come next to Anna' appears, what kind of beard is the man with the black hat sporting?", "question_wo_referring_query": "What kind of beard is the man wearing a black hat sporting?", "candidates": ["Full beard", "Goatee", "Stubble", "Circle beard", "Mutton chops"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "@movie.explained6-7273302310527257858_0", "video_path": "7273302310527257858.mp4", "subtitle_path": "7273302310527257858_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 57.22, "view_count": 3906}, {"video_id": "@movie.explained6-7269963817436695809", "question": "In a room with a yellow desk, there are many books on the desk. A person is sitting in front of the desk and is cutting their own hand with a knife. Who is the person sitting in front of the desk and cutting their own hand with a knife?", "question_wo_referring_query": "Who is the person sitting in front of the desk and cutting their own hand with a knife?", "candidates": ["The boy wearing a blue hoodie", "The boy wearing a red hoodie", "The boy wearing a white T-shirt", "The boy wearing a black-gray sweater with short black hair", "The boy wearing a black T-shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "@movie.explained6-7269963817436695809_0", "video_path": "7269963817436695809.mp4", "subtitle_path": "7269963817436695809_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.53, "view_count": 7357}, {"video_id": "@movie.explained6-7253793316922346754", "question": "In a scene with white text that says 'because she had confidence in her newfound powers' and features some withered trees and branches, there is an orange tiger with black stripes next to the withered tree. What was the tiger doing when it first appeared?", "question_wo_referring_query": "What was the tiger doing when it first appeared?", "candidates": ["Rolling on the ground", "Hunting a deer", "Crouching in the bushes", "Walking forward slowly with steady steps", "Walking around a tree in circles"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "@movie.explained6-7253793316922346754_0", "video_path": "7253793316922346754.mp4", "subtitle_path": "7253793316922346754_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 55.42, "view_count": 1196}, {"video_id": "@movie.explained6-7269857899449552129", "question": "In a scene with white text 'they prepare for bed conceding to an early night's rest', a man and a woman are lying on a bed covered with a grey blanket. When the subtitle 'They prepare for bed, conceding to an early night's rest' appears, what is the man lying on the bed doing?", "question_wo_referring_query": "What is the man lying on the bed doing?", "candidates": ["Cupping his face in his hands", "Kissing the woman beside him", "Reaching out to turn off the bedside lamp", "Taking off his clothes", "Hugging the woman beside him"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "@movie.explained6-7269857899449552129_0", "video_path": "7269857899449552129.mp4", "subtitle_path": "7269857899449552129_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 43.67, "view_count": 11591}, {"video_id": "@movie.explained6-7259234897154346242", "question": "In a scene with white text saying 'Ever since she and Em had been together', there's a woman with shoulder-length hair and a man wearing a blue shirt looking at each other. What did the woman with shoulder-length hair do after looking at the man?", "question_wo_referring_query": "What did the woman with shoulder-length hair do after looking at the man?", "candidates": ["Hugged the man wearing a blue shirt", "Held her head and cried", "Slapped the man in the face", "Gently touched the face of the man wearing a blue shirt with her ringed hand", "Held his face with both hands"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "@movie.explained6-7259234897154346242_0", "video_path": "7259234897154346242.mp4", "subtitle_path": "7259234897154346242_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.9, "view_count": 3700}, {"video_id": "@movie.explained6-7252662570014772482", "question": "In a scene with yellow text 'the one who helped him when he collapsed', there is a man with short black hair. Right after the narration says, 'Dae-ho re-enters the lucid dream, and this time, he learns that the man selling the balloons was wearing gloves, and that he was also,' what is the first object that appears on the screen?", "question_wo_referring_query": "What is the first object that appears on the screen?", "candidates": ["Green tree", "Dark blue coat", "Camel-colored sofa", "Black shoes", "Dark camel-colored handbag"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "@movie.explained6-7252662570014772482_0", "video_path": "7252662570014772482.mp4", "subtitle_path": "7252662570014772482_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 52.57, "view_count": 521}, {"video_id": "@movie.explained6-7253799362508246274", "question": "In a scene with white text 'In order to get the key to officer's safe,' there is a woman with Liu Hai hairstyle and another woman with her hair tied up. Which of the following subtitles appeared together with the woman with her hair tied up?", "question_wo_referring_query": "Which of the following subtitles appeared together with the woman with her hair tied up?", "candidates": ["\"The man quickly walks up and deliberately collides with the\"", "\"While he was picking up the documents, the top agent used\"", "\"interpreter. While Officer is dancing the man quietly takes out the key to the safe from\"", "\"Officer's shirt pocket, quickly printing down the\"", "\"After going back, the man took out the map and drew the\""], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "@movie.explained6-7253799362508246274_0", "video_path": "7253799362508246274.mp4", "subtitle_path": "7253799362508246274_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 37.74, "view_count": 1503}, {"video_id": "@movie.explained6-7270478842798689537", "question": "In a room with many partitions, there are several men. A woman wearing a camisole is sitting in a wheelchair. When the woman wearing a camisole appears in a room with several people sitting on chairs, how does her posture in the wheelchair change?", "question_wo_referring_query": "How does the posture of the woman wearing a camisole sitting in the wheelchair change?", "candidates": ["From hugging her knees with both hands to hugging her knees with one hand", "From hugging her knees with both hands to placing one leg to the side on the wheelchair", "From hugging her knees with both hands to kneeling on the wheelchair", "From hugging her knees with both hands to spreading her legs wide", "From hugging her knees with both hands to sitting cross-legged on the wheelchair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "@movie.explained6-7270478842798689537_0", "video_path": "7270478842798689537.mp4", "subtitle_path": "7270478842798689537_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 47.43, "view_count": 26881}, {"video_id": "@movie.explained6-7268989684171296001", "question": "In a dimly lit room, there is a man with short black hair sitting, and beside him stands a woman wearing a dark green top. What changes occur in the shot of the man with short black hair when the subtitle 'want her to go home with me okay not today thank you mister thank you' appears?", "question_wo_referring_query": "What changes occur in the shot of the man with short black hair?", "candidates": ["The shot changes from a long shot to a close-up", "The shot changes from a mid shot to a close-up", "The shot changes from a long shot to a close-up", "The shot changes from a close-up to a long shot", "The shot changes from a close-up to a close-up"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TAA", "level": "L2-Relation", "id": "@movie.explained6-7268989684171296001_0", "video_path": "7268989684171296001.mp4", "subtitle_path": "7268989684171296001_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 25.87, "view_count": 1761}, {"video_id": "@movie.explained6-7268796800855788801", "question": "In a scene set against a backdrop of green leaves, a man wearing a white shirt is standing next to a tree. The man has a wristwatch on his arm. What is the man in the white shirt doing next to the tree?", "question_wo_referring_query": "What is the man in the white shirt doing next to the tree?", "candidates": ["Holding his head with both hands", "Covering his face with both hands", "Kneeling by the tree and tying his shoelace", "Holding a sniper rifle and aiming forward", "Leaning against the tree and sleeping"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "@movie.explained6-7268796800855788801_0", "video_path": "7268796800855788801.mp4", "subtitle_path": "7268796800855788801_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.0, "view_count": 1800}, {"video_id": "@movie.explained6-7252657721969773826", "question": "In a classroom filled with numerous desks and red chairs, there is a black walnut bookshelf against the wall. The bookshelf holds many books and items. At the bottom of the screen, there are yellow words 'where they both get intimate'. What objects are present in the classroom with the red chair?", "question_wo_referring_query": "What objects are present in the classroom with the red chair?", "candidates": ["Yellow sunflowers", "White chrysanthemums", "Globe", "Green book bag", "Red roses"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "@movie.explained6-7252657721969773826_0", "video_path": "7252657721969773826.mp4", "subtitle_path": "7252657721969773826_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 57.37, "view_count": 864}, {"video_id": "@movie.explained6-7278290643472977153", "question": "In a scene with white text and the words 'and the dog was killed by him just now,' there are dense woods and a green meadow. When the subtitle 'and the dog was killed by him just now. Grandpa didn't believe it, but when he came out,' appears, what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["A blue shirt", "A white short-sleeved jacket", "A white daisy", "A black hooded raincoat", "Yellow ylang-ylang"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "@movie.explained6-7278290643472977153_0", "video_path": "7278290643472977153.mp4", "subtitle_path": "7278290643472977153_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 58.23, "view_count": 4194}, {"video_id": "@movie.explained6-7268952388361407745", "question": "In a scene with white text saying 'What are you doing Touching my kid?', there is a man holding a little girl with pigtails and a balloon. When the subtitle 'What are you doing touching my kid?' appears, what color is the balloon held by the little girl with pigtails?", "question_wo_referring_query": "What color is the balloon held by the little girl with pigtails?", "candidates": ["Blue", "Red", "Green", "Purple", "Orange"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "@movie.explained6-7268952388361407745_0", "video_path": "7268952388361407745.mp4", "subtitle_path": "7268952388361407745_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.37, "view_count": 1587}, {"video_id": "@movie_summary_-7223407726704004357", "question": "At a large entrance with white ceiling tiles, two men are standing. One man is being pointed at with a gun by another man wearing a watch on his wrist. Which man is being pointed at with the gun?", "question_wo_referring_query": "Which man is being pointed at with the gun?", "candidates": ["Man wearing a black short sleeve shirt", "Man wearing a black shirt", "Man wearing a black suit", "Man wearing a white T-shirt", "Man wearing a black hooded jacket"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "@movie_summary_-7223407726704004357_0", "video_path": "7223407726704004357.mp4", "subtitle_path": "7223407726704004357_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 55.92, "view_count": 1789}, {"video_id": "@movie.explained6-7269194842515721473", "question": "In a room with a white door, there is a man wearing a dark blue coat. When the subtitle 'Let everyone who died be alive again' appears, what is the man in the dark blue coat doing?", "question_wo_referring_query": "What is the man in the dark blue coat doing?", "candidates": ["Waving his left hand while talking", "Rolling on the ground", "Holding his head with both hands", "Covering his face with both hands", "Kneeling down to tie his shoelaces"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "@movie.explained6-7269194842515721473_0", "video_path": "7269194842515721473.mp4", "subtitle_path": "7269194842515721473_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 22.5, "view_count": 1735}, {"video_id": "@daily.stories20-7254684753503407406", "question": "In a scene with white text 'Sound of Freedom,' a man in a black short-sleeve shirt is sitting in the driver's seat. While watching a child seated in the back, what else does the man in the black short-sleeve shirt do?", "question_wo_referring_query": "What else does he do?", "candidates": ["Put his hands on his head", "Headbutt the steering wheel", "Turn his head to look at the man standing outside the car window", "Violently fists the steering wheel", "Reach out his hand towards the boy sitting in the back seat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "@daily.stories20-7254684753503407406_0", "video_path": "7254684753503407406.mp4", "subtitle_path": "7254684753503407406_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.03, "view_count": 1728}, {"video_id": "@movie.explained6-7277170954722200834", "question": "In the video, which of the following characters appears first?", "question_wo_referring_query": "In the video, which of the following characters appears first?", "candidates": ["Boy wearing a blue T-shirt", "Yoda baby with green skin", "Man holding a gun", "Man wearing a white suit", "Man wearing a helmet"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "@movie.explained6-7277170954722200834_0", "video_path": "7277170954722200834.mp4", "subtitle_path": "7277170954722200834_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 58.09, "view_count": 3664}, {"video_id": "@movie.explained6-7253160621594774786", "question": "In a dimly lit room, there's a humanoid monster tearing open a cupboard and roaring. After the narration says 'He also hissed at the kitten,' what happens next in the room?", "question_wo_referring_query": "What happens next in the room?", "candidates": ["A kitten is meowing on the floor.", "A woman in a white dress walks into the room.", "A man in khaki armor walks into the room.", "A monster walks into the room, flailing its arms and legs.", "The humanoid monster is knocked down on the floor."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "@movie.explained6-7253160621594774786_0", "video_path": "7253160621594774786.mp4", "subtitle_path": "7253160621594774786_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.64, "view_count": 1607}, {"video_id": "@movie.explained6-7275118301385100545", "question": "Next to a black column is a little girl wearing blue clothes. The little girl places her hands on the black column. After the voice-over says 'neighbor found the girl on the balcony. She spoke to the,' what is the first object to appear on the screen?", "question_wo_referring_query": "What is the first object to appear on the screen?", "candidates": ["Green tree", "White house", "Olive-colored leaves", "Yellow soil", "Black-cased phone"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "@movie.explained6-7275118301385100545_0", "video_path": "7275118301385100545.mp4", "subtitle_path": "7275118301385100545_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.57, "view_count": 8079}, {"video_id": "@movie.explained6-7269611422873619714", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a man in a white shirt is sitting in front of a black laptop; then, a man in a white shirt is bending over and talking to another man in a white shirt; finally, a white selection box is holding a blue car.", "First, a man in a white shirt is sitting in front of a black laptop; then, a white selection box is holding a blue car; finally, a man in a white shirt is bending over and talking to another man in a white shirt.", "First, a white selection box is holding a blue car; then, a man in a white shirt is sitting in front of a black laptop; finally, a man in a white shirt is bending over and talking to another man in a white shirt.", "First, a man in a white shirt is bending over and talking to another man in a white shirt; then, a white selection box is holding a blue car; finally, a man in a white shirt is sitting in front of a black laptop.", "First, a man in a white shirt is bending over and talking to another man in a white shirt; then, a man in a white shirt is sitting in front of a black laptop; finally, a white selection box is holding a blue car."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "@movie.explained6-7269611422873619714_0", "video_path": "7269611422873619714.mp4", "subtitle_path": "7269611422873619714_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.63, "view_count": 2742}, {"video_id": "@movie.explained6-7253157946857639170", "question": "In a dimly lit space, there is a pool with a humanoid creature in it. Beside the pool sits a woman with short black hair, wearing a green hair ornament. In which of the following scenes has the woman with short black hair appeared?", "question_wo_referring_query": "In which of the following scenes has the woman with short black hair appeared?", "candidates": ["In a room with many instruments", "In a room with a red sofa", "On the beach during rain", "In a beautiful garden", "In the deep blue sea"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "@movie.explained6-7253157946857639170_0", "video_path": "7253157946857639170.mp4", "subtitle_path": "7253157946857639170_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.12, "view_count": 1820}, {"video_id": "@movie.explained6-7277661747174051073", "question": "In front of a mirror, there is a woman with black shoulder-length curly hair wearing a floral shirt, arranging her hairstyle. Which of the following subtitles have appeared together with this woman with black shoulder-length curly hair?", "question_wo_referring_query": "Which of the following subtitles have appeared together with the woman with black shoulder-length curly hair?", "candidates": ["\"fingerprints, surprisingly,\"", "\"the fingerprints matched those of the missing girl Why young girl would became an old\"", "\"She seems to be dead\"", "\"her roommate's car and phone were at home, but she was nowhere to be found, and the hot lady\"", "\"Too unreliable\""], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "@movie.explained6-7277661747174051073_0", "video_path": "7277661747174051073.mp4", "subtitle_path": "7277661747174051073_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 32.57, "view_count": 4492}, {"video_id": "@movie.explained6-7269748569827986689", "question": "In a space filled with water, there is a man without any clothes on. The man has a black underwater breathing apparatus in his mouth. When this unclothed man appears in a scene with a man wearing a white coat, what change happens to the unclothed man?", "question_wo_referring_query": "What change happens to the unclothed man?", "candidates": ["He changes from being unclothed to wearing a blue T-shirt", "He changes from being unclothed to wearing a white T-shirt", "He changes from being unclothed to wearing a black hoodie", "The black breathing apparatus in his mouth changes to not having anything in his mouth", "He changes from being unclothed to wearing a red T-shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "@movie.explained6-7269748569827986689_0", "video_path": "7269748569827986689.mp4", "subtitle_path": "7269748569827986689_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 33.77, "view_count": 3930}, {"video_id": "@movie.explained6-7272908944903884034", "question": "In a bright room, a woman wearing a dark green long-sleeved shirt is sitting next to a white table. When the subtitle \"They bang frantically on the bathroom door, shouting their son's name\" appears, what change happens to the woman's clothing?", "question_wo_referring_query": "What change happens to the clothing of the woman wearing a dark green long-sleeved shirt?", "candidates": ["The dark green long-sleeved shirt changes to a dark blue off-shoulder top.", "The dark green long-sleeved shirt changes to a black hooded coat.", "The dark green long-sleeved shirt changes to an olive windbreaker.", "The dark green long-sleeved shirt changes to a white dress.", "The dark green long-sleeved shirt changes to a white T-shirt."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TAA", "level": "L2-Relation", "id": "@movie.explained6-7272908944903884034_0", "video_path": "7272908944903884034.mp4", "subtitle_path": "7272908944903884034_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.3, "view_count": 4166}, {"video_id": "@movie.explained6-7253792412919794946", "question": "In a scene with white text 'The on-duty officer immediately fired at the woman,' there is a brown table with green plants underneath it, and a man wearing army green pants in front of the table. What is the man wearing army green pants doing?", "question_wo_referring_query": "What is the man wearing army green pants doing?", "candidates": ["Kneeling and tying shoelaces", "Holding his head with both hands", "Holding a gun and firing", "Crawling on the ground", "Covering his face with both hands"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "@movie.explained6-7253792412919794946_0", "video_path": "7253792412919794946.mp4", "subtitle_path": "7253792412919794946_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 58.46, "view_count": 1414}, {"video_id": "@daily.stories20-7295879792677834030", "question": "In a gigantic room, a piece of skin with five sensory organs is placed in the middle of the room. Two people dressed in white clothes are standing on either side of the skin. When the subtitle 'She gave up her bones and unnecessary organs. She was so confident that she would be the most' appears, what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["Red clothes", "White basin", "Blue flower", "Green plant", "Yellow flower"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "@daily.stories20-7295879792677834030_0", "video_path": "7295879792677834030.mp4", "subtitle_path": "7295879792677834030_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 56.06, "view_count": 127772}, {"video_id": "@movie.explained6-7272839483966442753", "question": "In a room with a white table, a man and a woman are sitting on either side of the table. When the subtitle 'The old man takes out the cipher suitcase.' appears, what hairstyle does the man next to the table have?", "question_wo_referring_query": "What hairstyle does the man next to the table have?", "candidates": ["white buzz cut", "black curly hair", "black buzz cut", "white curly hair", "brown curly hair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "@movie.explained6-7272839483966442753_0", "video_path": "7272839483966442753.mp4", "subtitle_path": "7272839483966442753_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.07, "view_count": 4286}, {"video_id": "@movie.explained6-7252676917998947585", "question": "In a dark space, there is a woman with short brown hair. At the bottom of the screen, there are yellow letters saying 'this angers the devil'. Someone angered the devil because he regretted. Who regretted?", "question_wo_referring_query": "Who regretted?", "candidates": ["Mary", "Tony", "Jack", "Anna", "Bruce"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "@movie.explained6-7252676917998947585_0", "video_path": "7252676917998947585.mp4", "subtitle_path": "7252676917998947585_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.57, "view_count": 760}, {"video_id": "@movie.explained6-7269747704991993090", "question": "Beside a car driving forward, there is a dense forest, and in the dense forest, there is a deer. What is the deer doing the first time it appears?", "question_wo_referring_query": "What is the deer doing the first time it appears?", "candidates": ["Eating grass leisurely", "Following a fawn", "Running quickly", "Walking leisurely", "Digging with its antlers"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "@movie.explained6-7269747704991993090_0", "video_path": "7269747704991993090.mp4", "subtitle_path": "7269747704991993090_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 49.1, "view_count": 22500}, {"video_id": "@movie.explained6-7253797145961172226", "question": "In a scene with white text 'he heard his grandfather say', there is a boy lying on a rock. When the subtitle 'coma, he heard his grandfather say that leopards can climb trees, giving him an inspiration. It' appears, what is the boy lying on the rock doing?", "question_wo_referring_query": "What is the boy lying on the rock doing?", "candidates": ["Drinking water", "Washing clothes", "Squatting on the ground roasting food", "Holding binoculars and looking into the distance", "Lying on the rock sleeping"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "@movie.explained6-7253797145961172226_0", "video_path": "7253797145961172226.mp4", "subtitle_path": "7253797145961172226_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 46.45, "view_count": 1522}, {"video_id": "@movie.explained6-7254234549356334337", "question": "In a dark space, a woman wearing jeans is lying on the ground, reaching out for a cell phone that is on the floor. What did the woman in jeans do after picking up the cell phone?", "question_wo_referring_query": "What did the woman in jeans do after picking up the cell phone?", "candidates": ["Removed the cell phone case", "Threw the cell phone away", "Used one hand to prop herself up and sat up to make a call", "Crouched on the ground to make a call", "Played on the cell phone while lying flat on the ground"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "@movie.explained6-7254234549356334337_0", "video_path": "7254234549356334337.mp4", "subtitle_path": "7254234549356334337_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 58.29, "view_count": 2764}, {"video_id": "@movie.explained6-7253799591634636033", "question": "In the video screen, which of the following characters appears first?", "question_wo_referring_query": "In the video screen, which of the following characters appears first?", "candidates": ["A woman wearing a purple cheongsam", "A woman wearing a blue cheongsam", "A man wearing an army green jacket, holding a burning yellow plastic rod", "A woman wearing a white cheongsam", "A man wearing a blue jacket"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "@movie.explained6-7253799591634636033_0", "video_path": "7253799591634636033.mp4", "subtitle_path": "7253799591634636033_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 57.76, "view_count": 1637}, {"video_id": "@movie.explained6-7271451822135528705", "question": "In a scene with white text (please let me in) on the screen, a rough and dark hand is placed on a wall. After the caption \"apocalypse has hit. Guys are seeking refuge inside a supermarket. Cries for help echo from outside\" appears, what happens next in the video?", "question_wo_referring_query": "What happens next in the video?", "candidates": ["A man wearing a grayish-black hat is forcefully knocking on the door.", "A man is placing his hands on a woman's face.", "A woman wearing a black coat is pushing a shopping cart in the supermarket.", "A person wearing dark blue pants is walking around the supermarket.", "A man is forcefully tearing at a woman's face."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "@movie.explained6-7271451822135528705_0", "video_path": "7271451822135528705.mp4", "subtitle_path": "7271451822135528705_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.23, "view_count": 6306}, {"video_id": "@movie.explained6-7270058552545398018", "question": "In a dark space, there is a man with an oily face. After the narration says, 'Just as he was about to harm the girl, Lola arrived,' who is the first character to appear on the screen?", "question_wo_referring_query": "Who is the first character to appear on the screen?", "candidates": ["A woman with dark skin and an Afro hairstyle", "A woman wearing a blue bra", "A man wearing a short-sleeve shirt", "A woman with fair skin", "A woman with long blonde hair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "@movie.explained6-7270058552545398018_0", "video_path": "7270058552545398018.mp4", "subtitle_path": "7270058552545398018_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 39.3, "view_count": 4916}, {"video_id": "@movie.explained6-7269859381020216578", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a man is locked inside an iron cage; then, a short-haired woman is sitting in a car, ready to drive; finally, milk is poured from one bottle into another.", "First, a short-haired woman is sitting in a car, ready to drive; then, milk is poured from one bottle into another; finally, a man is locked inside an iron cage.", "First, milk is poured from one bottle into another; then, a short-haired woman is sitting in a car, ready to drive; finally, a man is locked inside an iron cage.", "First, milk is poured from one bottle into another; then, a man is locked inside an iron cage; finally, a short-haired woman is sitting in a car, ready to drive.", "First, a man is locked inside an iron cage; then, milk is poured from one bottle into another; finally, a short-haired woman is sitting in a car, ready to drive."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "@movie.explained6-7269859381020216578_0", "video_path": "7269859381020216578.mp4", "subtitle_path": "7269859381020216578_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 46.57, "view_count": 3588}, {"video_id": "@movie.explained6-7269746745142185218", "question": "In front of a little girl with pigtails, there is a man wearing a gray coat sitting down. The man has a watch on his wrist. In which of the following scenes does this man in the gray coat appear?", "question_wo_referring_query": "In which of the following scenes does the man in the gray coat appear?", "candidates": ["In a classroom with many wooden desks", "In a park during rainfall", "In a room with many fresh flowers", "On a golden sand beach", "In a scene where many people are seated in rows and a man in a black suit is also standing"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "@movie.explained6-7269746745142185218_0", "video_path": "7269746745142185218.mp4", "subtitle_path": "7269746745142185218_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 58.43, "view_count": 10657}, {"video_id": "@movie_summary_-7220159257197448453", "question": "In a telephone booth, there is a little boy with an orange backpack wearing a dark blue feather coat making a call. Which of the following subtitles has this scene appeared with?", "question_wo_referring_query": "Which of the following subtitles has the little boy in a dark blue feather coat appeared with?", "candidates": ["\"two of them knew anna lost her face when she heard it she hurried to the park the boy was found\"", "\"This world is really scary\"", "\"They're all crazy\"", "\"Anna has a panic on her face\"", "\"He felt extremely scared\""], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "@movie_summary_-7220159257197448453_0", "video_path": "7220159257197448453.mp4", "subtitle_path": "7220159257197448453_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 44.0, "view_count": 130649}, {"video_id": "@movie.explained6-7269963949913738498", "question": "In a dark room, a woman wearing a light blue fur coat is looking ahead. When the woman in the light blue fur coat appears in a room with a green large door, what change occurs to her iris?", "question_wo_referring_query": "What change occurs to the iris of the woman wearing a light blue fur coat?", "candidates": ["The iris changes from black to gold", "The iris changes from black to white", "The iris changes from black to blue", "The iris changes from black to green", "The iris changes from black to purple"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "@movie.explained6-7269963949913738498_0", "video_path": "7269963949913738498.mp4", "subtitle_path": "7269963949913738498_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 55.3, "view_count": 6426}, {"video_id": "@movie.explained6-7252676274051616002", "question": "In a dark space, there is a man with dark skin wearing a black coat. Behind the man is a closed elevator door. When the subtitle 'In contrast, a young boy and a security guard try to open the door by hand, but fail, despite' appears, what happens to the clothing worn by the man with dark skin?", "question_wo_referring_query": "What happens to the clothing worn by the man with dark skin?", "candidates": ["He changes into a black hooded jacket", "He puts on a blue shirt", "He takes off the black coat", "He changes into a red coat", "He changes into a white suit"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TAA", "level": "L2-Relation", "id": "@movie.explained6-7252676274051616002_0", "video_path": "7252676274051616002.mp4", "subtitle_path": "7252676274051616002_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 45.33, "view_count": 611}, {"video_id": "@movie.explained6-7252676446756424961", "question": "In a closed elevator, many people are trapped inside, and a man wearing a suit is lying on the floor of the elevator. What object is present in the scene?", "question_wo_referring_query": "What object is present in the scene?", "candidates": ["black suspenders", "white shirt", "red hooded hazmat suit", "green hooded hazmat suit", "blue hooded hazmat suit"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "@movie.explained6-7252676446756424961_0", "video_path": "7252676446756424961.mp4", "subtitle_path": "7252676446756424961_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 40.9, "view_count": 614}, {"video_id": "@movie.explained6-7270820755481365762", "question": "In front of a yellow earthen wall, there are several women wearing hats and long skirts. One of these women is holding two flowers. When the subtitle 'naked, their hair shaved off, yet they were not sexually' appears, what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["Blue flowers", "Yellow flowers", "Pink flowers", "Red flowers", "Green flowers"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "@movie.explained6-7270820755481365762_0", "video_path": "7270820755481365762.mp4", "subtitle_path": "7270820755481365762_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 44.13, "view_count": 24980}, {"video_id": "@movie.explained6-7269386891286318337", "question": "In a dark underwater scene, there are green aquatic plants swaying with the water current. A woman with long auburn hair is swimming in the water. What color clothes is she wearing?", "question_wo_referring_query": "What color clothes is she wearing?", "candidates": ["black", "blue", "red", "purple", "white"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "@movie.explained6-7269386891286318337_0", "video_path": "7269386891286318337.mp4", "subtitle_path": "7269386891286318337_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 34.0, "view_count": 4125}, {"video_id": "@movie.explained6-7262892216694230280", "question": "In a scene with the white text 'discovered by others' on the screen, there is a woman with exquisite makeup and red lipstick. When the subtitle 'To prevent being discovered by others.' appears, what hairstyle does the woman with exquisite makeup have?", "question_wo_referring_query": "What hairstyle does the woman with exquisite makeup have?", "candidates": ["Olive long curly hair", "Olive shoulder-length hair", "Blonde waist-length hair", "Blonde shoulder-length hair", "Black shoulder-length hair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "@movie.explained6-7262892216694230280_0", "video_path": "7262892216694230280.mp4", "subtitle_path": "7262892216694230280_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.0, "view_count": 9492}, {"video_id": "@movie.explained6-7269963595880860929", "question": "In a scene with white text 'An worm dropped from the ceiling' on the screen, there is a person wearing white earphones lying on a bed and sleeping. Who is wearing white earphones and lying on the bed sleeping?", "question_wo_referring_query": "Who is wearing white earphones and lying on the bed sleeping?", "candidates": ["The man wearing a gray hoodie", "The man wearing a white T-shirt", "The man wearing a black suit", "The man wearing a red hoodie", "The man wearing a black hoodie"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "@movie.explained6-7269963595880860929_0", "video_path": "7269963595880860929.mp4", "subtitle_path": "7269963595880860929_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.0, "view_count": 157682}, {"video_id": "@movie.explained6-7253798717914942722", "question": "In a room with a red mahogany desk, there are three people: a man sitting on a black chair and two men standing. What is the man wearing a military green hat doing when he first appears?", "question_wo_referring_query": "What is the man wearing a military green hat doing when he first appears?", "candidates": ["Lying on the ground, vomiting blood", "Kneeling on the ground", "Holding a black gun and pointing it at the man sitting on the black chair", "Handing a document in a deep blue folder to the man sitting on the black chair", "Holding a dagger and thrusting it toward the man sitting on the black chair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "@movie.explained6-7253798717914942722_0", "video_path": "7253798717914942722.mp4", "subtitle_path": "7253798717914942722_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 56.92, "view_count": 1394}, {"video_id": "@movie.explained6-7257359413067992321", "question": "In a bright room, there is a woman with short white hair wearing a silver-gray coat. When the subtitle 'Founder Ally told him, maybe he can save the planet.' appears, what is the woman in the silver-gray coat doing?", "question_wo_referring_query": "What is the woman in the silver-gray coat doing?", "candidates": ["Furiously throwing away a document in her hand", "Turning a pen with her hand", "Shouting in anger", "Signing a document", "Talking with her head raised"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "@movie.explained6-7257359413067992321_0", "video_path": "7257359413067992321.mp4", "subtitle_path": "7257359413067992321_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.0, "view_count": 2360}, {"video_id": "@movie.explained6-7274935347451120897", "question": "In a kitchen, there is a little girl with black hair. The little girl is standing in front of a stove and lighting a fire. What did the little girl do after lighting the fire on the stove?", "question_wo_referring_query": ", what did the little girl do after lighting the fire on the stove?", "candidates": ["Sat on the floor", "Swung a transparent bottle", "Looked around", "Put a piece of bread on the fire", "Pulled the microwave door"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "@movie.explained6-7274935347451120897_0", "video_path": "7274935347451120897.mp4", "subtitle_path": "7274935347451120897_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.97, "view_count": 5975}, {"video_id": "@daily.stories20-7123191229520678186", "question": "In the video, which of the following characters appears first?", "question_wo_referring_query": "In the video, which of the following characters appears first?", "candidates": ["Woman in a white top", "Man wearing a white shirt and black-rimmed glasses", "Woman wearing a blue short-sleeved shirt", "Man wearing a blue shirt", "Woman dressed in black"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "@daily.stories20-7123191229520678186_0", "video_path": "7123191229520678186.mp4", "subtitle_path": "7123191229520678186_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 54.88, "view_count": 411}, {"video_id": "@movie.explained6-7253138042167250178", "question": "In a scene with many people standing, there are two boys in dark blue clothes fighting. After the voiceover says 'After the fight ends, Adam's friend gets angry with him and they start fighting,' what happens to the two boys who were fighting?", "question_wo_referring_query": "What happens to the two boys who were fighting?", "candidates": ["Admitted to the hospital", "Taken away by the teacher", "Stumbling on the ground", "Lying flat on the ground", "Hugging and crying"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "@movie.explained6-7253138042167250178_0", "video_path": "7253138042167250178.mp4", "subtitle_path": "7253138042167250178_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.2, "view_count": 2731}, {"video_id": "@movie.explained6-7268780243513527553", "question": "In a space with many pillars, there is a roller coaster full of people. After the voice-over says 'had no choice but to sit at the end. When the roller coaster was about to start, the security,' who is the first character to appear on screen?", "question_wo_referring_query": "Who is the first character to appear on screen?", "candidates": ["A woman in a pink long-sleeved top", "A man in a dark olive jacket", "A man wearing a dark blue coat and earphones", "A man with a black hat", "A woman in flower-patterned pants"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "@movie.explained6-7268780243513527553_0", "video_path": "7268780243513527553.mp4", "subtitle_path": "7268780243513527553_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.03, "view_count": 21085}, {"video_id": "@movie.explained6-7253156792891378946", "question": "In a dimly lit room, there is a bookshelf filled with books. In front of the bookshelf is a woman wearing an olive-green bathrobe with wet hair. In which of the following scenes does this woman appear?", "question_wo_referring_query": "In which of the following scenes does this woman appear?", "candidates": ["In a seat by the window on a bus", "In a park during rain", "In a dark movie theater", "On a golden sandy beach", "In a bedroom with white sheets"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "@movie.explained6-7253156792891378946_0", "video_path": "7253156792891378946.mp4", "subtitle_path": "7253156792891378946_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 58.4, "view_count": 2067}, {"video_id": "@movie.explained6-7271232320022023425", "question": "In front of a glass window with a crack, there is a woman wearing an orange top and orange pants. The woman is looking at the cracked window. Which of the following subtitles has appeared together with the woman wearing the orange top?", "question_wo_referring_query": "Which of the following subtitles has appeared together with the woman wearing the orange top?", "candidates": ["\"But the people on the street seemed to have gone crazy.\"", "\"She told her sister to drive away quickly.\"", "\"Alice had just finished her checkup and was about to bounce.\"", "\"She hurriedly ran out and got into her sister's car to leave.\"", "\"Not even the nurse could stop her.\""], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "@movie.explained6-7271232320022023425_0", "video_path": "7271232320022023425.mp4", "subtitle_path": "7271232320022023425_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 56.13, "view_count": 8558}, {"video_id": "@movie.explained6-7274506004186828033", "question": "In a bright office, there is a gray chair. On the chair sits a man wearing a black suit. When the man in the black suit appears in a dark room with a woman dressed in a white top, what changes occur to him?", "question_wo_referring_query": "What changes occur to the man in the black suit?", "candidates": ["He put on a white shirt", "He changed into a white bathrobe", "He put on a blue suit", "He took off the black suit", "He put on an olive suit"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "@movie.explained6-7274506004186828033_0", "video_path": "7274506004186828033.mp4", "subtitle_path": "7274506004186828033_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.87, "view_count": 8873}, {"video_id": "@jess.morg-7232049679339408683", "question": "A slice of lemon is placed on orange juice with ice cubes in a transparent glass, and the bottom of the glass has slices of orange and lemon on ice cubes. What is the hand holding the iron spoon doing at this moment in the video?", "question_wo_referring_query": "What is the hand holding the iron spoon doing at this moment?", "candidates": ["Drinking the juice", "Lifting the juice in the glass", "Pouring the juice out", "Adding seasoning to the glass", "Stirring inside the glass"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "@jess.morg-7232049679339408683_0", "video_path": "7232049679339408683.mp4", "subtitle_path": "7232049679339408683_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 26.6, "view_count": 818}, {"video_id": "@jonijawne-7292020937879194885", "question": "A man with a middle-part hairstyle, wearing a light-colored short-sleeved top, is sitting in front of a white plate covered with plastic wrap, holding up a jar of green sauce in front of a camera. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["tomato sauce", "fork and knife", "curtain", "fruit knife", "window"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "@jonijawne-7292020937879194885_0", "video_path": "7292020937879194885.mp4", "subtitle_path": "7292020937879194885_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 35.2, "view_count": 34359}, {"video_id": "@kerstinong-7251845734398512385", "question": "Standing on the wooden floor in front of the wooden wardrobe, wearing a black top and white long pants, holding a small black bag and taking a selfie with her phone, what is her hairstyle?", "question_wo_referring_query": "What is her hairstyle?", "candidates": ["long curly hair", "red long hair", "short straight hair", "short curly hair", "long straight hair"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "@kerstinong-7251845734398512385_0", "video_path": "7251845734398512385.mp4", "subtitle_path": "7251845734398512385_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 42.43, "view_count": 4407}, {"video_id": "@kerstinong-7204107990654291208", "question": "The girl standing on the wooden floor in the corner of the wooden wardrobe, wearing an orange sports top and shorts, with her hands raised to her waist, when the subtitle 'Green, green grass' appears, what kind of shorts is she wearing?", "question_wo_referring_query": "What kind of shorts is she wearing?", "candidates": ["yellow shorts", "blue denim shorts", "black suit shorts", "gray-blue sports shorts", "green shorts"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "@kerstinong-7204107990654291208_0", "video_path": "7204107990654291208.mp4", "subtitle_path": "7204107990654291208_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 18.45, "view_count": 4849}, {"video_id": "@jonijawne-7190789373963668741", "question": "On a gray floor tile stands a man wearing a black suit, black leather belt, black undershirt, and black outerwear. What is the object that this man is wearing around his neck?", "question_wo_referring_query": ", What is the object that this man is wearing around his neck?", "candidates": ["pearl necklace", "emerald necklace", "gold necklace", "scarf", "silver necklace"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "@jonijawne-7190789373963668741_0", "video_path": "7190789373963668741.mp4", "subtitle_path": "7190789373963668741_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.13, "view_count": 2320}, {"video_id": "@kerstinong-7018191944165690625", "question": "A woman wearing a white strap top and white shorts is standing on a wooden floor next to brown and white curtains. What is the man in blue and white sportswear with a yellow headband doing the first time he appears in the bottom left corner of the screen?", "question_wo_referring_query": "What is he doing?", "candidates": ["Wiping the table", "Exercising", "Sleeping", "Washing his face", "Cooking"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "@kerstinong-7018191944165690625_0", "video_path": "7018191944165690625.mp4", "subtitle_path": "7018191944165690625_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.67, "view_count": 23236}, {"video_id": "@lisolna-7308762530015366432", "question": "There is a row of silver shopping carts stacked together in the screen, with orange handles on the shopping carts. A hand is resting on the shopping cart. When the subtitle 'I'm going to make a so' appears, what is the person in the screen doing?", "question_wo_referring_query": "What is the person in the screen doing?", "candidates": ["Placing fruits in the shopping cart", "Placing vegetables in the shopping cart", "Taking out a shopping cart", "Pushing the shopping cart", "Placing chicken breast in the shopping cart"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "@lisolna-7308762530015366432_0", "video_path": "7308762530015366432.mp4", "subtitle_path": "7308762530015366432_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.63, "view_count": 44151}, {"video_id": "@kerstinong-7217815409498541314", "question": "There is a hamburger on the dining table in front, and a person is wiping the dining utensils with a napkin. What happens in the video after this scene?", "question_wo_referring_query": "What happens in the video after this scene?", "candidates": ["Puts the knife into a clear glass of water", "Cuts a steak with a knife", "Cuts the hamburger with a knife", "Picks up a glass of clear water from the table", "A long-haired woman in a dark green dress talks to the camera"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "@kerstinong-7217815409498541314_0", "video_path": "7217815409498541314.mp4", "subtitle_path": "7217815409498541314_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 21.9, "view_count": 42386}, {"video_id": "@kerstinong-7115387502784957697", "question": "Which object appears first in the video?", "question_wo_referring_query": "Which object appears first in the video?", "candidates": ["Yellow short coat", "White short sleeve", "Laptop", "KFC chicken roll", "Black athletic outfit"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "@kerstinong-7115387502784957697_0", "video_path": "7115387502784957697.mp4", "subtitle_path": "7115387502784957697_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 25.77, "view_count": 145200}, {"video_id": "@lisolna-7240579981544410394", "question": "The buffet table in the dining hall is filled with many dishes in metal plates. When the subtitle 'Oh, that's the same?\nOh, that's the same?' appears, what happens next in the scene?", "question_wo_referring_query": "What happens next in the scene?", "candidates": ["Pick up an orange and place it on the plate", "Pick up an apple and place it on the plate", "Take a bowl of shredded carrot from the counter", "Pick up a bowl of blueberries and place it on the plate", "Pick up a bowl of grapes and place it on the plate"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "@lisolna-7240579981544410394_0", "video_path": "7240579981544410394.mp4", "subtitle_path": "7240579981544410394_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 36.67, "view_count": 621000}, {"video_id": "@lisolna-7227149138155031834", "question": "Inside a caf\u00e9 with a brown counter filled with many paper cups, there is a staff member wearing a white shirt working with their back to the camera. After the subtitle 'Oh, under the sun [\"Oh, under the sun\"] i' appears, what object can be seen?", "question_wo_referring_query": "What object appears?", "candidates": ["A woman with long black curly hair sitting at the table eating something", "Round cakes displayed in the glass case", "Various breads displayed on the glass counter", "A white truck outside the window", "A black cash register on the counter"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "@lisolna-7227149138155031834_0", "video_path": "7227149138155031834.mp4", "subtitle_path": "7227149138155031834_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 19.03, "view_count": 5952}, {"video_id": "@kerstinong-7042683058008493313", "question": "A woman wearing black and yellow sportswear and black yoga pants is taking a selfie in the mirror. Which lyrics have appeared simultaneously with this scene?", "question_wo_referring_query": "Which lyrics have appeared simultaneously with this scene?", "candidates": ["\"Yesterday, all my troubles seemed so far away. Now it looks as though they're here to stay.\"", "\"Baby, you're a firework. Come on, let your colors burst.\"", "\"If only you knew, if only you knew\"", "\"Every night in my dreams, I see you, I feel you.\"", "\"Cause all of me loves all of you. Love your curves and all your edges, all your perfect imperfections.\""], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "@kerstinong-7042683058008493313_0", "video_path": "7042683058008493313.mp4", "subtitle_path": "7042683058008493313_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 18.4, "view_count": 5813}, {"video_id": "@lisolna-7351428596537281825", "question": "A chef wearing a white chef's hat and uniform is stir-frying shrimp pancakes in front of an iron pan. When another chef, wearing a Tibetan blue apron and black gloves, places the stir-fried shrimp pancakes on a gray granite table, what changes have occurred to these shrimp pancakes?", "question_wo_referring_query": ", what changes have occurred to these shrimp pancakes?", "candidates": ["Changed from yellow to red", "Changed from inside the pot to the white plate", "Changed from yellow to white", "Changed from yellow to orange", "Changed from yellow to black"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "@lisolna-7351428596537281825_0", "video_path": "7351428596537281825.mp4", "subtitle_path": "7351428596537281825_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.67, "view_count": 144552}, {"video_id": "@kerstinong-7165396465978772738", "question": "What is a woman wearing a pink short-sleeve shirt, black shorts, with long hair draped over her head, doing on the stairs next to the banister holding a blue vase?", "question_wo_referring_query": "What is she doing?", "candidates": ["Tying her hair", "Squatting", "Going up the stairs", "Making a phone call", "Looking out the window"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "@kerstinong-7165396465978772738_0", "video_path": "7165396465978772738.mp4", "subtitle_path": "7165396465978772738_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 20.83, "view_count": 12980}, {"video_id": "@lisolna-7305815434505309473", "question": "On a silver dining table, there are fruits in a gray-blue and black bowl. A hand uses chopsticks to pick up some blueberries. What objects are present on the screen at this moment?", "question_wo_referring_query": "What objects are present on the screen at this moment?", "candidates": ["Cherry tomato", "Apple", "Shredded carrot", "Celery", "Watermelon"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "@lisolna-7305815434505309473_0", "video_path": "7305815434505309473.mp4", "subtitle_path": "7305815434505309473_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.7, "view_count": 112405}, {"video_id": "@lisolna-7270586465476726048", "question": "At the seaside during sunset, there are two white chairs and an umbrella on the sand. Several people are swimming in the sea. When the subtitle 'let's go swimming at sunset I don't wanna fight your shadow' appears, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["sunglasses", "barbecue rack", "sun", "pink flotation ring", "beer"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "@lisolna-7270586465476726048_0", "video_path": "7270586465476726048.mp4", "subtitle_path": "7270586465476726048_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 35.37, "view_count": 5350}, {"video_id": "@lisolna-7305430402783513888", "question": "A person in a gray top is holding two bouquets of flowers and standing on a gray and white floorboard. In front of them is a person wearing blue jeans. What color is the baby's breath that this person is holding?", "question_wo_referring_query": "What color is the baby's breath that this person is holding?", "candidates": ["blue", "off white", "yellow", "pink", "light purple"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "@lisolna-7305430402783513888_0", "video_path": "7305430402783513888.mp4", "subtitle_path": "7305430402783513888_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.18, "view_count": 21680}, {"video_id": "@jonijawne-7305429122044497157", "question": "In front of a Christmas tree decorated with two colored lights, wearing a gray scarf, and making a heart gesture at the mirror, what kind of outerwear does the man appear in when the subtitle 'beautiful Christmas decorations held by the designer brand.It's so beautiful, cozy, peaceful, and really gives you that' appears?", "question_wo_referring_query": "What kind of outerwear is being worn?", "candidates": ["olive-colored leather jacket", "cream-colored windbreaker", "yellow coat", "brown coat", "olive-colored cotton coat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "@jonijawne-7305429122044497157_0", "video_path": "7305429122044497157.mp4", "subtitle_path": "7305429122044497157_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 56.97, "view_count": 6246}, {"video_id": "@jess.morg-7228722550849785131", "question": "The background is a green strawberry field on a yellow soil. After a hand lifted a white container filled with red strawberries, what did the owner of the hand do next?", "question_wo_referring_query": "What did the owner of the hand do next?", "candidates": ["Entered the supermarket", "Sat in a car driving on a highway", "Chose clothes from a rack filled with clothes", "Visited a sunflower garden", "Walked on the yellow soil"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "@jess.morg-7228722550849785131_0", "video_path": "7228722550849785131.mp4", "subtitle_path": "7228722550849785131_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 18.17, "view_count": 2575}, {"video_id": "@kerstinong-7332822138476416257", "question": "The background is an indoor setting with red lanterns hanging and a 'Fu' character displayed. A woman in a red dress, with her hair covered and wearing earrings, holds a card resembling a red envelope with both hands, speaking towards a mirror. What happens on the screen when the subtitle 'Ah, get into a good college, make progress in studies! Get into a good college, make progress in studies!' appears?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The woman sits on the sofa and strokes her hair with both hands.", "The woman in a floral top holds two oranges and talks to the mirror.", "The woman in a red dress holds two oranges and talks to the mirror.", "The woman in a light pink top holds two oranges.", "The woman in a white camisole holds two oranges with both hands."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "@kerstinong-7332822138476416257_0", "video_path": "7332822138476416257.mp4", "subtitle_path": "7332822138476416257_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 39.25, "view_count": 31335}, {"video_id": "@lisolna-7163292971402644741", "question": "On a wall covered with white grid wallpaper, various styles of mugs are hanging. After the subtitle 'Hanging my mugs collection' appears, what object is shown?", "question_wo_referring_query": "What object is shown?", "candidates": ["A silver sink", "A black and white cat", "An electric drill", "A black bottle of oil cleaner", "A potted green plant"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "@lisolna-7163292971402644741_0", "video_path": "7163292971402644741.mp4", "subtitle_path": "7163292971402644741_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 58.65, "view_count": 1186}, {"video_id": "@lisolna-7177725497588124934", "question": "In the video, which of the following sequences of scenes is correct?", "question_wo_referring_query": "In the video, which of the following sequences of scenes is correct?", "candidates": ["First, the lipstick is applied to the lips for a color test, then the pink lipstick exterior is shown, and finally, the silver lipstick packaging box is displayed on the beige podium.", "First, the pink lipstick exterior is shown, then the silver lipstick packaging box is displayed on the beige podium, and finally, the lipstick is applied to the lips for a color test.", "First, the silver lipstick packaging box is displayed on the beige podium, then the lipstick is applied to the lips for a color test, and finally, the pink lipstick exterior is shown.", "First, the pink lipstick exterior is shown, then the lipstick is applied to the lips for a color test, and finally, the silver lipstick packaging box is displayed on the beige podium.", "First, the silver lipstick packaging box is displayed on the beige podium, then the pink lipstick exterior is shown, and finally, the lipstick is applied to the lips for a color test."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "@lisolna-7177725497588124934_0", "video_path": "7177725497588124934.mp4", "subtitle_path": "7177725497588124934_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 21.98, "view_count": 28216}, {"video_id": "@lisolna-7222974512764112134", "question": "On a glass counter with some egg tarts, a person wearing a blue top and a black skirt is holding a bottle of seasoning and sprinkling it on an egg tart that has a piece of white paper underneath. Which subtitles have appeared along with this egg tart?", "question_wo_referring_query": "Which subtitles have appeared along with this egg tart?", "candidates": ["\"Let's get naughty, oh, naughty, oh, yeah\"", "\u201cLet's dance together\u201d", "\"So in love, you ain't gonna tell me nothing\"", "\"Let me hit this grip, cause I had no idea\"", "\u201cLet's party, oh, naughty, oh, yeah\u201d"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "@lisolna-7222974512764112134_0", "video_path": "7222974512764112134.mp4", "subtitle_path": "7222974512764112134_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 20.92, "view_count": 3492}, {"video_id": "@kerstinong-6993273831196757250", "question": "Behind her is an air conditioner. She's wearing a black tank top, black-framed glasses, and has long golden hair. What changes occurred when she sat in front of the black computer with a white facial mask on?", "question_wo_referring_query": "What changes occurred?", "candidates": ["Changed from wearing glasses to not wearing glasses", "Changed from golden hair to black hair", "Changed from wearing a black tank top to wearing a white tank top", "Changed from wearing a necklace to not wearing a necklace", "Changed from long hair to short hair"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "@kerstinong-6993273831196757250_0", "video_path": "6993273831196757250.mp4", "subtitle_path": "6993273831196757250_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 18.92, "view_count": 4729}, {"video_id": "@lisolna-7228573168582003994", "question": "A hand picked up a green apple from a tray filled with many red and green apples on a black dining table in the dining hall. When the subtitle \u201cBut now I'm left here feeling stupid\u201d appeared, what change happened to the green apple?", "question_wo_referring_query": "What change happened to the green apple?", "candidates": ["The apple moved from the table to a plate with some food on it", "The red apple turned green", "The green apple turned red", "The whole apple was bitten once", "The green apple turned yellow"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "@lisolna-7228573168582003994_0", "video_path": "7228573168582003994.mp4", "subtitle_path": "7228573168582003994_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 25.73, "view_count": 18741}, {"video_id": "@jess.morg-7225722894263881003", "question": "The background is an overcast sky and green trees. In the frame, there is an object held by one hand that is white and cylindrical, with the word 'peach' in blue on it. What is being done in the scene?", "question_wo_referring_query": "What is being done in the scene?", "candidates": ["Taking the object apart", "Spraying the object on themselves", "Opening the lid of the object", "Demonstrating the object to the camera", "Throwing the object on the floor"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "@jess.morg-7225722894263881003_0", "video_path": "7225722894263881003.mp4", "subtitle_path": "7225722894263881003_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 26.23, "view_count": 1191}, {"video_id": "@jonijawne-7292436855470050566", "question": "A man with a middle-part hairstyle wearing a black crossbody bag, a black long-sleeve shirt, and khaki pants, is taking a selfie in an elevator while holding up his phone with one hand. What objects are present on the screen at this moment?", "question_wo_referring_query": "What objects are present on the screen at this moment?", "candidates": ["White sneakers", "Baseball cap", "Sunglasses", "Necklace", "Suitcase"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "@jonijawne-7292436855470050566_0", "video_path": "7292436855470050566.mp4", "subtitle_path": "7292436855470050566_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 49.8, "view_count": 2903}, {"video_id": "@lisolna-7160282657656458501", "question": "A woman wearing red lipstick and a black top is facing the mirror; what is her hairstyle at this moment?", "question_wo_referring_query": "What is her hairstyle at this moment?", "candidates": ["Wavy curls", "Short curls", "Short straight hair", "Black straight hair", "Long curls"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "@lisolna-7160282657656458501_0", "video_path": "7160282657656458501.mp4", "subtitle_path": "7160282657656458501_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 22.78, "view_count": 1187}, {"video_id": "@jonijawne-7262037817457495302", "question": "A man with a middle-part hairstyle is wearing light blue ripped jeans and a black coat, standing in front of a white door and wall while putting on clothes. When the subtitle 'you can so a belt try to use outfit pieces that can stand' appears, what kind of coat is he wearing?", "question_wo_referring_query": "What kind of coat is he wearing?", "candidates": ["leather coat", "black short coat", "jacket", "cotton coat", "cotton coat"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "@jonijawne-7262037817457495302_0", "video_path": "7262037817457495302.mp4", "subtitle_path": "7262037817457495302_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 36.97, "view_count": 90105}, {"video_id": "@kerstinong-7299292421882400002", "question": "A woman wearing a white short-sleeve shirt and light-colored floral shorts is in the gym in front of a mirror with a blue and black mat, holding something with both hands while doing a deep squat. What is she holding?", "question_wo_referring_query": "What is she holding with both hands while doing a deep squat?", "candidates": ["Mobile phone", "Dumbbell", "Mineral water bottle", "Barbell", "Brick"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "@kerstinong-7299292421882400002_0", "video_path": "7299292421882400002.mp4", "subtitle_path": "7299292421882400002_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.3, "view_count": 24668}, {"video_id": "@lisolna-7192263848617905414", "question": "In a bowl with fried items such as 'pumpkin slices, grapes, and cherry tomatoes', what is happening on the screen when shredded carrots first appear at the top of the bowl?", "question_wo_referring_query": "What is happening on the screen?", "candidates": ["Eating shredded carrots", "Pouring ketchup on the shredded carrots", "Putting shredded carrots into the bowl", "Putting shredded carrots into the pot", "Pouring sauce into the bowl"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "@lisolna-7192263848617905414_0", "video_path": "7192263848617905414.mp4", "subtitle_path": "7192263848617905414_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 21.4, "view_count": 71456}, {"video_id": "@lisolna-7319899779604040992", "question": "In front of the lamp on the white table, what happened after a transparent pink flower-shaped high heel cup was taken out from a paper box?", "question_wo_referring_query": "What happened next?", "candidates": ["A black and white cat crawled on the table.", "Opened the paper box in front of the cat.", "The second high heel cup was taken out from the white wrapping paper.", "Moved the paper box containing the high heel cup in front of the cat.", "Rotated the mirror to display the high heel cup."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "@lisolna-7319899779604040992_0", "video_path": "7319899779604040992.mp4", "subtitle_path": "7319899779604040992_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.27, "view_count": 17268}, {"video_id": "@lisolna-7277961827311471905", "question": "A hand picked up a bottle of purple-packaged juice from a shelf with green and orange juice. What happened after the subtitle 'I'm going to make a cake with the leftover cake. I so' appeared?", "question_wo_referring_query": "What happened?", "candidates": ["A hand picked up two cartons of yellow-packaged yogurt from the shelf", "A hand placed a bucket of cooking oil into a shopping cart", "A hand placed a carton of juice into a shopping cart", "A hand picked up a bottle of yogurt from the shelf", "A hand placed a bottle of beverage into a shopping cart"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "@lisolna-7277961827311471905_0", "video_path": "7277961827311471905.mp4", "subtitle_path": "7277961827311471905_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.19, "view_count": 31409}, {"video_id": "@lisolna-7315967352795909409", "question": "In front of a shelf with various vegetables, a person in a black long sleeve shirt is holding an orange pumpkin. After the subtitle 'I'm going to make a so' appears, what object appears next?", "question_wo_referring_query": "What is the object that appears next?", "candidates": ["a box of chicken breast", "a lemon", "a green pepper", "broccoli", "a tomato"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "@lisolna-7315967352795909409_0", "video_path": "7315967352795909409.mp4", "subtitle_path": "7315967352795909409_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.8, "view_count": 29700}, {"video_id": "@lisolna-7295772751866989856", "question": "There is a gray plate at the bottom of the screen. In which other scenes does a bowl of vegetable salad in a transparent container taken from the counter appear?", "question_wo_referring_query": "In which other scenes does it appear?", "candidates": ["In a white ceramic bowl", "On the dining table of a person sitting across using a cell phone", "In a metal pot", "In the refrigerator", "On a black table"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "@lisolna-7295772751866989856_0", "video_path": "7295772751866989856.mp4", "subtitle_path": "7295772751866989856_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.67, "view_count": 222148}, {"video_id": "@kerstinong-7135010344707132673", "question": "In the gym, a woman dressed in a pink sports bra and pink yoga pants is holding a kettlebell. This pink sports bra has also appeared along with which subtitles?", "question_wo_referring_query": "Along with which subtitles has it also appeared?", "candidates": ["\"Explote jobs that range from psychologists, high performance\"", "\u201cwhere my eyes are looking at when I start and its performance.\u201d", "\u201csometimes hard to see with a naked eye. And so, sports science can help with that to improve an\u201d", "\"athletes' performance. Sports physiologists help athletes optimize training, performance and\"", "\u201cLearn ways to improve workout results such as food nutrition that you need for performance.\u201d"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "@kerstinong-7135010344707132673_0", "video_path": "7135010344707132673.mp4", "subtitle_path": "7135010344707132673_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 52.73, "view_count": 34033}, {"video_id": "@lisolna-7284621298506927392", "question": "After placing four yellow and white pieces of dough-wrapped meat wrapped in paper inside a black oven, what change occurs to the meat when it appears next to a vegetable salad served on a white plate with blue floral patterns?", "question_wo_referring_query": ", what change occurs to the meat?", "candidates": ["Changes from whole pieces to cut pieces", "Changes from yellow-white to charred black", "Changes from yellow to olive green", "Changes from yellow to white", "Changes from yellow-white to golden yellow"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "@lisolna-7284621298506927392_0", "video_path": "7284621298506927392.mp4", "subtitle_path": "7284621298506927392_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.83, "view_count": 28644}, {"video_id": "@kerstinong-7351704763655785730", "question": "In a high-rise and green tree-surrounded playground, what is a girl with tied-up hair, dressed in a black short-sleeve shirt, shorts, and white sneakers doing on a red plastic track?", "question_wo_referring_query": "What is she doing?", "candidates": ["Talking on the phone", "Chatting with friends", "Walking", "Running", "Jumping rope"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "@kerstinong-7351704763655785730_0", "video_path": "7351704763655785730.mp4", "subtitle_path": "7351704763655785730_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 19.63, "view_count": 6573}, {"video_id": "@lisolna-7253855068351499547", "question": "In the shopping cart with a metal frame, there are three bell peppers, a bottle of fruit juice, an apple wrapped in red netting, and a person is holding a yellow-green package with red XXL letters and placing it into the cart. What is the material of the fruit juice packaging?", "question_wo_referring_query": "What is the material of this fruit juice packaging?", "candidates": ["Paper cup", "Plastic", "Glass", "Paper box", "Tin can"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "@lisolna-7253855068351499547_0", "video_path": "7253855068351499547.mp4", "subtitle_path": "7253855068351499547_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 25.93, "view_count": 33698}, {"video_id": "@lisolna-7304302312171031840", "question": "There are many food items placed on the wooden table using white plates. On the left side of the screen, there is a glass of orange juice. Next to the orange juice, there is a metal pitcher and a transparent glass. What is the golden pitcher pouring into the transparent glass?", "question_wo_referring_query": "What is the golden pitcher pouring into the transparent glass?", "candidates": ["fruit juice", "red wine", "coke", "tea", "water"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "@lisolna-7304302312171031840_0", "video_path": "7304302312171031840.mp4", "subtitle_path": "7304302312171031840_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.77, "view_count": 91330}, {"video_id": "@lisolna-7220044437685013766", "question": "On a display with many pink and white dolls in glass frames, what was the person holding a white doll doing when it first appeared?", "question_wo_referring_query": "What was being done?", "candidates": ["Choosing a toy", "Showing the doll in their hand to the camera", "Taking pictures of the toy dolls", "Touching the toy dolls", "Wrapping up the toy dolls"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "@lisolna-7220044437685013766_0", "video_path": "7220044437685013766.mp4", "subtitle_path": "7220044437685013766_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 20.07, "view_count": 8348}, {"video_id": "@lisolna-7251590799392050458", "question": "After a hand with a ring on the middle finger took out a red-capped juice bottle from the white wall's drink cabinet, what did this hand do next?", "question_wo_referring_query": ", what did this hand do next?", "candidates": ["Took out chopsticks from a square iron container", "Picked up a piece of purple cabbage and placed it on the plate", "Picked up a piece of stir-fried yellow beans and green vegetables and placed it on the plate", "Placed shredded carrots onto the plate", "The chef fried meat on the iron plate"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "@lisolna-7251590799392050458_0", "video_path": "7251590799392050458.mp4", "subtitle_path": "7251590799392050458_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 37.97, "view_count": 586822}, {"video_id": "@lisolna-7320233018507988257", "question": "Who appears after the female wearing blue clothes and a black apron, with her hair tied up, who is preparing a drink behind the glass counter where bread is displayed?", "question_wo_referring_query": ", who appears?", "candidates": ["The curly-haired man in black clothes sitting inside the coffee shop", "The male staff working inside the coffee shop", "The woman in pink clothes sitting behind the floor-to-ceiling window of the coffee shop", "The curly-haired woman sitting in front of the wooden table with a white plate, holding a drink and currently drinking", "The passerby walking outside the coffee shop"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "@lisolna-7320233018507988257_0", "video_path": "7320233018507988257.mp4", "subtitle_path": "7320233018507988257_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 32.43, "view_count": 18111}, {"video_id": "@lisolna-7254167614245719323", "question": "On a gray wooden dining table, there are two white empty plates placed on the top right. In the screen, there is a plate of stir-fried vegetables with meat served on a black plate placed on the wooden table. After the subtitle 'knowing how spicy it is and she started crying because it actually physically hurts but we tried' appears, what is the object seen?", "question_wo_referring_query": "What is the object that appears?", "candidates": ["horse soaked in rain", "light green matcha ice cream sundae", "menu", "baked bun", "different colored drinks in a drink cabinet"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "@lisolna-7254167614245719323_0", "video_path": "7254167614245719323.mp4", "subtitle_path": "7254167614245719323_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 55.33, "view_count": 30400}, {"video_id": "@lisolna-7282040659651923233", "question": "Which of the following sequences of scenes in the video is correct?", "question_wo_referring_query": "Which of the following sequences of scenes in the video is correct?", "candidates": ["First, a chef puts the cooked meat on the beautifully arranged plate, then a chef is frying meat in front of an iron plate in the kitchen, and finally, a hand takes a cheesecake from the refrigerator.", "First, a chef is frying meat in front of an iron plate in the kitchen, then a hand takes a cheesecake from the refrigerator, and finally, the chef puts the cooked meat on the beautifully arranged plate.", "First, a chef puts the cooked meat on the beautifully arranged plate, then a hand takes a cheesecake from the refrigerator, and finally, a chef is frying meat in front of an iron plate in the kitchen.", "First, a chef is frying meat in front of an iron plate in the kitchen, then the chef puts the cooked meat on the beautifully arranged plate, and finally, a hand takes a cheesecake from the refrigerator.", "First, a hand takes a cheesecake from the refrigerator, then the chef puts the cooked meat on the beautifully arranged plate, and finally, a chef is frying meat in front of an iron plate in the kitchen."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "@lisolna-7282040659651923233_0", "video_path": "7282040659651923233.mp4", "subtitle_path": "7282040659651923233_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.5, "view_count": 238095}, {"video_id": "@jess.morg-7215309921653411115", "question": "A girl, wearing a low ponytail, a white shirt, and gray-green pants, is standing facing left toward the screen. She is in front of a white curtain, carrying a white single-shoulder bag with red striped patterns on her back. In what scenes has this girl appeared?", "question_wo_referring_query": "In what scenes has this girl appeared?", "candidates": ["In the kitchen", "In an amusement park", "In a caf\u00e9", "On the beach with a green can in her hand on a cloudy day", "In a shopping mall"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "@jess.morg-7215309921653411115_0", "video_path": "7215309921653411115.mp4", "subtitle_path": "7215309921653411115_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 42.9, "view_count": 769}, {"video_id": "@kerstinong-7278317749707754760", "question": "On the left side of the screen, there is a white French window. A girl with her hair tied up, wearing a blue jacket and blue sweatpants, is standing at the entrance of the bathroom. What changes when she, holding a handbag and a laptop, walks towards the mirror from the wooden floor in front of the black curtain?", "question_wo_referring_query": "What changes occurred?", "candidates": ["Not wearing a hat turned into wearing a hat", "The blue clothes turned into black clothes", "The tied-up hair turned into loose hair", "The blue clothes turned into white clothes", "Not wearing sunglasses turned into wearing sunglasses"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "@kerstinong-7278317749707754760_0", "video_path": "7278317749707754760.mp4", "subtitle_path": "7278317749707754760_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 44.67, "view_count": 32123}, {"video_id": "@emsiiees-7345242839149743361", "question": "A pair of hands, wearing a white knitted glove, removes a light purple rectangular photo frame keychain from a plastic bag next to a blue shopping cart. When the caption 'everywhere this place is definitely a must visit the next time you're in North York' appears, what change occurs to this small photo frame?", "question_wo_referring_query": "What change occurs to this small photo frame?", "candidates": ["The frame is placed in a bedroom", "The frame is fitted with a photo of one woman", "The frame is fitted with a group photo of two women", "The frame is fitted with a group photo of three women", "The frame is fitted with a group photo of three men"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "@emsiiees-7345242839149743361_0", "video_path": "7345242839149743361.mp4", "subtitle_path": "7345242839149743361_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 35.3, "view_count": 29822}, {"video_id": "@kerstinong-7001129103185284353", "question": "A woman wearing a dark blue short-sleeved shirt, purple shorts, and tying her hair is doing something on a gray concrete ground. There is a black iron fence above a red brick wall, and many green trees outside the wall. What is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["Jumping rope", "Walking", "Making a phone call", "Dancing", "Running"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "@kerstinong-7001129103185284353_0", "video_path": "7001129103185284353.mp4", "subtitle_path": "7001129103185284353_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 27.83, "view_count": 6388}, {"video_id": "@jonijawne-7193007150229228805", "question": "Inside the paper box, there are many different types of cosmetics. When the hand in the video takes out a round olive-colored makeup palette, what object is present on the screen?", "question_wo_referring_query": ", what object is present on the screen?", "candidates": ["eyebrow pencil", "mobile phone", "makeup mirror", "lipstick", "pink packaging box"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "@jonijawne-7193007150229228805_0", "video_path": "7193007150229228805.mp4", "subtitle_path": "7193007150229228805_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 41.55, "view_count": 887}, {"video_id": "@kerstinong-7289421146963905794", "question": "In the gym, there is a woman lying on a black piece of fitness equipment, wearing a sports bra and shorts, and lifting a black kettlebell with both hands. When the subtitle 'you' appears, what object is present on the screen?", "question_wo_referring_query": ", what object is present on the screen?", "candidates": ["Black sports shoes", "Large mirror", "Treadmill", "White shorts", "Black shorts"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "@kerstinong-7289421146963905794_0", "video_path": "7289421146963905794.mp4", "subtitle_path": "7289421146963905794_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 24.3, "view_count": 20767}, {"video_id": "@jess.morg-7263232350425845034", "question": "On a gray floor, a pair of hands open a box that contains many cans of beverages in different colored packages. What is the material of the box holding these cans?", "question_wo_referring_query": "What is the material of the box holding these cans?", "candidates": ["Metal box", "Wooden box", "Acrylic box", "Plastic box", "Paper box"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "@jess.morg-7263232350425845034_0", "video_path": "7263232350425845034.mp4", "subtitle_path": "7263232350425845034_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 20.3, "view_count": 787}, {"video_id": "@lisolna-7287216512023285025", "question": "In front of an acrylic display rack holding many pens of different colors, when the subtitle 'I surprise myself' appears, what kind of pen is the hand in the screen holding?", "question_wo_referring_query": "What kind of pen is the hand in the screen holding?", "candidates": ["Four brushes", "Black, dark blue, pink, and purple pens", "Colored pencils of different colors", "Four pencils", "Colored markers"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "@lisolna-7287216512023285025_0", "video_path": "7287216512023285025.mp4", "subtitle_path": "7287216512023285025_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.37, "view_count": 27171}, {"video_id": "@kerstinong-6896457378645085441", "question": "Who is the person practicing leg lifts and jumps between the white crossbars surrounded by green trees and grass?", "question_wo_referring_query": "Who is the person?", "candidates": ["The woman wearing a black short-sleeve shirt and black shorts with her hair tied back", "The woman wearing black athletic underwear and light-colored floral shorts with a sun hat and tied-back hair", "The woman wearing athletic gear and sunglasses", "The woman wearing white athletic underwear, black shorts, and a sun hat", "The shirtless man with short hair wearing black shorts"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "@kerstinong-6896457378645085441_0", "video_path": "6896457378645085441.mp4", "subtitle_path": "6896457378645085441_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 45.9, "view_count": 5955}, {"video_id": "@emsiiees-7352590329326996753", "question": "When a girl wearing a white coat with black innerwear and gray pants lies on a black bed for the first time, and another girl wearing a black short-sleeve shirt and a mask is sitting behind her head, with lamps on both sides facing the bed, what is the girl in the black short-sleeve shirt doing in the scene?", "question_wo_referring_query": "... what is the girl in the black short-sleeve shirt doing in the scene?", "candidates": ["Sending a short-sleeved message on a mobile phone", "Giving the girl lying down a facial treatment", "Chatting with the girl lying down", "Applying a facial mask to the girl lying down", "Answering a phone call"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "@emsiiees-7352590329326996753_0", "video_path": "7352590329326996753.mp4", "subtitle_path": "7352590329326996753_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 34.57, "view_count": 2042}, {"video_id": "@lisolna-7280183396532309281", "question": "On the silver checkout counter in the video, there is a black cash register, two bottles of mineral water, one bottle of orange juice, and a can placed on top of it. When the caption 'I'm going to be back home. I'm so excited to be back home. I'm so excited to be back home.' appears, what is the person in the video doing?", "question_wo_referring_query": "What is the person in the video doing?", "candidates": ["Chatting with a friend", "Packing the purchased items", "Selecting goods", "Opening the orange juice bottle cap", "Settling the bill at the checkout counter"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "@lisolna-7280183396532309281_0", "video_path": "7280183396532309281.mp4", "subtitle_path": "7280183396532309281_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.83, "view_count": 29149}, {"video_id": "@jonijawne-7304793585227123974", "question": "In a setting surrounded by golden decorations, a man wearing a grey scarf and khaki coat sits on a large chair. What does he do next?", "question_wo_referring_query": "What does he do next?", "candidates": ["Holds a yellow Winnie the Pooh toy in a store", "Jumps down from the stairs next to the decorations", "Takes a selfie in the mirror wearing a red and green wig", "Makes a heart gesture towards a mirror in front of a Christmas tree with decorations", "Stands on a white balcony looking out towards the sun"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "@jonijawne-7304793585227123974_0", "video_path": "7304793585227123974.mp4", "subtitle_path": "7304793585227123974_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.6, "view_count": 1428}, {"video_id": "@lisolna-7301326399023877409", "question": "After placing two skincare products into a gray shopping cart with one handle extended in four directions, what is the first object that appears in the screen?", "question_wo_referring_query": "What is the first object that appears in the screen?", "candidates": ["A black bear-shaped ornament", "An olive-colored bottle of shampoo", "A yellow round package of face mask", "A transparent glass cup with a Christmas tree pattern placed on the shelf", "A dark green glass cup in the shape of a Christmas tree"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "@lisolna-7301326399023877409_0", "video_path": "7301326399023877409.mp4", "subtitle_path": "7301326399023877409_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.2, "view_count": 40680}, {"video_id": "@lisolna-7296518225066495264", "question": "On a grey marble table, there is a dining plate with many crushed ice cubes, and on top of them, there are several transparent cups filled with white yogurt, mango sauce, and a combination of thin mint leaves. In addition to this dessert, what subtitles have also appeared at the same time?", "question_wo_referring_query": ", in addition to this dessert, what subtitles have also appeared at the same time?", "candidates": ["\u201cThis is really delicious\u201d", "\u201cPut the chicken on the plate\u201d", "\u201cPut the dessert on the plate\u201d", "\u201cThis dessert looks good\u201d", "\u201cI'm going to eat this\u201d"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "@lisolna-7296518225066495264_0", "video_path": "7296518225066495264.mp4", "subtitle_path": "7296518225066495264_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.72, "view_count": 256779}, {"video_id": "@kerstinong-7070180197017898241", "question": "In a green screen smartphone interface, the top left contains a row of emojis. Beneath the emojis, there is a dialogue box with a white background that says 'Buy buy'. Below the dialogue box, there are some operation options. When the dialogue box below the row of emojis changes to a green background and says 'Noooooooo', what changes occur to the row of emojis?", "question_wo_referring_query": ", what changes occur to the row of emojis?", "candidates": ["The emojis gradually disappear", "The emojis move from the left side of the screen to the bottom left", "The emojis move from the left side of the screen to the right side", "The emojis move from the left side of the screen to the top", "The emojis move from the left side of the screen to the bottom right"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "@kerstinong-7070180197017898241_0", "video_path": "7070180197017898241.mp4", "subtitle_path": "7070180197017898241_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 25.6, "view_count": 111758}, {"video_id": "@lisolna-7272043694084197664", "question": "In the sunlight streaming through the car windows, there are many green plants and potted flowers outside. A red flower pot appears in the upper part of the rearview mirror on the left side of the screen. Which subtitles have appeared at the same time?", "question_wo_referring_query": "Which subtitles have appeared at the same time?", "candidates": ["\"Because we don't have plants at home and I find it kind of\"", "\u201cNo, just kidding\u201d", "\"So this time we went to pick some plants for the outside\"", "\"If you don't love plants, I'm sorry, we can't be friends.\"", "\"This is one of my favorite activities to do with my mom\""], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "@lisolna-7272043694084197664_0", "video_path": "7272043694084197664.mp4", "subtitle_path": "7272043694084197664_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 48.37, "view_count": 8162}, {"video_id": "L6MtA_LR_m8", "question": "The screen shows an aerial view of a parking lot with some cars parked in it. On the right is a road divided into two lanes by yellow lines. There is a white car parked a bit larger than the other cars on the side of the parking spot. There are two people next to the white car, and they have a gurney between them. What are these two people doing?", "question_wo_referring_query": "What are these two people doing?", "candidates": ["One is pushing the gurney while the other walks ahead", "One is pushing the gurney while the other runs ahead", "They are hugging", "One is pushing the gurney while the other is leading the way", "They are shaking hands"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "L6MtA_LR_m8_0", "video_path": "L6MtA_LR_m8.mp4", "subtitle_path": "L6MtA_LR_m8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 70, "duration": 8.0, "view_count": 12650}, {"video_id": "6dL9KepCgck", "question": "The screen shows a black background with many blue and green stripes. In the center, there is a square frame, within which the screen displays two people wearing baseball uniforms standing on the side of a grassy path facing the camera. They are wearing jerseys and caps. What objects are present in the screen?", "question_wo_referring_query": "What objects are present in the screen?", "candidates": ["Yellow uniform number 17", "White uniform number 37", "Black uniform number 37", "White uniform number 17", "Black uniform number 17"], "topic_category": "KH-Knowledge-History", "question_category": "S2O", "level": "IntraMoment", "id": "6dL9KepCgck_0", "video_path": "6dL9KepCgck.mp4", "subtitle_path": "6dL9KepCgck_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1285, "duration": 10.01, "view_count": 916140}, {"video_id": "xGJzCTd2y0k", "question": "A black-haired man in a gray T-shirt is standing in front of a mirror in a room. The room behind the man is white, with lighting, a yellow wooden floor, and white walls. When the man mentions 'Rock climbing or kayaking, stuff like that.', what object is not present in the scene?", "question_wo_referring_query": "What object is not present in the scene?", "candidates": ["Yellow baseboard", "Yellow paper scattered on the floor", "White baseboard", "A wooden coffee table", "A small sofa with yellow striped fabric"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2O", "level": "IntraMoment", "id": "xGJzCTd2y0k_0", "video_path": "xGJzCTd2y0k.mp4", "subtitle_path": "xGJzCTd2y0k_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 283, "duration": 11.01, "view_count": 139828}, {"video_id": "JGZpjlQAno4", "question": "The screen shows a lake, surrounded by some small islands. There are many trees on the small islands, and above the trees is a bright sunny sky. What is the state of the lake water in the scene?", "question_wo_referring_query": "What is the state of the lake water in the scene?", "candidates": ["The lake water is splashing and making many waves.", "The lake water is stagnant and not flowing.", "The lake water is flowing gently.", "The lake water is turbulent and choppy.", "The lake water is flowing rapidly."], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "JGZpjlQAno4_0", "video_path": "JGZpjlQAno4.mp4", "subtitle_path": "JGZpjlQAno4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 146, "duration": 12.01, "view_count": 4502}, {"video_id": "wPnIv_20tvM", "question": "In the video, under a blue sky with white clouds, on the left side, there are some red buildings. The first floor of the buildings has white brick arch decorations, and under the arches are shop doors made of glass. The entire first floor of the buildings consists of shops. Pedestrians walk on the streets beside the shops, and there are also some street trees. In the middle of the screen, three people are walking, one of whom is a long-haired woman dressed in red and white clothes. When she is mentioned as 'fun kind of vibrant cities however from,' what type of pants is she wearing?", "question_wo_referring_query": "What type of pants is she wearing?", "candidates": ["Short jeans", "Long jeans", "Silk long pants", "White long skirt", "Purple dress"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "wPnIv_20tvM_0", "video_path": "wPnIv_20tvM.mp4", "subtitle_path": "wPnIv_20tvM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 200, "duration": 13.98, "view_count": 43913}, {"video_id": "64pH_OzfSLQ", "question": "The screen is divided into two sections. On the left, there are some workers walking on the road. How many workers are there in the scene? There is also a woman wearing a dress. On the right side, there is a woman with black curly hair wearing a red, floral-collared outfit. She is speaking while wearing white earphones. Who is the person wearing a white hat in the scene?", "question_wo_referring_query": "Who is the person wearing a white hat in the scene?", "candidates": ["The person wearing a floral dress", "The person wearing a white polo shirt", "The person wearing a black shirt", "The person wearing an orange T-shirt", "The person wearing a white shirt"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "64pH_OzfSLQ_0", "video_path": "64pH_OzfSLQ.mp4", "subtitle_path": "64pH_OzfSLQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 374, "duration": 14.0, "view_count": 110776}, {"video_id": "Q-O7gvg5_VI", "question": "The screen shows an animation depicting a barbershop. The signboard has the words 'BARBER SHOP' written in yellow letters. Below, the door has blue and pink 'OPEN' stickers. Inside, there are drawings of two chairs and black and white checkered tiles. On the wall next to the door, there is a red, blue, and white lamp box. What happened the first time this lamp box appeared?", "question_wo_referring_query": "What happened the first time this lamp box appeared?", "candidates": ["The lamp box was playing music", "The lamp box was flashing", "The lamp box was rotating", "The lamp box was changing colors", "The lamp box emitted a bright light"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "Q-O7gvg5_VI_0", "video_path": "Q-O7gvg5_VI.mp4", "subtitle_path": "Q-O7gvg5_VI_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 211, "duration": 13.0, "view_count": 1504185}, {"video_id": "NVeVNuC34wI", "question": "In the video, there is a man wearing a white T-shirt lying on a somewhat narrow white sofa, talking to someone off-camera. The lighting in the room is a bit dim. What did the man in the video do after mentioning 'Why'?", "question_wo_referring_query": "What did the man in the video do?", "candidates": ["The man was getting up", "The man made a victory hand gesture towards the camera", "The man's hand was holding another person's hand", "The man turned over and fell asleep", "The man was throwing away trash"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "NVeVNuC34wI_0", "video_path": "NVeVNuC34wI.mp4", "subtitle_path": "NVeVNuC34wI_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 100, "duration": 14.02, "view_count": 2928950}, {"video_id": "9UJDle0QlkE", "question": "In the screen, there is a blue animated character. He is wearing a blue hat. His room has a yellow floor. There is a small sofa in the background, and a painting on the wall. On the far right, there is a bookshelf. The little blue character is explaining based on the content shown in a beige background on the right side of the screen. Which concept in the beige background appears first?", "question_wo_referring_query": "Which concept in the beige background appears first?", "candidates": ["I'LL B.C-ING", "Music", "YOU LATER", "Sources", "BY Attribution 4.0 License"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "9UJDle0QlkE_0", "video_path": "9UJDle0QlkE.mp4", "subtitle_path": "9UJDle0QlkE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 261, "duration": 11.01, "view_count": 3030}, {"video_id": "7gZvlv_5p6g", "question": "A man is sitting in a room with bookshelves and numerous papers and photos on the walls. He is seated in front of a mirror, wearing a black T-shirt, and has a cloth with many patterns and a green band with red and white patterns over him. The room has wooden flooring. The man is sitting with his head tilted and eyes closed, holding a small bag with one hand while the other hand is inside the bag. What happened after he said, 'Right, so I'm just gonna pick one out not gonna laugh like a look-see'?", "question_wo_referring_query": "What happened?", "candidates": ["The flag of Venezuela popped up in the top right corner of the screen.", "The flag of Venezuela popped up in the bottom left corner of the screen.", "The flag of Venezuela popped up in the top left corner of the screen.", "The flag of Venezuela popped up in the middle of the screen.", "The flag of Venezuela popped up in the bottom right corner of the screen."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "7gZvlv_5p6g_0", "video_path": "7gZvlv_5p6g.mp4", "subtitle_path": "7gZvlv_5p6g_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 554, "duration": 9.01, "view_count": 138243}, {"video_id": "zqCPVTMP8P4", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a woman dressed in a red knit sweater with golden hair is shown speaking in front of a mirror, followed by a scene where a man in gray appears behind the woman, and finally a scene where the woman touches her face with one hand.", "First, a scene where the woman touches her face with one hand, followed by a man in gray appearing behind the woman, and finally a scene where a woman dressed in a red knit sweater with golden hair is shown speaking in front of a mirror.", "First, a woman dressed in a red knit sweater with golden hair is shown speaking in front of a mirror, followed by a scene where the woman touches her face with one hand, and finally a scene where a man in gray appears behind the woman.", "First, a scene where the woman touches her face with one hand, followed by a woman dressed in a red knit sweater with golden hair speaking in front of a mirror, and finally a scene where a man in gray appears behind the woman.", "First, a scene where a man in gray appears behind the woman, followed by a woman dressed in a red knit sweater with golden hair speaking in front of a mirror, and finally a scene where the woman touches her face with one hand."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "zqCPVTMP8P4_0", "video_path": "zqCPVTMP8P4.mp4", "subtitle_path": "zqCPVTMP8P4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 314, "duration": 13.0, "view_count": 26288}, {"video_id": "bG7oeflLWbY", "question": "In the scene, there is a man with some gray hair wearing a red shirt. He has a single-lens reflex camera hanging in front of him and is standing by a lakeside. Across the lake, there is an ancient high-rise building in black and white, with many pointed black roofs and some white trim. The sky is white. In which scene does this building also appear?", "question_wo_referring_query": "In which other scene does this building appear?", "candidates": ["Under a sky filled with the setting sun", "Under a blue sky, with a stone embankment below the building", "In the middle of a forest", "In a scenic area with many yellow leaves", "In an area with a lot of snow"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "bG7oeflLWbY_0", "video_path": "bG7oeflLWbY.mp4", "subtitle_path": "bG7oeflLWbY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 42, "duration": 13.0, "view_count": 352337}, {"video_id": "4EAVCS1rqMg", "question": "The scene features an aerial shot of a building, with a road and trees alongside it. Most of the structures are relatively short buildings, and there's an area that seems to be a park with yellow pathways. In the distance, you can see some water, and further away, the sky. What is the object present in the scene?", "question_wo_referring_query": "What is the object present in the scene?", "candidates": ["White tall building", "Orange building", "Blue short building", "Pink short building", "Black ancient building"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "4EAVCS1rqMg_0", "video_path": "4EAVCS1rqMg.mp4", "subtitle_path": "4EAVCS1rqMg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 57, "duration": 9.01, "view_count": 3597}, {"video_id": "TAfGMXTMX74", "question": "This is an indoor volleyball court. Some female athletes wearing yellow training uniforms and knee pads are playing volleyball. The two athletes in the middle of the frame are bent over and ready to receive the ball. When referring to 'questions', what items are present in the screen?", "question_wo_referring_query": "What items are present in the screen?", "candidates": ["White shorts", "Blue shorts", "Yellow shoes", "Black shorts", "Blue training uniforms"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "TAfGMXTMX74_0", "video_path": "TAfGMXTMX74.mp4", "subtitle_path": "TAfGMXTMX74_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 39, "duration": 10.0, "view_count": 24436}, {"video_id": "-II-ALmFvFM", "question": "The screen shows four ancient figures holding shields and long spears, with a large elephant beside them. The person at the front wearing a headband is draped in a green robe, while the three people behind him are dressed only in hemp ropes and white cloth. They have very dark skin and well-developed muscles. What shape are the shields that the people dressed in hemp ropes and white cloth are holding?", "question_wo_referring_query": "What shape are the shields held by the people dressed in hemp ropes and white cloth?", "candidates": ["Circular", "Rectangular", "Long wave-like", "Hexagonal", "Oval"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "-II-ALmFvFM_0", "video_path": "-II-ALmFvFM.mp4", "subtitle_path": "-II-ALmFvFM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 383, "duration": 13.01, "view_count": 84702}, {"video_id": "IXFgu93SnMc", "question": "The screen shows a black background with content written in a red and white rectangular frame, along with a picture. In the bottom right corner, there is a small inset where a woman is standing at a podium, with a computer in front of her. The woman is wearing black clothes. What did the woman do the first time she appeared?", "question_wo_referring_query": "What did the woman do the first time she appeared?", "candidates": ["The woman raised both hands", "The woman was shaking hands with someone", "The woman touched her hair", "The woman raised one hand", "The woman was nodding"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "IXFgu93SnMc_0", "video_path": "IXFgu93SnMc.mp4", "subtitle_path": "IXFgu93SnMc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1783, "duration": 9.01, "view_count": 4219}, {"video_id": "tbq53JvDa8g", "question": "The image features a woman with black hair wearing glasses and a white T-shirt. Behind her is a wall covered with many photos and a bookshelf filled with books and office supplies. What did the woman do when a really cool way to celebrate getting was mentioned?", "question_wo_referring_query": "What did the woman do?", "candidates": ["The woman touched her face", "The woman raised both hands", "The woman raised one hand", "The woman picked up a book", "The woman was playing the piano"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "tbq53JvDa8g_0", "video_path": "tbq53JvDa8g.mp4", "subtitle_path": "tbq53JvDa8g_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 29, "duration": 8.0, "view_count": 79931}, {"video_id": "9sn033Kj1zE", "question": "A woman is in a kitchen. Behind her is a white counter with some kitchenware and green plants on it. In front of her is a yellow table, which has many bowls filled with food and a round tray with four cupcakes on it. After the woman smiles at the camera, what happens?", "question_wo_referring_query": "What happens next?", "candidates": ["The scene changes to a woman and a little boy shaking hands.", "The scene changes to a woman and a little girl shaking hands.", "The scene changes to a woman and a little girl talking.", "The scene changes to a woman and a little boy talking.", "The scene changes to a woman and a little girl hugging."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "9sn033Kj1zE_0", "video_path": "9sn033Kj1zE.mp4", "subtitle_path": "9sn033Kj1zE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 42, "duration": 10.01, "view_count": 294154}, {"video_id": "9yLB8qZY3uQ", "question": "On the screen, there is a pot with fried spaghetti inside. One hand is holding a utensil and a chef's wiping cloth and rubbing it inside the pot. Which utensil is used first to stir the spaghetti in the screen?", "question_wo_referring_query": "Which utensil is used first to stir the spaghetti on the screen?", "candidates": ["chopsticks", "tongs", "chef's wiping cloth", "ladle", "spatula"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "9yLB8qZY3uQ_0", "video_path": "9yLB8qZY3uQ.mp4", "subtitle_path": "9yLB8qZY3uQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 278, "duration": 9.0, "view_count": 1472262}, {"video_id": "wLwXEFMchcQ", "question": "The screen shows a split view of the news segment. The man on the left is against a white background and is wearing a black suit, with a very long tie. On the right side, against a green background, is an older man with gray hair, wearing a gray suit and glasses. After the term 'been' is mentioned, what happened?", "question_wo_referring_query": "What happened?", "candidates": ["The screen disappeared", "The screen zoomed out of the man with the green background", "The screen zoomed out of the man with the white background", "The screen played the man with the green background in fullscreen", "The screen zoomed in on the man with the white background"], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "wLwXEFMchcQ_0", "video_path": "wLwXEFMchcQ.mp4", "subtitle_path": "wLwXEFMchcQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 45, "duration": 9.0, "view_count": 433407}, {"video_id": "FfAPCf0z2Mc", "question": "The person on the screen is a black woman wearing a pink apron. She has curly hair and is in a kitchen. In front of her is a cutting board with a piece of meat on it, along with four utensils. On the counter behind her, there are some seasonings and items like bananas. What is the first thing that appears on the screen before the woman mentions 'so usually when you flip your steak'?", "question_wo_referring_query": "What is the first thing that appears on the screen?", "candidates": ["Oiled vegetables on a white iron plate", "Oiled vegetables on a black iron plate", "Oiled fruits on a black iron plate", "Oiled meat on a black iron plate", "Oiled meat on a gray iron plate"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "FfAPCf0z2Mc_0", "video_path": "FfAPCf0z2Mc.mp4", "subtitle_path": "FfAPCf0z2Mc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 114, "duration": 11.01, "view_count": 272879}, {"video_id": "Q33Z151cBRM", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a person wearing glasses and a T-shirt is shown picking vegetables in front of the stove. Then, a clip of a soy sauce bottle pouring sauce into a yellow bowl is shown. Finally, a clip of powdered seasoning being added to noodles is shown.", "First, a clip of powdered seasoning being added to noodles is shown. Then, a person wearing glasses and a T-shirt is shown picking vegetables in front of the stove. Finally, a clip of a soy sauce bottle pouring sauce into a yellow bowl is shown.", "First, a clip of a soy sauce bottle pouring sauce into a yellow bowl is shown. Then, a person wearing glasses and a T-shirt is shown picking vegetables in front of the stove. Finally, a clip of powdered seasoning being added to noodles is shown.", "First, a clip of powdered seasoning being added to noodles is shown. Then, a clip of a soy sauce bottle pouring sauce into a yellow bowl is shown. Finally, a person wearing glasses and a T-shirt is shown picking vegetables in front of the stove.", "First, a clip of a soy sauce bottle pouring sauce into a yellow bowl is shown, with eggs also in the bowl. Then, a clip of powdered seasoning being added to noodles is shown. Finally, a person wearing glasses and a T-shirt is shown picking vegetables in front of the stove."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "Q33Z151cBRM_0", "video_path": "Q33Z151cBRM.mp4", "subtitle_path": "Q33Z151cBRM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 343, "duration": 8.01, "view_count": 87096}, {"video_id": "eDrCuOEJA5w", "question": "In the scene, there is a white plate with a small piece of cake on it. A spoon is scooping up a small piece. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Pink cream", "Fingers with nail polish", "Rock slab surface", "Blue cream", "Coffee-colored cream"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "eDrCuOEJA5w_0", "video_path": "eDrCuOEJA5w.mp4", "subtitle_path": "eDrCuOEJA5w_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 273, "duration": 11.0, "view_count": 7088}, {"video_id": "Gf4rpcQjqM8", "question": "The screen shows an interview scene. Two women are standing facing each other, having a conversation. Both of them are wearing black clothes. The background where they are standing is a deep blue curtain with three American flags. The woman on the left has blonde hair, and the woman on the right has brown hair. What type of clothing is the woman with brown hair wearing?", "question_wo_referring_query": "What type of clothing is the woman with brown hair wearing?", "candidates": ["black shirt", "black coat", "black silk dress", "black suit", "black backless dress"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "Gf4rpcQjqM8_0", "video_path": "Gf4rpcQjqM8.mp4", "subtitle_path": "Gf4rpcQjqM8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 36, "duration": 12.01, "view_count": 22270}, {"video_id": "TxKKivpeNmM", "question": "A woman is standing in front of a screen, admiring a painting. She is in a white gallery with frames on the white walls. The woman has black hair, is wearing black clothing, and is also wearing some accessories. She has bracelets on her hand. When the phrase 'creates this very dynamic rhythmic' is mentioned, what type of clothing is the woman wearing?", "question_wo_referring_query": "What type of clothing is the woman wearing?", "candidates": ["black T-shirt", "black shirt", "black dress", "black vest", "black suspenders"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "TxKKivpeNmM_0", "video_path": "TxKKivpeNmM.mp4", "subtitle_path": "TxKKivpeNmM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 58, "duration": 12.97, "view_count": 33209}, {"video_id": "nCC3ULumJLw", "question": "On the screen, there is a white table with a wooden cutting board on it, a green plant beside the cutting board, a transparent meat grinder on top of the cutting board, and a black lid ready to cover the meat grinder. What food is placed inside the meat grinder?", "question_wo_referring_query": "What food is placed inside the meat grinder?", "candidates": ["Tofu", "Meat chunks", "Bell pepper", "Green vegetables", "Garlic"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "nCC3ULumJLw_0", "video_path": "nCC3ULumJLw.mp4", "subtitle_path": "nCC3ULumJLw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 41, "duration": 11.0, "view_count": 51787}, {"video_id": "DUAmuCYTQjk", "question": "In the scene, there is a blurry background with a granular dough that looks a bit moist in front of the camera. The dough is also tied with a string. A hand is holding a brush, and there is another rolling pin in the distance. What happens the first time the brush appears?", "question_wo_referring_query": "What happens the first time the brush appears?", "candidates": ["Brushing the dough with the brush", "Using the brush to pick up the flour", "Painting the dough", "Sweeping the flour on the board", "Brushing off the ash on the dough"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "DUAmuCYTQjk_0", "video_path": "DUAmuCYTQjk.mp4", "subtitle_path": "DUAmuCYTQjk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 92, "duration": 10.01, "view_count": 359211}, {"video_id": "-xiPqQtMLNQ", "question": "The screen shows a wooden table. There is a pair of hands wearing black gloves holding a dough ball. What are the hands in the screen doing?", "question_wo_referring_query": "What are the hands in the screen doing?", "candidates": ["Stretching the dough ball into a long strip", "Filling the dough ball with stuffing", "Kneading the dough ball", "Molding the dough ball into a shape", "Dividing the dough ball into two pieces"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "-xiPqQtMLNQ_0", "video_path": "-xiPqQtMLNQ.mp4", "subtitle_path": "-xiPqQtMLNQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 234, "duration": 10.0, "view_count": 5048}, {"video_id": "icNgyDLnO-g", "question": "Who is the first character to appear in the video?", "question_wo_referring_query": "Which one is it?", "candidates": ["The person wearing a white shirt with an olive belt and no mustache", "The person wearing a black shirt with an olive belt and sporting a mustache", "The person wearing a white shirt with an olive belt and sporting a mustache", "The person wearing a black shirt with an olive belt and no mustache", "The person wearing a blue shirt with an olive belt and sporting a mustache"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "icNgyDLnO-g_0", "video_path": "icNgyDLnO-g.mp4", "subtitle_path": "icNgyDLnO-g_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 176, "duration": 12.01, "view_count": 1814}, {"video_id": "fw3FYfcR6GE", "question": "The screen shows a woman positioned in front of a white wall with a blackboard hanging on it. She has yellow hair and is wearing black-rimmed glasses. She is speaking to the camera. What happens after she mentions 'calling for weeks now for an immediat'?", "question_wo_referring_query": "What happens after?", "candidates": ["The woman hugged someone before continuing to speak.", "The woman lightly shook her head before continuing to speak.", "The woman shook hands with someone before continuing to speak.", "The woman raised her hand before continuing to speak.", "The woman touched her hair."], "topic_category": "NP-News-Programs", "question_category": "T3E", "level": "L2-Relation", "id": "fw3FYfcR6GE_0", "video_path": "fw3FYfcR6GE.mp4", "subtitle_path": "fw3FYfcR6GE_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 279, "duration": 11.0, "view_count": 30644}, {"video_id": "7FpeWSFTFRk", "question": "The scene depicts a painting with many soldiers. There is a person sitting in front of a stage draped in red cloth, surrounded by several other people. Below the stage, there's a person in white robes talking to the person sitting on the stage. The scene is filled with people and some horses, resembling an ancient battlefield gathering soldiers. Who is the first person wearing a headband that appears after mentioning 'initially very successful and then not'?", "question_wo_referring_query": "The scene depicts a painting with many soldiers. There is a person sitting in front of a stage draped in red cloth, surrounded by several other people. Below the stage, there's a person in white robes talking to the person sitting on the stage. The scene is filled with people and some horses, resembling an ancient battlefield gathering soldiers. Who is the first person wearing a headband that appears after mentioning 'initially very successful and then not'?", "candidates": ["A portrait against a gray background, of a woman wearing a pink dress with a headband", "A portrait against a gray background, of a woman wearing an olive dress with a headband", "A portrait against a gray background, of a woman wearing a blue dress with a headband", "A portrait against a gray background, of a woman wearing a green dress with a headband", "A portrait against a gray background, of a woman wearing a white dress with a headband"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "7FpeWSFTFRk_0", "video_path": "7FpeWSFTFRk.mp4", "subtitle_path": "7FpeWSFTFRk_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 91, "duration": 9.0, "view_count": 2244346}, {"video_id": "gK6m1CkoqA8", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a woman wearing a hat by the river is shown, followed by a scene of a hand holding a green stone, and finally, a scene of the woman standing by the river is shown.", "First, a woman wearing a hat by the river is shown, followed by a scene of the woman standing by the river, and finally, a scene of a hand holding a green stone is shown.", "First, a scene of a woman standing by the river is shown, followed by a scene of a woman wearing a hat by the river, and finally, a scene of a hand holding a green stone is shown.", "First, a scene of a hand holding a green stone is shown, followed by a scene of a woman wearing a hat by the river, and finally, a scene of a woman standing by the river is shown.", "First, a scene of a woman standing by the river is shown, followed by a scene of a hand holding a green stone, and finally, another scene of a woman wearing a hat by the river is shown."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "gK6m1CkoqA8_0", "video_path": "gK6m1CkoqA8.mp4", "subtitle_path": "gK6m1CkoqA8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 74, "duration": 8.0, "view_count": 1002798}, {"video_id": "VbQ-gZ876xQ", "question": "The screen shows a somewhat narrow hallway. A person dressed in black is holding a little girl and helping her change shoes. The girl has small braids and is wearing a white dress with pink flowers. In the bottom right corner, there is a split screen where a person wearing a floral shirt is watching. In which scene does this little girl also appear?", "question_wo_referring_query": "The screen shows a somewhat narrow hallway. A person dressed in black is holding a little girl and helping her change shoes. The girl has small braids and is wearing a white dress with pink flowers. In the bottom right corner, there is a split screen where a person wearing a floral shirt is watching. In which scene does this little girl also appear?", "candidates": ["In a white hallway with a black door", "In a black hallway with a white door", "In a yellow hallway with a black door", "In a yellow hallway with a white door", "In a white hallway with a white door"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "VbQ-gZ876xQ_0", "video_path": "VbQ-gZ876xQ.mp4", "subtitle_path": "VbQ-gZ876xQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 42, "duration": 8.0, "view_count": 2524}, {"video_id": "OHucv0YV9c4", "question": "In the video, there is a flower ring woven from vines. A pair of hands is holding the ring. Below it, there are some leaves placed on a wooden board. What transformation occurs to the ring in a scene with scissors?", "question_wo_referring_query": "What kind of transformation occurs?", "candidates": ["The ring is woven into a net.", "The ring is cut off.", "A bunch of white flowers is attached to the ring.", "The vines around the ring are cut off.", "The ring is inserted with leaves."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "OHucv0YV9c4_0", "video_path": "OHucv0YV9c4.mp4", "subtitle_path": "OHucv0YV9c4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 228, "duration": 12.01, "view_count": 370915}, {"video_id": "UYZ1QXQoHdQ", "question": "There is a hand holding a cup in the screen, along with a small white tablet. The background material looks very soft, likely on a bed. At the bottom of the screen, the floral edge of a bedsheet is visible. Which object is not present in the screen?", "question_wo_referring_query": "Which object is not present in the screen?", "candidates": ["Off-white blanket", "Black text", "Gray sleeve", "Coca-Cola", "Coffee"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "UYZ1QXQoHdQ_0", "video_path": "UYZ1QXQoHdQ.mp4", "subtitle_path": "UYZ1QXQoHdQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 24, "duration": 13.0, "view_count": 6504}, {"video_id": "s45PsbHKKHM", "question": "On the screen, there is a short-haired man wearing a checkered shirt speaking. The background behind him is a mustard yellow color, with two black rectangles on either side of him. What object is present on the screen when he mentions 'right I don't have mine anyway there's'?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["COMMENT", "Black short hair", "WEIRD HISTORY", "Golden short hair", "Checkered shirt with red and gray squares"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "s45PsbHKKHM_0", "video_path": "s45PsbHKKHM.mp4", "subtitle_path": "s45PsbHKKHM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 254, "duration": 13.01, "view_count": 44349}, {"video_id": "C2lQ68C5opU", "question": "In the scene, there is a short-haired man wearing a black coat standing in front of a pile of wood. Nearby, there is also a yellow and black machine with the word 'Tigerca' on it. In the distance, there is a forest and the sky. What is the man in the scene doing?", "question_wo_referring_query": "What is the man in the scene doing?", "candidates": ["Fixing his hair", "Looking at the phone in his hand", "Looking into the distance", "Adjusting his clothes", "Gesturing with his hands in front of a mirror"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "C2lQ68C5opU_0", "video_path": "C2lQ68C5opU.mp4", "subtitle_path": "C2lQ68C5opU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1528, "duration": 11.97, "view_count": 68695}, {"video_id": "kD6YFXHK_z8", "question": "In the video, there is a wooden table with a black pot on it, a hand holding a transparent bowl, and a pair of tongs. What happens when these red tongs appear?", "question_wo_referring_query": "What happens?", "candidates": ["The tongs pick up meat from the bowl and place it into the pot.", "The tongs pick up meat from the bowl.", "The stand is placed into the pot.", "The tongs take meat from the pot and put it into the bowl.", "Meat is taken out of the pot."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "kD6YFXHK_z8_0", "video_path": "kD6YFXHK_z8.mp4", "subtitle_path": "kD6YFXHK_z8_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 342, "duration": 9.01, "view_count": 132248}, {"video_id": "UvCUyd_gC6U", "question": "The screen shows a wooden table with some vegetables and fruits stacked at the top right corner. Inside a red square bowl are chicken breast, tomatoes, and vegetables. A pair of hands is mixing the salad with two forks. What happens after the salad is mixed?", "question_wo_referring_query": "What happens next?", "candidates": ["A person appears and starts eating the salad", "A pair of hands holds a wooden bowl filled with the freshly mixed salad", "A hand starts adding yogurt into the bowl", "A hand starts adding olive oil into the bowl", "A hand starts transferring the salad into the wooden bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "UvCUyd_gC6U_0", "video_path": "UvCUyd_gC6U.mp4", "subtitle_path": "UvCUyd_gC6U_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 232, "duration": 9.01, "view_count": 76812}, {"video_id": "Rsbzdlh-7TU", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a scene of a boy in a black and white striped shirt standing in front of a plastic booth is played, then a scene of a small dog jumping in front of a mirror is played, and finally a scene of a man facing away from the mirror, wearing a duckbill cap, and carrying a black backpack is played.", "First, a scene of a man facing away from the mirror, wearing a duckbill cap, and carrying a black backpack is played, then a scene of a small dog jumping in front of a mirror is played, and finally a scene of a boy in a black and white striped shirt standing in front of a plastic booth is played.", "First, a scene of a boy in a black and white striped shirt standing in front of a plastic booth is played, then a scene of a man facing away from the mirror, wearing a duckbill cap, and carrying a black backpack is played, and finally a scene of a small dog jumping in front of a mirror is played.", "First, a scene of a small dog jumping in front of a mirror is played, then a scene of a boy in a black and white striped shirt standing in front of a plastic booth is played, and finally a scene of a man facing away from the mirror, wearing a duckbill cap, and carrying a black backpack is played.", "First, a scene of a small dog jumping in front of a mirror is played, then a scene of a man facing away from the mirror, wearing a duckbill cap, and carrying a black backpack is played, and finally a scene of a boy in a black and white striped shirt standing in front of a plastic booth is played."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "Rsbzdlh-7TU_0", "video_path": "Rsbzdlh-7TU.mp4", "subtitle_path": "Rsbzdlh-7TU_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 276, "duration": 9.01, "view_count": 228811}, {"video_id": "J-mf3tlrFQY", "question": "On a white table in the video, there is a cutting board with a pair of hands chopping a pumpkin. Next to the cutting board, there's a metal tray with a white teapot on it. In which other scene does the pumpkin appear?", "question_wo_referring_query": "In which other scene does the pumpkin appear?", "candidates": ["Inside a white ceramic pot on a white stove", "Inside a yellow ceramic pot on a white stove", "Inside a pink ceramic pot on a white stove", "Inside a gray ceramic pot on a white stove", "Inside a black ceramic pot on a white stove"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "J-mf3tlrFQY_0", "video_path": "J-mf3tlrFQY.mp4", "subtitle_path": "J-mf3tlrFQY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 329, "duration": 11.0, "view_count": 17140}, {"video_id": "4Ue6M0nb6x4", "question": "In the scene, there is a news segment. On the left is a woman with black hair wearing a pink outfit, and on the right is a man wearing glasses and a black suit. When did these two characters and which subtitles appear simultaneously?", "question_wo_referring_query": ", when did these two characters and which subtitles appear simultaneously?", "candidates": ["h", "but", "so", "uh", "looming uh can the house continue with"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "4Ue6M0nb6x4_0", "video_path": "4Ue6M0nb6x4.mp4", "subtitle_path": "4Ue6M0nb6x4_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 138, "duration": 12.0, "view_count": 45183}, {"video_id": "utIGoFHhQ7s", "question": "The scene shows the cabin of a large vehicle with several people inside, including a child in green clothes, a man in checkered clothing, and a woman in red and white striped clothing holding a child. The screen is somewhat blurry. Outside the cabin, there are green trees, a dirt road, and a soldier. What change occurred to the woman holding the child after the phrase 'many Armenians in Nagorno-Karabakh feel' was mentioned?", "question_wo_referring_query": "What change occurred to the woman holding the child?", "candidates": ["Put down the hand supporting the cabin roof", "Prepared to get off the vehicle", "Engaged in conversation with the man next to her", "Put down the child", "Started tying shoelaces"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "utIGoFHhQ7s_0", "video_path": "utIGoFHhQ7s.mp4", "subtitle_path": "utIGoFHhQ7s_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 51, "duration": 11.0, "view_count": 56386}, {"video_id": "IpM8_M6ktsg", "question": "In the video, there is a BBC NP-News-Program where a male presenter is standing in front of a screen talking. The screen is displaying an image, and there are purple light boxes on both sides of the screen. The man is wearing a grey suit and has short black hair. What action does the man take?", "question_wo_referring_query": "What action does the man take?", "candidates": ["The man takes out a pen", "The man points at the screen", "The man spreads the palms of both hands", "The man looks at the screen behind him", "The man spreads the palm of one hand"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "IpM8_M6ktsg_0", "video_path": "IpM8_M6ktsg.mp4", "subtitle_path": "IpM8_M6ktsg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 56, "duration": 14.0, "view_count": 52112}, {"video_id": "hhsGT6PH7Xg", "question": "In the screen, there is a woman wearing white clothes, holding her face with her hands and looking downwards. She has black hair and is in a room with an ancient, simple decor. There is a painting on the white wall behind her. What is the object that does not exist in the screen?", "question_wo_referring_query": "What is the object that does not exist in the screen?", "candidates": ["Earrings", "Yellow painting", "Mineral water bottle", "Cola bottle", "Blue painting"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "hhsGT6PH7Xg_0", "video_path": "hhsGT6PH7Xg.mp4", "subtitle_path": "hhsGT6PH7Xg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 239, "duration": 9.0, "view_count": 149997}, {"video_id": "m2FEtP5J8LM", "question": "The screen shows many children eating inside a pavilion made of wooden beams. On their dining table is a cloth with many floral patterns on it. The table is filled with various kinds of food. Surrounding the pavilion are thick wooden columns. The children's hair is black. What material is the jug holding orange juice on the dining table made of?", "question_wo_referring_query": "What material is the jug holding orange juice on the dining table made of?", "candidates": ["White ceramic jug", "Transparent glass jug", "Transparent crystal jug", "Transparent plastic jug", "Transparent glass jug"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "m2FEtP5J8LM_1", "video_path": "m2FEtP5J8LM.mp4", "subtitle_path": "m2FEtP5J8LM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1049, "duration": 9.0, "view_count": 1397069}, {"video_id": "rT9x9a23Q9k", "question": "A woman is standing in front of a mirror, wearing a black coat and a gray scarf. Her hair is combed back, and she is holding a hanger with a camel-colored cardigan on it. The woman is in a white room. When the term 'literally just going to target um' is mentioned, what type of cardigan is this?", "question_wo_referring_query": "What type of cardigan is this?", "candidates": ["Cotton cardigan", "Knitted cardigan", "Furry cardigan", "Down cardigan", "Wool cardigan"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "rT9x9a23Q9k_0", "video_path": "rT9x9a23Q9k.mp4", "subtitle_path": "rT9x9a23Q9k_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 600, "duration": 12.0, "view_count": 25849}, {"video_id": "HH1atPx40PQ", "question": "In the animation, a group of soldiers is standing in a grassy field, surrounded by green plants and trees. The soldiers have yellow hair and wear purple clothes, while some wear grey clothes. Some are holding long knives while others hold short swords. What does the soldier wearing a helmet and holding a short sword look like?", "question_wo_referring_query": "What does the soldier wearing a helmet and holding a short sword look like?", "candidates": ["A soldier wearing purple clothes holding a long knife", "A soldier wearing grey clothes holding a long knife", "A soldier wearing purple clothes with an olive-green belt", "A soldier wearing grey clothes holding a shield", "A soldier wearing purple clothes with a yellow belt"], "topic_category": "KH-Knowledge-History", "question_category": "E2O", "level": "IntraMoment", "id": "HH1atPx40PQ_0", "video_path": "HH1atPx40PQ.mp4", "subtitle_path": "HH1atPx40PQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 172, "duration": 11.01, "view_count": 3315}, {"video_id": "qGKUKj-ha3Q", "question": "The screen shows a green and pink wall, with two children's amusement facilities on the green turf: a yellow slide and a pink slide. There are three girls on the screen. The girl in red is holding a girl in a green skirt. Next to them is a girl wearing a yellow headscarf. What happens after the girl in red holds the girl in the green skirt?", "question_wo_referring_query": "What happens next?", "candidates": ["A man in gray clothes holds the girl in the green skirt, and the girl in red holds a red balloon.", "A man in gray clothes holds the girl in the green skirt, and the girl in red holds a green balloon.", "A man in gray clothes holds the girl in the green skirt, and the girl in red holds an orange balloon.", "A man in gray clothes holds the girl in the green skirt, and the girl in red holds a white balloon.", "A man in gray clothes holds the girl in the green skirt, and the girl in red holds a pink balloon."], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "qGKUKj-ha3Q_1", "video_path": "qGKUKj-ha3Q.mp4", "subtitle_path": "qGKUKj-ha3Q_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 100, "duration": 12.0, "view_count": 24919}, {"video_id": "@recipesbyanne-7187799426268777734", "question": "On a white marble table, there is a white bowl. A pair of hands is holding the bowl steadily, and inside the bowl, there is a light yellow thick soup sprinkled with green herbs and chili peppers. After the subtitle 'Balls hanging low, wallet pop a bottle off of your chain' appears, what is the first object to appear?", "question_wo_referring_query": "what is the first object to appear?", "candidates": ["pumpkin", "green pepper", "pot lid", "blueberry", "garlic"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "@recipesbyanne-7187799426268777734_0", "video_path": "7187799426268777734.mp4", "subtitle_path": "7187799426268777734_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.2, "view_count": 36075}, {"video_id": "@recipesbyanne-7138820604882357510", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a scene of slicing green vegetables is shown, followed by a scene with a white plate containing a yellow squash being cut with a fork, and finally a scene of an orange being sliced with a knife.", "First, a scene of an orange being sliced with a knife is shown, followed by a scene of slicing green vegetables, and finally a scene with a white plate containing a yellow squash being cut with a fork.", "First, a scene of an orange being sliced with a knife is shown, followed by a scene with a white plate containing a yellow squash being cut with a fork, and finally a scene of slicing green vegetables.", "First, a scene with a white plate containing a yellow squash being cut with a fork is shown, followed by a scene of slicing green vegetables, and finally a scene of an orange being sliced with a knife.", "First, a scene with a white plate containing a yellow squash being cut with a fork is shown, followed by a scene of an orange being sliced with a knife, and finally a scene of slicing green vegetables."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "@recipesbyanne-7138820604882357510_0", "video_path": "7138820604882357510.mp4", "subtitle_path": "7138820604882357510_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.9, "view_count": 5429}, {"video_id": "@recipesbyanne-7282777187286011169", "question": "On the screen, there is a red brick wall. A woman is sitting in front of the brick wall, with a table in front of her. On the table, there is a black pot. The woman has golden hair and is wearing black clothes. She is stirring the food in the pot with a spatula. In which other scene does this pot appear?", "question_wo_referring_query": "In which other scene does this pot appear?", "candidates": ["In a scene with a yellow background where a hand is holding an iron spatula.", "In a scene with a red background where a hand is holding a wooden spatula.", "In a scene with a white background where a hand is holding an iron spatula.", "In a scene with a black background where a hand is holding an iron spatula.", "In a scene with a white background where a hand is holding a wooden spatula."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "@recipesbyanne-7282777187286011169_0", "video_path": "7282777187286011169.mp4", "subtitle_path": "7282777187286011169_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.4, "view_count": 98256}, {"video_id": "@recipesbyanne-7174789827072773382", "question": "The screen shows a white floral patterned marble table with a black rectangular grid on top, on which there are cut white cubes. The surface of the cubes is uneven, resembling powdered sugar. A pair of tongs lifts one of the cubes. At which subtitle did the white cubes and this action appear simultaneously?", "question_wo_referring_query": "At which subtitle did the white cubes and this action appear simultaneously?", "candidates": ["thank you", "why", "Like Jesus said, I'm gonna dance, dance, dance with my", "if\n", "so"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "@recipesbyanne-7174789827072773382_0", "video_path": "7174789827072773382.mp4", "subtitle_path": "7174789827072773382_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.03, "view_count": 18100}, {"video_id": "@healthfood-7036391022099778862", "question": "A woman is standing inside a white building, with a bit of greenery behind her. She has blond hair and is wearing a white floral dress with green patterns. She is holding a phone and taking a selfie. What change occurs while she is having a drink in the kitchen?", "question_wo_referring_query": "What change occurs?", "candidates": ["Changed to a black suspender", "Changed to a jacket", "Changed to black shorts", "Changed to a black backpack", "Changed to a white backpack"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "@healthfood-7036391022099778862_0", "video_path": "7036391022099778862.mp4", "subtitle_path": "7036391022099778862_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.27, "view_count": 63170}, {"video_id": "@healthfood-6950740596235455749", "question": "A lady is sitting in a bedroom, with a white wall and a bed with a white blanket behind her. There is a floor lamp with a white lampshade on the side. She is wearing a white T-shirt and light-colored jeans. What change occurs to the lady after mentioning 'I'm not hatin', I'm just tellin' you'?", "question_wo_referring_query": "What change occurs to the lady?", "candidates": ["The lady closed her eyes", "The lady looked back at the bed behind her", "The lady started making a phone call", "The lady started looking at her phone", "The lady raised one hand"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "@healthfood-6950740596235455749_0", "video_path": "6950740596235455749.mp4", "subtitle_path": "6950740596235455749_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.17, "view_count": 23000}, {"video_id": "@healthfood-7352910847997496619", "question": "In the video, there is a transparent glass filled with chia seeds and strawberry milk, placed on a wooden table. Inside the glass, there is a spoon. From the top of the screen, a part of a hand's finger is visible. What is this hand doing with the spoon?", "question_wo_referring_query": "What is this hand doing with the spoon?", "candidates": ["Stirring the strawberry milk and chia seeds", "Stirring the strawberry milk and chia seeds", "Using the spoon to put chia seeds into the glass", "Using the spoon to add gelatin flakes", "Using the spoon to add sugar"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "@healthfood-7352910847997496619_0", "video_path": "7352910847997496619.mp4", "subtitle_path": "7352910847997496619_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.5, "view_count": 66523}, {"video_id": "@recipesbyanne-7128784895203118342", "question": "A cutting board is placed on a wooden table, with a piece of cake on it. The cake has been drizzled with sauce and some small toppings. A hand is holding a small piece of the cake. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Blue small toppings", "Green small toppings", "Pink small toppings", "Yellow small toppings", "Orange small toppings"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "@recipesbyanne-7128784895203118342_0", "video_path": "7128784895203118342.mp4", "subtitle_path": "7128784895203118342_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.52, "view_count": 3747}, {"video_id": "@recipesbyanne-7119521996882365701", "question": "There is a cake on the screen. The cake has a crust that is already baked to a golden brown, with some chocolate drizzled on top. There is a round hole in the middle of the cake. The cake has been cut. Into what shape has the cake been divided?", "question_wo_referring_query": "Into what shape has the cake been divided?", "candidates": ["Round shape", "Heart shape", "No shape", "Square shape", "Triangle shape"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "@recipesbyanne-7119521996882365701_0", "video_path": "7119521996882365701.mp4", "subtitle_path": "7119521996882365701_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.57, "view_count": 2678}, {"video_id": "@healthfood-7109155206348279083", "question": "In the video, there's a hand holding a cup containing chia seeds and water, with a green straw inserted into it. The cup is in a room with white walls and a wooden floor. When the phrase 'This is love, you're roses.' is mentioned, what is the material of this cup?", "question_wo_referring_query": "What is the material of this cup?", "candidates": ["Crystal cup", "Plastic cup", "Ceramic cup", "Stainless steel cup", "Glass cup"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "@healthfood-7109155206348279083_0", "video_path": "7109155206348279083.mp4", "subtitle_path": "7109155206348279083_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.77, "view_count": 2200000}, {"video_id": "@recipesbyanne-7212649090847345926", "question": "On a white stone countertop, there is a black pot containing some green squash and white kidney beans. What food is being added to the pot?", "question_wo_referring_query": "What food is being added to the pot?", "candidates": ["tomato", "cream", "lettuce", "white kidney beans", "yellow squash"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "@recipesbyanne-7212649090847345926_0", "video_path": "7212649090847345926.mp4", "subtitle_path": "7212649090847345926_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.97, "view_count": 148484}, {"video_id": "@recipesbyanne-7123497934913686789", "question": "On the screen, there is a round plate on a wooden patterned table. On the plate, there is an egg, a bun with vegetable leaves, and a tomato. After the 'bye' subtitle disappears, what happens?", "question_wo_referring_query": "On the screen, there is a round plate on a wooden patterned table. On the plate, there is an egg, a bun with vegetable leaves, and a tomato. After the 'bye' subtitle disappears, what happens?", "candidates": ["The sauce is being poured onto the plate", "Chopsticks continuously pick up the egg", "A hand is holding a bun", "A hand is holding the plate", "A hand is holding an egg"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "@recipesbyanne-7123497934913686789_0", "video_path": "7123497934913686789.mp4", "subtitle_path": "7123497934913686789_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.43, "view_count": 4037}, {"video_id": "@recipesbyanne-7150640587299933445", "question": "The screen shows a black pot containing a yellow mixture of flour and vegetables. A spatula is pressing it down, spreading it evenly in the pot. What happens after the spatula finishes pressing it?", "question_wo_referring_query": "What happens after the spatula finishes pressing it?", "candidates": ["The finished pancake is placed on a plate and topped with honey mustard sauce.", "The finished pancake is placed on a plate and topped with Thousand Island dressing.", "The pancake is slowly soaking in oil.", "The finished pancake is placed on a plate and topped with tomato sauce and salad dressing.", "Vegetables are placed on top of the pancake."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "@recipesbyanne-7150640587299933445_0", "video_path": "7150640587299933445.mp4", "subtitle_path": "7150640587299933445_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.07, "view_count": 28800}, {"video_id": "@healthfood-7263414853128228138", "question": "Which food item appears first in the video?", "question_wo_referring_query": "Which one is it?", "candidates": ["Long slices of pickled cucumber", "Long strips of pickled cucumber", "Cubes of pickled cucumber", "Whole pickled cucumber", "Round slices of pickled cucumber"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "@healthfood-7263414853128228138_0", "video_path": "7263414853128228138.mp4", "subtitle_path": "7263414853128228138_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.1, "view_count": 4536410}, {"video_id": "@jetset_anna-6925802847506369797", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, a scene with two lounge chairs is played, followed by a scene with a wooden staircase leading to the sea, and finally a scene of a white sofa with a backdrop of a turquoise blue sky.", "First, a scene with two lounge chairs is played, followed by a scene of a white sofa with a backdrop of a turquoise blue sky, and finally a scene with a wooden staircase leading to the sea.", "First, a scene of a white sofa with a backdrop of a turquoise blue sky is played, followed by a scene with two lounge chairs, and finally a scene with a wooden staircase leading to the sea.", "First, a scene of a white sofa with a backdrop of a turquoise blue sky is played, followed by a scene with a wooden staircase leading to the sea, and finally a scene with two lounge chairs.", "First, a scene with a wooden staircase leading to the sea is played, followed by a scene of a white sofa with a backdrop of a turquoise blue sky, and finally a scene with two lounge chairs."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "@jetset_anna-6925802847506369797_0", "video_path": "6925802847506369797.mp4", "subtitle_path": "6925802847506369797_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.0, "view_count": 10200}, {"video_id": "@daiki.shino-7119916888062315782", "question": "Two men are walking on the street in the video: the street is paved with cobblestones, the trees lining the street are tall with lush green leaves. One man is dressed in a white shirt and black pants, the other in a white shirt and blue pants. In which subtitle do these two men appear at the same time?", "question_wo_referring_query": "In which subtitle do these two men appear at the same time?", "candidates": ["In my heart.", "In my eyes.", "so", "In my passion.", "much"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "@daiki.shino-7119916888062315782_0", "video_path": "7119916888062315782.mp4", "subtitle_path": "7119916888062315782_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.88, "view_count": 157600}, {"video_id": "@jetset_anna-6944026211962195205", "question": "The screen shows a blue sea with some hills in the distance. The sky is very blue, and there is a white pavilion with a domed roof in front of the camera. The surrounding buildings are all white, and the pavilion's roof is blue. On top of the pavilion stands a white cross. When the camera reveals a corner of a gray curtain, there are red flowers below the curtain. There is a blue door in the bottom left corner. When the architecture, very similar to the pavilion, appears in the shot, what changes occur to the pavilion?", "question_wo_referring_query": "What changes occur to the pavilion?", "candidates": ["The cross on top of the pavilion turns yellow", "The cross on top of the pavilion turns green", "The cross on top of the pavilion turns black", "The pavilion disappears from the screen", "The cross on top of the pavilion turns blue"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "@jetset_anna-6944026211962195205_0", "video_path": "6944026211962195205.mp4", "subtitle_path": "6944026211962195205_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.73, "view_count": 70800}, {"video_id": "@daiki.shino-7129222222815841542", "question": "The person on the screen is a man wearing a white T-shirt and black shorts. He is standing next to a bridge with green trees around. There is a microphone stand in front of him, and he has a guitar on his back. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Playing the guitar and singing", "Singing with the microphone stand", "Dancing with his hands up", "Putting down the guitar", "Interacting with the audience"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "@daiki.shino-7129222222815841542_0", "video_path": "7129222222815841542.mp4", "subtitle_path": "7129222222815841542_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.28, "view_count": 7130}, {"video_id": "@luxtravelbe-7233458302762470683", "question": "The screen shows a stone-stacked pond with some aquatic plants. The pond is surrounded by green plants, with the turquoise sea, blue sky, and white clouds in the distance. What is something that is not present in the scene?", "question_wo_referring_query": "What is something that is not present in the scene?", "candidates": ["Palm tree", "Red plants", "Green lotus leaves", "Purple lotus", "Banana tree"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "@luxtravelbe-7233458302762470683_0", "video_path": "7233458302762470683.mp4", "subtitle_path": "7233458302762470683_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.52, "view_count": 62041}, {"video_id": "@kelseyinlondon-7329947846646435104", "question": "The screen shows a dining hall with uniquely shaped hanging lights. The ceiling is blue, the tables are white with some dining utensils on them. When the line 'If you need me, call me, no matter where you are, no matter how far.' is mentioned, what object is missing from the screen?", "question_wo_referring_query": "What object is missing from the screen?", "candidates": ["Knife and fork", "Yellow hanging light", "Wooden spice jar", "Glass cup", "Air conditioner"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "@kelseyinlondon-7329947846646435104_0", "video_path": "7329947846646435104.mp4", "subtitle_path": "7329947846646435104_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.25, "view_count": 679688}, {"video_id": "@kelseyinlondon-7309532634403605793", "question": "The screen shows a sky illuminated by the evening glow, with some short buildings in the distance, a sea, and a viewing platform with glass railings. The building next to the viewing platform has a red roof. What object is placed next to the viewing platform?", "question_wo_referring_query": "What object is placed next to the viewing platform?", "candidates": ["a pink round table", "a white round table", "a green round table", "a black round table", "an olive round table"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "@kelseyinlondon-7309532634403605793_0", "video_path": "7309532634403605793.mp4", "subtitle_path": "7309532634403605793_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.93, "view_count": 19609}, {"video_id": "@daiki.shino-7238336395050306842", "question": "The screen shows a man lying on a sofa. He has black hair and is wearing a loose plain T-shirt. Behind him is a large glass window with two white curtains. Next to the sofa is a rectangular table. What is the man doing?", "question_wo_referring_query": "What is the man doing?", "candidates": ["Covering himself with a blanket", "Eating something", "Watching a tablet", "Watching TV", "Flipping through a book"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "@daiki.shino-7238336395050306842_0", "video_path": "7238336395050306842.mp4", "subtitle_path": "7238336395050306842_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.4, "view_count": 5569}, {"video_id": "@placesunleashed-7297289751974006022", "question": "What is the first object to appear in the video?", "question_wo_referring_query": "What is the first object to appear in the video?", "candidates": ["A blue chapel", "A white chapel", "A red rock with rose-red flowers growing on it", "A white rock with green plants growing on it", "A mushroom-shaped tree"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "@placesunleashed-7297289751974006022_0", "video_path": "7297289751974006022.mp4", "subtitle_path": "7297289751974006022_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.5, "view_count": 10676}, {"video_id": "@movie.explained6-7253148471778135297", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["A scene of a woman in a black coat sitting at a desk talking, followed by a scene of a man and a woman in a living room, then a scene of a man.", "A scene of a man, followed by a scene of a man and a woman in a living room, then a scene of a woman in a black coat sitting at a desk talking.", "A scene of a man and a woman in a living room, followed by a scene of a man, then a scene of a woman in a black coat sitting at a desk talking.", "A scene of a man and a woman in a living room, followed by a scene of a woman in a black coat sitting at a desk talking, then a scene of a man.", "First, a scene of a woman in a black coat sitting at a desk talking, followed by a scene of a man, then a scene of a man and a woman in a living room."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "@movie.explained6-7253148471778135297_0", "video_path": "7253148471778135297.mp4", "subtitle_path": "7253148471778135297_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 10, "duration": 10.0, "view_count": 10854}, {"video_id": "@movie.explained6-7275900555145317633", "question": "In the scene, there is a man wearing an olive-green coat and headphones. Nearby, there is a building with tiled roof and another man with long hair wearing a shirt. The two are facing each other. In what other scenes does the man in the olive-green coat appear?", "question_wo_referring_query": "In what other scenes does the man in the olive-green coat appear?", "candidates": ["In front of a golden pagoda", "In a white church", "In a busy square", "On a bustling street with blue signposts and trees", "By a small river"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "@movie.explained6-7275900555145317633_0", "video_path": "7275900555145317633.mp4", "subtitle_path": "7275900555145317633_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.01, "view_count": 2576}, {"video_id": "@movie.explained6-7254235354243386626", "question": "In the scene, there is a profile of a woman with black hair. She looks ahead with a very sorrowful expression. The scene takes place at night with dim lighting. In the same scene, when the white text below the woman changes to 'no engineer from the beginning,' what change occurs in the woman?", "question_wo_referring_query": "What change occurs in the woman?", "candidates": ["The woman's head lifts slightly, her eyes filled with tears", "The woman covers her face with her hands", "The woman starts crying heavily", "The woman glares at someone with hatred", "The woman looks at someone with a face full of surprise"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "@movie.explained6-7254235354243386626_0", "video_path": "7254235354243386626.mp4", "subtitle_path": "7254235354243386626_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 3, "duration": 14.02, "view_count": 2874}, {"video_id": "@movie.explained6-7252675724711054593", "question": "The screen shows a side profile of a man with black curly hair, wearing a gray suit and a red tie. What change occurs to this man when the phrase 'And it stops working.' is mentioned?", "question_wo_referring_query": "What change occurs to this man?", "candidates": ["The man raised both his hands", "The man changed to a blue tie", "The man turned to look at someone else", "The man turned his back", "The man revealed his full face"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TAA", "level": "L2-Relation", "id": "@movie.explained6-7252675724711054593_0", "video_path": "7252675724711054593.mp4", "subtitle_path": "7252675724711054593_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 13, "duration": 14.0, "view_count": 561}, {"video_id": "@movie.explained6-7253161825448004865", "question": "A woman is standing in the rain at night; her black hair is already wet, her skin is very fair. What is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["Touching her hair", "Making a phone call", "Turning around", "Picking up her phone", "Raising her hand"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "@movie.explained6-7253161825448004865_0", "video_path": "7253161825448004865.mp4", "subtitle_path": "7253161825448004865_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 48, "duration": 8.0, "view_count": 14600}, {"video_id": "@movie.explained6-7268621303383346433", "question": "In the scene, a tall man in black clothing and a young girl in red clothing are walking down the street. When the term 'Four dollars?' is mentioned, what is absent from the scene?", "question_wo_referring_query": ", what is absent from the scene?", "candidates": ["road sign", "white socks", "black skirt", "car", "olive pants"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "@movie.explained6-7268621303383346433_0", "video_path": "7268621303383346433.mp4", "subtitle_path": "7268621303383346433_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 19, "duration": 9.0, "view_count": 2984}, {"video_id": "@movie.explained6-7269964117132315905", "question": "In the scene, there's a blue sky and a white object with steps on it. A man with black hair, wearing a black coat and white shirt, walks by. What is the shape of the object with the steps?", "question_wo_referring_query": "What is the shape of the object with steps?", "candidates": ["Square", "Rectangle", "Staircase", "Triangle", "Cylinder"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "@movie.explained6-7269964117132315905_0", "video_path": "7269964117132315905.mp4", "subtitle_path": "7269964117132315905_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 31, "duration": 10.0, "view_count": 6335}, {"video_id": "@kerstinong-6970696200672627969", "question": "The woman with brown hair is in a white room next to a white table. She is wearing a black sports bra with her long hair draped over one shoulder. What is this woman doing?", "question_wo_referring_query": "What is this woman doing?", "candidates": ["Talking to someone", "Showing her arm muscles", "Making a phone call", "Smiling", "Showing her leg muscles"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "@kerstinong-6970696200672627969_0", "video_path": "6970696200672627969.mp4", "subtitle_path": "6970696200672627969_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.11, "view_count": 11273}, {"video_id": "@jonijawne-7360667596355669253", "question": "The scene features a person with short hair, wearing sunglasses, a black jacket over a white shirt. He is taking a selfie in front of a mirror. Behind him, there's an old pavilion, a dirt road with some weeds, and some buildings in the distance that blend with the ground color and look old. Sunlight is shining on the ground. What did this person do when he first appeared?", "question_wo_referring_query": "What did he do?", "candidates": ["He starts dancing.", "He stands with his back to the mirror.", "He is slightly turning his face to one side while admiring something.", "He is making a phone call.", "He starts talking with his companion."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "@jonijawne-7360667596355669253_0", "video_path": "7360667596355669253.mp4", "subtitle_path": "7360667596355669253_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.97, "view_count": 548}, {"video_id": "@jess.morg-7233539433092451630", "question": "The screen shows the interior of a car, and the car's rearview mirror has a round ornament hanging from it. Outside the car window, there is a blue sky and bright sunlight, as well as white buildings. When it mentions 'And as my mind begins to spread its wings, there's no stopping curiosity,' what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The ornament lightly sways hanging from the rearview mirror.", "A bird flies past the car window.", "A person's shadow is reflected in the rearview mirror.", "The rearview mirror falls off.", "The ornament is blown by the wind."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "@jess.morg-7233539433092451630_0", "video_path": "7233539433092451630.mp4", "subtitle_path": "7233539433092451630_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.68, "view_count": 2950}, {"video_id": "@jess.morg-7212368372221119787", "question": "The screen shows a descending staircase leading to a pathway surrounded by green plants. Along the pathway, the sea is visible in the distance, with sunlight shimmering on the water. The sea and the sky merge together, with thick clouds above. What happens when the sea water is illuminated by the sunlight?", "question_wo_referring_query": "What happens?", "candidates": ["People are playing volleyball by the sea", "A boat appears at the seaside", "The plants by the sea are gently swayed by the wind", "People are swimming in the sea", "Waves form on the surface of the sea"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "@jess.morg-7212368372221119787_0", "video_path": "7212368372221119787.mp4", "subtitle_path": "7212368372221119787_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.37, "view_count": 774}, {"video_id": "@kerstinong-7253789113843830018", "question": "The video shows a chat log detailing someone's process of purchasing a Mercedes-Benz CLA180. What was the first question this person asked during the buying process?", "question_wo_referring_query": "What was the first question asked?", "candidates": ["How much", "Do you want to buy a sister car?", "How many years left?", "When can I pick up my car?", "Do you want to buy a mother car?"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "@kerstinong-7253789113843830018_0", "video_path": "7253789113843830018.mp4", "subtitle_path": "7253789113843830018_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 9.5, "view_count": 14100}, {"video_id": "@jonijawne-7197561669575331077", "question": "The scene shows a stone countertop with a stainless steel pan on it. Someone is holding a bottle and pouring seasoning into the pan. There are also cold water jars with transparent bodies and white lids around. After the phrase 'I'm so mature, I got me a car' is mentioned, what happens?", "question_wo_referring_query": "What happens after the phrase is mentioned?", "candidates": ["Crushes tofu and puts it into the pan", "Cooks the tofu", "Pours salad dressing into the pan", "Sprinkles seasoning into the stainless steel pan", "Puts tofu in the stainless steel pan"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "@jonijawne-7197561669575331077_0", "video_path": "7197561669575331077.mp4", "subtitle_path": "7197561669575331077_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.07, "view_count": 2106}, {"video_id": "@jess.morg-7313356655184334126", "question": "The scene depicts a place at night adorned with many colorful lights. There is a snowman on the grass, a red airplane, and a lamp resembling a rocking horse. Nearby, there is a white building also decorated with colorful lights, and the green grass beneath it is illuminated as well. When the phrase 'Jack Frost nipping at your nose' is mentioned, what is the first object that appears on the screen?", "question_wo_referring_query": "What is the first object that appears on the screen?", "candidates": ["A trampoline adorned with colorful lights", "A large hammer adorned with colorful lights", "A pirate ship adorned with colorful lights", "A green house adorned with colorful lights", "A tree adorned with colorful lights"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "@jess.morg-7313356655184334126_0", "video_path": "7313356655184334126.mp4", "subtitle_path": "7313356655184334126_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 13.7, "view_count": 3001}, {"video_id": "@jess.morg-7243538934234402090", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a scene with a small bed with a guitar hanging above it is shown, followed by a scene with a small table holding a pink card and a candle, and finally, a scene with a scented candle is shown.", "First, a scene with a scented candle is shown, followed by a scene with a small bed with a guitar hanging above it, and finally, a scene with a small table holding a pink card and a candle is shown.", "First, a scene with a small table holding a pink card and a candle is shown, followed by a scene with a small bed with a guitar hanging above it, and finally, a scene with a scented candle is shown.", "First, a scene with a small bed with a guitar hanging above it is shown, followed by a scene with a scented candle, and finally, a scene with a small table holding a pink card and a candle is shown.", "First, a scene with a scented candle is shown, followed by a scene with a small table holding a pink card and a candle, and finally, a scene with a small bed with a guitar hanging above it is shown."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "@jess.morg-7243538934234402090_0", "video_path": "7243538934234402090.mp4", "subtitle_path": "7243538934234402090_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 11.13, "view_count": 799}, {"video_id": "@jonijawne-7290534252309957893", "question": "The scene shows a wooden table with a few plates and cups on it. The plates contain food, and the cups have red and white wine in them. In which scene has the glass cup appeared before?", "question_wo_referring_query": "In which scene has the glass cup appeared before?", "candidates": ["In a scene with a wooden table and a woman wearing a black dress", "In a scene with a wooden table and a woman wearing a red dress", "In a scene with a wooden table and a woman wearing a pink dress", "In a plaza with a dining area", "In a field where someone is admiring plants"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "@jonijawne-7290534252309957893_0", "video_path": "7290534252309957893.mp4", "subtitle_path": "7290534252309957893_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.33, "view_count": 1368}, {"video_id": "@jess.morg-7198972261762878766", "question": "In the video, there is a wooden floor with a pattern of a white floral design. There is also a paper box on it. What subtitle appeared at the same time as this paper box?", "question_wo_referring_query": ", what subtitle appeared at the same time as this paper box?", "candidates": ["what", "wach", "Thank you", "h", "SO"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "@jess.morg-7198972261762878766_0", "video_path": "7198972261762878766.mp4", "subtitle_path": "7198972261762878766_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 14.7, "view_count": 6347}, {"video_id": "@emsiiees-7206512793443437851", "question": "On the screen is a woman wearing a red headband clip at the back of her head, inside a white room. There's a yellow painting on the wall behind her. She is sitting at a table painting a white ceramic cup. The table holds a pen holder filled with brushes and pens. In the same room, with a plate of dumplings on the table, what change occurred?", "question_wo_referring_query": "What change occurred?", "candidates": ["The woman picked up a cup of coffee", "The woman picked up a cup of tea", "The woman picked up a cup of cola", "The woman picked up a cup of orange juice", "The woman picked up a cup of milk"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SAA", "level": "L2-Relation", "id": "@emsiiees-7206512793443437851_0", "video_path": "7206512793443437851.mp4", "subtitle_path": "7206512793443437851_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 15.63, "view_count": 8119}, {"video_id": "@kerstinong-7223807492675882242", "question": "In the video, there is a hand extended forward, set against the backdrop of a store with black pillars and signs. When the phrase 'Cause all I need is beauty and love me' is mentioned, what change occurs to the hand?", "question_wo_referring_query": "What change occurs to the hand?", "candidates": ["A victory hand gesture", "Holding a handbag", "The hand is holding another hand", "Pressing a button in the car", "The hand is holding a cup of water"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "@kerstinong-7223807492675882242_0", "video_path": "7223807492675882242.mp4", "subtitle_path": "7223807492675882242_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 12.13, "view_count": 31300}, {"video_id": "@kerstinong-6966568746131426561", "question": "The person on the screen is a woman with brown hair, she is in a relatively open area, behind her is a building with a red roof, there are cars parked by the roadside, this woman is wearing a white strapless wedding dress, what is she doing?", "question_wo_referring_query": ", what is she doing?", "candidates": ["Walking with a skirt", "Holding a bouquet", "Spinning around", "Wearing a headband", "Touching her headband"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "@kerstinong-6966568746131426561_0", "video_path": "6966568746131426561.mp4", "subtitle_path": "6966568746131426561_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 10.57, "view_count": 8243}, {"video_id": "@lisolna-7178476407058615558", "question": "The scene is in a small restaurant. There are wooden shelves and paintings on the wall of the restaurant. Bottles of alcohol are placed on the shelves. The tables in the restaurant are also wooden, and the chairs are orange wooden chairs. There is a small lamp on the ceiling panel of the restaurant, and a decorative item is hung with a hemp rope. What is the object that is not present in the scene?", "question_wo_referring_query": "What is the object that is not present in the scene?", "candidates": ["White computer", "Black wine bottle", "Person wearing a red T-shirt", "Person wearing a black jacket", "Pink handbag"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "@lisolna-7178476407058615558_0", "video_path": "7178476407058615558.mp4", "subtitle_path": "7178476407058615558_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 0, "duration": 8.9, "view_count": 13478}, {"video_id": "xERGt6BWkCg", "question": "The scene is a segment from a movie, showing the silhouettes of two men standing next to a white wall with cracks, holding handguns. They are wearing masks and the image is very blurry. When mentioning drug factories in Mexico, where they can easily take down Leo's men and extract information about, what is the object that is not present in the scene?", "question_wo_referring_query": "What is the object that is not present in the scene?", "candidates": ["black mask", "black mask with white pattern", "black clothes", "white column", "black stick"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "xERGt6BWkCg_0", "video_path": "xERGt6BWkCg.mp4", "subtitle_path": "xERGt6BWkCg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 352, "duration": 13.0, "view_count": 61691}, {"video_id": "xQ7J9OM3Kyg", "question": "The scene depicts a dark background, with a group of people seated around a rectangular dining table, which is filled with food. There are two servants on the side pouring drinks and serving dishes. A chandelier hangs from the ceiling. What shape would the chandelier appear to be when viewed directly from the front when it is said that 'is sitting at a banquet table. The men are dressed in expensive suits. The ladies have'?", "question_wo_referring_query": "The scene depicts a dark background, with a group of people seated around a rectangular dining table, which is filled with food. There are two servants on the side pouring drinks and serving dishes. A chandelier hangs from the ceiling. What shape would the chandelier appear to be when viewed directly from the front?", "candidates": ["Oval", "Triangular", "Elliptical", "Round", "Rectangular"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "xQ7J9OM3Kyg_0", "video_path": "xQ7J9OM3Kyg.mp4", "subtitle_path": "xQ7J9OM3Kyg_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 2, "duration": 8.0, "view_count": 198543}, {"video_id": "Lw9sPOoNhmA", "question": "The screen shows a newspaper, which contains some text and two pictures. One of the pictures is of a man and a woman, and the other picture is of a man. Based on the screen, who is the person printed on the left side of the newspaper?", "question_wo_referring_query": "Based on the screen, who is the person printed on the left side of the newspaper?", "candidates": ["man in a shirt", "man with somewhat long hair", "woman in a floral dress", "woman in a purple dress", "woman in a yellow dress"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "Lw9sPOoNhmA_0", "video_path": "Lw9sPOoNhmA.mp4", "subtitle_path": "Lw9sPOoNhmA_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 174, "duration": 10.0, "view_count": 8581}, {"video_id": "92LqaP1PyNQ", "question": "The screen shows a man wearing glasses. He is dressed in a suit, his hair is black, and there is another person in a suit beside him whose head is not visible. Behind him, there are two bouquets of flowers. What did this man do the first time he appeared?", "question_wo_referring_query": "What did this man do the first time he appeared?", "candidates": ["He was smiling and clapping", "He handed a notebook to someone", "He took a sip of water", "He was making a phone call", "He was talking to someone"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "92LqaP1PyNQ_0", "video_path": "92LqaP1PyNQ.mp4", "subtitle_path": "92LqaP1PyNQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 419, "duration": 8.01, "view_count": 839058}, {"video_id": "paYWjH41rKc", "question": "In the scene, a man is standing in a white room. He is wearing a pink shirt and looking ahead with a somewhat worried and guarded expression. What does the man do when 'take a look at Peter who was in complete' is mentioned?", "question_wo_referring_query": "What does the man do in the scene?", "candidates": ["He is holding a baby.", "He is holding a rabbit.", "He is holding a pile of clothes.", "He is holding a puppy.", "He is holding a kitten."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "paYWjH41rKc_0", "video_path": "paYWjH41rKc.mp4", "subtitle_path": "paYWjH41rKc_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 670, "duration": 13.0, "view_count": 13717}, {"video_id": "oAt82-kuVfo", "question": "The scene is of a movie where a black car is speeding, spraying a lot of black debris from under its tires. It is raining heavily. There is a spare tire at the back of the car. In front of the car, there is a door. What happens when the car rushes forward?", "question_wo_referring_query": "What happens?", "candidates": ["Some uniformed people with helmets appear in the rain, surrounded by houses with thatched roofs.", "Some uniformed people with helmets appear in the rain, surrounded by trees.", "Some uniformed people with helmets appear in the rain, surrounded by a large river.", "Some uniformed people with helmets appear in the rain, surrounded by the sea.", "Some uniformed people with helmets appear in the rain, surrounded by mountains."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "oAt82-kuVfo_0", "video_path": "oAt82-kuVfo.mp4", "subtitle_path": "oAt82-kuVfo_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 686, "duration": 9.01, "view_count": 23418}, {"video_id": "Kas_qgodLTs", "question": "Who is the first character to appear on screen?", "question_wo_referring_query": "Who is the first character to appear on screen?", "candidates": ["A young man wearing a white suit", "A young man wearing a white shirt", "A young woman with black hair, black eyebrows, and very fair skin", "A young man wearing a black suit", "A young man wearing a gray suit"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "Kas_qgodLTs_0", "video_path": "Kas_qgodLTs.mp4", "subtitle_path": "Kas_qgodLTs_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 1752, "duration": 11.97, "view_count": 27225}, {"video_id": "XMoIjRggKpM", "question": "In the scene, two men are standing under a tree. The camera angle appears to be looking slightly upwards at them. They are gazing in the direction of the camera and both are dressed in somewhat old-fashioned coats. One man has darker skin, and the other is wearing a hat. When mentioning 'the cemetery and share stories about the', who is the first character that appears?", "question_wo_referring_query": "Who is the first character that appears?", "candidates": ["The man wearing a red and black plaid coat", "The man wearing a gray and black plaid coat", "The man wearing a blue and black plaid coat", "The man wearing a yellow plaid coat", "The man wearing a yellow and black plaid coat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "XMoIjRggKpM_0", "video_path": "XMoIjRggKpM.mp4", "subtitle_path": "XMoIjRggKpM_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 191, "duration": 10.01, "view_count": 127625}, {"video_id": "e0RQyszjIWM", "question": "A transparent bowl is placed on a wooden table. The top of the bowl reads 'Powdered Sugar 1/2 Cup'. Inside the bowl, there are chocolate-colored and white ingredients. A hand is adding ingredients into the bowl. When this bowl and the subtitle [Music] appear at the same time, what changes occur?", "question_wo_referring_query": "What changes occur when this bowl and the subtitle [Music] appear at the same time?", "candidates": ["The text on the bowl's rim changes to 'Honey 1/4 cup'", "The ingredients inside the bowl change to a chocolate-colored paste", "The text on the bowl's rim changes to 'Vanilla 1/2 tsp'", "The text on the bowl's rim changes to 'Salt 1/2 tsp'", "The text on the bowl's rim changes to 'Coconut Oil 1/4 Cup'"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "e0RQyszjIWM_0", "video_path": "e0RQyszjIWM.mp4", "subtitle_path": "e0RQyszjIWM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 244.92, "view_count": 854614}, {"video_id": "e0RQyszjIWM", "question": "There is a piece of wood placed on a black platform. On the wood, there is a golden circular food item. A hand is picking up one of the food items. When this food item appears on the screen along with the subtitle 'do\u3010Music\u3011', what change occurs to it?", "question_wo_referring_query": "What change occurs to it?", "candidates": ["Its surface turns green", "Its surface turns purple", "Its surface turns red", "Its surface turns yellow", "Its surface turns olive"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "e0RQyszjIWM_1", "video_path": "e0RQyszjIWM.mp4", "subtitle_path": "e0RQyszjIWM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 244.92, "view_count": 854614}, {"video_id": "GLoPS7HXVy0", "question": "There is a painting on the wall. In front of the painting, on a wooden table, there is a lamp. Next to the lamp, a man with short white hair, wearing a black suit, is walking towards the podium. What change occurred when the man appeared between the two flags?", "question_wo_referring_query": "What change occurred?", "candidates": ["He touched his forehead with his right hand", "He covered his face with both hands", "He raised his left hand", "He placed both hands on the table in front", "He raised his right hand"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "GLoPS7HXVy0_0", "video_path": "GLoPS7HXVy0.mp4", "subtitle_path": "GLoPS7HXVy0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 218.96, "view_count": 4818}, {"video_id": "GLoPS7HXVy0", "question": "In the room, there is a lime-colored bookshelf filled with books. In front of the bookshelf, there is a flag of Israel, and next to the flag stands a man dressed in a black suit with a red tie. He raises both his hands with palms facing each other. What change occurs when this man appears next to the American flag?", "question_wo_referring_query": "What change occurs?", "candidates": ["He touches his chin with his left hand.", "He raises both his hands above his head.", "He places his hands together.", "He clenches both his fists.", "He lowers his right hand."], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "GLoPS7HXVy0_1", "video_path": "GLoPS7HXVy0.mp4", "subtitle_path": "GLoPS7HXVy0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 218.96, "view_count": 4818}, {"video_id": "X8PqPrv8bSc", "question": "In the city, rows of houses stand tall, with a chimney constantly spewing black smoke next to a house. On the house at the lower left corner of the screen, two British flags are hanging. With which subtitles have these two flags appeared together?", "question_wo_referring_query": "With which subtitles have these two British flags appeared together?", "candidates": ["century that the demand really increased", "the job was in high demand master sweeps", "have the job of climbing up into narrow", "chimney-sweep dangerous jobs in history", "took on apprentices known as climbing"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "X8PqPrv8bSc_0", "video_path": "X8PqPrv8bSc.mp4", "subtitle_path": "X8PqPrv8bSc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 201.54, "view_count": 676544}, {"video_id": "X8PqPrv8bSc", "question": "In front of a wall, there is a man wearing a gray hat, a white shirt, and olive green suspenders. His left hand hangs naturally by his side, and he holds a brush in his right hand. With which subtitles in the video does this brush appear at the same time?", "question_wo_referring_query": "With which subtitles in the video does this brush appear at the same time?", "candidates": ["taken by the master sweep instead they", "were the source of heat for most homes", "sometimes as young as four by the time", "century that the demand really increased", "without this maintenance chimneys could"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "X8PqPrv8bSc_1", "video_path": "X8PqPrv8bSc.mp4", "subtitle_path": "X8PqPrv8bSc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 201.54, "view_count": 676544}, {"video_id": "vkJi_VJ9myA", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a cat peeks out from a cupboard containing bowls, then a woman wearing a green coat stretches her hand to pick a leaf on top of a tree on a hillside, and lastly, a person wearing a white top, wine-red pants, and yellow gloves is kneeling on the ground digging soil.", "First, a person wearing a white top, wine-red pants, and yellow gloves is kneeling on the ground digging soil, then a woman wearing a green coat stretches her hand to pick a leaf on top of a tree on a hillside, and lastly, a cat peeks out from a cupboard containing bowls.", "First, a person wearing a white top, wine-red pants, and yellow gloves is kneeling on the ground digging soil, then a cat peeks out from a cupboard containing bowls, and lastly, a woman wearing a green coat stretches her hand to pick a leaf on top of a tree on a hillside.", "First, a cat peeks out from a cupboard containing bowls, then a person wearing a white top, wine-red pants, and yellow gloves is kneeling on the ground digging soil, and lastly, a woman wearing a green coat stretches her hand to pick a leaf on top of a tree on a hillside.", "First, a woman wearing a green coat stretches her hand to pick a leaf on top of a tree on a hillside, then a person wearing a white top, wine-red pants, and yellow gloves is kneeling on the ground digging soil, and lastly, a cat peeks out from a cupboard containing bowls."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "vkJi_VJ9myA_0", "video_path": "vkJi_VJ9myA.mp4", "subtitle_path": "vkJi_VJ9myA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 528.7, "view_count": 183876}, {"video_id": "vkJi_VJ9myA", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which scene sequence below is correct?", "candidates": ["First, a woman with long golden hair and a scarf is talking between two white tree trunks. Next, a red and a white candle on a table in a dimly lit room are burning. Lastly, a woman in a red outfit is holding a model of a house beside a table.", "First, a woman in a red outfit is holding a model of a house beside a table. Then, a red and a white candle on a table in a dimly lit room are burning. Lastly, a woman with long golden hair and a scarf is talking between two white tree trunks.", "First, a woman with long golden hair and a scarf is talking between two white tree trunks. Then, a woman in a red outfit is holding a model of a house beside a table. Finally, in a dimly lit room, a red and a white candle on a table are burning.", "First, a woman in a red outfit is holding a model of a house beside a table. Then, a woman with long golden hair and a scarf is talking between two white tree trunks. Finally, in a dimly lit room, a red and a white candle on a table are burning.", "First, a red and a white candle on a table in a dimly lit room are burning. Next, a woman with long golden hair and a scarf is talking between two white tree trunks. Finally, a woman in a red outfit is holding a model of a house beside a table."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "vkJi_VJ9myA_1", "video_path": "vkJi_VJ9myA.mp4", "subtitle_path": "vkJi_VJ9myA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 528.7, "view_count": 183876}, {"video_id": "dGmWIcFylAg", "question": "At the bottom of the screen, there is a pitch-black field, and the distant mountain peaks still have melting snow. Farther beyond the mountains, there is a sun. After the subtitle mentions 'punishment of humanity punishing the,' what sculpture appears in the video?", "question_wo_referring_query": ", what sculpture appears in the video?", "candidates": ["A person with a bare upper body, looking up to the sky with hands behind their back", "A person sitting with a hat and holding a scepter", "A person kneeling on one knee and raising a torch", "A person with long hair, wearing a cloak, sitting on a table", "An elephant with its trunk raised"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "dGmWIcFylAg_0", "video_path": "dGmWIcFylAg.mp4", "subtitle_path": "dGmWIcFylAg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 474.68, "view_count": 17185}, {"video_id": "dGmWIcFylAg", "question": "In the night sky, the stars twinkle with light. The mountain peaks and trees appear only as black silhouettes in the evening. On the right side of the screen, there is a sculpture of a man with long beard and closed eyes. After the subtitle mentions 'petulant and childish over being fooled', what does the sculpture in the video depict?", "question_wo_referring_query": ", what does the sculpture in the video depict?", "candidates": ["A man with a shining upper body, holding a torch in his right hand, standing", "A man with his right hand raised holding a hammer and his left hand holding an object, placed on a stone platform", "A woman with tied hair, covering her chest with her hands", "A man with a shining body, his right hand placed in a jar", "A man with a shining body, his limbs locked in chains, imprisoned on a stone slab"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "dGmWIcFylAg_1", "video_path": "dGmWIcFylAg.mp4", "subtitle_path": "dGmWIcFylAg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 474.68, "view_count": 17185}, {"video_id": "MxBfxLrcUtQ", "question": "In a brightly lit hall, a man wearing a white short-sleeved shirt is standing, and behind him is a man with a beard wearing a black coat and carrying a backpack. After the subtitle mentions 'that Egypt is are picking us up they,' what does the man with the backpack do in front of a blue sign with a car diagram?", "question_wo_referring_query": "What does the man with the backpack do?", "candidates": ["He hugs a bald man wearing white clothes.", "He waves a golf club.", "He lies on the ground.", "He takes out his ID.", "He puts down his backpack."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "MxBfxLrcUtQ_0", "video_path": "MxBfxLrcUtQ.mp4", "subtitle_path": "MxBfxLrcUtQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 573.2, "view_count": 1661189}, {"video_id": "MxBfxLrcUtQ", "question": "Outside, a man wearing a white short-sleeve shirt is putting food into his mouth. Next to him is a dazed-looking man wearing green clothing. After the subtitle mentions 'what I\u2019m Harvard oh my god there\u2019s most', what does the man in green clothing do in front of a screen?", "question_wo_referring_query": "What does the man in green clothing do in front of a screen?", "candidates": ["He crouches on the ground", "He is eating food", "It hugs the bald man wearing white clothes", "He swings a golf club", "He puts on earphones"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "MxBfxLrcUtQ_1", "video_path": "MxBfxLrcUtQ.mp4", "subtitle_path": "MxBfxLrcUtQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 573.2, "view_count": 1661189}, {"video_id": "T_o4bJt0dXw", "question": "Who is the last person to appear in the video?", "question_wo_referring_query": "Who is the last person to appear in the video?", "candidates": ["A woman with blonde hair wearing black clothes", "A man with brown hair wearing green clothes", "A woman with brown hair wearing black clothes", "A woman with brown hair wearing grey clothes", "A woman with brown hair wearing green clothes"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "T_o4bJt0dXw_0", "video_path": "T_o4bJt0dXw.mp4", "subtitle_path": "T_o4bJt0dXw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 286.15, "view_count": 6588}, {"video_id": "T_o4bJt0dXw", "question": "In a brightly lit room, there are many glowing displays, and there is a woman with olive-colored hair wearing a black coat and gray interior clothing. She is speaking with her eyes closed. After this scene, who is the first character to appear?", "question_wo_referring_query": "Who is the first character to appear?", "candidates": ["A woman wearing a black coat", "A bald man", "A man wearing a suit", "A woman wearing gray interior clothing", "A woman wearing green clothing"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "T_o4bJt0dXw_1", "video_path": "T_o4bJt0dXw.mp4", "subtitle_path": "T_o4bJt0dXw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 286.15, "view_count": 6588}, {"video_id": "Ul6Kgqx-qoo", "question": "A piece of butter is placed in an empty, black iron pot, and a spatula is stirring inside the pot. The screen also shows the words '20g butter'. What is the next step?", "question_wo_referring_query": "What is the next step?", "candidates": ["The next step is to add the banana to the pot", "The next step is to pour the flour into a transparent bowl", "The next step is to add cocoa powder to the pot", "The next step is to cut the banana", "The next step is to pour milk into the measuring cup"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "Ul6Kgqx-qoo_0", "video_path": "Ul6Kgqx-qoo.mp4", "subtitle_path": "Ul6Kgqx-qoo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 221.0, "view_count": 1620}, {"video_id": "Ul6Kgqx-qoo", "question": "A person wearing black gloves placed a cake onto a green table, using a white plate. What did this person do next?", "question_wo_referring_query": ", what did this person do next?", "candidates": ["He cut the cake with a silver knife.", "He inserted a toothpick into the cake.", "He added a banana to the pot.", "He lifted the pot lid.", "He covered the pot with a lid."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "Ul6Kgqx-qoo_1", "video_path": "Ul6Kgqx-qoo.mp4", "subtitle_path": "Ul6Kgqx-qoo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 221.0, "view_count": 1620}, {"video_id": "pi-4mBmahts", "question": "On the wall, there is a yellow part with the letters 'BB52' written on it. In front of the wall, there is a woman with golden long hair wearing gray clothes sitting. What did this woman do when she first appeared?", "question_wo_referring_query": "What did she do?", "candidates": ["She was eating food", "She was drinking water", "She was exercising", "She was walking the runway", "She was doing a handstand"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "pi-4mBmahts_0", "video_path": "pi-4mBmahts.mp4", "subtitle_path": "pi-4mBmahts_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 582.57, "view_count": 16313}, {"video_id": "pi-4mBmahts", "question": "In a room equipped with transparent glass, with green plants planted outside the room, and a white pillar inside the room, a woman with long blonde hair in a bikini has one hand resting on the pillar. When this woman appears for the first time, what does she do?", "question_wo_referring_query": "What does she do?", "candidates": ["She walks towards the camera.", "She is dancing.", "She picks up a tambourine.", "She climbs onto the pillar.", "She puts on a coat."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "pi-4mBmahts_1", "video_path": "pi-4mBmahts.mp4", "subtitle_path": "pi-4mBmahts_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 582.57, "view_count": 16313}, {"video_id": "ti5dkG4R8lw", "question": "Against a white background with a black (4,4) font style and a red dot, in the lower right corner of the screen, there is a man with short hair wearing a black jacket. His left hand is holding an object, and his right hand is clawed with the palm facing up. Which character is being illuminated by the red dot?", "question_wo_referring_query": "Against a white background with a black (4,4) font style and a red dot, in the lower right corner of the screen, there is a man with short hair wearing a black jacket. His left hand is holding an object, and his right hand is clawed with the palm facing up. Which character is being illuminated by the red dot?", "candidates": ["1", "y", "p", "x", "4"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "ti5dkG4R8lw_0", "video_path": "ti5dkG4R8lw.mp4", "subtitle_path": "ti5dkG4R8lw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 239.24, "view_count": 3}, {"video_id": "ti5dkG4R8lw", "question": "In the bottom right corner of the screen, there is a man with short black hair wearing a long-sleeved black jacket. At this moment, his eyes are closed. The background is white, and outside of the character, there is black text 'x.view\n(-1,8)'. A red light is shining on one of the characters. Which character is being illuminated?", "question_wo_referring_query": "Which character is being illuminated?", "candidates": ["x", "w", "i", "v", "8"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "ti5dkG4R8lw_1", "video_path": "ti5dkG4R8lw.mp4", "subtitle_path": "ti5dkG4R8lw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 239.24, "view_count": 3}, {"video_id": "vXmjhZcTt_I", "question": "In the auditorium, different images are being projected on the screen at the back. At the front, there is a man with short hair wearing a black tie. At this moment, he just closed his eyes. When the subtitle mentions 'editor of NASA', what are the colors of the man's suit and shirt?", "question_wo_referring_query": "What are the colors of the man's suit and shirt?", "candidates": ["Black suit and red shirt", "Olive suit and pink shirt", "Black suit and white shirt", "Black suit and pink shirt", "Blue suit and pink shirt"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "vXmjhZcTt_I_0", "video_path": "vXmjhZcTt_I.mp4", "subtitle_path": "vXmjhZcTt_I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 191.64, "view_count": 71880}, {"video_id": "vXmjhZcTt_I", "question": "The bookshelf in the room is full of books, and there is a globe model on the cabinet in front of the bookshelf. In front of the model, there is a man wearing glasses. He is pointing with a finger. When the subtitle mentions 'this the spacecraft that just was just', what kind of outerwear is he wearing?", "question_wo_referring_query": "What kind of outerwear is he wearing?", "candidates": ["Blue long-sleeved outerwear", "Nude-colored long-sleeved outerwear", "Blue long-sleeved knit shirt", "Blue short-sleeved leather jacket", "Olive long-sleeved outerwear"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "vXmjhZcTt_I_1", "video_path": "vXmjhZcTt_I.mp4", "subtitle_path": "vXmjhZcTt_I_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 191.64, "view_count": 71880}, {"video_id": "C0MrVovn6s4", "question": "On the left side of the white background, there are black characters, and on the right side, there is a red dot. Below the red dot, there is a man wearing a blue jacket, with his hands clasped together. What is the hairstyle of this man?", "question_wo_referring_query": "What is the hairstyle of this man?", "candidates": ["Gray short hair", "Black short hair", "Black mushroom cut", "Black long hair", "Black mohawk"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2A", "level": "IntraMoment", "id": "C0MrVovn6s4_0", "video_path": "C0MrVovn6s4.mp4", "subtitle_path": "C0MrVovn6s4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 519.24, "view_count": 14}, {"video_id": "C0MrVovn6s4", "question": "In a white background with black text, a red dot is positioned between '\u5355\u8bcde' and '\u5355\u8bcdr'. In the bottom right corner of the screen, there is a short-haired man wearing glasses. He is holding something in his left hand and making a gesture with his right hand. What kind of outerwear is this man wearing?", "question_wo_referring_query": "What kind of outerwear is this man wearing?", "candidates": ["blue knitted sweater", "blue suit", "blue wool sweater", "gray suit", "gray knitted sweater"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2A", "level": "IntraMoment", "id": "C0MrVovn6s4_1", "video_path": "C0MrVovn6s4.mp4", "subtitle_path": "C0MrVovn6s4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 519.24, "view_count": 14}, {"video_id": "s9Jam74Mxoo", "question": "On the grass, there is a white car, a red basket, a silver table, and a man with tattooed arms wearing a black short-sleeve shirt sitting beside. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["He stands up", "He is cutting the item in his hand with a knife", "He is tapping on the silver table", "He is drinking alcohol", "He throws the item in his hand onto the ground"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "s9Jam74Mxoo_0", "video_path": "s9Jam74Mxoo.mp4", "subtitle_path": "s9Jam74Mxoo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 425.47, "view_count": 161245}, {"video_id": "s9Jam74Mxoo", "question": "In an uneven grassy area, green and yellow grasses are mixed and covering the whole ground. On the grassland, there is a section made up of piled stones, with black ash inside it, indicating a previous fire. Around this area, there stands a tripod. A man dressed in black is squatting nearby. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["He is taking wood out of the ash pile.", "He is gathering the tripod.", "He stands up.", "He is sitting on the ground.", "He is adding wood to the ash pile."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "s9Jam74Mxoo_1", "video_path": "s9Jam74Mxoo.mp4", "subtitle_path": "s9Jam74Mxoo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 425.47, "view_count": 161245}, {"video_id": "J54lSwAmxzQ", "question": "In a spacious room filled with many products, a person dressed in purple stands in the aisle. A woman with long hair, wearing a black hat, stands in front of a rack with a small bag hanging on it. She is holding something and looking at a mirror. When this woman and the subtitle 'then she came along and she found it' appear together, what change occurs to her?", "question_wo_referring_query": "What change occurs to her?", "candidates": ["She takes off her hat", "She is holding a camera", "She takes a bag", "She changes to a white coat", "She changes to a black coat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "J54lSwAmxzQ_0", "video_path": "J54lSwAmxzQ.mp4", "subtitle_path": "J54lSwAmxzQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 567.03, "view_count": 5217}, {"video_id": "J54lSwAmxzQ", "question": "In a room, there are two people sitting on a chair. The person on the left is only partially visible, while the man on the right is wearing a black coat and a blue shirt. When this man appears together with the subtitle 'watch some oversized dudes hit each,' what change happens to him?", "question_wo_referring_query": "what change happens to him?", "candidates": ["He wears a blue sweater", "He takes off the black coat", "He puts on a watch", "He puts on glasses", "He puts on a hat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "J54lSwAmxzQ_1", "video_path": "J54lSwAmxzQ.mp4", "subtitle_path": "J54lSwAmxzQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 567.03, "view_count": 5217}, {"video_id": "jC6XV6Ap1Pc", "question": "On a brown wood table, there is a black round plate which is empty in the middle. What happens when a hand wearing a black glove appears next to the round plate?", "question_wo_referring_query": "What changes occur when a hand wearing a black glove appears?", "candidates": ["White rice appears in the round plate.", "An egg appears in the round plate.", "Oil appears in the round plate.", "Carrots appear in the round plate.", "Flour appears in the round plate."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "jC6XV6Ap1Pc_0", "video_path": "jC6XV6Ap1Pc.mp4", "subtitle_path": "jC6XV6Ap1Pc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 262.4, "view_count": 21104}, {"video_id": "jC6XV6Ap1Pc", "question": "A black tray is placed on a wooden surface, with a white object inside the tray. Someone removes a thin film covering the object. When this object appears next to a knife, what change does it undergo?", "question_wo_referring_query": "What change does it undergo?", "candidates": ["A needle is inserted into the object", "It turns yellowish brown", "It turns green", "It is missing half of it", "It turns black"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "jC6XV6Ap1Pc_1", "video_path": "jC6XV6Ap1Pc.mp4", "subtitle_path": "jC6XV6Ap1Pc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 262.4, "view_count": 21104}, {"video_id": "Rukse1qJ_gU", "question": "A white background is divided into two areas by a gray line. On the left side of the gray line, there are a few lines of text. The topmost line reads 'Attack Effectiveness.' On the right side of the gray line, there is an icon of a mouse cursor. Which subtitles appear together with this mouse cursor icon?", "question_wo_referring_query": ", which subtitles appear together with this mouse cursor icon?", "candidates": ["basically the attack success rate which my ", "images by 99.53%", "means i ", "arxs badet Blended and w", "non-visible"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TOS", "level": "L2-Relation", "id": "Rukse1qJ_gU_0", "video_path": "Rukse1qJ_gU.mp4", "subtitle_path": "Rukse1qJ_gU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 865, "duration": 452.0, "view_count": 74}, {"video_id": "Rukse1qJ_gU", "question": "In a white background there is a gray arc line, to the right of the arc line is a clean white area, to the left of the arc line is a line of text that says 'Resistance to Fine-Pruning', and an image of a mouse icon is below all the text. With which subtitles did the text 'Resistance to Fine-Pruning' appear together?", "question_wo_referring_query": "In a white background there is a gray arc line, to the right of the arc line is a clean white area, to the left of the arc line is a line of text that says 'Resistance to Fine-Pruning', and an image of a mouse icon is below all the text. With which subtitles did the text 'Resistance to Fine-Pruning' appear together?", "candidates": ["interested in that uh they segment that", "been listed", "two evaluation metrics which is ba ba", "benign data set that during training we", "strip so it uh in fine ping what happen"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "TOS", "level": "L2-Relation", "id": "Rukse1qJ_gU_1", "video_path": "Rukse1qJ_gU.mp4", "subtitle_path": "Rukse1qJ_gU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 865, "duration": 452.0, "view_count": 74}, {"video_id": "AVIYgEzP-80", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there is a wooden table with a book showing only a corner of its flower-patterned cover. Then, an open book with 'Yu Zhen Ru Dao' and a red seal on the left page, and a drawing of two women on the right page. Finally, a painting of a woman with her hair up, wearing a wide robe, reading a book.", "First, an open book with 'Yu Zhen Ru Dao' and a red seal on the left page, and a drawing of two women on the right page. Then, a painting of a woman with her hair up, wearing a wide robe, reading a book. Finally, a wooden table with a book showing only a corner of its flower-patterned cover.", "First, a painting of a woman with her hair up, wearing a wide robe, reading a book. Then, a wooden table with a book showing only a corner of its flower-patterned cover. Finally, an open book with 'Yu Zhen Ru Dao' and a red seal on the left page, and a drawing of two women on the right page.", "First, there is a wooden table with a book showing only a corner of its flower-patterned cover. Then, a painting of a woman with her hair up, wearing a wide robe, reading a book. Finally, an open book with 'Yu Zhen Ru Dao' and a red seal on the left page, and a drawing of two women on the right page.", "First, an open book with 'Yu Zhen Ru Dao' and a red seal on the left page, and a drawing of two women on the right page. Then, a wooden table with a book showing only a corner of its flower-patterned cover. Finally, a painting of a woman with her hair up, wearing a wide robe, reading a book."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "AVIYgEzP-80_0", "video_path": "AVIYgEzP-80.mp4", "subtitle_path": "AVIYgEzP-80_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 228.56, "view_count": 3665}, {"video_id": "AVIYgEzP-80", "question": "Which of the following scenarios is in the correct sequence?", "question_wo_referring_query": "Which of the following scenarios is in the correct sequence?", "candidates": ["First, there is a piece of calligraphy with the words 'Twelve banquets, playing the flute and yellow bamboo'; then there is a painting with a woman with tied-up hair, wearing a wide robe, holding a brush; finally, there is another painting with a horse and a man wearing a headscarf standing next to it.", "First, there is a painting with a woman with tied-up hair, wearing a wide robe, holding a brush; then there is a piece of calligraphy with the words 'Twelve banquets, playing the flute and yellow bamboo'; finally, there is another painting with a horse and a man wearing a headscarf standing next to it.", "First, there is a painting with a horse and a man wearing a headscarf standing next to it; then there is a piece of calligraphy with the words 'Twelve banquets, playing the flute and yellow bamboo'; finally, there is another painting, with a woman with tied-up hair, wearing a wide robe, and holding a brush.", "First, there is a piece of calligraphy with the words 'Twelve banquets, playing the flute and yellow bamboo'; then there is a painting with a horse and a man wearing a headscarf standing next to it; finally, there is another painting, with a woman with tied-up hair, wearing a wide robe, and holding a brush.", "First, there is a painting with a horse and a man wearing a headscarf standing next to it; then there is a painting with a woman with tied-up hair, wearing a wide robe, holding a brush; finally, there is a piece of calligraphy with the words 'Twelve banquets, playing the flute and yellow bamboo'."], "topic_category": "KA-Knowledge-Art", "question_category": "SSS", "level": "L2-Relation", "id": "AVIYgEzP-80_1", "video_path": "AVIYgEzP-80.mp4", "subtitle_path": "AVIYgEzP-80_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 228.56, "view_count": 3665}, {"video_id": "e-_92MYcANk", "question": "By a thriving grassland stands a forest. A man with a bun is standing in front of the tree wearing a white long-sleeve shirt. Before the subtitle mentions (rustling and soft scraping), what insect appears in the video?", "question_wo_referring_query": "What insect appears in the video?", "candidates": ["ant", "spider", "moth", "beetle", "bee"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "e-_92MYcANk_0", "video_path": "e-_92MYcANk.mp4", "subtitle_path": "e-_92MYcANk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 526.07, "view_count": 97760}, {"video_id": "e-_92MYcANk", "question": "In a room with a window and white walls, there is a white cloth placed. Seated on the cloth are two people: on the left is a bald man wearing a white shirt, and on the right is a woman wearing a green outfit and glasses. After the subtitles mention (traffic, honking, and other city noises), what object appears in the video?", "question_wo_referring_query": "What object appears in the video?", "candidates": ["Yellow tower crane", "White bowl", "American flag", "White pillow", "Silver fork"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "e-_92MYcANk_1", "video_path": "e-_92MYcANk.mp4", "subtitle_path": "e-_92MYcANk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 526.07, "view_count": 97760}, {"video_id": "VRwWJyy24So", "question": "There are many tools and a shiny metal plate on a workbench in a room. The metal plate reflects the checkered shirt. The owner of the checkered shirt is sitting at the workbench. He has short hair and is wearing glasses, holding a tool in his hand. After the subtitle mentions 'that I\u2019m working with doesn\u2019t feel flat,' what does the man do with the hose?", "question_wo_referring_query": "What does the man do with the hose?", "candidates": ["He washes the floor with the hose", "He washes his hair with the hose", "He washes the metal plate with the hose", "He waters the plants with the hose", "He cuts the hose"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "VRwWJyy24So_0", "video_path": "VRwWJyy24So.mp4", "subtitle_path": "VRwWJyy24So_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 304.41, "view_count": 141337}, {"video_id": "VRwWJyy24So", "question": "A person wearing black gloves and not showing their face uses a white paper to remove excess ink from the surface of a black plate. After the subtitle mentions 'to remove excess ink from the raised surface', what does the man do next?", "question_wo_referring_query": "What does the man do next?", "candidates": ["He rinses the plate with water", "He wipes the edges of the plate with white paper", "He applies black paint on the plate", "He carves on the plate with a tool", "He bakes the plate with fire"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "VRwWJyy24So_1", "video_path": "VRwWJyy24So.mp4", "subtitle_path": "VRwWJyy24So_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 304.41, "view_count": 141337}, {"video_id": "-ToD8fKbhiw", "question": "In a room with a white door, there are some posters and photos on the wall. A man with short hair, wearing green clothes, is sitting in the room, holding a photo and showing it. On the right side of the screen, the word 'Kosovo' appears. After this scene, which country's flag appears first in the video?", "question_wo_referring_query": "Which country's flag appears first in the video?", "candidates": ["Palestine", "Greece", "Canadian flag", "Japan", "United States flag"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "-ToD8fKbhiw_0", "video_path": "-ToD8fKbhiw.mp4", "subtitle_path": "-ToD8fKbhiw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 357.89, "view_count": 127811}, {"video_id": "-ToD8fKbhiw", "question": "In a room, there is a bed covered with an orange blanket. Seated on the side of the bed is a man with short hair wearing a green short-sleeved shirt. He raises his drawing and smiles happily. After this scene, what is the final image that appears in the video?", "question_wo_referring_query": "What is the final image that appears in the video?", "candidates": ["A movie poster with many people on it", "Three people dancing", "A person wearing a suit looking at some materials", "A castle", "A white building"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "-ToD8fKbhiw_1", "video_path": "-ToD8fKbhiw.mp4", "subtitle_path": "-ToD8fKbhiw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 357.89, "view_count": 127811}, {"video_id": "7axu8JonrD0", "question": "In a room, there is a world map hanging on the wall and a model of a tiger. A man with short hair, wearing a green shirt, is speaking. At this moment, he closes his eyes. What happened earlier in the video?", "question_wo_referring_query": "What happened earlier in the video?", "candidates": ["A bird was trapped in a birdcage", "A firework exploded in the air", "A bird landed on an exposed rock surface", "A man shot a pheasant", "Two cars collided"], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "7axu8JonrD0_0", "video_path": "7axu8JonrD0.mp4", "subtitle_path": "7axu8JonrD0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 503.7, "view_count": 38189}, {"video_id": "7axu8JonrD0", "question": "Under the green tree, there is a patch of withered yellow grass. A man wearing a hat and a blue jacket is riding a horse on the grass. What happens next?", "question_wo_referring_query": ", what happens next?", "candidates": ["A cow lying on the grass.", "A rocket launching into the sky.", "A man wearing a checkered shirt lassoing a calf while riding on horseback.", "A horse eating grass.", "Mickey Mouse riding a horse and galloping wildly."], "topic_category": "KG-Knowledge-Geography", "question_category": "E3E", "level": "L2-Relation", "id": "7axu8JonrD0_1", "video_path": "7axu8JonrD0.mp4", "subtitle_path": "7axu8JonrD0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 503.7, "view_count": 38189}, {"video_id": "-cFVXLa2SWw", "question": "Two men are standing in front of a white column. The man on the right has short hair and is wearing a red short-sleeved shirt. The man on the left is wearing a hat and sunglasses, and a white short-sleeved shirt. When the subtitle mentions 'kidding okay now you go away it\u2019s way', what is the man in the white shirt doing?", "question_wo_referring_query": "What is the man in the white shirt doing?", "candidates": ["He is holding his head and crying", "He is wiping his glasses", "He is waving his arms", "He slapped his own ear", "He is taking off his glasses"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "-cFVXLa2SWw_0", "video_path": "-cFVXLa2SWw.mp4", "subtitle_path": "-cFVXLa2SWw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 526.43, "view_count": 12864}, {"video_id": "-cFVXLa2SWw", "question": "In the distance, there is a building with green glass. Nearby, a man wearing a hat and a white shirt is talking with a man wearing sunglasses and a black shirt. Behind them, there is a woman in a purple short-sleeve shirt. When the subtitle mentions 'largest city on the Canadian border,' what is this woman doing?", "question_wo_referring_query": "What is this woman doing?", "candidates": ["She was dancing", "She quickly walked out of the frame with her left hand near her mouth", "She squatted on the ground", "She nudged the man in the white shirt", "She flipped over a railing"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "-cFVXLa2SWw_1", "video_path": "-cFVXLa2SWw.mp4", "subtitle_path": "-cFVXLa2SWw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 526.43, "view_count": 12864}, {"video_id": "zPhZGkG8AkU", "question": "There is a row of people standing on the grass. The three people on the left are wearing hats and holding rifles and shields. The four people on the right are wearing gloves. The person in the middle is holding a long sword and a shield. What did the man holding the long sword do when he appeared for the first time?", "question_wo_referring_query": "What did he do?", "candidates": ["He was performing a sword dance", "He was slashing the people wearing gloves", "He put on a hat", "He threw away the shield", "He threw away the long sword"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "zPhZGkG8AkU_0", "video_path": "zPhZGkG8AkU.mp4", "subtitle_path": "zPhZGkG8AkU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 407.53, "view_count": 4402}, {"video_id": "zPhZGkG8AkU", "question": "A painting is hanging on the wall of the room. Next to the painting are raised curtains and windows. There is a computer on the desk in front of the window, and there are also books on the bookshelf beside it. In the room, there is also a person wearing a hat and a blue outfit. What did the person wearing the hat do when they appeared for the first time?", "question_wo_referring_query": "What did he do?", "candidates": ["He raised his hand and cheered", "He took a book from the bookshelf", "He turned off the computer", "He raised the curtain", "He took off his hat"], "topic_category": "KH-Knowledge-History", "question_category": "O2E", "level": "IntraMoment", "id": "zPhZGkG8AkU_1", "video_path": "zPhZGkG8AkU.mp4", "subtitle_path": "zPhZGkG8AkU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 407.53, "view_count": 4402}, {"video_id": "8O3CY-LxxmY", "question": "In a room, there is a man sitting wearing a multicolored fur coat and red shorts. His right hand is placed on the armrest next to him. When the subtitle mentions 'are you down to', what is the man's hairstyle?", "question_wo_referring_query": "What is the man's hairstyle?", "candidates": ["Mullet", "Mohawk", "Buzz cut", "Liu Hai hairstyle", "Afro"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "8O3CY-LxxmY_0", "video_path": "8O3CY-LxxmY.mp4", "subtitle_path": "8O3CY-LxxmY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 190.02, "view_count": 4376}, {"video_id": "8O3CY-LxxmY", "question": "In the laundry room, a washing machine in the background is spinning and washing clothes. A man with black hair is talking to a person holding clothes whose face is not shown. When the subtitle mentions 'the back I\u2019ll take itow leave you with,' what kind of outerwear is the man wearing?", "question_wo_referring_query": "What kind of outerwear is the man wearing?", "candidates": ["A blue windbreaker", "A denim jacket without a hood", "A gray suit", "A gray wool sweater", "A gray hoodie"], "topic_category": "NP-News-Programs", "question_category": "T2A", "level": "IntraMoment", "id": "8O3CY-LxxmY_1", "video_path": "8O3CY-LxxmY.mp4", "subtitle_path": "8O3CY-LxxmY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 190.02, "view_count": 4376}, {"video_id": "ItyYS0tbs6g", "question": "Dense stones are laid on the grass, with a patch of grass similar to a heart shape exposed at the center. Sunlight shines on the uneven stones, casting shadows on the grass. What shape is the stone at the center?", "question_wo_referring_query": ", what shape is the stone at the center?", "candidates": ["Hexagon-like shape", "Triangle", "Circle", "Rectangle-like shape", "Square-like shape"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "ItyYS0tbs6g_0", "video_path": "ItyYS0tbs6g.mp4", "subtitle_path": "ItyYS0tbs6g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 547.09, "view_count": 210905}, {"video_id": "ItyYS0tbs6g", "question": "A man appears in front of a green background, wearing glasses and a checkered shirt. The shirt is fully buttoned, and the man's hands are positioned below his chest, out of the frame. What style are the man's glasses?", "question_wo_referring_query": "What style are the man's glasses in the checkered shirt?", "candidates": ["Black large round frame", "Black rectangular frame", "Gold rectangular frame", "Black small round frame", "Gold small round frame"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "ItyYS0tbs6g_1", "video_path": "ItyYS0tbs6g.mp4", "subtitle_path": "ItyYS0tbs6g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 547.09, "view_count": 210905}, {"video_id": "igyu1fxfSqQ", "question": "A man wearing a shirt is sitting on a black chair. The man has headphones on his head. There is a bookshelf behind the man, on which books and paper boxes are placed. When the subtitles show \u201cAnd so if you have, we're trying to measure changes in the\u201d, what object is present in the room with the bookshelf?", "question_wo_referring_query": "What object is present in the room with the bookshelf?", "candidates": ["pot plant", "plush toy", "globe", "transparent jar", "watch"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "igyu1fxfSqQ_0", "video_path": "igyu1fxfSqQ.mp4", "subtitle_path": "igyu1fxfSqQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 564.9, "view_count": 118533}, {"video_id": "igyu1fxfSqQ", "question": "A man appears in front of a blue background. The man is wearing glasses and a striped shirt. His hands are clasped together near his waist. When the subtitle 'Never underestimate the combination' appears, what object is present on the man in the striped shirt?", "question_wo_referring_query": "What object is present on the man in the striped shirt?", "candidates": ["badge", "ring", "hat", "watch", "necklace"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2O", "level": "IntraMoment", "id": "igyu1fxfSqQ_1", "video_path": "igyu1fxfSqQ.mp4", "subtitle_path": "igyu1fxfSqQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 564.9, "view_count": 118533}, {"video_id": "yBvyYixwuIk", "question": "A man wearing a dark short-sleeved shirt is sitting on a chair. There is a pocket on the chest of the short-sleeved shirt. The man's hands are raised on both sides of his head. There is an empty chair beside the man, and behind the man are some miscellaneous items. What object is present on this man wearing the short-sleeved shirt?", "question_wo_referring_query": "What object is present on this man wearing the short-sleeved shirt?", "candidates": ["hat", "glasses", "necklace", "watch", "bracelet"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "yBvyYixwuIk_0", "video_path": "yBvyYixwuIk.mp4", "subtitle_path": "yBvyYixwuIk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 546.38, "view_count": 210339}, {"video_id": "yBvyYixwuIk", "question": "Two men appear on a field. The man on the left is wearing blue jeans and a dark-colored shirt, while the man on the right is wearing gray trousers and a black shirt. Both men are holding a sheep together. There is a pillar next to the man on the left. What object exists in this scene?", "question_wo_referring_query": "What object exists in this scene?", "candidates": ["wristwatch", "shoes", "hat", "bird", "shepherd dog"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "yBvyYixwuIk_1", "video_path": "yBvyYixwuIk.mp4", "subtitle_path": "yBvyYixwuIk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 546.38, "view_count": 210339}, {"video_id": "DwtmeNDWtHg", "question": "A man appears in a broadcast studio. He is wearing a white shirt and a dark suit. The man's hair is silver, and behind him is a blurry electronic screen. There is a red strip below the man's chest. What is this silver-haired man doing?", "question_wo_referring_query": ", what is this silver-haired man doing?", "candidates": ["Holding glasses with one hand", "Holding a ring with one hand", "Holding a ring with both hands", "Holding a cup with one hand", "Holding glasses with both hands"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "DwtmeNDWtHg_0", "video_path": "DwtmeNDWtHg.mp4", "subtitle_path": "DwtmeNDWtHg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 341.88, "view_count": 351665}, {"video_id": "DwtmeNDWtHg", "question": "A man appears in a broadcast room, wearing a white shirt and a dark-colored suit. The man's hair is silver, and there is a blurred electronic screen behind him. His eyes are tightly closed. What is this silver-haired man doing?", "question_wo_referring_query": "What is this silver-haired man doing?", "candidates": ["Waving hands", "Hands touching", "Holding a ring", "Holding a glass of water", "Crossing arms"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "DwtmeNDWtHg_1", "video_path": "DwtmeNDWtHg.mp4", "subtitle_path": "DwtmeNDWtHg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 341.88, "view_count": 351665}, {"video_id": "yf93BOmNNfQ", "question": "A man and a woman are communicating online. The man on the left is wearing a white shirt and a blue suit, with a red tie. The woman on the right has golden hair and is wearing a black top. In what caption did the man on the left, who is wearing a blue suit, appear?", "question_wo_referring_query": "In what caption did the man on the left, who is wearing a blue suit, appear?", "candidates": ["Here I see the danger of being outlined", "way? If this is somehow passed and TikTok is", "policies espoused by the Chinese government", "Regulations on how the company manages the content", "Well. And of course, this is not the only"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "yf93BOmNNfQ_0", "video_path": "yf93BOmNNfQ.mp4", "subtitle_path": "yf93BOmNNfQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 439.57, "view_count": 2462}, {"video_id": "yf93BOmNNfQ", "question": "Two women are having an online conversation. The woman on the left is wearing earrings and a dark top, with one hand positioned on her chest. The woman on the right has blonde hair, closed eyes, and is wearing a black top. In what subtitle does the woman on the left wearing earrings appear?", "question_wo_referring_query": "In what subtitle does the woman on the left wearing earrings appear?", "candidates": ["legislation like this? No, and definitely not before the", "Here I see the danger of being outlined", "Well. And of course, this is not the only", "Regulations on how the company manages the content", "way? If this is somehow passed and TikTok is"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "yf93BOmNNfQ_1", "video_path": "yf93BOmNNfQ.mp4", "subtitle_path": "yf93BOmNNfQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 439.57, "view_count": 2462}, {"video_id": "0DIHeTrm4WI", "question": "A man in camo pants is walking on the street. The man is wearing camo pants, a hat, his pants are dark in color, he has a stubble beard, there are buildings beside the street, and there is clutter on the street. Where else has this man in camo pants appeared?", "question_wo_referring_query": "Where else has this man in camo pants appeared?", "candidates": ["Restaurant sofa", "House with a liquor cabinet", "Park bench", "Park lawn", "Restaurant table"], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "0DIHeTrm4WI_0", "video_path": "0DIHeTrm4WI.mp4", "subtitle_path": "0DIHeTrm4WI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 572.88, "view_count": 410381}, {"video_id": "0DIHeTrm4WI", "question": "In a forest clearing, there are three characters from cartoons on the left side, dressed in uniforms and wearing hats, while on the right side there is a character kneeling on one knee shooting, also wearing a helmet and a uniform. Where else has the gun in the right-side helmeted character's hand appeared before?", "question_wo_referring_query": "Where else has the gun in the right-side helmeted character's hand appeared before?", "candidates": ["On the ground near a rock", "On the branch of a big tree", "On a stone by the lake", "On a bench on the grass", "In the grass by the lake"], "topic_category": "KH-Knowledge-History", "question_category": "SOS", "level": "L2-Relation", "id": "0DIHeTrm4WI_1", "video_path": "0DIHeTrm4WI.mp4", "subtitle_path": "0DIHeTrm4WI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 572.88, "view_count": 410381}, {"video_id": "zurEDTHjShY", "question": "What is the correct sequence of the following scenes?", "question_wo_referring_query": "What is the correct sequence of the following scenes?", "candidates": ["First, the king enters the venue in a carriage and gives a speech, then a group of people discuss the king's speech, followed by a man introducing while circling around the king's speech.", "First, a man introduces while circling around the king's speech, then the king enters the venue in a carriage and gives a speech, followed by a group of people discussing the king's speech.", "First, the king enters the venue in a carriage and gives a speech, then a man introduces while circling around the king's speech, followed by a group of people discussing the king's speech.", "First, a group of people discuss the king's speech, then the king enters the venue in a carriage and gives a speech, followed by a man introducing while circling around the king's speech.", "First, a group of people discuss the king's speech, then a man introduces while circling around the king's speech, followed by the king entering the venue in a carriage and giving a speech."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "zurEDTHjShY_0", "video_path": "zurEDTHjShY.mp4", "subtitle_path": "zurEDTHjShY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 259.0, "view_count": 96285}, {"video_id": "zurEDTHjShY", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, the scene of other attendees entering, then the scene of the king delivering a speech, followed by the scene of the king's entry ceremony.", "First, the scene of the king's entry ceremony, then the scene of other attendees entering, followed by the scene of the king delivering a speech.", "First, the scene of the king delivering a speech, then the scene of the king's entry ceremony, followed by the scene of other attendees entering.", "First, the scene of the king's entry ceremony, then the scene of the king delivering a speech, followed by the scene of other attendees entering.", "First, the scene of other attendees entering, then the scene of the king's entry ceremony, followed by the scene of the king delivering a speech."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "zurEDTHjShY_1", "video_path": "zurEDTHjShY.mp4", "subtitle_path": "zurEDTHjShY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 259.0, "view_count": 96285}, {"video_id": "7FdGA-NefJo", "question": "A man appears in a room, wearing a black hooded robe with his eyes tightly closed. He has a beard on his chin, and behind him are a chair, green plants, and a cabinet. After the subtitle 'is forest fence treasure' appears, what action does the man in the black robe perform?", "question_wo_referring_query": "What action does the man in the black robe perform?", "candidates": ["The man cups his face with both hands", "The man supports his forehead with one hand", "The man supports his chin with one hand", "The man crosses his hands at the waist", "The man supports his chin with both hands"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "7FdGA-NefJo_0", "video_path": "7FdGA-NefJo.mp4", "subtitle_path": "7FdGA-NefJo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 439.73, "view_count": 2301658}, {"video_id": "7FdGA-NefJo", "question": "Two gentlemen appear in the kitchen. The gentleman on the left is wearing a white shirt and red pants, along with a tie. The gentleman on the right is wearing a black shirt. On the table in front of them, there are kitchen utensils and ingredients. After the subtitle 'tommy and cam have started their own,' what did the man in the black shirt do?", "question_wo_referring_query": "What did the man in the black shirt do?", "candidates": ["Picked up a plate", "Picked up a knife", "Opened the cupboard", "Picked up a fork", "Picked up an apple"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "7FdGA-NefJo_1", "video_path": "7FdGA-NefJo.mp4", "subtitle_path": "7FdGA-NefJo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 439.73, "view_count": 2301658}, {"video_id": "Cq7DpoLB7dc", "question": "Who is the first person to appear in the video?", "question_wo_referring_query": "Who is the first person to appear in the video?", "candidates": ["A man wearing a black short-sleeved shirt", "A man wearing a yellow shirt", "A woman wearing a yellow shirt", "A man wearing a white shirt", "A soldier in a green uniform"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "Cq7DpoLB7dc_0", "video_path": "Cq7DpoLB7dc.mp4", "subtitle_path": "Cq7DpoLB7dc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 450.99, "view_count": 5654595}, {"video_id": "Cq7DpoLB7dc", "question": "What is the first mode of transportation that appears in the video?", "question_wo_referring_query": "What is the first mode of transportation that appears in the video?", "candidates": ["Rickshaw", "Tricycle", "Helicopter", "Bus", "Bicycle"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "Cq7DpoLB7dc_1", "video_path": "Cq7DpoLB7dc.mp4", "subtitle_path": "Cq7DpoLB7dc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 450.99, "view_count": 5654595}, {"video_id": "NuCwkQKfbYk", "question": "There are two lines of pink English text at the bottom of a white background. The words in the first line are bolded. The subtitles introduce the main character's name and profession. What is mentioned after the name and profession introduction ends?", "question_wo_referring_query": "What is mentioned after the name and profession introduction in the subtitles?", "candidates": ["Modernist philosophy", "The main character is a fan of an artist", "Portrait", "Decorative art", "Concept art"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "NuCwkQKfbYk_0", "video_path": "NuCwkQKfbYk.mp4", "subtitle_path": "NuCwkQKfbYk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 201.37, "view_count": 6092}, {"video_id": "NuCwkQKfbYk", "question": "A naked woman appears in the frame, wearing a headscarf, with her back to the camera. There is a screen near her feet, and the frame is brown with elaborate floral patterns. The subtitle mentions conceptual art. What else does the subtitle mention after conceptual art?", "question_wo_referring_query": "The subtitle mentions conceptual art. What else does the subtitle mention after conceptual art?", "candidates": ["Image creation", "Decorative art", "Do not view the world through colored lenses", "Portrait painting", "Modern philosophy"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "NuCwkQKfbYk_1", "video_path": "NuCwkQKfbYk.mp4", "subtitle_path": "NuCwkQKfbYk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 201.37, "view_count": 6092}, {"video_id": "QnLqohT0NDs", "question": "A lady is leaning against a window. She is wearing a dark-colored top and blue denim shorts. The lady has a black bag on her back and is wearing glasses. The window is red-brown with a white door beside it. Inside the window, there is a white curtain. When the word 'Music' appears in the subtitles, what is the lady with glasses doing?", "question_wo_referring_query": "What is the lady with glasses doing?", "candidates": ["Biting a hot dog with her mouth", "Biting a fork with her mouth", "Adjusting her top", "Adjusting her glasses", "Biting an apple with her mouth"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "QnLqohT0NDs_0", "video_path": "QnLqohT0NDs.mp4", "subtitle_path": "QnLqohT0NDs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 253.42, "view_count": 75355}, {"video_id": "QnLqohT0NDs", "question": "A man appears on the screen. He is wearing a black top, carrying a backpack, and standing in front of a building. The building has a yellow pipe on its exterior wall, which is emitting white gas. What is the man carrying the backpack doing when the subtitle 'people from all over the world come to' appears?", "question_wo_referring_query": "What is the man carrying the backpack doing?", "candidates": ["taking a photo", "drinking water", "clapping", "bending", "stretching"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "QnLqohT0NDs_1", "video_path": "QnLqohT0NDs.mp4", "subtitle_path": "QnLqohT0NDs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 253.42, "view_count": 75355}, {"video_id": "I8EQAnKntLs", "question": "Four cartoon characters appear on the screen. The cartoon character on the left is wearing a black top and a tie, while the character on the right is wearing a white top and glasses, sitting on a table. There is a clock behind the four cartoon characters. When the subtitle 'Many low level bureaucratic positions are filled through competitive exam-based civil' appears, what style of glasses is the second man from the left wearing?", "question_wo_referring_query": "What style of glasses is the second man from the left wearing?", "candidates": ["Large round frames", "Fan-shaped frames", "Rectangular frames", "Oval frames", "Small round frames"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2A", "level": "IntraMoment", "id": "I8EQAnKntLs_0", "video_path": "I8EQAnKntLs.mp4", "subtitle_path": "I8EQAnKntLs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 418.0, "view_count": 1437156}, {"video_id": "I8EQAnKntLs", "question": "Three cartoon characters appear on a yellow screen, standing next to a machine. The woman on the left is wearing a blue skirt and a white hat, the woman on the right is wearing an olive shirt and a white hat, and the man in the middle is wearing a white kilt. When the subtitle says 'ones called regulations. In doing this, they're acting like a legislature, especially since,' what shape is the red pattern on the man's white kilt in the middle?", "question_wo_referring_query": "What shape is the red pattern on the front of the man's white kilt?", "candidates": ["star", "square", "rectangle", "circle", "triangle"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2A", "level": "IntraMoment", "id": "I8EQAnKntLs_1", "video_path": "I8EQAnKntLs.mp4", "subtitle_path": "I8EQAnKntLs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 418.0, "view_count": 1437156}, {"video_id": "cN69OEJdcrw", "question": "A group of people is on a sandy terrain, with greenery on the left side. The sandy terrain is sloped, and the group is moving upwards. The person on the far left is wearing a white short-sleeve shirt and black pants. The man at the back is wearing a yellow top and shorts, and he is carrying a black backpack. What type of top is the man carrying the black backpack wearing?", "question_wo_referring_query": "What type of top is the man carrying the black backpack wearing?", "candidates": ["Sleeveless tank top", "Jacket", "Hooded hazmat suit", "Hazmat suit", "Sweater"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "cN69OEJdcrw_0", "video_path": "cN69OEJdcrw.mp4", "subtitle_path": "cN69OEJdcrw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 544.45, "view_count": 46728}, {"video_id": "cN69OEJdcrw", "question": "In the blue sky, white clouds are floating. A temple appears on the screen. There is a signboard on the side of the road leading to the temple. In front of the temple, there is a golden pillar and a small pavilion. The lower part of the temple structure is square, while the upper part gets narrower. What is the shape of the signboard on the side of the road?", "question_wo_referring_query": "What is the shape of the signboard on the side of the road?", "candidates": ["Circular", "Hexagonal", "Triangular", "Heart-shaped", "Square"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "cN69OEJdcrw_1", "video_path": "cN69OEJdcrw.mp4", "subtitle_path": "cN69OEJdcrw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 544.45, "view_count": 46728}, {"video_id": "kb0GwABVwGQ", "question": "A man stands next to a ceramic sculpture, wearing a gray top, a mask, and glasses. Behind the man are transparent glass and glaring lights. Square prayer cards are hanging around the ceramic sculpture. What objects are present at this scene?", "question_wo_referring_query": ", what objects are present at this scene?", "candidates": ["hat", "watch", "earrings", "umbrella", "bracelet"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "kb0GwABVwGQ_0", "video_path": "kb0GwABVwGQ.mp4", "subtitle_path": "kb0GwABVwGQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 355.99, "view_count": 9191}, {"video_id": "kb0GwABVwGQ", "question": "A woman is pushing a ceramic sculpture forward. She is wearing a white mask and a black top. The ceramic sculpture is placed on a red cloth-covered object. There are pedestrians and cars on the street next to the woman. To the left of the ceramic sculpture is a shop window display. What items are present in this scene?", "question_wo_referring_query": "What items are present in this scene?", "candidates": ["Backpack", "Cart", "Yellow dog", "White dog", "Motorcycle"], "topic_category": "KA-Knowledge-Art", "question_category": "S2O", "level": "IntraMoment", "id": "kb0GwABVwGQ_1", "video_path": "kb0GwABVwGQ.mp4", "subtitle_path": "kb0GwABVwGQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 355.99, "view_count": 9191}, {"video_id": "ML6vsuE38HI", "question": "In the scene, there is a man and a woman standing in a place where many paintings are hanging on the wall. The man is wearing a suit and a red tie, and the woman is wearing a pink dress, a headband, and holding a bouquet of flowers. They are having a wedding ceremony and making vows. Surrounding them are several men and women in formal attire. What happened after they completed their vows?", "question_wo_referring_query": "What happened after they completed their vows?", "candidates": ["They were chatting with the guests", "They exchanged gifts", "They exchanged rings", "They ran on a bridge", "They thanked the guests"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "ML6vsuE38HI_0", "video_path": "ML6vsuE38HI.mp4", "subtitle_path": "ML6vsuE38HI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 546.88, "view_count": 3989}, {"video_id": "ML6vsuE38HI", "question": "The scene shows a man wearing a black suit jacket and a woman wearing black and white clothing sitting on a bench by the roadside. There are green plants and trees around them, as well as small paths and pedestrians. Both the man and the woman have light brown hair. The man is looking at the woman smilingly, and the woman has her back facing a mirror. What happens next?", "question_wo_referring_query": "The woman has her back facing a mirror. What happens next?", "candidates": ["The man hands the woman a phone", "The man leaves", "The woman and the man shake hands", "The woman hugs the man", "The man hands the woman a parcel"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "ML6vsuE38HI_1", "video_path": "ML6vsuE38HI.mp4", "subtitle_path": "ML6vsuE38HI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 546.88, "view_count": 3989}, {"video_id": "YV5XfB7VvFM", "question": "On a bright sunny day, there is a green meadow. In the middle of the green meadow, there's a small wooden hut. Beside the green meadow, there's a woman wearing a white windbreaker with her hair tied up. When the subtitle \"and they return to their lodge. She contemplates the situation, and she's full of denial,\" appears, what is the woman in the white windbreaker doing?", "question_wo_referring_query": "On a bright sunny day, there is a green meadow. In the middle of the green meadow, there's a small wooden hut. Beside the green meadow, there's a woman wearing a white windbreaker with her hair tied up. When the subtitle \"and they return to their lodge. She contemplates the situation, and she's full of denial,\" appears, what is the woman in the white windbreaker doing?", "candidates": ["Dancing beside the green meadow", "Picking white daisies while kneeling beside the green meadow", "Walking slowly forward on a dirt road", "Kneeling and petting a puppy's head", "Singing beside the green meadow"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "YV5XfB7VvFM_0", "video_path": "YV5XfB7VvFM.mp4", "subtitle_path": "YV5XfB7VvFM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 594.72, "view_count": 39177}, {"video_id": "YV5XfB7VvFM", "question": "In a dense forest, sunlight filters through the leaves, casting light onto the green grass. On the grass, there is a woman with long hair, wearing a blue-green shirt. When the subtitle \"not even knowing the difference between her as a human, and Lynx, as a dog\" appears, what is the long-haired woman doing?", "question_wo_referring_query": "In a dense forest, sunlight filters through the leaves, casting light onto the green grass. On the grass, there is a woman with long hair, wearing a blue-green shirt. When the subtitle \"not even knowing the difference between her as a human, and Lynx, as a dog\" appears, what is the long-haired woman doing?", "candidates": ["Squatting by a big tree looking for mushrooms", "Chopping a tree with an axe", "Kneeling on the grass, holding her head and crying", "Lying motionless on the green grass", "Leaning against a big tree, looking upwards"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2E", "level": "IntraMoment", "id": "YV5XfB7VvFM_1", "video_path": "YV5XfB7VvFM.mp4", "subtitle_path": "YV5XfB7VvFM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 594.72, "view_count": 39177}, {"video_id": "tdTvMwbeCRg", "question": "A large picture frame is hanging on the wall inside the room, containing a group photo of many people. The lamp in front of the picture frame emits a halo of light, and a person with short hair wearing a black suit is standing under the light. When he appeared for the first time, what did he do?", "question_wo_referring_query": "When he appeared for the first time, what did he do?", "candidates": ["He raised the wine glass in his hand", "He poured himself a glass of wine", "He took a sip of wine", "He turned to face the group photo", "He smashed the wine glass"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "tdTvMwbeCRg_0", "video_path": "tdTvMwbeCRg.mp4", "subtitle_path": "tdTvMwbeCRg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 219.6, "view_count": 114257}, {"video_id": "tdTvMwbeCRg", "question": "In the pitch-black night, there is a pole with a lamp and a road sign on it. The road sign reads 'MAPLE STREET 300.' What happened when the lamp first appeared?", "question_wo_referring_query": "What happened when it first appeared?", "candidates": ["The lamp rolled onto the ground.", "The lamp that was on went out and didn't light up again.", "The lamp was taken down from the pole.", "The lamp kept turning on and off repeatedly.", "The lamp exploded."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "tdTvMwbeCRg_1", "video_path": "tdTvMwbeCRg.mp4", "subtitle_path": "tdTvMwbeCRg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 219.6, "view_count": 114257}, {"video_id": "56zAHSpSM7k", "question": "On the left side of the screen, there is a woman with black hair wearing a white shirt and carrying a backpack, holding a cane in her hand. On the right side, there is another black-haired woman looking at her. Who is holding the glasses in the middle of the screen?", "question_wo_referring_query": "Who is holding the glasses in the middle of the screen?", "candidates": ["A red-haired woman with a white bag on her shoulder", "A black-haired woman with a black bag on her shoulder", "A black-haired woman with a brown bag on her shoulder", "A black-haired woman holding a cane", "A black-haired woman with a purple bag on her shoulder"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "56zAHSpSM7k_0", "video_path": "56zAHSpSM7k.mp4", "subtitle_path": "56zAHSpSM7k_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 579.28, "view_count": 2190}, {"video_id": "56zAHSpSM7k", "question": "A car is driving on the road with three people inside. One person, a black-haired woman wearing glasses, is seated in the front passenger seat. A woman wearing a black top is seated in the back seat. Who is shown in the driver's seat holding the steering wheel and driving?", "question_wo_referring_query": "Who is shown in the driver's seat holding the steering wheel and driving?", "candidates": ["A man wearing a white shirt and a yellow tie", "A man wearing a blue denim jacket", "A man wearing a white shirt and a black tie", "A man wearing a gray suit and a black tie", "A man wearing a black suit and a black tie"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "56zAHSpSM7k_1", "video_path": "56zAHSpSM7k.mp4", "subtitle_path": "56zAHSpSM7k_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 579.28, "view_count": 2190}, {"video_id": "@healthfood-7066170430322724142", "question": "On the screen, there is a stainless steel bowl on the table, filled with green onions and other foods. Above the stainless steel bowl, there is a white bowl. What happened?", "question_wo_referring_query": "What happened?", "candidates": ["The small tomatoes were poured into the stainless steel bowl", "The small tomatoes were cut in half", "The small tomatoes were poured into the sink", "The pumpkin was put into the stainless steel bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "@healthfood-7066170430322724142_0", "video_path": "7066170430322724142.mp4", "subtitle_path": "7066170430322724142_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 33.07, "view_count": 162700}, {"video_id": "@tiffycooks-7053166639109475589", "question": "In the video, in a room, there is a woman with black hair wearing a white short-sleeved shirt. The woman is wearing a gold bracelet on her right hand, and there is a box of fruit in front of her. What objects appeared in the video?", "question_wo_referring_query": "What objects appeared in the video?", "candidates": ["Mangoes", "Strawberries", "Blueberries", "Carrots"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "@tiffycooks-7053166639109475589_0", "video_path": "7053166639109475589.mp4", "subtitle_path": "7053166639109475589_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 20.27, "view_count": 434510}, {"video_id": "@tiffycooks-6929587730867440902", "question": "In the kitchen, there is a woman with black hair wearing a black top. In front of her, there is a black pot. She is holding a spatula in her left hand. When she mentions 'the onion for two to three minutes and garlic add in sliced beef saute together for two to three,' what objects can be seen on screen?", "question_wo_referring_query": "What objects can be seen on the screen?", "candidates": ["Crab", "Shrimp", "Beef", "Lamb chops"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "@tiffycooks-6929587730867440902_0", "video_path": "6929587730867440902.mp4", "subtitle_path": "6929587730867440902_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 28.8, "view_count": 469972}, {"video_id": "@thatrecipe.us-7348615144991165738", "question": "In the scene, on the wooden table, there is a cutting board with a bread roll on it. There is also a glass bowl filled with yellow food on the table. What is the shape of this bread roll?", "question_wo_referring_query": "What is the shape of this bread roll?", "candidates": ["Square", "Cylindrical", "Spherical", "Rectangular"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "@thatrecipe.us-7348615144991165738_0", "video_path": "7348615144991165738.mp4", "subtitle_path": "7348615144991165738_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.03, "view_count": 57688}, {"video_id": "@healthfood-6857265309942811910", "question": "On the screen, a person is wearing a ring on their right hand and holding a glass jar. There is a silver lid on top of the jar. When mentioning 'You have to start thinking of yourself as the main character,' what color is the substance inside the jar?", "question_wo_referring_query": "What color is the substance inside the jar?", "candidates": ["Dark Brown", "Green", "Orange", "Red"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "@healthfood-6857265309942811910_0", "video_path": "6857265309942811910.mp4", "subtitle_path": "6857265309942811910_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.7, "view_count": 887}, {"video_id": "@tiffycooks-6993376820880018694", "question": "In the video, there is a steamer basket on the table. A woman is holding a bowl with her right hand and chopsticks with her left hand. Who is the woman performing this action?", "question_wo_referring_query": "Who is the woman performing this action?", "candidates": ["A woman wearing a blue short-sleeve shirt", "A woman wearing a purple short-sleeve shirt", "A woman with black hair wearing a black short-sleeve shirt", "A woman wearing an orange top"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "@tiffycooks-6993376820880018694_0", "video_path": "6993376820880018694.mp4", "subtitle_path": "6993376820880018694_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 37.03, "view_count": 600251}, {"video_id": "@healthfood-7042012177213050117", "question": "On a yellow table, there is a cup, and a person places an egg on top of the cup. The screen shows the text '6 eggs'. What happened the first time the egg appeared?", "question_wo_referring_query": "What happened the first time the egg appeared?", "candidates": ["The egg was placed into the refrigerator", "The egg was placed into the pot", "The egg was thrown", "The egg was cracked into the cup"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "@healthfood-7042012177213050117_0", "video_path": "7042012177213050117.mp4", "subtitle_path": "7042012177213050117_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.17, "view_count": 92024}, {"video_id": "@tiffycooks-7101713245085388037", "question": "In the video, there is a black-haired woman wearing a dark blue top, with the text 'KALORIK MAXX DIGITAL AIR FRYER OVEN' above her. When she mentions 'Of course there are other factors such as price, size, and other functions', what action is she performing?", "question_wo_referring_query": "What action is this woman performing?", "candidates": ["Her left hand is on her forehead", "She is holding a chicken leg in her left hand, and her right hand is spread open with the palm facing the camera", "She is making a 'Yay' gesture with both hands", "She is clenching her fists"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "@tiffycooks-7101713245085388037_0", "video_path": "7101713245085388037.mp4", "subtitle_path": "7101713245085388037_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 35.7, "view_count": 871143}, {"video_id": "@recipesbyanne-7265365892819717409", "question": "On a white table, someone is holding a glass bowl with pieces of chicken in it. After placing the glass bowl on the table, what did she do?", "question_wo_referring_query": "What did she do?", "candidates": ["She added some fruit sauce to the chicken pieces.", "She added some sour cream to the chicken pieces.", "She added some water to the chicken pieces.", "She added some seasonings to the chicken pieces."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "@recipesbyanne-7265365892819717409_0", "video_path": "7265365892819717409.mp4", "subtitle_path": "7265365892819717409_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 19.0, "view_count": 73483}, {"video_id": "@healthfood-6948505079125118213", "question": "In the video, one scene shows a machine with yellow cornflakes, and another scene shows the words 'blend then mix in 1/2 cup blueberries' with blueberries underneath. Which scene appears first?", "question_wo_referring_query": "Which scene appears first?", "candidates": ["Honey appears first", "Yogurt appears first", "Cornflakes appear first", "Blueberries appear first"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "@healthfood-6948505079125118213_0", "video_path": "6948505079125118213.mp4", "subtitle_path": "6948505079125118213_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 28.1, "view_count": 25800}, {"video_id": "@thatrecipe.us-7185275988463471915", "question": "On the table, there is a black pot. A person wearing gloves is holding a yellow object and applying it inside the pot. After mentioning 'Hit the like button already and leave your comment on this recipe, Thank you', which object appears?", "question_wo_referring_query": "Which object appears?", "candidates": ["Milk", "Greenish powdery substance", "White sugar", "Dough and eggs"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "@thatrecipe.us-7185275988463471915_0", "video_path": "7185275988463471915.mp4", "subtitle_path": "7185275988463471915_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.06, "view_count": 38323}, {"video_id": "@tiffycooks-7351068654408174854", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First appears a woman with black hair wearing a white top in a white-themed kitchen, holding a pot filled with water and placing it on the gas stove. Then, the woman is holding a bowl of noodles in both hands. Finally, the woman appears holding a red bottle, a yellow bottle, and a green bottle of seasoning.", "First appears the woman holding a red bottle, a yellow bottle, and a green bottle of seasoning. Then, in a white-themed kitchen, the woman wearing a white top with black hair holds a pot filled with water and places it on the gas stove. Finally, the woman is holding a bowl of noodles in both hands.", "First appears the woman holding a red bottle, a yellow bottle, and a green bottle of seasoning. Then, the woman is holding a bowl of noodles in both hands. Finally, in a white-themed kitchen, the woman wearing a white top with black hair holds a pot filled with water and places it on the gas stove.", "First appears a woman with black hair wearing a white top in a white-themed kitchen, holding a pot filled with water and placing it on the gas stove. Then, the woman appears holding a red bottle, a yellow bottle, and a green bottle of seasoning. Finally, the woman is holding a bowl of noodles in both hands."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "@tiffycooks-7351068654408174854_0", "video_path": "7351068654408174854.mp4", "subtitle_path": "7351068654408174854_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 46.37, "view_count": 226050}, {"video_id": "@tiffycooks-7067984305435282694", "question": "On a brownish desktop, there is a cup of beverage. The screen shows the text 'Have you ever tried McDonald's Creamy corn soup?'. Does this beverage appear along with that text on the screen?", "question_wo_referring_query": "Does this beverage appear along with that text on the screen?", "candidates": ["Saute onion with butter until translucent and add in a can of cream-style corn.", "Have you ever tried McDonald's Creamy corn soup and Here's my copycat version that is vegetarian friendly, easy and affordable", "Sim", "pour it back in the corn mixture and then add in more corn"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "@tiffycooks-7067984305435282694_0", "video_path": "7067984305435282694.mp4", "subtitle_path": "7067984305435282694_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 27.63, "view_count": 315285}, {"video_id": "@tiffycooks-6939966685000633606", "question": "In the video, a woman with black hair wearing a gray short-sleeved shirt is seen in front of a table with a bowl on it. Inside the bowl, there is flour, accompanied by text indicating 'Flour, pint of salt, drizzle of oil.' What change occurred to the flour once it was put into the bowl?", "question_wo_referring_query": "What change occurred to the flour once it was put into the bowl?", "candidates": ["It turned into a bun", "It turned into a long strip", "It turned into a dough ball", "No change"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "@tiffycooks-6939966685000633606_0", "video_path": "6939966685000633606.mp4", "subtitle_path": "6939966685000633606_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 30.23, "view_count": 427093}, {"video_id": "@tiffycooks-6883908854552349953", "question": "In the kitchen, there is a woman with black hair wearing a gray short-sleeved shirt. In front of her is a pot of oil and she is holding a large piece of chicken. What is she going to do?", "question_wo_referring_query": "What is she going to do?", "candidates": ["She is putting the chicken in a bowl", "She is soaking the chicken in water", "She is putting the chicken into the oil pot", "She is putting the chicken into the oven"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "@tiffycooks-6883908854552349953_0", "video_path": "6883908854552349953.mp4", "subtitle_path": "6883908854552349953_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 44.8, "view_count": 1124332}, {"video_id": "@thatrecipe.us-7333331356707048746", "question": "On the wooden table, there are three square glass bowls, each containing different items: flour, beaten eggs, and breadcrumbs. Which of these items appears in the video?", "question_wo_referring_query": "Which of these items appears in the video?", "candidates": ["black disposable gloves", "white linen gloves", "red rubber gloves", "white gloves"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "@thatrecipe.us-7333331356707048746_0", "video_path": "7333331356707048746.mp4", "subtitle_path": "7333331356707048746_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.03, "view_count": 154133}, {"video_id": "@tiffycooks-7042378599319489797", "question": "In the screen, there is a woman with long hair wearing a grey short-sleeved shirt. She is holding a knife with her left hand and pressing on the knife's back with her right hand. When she mentioned, 'I remember being so nervous, I almost didn't post. When I woke up, the video had 40k views and I was so surprised.' which object appeared on the screen?", "question_wo_referring_query": "Which object appeared on the screen?", "candidates": ["Dark brown cutting board", "Orange carrot", "White-handled knife", "Fried food"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "@tiffycooks-7042378599319489797_0", "video_path": "7042378599319489797.mp4", "subtitle_path": "7042378599319489797_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 50.97, "view_count": 10042215}, {"video_id": "@thatrecipe.us-7318874169280679211", "question": "On a wooden-colored tabletop, there is a white plate with three pieces of food. A hand is holding the plate, and on the left side of the screen, a fork is pressing down on the food. What material is this fork made of?", "question_wo_referring_query": "What material is this fork made of?", "candidates": ["Gold metal fork", "Silver stainless steel fork", "White plastic fork", "Red rubber fork"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "@thatrecipe.us-7318874169280679211_0", "video_path": "7318874169280679211.mp4", "subtitle_path": "7318874169280679211_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.3, "view_count": 78408}, {"video_id": "@thatrecipe.us-7311830493836659998", "question": "On a white table, there is a metal plate. A person is holding a yellow brush and brushing something on the plate. Who is this person brushing the plate?", "question_wo_referring_query": "Who is this person brushing the plate?", "candidates": ["A person wearing black gloves", "A person wearing oven mitts", "A person wearing transparent gloves", "A person wearing red rubber gloves"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "@thatrecipe.us-7311830493836659998_0", "video_path": "7311830493836659998.mp4", "subtitle_path": "7311830493836659998_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.06, "view_count": 25063}, {"video_id": "@thatrecipe.us-7288500091076349227", "question": "On a wooden-colored desktop, there is a bowl with a stainless steel exterior and a white interior. The bowl contains a mushy food with green leaves. In the upper right corner of the screen, a hand is holding a lemon. What is he doing at this moment?", "question_wo_referring_query": "What is he doing at this moment?", "candidates": ["Breaking the lemon into pieces and putting them in the bowl", "Squeezing lemon juice into the bowl", "Placing lemon slices into the bowl", "Taking the lemon out of the bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "@thatrecipe.us-7288500091076349227_0", "video_path": "7288500091076349227.mp4", "subtitle_path": "7288500091076349227_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.04, "view_count": 902991}, {"video_id": "@thatrecipe.us-7281820611297447198", "question": "On a wooden-colored desktop, there is a black plate with four kidney beans that have slits in the middle. Also on the right side of the screen, there is a hand. When 'one pinch of garlic flakes, and one pinch of chimichurri' is mentioned, what action does the hand perform?", "question_wo_referring_query": "What action does the hand perform?", "candidates": ["Mashes the kidney beans into a paste", "Flips the kidney beans over", "Spreads wax paper onto the kidney beans", "Sprinkles spices onto the kidney beans"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "@thatrecipe.us-7281820611297447198_0", "video_path": "7281820611297447198.mp4", "subtitle_path": "7281820611297447198_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.03, "view_count": 620763}, {"video_id": "@tiffycooks-7193375566681099526", "question": "In a black pot, there are wide noodles covered with a lot of sauce. On top, there is also a lot of green parsley as decoration. After mentioning 'Finish off with some fresh parsley. Now look at that!', what happens next?", "question_wo_referring_query": "What happens next?", "candidates": ["Use tongs to lift the noodles", "Take a wooden spoon and stir in the pot", "Add some oil to the pot", "Sprinkle the pot with parsley bits"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "@tiffycooks-7193375566681099526_0", "video_path": "7193375566681099526.mp4", "subtitle_path": "7193375566681099526_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 23.6, "view_count": 2214552}, {"video_id": "@thatrecipe.us-7305520564511214891", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, someone pours chocolate from a glass bowl into a glass dish. Then, someone is using a spatula to spread chocolate over a cake. Finally, someone is in a silver-colored mold, holding a plastic bag containing a yellow paste.", "First, someone pours chocolate from a glass bowl into a glass dish. Then, someone is in a silver-colored mold, holding a plastic bag containing a yellow paste. Finally, someone is using a spatula to spread chocolate over a cake.", "First, someone is in a silver-colored mold. Someone is holding a plastic bag containing a yellow paste. Then, someone is using a spatula to spread chocolate over a cake. Finally, someone pours chocolate from a glass bowl into a glass dish.", "First, someone is in a silver-colored mold, holding a plastic bag containing a yellow paste. Then, someone pours chocolate from a glass bowl into a glass dish. Finally, someone is using a spatula to spread chocolate over a cake."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "@thatrecipe.us-7305520564511214891_0", "video_path": "7305520564511214891.mp4", "subtitle_path": "7305520564511214891_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.03, "view_count": 31444}, {"video_id": "@tiffycooks-6968516812627660038", "question": "In a scene where green cabinets are hanging above, a woman in a gray outfit is holding a plate full of chicken. Where has this plate of chicken appeared before?", "question_wo_referring_query": "Where has this plate of chicken appeared before?", "candidates": ["In a black tray", "In a red pressure cooker", "In a black bowl", "In a stainless steel bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "@tiffycooks-6968516812627660038_0", "video_path": "6968516812627660038.mp4", "subtitle_path": "6968516812627660038_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 34.3, "view_count": 923840}, {"video_id": "@tiffycooks-7021579879120407814", "question": "In the video, a long-haired woman dressed in a grey top is holding a plate of food and a pair of chopsticks with her left hand. When the subtitle mentions 'if you like it spicy add in spicy douban sauce water let it simmer for 10 to 15 minutes look at that,' what changes occur to the items she is holding?", "question_wo_referring_query": "What changes occur to the items the woman is holding?", "candidates": ["The chopsticks in her left hand are replaced with a blue bowl", "The chopsticks in her left hand are replaced with a transparent container", "The chopsticks in her left hand are replaced with a spoon", "The chopsticks in her left hand are replaced with a fork"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "@tiffycooks-7021579879120407814_0", "video_path": "7021579879120407814.mp4", "subtitle_path": "7021579879120407814_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 25.23, "view_count": 312847}, {"video_id": "@thatrecipe.us-7346768282894191918", "question": "There is a pair of hands on the screen. The right hand is wearing a ring on the ring finger. There is a cutting board with three sausages on it. What is the left hand doing in the scene?", "question_wo_referring_query": "What is the left hand doing in the scene?", "candidates": ["Holding a knife and cutting the sausages", "Holding chopsticks and stirring", "Holding a knife and cutting carrots", "Holding a spoon and stirring"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "@thatrecipe.us-7346768282894191918_0", "video_path": "7346768282894191918.mp4", "subtitle_path": "7346768282894191918_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.03, "view_count": 211286}, {"video_id": "@thatrecipe.us-7295179013629250858", "question": "There is a pot with a black exterior and a white interior in the scene. Inside the pot, there is a brown liquid, and water is being added to the pot with a transparent container. What else is in the scene?", "question_wo_referring_query": "What else is in the scene?", "candidates": ["a spatula", "a tofu block", "a juicer", "a yellow ladle"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "@thatrecipe.us-7295179013629250858_0", "video_path": "7295179013629250858.mp4", "subtitle_path": "7295179013629250858_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.03, "view_count": 45792}, {"video_id": "@thatrecipe.us-7340381821273722154", "question": "There is a pair of hands on the screen, the left hand wearing a ring, pressing an orange, and the right hand cutting the orange with a kitchen knife. Below the orange, there is a cutting board. When the subtitle mentions '1 beat condensed milk with oranges and result was amazing. Thank you.', which object did not appear on the screen?", "question_wo_referring_query": "Which object did not appear on the screen?", "candidates": ["ring", "condensed milk", "cutting board", "orange"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "@thatrecipe.us-7340381821273722154_0", "video_path": "7340381821273722154.mp4", "subtitle_path": "7340381821273722154_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.0, "view_count": 23005}, {"video_id": "@tiffycooks-6883547402523921665", "question": "In the scene, there is a woman with long hair wearing a black short-sleeved shirt in a room. Her left hand is placed on the pot lid. What is the color of the woman's hair in the video?", "question_wo_referring_query": "What is the color of the woman's hair in the video?", "candidates": ["black", "white", "green", "yellow"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "@tiffycooks-6883547402523921665_0", "video_path": "6883547402523921665.mp4", "subtitle_path": "6883547402523921665_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 48.5, "view_count": 1955838}, {"video_id": "@tiffycooks-7040152447154785541", "question": "In the video, a woman wearing a gray long-sleeve shirt is holding a transparent cup in her left hand and pouring egg mixture into a pot. When the subtitle reads 'Pour in one third of the egg mixture.', what is this woman's hairstyle?", "question_wo_referring_query": "What is this woman's hairstyle?", "candidates": ["Bald", "Medium hair", "Short hair", "Long hair"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "@tiffycooks-7040152447154785541_0", "video_path": "7040152447154785541.mp4", "subtitle_path": "7040152447154785541_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 44.23, "view_count": 355738}, {"video_id": "@recipesbyanne-7283888929219022112", "question": "In the video, there is a piece of dough placed in a red bowl. A tool is used to poke holes in the dough. What is the tool used to poke the dough in the video?", "question_wo_referring_query": "What is the tool used to poke the dough in the video?", "candidates": ["needle", "fork", "ladle", "chopstick"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "@recipesbyanne-7283888929219022112_0", "video_path": "7283888929219022112.mp4", "subtitle_path": "7283888929219022112_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.63, "view_count": 65321}, {"video_id": "@thatrecipe.us-7270264394657910058", "question": "There is a transparent bowl on the screen, with some food including a chicken egg and corn inside. Someone is holding a cup with oatmeal. What happens when the oatmeal appears on the screen?", "question_wo_referring_query": "What happens when the oatmeal appears on the screen?", "candidates": ["Eat it", "Throw into the trash can", "Put into the pot", "Pour into the glass bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "@thatrecipe.us-7270264394657910058_0", "video_path": "7270264394657910058.mp4", "subtitle_path": "7270264394657910058_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.0, "view_count": 13013}, {"video_id": "@healthfood-6954029804291132677", "question": "In the video, there is a woman wearing a blue long-sleeved shirt. She is holding a full cup in her left hand, and she is wearing a ring on her left ring finger. What happens when the subtitle says 'It tastes like strawberry milk from Nestle when I was a kid'?", "question_wo_referring_query": "What event happens?", "candidates": ["The woman eats something", "The woman drinks the liquid from the cup", "Nothing happens", "The woman leaves the house"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "@healthfood-6954029804291132677_0", "video_path": "6954029804291132677.mp4", "subtitle_path": "6954029804291132677_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 23.6, "view_count": 19000}, {"video_id": "@thatrecipe.us-7240230589893233966", "question": "In the video, a semi-liquid substance is poured from a bowl into a white plate on the table. The plate contains red and green food items. What action is done after pouring the semi-liquid substance?", "question_wo_referring_query": "What action is done after pouring the semi-liquid substance in the video?", "candidates": ["Stir", "Sprinkle fine yellow strips on top", "Sprinkle powdered substance", "Chop vegetables"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "@thatrecipe.us-7240230589893233966_0", "video_path": "7240230589893233966.mp4", "subtitle_path": "7240230589893233966_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.33, "view_count": 7151}, {"video_id": "@tiffycooks-6946627205128572165", "question": "In the video, the woman is making fish balls at home using various ingredients. She pours the ingredients into a bowl. Which of the following ingredients does she add to the bowl first?", "question_wo_referring_query": "Which of the following ingredients does she add to the bowl first?", "candidates": ["Egg", "Water", "Flour", "Fish"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "@tiffycooks-6946627205128572165_0", "video_path": "6946627205128572165.mp4", "subtitle_path": "6946627205128572165_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 29.7, "view_count": 659842}, {"video_id": "@tiffycooks-7042006134680554757", "question": "In the video, after putting the four marinated chicken legs into the oven, when the subtitle mentions 'at least one hour or overnight bake at 400 for 30 minutes simmer the sauce until bubbly brush a', what is the object that appears afterwards?", "question_wo_referring_query": "What is the object that appears afterwards?", "candidates": ["Light soy sauce", "Old soy sauce", "Brush", "Honey"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "@tiffycooks-7042006134680554757_0", "video_path": "7042006134680554757.mp4", "subtitle_path": "7042006134680554757_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 27.87, "view_count": 554239}, {"video_id": "@recipesbyanne-7283149977201757472", "question": "In the video, a burger with purple vegetables and meat was taken out from a white plate. In what other scene does the burger appear?", "question_wo_referring_query": "In what other scene does the burger appear in the video?", "candidates": ["In the pot", "In the hands of a woman wearing gray long sleeves", "In the hands of a woman wearing white long sleeves", "In the hands of a woman wearing black long sleeves"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "@recipesbyanne-7283149977201757472_0", "video_path": "7283149977201757472.mp4", "subtitle_path": "7283149977201757472_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 18.45, "view_count": 44405}, {"video_id": "@healthfood-6868287221548485893", "question": "In the video, a hand with blue nail polish is placed on the pie crust, and there is a piece of text in the upper left corner of the screen. Which subtitle in the video appears together with the pie crust?", "question_wo_referring_query": ", which subtitle in the video appears together with the pie crust?", "candidates": ["then flipped it over and cooked it for another 3 minutes.", "It was so good, and I highly recommend trying it.", "Then I took the rest of the butter and cinnamon sugar and put it over top of the apples.", "And here's the finished product. Look at that cross section."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "@healthfood-6868287221548485893_0", "video_path": "6868287221548485893.mp4", "subtitle_path": "6868287221548485893_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 41.14, "view_count": 15400}, {"video_id": "@healthfood-6871310712245980422", "question": "The left hand wearing a ring is placing some strawberries into a white mug on the screen. What changes occurred to the strawberries placed in the mug in the video?", "question_wo_referring_query": "What changes occurred to the strawberries placed in the mug in the video?", "candidates": ["Were thrown away", "Were eaten", "Were crushed and stirred", "Were taken out"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "@healthfood-6871310712245980422_0", "video_path": "6871310712245980422.mp4", "subtitle_path": "6871310712245980422_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 39.13, "view_count": 14526}, {"video_id": "@healthfood-7146298559732731178", "question": "There is a transparent cup on the screen, inside the cup there are various colored gummy bears. There is a bag on the cup. What happens to the bag in the video?", "question_wo_referring_query": "What happens to the bag in the video?", "candidates": ["Pour the gummy bears into the cup", "Throw it away", "Put it into the cup", "Take it away"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "@healthfood-7146298559732731178_0", "video_path": "7146298559732731178.mp4", "subtitle_path": "7146298559732731178_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.9, "view_count": 3555993}, {"video_id": "@thatrecipe.us-7206447074857078062", "question": "A hand is adding scallions to a pot with beans and bell peppers. Which of the following items does not appear in the video?", "question_wo_referring_query": "Which of the following items does not appear in the video?", "candidates": ["egg", "beans", "scallions", "bell peppers"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "@thatrecipe.us-7206447074857078062_0", "video_path": "7206447074857078062.mp4", "subtitle_path": "7206447074857078062_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.53, "view_count": 50292}, {"video_id": "@healthfood-6841948717499682053", "question": "In the video, three egg yolks are placed into a transparent bowl containing flour. When the subtitle mentions 'Add in three egg yolks, then add in the hot milk,' which object does not appear on the screen?", "question_wo_referring_query": "Which object does not appear on the screen?", "candidates": ["egg yolk", "egg shell", "transparent bowl", "pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "@healthfood-6841948717499682053_0", "video_path": "6841948717499682053.mp4", "subtitle_path": "6841948717499682053_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 35.1, "view_count": 2174}, {"video_id": "@tiffycooks-7213027333878467846", "question": "In the video, there is a woman sitting next to a dark wood table with green plants and polaroid pictures in the background, holding golden colored food. What color is her hair?", "question_wo_referring_query": "What color is her hair?", "candidates": ["Dark Brown", "Silver", "Yellow", "Black"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "@tiffycooks-7213027333878467846_0", "video_path": "7213027333878467846.mp4", "subtitle_path": "7213027333878467846_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 25.4, "view_count": 10794690}, {"video_id": "@thatrecipe.us-7288448632120954154", "question": "There is a transparent bowl on the screen containing various items. A hand places a yellow item inside the bowl. What is the last item added?", "question_wo_referring_query": "There is a transparent bowl on the screen containing various items. A hand places a yellow item inside the bowl. What is the last item added?", "candidates": ["Mango", "Banana", "Lemon juice", "Papaya"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "@thatrecipe.us-7288448632120954154_0", "video_path": "7288448632120954154.mp4", "subtitle_path": "7288448632120954154_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.0, "view_count": 19069}, {"video_id": "@healthfood-6867705614831848710", "question": "In the video, on a white marble patterned desktop, there is a transparent square glass bowl and a round white small bowl. Before the author used the white spatula to scoop a piece of cake from the square transparent glass bowl onto the white small bowl, what did the author do?", "question_wo_referring_query": ", what did the author do?", "candidates": ["Wipe the edge of the glass bowl with a white towel", "Wipe the white ceramic bowl", "Wipe the table", "Wipe the spatula"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "@healthfood-6867705614831848710_0", "video_path": "6867705614831848710.mp4", "subtitle_path": "6867705614831848710_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 44.13, "view_count": 5912}, {"video_id": "@thatrecipe.us-7338224548200090926", "question": "On a wooden-colored table, there is a glass bowl filled with beans. Above it, a pair of hands is pouring oil into the bowl. After mentioning 'Add two tablespoons of unsalted butter, one teaspoon of salt,' what does this person do next?", "question_wo_referring_query": "What did this person do next?", "candidates": ["Sprinkle seasoning into the bowl with minced meat", "Pour water into the pot filled with beans", "Sprinkle chives on the food", "Add minced meat into the bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "@thatrecipe.us-7338224548200090926_0", "video_path": "7338224548200090926.mp4", "subtitle_path": "7338224548200090926_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.03, "view_count": 223532}, {"video_id": "@healthfood-6857195167095540997", "question": "On the screen, there is a black flat-bottomed pan on the stove. A tortilla is added on top of the egg mixture in the pan. The subtitle reads, 'I then add the mixture to the pan and throw a tortilla on top.' What appears on the screen next?", "question_wo_referring_query": "On the screen, there is a black flat-bottomed pan on the stove. A tortilla is added on top of the egg mixture in the pan. The subtitle reads, 'I then add the mixture to the pan and throw a tortilla on top.' What appears on the screen next?", "candidates": ["A bunch of avocados", "A black spatula", "Some chopped parsley", "A pile of chili powder"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "@healthfood-6857195167095540997_0", "video_path": "6857195167095540997.mp4", "subtitle_path": "6857195167095540997_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 33.27, "view_count": 1931}, {"video_id": "@healthfood-7039414357536738606", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, place a sandwich loaf on a wooden-textured table and cut it into thin slices. Then, add two small pieces of butter to the golden egg yolk liquid to let it melt. Finally, place the two toasted slices on a plate and use a spoon to evenly spread the egg yolk sauce from a small white bowl on top.", "First, evenly spread the prepared egg yolk sauce on the toasted sandwich slices. Then, add two small pieces of butter to the golden egg yolk liquid to let it melt. Finally, place a sandwich loaf on a wooden-textured table and cut it into thin slices.", "First, add two small pieces of butter to the golden egg yolk liquid to let it melt. Then, place a sandwich loaf on a wooden-textured table and cut it into thin slices. Finally, evenly spread the prepared egg yolk sauce on the toasted sandwich slices.", "First, place a sandwich loaf on a wooden-textured table and cut it into thin slices. Then, evenly spread the prepared egg yolk sauce on the toasted sandwich slices. Finally, add two small pieces of butter to the golden egg yolk liquid to let it melt."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "@healthfood-7039414357536738606_0", "video_path": "7039414357536738606.mp4", "subtitle_path": "7039414357536738606_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 24.9, "view_count": 88922}, {"video_id": "@tiffycooks-7044606182362860806", "question": "In the scene where a woman with long hair parted in the middle, wearing a black short-sleeved shirt, holds a pair of chopsticks and picks up a piece of crispy tofu with sauce, which of the following scenes did this woman appear in?", "question_wo_referring_query": "In which of the following scenes did this woman appear?", "candidates": ["Beside a wooden desktop.", "Standing in front of a grey cabinet, beside a white countertop, with a cutting board holding tofu on the table.", "In a scene with deep-fried tofu.", "In front of a black cabinet, with a knife on the table."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "@tiffycooks-7044606182362860806_0", "video_path": "7044606182362860806.mp4", "subtitle_path": "7044606182362860806_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 27.03, "view_count": 1045326}, {"video_id": "@thatrecipe.us-7337041937762962730", "question": "What changes occur to the chicken pieces, which were cut into small cubes and placed in a transparent glass bowl on a wooden tabletop, after being put in the oil pot and fried?", "question_wo_referring_query": "What changes occur after being put in the oil pot and fried?", "candidates": ["Turned into golden crispy chicken nuggets", "Turned into black chicken nuggets", "Turned into red chicken nuggets", "No changes"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "@thatrecipe.us-7337041937762962730_0", "video_path": "7337041937762962730.mp4", "subtitle_path": "7337041937762962730_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.03, "view_count": 46067}, {"video_id": "@healthfood-6880908792259824897", "question": "What change occurs to the woman wearing a white high-collar long-sleeve top with a red skirt and making a standing thumbs-up gesture in the video when the subtitle says 'broth let that simmer again for about five minutes to thicken up then stir in your pasta and enjoy'?", "question_wo_referring_query": "What change occurs to her?", "candidates": ["She puts on a white sweater.", "She puts on a red jacket.", "She puts on a black jacket.", "She does not change."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "@healthfood-6880908792259824897_0", "video_path": "6880908792259824897.mp4", "subtitle_path": "6880908792259824897_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 40.73, "view_count": 7321}, {"video_id": "@asianfoodrecipes-7121805852654210305", "question": "In the video, on the gas stove, there is a pot with green and red soup containing diced corn and carrots. What is happening above the soup pot?", "question_wo_referring_query": "In the video, on the gas stove, there is a pot with green and red soup containing diced corn and carrots. What is happening above the soup pot?", "candidates": ["A white container is pouring clear water into the pot", "Other ingredients are being added", "Nothing is happening", "A white container is pouring milk into the pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "@asianfoodrecipes-7121805852654210305_0", "video_path": "7121805852654210305.mp4", "subtitle_path": "7121805852654210305_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 23.22, "view_count": 8713}, {"video_id": "@tiffycooks-7179273257328086277", "question": "In the video, there is a woman with long hair wearing black clothes, a necklace on her neck, a cabinet behind her, and a table in front of her. On the table, there is a bowl containing various ingredients. The woman is using her right hand to point at some food, and what is she holding in her left hand?", "question_wo_referring_query": "What is she holding in her left hand?", "candidates": ["spoon", "chopsticks", "bottle", "fork"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "@tiffycooks-7179273257328086277_0", "video_path": "7179273257328086277.mp4", "subtitle_path": "7179273257328086277_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 44.7, "view_count": 471104}, {"video_id": "@tiffycooks-7187462082315521286", "question": "There is a man in a black shirt on the screen. There is a round wall decoration hanging behind him. The man is holding a fork in his right hand, which has a piece of food on it. When the subtitle mentions 'I like it because the beef is kind of battered and it`s actually kind of,' what is the man holding in his left hand?", "question_wo_referring_query": "What is the man holding in his left hand?", "candidates": ["white bowl", "white plate", "silver knife", "transparent glass"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2O", "level": "IntraMoment", "id": "@tiffycooks-7187462082315521286_0", "video_path": "7187462082315521286.mp4", "subtitle_path": "7187462082315521286_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 58.7, "view_count": 219363}, {"video_id": "@thatrecipe.us-7293261691951926570", "question": "A powdered substance is poured into a transparent bowl on the screen. The bowl contains yellow custard and matcha powder. What color is the substance being poured?", "question_wo_referring_query": "What color is the substance being poured?", "candidates": ["Light pink", "Gray", "White", "Black"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "@thatrecipe.us-7293261691951926570_0", "video_path": "7293261691951926570.mp4", "subtitle_path": "7293261691951926570_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.03, "view_count": 333183}, {"video_id": "@tiffycooks-7054998830491634950", "question": "There is a plate on the table. On the plate, there is an object. A woman wearing a gray top is putting scallions into this object. What is this object?", "question_wo_referring_query": ", what is this object?", "candidates": ["fish", "chicken", "dumplings", "duck"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E2O", "level": "IntraMoment", "id": "@tiffycooks-7054998830491634950_0", "video_path": "7054998830491634950.mp4", "subtitle_path": "7054998830491634950_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 31.23, "view_count": 378412}, {"video_id": "@thatrecipe.us-7344846410593914155", "question": "On a yellow table, there is a glass bowl containing vegetables. A person is holding a black object over the bowl. What is this person doing?", "question_wo_referring_query": "What is this person doing?", "candidates": ["Cutting vegetables", "Sprinkling pepper powder", "Peeling", "Mixing vegetables"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "@thatrecipe.us-7344846410593914155_0", "video_path": "7344846410593914155.mp4", "subtitle_path": "7344846410593914155_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.03, "view_count": 69728}, {"video_id": "@recipesbyanne-7099103544250322182", "question": "In the video, a yellow liquid is poured into a transparent bowl that contains chocolate-colored powder. What do they do after pouring the liquid?", "question_wo_referring_query": "After pouring the liquid into the transparent bowl, what action is taken next?", "candidates": ["Add soy sauce", "Stir", "Add brown sugar", "Throw away"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "E3E", "level": "L2-Relation", "id": "@recipesbyanne-7099103544250322182_0", "video_path": "7099103544250322182.mp4", "subtitle_path": "7099103544250322182_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 15.93, "view_count": 1312}, {"video_id": "@tiffycooks-7289099196710259973", "question": "In the video, the sauce is poured over the cooked pumpkin and the cooked greens. Which vegetable is covered with the sauce first in the video?", "question_wo_referring_query": "Which vegetable is covered with the sauce first in the video?", "candidates": ["Greens", "Pumpkin", "String Beans", "Celery"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "@tiffycooks-7289099196710259973_0", "video_path": "7289099196710259973.mp4", "subtitle_path": "7289099196710259973_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 28.17, "view_count": 631640}, {"video_id": "@tiffycooks-7140756785920593157", "question": "In the scene, there is a long-haired woman in a grey outfit sitting at a table. She places her hands on the table, forming fists. There are two green plants in pots in the upper background. After the subtitle mentions 'but I've been nominated for the vegan,' what happens next?", "question_wo_referring_query": "What happens next?", "candidates": ["The woman clenches her fists", "The woman raises her hands and claps", "Her right hand is on the table while her left hand is extended", "The woman spreads her hands open"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "@tiffycooks-7140756785920593157_0", "video_path": "7140756785920593157.mp4", "subtitle_path": "7140756785920593157_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 25.08, "view_count": 109387}, {"video_id": "@tiffycooks-7268316272755068165", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, there is a scene of cutting a pancake on a wooden board with a kitchen knife, followed by a scene of stirring items in a bowl with chopsticks, and finally ends with a scene of a long-haired woman in a dark blue dress eating the pancake.", "First, there is a scene of stirring items in a bowl with chopsticks, followed by a scene of a long-haired woman in a dark blue dress eating the pancake, and finally ends with a scene of cutting a pancake on a wooden board with a kitchen knife.", "First, there is a scene of cutting a pancake on a wooden board with a kitchen knife, followed by a scene of a long-haired woman in a dark blue dress eating the pancake, and finally ends with a scene of stirring items in a bowl with chopsticks.", "First, there is a scene of a long-haired woman in a dark blue dress eating the pancake, followed by a scene of cutting a pancake on a wooden board with a kitchen knife, and finally ends with a scene of stirring items in a bowl with chopsticks."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "@tiffycooks-7268316272755068165_0", "video_path": "7268316272755068165.mp4", "subtitle_path": "7268316272755068165_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 30.73, "view_count": 712035}, {"video_id": "@tiffycooks-7294291784639859974", "question": "There is a man in the video eating from a transparent bowl using chopsticks. On the top of the screen, there are noodles, and on the bottom, there is rice. In which other scene does the man eat the same noodles?", "question_wo_referring_query": "In which other scene does the man eat the same noodles?", "candidates": ["In a white pot", "In a black pot", "In a cup", "In a milk tea shop"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "@tiffycooks-7294291784639859974_0", "video_path": "7294291784639859974.mp4", "subtitle_path": "7294291784639859974_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 29.17, "view_count": 345348}, {"video_id": "@tiffycooks-7249392485397548293", "question": "In the video, the cooked egg is placed into the fried rice using a black spatula. Along with which subtitle did the fried egg appear together in the video?", "question_wo_referring_query": ", along with which subtitle did the fried egg appear together in the video?", "candidates": ["Mix together soy sauce, fish sauce, and a pinch of sugar.", "Welcome back to another episode of Titty Broke Her Hand, featuring recipes that are so easy.", "Today we're making spicy Thai fried rice in 20 minutes.", "Sauce together for 1-2 minutes."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "@tiffycooks-7249392485397548293_0", "video_path": "7249392485397548293.mp4", "subtitle_path": "7249392485397548293_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 32.93, "view_count": 1414256}, {"video_id": "@asianfoodrecipes-7106995879961300226", "question": "The gloved hand is massaging the chicken in the video, with some sauce in the bowl. What changes did the chicken undergo in the video?", "question_wo_referring_query": ", what changes did the chicken undergo in the video?", "candidates": ["Boiled in water", "Steamed", "Roasted in the oven", "Stir-fried"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "@asianfoodrecipes-7106995879961300226_0", "video_path": "7106995879961300226.mp4", "subtitle_path": "7106995879961300226_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 21.32, "view_count": 17789}, {"video_id": "@luxtravelbe-7335896612276817185", "question": "In the video, there is a woman with a white headband and wearing a black long-sleeved shirt, holding a reindeer with a red collar around its neck. There are a few trees in the background, and the ground is covered in snow. What is the reindeer doing in the video?", "question_wo_referring_query": "What is the reindeer doing in the video?", "candidates": ["Crawling", "Sitting", "Sleeping", "Walking"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "@luxtravelbe-7335896612276817185_0", "video_path": "7335896612276817185.mp4", "subtitle_path": "7335896612276817185_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 22.33, "view_count": 120884}, {"video_id": "@_eat_sleep_travel_repeat-7301704385149537568", "question": "There is a house made of stone in the video, with many leaves on the ground and a few trees behind it. There is a waterfall behind the house as well. What color are the leaves on the ground in the video?", "question_wo_referring_query": "What color are the leaves on the ground in the video?", "candidates": ["purple", "green", "yellow", "red"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "@_eat_sleep_travel_repeat-7301704385149537568_0", "video_path": "7301704385149537568.mp4", "subtitle_path": "7301704385149537568_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 18.47, "view_count": 12916}, {"video_id": "@placesunleashed-7288028320216747270", "question": "There is a black car in the front of the screen, and behind it is a relatively dilapidated green building. When the subtitle mentions 'Norilsk located in Russia, is one of the northernmost cities in the world and is part of', what is the car in the screen?", "question_wo_referring_query": "What kind of car is in the screen?", "candidates": ["Car", "Bus", "Tricycle", "Bicycle"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "@placesunleashed-7288028320216747270_0", "video_path": "7288028320216747270.mp4", "subtitle_path": "7288028320216747270_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 25.77, "view_count": 7761}, {"video_id": "@kelseyinlondon-7259796551596330267", "question": "There is a person wearing a white long-sleeved shirt in the frame, with a ring on the left index finger, and holding a pillow below. What is the person holding in their left hand in the video?", "question_wo_referring_query": "What is the person holding in their left hand in the video?", "candidates": ["cup", "watch", "computer", "phone"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "@kelseyinlondon-7259796551596330267_0", "video_path": "7259796551596330267.mp4", "subtitle_path": "7259796551596330267_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 30.0, "view_count": 16083}, {"video_id": "@jetset_anna-7081941485486198021", "question": "There is a swimming pool on the screen, with a person in it. There are several lounge chairs next to the pool, as well as two houses. What did the person do when they appeared in the pool in the video?", "question_wo_referring_query": "What did the person do when they appeared in the pool in the video?", "candidates": ["Standing", "Lying on the lounge chair", "Sleeping", "Swimming"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "@jetset_anna-7081941485486198021_0", "video_path": "7081941485486198021.mp4", "subtitle_path": "7081941485486198021_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 18.53, "view_count": 46853}, {"video_id": "@placesunleashed-7354550985710243078", "question": "In the video, there is a pair of feet wearing black shoes. The left foot is resting on a stick, and below the stick, there is a black bird flying. What action does the bird in the video do after flying?", "question_wo_referring_query": "What action does the bird in the video do after flying?", "candidates": ["Lands on the stick", "Falls down", "Continues flying", "Flies upward"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "@placesunleashed-7354550985710243078_0", "video_path": "7354550985710243078.mp4", "subtitle_path": "7354550985710243078_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 37.1, "view_count": 489}, {"video_id": "@placesunleashed-7303296883546410246", "question": "In the video, there is a highway with a green tree outside it. There are a few white clouds in the sky and a green cone-shaped mountain. Additionally, a woman wearing a yellow short-sleeve shirt and a headscarf appears next to the tree. Which object appears first in the video?", "question_wo_referring_query": "Which object appears first in the video?", "candidates": ["The iron rod next to the highway", "None of these appeared", "A green cone-shaped mountain", "A woman wearing a yellow short-sleeve shirt with a headscarf"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "@placesunleashed-7303296883546410246_0", "video_path": "7303296883546410246.mp4", "subtitle_path": "7303296883546410246_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 19.4, "view_count": 6996}, {"video_id": "@kelseyinlondon-7143654046694395141", "question": "In the video, there is a person wearing a pilot costume on a basketball court with a basketball on the ground. After the caption 'When you're near me' appears, what event occurs?", "question_wo_referring_query": ", what event occurs?", "candidates": ["A right hand is scooping ice cream", "Two cups clink together", "A woman leans against a railing with her hand", "A swimming pool appears"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "@kelseyinlondon-7143654046694395141_0", "video_path": "7143654046694395141.mp4", "subtitle_path": "7143654046694395141_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 28.43, "view_count": 74937}, {"video_id": "@placesunleashed-7325959145322974470", "question": "A screen shows an island covered with many trees, with the words \"MALLORCA KNOWN AS THE LARGEST\" appearing in the middle. When the caption reads \"Mallorca, known as the largest Spanish island, is a traveler's paradise,\" what object appears on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["Ship", "Swimming pool", "Green islet", "Map segment"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "@placesunleashed-7325959145322974470_0", "video_path": "7325959145322974470.mp4", "subtitle_path": "7325959145322974470_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 22.02, "view_count": 4592}, {"video_id": "@kelseyinlondon-7310926465745308960", "question": "In the video, there is a long-haired woman wearing a white long-sleeve shirt and holding a camera in front of her eyes. In what other scene does this woman appear?", "question_wo_referring_query": "Where else does the woman in the video appear?", "candidates": ["On the stairs", "In the fruit shop", "In the milk tea shop", "Outside the museum"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "@kelseyinlondon-7310926465745308960_0", "video_path": "7310926465745308960.mp4", "subtitle_path": "7310926465745308960_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 32.73, "view_count": 47137}, {"video_id": "@placesunleashed-7293209833279343877", "question": "In the video, there is a woman wearing a pink outfit and a pink hat skating on a skateboard, with a tall building featuring yellow characters in the background. In which subtitle does this building with yellow characters also appear?", "question_wo_referring_query": "In the video, in which subtitle does the building with yellow characters appear?", "candidates": ["How have I only just heard about this futuristic city?", "Its ancient Chinese architecture combined with utopian neon lights is mesmerizing.", "Subway trains go through the top floors of buildings like a video game.", "I think I'd willingly become nocturnal if I lived here."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "@placesunleashed-7293209833279343877_0", "video_path": "7293209833279343877.mp4", "subtitle_path": "7293209833279343877_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 21.87, "view_count": 21063}, {"video_id": "@_eat_sleep_travel_repeat-7319473137253485857", "question": "In the scene, there is half a sun, a hill, a person on the hill, and a river beside the hill. What changes occur to the person on the hill?", "question_wo_referring_query": "What changes occur to the person on the hill in the scene?", "candidates": ["Walking up", "Walking down", "Sitting", "Lying down"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "@_eat_sleep_travel_repeat-7319473137253485857_0", "video_path": "7319473137253485857.mp4", "subtitle_path": "7319473137253485857_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.73, "view_count": 3020}, {"video_id": "@placesunleashed-7283570031777025285", "question": "In the black and white screen, there is a man wearing a shirt with a knitted sweater over it. The screen shows the words 'Roy Bates established the'. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["He is smiling at the camera", "He is making a 'yay' gesture with his left hand", "He is placing his left hand on his forehead", "He is clenching his left fist and placing it on his chest"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "@placesunleashed-7283570031777025285_0", "video_path": "7283570031777025285.mp4", "subtitle_path": "7283570031777025285_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 31.4, "view_count": 31118}, {"video_id": "@placesunleashed-7310322207266180358", "question": "In the scene, the ground in the white valley is covered with yellow leaves, and there are yellow letters that say 'FIRST HOT AIR' in the middle of the screen. What objects appeared in the scene?", "question_wo_referring_query": "What objects appeared in the scene?", "candidates": ["Airplane", "Hot air balloon", "Boat", "Car"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "@placesunleashed-7310322207266180358_0", "video_path": "7310322207266180358.mp4", "subtitle_path": "7310322207266180358_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 20.8, "view_count": 4266}, {"video_id": "@luxtravelbe-7297307905479560480", "question": "Under the blue sky, the road is flanked by green plants, and there is a car on the gray road. When 'Flying by on the Hawaiian' is mentioned, what color is the car?", "question_wo_referring_query": "What color is the car?", "candidates": ["red", "white", "black", "yellow"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "@luxtravelbe-7297307905479560480_0", "video_path": "7297307905479560480.mp4", "subtitle_path": "7297307905479560480_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 19.53, "view_count": 9693427}, {"video_id": "@kelseyinlondon-7101390452473007366", "question": "In the video, on the meadow, there are many yellow flowers and trees. There is a silver chair on the meadow. On the chair, there is a person sitting and drinking water from a cup. Who is this person sitting on the chair?", "question_wo_referring_query": "Who is this person sitting on the chair?", "candidates": ["The woman wearing a blue short-sleeve shirt", "The woman wearing a pink dress", "The woman wearing a black short-sleeve shirt", "The woman wearing a white dress"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "@kelseyinlondon-7101390452473007366_0", "video_path": "7101390452473007366.mp4", "subtitle_path": "7101390452473007366_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 42.2, "view_count": 24100}, {"video_id": "@luxtravelbe-7352606135754509600", "question": "In the scene, the road is covered by white snow. Surrounding the road are pine trees. Next to the pine trees is a wooden structure door. When the person wearing a black top, black pants, and a hat appears for the first time, what is this person doing?", "question_wo_referring_query": "What is this person doing?", "candidates": ["Walking in the snow", "Having a snowball fight", "Fishing", "Building a snowman"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "@luxtravelbe-7352606135754509600_0", "video_path": "7352606135754509600.mp4", "subtitle_path": "7352606135754509600_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 25.53, "view_count": 12156}, {"video_id": "@placesunleashed-7290280057040424197", "question": "In the scene with the text YOU'RE IN, there is a rock in the sea with a seal next to it. When the subtitle mentions 'Luckily if you travel here you're in one of the only places in British Columbia that', what is the seal doing?", "question_wo_referring_query": "What is the seal doing?", "candidates": ["It is in the deep sea", "It is on a fishing boat", "It is lying on the rock", "It is caught in a fishing net in the water"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "@placesunleashed-7290280057040424197_0", "video_path": "7290280057040424197.mp4", "subtitle_path": "7290280057040424197_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 22.23, "view_count": 390626}, {"video_id": "@placesunleashed-7328134394315672838", "question": "In the scene, the entire screen is filled with various small pieces of paper, with green text saying 'FOR PIRATES'. In the scene, there is a woman wearing an orange life jacket, floating on the surface of a turquoise lake with her arms outstretched. Which object appears first?", "question_wo_referring_query": "Which object appears first?", "candidates": ["A man wearing a black life jacket on the surface of a turquoise lake appears first", "In the scene, a woman wearing an orange life jacket, floating on the surface of a turquoise lake with her arms outstretched appears first", "A woman wearing a blue swimsuit in a swimming pool appears first", "The entire screen is filled with various small pieces of paper, with green text saying 'FOR PIRATES' appearing first"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "@placesunleashed-7328134394315672838_0", "video_path": "7328134394315672838.mp4", "subtitle_path": "7328134394315672838_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 21.27, "view_count": 11701}, {"video_id": "@kelseyinlondon-7255723592460029210", "question": "Which of the following sequences is correct?", "question_wo_referring_query": "Which of the following sequences is correct?", "candidates": ["First, a woman wearing a light-colored top and a white skirt appears in a luxurious golden room. Then, on a road between two rows of trees, a woman wearing a white top and pink pants, holding a yellow bag in her right hand, appears. Finally, there is a bouquet of flowers in front of a mirror, with the Eiffel Tower in the background.", "First, there is a bouquet of flowers in front of a mirror, with the Eiffel Tower in the background. Then, a woman wearing a light-colored top and a white skirt appears in a luxurious golden room. Finally, on a road between two rows of trees, a woman wearing a white top and pink pants, holding a yellow bag in her right hand, appears.", "First, there is a bouquet of flowers in front of a mirror, with the Eiffel Tower in the background. Then, on a road between two rows of trees, a woman wearing a white top and pink pants, holding a yellow bag in her right hand, appears. Finally, a woman wearing a light-colored top and a white skirt appears in a luxurious golden room.", "First, a woman wearing a light-colored top and a white skirt appears in a luxurious golden room. Then, there is a bouquet of flowers in front of a mirror, with the Eiffel Tower in the background. Finally, on a road between two rows of trees, a woman wearing a white top and pink pants, holding a yellow bag in her right hand, appears."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "@kelseyinlondon-7255723592460029210_0", "video_path": "7255723592460029210.mp4", "subtitle_path": "7255723592460029210_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 18.03, "view_count": 113814}, {"video_id": "@placesunleashed-7305076505724407045", "question": "In the scene, there's an orange mountain bathed in sunset light, with green vegetation in front of it. In which subtitles does this scene appear together with?", "question_wo_referring_query": "In which subtitles does this scene appear together with?", "candidates": ["I'm sorry, Duane", "Have you ever heard of the biggest rock in the world? This thing is over 1100 feet tall, but you are unable to climb it out of respect to the", "Uluru, located in Australia, is the most beautiful rock in the world", "Aboriginal people in the area"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "@placesunleashed-7305076505724407045_0", "video_path": "7305076505724407045.mp4", "subtitle_path": "7305076505724407045_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.13, "view_count": 26256}, {"video_id": "@placesunleashed-7322988438955838725", "question": "In the scene, there are many buildings surrounded by green trees. The text 'THAT YOU'LL NEVER' is displayed in green. What changes are seen in the scene of this building?", "question_wo_referring_query": "What changes are seen in the scene of this building?", "candidates": ["No changes.", "A blue light beam is shining on the building.", "An orange light beam is shining on the building.", "A green light beam is shining on the building."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "@placesunleashed-7322988438955838725_0", "video_path": "7322988438955838725.mp4", "subtitle_path": "7322988438955838725_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.4, "view_count": 3220}, {"video_id": "@daiki.shino-7302866356590955808", "question": "In the circular screen, a blue sphere is surrounded by several people. Each person is wearing different clothes. The background features a blue sky and the yellow text 'THE V&A'. What objects appear in this scene?", "question_wo_referring_query": "What objects appear in this scene?", "candidates": ["wu clouds", "swan", "airplane", "clouds"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "@daiki.shino-7302866356590955808_0", "video_path": "7302866356590955808.mp4", "subtitle_path": "7302866356590955808_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.0, "view_count": 27005}, {"video_id": "@placesunleashed-7295819151183138053", "question": "In the scene, there is a train passing through a narrow space, with buildings on both sides and a woman in a blue top standing nearby. The screen displays the text 'HOW MANY ACCIDENTS HAVE OCCURRED ON THIS FAMOUS.' When the subtitle 'How many accidents have occurred on this famous train cafe street' appears, what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["high-speed rail", "tracks", "cars", "airplanes"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "@placesunleashed-7295819151183138053_0", "video_path": "7295819151183138053.mp4", "subtitle_path": "7295819151183138053_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.33, "view_count": 78085}, {"video_id": "@daiki.shino-7298388543250992416", "question": "In the image, there is a woman's back in the bottom section, and on the green grass above, there is a man wearing a blue shirt, khaki pants, and a hat, standing next to a woman in dark blue jeans. What color jacket is the woman wearing?", "question_wo_referring_query": "What color jacket is the woman wearing?", "candidates": ["purple", "black", "pink", "blue"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "@daiki.shino-7298388543250992416_0", "video_path": "7298388543250992416.mp4", "subtitle_path": "7298388543250992416_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.56, "view_count": 8999}, {"video_id": "@placesunleashed-7299160023094021381", "question": "In the video, on a stone there are two animals, and there is text saying 'SPLIT OFF FROM THE CONTINENT' on the screen. What color are the legs of these two animals when the phrase 'but since the islands split off from the continent' is mentioned?", "question_wo_referring_query": "What color are the legs of these two animals?", "candidates": ["white", "black", "blue", "yellow"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "@placesunleashed-7299160023094021381_0", "video_path": "7299160023094021381.mp4", "subtitle_path": "7299160023094021381_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 23.07, "view_count": 1058426}, {"video_id": "@placesunleashed-7335922760335510789", "question": "In the screen, on the mountain, there is a person wearing pink clothes, carrying a silver backpack, and wearing a red hat standing on the mountain. The screen also displays green text that says 'AND VAST THAT MORE'. What was this person doing the first time they appeared?", "question_wo_referring_query": "What was this person doing the first time they appeared?", "candidates": ["Sitting on the ground", "Walking on the mountain", "Exercising", "Running"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "@placesunleashed-7335922760335510789_0", "video_path": "7335922760335510789.mp4", "subtitle_path": "7335922760335510789_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 23.2, "view_count": 3706}, {"video_id": "@kelseyinlondon-7203033684801244421", "question": "In the video, on the jade green sea, there is a white road that ends with some buildings. A woman is riding a bicycle on the white road with a backpack on her back. What did the woman do afterwards?", "question_wo_referring_query": "What did the woman do afterwards?", "candidates": ["Worked out in a room", "Took a walk on the white road", "Took a bath in a bathtub", "Swam in the sea"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "@kelseyinlondon-7203033684801244421_0", "video_path": "7203033684801244421.mp4", "subtitle_path": "7203033684801244421_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 56.17, "view_count": 1346273}, {"video_id": "@daiki.shino-7287621763444264224", "question": "In the scene, in front of a European-style building, two red double-decker buses encounter each other on the road. And in the scene, below the building, there is a man wearing a black shirt playing the saxophone, with a microphone in front of him. Which of these two things appears first?", "question_wo_referring_query": "Which of these two things appears first?", "candidates": ["The big clock appears first", "The seaside appears first", "In front of a European-style building, two red double-decker buses encounter each other on the road, appearing first", "Below the building, there is a man wearing a black shirt playing the saxophone, with a microphone in front of him, appearing first"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "@daiki.shino-7287621763444264224_0", "video_path": "7287621763444264224.mp4", "subtitle_path": "7287621763444264224_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 18.4, "view_count": 26594}, {"video_id": "@placesunleashed-7315898941659188485", "question": "In an area surrounded by glass, there are three women dressed in different clothes and one man wearing a black short-sleeved shirt. There is also a large building behind them. After mentioning 'This church that's been in construction since 1882 has an end date in the very near future,' who are the people present?", "question_wo_referring_query": "Who are the people present?", "candidates": ["A woman wearing a black feather coat and black pants", "A man wearing a black leather jacket and black pants", "A man wearing a red and blue striped short-sleeved shirt with tattoos on his arm", "A woman wearing blue clothes with a yellow bag"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "@placesunleashed-7315898941659188485_0", "video_path": "7315898941659188485.mp4", "subtitle_path": "7315898941659188485_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 21.38, "view_count": 3867399}, {"video_id": "@jetset_anna-6987800083022613766", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there is a white house with a blue door, and the outside ground is paved with irregularly shaped stones. There are also purple-red flowers on the left side of the screen. Then, there is a table covered with a white tablecloth and blue chairs placed on the ground. Finally, there is a yellow bicycle with a basket on the back seat that contains purple-red flowers.", "First, there is a yellow bicycle with a basket on the back seat that contains purple-red flowers. Then, there is a table covered with a white tablecloth and blue chairs placed on the ground. Finally, there is a white house with a blue door, and the outside ground is paved with irregularly shaped stones. There are also purple-red flowers on the left side of the screen.", "First, there is a white house with a blue door, and the outside ground is paved with irregularly shaped stones. There are also purple-red flowers on the left side of the screen. Then, there is a yellow bicycle with a basket on the back seat that contains purple-red flowers. Finally, there is a table covered with a white tablecloth and blue chairs placed on the ground.", "First, there is a yellow bicycle with a basket on the back seat that contains purple-red flowers. Then, there is a white house with a blue door, and the outside ground is paved with irregularly shaped stones. There are also purple-red flowers on the left side of the screen. Finally, there is a table covered with a white tablecloth and blue chairs placed on the ground."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "@jetset_anna-6987800083022613766_0", "video_path": "6987800083022613766.mp4", "subtitle_path": "6987800083022613766_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.07, "view_count": 1851522}, {"video_id": "@_eat_sleep_travel_repeat-7291601889475530017", "question": "In a scene with many people in the background, there is a man in black clothes sitting at a wooden table, holding a drink in his hand. What change happened to him while he was climbing the mountain later?", "question_wo_referring_query": "What change happened?", "candidates": ["He put on a hat and glasses.", "He changed into long pants and put on black-framed sunglasses.", "He changed into a palm-green windbreaker.", "He changed into shorts."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "@_eat_sleep_travel_repeat-7291601889475530017_0", "video_path": "7291601889475530017.mp4", "subtitle_path": "7291601889475530017_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.37, "view_count": 6925}, {"video_id": "@jetset_anna-6923192008718781702", "question": "Outside the white building, there is a swimming pool, and at the top of the screen, there are also tall and beautiful trees. Which of the following objects did not appear on the screen?", "question_wo_referring_query": "Which of the following objects did not appear on the screen?", "candidates": ["White lounge chair", "White parasol", "Terracotta tiles", "Green lounge chair"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "@jetset_anna-6923192008718781702_0", "video_path": "6923192008718781702.mp4", "subtitle_path": "6923192008718781702_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 20.7, "view_count": 4572}, {"video_id": "@luxtravelbe-7242768257147538715", "question": "Under the blue sky, there is a vast ocean, and on the left side of the screen, there is also a piece of land. When the phrase 'So say you will love me one day' is mentioned, what object appears on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["The high-hanging sun", "A banyan tree with green leaves", "A small boat on the sea", "The sunset on the shore"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "@luxtravelbe-7242768257147538715_0", "video_path": "7242768257147538715.mp4", "subtitle_path": "7242768257147538715_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.27, "view_count": 355946}, {"video_id": "@jetset_anna-7095725950322609414", "question": "Between two rows of buildings, there is a small path with many black poles. Above the buildings, there is a row of bright small lights. What shape are these lights?", "question_wo_referring_query": "What shape are these bright lights?", "candidates": ["Rectangular", "Star-shaped", "Small animal-shaped", "Trumpet-shaped"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "@jetset_anna-7095725950322609414_0", "video_path": "7095725950322609414.mp4", "subtitle_path": "7095725950322609414_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 20.0, "view_count": 38843}, {"video_id": "@_eat_sleep_travel_repeat-7298750645320895777", "question": "On a small path surrounded by green leaves, outside a wooden railing, there is a lady with a camera and a hat. When she mentioned 'I like to think about life this way,' what kind of clothes was she wearing?", "question_wo_referring_query": "What kind of clothes was she wearing?", "candidates": ["Dark-colored top, purple jeans, and short boots", "Light-colored top, dark jeans, and short boots", "Light-colored top, dark jeans, and long boots", "Dark-colored top, purple jeans, and long boots"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "@_eat_sleep_travel_repeat-7298750645320895777_0", "video_path": "7298750645320895777.mp4", "subtitle_path": "7298750645320895777_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 18.6, "view_count": 579388}, {"video_id": "@placesunleashed-7339290195533057286", "question": "Under a blue sky with white clouds lies an endless sea. In the picture, there is a red and white beach umbrella, and not far from it, there are two people floating on the water. Who are these two people?", "question_wo_referring_query": "Who are these two people?", "candidates": ["A person wearing a deep blue swimsuit and a person wearing a white swimsuit", "A person holding a swim ring and wearing a white swimsuit", "A person holding a swim ring and wearing a deep blue swimsuit", "Two people wearing deep blue swimsuits"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "@placesunleashed-7339290195533057286_0", "video_path": "7339290195533057286.mp4", "subtitle_path": "7339290195533057286_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 21.67, "view_count": 5345}, {"video_id": "@jetset_anna-7042661763845852422", "question": "Next to the wooden deck shaded by a jujube tree, there is a pool filled with clean water and paved with flower stones at the bottom. What happened when this pool appeared for the first time?", "question_wo_referring_query": ", what happened when this pool appeared for the first time?", "candidates": ["There were some green leaves floating on the pool", "There were some flower petals floating on the pool", "There were some brown leaves floating on the pool", "There was a swimming ring floating on the water"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "@jetset_anna-7042661763845852422_0", "video_path": "7042661763845852422.mp4", "subtitle_path": "7042661763845852422_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.99, "view_count": 76172}, {"video_id": "@luxtravelbe-7254638090537422106", "question": "In the screen where the map with a blue outlined triangle is shown, with white letters on it, what happens when 'Makes me feel alive' is mentioned?", "question_wo_referring_query": "What happens?", "candidates": ["A small white marker appears on top", "A small red marker appears on top", "A section of blue letters appears on it", "A small red question mark appears on top"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "@luxtravelbe-7254638090537422106_0", "video_path": "7254638090537422106.mp4", "subtitle_path": "7254638090537422106_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 18.97, "view_count": 323420}, {"video_id": "@_eat_sleep_travel_repeat-7237049173462306074", "question": "On a black stone road, a brown dog is lying next to a white wall. Someone wearing a black wristwatch is petting it. After the phrase 'Let go of all your haunted dreams tonight' is mentioned, what happens next?", "question_wo_referring_query": "What happens next?", "candidates": ["A person carrying a blue backpack is walking", "A person sitting on a blue swing bed is looking at the scenery", "A person wearing a wristwatch is feeding a bird in their hand", "A woman wearing a striped top and a floral skirt is looking at the scenery"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "@_eat_sleep_travel_repeat-7237049173462306074_0", "video_path": "7237049173462306074.mp4", "subtitle_path": "7237049173462306074_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 19.67, "view_count": 754}, {"video_id": "@_eat_sleep_travel_repeat-7202935218167172357", "question": "In a scene with many people in the background, there is a man in black clothes sitting at a wooden table holding a drink in his hand. After mentioning 'Tell people you love them', what appeared?", "question_wo_referring_query": ", what appeared?", "candidates": ["A person wearing a blue cap and holding a cup of water appeared", "A woman wearing a black patterned coat and a backpack appeared", "A plate of chocolate cake appeared", "A white dog appeared"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "@_eat_sleep_travel_repeat-7202935218167172357_0", "video_path": "7202935218167172357.mp4", "subtitle_path": "7202935218167172357_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.57, "view_count": 1440}, {"video_id": "@placesunleashed-7287278043519880453", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, there is a green meadow at the foot of the lofty mountain, with a few small huts on it. Then, beside a small road, there are roofs covered in glistening white snow and outdoor tables and chairs. Finally, within a wooden enclosure on another meadow, two cows are grazing.", "First, beside a small road, there are roofs covered in glistening white snow and outdoor tables and chairs. Then, at the foot of the lofty mountain, there is a green meadow with a few small huts on it. Finally, within a wooden enclosure on another meadow, two cows are grazing.", "First, there is a green meadow at the foot of the lofty mountain, with a few small huts on it. Then, within a wooden enclosure on another meadow, two cows are grazing. Finally, beside a small road, there are roofs covered in glistening white snow and outdoor tables and chairs.", "First, beside a small road, there are roofs covered in glistening white snow and outdoor tables and chairs. Then, within a wooden enclosure on another meadow, two cows are grazing. Finally, at the foot of the lofty mountain, there is a green meadow with a few small huts on it."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "@placesunleashed-7287278043519880453_0", "video_path": "7287278043519880453.mp4", "subtitle_path": "7287278043519880453_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 23.5, "view_count": 8223200}, {"video_id": "@placesunleashed-7299562810047040773", "question": "On a greyish-white crystal, there are two people wearing orange clothes, carrying white bags, and wearing white hats. Where else has this white hat appeared?", "question_wo_referring_query": "Where else has this white hat appeared?", "candidates": ["Stuck in a crevice of the crystal", "On a piece of empty ground", "On the head of a person wearing white clothes and carrying a black backpack", "On the head of a person wearing orange clothes and carrying a black backpack"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "@placesunleashed-7299562810047040773_0", "video_path": "7299562810047040773.mp4", "subtitle_path": "7299562810047040773_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 19.47, "view_count": 15878}, {"video_id": "@placesunleashed-7320724094641523974", "question": "On a blackish-green cliff face, there are three silver-shell, oval-shaped objects hanging, and nearby there's another object with a spherical top. Where have these oval-shaped objects and the subtitles appeared together before?", "question_wo_referring_query": ", and nearby there's another object with a spherical top. Where have these oval-shaped objects and the subtitles appeared together before?", "candidates": ["The Sky Lodge Suites in Peru are the hardest hotel to get to. The Sky Lodge Suites in Peru are the hardest hot", "Once you're there, you get to enjoy what it feels like to be suspended way up in the air", "just make sure not to think about the trip back down because you have to rappel.", "The reason why is because you have to rock climb up to get there."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "@placesunleashed-7320724094641523974_0", "video_path": "7320724094641523974.mp4", "subtitle_path": "7320724094641523974_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.07, "view_count": 1605}, {"video_id": "@luxtravelbe-7248698403943632155", "question": "There is a staircase in the screen, there is a woman on the staircase, below the staircase is flat ground, next to the flat ground is a sea. What is the woman in the screen doing?", "question_wo_referring_query": "What is the woman in the screen doing?", "candidates": ["Sitting", "Lying down", "Standing on the staircase", "Crawling"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "@luxtravelbe-7248698403943632155_0", "video_path": "7248698403943632155.mp4", "subtitle_path": "7248698403943632155_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 23.38, "view_count": 332986}, {"video_id": "@placesunleashed-7347821051583155461", "question": "In the scene, there is a person in black clothing wearing glasses walking in the middle, and on the right side, there is another person in black clothing. What objects appear in the scene?", "question_wo_referring_query": "What objects appear in the scene?", "candidates": ["Car", "Boat", "Purple hat", "Traffic light"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "@placesunleashed-7347821051583155461_0", "video_path": "7347821051583155461.mp4", "subtitle_path": "7347821051583155461_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.3, "view_count": 4397}, {"video_id": "@placesunleashed-7291384249851153670", "question": "There is a partial map on the screen, along with a line that says 'LOCATE IN KENYA'. There is also a green symbol featuring a white tree. What is the shape of the symbol in the video?", "question_wo_referring_query": ", the symbol features a white tree. What is the shape of the symbol in the video?", "candidates": ["Square", "Triangle", "Ellipse", "Circle"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "@placesunleashed-7291384249851153670_0", "video_path": "7291384249851153670.mp4", "subtitle_path": "7291384249851153670_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 18.6, "view_count": 5111}, {"video_id": "@placesunleashed-7340410524678638853", "question": "In the video, there is a golden tower and the text 'BUDDHIST CULTURE.' There is also a symbol below. When the subtitle mentions 'Buddhist culture is one factor as to why everything's gold,' what shape is the symbol?", "question_wo_referring_query": "What is the shape of the symbol?", "candidates": ["Round Cone", "Ellipse", "Circle", "Square"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "@placesunleashed-7340410524678638853_0", "video_path": "7340410524678638853.mp4", "subtitle_path": "7340410524678638853_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.78, "view_count": 1099}, {"video_id": "@placesunleashed-7290644595925437701", "question": "In the scene, there are many white buildings and a person standing on a white staircase. The screen displays the text 'WHEN YOU TRAVEL TO MATERA.' Who is the person standing on the staircase?", "question_wo_referring_query": "Who is the person standing on the staircase?", "candidates": ["A woman in a black coat", "A woman in a white short-sleeved shirt", "A black-haired woman in a beige coat", "A woman in a black short-sleeved shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "@placesunleashed-7290644595925437701_0", "video_path": "7290644595925437701.mp4", "subtitle_path": "7290644595925437701_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.5, "view_count": 5755}, {"video_id": "@placesunleashed-7338179642588794117", "question": "In the screen, there are green letters spelling AND RHINOCEROS, and there is a yellow tiger. What is this tiger doing?", "question_wo_referring_query": "What is this tiger doing?", "candidates": ["Running on the grassland", "Sitting on the ground", "Standing and looking forward", "Lying on the ground"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O2E", "level": "IntraMoment", "id": "@placesunleashed-7338179642588794117_0", "video_path": "7338179642588794117.mp4", "subtitle_path": "7338179642588794117_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 19.42, "view_count": 3665}, {"video_id": "@placesunleashed-7314411089406610693", "question": "In the scene, there is a snow-covered mountain with a lake in front of it. On the surface of the lake, there is an animal with the text 'THE NEARBY WATERS'. When it mentions 'In addition to the amazing landscape, the nearby waters have some of the most insane,' what is this animal doing?", "question_wo_referring_query": "In the scene, there is a snow-covered mountain with a lake in front of it. On the surface of the lake, there is an animal with the text 'THE NEARBY WATERS'. When it mentions 'In addition to the amazing landscape, the nearby waters have some of the most insane,' what is this animal doing?", "candidates": ["It is carrying a ball", "It is eating a fish", "It is peeking its head out of the lake", "It is swimming in the water"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "@placesunleashed-7314411089406610693_0", "video_path": "7314411089406610693.mp4", "subtitle_path": "7314411089406610693_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 20.5, "view_count": 2611}, {"video_id": "@placesunleashed-7324058647598877957", "question": "In the video, there is a scene that shows a map with green text reading 'IN THE SOUTH OF'. What happens after the map appears?", "question_wo_referring_query": "What happens after the map appears?", "candidates": ["There are people riding motorcycles on the road", "Many tourists appear by the sea", "A restaurant appears with some people dining", "In the palm-red valley, there are several people riding horses"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "@placesunleashed-7324058647598877957_0", "video_path": "7324058647598877957.mp4", "subtitle_path": "7324058647598877957_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.03, "view_count": 4362}, {"video_id": "@jetset_anna-7215464534239169798", "question": "In the scene, there is an outdoor swimming pool on the blue sea with a woman in the pool. Surrounding the sea is a green mountain range, and on the sea, there is a wooden house. Which object appears first?", "question_wo_referring_query": "Which object appears first?", "candidates": ["In a sea surrounded by green mountains, many wooden houses on the sea appear first.", "In a sea surrounded by green mountains, a wooden house on the sea appears first.", "On the blue sea, an outdoor swimming pool with many people in it appears first.", "On the blue sea, an outdoor swimming pool with a woman in it appears first."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "@jetset_anna-7215464534239169798_0", "video_path": "7215464534239169798.mp4", "subtitle_path": "7215464534239169798_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 18.05, "view_count": 15760}, {"video_id": "@placesunleashed-7322626857751366918", "question": "In a scene with a green tree canopy, a white waterfall cascades down, and at the bottom of the scene, there is a woman in a white top and blue pants. After mentioning 'and I doubt it's that hard to advertise when you're showing a place like this.' what happens next?", "question_wo_referring_query": "what happens next?", "candidates": ["A man holding a yellow surfboard is walking", "Four monkeys are sitting on a vine-covered wooden frame", "A shirtless man is playing in the water", "A man is standing on the beach admiring the sunset"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "@placesunleashed-7322626857751366918_0", "video_path": "7322626857751366918.mp4", "subtitle_path": "7322626857751366918_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 25.33, "view_count": 1082}, {"video_id": "@luxtravelbe-7226412891790953754", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, beneath the blue sky and white clouds, there is a vast expanse of sea with many boats anchored in the distance; then, on a curved road, there is a car driving along with a green sign overhead; finally, on a desolate piece of land, there is a parked airplane with many flat-topped buildings behind it.", "First, beneath the blue sky and white clouds, there is a vast expanse of sea with many boats anchored in the distance; then, on a desolate piece of land, there is a parked airplane with many flat-topped buildings behind it; finally, on a curved road, there is a car driving along with a green sign overhead.", "First, on a curved road, there is a car driving along with a green sign overhead; then, on a desolate piece of land, there is a parked airplane with many flat-topped buildings behind it; finally, beneath the blue sky and white clouds, there is a vast expanse of sea with many boats anchored in the distance.", "First, on a curved road, there is a car driving along with a green sign overhead; then, beneath the blue sky and white clouds, there is a vast expanse of sea with many boats anchored in the distance; finally, on a desolate piece of land, there is a parked airplane with many flat-topped buildings behind it."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "@luxtravelbe-7226412891790953754_0", "video_path": "7226412891790953754.mp4", "subtitle_path": "7226412891790953754_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 25.53, "view_count": 43134}, {"video_id": "@daiki.shino-7296156406548385057", "question": "In the video, there's a woman wearing black clothes and blue pants standing with her arms crossed. She's in the upper right corner of the screen, with a few skyscrapers in the background. Where has this woman appeared before?", "question_wo_referring_query": "Where has this woman appeared before?", "candidates": ["Inside a tidy library", "Under a twilight sky, in front of many buildings", "In a tidy bedroom", "On the tallest skyscraper"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "@daiki.shino-7296156406548385057_0", "video_path": "7296156406548385057.mp4", "subtitle_path": "7296156406548385057_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 20.77, "view_count": 9030}, {"video_id": "@kelseyinlondon-7295809105883663648", "question": "Outside, surrounded by tall buildings, there is an intersection with zebra crossing lines. A woman dressed in green clothes and wearing a headscarf stands on the zebra crossing. When she stands in front of a white desk, what clothes does she change into?", "question_wo_referring_query": "What clothes does she change into?", "candidates": ["She changes into a pink outfit", "She changes into a pink coat", "She changes into a white outfit", "She changes into a light green outfit"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "@kelseyinlondon-7295809105883663648_0", "video_path": "7295809105883663648.mp4", "subtitle_path": "7295809105883663648_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.17, "view_count": 137206}, {"video_id": "@kelseyinlondon-7154038655244340485", "question": "In a scene with a large bridge structure, a woman wearing a brown top, white pants, and carrying a bag on her left shoulder is standing on the shore looking into the distance. When mentioning 'Ooh, you're in a trance, baby['Ooh, you're in a trance, baby'] in Chinese Simplified can be translated as.' what changes does this woman undergo?", "question_wo_referring_query": "What changes does this woman undergo?", "candidates": ["She moves the bag from her left shoulder to her right side", "She changes to a black bag", "She ties her hair up", "She takes the bag down"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "@kelseyinlondon-7154038655244340485_0", "video_path": "7154038655244340485.mp4", "subtitle_path": "7154038655244340485_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.63, "view_count": 114200}, {"video_id": "-UDBrWJufZ4", "question": "On the screen is a woman with long black hair wearing a grey checkered coat, sitting in front of a white wall with a white shelf. There is a green plant and a small blackboard on the shelf. The woman is seated at a table with a tablet in front of her. What is she doing?", "question_wo_referring_query": "What is she doing?", "candidates": ["Opening her arms", "Raising one hand", "The woman is arranging her shirt", "Pointing at the tablet", "The woman is touching her head"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "-UDBrWJufZ4_0", "video_path": "-UDBrWJufZ4.mp4", "subtitle_path": "-UDBrWJufZ4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1558.16, "view_count": 41981}, {"video_id": "-UDBrWJufZ4", "question": "The screen shows a white PPT, with two split screens on the right. One shows a woman with black straight hair wearing a gray coat, and the other shows a woman wearing a black lab coat and glasses. The white PPT contains chemical formulas. What is the woman with black straight hair and a gray coat doing?", "question_wo_referring_query": "What is the woman with black straight hair and a gray coat doing?", "candidates": ["She is fixing her hair", "One hand is supporting her face", "One hand is raised", "One hand is clenched in a fist in front of her body", "One hand is holding a pen"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "-UDBrWJufZ4_1", "video_path": "-UDBrWJufZ4.mp4", "subtitle_path": "-UDBrWJufZ4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1558.16, "view_count": 41981}, {"video_id": "-UDBrWJufZ4", "question": "In the screen, there is a white PPT on the right side with two separate frames. One shows a woman with straight black hair wearing a grey coat, and the other shows a woman wearing a black lab coat and glasses. The white PPT contains a chemical formula. The woman with straight black hair is smiling, showing her white teeth. What is the woman with glasses doing?", "question_wo_referring_query": "What is the woman with glasses doing?", "candidates": ["Pointing at a tablet", "One hand clenched in a fist in front of her body", "One hand holding a pen", "One hand supporting her face", "One hand raised up"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2E", "level": "IntraMoment", "id": "-UDBrWJufZ4_2", "video_path": "-UDBrWJufZ4.mp4", "subtitle_path": "-UDBrWJufZ4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1558.16, "view_count": 41981}, {"video_id": "0PBgtlsO9c4", "question": "The scene depicts an animation in a poorly lit area. A soldier appears on the right side of the screen, wearing a green helmet and a soldier's uniform, and holding a rifle. When the phrase 'returned to England to serve as a' is mentioned, what change occurs in the soldier?", "question_wo_referring_query": "What change occurs in the soldier?", "candidates": ["He put down his backpack", "He put down the weapon in his hand", "He took off his hat", "He removed his bulletproof vest", "He took a few steps forward"], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "0PBgtlsO9c4_0", "video_path": "0PBgtlsO9c4.mp4", "subtitle_path": "0PBgtlsO9c4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2397.13, "view_count": 644978}, {"video_id": "0PBgtlsO9c4", "question": "The person on the screen is a soldier wearing a khaki camouflage helmet, military green uniform, and a camouflage vest. He is in a yellow room, with several guns on the table in front of him. He is also holding a gun in his hand. What change did the soldier undergo when the term 'equipment an even more pertinent concern' was mentioned?", "question_wo_referring_query": "What change did the soldier undergo?", "candidates": ["The soldier had two more guns in his hand.", "The soldier picked up a green gun.", "The soldier took off his helmet.", "The soldier removed his armor.", "The soldier no longer had a gun in his hand; he placed it on the table."], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "0PBgtlsO9c4_1", "video_path": "0PBgtlsO9c4.mp4", "subtitle_path": "0PBgtlsO9c4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2397.13, "view_count": 644978}, {"video_id": "0PBgtlsO9c4", "question": "The scene is in a green forest where two characters are crouching on the ground and shooting forward. One character is wearing white clothes and the other is wearing black clothes. The surrounding forest is green, and the ground under their feet is olive-colored. When 'and rifles but other stances such as the' is mentioned, what changes happen to the characters in the scene?", "question_wo_referring_query": "What changes happen to the characters in the scene?", "candidates": ["The character in white clothing stands up.", "The character in green clothing kneels on the ground.", "The character in white clothing starts to lie down on the ground.", "The character in green clothing stands up.", "The character in white clothing starts to squat."], "topic_category": "KH-Knowledge-History", "question_category": "TAA", "level": "L2-Relation", "id": "0PBgtlsO9c4_2", "video_path": "0PBgtlsO9c4.mp4", "subtitle_path": "0PBgtlsO9c4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2397.13, "view_count": 644978}, {"video_id": "6sAgglqMSBA", "question": "In the video, there is a man with brown hair and a woman with black hair in an elevator. The man is wearing a black jacket and a white T-shirt, and the woman is wearing a white tank top. When they appear in an open area with white buildings and parked cars in the background, what change happens to the woman?", "question_wo_referring_query": "What change happens to the woman?", "candidates": ["The woman changes into a black shirt", "The woman changes into a black tank top", "The woman changes into a black T-shirt", "The woman changes into a black fitness outfit", "The woman changes into a black dress"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "6sAgglqMSBA_0", "video_path": "6sAgglqMSBA.mp4", "subtitle_path": "6sAgglqMSBA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1187.37, "view_count": 3319118}, {"video_id": "6sAgglqMSBA", "question": "The screen shows a man with curly hair wearing a floral shirt, standing in front of a grand hall adorned with gilded decorations. The hall has a circular ceiling medallion and sparkling walls, with yellow wall sconces hanging on the white walls. People dressed in white pass by under the wall lights. This man appears in a place with a large vase, a round crystal chandelier, and green walls. What kind of change does he undergo?", "question_wo_referring_query": "What kind of change does he undergo?", "candidates": ["He changes into a red shirt", "He changes into a white shirt with black floral pattern", "He changes into a white shirt with blue floral pattern", "He changes into a blue shirt", "He changes into a white shirt with red floral pattern"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "6sAgglqMSBA_1", "video_path": "6sAgglqMSBA.mp4", "subtitle_path": "6sAgglqMSBA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1187.37, "view_count": 3319118}, {"video_id": "6sAgglqMSBA", "question": "In the video, there are a woman and a man wearing black clothes. The woman is wearing an off-shoulder top and glasses, and the man is wearing a cardigan. They are sitting in front of a table with a laptop and a bouquet of flowers. Behind them, there is a white bed and a wooden door. At one point, the woman appears on a yacht with a river and a Ferris wheel in the background. What change occurs to the woman at this point?", "question_wo_referring_query": "What change occurs to the woman at this point?", "candidates": ["The woman changes into a black T-shirt", "The woman changes into black pants", "The woman changes into a pink blouse", "The woman changes into a black dress", "The woman changes into a white tank top"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "6sAgglqMSBA_2", "video_path": "6sAgglqMSBA.mp4", "subtitle_path": "6sAgglqMSBA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1187.37, "view_count": 3319118}, {"video_id": "WakN0KKSiJ8", "question": "The man on the screen is wearing a black suit with a light blue shirt and black-framed glasses. His hair is graying, and there are some gray hairs in his beard as well. This man is sitting in a room with a white bookshelf behind him, filled with books. In which other scene does this man appear?", "question_wo_referring_query": "In which other scene does this man appear?", "candidates": ["In a PPT with a black background, this man appears on the right side of the slide, with a map of Paris displayed on the PPT.", "In a PPT with a black background, this man appears on the right side of the slide, with a map of Azerbaijan displayed on the PPT.", "In a PPT with a black background, this man appears on the right side of the slide, with a map of Venezuela displayed on the PPT.", "In a PPT with a black background, this man appears on the right side of the slide, with a map of Tokyo displayed on the PPT.", "In a PPT with a black background, this man appears on the right side of the slide, with a map of Pakistan displayed on the PPT."], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "WakN0KKSiJ8_0", "video_path": "WakN0KKSiJ8.mp4", "subtitle_path": "WakN0KKSiJ8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2255.56, "view_count": 537}, {"video_id": "WakN0KKSiJ8", "question": "The screen shows a PPT with a black background. The PPT displays some artifacts, all of which are circular copper plates with floral carvings. There are a total of 10 plates in various sizes. On the right side, a man is explaining in a small split screen. In which scene does the circular plate on the screen appear?", "question_wo_referring_query": "In which scene does the circular plate on the screen appear?", "candidates": ["In the gray image inscribed with BASILEUS EZANAS 'King Ezana'", "In the green photo inscribed with BASILEUS EZANAS 'King Ezana'", "In the yellow photo inscribed with BASILEUS EZANAS 'King Ezana'", "In the green room inscribed with BASILEUS EZANAS 'King Ezana'", "In the black PPT background inscribed with BASILEUS EZANAS 'King Ezana'"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "WakN0KKSiJ8_1", "video_path": "WakN0KKSiJ8.mp4", "subtitle_path": "WakN0KKSiJ8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2255.56, "view_count": 537}, {"video_id": "WakN0KKSiJ8", "question": "The screen shows a black PPT background with a white title that says 'Basilica: Pendant.' There is a black stone head with a raised middle section, and some floral patterns are carved on it. On the right, there is a man explaining in the frame. In which other scene did this stone head appear?", "question_wo_referring_query": "In which other scene did this stone head appear?", "candidates": ["A black background with red letters saying 'cf. Abba Garima Monastery Gospels (near Adwa) AD 390-660'", "A black background with olive letters saying 'cf. Abba Garima Monastery Gospels (near Adwa) AD 390-660'", "A black background with white letters saying 'cf. Abba Garima Monastery Gospels (near Adwa) AD 390-660'", "A black background with green letters saying 'cf. Abba Garima Monastery Gospels (near Adwa) AD 390-660'", "A black background with yellow letters saying 'cf. Abba Garima Monastery Gospels (near Adwa) AD 390-660'"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "WakN0KKSiJ8_2", "video_path": "WakN0KKSiJ8.mp4", "subtitle_path": "WakN0KKSiJ8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2255.56, "view_count": 537}, {"video_id": "YO3iBRvqzac", "question": "In the video, there is a plant that is full of small fruits, some of which are ripe and black while others are still green and unripe. A hand is picking the fruits. What is the first object that appears after mentioning Blackberry?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["a grassy field", "an ear of corn", "water flowing among rocks", "a campfire", "a chair"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "YO3iBRvqzac_0", "video_path": "YO3iBRvqzac.mp4", "subtitle_path": "YO3iBRvqzac_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1153.02, "view_count": 2119102}, {"video_id": "YO3iBRvqzac", "question": "In the screen, there is a hand holding a piece of fresh ham, with a rock behind the ham. The cuff of the hand is army green, and the other hand is sprinkling seasoning on the ham. When Pepper is mentioned, what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["Cymbopogon (fragrant grass) on the rock", "Lemon grass on the rock", "A yellowish green plant on the rock", "Dogtail grass on the rock", "A small yellow flower on the rock"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "YO3iBRvqzac_1", "video_path": "YO3iBRvqzac.mp4", "subtitle_path": "YO3iBRvqzac_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1153.02, "view_count": 2119102}, {"video_id": "YO3iBRvqzac", "question": "The screen shows a rock, with a rack made of wooden sticks. A white cloth is spread on the rack, and on the white cloth, there's a pie. A pair of hands is holding a pie. Before mentioning Thin pita bread, what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["Pink pattern on the white cloth", "Blue pattern on the white cloth", "Red pattern on the white cloth", "Purple pattern on the white cloth", "Green pattern on the white cloth"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "YO3iBRvqzac_2", "video_path": "YO3iBRvqzac.mp4", "subtitle_path": "YO3iBRvqzac_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1153.02, "view_count": 2119102}, {"video_id": "7F2Gc9ec26s", "question": "On the screen, a man with thinning, already white hair is sitting in front of a mirror wearing a suit. He has a red tie on, and behind him is a wall colored white and light blue. There is also a painting hanging on the wall. After mentioning 'talk a lot,' what action does this man perform?", "question_wo_referring_query": "What action does he perform?", "candidates": ["The man adjusts his tie.", "The man looks to his right.", "The man takes a sip of water.", "The man's hands are placed in front of him.", "The man takes out a notepad."], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "7F2Gc9ec26s_0", "video_path": "7F2Gc9ec26s.mp4", "subtitle_path": "7F2Gc9ec26s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2089.16, "view_count": 4466}, {"video_id": "7F2Gc9ec26s", "question": "The screen is divided into two parts. On the left, there is a man with white hair, sitting in a room with a painting on the wall. On the right, there is a man in a white room with brown hair, wearing a suit and striped shirt, a black tie, a black and white somewhat chaotic painting on the wall behind him, and a black sculpture. What action did the man on the right take after mentioning 'transformative'?", "question_wo_referring_query": "What action did the man on the right take?", "candidates": ["The man picked up a phone", "The man started reading a book", "The man picked up a notebook", "The man smiled", "The man used his two fingers to point down"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "7F2Gc9ec26s_1", "video_path": "7F2Gc9ec26s.mp4", "subtitle_path": "7F2Gc9ec26s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2089.16, "view_count": 4466}, {"video_id": "7F2Gc9ec26s", "question": "The screen is divided into two parts. The man on the left has white hair and is sitting in a room with a painting hanging on the wall. On the right is a man with brown hair, wearing a suit and a striped shirt with a black tie. Behind him, there is a black and white painting with some messy lines and a black sculpture. After mentioning 'thank you matt but i want you to know i', what action did the man on the right take?", "question_wo_referring_query": "What action did the man on the right take?", "candidates": ["The man revealed a clean set of teeth", "The man picked up a pen", "The man picked up a notebook", "The man looked to the right", "The man picked up the phone"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "7F2Gc9ec26s_2", "video_path": "7F2Gc9ec26s.mp4", "subtitle_path": "7F2Gc9ec26s_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2089.16, "view_count": 4466}, {"video_id": "fA-6YZ9PJxo", "question": "Which character appears first on screen?", "question_wo_referring_query": "Which character appears first on screen?", "candidates": ["The soldier in white with feathers in his helmet", "The soldier in blue with feathers in his helmet", "The soldier in yellow with feathers in his helmet", "The soldier in black with feathers in his helmet", "The soldier in olive with feathers in his helmet"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "fA-6YZ9PJxo_0", "video_path": "fA-6YZ9PJxo.mp4", "subtitle_path": "fA-6YZ9PJxo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 981.42, "view_count": 7879}, {"video_id": "fA-6YZ9PJxo", "question": "The video mentions some different ethnic groups. Which group is mentioned first?", "question_wo_referring_query": "The video mentions some different ethnic groups. Which group is mentioned first?", "candidates": ["Samnites", "Rome", "Gauls", "Egyptian", "Greeks"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "fA-6YZ9PJxo_1", "video_path": "fA-6YZ9PJxo.mp4", "subtitle_path": "fA-6YZ9PJxo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 981.42, "view_count": 7879}, {"video_id": "fA-6YZ9PJxo", "question": "In the scene, there are two animated soldiers. One soldier is wearing blue clothes with an olive-colored belt and a khaki helmet with feathers. Next to him is another soldier whose helmet does not have feathers. They are standing on grassy ground, with some trees and the sky in the background. There are shrubs beneath their feet, and they are holding spears and shields. After these two soldiers appear, who is the first character to make an entrance?", "question_wo_referring_query": "Who is the first character to make an entrance?", "candidates": ["The person wearing silver armor and holding a green shield", "The person wearing silver armor and holding a blue shield", "The person wearing silver armor and holding an olive-colored shield", "The soldier wearing silver armor and holding a red shield", "The person wearing silver armor and holding a yellow shield"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "fA-6YZ9PJxo_2", "video_path": "fA-6YZ9PJxo.mp4", "subtitle_path": "fA-6YZ9PJxo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 981.42, "view_count": 7879}, {"video_id": "yWOsyFAUS4g", "question": "On the screen, there is a tabby cat lying on a white bed sheet. The room has green walls, and there are some blurry elements in the background. What happens after the cat lies down on the bed?", "question_wo_referring_query": ", what happens after the cat lies down on the bed?", "candidates": ["The cat stands up", "The cat begins to stretch lazily", "The cat jumps off the bed", "A hand starts to pet the cat's neck", "The cat moves closer to the mirror"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "yWOsyFAUS4g_0", "video_path": "yWOsyFAUS4g.mp4", "subtitle_path": "yWOsyFAUS4g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1023.22, "view_count": 248099}, {"video_id": "yWOsyFAUS4g", "question": "The screen shows a woman wearing a white top standing in an open-style kitchen. Behind her, there is a stove and a wall that is both white and black. There are many colorful pictures on the white wall. On the table in front of her, there are pots and a wooden cutting board. On the wooden board, there is a bowl. The woman is speaking while holding a kitchen knife in one hand and placing the other hand in a stainless steel basin. What happened after she picked up the kitchen knife?", "question_wo_referring_query": "What happened?", "candidates": ["She started pouring water", "She picked up a seasoning bottle", "She started stir-frying vegetables", "She started cutting something", "She started mixing the container"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "yWOsyFAUS4g_1", "video_path": "yWOsyFAUS4g.mp4", "subtitle_path": "yWOsyFAUS4g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1023.22, "view_count": 248099}, {"video_id": "yWOsyFAUS4g", "question": "The scene shows an open kitchen with white and black walls. There are many colorful pictures on the white wall, while the black wall has a coffee machine and a stove. There is also a range hood and black cabinets. On the black countertop in front of the mirror, there is a cutting board with a blue knife and a few vegetables. In the corner, there's a box of eggs, and on the counter, there's a stainless steel milk pot. What happens next?", "question_wo_referring_query": "What happens next?", "candidates": ["A woman wearing green overalls and glasses starts chopping vegetables.", "A woman wearing pink overalls and glasses starts chopping vegetables.", "A woman wearing black overalls and glasses starts chopping vegetables.", "A woman wearing white overalls and glasses starts chopping vegetables.", "A woman wearing blue overalls and glasses starts chopping vegetables."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "yWOsyFAUS4g_2", "video_path": "yWOsyFAUS4g.mp4", "subtitle_path": "yWOsyFAUS4g_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1023.22, "view_count": 248099}, {"video_id": "9ljMlfC3kUw", "question": "In the video, there is a pure black background with a man wearing a black T-shirt and with black hair speaking. To the right of the screen are an American flag and a flag of Turkmenistan. What is the man in the video doing when he mentions 'there's only like a small diaspora of'?", "question_wo_referring_query": "What is the man in the video doing?", "candidates": ["The man opened his hands", "The man held a cup", "The man took out a microphone", "The man pointed in the direction of the flag", "The man touched his hair"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "9ljMlfC3kUw_0", "video_path": "9ljMlfC3kUw.mp4", "subtitle_path": "9ljMlfC3kUw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1915.88, "view_count": 557220}, {"video_id": "9ljMlfC3kUw", "question": "The screen shows a solid black background. A man in a black T-shirt and a woman with black hair wearing red clothes are standing together. In the top right corner, there is a white rectangular image with the text 'Ersari Tribe' written below it. What is the woman doing in the video when the phrase 'seljuk Empire the El saris were like the' is mentioned?", "question_wo_referring_query": "What is the woman doing in the video?", "candidates": ["The woman is raising both hands", "The woman is holding a flower", "The woman is looking at the man beside her", "The woman is pointing at the logo", "The woman is raising one hand"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "9ljMlfC3kUw_1", "video_path": "9ljMlfC3kUw.mp4", "subtitle_path": "9ljMlfC3kUw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1915.88, "view_count": 557220}, {"video_id": "9ljMlfC3kUw", "question": "In the scene, a man with curly hair and wearing a navy blue T-shirt is sitting in a room with white cabinets and a wooden door. When he says 'what's going on everybody I survived the,' what is he doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Pointing at the camera", "Raising one hand", "Putting his palms together", "Showing a victory gesture", "Clapping his hands together"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "9ljMlfC3kUw_2", "video_path": "9ljMlfC3kUw.mp4", "subtitle_path": "9ljMlfC3kUw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1915.88, "view_count": 557220}, {"video_id": "Q5fbU6slFtg", "question": "On the screen, there is a white background with a piece of white paper, a black pencil sharpener, and a white eraser. There is also a pair of hands. What did the pair of hands do when they first appeared?", "question_wo_referring_query": "What did the pair of hands do when they first appeared?", "candidates": ["Picked up the piece of white paper", "The pair of hands picked up the pen", "Placed it on the white desktop", "Picked up a bottle of ink", "Picked up the eraser"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "Q5fbU6slFtg_0", "video_path": "Q5fbU6slFtg.mp4", "subtitle_path": "Q5fbU6slFtg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1875.37, "view_count": 4825}, {"video_id": "Q5fbU6slFtg", "question": "The screen shows a woman wearing glasses and a deep blue outfit. Behind her is a window, and on the desk in front of the window, there are many crayons and pens. Her hair is tied up. What happened when she first appeared?", "question_wo_referring_query": "What happened when she first appeared?", "candidates": ["A white-framed pop-up appeared on the left side of the screen", "She picked up a pen", "She picked up a piece of paper", "The woman stood up and faced away from the camera", "She picked up a box of crayons"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "Q5fbU6slFtg_1", "video_path": "Q5fbU6slFtg.mp4", "subtitle_path": "Q5fbU6slFtg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1875.37, "view_count": 4825}, {"video_id": "Q5fbU6slFtg", "question": "The screen shows a white background, with a bird-like stone relief sculpture on the right. This sculpture is a work featuring a bird-headed human figure. It is wearing a long robe and has wings on its back. This image is related to the eagle-headed humans of Mesoamerican mythology. When the dark pencil first appears in the hand of that figure at the bottom of the stone carving, what happens?", "question_wo_referring_query": "When _____ happens, what occurs?", "candidates": ["Taking out a bottle of ink", "Picking up a pencil sharpener", "Picking up a color palette", "Picking up another yellow pencil", "Picking up an eraser"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "Q5fbU6slFtg_2", "video_path": "Q5fbU6slFtg.mp4", "subtitle_path": "Q5fbU6slFtg_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1875.37, "view_count": 4825}, {"video_id": "r1i6hMQbkx8", "question": "The screen shows a shop with a variety of cakes on a tray at the entrance. Each cake is placed in a different small box and covered with different colored sauces. They also have different shapes. What type of cake is closest to the camera?", "question_wo_referring_query": "What type of cake is closest to the camera?", "candidates": ["Cake with chocolate sauce", "Cake with strawberry and yellow sauce", "Cake with strawberry sauce", "Cake with white chocolate sauce", "Cake with strawberry and white sauce"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "r1i6hMQbkx8_0", "video_path": "r1i6hMQbkx8.mp4", "subtitle_path": "r1i6hMQbkx8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1603.4, "view_count": 119798}, {"video_id": "r1i6hMQbkx8", "question": "In the screen, there is a grassy field surrounded by a gray wooden fence, with some hills in the distance. Two women are standing in front of the gray wooden fence. On the right side of the frame, there is a wooden structure, and the plants on it have already withered. There are also some withered grasses around. Both women are standing on the grass, most of which has also withered. One woman has her hair tied up, while the other has her hair down. They are both wearing black clothes. Among them, which one is wearing a purple scarf?", "question_wo_referring_query": "Among them, which one is wearing a purple scarf?", "candidates": ["The woman wearing black clothes with her hair tied up", "The woman wearing black clothes with white pants", "The woman wearing black clothes with olive pants", "The woman wearing black clothes with her hair down", "The woman wearing black clothes with gray pants"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "r1i6hMQbkx8_1", "video_path": "r1i6hMQbkx8.mp4", "subtitle_path": "r1i6hMQbkx8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1603.4, "view_count": 119798}, {"video_id": "r1i6hMQbkx8", "question": "In the scene, there is a man wearing a bright red sweater. His hair is brown and he is speaking in front of a camera. Behind him, there is a black bookshelf made up of squares, with books, a yellow globe, a lamp, and several green plants placed on it. What item is placed together with the globe?", "question_wo_referring_query": "What item is placed together with the globe?", "candidates": ["Black book", "Red book", "Yellow book", "Green plant in a black pot", "Green plant in a white pot"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "r1i6hMQbkx8_2", "video_path": "r1i6hMQbkx8.mp4", "subtitle_path": "r1i6hMQbkx8_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1603.4, "view_count": 119798}, {"video_id": "H1cLtHtTM-A", "question": "On the screen, there is a white PPT. One hand is holding a piece of paper in front of the PPT background, while the other hand is pointing at content on the PPT. The paper has many boxes, some with shapes inside. The title on the white paper is 'Common pKa Values'. In the second row, middle box of the screen, there is a shape labeled with OH. When referring to 'oxygen and then lastly looking at that', what is the shape of this figure?", "question_wo_referring_query": "What is the shape of this figure?", "candidates": ["ellipse", "hexagon", "square", "triangle", "pentagon"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2A", "level": "IntraMoment", "id": "H1cLtHtTM-A_0", "video_path": "H1cLtHtTM-A.mp4", "subtitle_path": "H1cLtHtTM-A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1885.82, "view_count": 11855}, {"video_id": "H1cLtHtTM-A", "question": "The screen shows a white PPT with an orange frame on the left. Inside the frame, there are some letters. A pair of hands is pointing at the orange frame. In the middle of the PPT, there is a piece of paper with the title 'ACS EXAM QUESTION' written on it. There are also four lines drawn on it. When mentioning 'you know if something had resonance it's,' what type are these four lines?", "question_wo_referring_query": "What type are these four lines?", "candidates": ["Rounded wave line", "S-shaped line", "Z-shaped line", "Sharp-curved line", "Straight line"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2A", "level": "IntraMoment", "id": "H1cLtHtTM-A_1", "video_path": "H1cLtHtTM-A.mp4", "subtitle_path": "H1cLtHtTM-A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1885.82, "view_count": 11855}, {"video_id": "H1cLtHtTM-A", "question": "The screen shows a white PPT background with a pair of hands with painted nails pointing at the content inside. There is a piece of paper with a pointed line on it, and behind the line, there is an 'O' character. Above the character is a white PPT background with a shape on it. When mentioning 'has to gain a proton it doesn't have a,' the 'O' character is pressed. What is the shape enclosing a minus sign '-' on the upper right side of the character?", "question_wo_referring_query": "What is the shape enclosing a minus sign '-' on the upper right side of the 'O' character?", "candidates": ["Triangle", "Square", "Step Shape", "Hexagon", "Circle"], "topic_category": "KS-Knowledge-STEM", "question_category": "T2A", "level": "IntraMoment", "id": "H1cLtHtTM-A_2", "video_path": "H1cLtHtTM-A.mp4", "subtitle_path": "H1cLtHtTM-A_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1885.82, "view_count": 11855}, {"video_id": "3-6FG5eaiiU", "question": "The screen shows a man in a suit, with short black hair, reporting news in front of a camera. The man is wearing a red tie, and behind him, there is a white building with thick columns that have patterns on them. When the phrase 'and in the White House they' be happy uh' is mentioned, what is something not present in the screen?", "question_wo_referring_query": "What is something not present in the screen?", "candidates": ["earring", "white shirt", "black camera", "fill light", "person with white hair"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "3-6FG5eaiiU_0", "video_path": "3-6FG5eaiiU.mp4", "subtitle_path": "3-6FG5eaiiU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3039.77, "view_count": 136470}, {"video_id": "3-6FG5eaiiU", "question": "The scene shows two women and a man sitting at a round table talking. The room has a vintage style, with blue walls, a yellow door, and a painting with an olive green frame. One woman is wearing black clothing with an image of a white palace behind her, while another woman is dressed in blue clothing. The man is wearing a suit and a wristwatch. When mentioning 'theual folks who are you know basically', what object is not present on the screen?", "question_wo_referring_query": "What object is not present on the screen?", "candidates": ["Earrings", "Purple inner shirt", "A white cup with a black handle", "Pearl necklace", "Gold hairpin"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "3-6FG5eaiiU_1", "video_path": "3-6FG5eaiiU.mp4", "subtitle_path": "3-6FG5eaiiU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3039.77, "view_count": 136470}, {"video_id": "3-6FG5eaiiU", "question": "The screen shows a dark-skinned lady with black hair, sitting in front of a camera reporting news, with a white palace building in the background. She is wearing black clothes and a purple inner shirt. When she mentions 'Union Address tonight for more on what', what is not present in the scene?", "question_wo_referring_query": "what is not present in the scene?", "candidates": ["silver necklace", "white clouds", "blue sky", "earrings", "green trees"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "3-6FG5eaiiU_2", "video_path": "3-6FG5eaiiU.mp4", "subtitle_path": "3-6FG5eaiiU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 3039.77, "view_count": 136470}, {"video_id": "QOHpS66pheQ", "question": "The screen is divided into two sections. In the left section, there is a news broadcast with a woman in a rose-red outfit and a man in a suit. The background behind them features some tall buildings, with green and yellow colors. In the right section, there's a woman with shoulder-length hair wearing a gray coat with a black top underneath. The background behind her is somewhat blurry, and she's in front of a building. What object is missing from the screen?", "question_wo_referring_query": ", what object is missing from the screen?", "candidates": ["black mouse", "white keyboard", "black-rimmed glasses", "necklace", "white mouse"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "QOHpS66pheQ_0", "video_path": "QOHpS66pheQ.mp4", "subtitle_path": "QOHpS66pheQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1318.55, "view_count": 35801}, {"video_id": "QOHpS66pheQ", "question": "There are two split screens in the video, the left split screen is in a news broadcast room, with a woman in a rose-red dress and a man in a suit. The background behind them consists of some high-rise buildings with green and yellow colors. The right split screen shows a woman wearing black-framed glasses in front of a hall with a screen. The hall also has some seats, and the woman is wearing a coffee-colored inner shirt and a suit jacket. What object is not present in the video?", "question_wo_referring_query": "What object is not present in the video?", "candidates": ["an earring", "a silver necklace", "a button", "a pen", "a hidden blue floral tie"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "QOHpS66pheQ_1", "video_path": "QOHpS66pheQ.mp4", "subtitle_path": "QOHpS66pheQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1318.55, "view_count": 35801}, {"video_id": "QOHpS66pheQ", "question": "The woman on the screen is in a large hall with lots of glass. There are some patterns and letters on the glass behind her, and there is also yellow light on the glass. Beside the woman is a white-framed screen which shows a map. The woman is talking based on the map. She is wearing a patterned dress with dark green sleeves. What object is not present in the scene?", "question_wo_referring_query": "What object is not present in the scene?", "candidates": ["Earrings", "Green map block", "High heels", "Document bag", "Blue stripe"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "QOHpS66pheQ_2", "video_path": "QOHpS66pheQ.mp4", "subtitle_path": "QOHpS66pheQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1318.55, "view_count": 35801}, {"video_id": "fzBmDMQYmTI", "question": "The screen shows a somewhat dimly lit dining room with a man wearing a gray hat and a tan T-shirt standing in front of a mirror. He has his eyes downcast, and there is a plate of food in front of him. Behind him is a shelf, a green wreath, and a Christmas tree. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["He is holding a leaf", "He is holding a french thistle", "He is holding a french fry", "He is holding a bag of ketchup", "He is holding a bag of french fries"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "fzBmDMQYmTI_0", "video_path": "fzBmDMQYmTI.mp4", "subtitle_path": "fzBmDMQYmTI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 915.18, "view_count": 130827}, {"video_id": "fzBmDMQYmTI", "question": "In the video, there are two men standing together in a square, with two shopping malls in the background. One of the malls has a green pillar and yellow lights inside. There are a few steps and some scattered pedestrians not far from the two men. One man is wearing a white shirt, and the other is wearing a grey polo shirt and light yellow shorts. What are they doing?", "question_wo_referring_query": "What are the two men doing?", "candidates": ["Hugging", "Shaking hands", "Taking a photo", "Clapping", "Chatting"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "fzBmDMQYmTI_1", "video_path": "fzBmDMQYmTI.mp4", "subtitle_path": "fzBmDMQYmTI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 915.18, "view_count": 130827}, {"video_id": "fzBmDMQYmTI", "question": "The scene is at a public square at night. The buildings around are illuminated with yellow lights. There is a man in white clothes and another man wearing a hat and a striped T-shirt in front of the camera. There are a few other people behind them. The man in white also has a white necklace, and the man in the striped shirt has tattoos on his arms. What are these two doing?", "question_wo_referring_query": "What are these two people doing?", "candidates": ["Taking photos", "Shaking hands", "Clapping", "Hugging", "Chatting"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "fzBmDMQYmTI_2", "video_path": "fzBmDMQYmTI.mp4", "subtitle_path": "fzBmDMQYmTI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 915.18, "view_count": 130827}, {"video_id": "_TedFmvfCYo", "question": "In the video, there is a black car driving on a gray road. On both sides of the road, there are short buildings, some with red roofs and some with gray rooftops. What happens to the car when 'how far would $10,000 get us, in one of them' is mentioned?", "question_wo_referring_query": "What changes occur to the car?", "candidates": ["The car drives into a mountainous area with a castle.", "The car drives into a field of wheat.", "The car drives into a location with a lake.", "The car drives into a green grassy area with large trees on both sides of the road.", "The car drives into a city with tall buildings."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "_TedFmvfCYo_0", "video_path": "_TedFmvfCYo.mp4", "subtitle_path": "_TedFmvfCYo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2966.18, "view_count": 6776317}, {"video_id": "_TedFmvfCYo", "question": "In the scene, two men are sitting in the back seat of a car, with green trees visible outside the car window. One man is wearing a black hat and a patterned tank top, and he also has a scarf around his neck, with one of his arms resting on the other man's shoulder. The other man is dressed in a round-neck T-shirt. Both men have black skin and are wearing seat belts. When the phrase 'Ohh.. Ok lads, we're here now.' is mentioned, what change occurs to the man wearing the black hat?", "question_wo_referring_query": "In the scene, two men are sitting in the back seat of a car, with green trees visible outside the car window. One man is wearing a black hat and a patterned tank top, and he also has a scarf around his neck, with one of his arms resting on the other man's shoulder. The other man is dressed in a round-neck T-shirt. Both men have black skin and are wearing seat belts. When the phrase 'Ohh.. Ok lads, we're here now.' is mentioned, what change occurs to the man wearing the black hat?", "candidates": ["The man changed into a checkered shirt", "The man changed into a black shirt", "The man changed into a black T-shirt", "The man changed into a white shirt", "The man changed into a uniform"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "_TedFmvfCYo_1", "video_path": "_TedFmvfCYo.mp4", "subtitle_path": "_TedFmvfCYo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2966.18, "view_count": 6776317}, {"video_id": "_TedFmvfCYo", "question": "In the scene, four men are standing shoulder to shoulder. They are under a white sunshade and in front of a white building with sunlight shining on the ground. One man is wearing pink clothes, another is in a black coat, one man is wearing a black hat and a patterned sleeveless jacket, and the last man is wearing a white T-shirt with a baseball cap. When the phrase \"So far they've treated me like\" is mentioned, what changes occur to the man wearing the black hat and the patterned sleeveless jacket?", "question_wo_referring_query": "In the scene, four men are standing shoulder to shoulder. They are under a white sunshade and in front of a white building with sunlight shining on the ground. One man is wearing pink clothes, another is in a black coat, one man is wearing a black hat and a patterned sleeveless jacket, and the last man is wearing a white T-shirt with a baseball cap. When the phrase \"So far they've treated me like\" is mentioned, what changes occur to the man wearing the black hat and the patterned sleeveless jacket?", "candidates": ["The man changed into a blue shirt", "The man changed into a striped shirt", "The man changed into a white shirt", "The man changed into a white shirt", "The man changed into a black T-shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "_TedFmvfCYo_2", "video_path": "_TedFmvfCYo.mp4", "subtitle_path": "_TedFmvfCYo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2966.18, "view_count": 6776317}, {"video_id": "Iez6JSb5ndw", "question": "The screen shows a PPT background in military green, with two pictures in the background. One is a drawing of a pressure-assisted artillery design with some annotations, and the other is a black-and-white photograph of a real object. Which subtitles did these two images appear with?", "question_wo_referring_query": "Which subtitles did these two images appear with?", "candidates": ["The data for a rocket with high-explosives were as follows", "it weighed 34.2 kg, the propellant", "Yet, when we think of Nebelwerfer we mostly think of the later versions namely the rocket", "used the spin of the rocket", "Soviet Katyusha, this was due to the fact that the Germans"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "Iez6JSb5ndw_0", "video_path": "Iez6JSb5ndw.mp4", "subtitle_path": "Iez6JSb5ndw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1049.9, "view_count": 604975}, {"video_id": "Iez6JSb5ndw", "question": "The screen shows a military green background, within which there is a dark-colored, drawn artillery vehicle. This artillery piece has a very long barrel, and the wheels have structures composed of cylindrical and rectangular shapes. At what point do this artillery vehicle and subtitles appear simultaneously?", "question_wo_referring_query": "At what point do this artillery vehicle and subtitles appear simultaneously?", "candidates": ["used the spin of the rocket", "Yet, when we think of Nebelwerfer we mostly think of the later versions namely the rocket", "it weighed 34.2 kg, the propellant", "large launchers just used frames.", "Soviet Katyusha, this was due to the fact that the Germans"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "Iez6JSb5ndw_1", "video_path": "Iez6JSb5ndw.mp4", "subtitle_path": "Iez6JSb5ndw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1049.9, "view_count": 604975}, {"video_id": "Iez6JSb5ndw", "question": "In the video, there is a green-colored PPT background with an image composed of three circles lined up inside. There is also text content written with the alphabet inside gray and red backgrounds. What text appears simultaneously with the cloud image in the left circle?", "question_wo_referring_query": "What text appears simultaneously with the cloud image in the left circle?", "candidates": ["used the spin of the rocket", "large launchers just used frames.", "it weighed 34.2 kg, the propellant", "Yet, when we think of Nebelwerfer we mostly think of the later versions namely the rocket", "Therefore, multiple targets must be engaged quickly one after the other or from different"], "topic_category": "KH-Knowledge-History", "question_category": "TOS", "level": "L2-Relation", "id": "Iez6JSb5ndw_2", "video_path": "Iez6JSb5ndw.mp4", "subtitle_path": "Iez6JSb5ndw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1049.9, "view_count": 604975}, {"video_id": "KGRrgXAPw1Q", "question": "The person on the screen is a man with short hair wearing a white t-shirt. He is in a room covered with photos, and in the top right corner, there is an image of the Somali flag. In which other scene does the Somali flag appear in the video?", "question_wo_referring_query": "In which other scene does the Somali flag appear in the video?", "candidates": ["In a room covered with photos, in the top right corner of an image, with a man wearing a white t-shirt and white socks, on the background wall behind him.", "In a room covered with photos, in the top right corner of an image, with a man wearing a green t-shirt and white socks, on the background wall behind him.", "In a room covered with photos, in the top right corner of an image, with a man wearing a yellow t-shirt and white socks, on the background wall behind him.", "In a room covered with photos, in the top right corner of an image, with a man wearing a black t-shirt and white socks, on the background wall behind him.", "In a room covered with photos, in the top right corner of an image, with a man wearing a blue t-shirt and white socks, on the background wall behind him."], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "KGRrgXAPw1Q_0", "video_path": "KGRrgXAPw1Q.mp4", "subtitle_path": "KGRrgXAPw1Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 913.16, "view_count": 123478}, {"video_id": "KGRrgXAPw1Q", "question": "In the video, two men are sitting on a beige leather sofa in a room. There is a coffee table in front of them with many items on it, including a paper box, a cup, and many paper-based miscellaneous items. One of the men is wearing a light purple long-sleeved T-shirt, and the other is wearing a white T-shirt. Where else has the man with brown hair and the purple T-shirt appeared?", "question_wo_referring_query": "Where else has the man with brown hair and the purple T-shirt appeared?", "candidates": ["In a room full of photos, a man wearing a white T-shirt is explaining a screen in the top right corner, which shows the man with the purple T-shirt climbing a mountain.", "In a room full of photos, a man wearing a white T-shirt is explaining a screen in the top right corner, which shows the man with the purple T-shirt standing in a room full of photos.", "In a room full of photos, a man wearing a white T-shirt is explaining a screen in the top right corner, which shows the man with the purple T-shirt standing in a plaza.", "In a room full of photos, a man wearing a white T-shirt is explaining a screen in the top right corner, which shows the man with the purple T-shirt at the beach.", "In a room full of photos, a man wearing a white T-shirt is explaining a screen in the top right corner, which shows the man with the purple T-shirt cooking in a kitchen."], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "KGRrgXAPw1Q_1", "video_path": "KGRrgXAPw1Q.mp4", "subtitle_path": "KGRrgXAPw1Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 913.16, "view_count": 123478}, {"video_id": "KGRrgXAPw1Q", "question": "In the video, there is a man standing in a room filled with photos. On the photo wall, there's also a world map hanging. The man is speaking, he has black hair, and is wearing a white T-shirt. He's holding a white mug with blue and green floral patterns. In which other scene does this mug appear?", "question_wo_referring_query": "In which other scene does this mug appear?", "candidates": ["In a room with a sofa and a coffee table, two men are wearing green clothes and red and white headscarves", "In a room with a sofa and a coffee table, two men are wearing black clothes and red and white headscarves", "In a room with a sofa and a coffee table, two men are wearing yellow clothes and red and white headscarves", "In a room with a sofa and a coffee table, two men are wearing olive clothes and red and white headscarves", "In a room with a sofa and a coffee table, two men are wearing white clothes and red and white headscarves"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "KGRrgXAPw1Q_2", "video_path": "KGRrgXAPw1Q.mp4", "subtitle_path": "KGRrgXAPw1Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 913.16, "view_count": 123478}, {"video_id": "A7cWtICGgHE", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a scene of the Earth with the Northern Lights is shown, followed by a scene of a penguin, and lastly a scene of a man in a blue shirt sitting in a room.", "First, a scene of a man in a blue shirt sitting in a room is shown, followed by a scene of the Earth with the Northern Lights, and lastly a scene of a penguin.", "First, a scene of the Earth with the Northern Lights is shown, followed by a scene of a man in a blue shirt sitting in a room, and lastly a scene of a penguin.", "First, a scene of a penguin is shown, followed by a scene of a man in a blue shirt sitting in a room, and lastly a scene of the Earth with the Northern Lights.", "First, a scene of a man in a blue shirt sitting in a room is shown, followed by a scene of a penguin, and then a scene of the Earth with the Northern Lights."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "A7cWtICGgHE_0", "video_path": "A7cWtICGgHE.mp4", "subtitle_path": "A7cWtICGgHE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1483.82, "view_count": 1306802}, {"video_id": "A7cWtICGgHE", "question": "Which of the following sequence of scenes is correct?", "question_wo_referring_query": "Which of the following sequence of scenes is correct?", "candidates": ["First an image of some geese on yellow soil at the water's edge, followed by an image of a dark blue map with yellow-green blocks, and lastly a scene of waves crashing against rocks", "First an image of a dark blue map with yellow-green blocks, followed by an image of some geese on yellow soil at the water's edge, and lastly a scene of waves crashing against rocks", "First an image of a dark blue map with yellow-green blocks, followed by a scene of waves crashing against rocks, and lastly an image of some geese on yellow soil at the water's edge", "First an image of some geese on yellow soil at the water's edge, followed by a scene of waves crashing against rocks, and lastly an image of a dark blue map with yellow-green blocks", "First a scene of waves crashing against rocks, followed by an image of some geese on yellow soil at the water's edge, and lastly an image of a dark blue map with yellow-green blocks"], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "A7cWtICGgHE_1", "video_path": "A7cWtICGgHE.mp4", "subtitle_path": "A7cWtICGgHE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1483.82, "view_count": 1306802}, {"video_id": "A7cWtICGgHE", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a scene with a drawing made of lines is shown. Next, many penguins standing on the shore are shown. Finally, a group of penguins resting and grooming is shown.", "First, a group of penguins resting and grooming is shown. Then a scene with a drawing made of lines is shown. Finally, many penguins standing on the shore are shown.", "First, a scene with a drawing made of lines is shown. In the scene, there are many boats, each with people on board. Then a group of penguins resting and grooming is shown. Finally, many penguins standing on the shore are shown.", "First, many penguins standing on the shore are shown. Then a scene with a drawing made of lines is shown. Finally, a group of penguins resting and grooming is shown.", "First, a group of penguins resting and grooming is shown. Next, many penguins standing on the shore are shown. Finally, a scene with a drawing made of lines is shown."], "topic_category": "KG-Knowledge-Geography", "question_category": "SSS", "level": "L2-Relation", "id": "A7cWtICGgHE_2", "video_path": "A7cWtICGgHE.mp4", "subtitle_path": "A7cWtICGgHE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1483.82, "view_count": 1306802}, {"video_id": "mFfJ_-XadbQ", "question": "In the scene, there is a man wearing a yellow T-shirt in a room filled with photos. The room has a white door and a world map. The man has auburn hair and is holding a small microphone. Before mentioning 'Senegal for those of you that don't know', what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["Flag of Venezuela", "Flag of Senegal", "Flag of Canada", "Flag of the United States", "Flag of France"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "mFfJ_-XadbQ_0", "video_path": "mFfJ_-XadbQ.mp4", "subtitle_path": "mFfJ_-XadbQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1705.16, "view_count": 821227}, {"video_id": "mFfJ_-XadbQ", "question": "The screen shows a gigantic sculpture standing in a large circular plaza. The surrounding buildings and roads appear very small in comparison. The sculpture depicts a male figure holding a short-haired female figure. In the other hand of the male figure, there is a child. They are all looking in the same direction. This sculpture originates from Senegal and is called 'African Renaissance'. After mentioning 'symbolizing defiance tradition and the,' what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears after 'symbolizing defiance tradition and the'?", "candidates": ["USA Flag", "Canada Flag", "North Korea Flag", "Venezuela Flag", "Germany Flag"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "mFfJ_-XadbQ_1", "video_path": "mFfJ_-XadbQ.mp4", "subtitle_path": "mFfJ_-XadbQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1705.16, "view_count": 821227}, {"video_id": "mFfJ_-XadbQ", "question": "The screen shows a man in a room filled with photos. The room has a white wooden door and a world map. The man is talking to the camera. He is wearing a khaki colored T-shirt. He has brown hair and some stubble on his chin. On the left side of the screen, there is a picture of the North Korean flag. When 'prices' are mentioned, what is the first object that appears?", "question_wo_referring_query": "What is the first object that appears?", "candidates": ["Statue of Kim Jong Il", "Statue of the President of Gabon", "Flag of Venezuela", "Flag of Gabon", "French flag"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3O", "level": "L2-Relation", "id": "mFfJ_-XadbQ_2", "video_path": "mFfJ_-XadbQ.mp4", "subtitle_path": "mFfJ_-XadbQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1705.16, "view_count": 821227}, {"video_id": "Hn8XXPl1vjU", "question": "The screen shows a pure black background, and a man with black hair wearing a red POLO shirt is speaking in front of the camera. What happened after he mentioned 'all right we've reached Poland Europe'?", "question_wo_referring_query": "What happened after he mentioned 'all right we've reached Poland Europe'?", "candidates": ["A man wearing green clothes appeared standing in the center of the screen, and the man in red clothes is hitting him with his elbow.", "A man wearing yellow clothes appeared standing in the center of the screen, and the man in red clothes is hitting him with his elbow.", "A man wearing black clothes appeared standing in the center of the screen, and the man in red clothes is hitting him with his elbow.", "A man wearing gray clothes appeared standing in the center of the screen, and the man in red clothes is hitting him with his elbow.", "A man wearing white clothes appeared standing in the center of the screen, and the man in red clothes is hitting him with his elbow."], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "Hn8XXPl1vjU_0", "video_path": "Hn8XXPl1vjU.mp4", "subtitle_path": "Hn8XXPl1vjU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1149.32, "view_count": 3619334}, {"video_id": "Hn8XXPl1vjU", "question": "In the video, there's a man with very short hair wearing a gray shirt, and another man with black hair wearing a red polo shirt. The man in the gray shirt is grabbing the collar of the man in the red polo shirt. To the left of the screen, there's an image of a bottle, which is a glass bottle with a black cap containing olive-colored liquid. There are letters on the packaging. When 'no you haven't that was not cute whiskey' is mentioned, what does the man in the gray shirt do?", "question_wo_referring_query": "What does the man in the gray shirt do?", "candidates": ["The man raises one hand", "Starts drinking water", "Raises both hands", "Takes out a cup", "Shakes hands with the other man"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "Hn8XXPl1vjU_1", "video_path": "Hn8XXPl1vjU.mp4", "subtitle_path": "Hn8XXPl1vjU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1149.32, "view_count": 3619334}, {"video_id": "Hn8XXPl1vjU", "question": "The screen shows a black background, with a man in a red polo shirt speaking. On the left of the screen, there might be an image of the Indian flag. After mentioning 'mentioned but that would take way too long', what did the man do?", "question_wo_referring_query": "What did the man do?", "candidates": ["Took out a cup", "The man raised one hand", "The man pointed at the flag", "The man looked to his right", "The man spread his arms"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "Hn8XXPl1vjU_2", "video_path": "Hn8XXPl1vjU.mp4", "subtitle_path": "Hn8XXPl1vjU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1149.32, "view_count": 3619334}, {"video_id": "IWoq9N4fL9k", "question": "The man on the screen is wearing a gray and white lab coat, with black hair and glasses. He has some dark spots on his skin, and the background is a solid black color. On the right side of the screen, there is a string of white letters that reads 'Thai ambassador'. When the 'on the us and britain' is mentioned, what is the man in the screen doing?", "question_wo_referring_query": "What is the man in the screen doing?", "candidates": ["Holding a marker cup", "Making a phone call", "Introducing a cup", "Looking at a piece of paper in his hand", "Taking notes"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "IWoq9N4fL9k_0", "video_path": "IWoq9N4fL9k.mp4", "subtitle_path": "IWoq9N4fL9k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2007.68, "view_count": 866444}, {"video_id": "IWoq9N4fL9k", "question": "On the screen, there is a man in a white T-shirt and a woman in a yellow shirt. Both the man and the woman have brown hair. They are talking to the camera, and on the right side of the screen, it says #8(Asia). The background behind them is pure black. What is the man on the screen doing when 'industrialized country making the eighth' is mentioned?", "question_wo_referring_query": "What is the man on the screen doing?", "candidates": ["The man is pouring water", "The man is introducing a cup", "The man is holding a mug", "The man is drinking water", "The man is shaking hands with the woman"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "IWoq9N4fL9k_1", "video_path": "IWoq9N4fL9k.mp4", "subtitle_path": "IWoq9N4fL9k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2007.68, "view_count": 866444}, {"video_id": "IWoq9N4fL9k", "question": "In the screen, there is a man in a gray T-shirt with brown hair. He is standing in front of a completely black background. On the left side of the screen, there is a picture of an athlete playing badminton with the text 'Ranked 4th in BWF' written below it. When mentioning 'favorite sports, I like badminton a lot,' what is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["The man has one hand on his chest", "The man is holding a mug", "The man is pouring water", "The man is looking at a paper in his hand", "The man has both hands on his chest"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "IWoq9N4fL9k_2", "video_path": "IWoq9N4fL9k.mp4", "subtitle_path": "IWoq9N4fL9k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2007.68, "view_count": 866444}, {"video_id": "ZsZLPiypPPE", "question": "The man in the video is standing in front of a mirror, with a backdrop of the Earth behind him. There are two small shelves on the wall holding some items. The man is wearing a grey lab coat and has short black hair. What did this man do when he first appeared?", "question_wo_referring_query": "What did this man do when he first appeared?", "candidates": ["The man mentioned black holes", "The man mentioned the theory of geocentrism", "The man mentioned quantum mechanics", "The man mentioned the theory of relativity", "The man mentioned the theory of atoms"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "ZsZLPiypPPE_0", "video_path": "ZsZLPiypPPE.mp4", "subtitle_path": "ZsZLPiypPPE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 926.68, "view_count": 258204}, {"video_id": "ZsZLPiypPPE", "question": "The screen shows the social media profile of a person named D.Marble. It includes his profile picture, indicating he is a black male. His profile also features some videos he has posted. Above his profile picture, there is a photo of him in which he is wearing black clothes. What did this man do the first time he appeared in the photo with the blue background above his profile picture?", "question_wo_referring_query": "What did this man do the first time he appeared in the photo with the blue background above his profile picture?", "candidates": ["The man is talking with a big grin while wearing sunglasses.", "The man is listening to music with earphones and wearing sunglasses.", "The man is sitting on a boat wearing sunglasses.", "The man is standing by the sea smiling while wearing sunglasses.", "The man is holding a coconut while wearing sunglasses."], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "ZsZLPiypPPE_1", "video_path": "ZsZLPiypPPE.mp4", "subtitle_path": "ZsZLPiypPPE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 926.68, "view_count": 258204}, {"video_id": "ZsZLPiypPPE", "question": "In the video, there is a child wearing a white shirt with clouds in a blue sky and rays of sunlight behind him. The child's hair is brown. There is also a title on the screen that reads 'THE REASON YOU CANNOT FEEL THE EARTH SPINNING IS BECAUSE IT IS NOT SPINNING.' What did this child do the first time they appeared?", "question_wo_referring_query": "What did this child do the first time they appeared?", "candidates": ["He was holding a globe model", "He was holding a model of Jupiter", "He was holding a model of the sun", "He was holding a model of Venus", "He was holding a model of Mercury"], "topic_category": "KS-Knowledge-STEM", "question_category": "O2E", "level": "IntraMoment", "id": "ZsZLPiypPPE_2", "video_path": "ZsZLPiypPPE.mp4", "subtitle_path": "ZsZLPiypPPE_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 926.68, "view_count": 258204}, {"video_id": "CwK5ukue0Pw", "question": "The video shows an ink painting of a black-furred cat with white eyebrows on a grassy field. It has some patterns on its legs and body. The cat's pupils are slit-shaped, and it is hunting something in the grass. What animal did the cat catch?", "question_wo_referring_query": "What animal did the cat catch?", "candidates": ["Mouse", "Bird", "Cricket", "Frog", "Rabbit"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "CwK5ukue0Pw_0", "video_path": "CwK5ukue0Pw.mp4", "subtitle_path": "CwK5ukue0Pw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1462.46, "view_count": 48095}, {"video_id": "CwK5ukue0Pw", "question": "The scene shows an animated segment. In the segment, there's a window with a pink frame. The window has bars, and on the pink window sill, there is a piece of blue fabric and a blue bowl. The wall behind the window sill is blue with white floral patterns. Outside the window, there is a thatched roof house with a yellow roof. Who is looking out the window?", "question_wo_referring_query": "Who is looking out the window?", "candidates": ["A calico cat", "A white rabbit", "A black cat", "A white cat", "A tabby cat"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "CwK5ukue0Pw_1", "video_path": "CwK5ukue0Pw.mp4", "subtitle_path": "CwK5ukue0Pw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1462.46, "view_count": 48095}, {"video_id": "CwK5ukue0Pw", "question": "In the image, there is a painting that depicts a table covered with a white cloth. On the table, there are some yellow objects. A person wearing yellow and black striped pants is sitting beside the table. The person is also wearing a red shirt and black shoes. There is a red cloth on the floor below the pants, and there is a pair of black shoes under the table as well. Who is crouching under the table?", "question_wo_referring_query": "Who is crouching under the table?", "candidates": ["A small black dog", "A tortoiseshell cat", "An orange cat", "A calico cat", "A small yellow dog"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "CwK5ukue0Pw_2", "video_path": "CwK5ukue0Pw.mp4", "subtitle_path": "CwK5ukue0Pw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1462.46, "view_count": 48095}, {"video_id": "ENuzZNYbJgU", "question": "In the screen, there is a pure black background, and a soldier wearing a hat and mask is in front of the camera. He is dressed in a camouflage uniform, with two bulletproof vests in front of him, wearing a belt of the same color. On the right side of the screen, there is a red and black ribbon with a plaque on it. The plaque has figures and some letters on it. When mentioned 'for his heroism above and beyond the', what is the shape of this plaque?", "question_wo_referring_query": "What is the shape of this plaque?", "candidates": ["rectangle", "triangle", "oval", "square", "circle"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "ENuzZNYbJgU_0", "video_path": "ENuzZNYbJgU.mp4", "subtitle_path": "ENuzZNYbJgU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2540.17, "view_count": 180579}, {"video_id": "ENuzZNYbJgU", "question": "In the animation with a gray background on the screen, there are some handcrafted knives and tools, as well as some miscellaneous items like tape. There is a blue figure in front of the mirror, with an octagon in the middle. Hanging below it is a golden object shaped like a bird with wings, and below that is a golden pentagram with some green on it. In the very middle, there is a character. When selfless heroism and sacrifice until are mentioned, what shape is the white pattern in the octagon?", "question_wo_referring_query": "What shape is the white pattern in the octagon?", "candidates": ["Triangle", "Heart-shaped", "Square", "Pentagram", "Circle"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "ENuzZNYbJgU_1", "video_path": "ENuzZNYbJgU.mp4", "subtitle_path": "ENuzZNYbJgU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2540.17, "view_count": 180579}, {"video_id": "ENuzZNYbJgU", "question": "In the scene, there are four soldiers on the battlefield. They are wearing military green bulletproof vests and camouflage clothing. The ground beneath them is tan-colored terrain. Behind them are two long battle machines, and each of them is holding a gun. One character has black hair, with the darkest complexion. What type of mask is he wearing when referenced as a Pathfinder nine years ago?", "question_wo_referring_query": "What type of mask is he wearing?", "candidates": ["Red with black stripes", "Red with tan checkered pattern", "Red with white stripes", "Red with tan stripes", "Red with green stripes"], "topic_category": "KH-Knowledge-History", "question_category": "T2A", "level": "IntraMoment", "id": "ENuzZNYbJgU_2", "video_path": "ENuzZNYbJgU.mp4", "subtitle_path": "ENuzZNYbJgU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2540.17, "view_count": 180579}, {"video_id": "bMxSJAzZWMw", "question": "In the video, there is a man and a woman sitting in a travel van. Both the man and the woman have brown hair. The man is wearing a white T-shirt, and the woman is wearing a black T-shirt with glasses hanging on her clothes. Their seats are black with brown headrests. The top of the van is white, and it also has white curtains. What material are the seats in this travel van made of?", "question_wo_referring_query": "What material are the seats in this travel van made of?", "candidates": ["Sandpaper", "Velvet", "Fabric", "Wool", "Leather"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "bMxSJAzZWMw_0", "video_path": "bMxSJAzZWMw.mp4", "subtitle_path": "bMxSJAzZWMw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1224.09, "view_count": 417454}, {"video_id": "bMxSJAzZWMw", "question": "The screen shows a man wearing a white T-shirt. He is carrying a black backpack and is in a large white hall filled with lights. Beside him is a woman wearing a black vest, her clothes have sunglasses, and she has white nail polish and is holding a mobile phone. The woman has light-colored hair. Nearby, there is another woman with black hair wearing an orange dress. What type of hat is the man in front of the camera wearing?", "question_wo_referring_query": "What type of hat is the man in front of the camera wearing?", "candidates": ["Cowboy hat", "Beret", "Baseball cap", "Sun hat", "Fur hat"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "bMxSJAzZWMw_1", "video_path": "bMxSJAzZWMw.mp4", "subtitle_path": "bMxSJAzZWMw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1224.09, "view_count": 417454}, {"video_id": "bMxSJAzZWMw", "question": "The screen shows the upper half of a person wearing a white striped short-sleeve shirt, a yellow badge on the chest, and a jacket. Behind the person is a concrete road with buildings made of stone and concrete, many windows are visible, and there are two pedestrians wearing white clothes walking on the road. What type of jacket is this person wearing?", "question_wo_referring_query": "What type of jacket is this person wearing?", "candidates": ["Short-sleeve lab coat", "Woolen knight's armor", "Knight's armor", "Short-sleeve shirt jacket", "Blazer"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "bMxSJAzZWMw_2", "video_path": "bMxSJAzZWMw.mp4", "subtitle_path": "bMxSJAzZWMw_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1224.09, "view_count": 417454}, {"video_id": "ZLL6m6Uz7eM", "question": "In the video, the figure on the screen is Russia's leader Putin, and behind him are the flags of Russia and Ukraine. Putin is wearing a black suit with a striped tie and is speaking. When he mentions 'Putin claims Ukraine to be a historical...', what object is not present on the screen?", "question_wo_referring_query": ", what object is not present on the screen?", "candidates": ["Black striped tie", "White shirt", "Microphone", "Wristwatch", "Black background"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "ZLL6m6Uz7eM_0", "video_path": "ZLL6m6Uz7eM.mp4", "subtitle_path": "ZLL6m6Uz7eM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 909.76, "view_count": 14008}, {"video_id": "ZLL6m6Uz7eM", "question": "The screen is divided into two frames: one showing a building with flames and black smoke, and another showing Russian leader Putin giving a speech. The building is white with black-framed windows, and behind Putin is a yellow wall. He is wearing a black suit. When 'in 2022' is mentioned, what object is not present in the screen?", "question_wo_referring_query": "What object is not present in the screen?", "candidates": ["Car", "Pedestrian", "White wall", "Red tie", "Black tie with floral pattern"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "ZLL6m6Uz7eM_1", "video_path": "ZLL6m6Uz7eM.mp4", "subtitle_path": "ZLL6m6Uz7eM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 909.76, "view_count": 14008}, {"video_id": "ZLL6m6Uz7eM", "question": "The scene shows a soldier walking. In the distance, there is a hill covered with plants, most of which are already withered. There are also some military vehicles parked by the roadside. The soldiers are walking on a dirt road with grass on it, and there are white walls on the sides. The soldiers are carrying packs, some holding guns. What object is not present in the scene?", "question_wo_referring_query": ", what object is not present in the scene?", "candidates": ["Army green helmet", "Army green car", "Army green uniform", "Khaki slanted bag", "Camouflage uniform"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "ZLL6m6Uz7eM_2", "video_path": "ZLL6m6Uz7eM.mp4", "subtitle_path": "ZLL6m6Uz7eM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 909.76, "view_count": 14008}, {"video_id": "ZLVndSXE3Jk", "question": "Two men are having an online conversation. The man on the left is wearing a white shirt, a dark suit, and a tie, while the man on the right is wearing a dark top. There is a large bookshelf full of books behind the man, and a square frame is hanging next to the bookshelf. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["umbrella", "potted plant", "fan", "desk lamp", "glasses"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "ZLVndSXE3Jk_0", "video_path": "ZLVndSXE3Jk.mp4", "subtitle_path": "ZLVndSXE3Jk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1362.16, "view_count": 235108}, {"video_id": "ZLVndSXE3Jk", "question": "The man is wearing a dark upper garment, the bookshelf behind him is filled with books. There is a white door with a window on the left of the bookshelf, and a picture frame is placed next to the window on the right side of the bookshelf. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["necklace", "earphones", "desk lamp", "glasses", "potted plant"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "ZLVndSXE3Jk_1", "video_path": "ZLVndSXE3Jk.mp4", "subtitle_path": "ZLVndSXE3Jk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1362.16, "view_count": 235108}, {"video_id": "ZLVndSXE3Jk", "question": "In the studio, two gentlemen are interacting. The gentleman on the left is wearing a dark-colored suit and a white shirt with a tie, and has both hands on the table. The gentleman on the right is wearing glasses. A screen behind them shows tanks, flags, and soldiers. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["desk lamp", "pen", "notebook", "flower pot", "umbrella"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "ZLVndSXE3Jk_2", "video_path": "ZLVndSXE3Jk.mp4", "subtitle_path": "ZLVndSXE3Jk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1362.16, "view_count": 235108}, {"video_id": "2kmb1ARg9m0", "question": "A fiery bird appears at the shallow edge of a lake, its feet on the ground beneath the water. The bird's coloration is primarily red, with the area around its neck being the most vivid. Waves ripple across the lake's surface. What is this fiery bird doing at the edge of the lake?", "question_wo_referring_query": "What is this fiery bird doing at the edge of the lake?", "candidates": ["Dipping its head into the water to look for food", "Using its beak to groom its feathers", "Splashing around in the water", "Flapping its wings", "Gazing up at the sky and calling out"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "2kmb1ARg9m0_0", "video_path": "2kmb1ARg9m0.mp4", "subtitle_path": "2kmb1ARg9m0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1449.48, "view_count": 249945}, {"video_id": "2kmb1ARg9m0", "question": "A tunnel appears beneath the ground, a caterpillar appears on the right side of the tunnel. Inside the tunnel, the width varies. Roots of plants are on the upper side of the tunnel. What is the caterpillar on the right side of the tunnel doing?", "question_wo_referring_query": "What is the caterpillar on the right side of the tunnel doing?", "candidates": ["arched its body", "standing", "curled up its body", "eating fruits", "eating leaves"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "2kmb1ARg9m0_1", "video_path": "2kmb1ARg9m0.mp4", "subtitle_path": "2kmb1ARg9m0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1449.48, "view_count": 249945}, {"video_id": "2kmb1ARg9m0", "question": "A piece of light-colored cloth is spread out on the table, with a small bowl in one corner of the cloth containing white granules. In the middle of the cloth, there is a glass bowl. A hand appears in the frame. What is the hand doing?", "question_wo_referring_query": ", what is the hand doing in the frame?", "candidates": ["Holding the light-colored cloth", "Holding fruits", "Holding some white granules", "Holding the glass bowl", "Holding the small bowl"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2E", "level": "IntraMoment", "id": "2kmb1ARg9m0_2", "video_path": "2kmb1ARg9m0.mp4", "subtitle_path": "2kmb1ARg9m0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1449.48, "view_count": 249945}, {"video_id": "ieKFYXzPgIo", "question": "A girl in a yellow short-sleeve shirt is sitting on a sofa. The girl is wearing glasses and holding a controller and a bottle. There are cushions and miscellaneous items on the sofa. There are posters on the wall behind her and fan-shaped decorations on the side wall. When the subtitle 'advocate forward the rights of others' appears, what change occurs to the girl's clothing?", "question_wo_referring_query": "What change occurs to the clothing of the girl sitting on the sofa?", "candidates": ["The yellow short-sleeve shirt changes to a jacket", "The yellow short-sleeve shirt changes to a sweater", "The yellow short-sleeve shirt changes to a blue short-sleeve shirt", "The yellow short-sleeve shirt changes to a lab coat", "The yellow short-sleeve shirt changes to a white short-sleeve shirt"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "ieKFYXzPgIo_0", "video_path": "ieKFYXzPgIo.mp4", "subtitle_path": "ieKFYXzPgIo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1636.0, "view_count": 447138}, {"video_id": "ieKFYXzPgIo", "question": "A girl in a yellow short-sleeve shirt is sitting on a sofa. She is wearing glasses and gloves. Her hands are spread out to the sides. There are cushions and miscellaneous items on the sofa. There are posters on the wall behind her, and there are fan-shaped decorations on the side wall. When the subtitle 'gonna use a brush because it's not' appears, what changes occur to the girl's hairstyle?", "question_wo_referring_query": "What changes occur to the hairstyle of the girl in the yellow shirt?", "candidates": ["The girl's hair is styled into twin buns.", "The girl ties her hair back into a ponytail.", "The girl's hair is braided into a single plait.", "The girl's hair is braided into two plaits.", "The girl's hair changes into a high ponytail."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "ieKFYXzPgIo_1", "video_path": "ieKFYXzPgIo.mp4", "subtitle_path": "ieKFYXzPgIo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1636.0, "view_count": 447138}, {"video_id": "ieKFYXzPgIo", "question": "The girl in light-colored clothing is doing her hair with a device. The front part of the girl's hair is pink. The girl is wearing glasses. There is a potted plant near the girl. The shelf behind the girl has various items. The left wall of the shelf is pasted with a star design. When the subtitle 'and these silver hoops that we match all' appears, what change occurs to the girl's upper clothing?", "question_wo_referring_query": "What change occurs to the upper clothing of the girl who is doing her hair?", "candidates": ["The light-colored clothing changes to a black coat.", "The light-colored clothing changes to a hoodie.", "The light-colored clothing changes to a black hoodie.", "The light-colored clothing changes to a black short sleeve.", "The light-colored clothing changes to a denim jacket."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "ieKFYXzPgIo_2", "video_path": "ieKFYXzPgIo.mp4", "subtitle_path": "ieKFYXzPgIo_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1636.0, "view_count": 447138}, {"video_id": "Og2UveOeAVk", "question": "In the middle of the screen, there are two tank silhouette patterns with their barrels facing each other. The tank on the right has a vertical line on its top. There are five white circles surrounding the tank patterns, each containing images of gears, snowflakes, toys, etc. When white English text appears at the top of the screen and a white vertical finger image appears at the bottom, what change occurs to the left tank?", "question_wo_referring_query": "When white English text appears at the top of the screen and a white vertical finger image appears at the bottom, what change occurs to the left tank?", "candidates": ["The tank silhouette pattern appears colored", "The barrel of the tank silhouette pattern becomes longer", "The barrel of the tank silhouette pattern becomes thinner", "The barrel of the tank silhouette pattern becomes thicker", "The barrel of the tank silhouette pattern becomes shorter"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "Og2UveOeAVk_0", "video_path": "Og2UveOeAVk.mp4", "subtitle_path": "Og2UveOeAVk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1028.07, "view_count": 731154}, {"video_id": "Og2UveOeAVk", "question": "A tank appeared in the center of the screen. The outer circle of the wheels on the tank's tread is black. The tank's left side has three bullets wrapped in a white circle, and its right side has two tools wrapped in a white circle. What happens to the tank in the center of the screen when a red banner with white text appears at the top of the screen?", "question_wo_referring_query": "What happens to the tank in the center of the screen?", "candidates": ["The tank becomes a hologram.", "The tank shows cracks.", "The tank turns red.", "The tank's tread disappears.", "The tank's gun barrel turns to the left."], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "Og2UveOeAVk_1", "video_path": "Og2UveOeAVk.mp4", "subtitle_path": "Og2UveOeAVk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1028.07, "view_count": 731154}, {"video_id": "Og2UveOeAVk", "question": "The center of the screen features two tank designs, consisting of lines. In the upper-left corner, there are white bullet and shield designs. When a red rectangular box with white English characters appears in the center of the screen, what change occurs to the white shield in the upper-right corner?", "question_wo_referring_query": "When a red rectangular box with white English characters appears in the center of the screen, what change occurs to the white shield in the upper-right corner?", "candidates": ["The white shield develops a crack.", "The white shield turns red.", "The white shield turns black.", "The white shield turns blue.", "The white shield splits into two pieces."], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "Og2UveOeAVk_2", "video_path": "Og2UveOeAVk.mp4", "subtitle_path": "Og2UveOeAVk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1028.07, "view_count": 731154}, {"video_id": "2rfRk_mTf7M", "question": "A man is using a farm tool to weed and loosen the soil for crops in the field. The man is wearing a short-sleeved shirt, shorts, and a hat. Behind the man, there are green forests and rolling hills. There are two farm tools lying on the ground in front of the field. In the field to the left side, these tools and some subtitles have appeared together. What does the subtitle say?", "question_wo_referring_query": "What does the subtitle say that appeared with the tools on the left side of the field?", "candidates": ["more dependent on Aid meanwhile the", "education is extremely important", "mainly due to a lack of quality Care", "a wise decision", "In order to deal with these problems"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "2rfRk_mTf7M_0", "video_path": "2rfRk_mTf7M.mp4", "subtitle_path": "2rfRk_mTf7M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 930.29, "view_count": 72802}, {"video_id": "2rfRk_mTf7M", "question": "Three men appear in the field. The two men on the left are working with farm tools, while the man on the right holds a basket of agricultural products. The man on the right wears a checkered shirt, grey trousers, and a hat. Behind the three men are green forests and rolling mountains. With what caption has the man on the right, who is holding the agricultural products, appeared?", "question_wo_referring_query": "With what caption has the man on the right, who is holding the agricultural products, appeared?", "candidates": ["bubble so Haitian has a food problem but", "a wise decision\n", "Haitian farmers at market price to", "In order to deal with these problems", "education is extremely important"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "2rfRk_mTf7M_1", "video_path": "2rfRk_mTf7M.mp4", "subtitle_path": "2rfRk_mTf7M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 930.29, "view_count": 72802}, {"video_id": "2rfRk_mTf7M", "question": "The woman is wearing a top with a blue pattern, behind the woman there are potted plants and a desk lamp, to the left of the woman there is a cartoon girl holding binoculars and observing, to the right of the woman there is a cartoon globe, the globe is covered with pink and green patterns. What subtitles have appeared together with the cartoon girl with binoculars to the left of the woman?", "question_wo_referring_query": "What subtitles have appeared together with the cartoon girl with binoculars to the left of the woman?", "candidates": ["somewhere else but this is flawed", "education is extremely important", "word there are health issues that we", "In order to deal with these problems", "a wise decision\n"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "2rfRk_mTf7M_2", "video_path": "2rfRk_mTf7M.mp4", "subtitle_path": "2rfRk_mTf7M_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 930.29, "view_count": 72802}, {"video_id": "5jkwfoA8xL4", "question": "A lady in a black skirt is standing in front of a table. She is holding a pot in one hand and a glass bowl in the other hand. On the left side of the table, there is a plate with vegetables and a rectangular plastic container. On the right side of the table, there is a black device. On the countertop behind the lady, there are potted plants and books, and the countertop has white square tiles. Where else has the rectangular plastic container, placed on the left side of the table, appeared?", "question_wo_referring_query": "Where else has the rectangular plastic container, placed on the left side of the table, appeared?", "candidates": ["In the transparent glass bowl", "On the yellow cutting board", "In the woman's hand", "On the red plastic plate", "On the black oven"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "5jkwfoA8xL4_0", "video_path": "5jkwfoA8xL4.mp4", "subtitle_path": "5jkwfoA8xL4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2359.49, "view_count": 484327}, {"video_id": "5jkwfoA8xL4", "question": "A woman wearing a black skirt is standing in front of a table. She holds a piece of tin foil in her hand. To the left of the table, there is a blue pot and a standing pepper grinder. In the middle of the table, there is a wooden cutting board. On the countertop behind the woman, there are fresh flowers placed. The countertop has white square tiles. Where else has the pepper grinder seen to the left of the table, appeared?", "question_wo_referring_query": "Where else has the pepper grinder to the left of the table appeared?", "candidates": ["In the transparent glass bowl", "On the red plate", "On the yellow cutting board", "On the black oven", "On the wooden cutting board"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "5jkwfoA8xL4_1", "video_path": "5jkwfoA8xL4.mp4", "subtitle_path": "5jkwfoA8xL4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2359.49, "view_count": 484327}, {"video_id": "5jkwfoA8xL4", "question": "The woman in the yellow top is standing in front of a table. She is holding a paper document with both hands. On the left side of the table, there is a silver Apple laptop, and on the right side, there is a teapot. Behind the woman, there are fresh flowers and books on the kitchen counter, which has white square tiles. Besides the table, where else has this paper document appeared?", "question_wo_referring_query": "Besides the table, where else has the paper document held by the woman in the yellow top appeared?", "candidates": ["On the red plate", "Behind the laptop", "On the black oven", "Next to the teapot", "On the yellow cutting board"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "5jkwfoA8xL4_2", "video_path": "5jkwfoA8xL4.mp4", "subtitle_path": "5jkwfoA8xL4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2359.49, "view_count": 484327}, {"video_id": "V6yXZirD1jU", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First is a short clip introducing the features of Tokyo, then the American man summarizes his week-long trip to Tokyo, and lastly the man in Tokyo personally showcases his experiences of Tokyo life, including shopping, trains, games, and robots.", "First, an American man in Tokyo personally showcases his experiences of Tokyo life, including shopping, trains, games, and robots, then a short clip introduces the features of Tokyo, and finally, the man summarizes his week-long trip to Tokyo.", "First the man summarizes his week-long trip to Tokyo, then an American man in Tokyo personally showcases his experiences of Tokyo life, including shopping, trains, games, and robots, and lastly a short clip introduces the features of Tokyo.", "First is a short clip introducing the features of Tokyo, followed by an American man in Tokyo personally showcasing his experiences of Tokyo life, including shopping, trains, games, and robots, and then the man summarizes his week-long trip to Tokyo.", "First the man summarizes his week-long trip to Tokyo, followed by a short clip introducing the features of Tokyo, and then an American man in Tokyo personally showcasing his experiences of Tokyo life, including shopping, trains, games, and robots."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "V6yXZirD1jU_0", "video_path": "V6yXZirD1jU.mp4", "subtitle_path": "V6yXZirD1jU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 971.06, "view_count": 233933}, {"video_id": "V6yXZirD1jU", "question": "During an American man's one-week trip to Tokyo, what is the correct sequence of events?", "question_wo_referring_query": "During an American man's one-week trip to Tokyo, what is the correct sequence of events?", "candidates": ["First, the man and his friend experience various entertainment games together. Then, the man visits various robots. After that, the man enjoys Tokyo's delicious food, shopping, and experiences Japanese daily life.", "First, the man visits various robots. Then, the man and his friend experience various entertainment games together. After that, the man enjoys Tokyo's delicious food, shopping, and experiences Japanese daily life.", "First, the man and his friend experience various entertainment games together. Then, the man enjoys Tokyo's delicious food, shopping, and experiences Japanese daily life. After that, the man visits various robots.", "First, the man visits various robots. Then, the man enjoys Tokyo's delicious food, shopping, and experiences Japanese daily life. After that, the man and his friend experience various entertainment games together.", "First, the man enjoys Tokyo's delicious food, shopping, and experiences Japanese daily life. Then, the man visits various robots. After that, the man and his friend experience various entertainment games together."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "V6yXZirD1jU_1", "video_path": "V6yXZirD1jU.mp4", "subtitle_path": "V6yXZirD1jU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 971.06, "view_count": 233933}, {"video_id": "V6yXZirD1jU", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, the man summarizes his week-long trip in Tokyo, then the American man experiences gaming culture and robotics technology with a friend, followed by the American man exploring Tokyo's convenience store culture and travel culture, and finally, a short segment introduces the unique characteristics of the city of Tokyo.", "First, a short segment introduces the unique characteristics of the city of Tokyo, then the American man experiences gaming culture and robotics technology with a friend, followed by the American man exploring Tokyo's convenience store culture and travel culture, and finally, the man summarizes his week-long trip in Tokyo.", "First, a short segment introduces the unique characteristics of the city of Tokyo, then the American man experiences gaming culture and robotics technology with a friend, followed by the man summarizing his week-long trip in Tokyo, and finally, the American man explores Tokyo's convenience store culture and travel culture.", "First, the man summarizes his week-long trip in Tokyo, followed by the American man exploring Tokyo's convenience store culture and travel culture, then a short segment introduces the unique characteristics of the city of Tokyo, and finally, the American man experiences gaming culture and robotics technology with a friend.", "First, a short segment introduces the unique characteristics of the city of Tokyo, then the American man explores Tokyo's convenience store culture and travel culture, followed by the man summarizing his week-long trip in Tokyo, and finally, the American man experiences gaming culture and robotics technology with a friend."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "V6yXZirD1jU_2", "video_path": "V6yXZirD1jU.mp4", "subtitle_path": "V6yXZirD1jU_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 971.06, "view_count": 233933}, {"video_id": "CoPfYGgzCkM", "question": "In the middle of the screen, there is a large red X symbol. Two cartoon characters appear on the ground. The character on the left is wearing a uniform, helmet, and holding a gun. The masked cartoon character on the right is lying on the ground, with a gun near their feet. In the distance, there are building structures. After the subtitle 'for military' appears, what object shows up on the screen?", "question_wo_referring_query": "What object shows up on the screen?", "candidates": ["Medical personnel", "Airplane", "Handbag", "Armored vehicle", "Tent"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "CoPfYGgzCkM_0", "video_path": "CoPfYGgzCkM.mp4", "subtitle_path": "CoPfYGgzCkM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1854.83, "view_count": 638413}, {"video_id": "CoPfYGgzCkM", "question": "A path appears in the forest, with grass, shrubs, and large trees distributed on both sides of the path. Some of the trees have spotted patterns. The grass on the ground is lined with landmines. After the subtitle 'landmines are deployed there are' appears, what objects appear in the grass on both sides of the path?", "question_wo_referring_query": "What objects appear in the grass on both sides of the path?", "candidates": ["Airplane", "Warning sign", "Handbag", "Basket", "Medical personnel"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "CoPfYGgzCkM_1", "video_path": "CoPfYGgzCkM.mp4", "subtitle_path": "CoPfYGgzCkM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1854.83, "view_count": 638413}, {"video_id": "CoPfYGgzCkM", "question": "A plane is flying in the blue sky with two round control buttons inside the cockpit. Ahead of the glass in the cockpit, another aircraft is emitting gray exhaust. Once the subtitle 'leadership believed that upholding the' appears, what object can be seen in the blue sky?", "question_wo_referring_query": "What object appears in the blue sky?", "candidates": ["Handbag", "Medical personnel", "Armored vehicle", "Parachute", "Basket"], "topic_category": "KH-Knowledge-History", "question_category": "T3O", "level": "L2-Relation", "id": "CoPfYGgzCkM_2", "video_path": "CoPfYGgzCkM.mp4", "subtitle_path": "CoPfYGgzCkM_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1854.83, "view_count": 638413}, {"video_id": "8o8EzSwIZ6Q", "question": "A man appears in a snow-covered pond, with chunks of ice floating on the surface. In front of the man is a snow-covered platform with two dark objects placed on it. After the subtitle 'Music' appears, what does this black-haired man in the snowy pond do?", "question_wo_referring_query": "What does this black-haired man in the snowy pond do?", "candidates": ["The man picks up a frozen fish.", "The man enters a house.", "The man grabs a piece of ice.", "The man lies down on the snow.", "The man picks up a towel."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "8o8EzSwIZ6Q_0", "video_path": "8o8EzSwIZ6Q.mp4", "subtitle_path": "8o8EzSwIZ6Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1383.68, "view_count": 3541327}, {"video_id": "8o8EzSwIZ6Q", "question": "Two men are standing together. The man on the right is wearing a blue down jacket, gloves, and a hat. The man on the left is wearing a dark down jacket. Both men are carrying camera equipment. Behind them, there is a blue building and a blue car. After the subtitles 'need freezers because outside is the' appear, what does the man in the dark jacket holding the camera equipment do?", "question_wo_referring_query": "What does the man in the dark jacket holding the camera equipment do?", "candidates": ["He put on a pair of gloves.", "He handed the camera equipment to the man next to him.", "He put on a hat.", "He added an outer layer.", "He picked up a frozen fish."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "8o8EzSwIZ6Q_1", "video_path": "8o8EzSwIZ6Q.mp4", "subtitle_path": "8o8EzSwIZ6Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1383.68, "view_count": 3541327}, {"video_id": "8o8EzSwIZ6Q", "question": "Three men are standing in the snow. The man on the right is wearing a long coat and gloves, the man in the middle is wearing only trunks and shoes, and the man on the left is wearing black shorts and shoes. In the distant snowy background, there are city buildings. After the subtitle 'we go baby' appears, what does the man in the middle wearing trunks do?", "question_wo_referring_query": "What does the man in the middle wearing trunks do?", "candidates": ["He puts on a pair of gloves", "He puts on a large coat", "He lies down in the snow", "He picks up a towel", "He attempts to enter the water under the snow"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "8o8EzSwIZ6Q_2", "video_path": "8o8EzSwIZ6Q.mp4", "subtitle_path": "8o8EzSwIZ6Q_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1383.68, "view_count": 3541327}, {"video_id": "Jg7VMn2h9Fk", "question": "A man wearing a striped shirt appears on screen, with a wall behind him. On the left side of the wall, there is a poster and a wall tiger model, and on the right side, there is a map formed by connecting six pictures. When this man in the striped shirt introduces the geographical features of Michigan, which park does he mention first?", "question_wo_referring_query": "When this man in the striped shirt introduces the geographical features of Michigan, which park does he mention first?", "candidates": ["Redwood National Park", "Yellowstone National Park", "Acadia National Park", "Isle Royale National Park", "Niagara Falls National Park"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "Jg7VMn2h9Fk_0", "video_path": "Jg7VMn2h9Fk.mp4", "subtitle_path": "Jg7VMn2h9Fk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1663.2, "view_count": 158978}, {"video_id": "Jg7VMn2h9Fk", "question": "Who is the first person to appear in the video?", "question_wo_referring_query": "Who is the first person to appear in the video?", "candidates": ["The man wearing a yellow short-sleeved shirt", "The man wearing a green short-sleeved shirt", "The woman wearing a blue short-sleeved shirt", "The man wearing a blue short-sleeved shirt", "The man wearing a striped shirt"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "Jg7VMn2h9Fk_1", "video_path": "Jg7VMn2h9Fk.mp4", "subtitle_path": "Jg7VMn2h9Fk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1663.2, "view_count": 158978}, {"video_id": "Jg7VMn2h9Fk", "question": "What is the first mode of transportation that appears in the video?", "question_wo_referring_query": "What is the first mode of transportation that appears in the video?", "candidates": ["Bicycle", "Boat", "Car", "Helicopter", "Train"], "topic_category": "KG-Knowledge-Geography", "question_category": "O3O", "level": "L2-Relation", "id": "Jg7VMn2h9Fk_2", "video_path": "Jg7VMn2h9Fk.mp4", "subtitle_path": "Jg7VMn2h9Fk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1663.2, "view_count": 158978}, {"video_id": "kswaAPPRmWY", "question": "A man is standing in front of a table. He is wearing a dark-colored short-sleeve shirt and holding a kitchen knife. There are ingredients and kitchen utensils placed on the wooden table. Behind the man are stacked wood and debris. When the subtitle 'Beef' appears, what is the man in the dark short-sleeve shirt doing at the table?", "question_wo_referring_query": "What is the man in the dark short-sleeve shirt doing at the table?", "candidates": ["Wiping the utensils", "Wiping the cutting board", "Cutting green onions", "Cutting garlic", "Cutting beef"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "kswaAPPRmWY_0", "video_path": "kswaAPPRmWY.mp4", "subtitle_path": "kswaAPPRmWY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1003.34, "view_count": 1862087}, {"video_id": "kswaAPPRmWY", "question": "A man stands in front of a table. The man is wearing a dark-colored short-sleeved shirt and dark-colored long pants. On the wooden table, there is a wooden chopping board. Stacked wooden materials can be seen on the stone wall behind the man. To the right of the stone wall, there is a path through the woods. What is the man in the dark short-sleeved shirt doing when the subtitle 'Dough' appears?", "question_wo_referring_query": "What is the man in the dark short-sleeved shirt doing in front of the table?", "candidates": ["Slicing beef", "Kneading dough", "Holding a plate", "Eating an apple", "Cutting beef"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "kswaAPPRmWY_1", "video_path": "kswaAPPRmWY.mp4", "subtitle_path": "kswaAPPRmWY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1003.34, "view_count": 1862087}, {"video_id": "kswaAPPRmWY", "question": "A man is standing in front of a table. The man is wearing a dark short-sleeve shirt and dark long pants. On the wooden table, there is a green bowl and silver utensils. There are white ingredients on the left side of the wooden table, and there are green plants behind the man. When the subtitle 'Egg' appears, what is the man in the black short-sleeve shirt in front of the table doing?", "question_wo_referring_query": "What is the man in the black short-sleeve shirt in front of the table doing?", "candidates": ["Wiping the cutting board", "Wiping the wooden bowl", "Cutting fruit", "Stirring beef", "Cracking an egg"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "kswaAPPRmWY_2", "video_path": "kswaAPPRmWY.mp4", "subtitle_path": "kswaAPPRmWY_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1003.34, "view_count": 1862087}, {"video_id": "kNgUiYzUlWc", "question": "A group of pedestrians appear at the street corner. The man on the far left is wearing a light-colored hooded jacket and black trousers. In the center, there is a man wearing a hat and sitting in a wheelchair. Behind the man in the wheelchair is a man dressed in black clothes and black trousers. The street is lined with utility poles and miscellaneous items. What is the man in black clothes and black trousers doing on the street?", "question_wo_referring_query": "What is the man in black clothes and black trousers doing on the street?", "candidates": ["Holding a black bag", "Carrying a yellow bag", "Riding a bicycle", "Pushing a wheelchair", "Riding a motorcycle"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "kNgUiYzUlWc_0", "video_path": "kNgUiYzUlWc.mp4", "subtitle_path": "kNgUiYzUlWc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1252.4, "view_count": 152003}, {"video_id": "kNgUiYzUlWc", "question": "On the left side of the screen, three individuals are standing in a row, wearing red uniforms. Opposite them are parked cars and a crowd, with someone in the crowd holding a flag. On the right side of the screen, there is a man in a suit, who is wearing a white shirt and a tie. Which person has their back towards the camera and has both hands clasped behind their back?", "question_wo_referring_query": "Which person has their back towards the camera and has both hands clasped behind their back?", "candidates": ["The man on the left side of the screen wearing a white shirt", "The person on the left side of the screen wearing a red uniform and blue trousers", "The person on the right side of the screen wearing a suit", "The person on the left side of the screen wearing a red uniform and black trousers", "The person on the left side of the screen holding the green flag"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "kNgUiYzUlWc_1", "video_path": "kNgUiYzUlWc.mp4", "subtitle_path": "kNgUiYzUlWc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1252.4, "view_count": 152003}, {"video_id": "kNgUiYzUlWc", "question": "A group of pedestrians and a horse appear on the street. The man on the far left is wearing a black short-sleeved shirt, and the man on the far right is wearing a gray top. The man in the center is wearing a white short-sleeved shirt and a hat. There are green plants and power lines on both sides of the street. Which person on the street is controlling the horse-drawn cart?", "question_wo_referring_query": "Which person on the street is controlling the horse-drawn cart?", "candidates": ["The man wearing a gray top", "The man wearing a white short-sleeved shirt and a hat", "The man wearing a red top", "The man wearing a black short-sleeved shirt", "The man wearing a yellow top"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "kNgUiYzUlWc_2", "video_path": "kNgUiYzUlWc.mp4", "subtitle_path": "kNgUiYzUlWc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1252.4, "view_count": 152003}, {"video_id": "DBBCKghodVI", "question": "The top part of the screen shows a gray landmass. The land on the left side is relatively flat, while the terrain on the right side is more complex. There is a sunken area on the land. The bottom part of the screen shows the sea, and on the flat land on the left side, there is an arrow pointing from left to right. What color is this arrow pointing from left to right?", "question_wo_referring_query": "What color is this arrow pointing from left to right?", "candidates": ["yellow", "blue", "green", "black", "red"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "DBBCKghodVI_0", "video_path": "DBBCKghodVI.mp4", "subtitle_path": "DBBCKghodVI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1918.79, "view_count": 54202}, {"video_id": "DBBCKghodVI", "question": "The top left corner and the bottom part of the map are blue seas, while the land is mainly concentrated in the upper right part of the map. The land is interspersed with roads and urban areas. At the upper part of the land, there are green areas and light yellow areas. Near the green areas, there is a cluster of white light spots on the sea. What is the shape of this cluster of white light spots?", "question_wo_referring_query": "What is the shape of this cluster of white light spots?", "candidates": ["circle", "triangle", "rectangle", "square", "star"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "DBBCKghodVI_1", "video_path": "DBBCKghodVI.mp4", "subtitle_path": "DBBCKghodVI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1918.79, "view_count": 54202}, {"video_id": "DBBCKghodVI", "question": "On the left side of the screen, there is a vast blue ocean, while on the right side, there is expansive land. The boundary between the land and the ocean is light yellow. To the right of this boundary, there are complex terrains and green vegetation, and the farthest right area shows relatively flat land along with some lakes. How would you describe the boundary between the land and the ocean?", "question_wo_referring_query": "How would you describe the boundary between the land and the ocean in the middle of the screen?", "candidates": ["Appears like a combination of two curved lines", "Similar to a curved line", "Appears like a combination of straight and curved lines", "Appears like a combination of multiple straight lines", "Appears like a straight line extending into the distance"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "DBBCKghodVI_2", "video_path": "DBBCKghodVI.mp4", "subtitle_path": "DBBCKghodVI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1918.79, "view_count": 54202}, {"video_id": "SCR_g385pCs", "question": "A woman is wearing glasses and clothing decorated with pink patterns. There is a planter behind her, and a poster is hanging on the room\u2019s wall. When the subtitle shows 'some of the books that I got I'm so,' what object is present on the screen?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["Umbrella", "Pillow", "Necklace", "Apple", "Lamp"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "SCR_g385pCs_0", "video_path": "SCR_g385pCs.mp4", "subtitle_path": "SCR_g385pCs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 948.95, "view_count": 440263}, {"video_id": "SCR_g385pCs", "question": "The woman is wearing glasses and a dress with a pink design. There is a book in front of the woman. A poster is hanging on the wall. What object is present on the screen when the subtitle 'why do it so big I have a strain and' appears?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["potted plant", "watch", "mobile phone", "cushion", "desk lamp"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "SCR_g385pCs_1", "video_path": "SCR_g385pCs.mp4", "subtitle_path": "SCR_g385pCs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 948.95, "view_count": 440263}, {"video_id": "SCR_g385pCs", "question": "The woman is wearing glasses and clothes with pink patterns. She is holding a book with a black cover in one hand and adjusting her glasses with the other hand. There is a poster hanging on the wall, and when the subtitle 'never told you is about a Chinese' appears, what objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["desk lamp", "potted plant", "cushion", "wristwatch", "mobile phone"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "SCR_g385pCs_2", "video_path": "SCR_g385pCs.mp4", "subtitle_path": "SCR_g385pCs_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 948.95, "view_count": 440263}, {"video_id": "28jSAi9N1rg", "question": "Sitting in front of a painting, what is an elderly man with white hair, wearing a grey suit, blue shirt, and striped tie doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["Talking to the camera", "Taking off his coat", "Adjusting his collar", "Making a V sign at the camera", "Drinking tea"], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "28jSAi9N1rg_0", "video_path": "28jSAi9N1rg.mp4", "subtitle_path": "28jSAi9N1rg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 54, "duration": 21.99, "view_count": 8541}, {"video_id": "3DDIU4aIhmo", "question": "On the blue sea, a black-skinned swimmer wearing a yellow top is sitting on a boat. He is looking at the sea and has one hand on the railing of the boat. What other objects are present in the scene?", "question_wo_referring_query": "What other objects are present in the scene?", "candidates": ["Submarine", "Warship", "Airplane", "House", "Bicycle"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "3DDIU4aIhmo_0", "video_path": "3DDIU4aIhmo.mp4", "subtitle_path": "3DDIU4aIhmo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 755, "duration": 31.0, "view_count": 143656}, {"video_id": "8rCBiYEKxko", "question": "Under the blue sky and white clouds, a small person dressed in green clothing and holding a sword is riding a black horse. There are many neatly arranged soldiers standing behind him. What objects are present on the screen when the subtitle says 'war against the Gauls of France'?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["Green grass", "Squirrel", "Table", "Fresh flowers", "Rabbit"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "8rCBiYEKxko_0", "video_path": "8rCBiYEKxko.mp4", "subtitle_path": "8rCBiYEKxko_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 21, "duration": 27.0, "view_count": 9236}, {"video_id": "4FguF3forVw", "question": "In a room with green plants, in front of a window with white curtains, a woman with blonde curly hair is wiping the table with a cloth. What color top is the woman wearing?", "question_wo_referring_query": "What color top is the woman wearing?", "candidates": ["white", "black", "red", "purple", "olive"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "4FguF3forVw_0", "video_path": "4FguF3forVw.mp4", "subtitle_path": "4FguF3forVw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 292, "duration": 47.97, "view_count": 461683}, {"video_id": "8naFij-PwfI", "question": "Three characters appear on the screen. In the top left corner, there is a curly-haired woman wearing earrings. In the top right corner, there is a man sitting in a room with green plants. In the bottom center, there is a man sitting in a room with books. Who in the screen is touching their face with their finger?", "question_wo_referring_query": "Who in the screen is touching their face with their finger?", "candidates": ["A man wearing glasses and a black top", "A curly-haired woman wearing earrings", "A man wearing glasses and a purple top", "A man wearing glasses and a red top", "A man in a suit jacket and a blue shirt"], "topic_category": "KA-Knowledge-Art", "question_category": "E2O", "level": "IntraMoment", "id": "8naFij-PwfI_0", "video_path": "8naFij-PwfI.mp4", "subtitle_path": "8naFij-PwfI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1024, "duration": 26.0, "view_count": 8097}, {"video_id": "_U2P-A8btAk", "question": "In a dimly lit room, when a man in a suit and coat with black hair appeared for the first time at the podium with a microphone, what did he do?", "question_wo_referring_query": "What did this man do?", "candidates": ["Playing the guitar", "Dancing", "Giving a speech facing the audience", "Presenting a PPT", "Playing the piano"], "topic_category": "KA-Knowledge-Art", "question_category": "O2E", "level": "IntraMoment", "id": "_U2P-A8btAk_0", "video_path": "_U2P-A8btAk.mp4", "subtitle_path": "_U2P-A8btAk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 610, "duration": 20.02, "view_count": 921}, {"video_id": "SAksyGDDvfM", "question": "On the right side of the screen, there is a table with two monitors and a mouse. The monitors display a kitchen sink and a faucet. A man wearing a black suit and a white shirt is sitting in front of the monitors. What happened when the man said 'Electronics but they also show um what'?", "question_wo_referring_query": "What happened to the man?", "candidates": ["Moved the mouse", "Turned off the monitor", "Placed his palm on the monitor", "Pointed to the rightmost monitor with his index finger", "Pointed to the leftmost monitor with his index finger"], "topic_category": "NP-News-Programs", "question_category": "T2E", "level": "IntraMoment", "id": "SAksyGDDvfM_0", "video_path": "SAksyGDDvfM.mp4", "subtitle_path": "SAksyGDDvfM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 46, "duration": 43.0, "view_count": 144390}, {"video_id": "nrLRCWPLJ_w", "question": "After thick smoke billowed from the densely packed houses in the distance and drifted into the sky, what event occurred on the screen?", "question_wo_referring_query": ", what event occurred on the screen?", "candidates": ["A man wearing a purple shirt and black earphones is talking in front of the camera.", "A man wearing a blue shirt and sunglasses is talking in front of the camera.", "A man wearing a blue shirt and a baseball cap is talking in front of the camera.", "A man wearing a black shirt and Bluetooth earphones is talking in front of the camera.", "A man wearing a blue shirt and white earphones is talking in front of the camera."], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "nrLRCWPLJ_w_0", "video_path": "nrLRCWPLJ_w.mp4", "subtitle_path": "nrLRCWPLJ_w_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 759, "duration": 28.0, "view_count": 166013}, {"video_id": "zfoCtOEWwGM", "question": "In front of a window with green plants, after a pair of hands opens a colorful picture book, which of the following individuals appear first?", "question_wo_referring_query": "Which of the following individuals appear first?", "candidates": ["A man sitting on a chair wearing a black headscarf and white pants", "A man sitting on a chair wearing a grey headscarf and black pants", "A man kneeling in front of a yellow wall holding a kettle", "A man sitting on the grass wearing a white headscarf and black pants", "A woman sitting at the edge of a grassy field wearing a green headscarf and a skirt"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "zfoCtOEWwGM_0", "video_path": "zfoCtOEWwGM.mp4", "subtitle_path": "zfoCtOEWwGM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 91, "duration": 30.0, "view_count": 3492}, {"video_id": "qSPtiH920iQ", "question": "On the far right side of the screen is a man in a black shirt holding a camera lens, and on the left side of the screen, there is a woman wearing glasses and a denim jacket along with another woman wearing a black hat and a white shirt. Before the subtitle says 'I'm starstruck so Jamie's been like one,' what happens in the scene?", "question_wo_referring_query": "What happens in the scene?", "candidates": ["The woman wearing black glasses is shaking hands", "The woman wearing black glasses is tying a shoelace", "Two hands reach towards the woman wearing black glasses", "A hand reaches towards the woman wearing black glasses", "The woman wearing black glasses is hugging"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "qSPtiH920iQ_0", "video_path": "qSPtiH920iQ.mp4", "subtitle_path": "qSPtiH920iQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 387, "duration": 46.98, "view_count": 85402}, {"video_id": "lao2MP2Esi8", "question": "The blue ocean in the picture is surrounded by some mountain ranges. On the mountain ranges, many densely packed houses are distributed. After the subtitle says 'your favorite european city is', what object appears first on the screen?", "question_wo_referring_query": "What object appears first on the screen?", "candidates": ["an airplane", "a website", "a ship", "a globe", "a car"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "lao2MP2Esi8_0", "video_path": "lao2MP2Esi8.mp4", "subtitle_path": "lao2MP2Esi8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 507, "duration": 27.03, "view_count": 381416}, {"video_id": "esc3xBBn9ic", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, pour the green viscous substance from the glass bowl into the white cup, then place a green round chocolate into the white cup with the green viscous substance, and finally use a whisk to stir the green viscous substance in the glass bowl.", "First, use a whisk to stir the green viscous substance in the glass bowl, then pour the green viscous substance from the glass bowl into the white cup, and finally place a green round chocolate into the white cup with the green viscous substance.", "First, place a green round chocolate into the white cup with the green viscous substance, then use a whisk to stir the green viscous substance in the glass bowl, and finally pour the green viscous substance from the glass bowl into the white cup.", "First, pour the green viscous substance from the glass bowl into the white cup, then use a whisk to stir the green viscous substance in the glass bowl, and finally place a green round chocolate into the white cup with the green viscous substance.", "First, use a whisk to stir the green viscous substance in the glass bowl, then place a green round chocolate into the white cup with the green viscous substance, and finally pour the green viscous substance from the glass bowl into the white cup."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "esc3xBBn9ic_0", "video_path": "esc3xBBn9ic.mp4", "subtitle_path": "esc3xBBn9ic_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 362, "duration": 18.98, "view_count": 429114}, {"video_id": "dTDNlD6yMxo", "question": "Only half of a woman's reflection appears on the far right side of the screen, while a man is at the far left side, facing the mirror and talking to the woman. Where else has this man appeared?", "question_wo_referring_query": "Where else has this man appeared?", "candidates": ["In the bathroom", "In front of the bookshelf", "In the elevator", "On the bed", "In the kitchen"], "topic_category": "KA-Knowledge-Art", "question_category": "SOS", "level": "L2-Relation", "id": "dTDNlD6yMxo_0", "video_path": "dTDNlD6yMxo.mp4", "subtitle_path": "dTDNlD6yMxo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 107, "duration": 29.99, "view_count": 19883}, {"video_id": "sN3_n6HEA1s", "question": "A map appeared on the screen, with four red arrows marked on it and an area circled by a black line. In which subtitles did this map appear together?", "question_wo_referring_query": "In which subtitles did this map appear together?", "candidates": ["continents don't merge they drift apart", "history as we explore further the quest", "there's an unexpected twist sometimes", "seasoned of miners recent Explorations", "with risks challenging even the most"], "topic_category": "KG-Knowledge-Geography", "question_category": "TOS", "level": "L2-Relation", "id": "sN3_n6HEA1s_0", "video_path": "sN3_n6HEA1s.mp4", "subtitle_path": "sN3_n6HEA1s_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 74, "duration": 58.03, "view_count": 8076}, {"video_id": "pONr8-d-VfI", "question": "Under the clear blue sky, the cut surface of the roadside terrain shows rock layers of different colors, including yellow, white, red, and olive. At the top of the rock layers, there are guardrails and a few trees. What other objects are present in the scene?", "question_wo_referring_query": "What other objects are present in the scene?", "candidates": ["Airplane", "Car", "Trees", "Bicycle", "Windmill"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2O", "level": "IntraMoment", "id": "pONr8-d-VfI_0", "video_path": "pONr8-d-VfI.mp4", "subtitle_path": "pONr8-d-VfI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 289, "duration": 15.98, "view_count": 10060}, {"video_id": "0tIzo0Uik6I", "question": "In front of a bookshelf in a room, a blonde woman wearing a blue top and jeans covers her mouth and nose with her hands. What other objects are present on the screen when the subtitle says 'please'?", "question_wo_referring_query": "What other objects are present on the screen?", "candidates": ["books", "green plant", "flower pot", "snake", "piano"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "0tIzo0Uik6I_0", "video_path": "0tIzo0Uik6I.mp4", "subtitle_path": "0tIzo0Uik6I_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 931, "duration": 34.99, "view_count": 473041}, {"video_id": "nPH6EwIPZPg", "question": "If an asteroid falls and hits the Earth's surface in the video, what shape does the red flame emitted by the asteroid's impact take?", "question_wo_referring_query": "What shape does it take?", "candidates": ["Square", "Heart", "Staircase", "Rectangle", "Circle"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "nPH6EwIPZPg_0", "video_path": "nPH6EwIPZPg.mp4", "subtitle_path": "nPH6EwIPZPg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 666, "duration": 31.0, "view_count": 135305}, {"video_id": "UkPyFvAvl0A", "question": "On the far right of the screen stands a woman with a green shawl over her shoulders, in the middle is a woman wearing a pink bikini top holding a bottle of alcohol and a drink. Her legs are covered with green paint. When the subtitles say 'haha', what color pants is the woman in the pink bikini top wearing?", "question_wo_referring_query": "What color pants is the woman in the pink bikini top wearing when the subtitles say 'haha'?", "candidates": ["purple", "yellow", "green", "pink", "black"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "UkPyFvAvl0A_0", "video_path": "UkPyFvAvl0A.mp4", "subtitle_path": "UkPyFvAvl0A_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 56, "duration": 43.1, "view_count": 33507}, {"video_id": "WzPT3amcdqo", "question": "What did the man wearing a yellow striped shirt with a pink round collar do the first time he appeared in front of the window outside the house?", "question_wo_referring_query": "What did he do?", "candidates": ["Rubbed his eyes", "Touched his hair", "Touched his ear", "Touched his chin", "Touched his nose"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "WzPT3amcdqo_0", "video_path": "WzPT3amcdqo.mp4", "subtitle_path": "WzPT3amcdqo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1217, "duration": 32.03, "view_count": 97837}, {"video_id": "vbxX6PGkLlI", "question": "On a green grassy field, a soldier with a bloody white cloth on his face is holding a gun in the middle of the screen. When the subtitle says 'so terrifying to the approaching German', what is the soldier in the middle of the screen doing?", "question_wo_referring_query": "What is the soldier in the middle of the screen, who is holding a gun, doing?", "candidates": ["Running", "Spinning", "Lying on the ground", "Sitting on the ground", "Jumping"], "topic_category": "KH-Knowledge-History", "question_category": "T2E", "level": "IntraMoment", "id": "vbxX6PGkLlI_0", "video_path": "vbxX6PGkLlI.mp4", "subtitle_path": "vbxX6PGkLlI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1172, "duration": 33.0, "view_count": 4514463}, {"video_id": "7QddAi2Jl4o", "question": "Sitting in front of a bed with a white comforter, a woman wearing a gray coat is holding a book with a pink cover. After she appears, which book is mentioned first?", "question_wo_referring_query": "Which book is mentioned first below?", "candidates": ["Mieko and the Fifth Treasure", "The House on Mango Street", "Park Avenue summer", "Charlotte's Web", "The Outsiders"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "7QddAi2Jl4o_0", "video_path": "7QddAi2Jl4o.mp4", "subtitle_path": "7QddAi2Jl4o_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 545, "duration": 42.01, "view_count": 30602}, {"video_id": "_JK5GJ7ms0U", "question": "On a gray wall covered with a series of black ancient symbols, after the subtitle says 'pray but I didn't know what I was saying', what happens next in the scene?", "question_wo_referring_query": "what happens next in the scene?", "candidates": ["A young person with black hair and wearing a white scarf is looking around in an exhibition hall.", "An elderly person with white hair and wearing a white hat is looking around in an exhibition hall.", "An elderly person with white hair and wearing a black scarf is looking around in an exhibition hall.", "An elderly person with white hair and wearing black sunglasses is looking around in an exhibition hall.", "An elderly person with white hair and wearing a white scarf is looking around in an exhibition hall."], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "_JK5GJ7ms0U_0", "video_path": "_JK5GJ7ms0U.mp4", "subtitle_path": "_JK5GJ7ms0U_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 101, "duration": 49.02, "view_count": 13973}, {"video_id": "Fq4iHI1kST4", "question": "Several travelers are waiting in front of the airline check-in counter for service. After the subtitle says 'between China and Japan still aren't,' who is the first person to appear on the screen?", "question_wo_referring_query": "Who is the first person to appear on the screen?", "candidates": ["A woman wearing a pink top and green pants", "A woman wearing a pink skirt", "A woman wearing a pink top and yellow skirt", "A woman wearing a pink top and white pants", "A man wearing a pink top and white pants"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "Fq4iHI1kST4_0", "video_path": "Fq4iHI1kST4.mp4", "subtitle_path": "Fq4iHI1kST4_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 123, "duration": 28.0, "view_count": 5306}, {"video_id": "6RIPretVwPg", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a man wearing a hat in the kitchen uses a large round spatula to lift a white liquid from a pot. Then, he pours the meat from a round iron plate into a pot containing the white liquid. Finally, he uses the large round spatula to move the meat into a brown-colored clay pot.", "First, he uses the large round spatula to move the meat into a brown-colored clay pot. Then, he pours the meat from a round iron plate into a pot containing the white liquid. Finally, a man wearing a hat in the kitchen uses a large round spatula to lift a white liquid from a pot.", "First, he uses the large round spatula to move the meat into a brown-colored clay pot. Then, a man wearing a hat in the kitchen uses a large round spatula to lift a white liquid from a pot. Finally, he pours the meat from a round iron plate into a pot containing the white liquid.", "First, he uses the large round spatula to move the meat into a brown-colored clay pot. Then, a man wearing a hat in the kitchen uses a large round spatula to lift a white liquid from a pot. Finally, he pours the meat from a round iron plate into a pot containing the white liquid.", "First, a man wearing a hat in the kitchen uses a large round spatula to lift a white liquid from a pot. Then, he uses the large round spatula to move the meat into a brown-colored clay pot. Finally, he pours the meat from a round iron plate into a pot containing the white liquid."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "6RIPretVwPg_0", "video_path": "6RIPretVwPg.mp4", "subtitle_path": "6RIPretVwPg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 755, "duration": 40.0, "view_count": 7904385}, {"video_id": "iWL6Z9e6QWE", "question": "In the frame, there is a picturesque mountain peak, surrounded by dense houses and a blue ocean. In the middle of the frame, within a yellow box, the black English words 'THE FINAL CHALLENGE' are written. In which of the following places have these English words appeared before?", "question_wo_referring_query": "In which of the following places have these English words appeared before?", "candidates": ["In a green webpage", "In the background inside a car with two women sitting", "In a black webpage", "In the background of a green mountain peak with a narrow long staircase path in the middle", "In a white webpage"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "iWL6Z9e6QWE_0", "video_path": "iWL6Z9e6QWE.mp4", "subtitle_path": "iWL6Z9e6QWE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 727, "duration": 52.0, "view_count": 77040}, {"video_id": "BzGz18LynZY", "question": "Sunlight is shining down from a distance, and a man in a green shirt is sitting outside. Behind him, there are trees and a body of water. What did this man change into when he last appeared on the screen?", "question_wo_referring_query": "What did this man change into when he last appeared on the screen?", "candidates": ["Gray short sleeves", "Orange short sleeves", "Red short sleeves", "Black vest", "Black short sleeves"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "BzGz18LynZY_0", "video_path": "BzGz18LynZY.mp4", "subtitle_path": "BzGz18LynZY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 311, "duration": 32.99, "view_count": 34438}, {"video_id": "Md1u8HDBwZo", "question": "After the video transitions to a scene where a group of tourists are gathered on the fence by the lake, watching a golden fountain spray in the distance, what are the tourists in the video doing?", "question_wo_referring_query": "What are the tourists in the video doing?", "candidates": ["Jumping into the lake", "Holding up a red flag", "Holding up a banner", "Boarding a boat", "Taking photos with their phones"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "Md1u8HDBwZo_0", "video_path": "Md1u8HDBwZo.mp4", "subtitle_path": "Md1u8HDBwZo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 188, "duration": 45.0, "view_count": 474032}, {"video_id": "xEl7OeLWmeg", "question": "A person is holding a piping bag filled with green frosting, and is piping green frosting onto a Christmas tree-shaped cookie. What other objects can be seen in this scene?", "question_wo_referring_query": ", what other objects can be seen in this scene?", "candidates": ["Ring", "Green plant", "Flower", "Egg beater", "Necklace"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "xEl7OeLWmeg_0", "video_path": "xEl7OeLWmeg.mp4", "subtitle_path": "xEl7OeLWmeg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 140, "duration": 27.99, "view_count": 166834}, {"video_id": "Hism4KUBsR0", "question": "A curly-haired woman wearing a white headband and an exquisite necklace is looking down in the screen. When the subtitle says 'set the whole place Ablaze her red,' what else is visible on the screen?", "question_wo_referring_query": "What else is visible on the screen?", "candidates": ["Scepter", "Earrings", "High heels", "Ladder", "Scarf"], "topic_category": "KA-Knowledge-Art", "question_category": "T2O", "level": "IntraMoment", "id": "Hism4KUBsR0_0", "video_path": "Hism4KUBsR0.mp4", "subtitle_path": "Hism4KUBsR0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 81, "duration": 17.0, "view_count": 2120919}, {"video_id": "-w_Xp5JeVV8", "question": "What type of outerwear is a black-haired woman wearing who is sitting at a white desk with a laptop in front of her and holding a phone?", "question_wo_referring_query": "What type of outerwear is she wearing?", "candidates": ["Hoodie", "Denim jacket", "Wool coat", "Leather jacket", "Blazer"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "-w_Xp5JeVV8_0", "video_path": "-w_Xp5JeVV8.mp4", "subtitle_path": "-w_Xp5JeVV8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 133, "duration": 44.98, "view_count": 3837}, {"video_id": "ApSkDR3YnmA", "question": "There are two different people in the videoframe, the person on the right is a man with black hair and a mustache. In the whole frame, who is resting their hand on their chin and holding a pen?", "question_wo_referring_query": "In the whole frame, who is resting their hand on their chin and holding a pen?", "candidates": ["A man wearing a white shirt", "A woman wearing glasses and a black top", "A man wearing glasses and a black top", "A woman wearing a green floral dress", "A woman wearing glasses and a red top"], "topic_category": "NP-News-Programs", "question_category": "E2O", "level": "IntraMoment", "id": "ApSkDR3YnmA_0", "video_path": "ApSkDR3YnmA.mp4", "subtitle_path": "ApSkDR3YnmA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 97, "duration": 52.0, "view_count": 6237}, {"video_id": "H50_a6AowB0", "question": "In the video, there are two desks. On the desk to the far right, there is an elderly man with white hair wearing a blue shirt, who is sitting with his legs crossed. On the desk to the far left, there is a man wearing jeans. After the man performs a downward palm motion on the white paper in his hand, what does he do next?", "question_wo_referring_query": "What does the man do after performing the downward palm motion?", "candidates": ["Throw the white paper from his hand into the trash can", "Drop the white paper from his hand onto the ground", "Fold the white paper from his hand into a paper airplane", "Place the white paper from his hand onto the desk", "Hand the white paper to the person next to him"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "H50_a6AowB0_0", "video_path": "H50_a6AowB0.mp4", "subtitle_path": "H50_a6AowB0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 358, "duration": 42.0, "view_count": 103434}, {"video_id": "cwoauTdLTqc", "question": "After a short-haired woman in a pink top appears in front of a white exhibition wall with two different paintings hanging on it, who is the first person to enter the scene?", "question_wo_referring_query": "Who is the first person to enter the scene?", "candidates": ["A man wearing a gray coat and a white shirt", "A man wearing a yellow coat and a white shirt", "A woman wearing a gray coat and a white shirt", "A man wearing a black suit and a white shirt", "A man wearing a gray coat and a black shirt"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "cwoauTdLTqc_0", "video_path": "cwoauTdLTqc.mp4", "subtitle_path": "cwoauTdLTqc_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 146, "duration": 29.99, "view_count": 34796}, {"video_id": "F12LUOc8MTQ", "question": "On both sides of a road with vehicles on a highway, several large green grassy areas with green trees are distributed. After the subtitle says 'route that follows Interstate 10 and,' what happens on the screen?", "question_wo_referring_query": "what happens on the screen?", "candidates": ["A picture of a meteor shower appears", "A globe appears", "A picture of Mars appears", "A map appears", "A road with a zigzag line appears"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "F12LUOc8MTQ_0", "video_path": "F12LUOc8MTQ.mp4", "subtitle_path": "F12LUOc8MTQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 444, "duration": 44.01, "view_count": 120450}, {"video_id": "Zwox4zk7rmo", "question": "In the top left corner of the screen, there's a red pigment, and in the middle, there is a round barrel drawn with black pen. The surface of the barrel has some red pigment. After the subtitle mentions 'and sculptures', what is the first object that appears on the screen?", "question_wo_referring_query": "What is the first object to appear on the screen?", "candidates": ["A ship", "A globe", "A moon", "A blue ocean", "A world map"], "topic_category": "KA-Knowledge-Art", "question_category": "T3O", "level": "L2-Relation", "id": "Zwox4zk7rmo_0", "video_path": "Zwox4zk7rmo.mp4", "subtitle_path": "Zwox4zk7rmo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 158, "duration": 52.97, "view_count": 84484}, {"video_id": "SvfB28mU41M", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a hand holding a pen is drawing in the middle of the screen, then a colorful Rubik's cube appears in a background with the black text 'AMERICA'S GOT TALENT', and finally, a bottle of drink appears on the left side of a background with the blue text 'STAY ON TOP OF TRENDS'.", "First, a bottle of drink appears on the left side of a background with the blue text 'STAY ON TOP OF TRENDS', then a hand holding a pen is drawing in the middle of the screen, and finally, a colorful Rubik's cube appears in a background with the black text 'AMERICA'S GOT TALENT'.", "First, a bottle of drink appears on the left side of a background with the blue text 'STAY ON TOP OF TRENDS', then a colorful Rubik's cube appears in a background with the black text 'AMERICA'S GOT TALENT', and finally, a hand holding a pen is drawing in the middle of the screen.", "First, a hand holding a pen is drawing in the middle of the screen, then a bottle of drink appears on the left side of a background with the blue text 'STAY ON TOP OF TRENDS', and finally, a colorful Rubik's cube appears in a background with the black text 'AMERICA'S GOT TALENT'.", "First, a colorful Rubik's cube appears in a background with the black text 'AMERICA'S GOT TALENT', then a bottle of drink appears on the left side of a background with the blue text 'STAY ON TOP OF TRENDS', and finally, a hand holding a pen is drawing in the middle of the screen."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "SvfB28mU41M_0", "video_path": "SvfB28mU41M.mp4", "subtitle_path": "SvfB28mU41M_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 116, "duration": 30.0, "view_count": 6456}, {"video_id": "VNCYr4ZT1uQ", "question": "A man wearing a baseball cap appears in front of a lakeside at the foot of a mountain. Where else has this man appeared?", "question_wo_referring_query": "Where else has this man appeared?", "candidates": ["Inside an airplane cockpit", "In front of a window of a yellow building", "In an elevator", "On a boat", "On a train"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "VNCYr4ZT1uQ_0", "video_path": "VNCYr4ZT1uQ.mp4", "subtitle_path": "VNCYr4ZT1uQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 303, "duration": 59.02, "view_count": 8181}, {"video_id": "pMHIYXguz2M", "question": "In front of an orange background, which small round beads strung on a thin wire and which subtitles have appeared together?", "question_wo_referring_query": "Which subtitles have appeared together?", "candidates": ["upbeat music", "pass one of the loose ends through the first loop", "until there are enough loops to fit around your fingertips", "from it into a circle to make the crown", "shape the ring"], "topic_category": "KA-Knowledge-Art", "question_category": "TOS", "level": "L2-Relation", "id": "pMHIYXguz2M_0", "video_path": "pMHIYXguz2M.mp4", "subtitle_path": "pMHIYXguz2M_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 53, "duration": 50.99, "view_count": 8584}, {"video_id": "N9ahW3s7rS0", "question": "At the very beginning of the scene, on the far right side of the map, two small figures holding a red shield and a sword in their hands appear. Before these small figures finally appear in front of a smoking house, what changes occur to the feathers on their headbands?", "question_wo_referring_query": ", what changes occur to the feathers on the headbands of these small figures?", "candidates": ["The three black feathers on the headbands have disappeared.", "The three black feathers on the headbands have turned yellow.", "The three black feathers on the headbands have turned into one feather.", "The three black feathers on the headbands have turned into two feathers.", "The three black feathers on the headbands have turned purple."], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "N9ahW3s7rS0_0", "video_path": "N9ahW3s7rS0.mp4", "subtitle_path": "N9ahW3s7rS0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 22, "duration": 53.01, "view_count": 7222}, {"video_id": "HWb-YBxV6jo", "question": "What change occurs to the pale yellow star-shaped dough extruded from the churro maker into the hot oil in the white pot shown on the screen when the subtitle reads 'didn't make a huge difference in ease or efficiency'?", "question_wo_referring_query": "What change occurs to its state?", "candidates": ["It turned into a gas", "It turned green", "It turned black", "It turned into a liquid", "It turned into a solid"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TAA", "level": "L2-Relation", "id": "HWb-YBxV6jo_0", "video_path": "HWb-YBxV6jo.mp4", "subtitle_path": "HWb-YBxV6jo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 183, "duration": 25.03, "view_count": 768709}, {"video_id": "RcPUUgdbbSo", "question": "A long-haired woman is sitting on the right side of the screen, while on the left side of the screen, a man wearing a white short-sleeved shirt and carrying a black backpack is standing. In front of him, a person dressed in black, wearing a black hat, and holding a pink suitcase is doing what?", "question_wo_referring_query": "What is the person doing?", "candidates": ["Opening the pink suitcase", "Pushing the pink suitcase in their hand", "Putting the pink suitcase on a vehicle", "Taking off the hat", "Sitting on the pink suitcase"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "RcPUUgdbbSo_0", "video_path": "RcPUUgdbbSo.mp4", "subtitle_path": "RcPUUgdbbSo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 43, "duration": 58.0, "view_count": 12798}, {"video_id": "JA4Z8PdKR58", "question": "In the scene, a woman wearing a jumpsuit and gloves is standing in front of a mirror fixing her hair. What else is present in this scene?", "question_wo_referring_query": "What else is present in this scene?", "candidates": ["Dining table", "Fresh flowers", "Bedside table", "Sofa", "Green plant"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2O", "level": "IntraMoment", "id": "JA4Z8PdKR58_0", "video_path": "JA4Z8PdKR58.mp4", "subtitle_path": "JA4Z8PdKR58_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 73, "duration": 50.0, "view_count": 24645}, {"video_id": "UoRkqoEuzAk", "question": "A pair of hands is touching a white smartphone screen. When the subtitle says 'archive well for more on this I'm joined,' what other objects are present on the screen?", "question_wo_referring_query": "What other objects are present on the screen?", "candidates": ["Flower pot", "Fan", "Bicycle", "Keyboard", "Laptop"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "UoRkqoEuzAk_0", "video_path": "UoRkqoEuzAk.mp4", "subtitle_path": "UoRkqoEuzAk_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 6, "duration": 19.0, "view_count": 1241}, {"video_id": "Q2A3ZfG8cco", "question": "On an ancient Roman map, what color is the area labeled 'SELEUKOS' located below the blue sea labeled 'CASPIAN SEA' in white English letters?", "question_wo_referring_query": "What color is it?", "candidates": ["purple", "red", "black", "green", "white"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "Q2A3ZfG8cco_0", "video_path": "Q2A3ZfG8cco.mp4", "subtitle_path": "Q2A3ZfG8cco_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 920, "duration": 50.0, "view_count": 374443}, {"video_id": "aLhB9t-uqWw", "question": "There is a reflection on the lake in the scene. When the subtitle says 'ones here are literally under the,' what color is the surface of the lake?", "question_wo_referring_query": "What color is the surface of the lake?", "candidates": ["red", "blue", "black", "yellow", "purple"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2A", "level": "IntraMoment", "id": "aLhB9t-uqWw_0", "video_path": "aLhB9t-uqWw.mp4", "subtitle_path": "aLhB9t-uqWw_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 293, "duration": 34.0, "view_count": 136971}, {"video_id": "E1kzWqyRtD4", "question": "In a room with green plants, a man wearing black pants is sitting on a gray sofa. What is this man doing when the subtitle says 'the 12th century can you imagine an'?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Playing with a mobile phone on the sofa", "Looking at a computer on the sofa", "Putting one foot on the sofa", "Lying on the sofa", "Reading a book on the sofa"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "E1kzWqyRtD4_0", "video_path": "E1kzWqyRtD4.mp4", "subtitle_path": "E1kzWqyRtD4_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 163, "duration": 17.0, "view_count": 9541}, {"video_id": "H8Xwif_sWTA", "question": "After a worker is shown standing on the cockpit of a plane performing maintenance, what happens next?", "question_wo_referring_query": ", what happens next?", "candidates": ["Several black planes are flying in a blue sky", "Several green and black planes are flying over a grassy field", "Several red planes are flying over a grassy field", "Several green and black planes are flying over the ocean", "Several white planes are flying over a grassy field"], "topic_category": "KH-Knowledge-History", "question_category": "E3E", "level": "L2-Relation", "id": "H8Xwif_sWTA_0", "video_path": "H8Xwif_sWTA.mp4", "subtitle_path": "H8Xwif_sWTA_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 397, "duration": 38.0, "view_count": 1449145}, {"video_id": "eLw2VuyaBN0", "question": "After the white net is thrown into the blue ocean on screen, which character appears first?", "question_wo_referring_query": "Which character appears first?", "candidates": ["The person on the far left of the white building wearing a grey suit", "A man on the boat wearing an olive-green hat", "A man on the boat wearing a blue shirt", "A man on the boat wearing a pink hat", "A man on the boat wearing a purple shirt"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "eLw2VuyaBN0_0", "video_path": "eLw2VuyaBN0.mp4", "subtitle_path": "eLw2VuyaBN0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 144, "duration": 44.0, "view_count": 235136}, {"video_id": "ElWG0_kjy_Y", "question": "In the middle of the dark background, there is a famous painting of Mona Lisa. Before the subtitle says \"Mona Lisa is also rather content and self-assured, which was more how aristocratic men were,\" what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The screen turns white", "The screen turns purple", "The screen turns pink", "The screen turns yellow", "The screen turns black"], "topic_category": "KA-Knowledge-Art", "question_category": "T3E", "level": "L2-Relation", "id": "ElWG0_kjy_Y_0", "video_path": "ElWG0_kjy_Y.mp4", "subtitle_path": "ElWG0_kjy_Y_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 1000, "duration": 42.01, "view_count": 1716887}, {"video_id": "N98CtwsWrr8", "question": "In the video, a black-haired woman wearing a blue top is speaking in front of a blue background. Where are the white English words 'SiriusXM' displayed?", "question_wo_referring_query": "Where are the white English words 'SiriusXM' displayed?", "candidates": ["Inside the black rectangle at the bottom left corner of the screen", "Inside the white rectangle at the bottom of the screen", "Inside the purple rectangle at the bottom left corner of the screen", "Inside the yellow rectangle at the bottom left corner of the screen", "Inside the red rectangle at the bottom left corner of the screen"], "topic_category": "NP-News-Programs", "question_category": "SOS", "level": "L2-Relation", "id": "N98CtwsWrr8_0", "video_path": "N98CtwsWrr8.mp4", "subtitle_path": "N98CtwsWrr8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 17, "duration": 16.02, "view_count": 2169}, {"video_id": "kYc6h1lhhpM", "question": "In the kitchen, there is a man wearing a black short-sleeved shirt and a man wearing a white short-sleeved shirt. There is a black pot on a brown wooden board. Which subtitles have appeared along with this black pot?", "question_wo_referring_query": "Which subtitles have appeared along with this black pot?", "candidates": ["going to do is the roasted mushroom and", "tomato so I'll just slice these up some", "that I will not add a lot of because", "shallot roasted garlic and cabine chili", "into a Caston with some"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "kYc6h1lhhpM_0", "video_path": "kYc6h1lhhpM.mp4", "subtitle_path": "kYc6h1lhhpM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 39, "duration": 25.0, "view_count": 526817}, {"video_id": "34gmDYernok", "question": "In a news broadcast room, three people are sitting beside a table with various documents. The person sitting at the far left of the table is a female host wearing a pink suit and glasses. When this female host in the pink suit last appears on the screen, what change happens to her?", "question_wo_referring_query": "What change happens to this female host?", "candidates": ["She changed into a green suit", "She took off her glasses", "She changed into a purple suit", "She tied up her hair", "She changed into a black suit"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "34gmDYernok_0", "video_path": "34gmDYernok.mp4", "subtitle_path": "34gmDYernok_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 233, "duration": 22.0, "view_count": 11018}, {"video_id": "iUdr2wKyqck", "question": "On the right side of the screen, a woman with black hair wearing black clothes is speaking in front of the camera, while on the left side of the screen, a female celebrity is standing and being filmed in front of a blue-lit background. When the subtitles mention 'something original without offending', what change occurs in the background color behind the female celebrity?", "question_wo_referring_query": "What change occurs in the background color behind the female celebrity?", "candidates": ["changes to purple", "changes to red", "changes to black", "changes to yellow", "changes to green"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "iUdr2wKyqck_0", "video_path": "iUdr2wKyqck.mp4", "subtitle_path": "iUdr2wKyqck_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 189, "duration": 32.0, "view_count": 14155}, {"video_id": "PYMqes9DHG0", "question": "At the entrance of a store, a man dressed in a black top and black pants is holding a white packaged item in his hand. Next to him stands another man wearing an olive green suit. What is the man in black holding the white item doing?", "question_wo_referring_query": "What is the man in black holding the white item doing?", "candidates": ["Drinking coffee", "Buying fruits", "About to enter the store", "Washing dishes", "Driving a car"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "PYMqes9DHG0_0", "video_path": "PYMqes9DHG0.mp4", "subtitle_path": "PYMqes9DHG0_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 227, "duration": 52.99, "view_count": 1915692}, {"video_id": "Fghl51o7aVE", "question": "In the scene, a man wearing a green coat with a beard is sitting in front of a mirror talking. What other objects are present in this scene?", "question_wo_referring_query": "What other objects are present in this scene?", "candidates": ["green plant", "black microphone", "refrigerator", "flower pot", "blue microphone"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "Fghl51o7aVE_0", "video_path": "Fghl51o7aVE.mp4", "subtitle_path": "Fghl51o7aVE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 94, "duration": 16.0, "view_count": 147064}, {"video_id": "Q6OXIKIPVT8", "question": "In the video, a curly-haired woman wearing a light blue dress appears at the entrance of a wooden cabin. She places her hands on the wooden door and then gently closes it as she prepares to go outside. What style of clothing is the woman wearing?", "question_wo_referring_query": "What style of clothing is the woman wearing?", "candidates": ["Wool coat", "Feather coat", "Dress", "T-shirt", "Suit"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "Q6OXIKIPVT8_0", "video_path": "Q6OXIKIPVT8.mp4", "subtitle_path": "Q6OXIKIPVT8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 5, "duration": 33.03, "view_count": 208372}, {"video_id": "DwuGnwpkOOo", "question": "On the hillside of Xiaoya, a man wearing black pants and carrying a bag is descending. When the subtitles read \u201cas I would like go down this rock I'll just be like...Cool watch,\u201d what color shoes is this man wearing?", "question_wo_referring_query": "What color shoes is this man wearing?", "candidates": ["red", "white", "gray", "black", "blue"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "DwuGnwpkOOo_0", "video_path": "DwuGnwpkOOo.mp4", "subtitle_path": "DwuGnwpkOOo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 175, "duration": 36.0, "view_count": 12772}, {"video_id": "z41FzY428z8", "question": "What did the black-haired woman, who was wearing a black short-sleeve shirt and a ring on her hand, do the first time she appeared in the driver's seat inside the car?", "question_wo_referring_query": "What did she do?", "candidates": ["Looking at her phone", "Drinking a beverage", "Driving the car", "Sleeping", "Washing the car"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "z41FzY428z8_0", "video_path": "z41FzY428z8.mp4", "subtitle_path": "z41FzY428z8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 161, "duration": 45.0, "view_count": 213594}, {"video_id": "21JVHgEO5tE", "question": "There is a black pot in the center of the screen, containing chopped white scallions. What happens on the screen when the subtitle says 'Fry~1 minute'?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["Green peppers are added to the pot", "Coriander is added to the pot", "A hand is chopping scallions on a cutting board with a knife", "Oil is being poured into the pot", "A hand is stirring the pot with a wooden spatula"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "21JVHgEO5tE_0", "video_path": "21JVHgEO5tE.mp4", "subtitle_path": "21JVHgEO5tE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 10, "duration": 22.0, "view_count": 1531004}, {"video_id": "y4lFKK0b53w", "question": "In the video, there is black text in English reading 'at the age of 69' within a yellow moving frame at the bottom. On the left side, there is a woman dressed in red, and on the right side, before engaging in a conversation with the woman, what action does the man dressed in a black feathered coat perform?", "question_wo_referring_query": "In the video, there is black English text reading 'at the age of 69' within a yellow moving frame at the bottom. On the left side, there is a woman dressed in red, and on the right side, before engaging in a conversation with the woman, what action does the man dressed in a black feathered coat perform?", "candidates": ["Walking with a black and white umbrella in his hand", "Walking with a red and white umbrella in his hand", "Walking with a blue and white umbrella in his hand", "Walking with a green and white umbrella in his hand", "Walking with a yellow and white umbrella in his hand"], "topic_category": "NP-News-Programs", "question_category": "E3E", "level": "L2-Relation", "id": "y4lFKK0b53w_0", "video_path": "y4lFKK0b53w.mp4", "subtitle_path": "y4lFKK0b53w_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 15, "duration": 41.0, "view_count": 8646}, {"video_id": "Ts7oORrI7cI", "question": "After a person wearing a red top appears upper body behind the camera, which character appears first?", "question_wo_referring_query": "Which character appears first?", "candidates": ["A woman holding a basket and wearing a black short-sleeved shirt", "A woman with long hair", "A man with a crew cut", "A woman holding a basket and wearing a white short-sleeved shirt", "A woman holding two daisies"], "topic_category": "KA-Knowledge-Art", "question_category": "O3O", "level": "L2-Relation", "id": "Ts7oORrI7cI_0", "video_path": "Ts7oORrI7cI.mp4", "subtitle_path": "Ts7oORrI7cI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 107, "duration": 48.01, "view_count": 39411}, {"video_id": "@tiffycooks-7350695886466469126", "question": "A man wearing black glasses and a lab coat is holding chopsticks while sitting in front of a brown wooden table. On the table, there is a bowl of noodles and a metal fork. After he says \"Now look at that\" in the subtitles, what happens to this man?", "question_wo_referring_query": "What happens to this man?", "candidates": ["He picks up a metal fork", "A peace sign is shown towards the camera", "A hand makes a thumbs-up gesture towards the camera", "He picks up the noodles from the bowl", "He drinks soup with the metal fork"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3E", "level": "L2-Relation", "id": "@tiffycooks-7350695886466469126_0", "video_path": "7350695886466469126.mp4", "subtitle_path": "7350695886466469126_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 36.83, "view_count": 901222}, {"video_id": "@thatrecipe.us-7321109153286606110", "question": "Which of the following sequences of steps is correct?", "question_wo_referring_query": "Which of the following sequences of steps is correct?", "candidates": ["First, pour the red liquid into the pot, then use your hand to peel the pomegranate on a glass bowl, and finally pour white sugar into the glass bowl containing the pomegranate seeds.", "First, pour white sugar into the glass bowl containing the pomegranate seeds, then use your hand to peel the pomegranate on a glass bowl, and finally pour the red liquid into the pot.", "First, pour the red liquid into the pot, then pour white sugar into the glass bowl containing the pomegranate seeds, and finally use your hand to peel the pomegranate on a glass bowl.", "First, use your hand to peel the pomegranate on a glass bowl, then pour white sugar into the glass bowl containing the pomegranate seeds, and finally pour the red liquid into the pot.", "First, use your hand to peel the pomegranate on a glass bowl, then pour the red liquid into the pot, and finally pour white sugar into the glass bowl containing the pomegranate seeds."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "@thatrecipe.us-7321109153286606110_0", "video_path": "7321109153286606110.mp4", "subtitle_path": "7321109153286606110_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.33, "view_count": 27067}, {"video_id": "@thatrecipe.us-7327402732199955755", "question": "A person wearing white gloves is putting a skewer on a piece of chicken on a cutting board. Where else has this piece of chicken appeared?", "question_wo_referring_query": "Where else has this piece of chicken appeared?", "candidates": ["In the rice cooker", "On the grill", "On the barbecue pit", "In the freezer", "In the microwave"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "@thatrecipe.us-7327402732199955755_0", "video_path": "7327402732199955755.mp4", "subtitle_path": "7327402732199955755_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.47, "view_count": 104534}, {"video_id": "@healthfood-6921794711883615493", "question": "In the screen, a black-haired woman in a white hoodie puts a fork into her mouth. She seems to be enjoying this delicacy. Which subtitles appeared together when she put the fork into her mouth?", "question_wo_referring_query": "Which subtitles appeared together when she put the fork into her mouth?", "candidates": ["I cut like a little hat off my potato", "Hi, my name's Priyamand and this is my super easy jack potato recipe", "I grab a large potato and I always get the muddy potatoes because they taste so much better", "Then I pop it in the oven for about an hour and I know it's ready when I can jab it with a knife", "And then I mush mush mush and then I scoop and all the fillings into the skin and I pop the hat back on and then it goes back in the oven"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "TOS", "level": "L2-Relation", "id": "@healthfood-6921794711883615493_0", "video_path": "6921794711883615493.mp4", "subtitle_path": "6921794711883615493_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 35.87, "view_count": 14803}, {"video_id": "@thatrecipe.us-7269175925797817643", "question": "When white milkshake liquid in a glass bowl is poured into a glass bowl containing melted chocolate liquid, what color change occurs to the white liquid?", "question_wo_referring_query": "What color change occurs to the white liquid?", "candidates": ["It turns blue", "It turns green", "It turns black", "It turns red", "It turns yellow"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SAA", "level": "L2-Relation", "id": "@thatrecipe.us-7269175925797817643_0", "video_path": "7269175925797817643.mp4", "subtitle_path": "7269175925797817643_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.4, "view_count": 17975}, {"video_id": "@tiffycooks-7231208737380306182", "question": "In a room with green plants, a woman with black hair is sitting holding chopsticks. There is a glass on a green table in front of her, containing a round ice cube. What is this woman doing?", "question_wo_referring_query": "What is this woman doing?", "candidates": ["Stirring with chopsticks in the glass", "Pouring the round ice cube into a white dish", "Putting the chopsticks on the table", "Using chopsticks to pick up a round ice cube", "Pouring milk into the glass"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "@tiffycooks-7231208737380306182_0", "video_path": "7231208737380306182.mp4", "subtitle_path": "7231208737380306182_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 22.53, "view_count": 802863}, {"video_id": "@thatrecipe.us-7294389422991183147", "question": "In the scene, a hand is holding a black seasoning bottle, sprinkling some black pepper powder into a glass bowl containing a white liquid. What other objects are present in this scene?", "question_wo_referring_query": ", what other objects are present in this scene?", "candidates": ["green pepper", "chicken", "green onions", "aromatic greens", "egg yolk"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2O", "level": "IntraMoment", "id": "@thatrecipe.us-7294389422991183147_0", "video_path": "7294389422991183147.mp4", "subtitle_path": "7294389422991183147_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.06, "view_count": 70588}, {"video_id": "@thatrecipe.us-7280717512092798238", "question": "What changes occur in the state of the milk and flour in the glass bowl on the wooden table after mixing them with a wooden spatula?", "question_wo_referring_query": "What changes occur in its state?", "candidates": ["It becomes a liquid", "It turns red", "It becomes a sticky substance", "It becomes a solid", "It becomes a gas"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2A", "level": "IntraMoment", "id": "@thatrecipe.us-7280717512092798238_0", "video_path": "7280717512092798238.mp4", "subtitle_path": "7280717512092798238_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.03, "view_count": 163183}, {"video_id": "@healthfood-7119520332372905262", "question": "After mixing the egg yolk and egg white in the glass bowl with chopsticks, what color change occurs in the bowl when the subtitle says 'Add in your rum egg'?", "question_wo_referring_query": "What color change occurs in the bowl?", "candidates": ["Completely turned yellow", "Completely turned black", "Completely turned green", "Completely turned red", "Completely turned white"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "@healthfood-7119520332372905262_0", "video_path": "7119520332372905262.mp4", "subtitle_path": "7119520332372905262_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 26.0, "view_count": 50547}, {"video_id": "@healthfood-6891734930716413189", "question": "What happened when a few fresh red strawberries appeared in someone's hand for the first time?", "question_wo_referring_query": "What happened?", "candidates": ["They were put into someone's mouth", "They were placed on a white plate", "The strawberries were placed under a faucet for washing", "They were put into a juicer", "They were put on a cutting board"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "@healthfood-6891734930716413189_0", "video_path": "6891734930716413189.mp4", "subtitle_path": "6891734930716413189_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.2, "view_count": 8611}, {"video_id": "@thatrecipe.us-7302531523461336366", "question": "On the table, there's a white plastic bag containing a few pieces of chicken. Above the plastic bag, there's a hand holding a white bowl with yellow egg liquid in it. When the subtitle says 'Here's my secret to making some delicious chicken very quickly. Thank you,' what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["Add scallions to the white bowl with the egg liquid", "Put the white bowl with the egg liquid into the oven", "Pour the egg liquid from the white bowl into the pan", "Pour the egg liquid from the white bowl onto the chicken", "Rub the egg liquid from the white bowl onto the face"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2E", "level": "IntraMoment", "id": "@thatrecipe.us-7302531523461336366_0", "video_path": "7302531523461336366.mp4", "subtitle_path": "7302531523461336366_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.03, "view_count": 33678}, {"video_id": "@kelseyinlondon-7294940232196115744", "question": "On a round table with fresh flowers, there is an orange drink and gourmet food on a white plate. After a person picks up a cup with a green powder on top, what happens on the screen?", "question_wo_referring_query": ", what happens on the screen?", "candidates": ["A person is cutting an egg on the plate with a knife and fork", "A person puts the egg into their mouth with chopsticks", "A person uses chopsticks to pick up the egg from the plate", "A person picks up the white napkin on the table", "A person picks up the orange drink"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E3E", "level": "L2-Relation", "id": "@kelseyinlondon-7294940232196115744_0", "video_path": "7294940232196115744.mp4", "subtitle_path": "7294940232196115744_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.87, "view_count": 300776}, {"video_id": "@jetset_anna-7044132407611624709", "question": "In the top left corner of the screen, a banyan tree appears, and after its shadow is cast on the ground, which of the following objects appears first?", "question_wo_referring_query": "Which of the following objects appears first?", "candidates": ["submarine", "airplane", "ocean", "sailboat", "shark"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "@jetset_anna-7044132407611624709_0", "video_path": "7044132407611624709.mp4", "subtitle_path": "7044132407611624709_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.05, "view_count": 4657583}, {"video_id": "@luxtravelbe-7229003172793142554", "question": "A golden sun appears in the middle of a vast empty green field, and after the subtitle says 'I walked across an empty land,' what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A giraffe walks on the grassland", "A tiger walks on the grassland", "An elephant walks on the grassland", "A giraffe walks on the road", "A lion runs on the grassland"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3E", "level": "L2-Relation", "id": "@luxtravelbe-7229003172793142554_0", "video_path": "7229003172793142554.mp4", "subtitle_path": "7229003172793142554_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 20.07, "view_count": 10919}, {"video_id": "@placesunleashed-7299927097462557957", "question": "In the distance, under the blue sky, there is a golden building. In the middle of the screen, there is a black and white Taiji (Yin-Yang) symbol. After the subtitle says, 'It's not due to religion either, even though it's the founding place of Taoism,' which character appears first on the screen?", "question_wo_referring_query": "Which character appears first on the screen?", "candidates": ["A nurse in a white nurse uniform", "A woman wearing a black short-sleeved shirt", "A man wearing a black short-sleeved shirt", "A soldier wearing a green military uniform", "A man wearing a white short-sleeved shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "@placesunleashed-7299927097462557957_0", "video_path": "7299927097462557957.mp4", "subtitle_path": "7299927097462557957_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 20.83, "view_count": 18821}, {"video_id": "@luxtravelbe-7257978873088953627", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a room with a bed and a sofa is shown, then a pink sky with clouds appears above the distant sea, and lastly, a round cake with floral decorations is seen.", "First, a pink sky with clouds appears above the distant sea, then a round cake with floral decorations is seen, and lastly, a room with a bed and a sofa is shown.", "First, a pink sky with clouds appears above the distant sea, then a room with a bed and a sofa is shown, and finally a round cake with floral decorations is seen.", "First, a round cake with floral decorations is seen, then a room with a bed and a sofa is shown, and lastly, a pink sky with clouds appears above the distant sea.", "First, a room with a bed and a sofa is shown, then a round cake with floral decorations is seen, and lastly, a pink sky with clouds appears above the distant sea."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "@luxtravelbe-7257978873088953627_0", "video_path": "7257978873088953627.mp4", "subtitle_path": "7257978873088953627_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 23.93, "view_count": 23540}, {"video_id": "@_eat_sleep_travel_repeat-7292003905557286176", "question": "In front of a green mountain range, a person is walking forward on a distant green field, facing away from the camera. The whole scene gives a sense of tranquility. In which of the following places has this person also appeared?", "question_wo_referring_query": "In which of the following places has this person also appeared?", "candidates": ["On the volcano", "On a wooden bridge over a flowing stream", "In the desert", "On the white snowy ground", "In the kitchen"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "@_eat_sleep_travel_repeat-7292003905557286176_0", "video_path": "7292003905557286176.mp4", "subtitle_path": "7292003905557286176_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 22.93, "view_count": 4860}, {"video_id": "@daiki.shino-7210856063711726853", "question": "Under the dark sky, a rotating wooden horse with bright little lights on a rooftop is slowly moving, occasionally passing by a few pedestrians. With which subtitles did this rotating wooden horse appear?", "question_wo_referring_query": "With which subtitles did this rotating wooden horse appear?", "candidates": ["It's our own creation", "And what's so cool is that this whole evening, all our time together shouldn't", "It's so weird", "It's like our time together is just ours", "It must be like a new dream, a new in mind or something"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "@daiki.shino-7210856063711726853_0", "video_path": "7210856063711726853.mp4", "subtitle_path": "7210856063711726853_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 23.4, "view_count": 104414}, {"video_id": "@kelseyinlondon-7107283587463908613", "question": "What change occurred to the color of the woman's clothing when she appeared walking on a green lawn with yellow flowers after walking in front of a house with a triangular roof?", "question_wo_referring_query": "What change occurred to the color of the woman's clothing?", "candidates": ["It changed to green", "It changed to yellow", "It changed to purple", "It changed to white", "It changed to black"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SAA", "level": "L2-Relation", "id": "@kelseyinlondon-7107283587463908613_0", "video_path": "7107283587463908613.mp4", "subtitle_path": "7107283587463908613_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 27.17, "view_count": 133100}, {"video_id": "@luxtravelbe-7285054085957504289", "question": "In the video, there is a woman wearing black sunglasses and a black swimsuit in a blue swimming pool. When the subtitle says \u2018Everything we see in Chinese Simplified can be translated as,\u2019 what change occurs to the woman's clothing?", "question_wo_referring_query": "What change occurs to the woman's clothing?", "candidates": ["She changes into a black long-sleeve shirt", "She changes into a green long-sleeve shirt", "She changes into a white short-sleeve shirt", "She changes into a white long-sleeve shirt", "She changes into a white tank top"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TAA", "level": "L2-Relation", "id": "@luxtravelbe-7285054085957504289_0", "video_path": "7285054085957504289.mp4", "subtitle_path": "7285054085957504289_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 28.5, "view_count": 52439}, {"video_id": "@placesunleashed-7306231098491669766", "question": "In the video, there is a girl wearing a blue top and a white skirt who appears on a street with white zebra stripes in front of a house. What is this girl doing?", "question_wo_referring_query": "What is this girl doing?", "candidates": ["She is preparing to get into a car.", "She is preparing to sit on the road.", "She is lying on a street with red, orange, yellow, green, and other various colors.", "She is preparing to ride a bicycle.", "She is walking on a street with red, orange, yellow, green, and other various colors."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2E", "level": "IntraMoment", "id": "@placesunleashed-7306231098491669766_0", "video_path": "7306231098491669766.mp4", "subtitle_path": "7306231098491669766_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.1, "view_count": 4216}, {"video_id": "@placesunleashed-7327385320847068422", "question": "On a small group of islands in the sea, there is a rectangular green soccer field. In what subtitles did this green soccer field appear together?", "question_wo_referring_query": "In what subtitles did this green soccer field appear together?", "candidates": ["many Viking artifacts and burial grounds", "So if you somehow make it there, make sure to greet Erling Haaland at the field", "This place in Norway has the best view of the Northern Lights", "The village is also home to many Viking artifacts and burial grounds", "The place is located off the coast in the north of Norway"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "@placesunleashed-7327385320847068422_0", "video_path": "7327385320847068422.mp4", "subtitle_path": "7327385320847068422_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.88, "view_count": 1335}, {"video_id": "@kelseyinlondon-7146610567086673157", "question": "In a white room, there is a bouquet of flowers on a table next to a washing machine. A woman in a white dress is standing beside it, holding a piece of clothing in her hand. What color is the piece of clothing?", "question_wo_referring_query": "What color is the piece of clothing she is holding?", "candidates": ["Blue", "Black", "Purple", "Green", "Yellow"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "@kelseyinlondon-7146610567086673157_0", "video_path": "7146610567086673157.mp4", "subtitle_path": "7146610567086673157_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.87, "view_count": 869808}, {"video_id": "@luxtravelbe-7241628654151683355", "question": "On a car, there are two people seated. One of them is a man in white shorts sitting in the passenger seat, wearing a helmet. When the subtitles show 'Thank you', what color helmet is the man in the passenger seat wearing?", "question_wo_referring_query": "What color helmet is the man in the passenger seat wearing?", "candidates": ["Green", "White", "Black", "Purple", "Blue"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "@luxtravelbe-7241628654151683355_0", "video_path": "7241628654151683355.mp4", "subtitle_path": "7241628654151683355_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 17.52, "view_count": 42073}, {"video_id": "@placesunleashed-7328590524481309957", "question": "Who is the person walking in the middle of a narrow path flanked by white carved art pieces, heading towards the main hall?", "question_wo_referring_query": "Who is it?", "candidates": ["A man wearing a gray short-sleeved shirt and black shorts", "A man wearing a white short-sleeved shirt and beige shorts", "A man wearing a red short-sleeved shirt and beige shorts", "A man wearing a gray sweatshirt and beige trousers", "A man wearing a gray short-sleeved shirt and beige shorts"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "@placesunleashed-7328590524481309957_0", "video_path": "7328590524481309957.mp4", "subtitle_path": "7328590524481309957_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 16.45, "view_count": 10898}, {"video_id": "@movie.explained6-7253796559240957186", "question": "In front of a dilapidated house, what did a boy do after hastily taking off his black overcoat?", "question_wo_referring_query": "What did he do?", "candidates": ["Hid in a car", "Crawled into a hole", "Hid under the bed", "Hid on the roof", "Hid in a cabinet"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "@movie.explained6-7253796559240957186_0", "video_path": "7253796559240957186.mp4", "subtitle_path": "7253796559240957186_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 58.73, "view_count": 1564}, {"video_id": "@movie.explained6-7253794118814551298", "question": "After a woman dressed in red and wearing a pearl necklace appears, which of the following characters is the first to enter the scene?", "question_wo_referring_query": "Which of the following characters is the first to enter the scene?", "candidates": ["A character with black hair and red eyes", "A woman wearing a black choker and earrings", "A man with green hair", "A man dressed in a white coat", "A woman lying on the bed with curly hair"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "@movie.explained6-7253794118814551298_0", "video_path": "7253794118814551298.mp4", "subtitle_path": "7253794118814551298_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.56, "view_count": 1978}, {"video_id": "@movie.explained6-7271311433630043394", "question": "In the scene, after a man with glasses and wearing an orange jacket, who is sitting in the back seat of a car, says in the subtitles 'The monster caused the vehicle to shake violently,' what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["The car door is opened", "The people inside the car are shaking", "The car explodes", "Someone gets into the back seat", "The car flips over"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "@movie.explained6-7271311433630043394_0", "video_path": "7271311433630043394.mp4", "subtitle_path": "7271311433630043394_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 59.2, "view_count": 7238}, {"video_id": "@movie.explained6-7263426496725814535", "question": "In the scene where sunlight filters through the tree branches and illuminates a tranquil forest, just before the subtitle says, 'Not long after. A few people came to the dark forest. An old lady suddenly appeared in the mist. The kind child told her grandmother the truth. Knowing the purpose of her,' what object appears first on the screen?", "question_wo_referring_query": "What object appears first on the screen?", "candidates": ["airplane", "train", "ship", "bicycle", "car"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "@movie.explained6-7263426496725814535_0", "video_path": "7263426496725814535.mp4", "subtitle_path": "7263426496725814535_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.0, "view_count": 13115}, {"video_id": "@movie.explained6-7252659626720677122", "question": "In the scene, a man wearing a white short-sleeve shirt and a vest is standing in front of a dining table in the living room. Where else has this character appeared?", "question_wo_referring_query": "Where else has this character appeared?", "candidates": ["Football field", "Swimming pool", "Gym", "Bedroom", "Side view"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "@movie.explained6-7252659626720677122_0", "video_path": "7252659626720677122.mp4", "subtitle_path": "7252659626720677122_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 54.37, "view_count": 345}, {"video_id": "@movie.explained6-7253793462833761537", "question": "In the scene, a man wearing a police uniform is looking forward while holding a gun in both hands. Along with which subtitles has this gun appeared?", "question_wo_referring_query": ", with which subtitles has this gun appeared?", "candidates": ["Enraged, the old might demon revealed her true form", "begin to absorb life essence from Anna's body", "Anna erupted with a strong will to survive and use her hands to kill John", "In a critical moment", "seizing the opportunity"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "@movie.explained6-7253793462833761537_0", "video_path": "7253793462833761537.mp4", "subtitle_path": "7253793462833761537_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 58.99, "view_count": 1385}, {"video_id": "@movie.explained6-7271158207718952193", "question": "In front of a vast ocean, there are two people with their backs facing the camera. On the far right, there is a woman wearing a red top, and on the far left, there is a man wearing a yellow top. When this man in the yellow top appeared underwater, what change happened to his clothes?", "question_wo_referring_query": "What change happened to his clothes?", "candidates": ["The clothes turned green", "The clothes turned blue", "The clothes turned red", "The clothes changed from dry to wet", "The clothes turned white"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "@movie.explained6-7271158207718952193_0", "video_path": "7271158207718952193.mp4", "subtitle_path": "7271158207718952193_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 37.57, "view_count": 82380}, {"video_id": "@movie.explained6-7261188991548984583", "question": "At the beginning of the video, a man wearing a grey short-sleeved shirt is holding a bottle of wine. When the subtitle says 'However, these few short sentences gave Bruce great inspiration,' what changes occur to his clothing?", "question_wo_referring_query": "What changes occur to his clothing?", "candidates": ["He changes into a black hoodie", "He changes into a feathered garment", "He changes into clothes with a hat", "He changes into military attire", "He changes into a suit"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TAA", "level": "L2-Relation", "id": "@movie.explained6-7261188991548984583_0", "video_path": "7261188991548984583.mp4", "subtitle_path": "7261188991548984583_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 47.56, "view_count": 4711}, {"video_id": "@movie.explained6-7259232574977920257", "question": "In a swimming pool, there is a man with black hair who is shirtless. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["Eating cake", "Blowing bubbles", "Taking photos", "Cutting a cake", "Picking someone up"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "@movie.explained6-7259232574977920257_0", "video_path": "7259232574977920257.mp4", "subtitle_path": "7259232574977920257_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.23, "view_count": 3688}, {"video_id": "@movie.explained6-7253796754267671809", "question": "In the scene, a hand wearing a gray glove appears in front of an open box containing some soil beans and canned food. What other objects are present in this scene?", "question_wo_referring_query": "What other objects are present in this scene?", "candidates": ["A black mouse", "An eggplant", "A snake", "A white mouse", "A cat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "@movie.explained6-7253796754267671809_0", "video_path": "7253796754267671809.mp4", "subtitle_path": "7253796754267671809_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 53.05, "view_count": 1096}, {"video_id": "@movie.explained6-7275440169925299458", "question": "In the scene, a black statue with a collar appears in front of a building. When the subtitle says 'good times don't last. The Six died accidentally during an operation, and after the Fifth,' what object appears on the screen?", "question_wo_referring_query": "What object appears on the screen?", "candidates": ["sofa", "bed", "flower", "window", "curtain"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "@movie.explained6-7275440169925299458_0", "video_path": "7275440169925299458.mp4", "subtitle_path": "7275440169925299458_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 57.53, "view_count": 3197}, {"video_id": "@movie_summary_-7220158369586220293", "question": "In a room with a painting on the wall, a short-haired woman is holding the shoulder of a small boy and is about to leave through a brown wooden door. What color is the shirt of the short-haired woman?", "question_wo_referring_query": "What color is the shirt of the short-haired woman?", "candidates": ["blue", "gray", "yellow", "black", "red"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2A", "level": "IntraMoment", "id": "@movie_summary_-7220158369586220293_0", "video_path": "7220158369586220293.mp4", "subtitle_path": "7220158369586220293_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 51.8, "view_count": 26734}, {"video_id": "@jonijawne-7210497582349454597", "question": "On a white sink, a person is forcefully squeezing the water out from a morphing black powder in their hand. When the subtitle says 'Instead of when it's wet or else the makeup will float on top,' what shape does the black powder take?", "question_wo_referring_query": "What shape does the black powder take?", "candidates": ["square", "rectangle", "ladder shape", "triangle", "circle"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "@jonijawne-7210497582349454597_0", "video_path": "7210497582349454597.mp4", "subtitle_path": "7210497582349454597_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 20.25, "view_count": 47700}, {"video_id": "@kerstinong-7336794603368008962", "question": "In a room, who is the person pushing open a glass door with a black frame?", "question_wo_referring_query": "Who is it?", "candidates": ["A woman wearing a white top and black shorts", "A woman wearing a black top and pink shorts", "A woman wearing a pink long dress", "A woman wearing a white dress", "A woman wearing a white top and pink shorts"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E2O", "level": "IntraMoment", "id": "@kerstinong-7336794603368008962_0", "video_path": "7336794603368008962.mp4", "subtitle_path": "7336794603368008962_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 23.4, "view_count": 3782}, {"video_id": "@lisolna-7217028174557777158", "question": "In the room, what did the woman with straight hair and wearing a black top do the first time she appeared?", "question_wo_referring_query": "What did she do?", "candidates": ["Brushing her eyebrows", "Applying lipstick", "Wearing earrings", "Curling her eyelashes", "Applying blush"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O2E", "level": "IntraMoment", "id": "@lisolna-7217028174557777158_0", "video_path": "7217028174557777158.mp4", "subtitle_path": "7217028174557777158_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 20.02, "view_count": 3490}, {"video_id": "@lisolna-7303571556846947616", "question": "In the video, there is a metal container filled with several iron clips. A hand wearing a black sleeve appears from above. What happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A person picks up an iron clip", "A person picks up a pair of chopsticks", "A person reaches for a napkin", "A person uses chopsticks to pick up shredded carrots", "A person uses chopsticks to pick up a dumpling"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2E", "level": "IntraMoment", "id": "@lisolna-7303571556846947616_0", "video_path": "7303571556846947616.mp4", "subtitle_path": "7303571556846947616_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.75, "view_count": 180687}, {"video_id": "@lisolna-7316550085532241185", "question": "Before the subtitle says 'I'm going to make a sandwich with the leftover bread', what object appears first on the screen as a person is using tongs to pick up herbs?", "question_wo_referring_query": "What object appears first on the screen?", "candidates": ["A cutting board", "A bowl", "A round plate", "A square plate", "A fork"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3O", "level": "L2-Relation", "id": "@lisolna-7316550085532241185_0", "video_path": "7316550085532241185.mp4", "subtitle_path": "7316550085532241185_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.27, "view_count": 210257}, {"video_id": "@lisolna-7352901177203313953", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a white rectangular box containing seafood is placed on an electronic scale for weighing, then a plate of seafood delicacies is arranged on an olive green dining table, and finally a few black menus and a white round plate are placed on the olive green dining table.", "First, a few black menus and a white round plate are placed on an olive green dining table, then a plate of seafood delicacies is arranged on the olive green dining table, and finally a white rectangular box containing seafood is placed on an electronic scale for weighing.", "First, a few black menus and a white round plate are placed on an olive green dining table, then a white rectangular box containing seafood is placed on an electronic scale for weighing, and finally a plate of seafood delicacies is arranged on the olive green dining table.", "First, a white rectangular box containing seafood is placed on an electronic scale for weighing, then a few black menus and a white round plate are placed on an olive green dining table, and finally a plate of seafood delicacies is arranged on the olive green dining table.", "First, a plate of seafood delicacies is arranged on an olive green dining table, then a few black menus and a white round plate are placed on the olive green dining table, and finally a white rectangular box containing seafood is placed on an electronic scale for weighing."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "@lisolna-7352901177203313953_0", "video_path": "7352901177203313953.mp4", "subtitle_path": "7352901177203313953_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.7, "view_count": 70307}, {"video_id": "@lisolna-7289818300576550176", "question": "There are a few pieces of green avocado slices placed on a wooden board. In which of the following places have these pieces of avocado also appeared?", "question_wo_referring_query": "In which of the following places have these pieces of avocado also appeared?", "candidates": ["In a black pot", "In the refrigerator", "In an oven", "On a round plate", "In a blender"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SOS", "level": "L2-Relation", "id": "@lisolna-7289818300576550176_0", "video_path": "7289818300576550176.mp4", "subtitle_path": "7289818300576550176_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.86, "view_count": 47028}, {"video_id": "@lisolna-7194108610824211717", "question": "A woman in the video is using a metal spatula to scoop the flesh of an orange-skinned fruit from her hand. With which subtitles did this spatula appear together?", "question_wo_referring_query": "With which subtitles did this spatula appear together?", "candidates": ["So I've never had this before. I'm gonna try it together with you guys", "And then you start feeling", "One, two, yeah, it breaks", "It's so delicious", "It smells amazing"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "@lisolna-7194108610824211717_0", "video_path": "7194108610824211717.mp4", "subtitle_path": "7194108610824211717_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 34.77, "view_count": 8119}, {"video_id": "@lisolna-7188971449074371846", "question": "In the video, a white cotton swab is used to wipe a fingernail. When the subtitles say 'And I know that you're worth it, I can't walk away,' what change occurs on the fingernail?", "question_wo_referring_query": "What change occurs on the fingernail?", "candidates": ["A red heart appears on the fingernail of the little finger", "A red heart appears on the fingernail of the ring finger", "A red heart appears on the fingernail of the index finger", "A red heart appears on the fingernail of the middle finger", "A red heart appears on the fingernail of the thumb"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "@lisolna-7188971449074371846_0", "video_path": "7188971449074371846.mp4", "subtitle_path": "7188971449074371846_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 20.03, "view_count": 7889}, {"video_id": "@lisolna-7292045529943526688", "question": "In a bright room, a woman with curly hair, wearing a dark blue top, is sitting. In front of the woman, there is a wooden table with some plates of food on it. What is the woman in the dark blue top doing sitting beside the table?", "question_wo_referring_query": "What is the woman in the dark blue top doing sitting beside the table?", "candidates": ["Eating something", "Drinking a glass of red wine", "Holding her face with both hands, taking a photo", "Stacking the plates on the table", "Making a peace sign with both hands"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2E", "level": "IntraMoment", "id": "@lisolna-7292045529943526688_0", "video_path": "7292045529943526688.mp4", "subtitle_path": "7292045529943526688_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 0, "duration": 60.3, "view_count": 47866}, {"video_id": "NNCvzs3Y_hM", "question": "In a dimly lit room, with various weapons mounted on the walls, a woman with short green hair is carefully examining a gun. When the subtitle \"hypertech specialist that gives the team\" appears, what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["a red long-sleeved shirt", "green plants", "a white shirt", "a black case filled with nails", "gold"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "NNCvzs3Y_hM_0", "video_path": "NNCvzs3Y_hM.mp4", "subtitle_path": "NNCvzs3Y_hM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 197, "duration": 44.98, "view_count": 338242}, {"video_id": "s6c35iEqpe8", "question": "On a street with many pedestrians, there is a man in a mascot costume and another man wearing a colorful bow tie around his neck. When the subtitle \"him his phone, and Omar calls Waj. He tells him to surrender just like he will. Unfortunately,\" appears, what hairstyle does the man in the mascot costume have?", "question_wo_referring_query": "What hairstyle does the man in the mascot costume have?", "candidates": ["Brown short hair", "Blonde short hair", "Black buzz cut", "Blue short hair", "Black afro"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2A", "level": "IntraMoment", "id": "s6c35iEqpe8_0", "video_path": "s6c35iEqpe8.mp4", "subtitle_path": "s6c35iEqpe8_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 487, "duration": 30.0, "view_count": 645}, {"video_id": "bVpPiCFIh7c", "question": "In a dark room, two men wearing different colored clothes are fighting. One of them overpowered a corrupt policeman and ran away with a gun. Who overpowered the corrupt policeman?", "question_wo_referring_query": "Who overpowered the corrupt policeman?", "candidates": ["John", "David", "Miller", "James", "William"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "bVpPiCFIh7c_0", "video_path": "bVpPiCFIh7c.mp4", "subtitle_path": "bVpPiCFIh7c_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 270, "duration": 50.01, "view_count": 87282}, {"video_id": "DmBdW4QX5_s", "question": "On a clear and sunny day, on a road with many pedestrians, there's a woman wearing a khaki long coat. What is she doing the first time she appears?", "question_wo_referring_query": "What is she doing the first time she appears?", "candidates": ["Running towards and hugging a little boy", "Holding her head and crying", "Buying a lollipop for a little boy", "Holding her face with both hands", "Handing a book bag to a little girl"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "DmBdW4QX5_s_0", "video_path": "DmBdW4QX5_s.mp4", "subtitle_path": "DmBdW4QX5_s_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 363, "duration": 54.99, "view_count": 157018}, {"video_id": "niLJuTjAXLg", "question": "In a scene featuring a floating yellow flag, a man wearing a helmet and with long hair makes a gesture at the back of his head. What happens next in the scene?", "question_wo_referring_query": "What happens next in the scene?", "candidates": ["A man and a woman embrace each other", "A piece of red cloth falls from the sky", "A yellow sun descends from the sky", "Countless birds fly out from the sky", "A man wearing a helmet holds his head and cries in pain"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "niLJuTjAXLg_0", "video_path": "niLJuTjAXLg.mp4", "subtitle_path": "niLJuTjAXLg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 905, "duration": 42.01, "view_count": 362026}, {"video_id": "Mn67rZT8CzM", "question": "In the video, which of the following characters appears first?", "question_wo_referring_query": "In the video, which of the following characters appears first?", "candidates": ["The man in a black hoodie", "The man in a gray suit", "The man in a white shirt", "The man in a black suit", "The man in a black shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "Mn67rZT8CzM_0", "video_path": "Mn67rZT8CzM.mp4", "subtitle_path": "Mn67rZT8CzM_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 130, "duration": 21.02, "view_count": 62646}, {"video_id": "sJhHQpLSHFg", "question": "In front of a white door, a woman wearing an off-white coat and clutching her hair is standing. After someone beside her says, 'It turned out that the older woman had to move in because her house was flooded,' what happens next on the screen?", "question_wo_referring_query": "What happens next on the screen?", "candidates": ["A woman wearing a dark blue coat walks around the room holding a glass of water.", "A woman with short hair walks around the room holding a gun.", "A woman in a white tracksuit is jumping rope.", "A man in a black hooded jacket is drinking water.", "An elderly person is sitting in a wheelchair resting."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3E", "level": "L2-Relation", "id": "sJhHQpLSHFg_0", "video_path": "sJhHQpLSHFg.mp4", "subtitle_path": "sJhHQpLSHFg_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 322, "duration": 19.99, "view_count": 50707}, {"video_id": "_JgvpYrilys", "question": "In a dimly lit room, there's a chestnut-colored door frame, and a woman with long black hair leans against it. After the line 'and whether or not her mother may survive. The two women walk back to the front hall later,' who is the first character to appear on the screen?", "question_wo_referring_query": "Who is the first character to appear on the screen?", "candidates": ["A woman wearing a black coat", "A woman with long chestnut-colored hair", "A man with short black hair, wearing an army green coat", "A woman wearing a gray coat", "A man with short chestnut hair, wearing a black coat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "_JgvpYrilys_0", "video_path": "_JgvpYrilys.mp4", "subtitle_path": "_JgvpYrilys_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 410, "duration": 24.99, "view_count": 258628}, {"video_id": "KvRgXDQnKKo", "question": "A person on the screen is wearing black gloves and holding a peeler and a potato. The potato skin falls onto a brown table. What is the person on the screen doing?", "question_wo_referring_query": "A person on the screen is wearing black gloves and holding a peeler and a potato. The potato skin falls onto a brown table. What is the person on the screen doing?", "candidates": ["Peeling a carrot", "Peeling a dragon fruit", "Peeling a potato", "Peeling a pear", "Peeling an apple"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "KvRgXDQnKKo_0", "video_path": "KvRgXDQnKKo.mp4", "subtitle_path": "KvRgXDQnKKo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 227.6, "view_count": 1399}, {"video_id": "KvRgXDQnKKo", "question": "The person on the screen is wearing black gloves, holding a knife in the left hand, and pressing down on a tofu cake with the right hand. The tofu cake is cut into several pieces. What is the person on the screen doing?", "question_wo_referring_query": "What is the person on the screen doing?", "candidates": ["Baking tofu cake", "Sprinkling seasoning on the tofu cake", "Cutting tofu cake", "Pressing tofu cake", "Steaming tofu cake"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "KvRgXDQnKKo_1", "video_path": "KvRgXDQnKKo.mp4", "subtitle_path": "KvRgXDQnKKo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 227.6, "view_count": 1399}, {"video_id": "TQLBLNTDrLk", "question": "There is an image of a cell with a black-bordered frame in the screen, and the head is filled with white color marked with Y. The background picture is yellow. When the subtitle reads: 'an X, so it's more easily disrupted by things like high gravitational forces', what changes occurred to the Y chromosome in the image?", "question_wo_referring_query": "What changes occurred to the Y chromosome in the image?", "candidates": ["The Y chromosome turned green", "The Y chromosome became larger", "The Y chromosome turned black", "The Y chromosome increased", "The Y chromosome became smaller"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "TQLBLNTDrLk_0", "video_path": "TQLBLNTDrLk.mp4", "subtitle_path": "TQLBLNTDrLk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 235.53, "view_count": 151840}, {"video_id": "TQLBLNTDrLk", "question": "The man with brown hair, wearing black-rimmed glasses and a black and white polka dot shirt, is speaking on screen with his hands clasped together. When the subtitle reads \"going to have any better than the rest of us here on the ground,\" how do the man's hands change?", "question_wo_referring_query": "The man with brown hair, wearing black-rimmed glasses and a black and white polka dot shirt, is speaking on screen with his hands clasped together. When the subtitle reads \"going to have any better than the rest of us here on the ground,\" how do the man's hands change?", "candidates": ["He opens his hands", "He clenches his left hand into a fist and opens his right hand", "He clenches his right hand into a fist and opens his left hand", "His left hand rises above his head", "His right hand rises above his head"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "TQLBLNTDrLk_1", "video_path": "TQLBLNTDrLk.mp4", "subtitle_path": "TQLBLNTDrLk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 235.53, "view_count": 151840}, {"video_id": "mWACA1DJHhA", "question": "A man with curly hair appears in front of a blue background. He is wearing a black armor and a dark-colored shirt, and he is wearing glasses and a watch. The man\u2019s hands are spread open with palms facing each other. When a white bar appears in front of the man with the palms facing up, what change happens to the man in black armor with curly hair?", "question_wo_referring_query": "What change happens to the man in black armor with curly hair?", "candidates": ["The watch on the man\u2019s hand disappears", "The man\u2019s mouth opens and reveals his teeth", "A hat appears on the man\u2019s head", "A blue belt appears around the man\u2019s waist", "The necklace on the man\u2019s neck turns red"], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "mWACA1DJHhA_0", "video_path": "mWACA1DJHhA.mp4", "subtitle_path": "mWACA1DJHhA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 396.11, "view_count": 160551}, {"video_id": "mWACA1DJHhA", "question": "Eight gears appear in the center of the screen, forming a circle. The gear at the center right is yellow and contains a small circle within it. Below the circle formed by the gears, there is a row of square boxes, one of which (marked with the number 8) is filled in black. What change occurs to the yellow gear on the right when a sphere appears in the top left corner and a blue area appears in the top right corner of the screen?", "question_wo_referring_query": "What change occurs to the yellow gear on the right?", "candidates": ["The small circle in the lower left part inside the yellow gear rotates counterclockwise to the upper left position within the gear.", "The small circle in the lower left part inside the yellow gear rotates counterclockwise to the upper right position within the gear.", "The small circle in the lower left part inside the yellow gear rotates counterclockwise to the bottom center position within the gear.", "The small circle in the lower left part inside the yellow gear rotates counterclockwise to the lower right position within the gear.", "The small circle in the lower left part inside the yellow gear disappears."], "topic_category": "KS-Knowledge-STEM", "question_category": "SAA", "level": "L2-Relation", "id": "mWACA1DJHhA_1", "video_path": "mWACA1DJHhA.mp4", "subtitle_path": "mWACA1DJHhA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 396.11, "view_count": 160551}, {"video_id": "sZ8yMYvHc5E", "question": "Other famous artworks and related descriptions appear on the screen, with an image of Claude Monet in the center. Which subtitles and Claude Monet's image appear simultaneously?", "question_wo_referring_query": "Which subtitles and Claude Monet's image appear simultaneously?", "candidates": ["We can't even imagine what those would look like.", "art might be the most unique.", "I promise this is about science", "been devastating for him", "in his life"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "sZ8yMYvHc5E_0", "video_path": "sZ8yMYvHc5E.mp4", "subtitle_path": "sZ8yMYvHc5E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 293.96, "view_count": 615340}, {"video_id": "sZ8yMYvHc5E", "question": "On the left side of the screen, there's a man with brown hair. On the right side, there is a Monet painting of a beehive. Which subtitles appear simultaneously with the Monet beehive image?", "question_wo_referring_query": "Which subtitles appear simultaneously with the Monet beehive image?", "candidates": ["our eyes are most sensitive to yellow light.", "view them through the lens of science.", "But luckily Monet tried to show us", "His blues would have been bluer, his violets violet-er, and his whites?", "Voice-over: You are a beautiful and unique snowflake"], "topic_category": "KS-Knowledge-STEM", "question_category": "TOS", "level": "L2-Relation", "id": "sZ8yMYvHc5E_1", "video_path": "sZ8yMYvHc5E.mp4", "subtitle_path": "sZ8yMYvHc5E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 293.96, "view_count": 615340}, {"video_id": "CiHhwL3Ui7E", "question": "In the top left of the screen, there is a logo, and below the logo, there is an English title and a line of text. On the right side, there is a diagram consisting of 8 circles. Can you please identify in which other scene this circular diagram appears?", "question_wo_referring_query": "In the top left of the screen, there is a logo, and below the logo, there is an English title and a line of text. On the right side, there is a diagram consisting of 8 circles. Can you please identify in which other scene this circular diagram appears?", "candidates": ["In the top left of the screen, there is a logo, and below the logo, there is an English title and a line of text. On the right side, there is a diagram consisting of 8 circles, and on the left side of the circular diagram, there are 5 lines of yellow-highlighted English text.", "In the top left of the screen, there is a logo, and below the logo, there is an English title and a line of text. On the right side, there is a diagram consisting of 8 circles, and on the left side of the circular diagram, there are 5 lines of Chinese text.", "In the top left of the screen, there is a logo, and below the logo, there is an English title and two lines of text. On the right side, there is a diagram consisting of 8 circles, and on the left side of the circular diagram, there are 5 lines of English text.", "In the top left of the screen, there is a logo, and below the logo, there is an English title and a line of text. On the right side, there is a diagram consisting of 8 circles, and on the left side of the circular diagram, there are 6 lines of red-highlighted Chinese text.", "In the top left of the screen, there is a logo, and below the logo, there is an English title and a line of text. On the right side, there is a diagram consisting of 8 circles, and on the left side of the circular diagram, there are 5 lines of red-colored English text."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "CiHhwL3Ui7E_0", "video_path": "CiHhwL3Ui7E.mp4", "subtitle_path": "CiHhwL3Ui7E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 184.8, "view_count": 2}, {"video_id": "CiHhwL3Ui7E", "question": "The top left corner of the screen features a logo, and below the logo is an English title and one sentence. On the right side, there is an illustration comprised of 8 circles. In which scenario does the English sentence 'our networks typically start with random weights' also appear?", "question_wo_referring_query": "In which scenario does the English sentence 'our networks typically start with random weights' also appear?", "candidates": ["The top left corner of the screen features a logo, and below the logo is an English title and two sentences, with 3 lines of English text below that.", "The top left corner of the screen features a logo, and below the logo is an English title and two sentences, with 3 lines of highlighted purple English text below that.", "The top left corner of the screen features a logo, and below the logo is an English title and two sentences, with 5 lines of Chinese text below that.", "The top left corner of the screen features a logo, and below the logo is an English title and two sentences, with 4 lines of red English text below that.", "The top left corner of the screen features a logo, and below the logo is an English title and two sentences, with 5 lines of English text below that."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "SOS", "level": "L2-Relation", "id": "CiHhwL3Ui7E_1", "video_path": "CiHhwL3Ui7E.mp4", "subtitle_path": "CiHhwL3Ui7E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 184.8, "view_count": 2}, {"video_id": "cjeA-0tmsoc", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a group of men setting off for a trip, then a scene of someone controlling a mobile device to play a video, and finally a man in a leather jacket concluding the show with thanks.", "First, a scene of someone controlling a mobile device to play a video, then a man in a leather jacket concluding the show with thanks, and finally a group of men setting off for a trip.", "First, a man in a leather jacket concluding the show with thanks, then a group of men setting off for a trip, and finally a scene of someone controlling a mobile device to play a video.", "First, a scene of someone controlling a mobile device to play a video, then a group of men setting off for a trip, and finally a man in a leather jacket concluding the show with thanks.", "First, a man in a leather jacket concluding the show with thanks, then a scene of someone controlling a mobile device to play a video, and finally a group of men setting off for a trip."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "cjeA-0tmsoc_0", "video_path": "cjeA-0tmsoc.mp4", "subtitle_path": "cjeA-0tmsoc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 198.3, "view_count": 809683}, {"video_id": "cjeA-0tmsoc", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, the men walk to the chairs by a patch of grass and take off their glasses. Then, they walk to the chairs by the swimming pool and sit down. Finally, they jump into the water and start filming their swimming.", "First, the men walk to the chairs by a patch of grass and sit down to rest. Then, they walk to the chairs by the swimming pool and sit down. Finally, they jump into the water and start filming their swimming.", "First, the men walk to the chairs by the swimming pool and sit down. Then, they jump into the water and start filming their swimming. Finally, they walk to the chairs by a patch of grass and sit down to rest.", "First, the men jump into the water and start filming their swimming. Then, they walk to the chairs by the swimming pool and sit down. Finally, they walk to the chairs by a patch of grass and sit down to rest.", "First, the men jump into the water and start filming their swimming. Then, they walk to the chairs by a patch of grass and sit down to rest. Finally, they walk to the chairs by the swimming pool and sit down."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SSS", "level": "L2-Relation", "id": "cjeA-0tmsoc_1", "video_path": "cjeA-0tmsoc.mp4", "subtitle_path": "cjeA-0tmsoc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 198.3, "view_count": 809683}, {"video_id": "En8HANIpK-g", "question": "In the image, there is a blue 'Shepherd' text in the upper center, Meta AI and its related logo on the right, and five lines of English text at the bottom. The first line is labeled in pink. After the subtitle 'AI tited Shepherd: A Critic for Language Model Generation,' what appears below the blue English text?", "question_wo_referring_query": "What image appears below the blue English text?", "candidates": ["A red-bordered sketch of a dragon head appears", "A yellow-bordered sketch of a dragon head appears", "A green-bordered sketch of a dragon head appears", "A purple-bordered sketch of a dragon head appears", "A black-bordered sketch of a dragon head appears"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "En8HANIpK-g_0", "video_path": "En8HANIpK-g.mp4", "subtitle_path": "En8HANIpK-g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 455.1, "view_count": 647}, {"video_id": "En8HANIpK-g", "question": "In the middle upper section of the image, there is a blue 'Critique a LLM Response' phrase. The upper left part shows 5 lines of English text. The upper right part has two blue arrows and a water dragon head with an image of a yellow-based yang ram inside. The lower left part has a water dragon head and a blue arrow. After the subtitle 'there is risk with the suggested investments.' appears, what image appears in the lower left corner?", "question_wo_referring_query": ", what image appears in the lower left corner?", "candidates": ["An image of a rabbit with a hat", "An image of a sheep with glasses", "An image of a dog with a hat", "An image of a dog with glasses", "An image of a sheep with a scarf"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T3O", "level": "L2-Relation", "id": "En8HANIpK-g_1", "video_path": "En8HANIpK-g.mp4", "subtitle_path": "En8HANIpK-g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 455.1, "view_count": 647}, {"video_id": "_34MSahxlS4", "question": "On the left side of the screen, there is an electronic whiteboard with a row of numerical large titles and four columns of alphabetical subtitles. In the top right corner, there are two women giving a lecture, and in the bottom right corner, there is a two-column table with red borders. After the subtitle 'all the possible to concrete numbers so' appears, what happens on the whiteboard?", "question_wo_referring_query": "On the left side of the screen, there is an electronic whiteboard with a row of numerical large titles and four columns of alphabetical subtitles. In the top right corner, there are two women giving a lecture, and in the bottom right corner, there is a two-column table with red borders. After the subtitle 'all the possible to concrete numbers so' appears, what happens on the whiteboard?", "candidates": ["After 'c' in the title row, 4d shows n=8", "After 'b' in the title row, 5d shows n=9", "After 'a' in the title row, 3d shows n=3", "After 'a' in the title row, 3d shows l=3", "After 'b' in the title row, 3d shows n=3"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "_34MSahxlS4_0", "video_path": "_34MSahxlS4.mp4", "subtitle_path": "_34MSahxlS4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 457.26, "view_count": 23470}, {"video_id": "_34MSahxlS4", "question": "On the left side of the screen is an electronic whiteboard, which displays a large headline consisting of one row of numbers and 4 subheadings made up of letters arranged into two rows. In the top right corner are two women giving a lecture, and in the bottom right corner is a red-bordered table with two columns. What appears on the whiteboard after the subtitle 'it's' appears?", "question_wo_referring_query": "What appears on the whiteboard?", "candidates": ["A number 5 appears below the subheading 'a'", "A number 9 appears below the subheading 'd'", "A number 8 appears below the subheading 'c'", "A number 3 appears below the subheading 'b'", "A number 3 appears below the subheading 'a'"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "_34MSahxlS4_1", "video_path": "_34MSahxlS4.mp4", "subtitle_path": "_34MSahxlS4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 457.26, "view_count": 23470}, {"video_id": "9HmERWNR7fk", "question": "In the video, there are three women wearing white hats, each with a wooden bucket in front of them. Behind them stand three soldiers dressed in red shirts, white pants, and black hats, holding guns. Up to the current frame, who appeared first?", "question_wo_referring_query": "Up to the current frame, who appeared first?", "candidates": ["The first to appear is the soldier wearing a purple shirt, white pants, and a black hat holding a gun.", "The first to appear is the soldier wearing a black shirt, white pants, and a black hat holding a gun.", "The first to appear is the soldier wearing a red shirt, white pants, and a black hat holding a gun.", "The first to appear is the man wearing a red shirt, white pants, and a black hat.", "The first to appear is the soldier wearing a green shirt, black pants, and a black hat holding a gun."], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "9HmERWNR7fk_0", "video_path": "9HmERWNR7fk.mp4", "subtitle_path": "9HmERWNR7fk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 279.33, "view_count": 1423221}, {"video_id": "9HmERWNR7fk", "question": "In the video, there are three men wearing clothes with black and yellow floral patterns, each holding a wooden basin. The man on the left has brown long hair, the man in the middle has black long hair, and the man on the right has burgundy short hair. Could you please tell me, as of the current frame, who appeared first?", "question_wo_referring_query": "As of the current frame, who appeared first?", "candidates": ["The man with red long hair wearing clothes with black and white floral patterns", "The man with brown short hair wearing clothes with black and yellow floral patterns", "The man with brown long hair wearing clothes with black and yellow floral patterns", "The man with green long hair wearing clothes with red and black floral patterns", "The man with red long hair wearing clothes with black and yellow floral patterns"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "9HmERWNR7fk_1", "video_path": "9HmERWNR7fk.mp4", "subtitle_path": "9HmERWNR7fk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 279.33, "view_count": 1423221}, {"video_id": "zCikTT1zjfY", "question": "When the subtitle says, 'First, a small gold ring will be fused to the surface around which the granules will be placed,' what is the woman wearing a light blue denim short sleeve shirt, a necklace, a bracelet on her right hand, and holding an empty glass bottle doing with the tweezers in her left hand?", "question_wo_referring_query": "At this time, what is she doing with the tweezers in her left hand?", "candidates": ["Using the tweezers to take out a small gold ring from the bottle", "Using the tweezers to take out an alcohol cotton ball from the bottle", "Using the tweezers to take out a small copper ring from the bottle", "Using the tweezers to take out a small gold nugget from the bottle", "Using the tweezers to take out a gold rod from the bottle"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "zCikTT1zjfY_0", "video_path": "zCikTT1zjfY.mp4", "subtitle_path": "zCikTT1zjfY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 206.61, "view_count": 23430}, {"video_id": "zCikTT1zjfY", "question": "When the caption is 'Tiny granules are made by snipping fine gold wire over a charcoal block', two hands appear on the screen, the left hand holding a pair of scissors and the right hand gold wire, what is she doing with the scissors at this time?", "question_wo_referring_query": "What is she doing with the scissors at this time?", "candidates": ["Pinching the gold wire into flower shapes", "Cutting nails", "Cutting iron wire", "Cutting wood", "Cutting the gold wire into pieces"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "zCikTT1zjfY_1", "video_path": "zCikTT1zjfY.mp4", "subtitle_path": "zCikTT1zjfY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 206.61, "view_count": 23430}, {"video_id": "FgFNFVwJx5E", "question": "On the top right corner of the gray table, there is a blue plate with four sausages. In the middle, there is a white cutting board. The person is holding a skewer with sausages in their left hand and a knife in their right hand. What happens the first time this knife appears?", "question_wo_referring_query": "What happens the first time this knife appears?", "candidates": ["The person in the video is holding chopsticks and picking up the sausage.", "The person in the video is holding the knife and cutting the sausage.", "The person in the video is holding scissors and cutting the sausage.", "The person in the video is holding tongs and gripping the sausage.", "The person in the video is holding the knife and cutting a carrot."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "FgFNFVwJx5E_0", "video_path": "FgFNFVwJx5E.mp4", "subtitle_path": "FgFNFVwJx5E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 228.48, "view_count": 1678707}, {"video_id": "FgFNFVwJx5E", "question": "In the video, a person wearing black capri pants is holding a pair of tongs. In front of them are an iron brazier and burning charcoal. On the iron brazier, there is a grill with four wrapped food items. What happened the first time the tongs appeared?", "question_wo_referring_query": "What happened the first time the tongs appeared?", "candidates": ["The person in the video is picking up the wrapped food items with the tongs.", "The person in the video is turning the wrapped food items with scissors.", "The person in the video is turning the wrapped food items with the tongs.", "The person in the video is turning the eggplants with the tongs.", "The person in the video is turning the charcoal with the tongs."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O2E", "level": "IntraMoment", "id": "FgFNFVwJx5E_1", "video_path": "FgFNFVwJx5E.mp4", "subtitle_path": "FgFNFVwJx5E_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 228.48, "view_count": 1678707}, {"video_id": "uzDM2Z7cyyM", "question": "When the subtitle reads \u201cI'm working there alia 35 year 35 year,\u201d on the left side of the screen there is a high-rise building along a street with a red billboard visible. On the street, there is a woman with long black hair wearing a teal short-sleeve outfit and carrying a black handbag. In the middle of the screen, there is a man with short black hair wearing black-framed glasses, a white floral short-sleeve shirt, and carrying a crossbody bag. On his right side, there is a hanging curtain. What is the color of this curtain?", "question_wo_referring_query": "What is the color of this curtain?", "candidates": ["green", "purple", "white", "black", "red"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "uzDM2Z7cyyM_0", "video_path": "uzDM2Z7cyyM.mp4", "subtitle_path": "uzDM2Z7cyyM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 439.32, "view_count": 6998}, {"video_id": "uzDM2Z7cyyM", "question": "When the subtitle reads 'strawberry soda I have no idea what it,' a boy with curly hair wearing a white short-sleeve shirt with a maroon backpack is standing in front of the beverage shelf. He's holding a skateboard in his right hand and a pink bottle with a strawberry image and green Thai text in his left hand. What material is this bottle made of?", "question_wo_referring_query": "What material is this bottle made of?", "candidates": ["iron", "silver", "glass", "copper", "plastic"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2A", "level": "IntraMoment", "id": "uzDM2Z7cyyM_1", "video_path": "uzDM2Z7cyyM.mp4", "subtitle_path": "uzDM2Z7cyyM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 439.32, "view_count": 6998}, {"video_id": "DVwcV0nYSeM", "question": "In the scene, there is a cartoon character with long yellow hair wearing gray clothes, holding a baby wrapped in a white sheet. Next to them stands a person with brown headgear, also dressed in gray, in front of a wooden house. There are two trees with white flowers nearby. When the subtitle reads 'to the new illyrian king,' what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["A cartoon character with short yellow hair wearing black clothes, holding a baby wrapped in a white sheet.", "A cartoon character with black headgear wearing gray clothes.", "A cartoon character with long yellow hair wearing black clothes, holding a baby wrapped in a white sheet.", "A cartoon character with long yellow hair wearing gray clothes, holding a baby wrapped in a white sheet.", "A cartoon character with brown headgear wearing black clothes."], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "DVwcV0nYSeM_0", "video_path": "DVwcV0nYSeM.mp4", "subtitle_path": "DVwcV0nYSeM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 591.63, "view_count": 8939}, {"video_id": "DVwcV0nYSeM", "question": "There is a yellow-haired cartoon character wearing a brown armor on the screen, holding a weapon in their hand. Next to them, there is a person standing with a gray helmet and gray clothes, in front of a wooden house. There are two trees with white flowers beside them. What objects are present on the screen when the subtitle is 'of the situaton sienna overconfident in'?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["A cartoon character wearing a gray armor with a gray helmet, holding a long spear in their left hand", "A yellow-haired cartoon character wearing a gray armor, holding a long spear in their left hand", "A yellow-haired cartoon character wearing a brown armor, holding a long spear in their left hand", "A cartoon character wearing a gray armor with a yellow helmet, holding a long spear in their left hand", "A yellow-haired cartoon character wearing a brown armor, holding a long spear in their left hand"], "topic_category": "KH-Knowledge-History", "question_category": "T2O", "level": "IntraMoment", "id": "DVwcV0nYSeM_1", "video_path": "DVwcV0nYSeM.mp4", "subtitle_path": "DVwcV0nYSeM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 591.63, "view_count": 8939}, {"video_id": "BkSS5lrrd2M", "question": "The top part of the screen shows a blue sky, and there are houses of varying heights on both sides of the road. The screen displays the English phrase 'YOU DON'T WANT TO MISS IT'. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["red traffic light", "white traffic light", "lightning rod", "white cat", "white signpost"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "BkSS5lrrd2M_0", "video_path": "BkSS5lrrd2M.mp4", "subtitle_path": "BkSS5lrrd2M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 312.19, "view_count": 20682}, {"video_id": "BkSS5lrrd2M", "question": "A blond man wearing a brown baseball cap and a black T-shirt is sitting in a car with a black seatbelt on. What is not present in this scene?", "question_wo_referring_query": ", what is not present in this scene?", "candidates": ["blond man", "black T-shirt", "white-grey baseball cap", "pure white baseball cap", "black seatbelt"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2O", "level": "IntraMoment", "id": "BkSS5lrrd2M_1", "video_path": "BkSS5lrrd2M.mp4", "subtitle_path": "BkSS5lrrd2M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 312.19, "view_count": 20682}, {"video_id": "INlCLmWlojY", "question": "In a white checkered room, a man wearing a black baseball cap and a white T-shirt with a stubbly face, has a black watch on his right hand, holding a microphone in front of the camera. What is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["He is using both hands to gesture while conducting a band", "He is using both hands to gesture while recording a video", "He is using his left hand to gesture while recording a video", "He is using both hands to gesture while exercising", "He is using both hands to gesture while writing"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "INlCLmWlojY_0", "video_path": "INlCLmWlojY.mp4", "subtitle_path": "INlCLmWlojY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 595.2, "view_count": 3476}, {"video_id": "INlCLmWlojY", "question": "In a white checkered room, a man with a full beard wearing a black baseball cap and a white T-shirt, a black watch on his right hand, holds a microphone in front of him facing the camera. What is he doing?", "question_wo_referring_query": "What is he doing?", "candidates": ["He is gesturing with his left hand while exercising.", "He is gesturing with his right hand while recording a video.", "He is dancing with his right hand gesturing.", "He is shadowboxing with both hands.", "He is drawing with his right hand."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "INlCLmWlojY_1", "video_path": "INlCLmWlojY.mp4", "subtitle_path": "INlCLmWlojY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 595.2, "view_count": 3476}, {"video_id": "eGGgHzEq7D4", "question": "A man wearing a black T-shirt and denim shorts is walking barefoot towards the camera on a leaf-covered ground. There is a large tree planted on his left side. When the subtitle reads 'it here they go oh if I really grip it,' what change happens to this person?", "question_wo_referring_query": "What change happens to this person?", "candidates": ["The person turns sideways towards the camera.", "The person puts on a hat.", "The person wears denim pants.", "The person is wearing black shorts.", "The clothes change to a white T-shirt."], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "eGGgHzEq7D4_0", "video_path": "eGGgHzEq7D4.mp4", "subtitle_path": "eGGgHzEq7D4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 316.08, "view_count": 6502}, {"video_id": "eGGgHzEq7D4", "question": "On the screen, the green grass on the left side is next to a red cable car, and on the right side, there is a wooden fence wrapped around by white columns. In the middle, there is a white sprinkler spraying water. When the subtitles read 'those in the industry that do a lot of,' what change occurs to the sprinkler on the screen?", "question_wo_referring_query": "What change occurs to the sprinkler on the screen?", "candidates": ["The sprinkler broke", "The sprinkler got longer", "The sprinkler turned into a black hose", "The water flow from the sprinkler decreased", "The sprinkler turned green"], "topic_category": "NP-News-Programs", "question_category": "TAA", "level": "L2-Relation", "id": "eGGgHzEq7D4_1", "video_path": "eGGgHzEq7D4.mp4", "subtitle_path": "eGGgHzEq7D4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 316.08, "view_count": 6502}, {"video_id": "NNbUNlvwGhk", "question": "Two women and one boy appear on the street. The woman in front is wearing a green top and blue jeans. The boy in the middle is holding a paper box and wearing a short-sleeve shirt. The woman at the back is wearing short sleeves and light green pants. There is a rusty handrail on the side of the road. What subtitle appeared with the boy in the short-sleeve shirt in the middle?", "question_wo_referring_query": "What subtitle appeared with the boy in the short-sleeve shirt in the middle?", "candidates": ["immigration truck inside it is full of", "we can walk this way ", "parts of Haiti schools are open but some", "paper and work other cross hoping to", "in here they are about to We Believe be"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "NNbUNlvwGhk_0", "video_path": "NNbUNlvwGhk.mp4", "subtitle_path": "NNbUNlvwGhk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 271.71, "view_count": 29191}, {"video_id": "NNbUNlvwGhk", "question": "A woman and two men are having a conversation. The woman on the left is wearing a black top, with a green plant beside her. The two men on the right are wearing yellow and gray tops respectively. The man in the yellow top is wearing glasses. Behind them is a green wall. What subtitle has the man in the yellow top with glasses appeared with?", "question_wo_referring_query": "What subtitle has the man in the yellow top with glasses appeared with?", "candidates": ["but it's", "we can walk this way \n", "in here they are about to We Believe be", "immigration truck inside it is full of", "of them will not have food and are"], "topic_category": "NP-News-Programs", "question_category": "TOS", "level": "L2-Relation", "id": "NNbUNlvwGhk_1", "video_path": "NNbUNlvwGhk.mp4", "subtitle_path": "NNbUNlvwGhk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 271.71, "view_count": 29191}, {"video_id": "0bTcXe3YjGE", "question": "On a brown wooden cutting board, a person wearing a ring is chopping scallions with a vegetable knife. In the corner of the cutting board, there is still half an uncut scallion. Where else have these chopped scallions appeared?", "question_wo_referring_query": "Where else have these chopped scallions appeared?", "candidates": ["In a white ceramic bowl", "In the pan with cooking oil added", "In a clear glass cup", "In a clear glass bowl", "On a silver ceramic plate"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "0bTcXe3YjGE_0", "video_path": "0bTcXe3YjGE.mp4", "subtitle_path": "0bTcXe3YjGE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 427.68, "view_count": 86087}, {"video_id": "0bTcXe3YjGE", "question": "A man wearing a white shirt is chopping chili peppers on a brown wooden cutting board. The man is wearing a ring, holding the chili pepper with one hand and the knife with the other. There is a yellow lamp behind the man. Where else have the chopped chili peppers appeared on the table?", "question_wo_referring_query": "Where else have the chopped chili peppers appeared on the table?", "candidates": ["On a silver ceramic plate", "In a pot with added cooking oil", "In a transparent glass cup", "In a transparent glass bowl", "In a white ceramic bowl"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "0bTcXe3YjGE_1", "video_path": "0bTcXe3YjGE.mp4", "subtitle_path": "0bTcXe3YjGE_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 427.68, "view_count": 86087}, {"video_id": "ph_g5czA2y4", "question": "Under the grey sky, a truck appears on the street. The front of the truck is white, and it is dragging a red can. There are road signs and piles of snow around the street. After the subtitle 'circulating online and was apparently' appears, what object is seen on this snow-covered street?", "question_wo_referring_query": "What object appears on this snow-covered street?", "candidates": ["Bookshelf", "Car", "Utility pole", "Road sign", "Glasses"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "ph_g5czA2y4_0", "video_path": "ph_g5czA2y4.mp4", "subtitle_path": "ph_g5czA2y4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 283.08, "view_count": 318524}, {"video_id": "ph_g5czA2y4", "question": "A man wearing a dark-colored jacket appears on the street, holding a bag in his hand. The street is surrounded by accumulated snow. After the subtitle 'has verified that this footage is of a,' what appears on this snow-covered street?", "question_wo_referring_query": "What object appears on this snow-covered street?", "candidates": ["road sign", "utility pole", "glasses", "car", "bookshelf"], "topic_category": "NP-News-Programs", "question_category": "T3O", "level": "L2-Relation", "id": "ph_g5czA2y4_1", "video_path": "ph_g5czA2y4.mp4", "subtitle_path": "ph_g5czA2y4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 283.08, "view_count": 318524}, {"video_id": "qUJ80HBsBdY", "question": "A man wearing glasses appears in front of a purple background. The man is wearing a colorful shirt, with both hands positioned on either side of his waist. After the subtitle 'But for most of human history, heart attacks were basically a mystery' appears, what happens on the purple background?", "question_wo_referring_query": "What happens on the purple background?", "candidates": ["A heart image pops up on the upper right side", "A cartoon character pops up on the upper left side", "An animal image pops up on the upper left side", "A heart image pops up on the upper left side", "A cartoon character pops up on the upper right side"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "qUJ80HBsBdY_0", "video_path": "qUJ80HBsBdY.mp4", "subtitle_path": "qUJ80HBsBdY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 334.29, "view_count": 154125}, {"video_id": "qUJ80HBsBdY", "question": "A bespectacled man appears in front of a purple background. The man is wearing a multicolored shirt and has his hands spread apart. He is also wearing a ring on his hand. After the subtitle 'was recognized as the cause of fatal heart attacks,' what does the bespectacled man do?", "question_wo_referring_query": "After the subtitle 'was recognized as the cause of fatal heart attacks,' what does the bespectacled man do?", "candidates": ["Raised his hands", "Put one hand on his hip", "Clenched his fists", "Put his hands on his hips", "Put his hands together"], "topic_category": "KS-Knowledge-STEM", "question_category": "T3E", "level": "L2-Relation", "id": "qUJ80HBsBdY_1", "video_path": "qUJ80HBsBdY.mp4", "subtitle_path": "qUJ80HBsBdY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 334.29, "view_count": 154125}, {"video_id": "XHw0bDa16xA", "question": "What is the first animal to appear in the video?", "question_wo_referring_query": "What is the first animal to appear in the video?", "candidates": ["black cat", "green snake", "black dog", "white sheep", "resting bird"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "XHw0bDa16xA_0", "video_path": "XHw0bDa16xA.mp4", "subtitle_path": "XHw0bDa16xA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 525.65, "view_count": 820978}, {"video_id": "XHw0bDa16xA", "question": "A girl and a black dog appear on the road. The girl is wearing a hat and a skirt, and she has a backpack on her shoulders. On the left side of the road, there is a tall tree, and on the right side, there is a tree with blooming flowers. Which flower does the girl wearing a hat pick up first?", "question_wo_referring_query": "Which flower does the girl wearing a hat pick up first?", "candidates": ["a white flower", "a yellow flower", "a red flower", "a purple flower", "a black flower"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "XHw0bDa16xA_1", "video_path": "XHw0bDa16xA.mp4", "subtitle_path": "XHw0bDa16xA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 525.65, "view_count": 820978}, {"video_id": "ldZ8zXWP-Eg", "question": "When the first light blue cell with 36 colors on the left side of the screen is projected onto the blue cell below, the cell in the second column of the first row changes to dark blue from left to right. What happens to the blue cell below after the upper colorful cell changes to a dark blue cell and is projected?", "question_wo_referring_query": "When the first light blue cell with 36 colors on the left side of the screen is projected onto the blue cell below, the cell in the second column of the first row changes to dark blue from left to right. What happens to the blue cell below after the upper colorful cell changes to a dark blue cell and is projected?", "candidates": ["The cell below moves one column to the right, and its color changes to purple", "The cell below moves one column to the right, and its color changes to dark blue", "The cell below moves one column to the left, and its color changes to dark blue", "The cell below moves one row down, and its color changes to dark blue", "The cell below changes to green"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "ldZ8zXWP-Eg_0", "video_path": "ldZ8zXWP-Eg.mp4", "subtitle_path": "ldZ8zXWP-Eg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 229.27, "view_count": 11}, {"video_id": "ldZ8zXWP-Eg", "question": "When the dark green square in the first of the 25 light green squares on the right side of the screen is projected onto the 49 blue and white squares below, the squares in the 5th, 6th, and 7th rows and the 3rd, 4th, and 5th columns from left to right turn gray. What happens to the gray squares below after the dark green square above moves to the right and then is projected?", "question_wo_referring_query": "When the dark green square in the first of the 25 light green squares on the right side of the screen is projected onto the 49 blue and white squares below, the squares in the 5th, 6th, and 7th rows and the 3rd, 4th, and 5th columns from left to right turn gray. What happens to the gray squares below after the dark green square above moves to the right and then is projected?", "candidates": ["The squares in the 5th, 6th, and 7th rows and the 1st, 2nd, and 3rd columns from left to right turned gray.", "The squares in the 5th, 6th, and 7th rows and the 3rd, 4th, and 5th columns from left to right turned gray.", "The squares in the 5th, 6th, and 7th rows and the 5th, 6th, and 7th columns from left to right turned gray.", "The squares in the 1st, 2nd, and 3rd rows and the 3rd, 4th, and 5th columns from left to right turned gray.", "The squares in the 5th, 6th, and 7th rows and the 1st, 2nd, and 3rd columns from left to right turned gray."], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E3E", "level": "L2-Relation", "id": "ldZ8zXWP-Eg_1", "video_path": "ldZ8zXWP-Eg.mp4", "subtitle_path": "ldZ8zXWP-Eg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 229.27, "view_count": 11}, {"video_id": "nJXsi98pgWc", "question": "A man is in a room, wearing a short-sleeved shirt and jeans. He is standing beside a machine located in a corner of the room. In the center of the room, there is a large glass container with a silver metallic base and dark liquid inside. When the subtitle 'Aha aha' appears, what is this man doing?", "question_wo_referring_query": "What is this man, who is standing in the corner of the room, doing?", "candidates": ["Putting one hand on his waist", "Raising both hands", "Raising his left hand", "Raising his right hand", "Putting both hands on his waist"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "nJXsi98pgWc_0", "video_path": "nJXsi98pgWc.mp4", "subtitle_path": "nJXsi98pgWc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 272.23, "view_count": 7883}, {"video_id": "nJXsi98pgWc", "question": "A woman wearing a purple top appears on the screen. She is wearing glasses and a name tag. To her side, there is another woman in a white top, who is also wearing glasses and holding a black folder. When the subtitle 'In the rich man's world' appears, what is the woman in the purple top doing?", "question_wo_referring_query": "What is the woman in the purple top doing?", "candidates": ["Crossing her arms", "Waving her hand", "Eating something", "Nodding", "Drinking water"], "topic_category": "KA-Knowledge-Art", "question_category": "T2E", "level": "IntraMoment", "id": "nJXsi98pgWc_1", "video_path": "nJXsi98pgWc.mp4", "subtitle_path": "nJXsi98pgWc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 272.23, "view_count": 7883}, {"video_id": "UjbGZu3-8cg", "question": "A black car is parked at the street corner, surrounded by police officers and pedestrians. A woman with golden long hair, dressed in black, gets out of the car. What did the golden-haired woman do first after arriving?", "question_wo_referring_query": "What did the golden-haired woman do first after arriving?", "candidates": ["Closed the car door with her hand", "Nodded to a policeman", "Picked up her handbag", "Rested one hand on the car door", "Shook hands with a female officer"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "UjbGZu3-8cg_0", "video_path": "UjbGZu3-8cg.mp4", "subtitle_path": "UjbGZu3-8cg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 187.12, "view_count": 81492}, {"video_id": "UjbGZu3-8cg", "question": "A blonde woman appears on the street, dressed in black attire with a black car parked beside her. There are pedestrians and police officers nearby, including a female police officer with her hair tied at the back and wearing a uniform. What was the first thing this female officer did after arriving at the scene?", "question_wo_referring_query": "What was the first thing this female officer did after arriving at the scene?", "candidates": ["Pushed aside a male officer standing in front of the blonde woman", "Placed her hands on her waist", "Shook hands with the blonde woman", "Raised her right hand", "Put her hands into her coat pockets"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "UjbGZu3-8cg_1", "video_path": "UjbGZu3-8cg.mp4", "subtitle_path": "UjbGZu3-8cg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 187.12, "view_count": 81492}, {"video_id": "wr-_WFR_o6Y", "question": "In the bottom right corner of the screen, there is a man wearing glasses who is speaking. In the top left corner of the screen, there is a yellow icon, below which there is a line of English text. Below the English text, there is a picture. In the picture, there is a man in a yellow top positioned in the center, surrounded by tables and chairs. A group of people is sitting around the table in the background. Who is the person raising their hands and dancing in the middle of the screen?", "question_wo_referring_query": "Who is the person raising their hands and dancing in the middle of the screen?", "candidates": ["The man in a yellow top", "The man in a black top", "The woman in a gray top", "The woman in a red short-sleeve top", "The man in a blue suit"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "wr-_WFR_o6Y_0", "video_path": "wr-_WFR_o6Y.mp4", "subtitle_path": "wr-_WFR_o6Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 372.6, "view_count": 45}, {"video_id": "wr-_WFR_o6Y", "question": "In the bottom right corner of the screen, a man is speaking. The man is wearing a suit and glasses. In the upper left corner of the screen, there is a yellow icon, with a line of English text below it and six pictures below the text. The pictures include images of robots, humans, and animals. Which individual in the middle of the screen is seated in front of a computer, touching the keyboard?", "question_wo_referring_query": "Which individual in the middle of the screen is seated in front of a computer, touching the keyboard?", "candidates": ["Man wearing short sleeves", "Man in a suit", "Robot", "Dog", "Man wearing glasses"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "E2O", "level": "IntraMoment", "id": "wr-_WFR_o6Y_1", "video_path": "wr-_WFR_o6Y.mp4", "subtitle_path": "wr-_WFR_o6Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 372.6, "view_count": 45}, {"video_id": "e0DBcrfELfQ", "question": "A pancake is spread out on the table, with red and white beef filling scattered on top of it. A hand wearing a black glove is holding a spatula to spread the beef filling. When the subtitle 'Evenly distribute the beef filling' appears, what material is the spatula being held by the hand in the black glove?", "question_wo_referring_query": "What material is the spatula being held by the hand in the black glove?", "candidates": ["Ceramic spatula", "Wooden spatula", "Silver spatula", "Plastic spatula", "Glass spatula"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "e0DBcrfELfQ_0", "video_path": "e0DBcrfELfQ.mp4", "subtitle_path": "e0DBcrfELfQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 362.88, "view_count": 1056259}, {"video_id": "e0DBcrfELfQ", "question": "There is a lump of pancake-like food on a red and brown table. The main color of the food is white. A pair of hands wearing black gloves are using a rolling pin to further process the food. When the subtitle 'Use the rolling pin to flatten out the pancakes' appears, what material is the rolling pin made of?", "question_wo_referring_query": "What material is the rolling pin made of?", "candidates": ["Bamboo rolling pin", "Metal rolling pin", "Yellow wooden rolling pin", "Stone rolling pin", "Red wooden rolling pin"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T2A", "level": "IntraMoment", "id": "e0DBcrfELfQ_1", "video_path": "e0DBcrfELfQ.mp4", "subtitle_path": "e0DBcrfELfQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 362.88, "view_count": 1056259}, {"video_id": "Lh1XhKo_b58", "question": "A woman is sitting on a bed, wearing a light-colored top with her hands raised. Behind her, there is a white cabinet with a basket on top, and a white door next to the cabinet. What shape is the tattoo on this woman's forearm?", "question_wo_referring_query": "What shape is the tattoo on the forearm of the woman wearing the light-colored top?", "candidates": ["heart-shaped", "round", "drop-shaped", "triangular", "star-shaped"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "Lh1XhKo_b58_0", "video_path": "Lh1XhKo_b58.mp4", "subtitle_path": "Lh1XhKo_b58_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 452.2, "view_count": 4985}, {"video_id": "Lh1XhKo_b58", "question": "A plate appears on a white table, holding white pumpkins, apples, a paper box, etc. There are two white chairs next to the table, and a white bookshelf in the distance. What is the shape of the plate on the white table?", "question_wo_referring_query": ", what is the shape of the plate on the white table?", "candidates": ["rectangle", "circle", "oval", "pentagon", "hexagon"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "S2A", "level": "IntraMoment", "id": "Lh1XhKo_b58_1", "video_path": "Lh1XhKo_b58.mp4", "subtitle_path": "Lh1XhKo_b58_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 452.2, "view_count": 4985}, {"video_id": "_iutI-kYv5A", "question": "A lady wearing a red top appears on the screen. The lady is wearing a necklace on her neck. One of the lady's hands is open with all fingers spread out. Behind the lady is a supermarket shelf with bottled and bucketed items on it. What objects are present in this scene?", "question_wo_referring_query": "What objects are present in this scene?", "candidates": ["ring", "watch", "flower pot", "fan", "mobile phone"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "_iutI-kYv5A_0", "video_path": "_iutI-kYv5A.mp4", "subtitle_path": "_iutI-kYv5A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 345.90999999999997, "view_count": 12806}, {"video_id": "_iutI-kYv5A", "question": "A woman wearing a checkered floral pattern is shopping in the supermarket. The woman is holding a shopping basket in her hand. To the left of the woman is a man with a backpack. The man is wearing a light-colored top. In front of the woman are bottled products and price tags. What object is present in this scene?", "question_wo_referring_query": "What object is present in this scene?", "candidates": ["flower pot", "mask", "fan", "cell phone", "desk lamp"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "_iutI-kYv5A_1", "video_path": "_iutI-kYv5A.mp4", "subtitle_path": "_iutI-kYv5A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 345.90999999999997, "view_count": 12806}, {"video_id": "rxx046l4IAc", "question": "On the left side of the screen, there is a man wearing a blue shirt with white buttons. On the right side of the screen, there are two men wearing suits, both with neckties, standing in front of a flag with light-colored curtains behind it. What is the man wearing glasses and a suit on the right side of the screen doing?", "question_wo_referring_query": "What is the man wearing glasses and a suit on the right side of the screen doing?", "candidates": ["Crossing arms", "Handshaking", "Bow", "Waving", "Nodding"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "rxx046l4IAc_0", "video_path": "rxx046l4IAc.mp4", "subtitle_path": "rxx046l4IAc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 454.15999999999997, "view_count": 47647}, {"video_id": "rxx046l4IAc", "question": "On the left side of the screen, there is a man wearing a blue shirt with white buttons. On the right side of the screen, there is a man in a black shirt sitting in a wheelchair. Next to the man in the black shirt is a flag. Behind the man in the black shirt, there is a worker wearing a grey short-sleeve shirt. In front of the man in the black shirt, there is a row of speakers. What is the worker behind the man in the black shirt doing?", "question_wo_referring_query": "What is the worker behind the man in the black shirt doing?", "candidates": ["Carrying a laptop bag", "Cleaning the floor", "Pushing the wheelchair", "Holding a folder", "Holding the flag"], "topic_category": "NP-News-Programs", "question_category": "S2E", "level": "IntraMoment", "id": "rxx046l4IAc_1", "video_path": "rxx046l4IAc.mp4", "subtitle_path": "rxx046l4IAc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 454.15999999999997, "view_count": 47647}, {"video_id": "67v3Lqtzr-w", "question": "A girl is eating noodles, she is wearing a dark colored short sleeve shirt, holding chopsticks in one hand and a bowl in the other. Behind her is a white curtain. When the subtitle 'oh let me believe that this was gonna be' appears, what changes occur to the hair of the girl in the short sleeves?", "question_wo_referring_query": "What changes occur to the hair of the girl in the short sleeves?", "candidates": ["The girl has cut her hair.", "The girl has braided her hair into twin tails.", "The girl's hair is dyed red.", "The girl's hair is tied up.", "The girl's hair is dyed yellow."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "67v3Lqtzr-w_0", "video_path": "67v3Lqtzr-w.mp4", "subtitle_path": "67v3Lqtzr-w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 423.12, "view_count": 29193}, {"video_id": "67v3Lqtzr-w", "question": "A girl appears in front of a white curtain. She is wearing a dark short-sleeve shirt, holding a fork in one hand and a white bowl in the other. She is wearing accessories on her wrist, and a ring on her finger. When the subtitle 'for the rest of my life and I'm also' appears, what changes happen to the girl's arm who is wearing the short-sleeve shirt?", "question_wo_referring_query": "What changes happen to the arm of the girl wearing the short-sleeve shirt?", "candidates": ["A tattoo appears on the girl's wrist", "The accessories on the girl's wrist turn into a pearl bracelet", "The accessories on the girl's wrist turn red", "The accessories on the girl's wrist turn into a watch", "The accessories on the girl's wrist disappear"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "67v3Lqtzr-w_1", "video_path": "67v3Lqtzr-w.mp4", "subtitle_path": "67v3Lqtzr-w_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 423.12, "view_count": 29193}, {"video_id": "o0qFeEZ7Wd8", "question": "On the gray desktop, a pair of hands are writing on a piece of lined paper. The hand holding the paper is wearing a ring, while the other hand is grasping a pen. There is a lit candle in front of the hands, and on the side are some rolled-up materials and a white jar. When the candle appears in the top right corner of the screen, what change happens to the hand holding the pen?", "question_wo_referring_query": "When the candle appears in the top right corner of the screen, what change happens to the hand holding the pen?", "candidates": ["The hand changes from holding the pen to making a fist", "The hand changes from holding the pen to spreading all five fingers", "The hand changes from holding the pen to palm facing up", "The hand changes from holding the pen to pointing at the notebook", "The hand changes from holding the pen to standing up with a thumbs-up"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "o0qFeEZ7Wd8_0", "video_path": "o0qFeEZ7Wd8.mp4", "subtitle_path": "o0qFeEZ7Wd8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 521.76, "view_count": 423913}, {"video_id": "o0qFeEZ7Wd8", "question": "A knight appears on the screen. The knight is riding a horse, wearing armor on his upper body. He holds the reins in one hand and a sword in the other. The hilt of the sword is positioned in front of the knight's chest, with the blade pointing upwards. When the knight's back obscures the horse's head, what change occurs in the actions of this sword-wielding knight?", "question_wo_referring_query": ", what change occurs in the actions of this sword-wielding knight?", "candidates": ["The knight steps back with the sword in hand", "The knight switches to holding the sword with both hands", "The knight raises the sword high, moving it from in front of his chest", "The knight thrusts the sword forward", "The knight lowers the sword and dismounts"], "topic_category": "KH-Knowledge-History", "question_category": "SAA", "level": "L2-Relation", "id": "o0qFeEZ7Wd8_1", "video_path": "o0qFeEZ7Wd8.mp4", "subtitle_path": "o0qFeEZ7Wd8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 521.76, "view_count": 423913}, {"video_id": "jd28-FKH9ho", "question": "Under the white sky, a lady appears by a rail door. The lady is wearing a yellow hat and a teal (cyan) scarf while holding a basket in her hand. The area around the rail is overgrown with wild grass. With what caption did this lady wearing a hat appear?", "question_wo_referring_query": "With what caption did this lady wearing a hat appear?", "candidates": ["on multiple walks everyday no matter the weather", "I don't really like it here", "Learn to relax", "Always at home", "change their life is them"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "jd28-FKH9ho_0", "video_path": "jd28-FKH9ho.mp4", "subtitle_path": "jd28-FKH9ho_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 590.42, "view_count": 332138}, {"video_id": "jd28-FKH9ho", "question": "A black-colored dog is sitting on a wooden floor. The dog's ears are floppy, and it is wearing an accessory on its neck. In the lower left corner of the floor, there is an object with a red handle. What caption has this black dog appeared with?", "question_wo_referring_query": "What caption has this black dog appeared with?", "candidates": ["Learn to relax", "on multiple walks everyday no matter the weather", "taking a drive to a nearby park or lake", "I don't really like it here", "Always at home"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TOS", "level": "L2-Relation", "id": "jd28-FKH9ho_1", "video_path": "jd28-FKH9ho.mp4", "subtitle_path": "jd28-FKH9ho_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 590.42, "view_count": 332138}, {"video_id": "ipVw772hCrM", "question": "A picture appears on the screen, showing various flags arranged in six rows and seven columns, totaling forty-two flags. The last flag in the second row is the flag of Afghanistan. Where else has this Afghan flag appeared?", "question_wo_referring_query": "Where else has this Afghan flag appeared?", "candidates": ["On a bench in the park", "In an abandoned factory", "Upper right of a man in dark jacket", "In a sunny square", "In an abandoned hotel"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "ipVw772hCrM_0", "video_path": "ipVw772hCrM.mp4", "subtitle_path": "ipVw772hCrM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 401.65, "view_count": 3758652}, {"video_id": "ipVw772hCrM", "question": "Two men appear on the map; the man on the right is wearing white pants, with his hands clasped in front of his chest. The man on the left is wearing a blue top and a hat, with his hands resting on his hips. The area on the map where they are standing is outlined with colored lines. Where else has the man on the right, who is wearing white pants, appeared?", "question_wo_referring_query": "Where else has the man on the right, who is wearing white pants, appeared?", "candidates": ["On a park bench", "In a sunny square", "A larger map with the Turkish flag", "On a chilly street", "In a crowded restaurant"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "ipVw772hCrM_1", "video_path": "ipVw772hCrM.mp4", "subtitle_path": "ipVw772hCrM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 401.65, "view_count": 3758652}, {"video_id": "NZMqV-66b1Y", "question": "What is the correct sequence of the following scenes?", "question_wo_referring_query": "What is the correct sequence of the following scenes?", "candidates": ["First, the woman starts explaining the topic, then makes concluding remarks, and finally gives an introduction.", "First, the woman starts explaining the topic, then gives an introduction, and finally makes concluding remarks.", "First, the woman gives an introduction, then she makes concluding remarks, and finally starts explaining the topic.", "First, the woman makes concluding remarks, then gives an introduction, and finally starts explaining the topic.", "First, the woman gives an introduction, then she starts explaining the topic, and finally makes concluding remarks."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "NZMqV-66b1Y_0", "video_path": "NZMqV-66b1Y.mp4", "subtitle_path": "NZMqV-66b1Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 225.06, "view_count": 98188}, {"video_id": "NZMqV-66b1Y", "question": "During the lady's lecture, which of the following sequences of events is correct?", "question_wo_referring_query": "During the lady's lecture, which of the following sequences of events is correct?", "candidates": ["First, there is an explanation of how to determine the speed law, then a mention of another video, and finally an explanation of how to find the k value", "First, there is an explanation of how to find the k value, then an explanation of how to determine the speed law, and finally a mention of another video", "First, there is a mention of another video, then an explanation of how to determine the speed law, and finally an explanation of how to find the k value", "First, there is a mention of another video, then an explanation of how to find the k value, and finally an explanation of how to determine the speed law", "First, there is an explanation of how to determine the speed law, then an explanation of how to find the k value, and finally a mention of another video"], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "NZMqV-66b1Y_1", "video_path": "NZMqV-66b1Y.mp4", "subtitle_path": "NZMqV-66b1Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 225.06, "view_count": 98188}, {"video_id": "NlejS6SCpHg", "question": "A blue sea appears with an island, where the buildings are mainly concentrated along the coast and low sea-level mountainous areas. There are boats on the nearby sea surface. The buildings on the island are primarily red and white. After the subtitle 'Medieval Times now to reach Cinque Terre' appears, what object appears on the ground?", "question_wo_referring_query": "What object appears on the ground?", "candidates": ["Beach chair", "Railroad tracks", "Chapel", "Backpack", "Statue"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "NlejS6SCpHg_0", "video_path": "NlejS6SCpHg.mp4", "subtitle_path": "NlejS6SCpHg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 490.36, "view_count": 140950}, {"video_id": "NlejS6SCpHg", "question": "A white boat appears only on the blue sea. The boat has a yellow object on the top and white waves behind it. In the distance, there is land with yellow and white houses. After the caption 'from the nearby Town such as Rapala' appears, what object appears on the land?", "question_wo_referring_query": "What object appears on the land?", "candidates": ["Lighthouse", "Beach chair", "Statue", "Train", "Backpack"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T3O", "level": "L2-Relation", "id": "NlejS6SCpHg_1", "video_path": "NlejS6SCpHg.mp4", "subtitle_path": "NlejS6SCpHg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 490.36, "view_count": 140950}, {"video_id": "hA4ZV72wnpI", "question": "The screen shows a map, with a portion of the sea dividing the land into two parts. On the right side of the land, three yellow arrows appear. On the left side of the map, a section near the sea is highlighted. After the subtitle 'the results looks like a backwards rain shadow effect' appears, what happens on the map?", "question_wo_referring_query": "What happens on the map?", "candidates": ["A hunting dog starts running", "A flag starts waving", "A cloud starts raining on the map", "An ant starts crawling", "A goat starts eating grass"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "hA4ZV72wnpI_0", "video_path": "hA4ZV72wnpI.mp4", "subtitle_path": "hA4ZV72wnpI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 576.54, "view_count": 1596592}, {"video_id": "hA4ZV72wnpI", "question": "In the map, four simulated human images appear. Two larger images are on the left side of the map, while the two smaller ones are respectively near the center and the bottom right corner of the map. The image near the center has a white speech bubble above it. After the caption 'They's made enemies with pretty much all of their neighbors' appears, what happens on the map?", "question_wo_referring_query": "What happens on the map?", "candidates": ["The four simulated human images start fighting with kicks", "The four simulated human images overlap with each other", "The four simulated human images start shooting each other", "The four simulated human images are holding hands", "The four simulated human images start fighting with fists"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "hA4ZV72wnpI_1", "video_path": "hA4ZV72wnpI.mp4", "subtitle_path": "hA4ZV72wnpI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 576.54, "view_count": 1596592}, {"video_id": "dpw8tAEuy6g", "question": "A blonde woman appears indoors, wearing a wine red top. Beside her is a red inflatable toy. She is surrounded by a dense crowd. What is the first electronic device this blonde woman picks up?", "question_wo_referring_query": "What is the first electronic device this blonde woman picks up?", "candidates": ["Digital camera", "Smartwatch", "Mobile phone", "Earphones", "Smart ring"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "dpw8tAEuy6g_0", "video_path": "dpw8tAEuy6g.mp4", "subtitle_path": "dpw8tAEuy6g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 266.48, "view_count": 45725}, {"video_id": "dpw8tAEuy6g", "question": "A blonde woman appears in a room, wearing a burgundy top. Beside her is a red inflatable toy. The woman is surrounded by a dense crowd of people. Who is the first person to pass by this blonde woman?", "question_wo_referring_query": "Who is the first person to pass by this blonde woman?", "candidates": ["A woman with a bag", "A woman with a backpack", "A man with a bag", "A man carrying a yellow handbag", "A man with a backpack"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "dpw8tAEuy6g_1", "video_path": "dpw8tAEuy6g.mp4", "subtitle_path": "dpw8tAEuy6g_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 266.48, "view_count": 45725}, {"video_id": "m__e6_z0U1Q", "question": "A man in a striped short-sleeve shirt is stuffing something into a yellow box. The man is wearing blue shorts and white shoes. Surrounding the man is a woman sitting on a stone bench, who is wearing short sleeves and shorts, and shaking her legs. To the left of the man is a quiet street. What did the man in the striped short-sleeve shirt do after stuffing the items?", "question_wo_referring_query": "What did the man in the striped short-sleeve shirt do after stuffing the items?", "candidates": ["Kicked the yellow box", "Clapped with the woman", "Hugged the woman", "Shook hands with the woman", "Picked up the backpack"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "m__e6_z0U1Q_0", "video_path": "m__e6_z0U1Q.mp4", "subtitle_path": "m__e6_z0U1Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 447.87, "view_count": 127096}, {"video_id": "m__e6_z0U1Q", "question": "A woman in a short-sleeved shirt appears on the screen, wearing a necklace around her neck and holding a paper bag with a bun inside. In front of the woman is a shop counter with two service staff members dressed in black. What does this woman in a white short-sleeved shirt do after buying the bun?", "question_wo_referring_query": "What does this woman in a white short-sleeved shirt do after buying the bun?", "candidates": ["Hug a man", "High-five a man", "Shake hands with a man", "Enter another store", "Eat the bun with a man"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "E3E", "level": "L2-Relation", "id": "m__e6_z0U1Q_1", "video_path": "m__e6_z0U1Q.mp4", "subtitle_path": "m__e6_z0U1Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 447.87, "view_count": 127096}, {"video_id": "wVMT3eEBbkQ", "question": "In the bottom right corner, a man in a white short-sleeved shirt is sitting in front of a laptop. In the white background, two cartoon characters appear. They are holding a circle. The cartoon character on the left is wearing a helmet, armor, and a gray badge on his waist. The cartoon character on the right is wearing a striped shirt with a yellow belt on his waist. When the subtitle 'Woah, look at the animation style' appears, what is the cartoon character in the striped shirt on the right doing?", "question_wo_referring_query": "What is the cartoon character in the striped shirt on the right doing?", "candidates": ["Holding up a long sword", "Swinging a long sword", "Swinging a pair of scissors", "Holding a weapon", "Holding up a rifle"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "wVMT3eEBbkQ_0", "video_path": "wVMT3eEBbkQ.mp4", "subtitle_path": "wVMT3eEBbkQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 554.22, "view_count": 117154}, {"video_id": "wVMT3eEBbkQ", "question": "In the bottom right corner, a man in a white short-sleeve shirt is sitting at a yellow desk with a laptop on it. A team of soldiers, wearing red tops, black pants, and black hats, appears on a white background. When the subtitle 'It's like Microsoft Paint style' appears, what is the man in the white short-sleeve shirt in the bottom right corner doing?", "question_wo_referring_query": "What is the man in the white short-sleeve shirt in the bottom right corner doing?", "candidates": ["Placing one hand on his hip", "Placing both hands on his hips", "Crossing his hands in front of his chest", "Raising both hands", "Supporting his chin with his hand"], "topic_category": "KG-Knowledge-Geography", "question_category": "T2E", "level": "IntraMoment", "id": "wVMT3eEBbkQ_1", "video_path": "wVMT3eEBbkQ.mp4", "subtitle_path": "wVMT3eEBbkQ_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 554.22, "view_count": 117154}, {"video_id": "YwnfVj5qeo4", "question": "A guitar is hanging on a white panel. In the room, the walls are equipped with brown cabinets. The cabinet on the right is empty, while the one on the left is full of items. When the subtitle 'with the guitar for its entire time' appears, what is the shape of the panel that the guitar is hanging on?", "question_wo_referring_query": "What is the shape of the panel that the guitar is hanging on?", "candidates": ["Square", "Circle", "Triangle", "Pentagon", "Hexagon"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "YwnfVj5qeo4_0", "video_path": "YwnfVj5qeo4.mp4", "subtitle_path": "YwnfVj5qeo4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 196.06, "view_count": 23809}, {"video_id": "YwnfVj5qeo4", "question": "A man appears in front of a desk. He is wearing a black shirt and has both hands on the desk. There is a white sheet of paper on the desk with a black object on it. When the subtitle 'so that I could get to that depth of the knack' appears, what shape is the black object on the white paper?", "question_wo_referring_query": "What shape is the black object on the white paper?", "candidates": ["circle", "square", "triangle", "teardrop", "hexagon"], "topic_category": "KA-Knowledge-Art", "question_category": "T2A", "level": "IntraMoment", "id": "YwnfVj5qeo4_1", "video_path": "YwnfVj5qeo4.mp4", "subtitle_path": "YwnfVj5qeo4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 196.06, "view_count": 23809}, {"video_id": "X0-iI6FzHww", "question": "There are three plates, a fork, a knife, and a spoon on the table. The plate on the right contains food with a long shape, while the middle plate has beef and vegetables. What is the shape of the plate in the middle that contains beef?", "question_wo_referring_query": ", what is the shape of the plate in the middle that contains beef?", "candidates": ["Circle", "Heart-shaped", "Rectangle", "Square", "Hexagon"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "X0-iI6FzHww_0", "video_path": "X0-iI6FzHww.mp4", "subtitle_path": "X0-iI6FzHww_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 502.04, "view_count": 206060}, {"video_id": "X0-iI6FzHww", "question": "In the distance, there are trees without leaves. A man wearing a headscarf is walking on the road. The man is wearing a gray jacket and has a camera hanging around his neck. There are white piles of snow on the road. On both sides of the road, there are fences. There is a house inside the fence on the left side. What is the shape of the window of the house inside the fence?", "question_wo_referring_query": "What is the shape of the window of the house inside the fence?", "candidates": ["Square", "Fan-shaped", "Elliptical", "Semi-circular", "Round"], "topic_category": "KH-Knowledge-History", "question_category": "S2A", "level": "IntraMoment", "id": "X0-iI6FzHww_1", "video_path": "X0-iI6FzHww.mp4", "subtitle_path": "X0-iI6FzHww_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 502.04, "view_count": 206060}, {"video_id": "iMqk4weSqh8", "question": "A woman is standing in front of a gym mirror. In the mirror, the woman is wearing a sports bra and black long pants, holding a mobile phone. There are treadmills and other equipment around her. When the subtitle 'Music' appears, what object is present on the woman wearing the sports bra?", "question_wo_referring_query": "What object is present on the woman wearing a sports bra?", "candidates": ["bracelet", "necklace", "earphone", "mechanical watch", "sports watch"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "iMqk4weSqh8_0", "video_path": "iMqk4weSqh8.mp4", "subtitle_path": "iMqk4weSqh8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 497.77, "view_count": 119349}, {"video_id": "iMqk4weSqh8", "question": "A woman appears in the kitchen, wearing a white tank top and light-colored pants. She is holding a glass bottle in one hand and a spatula in the other, stirring the bottle. On the table in front of her, there is a white plate. When the subtitle 'once a week just to switch up my workout' appears, what objects are present in this scene?", "question_wo_referring_query": "A woman appears in the kitchen, wearing a white tank top and light-colored pants. She is holding a glass bottle in one hand and a spatula in the other, stirring the bottle. On the table in front of her, there is a white plate. When the subtitle 'once a week just to switch up my workout' appears, what objects are present in this scene?", "candidates": ["potted plant", "ring", "lamp", "watch", "glove"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2O", "level": "IntraMoment", "id": "iMqk4weSqh8_1", "video_path": "iMqk4weSqh8.mp4", "subtitle_path": "iMqk4weSqh8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 497.77, "view_count": 119349}, {"video_id": "oUeb0BUhssU", "question": "In the bottom-right corner of a white PPT slide with black text showing a sigmoid function, there is a man in a blue-gray suit speaking. What objects are present on the screen?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["A yellow triangle", "A black square", "A white dashed box", "A small red dot", "A blue rectangle"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "oUeb0BUhssU_0", "video_path": "oUeb0BUhssU.mp4", "subtitle_path": "oUeb0BUhssU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 291.56, "view_count": 20}, {"video_id": "oUeb0BUhssU", "question": "On a PPT slide titled 'Activation Functions' with black font, there is a man with short black hair and glasses talking. What objects are present in the frame?", "question_wo_referring_query": "What objects are present in the frame?", "candidates": ["Red rectangle", "Black triangle with letters", "Purple rectangle", "Blue rectangle", "Black circled letters"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2O", "level": "IntraMoment", "id": "oUeb0BUhssU_1", "video_path": "oUeb0BUhssU.mp4", "subtitle_path": "oUeb0BUhssU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 291.56, "view_count": 20}, {"video_id": "K3Q_TqqMH4M", "question": "On a white wall, there is a giant painting hanging. A man wearing a white T-shirt and with short hair is standing in front of the painting. What is the man in the white T-shirt doing in front of the painting?", "question_wo_referring_query": "What is the man in the white T-shirt doing in front of the painting?", "candidates": ["Looking at the painting with his left hand resting on his forehead", "Looking at the painting with his right hand behind his back", "Looking at the painting while supporting his chin with his left hand", "Looking at the painting with both hands covering his face", "Looking at the painting with his arms crossed over his chest"], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "K3Q_TqqMH4M_0", "video_path": "K3Q_TqqMH4M.mp4", "subtitle_path": "K3Q_TqqMH4M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 488.96, "view_count": 301}, {"video_id": "K3Q_TqqMH4M", "question": "In a room with a wooden floor, several people are standing, with some white papers scattered around their feet. There is also a huge painting next to them. What are these people doing in this room with a wooden floor?", "question_wo_referring_query": "What are these people doing in the room with a wooden floor?", "candidates": ["Eating barbecue", "Reading a book", "Listening to music", "Standing shoulder to shoulder, forming a circle", "Dancing together"], "topic_category": "KA-Knowledge-Art", "question_category": "S2E", "level": "IntraMoment", "id": "K3Q_TqqMH4M_1", "video_path": "K3Q_TqqMH4M.mp4", "subtitle_path": "K3Q_TqqMH4M_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 488.96, "view_count": 301}, {"video_id": "MqS21wV75ls", "question": "In a flat area covered with green grass, there are a white goat and some black chickens. When the white goat appears on a brown soil area with high straw stalks, what changes occur in the goat's reflection?", "question_wo_referring_query": ", what changes occur in the goat's reflection?", "candidates": ["From a back view to a front view", "From a front view to a back view", "From a front view to a side view", "From a side view to a back view", "From a side view to a front view"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "MqS21wV75ls_0", "video_path": "MqS21wV75ls.mp4", "subtitle_path": "MqS21wV75ls_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 239.2, "view_count": 3779}, {"video_id": "MqS21wV75ls", "question": "On a square panel with a picture of a rooster, there is white text 'COPAS'. When the white text 'COPAS' appears on the red side panel of a green trailer, what color change occurs to the white text 'COPAS'?", "question_wo_referring_query": "What color change occurs to the white text 'COPAS'?", "candidates": ["Changes from white to purple", "Changes from white to yellow", "Changes from white to blue", "Changes from white to black", "Changes from white to green"], "topic_category": "NP-News-Programs", "question_category": "SAA", "level": "L2-Relation", "id": "MqS21wV75ls_1", "video_path": "MqS21wV75ls.mp4", "subtitle_path": "MqS21wV75ls_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 239.2, "view_count": 3779}, {"video_id": "oYe-QM8ZimI", "question": "In a corner with a red fire extinguisher, there is a woman sitting wearing a black fur coat and with long black hair. Which of the following subtitles appeared together with the woman in the black fur coat?", "question_wo_referring_query": "Which of the following subtitles appeared together with the woman in the black fur coat?", "candidates": ["\"I'm terrified of commitment and opening up\"", "\"To me strangely enough,\"", "\"that instilled the fear inside of them.\"", "\"feel my heart right now.\"", "\"or something to happen to them in the past\""], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "oYe-QM8ZimI_0", "video_path": "oYe-QM8ZimI.mp4", "subtitle_path": "oYe-QM8ZimI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 592.13, "view_count": 3580082}, {"video_id": "oYe-QM8ZimI", "question": "In a car with black seats, there is a man with short hair wearing a dark blue suit sitting in the passenger seat. Which of the following subtitles appear together with the man in the dark blue suit?", "question_wo_referring_query": ", which of the following subtitles appear together with the man in the dark blue suit?", "candidates": ["\"I'll order a plate of snails and you'll try one of them.\"", "\"Yeah, it's really good. It comes with pesto.\"", "\"Have you ever eaten snails?\"", "\"I just want the moment to be here.\"", "\"Oh my god...\""], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "TOS", "level": "L2-Relation", "id": "oYe-QM8ZimI_1", "video_path": "oYe-QM8ZimI.mp4", "subtitle_path": "oYe-QM8ZimI_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 592.13, "view_count": 3580082}, {"video_id": "gFY0oSQACVM", "question": "Which of the following sequence of scenes is correct?", "question_wo_referring_query": "Which of the following sequence of scenes is correct?", "candidates": ["First, there\u2019s a scene with the cut sausage inside a gray-black pot; then, there\u2019s a scene where someone is cutting a half onion on a wooden cutting board; finally, there\u2019s a scene where someone is cutting sausage on a wooden cutting board.", "First, there\u2019s a scene with the cut sausage inside a gray-black pot; then, there\u2019s a scene where someone is cutting sausage on a wooden cutting board; finally, there\u2019s a scene where someone is cutting a half onion on a wooden cutting board.", "First, there is a scene where someone is cutting sausage on a wooden cutting board; then, there\u2019s a scene with the cut sausage inside a gray-black pot; finally, there\u2019s a scene with someone cutting a half onion on the same wooden cutting board.", "First, there is a scene where someone is cutting sausage on a wooden cutting board; then, there\u2019s a scene with someone cutting a half onion on a wooden cutting board; finally, there\u2019s a scene with the cut sausage inside a gray-black pot.", "First, there\u2019s a scene where someone is cutting a half onion on a wooden cutting board; then, there\u2019s a scene with the cut sausage inside a gray-black pot; finally, there\u2019s a scene where someone is cutting sausage on a wooden cutting board."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "gFY0oSQACVM_0", "video_path": "gFY0oSQACVM.mp4", "subtitle_path": "gFY0oSQACVM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 234.27, "view_count": 369}, {"video_id": "gFY0oSQACVM", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, there is a scene with chopped sausage, green onions, and garlic in a black-gray pot; then there is a scene next to a pile of chopped green onions where someone is cutting garlic with a knife; finally, there is a scene where someone adds hot water to a glass bowl containing noodles.", "First, there is a scene with chopped sausage, green onions, and garlic in a black-gray pot; then there is a scene where someone adds hot water to a glass bowl containing noodles; finally, there is a scene next to a pile of chopped green onions where someone is cutting garlic with a knife.", "First, there is a scene next to a pile of chopped green onions where someone is cutting garlic with a knife; then there is a scene where someone adds hot water to a glass bowl containing noodles; finally, there is a scene with chopped sausage, green onions, and garlic in a black-gray pot.", "First, there is a scene next to a pile of chopped green onions where someone is cutting garlic with a knife; then there is a scene with chopped sausage, green onions, and garlic in a black-gray pot; finally, there is a scene where someone adds hot water to a glass bowl containing noodles.", "First, there is a scene where someone adds hot water to a glass bowl containing noodles; then there is a scene with chopped sausage, green onions, and garlic in a black-gray pot; finally, there is a scene next to a pile of chopped green onions where someone is cutting garlic with a knife."], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SSS", "level": "L2-Relation", "id": "gFY0oSQACVM_1", "video_path": "gFY0oSQACVM.mp4", "subtitle_path": "gFY0oSQACVM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 234.27, "view_count": 369}, {"video_id": "bRUAth73f6A", "question": "In a scene with a woodgrain background, there are white letters saying 'cool completely and chill'. After the subtitle '[Applause]' appears, what is the first object that appears on the screen?", "question_wo_referring_query": ", what is the first object that appears on the screen?", "candidates": ["transparent glass cup", "green radish", "clear bowl", "yellow cucumber", "green straw"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "bRUAth73f6A_0", "video_path": "bRUAth73f6A.mp4", "subtitle_path": "bRUAth73f6A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 206.37, "view_count": 184036}, {"video_id": "bRUAth73f6A", "question": "On a white background, there is a blue horizontal stripe with white text 'WATCH MORE' inside. After the subtitle 'oh yes' appears, what is the first object to appear on the screen?", "question_wo_referring_query": "What is the first object to appear on the screen?", "candidates": ["purple can", "green cloth", "yellow carrot", "silver ring", "green spade"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "bRUAth73f6A_1", "video_path": "bRUAth73f6A.mp4", "subtitle_path": "bRUAth73f6A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 206.37, "view_count": 184036}, {"video_id": "WCHRSrGB58A", "question": "In a vast green plain, there's a small river. The plain is covered with many green trees. After the voiceover says \"nature but beyond its Scenic Beauty and\", what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A flock of swans swims across a lake", "A herd of antelopes runs across the grassland", "A volcano erupts, spewing billowing black smoke", "A shrinking map of the Earth appears", "A group of wild geese flies south"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "WCHRSrGB58A_0", "video_path": "WCHRSrGB58A.mp4", "subtitle_path": "WCHRSrGB58A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 385.59, "view_count": 4789}, {"video_id": "WCHRSrGB58A", "question": "In the middle of a plain, there is a blue pond with white steam rising from it. After the narration says 'a hot spot of geological activity', what happens on the screen?", "question_wo_referring_query": "What happens on the screen?", "candidates": ["A herd of antelopes is running across the grassland", "A volcano erupts, emitting swirling black smoke", "A flock of large birds is flying south", "A globe appears and shrinks", "A group of hippos are grazing and resting on the greenish-yellow grassland"], "topic_category": "KG-Knowledge-Geography", "question_category": "T3E", "level": "L2-Relation", "id": "WCHRSrGB58A_1", "video_path": "WCHRSrGB58A.mp4", "subtitle_path": "WCHRSrGB58A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 385.59, "view_count": 4789}, {"video_id": "OR4b6gvh5as", "question": "In the video, which type of shield appears first?", "question_wo_referring_query": "Which type of shield appears first in the video?", "candidates": ["A round red shield", "A round shield with a V shape", "A yellow rectangular shield", "A round blue shield", "A red rectangular shield with a golden wheat design"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "OR4b6gvh5as_0", "video_path": "OR4b6gvh5as.mp4", "subtitle_path": "OR4b6gvh5as_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 255.21, "view_count": 7686}, {"video_id": "OR4b6gvh5as", "question": "In the video, which type of animal appears first?", "question_wo_referring_query": "In the video, which type of animal appears first?", "candidates": ["Cow", "Spider", "Snake", "Mouse", "Elephant"], "topic_category": "KH-Knowledge-History", "question_category": "O3O", "level": "L2-Relation", "id": "OR4b6gvh5as_1", "video_path": "OR4b6gvh5as.mp4", "subtitle_path": "OR4b6gvh5as_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 255.21, "view_count": 7686}, {"video_id": "YVKK2z4QKbg", "question": "In a car with black seats, a woman with tattoos on her arms and wearing a dark blue mask is sitting in the driver's seat. After spraying hand sanitizer on her hands, what did the woman wearing the dark blue mask do next?", "question_wo_referring_query": "What did she do next?", "candidates": ["She rubbed her hands together, evenly spreading the hand sanitizer.", "She shook the bottle containing the hand sanitizer.", "She took off her dark blue mask.", "She sprayed some hand sanitizer onto the steering wheel.", "She spread the hand sanitizer evenly on her arms."], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "YVKK2z4QKbg_0", "video_path": "YVKK2z4QKbg.mp4", "subtitle_path": "YVKK2z4QKbg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 273.15, "view_count": 208438}, {"video_id": "YVKK2z4QKbg", "question": "In front of a dark brown door, there's a person wearing white clothes and black gloves, holding a blue cloth to wipe the door handle. What did the person wearing black gloves do before wiping the door handle with the blue cloth?", "question_wo_referring_query": "What did the person wearing black gloves do before wiping the door handle with the blue cloth?", "candidates": ["Sprayed water on the door handle", "Adjusted their sleeve", "Closed the dark brown door", "Sprayed alcohol on the door handle", "Sprayed disinfectant on the door handle"], "topic_category": "KS-Knowledge-STEM", "question_category": "E3E", "level": "L2-Relation", "id": "YVKK2z4QKbg_1", "video_path": "YVKK2z4QKbg.mp4", "subtitle_path": "YVKK2z4QKbg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 273.15, "view_count": 208438}, {"video_id": "OSlW3ifjQV0", "question": "In the bottom right corner of a white PPT slide with black text 'Convolution Operation', there is a man in a blue-gray suit. When the subtitle 'convolution app uh the contion operation' appears, what is the man in the blue-gray suit doing?", "question_wo_referring_query": "What is the man in the blue-gray suit doing?", "candidates": ["Rubbed his eye with his right hand", "Touched his head", "Crossed his arms in front of his chest", "Pressed his temples with both hands", "Tapped his face with his right hand a few times"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "OSlW3ifjQV0_0", "video_path": "OSlW3ifjQV0.mp4", "subtitle_path": "OSlW3ifjQV0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 342.04, "view_count": 11}, {"video_id": "OSlW3ifjQV0", "question": "On a white PPT page with a small red dot, there is a man wearing glasses and a blue-grey suit. What is the man wearing glasses doing when the caption \"standard uh conion operation between two\" appears?", "question_wo_referring_query": "What is the man wearing glasses doing?", "candidates": ["Raising both hands, palms facing each other", "Crossing both arms in front of his chest", "Using his left hand to touch his collar", "Pressing his temples with both hands", "Using his right hand to touch his ear"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2E", "level": "IntraMoment", "id": "OSlW3ifjQV0_1", "video_path": "OSlW3ifjQV0.mp4", "subtitle_path": "OSlW3ifjQV0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 342.04, "view_count": 11}, {"video_id": "_V0hvU4d7-o", "question": "In a scene with a mountain peak surrounded by clouds and mist, there are endless mountains in the distance with scattered green plants on the mountains. A narrator mentions a certain country that looks particularly like Switzerland. Which country looks particularly like Switzerland?", "question_wo_referring_query": "Which country looks particularly like Switzerland?", "candidates": ["Austria", "Russia", "Germany", "England", "France"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "_V0hvU4d7-o_0", "video_path": "_V0hvU4d7-o.mp4", "subtitle_path": "_V0hvU4d7-o_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 516.85, "view_count": 605428}, {"video_id": "_V0hvU4d7-o", "question": "Under a blue sky with white clouds, there are many houses of different colors surrounded by green trees. The narrator is discussing one of his favorite towns. Which town is it?", "question_wo_referring_query": "Which town is the narrator's current favorite?", "candidates": ["Salzburg", "Florence", "Giethoorn", "Suzdal", "Heidelberg"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "E2O", "level": "IntraMoment", "id": "_V0hvU4d7-o_1", "video_path": "_V0hvU4d7-o.mp4", "subtitle_path": "_V0hvU4d7-o_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 516.85, "view_count": 605428}, {"video_id": "QZE-WunN9Hs", "question": "Next to a wooden-colored desk, there is a person wearing a purple long-sleeved shirt using a pair of scissors to cut paper. When the '[music]' subtitle appears for the first time, what color are the scissors that the person in the purple long-sleeved shirt is holding?", "question_wo_referring_query": "What color are the scissors that the person in the purple long-sleeved shirt is holding?", "candidates": ["Green", "Purple", "Red", "Blue", "Yellow"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "QZE-WunN9Hs_0", "video_path": "QZE-WunN9Hs.mp4", "subtitle_path": "QZE-WunN9Hs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 421.47, "view_count": 2456083}, {"video_id": "QZE-WunN9Hs", "question": "On a wooden table, there is a notebook. A person wearing a purple long-sleeve shirt is holding a sticker and attaching it to a blank notebook. When the subtitle \"[music]\" appears for the last time, what color is the notebook on the wooden table?", "question_wo_referring_query": "What color is the notebook on the wooden table?", "candidates": ["green", "red", "black", "blue", "purple"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2A", "level": "IntraMoment", "id": "QZE-WunN9Hs_1", "video_path": "QZE-WunN9Hs.mp4", "subtitle_path": "QZE-WunN9Hs_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 421.47, "view_count": 2456083}, {"video_id": "l8k4WiMITXw", "question": "In a car with black seats, there is a man wearing black sunglasses. When the subtitle 'we take a beaut' appears, what objects are present in the frame?", "question_wo_referring_query": "What objects are present in the frame?", "candidates": ["a yellow sunflower", "a yellow truck", "a purple peacock tail", "a white swan", "a yellow accessory"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "l8k4WiMITXw_0", "video_path": "l8k4WiMITXw.mp4", "subtitle_path": "l8k4WiMITXw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 293.83, "view_count": 5818}, {"video_id": "l8k4WiMITXw", "question": "In a bright room, there is a woman wearing a blue-green headscarf leaning against the wall. What objects are present on the screen when the caption 'notice that for cosmetic is a man' appears?", "question_wo_referring_query": "What objects are present on the screen?", "candidates": ["blue pants", "white feathered garment", "green plant", "yellow sunflower", "white bowl"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T2O", "level": "IntraMoment", "id": "l8k4WiMITXw_1", "video_path": "l8k4WiMITXw.mp4", "subtitle_path": "l8k4WiMITXw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 293.83, "view_count": 5818}, {"video_id": "OwsBi-xeZ50", "question": "In a scene with a blurry human figure in the background, there is a woman standing and wearing a black top, holding yellow and white papers in her hand. Next to her is a man in a dark red top looking at her. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Green plant", "Blue hat", "White earphones", "Red balloons", "Black rods"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "OwsBi-xeZ50_0", "video_path": "OwsBi-xeZ50.mp4", "subtitle_path": "OwsBi-xeZ50_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 240.96, "view_count": 23029}, {"video_id": "OwsBi-xeZ50", "question": "In a scene with white text 'Indian opposition protest arrest of leader', there are many yellow flags and standing people. What objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["Green plant", "Yellow incense burner", "Black electric fan", "Red banner", "Yellow sunflower"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "OwsBi-xeZ50_1", "video_path": "OwsBi-xeZ50.mp4", "subtitle_path": "OwsBi-xeZ50_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 240.96, "view_count": 23029}, {"video_id": "eROy3BrqEVk", "question": "In a screen with a document page as the background, a man with sparse hair wearing sunglasses appears in the lower left corner of the screen. What is the man wearing sunglasses doing in the lower left corner of the screen?", "question_wo_referring_query": "What is the man wearing sunglasses doing in the lower left corner of the screen?", "candidates": ["Touching his head", "Touching his chin with his right hand", "Touching his forehead with his left hand", "Making a gesture with his left hand", "Taking off the sunglasses from his eyes"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "eROy3BrqEVk_0", "video_path": "eROy3BrqEVk.mp4", "subtitle_path": "eROy3BrqEVk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 530, "duration": 431.0, "view_count": 17636}, {"video_id": "eROy3BrqEVk", "question": "In a scene with a black document page as the background, a man with sparse hair and wearing sunglasses appears in the lower right corner of the screen. Next to him, the document contains two pictures of glass boxes filled with water. What is the man with sparse hair in the lower right corner of the screen doing?", "question_wo_referring_query": "What is the man with sparse hair in the lower right corner of the screen doing?", "candidates": ["Touching his head with his hand", "Holding his face with both hands", "Gesturing with both hands in front of his chest while explaining", "Covering his mouth and laughing quietly", "Clapping happily"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "S2E", "level": "IntraMoment", "id": "eROy3BrqEVk_1", "video_path": "eROy3BrqEVk.mp4", "subtitle_path": "eROy3BrqEVk_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 530, "duration": 431.0, "view_count": 17636}, {"video_id": "5c4eWe2-F6U", "question": "In a room with a red brick sofa, a woman wearing a white long sleeve shirt and headphones is cleaning. What change happened to the woman wearing a white long sleeve shirt when the subtitle 'want it to be a nice clean place for' appeared?", "question_wo_referring_query": "What change happened to the woman wearing a white long sleeve shirt?", "candidates": ["Changed into a white nightgown", "Changed into a black T-shirt", "Changed into a blue hoodie", "Put on a white baseball cap", "Took off the headphones she was wearing"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "5c4eWe2-F6U_0", "video_path": "5c4eWe2-F6U.mp4", "subtitle_path": "5c4eWe2-F6U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 512.58, "view_count": 111007}, {"video_id": "5c4eWe2-F6U", "question": "In front of a yellow-green background, there is a girl wearing glasses and a black jacket. What change occurs to her when the subtitle 'practice' appears?", "question_wo_referring_query": "What change occurs to the girl in the black jacket?", "candidates": ["She puts on a black headset and changes into a black T-shirt.", "She changes into a blue dress.", "She puts on a yellow headset.", "She changes into a white T-shirt and puts on a gray headset.", "She changes into a blue hooded raincoat."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "TAA", "level": "L2-Relation", "id": "5c4eWe2-F6U_1", "video_path": "5c4eWe2-F6U.mp4", "subtitle_path": "5c4eWe2-F6U_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 512.58, "view_count": 111007}, {"video_id": "ia5vxCSiwm4", "question": "In an open field covered with many green plants and yellow dry grass, there is a fox hunting for food. The fox jumps particularly high. In which of the following scenes has such a high-jumping fox appeared?", "question_wo_referring_query": "In which of the following scenes has such a high-jumping fox appeared?", "candidates": ["In a room with red sand", "In a zoo with many tourists", "In a quiet park", "On a snow-covered ground", "In the shrubs by the water"], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "ia5vxCSiwm4_0", "video_path": "ia5vxCSiwm4.mp4", "subtitle_path": "ia5vxCSiwm4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 298.51, "view_count": 102980}, {"video_id": "ia5vxCSiwm4", "question": "On a deep blue background, there are two white arrows. In the bottom right corner of the screen, there is also a green SCI logo. In which of the following scenes do the white arrows appear?", "question_wo_referring_query": "In which of the following scenes do the white arrows appear?", "candidates": ["In a scene where two men are fighting.", "In a scene where a woman is singing.", "In a scene where a boy is stacking blocks.", "In a scene with a red magnet.", "In a scene where a woman in a white dress is dancing."], "topic_category": "KS-Knowledge-STEM", "question_category": "SOS", "level": "L2-Relation", "id": "ia5vxCSiwm4_1", "video_path": "ia5vxCSiwm4.mp4", "subtitle_path": "ia5vxCSiwm4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 298.51, "view_count": 102980}, {"video_id": "49wFX248sMw", "question": "On a wooden table, there is a square white plate, with tower-shaped golden foods placed on it. What is the first object to appear in the frame after the word \"you\" appears in the subtitles?", "question_wo_referring_query": "What is the first object to appear in the frame?", "candidates": ["Kiwifruit", "Pumpkin", "Orange carrot", "Sausage", "Mango cake roll"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "49wFX248sMw_0", "video_path": "49wFX248sMw.mp4", "subtitle_path": "49wFX248sMw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 229.4, "view_count": 63419}, {"video_id": "49wFX248sMw", "question": "On a white plate, there are some biscuits, with heart-shaped sauce in the middle of the biscuits. A hand wearing a ring appears on the plate. After the subtitle 'Ting Ting Ting' shows up, what is the first item that appears on the screen?", "question_wo_referring_query": "What is the first item that appears on the screen?", "candidates": ["Cookies", "Carrots", "Pumpkin", "Walnuts", "Sausages"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "T3O", "level": "L2-Relation", "id": "49wFX248sMw_1", "video_path": "49wFX248sMw.mp4", "subtitle_path": "49wFX248sMw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 229.4, "view_count": 63419}, {"video_id": "J-Xl-4wEhu0", "question": "Under the gray sky lies a seemingly endless mountain range, with some mountains covered in dense forests. After the phrase 'nestled among the mountain slopes where' is mentioned, what happens in the video?", "question_wo_referring_query": "What happens in the video at this point?", "candidates": ["A woman kneels on a rock and uses her hands to scoop a handful of stream water to drink.", "A white dog is chasing a butterfly.", "A man sits on a gray boulder.", "Hail smashes the car window.", "A woman is dancing."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "J-Xl-4wEhu0_0", "video_path": "J-Xl-4wEhu0.mp4", "subtitle_path": "J-Xl-4wEhu0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 352.8, "view_count": 200855}, {"video_id": "J-Xl-4wEhu0", "question": "In a waterscape surrounded by towering trees, there is a woman with her hair tied up, wearing earrings. After she says 'in spirit light anticipating the sunset,' what happens next in the scene?", "question_wo_referring_query": "What happens next in the scene?", "candidates": ["A woman wearing deep blue jeans walks on the yellow-green grass", "A woman wearing a gray-green top walking in the water", "A woman with her hair tied up walking in the water", "A woman wearing a gray-green top picking something", "A woman with her hair tied up standing in the water"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "T3E", "level": "L2-Relation", "id": "J-Xl-4wEhu0_1", "video_path": "J-Xl-4wEhu0.mp4", "subtitle_path": "J-Xl-4wEhu0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 352.8, "view_count": 200855}, {"video_id": "NWbWKFhEYp8", "question": "In the video, who introduces themselves first?", "question_wo_referring_query": "In the video, who introduces themselves first?", "candidates": ["David", "Diane", "Olivia", "Stephanie", "Michael"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "NWbWKFhEYp8_0", "video_path": "NWbWKFhEYp8.mp4", "subtitle_path": "NWbWKFhEYp8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 563.19, "view_count": 2298047}, {"video_id": "NWbWKFhEYp8", "question": "Which food item is mentioned first in the video?", "question_wo_referring_query": "Which food item is mentioned first in the video?", "candidates": ["potato", "tomato", "Arepas", "sausage", "cucumber"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "O3O", "level": "L2-Relation", "id": "NWbWKFhEYp8_1", "video_path": "NWbWKFhEYp8.mp4", "subtitle_path": "NWbWKFhEYp8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 563.19, "view_count": 2298047}, {"video_id": "wUuZ4G_Ydhc", "question": "In a black background, there is a man with short hair, holding a clapboard. After the clapboard is closed, what does the man with short hair do?", "question_wo_referring_query": "What does the man with short hair do?", "candidates": ["Speaking to the camera", "Adjusted his clothes", "Touched his hair", "Rubbed his eyes with his hand", "Touched his face with his left hand"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "wUuZ4G_Ydhc_0", "video_path": "wUuZ4G_Ydhc.mp4", "subtitle_path": "wUuZ4G_Ydhc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 207.04, "view_count": 2407}, {"video_id": "wUuZ4G_Ydhc", "question": "In a black background, there is a white-haired man facing the camera. After mentioning that the idea of same-sex relationships among roosters is a terrifying thought, what is mentioned next?", "question_wo_referring_query": "What is mentioned next?", "candidates": ["Explain that this is just a fictional story", "Discuss medical implementations", "Give examples of this situation in real life", "Explain the necessity of their existence"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "wUuZ4G_Ydhc_1", "video_path": "wUuZ4G_Ydhc.mp4", "subtitle_path": "wUuZ4G_Ydhc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 207.04, "view_count": 2407}, {"video_id": "bkgi83TAe3Q", "question": "In a room with a gray sofa, there is a man wearing a black short-sleeve shirt. In front of the man, there is a black cabinet with a television on top of it. When the subtitle 'through and you find the specific person' appears, what is the man in the black short-sleeve shirt doing?", "question_wo_referring_query": "What is the man wearing a black short-sleeve shirt doing?", "candidates": ["Leaning on the gray sofa with his legs on the black coffee table", "Crossing his arms over his chest", "Holding a remote control, changing the channel", "Propping his chin with his hand, watching TV", "Lying on the sofa sleeping"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "bkgi83TAe3Q_0", "video_path": "bkgi83TAe3Q.mp4", "subtitle_path": "bkgi83TAe3Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 571.7, "view_count": 83854}, {"video_id": "bkgi83TAe3Q", "question": "In a space with many trees and benches, a man wearing a grey and white short sleeve shirt is holding a white selfie stick. When the subtitle 'bad boy right here' appears, what is the man wearing the white short sleeve shirt doing?", "question_wo_referring_query": "What is the man wearing a white short sleeve shirt doing?", "candidates": ["Holding a selfie stick with a phone attached, filming himself", "Crouching on the ground to take photos", "Holding a selfie stick with a phone attached, filming the surrounding scenery", "Attaching the phone to the selfie stick", "Opening a black backpack"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "T2E", "level": "IntraMoment", "id": "bkgi83TAe3Q_1", "video_path": "bkgi83TAe3Q.mp4", "subtitle_path": "bkgi83TAe3Q_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 571.7, "view_count": 83854}, {"video_id": "0BwnYerHb6Y", "question": "On a colorful stage, there are five beautiful women standing. Next to the five beautiful women, there's a man wearing a black hat. What was the man wearing a black hat doing when he first appeared?", "question_wo_referring_query": "What was the man wearing a black hat doing when he first appeared?", "candidates": ["Waving at the five women", "Squatting down to tie his shoelace", "Holding his forehead with his hand", "Occasionally looking down at his notes while talking", "Bowing to the audience"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "0BwnYerHb6Y_0", "video_path": "0BwnYerHb6Y.mp4", "subtitle_path": "0BwnYerHb6Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 209.94, "view_count": 20194}, {"video_id": "0BwnYerHb6Y", "question": "On a stage occupied by several beautiful women, there is a woman wearing a khaki hat and a white shirt. She is holding a microphone. What is the woman with the khaki hat doing the first time she appears?", "question_wo_referring_query": "What is the woman with the khaki hat doing the first time she appears?", "candidates": ["Adjusting her hat", "Rubbing her eyes", "Dancing with both hands in front of the camera", "Holding a microphone and looking at the woman in the gray long-sleeve top", "Waving and greeting at the camera"], "topic_category": "NP-News-Programs", "question_category": "O2E", "level": "IntraMoment", "id": "0BwnYerHb6Y_1", "video_path": "0BwnYerHb6Y.mp4", "subtitle_path": "0BwnYerHb6Y_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 209.94, "view_count": 20194}, {"video_id": "4CUOKVlgDq0", "question": "In front of a blue background, a man wearing a black shirt with a mustache is giving a presentation in front of the camera. What is the object that is shaking left and right in the top right corner of the screen?", "question_wo_referring_query": "What is the object that is shaking left and right in the top right corner of the screen?", "candidates": ["A sheep", "An eye", "A blue projector", "A mouse", "A globe"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "4CUOKVlgDq0_0", "video_path": "4CUOKVlgDq0.mp4", "subtitle_path": "4CUOKVlgDq0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 385.8, "view_count": 66743}, {"video_id": "4CUOKVlgDq0", "question": "In front of a blue background, a man in a black shirt with a beard is speaking in front of the camera. What is the green English text that first appears in the top left corner of the screen?", "question_wo_referring_query": "What is the green English text that first appears in the top left corner of the screen?", "candidates": ["MONOCLONAL", "FOR PROTEINS", "STILL BIG", "TO PATHOGENS", "BUT THEY'RE"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "4CUOKVlgDq0_1", "video_path": "4CUOKVlgDq0.mp4", "subtitle_path": "4CUOKVlgDq0_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 385.8, "view_count": 66743}, {"video_id": "fXIRWrJXrwU", "question": "In the screen with a yellow background, at the top right corner, there is a purple plant. The screen also shows a blue circular design with the letter 'a' and a white paper strip with a black plus sign. What color is the chemical formula Mg on the white paper strip in this screen?", "question_wo_referring_query": "What color is the chemical formula Mg on the white paper strip in this screen?", "candidates": ["yellow", "blue", "purple", "black", "white"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "fXIRWrJXrwU_0", "video_path": "fXIRWrJXrwU.mp4", "subtitle_path": "fXIRWrJXrwU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 349.39, "view_count": 166928}, {"video_id": "fXIRWrJXrwU", "question": "In a yellow background, a finger with red nail polish points to a white strip of paper with the black English word 'magnesium'. In this scene, what color is the strip of paper with the black English words 'BALANCED EQUATION'?", "question_wo_referring_query": "What color is it?", "candidates": ["black", "blue", "yellow", "white", "purple"], "topic_category": "KS-Knowledge-STEM", "question_category": "S2A", "level": "IntraMoment", "id": "fXIRWrJXrwU_1", "video_path": "fXIRWrJXrwU.mp4", "subtitle_path": "fXIRWrJXrwU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 349.39, "view_count": 166928}, {"video_id": "2qfysRT_IQU", "question": "In front of a table with food on it, two people are holding each other's lower arm with one hand simultaneously. On the far right of the screen, a man wearing a gray jacket is looking at a woman next to him. When the subtitle says 'While they are eating, an elevated train passes by on the track nearby, making the building,' what objects are present in the scene?", "question_wo_referring_query": "What objects are present in the scene?", "candidates": ["a bed", "bicycle", "juicer", "wine glass", "car"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "2qfysRT_IQU_0", "video_path": "2qfysRT_IQU.mp4", "subtitle_path": "2qfysRT_IQU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 539.17, "view_count": 6606}, {"video_id": "2qfysRT_IQU", "question": "In a dark room, there is a man in a suit sitting to the left of a black table, holding a pen and writing on paper. On the right side, there is a man in a red shirt looking towards a mirror. When the subtitle says 'He talks to his lawyer and agrees that if he can take Somerset and Mills to two more', what objects are still present on the screen?", "question_wo_referring_query": "What objects are still present on the screen?", "candidates": ["a cup", "a bed", "a refrigerator", "a green plant", "a water dispenser"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T2O", "level": "IntraMoment", "id": "2qfysRT_IQU_1", "video_path": "2qfysRT_IQU.mp4", "subtitle_path": "2qfysRT_IQU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 539.17, "view_count": 6606}, {"video_id": "gIoy2WkybEA", "question": "On an olive-colored wooden desk, there is a table lamp. On the sofa beside the table, there is a dog lying on a blanket. What else is present in the scene?", "question_wo_referring_query": "What else is present in the scene?", "candidates": ["a chair", "a bookshelf", "a book", "a plant", "a cup"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "gIoy2WkybEA_0", "video_path": "gIoy2WkybEA.mp4", "subtitle_path": "gIoy2WkybEA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 595.24, "view_count": 4743232}, {"video_id": "gIoy2WkybEA", "question": "On a white desk in an office, there is a computer. A man in a gray plaid shirt has his back to the camera, with his hands behind his head. What other objects are present in the scene?", "question_wo_referring_query": "What other objects are present in the scene?", "candidates": ["black keyboard", "water dispenser", "white keyboard", "telephone", "flower pot"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2O", "level": "IntraMoment", "id": "gIoy2WkybEA_1", "video_path": "gIoy2WkybEA.mp4", "subtitle_path": "gIoy2WkybEA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 595.24, "view_count": 4743232}, {"video_id": "gLJnjDUmAQc", "question": "Inside a shoe store, various styles of leather shoes are displayed on shelves on both sides. In front of a door with a long curtain in the middle of the shelves, a woman with black hair is seen, with her back facing the mirror and her hair tied up. What is this woman doing?", "question_wo_referring_query": "What is this woman doing?", "candidates": ["She puts the shoe box on the ground", "She picks up a pair of leather shoes from the shelf", "She is wearing shoes on a sofa", "She is drinking tea", "She picks up a shoe box"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "gLJnjDUmAQc_0", "video_path": "gLJnjDUmAQc.mp4", "subtitle_path": "gLJnjDUmAQc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 598.03, "view_count": 96643}, {"video_id": "gLJnjDUmAQc", "question": "In a shoe store, a man in a blue suit on the left is sitting on a chair with one leg raised, while a curly-haired woman is kneeling in front of him. What is the woman doing?", "question_wo_referring_query": "What is the woman doing?", "candidates": ["She is mopping the floor", "She is helping the man try on shoes", "She is drinking a beverage", "She is washing clothes", "She is helping the man with his tie"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "S2E", "level": "IntraMoment", "id": "gLJnjDUmAQc_1", "video_path": "gLJnjDUmAQc.mp4", "subtitle_path": "gLJnjDUmAQc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 598.03, "view_count": 96643}, {"video_id": "F7P8uGkwhmY", "question": "A woman who is running with wired earphones and wearing a green backpack encounters a man wearing a blue short-sleeved shirt outside a room and engages in a conversation with him. What change occurs to the woman's hairstyle?", "question_wo_referring_query": "What change occurs to the woman's hairstyle?", "candidates": ["Her hair turns from tied up to let down", "Her hair turns into a bob cut", "Her hair turns black", "Her hair turns red", "She places a flower clip on her head"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "F7P8uGkwhmY_0", "video_path": "F7P8uGkwhmY.mp4", "subtitle_path": "F7P8uGkwhmY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 526.96, "view_count": 6129}, {"video_id": "F7P8uGkwhmY", "question": "On the main street, a woman wearing a white top and a black shawl on her shoulders is chatting with a man wearing a white shirt. What changed about her clothing when she appeared in the hospital?", "question_wo_referring_query": "What changed about her clothing?", "candidates": ["She changed into a blue hospital gown", "She changed into a black hospital gown", "She changed into a white hospital gown", "She changed into a white skirt", "She changed into a green hospital gown"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SAA", "level": "L2-Relation", "id": "F7P8uGkwhmY_1", "video_path": "F7P8uGkwhmY.mp4", "subtitle_path": "F7P8uGkwhmY_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 526.96, "view_count": 6129}, {"video_id": "9Z_pTQrk3t4", "question": "On the grass, a man with a beard wearing a black hat and smoking a cigarette appeared with which subtitles?", "question_wo_referring_query": "Appeared with which subtitles?", "candidates": ["and one judge openly demanded bribes from Baijiu to issue an arrest warrant", "His friend told him that he should remember the case because they had investigated it", "After further investigation, Baijiu discovered Liu's real identity", "return to the police station and goes to the judges to obtain an arrest warrant for Tang Long", "about Tang Long's whereabouts and hoped to receive a reward"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "9Z_pTQrk3t4_0", "video_path": "9Z_pTQrk3t4.mp4", "subtitle_path": "9Z_pTQrk3t4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 490.66, "view_count": 337432}, {"video_id": "9Z_pTQrk3t4", "question": "Suspended in mid-air with a river surging behind her, which subtitles had a woman with black hair and fresh blood on her mouth appeared together with?", "question_wo_referring_query": "Which subtitles had appeared together with?", "candidates": ["Li, who was frightened, then hid to avoid the robbers' attacks", "Everyone then thought he was just lucky", "martial arts practitioners possessed", "Liu to defeat the robbers alone", "Madam, kills a villager to force him to reveal who he really is"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "TOS", "level": "L2-Relation", "id": "9Z_pTQrk3t4_1", "video_path": "9Z_pTQrk3t4.mp4", "subtitle_path": "9Z_pTQrk3t4_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 490.66, "view_count": 337432}, {"video_id": "QkSAt9IWtLA", "question": "Where else has the man, who is sitting in the driver's seat holding the steering wheel, appeared?", "question_wo_referring_query": "Where else has he appeared?", "candidates": ["On the train", "In front of the elevator", "On the bullet train", "In the bathroom", "In the zoo"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "QkSAt9IWtLA_0", "video_path": "QkSAt9IWtLA.mp4", "subtitle_path": "QkSAt9IWtLA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 565.84, "view_count": 38730}, {"video_id": "QkSAt9IWtLA", "question": "A man wearing a suit and white pants is holding a bag and appearing on the main street. In which of the following places has this man also appeared?", "question_wo_referring_query": "In which of the following places has this man also appeared?", "candidates": ["In the karaoke room", "In the submarine", "On the volcano", "In the office", "On the snowy mountain"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SOS", "level": "L2-Relation", "id": "QkSAt9IWtLA_1", "video_path": "QkSAt9IWtLA.mp4", "subtitle_path": "QkSAt9IWtLA_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 565.84, "view_count": 38730}, {"video_id": "SHXLvR4P8mM", "question": "Which of the following sequence of events is correct?", "question_wo_referring_query": "Which of the following sequence of events is correct?", "candidates": ["First, a man in a white hoodie is eating noodles, then a boy is holding a refrigerator tightly with both hands, and finally two people fall from upstairs.", "First, a man in a white hoodie is eating noodles, then two people fall from upstairs, and finally a boy is holding a refrigerator tightly with both hands.", "First, a boy is holding a refrigerator tightly with both hands, then two people fall from upstairs, and finally a man in a white hoodie is eating noodles.", "First, a boy is holding a refrigerator tightly with both hands, then a man in a white hoodie is eating noodles, and finally two people fall from upstairs.", "First, two people fall from upstairs, then a man in a white hoodie is eating noodles, and finally a boy is holding a refrigerator tightly with both hands."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "SHXLvR4P8mM_0", "video_path": "SHXLvR4P8mM.mp4", "subtitle_path": "SHXLvR4P8mM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 394.8, "view_count": 36154}, {"video_id": "SHXLvR4P8mM", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a person in the room grabs another person's neck tightly with their hand. Then, a white-haired man wearing black armor points a gun at his own mouth. Finally, a person's hands are chained and lifted up.", "First, a person in the room grabs another person's neck tightly with their hand. Then, a person's hands are chained and lifted up. Finally, a white-haired man wearing black armor points a gun at his own mouth.", "First, a white-haired man wearing black armor points a gun at his own mouth. Then, a person's hands are chained and lifted up. Finally, a person in the room grabs another person's neck tightly with their hand.", "First, a person's hands are chained and lifted up. Then, a white-haired man wearing black armor points a gun at his own mouth. Finally, a person in the room grabs another person's neck tightly with their hand.", "First, a person's hands are chained and lifted up. Then, a person in the room grabs another person's neck tightly with their hand. Finally, a white-haired man wearing black armor points a gun at his own mouth."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "SHXLvR4P8mM_1", "video_path": "SHXLvR4P8mM.mp4", "subtitle_path": "SHXLvR4P8mM_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 394.8, "view_count": 36154}, {"video_id": "RklJ0TLh0_A", "question": "On the white snowy ground, some dry tree branches have snow on them. A woman wearing a blue coat helps another woman wearing a brown coat to walk. The subtitle says 'Soon-Hee comes home and sees Ji-Ah crying so she comforts her. The next day,' After this, what is the first thing that appears on the screen?", "question_wo_referring_query": "What is the first thing that appears on the screen?", "candidates": ["A rabbit", "A snow leopard", "A black bear", "A fox", "A reindeer"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "RklJ0TLh0_A_0", "video_path": "RklJ0TLh0_A.mp4", "subtitle_path": "RklJ0TLh0_A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 571.67, "view_count": 18294}, {"video_id": "RklJ0TLh0_A", "question": "Snowflakes are floating in the air, and a short-haired woman wearing a thick coat is holding a handful of white snow, ready to start a snowball fight. After the subtitle reads 'Eventually Winter has come and they're getting closer each day. As she reads the story of Kumiho,' what appears on the screen first?", "question_wo_referring_query": "What appears on the screen first?", "candidates": ["a bed", "a vase", "a book", "a sofa", "a car"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "T3O", "level": "L2-Relation", "id": "RklJ0TLh0_A_1", "video_path": "RklJ0TLh0_A.mp4", "subtitle_path": "RklJ0TLh0_A_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 571.67, "view_count": 18294}, {"video_id": "JEWVq08LT28", "question": "After the woman in a black top falls on the pink yoga mat, who is the first person to appear?", "question_wo_referring_query": "Who is the first person to appear?", "candidates": ["A woman in a pink bikini", "A man in a yellow suit and white shirt", "A woman in a black and white maid outfit", "A man in a red suit and blue shirt", "A man in a black suit and blue shirt"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "JEWVq08LT28_0", "video_path": "JEWVq08LT28.mp4", "subtitle_path": "JEWVq08LT28_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 596.8, "view_count": 406011}, {"video_id": "JEWVq08LT28", "question": "After a man in a blue shirt appears in front of a dining table with a glass, who is the first person to enter the scene?", "question_wo_referring_query": "Who is the first person to enter the scene?", "candidates": ["A woman in a black dress", "An elderly person with white hair wearing glasses", "An elderly person with white hair wearing a pearl necklace", "A woman in a white dress", "A woman in a pink dress"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O3O", "level": "L2-Relation", "id": "JEWVq08LT28_1", "video_path": "JEWVq08LT28.mp4", "subtitle_path": "JEWVq08LT28_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 596.8, "view_count": 406011}, {"video_id": "USdIHwQMeOo", "question": "In a store with several paintings hanging on the wall, a woman wearing a yellow floral dress speaks with a man next to her who is dressed in a yellow and blue patchwork top with a mustache. What does the woman do next?", "question_wo_referring_query": "What does the woman do next?", "candidates": ["She picks up a book from the table.", "She sits on the chair and drinks a beverage.", "She lies on the bed.", "She picks up a little boy.", "She sits on the sofa and tells a story."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "USdIHwQMeOo_0", "video_path": "USdIHwQMeOo.mp4", "subtitle_path": "USdIHwQMeOo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 531.6, "view_count": 5085}, {"video_id": "USdIHwQMeOo", "question": "On the white staircase, after a woman dressed in an orange halter dress with her hair up walks down, what does she do?", "question_wo_referring_query": "What does she do?", "candidates": ["She hugs a man in a suit with dark skin.", "She raises a glass of wine.", "She ties a necktie for a man in a suit with dark skin.", "She holds hands with a man in a suit with dark skin.", "She kisses a man in a suit with dark skin."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E3E", "level": "L2-Relation", "id": "USdIHwQMeOo_1", "video_path": "USdIHwQMeOo.mp4", "subtitle_path": "USdIHwQMeOo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 531.6, "view_count": 5085}, {"video_id": "sQImYnr_wCg", "question": "What happened when the figure of an upper body floating in the air with a pink heart-shaped balloon animal made its first appearance?", "question_wo_referring_query": "What happened during its first appearance?", "candidates": ["Got a big hole punctured", "Gradually ascended", "Started burning", "Gradually descended", "Was painted red"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "sQImYnr_wCg_0", "video_path": "sQImYnr_wCg.mp4", "subtitle_path": "sQImYnr_wCg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 517.33, "view_count": 9376}, {"video_id": "sQImYnr_wCg", "question": "What happened the first time the woman in a pink coat appeared next to the red car on screen?", "question_wo_referring_query": "What happened?", "candidates": ["She was put into a closet", "She was pushed into the water", "She was grabbed by the shoulders by two men", "She was thrown into a swimming pool", "She was thrown into a car"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "O2E", "level": "IntraMoment", "id": "sQImYnr_wCg_1", "video_path": "sQImYnr_wCg.mp4", "subtitle_path": "sQImYnr_wCg_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 517.33, "view_count": 9376}, {"video_id": "ISPIc88aTiU", "question": "Who is the person dressed in a teddy bear costume and clenching their fists on screen?", "question_wo_referring_query": "Who is it?", "candidates": ["A man wearing glasses and sweating on his forehead", "A white-haired man in black clothes", "A woman with long curly hair", "A man in a gray short-sleeved shirt", "A woman with a cigarette in her mouth"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "ISPIc88aTiU_0", "video_path": "ISPIc88aTiU.mp4", "subtitle_path": "ISPIc88aTiU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 211.65, "view_count": 18465}, {"video_id": "ISPIc88aTiU", "question": "Who is the person with a cigarette in their mouth and holding a lighter in the picture?", "question_wo_referring_query": "Who is it?", "candidates": ["A woman wearing black sunglasses and gold round earrings", "A man dressed in a teddy bear costume", "A red-haired man", "A white-haired man wearing black clothes", "A girl wearing a white coat"], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "E2O", "level": "IntraMoment", "id": "ISPIc88aTiU_1", "video_path": "ISPIc88aTiU.mp4", "subtitle_path": "ISPIc88aTiU_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 211.65, "view_count": 18465}, {"video_id": "AH8vF6yReIQ", "question": "In the video, when the subtitle changes to 'for our ecosystems and just huge also in,' what changes occur to the man with short brown hair, wearing black glasses and a brown and black checkered shirt?", "question_wo_referring_query": "What changes occur to the man?", "candidates": ["The man's clothing color and style change.", "The man's hair grows longer.", "The man raises his hands above his head.", "The man puts his hands together.", "The hair color of the person on screen changes."], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "AH8vF6yReIQ_0", "video_path": "AH8vF6yReIQ.mp4", "subtitle_path": "AH8vF6yReIQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1762.22, "view_count": 402187}, {"video_id": "AH8vF6yReIQ", "question": "On the screen is a man with black long hair wearing a black cloak. To his left are 4 alternating lines of white and orange English text. When the subtitle changes to 'trouble seabird poop is not quite as,' what change occurs to the English text on the screen?", "question_wo_referring_query": "What change occurs to the English text on the screen?", "candidates": ["7 lines of white English text appear on the man's right side", "3 lines of white English text appear on the man's right side", "7 lines of red English text appear on the man's left side", "3 lines of white English text appear on the man's left side", "7 lines of white English text appear on the man's right side"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "AH8vF6yReIQ_1", "video_path": "AH8vF6yReIQ.mp4", "subtitle_path": "AH8vF6yReIQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1762.22, "view_count": 402187}, {"video_id": "AH8vF6yReIQ", "question": "A bee is standing on a large tree in a forest in the video. When the subtitle changes to 'concern for us see north american bees,' what happens to the bee in the video?", "question_wo_referring_query": "What happens to the bee in the video?", "candidates": ["The color of the bee's head changes.", "The bee's head is gone.", "The wings on the bee's body disappear.", "The bee's four limbs are gone."], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "AH8vF6yReIQ_2", "video_path": "AH8vF6yReIQ.mp4", "subtitle_path": "AH8vF6yReIQ_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1762.22, "view_count": 402187}, {"video_id": "e_NjEi6i8uc", "question": "In a room with white lighting, two people stand in the back watching a man in gray clothes holding a yellow dog sitting on a dark yellow floor. Which character appears first?", "question_wo_referring_query": "Which character appears first?", "candidates": ["The man wearing a gray long-sleeve coat and black pants", "The short-haired woman holding her hand on the table and wearing white earphones and a gray top", "The man wearing a gray short-sleeve shirt and black pants", "The man wearing a gray long-sleeve coat and white pants", "The man wearing a military green jacket and a hat"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "e_NjEi6i8uc_0", "video_path": "e_NjEi6i8uc.mp4", "subtitle_path": "e_NjEi6i8uc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 479.71, "view_count": 8052}, {"video_id": "e_NjEi6i8uc", "question": "Only three characters appear on the screen. On the left is a woman wearing a black top and blue jeans. In the middle, there is a man in a grey top sitting with his legs apart on a yellow bench. On the right, there is a man sitting on a white bench, wearing a grey top and black pants. After these characters appear on the screen, which one comes on stage first?", "question_wo_referring_query": "After these characters appear on the screen, which one comes on stage first?", "candidates": ["A man with black hair, wearing glasses, with both hands on a white table, watching a video of a dog", "A man wearing an army green top and holding a microphone", "A short-haired woman wearing a grey top, with one hand placed on the table and wearing white earphones", "A dog", "A short-haired woman wearing glasses"], "topic_category": "NP-News-Programs", "question_category": "O3O", "level": "L2-Relation", "id": "e_NjEi6i8uc_1", "video_path": "e_NjEi6i8uc.mp4", "subtitle_path": "e_NjEi6i8uc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 479.71, "view_count": 8052}, {"video_id": "6dzE8v-if3k", "question": "Next to the wooden house, there is a tan-colored wooden table with a wooden board on it. A man dressed in a black short-sleeved shirt is cutting meat with a knife. In which scenarios has this man appeared?", "question_wo_referring_query": "In which scenarios has this man appeared?", "candidates": ["Beside a chicken feeding on a hillside", "Under a waterfall", "Inside a wooden cabin with a bed", "Next to a small stone fountain built with artificial stones", "Next to a blue beehive filled with bees"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "6dzE8v-if3k_0", "video_path": "6dzE8v-if3k.mp4", "subtitle_path": "6dzE8v-if3k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1109.11, "view_count": 2597233}, {"video_id": "6dzE8v-if3k", "question": "On the hillside, green low-growing plants are scattered around. A goat stands on a rock, looking up while tearing at leaves from the trees. In which scenes does this goat appear?", "question_wo_referring_query": "In which scenes does this goat appear?", "candidates": ["In a black iron pot", "Next to a man wearing a black short-sleeved shirt", "Next to a blue beehive", "On a wooden dining table", "In a stone-surrounded fountain"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "6dzE8v-if3k_1", "video_path": "6dzE8v-if3k.mp4", "subtitle_path": "6dzE8v-if3k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1109.11, "view_count": 2597233}, {"video_id": "6dzE8v-if3k", "question": "A black-clad man is using a silver bucket to cover the intestines that are hanging on an object wrapped in newspaper and inserted into a block of stone. In what scenes have these intestines appeared?", "question_wo_referring_query": "In what scenes have these intestines appeared?", "candidates": ["On the back of a mountain goat", "In a fountain surrounded by stones", "In a black iron plate", "In a transparent teapot", "In a white pot on the stove"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "SOS", "level": "L2-Relation", "id": "6dzE8v-if3k_2", "video_path": "6dzE8v-if3k.mp4", "subtitle_path": "6dzE8v-if3k_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1109.11, "view_count": 2597233}, {"video_id": "lTaxqZ6V0S0", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a man wearing a white hat and a black short-sleeved shirt stands outside the window and hands a little girl to a woman wearing a pink dress. Then, President Xi Jinping, wearing a white helmet and a black suit, appears on the right side of the screen. Finally, a blonde woman in a red dress, holding a white paper and pen, sits cross-legged in front of a purple screen.", "First, President Xi Jinping, wearing a white helmet and a black suit, appears on the right side of the screen. Then, a blonde woman in a red dress, holding a white paper and pen, sits cross-legged in front of a purple screen. Finally, a man wearing a white hat and a black short-sleeved shirt stands outside the window and hands a little girl to a woman wearing a pink dress.", "First, a blonde woman in a red dress, holding a white paper and pen, sits cross-legged in front of a purple screen. Then, President Xi Jinping, wearing a white helmet and a black suit, appears on the right side of the screen. Finally, a man wearing a white hat and a black short-sleeved shirt stands outside the window and hands a little girl to a woman wearing a pink dress.", "First, President Xi Jinping, wearing a white helmet and a black suit, appears on the right side of the screen. Then, a man wearing a white hat and a black short-sleeved shirt stands outside the window and hands a little girl to a woman wearing a pink dress. Finally, a blonde woman in a red dress, holding a white paper and pen, sits cross-legged in front of a purple screen.", "First, a man wearing a white hat and a black short-sleeved shirt stands outside the window and hands a little girl to a woman wearing a pink dress. Then, a blonde woman in a red dress, holding a white paper and pen, sits cross-legged in front of a purple screen. Finally, President Xi Jinping, wearing a white helmet and a black suit, appears on the right side of the screen."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "lTaxqZ6V0S0_0", "video_path": "lTaxqZ6V0S0.mp4", "subtitle_path": "lTaxqZ6V0S0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2584.99, "view_count": 1562}, {"video_id": "lTaxqZ6V0S0", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, the screen shows a white building with an eagle statue. Then, a man wearing a black suit and black tie is speaking in front of a white wall with a painting. Finally, a black-skinned woman wearing black glasses and black clothes is speaking in front of a blue screen.", "First, a black-skinned woman wearing black glasses and black clothes is speaking in front of a blue screen. Then, a man wearing a black suit and black tie is speaking in front of a white wall with a painting. Lastly, the screen shows a white building with an eagle statue.", "First, a black-skinned woman wearing black glasses and black clothes is speaking in front of a blue screen. Then, the screen shows a white building with an eagle statue. Finally, a man wearing a black suit and black tie is speaking in front of a white wall with a painting.", "First, the screen shows a white building with an eagle statue. Then, a black-skinned woman wearing black glasses and black clothes is speaking in front of a blue screen. Finally, a man wearing a black suit and black tie is speaking in front of a white wall with a painting.", "First, a man wearing a black suit and black tie is speaking in front of a white wall with a painting. Then, a black-skinned woman wearing black glasses and black clothes is speaking in front of a blue screen. Lastly, the screen shows a white building with an eagle statue."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "lTaxqZ6V0S0_1", "video_path": "lTaxqZ6V0S0.mp4", "subtitle_path": "lTaxqZ6V0S0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2584.99, "view_count": 1562}, {"video_id": "lTaxqZ6V0S0", "question": "Which of the following scene sequences is correct?", "question_wo_referring_query": "Which of the following scene sequences is correct?", "candidates": ["First, a woman wearing black glasses and a golden wig is seen sitting in front of a room door with a picture frame hanging nearby. Then, a bald man wearing black glasses and a black suit is shown sitting on a black chair. Finally, a woman with white hair wearing a purple coat appears on screen.", "First, a woman with white hair wearing a purple coat appears on screen. Then, a woman wearing black glasses and a golden wig is seen sitting in front of a room door with a picture frame hanging nearby. Finally, a bald man wearing black glasses and a black suit is shown sitting on a black chair.", "First, a bald man wearing black glasses and a black suit is shown sitting on a black chair. Next, a woman with white hair wearing a purple coat appears on screen. Finally, a woman wearing black glasses and a golden wig is seated in front of a room door with a picture frame hanging nearby.", "First, a woman with white hair wearing a purple coat appears on screen. Next, a bald man wearing black glasses and a black suit is shown sitting on a black chair. Finally, a woman wearing black glasses and a golden wig is seated in front of a room door with a picture frame hanging nearby.", "First, a bald man wearing black glasses and a black suit is shown sitting on a black chair. Then, a woman wearing black glasses and a golden wig is seen sitting in front of a room door with a picture frame hanging nearby. Finally, a woman with white hair wearing a purple coat appears on screen."], "topic_category": "NP-News-Programs", "question_category": "SSS", "level": "L2-Relation", "id": "lTaxqZ6V0S0_2", "video_path": "lTaxqZ6V0S0.mp4", "subtitle_path": "lTaxqZ6V0S0_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2584.99, "view_count": 1562}, {"video_id": "XtV6Rwdg8no", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, a side view of a woman's face is shown, with a white pole on the left side of the frame. Then, two women lying together are shown, and finally, one woman is shown looking sideways at another woman.", "First, one woman is shown looking sideways at another woman. Then, a side view of a woman's face is shown, with a white pole on the left side of the frame, and finally, two women lying together are shown.", "First, two women lying together are shown. Then, a side view of a woman's face is shown, with a white pole on the left side of the frame, and finally, one woman is shown looking sideways at another woman.", "First, a side view of a woman's face is shown, with a white pole on the left side of the frame. Then, one woman is shown looking sideways at another woman, and finally, two women lying together are shown.", "First, two women lying together are shown. Then, one woman is shown looking sideways at another woman, and finally, a side view of a woman's face with a white pole on the left side of the frame is shown."], "topic_category": "Recreational: MR-Movie-Recaps", "question_category": "SSS", "level": "L2-Relation", "id": "XtV6Rwdg8no_0", "video_path": "XtV6Rwdg8no.mp4", "subtitle_path": "XtV6Rwdg8no_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 16, "duration": 14.02, "view_count": 8966}, {"video_id": "ZAsxJpETLGw", "question": "In a room, a man wearing a black shirt is sitting on a chair, his hands are open, and there is a yellow table behind him. In the bottom left corner of the screen, there is text saying Matt@mattdajer. In what other scene does this man appear?", "question_wo_referring_query": "In a room, a man wearing a black shirt is sitting on a chair, his hands are open, and there is a yellow table behind him. In the bottom left corner of the screen, there is text saying Matt@mattdajer. In what other scene does this man appear?", "candidates": ["He appears in the study room.", "In a white room, there is another blond man in a black short-sleeve shirt beside him, and a man in gray clothes standing behind.", "He appears in the house of a man wearing gray clothes.", "He appears in the kitchen."], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "ZAsxJpETLGw_0", "video_path": "ZAsxJpETLGw.mp4", "subtitle_path": "ZAsxJpETLGw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 187.63, "view_count": 78113}, {"video_id": "ZAsxJpETLGw", "question": "In a room, a man wearing a black shirt is sitting on a chair with his hands open. There is also a yellow table behind him. In which other scene does this man appear?", "question_wo_referring_query": "In which other scene does this man appear?", "candidates": ["He and another man with black hair are lying naked on a grey bed", "He and another man are in a study room", "He and two other men are lying on a bed", "He and another man are in the kitchen"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "SOS", "level": "L2-Relation", "id": "ZAsxJpETLGw_1", "video_path": "ZAsxJpETLGw.mp4", "subtitle_path": "ZAsxJpETLGw_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 187.63, "view_count": 78113}, {"video_id": "MJazCMYQb8I", "question": "The truck's compartment was opened, and three men stood inside. The edge of the compartment had red stripes. The men in the compartment were wearing different outfits: one in blue denim with black pants, one in white denim with black pants, and one in a gray short-sleeve shirt with black pants. There were green trees and houses around the truck. Which of these men jumped out of the compartment first?", "question_wo_referring_query": "Which of these men jumped out of the compartment first?", "candidates": ["Man in white denim", "Man in blue denim", "Man in red short sleeves", "Man in black denim", "Man in gray short sleeves"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "MJazCMYQb8I_0", "video_path": "MJazCMYQb8I.mp4", "subtitle_path": "MJazCMYQb8I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 969.14, "view_count": 6998734}, {"video_id": "MJazCMYQb8I", "question": "A man in a lab coat and a woman in a blue denim jacket are walking side by side. The woman has her hands crossed in front of her chest and her hair is tied up. The man is wearing sunglasses, and there are light-colored square patches on the chest area of his lab coat. Beside them, there is a house roof and green trees. Which animal does the woman first touch?", "question_wo_referring_query": "Which animal does the woman first touch?", "candidates": ["An almost entirely black dog", "An almost entirely white dog", "A yellow dog", "A dog with a large gray area", "A yellow and white patterned dog"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "MJazCMYQb8I_1", "video_path": "MJazCMYQb8I.mp4", "subtitle_path": "MJazCMYQb8I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 969.14, "view_count": 6998734}, {"video_id": "MJazCMYQb8I", "question": "What is the first mode of transportation to appear in the video?", "question_wo_referring_query": "What is the first mode of transportation to appear in the video?", "candidates": ["an airplane", "a train", "a helicopter", "a motorcycle", "a white car"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "MJazCMYQb8I_2", "video_path": "MJazCMYQb8I.mp4", "subtitle_path": "MJazCMYQb8I_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 969.14, "view_count": 6998734}, {"video_id": "G7vget_QAmo", "question": "A book with a white building on the cover, set against a yellow background, with the title CHINESE MYTHOLOGY. What action does this book perform?", "question_wo_referring_query": "What action does this book perform?", "candidates": ["Close", "Rotate", "Open", "Thrown"], "topic_category": "KH-Knowledge-History", "question_category": "S2E", "level": "IntraMoment", "id": "G7vget_QAmo_0", "video_path": "G7vget_QAmo.mp4", "subtitle_path": "G7vget_QAmo_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 415, "duration": 25.99, "view_count": 302080}, {"video_id": "4POH95v46hs", "question": "Along the undulating mountain range, there is a lush, dense forest. The branches are abundant, and the leaves are luxuriant, presenting a vibrant green appearance. What color is this mountain range?", "question_wo_referring_query": "What color is this mountain range?", "candidates": ["Yellowish Green", "Green", "White", "Blue"], "topic_category": "KG-Knowledge-Geography", "question_category": "S2A", "level": "IntraMoment", "id": "4POH95v46hs_0", "video_path": "4POH95v46hs.mp4", "subtitle_path": "4POH95v46hs_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 226, "duration": 44.98, "view_count": 31092}, {"video_id": "UpcjgMSaecY", "question": "During a meeting, a group of people came in. When the subtitle 'him back to prison but does not require' appeared, there was a bald man wearing a black suit, glasses, and a mask. What color was his tie?", "question_wo_referring_query": "What color was his tie?", "candidates": ["blue", "yellow", "black", "red"], "topic_category": "NP-News-Programs", "question_category": "T2O", "level": "IntraMoment", "id": "UpcjgMSaecY_0", "video_path": "UpcjgMSaecY.mp4", "subtitle_path": "UpcjgMSaecY_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 195, "duration": 38.01, "view_count": 30142}, {"video_id": "GCb8RWnev_o", "question": "In front of the staircase, there is a woman with black hair wearing a purple dress, two men in black suits, and a man wearing a white hat, a black top, and white trousers. What is the man in white trousers holding?", "question_wo_referring_query": "In front of the staircase, there is a woman with black hair wearing a purple dress, two men in black suits, and a man wearing a white hat, a black top, and white trousers. What is the man in white trousers holding?", "candidates": ["Microphone", "Flag", "Bouquet", "Wine glass"], "topic_category": "NP-News-Programs", "question_category": "S2O", "level": "IntraMoment", "id": "GCb8RWnev_o_0", "video_path": "GCb8RWnev_o.mp4", "subtitle_path": "GCb8RWnev_o_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 146, "duration": 41.0, "view_count": 44552}, {"video_id": "bhb17HI6NnQ", "question": "In front of the coffee machine, the coffee is poured into a yellow flower-patterned cup, and the milk is loaded into a gray-blue packaged bottle. What is the first item to appear in the video?", "question_wo_referring_query": "What is the first item to appear in the video?", "candidates": ["Milk loaded into a gray-blue packaged bottle", "Coffee loaded into a gray-blue packaged bottle", "Milk poured into a yellow flower-patterned cup", "Coffee poured into a yellow flower-patterned cup"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "bhb17HI6NnQ_0", "video_path": "bhb17HI6NnQ.mp4", "subtitle_path": "bhb17HI6NnQ_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 175, "duration": 51.02, "view_count": 679423}, {"video_id": "rmoOthXlNzI", "question": "In the screen, there are little people wearing various clothes and holding various things. After the subtitle mentions \"these expansions had made the Roman,\" what action does the little person holding the moneybag do?", "question_wo_referring_query": "What action does the little person holding the moneybag do?", "candidates": ["Turns around and leaves", "Puts the moneybag on the ground", "Stretches out the moneybag", "Throws the moneybag into the air"], "topic_category": "KH-Knowledge-History", "question_category": "T3E", "level": "L2-Relation", "id": "rmoOthXlNzI_0", "video_path": "rmoOthXlNzI.mp4", "subtitle_path": "rmoOthXlNzI_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 18, "duration": 16.02, "view_count": 10918}, {"video_id": "1pRIUWwvtjc", "question": "In a room with white painted walls, there is a double door. Outside the door is a balcony. A woman with long golden hair, wearing stockings, is standing on a marble pedestal. What type of clothing is the woman wearing at the beginning?", "question_wo_referring_query": "What type of clothing is the woman wearing at the beginning?", "candidates": ["white tight top", "blue long-sleeve shirt", "white tank top", "gray long-sleeve shirt"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "1pRIUWwvtjc_0", "video_path": "1pRIUWwvtjc.mp4", "subtitle_path": "1pRIUWwvtjc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 480.88, "view_count": 11243}, {"video_id": "1pRIUWwvtjc", "question": "In front of a large painting, a woman with long golden hair is sitting at a desk working. There is a white cup next to her right hand. She is wearing a white top. Where did this woman with long golden hair appear first?", "question_wo_referring_query": "Where did this woman with long golden hair appear first?", "candidates": ["On the bed in the bedroom", "On the balcony of the bedroom", "Outdoors in the sunlight", "In front of the bathroom mirror"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "O3O", "level": "L2-Relation", "id": "1pRIUWwvtjc_1", "video_path": "1pRIUWwvtjc.mp4", "subtitle_path": "1pRIUWwvtjc_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 480.88, "view_count": 11243}, {"video_id": "si_PQDNctjo", "question": "In front of a red brick house, there's a big tree, and under the tree are a few people wearing white T-shirts with a blue-green circular pattern. In which of the following scenes has this pattern appeared?", "question_wo_referring_query": "In which of the following scenes has this pattern appeared?", "candidates": ["Outdoors in bright sunlight, on the black robe of a man with short golden hair.", "In a park, on the purple robe of a man with short black hair.", "Inside a library, on the black robe of a man with short golden hair.", "Inside a room with many photos, on the black robe of a man with short black hair."], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "si_PQDNctjo_0", "video_path": "si_PQDNctjo.mp4", "subtitle_path": "si_PQDNctjo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 289.83, "view_count": 70037}, {"video_id": "si_PQDNctjo", "question": "In the next room filled with pictures, there is a man dressed in a black hooded cloak with short black hair. Beside him, there is a blue bird-like logo. In which of the following scenes has this logo appeared before?", "question_wo_referring_query": "In which of the following scenes has this logo appeared before?", "candidates": ["In a screen with a library as the background", "In a screen with the Sun as the background", "In a screen with Earth as the background", "In a screen with a park as the background"], "topic_category": "KG-Knowledge-Geography", "question_category": "SOS", "level": "L2-Relation", "id": "si_PQDNctjo_1", "video_path": "si_PQDNctjo.mp4", "subtitle_path": "si_PQDNctjo_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 289.83, "view_count": 70037}, {"video_id": "78JvysqpNMc", "question": "There is a white table in front of a light blue curtain, and there is an object emitting white light on the table. To the left of the table stands a woman with red hair wearing a hair accessory and dressed in black clothes, and to the right stands a man wearing red clothes and glasses. Between them, there is a man in black holding a green paper. When the conversation reaches 'ponds I was gonna go with unfortunately,' what change occurs to the object emitting white light on the table?", "question_wo_referring_query": "When the conversation reaches 'ponds I was gonna go with unfortunately,' what change occurs to the object emitting white light on the table?", "candidates": ["No change", "Changes from white to green", "Changes from white to blue", "Changes from white to purple", "Changes from white to red"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "78JvysqpNMc_0", "video_path": "78JvysqpNMc.mp4", "subtitle_path": "78JvysqpNMc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1366.24, "view_count": 213014}, {"video_id": "78JvysqpNMc", "question": "In front of a light blue curtain, a man wearing red clothes and glasses is smiling at the camera. When the conversation reaches 'in for a treat thanks dude okay so in', what change occurs in the man wearing red clothes and glasses?", "question_wo_referring_query": "What change occurs in the man wearing red clothes and glasses?", "candidates": ["Facing the camera changes to standing with his back to the camera", "Facing the camera changes to side-facing the camera", "Facing the camera changes to crouching and facing the camera", "Facing the camera changes to crouching with his back to the camera"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "78JvysqpNMc_1", "video_path": "78JvysqpNMc.mp4", "subtitle_path": "78JvysqpNMc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1366.24, "view_count": 213014}, {"video_id": "78JvysqpNMc", "question": "Against a green background, a man wearing glasses and a shirt is speaking towards the camera. To the right of the man, there is an English sentence containing a percentage. When he talks about 'project so the church ordered 300', what change occurs to the color of the first English word on the right side of the man?", "question_wo_referring_query": "What change occurs to the color of the first English word on the right side of the man?", "candidates": ["Purple changes to red", "Purple changes to white", "Purple changes to yellow", "Purple changes to green", "Purple changes to black"], "topic_category": "KS-Knowledge-STEM", "question_category": "TAA", "level": "L2-Relation", "id": "78JvysqpNMc_2", "video_path": "78JvysqpNMc.mp4", "subtitle_path": "78JvysqpNMc_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1366.24, "view_count": 213014}, {"video_id": "k4E7jeDJcY4", "question": "Who is the first character to appear in the video?", "question_wo_referring_query": "Who is the first character to appear in the video?", "candidates": ["A woman dressed in a blue short-sleeved shirt and white pants", "The man wearing a multicolored striped undershirt, a light blue short-sleeved shirt, and black glasses", "A man with short black hair, wearing black clothes with a white logo", "A man wearing a black suit and tie", "A woman with curly hair wearing a green dress"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "k4E7jeDJcY4_0", "video_path": "k4E7jeDJcY4.mp4", "subtitle_path": "k4E7jeDJcY4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1227.52, "view_count": 254909}, {"video_id": "k4E7jeDJcY4", "question": "Which scene appears first in the video?", "question_wo_referring_query": "Which scene appears first in the video?", "candidates": ["A high-stemmed glass decorated with yellow and pink at the U-shaped base on a yellow wave line", "A toy car falls into the water", "A pink car driving past two elderly people", "Green and pink arrows lined up vertically", "A big-eyed green frog on green foliage"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "k4E7jeDJcY4_1", "video_path": "k4E7jeDJcY4.mp4", "subtitle_path": "k4E7jeDJcY4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1227.52, "view_count": 254909}, {"video_id": "k4E7jeDJcY4", "question": "Which plant is displayed first in the video?", "question_wo_referring_query": "Which plant is displayed first in the video?", "candidates": ["Small flower with pink petals featuring spots", "Four-leaf clover", "Small flower with purple-red petals", "Gold and silver flower", "Small flower with green calyx and yellow petals"], "topic_category": "KS-Knowledge-STEM", "question_category": "O3O", "level": "L2-Relation", "id": "k4E7jeDJcY4_2", "video_path": "k4E7jeDJcY4.mp4", "subtitle_path": "k4E7jeDJcY4_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1227.52, "view_count": 254909}, {"video_id": "VZtKWlOH4IE", "question": "In front of a rotating globe depicted in chalkboard drawing with deep green and light green, what appears on the screen?", "question_wo_referring_query": "What appears on the screen?", "candidates": ["Descriptions of three pictures written in chalk", "Mountains, white clouds, and crystals drawn in white chalk", "An English word written in chalk", "English words written in chalk and crystals drawn at the bottom", "Three pictures of ancient artworks"], "topic_category": "KA-Knowledge-Art", "question_category": "E3E", "level": "L2-Relation", "id": "VZtKWlOH4IE_0", "video_path": "VZtKWlOH4IE.mp4", "subtitle_path": "VZtKWlOH4IE_en.json", "duration_group": 60, "starting_timestamp_for_subtitles": 73, "duration": 50.01, "view_count": 27409}, {"video_id": "eugVGJkOjsI", "question": "On a white tabletop in front of a white tile wall, there's a pot placed. A man wearing a black coat and a gray scarf is holding a white board, standing in front of the pot. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["The man is putting down the board", "The man is lifting the pot", "The man is covering the pot with a lid", "The man is cutting the ingredients", "The man is pouring the ingredients from the board into the pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "eugVGJkOjsI_0", "video_path": "eugVGJkOjsI.mp4", "subtitle_path": "eugVGJkOjsI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1352.48, "view_count": 3253385}, {"video_id": "eugVGJkOjsI", "question": "In a room with a white tiled wall, a pot is placed on a white table. A man with short hair, wearing a grey short-sleeved shirt and grey apron, is holding a brown bag in his hand. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["He throws away the bag", "He takes something out of the bag with his left hand", "He takes something out of the bag with his right hand", "He pours the contents of the bag into the pot", "He puts the bag into the pot"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "eugVGJkOjsI_1", "video_path": "eugVGJkOjsI.mp4", "subtitle_path": "eugVGJkOjsI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1352.48, "view_count": 3253385}, {"video_id": "eugVGJkOjsI", "question": "A wooden board is placed on a white platform. A man and a woman are standing in front of the platform. The woman is looking at the man in front of her. The man has his left eye closed, his left hand on his waist, and his right hand raised. What is this man doing?", "question_wo_referring_query": "What is this man doing?", "candidates": ["This man is tasting food", "This man is pulling a tooth", "This man is shaking hands with the woman", "This man is combing his hair with his hand", "This man is squatting down"], "topic_category": "LC-Lifestyle-Cooking-Recipes", "question_category": "S2E", "level": "IntraMoment", "id": "eugVGJkOjsI_2", "video_path": "eugVGJkOjsI.mp4", "subtitle_path": "eugVGJkOjsI_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1352.48, "view_count": 3253385}, {"video_id": "hDbIzOx0Mnk", "question": "There is a white phone holder clipped to the wooden board at the head of the bed. On the bed, there is a woman wearing a white outer garment. She has her hands on the earphones, legs extended straight, and is leaning against the headboard. What was the last animal to appear on the screen before this scene?", "question_wo_referring_query": "What was the last animal to appear on the screen before this scene?", "candidates": ["dog", "chicken", "pig", "cat", "cow"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "hDbIzOx0Mnk_0", "video_path": "hDbIzOx0Mnk.mp4", "subtitle_path": "hDbIzOx0Mnk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 900.47, "view_count": 187089}, {"video_id": "hDbIzOx0Mnk", "question": "In a room with white walls, there is a woman wearing a white shirt and olive pants lying diagonally on a red sofa. Her left hand is raised high, and the white text '7:00 pm' is precisely on her arm. What is the first time that appears on the screen after this scene?", "question_wo_referring_query": "What is the first time that appears on the screen after this scene?", "candidates": ["9:30 pm", "6:30 pm", "6:00 pm", "2:00 pm", "8:00 pm"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "hDbIzOx0Mnk_1", "video_path": "hDbIzOx0Mnk.mp4", "subtitle_path": "hDbIzOx0Mnk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 900.47, "view_count": 187089}, {"video_id": "hDbIzOx0Mnk", "question": "In a room with pink lighting, a woman wearing earbuds and a white short-sleeved shirt is sitting cross-legged on a bed. She is putting an item into a gray bag. After this scene, what text first appears inside the green bar?", "question_wo_referring_query": "After this scene, what text first appears inside the green bar?", "candidates": ["not to brag", "but also because i am just filled with gratitude", "and so much love; thank you for letting me be part of your life,\n", "happy noises", "but also because i am just filled with gratitude"], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "O3O", "level": "L2-Relation", "id": "hDbIzOx0Mnk_2", "video_path": "hDbIzOx0Mnk.mp4", "subtitle_path": "hDbIzOx0Mnk_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 900.47, "view_count": 187089}, {"video_id": "BHw4NStQsT8", "question": "There is a man wearing a black and white striped shirt, holding a piece of olive-colored cardboard with the words 'STARVING BILLIONAIRE' on it. What is pasted on this cardboard?", "question_wo_referring_query": "What is pasted on this cardboard?", "candidates": ["Coin", "Flower", "Fruit", "Leaf"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "BHw4NStQsT8_0", "video_path": "BHw4NStQsT8.mp4", "subtitle_path": "BHw4NStQsT8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 593.59, "view_count": 1229471}, {"video_id": "BHw4NStQsT8", "question": "In a laboratory, there is a woman wearing blue and a yellow hat, along with a man wearing a black coat and a yellow hat standing in front of a control panel. There are four objects floating in mid-air on the screen. What are the objects floating in mid-air?", "question_wo_referring_query": "What are the objects floating in mid-air?", "candidates": ["paper money", "hats", "white paper", "balloons"], "topic_category": "KS-Knowledge-STEM", "question_category": "E2O", "level": "IntraMoment", "id": "BHw4NStQsT8_1", "video_path": "BHw4NStQsT8.mp4", "subtitle_path": "BHw4NStQsT8_en.json", "duration_group": 600, "starting_timestamp_for_subtitles": 0, "duration": 593.59, "view_count": 1229471}, {"video_id": "_skKO-cjRtw", "question": "In a vast gray studio, a man with short blond hair, wearing a light blue suit jacket, is sitting at a long black table with a glass of water and a sheet of paper on it. What is the material of the glass on the screen?", "question_wo_referring_query": ", what is the material of the glass on the screen?", "candidates": ["A white thermos cup", "A light blue paper cup", "A glass of water filled to nine-tenths capacity", "A plastic cup with an ice blue glass pattern"], "topic_category": "NP-News-Programs", "question_category": "S2A", "level": "IntraMoment", "id": "_skKO-cjRtw_0", "video_path": "_skKO-cjRtw.mp4", "subtitle_path": "_skKO-cjRtw_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 296, "duration": 11.0, "view_count": 62084}, {"video_id": "3Cl9ey8eEtY", "question": "In the video, there is a blonde woman wearing a blue jacket, driving a car. In the upper left corner of the screen, there's a text that reads 'cinnamon dolce latte'; in another scene, the same woman is shown with one hand on the steering wheel and the other hand pointing forward. Which scene appears first?", "question_wo_referring_query": "Which scene appears first?", "candidates": ["The scene with a blonde woman wearing a blue jacket, driving a car, with the text 'cinnamon dolce latte' in the upper left corner, appears first; the scene with the woman, one hand on the steering wheel and the other hand pointing forward, appears later.", "It's the same scene.", "The scene with a blonde woman wearing a blue jacket, driving a car, with the text 'cinnamon dolce latte' in the upper left corner, appears first; the scene with the woman holding a cup of coffee appears later.", "The scene with the woman, one hand on the steering wheel and the other hand pointing forward, appears first; the scene with a blonde woman wearing a blue jacket, driving a car, with the text 'cinnamon dolce latte' in the upper left corner, appears later."], "topic_category": "LV-Lifestyle-Life-Vlogs", "question_category": "SSS", "level": "L2-Relation", "id": "3Cl9ey8eEtY_0", "video_path": "3Cl9ey8eEtY.mp4", "subtitle_path": "3Cl9ey8eEtY_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 163, "duration": 8.0, "view_count": 23335}, {"video_id": "bCcTBSnp6rQ", "question": "In the scene where two men in white long sleeves are standing by the river under a white-clouded daytime sky, what is the color of the leaves on the tree behind them?", "question_wo_referring_query": "In this scene, what is the color of the leaves on the tree behind the two men?", "candidates": ["purple", "black", "green", "white"], "topic_category": "LT-Lifestyle-Travel-Guides", "question_category": "S2A", "level": "IntraMoment", "id": "bCcTBSnp6rQ_0", "video_path": "bCcTBSnp6rQ.mp4", "subtitle_path": "bCcTBSnp6rQ_en.json", "duration_group": 15, "starting_timestamp_for_subtitles": 532, "duration": 14.0, "view_count": 729668}, {"video_id": "FNDVy_BR8aA", "question": "On the far right of the screen, from top to bottom, there are three characters appearing in a rectangular frame. To the left is a white background with text. When the subtitle says 'decent initialization', what objects are present on the entire screen?", "question_wo_referring_query": "What objects are present on the entire screen?", "candidates": ["A pie chart", "Two line graphs with red lines", "A bar chart", "Two line graphs with green lines", "A bar chart"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "FNDVy_BR8aA_0", "video_path": "FNDVy_BR8aA.mp4", "subtitle_path": "FNDVy_BR8aA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2686.57, "view_count": 4228}, {"video_id": "FNDVy_BR8aA", "question": "In a room with a green rectangular bookcase and books, a black-haired man with a pompadour hairstyle and glasses is sitting down. What objects are present in the room when the subtitle says 'um the um the future of'?", "question_wo_referring_query": "What objects are present in the room?", "candidates": ["refrigerator", "oven", "washing machine", "computer", "pushcart"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "FNDVy_BR8aA_1", "video_path": "FNDVy_BR8aA.mp4", "subtitle_path": "FNDVy_BR8aA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2686.57, "view_count": 4228}, {"video_id": "FNDVy_BR8aA", "question": "In front of a green curtain, there is a man wearing black clothes and a white headset sitting on a black chair. What object is present on the screen when the subtitle says 'available'?", "question_wo_referring_query": "What object is present on the screen?", "candidates": ["blue swimming goggles", "purple sunglasses", "red swimming goggles", "black sunglasses", "red sunglasses"], "topic_category": "KC-Knowledge-Computer-Science", "question_category": "T2O", "level": "IntraMoment", "id": "FNDVy_BR8aA_2", "video_path": "FNDVy_BR8aA.mp4", "subtitle_path": "FNDVy_BR8aA_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 2686.57, "view_count": 4228}, {"video_id": "3R0e6Emdy5o", "question": "Which of the following sequences of scenes is correct?", "question_wo_referring_query": "Which of the following sequences of scenes is correct?", "candidates": ["First, two lines of English text with underlines appear, then a long-haired girl wearing a purple-red coat faces a mirror, followed by two arrows, one red and one blue-purple, in the top left corner of the screen, and finally, the letters XAZ in black appear in the center of the screen.", "First, two arrows, one red and one blue-purple, appear in the top left corner of the screen, then a long-haired girl wearing a purple-red coat faces a mirror, followed by the letters XAZ in black in the center of the screen, and finally, two lines of English text with underlines appear.", "First, a long-haired girl wearing a purple-red coat faces a mirror, then two arrows, one red and one blue-purple, appear in the top left corner of the screen, followed by the letters XAZ in black in the center of the screen, and finally, two lines of English text with underlines appear.", "First, the letters XAZ in black appear in the center of the screen, then a long-haired girl wearing a purple-red coat faces a mirror, followed by two arrows, one red and one blue-purple, in the top left corner of the screen, and finally, two lines of English text with underlines appear.", "First, two lines of English text with underlines appear, then a long-haired girl wearing a purple-red coat faces a mirror, followed by two arrows, one red and one blue-purple, in the top left corner of the screen, and finally, the letters XYZ in black appear in the center of the screen."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "3R0e6Emdy5o_0", "video_path": "3R0e6Emdy5o.mp4", "subtitle_path": "3R0e6Emdy5o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1466.57, "view_count": 47943}, {"video_id": "3R0e6Emdy5o", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a rectangular purple-bordered frame appears in the upper right corner of the screen showing a woman wearing a gray coat, explaining the topic on the left side of the screen, with a blue formula appearing during the analysis. Then, a rectangular blue-bordered frame appears in the upper right corner of the screen showing a woman wearing a gray coat, explaining the topic on the left side of the screen. Finally, a rectangular red-bordered frame appears in the upper right corner of the screen showing a woman wearing a sky blue fur coat, explaining the topic on the left side of the screen.", "First, a rectangular red-bordered frame appears in the upper right corner of the screen showing a woman wearing a purple-red fur coat, explaining the topic on the left side of the screen. Then, a rectangular purple-bordered frame appears in the upper right corner of the screen showing a woman wearing a gray coat, explaining the topic on the left side of the screen. Finally, a rectangular purple-bordered frame appears in the upper right corner of the screen showing a woman wearing a gray coat, explaining the topic on the left side of the screen, with a blue formula appearing during the analysis.", "First, a rectangular purple-bordered frame appears in the upper right corner of the screen showing a woman wearing a gray coat, explaining the topic on the left side of the screen, with a blue formula appearing during the analysis. Then, a rectangular red-bordered frame appears in the upper right corner of the screen showing a woman wearing a purple-red fur coat, explaining the topic on the left side of the screen. Finally, a rectangular purple-bordered frame appears in the upper right corner of the screen showing a woman wearing a gray coat, explaining the topic on the left side of the screen.", "First, a rectangular purple-bordered frame appears in the upper right corner of the screen showing a woman wearing a gray coat, explaining the topic on the left side of the screen, with a blue formula appearing during the analysis. Then, a rectangular purple-bordered frame appears in the upper right corner of the screen showing a woman wearing a gray coat, explaining the topic on the left side of the screen. Finally, a rectangular red-bordered frame appears in the upper right corner of the screen showing a woman wearing a purple-red fur coat, explaining the topic on the left side of the screen.", "First, a rectangular purple-bordered frame appears in the upper right corner of the screen showing a woman wearing a gray coat, explaining the topic on the left side of the screen, with a blue formula appearing during the analysis. Then, a rectangular blue-bordered frame appears in the upper right corner of the screen showing a woman wearing a gray coat, explaining the topic on the left side of the screen. Finally, a rectangular red-bordered frame appears in the upper right corner of the screen showing a woman wearing a purple-red fur coat, explaining the topic on the left side of the screen."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "3R0e6Emdy5o_1", "video_path": "3R0e6Emdy5o.mp4", "subtitle_path": "3R0e6Emdy5o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1466.57, "view_count": 47943}, {"video_id": "3R0e6Emdy5o", "question": "Which of the following sequences of events is correct?", "question_wo_referring_query": "Which of the following sequences of events is correct?", "candidates": ["First, a small black star appears on the white grid, then an A appears in the top left corner of the white grid. After that, an A in the bottom right corner of the white grid is circled. Finally, a larger X appears in the bottom right corner.", "First, a larger X appears in the bottom right corner of the white grid, then an A appears in the top left corner. After that, an A in the bottom right corner of the white grid is circled. Finally, a small black star appears.", "First, an A appears in the top left corner of the white grid, then a larger X appears in the bottom right corner. After that, an A in the bottom right corner of the white grid is circled. Finally, a small black star appears on the white grid.", "First, a small black star appears on the white grid, then a larger X appears in the bottom right corner of the white grid. After that, an A in the bottom right corner of the white grid is circled. Finally, an A appears in the top left corner of the white grid.", "First, a larger X appears in the bottom right corner of the white grid, then an A appears in the top left corner. After that, an A in the bottom right corner of the white grid is circled. Finally, a small black star appears on the white grid."], "topic_category": "KS-Knowledge-STEM", "question_category": "SSS", "level": "L2-Relation", "id": "3R0e6Emdy5o_2", "video_path": "3R0e6Emdy5o.mp4", "subtitle_path": "3R0e6Emdy5o_en.json", "duration_group": 3600, "starting_timestamp_for_subtitles": 0, "duration": 1466.57, "view_count": 47943}]
================================================
FILE: lmms-eval_videochat/eval_annotations/LongVideoBench/lvb_val.json
================================================
[
{
"video_id": "86CxyhFV9MI",
"question": "In the video, which subtitles appear at the same time as the man with black hair, dressed in grey clothes with black sleeves, on stage?",
"question_wo_referring_query": "Which subtitles appear at the same time?",
"candidates": [
"promisc has come to an end, in and run away countless times, i was just scared, i still",
"run away countless times, i was just scared, i still and front of our crown, like a world of souls,",
"promisc has come to an end, in and front of our crown, like a world of souls,",
"promisc has come to an end, in and captain of the godson, three three three three three three"
],
"correct_choice": 1,
"position": [
854,
948,
1373
],
"topic_category": "NP-News-Programs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "86CxyhFV9MI_0",
"video_path": "86CxyhFV9MI.mp4",
"subtitle_path": "86CxyhFV9MI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 190.16,
"view_count": 259852
},
{
"video_id": "86CxyhFV9MI",
"question": "Which subtitles appear at the same time as the gray-haired man in the black shirt on stage in the video?",
"question_wo_referring_query": "Which subtitles appear simultaneously?",
"candidates": [
"[applause] and Seye Seye Seye Seye Seye",
"[applause] and my love that I made on the ke is",
"captain of the godson, three three three three three three and my love that I made on the ke is",
"[applause] and captain of the godson, three three three three three three"
],
"correct_choice": 2,
"position": [
353,
1981,
3899
],
"topic_category": "NP-News-Programs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "86CxyhFV9MI_1",
"video_path": "86CxyhFV9MI.mp4",
"subtitle_path": "86CxyhFV9MI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 190.16,
"view_count": 259852
},
{
"video_id": "9PD3ciudpIE",
"question": "In front of an ancient building, a woman walks by holding something. When the subtitle says 'temples scattered throughout the', what color is the woman's clothing at that moment?",
"question_wo_referring_query": "In front of an ancient building, a woman walks by holding something. When the subtitle says 'temples scattered throughout the', what color is the woman's clothing at that moment?",
"candidates": [
"yellow",
"green",
"blue",
"red"
],
"correct_choice": 0,
"position": [
3068
],
"topic_category": "NP-News-Programs",
"question_category": "T2A",
"level": "L1-Perception",
"id": "9PD3ciudpIE_0",
"video_path": "9PD3ciudpIE.mp4",
"subtitle_path": "9PD3ciudpIE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 467.92,
"view_count": 4172
},
{
"video_id": "9PD3ciudpIE",
"question": "In front of a light yellow wall, there is a woman wearing a pink hat sitting with another woman with long hair wearing a dark blue outfit. When the subtitle mentions 'the old people that can't work anymore,' what is the woman with the pink hat wearing?",
"question_wo_referring_query": "What is the woman with the pink hat wearing?",
"candidates": [
"red short sleeves",
"red long sleeves",
"black long sleeves",
"pink short sleeves"
],
"correct_choice": 0,
"position": [
8561
],
"topic_category": "NP-News-Programs",
"question_category": "T2A",
"level": "L1-Perception",
"id": "9PD3ciudpIE_1",
"video_path": "9PD3ciudpIE.mp4",
"subtitle_path": "9PD3ciudpIE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 467.92,
"view_count": 4172
},
{
"video_id": "5qMcDQd17Y4",
"question": "In the scene, there is a man wearing a white shirt standing in front of a red and white building. When the subtitle mentions 'conference happening about central,' what is behind the man to his right?",
"question_wo_referring_query": "What is behind the man to his right?",
"candidates": [
"cell phone",
"car",
"computer",
"street light"
],
"correct_choice": 3,
"position": [
1290
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2O",
"level": "L1-Perception",
"id": "5qMcDQd17Y4_0",
"video_path": "5qMcDQd17Y4.mp4",
"subtitle_path": "5qMcDQd17Y4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 584.64,
"view_count": 74618
},
{
"video_id": "5qMcDQd17Y4",
"question": "A woman wearing a black and white woven top, glasses, with long black hair is mentioned with the subtitle 'you're interested in.' What is behind her on the right?",
"question_wo_referring_query": "What is behind this woman on the right?",
"candidates": [
"door",
"computer",
"lamp",
"phone"
],
"correct_choice": 0,
"position": [
10687
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2O",
"level": "L1-Perception",
"id": "5qMcDQd17Y4_1",
"video_path": "5qMcDQd17Y4.mp4",
"subtitle_path": "5qMcDQd17Y4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 584.64,
"view_count": 74618
},
{
"video_id": "7F9IrtSHmc0",
"question": "In a room with a wall tiger and a map on the wall, there is a man wearing a white shirt. What is he doing?",
"question_wo_referring_query": "What is he doing?",
"candidates": [
"drinking water",
"playing with a cell phone",
"speaking",
"dancing"
],
"correct_choice": 2,
"position": [
305
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "O2E",
"level": "L1-Perception",
"id": "7F9IrtSHmc0_0",
"video_path": "7F9IrtSHmc0.mp4",
"subtitle_path": "7F9IrtSHmc0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 422.66,
"view_count": 46214
},
{
"video_id": "7F9IrtSHmc0",
"question": "On a stage with lights, there are many people wearing colorful outfits. What are these people in the colorful outfits doing?",
"question_wo_referring_query": "What are these people in the colorful outfits doing?",
"candidates": [
"Drinking water",
"Playing piano",
"Sitting",
"Dancing"
],
"correct_choice": 3,
"position": [
2711
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "O2E",
"level": "L1-Perception",
"id": "7F9IrtSHmc0_1",
"video_path": "7F9IrtSHmc0.mp4",
"subtitle_path": "7F9IrtSHmc0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 422.66,
"view_count": 46214
},
{
"video_id": "QJ6sjg7SXOQ",
"question": "The screen is split into two sections, and in the small section on the far right, what is the man wearing a hat doing in front of a brown horse?",
"question_wo_referring_query": "What is he doing?",
"candidates": [
"Walking the horse",
"Punching towards the camera",
"Riding the horse",
"Kneeling down",
"Extending his palm forward while facing the camera"
],
"correct_choice": 4,
"position": [
6643
],
"topic_category": "NP-News-Programs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "QJ6sjg7SXOQ_0",
"video_path": "QJ6sjg7SXOQ.mp4",
"subtitle_path": "QJ6sjg7SXOQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 468.48,
"view_count": 210195
},
{
"video_id": "QJ6sjg7SXOQ",
"question": "The screen switches to a split-screen view with two sections, left and right. At the bottom of the screen, there is a red rectangle with white text that reads 'Israel's renewed bombardment of Gaza.' In the video on the left side of the screen, amidst scattered debris and items on the ground, a person dressed in a black shirt and black pants is holding a large round metal bucket. What is this man doing?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"Throwing the metal bucket into the sea",
"Burning the metal bucket",
"Throwing the metal bucket on the ground",
"Dumping garbage into the metal bucket",
"Placing the metal bucket in a vehicle"
],
"correct_choice": 2,
"position": [
7951
],
"topic_category": "NP-News-Programs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "QJ6sjg7SXOQ_1",
"video_path": "QJ6sjg7SXOQ.mp4",
"subtitle_path": "QJ6sjg7SXOQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 468.48,
"view_count": 210195
},
{
"video_id": "mH9LdC7IFH8",
"question": "In the scene, a man wearing glasses and dressed in a black suit is speaking to the camera under the golden sky. After the subtitle 'of that still uh in tanks and will be,' what happens on the screen?",
"question_wo_referring_query": "What happens on the screen?",
"candidates": [
"The screen switches to four cameras",
"The screen switches to two cameras",
"The camera view is enlarged",
"The screen switches to three cameras",
"The camera view is reduced"
],
"correct_choice": 1,
"position": [
4737,
4797
],
"topic_category": "NP-News-Programs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "mH9LdC7IFH8_0",
"video_path": "mH9LdC7IFH8.mp4",
"subtitle_path": "mH9LdC7IFH8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 322.04,
"view_count": 7310
},
{
"video_id": "mH9LdC7IFH8",
"question": "What happened on the screen before a man in black armor with glasses spoke into the microphone in front of a golden sky and the subtitles said 'country uh so we've seen significant'?",
"question_wo_referring_query": "What happened on the screen?",
"candidates": [
"A black-haired girl was shaking a wine glass in her hand",
"A woman in pink clothes was standing on a ladder",
"A red-haired girl was shaking a wine glass in her hand",
"A person was opening a wine bottle cap",
"A car was driving on the grass"
],
"correct_choice": 0,
"position": [
2276,
2341
],
"topic_category": "NP-News-Programs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "mH9LdC7IFH8_1",
"video_path": "mH9LdC7IFH8.mp4",
"subtitle_path": "mH9LdC7IFH8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 322.04,
"view_count": 7310
},
{
"video_id": "LB1qPExHQY8",
"question": "In the bottom right corner with a black tank in the background, what change occurred to the white word 'Plans' located at the bottom left of the screen when the subtitle said 'upcoming plans next steps are'?",
"question_wo_referring_query": "What change occurred?",
"candidates": [
"The text got bigger",
"The text turned black",
"The text turned red",
"The text got smaller",
"The text turned purple"
],
"correct_choice": 0,
"position": [
377,
1150
],
"topic_category": "KH-Knowledge-History",
"question_category": "TAA",
"level": "L2-Relation",
"id": "LB1qPExHQY8_0",
"video_path": "LB1qPExHQY8.mp4",
"subtitle_path": "LB1qPExHQY8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 181.4,
"view_count": 5521
},
{
"video_id": "LB1qPExHQY8",
"question": "In the background with a black tank in the lower right corner, when the word 'The ugly' in white text at the bottom left of the screen appears and the subtitle says 'framework now the ugly part well my', what change occurs?",
"question_wo_referring_query": "What change occurs?",
"candidates": [
"The font color changes to white",
"The font size increases",
"The font size decreases",
"The font color changes to red",
"The font color changes to black"
],
"correct_choice": 1,
"position": [
310,
2111
],
"topic_category": "KH-Knowledge-History",
"question_category": "TAA",
"level": "L2-Relation",
"id": "LB1qPExHQY8_1",
"video_path": "LB1qPExHQY8.mp4",
"subtitle_path": "LB1qPExHQY8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 181.4,
"view_count": 5521
},
{
"video_id": "8hhcFRoR0mw",
"question": "In front of the beige cabinet, the woman in the green apron seen on screen puts her hand into the iron pot containing a green and white mushy substance. What is this woman doing?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"Rubbing the contents of the pot on her face",
"Adding an egg to the pot",
"Shaping the contents of the pot into a ball",
"Using her hands to stir the contents of the pot",
"Adding water to the pot"
],
"correct_choice": 3,
"position": [
6721
],
"topic_category": "NP-News-Programs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "8hhcFRoR0mw_0",
"video_path": "8hhcFRoR0mw.mp4",
"subtitle_path": "8hhcFRoR0mw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 516.27,
"view_count": 144552
},
{
"video_id": "8hhcFRoR0mw",
"question": "On a brown floor, a woman with black hair, wearing a white top and a green skirt, is kneeling on a white mat. On the leftmost side of the white table in front of her, there's a metal pot. What is this woman doing?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"Making dumplings",
"Watching TV",
"Listening to music",
"Drinking water",
"Stir-frying vegetables"
],
"correct_choice": 0,
"position": [
10479
],
"topic_category": "NP-News-Programs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "8hhcFRoR0mw_1",
"video_path": "8hhcFRoR0mw.mp4",
"subtitle_path": "8hhcFRoR0mw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 516.27,
"view_count": 144552
},
{
"video_id": "h4jIoMxZopU",
"question": "In front of a blue background, a gentleman wearing a shirt with pink floral patterns is speaking. What did the gentleman do after becoming friends with the unicorn?",
"question_wo_referring_query": "What did the gentleman do after becoming friends with the unicorn?",
"candidates": [
"Put on a watch",
"Changed into a different shirt",
"Put on a mask",
"Put on a pair of sunglasses",
"Put on a unicorn headpiece"
],
"correct_choice": 4,
"position": [
72,
556
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E3E",
"level": "L2-Relation",
"id": "h4jIoMxZopU_0",
"video_path": "h4jIoMxZopU.mp4",
"subtitle_path": "h4jIoMxZopU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 538.08,
"view_count": 175981
},
{
"video_id": "h4jIoMxZopU",
"question": "Two yaks, one large and one small, appeared by the muddy riverbank. The ground by the riverbank is a chaotic mix of yellow earth. The small yak's head is resting on the hind legs of the large yak. What did the yak do after coming to the riverbank?",
"question_wo_referring_query": ", what did the yak do after coming to the riverbank?",
"candidates": [
"Lowered its head to drink water",
"Moved backward",
"Raised its head and made a sound",
"Left the riverbank",
"Turned around to look at the small yak"
],
"correct_choice": 0,
"position": [
10600,
10648
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E3E",
"level": "L2-Relation",
"id": "h4jIoMxZopU_1",
"video_path": "h4jIoMxZopU.mp4",
"subtitle_path": "h4jIoMxZopU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 538.08,
"view_count": 175981
},
{
"video_id": "M7YSCIkUaNw",
"question": "On the black table, there is a white tray and bowl. The tray has a dark-colored bottle with a white paper label on it, and the bowl contains a white powder. A hand covered with a sleeve is mixing the contents with a kitchen utensil. After the subtitle 'basic muffin butter mixed with a cup of', what appears on the screen?",
"question_wo_referring_query": "What appears on the screen?",
"candidates": [
"silver kitchen utensils",
"white bowl",
"red fruit",
"butterfly",
"white water bottle"
],
"correct_choice": 2,
"position": [
8438,
8450,
8921
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "M7YSCIkUaNw_0",
"video_path": "M7YSCIkUaNw.mp4",
"subtitle_path": "M7YSCIkUaNw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 522.9,
"view_count": 23762
},
{
"video_id": "M7YSCIkUaNw",
"question": "Walking through rows of shelves in a dazzling shop, ahead are humanoid figures with long white beards and plant decorations. The shelves on the right are filled with red and white Christmas toys. After the subtitle 'really happy' appears, what item appears on the screen?",
"question_wo_referring_query": "Lin, what item appears on the screen?",
"candidates": [
"A burning white candle",
"A little girl",
"A courier paper box",
"Santa Claus mug",
"Two books"
],
"correct_choice": 3,
"position": [
4853,
4890,
5046
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "M7YSCIkUaNw_1",
"video_path": "M7YSCIkUaNw.mp4",
"subtitle_path": "M7YSCIkUaNw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 522.9,
"view_count": 23762
},
{
"video_id": "x1FkhxMMIcg",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, the male host speaks, then the female host and the teacher have an online discussion, and finally, the female host introduces the teacher.",
"First, the male host speaks, then the female host introduces the teacher, and finally, the female host and the teacher have an online discussion.",
"First, the female host introduces a teacher, then the female host and the teacher have an online discussion, and finally, a male host speaks.",
"First, the female host and the teacher have an online discussion, then the female host introduces the teacher, and finally, a male host speaks.",
"First, the female host and the teacher have an online discussion, then the male host speaks, and finally, the female host introduces the teacher."
],
"correct_choice": 2,
"position": [
1,
601,
4577
],
"topic_category": "NP-News-Programs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "x1FkhxMMIcg_0",
"video_path": "x1FkhxMMIcg.mp4",
"subtitle_path": "x1FkhxMMIcg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 199.73,
"view_count": 985
},
{
"video_id": "CEZ9rbjK3P4",
"question": "A woman in a black top is sitting in front of a table. Behind and beside her are white walls and a bookshelf filled with books. There is a wooden object in front of her, and there are drawings on the table in front of her. Where else has this woman appeared?",
"question_wo_referring_query": "Where else has this woman appeared?",
"candidates": [
"A room with a window",
"Under green trees outdoors",
"In front of a table with potted plants",
"In front of a table with a desk lamp",
"A bench in the park"
],
"correct_choice": 0,
"position": [
112,
310
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SOS",
"level": "L2-Relation",
"id": "CEZ9rbjK3P4_0",
"video_path": "CEZ9rbjK3P4.mp4",
"subtitle_path": "CEZ9rbjK3P4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 200.8,
"view_count": 12920
},
{
"video_id": "CEZ9rbjK3P4",
"question": "In the center of the screen is a drawing, depicting a person resting their hand on their forehead on a yellowish paper. The drawing is composed of lines and lacks colors. Where else has this artwork appeared?",
"question_wo_referring_query": "Where else has this artwork appeared?",
"candidates": [
"In a room with a bookshelf",
"On a bench outside",
"In front of a table holding a vase",
"On the glass of a transparent window",
"In front of a table with a desk lamp"
],
"correct_choice": 0,
"position": [
1674,
3277
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SOS",
"level": "L2-Relation",
"id": "CEZ9rbjK3P4_1",
"video_path": "CEZ9rbjK3P4.mp4",
"subtitle_path": "CEZ9rbjK3P4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 200.8,
"view_count": 12920
},
{
"video_id": "MPQn_orwpfA",
"question": "The floor of the room is brown, there are flags and pictures on the wall. To the right is a bookshelf filled with books. A zoomed out map appears in the top right corner. A man wearing a grey coat is sitting on a black stool. What subtitle appears together with the map in the top right corner?",
"question_wo_referring_query": "What subtitle appears together with the map in the top right corner?",
"candidates": [
"Music",
"Europe Russia China, India and Australia most of these lines actually don't exist. So let's assume",
"65 to 94",
"We're funding maybe about 75% of all these train lines",
"You know at first glance this picture"
],
"correct_choice": 1,
"position": [
4943,
5022,
5098,
6870
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TOS",
"level": "L2-Relation",
"id": "MPQn_orwpfA_0",
"video_path": "MPQn_orwpfA.mp4",
"subtitle_path": "MPQn_orwpfA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 443.81,
"view_count": 984578
},
{
"video_id": "MPQn_orwpfA",
"question": "The floor of the room is tan, there are flags and pictures posted on the walls, a bookshelf filled with books on the right side, in the upper right corner there is a picture of an airplane flying in a clear sky, a man wearing a gray coat sitting on a black chair, which subtitle does this airplane picture appear together with?",
"question_wo_referring_query": ", which subtitle does this airplane picture appear together with?",
"candidates": [
"Europe Russia China, India and Australia most of these lines actually don't exist. So let's assume",
"We're funding maybe about 75% of all these train lines",
"You konw at first glance this picture",
"We live in a time in which air travel is the preferred method of long distance Journeys however",
"65 to 94"
],
"correct_choice": 3,
"position": [
6975,
6986,
5022,
5098
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TOS",
"level": "L2-Relation",
"id": "MPQn_orwpfA_1",
"video_path": "MPQn_orwpfA.mp4",
"subtitle_path": "MPQn_orwpfA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 443.81,
"view_count": 984578
},
{
"video_id": "5KLSf7kwr7s",
"question": "There is a machine next to the white wall. The machine's inlet has a gradually narrowing conical shape. At the outlet of the machine, there is a green plastic container. The engine of the machine is located on the left side. A person wearing jeans and a blue shirt is operating the machine. Who passes by from the right side of the machine?",
"question_wo_referring_query": "Who passes by from the right side of the machine?",
"candidates": [
"A woman",
"A man",
"A child",
"A cat",
"A dog"
],
"correct_choice": 4,
"position": [
7270
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E2O",
"level": "L1-Perception",
"id": "5KLSf7kwr7s_0",
"video_path": "5KLSf7kwr7s.mp4",
"subtitle_path": "5KLSf7kwr7s_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 494.26,
"view_count": 145023
},
{
"video_id": "5KLSf7kwr7s",
"question": "From the aerial view of the highway, there is a green water flow and a white boat below, a sandy area next to the green water flow, and three lines of wires on the left side of the highway. What passed by on the highway?",
"question_wo_referring_query": "What passed by on the highway?",
"candidates": [
"A white truck",
"A black motorcycle",
"A black truck",
"A white car with a black roof",
"A blue car with a black roof"
],
"correct_choice": 3,
"position": [
5336
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E2O",
"level": "L1-Perception",
"id": "5KLSf7kwr7s_1",
"video_path": "5KLSf7kwr7s.mp4",
"subtitle_path": "5KLSf7kwr7s_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 494.26,
"view_count": 145023
},
{
"video_id": "rMXJOKhf_AA",
"question": "The white-robed lady wearing the crown sits on a golden chair, with golden ornaments hanging on her chest. On both sides of her are two men, each with various badges on their outer clothes. One man with a golden cape is bending in front of the lady. What did the crowned lady do on her first appearance?",
"question_wo_referring_query": "What did the crowned lady do on her first appearance?",
"candidates": [
"Accepted the item handed over by the man on the right",
"Accepted the item handed over by the man on the left",
"Refused the item handed over by the man on the right",
"Accepted the item handed over by the man who was bending",
"Refused the item handed over by the man who was bending"
],
"correct_choice": 3,
"position": [
6850
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "O2E",
"level": "L1-Perception",
"id": "rMXJOKhf_AA_0",
"video_path": "rMXJOKhf_AA.mp4",
"subtitle_path": "rMXJOKhf_AA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 341.6,
"view_count": 3909869
},
{
"video_id": "rMXJOKhf_AA",
"question": "A man dressed as a clown is standing in front of a brick wall. The man has a red ball on his nose and is adorned with red decorations on his face. He is wearing a flowery ring and a purple coat. What did the man do the first time he appeared?",
"question_wo_referring_query": "What did the man do the first time he appeared?",
"candidates": [
"The man ran to the left side.",
"The man kept waving his hands.",
"The man wiggled his waist up and down.",
"The man spread his hands and climbed the wall.",
"The man's flowery ring fell on the ground."
],
"correct_choice": 2,
"position": [
2328
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "O2E",
"level": "L1-Perception",
"id": "rMXJOKhf_AA_1",
"video_path": "rMXJOKhf_AA.mp4",
"subtitle_path": "rMXJOKhf_AA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 341.6,
"view_count": 3909869
},
{
"video_id": "LcZxjqtzXJI",
"question": "The room is lit with dim, warm-colored lights. Three items are displayed in the right cabinet. The middle item is standing on a glossy surface, and the other two are suspended on the left and right sides. In the center of the room's screen, a dark-colored item is encased in transparent glass. On the left side, there are two black shadows. What are the black shadows doing?",
"question_wo_referring_query": "What are the black shadows doing?",
"candidates": [
"Walking towards the exit",
"Observing the items in the cabinet",
"Tidying up their clothes",
"Touching the cabinet",
"Tapping the cabinet"
],
"correct_choice": 0,
"position": [
3330
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2E",
"level": "L1-Perception",
"id": "LcZxjqtzXJI_0",
"video_path": "LcZxjqtzXJI.mp4",
"subtitle_path": "LcZxjqtzXJI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 208.04,
"view_count": 5498
},
{
"video_id": "LcZxjqtzXJI",
"question": "In a room filled with light, a man with a bald head wearing black-rimmed glasses stands in front of a glass showcase. The man is dressed in a black suit. Inside the transparent glass, there are two small sculptures, one of which is facing the man. What is this man doing?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"Touching the showcase",
"Adjusting his collar",
"Knocking on the showcase",
"Adjusting his sleeve",
"Staring at the sculpture"
],
"correct_choice": 4,
"position": [
723
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2E",
"level": "L1-Perception",
"id": "LcZxjqtzXJI_1",
"video_path": "LcZxjqtzXJI.mp4",
"subtitle_path": "LcZxjqtzXJI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 208.04,
"view_count": 5498
},
{
"video_id": "JIoE18JYGcM",
"question": "In front of a pure blue background with white squares, there is a man with short hair wearing a gray suit with a white printed shirt inside. What color are his glasses?",
"question_wo_referring_query": "What color are his glasses?",
"candidates": [
"gold",
"black",
"blue",
"white",
"silver"
],
"correct_choice": 1,
"position": [
4518
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2A",
"level": "L1-Perception",
"id": "JIoE18JYGcM_0",
"video_path": "JIoE18JYGcM.mp4",
"subtitle_path": "JIoE18JYGcM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 270.31,
"view_count": 210555
},
{
"video_id": "JIoE18JYGcM",
"question": "On a white webpage background, it says 'Physics of the Everyday,' with a yellow icon next to the text. There is also a bird next to the icon. What kind of feathers does this bird have?",
"question_wo_referring_query": "What kind of feathers does this bird have?",
"candidates": [
"Gray feathers",
"Black feathers",
"Black and white feathers",
"Pink feathers",
"White feathers"
],
"correct_choice": 3,
"position": [
5837
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2A",
"level": "L1-Perception",
"id": "JIoE18JYGcM_1",
"video_path": "JIoE18JYGcM.mp4",
"subtitle_path": "JIoE18JYGcM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 270.31,
"view_count": 210555
},
{
"video_id": "rZq-8Bq3mkU",
"question": "In front of a yellow background, a man with short hair, wearing black-framed glasses and dressed in a deep purple hooded sweatshirt. What is he doing when he appears for the first time?",
"question_wo_referring_query": "What is he doing when he appears for the first time?",
"candidates": [
"Frantically waving hands, speaking towards the camera",
"Raising both hands with palms open",
"Right hand holding the left hand, speaking towards the camera",
"Arms spread wide, speaking towards the camera",
"Left hand holding the right hand, speaking towards the camera"
],
"correct_choice": 4,
"position": [
100
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O2E",
"level": "L1-Perception",
"id": "rZq-8Bq3mkU_0",
"video_path": "rZq-8Bq3mkU.mp4",
"subtitle_path": "rZq-8Bq3mkU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 515.18,
"view_count": 242267
},
{
"video_id": "rZq-8Bq3mkU",
"question": "In a white room, a person wearing blue protective clothing, a mask, and a face shield is standing in front of a person with short blonde hair. What is the person in the blue protective clothing doing the first time they appear?",
"question_wo_referring_query": "What is the person in blue protective clothing doing the first time they appear?",
"candidates": [
"Using a cotton swab to treat a wound of the person with short blonde hair",
"Putting a mask on the person with short blonde hair",
"Using a bandage to wrap the person with short blonde hair",
"Taking a photo of the person with short blonde hair using a phone",
"Taking the temperature of the person with short blonde hair with a thermometer"
],
"correct_choice": 4,
"position": [
390
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O2E",
"level": "L1-Perception",
"id": "rZq-8Bq3mkU_1",
"video_path": "rZq-8Bq3mkU.mp4",
"subtitle_path": "rZq-8Bq3mkU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 515.18,
"view_count": 242267
},
{
"video_id": "VSZ8ywgGNGM",
"question": "On a grass field, a woman dressed in a sleeveless white shirt is crouching on the grass. In front of her lies a small tabby kitten. What is the woman holding in her hand to play with the kitten?",
"question_wo_referring_query": "What is the woman holding in her hand to play with the kitten?",
"candidates": [
"a flower",
"a string",
"a piece of cloth",
"a blade of grass",
"a cat tease stick"
],
"correct_choice": 3,
"position": [
1604
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E2O",
"level": "L1-Perception",
"id": "VSZ8ywgGNGM_0",
"video_path": "VSZ8ywgGNGM.mp4",
"subtitle_path": "VSZ8ywgGNGM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 232.98,
"view_count": 1112314
},
{
"video_id": "VSZ8ywgGNGM",
"question": "There is a lioness inside an iron mesh enclosure. Outside the iron mesh, someone is extending a pair of tongs through the mesh towards the lioness's nose. What is being clamped by the tongs that are reaching towards the lioness's nose?",
"question_wo_referring_query": "What is being clamped by the tongs that are reaching towards the lioness's nose?",
"candidates": [
"A flower",
"Catnip",
"A piece of meat",
"A leaf",
"A chicken"
],
"correct_choice": 1,
"position": [
2851
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E2O",
"level": "L1-Perception",
"id": "VSZ8ywgGNGM_1",
"video_path": "VSZ8ywgGNGM.mp4",
"subtitle_path": "VSZ8ywgGNGM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 232.98,
"view_count": 1112314
},
{
"video_id": "ezhafkxoRdo",
"question": "In front of a blue background, a man wearing a dark grey short-sleeve shirt and black-framed glasses with short black hair said \"Thank you for watching this episode of SciShow!\" After he finished speaking, what was the first action he took?",
"question_wo_referring_query": "What was the first action he took?",
"candidates": [
"Crossed his hands into fists",
"Placed both hands on his abdomen",
"Raised his right hand with palm open",
"Took off his glasses with both hands open",
"Raised both hands with palms open"
],
"correct_choice": 2,
"position": [
6066,
6080,
6205
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T3E",
"level": "L2-Relation",
"id": "ezhafkxoRdo_0",
"video_path": "ezhafkxoRdo.mp4",
"subtitle_path": "ezhafkxoRdo_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 292.88,
"view_count": 129688
},
{
"video_id": "ezhafkxoRdo",
"question": "Against a blue background, with a small green icon in the bottom right corner, a man wearing a dark grey short-sleeved shirt, with short black hair, and black-framed glasses, after he says \"or as an added layer of security against potential predators,\" what does he do first?",
"question_wo_referring_query": "What does he do first?",
"candidates": [
"Takes off glasses and wipes the lenses",
"Both palms clasped together",
"Right hand raised with palm open",
"Both hands raised with palms open",
"Both hands placed on the abdomen"
],
"correct_choice": 3,
"position": [
1486,
1508
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T3E",
"level": "L2-Relation",
"id": "ezhafkxoRdo_1",
"video_path": "ezhafkxoRdo.mp4",
"subtitle_path": "ezhafkxoRdo_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 292.88,
"view_count": 129688
},
{
"video_id": "wxWo44MoCTI",
"question": "On a wooden-colored table, after a strip of meat in a glass bowl is placed into a coffee-colored pot, what change occurs to the strip of meat?",
"question_wo_referring_query": "On a wooden-colored table, after a strip of meat in a glass bowl is placed into a coffee-colored pot, what change occurs to the strip of meat?",
"candidates": [
"Changes from a strip shape to a triangular shape",
"Changes from a strip shape to a pie shape",
"Changes from a strip shape to a pentagonal shape",
"Changes from a strip shape to a paste-like state"
],
"correct_choice": 1,
"position": [
4279,
4559
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SAA",
"level": "L2-Relation",
"id": "wxWo44MoCTI_0",
"video_path": "wxWo44MoCTI.mp4",
"subtitle_path": "wxWo44MoCTI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 384.59,
"view_count": 200485
},
{
"video_id": "wxWo44MoCTI",
"question": "Once the white mold in the coffee-colored pot appears on a wooden table, what changes occur to the white mold?",
"question_wo_referring_query": "What changes occur to the white mold that appears on a wooden table from the coffee-colored pot?",
"candidates": [
"The mold changes from white to sauce color",
"The mold changes from scattered to cake-like",
"The mold changes from sheet-like to cake-like",
"The mold changes from white to yellow"
],
"correct_choice": 0,
"position": [
4749,
5156
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SAA",
"level": "L2-Relation",
"id": "wxWo44MoCTI_1",
"video_path": "wxWo44MoCTI.mp4",
"subtitle_path": "wxWo44MoCTI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 384.59,
"view_count": 200485
},
{
"video_id": "VkNF0rXuDXw",
"question": "On a piece of white paper, there is a black frame drawn on it, filled with blue water. Someone is holding a yellow pen. What is he doing?",
"question_wo_referring_query": "What is he doing?",
"candidates": [
"Coloring the roof in the drawing",
"Coloring the flowers in the drawing",
"Coloring the radish in the drawing",
"Coloring the sun in the drawing",
"Coloring the western red cedar in the drawing"
],
"correct_choice": 3,
"position": [
2170
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2E",
"level": "L1-Perception",
"id": "VkNF0rXuDXw_0",
"video_path": "VkNF0rXuDXw.mp4",
"subtitle_path": "VkNF0rXuDXw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 221.13,
"view_count": 465186
},
{
"video_id": "VkNF0rXuDXw",
"question": "On a piece of white paper, the words 'not useful' are written, and there is a downward-curving arrow. Someone is holding a white and blue pen. What are they doing?",
"question_wo_referring_query": "What are they doing?",
"candidates": [
"Coloring the roof in the drawing",
"Drawing sweat on the little person in the drawing",
"Coloring the sky in the drawing",
"Coloring the flowers in the drawing",
"Putting clothes on the little person in the drawing"
],
"correct_choice": 1,
"position": [
3275
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2E",
"level": "L1-Perception",
"id": "VkNF0rXuDXw_1",
"video_path": "VkNF0rXuDXw.mp4",
"subtitle_path": "VkNF0rXuDXw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 221.13,
"view_count": 465186
},
{
"video_id": "v0QLje6xYgA",
"question": "On a pink stage, there's a girl with long black hair wearing a white top, white short skirt, and white knee-high stockings. When she appears for the first time, what is she doing?",
"question_wo_referring_query": "When she appears for the first time, what is she doing?",
"candidates": [
"Touching her face with her palm",
"One leg kneeling on the ground",
"Kneeling with her right hand supporting her chin",
"Kneeling with both hands resting on her left leg",
"Both legs kneeling on the ground"
],
"correct_choice": 3,
"position": [
247
],
"topic_category": "NP-News-Programs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "v0QLje6xYgA_0",
"video_path": "v0QLje6xYgA.mp4",
"subtitle_path": "v0QLje6xYgA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 181.28,
"view_count": 523551
},
{
"video_id": "UUaiqR1I454",
"question": "At the beginning of the video, a man with numerous arm injuries is crawling on the ground. He has black short hair, and he is wearing olive-colored clothes with damaged sleeves. With which subtitles does this man appear together?",
"question_wo_referring_query": "With which subtitles does this man appear together?",
"candidates": [
"still blanketed darkness\u4e0eothers weren\u2019t so lucky",
"still blanketed darkness\u4e0ethey did not surrender",
"still blanketed in darkness\u4e0ealmost eclipsed the morning sun his",
"others weren\u2019t so lucky\u4e0ealmost eclipsed the morning sun his",
"they did not surrender\u4e0ealmost eclipsed the morning sun his"
],
"correct_choice": 2,
"position": [
175,
1980,
2082
],
"topic_category": "KH-Knowledge-History",
"question_category": "TOS",
"level": "L2-Relation",
"id": "UUaiqR1I454_0",
"video_path": "UUaiqR1I454.mp4",
"subtitle_path": "UUaiqR1I454_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 494.13,
"view_count": 1429720
},
{
"video_id": "UUaiqR1I454",
"question": "Inside a room, hanging on the wall are an orange bag and a green and white garment. A woman wearing green clothes is holding a baby dressed in white. In which captions does this baby appear together?",
"question_wo_referring_query": ", in which captions does this baby appear together?",
"candidates": [
"cuts and bruises and they did not surrender",
"cuts and bruises and years that followed his wife sadly died",
"cuts and bruises and many more suffered similarly",
"years that followed his wife sadly died and many more suffered similarly",
"they did not surrender and many more suffered similarly"
],
"correct_choice": 2,
"position": [
7002,
7044,
9074
],
"topic_category": "KH-Knowledge-History",
"question_category": "TOS",
"level": "L2-Relation",
"id": "UUaiqR1I454_1",
"video_path": "UUaiqR1I454.mp4",
"subtitle_path": "UUaiqR1I454_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 494.13,
"view_count": 1429720
},
{
"video_id": "x4dFLY_-vKs",
"question": "In front of a tree outside a bright window, a man dressed in black and wearing glasses is talking. When the subtitle mentions 'But we don't understand everything,' what is the man wearing on his head at this moment?",
"question_wo_referring_query": "At this moment, what is the man wearing on his head?",
"candidates": [
"black cap",
"green cap",
"red maple leaf",
"white overcoat",
"red-framed glasses"
],
"correct_choice": 0,
"position": [
1686
],
"topic_category": "NP-News-Programs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "x4dFLY_-vKs_0",
"video_path": "x4dFLY_-vKs.mp4",
"subtitle_path": "x4dFLY_-vKs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 264.76,
"view_count": 48990
},
{
"video_id": "x4dFLY_-vKs",
"question": "In a studio, a woman with long hair dressed in black is giving an explanation. In front of her, a white pattern fills the screen and displays the word 'HEADLINES'. When the subtitle mentions 'Science, hence the telescope is on its way', what object is present on the screen?",
"question_wo_referring_query": "What object is present on the screen?",
"candidates": [
"Flesh-colored spacesuit without straps",
"Black spacesuit with straps",
"Black spacesuit without straps",
"Glasses",
"Flesh-colored spacesuit with straps"
],
"correct_choice": 1,
"position": [
4284
],
"topic_category": "NP-News-Programs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "x4dFLY_-vKs_1",
"video_path": "x4dFLY_-vKs.mp4",
"subtitle_path": "x4dFLY_-vKs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 264.76,
"view_count": 48990
},
{
"video_id": "yn7oTvw8QRY",
"question": "There are two images here. One shows a girl in green clothing with braided hair, holding a clay container in front of a solid color background wall. The other shows a girl in black and white floral clothing with loose hair. According to the video, which character appears first?",
"question_wo_referring_query": "According to the video, which character appears first?",
"candidates": [
"Boy with short hair and green stripes",
"Boy with golden hair",
"Girl in green clothing with loose hair",
"Girl in green clothing with braided hair",
"Girl in black and white floral clothing with loose hair"
],
"correct_choice": 3,
"position": [
4480,
4509,
4651
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O3O",
"level": "L2-Relation",
"id": "yn7oTvw8QRY_0",
"video_path": "yn7oTvw8QRY.mp4",
"subtitle_path": "yn7oTvw8QRY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 225.97,
"view_count": 421988
},
{
"video_id": "yn7oTvw8QRY",
"question": "There are two pictures here. One features a little boy with short golden hair, wearing a short-sleeve striped shirt, standing in front of a solid-colored background wall, holding a mud-colored clay container made of clay. The other picture features a pottery jar with a mountain goat design. According to the video, which of the following pictures appears last?",
"question_wo_referring_query": "According to the video, which of the following pictures appears last?",
"candidates": [
"a small bowl made of clay",
"a boy with short golden hair wearing a long-sleeve shirt",
"a boy with short golden hair wearing a short-sleeve shirt",
"a pottery jar with a mountain goat design",
"a girl wearing green clothing and tying her hair"
],
"correct_choice": 3,
"position": [
4718,
4873
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O3O",
"level": "L2-Relation",
"id": "yn7oTvw8QRY_1",
"video_path": "yn7oTvw8QRY.mp4",
"subtitle_path": "yn7oTvw8QRY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 225.97,
"view_count": 421988
},
{
"video_id": "t1nhAnMQBHg",
"question": "In a dimly lit room, there is a chair. A man in front of the chair, wearing a white short-sleeved shirt, is bending over looking at something in his hands. When the subtitles mention 'I don't know what this is, it looks like a pointed object', what is the item on the man's head?",
"question_wo_referring_query": "What is the item on the man's head?",
"candidates": [
"A black beret with red and white interlaced text",
"A black baseball cap with red and white interlaced text",
"A black top hat with red and white interlaced text",
"A yellow baseball cap with red and white interlaced text",
"A black baseball cap with no text"
],
"correct_choice": 1,
"position": [
7874,
7942
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T3O",
"level": "L2-Relation",
"id": "t1nhAnMQBHg_0",
"video_path": "t1nhAnMQBHg.mp4",
"subtitle_path": "t1nhAnMQBHg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 398.13,
"view_count": 251684
},
{
"video_id": "s1TjVlKBpjs",
"question": "In the distance, there is a multi-story building with red brick walls, and nearby, there is a blue enclosure. In the middle of the enclosure, there is a gray house with a column and bar. In front of the house, there is a flagpole with a national flag. Before the subtitle 'capturing the German U-boat bases' appears, what is the first type of electronic communication tool that appears?",
"question_wo_referring_query": "What is the first type of electronic communication tool that appears?",
"candidates": [
"telegraph",
"letter",
"cell phone",
"pager",
"landline phone"
],
"correct_choice": 1,
"position": [
666,
369
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "s1TjVlKBpjs_0",
"video_path": "s1TjVlKBpjs.mp4",
"subtitle_path": "s1TjVlKBpjs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 212.25,
"view_count": 3229465
},
{
"video_id": "s1TjVlKBpjs",
"question": "It was pouring rain, and among a grove of withered trees, the ground was all mud and bloodstains. When the subtitle 'The awful conditions caused rifles to become clogged up with mud and tanks to become stuck.' appears, what is the first type of weapon shown on the screen?",
"question_wo_referring_query": "What is the first type of weapon shown on the screen?",
"candidates": [
"Sword",
"Explosive Pack",
"Rifle",
"Hand Grenade",
"Tank"
],
"correct_choice": 2,
"position": [
1887,
2172,
2279
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "s1TjVlKBpjs_1",
"video_path": "s1TjVlKBpjs.mp4",
"subtitle_path": "s1TjVlKBpjs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 212.25,
"view_count": 3229465
},
{
"video_id": "KIf2fGmluhY",
"question": "The title on the screen is 'Datasets', below it are inference formulas, and a dynamic graph appears in the middle. In the top right corner, there is a man with glasses explaining. What did the boy holding the ball do when he appeared for the first time?",
"question_wo_referring_query": "What did the boy holding the ball do when he appeared for the first time?",
"candidates": [
"Threw the ball into the distance",
"Shot the ball towards the basket",
"Dribbled the ball",
"Stepped on the ball with his foot",
"Passed the ball to someone else"
],
"correct_choice": 2,
"position": [
4165
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "O2E",
"level": "L1-Perception",
"id": "KIf2fGmluhY_0",
"video_path": "KIf2fGmluhY.mp4",
"subtitle_path": "KIf2fGmluhY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 765,
"duration": 508.0,
"view_count": 349
},
{
"video_id": "KIf2fGmluhY",
"question": "On the left side, there are some English words and sentences. In the top right corner, there is a man in a suit explaining something. Below the man, there is a fishbowl. What did the fish do when it first appeared in the fishbowl?",
"question_wo_referring_query": "What did the fish do when it first appeared in the fishbowl?",
"candidates": [
"Swam back and forth in the fishbowl",
"Fought in the bathtub",
"Slowly sank in the fishbowl",
"Swallowed each other in the bathtub"
],
"correct_choice": 0,
"position": [
11433
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "O2E",
"level": "L1-Perception",
"id": "KIf2fGmluhY_1",
"video_path": "KIf2fGmluhY.mp4",
"subtitle_path": "KIf2fGmluhY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 765,
"duration": 508.0,
"view_count": 349
},
{
"video_id": "eJr-y6UXnRE",
"question": "What is the color of the first piece of clothing shown in the video?",
"question_wo_referring_query": "What is the color of the first piece of clothing shown in the video?",
"candidates": [
"white",
"purple",
"red",
"olive",
"black"
],
"correct_choice": 4,
"position": [
125,
315
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "eJr-y6UXnRE_0",
"video_path": "eJr-y6UXnRE.mp4",
"subtitle_path": "eJr-y6UXnRE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 592.56,
"view_count": 47581
},
{
"video_id": "eJr-y6UXnRE",
"question": "What is the first food item displayed in the video?",
"question_wo_referring_query": "What is the first food item displayed in the video?",
"candidates": [
"Avocado",
"Beverage with ice cubes",
"Watermelon",
"Apple",
"Potato chips"
],
"correct_choice": 1,
"position": [
62,
13307
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "eJr-y6UXnRE_1",
"video_path": "eJr-y6UXnRE.mp4",
"subtitle_path": "eJr-y6UXnRE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 592.56,
"view_count": 47581
},
{
"video_id": "_zEM6Fc-7NI",
"question": "On the street, in the distance there is the wall of a house, nearby there is a black car, on the right there are several people wearing overalls, in the middle there is a soldier wearing camouflage. When the narration mentions 'we just got to downtown LA to join the', which of these objects is not present on the screen?",
"question_wo_referring_query": "Which object is not present on the screen?",
"candidates": [
"people wearing overalls",
"big tree",
"police car",
"soldier wearing camouflage",
"black car"
],
"correct_choice": 2,
"position": [
680
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2O",
"level": "L1-Perception",
"id": "_zEM6Fc-7NI_0",
"video_path": "_zEM6Fc-7NI.mp4",
"subtitle_path": "_zEM6Fc-7NI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 285.95,
"view_count": 994355
},
{
"video_id": "_zEM6Fc-7NI",
"question": "In front of a wall with white lacquer on both sides and a yellow wooden board in the middle, on the left side is a tree trunk, with a white and yellow prismatic barrier next to the right side of the trunk. When the commentary mentions 'protesting or I must sleep in my bed my,' which object is on the screen?",
"question_wo_referring_query": "Which object is on the screen?",
"candidates": [
"portrait of a woman with short hair",
"wine glass",
"airplane",
"armored vehicle",
"police"
],
"correct_choice": 0,
"position": [
6039
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2O",
"level": "L1-Perception",
"id": "_zEM6Fc-7NI_1",
"video_path": "_zEM6Fc-7NI.mp4",
"subtitle_path": "_zEM6Fc-7NI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 285.95,
"view_count": 994355
},
{
"video_id": "Z00vWImw1KQ",
"question": "On the street, there is an ambulance with a white and green pattern in the middle. To the left is a soldier wearing camouflage clothing, and in the right seat of the ambulance is a person wearing red clothes. On the right side are onlookers. Which character closed the door of the ambulance?",
"question_wo_referring_query": "Which character closed the door of the ambulance?",
"candidates": [
"doctor",
"police officer wearing sunglasses",
"person wearing red clothes",
"woman wearing white clothes",
"soldier wearing camouflage clothing"
],
"correct_choice": 2,
"position": [
2624
],
"topic_category": "NP-News-Programs",
"question_category": "E2O",
"level": "L1-Perception",
"id": "Z00vWImw1KQ_0",
"video_path": "Z00vWImw1KQ.mp4",
"subtitle_path": "Z00vWImw1KQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 476.52,
"view_count": 81127
},
{
"video_id": "Z00vWImw1KQ",
"question": "Among the many vehicles and crowded places, with white lights on the left, a tent on the right, and an ambulance with yellow lights in the distance, which object is moving quickly?",
"question_wo_referring_query": ", which object is moving quickly?",
"candidates": [
"airplane",
"a green tank",
"tent",
"large truck",
"ambulance with yellow lights"
],
"correct_choice": 4,
"position": [
7870
],
"topic_category": "NP-News-Programs",
"question_category": "E2O",
"level": "L1-Perception",
"id": "Z00vWImw1KQ_1",
"video_path": "Z00vWImw1KQ.mp4",
"subtitle_path": "Z00vWImw1KQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 476.52,
"view_count": 81127
},
{
"video_id": "tGiRbGGwRj8",
"question": "In the bottom-left corner of the screen, there's a circle with a woman with long black hair inside it providing explanations. Behind her are 12 and a half sheets of informational background. What color is the outfit the woman is wearing?",
"question_wo_referring_query": "What color is the outfit the woman is wearing?",
"candidates": [
"Black",
"White",
"Red",
"Gray"
],
"correct_choice": 1,
"position": [
519
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2A",
"level": "L1-Perception",
"id": "tGiRbGGwRj8_0",
"video_path": "tGiRbGGwRj8.mp4",
"subtitle_path": "tGiRbGGwRj8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 223.43,
"view_count": 41604
},
{
"video_id": "tGiRbGGwRj8",
"question": "On the half blue, half pink tablecloth, this person places three pieces of paper with 'The partial pressure' and formulas written on them. At this moment, what is the color of the nail polish on their fingers?",
"question_wo_referring_query": "At this moment, what is the color of the nail polish on their fingers?",
"candidates": [
"pink",
"blue",
"white",
"red"
],
"correct_choice": 0,
"position": [
3904
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2A",
"level": "L1-Perception",
"id": "tGiRbGGwRj8_1",
"video_path": "tGiRbGGwRj8.mp4",
"subtitle_path": "tGiRbGGwRj8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 223.43,
"view_count": 41604
},
{
"video_id": "ixlQX7lV8dc",
"question": "In a picture featuring an animal with chameleon eyes, after the subtitle 'helped it seek out prey, primarily' appears, which animal is shown?",
"question_wo_referring_query": "After the subtitle 'helped it seek out prey, primarily' appears, which animal is shown?",
"candidates": [
"gecko",
"snake",
"chipmunk",
"frog"
],
"correct_choice": 3,
"position": [
6047,
6184
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "ixlQX7lV8dc_0",
"video_path": "ixlQX7lV8dc.mp4",
"subtitle_path": "ixlQX7lV8dc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 590.06,
"view_count": 140751
},
{
"video_id": "ixlQX7lV8dc",
"question": "In a picture with a microscope, after the subtitle 'After completing a study about prehistoric insects' appears, what person appears?",
"question_wo_referring_query": "What person appears?",
"candidates": [
"A man in a yellow shirt",
"A man wearing glasses and smiling slightly",
"A man in a green shirt",
"A woman wearing glasses"
],
"correct_choice": 1,
"position": [
11022,
11098
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "ixlQX7lV8dc_1",
"video_path": "ixlQX7lV8dc.mp4",
"subtitle_path": "ixlQX7lV8dc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 590.06,
"view_count": 140751
},
{
"video_id": "AnLMDMzO4QY",
"question": "Which subtitles appear together with the woman wearing a white top, denim overalls, and round earrings in the beginning of the video?",
"question_wo_referring_query": "Which subtitles appear together with this woman?",
"candidates": [
"we've got the cookies and the freezer and cooling for about maybe 10 minutes so appear together",
"love the fact they're sweet and",
"the a hint of salt\n",
"beath bars are mike chocolate covered"
],
"correct_choice": 0,
"position": [
46,
3066,
4556
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TOS",
"level": "L2-Relation",
"id": "AnLMDMzO4QY_0",
"video_path": "AnLMDMzO4QY.mp4",
"subtitle_path": "AnLMDMzO4QY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 261.8,
"view_count": 233386
},
{
"video_id": "AnLMDMzO4QY",
"question": "In the opening of the video, the person wearing a white top, black skirt, and black short hair, which subtitles do they appear with?",
"question_wo_referring_query": "Which subtitles do they appear with?",
"candidates": [
"beath bars are mike chocolate covered",
"cooling for about maybe 10 minutes so and let us know what you guys think in the appearing together",
"crunch around the room definitely",
"the a hint of salt\n"
],
"correct_choice": 1,
"position": [
4556,
5476
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TOS",
"level": "L2-Relation",
"id": "AnLMDMzO4QY_1",
"video_path": "AnLMDMzO4QY.mp4",
"subtitle_path": "AnLMDMzO4QY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 261.8,
"view_count": 233386
},
{
"video_id": "marOr6mJefE",
"question": "The man wearing a blue and white striped shirt in the van is talking to another man wearing a black shirt and a cap. What is the man in the black shirt and cap doing at this time?",
"question_wo_referring_query": "What is the man in the black shirt and cap doing at this time?",
"candidates": [
"playing with a phone",
"playing games",
"driving",
"drinking water"
],
"correct_choice": 2,
"position": [
3081
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O2E",
"level": "L1-Perception",
"id": "marOr6mJefE_0",
"video_path": "marOr6mJefE.mp4",
"subtitle_path": "marOr6mJefE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 367.99,
"view_count": 1121573
},
{
"video_id": "marOr6mJefE",
"question": "At the flat ground at the base of the mountain, someone is heading towards the lush green small hill. On the screen, there is also a man in a gray top wearing black frame glasses. What does this man in the gray top wearing black frame glasses stop to do?",
"question_wo_referring_query": "What does the man in the gray top wearing black frame glasses stop to do?",
"candidates": [
"look through binoculars",
"run",
"shoot",
"take a photo"
],
"correct_choice": 3,
"position": [
6893
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O2E",
"level": "L1-Perception",
"id": "marOr6mJefE_1",
"video_path": "marOr6mJefE.mp4",
"subtitle_path": "marOr6mJefE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 367.99,
"view_count": 1121573
},
{
"video_id": "LHXS0QR1ThA",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, there is a girl on a white background with text and the same image appearing twice, once at 47:39 and once at 50:37, then a page with the word 'results' appears, and finally 9 small animals appear.",
"First, there is a girl on a white background with text and the same image appearing twice, once at 47:39 and once at 50:37, then 9 small animals appear, and finally a page with the word 'results' appears.",
"First, 9 small animals appear, then a girl on a white background with text and the same image appearing twice, once at 47:39 and once at 50:37, and finally a page with the word 'results' appears.",
"First, a page with the word 'results' appears, then 9 small animals appear, and finally a girl on a white background with text and the same image appearing twice, once at 47:39 and once at 50:37."
],
"correct_choice": 1,
"position": [
8778,
9455,
10552
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "SSS",
"level": "L2-Relation",
"id": "LHXS0QR1ThA_0",
"video_path": "LHXS0QR1ThA.mp4",
"subtitle_path": "LHXS0QR1ThA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 931,
"duration": 484.0,
"view_count": 434
},
{
"video_id": "qjY9kmveQAk",
"question": "On a train, a person wearing a green military uniform and a green face mask is making a phone call. What other items appear on this train?",
"question_wo_referring_query": "What other items appear on this train?",
"candidates": [
"Biscuit",
"Flower",
"Gun",
"Piano"
],
"correct_choice": 2,
"position": [
647
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2O",
"level": "L1-Perception",
"id": "qjY9kmveQAk_0",
"video_path": "qjY9kmveQAk.mp4",
"subtitle_path": "qjY9kmveQAk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 589.89,
"view_count": 545005
},
{
"video_id": "qjY9kmveQAk",
"question": "Some soldiers, dressed in white shirts, gray pants, and wearing green hats, are holding guns and battling the people on the train. At this moment, what else appears?",
"question_wo_referring_query": "At this moment, what else appears?",
"candidates": [
"sun",
"airplane",
"rainbow",
"stars"
],
"correct_choice": 1,
"position": [
13286
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2O",
"level": "L1-Perception",
"id": "qjY9kmveQAk_1",
"video_path": "qjY9kmveQAk.mp4",
"subtitle_path": "qjY9kmveQAk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 589.89,
"view_count": 545005
},
{
"video_id": "QHS9ZZBdK-g",
"question": "In the bottom right corner of the picture, there is a man wearing white clothes. Behind him is a black screen showing many texts and four green rectangular graphics. What action did this man perform?",
"question_wo_referring_query": "What action did this man perform?",
"candidates": [
"Raised left hand with index and middle fingers pointing up",
"Raised right hand in a fist",
"Raised right hand with index and middle fingers pointing up",
"Raised left hand in a fist"
],
"correct_choice": 0,
"position": [
1669
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2A",
"level": "L1-Perception",
"id": "QHS9ZZBdK-g_0",
"video_path": "QHS9ZZBdK-g.mp4",
"subtitle_path": "QHS9ZZBdK-g_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 512,
"duration": 375.0,
"view_count": 2394
},
{
"video_id": "QHS9ZZBdK-g",
"question": "On a beige screen, there is English text in orange, red, and black. In the lower right corner, a man is talking to Mike. What color is the jacket that the man is wearing?",
"question_wo_referring_query": "What color is the jacket that the man is wearing?",
"candidates": [
"green",
"blue",
"yellow",
"red"
],
"correct_choice": 1,
"position": [
8579
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2A",
"level": "L1-Perception",
"id": "QHS9ZZBdK-g_1",
"video_path": "QHS9ZZBdK-g.mp4",
"subtitle_path": "QHS9ZZBdK-g_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 512,
"duration": 375.0,
"view_count": 2394
},
{
"video_id": "h0OHi9uAcBo",
"question": "In the video, a man wearing a green jacket and holding a green bag is talking to a long-haired woman in a white dress sitting at a desk. What did the long-haired woman in a white sleeveless top do after picking up a pen?",
"question_wo_referring_query": "In the video, a man wearing a green jacket and holding a green bag is talking to a long-haired woman in a white dress sitting at a desk. What did the long-haired woman in a white sleeveless top do after picking up a pen?",
"candidates": [
"Retouched a photo",
"Drew a picture on paper",
"Drank water",
"Wrote"
],
"correct_choice": 1,
"position": [
2734,
2832
],
"topic_category": "KA-Knowledge-Art",
"question_category": "E3E",
"level": "L2-Relation",
"id": "h0OHi9uAcBo_0",
"video_path": "h0OHi9uAcBo.mp4",
"subtitle_path": "h0OHi9uAcBo_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 224.93,
"view_count": 7108
},
{
"video_id": "h0OHi9uAcBo",
"question": "The woman in the black long-sleeve shirt is seated in front of the computer screen, intently reading a book. What did the woman in the black shirt do afterwards?",
"question_wo_referring_query": "What did the woman in black do afterwards?",
"candidates": [
"Put the book under the computer",
"Put the book into the copier",
"Put the book into the drawer",
"Put the book on the bookshelf"
],
"correct_choice": 1,
"position": [
1520,
1580
],
"topic_category": "KA-Knowledge-Art",
"question_category": "E3E",
"level": "L2-Relation",
"id": "h0OHi9uAcBo_1",
"video_path": "h0OHi9uAcBo.mp4",
"subtitle_path": "h0OHi9uAcBo_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 224.93,
"view_count": 7108
},
{
"video_id": "SO3czkzeFjw",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, on a sunny road, a boy in black clothes holding a skateboard opens the door of a house and walks out. Next, a man in white clothes opens a door inside a building with many doors and walks out. Finally, a boy wearing a purple shirt with yellow floral patterns is sitting in a room full of items, explaining in front of a mirror.",
"First, on a sunny road, a boy in black clothes holding a skateboard opens the door of a house and walks out. Next, a boy wearing a purple shirt with yellow floral patterns is sitting in a room full of items, explaining in front of a mirror. Finally, a man in white clothes opens a door inside a building with many doors and walks out.",
"First, a boy wearing a purple shirt with yellow floral patterns is sitting in a room full of items, explaining in front of a mirror. Next, on a sunny road, a boy in black clothes holding a skateboard opens the door of a house and walks out. Finally, a man in white clothes opens a door inside a building with many doors and walks out.",
"First, a man in white clothes opens a door inside a building with many doors and walks out. Next, on a sunny road, a boy in black clothes holding a skateboard opens the door of a house and walks out. Finally, a boy wearing a purple shirt with yellow floral patterns is sitting in a room full of items, explaining in front of a mirror.",
"First, a man in white clothes opens a door inside a building with many doors and walks out. Next, a boy wearing a purple shirt with yellow floral patterns is sitting in a room full of items, explaining in front of a mirror. Finally, on a sunny road, a boy in black clothes holding a skateboard opens the door of a house and walks out."
],
"correct_choice": 2,
"position": [
108,
120,
203
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "SO3czkzeFjw_0",
"video_path": "SO3czkzeFjw.mp4",
"subtitle_path": "SO3czkzeFjw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 401.44,
"view_count": 39971
},
{
"video_id": "SO3czkzeFjw",
"question": "Which of the following sequence of events is correct?",
"question_wo_referring_query": "Which of the following sequence of events is correct?",
"candidates": [
"First, the video shows a boy in a blue shirt crossing over a tree trunk in a forest. Then, the video shows a room with yellow and white walls with a staircase inside it. The boy, wearing a book bag, walks up the stairs. Finally, in a bedroom, a book bag placed on the bed is shown being zipped up. The boy then wears the book bag and walks out of the room.",
"First, a book bag placed on the bed in a bedroom is shown being zipped up. Then, a boy puts on the book bag and walks out of the room. Next, the video shows a room with yellow and white walls with a staircase inside it. The boy, still wearing the book bag, walks up the stairs. Finally, the video shows a boy in a blue shirt crossing over a tree trunk in a forest.",
"First, the video shows a room with yellow and white walls with a staircase inside it. The boy, wearing a book bag, walks up the stairs. Then, the video shows a book bag placed on the bed in a bedroom being zipped up. The boy then wears the book bag and walks out of the room. Finally, the video shows a boy in a blue shirt crossing over a tree trunk in a forest.",
"First, the video shows a boy in a blue shirt crossing over a tree trunk in a forest. Then, the video shows a book bag placed on the bed in a bedroom being zipped up. The boy then wears the book bag and walks out of the room. Finally, the video shows a room with yellow and white walls with a staircase inside it. The boy, wearing the book bag, walks up the stairs.",
"First, the video shows a room with yellow and white walls with a staircase inside it. The boy, wearing a book bag, walks up the stairs. Then, the video shows a boy in a blue shirt crossing over a tree trunk in a forest. Finally, in a bedroom, the video shows a book bag placed on the bed being zipped up. The boy then wears the book bag and walks out of the room."
],
"correct_choice": 1,
"position": [
6381,
6987,
7640
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "SO3czkzeFjw_1",
"video_path": "SO3czkzeFjw.mp4",
"subtitle_path": "SO3czkzeFjw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 401.44,
"view_count": 39971
},
{
"video_id": "liRm38L9EC4",
"question": "In front of a background with silver waves, there is a woman with long blonde hair, wearing a blue and white fitted top and black pants. In front of her, there is a wooden colored table. What objects appear in this scene?",
"question_wo_referring_query": "What objects appear in this scene?",
"candidates": [
"A plate with brown objects",
"A plate with purple objects",
"A plate with pink objects",
"A plate with green objects",
"A plate with blue objects"
],
"correct_choice": 0,
"position": [
198
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2O",
"level": "L1-Perception",
"id": "liRm38L9EC4_0",
"video_path": "liRm38L9EC4.mp4",
"subtitle_path": "liRm38L9EC4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 191.23,
"view_count": 167271
},
{
"video_id": "liRm38L9EC4",
"question": "In front of a white wall, there is a woman wearing a beige coat with a white camisole underneath and short hair. She is holding a cup of black liquid in her hand. On either side of her, there is a woman: one with a ponytail and another wearing a hairband with long earrings. What objects are present in this scene?",
"question_wo_referring_query": "What objects are present in this scene?",
"candidates": [
"A pure white plate",
"A pure blue plate",
"A blue plate with white floral patterns",
"A blue painting board",
"A white plate with blue floral patterns"
],
"correct_choice": 4,
"position": [
1705
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2O",
"level": "L1-Perception",
"id": "liRm38L9EC4_1",
"video_path": "liRm38L9EC4.mp4",
"subtitle_path": "liRm38L9EC4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 191.23,
"view_count": 167271
},
{
"video_id": "zk6fgbXI5AE",
"question": "A curling-haired boy wearing a black shirt with a red schoolbag strap on his shoulder is talking to the camera on the right side of the screen. On the left side of the screen, there is a man wearing a black short-sleeve shirt with a mustache. What appears first on the screen after the subtitle says \u201c$21,000 flights we're gonna do one for a\u201d?",
"question_wo_referring_query": "What appears first on the screen?",
"candidates": [
"A man wearing black clothes with a green bookbag and holding hands with a child",
"A man wearing a red floral outfit with a green bookbag and holding hands with a child",
"A woman wearing black clothes with a red bookbag and holding hands with a child",
"A woman wearing a red floral dress with a pink bookbag and holding hands with a child",
"A woman wearing a red floral dress with a yellow bookbag and holding hands with a child"
],
"correct_choice": 3,
"position": [
4175,
4306
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T3O",
"level": "L2-Relation",
"id": "zk6fgbXI5AE_0",
"video_path": "zk6fgbXI5AE.mp4",
"subtitle_path": "zk6fgbXI5AE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 389.22,
"view_count": 55541
},
{
"video_id": "zk6fgbXI5AE",
"question": "What object appeared before the subtitle saying 'but you have to pay for your bed' in a dark airplane cabin with a person wearing a black shirt without a scarf holding a round yellow paper cup containing ice and coffee-colored liquid?",
"question_wo_referring_query": "What object appeared?",
"candidates": [
"A red backpack",
"A bed",
"An air conditioner",
"A magazine with a woman with long hair on the cover",
"A flower pot"
],
"correct_choice": 3,
"position": [
6313,
6449
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T3O",
"level": "L2-Relation",
"id": "zk6fgbXI5AE_1",
"video_path": "zk6fgbXI5AE.mp4",
"subtitle_path": "zk6fgbXI5AE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 389.22,
"view_count": 55541
},
{
"video_id": "fxCRCMLJ0PU",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, a group of soldiers making rowing motions sitting in a green boat on a river, then several gray guns and two green grenades appear against a red background, and finally, five soldiers with helmets and face paint holding guns fiercely facing the camera.",
"First, a group of soldiers making rowing motions sitting in a green boat on a river, then several gray guns and two green grenades appear against a green background, and finally, three soldiers with helmets and face paint holding guns fiercely facing the camera.",
"First, several gray guns and two green grenades appear against a green background, then a group of soldiers making rowing motions sitting in a green boat on a river, and finally, three soldiers with helmets and face paint holding guns fiercely facing the camera.",
"First, several gray guns and two green grenades appear against a green background, then three soldiers with helmets and face paint holding guns fiercely facing the camera, and finally, a group of soldiers making rowing motions sitting in a green boat on a river.",
"First, a group of soldiers making rowing motions sitting in a green boat on a river, then several gray guns and two green grenades appear against a blue background, and finally, two soldiers with helmets and face paint holding guns fiercely facing the camera."
],
"correct_choice": 3,
"position": [
3053,
3393,
5308
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "fxCRCMLJ0PU_0",
"video_path": "fxCRCMLJ0PU.mp4",
"subtitle_path": "fxCRCMLJ0PU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 291.21,
"view_count": 2006850
},
{
"video_id": "bUaFXONIXzM",
"question": "In a foam paper cup containing three drinks, one of the drinks has a raised milk cap in a creamy yellow color. Where else has this drink appeared?",
"question_wo_referring_query": "In a foam paper cup containing three drinks, one of the drinks has a raised milk cap in a creamy yellow color. Where else has this drink appeared?",
"candidates": [
"In the kitchen",
"In the fridge",
"In an amusement park",
"In the hand of a woman without a ring holding a green straw",
"In a Starbucks store"
],
"correct_choice": 3,
"position": [
3042,
4962
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SOS",
"level": "L2-Relation",
"id": "bUaFXONIXzM_0",
"video_path": "bUaFXONIXzM.mp4",
"subtitle_path": "bUaFXONIXzM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 526.48,
"view_count": 30848
},
{
"video_id": "bUaFXONIXzM",
"question": "A long-haired woman wearing a white long-sleeved shirt is holding a cup of a brown-packaged drink with a green human face printed on it. Where has this drink appeared before?",
"question_wo_referring_query": "Where has this drink appeared before?",
"candidates": [
"In a kitchen",
"On a grass field",
"In the hands of a woman looking at a yellow stain on a white shirt",
"On a table",
"In a coffee shop"
],
"correct_choice": 2,
"position": [
6388,
7245
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SOS",
"level": "L2-Relation",
"id": "bUaFXONIXzM_1",
"video_path": "bUaFXONIXzM.mp4",
"subtitle_path": "bUaFXONIXzM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 526.48,
"view_count": 30848
},
{
"video_id": "kLuqCtnKr_8",
"question": "In a jade-green ocean, there is a huge rock with a hole standing in the ocean. Several people are standing on the rock, and one person is running forward. What did he do next?",
"question_wo_referring_query": ", what did he do next?",
"candidates": [
"Rushed towards another person standing on the rock",
"Jumped into someone's arms",
"Jumped into the jade-green sea",
"Hugged the person in front of him",
"Fell down on the rock"
],
"correct_choice": 2,
"position": [
121,
150
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E3E",
"level": "L2-Relation",
"id": "kLuqCtnKr_8_0",
"video_path": "kLuqCtnKr_8.mp4",
"subtitle_path": "kLuqCtnKr_8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 600.07,
"view_count": 98916
},
{
"video_id": "kLuqCtnKr_8",
"question": "On a mountain lush with greenery and overlooking a vast blue ocean, a girl dressed in a black suspenders and donning golden long hair smiles slightly at the mirror, what does she do next?",
"question_wo_referring_query": "what does she do next?",
"candidates": [
"Faces the mirror, bends at the waist, and laughs heartily",
"Turns her back to the mirror and faces the ocean",
"Turns her back to the mirror and extends both arms",
"Faces the mirror and extends both hands",
"Faces the mirror and extends one hand"
],
"correct_choice": 1,
"position": [
13198,
13219
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E3E",
"level": "L2-Relation",
"id": "kLuqCtnKr_8_1",
"video_path": "kLuqCtnKr_8.mp4",
"subtitle_path": "kLuqCtnKr_8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 600.07,
"view_count": 98916
},
{
"video_id": "5zbV24vyO44",
"question": "In a coding algorithm context, there is a box in the bottom right corner. Inside the box, there is a blue background wall with a black piece of clothing hanging on it. There is a man with short black hair wearing a plaid shirt. After he says 'this cell', what does he do?",
"question_wo_referring_query": "What does he do?",
"candidates": [
"Touches his chin with his hand",
"Touches his nose with his hand",
"Makes a 'Yay' gesture with his hand",
"Holds his fist with one hand",
"Touches his forehead with his hand"
],
"correct_choice": 1,
"position": [
590,
607
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T3E",
"level": "L2-Relation",
"id": "5zbV24vyO44_0",
"video_path": "5zbV24vyO44.mp4",
"subtitle_path": "5zbV24vyO44_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 191,
"duration": 543.0,
"view_count": 11023
},
{
"video_id": "5zbV24vyO44",
"question": "In the bottom right corner of the screen, there's a frame with a blue background wall, and on the wall, there are black clothes and a bag hanging, along with a man with short black hair. After this man says 'by a thousand samples of the test set,' what does he do next?",
"question_wo_referring_query": "What does he do next?",
"candidates": [
"Touches the brain with a hand",
"One hand extends with the index finger pointing upwards",
"Both hands extend with index fingers pointing upwards",
"Both hands spread outwards",
"One hand touches the forehead"
],
"correct_choice": 1,
"position": [
8734,
8751
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T3E",
"level": "L2-Relation",
"id": "5zbV24vyO44_1",
"video_path": "5zbV24vyO44.mp4",
"subtitle_path": "5zbV24vyO44_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 191,
"duration": 543.0,
"view_count": 11023
},
{
"video_id": "OGaML8Gg8JQ",
"question": "A man with short black hair, standing in front of a black background, wearing a purple short-sleeve shirt, in which of the following scenes did he appear?",
"question_wo_referring_query": ", in which of the following scenes did he appear?",
"candidates": [
"A scene with a pure black background and a small green map on the side",
"A room filled with clutter and books",
"A scene with a pure white background with a small flower on the side",
"A courtyard with many fresh flowers",
"In a park with many people"
],
"correct_choice": 0,
"position": [
478,
2556
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "SOS",
"level": "L2-Relation",
"id": "OGaML8Gg8JQ_0",
"video_path": "OGaML8Gg8JQ.mp4",
"subtitle_path": "OGaML8Gg8JQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 542.71,
"view_count": 799173
},
{
"video_id": "OGaML8Gg8JQ",
"question": "In front of a backdrop with a plant illustration, there is a man wearing a white hat, a gray T-shirt, and a black wristwatch. In which of the following scenes does he appear?",
"question_wo_referring_query": "In which of the following scenes does he appear?",
"candidates": [
"In front of a pure black backdrop, there is a scene with a man wearing a white short-sleeved shirt.",
"In front of a pure black backdrop, there is a scene with a man wearing a pink short-sleeved shirt.",
"In front of a pure black backdrop, there is a scene with a man wearing a green short-sleeved shirt.",
"In front of a pure black backdrop, there is a scene with a man wearing a red short-sleeved shirt.",
"In front of a pure black backdrop, there is a scene with a man wearing a blue short-sleeved shirt."
],
"correct_choice": 4,
"position": [
2003,
9445
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "SOS",
"level": "L2-Relation",
"id": "OGaML8Gg8JQ_1",
"video_path": "OGaML8Gg8JQ.mp4",
"subtitle_path": "OGaML8Gg8JQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 542.71,
"view_count": 799173
},
{
"video_id": "otaJfBSlsG8",
"question": "In front of a white cabinet, there is a woman with long black hair wearing a pink top, and there's also a green potted plant on the surface behind her. With which of the following subtitles has she appeared together?",
"question_wo_referring_query": "With which of the following subtitles has she appeared together?",
"candidates": [
"history",
"develop",
"warfare",
"people",
"science"
],
"correct_choice": 3,
"position": [
257,
9304
],
"topic_category": "KA-Knowledge-Art",
"question_category": "TOS",
"level": "L2-Relation",
"id": "otaJfBSlsG8_0",
"video_path": "otaJfBSlsG8.mp4",
"subtitle_path": "otaJfBSlsG8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 526.49,
"view_count": 672
},
{
"video_id": "otaJfBSlsG8",
"question": "In front of a white cabinet, there is a green potted plant on the side. In front, there is a woman with long black hair holding an illustrated book. Which of the following subtitles appeared with this illustrated book?",
"question_wo_referring_query": "Which of the following subtitles appeared with this illustrated book?",
"candidates": [
"that afternoon stella and her clay",
"let's find a cozy spot and let's get",
"let's begin with our welcome song just",
"started",
"story time and activity now"
],
"correct_choice": 0,
"position": [
329,
383,
410,
467,
3427
],
"topic_category": "KA-Knowledge-Art",
"question_category": "TOS",
"level": "L2-Relation",
"id": "otaJfBSlsG8_1",
"video_path": "otaJfBSlsG8.mp4",
"subtitle_path": "otaJfBSlsG8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 526.49,
"view_count": 672
},
{
"video_id": "60-ADf8OL9A",
"question": "On a stage with three people, three men are seated in a triangular formation. When one of the men, who is wearing a complete olive-colored suit and sitting cross-legged, mentions 'whether they are traditional artists,' what change occurs to this man?",
"question_wo_referring_query": "What change occurs to this man?",
"candidates": [
"Both hands change from resting on his legs to resting on the sofa.",
"His right hand changes from resting on the sofa to touching his forehead.",
"His right hand changes from resting on the sofa to being raised.",
"His left hand changes from resting on the sofa to touching his forehead.",
"His left hand changes from resting on the sofa to being raised."
],
"correct_choice": 4,
"position": [
318,
364
],
"topic_category": "KA-Knowledge-Art",
"question_category": "TAA",
"level": "L2-Relation",
"id": "60-ADf8OL9A_0",
"video_path": "60-ADf8OL9A.mp4",
"subtitle_path": "60-ADf8OL9A_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 336.05,
"view_count": 6733
},
{
"video_id": "60-ADf8OL9A",
"question": "On a stage, a few men seated on a sofa and chairs, among them a man with crossed legs, holding a microphone, and wearing an olive-colored suit. When this man mentions 'the life that you've led', what changes occur to the man with glasses and graying hair?",
"question_wo_referring_query": "What changes occur to the man with glasses and graying hair?",
"candidates": [
"Changes from touching his forehead with his left hand to touching his forehead with his right hand",
"Changes from facing the mirror to having his back to the mirror",
"Changes from raising his left hand to raising his right hand",
"Changes from having his back to the mirror to facing the mirror",
"Changes from holding the microphone in his left hand to holding it in his right hand"
],
"correct_choice": 1,
"position": [
527,
540
],
"topic_category": "KA-Knowledge-Art",
"question_category": "TAA",
"level": "L2-Relation",
"id": "60-ADf8OL9A_1",
"video_path": "60-ADf8OL9A.mp4",
"subtitle_path": "60-ADf8OL9A_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 336.05,
"view_count": 6733
},
{
"video_id": "VL259eBJ68w",
"question": "In the video, a black-haired woman wearing a black strappy dress and a man with dark skin in a black chef's uniform are in a kitchen. The cabinets behind them are white, and the windows have white curtains. On either side of the windows are wooden shelves holding some items. In front of them is a large wooden table. The woman has her arms crossed and is smiling slightly, while the man is holding a semi-circular lid. What are they doing in the kitchen?",
"question_wo_referring_query": "In the video, a black-haired woman wearing a black strappy dress and a man with dark skin in a black chef's uniform are in a kitchen. The cabinets behind them are white, and the windows have white curtains. On either side of the windows are wooden shelves holding some items. In front of them is a large wooden table. The woman has her arms crossed and is smiling slightly, while the man is holding a semi-circular lid. What are they doing in the kitchen?",
"candidates": [
"The man lifts the lid in his hand",
"The man flips the lid in his hand",
"The woman opens the lid in the man's hand",
"The man drops the lid in his hand",
"The man and woman are shaking hands"
],
"correct_choice": 0,
"position": [
1739
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "VL259eBJ68w_0",
"video_path": "VL259eBJ68w.mp4",
"subtitle_path": "VL259eBJ68w_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 550.62,
"view_count": 566917
},
{
"video_id": "VL259eBJ68w",
"question": "In the video, there is a woman in a black dress in the kitchen. Behind her, there is a white cabinet with some items on it and a window with white curtains. The curtains are flanked by wooden shelves. In front of her, there is a large counter with a cast iron pot on it. Above the pot is a large stainless steel bowl which she is holding with both hands. Next to them, there is a large pink container with red liquid inside. What is the woman doing in the kitchen?",
"question_wo_referring_query": "What is the woman doing in the kitchen?",
"candidates": [
"Turned the large bowl upside down",
"Put a lid on the large bowl",
"Moved the large bowl to the windowsill",
"Placed the large bowl on the counter",
"Put the large bowl on the pink container"
],
"correct_choice": 4,
"position": [
7801
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "VL259eBJ68w_1",
"video_path": "VL259eBJ68w.mp4",
"subtitle_path": "VL259eBJ68w_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 550.62,
"view_count": 566917
},
{
"video_id": "d8H7hgQY9ew",
"question": "In a news scene, a woman with short hair wearing red clothes and a black inner shirt with earrings is smiling. Behind her, there's a large screen displaying the letters 'cna' and a red triangular line, along with some red and black characters. When the ticker below her scrolls to 'Australia's energy giants to face annual earnings slump on bleak,' what event occurred?",
"question_wo_referring_query": "What event occurred?",
"candidates": [
"She shook hands with another man",
"She put on glasses",
"She nodded",
"She stood up from her chair",
"She hugged a man next to her"
],
"correct_choice": 2,
"position": [
682
],
"topic_category": "NP-News-Programs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "d8H7hgQY9ew_0",
"video_path": "d8H7hgQY9ew.mp4",
"subtitle_path": "d8H7hgQY9ew_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 439.36,
"view_count": 1104
},
{
"video_id": "d8H7hgQY9ew",
"question": "In an interview setting, a man in a suit and a woman are having a conversation. Behind them on the left is a model made of red lines, and on the right is a screen with red and black letters. Both the man and the woman are sitting on stools. In front of them is a round table, which also has red triangular lines with the letters cna printed on it. When the letters 'Rahm and Hovland' in the bottom scrolling black area moved to the left, what event occurred?",
"question_wo_referring_query": "What event occurred when the text moved to the left?",
"candidates": [
"The man smoothed his clothes",
"The man stood up",
"The man picked up a piece of paper from the table",
"The man took a sip of water",
"The woman adjusted her hair"
],
"correct_choice": 4,
"position": [
8031
],
"topic_category": "NP-News-Programs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "d8H7hgQY9ew_1",
"video_path": "d8H7hgQY9ew.mp4",
"subtitle_path": "d8H7hgQY9ew_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 439.36,
"view_count": 1104
},
{
"video_id": "LYvStKy8iAc",
"question": "In front of a white background wall, a woman with earrings and hair tied up, wearing a short-sleeved T-shirt, is pressing foundation liquid onto a sponge in her hand. Behind her, there are green leaf decorations on the wall. Her nails are pink, and she is wearing a ring on her finger. What happened after she pressed the foundation liquid onto the sponge?",
"question_wo_referring_query": "What happened after she pressed the foundation liquid onto the sponge?",
"candidates": [
"Pressed the sponge onto her neck",
"Pressed the sponge onto her face",
"Washed dishes with the sponge",
"Pressed the sponge onto her hand",
"Threw the sponge away"
],
"correct_choice": 1,
"position": [
2423,
2424
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "LYvStKy8iAc_0",
"video_path": "LYvStKy8iAc.mp4",
"subtitle_path": "LYvStKy8iAc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 575.32,
"view_count": 35141
},
{
"video_id": "LYvStKy8iAc",
"question": "A woman with earrings, hair tied up, and wearing a short-sleeved T-shirt is standing against a white background wall. She is holding a box containing some powder. There are green leafy decorations on the wall behind her. Her nails are pink, and she is wearing a ring on her finger. What happened after she picked up the powder box?",
"question_wo_referring_query": "What happened after she picked up the powder box?",
"candidates": [
"She brushed her cheeks",
"She closed the lid of the powder box",
"She brushed her eyelids",
"She opened the lid of the box",
"She threw the powder box"
],
"correct_choice": 0,
"position": [
9251,
9330
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "LYvStKy8iAc_1",
"video_path": "LYvStKy8iAc.mp4",
"subtitle_path": "LYvStKy8iAc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 575.32,
"view_count": 35141
},
{
"video_id": "rIU_BQEuKQ8",
"question": "In a room, there is a sofa and a staircase. To the left, there are some black shelves. The background is a bit blurry. A man with black hair wearing a white T-shirt is sitting in front of a mirror. To his left, there is a pop-up image showing a group photo of 10 people. After he mentions, 'However, keep in mind the Chinese spoken here is more in reference to Cantonese and the other southern dialects rather than Mandarin,' what happens?",
"question_wo_referring_query": "What happens afterward?",
"candidates": [
"The pop-up on the left shows a picture of an old building under a blue sky",
"The photo pop-up on the left shows a picture of a man and a woman sitting on a beach holding a child, with '10%' written on it",
"The photo pop-up on the left disappears",
"The man changes into a jacket",
"The man takes out a cup"
],
"correct_choice": 0,
"position": [
5559,
5572
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T3E",
"level": "L2-Relation",
"id": "rIU_BQEuKQ8_0",
"video_path": "rIU_BQEuKQ8.mp4",
"subtitle_path": "rIU_BQEuKQ8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 290.36,
"view_count": 277624
},
{
"video_id": "X0U3fP0tZyY",
"question": "A man wearing a red lettered t-shirt is explaining in front of a background wall with musical instruments hanging on it. The instruments on the background wall are all string instruments made of wood. There's also a white board with some words written on it for introduction. What is the first object that appears after mentioning 'performed by a wonderful cuatrista, Fabiola Mendez.'?",
"question_wo_referring_query": "What is the first object that appears?",
"candidates": [
"A woman in a black dress playing the piano",
"A woman in black clothing sitting and playing a small string instrument",
"A woman in purple clothing sitting and playing a small string instrument",
"A woman in a white dress playing the piano",
"A woman in a black dress sitting and playing a cello"
],
"correct_choice": 1,
"position": [
1380,
1448
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T3O",
"level": "L2-Relation",
"id": "X0U3fP0tZyY_0",
"video_path": "X0U3fP0tZyY.mp4",
"subtitle_path": "X0U3fP0tZyY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 458.39,
"view_count": 24800
},
{
"video_id": "X0U3fP0tZyY",
"question": "A woman with short hair dressed in black is sitting in a large hall with many glass display cases. She is wearing glasses and smiling while playing a white string instrument. When 'music flourishes, ends' is mentioned, what is the first object that appears?",
"question_wo_referring_query": "What is the first object that appears?",
"candidates": [
"A black and red piano",
"A black and red string instrument",
"A white guitar",
"Olive-colored loafers",
"White glasses"
],
"correct_choice": 1,
"position": [
4700,
4762
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T3O",
"level": "L2-Relation",
"id": "X0U3fP0tZyY_1",
"video_path": "X0U3fP0tZyY.mp4",
"subtitle_path": "X0U3fP0tZyY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 458.39,
"view_count": 24800
},
{
"video_id": "NexB4vj8_54",
"question": "What is the person in the video doing when a bowl of white yogurt appears above a transparent glass container filled with beaten yellow egg liquid on a dark red wooden table?",
"question_wo_referring_query": "What is the person in the video doing?",
"candidates": [
"Pouring the yogurt into the container with egg liquid",
"Stirring the yogurt and egg liquid",
"Drinking the yogurt",
"Stirring the yogurt",
"Stirring the egg liquid"
],
"correct_choice": 0,
"position": [
1099
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O2E",
"level": "L1-Perception",
"id": "NexB4vj8_54_0",
"video_path": "NexB4vj8_54.mp4",
"subtitle_path": "NexB4vj8_54_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 228.2,
"view_count": 1833
},
{
"video_id": "NexB4vj8_54",
"question": "On a dark red table, there is a round white plate holding a yellow cake decorated with white powdered sugar. When a silver serrated knife appears above the cake, what is the video creator doing?",
"question_wo_referring_query": ", what is the video creator doing?",
"candidates": [
"Eating the cake",
"Setting the table",
"Cutting the cake",
"Cleaning the utensils",
"Decorating the cake"
],
"correct_choice": 2,
"position": [
4394
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O2E",
"level": "L1-Perception",
"id": "NexB4vj8_54_1",
"video_path": "NexB4vj8_54.mp4",
"subtitle_path": "NexB4vj8_54_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 228.2,
"view_count": 1833
},
{
"video_id": "JpjytHmGHZ4",
"question": "In front of an olive background, a man wearing a red short-sleeved shirt holding a guitar is standing in front of a microphone. When the subtitle 'ask she should have an irrational fear' appears, what is this man doing?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"Waving to the audience",
"Making a speech",
"Performing magic",
"Dancing on stage",
"Playing the guitar and singing"
],
"correct_choice": 4,
"position": [
1825
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2E",
"level": "L1-Perception",
"id": "JpjytHmGHZ4_0",
"video_path": "JpjytHmGHZ4.mp4",
"subtitle_path": "JpjytHmGHZ4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 228.16,
"view_count": 521708
},
{
"video_id": "JpjytHmGHZ4",
"question": "A man holding a guitar dressed in a red short-sleeve shirt is standing behind a microphone on stage under the lights. On the screen behind him with a black background, two large English words appear. There is a row of audience members seated in front of the stage. When the subtitles 'Thank you' appear, what are the audience members doing?",
"question_wo_referring_query": "A man holding a guitar dressed in a red short-sleeve shirt is standing behind a microphone on stage under the lights. On the screen behind him with a black background, two large English words appear. There is a row of audience members seated in front of the stage. When the subtitles 'Thank you' appear, what are the audience members doing?",
"candidates": [
"Giving flowers to the man on stage",
"Asking the man on stage a question",
"Standing up and leaving",
"Clapping",
"Sitting still"
],
"correct_choice": 3,
"position": [
5189
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2E",
"level": "L1-Perception",
"id": "JpjytHmGHZ4_1",
"video_path": "JpjytHmGHZ4.mp4",
"subtitle_path": "JpjytHmGHZ4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 228.16,
"view_count": 521708
},
{
"video_id": "Ng2rNm6Nwsg",
"question": "There are two abstract landscape paintings on the screen. What happens after the camera moves from left to right?",
"question_wo_referring_query": "What happens on the screen?",
"candidates": [
"Three men appear on the screen, sitting around a table chatting.",
"An old man wearing a checkered suit and round glasses appears on the right side of the screen, standing in front of a wall full of paintings. On the left side, a woman is playing the guitar on a sofa, while a man is sitting on the sofa creating with a pencil.",
"An old man wearing a checkered suit and round glasses appears on the right side of the screen, standing in front of a wall full of paintings. On the left side, a woman is sitting on a sofa creating with a pencil, while a man is sitting on the sofa playing the guitar.",
"An old man wearing a checkered suit and round glasses appears on the left side of the screen, standing in front of a wall full of paintings. On the right side, a woman is sitting on a sofa creating with a pencil, while a man is sitting on the sofa playing the guitar.",
"An old man wearing a checkered suit and round glasses appears on the left side of the screen, standing in front of a wall full of paintings. On the right side, a woman is playing the guitar on a sofa, while a man is sitting on the sofa creating with a pencil."
],
"correct_choice": 3,
"position": [
3521,
3678
],
"topic_category": "KA-Knowledge-Art",
"question_category": "E3E",
"level": "L2-Relation",
"id": "Ng2rNm6Nwsg_0",
"video_path": "Ng2rNm6Nwsg.mp4",
"subtitle_path": "Ng2rNm6Nwsg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 254.38,
"view_count": 22643
},
{
"video_id": "Ng2rNm6Nwsg",
"question": "There are many abstract landscape paintings on the wall. On the right side of the screen, a man with thinning hair and wearing round glasses is sternly looking into a mirror. What happened in the video?",
"question_wo_referring_query": ", what happened in the video?",
"candidates": [
"A family is standing in front of a house taking a family photo.",
"Three children are playing around the house.",
"A man and a woman are creating something with paper and pen.",
"Three women are sitting in front of a house having a conversation.",
"The background is a tiled house with a two-story white staircase. Three men are sitting in front of the house having a discussion, and the man on the far right is smoking."
],
"correct_choice": 4,
"position": [
3856,
3908
],
"topic_category": "KA-Knowledge-Art",
"question_category": "E3E",
"level": "L2-Relation",
"id": "Ng2rNm6Nwsg_1",
"video_path": "Ng2rNm6Nwsg.mp4",
"subtitle_path": "Ng2rNm6Nwsg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 254.38,
"view_count": 22643
},
{
"video_id": "BktEeBeA7a8",
"question": "A woman wearing a pink fur coat over a white slip dress, with her hair tied in a bun and facing a mirror, is standing in a brightly lit kitchen. When she opens the wooden cabinet in front of her with her right hand, what is the first item she takes out?",
"question_wo_referring_query": "What is the item?",
"candidates": [
"A coffee machine",
"A glass with a wooden stirrer and a straw inserted",
"A can of coffee",
"Some cutlery",
"A plate of fruit"
],
"correct_choice": 1,
"position": [
820,
782,
960
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "BktEeBeA7a8_0",
"video_path": "BktEeBeA7a8.mp4",
"subtitle_path": "BktEeBeA7a8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 320.6,
"view_count": 4188
},
{
"video_id": "BktEeBeA7a8",
"question": "A woman dressed in a beige long-sleeved top, black tight pants, and white sneakers with curly hair draped over, stands on a low wooden platform in a forest bathed in golden sunlight. What is the first plant to appear?",
"question_wo_referring_query": "",
"candidates": [
"Wild flowers",
"Vegetables",
"Melon",
"Some colorful trees",
"A pile of green grass"
],
"correct_choice": 4,
"position": [
4644,
4728,
4963,
5038,
5665
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "BktEeBeA7a8_1",
"video_path": "BktEeBeA7a8.mp4",
"subtitle_path": "BktEeBeA7a8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 320.6,
"view_count": 4188
},
{
"video_id": "ydm72ftJStQ",
"question": "On a white background PPT, there are three cubes. The leftmost is a narrow blue cube, the middle is an orange cube, and the rightmost is a wide orange cube. In the lower right corner, there is a man wearing a grey coat with a khaki base. When the subtitle 'third dimension of these filters should' appears, what object appears on the screen?",
"question_wo_referring_query": "What object appears on the screen?",
"candidates": [
"backpack",
"leather shoes",
"collar",
"glasses",
"hat"
],
"correct_choice": 3,
"position": [
3575
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T2O",
"level": "L1-Perception",
"id": "ydm72ftJStQ_0",
"video_path": "ydm72ftJStQ.mp4",
"subtitle_path": "ydm72ftJStQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 277.6,
"view_count": 17
},
{
"video_id": "ydm72ftJStQ",
"question": "In a white background PPT, there is a yellow icon and black English text in the top left corner. In the center, there are black English letters saying 'Thank you'. When the caption 'here' appears, what object appears on the screen?",
"question_wo_referring_query": "What object appears on the screen?",
"candidates": [
"a man wearing glasses",
"a sphere",
"black English letters",
"a round shape",
"a cube"
],
"correct_choice": 2,
"position": [
6448
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T2O",
"level": "L1-Perception",
"id": "ydm72ftJStQ_1",
"video_path": "ydm72ftJStQ.mp4",
"subtitle_path": "ydm72ftJStQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 277.6,
"view_count": 17
},
{
"video_id": "88LbBgZP1vQ",
"question": "In a car, there is a black seat, a short-haired man wearing a gray shirt, and a long-haired woman sitting next to him. In which of the following locations has the short-haired man appeared?",
"question_wo_referring_query": "In which of the following locations has the short-haired man appeared?",
"candidates": [
"Verdant forest",
"Beautiful seaside",
"Quiet park",
"Crowded amusement park",
"Indoor basketball court"
],
"correct_choice": 4,
"position": [
444,
5322
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "88LbBgZP1vQ_0",
"video_path": "88LbBgZP1vQ.mp4",
"subtitle_path": "88LbBgZP1vQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 488.6,
"view_count": 3763
},
{
"video_id": "88LbBgZP1vQ",
"question": "In an indoor basketball court with red walls and a yellow floor, there is a girl wearing a purple short-sleeve shirt with her hair tied up, holding a basketball. In which of the following places has the girl appeared?",
"question_wo_referring_query": "In which of the following places has the girl appeared?",
"candidates": [
"In a room with a beige sofa",
"In a playground",
"In a quiet park",
"By the beach",
"In a clothing store with many beautiful clothes"
],
"correct_choice": 0,
"position": [
4622,
9191
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "88LbBgZP1vQ_1",
"video_path": "88LbBgZP1vQ.mp4",
"subtitle_path": "88LbBgZP1vQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 488.6,
"view_count": 3763
},
{
"video_id": "F7RSW-2rF4w",
"question": "On a street, next to the road, there is a house built with stones. The house has a black tiled roof and is covered with vines on the walls. A woman wearing a yellow top and a white floral half-skirt is walking on the street holding a book. From the subtitles listed below, which ones have appeared while this woman in the white floral half-skirt is present?",
"question_wo_referring_query": "During the scene with the woman in the white floral half-skirt, which of the following subtitles appear?",
"candidates": [
"\"Music\"",
"\"built here since 16th century\"",
"\"places\"",
"\"this title\"",
"\"and it does feel like time has stopped\""
],
"correct_choice": 0,
"position": [
428,
1686,
2669,
3205,
3271
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "F7RSW-2rF4w_0",
"video_path": "F7RSW-2rF4w.mp4",
"subtitle_path": "F7RSW-2rF4w_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 482.57,
"view_count": 8229
},
{
"video_id": "F7RSW-2rF4w",
"question": "On a dark and gloomy day, on a street lined with stone-built houses, there's a red and white round traffic barrier at the side of the street. Which of the following subtitles appeared together with the round traffic barrier?",
"question_wo_referring_query": "Which of the following subtitles appeared together with the round traffic barrier?",
"candidates": [
"\"sometimes we feel that our problems are\"",
"\"unbearable\"",
"\"with their own problems and everyday\"",
"\"and things are so bad that it's almost\"",
"\"Music\""
],
"correct_choice": 4,
"position": [
3654,
4731,
5491,
5564,
5571
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "F7RSW-2rF4w_1",
"video_path": "F7RSW-2rF4w.mp4",
"subtitle_path": "F7RSW-2rF4w_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 482.57,
"view_count": 8229
},
{
"video_id": "3JzhP8qfbqE",
"question": "On a red road, with yellow and green plants on both sides and tall peaks in the distance, what color is the car parked on the red road?",
"question_wo_referring_query": "What color is the car parked on the red road?",
"candidates": [
"Black",
"Pink",
"Silver Grey",
"Red",
"Silver White"
],
"correct_choice": 2,
"position": [
7433
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2A",
"level": "L1-Perception",
"id": "3JzhP8qfbqE_0",
"video_path": "3JzhP8qfbqE.mp4",
"subtitle_path": "3JzhP8qfbqE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 453.42,
"view_count": 75930
},
{
"video_id": "3JzhP8qfbqE",
"question": "On a piece of white paper displaying various Earth plates, there is a blue line, below which there is a box with a red stripe, and below the box, there is a black horizontal stripe with white letters inside. What shape is the box with the red stripe?",
"question_wo_referring_query": "What shape is the box with the red stripe?",
"candidates": [
"Triangle",
"Rectangle",
"Square",
"Ellipse",
"Rounded Square"
],
"correct_choice": 3,
"position": [
2714
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2A",
"level": "L1-Perception",
"id": "3JzhP8qfbqE_1",
"video_path": "3JzhP8qfbqE.mp4",
"subtitle_path": "3JzhP8qfbqE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 453.42,
"view_count": 75930
},
{
"video_id": "wq90Upjaij8",
"question": "In front of a white glass background, a woman wearing a brown top, with long hair and earrings, what did she do the first time she appeared?",
"question_wo_referring_query": "What did she do the first time she appeared?",
"candidates": [
"Smoothed her hair",
"Adjusted her clothes",
"Talked to the camera",
"Bent down to look at the floor",
"Covered her face with both hands"
],
"correct_choice": 2,
"position": [
149
],
"topic_category": "NP-News-Programs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "wq90Upjaij8_0",
"video_path": "wq90Upjaij8.mp4",
"subtitle_path": "wq90Upjaij8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 212.57999999999998,
"view_count": 8458
},
{
"video_id": "wq90Upjaij8",
"question": "In a room, there is a woman with long black hair wearing a black suit. Behind her is a glass door, and there is a table with a round object on it next to her. What did she do when she first appeared?",
"question_wo_referring_query": "What did she do when she first appeared?",
"candidates": [
"Walked forward with big steps",
"Bent down to pick up a document",
"Talked to the mirror",
"Tidied up her hair",
"Holding a phone making a call"
],
"correct_choice": 0,
"position": [
3062
],
"topic_category": "NP-News-Programs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "wq90Upjaij8_1",
"video_path": "wq90Upjaij8.mp4",
"subtitle_path": "wq90Upjaij8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 212.57999999999998,
"view_count": 8458
},
{
"video_id": "dE5iWeCVpGI",
"question": "On a black stage, there are many black frames; behind the frames, people are standing, while in front of the frames, someone is performing. There is a woman dressed in a red suspenders with black leggings. What is she doing when the subtitle 'I know that there is a leader for us the' appears?",
"question_wo_referring_query": "On a black stage, there are many black frames; behind the frames, people are standing, while in front of the frames, someone is performing. There is a woman dressed in a red suspenders with black leggings. What is she doing when the subtitle 'I know that there is a leader for us the' appears?",
"candidates": [
"Kneeling on one knee with both hands on the ground",
"Standing and singing, with her right hand raised",
"On both knees with both hands on the ground",
"On both knees with one hand on the ground",
"Kneeling on one knee with both arms open"
],
"correct_choice": 4,
"position": [
4218
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2E",
"level": "L1-Perception",
"id": "dE5iWeCVpGI_0",
"video_path": "dE5iWeCVpGI.mp4",
"subtitle_path": "dE5iWeCVpGI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 412.54,
"view_count": 6328
},
{
"video_id": "dE5iWeCVpGI",
"question": "In a scene with a slightly blurred background, what is a woman with long curly hair, wearing a long-sleeved top and a necklace, doing when the caption 'work you breathe it you live it and' appears?",
"question_wo_referring_query": "What is the woman doing?",
"candidates": [
"Holding a blue pen and writing something",
"Holding a black pen and writing something",
"Holding a purple pen and writing something",
"Holding a white pen and writing something",
"Holding a red pen and writing something"
],
"correct_choice": 0,
"position": [
8155
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2E",
"level": "L1-Perception",
"id": "dE5iWeCVpGI_1",
"video_path": "dE5iWeCVpGI.mp4",
"subtitle_path": "dE5iWeCVpGI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 412.54,
"view_count": 6328
},
{
"video_id": "jJGbXCCU5yc",
"question": "In the Google logo, there is a white box below it, and text continuously appears in it. What is the complete text that appears in the white box?",
"question_wo_referring_query": ", What is the complete text that appears in the white box?",
"candidates": [
"lucky",
"google flights",
"google flight",
"google search",
"search"
],
"correct_choice": 1,
"position": [
1025
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E2O",
"level": "L1-Perception",
"id": "jJGbXCCU5yc_0",
"video_path": "jJGbXCCU5yc.mp4",
"subtitle_path": "jJGbXCCU5yc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 404.73,
"view_count": 64655
},
{
"video_id": "jJGbXCCU5yc",
"question": "White clouds are flowing atop the mountain peak. A man wearing a gray coat and black shorts is standing in front of the cliff, gazing at the distant green mountains. What is the item in this man's hand?",
"question_wo_referring_query": "What is the item in this man's hand?",
"candidates": [
"glasses",
"backpack",
"camera",
"phone",
"stick"
],
"correct_choice": 2,
"position": [
8783
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E2O",
"level": "L1-Perception",
"id": "jJGbXCCU5yc_1",
"video_path": "jJGbXCCU5yc.mp4",
"subtitle_path": "jJGbXCCU5yc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 404.73,
"view_count": 64655
},
{
"video_id": "WpbB_swXHkc",
"question": "In front of a ruin of a building with two stone pillars standing at the entrance, there is a woman sitting. She has black curly hair and is wearing a blue long-sleeved garment, covering her face with both hands. Which subtitles have appeared together with this woman?",
"question_wo_referring_query": "Which subtitles have appeared together with this woman?",
"candidates": [
"all right that\u2019s the fun part of the day",
"is just bam it\u2019s kind of slaps",
"Is this a special treatment for members?",
"It\u2019s seven o\u2019clock, did you hear?",
"if you take a deep breath and keep it"
],
"correct_choice": 1,
"position": [
3849,
1203
],
"topic_category": "KA-Knowledge-Art",
"question_category": "TOS",
"level": "L2-Relation",
"id": "WpbB_swXHkc_0",
"video_path": "WpbB_swXHkc.mp4",
"subtitle_path": "WpbB_swXHkc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 526.83,
"view_count": 4541
},
{
"video_id": "WpbB_swXHkc",
"question": "On the right side of a desk with a building in the background, there are three icons, and next to the icons is a video being recorded. In the video, on a sofa with red flowers embroidered on it, sits a woman wearing earrings and a long-sleeved wine-red garment. With which subtitles does this woman appear together?",
"question_wo_referring_query": "With which subtitles does this woman appear together?",
"candidates": [
"before the time of the pandemic here",
"crowded men",
"myself",
"really exciting to me to have it all to",
"me to "
],
"correct_choice": 4,
"position": [
5419,
7948
],
"topic_category": "KA-Knowledge-Art",
"question_category": "TOS",
"level": "L2-Relation",
"id": "WpbB_swXHkc_1",
"video_path": "WpbB_swXHkc.mp4",
"subtitle_path": "WpbB_swXHkc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 526.83,
"view_count": 4541
},
{
"video_id": "D7VYbsORD8k",
"question": "In a room with green wall tiles, there is a woman with long hair wearing a white dress. In the lower part of the screen near her head, white text appears that says 'someone started playing drums in the back.' What change happens to her when she appears in the restroom?",
"question_wo_referring_query": "In a room with green wall tiles, there is a woman with long hair wearing a white dress. What change happens to her?",
"candidates": [
"A black bag appears on her shoulder.",
"Her top changes from white to red.",
"A white bag appears on her shoulder.",
"Her top changes from white to black.",
"A red bag appears on her shoulder."
],
"correct_choice": 0,
"position": [
6702,
7340
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "D7VYbsORD8k_0",
"video_path": "D7VYbsORD8k.mp4",
"subtitle_path": "D7VYbsORD8k_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 526.93,
"view_count": 23629
},
{
"video_id": "D7VYbsORD8k",
"question": "In a car, a woman with long hair, wearing a white long-sleeve shirt, is holding a food container and showing it to the camera. When a white box with the text 'So I drive by this place' appears to her left, what happens to the object in her hand?",
"question_wo_referring_query": "What happens to the object in her hand?",
"candidates": [
"The object in her hand changes from a food container to a fork",
"The object in her hand changes from a food container to a phone with a pink case",
"The object in her hand changes from a food container to a black shoulder bag",
"The object in her hand changes from a food container to a phone with a black case",
"The object in her hand changes from a food container to a phone with a white case"
],
"correct_choice": 1,
"position": [
10941,
1821
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "D7VYbsORD8k_1",
"video_path": "D7VYbsORD8k.mp4",
"subtitle_path": "D7VYbsORD8k_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 526.93,
"view_count": 23629
},
{
"video_id": "sv4wLIbSDqo",
"question": "Two people are in a video call. The person on the right is wearing headphones and sitting in front of a bookshelf full of books, while the person on the left is wearing black-frame glasses, a tie, and raising his right hand. When the subtitle 'joined by military analyst Frank Ledwich' appears on the screen, what change occurs in the action of the man raising his right hand?",
"question_wo_referring_query": "What change occurs in his action?",
"candidates": [
"He removes his glasses with his right hand",
"He supports his head with his right hand",
"He puts down his right hand",
"He supports his head with his left hand",
"He raises his left hand"
],
"correct_choice": 2,
"position": [
2455,
726
],
"topic_category": "NP-News-Programs",
"question_category": "TAA",
"level": "L2-Relation",
"id": "sv4wLIbSDqo_0",
"video_path": "sv4wLIbSDqo.mp4",
"subtitle_path": "sv4wLIbSDqo_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 315.08,
"view_count": 100434
},
{
"video_id": "sv4wLIbSDqo",
"question": "There are two screens. On the right screen, a man wearing headphones is sitting in front of a bookshelf filled with books. His eyes are closed tightly, and his mouth is open. On the left screen, there is a green military vehicle parked next to some damaged buildings. When the man on the right screen and the subtitles 'course as with all these major battles' appear, what changes occur on the left screen?",
"question_wo_referring_query": "What changes occur on the left screen?",
"candidates": [
"The left screen changes to a map",
"The left screen changes to a man wearing glasses and a tie",
"The left screen changes to a man riding a bicycle on the street",
"The left screen changes to a white car",
"The left screen changes to a red car"
],
"correct_choice": 2,
"position": [
3372,
3620
],
"topic_category": "NP-News-Programs",
"question_category": "TAA",
"level": "L2-Relation",
"id": "sv4wLIbSDqo_1",
"video_path": "sv4wLIbSDqo.mp4",
"subtitle_path": "sv4wLIbSDqo_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 315.08,
"view_count": 100434
},
{
"video_id": "7rMgpExA4kM",
"question": "In the video, a white airplane is taxiing on the runway at the airport, with a backdrop of golden hills and an orange sky. What happens first on screen after the caption 'water on runway' appears?",
"question_wo_referring_query": "In the video, a white airplane is taxiing on the runway at the airport, with a backdrop of golden hills and an orange sky. What happens first on screen after the caption 'water on runway' appears?",
"candidates": [
"Two men in work uniforms are manufacturing an airplane in a workshop.",
"At night, a man leans out from the front of a parked airplane to talk to the camera.",
"The airplane skids on a water puddle on the runway, splashing a huge spray of water.",
"A hand is drawing two buildings with a river flowing between them on white paper.",
"A person is working in a warehouse filled with boxes of paper."
],
"correct_choice": 2,
"position": [
2770,
2897,
5076,
6073,
6340,
6809
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T3E",
"level": "L2-Relation",
"id": "7rMgpExA4kM_0",
"video_path": "7rMgpExA4kM.mp4",
"subtitle_path": "7rMgpExA4kM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 313.48,
"view_count": 3395728
},
{
"video_id": "7rMgpExA4kM",
"question": "What happens first on the screen after the subtitle 'and I am getting ready to go' appears, featuring a man wearing a black cold hat with an English letter logo, dressed in black clothes, carrying a black backpack, and sporting a stubbly mustache?",
"question_wo_referring_query": "What happens first on the screen?",
"candidates": [
"Manufacturing airplane wheels in a factory.",
"An airplane model suspended by several steel wires is displayed.",
"The man sits in a car looking out the window at the sunlit grass and trees.",
"Manufacturing the interior of an airplane fuselage.",
"Viewing the city from above."
],
"correct_choice": 2,
"position": [
833,
1005,
1100,
2469,
2979,
3727
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T3E",
"level": "L2-Relation",
"id": "7rMgpExA4kM_1",
"video_path": "7rMgpExA4kM.mp4",
"subtitle_path": "7rMgpExA4kM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 313.48,
"view_count": 3395728
},
{
"video_id": "lN3WnXMaE0o",
"question": "A soldier wearing a golden helmet is standing on a grassy field near a wooden fence. Holding a water bag, three drops fall from it. After the subtitle 'you will have to pay for it in blood' appears, what is the first object that appears on the screen?",
"question_wo_referring_query": "What is the first object that appears on the screen?",
"candidates": [
"A person with white hair holding a green shield",
"A person wearing a red robe with black hair",
"Three shields",
"A person with yellow hair holding a water bag",
"A soldier holding a red shield and wearing a helmet"
],
"correct_choice": 2,
"position": [
1519,
1548
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "lN3WnXMaE0o_0",
"video_path": "lN3WnXMaE0o.mp4",
"subtitle_path": "lN3WnXMaE0o_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 295.92,
"view_count": 2928
},
{
"video_id": "lN3WnXMaE0o",
"question": "On the right side of the screen, there are three soldiers holding shield banners with olive patterns, armed with spears and swords. On the left side, there is a person dressed in red clothes with black hair. What object first appeared on the screen after the subtitle 'creating a ruckus that echoed throughout' appeared?",
"question_wo_referring_query": "What object first appeared on the screen?",
"candidates": [
"A person with yellow hair holding a waterskin",
"Three soldiers with green shield banners and white-haired with long swords",
"A person dressed in a red robe with black hair",
"Three baskets",
"A soldier holding a red shield banner wearing a helmet"
],
"correct_choice": 1,
"position": [
2578,
2632
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "lN3WnXMaE0o_1",
"video_path": "lN3WnXMaE0o.mp4",
"subtitle_path": "lN3WnXMaE0o_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 295.92,
"view_count": 2928
},
{
"video_id": "Fq3zbbp-lv4",
"question": "Outdoors, there is a patch of ground with some fallen leaves and wild grass, with a chrysanthemum plant blooming with purple petals and yellow florets. On it, a bee is stopping on the florets. This chrysanthemum has also appeared simultaneously with which subtitles?",
"question_wo_referring_query": ", this chrysanthemum has also appeared simultaneously with which subtitles?",
"candidates": [
"friend",
"my life is full of routines and",
"as defined by someone else's standards",
"or gentle or helpful will we be",
"be when we grow up"
],
"correct_choice": 2,
"position": [
2561,
2619,
2708,
3083,
3168,
5323
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "Fq3zbbp-lv4_0",
"video_path": "Fq3zbbp-lv4.mp4",
"subtitle_path": "Fq3zbbp-lv4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 404.73,
"view_count": 408944
},
{
"video_id": "Fq3zbbp-lv4",
"question": "In the grove of yellow leaves illuminated by sunlight, there is a woman with a checkered scarf, khaki-colored jacket, and jeans playing with a black Labrador beside her. Which of the following subtitles appeared simultaneously with the sight of this black dog?",
"question_wo_referring_query": "Which of the following subtitles appeared simultaneously with the sight of this black dog?",
"candidates": [
"\"define who we are\"",
"\"slow down and recharge and communicate\"",
"\"in my case i wanted to be either a\"",
"\"my life is full of routines and\"",
"\u201cthat puts me in a nostalgic mood it's\u201d"
],
"correct_choice": 3,
"position": [
1068,
1544,
2833,
5321,
7578,
8489
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "Fq3zbbp-lv4_1",
"video_path": "Fq3zbbp-lv4.mp4",
"subtitle_path": "Fq3zbbp-lv4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 404.73,
"view_count": 408944
},
{
"video_id": "vJ9hYCUDHTo",
"question": "A lady with brown hair is explaining in front of a mirror. She is wearing light brown clothes, with a white wall behind her decorated with long wooden planks. In the top left corner behind her, there is a display screen showing various times. When the phrase 'best of Malaysia Airline and people who' is mentioned, what objects are present?",
"question_wo_referring_query": "What objects are present?",
"candidates": [
"A hanging picture with timestamps",
"Earrings",
"Glasses",
"A blue robe",
"A necklace"
],
"correct_choice": 1,
"position": [
8222
],
"topic_category": "NP-News-Programs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "vJ9hYCUDHTo_0",
"video_path": "vJ9hYCUDHTo.mp4",
"subtitle_path": "vJ9hYCUDHTo_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 554.4,
"view_count": 73592
},
{
"video_id": "vJ9hYCUDHTo",
"question": "A man in a gray suit and tie is speaking on a podium. He is in a large hall, with three blue flags behind him, and has short hair and wears glasses. In front of him are two microphones. When he mentions 'of a tactical or technical issue it is,' which non-existent objects are present?",
"question_wo_referring_query": "Which non-existent objects are present?",
"candidates": [
"yellow pillar",
"black glasses",
"blue flags",
"dark brown door",
"blue curtain"
],
"correct_choice": 4,
"position": [
11121
],
"topic_category": "NP-News-Programs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "vJ9hYCUDHTo_1",
"video_path": "vJ9hYCUDHTo.mp4",
"subtitle_path": "vJ9hYCUDHTo_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 554.4,
"view_count": 73592
},
{
"video_id": "Q47I8AdRgzc",
"question": "A white hemisphere moves from left to right in a pitch-black space, along with some bright spots flashing in the space. Where have this hemisphere and which subtitles appeared together before?",
"question_wo_referring_query": ", where have this hemisphere and which subtitles appeared together before?",
"candidates": [
"the paths they previously simulated",
"like Venus",
"Earth's long-time neighbor - Mars",
"exhibited earth-like place tectonics",
"stagnant lid period align the simulation"
],
"correct_choice": 4,
"position": [
4949,
5085
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TAA",
"level": "L2-Relation",
"id": "Q47I8AdRgzc_0",
"video_path": "Q47I8AdRgzc.mp4",
"subtitle_path": "Q47I8AdRgzc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 293.23,
"view_count": 1889
},
{
"video_id": "Q47I8AdRgzc",
"question": "In the dark space dotted with some stars, an Earth-like sphere appears and continuously shrinks. With which subtitles has this sphere appeared together?",
"question_wo_referring_query": ", with which subtitles has this sphere appeared together?",
"candidates": [
"original plate structure",
"ancient history of a planet",
"creating a suitable environment",
"stagnant lid period align",
"understanding of planetary bodies like"
],
"correct_choice": 4,
"position": [
4199,
4311
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TAA",
"level": "L2-Relation",
"id": "Q47I8AdRgzc_1",
"video_path": "Q47I8AdRgzc.mp4",
"subtitle_path": "Q47I8AdRgzc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 293.23,
"view_count": 1889
},
{
"video_id": "tPQf5sB03ZE",
"question": "In a background dominated by shades of purple, there's a vast galaxy dotted with countless stars. On the right side of the screen, there's a glowing golden object. After the phrase 'if you enjoyed this video consider' is mentioned, what change takes place?",
"question_wo_referring_query": "What change takes place after that?",
"candidates": [
"It stops moving without any change",
"It drops vertically downward",
"It collides with another star",
"It flies down to the middle of the screen"
],
"correct_choice": 3,
"position": [
4082,
4223
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TAA",
"level": "L2-Relation",
"id": "tPQf5sB03ZE_0",
"video_path": "tPQf5sB03ZE.mp4",
"subtitle_path": "tPQf5sB03ZE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 202.24,
"view_count": 15649
},
{
"video_id": "tPQf5sB03ZE",
"question": "On the left side of a screen displaying a map with different green-colored pixel patterns, there is a blue section. After mentioning 'there's a city called Zealand in the U.S,' what change occurs on this map?",
"question_wo_referring_query": "What change occurs on this map?",
"candidates": [
"The red-marked area on the map is enlarged",
"The map rotates upwards to reveal a city",
"The map rotates to the left to reveal a city",
"The blue section on the map is enlarged"
],
"correct_choice": 0,
"position": [
2754,
2928
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TAA",
"level": "L2-Relation",
"id": "tPQf5sB03ZE_1",
"video_path": "tPQf5sB03ZE.mp4",
"subtitle_path": "tPQf5sB03ZE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 202.24,
"view_count": 15649
},
{
"video_id": "PzUxuZ-KGsU",
"question": "On a red wooden table, there is a round pancake laid horizontally, and the surface of the pancake is covered with an even spread of sauce. On the screen, a hand is seen sprinkling some garnish onto the sauce. Which of the following ingredients appear?",
"question_wo_referring_query": "Which of the following ingredients appear?",
"candidates": [
"Cooked white meat mince",
"Green scallions",
"White diced onions",
"Yellow minced ginger sauce"
],
"correct_choice": 1,
"position": [
2695
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2O",
"level": "L1-Perception",
"id": "PzUxuZ-KGsU_0",
"video_path": "PzUxuZ-KGsU.mp4",
"subtitle_path": "PzUxuZ-KGsU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 245.47,
"view_count": 1910
},
{
"video_id": "c6fuIEzOZ2E",
"question": "In a small room, a man and a woman are sitting on a white sofa. The man is wearing a dark gray jacket, and the woman is wearing a black dress. When the phrase 'people more a lot of things maybe how we' is mentioned, which objects are not present in the scene?",
"question_wo_referring_query": "Which objects are not present in the scene?",
"candidates": [
"A painting with a black frame",
"A red poster",
"A black microphone with a white pattern",
"A beige pillow with blue design"
],
"correct_choice": 1,
"position": [
2790
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2O",
"level": "L1-Perception",
"id": "c6fuIEzOZ2E_0",
"video_path": "c6fuIEzOZ2E.mp4",
"subtitle_path": "c6fuIEzOZ2E_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 464.93,
"view_count": 61884
},
{
"video_id": "c6fuIEzOZ2E",
"question": "In a waiting room, a man with black hair wearing a green down jacket is sitting. Next to him is a man wearing a blue jacket. When the phrase 'but maybe build on the mindset of the' is mentioned, which object does not appear on the screen?",
"question_wo_referring_query": "Which object does not appear on the screen?",
"candidates": [
"black television",
"black wristwatch",
"black-framed glasses",
"black knitted hat"
],
"correct_choice": 1,
"position": [
2662
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2O",
"level": "L1-Perception",
"id": "c6fuIEzOZ2E_1",
"video_path": "c6fuIEzOZ2E.mp4",
"subtitle_path": "c6fuIEzOZ2E_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 464.93,
"view_count": 61884
},
{
"video_id": "0mEN5Jf2hU0",
"question": "The long-haired woman wearing a blue and white floral pattern long-sleeve shirt is speaking in the center of the screen. After the subtitle 'For question two, label each molecule as chiral or achiral' appears, what does she do with her hands?",
"question_wo_referring_query": "What does she do with her hands?",
"candidates": [
"Extends three fingers",
"Extends five fingers",
"Extends two fingers",
"Extends four fingers"
],
"correct_choice": 2,
"position": [
1885,
2006
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T3E",
"level": "L2-Relation",
"id": "0mEN5Jf2hU0_0",
"video_path": "0mEN5Jf2hU0.mp4",
"subtitle_path": "0mEN5Jf2hU0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 219.51,
"view_count": 97270
},
{
"video_id": "0mEN5Jf2hU0",
"question": "On the left side of the screen, there is an image with several pieces of paper, while on the right side, there is a hand with blue nail polish holding a piece of paper with text on it. After the subtitle says 'triple bond will never be a chiral center so this one is not a chiral,' what happened?",
"question_wo_referring_query": "What happened?",
"candidates": [
"Took away the picture on the left",
"No change",
"Added a piece of paper",
"The right-hand paper disappeared"
],
"correct_choice": 3,
"position": [
1108,
1244
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T3E",
"level": "L2-Relation",
"id": "0mEN5Jf2hU0_1",
"video_path": "0mEN5Jf2hU0.mp4",
"subtitle_path": "0mEN5Jf2hU0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 219.51,
"view_count": 97270
},
{
"video_id": "ZoUsR8t8IxE",
"question": "In front of a yellow patterned blue text on a white wallpaper, a man with gray hair standing in white clothes mentioned 'President Danielle OTA it's obvious'. After that, who appeared?",
"question_wo_referring_query": "Who appeared?",
"candidates": [
"A woman with long black hair wearing a blue coat appeared.",
"A man with brown hair wearing a light blue coat appeared.",
"A woman with long black hair wearing a purple coat appeared.",
"A woman with brown hair wearing a light blue coat appeared."
],
"correct_choice": 3,
"position": [
3438,
4768
],
"topic_category": "NP-News-Programs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "ZoUsR8t8IxE_0",
"video_path": "ZoUsR8t8IxE.mp4",
"subtitle_path": "ZoUsR8t8IxE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 394.52,
"view_count": 40305
},
{
"video_id": "22iOyzE8Ec0",
"question": "In front of a purple background, there is a man with long black hair standing. He has his hands clenched into fists in front of him. In the top left corner of the screen, there are bold white and blue captions. What did he do after this?",
"question_wo_referring_query": "What did he do after this?",
"candidates": [
"He styled his hair facing the mirror",
"He made a victory gesture",
"He touched his beard",
"He closed his eyes and clasped his hands in front of him"
],
"correct_choice": 3,
"position": [
1579,
4496
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E3E",
"level": "L2-Relation",
"id": "22iOyzE8Ec0_0",
"video_path": "22iOyzE8Ec0.mp4",
"subtitle_path": "22iOyzE8Ec0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 231.86,
"view_count": 163649
},
{
"video_id": "22iOyzE8Ec0",
"question": "In front of a purple background, there is a man with long black hair standing. He is clasping his hands together in front of him. In the upper right corner of the screen, there are bold subtitles in white and blue. What action does he take afterwards?",
"question_wo_referring_query": "What action does he take afterwards?",
"candidates": [
"He tilts his head and supports his cheek with one hand.",
"He squats down to tie his shoelaces.",
"He bends down to pick something up.",
"He spreads his hands open, palms facing outward."
],
"correct_choice": 3,
"position": [
1661,
4355
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E3E",
"level": "L2-Relation",
"id": "22iOyzE8Ec0_1",
"video_path": "22iOyzE8Ec0.mp4",
"subtitle_path": "22iOyzE8Ec0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 231.86,
"view_count": 163649
},
{
"video_id": "JwoBdRC2fzE",
"question": "There are two statues in the background, and a group of people raising their fists watching two shirtless men wearing white shorts. What are these two shirtless men in white shorts doing?",
"question_wo_referring_query": "What are these two shirtless men in white shorts doing?",
"candidates": [
"Running race",
"Long jump competition",
"Swimming competition",
"Wrestling match"
],
"correct_choice": 3,
"position": [
2546
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2E",
"level": "L1-Perception",
"id": "JwoBdRC2fzE_0",
"video_path": "JwoBdRC2fzE.mp4",
"subtitle_path": "JwoBdRC2fzE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 504.67,
"view_count": 587243
},
{
"video_id": "JwoBdRC2fzE",
"question": "On the golden ground with black and gray caves and scattered wood, what did the man, who appeared for the first time when a shirtless man with a cloth strip tied around his waist and a beard appeared, do?",
"question_wo_referring_query": "What action did the man perform?",
"candidates": [
"lift a horse with both hands",
"lift a tree trunk with both hands",
"clench fists with both hands",
"lift a bull with both hands"
],
"correct_choice": 1,
"position": [
7962
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2E",
"level": "L1-Perception",
"id": "JwoBdRC2fzE_1",
"video_path": "JwoBdRC2fzE.mp4",
"subtitle_path": "JwoBdRC2fzE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 504.67,
"view_count": 587243
},
{
"video_id": "qYnloYaeQA8",
"question": "In front of a gray wall and a yellowish-brown door, there is a transparent glass table with dark brown edges. A man in a dark blue short-sleeved shirt is standing beside the table. What kind of hat is this man wearing?",
"question_wo_referring_query": "What kind of hat is this man wearing?",
"candidates": [
"A blue beret",
"A white ceremonial hat adorned with red flowers",
"An off-white baseball cap",
"A black baseball cap with a white design in the middle and on the upper side of the ear flaps"
],
"correct_choice": 3,
"position": [
2862
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2A",
"level": "L1-Perception",
"id": "qYnloYaeQA8_0",
"video_path": "qYnloYaeQA8.mp4",
"subtitle_path": "qYnloYaeQA8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 408.03,
"view_count": 1230478
},
{
"video_id": "qYnloYaeQA8",
"question": "Under the glaring sunlight, three men are running on the golden sand, with many green trees and people having fun in the background. What style of top is the man in the middle wearing?",
"question_wo_referring_query": "What style of top is the man in the middle wearing?",
"candidates": [
"Gray sweater",
"Olive suit",
"Blue sleeveless top with red and blue patterned front",
"White and black striped long sleeve"
],
"correct_choice": 2,
"position": [
7089
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2A",
"level": "L1-Perception",
"id": "qYnloYaeQA8_1",
"video_path": "qYnloYaeQA8.mp4",
"subtitle_path": "qYnloYaeQA8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 408.03,
"view_count": 1230478
},
{
"video_id": "Oht0i1DACcA",
"question": "What changes occur to the clothing of the woman, who appears at the beginning of the video with short hair, wearing a blue jacket, carrying shoes in her hand, and walking barefoot on a path between fields, when she walks down a path lined with brightly decorated shops?",
"question_wo_referring_query": "What changes occur to this woman's clothing?",
"candidates": [
"No changes",
"She changed to a red jacket",
"She took off the blue jacket and tied it around her waist, wearing a black short-sleeved top",
"She took off the blue jacket and tied it around her waist, wearing a white short-sleeved top"
],
"correct_choice": 2,
"position": [
377,
547
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "Oht0i1DACcA_0",
"video_path": "Oht0i1DACcA.mp4",
"subtitle_path": "Oht0i1DACcA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 238.91,
"view_count": 4683
},
{
"video_id": "zda-T6wrEhs",
"question": "What is the woman, who is wearing half-rimmed glasses, a white coat, and light blue jeans, doing in the vegetable-filled garden in the video?",
"question_wo_referring_query": "What is the woman doing?",
"candidates": [
"Crossing both hands",
"Using one hand to stroke her hair",
"Holding her hair with both hands",
"Placing one hand on her knee"
],
"correct_choice": 1,
"position": [
5901
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2E",
"level": "L1-Perception",
"id": "zda-T6wrEhs_0",
"video_path": "zda-T6wrEhs.mp4",
"subtitle_path": "zda-T6wrEhs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 507.47,
"view_count": 289752
},
{
"video_id": "zda-T6wrEhs",
"question": "When the video shows a room with black and orange colors, there is a bald man wearing a floral shirt sitting on a sofa. What is this man doing?",
"question_wo_referring_query": "What action is this man performing?",
"candidates": [
"Both hands are raised",
"One hand is holding a pen and the other hand is holding a book",
"One hand is placed on his knee",
"Both hands are placed on his knees and he is clenching his fists"
],
"correct_choice": 3,
"position": [
10069
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2E",
"level": "L1-Perception",
"id": "zda-T6wrEhs_1",
"video_path": "zda-T6wrEhs.mp4",
"subtitle_path": "zda-T6wrEhs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 507.47,
"view_count": 289752
},
{
"video_id": "-kaF6SnSEo8",
"question": "When a man wearing a red shirt with white stripes and a man wearing a white short-sleeve shirt with red and blue patterns appear in the video, which item are they both wearing?",
"question_wo_referring_query": "Which item are they both wearing?",
"candidates": [
"The man in the white shirt is wearing black-framed glasses, while the man in the red shirt is not",
"A blue bracelet",
"The man in the red shirt is wearing black-framed glasses, while the man in the white shirt is not",
"The man in the red shirt is wearing an orange scarf, while the man in the white shirt is not"
],
"correct_choice": 2,
"position": [
5210
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2O",
"level": "L1-Perception",
"id": "-kaF6SnSEo8_0",
"video_path": "-kaF6SnSEo8.mp4",
"subtitle_path": "-kaF6SnSEo8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 486.45,
"view_count": 3003793
},
{
"video_id": "-kaF6SnSEo8",
"question": "When a pie chart representing the Czech Ethnicity appears in the video, with blue occupying the largest portion, red being the second, and light green the least, which of the following sentences is displayed on the screen?",
"question_wo_referring_query": "Which of the following sentences is displayed on the screen?",
"candidates": [
"25% \"Unspecified\"",
"60% \u201cdeclared\u201d Czech",
"95% Czech",
"38% Czech"
],
"correct_choice": 0,
"position": [
7136
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2O",
"level": "L1-Perception",
"id": "-kaF6SnSEo8_1",
"video_path": "-kaF6SnSEo8.mp4",
"subtitle_path": "-kaF6SnSEo8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 486.45,
"view_count": 3003793
},
{
"video_id": "P0BSTjziVys",
"question": "In the opening of the video, Melissa Maribel with dark brown hair and wearing a white V-neck top appears. In which of the following scenes does Melissa Maribel appear?",
"question_wo_referring_query": "In which of the following scenes does Melissa Maribel appear?",
"candidates": [
"In an orange background with a green paper showing a hint in the middle.",
"On the left side with an orange background containing white text 'SUBSCRIBE', and on the right side, a green horizontal line, a red vertical line, and some black text.",
"In an orange background with a green plant pot in the upper right corner and some curved needles on the left.",
"On the right side with an orange background containing white text 'SUBSCRIBE', and on the left, a green horizontal line, a red vertical line, and some black text."
],
"correct_choice": 1,
"position": [
166,
5402
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SOS",
"level": "L2-Relation",
"id": "P0BSTjziVys_0",
"video_path": "P0BSTjziVys.mp4",
"subtitle_path": "P0BSTjziVys_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 239.62,
"view_count": 149376
},
{
"video_id": "P0BSTjziVys",
"question": "In the video, there is a green planter in the upper right corner and a curved needle against an orange background on the left. There is a green piece of paper labeled Tip1 on the screen. In which of the following scenes does the green piece of paper also appear?",
"question_wo_referring_query": "In which of the following scenes does the green piece of paper also appear?",
"candidates": [
"It appears simultaneously with the last SUBSCRIBE in the video.",
"In the video, there is a green planter in the upper left corner and a curved needle against an orange background on the right. There is a piece of paper labeled Tip2 on the screen.",
"In the video, there is a green planter in the upper right corner and a curved needle against an orange background on the left. There is a piece of paper labeled TIP2 on the screen.",
"In the video, there is a green planter in the upper right corner and a curved needle against an orange background on the left. There is a piece of paper labeled NOTE on the screen."
],
"correct_choice": 2,
"position": [
352,
3052
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SOS",
"level": "L2-Relation",
"id": "P0BSTjziVys_1",
"video_path": "P0BSTjziVys.mp4",
"subtitle_path": "P0BSTjziVys_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 239.62,
"view_count": 149376
},
{
"video_id": "JASFwBtUK40",
"question": "In a laboratory shown in the video, there are many experimental instruments and devices on an olive-colored desk. The window behind is white. When the subtitle mentions 'At this moment that we are defining and redefining', which person appears on the screen at this moment?",
"question_wo_referring_query": "Which person appears on the screen at this moment?",
"candidates": [
"Jack",
"Patrick Craine",
"John",
"Andr\u00e9s Jacque"
],
"correct_choice": 3,
"position": [
629
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2O",
"level": "L1-Perception",
"id": "JASFwBtUK40_0",
"video_path": "JASFwBtUK40.mp4",
"subtitle_path": "JASFwBtUK40_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 191.57,
"view_count": 18691
},
{
"video_id": "JASFwBtUK40",
"question": "In a room with white walls in the video, there are several brown experiment tables. A man wearing a blue and white shirt and black glasses is explaining. When the subtitle mentions 'I can. The higher I lift it, the faster the explanation goes,' which item is not present in the room at this time?",
"question_wo_referring_query": "Which item is not present in the room at this time?",
"candidates": [
"Some transparent tubes",
"Yellow pipes",
"Some white pipes",
"Black glasses"
],
"correct_choice": 1,
"position": [
2198
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2O",
"level": "L1-Perception",
"id": "JASFwBtUK40_1",
"video_path": "JASFwBtUK40.mp4",
"subtitle_path": "JASFwBtUK40_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 191.57,
"view_count": 18691
},
{
"video_id": "athabNMGceo",
"question": "Against a blue background, a man wearing a pair of black-rimmed glasses and a white short-sleeved shirt with a small bird pattern is explaining something. Which of the following animals can spray feces up to 40 cm?",
"question_wo_referring_query": "Which of the following animals can spray feces up to 40 cm?",
"candidates": [
"Hamster",
"Whale",
"Ostrich",
"Rabbit"
],
"correct_choice": 2,
"position": [
3856
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E2O",
"level": "L1-Perception",
"id": "athabNMGceo_0",
"video_path": "athabNMGceo.mp4",
"subtitle_path": "athabNMGceo_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 477.52,
"view_count": 218301
},
{
"video_id": "athabNMGceo",
"question": "Against a blue background, a man wearing black-framed glasses and a white short-sleeve shirt with a small bird pattern is explaining. Which of the following animals evolved hindgut fermentation?",
"question_wo_referring_query": "Which of the following animals evolved hindgut fermentation?",
"candidates": [
"Ostrich",
"Rabbit",
"Kangaroo",
"Whale"
],
"correct_choice": 1,
"position": [
5485
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E2O",
"level": "L1-Perception",
"id": "athabNMGceo_1",
"video_path": "athabNMGceo.mp4",
"subtitle_path": "athabNMGceo_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 477.52,
"view_count": 218301
},
{
"video_id": "SCZ_Z4NnikA",
"question": "On a seaside hill covered with green grass, there is a shining sea in the distance. A white building is constructed on the hill, with a red building at the bottom right corner and a white object as well. What shape is the white building?",
"question_wo_referring_query": "What shape is the white building?",
"candidates": [
"cylindrical",
"rectangular",
"square",
"irregular",
"stepped"
],
"correct_choice": 0,
"position": [
11247
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2A",
"level": "L1-Perception",
"id": "SCZ_Z4NnikA_0",
"video_path": "SCZ_Z4NnikA.mp4",
"subtitle_path": "SCZ_Z4NnikA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 589.36,
"view_count": 12509
},
{
"video_id": "SCZ_Z4NnikA",
"question": "There is a river in a grove, with green grass on the hillside next to it. On the left side of the river are green plants, and on the right side are white rocks. Some water is flowing in the river, and there is a bridge above it. What color are the railings of the bridge?",
"question_wo_referring_query": "What color are the railings of the bridge?",
"candidates": [
"Yellow",
"White",
"Brown",
"Green",
"Black"
],
"correct_choice": 1,
"position": [
13323
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2A",
"level": "L1-Perception",
"id": "SCZ_Z4NnikA_1",
"video_path": "SCZ_Z4NnikA.mp4",
"subtitle_path": "SCZ_Z4NnikA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 589.36,
"view_count": 12509
},
{
"video_id": "pFtKaT3GF9I",
"question": "Which of the following sequence of scenes is correct?",
"question_wo_referring_query": "Which of the following sequence of scenes is correct?",
"candidates": [
"First, a picture with the flag of Iraq is shown, followed by a picture of a man in green clothes and a man in white clothes with a black background in the upper right corner, and lastly a man discussing a picture with a sheep and a desert is shown.",
"First, a picture of a man in green clothes and a man in white clothes with a black background in the upper right corner is shown, followed by a picture with the flag of Iraq, and lastly a man discussing a picture with a sheep and a desert is shown.",
"First, a man discussing a picture with a sheep and a desert is shown, followed by a picture with the flag of Iraq, and lastly a picture of a man in green clothes and a man in white clothes with a black background in the upper right corner is shown.",
"First, a man discussing a picture with a sheep and a desert is shown, followed by a picture of a man in green clothes and a man in white clothes with a black background in the upper right corner, and lastly a picture with the flag of Iraq is shown.",
"First, a picture of a man in green clothes and a man in white clothes with a black background in the upper right corner is shown, followed by a man discussing a picture with a sheep and a desert, and lastly a picture with the flag of Iraq is shown."
],
"correct_choice": 2,
"position": [
483,
1188,
1510
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "SSS",
"level": "L2-Relation",
"id": "pFtKaT3GF9I_0",
"video_path": "pFtKaT3GF9I.mp4",
"subtitle_path": "pFtKaT3GF9I_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 524.83,
"view_count": 233479
},
{
"video_id": "PQRyGacBRA4",
"question": "In the video, a man wearing black clothes with curly hair is facing the camera. Behind him is a complex building with a clock tower. There are also some withered branches on the left side. In which of the following scenes does this man appear?",
"question_wo_referring_query": "In which of the following scenes does this man appear?",
"candidates": [
"On top of a building with a lot of reporters, and the roof is red.",
"On a viewing platform without the Union Jack flag.",
"In front of a building with black rectangular tiles, a white carved door frame, black double doors, and surrounded by windows.",
"On a road surrounded by trees with cars around.",
"On a circular stage with many people below."
],
"correct_choice": 3,
"position": [
107,
176,
4060,
3391,
3501,
2803
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "SOS",
"level": "L2-Relation",
"id": "PQRyGacBRA4_0",
"video_path": "PQRyGacBRA4.mp4",
"subtitle_path": "PQRyGacBRA4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 212.68,
"view_count": 1456490
},
{
"video_id": "PQRyGacBRA4",
"question": "A man wearing a suit and a purple tie is walking on a concrete road surrounded by trees and parked cars. He has short hair and is holding a piece of white paper in his hand. What other scenes did this man appear in?",
"question_wo_referring_query": "In which other scenes did he appear?",
"candidates": [
"On a stage with British flags displayed",
"In a library with a brown bookshelf full of books, in front of a green desk",
"In front of a blue backdrop with the British flag",
"In front of a complex building with a bell tower and some withered branches on the left",
"In front of a green backdrop painted with various colorful buildings"
],
"correct_choice": 2,
"position": [
3391,
4046,
4271,
4748,
687,
2224
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "SOS",
"level": "L2-Relation",
"id": "PQRyGacBRA4_1",
"video_path": "PQRyGacBRA4.mp4",
"subtitle_path": "PQRyGacBRA4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 212.68,
"view_count": 1456490
},
{
"video_id": "Efuyl2Anehg",
"question": "There are two gentlemen standing against a worn-down wall on the screen. One is wearing a white T-shirt and the other is wearing a purple T-shirt, with sunglasses and two strings hanging from his chest. This gentleman simultaneously appears with what kind of subtitles?",
"question_wo_referring_query": ", this gentleman simultaneously appears with what kind of subtitles?",
"candidates": [
"How hot does it typically get in Bahrain in the summer?",
"They play bagpipes in the Arab world?",
"Abdullah: So she would go like",
"in Bahrain, like 200 years ago",
"Abdullah: Probably like this."
],
"correct_choice": 0,
"position": [
7404,
9258,
10152,
10315,
7974
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TOS",
"level": "L2-Relation",
"id": "Efuyl2Anehg_0",
"video_path": "Efuyl2Anehg.mp4",
"subtitle_path": "Efuyl2Anehg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 511.64,
"view_count": 257904
},
{
"video_id": "HAED3riiZkw",
"question": "A bald man is sitting in a courtyard surrounded by bamboo walls. Behind him is a rectangular object. He is wearing a gray T-shirt with a white butterfly printed on it. When the man reappears among the trees, holding a board covered with a beehive, how does he change?",
"question_wo_referring_query": "How does he change?",
"candidates": [
"He puts on a white brimmed hat and a gray face mask, and wears a yellow beekeeping suit.",
"He puts on a black brimmed hat and a gray face mask, and wears a white beekeeping suit.",
"He puts on a yellow brimmed hat and a gray face mask, and wears a white beekeeping suit.",
"He puts on a white brimmed hat and a gray face mask, and wears a white beekeeping suit.",
"He puts on a white brimmed hat and a gray face mask, and wears a black beekeeping suit."
],
"correct_choice": 3,
"position": [
438,
546
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SAA",
"level": "L2-Relation",
"id": "HAED3riiZkw_0",
"video_path": "HAED3riiZkw.mp4",
"subtitle_path": "HAED3riiZkw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 300.97,
"view_count": 424853
},
{
"video_id": "HAED3riiZkw",
"question": "This is a piece of honeycomb taken from a box, placed in an iron tray. Below the iron tray is a table with a green tablecloth. There is also a pair of hands holding chopsticks, and next to it, there is a board of honeycomb. When this piece of honeycomb is placed in a large round stainless steel bowl and stirred with chopsticks, what changes occur?",
"question_wo_referring_query": ", what changes occur?",
"candidates": [
"Noodles were placed on the honeycomb",
"The honeycomb was poured out of the large bowl",
"Water was added to the honeycomb",
"The honeycomb was cut into pieces",
"The honeycomb was broken into pieces"
],
"correct_choice": 3,
"position": [
5098,
5601
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SAA",
"level": "L2-Relation",
"id": "HAED3riiZkw_1",
"video_path": "HAED3riiZkw.mp4",
"subtitle_path": "HAED3riiZkw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 300.97,
"view_count": 424853
},
{
"video_id": "0_YDrJoUe8s",
"question": "A woman is sitting inside a gallery. She is wearing a red coat and black clothes. Her hair is blonde and she is wearing a watch on her wrist. There are two paintings hanging on the wall behind her. When the phrase 'want to experience the painting as the' is mentioned, what change occurs to this woman?",
"question_wo_referring_query": "What change occurs to this woman?",
"candidates": [
"The woman shakes hands in front of the painting",
"The woman takes a picture in front of the painting",
"The woman dances in front of the painting",
"The woman flips her hair in front of the painting",
"The woman stands in front of the painting admiring it"
],
"correct_choice": 4,
"position": [
634,
867
],
"topic_category": "KA-Knowledge-Art",
"question_category": "TAA",
"level": "L2-Relation",
"id": "0_YDrJoUe8s_0",
"video_path": "0_YDrJoUe8s.mp4",
"subtitle_path": "0_YDrJoUe8s_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 268.6,
"view_count": 314463
},
{
"video_id": "0_YDrJoUe8s",
"question": "A woman is sitting inside a gallery. She is wearing a red coat and black clothes. Her hair is blond, and she has a watch on her wrist. There are two paintings hanging on the wall behind her. When she mentions 'his work is incredibly worked out and um,' what change occurs to the woman onscreen?",
"question_wo_referring_query": "What change occurs to the woman onscreen?",
"candidates": [
"The woman walks to admire four artworks.",
"The woman starts introducing the artworks.",
"The woman changed her clothes.",
"The woman sits down to admire four artworks.",
"The woman's hair was tied up."
],
"correct_choice": 0,
"position": [
634,
3716
],
"topic_category": "KA-Knowledge-Art",
"question_category": "TAA",
"level": "L2-Relation",
"id": "0_YDrJoUe8s_1",
"video_path": "0_YDrJoUe8s.mp4",
"subtitle_path": "0_YDrJoUe8s_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 268.6,
"view_count": 314463
},
{
"video_id": "TFbGLEZ4qt0",
"question": "A woman with long black straight hair is in a white room. She is wearing a white jacket and a pink shirt. On the left side is the room's door, and on the right side is a white display shelf with a desk lamp, a vase, and some pictures. She is sitting in front of a desk talking. On the right side of the desk, there is also a bucket filled with many colored pencils and a bouquet of flowers. What action did this woman take?",
"question_wo_referring_query": "What action did this woman take?",
"candidates": [
"She adjusted her hair a bit",
"She picked up the colored pencils",
"She turned around and looked at the shelf",
"She stood up and walked out of the room",
"She is gesturing with her hands"
],
"correct_choice": 4,
"position": [
626
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2E",
"level": "L1-Perception",
"id": "TFbGLEZ4qt0_0",
"video_path": "TFbGLEZ4qt0.mp4",
"subtitle_path": "TFbGLEZ4qt0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 295.25,
"view_count": 511914
},
{
"video_id": "TFbGLEZ4qt0",
"question": "A lady with long black straight hair is in a white room. She is wearing a white coat and a pink top. To her right is the door of the room, and to her left is a white display shelf with a table lamp, vase, and some pictures on it. She is sitting in front of a table, talking. There is also a bucket with many colored pencils and a bunch of flowers to the left of the table. What action did this lady do?",
"question_wo_referring_query": "What action did this lady do?",
"candidates": [
"She used a purple colored pencil",
"She moved her hand to the right",
"She got up and opened the room door",
"She picked up the bunch of flowers",
"She got up and walked out of the room"
],
"correct_choice": 1,
"position": [
4812
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2E",
"level": "L1-Perception",
"id": "TFbGLEZ4qt0_1",
"video_path": "TFbGLEZ4qt0.mp4",
"subtitle_path": "TFbGLEZ4qt0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 295.25,
"view_count": 511914
},
{
"video_id": "VwZeSoYugZk",
"question": "On a blue-green sea, there is a boat floating. There are yellow letters in the scene, and some clouds in the distant sky. The green sea is shimmering. When it mentions 'The crater is roughly 30 kilometers in diameter, and would have created a megatsunami,' what color is the roof of the boat in the scene?",
"question_wo_referring_query": "What color is the roof of the boat in the scene?",
"candidates": [
"purple",
"blue",
"orange",
"white",
"yellow"
],
"correct_choice": 1,
"position": [
7635
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "VwZeSoYugZk_0",
"video_path": "VwZeSoYugZk.mp4",
"subtitle_path": "VwZeSoYugZk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 553.52,
"view_count": 829765
},
{
"video_id": "VwZeSoYugZk",
"question": "Above a green lake, there is a small building by the shore. In the distance, there are some mountain peaks and a deep blue sky with clouds floating in it. The small building by the lake is light yellow with two windows. When it is mentioned that 'and only allowing a small group of people and animals, to survive and repopulate,' what is the shape of the roof of the small building?",
"question_wo_referring_query": "What is the shape of the roof of the small building?",
"candidates": [
"Staircase",
"Square",
"Triangle",
"Rectangle",
"Semi-circle"
],
"correct_choice": 2,
"position": [
3374
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "VwZeSoYugZk_1",
"video_path": "VwZeSoYugZk.mp4",
"subtitle_path": "VwZeSoYugZk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 553.52,
"view_count": 829765
},
{
"video_id": "ip8khYCMb8Y",
"question": "When the man wearing a yellow short-sleeve shirt and black shorts first appears on the surfboard in the desert, what is he doing?",
"question_wo_referring_query": ", what is he doing?",
"candidates": [
"Resting on the beach",
"Waving at the camera",
"Playing on the surfboard in the desert",
"Talking with friends",
"Making a phone call"
],
"correct_choice": 2,
"position": [
1856
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O2E",
"level": "L1-Perception",
"id": "ip8khYCMb8Y_0",
"video_path": "ip8khYCMb8Y.mp4",
"subtitle_path": "ip8khYCMb8Y_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 314.82,
"view_count": 121531
},
{
"video_id": "ip8khYCMb8Y",
"question": "A group of young people are riding bicycles and skateboards on a bridge with red railings, surrounded by a wire mesh fence. When a boy wearing a green short-sleeve shirt and khaki pants first appears in the middle of the bridge, what is he doing?",
"question_wo_referring_query": ", what is he doing?",
"candidates": [
"Walking",
"Talking with friends",
"Riding a bicycle",
"Answering a phone call",
"Skateboarding"
],
"correct_choice": 4,
"position": [
293
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O2E",
"level": "L1-Perception",
"id": "ip8khYCMb8Y_1",
"video_path": "ip8khYCMb8Y.mp4",
"subtitle_path": "ip8khYCMb8Y_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 314.82,
"view_count": 121531
},
{
"video_id": "8_MG-E8QlBM",
"question": "Some women wearing headscarves are standing inside the hut, while the sunlight outside is dazzling. A man in a black and white striped short sleeve is holding a mobile phone. When the subtitle 'time news arrived of another body' appears, what is the woman in the middle with her hands covering her face and carrying a child on her back doing?",
"question_wo_referring_query": "What is the woman in the middle with her hands covering her face and carrying a child on her back doing?",
"candidates": [
"Sweeping the floor",
"Chatting with others",
"Waving to the camera",
"Taking care of the child",
"Covering her face with both hands, looking distressed"
],
"correct_choice": 4,
"position": [
2232
],
"topic_category": "NP-News-Programs",
"question_category": "T2E",
"level": "L1-Perception",
"id": "8_MG-E8QlBM_0",
"video_path": "8_MG-E8QlBM.mp4",
"subtitle_path": "8_MG-E8QlBM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 268.08,
"view_count": 398377
},
{
"video_id": "8_MG-E8QlBM",
"question": "Under the scorching sun, a few people are standing on the yellow earth outside the home. On the right side of the screen, there is a parked yellow cart. In front of the camera, there are two men. The man on the left is wearing a black top with a distressed expression, while the man on the right is wearing a black and white striped top with rolled-up sleeves and has both hands placed on the man on the left. When the subtitle 'was being comforted by everyone who saw' appears, what is the man on the right doing?",
"question_wo_referring_query": "What is the man on the right doing?",
"candidates": [
"Chatting with the man",
"Waving to the camera",
"Comforting the distressed man",
"Distributing food to the needy",
"Wiping the man's tears"
],
"correct_choice": 2,
"position": [
2639
],
"topic_category": "NP-News-Programs",
"question_category": "T2E",
"level": "L1-Perception",
"id": "8_MG-E8QlBM_1",
"video_path": "8_MG-E8QlBM.mp4",
"subtitle_path": "8_MG-E8QlBM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 268.08,
"view_count": 398377
},
{
"video_id": "Lc7RikDaa30",
"question": "Green plants and decorative shelves are respectively on the left and right sides of the screen. A man with glasses, wearing a yellow suit and having a middle part hairstyle, is speaking into the microphone. Which of the following concepts is mentioned first?",
"question_wo_referring_query": "Which of the following concepts is mentioned first?",
"candidates": [
"The COE prices in Singapore have always been high",
"COE prices are determined by supply and demand",
"Using cars is the real issue",
"We need a rational and accurate road pricing system",
"More Singaporeans can own cars"
],
"correct_choice": 0,
"position": [
480,
525,
1176,
2126,
3043,
4138
],
"topic_category": "NP-News-Programs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "Lc7RikDaa30_0",
"video_path": "Lc7RikDaa30.mp4",
"subtitle_path": "Lc7RikDaa30_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 191.88,
"view_count": 24390
},
{
"video_id": "Lc7RikDaa30",
"question": "With green plants and decorative shelves on the left and right sides of the screen, a man wearing glasses and a yellow suit with a middle part hairstyle is introducing himself with a mirror. Which of the following scenes appears first?",
"question_wo_referring_query": "Which of the following scenes appears first?",
"candidates": [
"A woman sitting in a car holding the steering wheel",
"A screenshot of news about COE",
"A display screen showing toll information on a road in Singapore",
"People walking and cars driving on a road",
"Tightly packed and neatly arranged cars in a parking lot"
],
"correct_choice": 1,
"position": [
418,
516,
1007,
1433,
2628,
2954
],
"topic_category": "NP-News-Programs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "Lc7RikDaa30_1",
"video_path": "Lc7RikDaa30.mp4",
"subtitle_path": "Lc7RikDaa30_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 191.88,
"view_count": 24390
},
{
"video_id": "lCvQtGVhUrc",
"question": "The background shows a wall covered with numerous postcards and posters, with a bookshelf full of books on the right side. A bespectacled woman named Qiliu Hai is holding a book titled 'Beautiful Boy'. The cover of the book features a black-and-white photo of two men. When the camera zooms in on the woman who is hiding half of her face with the book, what changes occur to the book?",
"question_wo_referring_query": "What changes occur to the book?",
"candidates": [
"The book changes from being in the woman's hand to being on the table",
"The book changes from a black-and-white cover to a colored cover",
"The book changes from a full shot to a close-up",
"The book changes from brand new to damaged",
"The book changes from closed to open"
],
"correct_choice": 2,
"position": [
5386,
5506
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "lCvQtGVhUrc_0",
"video_path": "lCvQtGVhUrc.mp4",
"subtitle_path": "lCvQtGVhUrc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 513.96,
"view_count": 160689
},
{
"video_id": "lCvQtGVhUrc",
"question": "The background shows a wall covered with many postcards and posters, a bookshelf filled with books on the right side, and a woman with bangs wearing glasses and a yellow-green wool coat sitting in front of a mirror talking. At the end of the video, in the yellow-orange background, there are some pictures with white English words on the left side of the round photo frame. What changes can be observed about her?",
"question_wo_referring_query": "The background shows a wall covered with many postcards and posters, a bookshelf filled with books on the right side, and a woman with bangs wearing glasses and a yellow-green wool coat sitting in front of a mirror talking. At the end of the video, in the yellow-orange background, there are some pictures with white English words on the left side of the round photo frame. What changes can be observed about her?",
"candidates": [
"The yellow-green coat changes into a black top",
"She is no longer wearing glasses",
"Her bangs are no longer visible",
"The yellow-green wool coat changes into a white short-sleeve shirt",
"The wool coat changes into a denim jacket"
],
"correct_choice": 0,
"position": [
728,
11937
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "lCvQtGVhUrc_1",
"video_path": "lCvQtGVhUrc.mp4",
"subtitle_path": "lCvQtGVhUrc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 513.96,
"view_count": 160689
},
{
"video_id": "aoxy2e7j9Bc",
"question": "Under the blue sky, there are many lush green trees, golden shining stone walls under the sunlight, and below is a desolate road. What objects are present in the scene at this time?",
"question_wo_referring_query": "What objects are present in the scene at this time?",
"candidates": [
"Utility Pole",
"Traffic Light",
"A Group of People",
"White Cloud",
"Telegraph Line"
],
"correct_choice": 0,
"position": [
2860
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2O",
"level": "L1-Perception",
"id": "aoxy2e7j9Bc_0",
"video_path": "aoxy2e7j9Bc.mp4",
"subtitle_path": "aoxy2e7j9Bc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 307.44,
"view_count": 14704
},
{
"video_id": "aoxy2e7j9Bc",
"question": "In the forest with green trees and grass on both sides, there is a clear stream in the middle, and above the stream, there is a bridge with a white railing. What object can be seen on the screen at this moment?",
"question_wo_referring_query": "What object can be seen on the screen at this moment?",
"candidates": [
"child",
"puppy",
"bird",
"goldfish",
"rock"
],
"correct_choice": 4,
"position": [
6520
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2O",
"level": "L1-Perception",
"id": "aoxy2e7j9Bc_1",
"video_path": "aoxy2e7j9Bc.mp4",
"subtitle_path": "aoxy2e7j9Bc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 307.44,
"view_count": 14704
},
{
"video_id": "KWv8DJMEHsE",
"question": "Two men wearing straw hats and gray clothes with knee pads are standing in the grass holding knives. There are some grass huts behind them. What kind of knives are they holding?",
"question_wo_referring_query": "What kind of knives are they holding?",
"candidates": [
"Small knife",
"Vegetable knife",
"Military knife",
"Long knife",
"Fruit knife"
],
"correct_choice": 3,
"position": [
1145
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2A",
"level": "L1-Perception",
"id": "KWv8DJMEHsE_0",
"video_path": "KWv8DJMEHsE.mp4",
"subtitle_path": "KWv8DJMEHsE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 237.82,
"view_count": 10234
},
{
"video_id": "KWv8DJMEHsE",
"question": "Two men wearing straw hats and grey clothes stand in a grass field holding long knives. Behind them are a few green trees and a house. What does the house behind them look like?",
"question_wo_referring_query": "What does the house behind them look like?",
"candidates": [
"Wooden house",
"Building",
"Villa",
"Straw hut",
"Earthen house"
],
"correct_choice": 3,
"position": [
3160
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2A",
"level": "L1-Perception",
"id": "KWv8DJMEHsE_1",
"video_path": "KWv8DJMEHsE.mp4",
"subtitle_path": "KWv8DJMEHsE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 237.82,
"view_count": 10234
},
{
"video_id": "0WEnmqVVbHo",
"question": "In the bright sunlight outside, there are lakes and green trees in the distance. A group of young people were having a picnic on the grass, surrounded by others. In the picture, who is the person being hugged by the girl wearing a white short-sleeved shirt and light blue jeans?",
"question_wo_referring_query": "Who is the person being hugged?",
"candidates": [
"The girl wearing a beige long-sleeved outer jacket with a white short-sleeved shirt underneath",
"The man wearing a blue short-sleeved shirt and white shorts",
"The girl wearing a dark green suspender dress with a white short-sleeved shirt underneath",
"The man wearing a light blue long-sleeved outer jacket and beige long pants",
"The man wearing a white short-sleeved shirt"
],
"correct_choice": 2,
"position": [
10211
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E2O",
"level": "L1-Perception",
"id": "0WEnmqVVbHo_0",
"video_path": "0WEnmqVVbHo.mp4",
"subtitle_path": "0WEnmqVVbHo_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 479.48,
"view_count": 343405
},
{
"video_id": "f2-6Mh6lyQQ",
"question": "A woman wearing a dark red floral short-sleeve top, standing with purple gloves between white tables, with a piece of wooden craft on the table beside her. She is holding part of the craft with her right hand. What is this woman doing?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"Repairing the item",
"Creating a craft",
"Dismantling the craft",
"Excavating the item",
"Cleaning the item"
],
"correct_choice": 2,
"position": [
1489
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O2E",
"level": "L1-Perception",
"id": "f2-6Mh6lyQQ_0",
"video_path": "f2-6Mh6lyQQ.mp4",
"subtitle_path": "f2-6Mh6lyQQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 331.87,
"view_count": 58159
},
{
"video_id": "f2-6Mh6lyQQ",
"question": "There are three people on the screen. The man on the left is wearing a khaki suit with a white shirt and a tie. The man in the middle is wearing a white hoodie, purple gloves, and glasses, holding a document in his hand. The woman on the right is wearing a grey long-sleeved jacket with a black and white floral underlay, with her hands in purple gloves resting on the table. What are these three people doing around the document on the table?",
"question_wo_referring_query": "There are three people on the screen. The man on the left is wearing a khaki suit with a white shirt and a tie. The man in the middle is wearing a white hoodie, purple gloves, and glasses, holding a document in his hand. The woman on the right is wearing a grey long-sleeved jacket with a black and white floral underlay, with her hands in purple gloves resting on the table. What are these three people doing around the document on the table?",
"candidates": [
"Organizing the document",
"Making a craft",
"Disassembling the document",
"Assembling the document",
"Exploring the contents of the document"
],
"correct_choice": 4,
"position": [
4467
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O2E",
"level": "L1-Perception",
"id": "f2-6Mh6lyQQ_1",
"video_path": "f2-6Mh6lyQQ.mp4",
"subtitle_path": "f2-6Mh6lyQQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 331.87,
"view_count": 58159
},
{
"video_id": "uWBh0meTg08",
"question": "On a white background, after a hand writes the words 'Neural Architecture Search' in blue ink and then writes 'YOLO-NAS' over it in black ink with a big bracket, what follows next?",
"question_wo_referring_query": "what follows next?",
"candidates": [
"Draws a cartoon figure with raised hands standing next to a board filled with sketches on the left of the English text",
"Draws a timeline",
"Draws a table composed of colored lines",
"Draws a square frame",
"Draws two gears, one blue and one black"
],
"correct_choice": 0,
"position": [
1259,
1355
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "E3E",
"level": "L2-Relation",
"id": "uWBh0meTg08_0",
"video_path": "uWBh0meTg08.mp4",
"subtitle_path": "uWBh0meTg08_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 297.1,
"view_count": 4888
},
{
"video_id": "uWBh0meTg08",
"question": "On a white backdrop, after a hand writes the blue English words 'YOLO-NAS as foundational model,' what do they do next?",
"question_wo_referring_query": "What do they do next?",
"candidates": [
"Draws a rectangle",
"Draws a table composed of colored lines",
"Draws two yellow smiley faces",
"Draws two purple cylindrical shapes on the left side of the whiteboard, and a fish tank and a shoulder on the right side",
"Draws a timeline"
],
"correct_choice": 3,
"position": [
5819,
6727
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "E3E",
"level": "L2-Relation",
"id": "uWBh0meTg08_1",
"video_path": "uWBh0meTg08.mp4",
"subtitle_path": "uWBh0meTg08_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 297.1,
"view_count": 4888
},
{
"video_id": "GawGUhl9zuQ",
"question": "On a white background board, there is a notebook with pictures attached. The video screen of two women connecting is split into the top left and top right corners. The woman in the top left corner is wearing a red coat and a wristwatch. When the subtitle 'There's a lonely electron' appears, what change occurs to the red-clothed woman's wristwatch?",
"question_wo_referring_query": "What change occurs to the red-clothed woman's wristwatch?",
"candidates": [
"The watch face changes from dim to bright.",
"The watch face changes from bright to dim.",
"The watch face changes from red to black.",
"The watch face changes from black to white.",
"The watch face changes from white to yellow."
],
"correct_choice": 1,
"position": [
302,
549
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "TAA",
"level": "L2-Relation",
"id": "GawGUhl9zuQ_0",
"video_path": "GawGUhl9zuQ.mp4",
"subtitle_path": "GawGUhl9zuQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 340.94,
"view_count": 2860
},
{
"video_id": "GawGUhl9zuQ",
"question": "Between two connected videos, there is a white notebook in the middle held by a female. The notebook has an additional paper attached with a figure on it. What changes happen to the notebook when the subtitle mentions 'there good and then'?",
"question_wo_referring_query": "What changes happen to the notebook?",
"candidates": [
"The additional paper on the notebook is gone",
"The notebook only has a blue figure drawn with a pen",
"The notebook has an additional paper and a black figure drawn with a pen",
"The notebook has an additional paper and a black figure drawn with a pen",
"The notebook has an additional paper and a blue figure drawn with a pen"
],
"correct_choice": 4,
"position": [
88,
7116
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "TAA",
"level": "L2-Relation",
"id": "GawGUhl9zuQ_1",
"video_path": "GawGUhl9zuQ.mp4",
"subtitle_path": "GawGUhl9zuQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 340.94,
"view_count": 2860
},
{
"video_id": "2zZSMnGLGao",
"question": "In a room with a green plant in the background, a woman with long hair draped over her shoulders is holding a yellow book with a square cover featuring cartoon characters. What is this woman doing?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"Throwing the book into the trash can",
"Turning the back of the book towards the camera",
"Picking up another book with a blue cover",
"Watering the plant",
"Holding the book facing the camera"
],
"correct_choice": 4,
"position": [
1855
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2E",
"level": "L1-Perception",
"id": "2zZSMnGLGao_0",
"video_path": "2zZSMnGLGao.mp4",
"subtitle_path": "2zZSMnGLGao_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 463.03,
"view_count": 641
},
{
"video_id": "2zZSMnGLGao",
"question": "The woman in the video is holding an open book, and on the bottom of the yellow page on the right side of the book, there are two whales. What is this woman doing while holding the book?",
"question_wo_referring_query": "The woman in the video is holding an open book, and on the bottom of the yellow page on the right side of the book, there are two whales. What is this woman doing while holding the book?",
"candidates": [
"Holding the book with one hand and explaining",
"Standing up with the book",
"Turning her back to the camera",
"Holding the book with both hands and explaining",
"Bending down to pick something up"
],
"correct_choice": 0,
"position": [
6469
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2E",
"level": "L1-Perception",
"id": "2zZSMnGLGao_1",
"video_path": "2zZSMnGLGao.mp4",
"subtitle_path": "2zZSMnGLGao_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 463.03,
"view_count": 641
},
{
"video_id": "T57jVsvVVR0",
"question": "Standing in front of two flagpoles tied with white ribbons, what color is the clothing of the man holding the white paper and giving a speech when he says 'minister Wong and others to continue uh'?",
"question_wo_referring_query": "What color is the clothing?",
"candidates": [
"black",
"blue",
"red",
"yellow",
"white"
],
"correct_choice": 1,
"position": [
2369
],
"topic_category": "NP-News-Programs",
"question_category": "T2A",
"level": "L1-Perception",
"id": "T57jVsvVVR0_0",
"video_path": "T57jVsvVVR0.mp4",
"subtitle_path": "T57jVsvVVR0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 263.96,
"view_count": 10438
},
{
"video_id": "mkqgTAe2_O4",
"question": "A man and a woman, dressed as ordinary people, are holding shovels and digging soil. The man is wearing a hat, and the woman is wearing a headscarf. There are two soldiers in blue uniforms with black hats holding guns nearby. In the distance, there are some green plants. What is present in this scene?",
"question_wo_referring_query": ", what is present in this scene?",
"candidates": [
"A photographer wearing a hat",
"Skirts",
"Advancing cannons",
"NP-News-Programers",
"Various colored horses"
],
"correct_choice": 1,
"position": [
6168
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2O",
"level": "L1-Perception",
"id": "mkqgTAe2_O4_0",
"video_path": "mkqgTAe2_O4.mp4",
"subtitle_path": "mkqgTAe2_O4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 520.04,
"view_count": 1776919
},
{
"video_id": "mkqgTAe2_O4",
"question": "In the distance of the battlefield is a broken white building, with an archway on the left side of the building. A soldier wearing a green uniform and helmet is shooting, while another soldier without a helmet is wounded and leaning in the corner. There is a rifle to the left side of the wounded soldier. What objects are present in this scene?",
"question_wo_referring_query": "What objects are present in this scene?",
"candidates": [
"Doctor in a white coat",
"Military green aircraft",
"White horse",
"Broken iron pillar",
"Stacked lumber"
],
"correct_choice": 4,
"position": [
2035
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2O",
"level": "L1-Perception",
"id": "mkqgTAe2_O4_1",
"video_path": "mkqgTAe2_O4.mp4",
"subtitle_path": "mkqgTAe2_O4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 520.04,
"view_count": 1776919
},
{
"video_id": "lQODAJ_F5yE",
"question": "On yellow paper lies white paper with black symbols on it and a black arrow. The white paper and the arrow form a formula. Pink paper strips are used as markers. Below, the two largest sheets of paper are orange and blue. In the upper left corner is a black-headed, multicolored pencil. A hand with red nails is holding a pen and writing. The subtitle reads 'The first answer is with a negative sign because heat is released. We could have also written'. What elements are present in this scene?",
"question_wo_referring_query": "What elements are present in this scene?",
"candidates": [
"Red triangular paper with 'b' written on it",
"Orange triangular paper with 'b' written on it",
"Red rectangular paper with 'b' written on it",
"Red round paper with 'b' written on it",
"Orange round paper with 'b' written on it"
],
"correct_choice": 4,
"position": [
5362
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2O",
"level": "L1-Perception",
"id": "lQODAJ_F5yE_0",
"video_path": "lQODAJ_F5yE.mp4",
"subtitle_path": "lQODAJ_F5yE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 258.97,
"view_count": 126323
},
{
"video_id": "lQODAJ_F5yE",
"question": "On a yellow paper, there are three strips of paper with characters written on them. In the top left corner, there is a stationery item with a black head and a colorful body. In the bottom right corner, a woman with long hair is speaking inside a red circular frame. The subtitle appears 'enthalpy stoichiometry and how to use them. So if'. What is on the woman's body?",
"question_wo_referring_query": "What is on the woman's body?",
"candidates": [
"Blue nail polish",
"Black nail polish",
"White outerwear",
"Necklace",
"Black and white striped outerwear"
],
"correct_choice": 2,
"position": [
79
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2O",
"level": "L1-Perception",
"id": "lQODAJ_F5yE_1",
"video_path": "lQODAJ_F5yE.mp4",
"subtitle_path": "lQODAJ_F5yE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 258.97,
"view_count": 126323
},
{
"video_id": "zkmoxOKhpvk",
"question": "The bridge with green railings is halfway in the air. Dense green forests are beside the bridge. The sunshine casts the railings' reflections. Who crossed this bridge?",
"question_wo_referring_query": "Who crossed this bridge?",
"candidates": [
"The woman wearing a black corset",
"The curly-haired man wearing a green cat pattern onesie",
"The curly-haired man wearing a pink dog pattern onesie",
"The man wearing a black baseball cap",
"The curly-haired man wearing a pink cat pattern onesie"
],
"correct_choice": 4,
"position": [
654
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E2O",
"level": "L1-Perception",
"id": "zkmoxOKhpvk_0",
"video_path": "zkmoxOKhpvk.mp4",
"subtitle_path": "zkmoxOKhpvk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 541.0,
"view_count": 21577
},
{
"video_id": "zkmoxOKhpvk",
"question": "A pink boy stands on the green springboard, a short-sleeved boy in the left holding a blue bungee cord, a red short-sleeved man with a hat is watching. Who is assisting the pink boy to jump down from behind?",
"question_wo_referring_query": "Who is assisting the pink boy to jump down from behind?",
"candidates": [
"A woman in a yellow short-sleeved shirt",
"A man with sunglasses and a blue baseball cap",
"A man in orange pants",
"A woman in a black tank top",
"A woman with tied hair in a short-sleeved shirt"
],
"correct_choice": 1,
"position": [
4244
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E2O",
"level": "L1-Perception",
"id": "zkmoxOKhpvk_1",
"video_path": "zkmoxOKhpvk.mp4",
"subtitle_path": "zkmoxOKhpvk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 541.0,
"view_count": 21577
},
{
"video_id": "fp1r40w_PtA",
"question": "What did the woman with curly hair wearing black clothes and silver earrings do when she appeared in front of the blue background and the green pattern in the bottom right corner?",
"question_wo_referring_query": "What did she do when she appeared?",
"candidates": [
"Talked about the interactions between blood and venom",
"Talked about topics related to venom",
"Talked about topics related to muscles",
"Talked about some scientific research progress",
"Talked about topics related to poisonous snakes"
],
"correct_choice": 2,
"position": [
438
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O2E",
"level": "L1-Perception",
"id": "fp1r40w_PtA_0",
"video_path": "fp1r40w_PtA.mp4",
"subtitle_path": "fp1r40w_PtA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 402.86,
"view_count": 634467
},
{
"video_id": "fp1r40w_PtA",
"question": "On the sandy ground with good light, a patterned brown rattlesnake coils its body together, extending its head. What did the snake do when it appeared?",
"question_wo_referring_query": "What did the snake do when it appeared?",
"candidates": [
"Coiled its body",
"Opened its mouth",
"Launched an attack",
"Shook its tail",
"Twisted its neck"
],
"correct_choice": 3,
"position": [
1284
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O2E",
"level": "L1-Perception",
"id": "fp1r40w_PtA_1",
"video_path": "fp1r40w_PtA.mp4",
"subtitle_path": "fp1r40w_PtA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 402.86,
"view_count": 634467
},
{
"video_id": "Aw1_7wSaeKk",
"question": "On a grey background, four website addresses are displayed in white text at the center of the background. After the subtitle 'week also there might be a few surprises' appears, what happens above and below the URLs?",
"question_wo_referring_query": "On a grey background, four website addresses are displayed in white text at the center. After the subtitle 'week also there might be a few surprises' appears, what happens above and below the URLs?",
"candidates": [
"An airplane icon flies above the URLs, and a submarine icon flies below.",
"An airplane icon flies above the URLs, and an armored vehicle icon flies below.",
"A submarine icon flies above the URLs, and an airplane icon flies below.",
"An airplane icon flies above the URLs, and another airplane icon flies below.",
"A submarine icon flies above the URLs, and another submarine icon flies below."
],
"correct_choice": 3,
"position": [
2351,
2372
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3E",
"level": "L2-Relation",
"id": "Aw1_7wSaeKk_0",
"video_path": "Aw1_7wSaeKk.mp4",
"subtitle_path": "Aw1_7wSaeKk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 222.67000000000002,
"view_count": 36099
},
{
"video_id": "Aw1_7wSaeKk",
"question": "A handheld drill lies on the green grass, with a red button on the handle. The drill body is mainly gray with red letters BOSCH on it. After the subtitle 'this was no regular drill and was a ' appears, what happens in the center of the screen?",
"question_wo_referring_query": "What happens in the center of the screen?",
"candidates": [
"An image of a machete appears in the center of the screen.",
"The grass background becomes blurry, and an image of a gun appears in the center of the screen.",
"An identical drill appears.",
"An image of a hammer appears in the center of the screen.",
"An image of an eagle appears in the center of the screen."
],
"correct_choice": 1,
"position": [
712,
833
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3E",
"level": "L2-Relation",
"id": "Aw1_7wSaeKk_1",
"video_path": "Aw1_7wSaeKk.mp4",
"subtitle_path": "Aw1_7wSaeKk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 222.67000000000002,
"view_count": 36099
},
{
"video_id": "DRIpznER-VQ",
"question": "Four people are standing in a row in front of a window: two women are in the middle, and two men are on the outside. The man on the right is wearing a black coat, black-framed glasses, and carrying a backpack. One woman is wearing a gold-black patterned headscarf, and the other woman is in a black and white striped long skirt. The man on the left has a long beard, is resting his arm on the counter, and is dressed in gold-embroidered attire. In which other scenes does the man resting his arm appear?",
"question_wo_referring_query": "In which other scenes does the man resting his arm appear?",
"candidates": [
"Outside a store window in a high-end mall",
"At a speech venue, holding a mic",
"Outside a plant-filled balcony",
"Flipping through a book on a table",
"On a chair in a library"
],
"correct_choice": 1,
"position": [
490,
758
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SOS",
"level": "L2-Relation",
"id": "DRIpznER-VQ_0",
"video_path": "DRIpznER-VQ.mp4",
"subtitle_path": "DRIpznER-VQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 308.98,
"view_count": 18346
},
{
"video_id": "iJgh2dnudIU",
"question": "Next to a refrigerator covered in many pictures, there is a woman with purple hair wearing a green top. Her hands are open with the palms facing upwards. What items are behind her to the left?",
"question_wo_referring_query": "What items are behind her to the left?",
"candidates": [
"Green and gray boards",
"A pot and some knives",
"Tea cup",
"Water faucet"
],
"correct_choice": 3,
"position": [
1087
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2O",
"level": "L1-Perception",
"id": "iJgh2dnudIU_0",
"video_path": "iJgh2dnudIU.mp4",
"subtitle_path": "iJgh2dnudIU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 349.27,
"view_count": 182106
},
{
"video_id": "iJgh2dnudIU",
"question": "A person wearing a green short-sleeved shirt, holding a phone in their right hand, is facing 3 bottles on the table. What is this person holding in their left hand?",
"question_wo_referring_query": "What is this person holding in their left hand?",
"candidates": [
"A can",
"A bun",
"A dumpling",
"Beef"
],
"correct_choice": 0,
"position": [
4904
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2O",
"level": "L1-Perception",
"id": "iJgh2dnudIU_1",
"video_path": "iJgh2dnudIU.mp4",
"subtitle_path": "iJgh2dnudIU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 349.27,
"view_count": 182106
},
{
"video_id": "ze66pbJYr18",
"question": "Sitting at a table are 5 people, one of whom is a woman wearing a hat and pouring tea from a teapot. When the subtitle 'lived in a small hut and slept on straw' appears, what objects are present on the screen?",
"question_wo_referring_query": "What objects are present on the screen?",
"candidates": [
"Bed",
"Lamp, table, washing machine",
"Lamp, table, chair",
"TV, lamp, table"
],
"correct_choice": 2,
"position": [
7436
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2O",
"level": "L1-Perception",
"id": "ze66pbJYr18_0",
"video_path": "ze66pbJYr18.mp4",
"subtitle_path": "ze66pbJYr18_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 388.99,
"view_count": 23349
},
{
"video_id": "ze66pbJYr18",
"question": "In the middle of the black background, there is a picture. The picture shows a blue sky and yellow stars. The color of the house on the left is yellow and gray. When the subtitle 'iconic works the quaint setting and use' appears, what object is on the screen?",
"question_wo_referring_query": "What object is on the screen?",
"candidates": [
"Table and chair",
"Teapot",
"Bed",
"Table and stereo"
],
"correct_choice": 0,
"position": [
183
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2O",
"level": "L1-Perception",
"id": "ze66pbJYr18_1",
"video_path": "ze66pbJYr18.mp4",
"subtitle_path": "ze66pbJYr18_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 388.99,
"view_count": 23349
},
{
"video_id": "AjF13uKVQa0",
"question": "A man wearing glasses and a suit is standing in front of three blocks with the letters 'Y', 'G', and 'B' in yellow, green, and blue. He is holding a math kit with his right hand. What action did this man take?",
"question_wo_referring_query": "What action did this man take?",
"candidates": [
"Right hand raised",
"Both hands spread out, palms facing each other with a small distance between them, holding a pen in the right hand",
"Both hands clenched into fists",
"Both hands spread out, palms facing each other with a small distance between them, holding a pen in the left hand"
],
"correct_choice": 1,
"position": [
11276
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2A",
"level": "L1-Perception",
"id": "AjF13uKVQa0_0",
"video_path": "AjF13uKVQa0.mp4",
"subtitle_path": "AjF13uKVQa0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 34,
"duration": 579.0,
"view_count": 286
},
{
"video_id": "AjF13uKVQa0",
"question": "In the lower right corner of the screen, there is a man wearing a suit. His right hand is raised, and beside his right hand is the number 47. Behind this man, there is a white background with 10 lines of text. What color are these words?",
"question_wo_referring_query": "What color are these words?",
"candidates": [
"The top line is yellow, the rest are black",
"The third line has some blue text, the rest are black",
"The top 8 lines of text are black, the bottom 2 lines are red",
"All are black"
],
"correct_choice": 3,
"position": [
9868
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2A",
"level": "L1-Perception",
"id": "AjF13uKVQa0_1",
"video_path": "AjF13uKVQa0.mp4",
"subtitle_path": "AjF13uKVQa0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 34,
"duration": 579.0,
"view_count": 286
},
{
"video_id": "X29dPzJIMbA",
"question": "In the video, there are three women sitting on a bench. The woman on the left is wearing a black long-sleeve shirt, with her right hand extended forward. The woman in the middle is wearing a light gray long-sleeve shirt. The woman on the right is wearing a white long-sleeve shirt. Who is sitting in the middle?",
"question_wo_referring_query": "In the video, there are three women sitting on a bench. The woman on the left is wearing a black long-sleeve shirt, with her right hand extended forward. The woman in the middle is wearing a light gray long-sleeve shirt. The woman on the right is wearing a white long-sleeve shirt. Who is sitting in the middle?",
"candidates": [
"Nancy",
"Davy",
"Jo-Anne",
"Cara"
],
"correct_choice": 2,
"position": [
2318
],
"topic_category": "KA-Knowledge-Art",
"question_category": "E2O",
"level": "L1-Perception",
"id": "X29dPzJIMbA_0",
"video_path": "X29dPzJIMbA.mp4",
"subtitle_path": "X29dPzJIMbA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 209.28,
"view_count": 5171
},
{
"video_id": "X29dPzJIMbA",
"question": "In the video, there is a woman with long black hair wearing sunglasses, dressed in a white long-sleeve shirt, holding a blue phone in her right hand, wearing three rings, and making a 'peace' sign with her left hand. Who is taking a photo with the phone in the video?",
"question_wo_referring_query": "Who is taking a photo with the phone in the video?",
"candidates": [
"Cara",
"Nancy",
"Davy",
"Jo-Anne"
],
"correct_choice": 0,
"position": [
2884
],
"topic_category": "KA-Knowledge-Art",
"question_category": "E2O",
"level": "L1-Perception",
"id": "X29dPzJIMbA_1",
"video_path": "X29dPzJIMbA.mp4",
"subtitle_path": "X29dPzJIMbA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 209.28,
"view_count": 5171
},
{
"video_id": "WM78_KqcrSY",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, a few small planes appear on the left, top right, and bottom right, with a big plane in the middle right, then some lines of yellow and white subtitles appear, and finally, ends with a white bear.",
"First, a white bear appears, then a few small planes appear on the left, top right, and bottom right, with a big plane in the middle right, and finally, ends with some lines of yellow and white subtitles.",
"First, some lines of yellow and white subtitles appear, then a white bear appears, and finally, ends with a few small planes on the left, top right, and bottom right, with a big plane in the middle right.",
"First, a white bear appears, then some lines of yellow and white subtitles appear, and finally, the sequence ends with a screen showing a few small planes on the left, a few small planes on the top right and bottom right, and a big plane in the middle right."
],
"correct_choice": 3,
"position": [
110,
536,
2978
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "WM78_KqcrSY_0",
"video_path": "WM78_KqcrSY.mp4",
"subtitle_path": "WM78_KqcrSY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 490.23,
"view_count": 162769
},
{
"video_id": "WM78_KqcrSY",
"question": "Which of the following scenarios is in the correct order?",
"question_wo_referring_query": "Which of the following scenarios is in the correct order?",
"candidates": [
"First, a map block appears on the left with details, then the left side shows an icon, the right side shows a screen with two icons, and finally, the left shows many small drones and the right side top shows a screen with a few small drones to conclude.",
"First, the left side shows many small drones, the right side top shows a screen with a few small drones, then a detailed map block appears on the left, and finally, as the conclusion, the left shows an icon and the right shows two icons.",
"First, a map block appears on the left with details, then the left side shows many small drones, the right side top shows a screen with a few small drones, and finally, the left side shows an icon and the right side shows two icons to conclude.",
"First, the left shows an icon, the right side shows a screen with two icons, then the left side shows many small drones, the right side top shows a screen with a few small drones, and finally, the left side shows a detailed map block."
],
"correct_choice": 2,
"position": [
4838,
7979,
11249
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "WM78_KqcrSY_1",
"video_path": "WM78_KqcrSY.mp4",
"subtitle_path": "WM78_KqcrSY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 490.23,
"view_count": 162769
},
{
"video_id": "F8Ma1qs0Rkg",
"question": "There is a man wearing a black shirt with white patterns and gloves sitting in the middle of the screen. His hands are open, and there are two lamps lit behind him. In the latter part of the video, what changes on the screen when this man's left palm faces upwards?",
"question_wo_referring_query": "What changes on the screen when the man\u2019s left palm faces upwards in the latter part of the video?",
"candidates": [
"A book appears in the top right corner",
"A book appears in the top left corner",
"A book appears in the bottom left corner",
"A book appears in the bottom right corner"
],
"correct_choice": 2,
"position": [
9686,
9780
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SAA",
"level": "L2-Relation",
"id": "F8Ma1qs0Rkg_0",
"video_path": "F8Ma1qs0Rkg.mp4",
"subtitle_path": "F8Ma1qs0Rkg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 494.44,
"view_count": 2394
},
{
"video_id": "F8Ma1qs0Rkg",
"question": "There is a man in the middle of the screen wearing a black short-sleeved shirt with white floral patterns. His hands are placed below, and there is an orange subtitle below as well. What changes occur on the screen after the man raises his left hand?",
"question_wo_referring_query": "What changes occur on the screen after the man raises his left hand?",
"candidates": [
"The orange subtitle changes to black",
"An additional line of orange subtitle appears",
"No changes",
"The orange subtitle disappears"
],
"correct_choice": 3,
"position": [
10216,
10703
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SAA",
"level": "L2-Relation",
"id": "F8Ma1qs0Rkg_1",
"video_path": "F8Ma1qs0Rkg.mp4",
"subtitle_path": "F8Ma1qs0Rkg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 494.44,
"view_count": 2394
},
{
"video_id": "AYMdAVxALP4",
"question": "In the scene, there are water and mountains, and the sun's reflection is in the water. When the subtitles mention 'planet changing it fresh to salt,' what change happens to the sun?",
"question_wo_referring_query": "What change happens to the sun?",
"candidates": [
"shifts to the left",
"falls down",
"no change",
"rises up"
],
"correct_choice": 1,
"position": [
7530,
7628
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TAA",
"level": "L2-Relation",
"id": "AYMdAVxALP4_0",
"video_path": "AYMdAVxALP4.mp4",
"subtitle_path": "AYMdAVxALP4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 379.31,
"view_count": 15994
},
{
"video_id": "AYMdAVxALP4",
"question": "How many rectangular yellow metal blocks are on the table? There is a person with purple gloves in the background. When the subtitle mentions \"eneficial way is difficult to say the,\" what changes occur on the table?",
"question_wo_referring_query": "What changes occur on the table?",
"candidates": [
"The number of rectangular yellow metal blocks increased by two",
"The number of rectangular yellow metal blocks increased by one",
"The number of rectangular yellow metal blocks decreased by two",
"The number of rectangular yellow metal blocks decreased by one"
],
"correct_choice": 1,
"position": [
6431,
6511
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TAA",
"level": "L2-Relation",
"id": "AYMdAVxALP4_1",
"video_path": "AYMdAVxALP4.mp4",
"subtitle_path": "AYMdAVxALP4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 379.31,
"view_count": 15994
},
{
"video_id": "cc0T2vtuJtc",
"question": "In the video, when the spoon is used to place the prepared meat onto the mashed potatoes in the glass dish, and the subtitle mentions 'Just pour the potatoes over minced meat!!', what other items are visible on the screen?",
"question_wo_referring_query": "What other items are visible on the screen?",
"candidates": [
"Knife",
"Pot",
"Green onion",
"Cheese"
],
"correct_choice": 1,
"position": [
3824
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2O",
"level": "L1-Perception",
"id": "cc0T2vtuJtc_0",
"video_path": "cc0T2vtuJtc.mp4",
"subtitle_path": "cc0T2vtuJtc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 246.73,
"view_count": 1691
},
{
"video_id": "cc0T2vtuJtc",
"question": "There is a white plate in the middle of the screen with food on it, and a hand wearing a black glove is placed above the food. When the subtitle mentions 'I'd be happy to improve my channel!', what other object can be seen on the screen?",
"question_wo_referring_query": "What other object can be seen on the screen?",
"candidates": [
"spoon",
"pot",
"red seasoning",
"fork"
],
"correct_choice": 0,
"position": [
5512
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2O",
"level": "L1-Perception",
"id": "cc0T2vtuJtc_1",
"video_path": "cc0T2vtuJtc.mp4",
"subtitle_path": "cc0T2vtuJtc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 246.73,
"view_count": 1691
},
{
"video_id": "WmrwQMFZLqI",
"question": "There is a short-haired woman wearing a long-sleeved suit with an apron in the video. Her hands are placed in front of her chest, with her right thumb pointing up. In front of her is a stove with a yellow pot on it. What color is the apron worn by the woman in the video?",
"question_wo_referring_query": "What color is the apron worn by the woman in the video?",
"candidates": [
"Green",
"Black",
"Yellow",
"White"
],
"correct_choice": 1,
"position": [
3332
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2A",
"level": "L1-Perception",
"id": "WmrwQMFZLqI_0",
"video_path": "WmrwQMFZLqI.mp4",
"subtitle_path": "WmrwQMFZLqI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 504.8,
"view_count": 2416229
},
{
"video_id": "WmrwQMFZLqI",
"question": "In the video, there is a person wearing a black apron with the word 'TASTY' on it. They are wearing a ring on their left hand which is resting on a dough. There is a wooden board on the table. What is the shape of the dough in the video?",
"question_wo_referring_query": "What is the shape of the dough in the video?",
"candidates": [
"Square",
"Oval",
"Round",
"Triangular"
],
"correct_choice": 0,
"position": [
8428
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2A",
"level": "L1-Perception",
"id": "WmrwQMFZLqI_1",
"video_path": "WmrwQMFZLqI.mp4",
"subtitle_path": "WmrwQMFZLqI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 504.8,
"view_count": 2416229
},
{
"video_id": "6Lb1PyJxVQM",
"question": "In the scene, there is a black car parked on the beach by the sea. A man has his left hand on his waist and his right hand on the car. Who is the man making this gesture?",
"question_wo_referring_query": "Who is the man making this gesture?",
"candidates": [
"The man wearing a purple short-sleeved shirt",
"The man wearing a white short-sleeved shirt with tattoos on both arms",
"The man wearing a blue uniform",
"The man wearing a black short-sleeved shirt"
],
"correct_choice": 1,
"position": [
1417
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "E2O",
"level": "L1-Perception",
"id": "6Lb1PyJxVQM_0",
"video_path": "6Lb1PyJxVQM.mp4",
"subtitle_path": "6Lb1PyJxVQM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 394.85,
"view_count": 363619
},
{
"video_id": "6Lb1PyJxVQM",
"question": "In the scene, who is the person standing in front of the yellow dirt slope, holding food in their left hand and chopsticks in their right hand?",
"question_wo_referring_query": "Who is the person eating?",
"candidates": [
"The man wearing a white short-sleeve shirt with tattoos on both arms",
"The woman wearing a black short-sleeve shirt",
"The man wearing a yellow short-sleeve shirt",
"The man wearing a black jacket"
],
"correct_choice": 0,
"position": [
8137
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "E2O",
"level": "L1-Perception",
"id": "6Lb1PyJxVQM_1",
"video_path": "6Lb1PyJxVQM.mp4",
"subtitle_path": "6Lb1PyJxVQM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 394.85,
"view_count": 363619
},
{
"video_id": "BtaVRhoLpC0",
"question": "In front of a building, there is a person wearing a white shirt and a black suit jacket. There are many microphones in front of this person. What is this person doing?",
"question_wo_referring_query": "What is this person doing?",
"candidates": [
"This person is taking a walk",
"This person is answering reporters' questions",
"This person is eating",
"This person is chatting with friends"
],
"correct_choice": 1,
"position": [
960
],
"topic_category": "NP-News-Programs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "BtaVRhoLpC0_0",
"video_path": "BtaVRhoLpC0.mp4",
"subtitle_path": "BtaVRhoLpC0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 190.16,
"view_count": 50640
},
{
"video_id": "BtaVRhoLpC0",
"question": "The screen shows the text 'MEN and women filed claims after a special defence ministry inspection'. There is a person wearing a brown top. What is this person doing?",
"question_wo_referring_query": "What is this person doing?",
"candidates": [
"This person is looking down at a phone",
"This person is running",
"This person is eating",
"This person is talking to someone"
],
"correct_choice": 0,
"position": [
3963
],
"topic_category": "NP-News-Programs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "BtaVRhoLpC0_1",
"video_path": "BtaVRhoLpC0.mp4",
"subtitle_path": "BtaVRhoLpC0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 190.16,
"view_count": 50640
},
{
"video_id": "ozpGTw6DrXs",
"question": "A blonde man wearing a gray short sleeve shirt is in front of a mountain and an ocean in the scene. In which other scene does this man appear?",
"question_wo_referring_query": "In which other scene does this man appear?",
"candidates": [
"Appears on the road between the sea and the buildings",
"Appears on the beach",
"Appears on a boat",
"Appears in a restaurant"
],
"correct_choice": 0,
"position": [
46,
4596
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "ozpGTw6DrXs_0",
"video_path": "ozpGTw6DrXs.mp4",
"subtitle_path": "ozpGTw6DrXs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 495.2,
"view_count": 54575
},
{
"video_id": "ozpGTw6DrXs",
"question": "On a blue object at sea, there is a blonde woman wearing a blue swimsuit. In which other scene does this woman appear?",
"question_wo_referring_query": ", in which other scene does this woman appear?",
"candidates": [
"She appears in a library.",
"She appears in a kitchen.",
"She appears on a path between a white building on one side and a gray building on the other.",
"She appears in a restaurant."
],
"correct_choice": 2,
"position": [
5800,
9523
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "ozpGTw6DrXs_1",
"video_path": "ozpGTw6DrXs.mp4",
"subtitle_path": "ozpGTw6DrXs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 495.2,
"view_count": 54575
},
{
"video_id": "XR3Ov2nQ39s",
"question": "In an abandoned lot, there is a green tank. Next to the tank, a man wearing a black short-sleeve shirt and black pants is standing. With what subtitle did this man appear together?",
"question_wo_referring_query": "With what subtitle did this man appear together?",
"candidates": [
"a male beforehand and about taking",
"to full rights use them in my videos",
"masoom military and of course my",
"bovis and aircraft gun but more on this"
],
"correct_choice": 3,
"position": [
2656,
2852
],
"topic_category": "KH-Knowledge-History",
"question_category": "TOS",
"level": "L2-Relation",
"id": "XR3Ov2nQ39s_0",
"video_path": "XR3Ov2nQ39s.mp4",
"subtitle_path": "XR3Ov2nQ39s_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 591.63,
"view_count": 24016
},
{
"video_id": "XR3Ov2nQ39s",
"question": "On a small road in the forest, there is a green tank riding on it, a man wearing green clothes and a hat is standing in the tank. With which subtitles does this man appear together?",
"question_wo_referring_query": "With which subtitles does this man appear together?",
"candidates": [
"dead areas certain anti aircraft machine",
"ride is 10 euros and 5 euros if you're",
"prices in October 2016 the regular price",
"for an adult is 7.5 euros if you have a"
],
"correct_choice": 0,
"position": [
8722,
9247
],
"topic_category": "KH-Knowledge-History",
"question_category": "TOS",
"level": "L2-Relation",
"id": "XR3Ov2nQ39s_1",
"video_path": "XR3Ov2nQ39s.mp4",
"subtitle_path": "XR3Ov2nQ39s_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 591.63,
"view_count": 24016
},
{
"video_id": "_qepWb_NVj4",
"question": "In the video, there are several blocks on the screen, including a red block, a green block, and a purple block. The other blocks are light green. What change occurs to these blocks when the phrase 'Antarctica but ask a person from South' is mentioned?",
"question_wo_referring_query": "What change occurs to these blocks?",
"candidates": [
"The blocks turn orange, blue, pink, and red",
"The blocks turn red and green",
"The blocks turn yellow, pink, purple, and blue",
"All blocks turn light green"
],
"correct_choice": 3,
"position": [
295,
426
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TAA",
"level": "L2-Relation",
"id": "_qepWb_NVj4_0",
"video_path": "_qepWb_NVj4.mp4",
"subtitle_path": "_qepWb_NVj4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 460.26,
"view_count": 2649332
},
{
"video_id": "_qepWb_NVj4",
"question": "In the screen, there is a picture of a white island within a blue background, surrounded by a yellow line. After mentioning 'be a continent then the biggest island,' what changes occur in this screen?",
"question_wo_referring_query": "What changes occur in this screen?",
"candidates": [
"A picture of a blue island and a picture of an olive island appear to the right of the white island.",
"A picture of a pink island appears next to the white island.",
"A picture of a red island appears next to the white island.",
"A picture of a purple island appears next to the white island."
],
"correct_choice": 0,
"position": [
8000,
8600
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TAA",
"level": "L2-Relation",
"id": "_qepWb_NVj4_1",
"video_path": "_qepWb_NVj4.mp4",
"subtitle_path": "_qepWb_NVj4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 460.26,
"view_count": 2649332
},
{
"video_id": "z6HTO2SOxUc",
"question": "There's a black surface on the table with a plate on it, inside the plate there's a multicolored dotted face mask, next to it there's a black and green cucumber model, and a yellow label with the word 'Blondies'. What other objects appear in the video?",
"question_wo_referring_query": "What other objects appear in the video?",
"candidates": [
"Piano",
"Mobile phone",
"Television",
"Leaf"
],
"correct_choice": 3,
"position": [
1950
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2O",
"level": "L1-Perception",
"id": "z6HTO2SOxUc_0",
"video_path": "z6HTO2SOxUc.mp4",
"subtitle_path": "z6HTO2SOxUc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 277.24,
"view_count": 82453
},
{
"video_id": "z6HTO2SOxUc",
"question": "In the video, two men and a woman with straight hair wearing a black and white striped outfit appear, they are facing the camera and waving. There are also many objects on the table with the word JELL-O on them. Besides this, what else appears in the room?",
"question_wo_referring_query": "What else appears in the room?",
"candidates": [
"Mobile phone",
"Television",
"Flower pot",
"Piano"
],
"correct_choice": 2,
"position": [
292
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2O",
"level": "L1-Perception",
"id": "z6HTO2SOxUc_1",
"video_path": "z6HTO2SOxUc.mp4",
"subtitle_path": "z6HTO2SOxUc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 277.24,
"view_count": 82453
},
{
"video_id": "Jaw7eWzgWr0",
"question": "In a room, there is a man wearing a suit with a white shirt and a tie, holding and playing an instrument. What color is the tie the man is wearing?",
"question_wo_referring_query": "What color is the tie the man is wearing?",
"candidates": [
"white",
"green",
"black",
"yellow"
],
"correct_choice": 2,
"position": [
560
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "Jaw7eWzgWr0_0",
"video_path": "Jaw7eWzgWr0.mp4",
"subtitle_path": "Jaw7eWzgWr0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 248.58,
"view_count": 13697
},
{
"video_id": "Jaw7eWzgWr0",
"question": "In a room filled with many musical instruments, there is a man wearing a dark green suit with a light blue shirt underneath, a blue tie, and glasses, holding a musical instrument. What is the color of the gloves worn by the man with glasses?",
"question_wo_referring_query": "What is the color of the gloves worn by the man with glasses?",
"candidates": [
"Black",
"Red",
"White",
"Yellow"
],
"correct_choice": 2,
"position": [
1607
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "Jaw7eWzgWr0_1",
"video_path": "Jaw7eWzgWr0.mp4",
"subtitle_path": "Jaw7eWzgWr0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 248.58,
"view_count": 13697
},
{
"video_id": "9m4wi5gPdHg",
"question": "A man wearing a white shirt and brown shoes, and another man wearing a black coat and black shoes, both with guns, are standing on a yellow surface near the entrance of a store. After they mutually kill each other, what did the man in the white shirt do?",
"question_wo_referring_query": "What did the man in the white shirt do?",
"candidates": [
"Sing",
"Play the piano",
"Lie down",
"Dance"
],
"correct_choice": 2,
"position": [
664,
688
],
"topic_category": "KH-Knowledge-History",
"question_category": "E3E",
"level": "L2-Relation",
"id": "9m4wi5gPdHg_0",
"video_path": "9m4wi5gPdHg.mp4",
"subtitle_path": "9m4wi5gPdHg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 302.33,
"view_count": 1361003
},
{
"video_id": "9m4wi5gPdHg",
"question": "In a room, there stands a person in a black silhouette wearing a hat. After this person opens a safe containing gold and paper money, what does the man, who is in a black silhouette and wearing a hat, do?",
"question_wo_referring_query": "What does the man in a black silhouette and wearing a hat do?",
"candidates": [
"Took the paper money",
"Danced",
"Drank water",
"Watched TV"
],
"correct_choice": 0,
"position": [
4410,
4582
],
"topic_category": "KH-Knowledge-History",
"question_category": "E3E",
"level": "L2-Relation",
"id": "9m4wi5gPdHg_1",
"video_path": "9m4wi5gPdHg.mp4",
"subtitle_path": "9m4wi5gPdHg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 302.33,
"view_count": 1361003
},
{
"video_id": "a_8G0PzVFbc",
"question": "In a grassy area where fresh sprouts are just emerging, there is a long-haired cat with black and yellow fur. To its right, there is a large tree with some blue buds, and on the grassland with a red wall, there is a small rabbit sitting. Which of these two animals appeared first?",
"question_wo_referring_query": ", which of these two animals appeared first?",
"candidates": [
"Long-haired cat",
"Neither appeared",
"Small rabbit",
"Both appeared at the same time"
],
"correct_choice": 0,
"position": [
10236,
10394
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "a_8G0PzVFbc_0",
"video_path": "a_8G0PzVFbc.mp4",
"subtitle_path": "a_8G0PzVFbc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 532.99,
"view_count": 322346
},
{
"video_id": "a_8G0PzVFbc",
"question": "In a scene with a big tree as the background, a black and yellow long-haired cat is walking on a horizontal wooden beam, and in a scene with a big tree as the background, a black dog is holding its front legs on a horizontal wooden beam. Which of these two animals appeared first?",
"question_wo_referring_query": "Which of these two animals appeared first?",
"candidates": [
"Neither appeared",
"Black dog",
"Both appeared at the same time",
"Black and yellow long-haired cat"
],
"correct_choice": 3,
"position": [
1680,
2616
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "a_8G0PzVFbc_1",
"video_path": "a_8G0PzVFbc.mp4",
"subtitle_path": "a_8G0PzVFbc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 532.99,
"view_count": 322346
},
{
"video_id": "97nEBjiQI1M",
"question": "A news anchor with curled hair is wearing a pink blazer over a black base and sitting in front of the camera reading the news. What happened after the caption \u2018standards our climate editor Justin rout\u2019 appeared?",
"question_wo_referring_query": "What happened?",
"candidates": [
"The camera cuts to fish swimming around coral in the ocean.",
"The camera cuts to a woman wearing an orange floral shirt and earphones.",
"A woman wearing a red hat and white short-sleeved shirt is handing out mineral water in a crowd.",
"The camera cuts to a man speaking in front of multiple display screens.",
"The camera cuts to two farmers working in a field surrounded by green trees under a blue sky with white clouds."
],
"correct_choice": 4,
"position": [
596,
751,
1031,
2045,
2792,
5545
],
"topic_category": "NP-News-Programs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "97nEBjiQI1M_0",
"video_path": "97nEBjiQI1M.mp4",
"subtitle_path": "97nEBjiQI1M_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 381.28,
"view_count": 38518
},
{
"video_id": "97nEBjiQI1M",
"question": "In the blue seawater filled with white and olive pearls, and with some small fish swimming among the pearls, what happens on the screen after the subtitle 'might not fall back as expected a sign' appears?",
"question_wo_referring_query": "What happens on the screen?",
"candidates": [
"The camera switches to a scene of a forest being illuminated by sunlight",
"The camera switches to a news anchor speaking to the camera",
"The camera switches to a man sitting in front of multiple display screens, speaking to the camera",
"A turtle swims next to a reef in the blue seawater",
"The camera switches to a woman wearing an orange flowered top and an earpiece"
],
"correct_choice": 3,
"position": [
2825,
2992,
3098,
3187,
4530,
2211
],
"topic_category": "NP-News-Programs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "97nEBjiQI1M_1",
"video_path": "97nEBjiQI1M.mp4",
"subtitle_path": "97nEBjiQI1M_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 381.28,
"view_count": 38518
},
{
"video_id": "KlZpZVphLrc",
"question": "The BBC is broadcasting news. A woman with gray hair, wearing a red striped dress and glasses, is speaking. Behind her, there is another woman with black hair wearing a yellow dress and a man with glasses wearing a gray dress. There are several people behind them all listening attentively. When the phrase 'create the conditions for a sustainable' is mentioned, which item is not present in the scene?",
"question_wo_referring_query": "Which item is not present in the scene?",
"candidates": [
"diamond necklace",
"green clothes",
"pearl earrings",
"microphone used for talking",
"pearl necklace"
],
"correct_choice": 0,
"position": [
1700
],
"topic_category": "NP-News-Programs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "KlZpZVphLrc_0",
"video_path": "KlZpZVphLrc.mp4",
"subtitle_path": "KlZpZVphLrc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 341.88,
"view_count": 89340
},
{
"video_id": "KlZpZVphLrc",
"question": "BBC Television is currently broadcasting the news. On the left side of the split screen, a woman with long auburn hair, wearing blue clothes, is explaining something. On the right side of the screen, a video shows a man with white hair speaking, holding a white piece of paper, with a man and a woman seated behind him. When the phrase 'must be a condemnation of all parties if' is mentioned, which of the following objects does not appear in the scene?",
"question_wo_referring_query": "BBC Television is currently broadcasting the news. On the left side of the split screen, a woman with long auburn hair, wearing blue clothes, is explaining something. On the right side of the screen, a video shows a man with white hair speaking, holding a white piece of paper, with a man and a woman seated behind him. When the phrase 'must be a condemnation of all parties if' is mentioned, which of the following objects does not appear in the scene?",
"candidates": [
"Badge with a blue strap",
"White wristwatch",
"Earring",
"Microphone",
"Blue clothes"
],
"correct_choice": 1,
"position": [
4988
],
"topic_category": "NP-News-Programs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "KlZpZVphLrc_1",
"video_path": "KlZpZVphLrc.mp4",
"subtitle_path": "KlZpZVphLrc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 341.88,
"view_count": 89340
},
{
"video_id": "AxciimuEZAc",
"question": "In front of a building made of glass, there are some green plants and benches around, as well as red and orange striped sculptures. A man is walking down the steps on a marble-paved ground. In front of him, there is a black sculpture covered in raised dots. When the phrase 'sculpture garden looking at' is mentioned, what is this man wearing?",
"question_wo_referring_query": "In front of a building made of glass, there are some green plants and benches around, as well as red and orange striped sculptures. A man is walking down the steps on a marble-paved ground. In front of him, there is a black sculpture covered in raised dots. When the phrase 'sculpture garden looking at' is mentioned, what is this man wearing?",
"candidates": [
"blue long-sleeve shirt",
"blue vest",
"blue short-sleeve jacket",
"blue T-shirt",
"blue jacket"
],
"correct_choice": 0,
"position": [
201
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2A",
"level": "L1-Perception",
"id": "AxciimuEZAc_0",
"video_path": "AxciimuEZAc.mp4",
"subtitle_path": "AxciimuEZAc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 254.26,
"view_count": 12446
},
{
"video_id": "AxciimuEZAc",
"question": "In front of a transparent glass building, surrounded by green trees and a few other trees, there are some black rectangular objects placed in front of the trees. The sunlight is shining on the marble surface. A man in white and a blonde woman walk past a black, prominently sculpted point structure. When the phrase 'what does that mean for others' is mentioned, what kind of shoes is the woman wearing?",
"question_wo_referring_query": "What kind of shoes is the woman wearing?",
"candidates": [
"White sneakers with black edges",
"White slippers with black edges",
"White high heels with black edges",
"White leather shoes with black edges",
"White straw shoes with black edges"
],
"correct_choice": 0,
"position": [
4666
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2A",
"level": "L1-Perception",
"id": "AxciimuEZAc_1",
"video_path": "AxciimuEZAc.mp4",
"subtitle_path": "AxciimuEZAc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 254.26,
"view_count": 12446
},
{
"video_id": "kk-iRzLv81o",
"question": "In an animated building constructed with yellow and olive colors, the buildings on the left and center have white roofs. In front of the building, there are several people wearing yellow and black clothes and red and white clothes. The person wearing red and white clothes has a very tall grey hat and is holding a gun. One person, who is shirtless on the upper body and wearing yellow and black pants, is tied up with two sticks. There is another person swinging a bat and hitting him. Who is the character using the bat to hit the shirtless person?",
"question_wo_referring_query": "Who is the character using the bat to hit the shirtless person?",
"candidates": [
"The person wearing a white V-neck shirt, yellow pants, and has black hair",
"The person wearing yellow and black clothes with a double-eared hat",
"The person wearing yellow and black clothes with a round hat",
"The person wearing red and white clothes with a tall hat, holding a gun",
"The person wearing yellow and black clothes with olive-colored hair"
],
"correct_choice": 0,
"position": [
2769
],
"topic_category": "KH-Knowledge-History",
"question_category": "E2O",
"level": "L1-Perception",
"id": "kk-iRzLv81o_0",
"video_path": "kk-iRzLv81o.mp4",
"subtitle_path": "kk-iRzLv81o_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 447.25,
"view_count": 250909
},
{
"video_id": "kk-iRzLv81o",
"question": "The screen shows a plot of land surrounded by green trees. The land is brown, and there is a person wearing a yellow outfit with a cowboy hat and a mustache who is making a fire. There is also a person wearing white clothes and a gray skirt with a backpack behind him. Which one is carrying a gun?",
"question_wo_referring_query": "Which one is carrying a gun?",
"candidates": [
"The person wearing white clothes and a gray skirt",
"The person wearing white clothes and a gray skirt without a hat",
"The person wearing white clothes and a gray skirt with a hat",
"The person wearing yellow clothes and a hat making a fire",
"The person wearing white clothes, a gray skirt, and green shoes"
],
"correct_choice": 2,
"position": [
8147
],
"topic_category": "KH-Knowledge-History",
"question_category": "E2O",
"level": "L1-Perception",
"id": "kk-iRzLv81o_1",
"video_path": "kk-iRzLv81o.mp4",
"subtitle_path": "kk-iRzLv81o_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 447.25,
"view_count": 250909
},
{
"video_id": "Fo97qfO-9_g",
"question": "In the live news broadcast set, there are three paintings made into window shapes, with black grids and some tall buildings inside the frames. Two men in suits and a woman in a black skirt are conversing. The woman is looking at a piece of paper in her hand, the man in the middle has his arms crossed, and another man is speaking. There is a round table with a coffee cup in front of them. What did the man in the middle do?",
"question_wo_referring_query": "What did the man in the middle do?",
"candidates": [
"The man speaking picked up the coffee cup",
"The man sitting on the left stood up",
"The woman and the man in the middle shook hands",
"The man sitting in the middle placed his hands on his own legs",
"The woman stood up from her seat"
],
"correct_choice": 3,
"position": [
5836,
5878
],
"topic_category": "NP-News-Programs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "Fo97qfO-9_g_0",
"video_path": "Fo97qfO-9_g.mp4",
"subtitle_path": "Fo97qfO-9_g_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 451.99,
"view_count": 2519
},
{
"video_id": "Fo97qfO-9_g",
"question": "In the news broadcast room, there are three decorative paintings made to resemble windows. The frames contain black grids and some high-rise buildings. There are two men in suits and a woman in a black dress having a conversation. The woman is holding a piece of paper and speaking. The man in the middle is smiling at the man on the far right. What action is the woman performing?",
"question_wo_referring_query": "What action is the woman performing?",
"candidates": [
"The woman picked up a white piece of paper",
"The woman put one hand over the other",
"The woman shook hands with the man beside her",
"The woman stood up from her seat",
"The woman walked out of the frame"
],
"correct_choice": 1,
"position": [
8648,
8676
],
"topic_category": "NP-News-Programs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "Fo97qfO-9_g_1",
"video_path": "Fo97qfO-9_g.mp4",
"subtitle_path": "Fo97qfO-9_g_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 451.99,
"view_count": 2519
},
{
"video_id": "qbA42wQoWAs",
"question": "In a bedroom, three men are changing clothes in front of an open wardrobe and a clothing rack filled with items. What clothing are they changing into?",
"question_wo_referring_query": "What clothing are they changing into?",
"candidates": [
"white shirts",
"black shirts",
"white T-shirts",
"short sleeve shirts",
"black T-shirts"
],
"correct_choice": 0,
"position": [
4606
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E2O",
"level": "L1-Perception",
"id": "qbA42wQoWAs_0",
"video_path": "qbA42wQoWAs.mp4",
"subtitle_path": "qbA42wQoWAs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 454.75,
"view_count": 1604467
},
{
"video_id": "qbA42wQoWAs",
"question": "In the night-draped airport, four people are experiencing a ride in a vertical lift aircraft. Inside the aircraft, they are wearing earphones and fluorescent-patterned armor. Who among the following is participating?",
"question_wo_referring_query": "... Who among the following is participating?",
"candidates": [
"A man with white hair",
"A woman with long hair",
"A child",
"An elderly man with white hair",
"A woman with green hair"
],
"correct_choice": 1,
"position": [
8820
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E2O",
"level": "L1-Perception",
"id": "qbA42wQoWAs_1",
"video_path": "qbA42wQoWAs.mp4",
"subtitle_path": "qbA42wQoWAs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 454.75,
"view_count": 1604467
},
{
"video_id": "lojHyp1k0gE",
"question": "A woman wearing a dark green long-sleeved dress is crouching in the grass with several yellow chrysanthemums in front of her. What is she doing with the pruning shears in her hand?",
"question_wo_referring_query": "What is she doing?",
"candidates": [
"Pruning branches",
"Picking flowers",
"Fertilizing flowers",
"Planting flowers",
"Watering flowers"
],
"correct_choice": 1,
"position": [
3714
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "lojHyp1k0gE_0",
"video_path": "lojHyp1k0gE.mp4",
"subtitle_path": "lojHyp1k0gE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 466.34000000000003,
"view_count": 228398
},
{
"video_id": "lojHyp1k0gE",
"question": "In a dim room, there is a glass window as the background. On the table in front of the window, there are many green plants and flowers. A woman is standing in front of the wooden table, lighting a fire stick. What is she doing?",
"question_wo_referring_query": "What is she doing?",
"candidates": [
"Burning plants in a pot",
"Lighting incense",
"Lighting a candle",
"Burning a stick",
"Lighting a handheld incense stick"
],
"correct_choice": 1,
"position": [
8909
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "lojHyp1k0gE_1",
"video_path": "lojHyp1k0gE.mp4",
"subtitle_path": "lojHyp1k0gE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 466.34000000000003,
"view_count": 228398
},
{
"video_id": "gtX_oRpLClY",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"First is the scene of a picture of a cartoon mouse, next is the scene of a mobile photo album, and finally the scene of a mobile app icon appears",
"First is the scene of a mobile photo album, next is the scene of a mobile app icon, and finally the scene of a picture of a cartoon mouse appears",
"First is the scene of a mobile app icon, next is the scene of a mobile photo album, and finally the scene of a picture of a cartoon mouse appears",
"First is the scene of a mobile photo album, next is the scene of a picture of a cartoon mouse, and finally the scene of a mobile app icon appears",
"First is the scene of a picture of a cartoon mouse, next is the scene of a mobile app icon, and finally the scene of a mobile photo album appears"
],
"correct_choice": 3,
"position": [
186,
3546,
3877
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "gtX_oRpLClY_0",
"video_path": "gtX_oRpLClY.mp4",
"subtitle_path": "gtX_oRpLClY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 236.17000000000002,
"view_count": 191304
},
{
"video_id": "UN3ICsfqKEY",
"question": "In the open space in front of a red building, a person is holding a skateboard with a design that reads '8.0', running towards a woman wearing glasses and dressed in wine-red pants. What objects are present in this scene?",
"question_wo_referring_query": ", what objects are present in this scene?",
"candidates": [
"A necklace",
"A white building",
"A yellow skateboard with a design that reads '8.0'",
"A green skateboard with a design that reads '8.0'",
"Black-framed glasses"
],
"correct_choice": 3,
"position": [
3602
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2O",
"level": "L1-Perception",
"id": "UN3ICsfqKEY_0",
"video_path": "UN3ICsfqKEY.mp4",
"subtitle_path": "UN3ICsfqKEY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 518.69,
"view_count": 124730
},
{
"video_id": "UN3ICsfqKEY",
"question": "White clouds drift in the blue sky. A long-haired girl, wearing a red short-sleeve shirt and a reversed baseball cap, falls onto the ground. Beside her is a skateboard. What items are present in this scene?",
"question_wo_referring_query": "What items are present in this scene?",
"candidates": [
"A gray baseball cap",
"A black baseball cap",
"A gray short-sleeve shirt",
"A green skateboard",
"A red baseball cap"
],
"correct_choice": 0,
"position": [
7159
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2O",
"level": "L1-Perception",
"id": "UN3ICsfqKEY_1",
"video_path": "UN3ICsfqKEY.mp4",
"subtitle_path": "UN3ICsfqKEY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 518.69,
"view_count": 124730
},
{
"video_id": "yXXhrMqfMlk",
"question": "In a narrow alley flanked by yellow buildings, a man wearing a black short-sleeved shirt and a hat is walking. Sunlight casts on his face, reflecting a dazzling glare. When the subtitles mention 'growing and learning and maturing as a', what object is present in the scene?",
"question_wo_referring_query": "What object is present in the scene?",
"candidates": [
"a round hat",
"a top hat",
"a duckbill cap",
"a black car",
"a pair of black-framed glasses"
],
"correct_choice": 2,
"position": [
1971
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "yXXhrMqfMlk_0",
"video_path": "yXXhrMqfMlk.mp4",
"subtitle_path": "yXXhrMqfMlk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 317.9,
"view_count": 78499
},
{
"video_id": "yXXhrMqfMlk",
"question": "In a dimly lit room, a man wearing a duckbill hat and colorful clothes is reaching out with both hands, with 'TRAVELWITHMAX.ORG' written above his head. When the subtitle mentions 'that should have all the information you', what object is present in this scene?",
"question_wo_referring_query": "what object is present in this scene?",
"candidates": [
"a pair of headphones",
"a decorative lamp emitting white light",
"a decorative lamp emitting purple light",
"a bracelet",
"a decorative lamp emitting red light"
],
"correct_choice": 2,
"position": [
6730
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "yXXhrMqfMlk_1",
"video_path": "yXXhrMqfMlk.mp4",
"subtitle_path": "yXXhrMqfMlk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 317.9,
"view_count": 78499
},
{
"video_id": "Z7Cox6lPW3c",
"question": "A woman wearing red clothes and a man wearing black clothes are in a video call, and a sentence starting with 'TSMC GETS' is gradually being revealed at the bottom. What kind of hair does the woman in the video call have?",
"question_wo_referring_query": "What kind of hair does the woman in the video call have?",
"candidates": [
"She has short blonde hair.",
"She has black hair.",
"She has no hair.",
"She has long blonde hair.",
"She has black bob cut."
],
"correct_choice": 1,
"position": [
3160
],
"topic_category": "NP-News-Programs",
"question_category": "S2A",
"level": "L1-Perception",
"id": "Z7Cox6lPW3c_0",
"video_path": "Z7Cox6lPW3c.mp4",
"subtitle_path": "Z7Cox6lPW3c_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 208.07999999999998,
"view_count": 3498
},
{
"video_id": "Z7Cox6lPW3c",
"question": "In front of a background board with the word 'Bloomberg' and a high-rise building, a man with gold and gray-white short hair is speaking directly into the camera. The text in front of him is gradually appearing. What kind of clothes is this man wearing?",
"question_wo_referring_query": "What kind of clothes is this man wearing?",
"candidates": [
"He is wearing a pure black coat.",
"He is wearing a pure black shirt.",
"He is wearing a black checkered shirt.",
"He is wearing a pure black suit.",
"He is wearing a black striped shirt."
],
"correct_choice": 1,
"position": [
3851
],
"topic_category": "NP-News-Programs",
"question_category": "S2A",
"level": "L1-Perception",
"id": "Z7Cox6lPW3c_1",
"video_path": "Z7Cox6lPW3c.mp4",
"subtitle_path": "Z7Cox6lPW3c_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 208.07999999999998,
"view_count": 3498
},
{
"video_id": "D0RyFh0hnkQ",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences is correct?",
"candidates": [
"First, a red dot moves across a gray-black screen with two images and text; then the red dot moves across a gray-black screen without any images; finally, a blue dot moves across a gray-black screen with a red bar graph.",
"First, a red dot moves across a gray-black screen with a blue bar graph; then the red dot moves across a gray-black screen without any images; finally, the red dot moves across a gray-black screen with two images and text.",
"First, a red dot moves across a gray-black screen with two images and text; then the red dot moves across a gray-black screen without any images; finally, the red dot moves across a gray-black screen with a blue bar.",
"First, a blue dot moves across a gray-black screen with two images and text; then the blue dot moves across a gray-black screen without any images; finally, the blue dot moves across a gray-black screen with a red bar graph.",
"First, a red dot moves across a gray, dark screen without any images; then the red dot moves across a gray-black screen with two images and text; finally, the red dot moves across a gray-black screen with a blue bar graph."
],
"correct_choice": 4,
"position": [
1554,
2081,
2258
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "SSS",
"level": "L2-Relation",
"id": "D0RyFh0hnkQ_0",
"video_path": "D0RyFh0hnkQ.mp4",
"subtitle_path": "D0RyFh0hnkQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 521.96,
"view_count": 12
},
{
"video_id": "2edlqFUTDVc",
"question": "Under the blue and white sky, in front of a short house backdrop, on the far right side stand two figures wearing olive-colored clothes and belts, one of whom is holding a scythe. Which of the figures on the grass in the bottom left corner is simultaneously opening a scroll?",
"question_wo_referring_query": "Which of the figures in the bottom left corner is simultaneously opening a scroll?",
"candidates": [
"Two small figures wearing black helmets",
"Two small figures wearing white helmets",
"Two small figures wearing green helmets",
"Two small figures wearing black helmets",
"Two small figures wearing yellow helmets"
],
"correct_choice": 4,
"position": [
4426
],
"topic_category": "KH-Knowledge-History",
"question_category": "E2O",
"level": "L1-Perception",
"id": "2edlqFUTDVc_0",
"video_path": "2edlqFUTDVc.mp4",
"subtitle_path": "2edlqFUTDVc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 355.27,
"view_count": 4097
},
{
"video_id": "2edlqFUTDVc",
"question": "In front of a building with seven gray columns and green plants, two small figures on the far right are holding an olive shield and a saber, both wearing a gray helmet. On the left side of the screen, who gets stabbed in the back with the saber?",
"question_wo_referring_query": "Who gets stabbed in the back with the saber on the left side of the screen?",
"candidates": [
"A small figure with white hair wearing a crown",
"A small figure with golden hair wearing a gray helmet",
"A small figure with golden hair not wearing a crown",
"A small figure with golden hair wearing a crown",
"A small figure with black hair wearing a crown"
],
"correct_choice": 3,
"position": [
6864
],
"topic_category": "KH-Knowledge-History",
"question_category": "E2O",
"level": "L1-Perception",
"id": "2edlqFUTDVc_1",
"video_path": "2edlqFUTDVc.mp4",
"subtitle_path": "2edlqFUTDVc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 355.27,
"view_count": 4097
},
{
"video_id": "hf4WUOagFAw",
"question": "On the golden stage, what was the last action performed by the man wearing a light yellow jacket and white shirt, who stood in front of the mirror, bit his tongue, and gestured with his hands?",
"question_wo_referring_query": "What action did he perform?",
"candidates": [
"Made a 'yeah' gesture towards the mirror",
"Put his hand on someone's shoulder",
"Turned his back to the mirror",
"Ran towards the center of the stage",
"Sat on the ground of the stage and stretched his waist"
],
"correct_choice": 0,
"position": [
384,
5091
],
"topic_category": "NP-News-Programs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "hf4WUOagFAw_0",
"video_path": "hf4WUOagFAw.mp4",
"subtitle_path": "hf4WUOagFAw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 216.02,
"view_count": 5512
},
{
"video_id": "hf4WUOagFAw",
"question": "What did the last person in a purple top who is facing the camera and making the gesture for the number seven with their eyes closed do the first time they appeared on the golden stage?",
"question_wo_referring_query": "What action did they perform?",
"candidates": [
"Faced the camera and bit their tongue",
"Sat on the stage and stretched their back",
"Put their hand on someone else's shoulder",
"Pointed towards the camera with their hand",
"Danced holding a microphone"
],
"correct_choice": 1,
"position": [
354,
5042
],
"topic_category": "NP-News-Programs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "hf4WUOagFAw_1",
"video_path": "hf4WUOagFAw.mp4",
"subtitle_path": "hf4WUOagFAw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 216.02,
"view_count": 5512
},
{
"video_id": "zTeDF7mQ88A",
"question": "In front of the palace, a man dressed in colorful underwear and a white cloak, wearing tree branch decorations on his head, is dragging a woman in a skirt whose eyes sparkle with pink hearts. What change happened the last time this woman with sparkling pink hearts in her eyes appeared?",
"question_wo_referring_query": "What change happened the last time this woman with sparkling pink hearts in her eyes appeared?",
"candidates": [
"Her eyes changed from pink heart shapes to black tears streaming.",
"Her glasses changed from black with tears to pink heart shapes.",
"The woman held a shield.",
"She went from standing to sitting.",
"She wore a woven grass headband."
],
"correct_choice": 0,
"position": [
1559,
1841
],
"topic_category": "KH-Knowledge-History",
"question_category": "SAA",
"level": "L2-Relation",
"id": "zTeDF7mQ88A_0",
"video_path": "zTeDF7mQ88A.mp4",
"subtitle_path": "zTeDF7mQ88A_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 270.6,
"view_count": 2066
},
{
"video_id": "zTeDF7mQ88A",
"question": "A person wearing a grass-woven headband is standing in the middle of the screen. On both sides, there are boxes filled with coins and jewels. In front of the boxes, there is a pile of money bags. When the man wearing the grass headband appears to the right of two long-haired dwarves, the background is red curtains over sunlit floor tiles. What change occurs to this man?",
"question_wo_referring_query": "A person wearing a grass-woven headband is standing in the middle of the screen. On both sides, there are boxes filled with coins and jewels. In front of the boxes, there is a pile of money bags. When the man wearing the grass headband appears to the right of two long-haired dwarves, the background is red curtains over sunlit floor tiles. What change occurs to this man?",
"candidates": [
"Black eyes turned into pink hearts with tears",
"Clothing turned green",
"Writing turned white",
"Turned into a mirror image"
],
"correct_choice": 0,
"position": [
3514,
4097
],
"topic_category": "KH-Knowledge-History",
"question_category": "SAA",
"level": "L2-Relation",
"id": "zTeDF7mQ88A_1",
"video_path": "zTeDF7mQ88A.mp4",
"subtitle_path": "zTeDF7mQ88A_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 270.6,
"view_count": 2066
},
{
"video_id": "yl-6-Yzt--A",
"question": "In the top left corner of the screen, there is a rectangular land-sea distribution map with a white question mark in the middle. On the right side of the map, there is a man wearing a gray shirt sitting. What is this man doing?",
"question_wo_referring_query": "In the top left corner of the screen, there is a rectangular land-sea distribution map with a white question mark in the middle. On the right side of the map, there is a man wearing a gray shirt sitting. What is this man doing?",
"candidates": [
"Lying on the desk sleeping",
"Crossing his arms in front of his chest",
"Speaking with animated hand gestures",
"Standing up and turning around",
"Sitting silently with a closed mouth"
],
"correct_choice": 2,
"position": [
5277,
5393
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2E",
"level": "L1-Perception",
"id": "yl-6-Yzt--A_0",
"video_path": "yl-6-Yzt--A.mp4",
"subtitle_path": "yl-6-Yzt--A_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 349.62,
"view_count": 1971479
},
{
"video_id": "md3LVlEzFBU",
"question": "The dazzling sunlight shines in the upper left corner of the sky, a yellow tank's gun barrel pointing to the upper left, with a white C-shaped logo on the tank's body, and the tank\u2019s tracks are pressing against the grass. When the subtitle 'serious drawback its gun could only be' appears, what color is the background of the square within the C shape on the tank?",
"question_wo_referring_query": "What color is the background of the square within the C shape on the tank?",
"candidates": [
"white",
"blue",
"olive",
"black",
"yellow"
],
"correct_choice": 3,
"position": [
5070
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2A",
"level": "L1-Perception",
"id": "md3LVlEzFBU_0",
"video_path": "md3LVlEzFBU.mp4",
"subtitle_path": "md3LVlEzFBU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 294.17,
"view_count": 491420
},
{
"video_id": "md3LVlEzFBU",
"question": "The sky is blue over a yellow desert, and there are two military vehicles facing each other. On these vehicles, there are two people wearing hats with their upper bodies exposed. When the subtitle 'legacy of service in british military' appears, what is the shape of the camera-like object in the middle of the left-hand vehicle?",
"question_wo_referring_query": "What is the shape of the camera-like object in the middle of the left-hand vehicle?",
"candidates": [
"square",
"rectangle",
"step-shape",
"circle",
"semicircle"
],
"correct_choice": 3,
"position": [
6573
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2A",
"level": "L1-Perception",
"id": "md3LVlEzFBU_1",
"video_path": "md3LVlEzFBU.mp4",
"subtitle_path": "md3LVlEzFBU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 294.17,
"view_count": 491420
},
{
"video_id": "7MemY9jOmuk",
"question": "In the radio room scene, there are two soldiers holding a boy in a short-sleeved shirt and black pants against the wall. A man wearing a blue shirt with a pink tie and a blue suit is speaking in the radio room. A pink handkerchief is visible in the man's suit pocket. Which subtitle lines appear together with this man?",
"question_wo_referring_query": "Which subtitle lines appear together with this man?",
"candidates": [
"amplifying local Dynamics um however as",
"and your analysis",
"between cartel and",
"where we have a set of policies across",
"control well I'm joined Now by Vonda"
],
"correct_choice": 4,
"position": [
200,
3343,
5921
],
"topic_category": "NP-News-Programs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "7MemY9jOmuk_0",
"video_path": "7MemY9jOmuk.mp4",
"subtitle_path": "7MemY9jOmuk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 551.84,
"view_count": 161845
},
{
"video_id": "7MemY9jOmuk",
"question": "On the left, there is a topless man covering his head, while behind him stands a uniformed armed personnel wearing a mask. On the right, there is a woman sitting on a black sofa, dressed in black and white striped clothing. In which subtitles does the topless man appear?",
"question_wo_referring_query": "In which subtitles does the topless man appear?",
"candidates": [
"in",
"the two Mexican uh criminal groups at",
"moment",
"between cartel and",
"where we have a set of policies across"
],
"correct_choice": 1,
"position": [
5453,
5697,
6720
],
"topic_category": "NP-News-Programs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "7MemY9jOmuk_1",
"video_path": "7MemY9jOmuk.mp4",
"subtitle_path": "7MemY9jOmuk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 551.84,
"view_count": 161845
},
{
"video_id": "N4VtpYgZLVg",
"question": "On the white wall, there are many square picture frames hanging. There is an air conditioner on the left side of the wall, and below the air conditioner, there is a rolled-up curtain. The ceiling is a light color. A man with curly hair wearing an orange coat is speaking. What objects are present in the room?",
"question_wo_referring_query": "What objects are present in the room?",
"candidates": [
"A silver necklace",
"A tall potted plant",
"A small dog painting",
"A hat",
"A Mona Lisa painting"
],
"correct_choice": 4,
"position": [
10416
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2O",
"level": "L1-Perception",
"id": "N4VtpYgZLVg_0",
"video_path": "N4VtpYgZLVg.mp4",
"subtitle_path": "N4VtpYgZLVg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 569.57,
"view_count": 50019
},
{
"video_id": "N4VtpYgZLVg",
"question": "At the restaurant beside the dim street, a man in a gray short-sleeved shirt is sitting by the tree for a meal. On his left, there is a man wearing glasses and a white short-sleeved shirt along with a woodpile. On his right, there is someone in a yellow outfit and his friend sitting in a dark corner. What items are present in the scene?",
"question_wo_referring_query": ", what items are present in the scene?",
"candidates": [
"a cat",
"a watch",
"a bicycle",
"a hat",
"a dog"
],
"correct_choice": 1,
"position": [
9190
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2O",
"level": "L1-Perception",
"id": "N4VtpYgZLVg_1",
"video_path": "N4VtpYgZLVg.mp4",
"subtitle_path": "N4VtpYgZLVg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 569.57,
"view_count": 50019
},
{
"video_id": "6d-EVupvWzU",
"question": "A man wearing a grey hat and black clothes is sitting on an off-white chair. The chair to his right is empty, and beside the empty chair, there is a potted plant. The wall behind him is blue. Among the photos that the man is showing, which photo appears first?",
"question_wo_referring_query": "Among the photos that the man is showing, which photo appears first?",
"candidates": [
"A solo photo",
"A group photo of four people",
"A group photo of five people",
"A group photo of two people",
"A group photo of three people"
],
"correct_choice": 2,
"position": [
423,
424,
470
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "O3O",
"level": "L2-Relation",
"id": "6d-EVupvWzU_0",
"video_path": "6d-EVupvWzU.mp4",
"subtitle_path": "6d-EVupvWzU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 427.8,
"view_count": 153970
},
{
"video_id": "6d-EVupvWzU",
"question": "A man wearing a gray hat and black clothes is sitting on a cream-colored chair. The chair on his right is empty, and there is a potted plant next to the empty chair. The wall behind him is blue. When the man introduces the situation of his filming, which place is mentioned first?",
"question_wo_referring_query": "Which place is mentioned first?",
"candidates": [
"Haunted House",
"Singapore Flyer",
"Sky Tower",
"Rural Village",
"The Ritz-Carlton, Millenia Singapore"
],
"correct_choice": 4,
"position": [
892,
893,
921,
951
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "O3O",
"level": "L2-Relation",
"id": "6d-EVupvWzU_1",
"video_path": "6d-EVupvWzU.mp4",
"subtitle_path": "6d-EVupvWzU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 427.8,
"view_count": 153970
},
{
"video_id": "4QSmRYQBfN4",
"question": "In the middle of the screen, a flag with red and black appears, with a blue circle and yellow design in the center of the flag. After the subtitle 'The emblem is a machete with half a cogwheel positioned in a av that somewhat 1mitates' appears, what shows up in the bottom right corner of the screen?",
"question_wo_referring_query": "What shows up in the bottom right corner of the screen?",
"candidates": [
"Only a white arrowhead",
"Only a flag with a machete, a hammer, and a five-pointed star",
"A white arrowhead and a red flag with two yellow designs",
"A white arrowhead and a flag with only a five-pointed star"
],
"correct_choice": 2,
"position": [
32,
76
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T3E",
"level": "L2-Relation",
"id": "4QSmRYQBfN4_0",
"video_path": "4QSmRYQBfN4.mp4",
"subtitle_path": "4QSmRYQBfN4_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 45,
"duration": 16.98,
"view_count": 1246875
},
{
"video_id": "RQOdl64DtdI",
"question": "In the video, a woman with red hair is wearing a white top, holding a mirror in one hand and combing her long hair with the other. After mentioning the explanation 'rules rounding are back to a time when,' which character appears afterward?",
"question_wo_referring_query": "Which character appears afterward?",
"candidates": [
"Naked man",
"Woman lying in the water",
"Mona Lisa",
"Angel"
],
"correct_choice": 1,
"position": [
280,
827
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T3O",
"level": "L2-Relation",
"id": "RQOdl64DtdI_0",
"video_path": "RQOdl64DtdI.mp4",
"subtitle_path": "RQOdl64DtdI_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 859,
"duration": 52.0,
"view_count": 184867
},
{
"video_id": "vix2_R79-l4",
"question": "There is a long-haired woman wearing black clothes, holding a mobile phone in her right hand. The time on the phone screen shows 11:43. After the subtitle \"don't have time so now\" appears, what changes does the woman's phone screen undergo?",
"question_wo_referring_query": "What changes occur on the woman's phone screen?",
"candidates": [
"The time on the phone screen is 12:00",
"There are two long-haired women on the phone screen",
"The phone screen is black",
"The phone screen shows 'in call'"
],
"correct_choice": 1,
"position": [
388,
518
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TAA",
"level": "L2-Relation",
"id": "vix2_R79-l4_0",
"video_path": "vix2_R79-l4.mp4",
"subtitle_path": "vix2_R79-l4_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 75,
"duration": 22.0,
"view_count": 42675
},
{
"video_id": "gyV6EqgiPNg",
"question": "In the screen, a white plate is holding several buns, with a hand picking up one of the buns. When the subtitle 'the biggest change in the recipe was adding chocolate chips i did it for sake of my children' appears, what pattern is on the cloth under the plate?",
"question_wo_referring_query": "In the screen, a white plate is holding several buns, with a hand picking up one of the buns. When the subtitle 'the biggest change in the recipe was adding chocolate chips i did it for sake of my children' appears, what pattern is on the cloth under the plate?",
"candidates": [
"white striped",
"black and white striped",
"black and white checkered",
"white checkered"
],
"correct_choice": 3,
"position": [
1099
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2A",
"level": "L1-Perception",
"id": "gyV6EqgiPNg_0",
"video_path": "gyV6EqgiPNg.mp4",
"subtitle_path": "gyV6EqgiPNg_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 238,
"duration": 46.0,
"view_count": 6453
},
{
"video_id": "M-YfPangEfA",
"question": "On the yellow wooden dining table in the video, some food and tableware are placed. Which of the following items do not exist?",
"question_wo_referring_query": "Which of the following items do not exist?",
"candidates": [
"Noodles",
"A glass of water",
"Silver fork",
"Black fork",
"Bun"
],
"correct_choice": 3,
"position": [
355
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2O",
"level": "L1-Perception",
"id": "M-YfPangEfA_0",
"video_path": "M-YfPangEfA.mp4",
"subtitle_path": "M-YfPangEfA_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 1088,
"duration": 24.03,
"view_count": 428396
},
{
"video_id": "Kz8_rwVn094",
"question": "In a hospital with blue walls, there are two doctors in white coats and several black-clothed patients resting in their beds. There are also three IV drips in the room. What did the gray-haired doctor without a mask do when 'fast the lack of money throughout the' was mentioned?",
"question_wo_referring_query": "What did the gray-haired doctor without a mask do?",
"candidates": [
"Gave an IV to a patient in bed",
"Listened to the patient lying in bed",
"Inserted his hand into the pocket of his coat while leaning against the wall",
"Talked with a patient in bed",
"Shook hands with another masked doctor"
],
"correct_choice": 2,
"position": [
38,
70
],
"topic_category": "NP-News-Programs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "Kz8_rwVn094_0",
"video_path": "Kz8_rwVn094.mp4",
"subtitle_path": "Kz8_rwVn094_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 265,
"duration": 17.0,
"view_count": 3215
},
{
"video_id": "TxbSa3ux8SU",
"question": "After the subtitle '[Music]' appears, what appears above the rock where a long green reed stands in the seawater?",
"question_wo_referring_query": "What appeared?",
"candidates": [
"Seawater slowly receding",
"A flock of birds circling above the rock",
"Seawater slowly rising",
"Sea waves violently hitting the rock wall",
"A storm of wind and rain whipped up near the rock"
],
"correct_choice": 1,
"position": [
30
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T3E",
"level": "L2-Relation",
"id": "TxbSa3ux8SU_0",
"video_path": "TxbSa3ux8SU.mp4",
"subtitle_path": "TxbSa3ux8SU_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 384,
"duration": 18.02,
"view_count": 741324
},
{
"video_id": "iHNjWhx3EaI",
"question": "When the curly-haired man wearing a gray hoodie picks up a box of pink-packaged tea leaves and the subtitle \u2018editing my video drinking some yamamoto\u2019 appears, which item is not present in the room behind him?",
"question_wo_referring_query": "Which item is not present in the room behind him?",
"candidates": [
"Black Overcoat",
"Coat Rack",
"Television",
"Tableware",
"Refrigerator"
],
"correct_choice": 3,
"position": [
129
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2O",
"level": "L1-Perception",
"id": "iHNjWhx3EaI_0",
"video_path": "iHNjWhx3EaI.mp4",
"subtitle_path": "iHNjWhx3EaI_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 284,
"duration": 34.0,
"view_count": 14644
},
{
"video_id": "UbcWAfHo5j0",
"question": "When a group of girls wearing blue tops walk down the stairs from the second floor of the building with green plants on the right, what kind of skirts are they wearing?",
"question_wo_referring_query": "What kind of skirts are they wearing?",
"candidates": [
"Super short skirt",
"Pleated short skirt",
"Dress",
"Knee-length skirt",
"Ankle-length skirt"
],
"correct_choice": 3,
"position": [
490
],
"topic_category": "NP-News-Programs",
"question_category": "S2A",
"level": "L1-Perception",
"id": "UbcWAfHo5j0_0",
"video_path": "UbcWAfHo5j0.mp4",
"subtitle_path": "UbcWAfHo5j0_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 90,
"duration": 47.0,
"view_count": 575
},
{
"video_id": "uDZrl35yeZE",
"question": "In front of a building crowded with people, there is a person wearing a grey T-shirt, a hat, and holding a camera, standing next to a woman with blonde hair tied in braids, wearing a black and white floral suspender dress, and sunglasses. Where has he appeared before?",
"question_wo_referring_query": "Where has he appeared before?",
"candidates": [
"on a boat at sea",
"in a stylish caf\u00e9",
"in a dark underground mine",
"on a sunny beach",
"by a rocky seaside"
],
"correct_choice": 4,
"position": [
51,
244
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "uDZrl35yeZE_0",
"video_path": "uDZrl35yeZE.mp4",
"subtitle_path": "uDZrl35yeZE_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 82,
"duration": 18.02,
"view_count": 60384
},
{
"video_id": "US5Oz0q0BVE",
"question": "In a room with two rows of bookshelves, there is a desk in front of the bookshelves. On the desk are two paintings, one upright and one lying flat. Sitting next to the desk is a woman dressed in black. What style of glasses is this woman wearing?",
"question_wo_referring_query": "What style of glasses is this woman wearing?",
"candidates": [
"Not wearing glasses",
"Black-frame glasses",
"Red-frame glasses",
"Monocle",
"Gold-frame glasses"
],
"correct_choice": 1,
"position": [
149
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "US5Oz0q0BVE_0",
"video_path": "US5Oz0q0BVE.mp4",
"subtitle_path": "US5Oz0q0BVE_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 93,
"duration": 20.02,
"view_count": 20601
},
{
"video_id": "lcEaHk8f4Co",
"question": "Someone is holding a floral tray and pushing a tray filled with yellow block-like food material into a metal rack. What is the food material that's being pushed forward in the tray?",
"question_wo_referring_query": "Someone is holding a floral tray and pushing a tray filled with yellow block-like food material into a metal rack. What is the food material that's being pushed forward in the tray?",
"candidates": [
"Bread",
"Tofu",
"Banana",
"Papaya",
"Butter"
],
"correct_choice": 1,
"position": [
347
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "E2O",
"level": "L1-Perception",
"id": "lcEaHk8f4Co_0",
"video_path": "lcEaHk8f4Co.mp4",
"subtitle_path": "lcEaHk8f4Co_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 576,
"duration": 44.0,
"view_count": 745205
},
{
"video_id": "i6C6r2g4Y7Q",
"question": "In a space with a wall marked with an 's', there are three elevator doors. Three men and three women are standing in front of the elevator doors, tossing two basketballs to each other. What action does the person wearing a black leather jacket first take when they appear?",
"question_wo_referring_query": "What action does the person wearing a black leather jacket first take when they appear?",
"candidates": [
"He steps back into the crowd and then turns around with his back facing the elevator doors",
"He walks forward into the crowd and then turns around with his back facing the elevator doors",
"He walks backward straight through the crowd",
"He steps back into the crowd and then turns around facing the elevator doors",
"He walks forward into the crowd and then turns around facing the elevator doors"
],
"correct_choice": 0,
"position": [
364
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O2E",
"level": "L1-Perception",
"id": "i6C6r2g4Y7Q_0",
"video_path": "i6C6r2g4Y7Q.mp4",
"subtitle_path": "i6C6r2g4Y7Q_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 110,
"duration": 22.02,
"view_count": 947
},
{
"video_id": "W94Rth-aIkc",
"question": "White clouds are floating in the sky, with a mountain peak towering underneath them. The jade blue sea quietly presses against the mountain peak. What is the state of the mountain peak?",
"question_wo_referring_query": "What is the state of the mountain peak?",
"candidates": [
"Jade-green mountain peak",
"Fiery-red mountain peak",
"Bare mountain peak",
"Glowing mountain peak",
"Yellow mountain peak"
],
"correct_choice": 0,
"position": [
28
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2A",
"level": "L1-Perception",
"id": "W94Rth-aIkc_0",
"video_path": "W94Rth-aIkc.mp4",
"subtitle_path": "W94Rth-aIkc_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 1118,
"duration": 23.99,
"view_count": 1771056
},
{
"video_id": "k194yU67Zik",
"question": "In a kitchen with a background showing an egg on the left and a roasted chicken hanging on the right wall, which words appear together with the man wearing a black hat and a black T-shirt?",
"question_wo_referring_query": "Which words appear together?",
"candidates": [
"no matter how old",
"good it all comes down",
"to judge a book or a burger",
"you bye",
"there's nothing really bad about it but"
],
"correct_choice": 4,
"position": [
257,
227
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TOS",
"level": "L2-Relation",
"id": "k194yU67Zik_0",
"video_path": "k194yU67Zik.mp4",
"subtitle_path": "k194yU67Zik_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 271,
"duration": 33.99,
"view_count": 1008170
},
{
"video_id": "v-7zF3Y0yJs",
"question": "In front of a screen with a blue background and a turret-shaped building on the far right, what color suit is the man sitting in front of the phone booth wearing?",
"question_wo_referring_query": "What color suit is the man sitting in front of the phone booth wearing?",
"candidates": [
"blue",
"pink",
"white",
"black",
"yellow"
],
"correct_choice": 3,
"position": [
413
],
"topic_category": "NP-News-Programs",
"question_category": "S2A",
"level": "L1-Perception",
"id": "v-7zF3Y0yJs_0",
"video_path": "v-7zF3Y0yJs.mp4",
"subtitle_path": "v-7zF3Y0yJs_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 291,
"duration": 32.0,
"view_count": 6354
},
{
"video_id": "PCz04UJFaUY",
"question": "What happened after the person wearing a white helmet and orange goggles raised the hand with the yellow sleeve?",
"question_wo_referring_query": "What happened?",
"candidates": [
"A group of people were dancing",
"A group of people were riding scooters",
"A group of people were mountain climbing",
"A group of people were skiing",
"A bird flew onto the hand of the person in the yellow long sleeve"
],
"correct_choice": 4,
"position": [
2,
107
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E3E",
"level": "L2-Relation",
"id": "PCz04UJFaUY_0",
"video_path": "PCz04UJFaUY.mp4",
"subtitle_path": "PCz04UJFaUY_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 262,
"duration": 16.0,
"view_count": 7168
},
{
"video_id": "I2Fyzav8MxE",
"question": "When the screen zooms in to show a hand holding a tool adding water from a red bowl into a white tray with blue paint, what appears first?",
"question_wo_referring_query": "What appears first?",
"candidates": [
"A man wearing a black short-sleeve shirt and black pants",
"A piece of fabric with yellow paint",
"A man wearing a black short-sleeve shirt and white pants",
"A piece of fabric with red paint",
"A man wearing a white short-sleeve shirt and black pants"
],
"correct_choice": 0,
"position": [
125,
228,
463
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O3O",
"level": "L2-Relation",
"id": "I2Fyzav8MxE_0",
"video_path": "I2Fyzav8MxE.mp4",
"subtitle_path": "I2Fyzav8MxE_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 855,
"duration": 21.98,
"view_count": 383259
},
{
"video_id": "SFfWUrsg4gE",
"question": "In a well-lit room, what did a man dressed in a white lab coat and green plaid outerwear do after saying 'nothing too fancy here'?",
"question_wo_referring_query": "What action did he take?",
"candidates": [
"Turned the camera towards a bedroom with a guitar",
"Used his hand to open the door of a white wardrobe",
"Made a gesture of peace towards the camera",
"Zoomed in with the camera",
"Pointed his hand towards the balcony outside"
],
"correct_choice": 1,
"position": [
1022,
1094
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T3E",
"level": "L2-Relation",
"id": "SFfWUrsg4gE_0",
"video_path": "SFfWUrsg4gE.mp4",
"subtitle_path": "SFfWUrsg4gE_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 782,
"duration": 54.97,
"view_count": 115726
},
{
"video_id": "j7kxn5CsHnw",
"question": "A silver plate was placed on the wooden table, which contained some nuts and fragments of pastries like buns. A hand was holding a red wooden spoon picking up some pastry fragments. What subtitles appeared along with this red wooden spoon?",
"question_wo_referring_query": "?",
"candidates": [
"Meat buns were once popular in the USA and the UK",
"The situation in the UK is the same",
"They ultimately lost green onions",
"Sharing some information about meat buns",
"Music"
],
"correct_choice": 4,
"position": [
16,
150
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TOS",
"level": "L2-Relation",
"id": "j7kxn5CsHnw_0",
"video_path": "j7kxn5CsHnw.mp4",
"subtitle_path": "j7kxn5CsHnw_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 429,
"duration": 35.0,
"view_count": 104267
},
{
"video_id": "Yh48p5efBTk",
"question": "In a well-lit room, a silver-haired man wearing a black jacket and blue jeans pulls a poster from a desk, which also has a green paper box, a yellow book, white documents, and a silver box on it. The shelves hold a large number of paper items. What kind of top is this silver-haired man wearing?",
"question_wo_referring_query": "What kind of top is the silver-haired man wearing?",
"candidates": [
"Gray robe",
"Black short sleeve",
"Black jacket",
"Black long sleeve",
"Gray sweater"
],
"correct_choice": 2,
"position": [
106
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "Yh48p5efBTk_0",
"video_path": "Yh48p5efBTk.mp4",
"subtitle_path": "Yh48p5efBTk_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 339,
"duration": 19.02,
"view_count": 18683
},
{
"video_id": "iJk-HMfO4yQ",
"question": "There is a white car parked on the road, with trees and a red brick building behind it. Next to the trees, there is a utility pole. An officer in a black and white uniform, wearing glasses and an ID badge, is giving an interview. In front of the officer, there are two communication devices, one white labeled 'LBC' and one black. Who is the first person to pass by behind the officer?",
"question_wo_referring_query": "Who is the first person to pass by behind the officer?",
"candidates": [
"A woman wearing a suit",
"A woman wearing black clothes with a hat",
"A man wearing black clothes with a hat",
"A woman wearing a hat and sunglasses",
"An elderly person wearing a hat and sunglasses"
],
"correct_choice": 2,
"position": [
526,
723
],
"topic_category": "NP-News-Programs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "iJk-HMfO4yQ_0",
"video_path": "iJk-HMfO4yQ.mp4",
"subtitle_path": "iJk-HMfO4yQ_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 68,
"duration": 51.0,
"view_count": 386941
},
{
"video_id": "QeXuzEwC9z4",
"question": "Below the ceiling in the room is a fluffy white bed, with a pink light behind it. A girl wearing earrings and a white short-sleeved shirt with English letters printed on it is holding a mustard-colored clothes with a cartoon smiley face on it. After the subtitle 'with this one that I ended up getting it' appears, what item appeared?",
"question_wo_referring_query": "What item appeared?",
"candidates": [
"A mobile phone",
"A brooch",
"A small camera",
"A small cat",
"A plush toy"
],
"correct_choice": 2,
"position": [
203,
243
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "QeXuzEwC9z4_0",
"video_path": "QeXuzEwC9z4.mp4",
"subtitle_path": "QeXuzEwC9z4_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 108,
"duration": 42.0,
"view_count": 12398
},
{
"video_id": "hznvV2bBkX4",
"question": "In the black and white footage, there is a curly-haired woman wearing a black dress, holding a bag, standing on an empty street. What is she doing?",
"question_wo_referring_query": "What is she doing?",
"candidates": [
"She is turning around and greeting someone",
"She is waiting for the bus by the roadside",
"She is walking on the street",
"She is looking through her bag"
],
"correct_choice": 3,
"position": [
137
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2E",
"level": "L1-Perception",
"id": "hznvV2bBkX4_0",
"video_path": "hznvV2bBkX4.mp4",
"subtitle_path": "hznvV2bBkX4_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 176,
"duration": 49.01,
"view_count": 16362
},
{
"video_id": "GZc0P3Apfx4",
"question": "In front of a group of large buildings, there's a man wearing a light red long robe and a light red headscarf. Behind him, there is a layered cylindrical building. What did he do at that moment?",
"question_wo_referring_query": "What did he do at that moment?",
"candidates": [
"Holding a microphone and speaking.",
"Hands in his pockets.",
"Holding a book in one hand and raising the other hand in the air.",
"Hands on his waist."
],
"correct_choice": 2,
"position": [
1166
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O2E",
"level": "L1-Perception",
"id": "GZc0P3Apfx4_0",
"video_path": "GZc0P3Apfx4.mp4",
"subtitle_path": "GZc0P3Apfx4_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 13,
"duration": 49.02,
"view_count": 235973
},
{
"video_id": "HWyDOQrYtCk",
"question": "On the yellow table, there is a white bowl with instant noodles inside and a pair of chopsticks placed in the bowl. There is also a dark green bowl with vegetables and a blue plate with a fried egg. What items appeared after the word 'Music' was mentioned?",
"question_wo_referring_query": "There is a dark green bowl with vegetables and a blue plate with a fried egg. What items appeared after the word 'Music' was mentioned?",
"candidates": [
"Fried egg",
"Instant noodles",
"Tablet, Milk Tea",
"Vegetables"
],
"correct_choice": 2,
"position": [
145,
409
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "HWyDOQrYtCk_0",
"video_path": "HWyDOQrYtCk.mp4",
"subtitle_path": "HWyDOQrYtCk_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 803,
"duration": 39.0,
"view_count": 375800
},
{
"video_id": "lNReCCShKJQ",
"question": "Next to a large building, a tank is driving on a desolate street with signs of explosion in the background. In which of the following scenes has this tank appeared?",
"question_wo_referring_query": ", in which of the following scenes has this tank appeared?",
"candidates": [
"On a street with many pedestrians",
"On a square during a military parade",
"On a square during a rainy day",
"On a desolate grassland with dried grass"
],
"correct_choice": 3,
"position": [
359,
523
],
"topic_category": "KH-Knowledge-History",
"question_category": "SOS",
"level": "L2-Relation",
"id": "lNReCCShKJQ_0",
"video_path": "lNReCCShKJQ.mp4",
"subtitle_path": "lNReCCShKJQ_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 246,
"duration": 22.0,
"view_count": 4241622
},
{
"video_id": "DxxVla1CRvU",
"question": "In the middle of 2 parallel poles, there's a man wearing a police uniform. There are many pedestrians behind him. What is the man in the police uniform doing?",
"question_wo_referring_query": "What is the man in the police uniform doing?",
"candidates": [
"He is raising his hands above his head",
"He is bending down to pick something up",
"He is lying on the ground",
"He is putting his hands in front of his chest"
],
"correct_choice": 1,
"position": [
304
],
"topic_category": "NP-News-Programs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "DxxVla1CRvU_0",
"video_path": "DxxVla1CRvU.mp4",
"subtitle_path": "DxxVla1CRvU_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 330,
"duration": 14.98,
"view_count": 970897
},
{
"video_id": "SAkgZJllNzw",
"question": "There is a yellow title of State Parks at the top of the screen. In a picture on the left, there is a yellow tree and a river. Below, there is a plane, and next to the plane, there is the number 1. On the right, there is a green mountain scene, and below it, there is a plane with the number 5 next to it. When mentioning 'lot of great state parks I've hiked and,' which object does not appear on the screen?",
"question_wo_referring_query": "Which object does not appear on the screen?",
"candidates": [
"A plane made of white and black",
"A plane made of white and green",
"A plane made of white, blue, and red",
"A green valley"
],
"correct_choice": 1,
"position": [
401
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2O",
"level": "L1-Perception",
"id": "SAkgZJllNzw_0",
"video_path": "SAkgZJllNzw.mp4",
"subtitle_path": "SAkgZJllNzw_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 165,
"duration": 46.98,
"view_count": 20205
},
{
"video_id": "ixAU3l0sX_o",
"question": "On screen, a short-haired woman puts her entire right hand into her mouth. There are many people behind her, including a man wearing a striped shirt sitting in the background. What color clothing is this woman wearing?",
"question_wo_referring_query": "What color clothing is this woman wearing?",
"candidates": [
"Blue",
"Red",
"White",
"Black"
],
"correct_choice": 3,
"position": [
658
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2A",
"level": "L1-Perception",
"id": "ixAU3l0sX_o_0",
"video_path": "ixAU3l0sX_o.mp4",
"subtitle_path": "ixAU3l0sX_o_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 145,
"duration": 41.0,
"view_count": 104879
},
{
"video_id": "Owmy5Vu3BnA",
"question": "On the left side of the screen, there is a stock chart displayed, with a line of subtitles at the bottom. On the far right, the subtitle 'Lulu Chen BLOOMBERG NEWS' is shown. Who is speaking in the middle of the screen?",
"question_wo_referring_query": "Who is speaking in the middle of the screen?",
"candidates": [
"White-haired man",
"Woman with brown hair",
"Black-haired man",
"Woman with long black hair wearing a black coat"
],
"correct_choice": 3,
"position": [
515
],
"topic_category": "NP-News-Programs",
"question_category": "E2O",
"level": "L1-Perception",
"id": "Owmy5Vu3BnA_0",
"video_path": "Owmy5Vu3BnA.mp4",
"subtitle_path": "Owmy5Vu3BnA_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 1131,
"duration": 22.99,
"view_count": 6464
},
{
"video_id": "XAAqKoQM8Qc",
"question": "In the screen, there is a black tank on the yellow ground. In the screen, amidst the black smoke, there is a yellow tank on the yellow ground. Which of these two scenes appears first?",
"question_wo_referring_query": "Which of these two scenes appears first?",
"candidates": [
"They appear at the same time.",
"The first to appear are two black tanks. The last to appear are two yellow tanks.",
"The first to appear is the black tank on the yellow ground. The last to appear is the yellow tank amidst the black smoke on the yellow ground.",
"The first to appear is the yellow tank amidst the black smoke on the yellow ground. The last to appear is the black tank on the yellow ground."
],
"correct_choice": 2,
"position": [
26,
321
],
"topic_category": "KH-Knowledge-History",
"question_category": "O3O",
"level": "L2-Relation",
"id": "XAAqKoQM8Qc_0",
"video_path": "XAAqKoQM8Qc.mp4",
"subtitle_path": "XAAqKoQM8Qc_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 143,
"duration": 25.0,
"view_count": 670402
},
{
"video_id": "GwcXYTX2GmA",
"question": "With a green background, there are two overlapping photos of Black children. Additionally, there are bold white text elements on the top and right sides of the screen. Which of the following subtitles was displayed at the same time?",
"question_wo_referring_query": "Which of the following subtitles was displayed at the same time?",
"candidates": [
"population belonging to the next biggest",
"group the beija people in the east",
"have ancestral roots actually to arabs",
"that leaves about four percent of the "
],
"correct_choice": 2,
"position": [
183,
238
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TOS",
"level": "L2-Relation",
"id": "GwcXYTX2GmA_0",
"video_path": "GwcXYTX2GmA.mp4",
"subtitle_path": "GwcXYTX2GmA_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 939,
"duration": 30.99,
"view_count": 713223
},
{
"video_id": "LQ-mwO30_68",
"question": "With a dark blue background, a green topographic map is displayed on the left side and a 'Clipperton Island' label appears on the right side of the screen. After mentioning 'kilometers southwest of mace', what happens on the screen?",
"question_wo_referring_query": "What happens on the screen?",
"candidates": [
"A picture with sea and trees gradually appears.",
"A picture with sea and trees emerges from below in the form of a window frame.",
"A picture with sea and trees pops up on screen.",
"A picture with sea and trees emerges from above in the form of a window frame."
],
"correct_choice": 0,
"position": [
490,
536
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T3E",
"level": "L2-Relation",
"id": "LQ-mwO30_68_0",
"video_path": "LQ-mwO30_68.mp4",
"subtitle_path": "LQ-mwO30_68_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 524,
"duration": 26.0,
"view_count": 1516974
},
{
"video_id": "StLBHguzO8s",
"question": "In a sepia-toned room, there is a black woman with black curly hair looking to the left of the screen. What is she doing?",
"question_wo_referring_query": "What is she doing?",
"candidates": [
"She is talking to a mirror",
"She is introducing a tool in her hand",
"She is fixing her hair",
"She is displaying new clothes"
],
"correct_choice": 0,
"position": [
869
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2E",
"level": "L1-Perception",
"id": "StLBHguzO8s_0",
"video_path": "StLBHguzO8s.mp4",
"subtitle_path": "StLBHguzO8s_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 235,
"duration": 47.01,
"view_count": 3664
},
{
"video_id": "-iCLYpeghJs",
"question": "In front of a white tiled wallpaper with icy and sweet candy circles, there are two men standing side by side, one wearing short sleeves and the other holding a remote control wearing long sleeves. When the phrase 'tlicking topping motion is that how you' is mentioned, how are their clothes described?",
"question_wo_referring_query": "How are their clothes described?",
"candidates": [
"The man on the left is wearing light red short sleeves, and the man on the right is wearing a white long sleeve.",
"The man on the right is wearing light yellow short sleeves, and the man on the left is wearing a black long sleeve.",
"The man on the right is wearing light red short sleeves, and the man on the left is wearing a black long sleeve.",
"The man on the left is wearing light red short sleeves, and the man on the right is wearing a black long sleeve."
],
"correct_choice": 3,
"position": [
231
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2A",
"level": "L1-Perception",
"id": "-iCLYpeghJs_0",
"video_path": "-iCLYpeghJs.mp4",
"subtitle_path": "-iCLYpeghJs_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 46,
"duration": 42.0,
"view_count": 635379
},
{
"video_id": "oKX6y9-zTfI",
"question": "In the middle of the video, there is a man with black hair wearing a white and green robe standing. Beside him, there is a man wearing a purple and gray outfit with a gray hood. After the subtitle appears 'arms which was largely seen as a sign of royalty fearing a monarchy many', what happens next in the video?",
"question_wo_referring_query": "What changes happen in the video?",
"candidates": [
"Four women wearing white and orange robes appear",
"Two men wearing white and orange robes appear",
"Three men wearing white and orange robes appear",
"Four young boys wearing white and orange robes appear"
],
"correct_choice": 3,
"position": [
295,
331
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3E",
"level": "L2-Relation",
"id": "oKX6y9-zTfI_0",
"video_path": "oKX6y9-zTfI.mp4",
"subtitle_path": "oKX6y9-zTfI_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 139,
"duration": 20.0,
"view_count": 12619
},
{
"video_id": "ie9NxdgNe0g",
"question": "In front of a white curtain with some plastic vine embellishments, there is a woman wearing a pearl headband, draped long hair, and dressed in a suit. In which of the following scenarios has she appeared?",
"question_wo_referring_query": "In which of the following scenarios has she appeared?",
"candidates": [
"At the ticket checkpoint of a high-speed railway station",
"On a busy street",
"In the driver's seat of a car",
"In front of a bus stop on a rainy day"
],
"correct_choice": 2,
"position": [
278,
798
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SOS",
"level": "L2-Relation",
"id": "ie9NxdgNe0g_0",
"video_path": "ie9NxdgNe0g.mp4",
"subtitle_path": "ie9NxdgNe0g_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 412,
"duration": 35.0,
"view_count": 28258
},
{
"video_id": "luRqMb5qfhM",
"question": "There are many cucumber slices on the cutting board in the kitchen. A woman wearing a black dress and a gold bracelet is holding a knife. What is she doing?",
"question_wo_referring_query": "What is she doing?",
"candidates": [
"She is peeling the cucumber",
"She is slicing the cucumber",
"She is cleaning the cutting board",
"She is putting the cucumber into a bowl"
],
"correct_choice": 1,
"position": [
773
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "luRqMb5qfhM_0",
"video_path": "luRqMb5qfhM.mp4",
"subtitle_path": "luRqMb5qfhM_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 235,
"duration": 36.0,
"view_count": 287875
},
{
"video_id": "8ew0d0JmsfA",
"question": "In the desolate military base, parked next to two tents, there are four fully armed individuals holding water guns, and in front there are two soldiers dressed in olive military uniforms and carrying guns. Which of the following objects has not appeared?",
"question_wo_referring_query": "Which of the following objects has not appeared?",
"candidates": [
"A light olive helmet",
"A red rifle",
"A yellow oil drum",
"An army green tank"
],
"correct_choice": 1,
"position": [
817
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2O",
"level": "L1-Perception",
"id": "8ew0d0JmsfA_0",
"video_path": "8ew0d0JmsfA.mp4",
"subtitle_path": "8ew0d0JmsfA_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 396,
"duration": 39.0,
"view_count": 1650070
},
{
"video_id": "aCPNlZ7bvRc",
"question": "There is a man in the screen wearing a green military uniform, raising his hands. There are guns on both sides of him. Who is being held at gunpoint in the video?",
"question_wo_referring_query": ", who is being held at gunpoint in the video?",
"candidates": [
"Commander Jeremiah Denton",
"Tom",
"Nancy",
"Lhcy"
],
"correct_choice": 0,
"position": [
151
],
"topic_category": "KH-Knowledge-History",
"question_category": "E2O",
"level": "L1-Perception",
"id": "aCPNlZ7bvRc_0",
"video_path": "aCPNlZ7bvRc.mp4",
"subtitle_path": "aCPNlZ7bvRc_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 40,
"duration": 38.0,
"view_count": 9068204
},
{
"video_id": "mYotOV3Q51g",
"question": "In the screen, a man wearing a mask and short sleeves is serving coffee to two seated people. The woman on the left is wearing a denim jacket, and the man on the right is wearing a red short-sleeved shirt and a hat. What did the woman on the left do the first time she appeared?",
"question_wo_referring_query": "What did the woman on the left do the first time she appeared?",
"candidates": [
"Shaking hands",
"Hugging",
"Raising her phone",
"Lying down"
],
"correct_choice": 2,
"position": [
506
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O2E",
"level": "L1-Perception",
"id": "mYotOV3Q51g_0",
"video_path": "mYotOV3Q51g.mp4",
"subtitle_path": "mYotOV3Q51g_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 867,
"duration": 45.0,
"view_count": 70043
},
{
"video_id": "tYqDvtknII4",
"question": "A bar chart appears on the screen, one bar is purple, and the other is yellow. The two bars are compared, with the years indicated below. What shows up after the bar chart appears?",
"question_wo_referring_query": "What shows up after the bar chart appears?",
"candidates": [
"A city map",
"A glass of water",
"A smartphone",
"Fried chicken"
],
"correct_choice": 0,
"position": [
936,
960
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "E3E",
"level": "L2-Relation",
"id": "tYqDvtknII4_0",
"video_path": "tYqDvtknII4.mp4",
"subtitle_path": "tYqDvtknII4_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 234,
"duration": 47.98,
"view_count": 87348
},
{
"video_id": "e-EtQHH4aWE",
"question": "There are five women talking in a room on screen. They are sitting on a couch. The second woman from the left is wearing red high heels. What does the second woman from the left do with her hands after the subtitle mentions 'the proposal we're doing for google in'?",
"question_wo_referring_query": "What does the second woman from the left do with her hands?",
"candidates": [
"Clasp hands together",
"No change",
"Make a fist",
"Shake hands"
],
"correct_choice": 0,
"position": [
397,
420
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T3E",
"level": "L2-Relation",
"id": "e-EtQHH4aWE_0",
"video_path": "e-EtQHH4aWE.mp4",
"subtitle_path": "e-EtQHH4aWE_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 1311,
"duration": 23.0,
"view_count": 185
},
{
"video_id": "Jim_NG0PYpc",
"question": "In the video, there is a black person wearing a blue short-sleeve shirt with a beard speaking against a black background. There is a picture of barbeque meat in the upper left corner. Where else has the black person who appears at the beginning of the video appeared?",
"question_wo_referring_query": "Where else has the black person who appears at the beginning of the video appeared?",
"candidates": [
"The upper left corner is an image of a black person with a black background barbequing meat",
"Green background",
"Yellow background",
"Blue background"
],
"correct_choice": 0,
"position": [
788,
183
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TOS",
"level": "L2-Relation",
"id": "Jim_NG0PYpc_0",
"video_path": "Jim_NG0PYpc.mp4",
"subtitle_path": "Jim_NG0PYpc_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 51,
"duration": 44.0,
"view_count": 929213
},
{
"video_id": "3t_Knk7FWT8",
"question": "In the scene, two people are having a conversation. The woman on the left is sitting in a broadcasting room, while the man on the right is in a room having a discussion. When the man on the right says 'end not in a good place I mean I wish I', what accessory is the woman on the left wearing?",
"question_wo_referring_query": "What accessory is the woman on the left wearing?",
"candidates": [
"bracelet",
"ring",
"earrings",
"glasses"
],
"correct_choice": 3,
"position": [
176
],
"topic_category": "NP-News-Programs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "3t_Knk7FWT8_0",
"video_path": "3t_Knk7FWT8.mp4",
"subtitle_path": "3t_Knk7FWT8_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 468,
"duration": 15.0,
"view_count": 171529
},
{
"video_id": "DX-mYMYf8jI",
"question": "In the scene, in the kitchen, there are many colorful plates, cups, and bowls on the shelves. On the white marble table, there are many food ingredients. A blonde woman wearing a light green short-sleeved top and a dark green apron is standing in front of the table. She spreads out her fingers and places them on a white plate. What is the woman doing the first time she appears?",
"question_wo_referring_query": "What is the woman doing the first time she appears?",
"candidates": [
"She is picking up other food ingredients",
"She is putting food into the oven",
"She is taking items from the refrigerator",
"She is introducing the food ingredients on the table"
],
"correct_choice": 3,
"position": [
29
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O2E",
"level": "L1-Perception",
"id": "DX-mYMYf8jI_0",
"video_path": "DX-mYMYf8jI.mp4",
"subtitle_path": "DX-mYMYf8jI_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 351,
"duration": 57.02,
"view_count": 374074
},
{
"video_id": "nvG2YMWiwqU",
"question": "In the scene, inside a kitchen, a woman wearing a white short sleeve shirt and a red apron, with a smile on her face, is holding food with her right hand. On the table in front of her stands a cake cartoon character. What is the woman doing when the subtitles 'here you go copy here's your new friend' appear?",
"question_wo_referring_query": "What is the woman doing?",
"candidates": [
"She is planning to put the food in her hand into the refrigerator.",
"She is handing the food in her hand to a cake cartoon character.",
"She is tasting the food in her hand.",
"She is planning to throw away the food in her hand."
],
"correct_choice": 1,
"position": [
1038
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2E",
"level": "L1-Perception",
"id": "nvG2YMWiwqU_0",
"video_path": "nvG2YMWiwqU.mp4",
"subtitle_path": "nvG2YMWiwqU_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 321,
"duration": 58.98,
"view_count": 314850
},
{
"video_id": "TXiX3NO5f5w",
"question": "In the video, there is a green plant on the table, and someone is placing a pine cone on this green plant. Where else has this green plant appeared?",
"question_wo_referring_query": "Where else has this green plant appeared?",
"candidates": [
"On the grass",
"On a yellow wooden board, which also has an olive-colored piece of clothing on it.",
"Inside a flower pot",
"On TV"
],
"correct_choice": 1,
"position": [
573,
860
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SOS",
"level": "L2-Relation",
"id": "TXiX3NO5f5w_0",
"video_path": "TXiX3NO5f5w.mp4",
"subtitle_path": "TXiX3NO5f5w_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 214,
"duration": 36.0,
"view_count": 316843
},
{
"video_id": "oO8OJZFZT9s",
"question": "In the green mountains, a blond man wearing a blue short-sleeve shirt, carrying a green backpack, and a red jacket around his waist, appeared with which captions?",
"question_wo_referring_query": "appeared with which captions?",
"candidates": [
"into the beach and you're gonna start and marsh and then you're gonna kind of hit",
"gonna blow your frickin mind like I",
"that on your list and make sure you",
"and we came right when the Sun was"
],
"correct_choice": 0,
"position": [
388,
486
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "TOS",
"level": "L2-Relation",
"id": "oO8OJZFZT9s_0",
"video_path": "oO8OJZFZT9s.mp4",
"subtitle_path": "oO8OJZFZT9s_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 57,
"duration": 33.99,
"view_count": 90385
},
{
"video_id": "wGWbI3BHVM8",
"question": "In a grey background with a purple map pattern on the screen, a little figure wearing a blue hat is looking to the left of the screen, with both hands pointing to the left of the screen. After mentioning 'of arms and thunderous Hooves creating,' what change did he undergo?",
"question_wo_referring_query": "What change did he undergo?",
"candidates": [
"He looks toward the camera, with his left hand down and right hand up",
"He looks down and turns around",
"He turns around",
"He looks toward the camera, with his right hand down and left hand up"
],
"correct_choice": 0,
"position": [
45,
144
],
"topic_category": "KH-Knowledge-History",
"question_category": "TAA",
"level": "L2-Relation",
"id": "wGWbI3BHVM8_0",
"video_path": "wGWbI3BHVM8.mp4",
"subtitle_path": "wGWbI3BHVM8_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 21,
"duration": 37.0,
"view_count": 2191
},
{
"video_id": "_1kZe-2kiuQ",
"question": "In an office with a yellow desk, three ladies are sitting by the desk, and on the left side of the screen, there is a bookshelf with many books on it. Who was expressing their opinion at that time?",
"question_wo_referring_query": "Who was expressing their opinion at that time?",
"candidates": [
"The second person",
"The first person",
"The third person",
"Speaking simultaneously"
],
"correct_choice": 0,
"position": [
291
],
"topic_category": "KA-Knowledge-Art",
"question_category": "E2O",
"level": "L1-Perception",
"id": "_1kZe-2kiuQ_0",
"video_path": "_1kZe-2kiuQ.mp4",
"subtitle_path": "_1kZe-2kiuQ_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 211,
"duration": 58.02,
"view_count": 10349
},
{
"video_id": "TynnVVRNZGs",
"question": "In the video, two women are sitting in the car. The woman sitting in the driver's seat, who is on the right side of the screen, what is she doing when she appears for the first time?",
"question_wo_referring_query": "In the video, two women are sitting in the car. The woman sitting in the driver's seat, who is on the right side of the screen, what is she doing when she appears for the first time?",
"candidates": [
"Holding four cups of coffee",
"Exchanging coffee",
"Putting down coffee",
"Drinking a cup of coffee"
],
"correct_choice": 3,
"position": [
45
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "TynnVVRNZGs_0",
"video_path": "TynnVVRNZGs.mp4",
"subtitle_path": "TynnVVRNZGs_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 433,
"duration": 37.97,
"view_count": 271524
},
{
"video_id": "fLn06p2HtAc",
"question": "In the video, there is a small person figure wearing a blue duckbill cap and dressed in blue clothes. Where has this blue-clothed character appeared before?",
"question_wo_referring_query": "Where has this blue-clothed character appeared before?",
"candidates": [
"A room with a map in front, including 3 guitars of different colors, a computer, an olive-colored sofa, and a bookshelf.",
"A room with a map in front, including 2 guitars of different colors and a computer.",
"A room with a map in front, including an olive-colored sofa, a computer, and a bookshelf.",
"A room with a map in front, including an olive-colored sofa and a bookshelf."
],
"correct_choice": 0,
"position": [
127,
1178
],
"topic_category": "KH-Knowledge-History",
"question_category": "SOS",
"level": "L2-Relation",
"id": "fLn06p2HtAc_0",
"video_path": "fLn06p2HtAc.mp4",
"subtitle_path": "fLn06p2HtAc_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 313,
"duration": 58.0,
"view_count": 813
},
{
"video_id": "nBWB-h68E4k",
"question": "When the subtitle in the video mentions 'they assume that I make the work that I do,' what style of clothing is the person speaking wearing?",
"question_wo_referring_query": "What style of clothing is the person speaking in the video wearing?",
"candidates": [
"A black and white polka dot shirt",
"A white and gold shirt",
"A plain black shirt",
"A black and white striped shirt with a black and gold collar"
],
"correct_choice": 3,
"position": [
759
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2A",
"level": "L1-Perception",
"id": "nBWB-h68E4k_0",
"video_path": "nBWB-h68E4k.mp4",
"subtitle_path": "nBWB-h68E4k_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 1790,
"duration": 59.0,
"view_count": 4823
},
{
"video_id": "vP-fQu22bng",
"question": "In the video, it shows a clip of a blue sea with distant, continuous mountains. What happens after this video finishes playing?",
"question_wo_referring_query": "In the video, it shows a clip of a blue sea with distant, continuous mountains. What happens after this video finishes playing?",
"candidates": [
"A boy wearing white and blue clothes appears and speaks",
"A ship appears",
"A boy wearing black clothes appears and speaks",
"A dolphin appears and swims in the water"
],
"correct_choice": 0,
"position": [
69,
109
],
"topic_category": "NP-News-Programs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "vP-fQu22bng_0",
"video_path": "vP-fQu22bng.mp4",
"subtitle_path": "vP-fQu22bng_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 31,
"duration": 32.0,
"view_count": 8586
},
{
"video_id": "byaqQClCO-Y",
"question": "In the video, what did the woman holding the child do before the subtitle mentions 'since he first child was born seven'?",
"question_wo_referring_query": "What did the woman holding the child do?",
"candidates": [
"Touched her hair with her hand",
"Put down the child and squat",
"Held her head with both hands",
"Put her hair up"
],
"correct_choice": 0,
"position": [
660,
666
],
"topic_category": "NP-News-Programs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "byaqQClCO-Y_0",
"video_path": "byaqQClCO-Y.mp4",
"subtitle_path": "byaqQClCO-Y_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 490,
"duration": 36.0,
"view_count": 88737
},
{
"video_id": "yU9fGAEcxJY",
"question": "After the female protagonist in the video adds melted butter into the blender containing chocolate cake mix, what happens next in the video?",
"question_wo_referring_query": "What happens next in the video?",
"candidates": [
"Mix the two together",
"Pour them into a bowl",
"Add more chocolate cake mix",
"Add milk"
],
"correct_choice": 0,
"position": [
711
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "yU9fGAEcxJY_0",
"video_path": "yU9fGAEcxJY.mp4",
"subtitle_path": "yU9fGAEcxJY_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 75,
"duration": 54.97,
"view_count": 210748
},
{
"video_id": "m1mGerQhdHA",
"question": "In this video, which character appears first: the man wearing a plain white short-sleeve shirt and a watch, or the man wearing a dark blue short-sleeve shirt?",
"question_wo_referring_query": "Which character appears first?",
"candidates": [
"The man wearing a plain white short-sleeve shirt and a watch",
"The man wearing a floral short-sleeve shirt and a buzz cut",
"The woman wearing a plain white short-sleeve shirt",
"The man wearing a dark blue short-sleeve shirt"
],
"correct_choice": 0,
"position": [
35,
252
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O3O",
"level": "L2-Relation",
"id": "m1mGerQhdHA_0",
"video_path": "m1mGerQhdHA.mp4",
"subtitle_path": "m1mGerQhdHA_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 686,
"duration": 33.0,
"view_count": 57228
},
{
"video_id": "oiBhGhROP5o",
"question": "When various indicative markers with summary English words are shown in the video and the phrase 'with this assumption ever after' is mentioned, which of the following icons does not appear?",
"question_wo_referring_query": "Which of the following icons does not appear?",
"candidates": [
"Open-ended icon with 'assuming surprise'",
"Book icon with 'study hard'",
"Sequence icon with 'didn't try to bring maximum carrier'",
"Icon with 'limited plan'"
],
"correct_choice": 1,
"position": [
289
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2O",
"level": "L1-Perception",
"id": "oiBhGhROP5o_0",
"video_path": "oiBhGhROP5o.mp4",
"subtitle_path": "oiBhGhROP5o_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 681,
"duration": 15.0,
"view_count": 1112126
},
{
"video_id": "SYqEMs0EYoI",
"question": "At the beginning of the video, with a blue background, a short clip is played in the middle of the screen. Who is the first person to appear in this clip?",
"question_wo_referring_query": "Who is the first person to appear in this video?",
"candidates": [
"A woman wearing a sleeveless military green top and a pink hat",
"A woman wearing a floral cheongsam",
"A man wearing a black short-sleeve shirt and striped shorts",
"A man wearing a pink shirt and black pants"
],
"correct_choice": 0,
"position": [
5,
45
],
"topic_category": "KH-Knowledge-History",
"question_category": "O3O",
"level": "L2-Relation",
"id": "SYqEMs0EYoI_0",
"video_path": "SYqEMs0EYoI.mp4",
"subtitle_path": "SYqEMs0EYoI_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 256,
"duration": 47.0,
"view_count": 721796
},
{
"video_id": "Z9JSvDVMSm0",
"question": "When the image marked with 'Ecuador' is mentioned with 'Yesterday masked gunmen raided a TV' in the video, what change occurs to the image?",
"question_wo_referring_query": "What change occurs to the image?",
"candidates": [
"A man in black clothes appears in the video",
"The video adds a tag of Guayaquil",
"The video adds tags of Quito and Guayaquil",
"The video adds a tag of Quito"
],
"correct_choice": 2,
"position": [
122,
185
],
"topic_category": "NP-News-Programs",
"question_category": "TAA",
"level": "L2-Relation",
"id": "Z9JSvDVMSm0_0",
"video_path": "Z9JSvDVMSm0.mp4",
"subtitle_path": "Z9JSvDVMSm0_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 25,
"duration": 28.0,
"view_count": 146830
},
{
"video_id": "3b_v1KO_U8A",
"question": "What event occurred on the video screen just before the phrase 'wilderness and the open skies' was mentioned?",
"question_wo_referring_query": "What event occurred on the video screen?",
"candidates": [
"A host was explaining a map of the western region.",
"A map with the word 'city' appeared.",
"A map of China appeared.",
"A picture of the sea appeared."
],
"correct_choice": 1,
"position": [
750,
756
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T3E",
"level": "L2-Relation",
"id": "3b_v1KO_U8A_0",
"video_path": "3b_v1KO_U8A.mp4",
"subtitle_path": "3b_v1KO_U8A_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 329,
"duration": 53.99,
"view_count": 175582
},
{
"video_id": "kMryvefpcF8",
"question": "A man dressed in a white shirt, raising one hand, with a slight smile on his face, sitting on a black chair and speaking, this man appears with which subtitles?",
"question_wo_referring_query": ", this man appears with which subtitles?",
"candidates": [
"I eventually ended up living",
"offered me his couch to crash on",
"Pyramid of Giza",
"I got a tap"
],
"correct_choice": 2,
"position": [
179,
585
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "TOS",
"level": "L2-Relation",
"id": "kMryvefpcF8_0",
"video_path": "kMryvefpcF8.mp4",
"subtitle_path": "kMryvefpcF8_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 157,
"duration": 32.0,
"view_count": 1050171
},
{
"video_id": "NIxyQQfuVoc",
"question": "In front of a white house, surrounded by big trees, there is a woman wearing sunglasses, holding the hands of two little girls. What color is the top that the woman wearing sunglasses is wearing?",
"question_wo_referring_query": "What color is the top that the woman wearing sunglasses is wearing?",
"candidates": [
"Blue",
"Yellow",
"Red",
"White"
],
"correct_choice": 3,
"position": [
471
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2A",
"level": "L1-Perception",
"id": "NIxyQQfuVoc_0",
"video_path": "NIxyQQfuVoc.mp4",
"subtitle_path": "NIxyQQfuVoc_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 135,
"duration": 45.0,
"view_count": 554125
},
{
"video_id": "HvSEKzpSdzw",
"question": "In a room, a young man wearing a blue uniform, with a white inner layer, and a black wristwatch, is holding a box. What is this young man doing in the room?",
"question_wo_referring_query": "What is this young man doing in the room?",
"candidates": [
"Reaching into the box",
"Stepping on the box",
"Throwing the box away",
"Cutting the box with a knife"
],
"correct_choice": 3,
"position": [
246
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "HvSEKzpSdzw_0",
"video_path": "HvSEKzpSdzw.mp4",
"subtitle_path": "HvSEKzpSdzw_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 315,
"duration": 38.0,
"view_count": 44313
},
{
"video_id": "tCnelzIAHA0",
"question": "A long-haired woman wearing round earrings, a blue top, and black inner wear appears on the screen holding a book. What type of clothing is this woman wearing?",
"question_wo_referring_query": "What type of clothing is this woman wearing?",
"candidates": [
"Suit",
"Blouse",
"Feather coat",
"Denim jacket"
],
"correct_choice": 3,
"position": [
49
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2A",
"level": "L1-Perception",
"id": "tCnelzIAHA0_0",
"video_path": "tCnelzIAHA0.mp4",
"subtitle_path": "tCnelzIAHA0_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 143,
"duration": 24.99,
"view_count": 63479
},
{
"video_id": "7Q8d8Vvk6oo",
"question": "A cute, gray-furred little animal is in a thick, clean white snowy field, with dry tree branches at the side. What is this animal doing in the white snow in the video?",
"question_wo_referring_query": "What is this animal doing in the white snow in the video?",
"candidates": [
"Barking",
"Running",
"Rolling",
"Digging"
],
"correct_choice": 1,
"position": [
261
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E2O",
"level": "L1-Perception",
"id": "7Q8d8Vvk6oo_0",
"video_path": "7Q8d8Vvk6oo.mp4",
"subtitle_path": "7Q8d8Vvk6oo_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 95,
"duration": 45.0,
"view_count": 250628
},
{
"video_id": "i8TJ7RgimNM",
"question": "In a bright room, there is a woman wearing earrings and a necklace, dressed in green, sitting with an elderly woman wearing a gray sleeveless top with a necklace. When the woman in green, with her hands open and the backs of her hands facing outwards, naturally hangs them down, what is the elderly woman doing?",
"question_wo_referring_query": "What is the elderly woman doing?",
"candidates": [
"Raising hand",
"Standing up",
"Swaying left and right",
"Listening attentively"
],
"correct_choice": 3,
"position": [
666
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O2E",
"level": "L1-Perception",
"id": "i8TJ7RgimNM_0",
"video_path": "i8TJ7RgimNM.mp4",
"subtitle_path": "i8TJ7RgimNM_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 3345,
"duration": 33.03,
"view_count": 4984
},
{
"video_id": "m7sL8LuMD8Q",
"question": "In the video, there is a person resting their head against the wall, someone sitting on the windowsill, a long-haired woman leaning against the curtain, and a short-haired woman with her elbow on her knee. What action does the woman leaning against the curtain do afterward?",
"question_wo_referring_query": "What does the woman leaning against the curtain do afterward?",
"candidates": [
"Turn her head",
"Nod",
"Stand up",
"Wave her hand"
],
"correct_choice": 0,
"position": [
92,
169
],
"topic_category": "KA-Knowledge-Art",
"question_category": "E3E",
"level": "L2-Relation",
"id": "m7sL8LuMD8Q_0",
"video_path": "m7sL8LuMD8Q.mp4",
"subtitle_path": "m7sL8LuMD8Q_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 237,
"duration": 15.98,
"view_count": 54253
},
{
"video_id": "oCXKARwr6PA",
"question": "In an image of a young girl with a yellow background, wearing a headscarf, accompanied by feathers and some simple ornaments, which objects have never appeared?",
"question_wo_referring_query": "Which objects have never appeared?",
"candidates": [
"Yellow headscarf",
"Pearl earrings",
"Blue gemstone earrings",
"Blue gemstones",
"Peacock feathers"
],
"correct_choice": 2,
"position": [
19
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2O",
"level": "L1-Perception",
"id": "oCXKARwr6PA_0",
"video_path": "oCXKARwr6PA.mp4",
"subtitle_path": "oCXKARwr6PA_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 120,
"duration": 12.0,
"view_count": 174923
},
{
"video_id": "IuVyCbtGBFQ",
"question": "The blue sky is filled with many clouds, beneath the clouds lies a mountain ridge resembling a centipede, below the mountain ridge there are green plants and an endless expanse of seawater. A man stands sideways in front of the water. When the subtitle 'This island itself resembles a harbor' appears, what is present in the video frame?",
"question_wo_referring_query": "What is present in the video frame when the subtitle appears?",
"candidates": [
"White sailor cap",
"Red short sleeve shirt",
"Black sunglasses",
"A ferry",
"Sailboat"
],
"correct_choice": 2,
"position": [
192
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2O",
"level": "L1-Perception",
"id": "IuVyCbtGBFQ_0",
"video_path": "IuVyCbtGBFQ.mp4",
"subtitle_path": "IuVyCbtGBFQ_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 386,
"duration": 14.02,
"view_count": 9840
},
{
"video_id": "-if_Wg43vTk",
"question": "After the oil in the blue iron pot is heated, what was done with the glass bowl containing marinated meat that was picked up with both hands?",
"question_wo_referring_query": "After the oil in the blue iron pot is heated, what was done with the glass bowl containing marinated meat that was picked up with both hands?",
"candidates": [
"Poured the oil from the pot into the glass bowl.",
"Added a pinch of fermented bean curd.",
"Added white seasoning oil.",
"The marinated meat was poured into the pot.",
"Poured chili peppers into the pot."
],
"correct_choice": 3,
"position": [
27
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O2E",
"level": "L1-Perception",
"id": "-if_Wg43vTk_0",
"video_path": "-if_Wg43vTk.mp4",
"subtitle_path": "-if_Wg43vTk_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 142,
"duration": 8.97,
"view_count": 220825
},
{
"video_id": "kFHVBCEwC3w",
"question": "When sunlight shines on the building wall, an exterior green sign on a Roman building, and some passersby are standing by the poles, which object appears first after the video screen reaches this moment?",
"question_wo_referring_query": "Which object appears first after the video screen reaches this moment?",
"candidates": [
"A man wearing a gray shirt, round glasses, and a baseball cap",
"A man wearing a green short-sleeved shirt and having a plump physique",
"A row of arch-shaped doors",
"A woman wearing a floral short-sleeved shirt and carrying a white single-shoulder bag",
"A golden yellow circular dome"
],
"correct_choice": 4,
"position": [
33,
71,
110,
246
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O3O",
"level": "L2-Relation",
"id": "kFHVBCEwC3w_0",
"video_path": "kFHVBCEwC3w.mp4",
"subtitle_path": "kFHVBCEwC3w_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 839,
"duration": 13.0,
"view_count": 39387
},
{
"video_id": "WHxBO7XecSY",
"question": "Which of the following video sequences is correct?",
"question_wo_referring_query": "Which of the following video sequences is correct?",
"candidates": [
"First, a computer webpage with a picture of a mobile phone card, then a man wearing a black short-sleeve sitting in an office talks to the camera, and finally, a hand wearing a silver ring is using a mobile phone.",
"First, a man wearing a black short-sleeve sitting in an office talks to the camera, then a hand wearing a silver ring is using a mobile phone, and finally, a computer webpage with a picture of a mobile phone card.",
"First, a man wearing a black short-sleeve sitting in an office talks to the camera, then a computer webpage with a picture of a mobile phone card, and finally, a hand wearing a silver ring is using a mobile phone.",
"First, a hand wearing a silver ring is using a mobile phone, then a computer webpage with a picture of a mobile phone card, and finally, a man wearing a black short-sleeve sitting in an office talks to the camera.",
"First, a hand wearing a silver ring is using a mobile phone, then a man wearing a black short-sleeve sitting in an office talks to the camera, and finally, a computer webpage with a picture of a mobile phone card."
],
"correct_choice": 4,
"position": [
35,
90,
162
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SSS",
"level": "L2-Relation",
"id": "WHxBO7XecSY_0",
"video_path": "WHxBO7XecSY.mp4",
"subtitle_path": "WHxBO7XecSY_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 312,
"duration": 12.01,
"view_count": 170797
},
{
"video_id": "alEV01SMFNk",
"question": "What is the woman, wearing a plaid short-sleeve shirt and dark green apron, sitting in front of the kitchen table, preparing to do with a bucket of rice wine in her hand?",
"question_wo_referring_query": "What is she preparing to do?",
"candidates": [
"This woman is preparing to drink it",
"Preparing to pour it out and discard it",
"Preparing to pour it into a pot to cook",
"Preparing to pour it into a glass container and mix it",
"Preparing to spread it on dumplings"
],
"correct_choice": 3,
"position": [
65
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "alEV01SMFNk_0",
"video_path": "alEV01SMFNk.mp4",
"subtitle_path": "alEV01SMFNk_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 155,
"duration": 11.01,
"view_count": 410054
},
{
"video_id": "FoeeEQ0VVFE",
"question": "When a soldier, crouching and using a gun mechanism, and wearing a camouflage helmet is in combat with the enemy, what object appears on the screen?",
"question_wo_referring_query": "When a soldier, crouching and using a gun mechanism, and wearing a camouflage helmet is in combat with the enemy, what object appears on the screen?",
"candidates": [
"Bur",
"Grenade",
"Handgun",
"Sword",
"Rifle"
],
"correct_choice": 0,
"position": [
72
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2O",
"level": "L1-Perception",
"id": "FoeeEQ0VVFE_0",
"video_path": "FoeeEQ0VVFE.mp4",
"subtitle_path": "FoeeEQ0VVFE_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 20,
"duration": 11.0,
"view_count": 4749088
},
{
"video_id": "GHScTyR2OgQ",
"question": "On a red table, there is a transparent bowl, and a pair of hands wearing black gloves appear. What did the hands wearing black gloves do the first time they appeared?",
"question_wo_referring_query": "What did the hands wearing black gloves do the first time they appeared?",
"candidates": [
"Picked up a whisk to stir cream",
"Picked up a whisk to press olive oil",
"Picked up a whisk to beat eggs",
"Picked up a whisk to press tofu",
"Picked up a whisk to mix batter"
],
"correct_choice": 3,
"position": [
1
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O2E",
"level": "L1-Perception",
"id": "GHScTyR2OgQ_0",
"video_path": "GHScTyR2OgQ.mp4",
"subtitle_path": "GHScTyR2OgQ_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 68,
"duration": 9.0,
"view_count": 4190
},
{
"video_id": "mHARxee4EzQ",
"question": "A man wearing glasses, puffing on a cigarette, dressed in a grey long-sleeve shirt with rolled-up sleeves, and holding a handful of red feed is feeding a fish with a big mouth. In which of the following places did he appear?",
"question_wo_referring_query": "In which of the following places did he appear?",
"candidates": [
"On a boat being blown by the wind",
"In a crowded talent show venue",
"On a flying airplane",
"In a quiet park",
"In a boxing ring"
],
"correct_choice": 0,
"position": [
246,
293
],
"topic_category": "KH-Knowledge-History",
"question_category": "SOS",
"level": "L2-Relation",
"id": "mHARxee4EzQ_0",
"video_path": "mHARxee4EzQ.mp4",
"subtitle_path": "mHARxee4EzQ_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 25,
"duration": 13.01,
"view_count": 427738
},
{
"video_id": "_hODR1cR9lo",
"question": "In a white room, there is a woman with yellow hair, wearing a brown coat, a pair of necklaces, and glasses. With which of the following subtitles did she appear together?",
"question_wo_referring_query": "With which of the following subtitles did she appear together?",
"candidates": [
"\"I like to wear it when I sleep\"",
"\"which i am wearing right now and i have\"",
"\"This style is particularly comfortable to wear\"",
"\"There are a lot of people who like to wear this\"",
"\"I really like this style\""
],
"correct_choice": 1,
"position": [
37,
94,
138,
180,
236,
301
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "_hODR1cR9lo_0",
"video_path": "_hODR1cR9lo.mp4",
"subtitle_path": "_hODR1cR9lo_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 76,
"duration": 13.01,
"view_count": 193094
},
{
"video_id": "8wbNEJjBa0k",
"question": "When a man with short hair, who is wearing a red tight shirt, appears next to a woman with long golden hair, who is wearing a tight outfit, in front of a solid black background, what changes in him?",
"question_wo_referring_query": "What changes in him?",
"candidates": [
"Both hands change from being clasped in front of the abdomen to having palms up, pointing to the left",
"The left hand changes from touching the forehead to touching the nose",
"Both hands change from arms placed in front and behind to being clasped together",
"Both hands change from one hand clenched into a fist to both hands clenched into fists",
"Both hands change from being clasped in front of the abdomen to having palms up, pointing to the right"
],
"correct_choice": 4,
"position": [
130,
276
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "SAA",
"level": "L2-Relation",
"id": "8wbNEJjBa0k_0",
"video_path": "8wbNEJjBa0k.mp4",
"subtitle_path": "8wbNEJjBa0k_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 644,
"duration": 13.01,
"view_count": 1005384
},
{
"video_id": "g7zuBUMBr2E",
"question": "In a space constructed with lines, featuring various colors, what objects are present when the subtitle 'they hear a mix of different sounds' appears?",
"question_wo_referring_query": "What objects are present in the space?",
"candidates": [
"A man with hands down, looking up",
"A woman with dark skin",
"A woman with tied hair, wearing a patterned shawl",
"A woman wearing a colorful headscarf",
"A man raising both hands with palms facing outward"
],
"correct_choice": 4,
"position": [
150
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2O",
"level": "L1-Perception",
"id": "g7zuBUMBr2E_0",
"video_path": "g7zuBUMBr2E.mp4",
"subtitle_path": "g7zuBUMBr2E_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 180,
"duration": 13.01,
"view_count": 4136
},
{
"video_id": "fySqsm5kNl4",
"question": "In a bright sunny outdoor setting, there is a hole in the ground. When a man wearing a black short-sleeve shirt and a hat appears in front of the hole for the first time, holding a wooden stick, what does he do?",
"question_wo_referring_query": "What does he do?",
"candidates": [
"He kneels down.",
"He jumps into the hole.",
"He throws the wooden stick into the hole.",
"He steps over the hole.",
"He throws the wooden stick outside the hole."
],
"correct_choice": 2,
"position": [
300
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O2E",
"level": "L1-Perception",
"id": "fySqsm5kNl4_0",
"video_path": "fySqsm5kNl4.mp4",
"subtitle_path": "fySqsm5kNl4_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 561,
"duration": 14.0,
"view_count": 1798483
},
{
"video_id": "pJuq8D1NGJQ",
"question": "According to the video, which of the following scenes appears first?",
"question_wo_referring_query": "According to the video, which of the following scenes appears first?",
"candidates": [
"A green field under a blue sky with several red-walled, black-roofed buildings",
"A construction vehicle lifting timber",
"A truck driving on the road",
"A vast wilderness with a huge pit",
"A grassy area with a lakeside and a row of trees separating the grass from the lake"
],
"correct_choice": 4,
"position": [
1,
22,
90
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "O3O",
"level": "L2-Relation",
"id": "pJuq8D1NGJQ_0",
"video_path": "pJuq8D1NGJQ.mp4",
"subtitle_path": "pJuq8D1NGJQ_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 608,
"duration": 8.01,
"view_count": 1560829
},
{
"video_id": "QUtk5xEw1mQ",
"question": "At the beginning of the video, there is a man with greyish-white hair and a beard in front of a light purple background. He is wearing glasses and a brown coat. With which subtitles did this man appear?",
"question_wo_referring_query": "With which subtitles did this man appear?",
"candidates": [
"the price of man",
"helpless but why they are forced to pay",
"music",
"happyless but why they are forced to pay",
"helpful but why they are forced to pay"
],
"correct_choice": 1,
"position": [
22,
75,
141
],
"topic_category": "NP-News-Programs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "QUtk5xEw1mQ_0",
"video_path": "QUtk5xEw1mQ.mp4",
"subtitle_path": "QUtk5xEw1mQ_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 69,
"duration": 8.01,
"view_count": 2100
},
{
"video_id": "cceTeD4JADU",
"question": "In the video, there is a woman with blue hair wearing blue clothes. How many times did she look at the computer in total?",
"question_wo_referring_query": "How many times did she look at the computer in total?",
"candidates": [
"2",
"5",
"1",
"4",
"3"
],
"correct_choice": 2,
"position": [
171
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "cceTeD4JADU_0",
"video_path": "cceTeD4JADU.mp4",
"subtitle_path": "cceTeD4JADU_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 90,
"duration": 9.98,
"view_count": 1112608
},
{
"video_id": "LKQK0lud4fo",
"question": "There is a man, a woman, and a child on the screen. In the top-left corner, there is a yellow 'US' symbol. A red arrow is pointing to the little girl. What is the little girl doing?",
"question_wo_referring_query": "What is the little girl doing?",
"candidates": [
"Closing her eyes",
"Dancing",
"Running forward",
"Crying with her head in her hands",
"Opening her mouth to breathe"
],
"correct_choice": 4,
"position": [
218
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2E",
"level": "L1-Perception",
"id": "LKQK0lud4fo_0",
"video_path": "LKQK0lud4fo.mp4",
"subtitle_path": "LKQK0lud4fo_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 32,
"duration": 9.97,
"view_count": 943249
},
{
"video_id": "anrKq6HgPvs",
"question": "A pair of hands holds a fresh yellow egg and places it on top of a cup with black spotted designs. When the subtitle says 'when the weather is dark and gloomy', what changes occur to the egg?",
"question_wo_referring_query": "What changes occur to the egg?",
"candidates": [
"The eggshell turns white.",
"The eggshell shatters into small fragments.",
"The eggshell splits into two halves from an intact oval shape.",
"The eggshell does not change.",
"The eggshell turns red."
],
"correct_choice": 2,
"position": [
214,
235
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2A",
"level": "L1-Perception",
"id": "anrKq6HgPvs_0",
"video_path": "anrKq6HgPvs.mp4",
"subtitle_path": "anrKq6HgPvs_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 200,
"duration": 10.0,
"view_count": 11187
},
{
"video_id": "5Ugjg2ru5QQ",
"question": "On the left stands a man with sparse hair and a long beard, on the right stands a man with thick hair tied in a braid. When the subtitle 'organism 4 has apple special' appears, what action is the man with thick hair tied in a braid on the right doing?",
"question_wo_referring_query": "What action is the man with thick hair tied in a braid on the right doing?",
"candidates": [
"Both fists clenched",
"Both hands with fingers interlaced",
"Both hands placed on either side of his legs",
"Both hands with palms facing each other",
"One hand pointing to the car behind"
],
"correct_choice": 3,
"position": [
158
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2E",
"level": "L1-Perception",
"id": "5Ugjg2ru5QQ_0",
"video_path": "5Ugjg2ru5QQ.mp4",
"subtitle_path": "5Ugjg2ru5QQ_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 542,
"duration": 11.0,
"view_count": 146512
},
{
"video_id": "14ot4DrXdds",
"question": "Under the golden twilight, with industrial zones emitting exhaust fumes into the sky, what happened first after the subtitle 'the water while increased levels of' appeared?",
"question_wo_referring_query": "What happened first?",
"candidates": [
"Two molecular structures appeared simultaneously on the screen",
"A structure with a black center labeled CO\u2082, flanked by two red molecular structures, rotated and popped up on the right side of the screen",
"Three molecular structures appeared simultaneously on the screen",
"A structure with a black center labeled CO\u2082, flanked by two red molecular structures, rotated and popped up on the left side of the screen",
"A structure with a red center labeled CO\u2082, flanked by two black molecular structures, rotated and popped up on the left side of the screen"
],
"correct_choice": 3,
"position": [
20,
58
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T3E",
"level": "L2-Relation",
"id": "14ot4DrXdds_0",
"video_path": "14ot4DrXdds.mp4",
"subtitle_path": "14ot4DrXdds_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 319,
"duration": 8.01,
"view_count": 131811
},
{
"video_id": "aqGQw1bXFN4",
"question": "At the beginning of the scene, in front of two buildings with triangular roofs and red beams, there is a small figure dressed in grey, holding a quill and writing on a scroll. Which subtitles appear alongside this scene?",
"question_wo_referring_query": "Which subtitles appear alongside this scene?",
"candidates": [
"I will not let this law go through",
"Nevertheless",
"in spite of",
"What are you doing",
"were also the source of the tribunes' power. Any Tribune could intercede on"
],
"correct_choice": 4,
"position": [
110,
288
],
"topic_category": "KH-Knowledge-History",
"question_category": "TOS",
"level": "L2-Relation",
"id": "aqGQw1bXFN4_0",
"video_path": "aqGQw1bXFN4.mp4",
"subtitle_path": "aqGQw1bXFN4_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 172,
"duration": 12.0,
"view_count": 16224
},
{
"video_id": "bgklOaBBmB8",
"question": "On the brown wooden table, there is a transparent bowl filled with meat and broth. A pair of hands on either side of the bowl lifts it up. After lifting the bowl, what transition occurs in the scene?",
"question_wo_referring_query": ", what transition occurs in the scene?",
"candidates": [
"From the floor to the microwave",
"From the table to the floor",
"From the table to the microwave",
"From the table to the chair",
"From the table to outside"
],
"correct_choice": 2,
"position": [
18,
51
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2E",
"level": "L1-Perception",
"id": "bgklOaBBmB8_0",
"video_path": "bgklOaBBmB8.mp4",
"subtitle_path": "bgklOaBBmB8_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 298,
"duration": 11.98,
"view_count": 3687
},
{
"video_id": "-QSAotqKqX8",
"question": "How many women appear in the video in total?",
"question_wo_referring_query": "How many women appear in the video in total?",
"candidates": [
"4 women",
"5 women",
"2 women",
"1 woman",
"3 women"
],
"correct_choice": 3,
"position": [
88
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "O3O",
"level": "L2-Relation",
"id": "-QSAotqKqX8_0",
"video_path": "-QSAotqKqX8.mp4",
"subtitle_path": "-QSAotqKqX8_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 99,
"duration": 7.98,
"view_count": 79518
},
{
"video_id": "FH1L2Dq6964",
"question": "When 'make a marshmallow cereal treat and' is mentioned, there's a yellow pot on a gas stove in the kitchen, and on the counter, there are eggs and milk. A woman with blue hair, what is she doing?",
"question_wo_referring_query": "what is she doing?",
"candidates": [
"She is not holding anything in her hands.",
"She is holding a rectangular container full of colorful ingredients in one hand and oats in the other.",
"She is holding an egg in her hand.",
"She is holding an egg in one hand and milk in the other."
],
"correct_choice": 1,
"position": [
57
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2E",
"level": "L1-Perception",
"id": "FH1L2Dq6964_0",
"video_path": "FH1L2Dq6964.mp4",
"subtitle_path": "FH1L2Dq6964_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 172,
"duration": 8.01,
"view_count": 127447
},
{
"video_id": "npLd4WTSQsM",
"question": "On this yellow cutting board, two chicken breasts are placed side by side. After this person places the chicken breasts horizontally, what does he do next?",
"question_wo_referring_query": "What does he do next?",
"candidates": [
"He throws the chicken breasts away.",
"He puts the chicken breasts into a pot.",
"He does nothing.",
"He uses a knife to cut the two whole chicken breasts into evenly sized small pieces."
],
"correct_choice": 3,
"position": [
51,
336
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "E3E",
"level": "L2-Relation",
"id": "npLd4WTSQsM_0",
"video_path": "npLd4WTSQsM.mp4",
"subtitle_path": "npLd4WTSQsM_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 332,
"duration": 14.0,
"view_count": 2855970
},
{
"video_id": "59iv6EYWNn8",
"question": "When mentioning 'foreign', there's a Chinese flag in front of a dark blue background, and a man wearing a white shirt, black suit jacket, and a blue and white striped tie. What is his hairstyle like?",
"question_wo_referring_query": "What is his hairstyle like?",
"candidates": [
"Long straight black hair",
"Big waves",
"Explosive hair",
"Black and white mixed short hair"
],
"correct_choice": 3,
"position": [
14
],
"topic_category": "NP-News-Programs",
"question_category": "T2A",
"level": "L1-Perception",
"id": "59iv6EYWNn8_0",
"video_path": "59iv6EYWNn8.mp4",
"subtitle_path": "59iv6EYWNn8_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 284,
"duration": 9.0,
"view_count": 252557
},
{
"video_id": "PS0gXPNBZy8",
"question": "A woman wearing floral prints is walking on a small path in the countryside, surrounded by trees. There is a fence made of tree trunks on one side of the path, with a black cat walking on it. What did the woman do after leaving the path?",
"question_wo_referring_query": "What did the woman do after leaving the path?",
"candidates": [
"Go to the market",
"Kneel down to pick flowers",
"Go to the riverside",
"Go home"
],
"correct_choice": 1,
"position": [
1,
38
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "PS0gXPNBZy8_0",
"video_path": "PS0gXPNBZy8.mp4",
"subtitle_path": "PS0gXPNBZy8_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 50,
"duration": 8.97,
"view_count": 1310354
},
{
"video_id": "l3vxOGgAM2g",
"question": "When mentioning 'Music', the man wearing a white shirt, yellow jacket, and yellow pants, what type is his jacket?",
"question_wo_referring_query": "What type is his jacket?",
"candidates": [
"Sweater",
"Cotton Clothing",
"Coat",
"Jacket"
],
"correct_choice": 3,
"position": [
78
],
"topic_category": "NP-News-Programs",
"question_category": "T2A",
"level": "L1-Perception",
"id": "l3vxOGgAM2g_0",
"video_path": "l3vxOGgAM2g.mp4",
"subtitle_path": "l3vxOGgAM2g_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 187,
"duration": 8.01,
"view_count": 16651
},
{
"video_id": "ZfapKqwklG4",
"question": "When the phrase 'forbidden instead they were welcome to' was mentioned, what action did the man in the red and purple striped clothes do in the old, broken room?",
"question_wo_referring_query": "What action did he do?",
"candidates": [
"Got up from the bed and ran",
"Ate",
"Covered his ears with both hands",
"Picked up a cup and drank water"
],
"correct_choice": 2,
"position": [
19
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2E",
"level": "L1-Perception",
"id": "ZfapKqwklG4_0",
"video_path": "ZfapKqwklG4.mp4",
"subtitle_path": "ZfapKqwklG4_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 982,
"duration": 13.0,
"view_count": 352170
},
{
"video_id": "SH7Unhifaj0",
"question": "In a bedroom with red curtains and two different types of green plants on the balcony, what is a woman wearing a red long-sleeve sweater doing?",
"question_wo_referring_query": "What is she doing?",
"candidates": [
"She has her hands on her legs",
"She has her hands raised above her head",
"She has her hands clasped in front of her chest",
"She has her hands at her sides"
],
"correct_choice": 2,
"position": [
81
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2E",
"level": "L1-Perception",
"id": "SH7Unhifaj0_0",
"video_path": "SH7Unhifaj0.mp4",
"subtitle_path": "SH7Unhifaj0_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 328,
"duration": 13.98,
"view_count": 2553
},
{
"video_id": "Msz128EJeWE",
"question": "In a room, a bald man wearing a dark blue shirt and a suit, with a necklace, what did he do the first time he appeared?",
"question_wo_referring_query": "What did he do the first time he appeared?",
"candidates": [
"He placed his hands behind his back",
"He made an X gesture with his hands",
"He placed his hands by his sides",
"He raised his hands above his head"
],
"correct_choice": 1,
"position": [
64
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O2E",
"level": "L1-Perception",
"id": "Msz128EJeWE_0",
"video_path": "Msz128EJeWE.mp4",
"subtitle_path": "Msz128EJeWE_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 59,
"duration": 11.01,
"view_count": 5734
},
{
"video_id": "6Z7AAcD8rbo",
"question": "Next to the swimming pool, there are white sunshades and trees, and there are some people inside the pool. A man wearing a gray short-sleeve shirt is taking pictures in front of the outdoor swimming pool. Besides the swimming pool, there is also a wooden dining table and some chairs. Which scene appears first?",
"question_wo_referring_query": ", which scene appears first?",
"candidates": [
"Next to the swimming pool, there are white sunshades and trees, and there are some people inside the pool, appears first; A man wearing a blue short-sleeve shirt is taking pictures in front of the outdoor swimming pool. Besides the swimming pool, there is also a wooden dining table and some chairs, appears later.",
"They are one scene.",
"Next to the swimming pool, there are white sunshades and trees, and there are some people inside the pool, appears first; A man wearing a gray short-sleeve shirt is taking pictures in front of the outdoor swimming pool. Besides the swimming pool, there is also a wooden dining table and some chairs, appears later.",
"A man wearing a gray short-sleeve shirt is taking pictures in front of the outdoor swimming pool. Besides the swimming pool, there is also a wooden dining table and some chairs, appears first. Next to the swimming pool, there are white sunshades and trees, and there are some people inside the pool, appears later."
],
"correct_choice": 2,
"position": [
97,
188
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O3O",
"level": "L2-Relation",
"id": "6Z7AAcD8rbo_0",
"video_path": "6Z7AAcD8rbo.mp4",
"subtitle_path": "6Z7AAcD8rbo_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 11,
"duration": 13.01,
"view_count": 83657
},
{
"video_id": "9aWPbYosJUw",
"question": "On a red table, there is a white plate containing a delicious dish with a golden crust on the outside and cheese underneath, with the right side cut off. A person wearing gloves is picking up half of the dish. When they mention 'Its crispy exterior, with a soft cheesy heart, makes it irresistible delicious,' what kind of gloves is this person wearing?",
"question_wo_referring_query": "On a red table, there is a white plate containing a delicious dish with a golden crust on the outside and cheese underneath, with the right side cut off. A person wearing gloves is picking up half of the dish. When they mention 'Its crispy exterior, with a soft cheesy heart, makes it irresistible delicious,' what kind of gloves is this person wearing?",
"candidates": [
"Wearing heat-resistant cotton gloves",
"Wearing transparent plastic gloves",
"Not wearing gloves",
"Wearing black plastic gloves"
],
"correct_choice": 3,
"position": [
142
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2A",
"level": "L1-Perception",
"id": "9aWPbYosJUw_0",
"video_path": "9aWPbYosJUw.mp4",
"subtitle_path": "9aWPbYosJUw_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 128,
"duration": 12.0,
"view_count": 1623
},
{
"video_id": "JuthlQ2Zo38",
"question": "In front of the white curtain, when a woman in a light blue denim jacket with long hair lifts her hand, what is she doing?",
"question_wo_referring_query": "What is she doing?",
"candidates": [
"The woman is holding a white bottle and giving a detailed introduction about it.",
"The woman is taking off her earrings.",
"The woman is taking off her hair tie from her wrist and tying her hair.",
"The woman is taking off her jacket and placing it on the ground."
],
"correct_choice": 0,
"position": [
275
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "JuthlQ2Zo38_0",
"video_path": "JuthlQ2Zo38.mp4",
"subtitle_path": "JuthlQ2Zo38_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 269,
"duration": 13.0,
"view_count": 53863
},
{
"video_id": "iHzypa15yRA",
"question": "A woman in a black coat is standing in front of a closed store inside a mall, with a 'cna' news ticker moving below. When the phrase 'assistance at about 4:50 p.m and uh' is mentioned, what does she do?",
"question_wo_referring_query": "A woman in a black coat is standing in front of a closed store inside a mall, with a 'cna' news ticker moving below. When the phrase 'assistance at about 4:50 p.m and uh' is mentioned, what does she do?",
"candidates": [
"She looked down slightly at the manuscript on her phone",
"She ran her fingers through her hair",
"She switched the drink from her left hand to her right hand",
"She faced the camera and adjusted the lens position"
],
"correct_choice": 0,
"position": [
96
],
"topic_category": "NP-News-Programs",
"question_category": "T2E",
"level": "L1-Perception",
"id": "iHzypa15yRA_0",
"video_path": "iHzypa15yRA.mp4",
"subtitle_path": "iHzypa15yRA_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 156,
"duration": 13.0,
"view_count": 78396
},
{
"video_id": "Scne0ls23MA",
"question": "In a scene with a man in a white short-sleeve shirt, with a large rectangular account book (or register) behind him, and a brown object with a round spike at the top right corner of the screen, which of the following characters has appeared before?",
"question_wo_referring_query": "Which of the following characters has appeared before?",
"candidates": [
"A person wearing a blue and white headscarf and a black short-sleeve shirt",
"A person wearing a blue and white headscarf and a white short-sleeve shirt",
"A person wearing a red and white headscarf and a black short-sleeve shirt",
"A person wearing a red and white headscarf and a black and white striped short-sleeve shirt"
],
"correct_choice": 3,
"position": [
59,
145
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O3O",
"level": "L2-Relation",
"id": "Scne0ls23MA_0",
"video_path": "Scne0ls23MA.mp4",
"subtitle_path": "Scne0ls23MA_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 266,
"duration": 9.0,
"view_count": 205702
},
{
"video_id": "WLl3SeraTV0",
"question": "On a large marble countertop, there is a baking model. In the middle of the screen, a hand is holding a yellow jar and spraying it inward. What changes happen on the screen after the subtitle mentions 'so'?",
"question_wo_referring_query": "What changes happen on the screen?",
"candidates": [
"This person is holding a gray scoop and is stirring it inside.",
"This person is holding a jar of chocolate sauce and is about to pour it in.",
"This person is holding a yellow jar of fruit sauce and is about to pour it in.",
"This person is holding a red scoop and is stirring it inside."
],
"correct_choice": 1,
"position": [
95,
132
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TAA",
"level": "L2-Relation",
"id": "WLl3SeraTV0_0",
"video_path": "WLl3SeraTV0.mp4",
"subtitle_path": "WLl3SeraTV0_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 95,
"duration": 11.01,
"view_count": 389605
},
{
"video_id": "kbRtl58u_kk",
"question": "In a dark room, there is a rotating fan hanging. When the screen displays white subtitles saying 'i watch all ur vedeo', what is a short-haired man wearing a green short-sleeved shirt doing?",
"question_wo_referring_query": "What is he doing?",
"candidates": [
"He is preparing to change clothes.",
"He is preparing to turn off the fan.",
"He is touching his own hair.",
"He is looking at the electric fan."
],
"correct_choice": 3,
"position": [
44
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2E",
"level": "L1-Perception",
"id": "kbRtl58u_kk_0",
"video_path": "kbRtl58u_kk.mp4",
"subtitle_path": "kbRtl58u_kk_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 431,
"duration": 7.98,
"view_count": 65625
},
{
"video_id": "pZ-9hGQNHuA",
"question": "At night, on a street with a car parked with its hazard lights on, a person is standing beside the car with a bent waist. On one side of the road, there are houses, and on the other side, there are houses surrounded by fences. When the text 'what's your name' is mentioned, what objects appear on the screen?",
"question_wo_referring_query": "What objects appear on the screen?",
"candidates": [
"Truck",
"Police",
"Table",
"Car, streetlight, fence"
],
"correct_choice": 3,
"position": [
257
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2O",
"level": "L1-Perception",
"id": "pZ-9hGQNHuA_0",
"video_path": "pZ-9hGQNHuA.mp4",
"subtitle_path": "pZ-9hGQNHuA_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 93,
"duration": 12.01,
"view_count": 5248783
},
{
"video_id": "GZFL58_pXPg",
"question": "On the screen, there are many advanced math problems on a piece of white paper. Among the problems, there is also a complete brain diagram. When the subtitle mentions 'an extraordinary brain that lacked the,' what color appears on the right side of the brain diagram?",
"question_wo_referring_query": "What color appears on the right side of the brain diagram?",
"candidates": [
"The left side of the brain diagram is gray with stained patterns.",
"The right side of the brain is blue with stains.",
"The right side of the brain diagram is gray, resembling stains.",
"The left side of the brain is blue with stains."
],
"correct_choice": 2,
"position": [
78
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2A",
"level": "L1-Perception",
"id": "GZFL58_pXPg_0",
"video_path": "GZFL58_pXPg.mp4",
"subtitle_path": "GZFL58_pXPg_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 107,
"duration": 7.97,
"view_count": 22656
},
{
"video_id": "F2i-LchFVCU",
"question": "In a vast Milky Way, there are countless stars scattered throughout. In the upper right side of the screen, there is a slowly moving object. What is this object?",
"question_wo_referring_query": "In a vast Milky Way, there are countless stars scattered throughout. In the upper right side of the screen, there is a slowly moving object. What is this object?",
"candidates": [
"It is a blue planet",
"It is a launched satellite",
"It is a celestial body emitting a golden light",
"It is a purple planet"
],
"correct_choice": 2,
"position": [
312
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "E2O",
"level": "L1-Perception",
"id": "F2i-LchFVCU_0",
"video_path": "F2i-LchFVCU.mp4",
"subtitle_path": "F2i-LchFVCU_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 259,
"duration": 13.98,
"view_count": 5977
},
{
"video_id": "FOMS_yqdN1o",
"question": "In a room with warm lighting, a woman dressed in black is sitting on a chair holding a comb. What is she doing?",
"question_wo_referring_query": "Holding a comb, what is she doing?",
"candidates": [
"She is brushing her teeth",
"She is washing her face",
"She is putting on makeup",
"She is combing her hair"
],
"correct_choice": 3,
"position": [
15
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "FOMS_yqdN1o_0",
"video_path": "FOMS_yqdN1o.mp4",
"subtitle_path": "FOMS_yqdN1o_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 51,
"duration": 10.01,
"view_count": 339864
},
{
"video_id": "YvM4rzSxX1U",
"question": "In a plaza surrounded by high-walled buildings, many pairs of small figures are scattered around. On the right side of the screen, there is a character wearing a blue hat. To his left, there are two soldiers. What are they doing?",
"question_wo_referring_query": "What are they doing?",
"candidates": [
"The two helmeted soldiers are raising their shields and running forward",
"The two soldiers are looking for weapons at hand",
"The two soldiers are sparring",
"The two soldiers are greeting each other"
],
"correct_choice": 0,
"position": [
179
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2E",
"level": "L1-Perception",
"id": "YvM4rzSxX1U_0",
"video_path": "YvM4rzSxX1U.mp4",
"subtitle_path": "YvM4rzSxX1U_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 127,
"duration": 14.0,
"view_count": 1947
},
{
"video_id": "M5Umm-ltnao",
"question": "A woman with gray clothing and pink manicured hands is holding a copper-colored bracelet up to the camera and says, 'it also comes with this chain one I.' Which object is not shown on the screen?",
"question_wo_referring_query": "Which object is not shown on the screen?",
"candidates": [
"black bracelet",
"copper bracelet",
"wooden door",
"white-framed Pegasus painting"
],
"correct_choice": 0,
"position": [
172
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "M5Umm-ltnao_0",
"video_path": "M5Umm-ltnao.mp4",
"subtitle_path": "M5Umm-ltnao_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 1097,
"duration": 12.98,
"view_count": 41546
},
{
"video_id": "_kQXNFG664Y",
"question": "In a corner of a bedroom, a short-haired woman wearing an olive green tank top is sitting in front of a bed. To the right is a bookshelf filled with books. What color is the bookshelf in the screen?",
"question_wo_referring_query": "In a corner of a bedroom, a short-haired woman wearing an olive green tank top is sitting in front of a bed. To the right is a bookshelf filled with books. What color is the bookshelf in the screen?",
"candidates": [
"A three-layer pink bookshelf",
"A three-layer blue bookshelf",
"A four-layer red bookshelf",
"A five-layer orange bookshelf"
],
"correct_choice": 3,
"position": [
40
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2A",
"level": "L1-Perception",
"id": "_kQXNFG664Y_0",
"video_path": "_kQXNFG664Y.mp4",
"subtitle_path": "_kQXNFG664Y_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 678,
"duration": 8.0,
"view_count": 807323
},
{
"video_id": "h2RbVVz3NZc",
"question": "In a small car, a man wearing a blue hoodie is next to a man wearing a white shirt. In which subtitle did these two people appear together?",
"question_wo_referring_query": "In which subtitle did these two people appear together?",
"candidates": [
"yep",
"there that's good there's a sick",
"we could jump off of the ones off the 20",
"look"
],
"correct_choice": 2,
"position": [
56,
96,
135
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "TOS",
"level": "L2-Relation",
"id": "h2RbVVz3NZc_0",
"video_path": "h2RbVVz3NZc.mp4",
"subtitle_path": "h2RbVVz3NZc_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 120,
"duration": 13.01,
"view_count": 191003
},
{
"video_id": "UbgwG8fcIu0",
"question": "In front of a white wall with a hanging picture frame, a woman wearing a blue uniform, black shorts, and holding a green notebook did what while facing the camera?",
"question_wo_referring_query": "What did she do while facing the camera?",
"candidates": [
"She stuck out her tongue at the camera",
"She adjusted her long hair facing the camera",
"She made a funny face at the camera",
"She showed her notes to the camera"
],
"correct_choice": 0,
"position": [
60
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "UbgwG8fcIu0_0",
"video_path": "UbgwG8fcIu0.mp4",
"subtitle_path": "UbgwG8fcIu0_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 258,
"duration": 9.0,
"view_count": 13601
},
{
"video_id": "GAoK4XjssrM",
"question": "In front of a black screen, a man wearing a black short-sleeved shirt, with his left hand clenched into a fist and his right hand tightly grasping his left forearm, what object is present on the screen when he mentions 'guards for helping us out and uh that'?",
"question_wo_referring_query": "What object is present on the screen?",
"candidates": [
"A plain black short-sleeved shirt",
"A short-sleeved shirt with red, blue, and green prints",
"A long-sleeved shirt with red, blue, and green prints",
"A plain black long-sleeved shirt"
],
"correct_choice": 1,
"position": [
115
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2O",
"level": "L1-Perception",
"id": "GAoK4XjssrM_0",
"video_path": "GAoK4XjssrM.mp4",
"subtitle_path": "GAoK4XjssrM_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 304,
"duration": 9.01,
"view_count": 223717
},
{
"video_id": "jbPR2SJuFHg",
"question": "On a large marble table, there is a piece of baked food. A person wearing a ring is using chopsticks to apply sauce. What kind of ring is this person wearing?",
"question_wo_referring_query": "What kind of ring is this person wearing?",
"candidates": [
"Gold ring on the index finger",
"Diamond ring on the middle finger",
"Gold ring on the middle finger",
"Silver ring on the middle finger"
],
"correct_choice": 2,
"position": [
153
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2A",
"level": "L1-Perception",
"id": "jbPR2SJuFHg_0",
"video_path": "jbPR2SJuFHg.mp4",
"subtitle_path": "jbPR2SJuFHg_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 57,
"duration": 9.01,
"view_count": 1179070
},
{
"video_id": "XvJqVKi8i_I",
"question": "In front of a vast outdoor sports field, a man with curly hair wearing a grey short-sleeved shirt and with blue paint on his hands, appeared in which subtitle together?",
"question_wo_referring_query": ", appeared in which subtitle together?",
"candidates": [
"it's the ultimate office and everywhere",
"where dreams are made of",
"is vacation",
"yeah yeah New York"
],
"correct_choice": 0,
"position": [
43,
202
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "TOS",
"level": "L2-Relation",
"id": "XvJqVKi8i_I_0",
"video_path": "XvJqVKi8i_I.mp4",
"subtitle_path": "XvJqVKi8i_I_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 100,
"duration": 12.01,
"view_count": 51292
},
{
"video_id": "PE32sjfN-uM",
"question": "In a hall with a half-body statue and two national flags in the background, two rows of people are sitting on the left and right sides, and two country representatives are seated in the middle. After mentioning 'remain on good terms with both the', what changes occur on the screen?",
"question_wo_referring_query": "What changes occur on the screen?",
"candidates": [
"The screen focuses on the original representative on the left side.",
"The screen focuses on the original representative on the right side.",
"The screen focuses on the two national flags in the background.",
"The screen cuts to the two rows of people on both sides."
],
"correct_choice": 1,
"position": [
86,
110
],
"topic_category": "NP-News-Programs",
"question_category": "TAA",
"level": "L2-Relation",
"id": "PE32sjfN-uM_0",
"video_path": "PE32sjfN-uM.mp4",
"subtitle_path": "PE32sjfN-uM_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 46,
"duration": 9.0,
"view_count": 48968
},
{
"video_id": "2F7d7aUCmUU",
"question": "In a dense forest, a boy wearing a black short-sleeve shirt is sitting on a bench. When the subtitle mentions 'to positively influence us all that said', what action does the boy take?",
"question_wo_referring_query": "What action does the boy take when mentioned?",
"candidates": [
"Makes a fist",
"Waves his hand",
"Stands up",
"Claps his hands"
],
"correct_choice": 1,
"position": [
16
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2E",
"level": "L1-Perception",
"id": "2F7d7aUCmUU_0",
"video_path": "2F7d7aUCmUU.mp4",
"subtitle_path": "2F7d7aUCmUU_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 725,
"duration": 7.97,
"view_count": 119706
},
{
"video_id": "mB7NW91REF4",
"question": "In the video, the woman with the long black hair, wearing a scarf and earrings, is about to go out. Where did she go after going out?",
"question_wo_referring_query": "Where did she go after going out?",
"candidates": [
"On a spacious street",
"In a convenience store",
"In a barbecue restaurant",
"In a bar"
],
"correct_choice": 0,
"position": [
9,
224
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "mB7NW91REF4_0",
"video_path": "mB7NW91REF4.mp4",
"subtitle_path": "mB7NW91REF4_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 31,
"duration": 13.98,
"view_count": 88054
},
{
"video_id": "UM0VOrAtnPA",
"question": "There is a blonde woman sitting on a chair wearing a long skirt and reading a book. Where else has the book she's holding appeared?",
"question_wo_referring_query": "Where else has the book she's holding appeared?",
"candidates": [
"On the table",
"On the ground",
"On the sofa",
"On the chair"
],
"correct_choice": 0,
"position": [
38,
96
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SOS",
"level": "L2-Relation",
"id": "UM0VOrAtnPA_0",
"video_path": "UM0VOrAtnPA.mp4",
"subtitle_path": "UM0VOrAtnPA_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 115,
"duration": 8.01,
"view_count": 282345
},
{
"video_id": "BkCbbFn21LY",
"question": "In the video, when the eggs and white granules in the bowl are mixed together with chopsticks, the bowl is placed on a red and white striped cloth and placed on the ground. When the subtitle mentions '2 tbsp spelled flour', what is the substance that the person takes out?",
"question_wo_referring_query": "What is the substance that the person takes out?",
"candidates": [
"powder",
"paste",
"frost-like",
"liquid"
],
"correct_choice": 0,
"position": [
53,
235
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TAA",
"level": "L2-Relation",
"id": "BkCbbFn21LY_0",
"video_path": "BkCbbFn21LY.mp4",
"subtitle_path": "BkCbbFn21LY_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 27,
"duration": 11.0,
"view_count": 3440
},
{
"video_id": "H7gnsO4UZmU",
"question": "In the room, a woman wearing a necklace and white clothes is talking. When she mentions 'don't have to go to the gym for like a', which object does not appear in the scene?",
"question_wo_referring_query": "Which object does not appear in the scene?",
"candidates": [
"Milk tea",
"Tissue",
"Mirror",
"Bright lamp"
],
"correct_choice": 0,
"position": [
52
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "H7gnsO4UZmU_0",
"video_path": "H7gnsO4UZmU.mp4",
"subtitle_path": "H7gnsO4UZmU_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 313,
"duration": 13.0,
"view_count": 309587
},
{
"video_id": "W8w09l_mmT4",
"question": "In the kitchen, there is a man wearing a white short-sleeved patterned shirt cutting meat. In front of him, there are other side dishes, too. At the beginning, what happens to the meat being cut on the chopping board?",
"question_wo_referring_query": "At the beginning, what happens to the meat being cut on the chopping board?",
"candidates": [
"Thrown away",
"Lifted up",
"Fried in oil",
"Boiled in water"
],
"correct_choice": 1,
"position": [
22,
109
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SAA",
"level": "L2-Relation",
"id": "W8w09l_mmT4_0",
"video_path": "W8w09l_mmT4.mp4",
"subtitle_path": "W8w09l_mmT4_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 302,
"duration": 9.01,
"view_count": 576732
},
{
"video_id": "7PqVVjEW0LM",
"question": "There is a platter with exquisite arrangement in the video, featuring items like noodles and wrapped eggs. When the subtitle mentions 'enjoy it try to base that your,' what change occurs to this platter?",
"question_wo_referring_query": "What change occurs to this platter?",
"candidates": [
"Completely eaten",
"Disappears",
"Sprinkled with powdery substance",
"Thrown into the trash can"
],
"correct_choice": 2,
"position": [
194,
231
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TAA",
"level": "L2-Relation",
"id": "7PqVVjEW0LM_0",
"video_path": "7PqVVjEW0LM.mp4",
"subtitle_path": "7PqVVjEW0LM_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 307,
"duration": 10.01,
"view_count": 209409
},
{
"video_id": "NS2V_OHYkvA",
"question": "In the video, there are many well-arranged categories of foods in the supermarket. When the subtitle mentions 'so', what object is present in the bottom row of the screen?",
"question_wo_referring_query": "What object is present in the bottom row of the screen?",
"candidates": [
"eggs",
"milk tea",
"ice cream bars",
"fried chicken"
],
"correct_choice": 0,
"position": [
159
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "NS2V_OHYkvA_0",
"video_path": "NS2V_OHYkvA.mp4",
"subtitle_path": "NS2V_OHYkvA_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 223,
"duration": 9.01,
"view_count": 298229
},
{
"video_id": "iDFDxwPTjeU",
"question": "At the beginning of the video, a calendar and a hand-drawn booklet appear. With which subtitle did they appear together?",
"question_wo_referring_query": "With which subtitle did they appear together?",
"candidates": [
"the key",
"do you know",
"say you want",
"which is a very big adulting thing that"
],
"correct_choice": 3,
"position": [
1,
28
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "iDFDxwPTjeU_0",
"video_path": "iDFDxwPTjeU.mp4",
"subtitle_path": "iDFDxwPTjeU_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 153,
"duration": 13.0,
"view_count": 347155
},
{
"video_id": "18rZLZqg6C8",
"question": "In the video, a woman wearing a white long-sleeve sweater is sitting on a chair with a tree behind her. What movement occurs with her hands after she says 'and we booked this dreamy Cup in in'?",
"question_wo_referring_query": "What movement occurs with her hands?",
"candidates": [
"Tightening into a prayer position with both hands",
"From raising hands to a prayer position with both hands",
"No change",
"From clasping hands to raising hands"
],
"correct_choice": 2,
"position": [
71,
86
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TAA",
"level": "L2-Relation",
"id": "18rZLZqg6C8_0",
"video_path": "18rZLZqg6C8.mp4",
"subtitle_path": "18rZLZqg6C8_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 103,
"duration": 8.0,
"view_count": 45905
},
{
"video_id": "AU0UJVNZMvA",
"question": "The screen has two pairs of separate eyes and a cartoon girl headshot, and in the bottom right corner there is an old man with white hair, wearing glasses and a mustache. What is the object in the middle of this screen?",
"question_wo_referring_query": "What is the object in the middle of this screen?",
"candidates": [
"sword",
"refrigerator",
"mobile phone",
"computer"
],
"correct_choice": 0,
"position": [
93
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2O",
"level": "L1-Perception",
"id": "AU0UJVNZMvA_0",
"video_path": "AU0UJVNZMvA.mp4",
"subtitle_path": "AU0UJVNZMvA_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 1020,
"duration": 14.0,
"view_count": 185138
},
{
"video_id": "u-vYvqxHejY",
"question": "In the video, a blonde woman is explaining something while wearing a black coat. When the person speaking mentions 'Oh, and here's a picture of her at the printing press,' there is a shadow of her pointing to an object on the wall. What is the object present in the screen?",
"question_wo_referring_query": "What is the object present in the screen?",
"candidates": [
"Tree",
"Black and white photo",
"Book",
"Watercolor painting"
],
"correct_choice": 1,
"position": [
131
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2O",
"level": "L1-Perception",
"id": "u-vYvqxHejY_0",
"video_path": "u-vYvqxHejY.mp4",
"subtitle_path": "u-vYvqxHejY_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 75,
"duration": 7.97,
"view_count": 143935
},
{
"video_id": "KFqlW0APKRA",
"question": "In the video, there is a woman with long hair wearing a black coat explaining a protest activity in an empty room. What was her first action when she appeared?",
"question_wo_referring_query": "What was her first action when she appeared?",
"candidates": [
"Sitting",
"Lying down",
"Standing",
"Crouching"
],
"correct_choice": 0,
"position": [
191
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O2E",
"level": "L1-Perception",
"id": "KFqlW0APKRA_0",
"video_path": "KFqlW0APKRA.mp4",
"subtitle_path": "KFqlW0APKRA_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 223,
"duration": 14.02,
"view_count": 2782
},
{
"video_id": "Ge5SjsgaA4o",
"question": "In the video, a metal tray filled with baked round food items is depicted. These items are evenly distributed across the tray and each one appears golden yellow in color. The tray is placed on a wooden table and was taken out of an oven. After taking it out, what was added to the tray?",
"question_wo_referring_query": "After taking it out, what was added to the tray?",
"candidates": [
"Salad",
"Red sauce",
"Fruit juice",
"Chopped scallions"
],
"correct_choice": 1,
"position": [
71,
90
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "E3E",
"level": "L2-Relation",
"id": "Ge5SjsgaA4o_0",
"video_path": "Ge5SjsgaA4o.mp4",
"subtitle_path": "Ge5SjsgaA4o_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 116,
"duration": 8.01,
"view_count": 606674
},
{
"video_id": "tygk9-aneC4",
"question": "The screen shows a bar chart comparing the cost-effectiveness of different German tanks during World War II. Which tank, introduced first, has the lowest cost?",
"question_wo_referring_query": "Which tank, introduced first, has the lowest cost?",
"candidates": [
"Main Battle Tank",
"Tiger",
"Heavy Tank",
"Panther"
],
"correct_choice": 3,
"position": [
17,
167
],
"topic_category": "KH-Knowledge-History",
"question_category": "O3O",
"level": "L2-Relation",
"id": "tygk9-aneC4_0",
"video_path": "tygk9-aneC4.mp4",
"subtitle_path": "tygk9-aneC4_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 178,
"duration": 14.02,
"view_count": 582141
},
{
"video_id": "CFm3bc9gqYE",
"question": "In the video discussing events related to the Roman soldiers, who appears before the subtitles mention 'conditions however most historians agree'?",
"question_wo_referring_query": "Who appears?",
"candidates": [
"Wearing short sleeves",
"Wearing green clothes",
"Not wearing a helmet",
"Wearing a helmet and red clothes"
],
"correct_choice": 3,
"position": [
25,
57
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "CFm3bc9gqYE_0",
"video_path": "CFm3bc9gqYE.mp4",
"subtitle_path": "CFm3bc9gqYE_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 417,
"duration": 9.0,
"view_count": 3211907
},
{
"video_id": "mOiEOs3ZlT8",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is accurate?",
"candidates": [
"First, a woman explaining a panel of a sacrificial altar painting is shown. Then, a painting with only two women appears. Finally, another panel depicting a sacrificial altar painting is shown.",
"First, a painting with only two women appears. Then, a panel depicting a sacrificial altar painting is shown. Finally, a woman explaining this panel appears.",
"First, a panel depicting a sacrificial altar painting is shown. Then, a woman explaining this panel appears. Finally, a painting with only two women appears.",
"First, a painting with only two women appears. Then, a woman explaining a panel of a sacrificial altar painting is shown. Finally, the video ends with another panel depicting a sacrificial altar painting."
],
"correct_choice": 3,
"position": [
22,
99,
189
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SSS",
"level": "L2-Relation",
"id": "mOiEOs3ZlT8_0",
"video_path": "mOiEOs3ZlT8.mp4",
"subtitle_path": "mOiEOs3ZlT8_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 1009,
"duration": 12.01,
"view_count": 205801
},
{
"video_id": "QpDL3mopnWM",
"question": "There is a group of people on the screen, with a hillside and some scattered grass in the background. Two smaller individuals wearing yellow helmets use weapons to knock down a man with a grey helmet and beard to the ground. Where else does the man with the grey helmet and beard appear in the video?",
"question_wo_referring_query": "Where else does the man with the grey helmet and beard appear in the video?",
"candidates": [
"On the bed",
"On horseback",
"In the palace",
"In the office"
],
"correct_choice": 1,
"position": [
57,
219
],
"topic_category": "KH-Knowledge-History",
"question_category": "SOS",
"level": "L2-Relation",
"id": "QpDL3mopnWM_0",
"video_path": "QpDL3mopnWM.mp4",
"subtitle_path": "QpDL3mopnWM_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 182,
"duration": 14.02,
"view_count": 3936
},
{
"video_id": "9yQV0t5meo4",
"question": "In the scene on the wide street, there are a few cars and a few buildings in the background. What change occurred to the bird after it was on the ground?",
"question_wo_referring_query": "What change occurred to the bird after it was on the ground?",
"candidates": [
"Stayed in the same place",
"Walked on the ground",
"Flew onto a tree",
"Flew into the sky"
],
"correct_choice": 3,
"position": [
36,
149
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "9yQV0t5meo4_0",
"video_path": "9yQV0t5meo4.mp4",
"subtitle_path": "9yQV0t5meo4_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 14,
"duration": 9.01,
"view_count": 3792
},
{
"video_id": "8urCPp_yhWs",
"question": "In the video, a desert scene appears at dusk. When the subtitle mentions 'so our stay here at Thousand nights was,' how many people appear on the screen?",
"question_wo_referring_query": "How many people appear on the screen?",
"candidates": [
"4",
"3",
"2",
"5"
],
"correct_choice": 2,
"position": [
203
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2O",
"level": "L1-Perception",
"id": "8urCPp_yhWs_0",
"video_path": "8urCPp_yhWs.mp4",
"subtitle_path": "8urCPp_yhWs_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 465,
"duration": 9.0,
"view_count": 625947
},
{
"video_id": "Lf7uW4xK2bc",
"question": "In the video, there is a bowl on the table containing a white powdery substance. What color is the liquid being dripped into the bowl by the hand on screen?",
"question_wo_referring_query": "What color is the liquid being dripped into the bowl by the hand on screen?",
"candidates": [
"black",
"yellow",
"purple",
"red"
],
"correct_choice": 3,
"position": [
102
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "E2O",
"level": "L1-Perception",
"id": "Lf7uW4xK2bc_0",
"video_path": "Lf7uW4xK2bc.mp4",
"subtitle_path": "Lf7uW4xK2bc_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 283,
"duration": 7.97,
"view_count": 211504
},
{
"video_id": "Lk9ZihSgjdo",
"question": "In the video, a short-haired man and a bald man are making pizza in the kitchen. What action does the short-haired man do when he appears for the first time?",
"question_wo_referring_query": "What action does the short-haired man do when he appears for the first time in the video?",
"candidates": [
"Shake hands",
"Raise thumbs up",
"Hug",
"Bow"
],
"correct_choice": 1,
"position": [
45
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O2E",
"level": "L1-Perception",
"id": "Lk9ZihSgjdo_0",
"video_path": "Lk9ZihSgjdo.mp4",
"subtitle_path": "Lk9ZihSgjdo_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 780,
"duration": 12.01,
"view_count": 11231399
},
{
"video_id": "NWOF4GqpWaw",
"question": "In the video, a man wearing glasses, a black suit, and a tie is explaining a picture on the right side. After the subtitle mentions 'modern world', what is the next picture that appears?",
"question_wo_referring_query": "What is the next picture that appears?",
"candidates": [
"A picture of a dog",
"A picture with the sun and the sea accompanied by subtitles",
"A monochrome picture",
"A picture of a person standing"
],
"correct_choice": 1,
"position": [
106,
128
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T3O",
"level": "L2-Relation",
"id": "NWOF4GqpWaw_0",
"video_path": "NWOF4GqpWaw.mp4",
"subtitle_path": "NWOF4GqpWaw_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 2871,
"duration": 13.01,
"view_count": 4382
},
{
"video_id": "TARe4G-SXfk",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, a completely black image appears, then a messy room with a blue bed on the desk, and finally a man standing by a window looking out serves as the ending.",
"First, a man appears standing by a window looking out, then a completely black image appears, and finally a messy room with a blue bed on the desk serves as the ending.",
"First, a completely black image appears, then a man standing by a window looking out, and finally a messy room with a blue bed on the desk serves as the ending.",
"First, a man appears standing by a window looking out, then a messy room with a blue bed on the desk appears, and finally a completely black image serves as the ending."
],
"correct_choice": 1,
"position": [
47,
214,
303
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "TARe4G-SXfk_0",
"video_path": "TARe4G-SXfk.mp4",
"subtitle_path": "TARe4G-SXfk_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 71,
"duration": 13.97,
"view_count": 125114
},
{
"video_id": "3Qv2K_0GEE0",
"question": "There is a woman in the room in the video, and there is a green plant in the room. Which caption appears with the green plant at the beginning of the video?",
"question_wo_referring_query": "Which caption appears with the green plant at the beginning of the video?",
"candidates": [
"no",
"yes",
"I grew my first plant when I was about six years old",
"ok"
],
"correct_choice": 2,
"position": [
34,
181
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "3Qv2K_0GEE0_0",
"video_path": "3Qv2K_0GEE0.mp4",
"subtitle_path": "3Qv2K_0GEE0_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 30,
"duration": 8.0,
"view_count": 503397
},
{
"video_id": "dxjKdnJFmLs",
"question": "On a small road, a man wearing a dark red short-sleeved T-shirt and a green hat is chatting with a man wearing a black short-sleeved T-shirt and a red hat. When the man wearing the green hat holds a shoe in his right hand, what action does the man wearing the black short-sleeved T-shirt and red hat do with his left hand?",
"question_wo_referring_query": "What action does the man wearing the black short-sleeved T-shirt and red hat do with his left hand?",
"candidates": [
"Throwing the shoe away",
"Holding a red teacup",
"Holding a red teacup and drinking tea",
"Point with his thumb and index finger at the man wearing the green hat"
],
"correct_choice": 3,
"position": [
65
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O2E",
"level": "L1-Perception",
"id": "dxjKdnJFmLs_0",
"video_path": "dxjKdnJFmLs.mp4",
"subtitle_path": "dxjKdnJFmLs_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 363,
"duration": 10.98,
"view_count": 312966
},
{
"video_id": "bH1Q9AM5vik",
"question": "In front of several straight trees, there are a few pedestrians, including a man wearing a gray knit cap, black-rimmed glasses, and a black-and-white checkered coat. He looks sideways at the camera. What happens to him in the end?",
"question_wo_referring_query": "What happens to him in the end?",
"candidates": [
"He stands in a room with paintings on the wall",
"He sits on a bench in the park",
"He is now giving a lecture on a podium",
"He kneels in a room with paintings on the wall"
],
"correct_choice": 0,
"position": [
229,
264
],
"topic_category": "NP-News-Programs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "bH1Q9AM5vik_0",
"video_path": "bH1Q9AM5vik.mp4",
"subtitle_path": "bH1Q9AM5vik_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 99,
"duration": 11.0,
"view_count": 43070
},
{
"video_id": "1IpgyV9u5nE",
"question": "The sky is already late, a man under a dim yellow streetlamp with his right hand slightly clenched is looking at a mirror. What clothes is he wearing?",
"question_wo_referring_query": "What clothes is he wearing?",
"candidates": [
"He is wearing a red top",
"He is wearing a red short-sleeved shirt",
"Wearing a light yellow windbreaker",
"Wearing a light yellow top"
],
"correct_choice": 3,
"position": [
24
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2A",
"level": "L1-Perception",
"id": "1IpgyV9u5nE_0",
"video_path": "1IpgyV9u5nE.mp4",
"subtitle_path": "1IpgyV9u5nE_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 475,
"duration": 8.01,
"view_count": 4780
},
{
"video_id": "V0h7rJShw0g",
"question": "Inside a small car, a woman sitting in the driver's seat wearing a pink fleece jacket with her left hand fully opened upwards, what is she doing?",
"question_wo_referring_query": "What is she doing?",
"candidates": [
"She turns her head to wave at a pedestrian",
"She turns her head to look at the rearview mirror",
"She looks at the mirror and is talking",
"She closes her eyes to rest"
],
"correct_choice": 2,
"position": [
77
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "V0h7rJShw0g_0",
"video_path": "V0h7rJShw0g.mp4",
"subtitle_path": "V0h7rJShw0g_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 51,
"duration": 13.01,
"view_count": 15542
},
{
"video_id": "GKgl3aJr5Xw",
"question": "In the purple background, there's a curly-haired man wearing a gray short-sleeved jacket sitting at a desk with a purple plastic bottle of water. When mentioning 'Sunghoon is your favorite person there? Soomi, doesn't this mean that Sunghoon is your favorite person there?', what did he do?",
"question_wo_referring_query": "What did he do?",
"candidates": [
"He spread his hands open on the desk",
"He picked up the water on the desk",
"He put his hands together and looked at the camera",
"He put one hand on his cheek and looked to the right"
],
"correct_choice": 0,
"position": [
79
],
"topic_category": "NP-News-Programs",
"question_category": "T2E",
"level": "L1-Perception",
"id": "GKgl3aJr5Xw_0",
"video_path": "GKgl3aJr5Xw.mp4",
"subtitle_path": "GKgl3aJr5Xw_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 248,
"duration": 13.0,
"view_count": 15488
},
{
"video_id": "il_wMYlDQ6I",
"question": "In daytime with the sun, the video shows the process of an airplane slowly taking off. At the beginning of the video, where does the airplane on the flat ground appear?",
"question_wo_referring_query": ",where does the airplane on the flat ground appear at the beginning of the video?",
"candidates": [
"in the sky",
"remaining on the flat ground",
"in the ocean",
"on a hill"
],
"correct_choice": 0,
"position": [
3,
172
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "il_wMYlDQ6I_0",
"video_path": "il_wMYlDQ6I.mp4",
"subtitle_path": "il_wMYlDQ6I_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 13,
"duration": 12.0,
"view_count": 35954
},
{
"video_id": "3ZRVpYPFOl0",
"question": "The video explains what Mad Jack did after leaving the military and going to England. In England, he also learned archery. In the footage, he is wearing a blue long-sleeve shirt, holding a bow and arrow on a grassland. What object appeared when he shot the arrow?",
"question_wo_referring_query": "Which object appeared when he shot the arrow?",
"candidates": [
"boat",
"bird",
"arrow",
"target"
],
"correct_choice": 2,
"position": [
122
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2O",
"level": "L1-Perception",
"id": "3ZRVpYPFOl0_0",
"video_path": "3ZRVpYPFOl0.mp4",
"subtitle_path": "3ZRVpYPFOl0_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 80,
"duration": 13.0,
"view_count": 7079962
},
{
"video_id": "OLO19ZtdRwQ",
"question": "The video talks about famous tourist attractions. When the subtitle mentions 'ambitious Island boasting beautiful', there are a few people inside a hall, with a couple taking photos. What type of hall appears on the screen?",
"question_wo_referring_query": "What type of hall appears on the screen?",
"candidates": [
"Aquarium",
"Museum",
"Not appeared",
"Gymnasium"
],
"correct_choice": 0,
"position": [
112
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2O",
"level": "L1-Perception",
"id": "OLO19ZtdRwQ_0",
"video_path": "OLO19ZtdRwQ.mp4",
"subtitle_path": "OLO19ZtdRwQ_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 130,
"duration": 8.98,
"view_count": 1157
},
{
"video_id": "gJuOMPiixUA",
"question": "In a white room, there is a man wearing blue clothes and earphones, and a man wearing a grey suit jacket. In the video, which person turns around and clenches their fist?",
"question_wo_referring_query": "In the video, which person turns around and clenches their fist?",
"candidates": [
"The man wearing a grey suit",
"The man wearing a grey suit and earphones",
"Both turned around",
"The man wearing blue clothes and earphones"
],
"correct_choice": 0,
"position": [
180
],
"topic_category": "KA-Knowledge-Art",
"question_category": "E2O",
"level": "L1-Perception",
"id": "gJuOMPiixUA_0",
"video_path": "gJuOMPiixUA.mp4",
"subtitle_path": "gJuOMPiixUA_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 105,
"duration": 8.01,
"view_count": 21166
},
{
"video_id": "bDpgz_2Piqg",
"question": "A man wearing a grey shirt and a black hat is standing in front of a container holding food. There are many people around him, including one at the forefront wearing a black headscarf and another person in a grey shirt holding a plate of food. There's also a man wearing glasses in the corner. What type of hat is this?",
"question_wo_referring_query": "What type of hat is this?",
"candidates": [
"Western cowboy hat",
"Fisherman hat",
"Duck tongue hat",
"Beret"
],
"correct_choice": 2,
"position": [
22
],
"topic_category": "NP-News-Programs",
"question_category": "S2A",
"level": "L1-Perception",
"id": "bDpgz_2Piqg_0",
"video_path": "bDpgz_2Piqg.mp4",
"subtitle_path": "bDpgz_2Piqg_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 116,
"duration": 13.0,
"view_count": 2384
},
{
"video_id": "e_nAKiSutiA",
"question": "In the exhibition hall, there are two paintings hanging on the wall. A woman wearing a black shirt with her glasses on top of her head appears. What is she doing the first time she appears?",
"question_wo_referring_query": "What is this woman doing the first time she appears?",
"candidates": [
"She crosses her arms in front of her chest",
"She is clenching her fists",
"She raises both hands above her head",
"She has her palm up pointing opposite"
],
"correct_choice": 3,
"position": [
10
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O2E",
"level": "L1-Perception",
"id": "e_nAKiSutiA_0",
"video_path": "e_nAKiSutiA.mp4",
"subtitle_path": "e_nAKiSutiA_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 172,
"duration": 13.97,
"view_count": 44661
},
{
"video_id": "H_b5d-rLXJU",
"question": "Inside a room, there is a bookshelf filled with books and a wall covered in wallpaper. A man wearing a red short sleeve shirt is sitting on a gaming chair. When mentioning 'smd, what is caught? Piper, aka resource, is a trace format multi,' what is this man doing?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"He has both hands raised above his head",
"He has both hands crossed in front of his chest",
"He is doing the V-sign with both hands",
"He is doing the V-sign with one hand and clenching a fist with the other"
],
"correct_choice": 2,
"position": [
20
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2E",
"level": "L1-Perception",
"id": "H_b5d-rLXJU_0",
"video_path": "H_b5d-rLXJU.mp4",
"subtitle_path": "H_b5d-rLXJU_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 216,
"duration": 10.01,
"view_count": 153014
},
{
"video_id": "anQ5KFW8Gn4",
"question": "On a grassland with blue sky and white clouds, a soldier riding a horse and holding a long spear leads other soldiers also riding horses. What happens next?",
"question_wo_referring_query": "What happens next?",
"candidates": [
"Two generals lead a row of soldiers holding long spears and shields in training.",
"A general leads a row of soldiers resting on the grassland.",
"A general leads a row of soldiers on horseback in battle.",
"A general leads a row of soldiers holding long spears and shields in training."
],
"correct_choice": 3,
"position": [
118,
219
],
"topic_category": "KH-Knowledge-History",
"question_category": "E3E",
"level": "L2-Relation",
"id": "anQ5KFW8Gn4_0",
"video_path": "anQ5KFW8Gn4.mp4",
"subtitle_path": "anQ5KFW8Gn4_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 147,
"duration": 12.01,
"view_count": 2563
},
{
"video_id": "Vn-z0qXp9bQ",
"question": "A man wearing a pink hat is sitting next to a computer. He is wearing a brown short-sleeved T-shirt and glasses. After raising his right hand and saying 'with different lenses if you look at,' what action did he take next?",
"question_wo_referring_query": "What action did the man take next?",
"candidates": [
"The man on the left wearing a pink hat is standing with his hands in his pockets looking forward, and the man on the right wearing a black hat is standing with his hands in his pockets looking to the right front.",
"The man on the left wearing a pink hat is standing with his hands in his pockets looking forward, and the man on the right wearing a black hat is standing with his hands in his pockets looking down.",
"The man on the left wearing a pink hat is standing with his hands raised, and the man on the right wearing a black hat is standing with his hands in his pockets looking to the right front.",
"The man on the left wearing a pink hat is standing with his hands in his pockets looking forward, and the man on the right wearing a black hat is leaning against a tree looking up at the sky."
],
"correct_choice": 0,
"position": [
27,
39
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T3E",
"level": "L2-Relation",
"id": "Vn-z0qXp9bQ_0",
"video_path": "Vn-z0qXp9bQ.mp4",
"subtitle_path": "Vn-z0qXp9bQ_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 148,
"duration": 10.01,
"view_count": 728
},
{
"video_id": "rVGE6xhfVJE",
"question": "In a glass tray, there is a ball of meat. A pair of gloved hands is manipulating the ball from above. On the side of the tray, there are some flat pieces of meat. What change happened to the ball of meat in the tray?",
"question_wo_referring_query": "In a glass tray, there is a ball of meat. A pair of gloved hands is manipulating the ball from above. On the side of the tray, there are some flat pieces of meat. What change happened to the ball of meat in the tray?",
"candidates": [
"It turned into three meatballs",
"No change",
"It turned into two meatballs",
"It turned into four meatballs"
],
"correct_choice": 2,
"position": [
54,
227
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SAA",
"level": "L2-Relation",
"id": "rVGE6xhfVJE_0",
"video_path": "rVGE6xhfVJE.mp4",
"subtitle_path": "rVGE6xhfVJE_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 288,
"duration": 11.0,
"view_count": 141
},
{
"video_id": "ZL07dkaoHOg",
"question": "A girl wearing a knitted shirt, with her right hand, took a white piece of clothing from a grey bag. When 'music' is mentioned, what other changes does this bag undergo?",
"question_wo_referring_query": "When 'music' is mentioned, what other changes does this bag undergo?",
"candidates": [
"The color of the bag changes.",
"The bag becomes empty.",
"The bag becomes smaller.",
"The bag becomes bigger."
],
"correct_choice": 1,
"position": [
56,
178
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TAA",
"level": "L2-Relation",
"id": "ZL07dkaoHOg_0",
"video_path": "ZL07dkaoHOg.mp4",
"subtitle_path": "ZL07dkaoHOg_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 841,
"duration": 13.0,
"view_count": 33475
},
{
"video_id": "g_kziK-UOSU",
"question": "A car is stuck in a crevice between mountain rocks, severely damaged and emitting white smoke upwards. A person wearing blue clothing approaches the accident scene. What did this person in blue do afterwards?",
"question_wo_referring_query": "What did the person in blue do afterwards?",
"candidates": [
"Hold a fire extinguisher in their left hand",
"Hold a fire extinguisher in their right hand",
"Climb the mountain holding something in their left hand",
"Climb the mountain holding something in their right hand"
],
"correct_choice": 3,
"position": [
204,
241
],
"topic_category": "NP-News-Programs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "g_kziK-UOSU_0",
"video_path": "g_kziK-UOSU.mp4",
"subtitle_path": "g_kziK-UOSU_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 2554,
"duration": 10.01,
"view_count": 27412
},
{
"video_id": "VPcdtoldWCA",
"question": "In the cold winter, a simple kitchen made of wood was built; the kitchen has many food ingredients piled up. Which ingredient is currently being cut?",
"question_wo_referring_query": "Which ingredient is currently being cut?",
"candidates": [
"Onion",
"Tomato",
"Potato",
"Chili Pepper",
"Pumpkin"
],
"correct_choice": 0,
"position": [
94
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2O",
"level": "L1-Perception",
"id": "VPcdtoldWCA_0",
"video_path": "VPcdtoldWCA.mp4",
"subtitle_path": "VPcdtoldWCA_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 567,
"duration": 8.01,
"view_count": 2392783
},
{
"video_id": "K24dFfIM0gI",
"question": "On the narrow mountain roads where the road is blocked tightly with crawling traffic, filled with all kinds of trucks, which subtitles appear simultaneously when two rows of trucks are tightly connected on the congested road?",
"question_wo_referring_query": ", when two rows of trucks are tightly connected on the congested road, which subtitles appear simultaneously?",
"candidates": [
"disputes Pakistan sometimes shuts is ",
"Crossings with Afghanistan causing",
"Economically and it's not only on its",
"Millions is sulfur"
],
"correct_choice": 2,
"position": [
179
],
"topic_category": "NP-News-Programs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "K24dFfIM0gI_0",
"video_path": "K24dFfIM0gI.mp4",
"subtitle_path": "K24dFfIM0gI_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 91,
"duration": 10.0,
"view_count": 78728
},
{
"video_id": "qksR2Zvd-FM",
"question": "A bullet hits the green tank and sparks fly. Who is the character that first appears after Bai mentions 'period in World War II'?",
"question_wo_referring_query": "Who is the character that first appears after?",
"candidates": [
"The soldier wearing a red headscarf",
"The soldier wearing a black headscarf and green camouflage uniform",
"The soldier without a helmet, with short blond hair, wearing a green camouflage uniform",
"The soldier wearing a green helmet, black scarf, and camouflage uniform",
"The soldier wearing a green helmet and gray clothes"
],
"correct_choice": 0,
"position": [
1327,
1341,
2062,
3745,
14695
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "qksR2Zvd-FM_0",
"video_path": "qksR2Zvd-FM.mp4",
"subtitle_path": "qksR2Zvd-FM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 917.33,
"view_count": 1858932
},
{
"video_id": "qksR2Zvd-FM",
"question": "In the distance, there are lush plants. On the right, two soldiers are operating long-barreled weapons in a bunker. On the left, two soldiers are standing face to face. In front of the cannon, there are military vehicles and tanks. After the subtitle 'of damage over long distance' appears, who is the first person to appear?",
"question_wo_referring_query": ", who is the first person to appear?",
"candidates": [
"A soldier dressed in military green camouflage, wearing a black headgear.",
"A man wearing camouflage clothes, with short khaki hair.",
"A man with short gray hair, without a hat, exposing his forehead, wearing dark brown shoes.",
"A soldier with short khaki hair, wearing military green camouflage without a helmet.",
"A soldier wearing a green steel helmet, dressed in khaki military uniform."
],
"correct_choice": 2,
"position": [
5277,
5298,
11237,
12850,
14349
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "qksR2Zvd-FM_1",
"video_path": "qksR2Zvd-FM.mp4",
"subtitle_path": "qksR2Zvd-FM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 917.33,
"view_count": 1858932
},
{
"video_id": "qksR2Zvd-FM",
"question": "In the distance, there are high-rise buildings and green vegetation. A black car is underneath the vegetation. On the left side of the screen, there is a soldier wearing a black hood. In the middle, there is a soldier wearing a green hood. On the right side, there is a soldier wearing a green helmet. After the caption mentions 'this weapon can be nowadays,' who is the first person to appear?",
"question_wo_referring_query": "Who is the first person to appear?",
"candidates": [
"A soldier wearing a gray knit cap, a backpack, and a dark yellow uniform.",
"A soldier with a white helmet that has glasses on it.",
"A soldier in an army green camo outfit wearing a black hood.",
"A man wearing a camo outfit with short sandy hair."
],
"correct_choice": 1,
"position": [
10368,
10380,
9232,
14349,
19767
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "qksR2Zvd-FM_2",
"video_path": "qksR2Zvd-FM.mp4",
"subtitle_path": "qksR2Zvd-FM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 917.33,
"view_count": 1858932
},
{
"video_id": "rwL_XPw46zQ",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, a soldier wearing a green-yellow hat appears and is giving ammo to the caravan. Then, a section of a gun appears with a hand holding a handgun and shooting at the section. Finally, a soldier wearing a steel helmet appears and is shooting towards the right side of the screen with his right hand holding the gun.",
"First, a soldier wearing a green-yellow hat appears and is giving ammo to the caravan. Then, a soldier wearing a steel helmet appears and is shooting towards the right side of the screen with his right hand holding the gun. Finally, a section of a gun appears with a hand holding a handgun and shooting at the section.",
"First, a section of a gun appears with a hand holding a handgun and shooting at the section. Then, a soldier wearing a steel helmet appears and is shooting towards the right side of the screen with his right hand holding the gun. Finally, a soldier wearing a green-yellow hat appears and is giving ammo to the caravan.",
"First, a soldier wearing a steel helmet appears and is shooting towards the right side of the screen with his right hand holding the gun. Then, a soldier wearing a green-yellow hat appears and is giving ammo to the caravan. Finally, a section of a gun appears with a hand holding a handgun and shooting at the section.",
"First, a soldier wearing a steel helmet appears and is shooting towards the right side of the screen with his right hand holding the gun. Then, a section of a gun appears with a hand holding a handgun and shooting at the section. Finally, a soldier wearing a green-yellow hat appears and is giving ammo to the caravan."
],
"correct_choice": 1,
"position": [
2341,
5105,
6620
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "rwL_XPw46zQ_0",
"video_path": "rwL_XPw46zQ.mp4",
"subtitle_path": "rwL_XPw46zQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1623.21,
"view_count": 2452996
},
{
"video_id": "rwL_XPw46zQ",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, a soldier wearing a hat with an eagle insignia stands on a platform with his hand on his hip, then three soldiers with rifles appear on the battlefield, and finally a Japanese soldier is holding a handgun with both hands raised.",
"First, a Japanese soldier is holding a handgun with both hands raised, then three soldiers with rifles appear on the battlefield, and finally a soldier wearing a hat with an eagle insignia stands on a platform with his hand on his hip.",
"First, three soldiers with rifles appear on the battlefield, then a soldier wearing a hat with an eagle insignia stands on a platform with his hand on his hip, and finally a Japanese soldier is holding a handgun with both hands raised.",
"First, a soldier wearing a hat with an eagle insignia stands on a platform with his hand on his hip, then a Japanese soldier is holding a handgun with both hands raised, and finally three soldiers with rifles appear on the battlefield.",
"First, a Japanese soldier is holding a handgun with both hands raised, then a soldier wearing a hat with an eagle insignia stands on a platform with his hand on his hip, and finally three soldiers with rifles appear on the battlefield."
],
"correct_choice": 4,
"position": [
15039,
17682,
21440
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "rwL_XPw46zQ_1",
"video_path": "rwL_XPw46zQ.mp4",
"subtitle_path": "rwL_XPw46zQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1623.21,
"view_count": 2452996
},
{
"video_id": "rwL_XPw46zQ",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, a soldier is crawling and shooting with a light machine gun. Then, three soldiers are standing in the forest holding light machine guns. Finally, a parachute lands with military supplies.",
"First, a parachute lands with military supplies. Then, three soldiers are standing in the forest holding light machine guns. Finally, a soldier is crawling and shooting with a light machine gun.",
"First, a parachute lands with military supplies. Then, a soldier is crawling and shooting with a light machine gun. Finally, three soldiers are standing in the forest holding light machine guns.",
"First, three soldiers are standing in the forest holding light machine guns. Then, a parachute lands with military supplies. Finally, a soldier is crawling and shooting with a light machine gun.",
"First, a soldier is crawling and shooting with a light machine gun. Then, a parachute lands with military supplies. Finally, three soldiers are standing in the forest holding light machine guns."
],
"correct_choice": 4,
"position": [
24743,
31902,
37296
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "rwL_XPw46zQ_2",
"video_path": "rwL_XPw46zQ.mp4",
"subtitle_path": "rwL_XPw46zQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1623.21,
"view_count": 2452996
},
{
"video_id": "VFXJnbnN5ro",
"question": "In a space setting, there is a yellow sun design in the center with the word 'Emission' below. In which of the following scenarios has this sun design appeared?",
"question_wo_referring_query": "In which of the following scenarios has this sun design appeared?",
"candidates": [
"In the starry sky from a side view, there is a glaring white light in the center, surrounded by a halo of white light.",
"In a red cloud-like background, there is a cluster of bright light spots emitting strong light in the center.",
"In a black night sky, there is a purple circular object emitting purple light, with the words 'Protoplanetary Nebula' inscribed on it.",
"In a space background, there is a yellow design on the left side with radiating lines around it, and on the right side, there is a circular design with light blocked by something, with a yellow line attached to the sun.",
"In a blue background, the center is sparkling with densely packed light dots."
],
"correct_choice": 3,
"position": [
3227,
5603,
15804,
21224,
25523
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "SOS",
"level": "L2-Relation",
"id": "VFXJnbnN5ro_0",
"video_path": "VFXJnbnN5ro.mp4",
"subtitle_path": "VFXJnbnN5ro_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1214.35,
"view_count": 379611
},
{
"video_id": "VFXJnbnN5ro",
"question": "Against a blue background, there's an image of a green nebula in the middle. The wallpaper's cover is a computer screenshot, and the page contains many computer desktop icons. In which of the following scenes does this green nebula image appear?",
"question_wo_referring_query": "In which of the following scenes does this green nebula image appear?",
"candidates": [
"In space, there's a green mist-like star in the middle, with some white star points in the green background.",
"In the middle of the screen, there's a selfie of a couple. The boy has his eyes closed, with a yellow backpack strap on his right shoulder, and the girl is closely leaning against the boy's face.",
"The screen shows a semi-circle with a yellow light along its edge, and the words 'Planetary Nebula' printed in the middle.",
"In a hazy, dark gray background, a rainbow-colored drop-like object is in the middle of the screen, with the label 'SN 1604' in the lower-left corner.",
"In a black starry background with many star points, there's a reddish cloud-like object in the starry sky, with the label 'NGC 1976' in the lower-left corner."
],
"correct_choice": 0,
"position": [
506,
1508,
791,
16095,
22910
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "SOS",
"level": "L2-Relation",
"id": "VFXJnbnN5ro_1",
"video_path": "VFXJnbnN5ro.mp4",
"subtitle_path": "VFXJnbnN5ro_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1214.35,
"view_count": 379611
},
{
"video_id": "VFXJnbnN5ro",
"question": "Against a black background, there is an object on the left emitting yellow-green light, consisting of two connected dots, dispersing light upwards and downwards. In the middle, there is a large bright spot radiating red-yellow light in all directions. On the right, there is an arched light spot emitting red light in all directions. Can you tell in which of the following scenes the object on the left emitting yellow-green light has appeared before?",
"question_wo_referring_query": "Can you tell in which of the following scenes the object on the left emitting yellow-green light has appeared before?",
"candidates": [
"A sphere emitting red-blue light, with the phrase 'Reflection Nebula' written at the top.",
"Against a black background, there are two connected dots emitting yellow-green light upwards and downwards, with the word 'Ejected' written above.",
"A blue spherical object radiating strong light with two rings of red light around it, imprinted with 'NGC 6720' at the bottom left of the screen.",
"A white spherical object emitting white light in the pitch-black night sky."
],
"correct_choice": 1,
"position": [
11229,
14571,
7901,
3698,
17799
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "SOS",
"level": "L2-Relation",
"id": "VFXJnbnN5ro_2",
"video_path": "VFXJnbnN5ro.mp4",
"subtitle_path": "VFXJnbnN5ro_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1214.35,
"view_count": 379611
},
{
"video_id": "zJ2uPZfkYMk",
"question": "The white window frame holds a transparent glass that reflects the snowy landscape outside. In front of the window, there are porcelain items, a red candle, and a white candle. A person walks in from the right side of the screen. When the shot changes to a woman wearing a green linen dress facing the mirror, what change occurs to the red candle on the left side of the woman?",
"question_wo_referring_query": "What change occurs to the red candle on the left side of the woman?",
"candidates": [
"The red candle becomes longer",
"The red candle turns black",
"The red candle is burned out",
"The red candle is ignited",
"The red candle burns down and becomes shorter"
],
"correct_choice": 3,
"position": [
276,
2964
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "zJ2uPZfkYMk_0",
"video_path": "zJ2uPZfkYMk.mp4",
"subtitle_path": "zJ2uPZfkYMk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1460.59,
"view_count": 394634
},
{
"video_id": "zJ2uPZfkYMk",
"question": "In a room, there is a fallen mirror at the back, ceramic tableware in the cabinet, and three white candles and one red candle standing on the right side of the screen. A blonde woman wearing earrings and a necklace is sitting in the middle. When the scene changes to inside a small wooden cabin and the woman stands facing the mirror, what changes occur to her outfit?",
"question_wo_referring_query": "What changes occur to her outfit?",
"candidates": [
"A gray shirt changes into a green apron",
"A green outer coat changes into a black dress",
"A black T-shirt changes into a green dress",
"A green jacket changes into a blue apron",
"A gray inner shirt and green outer coat change into a green tank dress"
],
"correct_choice": 4,
"position": [
18557,
26156
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "zJ2uPZfkYMk_1",
"video_path": "zJ2uPZfkYMk.mp4",
"subtitle_path": "zJ2uPZfkYMk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1460.59,
"view_count": 394634
},
{
"video_id": "zJ2uPZfkYMk",
"question": "On a ground covered with yellow leaves, there is a golden-haired lady squatting and picking up the yellow leaves. When the scene changes to the lady, facing the left side of the screen, wearing a sunhat and collecting, what changes occur to the lady's outfit?",
"question_wo_referring_query": "What changes occur to the lady's outfit?",
"candidates": [
"The black jacket becomes a burgundy-colored robe.",
"The green dress becomes a blue T-shirt.",
"The black hoodie becomes a green jacket.",
"The burgundy-colored outerwear with a hat becomes a green T-shirt.",
"The burgundy-colored robe becomes a blue vest."
],
"correct_choice": 3,
"position": [
12846,
12967
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "zJ2uPZfkYMk_2",
"video_path": "zJ2uPZfkYMk.mp4",
"subtitle_path": "zJ2uPZfkYMk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1460.59,
"view_count": 394634
},
{
"video_id": "mqtJErix0ss",
"question": "In a room with a backdrop of Earth and outer space, there are books and a YouTube trophy on the left wall, character images and space pictures on the right wall, and a satellite model on the desk in the lower left corner. A man wearing a white T-shirt is sitting in the center with his hands clasped together. When the subtitle 'around the Earth not fly off into space' appears, what change happens to his hand movements?",
"question_wo_referring_query": "What change happens to his hand movements?",
"candidates": [
"The palms face each other and change to the left hand raised above the head",
"The palms face each other and change to both hands clenched",
"Both hands are waving",
"Both hands extend forward",
"Both hands are crossed in front of the chest"
],
"correct_choice": 0,
"position": [
11941,
11955
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "TAA",
"level": "L2-Relation",
"id": "mqtJErix0ss_0",
"video_path": "mqtJErix0ss.mp4",
"subtitle_path": "mqtJErix0ss_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 981.64,
"view_count": 151032
},
{
"video_id": "mqtJErix0ss",
"question": "In a dark environment, an old-fashioned rotary television is playing a scene where a missile emits light and rushes upwards. What change occurs to the missile when the subtitle 'it seems that whenever they were' appears?",
"question_wo_referring_query": "What change occurs to the missile?",
"candidates": [
"The missile turns into a wooden head",
"The missile turns into plastic",
"The missile splits into two halves",
"The missile turns into an ice block",
"The missile explodes into smoke"
],
"correct_choice": 4,
"position": [
14108,
14122
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "TAA",
"level": "L2-Relation",
"id": "mqtJErix0ss_1",
"video_path": "mqtJErix0ss.mp4",
"subtitle_path": "mqtJErix0ss_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 981.64,
"view_count": 151032
},
{
"video_id": "mqtJErix0ss",
"question": "In the sky, a rocket is blazing and flying swiftly away from the Earth, with the blue and white surface of the Earth as the background. What change occurs to the rocket when the subtitle 'camera if you watch a NASA launch or a' appears?",
"question_wo_referring_query": "What change occurs to the rocket?",
"candidates": [
"The flames at the bottom of the rocket extinguish",
"Water sprays from the bottom of the rocket",
"The rocket splits into two halves",
"The surface of the rocket turns black",
"The rocket begins to split"
],
"correct_choice": 0,
"position": [
16213,
16236
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "TAA",
"level": "L2-Relation",
"id": "mqtJErix0ss_2",
"video_path": "mqtJErix0ss.mp4",
"subtitle_path": "mqtJErix0ss_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 981.64,
"view_count": 151032
},
{
"video_id": "sEiyR7-0FOA",
"question": "There are two people holding a book and a pen, wearing khaki outerwear and hats. There is a tank between them. On the tank, there is a person wearing black clothes. When the subtitle 'the next day two Italian War' appears, what color is the book the person on the left is holding?",
"question_wo_referring_query": "What color is the book that the person on the left is holding?",
"candidates": [
"red",
"green",
"yellow",
"black",
"white"
],
"correct_choice": 1,
"position": [
16438
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2A",
"level": "L1-Perception",
"id": "sEiyR7-0FOA_0",
"video_path": "sEiyR7-0FOA.mp4",
"subtitle_path": "sEiyR7-0FOA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2853.83,
"view_count": 656796
},
{
"video_id": "sEiyR7-0FOA",
"question": "Two people are standing on an orange ground. On the left is a person wearing a helmet and holding a knife and shield. On the right is a person with black hair, also holding a knife and shield. When the subtitle 'threat of death or severe injury ever' appears, what is the shape of the shield held by the person on the left?",
"question_wo_referring_query": "What is the shape of the shield held by the person on the left?",
"candidates": [
"Square",
"Pentagon",
"Rectangle",
"Circle",
"Parallelogram"
],
"correct_choice": 3,
"position": [
44600
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2A",
"level": "L1-Perception",
"id": "sEiyR7-0FOA_1",
"video_path": "sEiyR7-0FOA.mp4",
"subtitle_path": "sEiyR7-0FOA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2853.83,
"view_count": 656796
},
{
"video_id": "sEiyR7-0FOA",
"question": "Two soldiers are standing next to a fallen tank, one soldier is holding a bucket, and there are 6 buckets on the ground next to the tank. When the subtitle 'misinformation is extremely effective in' appears, what color are the buckets on the ground?",
"question_wo_referring_query": "What color are the buckets on the ground?",
"candidates": [
"orange",
"white",
"red",
"pink",
"green"
],
"correct_choice": 1,
"position": [
67004
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2A",
"level": "L1-Perception",
"id": "sEiyR7-0FOA_2",
"video_path": "sEiyR7-0FOA.mp4",
"subtitle_path": "sEiyR7-0FOA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2853.83,
"view_count": 656796
},
{
"video_id": "LlqsCCa6y58",
"question": "On the left side of the screen, it says 'Showcase your artwork without all the hard work,' and on the right side, there is a phone with a purple case. What happens in the middle of the screen when the subtitle 'Advanced smart tools and unlimited customizations smartest makes marketing your own art easier so' appears?",
"question_wo_referring_query": "What happens in the middle of the screen?",
"candidates": [
"The left thumb taps the phone screen\n",
"The right index finger taps the phone screen",
"The right thumb taps the phone screen",
"The right middle finger slides on the phone screen",
"The right index finger turns off the phone's power button"
],
"correct_choice": 2,
"position": [
4774
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2E",
"level": "L1-Perception",
"id": "LlqsCCa6y58_0",
"video_path": "LlqsCCa6y58.mp4",
"subtitle_path": "LlqsCCa6y58_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1810.23,
"view_count": 10392
},
{
"video_id": "LlqsCCa6y58",
"question": "There are some black ink marks and pen drawings on the white paper. When the subtitle appears \"condens with his profound belief in the emotional power of colors and forms compared painting to\", what happens on the screen?",
"question_wo_referring_query": "What happens on the screen?",
"candidates": [
"The right hand is holding a pen and drawing quickly.",
"The left hand is holding a pen and swirling ink.",
"The right hand is holding a pen and practicing calligraphy.",
"The right hand is holding a pen and swirling ink."
],
"correct_choice": 0,
"position": [
27451
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2E",
"level": "L1-Perception",
"id": "LlqsCCa6y58_1",
"video_path": "LlqsCCa6y58.mp4",
"subtitle_path": "LlqsCCa6y58_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1810.23,
"view_count": 10392
},
{
"video_id": "LlqsCCa6y58",
"question": "In a black and white scene, a person is bending over, with a white skull next to them. In the distance, there are many mounds of earth. When the subtitle mentions \"American modernism her art reflects the diverse Landscapes of her homes from Wisconsin to New York\", what does this person do?",
"question_wo_referring_query": "What does this person do?",
"candidates": [
"Grabs a handful of sand",
"Picks up the skull from the ground",
"Dusts off the ash from their body",
"Picks up clothes from the ground",
"Raises their hand to block the wind and sand"
],
"correct_choice": 1,
"position": [
37076
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2E",
"level": "L1-Perception",
"id": "LlqsCCa6y58_2",
"video_path": "LlqsCCa6y58.mp4",
"subtitle_path": "LlqsCCa6y58_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1810.23,
"view_count": 10392
},
{
"video_id": "mS1QPVgBDQo",
"question": "During the conversation with an olive background, there is an audience member visible in the lower part of the screen. At the top, there are three women, with the woman in the middle currently speaking. The woman on the left has a white book resting on her lap. When the woman on the left picks up the microphone and starts talking, what happens to the white book?",
"question_wo_referring_query": "What happens to the white book?",
"candidates": [
"It is placed vertically on the woman's lap",
"It disappears from the screen",
"It becomes green",
"It turns into a circle",
"It is opened up"
],
"correct_choice": 0,
"position": [
9597,
9758
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SAA",
"level": "L2-Relation",
"id": "mS1QPVgBDQo_0",
"video_path": "mS1QPVgBDQo.mp4",
"subtitle_path": "mS1QPVgBDQo_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1950.72,
"view_count": 5086
},
{
"video_id": "mS1QPVgBDQo",
"question": "In a conversation with an olive-colored background, there is an audience below with heads exposed, and above are three women. The woman on the left is holding a microphone and speaking. On the left woman's lap is a vertically placed white book. What change occurs to the white book when the woman on the left puts down the microphone?",
"question_wo_referring_query": "What change occurs to the white book?",
"candidates": [
"It turned into a circle",
"A corner was missing",
"It became thicker",
"It was opened",
"It turned into two white books"
],
"correct_choice": 3,
"position": [
15811,
16242
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SAA",
"level": "L2-Relation",
"id": "mS1QPVgBDQo_1",
"video_path": "mS1QPVgBDQo.mp4",
"subtitle_path": "mS1QPVgBDQo_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1950.72,
"view_count": 5086
},
{
"video_id": "mS1QPVgBDQo",
"question": "In the interview with a green background, at the bottom, there is a visible audience, and at the top, there are three women sitting and talking. The woman in the middle is wearing a red and blue checkered shirt and has white hair. During the performance later, what change occurs to the woman in the middle, who is wearing a red and blue checkered shirt and has white hair?",
"question_wo_referring_query": "What change occurs to the woman in the middle, who is wearing a red and blue checkered shirt and has white hair?",
"candidates": [
"She put on a white hat",
"She put on a black hat",
"Her hair turned black",
"She tied a black ribbon around her head",
"She changed into a white suit"
],
"correct_choice": 3,
"position": [
23034,
23344
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SAA",
"level": "L2-Relation",
"id": "mS1QPVgBDQo_2",
"video_path": "mS1QPVgBDQo.mp4",
"subtitle_path": "mS1QPVgBDQo_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1950.72,
"view_count": 5086
},
{
"video_id": "hg2Q_O5b9w4",
"question": "In the PPT slide with a white background, there is a group of words enclosed in a rounded rectangle at the bottom. In the middle, there is a doodle composed of letters in blue parentheses, blue, and green. What change occurs to the doodle when the subtitle 'that if I put in this representation' appears?",
"question_wo_referring_query": "What change occurs to the doodle?",
"candidates": [
"Moved to the far left",
"Entirely turned black",
"Covered by a yellow overlay",
"Moved to the far right",
"Enlarged"
],
"correct_choice": 2,
"position": [
9661,
9715
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "TAA",
"level": "L2-Relation",
"id": "hg2Q_O5b9w4_0",
"video_path": "hg2Q_O5b9w4.mp4",
"subtitle_path": "hg2Q_O5b9w4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1724.54,
"view_count": 11490
},
{
"video_id": "hg2Q_O5b9w4",
"question": "In the white-background PPT screen, there is black text at the bottom, yellow-highlighted text on the left, and in the middle, an illustration composed of a blue rectangle, a green circle, and a blue triangle. When the subtitle 'spaceship this and this and so right and' appears, what changes occur to the illustration?",
"question_wo_referring_query": "What changes occur to the illustration?",
"candidates": [
"Covered by a blue highlight",
"A star appeared in the middle",
"Got bigger",
"Got smaller",
"A red dot appeared in the middle"
],
"correct_choice": 0,
"position": [
14021,
14104
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "TAA",
"level": "L2-Relation",
"id": "hg2Q_O5b9w4_1",
"video_path": "hg2Q_O5b9w4.mp4",
"subtitle_path": "hg2Q_O5b9w4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1724.54,
"view_count": 11490
},
{
"video_id": "hg2Q_O5b9w4",
"question": "In the PPT slide with white background and black text, on the left side, there are three shapes made up of three overlapping blue squares each, and the rest of the screen is text. When the caption 'here somewhere the anchor is cropped' appears, what changes occur to the shapes?",
"question_wo_referring_query": "What changes occur to the shapes?",
"candidates": [
"Turned green",
"Covered by a yellow overlay",
"Got bigger",
"Turned yellow",
"Got smaller"
],
"correct_choice": 2,
"position": [
35209,
35487
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "TAA",
"level": "L2-Relation",
"id": "hg2Q_O5b9w4_2",
"video_path": "hg2Q_O5b9w4.mp4",
"subtitle_path": "hg2Q_O5b9w4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1724.54,
"view_count": 11490
},
{
"video_id": "CGngv8vTQOs",
"question": "On a green grass field, there are some logos on the side. In the middle, there is a blue disk and a gray one, along with three cylindrical wrapped objects. What is it doing?",
"question_wo_referring_query": "What is it doing?",
"candidates": [
"Spiraling upwards",
"Flying to the right",
"Flying straight up",
"Flying to the left",
"Falling vertically down"
],
"correct_choice": 2,
"position": [
5957
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2E",
"level": "L1-Perception",
"id": "CGngv8vTQOs_0",
"video_path": "CGngv8vTQOs.mp4",
"subtitle_path": "CGngv8vTQOs_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 922.48,
"view_count": 2188
},
{
"video_id": "CGngv8vTQOs",
"question": "In the sky, there is a blue beam of light in the distance, green grass below, deep blue waters, and a gray rocket flying upwards, with blue flames shooting out below it. On either side of the rocket are red-topped cylinders. What are the cylinders doing?",
"question_wo_referring_query": "What are the cylinders doing?",
"candidates": [
"Breaking into pieces",
"Flying out to the right",
"Emitting blue flames",
"Spinning downwards",
"Flying out to the left"
],
"correct_choice": 3,
"position": [
14859
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2E",
"level": "L1-Perception",
"id": "CGngv8vTQOs_1",
"video_path": "CGngv8vTQOs.mp4",
"subtitle_path": "CGngv8vTQOs_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 922.48,
"view_count": 2188
},
{
"video_id": "CGngv8vTQOs",
"question": "Against a starry backdrop, there is a shrunken image of the Earth in the upper left corner with a gray rocket on it. Alongside the image, the word 'Playlist' is written in red text. What is the gray rocket doing?",
"question_wo_referring_query": "What is the gray rocket doing?",
"candidates": [
"Flying towards the Sun",
"Flying towards Jupiter",
"Rotating with its head and tail separated",
"Flying towards Uranus",
"Flying downward"
],
"correct_choice": 2,
"position": [
21717
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2E",
"level": "L1-Perception",
"id": "CGngv8vTQOs_2",
"video_path": "CGngv8vTQOs.mp4",
"subtitle_path": "CGngv8vTQOs_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 922.48,
"view_count": 2188
},
{
"video_id": "8Gl6iy7OEM4",
"question": "In a room filled with photos, two men of different skin tones are having a discussion. One man says 'malayta ah mala,' while the other man is holding a yellow piece of paper with some drawings and words on it. Which man is holding the yellow paper in front of the camera?",
"question_wo_referring_query": "Which man is holding the yellow paper in front of the camera?",
"candidates": [
"The man wearing a black long-sleeve shirt",
"The man wearing black-framed glasses",
"The man with dark skin",
"The man wearing a black short-sleeve shirt",
"The man wearing a yellow short-sleeve shirt"
],
"correct_choice": 4,
"position": [
6171
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "E2O",
"level": "L1-Perception",
"id": "8Gl6iy7OEM4_0",
"video_path": "8Gl6iy7OEM4.mp4",
"subtitle_path": "8Gl6iy7OEM4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1598.27,
"view_count": 557690
},
{
"video_id": "8Gl6iy7OEM4",
"question": "In a room filled with pictures, two men are discussing some topics. There's a picture with a river at the bottom left corner of the screen. One of the men is saying 'a pew river the whole'. Which man is saying 'a pew river the whole'?",
"question_wo_referring_query": "Which man is saying 'a pew river the whole'?",
"candidates": [
"The man with fair skin",
"The man wearing a black short-sleeve shirt",
"The man wearing an olive hat",
"The man wearing a white short-sleeve shirt",
"The man wearing a yellow short-sleeve shirt"
],
"correct_choice": 1,
"position": [
14346
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "E2O",
"level": "L1-Perception",
"id": "8Gl6iy7OEM4_1",
"video_path": "8Gl6iy7OEM4.mp4",
"subtitle_path": "8Gl6iy7OEM4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1598.27,
"view_count": 557690
},
{
"video_id": "8Gl6iy7OEM4",
"question": "In front of a map, a man with short blonde hair wearing a black short-sleeved shirt is working out in the video. Behind him is a rice-white door. What equipment is the man with short blonde hair holding?",
"question_wo_referring_query": "What equipment is the man with short blonde hair holding?",
"candidates": [
"black resistance band",
"red hand gripper",
"white kettlebell",
"black jump rope",
"black kettlebell"
],
"correct_choice": 0,
"position": [
29540
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "E2O",
"level": "L1-Perception",
"id": "8Gl6iy7OEM4_2",
"video_path": "8Gl6iy7OEM4.mp4",
"subtitle_path": "8Gl6iy7OEM4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1598.27,
"view_count": 557690
},
{
"video_id": "Bwnkg6GbXwU",
"question": "In a screen with a blue border, there is a highway. On the highway, many cars are parked. Two horses of different colors appear on the highway. What did the yellow-brown horse do the first time it appeared?",
"question_wo_referring_query": "What did the yellow-brown horse do the first time it appeared?",
"candidates": [
"Was hit by a white car",
"Ran against the direction of the traffic flow",
"Crossed the highway",
"Was hit by a black car",
"Ran in the direction of the traffic flow"
],
"correct_choice": 1,
"position": [
2052
],
"topic_category": "NP-News-Programs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "Bwnkg6GbXwU_0",
"video_path": "Bwnkg6GbXwU.mp4",
"subtitle_path": "Bwnkg6GbXwU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1299.47,
"view_count": 678259
},
{
"video_id": "Bwnkg6GbXwU",
"question": "In front of a white wall, a woman wearing black clothes and a white headscarf appears. What did she do the first time she appeared?",
"question_wo_referring_query": "What did she do the first time she appeared?",
"candidates": [
"Lowering her head",
"Adjusting the white headscarf",
"Covering her mouth with both hands",
"Speaking to the camera",
"Wiping tears with her hand"
],
"correct_choice": 3,
"position": [
8364
],
"topic_category": "NP-News-Programs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "Bwnkg6GbXwU_1",
"video_path": "Bwnkg6GbXwU.mp4",
"subtitle_path": "Bwnkg6GbXwU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1299.47,
"view_count": 678259
},
{
"video_id": "Bwnkg6GbXwU",
"question": "In front of a few houses, three men stand before a mirror. One man is wearing a green hat, and another man is wearing black glasses. What did the man wearing black glasses do the first time he appeared?",
"question_wo_referring_query": "What did the man wearing black glasses do the first time he appeared?",
"candidates": [
"Touched his head with his right hand.",
"Looked at the man wearing the green hat.",
"Touched his glasses with his right hand.",
"Lowered his head to look at a script, occasionally raising his head to speak to the mirror.",
"Covered his face with both hands."
],
"correct_choice": 3,
"position": [
11299
],
"topic_category": "NP-News-Programs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "Bwnkg6GbXwU_2",
"video_path": "Bwnkg6GbXwU.mp4",
"subtitle_path": "Bwnkg6GbXwU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1299.47,
"view_count": 678259
},
{
"video_id": "-eRimFrm6kQ",
"question": "The room filled with photographs has a white door, with a map next to it. Which character appears first in the room?",
"question_wo_referring_query": "Which character appears first in the room?",
"candidates": [
"A man wearing a blue T-shirt with a crew cut",
"A man wearing a white T-shirt with short hair",
"A woman with long blonde hair wearing a pink top",
"A woman with long black hair wearing a white dress",
"A man wearing a black T-shirt with curly hair"
],
"correct_choice": 1,
"position": [
369,
470
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "O3O",
"level": "L2-Relation",
"id": "-eRimFrm6kQ_0",
"video_path": "-eRimFrm6kQ.mp4",
"subtitle_path": "-eRimFrm6kQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1540.88,
"view_count": 687388
},
{
"video_id": "-eRimFrm6kQ",
"question": "Which country's flag appears first in the video?",
"question_wo_referring_query": "Which country's flag appears first in the video?",
"candidates": [
"Singapore",
"Kiribati",
"India",
"St. Kitts",
"Barbados"
],
"correct_choice": 4,
"position": [
468,
582
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "O3O",
"level": "L2-Relation",
"id": "-eRimFrm6kQ_1",
"video_path": "-eRimFrm6kQ.mp4",
"subtitle_path": "-eRimFrm6kQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1540.88,
"view_count": 687388
},
{
"video_id": "-eRimFrm6kQ",
"question": "In a room full of photos, a man wearing a white T-shirt, with short hair, holding a black microphone, which country does he mention first?",
"question_wo_referring_query": "Which country does he mention first?",
"candidates": [
"Serbia",
"Barbados",
"Pakistan",
"Balkan",
"India"
],
"correct_choice": 0,
"position": [
44,
187
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "O3O",
"level": "L2-Relation",
"id": "-eRimFrm6kQ_2",
"video_path": "-eRimFrm6kQ.mp4",
"subtitle_path": "-eRimFrm6kQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1540.88,
"view_count": 687388
},
{
"video_id": "acAWfzV__XI",
"question": "On a white screen, there are many letters written, and at the bottom, there is also a blue arrow drawn. A hand is holding a pen. After the text 'topological sort' appears on the screen, what does the hand holding the pen do?",
"question_wo_referring_query": "What does the hand holding the pen do?",
"candidates": [
"Draws a blue question mark on the white screen",
"Draws a blue arrow on the white screen",
"Draws an orange arrow on the white screen",
"Writes the black text 'Not' on the white screen",
"Draws a black line on the white screen"
],
"correct_choice": 3,
"position": [
5652,
5687,
5741,
5814,
5886,
6188
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T3E",
"level": "L2-Relation",
"id": "acAWfzV__XI_0",
"video_path": "acAWfzV__XI.mp4",
"subtitle_path": "acAWfzV__XI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1078.0,
"view_count": 79
},
{
"video_id": "acAWfzV__XI",
"question": "On a white screen, there are many formulas written, at the top there is blue text, below there is a black circle connected by a black arrow, and to the left there is also a blue arrow. After the subtitle \"empty queue\" appears, what happens to the blue arrow?",
"question_wo_referring_query": "What happens to the blue arrow?",
"candidates": [
"Moves down",
"Disappears",
"Moves up",
"Becomes wider",
"Becomes narrower"
],
"correct_choice": 0,
"position": [
16402,
16453,
16809,
17162,
18079,
20246
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T3E",
"level": "L2-Relation",
"id": "acAWfzV__XI_1",
"video_path": "acAWfzV__XI.mp4",
"subtitle_path": "acAWfzV__XI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1078.0,
"view_count": 79
},
{
"video_id": "acAWfzV__XI",
"question": "On a white screen, many formulas are written, and at the top, there is a blue title. On the formula on the left side, there is a gray box. What happens to the gray box after the subtitle \"complexity is Big O of V\" appears?",
"question_wo_referring_query": ", what happens to the gray box?",
"candidates": [
"Becomes more concise",
"Contents in the frame decrease",
"Just gets smaller",
"Moves down and gets larger",
"Moves down and gets smaller"
],
"correct_choice": 3,
"position": [
23557,
23595,
24660
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T3E",
"level": "L2-Relation",
"id": "acAWfzV__XI_2",
"video_path": "acAWfzV__XI.mp4",
"subtitle_path": "acAWfzV__XI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1078.0,
"view_count": 79
},
{
"video_id": "RN2g9sRuJhA",
"question": "In a room with white tiles, there is a gray countertop. Next to the countertop stands a man wearing a blue checkered shirt, with tattoos on his arms. Beside him, there is a white bowl containing potatoes. What is he doing?",
"question_wo_referring_query": "What is he doing?",
"candidates": [
"Washing the tomatoes",
"Peeling the potatoes",
"Cutting the potatoes",
"Washing the green peppers",
"Washing the potatoes"
],
"correct_choice": 4,
"position": [
7784
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "RN2g9sRuJhA_0",
"video_path": "RN2g9sRuJhA.mp4",
"subtitle_path": "RN2g9sRuJhA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 988.66,
"view_count": 956017
},
{
"video_id": "RN2g9sRuJhA",
"question": "On a natural wood-colored floor, there is a man standing who is wearing a blue shirt and has tattoos on his arms. In front of him, there are cut tofu strips, and to the left, there is a white plate. To his right, there is an orange object. What is he doing?",
"question_wo_referring_query": "What is he doing?",
"candidates": [
"Sprinkling salt on tofu",
"Cutting green peppers",
"Washing tofu strips",
"Frying tofu strips",
"Cutting tofu strips"
],
"correct_choice": 3,
"position": [
17001
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "RN2g9sRuJhA_1",
"video_path": "RN2g9sRuJhA.mp4",
"subtitle_path": "RN2g9sRuJhA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 988.66,
"view_count": 956017
},
{
"video_id": "RN2g9sRuJhA",
"question": "In a kitchen, there is a man standing wearing a blue shirt with tattoos on his arms. Behind him are some white cabinets and a black countertop. On the black countertop, there is a black pot. In front of the man, there is a plate of prepared food. What is he doing?",
"question_wo_referring_query": "In a kitchen, there is a man standing wearing a blue shirt with tattoos on his arms. Behind him are some white cabinets and a black countertop. On the black countertop, there is a black pot. In front of the man, there is a plate of prepared food. What is he doing?",
"candidates": [
"Taking photos of the food",
"Eating the prepared food",
"Cutting the food with a knife",
"Picking up the food with chopsticks",
"Putting on a disposable glove"
],
"correct_choice": 1,
"position": [
22578
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "RN2g9sRuJhA_2",
"video_path": "RN2g9sRuJhA.mp4",
"subtitle_path": "RN2g9sRuJhA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 988.66,
"view_count": 956017
},
{
"video_id": "FnKDgC9aNu0",
"question": "Next to the tan table made of two pieces put together, there are three men sitting on black chairs. When the subtitle 'August you could say okay we can ship' appears, what objects are present in the frame?",
"question_wo_referring_query": "What objects are present in the frame?",
"candidates": [
"black T-shirt",
"blue T-shirt",
"black backpack",
"white shirt",
"white T-shirt"
],
"correct_choice": 0,
"position": [
20446
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2O",
"level": "L1-Perception",
"id": "FnKDgC9aNu0_0",
"video_path": "FnKDgC9aNu0.mp4",
"subtitle_path": "FnKDgC9aNu0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3468.0,
"view_count": 161398
},
{
"video_id": "FnKDgC9aNu0",
"question": "In a white room, three men are sitting. In the top right corner of the screen, there is a square picture. When the subtitle 'doomed doomed attempt ah they don't know' appears, what objects are present on the screen?",
"question_wo_referring_query": "What objects are present on the screen?",
"candidates": [
"A blue shirt",
"A red arrow",
"A purple backpack",
"A black chair",
"A white short sleeve"
],
"correct_choice": 3,
"position": [
43112
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2O",
"level": "L1-Perception",
"id": "FnKDgC9aNu0_1",
"video_path": "FnKDgC9aNu0.mp4",
"subtitle_path": "FnKDgC9aNu0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3468.0,
"view_count": 161398
},
{
"video_id": "FnKDgC9aNu0",
"question": "In a room, there are three men sitting. One is wearing a black short-sleeve shirt, another is wearing a dark grey short-sleeve shirt, and the third is wearing a grey-and-white dress shirt. When the subtitle 'Sicily to Libya and some technical' appears, what objects can be seen in the frame?",
"question_wo_referring_query": "What objects can be seen in the frame?",
"candidates": [
"red arrow",
"white chair",
"blue water tank",
"black camera",
"black phone"
],
"correct_choice": 3,
"position": [
70027
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2O",
"level": "L1-Perception",
"id": "FnKDgC9aNu0_2",
"video_path": "FnKDgC9aNu0.mp4",
"subtitle_path": "FnKDgC9aNu0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3468.0,
"view_count": 161398
},
{
"video_id": "o2F-N42Ufo4",
"question": "Which of the following scenarios is in the correct sequence?",
"question_wo_referring_query": "Which of the following scenarios is in the correct sequence?",
"candidates": [
"First, a scene of a man in a gray suit with a black and white tie holding a gun is shown, followed by a photo of a person in a black suit with a black and red tie against a white background, and finally a scene of two people sitting and five people standing in a conference room is shown.",
"First, a scene of a man in a gray suit with a black and white tie holding a gun is shown, followed by a scene of two people sitting and five people standing in a conference room, and finally a photo of a person in a black suit with a black and red tie against a white background is shown.",
"First, a scene of two people sitting and five people standing in a conference room is shown, followed by a photo of a person in a black suit with a black and red tie against a white background, and finally a scene of a man in a gray suit with a black and white tie holding a gun is shown.",
"First, a photo of a person in a black suit with a black and red tie against a white background is shown, followed by a scene of two people sitting and five people standing in a conference room, and finally a scene of a man in a gray suit with a black and white tie holding a gun is shown.",
"First, a photo of a person in a black suit with a black and red tie against a white background is shown, followed by a scene of a man in a gray suit with a black and white tie holding a gun, and finally a scene of two people sitting and five people standing in a conference room is shown."
],
"correct_choice": 0,
"position": [
6468,
6586,
6947
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "o2F-N42Ufo4_0",
"video_path": "o2F-N42Ufo4.mp4",
"subtitle_path": "o2F-N42Ufo4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1130.3,
"view_count": 1701677
},
{
"video_id": "o2F-N42Ufo4",
"question": "Which of the following sequences is correct?",
"question_wo_referring_query": "Which of the following sequences is correct?",
"candidates": [
"First, a cartoon with a white background featuring two strong men and a woman in a skirt doing splits was shown. Then, a photo was shown of a man with graying hair, combed middle part, wearing a suit with a red and green tie, against a gray background. Finally, an image was shown of a man with an eye mask, shirtless with brown hair, a black tattoo on his abdomen, and a crowd in the background.",
"First, an image was shown of a man with an eye mask, shirtless with brown hair, a black tattoo on his abdomen, and a crowd in the background. Then, a cartoon with a white background featuring two strong men and a woman in a skirt doing splits was shown. Finally, a photo was shown of a man with graying hair, combed middle part, wearing a suit with a red and green tie, against a gray background.",
"First, a cartoon with a white background featuring two strong men and a woman in a skirt doing splits was shown. Then, an image was shown of a man with an eye mask, shirtless with brown hair, a black tattoo on his abdomen, and a crowd in the background. Finally, a photo was shown of a man with graying hair, combed middle part, wearing a suit with a red and green tie, against a gray background.",
"First, a photo was shown of a man with graying hair, combed middle part, wearing a suit with a red and green tie, against a gray background. Then, a cartoon with a white background featuring two strong men and a woman in a skirt doing splits was shown. Finally, an image was shown of a man with an eye mask, shirtless with brown hair, a black tattoo on his abdomen, and a crowd in the background.",
"First, a photo was shown of a man with graying hair, combed middle part, wearing a suit with a red and green tie, against a gray background. Then, an image was shown of a man with an eye mask, shirtless with brown hair, a black tattoo on his abdomen, and a crowd in the background. Finally, a cartoon with a white background featuring two strong men and a woman in a skirt doing splits was shown."
],
"correct_choice": 3,
"position": [
14865,
15010,
15713
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "o2F-N42Ufo4_1",
"video_path": "o2F-N42Ufo4.mp4",
"subtitle_path": "o2F-N42Ufo4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1130.3,
"view_count": 1701677
},
{
"video_id": "o2F-N42Ufo4",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"First, two people with long hair leaning against the wall in a grayscale photo were shown, followed by an orange-bordered screen with a blue interior, then a screen with a purple background and yellow letters, and lastly, a scene of a man in black clothes with white hair talking.",
"First, a scene of a man in black clothes with white hair talking was played, then an orange-bordered screen with a blue interior, followed by a screen with a purple background and yellow letters, and lastly, two people with long hair leaning against the wall in a grayscale photo were shown.",
"First, a scene of a man in black clothes with white hair talking was played, followed by two people with long hair leaning against the wall in a grayscale photo, then an orange-bordered screen with a blue interior, and lastly, a screen with a purple background and yellow letters was shown.",
"First, an orange-bordered screen with a blue interior was played, then a screen with a purple background and yellow letters, followed by a scene of a man in black clothes with white hair talking. Lastly, two people with long hair leaning against the wall in a black-and-white photo were shown.",
"First, two people with long hair leaning against the wall in a grayscale photo were shown, followed by a scene of a man in black clothes with white hair talking, then an orange-bordered screen with a blue interior, and lastly, a screen with a purple background and yellow letters."
],
"correct_choice": 3,
"position": [
21019,
21088,
23341
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "o2F-N42Ufo4_2",
"video_path": "o2F-N42Ufo4.mp4",
"subtitle_path": "o2F-N42Ufo4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1130.3,
"view_count": 1701677
},
{
"video_id": "XVXczyheik0",
"question": "In a white room, there is a man wearing a dark blue jacket speaking on the screen. In front of him is a black microphone. He has dark skin and a mustache. On the wall behind him, there's a black map. To the left of the frame, there is a whiteboard, and to the right, there is also a wall with a piece of paper in a black frame. While this man is writing on a black background with white text that says 'How much data do we need?', what change occurs?",
"question_wo_referring_query": "While this man is writing on a black background with white text that says 'How much data do we need?', what change occurs?",
"candidates": [
"He puts on glasses",
"He changes to a grey microphone",
"He puts on a hat",
"He changes to black clothes",
"He changes to white clothes"
],
"correct_choice": 2,
"position": [
149,
1737
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "SAA",
"level": "L2-Relation",
"id": "XVXczyheik0_0",
"video_path": "XVXczyheik0.mp4",
"subtitle_path": "XVXczyheik0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 967.97,
"view_count": 1949
},
{
"video_id": "XVXczyheik0",
"question": "Against a black background, a white-lettered title says 'How much data do we need?' with a picture of a golden retriever on the left. The word 'dog' appears on the far right, and a man is explaining in the bottom right corner. When the image of a shaking dog appears, what changes occur in the black scene?",
"question_wo_referring_query": "What changes occur in the black scene?",
"candidates": [
"The image of the shaking dog gradually shrinks",
"The arrow on the right side is labeled 'not dog'",
"The picture of the golden retriever disappears",
"The image of the shaking dog gradually enlarges",
"The man in the bottom right corner changes into a blue shirt"
],
"correct_choice": 1,
"position": [
4860,
4884
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "SAA",
"level": "L2-Relation",
"id": "XVXczyheik0_1",
"video_path": "XVXczyheik0.mp4",
"subtitle_path": "XVXczyheik0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 967.97,
"view_count": 1949
},
{
"video_id": "XVXczyheik0",
"question": "In a white room, a man wearing a dark blue jacket is speaking on screen. There is a black microphone in front of him, he has dark skin, and he is sporting a beard. A black map is hanging on the wall behind him, there is a whiteboard on the left side of the screen, and on the right side, there is another wall with a black-bordered paper on it. What changes occur to this man's position on the screen when the 'Quiz Time 3' title, featuring many horizontal grids with black letters, appears?",
"question_wo_referring_query": "What changes occur to this man's position on the screen?",
"candidates": [
"The man disappears from the screen.",
"The man starts wearing sunglasses.",
"The man's position on the screen shifts to the bottom-left corner.",
"The man's position on the screen shifts to the bottom-right corner.",
"The man changes to a white shirt."
],
"correct_choice": 3,
"position": [
20608,
20682
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "SAA",
"level": "L2-Relation",
"id": "XVXczyheik0_2",
"video_path": "XVXczyheik0.mp4",
"subtitle_path": "XVXczyheik0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 967.97,
"view_count": 1949
},
{
"video_id": "LfUsGv-ESbc",
"question": "The screen displays a webpage. The webpage contains some code, a gray image at the top, three clickable icons in the left sidebar, and login and settings options on the right side. In the bottom right corner, a man with sunglasses and short hair, wearing black clothes, is explaining something. When he mentions 'there you go and nothing happens of,' what is the shape of the gray icon in the top-right corner of the screen?",
"question_wo_referring_query": "The screen displays a webpage. The webpage contains some code, a gray image at the top, three clickable icons in the left sidebar, and login and settings options on the right side. In the bottom right corner, a man with sunglasses and short hair, wearing black clothes, is explaining something. When he mentions 'there you go and nothing happens of,' what is the shape of the gray icon in the top-right corner of the screen?",
"candidates": [
"Stairs",
"Rectangle",
"Triangle",
"Circle",
"Pentagram"
],
"correct_choice": 3,
"position": [
11759
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T2A",
"level": "L1-Perception",
"id": "LfUsGv-ESbc_0",
"video_path": "LfUsGv-ESbc.mp4",
"subtitle_path": "LfUsGv-ESbc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2009.63,
"view_count": 45480
},
{
"video_id": "LfUsGv-ESbc",
"question": "The screen shows a webpage. The top of the webpage has a sea-green background, the middle has a white background, and there's a black-and-white image on the right side. The middle contains some program code. In the black-and-white image, there is a man running with his hands raised. There are some icons connected with lines on his legs. In the lower right corner of the screen, there is a man wearing sunglasses and dressed in black, who is explaining something. The background behind him is grey. What is the shape of the icon connected by white lines on the screen?",
"question_wo_referring_query": "What is the shape of the icon connected by white lines on the screen?",
"candidates": [
"Circle",
"Triangle",
"Square",
"Staircase shape",
"Rectangle"
],
"correct_choice": 2,
"position": [
26073
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T2A",
"level": "L1-Perception",
"id": "LfUsGv-ESbc_1",
"video_path": "LfUsGv-ESbc.mp4",
"subtitle_path": "LfUsGv-ESbc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2009.63,
"view_count": 45480
},
{
"video_id": "LfUsGv-ESbc",
"question": "The screen displays a Google webpage, with a white background featuring five pictures. The right section has a black background with one picture. In the bottom right corner, there's a man with short hair wearing glasses and dressed in black who is explaining something. He has a beard, and the background behind him is grey. Above these images on the white background, there are eight identical icons related to the images. What is the shape of these icons?",
"question_wo_referring_query": "What is the shape of these icons?",
"candidates": [
"Oval",
"Circle",
"Triangle",
"Square",
"Rectangle"
],
"correct_choice": 1,
"position": [
46193
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T2A",
"level": "L1-Perception",
"id": "LfUsGv-ESbc_2",
"video_path": "LfUsGv-ESbc.mp4",
"subtitle_path": "LfUsGv-ESbc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2009.63,
"view_count": 45480
},
{
"video_id": "L-XGTMusZvc",
"question": "Three fighter jets were flying in a foggy sky at dusk. What event occurred after the subtitle 'attacks were not impressive. rather' appeared?",
"question_wo_referring_query": "Three fighter jets were flying in a foggy sky at dusk. What event occurred after the subtitle 'attacks were not impressive. rather' appeared?",
"candidates": [
"The fighter jet dropped bombs on the ground.",
"A pilot jumped off the fighter jet.",
"The fighter jet crashed from the sky.",
"The fighter jet crashed into the sea.",
"The fighter jet was hit by artillery fire."
],
"correct_choice": 0,
"position": [
4846,
15735
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3E",
"level": "L2-Relation",
"id": "L-XGTMusZvc_0",
"video_path": "L-XGTMusZvc.mp4",
"subtitle_path": "L-XGTMusZvc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 916.97,
"view_count": 153039
},
{
"video_id": "L-XGTMusZvc",
"question": "In a gray background, there are two lines of English sentences in white font on a red and black base in the upper left corner, and five white circular icons in the middle. What happened after the subtitle 'proper channels for requesting close air support' appeared?",
"question_wo_referring_query": "What happened?",
"candidates": [
"A fighter jet crashes from the sky",
"A fighter jet crashes into the sea",
"A pilot jumps off the fighter jet",
"A fighter jet is engaging in ground combat",
"A fighter jet is hit by artillery"
],
"correct_choice": 3,
"position": [
3259,
11066
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3E",
"level": "L2-Relation",
"id": "L-XGTMusZvc_1",
"video_path": "L-XGTMusZvc.mp4",
"subtitle_path": "L-XGTMusZvc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 916.97,
"view_count": 153039
},
{
"video_id": "L-XGTMusZvc",
"question": "In the gray background, there is a paragraph made up of black and white English words in the middle. There are four circular white icons on the top and bottom of the screen respectively. What happens after the subtitle 'close air support force, the 8th Air Corps' appears?",
"question_wo_referring_query": "What happens?",
"candidates": [
"Explosion occurs around the ship",
"Explosion occurs on the ground",
"Fighter jet is hit by artillery",
"Fighter jet crashes into the sea",
"Personnel on the tank are injured by explosion"
],
"correct_choice": 1,
"position": [
9227,
15749
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3E",
"level": "L2-Relation",
"id": "L-XGTMusZvc_2",
"video_path": "L-XGTMusZvc.mp4",
"subtitle_path": "L-XGTMusZvc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 916.97,
"view_count": 153039
},
{
"video_id": "8MkL3W6wU3g",
"question": "Some workers are working in a yard, surrounded by wooden frames. On the left side of the screen, there is a pillar, and the subtitle 'five months of working side by side we' appears. Who is the first person to appear after this?",
"question_wo_referring_query": "Who is the first person to appear?",
"candidates": [
"A foreign man wearing a light blue shirt, khaki pants, and a hat",
"A Chinese worker wearing dark blue sleeves",
"A foreign journalist holding a camera",
"A Chinese worker wearing a red hat",
"A foreign leader taking a group photo with the workers"
],
"correct_choice": 0,
"position": [
13373,
13647
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T3O",
"level": "L2-Relation",
"id": "8MkL3W6wU3g_0",
"video_path": "8MkL3W6wU3g.mp4",
"subtitle_path": "8MkL3W6wU3g_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1735.44,
"view_count": 7314
},
{
"video_id": "8MkL3W6wU3g",
"question": "In the video, there are two Chinese men. The Chinese man on the left, dressed in a grey work uniform, is sitting on a stone bench. The man on the right, wearing a dark blue and grey pants, is also seated. After the subtitle 'the Chinese word for landscape is Shan' appears, what is the first object that appears?",
"question_wo_referring_query": "What is the first object that appears?",
"candidates": [
"A fake mountain with flowing water",
"A small red wooden scroll table",
"A classic window with green bamboo and a fake mountain",
"A sign with Chinese characters inside a house",
"A lantern hanging from the beam"
],
"correct_choice": 0,
"position": [
33237,
33636
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T3O",
"level": "L2-Relation",
"id": "8MkL3W6wU3g_1",
"video_path": "8MkL3W6wU3g.mp4",
"subtitle_path": "8MkL3W6wU3g_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1735.44,
"view_count": 7314
},
{
"video_id": "8MkL3W6wU3g",
"question": "On the wooden couch, there is a small red wooden scroll-shaped table with an open book on it, and the caption 'the little scroll table on the couch is' appears. What is the object that appears next?",
"question_wo_referring_query": "What is the object that appears?",
"candidates": [
"beehive-shaped window in the courtyard",
"sign with Chinese characters",
"worker wearing a safety helmet with the Chinese flag",
"silver nail-shaped object",
"goldfish swimming in the pond"
],
"correct_choice": 4,
"position": [
36847,
39587
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T3O",
"level": "L2-Relation",
"id": "8MkL3W6wU3g_2",
"video_path": "8MkL3W6wU3g.mp4",
"subtitle_path": "8MkL3W6wU3g_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1735.44,
"view_count": 7314
},
{
"video_id": "2ekjGl8yWZk",
"question": "The girl with straight long hair, dressed in a deep purple suit and a black inner shirt, is seated on the sofa in front of the yellow background wall with two yellow cushions behind her. What is she doing the first time she appears?",
"question_wo_referring_query": "What is she doing?",
"candidates": [
"Sitting on the sofa and talking to the mirror",
"Watching TV",
"Writing",
"Cooking",
"Talking on the phone"
],
"correct_choice": 0,
"position": [
14
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O2E",
"level": "L1-Perception",
"id": "2ekjGl8yWZk_0",
"video_path": "2ekjGl8yWZk.mp4",
"subtitle_path": "2ekjGl8yWZk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1390.46,
"view_count": 58073
},
{
"video_id": "2ekjGl8yWZk",
"question": "During the appearance of the black arrow between the English text on the white background PPT with two lines at the top, what happens on the screen?",
"question_wo_referring_query": "what happens on the screen?",
"candidates": [
"The arrow moves from top to bottom",
"The arrow moves from right to left",
"The arrow changes from large to small",
"The arrow moves from left to right",
"The arrow changes from small to large"
],
"correct_choice": 3,
"position": [
498
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O2E",
"level": "L1-Perception",
"id": "2ekjGl8yWZk_1",
"video_path": "2ekjGl8yWZk.mp4",
"subtitle_path": "2ekjGl8yWZk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1390.46,
"view_count": 58073
},
{
"video_id": "2ekjGl8yWZk",
"question": "On the white-background PPT, there is text in English by Liu that appears. When the blue arrow first appears above the second line of English text, what happens on the screen?",
"question_wo_referring_query": ", what happens on the screen?",
"candidates": [
"The blue arrow moves from bottom to top.",
"The blue arrow moves from left to right.",
"The blue arrow moves from right to left.",
"The blue arrow moves from top to bottom.",
"The blue arrow enlarges."
],
"correct_choice": 3,
"position": [
13770
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O2E",
"level": "L1-Perception",
"id": "2ekjGl8yWZk_2",
"video_path": "2ekjGl8yWZk.mp4",
"subtitle_path": "2ekjGl8yWZk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1390.46,
"view_count": 58073
},
{
"video_id": "_BDzMutoy6A",
"question": "On a wooden table, there is a pan with light blue handles on both sides. Inside the pan, there is yellow oil and four sesame balls. There is a silver strainer on top of the pan. When the subtitles 'Also, you'll see as we fry these, they'll puff up' appear, what is happening on the screen?",
"question_wo_referring_query": "What is happening on the screen?",
"candidates": [
"Boiling sesame balls",
"Pan-frying sesame balls",
"Eating sesame balls",
"Frying sesame balls",
"Stir-frying sesame balls"
],
"correct_choice": 3,
"position": [
6790
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2E",
"level": "L1-Perception",
"id": "_BDzMutoy6A_0",
"video_path": "_BDzMutoy6A.mp4",
"subtitle_path": "_BDzMutoy6A_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 937.86,
"view_count": 390768
},
{
"video_id": "_BDzMutoy6A",
"question": "The background features a white cabinet decorated with red lanterns. On the right side, there's a silver refrigerator with a red couplet sticker on it. A woman in a red short-sleeve top and apron is standing in front of a table holding a sesame ball. On the table, there are two glass containers and an iron plate. When the subtitle 'I fell like it's so refreshing' appears, what is this woman doing?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"kneading a sesame ball",
"cutting a sesame ball",
"eating a sesame ball",
"deep-frying a sesame ball",
"making tangyuan"
],
"correct_choice": 0,
"position": [
14987
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2E",
"level": "L1-Perception",
"id": "_BDzMutoy6A_1",
"video_path": "_BDzMutoy6A.mp4",
"subtitle_path": "_BDzMutoy6A_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 937.86,
"view_count": 390768
},
{
"video_id": "_BDzMutoy6A",
"question": "In the upper left side of the wooden table, there's a plate of white and pink tangyuan. In the lower left, there's ice water with ice cubes in a glass container. On the right, there are six tangyuan in a blue pot on a black gas stove. When the subtitle 'It'll stick to the bottom of your pan' appears, what is the hand holding a green slotted spoon doing?",
"question_wo_referring_query": "What is the hand holding a green slotted spoon doing?",
"candidates": [
"Boiling tangyuan",
"Stir-frying tangyuan",
"Kneading tangyuan",
"Deep-frying tangyuan",
"Pan-frying tangyuan"
],
"correct_choice": 0,
"position": [
20384
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2E",
"level": "L1-Perception",
"id": "_BDzMutoy6A_2",
"video_path": "_BDzMutoy6A.mp4",
"subtitle_path": "_BDzMutoy6A_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 937.86,
"view_count": 390768
},
{
"video_id": "crmV4OduHYA",
"question": "On the right side of the screen, there is a green plant. A woman with long straight hair, wearing a brick-red coat, is sitting on a yellow chair in front of a blue background. After she speaks into the camera, what happens on the screen first?",
"question_wo_referring_query": "What happens first on the screen?",
"candidates": [
"In the upper right corner of the screen, a man in a red short-sleeve shirt is adding a picture to the screen",
"In the upper right corner of the screen, a man in a red short-sleeve shirt is explaining the PPT on the screen",
"In the upper right corner of the screen, a man in a red short-sleeve shirt is moving a picture on the screen with his hand",
"A man in a black shirt and a woman in a white outerwear and gray inner outfit are sitting in front of the wall painting and talking",
"In the upper right corner of the screen, a man in a red short-sleeve shirt is pointing at a picture on the screen with his finger"
],
"correct_choice": 3,
"position": [
440,
19068,
26600
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E3E",
"level": "L2-Relation",
"id": "crmV4OduHYA_0",
"video_path": "crmV4OduHYA.mp4",
"subtitle_path": "crmV4OduHYA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1233.11,
"view_count": 24109
},
{
"video_id": "crmV4OduHYA",
"question": "Some people are sitting face-to-face at a long table outdoors, wearing name tags around their necks and smiling at each other. What happened earlier in the video?",
"question_wo_referring_query": "What happened earlier in the video?",
"candidates": [
"A woman in a brick-red suit jacket is sitting and talking in front of a camera.",
"In the upper right corner of the screen, a man in a red short-sleeve shirt is explaining the PPT shown on the screen.",
"A man in a black shirt and a woman in a white coat with a grey inner lining are sitting in front of a camera, with the woman lifting her legs.",
"In the upper right corner of the screen, a man in a red short-sleeve shirt is pointing to a picture on the screen.",
"A man in a black shirt and a woman in a white coat with a grey inner lining are sitting in front of a mural and having a conversation."
],
"correct_choice": 0,
"position": [
211,
126,
6735,
26615
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E3E",
"level": "L2-Relation",
"id": "crmV4OduHYA_1",
"video_path": "crmV4OduHYA.mp4",
"subtitle_path": "crmV4OduHYA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1233.11,
"view_count": 24109
},
{
"video_id": "crmV4OduHYA",
"question": "In the upper right corner of the screen, there is a man wearing a red short-sleeved shirt. On a white background, two cups labeled with different chemical elements contain blue liquids. The liquid on the left has a paper strip marked 'Zn' in it. What happens first on the screen after the man picks up the paper strip?",
"question_wo_referring_query": "What happens first on the screen?",
"candidates": [
"A pair of hands in the screen place two photos on a panel with blue liquids.",
"Some people with badges on their necks sit facing each other at a long table outside, smiling at the camera.",
"A woman in a brick-red suit jacket sits in front of a mirror, talking.",
"A man in a black shirt and a woman in a white coat with a gray inner top sit in front of a mirror, and the woman's hands rest on her lap.",
"A man in a black shirt and a woman in a white coat with a gray inner top sit in front of a mirror, and the woman opens her hands."
],
"correct_choice": 0,
"position": [
26687,
26759
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E3E",
"level": "L2-Relation",
"id": "crmV4OduHYA_2",
"video_path": "crmV4OduHYA.mp4",
"subtitle_path": "crmV4OduHYA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1233.11,
"view_count": 24109
},
{
"video_id": "i327DBSS_iE",
"question": "In the video, which of the following characters appears first?",
"question_wo_referring_query": "Which of the following characters appears first in the video?",
"candidates": [
"The little boy in a dark blue short-sleeved shirt with gray and white stripes, drawing with a crayon in a drawing book",
"The short-haired woman in a misty blue long-sleeve shirt and glasses, holding a little boy",
"The short-haired woman in the black coat speaking in front of a large glass window",
"The man in a black outfit and black baseball cap, sitting in front of a shelf and looking at the camera",
"The woman in a gray shirt wearing a white mask and a blue head covering"
],
"correct_choice": 2,
"position": [
880,
1619,
2706,
8517,
10420
],
"topic_category": "NP-News-Programs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "i327DBSS_iE_0",
"video_path": "i327DBSS_iE.mp4",
"subtitle_path": "i327DBSS_iE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 905.0,
"view_count": 71399
},
{
"video_id": "i327DBSS_iE",
"question": "After a video scene of a woman wearing a light-colored coat, carrying a red bag, walking a small dog through a forest filled with red leaves, which season appears first?",
"question_wo_referring_query": ", which season appears first?",
"candidates": [
"Autumn",
"Summer",
"Winter",
"Spring"
],
"correct_choice": 2,
"position": [
15751,
16834,
17156,
17369,
17644
],
"topic_category": "NP-News-Programs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "i327DBSS_iE_1",
"video_path": "i327DBSS_iE.mp4",
"subtitle_path": "i327DBSS_iE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 905.0,
"view_count": 71399
},
{
"video_id": "i327DBSS_iE",
"question": "The screen shows some people pushing shopping carts in the vegetable section of a supermarket. What is the first product to appear on the screen after this?",
"question_wo_referring_query": "What is the first product to appear on the screen?",
"candidates": [
"Leek",
"Eraser",
"Cauliflower",
"Carrot",
"Mini Tomatoes"
],
"correct_choice": 2,
"position": [
18364,
18442,
18507,
18607,
18970
],
"topic_category": "NP-News-Programs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "i327DBSS_iE_2",
"video_path": "i327DBSS_iE.mp4",
"subtitle_path": "i327DBSS_iE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 905.0,
"view_count": 71399
},
{
"video_id": "d-GKQeu4S6M",
"question": "In the upper right corner, there are two minimized screens of two women. At the top of the white background, there's a line of black English text. In the middle, there are four rectangles with some formulas on them. What happens on the screen after the caption 'of questions cool that's not bad at all' appears?",
"question_wo_referring_query": "What happens first on the screen?",
"candidates": [
"The video screen of the two women moves from the right side to the left side.",
"Two women explain the content of the PPT composed of four differently colored rectangles on a white background.",
"The video screen of the two women moves from the top to the bottom.",
"The video screen of the two women disappears.",
"The video screen of the two women enlarges."
],
"correct_choice": 1,
"position": [
12019,
12745
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T3E",
"level": "L2-Relation",
"id": "d-GKQeu4S6M_0",
"video_path": "d-GKQeu4S6M.mp4",
"subtitle_path": "d-GKQeu4S6M_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1257.69,
"view_count": 66300
},
{
"video_id": "d-GKQeu4S6M",
"question": "In the top right corner of the video, there are two small screens of women. On a white background, there are some English formulas. After the subtitle 'there's p that's 1 DS next which is 2' appears, what does the woman in glasses and a grey T-shirt in the top right corner do?",
"question_wo_referring_query": "What does she do?",
"candidates": [
"Yawns",
"Brushes her hair with both hands",
"Stands up",
"Adjusts her glasses",
"Takes off her glasses"
],
"correct_choice": 1,
"position": [
20756,
20901
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T3E",
"level": "L2-Relation",
"id": "d-GKQeu4S6M_1",
"video_path": "d-GKQeu4S6M.mp4",
"subtitle_path": "d-GKQeu4S6M_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1257.69,
"view_count": 66300
},
{
"video_id": "d-GKQeu4S6M",
"question": "In the top right corner of the video, there are two minimized screens with women. On a white background, there are four formulas. After the subtitle 'so check next thing we look at is just' appears, what does the woman in the top right corner wearing a gray short sleeve and glasses do?",
"question_wo_referring_query": "What does the woman in the top right corner wearing a gray short sleeve and glasses do?",
"candidates": [
"Takes off her glasses",
"Supports her face with one hand",
"Stretches her waist",
"Holds her glasses with one hand",
"One hand grabs a handful of hair, the other hand touches her head"
],
"correct_choice": 4,
"position": [
25781,
25975
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T3E",
"level": "L2-Relation",
"id": "d-GKQeu4S6M_2",
"video_path": "d-GKQeu4S6M.mp4",
"subtitle_path": "d-GKQeu4S6M_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1257.69,
"view_count": 66300
},
{
"video_id": "P9hDA0u6FO0",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"On the right, a man in a dark blue military uniform with a white flower in his hair and medals on his chest. On the left, a yellow paper sheet. A man wearing a military hat, a long coat, and long boots, with white and red English words in the frame. Under a blue sky, a man in a white short sleeve shirt holding a camera stands by a yellow rock.",
"A man wearing a military hat, a long coat, and long boots, with white and red English words in the frame. On the right, a man in a dark blue military uniform with a white flower in his hair and medals on his chest. On the left, a yellow paper sheet. Under a blue sky, a man in a white short sleeve shirt holding a camera stands by a yellow rock.",
"A man wearing a military hat, a long coat, and long boots, with white and red English words in the frame. Under a blue sky, a man in a white short sleeve shirt holding a camera stands by a yellow rock. On the right, a man in a dark blue military uniform with a white flower in his hair and medals on his chest. On the left, a yellow paper sheet.",
"On the right, a man in a dark blue military uniform with a white flower in his hair and medals on his chest. On the left, a yellow paper sheet. Under a blue sky, a man in a white short sleeve shirt holding a camera stands by a yellow rock. A man wearing a military hat, a long coat, and long boots, with white and red English words in the frame."
],
"correct_choice": 1,
"position": [
316,
2822,
7028
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "P9hDA0u6FO0_0",
"video_path": "P9hDA0u6FO0.mp4",
"subtitle_path": "P9hDA0u6FO0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1995.88,
"view_count": 2792898
},
{
"video_id": "P9hDA0u6FO0",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"To the left, a man with curly hair tied in a queue, wearing a white uniform, and to his right is a map. A group of soldiers wearing hats and different colored uniforms are riding white, black, and olive horses. On a yellow-green map, two yellow circles are drawn.",
"On a yellow-green map, two yellow circles are drawn, a group of soldiers wearing hats and different colored uniforms are riding white, black, and olive horses. To the left, a man with curly hair tied in a queue, wearing a white uniform, and to his right is a map.",
"A group of soldiers wearing hats and different colored uniforms are riding white, black, and olive horses. On a yellow-green map, two yellow circles are drawn. To the left, a man with curly hair tied in a queue, wearing a white uniform, and to his right is a map.",
"On a yellow-green map, two yellow circles are drawn. To the left, a man with curly hair tied in a queue, wearing a white uniform, and to his right is a map. A group of soldiers wearing hats and different colored uniforms are riding white, black, and olive horses.",
"A group of soldiers wearing hats and different colored uniforms are riding white, black, and olive horses. To the left, a man with curly hair tied in a queue, wearing a white uniform, and to his right is a map. On a yellow-green map, two yellow circles are drawn."
],
"correct_choice": 1,
"position": [
24635,
40848,
44238
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "P9hDA0u6FO0_1",
"video_path": "P9hDA0u6FO0.mp4",
"subtitle_path": "P9hDA0u6FO0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1995.88,
"view_count": 2792898
},
{
"video_id": "P9hDA0u6FO0",
"question": "Which of the following sequences of scenarios is correct?",
"question_wo_referring_query": "Which of the following sequences of scenarios is correct?",
"candidates": [
"Six soldiers from different countries are holding rifles with bayonets in front of a map. A group of French soldiers dressed in olive green military uniforms are holding rifles and fighting in the midst of smoke. A soldier dressed in black top and white pants with a black high hat is standing with a rifle with a bayonet aimed.",
"A group of French soldiers dressed in olive green military uniforms are holding rifles and fighting in the midst of smoke. A soldier dressed in black top and white pants with a black high hat is standing with a rifle with a bayonet aimed. Six soldiers from different countries are holding rifles with bayonets in front of a map.",
"Six soldiers from different countries are holding rifles with bayonets in front of a map. A soldier dressed in black top and white pants with a black high hat is standing with a rifle with a bayonet aimed. A group of French soldiers dressed in olive green military uniforms are holding rifles and fighting in the midst of smoke.",
"A group of French soldiers dressed in olive green military uniforms are holding rifles and fighting in the midst of smoke. Six soldiers from different countries are holding rifles with bayonets in front of a map. A soldier dressed in black top and white pants with a black high hat is standing with a rifle with a bayonet aimed.",
"A soldier dressed in black top and white pants with a black high hat is standing with a rifle with a bayonet aimed. Six soldiers from different countries are holding rifles with bayonets in front of a map. A group of French soldiers dressed in olive green military uniforms are holding rifles and fighting in the midst of smoke."
],
"correct_choice": 3,
"position": [
20963,
26865,
31370
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "P9hDA0u6FO0_2",
"video_path": "P9hDA0u6FO0.mp4",
"subtitle_path": "P9hDA0u6FO0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1995.88,
"view_count": 2792898
},
{
"video_id": "wvfctNd-Aio",
"question": "On the left is black text in English with a red icon below, and on the right side of the screen is a stack of newspapers. Below the news headline, on a ticker tape, there is a black background with yellow text reading 'BREAKING NEWS'. Which subtitles have appeared at the same time as this icon?",
"question_wo_referring_query": "Which subtitles have appeared at the same time as this icon?",
"candidates": [
"\u201cGPS and health workers insisting that\u201d",
"\u201cand 37 and the polling station that have\u201d",
"\u201cthe QR code you'll see on screen during\u201d",
"\u201copened in Russia their elections taking\u201d",
"\u201cstart with the times world uh Pages 36\u201d"
],
"correct_choice": 0,
"position": [
42,
1162
],
"topic_category": "NP-News-Programs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "wvfctNd-Aio_0",
"video_path": "wvfctNd-Aio.mp4",
"subtitle_path": "wvfctNd-Aio_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1097.6,
"view_count": 10087
},
{
"video_id": "wvfctNd-Aio",
"question": "On the left side of the news screen is a bald man wearing a black suit with a white shirt, while on the right side of the news screen is a newspaper showing a picture of a woman with curly hair wearing a white shirt. Which subtitles appeared on the screen at the same time as this newspaper?",
"question_wo_referring_query": "Which subtitles appeared on the screen at the same time as this newspaper?",
"candidates": [
"\"politics totally so he's right there the\"",
"\"general election would somehow be a\"",
"\"Minister um in sunak government in what\"",
"\"accountability and there's be flights to\"",
"\"the sword having lovely nice thick gold\""
],
"correct_choice": 0,
"position": [
16601,
23896
],
"topic_category": "NP-News-Programs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "wvfctNd-Aio_1",
"video_path": "wvfctNd-Aio.mp4",
"subtitle_path": "wvfctNd-Aio_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1097.6,
"view_count": 10087
},
{
"video_id": "wvfctNd-Aio",
"question": "In the news footage, on the left is a woman with long blue sleeves and dark red hair, and on the right is a bald man in a black suit with a white shirt underneath. Which subtitle appears simultaneously with the glass in the man's right hand?",
"question_wo_referring_query": "Which subtitle appears simultaneously?",
"candidates": [
"\u201cthe the headlines about the end of\u201d",
"\"they they can desent no because there\"",
"\"descent from from Russians as much as\"",
"\"and that we are seeing this open um\"",
"\u201cdefending it all I'm saying uh is that I\u201d"
],
"correct_choice": 0,
"position": [
17250,
25742
],
"topic_category": "NP-News-Programs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "wvfctNd-Aio_2",
"video_path": "wvfctNd-Aio.mp4",
"subtitle_path": "wvfctNd-Aio_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1097.6,
"view_count": 10087
},
{
"video_id": "UwJTCg5fpXg",
"question": "In the screen, there is a lady sitting on a chair. The lady is holding two children beside her. There is a child standing behind her. One boy is wearing a black coat, and the boy behind her is wearing a blue coat. The girl is wearing a red coat and a checkered skirt. In the background, there is a white glass door, and there are green plants beside it. Who is the child that the lady sitting on the chair is holding?",
"question_wo_referring_query": "Who is the child that the lady sitting on the chair is holding?",
"candidates": [
"The child wearing gray clothes",
"The child wearing a blue coat",
"The child wearing a green coat",
"The child wearing a red coat",
"The child wearing a checkered shirt"
],
"correct_choice": 1,
"position": [
475
],
"topic_category": "NP-News-Programs",
"question_category": "E2O",
"level": "L1-Perception",
"id": "UwJTCg5fpXg_0",
"video_path": "UwJTCg5fpXg.mp4",
"subtitle_path": "UwJTCg5fpXg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1357.56,
"view_count": 12771
},
{
"video_id": "UwJTCg5fpXg",
"question": "In the picture, a woman is sitting on a chair holding two children beside her. Another child is standing behind her. A boy in a black woolen coat and another boy in a blue jacket are visible. The girl is wearing a red woolen coat and a checkered skirt. There is a white glass door in the background and green plants on the side. Who is the child sitting on the armrest of the chair in the picture?",
"question_wo_referring_query": "Who is the child sitting on the armrest of the chair in the picture?",
"candidates": [
"The child wearing a red woolen coat",
"The child wearing a blue jacket",
"The child wearing a grey jacket",
"The child wearing a green woolen coat",
"The child wearing a checkered shirt"
],
"correct_choice": 0,
"position": [
13237
],
"topic_category": "NP-News-Programs",
"question_category": "E2O",
"level": "L1-Perception",
"id": "UwJTCg5fpXg_1",
"video_path": "UwJTCg5fpXg.mp4",
"subtitle_path": "UwJTCg5fpXg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1357.56,
"view_count": 12771
},
{
"video_id": "UwJTCg5fpXg",
"question": "In the scene, a woman is sitting on a chair holding two children beside her. There is a child standing behind her. One boy is wearing a black coat, the boy behind is wearing a blue jacket, and the girl is wearing a red coat and a checkered skirt. In the background, there's a white glass door and a green plant. Who is wearing black leather shoes?",
"question_wo_referring_query": "Who is wearing black leather shoes?",
"candidates": [
"The child in the checkered shirt",
"The child in the grey jacket",
"The child in the green coat",
"The child in the red coat",
"The child in the blue jacket"
],
"correct_choice": 3,
"position": [
22932
],
"topic_category": "NP-News-Programs",
"question_category": "E2O",
"level": "L1-Perception",
"id": "UwJTCg5fpXg_2",
"video_path": "UwJTCg5fpXg.mp4",
"subtitle_path": "UwJTCg5fpXg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1357.56,
"view_count": 12771
},
{
"video_id": "-PnG8Jp2gFw",
"question": "On a beige wall, a television screen is hanging, displaying a webpage along with some file icons. A man wearing a black lab coat and a yellow and black badge is standing in front of the screen talking. When he mentions, 'being you don't know that 30-second ad,' what happens?",
"question_wo_referring_query": "What happened?",
"candidates": [
"The man waved his hand.",
"The man pointed at the audience listening to the lecture.",
"The man turned around to face the screen.",
"The man put on a hat.",
"The man pointed at the screen."
],
"correct_choice": 0,
"position": [
9794
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2E",
"level": "L1-Perception",
"id": "-PnG8Jp2gFw_0",
"video_path": "-PnG8Jp2gFw.mp4",
"subtitle_path": "-PnG8Jp2gFw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1526.96,
"view_count": 291304
},
{
"video_id": "-PnG8Jp2gFw",
"question": "On a light yellow wall, there is a television screen displaying a webpage along with some file icons. A man wearing a yellow and black work badge is standing in the lower left corner of the frame, speaking. He is wearing a black lab coat. What happened when he mentioned 'linked through Amazon now Amazon's'?",
"question_wo_referring_query": "What happened?",
"candidates": [
"The man walked out of the frame",
"The man took off his outer coat",
"The man touched his head with one hand",
"The man adjusted his clothes",
"The man jumped up"
],
"correct_choice": 2,
"position": [
14161
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2E",
"level": "L1-Perception",
"id": "-PnG8Jp2gFw_1",
"video_path": "-PnG8Jp2gFw.mp4",
"subtitle_path": "-PnG8Jp2gFw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1526.96,
"view_count": 291304
},
{
"video_id": "-PnG8Jp2gFw",
"question": "On a mustard-yellow wall hangs a television screen, which displays a webpage with black and white images and some document icons. A man with a yellow and black name tag is speaking. He is facing towards the right side of the screen, wearing a military green and black camouflaged uniform with a black and white accessory on the sleeve. What happens when he mentions: 'a gimbal now there's gimbals that stab'?",
"question_wo_referring_query": "What happens?",
"candidates": [
"The man opens one hand and clenches the other into a fist",
"The man turns to face the screen",
"The man brushes his hair",
"The man takes a sip of water",
"The man changes his jacket"
],
"correct_choice": 0,
"position": [
28039
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2E",
"level": "L1-Perception",
"id": "-PnG8Jp2gFw_2",
"video_path": "-PnG8Jp2gFw.mp4",
"subtitle_path": "-PnG8Jp2gFw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1526.96,
"view_count": 291304
},
{
"video_id": "jvkmcX47bKU",
"question": "A girl is sitting in a room with colorful lights. There is a green plant behind her, and a white curtain on the left side of the screen. She is holding a box in her hands. Her hair is a mix of yellow and black, she is wearing glasses, and she has on a letterman jacket. What happened when she mentioned 'stuff was left behind in high school big'?",
"question_wo_referring_query": "What happened?",
"candidates": [
"She put down the spoon.",
"The woman picked up a spoon and a paper cup.",
"She fixed her hair.",
"She took off her glasses.",
"She put the spoon into the paper cup."
],
"correct_choice": 1,
"position": [
4299,
4404
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "jvkmcX47bKU_0",
"video_path": "jvkmcX47bKU.mp4",
"subtitle_path": "jvkmcX47bKU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1417.0,
"view_count": 248606
},
{
"video_id": "jvkmcX47bKU",
"question": "A girl is sitting in a room with colorful lights, with green plants behind her. There's also a white curtain to the left of the screen. She has blond and black hair, is wearing glasses, and has a letter sweater on. While she is talking in front of a mirror, what happens after she mentions 'about myself'?",
"question_wo_referring_query": "What happened?",
"candidates": [
"She put down the fork",
"She put a fork into a paper cup",
"She changed into a gray T-shirt",
"She picked up a cake",
"She took off her glasses"
],
"correct_choice": 2,
"position": [
8788,
9447
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "jvkmcX47bKU_1",
"video_path": "jvkmcX47bKU.mp4",
"subtitle_path": "jvkmcX47bKU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1417.0,
"view_count": 248606
},
{
"video_id": "jvkmcX47bKU",
"question": "A girl is sitting in a room lit by colorful lights, with green plants behind her and a white curtain on the left side of the screen. Her hair is yellow and black, she is wearing glasses, and she has on a gray T-shirt along with gloves. She is speaking to the camera and after mentioning 'and she has this series of artwork named', what happened?",
"question_wo_referring_query": "What happened next?",
"candidates": [
"A transparent pen holder with blue sticky notes on it appeared to the right of the camera.",
"A transparent pen holder with purple sticky notes on it appeared to the right of the camera.",
"A transparent pen holder with olive sticky notes on it appeared to the right of the camera.",
"A transparent pen holder with yellow sticky notes on it appeared to the right of the camera.",
"A transparent pen holder with green sticky notes on it appeared to the right of the camera."
],
"correct_choice": 0,
"position": [
25601,
25658
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "jvkmcX47bKU_2",
"video_path": "jvkmcX47bKU.mp4",
"subtitle_path": "jvkmcX47bKU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1417.0,
"view_count": 248606
},
{
"video_id": "1pxrIj9Xyps",
"question": "In a room decorated with various paintings, there are two men standing in the middle. On the left is a man with short hair wearing a red short-sleeve shirt, holding a beer. On the right is a man with a black beard, wearing a gray short-sleeve shirt. Behind the man in red is a white door. When the subtitle mentions 'over 2600 meters yeah there you go we', what did the man in the red shirt do?",
"question_wo_referring_query": "What did the man in the red shirt do?",
"candidates": [
"Lowered his head",
"Raised both hands",
"Turned his head to the right",
"Looked up",
"Turned his head to the left"
],
"correct_choice": 2,
"position": [
14005,
14038
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T3E",
"level": "L2-Relation",
"id": "1pxrIj9Xyps_0",
"video_path": "1pxrIj9Xyps.mp4",
"subtitle_path": "1pxrIj9Xyps_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1667.13,
"view_count": 1430015
},
{
"video_id": "1pxrIj9Xyps",
"question": "In a white room, a man with short hair, wearing blue clothes and black socks, is standing in the middle. There is a cabinet behind him, which cabinet is filled with objects? To the right of the man, there's a balcony. After the subtitle \"about a sixth of the population is\" appears, what action does the man take?",
"question_wo_referring_query": "What action does the man take?",
"candidates": [
"Picked up a phone",
"Ate a cake",
"Put on a hat",
"Sat down",
"Drank something"
],
"correct_choice": 4,
"position": [
16299,
16338
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T3E",
"level": "L2-Relation",
"id": "1pxrIj9Xyps_1",
"video_path": "1pxrIj9Xyps.mp4",
"subtitle_path": "1pxrIj9Xyps_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1667.13,
"view_count": 1430015
},
{
"video_id": "1pxrIj9Xyps",
"question": "Under a blue sky, a woman with long purple hair and a black short-sleeve shirt is in the middle of the scene. There is a white house behind her on the left and trees on the right. After the subtitle 'our blood for example' appears, what action does the woman take?",
"question_wo_referring_query": "What action does the woman take?",
"candidates": [
"Raised one hand",
"Covered her face",
"Stood up",
"Raised both hands",
"Took out a book"
],
"correct_choice": 0,
"position": [
21876,
21898
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T3E",
"level": "L2-Relation",
"id": "1pxrIj9Xyps_2",
"video_path": "1pxrIj9Xyps.mp4",
"subtitle_path": "1pxrIj9Xyps_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1667.13,
"view_count": 1430015
},
{
"video_id": "E7FSg22MdKE",
"question": "In a room, blue curtains are hanging in the middle, and below the curtains is a parquet floor. After the subtitle mentions 'timeschedhereattheinem,' what is the first glowing object that appears?",
"question_wo_referring_query": "What is the first glowing object that appears?",
"candidates": [
"lamp",
"flashlight",
"fire",
"firefly",
"torch"
],
"correct_choice": 0,
"position": [
7316,
1517,
7546
],
"topic_category": "NP-News-Programs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "E7FSg22MdKE_0",
"video_path": "E7FSg22MdKE.mp4",
"subtitle_path": "E7FSg22MdKE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 905.0,
"view_count": 127817
},
{
"video_id": "E7FSg22MdKE",
"question": "In a dimly lit room, there is a lamp hanging from the ceiling. Below the lamp, there is an olive-colored table, with olive-colored objects placed on it. Behind the table, there is a wall filled with square grids. Below the wall, there are two windows, and behind the windows, there are three seated people. After the subtitle mentions 'think it would be great if we could remember those things', what is the first fruit that appears?",
"question_wo_referring_query": "What is the first fruit that appears?",
"candidates": [
"Apple",
"Mango",
"Pear",
"Watermelon",
"Banana"
],
"correct_choice": 3,
"position": [
14413,
14635
],
"topic_category": "NP-News-Programs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "E7FSg22MdKE_1",
"video_path": "E7FSg22MdKE.mp4",
"subtitle_path": "E7FSg22MdKE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 905.0,
"view_count": 127817
},
{
"video_id": "E7FSg22MdKE",
"question": "There are four people walking down a street: two adults and two children. The child on the left has long hair and is wearing white clothes. The child on the right has long hair and is wearing a white and purple dress. The adult on the left is wearing a purple short-sleeve shirt and white pants. The adult on the right has short hair and is wearing white clothes and pants. They are all holding hands. Which vegetable appears first after the subtitle 'Music' is mentioned?",
"question_wo_referring_query": "Which vegetable appears first?",
"candidates": [
"Tomato",
"Carrot",
"Cabbage",
"Watermelon",
"Eggplant"
],
"correct_choice": 0,
"position": [
17396,
17864
],
"topic_category": "NP-News-Programs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "E7FSg22MdKE_2",
"video_path": "E7FSg22MdKE.mp4",
"subtitle_path": "E7FSg22MdKE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 905.0,
"view_count": 127817
},
{
"video_id": "er1oRjH2iu8",
"question": "On the screen, a woman with long golden hair, wearing black clothes and a red collar, is slightly turning her face towards the camera. What change occurs in this woman when the subtitle mentions 'that the letters were assumed to have been love type letters.'?",
"question_wo_referring_query": "What change occurs in this woman?",
"candidates": [
"She changes to sitting on a white sofa",
"She changes to sitting on a purple sofa",
"She changes to squatting",
"She changes to sitting on a red sofa",
"She changes to sitting on a blue sofa"
],
"correct_choice": 0,
"position": [
11029,
11134
],
"topic_category": "KH-Knowledge-History",
"question_category": "TAA",
"level": "L2-Relation",
"id": "er1oRjH2iu8_0",
"video_path": "er1oRjH2iu8.mp4",
"subtitle_path": "er1oRjH2iu8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1255.4,
"view_count": 1355407
},
{
"video_id": "er1oRjH2iu8",
"question": "During the concert, a man with tattoos and long hair was singing shirtless while holding a microphone. When the subtitles displayed 'We don't want to spend all our time discussing,' what change occurred to the man's upper body?",
"question_wo_referring_query": "What change occurred to the man's upper body?",
"candidates": [
"His shirtless body changed into wearing a red suit.",
"His shirtless body changed into wearing a black suit.",
"His tattooed body changed into a shirtless body without tattoos.",
"His shirtless body changed into wearing a white suit."
],
"correct_choice": 1,
"position": [
4711,
4968
],
"topic_category": "KH-Knowledge-History",
"question_category": "TAA",
"level": "L2-Relation",
"id": "er1oRjH2iu8_1",
"video_path": "er1oRjH2iu8.mp4",
"subtitle_path": "er1oRjH2iu8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1255.4,
"view_count": 1355407
},
{
"video_id": "er1oRjH2iu8",
"question": "A white-framed black screen shows no image. When the caption mentions 'Fahlman posted this note, which would solve everything,' what changes occur on the black screen?",
"question_wo_referring_query": "What changes occur on the black screen?",
"candidates": [
"Blue English text appears on the screen",
"Green English text appears on the screen",
"Green numbers appear on the screen",
"Red English text appears on the screen",
"Yellow English text appears on the screen"
],
"correct_choice": 1,
"position": [
18647,
18754
],
"topic_category": "KH-Knowledge-History",
"question_category": "TAA",
"level": "L2-Relation",
"id": "er1oRjH2iu8_2",
"video_path": "er1oRjH2iu8.mp4",
"subtitle_path": "er1oRjH2iu8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1255.4,
"view_count": 1355407
},
{
"video_id": "XYsCVqz3iug",
"question": "On the table, there's a woman in blue on the left and a woman in gray on the right. The woman in blue has her hands on the table. What is the woman in blue doing?",
"question_wo_referring_query": "What is the woman in blue doing?",
"candidates": [
"Placed a watermelon on the table",
"Placed a banana on the table",
"Placed an apple on the table",
"Placed a light blue arrow on the table",
"Placed a dragon fruit on the table"
],
"correct_choice": 3,
"position": [
10621
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2E",
"level": "L1-Perception",
"id": "XYsCVqz3iug_0",
"video_path": "XYsCVqz3iug.mp4",
"subtitle_path": "XYsCVqz3iug_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1489.36,
"view_count": 3984
},
{
"video_id": "XYsCVqz3iug",
"question": "On the table, a woman wearing a blue dress is on the left side, with her hands on a white paper that says \"DIFFICULTY LEVEL 1\". A woman wearing a gray dress is on the right side. What is the woman in the gray dress doing?",
"question_wo_referring_query": "What is the woman in the gray dress doing?",
"candidates": [
"Combing her hair",
"Tidying up the items on the table",
"Playing the piano",
"Doing her nails",
"Cooking"
],
"correct_choice": 1,
"position": [
19645
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2E",
"level": "L1-Perception",
"id": "XYsCVqz3iug_1",
"video_path": "XYsCVqz3iug.mp4",
"subtitle_path": "XYsCVqz3iug_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1489.36,
"view_count": 3984
},
{
"video_id": "XYsCVqz3iug",
"question": "On the table, there is a woman dressed in blue on the left, and a woman dressed in grey on the right. She reaches out her ring-bearing hand next to a white paper with 'Br' written on it. What is the woman in grey doing?",
"question_wo_referring_query": "What is the woman in grey doing?",
"candidates": [
"Moved a computer",
"Moved the paper with 'Br' written on it",
"Moved a teapot",
"Moved a book",
"Moved an apple"
],
"correct_choice": 1,
"position": [
30522
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "S2E",
"level": "L1-Perception",
"id": "XYsCVqz3iug_2",
"video_path": "XYsCVqz3iug.mp4",
"subtitle_path": "XYsCVqz3iug_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1489.36,
"view_count": 3984
},
{
"video_id": "60oHeCZHtvI",
"question": "An armored vehicle is traveling on a sandy ground. There is a person extending their upper body out of the armored vehicle. Behind the armored vehicle, there are several tanks. Among these tanks, there is a person wearing a green vest. What is the color of the traveling armored vehicle?",
"question_wo_referring_query": "What is the color of the traveling armored vehicle?",
"candidates": [
"Red",
"Camouflage",
"White",
"Pink",
"Black"
],
"correct_choice": 1,
"position": [
857
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2A",
"level": "L1-Perception",
"id": "60oHeCZHtvI_0",
"video_path": "60oHeCZHtvI.mp4",
"subtitle_path": "60oHeCZHtvI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 965.0,
"view_count": 369586
},
{
"video_id": "60oHeCZHtvI",
"question": "In the PPT, there is a technical data label at the top, an armored vehicle with tracks in the middle, and 'Empty Weight' with 7 000kg (15 432 lbs) written below the vehicle. What is the color of the tracked armored vehicle?",
"question_wo_referring_query": "What is the color of the tracked armored vehicle?",
"candidates": [
"Pink",
"Red",
"Black",
"Tan and green",
"White"
],
"correct_choice": 3,
"position": [
12839
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2A",
"level": "L1-Perception",
"id": "60oHeCZHtvI_1",
"video_path": "60oHeCZHtvI.mp4",
"subtitle_path": "60oHeCZHtvI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 965.0,
"view_count": 369586
},
{
"video_id": "60oHeCZHtvI",
"question": "In the PPT, the words 'Thanks to all Supporters!!!' are written at the top. There are two tanks on the left and right in the middle, between which are the words 'Enjoy this type of Content? Consider supporting me.' What color are the two tanks?",
"question_wo_referring_query": "What color are the two tanks?",
"candidates": [
"Yellow",
"Red",
"White",
"Gray",
"Pink"
],
"correct_choice": 3,
"position": [
22818
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2A",
"level": "L1-Perception",
"id": "60oHeCZHtvI_2",
"video_path": "60oHeCZHtvI.mp4",
"subtitle_path": "60oHeCZHtvI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 965.0,
"view_count": 369586
},
{
"video_id": "s49y2RP5C7E",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, four soldiers are cooking food around a fire, then three soldiers are opening a can, and finally, one person is opening a can.",
"First, one person is opening a can, then three soldiers are opening a can, and finally, four soldiers are cooking food around a fire.",
"First, three soldiers are opening a can, then four soldiers are cooking food around a fire, and finally, one person is opening a can.",
"First, three soldiers are opening a can, then one person is opening a can, and finally, four soldiers are cooking food around a fire.",
"First, four soldiers are cooking food around a fire, then one person is opening a can, and finally, three soldiers are opening a can."
],
"correct_choice": 0,
"position": [
6253,
6347,
6592
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "s49y2RP5C7E_0",
"video_path": "s49y2RP5C7E.mp4",
"subtitle_path": "s49y2RP5C7E_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1086.25,
"view_count": 1458608
},
{
"video_id": "s49y2RP5C7E",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, a person wearing a camouflage outfit, one holding a bag of food, and another holding a coffee-colored fork. Then, two soldiers are sitting on a rock in front of the tanks, eating the food in their hands. Finally, three soldiers are sitting in front of two tanks, eating the food in their hands.",
"First, two soldiers are sitting on a rock in front of the tanks, eating the food in their hands. Then, three soldiers are sitting in front of two tanks, eating the food in their hands. Finally, a person wearing a camouflage outfit, one holding a bag of food, and another holding a coffee-colored fork.",
"First, three soldiers are sitting in front of two tanks, eating the food in their hands. Then, two soldiers are sitting on a rock in front of the tanks, eating the food in their hands. Finally, a person wearing a camouflage outfit, one holding a bag of food.",
"First, a person wearing a camouflage outfit, one holding a bag of food, and another holding a coffee-colored fork. Then, three soldiers are sitting in front of two tanks, eating the food in their hands. Finally, two soldiers are sitting on a rock in front of the tanks, eating the food in their hands.",
"First, three soldiers are sitting in front of two tanks, eating the food in their hands. Then, a person wearing a camouflage outfit, one holding a bag of food, and another holding a coffee-colored fork. Finally, two soldiers are sitting on a rock in front of the tanks, eating the food in their hands."
],
"correct_choice": 4,
"position": [
16098,
16456,
17453
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "s49y2RP5C7E_1",
"video_path": "s49y2RP5C7E.mp4",
"subtitle_path": "s49y2RP5C7E_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1086.25,
"view_count": 1458608
},
{
"video_id": "s49y2RP5C7E",
"question": "Which of the following sequences is correct?",
"question_wo_referring_query": "Which of the following sequences is correct?",
"candidates": [
"First, a soldier sits in front of a tent eating food from his hand, with two women in orange shorts beside him. Next, on the battlefield, two soldiers holding guns stand beside a soldier wearing a red hat. Finally, a soldier inside a room takes an item out of his pocket and places it on the table.",
"First, on the battlefield, two soldiers holding guns stand beside a soldier wearing a red hat. Next, a soldier sits in front of a tent eating food from his hand, with two women in orange shorts beside him. Finally, a soldier inside a room takes an item out of his pocket and places it on the table.",
"First, a soldier inside a room takes an item out of his pocket and places it on the table. Next, a soldier sits in front of a tent eating food from his hand, with two women in orange shorts beside him. Finally, on the battlefield, two soldiers holding guns stand beside a soldier wearing a red hat.",
"First, on the battlefield, two soldiers holding guns stand beside a soldier wearing a red hat. Next, a soldier inside a room takes an item out of his pocket and places it on the table. Finally, a soldier sits in front of a tent eating food from his hand, with two women in orange shorts beside him.",
"First, a soldier inside a room takes an item out of his pocket and places it on the table. Next, on the battlefield, two soldiers holding guns stand beside a soldier wearing a red hat. Finally, a soldier sits in front of a tent eating food from his hand, with two women in orange shorts beside him."
],
"correct_choice": 3,
"position": [
21495,
21715,
22723
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "s49y2RP5C7E_2",
"video_path": "s49y2RP5C7E.mp4",
"subtitle_path": "s49y2RP5C7E_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1086.25,
"view_count": 1458608
},
{
"video_id": "61SYvhojGvg",
"question": "Against the backdrop of a large wooden ship floating on the sea, the word 'FORECASTLE' is written at the top of the video. In the middle, there is a painting depicting a deep blue sky with three wooden ships floating on the water. One of the ships is crowded with soldiers wearing armor. What is the shape of the border area marked with 'FORECASTLE'?",
"question_wo_referring_query": "What is the shape of the border area marked with 'FORECASTLE'?",
"candidates": [
"Cuboid",
"Parallelogram",
"Rectangle",
"Circle",
"Square"
],
"correct_choice": 2,
"position": [
17145
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2A",
"level": "L1-Perception",
"id": "61SYvhojGvg_0",
"video_path": "61SYvhojGvg.mp4",
"subtitle_path": "61SYvhojGvg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1673.6,
"view_count": 3236964
},
{
"video_id": "61SYvhojGvg",
"question": "In a storage room, many wine casks are stacked. At the top of the video, 'WATER CASKS 300 tons' is written. Two wooden pillars stand in the middle. A lamp is hanging from the back pillar. A pile of hemp ropes, an iron hook, and a white string lie to the left of the lamp. What is the shape of the bottom of the wine casks?",
"question_wo_referring_query": "What is the shape of the bottom of the wine casks?",
"candidates": [
"rectangular",
"round",
"square",
"stepped",
"parallelogram"
],
"correct_choice": 1,
"position": [
31629
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2A",
"level": "L1-Perception",
"id": "61SYvhojGvg_1",
"video_path": "61SYvhojGvg.mp4",
"subtitle_path": "61SYvhojGvg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1673.6,
"view_count": 3236964
},
{
"video_id": "61SYvhojGvg",
"question": "On the wooden floor, there is a large cannon placed on the left side, tied with a thick rope. In front of it is a wooden wall with windows, and there is a protruding section at the bottom of the wall. Cannonballs are neatly arranged on it. What is the material of the cannonballs?",
"question_wo_referring_query": "What is the material of the cannonballs?",
"candidates": [
"tin",
"copper",
"gold",
"wood",
"iron"
],
"correct_choice": 4,
"position": [
17757
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2A",
"level": "L1-Perception",
"id": "61SYvhojGvg_2",
"video_path": "61SYvhojGvg.mp4",
"subtitle_path": "61SYvhojGvg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1673.6,
"view_count": 3236964
},
{
"video_id": "u1D4ArcBjLI",
"question": "Which of the following scenario sequences is correct?",
"question_wo_referring_query": "Which of the following scenario sequences is correct?",
"candidates": [
"First appears the scene of a wine-red background wall projecting a white screen, with an empty stage in the middle and a woman speaking on the right, and an audience seated below. Then, the video shifts to the projector screen with a white background and black figures, showing an elderly man in a black suit with a red tie at the top right. Finally, the scene shows a blonde woman in a black suit looking at a computer screen and speaking, set against a wine-red wall corner.",
"First appears the scene of a blonde woman in a black suit looking at a computer screen and speaking, set against a wine-red wall. Then, the video shows a wine-red background wall projecting a white screen, with an empty stage in the middle and a woman speaking on the right, and an audience seated below. Finally, the video shifts to the projector screen with a white background and black figures, showing an elderly man in a black suit with a red tie at the top right.",
"First appears the scene of a projector screen with a white background and black figures, showing an elderly man in a black suit with a red tie positioned at the top right. Then, the scene shows a blonde woman in a black suit looking at a computer screen and speaking, set against a wine-red wall corner. Finally, the video shows a wine-red background wall projecting a white screen, with an empty stage in the middle and a woman speaking on the right, and an audience seated below.",
"First appears the scene of a blonde woman in a black suit looking at a computer screen and speaking, set against a wine-red wall. Then, the video shifts to a projector screen with a white background and black figures, showing an elderly man in a black suit with a red tie positioned at the top right. Finally, a scene appears with a wine-red background wall projecting a white screen, with an empty stage in the middle and a woman speaking on the right, and an audience seated below.",
"First appears the scene of a projector screen with a white background and black figures, showing an elderly man in a black suit with a red tie positioned at the top right. Then, the video shows a wine-red background wall projecting a white screen, with an empty stage in the middle and a woman speaking on the right, and an audience seated below. Finally, a blonde woman in a black suit looking at a computer screen and speaking, set against a wine-red wall corner appears."
],
"correct_choice": 4,
"position": [
3131,
3280,
3434
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SSS",
"level": "L2-Relation",
"id": "u1D4ArcBjLI_0",
"video_path": "u1D4ArcBjLI.mp4",
"subtitle_path": "u1D4ArcBjLI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1652.82,
"view_count": 2057
},
{
"video_id": "u1D4ArcBjLI",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, the scene shows 'a corner of a wall with a wine-red background, where a blonde woman in a black suit is speaking with her hands spread apart', followed by 'the video screen focuses on the projector screen, with a white background and multiple pictures on the projector screen. The top right shows a picture of a person's head with their hand raised, and the bottom right shows a hand wearing a blue glove', and finally, 'the video screen focuses on the projector screen, displaying a white box filled with small items of different colors.'",
"First, the scene shows 'the video screen focuses on the projector screen, displaying a white box filled with small items of different colors', followed by 'a corner of a wall with a wine-red background, where a blonde woman in a black suit is speaking with her hands spread apart', and finally, 'the video screen focuses on the projector screen, with a white background and multiple pictures on the projector screen. The top right shows a picture of a person's head with their hand raised, and the bottom right shows a hand wearing a blue glove.'",
"First, the scene shows 'the video screen focuses on the projector screen, with a white background and multiple pictures on the projector screen. The top right shows a picture of a person's head with their hand raised, and the bottom right shows a hand wearing a blue glove', followed by 'the video screen focuses on the projector screen, displaying a white box filled with small items of different colors', and finally, 'a corner of a wall with a wine-red background, where a blonde woman in a black suit is speaking with her hands spread apart.'",
"First, the scene shows 'the video screen focuses on the projector screen, with a white background and multiple pictures on the projector screen. The top right shows a picture of a person's head with their hand raised, and the bottom right shows a hand wearing a blue glove', followed by 'the scene shows a corner of a wall with a wine-red background, where a blonde woman in a black suit is speaking with her hands spread apart', and finally, 'the video screen focuses on the projector screen, displaying a white box filled with small items of different colors.'",
"First, the scene shows 'the video screen focuses on the projector screen, displaying a white box filled with small items of different colors', followed by 'the video screen focuses on the projector screen, with a white background and multiple pictures on the projector screen. The top right shows a picture of a person's head with their hand raised, and the bottom right shows a hand wearing a blue glove', and finally, 'a corner of a wall with a wine-red background, where a blonde woman in a black suit is speaking with her hands spread apart.'"
],
"correct_choice": 2,
"position": [
25201,
25275,
25294
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SSS",
"level": "L2-Relation",
"id": "u1D4ArcBjLI_1",
"video_path": "u1D4ArcBjLI.mp4",
"subtitle_path": "u1D4ArcBjLI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1652.82,
"view_count": 2057
},
{
"video_id": "u1D4ArcBjLI",
"question": "Which of the following scene orders is correct?",
"question_wo_referring_query": "Which of the following scene orders is correct?",
"candidates": [
"First, the scene where a blonde woman in a black suit is seen speaking to a computer screen in front of a wine-red background wall, with PPT slides of beetles, cockroaches, and other insects projected on the upper left; next, the scene where the same woman and background are shown, but with a projected screen of several people walking in a square in front of a red building on the upper left; finally, the scene focuses on the projector screen displaying the image of several people walking in a square in front of a red building.",
"First, the scene where a blonde woman in a black suit is seen speaking to a computer screen in front of a wine-red background wall, and a projected screen of several people walking in a square in front of a red building is shown on the upper left; next, the scene focuses on the projector screen displaying the same image of several people walking in a square in front of a red building; finally, the scene where the PPT slides of beetles, cockroaches, and other insects are shown on the upper left.",
"First, the scene where the video screen focuses on the projector screen displaying the image of several people walking in a square in front of a red building; next, the scene where a blonde woman in a black suit is seen speaking to a computer screen in front of a wine-red background wall, with PPT slides of beetles, cockroaches, and other insects projected on the upper left; finally, the scene where the same woman and background are shown, but with a projected screen of several people walking in a square in front of a red building on the upper left.",
"First, the scene where the video screen focuses on the projector screen displaying the image of several people walking in a square in front of a red building; next, the scene where a blonde woman in a black suit is seen speaking to a computer screen in front of a wine-red background wall, and a projected screen of several people walking in a square in front of a red building is shown on the upper left; finally, the scene where the PPT slides of beetles, cockroaches, and other insects are shown on the upper left.",
"First, the scene where a blonde woman in a black suit is seen speaking to a computer screen in front of a wine-red background wall, with PPT slides of beetles, cockroaches, and other insects projected on the upper left; next, the scene focuses on the projector screen displaying the image of several people walking in a square in front of a red building; finally, the scene where the same woman and background are shown, but with a projected screen of several people walking in a square in front of a red building on the upper left."
],
"correct_choice": 0,
"position": [
34444,
34612,
34905
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SSS",
"level": "L2-Relation",
"id": "u1D4ArcBjLI_2",
"video_path": "u1D4ArcBjLI.mp4",
"subtitle_path": "u1D4ArcBjLI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1652.82,
"view_count": 2057
},
{
"video_id": "yFAuXmcGk2Y",
"question": "In the scene with a white background, there are two '\u53ea' characters sketched incorrectly on the bottom left of the screen. When the scene changes to an image with multiple '\u53ea' characters sketched, there are many red, dark blue, and green arrows. What changes occur to the '\u53ea' character in the bottom left corner at this time?",
"question_wo_referring_query": "What changes occur to the '\u53ea' character in the bottom left corner at this time?",
"candidates": [
"Green characters and sketches appeared above the '\u53ea' character",
"The '\u53ea' character turned black",
"The '\u53ea' character turned green",
"The bottom part of the '\u53ea' character disappeared",
"Blue characters and sketches appeared above the '\u53ea' character"
],
"correct_choice": 0,
"position": [
14743,
20826
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "SAA",
"level": "L2-Relation",
"id": "yFAuXmcGk2Y_0",
"video_path": "yFAuXmcGk2Y.mp4",
"subtitle_path": "yFAuXmcGk2Y_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3128.57,
"view_count": 17437
},
{
"video_id": "yFAuXmcGk2Y",
"question": "In the white background with black text, there is a segment of bold black text in the middle containing the phrase 'Benefit Tasks.' Above this, some words are partially covered with a yellow overlay. After the phrase 'Benefit Tasks' is enlarged on the screen, what changes occur?",
"question_wo_referring_query": "What changes occur?",
"candidates": [
"Turned red",
"Shrunk",
"Covered with yellow overlay",
"Disappeared",
"Covered with blue overlay"
],
"correct_choice": 2,
"position": [
47357,
47667
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "SAA",
"level": "L2-Relation",
"id": "yFAuXmcGk2Y_1",
"video_path": "yFAuXmcGk2Y.mp4",
"subtitle_path": "yFAuXmcGk2Y_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3128.57,
"view_count": 17437
},
{
"video_id": "yFAuXmcGk2Y",
"question": "In the white background screen, in the middle left is the green letter A, the upper left is the green letter C, and the middle is the green letter B. After a large section of text appears on the left side of the screen, what change happens to letter A?",
"question_wo_referring_query": "What change happens?",
"candidates": [
"Got covered by a yellow layer",
"Turned red",
"Moved to the right",
"Got enlarged",
"Moved to the left"
],
"correct_choice": 2,
"position": [
66516,
67251
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "SAA",
"level": "L2-Relation",
"id": "yFAuXmcGk2Y_2",
"video_path": "yFAuXmcGk2Y.mp4",
"subtitle_path": "yFAuXmcGk2Y_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3128.57,
"view_count": 17437
},
{
"video_id": "kOZnpwI2hIM",
"question": "Outside a house in Yangguanmingyao, a man wearing a white short-sleeved shirt is sitting on a chair. Behind the man, there are green grass, trees, and a house. Next to the man, there is an empty chair. The man has a tattooed arm and is wearing a watch and a bracelet on one hand, while the other hand is wrapped with a white bandage. After the subtitle 'Music' appears, what characters show up on the screen?",
"question_wo_referring_query": "What characters show up on the screen?",
"candidates": [
"A man with a black backpack",
"A man wearing a yellow short-sleeved shirt",
"A man wearing a red shirt and a man wearing a blue short-sleeved shirt",
"A shirtless man",
"A man wearing a blue shirt and a man wearing a black short-sleeved shirt"
],
"correct_choice": 3,
"position": [
13757,
13804
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T3O",
"level": "L2-Relation",
"id": "kOZnpwI2hIM_0",
"video_path": "kOZnpwI2hIM.mp4",
"subtitle_path": "kOZnpwI2hIM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1000.75,
"view_count": 11283521
},
{
"video_id": "kOZnpwI2hIM",
"question": "Inside the airport, there is a dense crowd. On the left side of the screen, there's an area surrounded by blue lines and silver pillars, with white stripes on the floor. On the right side of the screen, there's a person in a blue short-sleeve shirt and a girl with a backpack pushing a suitcase. After the word 'it' appears in the subtitles, what person(s) appear(s) on the screen?",
"question_wo_referring_query": "After the word 'it' appears in the subtitles, what person(s) appear(s) on the screen?",
"candidates": [
"A woman in blue with a work badge",
"A woman in red with a work badge",
"A man in red with a work badge",
"A potted plant",
"An umbrella"
],
"correct_choice": 1,
"position": [
5022,
5030
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T3O",
"level": "L2-Relation",
"id": "kOZnpwI2hIM_1",
"video_path": "kOZnpwI2hIM.mp4",
"subtitle_path": "kOZnpwI2hIM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1000.75,
"view_count": 11283521
},
{
"video_id": "kOZnpwI2hIM",
"question": "Blue skies with floating white clouds, the distant and nearby continuous mountain ranges, the blue sea and the deserted sandy beach, after the subtitle 'Luke decided to tell us a bit about his' appeared, which character appeared on the screen?",
"question_wo_referring_query": "Which character appeared on the screen?",
"candidates": [
"A small cat",
"A small dog",
"A man in a black short-sleeved shirt",
"A man in a blue short-sleeved shirt",
"A man in a white short-sleeved shirt"
],
"correct_choice": 4,
"position": [
11731,
11791
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T3O",
"level": "L2-Relation",
"id": "kOZnpwI2hIM_2",
"video_path": "kOZnpwI2hIM.mp4",
"subtitle_path": "kOZnpwI2hIM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1000.75,
"view_count": 11283521
},
{
"video_id": "eE5Z7gDbgVA",
"question": "In a black-and-white scene, two police officers in uniform are holding a woman. The woman looks excited, wearing a short-sleeved shirt and a medal around her neck. The policeman on the right is wearing sunglasses, and the one on the left is wearing a watch. Behind them are buildings and a dense crowd. When the subtitle 'Not cool, Rosie' appears, what is the tattoo design on the chest of the woman in the middle?",
"question_wo_referring_query": "What is the tattoo design on the chest of the woman in the middle?",
"candidates": [
"Round tattoo",
"Three-leaf clover tattoo",
"Small dog tattoo",
"Square tattoo",
"Small cat tattoo"
],
"correct_choice": 1,
"position": [
8397
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2A",
"level": "L1-Perception",
"id": "eE5Z7gDbgVA_0",
"video_path": "eE5Z7gDbgVA.mp4",
"subtitle_path": "eE5Z7gDbgVA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1074.2,
"view_count": 2937116
},
{
"video_id": "eE5Z7gDbgVA",
"question": "A man in a yellow shirt is standing in front of a gray background. His mouth shows a full set of white teeth. The man is holding a toy in one hand and a bag of food in the other hand. There is a paper bag in front of him. When the subtitle 'a whole lot easier' appears, what is the man's hairstyle like?",
"question_wo_referring_query": "What is the man's hairstyle like?",
"candidates": [
"golden long hair",
"silver short hair",
"black short hair",
"silver long hair",
"golden short hair"
],
"correct_choice": 2,
"position": [
23002
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2A",
"level": "L1-Perception",
"id": "eE5Z7gDbgVA_1",
"video_path": "eE5Z7gDbgVA.mp4",
"subtitle_path": "eE5Z7gDbgVA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1074.2,
"view_count": 2937116
},
{
"video_id": "eE5Z7gDbgVA",
"question": "Against a black-and-white background, three men appear on the screen. The man on the right has his hands crossed in front of his chest, the man on the left is wearing a hat and his finger is pointing to the upper right. The other man is staring sharply at the man on the right. What is the style of the hat worn by the man on the left when the subtitle 'There's no way I'm going down' appears?",
"question_wo_referring_query": "What is the style of the hat worn by the man on the left?",
"candidates": [
"a beanie",
"a top hat",
"a denim cap",
"a baseball cap",
"a beret"
],
"correct_choice": 1,
"position": [
24629
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2A",
"level": "L1-Perception",
"id": "eE5Z7gDbgVA_2",
"video_path": "eE5Z7gDbgVA.mp4",
"subtitle_path": "eE5Z7gDbgVA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1074.2,
"view_count": 2937116
},
{
"video_id": "V-RIpt7Tknc",
"question": "In the scene, a group of people are gathered outside. They are surrounding a man who is wearing a hat and dressed in white. Next to the man is a white horse, and in the upper right corner, someone is raising a flag. The story mentions that Serbia achieved autonomy. What happened next?",
"question_wo_referring_query": "What happened next?",
"candidates": [
"Russia supported Bulgaria",
"The empire was sacked",
"Italy and Germany each achieved unification",
"Greece was marginalized",
"Democratic thought flourished"
],
"correct_choice": 2,
"position": [
2343,
2690
],
"topic_category": "KH-Knowledge-History",
"question_category": "E3E",
"level": "L2-Relation",
"id": "V-RIpt7Tknc_0",
"video_path": "V-RIpt7Tknc.mp4",
"subtitle_path": "V-RIpt7Tknc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 952.22,
"view_count": 56363
},
{
"video_id": "V-RIpt7Tknc",
"question": "In a room, a group of people dressed in formal attire are seated around a long table with paper documents placed on it. They are actively engaging in discussion. The room has podiums and paintings around. The story mentions the signing of the London Treaty at the end of May 1913. What else is mentioned?",
"question_wo_referring_query": ", what else is mentioned?",
"candidates": [
"Montenegro permanently leaving Bulgaria",
"Bulgaria signing an armistice agreement",
"Signing of the Treaty of Bucharest",
"Second Balkan War",
"Greco-Bulgarian War"
],
"correct_choice": 3,
"position": [
15540,
15745
],
"topic_category": "KH-Knowledge-History",
"question_category": "E3E",
"level": "L2-Relation",
"id": "V-RIpt7Tknc_1",
"video_path": "V-RIpt7Tknc.mp4",
"subtitle_path": "V-RIpt7Tknc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 952.22,
"view_count": 56363
},
{
"video_id": "V-RIpt7Tknc",
"question": "A group of people are gathered in front of a large building's gate, where there is a round ball ornament on top of the gate, and plants on both sides of the gate. There is also a mention of the year 1908. After mentioning the Young Turk Revolution shaking the Ottoman Empire, what other event is mentioned?",
"question_wo_referring_query": "What other event is mentioned?",
"candidates": [
"Bulgaria's proclamation of independence",
"The war between the Ottoman Empire and Greece",
"The signing of the Treaty of Bucharest",
"The Greek War",
"The Second Balkan War"
],
"correct_choice": 0,
"position": [
5323,
5581
],
"topic_category": "KH-Knowledge-History",
"question_category": "E3E",
"level": "L2-Relation",
"id": "V-RIpt7Tknc_2",
"video_path": "V-RIpt7Tknc.mp4",
"subtitle_path": "V-RIpt7Tknc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 952.22,
"view_count": 56363
},
{
"video_id": "Fw1rirubXiU",
"question": "In a black background, there is a screen framed with a red line. On the screen, there's a man with short white hair and a pouting mouth. When this man appears next to the text 'THE WALL STREET JOURNAL', what change occurs to him?",
"question_wo_referring_query": "What change occurs to him?",
"candidates": [
"He has an additional pair of gold-framed glasses on his face",
"He has an additional hat on his head",
"He has an additional pair of silver-framed glasses on his face",
"His hair changes to black",
"He has an additional pair of sunglasses on his face"
],
"correct_choice": 1,
"position": [
2823,
49245
],
"topic_category": "NP-News-Programs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "Fw1rirubXiU_0",
"video_path": "Fw1rirubXiU.mp4",
"subtitle_path": "Fw1rirubXiU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3128.83,
"view_count": 256293
},
{
"video_id": "Fw1rirubXiU",
"question": "In the scene, there is only one woman with long chestnut hair, dressed in a black coat and a white blouse. On the blue stripe in front of her chest, the letters 'WHERE IS KATE?' are displayed. What change occurs when this woman appears next to a man in a black suit?",
"question_wo_referring_query": "What change occurs to her?",
"candidates": [
"She changes into a gray coat",
"She changes into a red coat",
"She changes into a purple floral coat",
"She changes into a green coat",
"She changes into a pink coat"
],
"correct_choice": 4,
"position": [
37546,
37866
],
"topic_category": "NP-News-Programs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "Fw1rirubXiU_1",
"video_path": "Fw1rirubXiU.mp4",
"subtitle_path": "Fw1rirubXiU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3128.83,
"view_count": 256293
},
{
"video_id": "Fw1rirubXiU",
"question": "In the broadcast room, there is a man wearing a black suit and a white shirt. His right hand is on the table, and his left hand is slightly raised. On the screen next to his right hand, the text 'DYING FOR AID' appears. What change occurs to his right hand when the text on the screen changes to 'GLOBAL WATCH'?",
"question_wo_referring_query": "What change occurs to his right hand?",
"candidates": [
"He places his right hand on a mouse",
"He places his right hand on his forehead",
"He places his right hand on his chin",
"He is holding a piece of paper with his right hand",
"He is holding a phone with his right hand"
],
"correct_choice": 3,
"position": [
13791,
60696
],
"topic_category": "NP-News-Programs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "Fw1rirubXiU_2",
"video_path": "Fw1rirubXiU.mp4",
"subtitle_path": "Fw1rirubXiU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3128.83,
"view_count": 256293
},
{
"video_id": "tdm72-vYxTs",
"question": "On the screen, there is a beige stool with a green bowl on top. The bowl contains yellow food, and there is a silver spoon inside the bowl. What happens after the spoon is placed in the bowl?",
"question_wo_referring_query": ", what happens after the spoon is placed in the bowl?",
"candidates": [
"The spoon stirs around in the bowl",
"The spoon scoops up the food",
"New food is added to the green bowl",
"The green bowl is carried away",
"The spoon hits the green bowl"
],
"correct_choice": 0,
"position": [
2622
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "tdm72-vYxTs_0",
"video_path": "tdm72-vYxTs.mp4",
"subtitle_path": "tdm72-vYxTs_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1163.97,
"view_count": 1474
},
{
"video_id": "tdm72-vYxTs",
"question": "On a beige mat, there is a black plate with multiple layers of thin pancakes on it. A few slices of tomato are placed on top of the pancakes. A knife appears on the screen. What is this knife doing?",
"question_wo_referring_query": "What is this knife doing?",
"candidates": [
"The knife is broken",
"The knife is cutting the thin pancakes",
"The knife is inserted into the table",
"The knife is smashing the plate",
"The knife is cutting open the beige mat"
],
"correct_choice": 1,
"position": [
6912
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "tdm72-vYxTs_1",
"video_path": "tdm72-vYxTs.mp4",
"subtitle_path": "tdm72-vYxTs_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1163.97,
"view_count": 1474
},
{
"video_id": "tdm72-vYxTs",
"question": "A pair of hands appears on the screen holding a knife on a wooden board, which has a white food item broken in half. What are these hands doing?",
"question_wo_referring_query": "What are these hands doing?",
"candidates": [
"The hands are using the knife to slice the wooden board vertically",
"The hands are taking away the wooden board",
"The hands are using the knife to slice the food item vertically",
"The hands are using the knife to slice the wooden board horizontally",
"The hands are using the knife to slice the food item horizontally"
],
"correct_choice": 4,
"position": [
15919
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "tdm72-vYxTs_2",
"video_path": "tdm72-vYxTs.mp4",
"subtitle_path": "tdm72-vYxTs_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1163.97,
"view_count": 1474
},
{
"video_id": "DoizYSYQRqU",
"question": "In a sunlit room, there is a woman wearing a black short-sleeve shirt. She is grabbing her hair with her left hand and combing her hair with her right hand. What objects are present in this scene?",
"question_wo_referring_query": "What objects are present in this scene?",
"candidates": [
"Comb",
"Hair clip",
"Necklace",
"Glasses",
"Watch"
],
"correct_choice": 0,
"position": [
7247
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2O",
"level": "L1-Perception",
"id": "DoizYSYQRqU_0",
"video_path": "DoizYSYQRqU.mp4",
"subtitle_path": "DoizYSYQRqU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1240.27,
"view_count": 74119
},
{
"video_id": "DoizYSYQRqU",
"question": "A woman in a red dress is singing on the screen, and behind her is a dancer wearing white clothing. The dancer has the word 'ICONIC' on their head. What objects are present in this scene?",
"question_wo_referring_query": "What objects are present in this scene?",
"candidates": [
"Display screen",
"Microphone",
"Water cup",
"Car",
"Speaker"
],
"correct_choice": 1,
"position": [
20086
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2O",
"level": "L1-Perception",
"id": "DoizYSYQRqU_1",
"video_path": "DoizYSYQRqU.mp4",
"subtitle_path": "DoizYSYQRqU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1240.27,
"view_count": 74119
},
{
"video_id": "DoizYSYQRqU",
"question": "In a car, a woman wearing a white coat is sitting with her eyes closed. She is holding a cup of coffee in her right hand and a mobile phone in her left hand. What objects are present in this scene?",
"question_wo_referring_query": "What objects are present in this scene?",
"candidates": [
"transparent water cup",
"nose ring",
"earring",
"watch",
"earphone"
],
"correct_choice": 2,
"position": [
27073
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2O",
"level": "L1-Perception",
"id": "DoizYSYQRqU_2",
"video_path": "DoizYSYQRqU.mp4",
"subtitle_path": "DoizYSYQRqU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1240.27,
"view_count": 74119
},
{
"video_id": "NAJOZTNkhlI",
"question": "In front of a pure black background, there is a man with cornrows. He is wearing a black coat and sunglasses. What did he do when he first appeared?",
"question_wo_referring_query": "In front of a pure black background, there is a man with cornrows. He is wearing a black coat and sunglasses. What did he do when he first appeared?",
"candidates": [
"He took off his sunglasses",
"He clenched his right hand into a fist",
"He raised his right hand",
"He touched his head with his hand",
"He raised his left hand"
],
"correct_choice": 2,
"position": [
2468
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "O2E",
"level": "L1-Perception",
"id": "NAJOZTNkhlI_0",
"video_path": "NAJOZTNkhlI.mp4",
"subtitle_path": "NAJOZTNkhlI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3135.5,
"view_count": 35507
},
{
"video_id": "NAJOZTNkhlI",
"question": "This is a document, within which there are bolded words 'START' and 'YIELD'. The page number at the bottom of the document shows '3'. What happened when this document first appeared?",
"question_wo_referring_query": "What happened when it first appeared?",
"candidates": [
"The bolded word 'YIELD' in the document was highlighted in green.",
"The bolded word 'START' in the document was highlighted in green.",
"The page number '3' in the document was highlighted in yellow.",
"The bolded word 'START' in the document was highlighted in yellow.",
"The bolded word 'YIELD' in the document was highlighted in yellow."
],
"correct_choice": 3,
"position": [
34731
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "O2E",
"level": "L1-Perception",
"id": "NAJOZTNkhlI_1",
"video_path": "NAJOZTNkhlI.mp4",
"subtitle_path": "NAJOZTNkhlI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3135.5,
"view_count": 35507
},
{
"video_id": "NAJOZTNkhlI",
"question": "This is a document with the words 'Preprint. Under review.' at the top. There is a table below with the word 'Method' in the header. What happened when this document first appeared?",
"question_wo_referring_query": "What happened when this document first appeared?",
"candidates": [
"There was a black line drawn across the document.",
"There was a yellow line drawn across the document.",
"The text in the document was highlighted in yellow.",
"The text in the document was highlighted in green.",
"There was a red line drawn across the document."
],
"correct_choice": 4,
"position": [
58147
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "O2E",
"level": "L1-Perception",
"id": "NAJOZTNkhlI_2",
"video_path": "NAJOZTNkhlI.mp4",
"subtitle_path": "NAJOZTNkhlI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3135.5,
"view_count": 35507
},
{
"video_id": "iwXp1fT89-M",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, in a room with three white doors, the woman points to the word 'live' on her sleeve. Next, in the same room with three white doors, the woman stands under a green figurehead and lifts her left hand to touch her hair. Finally, in a room with green walls, a woman wearing a blue sleeveless knit top is drinking water with a straw while wearing earphones.",
"First, in a room with three white doors, the woman points to the word 'live' on her sleeve. Next, in a room with green walls, a woman wearing a blue sleeveless knit top is drinking water with a straw while wearing earphones. Finally, in the same room with three white doors, the woman stands under a green figurehead and lifts her left hand to touch her hair.",
"First, in a room with green walls, a woman wearing a blue sleeveless knit top is drinking water with a straw while wearing earphones. Next, in a room with three white doors, the woman points to the word 'live' on her sleeve. Finally, in the same room with three white doors, the woman stands under a green figurehead and lifts her left hand to touch her hair.",
"First, in a room with three white doors, the woman stands under a green figurehead and lifts her left hand to touch her hair. Next, in a room with green walls, a woman wearing a blue sleeveless knit top is drinking water with a straw while wearing earphones. Finally, in the same room with three white doors, the woman points to the word 'live' on her sleeve.",
"First, in a room with green walls, a woman wearing a blue sleeveless knit top is drinking water with a straw while wearing earphones. Next, in a room with three white doors, the woman stands under a green figurehead and lifts her left hand to touch her hair. Finally, in the same room with three white doors, the woman points to the word 'live' on her sleeve."
],
"correct_choice": 2,
"position": [
575,
4829,
5871
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "iwXp1fT89-M_0",
"video_path": "iwXp1fT89-M.mp4",
"subtitle_path": "iwXp1fT89-M_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1156.19,
"view_count": 167192
},
{
"video_id": "iwXp1fT89-M",
"question": "Which of the following sequence of scenes is correct?",
"question_wo_referring_query": "Which of the following sequence of scenes is correct?",
"candidates": [
"First, in a small frame on a pure yellow background, a green straightener is displayed. Next, in a room with green walls, a woman in a blue sleeveless knit shirt lifts a pair of shoes with an 'N' pattern. Finally, on a pure yellow background, there is green text 'BEAUTY'.",
"First, in a room with green walls, a woman in a blue sleeveless knit shirt lifts a pair of shoes with an 'N' pattern. Next, on a pure yellow background, there is green text 'BEAUTY'. Finally, on the yellow background, a small frame displays a green straightener.",
"First, on a pure yellow background, there is green text 'BEAUTY'. Next, in a small frame on a pure yellow background, a green straightener is displayed. Finally, in a room with green walls, a woman in a blue sleeveless knit shirt lifts a pair of shoes with an 'N' pattern.",
"First, on a pure yellow background, there is green text 'BEAUTY'. Next, in a room with green walls, a woman in a blue sleeveless knit shirt lifts a pair of shoes with an 'N' pattern. Finally, in a small frame on a pure yellow background, a green straightener is displayed.",
"First, in a room with green walls, a woman in a blue sleeveless knit shirt lifts a pair of shoes with an 'N' pattern. Next, in a small frame on a pure yellow background, a green straightener is displayed. Finally, on the yellow background, there is green text 'BEAUTY'."
],
"correct_choice": 1,
"position": [
11352,
14174,
16816
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "iwXp1fT89-M_1",
"video_path": "iwXp1fT89-M.mp4",
"subtitle_path": "iwXp1fT89-M_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1156.19,
"view_count": 167192
},
{
"video_id": "iwXp1fT89-M",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, a camera is displayed in a small frame on a plain yellow background, then the woman in the sleeveless blue knit shirt covers her face with both hands, and finally, the woman in the sleeveless blue knit shirt hangs a pair of headphones around her neck.",
"First, the woman in the sleeveless blue knit shirt hangs a pair of headphones around her neck, then the woman in the sleeveless blue knit shirt covers her face with both hands, and lastly, a camera is displayed in a small frame on a plain yellow background.",
"First, the woman in a sleeveless blue knit shirt covers her face with both hands, then the woman in the sleeveless blue knit shirt hangs a pair of headphones around her neck, and lastly, a camera is displayed in a small frame on a plain yellow background.",
"First, a camera is displayed in a small frame on a plain yellow background, then the woman in the sleeveless blue knit shirt hangs a pair of headphones around her neck, and finally, the woman in the sleeveless blue knit shirt covers her face with both hands.",
"First, the woman in a sleeveless blue knit shirt covers her face with both hands, then a camera is displayed in a small frame on a plain yellow background, and finally, the woman in the sleeveless blue knit shirt hangs a pair of headphones around her neck."
],
"correct_choice": 4,
"position": [
21752,
23882,
25795
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "iwXp1fT89-M_2",
"video_path": "iwXp1fT89-M.mp4",
"subtitle_path": "iwXp1fT89-M_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1156.19,
"view_count": 167192
},
{
"video_id": "pGEF7Tme3Tk",
"question": "In front of the white cabinet, there is a man with short hair and wearing a black short-sleeved shirt. He is holding a chickpea in his left hand with both hands raised. When he appears alongside the subtitle 'pork ragu or i\u2019m gonna do some romesco,' what changes occur to him?",
"question_wo_referring_query": "What changes occur to him?",
"candidates": [
"The chickpea in his hand turns into pork.",
"The chickpea in his hand turns into an egg.",
"The chickpea in his hand turns into a glass of milk.",
"The chickpea in his hand turns into a bluebonnet.",
"The chickpea in his hand turns into a kitchen knife."
],
"correct_choice": 3,
"position": [
1856,
2664
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TAA",
"level": "L2-Relation",
"id": "pGEF7Tme3Tk_0",
"video_path": "pGEF7Tme3Tk.mp4",
"subtitle_path": "pGEF7Tme3Tk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 941.94,
"view_count": 163165
},
{
"video_id": "pGEF7Tme3Tk",
"question": "On a silver desktop, there is a wooden board. On the wooden board, there is an empty silver metal plate. A blond man wearing a black short-sleeved shirt is holding a spatula with his left hand and looking at the plate. When the plate and the subtitle 'uh and like I was saying at the start' appear simultaneously, what change occurs to the plate?",
"question_wo_referring_query": "What change occurs to the plate?",
"candidates": [
"Yellow food material appears in the plate",
"Green food material appears in the plate",
"Eggs appear in the plate",
"Blue flowers appear in the plate",
"Eggs and beans appear in the plate"
],
"correct_choice": 0,
"position": [
8754,
10197
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TAA",
"level": "L2-Relation",
"id": "pGEF7Tme3Tk_1",
"video_path": "pGEF7Tme3Tk.mp4",
"subtitle_path": "pGEF7Tme3Tk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 941.94,
"view_count": 163165
},
{
"video_id": "pGEF7Tme3Tk",
"question": "On the silver countertop, there is a wooden board. A man with short hair wearing a black short-sleeved shirt is placing two empty white porcelain plates on the board. When the subtitles 'don't need you know a massive plate of' appear simultaneously with these two plates, what change occurs?",
"question_wo_referring_query": "What change occurs?",
"candidates": [
"The plates broke",
"The plates gained patterns",
"Only the left plate has extra food material",
"Both plates have extra food material",
"Only the right plate has extra food material"
],
"correct_choice": 4,
"position": [
20507,
20749
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TAA",
"level": "L2-Relation",
"id": "pGEF7Tme3Tk_2",
"video_path": "pGEF7Tme3Tk.mp4",
"subtitle_path": "pGEF7Tme3Tk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 941.94,
"view_count": 163165
},
{
"video_id": "aqUisZS9Ruw",
"question": "In a room, a woman with long black hair wearing a white coat is clenching her hands tightly. On the screen, there is the word 'but'. In which subtitles has this woman appeared together with these words?",
"question_wo_referring_query": "In which subtitles has this woman appeared together with these words?",
"candidates": [
"hope you want to",
"museums I saw my first dog just",
"please take please take my car s if they",
"people here also really nice especially",
"be spending all the night Moon"
],
"correct_choice": 3,
"position": [
975,
17325,
6560,
12129
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "aqUisZS9Ruw_0",
"video_path": "aqUisZS9Ruw.mp4",
"subtitle_path": "aqUisZS9Ruw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 910.01,
"view_count": 112220
},
{
"video_id": "aqUisZS9Ruw",
"question": "A long-haired woman is holding two books and smiling happily. The book in her right hand has a blue cover, and the book in her left hand has a green cover. In which subtitles does the book with the green cover appear together?",
"question_wo_referring_query": "In which subtitles does the book with the green cover appear together?",
"candidates": [
"wait I think they took it oh I was lit",
"up in an alternate reality which is",
"looks so yum",
"which one to get I just came back before",
"please take please take my car s if they"
],
"correct_choice": 1,
"position": [
1296,
2368,
6560,
7178,
9687
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "aqUisZS9Ruw_1",
"video_path": "aqUisZS9Ruw.mp4",
"subtitle_path": "aqUisZS9Ruw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 910.01,
"view_count": 112220
},
{
"video_id": "aqUisZS9Ruw",
"question": "In the scene, there is a woman wearing a gray coat and a black floral-patterned shirt. She is holding a gray-shelled mobile phone. With which subtitles did this mobile phone appear together?",
"question_wo_referring_query": "With which subtitles did this mobile phone appear together?",
"candidates": [
"hope you want to",
"water um I\u2019m not going to lie guys you",
"please take please take my car s if they",
"nice these malls are huge and they\u2019re",
"people here also really nice especially"
],
"correct_choice": 1,
"position": [
12875,
10162,
6560,
17325,
12129
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "aqUisZS9Ruw_2",
"video_path": "aqUisZS9Ruw.mp4",
"subtitle_path": "aqUisZS9Ruw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 910.01,
"view_count": 112220
},
{
"video_id": "TMe7oXMJoSM",
"question": "The woman wearing a black dress with braids is facing the camera. She is singing 'Don't Pull' while the word 'run' appears in the subtitles. What change happens to the woman?",
"question_wo_referring_query": "What change happens to the woman?",
"candidates": [
"She changes from standing on stage to sitting on one knee on the ground",
"She changes from standing on stage to sitting on a bench",
"She changes from singing on stage to kneeling on stage",
"She changes from standing on stage to sitting on both knees on the ground"
],
"correct_choice": 1,
"position": [
1538,
7280
],
"topic_category": "KA-Knowledge-Art",
"question_category": "TAA",
"level": "L2-Relation",
"id": "TMe7oXMJoSM_0",
"video_path": "TMe7oXMJoSM.mp4",
"subtitle_path": "TMe7oXMJoSM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2805.11,
"view_count": 1011
},
{
"video_id": "TMe7oXMJoSM",
"question": "On the left side of the screen is a woman wearing a black low-necked dress, and on the right side of the screen is a black man with dreadlocks wearing a shirt. When the subtitle 'convict TC Whitehurst Gordon Georgia May' appears, what change happens to this man?",
"question_wo_referring_query": "What change happens to this man?",
"candidates": [
"This man goes from standing to sitting",
"This man has a lyrics book in his hand",
"This man puts on glasses",
"This man becomes holding his head with both hands",
"This man has a Doberman on him"
],
"correct_choice": 1,
"position": [
15842,
34830
],
"topic_category": "KA-Knowledge-Art",
"question_category": "TAA",
"level": "L2-Relation",
"id": "TMe7oXMJoSM_1",
"video_path": "TMe7oXMJoSM.mp4",
"subtitle_path": "TMe7oXMJoSM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2805.11,
"view_count": 1011
},
{
"video_id": "TMe7oXMJoSM",
"question": "On the left side of the screen is a woman wearing a black low-cut top, and on the right side of the screen is a man with glasses wearing a grey suit. The man is holding a lyric book and singing. What change occurs to this man when the lyrics 'our iise up good God good God iise up' appear?",
"question_wo_referring_query": "What change occurs to this man?",
"candidates": [
"Wearing glasses changes to not wearing glasses",
"Standing on the stage changes to kneeling on one knee",
"Holding a lyric book changes to tightly holding the woman's hand",
"Standing on the stage changes to sitting on a bench",
"Holding a lyric book changes to holding a dumbbell"
],
"correct_choice": 2,
"position": [
22954,
33449
],
"topic_category": "KA-Knowledge-Art",
"question_category": "TAA",
"level": "L2-Relation",
"id": "TMe7oXMJoSM_2",
"video_path": "TMe7oXMJoSM.mp4",
"subtitle_path": "TMe7oXMJoSM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2805.11,
"view_count": 1011
},
{
"video_id": "8foMISZGiyw",
"question": "There are three people in the room. One person is sitting on a red sofa, one person is standing, and another person is sitting on a chair in the corner of the room. The person sitting on the chair has white hair, is wearing white inner clothes, and a coat over them. What is the man sitting on the chair doing?",
"question_wo_referring_query": "What is the man sitting on the chair doing?",
"candidates": [
"He is rubbing his eyes with his right hand",
"He picked up a wine glass",
"He is standing up from the chair",
"He is rubbing his eyes with his left hand",
"He is rubbing his eyes with both hands"
],
"correct_choice": 3,
"position": [
8228
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "8foMISZGiyw_0",
"video_path": "8foMISZGiyw.mp4",
"subtitle_path": "8foMISZGiyw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1215.42,
"view_count": 128549
},
{
"video_id": "8foMISZGiyw",
"question": "There is a half-open door in the room. Next to the door is a chair. A man wearing a black coat and white pants is beside the chair. What is this man doing?",
"question_wo_referring_query": ", what is this man doing?",
"candidates": [
"He took off his coat.",
"He is preparing to sit on the chair.",
"He is standing up from the chair.",
"He moved the chair away.",
"He closed the door."
],
"correct_choice": 1,
"position": [
15699
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "8foMISZGiyw_1",
"video_path": "8foMISZGiyw.mp4",
"subtitle_path": "8foMISZGiyw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1215.42,
"view_count": 128549
},
{
"video_id": "8foMISZGiyw",
"question": "A woman wearing a striped short-sleeve dress is sitting on a chair in front of a pink wall window. A white cloth bag is placed on her lap. What is she doing?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"She is mending the cloth bag",
"She tore the cloth bag",
"She took a mask out of the cloth bag",
"She placed the cloth bag on the ground",
"She took a gun out of the cloth bag"
],
"correct_choice": 4,
"position": [
20240
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "8foMISZGiyw_2",
"video_path": "8foMISZGiyw.mp4",
"subtitle_path": "8foMISZGiyw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1215.42,
"view_count": 128549
},
{
"video_id": "J1IwKg2ufk8",
"question": "A person wearing an embroidered dress, whose face is not visible, is cutting a tomato on a wooden board with a vegetable knife. What objects are present in this scene?",
"question_wo_referring_query": "What objects are present?",
"candidates": [
"golden fork",
"wooden spoon",
"ring",
"parsley",
"pasta"
],
"correct_choice": 2,
"position": [
7566
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2O",
"level": "L1-Perception",
"id": "J1IwKg2ufk8_0",
"video_path": "J1IwKg2ufk8.mp4",
"subtitle_path": "J1IwKg2ufk8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1201.0,
"view_count": 528545
},
{
"video_id": "J1IwKg2ufk8",
"question": "A person wearing gray clothes without showing their face is holding the handle of the vegetable knife with their right hand and pressing the back of the knife with their left hand, cutting garlic on a wooden board. What objects are present in this scene?",
"question_wo_referring_query": "What objects are present in this scene?",
"candidates": [
"Silver bracelet",
"Watch",
"Vegetable knife without letters on the blade",
"Golden ring",
"Vegetable knife with letters on the blade"
],
"correct_choice": 4,
"position": [
14925
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2O",
"level": "L1-Perception",
"id": "J1IwKg2ufk8_1",
"video_path": "J1IwKg2ufk8.mp4",
"subtitle_path": "J1IwKg2ufk8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1201.0,
"view_count": 528545
},
{
"video_id": "J1IwKg2ufk8",
"question": "In a black pot, there are red ingredients being cooked. A person without a visible face is holding a spatula, stirring the contents in the pot. What objects are present in this scene?",
"question_wo_referring_query": "What objects are present in this scene?",
"candidates": [
"chopsticks",
"noodles",
"wooden spatula",
"iron spatula",
"fork"
],
"correct_choice": 2,
"position": [
27288
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2O",
"level": "L1-Perception",
"id": "J1IwKg2ufk8_2",
"video_path": "J1IwKg2ufk8.mp4",
"subtitle_path": "J1IwKg2ufk8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1201.0,
"view_count": 528545
},
{
"video_id": "XuQswmEPgxU",
"question": "At the very beginning of the video, a woman wearing a hat and a fur coat is standing in front of a car. In the car, in the driver's seat, there is a man also wearing a fur coat. In the upper left corner of the screen, the white text 'HARLEM IS EVERY WHERE' appears. Where else does this white text appear?",
"question_wo_referring_query": "Where else does the white text appear?",
"candidates": [
"In the purple background on the left side, there is a group of English words, and on the right side, there is a picture of a nude woman sitting with her hands hugging her knees, staring at the stove",
"In a black background on the left side, there is a picture of a person wearing a purple skirt sitting on the grass",
"In a green background on the right side, there is a picture of a person wearing a red skirt sitting in front of a stove",
"In a red background on the right side, there is a picture of a nude person sitting in front of a stove",
"In a purple background on the right side, there is a picture of a person wearing a white short-sleeve shirt sitting in front of a stove"
],
"correct_choice": 0,
"position": [
1307,
17693
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SOS",
"level": "L2-Relation",
"id": "XuQswmEPgxU_0",
"video_path": "XuQswmEPgxU.mp4",
"subtitle_path": "XuQswmEPgxU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1843.23,
"view_count": 2299
},
{
"video_id": "XuQswmEPgxU",
"question": "In the gray background, on the far right side, there is a picture. In the picture, a woman is turning her head to look at a man behind her. Below the picture, in white English text, it says 'James Van Der Zee, Self-Portrait with Gaynella Greenlee'. Where else has this picture appeared?",
"question_wo_referring_query": "Where else has this picture appeared?",
"candidates": [
"Appears in the top right corner of a screen displaying eight pictures at the same time",
"Appears in the top left corner of a screen displaying six pictures at the same time",
"Appears in the top right corner of a screen displaying six pictures at the same time",
"Appears in the bottom left corner of a screen displaying six pictures at the same time",
"Appears in the bottom right corner of a screen displaying six pictures at the same time"
],
"correct_choice": 4,
"position": [
16165,
42245
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SOS",
"level": "L2-Relation",
"id": "XuQswmEPgxU_1",
"video_path": "XuQswmEPgxU.mp4",
"subtitle_path": "XuQswmEPgxU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1843.23,
"view_count": 2299
},
{
"video_id": "XuQswmEPgxU",
"question": "On the far right side of the gray background, there is a picture depicting a crescent yellow moon and two black-skinned figures. Below this picture, there are white English words reading 'William H. Johnson, Street Life, Harlem, ca.' Where else has this picture on the far right appeared?",
"question_wo_referring_query": "Where else has the picture on the far right appeared?",
"candidates": [
"It appeared in the upper left corner of a screen with six pictures simultaneously.",
"It appeared in the lower right corner of a screen with six pictures simultaneously.",
"It appeared in the upper right corner of a screen with four pictures simultaneously.",
"It appeared in the lower left corner of a screen with six pictures simultaneously.",
"It appeared in the upper right corner of a screen with eight pictures simultaneously."
],
"correct_choice": 3,
"position": [
25545,
43654
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SOS",
"level": "L2-Relation",
"id": "XuQswmEPgxU_2",
"video_path": "XuQswmEPgxU.mp4",
"subtitle_path": "XuQswmEPgxU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1843.23,
"view_count": 2299
},
{
"video_id": "NSeq-nVSY_E",
"question": "In a white background PPT, the top left corner has the black English text '1st Pass: Contrastive Loss'. Which captions appear at the same time as the blue background icon with English text 'image Encoder' inside the dashed box in the middle of the screen?",
"question_wo_referring_query": "Which captions appear at the same time?",
"candidates": [
"'you're going to get a very low value for'",
"'you turn on the cross attention switch'",
"'and when they're like further apart'",
"'going to get a very high value for pi'",
"'pi and then what you're going to do is'"
],
"correct_choice": 1,
"position": [
8428,
12232
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "TOS",
"level": "L2-Relation",
"id": "NSeq-nVSY_E_0",
"video_path": "NSeq-nVSY_E.mp4",
"subtitle_path": "NSeq-nVSY_E_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1309.6,
"view_count": 67
},
{
"video_id": "NSeq-nVSY_E",
"question": "In the PPT with a white background, the upper left corner shows the black English text 'Proposed Solution', and on the right side of the screen there's a yellow background icon with the English text 'language Decoder'. Which subtitles appear simultaneously with it?",
"question_wo_referring_query": "Which subtitles appear simultaneously with it?",
"candidates": [
"\"functions but mmud already has learned\"",
"\u201cusually take some frame by frame\u201d",
"\u201cpositional EMB adding sign s or cosine\u201d",
"\"some embeddings from the encoder so what\"",
"\"you're going to get a very low value for\""
],
"correct_choice": 1,
"position": [
3786,
16616
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "TOS",
"level": "L2-Relation",
"id": "NSeq-nVSY_E_1",
"video_path": "NSeq-nVSY_E.mp4",
"subtitle_path": "NSeq-nVSY_E_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1309.6,
"view_count": 67
},
{
"video_id": "NSeq-nVSY_E",
"question": "In the white-background PPT, the top left corner is the black English text '1st Pass: Contrastive Loss'. Below the dashed box in the middle of the screen, which subtitles have appeared at the same time as a horizontal arrow pointing to the right?",
"question_wo_referring_query": "Which subtitles have appeared at the same time?",
"candidates": [
"\"the same part this it's not a very good\"",
"\"you're going to get a very low value for\"",
"\"image so when you uh so they do not work\"",
"\"well for U detection at Region level\"",
"\"pi and then what you're going to do is\""
],
"correct_choice": 0,
"position": [
8151,
13092
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "TOS",
"level": "L2-Relation",
"id": "NSeq-nVSY_E_2",
"video_path": "NSeq-nVSY_E.mp4",
"subtitle_path": "NSeq-nVSY_E_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1309.6,
"view_count": 67
},
{
"video_id": "WTT7XZko3qk",
"question": "In a room with a wall painting behind them and a wooden cabinet, on the left side of the screen, there is a man wearing a red plaid jacket and khaki pants. In the middle, there is a man wearing a black hoodie and jeans. On the right side, there is a man with a goatee wearing a black jacket over a white hoodie. What did they do after leaving the dining hall?",
"question_wo_referring_query": "What did they do after leaving the dining hall?",
"candidates": [
"They bid farewell at the entrance of the dining hall.",
"They ate food by the roadside.",
"They sat inside a car.",
"They had a conversation at the entrance of the dining hall.",
"They took a walk outside."
],
"correct_choice": 1,
"position": [
3297,
3695
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E3E",
"level": "L2-Relation",
"id": "WTT7XZko3qk_0",
"video_path": "WTT7XZko3qk.mp4",
"subtitle_path": "WTT7XZko3qk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1208.96,
"view_count": 1374559
},
{
"video_id": "WTT7XZko3qk",
"question": "Three men are seated inside a car. The driver and the front passenger are wearing white hooded suits, while the man in the back seat is holding a camera. What happened after the driver looked towards the rear window and turned the steering wheel?",
"question_wo_referring_query": "What happened?",
"candidates": [
"They stood on the ground and hugged each other.",
"A white car skidded on the dusty ground, kicking up a large amount of sand.",
"The man in the white suit struggled in the trunk.",
"They arrived at a restaurant to eat.",
"The man in the white suit's hands were tied up and he was placed in the trunk."
],
"correct_choice": 1,
"position": [
15539,
15783
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E3E",
"level": "L2-Relation",
"id": "WTT7XZko3qk_1",
"video_path": "WTT7XZko3qk.mp4",
"subtitle_path": "WTT7XZko3qk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1208.96,
"view_count": 1374559
},
{
"video_id": "WTT7XZko3qk",
"question": "A man wearing a black coat and blue jeans is lying on his side in the trunk of a car with his hands and feet tied up with tape. What happens next?",
"question_wo_referring_query": "What happens next?",
"candidates": [
"Three men are practicing hand-tied drills on a deserted ground",
"The man in black clothes struggles free from the tape binding his hands and feet inside the trunk",
"Three men are listening to a professional in black short sleeves explaining something",
"Three men are singing inside a car",
"Three men are dining inside a restaurant"
],
"correct_choice": 1,
"position": [
8510,
9418
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E3E",
"level": "L2-Relation",
"id": "WTT7XZko3qk_2",
"video_path": "WTT7XZko3qk.mp4",
"subtitle_path": "WTT7XZko3qk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1208.96,
"view_count": 1374559
},
{
"video_id": "qsH6q5wNso4",
"question": "Which sequence of scenes in the video is correct?",
"question_wo_referring_query": "Which sequence of scenes in the video is correct?",
"candidates": [
"A man wearing a green hooded coat is talking in front of a background wall with a picture of the Earth and some paintings hanging, four 3D demonstration images of the moon rover landing on the moon, and on his right, there is an English news webpage with some icons.",
"Four 3D demonstration images of the moon rover landing on the moon, on the right is an English news webpage with some icons, and a man wearing a green hooded coat is talking in front of a background wall with a picture of the Earth and some paintings hanging.",
"On the right is an English news webpage with some icons, four 3D demonstration images of the moon rover landing on the moon, and a man wearing a green hooded coat is talking in front of a background wall with a picture of the Earth and some paintings hanging.",
"A man wearing a green hooded coat is talking in front of a background wall with a picture of the Earth and some paintings hanging. On his right, there is an English news webpage with some icons, and four 3D demonstration images of the moon rover landing on the moon.",
"On the right is an English news webpage with some icons, a man wearing a green hooded coat is talking in front of a background wall with a picture of the Earth and some paintings hanging, and four 3D demonstration images of the moon rover landing on the moon."
],
"correct_choice": 3,
"position": [
2106,
2685,
4620
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SSS",
"level": "L2-Relation",
"id": "qsH6q5wNso4_0",
"video_path": "qsH6q5wNso4.mp4",
"subtitle_path": "qsH6q5wNso4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1002.36,
"view_count": 97613
},
{
"video_id": "qsH6q5wNso4",
"question": "In the video, which of the following sequences of scenes is correct?",
"question_wo_referring_query": "In the video, which of the following sequences of scenes is correct?",
"candidates": [
"A yellow lunar rover with the Indian flag on the moon with a blue Earth in the background, a man in front of a starry background wearing a black T-shirt with white letters, a gray hat, and headphones speaking, a grayscale photo of craters on the moon's surface taken on the moon",
"A grayscale photo of craters on the moon's surface taken on the moon, a man in front of a starry background wearing a black T-shirt with white letters, a gray hat, and headphones speaking, a yellow lunar rover with the Indian flag on the moon with a blue Earth in the background",
"A grayscale photo of craters on the moon's surface taken on the moon, a yellow lunar rover with the Indian flag on the moon with a blue Earth in the background, a man in front of a starry background wearing a black T-shirt with white letters, a gray hat, and headphones speaking",
"A yellow lunar rover with the Indian flag on the moon with a blue Earth in the background, a grayscale photo of craters on the moon's surface taken on the moon, a man in front of a starry background wearing a black T-shirt with white letters, a gray hat, and headphones speaking",
"A man in front of a starry background wearing a black T-shirt with white letters, a gray hat, and headphones speaking, a grayscale photo of craters on the moon's surface taken on the moon, a yellow lunar rover with the Indian flag on the moon with a blue Earth in the background"
],
"correct_choice": 3,
"position": [
8707,
9353,
10892
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SSS",
"level": "L2-Relation",
"id": "qsH6q5wNso4_1",
"video_path": "qsH6q5wNso4.mp4",
"subtitle_path": "qsH6q5wNso4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1002.36,
"view_count": 97613
},
{
"video_id": "qsH6q5wNso4",
"question": "Which of the following sequences of scenes in the video is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes in the video is correct?",
"candidates": [
"Indoors, a transparent globe, held by a figure in a red sleeve who is holding a yellow map board, is displayed in front of a mirror; in a white background, to the left, a gray moon is shown with three arrows pointing at a blue and white Earth; at night, under the moonlight, the shadow of a pine tree is cast on the snow.",
"In a white background, to the left, a gray moon is shown with three arrows pointing at a blue and white Earth; at night, under the moonlight, the shadow of a pine tree is cast on the snow; indoors, a transparent globe, held by a figure in a red sleeve who is holding a yellow map board, is displayed in front of a mirror.",
"At night, under the moonlight, the shadow of a pine tree is cast on the snow; indoors, a transparent globe, held by a figure in a red sleeve who is holding a yellow map board, is displayed in front of a mirror; in a white background, to the left, a gray moon is shown with three arrows pointing at a blue and white Earth.",
"In a white background, to the left, a gray moon is shown with three arrows pointing at a blue and white Earth; indoors, a transparent globe, held by a figure in a red sleeve who is holding a yellow map board, is displayed in front of a mirror; at night, under the moonlight, the shadow of a pine tree is cast on the snow.",
"At night, under the moonlight, the shadow of a pine tree is cast on the snow; in a white background, to the left, a gray moon is shown with three arrows pointing at a blue and white Earth; indoors, a transparent globe, held by a figure in a red sleeve who is holding a yellow map board, is displayed in front of a mirror."
],
"correct_choice": 4,
"position": [
16436,
17237,
18186
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SSS",
"level": "L2-Relation",
"id": "qsH6q5wNso4_2",
"video_path": "qsH6q5wNso4.mp4",
"subtitle_path": "qsH6q5wNso4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1002.36,
"view_count": 97613
},
{
"video_id": "Ro_8-CCORzk",
"question": "A woman with short blonde hair and wearing earrings is speaking in front of a mirror. She is wearing black clothes, with a gray background wall behind her. In front of her is a white subtitle bar that reads 'BLANC:PENSIONS MARKETGROWING'. The subtitle bar changes to 'BLANC: CHANGETHROUGH AITO COMEIN WAVES', and a pop-up window appears on the right side of the screen that reads 'Amanda Blanc'. What change did this woman experience?",
"question_wo_referring_query": "What change did this woman experience?",
"candidates": [
"The woman shook hands with a person next to her",
"The woman got a haircut",
"The woman raised both hands",
"The woman stood up",
"The woman took a sip of water"
],
"correct_choice": 2,
"position": [
2001,
6298
],
"topic_category": "NP-News-Programs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "Ro_8-CCORzk_0",
"video_path": "Ro_8-CCORzk.mp4",
"subtitle_path": "Ro_8-CCORzk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1327.9,
"view_count": 1515
},
{
"video_id": "Ro_8-CCORzk",
"question": "The screen shows a woman with black and yellow hair wearing a light blue suit. She has a necklace on and is speaking in front of a mirror, slightly angled. Behind her is a gray-white background wall. When she and a short-haired woman in black clothes are sitting at a small round table talking, with a picture of a building with a light strip on the wall, what change happens to this woman?",
"question_wo_referring_query": "When there is a picture of a building with a light strip on the wall, what change happens to this woman?",
"candidates": [
"The woman is fixing her hair.",
"The woman stands up.",
"The woman is shaking hands with the woman in black clothes.",
"The woman is listening with her hands on her knees.",
"The woman sits upright holding a glass of water."
],
"correct_choice": 3,
"position": [
7405,
9688
],
"topic_category": "NP-News-Programs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "Ro_8-CCORzk_1",
"video_path": "Ro_8-CCORzk.mp4",
"subtitle_path": "Ro_8-CCORzk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1327.9,
"view_count": 1515
},
{
"video_id": "Ro_8-CCORzk",
"question": "In the video, there are two men in suits standing together, both lightly resting their crossed hands in front of them. The man on the left is wearing a black suit with a brown tie and black-framed glasses. The man on the right has some grey in his hair and is wearing a maroon tie. Behind them is a blue wall with some blue rectangular logos and two flags in front of the wall. When the camera is focused solely on the man with greyish hair on the right, with a blue flag behind him, what kind of change does this man undergo?",
"question_wo_referring_query": "What kind of change does this man undergo?",
"candidates": [
"The man sits down",
"The man's tie changes to purple",
"The man turns his head to look to the side",
"The man has an extra pin on his chest",
"The man raises his hands"
],
"correct_choice": 4,
"position": [
26551,
26783
],
"topic_category": "NP-News-Programs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "Ro_8-CCORzk_2",
"video_path": "Ro_8-CCORzk.mp4",
"subtitle_path": "Ro_8-CCORzk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1327.9,
"view_count": 1515
},
{
"video_id": "QPth_xqBXGY",
"question": "The screen displays a black PPT background with the title 'War in Ukraine Factors' in gray letters. Below are three circles containing some drawings, followed by gray and red rectangles with some content written inside. What changes occur to the rectangles when the phrase 'actions of mot. riflemen and tanks to accomplish the task.' is mentioned?",
"question_wo_referring_query": "What changes occur to the rectangles?",
"candidates": [
"Both rectangles became empty",
"Both rectangles were filled with more content",
"The gray rectangle disappeared",
"The red rectangle disappeared",
"The red rectangle was enlarged"
],
"correct_choice": 2,
"position": [
6621,
6733
],
"topic_category": "KH-Knowledge-History",
"question_category": "TAA",
"level": "L2-Relation",
"id": "QPth_xqBXGY_0",
"video_path": "QPth_xqBXGY.mp4",
"subtitle_path": "QPth_xqBXGY_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1745.97,
"view_count": 376091
},
{
"video_id": "QPth_xqBXGY",
"question": "The screen shows a black PPT background with the title 'Interwar' on the slide. There are three white circles at the bottom containing drawings, and also a white line drawing of a tank. When the subtitle 'debate, as discussed in an article by Walther Nehring about anti-tank defense from 1936' appears, what happens to the tank?",
"question_wo_referring_query": "The screen shows a black PPT background with the title 'Interwar' on the slide. There are three white circles at the bottom containing drawings, and also a white line drawing of a tank. When the subtitle 'debate, as discussed in an article by Walther Nehring about anti-tank defense from 1936' appears, what happens to the tank?",
"candidates": [
"The tank rotates",
"The tank gradually shrinks",
"The tank gradually enlarges",
"The tank disappears",
"The tank rotates in circles on the screen"
],
"correct_choice": 3,
"position": [
16376,
16534
],
"topic_category": "KH-Knowledge-History",
"question_category": "TAA",
"level": "L2-Relation",
"id": "QPth_xqBXGY_1",
"video_path": "QPth_xqBXGY.mp4",
"subtitle_path": "QPth_xqBXGY_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1745.97,
"view_count": 376091
},
{
"video_id": "QPth_xqBXGY",
"question": "The screen shows a black PPT background with the title 'Simple Argument...' on it. There are 10 different circle shapes in the PPT, one of which has a question mark drawn inside. When the phrase 'of tanks, even older ones they can actually decimate your infantry fighting vehicles as' is mentioned, what change occurs to the circle with the question mark?",
"question_wo_referring_query": "What change occurs to the circle with the question mark?",
"candidates": [
"The circle shape moves to the right side of the PPT",
"The circle shape moves to the left side of the PPT",
"The circle shape moves to the bottom left corner",
"The circle shape disappears from the screen",
"The circle shape moves to the right side of the PPT"
],
"correct_choice": 2,
"position": [
30275,
31198
],
"topic_category": "KH-Knowledge-History",
"question_category": "TAA",
"level": "L2-Relation",
"id": "QPth_xqBXGY_2",
"video_path": "QPth_xqBXGY.mp4",
"subtitle_path": "QPth_xqBXGY_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1745.97,
"view_count": 376091
},
{
"video_id": "_7sd4fjnmvc",
"question": "A woman wearing a purple dress with an injured face is lying on a wooden floor, and a man is helping her up. In what scenes has the woman in the purple dress appeared?",
"question_wo_referring_query": "In what scenes has the woman in the purple dress appeared?",
"candidates": [
"A woman sitting on a running horse",
"A woman sitting on a chair, holding her waist with her hands, next to her is a man with a peaked cap with his hands on a table, and there is a red hat on the table",
"A woman sitting on a chair, holding her waist with her hands, next to her is a man with a peaked cap with his hands on a table, and there is a black hat on the table",
"On a road filled with wooden logs",
"On a boat"
],
"correct_choice": 2,
"position": [
7780,
8216
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SOS",
"level": "L2-Relation",
"id": "_7sd4fjnmvc_0",
"video_path": "_7sd4fjnmvc.mp4",
"subtitle_path": "_7sd4fjnmvc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1317.28,
"view_count": 61595
},
{
"video_id": "_7sd4fjnmvc",
"question": "In the dark night, there is an old man wearing a hat, one hand holding a flickering candle, one hand on the door, and watching a man and a woman. In which scenes has the old man in the hat appeared?",
"question_wo_referring_query": "In which scenes has the old man in the hat appeared?",
"candidates": [
"A crowded street",
"Sitting in a room with a glowing light bulb",
"Lying in a room with a flickering candle",
"Sitting in a room with a flickering candle",
"Sitting in a carriage"
],
"correct_choice": 3,
"position": [
18027,
18986
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SOS",
"level": "L2-Relation",
"id": "_7sd4fjnmvc_1",
"video_path": "_7sd4fjnmvc.mp4",
"subtitle_path": "_7sd4fjnmvc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1317.28,
"view_count": 61595
},
{
"video_id": "_7sd4fjnmvc",
"question": "Under a tree, there's a pile of dead wood. A man wearing a black headscarf is taking a yellow box out from the dead wood. In what scenes has the yellow box appeared?",
"question_wo_referring_query": "In what scenes has the yellow box appeared?",
"candidates": [
"The box is on a tree",
"The box is on a rooftop",
"The box is on a carriage",
"The box is on a boat",
"The box is buried in the soil, a man and a woman are digging it up"
],
"correct_choice": 4,
"position": [
3207,
30978
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SOS",
"level": "L2-Relation",
"id": "_7sd4fjnmvc_2",
"video_path": "_7sd4fjnmvc.mp4",
"subtitle_path": "_7sd4fjnmvc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1317.28,
"view_count": 61595
},
{
"video_id": "T15Kv6dtYO0",
"question": "In the oil painting, there are a few men wearing clothes of various colors discussing in the back left, while on the front right, there's a bucket containing a liquid. Beside the bucket, a few women are stirring the liquid with wooden sticks. What color is the liquid inside the bucket in the painting?",
"question_wo_referring_query": "What color is the liquid inside the bucket in the painting?",
"candidates": [
"yellow",
"white",
"red",
"black",
"blue"
],
"correct_choice": 2,
"position": [
7336
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "T15Kv6dtYO0_0",
"video_path": "T15Kv6dtYO0.mp4",
"subtitle_path": "T15Kv6dtYO0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 939.87,
"view_count": 628328
},
{
"video_id": "T15Kv6dtYO0",
"question": "On the wooden floor, there is a pair of white sandals on the left side and a pair of red slippers along with a colorful rug on the right side. Behind the slippers is a cabinet. What color is the cabinet in the video?",
"question_wo_referring_query": "What color is the cabinet in the video?",
"candidates": [
"white",
"brown",
"blue",
"red",
"green"
],
"correct_choice": 3,
"position": [
10931
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "T15Kv6dtYO0_1",
"video_path": "T15Kv6dtYO0.mp4",
"subtitle_path": "T15Kv6dtYO0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 939.87,
"view_count": 628328
},
{
"video_id": "T15Kv6dtYO0",
"question": "In the room, on the left is an open window, on the right is a red bed, on the back wall there is a gear-shaped decoration hanging, on the ceiling there is a chandelier, in the front a man dressed in black and wearing a black hat is holding the hand of a woman dressed in green and wearing a white headscarf. What is the shape of the hat in the room?",
"question_wo_referring_query": "What is the shape of the hat in the room?",
"candidates": [
"round",
"square",
"triangular",
"rectangular",
"stair-shaped"
],
"correct_choice": 0,
"position": [
20175
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "T15Kv6dtYO0_2",
"video_path": "T15Kv6dtYO0.mp4",
"subtitle_path": "T15Kv6dtYO0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 939.87,
"view_count": 628328
},
{
"video_id": "51dUUxFOjDE",
"question": "The screen is divided into three parts: left, center, and right. The left side is all text, the center is a map, and the right side is a man in black clothing. After the man waves his hands to greet the camera, what does he do?",
"question_wo_referring_query": "After the man waves his hands to greet the camera, what does he do?",
"candidates": [
"Put on a hat",
"Coughed",
"Took a sip of water",
"Tidied his clothes",
"Picked up a book"
],
"correct_choice": 4,
"position": [
52,
131
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E3E",
"level": "L2-Relation",
"id": "51dUUxFOjDE_0",
"video_path": "51dUUxFOjDE.mp4",
"subtitle_path": "51dUUxFOjDE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1508.0,
"view_count": 16800
},
{
"video_id": "51dUUxFOjDE",
"question": "The screen is divided into three parts: left, center, and right. The left side is all text, the center is a map, and the right side is a man in black clothing. What did the man do after the map in the center moved?",
"question_wo_referring_query": "What did the man do after the map in the center moved?",
"candidates": [
"Waved at the camera",
"Put on a coat",
"Stood up",
"Kneeled down",
"Looked up"
],
"correct_choice": 4,
"position": [
14830,
14918
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E3E",
"level": "L2-Relation",
"id": "51dUUxFOjDE_1",
"video_path": "51dUUxFOjDE.mp4",
"subtitle_path": "51dUUxFOjDE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1508.0,
"view_count": 16800
},
{
"video_id": "51dUUxFOjDE",
"question": "The screen is divided into three sections: the left is all text, the center is a map, and the right shows a man in black clothing. After gesturing with his hand towards the camera, what did the man do next?",
"question_wo_referring_query": "After gesturing with his hand towards the camera, what did the man do next?",
"candidates": [
"Took a sip of water",
"Lifted a stool slightly",
"Touched his own face with his hand",
"Touched his hair briefly",
"Knocked over a cup of water"
],
"correct_choice": 2,
"position": [
34937,
34994
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E3E",
"level": "L2-Relation",
"id": "51dUUxFOjDE_2",
"video_path": "51dUUxFOjDE.mp4",
"subtitle_path": "51dUUxFOjDE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1508.0,
"view_count": 16800
},
{
"video_id": "2vVQo_GMA70",
"question": "Who is the first person to appear in the video?",
"question_wo_referring_query": "Is it?",
"candidates": [
"The woman wearing a black T-shirt with white patterns and black long hair",
"The woman wearing a white coat and glasses",
"The child wearing a hat",
"The man wearing a suit",
"The child without a shirt"
],
"correct_choice": 0,
"position": [
112,
228
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O3O",
"level": "L2-Relation",
"id": "2vVQo_GMA70_0",
"video_path": "2vVQo_GMA70.mp4",
"subtitle_path": "2vVQo_GMA70_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2041.01,
"view_count": 12309
},
{
"video_id": "2vVQo_GMA70",
"question": "In the video, after the blue yarn ring appears for the first time, what is attached?",
"question_wo_referring_query": "What is it?",
"candidates": [
"purple yarn ring",
"pencil",
"rubber eraser",
"animal head sticker",
"a piece of paper with 'Aldehyde' written on it"
],
"correct_choice": 4,
"position": [
31204,
31249
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O3O",
"level": "L2-Relation",
"id": "2vVQo_GMA70_1",
"video_path": "2vVQo_GMA70.mp4",
"subtitle_path": "2vVQo_GMA70_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2041.01,
"view_count": 12309
},
{
"video_id": "2vVQo_GMA70",
"question": "In the video, after the yellow-bordered rectangular strip of paper is pasted, what appears next?",
"question_wo_referring_query": ", what appears next?",
"candidates": [
"red strip of paper",
"pencil",
"notebook",
"mobile phone",
"white strip of paper with lines on it"
],
"correct_choice": 4,
"position": [
35114,
35216
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O3O",
"level": "L2-Relation",
"id": "2vVQo_GMA70_2",
"video_path": "2vVQo_GMA70.mp4",
"subtitle_path": "2vVQo_GMA70_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2041.01,
"view_count": 12309
},
{
"video_id": "Sn7JPKbG6tY",
"question": "In a room with sunlight streaming through the window, decorated with many green plants, and with a shelf holding books and other items, there is a woman holding a cup. What is the color of this cup?",
"question_wo_referring_query": "What is the color of this cup?",
"candidates": [
"green",
"purple",
"red",
"black",
"white"
],
"correct_choice": 0,
"position": [
229
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2A",
"level": "L1-Perception",
"id": "Sn7JPKbG6tY_0",
"video_path": "Sn7JPKbG6tY.mp4",
"subtitle_path": "Sn7JPKbG6tY_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1034.0,
"view_count": 476047
},
{
"video_id": "Sn7JPKbG6tY",
"question": "In a room, sunlight passes through the window and shines in. The room is decorated with many green plants. There is also a shelf containing books and other items. A woman is holding a makeup item. What shape are the glasses the woman is wearing?",
"question_wo_referring_query": "In a room, sunlight passes through the window and shines in. The room is decorated with many green plants. There is also a shelf containing books and other items. A woman is holding a makeup item. What shape are the glasses the woman is wearing?",
"candidates": [
"Triangle",
"Irregular round shape",
"Square",
"Circle",
"Stair-shaped"
],
"correct_choice": 1,
"position": [
9802
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2A",
"level": "L1-Perception",
"id": "Sn7JPKbG6tY_1",
"video_path": "Sn7JPKbG6tY.mp4",
"subtitle_path": "Sn7JPKbG6tY_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1034.0,
"view_count": 476047
},
{
"video_id": "Sn7JPKbG6tY",
"question": "In a room with sunlight streaming through the windows, the room is decorated with many green plants. There is also a shelf with books and other items on it. A woman is admiring her eyeshadow in a mirror. What color is the woman's nail polish?",
"question_wo_referring_query": "What color is the woman's nail polish?",
"candidates": [
"black",
"green",
"red",
"purple",
"white"
],
"correct_choice": 0,
"position": [
19940
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2A",
"level": "L1-Perception",
"id": "Sn7JPKbG6tY_2",
"video_path": "Sn7JPKbG6tY.mp4",
"subtitle_path": "Sn7JPKbG6tY_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1034.0,
"view_count": 476047
},
{
"video_id": "pJI5ZU6wxqg",
"question": "Many private cars, motorcycles, and buses are driving on a road marked with white dashed lines. On both sides of the road are orderly rows of big trees and green grass. After the white car at the bottom center of the screen disappears from view, what happens?",
"question_wo_referring_query": "what happens?",
"candidates": [
"A man is washing a car",
"A man is helping a woman wash a car",
"A blue car and a person appear",
"A red car and a person appear",
"A blue car and two people appear"
],
"correct_choice": 3,
"position": [
6142,
6143
],
"topic_category": "NP-News-Programs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "pJI5ZU6wxqg_0",
"video_path": "pJI5ZU6wxqg.mp4",
"subtitle_path": "pJI5ZU6wxqg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1261.12,
"view_count": 3015
},
{
"video_id": "pJI5ZU6wxqg",
"question": "In the scene, floodwaters are surging, and some vehicles are driving through the water. A woman with long hair, wearing a red jacket and a white undershirt, is holding an umbrella. Beside her is a young boy in a blue short-sleeved shirt and a cyclist in a blue raincoat. After the woman and the young boy brush past the cyclist in the blue raincoat, what happens?",
"question_wo_referring_query": "What happens next?",
"candidates": [
"A tall man is carrying a young girl",
"Two women are holding an umbrella, walking forward in the water",
"An elderly person is being carried by a rescue team member",
"A rescue dog appears",
"A woman wearing a red hat is moving forward in the water"
],
"correct_choice": 4,
"position": [
12845,
12848
],
"topic_category": "NP-News-Programs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "pJI5ZU6wxqg_1",
"video_path": "pJI5ZU6wxqg.mp4",
"subtitle_path": "pJI5ZU6wxqg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1261.12,
"view_count": 3015
},
{
"video_id": "pJI5ZU6wxqg",
"question": "There is a lush tree on the screen. Beside it, there is a small wooden house. A man wearing a hat and dressed in yellow and black clothing is looking towards the camera while riding an electric vehicle. Some people are chatting by the roadside. What happens after a white car passes through the screen?",
"question_wo_referring_query": "What happens after?",
"candidates": [
"A white car appears",
"An elderly person holding a palm fan appears",
"A little girl with a backpack appears",
"A large truck appears",
"A person wearing a helmet appears riding a motorcycle"
],
"correct_choice": 4,
"position": [
23080,
23132
],
"topic_category": "NP-News-Programs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "pJI5ZU6wxqg_2",
"video_path": "pJI5ZU6wxqg.mp4",
"subtitle_path": "pJI5ZU6wxqg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1261.12,
"view_count": 3015
},
{
"video_id": "aVHAr8rc-Ks",
"question": "Who appears first in the video among the following characters?",
"question_wo_referring_query": "Who appears first in the video among the following characters?",
"candidates": [
"A man wearing an orange jacket with a small name tag, sitting in front of a bookshelf filled with books.",
"A man wearing a white baseball cap inside a museum.",
"A woman with golden hair talking to a mirror, with a glowing display screen in the lower right corner.",
"A woman with golden hair wearing a checkered scarf inside a museum.",
"A man wearing a black top with a patterned neckline, a black beret, and black-rimmed glasses, sitting in front of a name tag."
],
"correct_choice": 0,
"position": [
95,
1468,
4359
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O3O",
"level": "L2-Relation",
"id": "aVHAr8rc-Ks_0",
"video_path": "aVHAr8rc-Ks.mp4",
"subtitle_path": "aVHAr8rc-Ks_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 928.56,
"view_count": 253142
},
{
"video_id": "aVHAr8rc-Ks",
"question": "Which of the following scenes appears first in the video?",
"question_wo_referring_query": "Which of the following scenes appears first in the video?",
"candidates": [
"A museum filled with dinosaur fossils.",
"A room with wallpaper on the right side, a wooden bookshelf on the left with books and a globe.",
"A zoo with many animals.",
"A room with hats hanging on the left wall, a black shelf with a white fan, and some wall decorations on the right wall.",
"A dimly lit room with a yellowed screen displaying text in English."
],
"correct_choice": 1,
"position": [
867,
1854,
2305,
11596
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O3O",
"level": "L2-Relation",
"id": "aVHAr8rc-Ks_1",
"video_path": "aVHAr8rc-Ks.mp4",
"subtitle_path": "aVHAr8rc-Ks_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 928.56,
"view_count": 253142
},
{
"video_id": "aVHAr8rc-Ks",
"question": "What is the first concept mentioned after the man, sitting in front of the microphone wearing a black shirt with a pattern on the neck and a black cap and black-rimmed glasses, talks about evolution?",
"question_wo_referring_query": "Which concept is mentioned first?",
"candidates": [
"Human evolution differences",
"Animal fossilization",
"Vertebrate",
"Plant fossilization",
"Mythical creature"
],
"correct_choice": 3,
"position": [
1983,
3141,
7495,
13242,
14994,
20565
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O3O",
"level": "L2-Relation",
"id": "aVHAr8rc-Ks_2",
"video_path": "aVHAr8rc-Ks.mp4",
"subtitle_path": "aVHAr8rc-Ks_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 928.56,
"view_count": 253142
},
{
"video_id": "UMFy3keSk-s",
"question": "In a room with white walls, there is a brown wardrobe and a black television. A man in a pink short-sleeve shirt is sitting in the room. Which objects are present in the room with white walls?",
"question_wo_referring_query": "Which objects are present in the room with white walls?",
"candidates": [
"A blue hat",
"A blue curtain",
"A black chair",
"A black and grey backpack",
"A white bed"
],
"correct_choice": 3,
"position": [
8328
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2O",
"level": "L1-Perception",
"id": "UMFy3keSk-s_0",
"video_path": "UMFy3keSk-s.mp4",
"subtitle_path": "UMFy3keSk-s_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1139.41,
"view_count": 22433
},
{
"video_id": "UMFy3keSk-s",
"question": "In a scene with a lakeside view, there is a green tree by the lakeside, and a woman in a black jacket is running towards the lakeside. What objects are present in the scene?",
"question_wo_referring_query": "What objects are present in the scene?",
"candidates": [
"White glasses",
"Blue skirt",
"Black skirt",
"Pink shorts",
"Pink skirt"
],
"correct_choice": 3,
"position": [
18986
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2O",
"level": "L1-Perception",
"id": "UMFy3keSk-s_1",
"video_path": "UMFy3keSk-s.mp4",
"subtitle_path": "UMFy3keSk-s_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1139.41,
"view_count": 22433
},
{
"video_id": "UMFy3keSk-s",
"question": "In clear weather, there is a dirt road with a green forest in the distance. A man in a gray short-sleeved shirt is standing on the dirt road. Which objects are present in the scene?",
"question_wo_referring_query": "Which objects are present in the scene?",
"candidates": [
"Black dog",
"White chrysanthemum",
"Elephant",
"White hat",
"Blue hat"
],
"correct_choice": 2,
"position": [
27108
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2O",
"level": "L1-Perception",
"id": "UMFy3keSk-s_2",
"video_path": "UMFy3keSk-s.mp4",
"subtitle_path": "UMFy3keSk-s_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1139.41,
"view_count": 22433
},
{
"video_id": "fryyNwUCPWA",
"question": "In a round cylindrical space, there is a man sitting who is wearing a white coat and blue jeans. When the subtitle 'orange nice well I honestly never' appears, what objects are present on the screen?",
"question_wo_referring_query": "What objects are present on the screen?",
"candidates": [
"white hat",
"silver ring",
"red shoes",
"blue hat",
"black-framed glasses"
],
"correct_choice": 2,
"position": [
6674
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2O",
"level": "L1-Perception",
"id": "fryyNwUCPWA_0",
"video_path": "fryyNwUCPWA.mp4",
"subtitle_path": "fryyNwUCPWA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 980.96,
"view_count": 19676
},
{
"video_id": "fryyNwUCPWA",
"question": "In a room with white walls, there is a beige curved table with a computer on it. When the subtitle 'we're about to fly well I think it's a' appears, what objects are present on the screen?",
"question_wo_referring_query": "What objects are present on the screen?",
"candidates": [
"black remote control",
"white computer",
"white keyboard",
"blue hat",
"red coat"
],
"correct_choice": 0,
"position": [
14288
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2O",
"level": "L1-Perception",
"id": "fryyNwUCPWA_1",
"video_path": "fryyNwUCPWA.mp4",
"subtitle_path": "fryyNwUCPWA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 980.96,
"view_count": 19676
},
{
"video_id": "fryyNwUCPWA",
"question": "In a room with various instruments and control panels, there is a man with short hair wearing a white lab coat. When the subtitle 'think you'll see this technology be used' appears, what objects are present in the scene?",
"question_wo_referring_query": "What objects are present in the scene?",
"candidates": [
"a gold chain",
"a red button",
"a white button",
"a black remote",
"a black steering wheel"
],
"correct_choice": 1,
"position": [
19834
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2O",
"level": "L1-Perception",
"id": "fryyNwUCPWA_2",
"video_path": "fryyNwUCPWA.mp4",
"subtitle_path": "fryyNwUCPWA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 980.96,
"view_count": 19676
},
{
"video_id": "bqQTWdk1DAM",
"question": "A short-haired man wearing a blue shirt is standing on the street, with a pocket on each side of the chest of the blue shirt. Behind the man is a commercial street with duty-free shops and restaurants. What is the first food the man eats after entering the restaurant?",
"question_wo_referring_query": "What is the first food the man eats after entering the restaurant?",
"candidates": [
"fried noodles",
"steamed bread",
"soup dumplings",
"canned chive dumplings",
"fritters"
],
"correct_choice": 3,
"position": [
1735,
2352,
2698
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O3O",
"level": "L2-Relation",
"id": "bqQTWdk1DAM_0",
"video_path": "bqQTWdk1DAM.mp4",
"subtitle_path": "bqQTWdk1DAM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 922.09,
"view_count": 321187
},
{
"video_id": "bqQTWdk1DAM",
"question": "The man in the blue shirt is standing in front of a transparent display window. Inside the window, there are hanging roast ducks and other meats. On the right side of the window, there's a black area with white stripes. What food does the man eat first in front of the display window?",
"question_wo_referring_query": "What food does the man eat first in front of the display window?",
"candidates": [
"bun",
"roast duck",
"mooncake",
"mixed ham",
"apple"
],
"correct_choice": 3,
"position": [
4667,
5194,
6029
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O3O",
"level": "L2-Relation",
"id": "bqQTWdk1DAM_1",
"video_path": "bqQTWdk1DAM.mp4",
"subtitle_path": "bqQTWdk1DAM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 922.09,
"view_count": 321187
},
{
"video_id": "bqQTWdk1DAM",
"question": "A man wearing a multi-colored shirt is sitting in the outdoor dining area of a restaurant. The man is holding a yellow package in his hand. Next to him are a white basket and green plants. Behind the man, there is an open umbrella and other diners. On the left side of the screen, there are colorful flags and the entrance of a shop. What is the first food the man eats?",
"question_wo_referring_query": "What is the first food the man eats?",
"candidates": [
"Apple",
"Pie",
"Mooncake",
"Bread",
"Yogurt"
],
"correct_choice": 3,
"position": [
11510,
12250,
12877
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O3O",
"level": "L2-Relation",
"id": "bqQTWdk1DAM_2",
"video_path": "bqQTWdk1DAM.mp4",
"subtitle_path": "bqQTWdk1DAM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 922.09,
"view_count": 321187
},
{
"video_id": "I-yg_3yx6iA",
"question": "Under a gray sky, a bullet-riddled airplane is flying with red and white stripes on its tail wings. There is a red pattern dot on the back of the fuselage, and the rest of the plane is silver. When the subtitle 'the aircraft falling away after being' appears, what is the shape of the red pattern on the tail of the fuselage?",
"question_wo_referring_query": "What is the shape of the red pattern on the tail of the fuselage?",
"candidates": [
"A star",
"A square",
"A triangle",
"A pattern composed of a rectangle and a star",
"A rectangle"
],
"correct_choice": 3,
"position": [
2904
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2A",
"level": "L1-Perception",
"id": "I-yg_3yx6iA_0",
"video_path": "I-yg_3yx6iA.mp4",
"subtitle_path": "I-yg_3yx6iA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1204.46,
"view_count": 270779
},
{
"video_id": "I-yg_3yx6iA",
"question": "A green transport truck is parked on the roadside, with two groups of people fighting around the car. One group is in dark green uniforms holding axes, clubs and other weapons, while the other group is wearing white hats and uniforms with armbands. The highway is lined with greenery on both sides. When the subtitle 'there have been dozens of Border' appears, what shape is the decorative pattern on the rear of the car?",
"question_wo_referring_query": "What shape is the decorative pattern on the rear of the car?",
"candidates": [
"Circle",
"Square",
"Pentagon",
"Rectangle",
"Triangle"
],
"correct_choice": 2,
"position": [
15405
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2A",
"level": "L1-Perception",
"id": "I-yg_3yx6iA_1",
"video_path": "I-yg_3yx6iA.mp4",
"subtitle_path": "I-yg_3yx6iA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1204.46,
"view_count": 270779
},
{
"video_id": "I-yg_3yx6iA",
"question": "In the gray background, a row of soldiers wearing green uniforms is advancing along the diagonal staircase, holding guns. There is a bold white English text in the center of the screen. When the subtitle 'koreas the war was an international' appears, what shape is the figure at the bottom of the stairs?",
"question_wo_referring_query": "What shape is the figure at the bottom of the stairs?",
"candidates": [
"triangle",
"square",
"pentagon",
"rectangle",
"circle"
],
"correct_choice": 4,
"position": [
16952
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2A",
"level": "L1-Perception",
"id": "I-yg_3yx6iA_2",
"video_path": "I-yg_3yx6iA.mp4",
"subtitle_path": "I-yg_3yx6iA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1204.46,
"view_count": 270779
},
{
"video_id": "fZBC3nmvJb8",
"question": "In a room, there is a white piece of furniture in the back. Next to the furniture, there is a wooden object. On the right side, there are various bottles and tools on a table. On the left side, there is an easel. In the middle, there is a table with a palette containing black paint and a canvas. A man in a black T-shirt is standing in the middle, holding something while painting on the canvas. What is the man holding while painting?",
"question_wo_referring_query": "What is the man holding while painting on the canvas?",
"candidates": [
"Black marker",
"Rag",
"Small brush",
"Brush",
"Black crayon"
],
"correct_choice": 2,
"position": [
5347
],
"topic_category": "KA-Knowledge-Art",
"question_category": "E2O",
"level": "L1-Perception",
"id": "fZBC3nmvJb8_0",
"video_path": "fZBC3nmvJb8.mp4",
"subtitle_path": "fZBC3nmvJb8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1207.58,
"view_count": 3693262
},
{
"video_id": "fZBC3nmvJb8",
"question": "A man with a beard is holding a paintbrush and intently painting on a canvas. What object is turning white at this moment?",
"question_wo_referring_query": ", what object is turning white at this moment?",
"candidates": [
"canvas",
"man's finger",
"brush",
"scissors",
"easel"
],
"correct_choice": 0,
"position": [
14535
],
"topic_category": "KA-Knowledge-Art",
"question_category": "E2O",
"level": "L1-Perception",
"id": "fZBC3nmvJb8_1",
"video_path": "fZBC3nmvJb8.mp4",
"subtitle_path": "fZBC3nmvJb8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1207.58,
"view_count": 3693262
},
{
"video_id": "fZBC3nmvJb8",
"question": "On a black drawing board in the screen, there is a white design. In the distance, a plastic box filled with items is placed on the desk. In the lower right corner, a hand is holding a drawing pen and adjusting a color palette. What object is sliding down at this moment?",
"question_wo_referring_query": ", what object is sliding down at this moment?",
"candidates": [
"black canvas",
"white pigment",
"drawing pen",
"color palette",
"easel"
],
"correct_choice": 1,
"position": [
25293
],
"topic_category": "KA-Knowledge-Art",
"question_category": "E2O",
"level": "L1-Perception",
"id": "fZBC3nmvJb8_2",
"video_path": "fZBC3nmvJb8.mp4",
"subtitle_path": "fZBC3nmvJb8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1207.58,
"view_count": 3693262
},
{
"video_id": "2W2ZkYARds4",
"question": "On a sea level, there are dazzling lights in the distance, a big bridge stands in the middle, and there's a boat with lights on the left. At the bottom, it says 'BALTIMORE BRIDGE COLLAPSE INVESTIGATION.' When this boat appears, what happens?",
"question_wo_referring_query": "When this boat appears, what happens?",
"candidates": [
"A dolphin jumps",
"A helicopter flies over",
"The bridge collapses",
"The boat capsizes",
"The boat explodes"
],
"correct_choice": 2,
"position": [
20941
],
"topic_category": "NP-News-Programs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "2W2ZkYARds4_0",
"video_path": "2W2ZkYARds4.mp4",
"subtitle_path": "2W2ZkYARds4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3024.89,
"view_count": 236885
},
{
"video_id": "2W2ZkYARds4",
"question": "On a flat ground, there is a building with a brick wall behind, a car with an open trunk on the left, and a police officer with a police dog on the right. What happened when the police dog appeared?",
"question_wo_referring_query": "What happened when the police dog appeared?",
"candidates": [
"The police dog bit a tire.",
"The police officer drove away.",
"The police dog smelled the brick wall.",
"The police dog jumped into the car's trunk.",
"The police officer closed the car's trunk."
],
"correct_choice": 3,
"position": [
64591
],
"topic_category": "NP-News-Programs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "2W2ZkYARds4_1",
"video_path": "2W2ZkYARds4.mp4",
"subtitle_path": "2W2ZkYARds4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3024.89,
"view_count": 236885
},
{
"video_id": "2W2ZkYARds4",
"question": "In a grassy field, there are many trees in the distance, a house at the back right, a yellow warning tape nearby, and 6 policemen walking forward. What happened when the yellow warning tape appeared?",
"question_wo_referring_query": "What happened when the yellow warning tape appeared?",
"candidates": [
"A policeman with white hair bent down in the front right",
"A policeman cut the warning tape",
"A policeman wearing sunglasses at the back right raised his hand",
"A police dog ran out",
"A policeman in the back took a photo"
],
"correct_choice": 0,
"position": [
24863
],
"topic_category": "NP-News-Programs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "2W2ZkYARds4_2",
"video_path": "2W2ZkYARds4.mp4",
"subtitle_path": "2W2ZkYARds4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3024.89,
"view_count": 236885
},
{
"video_id": "MJYBHfYF8LI",
"question": "At a presentation with a black background, what action did a man wearing black clothing and glasses take while speaking towards the microphone when he mentioned 'stack and the higher memory. And it's the to die within the same'?",
"question_wo_referring_query": "What action did he take?",
"candidates": [
"Put on a hat",
"Removed his glasses",
"Showed two GPU chips to the camera",
"Took a sip of water",
"Took off his outer garment"
],
"correct_choice": 2,
"position": [
26527
],
"topic_category": "NP-News-Programs",
"question_category": "T2E",
"level": "L1-Perception",
"id": "MJYBHfYF8LI_0",
"video_path": "MJYBHfYF8LI.mp4",
"subtitle_path": "MJYBHfYF8LI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2781.75,
"view_count": 2571
},
{
"video_id": "MJYBHfYF8LI",
"question": "On the city streets, there is a white truck in the middle of the screen. In the distance, there is a yellow building, and nearby on the left, a woman in black clothes with short hair is talking to a man on the right who is also wearing black. When the subtitles mention 'you met with investors in Hong Kong and Kuala Lumpur in Singapore who are', what did the man do?",
"question_wo_referring_query": "What did the man do?",
"candidates": [
"opened the car door",
"hugged the woman briefly",
"waved a few times to the woman",
"took out a phone",
"picked up a pair of scissors"
],
"correct_choice": 2,
"position": [
43156
],
"topic_category": "NP-News-Programs",
"question_category": "T2E",
"level": "L1-Perception",
"id": "MJYBHfYF8LI_1",
"video_path": "MJYBHfYF8LI.mp4",
"subtitle_path": "MJYBHfYF8LI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2781.75,
"view_count": 2571
},
{
"video_id": "MJYBHfYF8LI",
"question": "At the blue-background press conference, there are two microphones in front of the camera, and a man in a suit and wearing glasses is standing in front of the microphones. The man is holding a green report in his hand. When the subtitle reads 'about 90 lawmakers that there are in the Legislative council to,' what action does the man take?",
"question_wo_referring_query": "What action does the man take?",
"candidates": [
"Touches his hair",
"Bows to the camera",
"Puts on a hat",
"Puts down the report",
"Takes out his mobile phone"
],
"correct_choice": 3,
"position": [
62912
],
"topic_category": "NP-News-Programs",
"question_category": "T2E",
"level": "L1-Perception",
"id": "MJYBHfYF8LI_2",
"video_path": "MJYBHfYF8LI.mp4",
"subtitle_path": "MJYBHfYF8LI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2781.75,
"view_count": 2571
},
{
"video_id": "yqejTvYILlA",
"question": "Outside the window where there is a shelf, in the distance, there are many people sitting and eating. Nearby, from left to right, there are a woman with blonde hair, a man wearing a hat and drinking, and a short-haired white man looking into a mirror. After the black man finishes drinking, what action does he perform?",
"question_wo_referring_query": "What action does the black man perform after finishing his drink?",
"candidates": [
"Falls to the ground",
"Puts the bottle on the table",
"Stands up",
"Takes out a piece of chewing gum",
"Eats a mouthful of bread"
],
"correct_choice": 1,
"position": [
7831,
7857
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E3E",
"level": "L2-Relation",
"id": "yqejTvYILlA_0",
"video_path": "yqejTvYILlA.mp4",
"subtitle_path": "yqejTvYILlA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1039.97,
"view_count": 1377927
},
{
"video_id": "yqejTvYILlA",
"question": "On the street, in the distance are roadside houses and cars driving on the road. Nearby is a man in a black T-shirt talking to the camera. To the left behind the man are piles of filled black garbage bags. What did the man do after finishing speaking?",
"question_wo_referring_query": "What did the man do after finishing speaking?",
"candidates": [
"Touched his hair",
"Took a sip of iced coffee",
"Turned and walked towards the garbage bags",
"Waved his left hand",
"Stuck out his tongue at the camera"
],
"correct_choice": 2,
"position": [
18685,
18732
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E3E",
"level": "L2-Relation",
"id": "yqejTvYILlA_1",
"video_path": "yqejTvYILlA.mp4",
"subtitle_path": "yqejTvYILlA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1039.97,
"view_count": 1377927
},
{
"video_id": "yqejTvYILlA",
"question": "In a brightly lit room, a man sitting on a bed wearing a green T-shirt is showing a wallet in front of a mirror. Behind him is a bright window with gray curtains. What does the man do after showing the wallet?",
"question_wo_referring_query": "What does the man do after showing the wallet?",
"candidates": [
"Put the wallet on the bed",
"Took out a handkerchief",
"Lay down on the bed",
"Took a sip of water",
"Took off the T-shirt"
],
"correct_choice": 0,
"position": [
22905,
22945
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E3E",
"level": "L2-Relation",
"id": "yqejTvYILlA_2",
"video_path": "yqejTvYILlA.mp4",
"subtitle_path": "yqejTvYILlA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1039.97,
"view_count": 1377927
},
{
"video_id": "571nruSayeo",
"question": "In the news footage, in front of the virtual cityscape background, there is a man in a gray suit and red tie speaking to the camera, with the subtitle 'long has it been like that in the'. In which sentences does this man wearing a red tie appear together with those subtitles?",
"question_wo_referring_query": "In which sentences does this man wearing a red tie appear together with those subtitles?",
"candidates": [
"You make a duck become a beautiful swan",
"Since then great changes have taken place there",
"Capital well I I think uh what we just",
"Of all the subject",
"Time passed quickly"
],
"correct_choice": 2,
"position": [
20153,
20260
],
"topic_category": "NP-News-Programs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "571nruSayeo_0",
"video_path": "571nruSayeo.mp4",
"subtitle_path": "571nruSayeo_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2745.12,
"view_count": 958990
},
{
"video_id": "571nruSayeo",
"question": "In the news footage, there is a man wearing green clothes in front of a virtual city background, speaking to the camera. The subtitles show 'the impression that uh the gangs were'. With which subtitles did this man in green clothes appear?",
"question_wo_referring_query": "In the news footage, there is a man wearing green clothes in front of a virtual city background, speaking to the camera. The subtitles show 'the impression that uh the gangs were'. With which subtitles did this man in green clothes appear?",
"candidates": [
"my parents gave me what I wanted",
"used in the past by politic S I mean for",
"Because he is so funny",
" He has a big mouth",
"And we're getting on well with each other"
],
"correct_choice": 1,
"position": [
38050,
38125
],
"topic_category": "NP-News-Programs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "571nruSayeo_1",
"video_path": "571nruSayeo.mp4",
"subtitle_path": "571nruSayeo_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2745.12,
"view_count": 958990
},
{
"video_id": "571nruSayeo",
"question": "In the press conference, in front of a background with the United Nations logo, a woman dressed in a red blouse and a gray plaid suit is speaking at the podium, saying 'establishment of a uh presidential'. With which subtitles does this woman in gray clothing appear together?",
"question_wo_referring_query": "With which subtitles does this woman in gray clothing appear together?",
"candidates": [
"and love is even beyond the life and death",
"Too often we take it as granted",
"love brings us bright when life gets hard and dark.",
"You'd better not get too tired while revising your lessons",
"transitional Council that will lead to"
],
"correct_choice": 4,
"position": [
54900,
54976
],
"topic_category": "NP-News-Programs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "571nruSayeo_2",
"video_path": "571nruSayeo.mp4",
"subtitle_path": "571nruSayeo_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2745.12,
"view_count": 958990
},
{
"video_id": "RTUFPjliMCU",
"question": "What style of clothing is the woman in the upper right corner of the screen wearing when the subtitle 'doesn\u2019t allow me to produces as much' appears while two women are discussing the combination of milk and coffee?",
"question_wo_referring_query": "What style of clothing is the woman in the upper right corner of the screen wearing?",
"candidates": [
"Black and white striped suit",
"Short sleeves with red wave patterns",
"Green and grey striped suit",
"Brown and green striped suit",
"Blue and grey striped suit"
],
"correct_choice": 0,
"position": [
2073
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2A",
"level": "L1-Perception",
"id": "RTUFPjliMCU_0",
"video_path": "RTUFPjliMCU.mp4",
"subtitle_path": "RTUFPjliMCU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2093.19,
"view_count": 157649
},
{
"video_id": "RTUFPjliMCU",
"question": "In the video, two women are explaining a chemistry problem. There are four numbers circled in the problem analysis. When the subtitles display 'kind of remove all this and just show', what are the characteristics of the line that highlights the circled numbers?",
"question_wo_referring_query": "What are the characteristics of the line that highlights the circled numbers?",
"candidates": [
"a dashed blue line",
"a wavy blue line",
"a double curved blue line",
"a solid blue line",
"a curved blue line"
],
"correct_choice": 0,
"position": [
31762
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2A",
"level": "L1-Perception",
"id": "RTUFPjliMCU_1",
"video_path": "RTUFPjliMCU.mp4",
"subtitle_path": "RTUFPjliMCU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2093.19,
"view_count": 157649
},
{
"video_id": "RTUFPjliMCU",
"question": "Two women on the screen are explaining a question. In the problem, there is a number crossing a pink bar at the bottom and a slanted line. When the subtitle 'figs but I still would round up so I' appears, what color is the slanted line?",
"question_wo_referring_query": "What color is the slanted line?",
"candidates": [
"Olive Yellow",
"Light Yellow",
"Light Blue",
"Grass Green",
"Ink Green"
],
"correct_choice": 3,
"position": [
42118
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2A",
"level": "L1-Perception",
"id": "RTUFPjliMCU_2",
"video_path": "RTUFPjliMCU.mp4",
"subtitle_path": "RTUFPjliMCU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2093.19,
"view_count": 157649
},
{
"video_id": "8k6M0HD162k",
"question": "In the video, some people are listening to someone on stage explain concepts of projection, including arrow symbols and plus-minus symbols. What happens on the screen when a formula starting with the letter P appears at the bottom of the projection?",
"question_wo_referring_query": "In the video, some people are listening to someone on stage explain concepts of projection, including arrow symbols and plus-minus symbols. What happens on the screen when a formula starting with the letter P appears at the bottom of the projection?",
"candidates": [
"A green mathematical expression appears.",
"A rectangular box appears around the formula starting with the letter P, and an arrow appears with the word 'satisfied' on it.",
"A circular graphic appears around the formula starting with the letter P.",
"A video link appears.",
"The formula starting with the letter P disappears."
],
"correct_choice": 1,
"position": [
8331,
8369
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "E3E",
"level": "L2-Relation",
"id": "8k6M0HD162k_0",
"video_path": "8k6M0HD162k.mp4",
"subtitle_path": "8k6M0HD162k_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1078.16,
"view_count": 304
},
{
"video_id": "8k6M0HD162k",
"question": "In the video, some people are listening to a person on the podium explaining concepts related to projectiles. The concepts include sequences, red text in English, black text in English, and diagrams of red nets. What happens on the screen after the red net diagram appears?",
"question_wo_referring_query": "What happens on the screen?",
"candidates": [
"Pie chart",
"Line chart",
"Bar chart",
"Circular diagram",
"Scatter plot"
],
"correct_choice": 2,
"position": [
17997,
18465
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "E3E",
"level": "L2-Relation",
"id": "8k6M0HD162k_1",
"video_path": "8k6M0HD162k.mp4",
"subtitle_path": "8k6M0HD162k_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1078.16,
"view_count": 304
},
{
"video_id": "8k6M0HD162k",
"question": "In the scene, some people are listening to a person on stage explaining some English knowledge points on shooting. A man stands up from his seat and walks to the front of the woman on stage. After this man stands up, what happens on the screen?",
"question_wo_referring_query": "What happens on the screen?",
"candidates": [
"The man walks out of the classroom",
"A person under the stage touches the back of their head with their hand",
"Everyone under the stage stands up",
"Everyone under the stage claps",
"The man walks onto the stage"
],
"correct_choice": 1,
"position": [
25071,
25143
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "E3E",
"level": "L2-Relation",
"id": "8k6M0HD162k_2",
"video_path": "8k6M0HD162k.mp4",
"subtitle_path": "8k6M0HD162k_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1078.16,
"view_count": 304
},
{
"video_id": "ll3tR0kUZHc",
"question": "On the right side of the white table, there is a white plate, and on the left side, there are silver kitchen utensils with a net pattern. On the kitchen utensils, there is a strawberry chocolate cake with golden edges. After the subtitle 'transfer to plate and throw the cake' appears, what happens on the screen?",
"question_wo_referring_query": "What happens on the screen?",
"candidates": [
"The cake fell onto the table",
"The cake was moved to the plate",
"The cake split open",
"The cake fell onto the ground",
"The cake toppled over"
],
"correct_choice": 1,
"position": [
18690,
18762
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T3E",
"level": "L2-Relation",
"id": "ll3tR0kUZHc_0",
"video_path": "ll3tR0kUZHc.mp4",
"subtitle_path": "ll3tR0kUZHc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 902.57,
"view_count": 932659
},
{
"video_id": "ll3tR0kUZHc",
"question": "A long-haired woman wearing a white short-sleeve shirt is standing in front of a white table. On the countertop behind her are some small potted plants and kitchen utensils. There is a colorful question mark sticker on the cupboard above the countertop, and the tiles on the wall are hexagonal. A paper is placed in front of the woman. After the subtitle 'be right back cover with the small oven' appears, what does the woman do?",
"question_wo_referring_query": "What does the woman do after that?",
"candidates": [
"The woman picks up a plate",
"The woman picks up the paper",
"The woman eats an apple",
"Search for the baking tray",
"The woman picks up a vegetable knife"
],
"correct_choice": 3,
"position": [
1622,
1684
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T3E",
"level": "L2-Relation",
"id": "ll3tR0kUZHc_1",
"video_path": "ll3tR0kUZHc.mp4",
"subtitle_path": "ll3tR0kUZHc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 902.57,
"view_count": 932659
},
{
"video_id": "ll3tR0kUZHc",
"question": "A long-haired woman wearing a white short-sleeve shirt is standing in front of a white table. On the counter behind her, there are small potted plants and kitchen utensils. The cabinet above the counter has a colorful question mark sticker on it. The tiles on the wall are hexagonal. At the center of the table, there is a blue scale. To the left of the table, there is a jar containing white seasoning, and a glass bowl is placed to the right of the scale. The glass bowl contains white ingredients and green leaves. After the subtitle 'two tablespoons of powdered sugar i am' appears, what does the woman do?",
"question_wo_referring_query": "What does the woman do?",
"candidates": [
"The woman picks up a fork",
"The woman lifts the glass bowl",
"The woman is holding a knife and cutting a slender ingredient",
"The woman picks up a white paper to look at it",
"The woman is stirring the white ingredients with a whisk"
],
"correct_choice": 2,
"position": [
10673,
10697
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T3E",
"level": "L2-Relation",
"id": "ll3tR0kUZHc_2",
"video_path": "ll3tR0kUZHc.mp4",
"subtitle_path": "ll3tR0kUZHc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 902.57,
"view_count": 932659
},
{
"video_id": "Y1YCvEip_ko",
"question": "The man in the black coat and the man with glasses sitting opposite him are having a meal together. The man with glasses is holding a phone in one hand and wearing a watch on the other. There is a red pillow on his left side. Reflected in the window behind the man with glasses is a service staff wearing a mask. There is a colorful cartoon character design behind the man in the black coat. He is holding a cup with both hands and has a black bag to his left. The tablecloth is a combination of olive and white colors. Where else does the man with glasses appear?",
"question_wo_referring_query": "Where else does the man with glasses appear?",
"candidates": [
"On the bicycle",
"In front of the hotel bed",
"On the beach",
"In the back seat of a taxi",
"On the bus"
],
"correct_choice": 1,
"position": [
6552,
12090,
2018
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SOS",
"level": "L2-Relation",
"id": "Y1YCvEip_ko_0",
"video_path": "Y1YCvEip_ko.mp4",
"subtitle_path": "Y1YCvEip_ko_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1100.18,
"view_count": 69255
},
{
"video_id": "Y1YCvEip_ko",
"question": "The man wearing glasses with a black backpack is outside the airport. He is wearing a yellow hat, a black mask, and a grey hoodie. He is surrounded by white columns and large glass windows. Where else does the man's hat appear?",
"question_wo_referring_query": "Where else does the man's hat appear?",
"candidates": [
"In the taxi",
"On the dining table",
"On the bus",
"In the elevator",
"On the beach"
],
"correct_choice": 3,
"position": [
2098,
2400,
6582
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SOS",
"level": "L2-Relation",
"id": "Y1YCvEip_ko_1",
"video_path": "Y1YCvEip_ko.mp4",
"subtitle_path": "Y1YCvEip_ko_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1100.18,
"view_count": 69255
},
{
"video_id": "Y1YCvEip_ko",
"question": "In a spacious room, a woman with long black hair and wearing black clothes, a man with glasses in white clothes, and another man in blue clothes are sitting on the floor. There is a silver frame above their heads. The man on the right is wearing jeans and sitting cross-legged. The woman on the left is wearing black shoes, sitting with her legs straight, resting one hand on the floor, and holding an interview device in the other hand. Where else does this woman appear?",
"question_wo_referring_query": "Where else does this woman appear?",
"candidates": [
"On an escalator at the airport",
"Inside a hotel",
"In a taxi",
"On the light yellow sofa inside the room",
"On a bus"
],
"correct_choice": 3,
"position": [
16550,
18240,
11025
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SOS",
"level": "L2-Relation",
"id": "Y1YCvEip_ko_2",
"video_path": "Y1YCvEip_ko.mp4",
"subtitle_path": "Y1YCvEip_ko_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1100.18,
"view_count": 69255
},
{
"video_id": "vVRC-0VKPrg",
"question": "At the top of the screen, there is a red search box. Below the search box, there are characters aligned to the left, which are composed of red and black colors. In the bottom right corner, a man wearing sunglasses and dressed in black is speaking into a microphone. What is this man doing?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"The man is drawing a circle with his hand",
"The man is adjusting his glasses",
"The man is making a scissor hand gesture",
"The man is making a heart hand gesture",
"The man is making an OK hand gesture"
],
"correct_choice": 4,
"position": [
8013
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2E",
"level": "L1-Perception",
"id": "vVRC-0VKPrg_0",
"video_path": "vVRC-0VKPrg.mp4",
"subtitle_path": "vVRC-0VKPrg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2354.8,
"view_count": 15286
},
{
"video_id": "vVRC-0VKPrg",
"question": "The top of the screen shows a red search box, below the search box is left-aligned text, with bold black characters at the bottom. Some characters in the upper right center are on a blue background. In the lower right corner, there is a man in black wearing sunglasses explaining something using a speech bubble. What is this man doing?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"Using the mouse pointer to add a green background to part of the text",
"Using the mouse pointer to select part of the text on the screen",
"Using the mouse pointer to add a red background to part of the text",
"Using the mouse pointer to add a yellow background to part of the text",
"Using the mouse pointer to add a white background to part of the text"
],
"correct_choice": 1,
"position": [
5492
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2E",
"level": "L1-Perception",
"id": "vVRC-0VKPrg_1",
"video_path": "vVRC-0VKPrg.mp4",
"subtitle_path": "vVRC-0VKPrg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2354.8,
"view_count": 15286
},
{
"video_id": "vVRC-0VKPrg",
"question": "In the bottom left corner of the white background is a rectangular frame with curves and color variations. Inside the frame, there are blue and yellow arrows and characters. The top of the frame has handwritten colorful characters above which there are black lines and characters. In the bottom right corner of the screen are a cartoon dog and the symbol \u03c0 (pi). What is happening to the right of the handwritten colorful characters?",
"question_wo_referring_query": "What is happening to the right of the handwritten colorful characters?",
"candidates": [
"A hand-drawn line is getting thinner.",
"A hand-drawn line is shortening.",
"A hand-drawn line is moving parallel.",
"A hand-drawn line is getting thicker.",
"A hand-drawn line is extending and eventually forms an arrowhead."
],
"correct_choice": 4,
"position": [
14784
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2E",
"level": "L1-Perception",
"id": "vVRC-0VKPrg_2",
"video_path": "vVRC-0VKPrg.mp4",
"subtitle_path": "vVRC-0VKPrg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2354.8,
"view_count": 15286
},
{
"video_id": "ZaXpMou55lw",
"question": "In the upper right corner of a white background, a middle-aged man wearing a gray short-sleeved shirt is explaining. Behind him are a piano keyboard and various objects. At the top of the screen are bold English letters, and in the center is a table with white data on a black background. What object is present in the scene?",
"question_wo_referring_query": "What object is present in the scene?",
"candidates": [
"A silver bracelet",
"A pair of glasses",
"There is a URL",
"There is a wristwatch",
"A yellow bracelet"
],
"correct_choice": 2,
"position": [
22318
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2O",
"level": "L1-Perception",
"id": "ZaXpMou55lw_0",
"video_path": "ZaXpMou55lw.mp4",
"subtitle_path": "ZaXpMou55lw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1018.35,
"view_count": 704
},
{
"video_id": "ZaXpMou55lw",
"question": "A man wearing a grey short-sleeve shirt is sitting in a chair and talking. Behind him, there's a black and white piano keyboard and a black microphone stand. The curtain next to the keyboard is white. In the corner of the room, there is a guitar and a drum set. What objects are present in the scene?",
"question_wo_referring_query": "What objects are present in the scene?",
"candidates": [
"black wristwatch",
"jeans jacket",
"black baseball cap",
"necklace",
"silver wristwatch"
],
"correct_choice": 0,
"position": [
428
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2O",
"level": "L1-Perception",
"id": "ZaXpMou55lw_1",
"video_path": "ZaXpMou55lw.mp4",
"subtitle_path": "ZaXpMou55lw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1018.35,
"view_count": 704
},
{
"video_id": "ZaXpMou55lw",
"question": "In the top-right corner of a white background, a man with short grey hair is explaining something. Behind the man, there is a piano keyboard and miscellaneous items. At the top of the screen, there are bold English characters, and below the characters, there are two web addresses. To the left of the web addresses, there are yellow and blue circular icons and simple cartoon graphics. Below the web addresses, there is a black rectangular text box filled with white and green characters. What objects or elements are present in the scene?",
"question_wo_referring_query": "What objects or elements are present in the scene?",
"candidates": [
"Red circular icon",
"Black wristwatch",
"Green circular icon",
"Black circular icon",
"White wristwatch"
],
"correct_choice": 1,
"position": [
14427
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2O",
"level": "L1-Perception",
"id": "ZaXpMou55lw_2",
"video_path": "ZaXpMou55lw.mp4",
"subtitle_path": "ZaXpMou55lw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1018.35,
"view_count": 704
},
{
"video_id": "wSHPuI7wWIg",
"question": "A man wearing black clothes and a black mask appears at the center of the screen, with a woman in a black hooded outfit next to him. Behind the man, there is a parked red car, a person in dark long clothes, and some trees. When the subtitle 'photo spot of the Tokyo Tower' appears, what objects are present in the scene?",
"question_wo_referring_query": "What objects are present in the scene?",
"candidates": [
"A black handbag",
"A white mask",
"A blue mask",
"A black car",
"A white handbag"
],
"correct_choice": 0,
"position": [
25326
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2O",
"level": "L1-Perception",
"id": "wSHPuI7wWIg_0",
"video_path": "wSHPuI7wWIg.mp4",
"subtitle_path": "wSHPuI7wWIg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1746.54,
"view_count": 143551
},
{
"video_id": "wSHPuI7wWIg",
"question": "A man wearing a black coat is sitting on a stool. The man is holding a black straw and a pink drink. Behind the man, there is a white stool and a screen. The screen displays an image of a large group of people gathered together. When the caption 'Can't complain' appears, what object is present in the scene?",
"question_wo_referring_query": "What object is present in the scene?",
"candidates": [
"A ring",
"A plate",
"A potted plant",
"A fork",
"A table lamp"
],
"correct_choice": 0,
"position": [
3234
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2O",
"level": "L1-Perception",
"id": "wSHPuI7wWIg_1",
"video_path": "wSHPuI7wWIg.mp4",
"subtitle_path": "wSHPuI7wWIg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1746.54,
"view_count": 143551
},
{
"video_id": "wSHPuI7wWIg",
"question": "A screen with a black frame is placed on the table. On the screen, a blue-haired girl is holding a small tile with a design. The girl is wearing a uniform with a red tie, and her eyes are white with circular patterns. When the subtitle 'Ah, are you kidding me' appears, what items are present in the scene?",
"question_wo_referring_query": "What items are present in the scene?",
"candidates": [
"A red heart tile",
"A white hat",
"A black peach tile",
"A plum blossom tile",
"A red hat"
],
"correct_choice": 0,
"position": [
36424
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2O",
"level": "L1-Perception",
"id": "wSHPuI7wWIg_2",
"video_path": "wSHPuI7wWIg.mp4",
"subtitle_path": "wSHPuI7wWIg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1746.54,
"view_count": 143551
},
{
"video_id": "NMHmqgO04rU",
"question": "The man wearing a brown apron and blue jeans is sitting on a sofa, with flower-patterned white cushions and pillows. The man is holding a plate in one hand and chopsticks in the other. There is a dog with a collar in front of him. In the direction of his hand holding the plate, there is a kitchen full of utensils. When the subtitle 'You can't eat this' appears, what is the clothing like under the man's brown apron?",
"question_wo_referring_query": "What is the clothing like under the man's brown apron?",
"candidates": [
"White shirt",
"Blue shirt",
"White chef uniform",
"Blue short sleeves",
"White short sleeves"
],
"correct_choice": 4,
"position": [
13672
],
"topic_category": "NP-News-Programs",
"question_category": "T2A",
"level": "L1-Perception",
"id": "NMHmqgO04rU_0",
"video_path": "NMHmqgO04rU.mp4",
"subtitle_path": "NMHmqgO04rU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1196.83,
"view_count": 24981
},
{
"video_id": "NMHmqgO04rU",
"question": "A white Pekingese dog is lying on a white cushion on the sofa. The dog is wearing a pearl necklace around its neck. There is a plaid pillow on the sofa. In the bottom left corner of the screen, there is an upper body image of a man wearing a white coat. When the subtitle 'It was just a small amount' appears, what does the sofa look like?",
"question_wo_referring_query": "What does the sofa look like?",
"candidates": [
"Olive-colored artificial leather sofa",
"Red artificial leather sofa",
"Red wooden sofa",
"Black artificial leather sofa",
"Olive-colored wooden sofa"
],
"correct_choice": 4,
"position": [
14646
],
"topic_category": "NP-News-Programs",
"question_category": "T2A",
"level": "L1-Perception",
"id": "NMHmqgO04rU_1",
"video_path": "NMHmqgO04rU.mp4",
"subtitle_path": "NMHmqgO04rU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1196.83,
"view_count": 24981
},
{
"video_id": "NMHmqgO04rU",
"question": "The man in a dark suit is handling food on a cutting board with a knife. He is wearing sunglasses, and behind him are various kitchen utensils and white cabinets. On the right side of the table, there is a bottle with green packaging containing cooking oil. In the lower left and right corners of the screen, there are images of two men from the waist up. When the subtitle 'Sheesh...' appears, what is the lighting on the ceiling like?",
"question_wo_referring_query": "What is the lighting on the ceiling like?",
"candidates": [
"Round wall lamp",
"Triangular wall lamp",
"Square wall lamp",
"Gold pendant lamp",
"Silver pendant lamp"
],
"correct_choice": 2,
"position": [
25387
],
"topic_category": "NP-News-Programs",
"question_category": "T2A",
"level": "L1-Perception",
"id": "NMHmqgO04rU_2",
"video_path": "NMHmqgO04rU.mp4",
"subtitle_path": "NMHmqgO04rU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1196.83,
"view_count": 24981
},
{
"video_id": "dCscvoOX2as",
"question": "On the shimmering lake, there is a bridge. A woman, wearing a gray headscarf and a black coat, is putting her phone into her handbag. Two other women are walking towards her. After this woman finishes putting her phone away, who is the first person to enter the scene?",
"question_wo_referring_query": "On the shimmering lake, there is a bridge. A woman, wearing a gray headscarf and a black coat, is putting her phone into her handbag. Two other women are walking towards her. After this woman finishes putting her phone away, who is the first person to enter the scene?",
"candidates": [
"A woman with a ponytail, wearing a floral dress",
"A woman with straight blonde hair, wearing white trousers",
"A woman with straight blonde hair, wearing a white coat",
"A woman with blonde hair, wearing a black coat",
"A woman wearing sunglasses and a black coat"
],
"correct_choice": 2,
"position": [
1504,
1601
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O3O",
"level": "L2-Relation",
"id": "dCscvoOX2as_0",
"video_path": "dCscvoOX2as.mp4",
"subtitle_path": "dCscvoOX2as_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1083.28,
"view_count": 3119
},
{
"video_id": "dCscvoOX2as",
"question": "A man wearing a grey coat and blue jeans is standing in front of a building with a glass facade. To his left, there is a small tree with red leaves. Between the man and the tree, there is the text 'FAZER'. What is the first white text that appears on the screen after this?",
"question_wo_referring_query": "What is the first white text that appears on the screen after this?",
"candidates": [
"EKiM 2021",
"KASIM 2021",
"Subscribe",
"SANOMATALO",
"OODI KESKUSTAKIRJASTO"
],
"correct_choice": 0,
"position": [
9349,
9512
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O3O",
"level": "L2-Relation",
"id": "dCscvoOX2as_1",
"video_path": "dCscvoOX2as.mp4",
"subtitle_path": "dCscvoOX2as_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1083.28,
"view_count": 3119
},
{
"video_id": "dCscvoOX2as",
"question": "In the evening, when the lights of distant buildings are bright, a man with short hair, wearing a black shirt, stands outside. On his chest, there's a white inscription 'ALLAS SEA POOL'. Who is the first person that appears behind this man after this scene?",
"question_wo_referring_query": "Who is the first person that appears behind this man after this scene?",
"candidates": [
"A man wearing a black long-sleeved shirt and black shorts",
"A man wearing black long pants with no shirt",
"A man wearing black shorts with no shirt",
"A man wearing white shorts with no shirt",
"A man wearing a khaki short-sleeved shirt and black shorts"
],
"correct_choice": 2,
"position": [
24583,
25138
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O3O",
"level": "L2-Relation",
"id": "dCscvoOX2as_2",
"video_path": "dCscvoOX2as.mp4",
"subtitle_path": "dCscvoOX2as_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1083.28,
"view_count": 3119
},
{
"video_id": "Dkm35G5kkcc",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"First, outside, two women walk side by side, the one on the left with her eyes closed and fingers touching her face. Next, in a laundry room with three washing machines, a man in black clothes puts clothes into a washing machine. Finally, on the left side of a corridor with a green billboard, a woman in red clothes is skipping and jumping.",
"First, on the left side of a corridor with a green billboard, a woman in red clothes is skipping and jumping. Next, outside, two women walk side by side, the one on the left with her eyes closed and fingers touching her face. Finally, in a laundry room with three washing machines, a man in black clothes puts clothes into a washing machine.",
"First, on the left side of a corridor with a green billboard, a woman in red clothes is skipping and jumping. Next, in a laundry room with three washing machines, a man in black clothes puts clothes into a washing machine. Finally, outside, two women walk side by side, the one on the left with her eyes closed and fingers touching her face.",
"First, in a laundry room with three washing machines, a man in black clothes puts clothes into a washing machine. Then, on the left side of a corridor with a green billboard, a woman in red clothes is skipping and jumping. Finally, outside, two women walk side by side, the one on the left with her eyes closed and fingers touching her face.",
"First, in a laundry room with three washing machines, a man in black clothes puts clothes into a washing machine. Next, outside, two women walk side by side, the one on the left with her eyes closed and fingers touching her face. Finally, on the left side of a corridor with a green billboard, a woman in red clothes is skipping and jumping."
],
"correct_choice": 3,
"position": [
1378,
5942,
7293
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "Dkm35G5kkcc_0",
"video_path": "Dkm35G5kkcc.mp4",
"subtitle_path": "Dkm35G5kkcc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1275.11,
"view_count": 131426
},
{
"video_id": "Dkm35G5kkcc",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, a man in blue clothes holds his head with both hands, next, a man in black clothes holds a guitar with his right hand placed over his eyes, and finally, a man plays bocce on the floor indoors.",
"First, a man in black clothes holds a guitar with his right hand placed over his eyes, next, a man in blue clothes holds his head with both hands, and finally, a man plays bocce on the floor indoors.",
"First, a man plays bocce on the floor indoors, next, a man in blue clothes holds his head with both hands, and finally, a man in black clothes holds a guitar with his right hand placed over his eyes.",
"First, a man plays bocce on the floor indoors, next, a man in black clothes holds a guitar with his right hand placed over his eyes, and finally, a man in blue clothes holds his head with both hands.",
"First, a man in black clothes holds a guitar with his right hand placed over his eyes, next, a man plays bocce on the floor indoors, and finally, a man in blue clothes holds his head with both hands."
],
"correct_choice": 4,
"position": [
12820,
15909,
17595
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "Dkm35G5kkcc_1",
"video_path": "Dkm35G5kkcc.mp4",
"subtitle_path": "Dkm35G5kkcc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1275.11,
"view_count": 131426
},
{
"video_id": "Dkm35G5kkcc",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, a man pins a map onto the lampshade with thumbtacks, then a man with headphones starts crying with his eyes closed, and finally someone walks into an empty classroom.",
"First, a man with headphones starts crying with his eyes closed, then a man pins a map onto the lampshade with thumbtacks, and finally someone walks into an empty classroom.",
"First, a man pins a map onto the lampshade with thumbtacks, then someone walks into an empty classroom, and finally a man with headphones starts crying with his eyes closed.",
"First, a man with headphones starts crying with his eyes closed, then someone walks into an empty classroom, and finally a man pins a map onto the lampshade with thumbtacks.",
"First, someone walks into an empty classroom, then a man pins a map onto the lampshade with thumbtacks, and finally someone walks into an empty classroom again."
],
"correct_choice": 0,
"position": [
20010,
21683,
26357
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "Dkm35G5kkcc_2",
"video_path": "Dkm35G5kkcc.mp4",
"subtitle_path": "Dkm35G5kkcc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1275.11,
"view_count": 131426
},
{
"video_id": "M5YKW6fhlss",
"question": "In a room with a green door and a water dispenser by the wall, a person with golden hair, without a nose or lips, is looking at a fly wearing a green coat. In which scenes has this fly appeared?",
"question_wo_referring_query": "In which scenes has this fly appeared?",
"candidates": [
"In a mine.",
"In a room with flies wearing pink coats and flies wearing white striped coats.",
"Next to a man wearing a plaid shirt.",
"Next to a man wearing a purple coat.",
"Next to a dalmatian."
],
"correct_choice": 1,
"position": [
411,
1010
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SOS",
"level": "L2-Relation",
"id": "M5YKW6fhlss_0",
"video_path": "M5YKW6fhlss.mp4",
"subtitle_path": "M5YKW6fhlss_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1010.47,
"view_count": 269904
},
{
"video_id": "M5YKW6fhlss",
"question": "In a park, next to a garbage bin, there is a tree. Beside the tree, there is a bench on which a man wearing glasses is seated. His right hand is on his waist, and he is holding a cellphone in his left hand. In what scenes has this man appeared?",
"question_wo_referring_query": "In what scenes has this man appeared?",
"candidates": [
"In a car",
"Next to a figure without a nose and mouth",
"In an outdoor swimming pool",
"In an underground cave with a river flowing through it",
"Next to a butterfly wearing a green coat"
],
"correct_choice": 3,
"position": [
17944,
23326
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SOS",
"level": "L2-Relation",
"id": "M5YKW6fhlss_1",
"video_path": "M5YKW6fhlss.mp4",
"subtitle_path": "M5YKW6fhlss_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1010.47,
"view_count": 269904
},
{
"video_id": "M5YKW6fhlss",
"question": "There are a few red photo frames hanging on a wooden wall. A man wearing black-rimmed glasses and a plaid shirt is explaining. To his right, there is a phone that is changing screens. In which scenes has this phone appeared before?",
"question_wo_referring_query": "In which scenes has this phone appeared before?",
"candidates": [
"Next to an insect wearing a pink coat",
"Next to an insect with a tie",
"In the hands of a woman wearing a black coat in the park",
"In the hands of a man wearing blue pants in the park",
"In the hands of a character without a nose or mouth"
],
"correct_choice": 3,
"position": [
16871,
16988
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SOS",
"level": "L2-Relation",
"id": "M5YKW6fhlss_2",
"video_path": "M5YKW6fhlss.mp4",
"subtitle_path": "M5YKW6fhlss_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1010.47,
"view_count": 269904
},
{
"video_id": "7m9XIXyT5_I",
"question": "On the left side of the screen, there is a house with only its frame remaining. A man wearing a gray hat is extending half of his body out from the black middle section while holding a bucket. What is this man doing?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"He filled the bucket with water.",
"He closed the door.",
"He took off his hat.",
"He is pouring the contents of the bucket outside.",
"He put down the bucket."
],
"correct_choice": 3,
"position": [
9860
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2E",
"level": "L1-Perception",
"id": "7m9XIXyT5_I_0",
"video_path": "7m9XIXyT5_I.mp4",
"subtitle_path": "7m9XIXyT5_I_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1376.54,
"view_count": 293861
},
{
"video_id": "7m9XIXyT5_I",
"question": "In the top left corner of the screen, there is the red text 'WEAPONRY'. Inside an operation room outlined with a red line, there is a soldier wearing an olive-colored hat. What did this soldier do?",
"question_wo_referring_query": "What did this soldier do?",
"candidates": [
"He removed the shell from the cannon",
"He collapsed onto the ground",
"He left the operation room",
"He loaded a shell into the cannon and fired it",
"He disassembled the cannon"
],
"correct_choice": 3,
"position": [
14693
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2E",
"level": "L1-Perception",
"id": "7m9XIXyT5_I_1",
"video_path": "7m9XIXyT5_I.mp4",
"subtitle_path": "7m9XIXyT5_I_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1376.54,
"view_count": 293861
},
{
"video_id": "7m9XIXyT5_I",
"question": "On the airplane, there are two soldiers wearing khaki hats. The soldier on the left has a hand on his hat, while the soldier on the right is holding a paper and a pen. What is the soldier on the right doing?",
"question_wo_referring_query": "What is the soldier on the right doing?",
"candidates": [
"He is throwing the pen",
"He is tearing up the paper",
"He is putting on goggles",
"He is writing on the paper",
"He is taking off the hat"
],
"correct_choice": 3,
"position": [
30714
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2E",
"level": "L1-Perception",
"id": "7m9XIXyT5_I_2",
"video_path": "7m9XIXyT5_I.mp4",
"subtitle_path": "7m9XIXyT5_I_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1376.54,
"view_count": 293861
},
{
"video_id": "_Y1pW77M3Pg",
"question": "In the video, a man dressed in a black hoodie is sitting in front of a white window, with the text 'That's where BetterHelp can come in.' also appearing on the screen. After mentioning 'don't know what to do that's where', what action does he take?",
"question_wo_referring_query": "In the video, a man dressed in a black hoodie is sitting in front of a white window, with the text 'That's where BetterHelp can come in.' also appearing on the screen. After mentioning 'don't know what to do that's where', what action does he take?",
"candidates": [
"He places his left hand on his forehead",
"He puts his hands behind his head",
"He gestures 'Yeah' with both hands",
"He places his fists on his chest"
],
"correct_choice": 1,
"position": [
1727,
6661
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "_Y1pW77M3Pg_0",
"video_path": "_Y1pW77M3Pg.mp4",
"subtitle_path": "_Y1pW77M3Pg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1013.89,
"view_count": 81189
},
{
"video_id": "_Y1pW77M3Pg",
"question": "In the video, a man wearing a black hoodie is sitting in front of a white window. The screen also displays the text 'But I think that they can feel that.' After mentioning 'no one can see that but I think they can,' what does he do?",
"question_wo_referring_query": "What does he do?",
"candidates": [
"He clasps his hands together and crosses them in front of his chest.",
"He clenches his fists and places them on his chest.",
"He places his left hand on his forehead.",
"He makes a 'yeah' gesture with both hands."
],
"correct_choice": 0,
"position": [
7374,
12250
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "_Y1pW77M3Pg_1",
"video_path": "_Y1pW77M3Pg.mp4",
"subtitle_path": "_Y1pW77M3Pg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1013.89,
"view_count": 81189
},
{
"video_id": "_Y1pW77M3Pg",
"question": "In the screen, a man wearing a black hooded jacket is sitting in front of a white window. He is looking to the left. The screen also shows the text 'Look at the window.' After the phrase 'wait hold that look look at the window' is mentioned, what does he do?",
"question_wo_referring_query": "What does he do?",
"candidates": [
"He raises both hands in a 'Y' shape.",
"He places both hands on his head.",
"He clenches both fists and places them on his chest.",
"He places his left hand on his forehead."
],
"correct_choice": 1,
"position": [
13052,
19963
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "_Y1pW77M3Pg_2",
"video_path": "_Y1pW77M3Pg.mp4",
"subtitle_path": "_Y1pW77M3Pg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1013.89,
"view_count": 81189
},
{
"video_id": "nWDNzv1Gk8Q",
"question": "Sitting in the driver's seat of a car, there is a blonde woman wearing a red top and a necklace. Her left hand is placed on her chest. What is she doing?",
"question_wo_referring_query": "What is she doing?",
"candidates": [
"She is placing her left hand on her forehead",
"She is making a 'yeah' gesture with her left hand",
"Her left hand is clenched in a fist on her chest",
"She is holding a phone in her right hand"
],
"correct_choice": 3,
"position": [
2152
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "nWDNzv1Gk8Q_0",
"video_path": "nWDNzv1Gk8Q.mp4",
"subtitle_path": "nWDNzv1Gk8Q_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1059.48,
"view_count": 25122
},
{
"video_id": "nWDNzv1Gk8Q",
"question": "In the driver's seat of a car, there is a woman with blonde hair, wearing a red top and a necklace. She extends her right hand forward with fingers spread. What is she doing?",
"question_wo_referring_query": "What is she doing?",
"candidates": [
"She is making a fist with her right hand and placing it on her chest",
"She is placing her right hand on her forehead",
"She is holding a phone in her left hand",
"She is making a 'peace' sign with her right hand"
],
"correct_choice": 2,
"position": [
10241
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "nWDNzv1Gk8Q_1",
"video_path": "nWDNzv1Gk8Q.mp4",
"subtitle_path": "nWDNzv1Gk8Q_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1059.48,
"view_count": 25122
},
{
"video_id": "nWDNzv1Gk8Q",
"question": "Sitting in the driver's seat of a car, there is a woman with blonde hair wearing a red top and a necklace. She is holding a phone in her right hand, leaning forward slightly, and smiling. What is she doing?",
"question_wo_referring_query": "What is she doing?",
"candidates": [
"Her right hand is placed on her forehead",
"Her right hand is making a 'yeah' gesture",
"Her right hand is clenched in a fist at her chest",
"Her left hand is on the steering wheel"
],
"correct_choice": 3,
"position": [
22919
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "nWDNzv1Gk8Q_2",
"video_path": "nWDNzv1Gk8Q.mp4",
"subtitle_path": "nWDNzv1Gk8Q_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1059.48,
"view_count": 25122
},
{
"video_id": "J_ZmaKRpyoU",
"question": "On the screen, there are three images on the PPT. One of the images has the word 'Image/Video' below it, another one has the word 'Where?' below it, and in the bottom right corner, there is a woman wearing a dark blue inner outfit, a black outer coat, and glasses. What objects have appeared on the screen?",
"question_wo_referring_query": "What objects have appeared on the screen?",
"candidates": [
"A photo of a globe",
"A photo of a satellite",
"A photo of a road at night",
"A photo of a man in a black short-sleeved shirt"
],
"correct_choice": 0,
"position": [
593
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2O",
"level": "L1-Perception",
"id": "J_ZmaKRpyoU_0",
"video_path": "J_ZmaKRpyoU.mp4",
"subtitle_path": "J_ZmaKRpyoU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 983.67,
"view_count": 319
},
{
"video_id": "J_ZmaKRpyoU",
"question": "In the screen, there is a data table on a PPT, with English text below the data table. The text above the data table reads 'Cross-view Geo-localization Datasets'. Several rows in the data table are highlighted, and a woman wearing a black outer jacket over a dark blue inner top is seen in the bottom right corner. What lines are present on the screen?",
"question_wo_referring_query": "What lines are present on the screen?",
"candidates": [
"blue dashed line",
"yellow double horizontal line",
"red dashed line",
"purple double horizontal line"
],
"correct_choice": 2,
"position": [
9018
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2O",
"level": "L1-Perception",
"id": "J_ZmaKRpyoU_1",
"video_path": "J_ZmaKRpyoU.mp4",
"subtitle_path": "J_ZmaKRpyoU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 983.67,
"view_count": 319
},
{
"video_id": "J_ZmaKRpyoU",
"question": "In the screen, there are 15 images on the PPT. On the right side of the images, there are several lines of English text, with the words 'Comparison: Qualitative Results' written on the images. In the bottom right corner, there is a woman wearing a dark blue inner coat with a black outer coat. Which images are highlighted on the screen?",
"question_wo_referring_query": "Which images are highlighted on the screen?",
"candidates": [
"The last image in the bottom right corner is highlighted with a green box",
"The last image in the top left corner is highlighted with a green box",
"The last image in the bottom right corner is highlighted with a red box",
"An image in the middle is highlighted"
],
"correct_choice": 0,
"position": [
18158
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2O",
"level": "L1-Perception",
"id": "J_ZmaKRpyoU_2",
"video_path": "J_ZmaKRpyoU.mp4",
"subtitle_path": "J_ZmaKRpyoU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 983.67,
"view_count": 319
},
{
"video_id": "6S_e34j6q9U",
"question": "Among the characters on the screen\u2014 a man with curly brown hair wearing a striped shirt and holding a blue egg beater while whisking eggs, and a woman with long brown hair wearing a white short-sleeve T-shirt and holding a dark blue egg beater while whisking eggs\u2014 which character appears first?",
"question_wo_referring_query": "Which character appears first?",
"candidates": [
"They appear at the same time.",
"The woman with long brown hair wearing a white short-sleeve T-shirt.",
"The man with curly brown hair wearing a striped shirt.",
"The man with short brown hair wearing a white short-sleeve T-shirt."
],
"correct_choice": 1,
"position": [
740,
105
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O3O",
"level": "L2-Relation",
"id": "6S_e34j6q9U_0",
"video_path": "6S_e34j6q9U.mp4",
"subtitle_path": "6S_e34j6q9U_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1238.45,
"view_count": 430899
},
{
"video_id": "6S_e34j6q9U",
"question": "In the video, within a kitchen with a row of white cabinets in the background, there is a man with curly hair wearing a striped shirt. On the table in front of him, there is a bottle of oil, a glass bowl containing egg liquid with a blue egg beater, a square box with colorful question marks, an empty glass bowl, and an empty small glass bowl. Which item appears first?",
"question_wo_referring_query": "Which item appears first?",
"candidates": [
"A square box with colorful question marks",
"An empty glass bowl",
"A small empty glass bowl",
"A glass bowl containing egg liquid with a blue egg beater"
],
"correct_choice": 3,
"position": [
7221,
3860
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O3O",
"level": "L2-Relation",
"id": "6S_e34j6q9U_1",
"video_path": "6S_e34j6q9U.mp4",
"subtitle_path": "6S_e34j6q9U_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1238.45,
"view_count": 430899
},
{
"video_id": "6S_e34j6q9U",
"question": "On the screen, there is a round plate with small golden yellow food cubes, a polygonal plate with brown small food cubes, and another round plate with large golden yellow food pieces on a wooden table. Which plate of food appears first?",
"question_wo_referring_query": "Which plate of food appears first?",
"candidates": [
"Polygonal plate with large golden yellow food pieces",
"Round plate with large golden yellow food pieces",
"Polygonal plate with brown small food cubes",
"Round plate with small golden yellow food cubes"
],
"correct_choice": 3,
"position": [
27526,
26933
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O3O",
"level": "L2-Relation",
"id": "6S_e34j6q9U_2",
"video_path": "6S_e34j6q9U.mp4",
"subtitle_path": "6S_e34j6q9U_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1238.45,
"view_count": 430899
},
{
"video_id": "WaiGdRYD36k",
"question": "The screen shows a blonde woman with a medal hanging on the left side of her clothing. Below, there is a subtitle 'Former U.S. House Speaker Nancy Pelosi.' After the subtitle mentions 'assassination of Alexa Lani,' which character appears next?",
"question_wo_referring_query": "Which character appears next?",
"candidates": [
"A woman with long hair wearing a checkered shirt, a white inner garment, and earrings",
"A woman with long hair wearing a checkered shirt, a red inner garment, and earrings",
"A woman with long hair wearing a checkered shirt, a pink inner garment, and earrings",
"A woman with long hair wearing a checkered jacket, a black inner garment, and earrings"
],
"correct_choice": 3,
"position": [
263,
1865
],
"topic_category": "NP-News-Programs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "WaiGdRYD36k_0",
"video_path": "WaiGdRYD36k.mp4",
"subtitle_path": "WaiGdRYD36k_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1195.2,
"view_count": 180472
},
{
"video_id": "WaiGdRYD36k",
"question": "On the screen is a long-haired woman wearing a checkered coat over a black shirt, and there is a bright lamp on the wall behind her. When the subtitle mentions 'strategically on how to get the job done,' what object appears on the wall?",
"question_wo_referring_query": "What object appears on the wall?",
"candidates": [
"A pendant lamp",
"A potted plant",
"A painting",
"A mobile phone"
],
"correct_choice": 2,
"position": [
11152,
16408
],
"topic_category": "NP-News-Programs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "WaiGdRYD36k_1",
"video_path": "WaiGdRYD36k.mp4",
"subtitle_path": "WaiGdRYD36k_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1195.2,
"view_count": 180472
},
{
"video_id": "WaiGdRYD36k",
"question": "In the video, there is a long-haired woman wearing a checkered coat with a black shirt underneath. On the wall behind her, there is a bright lamp. After the subtitle mentions 'reputational damage um how the IDF is', what object appears in the woman's hand?",
"question_wo_referring_query": "What object appears in the woman's hand?",
"candidates": [
"ball",
"phone",
"apple",
"camera"
],
"correct_choice": 1,
"position": [
22223,
23231
],
"topic_category": "NP-News-Programs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "WaiGdRYD36k_2",
"video_path": "WaiGdRYD36k.mp4",
"subtitle_path": "WaiGdRYD36k_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1195.2,
"view_count": 180472
},
{
"video_id": "oI975O1BUu0",
"question": "In the video, there is a man wearing a red shirt with a beard in the middle of the screen, and there is a logo in the bottom right corner. What clothing does this man change into when the subtitle mentions 'as Caitlyn will tell you that might not'?",
"question_wo_referring_query": "What clothing does this man change into?",
"candidates": [
"Black shirt changes to patterned short sleeves",
"White shirt changes to patterned short sleeves",
"Red shirt changes to patterned short sleeves",
"Yellow shirt changes to patterned short sleeves"
],
"correct_choice": 2,
"position": [
867,
6891
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "TAA",
"level": "L2-Relation",
"id": "oI975O1BUu0_0",
"video_path": "oI975O1BUu0.mp4",
"subtitle_path": "oI975O1BUu0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1385.09,
"view_count": 98142
},
{
"video_id": "oI975O1BUu0",
"question": "In the middle of the video, there is a man wearing a floral short-sleeved shirt and a cap. In the bottom right corner, there is a logo. When the subtitle mentions 'be 36 years before we could land on,' what does this man change into?",
"question_wo_referring_query": "What does this man change into?",
"candidates": [
"The floral short-sleeved shirt changes to a white outfit",
"The floral short-sleeved shirt changes to a yellow outfit",
"The floral short-sleeved shirt changes to a black outfit",
"The floral short-sleeved shirt changes to a red outfit"
],
"correct_choice": 2,
"position": [
6703,
18360
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "TAA",
"level": "L2-Relation",
"id": "oI975O1BUu0_1",
"video_path": "oI975O1BUu0.mp4",
"subtitle_path": "oI975O1BUu0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1385.09,
"view_count": 98142
},
{
"video_id": "oI975O1BUu0",
"question": "In the middle of the video, there is a man wearing a black shirt with a mustache. There is a logo in the bottom right corner. When the subtitle mentions 'Hayabusa had some surprises under the', what does the man change into?",
"question_wo_referring_query": "What does this man change into?",
"candidates": [
"The black shirt changes into a yellow shirt",
"The black shirt changes into a patterned short sleeve",
"No change",
"The black shirt changes into a red shirt"
],
"correct_choice": 1,
"position": [
18445,
26326
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "TAA",
"level": "L2-Relation",
"id": "oI975O1BUu0_2",
"video_path": "oI975O1BUu0.mp4",
"subtitle_path": "oI975O1BUu0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1385.09,
"view_count": 98142
},
{
"video_id": "GFg98TDqCpw",
"question": "In the lower right corner of the video, there is a bald person wearing sunglasses and dressed in black. There are also three boys on the screen: the one on the left is wearing a blue short-sleeve shirt, the one on the right is wearing a red long-sleeve shirt, and the one in the middle is wearing a gray short-sleeve shirt. Which of the following items does not appear in the video?",
"question_wo_referring_query": "Which of the following items does not appear in the video?",
"candidates": [
"Black hat",
"White hat",
"Sunglasses",
"Blue short-sleeve shirt"
],
"correct_choice": 1,
"position": [
24449
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2O",
"level": "L1-Perception",
"id": "GFg98TDqCpw_0",
"video_path": "GFg98TDqCpw.mp4",
"subtitle_path": "GFg98TDqCpw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1088.99,
"view_count": 50993
},
{
"video_id": "GFg98TDqCpw",
"question": "In the bottom right corner of the video, there is a bald man wearing sunglasses and dressed in black. In the frame, there are two bloodshot eyeballs. Which object does not appear in the video?",
"question_wo_referring_query": "Which object does not appear in the video?",
"candidates": [
"sunglasses",
"microphone",
"a man wearing glasses and a suit",
"eyeballs"
],
"correct_choice": 2,
"position": [
21852
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2O",
"level": "L1-Perception",
"id": "GFg98TDqCpw_1",
"video_path": "GFg98TDqCpw.mp4",
"subtitle_path": "GFg98TDqCpw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1088.99,
"view_count": 50993
},
{
"video_id": "GFg98TDqCpw",
"question": "There is a bald man wearing sunglasses in the bottom right corner of the video, he is dressed in black clothing, and there is a man in the screen wearing a white short-sleeve shirt with a beard. The man is holding a cigarette in his right hand. Which object does not appear in the video?",
"question_wo_referring_query": "Which object does not appear in the video?",
"candidates": [
"Hat",
"Sunglasses",
"Cigarette",
"White short-sleeve shirt"
],
"correct_choice": 0,
"position": [
24768
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2O",
"level": "L1-Perception",
"id": "GFg98TDqCpw_2",
"video_path": "GFg98TDqCpw.mp4",
"subtitle_path": "GFg98TDqCpw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1088.99,
"view_count": 50993
},
{
"video_id": "YebUIUOCo94",
"question": "In the scene, there is a man wearing a military uniform, with yellow decorations on his shoulders, and a red shoulder strap. When the subtitle mentions 'in the challenging years that lay ahead,' what is the man's hairstyle?",
"question_wo_referring_query": "What is the man's hairstyle in the scene?",
"candidates": [
"Bald",
"Mediterranean",
"Long hair",
"Crew cut"
],
"correct_choice": 1,
"position": [
37340
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2A",
"level": "L1-Perception",
"id": "YebUIUOCo94_0",
"video_path": "YebUIUOCo94.mp4",
"subtitle_path": "YebUIUOCo94_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2560.72,
"view_count": 907381
},
{
"video_id": "YebUIUOCo94",
"question": "In the video, there is a man wearing white pants lying on a blue-sheeted rack. There is another man watching him, and behind them, there is a person wearing a white skirt and an injured person leaning against the wall. When the caption mentions 'The operation went well. But the wound became infected, and Lannes died nine days later.', what is the condition of the lying man's leg?",
"question_wo_referring_query": "What is the condition of the lying man's leg?",
"candidates": [
"Right leg amputated",
"Left leg amputated",
"Both legs intact",
"Both legs amputated"
],
"correct_choice": 0,
"position": [
35746
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2A",
"level": "L1-Perception",
"id": "YebUIUOCo94_1",
"video_path": "YebUIUOCo94.mp4",
"subtitle_path": "YebUIUOCo94_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2560.72,
"view_count": 907381
},
{
"video_id": "YebUIUOCo94",
"question": "On the left side of the screen, there's a black and white photo of a man with many words around it. On the right side, there's a picture showing a man sitting on a red chair, with other people standing nearby. When the subtitle mentions 'Davout only surrendered Hamburg in May 1814, after confirmation arrived of Napoleon's abdication.', what color are the pants of the man sitting on the right?",
"question_wo_referring_query": "What color are the pants of the man sitting on the right?",
"candidates": [
"Green",
"White",
"Yellow",
"Black"
],
"correct_choice": 1,
"position": [
55565
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2A",
"level": "L1-Perception",
"id": "YebUIUOCo94_2",
"video_path": "YebUIUOCo94.mp4",
"subtitle_path": "YebUIUOCo94_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2560.72,
"view_count": 907381
},
{
"video_id": "xiK00WS0lkE",
"question": "There are several men in a dimly lit room, with a few playing guitars in the background. On the left, there is a man wearing a hat holding a saxophone. On the right, there is a man in a short-sleeved shirt with a bracelet holding a saxophone. What did the man holding the saxophone on the right do when he entered the scene?",
"question_wo_referring_query": "What did the man holding the saxophone on the right do when he entered the scene?",
"candidates": [
"Played the piano",
"Played the flute",
"Played the saxophone",
"Played the guitar"
],
"correct_choice": 2,
"position": [
12435
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O2E",
"level": "L1-Perception",
"id": "xiK00WS0lkE_0",
"video_path": "xiK00WS0lkE.mp4",
"subtitle_path": "xiK00WS0lkE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2311.64,
"view_count": 71392
},
{
"video_id": "xiK00WS0lkE",
"question": "In the room on the screen, there is a man wearing green clothes with a beard. Behind him, there is a white cabinet and a wooden shelf. On the shelf, there is a black bag. What did the man in the video do upon entering the scene?",
"question_wo_referring_query": "What did the man in the video do upon entering the scene?",
"candidates": [
"Raised his hand",
"Touched his beard with his right hand",
"Carried a bag on his back",
"Touched his beard with his left hand"
],
"correct_choice": 3,
"position": [
23314
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O2E",
"level": "L1-Perception",
"id": "xiK00WS0lkE_1",
"video_path": "xiK00WS0lkE.mp4",
"subtitle_path": "xiK00WS0lkE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2311.64,
"view_count": 71392
},
{
"video_id": "xiK00WS0lkE",
"question": "In the video, there is a man wearing gray clothes with a beard facing a whiteboard. He is holding a pen in his right hand, and the whiteboard has markings in blue and black pen. What did the man do when he appeared in the video?",
"question_wo_referring_query": "What did the man do when he appeared in the video?",
"candidates": [
"Dropped the pen",
"Marked the whiteboard with a pen",
"Did nothing",
"Erased the whiteboard"
],
"correct_choice": 1,
"position": [
40672
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "O2E",
"level": "L1-Perception",
"id": "xiK00WS0lkE_2",
"video_path": "xiK00WS0lkE.mp4",
"subtitle_path": "xiK00WS0lkE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2311.64,
"view_count": 71392
},
{
"video_id": "OMJc43wUPLM",
"question": "On a map with many place names, there are some gray-green arrows. Beside the map, there's a standing red box. Inside the red box, there are two gray circles. In one of the gray circles, there are two opposite white arrows. When the gray circle with the two opposite white arrows appears in a pure red screen with four gray circles, what changes occur to the white arrows in the gray circle?",
"question_wo_referring_query": "What changes occur to the white arrows in the gray circle?",
"candidates": [
"The arrows change from opposing to rotating in the same direction.",
"The white arrows change to black arrows.",
"The arrows change from opposing to intersecting with each other.",
"The arrows change from opposing to being parallel to each other.",
"The white arrows change to yellow arrows."
],
"correct_choice": 0,
"position": [
6913,
12448
],
"topic_category": "KH-Knowledge-History",
"question_category": "SAA",
"level": "L2-Relation",
"id": "OMJc43wUPLM_0",
"video_path": "OMJc43wUPLM.mp4",
"subtitle_path": "OMJc43wUPLM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1078.3,
"view_count": 461856
},
{
"video_id": "OMJc43wUPLM",
"question": "There is a map with many colors, and on the white area of the map, there are squares in red and green colors. Next to the map, there is a red frame with a grey circle inside it. Inside the grey circle, there are three bombs drawn. When the grey circle with three bombs appears on a pure red background with five grey circle icons, showing the text '1200 Planes' in white, what changes occur to the grey circle icon with three bombs?",
"question_wo_referring_query": "What changes occur to the grey circle icon with three bombs?",
"candidates": [
"The icon becomes smaller",
"The icon becomes larger",
"The color of the bombs inside the icon changes from white to black",
"The color of the bombs inside the icon changes from white to green",
"The color of the bombs inside the icon changes from white to red"
],
"correct_choice": 1,
"position": [
11001,
16442
],
"topic_category": "KH-Knowledge-History",
"question_category": "SAA",
"level": "L2-Relation",
"id": "OMJc43wUPLM_1",
"video_path": "OMJc43wUPLM.mp4",
"subtitle_path": "OMJc43wUPLM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1078.3,
"view_count": 461856
},
{
"video_id": "OMJc43wUPLM",
"question": "On a red background, there is a white box containing a small red square. Next to the red square, there are white letters reading '26\u00d7.' Inside the red square, there are two small black X's and one large black X. What changes occur to the red square when it appears on a red background with white letters reading 'Lwow - Lviv - Lemberg'?",
"question_wo_referring_query": "What changes occur to the red square?",
"candidates": [
"It changes from red to green, and the number of small X's inside the box changes from two to one.",
"It changes from red to green, and the number of small X's inside the box changes from two to four.",
"It changes from red to green, and the number of small X's inside the box changes from two to three.",
"It changes from red to green, and the number of small X's inside the box changes from two to six.",
"It changes from red to green, and the number of small X's inside the box changes from two to five."
],
"correct_choice": 1,
"position": [
1233,
15440
],
"topic_category": "KH-Knowledge-History",
"question_category": "SAA",
"level": "L2-Relation",
"id": "OMJc43wUPLM_2",
"video_path": "OMJc43wUPLM.mp4",
"subtitle_path": "OMJc43wUPLM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1078.3,
"view_count": 461856
},
{
"video_id": "Ytv-9RM4e0o",
"question": "In a spacious yellow room with many lights on, there is a man wearing a red and black checkered shirt. When the subtitle 'everyone you join me' appears, what change happens to the man wearing a red and black checkered shirt?",
"question_wo_referring_query": "What change happens to the man wearing a red and black checkered shirt?",
"candidates": [
"The man's shirt changes from a red and black checkered shirt to a white hoodie",
"The man's shirt changes from a red and black checkered shirt to a yellow short sleeve",
"The man's shirt changes from a red and black checkered shirt to a black short sleeve",
"The man's shirt changes from a red and black checkered shirt to a blue suit",
"The man's shirt changes from a red and black checkered shirt to a black hoodie"
],
"correct_choice": 4,
"position": [
75,
225
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "TAA",
"level": "L2-Relation",
"id": "Ytv-9RM4e0o_0",
"video_path": "Ytv-9RM4e0o.mp4",
"subtitle_path": "Ytv-9RM4e0o_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1523.48,
"view_count": 29325
},
{
"video_id": "Ytv-9RM4e0o",
"question": "In front of a white wall, a man wearing a patterned red shirt sits on a black chair. Next to him is a black shelf. When the subtitle 'has been happening my birthday happened' appears, what transformation occurs to the man in the patterned red shirt?",
"question_wo_referring_query": "What transformation occurs to the man in the patterned red shirt?",
"candidates": [
"The man's shirt changes from red to black.",
"The man's shirt changes from red to yellow.",
"The man's shirt changes from red to green.",
"The man's shirt changes from red to purple.",
"The man's shirt changes from red to white."
],
"correct_choice": 0,
"position": [
1363,
31112
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "TAA",
"level": "L2-Relation",
"id": "Ytv-9RM4e0o_1",
"video_path": "Ytv-9RM4e0o.mp4",
"subtitle_path": "Ytv-9RM4e0o_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1523.48,
"view_count": 29325
},
{
"video_id": "Ytv-9RM4e0o",
"question": "In a white room, in front of a gray cabinet placed against the wall, a man in a black suit is speaking. What change occurs to the man in the black suit when the subtitle 'the opportunity to lay out a bunch of my' appears?",
"question_wo_referring_query": "What change occurs to the man in the black suit?",
"candidates": [
"The man's clothing changes from a black suit to a blue T-shirt.",
"The man's clothing changes from a black suit to a red and black checkered shirt.",
"The man's clothing changes from a black suit to a white hazmat suit.",
"The man's clothing changes from a black suit to a yellow hazmat suit.",
"The man's clothing changes from a black suit to a blue jacket."
],
"correct_choice": 1,
"position": [
9262,
22182
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "TAA",
"level": "L2-Relation",
"id": "Ytv-9RM4e0o_2",
"video_path": "Ytv-9RM4e0o.mp4",
"subtitle_path": "Ytv-9RM4e0o_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1523.48,
"view_count": 29325
},
{
"video_id": "PbiTIR8N4Hc",
"question": "On a yellow ground, there is a fossil of a trilobite. After the narrator says \"the Devonian era was an incredibly,\" which fossil appears next?",
"question_wo_referring_query": "Which fossil appears next after the narrator says?",
"candidates": [
"Shell fossil",
"Plant fossil",
"Insect fossil",
"Coral fossil",
"Fish fossil"
],
"correct_choice": 4,
"position": [
65,
101
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T3O",
"level": "L2-Relation",
"id": "PbiTIR8N4Hc_0",
"video_path": "PbiTIR8N4Hc.mp4",
"subtitle_path": "PbiTIR8N4Hc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2573.74,
"view_count": 82462
},
{
"video_id": "PbiTIR8N4Hc",
"question": "In a blurry screen, there are two hands holding two grayish-white clods of dirt. Inside these clods there are some grayish-white objects. After the phrase 'that contain hollow cavities lined with' is spoken, what is the first object to appear?",
"question_wo_referring_query": "What is the first object to appear?",
"candidates": [
"Black crystals",
"Trilobite fossil",
"Pearl fossil",
"Plant fossil",
"Fish fossil"
],
"correct_choice": 0,
"position": [
45470,
45489,
47,
98
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T3O",
"level": "L2-Relation",
"id": "PbiTIR8N4Hc_1",
"video_path": "PbiTIR8N4Hc.mp4",
"subtitle_path": "PbiTIR8N4Hc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2573.74,
"view_count": 82462
},
{
"video_id": "PbiTIR8N4Hc",
"question": "In a blurry screen, there are some soils of different colors. In the bottom right corner of the screen, there is also white text that reads 'Credit: AdventureON'. When the subtitle 'good specimens could be found near the' appears next to it, what is the first object that appears?",
"question_wo_referring_query": "what is the first object that appears?",
"candidates": [
"Grayish white crystal",
"Trilobite fossil",
"Fish fossil",
"Black crystal",
"Green quartz"
],
"correct_choice": 4,
"position": [
49312,
49341,
46,
99,
45470,
45489
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T3O",
"level": "L2-Relation",
"id": "PbiTIR8N4Hc_2",
"video_path": "PbiTIR8N4Hc.mp4",
"subtitle_path": "PbiTIR8N4Hc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2573.74,
"view_count": 82462
},
{
"video_id": "fsz6bkkIHzQ",
"question": "In the black and white scene featuring a statue of the goddess of liberty, when the subtitle 'They arrived as immigrants, speaking no English at Ellis Island in late 1913' appears, what happens on the screen?",
"question_wo_referring_query": "What happens on the screen?",
"candidates": [
"The screen gradually becomes clear",
"The camera moves from right to left",
"The camera moves from bottom to top",
"The camera moves from left to right",
"The camera moves from top to bottom"
],
"correct_choice": 3,
"position": [
4133
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2E",
"level": "L1-Perception",
"id": "fsz6bkkIHzQ_0",
"video_path": "fsz6bkkIHzQ.mp4",
"subtitle_path": "fsz6bkkIHzQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 910.88,
"view_count": 1190228
},
{
"video_id": "fsz6bkkIHzQ",
"question": "In a black-and-white photo, a group of people are holding signs on the steps in front of a building. When the subtitles '35,000 was an enormous sum in 1958, and the commission would bring Rothko's left-wing views into conflict.' appear, what happens on the screen?",
"question_wo_referring_query": "What happens on the screen?",
"candidates": [
"The group is advertising for a store",
"The group is protesting",
"The group is making an advertisement",
"The group is playing games",
"The group is holding an event"
],
"correct_choice": 1,
"position": [
6536
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2E",
"level": "L1-Perception",
"id": "fsz6bkkIHzQ_1",
"video_path": "fsz6bkkIHzQ.mp4",
"subtitle_path": "fsz6bkkIHzQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 910.88,
"view_count": 1190228
},
{
"video_id": "fsz6bkkIHzQ",
"question": "On a yellow wooden floor in front of a red painting, there are two red stools. On the stool in the front, there is a woman with a pink backpack. When the subtitle \u201cRothko was aware that people often burst into tears when confronted with his paintings.\u201d appears, what is this woman doing?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"Doing homework",
"Talking with a friend",
"Making a phone call",
"Wiping her tears",
"Looking at the painting"
],
"correct_choice": 4,
"position": [
3355
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2E",
"level": "L1-Perception",
"id": "fsz6bkkIHzQ_2",
"video_path": "fsz6bkkIHzQ.mp4",
"subtitle_path": "fsz6bkkIHzQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 910.88,
"view_count": 1190228
},
{
"video_id": "s5edwp0PEqk",
"question": "Which of the following sequences of scenes in the video are correct?",
"question_wo_referring_query": "Which of the following sequences of scenes in the video are correct?",
"candidates": [
"First, a woman wearing a blue mask and glasses is leaning against the window inside a moving train, then a man wearing a white short-sleeved shirt and glasses is drawing on a round wooden table with a palette, and finally, a woman wearing an olive coat, khaki plaid pants, and a blue mask is posing on a platform in front of a house with red brick and white walls.",
"First, a man wearing a white short-sleeved shirt and glasses is drawing on a round wooden table with a palette, then a woman wearing a blue mask and glasses is leaning against the window inside a moving train, and finally, a woman wearing an olive coat, khaki plaid pants, and a blue mask is posing on a platform in front of a house with red brick and white walls.",
"First, a man wearing a white short-sleeved shirt and glasses is drawing on a round wooden table with a palette, then a woman wearing an olive coat, khaki plaid pants, and a blue mask is posing on a platform in front of a house with red brick and white walls, and finally, a woman wearing a blue mask and glasses is leaning against the window inside a moving train.",
"First, a woman wearing an olive coat, khaki plaid pants, and a blue mask is posing on a platform in front of a house with red brick and white walls, then a woman wearing a blue mask and glasses is leaning against the window inside a moving train, and finally, a man wearing a white short-sleeved shirt and glasses is drawing on a round wooden table with a palette.",
"First, a woman wearing a blue mask and glasses is leaning against the window inside a moving train, then a woman wearing an olive coat, khaki plaid pants, and a blue mask is posing on a platform in front of a house with red brick and white walls, and finally, a man wearing a white short-sleeved shirt and glasses is drawing on a round wooden table with a palette."
],
"correct_choice": 1,
"position": [
3696,
4627,
4978
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "s5edwp0PEqk_0",
"video_path": "s5edwp0PEqk.mp4",
"subtitle_path": "s5edwp0PEqk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1023.09,
"view_count": 233693
},
{
"video_id": "s5edwp0PEqk",
"question": "Which sequence of scenes in the video is correct?",
"question_wo_referring_query": "Which sequence of scenes in the video is correct?",
"candidates": [
"First, three girls sit on a giant crooked tree trunk. Then, a girl in a white tank top and white shorts and a girl in a black short-sleeve top and denim shorts with glasses embrace in front of a statue. Finally, the three people eat fruit on a colorful striped cloth on the beach.",
"First, three girls sit on a giant crooked tree trunk. Then, the three people eat fruit on a colorful striped cloth on the beach. Finally, a girl in a white tank top and white shorts and a girl in a black short-sleeve top and denim shorts with glasses embrace in front of a statue.",
"First, a girl in a white tank top and white shorts and a girl in a black short-sleeve top and denim shorts with glasses embrace in front of a statue. Then, three girls sit on a giant crooked tree trunk. Finally, the three people eat fruit on a colorful striped cloth on the beach.",
"First, the three people eat fruit on a colorful striped cloth on the beach. Then, three girls sit on a giant crooked tree trunk. Finally, a girl in a white tank top and white shorts and a girl in a black short-sleeve top and denim shorts with glasses embrace in front of a statue.",
"First, a girl in a white tank top and white shorts and a girl in a black short-sleeve top and denim shorts with glasses embrace in front of a statue. Then, the three people eat fruit on a colorful striped cloth on the beach. Finally, three girls sit on a giant crooked tree trunk."
],
"correct_choice": 2,
"position": [
5827,
6463,
11168
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "s5edwp0PEqk_1",
"video_path": "s5edwp0PEqk.mp4",
"subtitle_path": "s5edwp0PEqk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1023.09,
"view_count": 233693
},
{
"video_id": "s5edwp0PEqk",
"question": "Which of the following scene sequences in the video are correct?",
"question_wo_referring_query": "Which of the following scene sequences in the video are correct?",
"candidates": [
"First, inside a museum, many colorful crystals are displayed under a glass cover, then some people are sitting around a wooden table wrapping dumplings, and finally, a girl with short, pink-dyed hair wearing a black bikini sits on the beach spraying sunscreen mist.",
"First, some people are sitting around a wooden table wrapping dumplings, then inside a museum, many colorful crystals are displayed under a glass cover, and finally, a girl with short, pink-dyed hair wearing a black bikini sits on the beach spraying sunscreen mist.",
"First, a girl with short, pink-dyed hair wearing a black bikini sits on the beach spraying sunscreen mist, then some people are sitting around a wooden table wrapping dumplings, and finally, inside a museum, many colorful crystals are displayed under a glass cover.",
"First, a girl with short, pink-dyed hair wearing a black bikini sits on the beach spraying sunscreen mist, then inside a museum, many colorful crystals are displayed under a glass cover, and finally, some people are sitting around a wooden table wrapping dumplings.",
"First, inside a museum, many colorful crystals are displayed under a glass cover, then a girl with short, pink-dyed hair wearing a black bikini sits on the beach spraying sunscreen mist, and finally, some people are sitting around a wooden table wrapping dumplings."
],
"correct_choice": 3,
"position": [
13479,
15755,
16992
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "s5edwp0PEqk_2",
"video_path": "s5edwp0PEqk.mp4",
"subtitle_path": "s5edwp0PEqk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1023.09,
"view_count": 233693
},
{
"video_id": "9dSkvxS2EB0",
"question": "In the PPT slide with a white background, there is a black text block on the left, and some parts of the text are covered with a yellow overlay. To the right, there's a blank white area. In the center, there's a main graphic composed of red and blue circles and letters. After the text block on the left disappears, what changes occur to the central graphic?",
"question_wo_referring_query": "What changes occur?",
"candidates": [
"Disappeared",
"Covered by the yellow overlay",
"Shrunk",
"Moved slightly to the bottom left, and new parts appear in the screen",
"Moved to the right"
],
"correct_choice": 3,
"position": [
8231,
8362
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "SAA",
"level": "L2-Relation",
"id": "9dSkvxS2EB0_0",
"video_path": "9dSkvxS2EB0.mp4",
"subtitle_path": "9dSkvxS2EB0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2439.97,
"view_count": 116935
},
{
"video_id": "9dSkvxS2EB0",
"question": "On a white background with a text overlay, the letters 'K' and 'y' at the top are circled in green. What change occurs to the letters 'K' and 'y' when the middle part of the text is covered by a yellow and green overlay?",
"question_wo_referring_query": ", what change occurs to the letters 'K' and 'y'?",
"candidates": [
"Covered by blue overlay",
"Disappeared",
"Got shrunk",
"Covered by yellow overlay",
"Got enlarged"
],
"correct_choice": 2,
"position": [
41629,
41770
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "SAA",
"level": "L2-Relation",
"id": "9dSkvxS2EB0_1",
"video_path": "9dSkvxS2EB0.mp4",
"subtitle_path": "9dSkvxS2EB0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2439.97,
"view_count": 116935
},
{
"video_id": "9dSkvxS2EB0",
"question": "In the screen with white background and text, the right side is blank, the upper left section contains a circuit diagram, and the text information is located in both the middle left and bottom left sections. When the text information in the middle left and bottom left sections almost disappears, what change occurs to the circuit diagram in the upper left section?",
"question_wo_referring_query": "What change occurs to the circuit diagram in the upper left section?",
"candidates": [
"Shrinks",
"Covered by a yellow overlay",
"Enlarges",
"Covered by a blue overlay",
"Disappears"
],
"correct_choice": 2,
"position": [
57684,
57830
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "SAA",
"level": "L2-Relation",
"id": "9dSkvxS2EB0_2",
"video_path": "9dSkvxS2EB0.mp4",
"subtitle_path": "9dSkvxS2EB0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2439.97,
"view_count": 116935
},
{
"video_id": "aX_HgA5SNLQ",
"question": "In a white room, there is a glass door with a black frame. In front of the glass door, there is a man wearing a green coat and white earphones. After he says 'story', who is the first person to appear?",
"question_wo_referring_query": "Who is the first person to appear?",
"candidates": [
"A woman with dark skin wearing black-framed glasses",
"A man wearing black-framed glasses and a black top",
"A woman with black hair tied up",
"A woman wearing a blue top",
"A man with dreadlocks"
],
"correct_choice": 1,
"position": [
4329,
4358,
1978
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T3O",
"level": "L2-Relation",
"id": "aX_HgA5SNLQ_0",
"video_path": "aX_HgA5SNLQ.mp4",
"subtitle_path": "aX_HgA5SNLQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3204.68,
"view_count": 8996
},
{
"video_id": "aX_HgA5SNLQ",
"question": "In front of a bookshelf filled with many books, a man wearing black glasses, with short hair and dressed in a black top, is talking. After he says \"that really took away the waste scene so\", what is the first item that appears?",
"question_wo_referring_query": ", what is the first item that appears?",
"candidates": [
"gold necklace",
"green coat",
"white necklace",
"white earphones",
"black and gray long tail skirt"
],
"correct_choice": 4,
"position": [
56936,
56996,
8626,
7616
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T3O",
"level": "L2-Relation",
"id": "aX_HgA5SNLQ_1",
"video_path": "aX_HgA5SNLQ.mp4",
"subtitle_path": "aX_HgA5SNLQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3204.68,
"view_count": 8996
},
{
"video_id": "aX_HgA5SNLQ",
"question": "In front of a glass door with a black frame, there is a man sitting who is wearing a green jacket, has a goatee, and is wearing white earphones. After he says \"spaces,\" who is the first person to appear?",
"question_wo_referring_query": "Who is the first person to appear?",
"candidates": [
"A woman with black long hair wearing black top",
"A man with short hair wearing black top",
"A man wearing black glasses",
"A woman wearing black glasses",
"A woman wearing a blue top and tying her hair back with a black band"
],
"correct_choice": 0,
"position": [
8112,
8136,
6186,
2146
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T3O",
"level": "L2-Relation",
"id": "aX_HgA5SNLQ_2",
"video_path": "aX_HgA5SNLQ.mp4",
"subtitle_path": "aX_HgA5SNLQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3204.68,
"view_count": 8996
},
{
"video_id": "NHIT9vq6mJU",
"question": "In a white room, there are many glass cabinets. A yellowish-white pillar is placed against the wall, and next to the pillar, there is a niche in the wall with a painting inside it. What is the shape of the niche?",
"question_wo_referring_query": "What is the shape of the niche in the wall?",
"candidates": [
"Square",
"Round",
"Triangular",
"Rectangle",
"Arch-shaped"
],
"correct_choice": 4,
"position": [
9114
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "NHIT9vq6mJU_0",
"video_path": "NHIT9vq6mJU.mp4",
"subtitle_path": "NHIT9vq6mJU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1269.02,
"view_count": 246975
},
{
"video_id": "NHIT9vq6mJU",
"question": "In a white room, there is a large glass display case against the wall, inside the case is a large painting. A woman dressed in a red suit is looking at the painting hanging on the wall. What hairstyle does the woman in the red suit have?",
"question_wo_referring_query": "What hairstyle does the woman in the red suit have?",
"candidates": [
"Black long curly hair",
"Black long straight hair",
"Blonde long curly hair",
"Black short hair",
"Blonde short hair"
],
"correct_choice": 0,
"position": [
26118
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "NHIT9vq6mJU_1",
"video_path": "NHIT9vq6mJU.mp4",
"subtitle_path": "NHIT9vq6mJU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1269.02,
"view_count": 246975
},
{
"video_id": "NHIT9vq6mJU",
"question": "In a huge room with various exhibits on either side, there is a large pillar in the middle of the room. Next to the pillar, in the middle of the room, there is a long table. What is the color of the long table?",
"question_wo_referring_query": "What is the color of the long table?",
"candidates": [
"Green",
"Black",
"Blue",
"Gray",
"White"
],
"correct_choice": 1,
"position": [
16436
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "NHIT9vq6mJU_2",
"video_path": "NHIT9vq6mJU.mp4",
"subtitle_path": "NHIT9vq6mJU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1269.02,
"view_count": 246975
},
{
"video_id": "_uL3a3aMdMQ",
"question": "In front of a black background, two men are standing: one is wearing a white short-sleeve, and the other is wearing a blue long-sleeve coat. When the man in the white short-sleeve says \"wouldn't be a monarch to like fantasize\", what color is the inner wear of the man in the blue long-sleeve coat?",
"question_wo_referring_query": "What color is the inner wear of the man wearing the blue long-sleeve coat?",
"candidates": [
"blue",
"purple",
"black",
"green",
"white"
],
"correct_choice": 4,
"position": [
17488
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "_uL3a3aMdMQ_0",
"video_path": "_uL3a3aMdMQ.mp4",
"subtitle_path": "_uL3a3aMdMQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2844.85,
"view_count": 1228568
},
{
"video_id": "_uL3a3aMdMQ",
"question": "In front of a black background, there is a woman wearing a gray and white suspender dress. She is wearing a ring on her hand, and there is also a picture of a British flag beside her. When the subtitle \"was there too okay this is gonna be a\" appears, what kind of hairstyle does the woman in the gray and white suspender dress have?",
"question_wo_referring_query": "What kind of hairstyle does the woman in the gray and white suspender dress have?",
"candidates": [
"Long black hair",
"Long silver hair",
"Short blonde hair",
"Long black curls",
"Long blonde curls"
],
"correct_choice": 4,
"position": [
52389
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "_uL3a3aMdMQ_1",
"video_path": "_uL3a3aMdMQ.mp4",
"subtitle_path": "_uL3a3aMdMQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2844.85,
"view_count": 1228568
},
{
"video_id": "_uL3a3aMdMQ",
"question": "On a green meadow, next to it is a riverside road with many pedestrians walking. On the green meadow, there is a woman with dark skin wearing a white short-sleeve shirt and holding wheat. When she says 'everyone is mixed up with different', what shape is the hair accessory on her head?",
"question_wo_referring_query": "What shape is the hair accessory on her head?",
"candidates": [
"Bear ears shape",
"Rose shape",
"Butterfly shape",
"Bunny ears shape",
"Feather shape"
],
"correct_choice": 2,
"position": [
36325
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "_uL3a3aMdMQ_2",
"video_path": "_uL3a3aMdMQ.mp4",
"subtitle_path": "_uL3a3aMdMQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2844.85,
"view_count": 1228568
},
{
"video_id": "LAbtlJJhUlY",
"question": "A hand is holding a pen and coloring a design on white paper. The wrist of this hand has a silver item attached to it. The pen's body is black, and the tip is olive yellow. On the white paper, there are three bells drawn. Next to the bells, there is a sketch. The pen tip is currently positioned on this sketch. What did this pair of hands do after coloring?",
"question_wo_referring_query": "A hand is holding a pen and coloring a design on white paper. The wrist of this hand has a silver item attached to it. The pen's body is black, and the tip is olive yellow. On the white paper, there are three bells drawn. Next to the bells, there is a sketch. The pen tip is currently positioned on this sketch. What did this pair of hands do after coloring?",
"candidates": [
"Picked up a pair of scissors",
"Dropped the piece of paper on the ground",
"Picked up the piece of paper",
"Picked up a book",
"Tore the piece of paper"
],
"correct_choice": 2,
"position": [
22127,
22253
],
"topic_category": "KA-Knowledge-Art",
"question_category": "E3E",
"level": "L2-Relation",
"id": "LAbtlJJhUlY_0",
"video_path": "LAbtlJJhUlY.mp4",
"subtitle_path": "LAbtlJJhUlY_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1028.11,
"view_count": 219300
},
{
"video_id": "LAbtlJJhUlY",
"question": "A woman in a black coat is standing in the center of the screen. The coat has yellow vertical stripes and colorful badge-like decorations. The clothing underneath the coat is also black. The woman has silver accessories on both hands placed in front of her chest. Behind her, on the wall, there are colorful posters and sketches. After placing her hands in front of her chest, what did the woman do next?",
"question_wo_referring_query": "After placing her hands in front of her chest, what did the woman do next?",
"candidates": [
"Picked up a white bottle",
"Picked up a pair of scissors",
"Picked up a black bottle",
"Picked up a piece of white paper",
"Picked up a magazine"
],
"correct_choice": 0,
"position": [
5459,
5487
],
"topic_category": "KA-Knowledge-Art",
"question_category": "E3E",
"level": "L2-Relation",
"id": "LAbtlJJhUlY_1",
"video_path": "LAbtlJJhUlY.mp4",
"subtitle_path": "LAbtlJJhUlY_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1028.11,
"view_count": 219300
},
{
"video_id": "LAbtlJJhUlY",
"question": "A woman in a black coat with yellow vertical stripes and colorful badge dots is standing in the center of the screen. She has accessories on her wrist, holding scissors in one hand and a white paper in the other. Behind her, there is a table with a lamp and clutter. The wall is decorated with pictures and miscellaneous items. What does the woman do after picking up the scissors and the white paper?",
"question_wo_referring_query": "What does the woman do after picking up the scissors and the white paper?",
"candidates": [
"The woman picks up a magazine",
"The woman drops the white paper",
"The woman puts down the white paper",
"The woman puts down the scissors",
"The woman cuts the white paper with the scissors"
],
"correct_choice": 4,
"position": [
7980,
8036
],
"topic_category": "KA-Knowledge-Art",
"question_category": "E3E",
"level": "L2-Relation",
"id": "LAbtlJJhUlY_2",
"video_path": "LAbtlJJhUlY.mp4",
"subtitle_path": "LAbtlJJhUlY_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1028.11,
"view_count": 219300
},
{
"video_id": "_g3Y_mk64Wc",
"question": "A woman wearing a shoulder-baring top is sitting on a white sofa. She is wearing a wristwatch and a ring. Behind her are a black railing and a white wall. When the subtitle 'blackford' appears, what is the woman doing?",
"question_wo_referring_query": "What is the woman doing?",
"candidates": [
"The woman is holding a book",
"The woman is holding a cat",
"The woman is holding two pieces of paper",
"The woman is holding two books",
"The woman is holding a pen"
],
"correct_choice": 0,
"position": [
2791
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2E",
"level": "L1-Perception",
"id": "_g3Y_mk64Wc_0",
"video_path": "_g3Y_mk64Wc.mp4",
"subtitle_path": "_g3Y_mk64Wc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1078.24,
"view_count": 462834
},
{
"video_id": "_g3Y_mk64Wc",
"question": "A woman wearing an off-shoulder top is sitting on a white sofa. She has orange nail polish and is wearing a necklace. Behind her is a white wall and a black armrest. Two books are on either side of her. When the subtitle 'honeymooners is one of my favorites and' appears, what is the woman doing?",
"question_wo_referring_query": "A woman wearing an off-shoulder top is sitting on a white sofa. She has orange nail polish and is wearing a necklace. Behind her is a white wall and a black armrest. Two books are on either side of her. When the subtitle 'honeymooners is one of my favorites and' appears, what is the woman doing?",
"candidates": [
"The woman is holding a book in each hand",
"The woman is holding a pen in both hands",
"The woman is holding a book in her right hand and a pen in her left hand",
"The woman is holding a book in her left hand and a paper in her right hand",
"The woman is holding a book in her left hand and a pen in her right hand"
],
"correct_choice": 0,
"position": [
4237
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2E",
"level": "L1-Perception",
"id": "_g3Y_mk64Wc_1",
"video_path": "_g3Y_mk64Wc.mp4",
"subtitle_path": "_g3Y_mk64Wc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1078.24,
"view_count": 462834
},
{
"video_id": "_g3Y_mk64Wc",
"question": "A woman dressed in an off-shoulder top is sitting on a white sofa, wearing a necklace and a ring. Behind her, there's a black railing and a white wall with a plant on the railing. To her right is the cover of a book, and to her left is a book. What is the woman doing when the subtitle 'colleen hooper you did it again' appears?",
"question_wo_referring_query": "What is the woman doing?",
"candidates": [
"The woman is holding a pen with one hand.",
"The woman is holding a book with one hand.",
"The woman is holding a book with one hand.",
"The woman is lifting the book above her head.",
"The woman is holding a book with both hands."
],
"correct_choice": 4,
"position": [
19293
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2E",
"level": "L1-Perception",
"id": "_g3Y_mk64Wc_2",
"video_path": "_g3Y_mk64Wc.mp4",
"subtitle_path": "_g3Y_mk64Wc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1078.24,
"view_count": 462834
},
{
"video_id": "OAcbasjxljY",
"question": "On a green grass field, a man wearing a short-sleeved shirt is crouching on the ground, and in front of him is a standing prairie dog. What is this man doing?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"Holding the prairie dog in his hand",
"Shaking hands with the prairie dog",
"Putting the prairie dog into a basket",
"Feeding the prairie dog",
"Putting clothes on the prairie dog"
],
"correct_choice": 3,
"position": [
11109
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2E",
"level": "L1-Perception",
"id": "OAcbasjxljY_0",
"video_path": "OAcbasjxljY.mp4",
"subtitle_path": "OAcbasjxljY_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1401.44,
"view_count": 1085327
},
{
"video_id": "OAcbasjxljY",
"question": "In a room with walls covered in pictures and a world map hanging, what is a man wearing a white beard and a gray shirt doing?",
"question_wo_referring_query": "What is he doing?",
"candidates": [
"Holding a rabbit",
"Holding a squirrel",
"Holding a sheep",
"Holding a small dog",
"Holding a cat"
],
"correct_choice": 3,
"position": [
20886
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2E",
"level": "L1-Perception",
"id": "OAcbasjxljY_1",
"video_path": "OAcbasjxljY.mp4",
"subtitle_path": "OAcbasjxljY_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1401.44,
"view_count": 1085327
},
{
"video_id": "OAcbasjxljY",
"question": "In the scene where the words 'wanna make a meaningful connection' in white English letters are written at the top left corner, there is a man with long curly hair standing in the room, wearing a black outfit with a white heart pattern. What is this man doing?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"Dancing",
"Playing on a computer",
"Listening to music",
"Watching TV",
"Looking at a phone"
],
"correct_choice": 4,
"position": [
25976
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2E",
"level": "L1-Perception",
"id": "OAcbasjxljY_2",
"video_path": "OAcbasjxljY.mp4",
"subtitle_path": "OAcbasjxljY_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1401.44,
"view_count": 1085327
},
{
"video_id": "ysRFFN5nzqE",
"question": "In a room with a rectangular wooden board hanging on a wall, a woman with long black hair, wearing a black leather jacket, is sitting in front of a table with a water cup and a flat panel. What is the color of the water cup on the table when the subtitle says 'to Melissa miracle calm and I'll see you'?",
"question_wo_referring_query": "What is the color of the water cup on the table?",
"candidates": [
"yellow",
"white",
"blue",
"black",
"red"
],
"correct_choice": 2,
"position": [
51847
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2A",
"level": "L1-Perception",
"id": "ysRFFN5nzqE_0",
"video_path": "ysRFFN5nzqE.mp4",
"subtitle_path": "ysRFFN5nzqE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2161.16,
"view_count": 82069
},
{
"video_id": "ysRFFN5nzqE",
"question": "On the left side of the screen, there is an orange area above the pink text on a white screen, and on the right side of the white screen there are two character cutouts from top to bottom. When the subtitle says 'actually we can actually change that to', what shape is the orange area in the middle of the white screen?",
"question_wo_referring_query": "What shape is the orange area in the middle of the white screen?",
"candidates": [
"circle",
"triangle",
"rectangle",
"stair shape",
"square"
],
"correct_choice": 2,
"position": [
20665
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2A",
"level": "L1-Perception",
"id": "ysRFFN5nzqE_1",
"video_path": "ysRFFN5nzqE.mp4",
"subtitle_path": "ysRFFN5nzqE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2161.16,
"view_count": 82069
},
{
"video_id": "ysRFFN5nzqE",
"question": "On the rightmost side of the white screen, from top to bottom, there are two person frames. The top one is a black-haired woman sitting in front of a desk, and the bottom one is a woman wearing glasses. When the subtitle says 'get a little bit technical which you,' what type of clothing is the woman wearing glasses at the bottom wearing?",
"question_wo_referring_query": "What type of clothing is the woman wearing glasses at the bottom wearing?",
"candidates": [
"T-shirt",
"Leather jacket",
"Sweater",
"Swimsuit",
"Suit"
],
"correct_choice": 0,
"position": [
36114
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2A",
"level": "L1-Perception",
"id": "ysRFFN5nzqE_2",
"video_path": "ysRFFN5nzqE.mp4",
"subtitle_path": "ysRFFN5nzqE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2161.16,
"view_count": 82069
},
{
"video_id": "lzAESaVqix0",
"question": "Inside the gymnasium, standing in front of a ping pong table, a little boy wearing a blue short-sleeve shirt is holding a red ping pong paddle. After picking up a white ping pong ball from the red tray, what action did the boy take?",
"question_wo_referring_query": "What action did the boy take?",
"candidates": [
"Sit on the ping pong table",
"Place the ball on the table",
"Hit the ball with the paddle",
"Throw the ball on the ground",
"Put the paddle on the table"
],
"correct_choice": 2,
"position": [
21333,
21339
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "E3E",
"level": "L2-Relation",
"id": "lzAESaVqix0_0",
"video_path": "lzAESaVqix0.mp4",
"subtitle_path": "lzAESaVqix0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1189.36,
"view_count": 6106123
},
{
"video_id": "lzAESaVqix0",
"question": "After a man wearing a red short-sleeved shirt and a black hat finished speaking in front of a black background, what did this man do?",
"question_wo_referring_query": "What did this man do?",
"candidates": [
"picked up a basketball",
"picked up a stick",
"picked up a soccer ball",
"picked up a painting",
"picked up a pot of flowers"
],
"correct_choice": 0,
"position": [
19898,
19955
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "E3E",
"level": "L2-Relation",
"id": "lzAESaVqix0_1",
"video_path": "lzAESaVqix0.mp4",
"subtitle_path": "lzAESaVqix0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1189.36,
"view_count": 6106123
},
{
"video_id": "lzAESaVqix0",
"question": "Standing on the left side of a black screen, a man in a red short-sleeved shirt is hit by another man in a red and white striped shirt on the right side of the screen. After being slapped, what did the man in the red short-sleeved shirt do?",
"question_wo_referring_query": "What did the man in the red short-sleeved shirt on the left side of the screen do?",
"candidates": [
"Fell to the ground",
"Kneeled down on the ground",
"Grabbed someone else",
"Turned around and bent down",
"Jumped up"
],
"correct_choice": 3,
"position": [
27878,
27889
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "E3E",
"level": "L2-Relation",
"id": "lzAESaVqix0_2",
"video_path": "lzAESaVqix0.mp4",
"subtitle_path": "lzAESaVqix0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1189.36,
"view_count": 6106123
},
{
"video_id": "0up5NxTiGZE",
"question": "In a house built of wooden planks, there is a brown sheep eating something. A man in a black coat is petting it. Where else has this sheep appeared?",
"question_wo_referring_query": "Where else has this sheep appeared?",
"candidates": [
"In front of a wooden table under a thatched roof",
"On the grass",
"In a zoo",
"In a park",
"On the roof"
],
"correct_choice": 0,
"position": [
726,
16864
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SOS",
"level": "L2-Relation",
"id": "0up5NxTiGZE_0",
"video_path": "0up5NxTiGZE.mp4",
"subtitle_path": "0up5NxTiGZE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 991.83,
"view_count": 555292
},
{
"video_id": "0up5NxTiGZE",
"question": "Above a circular stove burning with red flames, a person is holding a cleaned and dried chicken, roasting it on the stove. Where else has this chicken appeared?",
"question_wo_referring_query": "Where else has this chicken appeared?",
"candidates": [
"In the refrigerator",
"In a round iron tray on an olive-colored wooden table",
"On the grass",
"On the snow-covered ground",
"In a sheep pen"
],
"correct_choice": 1,
"position": [
3118,
11540
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SOS",
"level": "L2-Relation",
"id": "0up5NxTiGZE_1",
"video_path": "0up5NxTiGZE.mp4",
"subtitle_path": "0up5NxTiGZE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 991.83,
"view_count": 555292
},
{
"video_id": "0up5NxTiGZE",
"question": "Under the built wooden frame, where else has the red shredded carrots inside the round iron plate on the olive-colored wooden table appeared?",
"question_wo_referring_query": "Where else has it appeared?",
"candidates": [
"On the roof",
"In the refrigerator",
"In the pot",
"In the oven",
"On the snowy ground"
],
"correct_choice": 2,
"position": [
7730,
7901
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SOS",
"level": "L2-Relation",
"id": "0up5NxTiGZE_2",
"video_path": "0up5NxTiGZE.mp4",
"subtitle_path": "0up5NxTiGZE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 991.83,
"view_count": 555292
},
{
"video_id": "4ouAf1ldH60",
"question": "On a brown cutting board, a person is holding a knife and cutting an orange carrot. What shape are the carrot pieces that have already been cut?",
"question_wo_referring_query": "What shape are the already cut carrot pieces?",
"candidates": [
"Round",
"Triangular",
"Long strips",
"Minced",
"Square"
],
"correct_choice": 2,
"position": [
15689
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2A",
"level": "L1-Perception",
"id": "4ouAf1ldH60_0",
"video_path": "4ouAf1ldH60.mp4",
"subtitle_path": "4ouAf1ldH60_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1232.2,
"view_count": 1415739
},
{
"video_id": "4ouAf1ldH60",
"question": "Underneath a shelf filled with round wooden logs, a man is stretching his arms while pulling a long, thin white noodle. What color is the shirt the man is wearing?",
"question_wo_referring_query": "What color is the shirt the man is wearing?",
"candidates": [
"white",
"olive",
"red",
"black",
"purple"
],
"correct_choice": 3,
"position": [
9020
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2A",
"level": "L1-Perception",
"id": "4ouAf1ldH60_1",
"video_path": "4ouAf1ldH60.mp4",
"subtitle_path": "4ouAf1ldH60_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1232.2,
"view_count": 1415739
},
{
"video_id": "4ouAf1ldH60",
"question": "A person is holding a spatula and putting pre-cut meat slices into a pan with hot oil. What is the shape of the pan containing the hot oil?",
"question_wo_referring_query": "What is the shape of the pan containing the hot oil?",
"candidates": [
"Triangular",
"Square",
"Stepped",
"Rectangular",
"Round"
],
"correct_choice": 4,
"position": [
20423
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2A",
"level": "L1-Perception",
"id": "4ouAf1ldH60_2",
"video_path": "4ouAf1ldH60.mp4",
"subtitle_path": "4ouAf1ldH60_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1232.2,
"view_count": 1415739
},
{
"video_id": "Z-1lgAXOEc8",
"question": "A man wearing a leather jacket is standing by the side of the road, behind him is a black iron gate and columns, surrounded by some green leaves and small yellow flowers. On the left side of the screen, there is a traffic light. In the distance, there are some buildings. In which other scenes does this man appear?",
"question_wo_referring_query": "In which other scenes does this man appear?",
"candidates": [
"In a green photo green ceiling with two storefront pizza shops.",
"In a place with blue sky and white clouds, red buildings on both sides, a crosswalk on the road, and traffic lights hanging on a pole.",
"In a place with orange pillars and glass doors with PIZZA stickers, at the top of the storefront, there is a grid-style pizza shop sign.",
"In a long corridor covered with photos.",
"In an open kitchen with white cabinets and a black stove, there is also a white door nearby."
],
"correct_choice": 4,
"position": [
22,
778,
61388250,
6691,
3242,
3242
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "Z-1lgAXOEc8_0",
"video_path": "Z-1lgAXOEc8.mp4",
"subtitle_path": "Z-1lgAXOEc8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1013.97,
"view_count": 263495
},
{
"video_id": "Z-1lgAXOEc8",
"question": "In the scene, there is a blonde woman wearing a black tank top and a man with curly hair wearing a black T-shirt. They are talking in front of a mirror, with a black column and a thick tree behind them. There are also two white cars parked by the roadside. In which scene does the woman in the gold tank top appear?",
"question_wo_referring_query": "In which scene does the woman in the gold tank top appear?",
"candidates": [
"On the stairs in front of a red building with black handrails.",
"On the stairs in front of a red building with white handrails.",
"On the stairs in front of a red building with gray handrails.",
"On the stairs in front of a red building with yellow handrails.",
"On the stairs in front of a red building with orange handrails."
],
"correct_choice": 0,
"position": [
1113,
4785
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "Z-1lgAXOEc8_1",
"video_path": "Z-1lgAXOEc8.mp4",
"subtitle_path": "Z-1lgAXOEc8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1013.97,
"view_count": 263495
},
{
"video_id": "Z-1lgAXOEc8",
"question": "A man wearing a black T-shirt is standing in front of a black and white painted roller shutter door. He has black hair and is holding a white paper plate with pizza in it. In which other scene does this white paper plate appear?",
"question_wo_referring_query": "In which other scene does this white paper plate appear?",
"candidates": [
"In a scene where someone is holding a pizza next to a transparent box containing red sauce.",
"In a scene where someone is holding a pizza next to a transparent box containing yellow sauce.",
"In a scene where someone is holding a pizza next to a transparent box containing white sauce.",
"In a scene where someone is holding a pizza next to a transparent box containing black sauce.",
"In a scene where someone is holding a pizza next to a transparent box containing green sauce."
],
"correct_choice": 0,
"position": [
14643,
17300
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "Z-1lgAXOEc8_2",
"video_path": "Z-1lgAXOEc8.mp4",
"subtitle_path": "Z-1lgAXOEc8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1013.97,
"view_count": 263495
},
{
"video_id": "eDso3zHFxL8",
"question": "The screen shows a man wearing a black hat and a white T-shirt. On his right, there is a four-panel photo showing a person in the same T-shirt at different locations. The background behind the man is pure black. What is this man doing?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"The man is holding a photo of the person and having a conversation with someone.",
"The man is holding a photo of the person and shaking hands with someone.",
"The man is holding a photo of the person and talking.",
"The man is holding a photo of the person and turning around.",
"The man is holding a photo of the person and facing away from the camera."
],
"correct_choice": 2,
"position": [
338
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2E",
"level": "L1-Perception",
"id": "eDso3zHFxL8_0",
"video_path": "eDso3zHFxL8.mp4",
"subtitle_path": "eDso3zHFxL8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1123.79,
"view_count": 84852
},
{
"video_id": "eDso3zHFxL8",
"question": "The individual on the screen is a man wearing a black hat and a white T-shirt. To his right, there is a photo that shows the back silhouette of a person looking at a sculpture. On the left side, there is also a yellow building. What is the man in the screen doing?",
"question_wo_referring_query": "The individual on the screen is a man wearing a black hat and a white T-shirt. To his right, there is a photo that shows the back silhouette of a person looking at a sculpture. On the left side, there is also a yellow building. What is the man in the screen doing?",
"candidates": [
"Raising both hands and facing away from a mirror while talking",
"Raising both hands and looking up while talking",
"Raising both hands and nodding while talking",
"Raising both hands and looking down while talking",
"Raising both hands and facing a mirror while talking"
],
"correct_choice": 1,
"position": [
8769
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2E",
"level": "L1-Perception",
"id": "eDso3zHFxL8_1",
"video_path": "eDso3zHFxL8.mp4",
"subtitle_path": "eDso3zHFxL8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1123.79,
"view_count": 84852
},
{
"video_id": "eDso3zHFxL8",
"question": "The screen shows a man wearing a black hat and a white t-shirt. To his right, there is an equation written with white characters. The background behind the man is pure black. What is this man doing?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"The man is holding a photo of a person and talking about it.",
"Spreading both hands open and facing the camera while speaking.",
"Pointing with both hands in the direction of the equation while speaking.",
"Raising both hands and facing the camera while speaking.",
"The man is holding a photo of a person and facing away from the camera."
],
"correct_choice": 2,
"position": [
22362
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2E",
"level": "L1-Perception",
"id": "eDso3zHFxL8_2",
"video_path": "eDso3zHFxL8.mp4",
"subtitle_path": "eDso3zHFxL8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1123.79,
"view_count": 84852
},
{
"video_id": "0t1vtW0cT1E",
"question": "In a dark car, there is a screen above, and a clock in front displaying 0:17. A man wearing glasses sitting in the front right seat is looking at the mirror. When the subtitle 'you guys to charity what's your name' appears, what action is this man with glasses doing?",
"question_wo_referring_query": "What action is the man with glasses doing?",
"candidates": [
"Pointing at the mirror with his middle finger",
"Giving a thumbs-up",
"Stretching both hands towards the mirror",
"Pointing at the mirror with his index finger",
"Closing his eyes tightly"
],
"correct_choice": 3,
"position": [
9852
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2E",
"level": "L1-Perception",
"id": "0t1vtW0cT1E_0",
"video_path": "0t1vtW0cT1E.mp4",
"subtitle_path": "0t1vtW0cT1E_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 951.4,
"view_count": 413019
},
{
"video_id": "0t1vtW0cT1E",
"question": "In the deep mountains, with overgrown weeds ahead, trees in the distance, a stone at the bottom right corner, and a wooden handrail with a hand wearing a bracelet directly in front, what action does the hand perform when the subtitle 'while do not put too much weight on this' appears?",
"question_wo_referring_query": "What action does the hand perform?",
"candidates": [
"pulls out the handrail",
"shakes the handrail",
"raises the thumb",
"lifts the stone",
"pulls grass"
],
"correct_choice": 1,
"position": [
15387
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2E",
"level": "L1-Perception",
"id": "0t1vtW0cT1E_1",
"video_path": "0t1vtW0cT1E.mp4",
"subtitle_path": "0t1vtW0cT1E_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 951.4,
"view_count": 413019
},
{
"video_id": "0t1vtW0cT1E",
"question": "In a canyon, the walls on both sides are covered with green plants. There is a small stream below, with several stones on the left side and a golden-haired person on the right. In front of the camera is a man wearing a red headscarf and holding a black backpack. When the subtitle 'gonna weather proof my stuff like this' appears, what action does this man take?",
"question_wo_referring_query": "What action does this man take when the subtitle 'gonna weather proof my stuff like this' appears?",
"candidates": [
"Pats the black backpack",
"Raises his thumb",
"Opens the backpack",
"Puts down the backpack",
"Take off the headscarf"
],
"correct_choice": 0,
"position": [
17732
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2E",
"level": "L1-Perception",
"id": "0t1vtW0cT1E_2",
"video_path": "0t1vtW0cT1E.mp4",
"subtitle_path": "0t1vtW0cT1E_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 951.4,
"view_count": 413019
},
{
"video_id": "mfS6gyP0mwo",
"question": "A woman with long hair, wearing a purple top and a necklace, is giving an introduction at the beginning of the video and later gives a lecture. What changes occur in the color of the wall behind her at these times?",
"question_wo_referring_query": "What changes occur in the color of the wall behind her?",
"candidates": [
"White changes to blue",
"Olive green changes to white",
"White changes to olive green",
"Olive green changes to blue"
],
"correct_choice": 1,
"position": [
167,
1000
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SAA",
"level": "L2-Relation",
"id": "mfS6gyP0mwo_0",
"video_path": "mfS6gyP0mwo.mp4",
"subtitle_path": "mfS6gyP0mwo_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1114.05,
"view_count": 22066
},
{
"video_id": "mfS6gyP0mwo",
"question": "In the upper right corner of the frame, there is a woman with long hair wearing a purple top, sitting on a black object. The wall behind her is white. At this moment, the camera is facing forward. When the camera turns and the yellow wall on the left is revealed, what change occurs to the object the woman is holding in her left hand?",
"question_wo_referring_query": "What change occurs to the object the woman is holding in her left hand?",
"candidates": [
"It changes to a calculator",
"It changes to a ruler",
"It changes to a piece of white paper",
"It changes to a circle"
],
"correct_choice": 2,
"position": [
9095,
11458
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SAA",
"level": "L2-Relation",
"id": "mfS6gyP0mwo_1",
"video_path": "mfS6gyP0mwo.mp4",
"subtitle_path": "mfS6gyP0mwo_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1114.05,
"view_count": 22066
},
{
"video_id": "mfS6gyP0mwo",
"question": "In the top right corner of the video, there is a woman wearing a purple outfit, holding a white pen in her left hand, sitting on a black object. The wall is white. When she explains 11.5110*21.20/(44.11+1.223) and during the summary at the end of the video, how does the color of the wall change?",
"question_wo_referring_query": "How does the color of the wall change?",
"candidates": [
"White turns to blue",
"White turns to purple",
"White turns to green",
"White turns to black"
],
"correct_choice": 1,
"position": [
25000,
26446
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SAA",
"level": "L2-Relation",
"id": "mfS6gyP0mwo_2",
"video_path": "mfS6gyP0mwo.mp4",
"subtitle_path": "mfS6gyP0mwo_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1114.05,
"view_count": 22066
},
{
"video_id": "zVudr8cxHRE",
"question": "In the video, there are two men in a room. The man on the left is wearing a yellow shirt, a hat backwards, and a watch on his left hand. The man on the right is wearing a shirt with red flowers and green leaves and has a bracelet on his left hand. When the subtitle mentions \"interested enough to join us but first,\" what change happens to the man wearing the shirt with red flowers and green leaves?",
"question_wo_referring_query": "What change happens to the man wearing the shirt with red flowers and green leaves?",
"candidates": [
"He takes off his bracelet",
"He changes into a black short-sleeved shirt",
"He changes into a white long-sleeved shirt",
"He changes into a black jacket"
],
"correct_choice": 1,
"position": [
2087,
3123
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "TAA",
"level": "L2-Relation",
"id": "zVudr8cxHRE_0",
"video_path": "zVudr8cxHRE.mp4",
"subtitle_path": "zVudr8cxHRE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1236.32,
"view_count": 5037235
},
{
"video_id": "zVudr8cxHRE",
"question": "In a confined space, on the right side, there is a man wearing a yellow shirt, with earphones around his neck, and a watch on his left hand. In the middle, there is a man wearing a white shirt and a visor, sitting on a black chair. On the left side, there is a person wearing black clothes, carrying a white bag. What change did the man in the yellow shirt undergo when the subtitles mentioned 'yourself feel terrible and like flirty'?",
"question_wo_referring_query": "What change did the man in the yellow shirt undergo?",
"candidates": [
"He took off his watch",
"He changed into a black shirt",
"He changed into a black jacket",
"He took off his earphones"
],
"correct_choice": 3,
"position": [
12249,
13746
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "TAA",
"level": "L2-Relation",
"id": "zVudr8cxHRE_1",
"video_path": "zVudr8cxHRE.mp4",
"subtitle_path": "zVudr8cxHRE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1236.32,
"view_count": 5037235
},
{
"video_id": "zVudr8cxHRE",
"question": "In the video, there's a man wearing a blue shirt with earphones around his neck and carrying a backpack. To his left, there's a parked blue bus, and to his right, there's an open glass door. What change occurs to the man when the subtitles mention 'there are no rules there are no rules'?",
"question_wo_referring_query": "What change occurs to the man?",
"candidates": [
"He changed into a white shirt",
"He changed into a purple-red shirt",
"He changed into a yellow shirt",
"He changed into a blue shirt"
],
"correct_choice": 1,
"position": [
18524,
20125
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "TAA",
"level": "L2-Relation",
"id": "zVudr8cxHRE_2",
"video_path": "zVudr8cxHRE.mp4",
"subtitle_path": "zVudr8cxHRE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1236.32,
"view_count": 5037235
},
{
"video_id": "LVFvRNRTEd4",
"question": "In the screen, there is a man on the right wearing a dark blue coat, paired with a green shirt underneath, pointing forward with his right index finger. To the left, there is a black man wearing a red coat, holding a metal rod. When the subtitle mentions 'fishing and research here you know why,' what material is the handle of the metal rod made of?",
"question_wo_referring_query": "What material is the handle of the metal rod made of?",
"candidates": [
"Stone",
"Glass",
"Cloth",
"Wood"
],
"correct_choice": 3,
"position": [
5545
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "LVFvRNRTEd4_0",
"video_path": "LVFvRNRTEd4.mp4",
"subtitle_path": "LVFvRNRTEd4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1181.27,
"view_count": 4450606
},
{
"video_id": "LVFvRNRTEd4",
"question": "On the left side of the screen, there are some yellow objects under the tree being exposed to the sun. On the right side, there is a person with long hair wearing a white top and blue pants standing in front of a wall. The person with long hair is carrying a bag. When the subtitle mentions 'largest nickel and abaca or Manila home,' what is the shape of the bag the person with long hair is carrying?",
"question_wo_referring_query": "What is the shape of the bag that the person with long hair is carrying?",
"candidates": [
"circle",
"rectangle",
"triangle",
"square"
],
"correct_choice": 0,
"position": [
11247
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "LVFvRNRTEd4_1",
"video_path": "LVFvRNRTEd4.mp4",
"subtitle_path": "LVFvRNRTEd4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1181.27,
"view_count": 4450606
},
{
"video_id": "LVFvRNRTEd4",
"question": "Among the group of people in the video, there is a woman wearing a blue jacket with a white shirt underneath. She is raising her right hand, holding a large stuffed toy in her left hand, and carrying a white plastic bag filled with items. When the subtitle mentions 'those and the word balikbayan means,' what color is the stuffed toy?",
"question_wo_referring_query": "What color is the stuffed toy?",
"candidates": [
"purple",
"red",
"blue",
"green",
"yellow"
],
"correct_choice": 1,
"position": [
19668
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "LVFvRNRTEd4_2",
"video_path": "LVFvRNRTEd4.mp4",
"subtitle_path": "LVFvRNRTEd4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1181.27,
"view_count": 4450606
},
{
"video_id": "OAHsR02dUc0",
"question": "On a black screen, there are ten different colored particle-like objects sprayed on the screen, and at the bottom, there are circles of various colors from 0 to 9. After this, what happened to these different colored objects?",
"question_wo_referring_query": "After this, what happened to these different colored objects?",
"candidates": [
"These differently colored objects became blurry and rectangular in shape",
"These differently colored objects became blurry and strip-like in shape",
"These differently colored objects became blurry and circular in shape",
"These differently colored objects were assembled together"
],
"correct_choice": 2,
"position": [
53402,
55573
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "E3E",
"level": "L2-Relation",
"id": "OAHsR02dUc0_0",
"video_path": "OAHsR02dUc0.mp4",
"subtitle_path": "OAHsR02dUc0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3485.12,
"view_count": 345
},
{
"video_id": "OAHsR02dUc0",
"question": "Against a white-screen background, at the top of the screen, there is a conspicuous title 'Lagrangian Mechanics.' On the right side, there's a black frame with sharp edges, inside of which are irregular blue lines. What happens to these blue lines afterward?",
"question_wo_referring_query": "What happens to these blue lines afterward?",
"candidates": [
"In the 0-6 region, it descends, then in the 6-8 region, it rapidly spirals up and then descends again.",
"In the 0-4 region, it ascends, then rapidly descends in the 6-8 region.",
"It becomes a straight horizontal line.",
"In the 0-6 region, it rises, then in the 6-8 region, it rapidly spirals down and then rises again."
],
"correct_choice": 3,
"position": [
5541,
8458
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "E3E",
"level": "L2-Relation",
"id": "OAHsR02dUc0_1",
"video_path": "OAHsR02dUc0.mp4",
"subtitle_path": "OAHsR02dUc0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3485.12,
"view_count": 345
},
{
"video_id": "OAHsR02dUc0",
"question": "In a screen with a white background, the title is 'General Loss', below which there is a line of black mathematical symbols. After that, what happens to these mathematical symbols?",
"question_wo_referring_query": ", after that, what happens to these mathematical symbols?",
"candidates": [
"They are sequentially framed by four boxes in green, red, yellow, and blue, and then appear.",
"They are sequentially circled by four circles in green, red, yellow, and blue, and then appear.",
"They are sequentially circled by four circles in green, red, blue, and yellow, and then appear.",
"They are sequentially framed by four boxes in green, red, blue, and yellow, and then appear."
],
"correct_choice": 3,
"position": [
72291,
74243
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "E3E",
"level": "L2-Relation",
"id": "OAHsR02dUc0_2",
"video_path": "OAHsR02dUc0.mp4",
"subtitle_path": "OAHsR02dUc0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3485.12,
"view_count": 345
},
{
"video_id": "bwDfdTh0VYs",
"question": "In the blue background, there is a dark blue circular icon. What appears on the screen after the phrase 'expect the damage in this yeah yeah' is mentioned?",
"question_wo_referring_query": "What appears on the screen?",
"candidates": [
"Five circular icons appear.",
"Four circular icons appear.",
"A black airplane icon appears, with four dark blue circular icons underneath it.",
"Two circular icons appear."
],
"correct_choice": 2,
"position": [
4291,
6260
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "bwDfdTh0VYs_0",
"video_path": "bwDfdTh0VYs.mp4",
"subtitle_path": "bwDfdTh0VYs_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1732.0,
"view_count": 73641
},
{
"video_id": "bwDfdTh0VYs",
"question": "In the blue background PPT, what appears on the screen after mentioning 'look at it and it's like oh yeah they'?",
"question_wo_referring_query": "What appears on the screen?",
"candidates": [
"A red airplane appears, with two lines of English text below it.",
"A black airplane appears, with two lines of English text below it.",
"A black airplane appears, with two lines of red English text below it.",
"A black airplane appears, with three lines of red English text below it."
],
"correct_choice": 1,
"position": [
7442,
15287
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "bwDfdTh0VYs_1",
"video_path": "bwDfdTh0VYs.mp4",
"subtitle_path": "bwDfdTh0VYs_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1732.0,
"view_count": 73641
},
{
"video_id": "bwDfdTh0VYs",
"question": "On the blue background PPT, there's a text style for 'Virtual and Real Attrition.' After mentioning 'real attrition because of every air attack,' what appears on the screen?",
"question_wo_referring_query": "What appears on the screen?",
"candidates": [
"Three dark blue circular icons and two lines of English text in red font appear",
"Five yellow circular icons appear",
"Four square icons appear",
"A red airplane icon appears"
],
"correct_choice": 0,
"position": [
21686,
37649
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "bwDfdTh0VYs_2",
"video_path": "bwDfdTh0VYs.mp4",
"subtitle_path": "bwDfdTh0VYs_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1732.0,
"view_count": 73641
},
{
"video_id": "-0aM99dMu_4",
"question": "On an image with a white background, there is densely packed black English text. The top line has the letters 'NAMICAL DISTANCE FUNCTIONS', and in the third line of the screen, there is another line of distance functions. When 'in time step I and then they simply' is mentioned, what change occurs to this line of distance functions?",
"question_wo_referring_query": "What change occurs to this line of distance functions?",
"candidates": [
"It turns red, and the latter part is circled in red for annotation",
"It turns red, and the latter part is circled in yellow for annotation",
"It turns yellow, and the latter part is circled in red for annotation",
"It turns yellow, and the latter part is circled in yellow for annotation"
],
"correct_choice": 2,
"position": [
22155,
23611
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "TAA",
"level": "L2-Relation",
"id": "-0aM99dMu_4_0",
"video_path": "-0aM99dMu_4.mp4",
"subtitle_path": "-0aM99dMu_4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1355.29,
"view_count": 1898
},
{
"video_id": "-0aM99dMu_4",
"question": "In an image with a white background and black English text, where the heading is styled as 'published as a conference paper at ICLR 2020,' the top row contains photos of six machines. When 'gradient so that's the outset let's' is mentioned, what changes occur to these six images?",
"question_wo_referring_query": "What changes occur to these six images?",
"candidates": [
"There is a yellow circle annotation at the bottom of the images.",
"There is a red circle annotation at the top of the images.",
"There is a red circle annotation at the bottom of the images.",
"There is a yellow circle annotation at the top of the images."
],
"correct_choice": 1,
"position": [
1996,
5801
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "TAA",
"level": "L2-Relation",
"id": "-0aM99dMu_4_1",
"video_path": "-0aM99dMu_4.mp4",
"subtitle_path": "-0aM99dMu_4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1355.29,
"view_count": 1898
},
{
"video_id": "-0aM99dMu_4",
"question": "In a black-bordered, white-background image, there are three equally sized video windows labeled 'timesteps'. When 'the unsupervised method will discover' is mentioned, what change occurs in the first video window?",
"question_wo_referring_query": "When 'the unsupervised method will discover' is mentioned, what change occurs in the first video window?",
"candidates": [
"It shows one segment of a black-and-white grid with two green objects.",
"It shows two segments of a black-and-white grid with two green objects.",
"It shows two segments of a pure black screen with two green objects.",
"It shows one segment of a pure black screen with two green objects."
],
"correct_choice": 1,
"position": [
25461,
26965
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "TAA",
"level": "L2-Relation",
"id": "-0aM99dMu_4_2",
"video_path": "-0aM99dMu_4.mp4",
"subtitle_path": "-0aM99dMu_4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1355.29,
"view_count": 1898
},
{
"video_id": "wFrztzzohJ8",
"question": "In front of a grey wall with a huge oil painting hanging on it, there's a dense row of people standing and admiring this artwork. Which objects have not appeared in the scene?",
"question_wo_referring_query": "Which objects have not appeared in the scene?",
"candidates": [
"Blue backpack",
"White pocket",
"Red hat",
"Red handbag"
],
"correct_choice": 0,
"position": [
16491
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2O",
"level": "L1-Perception",
"id": "wFrztzzohJ8_0",
"video_path": "wFrztzzohJ8.mp4",
"subtitle_path": "wFrztzzohJ8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1061.56,
"view_count": 461091
},
{
"video_id": "wFrztzzohJ8",
"question": "In an oil painting depicting many people, there is a man in blue clothes holding an object and kneeling towards a man in black clothes. There are also many people standing behind them. Which character appears in this scene?",
"question_wo_referring_query": "Which character appears in this scene?",
"candidates": [
"A person in black clothes leaning on a crutch",
"A person wearing a red robe",
"A person in gray clothes holding a musical instrument",
"A person wearing a purple robe"
],
"correct_choice": 1,
"position": [
15558
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2O",
"level": "L1-Perception",
"id": "wFrztzzohJ8_1",
"video_path": "wFrztzzohJ8.mp4",
"subtitle_path": "wFrztzzohJ8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1061.56,
"view_count": 461091
},
{
"video_id": "wFrztzzohJ8",
"question": "In a meadow with many animals, an old man dressed in blue clothes and wearing a red cape is talking to two naked people. Which animal appears on the screen?",
"question_wo_referring_query": "Which animal appears on the screen?",
"candidates": [
"Peacock",
"Tiger",
"Crow",
"Giraffe"
],
"correct_choice": 0,
"position": [
18723
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2O",
"level": "L1-Perception",
"id": "wFrztzzohJ8_2",
"video_path": "wFrztzzohJ8.mp4",
"subtitle_path": "wFrztzzohJ8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1061.56,
"view_count": 461091
},
{
"video_id": "bAGhXcYc0o4",
"question": "In a white room with a painting on the wall, a woman with light brown shoulder-length hair, wearing a white short-sleeved shirt, mentions \u2018So in standard Norwegian, you would say \u201cSoppel\u201d, but in Bergen, we say \u201cBoss\u201d\u2019. What objects are present on the screen?",
"question_wo_referring_query": "What objects are present on the screen?",
"candidates": [
"gray desk lamp",
"red desk lamp",
"wooden desk lamp",
"white desk lamp"
],
"correct_choice": 3,
"position": [
19126
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2O",
"level": "L1-Perception",
"id": "bAGhXcYc0o4_0",
"video_path": "bAGhXcYc0o4.mp4",
"subtitle_path": "bAGhXcYc0o4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1151.99,
"view_count": 2657081
},
{
"video_id": "bAGhXcYc0o4",
"question": "In front of a wooden wall decorated with a face flag, a man wearing a blue striped short sleeve shirt is sitting on a grey sofa. When 'Norwegian Geograpeeps explaining' is mentioned, what objects are present on the screen?",
"question_wo_referring_query": "What objects are present on the screen?",
"candidates": [
"A black hat",
"A blue hat",
"A yellow small flag",
"A black small flag"
],
"correct_choice": 1,
"position": [
18507
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2O",
"level": "L1-Perception",
"id": "bAGhXcYc0o4_1",
"video_path": "bAGhXcYc0o4.mp4",
"subtitle_path": "bAGhXcYc0o4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1151.99,
"view_count": 2657081
},
{
"video_id": "bAGhXcYc0o4",
"question": "Beneath the clouds, there are two cliffs covered with patches of green moss. Between them, there's a strange boulder. When mentioning 'There's that weird boulder jammed between two cliffs,' what objects are present on the screen?",
"question_wo_referring_query": "What objects are present on the screen?",
"candidates": [
"white arrowhead with black outline",
"pure white arrowhead",
"black arrowhead with white outline",
"pure black arrowhead"
],
"correct_choice": 0,
"position": [
7422
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2O",
"level": "L1-Perception",
"id": "bAGhXcYc0o4_2",
"video_path": "bAGhXcYc0o4.mp4",
"subtitle_path": "bAGhXcYc0o4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1151.99,
"view_count": 2657081
},
{
"video_id": "mq6L8CnNJXc",
"question": "On a white shelf, there is a pair of mismatched shoes, and next to the shoes, there's a wooden-colored bag. What kind of fastener does the bag in the video have?",
"question_wo_referring_query": "What kind of fastener does the bag in the video have?",
"candidates": [
"It is a gold square fastener",
"It is a silver square fastener",
"It is a silver round fastener",
"It is a gold round fastener"
],
"correct_choice": 1,
"position": [
10896
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2A",
"level": "L1-Perception",
"id": "mq6L8CnNJXc_0",
"video_path": "mq6L8CnNJXc.mp4",
"subtitle_path": "mq6L8CnNJXc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1184.77,
"view_count": 1898940
},
{
"video_id": "mq6L8CnNJXc",
"question": "In front of a wall with blue ceramic tiles, there's a man with a mustache, carrying a backpack and wearing a short-sleeved shirt. What is the color of the straps on the backpack he's carrying in the video?",
"question_wo_referring_query": "In the video, what is the color of the straps on the backpack he is carrying?",
"candidates": [
"Olive",
"Black",
"White",
"Gray"
],
"correct_choice": 1,
"position": [
18489
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2A",
"level": "L1-Perception",
"id": "mq6L8CnNJXc_1",
"video_path": "mq6L8CnNJXc.mp4",
"subtitle_path": "mq6L8CnNJXc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1184.77,
"view_count": 1898940
},
{
"video_id": "mq6L8CnNJXc",
"question": "In the scene, there are two characters made to look like maps with eyes. The one on the left is holding a cup, and the one on the right is holding a fish. What kind of cup is the one in the scene?",
"question_wo_referring_query": "What kind of cup is the one in the scene?",
"candidates": [
"A glass tea cup containing red tea",
"A ceramic cup containing red tea",
"A ceramic cup containing green tea",
"A glass tea cup containing coffee"
],
"correct_choice": 0,
"position": [
24925
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2A",
"level": "L1-Perception",
"id": "mq6L8CnNJXc_2",
"video_path": "mq6L8CnNJXc.mp4",
"subtitle_path": "mq6L8CnNJXc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1184.77,
"view_count": 1898940
},
{
"video_id": "ZRMbh0wSly0",
"question": "On a grey concrete ground, a group of girls dressed in the same white tops and beige floral skirts hold hands to form a circle. When mentioning the 'collodion anole rain making ritual,' what kind of ornaments are on the girls' skirts?",
"question_wo_referring_query": "What kind of ornaments are on the girls' skirts?",
"candidates": [
"Waist belts adorned with colorful flowers",
"Small skirts adorned with colorful flowers",
"Green leaf-shaped small skirts",
"Small animal-shaped ornaments made of folded green leaves"
],
"correct_choice": 2,
"position": [
18926
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "ZRMbh0wSly0_0",
"video_path": "ZRMbh0wSly0.mp4",
"subtitle_path": "ZRMbh0wSly0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1177.47,
"view_count": 2691164
},
{
"video_id": "ZRMbh0wSly0",
"question": "Looking at the screen full of people, how many people are holding a blue and yellow interwoven banner? In the middle of the screen, a woman in a black leather jacket is holding a little girl with a blue headscarf with star and moon designs. When mentioning 'on to their culture language and claim,' what kind of glasses is this woman wearing who is dressed in leather and wearing a black hat?",
"question_wo_referring_query": "What kind of glasses is this woman wearing who is dressed in leather and wearing a black hat?",
"candidates": [
"Wearing glasses with black frames and purple lenses",
"Wearing glasses with purple frames and black lenses",
"Wearing red-framed sunglasses",
"Wearing black-framed sunglasses"
],
"correct_choice": 0,
"position": [
16808
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "ZRMbh0wSly0_1",
"video_path": "ZRMbh0wSly0.mp4",
"subtitle_path": "ZRMbh0wSly0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1177.47,
"view_count": 2691164
},
{
"video_id": "ZRMbh0wSly0",
"question": "In a white room with a red wooden chair, there are four people wearing identical white tops with black and red floral patterns. When the phrase 'notice they all kind of have like a' is mentioned, what kind of accessory is the woman on the far left of the screen wearing on her head?",
"question_wo_referring_query": "What kind of accessory is the woman on the far left of the screen wearing on her head?",
"candidates": [
"Wearing a red scarf",
"Wearing a black scarf",
"Wearing a red hat",
"Wearing a black hat"
],
"correct_choice": 0,
"position": [
20125
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "ZRMbh0wSly0_2",
"video_path": "ZRMbh0wSly0.mp4",
"subtitle_path": "ZRMbh0wSly0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1177.47,
"view_count": 2691164
},
{
"video_id": "F2OhCCEIOcU",
"question": "In the video, the man wearing red short sleeves and sunglasses is holding a phone in his right hand, and sitting outside with a few green plants in the background. In which other scene does this man appear?",
"question_wo_referring_query": ", in which other scene does this man appear?",
"candidates": [
"By the seaside",
"Inside a room with a TV in the background",
"Inside a milk tea shop",
"Inside a fried chicken shop"
],
"correct_choice": 1,
"position": [
1749,
2635
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "F2OhCCEIOcU_0",
"video_path": "F2OhCCEIOcU.mp4",
"subtitle_path": "F2OhCCEIOcU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1159.37,
"view_count": 3101422
},
{
"video_id": "F2OhCCEIOcU",
"question": "In the video, many people are sitting on the grass. On the screen, there is a man with blonde hair wearing a necklace and an overcoat with a white inner layer, sitting cross-legged on the ground. In which other scene does the man wearing the white inner layer appear in the video?",
"question_wo_referring_query": "In which other scene does the man wearing the white inner layer appear in the video?",
"candidates": [
"Inside a room",
"At the seaside",
"Inside a cargo truck",
"In a milk tea shop"
],
"correct_choice": 0,
"position": [
16962,
14472
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "F2OhCCEIOcU_1",
"video_path": "F2OhCCEIOcU.mp4",
"subtitle_path": "F2OhCCEIOcU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1159.37,
"view_count": 3101422
},
{
"video_id": "F2OhCCEIOcU",
"question": "On the left side of the screen is a man wearing a white long-sleeved shirt, and on the right side is a man wearing a blue and white checkered shirt and a watch. They are in a room. In which other scene does the man wearing the white long-sleeved shirt appear?",
"question_wo_referring_query": "In which other scene does the man wearing the white long-sleeved shirt appear?",
"candidates": [
"In front of a car",
"By the sea",
"In a hamburger store",
"Inside a truck"
],
"correct_choice": 0,
"position": [
4370,
3869
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "F2OhCCEIOcU_2",
"video_path": "F2OhCCEIOcU.mp4",
"subtitle_path": "F2OhCCEIOcU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1159.37,
"view_count": 3101422
},
{
"video_id": "1D9TgBrW6Sw",
"question": "In the video, the woman wearing white on the left side of the screen is lighting a match, while the woman in red on the right side is holding a cigarette and has a hair clip on her head. During which subtitle does the woman holding the cigarette appear?",
"question_wo_referring_query": "During which subtitle does the woman holding the cigarette appear in the video?",
"candidates": [
"His teeth were clenched so tightly",
"I'm OK.",
"In fact, Flash forward to March of 1971",
"(SINGING) Don't settle for some of the taste some of the time"
],
"correct_choice": 3,
"position": [
13154,
12332
],
"topic_category": "KH-Knowledge-History",
"question_category": "TOS",
"level": "L2-Relation",
"id": "1D9TgBrW6Sw_0",
"video_path": "1D9TgBrW6Sw.mp4",
"subtitle_path": "1D9TgBrW6Sw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1517.88,
"view_count": 707714
},
{
"video_id": "1D9TgBrW6Sw",
"question": "In the middle of the screen, there is a man wearing black clothes and a black hat talking to a man wearing a gray coat. On the left side, there is a man wearing a windbreaker. In the video, which subtitle appears along with the man wearing a black hat?",
"question_wo_referring_query": "In the video, which subtitle appears along with the man wearing a black hat?",
"candidates": [
"(SINGING) Don't settle for some of the taste some of the time",
"of the most influential bands of all time",
"heart disease, and birth defects including",
"I know every inch of the 707."
],
"correct_choice": 3,
"position": [
20581,
20331
],
"topic_category": "KH-Knowledge-History",
"question_category": "TOS",
"level": "L2-Relation",
"id": "1D9TgBrW6Sw_1",
"video_path": "1D9TgBrW6Sw.mp4",
"subtitle_path": "1D9TgBrW6Sw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1517.88,
"view_count": 707714
},
{
"video_id": "1D9TgBrW6Sw",
"question": "In the scene where a man in a red and white striped long-sleeve shirt is on the left side of the room and a man in a blue long-sleeve shirt is on the right side, which subtitle does the man in the red and white striped shirt appear with?",
"question_wo_referring_query": "Which subtitle does the man in the red and white striped shirt appear with in the video?",
"candidates": [
"heart disease,and birth defects including",
"I know every inch of the 707.",
"I'm OK.",
"You won't be sorry."
],
"correct_choice": 3,
"position": [
29771,
29922
],
"topic_category": "KH-Knowledge-History",
"question_category": "TOS",
"level": "L2-Relation",
"id": "1D9TgBrW6Sw_2",
"video_path": "1D9TgBrW6Sw.mp4",
"subtitle_path": "1D9TgBrW6Sw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1517.88,
"view_count": 707714
},
{
"video_id": "ZsnfXfuGRrg",
"question": "What change occurs to the book held by a bald man with a mustache, wearing a gray shirt, which has a black cover with the English word 'STUKA' written on it, when the screen shows the text 'Early testprint of the German text' in white letters at the bottom?",
"question_wo_referring_query": "What change occurs to the book?",
"candidates": [
"The pages of the book turn black.",
"The book is opened.",
"A cartoon image appears on the book.",
"The pages of the book turn yellow.",
"The book is torn."
],
"correct_choice": 1,
"position": [
13461,
13671
],
"topic_category": "KH-Knowledge-History",
"question_category": "SAA",
"level": "L2-Relation",
"id": "ZsnfXfuGRrg_0",
"video_path": "ZsnfXfuGRrg.mp4",
"subtitle_path": "ZsnfXfuGRrg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1059.7,
"view_count": 158317
},
{
"video_id": "ZsnfXfuGRrg",
"question": "In the beginning of the video, a tank with green and multi-colored camouflage appears on a red screen. When it appears below a white and black rectangle with English text at the top of this red screen, what change does this tank undergo?",
"question_wo_referring_query": "What change does this tank undergo?",
"candidates": [
"Stands up",
"Becomes smaller",
"Becomes larger",
"Becomes black",
"Flips over"
],
"correct_choice": 1,
"position": [
364,
24829
],
"topic_category": "KH-Knowledge-History",
"question_category": "SAA",
"level": "L2-Relation",
"id": "ZsnfXfuGRrg_1",
"video_path": "ZsnfXfuGRrg.mp4",
"subtitle_path": "ZsnfXfuGRrg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1059.7,
"view_count": 158317
},
{
"video_id": "ZsnfXfuGRrg",
"question": "In the scene, a white tank appears in the lower left corner of a red screen with 'Production' written in white at the top. What change occurs to this white tank when it appears in the red screen with many small white figures and white guns at the top?",
"question_wo_referring_query": "What change occurs to the white tank?",
"candidates": [
"turns black",
"turns blue",
"turns red",
"gets smaller",
"gets bigger"
],
"correct_choice": 3,
"position": [
6747,
9723
],
"topic_category": "KH-Knowledge-History",
"question_category": "SAA",
"level": "L2-Relation",
"id": "ZsnfXfuGRrg_2",
"video_path": "ZsnfXfuGRrg.mp4",
"subtitle_path": "ZsnfXfuGRrg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1059.7,
"view_count": 158317
},
{
"video_id": "5tN9hyfdkaE",
"question": "Walking on a street filled with red lanterns, a black-haired man wearing a black coat is speaking to the camera. What hairstyle did he have when the subtitle reads 'celebrated Chinese culture but really to'?",
"question_wo_referring_query": "What hairstyle did he have?",
"candidates": [
"Long curly hair",
"Shoulder-length hair",
"Bald",
"Curly hair",
"Crew cut"
],
"correct_choice": 3,
"position": [
11173
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "5tN9hyfdkaE_0",
"video_path": "5tN9hyfdkaE.mp4",
"subtitle_path": "5tN9hyfdkaE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1489.96,
"view_count": 18917
},
{
"video_id": "5tN9hyfdkaE",
"question": "What color pants is the woman, who is walking away from the camera on a street decorated with red lanterns, wearing when she says in the subtitles 'me to try the viral TikTok Foods in' while wearing a brown top and with long hair?",
"question_wo_referring_query": "What color pants is she wearing?",
"candidates": [
"Gray",
"Black",
"Blue",
"White",
"Olive"
],
"correct_choice": 2,
"position": [
19774
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "5tN9hyfdkaE_1",
"video_path": "5tN9hyfdkaE.mp4",
"subtitle_path": "5tN9hyfdkaE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1489.96,
"view_count": 18917
},
{
"video_id": "5tN9hyfdkaE",
"question": "In a kitchen, there is a white plate with ingredients on the table. A person with an apron tied around their waist is cutting vegetables on a brown wooden board with a knife. What color top is the person wearing when the subtitle says 'Cuisines that extend well beyond Asia'?",
"question_wo_referring_query": "What color top is the person wearing?",
"candidates": [
"red",
"white",
"blue",
"black",
"purple"
],
"correct_choice": 3,
"position": [
30422
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "5tN9hyfdkaE_2",
"video_path": "5tN9hyfdkaE.mp4",
"subtitle_path": "5tN9hyfdkaE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1489.96,
"view_count": 18917
},
{
"video_id": "a_hkRXc_bYg",
"question": "During a broadcast, there is a man wearing a black suit and a white shirt. His eyes and mouth are tightly closed. After the subtitle 'here that want to do that my colleague' appears, what does the man do immediately?",
"question_wo_referring_query": "What does the man do immediately after?",
"candidates": [
"He only opens his mouth",
"He opens his eyes and mouth",
"He covers his face with his hand",
"He only opens his eyes",
"He covers his lower jaw with his hand"
],
"correct_choice": 1,
"position": [
18071,
18097
],
"topic_category": "NP-News-Programs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "a_hkRXc_bYg_0",
"video_path": "a_hkRXc_bYg.mp4",
"subtitle_path": "a_hkRXc_bYg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2995.8,
"view_count": 179989
},
{
"video_id": "a_hkRXc_bYg",
"question": "In a broadcast room, there is a man with short blond hair wearing a black suit and a white shirt. What action does this man take immediately after the subtitle mentions 'something where conservatives can say'?",
"question_wo_referring_query": "In a broadcast room, there is a man with short blond hair wearing a black suit and a white shirt. What action does this man take immediately after the subtitle mentions 'something where conservatives can say'?",
"candidates": [
"He raises both hands with palms facing inward",
"He only raises his left hand",
"He places his hand over his chest",
"He only raises his right hand",
"He raises both hands with palms facing outward"
],
"correct_choice": 4,
"position": [
32092,
32102
],
"topic_category": "NP-News-Programs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "a_hkRXc_bYg_1",
"video_path": "a_hkRXc_bYg.mp4",
"subtitle_path": "a_hkRXc_bYg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2995.8,
"view_count": 179989
},
{
"video_id": "a_hkRXc_bYg",
"question": "On the right side of the screen is a female host wearing red clothes, and on the left side of the screen, a drone is taking off. What happens in the video after the subtitle mentions 'that may be why in recent weeks there'?",
"question_wo_referring_query": "What happens in the video?",
"candidates": [
"A group of drones drop bombs",
"A black-skinned woman gives a speech on stage",
"A blonde man raises his hands with palms facing outwards",
"A drone gets shot down",
"A drone drops a bomb"
],
"correct_choice": 4,
"position": [
55930,
55960
],
"topic_category": "NP-News-Programs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "a_hkRXc_bYg_2",
"video_path": "a_hkRXc_bYg.mp4",
"subtitle_path": "a_hkRXc_bYg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2995.8,
"view_count": 179989
},
{
"video_id": "td35F9LNazA",
"question": "Inside a vehicle, there's a man wearing a colorful long-sleeve shirt and sporting short hair and stubble. He's standing near the electrical control module, introducing something with his left hand raised. When this man appears in a room with dim yellow lighting, what change does he undergo?",
"question_wo_referring_query": "What change does he undergo?",
"candidates": [
"He changed into a checkered shirt",
"He changed into a black garment",
"He changed into a black and white patterned garment",
"He changed into a white garment",
"He changed into a gray garment"
],
"correct_choice": 3,
"position": [
3171,
10266
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SAA",
"level": "L2-Relation",
"id": "td35F9LNazA_0",
"video_path": "td35F9LNazA.mp4",
"subtitle_path": "td35F9LNazA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 984.33,
"view_count": 242038
},
{
"video_id": "td35F9LNazA",
"question": "On the surface of the sink are a toothbrush and toothpaste. Reflected in the glass above the sink is a man wearing a white hat and a white short-sleeved shirt, holding a camera. What changes when this man appears in the driver's seat?",
"question_wo_referring_query": "What changes when this man appears in the driver's seat?",
"candidates": [
"He changed to a white long-sleeved coat",
"He changed to a gray long-sleeved coat",
"He changed to a black hat",
"He changed to a black short-sleeved coat",
"He changed to a black long-sleeved coat"
],
"correct_choice": 4,
"position": [
11642,
14728
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SAA",
"level": "L2-Relation",
"id": "td35F9LNazA_1",
"video_path": "td35F9LNazA.mp4",
"subtitle_path": "td35F9LNazA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 984.33,
"view_count": 242038
},
{
"video_id": "td35F9LNazA",
"question": "A man holding a laptop stands in front of a glass window. He is wearing a black short-sleeved shirt, has short black hair, and is wearing jeans. What changes occur to him when a woman wearing a grey long-sleeved shirt appears next to him?",
"question_wo_referring_query": "What changes occur to him?",
"candidates": [
"He changed into a grey short-sleeved shirt",
"He changed into a grey and white long-sleeved shirt",
"He changed into a grey and white short-sleeved shirt",
"He changed into a green long-sleeved shirt",
"He changed into a white short-sleeved shirt"
],
"correct_choice": 1,
"position": [
15038,
19053
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SAA",
"level": "L2-Relation",
"id": "td35F9LNazA_2",
"video_path": "td35F9LNazA.mp4",
"subtitle_path": "td35F9LNazA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 984.33,
"view_count": 242038
},
{
"video_id": "ZGMGQsnSdLE",
"question": "This is a composite image where red magma is pressing against a green crust. Flames are erupting from the gaps in the crust. When the magma and the subtitle 'magma before the eruption seismic' appear together, what changes occur in the magma?",
"question_wo_referring_query": ", what changes occur in the magma?",
"candidates": [
"The magma flowed on the surface",
"Some rocks appeared in the magma",
"The color of the magma turned green",
"The magma changed from liquid to solid",
"The orange-red color on the top of the magma disappeared"
],
"correct_choice": 2,
"position": [
3853,
4013
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TAA",
"level": "L2-Relation",
"id": "ZGMGQsnSdLE_0",
"video_path": "ZGMGQsnSdLE.mp4",
"subtitle_path": "ZGMGQsnSdLE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2376.54,
"view_count": 17624
},
{
"video_id": "ZGMGQsnSdLE",
"question": "Amidst the thick black smoke, a burst of yellow flames is erupting. When these flames appear together with the subtitles 'forth basaltic magma from the mantle in', what change occurs to the flames?",
"question_wo_referring_query": "What change occurs to the flames?",
"candidates": [
"It extinguishes.",
"Its color changes to blue.",
"Its color changes to orange.",
"Its color changes to red.",
"Its color changes to purple."
],
"correct_choice": 1,
"position": [
23503,
23557
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TAA",
"level": "L2-Relation",
"id": "ZGMGQsnSdLE_1",
"video_path": "ZGMGQsnSdLE.mp4",
"subtitle_path": "ZGMGQsnSdLE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2376.54,
"view_count": 17624
},
{
"video_id": "ZGMGQsnSdLE",
"question": "This is a model of the Earth. The blue ocean is surrounded by green and white land. When this image appeared together with the subtitle 'virgent plate boundaries on our planet,' what changes occurred?",
"question_wo_referring_query": ", what changes occurred?",
"candidates": [
"The surface of the image has additional red lines, arrows, and black boxes with white text inside the black boxes",
"The white color on the image disappeared",
"The blue color on the image disappeared",
"The green color on the image disappeared",
"The surface of the image only has additional red lines"
],
"correct_choice": 0,
"position": [
40764,
41027
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TAA",
"level": "L2-Relation",
"id": "ZGMGQsnSdLE_2",
"video_path": "ZGMGQsnSdLE.mp4",
"subtitle_path": "ZGMGQsnSdLE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2376.54,
"view_count": 17624
},
{
"video_id": "mTn_C-SyW84",
"question": "In a room, a woman wearing a white coat is facing the mirror and picking up a bouquet of flowers. One of her braids is draped between her eyes. What objects are present in this scene?",
"question_wo_referring_query": "What objects are present in this scene?",
"candidates": [
"a necklace",
"a watch",
"a ring",
"a hair clip",
"a door"
],
"correct_choice": 4,
"position": [
11130
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2O",
"level": "L1-Perception",
"id": "mTn_C-SyW84_0",
"video_path": "mTn_C-SyW84.mp4",
"subtitle_path": "mTn_C-SyW84_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 917.71,
"view_count": 413657
},
{
"video_id": "mTn_C-SyW84",
"question": "On a wooden surface, a pair of hands is embroidering flowers on a cloth. There are already some yellow and white flowers embroidered on this cloth. What objects exist in this scene?",
"question_wo_referring_query": "What objects exist in this scene?",
"candidates": [
"A flower basket",
"A knife",
"A wooden stick",
"A cup",
"A needle"
],
"correct_choice": 4,
"position": [
14095
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2O",
"level": "L1-Perception",
"id": "mTn_C-SyW84_1",
"video_path": "mTn_C-SyW84.mp4",
"subtitle_path": "mTn_C-SyW84_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 917.71,
"view_count": 413657
},
{
"video_id": "mTn_C-SyW84",
"question": "In a room with wooden walls, a woman in a grey strapless long dress stands in front of a white door, looking down and arranging her clothes. What is the object present in this scene?",
"question_wo_referring_query": "What is the object present in this scene?",
"candidates": [
"Bracelet",
"Earrings",
"Book",
"Necklace",
"A broom"
],
"correct_choice": 2,
"position": [
18668
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2O",
"level": "L1-Perception",
"id": "mTn_C-SyW84_2",
"video_path": "mTn_C-SyW84.mp4",
"subtitle_path": "mTn_C-SyW84_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 917.71,
"view_count": 413657
},
{
"video_id": "GdFMKGNFXaE",
"question": "In the small inset at the bottom left of the screen, a male presenter dressed in a gray suit and blue shirt is explaining. In the video behind the inset, next to a pile of burning bear dolls with golden flames, there is a man with his back facing the inset, covered in white cloth. What happened after this man, who is facing away, says in the subtitles 'comfort although it's kind of unclear'?",
"question_wo_referring_query": "What happened?",
"candidates": [
"Looked for something near the fire pile",
"Took off his shoes and walked away",
"Took off his clothes and threw them into the fire",
"Used a fire extinguisher to put out the fire",
"Threw an object in his hand into the fire"
],
"correct_choice": 4,
"position": [
18403,
18408
],
"topic_category": "NP-News-Programs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "GdFMKGNFXaE_0",
"video_path": "GdFMKGNFXaE.mp4",
"subtitle_path": "GdFMKGNFXaE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2774.88,
"view_count": 26716
},
{
"video_id": "GdFMKGNFXaE",
"question": "In the conference room at the displayed time of 11:13, a bald man wearing glasses is standing in front of a blue display board, looking down at a book with a blue cover on the table. Before the subtitle says 'limited stocks in each EU country so,' what did this man do?",
"question_wo_referring_query": "What did this man do?",
"candidates": [
"Placed the book on his shoulder",
"Picked up the book with a blue cover from the table",
"Took the book away from the scene",
"Put the book on the bookshelf",
"Sat down to read the book in his hand"
],
"correct_choice": 1,
"position": [
48230,
48202
],
"topic_category": "NP-News-Programs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "GdFMKGNFXaE_1",
"video_path": "GdFMKGNFXaE.mp4",
"subtitle_path": "GdFMKGNFXaE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2774.88,
"view_count": 26716
},
{
"video_id": "GdFMKGNFXaE",
"question": "Sitting in front of the intercom, after a woman wearing a yellow sweater and glasses says in the subtitles 'from the announcements they're going to,' what action does she take?",
"question_wo_referring_query": "What action does this woman take?",
"candidates": [
"Touched her nose",
"Touched her collar",
"Touched her lips",
"Touched her earlobe",
"Touched the intercom"
],
"correct_choice": 3,
"position": [
63829,
63855
],
"topic_category": "NP-News-Programs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "GdFMKGNFXaE_2",
"video_path": "GdFMKGNFXaE.mp4",
"subtitle_path": "GdFMKGNFXaE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2774.88,
"view_count": 26716
},
{
"video_id": "rP7sQe784k8",
"question": "There is a gray pot above the burning logs, and a piece of yellow butter is thrown into the pot. What changes occur to the butter when the subtitle 'Butter' appears?",
"question_wo_referring_query": "What changes occur?",
"candidates": [
"Turned black",
"Turned blue",
"Turned red",
"Turned white",
"Melted into a liquid"
],
"correct_choice": 4,
"position": [
26201,
26401
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TAA",
"level": "L2-Relation",
"id": "rP7sQe784k8_0",
"video_path": "rP7sQe784k8.mp4",
"subtitle_path": "rP7sQe784k8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1200.12,
"view_count": 4679209
},
{
"video_id": "rP7sQe784k8",
"question": "On a rectangular table with food ingredients, there are some green onions on the wooden tray on the far left side. What change occurred to the green onions when the subtitle says 'Onion'?",
"question_wo_referring_query": "What change occurred to the green onions?",
"candidates": [
"The green onions were cut into chunks",
"The green onions were put into a bottle",
"The green onions were placed on a rack",
"The green onions were put into a pot",
"The green onions were cut into pieces"
],
"correct_choice": 0,
"position": [
15269,
15645
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TAA",
"level": "L2-Relation",
"id": "rP7sQe784k8_1",
"video_path": "rP7sQe784k8.mp4",
"subtitle_path": "rP7sQe784k8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1200.12,
"view_count": 4679209
},
{
"video_id": "rP7sQe784k8",
"question": "A man is holding a spatula and preparing to pour uncooked rice from a bowl into a pot on a stove, and in the subtitles, it says 'while serving on a plate, place the pilav first and then the sauce over the pilav.' What change occurred to the rice in the pot?",
"question_wo_referring_query": "What change occurred to the rice in the pot?",
"candidates": [
"The rice turned purple",
"The rice is already cooked",
"The rice turned red",
"The rice turned black",
"The rice turned yellow"
],
"correct_choice": 1,
"position": [
26704,
27674
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TAA",
"level": "L2-Relation",
"id": "rP7sQe784k8_2",
"video_path": "rP7sQe784k8.mp4",
"subtitle_path": "rP7sQe784k8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1200.12,
"view_count": 4679209
},
{
"video_id": "KTY9bogonyw",
"question": "In the scene where red ribbons are floating in the air, what is the woman with black hair, dressed in a white coat and wearing a watch, doing?",
"question_wo_referring_query": "What is she doing?",
"candidates": [
"Putting her hand on her temple",
"Raising both hands",
"Facing away from the mirror",
"Raising one hand",
"Covering her nose with her hand"
],
"correct_choice": 1,
"position": [
18143
],
"topic_category": "NP-News-Programs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "KTY9bogonyw_0",
"video_path": "KTY9bogonyw.mp4",
"subtitle_path": "KTY9bogonyw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2617.55,
"view_count": 4182
},
{
"video_id": "KTY9bogonyw",
"question": "In the middle of the screen, there is a white-haired man dressed in a black suit with a dark blue tie, sitting in front of a white flag with a blue Star of David. What is this white-haired man doing?",
"question_wo_referring_query": "What is this white-haired man doing?",
"candidates": [
"Drinking water",
"Sleeping",
"Writing",
"Speaking into a microphone",
"Brewing tea"
],
"correct_choice": 3,
"position": [
22384
],
"topic_category": "NP-News-Programs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "KTY9bogonyw_1",
"video_path": "KTY9bogonyw.mp4",
"subtitle_path": "KTY9bogonyw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2617.55,
"view_count": 4182
},
{
"video_id": "KTY9bogonyw",
"question": "When the video switches to a screen showing a broadcast room with the number 151.86 at the bottom, there is a woman with braided hair wearing black pants sitting on the left side of the broadcast room. What is this woman doing?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"Reading a book",
"Touching her ear",
"Stretching her waist",
"Crossing her legs",
"Standing up to move around"
],
"correct_choice": 3,
"position": [
26639
],
"topic_category": "NP-News-Programs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "KTY9bogonyw_2",
"video_path": "KTY9bogonyw.mp4",
"subtitle_path": "KTY9bogonyw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2617.55,
"view_count": 4182
},
{
"video_id": "f0IbZGfTgUM",
"question": "In the top-left corner of a black background with white English text 'Explosive Reactive Armor', what happens on the screen after a gray circle with a white Star of David appears on the left side of the screen?",
"question_wo_referring_query": "What happens on the screen?",
"candidates": [
"A line of white English text appears at the bottom of the black screen",
"A line of green English text appears in the middle of the black screen",
"A line of white English text appears in the middle of the black screen",
"A line of red English text appears in the top right corner of the black screen",
"A line of white Chinese text appears in the middle of the black screen"
],
"correct_choice": 2,
"position": [
11462,
11510
],
"topic_category": "KH-Knowledge-History",
"question_category": "E3E",
"level": "L2-Relation",
"id": "f0IbZGfTgUM_0",
"video_path": "f0IbZGfTgUM.mp4",
"subtitle_path": "f0IbZGfTgUM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1117.4,
"view_count": 585632
},
{
"video_id": "f0IbZGfTgUM",
"question": "After a white milk bottle appears inside a gray circle on the far right side of the black screen, what happens on the screen?",
"question_wo_referring_query": "What happens on the screen?",
"candidates": [
"Both images on the screen move downward simultaneously",
"Both images on the screen move to the right simultaneously",
"Both images on the screen move upward simultaneously",
"Both images on the screen change to red",
"Both images on the screen move to the left simultaneously"
],
"correct_choice": 2,
"position": [
16266,
16344
],
"topic_category": "KH-Knowledge-History",
"question_category": "E3E",
"level": "L2-Relation",
"id": "f0IbZGfTgUM_1",
"video_path": "f0IbZGfTgUM.mp4",
"subtitle_path": "f0IbZGfTgUM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1117.4,
"view_count": 585632
},
{
"video_id": "f0IbZGfTgUM",
"question": "On the black screen in the top left corner, with white text saying 'Surface Design', a white tank appears below a white skull graphic. What change occurs on the screen after that?",
"question_wo_referring_query": "What change occurs on the screen?",
"candidates": [
"A blue arrow points to the white tank",
"A red arrow points to the white tank",
"A red arrow points to the skull",
"A purple arrow points to the white tank",
"A blue arrow points to the white tank"
],
"correct_choice": 1,
"position": [
18365,
18381
],
"topic_category": "KH-Knowledge-History",
"question_category": "E3E",
"level": "L2-Relation",
"id": "f0IbZGfTgUM_2",
"video_path": "f0IbZGfTgUM.mp4",
"subtitle_path": "f0IbZGfTgUM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1117.4,
"view_count": 585632
},
{
"video_id": "z6THwql5c6w",
"question": "In the black and white screen, two people wearing black suits are shaking hands. After the subtitles mention 'flex s overt there is econovation 15 seconds as well as elite he', what appears on the screen?",
"question_wo_referring_query": "What appears on the screen?",
"candidates": [
"Red prohibition sign",
"Blue prohibition sign",
"No right turn sign",
"Two-way traffic sign",
"No left turn sign"
],
"correct_choice": 0,
"position": [
16314,
16355
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "z6THwql5c6w_0",
"video_path": "z6THwql5c6w.mp4",
"subtitle_path": "z6THwql5c6w_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 910.64,
"view_count": 39935
},
{
"video_id": "z6THwql5c6w",
"question": "In a black and white scene, a group of people are charging up a hill with guns. One person is holding a flag with a five-starred red emblem. After the subtitle 'spot where t e Ah,this qv who konws it', what appears on the screen?",
"question_wo_referring_query": "What appears on the screen?",
"candidates": [
"French Flag",
"American Flag",
"British Flag",
"German Flag",
"Chinese Flag"
],
"correct_choice": 1,
"position": [
6802,
6827
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "z6THwql5c6w_1",
"video_path": "z6THwql5c6w.mp4",
"subtitle_path": "z6THwql5c6w_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 910.64,
"view_count": 39935
},
{
"video_id": "z6THwql5c6w",
"question": "At the top of the screen, there are three flags, with the American flag in the middle. A person in a black suit is speaking in front of a microphone in the center. After the captions mention 'punching domo in toner mau 9 term,' what appears on the screen?",
"question_wo_referring_query": "What appears on the screen?",
"candidates": [
"Stars",
"Sun",
"Moon",
"A piece of white paper",
"Satellite map"
],
"correct_choice": 2,
"position": [
11785,
11827
],
"topic_category": "KH-Knowledge-History",
"question_category": "T3O",
"level": "L2-Relation",
"id": "z6THwql5c6w_2",
"video_path": "z6THwql5c6w.mp4",
"subtitle_path": "z6THwql5c6w_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 910.64,
"view_count": 39935
},
{
"video_id": "GuEptwLiAvs",
"question": "Which of the following sequence of events is correct?",
"question_wo_referring_query": "Which of the following sequence of events is correct?",
"candidates": [
"A man wearing a blue-gray coat with white hair sits in a car with a seat belt fastened; a man wearing a black long-sleeve shirt holding an orange pumpkin stands in the kitchen; a man wearing a black coat with gray inner wear sits in a car with a seat belt fastened.",
"A man wearing a black coat with gray inner wear sits in a car with a seat belt fastened; a man wearing a black long-sleeve shirt holding an orange pumpkin stands in the kitchen; a man wearing a blue-gray coat with white hair sits in a car with a seat belt fastened.",
"A man wearing a black coat with gray inner wear sits in a car with a seat belt fastened; a man wearing a blue-gray coat with white hair sits in a car with a seat belt fastened; a man wearing a black long-sleeve shirt holding an orange pumpkin stands in the kitchen.",
"A man wearing a black long-sleeve shirt holding an orange pumpkin stands in the kitchen; a man wearing a blue-gray coat with white hair sits in a car with a seat belt fastened; a man wearing a black coat with gray inner wear sits in a car with a seat belt fastened.",
"A man wearing a black long-sleeve shirt holding an orange pumpkin stands in the kitchen; a man wearing a black coat with gray inner wear sits in a car with a seat belt fastened; a man wearing a blue-gray coat with white hair sits in a car with a seat belt fastened."
],
"correct_choice": 3,
"position": [
1174,
8960,
18425
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SSS",
"level": "L2-Relation",
"id": "GuEptwLiAvs_0",
"video_path": "GuEptwLiAvs.mp4",
"subtitle_path": "GuEptwLiAvs_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 927.64,
"view_count": 438241
},
{
"video_id": "GuEptwLiAvs",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences is correct?",
"candidates": [
"A man wearing an orange hoodie with a red and white logo on the upper right corner sits in front of the bookshelf; a man wearing a black long-sleeved shirt holding an orange pumpkin stands in the kitchen; a man wearing gray sportswear with black earbuds is running outside.",
"A man wearing a black long-sleeved shirt holding an orange pumpkin stands in the kitchen; a man wearing gray sportswear with black earbuds is running outside; a man wearing an orange hoodie with a red and white logo on the upper right corner sits in front of the bookshelf.",
"A man wearing an orange hoodie with a red and white logo on the upper right corner sits in front of the bookshelf; a man wearing gray sportswear with black earbuds is running outside; a man wearing a black long-sleeved shirt holding an orange pumpkin stands in the kitchen.",
"A man wearing a black long-sleeved shirt holding an orange pumpkin stands in the kitchen; a man wearing an orange hoodie with a red and white logo on the upper right corner sits in front of the bookshelf; a man wearing gray sportswear with black earbuds is running outside.",
"A man wearing gray sportswear with black earbuds is running outside; a man wearing an orange hoodie with a red and white logo on the upper right corner sits in front of the bookshelf; a man wearing a black long-sleeved shirt holding an orange pumpkin stands in the kitchen."
],
"correct_choice": 3,
"position": [
1182,
3008,
4396
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SSS",
"level": "L2-Relation",
"id": "GuEptwLiAvs_1",
"video_path": "GuEptwLiAvs.mp4",
"subtitle_path": "GuEptwLiAvs_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 927.64,
"view_count": 438241
},
{
"video_id": "GuEptwLiAvs",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"A man wearing an orange hooded sweatshirt with a red and white logo design on the top right corner sitting in front of a bookshelf, a man wearing a blue-gray jacket with white hair sitting in a car while buckled up with a seatbelt, a man wearing gray sportswear and black earphones running outdoors.",
"A man wearing an orange hooded sweatshirt with a red and white logo design on the top right corner sitting in front of a bookshelf, a man wearing gray sportswear and black earphones running outdoors, a man wearing a blue-gray jacket with white hair sitting in a car while buckled up with a seatbelt.",
"A man wearing gray sportswear and black earphones running outdoors, a man wearing a blue-gray jacket with white hair sitting in a car while buckled up with a seatbelt, a man wearing an orange hooded sweatshirt with a red and white logo design on the top right corner sitting in front of a bookshelf.",
"A man wearing a blue-gray jacket with white hair sitting in a car while buckled up with a seatbelt, a man wearing gray sportswear and black earphones running outdoors, a man wearing an orange hooded sweatshirt with a red and white logo design on the top right corner sitting in front of a bookshelf.",
"A man wearing gray sportswear and black earphones running outdoors, a man wearing an orange hooded sweatshirt with a red and white logo design on the top right corner sitting in front of a bookshelf, a man wearing a blue-gray jacket with white hair sitting in a car while buckled up with a seatbelt."
],
"correct_choice": 1,
"position": [
3053,
4351,
5738
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SSS",
"level": "L2-Relation",
"id": "GuEptwLiAvs_2",
"video_path": "GuEptwLiAvs.mp4",
"subtitle_path": "GuEptwLiAvs_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 927.64,
"view_count": 438241
},
{
"video_id": "IGmuaY1jB1w",
"question": "In the bottom right corner of the screen, there is a number 36. There are 9 colorful images in the center, and the top right corner has an image of an airplane. When the subtitles say 'find paper to the power,' what other objects are present on the screen?",
"question_wo_referring_query": "What other objects are present on the screen?",
"candidates": [
"watch, radio, car, sunflower",
"watch, radio, car, boat",
"watch, radio, car, elephant",
"watch, radio, car, big tree"
],
"correct_choice": 3,
"position": [
30877
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T2O",
"level": "L1-Perception",
"id": "IGmuaY1jB1w_0",
"video_path": "IGmuaY1jB1w.mp4",
"subtitle_path": "IGmuaY1jB1w_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1777.48,
"view_count": 330
},
{
"video_id": "IGmuaY1jB1w",
"question": "In the bottom right corner of the screen, the number is 31. A colorful picture appears on the screen, along with three purple rectangles containing black text. When the subtitle says 'have the image and you have the text so', what else is on the screen?",
"question_wo_referring_query": "What else is on the screen?",
"candidates": [
"Sunflower",
"Airplane",
"Car",
"Radio"
],
"correct_choice": 0,
"position": [
26731
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T2O",
"level": "L1-Perception",
"id": "IGmuaY1jB1w_1",
"video_path": "IGmuaY1jB1w.mp4",
"subtitle_path": "IGmuaY1jB1w_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1777.48,
"view_count": 330
},
{
"video_id": "IGmuaY1jB1w",
"question": "When the subtitles say 'packages after just cating lunch other,' the number 26 appears in the bottom right corner of the screen, and an image with colorful parts is shown. Beside the image, there are two lines of green text, four lines of yellow text, and two lines of red text. What other items are present on the screen?",
"question_wo_referring_query": "What other items are present on the screen?",
"candidates": [
"Sunglasses with earrings",
"Airplane",
"Sunflower",
"Microphone"
],
"correct_choice": 0,
"position": [
23087
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T2O",
"level": "L1-Perception",
"id": "IGmuaY1jB1w_2",
"video_path": "IGmuaY1jB1w.mp4",
"subtitle_path": "IGmuaY1jB1w_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1777.48,
"view_count": 330
},
{
"video_id": "DVsw1brd_Yc",
"question": "On a white desk, there is an open book, and in front of it stands a book with a black side face silhouette cutout on its cover, which also has bold white text. In which of the following scenes does this book with the cutout appear?",
"question_wo_referring_query": "In which of the following scenes does this book with the cutout appear?",
"candidates": [
"In the hands of a man wearing black clothes",
"In the hands of a man sitting in a subway",
"In a quiet library",
"In the hands of a woman wearing black clothes"
],
"correct_choice": 0,
"position": [
15504,
24681
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SOS",
"level": "L2-Relation",
"id": "DVsw1brd_Yc_0",
"video_path": "DVsw1brd_Yc.mp4",
"subtitle_path": "DVsw1brd_Yc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1071.81,
"view_count": 59971
},
{
"video_id": "DVsw1brd_Yc",
"question": "In a scene with two display screens in the background, a man in a black coat is holding a blue book with a black cap in his left hand. In which of the following scenarios does this book appear?",
"question_wo_referring_query": "In which of the following scenarios does this book appear?",
"candidates": [
"On a black keyboard",
"On a piano key",
"On a white bookshelf",
"On an olive-colored desk"
],
"correct_choice": 3,
"position": [
17978,
17728
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SOS",
"level": "L2-Relation",
"id": "DVsw1brd_Yc_1",
"video_path": "DVsw1brd_Yc.mp4",
"subtitle_path": "DVsw1brd_Yc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1071.81,
"view_count": 59971
},
{
"video_id": "DVsw1brd_Yc",
"question": "On a white desk with an open book and a black keyboard, there is an orange-red book with the words 'LUCKY PLANET' on the cover. In which of the following scenes has this book appeared?",
"question_wo_referring_query": "In which of the following scenes has this book appeared?",
"candidates": [
"In the hands of a child wearing a black short-sleeve shirt",
"In the hands of a woman wearing black scrubs",
"In the hands of a man wearing black scrubs",
"In the hands of a man wearing a black short-sleeve shirt"
],
"correct_choice": 2,
"position": [
2567,
24576
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "SOS",
"level": "L2-Relation",
"id": "DVsw1brd_Yc_2",
"video_path": "DVsw1brd_Yc.mp4",
"subtitle_path": "DVsw1brd_Yc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1071.81,
"view_count": 59971
},
{
"video_id": "JLBsG65WoVU",
"question": "In a screen with a white background, there are three headshot photos of people and their information. Below these, there is a headshot photo of one person and their information. On the right side of the screen, there is a male wearing a blue striped shirt. Which object appears on the screen?",
"question_wo_referring_query": ", which object appears on the screen?",
"candidates": [
"A round photo of a man wearing a black suit",
"A rectangular photo of a man wearing a black suit",
"A rectangular photo of a man wearing a white shirt",
"A round photo of a man wearing a black shirt"
],
"correct_choice": 0,
"position": [
3743
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2O",
"level": "L1-Perception",
"id": "JLBsG65WoVU_0",
"video_path": "JLBsG65WoVU.mp4",
"subtitle_path": "JLBsG65WoVU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1004.41,
"view_count": 53671
},
{
"video_id": "JLBsG65WoVU",
"question": "There is a row of neatly arranged green fields on the screen, with differently sized houses with gray roofs on top. Which object appears on the screen below?",
"question_wo_referring_query": "Which object appears on the screen below?",
"candidates": [
"green tree",
"yellow tree",
"white clouds",
"yellow sun"
],
"correct_choice": 0,
"position": [
23971
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2O",
"level": "L1-Perception",
"id": "JLBsG65WoVU_1",
"video_path": "JLBsG65WoVU.mp4",
"subtitle_path": "JLBsG65WoVU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1004.41,
"view_count": 53671
},
{
"video_id": "JLBsG65WoVU",
"question": "In front of a green wall hung with various objects, a man wearing a hat and a blue plaid shirt is sitting. Which of the following objects has appeared?",
"question_wo_referring_query": "Which of the following objects has appeared?",
"candidates": [
"Black license plate",
"Blue floral wallpaper tiger",
"White world map",
"Purple ship model"
],
"correct_choice": 1,
"position": [
22512
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2O",
"level": "L1-Perception",
"id": "JLBsG65WoVU_2",
"video_path": "JLBsG65WoVU.mp4",
"subtitle_path": "JLBsG65WoVU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1004.41,
"view_count": 53671
},
{
"video_id": "fWNJmZAWRNg",
"question": "In a room where all the furniture is made of solid wood, there is a person sitting on a chair holding an embroidery with blue fabric. Who is this person doing the embroidery?",
"question_wo_referring_query": "Who is this person doing the embroidery?",
"candidates": [
"A woman in a white dress",
"A woman in a blue dress",
"A woman in a black dress",
"A woman in a crimson dress"
],
"correct_choice": 3,
"position": [
4983
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E2O",
"level": "L1-Perception",
"id": "fWNJmZAWRNg_0",
"video_path": "fWNJmZAWRNg.mp4",
"subtitle_path": "fWNJmZAWRNg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 914.92,
"view_count": 278426
},
{
"video_id": "fWNJmZAWRNg",
"question": "In a room, in front of a wooden table, there is a person wearing a light gray knitted sweater holding embroidery and a needle, doing embroidery. Who is this person doing the embroidery?",
"question_wo_referring_query": "Who is this person doing the embroidery?",
"candidates": [
"The woman with loose brown hair",
"The woman with loose black hair",
"The woman with black braided hair",
"The woman with brown braided hair"
],
"correct_choice": 3,
"position": [
10779
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E2O",
"level": "L1-Perception",
"id": "fWNJmZAWRNg_1",
"video_path": "fWNJmZAWRNg.mp4",
"subtitle_path": "fWNJmZAWRNg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 914.92,
"view_count": 278426
},
{
"video_id": "fWNJmZAWRNg",
"question": "In front of a wooden platform with three potted plants, there is a person looking down with both legs raised on the table. Who is this person?",
"question_wo_referring_query": "Who is this person?",
"candidates": [
"A woman with brown short hair",
"A woman with brown long hair",
"A woman with black short hair",
"A woman with black long hair"
],
"correct_choice": 1,
"position": [
6486
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E2O",
"level": "L1-Perception",
"id": "fWNJmZAWRNg_2",
"video_path": "fWNJmZAWRNg.mp4",
"subtitle_path": "fWNJmZAWRNg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 914.92,
"view_count": 278426
},
{
"video_id": "x4UBaEojM6U",
"question": "In a white room with a gray-white curtain, there is a woman wearing a black short-sleeved shirt. What action does she make when she says 'old women so it just goes back to like'?",
"question_wo_referring_query": "What action does she make?",
"candidates": [
"She raises both hands",
"She props up with both arms",
"She props up her right arm and raises her left hand",
"She props up her left arm and raises her right hand"
],
"correct_choice": 3,
"position": [
19798
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2E",
"level": "L1-Perception",
"id": "x4UBaEojM6U_0",
"video_path": "x4UBaEojM6U.mp4",
"subtitle_path": "x4UBaEojM6U_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 961.03,
"view_count": 17405
},
{
"video_id": "x4UBaEojM6U",
"question": "In a white room with a greyish-white canopy, there is a woman with a headband, wearing a black short-sleeved shirt. What action does she perform when she mentions 'fast forward to castings for hot Couture'?",
"question_wo_referring_query": "What action does she perform?",
"candidates": [
"She uses her right hand to apply mascara to her right eye.",
"She uses her left hand to apply mascara to her right eye.",
"She uses her left hand to apply mascara to her left eye.",
"She uses her right hand to apply mascara to her left eye."
],
"correct_choice": 2,
"position": [
11910
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2E",
"level": "L1-Perception",
"id": "x4UBaEojM6U_1",
"video_path": "x4UBaEojM6U.mp4",
"subtitle_path": "x4UBaEojM6U_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 961.03,
"view_count": 17405
},
{
"video_id": "x4UBaEojM6U",
"question": "In a white room with grey and white curtains, there is a woman wearing a black short-sleeved shirt who is holding something in her left hand. What action does she perform when she mentions 'I had literally no idea'?",
"question_wo_referring_query": "What action does she perform?",
"candidates": [
"She extends her right middle finger and rubs something on her left eyelid",
"She extends her right middle finger and rubs something on her right eyelid",
"She extends her right index finger and rubs something on her left eyelid",
"She extends her right index finger and rubs something on her right eyelid"
],
"correct_choice": 0,
"position": [
17605
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2E",
"level": "L1-Perception",
"id": "x4UBaEojM6U_2",
"video_path": "x4UBaEojM6U.mp4",
"subtitle_path": "x4UBaEojM6U_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 961.03,
"view_count": 17405
},
{
"video_id": "x1FkhxMMIcg",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, the female host introduces a teacher. Then, the female host connects with the teacher, who talks about the early development of her project. After that, the teacher recounts her own experiences along with those of a little girl and a student. Finally, a female host speaks.",
"First, a male host speaks. Then, the female host connects with the teacher, who talks about the early development of her project. After that, the teacher recounts her own experiences along with those of a little girl and a student. Finally, a female host speaks.",
"First, the female host interacts with the teacher. Then, the female host introduces the teacher, who talks about the early development of her project. After that, the teacher recounts her own experiences along with those of a little girl and a student. Finally, a male host speaks.",
"First, the female host introduces a teacher. Then, the female host connects with the teacher, who talks about the early development of her project. After that, the teacher recounts her own experiences along with those of a little girl and a student. Finally, a male host speaks.",
"First, a male host introduces a teacher. Then, the female host connects with the teacher, who talks about the early development of her project. After that, the teacher recounts her own experiences along with those of a little girl and a student. Finally, a female host speaks."
],
"correct_choice": 3,
"position": [
1,
601,
2285,
4577
],
"topic_category": "NP-News-Programs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "x1FkhxMMIcg_1",
"video_path": "x1FkhxMMIcg.mp4",
"subtitle_path": "x1FkhxMMIcg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 199.73,
"view_count": 985
},
{
"video_id": "BRiFXVCr1Ak",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First is a scene where a black boy wearing a hooded sweater puts on a hat in front of a white wall, then a scene with a man in a checkered shirt wearing his hat backwards next to a lamp, and finally a scene where a woman with long hair wearing a black short-sleeved shirt stands in front of an olive-colored curtain.",
"First is a scene with a man in a checkered shirt wearing his hat backwards next to a lamp, then a scene where a woman with long hair wearing a black short-sleeved shirt stands in front of an olive-colored curtain, and finally a scene where a black boy wearing a hooded sweater puts on a hat in front of a white wall.",
"First is a scene where a woman with long hair wearing a black short-sleeved shirt stands in front of an olive-colored curtain, then a scene where a black boy wearing a hooded sweater puts on a hat in front of a white wall, and finally a scene with a man in a checkered shirt wearing his hat backwards next to a lamp.",
"First is a scene with a man in a checkered shirt wearing his hat backwards next to a lamp, then a scene where a black boy wearing a hooded sweater puts on a hat in front of a white wall, and finally a scene where a woman with long hair wearing a black short-sleeved shirt stands in front of an olive-colored curtain.",
"First is a scene where a woman with long hair wearing a black short-sleeved shirt stands in front of an olive-colored curtain, then a scene with a man in a checkered shirt wearing his hat backwards next to a lamp, and finally a scene where a black boy wearing a hooded sweater puts on a hat in front of a white wall."
],
"correct_choice": 4,
"position": [
548,
1786,
2227
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "BRiFXVCr1Ak_0",
"video_path": "BRiFXVCr1Ak.mp4",
"subtitle_path": "BRiFXVCr1Ak_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 377.24,
"view_count": 23143
},
{
"video_id": "BRiFXVCr1Ak",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"First is a man seated in a gaming chair, with light strips on the wall behind him, wearing headphones. Next is a man in a brown long-sleeve shirt talking to a white microphone, with a blue curtain beside him. Finally, a scene with two women and a man wearing a hat standing above them, with a fan behind the man.",
"First is a scene with two women and a man wearing a hat standing above them, with a fan behind the man. Next is a man seated in a gaming chair, with light strips on the wall behind him, wearing headphones. Finally, a man in a brown long-sleeve shirt talking to a white microphone, with a blue curtain beside him.",
"First is a man in a brown long-sleeve shirt talking to a white microphone, with a blue curtain beside him. Next is a man seated in a gaming chair, with light strips on the wall behind him, wearing headphones. Finally, a scene with two women and a man wearing a hat standing above them, with a fan behind the man.",
"First is a scene with two women and a man wearing a hat standing above them, with a fan behind the man. Next is a man in a brown long-sleeve shirt talking to a white microphone, with a blue curtain beside him. Finally, a man wearing headphones is seated in a gaming chair, with light strips on the wall behind him.",
"First is a man in a brown long-sleeve shirt talking to a white microphone, with a blue curtain beside him. Next is a scene with two women and a man wearing a hat standing above them, with a fan behind the man. Finally, a man seated in a gaming chair, with light strips on the wall behind him, wearing headphones."
],
"correct_choice": 3,
"position": [
6216,
6363,
7982
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "BRiFXVCr1Ak_1",
"video_path": "BRiFXVCr1Ak.mp4",
"subtitle_path": "BRiFXVCr1Ak_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 377.24,
"view_count": 23143
},
{
"video_id": "t1nhAnMQBHg",
"question": "In a room with a piano and a display hanging on the wall, a man with a beard wearing black clothes is explaining a small microphone with his thumb and index finger. After the subtitles appear saying, \u2018So let\u2019s just jump into the flag and stuff, shall we? The flag known as the Hinomaru or \u2018Circle of the sun\u2019,\u2019 what is the image that appears in the center of the green-patterned background?",
"question_wo_referring_query": "What is the image that appears in the center of the green-patterned background?",
"candidates": [
"British flag",
"Japanese flag",
"Thai flag",
"Vietnamese flag",
"Brazilian flag"
],
"correct_choice": 1,
"position": [
1047,
1111
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T3O",
"level": "L2-Relation",
"id": "t1nhAnMQBHg_1",
"video_path": "t1nhAnMQBHg.mp4",
"subtitle_path": "t1nhAnMQBHg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 398.13,
"view_count": 251684
},
{
"video_id": "LHXS0QR1ThA",
"question": "Which of the following sequences is correct?",
"question_wo_referring_query": "Which of the following sequences is correct?",
"candidates": [
"First appears Question1: How to detect key-points in the etc. below step1 etc., then Question1: How to detect key-points in the etc., followed by two red circles where the formula is circled, and finally Question2: Why use gaussian etc.",
"First appears Question1: How to detect key-points in the etc., followed by two red circles where the formula is circled, then Question1: How to detect key-points in the etc., below appears step1 etc., and finally appears Question2: Why use gaussian etc.",
"First appears Question1: How to detect key-points in the etc., followed by two red circles where the formula is circled, then Question2: Why use gaussian etc., and finally Question1: How to detect key-points in the etc. below step1 etc.",
"First appears Question2: Why use gaussian etc., then Question1: How to detect key-points in the etc. below step1 etc., and finally Question1: How to detect key-points in the etc., followed by two red circles where the formula is circled."
],
"correct_choice": 1,
"position": [
4057,
5008,
5236
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "SSS",
"level": "L2-Relation",
"id": "LHXS0QR1ThA_1",
"video_path": "LHXS0QR1ThA.mp4",
"subtitle_path": "LHXS0QR1ThA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 931,
"duration": 484.0,
"view_count": 434
},
{
"video_id": "fxCRCMLJ0PU",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following scene sequences is accurate?",
"candidates": [
"First, a character dressed in green military attire with olive green camouflage on the face and a rifle on the back appears in front of a black screen. Then, a green aircraft flies in a sky with light orange and light purple hues. Finally, a little figure wearing an olive helmet holds onto a parachute and floats in the air.",
"First, a little figure wearing an olive helmet holds onto a parachute and floats in the air. Then, a character dressed in red military attire with yellow camouflage on the face and a rifle on the back stands in a military pose. Finally, a green aircraft flies in a sky with light orange and light purple hues.",
"First, a little figure wearing an olive helmet holds onto a parachute and floats in the air. Then, a character dressed in green military attire with olive green camouflage on the face and a rifle on the back stands in a military pose. Finally, a green aircraft flies in a sky with light orange and light purple hues.",
"First, a yellow aircraft flies in a sky with light orange and light purple hues. Then, a character dressed in red military attire with yellow camouflage on the face and a rifle on the back stands in a military pose. Finally, a little figure wearing an olive helmet holds onto a parachute and floats in the air.",
"First, a little figure wearing an olive helmet holds onto a parachute and floats in the air. Then, a character dressed in red military attire with yellow camouflage on the face and a rifle on the back stands in a military pose. Finally, a yellow aircraft flies in a sky with light orange and light purple hues."
],
"correct_choice": 0,
"position": [
50,
254,
379
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "fxCRCMLJ0PU_1",
"video_path": "fxCRCMLJ0PU.mp4",
"subtitle_path": "fxCRCMLJ0PU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 291.21,
"view_count": 2006850
},
{
"video_id": "rIU_BQEuKQ8",
"question": "In the scene, there is a dark blue background with a green map tile on it. The bottom left corner of the tile is highlighted with a red circle, and there is also a white arrow on the left side. Below the arrow are the words 'Winifred' and 'beach' written in white letters. What happens after the statement 'whereas the southwest Winifred beach zone is kinda left alone' is mentioned?",
"question_wo_referring_query": "What happens next?",
"candidates": [
"A man wearing a white T-shirt with black hair appears on the screen",
"A man wearing a green T-shirt with black hair appears on the screen",
"A man wearing a black T-shirt with black hair appears on the screen",
"A man wearing a grey T-shirt with black hair appears on the screen",
"A man wearing a red T-shirt with black hair appears on the screen"
],
"correct_choice": 0,
"position": [
1108,
1118
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T3E",
"level": "L2-Relation",
"id": "rIU_BQEuKQ8_1",
"video_path": "rIU_BQEuKQ8.mp4",
"subtitle_path": "rIU_BQEuKQ8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 290.36,
"view_count": 277624
},
{
"video_id": "PzUxuZ-KGsU",
"question": "On the red wooden table, there is an iron grid rack with a glass bowl containing four rolls of food. In the frame, there is a brush covered with yellow liquid decorating them. Which of the following objects did not appear?",
"question_wo_referring_query": "Which of the following objects did not appear?",
"candidates": [
"Light blue brush",
"Red brush",
"Glass bowl with egg liquid",
"Black iron rack"
],
"correct_choice": 1,
"position": [
4351
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2O",
"level": "L1-Perception",
"id": "PzUxuZ-KGsU_1",
"video_path": "PzUxuZ-KGsU.mp4",
"subtitle_path": "PzUxuZ-KGsU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 245.47,
"view_count": 1910
},
{
"video_id": "ZoUsR8t8IxE",
"question": "In front of a red house with a black roof, there is a woman wearing a light blue suit and holding a microphone. There is also a blue and white caption bar at the bottom of the screen. After the phrase 'court so this case is still really at' is mentioned, who appears on the screen?",
"question_wo_referring_query": "Who appears on the screen?",
"candidates": [
"A woman wearing a turquoise coat appears",
"A man wearing a black suit appears",
"A woman wearing headphones and a red coat appears",
"An elderly person wearing headphones and a black-gray coat appears"
],
"correct_choice": 3,
"position": [
8486,
9068
],
"topic_category": "NP-News-Programs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "ZoUsR8t8IxE_1",
"video_path": "ZoUsR8t8IxE.mp4",
"subtitle_path": "ZoUsR8t8IxE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 394.52,
"view_count": 40305
},
{
"video_id": "Oht0i1DACcA",
"question": "At the beginning of the video, a woman with a headband tied to her head, wearing a red top, carrying a black backpack, when the woman comes down from a hill with tall rocks, what changes occur to her backpack?",
"question_wo_referring_query": "What changes occur to her backpack?",
"candidates": [
"There is a dark red jacket hanging on her black backpack",
"Nothing changed",
"There is a white jacket hanging on her black backpack",
"There is a dark blue jacket hanging on her black backpack"
],
"correct_choice": 3,
"position": [
769,
5241
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "Oht0i1DACcA_1",
"video_path": "Oht0i1DACcA.mp4",
"subtitle_path": "Oht0i1DACcA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 238.91,
"view_count": 4683
},
{
"video_id": "0oALTLKRWBA",
"question": "In the video, a blonde girl appears, standing high up with a rose in her hand, overlooking the city. Then, a shirtless man wearing black shorts stands in front of a woman dressed in a bikini. There also appears a segment with three people on a boat traveling on the sea. What is the order of these scenes?",
"question_wo_referring_query": "In the video, a blonde girl appears, standing high up with a rose in her hand, overlooking the city. Then, a shirtless man wearing black shorts stands in front of a woman dressed in a bikini. There also appears a segment with three people on a boat traveling on the sea. What is the order of these scenes?",
"candidates": [
"First, a blonde girl appears, standing high up with a rose in her hand, overlooking the city. Then, a shirtless man wearing black shorts stands in front of a woman dressed in a bikini, and a segment with three people on a boat traveling on the sea appears.",
"First, a shirtless man wearing black shorts stands in front of a woman dressed in a bikini. Then, a blonde girl appears, standing high up with a rose in her hand, overlooking the city. Finally, a segment with three people on a boat traveling on the sea appears.",
"First, a segment with three people on a boat traveling on the sea appears. Then, a shirtless man wearing black shorts stands in front of a woman dressed in a bikini. Finally, a blonde girl appears, standing high up with a rose in her hand, overlooking the city.",
"First, a blonde girl appears, standing high up with a rose in her hand, overlooking the city. Then, a shirtless man wearing black shorts stands in front of a woman dressed in a bikini. Finally, a segment with three people on a boat traveling on the sea appears."
],
"correct_choice": 3,
"position": [
22,
48,
61
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SSS",
"level": "L2-Relation",
"id": "0oALTLKRWBA_1",
"video_path": "0oALTLKRWBA.mp4",
"subtitle_path": "0oALTLKRWBA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 407.33,
"view_count": 33009
},
{
"video_id": "pFtKaT3GF9I",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"First, a picture featuring the upper-left corner with the flag of Myanmar is shown, followed by a scene showing a person holding a piece of paper with numerous circular drawings made from flag designs, and finally a scene of a man holding a bag of colorful candies.",
"First, a picture featuring the upper-left corner with the flag of Myanmar is shown, followed by a scene of a man holding a bag of colorful candies, and finally a scene showing a person holding a piece of paper with numerous circular drawings made from flag designs.",
"First, a scene of a man holding a bag of colorful candies is shown, followed by a scene showing a person holding a piece of paper with numerous circular drawings made from flag designs, and finally a picture featuring the upper-left corner with the flag of Myanmar.",
"First, a scene showing a person holding a piece of paper with numerous circular drawings made from flag designs is shown, followed by a picture featuring the upper-left corner with the flag of Myanmar, and finally a scene of a man holding a bag of colorful candies.",
"First, a scene of a man holding a bag of colorful candies is shown, followed by a picture featuring the upper-left corner with the flag of Myanmar, and finally a scene showing a person holding a piece of paper with numerous circular drawings made from flag designs."
],
"correct_choice": 0,
"position": [
8125,
8637,
10428
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "SSS",
"level": "L2-Relation",
"id": "pFtKaT3GF9I_1",
"video_path": "pFtKaT3GF9I.mp4",
"subtitle_path": "pFtKaT3GF9I_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 524.83,
"view_count": 233479
},
{
"video_id": "Efuyl2Anehg",
"question": "A man wearing a gray coat and sporting short black hair is sitting upright on a cream-colored couch against a cream-colored background, eating from a green-white dessert bowl. What subtitle appeared at the same time as this man?",
"question_wo_referring_query": "What subtitle appeared at the same time as this man?",
"candidates": [
"The next morning",
"And it comes with my own Arabic TV.",
"And if you can see, it's written in Arabic.",
"It's a little after 10 and I just made it to Bahrain.",
"You golta buy tulip bulbs."
],
"correct_choice": 3,
"position": [
459,
1542,
1766,
1886,
2193,
2841
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TOS",
"level": "L2-Relation",
"id": "Efuyl2Anehg_1",
"video_path": "Efuyl2Anehg.mp4",
"subtitle_path": "Efuyl2Anehg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 511.64,
"view_count": 257904
},
{
"video_id": "0WEnmqVVbHo",
"question": "A woman dressed in a checkered suspender dress with a long-sleeved dark green inner layer is kneeling next to a white drum washing machine, preparing to do laundry. Beside her is a blue laundry bag. What item does she put into the washing machine?",
"question_wo_referring_query": "What item does she put into the washing machine?",
"candidates": [
"laundry powder",
"clothes",
"laundry pods",
"plush toy",
"laundry detergent"
],
"correct_choice": 1,
"position": [
6822
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E2O",
"level": "L1-Perception",
"id": "0WEnmqVVbHo_1",
"video_path": "0WEnmqVVbHo.mp4",
"subtitle_path": "0WEnmqVVbHo_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 479.48,
"view_count": 343405
},
{
"video_id": "q3FAxTSENEw",
"question": "According to the video, which of the following sequences of scenes is correct?",
"question_wo_referring_query": "According to the video, which of the following sequences of scenes is correct?",
"candidates": [
"A person in a white background initially sees the word 'Architecture', followed by a triangle with the letter A, and lastly a circle with the letter G appears.",
"A person in a white background initially sees a triangle with the letter A, followed by a circle with the letter G, and lastly the word 'Architecture' appears.",
"A person in a white background initially sees the word 'Architecture', followed by a circle with the letter G, and lastly a triangle with the letter A appears.",
"A person in a white background initially sees a circle with the letter G, followed by a triangle with the letter A, and lastly the word 'Architecture' appears.",
"A person in a white background initially sees a circle with the letter G, followed by the word 'Architecture', and lastly a triangle with the letter A appears."
],
"correct_choice": 1,
"position": [
1519,
1568,
3064
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "SSS",
"level": "L2-Relation",
"id": "q3FAxTSENEw_0",
"video_path": "q3FAxTSENEw.mp4",
"subtitle_path": "q3FAxTSENEw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 432,
"duration": 416.0,
"view_count": 1046
},
{
"video_id": "q3FAxTSENEw",
"question": "Which sequence of scenes appearing in the video is correct?",
"question_wo_referring_query": "Which sequence of scenes from the video is correct?",
"candidates": [
"A person is in a white background, and the remaining screen shows the word 'Q Learning' first, then '1 step', and finally 'Training'",
"A person is in a white background, and the remaining screen shows the word '1 step' first, then 'Q Learning', and finally 'Training'",
"A person is in a white background, and the remaining screen shows the word 'Q Learning' first, then 'Training', and finally '1 step'",
"A person is in a white background, and the remaining screen shows the word 'Training' first, then 'Q Learning', and finally '1 step'",
"A person is in a white background, and the remaining screen shows the word 'Training' first, then '1 step', and finally 'Q Learning'"
],
"correct_choice": 3,
"position": [
7224,
7301,
7567
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "SSS",
"level": "L2-Relation",
"id": "q3FAxTSENEw_1",
"video_path": "q3FAxTSENEw.mp4",
"subtitle_path": "q3FAxTSENEw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 432,
"duration": 416.0,
"view_count": 1046
},
{
"video_id": "T57jVsvVVR0",
"question": "On the left side of the screen, there is a large white board with black text saying 'NO EXTRADITION Free Julian Assange'. On the right side, there is a person holding a rectangular board with the text 'PRESS FREEDOM IS MY FREEDOM'. What style of shoes is the person on the right wearing when the subtitles say 'well I think the Australian government'?",
"question_wo_referring_query": "What style of shoes is he wearing?",
"candidates": [
"skate shoes",
"black boots",
"black loafers",
"black sneakers",
"black high heels"
],
"correct_choice": 1,
"position": [
4109
],
"topic_category": "NP-News-Programs",
"question_category": "T2A",
"level": "L1-Perception",
"id": "T57jVsvVVR0_1",
"video_path": "T57jVsvVVR0.mp4",
"subtitle_path": "T57jVsvVVR0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 263.96,
"view_count": 10438
},
{
"video_id": "qyaQ-wfojbM",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, there is a cooking demonstration video, then the lady explains. Next, the lady continues explaining, then a cooking demonstration, the lady explains again, then another cooking demonstration, the lady continues explaining, then another cooking demonstration, the lady explains again, and then the cooking video wraps up.",
"First, the lady with the yellow ribbon explains cooking, then there is a cooking demonstration. Next, the lady continues explaining, then another cooking demonstration, the lady explains again, then a cooking demonstration, the lady continues explaining, then another cooking demonstration, then another cooking demonstration, then the lady explains, and finally the lady concludes.",
"First, there is a cooking demonstration video, then the lady explains. Next, the lady continues explaining, then a cooking demonstration, the lady explains again, then another cooking demonstration, the lady continues explaining, then another cooking demonstration, the lady explains again, then another cooking demonstration, and finally the lady explains the last part of the cooking process.",
"First, there is a cooking demonstration video, then the lady explains. Next, there is another cooking demonstration, followed by yet another cooking demonstration, the lady continues explaining, then another cooking demonstration, the lady explains again, then another cooking demonstration, the lady explains again, and then the cooking video wraps up.",
"First, there is an introduction, followed by a cooking demonstration. Next, the lady continues explaining, then another cooking demonstration, the lady explains again, then a cooking demonstration, the lady continues explaining, then another cooking demonstration, the lady explains again, then another cooking demonstration, and finally the lady explains the last part of the cooking process."
],
"correct_choice": 1,
"position": [
1,
301,
848,
1236,
1640,
1696,
1961,
2419,
2645,
3156,
3686
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SSS",
"level": "L2-Relation",
"id": "qyaQ-wfojbM_0",
"video_path": "qyaQ-wfojbM.mp4",
"subtitle_path": "qyaQ-wfojbM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 198.7,
"view_count": 431998
},
{
"video_id": "qyaQ-wfojbM",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"First there is an introduction, then an advertisement, and finally, the woman explains while demonstrating the video",
"First, the woman explains while demonstrating the video, then there is an advertisement, and finally, there is an introduction",
"First there is an introduction, then a woman explains while demonstrating the video, and finally, there is an advertisement",
"First there is an introduction, then an advertisement, and finally, the woman explains while demonstrating the video",
"First, the woman explains while demonstrating the video, then there is the woman's introduction, and finally, there is an advertisement"
],
"correct_choice": 2,
"position": [
1,
107,
4303
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SSS",
"level": "L2-Relation",
"id": "qyaQ-wfojbM_1",
"video_path": "qyaQ-wfojbM.mp4",
"subtitle_path": "qyaQ-wfojbM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 198.7,
"view_count": 431998
},
{
"video_id": "DRIpznER-VQ",
"question": "A man with a blue shirt and curled hair and a woman with a checkered shirt are in a room, both looking down. In front of them is a pile of yellow cardboard-like objects. The man's black-framed glasses are in the pocket of his shirt. In which other scenes does this man appear?",
"question_wo_referring_query": "In which other scenes does this man appear?",
"candidates": [
"A high-floor room in a hotel",
"In front of a shop's glass window",
"An outdoor area full of green plants",
"A room with blurred block patterns in the background and a silver support frame",
"Inside a luxurious living room"
],
"correct_choice": 3,
"position": [
966,
1450
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SOS",
"level": "L2-Relation",
"id": "DRIpznER-VQ_1",
"video_path": "DRIpznER-VQ.mp4",
"subtitle_path": "DRIpznER-VQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 308.98,
"view_count": 18346
},
{
"video_id": "gtX_oRpLClY",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, the scene of removing the transparent phone case, then the scene of taking out the green phone case, and finally the scene of putting on the green phone case",
"First, the scene of taking out the green phone case, then the scene of putting on the green phone case, and finally the scene of removing the transparent phone case",
"First, the scene of taking out the green phone case, then the scene of removing the transparent phone case, and finally the scene of putting on the green phone case",
"First, the scene of removing the transparent phone case, then the scene of putting on the green phone case, and finally the scene of taking out the green phone case",
"First, the scene of putting on the green phone case, then the scene of removing the transparent phone case, and finally the scene of taking out the green phone case"
],
"correct_choice": 0,
"position": [
5240,
5343,
5443
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "gtX_oRpLClY_1",
"video_path": "gtX_oRpLClY.mp4",
"subtitle_path": "gtX_oRpLClY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 236.17000000000002,
"view_count": 191304
},
{
"video_id": "D0RyFh0hnkQ",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"First, the red dot moves on a screen with two curved wave diagrams on the far right. Then, it moves back and forth on two white backgrounds showing purple, red, and orange rectangular diagrams. Finally, the red dot moves across a grayish-black screen with six images and white text.",
"First, the red dot moves back and forth on two black backgrounds showing purple, red, and orange rectangular diagrams. Then, it moves across a grayish-black screen with six images and white text. Finally, the red dot moves on a screen with two curved wave diagrams on the far left.",
"First, the red dot moves across a grayish-black screen with six images and white text. Then, it moves back and forth on two white backgrounds showing purple, red, and orange rectangular diagrams. Finally, the red dot moves on a screen with two curved wave diagrams on the far right.",
"First, the red dot moves back and forth on two white backgrounds showing purple, red, and orange rectangular diagrams. Then, it moves across a grayish-black screen with six images and white text. Lastly, the red dot moves on a screen with two curved wave diagrams on the far right.",
"First, the red dot moves back and forth on two black backgrounds showing purple, red, and orange rectangular diagrams. Then, it moves across a grayish-black screen with six images and white text in red. Finally, the red dot moves on a screen with two curved wave diagrams on the far left."
],
"correct_choice": 2,
"position": [
7055,
9215,
10050
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "SSS",
"level": "L2-Relation",
"id": "D0RyFh0hnkQ_1",
"video_path": "D0RyFh0hnkQ.mp4",
"subtitle_path": "D0RyFh0hnkQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 521.96,
"view_count": 12
},
{
"video_id": "yl-6-Yzt--A",
"question": "In the top left corner of the screen, there is a globe with white English text displaying continents and oceans. To the right of the image, there's a man sitting and wearing a gray shirt. What is this man doing?",
"question_wo_referring_query": "In the top left corner of the screen, there is a globe with white English text displaying continents and oceans. To the right of the image, there's a man sitting and wearing a gray shirt. What is this man doing?",
"candidates": [
"Raising both hands above his head",
"Crossing both arms in front of his chest",
"Clenching both hands into fists",
"Interlacing his fingers with both hands",
"Holding his chin with one hand then opening it"
],
"correct_choice": 4,
"position": [
7759,
7780
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "S2E",
"level": "L1-Perception",
"id": "yl-6-Yzt--A_1",
"video_path": "yl-6-Yzt--A.mp4",
"subtitle_path": "yl-6-Yzt--A_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 349.62,
"view_count": 1971479
},
{
"video_id": "2Uh_KNmsoqI",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"First, the camera moves to the right, revealing a deep red area on the map labeled Valley. Then, the camera slowly zooms in, showing a map with a light pink area labeled Delaware Valley. Finally, two markers appear on the map pointing to two locations.",
"First, the camera slowly zooms in, showing a map with a light pink area labeled Delaware Valley. Then, the camera moves to the right, revealing a deep red area on the map labeled Valley. Finally, two markers appear on the map pointing to two locations.",
"First, the camera slowly zooms in, showing a map with a light pink area labeled Delaware Valley. Then, two markers appear on the map pointing to two locations. Finally, the camera moves to the right, revealing a deep red area on the map labeled Valley.",
"First, two markers appear on the map pointing to two locations. Then, the camera slowly zooms in, showing a map with a light pink area labeled Delaware Valley. Finally, the camera moves to the right, revealing a deep red area on the map labeled Valley."
],
"correct_choice": 2,
"position": [
50,
227,
407
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "SSS",
"level": "L2-Relation",
"id": "2Uh_KNmsoqI_0",
"video_path": "2Uh_KNmsoqI.mp4",
"subtitle_path": "2Uh_KNmsoqI_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 1212,
"duration": 17.02,
"view_count": 455159
},
{
"video_id": "xSzXFQPldJg",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, two soldiers and a wheeled machine appeared, followed by another wheeled machine, then several soldiers appeared. After that, another wheeled machine appeared. Next, a tank appeared, and finally, the tank extended two cannons.",
"First, several soldiers appeared, then a wheeled machine appeared, followed by two soldiers and a wheeled machine. After that, another wheeled machine appeared. Next, a tank appeared, and finally, the tank extended two cannons.",
"First, a single wheeled machine appeared, then two soldiers and a wheeled machine appeared, followed by several soldiers. After that, another wheeled machine appeared. Next, a tank appeared, and finally, the tank extended two cannons.",
"First, a tank appeared, then a wheeled machine appeared, followed by several soldiers. After that, another wheeled machine appeared. Next, two soldiers and a wheeled machine appeared, and finally, the tank extended two cannons."
],
"correct_choice": 0,
"position": [
29,
121,
296,
559,
893,
1056
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "xSzXFQPldJg_0",
"video_path": "xSzXFQPldJg.mp4",
"subtitle_path": "xSzXFQPldJg_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 37,
"duration": 51.0,
"view_count": 1492152
},
{
"video_id": "HPunfsyjETs",
"question": "There are several small bowls used for holding items on the screen. In which scenes do these small bowls appear?",
"question_wo_referring_query": "In which scenes do the small bowls appear?",
"candidates": [
"One scene involves pouring flour into a bowl, another scene involves adding colorful frosting to a big bowl, and another scene involves adding a liquid oil substance.",
"Adding colorful frosting to a big bowl.",
"One scene is adding colorful frosting to a big bowl, another scene is adding a liquid oil substance.",
"Adding a liquid oil substance."
],
"correct_choice": 2,
"position": [
218,
296
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SOS",
"level": "L2-Relation",
"id": "HPunfsyjETs_0",
"video_path": "HPunfsyjETs.mp4",
"subtitle_path": "HPunfsyjETs_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 88,
"duration": 19.02,
"view_count": 228465
},
{
"video_id": "GdZYLAI0vpc",
"question": "A woman wearing an overcoat, spreading her hands, and a woman wearing a white jacket with earrings are standing in front of the house. Who appears after the phrase 'back to' is mentioned?",
"question_wo_referring_query": "Who appears?",
"candidates": [
"A short-haired man wearing glasses, a brown shirt, and blue pants with his hands in his pockets",
"A woman wearing a purple shirt",
"An old man holding a cane",
"A student carrying a schoolbag"
],
"correct_choice": 0,
"position": [
209,
236
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "GdZYLAI0vpc_0",
"video_path": "GdZYLAI0vpc.mp4",
"subtitle_path": "GdZYLAI0vpc_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 1296,
"duration": 24.0,
"view_count": 102531
},
{
"video_id": "mHccnoh9f5w",
"question": "In a gray and hazy image, there is a terrain map with vertical and horizontal grooves in the middle, with a whitish-grayish background. When 'cover Australia or Antarctica' is mentioned, what kind of arrow appears on the screen?",
"question_wo_referring_query": "What kind of arrow appears on the screen?",
"candidates": [
"A blue arrow with a scissor tail",
"A standard red arrow",
"A red arrow with a scissor tail",
"A standard blue arrow"
],
"correct_choice": 3,
"position": [
102
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "mHccnoh9f5w_0",
"video_path": "mHccnoh9f5w.mp4",
"subtitle_path": "mHccnoh9f5w_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 249,
"duration": 8.01,
"view_count": 2186
},
{
"video_id": "Q4GK4asczVA",
"question": "In a bedroom with a few paintings hanging on the wall, a woman wearing a blue shirt, suspenders, and glasses, with a smile on her face, is holding items in dark blue, white, and olive green colors. On a table, there are 3 perfume cards, each with items of dark blue, white, and olive green colors. Which scene appears first?",
"question_wo_referring_query": "Which scene appears first?",
"candidates": [
"First appears the woman in a bedroom with a few paintings hanging on the wall, wearing a blue shirt, suspenders, and glasses, smiling and holding items in dark blue, white, and olive green colors. Last appears the table with 3 perfume cards, each with items in dark blue, white, and olive green colors.",
"First appears the table with 3 perfume cards, each with items in dark blue, white, and olive green colors. Last appears the woman in a bedroom with a few paintings hanging on the wall, wearing a blue shirt, suspenders, and glasses, smiling and holding items in dark blue, white, and olive green colors.",
"They appear simultaneously.",
"First appears the woman in a bedroom with a few paintings hanging on the wall, wearing a blue shirt, suspenders, and glasses, smiling and holding items in dark blue, white, and olive green colors. Last appears the solo scene of the woman."
],
"correct_choice": 0,
"position": [
141,
254
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "Q4GK4asczVA_0",
"video_path": "Q4GK4asczVA.mp4",
"subtitle_path": "Q4GK4asczVA_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 308,
"duration": 11.01,
"view_count": 126137
},
{
"video_id": "mFliMGufpwc",
"question": "In a dark room, there is a woman with long hair dressed in yellow and a child. They are both sitting on a bed. What items are present in this room?",
"question_wo_referring_query": "What items are present in this room?",
"candidates": [
"electric fan",
"parrot",
"refrigerator",
"dog",
"cat"
],
"correct_choice": 0,
"position": [
15714
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2O",
"level": "L1-Perception",
"id": "mFliMGufpwc_0",
"video_path": "mFliMGufpwc.mp4",
"subtitle_path": "mFliMGufpwc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1987.47,
"view_count": 5809
},
{
"video_id": "mFliMGufpwc",
"question": "There is a woman with medium-length wavy black hair on the screen, wearing a black and grey suit jacket, sitting on a chair facing a mirror. What objects are present in this scene?",
"question_wo_referring_query": "What objects are present in this scene?",
"candidates": [
"notebook",
"mobile phone",
"cup",
"camera",
"pen"
],
"correct_choice": 2,
"position": [
17712
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2O",
"level": "L1-Perception",
"id": "mFliMGufpwc_1",
"video_path": "mFliMGufpwc.mp4",
"subtitle_path": "mFliMGufpwc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1987.47,
"view_count": 5809
},
{
"video_id": "mFliMGufpwc",
"question": "There is a woman with short black hair on the screen, wearing glasses with purple-red frames, a green undershirt, and a white outer coat, with a black ornament on the collar. What items have appeared on the screen?",
"question_wo_referring_query": "There is a woman with short black hair on the screen, wearing glasses with purple-red frames, a green undershirt, and a white outer coat, with a black ornament on the collar. What items have appeared on the screen?",
"candidates": [
"mobile phone",
"bracelet",
"ring",
"watch",
"earring"
],
"correct_choice": 0,
"position": [
32789
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2O",
"level": "L1-Perception",
"id": "mFliMGufpwc_2",
"video_path": "mFliMGufpwc.mp4",
"subtitle_path": "mFliMGufpwc_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1987.47,
"view_count": 5809
},
{
"video_id": "NpYUxd1vUUE",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, the woman pulls up the curtain. Then, the short-haired blonde woman pulls up the room's curtain. Next, the short-haired blonde woman takes out a letter. Finally, the short-haired blonde woman shakes hands with an elderly man with white hair.",
"First, the short-haired blonde woman pulls up the room's curtain. Then, the short-haired blonde woman takes out a letter. Next, the short-haired blonde woman shakes hands with an elderly man with white hair. Finally, Anne pulls up the curtain.",
"First, Anne pulls up the curtain. Then, the short-haired blonde woman pulls up the room's curtain. Next, the short-haired blonde woman takes out a letter. Finally, the short-haired blonde woman shakes hands with an elderly man with white hair.",
"First, the short-haired blonde woman takes out a letter. Then, the short-haired blonde woman shakes hands with an elderly man with white hair. Next, the short-haired blonde woman pulls up the room's curtain. Finally, Anne pulls up the curtain.",
"First, a short-haired blonde woman shakes hands with an elderly man with white hair. Then, the short-haired blonde woman takes out a letter. Next, the short-haired blonde woman pulls up the room's curtain. Finally, Anne pulls up the curtain."
],
"correct_choice": 4,
"position": [
1230,
3387,
3728,
5779
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "NpYUxd1vUUE_0",
"video_path": "NpYUxd1vUUE.mp4",
"subtitle_path": "NpYUxd1vUUE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 913.53,
"view_count": 176456
},
{
"video_id": "NpYUxd1vUUE",
"question": "Which of the following sequence of events is correct?",
"question_wo_referring_query": "Which of the following sequence of events is correct?",
"candidates": [
"First, there is a painting showing a father, mother, a young boy, and a frightening old woman, then a short-haired blonde woman is embroidering, next, Lydia and Mrs. Mills are conversing in a room, and finally, the short-haired blonde woman uncovers a piece of furniture.",
"First, a short-haired blonde woman is embroidering, then Lydia and Mrs. Mills are conversing in a room, next, the short-haired blonde woman uncovers a piece of furniture, and finally, there is a painting showing a father, mother, a young boy, and a frightening old woman.",
"First, there is a painting showing a father, mother, a young boy, and a frightening old woman, then a short-haired blonde woman is dancing, next, Lydia and Mrs. Mills are conversing in a room, and finally, the short-haired blonde woman uncovers a piece of furniture.",
"First, Lydia and Mrs. Mills are conversing in a room, then a short-haired blonde woman is embroidering, next, the short-haired blonde woman uncovers a piece of furniture, and finally, there is a painting showing a father, mother, a young boy, and a frightening old woman.",
"First, there is a painting showing a father, mother, a young boy, and a frightening old woman, then a short-haired blonde woman is embroidering, next, the short-haired blonde woman uncovers a piece of furniture, and finally, Lydia and Mrs. Mills are conversing in a room."
],
"correct_choice": 1,
"position": [
6868,
7031,
7547,
7959
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "NpYUxd1vUUE_1",
"video_path": "NpYUxd1vUUE.mp4",
"subtitle_path": "NpYUxd1vUUE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 913.53,
"view_count": 176456
},
{
"video_id": "NpYUxd1vUUE",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, a pair of hands flips through a photo album, then a man embraces a child, followed by a pair of hands pouring a liquid with medicine into a sink, and finally a short-haired blonde woman points a gun at three people.",
"First, a short-haired blonde woman points a gun at three people, then a pair of hands pours a liquid with medicine into a sink, followed by a pair of hands flipping through a photo album, and finally a man embraces a child.",
"First, a man embraces a child, then a pair of hands pours a liquid with medicine into a sink, followed by a pair of hands flipping through a photo album, and finally a short-haired blonde woman points a gun at three people.",
"First, a short-haired blonde woman points a gun at three people, then a pair of hands pours a liquid with medicine into a sink, followed by a man embracing a child, and finally a pair of hands flipping through a photo album.",
"First, a man embraces a child, then a pair of hands flips through a photo album, followed by a pair of hands pouring a liquid with medicine into a sink, and finally a short-haired blonde woman points a gun at three people."
],
"correct_choice": 0,
"position": [
8673,
11802,
13972,
15737
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "NpYUxd1vUUE_2",
"video_path": "NpYUxd1vUUE.mp4",
"subtitle_path": "NpYUxd1vUUE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 913.53,
"view_count": 176456
},
{
"video_id": "HeRS3nwySI8",
"question": "On a street lined with gray and black houses, there is a woman with long hair crying on the left, and a man wearing a beige fur coat on the right. What did the man do when he appeared for the first time?",
"question_wo_referring_query": ", what did the man do when he appeared for the first time?",
"candidates": [
"Put on a pair of glasses",
"Put on a coat himself",
"Took off his fur coat",
"Put on a black hat",
"Put a coat on the woman"
],
"correct_choice": 4,
"position": [
8272
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "O2E",
"level": "L1-Perception",
"id": "HeRS3nwySI8_0",
"video_path": "HeRS3nwySI8.mp4",
"subtitle_path": "HeRS3nwySI8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 974.2,
"view_count": 1748036
},
{
"video_id": "HeRS3nwySI8",
"question": "In the context of the setting, how many human figures are subtly visible? In front of the mirror is a man wearing a green outfit and sporting short curly hair. What does the man do the first time he appears?",
"question_wo_referring_query": "What does the man do the first time he appears?",
"candidates": [
"Smokes a cigarette",
"Takes out a phone",
"Waves at the mirror",
"Eats a piece of bread",
"Drinks a glass of water"
],
"correct_choice": 4,
"position": [
15122
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "O2E",
"level": "L1-Perception",
"id": "HeRS3nwySI8_1",
"video_path": "HeRS3nwySI8.mp4",
"subtitle_path": "HeRS3nwySI8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 974.2,
"view_count": 1748036
},
{
"video_id": "HeRS3nwySI8",
"question": "In the dark tunnel, a green light flickers in the distance. In the middle is a woman wearing a blue top and black pants. When the woman appears for the first time, what happens?",
"question_wo_referring_query": "When the woman appears for the first time, what happens?",
"candidates": [
"Falls to the ground",
"Takes out a wooden stick",
"Half-kneels on the ground",
"Leans against the wall",
"Runs away into the distance"
],
"correct_choice": 0,
"position": [
19465
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "O2E",
"level": "L1-Perception",
"id": "HeRS3nwySI8_2",
"video_path": "HeRS3nwySI8.mp4",
"subtitle_path": "HeRS3nwySI8_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 974.2,
"view_count": 1748036
},
{
"video_id": "3u__SZlBLC0",
"question": "In the bright office, there is a white and red checkered desk in the center. Behind the desk, a man with glasses is sitting while another man in black clothes is standing with his hands on his hips. There is a transparent window behind them. What did the man in black do after putting his hands on his hips?",
"question_wo_referring_query": "What did he do?",
"candidates": [
"Clapped his hands",
"Picked up a pen",
"Patted his shoulder",
"Touched his head",
"Waved his hand"
],
"correct_choice": 3,
"position": [
14020,
14058
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E3E",
"level": "L2-Relation",
"id": "3u__SZlBLC0_0",
"video_path": "3u__SZlBLC0.mp4",
"subtitle_path": "3u__SZlBLC0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1835.47,
"view_count": 9148
},
{
"video_id": "3u__SZlBLC0",
"question": "On a desk with papers stacked on it, there's a hand on the left side touching a black cup, and on the right side, there's a pair of glasses. What happens when the black cup is knocked over?",
"question_wo_referring_query": "What happens?",
"candidates": [
"A hand takes away the glasses",
"The phone gets wet",
"The glasses fall to the ground",
"A hand takes away the soaked paper",
"The pencil gets wet"
],
"correct_choice": 3,
"position": [
28081,
28116
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E3E",
"level": "L2-Relation",
"id": "3u__SZlBLC0_1",
"video_path": "3u__SZlBLC0.mp4",
"subtitle_path": "3u__SZlBLC0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1835.47,
"view_count": 9148
},
{
"video_id": "3u__SZlBLC0",
"question": "On the gray road, a woman dressed in a black jacket, wearing a black helmet, and with long curly hair is smiling at the mirror. After the woman smiles, what does she do?",
"question_wo_referring_query": ", what does she do?",
"candidates": [
"Changed into a red coat",
"Got into a taxi",
"Squatted down on the ground",
"Got on a motorcycle",
"Got on a bicycle"
],
"correct_choice": 3,
"position": [
37077,
37159
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E3E",
"level": "L2-Relation",
"id": "3u__SZlBLC0_2",
"video_path": "3u__SZlBLC0.mp4",
"subtitle_path": "3u__SZlBLC0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1835.47,
"view_count": 9148
},
{
"video_id": "yz3lOAe32Tw",
"question": "In front of a white wall with prints and paintings, there is a woman in red clothes on the left and a young girl in white clothes on the right. They are having a conversation. What happened after the conversation mentioned 'mommy gerbil ate her offspring's head'?",
"question_wo_referring_query": "What happened?",
"candidates": [
"Jessica stroked Ted's mechanical arm",
"A robot with glowing yellow eyes came through the door",
"Two adult men were playing a board game",
"A man lay in the bathtub with a towel over his eyes",
"A woman in purple grabbed the wrist of a young girl with glasses"
],
"correct_choice": 0,
"position": [
8550,
10174,
7286,
6504,
1358,
2483
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3E",
"level": "L2-Relation",
"id": "yz3lOAe32Tw_0",
"video_path": "yz3lOAe32Tw.mp4",
"subtitle_path": "yz3lOAe32Tw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1115.91,
"view_count": 67305
},
{
"video_id": "yz3lOAe32Tw",
"question": "In front of a white wall with various cards attached, there is a woman on the left wearing clothes in yellow, green, and blue, and a girl on the right with a blue skirt. Before the subtitle 'when her mom pushes the button' appears, what action did the woman take?",
"question_wo_referring_query": "What action did the woman take?",
"candidates": [
"A man in a colorful patched suit with a mustache is reading a letter",
"Donna is holding a baby, and Steve is standing behind her",
"Steve is holding a wrapped paper box",
"The woman is grabbing the man's robotic arm that is emitting red light"
],
"correct_choice": 2,
"position": [
18306,
17545,
20080,
22024,
25860
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3E",
"level": "L2-Relation",
"id": "yz3lOAe32Tw_1",
"video_path": "yz3lOAe32Tw.mp4",
"subtitle_path": "yz3lOAe32Tw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1115.91,
"view_count": 67305
},
{
"video_id": "yz3lOAe32Tw",
"question": "In the room with the yellow walls, on the right side, there's a rectangular window with rounded corners. In the middle, there is a hemispherical object and a man wearing a gray coat. What action did the man take before the subtitle mentioned 'to smoke'?",
"question_wo_referring_query": "What action did the man take?",
"candidates": [
"Steve holding a packed paper box",
"A man lying in the bathtub with a towel over his eyes",
"Donna holding a baby, with Steve standing behind her",
"Jessica stroking Ted's mechanical arm"
],
"correct_choice": 2,
"position": [
23111,
25860,
17545,
10174,
7286
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3E",
"level": "L2-Relation",
"id": "yz3lOAe32Tw_2",
"video_path": "yz3lOAe32Tw.mp4",
"subtitle_path": "yz3lOAe32Tw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1115.91,
"view_count": 67305
},
{
"video_id": "YcbKamVxDzI",
"question": "In the distant view framed by black borders, on the green meadow, there are two individuals facing the camera on the left, dressed in white and gray clothes, respectively. What characters appear after the subtitle 'happily as they walk towards their rooms'?",
"question_wo_referring_query": "What characters appear?",
"candidates": [
"A man wearing white clothes and a red hat",
"A man wearing white clothes and sunglasses",
"A woman wearing a yellow hat and sunglasses",
"A man wearing black clothes and a red hat",
"A man wearing white clothes and a yellow hat"
],
"correct_choice": 4,
"position": [
9062,
9082
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3O",
"level": "L2-Relation",
"id": "YcbKamVxDzI_0",
"video_path": "YcbKamVxDzI.mp4",
"subtitle_path": "YcbKamVxDzI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1050.8,
"view_count": 645698
},
{
"video_id": "YcbKamVxDzI",
"question": "In a park full of trees and plants, there is a winding path in the middle. On the path, there is a woman wearing a dark blue top and pink trousers, and a man wearing a white top. After the subtitle 'girlfriend who are having an argument' appears, which characters appear?",
"question_wo_referring_query": "Which characters appear?",
"candidates": [
"A woman wearing a yellow hat and sunglasses",
"A man wearing blue clothes",
"A girl wearing blue clothes",
"A man wearing black clothes and a red hat",
"A man wearing grey clothes"
],
"correct_choice": 1,
"position": [
16560,
16584
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3O",
"level": "L2-Relation",
"id": "YcbKamVxDzI_1",
"video_path": "YcbKamVxDzI.mp4",
"subtitle_path": "YcbKamVxDzI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1050.8,
"view_count": 645698
},
{
"video_id": "YcbKamVxDzI",
"question": "In the video, the left side shows a dim yellow-lit beige room in the background, while the right side has a blurry white background. In the middle, there's a long-haired woman in purple clothing facing a mirror. Which characters appear after the subtitle 'concerns about the ending of the story' appears?",
"question_wo_referring_query": "Which characters appear afterward?",
"candidates": [
"A man wearing white clothes and a yellow hat",
"A man wearing black clothes and a red hat",
"A woman wearing yellow clothes with blonde hair",
"A woman wearing a black hat and sunglasses",
"A man wearing white clothes with long hair"
],
"correct_choice": 2,
"position": [
23891,
23911
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3O",
"level": "L2-Relation",
"id": "YcbKamVxDzI_2",
"video_path": "YcbKamVxDzI.mp4",
"subtitle_path": "YcbKamVxDzI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1050.8,
"view_count": 645698
},
{
"video_id": "5dJUUQufzw4",
"question": "Under a blue sky with white clouds, there are undulating mountains in the distance. In the sky, there is an airplane with black smoke trailing from its tail. In which of the following scenes has this airplane appeared before?",
"question_wo_referring_query": ", in which of the following scenes has this airplane appeared before?",
"candidates": [
"Over the mouth of an active volcano",
"Over a vast grassland",
"At a crowded crossroad",
"Above the blue sea",
"In the low airspace in front of a forest"
],
"correct_choice": 4,
"position": [
355,
397
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SOS",
"level": "L2-Relation",
"id": "5dJUUQufzw4_0",
"video_path": "5dJUUQufzw4.mp4",
"subtitle_path": "5dJUUQufzw4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1193.46,
"view_count": 916285
},
{
"video_id": "5dJUUQufzw4",
"question": "In a grassy area covered with yellow and green weeds, there is a man lying in the middle wearing a brown coat and sporting short curly hair. In which of the following scenes has the man on the grassy area appeared?",
"question_wo_referring_query": "In which of the following scenes has the man on the grassy area appeared?",
"candidates": [
"On a yellowish-brown boulder",
"Low air above the ground in front of a forest",
"Inside a car with red interior",
"On the roof of a red van",
"Inside a car with white interior"
],
"correct_choice": 2,
"position": [
7449,
7754
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SOS",
"level": "L2-Relation",
"id": "5dJUUQufzw4_1",
"video_path": "5dJUUQufzw4.mp4",
"subtitle_path": "5dJUUQufzw4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1193.46,
"view_count": 916285
},
{
"video_id": "5dJUUQufzw4",
"question": "By the blue-green waterside of the lake, a man in black clothes is sitting on the ground on the left, and a man in a hooded jacket is standing on the right. In which of the following scenes does the man standing by the lake appear?",
"question_wo_referring_query": "In which of the following scenes does the man standing by the lakeside appear?",
"candidates": [
"On the ground in front of a red hill",
"On the ground in front of a yellow hill",
"On a jade-green grassland",
"At a crowded intersection",
"On a yacht on the blue sea"
],
"correct_choice": 1,
"position": [
28161,
28178
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SOS",
"level": "L2-Relation",
"id": "5dJUUQufzw4_2",
"video_path": "5dJUUQufzw4.mp4",
"subtitle_path": "5dJUUQufzw4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1193.46,
"view_count": 916285
},
{
"video_id": "bXRuqcmTIuk",
"question": "In a room with slightly dim lighting and white walls, a long-haired woman in a red dress is sitting on a chair. She is holding a red musical instrument in her hand. Behind her are a few gray chairs. On the wall behind the chairs hangs a white square board. There is a window behind her on the right side. With which subtitles does the girl's musical instrument appear together?",
"question_wo_referring_query": "With which subtitles does the girl's musical instrument appear together?",
"candidates": [
"Here we learn that young Joe had a lot of enemies because he was a bully",
"Right after, Joe manages to answer the first question about who he is",
"served at every single Chinese restaurant with the word 'dragon' on its name, because it's the",
"a prep school called Evergreen Academy, which is a school Joe went to when he was younger.",
"hope. therefore he begins to write letters to his daughter, hoping that once he gets out, his"
],
"correct_choice": 4,
"position": [
6159,
6281,
10897,
15818,
16328,
19502
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "TOS",
"level": "L2-Relation",
"id": "bXRuqcmTIuk_0",
"video_path": "bXRuqcmTIuk.mp4",
"subtitle_path": "bXRuqcmTIuk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 963.87,
"view_count": 65830
},
{
"video_id": "bXRuqcmTIuk",
"question": "Two people are sitting on a screen, with a long-haired woman in a black long-sleeve dress on the left and a short-haired man in a black long-sleeve shirt on the right. Behind them, there are various types of trees. In front of them, there is a blue water cup. With which subtitles did this blue water cup appear together?",
"question_wo_referring_query": ", with which subtitles did this blue water cup appear together?",
"candidates": [
"served at every single Chinese restaurant with the word 'dragon' in its name, because it's the",
"Here we learn that young Joe had a lot of enemies because he was a bully",
"a perp school called Evergreen Academy. which is a school Joe went to when he was younger.",
"by two loving parents, and seems to be thriving with the cello. this provides Joe with newfound",
"Right after, Joe manages to answer the first question about who he is"
],
"correct_choice": 0,
"position": [
10881,
10897,
6159,
15818,
16318,
19502
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "TOS",
"level": "L2-Relation",
"id": "bXRuqcmTIuk_1",
"video_path": "bXRuqcmTIuk.mp4",
"subtitle_path": "bXRuqcmTIuk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 963.87,
"view_count": 65830
},
{
"video_id": "bXRuqcmTIuk",
"question": "In a room, a long-haired woman in a black long-sleeved dress is sitting by a desk, with a short-haired man in a black long-sleeved shirt standing beside her. There is a laptop on the desk. Behind them, there is a white door, and two paintings are hanging on the right side. On the desk, which subtitles have appeared together with the laptop?",
"question_wo_referring_query": "Which subtitles have appeared together with the laptop on the desk?",
"candidates": [
"At night, Joe and Maric who are still unaware of what is going on, sneak into the school",
"After the visit, the two decide to take a rest and clean themselves up. The tension",
"Back at the bar, Chucky who also went to Evergreen with Joe is closing his own research",
"Here we learn that young Joe had a bit of enemies because he was a bully",
"The two try looking up the school on the internet but the website is inaccessible"
],
"correct_choice": 4,
"position": [
15837,
15860,
16328,
16541,
16891,
17585
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "TOS",
"level": "L2-Relation",
"id": "bXRuqcmTIuk_2",
"video_path": "bXRuqcmTIuk.mp4",
"subtitle_path": "bXRuqcmTIuk_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 963.87,
"view_count": 65830
},
{
"video_id": "O471uwTNx6k",
"question": "In a blurry background, a man with thick eyebrows and short black hair is wearing earphones. When the scene changes to another man in a white shirt standing behind him outdoors, what change occurs to the first man?",
"question_wo_referring_query": "What change occurs to the man?",
"candidates": [
"The man's earphones have changed to wireless.",
"The man is no longer wearing earphones.",
"The man's earphones have changed from black to white.",
"The man's earphones have changed to a hands-free device."
],
"correct_choice": 1,
"position": [
1718,
2704
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SAA",
"level": "L2-Relation",
"id": "O471uwTNx6k_0",
"video_path": "O471uwTNx6k.mp4",
"subtitle_path": "O471uwTNx6k_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 959.7,
"view_count": 59664
},
{
"video_id": "O471uwTNx6k",
"question": "There are many screens on the left side of the scene, and on a nearby desk, there is a laptop. A person wearing a blue sleeveless outfit is standing next to the laptop. At the screen, a man in gray clothing is facing the camera. As the scene shifts to show a blonde woman in a suspender skirt bending down on the right side of this man, what change occurs in the man's outfit?",
"question_wo_referring_query": "What change occurs in the man's outfit?",
"candidates": [
"The outer garment changes to short sleeves",
"The outer garment changes to a hooded coat",
"The outer garment changes to a vest",
"The outer garment changes to black"
],
"correct_choice": 3,
"position": [
4566,
6625
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SAA",
"level": "L2-Relation",
"id": "O471uwTNx6k_1",
"video_path": "O471uwTNx6k.mp4",
"subtitle_path": "O471uwTNx6k_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 959.7,
"view_count": 59664
},
{
"video_id": "O471uwTNx6k",
"question": "Two men are having a conflict in the woods. The man on the left side of the screen is holding a scarf in his right hand and sunglasses in his left hand. The man on the right side of the screen is wearing a gray hoodie and has long hair. When the scene changes to a Jeep, what change happens to the man sitting in the passenger seat?",
"question_wo_referring_query": "What change happens to the man sitting in the passenger seat?",
"candidates": [
"He changed into a black short-sleeved shirt",
"He is now wearing a hat",
"His face has more scars",
"He became bald"
],
"correct_choice": 2,
"position": [
11019,
20638
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SAA",
"level": "L2-Relation",
"id": "O471uwTNx6k_2",
"video_path": "O471uwTNx6k.mp4",
"subtitle_path": "O471uwTNx6k_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 959.7,
"view_count": 59664
},
{
"video_id": "fO7nwCix8xU",
"question": "In a room that is slightly dimly lit, there is a white desk with various items on it. Next to the desk stands a man with short hair, wearing a long-sleeve black shirt and a red and white striped tie. He is surrounded by white bookshelves filled with books. Which object appears in the scene?",
"question_wo_referring_query": "Which object appears in the scene?",
"candidates": [
"Water cup",
"Rock",
"Red desk",
"Snacks",
"Table lamp"
],
"correct_choice": 4,
"position": [
11137
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2O",
"level": "L1-Perception",
"id": "fO7nwCix8xU_0",
"video_path": "fO7nwCix8xU.mp4",
"subtitle_path": "fO7nwCix8xU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1132.53,
"view_count": 26171
},
{
"video_id": "fO7nwCix8xU",
"question": "In a room with white walls and a decorative object hanging on the wall, there is a table and chairs. Two people are sitting at the table, which is cluttered with various items. On the left is a short-haired man in a white long-sleeved shirt, and on the right is a long-haired woman in a white long-sleeved shirt. Which of the following items has appeared on the table?",
"question_wo_referring_query": "Which of the following items has appeared on the table?",
"candidates": [
"Lamp",
"Red cup",
"Snacks",
"Jump rope",
"Beverage"
],
"correct_choice": 4,
"position": [
12963
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2O",
"level": "L1-Perception",
"id": "fO7nwCix8xU_1",
"video_path": "fO7nwCix8xU.mp4",
"subtitle_path": "fO7nwCix8xU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1132.53,
"view_count": 26171
},
{
"video_id": "fO7nwCix8xU",
"question": "In a slightly dimly lit room with a light beam, three men are sitting by a gray table. On the left is a man with short hair wearing a black shirt and a red tie. In the middle is a man wearing a yellow long-sleeve shirt and a red tie. On the right is another man in a black shirt. There are several items placed on the table. Which of the following items has appeared on the table?",
"question_wo_referring_query": "Which of the following items has appeared on the table?",
"candidates": [
"Jump rope",
"Book",
"Drink",
"Snacks",
"Cup"
],
"correct_choice": 1,
"position": [
20744
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2O",
"level": "L1-Perception",
"id": "fO7nwCix8xU_2",
"video_path": "fO7nwCix8xU.mp4",
"subtitle_path": "fO7nwCix8xU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1132.53,
"view_count": 26171
},
{
"video_id": "X5v4nBo5y28",
"question": "In a room, there is a person with short hair, dressed in black clothing and wearing a blue tie. Behind him, some lights are on the ceiling. After the subtitle mentions 'Heath also confirms his suspicion and reveals the traffic lights were hacked', which of the following items appears for the first time?",
"question_wo_referring_query": "Which of the following items appears for the first time?",
"candidates": [
"snacks",
"fan",
"cup",
"computer",
"mobile phone"
],
"correct_choice": 4,
"position": [
7352,
7389
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3O",
"level": "L2-Relation",
"id": "X5v4nBo5y28_0",
"video_path": "X5v4nBo5y28.mp4",
"subtitle_path": "X5v4nBo5y28_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1206.31,
"view_count": 105206
},
{
"video_id": "X5v4nBo5y28",
"question": "In the scene, there is a white house by the river, with various trees planted around it. Two people in long sleeves are walking towards the white house. When the subtitles mention 'They run into landlady and she tells them that Abigail disappeared one day and never ', which of the following items appears for the first time?",
"question_wo_referring_query": "Which of the following items appears for the first time?",
"candidates": [
"computer",
"snack",
"mobile phone",
"balloon",
"scissors"
],
"correct_choice": 4,
"position": [
21081,
21117,
7389
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3O",
"level": "L2-Relation",
"id": "X5v4nBo5y28_1",
"video_path": "X5v4nBo5y28.mp4",
"subtitle_path": "X5v4nBo5y28_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1206.31,
"view_count": 105206
},
{
"video_id": "X5v4nBo5y28",
"question": "In a room with white walls, there is a woman with curly hair wearing a black short-sleeved shirt on the left, and a man with short hair wearing a white coat on the right. Behind them, there is a bright object. When the subtitle mentions 'She then asks him why he wants to know, and he tells her because he has been arrested,' who is the first person to appear making a phone call in prison?",
"question_wo_referring_query": "Who is the first person to appear making a phone call in prison?",
"candidates": [
"A man in a black outfit",
"A woman in an orange outfit",
"A man in a purple outfit",
"A man in an orange outfit",
"A man in a white outfit"
],
"correct_choice": 3,
"position": [
13508,
13551
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3O",
"level": "L2-Relation",
"id": "X5v4nBo5y28_2",
"video_path": "X5v4nBo5y28.mp4",
"subtitle_path": "X5v4nBo5y28_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1206.31,
"view_count": 105206
},
{
"video_id": "MvgJAD6tZXo",
"question": "Inside a house with black wooden walls, there is a door on the left that lets in sunlight. Next to the door on the left is a man wearing black clothes, and on the right is the silhouette of a person also wearing black clothes. What is the man next to the door on the left, inside the house, doing?",
"question_wo_referring_query": "What is the man wearing black clothes next to the door on the left inside the house doing?",
"candidates": [
"Waving towards the door",
"Picking up a stick from the ground",
"Drinking water",
"Picking up a noodle",
"Moving a box"
],
"correct_choice": 3,
"position": [
7918
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2E",
"level": "L1-Perception",
"id": "MvgJAD6tZXo_0",
"video_path": "MvgJAD6tZXo.mp4",
"subtitle_path": "MvgJAD6tZXo_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1037.41,
"view_count": 52001
},
{
"video_id": "MvgJAD6tZXo",
"question": "On the wall between blue and white hangs several paintings with golden frames. On the blue sofa in front of the wall sits a man wearing a blue suit and a woman wearing a pink dress. What is the woman on the sofa doing?",
"question_wo_referring_query": "What is the woman on the sofa doing?",
"candidates": [
"Waving to the camera",
"Handing a black dog to the man",
"Handing a white dog to the man",
"Handing a white cat to the man",
"Handing a white box to the man"
],
"correct_choice": 2,
"position": [
16177
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2E",
"level": "L1-Perception",
"id": "MvgJAD6tZXo_1",
"video_path": "MvgJAD6tZXo.mp4",
"subtitle_path": "MvgJAD6tZXo_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1037.41,
"view_count": 52001
},
{
"video_id": "MvgJAD6tZXo",
"question": "Under the blue sky, on the left is the blue sea and sandy beach, while on the right, there is a cliff. On the cliff, there is a woman wearing a white shirt and a man wearing blue armor. What is the man on the cliff doing?",
"question_wo_referring_query": "What is the man on the cliff doing?",
"candidates": [
"Holding the woman's hand",
"Holding the woman's waist",
"Throwing a stone towards the beach",
"Touching the woman's hair",
"Waving at the camera"
],
"correct_choice": 0,
"position": [
23398
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2E",
"level": "L1-Perception",
"id": "MvgJAD6tZXo_2",
"video_path": "MvgJAD6tZXo.mp4",
"subtitle_path": "MvgJAD6tZXo_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1037.41,
"view_count": 52001
},
{
"video_id": "HQns-h_82qU",
"question": "In the animated scene, under the blue-green sky, in the middle of a road lined with tall grass on both sides, there is a creature wearing white clothes and has a tail. It's lifting a cane, what does it do after lifting the cane?",
"question_wo_referring_query": "What does it do next?",
"candidates": [
"Stuck the cane into the ground",
"Kneeled on the ground",
"Lay down on the ground",
"Put on a hat",
"Dropped the cane"
],
"correct_choice": 0,
"position": [
1185,
1196
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E3E",
"level": "L2-Relation",
"id": "HQns-h_82qU_0",
"video_path": "HQns-h_82qU.mp4",
"subtitle_path": "HQns-h_82qU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1333.77,
"view_count": 995835
},
{
"video_id": "HQns-h_82qU",
"question": "In the animated scene, the left side features a pitch-black sky, the middle shows a cliff face that is currently exploding, and the right side is a rock wall illuminated by a red light. What event occurred after the explosion?",
"question_wo_referring_query": "What event occurred?",
"candidates": [
"A man lifted up a cane",
"A woman fell to the ground",
"A tiger fell to the ground",
"A panda took out a bamboo stick",
"A panda fell to the ground"
],
"correct_choice": 4,
"position": [
20177,
20217
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E3E",
"level": "L2-Relation",
"id": "HQns-h_82qU_1",
"video_path": "HQns-h_82qU.mp4",
"subtitle_path": "HQns-h_82qU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1333.77,
"view_count": 995835
},
{
"video_id": "HQns-h_82qU",
"question": "In the animation scene, under the sky with sunlight penetrating through the clouds, there is an olive-colored animal head in the lower right corner. In the air on the left side, there is a panda with a glowing blue helmet and glowing green gloves. It is spreading its arms and staring forward. What event occurred after it spread its arms?",
"question_wo_referring_query": ", what event occurred?",
"candidates": [
"A square-pyramid-shaped platform flew into the sky",
"A square-pyramid-shaped platform fell to the ground",
"A missile landed on the ground",
"An airplane crashed to the ground",
"A tiger fell to the ground"
],
"correct_choice": 1,
"position": [
27221,
27267
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E3E",
"level": "L2-Relation",
"id": "HQns-h_82qU_2",
"video_path": "HQns-h_82qU.mp4",
"subtitle_path": "HQns-h_82qU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1333.77,
"view_count": 995835
},
{
"video_id": "O3Hwh0uv8Mg",
"question": "In a purely black screen, which character appears first?",
"question_wo_referring_query": "Which character appears first?",
"candidates": [
"The woman with long hair",
"The man with short hair",
"The woman wearing a hat",
"The boy holding a soccer ball",
"The woman with short hair"
],
"correct_choice": 0,
"position": [
186,
224
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "O3O",
"level": "L2-Relation",
"id": "O3Hwh0uv8Mg_0",
"video_path": "O3Hwh0uv8Mg.mp4",
"subtitle_path": "O3Hwh0uv8Mg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1149.0,
"view_count": 99805
},
{
"video_id": "O3Hwh0uv8Mg",
"question": "In the movie scene, there is a man in gray-black clothes standing between a red door and wall on the left, and a silver-white window and yellow wall on the right. After this man appears, which person or object appears first?",
"question_wo_referring_query": "Which person or object appears first?",
"candidates": [
"ambulance",
"woman in yellow clothes",
"car",
"motorcycle",
"woman in red clothes"
],
"correct_choice": 2,
"position": [
16558,
16613
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "O3O",
"level": "L2-Relation",
"id": "O3Hwh0uv8Mg_1",
"video_path": "O3Hwh0uv8Mg.mp4",
"subtitle_path": "O3Hwh0uv8Mg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1149.0,
"view_count": 99805
},
{
"video_id": "O3Hwh0uv8Mg",
"question": "In the movie scene, on the left is the background of a person with long hair, and on the right is a man dressed in a gray and black suit with a tie. Behind them is a doorway letting in light and blurry trees in the distance. After the man in the gray and black suit with a tie appears, which character appears first?",
"question_wo_referring_query": "After the man in the gray and black suit with a tie appears, which character appears first?",
"candidates": [
"A girl in a red dress",
"A man wearing a hat",
"A man in a burgundy suit",
"A woman in a yellow dress",
"A woman in a burgundy dress"
],
"correct_choice": 2,
"position": [
23263,
23309
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "O3O",
"level": "L2-Relation",
"id": "O3Hwh0uv8Mg_2",
"video_path": "O3Hwh0uv8Mg.mp4",
"subtitle_path": "O3Hwh0uv8Mg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1149.0,
"view_count": 99805
},
{
"video_id": "NSn78eNspwU",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"First, there is a scene where 'a small snow goose spreads its wings and opens its beak in front of a white snow goose's belly,' followed by 'a few snow geese walking back to the camera on the snowy field,' and finally 'a snowy field with many snow geese arranged in rows, with a tall white cliff in the distance.'",
"First, there is a scene where 'a snowy field with many snow geese arranged in rows, with a tall white cliff in the distance,' followed by 'a few snow geese walking back to the camera on the snowy field,' and finally 'a small snow goose spreads its wings and opens its beak in front of a white snow goose's belly.'",
"First, there is a scene where 'a few snow geese walking back to the camera on the snowy field,' followed by 'a small snow goose spreads its wings and opens its beak in front of a white snow goose's belly,' and finally 'a snowy field with many snow geese arranged in rows, with a tall white cliff in the distance.'",
"First, there is a scene where 'a snowy field with many snow geese arranged in rows, with a tall white cliff in the distance,' followed by 'a small snow goose spreads its wings and opens its beak in front of a white snow goose's belly,' and finally 'a few snow geese walking back to the camera on the snowy field.'",
"First, there is a scene where 'a small snow goose spreads its wings and opens its beak in front of a white snow goose's belly,' followed by 'a snowy field with many snow geese arranged in rows, with a tall white cliff in the distance,' and finally 'a few snow geese walking back to the camera on the snowy field.'"
],
"correct_choice": 3,
"position": [
2046,
2062,
2112
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "NSn78eNspwU_0",
"video_path": "NSn78eNspwU.mp4",
"subtitle_path": "NSn78eNspwU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 990.96,
"view_count": 31455
},
{
"video_id": "NSn78eNspwU",
"question": "Which of the following sequences is correct?",
"question_wo_referring_query": "Which of the following sequences is correct?",
"candidates": [
"First, there is a bright spot on the left side of the screen in a black background, and on the right side is a rock wall in a blue screen. Then, on the ice, there are several icebergs and a seal in the distance on the left side, and in the near distance on the right side, there is a big penguin and several small penguins. Finally, on the ice under the blue sky, there are several icebergs in the distance, and in the near distance on the left side, there is a penguin, and in the middle, there is a black seal.",
"First, on the ice, there are several icebergs and a seal in the distance on the left side, and in the near distance on the right side, there is a big penguin and several small penguins. Then, in the black background on the left side of the screen, there is a bright spot, and on the right side, there is a rock wall in a blue screen. Finally, on the ice under the blue sky, there are several icebergs in the distance, and in the near distance on the left side, there is a penguin, and in the middle, there is still a black seal.",
"First, the scene appears on the ice under the blue sky, with several icebergs in the distance. In the near distance on the left side, there is a penguin, and in the middle, there is a black seal. Then, in the black background on the left side of the screen, there is a bright spot, and on the right side, there is a rock wall in a blue screen. Finally, on the ice, there are several icebergs and a seal in the distance on the left side, and in the near distance on the right side, there is a big penguin and several small penguins.",
"First, there is a bright spot on the left side of the screen in a black background, and on the right side is a rock wall in a blue screen. Then, on the ice under the blue sky, there are several icebergs in the distance, and in the near distance on the left side, there is a penguin, and in the middle, there is a black seal. Finally, on the ice, there are several icebergs and a seal in the distance on the left side, and in the near distance on the right side, there is a big penguin and several small penguins.",
"First, on the ice, there are several icebergs and a seal in the distance on the left side, and in the near distance on the right side, there is a big penguin and several small penguins. Then, on the ice under the blue sky, there are several icebergs in the distance. In the near distance on the left side, there is a penguin, and in the middle, there is still a black seal. Finally, in the black background on the left side of the screen, there is a bright spot, and on the right side, there is a rock wall in a blue screen."
],
"correct_choice": 3,
"position": [
7788,
7820,
7859
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "NSn78eNspwU_1",
"video_path": "NSn78eNspwU.mp4",
"subtitle_path": "NSn78eNspwU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 990.96,
"view_count": 31455
},
{
"video_id": "NSn78eNspwU",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"First, a scene appears with 'a white goose looking sideways at the gray-white sky and the blue-black sea', then a scene with 'a goose with a yellow crown lying on the ground next to several goose legs on a white surface', and finally a scene with 'a yellow box and some black objects floating on a blue-green water surface.'",
"First, a scene appears with 'a goose with a yellow crown lying on the ground next to several goose legs on a white surface', then a scene with 'a white goose looking sideways at the gray-white sky and the blue-black sea', and finally a scene with 'a yellow box and some black objects floating on a blue-green water surface.'",
"First, a scene appears with a 'yellow box and some black objects floating on a blue-green water surface', then a scene with 'a goose with a yellow crown lying on the ground next to several goose legs on a white surface', and finally a scene with 'a white goose looking sideways at the gray-white sky and the blue-black sea.'",
"First, a scene appears with a 'yellow box and some black objects floating on a blue-green water surface', then a scene with 'a white goose looking sideways at the gray-white sky and the blue-black sea', and finally a scene with 'on a white surface, a goose with a yellow crown lying on the ground next to several goose legs.'",
"First, a scene appears with 'a white goose looking sideways at the gray-white sky and the blue-black sea', then a scene with 'a yellow box and some black objects floating on a blue-green water surface', and finally a scene with 'a goose with a yellow crown lying on the ground next to several goose legs on a white surface.'"
],
"correct_choice": 2,
"position": [
17045,
17105,
17148
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "NSn78eNspwU_2",
"video_path": "NSn78eNspwU.mp4",
"subtitle_path": "NSn78eNspwU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 990.96,
"view_count": 31455
},
{
"video_id": "0ln2qdCR5lA",
"question": "In a room, a person wearing a white protective suit places a bag under the neck of a lying man. Behind the person in the protective suit, there is a transparent glass, and behind the glass stands a person wearing a white long-sleeved shirt. What did the person in the protective suit do after placing the gray bag under the man's neck?",
"question_wo_referring_query": "What did the person in the protective suit do after placing the gray bag under the man's neck?",
"candidates": [
"Jump up",
"Insert a tube into the man's throat",
"Leave",
"Drink water",
"Kneel down"
],
"correct_choice": 1,
"position": [
10281,
10383
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E3E",
"level": "L2-Relation",
"id": "0ln2qdCR5lA_0",
"video_path": "0ln2qdCR5lA.mp4",
"subtitle_path": "0ln2qdCR5lA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1358.82,
"view_count": 757428
},
{
"video_id": "0ln2qdCR5lA",
"question": "Many green lights were emitted on the wall, a black-haired man wearing a gray suit picked up a box and was about to leave the house. What happened after the man picked up the box?",
"question_wo_referring_query": "What happened after the man picked up the box?",
"candidates": [
"Someone climbed a ladder",
"Drinking water",
"Someone sang",
"Someone ran",
"An explosion"
],
"correct_choice": 0,
"position": [
18509,
18522
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E3E",
"level": "L2-Relation",
"id": "0ln2qdCR5lA_1",
"video_path": "0ln2qdCR5lA.mp4",
"subtitle_path": "0ln2qdCR5lA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1358.82,
"view_count": 757428
},
{
"video_id": "0ln2qdCR5lA",
"question": "There are many instruments on the shelves of the laboratory. A man with short black hair, wearing a gray bulletproof vest, picks up a blue cup from the shelf. What did the man do after picking up the blue cup?",
"question_wo_referring_query": "What did the man do after picking up the blue cup?",
"candidates": [
"Kneeled down",
"Talked while turning around",
"Drank water",
"Ate something",
"Talked while running"
],
"correct_choice": 1,
"position": [
20582,
20662
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E3E",
"level": "L2-Relation",
"id": "0ln2qdCR5lA_2",
"video_path": "0ln2qdCR5lA.mp4",
"subtitle_path": "0ln2qdCR5lA_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1358.82,
"view_count": 757428
},
{
"video_id": "j6beJTHUT_c",
"question": "Under a white sky, a man wearing an olive hat and a long-sleeved white robe, along with another person wearing a white hat, are restraining a short-haired man. There is a house behind them surrounded by trees, and to the left of the house, there is a woman dressed in a long-sleeved white garment. When the subtitle mentions 'was trying to escape, Grisha is considered a rebel and will be executed along with other rebels,' what did the two people restraining the short-haired man do to him?",
"question_wo_referring_query": "What did the two people restraining the short-haired man do to him afterwards?",
"candidates": [
"Ran",
"Drank water",
"Tied a rope around the restrained man's neck",
"Crouched down",
"Let the restrained person go"
],
"correct_choice": 2,
"position": [
5301,
5422
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3E",
"level": "L2-Relation",
"id": "j6beJTHUT_c_0",
"video_path": "j6beJTHUT_c.mp4",
"subtitle_path": "j6beJTHUT_c_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 918.4,
"view_count": 22762
},
{
"video_id": "j6beJTHUT_c",
"question": "Under a piece of sky, there are five different people kneeling on yellow ground. Behind them, there are white columns and trees, and on both sides behind them, there is a brown house. In front of them, there is a person holding a whip pointing at them. After the subtitle 'The supervisor then punished one of them randomly, and unfortunately, the chosen one was Artyom,' appeared, what happened?",
"question_wo_referring_query": "What happened?",
"candidates": [
"Ate something",
"Stood up",
"A kneeling man was taken away",
"Drank water",
"Went for a run"
],
"correct_choice": 2,
"position": [
10261,
10407
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3E",
"level": "L2-Relation",
"id": "j6beJTHUT_c_1",
"video_path": "j6beJTHUT_c.mp4",
"subtitle_path": "j6beJTHUT_c_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 918.4,
"view_count": 22762
},
{
"video_id": "j6beJTHUT_c",
"question": "In a room, a woman with long black hair is sitting on the left, and a woman with yellow hair is sitting on the right. The woman with black hair is putting makeup on the woman with yellow hair. A man in yellow and white clothing is standing in the middle of the screen. After the subtitle 'Lisa is the female lead, replacing Polina. While getting ready, Lisa told her boyfriend, Alexey,', what does the woman with long black hair do?",
"question_wo_referring_query": "What does the woman with long black hair do?",
"candidates": [
"Running",
"Leaving",
"Drinking water",
"Eating something",
"Fixing the yellow-haired woman's hair"
],
"correct_choice": 4,
"position": [
12863,
12924
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3E",
"level": "L2-Relation",
"id": "j6beJTHUT_c_2",
"video_path": "j6beJTHUT_c.mp4",
"subtitle_path": "j6beJTHUT_c_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 918.4,
"view_count": 22762
},
{
"video_id": "TxS1JnfuG34",
"question": "In a slightly dimly lit room, there is a woman with long hair, wearing a white short-sleeved shirt, sitting on a bed. She is holding a mobile phone. There is a white cabinet behind her, filled with various items. In which of the following scenes has the mobile phone appeared?",
"question_wo_referring_query": "In which of the following scenes has the mobile phone appeared?",
"candidates": [
"On a white perforated table",
"On a green perforated table",
"On a yellow perforated table",
"On a blue perforated table",
"On a red perforated table"
],
"correct_choice": 0,
"position": [
7791,
8427
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SOS",
"level": "L2-Relation",
"id": "TxS1JnfuG34_0",
"video_path": "TxS1JnfuG34.mp4",
"subtitle_path": "TxS1JnfuG34_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1217.59,
"view_count": 1354237
},
{
"video_id": "TxS1JnfuG34",
"question": "On a street at night, there are two people. Among them, a woman with long black hair wearing a red coat is sitting on the ground. In front of her, a short-haired man wearing an olive-colored outfit is kneeling. In which of the following scenes did this woman appear?",
"question_wo_referring_query": "In which of the following scenes did this woman appear?",
"candidates": [
"In the woods",
"Inside a room",
"In the desert",
"In the sea",
"On a plane"
],
"correct_choice": 1,
"position": [
12937,
13122
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SOS",
"level": "L2-Relation",
"id": "TxS1JnfuG34_1",
"video_path": "TxS1JnfuG34.mp4",
"subtitle_path": "TxS1JnfuG34_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1217.59,
"view_count": 1354237
},
{
"video_id": "TxS1JnfuG34",
"question": "In a room, a man with short black hair, wearing black clothes, is holding a gun. He is pointing the gun at another man with short black hair, wearing black clothes with a belt. The question is, in which of the following scenes does this gun appear?",
"question_wo_referring_query": "In which of the following scenes does the gun appear?",
"candidates": [
"In the hand of the man with red hair",
"In the hand of the man with green hair",
"In the hand of the man with white hair",
"In the hand of the man with blue hair",
"In the hand of the man with purple hair"
],
"correct_choice": 2,
"position": [
23852,
24452
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SOS",
"level": "L2-Relation",
"id": "TxS1JnfuG34_2",
"video_path": "TxS1JnfuG34.mp4",
"subtitle_path": "TxS1JnfuG34_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1217.59,
"view_count": 1354237
},
{
"video_id": "d5JlCEDlHGE",
"question": "In a white room, on a silver floor sits a man with short hair, wearing a black short-sleeved shirt and black shorts. He is holding a white plate. What is this man doing?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"singing",
"running",
"drinking water",
"dancing",
"eating something"
],
"correct_choice": 4,
"position": [
6699
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2E",
"level": "L1-Perception",
"id": "d5JlCEDlHGE_0",
"video_path": "d5JlCEDlHGE.mp4",
"subtitle_path": "d5JlCEDlHGE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 931.63,
"view_count": 29184
},
{
"video_id": "d5JlCEDlHGE",
"question": "In a room with a white, square-patterned floor, there is a person with some white hair, wearing a black short-sleeve shirt with red designs. He also has a beard. What is this man doing?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"Reading a book",
"Fishing",
"Drinking water",
"Playing piano",
"Talking"
],
"correct_choice": 4,
"position": [
10918
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2E",
"level": "L1-Perception",
"id": "d5JlCEDlHGE_1",
"video_path": "d5JlCEDlHGE.mp4",
"subtitle_path": "d5JlCEDlHGE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 931.63,
"view_count": 29184
},
{
"video_id": "d5JlCEDlHGE",
"question": "Outside the room, there are two brown doors on a white wall. In front of the doors, there's a woman in gray clothes and a silver railing. On the wall, there's also a lamp emitting white light. What is this woman doing?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"walking",
"dancing",
"talking",
"running",
"reading"
],
"correct_choice": 0,
"position": [
12740
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2E",
"level": "L1-Perception",
"id": "d5JlCEDlHGE_2",
"video_path": "d5JlCEDlHGE.mp4",
"subtitle_path": "d5JlCEDlHGE_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 931.63,
"view_count": 29184
},
{
"video_id": "Y5833KeDmp4",
"question": "In the scene, there is a man with black clothes and black hair and a woman with vintage hair styled in a court manner and wearing a light green dress. Behind them is a walkway with hanging lanterns and a large chandelier. What did the two do?",
"question_wo_referring_query": "?",
"candidates": [
"shake hands",
"embrace",
"look at each other",
"hold hands",
"run"
],
"correct_choice": 2,
"position": [
100
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2E",
"level": "L1-Perception",
"id": "Y5833KeDmp4_0",
"video_path": "Y5833KeDmp4.mp4",
"subtitle_path": "Y5833KeDmp4_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 157,
"duration": 9.01,
"view_count": 59905
},
{
"video_id": "e11Q4ThFu5A",
"question": "In the video, a man and a woman are leaning on each other. The woman is wearing a black backless dress and has golden hair, while the man is wearing a grey suit and a yellow tie. There are many other people in the background, and they are in a banquet hall. The woman is wearing a necklace and holding a cup in her hand. What material is the cup made of?",
"question_wo_referring_query": "What material is the cup made of?",
"candidates": [
"Crystal glass cup",
"Transparent glass cup",
"Metal cup",
"Ceramic cup",
"Transparent plastic cup"
],
"correct_choice": 1,
"position": [
191
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2A",
"level": "L1-Perception",
"id": "e11Q4ThFu5A_0",
"video_path": "e11Q4ThFu5A.mp4",
"subtitle_path": "e11Q4ThFu5A_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 92,
"duration": 13.98,
"view_count": 826146
},
{
"video_id": "GRPLynULvJY",
"question": "In the scene, a woman is standing in front of a cash register, and there are two other people behind the counter. The four people in the scene are clearly visible. The woman at the register is wearing black clothes, and the woman buying coffee is wearing an olive-green trench coat. Who is the person in the scene with their head slightly bowed and smiling?",
"question_wo_referring_query": "Who is the person in the scene with their head slightly bowed and smiling?",
"candidates": [
"The black-haired woman in the white jacket",
"The black-haired woman in the olive-green jacket",
"The man in the black shirt",
"The woman in the olive-green trench coat",
"The woman in the black dress"
],
"correct_choice": 3,
"position": [
140
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E2O",
"level": "L1-Perception",
"id": "GRPLynULvJY_0",
"video_path": "GRPLynULvJY.mp4",
"subtitle_path": "GRPLynULvJY_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 54,
"duration": 14.02,
"view_count": 162730
},
{
"video_id": "T1K4rgs-1b8",
"question": "The screen shows a woman with black hair tied up. Behind her is a brick wall, she is wearing a lanyard necklace, and it seems she is also wearing a black bracelet on her hand. What object is not present in the scene?",
"question_wo_referring_query": "What object is not present in the scene?",
"candidates": [
"Red and white scarf",
"Black and white scarf",
"Brick wall",
"Red lanyard necklace",
"Black pillar"
],
"correct_choice": 0,
"position": [
260
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2O",
"level": "L1-Perception",
"id": "T1K4rgs-1b8_0",
"video_path": "T1K4rgs-1b8.mp4",
"subtitle_path": "T1K4rgs-1b8_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 258,
"duration": 12.0,
"view_count": 39459
},
{
"video_id": "3zv9RkDPX14",
"question": "In the scene, a short-haired man is standing in a forest. He is wearing a thick green coat and pointing a gun at someone in front of him. The person in front is dressed in black and has their back to the camera. What objects are present in the scene?",
"question_wo_referring_query": "What objects are present in the scene?",
"candidates": [
"Black gloves",
"Green and black gloves",
"White gloves",
"Green gloves",
"Yellow gloves"
],
"correct_choice": 0,
"position": [
203
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2O",
"level": "L1-Perception",
"id": "3zv9RkDPX14_0",
"video_path": "3zv9RkDPX14.mp4",
"subtitle_path": "3zv9RkDPX14_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 484,
"duration": 9.01,
"view_count": 30538
},
{
"video_id": "uJgZo4KxoZw",
"question": "In the scene, there is a man and a woman standing in a large room. In the room, there is a dining table with a blue tablecloth, a glass display case with black objects inside, a lot of glass, and a spiral staircase. The man is wearing a black suit and a white shirt, and he has brown hair. The woman has black hair and is wearing a gray coat. On the white wall next to the dining table, there is a protruding decoration. What is the shape of this decoration?",
"question_wo_referring_query": "What is the shape of this decoration?",
"candidates": [
"Heart",
"Square",
"Triangle",
"Circle",
"Rectangle"
],
"correct_choice": 3,
"position": [
223
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2A",
"level": "L1-Perception",
"id": "uJgZo4KxoZw_0",
"video_path": "uJgZo4KxoZw.mp4",
"subtitle_path": "uJgZo4KxoZw_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 186,
"duration": 13.97,
"view_count": 1567763
},
{
"video_id": "oddHY1vwcjo",
"question": "A man and a woman are standing by the roadside talking. The woman has her brown hair tied up and is wearing a shirt. The man also has brown hair. In the distance, there are buildings, traffic lights, and a road. The image is blurry, and the whole scene is shrouded in darkness. What material is the man's jacket made of?",
"question_wo_referring_query": "What material is the man's jacket made of?",
"candidates": [
"A woolen jacket",
"A mohair jacket",
"A denim jacket",
"A cotton jacket",
"A silk jacket"
],
"correct_choice": 2,
"position": [
174
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2A",
"level": "L1-Perception",
"id": "oddHY1vwcjo_0",
"video_path": "oddHY1vwcjo.mp4",
"subtitle_path": "oddHY1vwcjo_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 75,
"duration": 13.97,
"view_count": 228301
},
{
"video_id": "kWUmHAzCp7s",
"question": "A group of people wearing uniforms are standing with their backs turned at a platform. On the platform are three people, all standing in a dimly lit building with light pouring in from a rectangular window behind them, casting long shadows. When it mentions 'the boys are put through a strict training regime by the Peacekeepers. They also have to', what objects are present in the scene?",
"question_wo_referring_query": "What objects are present in the scene?",
"candidates": [
"yellow uniform",
"olive-colored uniform",
"gray uniform",
"green uniform",
"white and red uniform"
],
"correct_choice": 2,
"position": [
247
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2O",
"level": "L1-Perception",
"id": "kWUmHAzCp7s_0",
"video_path": "kWUmHAzCp7s.mp4",
"subtitle_path": "kWUmHAzCp7s_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 732,
"duration": 12.98,
"view_count": 305907
},
{
"video_id": "p7YxwveUrjI",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, an elderly woman with white hair wearing a dark green short-sleeved suit is sitting on the sofa with a book on her lap. Then, a Black man and a White man stand side by side in front of the camera, discussing something. Finally, a group of people are sitting on the grass playing.",
"First, a Black man and a White man stand side by side in front of the camera, discussing something. Then, an elderly woman with white hair wearing a dark green short-sleeved suit is sitting on the sofa with a book on her lap. Finally, a group of people are sitting on the grass playing.",
"First, a Black man and a White man stand side by side in front of the camera, discussing something. Then, an elderly woman with white hair wearing a dark green short-sleeved suit is sitting on the sofa with a book on her lap. Finally, a group of people are sitting on the grass playing.",
"First, a group of people are sitting on the grass playing. Then, a Black man and a White man stand side by side in front of the camera, discussing something. Finally, an elderly woman with white hair wearing a dark green short-sleeved suit is sitting on the sofa with a book on her lap.",
"First, an elderly woman with white hair wearing a dark green short-sleeved suit is sitting on the sofa with a book on her lap. Then, a group of people are sitting on the grass playing. Finally, a Black man and a White man stand side by side in front of the camera, discussing something."
],
"correct_choice": 0,
"position": [
12,
146,
281
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "p7YxwveUrjI_0",
"video_path": "p7YxwveUrjI.mp4",
"subtitle_path": "p7YxwveUrjI_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 12,
"duration": 14.02,
"view_count": 251020
},
{
"video_id": "H2ksp6sRR-k",
"question": "A man wearing a dark red checkered shirt and black-framed glasses is standing in front of a green background. There is a blue circular pattern on the left side of the screen and a pink circular pattern on the right side of the screen. What is this man doing when the subtitle 'person will receive the I lost scishow' appears?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"Chatting with a friend",
"Speaking to the camera",
"Talking on the phone",
"Looking at his phone",
"Nodding"
],
"correct_choice": 1,
"position": [
2631
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2E",
"level": "L1-Perception",
"id": "H2ksp6sRR-k_0",
"video_path": "H2ksp6sRR-k.mp4",
"subtitle_path": "H2ksp6sRR-k_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1058.68,
"view_count": 118102
},
{
"video_id": "H2ksp6sRR-k",
"question": "A woman with long curly hair wearing a white coat and black-framed glasses is standing in front of a green background. On the right side of the screen, there are green, yellow, and white English texts. What is this woman doing when the subtitle 'including a luminous Toberman right and' appears?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"Raising both hands to waist height and talking to the camera",
"Playing with her hair with both hands",
"Looking at her phone",
"Talking with a friend",
"Waving at the camera"
],
"correct_choice": 0,
"position": [
6614
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2E",
"level": "L1-Perception",
"id": "H2ksp6sRR-k_1",
"video_path": "H2ksp6sRR-k.mp4",
"subtitle_path": "H2ksp6sRR-k_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1058.68,
"view_count": 118102
},
{
"video_id": "H2ksp6sRR-k",
"question": "The background features a blue curtain. Three men are standing by a white powder-coated round table. The man on the left is wearing a black suit with a brick-red shirt underneath, and holding a green piece of paper in his hand. The man in the middle is dressed in a black suit with a tie and holding a green piece of paper in both hands. The man on the right is wearing a purple-blue checkered shirt and black-framed glasses, holding a green card and a pen up with both hands. What is the man on the right doing when the subtitle 'researchers named midi-chlorian' appears?",
"question_wo_referring_query": "The background features a blue curtain. Three men are standing by a white powder-coated round table. The man on the left is wearing a black suit with a brick-red shirt underneath, and holding a green piece of paper in his hand. The man in the middle is dressed in a black suit with a tie and holding a green piece of paper in both hands. The man on the right is wearing a purple-blue checkered shirt and black-framed glasses, holding a green card and a pen up with both hands. What is the man on the right doing when the subtitle 'researchers named midi-chlorian' appears?",
"candidates": [
"Closing his eyes and thinking",
"Talking to the man next to him",
"Waving to the camera",
"Looking up at the ceiling",
"Writing on the green piece of paper"
],
"correct_choice": 4,
"position": [
19581
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2E",
"level": "L1-Perception",
"id": "H2ksp6sRR-k_2",
"video_path": "H2ksp6sRR-k.mp4",
"subtitle_path": "H2ksp6sRR-k_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1058.68,
"view_count": 118102
},
{
"video_id": "TlaX2iIYZD4",
"question": "The PPT on the screen has a three-line English title at the top, chemical formulas are filled below, and there are video screens of two women on the right; the woman at the bottom right is wearing a light-colored short-sleeve shirt, has long hair, black-frame glasses, and a wristwatch with a white strap. What kind of necklace is the woman at the bottom right wearing when the subtitle 'okay so what do we do now to find the' appears?",
"question_wo_referring_query": "What kind of necklace is the woman at the bottom right wearing?",
"candidates": [
"silver necklace",
"blue gemstone necklace",
"gold square necklace",
"pearl necklace",
"green gemstone necklace"
],
"correct_choice": 2,
"position": [
38052
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2A",
"level": "L1-Perception",
"id": "TlaX2iIYZD4_0",
"video_path": "TlaX2iIYZD4.mp4",
"subtitle_path": "TlaX2iIYZD4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2475.31,
"view_count": 17317
},
{
"video_id": "TlaX2iIYZD4",
"question": "At the top of the PPT in the video, there are three lines of English titles, with the bottom filled with chemical formulas. On the right, there are video frames of two women. The woman in the lower right is wearing a light-colored short-sleeve, long hair, and black-rimmed glasses. The woman in the upper right is wearing a black-and-white checkered jacket with a dark red inner layer. What is the hairstyle of the woman in the upper right when the subtitle 'of these and the limiting reactant is' appears?",
"question_wo_referring_query": "What is the hairstyle of the woman in the upper right?",
"candidates": [
"Short curly hair",
"Short black hair",
"Long straight black hair",
"Curly black hair",
"Brown hair"
],
"correct_choice": 2,
"position": [
38192
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2A",
"level": "L1-Perception",
"id": "TlaX2iIYZD4_1",
"video_path": "TlaX2iIYZD4.mp4",
"subtitle_path": "TlaX2iIYZD4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2475.31,
"view_count": 17317
},
{
"video_id": "TlaX2iIYZD4",
"question": "There is a line of English text at the top of the PPT in the video, with some chemical formulas written below it. On the right side, there are two women in video frames. The woman in the lower right is sitting in front of a mirror with her head down, while the woman in the upper right with long black straight hair is wearing a black and white checkered coat and a dark red inner garment, facing the mirror. When the subtitle 'one and it\u2018s not of that I what I want' appears, what is the color of the chair that the woman in the lower right is sitting on?",
"question_wo_referring_query": "What color is it?",
"candidates": [
"Black",
"Off-white",
"Yellow",
"Olive",
"Green"
],
"correct_choice": 1,
"position": [
43656
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2A",
"level": "L1-Perception",
"id": "TlaX2iIYZD4_2",
"video_path": "TlaX2iIYZD4.mp4",
"subtitle_path": "TlaX2iIYZD4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2475.31,
"view_count": 17317
},
{
"video_id": "UO_6TQnnOxM",
"question": "The video shows the interior of a museum with two white light fixtures. Many white cabinets containing different artifacts are placed on the dark-colored floor. On the right side of the screen, a woman is sitting in a room illuminated by white lights, shown in a small video frame. What material is the protective cover placed over the artifacts on the cabinet made of?",
"question_wo_referring_query": "What material is the protective cover placed over the artifacts on the cabinet made of?",
"candidates": [
"Ceramics",
"Plastic",
"Glass",
"Iron",
"Acrylic"
],
"correct_choice": 2,
"position": [
2924
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "UO_6TQnnOxM_0",
"video_path": "UO_6TQnnOxM.mp4",
"subtitle_path": "UO_6TQnnOxM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3573.96,
"view_count": 2756
},
{
"video_id": "UO_6TQnnOxM",
"question": "The screen shows two black men sitting and performing music and singing under the lights. The man towards the left-back is wearing an olive-colored suit, while the man on the right is holding a wooden guitar and wearing a blue patterned robe and singing into a microphone. In the top-right corner of the screen, there is a small video of a black man wearing a white shirt with a blue collar. What is the hairstyle of the man holding the guitar?",
"question_wo_referring_query": "What is the hairstyle of the man holding the guitar?",
"candidates": [
"Long blonde hair",
"Short blonde hair",
"Black cornrows",
"Black afro",
"Short curly hair"
],
"correct_choice": 2,
"position": [
60713
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "UO_6TQnnOxM_1",
"video_path": "UO_6TQnnOxM.mp4",
"subtitle_path": "UO_6TQnnOxM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3573.96,
"view_count": 2756
},
{
"video_id": "UO_6TQnnOxM",
"question": "The scene shows two Black men sitting under a lamp performing music and singing. The man to the back left is wearing an olive suit, and the man on the right is holding a guitar and wearing a floral-patterned blue robe while singing into a microphone. In the top right corner of the screen, there's a small video frame of another Black man wearing a white shirt with a blue collar. What type of guitar is the man holding while singing into the microphone?",
"question_wo_referring_query": "What type of guitar is the man holding while singing into the microphone?",
"candidates": [
"black wood guitar",
"orange wood guitar",
"white electric guitar",
"yellow electric guitar",
"black electric guitar"
],
"correct_choice": 1,
"position": [
68846
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "UO_6TQnnOxM_2",
"video_path": "UO_6TQnnOxM.mp4",
"subtitle_path": "UO_6TQnnOxM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 3573.96,
"view_count": 2756
},
{
"video_id": "Aau-XoIebno",
"question": "A man wearing a crimson jacket, black pants, black-framed glasses, and with black hair is sitting cross-legged in front of a silver screen. He is holding a microphone in one hand and his other hand is open, suspended by his leg. What is this man doing?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"Scratching an itch",
"Talking into the microphone",
"Holding a water bottle",
"Waving to the audience",
"Making a phone call"
],
"correct_choice": 1,
"position": [
15433
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2E",
"level": "L1-Perception",
"id": "Aau-XoIebno_0",
"video_path": "Aau-XoIebno.mp4",
"subtitle_path": "Aau-XoIebno_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1499.57,
"view_count": 12738
},
{
"video_id": "Aau-XoIebno",
"question": "There are two men sitting on a stage. The man on the left is wearing a black jacket, black pants, and khaki shoes. He is crossing his legs and holding a microphone resting on his leg. The man on the right is wearing glasses, a khaki jacket, black pants, and olive shoes. He is also crossing his legs. What is he doing with his right hand?",
"question_wo_referring_query": "There are two men sitting on a stage. The man on the left is wearing a black jacket, black pants, and khaki shoes. He is crossing his legs and holding a microphone resting on his leg. The man on the right is wearing glasses, a khaki jacket, black pants, and olive shoes. He is also crossing his legs. What is he doing with his right hand?",
"candidates": [
"Pointing to the ground",
"Holding a bottle of mineral water and drinking it",
"Scratching his head",
"Waving to the audience",
"Adjusting his glasses"
],
"correct_choice": 4,
"position": [
26208
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2E",
"level": "L1-Perception",
"id": "Aau-XoIebno_1",
"video_path": "Aau-XoIebno.mp4",
"subtitle_path": "Aau-XoIebno_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1499.57,
"view_count": 12738
},
{
"video_id": "Aau-XoIebno",
"question": "On screen, two men are sitting on a stage. The man on the right is wearing glasses, a beige coat, and brown shoes, sitting cross-legged with a microphone in his hands resting on his lap. The man on the left is wearing a black coat, black pants, and beige shoes, also sitting cross-legged with a microphone in his hand. What is he doing with his raised right hand?",
"question_wo_referring_query": "What is he doing?",
"candidates": [
"Raising his right hand and pointing towards the man next to him",
"Raising his right hand and waving",
"Raising his right hand and drinking water",
"Raising his right hand and pointing towards the stage",
"Raising his right hand and scratching his head"
],
"correct_choice": 3,
"position": [
30280
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2E",
"level": "L1-Perception",
"id": "Aau-XoIebno_2",
"video_path": "Aau-XoIebno.mp4",
"subtitle_path": "Aau-XoIebno_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1499.57,
"view_count": 12738
},
{
"video_id": "O6UedmnRJc0",
"question": "The background is a white wall, with a black cabinet on the right side. A man wearing a white long-sleeved shirt and black pants stands next to a round wooden table with fruit and food on it, speaking to the camera. In which other scenes does this man appear?",
"question_wo_referring_query": "In which other scenes does this man appear?",
"candidates": [
"On a sandy beach by the sea",
"On a mountain covered with plants",
"Next to a man in a gym wearing a black tank top holding two dumbbells",
"By a swimming pool",
"In a dining hall"
],
"correct_choice": 2,
"position": [
6790,
9497
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "O6UedmnRJc0_0",
"video_path": "O6UedmnRJc0.mp4",
"subtitle_path": "O6UedmnRJc0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1035.12,
"view_count": 4497825
},
{
"video_id": "O6UedmnRJc0",
"question": "In a warmly lit office, a man wearing a gray short-sleeve shirt and a wristwatch is sitting face-to-face with a woman in a white coat with golden curly hair at a wooden table. In what other scenes has this woman appeared?",
"question_wo_referring_query": "In what other scenes has she appeared?",
"candidates": [
"In a gym.",
"In a coffee shop.",
"In a hospital lobby.",
"In a swimming pool.",
"In an office with wooden walls, beside a black screen and a laptop, opposite a man in a light denim jacket."
],
"correct_choice": 4,
"position": [
2398,
20765
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "O6UedmnRJc0_1",
"video_path": "O6UedmnRJc0.mp4",
"subtitle_path": "O6UedmnRJc0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1035.12,
"view_count": 4497825
},
{
"video_id": "O6UedmnRJc0",
"question": "The background features sunlit green trees and a doorway. A man wearing a white short-sleeved shirt, black pants, and sporting a goatee is sitting on a white sofa outside. In what other scenes does this man appear?",
"question_wo_referring_query": "In what other scenes does this man appear?",
"candidates": [
"On a rooftop terrace",
"In the driver's seat of a car illuminated by sunlight",
"Inside a cafe",
"Inside a library",
"Inside a museum"
],
"correct_choice": 1,
"position": [
15478,
16289
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "O6UedmnRJc0_2",
"video_path": "O6UedmnRJc0.mp4",
"subtitle_path": "O6UedmnRJc0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1035.12,
"view_count": 4497825
},
{
"video_id": "9WjElCiDpzM",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, the braided-haired woman holding a book opens it to a page where the left side shows a long branch with some people on it, and the right side is blue. Then, the dark-skinned woman on the white sofa lifts a book with the cover marked 'EXTRA YARN'. Finally, the dark-skinned woman on the white sofa holds a colored pencil in each hand.",
"First, the dark-skinned woman on the white sofa holds a colored pencil in each hand. Then, she lifts a book with the cover marked 'EXTRA YARN'. Lastly, the braided-haired woman holds a book, opens it to a page where the left side shows a long branch with some people on it, and the right side is blue.",
"First, a dark-skinned woman sitting on a white sofa lifts a book with the cover marked 'EXTRA YARN'. Then, a woman with braided hair holding a book opens to a page with a long branch on the left side with some people on it. The right side of the page is blue. Finally, the dark-skinned woman on the white sofa holds a colored pencil in each hand.",
"First, a dark-skinned woman sitting on a white sofa lifts a book with the cover marked 'EXTRA YARN'. Then, the same woman on the white sofa holds a colored pencil in each hand. Lastly, the braided-haired woman holds a book, opens it, and the left page shows a long branch with some people on it, while the right page is blue.",
"First, the braided-haired woman holding a book opens it to a page where the left side shows a long branch with some people on it, and the right side is blue. Then, the dark-skinned woman on the white sofa holds a colored pencil in each hand. Lastly, the dark-skinned woman on the white sofa lifts a book with the cover marked 'EXTRA YARN'."
],
"correct_choice": 2,
"position": [
1777,
12207,
16036
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SSS",
"level": "L2-Relation",
"id": "9WjElCiDpzM_0",
"video_path": "9WjElCiDpzM.mp4",
"subtitle_path": "9WjElCiDpzM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 949.83,
"view_count": 682
},
{
"video_id": "9WjElCiDpzM",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, a woman holding an open book appears; on the left page of the book, a person is standing under a tree, and on the right page, a man with red hair wearing a backpack appears against a black background. Then, a dark-skinned woman sitting on a white sofa places the index finger of her right hand on her lips, and finally, a dark-skinned woman with tightly braided hair holds two small wooden blocks in her hands.",
"First, a dark-skinned woman with tightly braided hair holds two small wooden blocks in her hands, then a dark-skinned woman sitting on a white sofa places the index finger of her right hand on her lips, and lastly, a woman holding an open book appears; on the left page of the book, a person is standing under a tree, and on the right page, a man with red hair wearing a backpack appears against a black background.",
"First, a dark-skinned woman sitting on a white sofa places the index finger of her right hand on her lips, then a woman holding an open book appears; on the left page of the book, a person is standing under a tree, and on the right page, a man with red hair wearing a backpack appears against a black background. Finally, a dark-skinned woman with tightly braided hair holds two small wooden blocks in her hands.",
"First, a woman holding an open book appears; on the left page of the book, a person is standing under a tree, and on the right page, a man with red hair wearing a backpack appears against a black background. Then, a dark-skinned woman with tightly braided hair holds two small wooden blocks in her hands, and finally, a dark-skinned woman sitting on a white sofa places the index finger of her right hand on her lips.",
"First, a dark-skinned woman sitting on a white sofa places the index finger of her right hand on her lips, then a dark-skinned woman with tightly braided hair holds two small wooden blocks in her hands, and lastly, a woman holding an open book appears; on the left page of the book, a person is standing under a tree, and on the right page, a man with red hair wearing a backpack appears against a black background."
],
"correct_choice": 2,
"position": [
1464,
6037,
16441
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SSS",
"level": "L2-Relation",
"id": "9WjElCiDpzM_1",
"video_path": "9WjElCiDpzM.mp4",
"subtitle_path": "9WjElCiDpzM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 949.83,
"view_count": 682
},
{
"video_id": "9WjElCiDpzM",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which scenario is in the correct order?",
"candidates": [
"First, a woman sitting on a white sofa, combing a doll, holds scissors in her right hand and a coloring pencil in her left hand; then, the dark-skinned woman secures a small white wooden block with the yarn; finally, a dark-skinned woman combing a doll holds a tuft of yarn.",
"First, a woman sitting on a white sofa, combing a doll, holds scissors in her right hand and a coloring pencil in her left hand; then, a dark-skinned woman combing a doll holds a tuft of yarn; finally, the dark-skinned woman secures a small white wooden block with the yarn.",
"First, a dark-skinned woman combing a doll holds a tuft of yarn; then, a woman sitting on a white sofa, combing a doll, holds scissors in her right hand and a coloring pencil in her left hand; finally, the dark-skinned woman secures a small white wooden block with the yarn.",
"First, the dark-skinned woman secures a small white wooden block with the yarn; then, a dark-skinned woman combing a doll holds a tuft of yarn; finally, a woman sitting on a white sofa, combing a doll, holds scissors in her right hand and a coloring pencil in her left hand.",
"First, a dark-skinned woman combing a doll holds a tuft of yarn; then, the dark-skinned woman secures a small white wooden block with the yarn; finally, a woman sitting on a white sofa, combing a doll, holds scissors in her right hand and a coloring pencil in her left hand."
],
"correct_choice": 2,
"position": [
15742,
16151,
17428
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SSS",
"level": "L2-Relation",
"id": "9WjElCiDpzM_2",
"video_path": "9WjElCiDpzM.mp4",
"subtitle_path": "9WjElCiDpzM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 949.83,
"view_count": 682
},
{
"video_id": "ClYmTkGTGYg",
"question": "A painting is hanging on the wall. In the painting, a woman wearing a long dress is raising her right hand and looking forward. Next to her stands a man with short hair and his hands naturally hanging down. What kind of outerwear is this man wearing?",
"question_wo_referring_query": "What kind of outerwear is this man wearing?",
"candidates": [
"Red suit",
"White sweater",
"Red windbreaker",
"Gray sweater",
"Red long-sleeved jacket"
],
"correct_choice": 4,
"position": [
2745
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "ClYmTkGTGYg_0",
"video_path": "ClYmTkGTGYg.mp4",
"subtitle_path": "ClYmTkGTGYg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1512.98,
"view_count": 7031
},
{
"video_id": "ClYmTkGTGYg",
"question": "There is a flower bed in the center of the screen, in which green plants and flowers are planted. There are two people beside the flower bed, one stands looking at the flower bed, and the other bends down to admire the flowers. What kind of outer garment is the person who is bending down wearing?",
"question_wo_referring_query": "What kind of outer garment is the person who is bending down wearing?",
"candidates": [
"Gray long-sleeve coat",
"Black leather jacket",
"Gray wool sweater",
"Black knit shirt",
"Gray knit shirt"
],
"correct_choice": 0,
"position": [
21403
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "ClYmTkGTGYg_1",
"video_path": "ClYmTkGTGYg.mp4",
"subtitle_path": "ClYmTkGTGYg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1512.98,
"view_count": 7031
},
{
"video_id": "ClYmTkGTGYg",
"question": "There is a small garden surrounded by white columns, where the greenery is luxuriant. Three people are in the garden. The person on the right has short hair, the person in the middle is facing a mirror, and the person on the left is wearing a blue coat. What kind of hairstyle does the person wearing a blue coat have?",
"question_wo_referring_query": "What kind of hairstyle does the person wearing a blue coat have?",
"candidates": [
"Blond short curly hair",
"Blond bob cut",
"Blond shoulder-length straight hair",
"White long hair",
"Crew cut"
],
"correct_choice": 2,
"position": [
27423
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "ClYmTkGTGYg_2",
"video_path": "ClYmTkGTGYg.mp4",
"subtitle_path": "ClYmTkGTGYg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1512.98,
"view_count": 7031
},
{
"video_id": "b__dUom9AcQ",
"question": "In front of the curtain, there is a transparent lectern with red text on it. Next to the lectern, a person wearing a blue coat is raising their right hand and speaking with their head lowered. What item is present in this scene?",
"question_wo_referring_query": "What item is present in this scene?",
"candidates": [
"Tie",
"Belt",
"Microphone",
"Display Screen",
"Mobile Phone"
],
"correct_choice": 2,
"position": [
4607
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2O",
"level": "L1-Perception",
"id": "b__dUom9AcQ_0",
"video_path": "b__dUom9AcQ.mp4",
"subtitle_path": "b__dUom9AcQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2454.92,
"view_count": 1862
},
{
"video_id": "b__dUom9AcQ",
"question": "On stage, a large screen in the background shows a scene of a building, and in the center of the stage, two people are seated on chairs. One is a bald man wearing a blue suit, and the other is a person in a white shirt, holding a microphone in the left hand and raising the right hand. What object is not present in this scene?",
"question_wo_referring_query": "What object is not present in this scene?",
"candidates": [
"table",
"tie",
"glass cup",
"red lamp",
"car"
],
"correct_choice": 4,
"position": [
21487
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2O",
"level": "L1-Perception",
"id": "b__dUom9AcQ_1",
"video_path": "b__dUom9AcQ.mp4",
"subtitle_path": "b__dUom9AcQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2454.92,
"view_count": 1862
},
{
"video_id": "b__dUom9AcQ",
"question": "In the black background, there are two screens. On the left, there is a white car passing by on a road in front of a construction building. On the right, there are two people conversing on a red carpet stage. What objects are present in this screen?",
"question_wo_referring_query": "What objects are present in this screen?",
"candidates": [
"helicopter",
"green stone lion",
"black car",
"traffic light",
"handgun"
],
"correct_choice": 3,
"position": [
53780
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2O",
"level": "L1-Perception",
"id": "b__dUom9AcQ_2",
"video_path": "b__dUom9AcQ.mp4",
"subtitle_path": "b__dUom9AcQ_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2454.92,
"view_count": 1862
},
{
"video_id": "JDtVwz1R-kI",
"question": "In a restroom, the door to a room with a toilet is open, and there's a sink outside the door. A man wearing a gray short-sleeve shirt and carrying a backpack stands with his right thumb up. Where else has this man appeared?",
"question_wo_referring_query": "Where else has this man appeared?",
"candidates": [
"On a motorcycle",
"In a black car",
"In a helicopter",
"On a suspension bridge",
"On the beach"
],
"correct_choice": 4,
"position": [
3826,
17628
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "JDtVwz1R-kI_0",
"video_path": "JDtVwz1R-kI.mp4",
"subtitle_path": "JDtVwz1R-kI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1197.24,
"view_count": 908702
},
{
"video_id": "JDtVwz1R-kI",
"question": "There are people resting in front of the flower bed in the spacious park. Not far from the flower bed, two people are standing. On the left is a man wearing gray clothes, and on the right is a woman wearing black clothes who is drinking a beverage. Where has this woman appeared before?",
"question_wo_referring_query": "Where has this woman appeared before?",
"candidates": [
"In the subway",
"In the convenience store",
"On the bridge",
"In the restroom",
"In the hotel room"
],
"correct_choice": 2,
"position": [
23588,
24797
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "JDtVwz1R-kI_1",
"video_path": "JDtVwz1R-kI.mp4",
"subtitle_path": "JDtVwz1R-kI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1197.24,
"view_count": 908702
},
{
"video_id": "JDtVwz1R-kI",
"question": "In a room, the ceiling light emits a blend of colors, a red typewriter is placed on the black table, and beside the table, a woman with dark skin wearing gray clothing is holding a bottle of mineral water. Where has this bottle of mineral water been placed?",
"question_wo_referring_query": "Where has this bottle of mineral water been placed?",
"candidates": [
"On the brown table",
"In the trash can",
"In the car",
"On the airplane",
"On the white bed"
],
"correct_choice": 0,
"position": [
21689,
22431
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "JDtVwz1R-kI_2",
"video_path": "JDtVwz1R-kI.mp4",
"subtitle_path": "JDtVwz1R-kI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1197.24,
"view_count": 908702
},
{
"video_id": "_ZIa6SEJEyg",
"question": "Against the white background, there are texts in both pink and black colors. In the bottom right corner of the screen, there is a woman with long black hair, wearing a pink outer garment and a black inner garment. She is raising both hands with palms facing inwards. Before the subtitle mentions 'see this topic on final exams so if you,' what number appears in the screen's bottom right corner?",
"question_wo_referring_query": "What number appears in the screen's bottom right corner?",
"candidates": [
"3",
"4",
"7",
"6",
"5"
],
"correct_choice": 0,
"position": [
34753,
7206
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T3O",
"level": "L2-Relation",
"id": "_ZIa6SEJEyg_0",
"video_path": "_ZIa6SEJEyg.mp4",
"subtitle_path": "_ZIa6SEJEyg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1463.26,
"view_count": 77419
},
{
"video_id": "_ZIa6SEJEyg",
"question": "On a white background with some text and mathematical formulas, there is an orange dialogue box. In the lower right corner, a woman wearing white clothing is extending two fingers on her left hand. After the subtitle mentions 'the coffee beans and the milk so that\u2019s,' what is the green letter that appears in the video?",
"question_wo_referring_query": ", what is the green letter that appears in the video?",
"candidates": [
"F",
"used",
"O",
"g",
"R"
],
"correct_choice": 1,
"position": [
16521,
26058
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T3O",
"level": "L2-Relation",
"id": "_ZIa6SEJEyg_1",
"video_path": "_ZIa6SEJEyg.mp4",
"subtitle_path": "_ZIa6SEJEyg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1463.26,
"view_count": 77419
},
{
"video_id": "_ZIa6SEJEyg",
"question": "In the bottom right corner of the screen, there is a woman with long hair wearing white. Her left hand is open with the palm facing up. In the remaining white background, there are words and a red light spot, which stops below the letter 'e'. After the subtitles mention 'out the leftover amount you\u2019re gonna,' what object appears in the hand of the woman wearing white?",
"question_wo_referring_query": "What object appears in the hand of the woman wearing white?",
"candidates": [
"cup without a straw",
"milk carton",
"green cup",
"coffee beans",
"a pen"
],
"correct_choice": 2,
"position": [
25619,
30745
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T3O",
"level": "L2-Relation",
"id": "_ZIa6SEJEyg_2",
"video_path": "_ZIa6SEJEyg.mp4",
"subtitle_path": "_ZIa6SEJEyg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1463.26,
"view_count": 77419
},
{
"video_id": "-An3wZyoYe0",
"question": "In the orange-red sunset, what is the girl, who is sitting back to the camera on a kayak with short sleeves and hair tied, doing when the subtitle 'exciting activities like kayaking and' appears?",
"question_wo_referring_query": "What is the girl doing?",
"candidates": [
"Admiring the sunset on the kayak",
"Sitting on the kayak and waving her hands",
"Chatting with friends",
"Taking photos with a mobile phone",
"Rowing the boat on the water"
],
"correct_choice": 4,
"position": [
2536
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2E",
"level": "L1-Perception",
"id": "-An3wZyoYe0_0",
"video_path": "-An3wZyoYe0.mp4",
"subtitle_path": "-An3wZyoYe0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 996.53,
"view_count": 434
},
{
"video_id": "-An3wZyoYe0",
"question": "In the video, a woman dressed in a blue suit with black hair adorned with a pink flower is squatting in a green vegetable garden. Next to her, a little girl with golden hair wearing a white outfit and a white hat is also squatting. When the subtitle 'making conservation education enjoyable' appears, what are they doing?",
"question_wo_referring_query": "What are they doing?",
"candidates": [
"Picking vegetables",
"Planting flowers",
"Planting vegetables",
"Watering the vegetables",
"Playing a game"
],
"correct_choice": 0,
"position": [
8473
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2E",
"level": "L1-Perception",
"id": "-An3wZyoYe0_1",
"video_path": "-An3wZyoYe0.mp4",
"subtitle_path": "-An3wZyoYe0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 996.53,
"view_count": 434
},
{
"video_id": "-An3wZyoYe0",
"question": "In the blue seawater, there are coral reefs and seashells. On the screen, there is a person wearing a diving suit. When the subtitle 'beaches and plenty of drilling dive' appears, what is this person in the diving suit doing?",
"question_wo_referring_query": "What is the person in the diving suit doing?",
"candidates": [
"Diving in the seawater",
"Performing an underwater show",
"Playing with a sea turtle",
"Waving at the camera",
"Catching seafood"
],
"correct_choice": 0,
"position": [
19472
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2E",
"level": "L1-Perception",
"id": "-An3wZyoYe0_2",
"video_path": "-An3wZyoYe0.mp4",
"subtitle_path": "-An3wZyoYe0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 996.53,
"view_count": 434
},
{
"video_id": "lnCPn8gX3FU",
"question": "In a restaurant, what is the man wearing a white shirt and a gray patterned jacket doing, standing to the right of a black-haired girl with Liu Hai hairstyle in white clothes?",
"question_wo_referring_query": "What is the man wearing a white shirt and a gray patterned jacket doing?",
"candidates": [
"Hugging the girl",
"Pulling the girl next to him and looking at her",
"Pinching the girl's cheek",
"Kissing",
"Touching the girl's head"
],
"correct_choice": 1,
"position": [
123
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2E",
"level": "L1-Perception",
"id": "lnCPn8gX3FU_0",
"video_path": "lnCPn8gX3FU.mp4",
"subtitle_path": "lnCPn8gX3FU_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 378,
"duration": 8.01,
"view_count": 36577
},
{
"video_id": "49YMA0f1yhI",
"question": "In the driver's seat of a car, a tense-looking man with a seatbelt on is driving. The car is speeding down the road. What color shirt is the man wearing?",
"question_wo_referring_query": ", What color shirt is the man wearing?",
"candidates": [
"Blue",
"Brown",
"Green",
"White",
"Black"
],
"correct_choice": 1,
"position": [
101
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2A",
"level": "L1-Perception",
"id": "49YMA0f1yhI_0",
"video_path": "49YMA0f1yhI.mp4",
"subtitle_path": "49YMA0f1yhI_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 1,
"duration": 11.01,
"view_count": 7193
},
{
"video_id": "hXFfPjytMo0",
"question": "In a room with a grey chair, two people are sitting on a sofa watching a broadcasted news program on TV. On the TV screen, a blond female reporter in a red coat is speaking into a microphone. When the subtitle says 'but after seeing that Terence is being accused of involvement in the murder as well as the media's...', what is the color of the wall behind the TV?",
"question_wo_referring_query": "What is the color of the wall behind the TV?",
"candidates": [
"red",
"green",
"white",
"blue",
"black"
],
"correct_choice": 3,
"position": [
71
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2A",
"level": "L1-Perception",
"id": "hXFfPjytMo0_0",
"video_path": "hXFfPjytMo0.mp4",
"subtitle_path": "hXFfPjytMo0_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 140,
"duration": 10.0,
"view_count": 195399
},
{
"video_id": "shGkzHpzwAQ",
"question": "In a dimly lit room with a bed and a lamp, a woman picks up a black phone with one hand and puts it to her ear. When the subtitle says 'it also wonders why she isn't following the rules. Greta drops the phone, but then there's knocking', what type of clothing is this woman wearing?",
"question_wo_referring_query": "What type of clothing is this woman wearing?",
"candidates": [
"nurse's uniform",
"sweater",
"T-shirt",
"suit",
"wool coat"
],
"correct_choice": 2,
"position": [
49
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2A",
"level": "L1-Perception",
"id": "shGkzHpzwAQ_0",
"video_path": "shGkzHpzwAQ.mp4",
"subtitle_path": "shGkzHpzwAQ_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 605,
"duration": 11.01,
"view_count": 3725038
},
{
"video_id": "8Qe03WDCrB4",
"question": "In the scene, there is a group of people, young and old, male and female, gathered in front of a building watching a performance on a stage. What object exists in this scene?",
"question_wo_referring_query": "What object exists in this scene?",
"candidates": [
"Basketball",
"Olive",
"Soccer ball",
"Ping pong ball",
"Balloon"
],
"correct_choice": 2,
"position": [
74
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2O",
"level": "L1-Perception",
"id": "8Qe03WDCrB4_0",
"video_path": "8Qe03WDCrB4.mp4",
"subtitle_path": "8Qe03WDCrB4_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 91,
"duration": 8.01,
"view_count": 351393
},
{
"video_id": "58mCJVNMJvw",
"question": "At the very beginning, a man with a mustache and curly hair says in the subtitles 'the prospective bride, where both families had agreed to marry Aziz and Isa's daughter. However, what color is this man's hair?",
"question_wo_referring_query": "What color is this man's hair?",
"candidates": [
"Brown",
"Green",
"Black",
"Blonde",
"Red"
],
"correct_choice": 2,
"position": [
33
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2A",
"level": "L1-Perception",
"id": "58mCJVNMJvw_0",
"video_path": "58mCJVNMJvw.mp4",
"subtitle_path": "58mCJVNMJvw_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 729,
"duration": 12.98,
"view_count": 208892
},
{
"video_id": "ErGYJ7kqIow",
"question": "In the red video frame, after a man with a mustache lies down facing the left side, what event happens on the screen?",
"question_wo_referring_query": "What event happens on the screen?",
"candidates": [
"The man has a green ribbon tied over his eyes.",
"The man has a green ribbon tied around his neck.",
"The man has a white ribbon in his mouth.",
"The man has a red ribbon in his mouth.",
"The man's eyes are covered."
],
"correct_choice": 2,
"position": [
280,
337
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E3E",
"level": "L2-Relation",
"id": "ErGYJ7kqIow_0",
"video_path": "ErGYJ7kqIow.mp4",
"subtitle_path": "ErGYJ7kqIow_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 380,
"duration": 14.02,
"view_count": 564
},
{
"video_id": "@recipesbyanne-7141686631676808454",
"question": "On a white marble countertop, there is a rectangular container lined with white paper. On the white paper, there is a large piece of chocolate-colored dessert sprinkled with yellow hard fruit bits. Which subtitles have appeared simultaneously with this chocolate dessert?",
"question_wo_referring_query": "Which subtitles have appeared simultaneously?",
"candidates": [
"Doing it all night, all summer'in Chinese Simplified can be translated",
"live my day as if it was no past",
"Doing it all night, all summer",
"It looks good to eat",
"Got my day, my heart out till the dawn"
],
"correct_choice": 4,
"position": [
195,
292
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TOS",
"level": "L2-Relation",
"id": "@recipesbyanne-7141686631676808454_0",
"video_path": "7141686631676808454.mp4",
"subtitle_path": "7141686631676808454_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 13.53,
"view_count": 10253
},
{
"video_id": "@recipesbyanne-7112477530824756486",
"question": "On a wooden table, a green container pours chocolate sauce into a red cake mold. After the subtitle 'Nom, nom, nom, nom, nom, nom, nom.' appears, what is the first object shown on the screen?",
"question_wo_referring_query": "What is the first object shown on the screen?",
"candidates": [
"Three cake molds filled with chocolate sauce",
"A chocolate cake with a bite taken out of it",
"Cake drizzled with chocolate sauce",
"A bowl of jelly pieces",
"Cake topped with jelly pieces"
],
"correct_choice": 0,
"position": [
32,
47
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T3O",
"level": "L2-Relation",
"id": "@recipesbyanne-7112477530824756486_0",
"video_path": "7112477530824756486.mp4",
"subtitle_path": "7112477530824756486_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 10.17,
"view_count": 2660
},
{
"video_id": "@asianfoodrecipes-7109221563697876225",
"question": "On a wooden table, there is a wooden cutting board covered with a layer of white paper. There are six pieces of tofu on the cutting board with red sauce spread on top. On the right side of the screen, there is a pink brush. What happens in the screen at this moment?",
"question_wo_referring_query": "What happens in the screen at this moment?",
"candidates": [
"The tofu pieces are being stir-fried",
"The brush is applying oil to the tofu pieces",
"The brush is applying sauce to the tofu pieces",
"The tofu pieces are being fried",
"The tofu pieces are being baked"
],
"correct_choice": 2,
"position": [
141
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "@asianfoodrecipes-7109221563697876225_0",
"video_path": "7109221563697876225.mp4",
"subtitle_path": "7109221563697876225_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 14.95,
"view_count": 7140
},
{
"video_id": "@recipesbyanne-7234536361296940314",
"question": "In the black pot with some broth, tomato sauce, and yellow vegetable pieces, what items are visible when the subtitle 'You still give me butterflies, my butterfly' appears?",
"question_wo_referring_query": "What items are visible in the scene?",
"candidates": [
"Plate",
"Chopsticks",
"Lid",
"Garlic",
"Green vegetable"
],
"correct_choice": 3,
"position": [
132
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2O",
"level": "L1-Perception",
"id": "@recipesbyanne-7234536361296940314_0",
"video_path": "7234536361296940314.mp4",
"subtitle_path": "7234536361296940314_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 11.03,
"view_count": 866061
},
{
"video_id": "@healthfood-6880172230588894465",
"question": "The upper left side of the screen shows a window overlooking the street below. When the yellow-green packaged snack first appears on the screen, what is happening in the scene?",
"question_wo_referring_query": "What is happening in the scene?",
"candidates": [
"The screen transitions from blurry to clear",
"Pouring the snacks out",
"Tearing open the snack package",
"The person in the video is eating the snacks",
"Showing the snack package to the camera"
],
"correct_choice": 4,
"position": [
38
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O2E",
"level": "L1-Perception",
"id": "@healthfood-6880172230588894465_0",
"video_path": "6880172230588894465.mp4",
"subtitle_path": "6880172230588894465_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 14.07,
"view_count": 3326
},
{
"video_id": "@healthfood-6917750501975100677",
"question": "There are some scattered white chocolate balls on a white table, a knife is placed on a chocolate ball that has been cut in half. When the subtitle 'Please' appears, what happens on the screen?",
"question_wo_referring_query": ", what happens on the screen?",
"candidates": [
"Picking up the chocolate ball",
"Cutting the chocolate ball",
"Pouring sauce on the chocolate ball",
"Eating the chocolate ball",
"Decorating the chocolate ball"
],
"correct_choice": 1,
"position": [
29
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2E",
"level": "L1-Perception",
"id": "@healthfood-6917750501975100677_0",
"video_path": "6917750501975100677.mp4",
"subtitle_path": "6917750501975100677_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 15.03,
"view_count": 15300
},
{
"video_id": "@tiffycooks-7022693587703926022",
"question": "After a woman with green nail polish sitting in a red seat in the car holds up an eyebrow pencil to the mirror, what does she do first?",
"question_wo_referring_query": "What does she do first?",
"candidates": [
"Apply eyebrow brush",
"Drive",
"Pound squash in the kitchen",
"Draw eyebrows",
"Toss squash in an iron pan"
],
"correct_choice": 3,
"position": [
108,
164
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "E3E",
"level": "L2-Relation",
"id": "@tiffycooks-7022693587703926022_0",
"video_path": "7022693587703926022.mp4",
"subtitle_path": "7022693587703926022_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 12.13,
"view_count": 224075
},
{
"video_id": "@recipesbyanne-7231566545263103259",
"question": "On a white marble tabletop, there is a white bowl with some brown sauce inside. There is a glass bottle pouring oil onto a metal spatula over the bowl. After the subtitle 'All I need is your love tonight.' appears, what object appears?",
"question_wo_referring_query": "What object appears?",
"candidates": [
"Raw meat chunk",
"White sugar",
"Used oil",
"White vinegar",
"Half a lime"
],
"correct_choice": 4,
"position": [
121,
158
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T3O",
"level": "L2-Relation",
"id": "@recipesbyanne-7231566545263103259_0",
"video_path": "7231566545263103259.mp4",
"subtitle_path": "7231566545263103259_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 15.22,
"view_count": 444000
},
{
"video_id": "@healthfood-7115052132578921774",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, pour olive oil into the iron pot, then demonstrate the olive oil in the pot against the background of a white cabinet and wooden countertop, and finally demonstrate the olive oil with dark green packaging and glass bottle.",
"First, demonstrate the olive oil in the pot against the background of a white cabinet and wooden countertop, then pour the olive oil into the iron pot, and finally demonstrate the olive oil with dark green packaging and glass bottle.",
"First, demonstrate the olive oil with dark green packaging and glass bottle, then pour the olive oil into the iron pot, and finally demonstrate the olive oil in the pot against the background of a white cabinet and wooden countertop.",
"First, pour olive oil into the iron pot, then demonstrate the olive oil with dark green packaging and glass bottle, and finally demonstrate the olive oil in the pot against the background of a white cabinet and wooden countertop.",
"First, demonstrate the olive oil with dark green packaging and glass bottle, then demonstrate the olive oil in the pot against the background of a white cabinet and wooden countertop, and finally pour the olive oil into the iron pot."
],
"correct_choice": 2,
"position": [
35,
86,
152
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SSS",
"level": "L2-Relation",
"id": "@healthfood-7115052132578921774_0",
"video_path": "7115052132578921774.mp4",
"subtitle_path": "7115052132578921774_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 8.57,
"view_count": 913600
},
{
"video_id": "@healthfood-7200404243168415022",
"question": "In the video, there is a white bowl on a wooden grain table, containing white condensed milk. When the condensed milk is placed on a piece of green vegetable and then put on a wooden board, what kind of change occurs?",
"question_wo_referring_query": "In the video, there is a white bowl on a wooden grain table, containing white condensed milk. When the condensed milk is placed on a piece of green vegetable and then put on a wooden board, what kind of change occurs?",
"candidates": [
"Red bread crumbs are sprinkled on top",
"Red goji berries are sprinkled on top",
"Red dates are sprinkled on top",
"Red goji berries are sprinkled on top",
"Red dried strawberries are sprinkled on top"
],
"correct_choice": 0,
"position": [
145,
250
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SAA",
"level": "L2-Relation",
"id": "@healthfood-7200404243168415022_0",
"video_path": "7200404243168415022.mp4",
"subtitle_path": "7200404243168415022_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 14.12,
"view_count": 41428
},
{
"video_id": "@healthfood-6867204066108329221",
"question": "A lady with gray hair tied up is sitting on a sofa watching TV, holding a white bowl in one hand and a spoon in the other. There's a mirror on the white wall behind her reflecting the TV screen. When 'Who you gonna call?' is mentioned, what object is not present in the scene?",
"question_wo_referring_query": "What object is not present in the scene?",
"candidates": [
"white socks",
"green pillow",
"blue pillow",
"black belt",
"black pillow"
],
"correct_choice": 4,
"position": [
221
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2O",
"level": "L1-Perception",
"id": "@healthfood-6867204066108329221_0",
"video_path": "6867204066108329221.mp4",
"subtitle_path": "6867204066108329221_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 14.07,
"view_count": 1370
},
{
"video_id": "@recipesbyanne-7190395053343296774",
"question": "The rectangular box is filled with brown-colored desserts, and it seems to have some white nuts inside. Chocolate sauce is drizzled over the food in the box. What happened after the chocolate sauce was drizzled on?",
"question_wo_referring_query": ", what happened?",
"candidates": [
"The food got burnt",
"The food formed into cubes",
"The food formed into long strips",
"Oats were sprinkled on the food",
"Yellow nuts were sprinkled on the food"
],
"correct_choice": 1,
"position": [
150,
201
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "E3E",
"level": "L2-Relation",
"id": "@recipesbyanne-7190395053343296774_0",
"video_path": "7190395053343296774.mp4",
"subtitle_path": "7190395053343296774_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 8.8,
"view_count": 59914
},
{
"video_id": "@recipesbyanne-7161798161395240197",
"question": "The screen shows a prepared wrap, with a piece cut off and held by a pair of hands. The background is a white marble table. In which other scene does this wrap appear?",
"question_wo_referring_query": "In which other scene does this wrap appear?",
"candidates": [
"Still on a marble table, there is a round green plate with the wrap covered in sesame seeds.",
"Still on a marble table, there is a round yellow plate with the wrap covered in sesame seeds.",
"Still on a marble table, there is a round black plate with the wrap covered in sesame seeds.",
"Still on a marble table, there is a round white plate with the wrap covered in sesame seeds.",
"Still on a marble table, there is a round pink plate with the wrap covered in sesame seeds."
],
"correct_choice": 3,
"position": [
233,
296
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SOS",
"level": "L2-Relation",
"id": "@recipesbyanne-7161798161395240197_0",
"video_path": "7161798161395240197.mp4",
"subtitle_path": "7161798161395240197_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 12.97,
"view_count": 24700
},
{
"video_id": "@healthfood-6999715804623293702",
"question": "The screen shows a pile of colorful foods, with a white paper towel underneath, a silver plate below the paper towel, and the plate has floral patterns. What is not present in the screen?",
"question_wo_referring_query": "The screen shows a pile of colorful foods, with a white paper towel underneath, a silver plate below the paper towel, and the plate has floral patterns. What is not present in the screen?",
"candidates": [
"Yellow bell peppers",
"Cherry tomatoes",
"Cauliflower florets",
"Carrot",
"White blocks sprinkled with seasoning"
],
"correct_choice": 3,
"position": [
237
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2O",
"level": "L1-Perception",
"id": "@healthfood-6999715804623293702_0",
"video_path": "6999715804623293702.mp4",
"subtitle_path": "6999715804623293702_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 13.52,
"view_count": 54887
},
{
"video_id": "@_eat_sleep_travel_repeat-7275734357799652640",
"question": "The screen shows a close-up of a purple flower. The entire petal of the flower is elongated. The stamen is yellow, and there are many similar purple flowers nearby. What happened when the flower first appeared?",
"question_wo_referring_query": ", what happened?",
"candidates": [
"Two bees landed on the flower.",
"A bee was flying over the flower.",
"Several bees were flying over the flower.",
"A bee landed on the flower.",
"Two bees were flying over the flower."
],
"correct_choice": 3,
"position": [
210
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O2E",
"level": "L1-Perception",
"id": "@_eat_sleep_travel_repeat-7275734357799652640_0",
"video_path": "7275734357799652640.mp4",
"subtitle_path": "7275734357799652640_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 9.63,
"view_count": 7969
},
{
"video_id": "@kelseyinlondon-7258968758130085146",
"question": "In the scene, there is a woman wearing a white dress. She is standing on a rock, gazing into the distance. In front of her, there is a green lake, and in the distance, there are some orange buildings. After the woman gazes into the distance, what happens next?",
"question_wo_referring_query": "After the woman gazes into the distance, what happens next?",
"candidates": [
"The woman walks on a stone path surrounded by green plants, between pink and yellow buildings.",
"The woman walks on a stone path surrounded by green plants, between green and black buildings.",
"The woman walks on a stone path surrounded by green plants, between green and yellow buildings.",
"The woman walks on a stone path surrounded by green plants, between blue and yellow buildings.",
"The woman walks on a stone path surrounded by green plants, between green and white buildings."
],
"correct_choice": 0,
"position": [
189,
227
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E3E",
"level": "L2-Relation",
"id": "@kelseyinlondon-7258968758130085146_0",
"video_path": "7258968758130085146.mp4",
"subtitle_path": "7258968758130085146_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 11.5,
"view_count": 53172
},
{
"video_id": "@daiki.shino-7120297338056281349",
"question": "Under the grey sky, there is a person dressed in a blue and pink patchwork garment. She is kneeling on the ground, raising her hands, with her gaze fixed on her hands. When this woman began walking toward a distant building emitting light in the evening, what change occurred to her?",
"question_wo_referring_query": "What change occurred to her?",
"candidates": [
"She changed into black clothes",
"She changed into a white outer garment",
"She stood up",
"She lay down on the ground",
"She raised her hands above her head"
],
"correct_choice": 2,
"position": [
12,
159
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SAA",
"level": "L2-Relation",
"id": "@daiki.shino-7120297338056281349_0",
"video_path": "7120297338056281349.mp4",
"subtitle_path": "7120297338056281349_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 15.12,
"view_count": 4194
},
{
"video_id": "@placesunleashed-7297000148201262341",
"question": "On the sea, a woman wearing sunglasses and long hair is sitting in a cart with the text 'WATER LINK' on the side. What is this woman doing?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"She is driving an amphibious car",
"She jumped out of the car",
"She is swimming",
"She jumped into the sea",
"She is lying in the car"
],
"correct_choice": 0,
"position": [
103
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2E",
"level": "L1-Perception",
"id": "@placesunleashed-7297000148201262341_0",
"video_path": "7297000148201262341.mp4",
"subtitle_path": "7297000148201262341_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 10.13,
"view_count": 8449
},
{
"video_id": "@placesunleashed-7325156862486220038",
"question": "On the cliff, there are many protruding rocks. Water flows down the rocks from the peak. Below, there are two men wearing shorts and bare upper bodies. Another man runs towards them from the staircase below. When the subtitle mentions 'Visitors can choose to enter the spectacular La Gloria cave, which is filled with mineral deposits,' what two colors are present on the clothes of the man who is running?",
"question_wo_referring_query": "When the subtitle mentions 'Visitors can choose to enter the spectacular La Gloria cave, which is filled with mineral deposits,' what two colors are present on the clothes of the man who is running?",
"candidates": [
"green and white",
"olive and white",
"red and olive",
"red and green",
"red and white"
],
"correct_choice": 4,
"position": [
229
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2A",
"level": "L1-Perception",
"id": "@placesunleashed-7325156862486220038_0",
"video_path": "7325156862486220038.mp4",
"subtitle_path": "7325156862486220038_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 15.8,
"view_count": 1320
},
{
"video_id": "@kelseyinlondon-7287537087203511585",
"question": "In front of a building with a green roof, there is an ice surface reflecting the building. Many people are skating and playing on the ice. After the subtitle 'So show up your soul' appears, what item is shown in the video?",
"question_wo_referring_query": ", what item is shown in the video?",
"candidates": [
"a silver knife",
"a silver fork",
"a mug with coffee",
"a red chair",
"a green bag"
],
"correct_choice": 4,
"position": [
228,
255
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T3O",
"level": "L2-Relation",
"id": "@kelseyinlondon-7287537087203511585_0",
"video_path": "7287537087203511585.mp4",
"subtitle_path": "7287537087203511585_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 12.6,
"view_count": 241276
},
{
"video_id": "@jetset_anna-6985193727107009798",
"question": "A narrow alley lies between two clean, white houses. The alley is paved with stone slabs, and several blue flower pots with plants are placed along the alley. A cat appears in the alley. When the caption mentions 'Why I love a mama me a summer. [Why I love a mama me a summer] translates to Chinese Simplified as.', what is the cat doing?",
"question_wo_referring_query": "What is the cat doing?",
"candidates": [
"The cat is walking away from the camera",
"The cat is jumping onto the wall",
"The cat is walking towards the camera",
"The cat is doing a backflip",
"The cat is crouching on the ground"
],
"correct_choice": 0,
"position": [
45
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2E",
"level": "L1-Perception",
"id": "@jetset_anna-6985193727107009798_0",
"video_path": "6985193727107009798.mp4",
"subtitle_path": "6985193727107009798_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 14.07,
"view_count": 1985824
},
{
"video_id": "@jetset_anna-7126584501228129542",
"question": "In the middle of the white room, there is a door with two pots of flowers on its sides. The flower on the right side is very tall, while the pot on the left side has a white pipe on it. The top part of the door is triangular. What color is the door?",
"question_wo_referring_query": ", what color is the door?",
"candidates": [
"Pink",
"Green",
"Blue",
"Black",
"Yellow"
],
"correct_choice": 0,
"position": [
109
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2A",
"level": "L1-Perception",
"id": "@jetset_anna-7126584501228129542_0",
"video_path": "7126584501228129542.mp4",
"subtitle_path": "7126584501228129542_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 10.21,
"view_count": 72069
},
{
"video_id": "@luxtravelbe-7233851486474636570",
"question": "Holding two cups containing liquid, the cup on the right has some bubbles inside. In the distance, there is a sunset. What happens when the two cups touch for the first time?",
"question_wo_referring_query": "What happens when the two cups touch for the first time?",
"candidates": [
"A third cup appears",
"Bubbles appear in the left cup",
"The bubbles in the right cup disappear",
"They clink the cups again",
"The cups move out of the frame"
],
"correct_choice": 4,
"position": [
17
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O2E",
"level": "L1-Perception",
"id": "@luxtravelbe-7233851486474636570_0",
"video_path": "7233851486474636570.mp4",
"subtitle_path": "7233851486474636570_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 9.31,
"view_count": 8200
},
{
"video_id": "@placesunleashed-7327779917989416197",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, there's a red sphere in the darkness, then in outer space, a glowing object shines light on the Earth, and finally the green glow flows through the treetops at night.",
"First, the green glow flows through the treetops at night, then in outer space, a glowing object shines light on the Earth, and finally there's a red sphere in the darkness.",
"First, the green glow flows through the treetops at night, then there's a red sphere in the darkness, and finally in outer space, a glowing object shines light on the Earth.",
"First, there's a red sphere in the darkness, then the green glow flows through the treetops at night, and finally in outer space, a glowing object shines light on the Earth.",
"First, in outer space, a glowing object shines light on the Earth, then the green glow flows through the treetops at night, and finally there's a red sphere in the darkness."
],
"correct_choice": 2,
"position": [
133,
211,
248
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SSS",
"level": "L2-Relation",
"id": "@placesunleashed-7327779917989416197_0",
"video_path": "7327779917989416197.mp4",
"subtitle_path": "7327779917989416197_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 15.17,
"view_count": 1055
},
{
"video_id": "@placesunleashed-7334544531490196741",
"question": "Sunlight is shining on a stone wall, beneath which is a blue sea. Two white boats are anchored on the sea, and in the distance, the sea stretches out infinitely. When the camera moves to the right, what changes on the screen?",
"question_wo_referring_query": "What changes on the screen?",
"candidates": [
"A cruise ship appears on the screen",
"The sun appears on the screen",
"Stars appear on the screen",
"An airplane appears on the screen",
"The moon appears on the screen"
],
"correct_choice": 1,
"position": [
81,
206
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SAA",
"level": "L2-Relation",
"id": "@placesunleashed-7334544531490196741_0",
"video_path": "7334544531490196741.mp4",
"subtitle_path": "7334544531490196741_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 10.5,
"view_count": 1522
},
{
"video_id": "@placesunleashed-7303594391850044678",
"question": "There is a small island on the blue sea, the island is full of green plants, and there are many boats docked along the shore. Among them, the largest boat in the middle, what color is the largest boat in the middle?",
"question_wo_referring_query": "What color is the largest boat in the middle?",
"candidates": [
"red",
"black",
"yellow",
"white",
"green"
],
"correct_choice": 3,
"position": [
198
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2A",
"level": "L1-Perception",
"id": "@placesunleashed-7303594391850044678_0",
"video_path": "7303594391850044678.mp4",
"subtitle_path": "7303594391850044678_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 14.47,
"view_count": 12429
},
{
"video_id": "@placesunleashed-7297652941551439109",
"question": "On the sandy beach, there are 4 kangaroos. Three of them are in a row, while the other one is stepping on a clump of dry grass. In the distance, the screen shows a blue ocean and a small island. What did the kangaroo do the first time it appeared?",
"question_wo_referring_query": "What did the kangaroo do the first time it appeared?",
"candidates": [
"Crossing",
"Jumping",
"Fighting",
"Walking",
"Drinking water"
],
"correct_choice": 1,
"position": [
211
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O2E",
"level": "L1-Perception",
"id": "@placesunleashed-7297652941551439109_0",
"video_path": "7297652941551439109.mp4",
"subtitle_path": "7297652941551439109_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 14.47,
"view_count": 11952
},
{
"video_id": "@jetset_anna-6933580627824495878",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, a green lake with a few small boats docked at the pier, then a green lake with two people rowing a boat, and finally a green lake with no boats in sight, with mountains and trees in the distance.",
"First, a green lake with a few small boats docked at the pier, then a green lake with no boats in sight, with mountains and trees in the distance, and finally a green lake with two people rowing a boat.",
"First, a green lake with two people rowing a boat, then a green lake with a few small boats docked at the pier, and finally a green lake with no boats in sight, with mountains and trees in the distance.",
"First, a green lake with two people rowing a boat, then a green lake with no boats in sight, with mountains and trees in the distance, and finally a green lake with a few small boats docked at the pier.",
"First, a green lake with no boats in sight, with mountains and trees in the distance, then a green lake with a few small boats docked at the pier, and finally a green lake with two people rowing a boat."
],
"correct_choice": 0,
"position": [
25,
147,
350
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SSS",
"level": "L2-Relation",
"id": "@jetset_anna-6933580627824495878_0",
"video_path": "6933580627824495878.mp4",
"subtitle_path": "6933580627824495878_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 15.07,
"view_count": 14198
},
{
"video_id": "@movie.explained6-7278310085217094914",
"question": "A man wearing a white short-sleeved shirt is holding a newspaper, and beside him is another man wearing a striped shirt. Behind them is a white house. What did the man holding the newspaper do the first time he appeared?",
"question_wo_referring_query": ", what did the man holding the newspaper do the first time he appeared?",
"candidates": [
"peeling an apple",
"using the newspaper to block the window",
"playing guitar",
"wrapping beans with the newspaper",
"driving a car"
],
"correct_choice": 1,
"position": [
192
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "O2E",
"level": "L1-Perception",
"id": "@movie.explained6-7278310085217094914_0",
"video_path": "7278310085217094914.mp4",
"subtitle_path": "7278310085217094914_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 28,
"duration": 10.0,
"view_count": 6936
},
{
"video_id": "@movie.explained6-7270058445577948418",
"question": "Which of the following scenario sequences is correct?",
"question_wo_referring_query": "Which of the following scenario sequences is correct?",
"candidates": [
"First, flames engulf a person, then a woman in yellow clothing and a person in a black outfit lie together, and finally, a man wearing a grey outfit lights a lighter.",
"First, a man wearing a grey outfit lights a lighter, then flames engulf a person, and finally, a woman in yellow clothing and a person in a black outfit lie together.",
"First, a man wearing a grey outfit lights a lighter, then a woman in yellow clothing and a person in a black outfit lie together, and finally, flames engulf a person.",
"First, flames engulf a person, then a man wearing a grey outfit lights a lighter, and finally, a woman in yellow clothing and a person in a black outfit lie together.",
"First, a woman in yellow clothing and a person in a black outfit lie together, then a man wearing a grey outfit lights a lighter, and finally, flames engulf a person."
],
"correct_choice": 1,
"position": [
95,
195,
281
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "@movie.explained6-7270058445577948418_0",
"video_path": "7270058445577948418.mp4",
"subtitle_path": "7270058445577948418_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 17,
"duration": 13.0,
"view_count": 4716
},
{
"video_id": "@movie.explained6-7252661595875183874",
"question": "A man with black hair wearing a green coat sits with a boy who also has black hair. The man is saying something to the boy. In which subtitles have this man appeared before?",
"question_wo_referring_query": "In which subtitles have this man appeared before?",
"candidates": [
"I am a photographer",
"I know",
"He buys the things and then returns to his son who was waiting for him",
"Help me",
"Don't do this"
],
"correct_choice": 2,
"position": [
282,
180
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "TOS",
"level": "L2-Relation",
"id": "@movie.explained6-7252661595875183874_0",
"video_path": "7252661595875183874.mp4",
"subtitle_path": "7252661595875183874_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 22,
"duration": 13.0,
"view_count": 554
},
{
"video_id": "@movie.explained6-7275485247351901442",
"question": "In front of a building, a man wearing a green coat throws a punch towards a man in black clothes. The man in black dodges, and when the man in the green coat and the subtitle 'Accidentally, the eldest brother\u2019s punch shattered everything about the sixth sibling.' appear simultaneously, what change occurs to him?",
"question_wo_referring_query": "What change occurs to him?",
"candidates": [
"He takes off his outer coat.",
"He falls to the ground.",
"He holds his arm with his hand.",
"He looks to the side as he moves past.",
"He raises both hands above his head."
],
"correct_choice": 3,
"position": [
32,
196
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "TAA",
"level": "L2-Relation",
"id": "@movie.explained6-7275485247351901442_0",
"video_path": "7275485247351901442.mp4",
"subtitle_path": "7275485247351901442_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 31,
"duration": 8.98,
"view_count": 3023
},
{
"video_id": "@movie.explained6-7268771669123042562",
"question": "In the amusement park at night, there is a woman standing with her hair covered, holding a camera. Behind her, there is an amusement facility with yellow lights. What color clothes is the woman holding the camera wearing?",
"question_wo_referring_query": "What color clothes is the woman holding the camera wearing?",
"candidates": [
"blue",
"pink",
"white",
"green",
"purple"
],
"correct_choice": 1,
"position": [
101
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2A",
"level": "L1-Perception",
"id": "@movie.explained6-7268771669123042562_0",
"video_path": "7268771669123042562.mp4",
"subtitle_path": "7268771669123042562_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 1,
"duration": 12.01,
"view_count": 4700
},
{
"video_id": "@movie.explained6-7254803496900267266",
"question": "In a car with black seats, there are four people sitting in a row. One of them is a man wearing a black shirt. He rests his right hand on the shoulder of the woman next to him. When the subtitle 'She didn't want to harm the flowers of her country' appears, what hairstyle does the man in the black shirt have?",
"question_wo_referring_query": "What hairstyle does the man in the black shirt have?",
"candidates": [
"Short hair",
"Pigtails",
"Long hair",
"Bald",
"Shoulder-length curls"
],
"correct_choice": 0,
"position": [
171
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2A",
"level": "L1-Perception",
"id": "@movie.explained6-7254803496900267266_0",
"video_path": "7254803496900267266.mp4",
"subtitle_path": "7254803496900267266_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 30,
"duration": 13.0,
"view_count": 7423
},
{
"video_id": "@movie.explained6-7253792552996867330",
"question": "In a dim room, there is a black sofa. Two women are moving forward, one in a green dress and one in a pink dress. One of them is being pushed while moving forward. Who is being pushed while moving forward?",
"question_wo_referring_query": "Who is being pushed while moving forward?",
"candidates": [
"The woman in the green dress",
"The woman with a ponytail",
"The woman with a bun hairstyle",
"The woman with braided pigtails",
"The woman in the pink dress"
],
"correct_choice": 4,
"position": [
63
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E2O",
"level": "L1-Perception",
"id": "@movie.explained6-7253792552996867330_0",
"video_path": "7253792552996867330.mp4",
"subtitle_path": "7253792552996867330_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 42,
"duration": 8.01,
"view_count": 1796
},
{
"video_id": "@movie.explained6-7274542274997013761",
"question": "In a dark room, there is a man and a woman. The woman is wearing a white nightgown, and the man is wearing a blue shirt. When the man in the blue shirt appears in an office with brown walls, what change occurs to his clothing?",
"question_wo_referring_query": ", what change occurs to his clothing?",
"candidates": [
"Changes from a blue shirt to a red shirt",
"Changes from a blue shirt to a white T-shirt",
"Changes from a blue shirt to a purple shirt",
"Changes from a blue shirt to a black hoodie",
"Wears an additional black suit"
],
"correct_choice": 4,
"position": [
130,
250
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SAA",
"level": "L2-Relation",
"id": "@movie.explained6-7274542274997013761_0",
"video_path": "7274542274997013761.mp4",
"subtitle_path": "7274542274997013761_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 30,
"duration": 14.0,
"view_count": 4882
},
{
"video_id": "@jonijawne-7305901654417706246",
"question": "In the video, which of the following characters appears first?",
"question_wo_referring_query": "In the video, which of the following characters appears first?",
"candidates": [
"A woman wearing a navy blue hooded jacket",
"A man wearing a gray coat, earphones, and with short black hair",
"A woman wearing a white T-shirt",
"A man wearing a white shirt",
"A woman wearing a dark orange coat"
],
"correct_choice": 1,
"position": [
95,
105
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "@jonijawne-7305901654417706246_0",
"video_path": "7305901654417706246.mp4",
"subtitle_path": "7305901654417706246_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 10.27,
"view_count": 1597
},
{
"video_id": "@kerstinong-6965059948090821890",
"question": "On a gloomy day, there is a flat road. On the road, there is a woman wearing a white top and green pants. What is the woman in the white top doing on the flat road?",
"question_wo_referring_query": "On a gloomy day, there is a flat road. On the road, there is a woman wearing a white top and green pants. What is the woman in the white top doing on the flat road?",
"candidates": [
"Jumping rope",
"Running",
"Dancing",
"Doing push-ups",
"Doing high leg raises"
],
"correct_choice": 1,
"position": [
211
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "@kerstinong-6965059948090821890_0",
"video_path": "6965059948090821890.mp4",
"subtitle_path": "6965059948090821890_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 12.79,
"view_count": 25861
},
{
"video_id": "@jess.morg-7195624647830850858",
"question": "A long-haired woman appears beside a tree trunk. The woman is wearing a white top and black jeans, with grass around her legs. Behind the woman are a large tree and the blue sky. What is this woman in the white top doing?",
"question_wo_referring_query": "What is this woman in the white top doing?",
"candidates": [
"The woman is touching her cheek",
"The woman is kicking",
"The woman is bending at the waist",
"The woman is waving her hand",
"The woman is touching her forehead"
],
"correct_choice": 1,
"position": [
208
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2E",
"level": "L1-Perception",
"id": "@jess.morg-7195624647830850858_0",
"video_path": "7195624647830850858.mp4",
"subtitle_path": "7195624647830850858_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 13.87,
"view_count": 279
},
{
"video_id": "@jess.morg-7291800958152084782",
"question": "The sun is shining on the utility pole, and the shadow beside it looks like a person leaning against the pole. There is a black-and-white object stuck on the pole. A person wearing black pants appears on the screen. What did this person do when they first appeared?",
"question_wo_referring_query": "What did the person in black pants do when they first appeared?",
"candidates": [
"Lifted a tray",
"Stepped over a barrier",
"Picked up a drink",
"Opened a door",
"Stepped on the object on the pole"
],
"correct_choice": 4,
"position": [
8
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "@jess.morg-7291800958152084782_0",
"video_path": "7291800958152084782.mp4",
"subtitle_path": "7291800958152084782_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 9.7,
"view_count": 887
},
{
"video_id": "@jonijawne-7318074908645264645",
"question": "A light-colored building is located on the left side of the screen, with exquisite carvings on its walls. At the entrance at the front of the building, a red UNIQLO sign is hanging. The entrance area is surrounded by a crowd of pedestrians. After the camera finishes filming the entrance of UNIQLO, what does it film next?",
"question_wo_referring_query": "What does the camera film next?",
"candidates": [
"It films the scene inside the UNIQLO store.",
"It films the art museum next door.",
"It films the library next door.",
"It films the hotel next door.",
"It films the restaurant next door."
],
"correct_choice": 0,
"position": [
13,
80
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "@jonijawne-7318074908645264645_0",
"video_path": "7318074908645264645.mp4",
"subtitle_path": "7318074908645264645_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 8.37,
"view_count": 2788
},
{
"video_id": "@jess.morg-7275475379375164715",
"question": "There is a shelf on the white wall, and on the shelf, there are green plants and various items such as photos. To the right of the shelf is a window covered by a green curtain. After the subtitle 'Halloween is cool' appears, what happens in the room?",
"question_wo_referring_query": "What happens in the room?",
"candidates": [
"An old man lies on the sofa resting",
"A cat lies on the cushion resting",
"A little girl lies on the sofa resting",
"A dog lies on the cushion resting",
"A little boy lies on the sofa resting"
],
"correct_choice": 3,
"position": [
98,
113
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "@jess.morg-7275475379375164715_0",
"video_path": "7275475379375164715.mp4",
"subtitle_path": "7275475379375164715_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 11.6,
"view_count": 671
},
{
"video_id": "@jonijawne-7184469162956246277",
"question": "The man wearing a black shirt appears in the center of the screen, he's wearing sunglasses, his hair is parted in the middle, and he has a white accessory at the collar. Behind him is a green field, trees, and a white sky, with people resting on the grass. In which other locations has this man wearing sunglasses appeared?",
"question_wo_referring_query": "In which other locations has this man wearing sunglasses appeared?",
"candidates": [
"White sofa",
"Path by the water",
"Spacious theater",
"Hill in the grass field",
"Bench in the grass field"
],
"correct_choice": 1,
"position": [
50,
174
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SOS",
"level": "L2-Relation",
"id": "@jonijawne-7184469162956246277_0",
"video_path": "7184469162956246277.mp4",
"subtitle_path": "7184469162956246277_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 10.87,
"view_count": 282
},
{
"video_id": "@kerstinong-6900436106790194434",
"question": "A woman is working out on a piece of fitness equipment. She is holding the left and right parts of the equipment above with both hands, and her legs are spread apart. The woman is wearing a sports bra and black shorts. There are white and blue text boxes respectively on the upper and right parts of the woman. When the subtitle 'Story' appears, what change occurs to the woman who is working out?",
"question_wo_referring_query": "What change occurs to the woman who is working out?",
"candidates": [
"The woman's hands holding the equipment change to being crossed in front of her chest",
"The woman's hands holding the equipment change to being on her waist",
"The woman's hands holding the equipment change to being raised",
"The woman's spread-apart legs spread even wider",
"The woman's spread-apart legs change to being crossed together"
],
"correct_choice": 4,
"position": [
5,
81
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TAA",
"level": "L2-Relation",
"id": "@kerstinong-6900436106790194434_0",
"video_path": "6900436106790194434.mp4",
"subtitle_path": "6900436106790194434_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 13.73,
"view_count": 32017
},
{
"video_id": "@jonijawne-7312063059160206597",
"question": "Inside the subway station, on the left is the train and on the right are the waiting passengers. Some passengers are sitting with their phones, while others are standing. Screens and signs are suspended in the upper space of the subway station. After filming the subway station, what else did the camera capture?",
"question_wo_referring_query": ", after filming the subway station, what else did the camera capture?",
"candidates": [
"a man wearing a hat",
"exit",
"a restaurant",
"subway car",
"a convenience store"
],
"correct_choice": 3,
"position": [
11,
62
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "@jonijawne-7312063059160206597_0",
"video_path": "7312063059160206597.mp4",
"subtitle_path": "7312063059160206597_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 9.1,
"view_count": 12504
},
{
"video_id": "@jess.morg-7255843192677911851",
"question": "A pile of large rocks is stacked together, with the sea right next to the rocks. As the waves crash against the rocks, white splashes are created. The colors of these rocks vary in shades. After the caption 'I don't count all the pickles on your face' appears, what happens on the screen?",
"question_wo_referring_query": "What happens on the screen?",
"candidates": [
"A seagull appears in the sky",
"A crab appears on the rocks",
"A woman is taking a photo with a man",
"A sailboat passes by on the sea",
"A cruise ship appears on the sea"
],
"correct_choice": 2,
"position": [
100,
130
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "@jess.morg-7255843192677911851_0",
"video_path": "7255843192677911851.mp4",
"subtitle_path": "7255843192677911851_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 9.69,
"view_count": 661
},
{
"video_id": "@jess.morg-7284004157311421742",
"question": "A notebook computer is placed on the table, the notebook is silver-grey, and the screen frame is black. A webpage is open on the laptop screen, and a lady appears on the webpage, surrounded by text and small icons. After the subtitle 'and everyone knows' appears, what object appears on the laptop screen?",
"question_wo_referring_query": "What object appears on the laptop screen?",
"candidates": [
"a pair of sunglasses",
"a black cat",
"an umbrella",
"a black dog",
"a desk lamp"
],
"correct_choice": 0,
"position": [
160,
180
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "@jess.morg-7284004157311421742_0",
"video_path": "7284004157311421742.mp4",
"subtitle_path": "7284004157311421742_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 10.87,
"view_count": 2266
},
{
"video_id": "@jess.morg-7272881154838023467",
"question": "Under the blue sky, a flock of seagulls is soaring. There are white characters in the center of the screen. In what other scenes have this flock of seagulls appeared?",
"question_wo_referring_query": "In what other scenes have this flock of seagulls appeared?",
"candidates": [
"Giant rocks by the sea",
"Cliff by the sea",
"Volleyball net on the beach",
"Branches on the beach",
"The sea where the tourists sail a red boat"
],
"correct_choice": 4,
"position": [
153,
192
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SOS",
"level": "L2-Relation",
"id": "@jess.morg-7272881154838023467_0",
"video_path": "7272881154838023467.mp4",
"subtitle_path": "7272881154838023467_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 9.83,
"view_count": 670
},
{
"video_id": "@jess.morg-7248745529272700202",
"question": "White clouds float in the azure sky, stones are piled up on the beach, white waves rise from the sea, there is a mountain connected to the sea in the distance, and as the subtitle 'shouldn't be comparing ourselves to other people because we don't know the' appears, what changes occur on the sea?",
"question_wo_referring_query": "What changes occur on the sea?",
"candidates": [
"The sea becomes calm.",
"Massive waves rise from the sea.",
"The blue sea turns dark.",
"A cruise ship appears on the sea.",
"A sailboat appears on the sea."
],
"correct_choice": 2,
"position": [
3,
40
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TAA",
"level": "L2-Relation",
"id": "@jess.morg-7248745529272700202_0",
"video_path": "7248745529272700202.mp4",
"subtitle_path": "7248745529272700202_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 11.33,
"view_count": 1088
},
{
"video_id": "@lisolna-7203500499360877829",
"question": "A cat is putting its head on a white cushion. On the wall above the cat, there's a white shelf with some colorful vases on it. When the subtitle 'Thinking of running to get by' appears, what color is the cat's face?",
"question_wo_referring_query": "What color is the cat's face?",
"candidates": [
"Blue and white face",
"All black face",
"All white face",
"Black and white face",
"Blue face"
],
"correct_choice": 3,
"position": [
35
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2A",
"level": "L1-Perception",
"id": "@lisolna-7203500499360877829_0",
"video_path": "7203500499360877829.mp4",
"subtitle_path": "7203500499360877829_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 11.53,
"view_count": 3075
},
{
"video_id": "@kerstinong-6976239624578419969",
"question": "A hand with manicured nails is placed near the shoulder of a reclining woman. The woman is wearing a black short sleeve shirt and is smiling. There is a toy figure behind her. What did the hand with manicured nails do after being placed on the reclining woman's shoulder?",
"question_wo_referring_query": "What did the hand with manicured nails do after being placed on the reclining woman's shoulder?",
"candidates": [
"Tapped the reclining woman's forehead",
"Tapped the reclining woman's face",
"Tapped the reclining woman's shoulder",
"Tapped the reclining woman's stomach",
"Tapped the reclining woman's buttocks"
],
"correct_choice": 2,
"position": [
235,
253
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "@kerstinong-6976239624578419969_0",
"video_path": "6976239624578419969.mp4",
"subtitle_path": "6976239624578419969_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 15.08,
"view_count": 54166
},
{
"video_id": "@jonijawne-7194500194648476933",
"question": "A man appears on the screen wearing a grey shirt and black pants, holding a black long strip in his hand. After the subtitle 'I got the chronic by the tree' appears, what does the man in the grey shirt do?",
"question_wo_referring_query": "What does the man in the grey shirt do?",
"candidates": [
"Holds both hands in front of his chest",
"Presses the black long strip against his chest with both hands",
"Inserts both hands into his pockets",
"Inserts both hands into his coat pockets",
"Crosses his hands at his waist"
],
"correct_choice": 1,
"position": [
105,
121
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "@jonijawne-7194500194648476933_0",
"video_path": "7194500194648476933.mp4",
"subtitle_path": "7194500194648476933_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 10.77,
"view_count": 15124
},
{
"video_id": "@jess.morg-7079970495499717934",
"question": "What is the correct order of the following scenes?",
"question_wo_referring_query": "What is the correct order of the following scenes?",
"candidates": [
"First, a red heart-shaped screen, then a dimly lit street scene, followed by a gas station screen.",
"First, a gas station screen, then a dimly lit street scene, followed by a red heart-shaped screen.",
"First, a dimly lit street scene, then a gas station screen, followed by a red heart-shaped screen.",
"First, a gas station screen, then a red heart-shaped screen, followed by a dimly lit street scene.",
"First, a dimly lit street scene, then a red heart-shaped screen, followed by a gas station screen."
],
"correct_choice": 0,
"position": [
2,
48,
106
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "@jess.morg-7079970495499717934_0",
"video_path": "7079970495499717934.mp4",
"subtitle_path": "7079970495499717934_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 9.17,
"view_count": 115
},
{
"video_id": "Jfp1Ks7Hh1E",
"question": "Who is the first person to appear in the video?",
"question_wo_referring_query": "Who is the first person to appear in the video?",
"candidates": [
"The man sitting in front of a desk with a projector and computer in a room with hanging lights, wearing a black hoodie and jeans",
"The man sitting on the bed in a room with hanging lights, wearing a black short-sleeve shirt and black pants",
"The woman with black hair, wearing a black and white coat with a white top, a white headset around her neck, opening a door",
"The woman walking outdoors at night in front of a wall with English writing, wearing a black short-sleeve shirt and shorts, and carrying a paper bag while wearing headphones",
"The man in a dimly lit room, sitting in front of a computer, wearing a white hoodie and facing a mirror"
],
"correct_choice": 1,
"position": [
159,
723,
6506,
13179,
14020
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "Jfp1Ks7Hh1E_0",
"video_path": "Jfp1Ks7Hh1E.mp4",
"subtitle_path": "Jfp1Ks7Hh1E_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 909.33,
"view_count": 569236
},
{
"video_id": "Jfp1Ks7Hh1E",
"question": "In the corridor with light yellow walls and a glass door behind him showing the green trees outside, a man with a middle part hairstyle, dressed in a white hoodie and dark shorts, is holding a camera. Who is the first person to appear behind him?",
"question_wo_referring_query": "In the corridor with light yellow walls and a glass door showing the green trees outside, a man with a middle part hairstyle, dressed in a white hoodie and dark shorts, is holding a camera. Who is the first person to appear behind him?",
"candidates": [
"A woman with black hair wearing a black and white outerwear with a white inner layer, with white earphones around her neck, opening a door.",
"A woman walking outside at night in front of a wall with English writing, wearing a black short-sleeve shirt, shorts, earphones, and holding a paper bag.",
"A man sitting at a desk with a projector and computer in a room with hanging lanterns, wearing a black hoodie and jeans.",
"A man sitting on a bed in a room with hanging lanterns, wearing a black short-sleeve shirt and black long pants.",
"A man leaning against the light yellow wall in the corridor, wearing a gray long-sleeve shirt, black long pants, and glasses, with his arms crossed in front of his chest."
],
"correct_choice": 4,
"position": [
14174,
15742
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "Jfp1Ks7Hh1E_1",
"video_path": "Jfp1Ks7Hh1E.mp4",
"subtitle_path": "Jfp1Ks7Hh1E_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 909.33,
"view_count": 569236
},
{
"video_id": "Jfp1Ks7Hh1E",
"question": "Who is the person that appears first after the man on the right side of the screen, wearing a black jacket, a gold necklace, and having black middle-parted hair?",
"question_wo_referring_query": "Who is the person that appears first afterward?",
"candidates": [
"The man sitting at a desk with a projector and computer in a bedroom with colorful lights, wearing a black hoodie and jeans.",
"The man with middle-parted hair holding a camera, wearing a white hoodie and dark shorts.",
"The woman in the room behind the door, wearing a gray hoodie and glasses, with her hair tied up.",
"The man sitting on a bed with colorful lights in the bedroom, wearing a black short-sleeve shirt and black pants.",
"The man leaning against a yellow wall in the corridor, wearing a gray long-sleeve shirt, black pants, and glasses, with his arms crossed."
],
"correct_choice": 2,
"position": [
16129,
16367
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "Jfp1Ks7Hh1E_2",
"video_path": "Jfp1Ks7Hh1E.mp4",
"subtitle_path": "Jfp1Ks7Hh1E_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 909.33,
"view_count": 569236
},
{
"video_id": "gURB1JwPfJw",
"question": "On a PPT with a white background, there is a blue header at the top that says 'Molar Mass'. Below it, there are two light green rectangles with English text inside. In the circular frame at the bottom right corner, there is a video screen showing a long-haired woman wearing a white coat over a black outfit. When the subtitle 'let's say if you know what one we use' appears, what is this woman doing?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"Combing her hair",
"Tying her hair",
"Raising one hand to explain",
"Wearing glasses",
"Waving at the mirror"
],
"correct_choice": 2,
"position": [
3667
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2E",
"level": "L1-Perception",
"id": "gURB1JwPfJw_0",
"video_path": "gURB1JwPfJw.mp4",
"subtitle_path": "gURB1JwPfJw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 903.14,
"view_count": 199542
},
{
"video_id": "gURB1JwPfJw",
"question": "On a white background PPT, at the top, there is a blue title 'Mole to Mole Ratios', at the bottom, there is a black chemistry formula with an arrow and an English question in small black font. In the bottom right corner, within a circular frame, there is a video screen showing a long-haired woman wearing a white coat over a black outfit. When the subtitle 'compound yeah so that's what you need' appears, what is this woman doing?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"Touching her head with both hands",
"Both hands resting flat on the table, talking towards the webcam",
"Waving towards the webcam",
"Propping up her chin with both hands",
"Elbows resting on the table with both hands raised, giving an explanation"
],
"correct_choice": 4,
"position": [
5952
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2E",
"level": "L1-Perception",
"id": "gURB1JwPfJw_1",
"video_path": "gURB1JwPfJw.mp4",
"subtitle_path": "gURB1JwPfJw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 903.14,
"view_count": 199542
},
{
"video_id": "gURB1JwPfJw",
"question": "On a white background PPT, there's a black English question at the top. In the middle of the screen, there's one black arrow and three blue arrows. At the bottom, there are chemical element symbols written inside a black lined border. In the lower right corner, within a circular frame, there's a video screen of a long-haired woman wearing a white coat with black inner clothing. What is this woman doing when the subtitle 'that's a good thing you get it now yeah' appears?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"Yawning",
"Raising both hands and smiling with thumbs up",
"Resting her face on one hand",
"Making a peace sign at the camera",
"Tucking her hair behind her ear"
],
"correct_choice": 1,
"position": [
18322
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "T2E",
"level": "L1-Perception",
"id": "gURB1JwPfJw_2",
"video_path": "gURB1JwPfJw.mp4",
"subtitle_path": "gURB1JwPfJw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 903.14,
"view_count": 199542
},
{
"video_id": "9S9i12n0TIw",
"question": "In front of a wall displaying a blue and white map, what is the man with white hair wearing a black short-sleeved shirt doing when he first appears?",
"question_wo_referring_query": "What is he doing?",
"candidates": [
"Smiling at the camera",
"Making a mark on the map",
"Making a 'Yay' sign towards the camera",
"Raising his hand to greet",
"Talking to the camera"
],
"correct_choice": 4,
"position": [
485
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "O2E",
"level": "L1-Perception",
"id": "9S9i12n0TIw_0",
"video_path": "9S9i12n0TIw.mp4",
"subtitle_path": "9S9i12n0TIw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1168.54,
"view_count": 26499
},
{
"video_id": "9S9i12n0TIw",
"question": "When a blue and orange map appears on the green wall for the first time, what is the person with sweat-covered arm hair doing on the screen?",
"question_wo_referring_query": "What is this person doing?",
"candidates": [
"Spreading fingers and placing hand on the map",
"Making notes on the map",
"Drawing on the map",
"Blackening the map",
"Pointing to the map with an outstretched finger"
],
"correct_choice": 4,
"position": [
22326
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "O2E",
"level": "L1-Perception",
"id": "9S9i12n0TIw_1",
"video_path": "9S9i12n0TIw.mp4",
"subtitle_path": "9S9i12n0TIw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1168.54,
"view_count": 26499
},
{
"video_id": "9S9i12n0TIw",
"question": "When the map with black lines on white paper first appears, what is the sweaty arm doing in the scene?",
"question_wo_referring_query": "What is happening?",
"candidates": [
"Spreading the map with open hands",
"Drawing on the map with a pen",
"Pointing at the map with a finger",
"Placing objects on the map",
"Painting the map black"
],
"correct_choice": 2,
"position": [
24413
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "O2E",
"level": "L1-Perception",
"id": "9S9i12n0TIw_2",
"video_path": "9S9i12n0TIw.mp4",
"subtitle_path": "9S9i12n0TIw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1168.54,
"view_count": 26499
},
{
"video_id": "gJijNOktmoI",
"question": "On a stretch of yellow sandbrisk\u0435d by jade-green seawater, a man wearing a grey short-sleeve shirt, glasses, and sporting a goatee is taking a selfie. In which of the following scenes does this goateed man appear?",
"question_wo_referring_query": "In which of the following scenes has the goateed man appeared?",
"candidates": [
"Inside a large turtle tank",
"On a beach during the rain",
"In a dense primeval forest",
"In a park during the rain",
"In a park with a slide"
],
"correct_choice": 0,
"position": [
152,
15667
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "gJijNOktmoI_0",
"video_path": "gJijNOktmoI.mp4",
"subtitle_path": "gJijNOktmoI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 967.48,
"view_count": 145552
},
{
"video_id": "gJijNOktmoI",
"question": "There is a sea with jade-green water, a sky that is azure blue, small boats floating on the jade-green water, and a man with curly hair standing on the yellow sandy beach, wearing a grey short-sleeve shirt. In which of the following scenes has the man with curly hair, who is wearing a grey short-sleeve shirt, appeared before?",
"question_wo_referring_query": "In which of the following scenes has the man with curly hair, who is wearing a grey short-sleeve shirt, appeared before?",
"candidates": [
"On the beach during a rainstorm",
"In front of an off-white wall with a hanging painting",
"In a peaceful park",
"In a pool with large turtles",
"In a restaurant with yellow curry"
],
"correct_choice": 1,
"position": [
167,
9307
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "gJijNOktmoI_1",
"video_path": "gJijNOktmoI.mp4",
"subtitle_path": "gJijNOktmoI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 967.48,
"view_count": 145552
},
{
"video_id": "gJijNOktmoI",
"question": "Under a cloudless blue sky, there is a turquoise sea. A dark-skinned woman wearing glasses and a white bikini is in the turquoise sea. In which of the following scenes has this dark-skinned woman appeared?",
"question_wo_referring_query": "In which of the following scenes has this dark-skinned woman appeared?",
"candidates": [
"On the beach during rain",
"In a room with a red sofa",
"In a park with slides",
"In a dense primal forest",
"By the sea with a yellow boat"
],
"correct_choice": 4,
"position": [
6043,
10837
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SOS",
"level": "L2-Relation",
"id": "gJijNOktmoI_2",
"video_path": "gJijNOktmoI.mp4",
"subtitle_path": "gJijNOktmoI_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 967.48,
"view_count": 145552
},
{
"video_id": "d7IqrLV6Tlg",
"question": "Which sequence of scenes is correct from the options below?",
"question_wo_referring_query": "Which sequence of scenes is correct from the options below?",
"candidates": [
"First, a man wearing a black suit is writing on a blackboard; then, a man wearing a baseball cap and glasses is standing in front of the interstellar gate; finally, a man wearing a beret is standing in front of a gate that looks like water waves.",
"First, a man wearing a beret is standing in front of a gate that looks like water waves; then, a man wearing a black suit is writing on a blackboard; finally, a man wearing a baseball cap and glasses is standing in front of the interstellar gate.",
"First, a man wearing a black suit is writing on a blackboard; then, a man wearing a beret is standing in front of a gate that looks like water waves; finally, a man wearing a baseball cap and glasses is standing in front of the interstellar gate.",
"First, a man wearing a beret is standing in front of a gate that looks like water waves; then, a man wearing a baseball cap and glasses is standing in front of the interstellar gate; finally, a man wearing a black suit is writing on a blackboard.",
"First, a man wearing a baseball cap and glasses is standing in front of the interstellar gate; then, a man wearing a beret is standing in front of a gate that looks like water waves; finally, a man wearing a black suit is writing on a blackboard."
],
"correct_choice": 2,
"position": [
1450,
6316,
6512
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "d7IqrLV6Tlg_0",
"video_path": "d7IqrLV6Tlg.mp4",
"subtitle_path": "d7IqrLV6Tlg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1008.68,
"view_count": 352153
},
{
"video_id": "d7IqrLV6Tlg",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, many people are seen running across an endless desert; then, a man in a black short-sleeved shirt punches a man wearing a helmet; finally, a man wearing glasses lifts a curtain.",
"First, a man wearing glasses lifts a curtain; then, a man in a black short-sleeved shirt punches a man wearing a helmet; finally, many people are seen running across an endless desert.",
"First, a man in a black short-sleeved shirt punches a man wearing a helmet; then, a man wearing glasses lifts a curtain; finally, many people are seen running across an endless desert.",
"First, a man in a black short-sleeved shirt punches a man wearing a helmet; then, many people are seen running across an endless desert; finally, a man wearing glasses lifts a curtain.",
"First, a man wearing glasses lifts a curtain; then, many people are seen running across an endless desert; finally, a man in a black short-sleeved shirt punches a man wearing a helmet."
],
"correct_choice": 1,
"position": [
12120,
17355,
23044
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "d7IqrLV6Tlg_1",
"video_path": "d7IqrLV6Tlg.mp4",
"subtitle_path": "d7IqrLV6Tlg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1008.68,
"view_count": 352153
},
{
"video_id": "d7IqrLV6Tlg",
"question": "Which of the following sequence of scenes is correct?",
"question_wo_referring_query": "Which of the following sequence of scenes is correct?",
"candidates": [
"First, a flying ship resembling a pyramid appears in the desert; then a flying ship resembling a pyramid appears in outer space; finally, a man wearing a black short-sleeve shirt and black sunglasses holding a binocular appears.",
"First, a man wearing a black short-sleeve shirt and black sunglasses holding a binocular appears; then a flying ship resembling a pyramid appears in the desert; finally, another flying ship resembling a pyramid appears in outer space.",
"First, a man wearing a black short-sleeve shirt and black sunglasses holding a binocular appears; then a flying ship resembling a pyramid appears in outer space; finally, a flying ship resembling a pyramid appears in the desert.",
"First, a flying ship resembling a pyramid appears in the desert; then a man wearing a black short-sleeve shirt and black sunglasses holding a binocular appears; finally, a flying ship resembling a pyramid appears in outer space.",
"First, a flying ship resembling a pyramid appears in outer space; then a flying ship resembling a pyramid appears in the desert; finally, a man wearing a black short-sleeve shirt and black sunglasses holding a binocular appears."
],
"correct_choice": 1,
"position": [
8763,
15907,
23393
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "d7IqrLV6Tlg_2",
"video_path": "d7IqrLV6Tlg.mp4",
"subtitle_path": "d7IqrLV6Tlg_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1008.68,
"view_count": 352153
},
{
"video_id": "PCPQToF10IM",
"question": "In the desert, there are three mannequins. In front of the mannequins, there is a man wearing an olive-colored coat. Next to him, a person wearing a black hat and black clothes is holding a handgun. After the subtitle mentions 'off target by 0.5 mm sensing', what item appears in the video?",
"question_wo_referring_query": ", what item appears in the video?",
"candidates": [
"\u624b\u8868",
"\u9999\u70df",
"\u68cb\u76d8",
"\u6728\u684c",
"\u624b\u67aa"
],
"correct_choice": 2,
"position": [
829,
3712
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3O",
"level": "L2-Relation",
"id": "PCPQToF10IM_0",
"video_path": "PCPQToF10IM.mp4",
"subtitle_path": "PCPQToF10IM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 904.67,
"view_count": 85677
},
{
"video_id": "PCPQToF10IM",
"question": "A man outdoors wearing a black hat and dressed in black clothes has his hands in his pockets, looking at a child in front of him wearing a jacket. After the subtitle mentions 'fixed during their Journey with Odo', who is the character that appears in the video?",
"question_wo_referring_query": "Who is the character that appears in the video?",
"candidates": [
"A child with black hair wearing a jacket",
"A bald man wearing a green coat",
"A woman wearing a yellow headscarf and green coat",
"A man with an afro wearing black clothes",
"A woman with blonde hair sitting in a wheelchair"
],
"correct_choice": 3,
"position": [
6972,
10142
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3O",
"level": "L2-Relation",
"id": "PCPQToF10IM_1",
"video_path": "PCPQToF10IM.mp4",
"subtitle_path": "PCPQToF10IM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 904.67,
"view_count": 85677
},
{
"video_id": "PCPQToF10IM",
"question": "In the scene, there is only one woman with golden hair wearing a white dress, she is intently looking at her raised hand. After the subtitle mentions 'an immediate abduction impossible', who is the character that appears in the video?",
"question_wo_referring_query": ", who is the character that appears in the video?",
"candidates": [
"A man with white hair wearing a white dress",
"A man wearing a white hat and a floral-patterned shirt",
"A man wearing an olive-colored jacket and a white shirt",
"A man with blonde hair and a bare upper body",
"A man wearing a black hat and an all-black suit"
],
"correct_choice": 1,
"position": [
9059,
11367
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3O",
"level": "L2-Relation",
"id": "PCPQToF10IM_2",
"video_path": "PCPQToF10IM.mp4",
"subtitle_path": "PCPQToF10IM_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 904.67,
"view_count": 85677
},
{
"video_id": "IN0osLg-Mn8",
"question": "A man wearing a black coat is kissing a woman with a ponytail. When the subtitle 'the conspiracy Frank sends his wife, Jordan to safety in Venezuela, then burns down his casino' appears, which of the following items is present?",
"question_wo_referring_query": "Which of the following items is present?",
"candidates": [
"hat",
"watermelon",
"earring",
"ring",
"earphone"
],
"correct_choice": 2,
"position": [
7049
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2O",
"level": "L1-Perception",
"id": "IN0osLg-Mn8_0",
"video_path": "IN0osLg-Mn8.mp4",
"subtitle_path": "IN0osLg-Mn8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 468.57,
"view_count": 79320
},
{
"video_id": "IN0osLg-Mn8",
"question": "A person wearing a black jacket is pressing down another person wearing a checkered jacket on the ground. When the subtitle 'The prideful Frank refuses to cooperate, and is fatally stabbed and left for dead in the desert' appears, which of the following items is present?",
"question_wo_referring_query": "Which of the following items is present?",
"candidates": [
"hat",
"earring",
"computer",
"phone",
"watch"
],
"correct_choice": 4,
"position": [
9567
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2O",
"level": "L1-Perception",
"id": "IN0osLg-Mn8_1",
"video_path": "IN0osLg-Mn8.mp4",
"subtitle_path": "IN0osLg-Mn8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 468.57,
"view_count": 79320
},
{
"video_id": "pFpZvRsEGZs",
"question": "A bald person wearing a green coat is holding a phone to his ear. Behind him is a wall made of wooden blocks, and to his left-rear side, there is a white light bulb. What color is the phone?",
"question_wo_referring_query": "What color is the phone?",
"candidates": [
"black",
"blue",
"red",
"yellow",
"white"
],
"correct_choice": 2,
"position": [
3591
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2A",
"level": "L1-Perception",
"id": "pFpZvRsEGZs_0",
"video_path": "pFpZvRsEGZs.mp4",
"subtitle_path": "pFpZvRsEGZs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 559.56,
"view_count": 53769
},
{
"video_id": "pFpZvRsEGZs",
"question": "A man in a white coat and a long-haired woman are standing face to face next to a shelf. The man is holding a bottle with liquid in one hand and resting the other hand on the shelf. What is the color of the liquid in the bottle?",
"question_wo_referring_query": "What is the color of the liquid in the bottle?",
"candidates": [
"Black",
"Yellow",
"White",
"Red",
"Blue"
],
"correct_choice": 3,
"position": [
8630
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2A",
"level": "L1-Perception",
"id": "pFpZvRsEGZs_1",
"video_path": "pFpZvRsEGZs.mp4",
"subtitle_path": "pFpZvRsEGZs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 559.56,
"view_count": 53769
},
{
"video_id": "6hBbXVkgxGE",
"question": "A person dressed in black clothes is making a phone call. One of his hands is placed on the table. On the table, there is a desk lamp. Next to the desk lamp is an open window. To the left of the window is a cabinet, on which there are many bottles of alcohol. Who is making the phone call?",
"question_wo_referring_query": "Who is making the phone call?",
"candidates": [
"Jan",
"Dougie",
"Bullet",
"Mike",
"Statham"
],
"correct_choice": 3,
"position": [
4618
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E2O",
"level": "L1-Perception",
"id": "6hBbXVkgxGE_0",
"video_path": "6hBbXVkgxGE.mp4",
"subtitle_path": "6hBbXVkgxGE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 539.58,
"view_count": 2127031
},
{
"video_id": "6hBbXVkgxGE",
"question": "A person wearing a black coat is sitting, holding a document. Next to him is a table with a gun and a cell phone on it. There's a black lamp on the table. Who is holding the document?",
"question_wo_referring_query": "Who is holding the document?",
"candidates": [
"Statham",
"Dougie",
"Jan",
"Mike",
"Bullet"
],
"correct_choice": 0,
"position": [
11637
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E2O",
"level": "L1-Perception",
"id": "6hBbXVkgxGE_1",
"video_path": "6hBbXVkgxGE.mp4",
"subtitle_path": "6hBbXVkgxGE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 539.58,
"view_count": 2127031
},
{
"video_id": "kj3Po7zUeyw",
"question": "What is the ethnicity of the first person to appear in the video?",
"question_wo_referring_query": "What is the ethnicity of the first person to appear in the video?",
"candidates": [
"White",
"Black",
"Asian"
],
"correct_choice": 1,
"position": [
190,
2183
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "O3O",
"level": "L2-Relation",
"id": "kj3Po7zUeyw_0",
"video_path": "kj3Po7zUeyw.mp4",
"subtitle_path": "kj3Po7zUeyw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 504.37,
"view_count": 7665
},
{
"video_id": "kj3Po7zUeyw",
"question": "Who is the first person to die in the video?",
"question_wo_referring_query": "Who is the first person to die in the video?",
"candidates": [
"Grace",
"Philip",
"Sarah Douglas",
"Kate",
"John"
],
"correct_choice": 1,
"position": [
3471,
9641
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "O3O",
"level": "L2-Relation",
"id": "kj3Po7zUeyw_1",
"video_path": "kj3Po7zUeyw.mp4",
"subtitle_path": "kj3Po7zUeyw_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 504.37,
"view_count": 7665
},
{
"video_id": "w2HKqOhf7KQ",
"question": "A man wearing a checkered shirt stands opposite a woman. One of the man's eyes is covered with a white bandage. Behind him, there is a tree, and behind the tree, there is a house. Before the subtitle 'fleeing after the attack Miguel finally' appears, what is the first mode of transportation that appears?",
"question_wo_referring_query": "What is the first mode of transportation that appears?",
"candidates": [
"Train",
"Car",
"Bicycle",
"Horse-drawn carriage",
"Motorcycle"
],
"correct_choice": 1,
"position": [
8550,
90
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3O",
"level": "L2-Relation",
"id": "w2HKqOhf7KQ_0",
"video_path": "w2HKqOhf7KQ.mp4",
"subtitle_path": "w2HKqOhf7KQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 598.27,
"view_count": 18855
},
{
"video_id": "w2HKqOhf7KQ",
"question": "A man and a woman dressed in white clothes are standing by a car. To their left is a person wearing a hat and striped clothes. After the subtitle 'other people seeking refuge the next day' appears, what is the first mode of transportation that appears?",
"question_wo_referring_query": "What is the first mode of transportation to appear?",
"candidates": [
"bicycle",
"horse carriage",
"car",
"train",
"motorcycle"
],
"correct_choice": 2,
"position": [
12482,
12934
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3O",
"level": "L2-Relation",
"id": "w2HKqOhf7KQ_1",
"video_path": "w2HKqOhf7KQ.mp4",
"subtitle_path": "w2HKqOhf7KQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 598.27,
"view_count": 18855
},
{
"video_id": "k4jiEuZbN-4",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, in a room, the person in the black and white outfit is talking to a man wearing a hat. Then, the person in the black and white outfit is talking with a lady in black clothes, both having their hands on the table. Finally, a person wearing a black and white outfit with glasses is making a phone call.",
"First, the person in the black and white outfit is talking with a lady in black clothes, both having their hands on the table. Then, a person wearing a black and white outfit with glasses is making a phone call. Finally, in a room, the person in the black and white outfit is talking to a man wearing a hat.",
"First, in a room, the person in the black and white outfit is talking to a man wearing a hat. Then, a person wearing a black and white outfit with glasses is making a phone call. Finally, the person in the black and white outfit is talking with a lady in black clothes, both having their hands on the table.",
"First, a person wearing a black and white outfit with glasses is making a phone call. Then, the person in the black and white outfit is talking with a lady in black clothes, both having their hands on the table. Finally, in a room, the person in the black and white outfit is talking to a man wearing a hat.",
"First, a person wearing a black and white outfit with glasses is making a phone call. Then, in a room, the person in the black and white outfit is talking to a man wearing a hat. Finally, the person in the black and white outfit is talking with a lady in black clothes, both having their hands on the table."
],
"correct_choice": 4,
"position": [
3940,
7518,
8342
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "k4jiEuZbN-4_0",
"video_path": "k4jiEuZbN-4.mp4",
"subtitle_path": "k4jiEuZbN-4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 600.0,
"view_count": 824616
},
{
"video_id": "k4jiEuZbN-4",
"question": "Which of the following sequence of scenes is correct?",
"question_wo_referring_query": "Which of the following sequence of scenes is correct?",
"candidates": [
"First, two people are hugging in a red room with a window behind them. The window has blue curtains on both sides. Next, a police officer is standing behind a person wearing a prison uniform, facing a door with a transparent glass panel. Lastly, a person is washing a small knife covered in fresh blood at a sink.",
"First, a police officer is standing behind a person wearing a prison uniform, facing a door with a transparent glass panel. Next, a person is washing a small knife covered in fresh blood at a sink. Finally, two people are hugging in a red room with a window behind them. The window has blue curtains on both sides.",
"First, a person is washing a small knife covered in fresh blood at a sink. Next, a police officer is standing behind a person wearing a prison uniform, facing a door with a transparent glass panel. Finally, two people are hugging in a red room with a window behind them. The window has blue curtains on both sides.",
"First, a person is washing a small knife covered in fresh blood at a sink. Next, two people are hugging in a red room with a window behind them. The window has blue curtains on both sides. Lastly, a police officer is standing behind a person wearing a prison uniform, facing a door with a transparent glass panel.",
"First, two people are hugging in a red room with a window behind them. The window has blue curtains on both sides. Next, a person is washing a small knife covered in fresh blood at a sink. Finally, a police officer is standing behind a person wearing a prison uniform, facing a door with a transparent glass panel."
],
"correct_choice": 3,
"position": [
12219,
12746,
13655
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "k4jiEuZbN-4_1",
"video_path": "k4jiEuZbN-4.mp4",
"subtitle_path": "k4jiEuZbN-4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 600.0,
"view_count": 824616
},
{
"video_id": "OUeE8nCKWGA",
"question": "In a store, there are shelves with bottles and other items at the back, a bookshelf full of books in the distance, a stack of books in the bottom right corner, books on a glass counter in the middle, a white Christmas tree, glass bottles containing things, a person wearing a black coat on the left, and a man wearing glasses in the middle. What is the man wearing glasses in the middle doing?",
"question_wo_referring_query": "What is the man wearing glasses in the middle doing?",
"candidates": [
"Writing",
"Tearing a book",
"Holding a glass cup",
"Holding a book",
"Picking up the Christmas tree"
],
"correct_choice": 2,
"position": [
3149
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2E",
"level": "L1-Perception",
"id": "OUeE8nCKWGA_0",
"video_path": "OUeE8nCKWGA.mp4",
"subtitle_path": "OUeE8nCKWGA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 275.44,
"view_count": 24304
},
{
"video_id": "OUeE8nCKWGA",
"question": "In a room with a white background, there is a man with short hair wearing a short-sleeved T-shirt, sitting in front of a mirror. What is he doing at this moment?",
"question_wo_referring_query": "What is he doing at this moment?",
"candidates": [
"Raised both hands upwards",
"Shaking head",
"Stood up",
"Crying",
"Hands clasped together"
],
"correct_choice": 4,
"position": [
6282
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2E",
"level": "L1-Perception",
"id": "OUeE8nCKWGA_1",
"video_path": "OUeE8nCKWGA.mp4",
"subtitle_path": "OUeE8nCKWGA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 275.44,
"view_count": 24304
},
{
"video_id": "N7RTTiHsSjI",
"question": "On the exterior facade of a building, there are three windows in the middle, a black pipe on the left side, and a white air conditioner unit on the right side. What is the shape of the air conditioner unit?",
"question_wo_referring_query": "What is the shape of the air conditioner unit?",
"candidates": [
"Cuboid",
"Spherical",
"Cube",
"Cylindrical",
"Conical"
],
"correct_choice": 0,
"position": [
5830
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2A",
"level": "L1-Perception",
"id": "N7RTTiHsSjI_0",
"video_path": "N7RTTiHsSjI.mp4",
"subtitle_path": "N7RTTiHsSjI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 481.24,
"view_count": 3646388
},
{
"video_id": "N7RTTiHsSjI",
"question": "On a flat ground, the left side is the exterior wall of a building. There are two windows on the exterior wall, two outdoor units of air conditioners, and several plants at the bottom. In the middle, there are 7 people, 5 of whom are looking upward. There is a car in front of them, and there is a liquid on the car. Could you please tell me the color of the liquid on the car at this time?",
"question_wo_referring_query": "Could you please tell me the color of the liquid on the car at this time?",
"candidates": [
"Purple",
"Black",
"Blue",
"White",
"Blood Red"
],
"correct_choice": 4,
"position": [
8859
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2A",
"level": "L1-Perception",
"id": "N7RTTiHsSjI_1",
"video_path": "N7RTTiHsSjI.mp4",
"subtitle_path": "N7RTTiHsSjI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 481.24,
"view_count": 3646388
},
{
"video_id": "e6HwinLBK_Y",
"question": "In the background of the scene, there's a green plant. In the middle, there is a black man who is Kamala Khan's clone. Next to him, there is a woman wearing a white hat. Who is smiling in the scene at this moment?",
"question_wo_referring_query": "Who is smiling in the scene at this moment?",
"candidates": [
"The woman wearing a black hat",
"Kamala Khan's clone",
"The woman wearing a white hat",
"The man wearing a white hat"
],
"correct_choice": 1,
"position": [
6291
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E2O",
"level": "L1-Perception",
"id": "e6HwinLBK_Y_0",
"video_path": "e6HwinLBK_Y.mp4",
"subtitle_path": "e6HwinLBK_Y_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 512.08,
"view_count": 489768
},
{
"video_id": "e6HwinLBK_Y",
"question": "In the middle of the lake, there is a person wearing green pants standing inside holding a box. There are ripples on the lake. What is the object that is falling at this moment?",
"question_wo_referring_query": "What is the object that is falling at this moment?",
"candidates": [
"bird droppings",
"red powder",
"tears",
"ashes",
"a box of ashes"
],
"correct_choice": 3,
"position": [
9639
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E2O",
"level": "L1-Perception",
"id": "e6HwinLBK_Y_1",
"video_path": "e6HwinLBK_Y.mp4",
"subtitle_path": "e6HwinLBK_Y_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 512.08,
"view_count": 489768
},
{
"video_id": "T5bTeGzgJFs",
"question": "On a grassy field with a gentle breeze and beautiful sunshine, there is a withered tree on the right. In the middle, a group of people dressed in animal skins are standing. They are holding long spears, and the leader is placing stone heads with carved patterns together. Which of the following subtitles have appeared along with the stone heads with carved patterns?",
"question_wo_referring_query": "Which of the following subtitles have appeared along with the stone heads with carved patterns?",
"candidates": [
"The trip continues with only stops to hunt and rest. One evening,when they find a large and",
"men bring the sleds to carry the bison on,Tau stays by the edge of the diff,crying in denial.",
"Now that their hunt has been divided equally,Xi's party leaves on their own to return to their home",
"to give Kappa a memorial service that symbolizes the passing of one's spirit to the afterlife.",
"on a further ledge,where he breaks a leg and falls unconscious. Desperate to save his san,"
],
"correct_choice": 3,
"position": [
2822,
2868,
3066,
3579,
3780,
3913
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "TOS",
"level": "L2-Relation",
"id": "T5bTeGzgJFs_0",
"video_path": "T5bTeGzgJFs.mp4",
"subtitle_path": "T5bTeGzgJFs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 534.34,
"view_count": 1611471
},
{
"video_id": "T5bTeGzgJFs",
"question": "In a mountain cave, there is a person on the right wearing something, and on the left, there is a wolf baring its teeth at the person wearing something. Below the wolf, there is a coil of rope. Which of the following subtitles have appeared with the wolf?",
"question_wo_referring_query": "Which of the following subtitles have appeared with the wolf?",
"candidates": [
"on a further ledge, where he breaks a leg and falls unconscious. Desperate to save his san,",
"snowstorms become more common, so Keda is having difficulty making much progress each day,",
"Keda tells Alpha to go with them, so now he is left to continue traveling alone. As time passer,",
"men bring the sleds to carry the bison on, Tau stays by the edge of the diff, crying in denial.",
"finds with the wolf as well. But worms are not enough to keep themselves fed, so Keda finally"
],
"correct_choice": 4,
"position": [
6710,
6920,
3780,
9051,
9102,
3579
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "TOS",
"level": "L2-Relation",
"id": "T5bTeGzgJFs_1",
"video_path": "T5bTeGzgJFs.mp4",
"subtitle_path": "T5bTeGzgJFs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 534.34,
"view_count": 1611471
},
{
"video_id": "n24n_20Kwe4",
"question": "In the scene, there is a half-open brown door, some flowers painted on the white wall, and a girl standing in the doorway. She has two ponytails and is wearing a red inner clothing with a purple fur coat. When the subtitle 'appeared she must have been hiding when' appears, what happens to the girl?",
"question_wo_referring_query": "what happens to the girl?",
"candidates": [
"Holds a doll and smiles at the camera",
"Holds a doll and shows a fearful expression",
"Holds a pillow and squats down",
"Holds a doll and squats down",
"Looks at a kitten nearby"
],
"correct_choice": 1,
"position": [
1022
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2E",
"level": "L1-Perception",
"id": "n24n_20Kwe4_0",
"video_path": "n24n_20Kwe4.mp4",
"subtitle_path": "n24n_20Kwe4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 593.36,
"view_count": 13151
},
{
"video_id": "n24n_20Kwe4",
"question": "In the scene, there are two large trees with thick branches and dense foliage. Below the trees is a path with wooden railings, and a car is parked on the road. When the subtitle 'from her so after all that craziness Tom' appears, what happens?",
"question_wo_referring_query": "What happens?",
"candidates": [
"A man holding a bag walks towards the camera with his head down.",
"A dog and a cat chase each other under the tree.",
"A woman holding a bag walks towards the camera with her head down.",
"A child runs with a toy in his arms.",
"A black car drives by."
],
"correct_choice": 0,
"position": [
10684
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2E",
"level": "L1-Perception",
"id": "n24n_20Kwe4_1",
"video_path": "n24n_20Kwe4.mp4",
"subtitle_path": "n24n_20Kwe4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 593.36,
"view_count": 13151
},
{
"video_id": "sHUEJw1BsGQ",
"question": "On a car, in the front left seat sits a man wearing a green hat, with a face covered, dressed in khaki clothes. Behind him is a woman with blonde hair in white clothes. In the front right seat is a man wearing sunglasses, with a face covered, dressed in gray clothes, with a black steering wheel in front of him. Behind him is a man with short hair, dressed in a white long-sleeve shirt, wearing a black bow tie. When the subtitle mentions 'The pirates intend to take them to their boss, but Darey tricks them by asking for a cigarette,' what object is present in the picture?",
"question_wo_referring_query": "What object is present in the picture?",
"candidates": [
"fan",
"cup",
"cigarette",
"computer",
"mobile phone"
],
"correct_choice": 2,
"position": [
3528
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2O",
"level": "L1-Perception",
"id": "sHUEJw1BsGQ_0",
"video_path": "sHUEJw1BsGQ.mp4",
"subtitle_path": "sHUEJw1BsGQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 470.1,
"view_count": 4207
},
{
"video_id": "sHUEJw1BsGQ",
"question": "By the side of a sea, a person with short hair, dressed in a white long-sleeve shirt with a bit of black, stands in the middle. In front of him stands a person with a covered face, dressed in olive clothing. Behind them sits a group of people wearing different styles of clothes. Trees are also planted behind the crowd. When the subtitle reaches 'When Tom is taken to the other captives, he provocatively says something in Balinese to,' what object is present in the frame?",
"question_wo_referring_query": "What object is present in the frame?",
"candidates": [
"computer",
"mobile phone",
"water cup",
"fishing rod",
"hat"
],
"correct_choice": 4,
"position": [
6630
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2O",
"level": "L1-Perception",
"id": "sHUEJw1BsGQ_1",
"video_path": "sHUEJw1BsGQ.mp4",
"subtitle_path": "sHUEJw1BsGQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 470.1,
"view_count": 4207
},
{
"video_id": "XJ6REZOXsvM",
"question": "In a room, there is a table beside a sofa. On the sofa, there is a man with short hair wearing a long-sleeved shirt. In front of him, there is a transparent window. In front of the window, there is another table with decorations on it. To the left of it, there is a glowing screen. To the left of the screen, there is a burning fireplace. Next to the man, there is also a lamp. What is the shape of the glowing screen?",
"question_wo_referring_query": "What is the shape of the glowing screen?",
"candidates": [
"Circle",
"Pentagon",
"Oval",
"Triangle",
"Square"
],
"correct_choice": 4,
"position": [
13327
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2A",
"level": "L1-Perception",
"id": "XJ6REZOXsvM_0",
"video_path": "XJ6REZOXsvM.mp4",
"subtitle_path": "XJ6REZOXsvM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 588.09,
"view_count": 116450
},
{
"video_id": "XJ6REZOXsvM",
"question": "In a room with slightly dim lighting, there stands a woman with long hair, wearing blue clothes. She is looking at a square screen in front of her and is touching the screen with her hand. On the screen, there is a man. Not far from them, there is a lit lamp. What is the color of the man's clothes on the screen?",
"question_wo_referring_query": "What is the color of the man's clothes on the screen?",
"candidates": [
"white",
"yellow",
"purple",
"olive",
"pink"
],
"correct_choice": 0,
"position": [
8738
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2A",
"level": "L1-Perception",
"id": "XJ6REZOXsvM_1",
"video_path": "XJ6REZOXsvM.mp4",
"subtitle_path": "XJ6REZOXsvM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 588.09,
"view_count": 116450
},
{
"video_id": "pPJq1rMDRGs",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"First, a scene appears with 'various hanging lights glowing fuzzily with a woman in the middle whose face is covered in blood,' then a scene appears with 'a white cloud-filled sky, with a power pole on the left and a woman with golden curly hair and a sideways face on the right,' and finally, a scene appears with 'a man in grey-green clothes standing in front of a blurred hilly background.'",
"First, a scene appears with 'a man in grey-green clothes standing in front of a blurred hilly background,' then a scene appears with 'various hanging lights glowing fuzzily with a woman in the middle whose face is covered in blood,' and finally, a scene appears with 'a white cloud-filled sky, with a power pole on the left and a woman with golden curly hair and a sideways face on the right.'",
"First, a scene appears with 'various hanging lights glowing fuzzily with a woman in the middle whose face is covered in blood,' then a scene appears with 'a man in grey-green clothes standing in front of a blurred hilly background,' and finally, a scene appears with 'a white cloud-filled sky, with a power pole on the left and a woman with golden curly hair and a sideways face on the right.'",
"First, a scene appears with 'a white cloud-filled sky, with a power pole on the left and a woman with golden curly hair and a sideways face on the right,' then a scene appears with 'various hanging lights glowing fuzzily with a woman in the middle whose face is covered in blood,' and finally, a scene appears with 'a man in grey-green clothes standing in front of a blurred hilly background.'",
"First, a scene appears with 'a man in grey-green clothes standing in front of a blurred hilly background,' then a scene appears with 'a white cloud-filled sky, with a power pole on the left and a woman with golden curly hair and a sideways face on the right,' and finally, a scene appears with 'various hanging lights glowing fuzzily with a woman in the middle whose face is covered in blood.'"
],
"correct_choice": 1,
"position": [
947,
2086,
3534
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "pPJq1rMDRGs_0",
"video_path": "pPJq1rMDRGs.mp4",
"subtitle_path": "pPJq1rMDRGs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 484.33,
"view_count": 27739
},
{
"video_id": "pPJq1rMDRGs",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, a scene where 'inside a running car, a woman with a headscarf and a man driving the car, outside the car window is a vast yellow plain.' Then, a scene where 'in the middle of the frame, a door is open with strong light shining through it, a man is standing in the light. Both sides of the frame show walls with cracks.' Lastly, a scene where 'a woman with long straight hair is standing in front of a background wall with a green light, shouting into a microphone.'",
"First, a scene where 'inside a running car, a woman with a headscarf and a man driving the car, outside the car window is a vast yellow plain.' Then, a scene where 'a woman with long straight hair is standing in front of a background wall with a green light, shouting into a microphone.' Lastly, a scene where 'in the middle of the frame, a door is open with strong light shining through it, a man is standing in the light. Both sides of the frame show walls with cracks.'",
"First, a scene where 'in the middle of the frame, a door is open with strong light shining through it, a man is standing in the light. Both sides of the frame show walls with cracks.' Then, a scene where 'a woman with long straight hair is standing in front of a background wall with a green light, shouting into a microphone.' Lastly, a scene where 'inside a running car, a woman with a headscarf and a man driving the car, outside the car window is a vast yellow plain.'",
"First, a scene where 'a woman with long straight hair is standing in front of a background wall with a green light, shouting into a microphone.' Then, a scene where 'in the middle of the frame, a door is open with strong light shining through it, a man is standing in the light. Both sides of the frame show walls with cracks.' Lastly, a scene where 'inside a running car, a woman with a headscarf and a man driving the car, outside the car window is a vast yellow plain.'",
"First, a scene where 'in the middle of the frame, a door is open with strong light shining through it, a man is standing in the light. Both sides of the frame show walls with cracks.' Then, a scene where 'inside a running car, a woman with a headscarf and a man driving the car, outside the car window is a vast yellow plain.' Lastly, a scene where 'a woman with long straight hair is standing in front of a background wall with a green light, shouting into a microphone.'"
],
"correct_choice": 3,
"position": [
7661,
9061,
10599
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "pPJq1rMDRGs_1",
"video_path": "pPJq1rMDRGs.mp4",
"subtitle_path": "pPJq1rMDRGs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 484.33,
"view_count": 27739
},
{
"video_id": "tWiGnu2BNsY",
"question": "A man in a yellow coat is shopping with a red shopping basket in a supermarket. To his left is a woman wearing blue clothes. What happened in front of the man while he was shopping?",
"question_wo_referring_query": "What happened?",
"candidates": [
"Kissed a prison guard",
"Fought with a prison guard",
"Danced with a prison guard",
"Shook hands with a prison guard",
"Hugged a prison guard"
],
"correct_choice": 3,
"position": [
5087,
688
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E3E",
"level": "L2-Relation",
"id": "tWiGnu2BNsY_0",
"video_path": "tWiGnu2BNsY.mp4",
"subtitle_path": "tWiGnu2BNsY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 585.93,
"view_count": 268176
},
{
"video_id": "tWiGnu2BNsY",
"question": "A man wearing white clothes is lying on the grass in a graveyard. Next to him are yellow clothes and a black bag. On the grass, there are many standing tombstones. Before the man lies down on the grass, what was he doing?",
"question_wo_referring_query": "What was the man doing?",
"candidates": [
"Piloting a plane",
"Peeling an apple",
"Cutting a watermelon",
"Coughing",
"Opening a wheelbarrow"
],
"correct_choice": 3,
"position": [
11193,
9532
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "E3E",
"level": "L2-Relation",
"id": "tWiGnu2BNsY_1",
"video_path": "tWiGnu2BNsY.mp4",
"subtitle_path": "tWiGnu2BNsY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 585.93,
"view_count": 268176
},
{
"video_id": "LAQRjgx_OY8",
"question": "In a museum, the surrounding display stands are filled with various artifacts. There is a short-haired man wearing a hoodie who is touching a metal tool in front of him. When he appears in a room, bound to a chair, the mirror behind him reflects the image of the man in the hoodie. At the same time, there is also a man in a black coat sitting opposite him. What kind of change did the man in the hoodie experience?",
"question_wo_referring_query": "What kind of change did the man in the hoodie experience?",
"candidates": [
"The yellow-green workwear changed to a white shirt paired with a dark blue blazer.",
"The yellow shirt changed to a blue suit.",
"The green cleaning uniform changed to a blue suit with a white shirt.",
"The black jacket changed to a blue suit.",
"The yellow workwear changed to a black formal outfit."
],
"correct_choice": 2,
"position": [
4322,
4562
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SAA",
"level": "L2-Relation",
"id": "LAQRjgx_OY8_0",
"video_path": "LAQRjgx_OY8.mp4",
"subtitle_path": "LAQRjgx_OY8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 468.37,
"view_count": 58819
},
{
"video_id": "LAQRjgx_OY8",
"question": "In a forest surrounded by trees, there is a plant with long leaves. To the right, there is a black man wearing glasses and grinning. When he appears in the center of a grassy area with his hands in front of him, what kind of transformation does this man undergo?",
"question_wo_referring_query": "What kind of transformation does this black man undergo?",
"candidates": [
"The black tight-fitting suit transforms into a black suit",
"The olive green trench coat transforms into a black skirt",
"The black jacket transforms into a white suit",
"The purple T-shirt transforms into a blue jacket",
"The black backpack transforms into a blue jacket"
],
"correct_choice": 0,
"position": [
9317,
10389
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SAA",
"level": "L2-Relation",
"id": "LAQRjgx_OY8_1",
"video_path": "LAQRjgx_OY8.mp4",
"subtitle_path": "LAQRjgx_OY8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 468.37,
"view_count": 58819
},
{
"video_id": "@movie.explained6-7259227637992705282",
"question": "A woman wearing a ponytail appears in the middle of the screen, with 'The camera only works with people' text at the bottom of the screen. Which item appears in the middle of the screen?",
"question_wo_referring_query": "Which item appears in the middle of the screen?",
"candidates": [
"earrings",
"hat",
"crown",
"necklace",
"earphones"
],
"correct_choice": 0,
"position": [
960
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2O",
"level": "L1-Perception",
"id": "@movie.explained6-7259227637992705282_0",
"video_path": "7259227637992705282.mp4",
"subtitle_path": "7259227637992705282_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.0,
"view_count": 2511
},
{
"video_id": "@movie.explained6-7253164232307379457",
"question": "In a room with a black and white checkered floor, a man wearing a black hooded jacket is holding a woman whose mouth is taped with blue tape. When the subtitle \"In a short time, her friend fell to the ground in pain. Her organs were dissolved by the venom\" appears, what color clothes is the woman wearing?",
"question_wo_referring_query": "What color clothes is the woman wearing, whose mouth is taped with blue tape?",
"candidates": [
"Purple",
"Blue",
"White",
"Red",
"Green"
],
"correct_choice": 1,
"position": [
1023
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2A",
"level": "L1-Perception",
"id": "@movie.explained6-7253164232307379457_0",
"video_path": "7253164232307379457.mp4",
"subtitle_path": "7253164232307379457_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.0,
"view_count": 9417
},
{
"video_id": "@movie.explained6-7280742496882134274",
"question": "Which of the following animals appears first in the video?",
"question_wo_referring_query": "Which of the following animals appears first in the video?",
"candidates": [
"Snake",
"Spider",
"Chicken",
"Cicada",
"Mouse"
],
"correct_choice": 2,
"position": [
421,
972
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "O3O",
"level": "L2-Relation",
"id": "@movie.explained6-7280742496882134274_0",
"video_path": "7280742496882134274.mp4",
"subtitle_path": "7280742496882134274_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 57.97,
"view_count": 53373
},
{
"video_id": "@movie.explained6-7262938043315686664",
"question": "In a classroom with many wooden desks and chairs, there is a woman wearing a black dress and has black hair standing. The woman is surrounded by many children. With which of the following subtitles did the woman in the black dress appear together?",
"question_wo_referring_query": ", with which of the following subtitles did the woman in the black dress appear together?",
"candidates": [
"\"the truth of the matter. Call on your friend and sneak into the teacher's house. The three have\"",
"\"against the wall.\"",
"\"living room\"",
"\"The three of them had to hide behind the sofa.\"",
"\"Unexpectedly, a few people even entered the room, discovering the previous teacher leaning\""
],
"correct_choice": 2,
"position": [
66,
913,
288,
728,
822,
939
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "TOS",
"level": "L2-Relation",
"id": "@movie.explained6-7262938043315686664_0",
"video_path": "7262938043315686664.mp4",
"subtitle_path": "7262938043315686664_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 51.36,
"view_count": 11954
},
{
"video_id": "@movie.explained6-7268167132591000833",
"question": "In a scene where a pair of hands is holding a phone with a black case, there is a text message on the phone containing a string of numbers 558441328. What color is the nail polish on the hands in the scene?",
"question_wo_referring_query": "What color is the nail polish on the hands in the scene?",
"candidates": [
"green",
"black",
"purple",
"blue",
"red"
],
"correct_choice": 4,
"position": [
633
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2A",
"level": "L1-Perception",
"id": "@movie.explained6-7268167132591000833_0",
"video_path": "7268167132591000833.mp4",
"subtitle_path": "7268167132591000833_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 56.13,
"view_count": 3443
},
{
"video_id": "@movie.explained6-7277548218106367234",
"question": "In a room with portraits on the wall, there is a white sofa with a woman lying on it. What was the first outfit the woman on the white sofa was wearing when she appeared?",
"question_wo_referring_query": "What was the first outfit the woman on the white sofa was wearing when she appeared?",
"candidates": [
"A pink long-sleeve top",
"A white dress",
"A black hoodie",
"A dark blue top with floral prints",
"A blue bodysuit"
],
"correct_choice": 3,
"position": [
52,
546
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "O3O",
"level": "L2-Relation",
"id": "@movie.explained6-7277548218106367234_0",
"video_path": "7277548218106367234.mp4",
"subtitle_path": "7277548218106367234_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 49.4,
"view_count": 5134
},
{
"video_id": "@movie.explained6-7269647281668852993",
"question": "In a vast open area, there is a centipede formed by criminals dressed in orange uniforms. After the narration saying \"In American prisons, the mouths and buttocks of 800 inmates are sewn together to form a long human centipede\", what event happened in the video?",
"question_wo_referring_query": "what event happened in the video?",
"candidates": [
"Criminals in orange clothes gather to gamble",
"Criminals in orange clothes gather to fight",
"A man in orange clothes is pushing his head against the person in front of him",
"Criminals in orange clothes go around killing people",
"A man in gray-green clothes is touching the brain of a dark-skinned man"
],
"correct_choice": 4,
"position": [
141,
166,
41
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3E",
"level": "L2-Relation",
"id": "@movie.explained6-7269647281668852993_0",
"video_path": "7269647281668852993.mp4",
"subtitle_path": "7269647281668852993_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 59.67,
"view_count": 26516
},
{
"video_id": "@movie.explained6-7267308320413797650",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, a person holding a phone taking a photo of their leg; then a pale-skinned woman and a dark-skinned man kissing; and finally, a woman wearing a green headband sliding down a hill being pulled back by someone.",
"First, a pale-skinned woman and a dark-skinned man kissing; then a person holding a phone taking a photo of their leg; and finally, a woman wearing a green headband sliding down a hill being pulled back by someone.",
"First, a pale-skinned woman and a dark-skinned man kissing; then a woman wearing a green headband sliding down a hill being pulled back by someone; and finally, a person holding a phone taking a photo of their leg.",
"First, a person holding a phone taking a photo of their leg; then a woman wearing a green headband sliding down a hill being pulled back by someone; and finally, a pale-skinned woman and a dark-skinned man kissing.",
"First, a woman wearing a green headband sliding down a hill being pulled back by someone; then a pale-skinned woman and a dark-skinned man kissing; and finally, a person holding a phone taking a photo of their leg."
],
"correct_choice": 0,
"position": [
172,
1211,
1353
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "@movie.explained6-7267308320413797650_0",
"video_path": "7267308320413797650.mp4",
"subtitle_path": "7267308320413797650_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.03,
"view_count": 9701
},
{
"video_id": "@movie.explained6-7275653401025826050",
"question": "In a sunny outdoor setting, a fire truck stands nearby along with a firefighter whose skin is dark. In which of the following scenes has this dark-skinned firefighter appeared before?",
"question_wo_referring_query": ", in which of the following scenes has the dark-skinned firefighter appeared before?",
"candidates": [
"In front of a burning house",
"Inside a garbage truck loaded with trash",
"Outside a forest on fire",
"In a rainy park",
"On a golden sandy beach"
],
"correct_choice": 1,
"position": [
502,
963
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SOS",
"level": "L2-Relation",
"id": "@movie.explained6-7275653401025826050_0",
"video_path": "7275653401025826050.mp4",
"subtitle_path": "7275653401025826050_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.03,
"view_count": 3541
},
{
"video_id": "@movie.explained6-7267884432420277506",
"question": "In a scene with the white text 'to partner in the firm', there is a man wearing a blue shirt who is resting his hand on the shoulder of another man beside him. What kind of beard does the man in the blue shirt have?",
"question_wo_referring_query": "What kind of beard does the man in the blue shirt have?",
"candidates": [
"Mountain goat beard",
"Connected beard",
"Eight-character beard",
"One-character beard",
"Shaving beard"
],
"correct_choice": 1,
"position": [
1336
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2A",
"level": "L1-Perception",
"id": "@movie.explained6-7267884432420277506_0",
"video_path": "7267884432420277506.mp4",
"subtitle_path": "7267884432420277506_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.02,
"view_count": 5452
},
{
"video_id": "@movie.explained6-7268213090708245761",
"question": "In a scene with white text 'Tom used his eyes to signal Mike to take action', there is a table with many codes on it, and beside the table, there is a woman with short white hair wearing a red blouse. What is the woman in the red blouse doing when she first appears?",
"question_wo_referring_query": "What is the woman in the red blouse doing when she first appears?",
"candidates": [
"Covering her face with both hands",
"Putting both hands on the table with many codes",
"Drinking a glass of red wine",
"Brushing her hair with her right hand",
"Brushing her hair with her left hand"
],
"correct_choice": 1,
"position": [
559
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "O2E",
"level": "L1-Perception",
"id": "@movie.explained6-7268213090708245761_0",
"video_path": "7268213090708245761.mp4",
"subtitle_path": "7268213090708245761_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 56.47,
"view_count": 3112
},
{
"video_id": "@movie.explained6-7269746510462536962",
"question": "In a scene with white text 'I want you to start with that man over there, okay?' there are many people, and in front of a man in a black suit there is a triangular shelf with a basketball on it. What other objects are present in the scene?",
"question_wo_referring_query": "What other objects are present in the scene?",
"candidates": [
"A doll wearing purple clothes",
"A green plant",
"A yellow incense burner",
"A white chrysanthemum",
"A white dress"
],
"correct_choice": 0,
"position": [
144
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2O",
"level": "L1-Perception",
"id": "@movie.explained6-7269746510462536962_0",
"video_path": "7269746510462536962.mp4",
"subtitle_path": "7269746510462536962_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 58.7,
"view_count": 8420
},
{
"video_id": "@movie.explained6-7270293967017725185",
"question": "On a flat surface illuminated by the sun, there is a woman wearing a gray hoodie and a black choker on her neck. Behind her, there is a white door and white columns. What hairstyle does the woman in the gray hoodie have?",
"question_wo_referring_query": "What hairstyle does the woman in the gray hoodie have?",
"candidates": [
"high ponytail",
"braids",
"curly hair",
"bun",
"pigtails"
],
"correct_choice": 0,
"position": [
640
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2A",
"level": "L1-Perception",
"id": "@movie.explained6-7270293967017725185_0",
"video_path": "7270293967017725185.mp4",
"subtitle_path": "7270293967017725185_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 48.63,
"view_count": 13900
},
{
"video_id": "@movie.explained6-7254238802577820929",
"question": "Outside a bright window, there is a boy wearing a white shirt and a boy holding a basketball. At the bottom of the screen, there is white text that says 'Tom is really a pair of'. What is the boy in the white shirt doing in the scene?",
"question_wo_referring_query": "What is the boy in the white shirt doing in the scene?",
"candidates": [
"Holding his head and crying",
"Crouching on the boy holding the basketball",
"Fighting with the boy holding the basketball",
"Kneeling on the ground tying his shoelaces",
"Putting his arm around the boy holding the basketball"
],
"correct_choice": 1,
"position": [
121
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2E",
"level": "L1-Perception",
"id": "@movie.explained6-7254238802577820929_0",
"video_path": "7254238802577820929.mp4",
"subtitle_path": "7254238802577820929_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 54.19,
"view_count": 3006
},
{
"video_id": "@movie.explained6-7269006883380399362",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, a man wearing a black hoodie yawning; then, a man holding a white sign with the red text 'I HATE'; finally, two dark-skinned children lying at the doorway looking outside.",
"First, two dark-skinned children lying at the doorway looking outside; then, a man wearing a black hoodie yawning; finally, a man holding a white sign with the red text 'I HATE'.",
"First, a man holding a white sign with the red text 'I HATE'; then, a man wearing a black hoodie yawning; finally, two dark-skinned children lying at the doorway looking outside.",
"First, a man holding a white sign with the red text 'I HATE'; then, two dark-skinned children lying at the doorway looking outside; finally, a man wearing a black hoodie yawning.",
"First, two dark-skinned children lying at the doorway looking outside; then, a man holding a white sign with the red text 'I HATE'; finally, a man wearing a black hoodie yawning."
],
"correct_choice": 4,
"position": [
167,
335,
838
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "@movie.explained6-7269006883380399362_0",
"video_path": "7269006883380399362.mp4",
"subtitle_path": "7269006883380399362_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 57.47,
"view_count": 2009
},
{
"video_id": "@movie.explained6-7270058010888768770",
"question": "In a dark room, there is a phone, and the screen of the phone shows the number 911. When the subtitle \"However, the girl used a disposable phone without a chip, making it impossible to locate her\" appears, what change happens to the text in the middle of the phone screen?",
"question_wo_referring_query": "What change happens to the text in the middle of the phone screen?",
"candidates": [
"Changes from the number 911 to the number 129",
"Changes from the number 911 to the blue text 'New Voicemail RAUL'",
"Changes from the number 911 to the number 119",
"Changes from the number 911 to the blue text 'save me'",
"Changes from the number 911 to the number 110"
],
"correct_choice": 1,
"position": [
437,
657
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "TAA",
"level": "L2-Relation",
"id": "@movie.explained6-7270058010888768770_0",
"video_path": "7270058010888768770.mp4",
"subtitle_path": "7270058010888768770_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 40.6,
"view_count": 4212
},
{
"video_id": "@kerstinong-7074483613273754882",
"question": "Inside a floor-to-ceiling window with a view of tall buildings there is a beauty room with pink walls and a white beauty bed. When the subtitle 'Okay, let's get started' appears, what objects are present in the scene?",
"question_wo_referring_query": "What objects are present in the scene?",
"candidates": [
"false eyelashes",
"face mask",
"powder puff",
"lighting stand",
"mirror"
],
"correct_choice": 3,
"position": [
261
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "@kerstinong-7074483613273754882_0",
"video_path": "7074483613273754882.mp4",
"subtitle_path": "7074483613273754882_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 47.4,
"view_count": 21400
},
{
"video_id": "@lisolna-7288701821369978144",
"question": "Which of the following sequences of scenes in the video is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes in the video is correct?",
"candidates": [
"First, take a bottle of juice from the drinks cabinet, then use chopsticks to pick up a reddish food item from the white plate, and finally pick up an orange from the dining table.",
"First, pick up an orange from the dining table, then take a bottle of juice from the drinks cabinet, and finally use chopsticks to pick up a reddish food item from the white plate.",
"First, pick up an orange from the dining table, then use chopsticks to pick up a reddish food item from the white plate, and finally take a bottle of juice from the drinks cabinet.",
"First, take a bottle of juice from the drinks cabinet, then pick up an orange from the dining table, and finally use chopsticks to pick up a reddish food item from the white plate.",
"First, use chopsticks to pick up a reddish food item from the white plate, then take a bottle of juice from the drinks cabinet, and finally pick up an orange from the dining table."
],
"correct_choice": 1,
"position": [
483,
612,
917
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "@lisolna-7288701821369978144_0",
"video_path": "7288701821369978144.mp4",
"subtitle_path": "7288701821369978144_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.17,
"view_count": 76706
},
{
"video_id": "@jonijawne-7197144530024500485",
"question": "Where else have we seen the man in black clothes standing indoors beside a large floor-to-ceiling window with a black frame, through which we can see the view of tall buildings outside?",
"question_wo_referring_query": "In what other scenes has he appeared?",
"candidates": [
"Inside the hotel lobby",
"On the street downstairs",
"On the couch in the mall",
"Beside the transparent glass wall on the rooftop under the blue sky",
"Inside the coffee shop"
],
"correct_choice": 3,
"position": [
56,
820
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SOS",
"level": "L2-Relation",
"id": "@jonijawne-7197144530024500485_0",
"video_path": "7197144530024500485.mp4",
"subtitle_path": "7197144530024500485_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 38.67,
"view_count": 3083
},
{
"video_id": "@lisolna-7284252812194729248",
"question": "What change occurred on the gray marble table with food placed on a white plate when the subtitles 'ASAP baby Hurry up don't say maybe Hi, it's me again, I'm back' appeared?",
"question_wo_referring_query": "What change occurred?",
"candidates": [
"From indoors to outdoors on a wooden table",
"From a white plate to a black plate",
"From a white plate to a yellow plate",
"From a full plate to a partially empty plate",
"From a white plate to a blue plate"
],
"correct_choice": 0,
"position": [
433,
830
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TAA",
"level": "L2-Relation",
"id": "@lisolna-7284252812194729248_0",
"video_path": "7284252812194729248.mp4",
"subtitle_path": "7284252812194729248_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.9,
"view_count": 59191
},
{
"video_id": "@emsiiees-7296863219178638594",
"question": "Inside the glass window that looks out onto the street, there are five cups with black straws on the wooden table. Who is the person holding one of the drinks in the video while tasting it?",
"question_wo_referring_query": "Who is the person?",
"candidates": [
"The woman wearing a black top and sunglasses",
"The man wearing a black short-sleeved shirt",
"The woman wearing a white top and glasses",
"The woman with long black hair wearing a black short-sleeved shirt",
"The short-haired woman wearing a white top"
],
"correct_choice": 3,
"position": [
332
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E2O",
"level": "L1-Perception",
"id": "@emsiiees-7296863219178638594_0",
"video_path": "7296863219178638594.mp4",
"subtitle_path": "7296863219178638594_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 37.27,
"view_count": 2472
},
{
"video_id": "@lisolna-7184160724715883781",
"question": "In front of a white wall, on a golden iron frame, what is happening on the screen when a transparent light bulb appears for the first time?",
"question_wo_referring_query": ", what is happening on the screen?",
"candidates": [
"Disassembling the table lamp",
"Turning on the table lamp",
"Repairing the table lamp",
"Assembling the table lamp",
"Cleaning the table lamp"
],
"correct_choice": 3,
"position": [
286
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "@lisolna-7184160724715883781_0",
"video_path": "7184160724715883781.mp4",
"subtitle_path": "7184160724715883781_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 22.5,
"view_count": 12628
},
{
"video_id": "@jonijawne-7214559428471770374",
"question": "A man wearing a gray-green top and a black mask, with black short hair clipped with four red and yellow hair curlers, holding a blue hair dryer in one hand, what is he doing when the subtitle 'Firstly, starting off with flat hair, you want to use' appears?",
"question_wo_referring_query": "A man wearing a gray-green top and a black mask, with black short hair clipped with four red and yellow hair curlers, holding a blue hair dryer in one hand, what is he doing when the subtitle 'Firstly, starting off with flat hair, you want to use' appears?",
"candidates": [
"Blow-drying hair",
"Brushing teeth",
"Washing face",
"Spraying hair with styling mist",
"Burning hair"
],
"correct_choice": 0,
"position": [
209
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2E",
"level": "L1-Perception",
"id": "@jonijawne-7214559428471770374_0",
"video_path": "7214559428471770374.mp4",
"subtitle_path": "7214559428471770374_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 28.53,
"view_count": 333952
},
{
"video_id": "@lisolna-7321010343067618592",
"question": "Which object appears first in the video?",
"question_wo_referring_query": "Which object appears first in the video?",
"candidates": [
"A bag of chips",
"Transparent glass cup",
"Transparent glass bowl",
"Black and white cat",
"Toaster"
],
"correct_choice": 1,
"position": [
222,
578,
640,
669,
1204
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "@lisolna-7321010343067618592_0",
"video_path": "7321010343067618592.mp4",
"subtitle_path": "7321010343067618592_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.63,
"view_count": 22775
},
{
"video_id": "@jess.morg-7206809277867052330",
"question": "On an olive-colored marble table, there is a black stove with a pot of boiling water. When the subtitle 'tall white house with an empty room and your name carved over the door Facing up to the tallest hill' appears, what object is present on the screen?",
"question_wo_referring_query": "What object is present on the screen?",
"candidates": [
"White plate",
"Blue flame",
"Scissors",
"Kitchen knife",
"Napkin"
],
"correct_choice": 1,
"position": [
313
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "@jess.morg-7206809277867052330_0",
"video_path": "7206809277867052330.mp4",
"subtitle_path": "7206809277867052330_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 34.73,
"view_count": 638
},
{
"video_id": "@lisolna-7359951777845775648",
"question": "What is the first item placed in the shopping cart in the video?",
"question_wo_referring_query": "What is the first item placed in the shopping cart in the video?",
"candidates": [
"A box of blueberries",
"A net of tangerines",
"A bag of carrots",
"A bag of nuts",
"A box of cherry tomatoes"
],
"correct_choice": 0,
"position": [
158,
224,
286,
371,
424
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "@lisolna-7359951777845775648_0",
"video_path": "7359951777845775648.mp4",
"subtitle_path": "7359951777845775648_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.67,
"view_count": 12003
},
{
"video_id": "@jess.morg-7203120578537049386",
"question": "Which of the following scene sequences in the video is correct?",
"question_wo_referring_query": "Which of the following scene sequences in the video is correct?",
"candidates": [
"First, flipping the menu on a wooden table in a dining room, then showing a phone to a mirror in a car, and finally showing a black shopping bag to a mirror in a shopping mall.",
"First, showing a black shopping bag to a mirror in a shopping mall, then showing a phone to a mirror in a car, and finally flipping the menu on a wooden table in a dining room.",
"First, showing a phone to a mirror in a car, then showing a black shopping bag to a mirror in a shopping mall, and finally flipping the menu on a wooden table in a dining room.",
"First, flipping the menu on a wooden table in a dining room, then showing a black shopping bag to a mirror in a shopping mall, and finally showing a phone to a mirror in a car.",
"First, showing a phone to a mirror in a car, then flipping the menu on a wooden table in a dining room, and finally showing a black shopping bag to a mirror in a shopping mall."
],
"correct_choice": 0,
"position": [
102,
196,
382
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "@jess.morg-7203120578537049386_0",
"video_path": "7203120578537049386.mp4",
"subtitle_path": "7203120578537049386_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 20.17,
"view_count": 714
},
{
"video_id": "@kerstinong-7197781091162279169",
"question": "At the beginning of the video, a woman wearing a white short-sleeved top and gray shorts is standing barefoot on a wooden platform, holding black and pink clothes. When the subtitle 'Ever since I left the city, you' appears, what change happens to this woman?",
"question_wo_referring_query": "What change happens to this woman?",
"candidates": [
"The gray shorts change to black shorts.",
"She goes from not wearing sunglasses to wearing sunglasses.",
"The white short-sleeved top changes to a pink plaid long-sleeved top.",
"She goes from not wearing a hat to wearing a hat.",
"Her hair changes from black to yellow."
],
"correct_choice": 2,
"position": [
12,
82
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TAA",
"level": "L2-Relation",
"id": "@kerstinong-7197781091162279169_0",
"video_path": "7197781091162279169.mp4",
"subtitle_path": "7197781091162279169_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 19.27,
"view_count": 988800
},
{
"video_id": "@lisolna-7198577717292322054",
"question": "Through the clouds, there is a red sunset. Behind many neatly parked cars, there is a white house and a red brick house. What objects are present in the scene at this moment?",
"question_wo_referring_query": "What objects are present in the scene at this moment?",
"candidates": [
"Blue sky",
"Street lamp",
"High-rise building",
"Grassland",
"Fire truck"
],
"correct_choice": 1,
"position": [
377
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2O",
"level": "L1-Perception",
"id": "@lisolna-7198577717292322054_0",
"video_path": "7198577717292322054.mp4",
"subtitle_path": "7198577717292322054_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 18.17,
"view_count": 15230
},
{
"video_id": "@lisolna-7333990579963039009",
"question": "Next to a shelf on a gray tile floor, a hand with red nail polish is holding a small bottle of nail polish. When the subtitle 'so I'm going to make a Thank you.' appears, what is the color of the nail polish in the bottle?",
"question_wo_referring_query": "What is the color of the nail polish in the bottle?",
"candidates": [
"White",
"Olive",
"Pink",
"Crimson",
"Black"
],
"correct_choice": 3,
"position": [
729
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2A",
"level": "L1-Perception",
"id": "@lisolna-7333990579963039009_0",
"video_path": "7333990579963039009.mp4",
"subtitle_path": "7333990579963039009_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.13,
"view_count": 22739
},
{
"video_id": "@kerstinong-7254520467002854657",
"question": "Behind her is a yellow wooden wardrobe. A woman in a pink off-shoulder dress with her hair tied up raises her hand next to her head. 'Tip No.2' in white appears on the screen. When the subtitle 'you can try out this to win some attractive prizes. We're free, oh we got that, that freedom' shows up, what is this woman doing?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"Waving at the camera",
"Making a phone call",
"Exercising",
"Giving a thumbs up",
"Taking a photo"
],
"correct_choice": 3,
"position": [
1178
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2E",
"level": "L1-Perception",
"id": "@kerstinong-7254520467002854657_0",
"video_path": "7254520467002854657.mp4",
"subtitle_path": "7254520467002854657_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.37,
"view_count": 35607
},
{
"video_id": "@lisolna-7265680288636767520",
"question": "In the top left corner of the screen, there's a person wearing jeans and white sneakers holding a white teddy dog standing on a gray floor. Sunlight is shining in from the top right corner. After the subtitle 'Such beautiful day, so I'm gonna start it at a cafe and I'm gonna be productive and that's my ride' appears, what is the object shown in the middle of the screen?",
"question_wo_referring_query": "What is the object shown in the middle of the screen?",
"candidates": [
"A train running on the railway",
"A flowing river below the railway",
"A green sandwich placed on a black plate",
"Coffee inside a red polka dot mug",
"A building surrounded by green trees"
],
"correct_choice": 2,
"position": [
938,
1203
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "@lisolna-7265680288636767520_0",
"video_path": "7265680288636767520.mp4",
"subtitle_path": "7265680288636767520_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 53.47,
"view_count": 12943
},
{
"video_id": "@lisolna-7235995901477604635",
"question": "Which of the following sequences of scenes from the video is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes from the video is correct?",
"candidates": [
"First, the chef uses a ladle to transfer the prepared food onto a white plate. Next, a chef is cooking a meal in a kitchen filled with various ingredients. Then, chopsticks are used to pick up noodles from the plate. Finally, a hand takes a bottle of purple drink from a drink cabinet.",
"First, a chef is cooking a meal in a kitchen filled with various ingredients. Next, a hand takes a bottle of purple drink from a drink cabinet. Then, the chef uses a ladle to transfer the prepared food onto a white plate. Finally, chopsticks are used to pick up noodles from the plate.",
"First, a chef is cooking a meal in a kitchen filled with various ingredients. Next, chopsticks are used to pick up noodles from the plate. Then, a hand takes a bottle of purple drink from a drink cabinet. Finally, the chef uses a ladle to transfer the prepared food onto a white plate.",
"First, a hand takes a bottle of purple drink from a drink cabinet. Next, a chef is cooking a meal in a kitchen filled with various ingredients. Then, chopsticks are used to pick up noodles from the plate. Finally, the chef uses a ladle to transfer the prepared food onto a white plate.",
"First, a chef is cooking a meal in a kitchen filled with various ingredients. Next, the chef uses a ladle to transfer the prepared food onto a white plate. Then, a hand takes a bottle of purple drink from a drink cabinet. Finally, chopsticks are used to pick up noodles from the plate."
],
"correct_choice": 4,
"position": [
122,
347,
523,
742
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SSS",
"level": "L2-Relation",
"id": "@lisolna-7235995901477604635_0",
"video_path": "7235995901477604635.mp4",
"subtitle_path": "7235995901477604635_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 36.9,
"view_count": 1700000
},
{
"video_id": "@kerstinong-7097604725134118146",
"question": "The background is a vibrant sky with sunlight above, featuring trees and a green meadow. A woman wearing an orange dress, a green shoulder bag, and green shoes is walking on stone steps with a handrail on the right side. In which other scenes does this woman appear?",
"question_wo_referring_query": "In which other scenes does this woman appear?",
"candidates": [
"On a green meadow with long shadows cast by the sun",
"In a coffee shop",
"Inside a clothing store in a shopping mall",
"Next to the dining table in a dining hall",
"By the swimming pool"
],
"correct_choice": 0,
"position": [
93,
317
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SOS",
"level": "L2-Relation",
"id": "@kerstinong-7097604725134118146_0",
"video_path": "7097604725134118146.mp4",
"subtitle_path": "7097604725134118146_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 16.27,
"view_count": 2254
},
{
"video_id": "VNvH5u5CWV8",
"question": "In the video, there's a tray with parchment paper on a white background, neatly arranged with carrot slices that have a layer of oil on them. What is the first utensil that appears after the subtitle 'i' shows up?",
"question_wo_referring_query": "What is the first utensil that appears?",
"candidates": [
"round bowl",
"transparent bowl",
"transparent plate",
"square bowl",
"black bowl"
],
"correct_choice": 3,
"position": [
224,
272
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T3O",
"level": "L2-Relation",
"id": "VNvH5u5CWV8_0",
"video_path": "VNvH5u5CWV8.mp4",
"subtitle_path": "VNvH5u5CWV8_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 345,
"duration": 13.0,
"view_count": 125700
},
{
"video_id": "baFsMWNavQ4",
"question": "On the screen, there is a black-skinned woman wearing a white and green shirt with black floral patterns. She is standing in a white room with a large screen behind her displaying some images. The woman is speaking with her hands open, facing the camera. What subtitles appeared simultaneously with this woman?",
"question_wo_referring_query": "What subtitles appeared simultaneously with this woman?",
"candidates": [
"Thank you",
"that we were doing",
"waht",
"SO\n",
"Can you"
],
"correct_choice": 1,
"position": [
150,
207
],
"topic_category": "KA-Knowledge-Art",
"question_category": "TOS",
"level": "L2-Relation",
"id": "baFsMWNavQ4_0",
"video_path": "baFsMWNavQ4.mp4",
"subtitle_path": "baFsMWNavQ4_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 220,
"duration": 14.02,
"view_count": 5440
},
{
"video_id": "akoJDx23QWU",
"question": "A man and a woman are sitting in front of a red background. The man has black hair and is wearing black clothes, while the woman has brown hair and is wearing off-white clothes. In front of them is a long black table with some items on it. When the camera zooms in and the man raises one of his arms, what change happens to the woman next to him?",
"question_wo_referring_query": "What change happens to the woman next to him?",
"candidates": [
"An 'omg' icon appears on the woman's face",
"A bowtie icon appears on the woman's face",
"A sad face icon appears on the woman's face",
"A small cat icon appears on the woman's face",
"A smiley face icon appears on the woman's face"
],
"correct_choice": 0,
"position": [
96,
167
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SAA",
"level": "L2-Relation",
"id": "akoJDx23QWU_0",
"video_path": "akoJDx23QWU.mp4",
"subtitle_path": "akoJDx23QWU_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 233,
"duration": 14.02,
"view_count": 48870
},
{
"video_id": "MTmUo6QITo4",
"question": "The screen shows a woman with braided hair. She is wearing black clothes and is in a white room. On the right side of the screen, there's a clothes rack with clothes hanging on it. The woman is talking to the camera. When she mentions 'such a good mood today and I feel like', what change occurs in the woman?",
"question_wo_referring_query": "What change occurs in the woman?",
"candidates": [
"The woman picks up a book",
"The woman starts singing",
"The woman touches her own braid with one hand",
"The woman touches her hair",
"The woman turns to look for clothes"
],
"correct_choice": 2,
"position": [
84,
174
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TAA",
"level": "L2-Relation",
"id": "MTmUo6QITo4_0",
"video_path": "MTmUo6QITo4.mp4",
"subtitle_path": "MTmUo6QITo4_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 127,
"duration": 13.98,
"view_count": 63253
},
{
"video_id": "zU3bjOKfsww",
"question": "In the video, there are tomatoes and zucchini in the pan. A hand is holding a spatula. What is this hand doing?",
"question_wo_referring_query": ", what is this hand doing?",
"candidates": [
"Adding sugar to the spatula",
"Taking out the tomatoes with the spatula",
"Adding salt to the spatula",
"Stir-frying the zucchini and tomatoes",
"Taking out the zucchini with the spatula"
],
"correct_choice": 3,
"position": [
248
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "zU3bjOKfsww_0",
"video_path": "zU3bjOKfsww.mp4",
"subtitle_path": "zU3bjOKfsww_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 292,
"duration": 12.0,
"view_count": 1875470
},
{
"video_id": "qZVBFAtfp2A",
"question": "In the video, a man in pink garb with a ponytail stands in front of a white wall with a white door, as well as a lantern. The man is wearing a pink garment. Not far away, there's a street with some small buildings and parked cars. Behind the man, there is a person in black clothing facing away from the camera. When the phrase 'I can't wait to surprise my mom' is mentioned, what type of garment is the man wearing?",
"question_wo_referring_query": "What type of garment is the man wearing?",
"candidates": [
"Pink tank top",
"Pink T-shirt",
"Pink dress shirt",
"Pink jacket",
"Pink sweater"
],
"correct_choice": 3,
"position": [
14
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2A",
"level": "L1-Perception",
"id": "qZVBFAtfp2A_0",
"video_path": "qZVBFAtfp2A.mp4",
"subtitle_path": "qZVBFAtfp2A_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 531,
"duration": 10.01,
"view_count": 667590
},
{
"video_id": "Uy_o-WCq2Cc",
"question": "The scene shows a very large space with gray-white walls and gray tiles on the floor. There are a few people standing by the wall and four long tables on the ground. A woman in black clothing is sitting in front of one of the tables. In which other scene does this woman appear?",
"question_wo_referring_query": "In which other scene does this woman appear?",
"candidates": [
"In a place with all white walls and gray tiles on the floor, with a piece of paper on the table",
"In front of a table in a spacious square",
"In a square",
"In a forest",
"In a seat in a movie theater"
],
"correct_choice": 0,
"position": [
24,
159
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SOS",
"level": "L2-Relation",
"id": "Uy_o-WCq2Cc_0",
"video_path": "Uy_o-WCq2Cc.mp4",
"subtitle_path": "Uy_o-WCq2Cc_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 113,
"duration": 11.01,
"view_count": 23634
},
{
"video_id": "q2bpkhjNxf0",
"question": "The screen shows an animation with blue sky and white clouds, followed by a staircase. The grey staircase has an olive green slope in the middle. A rectangular object wrapped with a string is being pulled upwards along the slope. Which subtitles have appeared together with this rectangular object?",
"question_wo_referring_query": "Which subtitles have appeared together with this rectangular object?",
"candidates": [
"welcome",
"While it is important to note that this evidence comes",
"thak you ",
"h",
"so"
],
"correct_choice": 1,
"position": [
19,
95
],
"topic_category": "KH-Knowledge-History",
"question_category": "TOS",
"level": "L2-Relation",
"id": "q2bpkhjNxf0_0",
"video_path": "q2bpkhjNxf0.mp4",
"subtitle_path": "q2bpkhjNxf0_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 239,
"duration": 9.01,
"view_count": 631633
},
{
"video_id": "z8YF-D-t7AI",
"question": "In the video, a hand is holding a piece of raw meat patty with green seasoning inside. What changes occur after the meat patty is placed on a black flat cast iron pan?",
"question_wo_referring_query": "What changes occur?",
"candidates": [
"The meat patty gets sauce poured over it",
"The meat patty develops caramelized grill marks",
"The meat patty gets burnt",
"The meat patty falls apart",
"A piece of vegetable is placed on the meat patty"
],
"correct_choice": 1,
"position": [
30,
260
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SAA",
"level": "L2-Relation",
"id": "z8YF-D-t7AI_0",
"video_path": "z8YF-D-t7AI.mp4",
"subtitle_path": "z8YF-D-t7AI_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 214,
"duration": 13.01,
"view_count": 189097
},
{
"video_id": "wl58wNMvi6Y",
"question": "The screen shows a blue background with a map visible through it. At the top of the background, there are three arrows and an image resembling a factory logo. There is also a white cloud on the side. What changes occur on the screen when 'begin to look carefully at the water' is mentioned?",
"question_wo_referring_query": "What changes occur on the screen?",
"candidates": [
"The arrows in the screen turn transparent",
"Five water drop icons appear on the screen",
"The blue background darkens, and the map disappears",
"The arrows in the screen disappear",
"The blue background disappears, and the map becomes clearly visible"
],
"correct_choice": 4,
"position": [
271,
298
],
"topic_category": "KA-Knowledge-Art",
"question_category": "TAA",
"level": "L2-Relation",
"id": "wl58wNMvi6Y_0",
"video_path": "wl58wNMvi6Y.mp4",
"subtitle_path": "wl58wNMvi6Y_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 345,
"duration": 13.01,
"view_count": 8882
},
{
"video_id": "u5NAcHhI_Uc",
"question": "The screen shows a completely black background with a drawing of a soldier. He is wearing a uniform, has bullets hanging on his body, and is holding a gun. What is this soldier doing?",
"question_wo_referring_query": ", what is this soldier doing?",
"candidates": [
"The soldier is looking upwards",
"The soldier is saluting",
"The soldier is looking to the left",
"The soldier is looking to the right",
"The soldier is standing straight"
],
"correct_choice": 4,
"position": [
178
],
"topic_category": "KH-Knowledge-History",
"question_category": "S2E",
"level": "L1-Perception",
"id": "u5NAcHhI_Uc_0",
"video_path": "u5NAcHhI_Uc.mp4",
"subtitle_path": "u5NAcHhI_Uc_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 447,
"duration": 8.0,
"view_count": 3319061
},
{
"video_id": "sWfcgeDth_w",
"question": "The scene shows a lady sitting in a yellow flower field looking ahead, with a mountain peak in the distance. Yellow flowers bloom on a hillside. The lady has brown hair and is wearing a long-sleeved outfit. When 'Not the least of which, is' is mentioned, what object is not present in the scene?",
"question_wo_referring_query": "What object is not present in the scene?",
"candidates": [
"Snowy mountain",
"Green long-sleeved outfit",
"Blue sky",
"Brown long-sleeved outfit",
"Yellow flowers of yellow flower clusters"
],
"correct_choice": 3,
"position": [
106
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "sWfcgeDth_w_0",
"video_path": "sWfcgeDth_w.mp4",
"subtitle_path": "sWfcgeDth_w_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 91,
"duration": 9.0,
"view_count": 1149326
},
{
"video_id": "N2mrK33gCYE",
"question": "In the scene, there is a person wearing a shirt leaning over a desk with white paper on it. He dips one hand into the ink on the plate. What happens after Annc: IIc would talk about the things hc would do to try to achieve a heightened state of awareness?",
"question_wo_referring_query": "What happens?",
"candidates": [
"A man wearing a shirt is holding a wooden object in one hand and a sculpture in the other. He places the sculpture in front of the wooden object and observes it. The background behind him is blue.",
"A man wearing a shirt is holding a wooden object in one hand and a sculpture in the other. He places the sculpture in front of the wooden object and observes it. The background behind him is black.",
"A man wearing a shirt is holding a wooden object in one hand and a sculpture in the other. He places the sculpture in front of the wooden object and observes it. The background behind him is purple.",
"A man wearing a shirt is holding a wooden object in one hand and a sculpture in the other. He places the sculpture in front of the wooden object and observes it. The background behind him is white.",
"A man wearing a shirt is holding a wooden object in one hand and a sculpture in the other. He places the sculpture in front of the wooden object and observes it. The background behind him is gray."
],
"correct_choice": 1,
"position": [
295,
330
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2E",
"level": "L1-Perception",
"id": "N2mrK33gCYE_0",
"video_path": "N2mrK33gCYE.mp4",
"subtitle_path": "N2mrK33gCYE_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 191,
"duration": 14.02,
"view_count": 567811
},
{
"video_id": "VLInjyogciw",
"question": "The screen shows a red wall with a black door. Beside the door, there is a round stone carving with several holes in the middle, seemingly sculpted into human facial features. In front of the stone carving, there is a black-striped pole. At what time did this stone carving and which subtitles appear simultaneously?",
"question_wo_referring_query": "At what time did this stone carving and which subtitles appear simultaneously?",
"candidates": [
"TRUTH",
"MOUTH OF TRUTH",
"truthfulness albeit with a mnuch lighter",
"EVENING",
"EARLY "
],
"correct_choice": 2,
"position": [
89,
292
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "TOS",
"level": "L2-Relation",
"id": "VLInjyogciw_0",
"video_path": "VLInjyogciw.mp4",
"subtitle_path": "VLInjyogciw_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 666,
"duration": 12.98,
"view_count": 52366
},
{
"video_id": "kONR5_mHofA",
"question": "There is a white background in the screen with a blue and white rectangular pattern inside. The rectangle contains eight identical character icons, seven of which are green and one is brown. What change occurs on the screen when 'nurnbers sha finished 11th but by comning' is mentioned?",
"question_wo_referring_query": "What change occurs on the screen when this is mentioned?",
"candidates": [
"The green character below the brown character is labeled '8th place'",
"The green character below the brown character is labeled '11th place'",
"The green character below the brown character is labeled '9th place'",
"The green character below the brown character is labeled '10th place'",
"The green character below the brown character is labeled '12th place'"
],
"correct_choice": 0,
"position": [
212,
270
],
"topic_category": "NP-News-Programs",
"question_category": "TAA",
"level": "L2-Relation",
"id": "kONR5_mHofA_0",
"video_path": "kONR5_mHofA.mp4",
"subtitle_path": "kONR5_mHofA_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 62,
"duration": 13.98,
"view_count": 5526
},
{
"video_id": "5Hn1UhlLphk",
"question": "In the scene, a woman is standing in the kitchen wearing an apron. Behind her are white cabinets with many items on the counter, including cutting boards, bowls, small appliances, and bottles. Next to her on the counter is a microwave, along with two small bowls and a small bottle. The woman has light brown hair and is wearing a blue outfit. What is the shape of the picture on the wall behind the counter where she is standing?",
"question_wo_referring_query": "What is the shape of the picture on the wall behind the counter where she is standing?",
"candidates": [
"Pentagon",
"Square",
"Octagon",
"Hexagon",
"Rectangle"
],
"correct_choice": 3,
"position": [
141
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2A",
"level": "L1-Perception",
"id": "5Hn1UhlLphk_0",
"video_path": "5Hn1UhlLphk.mp4",
"subtitle_path": "5Hn1UhlLphk_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 126,
"duration": 7.97,
"view_count": 258166
},
{
"video_id": "cceTeD4JADU",
"question": "In the screen, there is a woman with green and black hair wearing glasses. She is in a room covered with drawings. In front of her is a computer desk, which besides the computer also has some cluttered items. She is wearing a sweater that matches the color of her green hair. When she mentions 'graded for it my exam,' what object is not present in the scene?",
"question_wo_referring_query": "What object is not present in the scene?",
"candidates": [
"An orange string",
"A blue painting",
"A golden lamp",
"A white bookshelf",
"A white keyboard"
],
"correct_choice": 3,
"position": [
192
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "cceTeD4JADU_1",
"video_path": "cceTeD4JADU.mp4",
"subtitle_path": "cceTeD4JADU_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 90,
"duration": 9.98,
"view_count": 1112608
},
{
"video_id": "YiHjsOqxtk0",
"question": "On the screen, there is a wooden table, with an iron plate on top covered in oil paper. There are four dumplings on the iron plate. One hand is holding a brush, and the other hand holds a small box containing egg liquid. What was done when the brush appeared for the first time?",
"question_wo_referring_query": "What was done when the brush appeared for the first time?",
"candidates": [
"The brush is coloring the dumplings",
"The brush is mixing the egg liquid",
"The brush is applying the egg liquid",
"The brush is brushing egg liquid on the dumplings",
"The brush is placed on the iron plate"
],
"correct_choice": 2,
"position": [
95
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O2E",
"level": "L1-Perception",
"id": "YiHjsOqxtk0_0",
"video_path": "YiHjsOqxtk0.mp4",
"subtitle_path": "YiHjsOqxtk0_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 157,
"duration": 10.97,
"view_count": 228349
},
{
"video_id": "bgklOaBBmB8",
"question": "On the table in the video, there is a transparent bowl containing orange food. What happened when the subtitle 'foreign' appears?",
"question_wo_referring_query": "What happened?",
"candidates": [
"Holding a spoon and stirring the food in the bowl",
"A hand is holding a transparent bowl",
"Both hands are holding a transparent bowl",
"Holding chopsticks and stirring the food in the bowl",
"Holding a spoon and scooping the food in the bowl"
],
"correct_choice": 2,
"position": [
16
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2E",
"level": "L1-Perception",
"id": "bgklOaBBmB8_1",
"video_path": "bgklOaBBmB8.mp4",
"subtitle_path": "bgklOaBBmB8_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 298,
"duration": 11.98,
"view_count": 3687
},
{
"video_id": "-QSAotqKqX8",
"question": "Who is the first character to appear on screen?",
"question_wo_referring_query": "Who is the first character to appear on screen?",
"candidates": [
"The long-haired man in a black coat",
"The man in a gray coat and blue hat",
"The long-haired woman in a black floral dress",
"The man in a gray coat and black hat",
"The short-haired man in a black coat"
],
"correct_choice": 4,
"position": [
14,
37
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "O3O",
"level": "L2-Relation",
"id": "-QSAotqKqX8_1",
"video_path": "-QSAotqKqX8.mp4",
"subtitle_path": "-QSAotqKqX8_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 99,
"duration": 7.98,
"view_count": 79518
},
{
"video_id": "@recipesbyanne-7153617149041446150",
"question": "In the video, there is a white rectangular bowl on a white table. The bowl contains white cream cheese cubes. A hand is holding a piece of cream cheese. What does the hand do when the white cream cheese first appears?",
"question_wo_referring_query": "What does the hand do?",
"candidates": [
"Spread the cream cheese evenly",
"Use a torch to melt the cream cheese",
"Break the cream cheese into pieces",
"Season the cream cheese",
"Take the cream cheese off"
],
"correct_choice": 0,
"position": [
123
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O2E",
"level": "L1-Perception",
"id": "@recipesbyanne-7153617149041446150_0",
"video_path": "7153617149041446150.mp4",
"subtitle_path": "7153617149041446150_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 9.77,
"view_count": 2256136
},
{
"video_id": "@kelseyinlondon-7068018011139050758",
"question": "The screen shows three women sitting by the river, facing the Eiffel Tower. The sky is very blue. The women are respectively wearing a white coat, black and white striped clothing, and a pink outfit. They have their backs to the camera. What kind of hats are the two women in the screen wearing?",
"question_wo_referring_query": "What kind of hats are the two women in the screen wearing?",
"candidates": [
"sun hat",
"straw hat",
"cowboy hat",
"wool hat",
"beret"
],
"correct_choice": 4,
"position": [
237
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2A",
"level": "L1-Perception",
"id": "@kelseyinlondon-7068018011139050758_0",
"video_path": "7068018011139050758.mp4",
"subtitle_path": "7068018011139050758_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 10.5,
"view_count": 25155
},
{
"video_id": "@placesunleashed-7293176969531870470",
"question": "In the scene, a lady is walking towards a tree with yellow leaves. In the distance, there are some mountains and a lake. The sky is white. The lady is wearing a black top and black pants. What happens as the lady approaches the tree?",
"question_wo_referring_query": "What happens?",
"candidates": [
"A person is lying by the lake; in the distance, there's the tree with yellow leaves and mountains.",
"A person is jumping by the lake; in the distance, there's the tree with yellow leaves and mountains.",
"A person is kneeling by the lake; in the distance, there's the tree with yellow leaves and mountains.",
"A person is standing by the lake; in the distance, there's the tree with yellow leaves and mountains.",
"A person is sitting by the lake; in the distance, there's the tree with yellow leaves and mountains."
],
"correct_choice": 2,
"position": [
281,
302
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E3E",
"level": "L2-Relation",
"id": "@placesunleashed-7293176969531870470_0",
"video_path": "7293176969531870470.mp4",
"subtitle_path": "7293176969531870470_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 14.47,
"view_count": 8943
},
{
"video_id": "@kelseyinlondon-7094322812327906565",
"question": "In the scene, there is a white dining room with a white dining table on which there is a dessert plate. Two hands are holding glasses for a toast. After the subtitle 'Thak you' appears, what happens?",
"question_wo_referring_query": "What happens after that?",
"candidates": [
"Two hands take two pieces of dessert",
"Two people drink the wine in the glasses",
"Two people start drinking tea",
"Two people start eating the dessert",
"One hand takes a piece of dessert"
],
"correct_choice": 4,
"position": [
75,
138
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T3E",
"level": "L2-Relation",
"id": "@kelseyinlondon-7094322812327906565_0",
"video_path": "7094322812327906565.mp4",
"subtitle_path": "7094322812327906565_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 11.0,
"view_count": 48300
},
{
"video_id": "@_eat_sleep_travel_repeat-7296468146473536800",
"question": "The screen shows a grassland with a large tree on it. In the distance, there is a lake and some plants. After the subtitle 'Thank you.' appears, what is the first food item that appears?",
"question_wo_referring_query": "What is the first food item that appears?",
"candidates": [
"Strawberry",
"Pumpkin",
"Grapes",
"Watermelon",
"Melon"
],
"correct_choice": 1,
"position": [
36,
216
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T3O",
"level": "L2-Relation",
"id": "@_eat_sleep_travel_repeat-7296468146473536800_0",
"video_path": "7296468146473536800.mp4",
"subtitle_path": "7296468146473536800_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 0,
"duration": 12.33,
"view_count": 7334
},
{
"video_id": "@movie.explained6-7268936523481943297",
"question": "In the scene, there is a little boy and a little girl. The little boy is wearing black clothes, the little girl is wearing white clothes, and there is also an older man in the background. At the same time, what subtitles appear with the little girl?",
"question_wo_referring_query": ", at the same time, what subtitles appear with the little girl?",
"candidates": [
"He told me.",
"h",
"thanks",
"thank you",
"so"
],
"correct_choice": 0,
"position": [
76,
264
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "TOS",
"level": "L2-Relation",
"id": "@movie.explained6-7268936523481943297_0",
"video_path": "7268936523481943297.mp4",
"subtitle_path": "7268936523481943297_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 3,
"duration": 12.0,
"view_count": 1536
},
{
"video_id": "@movie.explained6-7257359603489443073",
"question": "In the scene, there is a woman with white hair wearing blue clothes. Behind her, there is a man in a white coat and another woman in white clothes. They are in a futuristic-looking glass room. What is the object present in the scene?",
"question_wo_referring_query": "What is the object present in the scene?",
"candidates": [
"Orange-red thread",
"Blue clothing with a yellow collar",
"Yellow pants",
"Blue clothing with a black collar",
"Rose-colored pants"
],
"correct_choice": 0,
"position": [
59
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2O",
"level": "L1-Perception",
"id": "@movie.explained6-7257359603489443073_0",
"video_path": "7257359603489443073.mp4",
"subtitle_path": "7257359603489443073_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 9,
"duration": 12.0,
"view_count": 3405
},
{
"video_id": "UKU-W5bBCwY",
"question": "In the scene of a movie on the screen, a group of people are dressed in very shabby clothes, with burlap bags over their heads showing only their eyes. They are in a well-decorated room. The walls of the room have orange-colored light strips, the ceiling is blue, and there are orange-colored sofas and chairs in the room. There is a red laser on the ceiling. What is the shape of the patterns on the ceiling?",
"question_wo_referring_query": "What is the shape of the patterns on the ceiling?",
"candidates": [
"Triangle",
"Circle",
"Spiral",
"Heart",
"Square"
],
"correct_choice": 4,
"position": [
120
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2A",
"level": "L1-Perception",
"id": "UKU-W5bBCwY_0",
"video_path": "UKU-W5bBCwY.mp4",
"subtitle_path": "UKU-W5bBCwY_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 514,
"duration": 9.01,
"view_count": 270836
},
{
"video_id": "VGQ_djSR7zE",
"question": "In the scene, a hand wearing a pair of white gloves is touching someone's neck, which shows signs of bruises. When the mention of 'The ME shows thetn an incision on the throat of one of the dead passengers. It is a deep,', what happens next?",
"question_wo_referring_query": "What happens next?",
"candidates": [
"A woman wearing glasses and dark clothes appears along with a man dressed in a white-gray coat.",
"A woman wearing glasses and dark clothes appears along with a man dressed in a black coat.",
"A woman wearing glasses and dark clothes appears along with a man dressed in a yellow coat.",
"A woman wearing glasses and dark clothes appears along with a man dressed in an olive coat.",
"A woman wearing glasses and dark clothes appears along with a man dressed in a gray coat."
],
"correct_choice": 1,
"position": [
40,
141
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3E",
"level": "L2-Relation",
"id": "VGQ_djSR7zE_0",
"video_path": "VGQ_djSR7zE.mp4",
"subtitle_path": "VGQ_djSR7zE_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 335,
"duration": 8.01,
"view_count": 24818
},
{
"video_id": "0rWA-p4p5IM",
"question": "On a white surface, there is a small silver bowl with a piece of cake inside it. A person, whose face is not visible, is using a tool to take the cake out of the bowl. Where does the cake appear?",
"question_wo_referring_query": "Where does the cake appear?",
"candidates": [
"In a teacup",
"In a red bag",
"On a silver plate",
"In a bucket with alcohol",
"On a black plate"
],
"correct_choice": 2,
"position": [
5343,
7270
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SOS",
"level": "L2-Relation",
"id": "0rWA-p4p5IM_0",
"video_path": "0rWA-p4p5IM.mp4",
"subtitle_path": "0rWA-p4p5IM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 461.09000000000003,
"view_count": 843757
},
{
"video_id": "0rWA-p4p5IM",
"question": "On a white surface, a hand places a transparent bowl on the table with an unpeeled egg inside. Where has the egg appeared?",
"question_wo_referring_query": "Where has the egg appeared?",
"candidates": [
"In a cake",
"In a silver metal box",
"In a blue plastic container",
"In a black pot",
"In the refrigerator"
],
"correct_choice": 2,
"position": [
1014,
2760
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SOS",
"level": "L2-Relation",
"id": "0rWA-p4p5IM_1",
"video_path": "0rWA-p4p5IM.mp4",
"subtitle_path": "0rWA-p4p5IM_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 461.09000000000003,
"view_count": 843757
},
{
"video_id": "fvCrE5NCsts",
"question": "In the black-and-white scene, there is a grass hut. Inside the grass hut, there is a person holding a long gun without revealing their body. When the subtitle mentions 'to six enemy soldiers every day,' what happens in the scene?",
"question_wo_referring_query": "What happens in the scene?",
"candidates": [
"The gun fired",
"A bird landed on the ground",
"The gun was placed on the ground",
"The person stood up from the grass hut",
"The gun was thrown out"
],
"correct_choice": 0,
"position": [
3050
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2E",
"level": "L1-Perception",
"id": "fvCrE5NCsts_0",
"video_path": "fvCrE5NCsts.mp4",
"subtitle_path": "fvCrE5NCsts_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 511.48,
"view_count": 7576619
},
{
"video_id": "fvCrE5NCsts",
"question": "In the black and white scene, it is a snowy landscape with trees covered in a thick layer of snow. There is a group of people holding items in their hands on the snow. What happens on the screen when the subtitle mentions 'with artillery strikes.'?",
"question_wo_referring_query": "What happens on the screen?",
"candidates": [
"The people in the scene shoot towards the sky",
"The people holding the items walk towards the camera",
"The people in the scene all fall to the ground",
"The people in the scene start fighting each other",
"The people in the scene throw the items in their hands to the ground"
],
"correct_choice": 1,
"position": [
9956
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2E",
"level": "L1-Perception",
"id": "fvCrE5NCsts_1",
"video_path": "fvCrE5NCsts.mp4",
"subtitle_path": "fvCrE5NCsts_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 511.48,
"view_count": 7576619
},
{
"video_id": "8905KCkLDYc",
"question": "Under the sunlight, the sea water reflects dazzling lights, and several men and women laugh heartily on the beach, raising their beer bottles to toast each other. What objects are present in this scene?",
"question_wo_referring_query": "What objects are present in this scene?",
"candidates": [
"American flag",
"beach cart",
"watch",
"balloon",
"mobile phone"
],
"correct_choice": 0,
"position": [
4406
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2O",
"level": "L1-Perception",
"id": "8905KCkLDYc_0",
"video_path": "8905KCkLDYc.mp4",
"subtitle_path": "8905KCkLDYc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 497.93,
"view_count": 54556
},
{
"video_id": "8905KCkLDYc",
"question": "In the blue seawater, there are black rocks. A sea turtle is swimming, and beside it, there is a man wearing black shorts following closely with a light on his body. What objects are present in this scene?",
"question_wo_referring_query": "In the blue seawater, there are black rocks. A sea turtle is swimming, and beside it, there is a man wearing black shorts following closely with a light on his body. What objects are present in the scene?",
"candidates": [
"Diving mask",
"Cell phone",
"Oxygen tank",
"Wetsuit",
"Net"
],
"correct_choice": 0,
"position": [
9032
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2O",
"level": "L1-Perception",
"id": "8905KCkLDYc_1",
"video_path": "8905KCkLDYc.mp4",
"subtitle_path": "8905KCkLDYc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 497.93,
"view_count": 54556
},
{
"video_id": "6qLYX6bzsl4",
"question": "In a black iron pot with some oil, there is a lump of dough. A person, only showing one hand, is holding a silver ladle and spreads the dough in the pot. Where has this ladle appeared before?",
"question_wo_referring_query": "Where has this ladle appeared before?",
"candidates": [
"In a green porcelain bowl",
"In a bowl rack",
"In a red bag",
"In a cupboard",
"In a black bag"
],
"correct_choice": 0,
"position": [
3286,
4912
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SOS",
"level": "L2-Relation",
"id": "6qLYX6bzsl4_0",
"video_path": "6qLYX6bzsl4.mp4",
"subtitle_path": "6qLYX6bzsl4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 307.43,
"view_count": 1244
},
{
"video_id": "6qLYX6bzsl4",
"question": "On a wooden surface, there is a beige mat. On the mat, there is a black plate with white floral patterns. A pair of hands is placing a pancake on the plate. Where has this pancake appeared before?",
"question_wo_referring_query": "Where has this pancake appeared before?",
"candidates": [
"In a golden plate",
"In a black pot",
"In a black bucket",
"In a white pot",
"In a silver plate"
],
"correct_choice": 1,
"position": [
4843,
3787
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SOS",
"level": "L2-Relation",
"id": "6qLYX6bzsl4_1",
"video_path": "6qLYX6bzsl4.mp4",
"subtitle_path": "6qLYX6bzsl4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 307.43,
"view_count": 1244
},
{
"video_id": "UbNyMSwoT5A",
"question": "A group of people is advancing down the mountain. A person without a helmet is riding an elephant, followed by a group of people wearing helmets. A soldier is waving a long sword. Who is the person waving the long sword?",
"question_wo_referring_query": "Who is the person waving the long sword?",
"candidates": [
"The person with a scabbard but without a helmet",
"The person wearing a helmet without a scabbard",
"The person without a helmet and without a shield",
"The person riding an elephant at the front",
"The person wearing a helmet without a shield"
],
"correct_choice": 4,
"position": [
325
],
"topic_category": "KH-Knowledge-History",
"question_category": "E2O",
"level": "L1-Perception",
"id": "UbNyMSwoT5A_0",
"video_path": "UbNyMSwoT5A.mp4",
"subtitle_path": "UbNyMSwoT5A_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 250.92,
"view_count": 11784
},
{
"video_id": "UbNyMSwoT5A",
"question": "On the grass, six people are fighting two against two. One person wielding a long sword is slashing at a person holding a long spear. The person with the long spear is falling backward. Who is the person swinging the long sword?",
"question_wo_referring_query": "Who is the person swinging the long sword?",
"candidates": [
"The person without a helmet and without a shield",
"The person wearing a helmet and holding a shield",
"The person wearing a white headscarf",
"The person with golden hair and a shield",
"The person wearing a headscarf and with no hair"
],
"correct_choice": 3,
"position": [
2401
],
"topic_category": "KH-Knowledge-History",
"question_category": "E2O",
"level": "L1-Perception",
"id": "UbNyMSwoT5A_1",
"video_path": "UbNyMSwoT5A.mp4",
"subtitle_path": "UbNyMSwoT5A_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 250.92,
"view_count": 11784
},
{
"video_id": "Sy2unO22PUE",
"question": "Four armed characters appear on the screen: two characters in green uniforms with helmets on the first floor, and two characters in yellow shirts with hats on the second floor. The characters on the second floor are shooting from kneeling positions, while the characters on the first floor are shooting from standing positions. What change occurs to the positions of the two characters in yellow shirts when the subtitle 'made their way down to the third floor' appears?",
"question_wo_referring_query": "What change occurs to the positions of the two characters in yellow shirts?",
"candidates": [
"They go from kneeling to prostrate shooting positions",
"They go from kneeling to shooting while leaning against a railing",
"They go from kneeling to shooting while leaning against a wall",
"They go from kneeling to prone positions",
"They go from kneeling to standing positions"
],
"correct_choice": 3,
"position": [
6764,
6805
],
"topic_category": "KH-Knowledge-History",
"question_category": "TAA",
"level": "L2-Relation",
"id": "Sy2unO22PUE_0",
"video_path": "Sy2unO22PUE.mp4",
"subtitle_path": "Sy2unO22PUE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 557.67,
"view_count": 83680
},
{
"video_id": "Sy2unO22PUE",
"question": "A cartoon character is standing at the end of a corridor in a room. The cartoon character is wearing a green uniform, a green hat, and black shoes. The corridor has doors on both sides arranged in an orderly manner. When the subtitle 'around a corner he was suddenly within' appears, what change occurs in the posture of this character in the green uniform?",
"question_wo_referring_query": "When the subtitle 'around a corner he was suddenly within' appears, what change occurs in the posture of the character in the green uniform?",
"candidates": [
"From standing to leaning back",
"From standing to kneeling",
"From standing to bending over",
"From standing to crawling",
"From standing to lying on the side"
],
"correct_choice": 1,
"position": [
6862,
6902
],
"topic_category": "KH-Knowledge-History",
"question_category": "TAA",
"level": "L2-Relation",
"id": "Sy2unO22PUE_1",
"video_path": "Sy2unO22PUE.mp4",
"subtitle_path": "Sy2unO22PUE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 557.67,
"view_count": 83680
},
{
"video_id": "66dwcQ1Y048",
"question": "A man appears on the screen, wearing a striped short-sleeved round-neck shirt and sporting a goatee. In the background behind the man, there are circles of various colors and sizes. When this goateed man stands in front of the signboard holding hands with two friends, what changes occur to his clothing?",
"question_wo_referring_query": "When this goateed man stands in front of the signboard holding hands with two friends, what changes occur to his clothing?",
"candidates": [
"The striped short-sleeved shirt changes to a cotton jacket.",
"The striped short-sleeved shirt changes to a hooded jacket.",
"The striped short-sleeved shirt changes to a white short-sleeved shirt with letters.",
"The striped short-sleeved shirt changes to a suit.",
"The striped short-sleeved shirt changes to a black jacket."
],
"correct_choice": 2,
"position": [
226,
956
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SAA",
"level": "L2-Relation",
"id": "66dwcQ1Y048_0",
"video_path": "66dwcQ1Y048.mp4",
"subtitle_path": "66dwcQ1Y048_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 186.81,
"view_count": 470742
},
{
"video_id": "66dwcQ1Y048",
"question": "Under the blue sky, a yellow car is driving on the road, with a truck behind it. There are tall trees planted in the greenbelt on both sides of the road. When a basketball hoop and a man in a black t-shirt appear around the yellow car, what change occurs with the yellow car?",
"question_wo_referring_query": "When a basketball hoop and a man in a black t-shirt appear around the yellow car, what change occurs with the yellow car?",
"candidates": [
"The door of the yellow car changes from closed to open.",
"The door of the yellow car shows some cartoon drawings.",
"The body of the yellow car gets sprayed with paint.",
"Some cartoon drawings appear on the body of the yellow car.",
"The headlights of the yellow car show some cartoon drawings."
],
"correct_choice": 0,
"position": [
2873,
2915
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SAA",
"level": "L2-Relation",
"id": "66dwcQ1Y048_1",
"video_path": "66dwcQ1Y048.mp4",
"subtitle_path": "66dwcQ1Y048_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 186.81,
"view_count": 470742
},
{
"video_id": "UwlKYM2Sotg",
"question": "Three men appear on the screen. The man on the left is wearing a red shirt, the man in the middle is wearing a black shirt and jeans, and the man on the right is leaning on a blue shelf that holds various items. What did the man in the middle, who is wearing a black shirt, do the first time he appeared?",
"question_wo_referring_query": "What did the man in the middle, who is wearing a black shirt, do the first time he appeared?",
"candidates": [
"Folded his arms in front of his chest",
"Picked up a key",
"Picked up a watch",
"Put his hand on a companion's shoulder",
"Picked up a cup of water"
],
"correct_choice": 3,
"position": [
290
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "UwlKYM2Sotg_0",
"video_path": "UwlKYM2Sotg.mp4",
"subtitle_path": "UwlKYM2Sotg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 503.12,
"view_count": 32388
},
{
"video_id": "UwlKYM2Sotg",
"question": "A man is sitting on a couch. The man is wearing a pink top and blue jeans. He is holding a black cup in his hand. The couch is dark-colored, and there is a green handkerchief placed on the armrest of the couch. What did the man in the pink top do the first time he appeared?",
"question_wo_referring_query": "What did the man in the pink top do the first time he appeared?",
"candidates": [
"Picked up a key",
"Held his forehead with his hand",
"Wiped his pants",
"Picked up a watch",
"Wiped his shirt"
],
"correct_choice": 1,
"position": [
138
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "UwlKYM2Sotg_1",
"video_path": "UwlKYM2Sotg.mp4",
"subtitle_path": "UwlKYM2Sotg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 503.12,
"view_count": 32388
},
{
"video_id": "1oL4ugMdDvQ",
"question": "A man and three women appear outdoors. They are gathered around a gray table. The man is wearing a white short-sleeve shirt and black shorts. One woman is wearing a light-colored top and blue shorts. Another woman has sunglasses on her head. The third woman is wearing a black top and black pants. There is a pot on the table. Who is using their hand to take food from the pot?",
"question_wo_referring_query": "Who is using their hand to take food from the pot?",
"candidates": [
"The man wearing a blue top",
"The man wearing a yellow top",
"The woman wearing shorts",
"The woman with sunglasses on her head",
"The man wearing a short-sleeve shirt"
],
"correct_choice": 3,
"position": [
9735
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "E2O",
"level": "L1-Perception",
"id": "1oL4ugMdDvQ_0",
"video_path": "1oL4ugMdDvQ.mp4",
"subtitle_path": "1oL4ugMdDvQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 428.68,
"view_count": 114400
},
{
"video_id": "1oL4ugMdDvQ",
"question": "A man appears in the wilderness. The man is wearing black pants and has tattoos on his arms. On the table in front of the man, there is a pot and a cutting board with red ingredients on it. Behind the man is green foliage. What is the ingredient that the man added to the pot?",
"question_wo_referring_query": "What is the ingredient that the man added to the pot?",
"candidates": [
"radishes",
"apples",
"chili peppers",
"corn slices",
"tomatoes"
],
"correct_choice": 3,
"position": [
5154
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "E2O",
"level": "L1-Perception",
"id": "1oL4ugMdDvQ_1",
"video_path": "1oL4ugMdDvQ.mp4",
"subtitle_path": "1oL4ugMdDvQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 428.68,
"view_count": 114400
},
{
"video_id": "mld0TnA2jEs",
"question": "A man in a red short-sleeve shirt is sitting on a sofa with his hands folded. There is a laptop on the left side of the sofa and a pillow on the right side. Behind the sofa is an open kitchen with brown cabinets. When the subtitle 'a very desirable property now let's talk' appears, what objects are present in this scene?",
"question_wo_referring_query": "What objects are present in this scene?",
"candidates": [
"a black thread",
"an umbrella",
"a potted plant",
"a watch",
"a necklace"
],
"correct_choice": 0,
"position": [
3115
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T2O",
"level": "L1-Perception",
"id": "mld0TnA2jEs_0",
"video_path": "mld0TnA2jEs.mp4",
"subtitle_path": "mld0TnA2jEs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 457.0,
"view_count": 3660
},
{
"video_id": "mld0TnA2jEs",
"question": "A man wearing a red short-sleeve shirt and dark pants is sitting on the sofa with his hands spread out to the sides. There is a laptop on the left side of the sofa. Behind the sofa is an open kitchen with cabinets that are oak-colored. When the subtitle 'be used in machine learning algorithms' appears, what objects are present in the scene?",
"question_wo_referring_query": "What objects are present in the scene?",
"candidates": [
"watch",
"potted plant",
"pillow",
"umbrella",
"necklace"
],
"correct_choice": 2,
"position": [
2808
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T2O",
"level": "L1-Perception",
"id": "mld0TnA2jEs_1",
"video_path": "mld0TnA2jEs.mp4",
"subtitle_path": "mld0TnA2jEs_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 457.0,
"view_count": 3660
},
{
"video_id": "JLnsWrzV_j4",
"question": "In the distance, there are some green plants within grey buildings. A woman dressed in a Tibetan blue long-sleeve top, a blue checkered short-sleeve shirt, black leggings, and carrying a black crossbody bag is taking a photo of a man in front of her who is wearing a Tibetan blue top and grey pants with a backpack. What is the hairstyle of this woman when the subtitle 'ignores her and walks away Yuan secretly' appears?",
"question_wo_referring_query": "What is the hairstyle of this woman?",
"candidates": [
"Loose hair",
"Long straight black hair",
"Tied in a ponytail",
"Short brown hair",
"Brown curly hair"
],
"correct_choice": 2,
"position": [
911
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2A",
"level": "L1-Perception",
"id": "JLnsWrzV_j4_0",
"video_path": "JLnsWrzV_j4.mp4",
"subtitle_path": "JLnsWrzV_j4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 590.51,
"view_count": 18383
},
{
"video_id": "JLnsWrzV_j4",
"question": "A man with black parted hair, wearing a blue shirt, a black backpack, and a wristwatch is kneeling on the ground, looking troubled at the scattered items on the ground. When the caption 'having an accident that distracts her' appears, what color is the wristwatch worn by this man?",
"question_wo_referring_query": "What color is the wristwatch worn by this man?",
"candidates": [
"Blue",
"Pink",
"Green",
"White",
"Black"
],
"correct_choice": 4,
"position": [
10463
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2A",
"level": "L1-Perception",
"id": "JLnsWrzV_j4_1",
"video_path": "JLnsWrzV_j4.mp4",
"subtitle_path": "JLnsWrzV_j4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 590.51,
"view_count": 18383
},
{
"video_id": "@healthfood-6874943613630041350",
"question": "On a grey table, there's a white glass bowl containing flour and eggs. A person holds the bowl with their left hand while using their right hand to place a whisk inside the bowl. What happens after 'Make the pancake batter' is mentioned?",
"question_wo_referring_query": "What happens next?",
"candidates": [
"Slices an orange",
"Washes blueberries in a water basin",
"Slices a banana",
"Cuts an apple into slices and removes the core"
],
"correct_choice": 3,
"position": [
176,
274
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T3E",
"level": "L2-Relation",
"id": "@healthfood-6874943613630041350_0",
"video_path": "6874943613630041350.mp4",
"subtitle_path": "6874943613630041350_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 27.83,
"view_count": 5296
},
{
"video_id": "@healthfood-7135100078221430058",
"question": "On a white table, there is a white plate with strawberries on it. A person is holding a strawberry with their left hand, and holding a knife above the strawberry with their right hand. After saying 'Welcome back to episode 5 of my healthy snack series,' what happens to the strawberry?",
"question_wo_referring_query": "What happens to the strawberry?",
"candidates": [
"The strawberry is cut into small pieces",
"The strawberry is cut in half",
"The strawberry is covered in yellow powder",
"Blueberry sauce is added to the strawberry"
],
"correct_choice": 2,
"position": [
46,
306
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TAA",
"level": "L2-Relation",
"id": "@healthfood-7135100078221430058_0",
"video_path": "7135100078221430058.mp4",
"subtitle_path": "7135100078221430058_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 17.05,
"view_count": 30743
},
{
"video_id": "@tiffycooks-7201571680890670342",
"question": "A woman wearing green clothes and draped in long hair is sitting in front of a table outdoors. There is also a dish in a brown bowl in front of her. When mentioning 'The cricket was crispy, savory, and paired with pad thai, super delicious,' what is the material of the chopsticks the woman is holding?",
"question_wo_referring_query": "What is the material of the chopsticks the woman is holding?",
"candidates": [
"Red plastic chopsticks",
"White ivory chopsticks",
"Wooden chopsticks",
"Silver stainless steel chopsticks"
],
"correct_choice": 2,
"position": [
892
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2A",
"level": "L1-Perception",
"id": "@tiffycooks-7201571680890670342_0",
"video_path": "7201571680890670342.mp4",
"subtitle_path": "7201571680890670342_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 48.53,
"view_count": 206578
},
{
"video_id": "@tiffycooks-7171182579058511110",
"question": "In front of a row of black glass doors of a freezer, a woman wearing a beige wool coat and a black mask is holding a box of food, standing in front of a glass cabinet. What did she do next?",
"question_wo_referring_query": ", what did she do next?",
"candidates": [
"She was sleeping in the bedroom",
"She was eating chicken wings from a plate",
"She was reading a book in the library",
"She was cleaning the room in the bedroom"
],
"correct_choice": 1,
"position": [
165,
338
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "E3E",
"level": "L2-Relation",
"id": "@tiffycooks-7171182579058511110_0",
"video_path": "7171182579058511110.mp4",
"subtitle_path": "7171182579058511110_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 27.28,
"view_count": 3185875
},
{
"video_id": "@thatrecipe.us-7266555380925287723",
"question": "On a wooden-colored table, there is a round transparent bowl with someone stirring the chocolate and butter inside it using a ladle. According to the video, which of these two stirred items appears first?",
"question_wo_referring_query": "Which of these two stirred items appears first?",
"candidates": [
"Butter",
"Chocolate and butter appear simultaneously",
"Neither of these items appears",
"Chocolate"
],
"correct_choice": 3,
"position": [
390,
158
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O3O",
"level": "L2-Relation",
"id": "@thatrecipe.us-7266555380925287723_0",
"video_path": "7266555380925287723.mp4",
"subtitle_path": "7266555380925287723_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.0,
"view_count": 15537
},
{
"video_id": "@tiffycooks-7080578381569445125",
"question": "Someone is washing fresh green vegetables in a stainless steel basin on the screen. After mentioning 'back from a long road trip in the middle of the night. He had been driving for hours and I wanted,' what appears on the screen?",
"question_wo_referring_query": "What appears on the screen?",
"candidates": [
"meat frying in the pan",
"white ceramic plate with yellow edges",
"dark liquid seasoning in a jar",
"raw red meat"
],
"correct_choice": 0,
"position": [
307,
597
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T3O",
"level": "L2-Relation",
"id": "@tiffycooks-7080578381569445125_0",
"video_path": "7080578381569445125.mp4",
"subtitle_path": "7080578381569445125_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 27.63,
"view_count": 803055
},
{
"video_id": "@thatrecipe.us-7301855783438896427",
"question": "There's a whole cabbage on the cutting board. A person's left hand is pressing the cabbage while the right hand is cutting it with a stainless steel knife with a black handle. What happens to the cabbage in the end?",
"question_wo_referring_query": "What happens to the cabbage in the end?",
"candidates": [
"It becomes bigger",
"The color changes",
"There is no change",
"It becomes smaller"
],
"correct_choice": 3,
"position": [
60,
1313
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SAA",
"level": "L2-Relation",
"id": "@thatrecipe.us-7301855783438896427_0",
"video_path": "7301855783438896427.mp4",
"subtitle_path": "7301855783438896427_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.13,
"view_count": 139730
},
{
"video_id": "@tiffycooks-7062861483687873798",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, a mixer in use appears, then an egg is cracked into a bowl, and finally, a woman pouring flour into the bowl serves as the conclusion.",
"First, a woman pours flour into a bowl, then an egg is cracked into the bowl, and finally, a mixer being used serves as the conclusion.",
"First, a woman pours flour into a bowl, then a mixer in use appears, and finally, an egg being cracked into the bowl serves as the conclusion.",
"First, an egg is cracked into a bowl, then a woman pours flour into the bowl, and finally, a mixer in use serves as the conclusion."
],
"correct_choice": 1,
"position": [
5,
145,
212
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SSS",
"level": "L2-Relation",
"id": "@tiffycooks-7062861483687873798_0",
"video_path": "7062861483687873798.mp4",
"subtitle_path": "7062861483687873798_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 41.13,
"view_count": 114669
},
{
"video_id": "@healthfood-6862748507976011013",
"question": "In the video, there are two slices of bread. The one on the left has been spread with chocolate sauce, while the one on the right has been spread with orange sauce. When the subtitle says 'And then slice up your favorite fruit. I'm choosing a nectarine today', what change occurs to the bread slices?",
"question_wo_referring_query": "What change occurs to the bread slices?",
"candidates": [
"The left bread slice adds a lemon",
"The left bread slice adds a peach",
"The left bread slice adds a watermelon",
"The right bread slice adds a peach"
],
"correct_choice": 1,
"position": [
245,
274
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TAA",
"level": "L2-Relation",
"id": "@healthfood-6862748507976011013_0",
"video_path": "6862748507976011013.mp4",
"subtitle_path": "6862748507976011013_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 22.83,
"view_count": 2390
},
{
"video_id": "@healthfood-6948509102305791238",
"question": "When the subtitle: 'First, you're going to want to heat your tortillas on a pan for a little bit so that they're more flexible.' appears at the bottom of the screen, what color sauce is spread on the tortilla placed in the white speckled round plate on the gray marble countertop?",
"question_wo_referring_query": "What color sauce is spread on the tortilla?",
"candidates": [
"Green",
"Red",
"Yellow",
"White"
],
"correct_choice": 1,
"position": [
170
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2A",
"level": "L1-Perception",
"id": "@healthfood-6948509102305791238_0",
"video_path": "6948509102305791238.mp4",
"subtitle_path": "6948509102305791238_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 47.33,
"view_count": 58396
},
{
"video_id": "@thatrecipe.us-7262468242692689194",
"question": "There is a wooden table in the video, and on the table there is a wooden board. A pair of hands is using a knife to cut a vegetable that has a yellow outer part and white inner part. What is the vegetable being cut?",
"question_wo_referring_query": "What is the vegetable being cut?",
"candidates": [
"Turnip",
"Carrot",
"Garlic",
"Leek"
],
"correct_choice": 3,
"position": [
181
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "E2O",
"level": "L1-Perception",
"id": "@thatrecipe.us-7262468242692689194_0",
"video_path": "7262468242692689194.mp4",
"subtitle_path": "7262468242692689194_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.0,
"view_count": 14953
},
{
"video_id": "@healthfood-7125042270381853995",
"question": "When the text 'Now for the sweet part, add some sweetened condensed milk right on top.' appears at the bottom of the screen, what changes happen in the transparent glass bowl containing green avocado?",
"question_wo_referring_query": "What changes occur in the transparent glass bowl containing green avocado when the text 'Now for the sweet part, add some sweetened condensed milk right on top.' appears at the bottom of the screen?",
"candidates": [
"Red tomato sauce is added",
"Pink condensed milk is added",
"Sweetened condensed milk is added",
"Green fruit sauce is added"
],
"correct_choice": 2,
"position": [
269
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2E",
"level": "L1-Perception",
"id": "@healthfood-7125042270381853995_0",
"video_path": "7125042270381853995.mp4",
"subtitle_path": "7125042270381853995_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 16.28,
"view_count": 16675
},
{
"video_id": "@thatrecipe.us-7245035133688892714",
"question": "In the video, two rows of sliced green onions are neatly arranged on a silver baking tray, with lemon slices placed on top of the green onions. Which of these ingredients appears first in the video?",
"question_wo_referring_query": "Which of these ingredients appears first in the video?",
"candidates": [
"Lemon slices",
"Neither of them appear",
"Both appear at the same time",
"Sliced white green onions"
],
"correct_choice": 3,
"position": [
1120,
788
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "O3O",
"level": "L2-Relation",
"id": "@thatrecipe.us-7245035133688892714_0",
"video_path": "7245035133688892714.mp4",
"subtitle_path": "7245035133688892714_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.93,
"view_count": 40665
},
{
"video_id": "@tiffycooks-6940704678569053446",
"question": "In the video, a woman in a white top is holding a large dumpling in her left hand and a white bowl in her right hand. What other items appeared together with the dumpling?",
"question_wo_referring_query": "What other items appeared together with the dumpling?",
"candidates": [
"Glutinous rice flour, sugar, hot water, knead until it forms into a ball.",
"Coat the rice balls with water. Coat with sesame seeds. And don't want to add any filling. All you have to do is just roll them into little balls.",
"Like this, roll it out. Cut into 10 pieces.",
"15 Shree food for 15 days."
],
"correct_choice": 1,
"position": [
180,
290,
470
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TOS",
"level": "L2-Relation",
"id": "@tiffycooks-6940704678569053446_0",
"video_path": "6940704678569053446.mp4",
"subtitle_path": "6940704678569053446_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 29.87,
"view_count": 1220741
},
{
"video_id": "@thatrecipe.us-7309241426028596523",
"question": "On a yellow table, there is a yellow cutting board with a piece of chicken breast on it. A person wearing a glove on the left hand has the left hand placed on the chicken breast, while holding a knife with the right hand on the chicken breast. When mentioning 'Only these ingredients and my family loves it,' what is this person doing?",
"question_wo_referring_query": "What is this person doing?",
"candidates": [
"This person is frying chicken breast",
"This person is frying a steak",
"This person is cutting chicken breast",
"This person is washing chicken breast"
],
"correct_choice": 2,
"position": [
173
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2E",
"level": "L1-Perception",
"id": "@thatrecipe.us-7309241426028596523_0",
"video_path": "7309241426028596523.mp4",
"subtitle_path": "7309241426028596523_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.03,
"view_count": 21427
},
{
"video_id": "@placesunleashed-7324761016909188358",
"question": "In the video, there is a woman wearing a yellow coat with a ponytail. She is holding a mobile phone in her right hand and raises it to her ear. Which object does not appear in the video?",
"question_wo_referring_query": "Which object does not appear in the video?",
"candidates": [
"Yellow coat",
"Ring",
"Swimming pool",
"Mobile phone"
],
"correct_choice": 2,
"position": [
322
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2O",
"level": "L1-Perception",
"id": "@placesunleashed-7324761016909188358_0",
"video_path": "7324761016909188358.mp4",
"subtitle_path": "7324761016909188358_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 20.37,
"view_count": 66793
},
{
"video_id": "@placesunleashed-7289528053112343814",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"First, a shot of the sea with a mountain in the background appears, then a scene of a waterfall appears, and finally, it concludes with the appearance of a frog.",
"First, a scene of a waterfall appears, then a frog appears, and finally, it concludes with a shot of the sea with a mountain in the background.",
"First, a scene of a waterfall appears, then a shot of the sea with a mountain in the background, and finally, it concludes with the appearance of a frog.",
"First, a frog appears, then a shot of the sea with a mountain in the background appears, and finally, it concludes with a scene of a waterfall."
],
"correct_choice": 2,
"position": [
149,
437,
496
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SSS",
"level": "L2-Relation",
"id": "@placesunleashed-7289528053112343814_0",
"video_path": "7289528053112343814.mp4",
"subtitle_path": "7289528053112343814_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 24.37,
"view_count": 4303
},
{
"video_id": "@placesunleashed-7298820532466683142",
"question": "On the yellowish vast terrain, when the words 'TRY TO VIEW IT FROM' appear on the screen, and when it mentions 'Try to view it from space if you can', what object is present on the screen?",
"question_wo_referring_query": "What object is present on the screen?",
"candidates": [
"irregular structured object",
"circular structured object",
"square structured object",
"bead-like circular structured object"
],
"correct_choice": 1,
"position": [
358
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T2O",
"level": "L1-Perception",
"id": "@placesunleashed-7298820532466683142_0",
"video_path": "7298820532466683142.mp4",
"subtitle_path": "7298820532466683142_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 16.03,
"view_count": 2508
},
{
"video_id": "@placesunleashed-7324498849857326341",
"question": "In the scene, in the green valley filled with green plants, there are green letters spelling 'GREENERY IN THE'. What is the color of the path in the scene?",
"question_wo_referring_query": "What is the color of the path in the scene?",
"candidates": [
"green",
"blue",
"gray",
"orange"
],
"correct_choice": 3,
"position": [
314
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2A",
"level": "L1-Perception",
"id": "@placesunleashed-7324498849857326341_0",
"video_path": "7324498849857326341.mp4",
"subtitle_path": "7324498849857326341_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 16.72,
"view_count": 3025
},
{
"video_id": "@placesunleashed-7308098412078025989",
"question": "In the video, there is a turquoise lake with a rocky pile at the side of the lake. In the middle of the screen, there are yellow letters stating 'FOR MULTIPLE.' After the subtitle 'This is the world's top-ranked beach for multiple reasons' is mentioned, what objects appear?",
"question_wo_referring_query": "What objects appear after that?",
"candidates": [
"Seaweed",
"Ordinary green plants",
"Small fish",
"Coconuts on the coconut tree"
],
"correct_choice": 3,
"position": [
58,
378
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T3O",
"level": "L2-Relation",
"id": "@placesunleashed-7308098412078025989_0",
"video_path": "7308098412078025989.mp4",
"subtitle_path": "7308098412078025989_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 23.97,
"view_count": 12649
},
{
"video_id": "@luxtravelbe-7286546546932370721",
"question": "In the scene, on the bustling street with many cars on the road, there is a man in a yellow short sleeve shirt also on the street. What is this man doing?",
"question_wo_referring_query": "What is this man doing?",
"candidates": [
"He is walking",
"He is running",
"He is riding a motorcycle",
"He is riding a bicycle"
],
"correct_choice": 3,
"position": [
222
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2E",
"level": "L1-Perception",
"id": "@luxtravelbe-7286546546932370721_0",
"video_path": "7286546546932370721.mp4",
"subtitle_path": "7286546546932370721_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 27.03,
"view_count": 18475
},
{
"video_id": "@luxtravelbe-7273938266183847200",
"question": "In the scene, there is a white rock beside the blue sea, surrounded by some green trees. Who is the person jumping off the white rock?",
"question_wo_referring_query": "Who is the person jumping off the white rock?",
"candidates": [
"A man wearing blue swimming trunks",
"A woman wearing a blue swimsuit",
"A woman wearing a black swimsuit",
"A man wearing black swimming trunks"
],
"correct_choice": 3,
"position": [
385
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E2O",
"level": "L1-Perception",
"id": "@luxtravelbe-7273938266183847200_0",
"video_path": "7273938266183847200.mp4",
"subtitle_path": "7273938266183847200_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 23.82,
"view_count": 15087
},
{
"video_id": "@placesunleashed-7285075986662624518",
"question": "Under the blue sky and white clouds, there is a piece of yellow land with a lot of withered grass and decaying trees. After mentioning 'which you can still find communities of today. The region also includes the most southern city in the world, Ushuaia, which is often', what happened?",
"question_wo_referring_query": ", what happened?",
"candidates": [
"A blue map appeared with the word 'Ushuaia' in black letters, accompanied by green and white designs",
"A woman in red reading a book appeared",
"Three penguins appeared",
"A glacier appeared"
],
"correct_choice": 0,
"position": [
653,
737
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "T3E",
"level": "L2-Relation",
"id": "@placesunleashed-7285075986662624518_0",
"video_path": "7285075986662624518.mp4",
"subtitle_path": "7285075986662624518_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 30.73,
"view_count": 11054
},
{
"video_id": "@placesunleashed-7283971258755140869",
"question": "Under a blue sky with white clouds, there is a reddish-brown rock. In its center, there is a piece of rock, surrounded by water, and the whole scene resembles a horseshoe shape. Which subtitles have mentioned this place together?",
"question_wo_referring_query": ", which subtitles have mentioned this place together?",
"candidates": [
"You've got the Wave, Antelope Canyon, Havasu Falls, Monument Valley, Horseshoe Bend, Hoover Dam, and the Grand Canyon, just to name a few.",
"I mean, look at all these places.",
"the vast landscape",
"And that doesn't even include the vast landscape that stretches for eternity."
],
"correct_choice": 0,
"position": [
84,
308
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "TOS",
"level": "L2-Relation",
"id": "@placesunleashed-7283971258755140869_0",
"video_path": "7283971258755140869.mp4",
"subtitle_path": "7283971258755140869_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 19.33,
"view_count": 1157
},
{
"video_id": "@placesunleashed-7302913253959617797",
"question": "Under the blue sky and white clouds, there is a vast sea stretching to the horizon. On the sea's surface, there are two huge and oddly shaped rocks. A person is on the sandy shore; what is this person doing?",
"question_wo_referring_query": "What is this person doing?",
"candidates": [
"Building a sandcastle on the beach",
"Lying on the beach resting",
"Walking on the beach",
"Squatting on the beach catching crabs"
],
"correct_choice": 2,
"position": [
341
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2E",
"level": "L1-Perception",
"id": "@placesunleashed-7302913253959617797_0",
"video_path": "7302913253959617797.mp4",
"subtitle_path": "7302913253959617797_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 17.03,
"view_count": 5497
},
{
"video_id": "@placesunleashed-7321079612488862981",
"question": "On a sea with white waves rolling, a man slightly bends his knees, leans back, and stands on a light blue surfboard to surf. After this, what happens next?",
"question_wo_referring_query": ", what happens next?",
"candidates": [
"A person in a blue swimsuit jumps into the water",
"A person in a black swimsuit jumps into the water",
"Two fully armed people walk on a forest path",
"A person in a swimsuit walks on the beach"
],
"correct_choice": 1,
"position": [
332,
362
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "E3E",
"level": "L2-Relation",
"id": "@placesunleashed-7321079612488862981_0",
"video_path": "7321079612488862981.mp4",
"subtitle_path": "7321079612488862981_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 20.07,
"view_count": 1155
},
{
"video_id": "@placesunleashed-7319588197858610438",
"question": "Under an orange-yellow radiant sky, there is a sea of flowers filled with marigolds. In the foreground of the scene are purple-red marigolds. Behind them are pink and red marigolds. Which of these marigold colors appears first?",
"question_wo_referring_query": "Which of these marigold colors appears first?",
"candidates": [
"All three appear simultaneously",
"Red marigolds",
"Purple-red marigolds",
"Pink marigolds"
],
"correct_choice": 1,
"position": [
241,
65
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O3O",
"level": "L2-Relation",
"id": "@placesunleashed-7319588197858610438_0",
"video_path": "7319588197858610438.mp4",
"subtitle_path": "7319588197858610438_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 17.13,
"view_count": 1583
},
{
"video_id": "@placesunleashed-7337436458078260485",
"question": "Under the blue sky, there are majestic snow-capped mountains, with many houses at their base. When 'and have some of the cutest cows' is mentioned, what changes occur in the clouds over the mountains?",
"question_wo_referring_query": "What changes occur in the clouds over the mountains?",
"candidates": [
"The white clouds are replaced by dark clouds",
"The white clouds move smoothly to the left side of the screen",
"The clouds are tinted by the evening glow",
"The white clouds move smoothly to the right side of the screen"
],
"correct_choice": 1,
"position": [
59,
91
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "TAA",
"level": "L2-Relation",
"id": "@placesunleashed-7337436458078260485_0",
"video_path": "7337436458078260485.mp4",
"subtitle_path": "7337436458078260485_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 16.9,
"view_count": 1006
},
{
"video_id": "1R5uPaL0V-0",
"question": "In the scene, a man is conversing with a woman. They are seated in a colorful room with plants and colorful curtains. The man is sitting on a white chair, and the woman is sitting on a sofa, holding a pillow. The man is wearing a round-neck T-shirt, and the woman is dressed in pink. Both have black hair. What type of outfit is the woman wearing?",
"question_wo_referring_query": "What type of outfit is the woman wearing?",
"candidates": [
"coat",
"suit",
"dress",
"sweater",
"shirt"
],
"correct_choice": 0,
"position": [
2257
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2A",
"level": "L1-Perception",
"id": "1R5uPaL0V-0_0",
"video_path": "1R5uPaL0V-0.mp4",
"subtitle_path": "1R5uPaL0V-0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1008.01,
"view_count": 9980770
},
{
"video_id": "1R5uPaL0V-0",
"question": "The person on the screen is a black-haired man wearing a T-shirt and a black scarf. He is in a white kitchen, with a stove behind him. On the stove, there are some cylinders and tools, as well as transparent bottles and various other items. In front of the man, there is a countertop with a cutting board and a metal tray containing a large sushi roll. The man is holding a yellow model of a boat. What material is this boat made of?",
"question_wo_referring_query": "What material is this boat made of?",
"candidates": [
"plastic",
"wood",
"ceramic",
"stainless steel",
"glass"
],
"correct_choice": 1,
"position": [
17662
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2A",
"level": "L1-Perception",
"id": "1R5uPaL0V-0_1",
"video_path": "1R5uPaL0V-0.mp4",
"subtitle_path": "1R5uPaL0V-0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1008.01,
"view_count": 9980770
},
{
"video_id": "1R5uPaL0V-0",
"question": "The screen shows a woman in a pink outfit holding a bun filled with some meat and vegetables. She also has a spoon in her hand with some sauce on it, which she pours onto the bun. The background is blurry, and there is a cute white design on her outfit. In front of the camera, there are some green avocados. What color is the sauce that the woman pours?",
"question_wo_referring_query": "What color is the sauce that the woman pours?",
"candidates": [
"pink",
"red",
"flesh color",
"yellow",
"green"
],
"correct_choice": 3,
"position": [
22556
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2A",
"level": "L1-Perception",
"id": "1R5uPaL0V-0_2",
"video_path": "1R5uPaL0V-0.mp4",
"subtitle_path": "1R5uPaL0V-0_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1008.01,
"view_count": 9980770
},
{
"video_id": "fPLjjr8w6DU",
"question": "Which character is the first to jump into the swimming pool in the video?",
"question_wo_referring_query": "Which character is the first to jump into the swimming pool in the video?",
"candidates": [
"A woman in a red swimsuit",
"A man in black swim trunks",
"A woman in a white swimsuit",
"A woman in a yellow swimsuit",
"A woman in a black swimsuit"
],
"correct_choice": 1,
"position": [
17,
97
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O3O",
"level": "L2-Relation",
"id": "fPLjjr8w6DU_0",
"video_path": "fPLjjr8w6DU.mp4",
"subtitle_path": "fPLjjr8w6DU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1158.76,
"view_count": 704450
},
{
"video_id": "fPLjjr8w6DU",
"question": "Who is the first character standing in front of the bathtub with flower petals in the video?",
"question_wo_referring_query": "Who is the first character standing in front of the bathtub with flower petals in the video?",
"candidates": [
"A lady wearing a white floral dress",
"A lady wearing a pink floral dress",
"A lady wearing a purple floral dress",
"A lady wearing a yellow floral dress",
"A lady wearing a green floral dress"
],
"correct_choice": 0,
"position": [
468,
485
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O3O",
"level": "L2-Relation",
"id": "fPLjjr8w6DU_1",
"video_path": "fPLjjr8w6DU.mp4",
"subtitle_path": "fPLjjr8w6DU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1158.76,
"view_count": 704450
},
{
"video_id": "fPLjjr8w6DU",
"question": "Which mode of transportation is mentioned first in the video?",
"question_wo_referring_query": "Which mode of transportation is mentioned first in the video?",
"candidates": [
"A blue and black sports car",
"A white and black sports car",
"A yellow and black sports car",
"A purple and black sports car",
"A silver sports car"
],
"correct_choice": 2,
"position": [
1888,
1918
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "O3O",
"level": "L2-Relation",
"id": "fPLjjr8w6DU_2",
"video_path": "fPLjjr8w6DU.mp4",
"subtitle_path": "fPLjjr8w6DU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1158.76,
"view_count": 704450
},
{
"video_id": "Z6Hx_BAZeUw",
"question": "In the video, a woman with long black hair is sitting in front of a white desk. Behind her, there's a white shelf with two green plants on it. She is wearing a denim shirt and is speaking. After raising one of her hands while speaking, what action does she perform?",
"question_wo_referring_query": "What action does she perform?",
"candidates": [
"The woman takes out a notebook",
"The woman picks up a pen",
"The woman starts using a tablet",
"The woman picks up a pencil case",
"The woman's hand rests on the desk"
],
"correct_choice": 4,
"position": [
343,
385
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E3E",
"level": "L2-Relation",
"id": "Z6Hx_BAZeUw_0",
"video_path": "Z6Hx_BAZeUw.mp4",
"subtitle_path": "Z6Hx_BAZeUw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2186.42,
"view_count": 319483
},
{
"video_id": "Z6Hx_BAZeUw",
"question": "The screen shows a PPT with a white background. The title of the PPT is 'How to Draw Lewis Structures'. There are some handwritten notes in black marker inside the PPT, and there is a decorative floral border with some content inside the border as well. There are two women in split screens on the right side of the PPT. What happens after the content of this PPT is presented?",
"question_wo_referring_query": "What happens?",
"candidates": [
"The content on the PPT disappears, and a light purple box appears, with CH3OH written at the top of the box.",
"The content on the PPT disappears, and a light green box appears, with CH3OH written at the top of the box.",
"The content on the PPT disappears, and a light red box appears, with CH3OH written at the top of the box.",
"The content on the PPT disappears, and a light yellow box appears, with CH3OH written at the top of the box.",
"The content on the PPT disappears, and a light pink box appears, with CH3OH written at the top of the box."
],
"correct_choice": 1,
"position": [
24839,
25004
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E3E",
"level": "L2-Relation",
"id": "Z6Hx_BAZeUw_1",
"video_path": "Z6Hx_BAZeUw.mp4",
"subtitle_path": "Z6Hx_BAZeUw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2186.42,
"view_count": 319483
},
{
"video_id": "Z6Hx_BAZeUw",
"question": "The screen shows a white PPT background with the title 'Exceptions to the Octet Rule'. Below the title, the content 'XeF4' is displayed. There is a pink frame underneath, and to the right, there is a figure composed of green and pink elements with black letters inside. On the left, there is a split screen showing two women. In the split screen, a woman with black straight hair, wearing a denim jacket, is speaking. After she clasps her hands, what action does this woman take?",
"question_wo_referring_query": "What action does this woman take after clasping her hands?",
"candidates": [
"The woman takes out a notebook",
"The woman starts making a phone call",
"The woman raises a cup",
"The woman picks up a pen",
"The woman spreads her hands open"
],
"correct_choice": 4,
"position": [
43048,
43067
],
"topic_category": "KS-Knowledge-STEM",
"question_category": "E3E",
"level": "L2-Relation",
"id": "Z6Hx_BAZeUw_2",
"video_path": "Z6Hx_BAZeUw.mp4",
"subtitle_path": "Z6Hx_BAZeUw_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2186.42,
"view_count": 319483
},
{
"video_id": "7TljSpTBS9c",
"question": "The shoe rack is filled with shoes, and there is a shoe box placed on the top of the shoe rack. A person wearing a black top picks up a pair of Nike shoes with multiple colors including red, blue, and white. The shoe laces are blue. After the person in the black top puts down the Nike shoes, what does he do next?",
"question_wo_referring_query": "After the person in the black top puts down the Nike shoes, what does he do next?",
"candidates": [
"Picks up a pair of shoes",
"Touches the shoe next to him",
"Picks up a can of drink",
"Picks up a black backpack",
"Picks up a shoe box"
],
"correct_choice": 1,
"position": [
4003,
4021
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "7TljSpTBS9c_0",
"video_path": "7TljSpTBS9c.mp4",
"subtitle_path": "7TljSpTBS9c_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1347.35,
"view_count": 168866
},
{
"video_id": "7TljSpTBS9c",
"question": "A strip of paper is placed on a white background, with three lines of English written on it. To the left of the paper strip, there is a black wheel-shaped object and a purple light. A hand is pointing at the paper strip while explaining something. After explaining the paper strip, what does the owner of the hand do?",
"question_wo_referring_query": "After explaining the paper strip, what does the owner of the hand do?",
"candidates": [
"Take a yellow toy chicken from the corner",
"Take a magazine from a black bag",
"Take a music box from the corner",
"Take a green toy from the corner",
"Take an apple from a black bag"
],
"correct_choice": 3,
"position": [
9480,
9790
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "7TljSpTBS9c_1",
"video_path": "7TljSpTBS9c.mp4",
"subtitle_path": "7TljSpTBS9c_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1347.35,
"view_count": 168866
},
{
"video_id": "7TljSpTBS9c",
"question": "There is a poster hanging on the wall showing a pink semicircle positioned over a blue sea. Various cartoon characters are standing on the pink semicircle. There is a string of lights in the upper right corner of the poster, with a yellow object on the left and a red object on the right. After introducing this blue poster, what did the person with the hand do next?",
"question_wo_referring_query": "After introducing this blue poster, what did the person with the hand do next?",
"candidates": [
"Took out a fan",
"Took out a book",
"Took out a pair of shoes",
"Held a guitar",
"Picked up a cartoon toy"
],
"correct_choice": 3,
"position": [
18685,
18765
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "7TljSpTBS9c_2",
"video_path": "7TljSpTBS9c.mp4",
"subtitle_path": "7TljSpTBS9c_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1347.35,
"view_count": 168866
},
{
"video_id": "duxO1EZ650E",
"question": "On the left side of the screen is a woman wearing glasses and earrings, with books and miscellaneous items on the shelf behind her. On the right side of the screen is a man wearing a black top, with long hair and glasses. What did the man with glasses on the right side of the screen do when he appeared for the first time?",
"question_wo_referring_query": "What did the man with glasses on the right side of the screen do when he appeared for the first time?",
"candidates": [
"Crossed his arms in front of his chest",
"Supported his chin with one hand",
"Placed one hand on the other arm's forearm",
"Held a wristwatch",
"Held his forehead with one hand"
],
"correct_choice": 2,
"position": [
6003
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O2E",
"level": "L1-Perception",
"id": "duxO1EZ650E_0",
"video_path": "duxO1EZ650E.mp4",
"subtitle_path": "duxO1EZ650E_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2066.5,
"view_count": 2383
},
{
"video_id": "duxO1EZ650E",
"question": "Two people appear in the room. One person is lying on debris in the corner, while the other person is sitting by the feet of the lying person. There is a can placed inside the room, and behind the sitting man is a fireplace. What did the person lying on the debris do the first time they appeared?",
"question_wo_referring_query": "What did the person lying on the debris do the first time they appeared?",
"candidates": [
"Holding the head with both hands",
"Supporting the chin with one hand",
"Using hands to touch the top of the head",
"Crossing arms over the chest",
"Using hand to support the forehead"
],
"correct_choice": 2,
"position": [
17881
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O2E",
"level": "L1-Perception",
"id": "duxO1EZ650E_1",
"video_path": "duxO1EZ650E.mp4",
"subtitle_path": "duxO1EZ650E_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2066.5,
"view_count": 2383
},
{
"video_id": "duxO1EZ650E",
"question": "A woman is sitting on the ground with her hair parted in the middle, exposing her forehead. She is surrounded by exquisitely woven items, and there are tree branches and leaves behind her. What did this seated woman do the first time she appeared?",
"question_wo_referring_query": "What did this seated woman do the first time she appeared?",
"candidates": [
"Holding a flower",
"Hugging a cat",
"Praying with hands clasped together",
"Holding a woven item with both hands",
"Holding her chin with one hand"
],
"correct_choice": 3,
"position": [
33444
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O2E",
"level": "L1-Perception",
"id": "duxO1EZ650E_2",
"video_path": "duxO1EZ650E.mp4",
"subtitle_path": "duxO1EZ650E_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 2066.5,
"view_count": 2383
},
{
"video_id": "vEy6tcU6eLU",
"question": "A man wearing an orange shirt appears in front of a black background. The man rolls up his sleeves, and to his left, there is a square frame with green and white inside. When the subtitle 'Vegetarian and this one means' appears, what is the shape at the center of the green area inside the square frame on the left?",
"question_wo_referring_query": "What is the shape of the green pattern at the center of the square frame on the left?",
"candidates": [
"Circle",
"Triangle",
"Rectangle",
"Square",
"Droplet shape"
],
"correct_choice": 0,
"position": [
11582
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "vEy6tcU6eLU_0",
"video_path": "vEy6tcU6eLU.mp4",
"subtitle_path": "vEy6tcU6eLU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1150.53,
"view_count": 5449194
},
{
"video_id": "vEy6tcU6eLU",
"question": "A man wearing an orange shirt appears in front of a black background. The man rolls up his sleeves. The man is holding a plate of Indian specialty food in his hand. There is a food picture to the man's left, and when the subtitle 'These usually come to the Northern regions of India' appears, what is the shape of the plate in the picture to the right?",
"question_wo_referring_query": "What is the shape of the plate in the picture on the right?",
"candidates": [
"Rectangle",
"Circle",
"Square",
"Pentagon",
"Irregular Shape"
],
"correct_choice": 1,
"position": [
14197
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "vEy6tcU6eLU_1",
"video_path": "vEy6tcU6eLU.mp4",
"subtitle_path": "vEy6tcU6eLU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1150.53,
"view_count": 5449194
},
{
"video_id": "vEy6tcU6eLU",
"question": "A man appears in front of a black background. The man has thick and long hair. He is wearing a short-sleeve shirt, and the buttons of the short-sleeve shirt are not fastened. When the subtitle 'So how do they communicate with each other?' appears, what is the color of the shirt under the short-sleeve shirt of the man with thick hair?",
"question_wo_referring_query": "What is the color of the shirt under the short-sleeve shirt of the man with thick hair?",
"candidates": [
"yellow",
"black",
"blue",
"white",
"green"
],
"correct_choice": 1,
"position": [
16691
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "T2A",
"level": "L1-Perception",
"id": "vEy6tcU6eLU_2",
"video_path": "vEy6tcU6eLU.mp4",
"subtitle_path": "vEy6tcU6eLU_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1150.53,
"view_count": 5449194
},
{
"video_id": "vE0faFD1F_g",
"question": "A painting is hanging on the blue wall, and a person wearing long sleeves is standing with their back to the camera, admiring the painting on the wall. When the subtitle says 'day knowing this would be no more than,' what color clothing is the person standing in front of the painting wearing?",
"question_wo_referring_query": "What color of clothing is the person standing in front of the painting wearing?",
"candidates": [
"red",
"blue",
"white",
"black",
"pink"
],
"correct_choice": 3,
"position": [
54
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T2A",
"level": "L1-Perception",
"id": "vE0faFD1F_g_0",
"video_path": "vE0faFD1F_g.mp4",
"subtitle_path": "vE0faFD1F_g_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 107,
"duration": 56.02,
"view_count": 2267
},
{
"video_id": "jiR7qHzgq_c",
"question": "Sitting neatly on two rows of tan chairs, what clothing did the soldiers wearing green helmets and holding tan rifles change into when they finally appeared on the white snowy ground?",
"question_wo_referring_query": "What clothing did they change into?",
"candidates": [
"changed into coats",
"changed into vests",
"changed into skirts",
"changed into suits",
"changed into short sleeves"
],
"correct_choice": 0,
"position": [
144,
889
],
"topic_category": "KH-Knowledge-History",
"question_category": "SAA",
"level": "L2-Relation",
"id": "jiR7qHzgq_c_0",
"video_path": "jiR7qHzgq_c.mp4",
"subtitle_path": "jiR7qHzgq_c_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 95,
"duration": 54.0,
"view_count": 2424752
},
{
"video_id": "tdA5atpqaAc",
"question": "In the top-right corner of the brown cutting board, there are several pieces of already-cut meat. On the left side of the cutting board, there is uncut meat on a metal plate. In the middle of the screen, a person is holding a pair of tongs. What is this person doing?",
"question_wo_referring_query": "What is this person doing?",
"candidates": [
"Placing the tongs on the cutting board",
"Placing the meat from the metal plate onto the cutting board",
"Using the tongs to grab vegetables",
"Placing the cut meat from the cutting board onto the metal plate",
"Putting the tongs into a bowl"
],
"correct_choice": 1,
"position": [
318
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "S2E",
"level": "L1-Perception",
"id": "tdA5atpqaAc_0",
"video_path": "tdA5atpqaAc.mp4",
"subtitle_path": "tdA5atpqaAc_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 904,
"duration": 23.98,
"view_count": 136606
},
{
"video_id": "t48HXAjjDAU",
"question": "The screen is divided into two different characters' headshots on the left and right. The left headshot is of a person with glasses shouting. In the red frame at the bottom, it is written in white English text 'Hundreds rally in Tel Aviv for ceasefire in Gaza'. Whose hair is being blown messily by the strong wind in the screen?",
"question_wo_referring_query": "Whose hair is being blown messily by the strong wind in the screen?",
"candidates": [
"A female reporter wearing a white coat",
"A male reporter wearing a grey coat",
"A female reporter wearing a black coat",
"A female reporter with glasses",
"A female reporter wearing a purple coat"
],
"correct_choice": 2,
"position": [
284
],
"topic_category": "NP-News-Programs",
"question_category": "E2O",
"level": "L1-Perception",
"id": "t48HXAjjDAU_0",
"video_path": "t48HXAjjDAU.mp4",
"subtitle_path": "t48HXAjjDAU_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 319,
"duration": 15.0,
"view_count": 70594
},
{
"video_id": "AHeq99pojLo",
"question": "In the top left corner of the black background, there is a picture where many people are sitting together. In the middle of the black background, a man wearing an orange shirt makes a 'Yay' hand gesture. What action did the man do after that?",
"question_wo_referring_query": "What action did the man do after that?",
"candidates": [
"Touched his nose",
"Put his hands on his head",
"Touched his ear",
"Clasped his hands together",
"Crossed his hands in front of his chest"
],
"correct_choice": 3,
"position": [
422,
445
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "E3E",
"level": "L2-Relation",
"id": "AHeq99pojLo_0",
"video_path": "AHeq99pojLo.mp4",
"subtitle_path": "AHeq99pojLo_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 512,
"duration": 30.99,
"view_count": 1412312
},
{
"video_id": "6moss7hGbvg",
"question": "In the driver's seat of a car, a woman wearing a white top and earrings is holding a cup of green beverage. Along with which subtitles did this cup of green beverage appear?",
"question_wo_referring_query": "Along with which subtitles did this cup of green beverage appear?",
"candidates": [
"good luck",
"out of slump clean and just get your",
"best wish",
"wow",
"goodbye"
],
"correct_choice": 1,
"position": [
205,
373
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "6moss7hGbvg_0",
"video_path": "6moss7hGbvg.mp4",
"subtitle_path": "6moss7hGbvg_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 675,
"duration": 24.0,
"view_count": 47183
},
{
"video_id": "HRYlXC_ChzU",
"question": "Sitting in the driver's seat of the car, a woman wearing blue jeans and a high ponytail mentioned in the subtitles 'really in depth car videos like those'. What color top was she wearing?",
"question_wo_referring_query": "What color top was she wearing?",
"candidates": [
"purple",
"white",
"blue",
"black",
"green"
],
"correct_choice": 3,
"position": [
341
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T2A",
"level": "L1-Perception",
"id": "HRYlXC_ChzU_0",
"video_path": "HRYlXC_ChzU.mp4",
"subtitle_path": "HRYlXC_ChzU_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 309,
"duration": 25.0,
"view_count": 93584
},
{
"video_id": "MyDAFlHsYsQ",
"question": "In the top right corner of the screen, there is a picture of a clay pot. In the middle of the screen, a long-haired woman is sitting on a sofa, holding a book with a blue cover that has a pink koi fish on it. What happened when this book first appeared?",
"question_wo_referring_query": "What happened when this book first appeared?",
"candidates": [
"The book was placed on the TV",
"The book was lifted towards the camera",
"The book was torn",
"The book was opened",
"The book was placed on a bookshelf"
],
"correct_choice": 1,
"position": [
1045
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O2E",
"level": "L1-Perception",
"id": "MyDAFlHsYsQ_0",
"video_path": "MyDAFlHsYsQ.mp4",
"subtitle_path": "MyDAFlHsYsQ_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 372,
"duration": 48.02,
"view_count": 1381
},
{
"video_id": "rzIiQ4Vxlbk",
"question": "What action did the curly-haired man wearing a striped black-and-white shirt, who was crouching in front of a camera, take when he said 'shoot had already started' in the subtitles?",
"question_wo_referring_query": "What action did he take?",
"candidates": [
"Crouched on the ground",
"Looked down at the camera in his hand",
"Sat on the ground",
"Took off the striped black-and-white shirt",
"Picked up the camera and took a photo"
],
"correct_choice": 1,
"position": [
93
],
"topic_category": "KH-Knowledge-History",
"question_category": "T2E",
"level": "L1-Perception",
"id": "rzIiQ4Vxlbk_0",
"video_path": "rzIiQ4Vxlbk.mp4",
"subtitle_path": "rzIiQ4Vxlbk_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 490,
"duration": 50.99,
"view_count": 514584
},
{
"video_id": "SKET024GdlE",
"question": "Who is the person standing in front of the wall with several rectangular maps, talking to the camera?",
"question_wo_referring_query": "Who is it?",
"candidates": [
"A man wearing a green shirt",
"A man wearing a red shirt",
"A man wearing a multicolored shirt",
"A man wearing a black shirt",
"A man wearing a purple shirt"
],
"correct_choice": 0,
"position": [
366
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "E2O",
"level": "L1-Perception",
"id": "SKET024GdlE_0",
"video_path": "SKET024GdlE.mp4",
"subtitle_path": "SKET024GdlE_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 14,
"duration": 17.02,
"view_count": 23558
},
{
"video_id": "3pTVbQilDqY",
"question": "Sitting in a studio, what is an elderly person wearing glasses and dressed in a black coat and purple shirt doing?",
"question_wo_referring_query": "What are they doing?",
"candidates": [
"Moving materials",
"Shaking hands with someone",
"Talking to the camera",
"Testing equipment",
"Taking photos"
],
"correct_choice": 2,
"position": [
209
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O2E",
"level": "L1-Perception",
"id": "3pTVbQilDqY_0",
"video_path": "3pTVbQilDqY.mp4",
"subtitle_path": "3pTVbQilDqY_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 61,
"duration": 26.99,
"view_count": 26240
},
{
"video_id": "wKd804fWOyQ",
"question": "In a room cluttered with various small objects, a man wearing black glasses and a green lab coat is sitting. Before the subtitle reads 'so yeah that's', what object appears first on the screen?",
"question_wo_referring_query": "What object appears first on the screen?",
"candidates": [
"A refrigerator",
"A water dispenser",
"A bicycle",
"An oven",
"A small green plant"
],
"correct_choice": 4,
"position": [
446,
383
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3O",
"level": "L2-Relation",
"id": "wKd804fWOyQ_0",
"video_path": "wKd804fWOyQ.mp4",
"subtitle_path": "wKd804fWOyQ_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 1198,
"duration": 48.01,
"view_count": 112563
},
{
"video_id": "3avWNHoEDAg",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences is correct?",
"candidates": [
"First, rocky surfaces in a pitch-black environment, then green vegetation and mountain ranges under a blue sky, and finally continuously flowing red lava",
"First, continuously flowing red lava, then rocky surfaces in a pitch-black environment, and finally green vegetation and mountain ranges under a blue sky",
"First, continuously flowing red lava, then green vegetation and mountain ranges under a blue sky, and finally rocky surfaces in a pitch-black environment",
"First, rocky surfaces in a pitch-black environment, then continuously flowing red lava, and finally green vegetation and mountain ranges under a blue sky",
"First, green vegetation and mountain ranges under a blue sky, then continuously flowing red lava, and finally rocky surfaces in a pitch-black environment"
],
"correct_choice": 1,
"position": [
294,
557,
670
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "SSS",
"level": "L2-Relation",
"id": "3avWNHoEDAg_0",
"video_path": "3avWNHoEDAg.mp4",
"subtitle_path": "3avWNHoEDAg_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 243,
"duration": 48.02,
"view_count": 21822
},
{
"video_id": "wWNBYEUxews",
"question": "Who is the person standing in front of a screen, holding a camera and carrying a red bag on their shoulder?",
"question_wo_referring_query": "Who is it?",
"candidates": [
"A woman wearing a blue top and purple skirt",
"A woman wearing a green top and olive green skirt",
"A woman wearing a white top and olive green skirt",
"A woman wearing a red top and olive green skirt",
"A woman wearing a blue top and olive green skirt"
],
"correct_choice": 4,
"position": [
479
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E2O",
"level": "L1-Perception",
"id": "wWNBYEUxews_0",
"video_path": "wWNBYEUxews.mp4",
"subtitle_path": "wWNBYEUxews_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 159,
"duration": 52.0,
"view_count": 352802
},
{
"video_id": "@tiffycooks-7003765189401185542",
"question": "In the pot of rolling hot oil, there are many golden fried chicken pieces being deep-fried. Before the subtitle says 'Fry the chicken for four to five minutes,' which person first appears on the screen?",
"question_wo_referring_query": "Which person first appears on the screen?",
"candidates": [
"A blonde-haired woman wearing a black top",
"A red-haired woman wearing a black top",
"A black-haired woman wearing a green top",
"A black-haired woman wearing a white top",
"A black-haired woman wearing a black top"
],
"correct_choice": 4,
"position": [
776,
751
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T3O",
"level": "L2-Relation",
"id": "@tiffycooks-7003765189401185542_0",
"video_path": "7003765189401185542.mp4",
"subtitle_path": "7003765189401185542_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 51.5,
"view_count": 434585
},
{
"video_id": "@thatrecipe.us-7335998138500648235",
"question": "When the yellow egg mixture in the glass bowl is poured into the water-filled pot, what change in state occurs?",
"question_wo_referring_query": "What change in state occurs?",
"candidates": [
"It turns into a solid",
"It turns into a gas",
"It turns black",
"It turns white",
"It turns into a liquid"
],
"correct_choice": 0,
"position": [
6,
103
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "TAA",
"level": "L2-Relation",
"id": "@thatrecipe.us-7335998138500648235_0",
"video_path": "7335998138500648235.mp4",
"subtitle_path": "7335998138500648235_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.03,
"view_count": 660191
},
{
"video_id": "@healthfood-7267233980473314606",
"question": "On the table in the kitchen, a person is adding yellow lemon slices to a glass jar filled with yellow squash slices. When the subtitle says 'I'm a New York Medicine Ave,' what other objects are present in the room?",
"question_wo_referring_query": "What other objects are present in the room?",
"candidates": [
"Refrigerator",
"Red flowers",
"Yellow flowers",
"Green plants",
"Oven"
],
"correct_choice": 1,
"position": [
457
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2O",
"level": "L1-Perception",
"id": "@healthfood-7267233980473314606_0",
"video_path": "7267233980473314606.mp4",
"subtitle_path": "7267233980473314606_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 23.9,
"view_count": 55942
},
{
"video_id": "@tiffycooks-7194857776877817094",
"question": "Who is the person, in a room with green plants, holding chopsticks and picking up food from the round plate on the table in front, and then putting it into their mouth?",
"question_wo_referring_query": "Who is it?",
"candidates": [
"A black-haired woman wearing a black top",
"A black-haired man wearing a white top",
"A black-haired woman wearing a white top",
"A white-haired woman wearing a black top",
"A black-haired woman wearing a purple top"
],
"correct_choice": 0,
"position": [
79
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "E2O",
"level": "L1-Perception",
"id": "@tiffycooks-7194857776877817094_0",
"video_path": "7194857776877817094.mp4",
"subtitle_path": "7194857776877817094_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 24.13,
"view_count": 2610602
},
{
"video_id": "@placesunleashed-7326709884102216965",
"question": "In the video, there is a black-haired woman wearing a backless yellow floral dress with her back to the camera, facing a white curtain. What other objects are present in this scene?",
"question_wo_referring_query": "What other objects are present in this scene?",
"candidates": [
"life preserver",
"green bedspread",
"dinghy",
"sailboat",
"cradle"
],
"correct_choice": 1,
"position": [
40
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "S2O",
"level": "L1-Perception",
"id": "@placesunleashed-7326709884102216965_0",
"video_path": "7326709884102216965.mp4",
"subtitle_path": "7326709884102216965_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 17.4,
"view_count": 3558
},
{
"video_id": "@movie.explained6-7276598470151032066",
"question": "In a room, a young man stands on the left side of the screen, and an elderly man with white hair, wearing a red plaid shirt, stands on the right side of the screen. The two face each other. What did this elderly man do the first time he appeared?",
"question_wo_referring_query": "What did this elderly man do the first time he appeared?",
"candidates": [
"Was washing his hair",
"Held a knife towards himself",
"Looked in the mirror",
"Held a knife towards the person in front",
"Was cutting his hair"
],
"correct_choice": 3,
"position": [
857
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "O2E",
"level": "L1-Perception",
"id": "@movie.explained6-7276598470151032066_0",
"video_path": "7276598470151032066.mp4",
"subtitle_path": "7276598470151032066_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 58.19,
"view_count": 3215
},
{
"video_id": "@movie.explained6-7277140989259599105",
"question": "What did the Yoda baby with big black eyes in the screen do when the subtitle said, \"to obtain any information about the Yoda baby's owner\"?",
"question_wo_referring_query": "What action did it take?",
"candidates": [
"Raised one ear",
"Raised both ears",
"Blinked its eyes",
"Ran on the ground",
"Walked on the ground"
],
"correct_choice": 2,
"position": [
288
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2E",
"level": "L1-Perception",
"id": "@movie.explained6-7277140989259599105_0",
"video_path": "7277140989259599105.mp4",
"subtitle_path": "7277140989259599105_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 59.63,
"view_count": 3038
},
{
"video_id": "@movie.explained6-7261188695892462856",
"question": "Which of the following scene sequences is correct?",
"question_wo_referring_query": "Which of the following scene sequences is correct?",
"candidates": [
"First, a man wearing blue clothes and a blue hat is making a phone call. Then, a woman with her hair wrapped is sitting in the room tidying up clothes. Lastly, a woman wearing a hat and a skirt is playing by the seaside.",
"First, a man wearing blue clothes and a blue hat is making a phone call. Then, a woman wearing a hat and a skirt is playing by the seaside. Lastly, a woman with her hair wrapped is sitting in the room tidying up clothes.",
"First, a woman wearing a hat and a skirt is playing by the seaside. Then, a man wearing blue clothes and a blue hat is making a phone call. Lastly, a woman with her hair wrapped is sitting in the room tidying up clothes.",
"First, a woman with her hair wrapped is sitting in the room tidying up clothes. Then, a woman wearing a hat and a skirt is playing by the seaside. Lastly, a man wearing blue clothes and a blue hat is making a phone call.",
"First, a woman wearing a hat and a skirt is playing by the seaside. Then, a woman with her hair wrapped is sitting in the room tidying up clothes. Lastly, a man wearing blue clothes and a blue hat is making a phone call."
],
"correct_choice": 1,
"position": [
346,
699,
1176
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "SSS",
"level": "L2-Relation",
"id": "@movie.explained6-7261188695892462856_0",
"video_path": "7261188695892462856.mp4",
"subtitle_path": "7261188695892462856_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.0,
"view_count": 5730
},
{
"video_id": "@lisolna-7282789187676294432",
"question": "In the kitchen, after a chef pours a yellow liquid into a wok with stir-fried vegetables, what happens on the screen?",
"question_wo_referring_query": "What happens on the screen?",
"candidates": [
"The chef adds eggplant to the wok.",
"The chef transfers the vegetables from the wok into a rectangular white plate containing bread.",
"The chef adds scallions to the wok.",
"The chef transfers the vegetables from the wok into a rectangular white plate containing rice.",
"The chef transfers the vegetables from the wok into a round white plate containing rice."
],
"correct_choice": 4,
"position": [
296,
358
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "@lisolna-7282789187676294432_0",
"video_path": "7282789187676294432.mp4",
"subtitle_path": "7282789187676294432_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.37,
"view_count": 76800
},
{
"video_id": "@lisolna-7315451966732193057",
"question": "After a person places a round white plate on the table, which of the following objects appears first?",
"question_wo_referring_query": "Which of the following objects appears first?",
"candidates": [
"Straw",
"Ladle",
"Cup",
"Fork",
"Iron weight"
],
"correct_choice": 4,
"position": [
295,
322,
400,
456
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "O3O",
"level": "L2-Relation",
"id": "@lisolna-7315451966732193057_0",
"video_path": "7315451966732193057.mp4",
"subtitle_path": "7315451966732193057_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 60.24,
"view_count": 204400
},
{
"video_id": "@kerstinong-6999997457392307457",
"question": "In a room, a long-haired girl wearing a white top is sitting on the ground, holding a blue wrapped gift. After the subtitle says 'I hope he doesn't mind,' what does she do?",
"question_wo_referring_query": "What does she do?",
"candidates": [
"She puts the gift under the bed.",
"She puts the gift into the cabinet.",
"She puts the gift into a courier box.",
"She puts the gift into the car's trunk.",
"She puts the gift into a paper bag."
],
"correct_choice": 4,
"position": [
171,
202
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "T3E",
"level": "L2-Relation",
"id": "@kerstinong-6999997457392307457_0",
"video_path": "6999997457392307457.mp4",
"subtitle_path": "6999997457392307457_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 35.83,
"view_count": 107361
},
{
"video_id": "@lisolna-7265337261737217312",
"question": "In a rectangular plate containing various buns, when a piece of toast picked up with tongs is placed into a round plate with forks, what changes occur to the toast?",
"question_wo_referring_query": ", what changes occur to the toast?",
"candidates": [
"The toast turned black",
"A corner of the toast is missing",
"The toast turned into a triangle",
"The toast turned into a square",
"The toast turned red"
],
"correct_choice": 1,
"position": [
299,
738
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "SAA",
"level": "L2-Relation",
"id": "@lisolna-7265337261737217312_0",
"video_path": "7265337261737217312.mp4",
"subtitle_path": "7265337261737217312_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 34.2,
"view_count": 183792
},
{
"video_id": "@lisolna-7231929658869107994",
"question": "In a glass bowl containing some white and green foods, there is some brown powder. There are some fruits beside the glass bowl. A person is holding a wooden spoon, stirring inside the glass bowl. What objects are present in the scene?",
"question_wo_referring_query": "What objects are present in the scene?",
"candidates": [
"Yellow plums",
"Yellow bananas",
"Yellow pineapple",
"Green mangoes",
"Pink peaches"
],
"correct_choice": 1,
"position": [
978
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "S2O",
"level": "L1-Perception",
"id": "@lisolna-7231929658869107994_0",
"video_path": "7231929658869107994.mp4",
"subtitle_path": "7231929658869107994_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 0,
"duration": 56.8,
"view_count": 13847
},
{
"video_id": "s2N4cPqCdys",
"question": "In a dimly lit space, there's a well made of stone, surrounded by people. A man wearing a dark yellow short-sleeved shirt is shining a flashlight into the well. What color is the flashlight beam shining into the well?",
"question_wo_referring_query": "What color is the flashlight beam shining into the well?",
"candidates": [
"green",
"yellow",
"blue",
"white",
"red"
],
"correct_choice": 1,
"position": [
866
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "S2A",
"level": "L1-Perception",
"id": "s2N4cPqCdys_0",
"video_path": "s2N4cPqCdys.mp4",
"subtitle_path": "s2N4cPqCdys_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 99,
"duration": 53.99,
"view_count": 184995
},
{
"video_id": "lzuixOdhZow",
"question": "In a dense forest, there is a man with short black hair and a woman with a bun. When the subtitle \"spirit again and kisses Oliver. Concurrently, Korrigan senses the kiss\" appears, what are the man with short hair and the woman with a bun doing?",
"question_wo_referring_query": "What are the man with short hair and the woman with a bun doing?",
"candidates": [
"dancing",
"kissing",
"hugging each other tightly",
"crying in each other's arms",
"the man with short black hair is knocking the woman with a bun unconscious"
],
"correct_choice": 1,
"position": [
487
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2E",
"level": "L1-Perception",
"id": "lzuixOdhZow_0",
"video_path": "lzuixOdhZow.mp4",
"subtitle_path": "lzuixOdhZow_en.json",
"duration_group": 60,
"starting_timestamp_for_subtitles": 434,
"duration": 27.03,
"view_count": 347057
},
{
"video_id": "1vvYsirvA2I",
"question": "On the screen, there is a wooden cutting board placed on a grey table. In the upper right corner of the board, there are two pieces of chicken breast, which have been sliced. Who sliced the meat on the cutting board?",
"question_wo_referring_query": "Who sliced the meat on the cutting board?",
"candidates": [
"The woman in green clothes",
"The woman in denim-colored clothes",
"The woman in pink clothes",
"The person wearing leopard print clothes",
"The woman in black clothes"
],
"correct_choice": 3,
"position": [
487
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "E2O",
"level": "L1-Perception",
"id": "1vvYsirvA2I_0",
"video_path": "1vvYsirvA2I.mp4",
"subtitle_path": "1vvYsirvA2I_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 339.23,
"view_count": 48975
},
{
"video_id": "1vvYsirvA2I",
"question": "In the video, a white-ringed vegetable is being cut with a knife on a wooden cutting board. What is this vegetable that is being cut?",
"question_wo_referring_query": "What is this vegetable that is being cut?",
"candidates": [
"Apple",
"Yuzu",
"Leek",
"Radish",
"Pumpkin"
],
"correct_choice": 2,
"position": [
1991
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "E2O",
"level": "L1-Perception",
"id": "1vvYsirvA2I_1",
"video_path": "1vvYsirvA2I.mp4",
"subtitle_path": "1vvYsirvA2I_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 339.23,
"view_count": 48975
},
{
"video_id": "OmhVj_-cfH0",
"question": "A middle-aged man wearing a dark brown hat, a suit with a dark blue coat and blue shirt is standing in a gallery. There are paintings displayed on both sides of the gallery, and there is a row of spotlights on the left side of the ceiling. What color is the suit the middle-aged man is wearing?",
"question_wo_referring_query": "What color is the suit the middle-aged man is wearing?",
"candidates": [
"Brown",
"Blue",
"Gray",
"Black",
"Green"
],
"correct_choice": 2,
"position": [
4325
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "OmhVj_-cfH0_0",
"video_path": "OmhVj_-cfH0.mp4",
"subtitle_path": "OmhVj_-cfH0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 204.74,
"view_count": 4846
},
{
"video_id": "OmhVj_-cfH0",
"question": "On the screen, there is an elderly man wearing a dark brown hat, dressed in a brown suit, with a dark blue sweater and a blue shirt underneath, walking in the gallery. On the right side of the gallery, there is an artwork displayed, among which a painting shows a woman in a green dress sitting on a sofa. Can you tell what color the sofa is?",
"question_wo_referring_query": "Can you tell what color the sofa is?",
"candidates": [
"red",
"black",
"pink",
"green",
"white"
],
"correct_choice": 2,
"position": [
420
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "OmhVj_-cfH0_1",
"video_path": "OmhVj_-cfH0.mp4",
"subtitle_path": "OmhVj_-cfH0_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 204.74,
"view_count": 4846
},
{
"video_id": "t3P11ENBZyc",
"question": "In the scene, three officers are dressed in dark green military uniforms with caps. The officer on the left is smoking with his right hand, and there is an eagle insignia on his cap. A rifle is placed in front of the window on the left. The officer in the middle stands with a rifle on his back, while the officer on the right is seated. When the scene changes to show four officers in a frontline bunker exchanging fire, the officer on the left is disguised with grass on his helmet and is aiming a rifle. The seated officer on the right has facial injuries. To his right, there is an officer with a steel helmet firing a rifle, and they are positioned behind a barricade. The standing officer with the eagle insignia is present. What change has this officer undergone?",
"question_wo_referring_query": "What change has this officer undergone?",
"candidates": [
"He is standing",
"He is holding water",
"There are some wounds on his face",
"He is standing straight",
"He is holding a wooden stick"
],
"correct_choice": 2,
"position": [
3056,
8279
],
"topic_category": "KH-Knowledge-History",
"question_category": "SAA",
"level": "L2-Relation",
"id": "t3P11ENBZyc_0",
"video_path": "t3P11ENBZyc.mp4",
"subtitle_path": "t3P11ENBZyc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 489.42,
"view_count": 4274368
},
{
"video_id": "t3P11ENBZyc",
"question": "In the scene, an officer wearing a yellow-green uniform is climbing up a wall. He is wearing a steel helmet, has a gun on his back, and is carrying a black item. The scene then transitions to a military doctor holding a syringe and administering an injection to a soldier on the ground, surrounded by fallen soldiers. What changes have occurred to the soldier on the ground?",
"question_wo_referring_query": "What changes have occurred to the soldier on the ground?",
"candidates": [
"His clothes have changed to blue and green.",
"He is drinking water.",
"He is bleeding.",
"He is sleeping.",
"His helmet is gone."
],
"correct_choice": 2,
"position": [
2950,
7746
],
"topic_category": "KH-Knowledge-History",
"question_category": "SAA",
"level": "L2-Relation",
"id": "t3P11ENBZyc_1",
"video_path": "t3P11ENBZyc.mp4",
"subtitle_path": "t3P11ENBZyc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 489.42,
"view_count": 4274368
},
{
"video_id": "F4bDyyEO4PU",
"question": "Which of the following sequences is correct?",
"question_wo_referring_query": "Which of the following sequences is correct?",
"candidates": [
"First, there is a food demonstration, followed by a cooking lesson including ingredient preparation, ingredient processing, and dish preparation, and then a presentation of the final cooked dishes.",
"First, there is a cooking lesson including ingredient preparation, ingredient processing, and dish preparation, followed by a presentation of the final cooked dishes, and then a food demonstration.",
"First, there is a food demonstration, followed by a presentation of the final cooked dishes, and then a cooking lesson including ingredient preparation, ingredient processing, and dish preparation.",
"First, there is a cooking lesson including ingredient preparation, ingredient processing, and dish preparation, followed by a food demonstration, and then a presentation of the final cooked dishes.",
"First, there is a presentation of the final cooked dishes, followed by a food demonstration, and then a cooking lesson including ingredient preparation, ingredient processing, and dish preparation."
],
"correct_choice": 0,
"position": [
3,
357,
10830
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SSS",
"level": "L2-Relation",
"id": "F4bDyyEO4PU_0",
"video_path": "F4bDyyEO4PU.mp4",
"subtitle_path": "F4bDyyEO4PU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 481.3,
"view_count": 235
},
{
"video_id": "F4bDyyEO4PU",
"question": "In the cooking tutorial, which of the following sequences of steps is correct?",
"question_wo_referring_query": "In the cooking tutorial, which of the following sequences of steps is correct?",
"candidates": [
"First, demonstrate how to use purple sweet potatoes and red peppers to make a side dish; then demonstrate how to press the beef patties into cake shapes, coat them with flour, and fry them in a pan; finally, demonstrate how to use beef, green onions, and related seasonings to make beef patties.",
"First, demonstrate how to use beef, green onions, and related seasonings to make beef patties; then demonstrate how to use purple sweet potatoes and red peppers to make a side dish; finally, demonstrate how to press the beef patties into cake shapes, coat them with flour, and fry them in a pan.",
"First, demonstrate how to press the beef patties into cake shapes, coat them with flour, and fry them in a pan; then demonstrate how to use beef, green onions, and related seasonings to make beef patties; finally, demonstrate how to use purple sweet potatoes and red peppers to make a side dish.",
"First, demonstrate how to use purple sweet potatoes and red peppers to make a side dish; then demonstrate how to use beef, green onions, and related seasonings to make beef patties; finally, demonstrate how to press the beef patties into cake shapes, coat them with flour, and fry them in a pan.",
"First, demonstrate how to use beef, green onions, and related seasonings to make beef patties; then demonstrate how to press the beef patties into cake shapes, coat them with flour, and fry them in a pan; finally, demonstrate how to use purple sweet potatoes and red peppers to make a side dish."
],
"correct_choice": 4,
"position": [
1955,
5238,
7414
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "SSS",
"level": "L2-Relation",
"id": "F4bDyyEO4PU_1",
"video_path": "F4bDyyEO4PU.mp4",
"subtitle_path": "F4bDyyEO4PU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 481.3,
"view_count": 235
},
{
"video_id": "-Xg2-SUq6wo",
"question": "A man appears in a studio. The man is wearing a black suit and a white shirt. He has glasses on and his hands are crossed. Behind the man on the screen is a nighttime city view. When the subtitle 'direction rather than total loss of' appears, what object is present in the studio?",
"question_wo_referring_query": "What object is present in the studio?",
"candidates": [
"watch",
"doll",
"earrings",
"potted plant",
"necklace"
],
"correct_choice": 0,
"position": [
7665
],
"topic_category": "NP-News-Programs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "-Xg2-SUq6wo_0",
"video_path": "-Xg2-SUq6wo.mp4",
"subtitle_path": "-Xg2-SUq6wo_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 405.84000000000003,
"view_count": 25510
},
{
"video_id": "-Xg2-SUq6wo",
"question": "A woman appears in the broadcast room, wearing a dark top and a watch. She is holding a pen in one hand, drawing something. The screen behind her shows a nighttime cityscape. When the subtitle 'investigations can kick off properly but' appears, what object is present in the broadcast room?",
"question_wo_referring_query": "What object is present in the broadcast room?",
"candidates": [
"necklace",
"plant",
"doll",
"microphone",
"desk lamp"
],
"correct_choice": 0,
"position": [
7831
],
"topic_category": "NP-News-Programs",
"question_category": "T2O",
"level": "L1-Perception",
"id": "-Xg2-SUq6wo_1",
"video_path": "-Xg2-SUq6wo.mp4",
"subtitle_path": "-Xg2-SUq6wo_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 405.84000000000003,
"view_count": 25510
},
{
"video_id": "Ip9DbdOtqF4",
"question": "A little girl appears in front of a wall. She is holding two whips and is dressed in a light-colored top and dark-colored trousers. There is a piece of white paper hanging on the wall in front of her. What did the little girl do the first time she appeared on the scene?",
"question_wo_referring_query": "What did the little girl do the first time she appeared on the scene?",
"candidates": [
"Grabbed the white paper",
"Raised both hands",
"Used a tool to imprint a design on the white paper",
"Bent down and bowed",
"Waved her hands left and right"
],
"correct_choice": 2,
"position": [
4800
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O2E",
"level": "L1-Perception",
"id": "Ip9DbdOtqF4_0",
"video_path": "Ip9DbdOtqF4.mp4",
"subtitle_path": "Ip9DbdOtqF4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 249.67,
"view_count": 1679
},
{
"video_id": "Ip9DbdOtqF4",
"question": "A group of women are sitting in front of a white table, where they are engaging in handmade crafts. There are colored paper and tools scattered on the table. A woman wearing glasses is sitting on the right side of the table. She has short black hair and is wearing earrings. Outside the table, a grey-haired woman in a black top is observing. What did the woman with short hair and glasses sitting at the table do the first time she appeared?",
"question_wo_referring_query": "What did the woman with short hair and glasses sitting at the table do the first time she appeared?",
"candidates": [
"Waved her hands left and right",
"Held her face with both hands",
"Supported her forehead with one hand",
"Cut materials with scissors",
"Held her face with one hand"
],
"correct_choice": 3,
"position": [
1160
],
"topic_category": "KA-Knowledge-Art",
"question_category": "O2E",
"level": "L1-Perception",
"id": "Ip9DbdOtqF4_1",
"video_path": "Ip9DbdOtqF4.mp4",
"subtitle_path": "Ip9DbdOtqF4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 249.67,
"view_count": 1679
},
{
"video_id": "gAgCnu82RHE",
"question": "Two pieces of green land appear in the blue background. In the blue background, there are sea and land. A cloud of black smoke appears on the screen. What object emitted this black smoke?",
"question_wo_referring_query": "What object emitted this black smoke?",
"candidates": [
"An airplane in the blue background",
"A cargo ship in the blue background",
"A waste dump in the blue background",
"A volcano on the green land",
"A volcano in the blue background"
],
"correct_choice": 3,
"position": [
2186
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "E2O",
"level": "L1-Perception",
"id": "gAgCnu82RHE_0",
"video_path": "gAgCnu82RHE.mp4",
"subtitle_path": "gAgCnu82RHE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 389.02,
"view_count": 169437
},
{
"video_id": "gAgCnu82RHE",
"question": "In the blue sky, white clouds are floating, a road appears near the blue sea, on the left side of the road is a green forest, and trees are planted on both sides of the road. What is moving on the road?",
"question_wo_referring_query": ", what is moving on the road?",
"candidates": [
"Hunting Dog",
"Helicopter",
"Pedestrian",
"Car",
"Seabird"
],
"correct_choice": 3,
"position": [
190
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "E2O",
"level": "L1-Perception",
"id": "gAgCnu82RHE_1",
"video_path": "gAgCnu82RHE.mp4",
"subtitle_path": "gAgCnu82RHE_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 389.02,
"view_count": 169437
},
{
"video_id": "sKvvuo9Yxqk",
"question": "In a scene with many yellow pages as the background, there is a blue-gray map. What changes occurred to the blue-gray map when the subtitle \"subcontinent\" appeared?",
"question_wo_referring_query": "What changes occurred to the blue-gray map?",
"candidates": [
"The entire map changed from blue-gray to red",
"The entire map changed from blue-gray to green",
"Different colors filled in the regions",
"The entire map changed from blue-gray to white",
"The entire map changed from blue-gray to purple"
],
"correct_choice": 2,
"position": [
258,
352
],
"topic_category": "KH-Knowledge-History",
"question_category": "TAA",
"level": "L2-Relation",
"id": "sKvvuo9Yxqk_0",
"video_path": "sKvvuo9Yxqk.mp4",
"subtitle_path": "sKvvuo9Yxqk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 582.28,
"view_count": 11134
},
{
"video_id": "sKvvuo9Yxqk",
"question": "On a white background, there is an image of a golden pyramid. The top of the golden pyramid is blue, and there is a red circle around the blue top. When the subtitle 'finally at the bottom were the shudras' appears, what change occurs to the red circle on the golden pyramid?",
"question_wo_referring_query": "What change occurs to the red circle on the golden pyramid?",
"candidates": [
"It changes from appearing at the top of the golden pyramid to appearing at the bottom of the golden pyramid.",
"The red circle changes to a purple circle.",
"The red circle changes to a green circle.",
"It changes from appearing at the top of the golden pyramid to appearing in the middle of the golden pyramid.",
"The red circle changes to a blue circle."
],
"correct_choice": 0,
"position": [
4997,
5337
],
"topic_category": "KH-Knowledge-History",
"question_category": "TAA",
"level": "L2-Relation",
"id": "sKvvuo9Yxqk_1",
"video_path": "sKvvuo9Yxqk.mp4",
"subtitle_path": "sKvvuo9Yxqk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 582.28,
"view_count": 11134
},
{
"video_id": "CdTijM0_es4",
"question": "Under the blue sky and white clouds, there is an endless stretch of mountain ranges. In front of the mountain ranges, there are some brown tents with two soldiers standing beside them. The soldiers are wearing gray helmets and holding crescent-shaped shields. In which of the following scenes have soldiers wearing gray helmets appeared?",
"question_wo_referring_query": "In which of the following scenes have soldiers wearing gray helmets appeared?",
"candidates": [
"Inside a dense forest",
"In a desert with swirling sandstorms",
"On a high mountain covered with white snow",
"In a scene with an olive tree and numerous arrows flying in the sky",
"On a grassland during rain"
],
"correct_choice": 3,
"position": [
930,
1110
],
"topic_category": "KH-Knowledge-History",
"question_category": "SOS",
"level": "L2-Relation",
"id": "CdTijM0_es4_0",
"video_path": "CdTijM0_es4.mp4",
"subtitle_path": "CdTijM0_es4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 240.57,
"view_count": 3466
},
{
"video_id": "CdTijM0_es4",
"question": "On a green meadow, in the distance there are gray mountains. In front of the mountains stands a group of soldiers wearing gray helmets and a group wearing dark blue helmets. In which of the following scenes do the soldiers wearing dark blue helmets appear?",
"question_wo_referring_query": "In which of the following scenes do the soldiers wearing dark blue helmets appear?",
"candidates": [
"In a desert with no vegetation",
"On a snow-covered grassland",
"In a forest during the rain",
"On a plain during a thunderstorm",
"In a scene where soldiers wearing yellow helmets are riding a blackish brown horse"
],
"correct_choice": 4,
"position": [
2623,
4842
],
"topic_category": "KH-Knowledge-History",
"question_category": "SOS",
"level": "L2-Relation",
"id": "CdTijM0_es4_1",
"video_path": "CdTijM0_es4.mp4",
"subtitle_path": "CdTijM0_es4_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 240.57,
"view_count": 3466
},
{
"video_id": "YehqA9xoGTY",
"question": "In a scene with a vague background, there is a man with short black hair wearing a dark blue suit. What was the man in the dark blue suit doing when he first appeared?",
"question_wo_referring_query": "What was the man in the dark blue suit doing when he first appeared?",
"candidates": [
"Talking to the camera",
"Smiling and clapping",
"Supporting his chin with his left hand",
"Supporting his forehead with his left hand",
"Crossing his arms in front of his chest"
],
"correct_choice": 0,
"position": [
39
],
"topic_category": "NP-News-Programs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "YehqA9xoGTY_0",
"video_path": "YehqA9xoGTY.mp4",
"subtitle_path": "YehqA9xoGTY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 264.0,
"view_count": 18941
},
{
"video_id": "YehqA9xoGTY",
"question": "In a room with blue lighting, there is a white table. Next to the table, there is a woman wearing a black top. What is the woman wearing the black top doing when she first appears?",
"question_wo_referring_query": "What is the woman wearing the black top doing when she first appears?",
"candidates": [
"Fixing her hair",
"Sitting facing the mirror with both hands on the white table",
"Clapping happily",
"Supporting her head with her right hand",
"Supporting her head with her left hand"
],
"correct_choice": 1,
"position": [
118
],
"topic_category": "NP-News-Programs",
"question_category": "O2E",
"level": "L1-Perception",
"id": "YehqA9xoGTY_1",
"video_path": "YehqA9xoGTY.mp4",
"subtitle_path": "YehqA9xoGTY_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 264.0,
"view_count": 18941
},
{
"video_id": "Bjymxow3TVQ",
"question": "In a room with a large control panel, there is a man with short hair and glasses. Under the man's hand is a cream-colored paper. What style of clothing is the man with glasses wearing?",
"question_wo_referring_query": "What style of clothing is the man with glasses wearing?",
"candidates": [
"Black hooded jacket",
"Pink plaid shirt",
"Blue hooded jacket",
"Black suit",
"White T-shirt"
],
"correct_choice": 1,
"position": [
1175
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "Bjymxow3TVQ_0",
"video_path": "Bjymxow3TVQ.mp4",
"subtitle_path": "Bjymxow3TVQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 411.11,
"view_count": 115701
},
{
"video_id": "Bjymxow3TVQ",
"question": "In a room with a white table, there is a man wearing black pants and glasses standing. In front of the man, there is a gray wooden board. What hairstyle does the man with glasses have?",
"question_wo_referring_query": "What hairstyle does the man with glasses have?",
"candidates": [
"Brown short hair",
"Brown shoulder-length curls",
"Blue short hair",
"Black shoulder-length curls",
"Black buzz cut"
],
"correct_choice": 0,
"position": [
6629
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2A",
"level": "L1-Perception",
"id": "Bjymxow3TVQ_1",
"video_path": "Bjymxow3TVQ.mp4",
"subtitle_path": "Bjymxow3TVQ_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 411.11,
"view_count": 115701
},
{
"video_id": "by3NxI0dA6w",
"question": "In a scene with a grayish-purple background, there are white letters spelling 'NATIVE AMERICA'. Next to the white letters, there is a handicraft. When the white letters 'NATIVE AMERICA' appear on a white wall, what color change occurs to the letters?",
"question_wo_referring_query": "When the letters appear on a white wall, what color change occurs to the letters?",
"candidates": [
"From white to blackish-gray",
"From white to yellow",
"From white to purple",
"From white to blue",
"From white to green"
],
"correct_choice": 0,
"position": [
176,
219
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SAA",
"level": "L2-Relation",
"id": "by3NxI0dA6w_0",
"video_path": "by3NxI0dA6w.mp4",
"subtitle_path": "by3NxI0dA6w_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 200.32999999999998,
"view_count": 17670
},
{
"video_id": "by3NxI0dA6w",
"question": "Within the white frame, there is a painting featuring eight people. The first person from the left is facing the right side of the painting, while the remaining seven people are facing the left side of the painting. When the scene cuts to a close-up of the current painting, what change occurs to the person wearing yellow in the painting?",
"question_wo_referring_query": "What change occurs to the person wearing yellow in the painting?",
"candidates": [
"The character is enlarged",
"The character is shrunk",
"The character becomes blurred",
"Only the upper half of the character remains"
],
"correct_choice": 0,
"position": [
3608,
3678
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SAA",
"level": "L2-Relation",
"id": "by3NxI0dA6w_1",
"video_path": "by3NxI0dA6w.mp4",
"subtitle_path": "by3NxI0dA6w_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 200.32999999999998,
"view_count": 17670
},
{
"video_id": "3hyPwjkdHEA",
"question": "In a room with a wooden surface, a woman in a yellow floral dress is standing in front of the wooden surface, holding a white piece of paper. Which of the following captions have appeared together with the woman in the yellow floral dress?",
"question_wo_referring_query": "Which of the following captions have appeared together with the woman in the yellow floral dress?",
"candidates": [
"\"salutations welcome to the cottage fairy\"",
"\"the process it's quite simple\"",
"\"i want to send out peaceful thoughts and\"",
"\"today i will be making my own paper and\"",
"\"wanted to give you some insight into\""
],
"correct_choice": 2,
"position": [
518,
7876,
107,
172,
226,
303
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "3hyPwjkdHEA_0",
"video_path": "3hyPwjkdHEA.mp4",
"subtitle_path": "3hyPwjkdHEA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 455.11,
"view_count": 441690
},
{
"video_id": "3hyPwjkdHEA",
"question": "On a ground covered with dry grass, there is a woman standing in a yellow floral dress. Behind the woman is a black dog. The woman is holding a hat in her hand. Which of the following subtitles appeared along with the black dog following the woman?",
"question_wo_referring_query": "which of the following subtitles appeared along with the black dog following the woman?",
"candidates": [
"\"you can give yourself than a homemade\"",
"\"of care\"",
"\"meal\"",
"\"after a long day i find no greater gift\"",
"\"and faith that anything is possible and\""
],
"correct_choice": 4,
"position": [
7907,
7954,
9568,
9590,
9637,
9663
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "TOS",
"level": "L2-Relation",
"id": "3hyPwjkdHEA_1",
"video_path": "3hyPwjkdHEA.mp4",
"subtitle_path": "3hyPwjkdHEA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 455.11,
"view_count": 441690
},
{
"video_id": "jdbG9gmg_SA",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"First, there's a scene on a flat road with many cars parked on either side; then, a scene with a man dressed in blue sitting in the front passenger seat of a car; finally, a scene where a woman dressed in white is close to a man dressed in blue.",
"First, there's a scene where a woman dressed in white is close to a man dressed in blue; then, a scene on a flat road with many cars parked on either side; finally, a scene with a man dressed in blue sitting in the front passenger seat of a car.",
"First, there's a scene with a man dressed in blue sitting in the front passenger seat of a car; then, a scene on a flat road with many cars parked on either side; finally, a scene where a woman dressed in white is close to a man dressed in blue.",
"First, there's a scene on a flat road with many cars parked on either side; then, a scene where a woman dressed in white is close to a man dressed in blue; finally, a scene with a man dressed in blue sitting in the front passenger seat of a car.",
"First, there's a scene where a woman dressed in white is close to a man dressed in blue; then, a scene with a man dressed in blue sitting in the front passenger seat of a car; finally, a scene on a flat road with many cars parked on either side."
],
"correct_choice": 1,
"position": [
91,
359,
945
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SSS",
"level": "L2-Relation",
"id": "jdbG9gmg_SA_0",
"video_path": "jdbG9gmg_SA.mp4",
"subtitle_path": "jdbG9gmg_SA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 272.13,
"view_count": 6718
},
{
"video_id": "jdbG9gmg_SA",
"question": "Which of the following sequence of scenes is correct?",
"question_wo_referring_query": "Which of the following sequence of scenes is correct?",
"candidates": [
"First, it's near dusk with many people raising their hands and cheering wildly; then, it's in sunny weather with many people queuing; lastly, it's in the evening with many people camping.",
"First, it's in the evening with many people camping; then, it's near dusk with many people raising their hands and cheering wildly; lastly, it's in sunny weather with many people queuing.",
"First, it's in the evening with many people camping; then, it's in sunny weather with many people queuing; lastly, it's near dusk with many people raising their hands and cheering wildly.",
"First, it's in sunny weather with many people queuing; then, it's in the evening with many people camping; lastly, it's near dusk with many people raising their hands and cheering wildly.",
"First, it's in sunny weather with many people queuing; then, it's near dusk with many people raising their hands and cheering wildly; lastly, it's in the evening with many people camping."
],
"correct_choice": 2,
"position": [
1918,
2517,
5469
],
"topic_category": "LT-Lifestyle-Travel-Guides",
"question_category": "SSS",
"level": "L2-Relation",
"id": "jdbG9gmg_SA_1",
"video_path": "jdbG9gmg_SA.mp4",
"subtitle_path": "jdbG9gmg_SA_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 272.13,
"view_count": 6718
},
{
"video_id": "UD5ifzOPzhc",
"question": "In the white background at the top that has black English text 'Classifiers: Nearest neighbor', in the bottom right corner, there is a man wearing glasses and a suit. The screen also has some blue rectangular prints, and when the subtitle says 'category just assign that blue category', what color is the circle print on the screen?",
"question_wo_referring_query": "What color is the circle print on the screen?",
"candidates": [
"green",
"yellow",
"purple",
"blue",
"red"
],
"correct_choice": 4,
"position": [
5373
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T2A",
"level": "L1-Perception",
"id": "UD5ifzOPzhc_0",
"video_path": "UD5ifzOPzhc.mp4",
"subtitle_path": "UD5ifzOPzhc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 564.32,
"view_count": 135
},
{
"video_id": "UD5ifzOPzhc",
"question": "In the white background at the top with black English letters 'K-nearest neighbor' written on it, there is a man wearing glasses and a suit in the bottom right corner. The screen has some red error symbols. When the subtitles say 'okay because these might be some noise,' what is the shape of the green graphic on the screen?",
"question_wo_referring_query": "What is the shape of the green graphic on the screen?",
"candidates": [
"rectangle",
"circle",
"triangle",
"star",
"square"
],
"correct_choice": 1,
"position": [
10697
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "T2A",
"level": "L1-Perception",
"id": "UD5ifzOPzhc_1",
"video_path": "UD5ifzOPzhc.mp4",
"subtitle_path": "UD5ifzOPzhc_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 564.32,
"view_count": 135
},
{
"video_id": "soNQYQXrx_A",
"question": "In the scene where a woman wearing a white top is standing on barren ground, the subtitles say 'Hatch Flood freaks out and leaves, but Justin manages to tell him one last time where he' \u2014 during this time, what change occurs to the woman's clothing?",
"question_wo_referring_query": "What change occurs to this woman's clothing?",
"candidates": [
"She changes into an olive-colored coat",
"She changes into a blue coat",
"She changes into a black coat",
"She changes into a yellow floral dress",
"She changes into a white sweater"
],
"correct_choice": 0,
"position": [
2496,
4398
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "TAA",
"level": "L2-Relation",
"id": "soNQYQXrx_A_0",
"video_path": "soNQYQXrx_A.mp4",
"subtitle_path": "soNQYQXrx_A_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 563.37,
"view_count": 93803
},
{
"video_id": "soNQYQXrx_A",
"question": "A man dressed in a black suit and white shirt with a tie is floating on the surface of the water. What happens to his clothes when the subtitle says 'driven out of the world'?",
"question_wo_referring_query": "What happens to this man's clothes?",
"candidates": [
"He puts on a denim jacket.",
"He puts on a white tank top.",
"He puts on a black bathrobe.",
"He puts on a green short-sleeve shirt.",
"He is shirtless."
],
"correct_choice": 4,
"position": [
4030,
7571
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "TAA",
"level": "L2-Relation",
"id": "soNQYQXrx_A_1",
"video_path": "soNQYQXrx_A.mp4",
"subtitle_path": "soNQYQXrx_A_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 563.37,
"view_count": 93803
},
{
"video_id": "vpKtHB8x0js",
"question": "On a piece of grass, a woman wearing a blue top and a hat has her hands in her pockets. Next to her stands a man. Before the subtitle reads 'Liam, who is also being treated in the same ward. It turns out his damaged lungs have still not', what is this woman doing?",
"question_wo_referring_query": "What is this woman doing?",
"candidates": [
"Curling her eyelashes in front of a mirror",
"Applying blush in front of a mirror",
"Putting on lipstick in front of a mirror",
"Drawing her eyebrows in front of a mirror",
"Combing her hair in front of a mirror"
],
"correct_choice": 2,
"position": [
5360,
5302
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3E",
"level": "L2-Relation",
"id": "vpKtHB8x0js_0",
"video_path": "vpKtHB8x0js.mp4",
"subtitle_path": "vpKtHB8x0js_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 564.3,
"view_count": 178690
},
{
"video_id": "vpKtHB8x0js",
"question": "Before the subtitle says, 'she quickly hides inside a dead tree with James. The latter wants her to return to,' what does the woman in a blue top do?",
"question_wo_referring_query": "What does the woman do?",
"candidates": [
"She is swimming in the water",
"She assists a person in walking",
"She is combing her hair",
"She wakes up a man lying on the ground",
"She is applying lipstick"
],
"correct_choice": 1,
"position": [
8463,
8432
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T3E",
"level": "L2-Relation",
"id": "vpKtHB8x0js_1",
"video_path": "vpKtHB8x0js.mp4",
"subtitle_path": "vpKtHB8x0js_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 564.3,
"view_count": 178690
},
{
"video_id": "JhlzvoqKOc8",
"question": "By the riverbank, a woman in a pink backpack and black pants says in the subtitles, 'have intercourse. However, she changes her mind at the last moment and returns home.' What action did she take?",
"question_wo_referring_query": "What action did she take?",
"candidates": [
"Picked up the white clothes",
"Threw the white clothes into the fire",
"Threw the white clothes into the river",
"Threw the white clothes on the ground",
"Put the white clothes on"
],
"correct_choice": 3,
"position": [
11105
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2E",
"level": "L1-Perception",
"id": "JhlzvoqKOc8_0",
"video_path": "JhlzvoqKOc8.mp4",
"subtitle_path": "JhlzvoqKOc8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 592.29,
"view_count": 521764
},
{
"video_id": "JhlzvoqKOc8",
"question": "Two people are standing on a grass field. On the far left is a man wearing a denim jacket, and on the far right is a blonde woman in a white top. When the subtitles say 'As the days pass, the group forgets all their miseries and has the best time of their lives,' what action does this woman perform?",
"question_wo_referring_query": "What action does this woman perform?",
"candidates": [
"She picks up a gun",
"She picks up a shield",
"She picks up a flower",
"She picks up a dagger",
"She picks up a sword"
],
"correct_choice": 0,
"position": [
7863
],
"topic_category": "Recreational: MR-Movie-Recaps",
"question_category": "T2E",
"level": "L1-Perception",
"id": "JhlzvoqKOc8_1",
"video_path": "JhlzvoqKOc8.mp4",
"subtitle_path": "JhlzvoqKOc8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 592.29,
"view_count": 521764
},
{
"video_id": "mbcvVYobCXI",
"question": "Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "Which of the following sequences of scenes is correct?",
"candidates": [
"First, a black egg-shaped stone on the left and a white egg-shaped stone on the right, then a pile of seaweed on the beach, and finally, two white egg-shaped stones on the beach.",
"First, two white egg-shaped stones on the beach, then a black egg-shaped stone on the left and a white egg-shaped stone on the right, and finally, a pile of seaweed on the beach.",
"First, a pile of seaweed on the beach, then a black egg-shaped stone on the left and a white egg-shaped stone on the right, and finally, two white egg-shaped stones on the beach.",
"First, two white egg-shaped stones on the beach, then a pile of seaweed on the beach, and finally, a black egg-shaped stone on the left and a white egg-shaped stone on the right.",
"First, a pile of seaweed on the beach, then two white egg-shaped stones on the beach, and finally, a black egg-shaped stone on the left and a white egg-shaped stone on the right."
],
"correct_choice": 2,
"position": [
574,
884,
1426
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SSS",
"level": "L2-Relation",
"id": "mbcvVYobCXI_0",
"video_path": "mbcvVYobCXI.mp4",
"subtitle_path": "mbcvVYobCXI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 192.09,
"view_count": 6868
},
{
"video_id": "mbcvVYobCXI",
"question": "Which of the following sequences of events is correct?",
"question_wo_referring_query": "Which of the following sequences of events is correct?",
"candidates": [
"First, on the left side of a wooden board, there is a white shell, and on the right side, there is a black shell. Then, on the ground, there is a snail shell placed on an upright branch. Finally, on the ground, there is a black snail shell.",
"First, on the ground, there is a black snail shell. Then, on the left side of a wooden board, there is a white shell, and on the right side, there is a black shell. Finally, on the ground, there is a snail shell placed on an upright branch.",
"First, on the ground, there is a snail shell placed on an upright branch. Then, on the ground, there is a black snail shell. Finally, on the left side of a wooden board, there is a white shell, and on the right side, there is a black shell.",
"First, on the left side of a wooden board, there is a white shell, and on the right side, there is a black shell. Then, on the ground, there is a black snail shell. Finally, on the ground, there is a snail shell placed on an upright branch.",
"First, on the ground, there is a black snail shell. Then, on the ground, there is a snail shell placed on an upright branch. Finally, on the left side of a wooden board, there is a white shell, and on the right side, there is a black shell."
],
"correct_choice": 3,
"position": [
2835,
3192,
3481
],
"topic_category": "KA-Knowledge-Art",
"question_category": "SSS",
"level": "L2-Relation",
"id": "mbcvVYobCXI_1",
"video_path": "mbcvVYobCXI.mp4",
"subtitle_path": "mbcvVYobCXI_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 192.09,
"view_count": 6868
},
{
"video_id": "mFcEWmtn3ag",
"question": "This video utilizes pictures, illustrations, and text explanations to narrate many historical stories and depict numerous scenes from those times. Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "This video uses pictures, illustrations, and text explanations to tell many historical stories and depict various scenes from those times. Which of the following sequences of scenes is correct?",
"candidates": [
"First, there is a giant mail steamship on the vast ocean, followed by a splendidly dressed squire holding a cane standing in front of a fireplace in a luxurious room, and lastly, a man in a suit and tie answering a phone call.",
"First, there is a giant mail steamship on the vast ocean, followed by a man in a suit and tie answering a phone call, and lastly, a splendidly dressed squire holding a cane standing in front of a fireplace in a luxurious room.",
"First, a splendidly dressed squire holding a cane stands in front of a fireplace in a luxurious room, followed by a man in a suit and tie answering a phone call, and lastly, there is a giant mail steamship on the vast ocean.",
"First, a man in a suit and tie is answering a phone call, followed by a splendidly dressed squire holding a cane standing in front of a fireplace in a luxurious room, and lastly, there is a giant mail steamship on the vast ocean."
],
"correct_choice": 3,
"position": [
62,
1219,
7377
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "mFcEWmtn3ag_0",
"video_path": "mFcEWmtn3ag.mp4",
"subtitle_path": "mFcEWmtn3ag_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 523.26,
"view_count": 489106
},
{
"video_id": "mFcEWmtn3ag",
"question": "This video uses images, illustrations, and text explanations to tell many historical stories, depicting numerous scenes from those times. Which of the following sequences of scenes is correct?",
"question_wo_referring_query": "This video uses images, illustrations, and text explanations to narrate many historical stories, depicting numerous scenes from those times. Which of the following sequences of scenes is correct?",
"candidates": [
"First, an animation of two men wearing top hats having a conversation is played, followed by a picture of a man with a mustache in a black frame, and finally, a segment of a man in a suit making a phone call in black-and-white footage is shown.",
"First, a segment of a man in a suit using a notebook in black-and-white footage is shown, followed by an animation of two men wearing top hats having a conversation, and finally, a painting of a man with a mustache in a black frame appears.",
"First, a picture of a man with a mustache in a black frame appears, followed by a segment of a man in a suit making a phone call in black-and-white footage, and finally, an animation of two men wearing top hats having a conversation is played.",
"First, an animation of two men wearing top hats having a conversation is played, followed by a segment of black-and-white footage, and finally, a painting of a man with a mustache in a black frame appears."
],
"correct_choice": 1,
"position": [
44,
182,
477
],
"topic_category": "KH-Knowledge-History",
"question_category": "SSS",
"level": "L2-Relation",
"id": "mFcEWmtn3ag_1",
"video_path": "mFcEWmtn3ag.mp4",
"subtitle_path": "mFcEWmtn3ag_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 523.26,
"view_count": 489106
},
{
"video_id": "Pm93D8CVlY8",
"question": "The background is a website page with icons of a supermarket, a shopping cart, as well as labels for price and source keywords. A man wearing sunglasses is talking to the microphone. What happened on the screen after the man lowered his right hand?",
"question_wo_referring_query": "What happened on the screen after the man lowered his right hand?",
"candidates": [
"The mouse cursor moved to the English word 'Pranksy' and stopped.",
"The mouse cursor moved to the upper right corner and then stopped.",
"The mouse cursor slid to the number '0.9' and painted over it.",
"The mouse cursor slid to the number '78.88' and painted over it.",
"The mouse cursor moved to the upper left corner and then stopped."
],
"correct_choice": 3,
"position": [
5119
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2E",
"level": "L1-Perception",
"id": "Pm93D8CVlY8_0",
"video_path": "Pm93D8CVlY8.mp4",
"subtitle_path": "Pm93D8CVlY8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 247,
"duration": 244.0,
"view_count": 46252
},
{
"video_id": "Pm93D8CVlY8",
"question": "The background is a web interface. In the background, the only moving object is a monkey wearing a red hat. Above the monkey, there is a sign with '9,999 items'. What is the man wearing black sunglasses and standing to the right of the screen doing?",
"question_wo_referring_query": "What is the man wearing black sunglasses and standing to the right of the screen doing?",
"candidates": [
"Pointing at the screen behind him with his left hand and talking",
"Holding a microphone with his right hand and talking",
"Turning around and pointing at the screen behind him",
"Pointing at the screen behind him with his right hand"
],
"correct_choice": 0,
"position": [
3440
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "S2E",
"level": "L1-Perception",
"id": "Pm93D8CVlY8_1",
"video_path": "Pm93D8CVlY8.mp4",
"subtitle_path": "Pm93D8CVlY8_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 247,
"duration": 244.0,
"view_count": 46252
},
{
"video_id": "brZugTJ0odg",
"question": "In a room filled with all kinds of items, with a yellow floor, there is a woman wearing a blue long-sleeved top pushing a cart. After the subtitle 'The St. Mark's Tower is one of Frank Lloyd Wright's earliest designs for a' appears, what does she do?",
"question_wo_referring_query": "What does she do?",
"candidates": [
"Walks around the cart in a circle",
"Speaks into the mirror",
"Brushes something on the cart with a small brush",
"Lifts the bottom of the cart"
],
"correct_choice": 0,
"position": [
294,
406,
487,
987,
1046
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T3E",
"level": "L2-Relation",
"id": "brZugTJ0odg_0",
"video_path": "brZugTJ0odg.mp4",
"subtitle_path": "brZugTJ0odg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 260.14,
"view_count": 95521
},
{
"video_id": "brZugTJ0odg",
"question": "On the yellow floor, a person wearing a blue top is pushing a wooden cart filled with various items. After the subtitles 'For the restoration, a lot of it was just cutting acid-free paperboard to the' appear, what does the person in the blue top do?",
"question_wo_referring_query": "What does the person wearing the blue top do?",
"candidates": [
"Opened the pigment tray",
"Wrote with a paintbrush",
"Held a red hammer",
"Used a brush to apply glue to the items"
],
"correct_choice": 0,
"position": [
4545,
4723,
5029,
5343
],
"topic_category": "KA-Knowledge-Art",
"question_category": "T3E",
"level": "L2-Relation",
"id": "brZugTJ0odg_1",
"video_path": "brZugTJ0odg.mp4",
"subtitle_path": "brZugTJ0odg_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 260.14,
"view_count": 95521
},
{
"video_id": "kN88RP3XWUU",
"question": "A man who is not wearing clothes but has a piece of cloth hanging around his waist is carrying a large blue and white ball. The man has bronze-colored skin and black curly hair. Which of the following subtitles has this man appeared with?",
"question_wo_referring_query": "Which of the following subtitles has this man appeared with?",
"candidates": [
"punishments one of the Titans Atlas was",
"it'll come back up in just a second over",
"world looked like this kind of like an O",
"with the T inside here was Europe here"
],
"correct_choice": 0,
"position": [
759,
1063,
1160,
1228
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TOS",
"level": "L2-Relation",
"id": "kN88RP3XWUU_0",
"video_path": "kN88RP3XWUU.mp4",
"subtitle_path": "kN88RP3XWUU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 313.25,
"view_count": 190668
},
{
"video_id": "kN88RP3XWUU",
"question": "Against a vintage backdrop, there is a black frame within which there is a person with curly hair wearing a wreath made of leaves on their head. In which of the following subtitles did this person appear?",
"question_wo_referring_query": "In which of the following subtitles did this person appear?",
"candidates": [
"the conclusion that Atlas was derived",
"holding up the sky enduring going back",
"attached to what we call the Atlas",
"enduring Duras durable yeah which makes"
],
"correct_choice": 0,
"position": [
5558,
5696,
5801,
5897
],
"topic_category": "KG-Knowledge-Geography",
"question_category": "TOS",
"level": "L2-Relation",
"id": "kN88RP3XWUU_1",
"video_path": "kN88RP3XWUU.mp4",
"subtitle_path": "kN88RP3XWUU_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 313.25,
"view_count": 190668
},
{
"video_id": "to7vCdkLi4s",
"question": "In the screen, on the PPT, there is a red rectangle with the word 'Reinforcement Learning' inside it, and below the rectangle is a segment of English text. Which subtitles have appeared together with it?",
"question_wo_referring_query": "Which subtitles have appeared together with it?",
"candidates": [
"quite grandiose so we'll dive into it",
"better at it that is reinforcement and pictures right you have a database and",
"with augmented data this paper is by",
"plugin like here so this is basically"
],
"correct_choice": 1,
"position": [
1464,
2232
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "TOS",
"level": "L2-Relation",
"id": "to7vCdkLi4s_0",
"video_path": "to7vCdkLi4s.mp4",
"subtitle_path": "to7vCdkLi4s_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1334.03,
"view_count": 7041
},
{
"video_id": "to7vCdkLi4s",
"question": "There are 6 pictures on the PPT, and each picture has an English word on it. Which subtitles appear on the screen together with these words?",
"question_wo_referring_query": "Which subtitles appear on the screen together with this screen?",
"candidates": [
"you rotate it random conf means you and of in a sort of way that doesn't mess up",
"frames then you can decide on a on a",
"across the stacked frames so basically",
"sometimes it's beneficial to feed the"
],
"correct_choice": 0,
"position": [
6586,
7056
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "TOS",
"level": "L2-Relation",
"id": "to7vCdkLi4s_1",
"video_path": "to7vCdkLi4s.mp4",
"subtitle_path": "to7vCdkLi4s_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1334.03,
"view_count": 7041
},
{
"video_id": "to7vCdkLi4s",
"question": "In the picture, there is an image with colors like red, orange, and blue on a white background. Alongside which subtitles does this screen appear?",
"question_wo_referring_query": "Alongside which subtitles does this screen appear?",
"candidates": [
"whatever 300 points in this Walker if and okay crop is the most the most effective",
"frames then you can decide on a on a\n",
"quite grandiose so we'll dive into it",
"across the stacked frames so basically"
],
"correct_choice": 0,
"position": [
17288,
17585
],
"topic_category": "KC-Knowledge-Computer-Science",
"question_category": "TOS",
"level": "L2-Relation",
"id": "to7vCdkLi4s_2",
"video_path": "to7vCdkLi4s.mp4",
"subtitle_path": "to7vCdkLi4s_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 1334.03,
"view_count": 7041
},
{
"video_id": "GGwHSz9towk",
"question": "The curly-haired man in the blue shirt, standing in front of the stone carving, what is the curly-haired man in the blue shirt doing?",
"question_wo_referring_query": "What is the curly-haired man in the blue shirt doing?",
"candidates": [
"Touching",
"Painting",
"Admiring",
"Carving"
],
"correct_choice": 2,
"position": [
223
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2E",
"level": "L1-Perception",
"id": "GGwHSz9towk_0",
"video_path": "GGwHSz9towk.mp4",
"subtitle_path": "GGwHSz9towk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 193.09,
"view_count": 2840
},
{
"video_id": "GGwHSz9towk",
"question": "The man with curly hair wearing a blue shirt, with floral patterns on the right sleeve, is standing next to a transparent display case on his left. What is this man in the blue shirt doing?",
"question_wo_referring_query": "What is the man with curly hair in the blue shirt doing?",
"candidates": [
"Petting",
"Taking a photo",
"Painting",
"Observing"
],
"correct_choice": 3,
"position": [
2533
],
"topic_category": "KA-Knowledge-Art",
"question_category": "S2E",
"level": "L1-Perception",
"id": "GGwHSz9towk_1",
"video_path": "GGwHSz9towk.mp4",
"subtitle_path": "GGwHSz9towk_en.json",
"duration_group": 600,
"starting_timestamp_for_subtitles": 0,
"duration": 193.09,
"view_count": 2840
},
{
"video_id": "_EUDpS9UF9o",
"question": "In the scene, there's a person cutting a green onion with a knife, and in the upper left corner, there's also a screen with burning wood. When the subtitle mentions 'Onion,' what other objects are present in the scene?",
"question_wo_referring_query": "What other objects are present in the scene?",
"candidates": [
"On the table, there's also a silver bowl containing a tomato and a pumpkin.",
"Oven",
"Bread",
"Watch"
],
"correct_choice": 0,
"position": [
20
],
"topic_category": "LC-Lifestyle-Cooking-Recipes",
"question_category": "T2O",
"level": "L1-Perception",
"id": "_EUDpS9UF9o_0",
"video_path": "_EUDpS9UF9o.mp4",
"subtitle_path": "_EUDpS9UF9o_en.json",
"duration_group": 15,
"starting_timestamp_for_subtitles": 1267,
"duration": 9.0,
"view_count": 1629269
},
{
"video_id": "7XWqI121-Q4",
"question": "In a bedroom with two warm-colored table lamps and two paintings hanging on the wall, a short-haired woman wearing an orange outfit is sitting on the bed turning on a computer. What did she do before turning on the computer on the bed?",
"question_wo_referring_query": "What did she do before turning on the computer on the bed?",
"candidates": [
"Changed into a black coat",
"Visited a museum",
"Took a shower",
"Wrote something in front of the computer",
"Put on shoes"
],
"correct_choice": 2,
"position": [
7917,
7861
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "7XWqI121-Q4_0",
"video_path": "7XWqI121-Q4.mp4",
"subtitle_path": "7XWqI121-Q4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 954.46,
"view_count": 263911
},
{
"video_id": "7XWqI121-Q4",
"question": "A person wearing khaki-colored pants with a leather jacket and carrying a black handbag is walking on the sidewalk along the road. After this person walks past, a string of yellow English words appears. What is the first thing that happens after this?",
"question_wo_referring_query": "What is the first thing that happens after this?",
"candidates": [
"The woman opens a laptop on the bed",
"The woman visits a museum",
"The woman sits on a bed in the bedroom and talks",
"The woman stands on the bridge and takes a selfie with one hand making a V sign",
"The woman goes to eat in the dining hall"
],
"correct_choice": 3,
"position": [
5539,
5834
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "7XWqI121-Q4_1",
"video_path": "7XWqI121-Q4.mp4",
"subtitle_path": "7XWqI121-Q4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 954.46,
"view_count": 263911
},
{
"video_id": "7XWqI121-Q4",
"question": "After entering the museum, where did the short-haired woman in a black coat, black long skirt, and black mask visit first?",
"question_wo_referring_query": "After entering the museum, where did the short-haired woman in a black coat, black long skirt, and black mask visit first?",
"candidates": [
"Sculptures",
"Bronze statues",
"Library",
"Wall paintings",
"Paintings"
],
"correct_choice": 0,
"position": [
10148,
10451,
11445
],
"topic_category": "LV-Lifestyle-Life-Vlogs",
"question_category": "E3E",
"level": "L2-Relation",
"id": "7XWqI121-Q4_2",
"video_path": "7XWqI121-Q4.mp4",
"subtitle_path": "7XWqI121-Q4_en.json",
"duration_group": 3600,
"starting_timestamp_for_subtitles": 0,
"duration": 954.46,
"view_count": 263911
}
]
================================================
FILE: lmms-eval_videochat/eval_annotations/MLVU_MC/README.md
================================================
---
license: mit
extra_gated_prompt: >-
You agree to not use the dataset to conduct experiments that cause harm to
human subjects. Please note that the data in this dataset may be subject to
other agreements. Before using the data, be sure to read the relevant
agreements carefully to ensure compliant use. Video copyrights belong to the
original video creators or platforms and are for academic research use only.
task_categories:
- visual-question-answering
extra_gated_fields:
Name: text
Company/Organization: text
Country: text
E-Mail: text
modalities:
- Video
- Text
configs:
- config_name: 1_plotQA
data_files: json/1_plotQA.json
- config_name: 2_needle
data_files: json/2_needle.json
- config_name: 3_ego
data_files: json/3_ego.json
- config_name: 4_count
data_files: json/4_count.json
- config_name: 5_order
data_files: json/5_order.json
- config_name: 6_anomaly_reco
data_files: json/6_anomaly_reco.json
- config_name: 7_topic_reasoning
data_files: json/7_topic_reasoning.json
language:
- en
size_categories:
- 1K1->3->4",
"3->2->1->4",
"4->3->2->1",
"1->2->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_151.mp4",
"duration": 7590.67,
"question": "Arrange the following events from the video in the correct chronological order: (1)The woman starts working on her nails using bottles from a box next to her; (2)The words \"Love Food & Money with Angie Greenup\" appears on screen; (3)Her twitter handle and subscribe screen are shown while she holds her dogs; (4)The woman speaks to the camera from her living room while her dogs play fight behind her.",
"candidates": [
"2->4->1->3",
"2->1->4->3",
"4->2->1->3",
"1->2->4->3"
],
"answer": "2->4->1->3",
"question_type": "order"
},
{
"video": "order_214.mp4",
"duration": 485.77,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"abseiling --> making jewelry --> milking cow --> cleaning toilet",
"making jewelry --> abseiling --> cleaning toilet --> milking cow",
"milking cow --> cleaning toilet --> abseiling --> making jewelry",
"cleaning toilet --> milking cow --> abseiling --> making jewelry"
],
"answer": "abseiling --> making jewelry --> milking cow --> cleaning toilet",
"question_type": "order"
},
{
"video": "order_288.mp4",
"duration": 511.25,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"water sliding --> making jewelry --> abseiling --> javelin throw",
"abseiling --> water sliding --> javelin throw --> making jewelry",
"javelin throw --> water sliding --> abseiling --> making jewelry",
"water sliding --> javelin throw --> abseiling --> making jewelry"
],
"answer": "javelin throw --> water sliding --> abseiling --> making jewelry",
"question_type": "order"
},
{
"video": "order_108.mp4",
"duration": 667.7199999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)Water is added to the cup and more limes are squeezed in by hand; (2)The small bowls of salt are arranged and limes are sliced in halves; (3)The cup is stirred with more water and a set of cups filled with the refreshment are seen; (4)The limes are juiced into a cup using a hand held press.",
"candidates": [
"2->4->1->3",
"4->2->1->3",
"2->1->4->3",
"1->2->4->3"
],
"answer": "2->4->1->3",
"question_type": "order"
},
{
"video": "order_302.mp4",
"duration": 490.05999999999995,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"javelin throw --> stomping grapes --> baking cookies --> playing trombone",
"playing trombone --> baking cookies --> stomping grapes --> javelin throw",
"javelin throw --> stomping grapes --> playing trombone --> baking cookies",
"stomping grapes --> playing trombone --> baking cookies --> javelin throw"
],
"answer": "stomping grapes --> playing trombone --> baking cookies --> javelin throw",
"question_type": "order"
},
{
"video": "order_133.mp4",
"duration": 1734.55,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man starts to demonstrate playing the bongos in a lesson; (2)LP and Giovanni Logo appear on the black screen opening; (3)The lesson continues, alternating between color and black and white footage; (4)A man sits behind a set of bongo drums.",
"candidates": [
"2->4->1->3",
"4->2->1->3",
"2->1->4->3",
"1->2->3->4"
],
"answer": "2->4->1->3",
"question_type": "order"
},
{
"video": "order_221.mp4",
"duration": 637.64,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"jetskiing --> tossing coin --> shredding paper --> zumba",
"zumba --> tossing coin --> jetskiing --> shredding paper",
"jetskiing --> shredding paper --> zumba --> tossing coin",
"zumba --> tossing coin --> shredding paper --> jetskiing"
],
"answer": "jetskiing --> shredding paper --> zumba --> tossing coin",
"question_type": "order"
},
{
"video": "order_295.mp4",
"duration": 519.0899999999999,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"paragliding --> javelin throw --> playing harp --> carving pumpkin",
"javelin throw --> playing harp --> paragliding --> carving pumpkin",
"javelin throw --> playing harp --> carving pumpkin --> paragliding",
"playing harp --> javelin throw --> paragliding --> carving pumpkin"
],
"answer": "playing harp --> javelin throw --> paragliding --> carving pumpkin",
"question_type": "order"
},
{
"video": "order_115.mp4",
"duration": 706.6700000000001,
"question": "Arrange the following events from the video in the correct chronological order: (1)A lady in black reads names a man hands her and passes out prize buckets to the kids; (2)A group of kids is building a moat filled with water around a sand castle; (3)We see an opening title screen; (4)We see kids across the beach working on their castles in the wet sand.",
"candidates": [
"3->2->4->1",
"4->1->2->3",
"2->3->1->4",
"1->4->2->3"
],
"answer": "3->2->4->1",
"question_type": "order"
},
{
"video": "order_140.mp4",
"duration": 6191.46,
"question": "Arrange the following events from the video in the correct chronological order: (1)People crash into the bottom of a bridge; (2)People are sitting on a raft going down a river; (3)People are walking across the water and down a trail; (4)People are carrying their raft and get into a van.",
"candidates": [
"1->2->3->4",
"2->1->3->4",
"2->3->1->4",
"1->3->2->4"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_79.mp4",
"duration": 1225.45,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man melts the wax with the tool and wipes the ski; (2)The man adds a substance from a jug to the ski and wipes it with a paper towel; (3)The man exchanges skis and waxes the second one with the tool; (4)The man then scrapes the wax off the ski and uses a different tools and paper towel on the ski.",
"candidates": [
"4->3->2->1",
"3->2->1->4",
"1->2->3->4",
"2->1->3->4"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_203.mp4",
"duration": 490.01,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"playing trombone --> milking cow --> stomping grapes --> riding mule",
"stomping grapes --> riding mule --> playing trombone --> milking cow",
"playing trombone --> stomping grapes --> milking cow --> riding mule",
"riding mule --> milking cow --> stomping grapes --> playing trombone"
],
"answer": "riding mule --> milking cow --> stomping grapes --> playing trombone",
"question_type": "order"
},
{
"video": "order_78.mp4",
"duration": 2241.51,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man melts the wax with the tool and wipes the ski; (2)The man adds a substance from a jug to the ski and wipes it with a paper towel; (3)The man exchanges skis and waxes the second one with the tool; (4)The man then scrapes the wax off the ski and uses a different tools and paper towel on the ski.",
"candidates": [
"4->3->2->1",
"1->2->3->4",
"3->2->1->4",
"2->1->3->4"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_77.mp4",
"duration": 688.89,
"question": "Arrange the following events from the video in the correct chronological order: (1)Woman begins ripping the wrapping paper with her hands; (2)Woman sets the box on top of the wrapping paper and begins wrapping the box; (3)Woman lifts up a box and sets it on the table; (4)Woman grabs a pair of scissors and tape.",
"candidates": [
"1->2->3->4",
"4->3->2->1",
"2->1->3->4",
"3->4->1->2"
],
"answer": "3->4->1->2",
"question_type": "order"
},
{
"video": "order_76.mp4",
"duration": 1093.34,
"question": "Arrange the following events from the video in the correct chronological order: (1)Woman begins ripping the wrapping paper with her hands; (2)Woman sets the box on top of the wrapping paper and begins wrapping the box; (3)Woman lifts up a box and sets it on the table; (4)Woman grabs a pair of scissors and tape.",
"candidates": [
"2->1->3->4",
"1->2->3->4",
"3->4->1->2",
"4->3->2->1"
],
"answer": "3->4->1->2",
"question_type": "order"
},
{
"video": "order_75.mp4",
"duration": 688.89,
"question": "Arrange the following events from the video in the correct chronological order: (1)Woman begins ripping the wrapping paper with her hands; (2)Woman sets the box on top of the wrapping paper and begins wrapping the box; (3)Woman lifts up a box and sets it on the table; (4)Woman grabs a pair of scissors and tape.",
"candidates": [
"3->4->1->2",
"4->3->2->1",
"1->2->3->4",
"2->1->3->4"
],
"answer": "3->4->1->2",
"question_type": "order"
},
{
"video": "order_74.mp4",
"duration": 684.15,
"question": "Arrange the following events from the video in the correct chronological order: (1)A shirtless man lifts a ball onto one shoulder; (2)A series of tug of war matches are shown; (3)A third man flips a heavy tire; (4)Individuals are shown exercising with weights, kegs, or tires in a parking lot.",
"candidates": [
"1->2->3->4",
"2->1->3->4",
"3->1->2->4",
"1->3->4->2"
],
"answer": "1->3->4->2",
"question_type": "order"
},
{
"video": "order_287.mp4",
"duration": 490.03999999999996,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"zumba --> javelin throw --> riding mule --> water sliding",
"riding mule --> zumba --> javelin throw --> water sliding",
"riding mule --> javelin throw --> water sliding --> zumba",
"water sliding --> javelin throw --> zumba --> riding mule"
],
"answer": "zumba --> javelin throw --> riding mule --> water sliding",
"question_type": "order"
},
{
"video": "order_107.mp4",
"duration": 685.98,
"question": "Arrange the following events from the video in the correct chronological order: (1)The camera focuses on an older man's face; (2)The two children dance together; (3)The camera focuses on a bug on the wall; (4)The two children interact with each other in a cluttered room.",
"candidates": [
"4->3->2->1",
"1->4->3->2",
"4->1->2->3",
"1->2->3->4"
],
"answer": "1->4->3->2",
"question_type": "order"
},
{
"video": "order_73.mp4",
"duration": 714.15,
"question": "Arrange the following events from the video in the correct chronological order: (1)A shirtless man lifts a ball onto one shoulder; (2)A series of tug of war matches are shown; (3)A third man flips a heavy tire; (4)Individuals are shown exercising with weights, kegs, or tires in a parking lot.",
"candidates": [
"1->2->3->4",
"3->1->2->4",
"2->1->3->4",
"1->3->4->2"
],
"answer": "1->3->4->2",
"question_type": "order"
},
{
"video": "order_210.mp4",
"duration": 777.0500000000001,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"cooking sausages --> jetskiing --> javelin throw --> playing trombone",
"jetskiing --> playing trombone --> javelin throw --> cooking sausages",
"jetskiing --> cooking sausages --> javelin throw --> playing trombone",
"cooking sausages --> javelin throw --> jetskiing --> playing trombone"
],
"answer": "jetskiing --> playing trombone --> javelin throw --> cooking sausages",
"question_type": "order"
},
{
"video": "order_72.mp4",
"duration": 684.1500000000001,
"question": "Arrange the following events from the video in the correct chronological order: (1)A shirtless man lifts a ball onto one shoulder; (2)A series of tug of war matches are shown; (3)A third man flips a heavy tire; (4)Individuals are shown exercising with weights, kegs, or tires in a parking lot.",
"candidates": [
"1->3->4->2",
"3->1->2->4",
"1->2->3->4",
"2->1->3->4"
],
"answer": "1->3->4->2",
"question_type": "order"
},
{
"video": "order_132.mp4",
"duration": 688.49,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man starts to demonstrate playing the bongos in a lesson; (2)LP and Giovanni Logo appear on the black screen opening; (3)The lesson continues, alternating between color and black and white footage; (4)A man sits behind a set of bongo drums.",
"candidates": [
"2->4->1->3",
"2->1->4->3",
"4->2->1->3",
"1->2->3->4"
],
"answer": "2->4->1->3",
"question_type": "order"
},
{
"video": "order_71.mp4",
"duration": 665.25,
"question": "Arrange the following events from the video in the correct chronological order: (1) Woman measures and cuts the wallpaper; (2) Woman grabs wallpaper paste and materials; (3) Woman hangs the wallpaper and flattens it; (4) Woman pastes the wallpaper with a brush and soaks it.",
"candidates": [
"4->3->2->1",
"2->1->4->3",
"1->2->3->4",
"3->2->1->4"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_269.mp4",
"duration": 485.98999999999995,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"jetskiing --> abseiling --> water sliding --> playing trombone",
"playing trombone --> jetskiing --> water sliding --> abseiling",
"playing trombone --> abseiling --> jetskiing --> water sliding",
"abseiling --> playing trombone --> jetskiing --> water sliding"
],
"answer": "playing trombone --> jetskiing --> water sliding --> abseiling",
"question_type": "order"
},
{
"video": "order_70.mp4",
"duration": 695.25,
"question": "Arrange the following events from the video in the correct chronological order: (1) Woman measures and cuts the wallpaper; (2) Woman grabs wallpaper paste and materials; (3) Woman hangs the wallpaper and flattens it; (4) Woman pastes the wallpaper with a brush and soaks it.",
"candidates": [
"4->3->2->1",
"1->2->3->4",
"3->2->1->4",
"2->1->4->3"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_19.mp4",
"duration": 679.45,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man explains wakeboarding concepts while his daughter wakeboards in a lake; (2)The video introduction about teaching a child to wakeboard is shown; (3)The girl wakeboards in the lake again while her father continues to explain the teaching techniques; (4)They practice wakeboarding in a pool while discussing techniques.",
"candidates": [
"1->2->3->4",
"3->4->1->2",
"2->1->4->3",
"4->3->2->1"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_18.mp4",
"duration": 679.45,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man explains wakeboarding concepts while his daughter wakeboards in a lake; (2)The video introduction about teaching a child to wakeboard is shown; (3)The girl wakeboards in the lake again while her father continues to explain the teaching techniques; (4)They practice wakeboarding in a pool while discussing techniques.",
"candidates": [
"1->2->3->4",
"3->4->1->2",
"2->1->4->3",
"4->3->2->1"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_294.mp4",
"duration": 490.10999999999996,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"shredding paper --> riding mule --> baking cookies --> milking cow",
"shredding paper --> baking cookies --> riding mule --> milking cow",
"riding mule --> shredding paper --> milking cow --> baking cookies",
"shredding paper --> milking cow --> baking cookies --> riding mule"
],
"answer": "shredding paper --> riding mule --> baking cookies --> milking cow",
"question_type": "order"
},
{
"video": "order_17.mp4",
"duration": 681.28,
"question": "Arrange the following events from the video in the correct chronological order: (1)People crash into the bottom of a bridge; (2)People are sitting on a raft going down a river; (3)People are walking across the water and down a trail; (4)People are carrying their raft and get into a van.",
"candidates": [
"2->1->3->4",
"1->2->3->4",
"1->3->2->4",
"2->3->1->4"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_114.mp4",
"duration": 676.67,
"question": "Arrange the following events from the video in the correct chronological order: (1)A lady in black reads names a man hands her and passes out prize buckets to the kids; (2)A group of kids is building a moat filled with water around a sand castle; (3)We see an opening title screen; (4)We see kids across the beach working on their castles in the wet sand.",
"candidates": [
"1->4->2->3",
"4->1->2->3",
"3->2->4->1",
"2->3->1->4"
],
"answer": "3->2->4->1",
"question_type": "order"
},
{
"video": "order_16.mp4",
"duration": 681.2,
"question": "Arrange the following events from the video in the correct chronological order: (1)People crash into the bottom of a bridge; (2)People are sitting on a raft going down a river; (3)People are walking across the water and down a trail; (4)People are carrying their raft and get into a van.",
"candidates": [
"1->2->3->4",
"1->3->2->4",
"2->1->3->4",
"2->3->1->4"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_15.mp4",
"duration": 711.2,
"question": "Arrange the following events from the video in the correct chronological order: (1)People crash into the bottom of a bridge; (2)People are sitting on a raft going down a river; (3)People are walking across the water and down a trail; (4)People are carrying their raft and get into a van.",
"candidates": [
"1->3->2->4",
"1->2->3->4",
"2->1->3->4",
"2->3->1->4"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_202.mp4",
"duration": 490.04999999999995,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"abseiling --> carving pumpkin --> javelin throw --> riding mule",
"riding mule --> carving pumpkin --> javelin throw --> abseiling",
"javelin throw --> abseiling --> carving pumpkin --> riding mule",
"abseiling --> riding mule --> javelin throw --> carving pumpkin"
],
"answer": "abseiling --> carving pumpkin --> javelin throw --> riding mule",
"question_type": "order"
},
{
"video": "order_14.mp4",
"duration": 687.24,
"question": "Arrange the following events from the video in the correct chronological order: (1)The batter is poured into bowls and dye is added; (2)Ingredients are shown on a counter; (3)The cake is frosted with blue frosting and sprinkles are added; (4)The pans are greased and the different colored batter is poured into them.",
"candidates": [
"4->3->2->1",
"1->2->3->4",
"2->1->4->3",
"3->4->1->2"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_276.mp4",
"duration": 7200.02,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"riding mule --> clean and jerk --> zumba --> cleaning toilet",
"cleaning toilet --> zumba --> clean and jerk --> riding mule",
"cleaning toilet --> zumba --> riding mule --> clean and jerk",
"riding mule --> cleaning toilet --> zumba --> clean and jerk"
],
"answer": "riding mule --> clean and jerk --> zumba --> cleaning toilet",
"question_type": "order"
},
{
"video": "order_339.mp4",
"duration": 486.78,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"playing trombone --> paragliding --> riding mule --> jetskiing",
"playing trombone --> riding mule --> jetskiing --> paragliding",
"paragliding --> riding mule --> jetskiing --> playing trombone",
"paragliding --> jetskiing --> riding mule --> playing trombone"
],
"answer": "paragliding --> jetskiing --> riding mule --> playing trombone",
"question_type": "order"
},
{
"video": "order_13.mp4",
"duration": 673.2,
"question": "Arrange the following events from the video in the correct chronological order: (1)The batter is poured into bowls and dye is added; (2)Ingredients are shown on a counter; (3)The cake is frosted with blue frosting and sprinkles are added; (4)The pans are greased and the different colored batter is poured into them.",
"candidates": [
"2->1->4->3",
"4->3->2->1",
"1->2->3->4",
"3->4->1->2"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_12.mp4",
"duration": 687.24,
"question": "Arrange the following events from the video in the correct chronological order: (1)The batter is poured into bowls and dye is added; (2)Ingredients are shown on a counter; (3)The cake is frosted with blue frosting and sprinkles are added; (4)The pans are greased and the different colored batter is poured into them.",
"candidates": [
"4->3->2->1",
"3->4->1->2",
"2->1->4->3",
"1->2->3->4"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_121.mp4",
"duration": 673.38,
"question": "Arrange the following events from the video in the correct chronological order: (1)The interviewer plays with the dogs; (2)A man has dogs on a city street near a car; (3)We see a title screen over the UK flag; (4)We see a banner across the bottom of the screen and the man kneeling playing with his dogs.",
"candidates": [
"2->1->3->4",
"1->2->3->4",
"3->2->1->4",
"4->3->2->1"
],
"answer": "3->2->1->4",
"question_type": "order"
},
{
"video": "order_11.mp4",
"duration": 707.69,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man attempts to walk across the rope but falls and holds onto the rope; (2)A seal sits on a rock near an ocean; (3)The man films from a beach cliff next to a tent; (4)The man walks across the rope all the way to the attached rock.",
"candidates": [
"2->1->4->3",
"1->2->3->4",
"3->1->2->4",
"2->4->1->3"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_258.mp4",
"duration": 605.05,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"javelin throw --> stomping grapes --> tossing coin --> cooking sausages",
"tossing coin --> stomping grapes --> javelin throw --> cooking sausages",
"cooking sausages --> javelin throw --> stomping grapes --> tossing coin",
"javelin throw --> stomping grapes --> cooking sausages --> tossing coin"
],
"answer": "javelin throw --> stomping grapes --> tossing coin --> cooking sausages",
"question_type": "order"
},
{
"video": "order_10.mp4",
"duration": 677.73,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man attempts to walk across the rope but falls and holds onto the rope; (2)A seal sits on a rock near an ocean; (3)The man films from a beach cliff next to a tent; (4)The man walks across the rope all the way to the attached rock.",
"candidates": [
"3->1->2->4",
"1->2->3->4",
"2->4->1->3",
"2->1->4->3"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_283.mp4",
"duration": 515.05,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"stomping grapes --> tossing coin --> zumba --> cleaning toilet",
"zumba --> tossing coin --> stomping grapes --> cleaning toilet",
"tossing coin --> cleaning toilet --> stomping grapes --> zumba",
"cleaning toilet --> stomping grapes --> zumba --> tossing coin"
],
"answer": "cleaning toilet --> stomping grapes --> zumba --> tossing coin",
"question_type": "order"
},
{
"video": "order_346.mp4",
"duration": 5786.9,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"abseiling --> javelin throw --> pole vault --> cleaning toilet",
"pole vault --> cleaning toilet --> javelin throw --> abseiling",
"abseiling --> javelin throw --> cleaning toilet --> pole vault",
"javelin throw --> abseiling --> cleaning toilet --> pole vault"
],
"answer": "pole vault --> cleaning toilet --> javelin throw --> abseiling",
"question_type": "order"
},
{
"video": "order_103.mp4",
"duration": 717.89,
"question": "Arrange the following events from the video in the correct chronological order: (1)A lady in blue talks about the Extreme Dog Grooming company; (2)A poodle is groomed and dyed with different colors; (3)A dog painted to resemble a zebra is shown; (4)A dog with spots resembling a giraffe is brought to an elementary school for kids to play with.",
"candidates": [
"4->3->2->1",
"3->4->1->2",
"1->2->4->3",
"2->1->3->4"
],
"answer": "1->2->4->3",
"question_type": "order"
},
{
"video": "order_265.mp4",
"duration": 641.0899999999999,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"pole vault --> playing harp --> paragliding --> cooking sausages",
"cooking sausages --> paragliding --> pole vault --> playing harp",
"cooking sausages --> pole vault --> paragliding --> playing harp",
"playing harp --> pole vault --> cooking sausages --> paragliding"
],
"answer": "pole vault --> playing harp --> paragliding --> cooking sausages",
"question_type": "order"
},
{
"video": "order_328.mp4",
"duration": 593.3299999999999,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"playing trombone --> shredding paper --> jetskiing --> carving pumpkin",
"playing trombone --> carving pumpkin --> shredding paper --> jetskiing",
"shredding paper --> playing trombone --> carving pumpkin --> jetskiing",
"playing trombone --> carving pumpkin --> jetskiing --> shredding paper"
],
"answer": "playing trombone --> carving pumpkin --> jetskiing --> shredding paper",
"question_type": "order"
},
{
"video": "order_290.mp4",
"duration": 504.38,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"clean and jerk --> jetskiing --> pole vault --> javelin throw",
"jetskiing --> pole vault --> clean and jerk --> javelin throw",
"pole vault --> jetskiing --> clean and jerk --> javelin throw",
"javelin throw --> jetskiing --> pole vault --> clean and jerk"
],
"answer": "pole vault --> jetskiing --> clean and jerk --> javelin throw",
"question_type": "order"
},
{
"video": "order_110.mp4",
"duration": 667.64,
"question": "Arrange the following events from the video in the correct chronological order: (1)Water is added to the cup and more limes are squeezed in by hand; (2)The small bowls of salt are arranged and limes are sliced in halves; (3)The cup is stirred with more water and a set of cups filled with the refreshment are seen; (4)The limes are juiced into a cup using a hand held press.",
"candidates": [
"2->1->4->3",
"1->2->4->3",
"2->4->1->3",
"4->2->1->3"
],
"answer": "2->4->1->3",
"question_type": "order"
},
{
"video": "order_169.mp4",
"duration": 484.91,
"question": "Arrange the following events from the video in the correct chronological order: (1)A lady in blue talks about the Extreme Dog Grooming company; (2)A poodle is groomed and dyed with different colors; (3)A dog painted to resemble a zebra is shown; (4)A dog with spots resembling a giraffe is brought to an elementary school for kids to play with.",
"candidates": [
"4->3->2->1",
"3->4->1->2",
"2->1->3->4",
"1->2->4->3"
],
"answer": "1->2->4->3",
"question_type": "order"
},
{
"video": "order_257.mp4",
"duration": 1394.62,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"zumba --> playing harp --> carving pumpkin --> pole vault",
"zumba --> pole vault --> carving pumpkin --> playing harp",
"pole vault --> zumba --> carving pumpkin --> playing harp",
"zumba --> pole vault --> playing harp --> carving pumpkin"
],
"answer": "zumba --> playing harp --> carving pumpkin --> pole vault",
"question_type": "order"
},
{
"video": "order_282.mp4",
"duration": 490.03999999999996,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"jetskiing --> abseiling --> zumba --> cooking sausages",
"abseiling --> cooking sausages --> zumba --> jetskiing",
"abseiling --> zumba --> cooking sausages --> jetskiing",
"cooking sausages --> abseiling --> zumba --> jetskiing"
],
"answer": "abseiling --> cooking sausages --> zumba --> jetskiing",
"question_type": "order"
},
{
"video": "order_345.mp4",
"duration": 488.32,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"pole vault --> playing harp --> paragliding --> playing trombone",
"paragliding --> playing harp --> playing trombone --> pole vault",
"pole vault --> paragliding --> playing harp --> playing trombone",
"playing harp --> paragliding --> pole vault --> playing trombone"
],
"answer": "pole vault --> paragliding --> playing harp --> playing trombone",
"question_type": "order"
},
{
"video": "order_102.mp4",
"duration": 1728.94,
"question": "Arrange the following events from the video in the correct chronological order: (1)A lady in blue talks about the Extreme Dog Grooming company; (2)A poodle is groomed and dyed with different colors; (3)A dog painted to resemble a zebra is shown; (4)A dog with spots resembling a giraffe is brought to an elementary school for kids to play with.",
"candidates": [
"4->3->2->1",
"2->1->3->4",
"1->2->4->3",
"3->4->1->2"
],
"answer": "1->2->4->3",
"question_type": "order"
},
{
"video": "order_176.mp4",
"duration": 942.72,
"question": "Arrange the following events from the video in the correct chronological order: (1)The woman hangs the washed clothes on a line; (2)The woman fills a metal bucket with water; (3)The woman washes and scrubs clothes by hand; (4)The woman places a small wooden stool near a larger bucket.",
"candidates": [
"2->4->3->1",
"3->2->4->1",
"4->3->2->1",
"1->2->3->4"
],
"answer": "2->4->3->1",
"question_type": "order"
},
{
"video": "order_239.mp4",
"duration": 515.76,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"abseiling --> water sliding --> baking cookies --> jetskiing",
"abseiling --> water sliding --> jetskiing --> baking cookies",
"water sliding --> jetskiing --> baking cookies --> abseiling",
"jetskiing --> abseiling --> water sliding --> baking cookies"
],
"answer": "water sliding --> jetskiing --> baking cookies --> abseiling",
"question_type": "order"
},
{
"video": "order_264.mp4",
"duration": 274.05,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"carving pumpkin --> tossing coin --> milking cow --> shredding paper",
"tossing coin --> shredding paper --> milking cow --> carving pumpkin",
"milking cow --> shredding paper --> carving pumpkin --> tossing coin",
"carving pumpkin --> milking cow --> tossing coin --> shredding paper"
],
"answer": "milking cow --> shredding paper --> carving pumpkin --> tossing coin",
"question_type": "order"
},
{
"video": "order_327.mp4",
"duration": 513.06,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"playing harp --> paragliding --> shredding paper --> jetskiing",
"jetskiing --> shredding paper --> playing harp --> paragliding",
"jetskiing --> paragliding --> playing harp --> shredding paper",
"jetskiing --> shredding paper --> paragliding --> playing harp"
],
"answer": "jetskiing --> paragliding --> playing harp --> shredding paper",
"question_type": "order"
},
{
"video": "order_158.mp4",
"duration": 686.27,
"question": "Arrange the following events from the video in the correct chronological order: (1) Woman measures and cuts the wallpaper; (2) Woman grabs wallpaper paste and materials; (3) Woman hangs the wallpaper and flattens it; (4) Woman pastes the wallpaper with a brush and soaks it.",
"candidates": [
"1->2->3->4",
"2->1->4->3",
"4->3->2->1",
"3->2->1->4"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_246.mp4",
"duration": 600.03,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"making jewelry --> riding mule --> playing trombone --> shredding paper",
"shredding paper --> riding mule --> making jewelry --> playing trombone",
"making jewelry --> shredding paper --> playing trombone --> riding mule",
"playing trombone --> riding mule --> making jewelry --> shredding paper"
],
"answer": "making jewelry --> riding mule --> playing trombone --> shredding paper",
"question_type": "order"
},
{
"video": "order_309.mp4",
"duration": 535.1099999999999,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"zumba --> clean and jerk --> milking cow --> playing trombone",
"clean and jerk --> zumba --> milking cow --> playing trombone",
"playing trombone --> milking cow --> clean and jerk --> zumba",
"milking cow --> zumba --> clean and jerk --> playing trombone"
],
"answer": "milking cow --> zumba --> clean and jerk --> playing trombone",
"question_type": "order"
},
{
"video": "order_271.mp4",
"duration": 520.02,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"cleaning toilet --> making jewelry --> abseiling --> javelin throw",
"javelin throw --> making jewelry --> abseiling --> cleaning toilet",
"cleaning toilet --> javelin throw --> making jewelry --> abseiling",
"abseiling --> javelin throw --> making jewelry --> cleaning toilet"
],
"answer": "cleaning toilet --> making jewelry --> abseiling --> javelin throw",
"question_type": "order"
},
{
"video": "order_334.mp4",
"duration": 1365.29,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"stomping grapes --> carving pumpkin --> milking cow --> shredding paper",
"shredding paper --> stomping grapes --> milking cow --> carving pumpkin",
"shredding paper --> milking cow --> stomping grapes --> carving pumpkin",
"milking cow --> shredding paper --> stomping grapes --> carving pumpkin"
],
"answer": "stomping grapes --> carving pumpkin --> milking cow --> shredding paper",
"question_type": "order"
},
{
"video": "order_165.mp4",
"duration": 986.37,
"question": "Arrange the following events from the video in the correct chronological order: (1)Man is on lake side talking to the camera like other couples as well; (2)Man is talking to the camera; (3)People are kayaking on calm river and have a good picnic day; (4)People are standing on a side of a rock wall.",
"candidates": [
"1->2->3->4",
"2->1->4->3",
"4->2->1->3",
"2->4->1->3"
],
"answer": "2->4->1->3",
"question_type": "order"
},
{
"video": "order_228.mp4",
"duration": 7200.040000000001,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"javelin throw --> cooking sausages --> riding mule --> cleaning toilet",
"cleaning toilet --> cooking sausages --> javelin throw --> riding mule",
"cooking sausages --> cleaning toilet --> riding mule --> javelin throw",
"javelin throw --> cooking sausages --> cleaning toilet --> riding mule"
],
"answer": "javelin throw --> cooking sausages --> cleaning toilet --> riding mule",
"question_type": "order"
},
{
"video": "order_253.mp4",
"duration": 1365.11,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"shredding paper --> making jewelry --> playing harp --> cooking sausages",
"shredding paper --> cooking sausages --> making jewelry --> playing harp",
"making jewelry --> playing harp --> cooking sausages --> shredding paper",
"playing harp --> making jewelry --> cooking sausages --> shredding paper"
],
"answer": "shredding paper --> cooking sausages --> making jewelry --> playing harp",
"question_type": "order"
},
{
"video": "order_69.mp4",
"duration": 665.38,
"question": "Arrange the following events from the video in the correct chronological order: (1) Woman measures and cuts the wallpaper; (2) Woman grabs wallpaper paste and materials; (3) Woman hangs the wallpaper and flattens it; (4) Woman pastes the wallpaper with a brush and soaks it.",
"candidates": [
"1->2->3->4",
"4->3->2->1",
"2->1->4->3",
"3->2->1->4"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_316.mp4",
"duration": 479.47,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"baking cookies --> stomping grapes --> clean and jerk --> zumba",
"zumba --> baking cookies --> clean and jerk --> stomping grapes",
"baking cookies --> zumba --> clean and jerk --> stomping grapes",
"clean and jerk --> stomping grapes --> baking cookies --> zumba"
],
"answer": "zumba --> baking cookies --> clean and jerk --> stomping grapes",
"question_type": "order"
},
{
"video": "order_68.mp4",
"duration": 686.23,
"question": "Arrange the following events from the video in the correct chronological order: (1)The mix is poured into cupcake liners; (2)A cake with a Hershey shape is placed on a white plate; (3)Eggs, flour, and other ingredients are mixed in a bowl; (4)The cake is cut into a piece and served on a white plate.",
"candidates": [
"3->1->4->2",
"3->1->2->4",
"2->3->1->4",
"1->2->3->4"
],
"answer": "3->1->2->4",
"question_type": "order"
},
{
"video": "order_67.mp4",
"duration": 686.23,
"question": "Arrange the following events from the video in the correct chronological order: (1)The mix is poured into cupcake liners; (2)A cake with a Hershey shape is placed on a white plate; (3)Eggs, flour, and other ingredients are mixed in a bowl; (4)The cake is cut into a piece and served on a white plate.",
"candidates": [
"1->2->3->4",
"3->1->4->2",
"2->3->1->4",
"3->1->2->4"
],
"answer": "3->1->2->4",
"question_type": "order"
},
{
"video": "order_341.mp4",
"duration": 519.3399999999999,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"riding mule --> milking cow --> pole vault --> stomping grapes",
"stomping grapes --> pole vault --> riding mule --> milking cow",
"riding mule --> stomping grapes --> pole vault --> milking cow",
"pole vault --> riding mule --> stomping grapes --> milking cow"
],
"answer": "riding mule --> milking cow --> pole vault --> stomping grapes",
"question_type": "order"
},
{
"video": "order_172.mp4",
"duration": 6510.83,
"question": "Arrange the following events from the video in the correct chronological order: (1)The guy measures the ingredient on the table; (2)The child and guy added the egg to the bowl; (3)The guy uses silverware to put dough on a baking pan; (4)The child, guy, and dog watch the baking process through the oven window.",
"candidates": [
"2->1->3->4",
"3->2->1->4",
"4->3->2->1",
"1->2->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_66.mp4",
"duration": 743.44,
"question": "Arrange the following events from the video in the correct chronological order: (1)The mix is poured into cupcake liners; (2)A cake with a Hershey shape is placed on a white plate; (3)Eggs, flour, and other ingredients are mixed in a bowl; (4)The cake is cut into a piece and served on a white plate.",
"candidates": [
"3->1->4->2",
"3->1->2->4",
"1->2->3->4",
"2->3->1->4"
],
"answer": "3->1->2->4",
"question_type": "order"
},
{
"video": "order_235.mp4",
"duration": 646.57,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"water sliding --> stomping grapes --> abseiling --> cleaning toilet",
"stomping grapes --> cleaning toilet --> water sliding --> abseiling",
"abseiling --> water sliding --> stomping grapes --> cleaning toilet",
"cleaning toilet --> abseiling --> stomping grapes --> water sliding"
],
"answer": "water sliding --> stomping grapes --> abseiling --> cleaning toilet",
"question_type": "order"
},
{
"video": "order_65.mp4",
"duration": 1071.78,
"question": "Arrange the following events from the video in the correct chronological order: (1)The action of the lures is shown underwater as several different fish go after the lures; (2)Several men show off the different lures they are using for ice fishing; (3)The video ends with the closing credits and Graphics shown on the screen; (4)An introduction comes onto the screen for a video about fishing lures.",
"candidates": [
"1->2->3->4",
"4->2->1->3",
"3->2->1->4",
"2->1->3->4"
],
"answer": "4->2->1->3",
"question_type": "order"
},
{
"video": "order_64.mp4",
"duration": 674.54,
"question": "Arrange the following events from the video in the correct chronological order: (1)The action of the lures is shown underwater as several different fish go after the lures; (2)Several men show off the different lures they are using for ice fishing; (3)The video ends with the closing credits and Graphics shown on the screen; (4)An introduction comes onto the screen for a video about fishing lures.",
"candidates": [
"3->2->1->4",
"2->1->3->4",
"1->2->3->4",
"4->2->1->3"
],
"answer": "4->2->1->3",
"question_type": "order"
},
{
"video": "order_157.mp4",
"duration": 835.17,
"question": "Arrange the following events from the video in the correct chronological order: (1)The mix is poured into cupcake liners; (2)A cake with a Hershey shape is placed on a white plate; (3)Eggs, flour, and other ingredients are mixed in a bowl; (4)The cake is cut into a piece and served on a white plate.",
"candidates": [
"3->1->4->2",
"1->2->3->4",
"3->1->2->4",
"2->3->1->4"
],
"answer": "3->1->2->4",
"question_type": "order"
},
{
"video": "order_260.mp4",
"duration": 6236.110000000001,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"jetskiing --> paragliding --> abseiling --> zumba",
"abseiling --> zumba --> paragliding --> jetskiing",
"zumba --> jetskiing --> abseiling --> paragliding",
"zumba --> jetskiing --> paragliding --> abseiling"
],
"answer": "jetskiing --> paragliding --> abseiling --> zumba",
"question_type": "order"
},
{
"video": "order_63.mp4",
"duration": 674.62,
"question": "Arrange the following events from the video in the correct chronological order: (1)The action of the lures is shown underwater as several different fish go after the lures; (2)Several men show off the different lures they are using for ice fishing; (3)The video ends with the closing credits and Graphics shown on the screen; (4)An introduction comes onto the screen for a video about fishing lures.",
"candidates": [
"3->2->1->4",
"4->2->1->3",
"1->2->3->4",
"2->1->3->4"
],
"answer": "4->2->1->3",
"question_type": "order"
},
{
"video": "order_323.mp4",
"duration": 378.04,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"javelin throw --> stomping grapes --> shredding paper --> abseiling",
"shredding paper --> javelin throw --> abseiling --> stomping grapes",
"abseiling --> javelin throw --> stomping grapes --> shredding paper",
"shredding paper --> stomping grapes --> javelin throw --> abseiling"
],
"answer": "shredding paper --> javelin throw --> abseiling --> stomping grapes",
"question_type": "order"
},
{
"video": "order_62.mp4",
"duration": 1102.12,
"question": "Arrange the following events from the video in the correct chronological order: (1)The chef pours soy sauce into the cup; (2)The chef shows off shredded garlic before throwing it into the cup; (3)The chef grabs a bowl of salad and shows it off; (4)The chef grabs a cup of nuts and throws it on top of a salad.",
"candidates": [
"3->4->1->2",
"4->3->2->1",
"1->2->3->4",
"2->1->3->4"
],
"answer": "3->4->1->2",
"question_type": "order"
},
{
"video": "order_61.mp4",
"duration": 688.21,
"question": "Arrange the following events from the video in the correct chronological order: (1)The chef pours soy sauce into the cup; (2)The chef shows off shredded garlic before throwing it into the cup; (3)The chef grabs a bowl of salad and shows it off; (4)The chef grabs a cup of nuts and throws it on top of a salad.",
"candidates": [
"1->2->3->4",
"3->4->1->2",
"2->1->3->4",
"4->3->2->1"
],
"answer": "3->4->1->2",
"question_type": "order"
},
{
"video": "order_139.mp4",
"duration": 694.26,
"question": "Arrange the following events from the video in the correct chronological order: (1)The batter is poured into bowls and dye is added; (2)Ingredients are shown on a counter; (3)The cake is frosted with blue frosting and sprinkles are added; (4)The pans are greased and the different colored batter is poured into them.",
"candidates": [
"1->2->3->4",
"4->3->2->1",
"3->4->1->2",
"2->1->4->3"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_60.mp4",
"duration": 688.3399999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)The chef pours soy sauce into the cup; (2)The chef shows off shredded garlic before throwing it into the cup; (3)The chef grabs a bowl of salad and shows it off; (4)The chef grabs a cup of nuts and throws it on top of a salad.",
"candidates": [
"3->4->1->2",
"2->1->3->4",
"4->3->2->1",
"1->2->3->4"
],
"answer": "3->4->1->2",
"question_type": "order"
},
{
"video": "order_164.mp4",
"duration": 5936.1900000000005,
"question": "Arrange the following events from the video in the correct chronological order: (1) The slack line athletes finish competing and shake hands at the finish; (2) A bald headed man performs tricks on a yellow trick line; (3) Two men meet to compete in a slack line competition; (4) The camera pans back to the bald man performing trick on the slack line and back to the man wearing a baseball cap performing tricks on the traditional line.",
"candidates": [
"4->3->2->1",
"2->3->1->4",
"3->2->4->1",
"1->2->3->4"
],
"answer": "3->2->4->1",
"question_type": "order"
},
{
"video": "order_227.mp4",
"duration": 520.05,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"making jewelry --> abseiling --> milking cow --> jetskiing",
"jetskiing --> abseiling --> making jewelry --> milking cow",
"jetskiing --> making jewelry --> abseiling --> milking cow",
"jetskiing --> making jewelry --> milking cow --> abseiling"
],
"answer": "jetskiing --> abseiling --> making jewelry --> milking cow",
"question_type": "order"
},
{
"video": "order_330.mp4",
"duration": 522.04,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"playing trombone --> milking cow --> cleaning toilet --> jetskiing",
"playing trombone --> cleaning toilet --> milking cow --> jetskiing",
"cleaning toilet --> playing trombone --> milking cow --> jetskiing",
"playing trombone --> jetskiing --> milking cow --> cleaning toilet"
],
"answer": "playing trombone --> milking cow --> cleaning toilet --> jetskiing",
"question_type": "order"
},
{
"video": "order_252.mp4",
"duration": 532.0899999999999,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"playing trombone --> carving pumpkin --> playing harp --> cleaning toilet",
"playing trombone --> cleaning toilet --> carving pumpkin --> playing harp",
"cleaning toilet --> playing trombone --> playing harp --> carving pumpkin",
"playing harp --> carving pumpkin --> playing trombone --> cleaning toilet"
],
"answer": "playing trombone --> carving pumpkin --> playing harp --> cleaning toilet",
"question_type": "order"
},
{
"video": "order_315.mp4",
"duration": 6428.970000000001,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"playing trombone --> zumba --> abseiling --> javelin throw",
"playing trombone --> abseiling --> zumba --> javelin throw",
"abseiling --> javelin throw --> zumba --> playing trombone",
"playing trombone --> javelin throw --> abseiling --> zumba"
],
"answer": "abseiling --> javelin throw --> zumba --> playing trombone",
"question_type": "order"
},
{
"video": "order_146.mp4",
"duration": 851.67,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man snowboards down a hill and turns around; (2)An old man holds a surfboard and puts on a helmet to snowboard; (3)A young person sits on the snow wearing a snowboard; (4)The man has a hot drink with other people.",
"candidates": [
"1->2->4->3",
"2->1->4->3",
"2->1->3->4",
"1->2->3->4"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_209.mp4",
"duration": 6071.650000000001,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"pole vault --> abseiling --> clean and jerk --> making jewelry",
"clean and jerk --> abseiling --> pole vault --> making jewelry",
"making jewelry --> abseiling --> clean and jerk --> pole vault",
"pole vault --> clean and jerk --> abseiling --> making jewelry"
],
"answer": "pole vault --> clean and jerk --> abseiling --> making jewelry",
"question_type": "order"
},
{
"video": "order_171.mp4",
"duration": 8177.369999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)Water is added to the cup and more limes are squeezed in by hand; (2)The small bowls of salt are arranged and limes are sliced in halves; (3)The cup is stirred with more water and a set of cups filled with the refreshment are seen; (4)The limes are juiced into a cup using a hand held press.",
"candidates": [
"1->2->4->3",
"2->4->1->3",
"2->1->4->3",
"4->2->1->3"
],
"answer": "2->4->1->3",
"question_type": "order"
},
{
"video": "order_234.mp4",
"duration": 520.02,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"stomping grapes --> javelin throw --> clean and jerk --> milking cow",
"clean and jerk --> milking cow --> javelin throw --> stomping grapes",
"milking cow --> javelin throw --> clean and jerk --> stomping grapes",
"clean and jerk --> javelin throw --> milking cow --> stomping grapes"
],
"answer": "stomping grapes --> javelin throw --> clean and jerk --> milking cow",
"question_type": "order"
},
{
"video": "order_128.mp4",
"duration": 665.3399999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)Woman tapes her hands with white tape; (2)Woman starts boxing in the ring with a guy; (3)Woman does sit ups on a towel on the beach; (4)Pictures of woman in her bikini are shown.",
"candidates": [
"4->3->2->1",
"3->2->1->4",
"2->1->3->4",
"1->2->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_322.mp4",
"duration": 6126.740000000001,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"jetskiing --> playing harp --> tossing coin --> cleaning toilet",
"jetskiing --> tossing coin --> cleaning toilet --> playing harp",
"cleaning toilet --> tossing coin --> jetskiing --> playing harp",
"cleaning toilet --> jetskiing --> tossing coin --> playing harp"
],
"answer": "cleaning toilet --> jetskiing --> tossing coin --> playing harp",
"question_type": "order"
},
{
"video": "order_153.mp4",
"duration": 828.52,
"question": "Arrange the following events from the video in the correct chronological order: (1)A man picks up the baby from the pool; (2)A person carries two bags out of a house; (3)A baby falls into the swimming pool; (4)A dog walks out of a house.",
"candidates": [
"2->4->3->1",
"1->3->2->4",
"4->2->1->3",
"3->1->2->4"
],
"answer": "2->4->3->1",
"question_type": "order"
},
{
"video": "order_216.mp4",
"duration": 490.03,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"riding mule --> baking cookies --> pole vault --> jetskiing",
"baking cookies --> jetskiing --> pole vault --> riding mule",
"jetskiing --> baking cookies --> pole vault --> riding mule",
"riding mule --> pole vault --> baking cookies --> jetskiing"
],
"answer": "jetskiing --> baking cookies --> pole vault --> riding mule",
"question_type": "order"
},
{
"video": "order_241.mp4",
"duration": 490.15,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"riding mule --> baking cookies --> playing trombone --> water sliding",
"riding mule --> water sliding --> playing trombone --> baking cookies",
"water sliding --> baking cookies --> riding mule --> playing trombone",
"playing trombone --> riding mule --> water sliding --> baking cookies"
],
"answer": "playing trombone --> riding mule --> water sliding --> baking cookies",
"question_type": "order"
},
{
"video": "order_304.mp4",
"duration": 485.98,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"milking cow --> paragliding --> water sliding --> riding mule",
"water sliding --> paragliding --> riding mule --> milking cow",
"water sliding --> milking cow --> riding mule --> paragliding",
"paragliding --> milking cow --> riding mule --> water sliding"
],
"answer": "water sliding --> milking cow --> riding mule --> paragliding",
"question_type": "order"
},
{
"video": "order_135.mp4",
"duration": 421.76,
"question": "Arrange the following events from the video in the correct chronological order: (1)A young and a kid are doing balance in a balance rope; (2)People are walking in a bridge to see a competition of men doing tricks on top of a balance rope; (3)A man is jumping and doing tricks in a balance rope above a cold river; (4)The boy is in a competition in snowy path doing tricks on a balance rope with people behind a fence watching him.",
"candidates": [
"4->3->2->1",
"3->2->1->4",
"1->2->3->4",
"2->1->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_160.mp4",
"duration": 8198.619999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)Woman begins ripping the wrapping paper with her hands; (2)Woman sets the box on top of the wrapping paper and begins wrapping the box; (3)Woman lifts up a box and sets it on the table; (4)Woman grabs a pair of scissors and tape.",
"candidates": [
"3->4->1->2",
"4->3->2->1",
"1->2->3->4",
"2->1->3->4"
],
"answer": "3->4->1->2",
"question_type": "order"
},
{
"video": "order_223.mp4",
"duration": 506.7,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"cooking sausages --> jetskiing --> baking cookies --> playing harp",
"baking cookies --> cooking sausages --> jetskiing --> playing harp",
"playing harp --> jetskiing --> cooking sausages --> baking cookies",
"cooking sausages --> baking cookies --> playing harp --> jetskiing"
],
"answer": "cooking sausages --> jetskiing --> baking cookies --> playing harp",
"question_type": "order"
},
{
"video": "order_99.mp4",
"duration": 1578.18,
"question": "Arrange the following events from the video in the correct chronological order: (1)The trainer and class step in a circle and up on the platform; (2)The trainer leads an aerobic class with people in a gym; (3)The trainer and class walk over then in reverse over the platform; (4)The trainer and class step up sideways on the platform.",
"candidates": [
"4->3->2->1",
"2->1->4->3",
"1->2->3->4",
"3->4->1->2"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_98.mp4",
"duration": 840.4,
"question": "Arrange the following events from the video in the correct chronological order: (1)A person in a red coat cleans the snow off their car; (2)The trunk of the car is lifted open; (3)A person in a tan coat cleans off the front of the car; (4)A man in a white jacket starts to clear the snow off of another car.",
"candidates": [
"3->2->1->4",
"1->2->3->4",
"2->1->3->4",
"4->3->2->1"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_311.mp4",
"duration": 516.06,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"cooking sausages --> javelin throw --> cleaning toilet --> carving pumpkin",
"cleaning toilet --> cooking sausages --> javelin throw --> carving pumpkin",
"cooking sausages --> cleaning toilet --> javelin throw --> carving pumpkin",
"cleaning toilet --> javelin throw --> cooking sausages --> carving pumpkin"
],
"answer": "cleaning toilet --> javelin throw --> cooking sausages --> carving pumpkin",
"question_type": "order"
},
{
"video": "order_97.mp4",
"duration": 1220.92,
"question": "Arrange the following events from the video in the correct chronological order: (1)A person in a red coat cleans the snow off their car; (2)The trunk of the car is lifted open; (3)A person in a tan coat cleans off the front of the car; (4)A man in a white jacket starts to clear the snow off of another car.",
"candidates": [
"2->1->3->4",
"1->2->3->4",
"4->3->2->1",
"3->2->1->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_205.mp4",
"duration": 490.03999999999996,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"paragliding --> zumba --> playing trombone --> making jewelry",
"zumba --> making jewelry --> playing trombone --> paragliding",
"paragliding --> zumba --> making jewelry --> playing trombone",
"zumba --> playing trombone --> paragliding --> making jewelry"
],
"answer": "zumba --> playing trombone --> paragliding --> making jewelry",
"question_type": "order"
},
{
"video": "order_96.mp4",
"duration": 810.4,
"question": "Arrange the following events from the video in the correct chronological order: (1)A person in a red coat cleans the snow off their car; (2)The trunk of the car is lifted open; (3)A person in a tan coat cleans off the front of the car; (4)A man in a white jacket starts to clear the snow off of another car.",
"candidates": [
"3->2->1->4",
"4->3->2->1",
"2->1->3->4",
"1->2->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_95.mp4",
"duration": 677.6,
"question": "Arrange the following events from the video in the correct chronological order: (1) The marching band aligns in the street with their instruments; (2) A man passes in front of the marching band holding a camera; (3) The marching band performs in a field and in a gym, moving around while playing; (4) The marching band performs in front of a building and other places.",
"candidates": [
"2->4->3->1",
"4->2->1->3",
"1->3->2->4",
"3->1->4->2"
],
"answer": "3->1->4->2",
"question_type": "order"
},
{
"video": "order_127.mp4",
"duration": 665.34,
"question": "Arrange the following events from the video in the correct chronological order: (1)Woman tapes her hands with white tape; (2)Woman starts boxing in the ring with a guy; (3)Woman does sit ups on a towel on the beach; (4)Pictures of woman in her bikini are shown.",
"candidates": [
"1->2->3->4",
"4->3->2->1",
"2->1->3->4",
"3->2->1->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_230.mp4",
"duration": 486.96999999999997,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"cleaning toilet --> tossing coin --> playing trombone --> shredding paper",
"cleaning toilet --> playing trombone --> tossing coin --> shredding paper",
"tossing coin --> cleaning toilet --> playing trombone --> shredding paper",
"cleaning toilet --> tossing coin --> shredding paper --> playing trombone"
],
"answer": "tossing coin --> cleaning toilet --> playing trombone --> shredding paper",
"question_type": "order"
},
{
"video": "order_94.mp4",
"duration": 707.6,
"question": "Arrange the following events from the video in the correct chronological order: (1) The marching band aligns in the street with their instruments; (2) A man passes in front of the marching band holding a camera; (3) The marching band performs in a field and in a gym, moving around while playing; (4) The marching band performs in front of a building and other places.",
"candidates": [
"1->3->2->4",
"4->2->1->3",
"2->4->3->1",
"3->1->4->2"
],
"answer": "3->1->4->2",
"question_type": "order"
},
{
"video": "order_93.mp4",
"duration": 1723.73,
"question": "Arrange the following events from the video in the correct chronological order: (1) The marching band aligns in the street with their instruments; (2) A man passes in front of the marching band holding a camera; (3) The marching band performs in a field and in a gym, moving around while playing; (4) The marching band performs in front of a building and other places.",
"candidates": [
"4->2->1->3",
"2->4->3->1",
"1->3->2->4",
"3->1->4->2"
],
"answer": "3->1->4->2",
"question_type": "order"
},
{
"video": "order_92.mp4",
"duration": 1578.99,
"question": "Arrange the following events from the video in the correct chronological order: (1)Man is on lake side talking to the camera like other couples as well; (2)Man is talking to the camera; (3)People are kayaking on calm river and have a good picnic day; (4)People are standing on a side of a rock wall.",
"candidates": [
"2->1->4->3",
"2->4->1->3",
"1->2->3->4",
"4->2->1->3"
],
"answer": "2->4->1->3",
"question_type": "order"
},
{
"video": "order_289.mp4",
"duration": 6936.370000000001,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"cooking sausages --> pole vault --> clean and jerk --> stomping grapes",
"clean and jerk --> cooking sausages --> pole vault --> stomping grapes",
"clean and jerk --> stomping grapes --> cooking sausages --> pole vault",
"pole vault --> stomping grapes --> cooking sausages --> clean and jerk"
],
"answer": "pole vault --> stomping grapes --> cooking sausages --> clean and jerk",
"question_type": "order"
},
{
"video": "order_90.mp4",
"duration": 703.3499999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)Man is on lake side talking to the camera like other couples as well; (2)Man is talking to the camera; (3)People are kayaking on calm river and have a good picnic day; (4)People are standing on a side of a rock wall.",
"candidates": [
"4->2->1->3",
"2->1->4->3",
"2->4->1->3",
"1->2->3->4"
],
"answer": "2->4->1->3",
"question_type": "order"
},
{
"video": "order_134.mp4",
"duration": 688.49,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man starts to demonstrate playing the bongos in a lesson; (2)LP and Giovanni Logo appear on the black screen opening; (3)The lesson continues, alternating between color and black and white footage; (4)A man sits behind a set of bongo drums.",
"candidates": [
"4->2->1->3",
"2->1->4->3",
"1->2->3->4",
"2->4->1->3"
],
"answer": "2->4->1->3",
"question_type": "order"
},
{
"video": "order_222.mp4",
"duration": 520.03,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"cleaning toilet --> milking cow --> abseiling --> playing trombone",
"milking cow --> cleaning toilet --> playing trombone --> abseiling",
"milking cow --> abseiling --> cleaning toilet --> playing trombone",
"playing trombone --> milking cow --> cleaning toilet --> abseiling"
],
"answer": "milking cow --> abseiling --> cleaning toilet --> playing trombone",
"question_type": "order"
},
{
"video": "order_296.mp4",
"duration": 518.9,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"riding mule --> clean and jerk --> tossing coin --> abseiling",
"abseiling --> riding mule --> clean and jerk --> tossing coin",
"riding mule --> abseiling --> tossing coin --> clean and jerk",
"abseiling --> tossing coin --> clean and jerk --> riding mule"
],
"answer": "abseiling --> riding mule --> clean and jerk --> tossing coin",
"question_type": "order"
},
{
"video": "order_141.mp4",
"duration": 518.43,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man explains wakeboarding concepts while his daughter wakeboards in a lake; (2)The video introduction about teaching a child to wakeboard is shown; (3)The girl wakeboards in the lake again while her father continues to explain the teaching techniques; (4)They practice wakeboarding in a pool while discussing techniques.",
"candidates": [
"3->4->1->2",
"1->2->3->4",
"4->3->2->1",
"2->1->4->3"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_204.mp4",
"duration": 497.04999999999995,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"abseiling --> cleaning toilet --> jetskiing --> clean and jerk",
"jetskiing --> cleaning toilet --> clean and jerk --> abseiling",
"clean and jerk --> jetskiing --> abseiling --> cleaning toilet",
"cleaning toilet --> jetskiing --> abseiling --> clean and jerk"
],
"answer": "cleaning toilet --> jetskiing --> abseiling --> clean and jerk",
"question_type": "order"
},
{
"video": "order_278.mp4",
"duration": 497.05999999999995,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"shredding paper --> jetskiing --> riding mule --> playing trombone",
"playing trombone --> shredding paper --> jetskiing --> riding mule",
"jetskiing --> playing trombone --> riding mule --> shredding paper",
"shredding paper --> riding mule --> jetskiing --> playing trombone"
],
"answer": "playing trombone --> shredding paper --> jetskiing --> riding mule",
"question_type": "order"
},
{
"video": "order_58.mp4",
"duration": 673.02,
"question": "Arrange the following events from the video in the correct chronological order: (1)The boy begins hopping on the squares, starting from his driveway; (2)The girl joins him near the sidewalk and walks along his side as he hops across the squares; (3)He hops till he reaches the end of the sidewalk which marks the end of the hopscotch squares; (4)After he's done hopping he smiles and begins walking back.",
"candidates": [
"4->3->2->1",
"3->2->1->4",
"1->2->3->4",
"2->1->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_57.mp4",
"duration": 672.89,
"question": "Arrange the following events from the video in the correct chronological order: (1)The boy begins hopping on the squares, starting from his driveway; (2)The girl joins him near the sidewalk and walks along his side as he hops across the squares; (3)He hops till he reaches the end of the sidewalk which marks the end of the hopscotch squares; (4)After he's done hopping he smiles and begins walking back.",
"candidates": [
"1->2->3->4",
"4->3->2->1",
"3->2->1->4",
"2->1->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_285.mp4",
"duration": 455.04999999999995,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"shredding paper --> paragliding --> making jewelry --> carving pumpkin",
"paragliding --> making jewelry --> shredding paper --> carving pumpkin",
"shredding paper --> carving pumpkin --> making jewelry --> paragliding",
"making jewelry --> paragliding --> carving pumpkin --> shredding paper"
],
"answer": "shredding paper --> paragliding --> making jewelry --> carving pumpkin",
"question_type": "order"
},
{
"video": "order_105.mp4",
"duration": 715.98,
"question": "Arrange the following events from the video in the correct chronological order: (1)The camera focuses on an older man's face; (2)The two children dance together; (3)The camera focuses on a bug on the wall; (4)The two children interact with each other in a cluttered room.",
"candidates": [
"1->4->3->2",
"1->2->3->4",
"4->1->2->3",
"4->3->2->1"
],
"answer": "1->4->3->2",
"question_type": "order"
},
{
"video": "order_55.mp4",
"duration": 665.54,
"question": "Arrange the following events from the video in the correct chronological order: (1)A man picks up the baby from the pool; (2)A person carries two bags out of a house; (3)A baby falls into the swimming pool; (4)A dog walks out of a house.",
"candidates": [
"1->3->2->4",
"2->4->3->1",
"4->2->1->3",
"3->1->2->4"
],
"answer": "2->4->3->1",
"question_type": "order"
},
{
"video": "order_54.mp4",
"duration": 665.54,
"question": "Arrange the following events from the video in the correct chronological order: (1)A man picks up the baby from the pool; (2)A person carries two bags out of a house; (3)A baby falls into the swimming pool; (4)A dog walks out of a house.",
"candidates": [
"3->1->2->4",
"4->2->1->3",
"2->4->3->1",
"1->3->2->4"
],
"answer": "2->4->3->1",
"question_type": "order"
},
{
"video": "order_130.mp4",
"duration": 684.1,
"question": "Arrange the following events from the video in the correct chronological order: (1)The blue and gray guy are on a seesaw and the blue guy jumps at the gray guy; (2)A man's image as he talks is imposed over trees and the man make gestures towards his mouth; (3)Both of the characters fall off the map; (4)We see a game of rock em sock em style robots on a computer screen cut with images of the man narrating in the lower left corner.",
"candidates": [
"4->2->1->3",
"1->2->3->4",
"2->4->1->3",
"2->1->4->3"
],
"answer": "2->4->1->3",
"question_type": "order"
},
{
"video": "order_53.mp4",
"duration": 665.78,
"question": "Arrange the following events from the video in the correct chronological order: (1)A guy sits and talks inside; (2)A man surfs on a body of water; (3)A lady spins on skates; (4)The credits of the video are shown.",
"candidates": [
"1->2->3->4",
"2->1->3->4",
"4->3->2->1",
"3->2->1->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_52.mp4",
"duration": 695.78,
"question": "Arrange the following events from the video in the correct chronological order: (1)A guy sits and talks inside; (2)A man surfs on a body of water; (3)A lady spins on skates; (4)The credits of the video are shown.",
"candidates": [
"4->3->2->1",
"1->2->3->4",
"3->2->1->4",
"2->1->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_51.mp4",
"duration": 665.78,
"question": "Arrange the following events from the video in the correct chronological order: (1)A guy sits and talks inside; (2)A man surfs on a body of water; (3)A lady spins on skates; (4)The credits of the video are shown.",
"candidates": [
"2->1->3->4",
"3->2->1->4",
"4->3->2->1",
"1->2->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_50.mp4",
"duration": 707.3999999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)The woman starts working on her nails using bottles from a box next to her; (2)The words \"Love Food & Money with Angie Greenup\" appears on screen; (3)Her twitter handle and subscribe screen are shown while she holds her dogs; (4)The woman speaks to the camera from her living room while her dogs play fight behind her.",
"candidates": [
"2->4->1->3",
"2->1->4->3",
"4->2->1->3",
"1->2->4->3"
],
"answer": "2->4->1->3",
"question_type": "order"
},
{
"video": "order_122.mp4",
"duration": 673.46,
"question": "Arrange the following events from the video in the correct chronological order: (1)The interviewer plays with the dogs; (2)A man has dogs on a city street near a car; (3)We see a title screen over the UK flag; (4)We see a banner across the bottom of the screen and the man kneeling playing with his dogs.",
"candidates": [
"4->3->2->1",
"1->2->3->4",
"3->2->1->4",
"2->1->3->4"
],
"answer": "3->2->1->4",
"question_type": "order"
},
{
"video": "order_259.mp4",
"duration": 388.07,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"paragliding --> milking cow --> clean and jerk --> stomping grapes",
"stomping grapes --> clean and jerk --> milking cow --> paragliding",
"clean and jerk --> paragliding --> stomping grapes --> milking cow",
"paragliding --> clean and jerk --> stomping grapes --> milking cow"
],
"answer": "clean and jerk --> paragliding --> stomping grapes --> milking cow",
"question_type": "order"
},
{
"video": "order_284.mp4",
"duration": 593.3299999999999,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"milking cow --> shredding paper --> carving pumpkin --> pole vault",
"pole vault --> shredding paper --> carving pumpkin --> milking cow",
"carving pumpkin --> milking cow --> shredding paper --> pole vault",
"pole vault --> milking cow --> shredding paper --> carving pumpkin"
],
"answer": "pole vault --> shredding paper --> carving pumpkin --> milking cow",
"question_type": "order"
},
{
"video": "order_347.mp4",
"duration": 490.02,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"making jewelry --> abseiling --> milking cow --> cooking sausages",
"cooking sausages --> abseiling --> milking cow --> making jewelry",
"milking cow --> abseiling --> making jewelry --> cooking sausages",
"abseiling --> milking cow --> cooking sausages --> making jewelry"
],
"answer": "making jewelry --> abseiling --> milking cow --> cooking sausages",
"question_type": "order"
},
{
"video": "order_266.mp4",
"duration": 358.03999999999996,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"baking cookies --> javelin throw --> jetskiing --> making jewelry",
"baking cookies --> javelin throw --> making jewelry --> jetskiing",
"baking cookies --> making jewelry --> javelin throw --> jetskiing",
"jetskiing --> making jewelry --> baking cookies --> javelin throw"
],
"answer": "baking cookies --> making jewelry --> javelin throw --> jetskiing",
"question_type": "order"
},
{
"video": "order_329.mp4",
"duration": 488.55999999999995,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"tossing coin --> jetskiing --> zumba --> cooking sausages",
"tossing coin --> zumba --> jetskiing --> cooking sausages",
"tossing coin --> jetskiing --> cooking sausages --> zumba",
"jetskiing --> cooking sausages --> zumba --> tossing coin"
],
"answer": "tossing coin --> jetskiing --> zumba --> cooking sausages",
"question_type": "order"
},
{
"video": "order_111.mp4",
"duration": 675.0699999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)The guy measures the ingredient on the table; (2)The child and guy added the egg to the bowl; (3)The guy uses silverware to put dough on a baking pan; (4)The child, guy, and dog watch the baking process through the oven window.",
"candidates": [
"4->3->2->1",
"3->2->1->4",
"1->2->3->4",
"2->1->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_273.mp4",
"duration": 490.02,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"baking cookies --> jetskiing --> water sliding --> paragliding",
"water sliding --> baking cookies --> jetskiing --> paragliding",
"baking cookies --> water sliding --> jetskiing --> paragliding",
"jetskiing --> water sliding --> paragliding --> baking cookies"
],
"answer": "water sliding --> baking cookies --> jetskiing --> paragliding",
"question_type": "order"
},
{
"video": "order_336.mp4",
"duration": 388.05,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"zumba --> paragliding --> stomping grapes --> baking cookies",
"baking cookies --> stomping grapes --> zumba --> paragliding",
"stomping grapes --> zumba --> baking cookies --> paragliding",
"zumba --> stomping grapes --> baking cookies --> paragliding"
],
"answer": "zumba --> stomping grapes --> baking cookies --> paragliding",
"question_type": "order"
},
{
"video": "order_0.mp4",
"duration": 665.74,
"question": "Arrange the following events from the video in the correct chronological order: (1)A young and a kid are doing balance in a balance rope; (2)People are walking in a bridge to see a competition of men doing tricks on top of a balance rope; (3)A man is jumping and doing tricks in a balance rope above a cold river; (4)The boy is in a competition in snowy path doing tricks on a balance rope with people behind a fence watching him.",
"candidates": [
"4->3->2->1",
"3->2->1->4",
"1->2->3->4",
"2->1->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_88.mp4",
"duration": 705.1400000000001,
"question": "Arrange the following events from the video in the correct chronological order: (1) The slack line athletes finish competing and shake hands at the finish; (2) A bald headed man performs tricks on a yellow trick line; (3) Two men meet to compete in a slack line competition; (4) The camera pans back to the bald man performing trick on the slack line and back to the man wearing a baseball cap performing tricks on the traditional line.",
"candidates": [
"2->3->1->4",
"1->2->3->4",
"3->2->4->1",
"4->3->2->1"
],
"answer": "3->2->4->1",
"question_type": "order"
},
{
"video": "order_87.mp4",
"duration": 675.14,
"question": "Arrange the following events from the video in the correct chronological order: (1) The slack line athletes finish competing and shake hands at the finish; (2) A bald headed man performs tricks on a yellow trick line; (3) Two men meet to compete in a slack line competition; (4) The camera pans back to the bald man performing trick on the slack line and back to the man wearing a baseball cap performing tricks on the traditional line.",
"candidates": [
"4->3->2->1",
"1->2->3->4",
"3->2->4->1",
"2->3->1->4"
],
"answer": "3->2->4->1",
"question_type": "order"
},
{
"video": "order_255.mp4",
"duration": 784.1899999999999,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"tossing coin --> clean and jerk --> carving pumpkin --> milking cow",
"tossing coin --> milking cow --> clean and jerk --> carving pumpkin",
"clean and jerk --> tossing coin --> carving pumpkin --> milking cow",
"milking cow --> clean and jerk --> tossing coin --> carving pumpkin"
],
"answer": "milking cow --> clean and jerk --> tossing coin --> carving pumpkin",
"question_type": "order"
},
{
"video": "order_318.mp4",
"duration": 520.03,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"riding mule --> cooking sausages --> javelin throw --> carving pumpkin",
"riding mule --> javelin throw --> cooking sausages --> carving pumpkin",
"javelin throw --> riding mule --> carving pumpkin --> cooking sausages",
"cooking sausages --> javelin throw --> carving pumpkin --> riding mule"
],
"answer": "cooking sausages --> javelin throw --> carving pumpkin --> riding mule",
"question_type": "order"
},
{
"video": "order_86.mp4",
"duration": 673.98,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man shows how to work the monkey bars; (2)A male fitness trainer from Iron Edge is about to demonstrate various workouts using bars; (3)The trainer shows how to maneuver straight bar, pulleys, and medicine ball; (4)A workout regimen is displayed as part of the conclusion of the demonstration.",
"candidates": [
"3->2->1->4",
"4->3->2->1",
"2->1->3->4",
"1->2->3->4"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_280.mp4",
"duration": 490.16999999999996,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"tossing coin --> cooking sausages --> baking cookies --> playing trombone",
"cooking sausages --> tossing coin --> baking cookies --> playing trombone",
"baking cookies --> tossing coin --> playing trombone --> cooking sausages",
"baking cookies --> tossing coin --> cooking sausages --> playing trombone"
],
"answer": "baking cookies --> tossing coin --> playing trombone --> cooking sausages",
"question_type": "order"
},
{
"video": "order_343.mp4",
"duration": 520.02,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"tossing coin --> riding mule --> pole vault --> javelin throw",
"riding mule --> pole vault --> javelin throw --> tossing coin",
"tossing coin --> riding mule --> javelin throw --> pole vault",
"pole vault --> riding mule --> javelin throw --> tossing coin"
],
"answer": "riding mule --> pole vault --> javelin throw --> tossing coin",
"question_type": "order"
},
{
"video": "order_100.mp4",
"duration": 672.54,
"question": "Arrange the following events from the video in the correct chronological order: (1)The trainer and class step in a circle and up on the platform; (2)The trainer leads an aerobic class with people in a gym; (3)The trainer and class walk over then in reverse over the platform; (4)The trainer and class step up sideways on the platform.",
"candidates": [
"3->4->1->2",
"1->2->3->4",
"2->1->4->3",
"4->3->2->1"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_84.mp4",
"duration": 673.98,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man shows how to work the monkey bars; (2)A male fitness trainer from Iron Edge is about to demonstrate various workouts using bars; (3)The trainer shows how to maneuver straight bar, pulleys, and medicine ball; (4)A workout regimen is displayed as part of the conclusion of the demonstration.",
"candidates": [
"4->3->2->1",
"1->2->3->4",
"3->2->1->4",
"2->1->3->4"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_83.mp4",
"duration": 673.5799999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)The instructor finishes the class; (2)A woman's indoor aerobic class is in process; (3)The logo 'Zumba Toning' appears on screen; (4)The camera briefly scans to the mirrored wall and then back to the class.",
"candidates": [
"2->1->3->4",
"4->3->2->1",
"1->2->3->4",
"3->2->4->1"
],
"answer": "3->2->4->1",
"question_type": "order"
},
{
"video": "order_82.mp4",
"duration": 703.5799999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)The instructor finishes the class; (2)A woman's indoor aerobic class is in process; (3)The logo 'Zumba Toning' appears on screen; (4)The camera briefly scans to the mirrored wall and then back to the class.",
"candidates": [
"4->3->2->1",
"3->2->4->1",
"2->1->3->4",
"1->2->3->4"
],
"answer": "3->2->4->1",
"question_type": "order"
},
{
"video": "order_81.mp4",
"duration": 796.8399999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)The instructor finishes the class; (2)A woman's indoor aerobic class is in process; (3)The logo 'Zumba Toning' appears on screen; (4)The camera briefly scans to the mirrored wall and then back to the class.",
"candidates": [
"1->2->3->4",
"4->3->2->1",
"3->2->4->1",
"2->1->3->4"
],
"answer": "3->2->4->1",
"question_type": "order"
},
{
"video": "order_247.mp4",
"duration": 396.33,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"cooking sausages --> jetskiing --> cleaning toilet --> abseiling",
"abseiling --> jetskiing --> cooking sausages --> cleaning toilet",
"jetskiing --> abseiling --> cleaning toilet --> cooking sausages",
"abseiling --> cooking sausages --> cleaning toilet --> jetskiing"
],
"answer": "jetskiing --> abseiling --> cleaning toilet --> cooking sausages",
"question_type": "order"
},
{
"video": "order_272.mp4",
"duration": 488.48,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"tossing coin --> jetskiing --> javelin throw --> carving pumpkin",
"jetskiing --> javelin throw --> tossing coin --> carving pumpkin",
"tossing coin --> jetskiing --> carving pumpkin --> javelin throw",
"carving pumpkin --> javelin throw --> tossing coin --> jetskiing"
],
"answer": "jetskiing --> javelin throw --> tossing coin --> carving pumpkin",
"question_type": "order"
},
{
"video": "order_335.mp4",
"duration": 647.04,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"jetskiing --> milking cow --> clean and jerk --> baking cookies",
"milking cow --> baking cookies --> jetskiing --> clean and jerk",
"clean and jerk --> jetskiing --> milking cow --> baking cookies",
"milking cow --> clean and jerk --> baking cookies --> jetskiing"
],
"answer": "milking cow --> clean and jerk --> baking cookies --> jetskiing",
"question_type": "order"
},
{
"video": "order_229.mp4",
"duration": 605.46,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"zumba --> milking cow --> javelin throw --> water sliding",
"zumba --> water sliding --> javelin throw --> milking cow",
"javelin throw --> water sliding --> zumba --> milking cow",
"milking cow --> water sliding --> javelin throw --> zumba"
],
"answer": "javelin throw --> water sliding --> zumba --> milking cow",
"question_type": "order"
},
{
"video": "order_254.mp4",
"duration": 463.04999999999995,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"clean and jerk --> cooking sausages --> playing harp --> making jewelry",
"cooking sausages --> making jewelry --> clean and jerk --> playing harp",
"playing harp --> cooking sausages --> clean and jerk --> making jewelry",
"cooking sausages --> making jewelry --> playing harp --> clean and jerk"
],
"answer": "playing harp --> cooking sausages --> clean and jerk --> making jewelry",
"question_type": "order"
},
{
"video": "order_6.mp4",
"duration": 711.8000000000001,
"question": "Arrange the following events from the video in the correct chronological order: (1)A guy approaches a weight on a stage; (2)A man massages a guy's shoulders; (3)A guy lifts a weight on a stage and releases it; (4)A guy kisses the weight plates.",
"candidates": [
"1->3->2->4",
"2->1->3->4",
"2->3->1->4",
"1->2->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_342.mp4",
"duration": 515.1899999999999,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"making jewelry --> cooking sausages --> water sliding --> cleaning toilet",
"cooking sausages --> water sliding --> making jewelry --> cleaning toilet",
"making jewelry --> cleaning toilet --> water sliding --> cooking sausages",
"making jewelry --> cleaning toilet --> cooking sausages --> water sliding"
],
"answer": "cooking sausages --> water sliding --> making jewelry --> cleaning toilet",
"question_type": "order"
},
{
"video": "order_49.mp4",
"duration": 682.37,
"question": "Arrange the following events from the video in the correct chronological order: (1)The woman starts working on her nails using bottles from a box next to her; (2)The words \"Love Food & Money with Angie Greenup\" appears on screen; (3)Her twitter handle and subscribe screen are shown while she holds her dogs; (4)The woman speaks to the camera from her living room while her dogs play fight behind her.",
"candidates": [
"2->1->4->3",
"4->2->1->3",
"1->2->4->3",
"2->4->1->3"
],
"answer": "2->4->1->3",
"question_type": "order"
},
{
"video": "order_1.mp4",
"duration": 665.74,
"question": "Arrange the following events from the video in the correct chronological order: (1)A young and a kid are doing balance in a balance rope; (2)People are walking in a bridge to see a competition of men doing tricks on top of a balance rope; (3)A man is jumping and doing tricks in a balance rope above a cold river; (4)The boy is in a competition in snowy path doing tricks on a balance rope with people behind a fence watching him.",
"candidates": [
"4->3->2->1",
"1->2->3->4",
"2->1->3->4",
"3->2->1->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_261.mp4",
"duration": 395.73999999999995,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"water sliding --> making jewelry --> paragliding --> playing trombone",
"making jewelry --> paragliding --> water sliding --> playing trombone",
"making jewelry --> paragliding --> playing trombone --> water sliding",
"paragliding --> water sliding --> making jewelry --> playing trombone"
],
"answer": "paragliding --> water sliding --> making jewelry --> playing trombone",
"question_type": "order"
},
{
"video": "order_47.mp4",
"duration": 673.61,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man in red cap stands outside a barbershop talking; (2)The man pretends to be asleep during his haircut; (3)The man points out the cameras and explains it to the barber; (4)The man appears to fall out of the chair.",
"candidates": [
"3->2->1->4",
"1->3->2->4",
"2->1->3->4",
"1->2->4->3"
],
"answer": "1->2->4->3",
"question_type": "order"
},
{
"video": "order_324.mp4",
"duration": 490.02,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"cooking sausages --> cleaning toilet --> jetskiing --> stomping grapes",
"cooking sausages --> jetskiing --> stomping grapes --> cleaning toilet",
"cleaning toilet --> jetskiing --> stomping grapes --> cooking sausages",
"jetskiing --> stomping grapes --> cleaning toilet --> cooking sausages"
],
"answer": "cooking sausages --> jetskiing --> stomping grapes --> cleaning toilet",
"question_type": "order"
},
{
"video": "order_46.mp4",
"duration": 673.61,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man in red cap stands outside a barbershop talking; (2)The man pretends to be asleep during his haircut; (3)The man points out the cameras and explains it to the barber; (4)The man appears to fall out of the chair.",
"candidates": [
"2->1->3->4",
"3->2->1->4",
"1->2->4->3",
"1->3->2->4"
],
"answer": "1->2->4->3",
"question_type": "order"
},
{
"video": "order_218.mp4",
"duration": 245.11,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"javelin throw --> clean and jerk --> cooking sausages --> carving pumpkin",
"carving pumpkin --> clean and jerk --> javelin throw --> cooking sausages",
"cooking sausages --> javelin throw --> clean and jerk --> carving pumpkin",
"javelin throw --> carving pumpkin --> clean and jerk --> cooking sausages"
],
"answer": "carving pumpkin --> clean and jerk --> javelin throw --> cooking sausages",
"question_type": "order"
},
{
"video": "order_44.mp4",
"duration": 677.8399999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)The bowler from the blue team hits an overhand ball to the batter; (2)The bowler raises his hand to claim that the batter has not made a run; (3)The bowler causes the batter to get out by hitting the stumps behind him, the entire team cheers; (4)The batter walks out and another batter from his team comes on the field.",
"candidates": [
"2->1->3->4",
"1->2->3->4",
"4->3->2->1",
"3->2->1->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_243.mp4",
"duration": 414.05999999999995,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"pole vault --> zumba --> cleaning toilet --> shredding paper",
"pole vault --> shredding paper --> cleaning toilet --> zumba",
"zumba --> cleaning toilet --> shredding paper --> pole vault",
"cleaning toilet --> shredding paper --> pole vault --> zumba"
],
"answer": "pole vault --> shredding paper --> cleaning toilet --> zumba",
"question_type": "order"
},
{
"video": "order_43.mp4",
"duration": 677.76,
"question": "Arrange the following events from the video in the correct chronological order: (1)The bowler from the blue team hits an overhand ball to the batter; (2)The bowler raises his hand to claim that the batter has not made a run; (3)The bowler causes the batter to get out by hitting the stumps behind him, the entire team cheers; (4)The batter walks out and another batter from his team comes on the field.",
"candidates": [
"1->2->3->4",
"2->1->3->4",
"3->2->1->4",
"4->3->2->1"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_42.mp4",
"duration": 677.76,
"question": "Arrange the following events from the video in the correct chronological order: (1)The bowler from the blue team hits an overhand ball to the batter; (2)The bowler raises his hand to claim that the batter has not made a run; (3)The bowler causes the batter to get out by hitting the stumps behind him, the entire team cheers; (4)The batter walks out and another batter from his team comes on the field.",
"candidates": [
"1->2->3->4",
"3->2->1->4",
"2->1->3->4",
"4->3->2->1"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_41.mp4",
"duration": 678.6899999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man replaces the tire on the front rim and pumps it up; (2)The man removes the front tire of the bike from the frame; (3)The man installs a headlamp to the bike; (4)The man reinstalls the front tire onto the bike frame.",
"candidates": [
"4->3->2->1",
"2->1->4->3",
"3->4->1->2",
"1->2->3->4"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_331.mp4",
"duration": 375.04999999999995,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"zumba --> baking cookies --> playing harp --> cooking sausages",
"playing harp --> zumba --> baking cookies --> cooking sausages",
"playing harp --> cooking sausages --> zumba --> baking cookies",
"zumba --> baking cookies --> cooking sausages --> playing harp"
],
"answer": "zumba --> baking cookies --> cooking sausages --> playing harp",
"question_type": "order"
},
{
"video": "order_162.mp4",
"duration": 502.59999999999997,
"question": "Arrange the following events from the video in the correct chronological order: (1)The instructor finishes the class; (2)A woman's indoor aerobic class is in process; (3)The logo 'Zumba Toning' appears on screen; (4)The camera briefly scans to the mirrored wall and then back to the class.",
"candidates": [
"2->1->3->4",
"1->2->3->4",
"3->2->4->1",
"4->3->2->1"
],
"answer": "3->2->4->1",
"question_type": "order"
},
{
"video": "order_225.mp4",
"duration": 782.8,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"riding mule --> milking cow --> jetskiing --> playing harp",
"playing harp --> riding mule --> jetskiing --> milking cow",
"playing harp --> jetskiing --> riding mule --> milking cow",
"milking cow --> playing harp --> riding mule --> jetskiing"
],
"answer": "milking cow --> playing harp --> riding mule --> jetskiing",
"question_type": "order"
},
{
"video": "order_250.mp4",
"duration": 490.03,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"pole vault --> baking cookies --> tossing coin --> abseiling",
"tossing coin --> pole vault --> baking cookies --> abseiling",
"baking cookies --> tossing coin --> pole vault --> abseiling",
"baking cookies --> abseiling --> tossing coin --> pole vault"
],
"answer": "baking cookies --> tossing coin --> pole vault --> abseiling",
"question_type": "order"
},
{
"video": "order_313.mp4",
"duration": 490.03,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"playing trombone --> cooking sausages --> making jewelry --> shredding paper",
"cooking sausages --> making jewelry --> shredding paper --> playing trombone",
"making jewelry --> shredding paper --> playing trombone --> cooking sausages",
"playing trombone --> shredding paper --> cooking sausages --> making jewelry"
],
"answer": "playing trombone --> shredding paper --> cooking sausages --> making jewelry",
"question_type": "order"
},
{
"video": "order_154.mp4",
"duration": 511.90999999999997,
"question": "Arrange the following events from the video in the correct chronological order: (1)The boy begins hopping on the squares, starting from his driveway; (2)The girl joins him near the sidewalk and walks along his side as he hops across the squares; (3)He hops till he reaches the end of the sidewalk which marks the end of the hopscotch squares; (4)After he's done hopping he smiles and begins walking back.",
"candidates": [
"4->3->2->1",
"3->2->1->4",
"2->1->3->4",
"1->2->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_217.mp4",
"duration": 485.42,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"jetskiing --> stomping grapes --> playing trombone --> water sliding",
"water sliding --> jetskiing --> playing trombone --> stomping grapes",
"stomping grapes --> playing trombone --> water sliding --> jetskiing",
"playing trombone --> water sliding --> jetskiing --> stomping grapes"
],
"answer": "stomping grapes --> playing trombone --> water sliding --> jetskiing",
"question_type": "order"
},
{
"video": "order_320.mp4",
"duration": 490.03999999999996,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"jetskiing --> shredding paper --> milking cow --> baking cookies",
"milking cow --> jetskiing --> shredding paper --> baking cookies",
"milking cow --> jetskiing --> baking cookies --> shredding paper",
"baking cookies --> jetskiing --> milking cow --> shredding paper"
],
"answer": "milking cow --> jetskiing --> shredding paper --> baking cookies",
"question_type": "order"
},
{
"video": "order_7.mp4",
"duration": 681.8,
"question": "Arrange the following events from the video in the correct chronological order: (1)A guy approaches a weight on a stage; (2)A man massages a guy's shoulders; (3)A guy lifts a weight on a stage and releases it; (4)A guy kisses the weight plates.",
"candidates": [
"2->1->3->4",
"1->2->3->4",
"1->3->2->4",
"2->3->1->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_242.mp4",
"duration": 647.76,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"clean and jerk --> shredding paper --> cooking sausages --> paragliding",
"cooking sausages --> shredding paper --> clean and jerk --> paragliding",
"shredding paper --> paragliding --> cooking sausages --> clean and jerk",
"paragliding --> shredding paper --> cooking sausages --> clean and jerk"
],
"answer": "cooking sausages --> shredding paper --> clean and jerk --> paragliding",
"question_type": "order"
},
{
"video": "order_305.mp4",
"duration": 575.0600000000001,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"stomping grapes --> paragliding --> milking cow --> shredding paper",
"paragliding --> shredding paper --> milking cow --> stomping grapes",
"stomping grapes --> shredding paper --> milking cow --> paragliding",
"milking cow --> stomping grapes --> paragliding --> shredding paper"
],
"answer": "stomping grapes --> paragliding --> milking cow --> shredding paper",
"question_type": "order"
},
{
"video": "order_2.mp4",
"duration": 695.74,
"question": "Arrange the following events from the video in the correct chronological order: (1)A young and a kid are doing balance in a balance rope; (2)People are walking in a bridge to see a competition of men doing tricks on top of a balance rope; (3)A man is jumping and doing tricks in a balance rope above a cold river; (4)The boy is in a competition in snowy path doing tricks on a balance rope with people behind a fence watching him.",
"candidates": [
"2->1->3->4",
"1->2->3->4",
"4->3->2->1",
"3->2->1->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_224.mp4",
"duration": 482.23999999999995,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"playing harp --> javelin throw --> cleaning toilet --> baking cookies",
"baking cookies --> javelin throw --> cleaning toilet --> playing harp",
"javelin throw --> playing harp --> baking cookies --> cleaning toilet",
"cleaning toilet --> baking cookies --> javelin throw --> playing harp"
],
"answer": "playing harp --> javelin throw --> cleaning toilet --> baking cookies",
"question_type": "order"
},
{
"video": "order_298.mp4",
"duration": 490.03,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"riding mule --> shredding paper --> milking cow --> stomping grapes",
"milking cow --> stomping grapes --> riding mule --> shredding paper",
"milking cow --> riding mule --> stomping grapes --> shredding paper",
"stomping grapes --> milking cow --> shredding paper --> riding mule"
],
"answer": "riding mule --> shredding paper --> milking cow --> stomping grapes",
"question_type": "order"
},
{
"video": "order_118.mp4",
"duration": 699.22,
"question": "Arrange the following events from the video in the correct chronological order: (1)A white car drives by in the background; (2)A black car drives by in the background; (3)Two people walk by in the background; (4)The ball is kicked into the camera.",
"candidates": [
"4->3->2->1",
"3->2->1->4",
"1->2->3->4",
"2->1->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_312.mp4",
"duration": 514.23,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"javelin throw --> cooking sausages --> riding mule --> pole vault",
"javelin throw --> riding mule --> cooking sausages --> pole vault",
"riding mule --> pole vault --> javelin throw --> cooking sausages",
"cooking sausages --> riding mule --> javelin throw --> pole vault"
],
"answer": "javelin throw --> riding mule --> cooking sausages --> pole vault",
"question_type": "order"
},
{
"video": "order_206.mp4",
"duration": 489.13,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"playing trombone --> javelin throw --> zumba --> playing harp",
"javelin throw --> playing trombone --> zumba --> playing harp",
"playing trombone --> playing harp --> zumba --> javelin throw",
"zumba --> playing trombone --> playing harp --> javelin throw"
],
"answer": "playing trombone --> javelin throw --> zumba --> playing harp",
"question_type": "order"
},
{
"video": "order_231.mp4",
"duration": 588.06,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"cooking sausages --> baking cookies --> milking cow --> making jewelry",
"making jewelry --> cooking sausages --> milking cow --> baking cookies",
"cooking sausages --> making jewelry --> baking cookies --> milking cow",
"baking cookies --> making jewelry --> milking cow --> cooking sausages"
],
"answer": "cooking sausages --> baking cookies --> milking cow --> making jewelry",
"question_type": "order"
},
{
"video": "order_125.mp4",
"duration": 825.7,
"question": "Arrange the following events from the video in the correct chronological order: (1)The woman hangs the washed clothes on a line; (2)The woman fills a metal bucket with water; (3)The woman washes and scrubs clothes by hand; (4)The woman places a small wooden stool near a larger bucket.",
"candidates": [
"1->2->3->4",
"4->3->2->1",
"3->2->4->1",
"2->4->3->1"
],
"answer": "2->4->3->1",
"question_type": "order"
},
{
"video": "order_301.mp4",
"duration": 781.49,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"milking cow --> tossing coin --> making jewelry --> jetskiing",
"jetskiing --> making jewelry --> milking cow --> tossing coin",
"milking cow --> jetskiing --> tossing coin --> making jewelry",
"making jewelry --> jetskiing --> milking cow --> tossing coin"
],
"answer": "jetskiing --> making jewelry --> milking cow --> tossing coin",
"question_type": "order"
},
{
"video": "order_117.mp4",
"duration": 669.22,
"question": "Arrange the following events from the video in the correct chronological order: (1)A white car drives by in the background; (2)A black car drives by in the background; (3)Two people walk by in the background; (4)The ball is kicked into the camera.",
"candidates": [
"2->1->3->4",
"1->2->3->4",
"3->2->1->4",
"4->3->2->1"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_220.mp4",
"duration": 490.04,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"baking cookies --> water sliding --> abseiling --> jetskiing",
"abseiling --> baking cookies --> jetskiing --> water sliding",
"jetskiing --> water sliding --> abseiling --> baking cookies",
"abseiling --> jetskiing --> water sliding --> baking cookies"
],
"answer": "baking cookies --> water sliding --> abseiling --> jetskiing",
"question_type": "order"
},
{
"video": "order_279.mp4",
"duration": 485.03,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"stomping grapes --> paragliding --> jetskiing --> playing trombone",
"paragliding --> jetskiing --> playing trombone --> stomping grapes",
"stomping grapes --> paragliding --> playing trombone --> jetskiing",
"playing trombone --> paragliding --> stomping grapes --> jetskiing"
],
"answer": "paragliding --> jetskiing --> playing trombone --> stomping grapes",
"question_type": "order"
},
{
"video": "order_3.mp4",
"duration": 708.41,
"question": "Arrange the following events from the video in the correct chronological order: (1)The group begins to dance in unison; (2)Some of the group are on their feet; (3)A group gathers to the center of a gym floor; (4)Some are in wheel chairs.",
"candidates": [
"3->1->2->4",
"2->3->1->4",
"4->3->2->1",
"1->2->3->4"
],
"answer": "3->1->2->4",
"question_type": "order"
},
{
"video": "order_124.mp4",
"duration": 702.7,
"question": "Arrange the following events from the video in the correct chronological order: (1)The woman hangs the washed clothes on a line; (2)The woman fills a metal bucket with water; (3)The woman washes and scrubs clothes by hand; (4)The woman places a small wooden stool near a larger bucket.",
"candidates": [
"1->2->3->4",
"3->2->4->1",
"4->3->2->1",
"2->4->3->1"
],
"answer": "2->4->3->1",
"question_type": "order"
},
{
"video": "order_349.mp4",
"duration": 280.06,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"baking cookies --> making jewelry --> zumba --> jetskiing",
"jetskiing --> making jewelry --> zumba --> baking cookies",
"jetskiing --> baking cookies --> making jewelry --> zumba",
"baking cookies --> zumba --> jetskiing --> making jewelry"
],
"answer": "baking cookies --> making jewelry --> zumba --> jetskiing",
"question_type": "order"
},
{
"video": "order_106.mp4",
"duration": 685.98,
"question": "Arrange the following events from the video in the correct chronological order: (1)The camera focuses on an older man's face; (2)The two children dance together; (3)The camera focuses on a bug on the wall; (4)The two children interact with each other in a cluttered room.",
"candidates": [
"4->1->2->3",
"1->2->3->4",
"4->3->2->1",
"1->4->3->2"
],
"answer": "1->4->3->2",
"question_type": "order"
},
{
"video": "order_39.mp4",
"duration": 678.69,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man replaces the tire on the front rim and pumps it up; (2)The man removes the front tire of the bike from the frame; (3)The man installs a headlamp to the bike; (4)The man reinstalls the front tire onto the bike frame.",
"candidates": [
"2->1->4->3",
"4->3->2->1",
"1->2->3->4",
"3->4->1->2"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_38.mp4",
"duration": 673.61,
"question": "Arrange the following events from the video in the correct chronological order: (1)The word BIKE is overlaid on a mountain scene; (2)Oregon daily emerald logo and title card pops up; (3)The instructions follow with a man in a white ensemble and purple hat; (4)REPAIR is then overlaid under BIKE, Becoming BIKE REPAIR.",
"candidates": [
"1->2->4->3",
"2->1->4->3",
"2->1->3->4",
"1->2->3->4"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_268.mp4",
"duration": 663.8199999999999,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"tossing coin --> clean and jerk --> riding mule --> pole vault",
"clean and jerk --> tossing coin --> pole vault --> riding mule",
"pole vault --> clean and jerk --> riding mule --> tossing coin",
"clean and jerk --> tossing coin --> riding mule --> pole vault"
],
"answer": "clean and jerk --> tossing coin --> riding mule --> pole vault",
"question_type": "order"
},
{
"video": "order_36.mp4",
"duration": 703.61,
"question": "Arrange the following events from the video in the correct chronological order: (1)The word BIKE is overlaid on a mountain scene; (2)Oregon daily emerald logo and title card pops up; (3)The instructions follow with a man in a white ensemble and purple hat; (4)REPAIR is then overlaid under BIKE, Becoming BIKE REPAIR.",
"candidates": [
"2->1->3->4",
"2->1->4->3",
"1->2->3->4",
"1->2->4->3"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_35.mp4",
"duration": 675.78,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man snowboards down a hill and turns around; (2)An old man holds a surfboard and puts on a helmet to snowboard; (3)A young person sits on the snow wearing a snowboard; (4)The man has a hot drink with other people.",
"candidates": [
"2->1->3->4",
"2->1->4->3",
"1->2->3->4",
"1->2->4->3"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_293.mp4",
"duration": 487.58,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"clean and jerk --> stomping grapes --> zumba --> water sliding",
"water sliding --> stomping grapes --> clean and jerk --> zumba",
"water sliding --> stomping grapes --> zumba --> clean and jerk",
"clean and jerk --> water sliding --> zumba --> stomping grapes"
],
"answer": "clean and jerk --> stomping grapes --> zumba --> water sliding",
"question_type": "order"
},
{
"video": "order_113.mp4",
"duration": 675.0699999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)The guy measures the ingredient on the table; (2)The child and guy added the egg to the bowl; (3)The guy uses silverware to put dough on a baking pan; (4)The child, guy, and dog watch the baking process through the oven window.",
"candidates": [
"2->1->3->4",
"3->2->1->4",
"1->2->3->4",
"4->3->2->1"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_33.mp4",
"duration": 675.6500000000001,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man snowboards down a hill and turns around; (2)An old man holds a surfboard and puts on a helmet to snowboard; (3)A young person sits on the snow wearing a snowboard; (4)The man has a hot drink with other people.",
"candidates": [
"1->2->3->4",
"2->1->3->4",
"1->2->4->3",
"2->1->4->3"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_32.mp4",
"duration": 665.06,
"question": "Arrange the following events from the video in the correct chronological order: (1)A man on the street with a poster sign tries to get customers; (2)A university swim team is doing a fund raiser washing cars; (3)The students thank people in the video and to come support them; (4)A black screen appears with a website address.",
"candidates": [
"1->2->3->4",
"2->3->1->4",
"2->1->3->4",
"1->3->2->4"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_31.mp4",
"duration": 665.1,
"question": "Arrange the following events from the video in the correct chronological order: (1)A man on the street with a poster sign tries to get customers; (2)A university swim team is doing a fund raiser washing cars; (3)The students thank people in the video and to come support them; (4)A black screen appears with a website address.",
"candidates": [
"2->1->3->4",
"1->3->2->4",
"1->2->3->4",
"2->3->1->4"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_201.mp4",
"duration": 489.08,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"carving pumpkin --> milking cow --> cooking sausages --> javelin throw",
"carving pumpkin --> cooking sausages --> milking cow --> javelin throw",
"carving pumpkin --> cooking sausages --> javelin throw --> milking cow",
"carving pumpkin --> milking cow --> javelin throw --> cooking sausages"
],
"answer": "carving pumpkin --> cooking sausages --> javelin throw --> milking cow",
"question_type": "order"
},
{
"video": "order_275.mp4",
"duration": 483.82000000000005,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"abseiling --> baking cookies --> clean and jerk --> water sliding",
"water sliding --> clean and jerk --> baking cookies --> abseiling",
"clean and jerk --> abseiling --> baking cookies --> water sliding",
"abseiling --> water sliding --> clean and jerk --> baking cookies"
],
"answer": "abseiling --> baking cookies --> clean and jerk --> water sliding",
"question_type": "order"
},
{
"video": "order_30.mp4",
"duration": 665.06,
"question": "Arrange the following events from the video in the correct chronological order: (1)A man on the street with a poster sign tries to get customers; (2)A university swim team is doing a fund raiser washing cars; (3)The students thank people in the video and to come support them; (4)A black screen appears with a website address.",
"candidates": [
"2->1->3->4",
"1->2->3->4",
"2->3->1->4",
"1->3->2->4"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_338.mp4",
"duration": 520.03,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"shredding paper --> making jewelry --> cooking sausages --> carving pumpkin",
"carving pumpkin --> making jewelry --> cooking sausages --> shredding paper",
"cooking sausages --> carving pumpkin --> making jewelry --> shredding paper",
"shredding paper --> carving pumpkin --> cooking sausages --> making jewelry"
],
"answer": "carving pumpkin --> making jewelry --> cooking sausages --> shredding paper",
"question_type": "order"
},
{
"video": "order_120.mp4",
"duration": 673.38,
"question": "Arrange the following events from the video in the correct chronological order: (1)The interviewer plays with the dogs; (2)A man has dogs on a city street near a car; (3)We see a title screen over the UK flag; (4)We see a banner across the bottom of the screen and the man kneeling playing with his dogs.",
"candidates": [
"3->2->1->4",
"1->2->3->4",
"4->3->2->1",
"2->1->3->4"
],
"answer": "3->2->1->4",
"question_type": "order"
},
{
"video": "order_9.mp4",
"duration": 677.82,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man attempts to walk across the rope but falls and holds onto the rope; (2)A seal sits on a rock near an ocean; (3)The man films from a beach cliff next to a tent; (4)The man walks across the rope all the way to the attached rock.",
"candidates": [
"3->1->2->4",
"2->4->1->3",
"1->2->3->4",
"2->1->4->3"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_179.mp4",
"duration": 485.51000000000005,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man starts to demonstrate playing the bongos in a lesson; (2)LP and Giovanni Logo appear on the black screen opening; (3)The lesson continues, alternating between color and black and white footage; (4)A man sits behind a set of bongo drums.",
"candidates": [
"2->1->4->3",
"4->2->1->3",
"1->2->3->4",
"2->4->1->3"
],
"answer": "2->4->1->3",
"question_type": "order"
},
{
"video": "order_4.mp4",
"duration": 708.4100000000001,
"question": "Arrange the following events from the video in the correct chronological order: (1)The group begins to dance in unison; (2)Some of the group are on their feet; (3)A group gathers to the center of a gym floor; (4)Some are in wheel chairs.",
"candidates": [
"1->2->3->4",
"2->3->1->4",
"4->3->2->1",
"3->1->2->4"
],
"answer": "3->1->2->4",
"question_type": "order"
},
{
"video": "order_292.mp4",
"duration": 490.03999999999996,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"carving pumpkin --> cooking sausages --> playing trombone --> stomping grapes",
"stomping grapes --> carving pumpkin --> cooking sausages --> playing trombone",
"playing trombone --> cooking sausages --> stomping grapes --> carving pumpkin",
"carving pumpkin --> playing trombone --> cooking sausages --> stomping grapes"
],
"answer": "playing trombone --> cooking sausages --> stomping grapes --> carving pumpkin",
"question_type": "order"
},
{
"video": "order_112.mp4",
"duration": 675.0699999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)The guy measures the ingredient on the table; (2)The child and guy added the egg to the bowl; (3)The guy uses silverware to put dough on a baking pan; (4)The child, guy, and dog watch the baking process through the oven window.",
"candidates": [
"3->2->1->4",
"2->1->3->4",
"1->2->3->4",
"4->3->2->1"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_249.mp4",
"duration": 777.05,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"paragliding --> stomping grapes --> carving pumpkin --> shredding paper",
"paragliding --> shredding paper --> carving pumpkin --> stomping grapes",
"shredding paper --> carving pumpkin --> stomping grapes --> paragliding",
"stomping grapes --> shredding paper --> carving pumpkin --> paragliding"
],
"answer": "stomping grapes --> shredding paper --> carving pumpkin --> paragliding",
"question_type": "order"
},
{
"video": "order_256.mp4",
"duration": 490.03999999999996,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"shredding paper --> clean and jerk --> playing trombone --> javelin throw",
"javelin throw --> shredding paper --> playing trombone --> clean and jerk",
"clean and jerk --> javelin throw --> playing trombone --> shredding paper",
"shredding paper --> javelin throw --> clean and jerk --> playing trombone"
],
"answer": "javelin throw --> shredding paper --> playing trombone --> clean and jerk",
"question_type": "order"
},
{
"video": "order_319.mp4",
"duration": 728.06,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"milking cow --> riding mule --> paragliding --> cooking sausages",
"cooking sausages --> riding mule --> paragliding --> milking cow",
"milking cow --> riding mule --> cooking sausages --> paragliding",
"cooking sausages --> milking cow --> paragliding --> riding mule"
],
"answer": "cooking sausages --> riding mule --> paragliding --> milking cow",
"question_type": "order"
},
{
"video": "order_281.mp4",
"duration": 605.5200000000001,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"javelin throw --> tossing coin --> stomping grapes --> milking cow",
"javelin throw --> milking cow --> tossing coin --> stomping grapes",
"milking cow --> tossing coin --> stomping grapes --> javelin throw",
"milking cow --> tossing coin --> javelin throw --> stomping grapes"
],
"answer": "milking cow --> tossing coin --> stomping grapes --> javelin throw",
"question_type": "order"
},
{
"video": "order_101.mp4",
"duration": 702.54,
"question": "Arrange the following events from the video in the correct chronological order: (1)The trainer and class step in a circle and up on the platform; (2)The trainer leads an aerobic class with people in a gym; (3)The trainer and class walk over then in reverse over the platform; (4)The trainer and class step up sideways on the platform.",
"candidates": [
"2->1->4->3",
"4->3->2->1",
"3->4->1->2",
"1->2->3->4"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_175.mp4",
"duration": 591.4,
"question": "Arrange the following events from the video in the correct chronological order: (1)The interviewer plays with the dogs; (2)A man has dogs on a city street near a car; (3)We see a title screen over the UK flag; (4)We see a banner across the bottom of the screen and the man kneeling playing with his dogs.",
"candidates": [
"3->2->1->4",
"1->2->3->4",
"4->3->2->1",
"2->1->3->4"
],
"answer": "3->2->1->4",
"question_type": "order"
},
{
"video": "order_238.mp4",
"duration": 490.03,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"riding mule --> milking cow --> stomping grapes --> pole vault",
"stomping grapes --> milking cow --> pole vault --> riding mule",
"riding mule --> milking cow --> pole vault --> stomping grapes",
"milking cow --> pole vault --> riding mule --> stomping grapes"
],
"answer": "stomping grapes --> milking cow --> pole vault --> riding mule",
"question_type": "order"
},
{
"video": "order_263.mp4",
"duration": 490.03000000000003,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"pole vault --> jetskiing --> zumba --> clean and jerk",
"pole vault --> clean and jerk --> zumba --> jetskiing",
"jetskiing --> pole vault --> zumba --> clean and jerk",
"clean and jerk --> jetskiing --> zumba --> pole vault"
],
"answer": "pole vault --> clean and jerk --> zumba --> jetskiing",
"question_type": "order"
},
{
"video": "order_326.mp4",
"duration": 490.03999999999996,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"tossing coin --> clean and jerk --> zumba --> stomping grapes",
"zumba --> tossing coin --> clean and jerk --> stomping grapes",
"zumba --> stomping grapes --> clean and jerk --> tossing coin",
"clean and jerk --> tossing coin --> zumba --> stomping grapes"
],
"answer": "zumba --> tossing coin --> clean and jerk --> stomping grapes",
"question_type": "order"
},
{
"video": "order_5.mp4",
"duration": 678.4100000000001,
"question": "Arrange the following events from the video in the correct chronological order: (1)The group begins to dance in unison; (2)Some of the group are on their feet; (3)A group gathers to the center of a gym floor; (4)Some are in wheel chairs.",
"candidates": [
"1->2->3->4",
"3->1->2->4",
"2->3->1->4",
"4->3->2->1"
],
"answer": "3->1->2->4",
"question_type": "order"
},
{
"video": "order_308.mp4",
"duration": 481.04999999999995,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"pole vault --> making jewelry --> javelin throw --> zumba",
"making jewelry --> zumba --> pole vault --> javelin throw",
"making jewelry --> javelin throw --> zumba --> pole vault",
"making jewelry --> pole vault --> zumba --> javelin throw"
],
"answer": "making jewelry --> zumba --> pole vault --> javelin throw",
"question_type": "order"
},
{
"video": "order_270.mp4",
"duration": 359.03999999999996,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"baking cookies --> playing trombone --> cleaning toilet --> playing harp",
"playing harp --> cleaning toilet --> playing trombone --> baking cookies",
"cleaning toilet --> baking cookies --> playing harp --> playing trombone",
"playing harp --> cleaning toilet --> baking cookies --> playing trombone"
],
"answer": "playing harp --> cleaning toilet --> playing trombone --> baking cookies",
"question_type": "order"
},
{
"video": "order_333.mp4",
"duration": 486.18,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"clean and jerk --> paragliding --> making jewelry --> carving pumpkin",
"making jewelry --> paragliding --> carving pumpkin --> clean and jerk",
"making jewelry --> carving pumpkin --> clean and jerk --> paragliding",
"making jewelry --> paragliding --> clean and jerk --> carving pumpkin"
],
"answer": "clean and jerk --> paragliding --> making jewelry --> carving pumpkin",
"question_type": "order"
},
{
"video": "order_149.mp4",
"duration": 694.78,
"question": "Arrange the following events from the video in the correct chronological order: (1)The bowler from the blue team hits an overhand ball to the batter; (2)The bowler raises his hand to claim that the batter has not made a run; (3)The bowler causes the batter to get out by hitting the stumps behind him, the entire team cheers; (4)The batter walks out and another batter from his team comes on the field.",
"candidates": [
"4->3->2->1",
"1->2->3->4",
"3->2->1->4",
"2->1->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_237.mp4",
"duration": 487.61999999999995,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"stomping grapes --> milking cow --> making jewelry --> water sliding",
"water sliding --> stomping grapes --> making jewelry --> milking cow",
"milking cow --> making jewelry --> stomping grapes --> water sliding",
"milking cow --> water sliding --> making jewelry --> stomping grapes"
],
"answer": "stomping grapes --> milking cow --> making jewelry --> water sliding",
"question_type": "order"
},
{
"video": "order_340.mp4",
"duration": 368.05999999999995,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"cleaning toilet --> paragliding --> abseiling --> jetskiing",
"jetskiing --> paragliding --> cleaning toilet --> abseiling",
"abseiling --> paragliding --> jetskiing --> cleaning toilet",
"abseiling --> paragliding --> cleaning toilet --> jetskiing"
],
"answer": "abseiling --> paragliding --> jetskiing --> cleaning toilet",
"question_type": "order"
},
{
"video": "order_325.mp4",
"duration": 489.15,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"clean and jerk --> shredding paper --> cooking sausages --> zumba",
"zumba --> cooking sausages --> clean and jerk --> shredding paper",
"shredding paper --> clean and jerk --> cooking sausages --> zumba",
"clean and jerk --> zumba --> cooking sausages --> shredding paper"
],
"answer": "clean and jerk --> zumba --> cooking sausages --> shredding paper",
"question_type": "order"
},
{
"video": "order_29.mp4",
"duration": 679.6700000000001,
"question": "Arrange the following events from the video in the correct chronological order: (1) Cheese is sprinkled on the spaghetti; (2) A plate of spaghetti is shown; (3) Vegetables are added to the pot; (4) All of the contents get mixed and cooked.",
"candidates": [
"4->3->2->1",
"2->3->1->4",
"1->2->3->4",
"2->1->3->4"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_28.mp4",
"duration": 679.71,
"question": "Arrange the following events from the video in the correct chronological order: (1) Cheese is sprinkled on the spaghetti; (2) A plate of spaghetti is shown; (3) Vegetables are added to the pot; (4) All of the contents get mixed and cooked.",
"candidates": [
"4->3->2->1",
"1->2->3->4",
"2->1->3->4",
"2->3->1->4"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_244.mp4",
"duration": 324.64,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"baking cookies --> zumba --> water sliding --> carving pumpkin",
"water sliding --> zumba --> baking cookies --> carving pumpkin",
"water sliding --> baking cookies --> zumba --> carving pumpkin",
"zumba --> water sliding --> carving pumpkin --> baking cookies"
],
"answer": "water sliding --> baking cookies --> zumba --> carving pumpkin",
"question_type": "order"
},
{
"video": "order_27.mp4",
"duration": 679.6700000000001,
"question": "Arrange the following events from the video in the correct chronological order: (1) Cheese is sprinkled on the spaghetti; (2) A plate of spaghetti is shown; (3) Vegetables are added to the pot; (4) All of the contents get mixed and cooked.",
"candidates": [
"2->1->3->4",
"2->3->1->4",
"4->3->2->1",
"1->2->3->4"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_26.mp4",
"duration": 688.0699999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)Two people are paddling down rapids on a river in canoes; (2)One of them stops at a bank where there is a person in a blue canoe; (3)People are seen in a group large red tube rapids ride; (4)They pass by a building and then fall into the water.",
"candidates": [
"4->3->2->1",
"1->2->3->4",
"3->2->1->4",
"2->1->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_25.mp4",
"duration": 718.0699999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)Two people are paddling down rapids on a river in canoes; (2)One of them stops at a bank where there is a person in a blue canoe; (3)People are seen in a group large red tube rapids ride; (4)They pass by a building and then fall into the water.",
"candidates": [
"2->1->3->4",
"3->2->1->4",
"4->3->2->1",
"1->2->3->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_24.mp4",
"duration": 688.0699999999999,
"question": "Arrange the following events from the video in the correct chronological order: (1)Two people are paddling down rapids on a river in canoes; (2)One of them stops at a bank where there is a person in a blue canoe; (3)People are seen in a group large red tube rapids ride; (4)They pass by a building and then fall into the water.",
"candidates": [
"2->1->3->4",
"4->3->2->1",
"1->2->3->4",
"3->2->1->4"
],
"answer": "1->2->3->4",
"question_type": "order"
},
{
"video": "order_23.mp4",
"duration": 667.42,
"question": "Arrange the following events from the video in the correct chronological order: (1)Mestre Calango performs by the water on the pier; (2)View of a large body of water with a city around it; (3)Mestre Calango takes his shirt and shoes off and performs on the beach; (4)Credits overlay a black screen.",
"candidates": [
"4->3->2->1",
"1->2->3->4",
"2->1->3->4",
"2->3->1->4"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_22.mp4",
"duration": 667.29,
"question": "Arrange the following events from the video in the correct chronological order: (1)Mestre Calango performs by the water on the pier; (2)View of a large body of water with a city around it; (3)Mestre Calango takes his shirt and shoes off and performs on the beach; (4)Credits overlay a black screen.",
"candidates": [
"2->3->1->4",
"1->2->3->4",
"2->1->3->4",
"4->3->2->1"
],
"answer": "2->1->3->4",
"question_type": "order"
},
{
"video": "order_314.mp4",
"duration": 507.07,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"water sliding --> shredding paper --> pole vault --> milking cow",
"milking cow --> pole vault --> shredding paper --> water sliding",
"pole vault --> water sliding --> milking cow --> shredding paper",
"milking cow --> pole vault --> water sliding --> shredding paper"
],
"answer": "milking cow --> pole vault --> shredding paper --> water sliding",
"question_type": "order"
},
{
"video": "order_20.mp4",
"duration": 679.54,
"question": "Arrange the following events from the video in the correct chronological order: (1)The man explains wakeboarding concepts while his daughter wakeboards in a lake; (2)The video introduction about teaching a child to wakeboard is shown; (3)The girl wakeboards in the lake again while her father continues to explain the teaching techniques; (4)They practice wakeboarding in a pool while discussing techniques.",
"candidates": [
"2->1->4->3",
"1->2->3->4",
"4->3->2->1",
"3->4->1->2"
],
"answer": "2->1->4->3",
"question_type": "order"
},
{
"video": "order_208.mp4",
"duration": 465.28999999999996,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"paragliding --> playing harp --> riding mule --> making jewelry",
"making jewelry --> paragliding --> playing harp --> riding mule",
"paragliding --> riding mule --> playing harp --> making jewelry",
"riding mule --> making jewelry --> paragliding --> playing harp"
],
"answer": "making jewelry --> paragliding --> playing harp --> riding mule",
"question_type": "order"
},
{
"video": "order_233.mp4",
"duration": 490.03,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"javelin throw --> cleaning toilet --> stomping grapes --> riding mule",
"riding mule --> cleaning toilet --> javelin throw --> stomping grapes",
"riding mule --> stomping grapes --> javelin throw --> cleaning toilet",
"javelin throw --> riding mule --> cleaning toilet --> stomping grapes"
],
"answer": "javelin throw --> riding mule --> cleaning toilet --> stomping grapes",
"question_type": "order"
},
{
"video": "order_321.mp4",
"duration": 284.06999999999994,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"making jewelry --> cleaning toilet --> paragliding --> carving pumpkin",
"paragliding --> making jewelry --> cleaning toilet --> carving pumpkin",
"carving pumpkin --> cleaning toilet --> making jewelry --> paragliding",
"paragliding --> cleaning toilet --> carving pumpkin --> making jewelry"
],
"answer": "paragliding --> making jewelry --> cleaning toilet --> carving pumpkin",
"question_type": "order"
},
{
"video": "order_215.mp4",
"duration": 246.05,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"paragliding --> pole vault --> playing harp --> riding mule",
"paragliding --> playing harp --> pole vault --> riding mule",
"playing harp --> pole vault --> riding mule --> paragliding",
"riding mule --> playing harp --> pole vault --> paragliding"
],
"answer": "playing harp --> pole vault --> riding mule --> paragliding",
"question_type": "order"
},
{
"video": "order_240.mp4",
"duration": 505.03999999999996,
"question": "Please identify the option that corresponds to the order of events as they occur in the video.",
"candidates": [
"playing trombone --> playing harp --> paragliding --> stomping grapes",
"playing harp --> playing trombone --> paragliding --> stomping grapes",
"paragliding --> playing harp --> playing trombone --> stomping grapes",
"paragliding --> stomping grapes --> playing trombone --> playing harp"
],
"answer": "playing trombone --> playing harp --> paragliding --> stomping grapes",
"question_type": "order"
},
{
"video": "order_303.mp4",
"duration": 513.05,
"question": "Which of the following options correctly matches the sequence of actions as they actually appear in the video?",
"candidates": [
"javelin throw --> zumba --> riding mule --> paragliding",
"riding mule --> javelin throw --> paragliding --> zumba",
"paragliding --> javelin throw --> riding mule --> zumba",
"riding mule --> zumba --> javelin throw --> paragliding"
],
"answer": "riding mule --> javelin throw --> paragliding --> zumba",
"question_type": "order"
},
{
"video": "order_207.mp4",
"duration": 681.05,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"stomping grapes --> tossing coin --> carving pumpkin --> shredding paper",
"tossing coin --> carving pumpkin --> shredding paper --> stomping grapes",
"tossing coin --> shredding paper --> carving pumpkin --> stomping grapes",
"shredding paper --> carving pumpkin --> stomping grapes --> tossing coin"
],
"answer": "stomping grapes --> tossing coin --> carving pumpkin --> shredding paper",
"question_type": "order"
},
{
"video": "order_310.mp4",
"duration": 505.66,
"question": "Can you tell me which option represents the actual order of actions shown in the video?",
"candidates": [
"zumba --> tossing coin --> abseiling --> clean and jerk",
"abseiling --> zumba --> tossing coin --> clean and jerk",
"clean and jerk --> tossing coin --> abseiling --> zumba",
"tossing coin --> zumba --> clean and jerk --> abseiling"
],
"answer": "clean and jerk --> tossing coin --> abseiling --> zumba",
"question_type": "order"
}
]
================================================
FILE: lmms-eval_videochat/eval_annotations/MLVU_MC/json/6_anomaly_reco.json
================================================
[
{
"video": "surveil_20.mp4",
"duration": 485.17,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"RoadAccidents",
"Shooting",
"Shoplifting",
"Assault"
],
"answer": "Shoplifting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_24.mp4",
"duration": 227.4,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Fighting",
"Vandalism",
"Robbery",
"Assault"
],
"answer": "Fighting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_22.mp4",
"duration": 195.5,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Shooting",
"Fighting",
"RoadAccidents",
"Shooting"
],
"answer": "Fighting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_11.mp4",
"duration": 428.55,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Fighting",
"Abuse",
"Stealing",
"Shooting"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_91.mp4",
"duration": 526.55,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Burglary",
"Shooting",
"Normal",
"Vandalism"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_98.mp4",
"duration": 331.59,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Burglary",
"Stealing",
"Fighting",
"Assault"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_82.mp4",
"duration": 288.02,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Fighting",
"Abuse",
"RoadAccidents",
"Arson"
],
"answer": "Arson",
"question_type": "anomaly_reco"
},
{
"video": "surveil_63.mp4",
"duration": 607.54,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Assault",
"Normal",
"RoadAccidents",
"Shooting"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_134.mp4",
"duration": 231.97,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Normal",
"Shooting",
"Shooting",
"Robbery"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_103.mp4",
"duration": 3600.03,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Normal",
"Stealing",
"Vandalism",
"Burglary"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_165.mp4",
"duration": 258.93,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arson",
"Burglary",
"Assault",
"Arrest"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_192.mp4",
"duration": 722.77,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Shoplifting",
"Burglary",
"Vandalism",
"Shooting"
],
"answer": "Shooting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_140.mp4",
"duration": 318.39,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Robbery",
"Normal",
"Stealing",
"Arson"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_3.mp4",
"duration": 559.28,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Arrest",
"Assault",
"Fighting",
"Burglary"
],
"answer": "Fighting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_35.mp4",
"duration": 332.64,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Vandalism",
"Arson",
"Shoplifting",
"Burglary"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_75.mp4",
"duration": 381.04,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Stealing",
"Burglary",
"Explosion",
"Abuse"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_144.mp4",
"duration": 274.67,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Fighting",
"Arrest",
"Normal",
"Stealing"
],
"answer": "Fighting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_33.mp4",
"duration": 430.79,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Normal",
"Vandalism",
"RoadAccidents",
"Burglary"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_195.mp4",
"duration": 239.49,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Robbery",
"Stealing",
"Assault",
"Vandalism"
],
"answer": "Vandalism",
"question_type": "anomaly_reco"
},
{
"video": "surveil_149.mp4",
"duration": 307.32,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Robbery",
"Burglary",
"Explosion",
"Shooting"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_45.mp4",
"duration": 299.83,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Shoplifting",
"Abuse",
"Fighting",
"Assault"
],
"answer": "Assault",
"question_type": "anomaly_reco"
},
{
"video": "surveil_50.mp4",
"duration": 212.74,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Normal",
"Arson",
"Robbery",
"Shooting"
],
"answer": "Arson",
"question_type": "anomaly_reco"
},
{
"video": "surveil_179.mp4",
"duration": 208.89,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Shoplifting",
"Robbery",
"Explosion",
"Fighting"
],
"answer": "Fighting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_90.mp4",
"duration": 198.2,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"RoadAccidents",
"Shooting",
"Arson",
"Fighting"
],
"answer": "RoadAccidents",
"question_type": "anomaly_reco"
},
{
"video": "surveil_57.mp4",
"duration": 1355.21,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Arrest",
"Abuse",
"Burglary",
"Vandalism"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_180.mp4",
"duration": 196.27,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arrest",
"Assault",
"Burglary",
"Vandalism"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_54.mp4",
"duration": 254.88,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Assault",
"Explosion",
"Shoplifting",
"Normal"
],
"answer": "Shoplifting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_5.mp4",
"duration": 2016.3,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Normal",
"Stealing",
"Assault",
"Shooting"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_46.mp4",
"duration": 539.36,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arrest",
"Explosion",
"Assault",
"RoadAccidents"
],
"answer": "Assault",
"question_type": "anomaly_reco"
},
{
"video": "surveil_171.mp4",
"duration": 602.03,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Stealing",
"Robbery",
"Explosion",
"RoadAccidents"
],
"answer": "RoadAccidents",
"question_type": "anomaly_reco"
},
{
"video": "surveil_188.mp4",
"duration": 386.4,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Arson",
"Stealing",
"Vandalism",
"Shoplifting"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_138.mp4",
"duration": 330.94,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Shooting",
"Normal",
"Fighting",
"Arson"
],
"answer": "Fighting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_173.mp4",
"duration": 352.88,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Arrest",
"Burglary",
"Explosion",
"Shooting"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_2.mp4",
"duration": 225.89,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Abuse",
"Arrest",
"Arson",
"Stealing"
],
"answer": "Arson",
"question_type": "anomaly_reco"
},
{
"video": "surveil_30.mp4",
"duration": 387.21,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Arrest",
"Arson",
"Robbery",
"Stealing"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_58.mp4",
"duration": 184.66,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Burglary",
"Stealing",
"RoadAccidents",
"Shooting"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_174.mp4",
"duration": 246.88,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Burglary",
"Robbery",
"Shooting",
"Stealing"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_87.mp4",
"duration": 300.03,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"RoadAccidents",
"Normal",
"Shoplifting",
"Burglary"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_147.mp4",
"duration": 251.2,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Normal",
"Burglary",
"Assault",
"Robbery"
],
"answer": "Assault",
"question_type": "anomaly_reco"
},
{
"video": "surveil_156.mp4",
"duration": 258.17,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"RoadAccidents",
"Stealing",
"Explosion",
"Arrest"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_137.mp4",
"duration": 276.23,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Explosion",
"RoadAccidents",
"Shooting",
"Assault"
],
"answer": "Shooting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_155.mp4",
"duration": 266.68,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arson",
"Abuse",
"RoadAccidents",
"Vandalism"
],
"answer": "Vandalism",
"question_type": "anomaly_reco"
},
{
"video": "surveil_124.mp4",
"duration": 1054.78,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Shooting",
"Arrest",
"Robbery",
"Fighting"
],
"answer": "Fighting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_178.mp4",
"duration": 210.07,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"RoadAccidents",
"Stealing",
"Vandalism",
"Fighting"
],
"answer": "Vandalism",
"question_type": "anomaly_reco"
},
{
"video": "surveil_117.mp4",
"duration": 387.33,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Arrest",
"Arson",
"Shoplifting",
"Abuse"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_86.mp4",
"duration": 252.27,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Stealing",
"RoadAccidents",
"Shooting",
"Shooting"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_193.mp4",
"duration": 205.7,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Abuse",
"Normal",
"Robbery",
"Shooting"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_187.mp4",
"duration": 365.38,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arson",
"Assault",
"Vandalism",
"Burglary"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_133.mp4",
"duration": 297.95,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Arson",
"Fighting",
"Abuse",
"Burglary"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_176.mp4",
"duration": 32550.13,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Stealing",
"Shooting",
"Normal",
"Fighting"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_60.mp4",
"duration": 274.1,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Robbery",
"Normal",
"Shooting",
"Burglary"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_142.mp4",
"duration": 273.74,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Stealing",
"Explosion",
"Robbery",
"Normal"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_129.mp4",
"duration": 250.19,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Shooting",
"Shoplifting",
"Vandalism",
"Robbery"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_123.mp4",
"duration": 600.36,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Arson",
"Arrest",
"RoadAccidents",
"Normal"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_198.mp4",
"duration": 299.7,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Normal",
"Robbery",
"Arrest",
"Fighting"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_115.mp4",
"duration": 806.48,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Shoplifting",
"Stealing",
"RoadAccidents",
"Shooting"
],
"answer": "Shoplifting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_109.mp4",
"duration": 538.75,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Abuse",
"Burglary",
"Explosion",
"Arrest"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_104.mp4",
"duration": 220.13,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Burglary",
"Stealing",
"Arrest",
"Abuse"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_32.mp4",
"duration": 371.85,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Shooting",
"Vandalism",
"Shoplifting",
"Stealing"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_141.mp4",
"duration": 471.52,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arson",
"Explosion",
"Stealing",
"Abuse"
],
"answer": "Abuse",
"question_type": "anomaly_reco"
},
{
"video": "surveil_14.mp4",
"duration": 237.27,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Abuse",
"Robbery",
"Stealing",
"Arrest"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_62.mp4",
"duration": 231.67000000000002,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Assault",
"Vandalism",
"Shooting",
"Robbery"
],
"answer": "Assault",
"question_type": "anomaly_reco"
},
{
"video": "surveil_81.mp4",
"duration": 285.96,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Assault",
"Arson",
"Normal",
"Vandalism"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_164.mp4",
"duration": 288.18,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Shooting",
"Assault",
"Normal",
"Arrest"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_71.mp4",
"duration": 189.7,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Assault",
"Shoplifting",
"Normal",
"Abuse"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_169.mp4",
"duration": 196.63,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arson",
"Vandalism",
"Burglary",
"Shoplifting"
],
"answer": "Arson",
"question_type": "anomaly_reco"
},
{
"video": "surveil_6.mp4",
"duration": 185.6,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Shooting",
"Normal",
"Arson",
"Shooting"
],
"answer": "Shooting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_96.mp4",
"duration": 258.08,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Vandalism",
"Shooting",
"Burglary",
"Arrest"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_128.mp4",
"duration": 208.45,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Robbery",
"Fighting",
"Arson",
"Arrest"
],
"answer": "Arson",
"question_type": "anomaly_reco"
},
{
"video": "surveil_167.mp4",
"duration": 375.89,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"RoadAccidents",
"Normal",
"Robbery",
"Shoplifting"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_166.mp4",
"duration": 262.53,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Shooting",
"Normal",
"Abuse",
"Shoplifting"
],
"answer": "Abuse",
"question_type": "anomaly_reco"
},
{
"video": "surveil_26.mp4",
"duration": 209.64,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Fighting",
"Vandalism",
"Normal",
"RoadAccidents"
],
"answer": "Fighting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_15.mp4",
"duration": 727.24,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Normal",
"Stealing",
"RoadAccidents",
"Shooting"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_189.mp4",
"duration": 220.32,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Shoplifting",
"Arrest",
"Stealing",
"Robbery"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_194.mp4",
"duration": 1320.17,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Stealing",
"Normal",
"Robbery",
"Arrest"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_135.mp4",
"duration": 180.14,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Assault",
"Shooting",
"Vandalism",
"Normal"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_181.mp4",
"duration": 225.7,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Shoplifting",
"Explosion",
"RoadAccidents",
"Burglary"
],
"answer": "Shoplifting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_47.mp4",
"duration": 1420.03,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Vandalism",
"Normal",
"Shoplifting",
"Shooting"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_139.mp4",
"duration": 239.03,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Assault",
"Vandalism",
"Normal",
"Robbery"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_161.mp4",
"duration": 299.67,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Stealing",
"Burglary",
"Shooting",
"Arrest"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_190.mp4",
"duration": 216.2,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arrest",
"Shoplifting",
"Normal",
"Burglary"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_88.mp4",
"duration": 248.34,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Shoplifting",
"RoadAccidents",
"Fighting",
"Explosion"
],
"answer": "RoadAccidents",
"question_type": "anomaly_reco"
},
{
"video": "surveil_152.mp4",
"duration": 290.9,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Abuse",
"Arrest",
"Normal",
"Robbery"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_43.mp4",
"duration": 409.79,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Normal",
"Arrest",
"Shoplifting",
"Shooting"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_160.mp4",
"duration": 431.48,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Stealing",
"Vandalism",
"Arson",
"RoadAccidents"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_28.mp4",
"duration": 915.2,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Normal",
"Burglary",
"Abuse",
"Vandalism"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_130.mp4",
"duration": 375.26,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Shoplifting",
"Assault",
"Burglary",
"Robbery"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_68.mp4",
"duration": 221.07999999999998,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Robbery",
"Burglary",
"Arrest",
"Arson"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_183.mp4",
"duration": 397.9,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Shooting",
"Assault",
"Stealing",
"Shoplifting"
],
"answer": "Shoplifting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_49.mp4",
"duration": 484.49,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Robbery",
"RoadAccidents",
"Normal",
"Stealing"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_37.mp4",
"duration": 2223.27,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Burglary",
"Fighting",
"Shoplifting",
"Explosion"
],
"answer": "Shoplifting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_118.mp4",
"duration": 435.79,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Robbery",
"Shooting",
"Burglary",
"Arson"
],
"answer": "Shooting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_59.mp4",
"duration": 1200.63,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Burglary",
"Shooting",
"Normal",
"RoadAccidents"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_97.mp4",
"duration": 269.87,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Assault",
"Shooting",
"Fighting",
"RoadAccidents"
],
"answer": "Assault",
"question_type": "anomaly_reco"
},
{
"video": "surveil_186.mp4",
"duration": 403.58,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Abuse",
"Arson",
"Stealing",
"Robbery"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_84.mp4",
"duration": 200.03,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Fighting",
"Normal",
"Vandalism",
"Arson"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_17.mp4",
"duration": 554.26,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Robbery",
"Normal",
"Vandalism",
"Arrest"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_112.mp4",
"duration": 317.3,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Shoplifting",
"Fighting",
"Assault",
"Arrest"
],
"answer": "Shoplifting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_41.mp4",
"duration": 781.1,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Burglary",
"Shoplifting",
"Normal",
"Explosion"
],
"answer": "Shoplifting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_29.mp4",
"duration": 197.17000000000002,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Arson",
"Fighting",
"Stealing",
"Arrest"
],
"answer": "Fighting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_53.mp4",
"duration": 300.06,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Fighting",
"Stealing",
"RoadAccidents",
"Robbery"
],
"answer": "Fighting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_184.mp4",
"duration": 218.17000000000002,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arrest",
"Shoplifting",
"Fighting",
"Explosion"
],
"answer": "Fighting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_9.mp4",
"duration": 220.07999999999998,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Shooting",
"Robbery",
"Stealing",
"Fighting"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_76.mp4",
"duration": 308.37,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"RoadAccidents",
"Explosion",
"Vandalism",
"Stealing"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_8.mp4",
"duration": 223.87,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Burglary",
"Fighting",
"Robbery",
"Assault"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_151.mp4",
"duration": 228.67000000000002,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Stealing",
"Shoplifting",
"Fighting",
"Normal"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_136.mp4",
"duration": 195.84,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arson",
"Arrest",
"Stealing",
"RoadAccidents"
],
"answer": "Arson",
"question_type": "anomaly_reco"
},
{
"video": "surveil_100.mp4",
"duration": 4730.04,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Explosion",
"Robbery",
"Arson",
"Shoplifting"
],
"answer": "Explosion",
"question_type": "anomaly_reco"
},
{
"video": "surveil_12.mp4",
"duration": 192.37,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Shoplifting",
"Normal",
"Burglary",
"Fighting"
],
"answer": "Shoplifting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_95.mp4",
"duration": 301.13,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Fighting",
"Vandalism",
"Shooting",
"Normal"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_72.mp4",
"duration": 193.73,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Vandalism",
"Abuse",
"Fighting",
"Arson"
],
"answer": "Vandalism",
"question_type": "anomaly_reco"
},
{
"video": "surveil_191.mp4",
"duration": 604.98,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Shoplifting",
"RoadAccidents",
"Shooting",
"Stealing"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_182.mp4",
"duration": 254.38,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Vandalism",
"Arson",
"Normal",
"Explosion"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_168.mp4",
"duration": 315.14,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arson",
"Explosion",
"Robbery",
"Stealing"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_48.mp4",
"duration": 435.26,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Stealing",
"Shoplifting",
"Arson",
"Normal"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_199.mp4",
"duration": 2102.0,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Arrest",
"Explosion",
"Assault",
"Shooting"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_106.mp4",
"duration": 393.35,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Normal",
"Burglary",
"Fighting",
"Arrest"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_10.mp4",
"duration": 359.97,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"RoadAccidents",
"Arrest",
"Assault",
"Stealing"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_154.mp4",
"duration": 343.6,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Stealing",
"Arson",
"Assault",
"Vandalism"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_1.mp4",
"duration": 182.88,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Abuse",
"Stealing",
"Shoplifting",
"RoadAccidents"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_79.mp4",
"duration": 360.62,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Fighting",
"Shoplifting",
"RoadAccidents",
"Shooting"
],
"answer": "Shoplifting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_39.mp4",
"duration": 196.05,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Abuse",
"RoadAccidents",
"Burglary",
"Arson"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_145.mp4",
"duration": 211.07,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arson",
"Robbery",
"Explosion",
"Abuse"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_23.mp4",
"duration": 949.28,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Abuse",
"Assault",
"Arson",
"Robbery"
],
"answer": "Abuse",
"question_type": "anomaly_reco"
},
{
"video": "surveil_111.mp4",
"duration": 229.6,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"RoadAccidents",
"Normal",
"Explosion",
"Fighting"
],
"answer": "Fighting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_110.mp4",
"duration": 527.86,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Arrest",
"Normal",
"Burglary",
"Shoplifting"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_105.mp4",
"duration": 207.17000000000002,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Robbery",
"Burglary",
"Arson",
"Shoplifting"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_121.mp4",
"duration": 314.07,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Burglary",
"Shooting",
"Shooting",
"RoadAccidents"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_153.mp4",
"duration": 495.13,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Burglary",
"Shoplifting",
"Shooting",
"Stealing"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_102.mp4",
"duration": 415.0,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Shooting",
"Arrest",
"Vandalism",
"Fighting"
],
"answer": "Vandalism",
"question_type": "anomaly_reco"
},
{
"video": "surveil_4.mp4",
"duration": 192.03,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Shooting",
"Shooting",
"Robbery",
"Burglary"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_93.mp4",
"duration": 257.58,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Arson",
"Normal",
"Shooting",
"RoadAccidents"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_177.mp4",
"duration": 212.51,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Explosion",
"Assault",
"RoadAccidents",
"Normal"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_99.mp4",
"duration": 334.18,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Stealing",
"Arson",
"Burglary",
"Normal"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_67.mp4",
"duration": 543.0,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Vandalism",
"Arson",
"Explosion",
"RoadAccidents"
],
"answer": "Explosion",
"question_type": "anomaly_reco"
},
{
"video": "surveil_119.mp4",
"duration": 519.99,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Shooting",
"Vandalism",
"RoadAccidents",
"Burglary"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_146.mp4",
"duration": 3599.84,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Burglary",
"Normal",
"Arson",
"Shooting"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_56.mp4",
"duration": 237.76,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Burglary",
"Normal",
"Arson",
"Stealing"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_0.mp4",
"duration": 875.51,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Shooting",
"Fighting",
"Assault",
"Arson"
],
"answer": "Fighting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_120.mp4",
"duration": 280.29,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Vandalism",
"Abuse",
"Assault",
"Burglary"
],
"answer": "Abuse",
"question_type": "anomaly_reco"
},
{
"video": "surveil_19.mp4",
"duration": 559.82,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Arson",
"Shoplifting",
"Explosion",
"Abuse"
],
"answer": "Abuse",
"question_type": "anomaly_reco"
},
{
"video": "surveil_7.mp4",
"duration": 262.83,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Robbery",
"Abuse",
"Vandalism",
"Arrest"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_52.mp4",
"duration": 208.61,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Shooting",
"Burglary",
"Shooting",
"Arson"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_13.mp4",
"duration": 248.09,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Stealing",
"Robbery",
"Normal",
"Burglary"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_25.mp4",
"duration": 195.06,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Burglary",
"Stealing",
"Robbery",
"Assault"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_42.mp4",
"duration": 240.64,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Shoplifting",
"Shooting",
"Explosion",
"Vandalism"
],
"answer": "Shoplifting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_77.mp4",
"duration": 291.04,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Burglary",
"Normal",
"Abuse",
"Stealing"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_65.mp4",
"duration": 187.53,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Normal",
"Fighting",
"Stealing",
"Shooting"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_40.mp4",
"duration": 183.31,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Stealing",
"Shooting",
"RoadAccidents",
"Robbery"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_55.mp4",
"duration": 220.6,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Shooting",
"Burglary",
"Assault",
"Vandalism"
],
"answer": "Assault",
"question_type": "anomaly_reco"
},
{
"video": "surveil_38.mp4",
"duration": 355.4,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Assault",
"RoadAccidents",
"Normal",
"Stealing"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_69.mp4",
"duration": 229.1,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Vandalism",
"Shooting",
"Burglary",
"Shoplifting"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_185.mp4",
"duration": 907.05,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"RoadAccidents",
"Vandalism",
"Shoplifting",
"Arrest"
],
"answer": "Shoplifting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_114.mp4",
"duration": 220.27,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Abuse",
"Shooting",
"Explosion",
"Fighting"
],
"answer": "Fighting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_78.mp4",
"duration": 294.96,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"RoadAccidents",
"Arrest",
"Normal",
"Shooting"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_157.mp4",
"duration": 255.0,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arrest",
"Explosion",
"Arson",
"Shooting"
],
"answer": "Explosion",
"question_type": "anomaly_reco"
},
{
"video": "surveil_132.mp4",
"duration": 297.57,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"RoadAccidents",
"Assault",
"Stealing",
"Shoplifting"
],
"answer": "Shoplifting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_83.mp4",
"duration": 511.91,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Normal",
"Stealing",
"Shooting",
"Fighting"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_101.mp4",
"duration": 419.73,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Stealing",
"Vandalism",
"Robbery",
"Fighting"
],
"answer": "Vandalism",
"question_type": "anomaly_reco"
},
{
"video": "surveil_131.mp4",
"duration": 180.0,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Assault",
"Robbery",
"Fighting",
"Stealing"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_34.mp4",
"duration": 353.57,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Fighting",
"Arrest",
"Shooting",
"RoadAccidents"
],
"answer": "RoadAccidents",
"question_type": "anomaly_reco"
},
{
"video": "surveil_170.mp4",
"duration": 184.83,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arson",
"Abuse",
"Arrest",
"Normal"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_127.mp4",
"duration": 4218.5,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Arson",
"RoadAccidents",
"Abuse",
"Assault"
],
"answer": "Arson",
"question_type": "anomaly_reco"
},
{
"video": "surveil_126.mp4",
"duration": 364.04,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Explosion",
"Arson",
"Shooting",
"Normal"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_162.mp4",
"duration": 236.21,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Shooting",
"Robbery",
"Shooting",
"Normal"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_74.mp4",
"duration": 524.46,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Shooting",
"Robbery",
"Stealing",
"Abuse"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_73.mp4",
"duration": 212.97,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Explosion",
"Shoplifting",
"Arrest",
"Robbery"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_64.mp4",
"duration": 360.0,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Arrest",
"Robbery",
"Shooting",
"Explosion"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_51.mp4",
"duration": 180.0,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Normal",
"Fighting",
"Assault",
"Shoplifting"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_196.mp4",
"duration": 312.55,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"RoadAccidents",
"Assault",
"Burglary",
"Normal"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_148.mp4",
"duration": 180.33,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Assault",
"Burglary",
"Vandalism",
"Normal"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_16.mp4",
"duration": 190.11,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Explosion",
"Burglary",
"Vandalism",
"Fighting"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_18.mp4",
"duration": 306.71,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Vandalism",
"Normal",
"Shooting",
"Abuse"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_163.mp4",
"duration": 509.27,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"RoadAccidents",
"Robbery",
"Arrest",
"Shooting"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_175.mp4",
"duration": 437.17,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Robbery",
"Normal",
"Explosion",
"Vandalism"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_36.mp4",
"duration": 222.48,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arson",
"RoadAccidents",
"Shoplifting",
"Arrest"
],
"answer": "Shoplifting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_172.mp4",
"duration": 321.85,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Arson",
"Abuse",
"Fighting",
"Robbery"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_94.mp4",
"duration": 451.44,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Shoplifting",
"Arson",
"Stealing",
"Assault"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_107.mp4",
"duration": 639.59,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Fighting",
"Stealing",
"RoadAccidents",
"Arrest"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_21.mp4",
"duration": 180.0,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Shooting",
"Shoplifting",
"Robbery",
"Shooting"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_85.mp4",
"duration": 601.84,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Normal",
"Robbery",
"Burglary",
"Arrest"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_31.mp4",
"duration": 220.96,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Stealing",
"Shooting",
"Normal",
"RoadAccidents"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_44.mp4",
"duration": 206.36,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Explosion",
"RoadAccidents",
"Abuse",
"Arrest"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_150.mp4",
"duration": 223.52,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Fighting",
"Arson",
"RoadAccidents",
"Arrest"
],
"answer": "RoadAccidents",
"question_type": "anomaly_reco"
},
{
"video": "surveil_197.mp4",
"duration": 221.8,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Stealing",
"Assault",
"Robbery",
"Shoplifting"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_61.mp4",
"duration": 396.5,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Arrest",
"RoadAccidents",
"Abuse",
"Shooting"
],
"answer": "RoadAccidents",
"question_type": "anomaly_reco"
},
{
"video": "surveil_116.mp4",
"duration": 291.18,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Shooting",
"Arrest",
"Stealing",
"Robbery"
],
"answer": "Arrest",
"question_type": "anomaly_reco"
},
{
"video": "surveil_80.mp4",
"duration": 411.3,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Assault",
"Shooting",
"Shoplifting",
"Vandalism"
],
"answer": "Shoplifting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_27.mp4",
"duration": 318.46,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Stealing",
"Robbery",
"Abuse",
"Shooting"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_113.mp4",
"duration": 202.62,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Vandalism",
"Arson",
"Shooting",
"Stealing"
],
"answer": "Stealing",
"question_type": "anomaly_reco"
},
{
"video": "surveil_70.mp4",
"duration": 185.27,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Shooting",
"Normal",
"Explosion",
"Vandalism"
],
"answer": "Vandalism",
"question_type": "anomaly_reco"
},
{
"video": "surveil_122.mp4",
"duration": 240.2,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Robbery",
"Fighting",
"Normal",
"Shooting"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_143.mp4",
"duration": 341.07,
"question": "Is there anything unusual in this surveillance video? If there is, what type of unusual activity is it?",
"candidates": [
"Burglary",
"RoadAccidents",
"Arson",
"Vandalism"
],
"answer": "Arson",
"question_type": "anomaly_reco"
},
{
"video": "surveil_89.mp4",
"duration": 187.37,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"RoadAccidents",
"Fighting",
"Shooting",
"Arson"
],
"answer": "RoadAccidents",
"question_type": "anomaly_reco"
},
{
"video": "surveil_125.mp4",
"duration": 190.32,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Fighting",
"Robbery",
"RoadAccidents",
"Burglary"
],
"answer": "Robbery",
"question_type": "anomaly_reco"
},
{
"video": "surveil_108.mp4",
"duration": 1620.85,
"question": "Is there any abnormality in this surveillance video? If so, what type of abnormality is it?",
"candidates": [
"Shooting",
"Explosion",
"Assault",
"Burglary"
],
"answer": "Burglary",
"question_type": "anomaly_reco"
},
{
"video": "surveil_159.mp4",
"duration": 448.68,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Burglary",
"Assault",
"Normal",
"RoadAccidents"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_158.mp4",
"duration": 213.69,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Shooting",
"Shoplifting",
"Explosion",
"Normal"
],
"answer": "Shooting",
"question_type": "anomaly_reco"
},
{
"video": "surveil_66.mp4",
"duration": 319.72,
"question": "Does this surveillance footage contain any anomalies? If yes, which kind of anomaly?",
"candidates": [
"Assault",
"Explosion",
"Normal",
"Burglary"
],
"answer": "Normal",
"question_type": "anomaly_reco"
},
{
"video": "surveil_92.mp4",
"duration": 437.8,
"question": "Are there any irregularities in this surveillance video? If there are, what sort are they?",
"candidates": [
"Arson",
"Fighting",
"Burglary",
"Assault"
],
"answer": "Fighting",
"question_type": "anomaly_reco"
}
]
================================================
FILE: lmms-eval_videochat/eval_annotations/MLVU_MC/json/7_topic_reasoning.json
================================================
[
{
"video": "AWA-6.mp4",
"duration": 450.0,
"question": "What is the main background of the video?",
"candidates": [
"Grassland",
"Lake",
"Ocean",
"Desert"
],
"answer": "Grassland",
"question_type": "topic_reasoning"
},
{
"video": "movie101_10.mp4",
"duration": 716,
"question": "What color is the scarf worn by the woman in the video?",
"candidates": [
"Red",
"Blue",
"White",
"Pink"
],
"answer": "Blue",
"question_type": "topic_reasoning"
},
{
"video": "movie101_83.mp4",
"duration": 640,
"question": "What type of film is this?",
"candidates": [
"Mystery",
"Comedy",
"Romance",
"Action"
],
"answer": "Romance",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_28.mp4",
"duration": 575,
"question": "What type of film is this?",
"candidates": [
"History",
"Romance",
"Action",
"Comedy"
],
"answer": "Action",
"question_type": "topic_reasoning"
},
{
"video": "movie101_74.mp4",
"duration": 323,
"question": "What is the genre of this film?",
"candidates": [
"Sci-Fi",
"Romance",
"Action",
"Mystery"
],
"answer": "Action",
"question_type": "topic_reasoning"
},
{
"video": "ego_13.mp4",
"duration": 480.02,
"question": "What is the first-person character doing in this video?",
"candidates": [
"Making coffee",
"Making milk",
"Making a cake",
"Baking cookies"
],
"answer": "Making coffee",
"question_type": "topic_reasoning"
},
{
"video": "movie101_86.mp4",
"duration": 611,
"question": "What story does the whole video tell?",
"candidates": [
"Criminal Investigation",
"Wedding Scene",
"Drama Performance",
"Chase Incident"
],
"answer": "Criminal Investigation",
"question_type": "topic_reasoning"
},
{
"video": "movie101_52.mp4",
"duration": 338,
"question": "What season is it in the video?",
"candidates": [
"Summer",
"Spring",
"Autumn",
"Winter"
],
"answer": "Winter",
"question_type": "topic_reasoning"
},
{
"video": "haimian_1.mp4",
"duration": 305,
"question": "What is the background of the video?",
"candidates": [
"Desert",
"Undersea",
"Forest",
"Beach"
],
"answer": "Undersea",
"question_type": "topic_reasoning"
},
{
"video": "movie101_106.mp4",
"duration": 746,
"question": "What is the main setting of the video?",
"candidates": [
"Desert",
"Ocean",
"City",
"Palace"
],
"answer": "Palace",
"question_type": "topic_reasoning"
},
{
"video": "movie101_36.mp4",
"duration": 415,
"question": "What is the setting of the scene in the video?",
"candidates": [
"City",
"Island",
"Snowy Mountain",
"Forest"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_1.mp4",
"duration": 514,
"question": "Who is the main character shown in the video?",
"candidates": [
"A man in a red coat",
"A woman in a green coat",
"A woman in a blue coat",
"A woman in a red coat"
],
"answer": "A woman in a red coat",
"question_type": "topic_reasoning"
},
{
"video": "203.mp4",
"duration": 480.0,
"question": "What type of video is this?",
"candidates": [
"Science Fiction",
"Animals",
"Romance",
"Comedy"
],
"answer": "Animals",
"question_type": "topic_reasoning"
},
{
"video": "ego_17.mp4",
"duration": 480.03,
"question": "What is the weather like in the video?",
"candidates": [
"Blizzard",
"Overcast",
"Sunny",
"Hail"
],
"answer": "Overcast",
"question_type": "topic_reasoning"
},
{
"video": "232.mp4",
"duration": 480.0,
"question": "What is the video related to?",
"candidates": [
"The video is related to traditional culture",
"The video is related to holidays",
"The video is related to nature",
"The video is related to food"
],
"answer": "The video is related to nature",
"question_type": "topic_reasoning"
},
{
"video": "movie101_11.mp4",
"duration": 206,
"question": "What type of movie is the scene in the video?",
"candidates": [
"Historical drama",
"Horror",
"Action",
"Comedy"
],
"answer": "Historical drama",
"question_type": "topic_reasoning"
},
{
"video": "movie101_56.mp4",
"duration": 338,
"question": "What is the main setting of the video?",
"candidates": [
"Countryside",
"Desert",
"City",
"Seaside"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_27.mp4",
"duration": 486,
"question": "What type of film is this?",
"candidates": [
"Comedy",
"Romance",
"History",
"Action"
],
"answer": "Action",
"question_type": "topic_reasoning"
},
{
"video": "movie101_72.mp4",
"duration": 289,
"question": "In what setting does the majority of the video take place?",
"candidates": [
"Ancient Folk",
"Modern City",
"Modern Rural",
"Ancient Palace"
],
"answer": "Ancient Folk",
"question_type": "topic_reasoning"
},
{
"video": "ego_21.mp4",
"duration": 480.02,
"question": "What is the main activity of the first person perspective character in this video?",
"candidates": [
"Mopping the floor",
"Washing clothes",
"Wiping windows",
"Sweeping the floor"
],
"answer": "Sweeping the floor",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_20.mp4",
"duration": 426,
"question": "In what setting does the scene in the video take place?",
"candidates": [
"Forest",
"Snowy Mountain",
"City",
"Island"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "AWC-3.mp4",
"duration": 450.0,
"question": "What is the most frequent scene in the video?",
"candidates": [
"Cliff",
"Forest",
"Desert",
"Ocean"
],
"answer": "Desert",
"question_type": "topic_reasoning"
},
{
"video": "xiaoliyu_1.mp4",
"duration": 381,
"question": "What is the background of the video?",
"candidates": [
"Forest",
"Desert",
"Underwater",
"Beach"
],
"answer": "Underwater",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_22.mp4",
"duration": 441,
"question": "What genre of movie is the clip in the video from?",
"candidates": [
"Horror",
"Action",
"War",
"Documentary"
],
"answer": "Action",
"question_type": "topic_reasoning"
},
{
"video": "215.mp4",
"duration": 480.0,
"question": "What type of video is this?",
"candidates": [
"It is a video documenting daily life.",
"It is a video documenting traditional customs.",
"It is a video documenting food.",
"It is a video documenting nature."
],
"answer": "It is a video documenting nature.",
"question_type": "topic_reasoning"
},
{
"video": "xiaoliyu_2.mp4",
"duration": 364,
"question": "What is the background of the video?",
"candidates": [
"Under the sea",
"Beach",
"Desert",
"Forest"
],
"answer": "Under the sea",
"question_type": "topic_reasoning"
},
{
"video": "movie101_89.mp4",
"duration": 677,
"question": "",
"candidates": [
"",
"",
"",
""
],
"answer": "",
"question_type": "topic_reasoning"
},
{
"video": "movie101_69.mp4",
"duration": 385,
"question": "What event is depicted in the entire video?",
"candidates": [
"Police drug bust",
"Technology research",
"Love story",
"Action fight"
],
"answer": "Police drug bust",
"question_type": "topic_reasoning"
},
{
"video": "tomjerry_8.mp4",
"duration": 308,
"question": "What type of video is this?",
"candidates": [
"Mystery",
"Romance",
"Science fiction",
"Cartoon animation"
],
"answer": "Cartoon animation",
"question_type": "topic_reasoning"
},
{
"video": "AWA-17.mp4",
"duration": 450.0,
"question": "What type of video is this?",
"candidates": [
"Comedy",
"Animal",
"Science Fiction",
"Action"
],
"answer": "Animal",
"question_type": "topic_reasoning"
},
{
"video": "movie101_8.mp4",
"duration": 750,
"question": "What is the weather in the video scene?",
"candidates": [
"Rainy",
"Sunny",
"Foggy",
"Snowy"
],
"answer": "Sunny",
"question_type": "topic_reasoning"
},
{
"video": "AWA-12.mp4",
"duration": 450.0,
"question": "What is the main environment in the video?",
"candidates": [
"Grassland",
"Gobi",
"Forest",
"Desert"
],
"answer": "Grassland",
"question_type": "topic_reasoning"
},
{
"video": "tomjerry_7.mp4",
"duration": 306,
"question": "Who are the characters in the video?",
"candidates": [
"Two cartoon cats",
"Two cartoon cats and two cartoon mice",
"Two cartoon cats and one cartoon mouse",
"One cartoon cat and one cartoon mouse"
],
"answer": "Two cartoon cats and one cartoon mouse",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_12.mp4",
"duration": 587,
"question": "What type of video is this?",
"candidates": [
"Advertisement video",
"Daily life documentary",
"Animation",
"Musical"
],
"answer": "Daily life documentary",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_16.mp4",
"duration": 552,
"question": "In what scenario does the scene in the video take place?",
"candidates": [
"Snow mountain",
"Forest",
"Island",
"City"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "tomjerry_11.mp4",
"duration": 322,
"question": "What character appears the most in the video?",
"candidates": [
"Cartoon fish",
"Cartoon dog",
"Cartoon bear",
"Cartoon mouse"
],
"answer": "Cartoon mouse",
"question_type": "topic_reasoning"
},
{
"video": "movie101_30.mp4",
"duration": 726,
"question": "What is the genre of this movie clip?",
"candidates": [
"Comedy",
"Horror",
"Action",
"War"
],
"answer": "Action",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_7.mp4",
"duration": 533,
"question": "What type of video is this?",
"candidates": [
"Animation",
"Music video",
"Movie clip",
"Documentary"
],
"answer": "Movie clip",
"question_type": "topic_reasoning"
},
{
"video": "239.mp4",
"duration": 480.0,
"question": "What type of video is this?",
"candidates": [
"Holiday",
"Nature",
"Food",
"Lifestyle"
],
"answer": "Nature",
"question_type": "topic_reasoning"
},
{
"video": "movie101_16.mp4",
"duration": 740,
"question": "What type of movie is the scene in the video?",
"candidates": [
"Science Fiction",
"Comedy",
"Horror",
"Documentary"
],
"answer": "Documentary",
"question_type": "topic_reasoning"
},
{
"video": "movie101_106.mp4",
"duration": 746,
"question": "What is the style of the characters' clothing in the video?",
"candidates": [
"Ancient royal style",
"Western style",
"Ethnic style",
"Exotic style"
],
"answer": "Ancient royal style",
"question_type": "topic_reasoning"
},
{
"video": "ego_12.mp4",
"duration": 480.02,
"question": "Who is captured in the video?",
"candidates": [
"Elderly",
"Child",
"Man",
"Young woman"
],
"answer": "Young woman",
"question_type": "topic_reasoning"
},
{
"video": "movie101_70.mp4",
"duration": 361,
"question": "What is the genre of this film?",
"candidates": [
"Sci-Fi",
"Romance",
"Action",
"Mystery"
],
"answer": "Action",
"question_type": "topic_reasoning"
},
{
"video": "AWD-7.mp4",
"duration": 450.13,
"question": "What type of video is this?",
"candidates": [
"Traditional Festivals",
"Food Flavor",
"History and Culture",
"Natural Science"
],
"answer": "Natural Science",
"question_type": "topic_reasoning"
},
{
"video": "movie101_73.mp4",
"duration": 421,
"question": "What is the main story of the film?",
"candidates": [
"Blind date",
"Fighting",
"Dancing",
"Eating"
],
"answer": "Blind date",
"question_type": "topic_reasoning"
},
{
"video": "movie101_70.mp4",
"duration": 361,
"question": "What event is mainly narrated in the video?",
"candidates": [
"Theft",
"Romance",
"Dance",
"Chase"
],
"answer": "Theft",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_28.mp4",
"duration": 575,
"question": "Where is the setting of the video story?",
"candidates": [
"City",
"Seaside",
"Desert",
"Countryside"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "ego_12.mp4",
"duration": 480.02,
"question": "What is the weather like in the video?",
"candidates": [
"Snowy",
"Windy",
"Sunny",
"Rainy"
],
"answer": "Sunny",
"question_type": "topic_reasoning"
},
{
"video": "movie101_12.mp4",
"duration": 346,
"question": "What is the weather in the scene of the video?",
"candidates": [
"Foggy",
"Snowy",
"Rainy",
"Sunny"
],
"answer": "Sunny",
"question_type": "topic_reasoning"
},
{
"video": "AWD-1.mp4",
"duration": 450.13,
"question": "What type of video is this?",
"candidates": [
"Nature Science",
"History Culture",
"Traditional Festival",
"Food Flavor"
],
"answer": "Nature Science",
"question_type": "topic_reasoning"
},
{
"video": "AWA-4.mp4",
"duration": 450.0,
"question": "What is the main background of the video?",
"candidates": [
"Grassland",
"Lake",
"Ocean",
"Desert"
],
"answer": "Grassland",
"question_type": "topic_reasoning"
},
{
"video": "movie101_22.mp4",
"duration": 335,
"question": "What type of movie is the scene in the video from?",
"candidates": [
"Horror",
"Comedy",
"Historical drama",
"Science fiction"
],
"answer": "Historical drama",
"question_type": "topic_reasoning"
},
{
"video": "ego_2.mp4",
"duration": 480.03,
"question": "What is the weather like in the video?",
"candidates": [
"Sunny",
"Snowing",
"Windy",
"Drizzling"
],
"answer": "Sunny",
"question_type": "topic_reasoning"
},
{
"video": "movie101_28.mp4",
"duration": 234,
"question": "What is the genre of this movie clip?",
"candidates": [
"Horror",
"Science Fiction",
"War",
"Comedy"
],
"answer": "Science Fiction",
"question_type": "topic_reasoning"
},
{
"video": "AWB-12.mp4",
"duration": 450.0,
"question": "What is the primary environment in the video?",
"candidates": [
"Forest",
"Gobi",
"Desert",
"Grassland"
],
"answer": "Grassland",
"question_type": "topic_reasoning"
},
{
"video": "AWE-4.mp4",
"duration": 450.0,
"question": "What is the main content of the video related to?",
"candidates": [
"Weather",
"Food",
"Animals",
"Plants"
],
"answer": "Animals",
"question_type": "topic_reasoning"
},
{
"video": "movie101_33.mp4",
"duration": 240,
"question": "What is the genre of this movie clip?",
"candidates": [
"Horror",
"Comedy",
"War",
"Action"
],
"answer": "Action",
"question_type": "topic_reasoning"
},
{
"video": "movie101_29.mp4",
"duration": 482,
"question": "What kind of weather is depicted in the video?",
"candidates": [
"Foggy",
"Rainy",
"Snowy",
"Sunny"
],
"answer": "Sunny",
"question_type": "topic_reasoning"
},
{
"video": "ego_11.mp4",
"duration": 480.02,
"question": "What is the first-person character doing in this first-person video?",
"candidates": [
"Posting sticky notes",
"Hanging wallpaper",
"Posting posters",
"Posting Spring Festival couplets"
],
"answer": "Posting posters",
"question_type": "topic_reasoning"
},
{
"video": "223.mp4",
"duration": 480.0,
"question": "What type of video is this?",
"candidates": [
"Natural Science",
"Food Culture",
"Natural Animals",
"Traditional Customs"
],
"answer": "Natural Science",
"question_type": "topic_reasoning"
},
{
"video": "movie101_61.mp4",
"duration": 627,
"question": "What type of film is this?",
"candidates": [
"Romance",
"Thriller",
"Mystery",
"Action"
],
"answer": "Romance",
"question_type": "topic_reasoning"
},
{
"video": "movie101_108.mp4",
"duration": 330,
"question": "Which scene does not appear in the video?",
"candidates": [
"Bathroom",
"Playground",
"Mountain road",
"Auditorium"
],
"answer": "Auditorium",
"question_type": "topic_reasoning"
},
{
"video": "AWC-8.mp4",
"duration": 450.0,
"question": "What is the content of this video about?",
"candidates": [
"Birds",
"Dinosaurs",
"Whales",
"Sea Turtles"
],
"answer": "Dinosaurs",
"question_type": "topic_reasoning"
},
{
"video": "movie101_111.mp4",
"duration": 720,
"question": "What is the main plot shown in the video?",
"candidates": [
"Police solving a case",
"Basketball match",
"Friends gathering",
"Traveling and sightseeing"
],
"answer": "Police solving a case",
"question_type": "topic_reasoning"
},
{
"video": "AWA-16.mp4",
"duration": 450.0,
"question": "What is the setting of the video?",
"candidates": [
"Ocean",
"Desert",
"City",
"Grassland"
],
"answer": "Grassland",
"question_type": "topic_reasoning"
},
{
"video": "movie101_40.mp4",
"duration": 450,
"question": "In what setting does the scene in the video take place?",
"candidates": [
"Snowy Mountain",
"Forest",
"Island",
"City"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "203.mp4",
"duration": 480.0,
"question": "What is the main scene in the video?",
"candidates": [
"Sky",
"Barren land",
"Ocean",
"Wetland"
],
"answer": "Barren land",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_2.mp4",
"duration": 509,
"question": "What type of video is this?",
"candidates": [
"Movie clip",
"Music video",
"Cartoon",
"Documentary"
],
"answer": "Movie clip",
"question_type": "topic_reasoning"
},
{
"video": "ego_2.mp4",
"duration": 480.03,
"question": "Where is the person in the video?",
"candidates": [
"Car dealership",
"Fruit store",
"Clothing store",
"Mobile phone store"
],
"answer": "Clothing store",
"question_type": "topic_reasoning"
},
{
"video": "230.mp4",
"duration": 480.0,
"question": "What kind of video is this?",
"candidates": [
"This is a video related to nature",
"This is a video related to traditional culture",
"This is a video related to food",
"This is a video related to transportation"
],
"answer": "This is a video related to nature",
"question_type": "topic_reasoning"
},
{
"video": "xiaoliyu_2.mp4",
"duration": 364,
"question": "What is the protagonist in the video?",
"candidates": [
"Marine animal",
"Bird",
"Bear",
"Dinosaur"
],
"answer": "Marine animal",
"question_type": "topic_reasoning"
},
{
"video": "movie101_68.mp4",
"duration": 445,
"question": "What is the genre of this movie?",
"candidates": [
"Romance",
"Comedy",
"Science Fiction",
"Mystery"
],
"answer": "Comedy",
"question_type": "topic_reasoning"
},
{
"video": "AWB-10.mp4",
"duration": 450.0,
"question": "What is the main content related to in the video?",
"candidates": [
"Plants",
"Food",
"Humans",
"Animals"
],
"answer": "Animals",
"question_type": "topic_reasoning"
},
{
"video": "movie101_19.mp4",
"duration": 477,
"question": "What genre of movie does the scene in the video belong to?",
"candidates": [
"Action",
"Science Fiction",
"Comedy",
"Documentary"
],
"answer": "Action",
"question_type": "topic_reasoning"
},
{
"video": "AWB-1.mp4",
"duration": 450.04,
"question": "What is the main background of the video?",
"candidates": [
"Grassland",
"Ocean",
"Desert",
"Lake"
],
"answer": "Grassland",
"question_type": "topic_reasoning"
},
{
"video": "xiaoliyu_6.mp4",
"duration": 400,
"question": "What is the relationship between the cartoon characters of carp, jellyfish, seahorse, and turtle?",
"candidates": [
"Friends",
"Lovers",
"Teacher-student",
"Enemies"
],
"answer": "Friends",
"question_type": "topic_reasoning"
},
{
"video": "movie101_82.mp4",
"duration": 467,
"question": "What story does the entire video tell?",
"candidates": [
"Drama performance",
"Chase event",
"Wedding scene",
"Criminal investigation"
],
"answer": "Criminal investigation",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_26.mp4",
"duration": 522,
"question": "What genre of movie is the clip in the video from?",
"candidates": [
"War",
"Action",
"Horror",
"Documentary"
],
"answer": "Action",
"question_type": "topic_reasoning"
},
{
"video": "movie101_33.mp4",
"duration": 240,
"question": "In which scene does the footage of being chased by bees in the video take place?",
"candidates": [
"Forest",
"City",
"Snowy Mountain",
"Grassland"
],
"answer": "Forest",
"question_type": "topic_reasoning"
},
{
"video": "206.mp4",
"duration": 480.0,
"question": "What type of video is this?",
"candidates": [
"Science Fiction",
"Comedy",
"Animal",
"Action"
],
"answer": "Animal",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_33.mp4",
"duration": 532,
"question": "Where does the story in the video take place?",
"candidates": [
"Countryside",
"Desert",
"City",
"Seaside"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "AWD-1.mp4",
"duration": 450.13,
"question": "What is the background of the video?",
"candidates": [
"Forest",
"Ocean",
"Desert",
"Glacier"
],
"answer": "Glacier",
"question_type": "topic_reasoning"
},
{
"video": "217.mp4",
"duration": 480.0,
"question": "What is the main scene shown in the video?",
"candidates": [
"Sky",
"Ocean",
"Desert",
"Grassland"
],
"answer": "Ocean",
"question_type": "topic_reasoning"
},
{
"video": "210.mp4",
"duration": 480.0,
"question": "In what environment does the main event in the video occur?",
"candidates": [
"Sky",
"Water area",
"Forest",
"Desert"
],
"answer": "Water area",
"question_type": "topic_reasoning"
},
{
"video": "movie101_22.mp4",
"duration": 335,
"question": "What is the weather in the scene of the video?",
"candidates": [
"Sunny",
"Snowy",
"Foggy",
"Rainy"
],
"answer": "Sunny",
"question_type": "topic_reasoning"
},
{
"video": "AWB-4.mp4",
"duration": 450.0,
"question": "What is the main setting of the video?",
"candidates": [
"Grassland",
"Desert",
"Ocean",
"City"
],
"answer": "Grassland",
"question_type": "topic_reasoning"
},
{
"video": "208.mp4",
"duration": 480.0,
"question": "What type of video is this?",
"candidates": [
"Comedy",
"Animals",
"Science Fiction",
"Action"
],
"answer": "Animals",
"question_type": "topic_reasoning"
},
{
"video": "movie101_54.mp4",
"duration": 580,
"question": "Which activity is not included in the police's actions in the video?",
"candidates": [
"Rescuing the injured",
"Gunfight",
"Arresting",
"Escorting prisoners"
],
"answer": "Escorting prisoners",
"question_type": "topic_reasoning"
},
{
"video": "222.mp4",
"duration": 480.0,
"question": "What type of video is this?",
"candidates": [
"It's a video related to animals",
"It's a video related to nature",
"It's a video related to food",
"It's a video related to traditional culture"
],
"answer": "It's a video related to nature",
"question_type": "topic_reasoning"
},
{
"video": "movie101_51.mp4",
"duration": 796,
"question": "What kind of clothing is the old man preparing food on the street wearing?",
"candidates": [
"Sportswear",
"Zhongshan suit",
"Japanese clothing",
"Casual wear"
],
"answer": "Japanese clothing",
"question_type": "topic_reasoning"
},
{
"video": "211.mp4",
"duration": 480.0,
"question": "What type of video is this?",
"candidates": [
"Comedy",
"Animals",
"Science Fiction",
"Romance"
],
"answer": "Animals",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_30.mp4",
"duration": 584,
"question": "Where does the story of the video take place?",
"candidates": [
"Countryside",
"Seaside",
"City",
"Desert"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "xiaoliyu_3.mp4",
"duration": 405,
"question": "What type of video is this?",
"candidates": [
"Thriller",
"Cartoon animation",
"Mystery",
"Science fiction"
],
"answer": "Cartoon animation",
"question_type": "topic_reasoning"
},
{
"video": "movie101_107.mp4",
"duration": 471,
"question": "What is the main setting of the video?",
"candidates": [
"Marketplace",
"Park",
"Office",
"Stadium"
],
"answer": "Marketplace",
"question_type": "topic_reasoning"
},
{
"video": "movie101_15.mp4",
"duration": 441,
"question": "What type of movie is the scene in the video from?",
"candidates": [
"Comedy",
"Action",
"Horror",
"Modern film"
],
"answer": "Modern film",
"question_type": "topic_reasoning"
},
{
"video": "movie101_67.mp4",
"duration": 530,
"question": "What genre is this movie?",
"candidates": [
"Sci-fi",
"Comedy",
"Thriller",
"Romance"
],
"answer": "Comedy",
"question_type": "topic_reasoning"
},
{
"video": "movie101_89.mp4",
"duration": 677,
"question": "What type of film is this?",
"candidates": [
"Comedy",
"Mystery",
"Action",
"Documentary"
],
"answer": "Documentary",
"question_type": "topic_reasoning"
},
{
"video": "movie101_15.mp4",
"duration": 441,
"question": "What animal appears in the video?",
"candidates": [
"Dog",
"Cat",
"Llama",
"Sheep"
],
"answer": "Llama",
"question_type": "topic_reasoning"
},
{
"video": "movie101_10.mp4",
"duration": 716,
"question": "What color is the hat worn by the person who appeared in the market?",
"candidates": [
"Red",
"Blue",
"Black",
"Green"
],
"answer": "Red",
"question_type": "topic_reasoning"
},
{
"video": "ego_15.mp4",
"duration": 480.02,
"question": "In this first-person video, what is the first-person character doing?",
"candidates": [
"Making the bed",
"Organizing kitchen utensils",
"Organizing the wardrobe",
"Organizing the bookshelf"
],
"answer": "Organizing kitchen utensils",
"question_type": "topic_reasoning"
},
{
"video": "movie101_20.mp4",
"duration": 476,
"question": "What type of movie is the scene in the video from?",
"candidates": [
"Science fiction",
"Comedy",
"War film",
"Horror"
],
"answer": "War film",
"question_type": "topic_reasoning"
},
{
"video": "ego_1.mp4",
"duration": 480.0,
"question": "In this first-person perspective video, what is the main activity the first-person character is doing?",
"candidates": [
"Sawing wood",
"Watering the lawn",
"Repairing pipes",
"Installing wooden boards"
],
"answer": "Installing wooden boards",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_13.mp4",
"duration": 585,
"question": "What type of video is this?",
"candidates": [
"Cartoon",
"Daily life documentary",
"Advertisement video",
"Music video"
],
"answer": "Daily life documentary",
"question_type": "topic_reasoning"
},
{
"video": "movie101_81.mp4",
"duration": 252,
"question": "In what setting does the entire video take place?",
"candidates": [
"Amusement park",
"Library",
"Wedding",
"School"
],
"answer": "Wedding",
"question_type": "topic_reasoning"
},
{
"video": "ego_17.mp4",
"duration": 480.03,
"question": "What is the main activity of the first person character in this first person video?",
"candidates": [
"Drilling holes in glass",
"Punching holes in wood",
"Drilling holes in diamonds",
"Punching holes in the wall"
],
"answer": "Punching holes in wood",
"question_type": "topic_reasoning"
},
{
"video": "AWA-17.mp4",
"duration": 450.0,
"question": "What is the scene of the video?",
"candidates": [
"Ocean",
"Grassland",
"Desert",
"Sky"
],
"answer": "Grassland",
"question_type": "topic_reasoning"
},
{
"video": "movie101_69.mp4",
"duration": 385,
"question": "What is the genre of this film?",
"candidates": [
"Police and criminals",
"Romance",
"Science fiction",
"Mystery"
],
"answer": "Police and criminals",
"question_type": "topic_reasoning"
},
{
"video": "xiaoliyu_5.mp4",
"duration": 415,
"question": "What is the main task of the three cartoon animals in the entire clip?",
"candidates": [
"Playing",
"Going home",
"Arguing",
"Looking for the cartoon turtle"
],
"answer": "Looking for the cartoon turtle",
"question_type": "topic_reasoning"
},
{
"video": "haimian_1.mp4",
"duration": 305,
"question": "What is the protagonist of the video?",
"candidates": [
"Shark",
"Cartoon Sponge",
"Starfish",
"Octopus"
],
"answer": "Cartoon Sponge",
"question_type": "topic_reasoning"
},
{
"video": "movie101_102.mp4",
"duration": 607,
"question": "Where does the main event in the video take place?",
"candidates": [
"School",
"Temple",
"Desert",
"Forest"
],
"answer": "Temple",
"question_type": "topic_reasoning"
},
{
"video": "movie101_81.mp4",
"duration": 252,
"question": "What type of film is this?",
"candidates": [
"Action",
"Romance",
"Thriller",
"Mystery"
],
"answer": "Romance",
"question_type": "topic_reasoning"
},
{
"video": "movie101_0.mp4",
"duration": 789,
"question": "What is the weather in the video?",
"candidates": [
"Sunny",
"Blizzard",
"Heavy Rain",
"Overcast"
],
"answer": "Sunny",
"question_type": "topic_reasoning"
},
{
"video": "movie101_85.mp4",
"duration": 560,
"question": "What type of film is this?",
"candidates": [
"Mystery",
"Animation",
"Action",
"Comedy"
],
"answer": "Animation",
"question_type": "topic_reasoning"
},
{
"video": "238.mp4",
"duration": 480.0,
"question": "What type of video is this?",
"candidates": [
"Science Fiction",
"Comedy",
"Romance",
"Nature Documentary"
],
"answer": "Nature Documentary",
"question_type": "topic_reasoning"
},
{
"video": "tomjerry_10.mp4",
"duration": 302,
"question": "Where is the scene of the video?",
"candidates": [
"Street",
"Park",
"Outdoors",
"Inside the house"
],
"answer": "Inside the house",
"question_type": "topic_reasoning"
},
{
"video": "movie101_68.mp4",
"duration": 445,
"question": "In what environment does the story take place?",
"candidates": [
"Snowy mountains",
"Ocean",
"River",
"Forest"
],
"answer": "Snowy mountains",
"question_type": "topic_reasoning"
},
{
"video": "movie101_26.mp4",
"duration": 763,
"question": "What is the genre of this movie clip?",
"candidates": [
"War",
"Documentary Drama",
"Horror",
"Comedy"
],
"answer": "Documentary Drama",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_10.mp4",
"duration": 545,
"question": "What type of video is this?",
"candidates": [
"Documentary",
"Cartoon",
"Movie clip",
"Stage play"
],
"answer": "Movie clip",
"question_type": "topic_reasoning"
},
{
"video": "AWB-15.mp4",
"duration": 450.0,
"question": "What is the main content related to in the video?",
"candidates": [
"Animals",
"Humans",
"Food",
"Plants"
],
"answer": "Animals",
"question_type": "topic_reasoning"
},
{
"video": "movie101_84.mp4",
"duration": 387,
"question": "What is the main setting shown in the video?",
"candidates": [
"Desert",
"City",
"Ocean",
"Forest"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "movie101_111.mp4",
"duration": 720,
"question": "What type of video is this?",
"candidates": [
"Comedy",
"Science fiction",
"Action",
"Mystery"
],
"answer": "Mystery",
"question_type": "topic_reasoning"
},
{
"video": "AWA-15.mp4",
"duration": 450.0,
"question": "What is this video related to?",
"candidates": [
"Lifestyle",
"Wildlife",
"Food Flavors",
"Traditional Festivals"
],
"answer": "Wildlife",
"question_type": "topic_reasoning"
},
{
"video": "movie101_14.mp4",
"duration": 465,
"question": "In what setting does the clip in the video take place?",
"candidates": [
"Grassland",
"Forest",
"City",
"Snow Mountain"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "movie101_25.mp4",
"duration": 737,
"question": "What is the genre of the movie clip?",
"candidates": [
"Horror",
"Modern",
"War",
"Comedy"
],
"answer": "Modern",
"question_type": "topic_reasoning"
},
{
"video": "AWA-15.mp4",
"duration": 450.0,
"question": "What is the main background of the video?",
"candidates": [
"Ocean",
"Lake",
"Grassland",
"Desert"
],
"answer": "Grassland",
"question_type": "topic_reasoning"
},
{
"video": "ego_25.mp4",
"duration": 480.0,
"question": "What is the first-person character mainly doing in this first-person video?",
"candidates": [
"Riding a horse",
"Riding a motorcycle",
"Riding a bicycle",
"Riding a tricycle"
],
"answer": "Riding a bicycle",
"question_type": "topic_reasoning"
},
{
"video": "movie101_34.mp4",
"duration": 748,
"question": "What is the genre of the movie clip?",
"candidates": [
"Horror",
"War",
"Modern",
"Comedy"
],
"answer": "Modern",
"question_type": "topic_reasoning"
},
{
"video": "haimian_8.mp4",
"duration": 445,
"question": "What type of video is this?",
"candidates": [
"Mystery",
"Cartoon",
"Science Fiction",
"Thriller"
],
"answer": "Cartoon",
"question_type": "topic_reasoning"
},
{
"video": "xiaoliyu_6.mp4",
"duration": 400,
"question": "What is the living environment of the giant octopus?",
"candidates": [
"Bright",
"Dark and lightless",
"Spacious",
"Sunny"
],
"answer": "Dark and lightless",
"question_type": "topic_reasoning"
},
{
"video": "haimian_9.mp4",
"duration": 444,
"question": "What type of video is this?",
"candidates": [
"Thriller",
"Mystery",
"Cartoon",
"Sci-Fi"
],
"answer": "Cartoon",
"question_type": "topic_reasoning"
},
{
"video": "9.mp4",
"duration": 471.67,
"question": "What is the main setting of the video?",
"candidates": [
"Library",
"Construction site",
"Stadium",
"School"
],
"answer": "Construction site",
"question_type": "topic_reasoning"
},
{
"video": "AWC-10.mp4",
"duration": 450.0,
"question": "What type of video is this?",
"candidates": [
"Traditional Festivals",
"Natural Science Popularization",
"Food Flavor",
"Historical Culture"
],
"answer": "Natural Science Popularization",
"question_type": "topic_reasoning"
},
{
"video": "movie101_28.mp4",
"duration": 234,
"question": "In what setting does the scene in the video take place?",
"candidates": [
"Island",
"Snowy Mountain",
"Forest",
"City"
],
"answer": "Forest",
"question_type": "topic_reasoning"
},
{
"video": "movie101_71.mp4",
"duration": 227,
"question": "What event is primarily depicted in the video?",
"candidates": [
"Prison Break",
"Romance",
"Dance",
"Competition"
],
"answer": "Prison Break",
"question_type": "topic_reasoning"
},
{
"video": "haimian_7.mp4",
"duration": 426,
"question": "Who is the protagonist of the video?",
"candidates": [
"Cartoon Sponge",
"Cartoon Fish",
"Cartoon Shark",
"Cartoon Jellyfish"
],
"answer": "Cartoon Sponge",
"question_type": "topic_reasoning"
},
{
"video": "movie101_32.mp4",
"duration": 247,
"question": "What is the genre of this film clip?",
"candidates": [
"Comedy",
"War",
"Horror",
"Modern"
],
"answer": "Modern",
"question_type": "topic_reasoning"
},
{
"video": "237.mp4",
"duration": 480.0,
"question": "What is this video related to?",
"candidates": [
"This video is related to holidays",
"This video is related to nature",
"This video is related to traditional culture",
"This video is related to food"
],
"answer": "This video is related to nature",
"question_type": "topic_reasoning"
},
{
"video": "AWE-3.mp4",
"duration": 450.0,
"question": "What type of video is this?",
"candidates": [
"Food Flavor",
"Natural Science",
"Traditional Festivals",
"Historical Culture"
],
"answer": "Natural Science",
"question_type": "topic_reasoning"
},
{
"video": "movie101_71.mp4",
"duration": 227,
"question": "What is the genre of this film?",
"candidates": [
"Romance",
"Comedy",
"Mystery",
"Science Fiction"
],
"answer": "Comedy",
"question_type": "topic_reasoning"
},
{
"video": "AWG-6.mp4",
"duration": 450.0,
"question": "What is the main content related to in the video?",
"candidates": [
"Animals",
"Food",
"Weather",
"Plants"
],
"answer": "Animals",
"question_type": "topic_reasoning"
},
{
"video": "tomjerry_9.mp4",
"duration": 328,
"question": "When does the video take place?",
"candidates": [
"Evening",
"Morning",
"Night",
"Noon"
],
"answer": "Night",
"question_type": "topic_reasoning"
},
{
"video": "movie101_62.mp4",
"duration": 601,
"question": "What is the setting of the video?",
"candidates": [
"Grassland",
"Ocean",
"Desert",
"Forest"
],
"answer": "Ocean",
"question_type": "topic_reasoning"
},
{
"video": "216.mp4",
"duration": 480.0,
"question": "What is the main scene shown in the video?",
"candidates": [
"Ocean",
"Forest",
"Grassland",
"Desert"
],
"answer": "Desert",
"question_type": "topic_reasoning"
},
{
"video": "movie101_87.mp4",
"duration": 599,
"question": "What is the main setting shown in the video?",
"candidates": [
"Hospital",
"Campus",
"Countryside",
"Temple"
],
"answer": "Campus",
"question_type": "topic_reasoning"
},
{
"video": "204.mp4",
"duration": 480.0,
"question": "What type of video is this?",
"candidates": [
"This is a video documenting characters",
"This is a video documenting daily life",
"This is a video documenting animals",
"This is a video documenting food"
],
"answer": "This is a video documenting animals",
"question_type": "topic_reasoning"
},
{
"video": "ego_26.mp4",
"duration": 375.37,
"question": "In this first-person perspective video, what is the main activity of the person in the first-person perspective?",
"candidates": [
"Buying a bicycle",
"Repairing a car",
"Repairing a bicycle",
"Riding a bicycle"
],
"answer": "Repairing a bicycle",
"question_type": "topic_reasoning"
},
{
"video": "movie101_41.mp4",
"duration": 319,
"question": "In what setting does the scene in the video take place?",
"candidates": [
"Forest",
"Snowy Mountain",
"City",
"Island"
],
"answer": "Forest",
"question_type": "topic_reasoning"
},
{
"video": "1.mp4",
"duration": 401.23,
"question": "In what environment is the man in the black sweater?",
"candidates": [
"Desert",
"Grassland",
"Forest",
"Ocean"
],
"answer": "Ocean",
"question_type": "topic_reasoning"
},
{
"video": "tomjerry_1.mp4",
"duration": 311,
"question": "Where is the main setting of the video?",
"candidates": [
"Desert",
"Grassland",
"Outside the house",
"Inside the house"
],
"answer": "Outside the house",
"question_type": "topic_reasoning"
},
{
"video": "AWB-7.mp4",
"duration": 450.0,
"question": "What type of video is this?",
"candidates": [
"Animals",
"Comedy",
"Science Fiction",
"Romance"
],
"answer": "Animals",
"question_type": "topic_reasoning"
},
{
"video": "AWD-7.mp4",
"duration": 450.13,
"question": "What is the main content related to in the video?",
"candidates": [
"Animals",
"Weather",
"Plants",
"Food"
],
"answer": "Animals",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_25.mp4",
"duration": 478,
"question": "What genre of film is the clip in the video from?",
"candidates": [
"War film",
"Horror film",
"Documentary",
"Action film"
],
"answer": "Action film",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_12.mp4",
"duration": 587,
"question": "What does the video primarily describe?",
"candidates": [
"People's daily life",
"A global natural disaster",
"A group of people performing and partying at a concert",
"A vivid outdoor adventure"
],
"answer": "People's daily life",
"question_type": "topic_reasoning"
},
{
"video": "movie101_51.mp4",
"duration": 796,
"question": "In the beginning of the video, under what conditions is the crane working?",
"candidates": [
"At noon",
"In the morning",
"At night",
"In the afternoon"
],
"answer": "At night",
"question_type": "topic_reasoning"
},
{
"video": "xiaoliyu_4.mp4",
"duration": 399,
"question": "What type of video is this?",
"candidates": [
"Mystery",
"Thriller",
"Science Fiction",
"Cartoon animation"
],
"answer": "Cartoon animation",
"question_type": "topic_reasoning"
},
{
"video": "AWB-11.mp4",
"duration": 450.0,
"question": "What is the main environment in the video?",
"candidates": [
"Desert",
"Gobi",
"Forest",
"Grassland"
],
"answer": "Grassland",
"question_type": "topic_reasoning"
},
{
"video": "movie101_78.mp4",
"duration": 607,
"question": "In what setting does the event in the video take place?",
"candidates": [
"Modern Countryside",
"Ancient Folklore",
"Modern City",
"Ancient Palace"
],
"answer": "Ancient Folklore",
"question_type": "topic_reasoning"
},
{
"video": "movie101_25.mp4",
"duration": 737,
"question": "What is the weather in the scene of the video?",
"candidates": [
"Sunny day",
"Snowy day",
"Rainy day",
"Foggy day"
],
"answer": "Sunny day",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_2.mp4",
"duration": 509,
"question": "What time is the overall video set in?",
"candidates": [
"Early Morning",
"Noon",
"Afternoon",
"Evening"
],
"answer": "Evening",
"question_type": "topic_reasoning"
},
{
"video": "AWA-4.mp4",
"duration": 450.0,
"question": "What type of video is this?",
"candidates": [
"Food and Flavors",
"Lifestyle",
"Traditional Festivals",
"Nature and Animals"
],
"answer": "Nature and Animals",
"question_type": "topic_reasoning"
},
{
"video": "215.mp4",
"duration": 480.0,
"question": "Where is the main environment in the video?",
"candidates": [
"River",
"Forest",
"Desert",
"Ocean"
],
"answer": "Desert",
"question_type": "topic_reasoning"
},
{
"video": "AWC-10.mp4",
"duration": 450.0,
"question": "What is the character in the video studying?",
"candidates": [
"Whales",
"Sea Turtles",
"Birds",
"Dinosaurs"
],
"answer": "Dinosaurs",
"question_type": "topic_reasoning"
},
{
"video": "movie101_83.mp4",
"duration": 640,
"question": "What is the main setting shown in the video?",
"candidates": [
"City",
"Forest",
"Ocean",
"Desert"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "ego_11.mp4",
"duration": 480.02,
"question": "Where is the person in the video?",
"candidates": [
"Living room",
"Lounge",
"Hall",
"Living room"
],
"answer": "Living room",
"question_type": "topic_reasoning"
},
{
"video": "movie101_21.mp4",
"duration": 289,
"question": "What is the environment in the scene of the video?",
"candidates": [
"Island",
"City",
"Grassland",
"Forest"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_15.mp4",
"duration": 541,
"question": "What genre of movie is the clip in the video from?",
"candidates": [
"Action movie",
"Documentary",
"Horror movie",
"War movie"
],
"answer": "Action movie",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_20.mp4",
"duration": 426,
"question": "What genre of movie is the clip in the video from?",
"candidates": [
"War",
"Documentary",
"Action",
"Horror"
],
"answer": "Action",
"question_type": "topic_reasoning"
},
{
"video": "movie101_39.mp4",
"duration": 473,
"question": "What is the genre of the film clip?",
"candidates": [
"War",
"Horror",
"Action",
"Comedy"
],
"answer": "Action",
"question_type": "topic_reasoning"
},
{
"video": "4.mp4",
"duration": 603.0,
"question": "What environment does the video start in?",
"candidates": [
"Grassland",
"In the water",
"Desert",
"On a tree"
],
"answer": "In the water",
"question_type": "topic_reasoning"
},
{
"video": "232.mp4",
"duration": 480.0,
"question": "Who is the person doing the explanation in the video?",
"candidates": [
"It's a man",
"It's a woman",
"It's a child",
"It's an old person"
],
"answer": "It's a man",
"question_type": "topic_reasoning"
},
{
"video": "movie101_73.mp4",
"duration": 421,
"question": "What type of film is this?",
"candidates": [
"Sci-fi",
"Comedy",
"Mystery",
"Romance"
],
"answer": "Romance",
"question_type": "topic_reasoning"
},
{
"video": "movie101_23.mp4",
"duration": 473,
"question": "What type of movie is the scene in the video?",
"candidates": [
"Action",
"Horror",
"Science Fiction",
"Documentary"
],
"answer": "Documentary",
"question_type": "topic_reasoning"
},
{
"video": "ego_28.mp4",
"duration": 480.0,
"question": "What is the first-person character mainly doing in this video?",
"candidates": [
"Installing stairs",
"Installing doors and windows",
"Installing air conditioning",
"Installing stove"
],
"answer": "Installing stairs",
"question_type": "topic_reasoning"
},
{
"video": "movie101_30.mp4",
"duration": 726,
"question": "In what scenario does the scene in the video take place?",
"candidates": [
"Snowy Mountain",
"Forest",
"City",
"Island"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_39.mp4",
"duration": 476,
"question": "What type of film is this?",
"candidates": [
"Thriller",
"Romance",
"Comedy",
"Science Fiction"
],
"answer": "Comedy",
"question_type": "topic_reasoning"
},
{
"video": "movie101_109.mp4",
"duration": 670,
"question": "What is the setting of the video?",
"candidates": [
"Ocean",
"Forest",
"Prairie",
"City"
],
"answer": "Prairie",
"question_type": "topic_reasoning"
},
{
"video": "movie101_57.mp4",
"duration": 753,
"question": "What type of movie is this?",
"candidates": [
"Comedy",
"Mystery",
"Thriller",
"Romance"
],
"answer": "Comedy",
"question_type": "topic_reasoning"
},
{
"video": "movie101_17.mp4",
"duration": 613,
"question": "What is the weather in the scene of the video?",
"candidates": [
"Rainy",
"Foggy",
"Sunny",
"Snowy"
],
"answer": "Sunny",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_4.mp4",
"duration": 421,
"question": "When does the event in the video take place?",
"candidates": [
"Noon",
"Evening",
"Early morning",
"Afternoon"
],
"answer": "Evening",
"question_type": "topic_reasoning"
},
{
"video": "movie101_105.mp4",
"duration": 744,
"question": "What event is primarily narrated in the video?",
"candidates": [
"Tsunami",
"People's beautiful life is disrupted by an earthquake",
"Flood",
"Sandstorm"
],
"answer": "People's beautiful life is disrupted by an earthquake",
"question_type": "topic_reasoning"
},
{
"video": "206.mp4",
"duration": 480.0,
"question": "What is the main scene shown in the video?",
"candidates": [
"Desert",
"Forest",
"City",
"Ocean"
],
"answer": "Forest",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_14.mp4",
"duration": 459,
"question": "What is the setting of the clip in the video?",
"candidates": [
"Grassland",
"Snowy mountain",
"Island",
"City"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "movie101_105.mp4",
"duration": 744,
"question": "What scene is primarily depicted in the video?",
"candidates": [
"City",
"Desert",
"Grassland",
"Forest"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_15.mp4",
"duration": 541,
"question": "In what kind of setting does the scene in the video take place?",
"candidates": [
"Snowy mountain",
"Island",
"City",
"Forest"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "AWA-4.mp4",
"duration": 450.0,
"question": "What is the main subject of the video?",
"candidates": [
"Poultry",
"Various types of dinosaurs",
"Fish",
"Birds"
],
"answer": "Various types of dinosaurs",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_16.mp4",
"duration": 552,
"question": "What genre of movie is the clip in the video from?",
"candidates": [
"Action movie",
"Documentary",
"Horror movie",
"War movie"
],
"answer": "Action movie",
"question_type": "topic_reasoning"
},
{
"video": "tomjerry_8.mp4",
"duration": 308,
"question": "What is the protagonist of the video?",
"candidates": [
"Two cartoon cats",
"Two cartoon cats and two cartoon mice",
"Two cartoon cats and a cartoon mouse",
"A cartoon cat and a cartoon mouse"
],
"answer": "A cartoon cat and a cartoon mouse",
"question_type": "topic_reasoning"
},
{
"video": "movie101_108.mp4",
"duration": 330,
"question": "What is the main subject shown in the video?",
"candidates": [
"Two little boys",
"Three little girls",
"Two little girls",
"A boy and a girl"
],
"answer": "Two little girls",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_33.mp4",
"duration": 532,
"question": "What genre does this film belong to?",
"candidates": [
"Thriller",
"Sci-fi",
"Romance",
"Comedy"
],
"answer": "Sci-fi",
"question_type": "topic_reasoning"
},
{
"video": "tomjerry_6.mp4",
"duration": 313,
"question": "What type of video is this?",
"candidates": [
"Cartoon animation",
"Thriller mystery",
"Ethics",
"Science fiction"
],
"answer": "Cartoon animation",
"question_type": "topic_reasoning"
},
{
"video": "AWC-7.mp4",
"duration": 450.0,
"question": "What is the content of the video about?",
"candidates": [
"Dinosaurs",
"Birds",
"Sea Turtles",
"Whales"
],
"answer": "Dinosaurs",
"question_type": "topic_reasoning"
},
{
"video": "movie101_40.mp4",
"duration": 450,
"question": "What genre of movie clip is this?",
"candidates": [
"Horror",
"War",
"Action",
"Comedy"
],
"answer": "Action",
"question_type": "topic_reasoning"
},
{
"video": "221.mp4",
"duration": 480.0,
"question": "Who is the protagonist in the video?",
"candidates": [
"An elderly",
"A woman",
"A man",
"A child"
],
"answer": "A man",
"question_type": "topic_reasoning"
},
{
"video": "AWB-7.mp4",
"duration": 450.0,
"question": "What is the main setting of the video?",
"candidates": [
"Grassland",
"City",
"Ocean",
"Desert"
],
"answer": "Grassland",
"question_type": "topic_reasoning"
},
{
"video": "movie101_87.mp4",
"duration": 599,
"question": "What type of film is this?",
"candidates": [
"Action",
"Mystery",
"Comedy",
"Romance"
],
"answer": "Romance",
"question_type": "topic_reasoning"
},
{
"video": "AWB-10.mp4",
"duration": 450.0,
"question": "What is the main environment in the video?",
"candidates": [
"Desert",
"Grassland",
"Gobi",
"Forest"
],
"answer": "Grassland",
"question_type": "topic_reasoning"
},
{
"video": "AWC-8.mp4",
"duration": 450.0,
"question": "What type of video is this?",
"candidates": [
"Traditional Festivals",
"Natural Science Popularization",
"History and Culture",
"Food and Flavor"
],
"answer": "Natural Science Popularization",
"question_type": "topic_reasoning"
},
{
"video": "movie101_107.mp4",
"duration": 471,
"question": "Who is the main character of the video?",
"candidates": [
"An old man with white hair",
"A man in green clothes",
"The woman selling fish",
"A young child"
],
"answer": "The woman selling fish",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_34.mp4",
"duration": 492,
"question": "Where is the setting of the video story?",
"candidates": [
"Desert",
"Seaside",
"Countryside",
"City"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "haimian_2.mp4",
"duration": 356,
"question": "Who is the main character of the video?",
"candidates": [
"Cartoon Whale",
"Cartoon Starfish",
"Cartoon Sponge and Cartoon Octopus",
"Cartoon Shark"
],
"answer": "Cartoon Sponge and Cartoon Octopus",
"question_type": "topic_reasoning"
},
{
"video": "movie101_88.mp4",
"duration": 348,
"question": "What is the main setting shown in the video?",
"candidates": [
"Desert",
"Ocean",
"City",
"Forest"
],
"answer": "Ocean",
"question_type": "topic_reasoning"
},
{
"video": "216.mp4",
"duration": 480.0,
"question": "What type of video is this?",
"candidates": [
"Romance",
"Nature",
"Sci-fi",
"Comedy"
],
"answer": "Nature",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_39.mp4",
"duration": 476,
"question": "Where does the story of the video take place?",
"candidates": [
"Countryside",
"Desert",
"Seaside",
"City"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "movie101_35.mp4",
"duration": 481,
"question": "What is the weather in the video scene?",
"candidates": [
"Snowy",
"Foggy",
"Rainy",
"Sunny"
],
"answer": "Sunny",
"question_type": "topic_reasoning"
},
{
"video": "ego_14.mp4",
"duration": 480.07,
"question": "Where is the person in the video?",
"candidates": [
"Basement",
"Bathroom",
"Bedroom",
"Kitchen"
],
"answer": "Bedroom",
"question_type": "topic_reasoning"
},
{
"video": "ego_3.mp4",
"duration": 418.06,
"question": "In this first-person video, what is the first-person character doing?",
"candidates": [
"Trying on jewelry",
"Trying on a tie",
"Trying on clothes",
"Trying on shoes"
],
"answer": "Trying on clothes",
"question_type": "topic_reasoning"
},
{
"video": "movie101_24.mp4",
"duration": 311,
"question": "What genre of movie is the animation in the video?",
"candidates": [
"War",
"Comedy",
"Science Fiction",
"Horror"
],
"answer": "Science Fiction",
"question_type": "topic_reasoning"
},
{
"video": "ego_21.mp4",
"duration": 480.02,
"question": "Where is the person in the video?",
"candidates": [
"Bedroom",
"Living room",
"Kitchen",
"Bathroom"
],
"answer": "Living room",
"question_type": "topic_reasoning"
},
{
"video": "ego_10.mp4",
"duration": 420.57,
"question": "What is the first-person character doing in this first-person video?",
"candidates": [
"Textile making",
"Glass making",
"Woodworking",
"Ceramics making"
],
"answer": "Woodworking",
"question_type": "topic_reasoning"
},
{
"video": "movie101_14.mp4",
"duration": 465,
"question": "What is the weather in the scene of the video?",
"candidates": [
"Rainy",
"Foggy",
"Sunny",
"Snowy"
],
"answer": "Sunny",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_18.mp4",
"duration": 568,
"question": "In what setting does the scene in the video take place?",
"candidates": [
"City",
"Island",
"Forest",
"Snowy mountain"
],
"answer": "City",
"question_type": "topic_reasoning"
},
{
"video": "AWA-5.mp4",
"duration": 450.0,
"question": "What is the main setting of the video?",
"candidates": [
"City",
"Grassland",
"Desert",
"Ocean"
],
"answer": "Grassland",
"question_type": "topic_reasoning"
},
{
"video": "movie101_85.mp4",
"duration": 560,
"question": "What is the main setting shown in the video?",
"candidates": [
"City",
"Desert",
"Forest",
"Ocean"
],
"answer": "Ocean",
"question_type": "topic_reasoning"
},
{
"video": "en_tv_38.mp4",
"duration": 410,
"question": "What type of film is this?",
"candidates": [
"Thriller",
"Comedy",
"Action",
"Romance"
],
"answer": "Comedy",
"question_type": "topic_reasoning"
},
{
"video": "movie101_62.mp4",
"duration": 601,
"question": "What type of film is this?",
"candidates": [
"Action",
"Thriller",
"Mystery",
"Comedy"
],
"answer": "Comedy",
"question_type": "topic_reasoning"
},
{
"video": "231.mp4",
"duration": 480.0,
"question": "What scenery is mainly shown in the video?",
"candidates": [
"Space scenery",
"Grassland scenery",
"Ocean scenery",
"Desert scenery"
],
"answer": "Space scenery",
"question_type": "topic_reasoning"
},
{
"video": "AWD-5.mp4",
"duration": 450.09,
"question": "What type of video is this?",
"candidates": [
"Traditional Festivals",
"Natural Science",
"Food Flavor",
"History and Culture"
],
"answer": "Natural Science",
"question_type": "topic_reasoning"
},
{
"video": "movie101_34.mp4",
"duration": 748,
"question": "In what setting does the scene in the video take place?",
"candidates": [
"Forest",
"Island",
"Snowy Mountain",
"Town"
],
"answer": "Town",
"question_type": "topic_reasoning"
},
{
"video": "tomjerry_2.mp4",
"duration": 303,
"question": "What is the main character of the video?",
"candidates": [
"A cartoon cat and a cartoon mouse",
"Three cats",
"One cat and two mice",
"Three mice"
],
"answer": "A cartoon cat and a cartoon mouse",
"question_type": "topic_reasoning"
},
{
"video": "movie101_79.mp4",
"duration": 590,
"question": "What type of film is this?",
"candidates": [
"Thriller",
"Action",
"Mystery",
"Comedy"
],
"answer": "Comedy",
"question_type": "topic_reasoning"
},
{
"video": "AWB-15.mp4",
"duration": 450.0,
"question": "What is the main environment in the video?",
"candidates": [
"Gobi",
"Forest",
"Grassland",
"Desert"
],
"answer": "Grassland",
"question_type": "topic_reasoning"
},
{
"video": "haimian_4.mp4",
"duration": 323,
"question": "What is the background of the video?",
"candidates": [
"Forest",
"Desert",
"Gobi",
"Under the sea"
],
"answer": "Under the sea",
"question_type": "topic_reasoning"
},
{
"video": "game_21.mp4",
"duration": 275.07,
"question": "What is the type of this video?",
"candidates": [
"Movie trailer",
"Documentary",
"Tutorial",
"Video game"
],
"answer": "Video game",
"question_type": "topic_reasoning"
},
{
"video": "game_5.mp4",
"duration": 208.05,
"question": "What are the characteristics of the object being built in the video?",
"candidates": [
"green",
"high",
"solid",
"cylindroid"
],
"answer": "high",
"question_type": "topic_reasoning"
},
{
"video": "game_29.mp4",
"duration": 275.11,
"question": "Where is the game protagonist setting up the armor equipment?",
"candidates": [
"In an open field outside",
"In a small underground bunker",
"Inside a large indoor building",
"On the top of a high mountain"
],
"answer": "Inside a large indoor building",
"question_type": "topic_reasoning"
},
{
"video": "game_19.mp4",
"duration": 203.73,
"question": "What animal appears in this gameplay video?",
"candidates": [
"Parrots",
"Wolves",
"Ocelots",
"Chickens"
],
"answer": "Ocelots",
"question_type": "topic_reasoning"
},
{
"video": "game_6.mp4",
"duration": 195.81,
"question": "What is the object built by the main character in the video?",
"candidates": [
"Castle",
"fort",
"Tent",
"fireworks"
],
"answer": "Tent",
"question_type": "topic_reasoning"
},
{
"video": "game_43.mp4",
"duration": 229.04,
"question": "What is the person in the game doing?",
"candidates": [
"Building an automatic farm",
"Fighting with a game boss",
"Exploring a haunted house",
"Designing a character's outfit"
],
"answer": "Building an automatic farm",
"question_type": "topic_reasoning"
},
{
"video": "game_4.mp4",
"duration": 187.62,
"question": "What color is the object being built in the video?",
"candidates": [
"green",
"blue",
"white",
"brown"
],
"answer": "brown",
"question_type": "topic_reasoning"
},
{
"video": "game_41.mp4",
"duration": 312.45,
"question": "What is this video about?",
"candidates": [
"A person in the game planting trees by the lake",
"A person in the game building a structure by the lake",
"A person in the game taking care of pets",
"A documentary about humans and nature"
],
"answer": "A person in the game building a structure by the lake",
"question_type": "topic_reasoning"
},
{
"video": "game_20.mp4",
"duration": 299.26,
"question": "What color is the building that appears in this gameplay video?",
"candidates": [
"Dark blue",
"Silver-gray",
"Red",
"Brown"
],
"answer": "Silver-gray",
"question_type": "topic_reasoning"
},
{
"video": "game_22.mp4",
"duration": 253.68,
"question": "What is being built in this video?",
"candidates": [
"A floating platform",
"An underwater base",
"A treehouse",
"A fortress"
],
"answer": "A floating platform",
"question_type": "topic_reasoning"
},
{
"video": "game_18.mp4",
"duration": 236.31,
"question": "What animal appears in this gameplay video?",
"candidates": [
"Wolf",
"Horse",
"Sheep",
"Wild boar"
],
"answer": "Wild boar",
"question_type": "topic_reasoning"
},
{
"video": "game_15.mp4",
"duration": 335.32,
"question": "What is the object being built in this video?",
"candidates": [
"A wall and a trench behind it",
"A house and a garden",
"A bridge and a river",
"A tower and a moat"
],
"answer": "A wall and a trench behind it",
"question_type": "topic_reasoning"
},
{
"video": "game_30.mp4",
"duration": 200.85,
"question": "What is the main outdoor setting where the game protagonist is located?",
"candidates": [
"Desert",
"Forest",
"Mountain",
"City"
],
"answer": "Forest",
"question_type": "topic_reasoning"
},
{
"video": "game_11.mp4",
"duration": 264.89,
"question": "What is the type of this video?",
"candidates": [
"Minecraft gameplay video",
"Minecraft strategy guide",
"Minecraft mod review",
"Minecraft developer diary"
],
"answer": "Minecraft gameplay video",
"question_type": "topic_reasoning"
},
{
"video": "game_17.mp4",
"duration": 182.65,
"question": "Where is the setting of this gameplay video?",
"candidates": [
"Jungle",
"Mountains",
"Forest",
"Desert"
],
"answer": "Desert",
"question_type": "topic_reasoning"
},
{
"video": "game_46.mp4",
"duration": 208.01,
"question": "What is this video about?",
"candidates": [
"A person performing a song",
"A person doing a live product promotion",
"A game tutorial video",
"A cartoon animation"
],
"answer": "A game tutorial video",
"question_type": "topic_reasoning"
},
{
"video": "game_32.mp4",
"duration": 278.83,
"question": "What is the theme of this video?",
"candidates": [
"A person singing",
"A cartoon animation",
"A person demonstrating how they play a game",
"A person live-streaming a sale"
],
"answer": "A person demonstrating how they play a game",
"question_type": "topic_reasoning"
},
{
"video": "game_38.mp4",
"duration": 331.3,
"question": "What is the video mainly about?",
"candidates": [
"A person livestreaming product promotion",
"A person demonstrating how to build a castle in the game.",
"A person demonstrating a jungle crossing in the game",
"A person demonstrating a castle adventure in the game"
],
"answer": "A person demonstrating a castle adventure in the game",
"question_type": "topic_reasoning"
},
{
"video": "game_2.mp4",
"duration": 290.83,
"question": "What did the man in the video put on the constructed device?",
"candidates": [
"cups",
"explosive boxes",
"bags",
"guns"
],
"answer": "explosive boxes",
"question_type": "topic_reasoning"
},
{
"video": "game_13.mp4",
"duration": 193.84,
"question": "What is the main action performed by the character in this video?",
"candidates": [
"Attacking",
"Crafting",
"Trading",
"Exploring"
],
"answer": "Attacking",
"question_type": "topic_reasoning"
},
{
"video": "game_26.mp4",
"duration": 438.53,
"question": "What is the indoor scene where the game protagonist is located?",
"candidates": [
"A high school classroom",
"A luxury penthouse",
"A local library",
"A McDonald's fast food restaurant"
],
"answer": "A McDonald's fast food restaurant",
"question_type": "topic_reasoning"
},
{
"video": "game_7.mp4",
"duration": 258.07,
"question": "What is the object built by the mission in the video?",
"candidates": [
"Chair",
"Gun",
"Table",
"flying device"
],
"answer": "Table",
"question_type": "topic_reasoning"
},
{
"video": "game_33.mp4",
"duration": 244.93,
"question": "What is the game player doing?",
"candidates": [
"Building a windmill",
"Constructing a tank",
"Performing a flight",
"Digging a hole"
],
"answer": "Building a windmill",
"question_type": "topic_reasoning"
},
{
"video": "game_37.mp4",
"duration": 259.07,
"question": "What is the video demonstrating?",
"candidates": [
"How to build a pool in the game",
"How to build a tiny fishing hut in the game",
"A cartoon animation",
"How to catch fish in the game"
],
"answer": "How to build a tiny fishing hut in the game",
"question_type": "topic_reasoning"
},
{
"video": "game_9.mp4",
"duration": 240.91,
"question": "What is the character always holding in their hand in the game?",
"candidates": [
"Pickaxe",
"Gun",
"Map",
"Torch"
],
"answer": "Map",
"question_type": "topic_reasoning"
},
{
"video": "game_44.mp4",
"duration": 340.61,
"question": "What is this video about?",
"candidates": [
"A game player digging underground tunnels in the game",
"A game player fighting monsters in the game",
"A game player building a castle in the game",
"A game player exploring a forest in the game"
],
"answer": "A game player digging underground tunnels in the game",
"question_type": "topic_reasoning"
},
{
"video": "game_35.mp4",
"duration": 202.64,
"question": "What is the protagonist mainly constructing in the game?",
"candidates": [
"Christmas Doorbell",
"Digging basements",
"Building houses",
"Tanks"
],
"answer": "Christmas Doorbell",
"question_type": "topic_reasoning"
},
{
"video": "game_42.mp4",
"duration": 1173.14,
"question": "What is this video about?",
"candidates": [
"A person demonstrating how to win in a shooting game",
"Someone live-streaming a singing performance",
"A person demonstrating how to cook in the game",
"A person demonstrating how to build a device in the game"
],
"answer": "A person demonstrating how to build a device in the game",
"question_type": "topic_reasoning"
},
{
"video": "game_10.mp4",
"duration": 291.27,
"question": "What is the character building in the game?",
"candidates": [
"Pool",
"House",
"Bridge",
"Garden"
],
"answer": "Pool",
"question_type": "topic_reasoning"
},
{
"video": "game_24.mp4",
"duration": 350.23,
"question": "What is the protagonist of the game doing?",
"candidates": [
"Building a bridge and setting up a trap",
"Planting trees and setting a fence",
"Constructing a house and installing windows",
"Digging a well and placing a ladder"
],
"answer": "Digging a well and placing a ladder",
"question_type": "topic_reasoning"
},
{
"video": "game_36.mp4",
"duration": 264.5,
"question": "What scene is demonstrated in the video?",
"candidates": [
"How to build a bunker in the game",
"How to build a bee farm in the game",
"Educational content about the social behavior of bees",
"How to get along with bees"
],
"answer": "How to build a bee farm in the game",
"question_type": "topic_reasoning"
},
{
"video": "game_25.mp4",
"duration": 189.89,
"question": "What is the game character doing?",
"candidates": [
"Making a scarecrow",
"Fighting with enemies",
"Collecting resources",
"Building a house"
],
"answer": "Making a scarecrow",
"question_type": "topic_reasoning"
},
{
"video": "game_12.mp4",
"duration": 224.47,
"question": "What is the main action performed by the character in this video?",
"candidates": [
"Mining",
"Farming",
"Building",
"Exploring"
],
"answer": "Building",
"question_type": "topic_reasoning"
},
{
"video": "game_16.mp4",
"duration": 294.99,
"question": "Question: What is the object being built in this video?",
"candidates": [
"An underground bunker",
"A floating platform",
"A rooftop garden",
"A water tower"
],
"answer": "A floating platform",
"question_type": "topic_reasoning"
},
{
"video": "game_31.mp4",
"duration": 309.43,
"question": "What does this appear to be a video of?",
"candidates": [
"A game match",
"Game tutorial video",
"Behind-the-scenes of game development",
"Game developer interview"
],
"answer": "Game tutorial video",
"question_type": "topic_reasoning"
},
{
"video": "game_27.mp4",
"duration": 200.37,
"question": "What is the protagonist of the game doing?",
"candidates": [
"Constructing a sandcastle on the beach",
"Setting up a tent at a coastal campsite",
"Exploring a cave",
"Building a house by the seaside"
],
"answer": "Building a house by the seaside",
"question_type": "topic_reasoning"
},
{
"video": "game_14.mp4",
"duration": 260.09,
"question": "What is the object being built in this video?",
"candidates": [
"Farm",
"Bridge",
"Tower",
"Pool"
],
"answer": "Farm",
"question_type": "topic_reasoning"
},
{
"video": "game_34.mp4",
"duration": 232.22,
"question": "What is the protagonist mainly doing in the game?",
"candidates": [
"Raising pets",
"Planting trees",
"Constructing buildings",
"Digging holes"
],
"answer": "Constructing buildings",
"question_type": "topic_reasoning"
},
{
"video": "game_28.mp4",
"duration": 194.1,
"question": "What is the protagonist of the game doing?",
"candidates": [
"Building a house",
"Building a blinking device",
"Digging a tunnel",
"Building a scarecrow"
],
"answer": "Building a blinking device",
"question_type": "topic_reasoning"
},
{
"video": "game_1.mp4",
"duration": 313.33,
"question": "What shape is the object built by the main character in the video?",
"candidates": [
"circle",
"triangle",
"rectangle",
"heart"
],
"answer": "heart",
"question_type": "topic_reasoning"
},
{
"video": "game_3.mp4",
"duration": 316.99,
"question": "What is my main action in the video?",
"candidates": [
"sitting",
"fishing",
"running",
"climbing"
],
"answer": "running",
"question_type": "topic_reasoning"
},
{
"video": "game_23.mp4",
"duration": 247.36,
"question": "What is the type of this video?",
"candidates": [
"Movie trailer",
"Video game",
"Documentary",
"Tutorial"
],
"answer": "Video game",
"question_type": "topic_reasoning"
}
]
================================================
FILE: lmms-eval_videochat/eval_annotations/MVBench/README.md
================================================
---
license: mit
extra_gated_prompt: >-
You agree to not use the dataset to conduct experiments that cause harm to
human subjects. Please note that the data in this dataset may be subject to
other agreements. Before using the data, be sure to read the relevant
agreements carefully to ensure compliant use. Video copyrights belong to the
original video creators or platforms and are for academic research use only.
task_categories:
- visual-question-answering
- video-classification
extra_gated_fields:
Name: text
Company/Organization: text
Country: text
E-Mail: text
modalities:
- Video
- Text
configs:
- config_name: action_sequence
data_files: json/action_sequence.json
- config_name: moving_count
data_files: json/moving_count.json
- config_name: action_prediction
data_files: json/action_prediction.json
- config_name: episodic_reasoning
data_files: json/episodic_reasoning.json
- config_name: action_antonym
data_files: json/action_antonym.json
- config_name: action_count
data_files: json/action_count.json
- config_name: scene_transition
data_files: json/scene_transition.json
- config_name: object_shuffle
data_files: json/object_shuffle.json
- config_name: object_existence
data_files: json/object_existence.json
- config_name: fine_grained_pose
data_files: json/fine_grained_pose.json
- config_name: unexpected_action
data_files: json/unexpected_action.json
- config_name: moving_direction
data_files: json/moving_direction.json
- config_name: state_change
data_files: json/state_change.json
- config_name: object_interaction
data_files: json/object_interaction.json
- config_name: character_order
data_files: json/character_order.json
- config_name: action_localization
data_files: json/action_localization.json
- config_name: counterfactual_inference
data_files: json/counterfactual_inference.json
- config_name: fine_grained_action
data_files: json/fine_grained_action.json
- config_name: moving_attribute
data_files: json/moving_attribute.json
- config_name: egocentric_navigation
data_files: json/egocentric_navigation.json
language:
- en
size_categories:
- 1K-
You agree to not use the dataset to conduct experiments that cause harm to
human subjects. Please note that the data in this dataset may be subject to
other agreements. Before using the data, be sure to read the relevant
agreements carefully to ensure compliant use. Video copyrights belong to the
original video creators or platforms and are for academic research use only.
task_categories:
- visual-question-answering
extra_gated_fields:
Name: text
Company/Organization: text
Country: text
E-Mail: text
modalities:
- Video
- Text
configs:
- config_name: charades
data_files: json/temporal_grounding_charades.json
language:
- en
size_categories:
- 1K argparse.Namespace:
parser = argparse.ArgumentParser(formatter_class=argparse.RawTextHelpFormatter)
parser.add_argument("--config", default="", help="Path to a yaml file specifying all eval arguments, will ignore cli arguments if specified")
parser.add_argument("--model", default="hf", help="Name of model e.g. `hf`")
parser.add_argument(
"--tasks",
default=None,
help="To get full list of tasks, use the command lmms-eval --tasks list",
)
parser.add_argument(
"--model_args",
default="",
help="String arguments for model, e.g. `pretrained=EleutherAI/pythia-160m,dtype=float32`",
)
parser.add_argument(
"--num_fewshot",
type=int,
default=None,
help="Number of examples in few-shot context",
)
parser.add_argument("--batch_size", type=str, default=1)
parser.add_argument(
"--device",
type=str,
default=None,
help="Device to use (e.g. cuda, cuda:0, cpu)",
)
parser.add_argument(
"--output_path",
default=None,
type=str,
metavar="= [dir/file.jsonl] [DIR]",
help="The path to the output file where the result metrics will be saved. If the path is a directory and log_samples is true, the results will be saved in the directory. Else the parent directory will be used.",
)
parser.add_argument(
"--limit",
type=float,
default=None,
help="Limit the number of examples per task. " "If <1, limit is a percentage of the total number of examples.",
)
parser.add_argument(
"--check_integrity",
action="store_true",
help="Whether to run the relevant part of the test suite for the tasks",
)
parser.add_argument(
"--show_task_to_terminal",
action="store_true",
default=False,
help="Prints the prompt for the first few documents",
)
parser.add_argument(
"--log_samples",
action="store_true",
default=False,
help="If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis",
)
parser.add_argument(
"--wandb_log_samples",
action="store_true",
default=False,
help="If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis to Weights and Biases",
)
parser.add_argument(
"--log_samples_suffix",
type=str,
default="model_outputs",
help="Specify a suffix for the log_samples file name.",
)
parser.add_argument(
"--predict_only",
"-x",
action="store_true",
default=False,
help="Use with --log_samples. Only model outputs will be saved and metrics will not be evaluated.",
)
parser.add_argument(
"--show_config",
action="store_true",
default=False,
help="If True, shows the the full config of all tasks at the end of the evaluation.",
)
parser.add_argument(
"--include_path",
type=str,
default=None,
help="Additional path to include if there are external tasks to include.",
)
parser.add_argument(
"--gen_kwargs",
default="",
help=("String arguments for model generation on greedy_until tasks," " e.g. `temperature=0,top_k=0,top_p=0`"),
)
parser.add_argument(
"--verbosity",
type=str,
default="INFO",
help="Log error when tasks are not registered.",
)
parser.add_argument(
"--wandb_args",
default="",
help="Comma separated string arguments passed to wandb.init, e.g. `project=lmms-eval,job_type=eval",
)
parser.add_argument(
"--timezone",
default="Asia/Singapore",
help="Timezone for datetime string, e.g. Asia/Singapore, America/New_York, America/Los_Angeles",
)
args = parser.parse_args()
return args
def cli_evaluate(args: Union[argparse.Namespace, None] = None) -> None:
# print("我开始啦!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
# print("我开始啦!!!!!!!!!!!!!!!!!!!!!!!!!!!!!")
if not args:
args = parse_eval_args()
# Check if no arguments were passed after parsing
if len(sys.argv) == 1:
print("┌───────────────────────────────────────────────────────────────────────────────┐")
print("│ Please provide arguments to evaluate the model. e.g. │")
print("│ `lmms-eval --model llava --model_path liuhaotian/llava-v1.6-7b --tasks okvqa` │")
print("│ Use `lmms-eval --help` for more information. │")
print("└───────────────────────────────────────────────────────────────────────────────┘")
sys.exit(1)
# reset logger
eval_logger.remove()
eval_logger.add(sys.stdout, colorize=True, level=args.verbosity)
eval_logger.info(f"Verbosity set to {args.verbosity}")
os.environ["TOKENIZERS_PARALLELISM"] = "false"
args_list = []
results_list = []
if args.config:
if not os.path.exists(args.config):
raise ValueError(f"Config file does not exist: {args.config}")
with open(args.config, "r") as file:
config_args = yaml.safe_load(file)
config_args = [config_args] if type(config_args) != list else config_args
# multiple configs, create args list first
for config in config_args:
args_copy = argparse.Namespace(**vars(args))
for key, value in config.items():
setattr(args_copy, key, value)
args_list.append(args_copy)
else:
args_list.append(args)
# initialize Accelerator
kwargs_handler = InitProcessGroupKwargs(timeout=datetime.timedelta(seconds=60000))
accelerator = Accelerator(kwargs_handlers=[kwargs_handler])
if accelerator.is_main_process:
is_main_process = True
else:
is_main_process = False
for args in args_list:
try:
# if is_main_process and args.wandb_args: # thoughtfully we should only init wandb once, instead of multiple ranks to avoid network traffics and unwanted behaviors.
# wandb_logger = WandbLogger(args)
results, samples = cli_evaluate_single(args)
results_list.append(results)
accelerator.wait_for_everyone()
# if is_main_process and args.wandb_args:
# wandb_logger.post_init(results)
# wandb_logger.log_eval_result()
# if args.wandb_log_samples and samples is not None:
# wandb_logger.log_eval_samples(samples)
# wandb_logger.finish()
except Exception as e:
traceback.print_exc()
eval_logger.error(f"Error during evaluation: {e}")
traceback.print_exc()
results_list.append(None)
for args, results in zip(args_list, results_list):
# cli_evaluate will return none if the process is not the main process (rank 0)
if results is not None:
print_results(args, results)
def cli_evaluate_single(args: Union[argparse.Namespace, None] = None) -> None:
initialize_tasks(args.verbosity)
if args.predict_only:
args.log_samples = True
if (args.log_samples or args.predict_only) and not args.output_path:
raise ValueError("Specify --output_path if providing --log_samples or --predict_only")
if args.limit:
eval_logger.warning(" --limit SHOULD ONLY BE USED FOR TESTING." "REAL METRICS SHOULD NOT BE COMPUTED USING LIMIT.")
if args.include_path is not None:
eval_logger.info(f"Including path: {args.include_path}")
include_path(args.include_path)
if os.environ.get("LMMS_EVAL_PLUGINS", None):
for plugin in os.environ["LMMS_EVAL_PLUGINS"].split(","):
package_tasks_location = importlib.util.find_spec(f"{plugin}.tasks").submodule_search_locations[0]
eval_logger.info(f"Including path: {args.include_path}")
include_path(package_tasks_location)
if args.tasks is None:
task_names = ALL_TASKS
elif args.tasks == "list":
eval_logger.info("Available Tasks:\n - {}".format(f"\n - ".join(sorted(ALL_TASKS))))
sys.exit()
elif args.tasks == "list_with_num":
log_message = (
"\n" + "=" * 70 + "\n" + "\n\tYou are trying to check all the numbers in each task." + "\n\tThis action will download the complete dataset." + "\n\tIf the results are not clear initially, call this again." + "\n\n" + "=" * 70
)
eval_logger.info(log_message)
for task_name in sorted(ALL_TASKS):
try:
task_dict = get_task_dict([task_name], model_name="llava")
task_obj = task_dict[task_name]
if type(task_obj) == tuple:
group, task_obj = task_obj
if task_obj is None:
continue
eval_logger.info(f"\nTask : {task_obj.config.task}\n - #num : {len(task_obj.test_docs()) if task_obj.has_test_docs() else len(task_obj.validation_docs())}")
except Exception as e:
eval_logger.debug(f"\nTask : {task_name} fail to load \n Exception : \n {e}")
sys.exit()
else:
tasks_list = args.tasks.split(",")
eval_logger.info(f"Evaluating on {len(tasks_list)} tasks.")
task_names = utils.pattern_match(tasks_list, ALL_TASKS)
task_missing = [task for task in tasks_list if task not in task_names and "*" not in task] # we don't want errors if a wildcard ("*") task name was used
if task_missing:
missing = ", ".join(task_missing)
eval_logger.error(
f"Tasks were not found: {missing}. Try `lmms-eval --tasks list` for list of available tasks",
)
# eval_logger.warn(f"Tasks {missing} were not found. Try `lmms-eval --tasks list` for list of available tasks.")
eval_logger.info(f"Selected Tasks: {task_names}")
# set datetime before evaluation
datetime_str = utils.get_datetime_str(timezone=args.timezone)
if args.output_path:
if args.log_samples_suffix and len(args.log_samples_suffix) > 15:
eval_logger.warning("The suffix for log_samples is too long. It is recommended to keep it under 15 characters.")
args.log_samples_suffix = args.log_samples_suffix[:5] + "..." + args.log_samples_suffix[-5:]
hash_input = f"{args.model_args}".encode("utf-8")
hash_output = hashlib.sha256(hash_input).hexdigest()[:6]
path = Path(args.output_path)
path = path.expanduser().resolve().joinpath(f"{datetime_str}_{args.log_samples_suffix}_{args.model}_model_args_{hash_output}")
args.output_path = path
elif args.log_samples and not args.output_path:
assert args.output_path, "Specify --output_path"
results = evaluator.simple_evaluate(
model=args.model,
model_args=args.model_args,
tasks=task_names,
num_fewshot=args.num_fewshot,
batch_size=args.batch_size,
device=args.device,
limit=args.limit,
check_integrity=args.check_integrity,
show_task_to_terminal=args.show_task_to_terminal,
log_samples=args.log_samples,
gen_kwargs=args.gen_kwargs,
cli_args=args,
predict_only=args.predict_only,
)
if results is not None:
if args.log_samples:
samples = results.pop("samples")
else:
samples = None
dumped = json.dumps(results, indent=4, default=_handle_non_serializable)
if args.show_config:
print(dumped)
if args.output_path:
args.output_path.mkdir(parents=True, exist_ok=True)
result_file_path = path.joinpath("results.json")
if result_file_path.exists():
eval_logger.warning(f"Output file {result_file_path} already exists and will be overwritten.")
result_file_path.open("w").write(dumped)
if args.log_samples:
for task_name, config in results["configs"].items():
filename = args.output_path.joinpath(f"{task_name}.json")
# Structure the data with 'args' and 'logs' keys
data_to_dump = {"args": vars(args), "model_configs": config, "logs": sorted(samples[task_name], key=lambda x: x["doc_id"]), "time": datetime_str}
samples_dumped = json.dumps(data_to_dump, indent=4, default=_handle_non_serializable, ensure_ascii=False)
filename.open("w", encoding="utf-8").write(samples_dumped)
eval_logger.info(f"Saved samples to {filename}")
return results, samples
return None, None
def print_results(args, results):
print(f"{args.model} ({args.model_args}),\ngen_kwargs: ({args.gen_kwargs}),\nlimit: {args.limit},\nnum_fewshot: {args.num_fewshot},\nbatch_size: {args.batch_size}")
print(evaluator.make_table(results))
if "groups" in results:
print(evaluator.make_table(results, "groups"))
if __name__ == "__main__":
import os # NOTE
os.environ['HF_ENDPOINT'] = "https://hf-mirror.com"
os.environ['HF_DATASETS_OFFLINE'] = '1'
os.environ['HF_EVALUATE_OFFLINE'] = '1'
cli_evaluate()
================================================
FILE: lmms-eval_videochat/lmms_eval/api/__init__.py
================================================
================================================
FILE: lmms-eval_videochat/lmms_eval/api/filter.py
================================================
from dataclasses import dataclass
from typing import List
from lmms_eval.api.instance import Instance
from datasets import Dataset
class Filter:
"""
Filter classes operate on a per-task level.
They take all model outputs (`instance.resps` for all `task.instances`)
across all instances of a task, and perform operations.
In a single run, one can configure any number of separate filters or lists of filters.
"""
def __init__(self, *args, **kwargs) -> None:
"""
Can define custom behavior here, if an individual instantiation of a Filter class should have state.
"""
def apply(self, resps, docs):
"""
Defines the operation to perform on a list of the `inst.resps` properties of `Instance` objects.
Should return the list of (filtered) response lists *in the same order as they were input*, e.g.
if pass in [, ] should return
[, ]
"""
return resps
@dataclass
class FilterEnsemble:
"""
FilterEnsemble creates a pipeline applying multiple filters.
Its intended usage is to stack multiple post-processing steps in order.
`task.apply_filters` should use a list of FilterEnsemble classes that it stores, to apply each
pipeline separately.
"""
name: str
filters: List[Filter]
def apply(self, instances: List[Instance], docs: List[Dataset]) -> None:
resps = [inst.resps for inst in instances] # operate just on the model responses
for f in self.filters:
# apply filters in sequence
resps = f.apply(resps, docs)
# add the end results after filtering to filtered_requests of their respective source instances.
# has key `self.name`: each FilterEnsemble applied in a given run should use a different name.
for inst, resp in zip(instances, resps):
inst.filtered_resps[self.name] = resp
================================================
FILE: lmms-eval_videochat/lmms_eval/api/instance.py
================================================
from dataclasses import dataclass, field
from typing import Literal, Tuple
@dataclass
class Instance:
request_type: Literal["loglikelihood", "generate_until"]
arguments: tuple
idx: int
metadata: Tuple[str, int, int] = field(default_factory=lambda: (None, None, None)) # TODO: better typehints here
resps: list = field(default_factory=list)
filtered_resps: dict = field(default_factory=dict)
# initialized after init
task_name: str = None
doc_id: str = None
repeats: str = None
doc: dict = None
def __post_init__(self) -> None:
# unpack metadata field
self.task_name, self.doc_id, self.repeats = self.metadata["task"], self.metadata["doc_id"], self.metadata["repeats"]
@property
def args(self):
"""
Returns (string,) where `string` is the string to calculate loglikelihood over
"""
return self.arguments if isinstance(self.arguments, tuple) else (self.arguments,)
================================================
FILE: lmms-eval_videochat/lmms_eval/api/metrics.py
================================================
import math
from collections.abc import Iterable
import numpy as np
import sacrebleu
import sklearn.metrics
import random
import evaluate
import torch
from lmms_eval.api.registry import register_metric, register_aggregation
from loguru import logger as eval_logger
# Register Aggregations First
@register_aggregation("bypass")
def bypass_agg(arr):
return 999
@register_aggregation("mean")
def mean(arr):
return sum(arr) / len(arr)
@register_aggregation("median")
def median(arr):
return arr[len(arr) // 2]
# Certain metrics must be calculated across all documents in a benchmark.
# We use them as aggregation metrics, paired with no-op passthrough metric fns.
@register_aggregation("perplexity")
def perplexity(items):
# return math.exp(-mean(items))
items = torch.exp(torch.tensor(items)).tolist()
return sum(items) / len(items)
@register_aggregation("weighted_perplexity")
def weighted_perplexity(items):
return math.exp(-weighted_mean(items))
@register_aggregation("bits_per_byte")
def bits_per_byte(items):
return -weighted_mean(items) / math.log(2)
@register_aggregation("f1")
def f1_score(items):
unzipped_list = list(zip(*items))
golds = unzipped_list[0]
preds = unzipped_list[1]
fscore = sklearn.metrics.f1_score(golds, preds)
return np.max(fscore)
@register_aggregation("matthews_corrcoef")
def matthews_corrcoef(items):
unzipped_list = list(zip(*items))
golds = unzipped_list[0]
preds = unzipped_list[1]
# print(preds)
return sklearn.metrics.matthews_corrcoef(golds, preds)
@register_aggregation("bleu")
def bleu(items):
"""The Bilingual Evaluation Understudy Score, or BLEU for short, is a metric
for evaluating a generated sentence to a reference sentence. It counts matching
n-grams in the candidate translation to n-grams in the reference text, where
1-gram or unigram would be each token and a bigram comparison would be each
word pair. The comparison is made regardless of word order
Source: https://machinelearningmastery.com/calculate-bleu-score-for-text-python/
Paper: https://www.aclweb.org/anthology/P02-1040/
Higher is better
"""
refs = list(zip(*items))[0]
preds = list(zip(*items))[1]
refs, preds = _sacreformat(refs, preds)
return sacrebleu.corpus_bleu(preds, refs).score
@register_aggregation("chrf")
def chrf(items):
"""chrF++ is a tool for automatic evaluation of machine translation output
based on character n-gram precision and recall enhanced with word n-grams.
Source: https://github.com/m-popovic/chrF
Paper: https://www.aclweb.org/anthology/W15-3049.pdf
Higher is better # TODO I think
"""
refs = list(zip(*items))[0]
preds = list(zip(*items))[1]
refs, preds = _sacreformat(refs, preds)
return sacrebleu.corpus_chrf(preds, refs).score
@register_aggregation("ter")
def ter(items):
"""Translation Error Rate is an error metric for machine translation that
measures the number of edits required to change a system output into one
of the references
Source: http://www.cs.umd.edu/~snover/tercom/
Paper: http://mt-archive.info/AMTA-2006-Snover.pdf
Lower is better
"""
refs = list(zip(*items))[0]
preds = list(zip(*items))[1]
refs, preds = _sacreformat(refs, preds)
return sacrebleu.corpus_ter(preds, refs).score
@register_metric(
metric="acc",
higher_is_better=True,
output_type=["loglikelihood", "multiple_choice"],
aggregation="mean",
)
def acc_fn(items): # This is a passthrough function
return items
@register_metric(
metric="acc_norm",
higher_is_better=True,
output_type=["loglikelihood", "multiple_choice"],
aggregation="mean",
)
def acc_norm_fn(items): # This is a passthrough function
return items
@register_metric(
metric="acc_mutual_info",
higher_is_better=True,
output_type="multiple_choice",
aggregation="mean",
)
def acc_mutual_info_fn(items): # This is a passthrough function
return items
exact_match = evaluate.load("exact_match")
@register_metric(
metric="exact_match",
higher_is_better=True,
output_type="generate_until",
aggregation="mean",
)
def exact_match_fn(**kwargs):
return exact_match.compute(**kwargs)
@register_metric(
metric="perplexity",
higher_is_better=False,
output_type="loglikelihood",
aggregation="perplexity",
)
def perplexity_fn(items): # This is a passthrough function
return items
def levenshtein_distance(s1, s2):
if len(s1) > len(s2):
s1, s2 = s2, s1
distances = range(len(s1) + 1)
for i2, c2 in enumerate(s2):
distances_ = [i2 + 1]
for i1, c1 in enumerate(s1):
if c1 == c2:
distances_.append(distances[i1])
else:
distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
distances = distances_
return distances[-1]
@register_metric(
metric="anls",
higher_is_better=True,
output_type="generate_until",
aggregation="mean",
)
def anls(
references,
predictions,
thresh_hold=0.5,
): # This is a passthrough function
"""https://github.com/QwenLM/Qwen-VL/blob/master/eval_mm/infographicsvqa_eval.py"""
values = []
for answer in references:
# preprocess both the answers - gt and prediction
gt_answer = " ".join(answer.strip().lower().split())
det_answer = " ".join(predictions[0].strip().lower().split())
# dist = levenshtein_distance(answer.lower(), detObject['answer'].lower())
dist = levenshtein_distance(gt_answer, det_answer)
length = max(len(answer.upper()), len(predictions[0].upper()))
values.append(0.0 if length == 0 else float(dist) / float(length))
question_result = 1 - min(values)
if question_result < thresh_hold:
question_result = 0
return {"anls": question_result}
def pop_stddev(arr):
mu = mean(arr)
return math.sqrt(sum([(x - mu) ** 2 for x in arr]) / len(arr))
def sample_stddev(arr):
mu = mean(arr)
return math.sqrt(sum([(x - mu) ** 2 for x in arr]) / (len(arr) - 1))
def mean_stderr(arr):
return sample_stddev(arr) / math.sqrt(len(arr))
@register_metric(
metric="bypass",
higher_is_better=True,
output_type=["loglikelihood", "multiple_choice", "generate_until"],
aggregation="bypass",
)
def bypass(items):
return items
@register_metric(
metric="mcc",
higher_is_better=True,
output_type="multiple_choice",
aggregation="matthews_corrcoef",
)
def mcc_fn(items): # This is a passthrough function
return items
@register_metric(
metric="f1",
higher_is_better=True,
output_type="multiple_choice",
aggregation="f1",
)
def f1_fn(items): # This is a passthrough function
return items
@register_metric(
metric="bleu",
higher_is_better=True,
output_type="generate_until",
aggregation="bleu",
)
def bleu_fn(items): # This is a passthrough function
return items
@register_metric(
metric="chrf",
higher_is_better=True,
output_type="generate_until",
aggregation="chrf",
)
def chrf_fn(items): # This is a passthrough function
return items
@register_metric(
metric="ter",
higher_is_better=True,
output_type="generate_until",
aggregation="ter",
)
def ter_fn(items): # This is a passthrough function
return items
@register_metric(
metric="acc_all",
higher_is_better=True,
output_type="loglikelihood",
aggregation="mean",
)
def acc_all(items):
# Only count as correct if all answers are labeled correctly for each question
question_scoring_dict = {}
preds = list(zip(*items))[0]
docs = list(zip(*items))[1]
for doc, pred in zip(docs, preds):
paragraph_id = doc["idx"]["paragraph"]
question_id = doc["idx"]["question"]
if (paragraph_id, question_id) not in question_scoring_dict:
question_scoring_dict[(paragraph_id, question_id)] = []
gold_label = doc["label"] == 1
question_scoring_dict[(paragraph_id, question_id)].append(gold_label == pred)
acc = np.mean([int(all(x)) for x in question_scoring_dict.values()])
return acc
def acc_all_stderr(items):
# Only count as correct if all answers are labeled correctly for each question
question_scoring_dict = {}
preds = list(zip(*items))[0]
docs = list(zip(*items))[1]
for doc, pred in zip(docs, preds):
question_id = doc["idx"]["question"]
if question_id not in question_scoring_dict:
question_scoring_dict[question_id] = []
gold_label = doc["label"] == 1
question_scoring_dict[question_id].append(gold_label == pred)
acc = mean_stderr([int(all(x)) for x in question_scoring_dict.values()])
return acc
def metric_max_over_ground_truths(metric_fn, prediction, ground_truths):
"""Compute max metric between prediction and each ground truth."""
scores_for_ground_truths = []
for ground_truth in ground_truths:
score = metric_fn(prediction, ground_truth)
scores_for_ground_truths.append(score)
return max(scores_for_ground_truths)
def weighted_mean(items):
a, b = zip(*items)
return sum(a) / sum(b)
def is_non_str_iterable(obj):
return isinstance(obj, Iterable) and not isinstance(obj, str)
def _sacreformat(refs, preds):
"""Format refs and preds for sacrebleu corpus calculation. It is very particular"""
# Sacrebleu expects (List[str], List[List[str])
# e.g. sacrebleu.corpus_bleu([pred_t], [[ref1_stream], [ref2_stream], ...])
# Note [ref1_stream] is the first reference for each pred.
# So lists are size N and (M, N) for N preds and M possible refs for each pred
# This is a different order of dimensions that I would expect
# We expect refs to be List[str] or List[List[str]], the outer list corresponding to preds
# Must become List[List[str]] with the inner list corresponding to preds
if not is_non_str_iterable(refs):
refs = list(refs)
if not is_non_str_iterable(refs[0]):
refs = [[ref] for ref in refs]
refs = list(zip(*refs))
# Note the number of refs in each ref list much match the number of preds
# We expect preds to be List[str] or List[List[str]]. Must become List[str]
if not is_non_str_iterable(preds):
preds = list(preds)
if is_non_str_iterable(preds[0]):
assert len(preds[0]) == 1, f"Pred must be a str, was {preds[0]}"
preds = [pred[0] for pred in preds]
return refs, preds
# stderr stuff
class _bootstrap_internal:
def __init__(self, f, n) -> None:
self.f = f
self.n = n
def __call__(self, v):
i, xs = v
rnd = random.Random()
rnd.seed(i)
res = []
for _ in range(self.n):
res.append(self.f(rnd.choices(xs, k=len(xs))))
return res
def bootstrap_stderr(f, xs, iters):
import multiprocessing as mp
pool = mp.Pool(mp.cpu_count())
# this gives a biased estimate of the stderr (i.e w/ the mean, it gives something
# equivalent to stderr calculated without Bessel's correction in the stddev.
# Unfortunately, I haven't been able to figure out what the right correction is
# to make the bootstrap unbiased - i considered multiplying by sqrt(n/(n-1)) but
# that would be ad-hoc and I can't prove that that would actually be an unbiased estimator)
# Thankfully, shouldn't matter because our samples are pretty big usually anyways
res = []
chunk_size = min(1000, iters)
from tqdm import tqdm
print("bootstrapping for stddev:", f.__name__)
for bootstrap in tqdm(
pool.imap(
_bootstrap_internal(f, chunk_size),
[(i, xs) for i in range(iters // chunk_size)],
),
total=iters // chunk_size,
):
# sample w replacement
res.extend(bootstrap)
pool.close()
return sample_stddev(res)
def stderr_for_metric(metric, bootstrap_iters):
bootstrappable = [
median,
matthews_corrcoef,
f1_score,
perplexity,
bleu,
chrf,
ter,
]
if metric in bootstrappable:
return lambda x: bootstrap_stderr(metric, x, iters=bootstrap_iters)
stderr = {mean: mean_stderr, acc_all: acc_all_stderr}
return stderr.get(metric, None)
================================================
FILE: lmms-eval_videochat/lmms_eval/api/model.py
================================================
import abc
import os
from typing import Union, List, Tuple, Optional, Type, TypeVar
from sqlitedict import SqliteDict
import json
import hashlib
from lmms_eval.api.instance import Instance
from tqdm import tqdm
from lmms_eval import utils
from loguru import logger as eval_logger
T = TypeVar("T", bound="lmms")
class lmms(abc.ABC):
def __init__(self) -> None:
"""Defines the interface that should be implemented by all lmms subclasses.
lmmss are assumed to take image-text as input and yield strings as output
(inputs/outputs should be tokenization-agnostic.)
"""
# set rank and world size to a single process, by default.
self._rank = 0
self._world_size = 1
self.cache_hook = CacheHook(None)
self.task_dict = {}
@abc.abstractmethod
def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
"""Compute log-likelihood of generating a continuation from a context.
Downstream tasks should attempt to use loglikelihood instead of other
LMM calls whenever possible.
:param requests: list[Instance]
A list of Instance objects, with property `args` which returns a tuple (context, continuation).
`context: str`
Context string. Implementations of LMM must be able to handle an
empty context string.
`continuation: str`
The continuation over which log likelihood will be calculated. If
there is a word boundary, the space should be in the continuation.
For example, context="hello" continuation=" world" is correct.
'visual_list: list[dict]'
Visual input to the model. Can be None.
:return: list[tuple[float, bool]]
A list of pairs (logprob, isgreedy)
`logprob: float`
The log probability of `continuation`.
`isgreedy`:
Whether `continuation` would be generated by greedy sampling from `context`.
"""
pass
# TODO: Add an optional max length
@abc.abstractmethod
def generate_until(self, requests) -> List[str]:
"""Generate greedily until a stopping sequence
:param requests: list[Instance]
A list of Instance objects with property `args` which returns a tuple (context, until).
context: str
Context string
generation_kwargs: dict
Generation Kwargs
'visual_list: list[dict]'
Visual input to the model. Can be None.
:return: list[str]
A list of strings continuation
continuation: str
The generated continuation.
"""
pass
@classmethod
def create_from_arg_string(cls: Type[T], arg_string: str, additional_config: Optional[dict] = None) -> T:
"""
Creates an instance of the LMM class using the given argument string and additional config.
Parameters:
- arg_string: A string containing arguments in the format key1=value1,key2=value2.
- additional_config: Optional dictionary containing additional configuration parameters.
Returns:
- Instance of the LMM class.
"""
additional_config = {} if additional_config is None else additional_config
args = utils.simple_parse_args_string(arg_string)
args2 = {k: v for k, v in additional_config.items() if v is not None}
return cls(**args, **args2)
@property
def rank(self):
# used in the case of parallelism. Hardcoded to
# ensure no errors arise using API models which do
# not support multi-device parallelism nor expect it.
return self._rank
@property
def world_size(self):
# used in the case of parallelism. Hardcoded to
# ensure no errors arise using API models which do
# not support multi-device parallelism nor expect it.
return self._world_size
def set_cache_hook(self, cache_hook) -> None:
self.cache_hook = cache_hook
### SQLite-based caching of LMM responses
def hash_args(attr, args):
dat = json.dumps([attr] + list(args))
return hashlib.sha256(dat.encode("utf-8")).hexdigest()
class CacheHook:
def __init__(self, cachinglm) -> None:
if cachinglm is None:
self.dbdict = None
return
self.dbdict = cachinglm.dbdict
def add_partial(self, attr, req, res) -> None:
if self.dbdict is None:
return
hsh = hash_args(attr, req)
self.dbdict[hsh] = res
class CachingLMM:
def __init__(self, lm, cache_db) -> None:
"""LMM wrapper that returns cached results if they exist, and uses the underlying LMM if not.
:param lm: LMM
Underlying LMM
:param cache_db: str
Path to cache db
"""
self.lm = lm
self.cache_db = cache_db
if os.path.dirname(cache_db):
os.makedirs(os.path.dirname(cache_db), exist_ok=True)
self.dbdict = SqliteDict(cache_db, autocommit=True)
# add hook to lm
lm.set_cache_hook(self.get_cache_hook())
def __getattr__(self, attr):
lm_attr = getattr(self.lm, attr)
if not callable(lm_attr):
return lm_attr
def fn(requests):
res = []
remaining_reqs = []
warned = False
# figure out which ones are cached and which ones are new
eval_logger.info(f"Loading '{attr}' responses from cache '{self.cache_db}' where possible...")
for req in tqdm(requests):
hsh = hash_args(attr, req.args)
if attr == "generate_until" and req.args[1].get("do_sample", False):
# when we are doing non-greedy generation, don't use the cache
# (else every "randomly sampled" generation would be identical for repeats > 1).
if not warned:
eval_logger.warning(f"Arguments to lm.generate_until() '{req.args[1]}' include non-deterministic sampling. Caching will not be performed for such requests.")
warned = True
res.append(None)
remaining_reqs.append(req)
elif hsh in self.dbdict:
ob = self.dbdict[hsh]
assert ob is not None
res.append(ob)
else:
res.append(None)
remaining_reqs.append(req)
# actually run the LMM on the requests that do not have cached results
rem_res = getattr(self.lm, attr)(remaining_reqs)
# stick the new ones back into the list and also cache any of the new ones
resptr = 0
for req, r in zip(remaining_reqs, rem_res):
while res[resptr] is not None:
resptr += 1
res[resptr] = r
# caching
hsh = hash_args(attr, req.args)
self.dbdict[hsh] = r
self.dbdict.commit()
return res
return fn
def get_cache_hook(self):
return CacheHook(self)
================================================
FILE: lmms-eval_videochat/lmms_eval/api/registry.py
================================================
from lmms_eval.api.model import lmms
from typing import Callable, Dict
import evaluate as hf_evaluate
from loguru import logger as eval_logger
MODEL_REGISTRY = {}
def register_model(*names):
# either pass a list or a single alias.
# function receives them as a tuple of strings
def decorate(cls):
for name in names:
assert issubclass(cls, lmms), f"Model '{name}' ({cls.__name__}) must extend lmms class"
assert name not in MODEL_REGISTRY, f"Model named '{name}' conflicts with existing model! Please register with a non-conflicting alias instead."
MODEL_REGISTRY[name] = cls
return cls
return decorate
def get_model(model_name):
try:
return MODEL_REGISTRY[model_name]
except KeyError:
raise ValueError(f"Attempted to load model '{model_name}', but no model for this name found! Supported model names: {', '.join(MODEL_REGISTRY.keys())}")
TASK_REGISTRY = {} # Key: task name, Value: task ConfigurableTask class
GROUP_REGISTRY = {} # Key: group name, Value: list of task names or group names
TASK_INITIALIZED = False
ALL_TASKS = set() # Set of all task names and group names
func2task_index = {} # Key: task ConfigurableTask class, Value: task name
def register_task(name):
def decorate(fn):
assert name not in TASK_REGISTRY, f"task named '{name}' conflicts with existing registered task!"
TASK_REGISTRY[name] = fn
ALL_TASKS.add(name)
func2task_index[fn.__name__] = name
return fn
return decorate
def register_group(name):
def decorate(fn):
func_name = func2task_index[fn.__name__]
if name in GROUP_REGISTRY:
GROUP_REGISTRY[name].append(func_name)
else:
GROUP_REGISTRY[name] = [func_name]
ALL_TASKS.add(name)
return fn
return decorate
OUTPUT_TYPE_REGISTRY = {}
METRIC_REGISTRY = {}
METRIC_AGGREGATION_REGISTRY = {}
AGGREGATION_REGISTRY = {}
HIGHER_IS_BETTER_REGISTRY = {}
DEFAULT_METRIC_REGISTRY = {
"loglikelihood": [
"perplexity",
"acc",
],
"multiple_choice": ["acc", "acc_norm"],
"generate_until": ["exact_match"],
}
def register_metric(**args):
# TODO: do we want to enforce a certain interface to registered metrics?
def decorate(fn):
assert "metric" in args
name = args["metric"]
for key, registry in [
("metric", METRIC_REGISTRY),
("higher_is_better", HIGHER_IS_BETTER_REGISTRY),
("aggregation", METRIC_AGGREGATION_REGISTRY),
]:
if key in args:
value = args[key]
assert value not in registry, f"{key} named '{value}' conflicts with existing registered {key}!"
if key == "metric":
registry[name] = fn
elif key == "aggregation":
registry[name] = AGGREGATION_REGISTRY[value]
else:
registry[name] = value
return fn
return decorate
def get_metric(name: str, hf_evaluate_metric=False) -> Callable:
if not hf_evaluate_metric:
if name in METRIC_REGISTRY:
return METRIC_REGISTRY[name]
else:
eval_logger.warning(f"Could not find registered metric '{name}' in lm-eval, searching in HF Evaluate library...")
try:
metric_object = hf_evaluate.load(name)
return metric_object.compute
except Exception:
eval_logger.error(
f"{name} not found in the evaluate library! Please check https://huggingface.co/evaluate-metric",
)
def register_aggregation(name):
def decorate(fn):
assert name not in AGGREGATION_REGISTRY, f"aggregation named '{name}' conflicts with existing registered aggregation!"
AGGREGATION_REGISTRY[name] = fn
return fn
return decorate
def get_aggregation(name):
try:
return AGGREGATION_REGISTRY[name]
except KeyError:
eval_logger.warning(
"{} not a registered aggregation metric!".format(name),
)
def get_metric_aggregation(name):
try:
return METRIC_AGGREGATION_REGISTRY[name]
except KeyError:
eval_logger.warning(
"{} metric is not assigned a default aggregation!".format(name),
)
def is_higher_better(metric_name):
try:
return HIGHER_IS_BETTER_REGISTRY[metric_name]
except KeyError:
eval_logger.warning(f"higher_is_better not specified for metric '{metric_name}'!")
================================================
FILE: lmms-eval_videochat/lmms_eval/api/samplers.py
================================================
class ContextSampler:
def __init__(self, docs, task, fewshot_indices=None, rnd=None) -> None:
self.rnd = rnd
assert self.rnd, "must pass rnd to FewShotSampler!"
self.task = task
self.config = task._config
self.target_delimiter = self.config.target_delimiter
self.fewshot_delimiter = self.config.fewshot_delimiter
self.doc_to_text = self.task.doc_to_text
self.doc_to_target = self.task.doc_to_target
self.doc_to_choice = self.task.doc_to_choice
self.docs = docs # HF dataset split, provided by task._fewshot_docs()
if fewshot_indices: # subset few-shot docs from
self.docs = self.docs.select(fewshot_indices)
def get_context(self, doc, num_fewshot):
# draw an extra fewshot sample if using same split as evaluating on
n_samples = num_fewshot + 1 if self.config.fewshot_split == self.config.test_split else num_fewshot
# draw `n_samples` docs from fewshot_docs
fewshotex = self.sample(n_samples)
# get rid of the doc that's the one we're evaluating, if it's in the fewshot
# TODO: should we just stop people from using fewshot from same split as evaluating?
selected_docs = [x for x in fewshotex if x != doc][:num_fewshot]
labeled_examples = (
self.fewshot_delimiter.join(
[
# TODO: is separating doc_to_text and doc_to_target by one space always desired?
(self.doc_to_text(doc) if (self.config.doc_to_choice is None or type(self.doc_to_text(doc)) is str) else self.doc_to_choice(doc)[self.doc_to_text(doc)])
+ self.target_delimiter
+ (
str(self.doc_to_target(doc)[0])
if type(self.doc_to_target(doc)) is list
else self.doc_to_target(doc) if (self.config.doc_to_choice is None or type(self.doc_to_target(doc)) is str) else str(self.doc_to_choice(doc)[self.doc_to_target(doc)])
)
for doc in selected_docs
]
)
+ self.fewshot_delimiter
)
return labeled_examples
def sample(self, n):
"""
Draw `n` samples from our fewshot docs. This method should be overridden by subclasses.
"""
return self.rnd.sample(self.docs, n)
class FirstNSampler(ContextSampler):
def sample(self, n) -> None:
"""
Draw the first `n` samples in order from the specified split.
Used for tasks with "canonical" ordered fewshot examples, such as MMLU and CMMLU.
"""
assert n <= len(self.docs), f"Error: number of fewshot samples requested exceeds the {len(self.docs)} that are available."
return self.docs[:n]
class BalancedSampler(ContextSampler):
def sample(self, n) -> None:
"""
TODO: this should return approximately class-balanced samples from our fewshot examples.
TODO: what order should they be in? maybe random?
"""
pass
class ManualSampler(ContextSampler):
def sample(self, n) -> None:
""" """
pass
SAMPLER_REGISTRY = {
"default": ContextSampler,
"first_n": FirstNSampler,
}
def get_sampler(name):
try:
return SAMPLER_REGISTRY[name]
except KeyError:
raise ValueError(f"Attempted to use contextsampler '{name}', but no sampling strategy for this name found! Supported model names: {', '.join(SAMPLER_REGISTRY.keys())}")
================================================
FILE: lmms-eval_videochat/lmms_eval/api/task.py
================================================
import abc
import ast
import itertools
import json
import os
import re
import random
import shutil
import inspect
import subprocess
from collections.abc import Callable
from dataclasses import dataclass, field, asdict
from glob import glob
from typing import Any, List, Union
import datasets
import numpy as np
from PIL import ImageFile
from datasets import DownloadConfig, Image, Sequence
from huggingface_hub import snapshot_download
from tenacity import retry, stop_after_attempt, wait_fixed, stop_after_delay
from tqdm import tqdm
from accelerate import Accelerator
from lmms_eval import utils
from lmms_eval.api import samplers
from lmms_eval.api.instance import Instance
from lmms_eval.api.registry import (
AGGREGATION_REGISTRY,
DEFAULT_METRIC_REGISTRY,
METRIC_REGISTRY,
OUTPUT_TYPE_REGISTRY,
get_aggregation,
get_metric,
get_metric_aggregation,
is_higher_better,
)
from lmms_eval.filters import build_filter_ensemble
from loguru import logger as eval_logger
# HuggingfaceM4/NoCaps contains truncated image in test split
# Include this inside code block to avoid error
ImageFile.LOAD_TRUNCATED_IMAGES = True
ALL_OUTPUT_TYPES = [
"loglikelihood",
"multiple_choice",
"generate_until",
]
@dataclass
class TaskConfig(dict):
# task naming/registry
task: str = None
task_alias: str = None
group: Union[str, list] = None
group_alias: Union[str, list] = None
# HF dataset options.
# which dataset to use,
# and what splits for what purpose
dataset_path: str = None
dataset_name: str = None
dataset_kwargs: dict = None
training_split: str = None
validation_split: str = None
test_split: str = None
fewshot_split: str = None # TODO: assert that this not None if num_fewshot > 0. (?) assert if this is same split as one evaling (?)
full_docs: bool = False
# formatting / prompting options.
# see docs/advanced_task_guide.md for more info
process_results_use_image: bool = False
process_docs: Callable = None
doc_to_visual: Union[Callable, str] = None
doc_to_text: Union[Callable, str] = None
doc_to_target: Union[Callable, str] = None
doc_to_choice: Union[Callable, str, dict, list] = None
process_results: Union[Callable, str] = None
use_prompt: str = None
description: str = ""
target_delimiter: str = " "
fewshot_delimiter: str = "\n\n"
fewshot_config: dict = None
# runtime configuration options
num_fewshot: int = None
# scoring options
metric_list: list = None
output_type: str = "generate_until"
generation_kwargs: dict = None
repeats: int = 1
filter_list: Union[str, list] = None
should_decontaminate: bool = False
doc_to_decontamination_query: str = None
metadata: Union[str, list] = None # by default, not used in the code. allows for users to pass arbitrary info to tasks
lmms_eval_specific_kwargs: dict = None
model_specific_generation_kwargs: dict = None
model_specific_target_kwargs: dict = None
def __post_init__(self) -> None:
if self.dataset_path and os.path.exists(os.path.dirname(self.dataset_path)):
import inspect
from importlib import import_module
# self.dataset_path = inspect.getfile(import_module(self.dataset_path))
if self.generation_kwargs is not None:
if self.output_type != "generate_until":
eval_logger.warning(f"[{self.task}] passed `generation_kwargs`, but not using `output_type: generate_until`!")
assert self.output_type != "generate_until"
if "temperature" in self.generation_kwargs:
self.generation_kwargs["temperature"] = float(self.generation_kwargs["temperature"])
if "until" not in self.generation_kwargs:
self.generation_kwargs["until"] = [self.fewshot_delimiter]
else:
if self.output_type == "generate_until":
# ensure that we greedily generate in absence of explicit arguments otherwise
self.generation_kwargs = {
"until": None if self.fewshot_delimiter is None else [self.fewshot_delimiter],
"do_sample": False,
}
# TODO: how to make TaskConfigs be de- and re-serializable, even when using the !function constructor?
def __getitem__(self, item):
return getattr(self, item)
def __setitem__(self, item, value):
return setattr(self, item, value)
def to_dict(self):
"""dumps the current config as a dictionary object, as a printable format.
null fields will not be printed.
Used for dumping results alongside full task configuration
:return: dict
A printable dictionary version of the TaskConfig object.
# TODO: should any default value in the TaskConfig not be printed?
"""
cfg_dict = asdict(self)
# remove values that are `None`
for k, v in list(cfg_dict.items()):
if v is None:
cfg_dict.pop(k)
elif isinstance(v, Callable):
# TODO: this should handle Promptsource template objects as a separate case?
cfg_dict[k] = str(v)
return cfg_dict
class Task(abc.ABC):
"""A task represents an entire benchmark including its dataset, problems,
answers, and evaluation methods. See BoolQ for a simple example implementation
A `doc` can be any python object which represents one instance of evaluation.
This is usually a dictionary e.g.
{"question": ..., "answer": ...} or
{"question": ..., question, answer)
"""
VERSION = None
# The name of the `Task` benchmark as denoted in the HuggingFace datasets Hub
# or a path to a custom `datasets` loading script.
DATASET_PATH: str = None
# The name of a subset within `DATASET_PATH`.
DATASET_NAME: str = None
OUTPUT_TYPE: str = None
def __init__(
self,
data_dir=None,
cache_dir=None,
download_mode=None,
config=None,
) -> None:
"""
:param data_dir: str
Stores the path to a local folder containing the `Task`'s data files.
Use this to specify the path to manually downloaded data (usually when
the dataset is not publicly accessible).
:param cache_dir: str
The directory to read/write the `Task` dataset. This follows the
HuggingFace `datasets` API with the default cache directory located at:
`~/.cache/huggingface/datasets`
NOTE: You can change the cache location globally for a given process
to another directory:
`export HF_DATASETS_CACHE="/path/to/another/directory"`
:param download_mode: datasets.DownloadMode
How to treat pre-existing `Task` downloads and data.
- `datasets.DownloadMode.REUSE_DATASET_IF_EXISTS`
Reuse download and reuse dataset.
- `datasets.DownloadMode.REUSE_CACHE_IF_EXISTS`
Reuse download with fresh dataset.
- `datasets.DownloadMode.FORCE_REDOWNLOAD`
Fresh download and fresh dataset.
"""
self.download(data_dir, cache_dir, download_mode) # NOTE lxh
self._training_docs = None
self._fewshot_docs = None
self._instances = None
self._config = TaskConfig({**config}) if config else TaskConfig()
self._filters = [build_filter_ensemble("none", [["take_first", None]])]
def download(self, data_dir=None, cache_dir=None, download_mode=None) -> None:
"""Downloads and returns the task dataset.
Override this method to download the dataset from a custom API.
:param data_dir: str
Stores the path to a local folder containing the `Task`'s data files.
Use this to specify the path to manually downloaded data (usually when
the dataset is not publicly accessible).
:param cache_dir: str
The directory to read/write the `Task` dataset. This follows the
HuggingFace `datasets` API with the default cache directory located at:
`~/.cache/huggingface/datasets`
NOTE: You can change the cache location globally for a given process
by setting the shell environment variable, `HF_DATASETS_CACHE`,
to another directory:
`export HF_DATASETS_CACHE="/path/to/another/directory"`
:param download_mode: datasets.DownloadMode
How to treat pre-existing `Task` downloads and data.
- `datasets.DownloadMode.REUSE_DATASET_IF_EXISTS`
Reuse download and reuse dataset.
- `datasets.DownloadMode.REUSE_CACHE_IF_EXISTS`
Reuse download with fresh dataset.
- `datasets.DownloadMode.FORCE_REDOWNLOAD`
Fresh download and fresh dataset.
"""
raise NotImplementedError("I don't want it.")
self.dataset = datasets.load_dataset(
path=self.DATASET_PATH,
name=self.DATASET_NAME,
data_dir=data_dir,
cache_dir=cache_dir,
download_mode=download_mode,
)
self.dataset_no_image = datasets.load_dataset(
path=self.DATASET_PATH,
name=self.DATASET_NAME,
data_dir=data_dir,
cache_dir=cache_dir,
download_mode=download_mode,
)
for doc_name in self.dataset_no_image:
remove_cols = []
features = self.dataset_no_image[doc_name].features
# If it is an Image instance or a Sequence of Image instance. Remove it
for feature in features:
if isinstance(features[feature], Image):
remove_cols.append(feature)
elif isinstance(features[feature], Sequence) and isinstance(features[feature].feature, Image):
remove_cols.append(feature)
for remove_col in remove_cols:
self.dataset_no_image[doc_name] = self.dataset_no_image[doc_name].remove_columns(remove_col)
@property
def config(self):
"""Returns the TaskConfig associated with this class."""
return self._config
@abc.abstractmethod
def has_training_docs(self):
"""Whether the task has a training set"""
pass
@abc.abstractmethod
def has_validation_docs(self):
"""Whether the task has a validation set"""
pass
@abc.abstractmethod
def has_test_docs(self):
"""Whether the task has a test set"""
pass
def training_docs(self):
"""
:return: Iterable[obj]
A iterable of any object, that doc_to_text can handle
"""
return []
def validation_docs(self):
"""
:return: Iterable[obj]
A iterable of any object, that doc_to_text can handle
"""
return []
def test_docs(self):
"""
:return: Iterable[obj]
A iterable of any object, that doc_to_text can handle
"""
return []
def fewshot_docs(self):
"""
:return: Iterable[obj]
A iterable of any object, that doc_to_text can handle
"""
if self.has_training_docs():
return self.training_docs()
elif self.has_validation_docs():
return self.validation_docs()
else:
if self.config.num_fewshot is not None:
eval_logger.warning("has_training_docs and has_validation_docs are False" ", using test_docs as fewshot_docs but this is not recommended.")
return self.test_docs()
def _process_doc(self, doc):
"""
Override this to process (detokenize, strip, replace, etc.) individual
documents. This can be used in a map over documents of a data split.
E.g. `map(self._process_doc, self.dataset["validation"])`
:return: dict
The processed version of the specified `doc`.
"""
return doc
@property
def instances(self):
"""After calling `task.build_all_requests()`, tasks
maintain a list of the dataset instances which will be evaluated.
"""
return self._instances
def fewshot_examples(self, k, rnd):
if self._training_docs is None:
self._training_docs = list(self.training_docs())
return rnd.sample(self._training_docs, k)
def doc_to_decontamination_query(self, doc) -> None:
print("Override doc_to_decontamination_query with document specific decontamination query.")
assert False
@abc.abstractmethod
def doc_to_text(self, doc):
pass
@abc.abstractmethod
def doc_to_target(self, doc):
pass
# @profile
def build_all_requests(self, limit=None, rank=None, world_size=None) -> None:
"""Build a set of Instances for a task, and store them in task.instances"""
if self.has_test_docs():
docs = self.test_docs()
split = self.config.test_split
elif self.has_validation_docs():
docs = self.validation_docs()
split = self.config.validation_split
else:
assert False, f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!"
eval_logger.info(f"Building contexts for task {self.CONFIG.task} on rank {rank}...")
instances = []
doc_id_iterator = utils.create_iterator([i for i in range(len(docs))], rank, world_size, limit)
doc_id_iterator, doc_id_iterator_counting = itertools.tee(doc_id_iterator)
total_docs = sum(1 for _ in doc_id_iterator_counting)
pbar = tqdm(total=total_docs, desc=f"Building context", disable=(rank != 0))
for doc_id in doc_id_iterator:
# sample fewshot context #TODO: need to offset doc_id by rank now!
fewshot_ctx = self.fewshot_context(doc_id, 0 if self.config.num_fewshot is None else self.config.num_fewshot, self.config.training_split if self.has_training_docs() else split)
# TODO: we should override self.config.repeats if doing greedy gen so users don't waste time+compute
per_task_metadata = {"task": self.config["task"], "doc_id": doc_id, "repeats": self.config.repeats}
if self.config.metadata and type(self.config.metadata) == dict: # TODO: temporary fix for metadata loading, ignore the list of dict type.
per_task_metadata.update(self.config.metadata)
inst = self.construct_requests(doc_id=doc_id, ctx=fewshot_ctx, metadata=per_task_metadata, split=split)
if not isinstance(inst, list):
inst = [inst]
instances.extend(inst)
pbar.update(1)
pbar.close()
self._instances = instances
assert len(self._instances) != 0, "task.build_requests() did not find any docs!"
@abc.abstractmethod
def construct_requests(self, doc_id, ctx, **kwargs):
"""Uses RequestFactory to construct Requests and returns an iterable of
Requests which will be sent to the LMM.
:param doc_id: int
The index of a document within `self.test_docs()` or `self.validation_docs()`,
whichever is the main split used.
:param ctx: str
The context string, generated by fewshot_context. This includes the natural
language description, as well as the few shot examples, and the question
part of the document for `doc`.
:param repeats: int
TODO: update this docstring
The number of times each instance in a dataset is inferred on. Defaults to 1,
can be increased for techniques like majority voting.
"""
pass
@abc.abstractmethod
def process_results(self, doc, results):
"""Take a single document and the LMM results and evaluates, returning a
dict where keys are the names of submetrics and values are the values of
the metric for that one document
:param doc:
The document as returned from training_docs, validation_docs, or test_docs.
:param results:
The results of the requests created in construct_requests.
"""
pass
@abc.abstractmethod
def aggregation(self):
"""
:returns: {str: [metric_score] -> float}
A dictionary where keys are the names of submetrics and values are
functions that aggregate a list of metric scores
"""
pass
@abc.abstractmethod
def higher_is_better(self):
"""
:returns: {str: bool}
A dictionary where keys are the names of submetrics and values are
whether a higher value of the submetric is better
"""
pass
@classmethod
def count_bytes(cls, doc):
"""Used for byte-level perplexity metrics in rolling loglikelihood"""
return len(doc.encode("utf-8"))
@utils.positional_deprecated
def fewshot_context(
self,
doc_id,
num_fewshot,
split,
rnd=random.Random(1234),
description=None,
):
"""Returns a fewshot context string that is made up of a prepended description
(if provided), the `num_fewshot` number of examples, and an appended prompt example.
:param doc_id: int
The document id as returned from training_docs, validation_docs, or test_docs.
:param num_fewshot: int
The number of fewshot examples to provide in the returned context string.
:param split: str
The split of the document to retrieve from the dataset
:param rnd: random.Random
The pseudo-random number generator used to randomly sample examples.
WARNING: This is currently a required arg although it's optionalized with a default `None`.
:param description: str
The task's description that will be prepended to the fewshot examples.
:returns: str
The fewshot context.
"""
assert rnd is not None, "A `random.Random` generator argument must be provided to `rnd`"
description = description if description else ""
doc = self.dataset_no_image[split][doc_id]
if num_fewshot == 0:
labeled_examples = ""
else:
# for sets with no training docs, draw from other set *but ensure no overlap with current doc*
if self.has_training_docs():
fewshotex = self.fewshot_examples(k=num_fewshot, rnd=rnd)
else:
if self._fewshot_docs is None:
self._fewshot_docs = list(self.validation_docs() if self.has_validation_docs() else self.test_docs())
fewshotex = rnd.sample(self._fewshot_docs, num_fewshot + 1)
# get rid of the doc that's the one we're evaluating, if it's in the fewshot
fewshotex = [x for x in fewshotex if x != doc][:num_fewshot]
labeled_examples = "\n\n".join([self.doc_to_text(doc) + self.doc_to_target(doc) for doc in fewshotex]) + "\n\n"
example = self.doc_to_text(doc)
return description + labeled_examples + example
def apply_filters(self):
if hasattr(self, "_filters"):
for f in self._filters:
f.apply(self._instances, None)
else:
eval_logger.warning("No filter defined, passing through instances")
return self._instances
def dump_config(self) -> dict:
"""Returns a dictionary representing the task's config.
:returns: str
The fewshot context.
"""
# TODO: this should only return the overrides applied to a non-YAML task's configuration.
# (num_fewshot)
return self.config.to_dict()
def override_metric(self, metric_name: str) -> None:
"""
Override the default metrics used for evaluation with custom metrics.
Parameters:
- metric_name (str): The name of the custom metric to override. Should be registered in api.metrics.
"""
(
self._metric_fn_list,
self._aggregation_list,
self._metric_fn_kwargs,
self._higher_is_better,
) = ({}, {}, {}, {})
self._metric_fn_list[metric_name] = get_metric(metric_name)
self._aggregation_list[metric_name] = get_metric_aggregation(metric_name)
self._higher_is_better[metric_name] = is_higher_better(metric_name)
self._metric_fn_kwargs[metric_name] = {}
if not isinstance(self, ConfigurableTask):
self.process_results = lambda x, y: {metric_name: get_metric(metric_name)}
self.aggregation = lambda: {metric_name: get_metric_aggregation(metric_name)}
setattr(self._config, "metric_list", [{"metric": metric_name}])
setattr(self._config, "process_results", None)
class ConfigurableTask(Task):
VERSION = "Yaml"
OUTPUT_TYPE = None
CONFIG = None
def __init__(self, model_name) -> None: # TODO no super() call here
# Get pre-configured attributes
self._config = self.CONFIG
# different model requires different prompt, we have to take those into account.
self.model_name = model_name
self._prepare_model_specific_config()
assert self.config.output_type in ALL_OUTPUT_TYPES
self.OUTPUT_TYPE = self.config.output_type
self.DATASET_PATH = self.config.dataset_path
if self.config.dataset_name is not None:
self.DATASET_NAME = self.config.dataset_name
self._prepare_metric_and_aggregation()
self.download(self.config.dataset_kwargs)
self._training_docs = None
self._fewshot_docs = None
if self.config.filter_list is not None:
self._filters = []
for filter_config in self.config.filter_list:
for filter_pipeline in filter_config:
filter_name = filter_config["name"]
filter_functions = filter_config["filter"]
components = []
for function in filter_functions:
kwargs = {key: function[key] for key in function if key != "function"}
components.append([function["function"], kwargs])
filter_pipeline = build_filter_ensemble(filter_name, components)
self._filters.append(filter_pipeline)
else:
self._filters = [build_filter_ensemble("none", [["take_first", None]])]
if self.config.fewshot_config is not None:
self.sampler = samplers.get_sampler(self.config.fewshot_config.get("sampler", "default") if self.config.fewshot_config else "default")(list(self.fewshot_docs()), self, rnd=random.Random(1234))
if self.has_test_docs():
self.task_docs = self.test_docs()
elif self.has_validation_docs():
self.task_docs = self.validation_docs()
else:
assert False, f"Task dataset (path={self.DATASET_PATH}, name={self.DATASET_NAME}) must have valid or test docs!"
# Test One Doc
self.features = list(self.task_docs.features.keys())
self.multiple_input = 0
self.multiple_target = 0
test_doc = self.task_docs[0]
test_text = self.doc_to_text(test_doc)
test_target = self.doc_to_target(test_doc)
if self.config.doc_to_choice is not None:
test_choice = self.doc_to_choice(test_doc)
if type(test_choice) is not list:
eval_logger.error("doc_to_choice must return list")
else:
num_choice = len(test_choice)
if type(test_text) is int:
self.multiple_input = num_choice
else:
test_choice = None
if type(test_target) is list:
self.multiple_target = len(test_target)
else:
if (type(test_target) is int) and (test_choice is not None):
test_target = test_choice[test_target]
else:
test_target = str(test_target)
if test_choice is not None:
check_choices = test_choice
else:
check_choices = [test_target]
if self.config.doc_to_choice is not None:
for choice in check_choices:
choice_has_whitespace = True if choice[0].isspace() else False
delimiter_has_whitespace = True if self.config.target_delimiter.rstrip() != self.config.target_delimiter else False
if delimiter_has_whitespace and choice_has_whitespace:
eval_logger.warning(f'Both target_delimiter and target choice: "{choice}" have whitespace')
elif (not delimiter_has_whitespace) and (not choice_has_whitespace):
eval_logger.warning(f'Both target_delimiter "{self.config.target_delimiter}" and target choice: "{choice}" do not have whitespace, ignore if the language you are evaluating on does not require/use whitespace')
def _prepare_model_specific_config(self):
self.lmms_eval_specific_kwargs = self.config.lmms_eval_specific_kwargs
if self.lmms_eval_specific_kwargs is not None:
if self.model_name in self.lmms_eval_specific_kwargs:
self.lmms_eval_specific_kwargs = self.lmms_eval_specific_kwargs[self.model_name]
if "default" in self.lmms_eval_specific_kwargs:
self.lmms_eval_specific_kwargs.update(self.lmms_eval_specific_kwargs.get("default", {}))
if "dataset" in self.lmms_eval_specific_kwargs:
self.lmms_eval_specific_kwargs.update(self.lmms_eval_specific_kwargs.get("dataset", {}))
self.model_specific_target_kwargs = self.config.model_specific_target_kwargs
if self.model_specific_target_kwargs is not None:
if self.model_name in self.model_specific_target_kwargs:
self.model_specific_target_kwargs = self.model_specific_target_kwargs[self.model_name]
else:
self.model_specific_target_kwargs = self.model_specific_target_kwargs.get("default", None)
self.model_specific_generation_kwargs = self.config.model_specific_generation_kwargs
if self.model_specific_generation_kwargs is not None:
if self.model_name in self.model_specific_generation_kwargs:
self.model_specific_generation_kwargs = self.model_specific_generation_kwargs[self.model_name]
else:
self.model_specific_generation_kwargs = self.model_specific_generation_kwargs.get("default", {})
self.config.generation_kwargs.update(self.model_specific_generation_kwargs)
def _prepare_metric_and_aggregation(self):
self._metric_fn_list = {}
self._metric_fn_kwargs = {}
self._aggregation_list = {}
self._higher_is_better = {}
if self.config.metric_list is None:
# TODO: handle this in TaskConfig.__post_init__ ?
_metric_list = DEFAULT_METRIC_REGISTRY[self.config.output_type]
for metric_name in _metric_list:
self._metric_fn_list[metric_name] = METRIC_REGISTRY[metric_name]
self._metric_fn_kwargs[metric_name] = {}
self._aggregation_list[metric_name] = get_metric_aggregation(metric_name)
self._higher_is_better[metric_name] = is_higher_better(metric_name)
else:
for metric_config in self.config.metric_list:
assert "metric" in metric_config
metric_name = metric_config["metric"]
kwargs = {key: metric_config[key] for key in metric_config if key not in ["metric", "aggregation", "higher_is_better"]}
if self.config.process_results is not None:
self._metric_fn_list[metric_name] = None
self._metric_fn_kwargs[metric_name] = {}
elif callable(metric_name):
metric_fn = metric_name.__call__
metric_name = metric_name.__name__
self._metric_fn_list[metric_name] = metric_fn
self._metric_fn_kwargs[metric_name] = kwargs
else:
self._metric_fn_list[metric_name] = METRIC_REGISTRY[metric_name]
self._metric_fn_kwargs[metric_name] = kwargs
if "aggregation" in metric_config:
agg_name = metric_config["aggregation"]
if type(agg_name) == str:
self._aggregation_list[metric_name] = get_aggregation(agg_name)
elif callable(agg_name):
self._aggregation_list[metric_name] = metric_config["aggregation"]
else:
INV_AGG_REGISTRY = {v: k for k, v in AGGREGATION_REGISTRY.items()}
metric_agg = get_metric_aggregation(metric_name)
eval_logger.warning(f"[Task: {self._config.task}] metric {metric_name} is defined, but aggregation is not. " f"using default " f"aggregation={INV_AGG_REGISTRY[metric_agg]}")
self._aggregation_list[metric_name] = metric_agg
if "higher_is_better" in metric_config:
self._higher_is_better[metric_name] = metric_config["higher_is_better"]
else:
eval_logger.warning(f"[Task: {self._config.task}] metric {metric_name} is defined, but higher_is_better is not. " f"using default " f"higher_is_better={is_higher_better(metric_name)}")
self._higher_is_better[metric_name] = is_higher_better(metric_name)
@retry(stop=(stop_after_attempt(5) | stop_after_delay(60)), wait=wait_fixed(2))
def download(self, dataset_kwargs=None) -> None:
# If the dataset is a video dataset,
# Recursively search whether their is a zip and unzip it to the huggingface home
download_config = DownloadConfig()
download_config.max_retries = dataset_kwargs.get("max_retries", 10) if dataset_kwargs is not None else 10
download_config.num_proc = dataset_kwargs.get("num_proc", 8) if dataset_kwargs is not None else 8
download_config.local_files_only = dataset_kwargs.get("local_files_only", True) if dataset_kwargs is not None else True # NOTE 默认用本地
if dataset_kwargs is not None: # NOTE lxh
if "From_YouTube" in dataset_kwargs:
raise NotImplementedError("I don't want it!")
def _download_from_youtube(path):
try:
for video in tqdm(self.all_dataset[split]):
video_id = video["videoID"]
target_path = os.path.join(path, f"{video_id}.mp4")
assert shutil.which("yt-dlp") is not None, "yt-dlp must be installed and available in the system's PATH"
command = f"yt-dlp -o {target_path} -f mp4 https://www.youtube.com/watch?v={video_id}"
subprocess.run(command, shell=True)
with open(os.path.join(cache_path, f"{task}_download_status.json"), "w") as f:
f.write(json.dumps({task: "downloaded"}))
except Exception as e:
eval_logger.error(f"Error while downloading {task} data: {e}")
with open(os.path.join(cache_path, f"{task}_download_status.json"), "w") as f:
f.write(json.dumps({task: "not downloaded"}))
hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/")
accelerator = Accelerator()
if accelerator.is_main_process:
dataset_kwargs.pop("From_YouTube")
self.all_dataset = datasets.load_dataset(
path=self.DATASET_PATH,
name=self.DATASET_NAME,
download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS,
**dataset_kwargs if dataset_kwargs is not None else {},
)
dataset_kwargs["From_YouTube"] = True
cache_path = snapshot_download(repo_id=self.DATASET_PATH, repo_type="dataset") # download_parquet
split = vars(self.config)["test_split"]
task = vars(self.config)["task"]
video_path = os.path.join(hf_home, task)
if os.path.exists(os.path.join(cache_path, f"{task}_download_status.json")):
download_status = json.load(open(os.path.join(cache_path, f"{task}_download_status.json"), "r"))
if download_status[task] == "downloaded":
eval_logger.info(f"Data for {task} already download!")
else:
eval_logger.info(f"Start downloading YouTube data to {video_path}...")
_download_from_youtube(video_path)
else:
eval_logger.info(f"Start downloading YouTube data to {video_path}...")
_download_from_youtube(video_path)
accelerator.wait_for_everyone()
if "builder_script" in dataset_kwargs:
builder_script = dataset_kwargs["builder_script"]
self.DATASET_PATH = os.path.join(cache_path, builder_script)
dataset_kwargs.pop("builder_script")
downloaded_video_ids = [i.split(".mp4")[0] for i in os.listdir(os.path.expanduser(video_path)) if i.endswith(".mp4")]
# Filtered the existing dataset with the downloaded video ids
self.dataset = datasets.DatasetDict({split: self.all_dataset[split].filter(lambda x: x["videoID"] in downloaded_video_ids)})
self.dataset_no_image = self.dataset
dataset_kwargs.pop("From_YouTube")
return
if "video" in dataset_kwargs and dataset_kwargs["video"]:
# hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/")
cache_dir = dataset_kwargs["cache_dir"]
# cache_dir = os.path.join(hf_home, cache_dir)
if not (os.path.exists(cache_dir) or 's3://' in cache_dir):
raise NotImplementedError(f"I don't want it: {cache_dir}.")
accelerator = Accelerator()
if accelerator.is_main_process:
force_download = dataset_kwargs.get("force_download", False)
force_unzip = dataset_kwargs.get("force_unzip", False)
cache_path = snapshot_download(repo_id=self.DATASET_PATH, repo_type="dataset", force_download=force_download, etag_timeout=60)
zip_files = glob(os.path.join(cache_path, "**/*.zip"), recursive=True)
tar_files = glob(os.path.join(cache_path, "**/*.tar*"), recursive=True)
def unzip_video_data(zip_file):
import zipfile
import os
with zipfile.ZipFile(zip_file, "r") as zip_ref:
for file_info in zip_ref.infolist():
target_path = os.path.join(cache_dir, file_info.filename)
if not os.path.exists(target_path):
zip_ref.extract(file_info, cache_dir)
else:
eval_logger.info(f"Skipping existing file: {target_path}")
eval_logger.info(f"Extracted all files from {zip_file} to {cache_dir}")
def untar_video_data(tar_file):
import tarfile
with tarfile.open(tar_file, "r") as tar_ref:
tar_ref.extractall(cache_dir)
eval_logger.info(f"Extracted all files from {tar_file} to {cache_dir}")
def concat_tar_parts(tar_parts, output_tar):
with open(output_tar, "wb") as out_tar:
from tqdm import tqdm
for part in tqdm(sorted(tar_parts)):
with open(part, "rb") as part_file:
out_tar.write(part_file.read())
eval_logger.info(f"Concatenated parts {tar_parts} into {output_tar}")
# Unzip zip files if needed
if force_unzip or (not os.path.exists(cache_dir) and len(zip_files) > 0):
for zip_file in zip_files:
unzip_video_data(zip_file)
# Concatenate and extract tar files if needed
if force_unzip or (not os.path.exists(cache_dir) and len(tar_files) > 0):
tar_parts_dict = {}
# Group tar parts together
for tar_file in tar_files:
base_name = tar_file.split(".tar")[0]
if base_name not in tar_parts_dict:
tar_parts_dict[base_name] = []
tar_parts_dict[base_name].append(tar_file)
# Concatenate and untar split parts
for base_name, parts in tar_parts_dict.items():
eval_logger.info(f"Extracting following tar files: {parts}")
output_tar = base_name + ".tar"
if not os.path.exists(output_tar):
eval_logger.info(f"Start concatenating tar files")
concat_tar_parts(parts, output_tar)
eval_logger.info(f"Finish concatenating tar files")
if not os.path.exists(os.path.join(cache_dir, os.path.basename(base_name))):
untar_video_data(output_tar)
accelerator.wait_for_everyone()
dataset_kwargs.pop("cache_dir")
dataset_kwargs.pop("video")
if "builder_script" in dataset_kwargs:
raise NotImplementedError("I don't want it.")
builder_script = dataset_kwargs["builder_script"]
self.DATASET_PATH = os.path.join(cache_path, builder_script)
dataset_kwargs.pop("builder_script")
if "force_download" in dataset_kwargs:
dataset_kwargs.pop("force_download")
if "force_unzip" in dataset_kwargs:
dataset_kwargs.pop("force_unzip")
if "local_files_only" in dataset_kwargs:
dataset_kwargs.pop("local_files_only")
# self.dataset = datasets.load_dataset(
# path=self.DATASET_PATH,
# name=self.DATASET_NAME,
# download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS,
# download_config=download_config,
# **dataset_kwargs if dataset_kwargs is not None else {},
# )
self.dataset = datasets.load_dataset(
path=self.DATASET_PATH,
name=self.DATASET_NAME,
# local_files_only=True
# download_mode=datasets.DownloadMode.REUSE_DATASET_IF_EXISTS,
# download_config=download_config,
# **dataset_kwargs if dataset_kwargs is not None else {},
)
if self.config.process_docs is not None:
for split in self.dataset:
if split in [
self.config.training_split, self.config.validation_split, self.config.test_split, self.config.fewshot_split
]:
self.dataset[split] = self.config.process_docs(self.dataset[split])
# copy dataset, remove image features
self.dataset_no_image = self.dataset.copy()
for doc_name in self.dataset_no_image:
remove_cols = []
features = self.dataset_no_image[doc_name].features
# If it is an Image instance or a Sequence of Image instance. Remove it
for feature in features:
if isinstance(features[feature], Image):
remove_cols.append(feature)
elif isinstance(features[feature], Sequence) and isinstance(features[feature].feature, Image):
remove_cols.append(feature)
for remove_col in remove_cols:
self.dataset_no_image[doc_name] = self.dataset_no_image[doc_name].remove_columns(remove_col)
def has_training_docs(self) -> bool:
if self.config.training_split is not None:
return True
else:
return False
def has_validation_docs(self) -> bool:
if self.config.validation_split is not None:
return True
else:
return False
def has_test_docs(self) -> bool:
if self.config.test_split is not None:
return True
else:
return False
def training_docs(self) -> datasets.Dataset:
if self.has_training_docs():
return self.dataset[self.config.training_split]
def validation_docs(self) -> datasets.Dataset:
if self.has_validation_docs():
return self.dataset[self.config.validation_split]
def test_docs(self) -> datasets.Dataset:
if self.has_test_docs():
return self.dataset[self.config.test_split]
def fewshot_docs(self):
if self.config.fewshot_split is not None:
return self.dataset[self.config.fewshot_split]
else:
if (self.config.num_fewshot is not None) and (self.config.num_fewshot > 0):
eval_logger.warning(f"Task '{self.config.task}': " "num_fewshot > 0 but fewshot_split is None. " "using preconfigured rule.")
return super().fewshot_docs()
@utils.positional_deprecated
def fewshot_context(self, doc_id, num_fewshot, split):
"""Returns a fewshot context string that is made up of a prepended description
(if provided), the `num_fewshot` number of examples, and an appended prompt example.
:param doc_id: str
The document id as returned from training_docs, validation_docs, or test_docs.
:param num_fewshot: int
The number of fewshot examples to provide in the returned context string.
:returns: str
The fewshot context.
"""
doc = self.dataset_no_image[split][doc_id]
if num_fewshot == 0:
# always prepend the (possibly empty) task description
labeled_examples = self.config.description
else:
labeled_examples = self.config.description + self.sampler.get_context(doc, num_fewshot)
example = self.doc_to_text(doc)
if type(example) == str:
return labeled_examples + example
elif type(example) == list:
return [labeled_examples + ex for ex in example]
elif type(example) == int:
if self.config.doc_to_choice is not None:
choices = self.doc_to_choice(doc)
return labeled_examples + choices[example]
else:
return labeled_examples + str(example)
def apply_filters(self):
if hasattr(self, "_filters"):
for f in self._filters:
f.apply(self._instances, self.task_docs)
else:
eval_logger.warning("No filter defined, passing through instances")
return self._instances
def should_decontaminate(self):
return self.config.should_decontaminate
def doc_to_decontamination_query(self, doc):
if self.config.should_decontaminate:
if self.config.doc_to_decontamination_query is None:
return self.doc_to_text(doc)
else:
doc_to_decontamination_query = self.config.doc_to_decontamination_query
if doc_to_decontamination_query in self.features:
return doc[doc_to_decontamination_query]
elif callable(doc_to_decontamination_query):
return doc_to_decontamination_query(doc)
else:
return ast.literal_eval(utils.apply_template(self.config.doc_to_decontamination_query, doc))
def _process_doc(self, doc):
"""
Override this to process (detokenize, strip, replace, etc.) individual
documents. This can be used in a map over documents of a data split.
E.g. `map(self._process_doc, self.dataset["validation"])`
:return: dict
The processed version of the specified `doc`.
"""
return doc
def doc_to_text(self, doc):
doc_to_text = self.config.doc_to_text
if type(doc_to_text) == int:
return doc_to_text
elif type(doc_to_text) == str:
if doc_to_text in self.features:
# if self.config.doc_to_choice is not None:
# return self.doc_to_choice(doc)[doc[doc_to_text]]
# else:
return doc[doc_to_text]
else:
text_string = utils.apply_template(doc_to_text, doc)
if text_string.isdigit() and self._config.doc_to_choice is not None:
return ast.literal_eval(text_string)
else:
return text_string
elif callable(doc_to_text):
return (
doc_to_text(doc, self.lmms_eval_specific_kwargs)
if self.lmms_eval_specific_kwargs is not None
else doc_to_text(
doc,
)
)
# Used when applying a Promptsource template
elif hasattr(doc_to_text, "apply"):
applied_prompt = doc_to_text.apply(doc)
if len(applied_prompt) == 2:
return applied_prompt[0]
else:
eval_logger.warning("Applied prompt returns empty string")
return self.config.fewshot_delimiter
else:
print(type(doc_to_text))
raise TypeError
def doc_to_target(self, doc: dict) -> Union[int, str, list]:
doc_to_target = self.config.doc_to_target
if type(doc_to_target) == int:
return doc_to_target
elif type(doc_to_target) == str:
if doc_to_target in self.features:
# if self.config.doc_to_choice is not None:
# return self.doc_to_choice(doc)[doc[doc_to_target]]
# else:
return doc[doc_to_target]
else:
target_string = utils.apply_template(doc_to_target, doc)
if target_string.isdigit() and self._config.doc_to_choice is not None:
return ast.literal_eval(target_string)
elif len(target_string) >= 2 and (target_string[0] == "[") and (target_string[-1] == "]"):
try:
return ast.literal_eval(target_string)
except (SyntaxError, ValueError):
return target_string
else:
return target_string
elif type(doc_to_target) == list:
return doc_to_target
elif callable(doc_to_target):
return doc_to_target(doc, self.model_specific_target_kwargs) if self.model_specific_target_kwargs is not None else doc_to_target(doc)
# Used when applying a Promptsource template
elif hasattr(doc_to_target, "apply"):
applied_prompt = doc_to_target.apply(doc)
if len(applied_prompt) == 2:
return applied_prompt[1]
else:
eval_logger.warning("Applied prompt returns empty string")
return self.config.fewshot_delimiter
else:
raise TypeError
def doc_to_visual(self, doc: dict) -> Union[int, str, list]:
self.config.doc_to_visual
if type(self.config.doc_to_visual) == str:
assert self.config.doc_to_visual in self.features
# Single image. Still return a list for consistency.
return [doc[self.config.doc_to_visual]]
elif callable(self.config.doc_to_visual):
return (
self.config.doc_to_visual(doc, self.lmms_eval_specific_kwargs)
if self.lmms_eval_specific_kwargs is not None and len(inspect.signature(self.config.doc_to_visual).parameters) == 2
else self.config.doc_to_visual(
doc,
)
)
else:
raise TypeError
def doc_to_choice(self, doc: Any) -> List[str]:
if self.config.doc_to_choice is None:
eval_logger.error("doc_to_choice was called but not set in config")
else:
doc_to_choice = self.config.doc_to_choice
if type(doc_to_choice) == str:
if doc_to_choice in self.features:
return doc[doc_to_choice]
else:
return ast.literal_eval(utils.apply_template(doc_to_choice, doc))
elif type(doc_to_choice) == list:
return doc_to_choice
elif type(doc_to_choice) == dict:
return list(doc_to_choice.values())
elif callable(doc_to_choice):
return doc_to_choice(doc)
elif hasattr(doc_to_choice, "get_answer_choices_list"):
return doc_to_choice.get_answer_choices_list(doc)
else:
raise TypeError
def construct_requests(self, doc_id: int, ctx: str, **kwargs) -> Union[List[Instance], Instance]:
split = kwargs.get("split")
kwargs.pop("split")
if self.OUTPUT_TYPE == "loglikelihood":
arguments = (ctx, self.doc_to_target, self.doc_to_visual, doc_id, self.config.task, split)
elif self.OUTPUT_TYPE == "multiple_choice":
doc = self.dataset[split][doc_id]
choices = self.doc_to_choice(doc)
target_delimiter = self.config.target_delimiter
if self.multiple_input:
# If there are multiple inputs, choices are placed in the ctx
cont = self.doc_to_target(doc)
arguments = [(ctx, f"{target_delimiter}{cont}", self.doc_to_visual, doc_id, self.config.task, split) for ctx in choices]
else:
# Otherwise they are placed in the continuation
arguments = [(ctx, f"{target_delimiter}{cont}", self.doc_to_visual, doc_id, self.config.task, split) for cont in choices]
request_list = [
Instance(
request_type="loglikelihood",
# doc=doc,
arguments=arg,
idx=i,
**kwargs,
)
for i, arg in enumerate(arguments)
]
# TODO: we should raise a warning telling users this will at most ~2x runtime.
if "acc_mutual_info" in self._metric_fn_list.keys():
# if we are calculating multiple choice accuracy
# using mutual information instead of raw loglikelihood as metric, need unconditional lls.
# here mutual info refers to calculating
# log(P(choice|ctx) / P(choice)) = log(P(choice|ctx)) - log(P(choice))
# in other words normalizing by subtracting the unconditional logprob of each choice.
request_list.extend(
[
Instance(
request_type="loglikelihood",
# doc=doc,
arguments=("", "{}".format(choice)),
idx=i,
**kwargs,
)
for i, choice in enumerate(choices)
]
)
return request_list
elif self.OUTPUT_TYPE == "generate_until":
arguments = (ctx, self.config.generation_kwargs, self.doc_to_visual, doc_id, self.config.task, split)
return Instance(request_type=self.OUTPUT_TYPE, arguments=arguments, idx=0, **kwargs)
# TODO: we add a full_docs interface here for some evaluations that needs to access the full datasets during process_results function. we may have better ways to handle this.
@retry(stop=(stop_after_attempt(5) | stop_after_delay(1200)), wait=wait_fixed(2))
def process_results(self, doc, results, full_docs=None):
if self.OUTPUT_TYPE == "generate_until":
results[0] = results[0].strip()
kwargs = {}
if full_docs is not None:
kwargs["full_docs"] = full_docs
if callable(self.config.process_results):
return self.config.process_results(doc, results, **kwargs)
result_dict = {}
use_metric = list(self._metric_fn_list.keys())
if self.OUTPUT_TYPE == "loglikelihood":
results = results[0]
ll, is_greedy = results
return {
**({"perplexity": ll} if "perplexity" in use_metric else {}),
**({"acc": int(is_greedy)} if "acc" in use_metric else {}),
}
elif self.OUTPUT_TYPE == "multiple_choice":
lls, is_greedy = zip(*results)
# retrieve choices in List[str] form, to compute choice lengths, etc.
choices = self.doc_to_choice(doc)
completion_len = np.array([float(len(i)) for i in choices])
if 2 * len(choices) == len(lls) and "acc_mutual_info" in self._metric_fn_list.keys():
# then we are doing mutual info.
# this stores the "dryrun" / unconditional answer loglikelihoods
lls_unconditional = lls[1::2]
assert len(lls_unconditional) == len(choices)
# and this stores our "regular" conditional loglikelihoods
lls = lls[::2]
pred = np.argmax(lls)
pred_norm = np.argmax(lls / completion_len)
if self.multiple_input:
gold = self.doc_to_text(doc)
else:
gold = self.doc_to_target(doc)
gold_index_error = False
if type(gold) is list:
gold = [i if i < len(choices) else -100 for i in gold]
if -100 in gold:
gold_index_error = True
else:
if type(gold) is int:
gold = gold if gold < len(choices) else -100
elif type(gold) is str:
gold = choices.index(gold) if gold in choices else -100
if gold == -100:
gold_index_error = True
if gold_index_error:
eval_logger.warning(f"Label index was not in within range of available choices," f"Sample:\n\n{doc}\n\n")
if self.multiple_target:
acc = 1.0 if pred in gold else 0.0
acc_norm = 1.0 if pred_norm in gold else 0.0
exact_match = int(any([is_greedy[i] if i != -100 else 0 for i in gold]))
else:
acc = 1.0 if pred == gold else 0.0
acc_norm = 1.0 if pred_norm == gold else 0.0
# TODO: this gets score of 0 on arc_challenge for pythia-70m. need to test that this works properly
exact_match = int(is_greedy[gold]) if gold != -100 else 0
result_dict = {
**({"acc": acc} if "acc" in use_metric else {}),
**({"f1": (gold, pred)} if "f1" in use_metric else {}),
**({"mcc": (gold, pred)} if "mcc" in use_metric else {}),
**({"acc_norm": acc_norm} if "acc_norm" in use_metric else {}),
**({"exact_match": exact_match} if "exact_match" in use_metric else {}),
}
if "acc_mutual_info" in use_metric:
lls_mutual_info = [ll_c - ll_u for ll_c, ll_u in zip(lls, lls_unconditional)]
acc_mutual_info = 1.0 if np.argmax(lls_mutual_info) == gold else 0.0
result_dict["acc_mutual_info"] = acc_mutual_info
elif self.OUTPUT_TYPE == "generate_until":
gold = self.doc_to_target(doc)
result = results[0]
if self.config.doc_to_choice is not None:
# If you set doc_to_choice,
# it assumes that doc_to_target returns a number.
choices = self.doc_to_choice(doc)
gold = choices[gold]
# we expect multiple_targets to be a list.
elif self.multiple_target:
gold = list(gold)
elif type(gold) != type(result):
# cast gold to the same type as result
gold = type(result)(gold)
for metric in self._metric_fn_list.keys():
if self.multiple_target:
# in the case where we have multiple targets,
# return true if any are true
# TODO: this may break for multipLe_target, non zero-or-1 metrics
scores = []
if not isinstance(gold, list):
# sometimes, a multiple_target dataset has exceptions where one doc has only one string answer
# print(gold)
gold = [gold]
for gold_option in gold:
try:
result_score = self._metric_fn_list[metric](
references=[gold_option],
predictions=[result],
**self._metric_fn_kwargs[metric],
)
except TypeError: # TODO: this is hacky and I don't want to do it
result_score = self._metric_fn_list[metric]([gold_option, result])
if isinstance(result_score, dict):
# TODO: this handles the case where HF evaluate returns a dict.
result_score = result_score[metric]
scores.append(result_score)
if any(scores):
result_score = 1.0
else:
result_score = 0.0
else:
try:
result_score = self._metric_fn_list[metric](
references=[gold],
predictions=[result],
**self._metric_fn_kwargs[metric],
)
except TypeError: # needed for now in order to use a different interface between our own metrics and HF Evaluate metrics
result_score = self._metric_fn_list[metric]([gold, result])
if isinstance(result_score, dict):
# TODO: this handles the case where HF evaluate returns a dict.
result_score = result_score[metric]
result_dict[metric] = result_score
else:
raise ValueError(
f"Passed invalid output_type '{self.OUTPUT_TYPE}' ! Please use one of ",
"'loglikelihood','generate_until' or 'multiple_choice'",
)
return result_dict
def aggregation(self):
return self._aggregation_list
def higher_is_better(self):
return self._higher_is_better
================================================
FILE: lmms-eval_videochat/lmms_eval/evaluator.py
================================================
import os
import time
import random
import itertools
import json
import collections
import sys
import inspect
from tqdm import tqdm
import torch
import numpy as np
from datasets import Image, Sequence
import lmms_eval.api
import lmms_eval.tasks
import lmms_eval.models
import lmms_eval.api.metrics
import lmms_eval.api.registry
from lmms_eval.utils import (
positional_deprecated,
run_task_tests,
make_table,
create_iterator,
get_git_commit_hash,
simple_parse_args_string,
)
from loguru import logger as eval_logger
@positional_deprecated
def simple_evaluate(
model,
model_args=None,
tasks=[],
num_fewshot=None,
batch_size=None,
device=None,
limit=None,
bootstrap_iters: int = 100000,
check_integrity: bool = False,
show_task_to_terminal: bool = False,
log_samples: bool = True,
gen_kwargs: str = None,
cli_args=None, # Bo: put args into more functions (cost 48 Bytes per call)
predict_only: bool = False,
):
"""Instantiate and evaluate a model on a list of tasks.
:param model: Union[str, LMM]
Name of model or LMM object, see lmms_eval.models.get_model
:param model_args: Optional[str]
String arguments for each model class, see LMM.create_from_arg_string.
Ignored if `model` argument is a LMM object.
:param tasks: list[Union[str, Task]]
List of task names or Task objects. Task objects will be taken to have name task.EVAL_HARNESS_NAME if defined and type(task).__name__ otherwise.
:param num_fewshot: int
Number of examples in few-shot context
:param batch_size: int or str, optional
Batch size for model
:param device: str, optional
PyTorch device (e.g. "cpu" or "cuda:0") for running models
:param limit: int or float, optional
Limit the number of examples per task (only use this for testing), If <1, limit is a percentage of the total number of examples.
:param bootstrap_iters:
Number of iterations for bootstrap statistics
:param check_integrity: bool
Whether to run the relevant part of the test suite for the tasks
:param show_task_to_terminal: bool
If True, write out an example document and model input for checking task integrity
:param log_samples: bool
If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis
:param gen_kwargs: str
String arguments for model generation
Ignored for all tasks with loglikelihood output_type
:return
Dictionary of results
"""
random.seed(0)
np.random.seed(1234)
torch.manual_seed(1234) # TODO: this may affect training runs that are run with evaluation mid-run.
assert tasks != [], "No tasks specified, or no tasks found. Please verify the task names."
if gen_kwargs:
gen_kwargs = simple_parse_args_string(gen_kwargs)
eval_logger.warning(f"generation_kwargs specified through cli, these settings will be used over set parameters in yaml tasks.")
if gen_kwargs == "":
gen_kwargs = None
if model_args is None:
model_args = ""
lm = lmms_eval.api.registry.get_model(model).create_from_arg_string(
model_args,
{
"batch_size": batch_size,
"device": device,
},
)
task_dict = lmms_eval.tasks.get_task_dict(tasks, model_name=model)
for task_name in task_dict.keys():
task_obj = task_dict[task_name]
if type(task_obj) == tuple:
group, task_obj = task_obj
if task_obj is None:
continue
lm.task_dict[task_name] = task_obj.dataset
config = task_obj._config
if config["output_type"] == "generate_until" and gen_kwargs:
config["generation_kwargs"].update(gen_kwargs)
if predict_only:
log_samples = True
eval_logger.info(f"Processing {task_name} in output-only mode. Metrics will not be calculated!")
# we have to change the class properties post-hoc. This is pretty hacky.
task_obj.override_metric(metric_name="bypass")
if num_fewshot is not None:
if config["num_fewshot"] == 0:
eval_logger.info(f"num_fewshot has been set to 0 for {task_name} in its config. Manual configuration will be ignored.")
else:
default_num_fewshot = config["num_fewshot"]
eval_logger.warning(f"Overwriting default num_fewshot of {task_name} from {default_num_fewshot} to {num_fewshot}")
task_obj._config["num_fewshot"] = num_fewshot
if check_integrity:
run_task_tests(task_list=tasks)
# print('task_dict:')
# print(task_dict)
results = evaluate(
lm=lm,
task_dict=task_dict,
limit=limit,
bootstrap_iters=bootstrap_iters,
show_task_to_terminal=show_task_to_terminal,
log_samples=log_samples,
cli_args=cli_args,
)
if lm.rank == 0:
# add info about the model and few shot config
results["model_configs"] = {
"model": model if isinstance(model, str) else model.model.config._name_or_path,
"model_args": model_args,
"batch_size": batch_size,
"device": device,
"limit": limit,
"bootstrap_iters": bootstrap_iters,
"gen_kwargs": gen_kwargs,
}
results["git_hash"] = get_git_commit_hash()
return results
else:
return None
decontaminate_suffix = "_decontaminate"
@positional_deprecated
def evaluate(
lm,
task_dict,
limit=None,
bootstrap_iters: int = 100000,
show_task_to_terminal: bool = False,
log_samples: bool = True,
cli_args=None,
):
"""Instantiate and evaluate a model on a list of tasks.
:param lm: obj
Language Model
:param task_dict: dict[str, Task]
Dictionary of tasks. Tasks will be taken to have name type(task).config.task .
:param limit: int, optional
Limit the number of examples per task (only use this for testing)
:param bootstrap_iters:
Number of iterations for bootstrap statistics
:param show_task_to_terminal: bool
If True, write out an example document and model input for checking task integrity
:param log_samples: bool
If True, write out all model outputs and documents for per-sample measurement and post-hoc analysis
:return
Dictionary of results
"""
# stores the final result for each task, for each metric/filter pair.
results = collections.defaultdict(dict)
# Tracks each task's version.
versions = collections.defaultdict(dict)
# Tracks the YAML configs of all chosen tasks.
configs = collections.defaultdict(dict)
# logs info about each document evaluated.
samples = collections.defaultdict(list)
# tracks all Instances/requests a model must generate output on.
requests = collections.defaultdict(list)
# Aggregated task scores presented with groups
results_agg = collections.defaultdict(dict)
# Aggregated groups scores only
groups_agg = collections.defaultdict(dict)
# stores the amount to pad out reqs per req. type so that
# number of fwd passes per distributed rank is equal
padding_requests = collections.defaultdict(int)
# store the hierarchy to do proper ordering
task_hierarchy = collections.defaultdict(list)
# store the ordering of tasks and groups
task_order = collections.defaultdict(int)
task_group_alias = collections.defaultdict(dict)
# store num-fewshot value per task
num_fewshot = collections.defaultdict(int)
# get lists of each type of request
for task_name, task in task_dict.items():
if type(task) == tuple:
group_name, task = task
task_hierarchy[group_name].append(task_name)
versions[group_name] = "N/A"
else:
group_name = None
task_hierarchy[task_name] = []
if task is None:
continue
versions[task_name] = task.VERSION
configs[task_name] = dict(task.dump_config())
if "num_fewshot" in configs[task_name]:
n_shot = configs[task_name]["num_fewshot"]
else:
n_shot = 0
num_fewshot[task_name] = n_shot
if "task_alias" in configs[task_name]:
task_group_alias[task_name] = configs[task_name]["task_alias"]
if ("group_alias" in configs[task_name]) and (group_name not in task_group_alias) and (group_name is not None):
task_group_alias[group_name] = configs[task_name]["group_alias"]
if limit is not None:
if task.has_test_docs():
task_docs = task.test_docs()
elif task.has_validation_docs():
task_docs = task.validation_docs()
else:
raise RuntimeError("Task has neither test_docs nor validation_docs")
limit = int(len(task_docs) * limit) if limit < 1.0 else int(limit)
task.build_all_requests(limit=limit, rank=lm.rank, world_size=lm.world_size)
eval_logger.debug(f"Task: {task_name}; number of requests on rank {lm.rank}: {len(task.instances)}")
if show_task_to_terminal:
for inst in task.instances:
# print the prompt for the first few documents
if inst.doc_id < 1:
eval_logger.info(
f"Task: {task_name}; document {inst.doc_id}; context prompt (starting on next line):\
\n{inst.args[0]}\n(end of prompt on previous line)\ntarget string or answer choice index (starting on next line):\n{task.doc_to_target(inst.doc)}\n(end of target on previous line)"
)
eval_logger.info(f"Request: {str(inst)}")
# aggregate Instances by LMM method requested to get output.
for instance in task.instances:
reqtype = instance.request_type
requests[reqtype].append(instance)
if lm.world_size > 1:
instances_rnk = torch.tensor(len(task._instances), device=lm.device)
gathered_item = lm.accelerator.gather(instances_rnk).cpu().detach().numpy().tolist()
# compute number of pseudobatches to pad with (FSDP/DDP require even batches among ranks)
numpad = max(gathered_item) - gathered_item[lm.rank]
padding_requests[task.OUTPUT_TYPE] += numpad
### Run LMM on inputs, get all outputs ###
# execute each type of request
for reqtype, reqs in requests.items():
eval_logger.info("Running {} requests".format(reqtype))
# create `K` copies of each request `req` based off `K = req.repeats`
cloned_reqs = []
for req in reqs:
cloned_reqs.extend([req] * req.repeats)
if (lm.world_size > 1) and (padding_requests[reqtype] > 0):
for _ in range(padding_requests[reqtype]):
cloned_reqs.extend([req] * req.repeats)
# run requests through model
resps = getattr(lm, reqtype)(cloned_reqs) # Choiszt run generate until
# put responses from model into a list of length K for each request.
for x, req in zip(resps, cloned_reqs):
req.resps.append(x)
if lm.world_size > 1:
lm.accelerator.wait_for_everyone()
### Postprocess outputs ###
# TODO: del model here, maybe (idea: allow user to specify device of e.g. reward model separately)
for task_name, task in task_dict.items():
if type(task) == tuple:
group, task = task
if task is None:
continue
task.apply_filters()
### Collect values of metrics on all datapoints ###
vals = collections.defaultdict(list)
# unpack results and sort back in order and return control to Task
for task_name, task in task_dict.items():
if type(task) == tuple:
group, task = task
if task is None:
continue
# TODO: make it possible to use a different metric per filter
# iterate over different filters used
for key in task.instances[0].filtered_resps.keys():
# hack: remove image columns to speed avoid loading images and speed up postprocessing
# reason: doc_iterator will actually load image if it's in the doc.
docs = task.test_docs() if task.has_test_docs() else task.validation_docs()
if not task.config["process_results_use_image"]:
remove_cols = []
features = docs.features
# If it is an Image instance or a Sequence of Image instance. Remove it
for feature in features:
if isinstance(features[feature], Image):
remove_cols.append(feature)
elif isinstance(features[feature], Sequence) and isinstance(features[feature].feature, Image):
remove_cols.append(feature)
if remove_cols:
docs = docs.remove_columns(remove_cols)
####################### Processing with Full Docs Mode #######################
full_docs = task.config["full_docs"]
doc_iterator = itertools.islice(enumerate(docs), lm.rank, limit, lm.world_size)
# Instead of converting the iterator to a list, use `itertools.tee` to create a parallel iterator for counting
# doc_iterator, doc_iterator_for_counting = itertools.tee(doc_iterator)
# Don't use above one, this would crash if doc_iterator_for_counting contains too many objects and very slow
doc_iterator_for_counting = itertools.islice(range(len(task.test_docs())), lm.rank, limit, lm.world_size) if task.has_test_docs() else itertools.islice(range(len(task.validation_docs())), lm.rank, limit, lm.world_size)
total_docs = sum(1 for _ in doc_iterator_for_counting)
pbar = tqdm(total=total_docs, desc=f"Postprocessing", disable=(lm.rank != 0))
for doc_id, doc in doc_iterator:
# subset instances to only this document id ; sort by idx
requests = list(filter(lambda x: x.doc_id == doc_id, task.instances))
requests.sort(key=lambda x: x.idx)
if full_docs:
metrics = task.process_results(doc, [req.filtered_resps[key] for req in requests], full_docs=docs)
else:
metrics = task.process_results(doc, [req.filtered_resps[key] for req in requests])
if log_samples:
target = task.doc_to_target(doc)
example = {
"doc_id": doc_id,
"target": target,
"doc": doc,
"arguments": [tuple(a for a in req.args if isinstance(a, (int, str))) for req in requests], # do not include image
"resps": [req.resps for req in requests],
"filtered_resps": [req.filtered_resps[key] for req in requests],
}
example.update(metrics)
samples[task_name].append(example)
for metric, value in metrics.items():
vals[(task_name, key, metric)].append(value)
pbar.update(1)
pbar.close()
if hasattr(lm, "_model"):
del lm._model
torch.cuda.empty_cache()
if lm.world_size > 1:
# if multigpu, then gather data across all ranks
# first gather logged samples across all ranks
for task_name, task_samples in list(samples.items()):
full_samples = [None] * lm.world_size
torch.distributed.all_gather_object(full_samples, task_samples)
samples[task_name] = list(itertools.chain.from_iterable(full_samples))
# then collect metrics across all ranks
vals_torch = collections.defaultdict(list)
for (task_name, key, metric), items in vals.items():
numitem = 0
if type(items[0]) == tuple:
numitem = len(items[0])
if isinstance(items[0], (str, list, dict)):
# handle the string case
gathered_items = [None] * lm.accelerator.num_processes
torch.distributed.all_gather_object(gathered_items, items)
gathered_item = list(itertools.chain.from_iterable(gathered_items))
else:
# distributed gather requires all ranks to have same dimensions
# so we pad out with float32 min value
pad_value = torch.finfo(torch.float32).min
metrics_tensor = torch.tensor(items, device=lm.device)
original_dtype = metrics_tensor.dtype # store original dtype
torch_device_tensor = lm.accelerator.pad_across_processes(metrics_tensor.to(torch.float32), pad_index=pad_value)
gathered_item = lm.accelerator.gather(torch_device_tensor)
if numitem > 0:
gathered_filtered = gathered_item[gathered_item[:, 0] != pad_value]
else:
gathered_filtered = gathered_item[gathered_item != pad_value]
gathered_item = gathered_filtered.to(original_dtype).cpu().detach().numpy().tolist()
# reconvert if we were passed a tuple of values
if numitem > 0:
gathered_item = [tuple(g) for g in gathered_item]
if lm.rank == 0:
vals_torch[(task_name, key, metric)] = gathered_item
vals = vals_torch
# Ensure all ranks wait for rank 0 to finish aggregation
torch.distributed.barrier()
# Synchronize processes with a temp file in case the evluation metric requires gpus
# TODO: fix barriers' taking up gpu computation
os.makedirs(cli_args.output_path, exist_ok=True)
if os.path.exists(f"{cli_args.output_path}/rank{int(os.environ.get('RANK', 0))}_metric_eval_done.txt"):
os.remove(f"{cli_args.output_path}/rank{int(os.environ.get('RANK', 0))}_metric_eval_done.txt")
if lm.rank == 0:
### Get task ordering for correct sample-wide aggregation
group_to_task = {}
for group in task_hierarchy.keys():
if group not in task_order:
task_order[group] = 0
if len(task_hierarchy[group]) > 0:
group_to_task[group] = task_hierarchy[group].copy()
for task in task_hierarchy[group]:
if task in task_order:
task_order[task] += 1
else:
task_order[task] = 1 + task_order[group]
if task in task_hierarchy:
group_to_task[group].remove(task)
group_to_task[group].extend(task_hierarchy[task])
task_to_group = {}
for group in group_to_task:
for task in group_to_task[group]:
if task in task_to_group:
task_to_group[task].append(group)
else:
task_to_group[task] = [group]
### Aggregate results over all datapoints ###
# aggregate results ; run bootstrap CIs
for (task_name, key, metric), items in vals.items():
task = task_dict[task_name]
metric_key = metric + "," + key
if type(task) == tuple:
group_name, task = task
else:
group_name = None
if metric not in task.aggregation():
continue
agg_fn = task.aggregation()[metric]
# Bo: for models that need to know the args to save to correct path
if inspect.getfullargspec(agg_fn).args == ["results", "args"]:
results[task_name][metric_key] = agg_fn(items, cli_args)
else:
# Bo: for models only need agg items
results[task_name][metric_key] = agg_fn(items)
results[task_name]["samples"] = len(items)
# hotfix: bleu, chrf, ter seem to be really expensive to bootstrap
# so we run them less iterations. still looking for a cleaner way to do this
if bootstrap_iters > 0:
stderr = lmms_eval.api.metrics.stderr_for_metric(
metric=task.aggregation()[metric],
bootstrap_iters=min(bootstrap_iters, 100) if metric in ["bleu", "chrf", "ter"] else bootstrap_iters,
)
if stderr is not None and len(items) > 1:
results[task_name][metric + "_stderr" + "," + key] = stderr(items)
else:
results[task_name][metric + "_stderr" + "," + key] = "N/A"
if bool(results):
for group, task_list in reversed(task_hierarchy.items()):
if task_list == []:
total_size = results[group]["samples"]
else:
total_size = 0
for task in task_list:
metrics = results[task]
current_size = metrics.pop("samples")
# TODO: There should be a way for users
# to toggle between weighted and
# unweighted averaging
# For unweighted averaging, use:
# current_size = 1
all_stderr = []
for metric in [key for key in metrics.keys() if "_stderr" not in key]:
stderr = "_stderr,".join(metric.split(","))
stderr_score = results[task][stderr]
var_score = stderr_score**2 if stderr_score != "N/A" else 0
metric_score = results[task][metric]
all_stderr.append(stderr)
if metric_score is None:
results[group][metric] = None
results[group][stderr] = 0
continue
if metric in results[group]:
if isinstance(results[group][metric], str) == False:
results[group][metric] = (results[group][metric] * total_size + metric_score * current_size) / (total_size + current_size)
# $$s_z^2 = \frac{(n-1) s_x^2 + (m-1) s_y^2}{n+m-1} + \frac{nm(\bar x - \bar y)^2}{(n+m)(n+m-1)}.$$
results[group][stderr] = ((total_size - 1) * results[group][stderr] + (current_size - 1) * var_score) / (total_size + current_size - 1) + total_size * current_size / (
(total_size + current_size) * (total_size + current_size - 1)
) * (results[group][metric] - metric_score) ** 2
else:
# accuracy = re.search(r'acc: ([\d.]+)%', results[group][metric]).group(1)
# score = re.search(r'score: ([\d.]+)', results[group][metric]).group(1)
# group_accuracy = float(accuracy)
# group_score = float(score)
# group_accuracy = (group_accuracy * total_size + metric_score * current_size) / total_size
# group_score = (group_score * total_size + metric_score * current_size) / total_size
# results[group][metric] = "Acc: " + str(group_accuracy) + " Score: " + str(group_score)
results[group][metric] = "group_results"
results[group][stderr] = 0
else:
results[group][metric] = metric_score
results[group][stderr] = var_score
total_size += current_size
for stderr in all_stderr:
results[group][stderr] = np.sqrt(results[group][stderr])
results[group]["samples"] = total_size
def print_tasks(task_hierarchy, task_order, task_version, task_group_alias):
results_agg = collections.defaultdict(dict)
groups_agg = collections.defaultdict(dict)
for group_name, task_list in task_hierarchy.items():
order = task_order[group_name]
results_agg[group_name] = results[group_name].copy()
results_agg[group_name]["tab"] = order
if (order < max(task_order.values())) and (len(task_list) > 0):
groups_agg[group_name] = results[group_name].copy()
groups_agg[group_name]["tab"] = order
if task_list != []:
for task in sorted(task_list):
if task in task_hierarchy:
_task_hierarchy = {task: task_hierarchy[task]}
else:
_task_hierarchy = {task: []}
_results_agg, _groups_agg, task_version = print_tasks(_task_hierarchy, task_order, task_version, task_group_alias)
results_agg = {**results_agg, **_results_agg}
groups_agg = {**groups_agg, **_groups_agg}
return results_agg, groups_agg, task_version
results_agg, groups_agg, versions = print_tasks(task_hierarchy, task_order, versions, task_group_alias)
for task in results_agg:
task_results = results_agg[task]
if "samples" in task_results:
task_results.pop("samples")
tab_string = ""
if "tab" in task_results:
tab = task_results.pop("tab")
tab_string = " " * tab + "- " if tab > 0 else ""
if task in task_group_alias:
task_alias = task_group_alias[task]
results_agg[task]["alias"] = tab_string + task_alias
else:
results_agg[task]["alias"] = tab_string + task
for group in groups_agg:
group_results = groups_agg[group]
if "samples" in group_results:
group_results.pop("samples")
tab_string = ""
if "tab" in group_results:
tab = group_results.pop("tab")
tab_string = " " * tab + "- " if tab > 0 else ""
if group in task_group_alias:
group_alias = task_group_alias[group]
groups_agg[group]["alias"] = tab_string + group_alias
else:
groups_agg[group]["alias"] = tab_string + group
for group_name, task_list in task_hierarchy.items():
if task_list != []:
num_fewshot[group_name] = num_fewshot[task_list[0]]
results_dict = {
"results": dict(results_agg.items()),
**({"groups": dict(groups_agg.items())} if bool(groups_agg) else {}),
"configs": dict(sorted(configs.items())),
"versions": dict(sorted(versions.items())),
"n-shot": dict(sorted(num_fewshot.items())),
}
if log_samples:
results_dict["samples"] = dict(samples)
else:
results_dict = None
with open(f"{cli_args.output_path}/rank{int(os.environ.get('RANK', 0))}_metric_eval_done.txt", "w") as f:
f.write(f"rank {int(os.environ.get('RANK', 0))} eval done")
while len([file for file in os.listdir(cli_args.output_path) if file.endswith("metric_eval_done.txt")]) < lm._world_size:
time.sleep(1)
return results_dict
================================================
FILE: lmms-eval_videochat/lmms_eval/filters/__init__.py
================================================
from lmms_eval.api.filter import FilterEnsemble, Filter
from . import selection
from . import extraction
from . import transformation
FILTER_REGISTRY = {
"take_first": selection.TakeFirstFilter,
"regex": extraction.RegexFilter,
"majority_vote": selection.MajorityVoteFilter,
"take_first_k": selection.TakeKFilter,
"remove_whitespace": extraction.WhitespaceFilter,
"lowercase": transformation.LowercaseFilter,
"uppercase": transformation.UppercaseFilter,
"map": transformation.MapFilter,
"multi_choice_regex": extraction.MultiChoiceRegexFilter,
# TODO: implement this filter. either it should take in an arbitrary "scoring"/reward function
# that takes an input and returns a scalar and then should select the max reward,
# or should implement different filters for different ways of handling a reward model's inference.
# "arg_max": selection.ArgMaxFilter,
}
def get_filter(filter_name):
if filter_name in FILTER_REGISTRY:
return FILTER_REGISTRY[filter_name]
else:
return filter_name
def build_filter_ensemble(filter_name, components):
"""
Create a filtering pipeline.
"""
filters = []
for function, kwargs in components:
if kwargs is None:
f = get_filter(function)()
else:
# create a filter given its name in the registry
f = get_filter(function)(**kwargs) # TODO: pass kwargs to filters properly
# add the filter as a pipeline step
filters.append(f)
return FilterEnsemble(name=filter_name, filters=filters)
================================================
FILE: lmms-eval_videochat/lmms_eval/filters/decontamination.py
================================================
from lmms_eval.api.filter import Filter
class DecontaminationFilter(Filter):
"""
A filter which evaluates
"""
name = "track_decontamination"
def __init__(self, path) -> None:
"""
TODO: make sure only ever run one time on the train set (should this be cached as a class var? keyed by value for "path").
should further cache result on a given (task_name, doc_id)
"""
self._decontam_results = None
def apply(self, resps, docs) -> None:
"""
Return {"no_contamination", "only_contamination"} keys for the 2 different subsets
"""
pass
================================================
FILE: lmms-eval_videochat/lmms_eval/filters/extraction.py
================================================
import re
import sys
import unicodedata
from lmms_eval.api.filter import Filter
class WhitespaceFilter(Filter):
""" """
def __init__(self) -> None:
pass
def apply(self, resps, docs):
def filter_set(inst):
filtered_resp = []
for resp in inst:
if resp.startswith(" "):
resp = resp[1:]
filtered_resp.append(resp)
return filtered_resp
filtered_resps = [filter_set(resp) for resp in resps]
return filtered_resps
class RegexFilter(Filter):
""" """
def __init__(
self,
regex_pattern: str = r"#### (\-?[0-9\.\,]+)",
group_select=0,
fallback: str = "[invalid]",
) -> None:
"""
pass a string `regex` to run `re.compile(r"regex")` on.
`fallback` defines the output returned if no matches for the regex are located.
"""
self.regex_pattern = regex_pattern
self.regex = re.compile(regex_pattern)
self.group_select = group_select
self.fallback = fallback
def apply(self, resps, docs):
# here, we assume we have a list, in which each element is
# a list of model responses for some particular input/target pair.
# so we process each of these (same input/target response sets)
# independently (and keep them a list.)
def filter_set(inst):
filtered = []
for resp in inst:
match = self.regex.findall(resp)
if match:
match = match[self.group_select]
if isinstance(match, tuple):
match = [m for m in match if m][0]
match = match.strip()
else:
match = self.fallback
filtered.append(match)
return filtered
# print(resps)
filtered_resps = list(map(lambda x: filter_set(x), resps))
# print(filtered_resps)
return filtered_resps
class MultiChoiceRegexFilter(RegexFilter):
"""
A filter used to extract a model's answer on multiple choice questions with
letter answers. assumes each document has a "choices" field
containing the list of answer choices and that the answer label symbols
are of the form (A), (B), (C), ... or A, B, C.
"""
def __init__(
self,
regex_pattern: str = r"#### (\-?[0-9\.\,]+)",
group_select=0,
fallback: str = "[invalid]",
ignore_case=False,
ignore_punctuation=False,
regexes_to_ignore=None,
) -> None:
"""
regex_pattern: The basic regex pattern to use. If fails to match, we will use the customized match procedure
- step 1 : We parse the choices between ([A-Z])s then try to find these choices in the response.
- step 2 : We parse the choice with regex :[\s]*([A-?]), where ? varies by number of choices.
group_select: Selects the (group_select)th match from the findall result.
ignore_case: Ignores the case during step 1 matching
ignore_punctuation: Remove the punctuation during step 1 matching
regexes_to_ignore: Remove these regexes during step 1 matching
"""
super().__init__(regex_pattern, group_select, fallback)
self.ignore_case = ignore_case
self.ignore_punctuation = ignore_punctuation
self.regexes_to_ignore = regexes_to_ignore
def apply(self, resps, docs):
# here, we assume we have a list, in which each element is
# a list of model responses for some particular input/target pair.
# so we process each of these (same input/target response sets)
# independently (and keep them a list.)
def find_match(regex, resp, convert_dict={}):
match = regex.findall(resp)
if match:
match = match[self.group_select]
if isinstance(match, tuple):
match = [m for m in match if m][0]
match = match.strip()
if match and match in convert_dict:
match = convert_dict[match]
return match
punct_tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith("P"))
def filter_ignores(st):
if self.regexes_to_ignore is not None:
for s in self.regexes_to_ignore:
st = re.sub(s, "", st)
if self.ignore_case:
st = st.lower()
if self.ignore_punctuation:
# https://stackoverflow.com/a/266162
st = st.translate(punct_tbl)
return st
filtered_resps = []
for r, doc in zip(resps, docs):
fallback_regexes = []
choice_to_alpha = {}
next_alpha = "A"
without_paren_fallback_regexes = []
without_paren_to_target = {}
choices = doc["choices"]
for c in choices:
m = filter_ignores(c.strip())
fallback_regexes.append(f"{re.escape(m)}")
choice_to_alpha[m] = f"({next_alpha})"
without_paren_fallback_regexes.append(next_alpha)
without_paren_to_target[next_alpha] = f"({next_alpha})"
next_alpha = chr(ord(next_alpha) + 1)
fallback_regex = re.compile("|".join(fallback_regexes))
without_paren_fallback_regex = "|".join(without_paren_fallback_regexes)
without_paren_fallback_regex = re.compile(f":[\s]*({without_paren_fallback_regex})")
filtered = []
for resp in r:
match = find_match(self.regex, resp)
if not match:
match = find_match(fallback_regex, filter_ignores(resp), choice_to_alpha)
if not match:
match = find_match(without_paren_fallback_regex, resp, without_paren_to_target)
if not match:
match = self.fallback
filtered.append(match)
filtered_resps.append(filtered)
return filtered_resps
class ExtendedRegexFilter(RegexFilter):
punct_tbl = dict.fromkeys(i for i in range(sys.maxunicode) if unicodedata.category(chr(i)).startswith("P"))
def __init__(
self,
regex_pattern: str = r"#### (\-?[0-9\.\,]+)",
group_select=0,
fallback: str = "[invalid]",
ignore_case=False,
ignore_punctuation=False,
regexes_to_ignore=None,
) -> None:
super().__init__(regex_pattern, group_select, fallback)
self.ignore_case = ignore_case
self.ignore_punctuation = ignore_punctuation
self.regexes_to_ignore = regexes_to_ignore
def filter_ignores(self, st):
if self.regexes_to_ignore is not None:
for s in self.regexes_to_ignore:
st = re.sub(s, "", st)
if self.ignore_case:
st = st.lower()
if self.ignore_punctuation:
# https://stackoverflow.com/a/266162
st = st.translate(self.punct_tbl)
return st
def find_match(self, regex, resp, convert_dict={}):
match = regex.findall(resp)
if match:
match = match[self.group_select]
if isinstance(match, tuple):
match = [m for m in match if m][0]
match = match.strip()
if match and match in convert_dict:
match = convert_dict[match]
return match
# Designed for the AI2D/RealworldQA dataset
class SimpleMultiChoiceRegexFilter(ExtendedRegexFilter):
def __init__(self, *args, **kwargs):
"""
regex_pattern: The basic regex pattern to use. If fails to match, we will use the customized match procedure
- step 1 : We parse the choices between ([A-Z])s then try to find these choices in the response.
- step 2 : We parse the choice with regex :[\s]*([A-?]), where ? varies by number of choices.
group_select: Selects the (group_select)th match from the findall result.
ignore_case: Ignores the case during step 1 matching
ignore_punctuation: Remove the punctuation during step 1 matching
regexes_to_ignore: Remove these regexes during step 1 matching
"""
super().__init__(*args, **kwargs)
def apply(self, resps, docs):
# here, we assume we have a list, in which each element is
# a list of model responses for some particular input/target pair.
# so we process each of these (same input/target response sets)
# independently (and keep them a list.)
filtered_resps = []
for r, doc in zip(resps, docs):
fallback_regexes = []
choice_to_alpha = {}
next_alpha = "A"
without_paren_fallback_regexes = []
without_paren_to_target = {}
# Regex to extract multiple choice options from the question
multiple_choices_regex = re.compile(r"\b([A-Z])\.\s+([^\n]*)")
matches = multiple_choices_regex.findall(doc["question"])
# Build regex patterns and mappings for each choice
for m in matches:
choice_text = m[1].strip()
fallback_regexes.append(f"{re.escape(choice_text)}")
choice_to_alpha[choice_text] = next_alpha
next_alpha = chr(ord(next_alpha) + 1)
# Compile regex to match any of the extracted choices
fallback_regex = re.compile("|".join(fallback_regexes))
# Process each response
filtered = []
for resp in r:
# Remove any punctuation and extra spaces
cleaned_resp = re.sub(r"[^\w\s]", "", resp).strip()
# Try to match cleaned response with the choice text
match = fallback_regex.search(cleaned_resp)
if match and match.group() in choice_to_alpha:
# Map the matched choice text back to its corresponding letter
filtered.append(choice_to_alpha[match.group()])
else:
# If no match, return the cleaned response
filtered.append(cleaned_resp)
filtered_resps.append(filtered[0])
return filtered_resps
================================================
FILE: lmms-eval_videochat/lmms_eval/filters/selection.py
================================================
from collections import Counter
from lmms_eval.api.filter import Filter
class TakeFirstFilter(Filter):
def __init__(self) -> None:
"""
Can define custom behavior here, if an individual instantiation of a Filter class should have state.
"""
def apply(self, resps, docs):
"""
Assuming each entry of `resps` is a list of model responses, we discard all but the first response.
"""
return map(lambda r: r[0], resps)
class TakeKFilter(Filter):
def __init__(self, *args, **kwargs) -> None:
self.k = kwargs.pop("k")
super().__init__(*args, **kwargs)
def apply(self, resps, docs):
# check we have at least k responses per doc, else we can't take the first k
assert len(resps[0]) >= self.k, f"Need at least {self.k} responses per doc to take first {self.k}, but got {len(resps[0])} only! Please increase TaskConfig.repeats ."
return map(lambda r: r[: self.k], resps)
class MajorityVoteFilter(Filter):
def __init__(self) -> None:
"""
Can define custom behavior here, if an individual instantiation of a Filter class should have state.
"""
def apply(self, resps, docs):
"""
Each entry of `resps` is a list of model responses.
We select the response that occurs most frequently in each entry of `resps`.
"""
def select_majority(resp):
counts = Counter(resp)
vote = counts.most_common(1)[0][0]
return vote
return map(lambda r: [select_majority(r)], resps)
================================================
FILE: lmms-eval_videochat/lmms_eval/filters/transformation.py
================================================
from lmms_eval.api.filter import Filter
class LowercaseFilter(Filter):
def __init__(self) -> None:
pass
def apply(self, resps, docs):
def filter_set(inst):
return [resp.lower() for resp in inst]
return [filter_set(resp) for resp in resps]
class UppercaseFilter(Filter):
def __init__(self) -> None:
pass
def apply(self, resps, docs):
def filter_set(inst):
return [resp.upper() for resp in inst]
return [filter_set(resp) for resp in resps]
class MapFilter(Filter):
def __init__(self, mapping_dict: dict = {}, default_value=None) -> None:
"""
Initializes the MapFilter with a given mapping dictionary and default value.
Args:
- mapping_dict (dict): A dictionary containing the key-value mappings.
Default is an empty dictionary.
- default_value (Any): The value to be returned when a key is not found in the mapping_dict.
Default is None.
Example:
mapper = MapFilter({'A': 1, 'B': 2}, default_value=0)
"""
assert isinstance(mapping_dict, dict), "Provided mapping_dict is not a dictionary"
self.mapping_dict = mapping_dict
self.default_value = default_value
def apply(self, resps, docs):
def filter_set(inst):
return [self.mapping_dict.get(resp, self.default_value) for resp in inst]
return [filter_set(resp) for resp in resps]
================================================
FILE: lmms-eval_videochat/lmms_eval/logging_utils.py
================================================
# Code mostly from: https://github.com/EleutherAI/lm-evaluation-harness/pull/1339, credit to: https://github.com/ayulockin
import copy
import re
import os
import json
import glob
import pandas as pd
import numpy as np
from datetime import datetime
from typing import Any, Dict, List, Literal, Tuple, Union
from packaging.version import Version
from lmms_eval import utils
import tenacity
from loguru import logger
try:
import wandb
assert Version(wandb.__version__) >= Version("0.13.6")
if Version(wandb.__version__) < Version("0.13.6"):
wandb.require("report-editing:v0")
except Exception as e:
logger.warning("To use the wandb reporting functionality please install wandb>=0.13.6.\n" "To install the latest version of wandb run `pip install wandb --upgrade`\n" f"{e}")
def remove_none_pattern(input_string):
# Define the pattern to match ',none' at the end of the string
pattern = re.compile(r",none$")
# Use sub() to replace ',none' with an empty string
result = re.sub(pattern, "", input_string)
# check if the input_string changed
removed = result != input_string
return result, removed
def _handle_non_serializable(o: Any) -> Union[int, str, list]:
"""Handle non-serializable objects by converting them to serializable types.
Args:
o (Any): The object to be handled.
Returns:
Union[int, str, list]: The converted object. If the object is of type np.int64 or np.int32,
it will be converted to int. If the object is of type set, it will be converted
to a list. Otherwise, it will be converted to str.
"""
if isinstance(o, np.int64) or isinstance(o, np.int32):
return int(o)
elif isinstance(o, set):
return list(o)
else:
return str(o)
def get_wandb_printer() -> Literal["Printer"]:
"""Returns a wandb printer instance for pretty stdout."""
from wandb.sdk.lib.printer import get_printer
from wandb.sdk.wandb_settings import Settings
printer = get_printer(Settings()._jupyter)
return printer
# class WandbLogger:
class WandbLogger:
def __init__(self, args):
self.wandb_args = utils.simple_parse_args_string(args.wandb_args)
self.args = args
self.all_args_dict = vars(args)
self.printer = get_wandb_printer()
try:
self.init_run()
except Exception as e:
logger.warning(f"Failed to initialize W&B run: {e}")
os.environ["WANDB_MODE"] = "offline"
self.init_run()
def finish(self):
self.run.finish()
@tenacity.retry(wait=tenacity.wait_fixed(5), stop=tenacity.stop_after_attempt(5))
def init_run(self):
if "name" not in self.wandb_args:
if "config" in self.all_args_dict and self.all_args_dict["config"] != "":
self.wandb_args["name"] = self.all_args_dict["config"].split("/")[-1].replace(".yaml", "") + "/" + self.args.log_samples_suffix
else:
task_names = self.args.tasks.replace(",", "/")
self.wandb_args["name"] = f"{self.args.model}/<{task_names}>/{self.args.log_samples_suffix}"
if self.args.num_fewshot:
self.wandb_args["name"] += f"_{self.args.num_fewshot}shot"
if "project" not in self.wandb_args:
self.wandb_args["project"] = "lmms-eval"
# initialize a W&B run
self.run = wandb.init(**self.wandb_args)
def post_init(self, results: Dict[str, Any]) -> None:
self.results: Dict[str, Any] = copy.deepcopy(results)
self.task_names: List[str] = list(results.get("results", {}).keys())
self.group_names: List[str] = list(results.get("groups", {}).keys())
def _get_config(self) -> Dict[str, Any]:
"""Get configuration parameters."""
self.task_configs = self.results.get("configs", {})
cli_configs = self.results.get("config", {})
configs = {
"task_configs": self.task_configs,
"cli_configs": cli_configs,
}
return configs
def _sanitize_results_dict(self) -> Tuple[Dict[str, str], Dict[str, Any]]:
"""Sanitize the results dictionary."""
_results = copy.deepcopy(self.results.get("results", dict()))
_results["model_configs"] = self.results.get("model_configs", dict())
# Remove None from the metric string name
tmp_results = copy.deepcopy(_results)
for task_name in self.task_names:
task_result = tmp_results.get(task_name, dict())
for metric_name, metric_value in task_result.items():
_metric_name, removed = remove_none_pattern(metric_name)
if removed:
_results[task_name][_metric_name] = metric_value
_results[task_name].pop(metric_name)
# remove string valued keys from the results dict
wandb_summary = {}
for task in self.task_names:
task_result = _results.get(task, dict())
for metric_name, metric_value in task_result.items():
if isinstance(metric_value, str):
wandb_summary[f"{task}/{metric_name}"] = metric_value
wandb_summary["model_configs"] = self.results.get("model_configs", dict())
for summary_metric, summary_value in wandb_summary.items():
if summary_metric != "model_configs":
_task, _summary_metric = summary_metric.split("/")
_results[_task].pop(_summary_metric)
tmp_results = copy.deepcopy(_results)
for task_name, task_results in tmp_results.items():
if task_name != "model_configs":
for metric_name, metric_value in task_results.items():
_results[f"{task_name}/{metric_name}"] = metric_value
_results[task_name].pop(metric_name)
for task in self.task_names:
_results.pop(task)
return wandb_summary, _results
def _log_results_as_table(self) -> None:
"""Generate and log evaluation results as a table to W&B."""
columns = [
"Model",
"Args",
"Tasks",
"Version",
"Filter",
"num_fewshot",
"Metric",
"Value",
"Stderr",
]
def make_table(columns: List[str], key: str = "results"):
table = wandb.Table(columns=columns)
results = copy.deepcopy(self.results)
model_name = results.get("model_configs").get("model")
model_args = results.get("model_configs").get("model_args")
for k, dic in results.get(key).items():
if k in self.group_names and not key == "groups":
continue
version = results.get("versions").get(k)
if version == "N/A":
version = None
n = results.get("n-shot").get(k)
for (mf), v in dic.items():
m, _, f = mf.partition(",")
if m.endswith("_stderr"):
continue
if m == "alias":
continue
if m + "_stderr" + "," + f in dic:
se = dic[m + "_stderr" + "," + f]
if se != "N/A":
se = "%.4f" % se
data = [model_name, model_args, k, version, f, n, m, str(v), str(se)]
if key == "groups":
data = [self.group_names] + data
table.add_data(*data)
else:
data = [model_name, model_args, k, version, f, n, m, str(v), ""]
if key == "groups":
data = [self.group_names] + data
table.add_data(*data)
return table
# log the complete eval result to W&B Table
table = make_table(columns, "results")
self.run.log({"evaluation/eval_results": table})
if "groups" in self.results.keys():
table = make_table(["Groups"] + columns, "groups")
self.run.log({"evaluation/group_eval_results": table})
def _log_results_as_artifact(self) -> None:
"""Log results as JSON artifact to W&B."""
dumped = json.dumps(self.results, indent=2, default=_handle_non_serializable, ensure_ascii=False)
artifact = wandb.Artifact("results", type="eval_results")
with artifact.new_file("results.json", mode="w", encoding="utf-8") as f:
f.write(dumped)
self.run.log_artifact(artifact)
def log_eval_result(self) -> None:
"""Log evaluation results to W&B."""
# Log configs to wandb
configs = self._get_config()
self.run.config.update(configs, allow_val_change=True)
wandb_summary, self.wandb_results = self._sanitize_results_dict()
# update wandb.run.summary with items that were removed
self.run.summary.update(wandb_summary)
# Log the evaluation metrics to wandb
self.run.log(self.wandb_results)
# Log the evaluation metrics as W&B Table
self._log_results_as_table()
# Log the results dict as json to W&B Artifacts
self._log_results_as_artifact()
def _generate_dataset(self, data: List[Dict[str, Any]], config: Dict[str, Any]) -> pd.DataFrame:
"""Generate a dataset from evaluation data.
Args:
data (List[Dict[str, Any]]): The data to generate a dataset for.
config (Dict[str, Any]): The configuration of the task.
Returns:
pd.DataFrame: A dataframe that is ready to be uploaded to W&B.
"""
ids = [x["doc_id"] for x in data]
labels = [x["target"] for x in data]
instance = [""] * len(ids)
resps = [""] * len(ids)
filtered_resps = [""] * len(ids)
model_outputs = {}
metrics_list = config["metric_list"]
metrics = {}
for metric in metrics_list:
metric = metric.get("metric")
if metric in ["word_perplexity", "byte_perplexity", "bits_per_byte"]:
metrics[f"{metric}_loglikelihood"] = [x[metric][0] for x in data]
if metric in ["byte_perplexity", "bits_per_byte"]:
metrics[f"{metric}_bytes"] = [x[metric][1] for x in data]
else:
metrics[f"{metric}_words"] = [x[metric][1] for x in data]
else:
metrics[metric] = [x[metric] for x in data]
if config["output_type"] == "loglikelihood":
instance = [x["arguments"][0][0] for x in data]
labels = [x["arguments"][0][1] for x in data]
resps = [f'log probability of continuation is {x["resps"][0][0][0]} ' + "\n\n" + "continuation will {} generated with greedy sampling".format("not be" if not x["resps"][0][0][1] else "be") for x in data]
filtered_resps = [f'log probability of continuation is {x["filtered_resps"][0][0]} ' + "\n\n" + "continuation will {} generated with greedy sampling".format("not be" if not x["filtered_resps"][0][1] else "be") for x in data]
elif config["output_type"] == "multiple_choice":
instance = [x["arguments"][0][0] for x in data]
choices = ["\n".join([f"{idx}. {y[1]}" for idx, y in enumerate(x["arguments"])]) for x in data]
resps = [np.argmax([n[0][0] for n in x["resps"]]) for x in data]
filtered_resps = [np.argmax([n[0] for n in x["filtered_resps"]]) for x in data]
elif config["output_type"] == "generate_until":
instance = [x["arguments"][0][0] for x in data]
resps = [x["resps"][0][0] for x in data]
filtered_resps = [x["filtered_resps"][0] for x in data]
model_outputs["raw_predictions"] = resps
model_outputs["filtered_predictions"] = filtered_resps
df_data = {
"id": ids,
"data": instance,
}
if config["output_type"] == "multiple_choice":
df_data["choices"] = choices
tmp_data = {
"input_len": [len(x) for x in instance],
"labels": labels,
"output_type": config["output_type"],
}
df_data.update(tmp_data)
df_data.update(model_outputs)
df_data.update(metrics)
return pd.DataFrame(df_data)
def _log_samples_as_artifact(self, data: List[Dict[str, Any]], task_name: str) -> None:
# log the samples as an artifact
dumped = json.dumps(
data,
indent=2,
default=_handle_non_serializable,
ensure_ascii=False,
)
artifact = wandb.Artifact(f"{task_name}", type="samples_by_task")
with artifact.new_file(f"{task_name}_eval_samples.json", mode="w", encoding="utf-8") as f:
f.write(dumped)
self.run.log_artifact(artifact)
# artifact.wait()
def log_eval_samples(self, samples: Dict[str, List[Dict[str, Any]]]) -> None:
"""Log evaluation samples to W&B.
Args:
samples (Dict[str, List[Dict[str, Any]]]): Evaluation samples for each task.
"""
task_names: List[str] = [x for x in self.task_names if x not in self.group_names]
ungrouped_tasks = []
tasks_by_groups = {}
for task_name in task_names:
group_names = self.task_configs[task_name].get("group", None)
if group_names:
if isinstance(group_names, str):
group_names = [group_names]
for group_name in group_names:
if not tasks_by_groups.get(group_name):
tasks_by_groups[group_name] = [task_name]
else:
tasks_by_groups[group_name].append(task_name)
else:
ungrouped_tasks.append(task_name)
for task_name in ungrouped_tasks:
eval_preds = samples[task_name]
# log the samples as a W&B Table
df = self._generate_dataset(eval_preds, self.task_configs.get(task_name))
self.run.log({f"{task_name}_eval_results": df})
# log the samples as a json file as W&B Artifact
self._log_samples_as_artifact(eval_preds, task_name)
for group, grouped_tasks in tasks_by_groups.items():
grouped_df = pd.DataFrame()
for task_name in grouped_tasks:
eval_preds = samples[task_name]
df = self._generate_dataset(eval_preds, self.task_configs.get(task_name))
df["group"] = group
df["task"] = task_name
grouped_df = pd.concat([grouped_df, df], ignore_index=True)
# log the samples as a json file as W&B Artifact
self._log_samples_as_artifact(eval_preds, task_name)
self.run.log({f"{group}_eval_results": grouped_df})
================================================
FILE: lmms-eval_videochat/lmms_eval/models/__init__.py
================================================
import importlib
import os
import hf_transfer
from loguru import logger
import sys
import hf_transfer
os.environ["HF_HUB_ENABLE_HF_TRANSFER"] = "1"
logger.remove()
logger.add(sys.stdout, level="WARNING")
AVAILABLE_MODELS = {
"videochat_flash": "VideoChat_Flash"
}
for model_name, model_class in AVAILABLE_MODELS.items():
try:
exec(f"from .{model_name} import {model_class}")
except Exception as e:
logger.debug(f"Failed to import {model_class} from {model_name}: {e}")
if os.environ.get("LMMS_EVAL_PLUGINS", None):
# Allow specifying other packages to import models from
for plugin in os.environ["LMMS_EVAL_PLUGINS"].split(","):
m = importlib.import_module(f"{plugin}.models")
for model_name, model_class in getattr(m, "AVAILABLE_MODELS").items():
try:
exec(f"from {plugin}.models.{model_name} import {model_class}")
except ImportError as e:
logger.debug(f"Failed to import {model_class} from {model_name}: {e}")
================================================
FILE: lmms-eval_videochat/lmms_eval/models/videochat_flash.py
================================================
import logging
import warnings
from datetime import timedelta
from typing import List, Optional, Union, Tuple
import PIL
import torch
from tqdm import tqdm
from packaging import version
from accelerate import Accelerator, DistributedType, InitProcessGroupKwargs
from accelerate.state import AcceleratorState
from transformers import AutoTokenizer, AutoModel
from lmms_eval import utils
from lmms_eval.api.instance import Instance
from lmms_eval.api.model import lmms
from lmms_eval.api.registry import register_model
# Suppress warnings
warnings.filterwarnings("ignore")
# Configure logging
eval_logger = logging.getLogger("lmms-eval")
# Enable TF32 for CUDA
torch.backends.cuda.matmul.allow_tf32 = True
# Determine best attention implementation
if version.parse(torch.__version__) >= version.parse("2.1.2"):
best_fit_attn_implementation = "sdpa"
else:
best_fit_attn_implementation = "eager"
@register_model("videochat_flash")
class VideoChat_Flash(lmms):
"""
VideoChat Flash
"""
def __init__(
self,
pretrained: str = "xxx",
device: Optional[str] = "cuda:0",
batch_size: Optional[Union[int, str]] = 1,
device_map: Optional[str] = "cuda:0",
use_cache: Optional[bool] = True,
max_num_frames: Optional[int] = 32,
**kwargs,
) -> None:
super().__init__()
# Do not use kwargs for now
assert kwargs == {}, f"Unexpected kwargs: {kwargs}"
assert torch.cuda.device_count() > 0, torch.cuda.device_count()
accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52))
accelerator = Accelerator(kwargs_handlers=[accelerator_kwargs])
if accelerator.num_processes > 1:
self._device = torch.device(f"cuda:{accelerator.local_process_index}")
self.device_map = f"cuda:{accelerator.local_process_index}"
elif accelerator.num_processes == 1 and device_map == "auto":
self._device = torch.device(device)
self.device_map = device_map
else:
self._device = torch.device(f"cuda:{accelerator.local_process_index}")
self.device_map = f"cuda:{accelerator.local_process_index}"
self.max_num_frames = max_num_frames
self._tokenizer = AutoTokenizer.from_pretrained(pretrained, trust_remote_code=True)
self._model = AutoModel.from_pretrained(pretrained, trust_remote_code=True).half().cuda()
# modify here to use video-level compress
self.model.config.mm_llm_compress = False
self.model.config.llm_compress_type = "attention"
self.model.config.llm_compress_layer_list = [24]
self.model.config.llm_image_token_ratio_list = [1.0, 0.5]
self._config = self._model.config
self.model.eval()
self.batch_size_per_gpu = int(batch_size)
self.use_cache = use_cache
assert self.batch_size_per_gpu == 1
if accelerator.num_processes > 1:
assert accelerator.distributed_type in [DistributedType.FSDP, DistributedType.MULTI_GPU, DistributedType.DEEPSPEED], "Unsupported distributed type provided. Only DDP and FSDP are supported."
# If you want to use DistributedType.DEEPSPEED, you have to run accelerate config before using the model
# Also, you have to select zero stage 0 (equivalent to DDP) in order to make the prepare model works
# I tried to set different parameters in the kwargs to let default zero 2 stage works, but it didn't work.
if accelerator.distributed_type == DistributedType.DEEPSPEED:
kwargs = {
"train_micro_batch_size_per_gpu": self.batch_size_per_gpu,
"train_batch_size": self.batch_size_per_gpu * accelerator.num_processes,
}
AcceleratorState().deepspeed_plugin.deepspeed_config_process(must_match=True, **kwargs)
eval_logger.info("Detected that you are using DistributedType.DEEPSPEED. Make sure you run `accelerate config` and set zero stage to 0")
if accelerator.distributed_type == DistributedType.FSDP or accelerator.distributed_type == DistributedType.DEEPSPEED:
self._model = accelerator.prepare(self.model)
else:
self._model = accelerator.prepare_model(self.model, evaluation_mode=True)
self.accelerator = accelerator
if self.accelerator.is_local_main_process:
eval_logger.info(f"Using {accelerator.num_processes} devices with data parallelism")
self._rank = self.accelerator.local_process_index
self._world_size = self.accelerator.num_processes
elif accelerator.num_processes == 1 and device_map == "auto":
eval_logger.info(f"Using {accelerator.num_processes} devices with tensor parallelism")
self._rank = 0
self._world_size = 1
else:
eval_logger.info(f"Using single device: {self._device}")
self.model.to(self._device)
self._rank = 0
self._world_size = 1
@property
def config(self):
# return the associated transformers.AutoConfig for the given pretrained model.
return self._config
@property
def tokenizer(self):
return self._tokenizer
@property
def model(self):
# returns the model, unwrapping it if using Accelerate
if hasattr(self, "accelerator"):
return self.accelerator.unwrap_model(self._model)
else:
return self._model
@property
def eot_token_id(self):
# we use EOT because end of *text* is more accurate for what we're doing than end of *sentence*
return self.tokenizer.eos_token_id
@property
def max_length(self):
return self._max_length
@property
def batch_size(self):
return self.batch_size_per_gpu
@property
def device(self):
return self._device
@property
def rank(self):
return self._rank
@property
def world_size(self):
return self._world_size
def tok_encode(self, string: str, left_truncate_len=None, add_special_tokens=None) -> List[int]:
""" """
add_special_tokens = False if add_special_tokens is None else add_special_tokens
encoding = self.tokenizer.encode(string, add_special_tokens=add_special_tokens)
# left-truncate the encoded context to be at most `left_truncate_len` tokens long
if left_truncate_len:
encoding = encoding[-left_truncate_len:]
return encoding
def tok_decode(self, tokens):
try:
return self.tokenizer.decode(tokens)
except:
return self.tokenizer.decode([tokens])
def loglikelihood(self, requests: List[Instance]) -> List[Tuple[float, bool]]:
raise NotImplementedError
def flatten(self, input):
new_list = []
for i in input:
for j in i:
new_list.append(j)
return new_list
def generate_until(self, requests: List[Instance]) -> List[str]:
res = []
def _collate(x):
toks = self.tok_encode(x[0])
return -len(toks), x[0]
# we group requests by their generation_kwargs,
# so that we don't try to execute e.g. greedy sampling and temp=0.8 sampling
# in the same batch.
metadata = requests[0].metadata
re_ords = utils.Collator([reg.args for reg in requests], _collate, grouping=True)
chunks = re_ords.get_batched(n=self.batch_size, batch_fn=None)
num_iters = len(requests) // self.batch_size if len(requests) % self.batch_size == 0 else len(requests) // self.batch_size + 1
pbar = tqdm(total=num_iters, disable=(self.rank != 0), desc="Model Responding")
for chunk in chunks:
batched_contexts, all_gen_kwargs, batched_doc_to_visual, batched_doc_id, batched_task, batched_split = zip(*chunk)
task = batched_task[0]
split = batched_split[0]
batched_visuals = [batched_doc_to_visual[0](self.task_dict[task][split][ids]) for ids in batched_doc_id] # [B, N]
assert len(batched_visuals) == 1
# we assume all gen kwargs in the batch are the same
# this is safe to assume because the `grouper` object ensures it.
gen_kwargs = all_gen_kwargs[0]
if "until" in gen_kwargs:
gen_kwargs.pop("until")
# preconfigure gen_kwargs with defaults
if "max_new_tokens" not in gen_kwargs:
gen_kwargs["max_new_tokens"] = 1024
if "temperature" not in gen_kwargs:
gen_kwargs["temperature"] = 0
if "do_sample" not in gen_kwargs:
gen_kwargs["do_sample"] = False
if "top_p" not in gen_kwargs:
gen_kwargs["top_p"] = None
if "num_beams" not in gen_kwargs:
gen_kwargs["num_beams"] = 1
question_input = []
text_outputs = []
for visual, context in zip(batched_visuals, batched_contexts):
if len(visual) > 1 or "image_aspect_ratio" not in self._config.__dict__: # for multi image case, we treat per image aspect ratio as "pad" by default.
self._config.image_aspect_ratio = getattr(gen_kwargs, "image_aspect_ratio", "pad")
eval_logger.info(f"Setting image aspect ratio: {self._config.image_aspect_ratio}")
if type(visual[0]) == PIL.Image.Image: # and "task_type" not in metadata and "sample_frames" not in metadata: # For image task
raise NotImplementedError(f"I don't want image task now: {visual}, {task}, {metadata}")
elif type(visual[0]) == str: # For video task
if len(visual) > 1:
assert len(visual) == 2, visual
media_dict = visual[1]
else:
media_dict = {'video_read_type': 'decord'}
video_path = visual[0]
question_input.append(context)
try:
response = self.model.chat(
video_path,
self.tokenizer,
context,
chat_history=None,
return_history=False,
max_num_frames=self.max_num_frames,
media_dict=media_dict,
generation_config={
"max_new_tokens":gen_kwargs["max_new_tokens"],
"temperature":gen_kwargs["temperature"],
"do_sample":gen_kwargs["do_sample"],
"top_p":gen_kwargs["top_p"],
"num_beams":gen_kwargs["num_beams"]}
)
text_outputs.append(response)
except Exception as e:
raise e
text_outputs = [response.strip() for response in text_outputs]
res.extend(text_outputs)
self.cache_hook.add_partial("generate_until", (context, gen_kwargs), text_outputs)
pbar.update(1)
# reorder this group of results back to original unsorted form
res = re_ords.get_original(res)
pbar.close()
return res
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/__init__.py
================================================
import os, sys
from typing import List, Union, Dict
from lmms_eval import utils
# from lmms_eval import prompts
from lmms_eval.api.task import TaskConfig, Task, ConfigurableTask
from lmms_eval.api.registry import (
register_task,
register_group,
TASK_REGISTRY,
GROUP_REGISTRY,
ALL_TASKS,
TASK_INITIALIZED,
)
from loguru import logger
eval_logger = logger
def register_configurable_task(config: Dict[str, str]) -> int:
SubClass = type(
config["task"] + "ConfigurableTask",
(ConfigurableTask,),
{"CONFIG": TaskConfig(**config)},
)
if "task" in config:
task_name = "{}".format(config["task"])
register_task(task_name)(SubClass)
if "group" in config:
if config["group"] == config["task"]:
raise ValueError("task and group name cannot be the same")
elif type(config["group"]) == str:
group_name = [config["group"]]
else:
group_name = config["group"]
for group in group_name:
register_group(group)(SubClass)
return 0
def register_configurable_group(config: Dict[str, str]) -> int:
group = config["group"]
task_list = config["task"]
task_names = utils.pattern_match(task_list, ALL_TASKS)
for task in task_names:
if (task in TASK_REGISTRY) or (task in GROUP_REGISTRY):
if group in GROUP_REGISTRY:
GROUP_REGISTRY[group].append(task)
else:
GROUP_REGISTRY[group] = [task]
ALL_TASKS.add(group)
return 0
def get_task_name_from_config(task_config: Dict[str, str]) -> str:
if "dataset_name" in task_config:
return "{dataset_path}_{dataset_name}".format(**task_config)
else:
return "{dataset_path}".format(**task_config)
def include_task_folder(task_dir: str, register_task: bool = True) -> None:
"""
Calling this function
"""
for root, subdirs, file_list in os.walk(task_dir):
# if (subdirs == [] or subdirs == ["__pycache__"]) and (len(file_list) > 0):
for f in file_list:
# if "detail" in f:
#
# if "vatex" in f:
# print("a")
if f.endswith(".yaml"):
yaml_path = os.path.join(root, f)
try:
config = utils.load_yaml_config(yaml_path)
if "task" not in config:
continue
if register_task:
if type(config["task"]) == str:
# print("register task: ", config["task"])
register_configurable_task(config)
else:
if type(config["task"]) == list:
# print("register task list: ", config["task"])
register_configurable_group(config)
# Log this silently and show it only when
# the user defines the appropriate verbosity.
except ModuleNotFoundError as e:
eval_logger.debug(f"{yaml_path}: {e}. Config will not be added to registry.")
except Exception as error:
import traceback
eval_logger.debug(f"Failed to load config in {yaml_path}. Config will not be added to registry\n" f"Error: {error}\n" f"Traceback: {traceback.format_exc()}")
return 0
def include_path(task_dir):
include_task_folder(task_dir)
# Register Benchmarks after all tasks have been added
include_task_folder(task_dir, register_task=False)
return 0
def initialize_tasks(verbosity="INFO"):
logger.remove()
eval_logger.add(sys.stdout, colorize=True, level=verbosity)
global TASK_INITIALIZED
if TASK_INITIALIZED:
eval_logger.info("Tasks already initialized, skipping re-initialization.")
return
# eval_logger.add(sys.stderr, level=verbosity)
task_dir = os.path.dirname(os.path.abspath(__file__)) + "/"
include_path(task_dir)
TASK_INITIALIZED = True
def get_task(task_name, model_name):
try:
return TASK_REGISTRY[task_name](model_name=model_name) # TODO choiszt the return result need to check " 'mmeConfigurableTask' object has no attribute '_instances'. Did you mean: 'instances'?"
except KeyError:
eval_logger.info("Available tasks:")
eval_logger.info(list(TASK_REGISTRY) + list(GROUP_REGISTRY))
raise KeyError(f"Missing task {task_name}")
def get_task_name_from_object(task_object):
for name, class_ in TASK_REGISTRY.items():
if class_ is task_object:
return name
# TODO: scrap this
# this gives a mechanism for non-registered tasks to have a custom name anyways when reporting
return task_object.EVAL_HARNESS_NAME if hasattr(task_object, "EVAL_HARNESS_NAME") else type(task_object).__name__
# TODO: pass num_fewshot and other cmdline overrides in a better way
def get_task_dict(task_name_list: List[Union[str, Dict, Task]], model_name: str):
all_task_dict = {}
# Ensure task_name_list is a list to simplify processing
if not isinstance(task_name_list, list):
task_name_list = [task_name_list]
for task_element in task_name_list:
if isinstance(task_element, str) and task_element in GROUP_REGISTRY:
group_name = task_element
for task_name in GROUP_REGISTRY[task_element]:
if task_name not in all_task_dict:
# Recursively get the task dictionary for nested groups
task_obj = get_task_dict([task_name], model_name)
# Merge the dictionaries
all_task_dict.update({task_name: (group_name, task_obj.get(task_name, None))})
else:
task_name = task_element if isinstance(task_element, str) else task_element.EVAL_HARNESS_NAME
if task_name not in all_task_dict:
task_obj = get_task(task_name=task_name, model_name=model_name)
all_task_dict[task_name] = task_obj
return all_task_dict
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/_task_utils/file_utils.py
================================================
import os
def generate_submission_file(file_name, args, subpath="submissions"):
path = os.path.join(args.output_path, subpath)
os.makedirs(path, exist_ok=True)
path = os.path.join(path, file_name)
return os.path.abspath(path)
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/_task_utils/gpt_eval_utils.py
================================================
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/_task_utils/video_loader.py
================================================
import os
def get_cache_dir(config, sub_dir="videos"):
HF_HOME = os.environ["HF_HOME"]
cache_dir = config["dataset_kwargs"]["cache_dir"]
cache_dir = os.path.join(HF_HOME, cache_dir)
cache_dir = os.path.join(cache_dir, sub_dir)
return cache_dir
def _get_video_file(prefix: str, video_name: str, suffix: str):
if not isinstance(video_name, str):
video_name = str(video_name)
if not video_name.endswith(suffix):
video_name = f"{video_name}.{suffix}"
video_path = os.path.join(prefix, video_name)
return video_path
def get_video(prefix: str, video_name: str, suffix: str = "mp4"):
tried = [os.path.abspath(_get_video_file(prefix, video_name, suffix)), os.path.abspath(_get_video_file(prefix, video_name, suffix.upper())), os.path.abspath(_get_video_file(prefix, video_name, suffix.lower()))]
for video_path in tried:
if os.path.exists(video_path):
return video_path
raise FileNotFoundError(f"Tried both {tried} but none of them exist, please check")
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/_task_utils/vqa_eval_metric.py
================================================
import re
class EvalAIAnswerProcessor:
"""
Processes an answer similar to Eval AI
copied from
https://github.com/facebookresearch/mmf/blob/c46b3b3391275b4181567db80943473a89ab98ab/pythia/tasks/processors.py#L897
"""
CONTRACTIONS = {
"aint": "ain't",
"arent": "aren't",
"cant": "can't",
"couldve": "could've",
"couldnt": "couldn't",
"couldn'tve": "couldn't've",
"couldnt've": "couldn't've",
"didnt": "didn't",
"doesnt": "doesn't",
"dont": "don't",
"hadnt": "hadn't",
"hadnt've": "hadn't've",
"hadn'tve": "hadn't've",
"hasnt": "hasn't",
"havent": "haven't",
"hed": "he'd",
"hed've": "he'd've",
"he'dve": "he'd've",
"hes": "he's",
"howd": "how'd",
"howll": "how'll",
"hows": "how's",
"Id've": "I'd've",
"I'dve": "I'd've",
"Im": "I'm",
"Ive": "I've",
"isnt": "isn't",
"itd": "it'd",
"itd've": "it'd've",
"it'dve": "it'd've",
"itll": "it'll",
"let's": "let's",
"maam": "ma'am",
"mightnt": "mightn't",
"mightnt've": "mightn't've",
"mightn'tve": "mightn't've",
"mightve": "might've",
"mustnt": "mustn't",
"mustve": "must've",
"neednt": "needn't",
"notve": "not've",
"oclock": "o'clock",
"oughtnt": "oughtn't",
"ow's'at": "'ow's'at",
"'ows'at": "'ow's'at",
"'ow'sat": "'ow's'at",
"shant": "shan't",
"shed've": "she'd've",
"she'dve": "she'd've",
"she's": "she's",
"shouldve": "should've",
"shouldnt": "shouldn't",
"shouldnt've": "shouldn't've",
"shouldn'tve": "shouldn't've",
"somebody'd": "somebodyd",
"somebodyd've": "somebody'd've",
"somebody'dve": "somebody'd've",
"somebodyll": "somebody'll",
"somebodys": "somebody's",
"someoned": "someone'd",
"someoned've": "someone'd've",
"someone'dve": "someone'd've",
"someonell": "someone'll",
"someones": "someone's",
"somethingd": "something'd",
"somethingd've": "something'd've",
"something'dve": "something'd've",
"somethingll": "something'll",
"thats": "that's",
"thered": "there'd",
"thered've": "there'd've",
"there'dve": "there'd've",
"therere": "there're",
"theres": "there's",
"theyd": "they'd",
"theyd've": "they'd've",
"they'dve": "they'd've",
"theyll": "they'll",
"theyre": "they're",
"theyve": "they've",
"twas": "'twas",
"wasnt": "wasn't",
"wed've": "we'd've",
"we'dve": "we'd've",
"weve": "we've",
"werent": "weren't",
"whatll": "what'll",
"whatre": "what're",
"whats": "what's",
"whatve": "what've",
"whens": "when's",
"whered": "where'd",
"wheres": "where's",
"whereve": "where've",
"whod": "who'd",
"whod've": "who'd've",
"who'dve": "who'd've",
"wholl": "who'll",
"whos": "who's",
"whove": "who've",
"whyll": "why'll",
"whyre": "why're",
"whys": "why's",
"wont": "won't",
"wouldve": "would've",
"wouldnt": "wouldn't",
"wouldnt've": "wouldn't've",
"wouldn'tve": "wouldn't've",
"yall": "y'all",
"yall'll": "y'all'll",
"y'allll": "y'all'll",
"yall'd've": "y'all'd've",
"y'alld've": "y'all'd've",
"y'all'dve": "y'all'd've",
"youd": "you'd",
"youd've": "you'd've",
"you'dve": "you'd've",
"youll": "you'll",
"youre": "you're",
"youve": "you've",
}
NUMBER_MAP = {
"none": "0",
"zero": "0",
"one": "1",
"two": "2",
"three": "3",
"four": "4",
"five": "5",
"six": "6",
"seven": "7",
"eight": "8",
"nine": "9",
"ten": "10",
}
ARTICLES = ["a", "an", "the"]
PERIOD_STRIP = re.compile(r"(?!<=\d)(\.)(?!\d)")
COMMA_STRIP = re.compile(r"(?<=\d)(\,)+(?=\d)")
PUNCTUATIONS = [
";",
r"/",
"[",
"]",
'"',
"{",
"}",
"(",
")",
"=",
"+",
"\\",
"_",
"-",
">",
"<",
"@",
"`",
",",
"?",
"!",
]
def __init__(self, *args, **kwargs):
pass
def word_tokenize(self, word):
word = word.lower()
word = word.replace(",", "").replace("?", "").replace("'s", " 's")
return word.strip()
def process_punctuation(self, in_text):
out_text = in_text
for p in self.PUNCTUATIONS:
if (p + " " in in_text or " " + p in in_text) or (re.search(self.COMMA_STRIP, in_text) is not None):
out_text = out_text.replace(p, "")
else:
out_text = out_text.replace(p, " ")
out_text = self.PERIOD_STRIP.sub("", out_text, re.UNICODE)
return out_text
def process_digit_article(self, in_text):
out_text = []
temp_text = in_text.lower().split()
for word in temp_text:
word = self.NUMBER_MAP.setdefault(word, word)
if word not in self.ARTICLES:
out_text.append(word)
else:
pass
for word_id, word in enumerate(out_text):
if word in self.CONTRACTIONS:
out_text[word_id] = self.CONTRACTIONS[word]
out_text = " ".join(out_text)
return out_text
def __call__(self, item):
item = self.word_tokenize(item)
item = item.replace("\n", " ").replace("\t", " ").strip()
item = self.process_punctuation(item)
item = self.process_digit_article(item)
return item
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/longvideobench/longvideobench_test_v.yaml
================================================
dataset_path: eval_data_jsons/LongVideoBench
dataset_kwargs:
token: True
cache_dir: your_eval_data_dir/LongVideoBench/
video: True
force_download: False
local_files_only: False
# From_YouTube: True
task: longvideobench_test_v
test_split: test
doc_to_visual: !function utils.longvideobench_doc_to_visual_v
doc_to_text: !function utils.longvideobench_doc_to_text
doc_to_target: "correct_choice"
generation_kwargs:
max_new_tokens: 32
temperature: 0
do_sample: False
process_results: !function utils.longvideobench_process_results
metric_list:
- metric: submission
aggregation: !function utils.longvideobench_aggregate_results_for_submission
higher_is_better: true
lmms_eval_specific_kwargs:
default:
pre_prompt: ""
post_prompt: "Answer with the option's letter from the given choices directly.\n"
insert_interleave_subtitles: True
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/longvideobench/longvideobench_val_i.yaml
================================================
dataset_path: lmms-eval_data/LongVideoBench
dataset_kwargs:
token: True
cache_dir: phdd2:s3://LongVideoBench/
video: True
force_download: False
local_files_only: False
# From_YouTube: True
task: longvideobench_val_i
test_split: validation
doc_to_visual: !function utils.longvideobench_doc_to_visual_i
doc_to_text: !function utils.longvideobench_doc_to_text
doc_to_target: "correct_choice"
generation_kwargs:
max_new_tokens: 32
temperature: 0
do_sample: False
process_results: !function utils.longvideobench_process_results
metric_list:
- metric: lvb_acc
aggregation: !function utils.longvideobench_aggregate_results
higher_is_better: true
lmms_eval_specific_kwargs:
default:
pre_prompt: ""
post_prompt: "Answer with the option's letter from the given choices directly.\n"
insert_interleave_subtitles: True
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/longvideobench/longvideobench_val_v.yaml
================================================
dataset_path: eval_data_jsons/LongVideoBench
dataset_kwargs:
token: True
cache_dir: your_eval_data_dir/LongVideoBench/
video: True
force_download: False
local_files_only: False
# From_YouTube: True
task: longvideobench_val_v
test_split: validation
doc_to_visual: !function utils.longvideobench_doc_to_visual_v
doc_to_text: !function utils.longvideobench_doc_to_text
doc_to_target: "correct_choice"
generation_kwargs:
max_new_tokens: 32
temperature: 0
do_sample: False
process_results: !function utils.longvideobench_process_results
metric_list:
- metric: lvb_acc
aggregation: !function utils.longvideobench_aggregate_results
higher_is_better: true
lmms_eval_specific_kwargs:
default:
pre_prompt: ""
post_prompt: "Answer with the option's letter from the given choices directly.\n"
insert_interleave_subtitles: True
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/longvideobench/utils.py
================================================
import json
import os
import random
import re
import sys
from collections import Counter, defaultdict
from pathlib import Path
from typing import Dict, List, Optional, Union
import decord
import numpy as np
import torch
import yaml
from decord import VideoReader, cpu
from loguru import logger as eval_logger
from PIL import Image
import io
from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
def timestamp_to_seconds(timestamp):
# Split the timestamp into hours, minutes, and seconds
h, m, s = timestamp.split(":")
# Convert hours, minutes, and total seconds (including fractions) to float and compute total seconds
total_seconds = int(h) * 3600 + int(m) * 60 + float(s)
return total_seconds
def load_video(video_file, duration, max_num_frames=16):
from decord import VideoReader
vr = VideoReader(video_file, ctx=cpu(0), num_threads=1)
fps = vr.get_avg_fps()
total_valid_frames = int(duration * fps)
num_frames = min(max_num_frames, int(duration))
frame_indices = [int(total_valid_frames / num_frames) * i for i in range(num_frames)]
frames = vr.get_batch(frame_indices)
if isinstance(frames, torch.Tensor):
frames = frames.numpy()
else:
frames = frames.asnumpy()
frame_timestamps = [frame_index / fps for frame_index in frame_indices]
return [Image.fromarray(fr).convert("RGB") for fr in frames]
def compute_frame_timestamps(duration, max_num_frames=16):
if duration > max_num_frames:
return [duration / max_num_frames * i for i in range(max_num_frames)]
else:
return [i for i in range(int(duration))]
def insert_subtitles_into_frames(frame_timestamps, subtitles, starting_timestamp_for_subtitles, duration, add_image):
interleaved_list = []
cur_i = 0
for subtitle in subtitles:
if "timestamp" in subtitle:
start, end = subtitle["timestamp"]
if not isinstance(end, float):
end = duration
start -= starting_timestamp_for_subtitles
end -= starting_timestamp_for_subtitles
subtitle_timestamp = (start + end) / 2
subtitle_text = subtitle["text"]
else:
start, end = subtitle["start"], subtitle["end"]
start = timestamp_to_seconds(start)
end = timestamp_to_seconds(end)
start -= starting_timestamp_for_subtitles
end -= starting_timestamp_for_subtitles
subtitle_timestamp = (start + end) / 2
subtitle_text = subtitle["line"]
for i, frame_timestamp in enumerate(frame_timestamps[cur_i:]):
if frame_timestamp <= subtitle_timestamp:
# print("frame:", frame_timestamp)
if add_image:
interleaved_list.append("")
cur_i += 1
else:
break
if end - start < 1:
end = subtitle_timestamp + 0.5
start = subtitle_timestamp - 0.5
covering_frames = False
for frame_timestamp in frame_timestamps:
if frame_timestamp < end and frame_timestamp > start:
covering_frames = True
break
if covering_frames:
# print("subtitle:", subtitle_timestamp, start, end)
interleaved_list.append(subtitle_text)
else:
pass
# print("leaving out subtitle:", start, end)
if add_image:
for i, frame_timestamp in enumerate(frame_timestamps[cur_i:]):
# print(frame_timestamp)
interleaved_list.append("")
return "\n".join(interleaved_list)
def longvideobench_doc_to_text(doc, lmms_eval_specific_kwargs):
candidates = []
for i in range(5):
candidate = doc.get(f"option{i}")
if candidate != "N/A":
candidates.append(candidate)
question = doc["question"] + "\n" + "\n".join([". ".join([f'({chr(ord("A") + i)})', candidate]) for i, candidate in enumerate(candidates)])
pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
post_prompt = lmms_eval_specific_kwargs["post_prompt"]
if lmms_eval_specific_kwargs.get("insert_interleave_subtitles", False):
with open(Path(__file__).parent / "longvideobench_val_i.yaml", "r") as f:
raw_data = f.readlines()
safe_data = []
for i, line in enumerate(raw_data):
# remove function definition since yaml load cannot handle it
if "!function" not in line:
safe_data.append(line)
cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]
subtitle_subdir_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"].get("subtitle_subdir", "subtitles")
cache_dir = os.path.join(cache_name, subtitle_subdir_name)
if '_en.json' in doc["subtitle_path"]:
subtitle_path = doc["subtitle_path"]
else:
subtitle_path = doc["subtitle_path"].replace(".json", "_en.json")
with open(os.path.join(cache_dir, subtitle_path), "r") as f:
subtitles = json.load(f)
max_num_frames = yaml.safe_load("".join(safe_data))["dataset_kwargs"].get("max_num_frames", 512)
frame_timestamps = compute_frame_timestamps(doc["duration"], max_num_frames)
interleaved_prefix = insert_subtitles_into_frames(frame_timestamps, subtitles, doc["starting_timestamp_for_subtitles"], doc["duration"], add_image=False)
return f"{pre_prompt} This video's subtitles are listed below: \n{interleaved_prefix}.\n Select the best answer to the following multiple-choice question based on the video and the subtitles. Respond with only the letter of the correct option.\nQuestion: {question}\n{post_prompt}"
else:
raise NotImplementedError
return f"{pre_prompt}{question}\n{post_prompt}"
# hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/")
# base_cache_dir = os.path.expanduser(hf_home)
def longvideobench_doc_to_visual_v(doc):
with open(Path(__file__).parent / "longvideobench_val_v.yaml", "r") as f:
raw_data = f.readlines()
safe_data = []
for i, line in enumerate(raw_data):
# remove function definition since yaml load cannot handle it
if "!function" not in line:
safe_data.append(line)
cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]
vid_subdir_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"].get("video_subdir", "videos/")
cache_dir = os.path.join(cache_name, vid_subdir_name)
video_path = doc["video_path"]
video_path = os.path.join(cache_dir, video_path)
return [video_path]
def longvideobench_doc_to_visual_i(doc):
with open(Path(__file__).parent / "longvideobench_val_i.yaml", "r") as f:
raw_data = f.readlines()
safe_data = []
for i, line in enumerate(raw_data):
# remove function definition since yaml load cannot handle it
if "!function" not in line:
safe_data.append(line)
cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]
vid_subdir_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"].get("video_subdir", "videos/")
cache_dir = os.path.join(cache_name, vid_subdir_name)
video_path = doc["video_path"]
video_path = os.path.join(cache_dir, video_path)
max_num_frames = yaml.safe_load("".join(safe_data))["dataset_kwargs"].get("max_num_frames", 16)
return load_video(video_path, doc["duration"], max_num_frames)
def get_multi_choice_info(options):
"""
Given the list of options for multiple choice question
Return the index2ans and all_choices
https://github.com/MMMU-Benchmark/MMMU/blob/51ce7f3e829c16bb44bc5445782686b4c3508794/eval/data_utils.py#L54
"""
start_chr = "A"
all_choices = []
index2ans = {}
for i, option in enumerate(options):
index2ans[chr(ord(start_chr) + i)] = option
all_choices.append(chr(ord(start_chr) + i))
return index2ans, all_choices
def parse_multi_choice_response(response, all_choices, index2ans):
"""
Parse the prediction from the generated response.
Return the predicted index e.g., A, B, C, D.
https://github.com/MMMU-Benchmark/MMMU/blob/51ce7f3e829c16bb44bc5445782686b4c3508794/eval/eval_utils.py#L10
"""
for char in [",", ".", "!", "?", ";", ":", "'"]:
response = response.strip(char)
response = " " + response + " " # add space to avoid partial match
index_ans = True
ans_with_brack = False
candidates = []
for choice in all_choices: # e.g., (A) (B) (C) (D)
if f"({choice})" in response:
candidates.append(choice)
ans_with_brack = True
if len(candidates) == 0:
for choice in all_choices: # e.g., A B C D
if f"{choice} " in response:
candidates.append(choice)
if len(candidates) == 0:
for choice in all_choices: # e.g., A. B. C. D.
if f"{choice}." in response:
candidates.append(choice)
# if all above doesn't get candidates, check if the content is larger than 5 tokens and try to parse the example
if len(candidates) == 0 and len(response.split()) > 5:
for index, ans in index2ans.items():
if ans.lower() in response.lower():
candidates.append(index)
index_ans = False # it's content ans.
if len(candidates) == 0: # still not get answer, randomly choose one.
pred_index = random.choice(all_choices)
elif len(candidates) > 1:
start_indexes = []
if index_ans:
if ans_with_brack:
for can in candidates:
index = response.rfind(f"({can})")
start_indexes.append(index) # -1 will be ignored anyway
# start_indexes = [generated_response.index(f'({can})') for can in candidates]
else:
for can in candidates:
index = response.rfind(f" {can} ")
start_indexes.append(index)
else:
for can in candidates:
index = response.lower().rfind(index2ans[can].lower())
start_indexes.append(index)
# get the last one
pred_index = candidates[np.argmax(start_indexes)]
else: # if only one candidate, use it.
pred_index = candidates[0]
return pred_index
def evaluate_longvideobench(samples):
pred_correct = 0
judge_dict = dict()
for sample in samples:
gold_i = sample["answer"]
pred_i = sample["parsed_pred"]
correct = eval_multi_choice(gold_i, pred_i)
if correct:
judge_dict[sample["id"]] = "Correct"
pred_correct += 1
else:
judge_dict[sample["id"]] = "Wrong"
if len(samples) == 0:
return {"acc": 0}
return judge_dict, {"acc": pred_correct / len(samples)}
def eval_multi_choice(gold_i, pred_i):
correct = False
# only they are exactly the same, we consider it as correct
if isinstance(gold_i, list):
for answer in gold_i:
if answer == pred_i:
correct = True
break
else: # gold_i is a string
if gold_i == pred_i:
correct = True
return correct
def calculate_ins_level_acc(results):
"""Calculate the instruction level accuracy for given Subject results
https://github.com/MMMU-Benchmark/MMMU/blob/51ce7f3e829c16bb44bc5445782686b4c3508794/eval/eval_utils.py#L246
"""
acc = 0
ins_num = 0
for cat_results in results.values():
acc += cat_results["acc"] * cat_results["num_example"]
ins_num += cat_results["num_example"]
if ins_num == 0:
return 0
return acc / ins_num
def longvideobench_process_results(doc, results):
pred = results[0]
all_choices = []
index2ans = {}
for i in range(5):
option = doc.get(f"option{i}")
if option == "N/A":
break
index2ans[chr(ord("A") + i)] = option
all_choices.append(chr(ord("A") + i))
parsed_pred = parse_multi_choice_response(pred, all_choices, index2ans)
id = doc["video_id"]
lvb_acc = {"id": id, "duration_group": doc["duration_group"], "question_category": doc["question_category"], "answer": chr(ord("A") + doc["correct_choice"]), "parsed_pred": parsed_pred}
return {
"lvb_acc": lvb_acc,
"submission": {
id: pred,
},
}
def longvideobench_aggregate_results(results):
evaluation_result = {}
subset_to_eval_samples = defaultdict(list)
for result in results:
subset_to_eval_samples[result["duration_group"]].append(result)
subset_to_eval_samples[result["question_category"]].append(result)
for subset, sub_eval_samples in subset_to_eval_samples.items():
judge_dict, metric_dict = evaluate_longvideobench(sub_eval_samples)
metric_dict.update({"num_example": len(sub_eval_samples)})
evaluation_result[subset] = metric_dict
printable_results = {}
for cat_name, cat_results in evaluation_result.items():
printable_results[cat_name] = {
"num": int(cat_results["num_example"]),
"acc": round(cat_results["acc"], 5),
}
all_ins_acc = calculate_ins_level_acc(evaluation_result)
printable_results["Overall"] = {
"num": sum([cat_results["num_example"] for cat_results in evaluation_result.values()]),
"acc": round(all_ins_acc, 5),
}
eval_logger.info(printable_results)
return printable_results["Overall"]["acc"]
def longvideobench_aggregate_results_for_submission(results, args):
path = generate_submission_file("longvideobench_test_for_submission.json", args)
results_dict = {list(item.keys())[0]: list(item.values())[0] for item in results}
with open(path, "w") as f:
json.dump(results_dict, f)
eval_logger.info(f"Results saved to {path}.")
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/lvbench/_default_template.yaml
================================================
dataset_path: eval_data_jsons/LVBench
dataset_kwargs:
token: True
cache_dir: your_eval_data_dir/LVBench/frames/ # NOTE: we extract frames to avoid oom.
video: True
generation_kwargs:
max_new_tokens: 16
temperature: 0
top_p: 1.0
num_beams: 1
do_sample: false
output_type: generate_until
doc_to_visual: !function utils.lvbench_mc_doc_to_visual
doc_to_text: !function utils.lvbench_mc_doc_to_text
doc_to_target: "answer"
# The return value of process_results will be used by metrics
process_results: !function utils.lvbench_mc_process_results
# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
metric_list:
- metric: lvbench_mc_accuracy
aggregation: !function utils.lvbench_mc_aggregate_results
higher_is_better: true
lmms_eval_specific_kwargs:
default:
post_prompt: "Answer with the option's letter from the given choices directly."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/lvbench/lvbench.yaml
================================================
group: lvbench
task:
- lvbench_cartoon
- lvbench_documentary
- lvbench_live
- lvbench_selfmedia
- lvbench_sport
- lvbench_tv
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/lvbench/lvbench_cartoon.yaml
================================================
include: _default_template.yaml
task: lvbench_cartoon
dataset_name: lvbench_cartoon
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: lvbench_cartoon
post_prompt: "Answer with the option's letter from the given choices directly."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/lvbench/lvbench_documentary.yaml
================================================
include: _default_template.yaml
task: lvbench_documentary
dataset_name: lvbench_documentary
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: lvbench_documentary
post_prompt: "Answer with the option's letter from the given choices directly."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/lvbench/lvbench_live.yaml
================================================
include: _default_template.yaml
task: lvbench_live
dataset_name: lvbench_live
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: lvbench_live
post_prompt: "Answer with the option's letter from the given choices directly."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/lvbench/lvbench_selfmedia.yaml
================================================
include: _default_template.yaml
task: lvbench_selfmedia
dataset_name: lvbench_selfmedia
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: lvbench_selfmedia
post_prompt: "Answer with the option's letter from the given choices directly."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/lvbench/lvbench_sport.yaml
================================================
include: _default_template.yaml
task: lvbench_sport
dataset_name: lvbench_sport
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: lvbench_sport
post_prompt: "Answer with the option's letter from the given choices directly."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/lvbench/lvbench_tv.yaml
================================================
include: _default_template.yaml
task: lvbench_tv
dataset_name: lvbench_tv
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: lvbench_tv
post_prompt: "Answer with the option's letter from the given choices directly."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/lvbench/utils.py
================================================
from collections import defaultdict
import os
import datetime
import json
from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
from pathlib import Path
import yaml
import sys, string
from typing import List, Dict, Optional, Union
import re
import PIL
import numpy as np
from loguru import logger as eval_logger
# hf_home = os.getenv("HF_HOME", "./~/.cache/huggingface")
# base_cache_dir = os.path.expanduser(hf_home)
with open(Path(__file__).parent / "_default_template.yaml", "r") as f:
raw_data = f.readlines()
safe_data = []
for i, line in enumerate(raw_data):
# remove function definition since yaml load cannot handle it
if "!function" not in line:
safe_data.append(line)
cache_dir = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]
def lvbench_mc_doc_to_visual(doc, lmms_eval_specific_kwargs=None):
# cache_dir = os.path.join(base_cache_dir, cache_name)
video_path = os.path.join(cache_dir, doc["video"])
if os.path.exists(video_path):
video_path = video_path
elif "s3://" not in video_path:
eval_logger.error(f"Video path: {video_path} does not exist, please check.")
if "start" in doc:
start, end = doc['start'], doc['end']
media_dict = {'start':start, 'end':end, 'video_read_type': 'img'}
else:
media_dict = {'video_read_type': 'img'}
print("video_path:", video_path)
return [video_path, media_dict]
def lvbench_mc_doc_to_text(doc, lmms_eval_specific_kwargs=None):
option_prompt = ""
option_list = doc["candidates"]
option_letters = string.ascii_uppercase
for char_index, option in enumerate(option_list):
option_letter = option_letters[char_index]
option_prompt += f"{option_letter}. {option}\n"
full_text = doc["question"] + "\n" + option_prompt + lmms_eval_specific_kwargs["post_prompt"]
return full_text
def mcq_acc(answer, pred):
periodStrip = re.compile("(?!<=\d)(\.)(?!\d)")
commaStrip = re.compile("(\d)(\,)(\d)")
punct = [";", r"/", "[", "]", '"', "{", "}", "(", ")", "=", "+", "\\", "_", "-", ">", "<", "@", "`", ",", "?", "!"]
def processPunctuation(inText):
outText = inText
for p in punct:
if (p + " " in inText or " " + p in inText) or (re.search(commaStrip, inText) != None):
outText = outText.replace(p, "")
else:
outText = outText.replace(p, " ")
outText = periodStrip.sub("", outText, re.UNICODE)
return outText
def process(answer):
option_regex = re.compile(r"^([A-E])\.\s*(.+)$", re.IGNORECASE)
match = option_regex.match(answer.strip())
if match:
# If matched, return the option letter in uppercase
return match.group(1).upper()
else:
# If no match, process the answer as before
answer = answer.replace("\n", " ")
answer = answer.replace("\t", " ")
answer = answer.strip()
answer = processPunctuation(answer)
answer = answer.strip("'")
answer = answer.strip('"')
answer = answer.strip(")")
answer = answer.strip("(")
answer = answer.strip().lower()
# Try to find any single letter (A-E) in the processed answer
letter_match = re.search(r"\b([A-E])\b", answer, re.IGNORECASE)
if letter_match:
return letter_match.group(1).upper()
return answer
pred = process(pred)
answer = process(answer)
if pred == answer:
score = 1
else:
score = 0
return score
def lvbench_mc_process_results(doc, results):
"""
Args:
doc: a instance of the eval dataset
results: [pred]
Returns:
a dictionary with key: metric name (in this case lvbench_mc_perception_score), value: metric value
"""
pred = results[0]
# Calculate the ground truth option letter
option_letters = string.ascii_uppercase
gt_option_letter = None
for i, candidate in enumerate(doc["candidates"]):
if candidate == doc["answer"]:
gt_option_letter = option_letters[i]
break
if gt_option_letter is not None:
# Calculate the score using mcq_acc function
score = mcq_acc(gt_option_letter, pred)
else:
score = 0
data_dict = {"pred_answer": pred, "gt_answer": gt_option_letter, "score": score}
return {"lvbench_mc_accuracy": data_dict}
def lvbench_mc_aggregate_results(results):
"""
Args:
results: a list of values returned by process_results
Returns:
A score
"""
total_answered = 0
total_correct = 0
for result in results:
if result["pred_answer"] != "":
total_answered += 1
total_correct += result["score"]
return 100 * total_correct / total_answered if total_answered > 0 else 0
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mlvu_mc/_default_template.yaml
================================================
dataset_path: eval_data_jsons/MLVU_MC
dataset_kwargs:
token: True
cache_dir: your_eval_data_dir/MLVU_MC
video: True
generation_kwargs:
max_new_tokens: 16
temperature: 0
top_p: 1.0
num_beams: 1
do_sample: false
output_type: generate_until
doc_to_visual: !function utils.mlvu_mc_doc_to_visual
doc_to_text: !function utils.mlvu_mc_doc_to_text
doc_to_target: "answer"
# The return value of process_results will be used by metrics
process_results: !function utils.mlvu_mc_process_results
# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
metric_list:
- metric: mlvu_mc_accuracy
aggregation: !function utils.mlvu_mc_aggregate_results
higher_is_better: true
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mlvu_mc/mlvu_mc.yaml
================================================
group: mlvu_mc
task:
- mlvu_mc_count
- mlvu_mc_ego
- mlvu_mc_needle
- mlvu_mc_order
- mlvu_mc_plotqa
- mlvu_mc_anomaly_reco
- mlvu_mc_topic_reasoning
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mlvu_mc/mlvu_mc_anomaly_reco.yaml
================================================
include: _default_template.yaml
task: mlvu_mc_anomaly_reco
dataset_name: 6_anomaly_reco
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: 6_anomaly_reco
post_prompt: "Answer with the option's letter from the given choices directly."
videochat_next_dynamic_newprompt:
sub_task: 6_anomaly_reco
yinan_prompt: "()"
post_prompt: "\nOnly give the best option."
videochat_next_dynamic_pdrop_newprompt:
sub_task: 6_anomaly_reco
yinan_prompt: "()"
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mlvu_mc/mlvu_mc_count.yaml
================================================
include: _default_template.yaml
task: mlvu_mc_count
dataset_name: 4_count
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: 4_count
post_prompt: "Answer with the option's letter from the given choices directly."
videochat_next_dynamic_newprompt:
sub_task: 4_count
yinan_prompt: "()"
post_prompt: "\nOnly give the best option."
videochat_next_dynamic_pdrop_newprompt:
sub_task: 4_count
yinan_prompt: "()"
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mlvu_mc/mlvu_mc_ego.yaml
================================================
include: _default_template.yaml
task: mlvu_mc_ego
dataset_name: 3_ego
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: 3_ego
post_prompt: "Answer with the option's letter from the given choices directly."
videochat_next_dynamic_newprompt:
sub_task: 3_ego
yinan_prompt: "()"
post_prompt: "\nOnly give the best option."
videochat_next_dynamic_pdrop_newprompt:
sub_task: 3_ego
yinan_prompt: "()"
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mlvu_mc/mlvu_mc_needle.yaml
================================================
include: _default_template.yaml
task: mlvu_mc_needle
dataset_name: 2_needle
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: 2_needle
post_prompt: "Answer with the option's letter from the given choices directly."
videochat_next_dynamic_newprompt:
sub_task: 2_needle
yinan_prompt: "()"
post_prompt: "\nOnly give the best option."
videochat_next_dynamic_pdrop_newprompt:
sub_task: 2_needle
yinan_prompt: "()"
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mlvu_mc/mlvu_mc_order.yaml
================================================
include: _default_template.yaml
task: mlvu_mc_order
dataset_name: 5_order
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: 5_order
post_prompt: "Answer with the option's letter from the given choices directly."
videochat_next_dynamic_newprompt:
sub_task: 5_order
yinan_prompt: "()"
post_prompt: "\nOnly give the best option."
videochat_next_dynamic_pdrop_newprompt:
sub_task: 5_order
yinan_prompt: "()"
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mlvu_mc/mlvu_mc_plotqa.yaml
================================================
include: _default_template.yaml
task: mlvu_mc_plotqa
dataset_name: 1_plotQA
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: 1_plotQA
post_prompt: "Answer with the option's letter from the given choices directly."
videochat_next_dynamic_newprompt:
sub_task: 1_plotQA
yinan_prompt: "()"
post_prompt: "\nOnly give the best option."
videochat_next_dynamic_pdrop_newprompt:
sub_task: 1_plotQA
yinan_prompt: "()"
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mlvu_mc/mlvu_mc_topic_reasoning.yaml
================================================
include: _default_template.yaml
task: mlvu_mc_topic_reasoning
dataset_name: 7_topic_reasoning
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: 7_topic_reasoning
post_prompt: "Answer with the option's letter from the given choices directly."
videochat_next_dynamic_newprompt:
sub_task: 7_topic_reasoning
yinan_prompt: "()"
post_prompt: "\nOnly give the best option."
videochat_next_dynamic_pdrop_newprompt:
sub_task: 7_topic_reasoning
yinan_prompt: "()"
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mlvu_mc/utils.py
================================================
from collections import defaultdict
import os
import datetime
import json
from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
from pathlib import Path
import yaml
import sys, string
from typing import List, Dict, Optional, Union
import re
import PIL
import numpy as np
from loguru import logger as eval_logger
import io
DATA_LIST = {
"4_count": "your_eval_data_dir/MVLU/MLVU/video/4_count",
"3_ego": "your_eval_data_dir/MVLU/MLVU/video/3_ego",
"2_needle": "your_eval_data_dir/MVLU/MLVU/video/2_needle",
"5_order": "your_eval_data_dir/MVLU/MLVU/video/5_order",
"1_plotQA": "your_eval_data_dir/MVLU/MLVU/video/1_plotQA",
"6_anomaly_reco": "your_eval_data_dir/MVLU/MLVU/video/6_anomaly_reco",
"7_topic_reasoning": "your_eval_data_dir/MVLU/MLVU/video/7_topic_reasoning"
}
# hf_home = os.getenv("HF_HOME", "./~/.cache/huggingface")
# base_cache_dir = os.path.expanduser(hf_home)
with open(Path(__file__).parent / "_default_template.yaml", "r") as f:
raw_data = f.readlines()
safe_data = []
for i, line in enumerate(raw_data):
# remove function definition since yaml load cannot handle it
if "!function" not in line:
safe_data.append(line)
cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]
def mlvu_mc_doc_to_visual(doc, lmms_eval_specific_kwargs=None):
# cache_dir = os.path.join(base_cache_dir, cache_name)
cache_dir = ""
dataset_folder = DATA_LIST[lmms_eval_specific_kwargs["sub_task"]]
video_path = os.path.join(cache_dir, dataset_folder, doc["video"])
if os.path.exists(video_path):
video_path = video_path
elif os.path.basename(dataset_folder) in ["clevrer", "star"]:
alternative_video_path = os.path.join(cache_dir, "data0613", dataset_folder, doc["video"])
if os.path.exists(alternative_video_path):
video_path = alternative_video_path
else:
eval_logger.error(f"Video path: {video_path} does not exist, please check.")
elif "s3://" not in video_path:
eval_logger.error(f"Video path: {video_path} does not exist, please check.")
if "start" in doc:
start, end = doc['start'], doc['end']
media_dict = {'start':start, 'end':end, 'video_read_type': 'decord'}
else:
media_dict = {'video_read_type': 'decord'}
return [video_path, media_dict]
def mlvu_mc_frames_doc_to_visual(doc, lmms_eval_specific_kwargs=None):
# cache_dir = os.path.join(base_cache_dir, cache_name)
cache_dir = ""
dataset_folder = DATA_LIST[lmms_eval_specific_kwargs["sub_task"]]
video_path = os.path.join(cache_dir, dataset_folder, doc["video"])
if os.path.exists(video_path):
video_path = video_path
elif os.path.basename(dataset_folder) in ["clevrer", "star"]:
alternative_video_path = os.path.join(cache_dir, "data0613", dataset_folder, doc["video"])
if os.path.exists(alternative_video_path):
video_path = alternative_video_path
else:
eval_logger.error(f"Video path: {video_path} does not exist, please check.")
elif "s3://" not in video_path:
eval_logger.error(f"Video path: {video_path} does not exist, please check.")
# frame_image_list = read_frame(video_path)
if "start" in doc:
start, end = doc['start'], doc['end']
media_dict = {'start':start, 'end':end, 'video_read_type': 'img'}
else:
media_dict = {'video_read_type': 'img'}
return [video_path, media_dict]
def mlvu_mc_doc_to_text(doc, lmms_eval_specific_kwargs=None):
option_prompt = ""
option_list = doc["candidates"]
option_letters = string.ascii_uppercase
for char_index, option in enumerate(option_list):
option_letter = option_letters[char_index]
option_prompt += f"({option_letter}) {option}\n"
full_text = "Question:" + doc["question"] + "\nOption:\n" + option_prompt + lmms_eval_specific_kwargs["post_prompt"]
return full_text
def mcq_acc(answer, pred):
periodStrip = re.compile("(?!<=\d)(\.)(?!\d)")
commaStrip = re.compile("(\d)(\,)(\d)")
punct = [";", r"/", "[", "]", '"', "{", "}", "(", ")", "=", "+", "\\", "_", "-", ">", "<", "@", "`", ",", "?", "!"]
def processPunctuation(inText):
outText = inText
for p in punct:
if (p + " " in inText or " " + p in inText) or (re.search(commaStrip, inText) != None):
outText = outText.replace(p, "")
else:
outText = outText.replace(p, " ")
outText = periodStrip.sub("", outText, re.UNICODE)
return outText
def process(answer):
option_regex = re.compile(r"^([A-E])\.\s*(.+)$", re.IGNORECASE)
match = option_regex.match(answer.strip())
if match:
# If matched, return the option letter in uppercase
return match.group(1).upper()
else:
# If no match, process the answer as before
answer = answer.replace("\n", " ")
answer = answer.replace("\t", " ")
answer = answer.strip()
answer = processPunctuation(answer)
answer = answer.strip("'")
answer = answer.strip('"')
answer = answer.strip(")")
answer = answer.strip("(")
answer = answer.strip().lower()
# Try to find any single letter (A-E) in the processed answer
letter_match = re.search(r"\b([A-E])\b", answer, re.IGNORECASE)
if letter_match:
return letter_match.group(1).upper()
return answer
pred = process(pred)
answer = process(answer)
if pred == answer:
score = 1
else:
score = 0
return score
def mlvu_mc_process_results(doc, results):
"""
Args:
doc: a instance of the eval dataset
results: [pred]
Returns:
a dictionary with key: metric name (in this case mlvu_mc_perception_score), value: metric value
"""
pred = results[0]
# Calculate the ground truth option letter
option_letters = string.ascii_uppercase
gt_option_letter = None
for i, candidate in enumerate(doc["candidates"]):
if candidate == doc["answer"]:
gt_option_letter = option_letters[i]
break
if gt_option_letter is not None:
# Calculate the score using mcq_acc function
score = mcq_acc(gt_option_letter, pred)
else:
score = 0
data_dict = {"pred_answer": pred, "gt_answer": gt_option_letter, "score": score}
return {"mlvu_mc_accuracy": data_dict}
def mlvu_mc_aggregate_results(results):
"""
Args:
results: a list of values returned by process_results
Returns:
A score
"""
total_answered = 0
total_correct = 0
for result in results:
if result["pred_answer"] != "":
total_answered += 1
total_correct += result["score"]
return 100 * total_correct / total_answered if total_answered > 0 else 0
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/_default_template.yaml
================================================
dataset_path: eval_data_jsons/MVBench
dataset_kwargs:
token: True
cache_dir: your_eval_data_dir/MVBench
video: True
generation_kwargs:
max_new_tokens: 16
temperature: 0
top_p: 1.0
num_beams: 1
do_sample: false
output_type: generate_until
doc_to_visual: !function utils.mvbench_doc_to_visual
doc_to_text: !function utils.mvbench_doc_to_text
doc_to_target: "answer"
# The return value of process_results will be used by metrics
process_results: !function utils.mvbench_process_results
# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
metric_list:
- metric: mvbench_accuracy
aggregation: !function utils.mvbench_aggregate_results
higher_is_better: true
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench.yaml
================================================
group: mvbench
task:
- mvbench_action_sequence
- mvbench_moving_count
- mvbench_action_prediction
- mvbench_episodic_reasoning
- mvbench_action_antonym
- mvbench_action_count
- mvbench_scene_transition
- mvbench_object_shuffle
- mvbench_object_existence
- mvbench_fine_grained_pose
- mvbench_unexpected_action
- mvbench_moving_direction
- mvbench_state_change
- mvbench_object_interaction
- mvbench_character_order
- mvbench_action_localization
- mvbench_counterfactual_inference
- mvbench_fine_grained_action
- mvbench_moving_attribute
- mvbench_egocentric_navigation
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_action_antonym.yaml
================================================
include: _default_template.yaml
task: mvbench_action_antonym
dataset_name: action_antonym
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: action_antonym
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_action_count.yaml
================================================
include: _default_template.yaml
task: mvbench_action_count
dataset_name: action_count
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: action_count
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_action_localization.yaml
================================================
include: _default_template.yaml
task: mvbench_action_localization
dataset_name: action_localization
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: action_localization
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_action_prediction.yaml
================================================
include: _default_template.yaml
task: mvbench_action_prediction
dataset_name: action_prediction
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: action_prediction
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_action_sequence.yaml
================================================
include: _default_template.yaml
task: mvbench_action_sequence
dataset_name: action_sequence
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: action_sequence
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_character_order.yaml
================================================
include: _default_template.yaml
task: mvbench_character_order
dataset_name: character_order
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: character_order
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_counterfactual_inference.yaml
================================================
include: _default_template.yaml
task: mvbench_counterfactual_inference
dataset_name: counterfactual_inference
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: counterfactual_inference
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_egocentric_navigation.yaml
================================================
include: _default_template.yaml
task: mvbench_egocentric_navigation
dataset_name: egocentric_navigation
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: egocentric_navigation
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_episodic_reasoning.yaml
================================================
include: _default_template.yaml
task: mvbench_episodic_reasoning
dataset_name: episodic_reasoning
test_split: train
doc_to_visual: !function utils.mvbench_frames_doc_to_visual
generation_kwargs:
max_new_tokens: 16
temperature: 0
top_p: 1.0
num_beams: 1
do_sample: false
lmms_eval_specific_kwargs:
default:
sub_task: episodic_reasoning
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_fine_grained_action.yaml
================================================
include: _default_template.yaml
task: mvbench_fine_grained_action
dataset_name: fine_grained_action
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: fine_grained_action
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_fine_grained_pose.yaml
================================================
include: _default_template.yaml
task: mvbench_fine_grained_pose
dataset_name: fine_grained_pose
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: fine_grained_pose
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_moving_attribute.yaml
================================================
include: _default_template.yaml
task: mvbench_moving_attribute
dataset_name: moving_attribute
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: moving_attribute
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_moving_count.yaml
================================================
include: _default_template.yaml
task: mvbench_moving_count
dataset_name: moving_count
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: moving_count
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_moving_direction.yaml
================================================
include: _default_template.yaml
task: mvbench_moving_direction
dataset_name: moving_direction
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: moving_direction
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_object_existence.yaml
================================================
include: _default_template.yaml
task: mvbench_object_existence
dataset_name: object_existence
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: object_existence
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_object_interaction.yaml
================================================
include: _default_template.yaml
task: mvbench_object_interaction
dataset_name: object_interaction
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: object_interaction
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_object_shuffle.yaml
================================================
include: _default_template.yaml
task: mvbench_object_shuffle
dataset_name: object_shuffle
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: object_shuffle
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_scene_transition.yaml
================================================
include: _default_template.yaml
task: mvbench_scene_transition
dataset_name: scene_transition
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: scene_transition
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_state_change.yaml
================================================
include: _default_template.yaml
task: mvbench_state_change
dataset_name: state_change
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: state_change
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/mvbench_unexpected_action.yaml
================================================
include: _default_template.yaml
task: mvbench_unexpected_action
dataset_name: unexpected_action
test_split: train
lmms_eval_specific_kwargs:
default:
sub_task: unexpected_action
post_prompt: "\nOnly give the best option."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/mvbench/utils.py
================================================
from collections import defaultdict
import os
import datetime
import json
from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
from pathlib import Path
import yaml
import sys, string
from typing import List, Dict, Optional, Union
import re
import PIL
import numpy as np
from loguru import logger as eval_logger
import io
DATA_LIST = {
"action_sequence": "your_data_dir/star/Charades_v1_480/",
"action_prediction": "your_data_dir/star/Charades_v1_480/",
"action_antonym": "your_data_dir/ssv2-video/",
"fine_grained_action": "p2hdd:s3://Moments_in_Time_Raw/videos/",
"unexpected_action": "your_data_dir/funqa-test/test/",
"object_existence": "your_data_dir/clevrer/video_validation/",
"object_interaction": "your_data_dir/star/Charades_v1_480/",
"object_shuffle": "your_data_dir/perception/videos/",
"moving_direction": "your_data_dir/clevrer/video_validation/",
"action_localization": "your_data_dir/sta/sta_video/",
"scene_transition": "your_data_dir/scene-qa/video/",
"action_count": "your_data_dir/perception/videos/",
"moving_count": "your_data_dir/clevrer/video_validation/",
"moving_attribute": "your_data_dir/clevrer/video_validation/",
"state_change": "your_data_dir/perception/videos/",
"fine_grained_pose": "your_data_dir/nturgbd/",
"character_order": "your_data_dir/perception/videos/",
"egocentric_navigation": "your_data_dir/vlnqa/",
"episodic_reasoning": "your_data_dir/tvqa/frames_fps3_hq/",
"counterfactual_inference": "your_data_dir/clevrer/video_validation/",
}
# hf_home = os.getenv("HF_HOME", "./~/.cache/huggingface")
# base_cache_dir = os.path.expanduser(hf_home)
with open(Path(__file__).parent / "_default_template.yaml", "r") as f:
raw_data = f.readlines()
safe_data = []
for i, line in enumerate(raw_data):
# remove function definition since yaml load cannot handle it
if "!function" not in line:
safe_data.append(line)
cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]
def mvbench_doc_to_visual(doc, lmms_eval_specific_kwargs=None):
# cache_dir = os.path.join(base_cache_dir, cache_name)
cache_dir = ""
dataset_folder = DATA_LIST[lmms_eval_specific_kwargs["sub_task"]]
video_path = os.path.join(cache_dir, dataset_folder, doc["video"])
if os.path.exists(video_path):
video_path = video_path
elif os.path.basename(dataset_folder) in ["clevrer", "star"]:
alternative_video_path = os.path.join(cache_dir, "data0613", dataset_folder, doc["video"])
if os.path.exists(alternative_video_path):
video_path = alternative_video_path
else:
eval_logger.error(f"Video path: {video_path} does not exist, please check.")
elif "s3://" not in video_path:
eval_logger.error(f"Video path: {video_path} does not exist, please check.")
if "start" in doc:
start, end = doc['start'], doc['end']
media_dict = {'start':start, 'end':end, 'video_read_type': 'decord'}
else:
media_dict = {'video_read_type': 'decord'}
return [video_path, media_dict]
def mvbench_frames_doc_to_visual(doc, lmms_eval_specific_kwargs=None):
# cache_dir = os.path.join(base_cache_dir, cache_name)
cache_dir = ""
dataset_folder = DATA_LIST[lmms_eval_specific_kwargs["sub_task"]]
video_path = os.path.join(cache_dir, dataset_folder, doc["video"])
if os.path.exists(video_path):
video_path = video_path
elif os.path.basename(dataset_folder) in ["clevrer", "star"]:
alternative_video_path = os.path.join(cache_dir, "data0613", dataset_folder, doc["video"])
if os.path.exists(alternative_video_path):
video_path = alternative_video_path
else:
eval_logger.error(f"Video path: {video_path} does not exist, please check.")
elif "s3://" not in video_path:
eval_logger.error(f"Video path: {video_path} does not exist, please check.")
# frame_image_list = read_frame(video_path)
if "start" in doc:
start, end = doc['start'], doc['end']
media_dict = {'start':start, 'end':end, 'video_read_type': 'img'}
else:
media_dict = {'video_read_type': 'img'}
return [video_path, media_dict]
def mvbench_doc_to_text(doc, lmms_eval_specific_kwargs=None):
option_prompt = ""
option_list = doc["candidates"]
option_letters = string.ascii_uppercase
for char_index, option in enumerate(option_list):
option_letter = option_letters[char_index]
option_prompt += f"({option_letter}) {option}\n"
full_text = "Question: " + doc["question"] + "\nOption:\n" + option_prompt + lmms_eval_specific_kwargs["post_prompt"]
return full_text
def mcq_acc(answer, pred):
periodStrip = re.compile("(?!<=\d)(\.)(?!\d)")
commaStrip = re.compile("(\d)(\,)(\d)")
punct = [";", r"/", "[", "]", '"', "{", "}", "(", ")", "=", "+", "\\", "_", "-", ">", "<", "@", "`", ",", "?", "!"]
def processPunctuation(inText):
outText = inText
for p in punct:
if (p + " " in inText or " " + p in inText) or (re.search(commaStrip, inText) != None):
outText = outText.replace(p, "")
else:
outText = outText.replace(p, " ")
outText = periodStrip.sub("", outText, re.UNICODE)
return outText
def process(answer):
option_regex = re.compile(r"^([A-E])\.\s*(.+)$", re.IGNORECASE)
match = option_regex.match(answer.strip())
if match:
# If matched, return the option letter in uppercase
return match.group(1).upper()
else:
# If no match, process the answer as before
answer = answer.replace("\n", " ")
answer = answer.replace("\t", " ")
answer = answer.strip()
answer = processPunctuation(answer)
answer = answer.strip("'")
answer = answer.strip('"')
answer = answer.strip(")")
answer = answer.strip("(")
answer = answer.strip().lower()
# Try to find any single letter (A-E) in the processed answer
letter_match = re.search(r"\b([A-E])\b", answer, re.IGNORECASE)
if letter_match:
return letter_match.group(1).upper()
return answer
pred = process(pred)
answer = process(answer)
if pred == answer:
score = 1
else:
score = 0
return score
def mvbench_process_results(doc, results):
"""
Args:
doc: a instance of the eval dataset
results: [pred]
Returns:
a dictionary with key: metric name (in this case mvbench_perception_score), value: metric value
"""
pred = results[0]
# Calculate the ground truth option letter
option_letters = string.ascii_uppercase
gt_option_letter = None
for i, candidate in enumerate(doc["candidates"]):
if candidate == doc["answer"]:
gt_option_letter = option_letters[i]
break
if gt_option_letter is not None:
# Calculate the score using mcq_acc function
score = mcq_acc(gt_option_letter, pred)
else:
score = 0
data_dict = {"pred_answer": pred, "gt_answer": gt_option_letter, "score": score}
return {"mvbench_accuracy": data_dict}
def mvbench_aggregate_results(results):
"""
Args:
results: a list of values returned by process_results
Returns:
A score
"""
total_answered = 0
total_correct = 0
for result in results:
if result["pred_answer"] != "":
total_answered += 1
total_correct += result["score"]
return 100 * total_correct / total_answered if total_answered > 0 else 0
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/perceptiontest/val/_default_template_yaml
================================================
dataset_path: eval_data_jsons/PerceptionTest_Val
dataset_kwargs:
token: True
video: True
cache_dir: pssd:s3://perception/
lmms_eval_specific_kwargs:
default:
pre_prompt: ""
post_prompt: ""
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/perceptiontest/val/perceptiontest_mc.yaml
================================================
dataset_name: "mc_question_val"
task: "perceptiontest_val_mc"
test_split: validation
output_type: generate_until
doc_to_visual: !function utils.perceptiontest_val_doc_to_visual
doc_to_text: !function utils.perceptiontest_val_doc_to_text
doc_to_target: !function utils.perceptiontest_val_doc_to_answer
process_results: !function utils.perceptiontest_val_process_results_mc
metric_list:
- metric: accuracy
aggregation: !function utils.perceptiontest_val_aggregate_accuracy
higher_is_better: true
include: _default_template_yaml
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/perceptiontest/val/utils.py
================================================
from decord import VideoReader, cpu
import numpy as np
import os
import sys
import datetime
import lmms_eval.tasks._task_utils.file_utils as file_utils
import json
import yaml
from pathlib import Path
with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
raw_data = f.readlines()
safe_data = []
for i, line in enumerate(raw_data):
# remove function definition since yaml load cannot handle it
if "!function" not in line:
safe_data.append(line)
config = yaml.safe_load("".join(safe_data))
# We will unzip all the zip files
# To HF HOME cache dir
# And load it here
# HF_HOME = os.environ["HF_HOME"]
cache_dir = config["dataset_kwargs"]["cache_dir"]
# cache_dir = os.path.join(HF_HOME, cache_dir)
cache_dir = os.path.join(cache_dir, "videos")
from loguru import logger as eval_logger
# Pass in video path here
# Can only work correctly with video llm
def perceptiontest_val_doc_to_visual(doc):
video_path = doc["video_name"] + ".mp4"
video_path = os.path.join(cache_dir, video_path)
# if os.path.exists(video_path):
# video_path = video_path
# elif os.path.exists(video_path.replace("mp4", "MP4")):
# video_path = video_path.replace("mp4", "MP4")
# else:
# sys.exit(f"video path:{video_path} does not exist, please check")
return [video_path]
# This is the place where you format your question
def perceptiontest_val_doc_to_text(doc, lmms_eval_specific_kwargs=None):
if lmms_eval_specific_kwargs is None:
lmms_eval_specific_kwargs = {}
pre_prompt = ""
post_prompt = ""
if "pre_prompt" in lmms_eval_specific_kwargs:
pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
if "post_prompt" in lmms_eval_specific_kwargs:
post_prompt = lmms_eval_specific_kwargs["post_prompt"]
question = doc["question"]
if "options" in doc:
index = 0
for op in doc["options"]:
if index == 0:
question += "\n" + "(A). " + op
elif index == 1:
question += "\n" + "(B). " + op
else:
question += "\n" + "(C). " + op
index += 1
post_prompt = "\nOnly give the best option."
return f"{pre_prompt}\nQuestion: {question}{post_prompt}"
def perceptiontest_val_doc_to_answer(doc):
return doc["answer_id"]
# Process result for mc_ppl
def perceptiontest_val_process_results_mc_ppl(doc, result):
# Initialize minimum value and index
min_value = float("inf")
min_index = -1
# Iterate through the results to find the index of the lowest value
for i, (value, _) in enumerate(result):
if value < min_value:
min_value = value
min_index = i
# Return the result with the index of the lowest value
return {
"accuracy": {
"video_name": doc["video_name"],
"question": doc["question"],
"question_id": doc["question_id"],
"pred_id": min_index,
"answer_id": doc["answer_id"],
"area": doc["area"],
"reasoning": doc["reasoning"],
"tag": doc["tag"],
}
}
# Process result for generation
import re
def perceptiontest_val_process_results_mc(doc, result):
pred = result[0].strip() # raw text prediction
# Use regex to match A, B, C, or D
match = re.search(r"\b([A-D])\b", pred)
if match:
pred = match.group(1) # Extract the matched letter
pred = pred.upper()
else:
pred = "" # Set to empty string if no match found
# Map the prediction to an index
pred_to_index = {"A": 0, "B": 1, "C": 2, "D": 3}
index = pred_to_index.get(pred, -1) # Default to -1 if the prediction is not found
correct = 1 if index == int(doc["answer_id"]) else 0
return {
"accuracy": {
"video_name": doc["video_name"],
"question": doc["question"],
"pred_id": index,
"answer_id": doc["answer_id"],
"correct": correct,
}
}
def perceptiontest_val_aggregate_accuracy(results, args):
yes_count = 0
# results is a list of dict
for answer_dict in results:
if answer_dict["correct"] == 1:
yes_count = yes_count + 1
accuracy = yes_count / len(results)
return accuracy
def perceptiontest_val_doc_to_choice(doc):
return [op for op in doc["options"]]
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/temporal_grounding/_default_template.yaml
================================================
dataset_path: eval_data_jsons/Temporal_Grounding
dataset_kwargs:
token: True
cache_dir: your_eval_data_dir/Temporal_Grounding
video: True
generation_kwargs:
max_new_tokens: 50
temperature: 0
top_p: 1.0
num_beams: 1
do_sample: false
output_type: generate_until
doc_to_visual: !function utils.temporal_grounding_doc_to_visual
doc_to_text: !function utils.temporal_grounding_doc_to_text
doc_to_target: !function utils.temporal_grounding_doc_to_answer
process_results: !function utils.temporal_grounding_process_results_generation
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/temporal_grounding/charades.yaml
================================================
include: _default_template.yaml
task: temporal_grounding_charades
dataset_name: charades
test_split: train
metric_list:
- metric: submission
aggregation: !function utils.temporal_grounding_aggregate_charades
higher_is_better: true
lmms_eval_specific_kwargs:
default:
sub_task: charades
pre_prompt: "Please find the visual event described by a sentence in the video, determining its starting and ending times. The format should be: 'The event happens in the start time - end time'. For example, The event 'person turn a light on' happens in the 24.3 - 30.4 seonds. Now I will give you the textual sentence: "
post_prompt: "Please return its start time and end time."
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/temporal_grounding/eval_tvg.py
================================================
import json
import argparse
import os
import re
from copy import deepcopy
import pdb
import numpy as np
from pathlib import Path
# read json files
def read_json(path):
with open(path, "r") as fin:
datas = json.load(fin)
return datas
def write_json(path, data):
with open(path, "w") as fout:
json.dump(data, fout)
print("The format file has been saved at:{}".format(path))
return
def extract_time(paragraph):
prompt = 'A specific example is : 20.8 - 30.0 seconds'.lower()
paragraph = paragraph.lower().replace(prompt, '').replace("to", '-')
# Split text into sentences based on common delimiters
sentences = re.split(r'[!?\n]', paragraph)
# Keywords that might indicate the presence of time information
keywords = ["starts", "ends", "happens in", "start time", "end time", "start", "end", "happen"]
# filter sentences by keywords
candidates = []
for sentence in sentences:
# If sentence contains one of the keywords
if any(keyword in sentence for keyword in keywords):
candidates.append(sentence)
timestamps = []
# Check for The given query happens in m - n (seconds)
patterns = [
r"(\d+\.*\d*)\s*-\s*(\d+\.*\d*)"
]
for time_pattern in patterns:
time_matches = re.findall(time_pattern, paragraph)
if time_matches:
timestamps = [[float(start), float(end)] for start, end in time_matches]
if len(sentences) == 0:
return []
# check for other formats e.g.:
# 1 .Starting time: 0.8 seconds
# Ending time: 1.1 seconds
# 2. The start time for this event is 0 seconds, and the end time is 12 seconds.
if len(timestamps) == 0:
times = []
time_regex = re.compile(r'\b(\d+\.\d+\b|\b\d+)\b') # time formats (e.g., 18, 18.5)
for sentence in candidates:
time = re.findall(time_regex, sentence)
if time:
time_in_sec = float(time[0])
times.append(time_in_sec)
times = times[:len(times)//2*2]
timestamps = [(times[i], times[i+1]) for i in range(0, len(times), 2)]
# Check for examples like:
# 3. The event 'person flipped the light switch near the door' starts at 00:00:18 and ends at 00:00:23.
if len(timestamps) == 0:
times = []
time_regex = re.compile(r'\b((\d{1,2}:\d{2}:\d{2}))\b') # time formats (e.g., 18:00, 00:18:05)
for sentence in candidates:
time = re.findall(time_regex, sentence)
if time:
t = time[0]
else:
continue
# If time is in HH:MM:SS format, convert to seconds
if t.count(':') == 2:
h, m, s = map(int, t.split(':'))
time_in_sec = h * 3600 + m * 60 + s
elif t.count(':') == 1:
m, s = map(int, t.split(':'))
time_in_sec = m * 60 + s
times.append(time_in_sec)
times = times[:len(times)//2*2]
timestamps = [(times[i], times[i+1]) for i in range(0, len(times), 2)]
results = []
for (start, end) in timestamps:
if end > start:
results.append([start, end])
else:
results.append([end, start])
if len(results) > 1:
results = results[:1]
return results
def iou(A, B):
max0 = max((A[0]), (B[0]))
min0 = min((A[0]), (B[0]))
max1 = max((A[1]), (B[1]))
min1 = min((A[1]), (B[1]))
return max(min1 - max0, 0) / (max1 - min0)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument('-f', default='your_result.json')
args = parser.parse_args()
datas = read_json(args.f)
num = len(datas)
Result = {0.3:0, 0.5:0, 0.7:0}
for c_iou in [0.3, 0.5, 0.7]:
for k in datas.keys():
vid, caption, gt = k.split(">>>")
pred = datas[k]
gt = eval(gt)
timestamps = extract_time(pred)
if len(timestamps) != 1:
print(f"pred={pred},timestamps={timestamps}")
timestamps = [[gt[1]+10, gt[1]+20]]
print(f"GT: {gt}, Pred: {timestamps[0]}")
if(iou(gt, timestamps[0]) >= c_iou):
Result[c_iou] = Result[c_iou] + 1
print("IOU 0.3: {0}\nIOU 0.5: {1}\nIOU 0.7: {2}".format(Result[0.3]*100/num, Result[0.5]*100/num, Result[0.7]*100/num))
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/temporal_grounding/utils.py
================================================
from decord import VideoReader, cpu
import numpy as np
import os
import sys
import datetime
import lmms_eval.tasks._task_utils.file_utils as file_utils
import json
import yaml
import random
from pathlib import Path
with open(Path(__file__).parent / "_default_template.yaml", "r") as f:
raw_data = f.readlines()
safe_data = []
for i, line in enumerate(raw_data):
# remove function definition since yaml load cannot handle it
if "!function" not in line:
safe_data.append(line)
config = yaml.safe_load("".join(safe_data))
from loguru import logger as eval_logger
DATA_LIST = {
"charades": 'your_data_dir/Charades/',
}
# Pass in video path here
# Can only work correctly with video llm
def temporal_grounding_doc_to_visual(doc, lmms_eval_specific_kwargs=None):
video_path = doc["video"]
data_root = DATA_LIST[lmms_eval_specific_kwargs["sub_task"]]
video_path = os.path.join(data_root, video_path)
if os.path.exists(video_path):
video_path = video_path
elif "s3://" not in video_path:
sys.exit(f"video path:{video_path} does not exist, please check")
return [video_path]
# This is the place where you format your question
def temporal_grounding_doc_to_text(doc, lmms_eval_specific_kwargs=None):
if lmms_eval_specific_kwargs is None:
lmms_eval_specific_kwargs = {}
if "pre_prompt" in lmms_eval_specific_kwargs:
pre_prompt = lmms_eval_specific_kwargs["pre_prompt"]
if "post_prompt" in lmms_eval_specific_kwargs:
post_prompt = lmms_eval_specific_kwargs["post_prompt"]
question = doc["caption"]
return f"{pre_prompt}{question}. {post_prompt}"
def temporal_grounding_doc_to_answer(doc):
return doc["timestamp"]
# Process result for mcq answer generation
def temporal_grounding_process_results_generation(doc, result):
pred = result[0]
return {"submission": {f'{doc["video"]}>>>{doc["caption"]}>>>{doc["timestamp"]}': pred}}
def temporal_grounding_aggregate_charades(results, args):
temporal_grounding_aggregate_submissions(results, args, "charades")
def temporal_grounding_aggregate_submissions(results, args, task):
now_date_time = datetime.datetime.now().strftime("%Y-%m-%d-%H-%M-%S")
submission_file_name = f"inference_results_temporal_grounding_{task}_{now_date_time}.json"
path = file_utils.generate_submission_file(submission_file_name, args)
# results is a list of 5031 dict,
# need to convert results into a single dict with 5031 key-value pairs
combined_submission = {}
for submission_dict in results:
combined_submission.update(submission_dict)
with open(path, "w") as f:
json.dump(combined_submission, f, indent=4)
eval_logger.info(f"Submission file saved to {path}")
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/videomme/utils.py
================================================
from collections import defaultdict
import os
import datetime
import json
from lmms_eval.tasks._task_utils.file_utils import generate_submission_file
import io
from pathlib import Path
import yaml
import sys
from typing import List, Dict, Optional, Union
import re
import cv2
import numpy as np
from loguru import logger as eval_logger
VIDEO_TYPE = ["short", "medium", "long"]
CATEGORIES = ["Knowledge", "Film & Television", "Sports Competition", "Artistic Performance", "Life Record", "Multilingual"]
SUB_CATEGORIES = [
"Humanity & History",
"Literature & Art",
"Biology & Medicine",
"Finance & Commerce",
"Astronomy",
"Geography",
"Law",
"Life Tip",
"Technology",
"Animation",
"Movie & TV Show",
"Documentary",
"News Report",
"Esports",
"Basketball",
"Football",
"Athletics",
"Other Sports",
"Stage Play",
"Magic Show",
"Variety Show",
"Acrobatics",
"Handicraft",
"Food",
"Fashion",
"Daily Life",
"Travel",
"Pet & Animal",
"Exercise",
"Multilingual",
]
TASK_CATEGORIES = [
"Temporal Perception",
"Spatial Perception",
"Attribute Perception",
"Action Recognition",
"Object Recognition",
"OCR Problems",
"Counting Problem",
"Temporal Reasoning",
"Spatial Reasoning",
"Action Reasoning",
"Object Reasoning",
"Information Synopsis",
]
replace_prompt = " Please answer yes or no."
# with open(Path(__file__).parent / "_default_template_yaml", "r") as f:
# raw_data = f.readlines()
# safe_data = []
# for i, line in enumerate(raw_data):
# # remove function definition since yaml load cannot handle it
# if "!function" not in line:
# safe_data.append(line)
# config = yaml.safe_load("".join(safe_data))
hf_home = os.getenv("HF_HOME", "~/.cache/huggingface/")
# cache_dir = os.path.join(hf_home, cache_dir)
# base_cache_dir = config["dataset_kwargs"]["cache_dir"]
# base_cache_dir = os.path.expanduser(hf_home)
with open(Path(__file__).parent / "videomme.yaml", "r") as f:
raw_data = f.readlines()
safe_data = []
for i, line in enumerate(raw_data):
# remove function definition since yaml load cannot handle it
if "!function" not in line:
safe_data.append(line)
cache_name = yaml.safe_load("".join(safe_data))["dataset_kwargs"]["cache_dir"]
def parse_subtitle_time(time_str):
h, m, s_ms = time_str.split(":")
s, ms = s_ms.split(",")
return int(h) * 3600 + int(m) * 60 + int(s) + int(ms) / 1000
def load_subtitles(subtitle_path):
subtitles = {}
with open(subtitle_path, "r", encoding="utf-8") as file:
content = file.read().split("\n\n")
for section in content:
if section.strip():
lines = section.split("\n")
if len(lines) >= 3:
time_range = lines[1].split(" --> ")
start_time = parse_subtitle_time(time_range[0])
end_time = parse_subtitle_time(time_range[1])
text = " ".join(line for line in lines[2:])
subtitles[(start_time, end_time)] = text
return subtitles
def convert_time_to_frame(time_in_seconds, fps):
return int(time_in_seconds * fps)
def extract_subtitles(video_path, subtitle_path):
video = cv2.VideoCapture(video_path)
fps = video.get(cv2.CAP_PROP_FPS)
total_frame = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
subtitles = load_subtitles(subtitle_path)
subtitle_frames = []
for (start_time, end_time), text in subtitles.items():
start_frame = convert_time_to_frame(start_time, fps)
end_frame = convert_time_to_frame(end_time, fps)
subtitle_frames.append((start_frame, end_frame, text))
return subtitle_frames, total_frame
def parse_subtitle_time(time_str):
h, m, s_ms = time_str.split(":")
s, ms = s_ms.split(",")
return int(h) * 3600 + int(m) * 60 + int(s) + int(ms) / 1000
def load_subtitles(subtitle_path):
subtitles = {}
with open(subtitle_path, "r", encoding="utf-8") as file:
content = file.read().split("\n\n")
for section in content:
if section.strip():
lines = section.split("\n")
if len(lines) >= 3:
time_range = lines[1].split(" --> ")
start_time = parse_subtitle_time(time_range[0])
end_time = parse_subtitle_time(time_range[1])
text = " ".join(line for line in lines[2:])
subtitles[(start_time, end_time)] = text
return subtitles
def convert_time_to_frame(time_in_seconds, fps):
return int(time_in_seconds * fps)
def extract_subtitles(video_path, subtitle_path):
video = cv2.VideoCapture(video_path)
fps = video.get(cv2.CAP_PROP_FPS)
total_frame = int(video.get(cv2.CAP_PROP_FRAME_COUNT))
subtitles = load_subtitles(subtitle_path)
subtitle_frames = []
for (start_time, end_time), text in subtitles.items():
start_frame = convert_time_to_frame(start_time, fps)
end_frame = convert_time_to_frame(end_time, fps)
subtitle_frames.append((start_frame, end_frame, text))
return subtitle_frames, total_frame
def videomme_doc_to_visual(doc):
# cache_dir = os.path.join(base_cache_dir, cache_name)
cache_dir = cache_name
video_path = doc["videoID"] + ".mp4"
video_path = os.path.join(cache_dir, "videos", video_path)
if os.path.exists(video_path):
video_path = video_path
elif os.path.exists(video_path.replace("mp4", "MP4")):
video_path = video_path.replace("mp4", "MP4")
elif os.path.exists(video_path.replace("mp4", "mkv")):
video_path = video_path.replace("mp4", "mkv")
elif 's3://' not in video_path:
sys.exit(f"video path:{video_path} does not exist, please check")
return [video_path]
def videomme_doc_to_text(doc, lmms_eval_specific_kwargs=None):
"""
VLMEvalKit style
"""
option_prompt = "These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option."
question = doc["question"]
# option = str(doc["options"])
option = "\n".join(doc['options'])
question = "Question: " + question + "\nOption:\n" + option
post_prompt = lmms_eval_specific_kwargs["post_prompt"] if "post_prompt" in lmms_eval_specific_kwargs else "The best answer is:"
full_prompt = option_prompt + "\n" + question + post_prompt
return full_prompt
# Frames + Subs
# This video's subtitles are listed below:
# 【subtitles】
# Select the best answer to the following multiple-choice question based on the video and the subtitles. Respond with only the letter (A, B, C, or D) of the correct option.
# 【question】
# The best answer is:
# Frames / Frames + Audio
# Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.
# 【question】
# The best answer is:
def videomme_doc_to_text_subtitle(doc, lmms_eval_specific_kwargs=None):
# cache_dir = os.path.join(base_cache_dir, cache_name)
cache_dir = cache_name
video_path = doc["videoID"] + ".mp4"
subtitle_path = os.path.join(cache_dir, "subtitle", doc["videoID"] + ".srt")
video_path = os.path.join(cache_dir, "videos", video_path)
subtitle = ""
subtitles_prompt = ""
if os.path.exists(subtitle_path): # Denote have subtitle
subtitle = open(subtitle_path).readlines()
subtitles_prompt = "This video's subtitles are listed below: \n"
else:
print(f"Can't find subtitle for: {subtitle_path}")
if subtitle != "":
if "gemini_api_flag" in lmms_eval_specific_kwargs: # specific for gemini_api
if lmms_eval_specific_kwargs["gemini_api_flag"] == "full subtitle":
textlist = []
for ele in subtitle:
pattern = r'(.*?)'
matches = re.findall(pattern, ele)
if matches:
textlist.append(matches[0])
subtitle_text = "\n".join(textlist)
else:
raise NotImplementedError("用全部subtitles即可")
if "frame_num" in lmms_eval_specific_kwargs:
frame_num = lmms_eval_specific_kwargs["frame_num"]
subtitle_by_frame, total_frame = extract_subtitles(video_path, subtitle_path)
uniform_sampled_frames = np.linspace(0, total_frame - 1, frame_num, dtype=int).tolist()
subtitle_by_frame_idx = []
for frame_idx in uniform_sampled_frames:
for idx, title in enumerate(subtitle_by_frame):
if frame_idx < title[1] and frame_idx >= title[0]:
subtitle_by_frame_idx.append(idx)
subtitle_by_frame_idx = list(set(subtitle_by_frame_idx))
textlist = []
for idx in subtitle_by_frame_idx:
pattern = r'(.*?)'
raw_text = re.findall(pattern, subtitle_by_frame[idx][2])
try:
textlist.append(raw_text[0])
except:
continue
subtitle_text = "\n".join(textlist)
subtitle = subtitle_text
if subtitle != "":
option_prompt = "These are the frames and subtitles of a video. Select the best answer to the following multiple-choice question based on the video and the subtitles. Respond with only the letter (A, B, C, or D) of the correct option."
else:
option_prompt = "These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option."
question = doc["question"]
# option = str(doc["options"])
option = "\n".join(doc['options'])
question = "Question: " + question + "\nOption:\n" + option
full_prompt = subtitles_prompt + subtitle + "\n" + option_prompt + "\n" + question + "\n" + "The best answer is:"
return full_prompt
def extract_characters_regex(s):
s = s.strip()
answer_prefixes = [
"The best answer is",
"The correct answer is",
"The answer is",
"The answer",
"The best option is" "The correct option is",
"Best answer:" "Best option:",
]
for answer_prefix in answer_prefixes:
s = s.replace(answer_prefix, "")
if len(s.split()) > 10 and not re.search("[ABCD]", s):
return ""
matches = re.search(r"[ABCD]", s)
if matches is None:
return ""
return matches[0]
matrices = []
for i in VIDEO_TYPE:
for j in CATEGORIES:
for k in SUB_CATEGORIES:
for l in TASK_CATEGORIES:
matrices.append(f"{i}_{j}_{k}_{l}")
def videomme_process_results(doc, results):
"""
Args:
doc: a instance of the eval dataset
results: [pred]
Returns:
a dictionary with key: metric name (in this case videomme score), value: metric value
"""
pred = results[0]
pred_ans = extract_characters_regex(pred)
# gt_ans = doc["answer"].lower().strip().replace(".", "")
category = doc["domain"]
sub_category = doc["sub_category"]
task_category = doc["task_type"]
data_dict = {"question_id": doc["question_id"], "duration": doc["duration"], "category": category, "sub_category": sub_category, "task_category": task_category, "pred_answer": pred_ans, "answer": doc["answer"]}
# return {f"videomme_percetion_score": data_dict for metric in matrices}
return {f"videomme_percetion_score": data_dict}
def videomme_aggregate_results(results):
"""
Args:
results: a list of values returned by process_results
Returns:
A score
"""
category2score = {}
for video_type in VIDEO_TYPE:
for category in CATEGORIES:
for sub_category in SUB_CATEGORIES:
for task_category in TASK_CATEGORIES:
key = f"{video_type}_{category}_{sub_category}_{task_category}"
category2score[key] = {"correct": 0, "answered": 0}
for result in results:
video_type = result["duration"]
category = result["category"]
sub_category = result["sub_category"]
task_category = result["task_category"]
key = f"{video_type}_{category}_{sub_category}_{task_category}"
category2score[key]["answered"] += 1
category2score[key]["correct"] += result["pred_answer"] == result["answer"]
for video_type in VIDEO_TYPE:
total_correct = 0
total_answered = 0
for k, v in category2score.items():
if video_type in k:
total_correct += v["correct"]
total_answered += v["answered"]
eval_logger.info(f"Evaluation on video Type: {video_type}: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%")
for category in CATEGORIES:
total_correct = 0
total_answered = 0
for k, v in category2score.items():
if category in k:
total_correct += v["correct"]
total_answered += v["answered"]
eval_logger.info(f"Evaluation on Categories: {category}: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%")
for sub_cate in SUB_CATEGORIES:
total_correct = 0
total_answered = 0
for k, v in category2score.items():
if sub_cate in k:
total_correct += v["correct"]
total_answered += v["answered"]
eval_logger.info(f"Evaluation on Video Sub Categories: {sub_cate}: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%")
for task_cate in TASK_CATEGORIES:
total_correct = 0
total_answered = 0
for k, v in category2score.items():
if task_cate in k:
total_correct += v["correct"]
total_answered += v["answered"]
eval_logger.info(f"Evaluation on Task Categories: {task_cate}: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%")
print((f"Evaluation on Task Categories: {task_cate}: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%"))
total_correct = 0
total_answered = 0
for k, v in category2score.items():
total_correct += v["correct"]
total_answered += v["answered"]
eval_logger.info(f"Overall Performance: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%")
print(f"Overall Performance: {100 * total_correct / total_answered if total_answered > 0 else 0 : .1f}%")
return 100 * total_correct / total_answered if total_answered > 0 else 0
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/videomme/videomme.yaml
================================================
dataset_path: eval_data_jsons/Video-MME
dataset_kwargs:
token: True
cache_dir: your_eval_data_dir/VideoMME_0629
video: True
# From_YouTube: True
task: videomme
test_split: test
output_type: generate_until
doc_to_visual: !function utils.videomme_doc_to_visual
doc_to_text: !function utils.videomme_doc_to_text
doc_to_target: "answer"
generation_kwargs:
max_new_tokens: 16
temperature: 0
top_p: 1.0
num_beams: 1
do_sample: false
# The return value of process_results will be used by metrics
process_results: !function utils.videomme_process_results
# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
metric_list:
- metric: videomme_percetion_score
aggregation: !function utils.videomme_aggregate_results
higher_is_better: true
lmms_eval_specific_kwargs:
default:
pre_prompt: ""
post_prompt: "\nAnswer with the option's letter from the given choices directly."
metadata:
- version: 0.0
================================================
FILE: lmms-eval_videochat/lmms_eval/tasks/videomme/videomme_w_subtitle.yaml
================================================
dataset_path: eval_data_jsons/Video-MME
dataset_kwargs:
token: True
cache_dir: your_eval_data_dir/VideoMME_0629
video: True
# From_YouTube: True
task: videomme_w_subtitle
test_split: test
output_type: generate_until
doc_to_visual: !function utils.videomme_doc_to_visual
doc_to_text: !function utils.videomme_doc_to_text_subtitle
doc_to_target: "answer"
generation_kwargs:
max_new_tokens: 16
temperature: 0
top_p: 1.0
num_beams: 1
do_sample: false
# The return value of process_results will be used by metrics
process_results: !function utils.videomme_process_results
# Note that the metric name can be either a registed metric function (such as the case for GQA) or a key name returned by process_results
metric_list:
- metric: videomme_percetion_score
aggregation: !function utils.videomme_aggregate_results
higher_is_better: true
lmms_eval_specific_kwargs:
default:
gemini_api_flag: "full subtitle"
gemini_api:
gemini_api_flag: "full subtitle"
metadata:
- version: 0.0
================================================
FILE: lmms-eval_videochat/lmms_eval/utils.py
================================================
import os
import re
import sys
import yaml
import json
import inspect
import pathlib
import functools
import subprocess
import collections
import importlib.util
import fnmatch
import datetime
from typing import (
Any,
Callable,
Iterable,
Iterator,
List,
Literal,
Optional,
Tuple,
Type,
Union,
)
import warnings
warnings.simplefilter("ignore", category=DeprecationWarning)
warnings.filterwarnings("ignore")
import gc
import torch
import transformers
from jinja2 import BaseLoader, Environment, StrictUndefined
from itertools import islice
import pytz
from loguru import logger as eval_logger
SPACING = " " * 47
def is_json(string):
try:
json.loads(string)
return True
except json.JSONDecodeError:
return False
def escaped_split(text, sep_char, maxsplit=-1):
"""Split text into a list on occurrences of the given separation
character `sep_char`. The separation character may be escaped by a
backslash to avoid splitting at that location.
The separation character must be a string of size 1.
If `maxsplit` is given, at most `maxsplit` splits are done (thus,
the list will have at most `maxsplit + 1` elements). If `maxsplit`
is not specified or less than 0, then there is no limit on the
number of splits (all possible splits are made).
"""
assert len(sep_char) == 1, "separation string must be a single character for escaped splitting"
if maxsplit == 0:
return text
maxsplit = max(0, maxsplit)
return re.split(r"(? None:
self.choices = choices
# Simple wildcard support (linux filename patterns)
def __contains__(self, values) -> bool:
for value in values.split(","):
if len(fnmatch.filter(self.choices, value)) == 0:
eval_logger.info(f"Available tasks to choose:")
for choice in self.choices:
eval_logger.info(f" - {choice}")
raise ValueError("'{}' is not in task list".format(value))
return True
def __iter__(self) -> Iterator:
for choice in self.choices:
yield choice
# Returns a list containing all values of the source_list that
# match at least one of the patterns
def pattern_match(patterns, source_list):
if type(patterns) == str:
patterns = [patterns]
task_names = set()
for pattern in patterns:
for matching in fnmatch.filter(source_list, pattern):
task_names.add(matching)
return sorted(list(task_names))
def general_detokenize(string):
string = string.replace(" n't", "n't")
string = string.replace(" )", ")")
string = string.replace("( ", "(")
string = string.replace('" ', '"')
string = string.replace(' "', '"')
string = re.sub(r" (['.,])", r"\1", string)
return string
def get_rolling_token_windows(token_list, prefix_token, max_seq_len, context_len):
"""
- context_len allows for a rolling window context, allowing each prediction window to potentially
condition on some context
:param token_list: list
List of tokens to be PREDICTED
:param max_seq_len: int
max_seq_len of model (or max_seq_len we want to use)
:param context_len: int
Amount of desired token context for prediction. Needs to be at least 1.
:param prefix_token: token
Dummy token like so the first token has something to condition on
:return: generator
Generator of tuples
(input_tokens, pred_tokens)
Note: Score only the last len(pred_tokens) logits of the LMM
"""
assert 1 <= context_len <= max_seq_len
if not token_list:
return
# +1 offset, going from input->preds
pred_len = max_seq_len - context_len + 1
predicted = 0
# Special handling for first window: predict all tokens
first_seq_len = min(max_seq_len, len(token_list))
yield ([prefix_token] + token_list[: first_seq_len - 1], token_list[:first_seq_len])
predicted += first_seq_len
while predicted < len(token_list):
window_pred_len = min(len(token_list) - predicted, pred_len)
window_end = predicted + window_pred_len
yield (
token_list[window_end - max_seq_len - 1 : window_end - 1],
token_list[window_end - window_pred_len : window_end],
)
predicted += window_pred_len
def make_disjoint_window(pair):
"""Takes output from get_rolling_token_windows and makes the context not overlap with the continuation"""
a, b = pair
return a[: len(a) - (len(b) - 1)], b
class Reorderer:
def __init__(self, arr: List[Any], fn: Callable) -> None:
"""Reorder an array according to some function
Args:
arr (List[Any]): The initial array
fn (Callable[[Any], Any]): A function to determine the priority of elements
"""
self.size = len(arr)
arr = list(enumerate(arr))
arr = group(arr, lambda x: fn(x[1]))
# arr = [([y[0] for y in x], x[0][1]) for x in arr]
# TODO: overhaul reorderer. It currently grouped requests by content but we don't want this
arr = [([y[0]], x[0][1]) for x in arr for y in x]
arr.sort(key=lambda x: fn(x[1]))
self.arr = arr
def get_reordered(self):
"""Gets the reordered array
Returns:
List[Any]: The reordered array
"""
return [x[1] for x in self.arr]
def get_original(self, newarr):
"""Restores the original order of a new array based on the old array's order
Args:
newarr (List[Any]): The array to be restored
Returns:
List[Any]: The array restored to the original order
"""
res = [None] * self.size
cov = [False] * self.size
for (inds, _), v in zip(self.arr, newarr):
for ind in inds:
res[ind] = v
cov[ind] = True
assert all(cov)
return res
class Grouper:
"""
takes an array `arr` and function `fn` and returns a dictionary
with keys fn(ob) for each ob in `arr` and with values `self.arr[key]` a list of all
objects in `arr` satisfying `key == fn(ob)`.
"""
def __init__(self, arr, fn) -> None:
# self.orig_arr = arr
self.size = len(arr)
arr = list(enumerate(arr))
def group_return_dict(arr, fn):
res = collections.defaultdict(list)
for ob in arr:
res[fn(ob)].append(ob)
return res
arr = group_return_dict(arr, lambda x: fn(x[1]))
# self.arr has format Dict[Tuple[int, ]]
self.arr = arr
self._grouped = None
def get_grouped(self):
# return the contents but not indices for our grouped dict.
if self._grouped:
return self._grouped
grouped = {}
for key in self.arr.keys():
# drop the index from each element of self.arr
grouped[key] = [y[1] for y in self.arr[key]]
self._grouped = grouped
return grouped
def get_original(self, grouped_dict):
# take in a grouped dictionary with e.g. results for each key listed
# in the same order as the instances in `self.arr`, and
# return the results in the same (single list) order as `self.orig_arr`.
res = [None] * self.size
cov = [False] * self.size
# orig = [None] * self.size
assert grouped_dict.keys() == self.arr.keys()
for key in grouped_dict.keys():
for (ind, _), v in zip(self.arr[key], grouped_dict[key]):
res[ind] = v
cov[ind] = True
# orig[ind] = _
assert all(cov)
# assert orig == self.orig_arr
return res
def make_table(result_dict, column: str = "results"):
"""Generate table of results."""
from pytablewriter import MarkdownTableWriter, LatexTableWriter
if column == "results":
column_name = "Tasks"
elif column == "groups":
column_name = "Groups"
all_headers = [
column_name,
"Version",
"Filter",
"n-shot",
"Metric",
"Value",
"",
"Stderr",
]
md_writer = MarkdownTableWriter()
latex_writer = LatexTableWriter()
md_writer.headers = all_headers
latex_writer.headers = all_headers
# Set column alignments for LaTeX
latex_writer.column_alignments = ["center"] * len(all_headers)
# Set padding for LaTeX columns (this will add space between columns)
latex_writer.column_format = " ".join(["|c"] * len(all_headers)) + "|"
values = []
for k, dic in result_dict[column].items():
version = result_dict["versions"][k]
n = str(result_dict["n-shot"][k])
if "alias" in dic:
k = dic.pop("alias")
for (mf), v in dic.items():
m, _, f = mf.partition(",")
if m.endswith("_stderr"):
continue
points = "N/A"
if v is not None:
if isinstance(v, str):
points = v
else:
# if 0 <= v <= 1:
# # v *= 100
points = "%.4f" % v
if m + "_stderr" + "," + f in dic:
if v is None:
se = "N/A"
else:
se = dic[m + "_stderr" + "," + f]
if se != "N/A":
se = "%.4f" % se
values.append([k, version, f, n, m, points, "±", se])
else:
values.append([k, version, f, n, m, points, "", ""])
k = ""
version = ""
md_writer.value_matrix = values
latex_writer.value_matrix = values
# Print LaTeX table to see how it looks
# print(latex_writer.dumps())
# Return Markdown table (note: column width and text alignment may not be supported)
return md_writer.dumps()
def positional_deprecated(fn):
"""
A decorator to nudge users into passing only keyword args (`kwargs`) to the
wrapped function, `fn`.
"""
@functools.wraps(fn)
def _wrapper(*args, **kwargs):
if len(args) != 1 if inspect.ismethod(fn) else 0:
print(f"WARNING: using {fn.__name__} with positional arguments is " "deprecated and will be disallowed in a future version of " "lmms-evaluation-harness!")
return fn(*args, **kwargs)
return _wrapper
@positional_deprecated
def find_test_root(start_path: pathlib.Path) -> pathlib.Path:
"""
Search upward in the directory tree to a maximum of three layers
to find and return the package root (containing the 'tests' folder)
"""
cur_path = start_path.resolve()
max_layers = 3
for _ in range(max_layers):
if (cur_path / "tests" / "test_version_stable.py").exists():
return cur_path
else:
cur_path = cur_path.parent.resolve()
raise FileNotFoundError(f"Unable to find package root within {max_layers} upwards" + f"of {start_path}")
@positional_deprecated
def run_task_tests(task_list: List[str]):
"""
Find the package root and run the tests for the given tasks
"""
import pytest
package_root = find_test_root(start_path=pathlib.Path(__file__))
task_string = " or ".join(task_list)
args = [
f"{package_root}/tests/test_version_stable.py",
f"--rootdir={package_root}",
"-k",
f"{task_string}",
]
sys.path.append(str(package_root))
pytest_return_val = pytest.main(args)
if pytest_return_val:
raise ValueError(f"Not all tests for the specified tasks ({task_list}) ran successfully! Error code: {pytest_return_val}")
def get_git_commit_hash():
"""
Gets the git commit hash of your current repo (if it exists).
Source: https://github.com/EleutherAI/gpt-neox/blob/b608043be541602170bfcfb8ec9bf85e8a0799e0/megatron/neox_arguments/neox_args.py#L42
"""
try:
git_hash = subprocess.check_output(["git", "describe", "--always"]).strip()
git_hash = git_hash.decode()
except subprocess.CalledProcessError or FileNotFoundError:
# FileNotFoundError occurs when git not installed on system
git_hash = None
return git_hash
def get_datetime_str(timezone="Asia/Singapore"):
"""
Gets the current datetime in UTC+8 timezone as a string.
"""
# Default: UTC+8 timezone
tz = pytz.timezone(timezone)
utc_now = datetime.datetime.now(datetime.timezone.utc)
local_time = utc_now.astimezone(tz)
return local_time.strftime("%m%d_%H%M")
def import_function(loader, node):
function_name = loader.construct_scalar(node)
yaml_path = os.path.dirname(loader.name)
*module_name, function_name = function_name.split(".")
if type(module_name) == list:
module_name = ".".join(module_name)
module_path = os.path.normpath(os.path.join(yaml_path, "{}.py".format(module_name)))
spec = importlib.util.spec_from_file_location(module_name, module_path)
module = importlib.util.module_from_spec(spec)
spec.loader.exec_module(module)
function = getattr(module, function_name)
return function
# Add the import_function constructor to the YAML loader
yaml.add_constructor("!function", import_function)
def load_yaml_config(yaml_path=None, yaml_config=None, yaml_dir=None):
if yaml_config is None:
with open(yaml_path, "rb") as file:
yaml_config = yaml.full_load(file)
if yaml_dir is None:
yaml_dir = os.path.dirname(yaml_path)
assert yaml_dir is not None
assert yaml_config is not None, f"Failed to load yaml config from {yaml_path}"
if "include" in yaml_config:
include_path = yaml_config["include"]
del yaml_config["include"]
if type(include_path) == str:
include_path = [include_path]
# Load from the last one first
include_path.reverse()
final_yaml_config = {}
for path in include_path:
# Assumes that path is a full path.
# If not found, assume the included yaml
# is in the same dir as the original yaml
if not os.path.isfile(path):
path = os.path.join(yaml_dir, path)
try:
included_yaml_config = load_yaml_config(path)
final_yaml_config.update(included_yaml_config)
except Exception as ex:
# If failed to load, ignore
raise ex
final_yaml_config.update(yaml_config)
return final_yaml_config
return yaml_config
def regex_replace(string, pattern, repl, count: int = 0):
"""Implements the `re.sub` function as a custom Jinja filter."""
return re.sub(pattern, repl, string, count=count)
env = Environment(loader=BaseLoader, undefined=StrictUndefined)
env.filters["regex_replace"] = regex_replace
def apply_template(template: str, doc: dict) -> str:
rtemplate = env.from_string(template)
return rtemplate.render(**doc)
def create_iterator(raw_iterator, rank, world_size, limit=None):
"""
Method for creating a (potentially) sliced and limited
iterator from a raw document iterator. Used for splitting data
among ranks in multigpu setting or only pulling a sample of documents
"""
return islice(raw_iterator, rank, limit, world_size)
def pad_and_concat(
max_length: int,
tensors: List[torch.Tensor],
padding_side: Literal["right", "left"] = "right",
):
"""
Method for padding a list of tensors given the maximum tensor
length in the batch. Used for batching inputs and continuations in
seq2seq models.
"""
assert padding_side == "left" or padding_side == "right", f"Unrecognized padding type: '{padding_side}' not 'left' or 'right'"
for i, tensor in enumerate(tensors):
if len(tensor.shape) == 2:
tensor = tensor.squeeze(0) # squeeze, in case passed [1, seq] size
tensor_len = tensor.shape[0]
if tensor_len < max_length:
if padding_side == "right":
# right-pad
tensors[i] = torch.cat(
[
tensor, # [seq]
torch.zeros(
max_length - tensor_len,
dtype=torch.long,
device=tensor.device,
), # [padding_length - seq]
],
dim=0,
).unsqueeze(0)
else:
# left-pad
tensors[i] = torch.cat(
[
torch.zeros(
max_length - tensor_len,
dtype=torch.long,
device=tensor.device,
), # [padding_length - seq]
tensor, # [seq]
],
dim=0,
).unsqueeze(0)
else:
tensors[i] = tensor.unsqueeze(0)
return torch.cat(tensors, dim=0)
def clear_torch_cache() -> None:
gc.collect()
torch.cuda.empty_cache()
def get_dtype(dtype: Union[str, torch.dtype]) -> torch.dtype:
"""Converts `dtype` from `str` to torch.dtype when possible. Does not use an instantiated HF AutoConfig"""
if isinstance(dtype, str) and dtype != "auto":
# Convert `str` args torch dtype: `float16` -> `torch.float16`
_torch_dtype = getattr(torch, dtype)
else:
_torch_dtype = dtype
return _torch_dtype
# Multi-token stopping criteria
class MultiTokenEOSCriteria(transformers.StoppingCriteria):
"""Criteria to stop on the specified multi-token sequence."""
def __init__(
self,
sequence: str,
tokenizer: transformers.PreTrainedTokenizer,
initial_decoder_input_length: int,
batch_size: int,
) -> None:
self.initial_decoder_input_length = initial_decoder_input_length
self.done_tracker = [False] * batch_size
self.sequence = sequence
self.sequence_ids = tokenizer.encode(sequence, add_special_tokens=False)
# we look back for 2 more tokens than it takes to encode our stop sequence
# because tokenizers suck, and a model might generate `['\n', '\n']` but our `sequence` is `['\n\n']`
# and we don't want to mistakenly not stop a generation because our
# (string) stop sequence was output in a different tokenization
# NOTE: there is a minor danger that this will end up looking back 2 tokens into the past, into the inputs to the model,
# and stopping generation immediately as a result. With only 2 extra tokens of lookback, this risk is minimized
self.sequence_id_len = len(self.sequence_ids) + 2
self.tokenizer = tokenizer
def __call__(self, input_ids, scores, **kwargs) -> bool:
# For efficiency, we compare the last n tokens where n is the number of tokens in the stop_sequence
lookback_ids_batch = input_ids[:, self.initial_decoder_input_length :][:, -self.sequence_id_len :]
lookback_tokens_batch = self.tokenizer.batch_decode(lookback_ids_batch)
for i, done in enumerate(self.done_tracker):
if not done:
self.done_tracker[i] = self.sequence in lookback_tokens_batch[i]
return False not in self.done_tracker
def stop_sequences_criteria(
tokenizer: transformers.PreTrainedTokenizer,
stop_sequences: List[str],
initial_decoder_input_length: int,
batch_size: int,
) -> transformers.StoppingCriteriaList:
return transformers.StoppingCriteriaList(
[
*[MultiTokenEOSCriteria(sequence, tokenizer, initial_decoder_input_length, batch_size) for sequence in stop_sequences],
]
)
# from more_itertools
def divide(iterable, n) -> List[Iterator]:
"""Divide the elements from *iterable* into *n* parts, maintaining
order.
>>> group_1, group_2 = divide(2, [1, 2, 3, 4, 5, 6])
>>> list(group_1)
[1, 2, 3]
>>> list(group_2)
[4, 5, 6]
If the length of *iterable* is not evenly divisible by *n*, then the
length of the returned iterables will not be identical:
>>> children = divide(3, [1, 2, 3, 4, 5, 6, 7])
>>> [list(c) for c in children]
[[1, 2, 3], [4, 5], [6, 7]]
If the length of the iterable is smaller than n, then the last returned
iterables will be empty:
>>> children = divide(5, [1, 2, 3])
>>> [list(c) for c in children]
[[1], [2], [3], [], []]
This function will exhaust the iterable before returning and may require
significant storage. If order is not important, see :func:`distribute`,
which does not first pull the iterable into memory.
"""
if n < 1:
raise ValueError("n must be at least 1")
try:
iterable[:0]
except TypeError:
seq = tuple(iterable)
else:
seq = iterable
q, r = divmod(len(seq), n)
ret = []
stop = 0
for i in range(1, n + 1):
start = stop
stop += q + 1 if i <= r else q
ret.append(iter(seq[start:stop]))
return ret
class Collator:
"""
A class for reordering and batching elements of an array.
This class allows for sorting an array based on a provided sorting function, grouping elements based on a grouping function, and generating batches from the sorted and grouped data.
"""
def __init__(
self,
arr: List,
sort_fn: Callable,
group_fn: Callable = lambda x: x[1],
grouping: bool = False,
) -> None:
self.grouping = grouping
self.fn = sort_fn
self.group_fn = lambda x: group_fn(x[1]) # first index are enumerated indices
self.reorder_indices: List = []
self.size = len(arr)
self.arr_with_indices: Iterable[Any] = tuple(enumerate(arr)) # [indices, (arr)]
if self.grouping is True:
self.group_by_index()
def group_by_index(self) -> None:
self.arr_with_indices = self.group(self.arr_with_indices, fn=self.group_fn, values=False)
def get_batched(self, n: int = 1, batch_fn: Optional[Callable] = None) -> Iterator:
"""
Generates and yields batches from the reordered array.
Parameters:
- n (int): The size of each batch. Defaults to 1.
- batch_fn (Optional[Callable[[int, Iterable], int]]): A function to determine the size of each batch. Defaults to None.
Yields:
Iterator: An iterator over batches of reordered elements.
"""
if self.grouping:
for (
key,
values,
) in self.arr_with_indices.items(): # type: ignore
values = self._reorder(values)
batch = self.get_chunks(values, n=n, fn=batch_fn)
yield from batch
else:
values = self._reorder(self.arr_with_indices) # type: ignore
batch = self.get_chunks(values, n=n, fn=batch_fn)
yield from batch
def _reorder(self, arr: Union[List, Tuple[Tuple[int, Any], ...]]) -> List:
"""
Reorders the elements in the array based on the sorting function.
Parameters:
- arr (Union[List, Tuple[Tuple[int, Any], ...]]): The array or iterable to be reordered.
Yields:
List: Yields reordered elements one by one.
"""
arr = sorted(arr, key=lambda x: self.fn(x[1]))
self.reorder_indices.extend([x[0] for x in arr])
yield from [x[1] for x in arr]
def get_original(self, newarr: List) -> List:
"""
Restores the original order of elements from the reordered list.
Parameters:
- newarr (List): The reordered array.
Returns:
List: The array with elements restored to their original order.
"""
res = [None] * self.size
cov = [False] * self.size
for ind, v in zip(self.reorder_indices, newarr):
res[ind] = v
cov[ind] = True
assert all(cov)
return res
def __len__(self):
return self.size
@staticmethod
def group(arr: Iterable, fn: Callable, values: bool = False) -> Iterable:
"""
Groups elements of an iterable based on a provided function.
Parameters:
- arr (Iterable): The iterable to be grouped.
- fn (Callable): The function to determine the grouping.
- values (bool): If True, returns the values of the group. Defaults to False.
Returns:
Iterable: An iterable of grouped elements.
"""
res = collections.defaultdict(list)
for ob in arr:
try:
hashable_dict = tuple(
(
key,
tuple(value) if isinstance(value, collections.abc.Iterable) else value,
)
for key, value in sorted(fn(ob).items())
)
res[hashable_dict].append(ob)
except TypeError:
res[fn(ob)].append(ob)
if not values:
return res
return res.values()
@staticmethod
def get_chunks(_iter, n: int = 0, fn=None):
"""
Divides an iterable into chunks of specified size or based on a given function.
Useful for batching
Parameters:
- iter: The input iterable to be divided into chunks.
- n: An integer representing the size of each chunk. Default is 0.
- fn: A function that takes the current index and the iterable as arguments and returns the size of the chunk. Default is None.
Returns:
An iterator that yields chunks of the input iterable.
Example usage:
```
data = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
for chunk in chunks(data, 3):
print(chunk)
```
Output:
```
[1, 2, 3]
[4, 5, 6]
[7, 8, 9]
[10]
```
"""
arr = []
_iter = tuple(_iter)
for i, x in enumerate(_iter):
arr.append(x)
if len(arr) == (fn(i, _iter) if fn else n):
yield arr
arr = []
if arr:
yield arr
================================================
FILE: lmms-eval_videochat/pyproject.toml
================================================
[tool.black]
line-length = 240
[build-system]
requires = ["setuptools>=42", "wheel", "setuptools_scm[tomli]>=6.3"]
build-backend = "setuptools.build_meta"
[project]
name = "lmms_eval"
version = "0.2.1"
authors = [
{ name = "LMMMs-Lab Evaluation Team", email = "lmms_eval@outlook.com" },
]
description = "A framework for evaluating large multi-modality language models"
readme = "README.md"
classifiers = [
"Programming Language :: Python :: 3",
"License :: OSI Approved :: MIT License",
"Operating System :: OS Independent",
]
requires-python = ">=3.8"
license = { text = "MIT" }
dependencies = [
"accelerate>=0.29.1",
"black==24.1.0",
"datasets==2.16.1",
"evaluate>=0.4.0",
"httpx==0.23.3",
"jsonlines",
"numexpr",
"numpy==1.26.4",
"peft>=0.2.0",
"pybind11>=2.6.2",
"pytablewriter",
"sacrebleu>=1.5.0",
"scikit-learn>=0.24.1",
"sqlitedict==2.1.0",
"torch>=2.1.0", # to enable sdpa mode for running 34B model on one 80GB GPU
"torchvision>=0.16.0",
"timm",
"einops",
"ftfy",
"openai",
"opencv-python-headless",
"av",
"hf_transfer",
"pywsd",
"nltk",
"sentencepiece==0.1.99",
"yt-dlp",
"pycocoevalcap",
"tqdm-multiprocess",
"transformers==4.39.2",
"transformers-stream-generator",
"zstandard",
"pillow",
"pyyaml",
"sympy",
"mpmath",
"Jinja2",
"openpyxl",
"loguru",
"Levenshtein",
"hf_transfer",
"tenacity==8.3.0",
"wandb>=0.16.0",
"tiktoken",
"pre-commit",
"pydantic",
"packaging",
"decord",
"zss",
"pywsd",
"spacy",
"anls",
"rouge",
"capture_metric",
"protobuf==3.20",
]
[project.optional-dependencies]
gemini = [
"google-generativeai",
]
reka = [
"httpx==0.23.3",
"reka-api",
]
all = [
"vila",
"gemini",
"reka",
]
[tool.setuptools.packages.find]
include = ["lmms_eval*"]
exclude = [
"assets*",
"benchmark*",
"docs",
"dist*",
"playground*",
"scripts*",
"tests*",
"checkpoints*",
"project_checkpoints*",
"debug_checkpoints*",
"mlx_configs*",
"wandb*",
"notebooks*",
"logs*",
]
[tool.wheel]
exclude = [
"assets*",
"benchmark*",
"docs",
"dist*",
"playground*",
"scripts*",
"tests*",
"checkpoints*",
"project_checkpoints*",
"debug_checkpoints*",
"mlx_configs*",
"wandb*",
"notebooks*",
"logs*",
]
[project.scripts]
lmms-eval = "lmms_eval.__main__:cli_evaluate"
================================================
FILE: lmms-eval_videochat/scripts/eval_longvideobench.sh
================================================
TASK=longvideobench_val_v
MODEL_NAME=videochat_flash
MAX_NUM_FRAMES=512
CKPT_PATH=OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
MASTER_PORT=$((18000 + $RANDOM % 100))
NUM_GPUS=8
accelerate launch --num_processes ${NUM_GPUS} --main_process_port ${MASTER_PORT} -m lmms_eval \
--model ${MODEL_NAME} \
--model_args pretrained=$CKPT_PATH,max_num_frames=$MAX_NUM_FRAMES \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/${JOB_NAME}_${MODEL_NAME}_f${MAX_NUM_FRAMES}
================================================
FILE: lmms-eval_videochat/scripts/eval_lvbench.sh
================================================
TASK=lvbench
MODEL_NAME=videochat_flash
MAX_NUM_FRAMES=640
CKPT_PATH=OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
MASTER_PORT=$((18000 + $RANDOM % 100))
NUM_GPUS=8
accelerate launch --num_processes ${NUM_GPUS} --main_process_port ${MASTER_PORT} -m lmms_eval \
--model ${MODEL_NAME} \
--model_args pretrained=$CKPT_PATH,max_num_frames=$MAX_NUM_FRAMES \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/${JOB_NAME}_${MODEL_NAME}_f${MAX_NUM_FRAMES}
================================================
FILE: lmms-eval_videochat/scripts/eval_mlvu.sh
================================================
TASK=mlvu_mc
MODEL_NAME=videochat_flash
MAX_NUM_FRAMES=512
CKPT_PATH=OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
MASTER_PORT=$((18000 + $RANDOM % 100))
NUM_GPUS=8
accelerate launch --num_processes ${NUM_GPUS} --main_process_port ${MASTER_PORT} -m lmms_eval \
--model ${MODEL_NAME} \
--model_args pretrained=$CKPT_PATH,max_num_frames=$MAX_NUM_FRAMES \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/${JOB_NAME}_${MODEL_NAME}_f${MAX_NUM_FRAMES}
================================================
FILE: lmms-eval_videochat/scripts/eval_mvbench.sh
================================================
TASK=mvbench
MODEL_NAME=videochat_flash
MAX_NUM_FRAMES=512
CKPT_PATH=OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
MASTER_PORT=$((18000 + $RANDOM % 100))
NUM_GPUS=8
accelerate launch --num_processes ${NUM_GPUS} --main_process_port ${MASTER_PORT} -m lmms_eval \
--model ${MODEL_NAME} \
--model_args pretrained=$CKPT_PATH,max_num_frames=$MAX_NUM_FRAMES \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/${JOB_NAME}_${MODEL_NAME}_f${MAX_NUM_FRAMES}
================================================
FILE: lmms-eval_videochat/scripts/eval_perceptiontest_val_mc.sh
================================================
TASK=perceptiontest_val_mc
MODEL_NAME=videochat_flash
MAX_NUM_FRAMES=512
CKPT_PATH=OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
MASTER_PORT=$((18000 + $RANDOM % 100))
NUM_GPUS=8
accelerate launch --num_processes ${NUM_GPUS} --main_process_port ${MASTER_PORT} -m lmms_eval \
--model ${MODEL_NAME} \
--model_args pretrained=$CKPT_PATH,max_num_frames=$MAX_NUM_FRAMES \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/${JOB_NAME}_${MODEL_NAME}_f${MAX_NUM_FRAMES}
================================================
FILE: lmms-eval_videochat/scripts/eval_temporal_grounding_chardes.sh
================================================
TASK=temporal_grounding_charades
MODEL_NAME=videochat_flash
MAX_NUM_FRAMES=512
CKPT_PATH=OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
MASTER_PORT=$((18000 + $RANDOM % 100))
NUM_GPUS=8
accelerate launch --num_processes ${NUM_GPUS} --main_process_port ${MASTER_PORT} -m lmms_eval \
--model ${MODEL_NAME} \
--model_args pretrained=$CKPT_PATH,max_num_frames=$MAX_NUM_FRAMES \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/${JOB_NAME}_${MODEL_NAME}_f${MAX_NUM_FRAMES}
python lmms_eval/tasks/temporal_grounding/eval_tvg.py -f ${JOB_NAME}_${MODEL_NAME}_f${MAX_NUM_FRAMES}/submissons/your_result.json # need to modify this filename
================================================
FILE: lmms-eval_videochat/scripts/eval_videomme.sh
================================================
TASK=videomme,videomme_w_subtitle
MODEL_NAME=videochat_flash
MAX_NUM_FRAMES=512
CKPT_PATH=OpenGVLab/VideoChat-Flash-Qwen2-7B_res448
echo $TASK
TASK_SUFFIX="${TASK//,/_}"
echo $TASK_SUFFIX
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
MASTER_PORT=$((18000 + $RANDOM % 100))
NUM_GPUS=8
accelerate launch --num_processes ${NUM_GPUS} --main_process_port ${MASTER_PORT} -m lmms_eval \
--model ${MODEL_NAME} \
--model_args pretrained=$CKPT_PATH,max_num_frames=$MAX_NUM_FRAMES \
--tasks $TASK \
--batch_size 1 \
--log_samples \
--log_samples_suffix $TASK_SUFFIX \
--output_path ./logs/${JOB_NAME}_${MODEL_NAME}_f${MAX_NUM_FRAMES}
================================================
FILE: lmms-eval_videochat/setup.py
================================================
import setuptools
# This is to make sure that the package supports editable installs
setuptools.setup()
================================================
FILE: lmms-eval_videochat/videochat-flash-7B@448_eval_log_videomme.json
================================================
{
"args": {
"config": "",
"model": "videochat_next_dynamic_newprompt",
"tasks": "videomme",
"model_args": "pretrained=xx,conv_template=qwen_2,max_frames_num=512",
"num_fewshot": null,
"batch_size": "1",
"device": null,
"output_path": "xx",
"limit": null,
"check_integrity": false,
"show_task_to_terminal": false,
"log_samples": true,
"wandb_log_samples": false,
"log_samples_suffix": "videomme",
"predict_only": false,
"show_config": false,
"include_path": null,
"gen_kwargs": "",
"verbosity": "INFO",
"wandb_args": "",
"timezone": "Asia/Singapore"
},
"model_configs": {
"task": "videomme",
"dataset_path": "Video-MME",
"dataset_kwargs": {
"token": true
},
"test_split": "test",
"full_docs": false,
"process_results_use_image": false,
"doc_to_visual": "",
"doc_to_text": "",
"doc_to_target": "answer",
"process_results": "",
"description": "",
"target_delimiter": " ",
"fewshot_delimiter": "\n\n",
"metric_list": [
{
"metric": "videomme_percetion_score",
"aggregation": "",
"higher_is_better": true
}
],
"output_type": "generate_until",
"generation_kwargs": {
"max_new_tokens": 16,
"temperature": 0.0,
"top_p": 1.0,
"num_beams": 1,
"do_sample": false,
"until": [
"\n\n"
]
},
"repeats": 1,
"should_decontaminate": false,
"metadata": [
{
"version": 0.0
}
],
"lmms_eval_specific_kwargs": {
"default": {
"pre_prompt": "",
"post_prompt": "\nAnswer with the option's letter from the given choices directly."
},
"gpt4v": {
"pre_prompt": "",
"post_prompt": "\nAnswer the question with A, B, C, or D."
},
"videochat": {
"post_prompt": "\nOnly give the best option. Best option:("
},
"videochat_pdrop": {
"post_prompt": "\nOnly give the best option. Best option:("
},
"xcomposer2_4khd": {
"pre_prompt": "[UNUSED_TOKEN_146]user\n",
"post_prompt": " Answer this question with A, B, C, or D.[UNUSED_TOKEN_145]\n[UNUSED_TOKEN_146]assistant\n"
},
"pre_prompt": "",
"post_prompt": "\nAnswer with the option's letter from the given choices directly."
}
},
"logs": [
{
"doc_id": 0,
"target": "C",
"doc": {
"video_id": "001",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=fFjv93ACGo8",
"videoID": "fFjv93ACGo8",
"question_id": "001-1",
"task_type": "Counting Problem",
"question": "When demonstrating the Germany modern Christmas tree is initially decorated with apples, candles and berries, which kind of the decoration has the largest number?",
"options": [
"A. Apples.",
"B. Candles.",
"C. Berries.",
"D. The three kinds are of the same number."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When demonstrating the Germany modern Christmas tree is initially decorated with apples, candles and berries, which kind of the decoration has the largest number?\nOption:\nA. Apples.\nB. Candles.\nC. Berries.\nD. The three kinds are of the same number.\nAnswer with the option's letter from the given choices directly.",
0,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "001-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1,
"target": "A",
"doc": {
"video_id": "001",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=fFjv93ACGo8",
"videoID": "fFjv93ACGo8",
"question_id": "001-2",
"task_type": "Information Synopsis",
"question": "What is the genre of this video?",
"options": [
"A. It is a news report that introduces the history behind Christmas decorations.",
"B. It is a documentary on the evolution of Christmas holiday recipes.",
"C. It is a travel vlog exploring Christmas markets around the world.",
"D. It is a tutorial on DIY Christmas ornament crafting."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the genre of this video?\nOption:\nA. It is a news report that introduces the history behind Christmas decorations.\nB. It is a documentary on the evolution of Christmas holiday recipes.\nC. It is a travel vlog exploring Christmas markets around the world.\nD. It is a tutorial on DIY Christmas ornament crafting.\nAnswer with the option's letter from the given choices directly.",
1,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "001-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2,
"target": "D",
"doc": {
"video_id": "001",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=fFjv93ACGo8",
"videoID": "fFjv93ACGo8",
"question_id": "001-3",
"task_type": "Counting Problem",
"question": "How many red socks are above the fireplace at the end of this video?",
"options": [
"A. 1.",
"B. 4.",
"C. 2.",
"D. 3."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many red socks are above the fireplace at the end of this video?\nOption:\nA. 1.\nB. 4.\nC. 2.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
2,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "001-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 3,
"target": "C",
"doc": {
"video_id": "002",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=N1cdUjctpG8",
"videoID": "N1cdUjctpG8",
"question_id": "002-1",
"task_type": "Object Recognition",
"question": "Which of the following features/items is not discussed in the video in relation to the tomb?",
"options": [
"A. Inkstone.",
"B. Niche.",
"C. Jade.",
"D. Sacrificial table."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following features/items is not discussed in the video in relation to the tomb?\nOption:\nA. Inkstone.\nB. Niche.\nC. Jade.\nD. Sacrificial table.\nAnswer with the option's letter from the given choices directly.",
3,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "002-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 4,
"target": "D",
"doc": {
"video_id": "002",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=N1cdUjctpG8",
"videoID": "N1cdUjctpG8",
"question_id": "002-2",
"task_type": "Action Reasoning",
"question": "Which of the following reasons motivated the archaeologists to excavate the tomb?",
"options": [
"A. Because it's from Ming Dynasty and of specific archaeological significance.",
"B. Because a new railway line will be built nearby.",
"C. Because there were treasures inside the tomb.",
"D. Highway realignment."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following reasons motivated the archaeologists to excavate the tomb?\nOption:\nA. Because it's from Ming Dynasty and of specific archaeological significance.\nB. Because a new railway line will be built nearby.\nC. Because there were treasures inside the tomb.\nD. Highway realignment.\nAnswer with the option's letter from the given choices directly.",
4,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "002-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 5,
"target": "B",
"doc": {
"video_id": "002",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=N1cdUjctpG8",
"videoID": "N1cdUjctpG8",
"question_id": "002-3",
"task_type": "Counting Problem",
"question": "How many porcelain jars were discovered in the niches located in the primary chamber of the tomb?",
"options": [
"A. 4.",
"B. 9.",
"C. 5.",
"D. 13."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many porcelain jars were discovered in the niches located in the primary chamber of the tomb?\nOption:\nA. 4.\nB. 9.\nC. 5.\nD. 13.\nAnswer with the option's letter from the given choices directly.",
5,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "002-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 6,
"target": "B",
"doc": {
"video_id": "003",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=HIjX8OPuf-w",
"videoID": "HIjX8OPuf-w",
"question_id": "003-1",
"task_type": "Counting Problem",
"question": "How many national flags appear in the video?",
"options": [
"A. 3.",
"B. 4.",
"C. 2.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many national flags appear in the video?\nOption:\nA. 3.\nB. 4.\nC. 2.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
6,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "003-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 7,
"target": "C",
"doc": {
"video_id": "003",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=HIjX8OPuf-w",
"videoID": "HIjX8OPuf-w",
"question_id": "003-2",
"task_type": "Object Recognition",
"question": "What is the video telling when the burger placed in the upper right corner at the end of the video first appears?",
"options": [
"A. Beef with spices came from Russia to Germany.",
"B. The steak began to be sandwiched between two pieces of bread.",
"C. Steak burgers spread throughout the United States.",
"D. The standardization of hamburgers."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video telling when the burger placed in the upper right corner at the end of the video first appears?\nOption:\nA. Beef with spices came from Russia to Germany.\nB. The steak began to be sandwiched between two pieces of bread.\nC. Steak burgers spread throughout the United States.\nD. The standardization of hamburgers.\nAnswer with the option's letter from the given choices directly.",
7,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "003-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 8,
"target": "D",
"doc": {
"video_id": "003",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=HIjX8OPuf-w",
"videoID": "HIjX8OPuf-w",
"question_id": "003-3",
"task_type": "Object Reasoning",
"question": "In which country is the food featured in the video recognized worldwide?",
"options": [
"A. Mongolia.",
"B. Russia.",
"C. Germany.",
"D. United States."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which country is the food featured in the video recognized worldwide?\nOption:\nA. Mongolia.\nB. Russia.\nC. Germany.\nD. United States.\nAnswer with the option's letter from the given choices directly.",
8,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "003-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 9,
"target": "B",
"doc": {
"video_id": "004",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=HwnB8aCn8yE",
"videoID": "HwnB8aCn8yE",
"question_id": "004-1",
"task_type": "Temporal Perception",
"question": "According to the video, which of the following is considered the earliest stage of human evolution?",
"options": [
"A. Ramapithecus.",
"B. Dryopithecus.",
"C. Ardipithecus Ramidus.",
"D. Homo Sapiens."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following is considered the earliest stage of human evolution?\nOption:\nA. Ramapithecus.\nB. Dryopithecus.\nC. Ardipithecus Ramidus.\nD. Homo Sapiens.\nAnswer with the option's letter from the given choices directly.",
9,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "004-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Temporal Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 10,
"target": "B",
"doc": {
"video_id": "004",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=HwnB8aCn8yE",
"videoID": "HwnB8aCn8yE",
"question_id": "004-2",
"task_type": "Object Recognition",
"question": "According to the video, in which state did human's ancestors first begin to walk on two legs?",
"options": [
"A. Dryopithecus.",
"B. Ramapithecus.",
"C. Ardipithecus Ramidus.",
"D. Homo Habilis."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, in which state did human's ancestors first begin to walk on two legs?\nOption:\nA. Dryopithecus.\nB. Ramapithecus.\nC. Ardipithecus Ramidus.\nD. Homo Habilis.\nAnswer with the option's letter from the given choices directly.",
10,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "004-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 11,
"target": "D",
"doc": {
"video_id": "004",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=HwnB8aCn8yE",
"videoID": "HwnB8aCn8yE",
"question_id": "004-3",
"task_type": "Object Recognition",
"question": "Given the information provided by the video, during which stage of human evolution did the regression of body hair occur rapidly?",
"options": [
"A. Dryopithecus.",
"B. Homo Sapiens.",
"C. Ardipithecus Ramidus.",
"D. Homo Sapiens Neanderthalensi."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Given the information provided by the video, during which stage of human evolution did the regression of body hair occur rapidly?\nOption:\nA. Dryopithecus.\nB. Homo Sapiens.\nC. Ardipithecus Ramidus.\nD. Homo Sapiens Neanderthalensi.\nAnswer with the option's letter from the given choices directly.",
11,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "004-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 12,
"target": "A",
"doc": {
"video_id": "005",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=24i4ncHuf6A",
"videoID": "24i4ncHuf6A",
"question_id": "005-1",
"task_type": "Counting Problem",
"question": "According to the video, how many individuals were in the car when Archduke Franz Ferdinand was assassinated?",
"options": [
"A. Three.",
"B. Two.",
"C. One.",
"D. Four."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how many individuals were in the car when Archduke Franz Ferdinand was assassinated?\nOption:\nA. Three.\nB. Two.\nC. One.\nD. Four.\nAnswer with the option's letter from the given choices directly.",
12,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "005-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 13,
"target": "B",
"doc": {
"video_id": "005",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=24i4ncHuf6A",
"videoID": "24i4ncHuf6A",
"question_id": "005-2",
"task_type": "Action Reasoning",
"question": "What's the trigger that set off the war mentioned in the video?",
"options": [
"A. Militarisim.",
"B. The assasination of Archduke Franze Fredinand.",
"C. King Edward III of England's claim to the French throne.",
"D. Germany's blitzkrieg against Poland."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the trigger that set off the war mentioned in the video?\nOption:\nA. Militarisim.\nB. The assasination of Archduke Franze Fredinand.\nC. King Edward III of England's claim to the French throne.\nD. Germany's blitzkrieg against Poland.\nAnswer with the option's letter from the given choices directly.",
13,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "005-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 14,
"target": "C",
"doc": {
"video_id": "005",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=24i4ncHuf6A",
"videoID": "24i4ncHuf6A",
"question_id": "005-3",
"task_type": "Attribute Perception",
"question": "Which of the following options is incorrect regarding the events in Sarajevo depicted in the video?",
"options": [
"A. It was organized by the Black Hand in Serbia.",
"B. Ferdinand was sitting on the left side of the car.",
"C. Ferdinand was wearing a white hat.",
"D. The killer was wearing black suit."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options is incorrect regarding the events in Sarajevo depicted in the video?\nOption:\nA. It was organized by the Black Hand in Serbia.\nB. Ferdinand was sitting on the left side of the car.\nC. Ferdinand was wearing a white hat.\nD. The killer was wearing black suit.\nAnswer with the option's letter from the given choices directly.",
14,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "005-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 15,
"target": "D",
"doc": {
"video_id": "006",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=40BlVzjxu-I",
"videoID": "40BlVzjxu-I",
"question_id": "006-1",
"task_type": "Object Reasoning",
"question": "What is one of the symbols of the festival that is introduced by the video?",
"options": [
"A. Turkey.",
"B. Moose.",
"C. Moon cake.",
"D. Shamrock."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is one of the symbols of the festival that is introduced by the video?\nOption:\nA. Turkey.\nB. Moose.\nC. Moon cake.\nD. Shamrock.\nAnswer with the option's letter from the given choices directly.",
15,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "006-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 16,
"target": "A",
"doc": {
"video_id": "006",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=40BlVzjxu-I",
"videoID": "40BlVzjxu-I",
"question_id": "006-2",
"task_type": "Temporal Reasoning",
"question": "What happened when Irish imiggrants brought the tradition of St. Patrick's Day to America?",
"options": [
"A. They began to center around drinking and celebrating the festival.",
"B. St. Patrick moved to America.",
"C. They celebrated St. Patrick's birth instead.",
"D. They attended the church and gather for feasts."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happened when Irish imiggrants brought the tradition of St. Patrick's Day to America?\nOption:\nA. They began to center around drinking and celebrating the festival.\nB. St. Patrick moved to America.\nC. They celebrated St. Patrick's birth instead.\nD. They attended the church and gather for feasts.\nAnswer with the option's letter from the given choices directly.",
16,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "006-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 17,
"target": "A",
"doc": {
"video_id": "006",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=40BlVzjxu-I",
"videoID": "40BlVzjxu-I",
"question_id": "006-3",
"task_type": "Action Recognition",
"question": "What is special about the celebration in New York according to the video?",
"options": [
"A. Hosting large parades.",
"B. Dressing in green and dyeing the river to green.",
"C. Drinking a lot.",
"D. Planting shamrocks."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is special about the celebration in New York according to the video?\nOption:\nA. Hosting large parades.\nB. Dressing in green and dyeing the river to green.\nC. Drinking a lot.\nD. Planting shamrocks.\nAnswer with the option's letter from the given choices directly.",
17,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "006-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 18,
"target": "D",
"doc": {
"video_id": "007",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=0ay2Qy3wBe8",
"videoID": "0ay2Qy3wBe8",
"question_id": "007-1",
"task_type": "Action Recognition",
"question": "How is the smoke generated by the man depicted in the video?",
"options": [
"A. By burning a piece of cloth.",
"B. By lighting a torch.",
"C. By smoking.",
"D. By lighting a bonefire."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How is the smoke generated by the man depicted in the video?\nOption:\nA. By burning a piece of cloth.\nB. By lighting a torch.\nC. By smoking.\nD. By lighting a bonefire.\nAnswer with the option's letter from the given choices directly.",
18,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "007-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 19,
"target": "D",
"doc": {
"video_id": "007",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=0ay2Qy3wBe8",
"videoID": "0ay2Qy3wBe8",
"question_id": "007-2",
"task_type": "Temporal Reasoning",
"question": "What kind of communication is listed before Semaphore?",
"options": [
"A. Telephone.",
"B. Homing pigeon.",
"C. Telegraph.",
"D. Pony express."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What kind of communication is listed before Semaphore?\nOption:\nA. Telephone.\nB. Homing pigeon.\nC. Telegraph.\nD. Pony express.\nAnswer with the option's letter from the given choices directly.",
19,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "007-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 20,
"target": "A",
"doc": {
"video_id": "007",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=0ay2Qy3wBe8",
"videoID": "0ay2Qy3wBe8",
"question_id": "007-3",
"task_type": "OCR Problems",
"question": "What is the specific sentence in the smart phone that makes the man embarrassed?",
"options": [
"A. BTW...you got something in your teeth!",
"B. Dude! You gotta come toBigStuf Camps!",
"C. Smoke Sig 2.0.",
"D. Last but not least."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the specific sentence in the smart phone that makes the man embarrassed?\nOption:\nA. BTW...you got something in your teeth!\nB. Dude! You gotta come toBigStuf Camps!\nC. Smoke Sig 2.0.\nD. Last but not least.\nAnswer with the option's letter from the given choices directly.",
20,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "007-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 21,
"target": "D",
"doc": {
"video_id": "008",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=_tvmjsKXTu8",
"videoID": "_tvmjsKXTu8",
"question_id": "008-1",
"task_type": "Attribute Perception",
"question": "In the video, what color pen did the author use when he wrote \"guitar\" for the second time?",
"options": [
"A. Brown.",
"B. Blue.",
"C. Black.",
"D. Pink."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, what color pen did the author use when he wrote \"guitar\" for the second time?\nOption:\nA. Brown.\nB. Blue.\nC. Black.\nD. Pink.\nAnswer with the option's letter from the given choices directly.",
21,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "008-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 22,
"target": "B",
"doc": {
"video_id": "008",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=_tvmjsKXTu8",
"videoID": "_tvmjsKXTu8",
"question_id": "008-2",
"task_type": "Counting Problem",
"question": "How many different guitar-shaped instruments are there in the video?",
"options": [
"A. 2.",
"B. 7.",
"C. 3.",
"D. 11."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many different guitar-shaped instruments are there in the video?\nOption:\nA. 2.\nB. 7.\nC. 3.\nD. 11.\nAnswer with the option's letter from the given choices directly.",
22,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "008-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 23,
"target": "A",
"doc": {
"video_id": "008",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=_tvmjsKXTu8",
"videoID": "_tvmjsKXTu8",
"question_id": "008-3",
"task_type": "Information Synopsis",
"question": "What does the second half of the video show?",
"options": [
"A. An advertisement for a music software.",
"B. A tutorial on how to practice guitar on electronic devices.",
"C. The history of guitar software on electronic devices.",
"D. An advertisement for a tablet computer."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the second half of the video show?\nOption:\nA. An advertisement for a music software.\nB. A tutorial on how to practice guitar on electronic devices.\nC. The history of guitar software on electronic devices.\nD. An advertisement for a tablet computer.\nAnswer with the option's letter from the given choices directly.",
23,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "008-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 24,
"target": "A",
"doc": {
"video_id": "009",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=sUDY-SMREtA",
"videoID": "sUDY-SMREtA",
"question_id": "009-1",
"task_type": "Attribute Perception",
"question": "Which color of clothes is QuYuan wearing in the video?",
"options": [
"A. White.",
"B. Blue.",
"C. Brown.",
"D. Yellow."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which color of clothes is QuYuan wearing in the video?\nOption:\nA. White.\nB. Blue.\nC. Brown.\nD. Yellow.\nAnswer with the option's letter from the given choices directly.",
24,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "009-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 25,
"target": "B",
"doc": {
"video_id": "009",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=sUDY-SMREtA",
"videoID": "sUDY-SMREtA",
"question_id": "009-2",
"task_type": "Attribute Perception",
"question": "Based on the video, which day does the festival fall on the lunar calendar?",
"options": [
"A. January 1th.",
"B. May 5th.",
"C. September 9th.",
"D. May 15th."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, which day does the festival fall on the lunar calendar?\nOption:\nA. January 1th.\nB. May 5th.\nC. September 9th.\nD. May 15th.\nAnswer with the option's letter from the given choices directly.",
25,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "009-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 26,
"target": "C",
"doc": {
"video_id": "009",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=sUDY-SMREtA",
"videoID": "sUDY-SMREtA",
"question_id": "009-3",
"task_type": "Action Reasoning",
"question": "According to the video, which of the following is the main reason why people commemorate Qu Yuan?",
"options": [
"A. Because people love Zongzi.",
"B. Because he committed suicide by drowing himself in Miluo River.",
"C. Because he brought peace and prosperity to the state.",
"D. Because he was exile."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following is the main reason why people commemorate Qu Yuan?\nOption:\nA. Because people love Zongzi.\nB. Because he committed suicide by drowing himself in Miluo River.\nC. Because he brought peace and prosperity to the state.\nD. Because he was exile.\nAnswer with the option's letter from the given choices directly.",
26,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "009-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 27,
"target": "C",
"doc": {
"video_id": "010",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=PSt_op3fQck",
"videoID": "PSt_op3fQck",
"question_id": "010-1",
"task_type": "Information Synopsis",
"question": "What can be learned from this video?",
"options": [
"A. The history of intercultural conflicts and disputes.",
"B. Quick ways to grasp the fundamentals of intercultural communication. understanding.",
"C. Techniques for mastering a foreign language in a short period of time.",
"D. Strategies for resolving international political conflicts."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be learned from this video?\nOption:\nA. The history of intercultural conflicts and disputes.\nB. Quick ways to grasp the fundamentals of intercultural communication. understanding.\nC. Techniques for mastering a foreign language in a short period of time.\nD. Strategies for resolving international political conflicts.\nAnswer with the option's letter from the given choices directly.",
27,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "010-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 28,
"target": "A",
"doc": {
"video_id": "010",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=PSt_op3fQck",
"videoID": "PSt_op3fQck",
"question_id": "010-2",
"task_type": "Object Recognition",
"question": "What is the pattern on the clothing of the staff who is behaving informally at the airline counter?",
"options": [
"A. A flower.",
"B. A coconut tree.",
"C. A flying bird.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the pattern on the clothing of the staff who is behaving informally at the airline counter?\nOption:\nA. A flower.\nB. A coconut tree.\nC. A flying bird.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
28,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "010-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 29,
"target": "B",
"doc": {
"video_id": "010",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=PSt_op3fQck",
"videoID": "PSt_op3fQck",
"question_id": "010-3",
"task_type": "Object Recognition",
"question": "At the end of the animation, which place does not the copilot travel to?",
"options": [
"A. The White House.",
"B. The Big Ben.",
"C. The Egyptian Pyramids.",
"D. The Eiffel Tower."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At the end of the animation, which place does not the copilot travel to?\nOption:\nA. The White House.\nB. The Big Ben.\nC. The Egyptian Pyramids.\nD. The Eiffel Tower.\nAnswer with the option's letter from the given choices directly.",
29,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "010-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 30,
"target": "C",
"doc": {
"video_id": "011",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=OE5S-NbNsro",
"videoID": "OE5S-NbNsro",
"question_id": "011-1",
"task_type": "Action Reasoning",
"question": "What is the kind of the dance in the video?",
"options": [
"A. Hip hop.",
"B. Jazz.",
"C. Ballet.",
"D. Folk dance."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the kind of the dance in the video?\nOption:\nA. Hip hop.\nB. Jazz.\nC. Ballet.\nD. Folk dance.\nAnswer with the option's letter from the given choices directly.",
30,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "011-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 31,
"target": "B",
"doc": {
"video_id": "011",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=OE5S-NbNsro",
"videoID": "OE5S-NbNsro",
"question_id": "011-2",
"task_type": "Attribute Perception",
"question": "What's the color of the cosmos introduced by the video?",
"options": [
"A. White.",
"B. Black.",
"C. Navy blue.",
"D. Dark purple."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the color of the cosmos introduced by the video?\nOption:\nA. White.\nB. Black.\nC. Navy blue.\nD. Dark purple.\nAnswer with the option's letter from the given choices directly.",
31,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "011-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 32,
"target": "A",
"doc": {
"video_id": "011",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=OE5S-NbNsro",
"videoID": "OE5S-NbNsro",
"question_id": "011-3",
"task_type": "Action Recognition",
"question": "What are the moves in the last scene of this dance?",
"options": [
"A. Kneel down on one knee and lean back.",
"B. Passe and then chasse.",
"C. Releve and then pirouette.",
"D. Passe and then Grand jete."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the moves in the last scene of this dance?\nOption:\nA. Kneel down on one knee and lean back.\nB. Passe and then chasse.\nC. Releve and then pirouette.\nD. Passe and then Grand jete.\nAnswer with the option's letter from the given choices directly.",
32,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "011-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 33,
"target": "B",
"doc": {
"video_id": "012",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=qvwfWyU9Gfc",
"videoID": "qvwfWyU9Gfc",
"question_id": "012-1",
"task_type": "Action Reasoning",
"question": "Based on the video, which of the following describes the reason why the student ate the banana?",
"options": [
"A. Because the banana looks tasty.",
"B. Because he considered the process of eating a banana is art.",
"C. Because he didn't think the banana worth $120,000.",
"D. Because he wanted to followed the man who ate a banana in a exhibition in 2019."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, which of the following describes the reason why the student ate the banana?\nOption:\nA. Because the banana looks tasty.\nB. Because he considered the process of eating a banana is art.\nC. Because he didn't think the banana worth $120,000.\nD. Because he wanted to followed the man who ate a banana in a exhibition in 2019.\nAnswer with the option's letter from the given choices directly.",
33,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "012-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 34,
"target": "C",
"doc": {
"video_id": "012",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=qvwfWyU9Gfc",
"videoID": "qvwfWyU9Gfc",
"question_id": "012-2",
"task_type": "Temporal Reasoning",
"question": "What did the student do after eating the banana in the video?",
"options": [
"A. He ate a banana in exhibition again in 2019.",
"B. He throwed the skin of banana.",
"C. He reattached the skin of banana using the same tape.",
"D. The action of eating a banana was considered art."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What did the student do after eating the banana in the video?\nOption:\nA. He ate a banana in exhibition again in 2019.\nB. He throwed the skin of banana.\nC. He reattached the skin of banana using the same tape.\nD. The action of eating a banana was considered art.\nAnswer with the option's letter from the given choices directly.",
34,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "012-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 35,
"target": "D",
"doc": {
"video_id": "012",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=qvwfWyU9Gfc",
"videoID": "qvwfWyU9Gfc",
"question_id": "012-3",
"task_type": "Attribute Perception",
"question": "What's the color of the tape in the video?",
"options": [
"A. Green.",
"B. Black.",
"C. Yellow.",
"D. Gray."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the color of the tape in the video?\nOption:\nA. Green.\nB. Black.\nC. Yellow.\nD. Gray.\nAnswer with the option's letter from the given choices directly.",
35,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "012-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 36,
"target": "C",
"doc": {
"video_id": "013",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=5wLv3pCqZ9o",
"videoID": "5wLv3pCqZ9o",
"question_id": "013-1",
"task_type": "Spatial Perception",
"question": "Which of the following is visible in the background of the video when the miniature bottle is shown empty?",
"options": [
"A. Water.",
"B. Cork.",
"C. Shells.",
"D. Star."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is visible in the background of the video when the miniature bottle is shown empty?\nOption:\nA. Water.\nB. Cork.\nC. Shells.\nD. Star.\nAnswer with the option's letter from the given choices directly.",
36,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "013-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 37,
"target": "A",
"doc": {
"video_id": "013",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=5wLv3pCqZ9o",
"videoID": "5wLv3pCqZ9o",
"question_id": "013-2",
"task_type": "Action Reasoning",
"question": "According to the video, which of the following ingredients is not used in the artwork?",
"options": [
"A. Shell.",
"B. Glue.",
"C. Blue Food Dye.",
"D. Oil."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following ingredients is not used in the artwork?\nOption:\nA. Shell.\nB. Glue.\nC. Blue Food Dye.\nD. Oil.\nAnswer with the option's letter from the given choices directly.",
37,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "013-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 38,
"target": "A",
"doc": {
"video_id": "013",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=5wLv3pCqZ9o",
"videoID": "5wLv3pCqZ9o",
"question_id": "013-3",
"task_type": "Action Recognition",
"question": "What was the next step taken by the operator after filling the bottle with oil?",
"options": [
"A. Applied glue to cork.",
"B. Added blue food dye.",
"C. Added 1/3 water.",
"D. Put the cork on the bottle."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the next step taken by the operator after filling the bottle with oil?\nOption:\nA. Applied glue to cork.\nB. Added blue food dye.\nC. Added 1/3 water.\nD. Put the cork on the bottle.\nAnswer with the option's letter from the given choices directly.",
38,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "013-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 39,
"target": "D",
"doc": {
"video_id": "014",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=uqILuTcux_o",
"videoID": "uqILuTcux_o",
"question_id": "014-1",
"task_type": "Object Recognition",
"question": "Which of the following can be identified as Susan White-Oakes' most recent art piece, based on the video?",
"options": [
"A. A tiger.",
"B. A bird.",
"C. A fish.",
"D. A pangolin."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following can be identified as Susan White-Oakes' most recent art piece, based on the video?\nOption:\nA. A tiger.\nB. A bird.\nC. A fish.\nD. A pangolin.\nAnswer with the option's letter from the given choices directly.",
39,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "014-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 40,
"target": "D",
"doc": {
"video_id": "014",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=uqILuTcux_o",
"videoID": "uqILuTcux_o",
"question_id": "014-2",
"task_type": "Attribute Perception",
"question": "Based on the video, what is Susan White-Oakes' latest art work made of?",
"options": [
"A. Silver.",
"B. Iron.",
"C. Bronze.",
"D. Copper."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, what is Susan White-Oakes' latest art work made of?\nOption:\nA. Silver.\nB. Iron.\nC. Bronze.\nD. Copper.\nAnswer with the option's letter from the given choices directly.",
40,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "014-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 41,
"target": "C",
"doc": {
"video_id": "014",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=uqILuTcux_o",
"videoID": "uqILuTcux_o",
"question_id": "014-3",
"task_type": "Temporal Reasoning",
"question": "What does Susan White-Oakes often do before polishing the whole body?",
"options": [
"A. She sculptures the legs.",
"B. She attaches the copper scales to the body.",
"C. She decides what size the head should be.",
"D. She shapes the claws."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does Susan White-Oakes often do before polishing the whole body?\nOption:\nA. She sculptures the legs.\nB. She attaches the copper scales to the body.\nC. She decides what size the head should be.\nD. She shapes the claws.\nAnswer with the option's letter from the given choices directly.",
41,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "014-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 42,
"target": "A",
"doc": {
"video_id": "015",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=eXKE0nAMmg4",
"videoID": "eXKE0nAMmg4",
"question_id": "015-1",
"task_type": "Attribute Perception",
"question": "Which elements are depicted in the painting introduced by the video?",
"options": [
"A. A little girl and a red balloon.",
"B. A little boy and a red balloon.",
"C. A little girl and a blue balloon.",
"D. An adult and a blue balloon."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which elements are depicted in the painting introduced by the video?\nOption:\nA. A little girl and a red balloon.\nB. A little boy and a red balloon.\nC. A little girl and a blue balloon.\nD. An adult and a blue balloon.\nAnswer with the option's letter from the given choices directly.",
42,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "015-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 43,
"target": "D",
"doc": {
"video_id": "015",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=eXKE0nAMmg4",
"videoID": "eXKE0nAMmg4",
"question_id": "015-2",
"task_type": "Action Recognition",
"question": "Based on the video, what happened when the painting was sold?",
"options": [
"A. Someone put a shredder inside the painting.",
"B. The buyer was very happy because it got more worthy.",
"C. It disapeared.",
"D. It shredded into pieces."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, what happened when the painting was sold?\nOption:\nA. Someone put a shredder inside the painting.\nB. The buyer was very happy because it got more worthy.\nC. It disapeared.\nD. It shredded into pieces.\nAnswer with the option's letter from the given choices directly.",
43,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "015-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 44,
"target": "B",
"doc": {
"video_id": "015",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=eXKE0nAMmg4",
"videoID": "eXKE0nAMmg4",
"question_id": "015-3",
"task_type": "Object Reasoning",
"question": "Based on the information provided by the video, which of the following statements is not correct?",
"options": [
"A. The painting got more expensive because it shredded.",
"B. The buyer put a shredder inside the painting many years ago.",
"C. People in the auction house were stunned when the painting shredded.",
"D. Banksky considered the shredding of painting is part of art."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the information provided by the video, which of the following statements is not correct?\nOption:\nA. The painting got more expensive because it shredded.\nB. The buyer put a shredder inside the painting many years ago.\nC. People in the auction house were stunned when the painting shredded.\nD. Banksky considered the shredding of painting is part of art.\nAnswer with the option's letter from the given choices directly.",
44,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "015-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 45,
"target": "B",
"doc": {
"video_id": "016",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=5_fXicEnKKk",
"videoID": "5_fXicEnKKk",
"question_id": "016-1",
"task_type": "Attribute Perception",
"question": "When was the painting mentioned in the video painted?",
"options": [
"A. 1998.",
"B. 1889.",
"C. 1899.",
"D. 1988."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When was the painting mentioned in the video painted?\nOption:\nA. 1998.\nB. 1889.\nC. 1899.\nD. 1988.\nAnswer with the option's letter from the given choices directly.",
45,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "016-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 46,
"target": "A",
"doc": {
"video_id": "016",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=5_fXicEnKKk",
"videoID": "5_fXicEnKKk",
"question_id": "016-2",
"task_type": "Object Reasoning",
"question": "Which of the following elements does not appear in the paintings primarily mentioned in the video?",
"options": [
"A. Van Gogh' tragic life.",
"B. The Van Gogh's artistic attainments.",
"C. The value of the painting.",
"D. The hope that led Van Gogh into success."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following elements does not appear in the paintings primarily mentioned in the video?\nOption:\nA. Van Gogh' tragic life.\nB. The Van Gogh's artistic attainments.\nC. The value of the painting.\nD. The hope that led Van Gogh into success.\nAnswer with the option's letter from the given choices directly.",
46,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "016-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 47,
"target": "D",
"doc": {
"video_id": "016",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=5_fXicEnKKk",
"videoID": "5_fXicEnKKk",
"question_id": "016-3",
"task_type": "Object Reasoning",
"question": "Which of the following elements is not present in the \"Starry Sky\"?",
"options": [
"A. Houses.",
"B. Stars.",
"C. Castle.",
"D. Moon."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following elements is not present in the \"Starry Sky\"?\nOption:\nA. Houses.\nB. Stars.\nC. Castle.\nD. Moon.\nAnswer with the option's letter from the given choices directly.",
47,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "016-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 48,
"target": "C",
"doc": {
"video_id": "017",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=O0qVPW1fUn4",
"videoID": "O0qVPW1fUn4",
"question_id": "017-1",
"task_type": "Counting Problem",
"question": "Based on the information provided by the video, what is the maximum number of fingers required to play this piece of music?",
"options": [
"A. Three.",
"B. Ten.",
"C. Four.",
"D. Five."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the information provided by the video, what is the maximum number of fingers required to play this piece of music?\nOption:\nA. Three.\nB. Ten.\nC. Four.\nD. Five.\nAnswer with the option's letter from the given choices directly.",
48,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "017-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 49,
"target": "B",
"doc": {
"video_id": "017",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=O0qVPW1fUn4",
"videoID": "O0qVPW1fUn4",
"question_id": "017-2",
"task_type": "Object Reasoning",
"question": "What do the characters in this video mean?",
"options": [
"A. Volume.",
"B. Scale.",
"C. Note.",
"D. Strength of pressing the piano keys."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the characters in this video mean?\nOption:\nA. Volume.\nB. Scale.\nC. Note.\nD. Strength of pressing the piano keys.\nAnswer with the option's letter from the given choices directly.",
49,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "017-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 50,
"target": "A",
"doc": {
"video_id": "017",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=O0qVPW1fUn4",
"videoID": "O0qVPW1fUn4",
"question_id": "017-3",
"task_type": "Object Reasoning",
"question": "In this video, why some keys are blue while others are green?",
"options": [
"A. Because it indicates the hand to play them.",
"B. Because it can help to make the video more colorful.",
"C. Because the YouTuber love these two colors.",
"D. Because keys on the special piano are eithor blue or green."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In this video, why some keys are blue while others are green?\nOption:\nA. Because it indicates the hand to play them.\nB. Because it can help to make the video more colorful.\nC. Because the YouTuber love these two colors.\nD. Because keys on the special piano are eithor blue or green.\nAnswer with the option's letter from the given choices directly.",
50,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "017-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 51,
"target": "D",
"doc": {
"video_id": "018",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=tX8a00l_Dfs",
"videoID": "tX8a00l_Dfs",
"question_id": "018-1",
"task_type": "Action Recognition",
"question": "According to the video, which finger does not need to press the strings when playing Em9?",
"options": [
"A. Index finger.",
"B. Middle finger.",
"C. Ring finger.",
"D. Little finger."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which finger does not need to press the strings when playing Em9?\nOption:\nA. Index finger.\nB. Middle finger.\nC. Ring finger.\nD. Little finger.\nAnswer with the option's letter from the given choices directly.",
51,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "018-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 52,
"target": "B",
"doc": {
"video_id": "018",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=tX8a00l_Dfs",
"videoID": "tX8a00l_Dfs",
"question_id": "018-2",
"task_type": "Counting Problem",
"question": "How many kinds of chords are played in the video?",
"options": [
"A. Eight.",
"B. Seven.",
"C. Fourteen.",
"D. Six."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many kinds of chords are played in the video?\nOption:\nA. Eight.\nB. Seven.\nC. Fourteen.\nD. Six.\nAnswer with the option's letter from the given choices directly.",
52,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "018-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 53,
"target": "D",
"doc": {
"video_id": "018",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=tX8a00l_Dfs",
"videoID": "tX8a00l_Dfs",
"question_id": "018-3",
"task_type": "Action Reasoning",
"question": "What did the YouTuber do at the latter part of the video?",
"options": [
"A. He played his favorite tune.",
"B. He claimed that he loves the chords mentioned above.",
"C. He played some complex chords.",
"D. He strung many chords."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What did the YouTuber do at the latter part of the video?\nOption:\nA. He played his favorite tune.\nB. He claimed that he loves the chords mentioned above.\nC. He played some complex chords.\nD. He strung many chords.\nAnswer with the option's letter from the given choices directly.",
53,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "018-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 54,
"target": "C",
"doc": {
"video_id": "019",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=mVIXU0x9ocI",
"videoID": "mVIXU0x9ocI",
"question_id": "019-1",
"task_type": "Action Reasoning",
"question": "What is the author's intention in presenting a scene where all birds fly away except for one on the branch?",
"options": [
"A. The bird feels tired of flying.",
"B. The bird likes to stay on the branch.",
"C. The bird might get used to being alone.",
"D. The bird is dead."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the author's intention in presenting a scene where all birds fly away except for one on the branch?\nOption:\nA. The bird feels tired of flying.\nB. The bird likes to stay on the branch.\nC. The bird might get used to being alone.\nD. The bird is dead.\nAnswer with the option's letter from the given choices directly.",
54,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "019-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 55,
"target": "A",
"doc": {
"video_id": "019",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=mVIXU0x9ocI",
"videoID": "mVIXU0x9ocI",
"question_id": "019-2",
"task_type": "Action Recognition",
"question": "What is the raven doing in the video?",
"options": [
"A. Eating.",
"B. Flying.",
"C. Walking.",
"D. Sleeping."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the raven doing in the video?\nOption:\nA. Eating.\nB. Flying.\nC. Walking.\nD. Sleeping.\nAnswer with the option's letter from the given choices directly.",
55,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "019-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 56,
"target": "C",
"doc": {
"video_id": "019",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=mVIXU0x9ocI",
"videoID": "mVIXU0x9ocI",
"question_id": "019-3",
"task_type": "OCR Problems",
"question": "Whom is the poem in the video written by?",
"options": [
"A. Michael.",
"B. Alone.",
"C. Edgar.",
"D. Shakespeare."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Whom is the poem in the video written by?\nOption:\nA. Michael.\nB. Alone.\nC. Edgar.\nD. Shakespeare.\nAnswer with the option's letter from the given choices directly.",
56,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "019-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 57,
"target": "A",
"doc": {
"video_id": "020",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=YPaLGP_xr-w",
"videoID": "YPaLGP_xr-w",
"question_id": "020-1",
"task_type": "Object Reasoning",
"question": "Based on the video, which of the following statements provides the most accurate explanation for why the Statue of Liberty is green in color?",
"options": [
"A. Because the copper is rusty.",
"B. Because she was painted green.",
"C. Because people didn't like her original color.",
"D. Because she is not the origin one any more."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, which of the following statements provides the most accurate explanation for why the Statue of Liberty is green in color?\nOption:\nA. Because the copper is rusty.\nB. Because she was painted green.\nC. Because people didn't like her original color.\nD. Because she is not the origin one any more.\nAnswer with the option's letter from the given choices directly.",
57,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "020-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 58,
"target": "D",
"doc": {
"video_id": "020",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=YPaLGP_xr-w",
"videoID": "YPaLGP_xr-w",
"question_id": "020-2",
"task_type": "Action Reasoning",
"question": "How did the photographer who took pictures of the three tourists in the video take the photo?",
"options": [
"A. By standing at the foot of the statue.",
"B. By going up to the crown of the statue.",
"C. By asking the vistor to lie down.",
"D. By lying down himself."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How did the photographer who took pictures of the three tourists in the video take the photo?\nOption:\nA. By standing at the foot of the statue.\nB. By going up to the crown of the statue.\nC. By asking the vistor to lie down.\nD. By lying down himself.\nAnswer with the option's letter from the given choices directly.",
58,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "020-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 59,
"target": "B",
"doc": {
"video_id": "020",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=YPaLGP_xr-w",
"videoID": "YPaLGP_xr-w",
"question_id": "020-3",
"task_type": "Spatial Perception",
"question": "Based on the information provided by the video, where is the best spot for tourists to take aerial photos of the sculpture mentioned in the video?",
"options": [
"A. At her feet.",
"B. At her crown.",
"C. Taken from the sea in the distance.",
"D. At her neck."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the information provided by the video, where is the best spot for tourists to take aerial photos of the sculpture mentioned in the video?\nOption:\nA. At her feet.\nB. At her crown.\nC. Taken from the sea in the distance.\nD. At her neck.\nAnswer with the option's letter from the given choices directly.",
59,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "020-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 60,
"target": "A",
"doc": {
"video_id": "021",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=TbaCxIJ_VP4",
"videoID": "TbaCxIJ_VP4",
"question_id": "021-1",
"task_type": "Information Synopsis",
"question": "Which of the following accurately describes the content of the video?",
"options": [
"A. How mRNA vaccines work.",
"B. Production process and specifications of mRNA vaccines.",
"C. Application market of mRNA vaccine.",
"D. Possible side effects of mRNA vaccines."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following accurately describes the content of the video?\nOption:\nA. How mRNA vaccines work.\nB. Production process and specifications of mRNA vaccines.\nC. Application market of mRNA vaccine.\nD. Possible side effects of mRNA vaccines.\nAnswer with the option's letter from the given choices directly.",
60,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "021-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 61,
"target": "B",
"doc": {
"video_id": "021",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=TbaCxIJ_VP4",
"videoID": "TbaCxIJ_VP4",
"question_id": "021-2",
"task_type": "Attribute Perception",
"question": "Where does mRNA come from in the video?",
"options": [
"A. Cells.",
"B. Vaccines.",
"C. COVID-19 virus.",
"D. Antibodies."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where does mRNA come from in the video?\nOption:\nA. Cells.\nB. Vaccines.\nC. COVID-19 virus.\nD. Antibodies.\nAnswer with the option's letter from the given choices directly.",
61,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "021-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 62,
"target": "C",
"doc": {
"video_id": "021",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=TbaCxIJ_VP4",
"videoID": "TbaCxIJ_VP4",
"question_id": "021-3",
"task_type": "Attribute Perception",
"question": "What color is the cell membrane of the virus-invaded cell in the video?",
"options": [
"A. Black.",
"B. Yellow.",
"C. Red.",
"D. Blue."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the cell membrane of the virus-invaded cell in the video?\nOption:\nA. Black.\nB. Yellow.\nC. Red.\nD. Blue.\nAnswer with the option's letter from the given choices directly.",
62,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "021-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 63,
"target": "B",
"doc": {
"video_id": "022",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=n3IYmdy6d4Y",
"videoID": "n3IYmdy6d4Y",
"question_id": "022-1",
"task_type": "OCR Problems",
"question": "According to the video, how long should one wait before cracking the same knuckle again?",
"options": [
"A. About 25 minutes.",
"B. About 20 minutes.",
"C. About 15 minutes.",
"D. About 23 minutes."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how long should one wait before cracking the same knuckle again?\nOption:\nA. About 25 minutes.\nB. About 20 minutes.\nC. About 15 minutes.\nD. About 23 minutes.\nAnswer with the option's letter from the given choices directly.",
63,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "022-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 64,
"target": "A",
"doc": {
"video_id": "022",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=n3IYmdy6d4Y",
"videoID": "n3IYmdy6d4Y",
"question_id": "022-2",
"task_type": "Temporal Perception",
"question": "What detail does the caption provide about the term \"synovial\" at the beginning of the video?",
"options": [
"A. It is derived from the Latin word for 'egg'.",
"B. It is a term used to describe a type of joint movement.",
"C. It refers to a type of bone tissue.",
"D. It indicates a condition related to the joints."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What detail does the caption provide about the term \"synovial\" at the beginning of the video?\nOption:\nA. It is derived from the Latin word for 'egg'.\nB. It is a term used to describe a type of joint movement.\nC. It refers to a type of bone tissue.\nD. It indicates a condition related to the joints.\nAnswer with the option's letter from the given choices directly.",
64,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "022-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 65,
"target": "D",
"doc": {
"video_id": "022",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=n3IYmdy6d4Y",
"videoID": "n3IYmdy6d4Y",
"question_id": "022-3",
"task_type": "Information Synopsis",
"question": "Which of the following could be the main topic of the video segment?",
"options": [
"A. The surgical procedure for joint replacement.",
"B. The nutritional requirements for healthy bones.",
"C. The explanation of how joint lubrication works.",
"D. What happens to your knuckles when you crack them."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following could be the main topic of the video segment?\nOption:\nA. The surgical procedure for joint replacement.\nB. The nutritional requirements for healthy bones.\nC. The explanation of how joint lubrication works.\nD. What happens to your knuckles when you crack them.\nAnswer with the option's letter from the given choices directly.",
65,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "022-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 66,
"target": "C",
"doc": {
"video_id": "023",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=iZYLeIJwe4w",
"videoID": "iZYLeIJwe4w",
"question_id": "023-1",
"task_type": "Information Synopsis",
"question": "Which option best describes the main topic of the video?",
"options": [
"A. Bacteria are replicating inside a cell.",
"B. A cell is being infected by a virus.",
"C. An immune cell is destroying pathogens.",
"D. A platelet is clotting blood."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which option best describes the main topic of the video?\nOption:\nA. Bacteria are replicating inside a cell.\nB. A cell is being infected by a virus.\nC. An immune cell is destroying pathogens.\nD. A platelet is clotting blood.\nAnswer with the option's letter from the given choices directly.",
66,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "023-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 67,
"target": "A",
"doc": {
"video_id": "023",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=iZYLeIJwe4w",
"videoID": "iZYLeIJwe4w",
"question_id": "023-2",
"task_type": "Object Recognition",
"question": "What is the mechanism/process that can break down bacteria according to the video?",
"options": [
"A. Special digestive enzymes: such as proteases or nucleases, which can break down bacterial proteins or nucleic acids under specific conditions.",
"B. Medical surgical instruments: e.g. scalpels, dissecting scissors, grinding equipment, etc., used for handling bacteria in surgery or cell culture.",
"C. Environmental disinfectants: Certain disinfectants, such as alcohol or hydrogen peroxide, are capable of disrupting the cell walls and membranes of bacteria, causing them to lyse.",
"D. Autoclaves: Autoclaving kills microorganisms, including bacteria, in medical devices or sample processing."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the mechanism/process that can break down bacteria according to the video?\nOption:\nA. Special digestive enzymes: such as proteases or nucleases, which can break down bacterial proteins or nucleic acids under specific conditions.\nB. Medical surgical instruments: e.g. scalpels, dissecting scissors, grinding equipment, etc., used for handling bacteria in surgery or cell culture.\nC. Environmental disinfectants: Certain disinfectants, such as alcohol or hydrogen peroxide, are capable of disrupting the cell walls and membranes of bacteria, causing them to lyse.\nD. Autoclaves: Autoclaving kills microorganisms, including bacteria, in medical devices or sample processing.\nAnswer with the option's letter from the given choices directly.",
67,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "023-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 68,
"target": "A",
"doc": {
"video_id": "023",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=iZYLeIJwe4w",
"videoID": "iZYLeIJwe4w",
"question_id": "023-3",
"task_type": "Attribute Perception",
"question": "Which cell shown in the video is of the same type as the white cell scrolling at the beginning?",
"options": [
"A. The green cells in the middle of the video, devour a red cell.",
"B. The devoured green cells at the beginning of the video.",
"C. Tiny dense cells in blue at the beginning of the video.",
"D. It is impossible to extrapolate."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which cell shown in the video is of the same type as the white cell scrolling at the beginning?\nOption:\nA. The green cells in the middle of the video, devour a red cell.\nB. The devoured green cells at the beginning of the video.\nC. Tiny dense cells in blue at the beginning of the video.\nD. It is impossible to extrapolate.\nAnswer with the option's letter from the given choices directly.",
68,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "023-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 69,
"target": "A",
"doc": {
"video_id": "024",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=jkNxmTrrZSk",
"videoID": "jkNxmTrrZSk",
"question_id": "024-1",
"task_type": "Information Synopsis",
"question": "What does the video depict?",
"options": [
"A. A virus attacks a cell.",
"B. The replication of DNA.",
"C. A neuron sends a signal.",
"D. White blood cells attacking a pathogen."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the video depict?\nOption:\nA. A virus attacks a cell.\nB. The replication of DNA.\nC. A neuron sends a signal.\nD. White blood cells attacking a pathogen.\nAnswer with the option's letter from the given choices directly.",
69,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "024-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 70,
"target": "B",
"doc": {
"video_id": "024",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=jkNxmTrrZSk",
"videoID": "jkNxmTrrZSk",
"question_id": "024-2",
"task_type": "Temporal Perception",
"question": "Which object is depicted as a small flying black dot at the start of the video?",
"options": [
"A. Dust mites in the lungs.",
"B. Virus.",
"C. Protein Receptors.",
"D. Black bacteria."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which object is depicted as a small flying black dot at the start of the video?\nOption:\nA. Dust mites in the lungs.\nB. Virus.\nC. Protein Receptors.\nD. Black bacteria.\nAnswer with the option's letter from the given choices directly.",
70,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "024-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 71,
"target": "B",
"doc": {
"video_id": "024",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=jkNxmTrrZSk",
"videoID": "jkNxmTrrZSk",
"question_id": "024-3",
"task_type": "Information Synopsis",
"question": "What might be the primary focus of the video?",
"options": [
"A. The process by which viruses are eliminated by the body's immune response.",
"B. Mechanisms of viral infection.",
"C. Production and spread of viruses.",
"D. Types of viruses and the most harmful ones."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What might be the primary focus of the video?\nOption:\nA. The process by which viruses are eliminated by the body's immune response.\nB. Mechanisms of viral infection.\nC. Production and spread of viruses.\nD. Types of viruses and the most harmful ones.\nAnswer with the option's letter from the given choices directly.",
71,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "024-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 72,
"target": "B",
"doc": {
"video_id": "025",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=ebzbKa32kuk",
"videoID": "ebzbKa32kuk",
"question_id": "025-1",
"task_type": "Object Recognition",
"question": "What structure is primarily depicted in the video?",
"options": [
"A. A model of a human lung.",
"B. A model of a human heart.",
"C. A model of a human liver.",
"D. A model of a human kidney."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What structure is primarily depicted in the video?\nOption:\nA. A model of a human lung.\nB. A model of a human heart.\nC. A model of a human liver.\nD. A model of a human kidney.\nAnswer with the option's letter from the given choices directly.",
72,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "025-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 73,
"target": "C",
"doc": {
"video_id": "025",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=ebzbKa32kuk",
"videoID": "ebzbKa32kuk",
"question_id": "025-2",
"task_type": "Attribute Perception",
"question": "Which of the following features is NOT visible on the model of the heart in the video?",
"options": [
"A. Coronary arteries.",
"B. Pulmonary veins.",
"C. Hepatic veins.",
"D. Aortic valve."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following features is NOT visible on the model of the heart in the video?\nOption:\nA. Coronary arteries.\nB. Pulmonary veins.\nC. Hepatic veins.\nD. Aortic valve.\nAnswer with the option's letter from the given choices directly.",
73,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "025-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 74,
"target": "A",
"doc": {
"video_id": "025",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=ebzbKa32kuk",
"videoID": "ebzbKa32kuk",
"question_id": "025-3",
"task_type": "Object Reasoning",
"question": "Which of the following best describes the purpose of the content shown in the video?",
"options": [
"A. For medical education.",
"B. For psychological prevention and treatment.",
"C. For diagnosis of special cases.",
"D. To guide home exercise programmes."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best describes the purpose of the content shown in the video?\nOption:\nA. For medical education.\nB. For psychological prevention and treatment.\nC. For diagnosis of special cases.\nD. To guide home exercise programmes.\nAnswer with the option's letter from the given choices directly.",
74,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "025-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 75,
"target": "C",
"doc": {
"video_id": "026",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=XWfqBKeC0g8",
"videoID": "XWfqBKeC0g8",
"question_id": "026-1",
"task_type": "Object Recognition",
"question": "What type of cells are highlighted in the video?",
"options": [
"A. Lung cancer cells.",
"B. Brain cancer cells.",
"C. Pancreatic cancer cells.",
"D. Stomach cancer cells."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What type of cells are highlighted in the video?\nOption:\nA. Lung cancer cells.\nB. Brain cancer cells.\nC. Pancreatic cancer cells.\nD. Stomach cancer cells.\nAnswer with the option's letter from the given choices directly.",
75,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "026-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 76,
"target": "A",
"doc": {
"video_id": "026",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=XWfqBKeC0g8",
"videoID": "XWfqBKeC0g8",
"question_id": "026-2",
"task_type": "OCR Problems",
"question": "What level of magnification is used to view the cells in the video?",
"options": [
"A. 2000x magnification.",
"B. 1500x magnification.",
"C. 1000x magnification.",
"D. 500x magnification."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What level of magnification is used to view the cells in the video?\nOption:\nA. 2000x magnification.\nB. 1500x magnification.\nC. 1000x magnification.\nD. 500x magnification.\nAnswer with the option's letter from the given choices directly.",
76,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "026-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 77,
"target": "D",
"doc": {
"video_id": "026",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=XWfqBKeC0g8",
"videoID": "XWfqBKeC0g8",
"question_id": "026-3",
"task_type": "Action Recognition",
"question": "What process might these cells be undergoing?",
"options": [
"A. Amitosis.",
"B. Photosynthesis.",
"C. Enzymatic reactions.",
"D. Mitosis."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What process might these cells be undergoing?\nOption:\nA. Amitosis.\nB. Photosynthesis.\nC. Enzymatic reactions.\nD. Mitosis.\nAnswer with the option's letter from the given choices directly.",
77,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "026-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 78,
"target": "A",
"doc": {
"video_id": "027",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=tS4a6I4-Yjo",
"videoID": "tS4a6I4-Yjo",
"question_id": "027-1",
"task_type": "Temporal Perception",
"question": "Why does the woman need to drink water at the beginning of the video?",
"options": [
"A. Because she is preparing for a medical examination.",
"B. Because she went through an operation and needed to be hydrated.",
"C. Because she is sick and takes medication that requires a lot of water for adequate reflection.",
"D. Instead of water, she drinks a special medicine."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the woman need to drink water at the beginning of the video?\nOption:\nA. Because she is preparing for a medical examination.\nB. Because she went through an operation and needed to be hydrated.\nC. Because she is sick and takes medication that requires a lot of water for adequate reflection.\nD. Instead of water, she drinks a special medicine.\nAnswer with the option's letter from the given choices directly.",
78,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "027-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 79,
"target": "C",
"doc": {
"video_id": "027",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=tS4a6I4-Yjo",
"videoID": "tS4a6I4-Yjo",
"question_id": "027-2",
"task_type": "OCR Problems",
"question": "Which organization is responsible for publishing the video?",
"options": [
"A. UCLA Medical Center.",
"B. Eye Research USA.",
"C. Cancer Research UK.",
"D. South Africa Medical Center."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which organization is responsible for publishing the video?\nOption:\nA. UCLA Medical Center.\nB. Eye Research USA.\nC. Cancer Research UK.\nD. South Africa Medical Center.\nAnswer with the option's letter from the given choices directly.",
79,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "027-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 80,
"target": "B",
"doc": {
"video_id": "027",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=tS4a6I4-Yjo",
"videoID": "tS4a6I4-Yjo",
"question_id": "027-3",
"task_type": "Action Reasoning",
"question": "Which of the following statements can be inferred from the video?",
"options": [
"A. The patient is undergoing an ECG.",
"B. The patient is undergoing a CT scan.",
"C. The healthcare worker is teaching a medical student.",
"D. The healthcare worker is conducting a medical experiment."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements can be inferred from the video?\nOption:\nA. The patient is undergoing an ECG.\nB. The patient is undergoing a CT scan.\nC. The healthcare worker is teaching a medical student.\nD. The healthcare worker is conducting a medical experiment.\nAnswer with the option's letter from the given choices directly.",
80,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "027-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 81,
"target": "A",
"doc": {
"video_id": "028",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=TVkck1ACKDQ",
"videoID": "TVkck1ACKDQ",
"question_id": "028-1",
"task_type": "Object Recognition",
"question": "What organ did the woman in the video remove from the medical model?",
"options": [
"A. She removed the liver.",
"B. She removed the lungs.",
"C. She removed the kidney.",
"D. She removed the heart."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What organ did the woman in the video remove from the medical model?\nOption:\nA. She removed the liver.\nB. She removed the lungs.\nC. She removed the kidney.\nD. She removed the heart.\nAnswer with the option's letter from the given choices directly.",
81,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "028-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 82,
"target": "B",
"doc": {
"video_id": "028",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=TVkck1ACKDQ",
"videoID": "TVkck1ACKDQ",
"question_id": "028-2",
"task_type": "Information Synopsis",
"question": "What is the purpose of the video?",
"options": [
"A. Promotion for a hospital.",
"B. Admissions brochure for a school.",
"C. Introduction of a newly established medical laboratory.",
"D. Introduction of excellent characters."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the purpose of the video?\nOption:\nA. Promotion for a hospital.\nB. Admissions brochure for a school.\nC. Introduction of a newly established medical laboratory.\nD. Introduction of excellent characters.\nAnswer with the option's letter from the given choices directly.",
82,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "028-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 83,
"target": "C",
"doc": {
"video_id": "028",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=TVkck1ACKDQ",
"videoID": "TVkck1ACKDQ",
"question_id": "028-3",
"task_type": "Counting Problem",
"question": "How many different people appear in the video?",
"options": [
"A. 3.",
"B. 4.",
"C. 1.",
"D. 2."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many different people appear in the video?\nOption:\nA. 3.\nB. 4.\nC. 1.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
83,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "028-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 84,
"target": "C",
"doc": {
"video_id": "029",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=JbDRs0ja3PE",
"videoID": "JbDRs0ja3PE",
"question_id": "029-1",
"task_type": "Object Recognition",
"question": "Which human organ is visible in the video?",
"options": [
"A. Stomach.",
"B. Liver.",
"C. Lungs.",
"D. Kidneys."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which human organ is visible in the video?\nOption:\nA. Stomach.\nB. Liver.\nC. Lungs.\nD. Kidneys.\nAnswer with the option's letter from the given choices directly.",
84,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "029-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 85,
"target": "B",
"doc": {
"video_id": "029",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=JbDRs0ja3PE",
"videoID": "JbDRs0ja3PE",
"question_id": "029-2",
"task_type": "Information Synopsis",
"question": "What is the purpose of the video?",
"options": [
"A. To warn about environmental issues.",
"B. To offer health-related advice.",
"C. To provide a weather update.",
"D. To advertise a new movie."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the purpose of the video?\nOption:\nA. To warn about environmental issues.\nB. To offer health-related advice.\nC. To provide a weather update.\nD. To advertise a new movie.\nAnswer with the option's letter from the given choices directly.",
85,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "029-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 86,
"target": "D",
"doc": {
"video_id": "029",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=JbDRs0ja3PE",
"videoID": "JbDRs0ja3PE",
"question_id": "029-3",
"task_type": "Information Synopsis",
"question": "What is the most likely message being conveyed by the video?",
"options": [
"A. The beauty of natural phenomena.",
"B. The excitement of scientific discovery.",
"C. The global impact of wildfires.",
"D. The dangers of drug interactions."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the most likely message being conveyed by the video?\nOption:\nA. The beauty of natural phenomena.\nB. The excitement of scientific discovery.\nC. The global impact of wildfires.\nD. The dangers of drug interactions.\nAnswer with the option's letter from the given choices directly.",
86,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "029-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 87,
"target": "A",
"doc": {
"video_id": "030",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=39HTpUG1MwQ",
"videoID": "39HTpUG1MwQ",
"question_id": "030-1",
"task_type": "Temporal Perception",
"question": "What organelle is depicted at the beginning of the video?",
"options": [
"A. Mitochondrion.",
"B. Ribosome.",
"C. Chloroplast.",
"D. Golgi apparatus."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What organelle is depicted at the beginning of the video?\nOption:\nA. Mitochondrion.\nB. Ribosome.\nC. Chloroplast.\nD. Golgi apparatus.\nAnswer with the option's letter from the given choices directly.",
87,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "030-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 88,
"target": "D",
"doc": {
"video_id": "030",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=39HTpUG1MwQ",
"videoID": "39HTpUG1MwQ",
"question_id": "030-2",
"task_type": "Object Recognition",
"question": "What color is the ATP in the video?",
"options": [
"A. Black.",
"B. Green.",
"C. Blue.",
"D. Red."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the ATP in the video?\nOption:\nA. Black.\nB. Green.\nC. Blue.\nD. Red.\nAnswer with the option's letter from the given choices directly.",
88,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "030-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 89,
"target": "B",
"doc": {
"video_id": "030",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=39HTpUG1MwQ",
"videoID": "39HTpUG1MwQ",
"question_id": "030-3",
"task_type": "Attribute Perception",
"question": "Based on the video, what is the main function of the organelle featured in the footage?",
"options": [
"A. Protein synthesis.",
"B. Energy production.",
"C. Photosynthesis.",
"D. Lipid metabolism."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, what is the main function of the organelle featured in the footage?\nOption:\nA. Protein synthesis.\nB. Energy production.\nC. Photosynthesis.\nD. Lipid metabolism.\nAnswer with the option's letter from the given choices directly.",
89,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "030-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 90,
"target": "D",
"doc": {
"video_id": "031",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=TPS22HRRY1k",
"videoID": "TPS22HRRY1k",
"question_id": "031-1",
"task_type": "Attribute Perception",
"question": "Which of the following best describes the types of assets included within Will's \"Assets Under Management\" visual bar in the video, showcasing the composition of the mutual fund's holdings?",
"options": [
"A. Real Estate (blue), Precious Metals (red), and Foreign Currencies (yellow).",
"B. Real Estate (blue), Foreign Currencies (red), and Precious Metals (yellow).",
"C. Bonds (blue), Stocks (red), and Cash (yellow).",
"D. Stocks (blue), Bonds (red), and Cash (yellow)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best describes the types of assets included within Will's \"Assets Under Management\" visual bar in the video, showcasing the composition of the mutual fund's holdings?\nOption:\nA. Real Estate (blue), Precious Metals (red), and Foreign Currencies (yellow).\nB. Real Estate (blue), Foreign Currencies (red), and Precious Metals (yellow).\nC. Bonds (blue), Stocks (red), and Cash (yellow).\nD. Stocks (blue), Bonds (red), and Cash (yellow).\nAnswer with the option's letter from the given choices directly.",
90,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "031-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 91,
"target": "B",
"doc": {
"video_id": "031",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=TPS22HRRY1k",
"videoID": "TPS22HRRY1k",
"question_id": "031-2",
"task_type": "Information Synopsis",
"question": "Which best summarizes the content of the video?",
"options": [
"A. The process of registering a new mutual fund with the regulator and the associated legal requirements.",
"B. A basic explanation of what mutual funds are, and how they work.",
"C. A basic explanation of what national debts are, how they work and some key terms management.",
"D. The benefits and risks of investing in mutual funds as compared to other investment options such as individual stocks and bonds."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which best summarizes the content of the video?\nOption:\nA. The process of registering a new mutual fund with the regulator and the associated legal requirements.\nB. A basic explanation of what mutual funds are, and how they work.\nC. A basic explanation of what national debts are, how they work and some key terms management.\nD. The benefits and risks of investing in mutual funds as compared to other investment options such as individual stocks and bonds.\nAnswer with the option's letter from the given choices directly.",
91,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "031-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 92,
"target": "A",
"doc": {
"video_id": "031",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=TPS22HRRY1k",
"videoID": "TPS22HRRY1k",
"question_id": "031-3",
"task_type": "Information Synopsis",
"question": "According to the video, which of the following statements is true?",
"options": [
"A. People no longer need to choose their own diversified investment portfolio after buying a mutual fund.",
"B. Investors in a closed-ended fund can withdraw their investment without trading to another investor.",
"C. Closed-ended funds can continually accept more funds from outsiders.",
"D. Anyone can set up a mutual fund at will."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following statements is true?\nOption:\nA. People no longer need to choose their own diversified investment portfolio after buying a mutual fund.\nB. Investors in a closed-ended fund can withdraw their investment without trading to another investor.\nC. Closed-ended funds can continually accept more funds from outsiders.\nD. Anyone can set up a mutual fund at will.\nAnswer with the option's letter from the given choices directly.",
92,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "031-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 93,
"target": "B",
"doc": {
"video_id": "032",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=X30VlGh3HwQ",
"videoID": "X30VlGh3HwQ",
"question_id": "032-1",
"task_type": "Temporal Perception",
"question": "What is the video illustrating when there are three buildings appear in the video at the same time?",
"options": [
"A. The debt relationship between the federal government, municipalities, and businesses.",
"B. The issuers in the bond market include the federal government, municipalities, and corporations.",
"C. Presentations by and between the federal government, municipalities, and businesses.",
"D. The issuance of bonds to fund the construction of infrastructure."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video illustrating when there are three buildings appear in the video at the same time?\nOption:\nA. The debt relationship between the federal government, municipalities, and businesses.\nB. The issuers in the bond market include the federal government, municipalities, and corporations.\nC. Presentations by and between the federal government, municipalities, and businesses.\nD. The issuance of bonds to fund the construction of infrastructure.\nAnswer with the option's letter from the given choices directly.",
93,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "032-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Temporal Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 94,
"target": "B",
"doc": {
"video_id": "032",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=X30VlGh3HwQ",
"videoID": "X30VlGh3HwQ",
"question_id": "032-2",
"task_type": "Object Reasoning",
"question": "Based on data presented in the video, what was the approximate difference in value between the US bond market and the US stock market as of 2020?",
"options": [
"A. $5 trillion.",
"B. $10 trillion.",
"C. $15 trillion.",
"D. $20 trillion."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on data presented in the video, what was the approximate difference in value between the US bond market and the US stock market as of 2020?\nOption:\nA. $5 trillion.\nB. $10 trillion.\nC. $15 trillion.\nD. $20 trillion.\nAnswer with the option's letter from the given choices directly.",
94,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "032-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 95,
"target": "A",
"doc": {
"video_id": "032",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=X30VlGh3HwQ",
"videoID": "X30VlGh3HwQ",
"question_id": "032-3",
"task_type": "Counting Problem",
"question": "How many laptops are produced in the company factory in the video animation?",
"options": [
"A. 9.",
"B. 6.",
"C. 8.",
"D. 7."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many laptops are produced in the company factory in the video animation?\nOption:\nA. 9.\nB. 6.\nC. 8.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
95,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "032-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 96,
"target": "C",
"doc": {
"video_id": "033",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=zKyWRRJQbkM",
"videoID": "zKyWRRJQbkM",
"question_id": "033-1",
"task_type": "Object Recognition",
"question": "Which denominations of bills and/or coins are shown in the animation at the beginning of the video, and what is the total value of these denominations?",
"options": [
"A. $15.25, including three $5 bills and one quarter.",
"B. $2.25, including two $1 bills and one quarter.",
"C. $7.25, including one $5 bill, two $1 bills and one quarter.",
"D. $10.25, including two $5 bills and one quarter."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which denominations of bills and/or coins are shown in the animation at the beginning of the video, and what is the total value of these denominations?\nOption:\nA. $15.25, including three $5 bills and one quarter.\nB. $2.25, including two $1 bills and one quarter.\nC. $7.25, including one $5 bill, two $1 bills and one quarter.\nD. $10.25, including two $5 bills and one quarter.\nAnswer with the option's letter from the given choices directly.",
96,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "033-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 97,
"target": "D",
"doc": {
"video_id": "033",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=zKyWRRJQbkM",
"videoID": "zKyWRRJQbkM",
"question_id": "033-2",
"task_type": "Object Reasoning",
"question": "Which historical figure is visually depicted in a framed portrait in the video?",
"options": [
"A. Abraham Lincoln.",
"B. George Washington.",
"C. John F. Kennedy.",
"D. Franklin D. Roosevelt."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which historical figure is visually depicted in a framed portrait in the video?\nOption:\nA. Abraham Lincoln.\nB. George Washington.\nC. John F. Kennedy.\nD. Franklin D. Roosevelt.\nAnswer with the option's letter from the given choices directly.",
97,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "033-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 98,
"target": "C",
"doc": {
"video_id": "033",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=zKyWRRJQbkM",
"videoID": "zKyWRRJQbkM",
"question_id": "033-3",
"task_type": "Object Recognition",
"question": "In the video, what visual aid is used to compare the nominal minimum wage with its value in 2019 dollars, demonstrating the impact of inflation?",
"options": [
"A. A map of the United States with states color-coded by their minimum wage laws.",
"B. A line graph showing the trend of the federal minimum wage over time.",
"C. A line graph with two lines.",
"D. A side-by-side comparison of a 2019 dollar bill and a 1938 dollar bill."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, what visual aid is used to compare the nominal minimum wage with its value in 2019 dollars, demonstrating the impact of inflation?\nOption:\nA. A map of the United States with states color-coded by their minimum wage laws.\nB. A line graph showing the trend of the federal minimum wage over time.\nC. A line graph with two lines.\nD. A side-by-side comparison of a 2019 dollar bill and a 1938 dollar bill.\nAnswer with the option's letter from the given choices directly.",
98,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "033-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 99,
"target": "D",
"doc": {
"video_id": "034",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=blSnLEZe-sI",
"videoID": "blSnLEZe-sI",
"question_id": "034-1",
"task_type": "Information Synopsis",
"question": "What is this video about?",
"options": [
"A. Shares.",
"B. Central bank.",
"C. Financial crisis.",
"D. Quantitative easing."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video about?\nOption:\nA. Shares.\nB. Central bank.\nC. Financial crisis.\nD. Quantitative easing.\nAnswer with the option's letter from the given choices directly.",
99,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "034-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 100,
"target": "A",
"doc": {
"video_id": "034",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=blSnLEZe-sI",
"videoID": "blSnLEZe-sI",
"question_id": "034-2",
"task_type": "Information Synopsis",
"question": "According to the video, which statement is correct?",
"options": [
"A. The overall stock market could see stronger gains because of quantitative easing.",
"B. Quantitative easing requires the central bank to print money.",
"C. QE makes those so-called safe assets more attractive.",
"D. Quantitative easing leads to more money in the system."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which statement is correct?\nOption:\nA. The overall stock market could see stronger gains because of quantitative easing.\nB. Quantitative easing requires the central bank to print money.\nC. QE makes those so-called safe assets more attractive.\nD. Quantitative easing leads to more money in the system.\nAnswer with the option's letter from the given choices directly.",
100,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "034-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 101,
"target": "B",
"doc": {
"video_id": "034",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=blSnLEZe-sI",
"videoID": "blSnLEZe-sI",
"question_id": "034-3",
"task_type": "Object Reasoning",
"question": "What does the combination of a bond certificate and the UK flag most likely symbolize in the video?",
"options": [
"A. The UK's dependence on foreign investments.",
"B. The UK's use of bonds for quantitative easing.",
"C. The UK's high national debt.",
"D. The UK's strong bond market."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the combination of a bond certificate and the UK flag most likely symbolize in the video?\nOption:\nA. The UK's dependence on foreign investments.\nB. The UK's use of bonds for quantitative easing.\nC. The UK's high national debt.\nD. The UK's strong bond market.\nAnswer with the option's letter from the given choices directly.",
101,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "034-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 102,
"target": "B",
"doc": {
"video_id": "035",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=Gh_GtYtQoVI",
"videoID": "Gh_GtYtQoVI",
"question_id": "035-1",
"task_type": "Object Reasoning",
"question": "What is the function of the stopwatch in the upper right corner of the video?",
"options": [
"A. To indicate the total duration of this video.",
"B. The video author made the annotations in order to explain clearly what finance is within one minute.",
"C. To display the current time when the narrator is sharing financial knowledge.",
"D. It is a countdown timer."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the function of the stopwatch in the upper right corner of the video?\nOption:\nA. To indicate the total duration of this video.\nB. The video author made the annotations in order to explain clearly what finance is within one minute.\nC. To display the current time when the narrator is sharing financial knowledge.\nD. It is a countdown timer.\nAnswer with the option's letter from the given choices directly.",
102,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "035-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 103,
"target": "D",
"doc": {
"video_id": "035",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=Gh_GtYtQoVI",
"videoID": "Gh_GtYtQoVI",
"question_id": "035-2",
"task_type": "Spatial Perception",
"question": "In what direction does the video author draw the various financial institutions?",
"options": [
"A. From left to right.",
"B. From right to left.",
"C. From bottom to top.",
"D. From top to bottom."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what direction does the video author draw the various financial institutions?\nOption:\nA. From left to right.\nB. From right to left.\nC. From bottom to top.\nD. From top to bottom.\nAnswer with the option's letter from the given choices directly.",
103,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "035-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Spatial Perception",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 104,
"target": "A",
"doc": {
"video_id": "035",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=Gh_GtYtQoVI",
"videoID": "Gh_GtYtQoVI",
"question_id": "035-3",
"task_type": "Information Synopsis",
"question": "From which aspect does the video describe the role of finance?",
"options": [
"A. Business investment.",
"B. Market pricing.",
"C. Resource allocation.",
"D. Structural adjustment."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: From which aspect does the video describe the role of finance?\nOption:\nA. Business investment.\nB. Market pricing.\nC. Resource allocation.\nD. Structural adjustment.\nAnswer with the option's letter from the given choices directly.",
104,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "035-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 105,
"target": "A",
"doc": {
"video_id": "036",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=MXKDvuEkSF0",
"videoID": "MXKDvuEkSF0",
"question_id": "036-1",
"task_type": "Counting Problem",
"question": "How many new advertising situations are introduced in the video?",
"options": [
"A. 3.",
"B. 1.",
"C. 2.",
"D. 4."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many new advertising situations are introduced in the video?\nOption:\nA. 3.\nB. 1.\nC. 2.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
105,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "036-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 106,
"target": "D",
"doc": {
"video_id": "036",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=MXKDvuEkSF0",
"videoID": "MXKDvuEkSF0",
"question_id": "036-2",
"task_type": "OCR Problems",
"question": "How much did sales increase at Dunkin' Donuts locations situated near the scent marketing dispensers?",
"options": [
"A. 37%.",
"B. 39%.",
"C. 27%.",
"D. 29%."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How much did sales increase at Dunkin' Donuts locations situated near the scent marketing dispensers?\nOption:\nA. 37%.\nB. 39%.\nC. 27%.\nD. 29%.\nAnswer with the option's letter from the given choices directly.",
106,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "036-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 107,
"target": "D",
"doc": {
"video_id": "036",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=MXKDvuEkSF0",
"videoID": "MXKDvuEkSF0",
"question_id": "036-3",
"task_type": "OCR Problems",
"question": "What is the estimated value of the global marketing industry?",
"options": [
"A. $612 billion.",
"B. $45 million.",
"C. $58 million.",
"D. $475 billion."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the estimated value of the global marketing industry?\nOption:\nA. $612 billion.\nB. $45 million.\nC. $58 million.\nD. $475 billion.\nAnswer with the option's letter from the given choices directly.",
107,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "036-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 108,
"target": "D",
"doc": {
"video_id": "037",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=EQ-67udZEeg",
"videoID": "EQ-67udZEeg",
"question_id": "037-1",
"task_type": "OCR Problems",
"question": "Which of the following companies is not featured in the video?",
"options": [
"A. General Electric.",
"B. Starbucks.",
"C. Adobe.",
"D. Pepsi."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following companies is not featured in the video?\nOption:\nA. General Electric.\nB. Starbucks.\nC. Adobe.\nD. Pepsi.\nAnswer with the option's letter from the given choices directly.",
108,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "037-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 109,
"target": "C",
"doc": {
"video_id": "037",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=EQ-67udZEeg",
"videoID": "EQ-67udZEeg",
"question_id": "037-2",
"task_type": "Information Synopsis",
"question": "What visual element is used to showcase companies that are included in the Dow Jones Industrial Average?",
"options": [
"A. Photographs of corporate buildings.",
"B. Animated stock tickers scrolling across the screen.",
"C. Hand-drawn company logos.",
"D. Charts showing stock performance over time."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What visual element is used to showcase companies that are included in the Dow Jones Industrial Average?\nOption:\nA. Photographs of corporate buildings.\nB. Animated stock tickers scrolling across the screen.\nC. Hand-drawn company logos.\nD. Charts showing stock performance over time.\nAnswer with the option's letter from the given choices directly.",
109,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "037-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 110,
"target": "D",
"doc": {
"video_id": "037",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=EQ-67udZEeg",
"videoID": "EQ-67udZEeg",
"question_id": "037-3",
"task_type": "Temporal Reasoning",
"question": "Which of the following is the correct order for the three indices presented in the video?",
"options": [
"A. Dow Jones, Nasdaq, S&P 500.",
"B. Nasdaq, S&P 500, Dow Jones.",
"C. S&P 500, Dow Jones, Nasdaq.",
"D. Dow Jones, S&P 500, Nasdaq."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is the correct order for the three indices presented in the video?\nOption:\nA. Dow Jones, Nasdaq, S&P 500.\nB. Nasdaq, S&P 500, Dow Jones.\nC. S&P 500, Dow Jones, Nasdaq.\nD. Dow Jones, S&P 500, Nasdaq.\nAnswer with the option's letter from the given choices directly.",
110,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "037-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 111,
"target": "A",
"doc": {
"video_id": "038",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=GqeRnxSuLFI",
"videoID": "GqeRnxSuLFI",
"question_id": "038-1",
"task_type": "Attribute Perception",
"question": "Which color of clothing is worn by the first person selling bananas in the video?",
"options": [
"A. Blue.",
"B. Purple.",
"C. Black.",
"D. Green."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which color of clothing is worn by the first person selling bananas in the video?\nOption:\nA. Blue.\nB. Purple.\nC. Black.\nD. Green.\nAnswer with the option's letter from the given choices directly.",
111,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "038-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 112,
"target": "A",
"doc": {
"video_id": "038",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=GqeRnxSuLFI",
"videoID": "GqeRnxSuLFI",
"question_id": "038-2",
"task_type": "Information Synopsis",
"question": "According to the video, which statement is correct?",
"options": [
"A. When supply is less than demand, you can make money by selling bananas.",
"B. You cannot make money by selling bananas.",
"C. You can make money by selling bananas.",
"D. When there are few people selling bananas, you can make money by selling bananas."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which statement is correct?\nOption:\nA. When supply is less than demand, you can make money by selling bananas.\nB. You cannot make money by selling bananas.\nC. You can make money by selling bananas.\nD. When there are few people selling bananas, you can make money by selling bananas.\nAnswer with the option's letter from the given choices directly.",
112,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "038-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 113,
"target": "A",
"doc": {
"video_id": "038",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=GqeRnxSuLFI",
"videoID": "GqeRnxSuLFI",
"question_id": "038-3",
"task_type": "Information Synopsis",
"question": "Which best summarizes the content of the video?",
"options": [
"A. Supply and demand.",
"B. Bananas supply.",
"C. Business competition.",
"D. Banana selling."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which best summarizes the content of the video?\nOption:\nA. Supply and demand.\nB. Bananas supply.\nC. Business competition.\nD. Banana selling.\nAnswer with the option's letter from the given choices directly.",
113,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "038-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 114,
"target": "D",
"doc": {
"video_id": "039",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=XOJKiOw8Xqo",
"videoID": "XOJKiOw8Xqo",
"question_id": "039-1",
"task_type": "Object Recognition",
"question": "Which time is displayed on the clock in the video?",
"options": [
"A. 08:00.",
"B. 03:00.",
"C. 11:00.",
"D. 05:00."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which time is displayed on the clock in the video?\nOption:\nA. 08:00.\nB. 03:00.\nC. 11:00.\nD. 05:00.\nAnswer with the option's letter from the given choices directly.",
114,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "039-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 115,
"target": "C",
"doc": {
"video_id": "039",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=XOJKiOw8Xqo",
"videoID": "XOJKiOw8Xqo",
"question_id": "039-2",
"task_type": "Attribute Perception",
"question": "What is the color of Canada on the map in the video?",
"options": [
"A. Orange.",
"B. Yellow.",
"C. Pink.",
"D. Green."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of Canada on the map in the video?\nOption:\nA. Orange.\nB. Yellow.\nC. Pink.\nD. Green.\nAnswer with the option's letter from the given choices directly.",
115,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "039-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 116,
"target": "B",
"doc": {
"video_id": "039",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=XOJKiOw8Xqo",
"videoID": "XOJKiOw8Xqo",
"question_id": "039-3",
"task_type": "Information Synopsis",
"question": "What is the topic of this video?",
"options": [
"A. Gambling.",
"B. Money laundering.",
"C. Remittance.",
"D. Tax avoidance."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the topic of this video?\nOption:\nA. Gambling.\nB. Money laundering.\nC. Remittance.\nD. Tax avoidance.\nAnswer with the option's letter from the given choices directly.",
116,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "039-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 117,
"target": "C",
"doc": {
"video_id": "040",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=5ksVshqVuiM",
"videoID": "5ksVshqVuiM",
"question_id": "040-1",
"task_type": "Object Reasoning",
"question": "When did the events described in the video take place?",
"options": [
"A. Between 2014 and 2016.",
"B. Between 2020 and 2023.",
"C. Between 1997 and 2002.",
"D. Between 2006 and 2008."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When did the events described in the video take place?\nOption:\nA. Between 2014 and 2016.\nB. Between 2020 and 2023.\nC. Between 1997 and 2002.\nD. Between 2006 and 2008.\nAnswer with the option's letter from the given choices directly.",
117,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "040-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 118,
"target": "B",
"doc": {
"video_id": "040",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=5ksVshqVuiM",
"videoID": "5ksVshqVuiM",
"question_id": "040-2",
"task_type": "OCR Problems",
"question": "What was Amazon's stock price at the peak of the dot-com bubble?",
"options": [
"A. $7/Share.",
"B. $100/Share.",
"C. $600/Share.",
"D. $300/Share."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was Amazon's stock price at the peak of the dot-com bubble?\nOption:\nA. $7/Share.\nB. $100/Share.\nC. $600/Share.\nD. $300/Share.\nAnswer with the option's letter from the given choices directly.",
118,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "040-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 119,
"target": "B",
"doc": {
"video_id": "040",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=5ksVshqVuiM",
"videoID": "5ksVshqVuiM",
"question_id": "040-3",
"task_type": "Information Synopsis",
"question": "What is the main content of this video?",
"options": [
"A. Amazon's development.",
"B. The dot-com Bubble.",
"C. Company competition.",
"D. Stock market."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main content of this video?\nOption:\nA. Amazon's development.\nB. The dot-com Bubble.\nC. Company competition.\nD. Stock market.\nAnswer with the option's letter from the given choices directly.",
119,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "040-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 120,
"target": "A",
"doc": {
"video_id": "041",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=yl5ZXQmrtP0",
"videoID": "yl5ZXQmrtP0",
"question_id": "041-1",
"task_type": "Temporal Reasoning",
"question": "When is the zodiacal light visible from the video?",
"options": [
"A. On March 19.",
"B. On March 24.",
"C. On March 25.",
"D. On March 29."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When is the zodiacal light visible from the video?\nOption:\nA. On March 19.\nB. On March 24.\nC. On March 25.\nD. On March 29.\nAnswer with the option's letter from the given choices directly.",
120,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "041-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 121,
"target": "B",
"doc": {
"video_id": "041",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=yl5ZXQmrtP0",
"videoID": "yl5ZXQmrtP0",
"question_id": "041-2",
"task_type": "Counting Problem",
"question": "How many times is the sun visible in the video?",
"options": [
"A. 2.",
"B. 4.",
"C. 3.",
"D. 1."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times is the sun visible in the video?\nOption:\nA. 2.\nB. 4.\nC. 3.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
121,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "041-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 122,
"target": "D",
"doc": {
"video_id": "041",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=yl5ZXQmrtP0",
"videoID": "yl5ZXQmrtP0",
"question_id": "041-3",
"task_type": "Information Synopsis",
"question": "What is the main topic of the video?",
"options": [
"A. Diurnal alternation.",
"B. The relationship between the sun, Earth and moon.",
"C. Seasonal change.",
"D. Astronomical phenomena visible in March."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main topic of the video?\nOption:\nA. Diurnal alternation.\nB. The relationship between the sun, Earth and moon.\nC. Seasonal change.\nD. Astronomical phenomena visible in March.\nAnswer with the option's letter from the given choices directly.",
122,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "041-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 123,
"target": "B",
"doc": {
"video_id": "042",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=kRlhlCWplqk",
"videoID": "kRlhlCWplqk",
"question_id": "042-1",
"task_type": "Counting Problem",
"question": "How many bodies are created after the collision?",
"options": [
"A. 2.",
"B. 3.",
"C. 1.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many bodies are created after the collision?\nOption:\nA. 2.\nB. 3.\nC. 1.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
123,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "042-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 124,
"target": "C",
"doc": {
"video_id": "042",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=kRlhlCWplqk",
"videoID": "kRlhlCWplqk",
"question_id": "042-2",
"task_type": "Information Synopsis",
"question": "Which of the following best describes the topic of the video?",
"options": [
"A. How is the Mars created.",
"B. How is the Earth created.",
"C. How is the moon created.",
"D. Why does Earth collide with Mars."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best describes the topic of the video?\nOption:\nA. How is the Mars created.\nB. How is the Earth created.\nC. How is the moon created.\nD. Why does Earth collide with Mars.\nAnswer with the option's letter from the given choices directly.",
124,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "042-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 125,
"target": "A",
"doc": {
"video_id": "042",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=kRlhlCWplqk",
"videoID": "kRlhlCWplqk",
"question_id": "042-3",
"task_type": "Object Reasoning",
"question": "How was the video most likely made?",
"options": [
"A. Using computer simulation.",
"B. Through special effects production.",
"C. By location shooting.",
"D. Manual model."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How was the video most likely made?\nOption:\nA. Using computer simulation.\nB. Through special effects production.\nC. By location shooting.\nD. Manual model.\nAnswer with the option's letter from the given choices directly.",
125,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "042-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 126,
"target": "D",
"doc": {
"video_id": "043",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=9jjTGpWmc5U",
"videoID": "9jjTGpWmc5U",
"question_id": "043-1",
"task_type": "Spatial Reasoning",
"question": "Why does the roof object fly?",
"options": [
"A. There was an explosion.",
"B. Thrown into the air by a child.",
"C. Tied to a balloon.",
"D. Lunar approach."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the roof object fly?\nOption:\nA. There was an explosion.\nB. Thrown into the air by a child.\nC. Tied to a balloon.\nD. Lunar approach.\nAnswer with the option's letter from the given choices directly.",
126,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "043-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Spatial Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 127,
"target": "C",
"doc": {
"video_id": "043",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=9jjTGpWmc5U",
"videoID": "9jjTGpWmc5U",
"question_id": "043-2",
"task_type": "Counting Problem",
"question": "How many people are watching TV in the video?",
"options": [
"A. 1.",
"B. 3.",
"C. 2.",
"D. 4."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people are watching TV in the video?\nOption:\nA. 1.\nB. 3.\nC. 2.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
127,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "043-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 128,
"target": "B",
"doc": {
"video_id": "043",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=9jjTGpWmc5U",
"videoID": "9jjTGpWmc5U",
"question_id": "043-3",
"task_type": "Spatial Reasoning",
"question": "Which type of disaster is portrayed in the video?",
"options": [
"A. Man-made disaster.",
"B. Astronomical disaster.",
"C. Meteorological disaster.",
"D. Geological disaster."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which type of disaster is portrayed in the video?\nOption:\nA. Man-made disaster.\nB. Astronomical disaster.\nC. Meteorological disaster.\nD. Geological disaster.\nAnswer with the option's letter from the given choices directly.",
128,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "043-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 129,
"target": "A",
"doc": {
"video_id": "044",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=fo0Hmch2YS0",
"videoID": "fo0Hmch2YS0",
"question_id": "044-1",
"task_type": "Counting Problem",
"question": "How many people can be seen in the video?",
"options": [
"A. 2.",
"B. 3.",
"C. 4.",
"D. 1."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people can be seen in the video?\nOption:\nA. 2.\nB. 3.\nC. 4.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
129,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "044-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 130,
"target": "C",
"doc": {
"video_id": "044",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=fo0Hmch2YS0",
"videoID": "fo0Hmch2YS0",
"question_id": "044-2",
"task_type": "Spatial Reasoning",
"question": "What is the most likely cause of the disaster in the video?",
"options": [
"A. Warfare.",
"B. Meteorite attack.",
"C. Lunar impact.",
"D. Alien attack."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the most likely cause of the disaster in the video?\nOption:\nA. Warfare.\nB. Meteorite attack.\nC. Lunar impact.\nD. Alien attack.\nAnswer with the option's letter from the given choices directly.",
130,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "044-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 131,
"target": "D",
"doc": {
"video_id": "044",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=fo0Hmch2YS0",
"videoID": "fo0Hmch2YS0",
"question_id": "044-3",
"task_type": "Action Recognition",
"question": "What are the people in the video doing?",
"options": [
"A. Exploring the earth.",
"B. Saving the earth.",
"C. Enjoying a meteor shower.",
"D. Escaping from a disaster."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the people in the video doing?\nOption:\nA. Exploring the earth.\nB. Saving the earth.\nC. Enjoying a meteor shower.\nD. Escaping from a disaster.\nAnswer with the option's letter from the given choices directly.",
131,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "044-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 132,
"target": "C",
"doc": {
"video_id": "045",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=990ci2H9BBc",
"videoID": "990ci2H9BBc",
"question_id": "045-1",
"task_type": "Object Recognition",
"question": "What astronomical phenomenon is depicted in the video?",
"options": [
"A. Solar flare.",
"B. Observed black hole.",
"C. Total solar eclipse.",
"D. Mirage."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What astronomical phenomenon is depicted in the video?\nOption:\nA. Solar flare.\nB. Observed black hole.\nC. Total solar eclipse.\nD. Mirage.\nAnswer with the option's letter from the given choices directly.",
132,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "045-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 133,
"target": "B",
"doc": {
"video_id": "045",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=990ci2H9BBc",
"videoID": "990ci2H9BBc",
"question_id": "045-2",
"task_type": "Action Recognition",
"question": "Which of the following is NOT known to occur during a total solar eclipse, according to the video?",
"options": [
"A. Temperature drop.",
"B. Appearance of snakes.",
"C. Appearance of planets and stars.",
"D. Appearance of rare shadow bands."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is NOT known to occur during a total solar eclipse, according to the video?\nOption:\nA. Temperature drop.\nB. Appearance of snakes.\nC. Appearance of planets and stars.\nD. Appearance of rare shadow bands.\nAnswer with the option's letter from the given choices directly.",
133,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "045-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 134,
"target": "A",
"doc": {
"video_id": "045",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=990ci2H9BBc",
"videoID": "990ci2H9BBc",
"question_id": "045-3",
"task_type": "Object Recognition",
"question": "Where were the people in the video located while observing the total solar eclipse?",
"options": [
"A. Stadium.",
"B. Nature park.",
"C. Street.",
"D. Rooftop."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where were the people in the video located while observing the total solar eclipse?\nOption:\nA. Stadium.\nB. Nature park.\nC. Street.\nD. Rooftop.\nAnswer with the option's letter from the given choices directly.",
134,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "045-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 135,
"target": "D",
"doc": {
"video_id": "046",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=gD0MvkDGAMg",
"videoID": "gD0MvkDGAMg",
"question_id": "046-1",
"task_type": "Information Synopsis",
"question": "What's the main point of this video?",
"options": [
"A. The role of supercomputers in the development of astronomy.",
"B. Astronomy enrollment outreach.",
"C. The role of atmospheric science.",
"D. The role of astronomical disciplines."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the main point of this video?\nOption:\nA. The role of supercomputers in the development of astronomy.\nB. Astronomy enrollment outreach.\nC. The role of atmospheric science.\nD. The role of astronomical disciplines.\nAnswer with the option's letter from the given choices directly.",
135,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "046-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 136,
"target": "A",
"doc": {
"video_id": "046",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=gD0MvkDGAMg",
"videoID": "gD0MvkDGAMg",
"question_id": "046-2",
"task_type": "Object Recognition",
"question": "What is the first celestial object shown in the video?",
"options": [
"A. Earth.",
"B. Moon.",
"C. Jupiter.",
"D. Saturn."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the first celestial object shown in the video?\nOption:\nA. Earth.\nB. Moon.\nC. Jupiter.\nD. Saturn.\nAnswer with the option's letter from the given choices directly.",
136,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "046-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 137,
"target": "B",
"doc": {
"video_id": "046",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=gD0MvkDGAMg",
"videoID": "gD0MvkDGAMg",
"question_id": "046-3",
"task_type": "Counting Problem",
"question": "How many times does the instructor in the video appear in different scenarios?",
"options": [
"A. 3.",
"B. 4.",
"C. 5.",
"D. 2."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times does the instructor in the video appear in different scenarios?\nOption:\nA. 3.\nB. 4.\nC. 5.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
137,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "046-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 138,
"target": "A",
"doc": {
"video_id": "047",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=WTQ7_e8bnfM",
"videoID": "WTQ7_e8bnfM",
"question_id": "047-1",
"task_type": "Action Reasoning",
"question": "Which of the following emotions might the individuals in the video be feeling?",
"options": [
"A. Fear from an impending cataclysmic event.",
"B. Excitement about an airshow.",
"C. Admiration of a natural celestial event.",
"D. Appreciation of the natural landscape."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following emotions might the individuals in the video be feeling?\nOption:\nA. Fear from an impending cataclysmic event.\nB. Excitement about an airshow.\nC. Admiration of a natural celestial event.\nD. Appreciation of the natural landscape.\nAnswer with the option's letter from the given choices directly.",
138,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "047-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 139,
"target": "B",
"doc": {
"video_id": "047",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=WTQ7_e8bnfM",
"videoID": "WTQ7_e8bnfM",
"question_id": "047-2",
"task_type": "Information Synopsis",
"question": "What is depicted in the video?",
"options": [
"A. A nuclear explosion on Earth viewed from space.",
"B. The impact of a giant asteroid on Earth.",
"C. A volcanic eruption on the Earth's surface.",
"D. The sunrise as seen from space."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is depicted in the video?\nOption:\nA. A nuclear explosion on Earth viewed from space.\nB. The impact of a giant asteroid on Earth.\nC. A volcanic eruption on the Earth's surface.\nD. The sunrise as seen from space.\nAnswer with the option's letter from the given choices directly.",
139,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "047-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 140,
"target": "C",
"doc": {
"video_id": "047",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=WTQ7_e8bnfM",
"videoID": "WTQ7_e8bnfM",
"question_id": "047-3",
"task_type": "Spatial Reasoning",
"question": "What might be the immediate consequence of such an event according to the video?",
"options": [
"A. The onset of a new ice age.",
"B. A slight change in weather patterns.",
"C. Global seismic and tidal disturbances.",
"D. An increase in solar radiation entering the atmosphere."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What might be the immediate consequence of such an event according to the video?\nOption:\nA. The onset of a new ice age.\nB. A slight change in weather patterns.\nC. Global seismic and tidal disturbances.\nD. An increase in solar radiation entering the atmosphere.\nAnswer with the option's letter from the given choices directly.",
140,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "047-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Spatial Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 141,
"target": "D",
"doc": {
"video_id": "048",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=M7WOXFvwbSY",
"videoID": "M7WOXFvwbSY",
"question_id": "048-1",
"task_type": "Object Recognition",
"question": "Which galaxies are depicted in the video?",
"options": [
"A. The Sombrero Galaxy and the Triangulum Galaxy.",
"B. The Milky Way and the Whirlpool Galaxy.",
"C. The Triangulum Galaxy and the Whirlpool Galaxy.",
"D. The Andromeda Galaxy and the Milky Way."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which galaxies are depicted in the video?\nOption:\nA. The Sombrero Galaxy and the Triangulum Galaxy.\nB. The Milky Way and the Whirlpool Galaxy.\nC. The Triangulum Galaxy and the Whirlpool Galaxy.\nD. The Andromeda Galaxy and the Milky Way.\nAnswer with the option's letter from the given choices directly.",
141,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "048-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 142,
"target": "A",
"doc": {
"video_id": "048",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=M7WOXFvwbSY",
"videoID": "M7WOXFvwbSY",
"question_id": "048-2",
"task_type": "Object Reasoning",
"question": "What significant event is the number at the bottom of the video referring to?",
"options": [
"A. The time until the Andromeda and Milky Way galaxies collide.",
"B. The number of stars in the Milky Way.",
"C. The age of the universe.",
"D. The time remaining before the Sun burns out."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What significant event is the number at the bottom of the video referring to?\nOption:\nA. The time until the Andromeda and Milky Way galaxies collide.\nB. The number of stars in the Milky Way.\nC. The age of the universe.\nD. The time remaining before the Sun burns out.\nAnswer with the option's letter from the given choices directly.",
142,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "048-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 143,
"target": "C",
"doc": {
"video_id": "048",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=M7WOXFvwbSY",
"videoID": "M7WOXFvwbSY",
"question_id": "048-3",
"task_type": "Object Reasoning",
"question": "Which of the following statements can be inferred about the Triangulum Galaxy (M33) based on the information presented in the video about the other two galaxies?",
"options": [
"A. It interacts gravitationally with the Andromeda Galaxy (M31).",
"B. It may be a companion galaxy to the Andromeda Galaxy, influenced by its gravity.",
"C. It is a satellite galaxy of the Andromeda Galaxy.",
"D. It is involved in the collision between the Milky Way and Andromeda."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements can be inferred about the Triangulum Galaxy (M33) based on the information presented in the video about the other two galaxies?\nOption:\nA. It interacts gravitationally with the Andromeda Galaxy (M31).\nB. It may be a companion galaxy to the Andromeda Galaxy, influenced by its gravity.\nC. It is a satellite galaxy of the Andromeda Galaxy.\nD. It is involved in the collision between the Milky Way and Andromeda.\nAnswer with the option's letter from the given choices directly.",
143,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "048-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 144,
"target": "C",
"doc": {
"video_id": "049",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=RvnC--JjDBw",
"videoID": "RvnC--JjDBw",
"question_id": "049-1",
"task_type": "Action Recognition",
"question": "Which activity is being performed by the astronaut in the video?",
"options": [
"A. Boarding a spacecraft.",
"B. Conducting a spacewalk with a tether.",
"C. Using a Manned Maneuvering Unit (MMU) to float free in space.",
"D. Performing maintenance on the exterior of a space station."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which activity is being performed by the astronaut in the video?\nOption:\nA. Boarding a spacecraft.\nB. Conducting a spacewalk with a tether.\nC. Using a Manned Maneuvering Unit (MMU) to float free in space.\nD. Performing maintenance on the exterior of a space station.\nAnswer with the option's letter from the given choices directly.",
144,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "049-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 145,
"target": "B",
"doc": {
"video_id": "049",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=RvnC--JjDBw",
"videoID": "RvnC--JjDBw",
"question_id": "049-2",
"task_type": "Spatial Reasoning",
"question": "Which feature of the astronaut's equipment indicates that they can move independently in space in the video?",
"options": [
"A. The tether connecting the astronaut to the spacecraft.",
"B. The presence of a jetpack-like device on the astronaut's back.",
"C. The flag on the astronaut's suit.",
"D. The navigation system for determining position and direction of the movement."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which feature of the astronaut's equipment indicates that they can move independently in space in the video?\nOption:\nA. The tether connecting the astronaut to the spacecraft.\nB. The presence of a jetpack-like device on the astronaut's back.\nC. The flag on the astronaut's suit.\nD. The navigation system for determining position and direction of the movement.\nAnswer with the option's letter from the given choices directly.",
145,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "049-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 146,
"target": "D",
"doc": {
"video_id": "049",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=RvnC--JjDBw",
"videoID": "RvnC--JjDBw",
"question_id": "049-3",
"task_type": "Action Reasoning",
"question": "Considering the context of the astronaut in the video, what historic milestone might this represent?",
"options": [
"A. The first space tourist taking a trip to Earth orbit.",
"B. The first person to walk on the Moon.",
"C. The first human to orbit the Earth.",
"D. The first astronaut to float freely in space without a safety tether."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Considering the context of the astronaut in the video, what historic milestone might this represent?\nOption:\nA. The first space tourist taking a trip to Earth orbit.\nB. The first person to walk on the Moon.\nC. The first human to orbit the Earth.\nD. The first astronaut to float freely in space without a safety tether.\nAnswer with the option's letter from the given choices directly.",
146,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "049-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 147,
"target": "A",
"doc": {
"video_id": "050",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=08km9Yqbt-A",
"videoID": "08km9Yqbt-A",
"question_id": "050-1",
"task_type": "Temporal Perception",
"question": "What is the No.2 celestial event shown in the video?",
"options": [
"A. Perseid meteor shower.",
"B. A solar eclipse.",
"C. A lunar eclipse.",
"D. A meteor shower."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the No.2 celestial event shown in the video?\nOption:\nA. Perseid meteor shower.\nB. A solar eclipse.\nC. A lunar eclipse.\nD. A meteor shower.\nAnswer with the option's letter from the given choices directly.",
147,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "050-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 148,
"target": "D",
"doc": {
"video_id": "050",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=08km9Yqbt-A",
"videoID": "08km9Yqbt-A",
"question_id": "050-2",
"task_type": "Temporal Perception",
"question": "On what date is the total eclipse supposed to occur in the video?",
"options": [
"A. August 1, 2008.",
"B. July 22, 2009.",
"C. June 29, 2024.",
"D. April 8, 2024."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: On what date is the total eclipse supposed to occur in the video?\nOption:\nA. August 1, 2008.\nB. July 22, 2009.\nC. June 29, 2024.\nD. April 8, 2024.\nAnswer with the option's letter from the given choices directly.",
148,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "050-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Temporal Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 149,
"target": "B",
"doc": {
"video_id": "050",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=08km9Yqbt-A",
"videoID": "08km9Yqbt-A",
"question_id": "050-3",
"task_type": "Spatial Perception",
"question": "What direction should one face to observe the No.5 celestial event?",
"options": [
"A. North.",
"B. East.",
"C. South.",
"D. West."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What direction should one face to observe the No.5 celestial event?\nOption:\nA. North.\nB. East.\nC. South.\nD. West.\nAnswer with the option's letter from the given choices directly.",
149,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "050-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Spatial Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 150,
"target": "B",
"doc": {
"video_id": "051",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=Qyg_91gNHCc",
"videoID": "Qyg_91gNHCc",
"question_id": "051-1",
"task_type": "Counting Problem",
"question": "How many bridges can be spotted in the video?",
"options": [
"A. Three.",
"B. Four.",
"C. Five.",
"D. One."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many bridges can be spotted in the video?\nOption:\nA. Three.\nB. Four.\nC. Five.\nD. One.\nAnswer with the option's letter from the given choices directly.",
150,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "051-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 151,
"target": "A",
"doc": {
"video_id": "051",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=Qyg_91gNHCc",
"videoID": "Qyg_91gNHCc",
"question_id": "051-2",
"task_type": "Object Recognition",
"question": "What do the scenes shot on Indus River and Yangze River have in common?",
"options": [
"A. They are all shot at dusk.",
"B. The land is covered with snow in both scene.",
"C. There are bridges on both scene.",
"D. They both contain an island in the middle of the lake."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the scenes shot on Indus River and Yangze River have in common?\nOption:\nA. They are all shot at dusk.\nB. The land is covered with snow in both scene.\nC. There are bridges on both scene.\nD. They both contain an island in the middle of the lake.\nAnswer with the option's letter from the given choices directly.",
151,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "051-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 152,
"target": "C",
"doc": {
"video_id": "051",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=Qyg_91gNHCc",
"videoID": "Qyg_91gNHCc",
"question_id": "051-3",
"task_type": "Object Recognition",
"question": "In which river's bank is it possible to see snow?",
"options": [
"A. Lena River.",
"B. Yangze River.",
"C. Ob-Irtysh River.",
"D. Syr Darya River."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which river's bank is it possible to see snow?\nOption:\nA. Lena River.\nB. Yangze River.\nC. Ob-Irtysh River.\nD. Syr Darya River.\nAnswer with the option's letter from the given choices directly.",
152,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "051-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 153,
"target": "D",
"doc": {
"video_id": "052",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=nYLMNQ77FjM",
"videoID": "nYLMNQ77FjM",
"question_id": "052-1",
"task_type": "Spatial Reasoning",
"question": "According to the video, which of the following statement is true?",
"options": [
"A. It often erupts.",
"B. Mt. Fuji is an active volcano and thus very dangerous.",
"C. It is completely covered in snow.",
"D. It can be seen from Tokyo if the weather is fine."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following statement is true?\nOption:\nA. It often erupts.\nB. Mt. Fuji is an active volcano and thus very dangerous.\nC. It is completely covered in snow.\nD. It can be seen from Tokyo if the weather is fine.\nAnswer with the option's letter from the given choices directly.",
153,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "052-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Spatial Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 154,
"target": "A",
"doc": {
"video_id": "052",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=nYLMNQ77FjM",
"videoID": "nYLMNQ77FjM",
"question_id": "052-2",
"task_type": "Counting Problem",
"question": "How many climbers are there in the video?",
"options": [
"A. Five.",
"B. Four.",
"C. One.",
"D. Ten."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many climbers are there in the video?\nOption:\nA. Five.\nB. Four.\nC. One.\nD. Ten.\nAnswer with the option's letter from the given choices directly.",
154,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "052-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 155,
"target": "A",
"doc": {
"video_id": "052",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=nYLMNQ77FjM",
"videoID": "nYLMNQ77FjM",
"question_id": "052-3",
"task_type": "Attribute Perception",
"question": "What color of clothing is the painter wearing?",
"options": [
"A. Blue.",
"B. Yellow.",
"C. Black.",
"D. White."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color of clothing is the painter wearing?\nOption:\nA. Blue.\nB. Yellow.\nC. Black.\nD. White.\nAnswer with the option's letter from the given choices directly.",
155,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "052-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 156,
"target": "C",
"doc": {
"video_id": "053",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=DF_J3vCcbBA",
"videoID": "DF_J3vCcbBA",
"question_id": "053-1",
"task_type": "Attribute Perception",
"question": "How does the color of lava change when exposed to the air for a short while?",
"options": [
"A. It turns blue.",
"B. It turns red.",
"C. It turns silver.",
"D. It turns black."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the color of lava change when exposed to the air for a short while?\nOption:\nA. It turns blue.\nB. It turns red.\nC. It turns silver.\nD. It turns black.\nAnswer with the option's letter from the given choices directly.",
156,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "053-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 157,
"target": "B",
"doc": {
"video_id": "053",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=DF_J3vCcbBA",
"videoID": "DF_J3vCcbBA",
"question_id": "053-2",
"task_type": "Action Recognition",
"question": "According to the video, how do the geologists collect lava safely?",
"options": [
"A. They use shovel to collects lava safely.",
"B. They wear black mask when collecting lava.",
"C. They manipulate drones to collect magma.",
"D. They stay far away from the lava."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how do the geologists collect lava safely?\nOption:\nA. They use shovel to collects lava safely.\nB. They wear black mask when collecting lava.\nC. They manipulate drones to collect magma.\nD. They stay far away from the lava.\nAnswer with the option's letter from the given choices directly.",
157,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "053-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 158,
"target": "C",
"doc": {
"video_id": "053",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=DF_J3vCcbBA",
"videoID": "DF_J3vCcbBA",
"question_id": "053-3",
"task_type": "Action Reasoning",
"question": "How do the geologists cool the lava down?",
"options": [
"A. By special oily coolant.",
"B. By leaving the lava on the ground.",
"C. By water.",
"D. By ice."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How do the geologists cool the lava down?\nOption:\nA. By special oily coolant.\nB. By leaving the lava on the ground.\nC. By water.\nD. By ice.\nAnswer with the option's letter from the given choices directly.",
158,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "053-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 159,
"target": "B",
"doc": {
"video_id": "054",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=wVlfyhs64IY",
"videoID": "wVlfyhs64IY",
"question_id": "054-1",
"task_type": "Attribute Perception",
"question": "Which of the following statements is not correct according to the video?",
"options": [
"A. Weather in one part of the world can affect the whole world.",
"B. El Niño and La Niña bring similar weather to North America.",
"C. El Niño usually shows up in winter.",
"D. El Niño and La Niña are both relevant to the abnormal intensity of trade winds."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements is not correct according to the video?\nOption:\nA. Weather in one part of the world can affect the whole world.\nB. El Niño and La Niña bring similar weather to North America.\nC. El Niño usually shows up in winter.\nD. El Niño and La Niña are both relevant to the abnormal intensity of trade winds.\nAnswer with the option's letter from the given choices directly.",
159,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "054-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 160,
"target": "C",
"doc": {
"video_id": "054",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=wVlfyhs64IY",
"videoID": "wVlfyhs64IY",
"question_id": "054-2",
"task_type": "Object Reasoning",
"question": "Which of the following is considered to be the primary cause of El Niño?",
"options": [
"A. West winds strengthen.",
"B. The reduction of upwelling water.",
"C. Trade winds weaken.",
"D. It often happens in winter over North America."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is considered to be the primary cause of El Niño?\nOption:\nA. West winds strengthen.\nB. The reduction of upwelling water.\nC. Trade winds weaken.\nD. It often happens in winter over North America.\nAnswer with the option's letter from the given choices directly.",
160,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "054-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 161,
"target": "A",
"doc": {
"video_id": "054",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=wVlfyhs64IY",
"videoID": "wVlfyhs64IY",
"question_id": "054-3",
"task_type": "Object Reasoning",
"question": "How is cold water formed on the sea surface?",
"options": [
"A. It comes from deeper sea.",
"B. The trade winds bring away the heat on the sea surface.",
"C. The trade wind are colder than the sea water and thus cooling the sea water down.",
"D. Earthquake."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How is cold water formed on the sea surface?\nOption:\nA. It comes from deeper sea.\nB. The trade winds bring away the heat on the sea surface.\nC. The trade wind are colder than the sea water and thus cooling the sea water down.\nD. Earthquake.\nAnswer with the option's letter from the given choices directly.",
161,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "054-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 162,
"target": "B",
"doc": {
"video_id": "055",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=K0mjUgHKfJo",
"videoID": "K0mjUgHKfJo",
"question_id": "055-1",
"task_type": "Counting Problem",
"question": "How many circles can be observed in the first few seconds of the video?",
"options": [
"A. Four.",
"B. Five.",
"C. Nine.",
"D. Six."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many circles can be observed in the first few seconds of the video?\nOption:\nA. Four.\nB. Five.\nC. Nine.\nD. Six.\nAnswer with the option's letter from the given choices directly.",
162,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "055-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 163,
"target": "C",
"doc": {
"video_id": "055",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=K0mjUgHKfJo",
"videoID": "K0mjUgHKfJo",
"question_id": "055-2",
"task_type": "Spatial Reasoning",
"question": "Based on the information provided by the video, what is the reason for the phenomenon that the atmospheric temperature in some places is lower than the sea water temperature?",
"options": [
"A. Because the sea water in these palces absorbs less heat from the sun.",
"B. Because the sun releases less heat in these palces.",
"C. Because the current brings hot water from a warmer place.",
"D. Because the current brings cold water from a colder place."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the information provided by the video, what is the reason for the phenomenon that the atmospheric temperature in some places is lower than the sea water temperature?\nOption:\nA. Because the sea water in these palces absorbs less heat from the sun.\nB. Because the sun releases less heat in these palces.\nC. Because the current brings hot water from a warmer place.\nD. Because the current brings cold water from a colder place.\nAnswer with the option's letter from the given choices directly.",
163,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "055-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Spatial Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 164,
"target": "D",
"doc": {
"video_id": "055",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=K0mjUgHKfJo",
"videoID": "K0mjUgHKfJo",
"question_id": "055-3",
"task_type": "Object Reasoning",
"question": "What is the red color in the circle most likely to represent?",
"options": [
"A. Seriousness.",
"B. Blood.",
"C. Pain.",
"D. Heat."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the red color in the circle most likely to represent?\nOption:\nA. Seriousness.\nB. Blood.\nC. Pain.\nD. Heat.\nAnswer with the option's letter from the given choices directly.",
164,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "055-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 165,
"target": "B",
"doc": {
"video_id": "056",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=WmVLcj-XKnM",
"videoID": "WmVLcj-XKnM",
"question_id": "056-1",
"task_type": "Object Reasoning",
"question": "What does \"you\" in the video represent?",
"options": [
"A. Nature.",
"B. Human.",
"C. Animals.",
"D. Plants."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does \"you\" in the video represent?\nOption:\nA. Nature.\nB. Human.\nC. Animals.\nD. Plants.\nAnswer with the option's letter from the given choices directly.",
165,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "056-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 166,
"target": "B",
"doc": {
"video_id": "056",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=WmVLcj-XKnM",
"videoID": "WmVLcj-XKnM",
"question_id": "056-2",
"task_type": "Object Recognition",
"question": "Which of the following elements does not appear in the video?",
"options": [
"A. Iceberg.",
"B. Moon.",
"C. Earth.",
"D. River."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following elements does not appear in the video?\nOption:\nA. Iceberg.\nB. Moon.\nC. Earth.\nD. River.\nAnswer with the option's letter from the given choices directly.",
166,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "056-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 167,
"target": "D",
"doc": {
"video_id": "056",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=WmVLcj-XKnM",
"videoID": "WmVLcj-XKnM",
"question_id": "056-3",
"task_type": "Temporal Perception",
"question": "According to the video, what is the focal point in the scene referred to as \"thrive\" by the speaker?",
"options": [
"A. Mountain.",
"B. Forest.",
"C. River.",
"D. Mushroom."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is the focal point in the scene referred to as \"thrive\" by the speaker?\nOption:\nA. Mountain.\nB. Forest.\nC. River.\nD. Mushroom.\nAnswer with the option's letter from the given choices directly.",
167,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "056-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Temporal Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 168,
"target": "C",
"doc": {
"video_id": "057",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=zBnKgwnn7i4",
"videoID": "zBnKgwnn7i4",
"question_id": "057-1",
"task_type": "Object Reasoning",
"question": "According to the video, what do the three curved lines extending from bottom up symbolize?",
"options": [
"A. Heat flow.",
"B. Stream.",
"C. Vapor.",
"D. Air."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what do the three curved lines extending from bottom up symbolize?\nOption:\nA. Heat flow.\nB. Stream.\nC. Vapor.\nD. Air.\nAnswer with the option's letter from the given choices directly.",
168,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "057-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 169,
"target": "A",
"doc": {
"video_id": "057",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=zBnKgwnn7i4",
"videoID": "zBnKgwnn7i4",
"question_id": "057-2",
"task_type": "Temporal Reasoning",
"question": "What is the correct sequence of events in the formation of rain?",
"options": [
"A. Evaporation, condensation and then the raindrop forms.",
"B. Condense, evapor and then the raindrop forms.",
"C. The raindrop forms, evapors and then condenses.",
"D. The raindrop forms, condenses and then evapors."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct sequence of events in the formation of rain?\nOption:\nA. Evaporation, condensation and then the raindrop forms.\nB. Condense, evapor and then the raindrop forms.\nC. The raindrop forms, evapors and then condenses.\nD. The raindrop forms, condenses and then evapors.\nAnswer with the option's letter from the given choices directly.",
169,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "057-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 170,
"target": "A",
"doc": {
"video_id": "057",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=zBnKgwnn7i4",
"videoID": "zBnKgwnn7i4",
"question_id": "057-3",
"task_type": "Object Reasoning",
"question": "Which of the following statement is true according to the video?",
"options": [
"A. The raindrop continues to grow up after leaving the cloud.",
"B. The cloud is made of water vapor.",
"C. The river flows towards the left part of the image in the video.",
"D. Rivers don't take part in the process of water cycles."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statement is true according to the video?\nOption:\nA. The raindrop continues to grow up after leaving the cloud.\nB. The cloud is made of water vapor.\nC. The river flows towards the left part of the image in the video.\nD. Rivers don't take part in the process of water cycles.\nAnswer with the option's letter from the given choices directly.",
170,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "057-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 171,
"target": "B",
"doc": {
"video_id": "058",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=D52rTzibFRc",
"videoID": "D52rTzibFRc",
"question_id": "058-1",
"task_type": "Spatial Reasoning",
"question": "Based on the information provided by the video, which one is the most direct cause of the phenomenon that the smoke flows towards the lamp?",
"options": [
"A. Light makes the smoke more obvious to our eyes.",
"B. Near the lamp, the temperature rises.",
"C. There is a vacuum cleaner near the lamp.",
"D. The lamps attracts insects and the flying insects fan the smoke towards the lamp."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the information provided by the video, which one is the most direct cause of the phenomenon that the smoke flows towards the lamp?\nOption:\nA. Light makes the smoke more obvious to our eyes.\nB. Near the lamp, the temperature rises.\nC. There is a vacuum cleaner near the lamp.\nD. The lamps attracts insects and the flying insects fan the smoke towards the lamp.\nAnswer with the option's letter from the given choices directly.",
171,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "058-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 172,
"target": "D",
"doc": {
"video_id": "058",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=D52rTzibFRc",
"videoID": "D52rTzibFRc",
"question_id": "058-2",
"task_type": "Attribute Perception",
"question": "Which color are the mountains in the video?",
"options": [
"A. Brown.",
"B. Blue.",
"C. Green.",
"D. Purple."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which color are the mountains in the video?\nOption:\nA. Brown.\nB. Blue.\nC. Green.\nD. Purple.\nAnswer with the option's letter from the given choices directly.",
172,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "058-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 173,
"target": "C",
"doc": {
"video_id": "058",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=D52rTzibFRc",
"videoID": "D52rTzibFRc",
"question_id": "058-3",
"task_type": "Spatial Reasoning",
"question": "What is one of the causes of wind according to the video?",
"options": [
"A. The sun is like a lamp.",
"B. The sea is cooler than the ground, and thus the sea wind forms.",
"C. The earth's rotaion changes as it orbits, and thus the temperature in a certain place changes.",
"D. The sun generates solar winds."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is one of the causes of wind according to the video?\nOption:\nA. The sun is like a lamp.\nB. The sea is cooler than the ground, and thus the sea wind forms.\nC. The earth's rotaion changes as it orbits, and thus the temperature in a certain place changes.\nD. The sun generates solar winds.\nAnswer with the option's letter from the given choices directly.",
173,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "058-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Spatial Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 174,
"target": "A",
"doc": {
"video_id": "059",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=NEFG7YcIDcI",
"videoID": "NEFG7YcIDcI",
"question_id": "059-1",
"task_type": "Object Recognition",
"question": "Which ocean do the thunderstorms in the video need to cross before they reach America?",
"options": [
"A. The Atlantic ocean.",
"B. The Pacific Ocean.",
"C. The Indian Ocean.",
"D. The Arctic Ocean."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which ocean do the thunderstorms in the video need to cross before they reach America?\nOption:\nA. The Atlantic ocean.\nB. The Pacific Ocean.\nC. The Indian Ocean.\nD. The Arctic Ocean.\nAnswer with the option's letter from the given choices directly.",
174,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "059-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 175,
"target": "B",
"doc": {
"video_id": "059",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=NEFG7YcIDcI",
"videoID": "NEFG7YcIDcI",
"question_id": "059-2",
"task_type": "Attribute Perception",
"question": "What does the speaker in the video dress like?",
"options": [
"A. Black jacket and blue trousers.",
"B. Blue jacket and black trousers.",
"C. Blue hat and black jacket.",
"D. Gray hat and black jacket."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the speaker in the video dress like?\nOption:\nA. Black jacket and blue trousers.\nB. Blue jacket and black trousers.\nC. Blue hat and black jacket.\nD. Gray hat and black jacket.\nAnswer with the option's letter from the given choices directly.",
175,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "059-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 176,
"target": "B",
"doc": {
"video_id": "059",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=NEFG7YcIDcI",
"videoID": "NEFG7YcIDcI",
"question_id": "059-3",
"task_type": "Object Reasoning",
"question": "Which of the following statement is true according to the video?",
"options": [
"A. The wind in eye wall is calmer than outside.",
"B. The eye of hurricane has the most violent wind.",
"C. The hurricane can't be seen form the outer space.",
"D. The hurricane won't grow when the temperature is too high."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statement is true according to the video?\nOption:\nA. The wind in eye wall is calmer than outside.\nB. The eye of hurricane has the most violent wind.\nC. The hurricane can't be seen form the outer space.\nD. The hurricane won't grow when the temperature is too high.\nAnswer with the option's letter from the given choices directly.",
176,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "059-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 177,
"target": "D",
"doc": {
"video_id": "060",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=8TNPeimqOO0",
"videoID": "8TNPeimqOO0",
"question_id": "060-1",
"task_type": "Object Recognition",
"question": "Which element doesn't show up in the video?",
"options": [
"A. Iceberg.",
"B. Orcas.",
"C. Sea bird.",
"D. Polar bear."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which element doesn't show up in the video?\nOption:\nA. Iceberg.\nB. Orcas.\nC. Sea bird.\nD. Polar bear.\nAnswer with the option's letter from the given choices directly.",
177,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "060-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 178,
"target": "A",
"doc": {
"video_id": "060",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=8TNPeimqOO0",
"videoID": "8TNPeimqOO0",
"question_id": "060-2",
"task_type": "Attribute Perception",
"question": "According to the video, which of the following statements is not correct?",
"options": [
"A. The report is about the biggest iceberg that forms recently.",
"B. The iceberg is much bigger than New York city.",
"C. The departure of the iceberg may block penguin colonies.",
"D. Dereck Mueller wears red plaid shirt in the report."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following statements is not correct?\nOption:\nA. The report is about the biggest iceberg that forms recently.\nB. The iceberg is much bigger than New York city.\nC. The departure of the iceberg may block penguin colonies.\nD. Dereck Mueller wears red plaid shirt in the report.\nAnswer with the option's letter from the given choices directly.",
178,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "060-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 179,
"target": "C",
"doc": {
"video_id": "060",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=8TNPeimqOO0",
"videoID": "8TNPeimqOO0",
"question_id": "060-3",
"task_type": "Spatial Reasoning",
"question": "Which of the following facts is not caused by the iceberg mentioned in the video?",
"options": [
"A. Orcas and sea birds around the iceberg become more active.",
"B. South Georgia Islands is affected.",
"C. The thick mist at sea near the melting iceberg.",
"D. Sea level may rises a little bit."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following facts is not caused by the iceberg mentioned in the video?\nOption:\nA. Orcas and sea birds around the iceberg become more active.\nB. South Georgia Islands is affected.\nC. The thick mist at sea near the melting iceberg.\nD. Sea level may rises a little bit.\nAnswer with the option's letter from the given choices directly.",
179,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "060-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 180,
"target": "B",
"doc": {
"video_id": "061",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=LCtOpCi5r2s",
"videoID": "LCtOpCi5r2s",
"question_id": "061-1",
"task_type": "Counting Problem",
"question": "How many red flags appear in the video?",
"options": [
"A. 7.",
"B. 6.",
"C. 8.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many red flags appear in the video?\nOption:\nA. 7.\nB. 6.\nC. 8.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
180,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "061-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 181,
"target": "C",
"doc": {
"video_id": "061",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=LCtOpCi5r2s",
"videoID": "LCtOpCi5r2s",
"question_id": "061-2",
"task_type": "Object Recognition",
"question": "Which item was not featured in the video?",
"options": [
"A. Balance scale.",
"B. Traffic light.",
"C. Gavel.",
"D. Magnifying glass."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which item was not featured in the video?\nOption:\nA. Balance scale.\nB. Traffic light.\nC. Gavel.\nD. Magnifying glass.\nAnswer with the option's letter from the given choices directly.",
181,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "061-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 182,
"target": "A",
"doc": {
"video_id": "061",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=LCtOpCi5r2s",
"videoID": "LCtOpCi5r2s",
"question_id": "061-3",
"task_type": "OCR Problems",
"question": "What is the website that appears at the end of the video?",
"options": [
"A. LEXANIMATA.COM.",
"B. HESHAM.COM.",
"C. ELRAFEI.COM.",
"D. HESHAMELRAFEI.COM."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the website that appears at the end of the video?\nOption:\nA. LEXANIMATA.COM.\nB. HESHAM.COM.\nC. ELRAFEI.COM.\nD. HESHAMELRAFEI.COM.\nAnswer with the option's letter from the given choices directly.",
182,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "061-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 183,
"target": "D",
"doc": {
"video_id": "062",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=jTzKgI68VLc",
"videoID": "jTzKgI68VLc",
"question_id": "062-1",
"task_type": "Spatial Perception",
"question": "In what order did the movement occur in the focused image within the video?",
"options": [
"A. It moved from the bottom right to the top right, then to the center, then to the top left, then to the bottom left, and finally it moved to the center bottom.",
"B. It moved from the bottom left to the top left, then to the top right, then to the bottom right, and finally it moved to the center.",
"C. It moved from the bottom left to the top left, then to the center, then to the top right, then to the bottom right, and finally it moved to the center bottom.",
"D. It moved from the top left to the bottom left, then to the center, then to the top right, then to the bottom right, and finally it moved to the center bottom."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what order did the movement occur in the focused image within the video?\nOption:\nA. It moved from the bottom right to the top right, then to the center, then to the top left, then to the bottom left, and finally it moved to the center bottom.\nB. It moved from the bottom left to the top left, then to the top right, then to the bottom right, and finally it moved to the center.\nC. It moved from the bottom left to the top left, then to the center, then to the top right, then to the bottom right, and finally it moved to the center bottom.\nD. It moved from the top left to the bottom left, then to the center, then to the top right, then to the bottom right, and finally it moved to the center bottom.\nAnswer with the option's letter from the given choices directly.",
183,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "062-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 184,
"target": "B",
"doc": {
"video_id": "062",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=jTzKgI68VLc",
"videoID": "jTzKgI68VLc",
"question_id": "062-2",
"task_type": "OCR Problems",
"question": "Which of the following keywords was not mentioned in the video?",
"options": [
"A. World Law.",
"B. World War.",
"C. World Police.",
"D. World Court."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following keywords was not mentioned in the video?\nOption:\nA. World Law.\nB. World War.\nC. World Police.\nD. World Court.\nAnswer with the option's letter from the given choices directly.",
184,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "062-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 185,
"target": "C",
"doc": {
"video_id": "062",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=jTzKgI68VLc",
"videoID": "jTzKgI68VLc",
"question_id": "062-3",
"task_type": "OCR Problems",
"question": "What information is shown in the video regarding Portugal's GDP?",
"options": [
"A. Portugal's GDP for the year 2020 is $350 billion USD.",
"B. Portugal's GDP for the year 2021 is $350 billion USD.",
"C. Portugal's GDP for the year 2020 is $230 billion USD.",
"D. Portugal's GDP for the year 2021 is $250 billion USD."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What information is shown in the video regarding Portugal's GDP?\nOption:\nA. Portugal's GDP for the year 2020 is $350 billion USD.\nB. Portugal's GDP for the year 2021 is $350 billion USD.\nC. Portugal's GDP for the year 2020 is $230 billion USD.\nD. Portugal's GDP for the year 2021 is $250 billion USD.\nAnswer with the option's letter from the given choices directly.",
185,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "062-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 186,
"target": "A",
"doc": {
"video_id": "063",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=eTLMWnsStuk",
"videoID": "eTLMWnsStuk",
"question_id": "063-1",
"task_type": "Counting Problem",
"question": "How many individuals are visible in the introductory shot of the video?",
"options": [
"A. 5.",
"B. 6.",
"C. 7.",
"D. 4."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many individuals are visible in the introductory shot of the video?\nOption:\nA. 5.\nB. 6.\nC. 7.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
186,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "063-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 187,
"target": "D",
"doc": {
"video_id": "063",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=eTLMWnsStuk",
"videoID": "eTLMWnsStuk",
"question_id": "063-2",
"task_type": "Attribute Perception",
"question": "What is the fate of the double-story building showcased in the video?",
"options": [
"A. The building in the video is letting out air-conditioning.",
"B. The building is hit by artillery shells.",
"C. The building is normal and nothing is happening.",
"D. The building is on fire and smoke is pouring out."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the fate of the double-story building showcased in the video?\nOption:\nA. The building in the video is letting out air-conditioning.\nB. The building is hit by artillery shells.\nC. The building is normal and nothing is happening.\nD. The building is on fire and smoke is pouring out.\nAnswer with the option's letter from the given choices directly.",
187,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "063-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 188,
"target": "B",
"doc": {
"video_id": "063",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=eTLMWnsStuk",
"videoID": "eTLMWnsStuk",
"question_id": "063-3",
"task_type": "Information Synopsis",
"question": "What does this video introduce?",
"options": [
"A. American Civil War.",
"B. The Prize Cases.",
"C. Age of Discovery.",
"D. President election."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does this video introduce?\nOption:\nA. American Civil War.\nB. The Prize Cases.\nC. Age of Discovery.\nD. President election.\nAnswer with the option's letter from the given choices directly.",
188,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "063-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 189,
"target": "C",
"doc": {
"video_id": "064",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=9_M4bNOxsYs",
"videoID": "9_M4bNOxsYs",
"question_id": "064-1",
"task_type": "Information Synopsis",
"question": "Which option best represents the main topic of this video?",
"options": [
"A. Laws of different countries.",
"B. System differences in different countries.",
"C. Common law and civil law.",
"D. Legal development process."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which option best represents the main topic of this video?\nOption:\nA. Laws of different countries.\nB. System differences in different countries.\nC. Common law and civil law.\nD. Legal development process.\nAnswer with the option's letter from the given choices directly.",
189,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "064-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 190,
"target": "A",
"doc": {
"video_id": "064",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=9_M4bNOxsYs",
"videoID": "9_M4bNOxsYs",
"question_id": "064-2",
"task_type": "Action Recognition",
"question": "What are the people wearing a white shirt and a person wearing a black shirt doing together in the video?",
"options": [
"A. Confrontation.",
"B. Dancing.",
"C. Shake hands.",
"D. Running."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the people wearing a white shirt and a person wearing a black shirt doing together in the video?\nOption:\nA. Confrontation.\nB. Dancing.\nC. Shake hands.\nD. Running.\nAnswer with the option's letter from the given choices directly.",
190,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "064-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 191,
"target": "D",
"doc": {
"video_id": "064",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=9_M4bNOxsYs",
"videoID": "9_M4bNOxsYs",
"question_id": "064-3",
"task_type": "Object Reasoning",
"question": "What is the identity of the individual sitting in the center of the video, wearing glasses, and holding a small hammer?",
"options": [
"A. Clerk.",
"B. Worker.",
"C. Defense counsel.",
"D. Judge."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the identity of the individual sitting in the center of the video, wearing glasses, and holding a small hammer?\nOption:\nA. Clerk.\nB. Worker.\nC. Defense counsel.\nD. Judge.\nAnswer with the option's letter from the given choices directly.",
191,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "064-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 192,
"target": "B",
"doc": {
"video_id": "065",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=fo-mVfOsC-E",
"videoID": "fo-mVfOsC-E",
"question_id": "065-1",
"task_type": "Information Synopsis",
"question": "What topic is introduced in the video?",
"options": [
"A. Types of cases handled by courts.",
"B. The roles of different people in court.",
"C. The process of courtroom trial.",
"D. A criminal case."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What topic is introduced in the video?\nOption:\nA. Types of cases handled by courts.\nB. The roles of different people in court.\nC. The process of courtroom trial.\nD. A criminal case.\nAnswer with the option's letter from the given choices directly.",
192,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "065-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 193,
"target": "C",
"doc": {
"video_id": "065",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=fo-mVfOsC-E",
"videoID": "fo-mVfOsC-E",
"question_id": "065-2",
"task_type": "Object Reasoning",
"question": "What is the role of the woman in the video with short hair, wearing a black top, and donning a white scarf?",
"options": [
"A. Clerk.",
"B. Defense lawyer.",
"C. Judge.",
"D. Accused."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the role of the woman in the video with short hair, wearing a black top, and donning a white scarf?\nOption:\nA. Clerk.\nB. Defense lawyer.\nC. Judge.\nD. Accused.\nAnswer with the option's letter from the given choices directly.",
193,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "065-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 194,
"target": "A",
"doc": {
"video_id": "065",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=fo-mVfOsC-E",
"videoID": "fo-mVfOsC-E",
"question_id": "065-3",
"task_type": "Counting Problem",
"question": "How many people can be seen holding cameras in the video?",
"options": [
"A. 2.",
"B. 3.",
"C. 4.",
"D. 1."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people can be seen holding cameras in the video?\nOption:\nA. 2.\nB. 3.\nC. 4.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
194,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "065-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 195,
"target": "D",
"doc": {
"video_id": "066",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=x6jPuXwtxCM",
"videoID": "x6jPuXwtxCM",
"question_id": "066-1",
"task_type": "Attribute Perception",
"question": "What are the colors of the three doors that appear in the video?",
"options": [
"A. Yellow, black, blue.",
"B. White, black, green.",
"C. Black, yellow, blue.",
"D. Red, black, blue."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the colors of the three doors that appear in the video?\nOption:\nA. Yellow, black, blue.\nB. White, black, green.\nC. Black, yellow, blue.\nD. Red, black, blue.\nAnswer with the option's letter from the given choices directly.",
195,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "066-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 196,
"target": "B",
"doc": {
"video_id": "066",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=x6jPuXwtxCM",
"videoID": "x6jPuXwtxCM",
"question_id": "066-2",
"task_type": "Action Recognition",
"question": "What is the man doing at the beginning of the video?",
"options": [
"A. Walking.",
"B. Running.",
"C. Dancing.",
"D. Not mentioned."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the man doing at the beginning of the video?\nOption:\nA. Walking.\nB. Running.\nC. Dancing.\nD. Not mentioned.\nAnswer with the option's letter from the given choices directly.",
196,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "066-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 197,
"target": "C",
"doc": {
"video_id": "066",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=x6jPuXwtxCM",
"videoID": "x6jPuXwtxCM",
"question_id": "066-3",
"task_type": "Object Recognition",
"question": "What is the object that appears after the red door opens in the video?",
"options": [
"A. A bird.",
"B. A large building.",
"C. A helicopter.",
"D. An oil drum."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the object that appears after the red door opens in the video?\nOption:\nA. A bird.\nB. A large building.\nC. A helicopter.\nD. An oil drum.\nAnswer with the option's letter from the given choices directly.",
197,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "066-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 198,
"target": "A",
"doc": {
"video_id": "067",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=yXNShWnon4g",
"videoID": "yXNShWnon4g",
"question_id": "067-1",
"task_type": "Object Reasoning",
"question": "What is the name of the island that is shown at the start of the video?",
"options": [
"A. Ireland.",
"B. England.",
"C. Scotland.",
"D. Greenland."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the name of the island that is shown at the start of the video?\nOption:\nA. Ireland.\nB. England.\nC. Scotland.\nD. Greenland.\nAnswer with the option's letter from the given choices directly.",
198,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "067-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 199,
"target": "D",
"doc": {
"video_id": "067",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=yXNShWnon4g",
"videoID": "yXNShWnon4g",
"question_id": "067-2",
"task_type": "Attribute Perception",
"question": "What is the color of the pencils that individuals are grasping for drawing in the video?",
"options": [
"A. Green.",
"B. Red.",
"C. Yellow.",
"D. Blue."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the pencils that individuals are grasping for drawing in the video?\nOption:\nA. Green.\nB. Red.\nC. Yellow.\nD. Blue.\nAnswer with the option's letter from the given choices directly.",
199,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "067-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 200,
"target": "B",
"doc": {
"video_id": "067",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=yXNShWnon4g",
"videoID": "yXNShWnon4g",
"question_id": "067-3",
"task_type": "Counting Problem",
"question": "How many adult men, aged 18 years or older, can be seen in the video?",
"options": [
"A. 2.",
"B. 1.",
"C. 3.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many adult men, aged 18 years or older, can be seen in the video?\nOption:\nA. 2.\nB. 1.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
200,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "067-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 201,
"target": "C",
"doc": {
"video_id": "068",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=UZvydHZKyww",
"videoID": "UZvydHZKyww",
"question_id": "068-1",
"task_type": "Information Synopsis",
"question": "Which of the following best describes the main topic of the video below?",
"options": [
"A. An overview of labor laws in California and their implications for small businesses.",
"B. The impact of economic policies on labor laws in California in the past decade.",
"C. The release of the new California labor laws in 2024.",
"D. A historical analysis of labor laws in California from 1990 to 2020."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best describes the main topic of the video below?\nOption:\nA. An overview of labor laws in California and their implications for small businesses.\nB. The impact of economic policies on labor laws in California in the past decade.\nC. The release of the new California labor laws in 2024.\nD. A historical analysis of labor laws in California from 1990 to 2020.\nAnswer with the option's letter from the given choices directly.",
201,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "068-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 202,
"target": "A",
"doc": {
"video_id": "068",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=UZvydHZKyww",
"videoID": "UZvydHZKyww",
"question_id": "068-2",
"task_type": "OCR Problems",
"question": "Which of the following is NOT mentioned in the new labor law?",
"options": [
"A. Maximum wage increases.",
"B. 5 days sick leave.",
"C. Cannabis off-hours.",
"D. Dicussing pay with co-workers."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is NOT mentioned in the new labor law?\nOption:\nA. Maximum wage increases.\nB. 5 days sick leave.\nC. Cannabis off-hours.\nD. Dicussing pay with co-workers.\nAnswer with the option's letter from the given choices directly.",
202,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "068-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 203,
"target": "D",
"doc": {
"video_id": "068",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=UZvydHZKyww",
"videoID": "UZvydHZKyww",
"question_id": "068-3",
"task_type": "Action Recognition",
"question": "Which activity can be seen in the video?",
"options": [
"A. Play computer game.",
"B. Fitness.",
"C. Play tennis.",
"D. Phone conversations."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which activity can be seen in the video?\nOption:\nA. Play computer game.\nB. Fitness.\nC. Play tennis.\nD. Phone conversations.\nAnswer with the option's letter from the given choices directly.",
203,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "068-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 204,
"target": "B",
"doc": {
"video_id": "069",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=-O6mJ0VBTc4",
"videoID": "-O6mJ0VBTc4",
"question_id": "069-1",
"task_type": "Counting Problem",
"question": "How many areas are divided into the video?",
"options": [
"A. 5.",
"B. 4.",
"C. 3.",
"D. 2."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many areas are divided into the video?\nOption:\nA. 5.\nB. 4.\nC. 3.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
204,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "069-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 205,
"target": "C",
"doc": {
"video_id": "069",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=-O6mJ0VBTc4",
"videoID": "-O6mJ0VBTc4",
"question_id": "069-2",
"task_type": "Object Recognition",
"question": "What is the object illustrated in the second point in the video?",
"options": [
"A. Briefcase.",
"B. Envelope.",
"C. Money bag.",
"D. Gift bag."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the object illustrated in the second point in the video?\nOption:\nA. Briefcase.\nB. Envelope.\nC. Money bag.\nD. Gift bag.\nAnswer with the option's letter from the given choices directly.",
205,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "069-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 206,
"target": "A",
"doc": {
"video_id": "069",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=-O6mJ0VBTc4",
"videoID": "-O6mJ0VBTc4",
"question_id": "069-3",
"task_type": "Action Recognition",
"question": "What is the first thing depicted in the video?",
"options": [
"A. Two hands holding a book.",
"B. A group of people.",
"C. An arrow.",
"D. A handshake."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the first thing depicted in the video?\nOption:\nA. Two hands holding a book.\nB. A group of people.\nC. An arrow.\nD. A handshake.\nAnswer with the option's letter from the given choices directly.",
206,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "069-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 207,
"target": "D",
"doc": {
"video_id": "070",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=CezlmUwMXNo",
"videoID": "CezlmUwMXNo",
"question_id": "070-1",
"task_type": "Information Synopsis",
"question": "Which of the following best describes the topic introduced in the video?",
"options": [
"A. A divorce case.",
"B. A financial fraud case.",
"C. A labor dispute case.",
"D. An abortion case."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best describes the topic introduced in the video?\nOption:\nA. A divorce case.\nB. A financial fraud case.\nC. A labor dispute case.\nD. An abortion case.\nAnswer with the option's letter from the given choices directly.",
207,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "070-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 208,
"target": "B",
"doc": {
"video_id": "070",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=CezlmUwMXNo",
"videoID": "CezlmUwMXNo",
"question_id": "070-2",
"task_type": "Counting Problem",
"question": "Which of the following options correctly states the number of people visible in the video?",
"options": [
"A. 5.",
"B. 4.",
"C. 6.",
"D. 3."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options correctly states the number of people visible in the video?\nOption:\nA. 5.\nB. 4.\nC. 6.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
208,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "070-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 209,
"target": "C",
"doc": {
"video_id": "070",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=CezlmUwMXNo",
"videoID": "CezlmUwMXNo",
"question_id": "070-3",
"task_type": "Object Reasoning",
"question": "Which statement accurately describes the job of the man shown in the center of the video who is wearing a black suit?",
"options": [
"A. He is the prosecutor.",
"B. He is the clerk.",
"C. He is the judge.",
"D. He is the defense lawyer."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which statement accurately describes the job of the man shown in the center of the video who is wearing a black suit?\nOption:\nA. He is the prosecutor.\nB. He is the clerk.\nC. He is the judge.\nD. He is the defense lawyer.\nAnswer with the option's letter from the given choices directly.",
209,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "070-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 210,
"target": "B",
"doc": {
"video_id": "071",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=IzQ2siryQrM",
"videoID": "IzQ2siryQrM",
"question_id": "071-1",
"task_type": "Temporal Perception",
"question": "How many hours of sleep does this video recommend adults aim to achieve each day, as depicted in the animation?",
"options": [
"A. 5-7.",
"B. 7-9.",
"C. 9-11.",
"D. 11-13."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many hours of sleep does this video recommend adults aim to achieve each day, as depicted in the animation?\nOption:\nA. 5-7.\nB. 7-9.\nC. 9-11.\nD. 11-13.\nAnswer with the option's letter from the given choices directly.",
210,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "071-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 211,
"target": "C",
"doc": {
"video_id": "071",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=IzQ2siryQrM",
"videoID": "IzQ2siryQrM",
"question_id": "071-2",
"task_type": "Action Recognition",
"question": "How does the animation demonstrate the impact of insufficient sleep on one's abilities?",
"options": [
"A. It features a time-lapse of a flower wilting and dying.",
"B. It portrays a listless woman who slowly hangs his head.",
"C. It draws a consuming battery turning from green to red.",
"D. It displays a graph showing declining stock prices."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the animation demonstrate the impact of insufficient sleep on one's abilities?\nOption:\nA. It features a time-lapse of a flower wilting and dying.\nB. It portrays a listless woman who slowly hangs his head.\nC. It draws a consuming battery turning from green to red.\nD. It displays a graph showing declining stock prices.\nAnswer with the option's letter from the given choices directly.",
211,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "071-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 212,
"target": "A",
"doc": {
"video_id": "071",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=IzQ2siryQrM",
"videoID": "IzQ2siryQrM",
"question_id": "071-3",
"task_type": "Object Recognition",
"question": "What is left of the wine on the animation when suggesting avoiding stimulants?",
"options": [
"A. Caffines.",
"B. Alcohol.",
"C. Heavy meals.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is left of the wine on the animation when suggesting avoiding stimulants?\nOption:\nA. Caffines.\nB. Alcohol.\nC. Heavy meals.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
212,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "071-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 213,
"target": "A",
"doc": {
"video_id": "072",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=VsuShNWghXk",
"videoID": "VsuShNWghXk",
"question_id": "072-1",
"task_type": "Object Recognition",
"question": "Which item does not the man wear in this video?",
"options": [
"A. Gloves.",
"B. A helmet.",
"C. Knee pads.",
"D. Elbow pads."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which item does not the man wear in this video?\nOption:\nA. Gloves.\nB. A helmet.\nC. Knee pads.\nD. Elbow pads.\nAnswer with the option's letter from the given choices directly.",
213,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "072-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 214,
"target": "D",
"doc": {
"video_id": "072",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=VsuShNWghXk",
"videoID": "VsuShNWghXk",
"question_id": "072-2",
"task_type": "Action Recognition",
"question": "According to the video, what is the right pose demonstrated in the animation when sitting on the bike before starting to ride?",
"options": [
"A. Sitting backwards on the bike with both feet on the handlebars, ready to push off and start riding.",
"B. Sitting sideways on the bike with the left foot on the ground and the right foot resting on the pedal.",
"C. Sitting on the bike with both feet on the ground, not engaging with the pedals.",
"D. Sitting on the bike seat with the right foot on the ground and the left foot resting on the pedal."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is the right pose demonstrated in the animation when sitting on the bike before starting to ride?\nOption:\nA. Sitting backwards on the bike with both feet on the handlebars, ready to push off and start riding.\nB. Sitting sideways on the bike with the left foot on the ground and the right foot resting on the pedal.\nC. Sitting on the bike with both feet on the ground, not engaging with the pedals.\nD. Sitting on the bike seat with the right foot on the ground and the left foot resting on the pedal.\nAnswer with the option's letter from the given choices directly.",
214,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "072-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 215,
"target": "B",
"doc": {
"video_id": "072",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=VsuShNWghXk",
"videoID": "VsuShNWghXk",
"question_id": "072-3",
"task_type": "Temporal Reasoning",
"question": "What is the sequence of steps introduced in this video?\n(a) Walk and practice using the brakes.\n(b) Glide to practice balancing.\n(c) Find a good space and safety considerations.\n(d) Start to pedal.\n(e) Adjust the seat.",
"options": [
"A. (c)(a)(d)(e)(b).",
"B. (c)(e)(a)(b)(d).",
"C. (a)(c)(e)(b)(d).",
"D. (a)(e)(b)(c)(d)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the sequence of steps introduced in this video?\n(a) Walk and practice using the brakes.\n(b) Glide to practice balancing.\n(c) Find a good space and safety considerations.\n(d) Start to pedal.\n(e) Adjust the seat.\nOption:\nA. (c)(a)(d)(e)(b).\nB. (c)(e)(a)(b)(d).\nC. (a)(c)(e)(b)(d).\nD. (a)(e)(b)(c)(d).\nAnswer with the option's letter from the given choices directly.",
215,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "072-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 216,
"target": "D",
"doc": {
"video_id": "073",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=7iXM5aq53Ts",
"videoID": "7iXM5aq53Ts",
"question_id": "073-1",
"task_type": "OCR Problems",
"question": "As depicted in the video, which smart phone is advertised on the screen of the laptop?",
"options": [
"A. iPhone 15 Pro Max.",
"B. iPhone 14 Pro.",
"C. iPhone 12.",
"D. iPhone 6s."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, which smart phone is advertised on the screen of the laptop?\nOption:\nA. iPhone 15 Pro Max.\nB. iPhone 14 Pro.\nC. iPhone 12.\nD. iPhone 6s.\nAnswer with the option's letter from the given choices directly.",
216,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "073-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 217,
"target": "B",
"doc": {
"video_id": "073",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=7iXM5aq53Ts",
"videoID": "7iXM5aq53Ts",
"question_id": "073-2",
"task_type": "Object Recognition",
"question": "What is the icon of Keyborad Cleaner app shown in the video?",
"options": [
"A. The icon features a colorful image of a keyboard with sparkling keys.",
"B. The icon depicts two black letters of K and C.",
"C. The icon displays a cartoon character holding a broom and cleaning a keyboard.",
"D. The icon shows a magnifying glass zooming in on a dirty keyboard."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the icon of Keyborad Cleaner app shown in the video?\nOption:\nA. The icon features a colorful image of a keyboard with sparkling keys.\nB. The icon depicts two black letters of K and C.\nC. The icon displays a cartoon character holding a broom and cleaning a keyboard.\nD. The icon shows a magnifying glass zooming in on a dirty keyboard.\nAnswer with the option's letter from the given choices directly.",
217,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "073-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 218,
"target": "C",
"doc": {
"video_id": "073",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=7iXM5aq53Ts",
"videoID": "7iXM5aq53Ts",
"question_id": "073-3",
"task_type": "Object Recognition",
"question": "What does the man use to clean the keyboard in this video?",
"options": [
"A. The man uses a non-abrasive sponge soaked in water to clean the keyboard.",
"B. The man uses a soft-bristled brush to scrub the keyboard keys.",
"C. The man uses a microfiber cloth to wipe off the keyboard.",
"D. The man uses dry compressed air to remove dust from the keyboard."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the man use to clean the keyboard in this video?\nOption:\nA. The man uses a non-abrasive sponge soaked in water to clean the keyboard.\nB. The man uses a soft-bristled brush to scrub the keyboard keys.\nC. The man uses a microfiber cloth to wipe off the keyboard.\nD. The man uses dry compressed air to remove dust from the keyboard.\nAnswer with the option's letter from the given choices directly.",
218,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "073-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 219,
"target": "C",
"doc": {
"video_id": "074",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=HvjgQqNOq9A",
"videoID": "HvjgQqNOq9A",
"question_id": "074-1",
"task_type": "Counting Problem",
"question": "How many steps does the man walk in this video?",
"options": [
"A. 2.",
"B. 3.",
"C. 4.",
"D. 5."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many steps does the man walk in this video?\nOption:\nA. 2.\nB. 3.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
219,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "074-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 220,
"target": "A",
"doc": {
"video_id": "074",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=HvjgQqNOq9A",
"videoID": "HvjgQqNOq9A",
"question_id": "074-2",
"task_type": "Object Recognition",
"question": "As depicted in the video, which of the following items is not hanging on the walls?",
"options": [
"A. A square mirror.",
"B. Curtains.",
"C. A number of calligraphy paintings.",
"D. Several framed photos."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, which of the following items is not hanging on the walls?\nOption:\nA. A square mirror.\nB. Curtains.\nC. A number of calligraphy paintings.\nD. Several framed photos.\nAnswer with the option's letter from the given choices directly.",
220,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "074-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 221,
"target": "B",
"doc": {
"video_id": "074",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=HvjgQqNOq9A",
"videoID": "HvjgQqNOq9A",
"question_id": "074-3",
"task_type": "Information Synopsis",
"question": "What is the primary focus of this video?",
"options": [
"A. It teaches how to dance.",
"B. It teaches how to walk silently.",
"C. It teaches how to skate.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary focus of this video?\nOption:\nA. It teaches how to dance.\nB. It teaches how to walk silently.\nC. It teaches how to skate.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
221,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "074-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 222,
"target": "B",
"doc": {
"video_id": "075",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=uz6rjbw0ZA0",
"videoID": "uz6rjbw0ZA0",
"question_id": "075-1",
"task_type": "Object Recognition",
"question": "On which shirt does the man show drawing imaginary lines?",
"options": [
"A. The shirt of white.",
"B. The shirt of black.",
"C. The shirt of blue.",
"D. The shirt of yellow."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: On which shirt does the man show drawing imaginary lines?\nOption:\nA. The shirt of white.\nB. The shirt of black.\nC. The shirt of blue.\nD. The shirt of yellow.\nAnswer with the option's letter from the given choices directly.",
222,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "075-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 223,
"target": "C",
"doc": {
"video_id": "075",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=uz6rjbw0ZA0",
"videoID": "uz6rjbw0ZA0",
"question_id": "075-2",
"task_type": "Object Recognition",
"question": "As can be seen in the video, where is the point B when drawing imaginary lines?",
"options": [
"A. On the center of the shirt.",
"B. On the bottom of the shirt.",
"C. On the shoulder.",
"D. No point B is marked in this video."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As can be seen in the video, where is the point B when drawing imaginary lines?\nOption:\nA. On the center of the shirt.\nB. On the bottom of the shirt.\nC. On the shoulder.\nD. No point B is marked in this video.\nAnswer with the option's letter from the given choices directly.",
223,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "075-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 224,
"target": "A",
"doc": {
"video_id": "075",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=uz6rjbw0ZA0",
"videoID": "uz6rjbw0ZA0",
"question_id": "075-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. It teaches how to fold a shirt.",
"B. It teaches how to iron a shirt.",
"C. It teaches how to wash a shirt.",
"D. It teaches how to dry a shirt."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. It teaches how to fold a shirt.\nB. It teaches how to iron a shirt.\nC. It teaches how to wash a shirt.\nD. It teaches how to dry a shirt.\nAnswer with the option's letter from the given choices directly.",
224,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "075-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 225,
"target": "D",
"doc": {
"video_id": "076",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=KOwR0Ln46Ks",
"videoID": "KOwR0Ln46Ks",
"question_id": "076-1",
"task_type": "Object Recognition",
"question": "According to the video, what is the brown item pinned on the speaker's waist?",
"options": [
"A. Bag of photos.",
"B. Bag of make-ups.",
"C. Bag of money.",
"D. Bag of treats."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is the brown item pinned on the speaker's waist?\nOption:\nA. Bag of photos.\nB. Bag of make-ups.\nC. Bag of money.\nD. Bag of treats.\nAnswer with the option's letter from the given choices directly.",
225,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "076-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 226,
"target": "B",
"doc": {
"video_id": "076",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=KOwR0Ln46Ks",
"videoID": "KOwR0Ln46Ks",
"question_id": "076-2",
"task_type": "Attribute Perception",
"question": "What is the color of rope to the dog of white?",
"options": [
"A. White.",
"B. Black.",
"C. Red.",
"D. Green."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of rope to the dog of white?\nOption:\nA. White.\nB. Black.\nC. Red.\nD. Green.\nAnswer with the option's letter from the given choices directly.",
226,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "076-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 227,
"target": "C",
"doc": {
"video_id": "076",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=KOwR0Ln46Ks",
"videoID": "KOwR0Ln46Ks",
"question_id": "076-3",
"task_type": "Information Synopsis",
"question": "What is the primary focus of this video?",
"options": [
"A. It teaches how to make food for dogs.",
"B. It teaches how to buy a beautiful dog.",
"C. It teaches how to properly walk a dog on a leash.",
"D. It teaches how to clean the kennel."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary focus of this video?\nOption:\nA. It teaches how to make food for dogs.\nB. It teaches how to buy a beautiful dog.\nC. It teaches how to properly walk a dog on a leash.\nD. It teaches how to clean the kennel.\nAnswer with the option's letter from the given choices directly.",
227,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "076-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 228,
"target": "A",
"doc": {
"video_id": "077",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=540LkURTR7g",
"videoID": "540LkURTR7g",
"question_id": "077-1",
"task_type": "Attribute Perception",
"question": "What is the color of the tall drinking glass initially held by the woman?",
"options": [
"A. Blue.",
"B. Red.",
"C. Green.",
"D. Yellow."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the tall drinking glass initially held by the woman?\nOption:\nA. Blue.\nB. Red.\nC. Green.\nD. Yellow.\nAnswer with the option's letter from the given choices directly.",
228,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "077-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 229,
"target": "D",
"doc": {
"video_id": "077",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=540LkURTR7g",
"videoID": "540LkURTR7g",
"question_id": "077-2",
"task_type": "Counting Problem",
"question": "How many cups appear in this video?",
"options": [
"A. 2.",
"B. 3.",
"C. 4.",
"D. 5."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many cups appear in this video?\nOption:\nA. 2.\nB. 3.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
229,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "077-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 230,
"target": "B",
"doc": {
"video_id": "077",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=540LkURTR7g",
"videoID": "540LkURTR7g",
"question_id": "077-3",
"task_type": "Action Reasoning",
"question": "According to this video, what is the purpose of adding fresh produce to the water?",
"options": [
"A. To enhance the water's purification and cleansing properties.",
"B. To infuse the water with a refreshing and subtle flavor.",
"C. To visually enhance the appearance of the water.",
"D. To turn the water into a thick smoothie."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to this video, what is the purpose of adding fresh produce to the water?\nOption:\nA. To enhance the water's purification and cleansing properties.\nB. To infuse the water with a refreshing and subtle flavor.\nC. To visually enhance the appearance of the water.\nD. To turn the water into a thick smoothie.\nAnswer with the option's letter from the given choices directly.",
230,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "077-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 231,
"target": "C",
"doc": {
"video_id": "078",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=SnOc5W0PgVE",
"videoID": "SnOc5W0PgVE",
"question_id": "078-1",
"task_type": "Spatial Perception",
"question": "Which side of the tie is shorter?",
"options": [
"A. The left and right side is of equal length.",
"B. Right.",
"C. Left.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which side of the tie is shorter?\nOption:\nA. The left and right side is of equal length.\nB. Right.\nC. Left.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
231,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "078-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 232,
"target": "A",
"doc": {
"video_id": "078",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=SnOc5W0PgVE",
"videoID": "SnOc5W0PgVE",
"question_id": "078-2",
"task_type": "Attribute Perception",
"question": "Which item is similar in size to the loop depicted in the video?",
"options": [
"A. His thumb.",
"B. The collar button.",
"C. A petal of the tie.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which item is similar in size to the loop depicted in the video?\nOption:\nA. His thumb.\nB. The collar button.\nC. A petal of the tie.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
232,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "078-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 233,
"target": "D",
"doc": {
"video_id": "078",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=SnOc5W0PgVE",
"videoID": "SnOc5W0PgVE",
"question_id": "078-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. It teaches how to tie a neck tie.",
"B. It advertises a bow tie.",
"C. It advertises a neck tie.",
"D. It teaches how to tie a bow tie."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. It teaches how to tie a neck tie.\nB. It advertises a bow tie.\nC. It advertises a neck tie.\nD. It teaches how to tie a bow tie.\nAnswer with the option's letter from the given choices directly.",
233,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "078-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 234,
"target": "B",
"doc": {
"video_id": "079",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=WViSvPFUVd8",
"videoID": "WViSvPFUVd8",
"question_id": "079-1",
"task_type": "Object Recognition",
"question": "Which vegetable is not visible in this video when introducing the first tip?",
"options": [
"A. Pepper.",
"B. Potato.",
"C. Carrot.",
"D. Beans."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which vegetable is not visible in this video when introducing the first tip?\nOption:\nA. Pepper.\nB. Potato.\nC. Carrot.\nD. Beans.\nAnswer with the option's letter from the given choices directly.",
234,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "079-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 235,
"target": "C",
"doc": {
"video_id": "079",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=WViSvPFUVd8",
"videoID": "WViSvPFUVd8",
"question_id": "079-2",
"task_type": "Spatial Perception",
"question": "How are the tomatoes in the video placed when introducing the fourth tip?",
"options": [
"A. They are arranged in a circular formation.",
"B. They are arranged in a straight line.",
"C. They are arranged in four distinct rows.",
"D. They are arranged randomly."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How are the tomatoes in the video placed when introducing the fourth tip?\nOption:\nA. They are arranged in a circular formation.\nB. They are arranged in a straight line.\nC. They are arranged in four distinct rows.\nD. They are arranged randomly.\nAnswer with the option's letter from the given choices directly.",
235,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "079-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 236,
"target": "A",
"doc": {
"video_id": "079",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=WViSvPFUVd8",
"videoID": "WViSvPFUVd8",
"question_id": "079-3",
"task_type": "Temporal Reasoning",
"question": "In which order do the six tips are introduced in the video?\n(a) Clip coupons.\n(b) Eat at home.\n(c) Freeze leftovers.\n(d) Cook once, eat twice.\n(e) Meal plan.\n(f) Buy in bulk.",
"options": [
"A. (b)(e)(d)(f)(a)(c).",
"B. (b)(e)(f)(c)(a)(d).",
"C. (e)(c)(d)(a)(b)(f).",
"D. (c)(e)(b)(f)(a)(d)."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which order do the six tips are introduced in the video?\n(a) Clip coupons.\n(b) Eat at home.\n(c) Freeze leftovers.\n(d) Cook once, eat twice.\n(e) Meal plan.\n(f) Buy in bulk.\nOption:\nA. (b)(e)(d)(f)(a)(c).\nB. (b)(e)(f)(c)(a)(d).\nC. (e)(c)(d)(a)(b)(f).\nD. (c)(e)(b)(f)(a)(d).\nAnswer with the option's letter from the given choices directly.",
236,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "079-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 237,
"target": "D",
"doc": {
"video_id": "080",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=21q-lDikdBg",
"videoID": "21q-lDikdBg",
"question_id": "080-1",
"task_type": "Temporal Perception",
"question": "When does the man take a long exhaled breath in this video?",
"options": [
"A. When he is running a marathon.",
"B. When he is solving a difficult math problem.",
"C. When he is releasing the tension.",
"D. When he is pulling apart."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When does the man take a long exhaled breath in this video?\nOption:\nA. When he is running a marathon.\nB. When he is solving a difficult math problem.\nC. When he is releasing the tension.\nD. When he is pulling apart.\nAnswer with the option's letter from the given choices directly.",
237,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "080-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Perception",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 238,
"target": "B",
"doc": {
"video_id": "080",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=21q-lDikdBg",
"videoID": "21q-lDikdBg",
"question_id": "080-2",
"task_type": "Spatial Perception",
"question": "As depicted in the video, what are the resistance bands parallel to when the man is pulling apart?",
"options": [
"A. Neck.",
"B. Chest.",
"C. Chin.",
"D. Nose."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, what are the resistance bands parallel to when the man is pulling apart?\nOption:\nA. Neck.\nB. Chest.\nC. Chin.\nD. Nose.\nAnswer with the option's letter from the given choices directly.",
238,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "080-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Spatial Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 239,
"target": "C",
"doc": {
"video_id": "080",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=21q-lDikdBg",
"videoID": "21q-lDikdBg",
"question_id": "080-3",
"task_type": "Information Synopsis",
"question": "Which choice best summarizes the main topic of the video?",
"options": [
"A. It advertises resistance bands for better selling.",
"B. It shows the difference between resistance bands and dumbbells.",
"C. It teaches you to improve your posture using resistance bands.",
"D. It evaluates the quality of resistance bands."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which choice best summarizes the main topic of the video?\nOption:\nA. It advertises resistance bands for better selling.\nB. It shows the difference between resistance bands and dumbbells.\nC. It teaches you to improve your posture using resistance bands.\nD. It evaluates the quality of resistance bands.\nAnswer with the option's letter from the given choices directly.",
239,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "080-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 240,
"target": "B",
"doc": {
"video_id": "081",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=FjS2LzrHEO8",
"videoID": "FjS2LzrHEO8",
"question_id": "081-1",
"task_type": "Action Recognition",
"question": "What was the purpose of using a hammer to hit the car in the video?",
"options": [
"A. To show the hammer works well.",
"B. To show the solidity of the car.",
"C. To warn people not to hit cars with hammers.",
"D. To illustrate that a hammer is harder than a bullet."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the purpose of using a hammer to hit the car in the video?\nOption:\nA. To show the hammer works well.\nB. To show the solidity of the car.\nC. To warn people not to hit cars with hammers.\nD. To illustrate that a hammer is harder than a bullet.\nAnswer with the option's letter from the given choices directly.",
240,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "081-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 241,
"target": "B",
"doc": {
"video_id": "081",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=FjS2LzrHEO8",
"videoID": "FjS2LzrHEO8",
"question_id": "081-2",
"task_type": "OCR Problems",
"question": "What is the size of the back touchscreen display in this video?",
"options": [
"A. 15 inches.",
"B. 9.4 inches.",
"C. 17.4 inches.",
"D. 18.5 inches."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the size of the back touchscreen display in this video?\nOption:\nA. 15 inches.\nB. 9.4 inches.\nC. 17.4 inches.\nD. 18.5 inches.\nAnswer with the option's letter from the given choices directly.",
241,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "081-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 242,
"target": "A",
"doc": {
"video_id": "081",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=FjS2LzrHEO8",
"videoID": "FjS2LzrHEO8",
"question_id": "081-3",
"task_type": "Information Synopsis",
"question": "Which summarizes the main content of this video?",
"options": [
"A. Car features.",
"B. Transportation.",
"C. Wild scenery.",
"D. Construction work."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which summarizes the main content of this video?\nOption:\nA. Car features.\nB. Transportation.\nC. Wild scenery.\nD. Construction work.\nAnswer with the option's letter from the given choices directly.",
242,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "081-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 243,
"target": "B",
"doc": {
"video_id": "082",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=m4qhFFdHTCc",
"videoID": "m4qhFFdHTCc",
"question_id": "082-1",
"task_type": "Attribute Perception",
"question": "What's wrong with this car?",
"options": [
"A. It doesn't have a left rear wheel.",
"B. It doesn't have a right front wheel.",
"C. Its headlamp is broken.",
"D. Its right door is broken."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's wrong with this car?\nOption:\nA. It doesn't have a left rear wheel.\nB. It doesn't have a right front wheel.\nC. Its headlamp is broken.\nD. Its right door is broken.\nAnswer with the option's letter from the given choices directly.",
243,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "082-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 244,
"target": "B",
"doc": {
"video_id": "082",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=m4qhFFdHTCc",
"videoID": "m4qhFFdHTCc",
"question_id": "082-2",
"task_type": "Attribute Perception",
"question": "Which colors are not present on the car, including the color of the lights, shown in the video?",
"options": [
"A. Black.",
"B. Blue.",
"C. Yellow.",
"D. White."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which colors are not present on the car, including the color of the lights, shown in the video?\nOption:\nA. Black.\nB. Blue.\nC. Yellow.\nD. White.\nAnswer with the option's letter from the given choices directly.",
244,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "082-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 245,
"target": "A",
"doc": {
"video_id": "082",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=m4qhFFdHTCc",
"videoID": "m4qhFFdHTCc",
"question_id": "082-3",
"task_type": "Spatial Reasoning",
"question": "In what location was the video most likely recorded?",
"options": [
"A. Indoor parking lot.",
"B. Outdoor park circuit.",
"C. Car assembly line.",
"D. Mountain road."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what location was the video most likely recorded?\nOption:\nA. Indoor parking lot.\nB. Outdoor park circuit.\nC. Car assembly line.\nD. Mountain road.\nAnswer with the option's letter from the given choices directly.",
245,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "082-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Spatial Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 246,
"target": "D",
"doc": {
"video_id": "083",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=lKoG2_zdoSA",
"videoID": "lKoG2_zdoSA",
"question_id": "083-1",
"task_type": "Attribute Perception",
"question": "What is the color of the pants worn by the young girl who converses with the elderly man on the subway?",
"options": [
"A. Light brown.",
"B. Black.",
"C. Pink.",
"D. Blue gray."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the pants worn by the young girl who converses with the elderly man on the subway?\nOption:\nA. Light brown.\nB. Black.\nC. Pink.\nD. Blue gray.\nAnswer with the option's letter from the given choices directly.",
246,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "083-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 247,
"target": "C",
"doc": {
"video_id": "083",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=lKoG2_zdoSA",
"videoID": "lKoG2_zdoSA",
"question_id": "083-2",
"task_type": "Information Synopsis",
"question": "What is the topic of this video?",
"options": [
"A. How to take a selfie.",
"B. How to make payments by mobile phone.",
"C. The development of mobile phones.",
"D. How to take the subway."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the topic of this video?\nOption:\nA. How to take a selfie.\nB. How to make payments by mobile phone.\nC. The development of mobile phones.\nD. How to take the subway.\nAnswer with the option's letter from the given choices directly.",
247,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "083-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 248,
"target": "B",
"doc": {
"video_id": "083",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=lKoG2_zdoSA",
"videoID": "lKoG2_zdoSA",
"question_id": "083-3",
"task_type": "Counting Problem",
"question": "How many changes in the development of mobile phones are introduced in the video?",
"options": [
"A. 4.",
"B. 6.",
"C. 3.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many changes in the development of mobile phones are introduced in the video?\nOption:\nA. 4.\nB. 6.\nC. 3.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
248,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "083-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 249,
"target": "C",
"doc": {
"video_id": "084",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=aqTIB_q40bo",
"videoID": "aqTIB_q40bo",
"question_id": "084-1",
"task_type": "Object Reasoning",
"question": "Which of the following options describes a common trait shared by the characters in the video?",
"options": [
"A. They all love driving.",
"B. They all like electronic devices.",
"C. They all wear masks or glasses.",
"D. They are all real people."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options describes a common trait shared by the characters in the video?\nOption:\nA. They all love driving.\nB. They all like electronic devices.\nC. They all wear masks or glasses.\nD. They are all real people.\nAnswer with the option's letter from the given choices directly.",
249,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "084-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 250,
"target": "B",
"doc": {
"video_id": "084",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=aqTIB_q40bo",
"videoID": "aqTIB_q40bo",
"question_id": "084-2",
"task_type": "Information Synopsis",
"question": "Which summarizes the main content of this video?",
"options": [
"A. Movie commentary.",
"B. Product advertisement.",
"C. Motorsport.",
"D. Fashion tutorial."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which summarizes the main content of this video?\nOption:\nA. Movie commentary.\nB. Product advertisement.\nC. Motorsport.\nD. Fashion tutorial.\nAnswer with the option's letter from the given choices directly.",
250,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "084-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 251,
"target": "B",
"doc": {
"video_id": "084",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=aqTIB_q40bo",
"videoID": "aqTIB_q40bo",
"question_id": "084-3",
"task_type": "Counting Problem",
"question": "At the end of the video, how many books can be seen resting on the living room table?",
"options": [
"A. 2.",
"B. 4.",
"C. 1.",
"D. 3."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At the end of the video, how many books can be seen resting on the living room table?\nOption:\nA. 2.\nB. 4.\nC. 1.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
251,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "084-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 252,
"target": "B",
"doc": {
"video_id": "085",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=hUrjmA0fhsc",
"videoID": "hUrjmA0fhsc",
"question_id": "085-1",
"task_type": "OCR Problems",
"question": "What speed is displayed on the car dashboard in the video?",
"options": [
"A. 66 MPH.",
"B. 55 MPH.",
"C. 32 MPH.",
"D. 22 MPH."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What speed is displayed on the car dashboard in the video?\nOption:\nA. 66 MPH.\nB. 55 MPH.\nC. 32 MPH.\nD. 22 MPH.\nAnswer with the option's letter from the given choices directly.",
252,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "085-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 253,
"target": "B",
"doc": {
"video_id": "085",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=hUrjmA0fhsc",
"videoID": "hUrjmA0fhsc",
"question_id": "085-2",
"task_type": "Counting Problem",
"question": "How many methods for disengaging Super Cruise are mentioned in the video?",
"options": [
"A. 2.",
"B. 3.",
"C. 1.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many methods for disengaging Super Cruise are mentioned in the video?\nOption:\nA. 2.\nB. 3.\nC. 1.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
253,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "085-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 254,
"target": "D",
"doc": {
"video_id": "085",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=hUrjmA0fhsc",
"videoID": "hUrjmA0fhsc",
"question_id": "085-3",
"task_type": "Information Synopsis",
"question": "According to the video, which of the following statements is accurate?",
"options": [
"A. When using Super Cruise, you need to keep your hands firmly on the wheel.",
"B. The seat vibrates to make the driver more comfortable while using Super Cruise.",
"C. The driver can take a nap while using Super Cruise.",
"D. Lane changing can be done without human control while using Super Cruise."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following statements is accurate?\nOption:\nA. When using Super Cruise, you need to keep your hands firmly on the wheel.\nB. The seat vibrates to make the driver more comfortable while using Super Cruise.\nC. The driver can take a nap while using Super Cruise.\nD. Lane changing can be done without human control while using Super Cruise.\nAnswer with the option's letter from the given choices directly.",
254,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "085-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 255,
"target": "C",
"doc": {
"video_id": "086",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=aBdQQxgxDrY",
"videoID": "aBdQQxgxDrY",
"question_id": "086-1",
"task_type": "Information Synopsis",
"question": "According to the video, which statement is correct?",
"options": [
"A. There are 15 batteries in the battery swap station.",
"B. You need to manually park the car into the battery swap station.",
"C. The car will be moved by the machine to the accurate position in the battery swap station.",
"D. You need to get off the car when battery swapping."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which statement is correct?\nOption:\nA. There are 15 batteries in the battery swap station.\nB. You need to manually park the car into the battery swap station.\nC. The car will be moved by the machine to the accurate position in the battery swap station.\nD. You need to get off the car when battery swapping.\nAnswer with the option's letter from the given choices directly.",
255,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "086-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 256,
"target": "A",
"doc": {
"video_id": "086",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=aBdQQxgxDrY",
"videoID": "aBdQQxgxDrY",
"question_id": "086-2",
"task_type": "Information Synopsis",
"question": "What is this video about?",
"options": [
"A. Automatic swap of auto parts.",
"B. Car Manufacturing Process.",
"C. Car-buying guidance.",
"D. Automatic car wash."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video about?\nOption:\nA. Automatic swap of auto parts.\nB. Car Manufacturing Process.\nC. Car-buying guidance.\nD. Automatic car wash.\nAnswer with the option's letter from the given choices directly.",
256,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "086-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 257,
"target": "A",
"doc": {
"video_id": "086",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=aBdQQxgxDrY",
"videoID": "aBdQQxgxDrY",
"question_id": "086-3",
"task_type": "OCR Problems",
"question": "What is the car model written on the registration plate that can be seen in this video?",
"options": [
"A. es8.",
"B. G58.",
"C. GS8.",
"D. e58."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the car model written on the registration plate that can be seen in this video?\nOption:\nA. es8.\nB. G58.\nC. GS8.\nD. e58.\nAnswer with the option's letter from the given choices directly.",
257,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "086-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 258,
"target": "C",
"doc": {
"video_id": "087",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=tF4DML7FIWk",
"videoID": "tF4DML7FIWk",
"question_id": "087-1",
"task_type": "Counting Problem",
"question": "How many humanoid robots can be identified in the video?",
"options": [
"A. 1.",
"B. 3.",
"C. 2.",
"D. 4."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many humanoid robots can be identified in the video?\nOption:\nA. 1.\nB. 3.\nC. 2.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
258,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "087-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 259,
"target": "B",
"doc": {
"video_id": "087",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=tF4DML7FIWk",
"videoID": "tF4DML7FIWk",
"question_id": "087-2",
"task_type": "Action Recognition",
"question": "Which task was not completed by the robots?",
"options": [
"A. Vault.",
"B. Split.",
"C. Balance beam.",
"D. Backflip."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which task was not completed by the robots?\nOption:\nA. Vault.\nB. Split.\nC. Balance beam.\nD. Backflip.\nAnswer with the option's letter from the given choices directly.",
259,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "087-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 260,
"target": "C",
"doc": {
"video_id": "087",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=tF4DML7FIWk",
"videoID": "tF4DML7FIWk",
"question_id": "087-3",
"task_type": "Attribute Perception",
"question": "What color are the fences on both sides of the room in this video?",
"options": [
"A. White.",
"B. Blue.",
"C. Orange.",
"D. Yellow."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color are the fences on both sides of the room in this video?\nOption:\nA. White.\nB. Blue.\nC. Orange.\nD. Yellow.\nAnswer with the option's letter from the given choices directly.",
260,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "087-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 261,
"target": "A",
"doc": {
"video_id": "088",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=jRS9fVh7MUw",
"videoID": "jRS9fVh7MUw",
"question_id": "088-1",
"task_type": "Information Synopsis",
"question": "What is the main content of this video?",
"options": [
"A. How camera lenses work.",
"B. How to make a movie.",
"C. How to buy camera lenses.",
"D. How to polish lenses."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main content of this video?\nOption:\nA. How camera lenses work.\nB. How to make a movie.\nC. How to buy camera lenses.\nD. How to polish lenses.\nAnswer with the option's letter from the given choices directly.",
261,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "088-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 262,
"target": "B",
"doc": {
"video_id": "088",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=jRS9fVh7MUw",
"videoID": "jRS9fVh7MUw",
"question_id": "088-2",
"task_type": "Counting Problem",
"question": "How many glass discs are there inside the disassembled lens in the video, at minimum?",
"options": [
"A. 4.",
"B. 5.",
"C. 3.",
"D. 2."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many glass discs are there inside the disassembled lens in the video, at minimum?\nOption:\nA. 4.\nB. 5.\nC. 3.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
262,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "088-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 263,
"target": "D",
"doc": {
"video_id": "088",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=jRS9fVh7MUw",
"videoID": "jRS9fVh7MUw",
"question_id": "088-3",
"task_type": "Object Reasoning",
"question": "According to the video, what is the purpose of changing the size of the hole of the center formed by stainless steel leaves?",
"options": [
"A. To match the size of the lens.",
"B. To move the glass discs inside the camera lens.",
"C. To turn the inner barrel slides forwards and backwards.",
"D. To control the amount of light entering the camera."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is the purpose of changing the size of the hole of the center formed by stainless steel leaves?\nOption:\nA. To match the size of the lens.\nB. To move the glass discs inside the camera lens.\nC. To turn the inner barrel slides forwards and backwards.\nD. To control the amount of light entering the camera.\nAnswer with the option's letter from the given choices directly.",
263,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "088-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 264,
"target": "C",
"doc": {
"video_id": "089",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=JxkJ-FwFeVI",
"videoID": "JxkJ-FwFeVI",
"question_id": "089-1",
"task_type": "Information Synopsis",
"question": "Based on the video, which of the following statements is true?",
"options": [
"A. The piston engines do not compress air.",
"B. Rolls-Royce engines are only used for applications on land and in the air.",
"C. Rolls-Royce engines include the gas turbines and the piston engines.",
"D. The gas turbines suck all the air surrounded into the engine."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, which of the following statements is true?\nOption:\nA. The piston engines do not compress air.\nB. Rolls-Royce engines are only used for applications on land and in the air.\nC. Rolls-Royce engines include the gas turbines and the piston engines.\nD. The gas turbines suck all the air surrounded into the engine.\nAnswer with the option's letter from the given choices directly.",
264,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "089-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 265,
"target": "A",
"doc": {
"video_id": "089",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=JxkJ-FwFeVI",
"videoID": "JxkJ-FwFeVI",
"question_id": "089-2",
"task_type": "OCR Problems",
"question": "What type of aircraft model is being used as an example when explaining gas turbines in the video?",
"options": [
"A. AIRBUS A350.",
"B. BOEING 747.",
"C. BOEING 777.",
"D. AIRBUS A330."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What type of aircraft model is being used as an example when explaining gas turbines in the video?\nOption:\nA. AIRBUS A350.\nB. BOEING 747.\nC. BOEING 777.\nD. AIRBUS A330.\nAnswer with the option's letter from the given choices directly.",
265,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "089-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 266,
"target": "D",
"doc": {
"video_id": "089",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=JxkJ-FwFeVI",
"videoID": "JxkJ-FwFeVI",
"question_id": "089-3",
"task_type": "Information Synopsis",
"question": "Which of the following best describes the content of the video?",
"options": [
"A. Why gas turbines are better than piston engines.",
"B. How to purchase Rolls-Royce Motor Cars.",
"C. How to choose an engine.",
"D. How Engines Work."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best describes the content of the video?\nOption:\nA. Why gas turbines are better than piston engines.\nB. How to purchase Rolls-Royce Motor Cars.\nC. How to choose an engine.\nD. How Engines Work.\nAnswer with the option's letter from the given choices directly.",
266,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "089-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 267,
"target": "B",
"doc": {
"video_id": "090",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=c9arR8T0Qts",
"videoID": "c9arR8T0Qts",
"question_id": "090-1",
"task_type": "Information Synopsis",
"question": "According to the video, which of the following statements is correct?",
"options": [
"A. The main component of sand is pure silicon.",
"B. Chips are produced in clean rooms.",
"C. There is only one layer of chip in a microchip.",
"D. The color of pure silicon is yellow."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following statements is correct?\nOption:\nA. The main component of sand is pure silicon.\nB. Chips are produced in clean rooms.\nC. There is only one layer of chip in a microchip.\nD. The color of pure silicon is yellow.\nAnswer with the option's letter from the given choices directly.",
267,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "090-1",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 268,
"target": "D",
"doc": {
"video_id": "090",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=c9arR8T0Qts",
"videoID": "c9arR8T0Qts",
"question_id": "090-2",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. How chips work.",
"B. What chips can do.",
"C. What are chips.",
"D. How chips are produced."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. How chips work.\nB. What chips can do.\nC. What are chips.\nD. How chips are produced.\nAnswer with the option's letter from the given choices directly.",
268,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "090-2",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 269,
"target": "B",
"doc": {
"video_id": "090",
"duration": "short",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=c9arR8T0Qts",
"videoID": "c9arR8T0Qts",
"question_id": "090-3",
"task_type": "OCR Problems",
"question": "Based on the video, what is the total number of measurements involved in chip manufacturing?",
"options": [
"A. 27.",
"B. 200.",
"C. 700.",
"D. 26."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, what is the total number of measurements involved in chip manufacturing?\nOption:\nA. 27.\nB. 200.\nC. 700.\nD. 26.\nAnswer with the option's letter from the given choices directly.",
269,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "090-3",
"duration": "short",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 270,
"target": "D",
"doc": {
"video_id": "091",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=_aVHf_jmWk8",
"videoID": "_aVHf_jmWk8",
"question_id": "091-1",
"task_type": "Action Recognition",
"question": "How does the girl feel in this video?",
"options": [
"A. Positive.",
"B. Happy.",
"C. Angry.",
"D. Sad."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the girl feel in this video?\nOption:\nA. Positive.\nB. Happy.\nC. Angry.\nD. Sad.\nAnswer with the option's letter from the given choices directly.",
270,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "091-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 271,
"target": "B",
"doc": {
"video_id": "091",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=_aVHf_jmWk8",
"videoID": "_aVHf_jmWk8",
"question_id": "091-2",
"task_type": "Counting Problem",
"question": "How many times does the butterfly occur in this video?",
"options": [
"A. 3.",
"B. 4.",
"C. 5.",
"D. 6."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times does the butterfly occur in this video?\nOption:\nA. 3.\nB. 4.\nC. 5.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
271,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "091-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 272,
"target": "B",
"doc": {
"video_id": "091",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=_aVHf_jmWk8",
"videoID": "_aVHf_jmWk8",
"question_id": "091-3",
"task_type": "Counting Problem",
"question": "What is the number of three-dimensional graphics that can be seen in this video?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the number of three-dimensional graphics that can be seen in this video?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
272,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "091-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 273,
"target": "C",
"doc": {
"video_id": "092",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=3rTsETO3s9U",
"videoID": "3rTsETO3s9U",
"question_id": "092-1",
"task_type": "Action Recognition",
"question": "What does the girl want to do in the video?",
"options": [
"A. Running.",
"B. Writing.",
"C. Painting.",
"D. Playing."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the girl want to do in the video?\nOption:\nA. Running.\nB. Writing.\nC. Painting.\nD. Playing.\nAnswer with the option's letter from the given choices directly.",
273,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "092-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 274,
"target": "D",
"doc": {
"video_id": "092",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=3rTsETO3s9U",
"videoID": "3rTsETO3s9U",
"question_id": "092-2",
"task_type": "Action Recognition",
"question": "What does the girl's mom want her to do in this video?",
"options": [
"A. Seeing a movie.",
"B. Exercising.",
"C. Playing with friends.",
"D. Studying math."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the girl's mom want her to do in this video?\nOption:\nA. Seeing a movie.\nB. Exercising.\nC. Playing with friends.\nD. Studying math.\nAnswer with the option's letter from the given choices directly.",
274,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "092-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 275,
"target": "C",
"doc": {
"video_id": "092",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=3rTsETO3s9U",
"videoID": "3rTsETO3s9U",
"question_id": "092-3",
"task_type": "Action Reasoning",
"question": "How does the girl perceive her future with regard to decision-making and control?",
"options": [
"A. She doesn't care about her future.",
"B. She wishes to make decision by herself.",
"C. She feels she will be controlled by her mom.",
"D. She has no hope for the future."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the girl perceive her future with regard to decision-making and control?\nOption:\nA. She doesn't care about her future.\nB. She wishes to make decision by herself.\nC. She feels she will be controlled by her mom.\nD. She has no hope for the future.\nAnswer with the option's letter from the given choices directly.",
275,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "092-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 276,
"target": "D",
"doc": {
"video_id": "093",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=drbi6HK1gSc",
"videoID": "drbi6HK1gSc",
"question_id": "093-1",
"task_type": "Action Reasoning",
"question": "How does the girl feel in this video?",
"options": [
"A. Scared.",
"B. Peaceful.",
"C. Relaxed.",
"D. Anxious."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the girl feel in this video?\nOption:\nA. Scared.\nB. Peaceful.\nC. Relaxed.\nD. Anxious.\nAnswer with the option's letter from the given choices directly.",
276,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "093-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 277,
"target": "B",
"doc": {
"video_id": "093",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=drbi6HK1gSc",
"videoID": "drbi6HK1gSc",
"question_id": "093-2",
"task_type": "Object Recognition",
"question": "What might be the age of the girl in the video?",
"options": [
"A. 14.",
"B. 16.",
"C. 18.",
"D. 12."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What might be the age of the girl in the video?\nOption:\nA. 14.\nB. 16.\nC. 18.\nD. 12.\nAnswer with the option's letter from the given choices directly.",
277,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "093-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 278,
"target": "C",
"doc": {
"video_id": "093",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=drbi6HK1gSc",
"videoID": "drbi6HK1gSc",
"question_id": "093-3",
"task_type": "Action Reasoning",
"question": "Which of the following does the girl desire?",
"options": [
"A. More people celebrate for her birthday.",
"B. Slower passage of time.",
"C. Faster passage of time.",
"D. A close friend playing with her."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following does the girl desire?\nOption:\nA. More people celebrate for her birthday.\nB. Slower passage of time.\nC. Faster passage of time.\nD. A close friend playing with her.\nAnswer with the option's letter from the given choices directly.",
278,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "093-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 279,
"target": "A",
"doc": {
"video_id": "094",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=M69Sn3OERZo",
"videoID": "M69Sn3OERZo",
"question_id": "094-1",
"task_type": "Attribute Perception",
"question": "What color is the monkey depicted in the video?",
"options": [
"A. Yellow.",
"B. Green.",
"C. Red.",
"D. Blue."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the monkey depicted in the video?\nOption:\nA. Yellow.\nB. Green.\nC. Red.\nD. Blue.\nAnswer with the option's letter from the given choices directly.",
279,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "094-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 280,
"target": "C",
"doc": {
"video_id": "094",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=M69Sn3OERZo",
"videoID": "M69Sn3OERZo",
"question_id": "094-2",
"task_type": "Object Recognition",
"question": "What animal saves the monkey in the video?",
"options": [
"A. Dog.",
"B. Cat.",
"C. Turtle.",
"D. Bird."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What animal saves the monkey in the video?\nOption:\nA. Dog.\nB. Cat.\nC. Turtle.\nD. Bird.\nAnswer with the option's letter from the given choices directly.",
280,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "094-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 281,
"target": "D",
"doc": {
"video_id": "094",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=M69Sn3OERZo",
"videoID": "M69Sn3OERZo",
"question_id": "094-3",
"task_type": "Action Recognition",
"question": "What action does the monkey do after being rescued?",
"options": [
"A. Handstanding.",
"B. Jumping.",
"C. Running.",
"D. Crying."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What action does the monkey do after being rescued?\nOption:\nA. Handstanding.\nB. Jumping.\nC. Running.\nD. Crying.\nAnswer with the option's letter from the given choices directly.",
281,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "094-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 282,
"target": "C",
"doc": {
"video_id": "095",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=djr6T2C8Gfs",
"videoID": "djr6T2C8Gfs",
"question_id": "095-1",
"task_type": "Action Recognition",
"question": "Which activity do the two ghost cowboys enjoy while horse riding as depicted in the video?",
"options": [
"A. Fighting.",
"B. Dancing.",
"C. Singing.",
"D. Quarrelling."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which activity do the two ghost cowboys enjoy while horse riding as depicted in the video?\nOption:\nA. Fighting.\nB. Dancing.\nC. Singing.\nD. Quarrelling.\nAnswer with the option's letter from the given choices directly.",
282,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "095-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 283,
"target": "D",
"doc": {
"video_id": "095",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=djr6T2C8Gfs",
"videoID": "djr6T2C8Gfs",
"question_id": "095-2",
"task_type": "Object Recognition",
"question": "As can be seen in the video, which animal is seeking to capture the two ghost cowboys?",
"options": [
"A. A horse.",
"B. A dog.",
"C. A panda.",
"D. A wolf."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As can be seen in the video, which animal is seeking to capture the two ghost cowboys?\nOption:\nA. A horse.\nB. A dog.\nC. A panda.\nD. A wolf.\nAnswer with the option's letter from the given choices directly.",
283,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "095-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 284,
"target": "B",
"doc": {
"video_id": "095",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=djr6T2C8Gfs",
"videoID": "djr6T2C8Gfs",
"question_id": "095-3",
"task_type": "Spatial Perception",
"question": "What do the two ghost cowboys like to do when they ride horses according to the video?",
"options": [
"A. Fighting.",
"B. Dancing.",
"C. Singing.",
"D. Quarrelling."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the two ghost cowboys like to do when they ride horses according to the video?\nOption:\nA. Fighting.\nB. Dancing.\nC. Singing.\nD. Quarrelling.\nAnswer with the option's letter from the given choices directly.",
284,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "095-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 285,
"target": "B",
"doc": {
"video_id": "096",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=Q0B5dLHDQ2w",
"videoID": "Q0B5dLHDQ2w",
"question_id": "096-1",
"task_type": "Object Recognition",
"question": "Which animal is the character depicted as in the video?",
"options": [
"A. A bat.",
"B. A dragon.",
"C. A bird.",
"D. A crocodile."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which animal is the character depicted as in the video?\nOption:\nA. A bat.\nB. A dragon.\nC. A bird.\nD. A crocodile.\nAnswer with the option's letter from the given choices directly.",
285,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "096-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 286,
"target": "A",
"doc": {
"video_id": "096",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=Q0B5dLHDQ2w",
"videoID": "Q0B5dLHDQ2w",
"question_id": "096-2",
"task_type": "Spatial Perception",
"question": "Which location is the current whereabouts of the creature?",
"options": [
"A. Mountain.",
"B. Forest.",
"C. Glacier.",
"D. House."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which location is the current whereabouts of the creature?\nOption:\nA. Mountain.\nB. Forest.\nC. Glacier.\nD. House.\nAnswer with the option's letter from the given choices directly.",
286,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "096-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Spatial Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 287,
"target": "D",
"doc": {
"video_id": "096",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=Q0B5dLHDQ2w",
"videoID": "Q0B5dLHDQ2w",
"question_id": "096-3",
"task_type": "Object Recognition",
"question": "Which object does the creature hold in its hand?",
"options": [
"A. A rifle.",
"B. A carrot.",
"C. Sunglasses.",
"D. A pistol."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which object does the creature hold in its hand?\nOption:\nA. A rifle.\nB. A carrot.\nC. Sunglasses.\nD. A pistol.\nAnswer with the option's letter from the given choices directly.",
287,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "096-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 288,
"target": "C",
"doc": {
"video_id": "097",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=uoJDGnaVuTg",
"videoID": "uoJDGnaVuTg",
"question_id": "097-1",
"task_type": "Action Recognition",
"question": "What do the two spider-men do in this video?",
"options": [
"A. Eating cakes.",
"B. Fighting with monster.",
"C. Drinking tea.",
"D. Sleeping."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the two spider-men do in this video?\nOption:\nA. Eating cakes.\nB. Fighting with monster.\nC. Drinking tea.\nD. Sleeping.\nAnswer with the option's letter from the given choices directly.",
288,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "097-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 289,
"target": "C",
"doc": {
"video_id": "097",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=uoJDGnaVuTg",
"videoID": "uoJDGnaVuTg",
"question_id": "097-2",
"task_type": "Object Recognition",
"question": "Which person placed their leg on the table?",
"options": [
"A. The one on top.",
"B. The one on the right.",
"C. The one on the left.",
"D. The one below."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which person placed their leg on the table?\nOption:\nA. The one on top.\nB. The one on the right.\nC. The one on the left.\nD. The one below.\nAnswer with the option's letter from the given choices directly.",
289,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "097-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 290,
"target": "A",
"doc": {
"video_id": "097",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=uoJDGnaVuTg",
"videoID": "uoJDGnaVuTg",
"question_id": "097-3",
"task_type": "Attribute Perception",
"question": "How does the right one become at the end?",
"options": [
"A. Small.",
"B. Big.",
"C. Slim.",
"D. Fat."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the right one become at the end?\nOption:\nA. Small.\nB. Big.\nC. Slim.\nD. Fat.\nAnswer with the option's letter from the given choices directly.",
290,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "097-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 291,
"target": "A",
"doc": {
"video_id": "098",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=PU-XOFIJMlg",
"videoID": "PU-XOFIJMlg",
"question_id": "098-1",
"task_type": "Object Recognition",
"question": "What element is unavailable for the protagonist to use?",
"options": [
"A. Metal.",
"B. Water.",
"C. Fire.",
"D. Air."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What element is unavailable for the protagonist to use?\nOption:\nA. Metal.\nB. Water.\nC. Fire.\nD. Air.\nAnswer with the option's letter from the given choices directly.",
291,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "098-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 292,
"target": "D",
"doc": {
"video_id": "098",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=PU-XOFIJMlg",
"videoID": "PU-XOFIJMlg",
"question_id": "098-2",
"task_type": "Object Recognition",
"question": "Which character also appears in this video?",
"options": [
"A. Ariel.",
"B. Auraro.",
"C. Anna.",
"D. Elsa."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which character also appears in this video?\nOption:\nA. Ariel.\nB. Auraro.\nC. Anna.\nD. Elsa.\nAnswer with the option's letter from the given choices directly.",
292,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "098-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 293,
"target": "B",
"doc": {
"video_id": "098",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=PU-XOFIJMlg",
"videoID": "PU-XOFIJMlg",
"question_id": "098-3",
"task_type": "Object Recognition",
"question": "What weapon does the protagonist use in this video?",
"options": [
"A. A knife.",
"B. A stick.",
"C. A sword.",
"D. A shield."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What weapon does the protagonist use in this video?\nOption:\nA. A knife.\nB. A stick.\nC. A sword.\nD. A shield.\nAnswer with the option's letter from the given choices directly.",
293,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "098-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 294,
"target": "B",
"doc": {
"video_id": "099",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=Ij-FYOrklFE",
"videoID": "Ij-FYOrklFE",
"question_id": "099-1",
"task_type": "Counting Problem",
"question": "How many Spider-Men are visible in the video?",
"options": [
"A. 2.",
"B. 3.",
"C. 4.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many Spider-Men are visible in the video?\nOption:\nA. 2.\nB. 3.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
294,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "099-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 295,
"target": "A",
"doc": {
"video_id": "099",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=Ij-FYOrklFE",
"videoID": "Ij-FYOrklFE",
"question_id": "099-2",
"task_type": "Spatial Perception",
"question": "What is the traffic situation in the city?",
"options": [
"A. Congestion.",
"B. Empty.",
"C. Medium.",
"D. Only pedestrian."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the traffic situation in the city?\nOption:\nA. Congestion.\nB. Empty.\nC. Medium.\nD. Only pedestrian.\nAnswer with the option's letter from the given choices directly.",
295,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "099-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Spatial Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 296,
"target": "C",
"doc": {
"video_id": "099",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=Ij-FYOrklFE",
"videoID": "Ij-FYOrklFE",
"question_id": "099-3",
"task_type": "Action Recognition",
"question": "As can be seen in the video, what happens when the black spider-man is blamed by the other spider-man?",
"options": [
"A. Someone comes into the room.",
"B. He eats some cookies.",
"C. The tea overflows.",
"D. He tips the waiter."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As can be seen in the video, what happens when the black spider-man is blamed by the other spider-man?\nOption:\nA. Someone comes into the room.\nB. He eats some cookies.\nC. The tea overflows.\nD. He tips the waiter.\nAnswer with the option's letter from the given choices directly.",
296,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "099-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 297,
"target": "D",
"doc": {
"video_id": "100",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=56yT3H_DjVE",
"videoID": "56yT3H_DjVE",
"question_id": "100-1",
"task_type": "Object Recognition",
"question": "Which animal appears in this video?",
"options": [
"A. Kangaroo.",
"B. Bird.",
"C. Deer.",
"D. Panda."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which animal appears in this video?\nOption:\nA. Kangaroo.\nB. Bird.\nC. Deer.\nD. Panda.\nAnswer with the option's letter from the given choices directly.",
297,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "100-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 298,
"target": "B",
"doc": {
"video_id": "100",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=56yT3H_DjVE",
"videoID": "56yT3H_DjVE",
"question_id": "100-2",
"task_type": "Object Recognition",
"question": "What does the panda use to fight with enemies in the video?",
"options": [
"A. A bamboo.",
"B. A stick.",
"C. An axe.",
"D. A spear."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the panda use to fight with enemies in the video?\nOption:\nA. A bamboo.\nB. A stick.\nC. An axe.\nD. A spear.\nAnswer with the option's letter from the given choices directly.",
298,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "100-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 299,
"target": "A",
"doc": {
"video_id": "100",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=56yT3H_DjVE",
"videoID": "56yT3H_DjVE",
"question_id": "100-3",
"task_type": "Object Recognition",
"question": "Which type of tree is depicted behind the meditating panda?",
"options": [
"A. A peach tree.",
"B. A pear tree.",
"C. An apple tree.",
"D. A pomegranate tree."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which type of tree is depicted behind the meditating panda?\nOption:\nA. A peach tree.\nB. A pear tree.\nC. An apple tree.\nD. A pomegranate tree.\nAnswer with the option's letter from the given choices directly.",
299,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "100-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 300,
"target": "A",
"doc": {
"video_id": "101",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=tiMaUSvlzIU",
"videoID": "tiMaUSvlzIU",
"question_id": "101-1",
"task_type": "Spatial Perception",
"question": "Where is the man answering the phone in the video?",
"options": [
"A. On a train.",
"B. In a room.",
"C. In a movie theater.",
"D. Outdoors."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where is the man answering the phone in the video?\nOption:\nA. On a train.\nB. In a room.\nC. In a movie theater.\nD. Outdoors.\nAnswer with the option's letter from the given choices directly.",
300,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "101-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Spatial Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 301,
"target": "D",
"doc": {
"video_id": "101",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=tiMaUSvlzIU",
"videoID": "tiMaUSvlzIU",
"question_id": "101-2",
"task_type": "Action Reasoning",
"question": "Why is the man in the video angry?",
"options": [
"A. Because the woman he's talking to on the phone called him \"moon\".",
"B. Because he doesn't like his traveling companions.",
"C. Because his grandmother called him \"moon\".",
"D. Because the woman he's talking to on the phone called him \"moonpie\"."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why is the man in the video angry?\nOption:\nA. Because the woman he's talking to on the phone called him \"moon\".\nB. Because he doesn't like his traveling companions.\nC. Because his grandmother called him \"moon\".\nD. Because the woman he's talking to on the phone called him \"moonpie\".\nAnswer with the option's letter from the given choices directly.",
301,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "101-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 302,
"target": "B",
"doc": {
"video_id": "101",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=tiMaUSvlzIU",
"videoID": "tiMaUSvlzIU",
"question_id": "101-3",
"task_type": "Action Recognition",
"question": "How many times does the person in the video transfer the phone to another person?",
"options": [
"A. 1.",
"B. 3.",
"C. 5.",
"D. 7."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times does the person in the video transfer the phone to another person?\nOption:\nA. 1.\nB. 3.\nC. 5.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
302,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "101-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 303,
"target": "B",
"doc": {
"video_id": "102",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=0IdYJGBmguM",
"videoID": "0IdYJGBmguM",
"question_id": "102-1",
"task_type": "Object Reasoning",
"question": "According to the video, who is about to get married?",
"options": [
"A. The man in short sleeves.",
"B. The man in the dark suit.",
"C. The man in the blue shirt.",
"D. It is unclear."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, who is about to get married?\nOption:\nA. The man in short sleeves.\nB. The man in the dark suit.\nC. The man in the blue shirt.\nD. It is unclear.\nAnswer with the option's letter from the given choices directly.",
303,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "102-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 304,
"target": "B",
"doc": {
"video_id": "102",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=0IdYJGBmguM",
"videoID": "0IdYJGBmguM",
"question_id": "102-2",
"task_type": "Action Recognition",
"question": "What are the people in the video arguing about?",
"options": [
"A. Whether the person in the blue shirt will ever get married.",
"B. Which person the man in the suit should choose as his best man.",
"C. Whether the man in the suit should get married.",
"D. How many times the man in the suit is getting married."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the people in the video arguing about?\nOption:\nA. Whether the person in the blue shirt will ever get married.\nB. Which person the man in the suit should choose as his best man.\nC. Whether the man in the suit should get married.\nD. How many times the man in the suit is getting married.\nAnswer with the option's letter from the given choices directly.",
304,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "102-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 305,
"target": "A",
"doc": {
"video_id": "102",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=0IdYJGBmguM",
"videoID": "0IdYJGBmguM",
"question_id": "102-3",
"task_type": "Counting Problem",
"question": "How many people are wearing ties in the video?",
"options": [
"A. 2.",
"B. 3.",
"C. 0.",
"D. 1."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people are wearing ties in the video?\nOption:\nA. 2.\nB. 3.\nC. 0.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
305,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "102-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 306,
"target": "A",
"doc": {
"video_id": "103",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=G267g0DpCVg",
"videoID": "G267g0DpCVg",
"question_id": "103-1",
"task_type": "Object Recognition",
"question": "Which item does the man throw into the trash at the beginning of the video?",
"options": [
"A. A fork.",
"B. A pair of chopsticks.",
"C. A box of noodles.",
"D. A spoon."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which item does the man throw into the trash at the beginning of the video?\nOption:\nA. A fork.\nB. A pair of chopsticks.\nC. A box of noodles.\nD. A spoon.\nAnswer with the option's letter from the given choices directly.",
306,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "103-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 307,
"target": "C",
"doc": {
"video_id": "103",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=G267g0DpCVg",
"videoID": "G267g0DpCVg",
"question_id": "103-2",
"task_type": "Object Recognition",
"question": "What is the man in the video eating?",
"options": [
"A. Fried chicken.",
"B. Hamburg.",
"C. Noodles.",
"D. Rice."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the man in the video eating?\nOption:\nA. Fried chicken.\nB. Hamburg.\nC. Noodles.\nD. Rice.\nAnswer with the option's letter from the given choices directly.",
307,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "103-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 308,
"target": "A",
"doc": {
"video_id": "103",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=G267g0DpCVg",
"videoID": "G267g0DpCVg",
"question_id": "103-3",
"task_type": "Action Reasoning",
"question": "Why did the man in the video laugh at the end?",
"options": [
"A. Because he learned an alternative use for chopsticks.",
"B. Because the food was delicious.",
"C. Because he really likes using chopsticks.",
"D. It is impossible to determine why the man laughed based on the information provided."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did the man in the video laugh at the end?\nOption:\nA. Because he learned an alternative use for chopsticks.\nB. Because the food was delicious.\nC. Because he really likes using chopsticks.\nD. It is impossible to determine why the man laughed based on the information provided.\nAnswer with the option's letter from the given choices directly.",
308,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "103-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 309,
"target": "A",
"doc": {
"video_id": "104",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=_cZXyj6rYVg",
"videoID": "_cZXyj6rYVg",
"question_id": "104-1",
"task_type": "Action Recognition",
"question": "What are the two people in the video doing?",
"options": [
"A. The man is trying to teach the woman knowledge.",
"B. The man is singing with the woman.",
"C. They are drawing pictures.",
"D. They are cooking."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the two people in the video doing?\nOption:\nA. The man is trying to teach the woman knowledge.\nB. The man is singing with the woman.\nC. They are drawing pictures.\nD. They are cooking.\nAnswer with the option's letter from the given choices directly.",
309,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "104-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 310,
"target": "C",
"doc": {
"video_id": "104",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=_cZXyj6rYVg",
"videoID": "_cZXyj6rYVg",
"question_id": "104-2",
"task_type": "Action Reasoning",
"question": "What was the reason for the woman in the video crying?",
"options": [
"A. She was unhappy with the man.",
"B. She found the man in the video to be foolish.",
"C. She felt foolish.",
"D. Insufficient information to determine."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the reason for the woman in the video crying?\nOption:\nA. She was unhappy with the man.\nB. She found the man in the video to be foolish.\nC. She felt foolish.\nD. Insufficient information to determine.\nAnswer with the option's letter from the given choices directly.",
310,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "104-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 311,
"target": "C",
"doc": {
"video_id": "104",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=_cZXyj6rYVg",
"videoID": "_cZXyj6rYVg",
"question_id": "104-3",
"task_type": "Attribute Perception",
"question": "What color is the lamp next to the sofa in the video?",
"options": [
"A. Blue.",
"B. Green.",
"C. Orange.",
"D. White."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the lamp next to the sofa in the video?\nOption:\nA. Blue.\nB. Green.\nC. Orange.\nD. White.\nAnswer with the option's letter from the given choices directly.",
311,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "104-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 312,
"target": "C",
"doc": {
"video_id": "105",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=rQhLWHtHyiM",
"videoID": "rQhLWHtHyiM",
"question_id": "105-1",
"task_type": "OCR Problems",
"question": "What number does the person type in at the beginning of the video?",
"options": [
"A. 123456.",
"B. 342432.",
"C. 322434.",
"D. 223344."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What number does the person type in at the beginning of the video?\nOption:\nA. 123456.\nB. 342432.\nC. 322434.\nD. 223344.\nAnswer with the option's letter from the given choices directly.",
312,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "105-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 313,
"target": "D",
"doc": {
"video_id": "105",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=rQhLWHtHyiM",
"videoID": "rQhLWHtHyiM",
"question_id": "105-2",
"task_type": "Object Recognition",
"question": "According to the video, which of the following items is in the safe?",
"options": [
"A. Gold.",
"B. Money.",
"C. A document.",
"D. A handgun."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following items is in the safe?\nOption:\nA. Gold.\nB. Money.\nC. A document.\nD. A handgun.\nAnswer with the option's letter from the given choices directly.",
313,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "105-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 314,
"target": "D",
"doc": {
"video_id": "105",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=rQhLWHtHyiM",
"videoID": "rQhLWHtHyiM",
"question_id": "105-3",
"task_type": "Counting Problem",
"question": "How many individuals are present in the room?",
"options": [
"A. 8.",
"B. 9.",
"C. 7.",
"D. 6."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many individuals are present in the room?\nOption:\nA. 8.\nB. 9.\nC. 7.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
314,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "105-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 315,
"target": "A",
"doc": {
"video_id": "106",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=bYXhA8VG8Lw",
"videoID": "bYXhA8VG8Lw",
"question_id": "106-1",
"task_type": "Attribute Perception",
"question": "What is the main color of the mug in the room?",
"options": [
"A. White.",
"B. Blue.",
"C. Orange.",
"D. Black."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main color of the mug in the room?\nOption:\nA. White.\nB. Blue.\nC. Orange.\nD. Black.\nAnswer with the option's letter from the given choices directly.",
315,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "106-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 316,
"target": "D",
"doc": {
"video_id": "106",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=bYXhA8VG8Lw",
"videoID": "bYXhA8VG8Lw",
"question_id": "106-2",
"task_type": "Action Reasoning",
"question": "As can be seen in the video, what does the woman do to the man next to her?",
"options": [
"A. She touches his lip.",
"B. She hits his head.",
"C. She holds him in the face.",
"D. She kisses his face."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As can be seen in the video, what does the woman do to the man next to her?\nOption:\nA. She touches his lip.\nB. She hits his head.\nC. She holds him in the face.\nD. She kisses his face.\nAnswer with the option's letter from the given choices directly.",
316,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "106-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 317,
"target": "D",
"doc": {
"video_id": "106",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=bYXhA8VG8Lw",
"videoID": "bYXhA8VG8Lw",
"question_id": "106-3",
"task_type": "Object Reasoning",
"question": "Who speaks the most in the room?",
"options": [
"A. Nobody speaks.",
"B. The woman with curly hair.",
"C. The man with white hair.",
"D. The man with curly hair."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who speaks the most in the room?\nOption:\nA. Nobody speaks.\nB. The woman with curly hair.\nC. The man with white hair.\nD. The man with curly hair.\nAnswer with the option's letter from the given choices directly.",
317,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "106-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 318,
"target": "D",
"doc": {
"video_id": "107",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=2Qno7H4BwAU",
"videoID": "2Qno7H4BwAU",
"question_id": "107-1",
"task_type": "Object Reasoning",
"question": "What object is the man in the gray T-shirt pulling?",
"options": [
"A. A kite.",
"B. A balloon.",
"C. A ladder.",
"D. A helicopter."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What object is the man in the gray T-shirt pulling?\nOption:\nA. A kite.\nB. A balloon.\nC. A ladder.\nD. A helicopter.\nAnswer with the option's letter from the given choices directly.",
318,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "107-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 319,
"target": "C",
"doc": {
"video_id": "107",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=2Qno7H4BwAU",
"videoID": "2Qno7H4BwAU",
"question_id": "107-2",
"task_type": "Attribute Perception",
"question": "Whar color is the helicopter in the video?",
"options": [
"A. Orange.",
"B. Pink.",
"C. Blue.",
"D. Black."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Whar color is the helicopter in the video?\nOption:\nA. Orange.\nB. Pink.\nC. Blue.\nD. Black.\nAnswer with the option's letter from the given choices directly.",
319,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "107-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 320,
"target": "A",
"doc": {
"video_id": "107",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=2Qno7H4BwAU",
"videoID": "2Qno7H4BwAU",
"question_id": "107-3",
"task_type": "Action Reasoning",
"question": "What happens after the helicopter takes off in this video?",
"options": [
"A. It is pulled to ground and crash.",
"B. It flies to sky.",
"C. It falls into water.",
"D. It is hit by a missile."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens after the helicopter takes off in this video?\nOption:\nA. It is pulled to ground and crash.\nB. It flies to sky.\nC. It falls into water.\nD. It is hit by a missile.\nAnswer with the option's letter from the given choices directly.",
320,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "107-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 321,
"target": "C",
"doc": {
"video_id": "108",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=a0AGwUACt7E",
"videoID": "a0AGwUACt7E",
"question_id": "108-1",
"task_type": "Spatial Perception",
"question": "What is the weather like when the helicopter takes off?",
"options": [
"A. Sown.",
"B. Sunny.",
"C. Rainy.",
"D. Cloudless."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the weather like when the helicopter takes off?\nOption:\nA. Sown.\nB. Sunny.\nC. Rainy.\nD. Cloudless.\nAnswer with the option's letter from the given choices directly.",
321,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "108-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 322,
"target": "B",
"doc": {
"video_id": "108",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=a0AGwUACt7E",
"videoID": "a0AGwUACt7E",
"question_id": "108-2",
"task_type": "Action Recognition",
"question": "What does the man do in the helicopter?",
"options": [
"A. Eating breakfast.",
"B. Putting on a red mech.",
"C. Having a nap.",
"D. Reading a book."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the man do in the helicopter?\nOption:\nA. Eating breakfast.\nB. Putting on a red mech.\nC. Having a nap.\nD. Reading a book.\nAnswer with the option's letter from the given choices directly.",
322,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "108-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 323,
"target": "B",
"doc": {
"video_id": "108",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=a0AGwUACt7E",
"videoID": "a0AGwUACt7E",
"question_id": "108-3",
"task_type": "Object Recognition",
"question": "Which of the following shows up as Iron-Man flies away?",
"options": [
"A. A bird.",
"B. A flight.",
"C. An island.",
"D. Sun."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following shows up as Iron-Man flies away?\nOption:\nA. A bird.\nB. A flight.\nC. An island.\nD. Sun.\nAnswer with the option's letter from the given choices directly.",
323,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "108-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 324,
"target": "A",
"doc": {
"video_id": "109",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=qu9ImFMLYxw",
"videoID": "qu9ImFMLYxw",
"question_id": "109-1",
"task_type": "Action Recognition",
"question": "What is the man at the table doing while others speak to him?",
"options": [
"A. Writing.",
"B. Eating.",
"C. Drinking.",
"D. Sleeping."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the man at the table doing while others speak to him?\nOption:\nA. Writing.\nB. Eating.\nC. Drinking.\nD. Sleeping.\nAnswer with the option's letter from the given choices directly.",
324,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "109-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 325,
"target": "D",
"doc": {
"video_id": "109",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=qu9ImFMLYxw",
"videoID": "qu9ImFMLYxw",
"question_id": "109-2",
"task_type": "Attribute Perception",
"question": "What is the hair color of the smoking man?",
"options": [
"A. Red.",
"B. Golden.",
"C. White.",
"D. Black."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the hair color of the smoking man?\nOption:\nA. Red.\nB. Golden.\nC. White.\nD. Black.\nAnswer with the option's letter from the given choices directly.",
325,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "109-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 326,
"target": "C",
"doc": {
"video_id": "109",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=qu9ImFMLYxw",
"videoID": "qu9ImFMLYxw",
"question_id": "109-3",
"task_type": "Action Recognition",
"question": "What is the attitude of the man who sits at the table when others speak to him?",
"options": [
"A. Interested.",
"B. Excited.",
"C. Careless.",
"D. Angry."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the attitude of the man who sits at the table when others speak to him?\nOption:\nA. Interested.\nB. Excited.\nC. Careless.\nD. Angry.\nAnswer with the option's letter from the given choices directly.",
326,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "109-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 327,
"target": "B",
"doc": {
"video_id": "110",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=Qgr4dcsY-60",
"videoID": "Qgr4dcsY-60",
"question_id": "110-1",
"task_type": "Action Recognition",
"question": "What activities do students engage in within the room?",
"options": [
"A. Reading books.",
"B. Practicing spell.",
"C. Fighting with rods.",
"D. Making explosion."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What activities do students engage in within the room?\nOption:\nA. Reading books.\nB. Practicing spell.\nC. Fighting with rods.\nD. Making explosion.\nAnswer with the option's letter from the given choices directly.",
327,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "110-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 328,
"target": "C",
"doc": {
"video_id": "110",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=Qgr4dcsY-60",
"videoID": "Qgr4dcsY-60",
"question_id": "110-2",
"task_type": "Action Recognition",
"question": "Which of the following outcomes occurs when the boy with brown hair and no glasses indicates the feather?",
"options": [
"A. The boy floats up.",
"B. The boy explodes.",
"C. The feather explodes.",
"D. The feather floats up."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following outcomes occurs when the boy with brown hair and no glasses indicates the feather?\nOption:\nA. The boy floats up.\nB. The boy explodes.\nC. The feather explodes.\nD. The feather floats up.\nAnswer with the option's letter from the given choices directly.",
328,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "110-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 329,
"target": "B",
"doc": {
"video_id": "110",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=Qgr4dcsY-60",
"videoID": "Qgr4dcsY-60",
"question_id": "110-3",
"task_type": "Action Reasoning",
"question": "What is the attitude of the teacher towards the girl's appearance?",
"options": [
"A. Angry.",
"B. Positive.",
"C. Disappointed.",
"D. Peaceful."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the attitude of the teacher towards the girl's appearance?\nOption:\nA. Angry.\nB. Positive.\nC. Disappointed.\nD. Peaceful.\nAnswer with the option's letter from the given choices directly.",
329,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "110-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 330,
"target": "B",
"doc": {
"video_id": "111",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=HmnKbBmYii8",
"videoID": "HmnKbBmYii8",
"question_id": "111-1",
"task_type": "Counting Problem",
"question": "How many polar bears are visible/seen in the video?",
"options": [
"A. 4.",
"B. 3.",
"C. 1.",
"D. 2."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many polar bears are visible/seen in the video?\nOption:\nA. 4.\nB. 3.\nC. 1.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
330,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "111-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 331,
"target": "D",
"doc": {
"video_id": "111",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=HmnKbBmYii8",
"videoID": "HmnKbBmYii8",
"question_id": "111-2",
"task_type": "Action Reasoning",
"question": "In the video, what is the polar bear mother waiting for?",
"options": [
"A. A blizzard.",
"B. Little polar bears.",
"C. The polar bear father.",
"D. A seal to come to the breathing hole and hunt it."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, what is the polar bear mother waiting for?\nOption:\nA. A blizzard.\nB. Little polar bears.\nC. The polar bear father.\nD. A seal to come to the breathing hole and hunt it.\nAnswer with the option's letter from the given choices directly.",
331,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "111-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 332,
"target": "C",
"doc": {
"video_id": "111",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=HmnKbBmYii8",
"videoID": "HmnKbBmYii8",
"question_id": "111-3",
"task_type": "OCR Problems",
"question": "According to the video, what is the frequency at which a seal needs to surface for air?",
"options": [
"A. Every 3 minutes.",
"B. Every 40 minutes.",
"C. Every 30 minutes.",
"D. Every 3 hours."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is the frequency at which a seal needs to surface for air?\nOption:\nA. Every 3 minutes.\nB. Every 40 minutes.\nC. Every 30 minutes.\nD. Every 3 hours.\nAnswer with the option's letter from the given choices directly.",
332,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "111-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 333,
"target": "B",
"doc": {
"video_id": "112",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=MPfuSfiZ96A",
"videoID": "MPfuSfiZ96A",
"question_id": "112-1",
"task_type": "Counting Problem",
"question": "How many birds can be observed perched on the street lamp in the first half of the video?",
"options": [
"A. 6.",
"B. 5.",
"C. 7.",
"D. 8."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many birds can be observed perched on the street lamp in the first half of the video?\nOption:\nA. 6.\nB. 5.\nC. 7.\nD. 8.\nAnswer with the option's letter from the given choices directly.",
333,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "112-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 334,
"target": "C",
"doc": {
"video_id": "112",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=MPfuSfiZ96A",
"videoID": "MPfuSfiZ96A",
"question_id": "112-2",
"task_type": "Information Synopsis",
"question": "What do the people in the village show in the video used to illuminate the night?",
"options": [
"A. Flashlights.",
"B. Glow sticks.",
"C. Homemade lamps.",
"D. Electric lights."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the people in the village show in the video used to illuminate the night?\nOption:\nA. Flashlights.\nB. Glow sticks.\nC. Homemade lamps.\nD. Electric lights.\nAnswer with the option's letter from the given choices directly.",
334,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "112-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 335,
"target": "D",
"doc": {
"video_id": "112",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=MPfuSfiZ96A",
"videoID": "MPfuSfiZ96A",
"question_id": "112-3",
"task_type": "Action Reasoning",
"question": "What is the purpose of collecting sap from a rubber tree in the video?",
"options": [
"A. To use as cooking ingredients.",
"B. For medicinal purposes.",
"C. As decoration.",
"D. To use as fuel for self-made lamps."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the purpose of collecting sap from a rubber tree in the video?\nOption:\nA. To use as cooking ingredients.\nB. For medicinal purposes.\nC. As decoration.\nD. To use as fuel for self-made lamps.\nAnswer with the option's letter from the given choices directly.",
335,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "112-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 336,
"target": "A",
"doc": {
"video_id": "113",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=HqAfzcJurMM",
"videoID": "HqAfzcJurMM",
"question_id": "113-1",
"task_type": "Attribute Perception",
"question": "Which material was used to create the perfect sphere in the video?",
"options": [
"A. Dirt.",
"B. Wood.",
"C. Metal.",
"D. Glass."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which material was used to create the perfect sphere in the video?\nOption:\nA. Dirt.\nB. Wood.\nC. Metal.\nD. Glass.\nAnswer with the option's letter from the given choices directly.",
336,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "113-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 337,
"target": "B",
"doc": {
"video_id": "113",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=HqAfzcJurMM",
"videoID": "HqAfzcJurMM",
"question_id": "113-2",
"task_type": "Counting Problem",
"question": "In the second half of the video, the man held up several spheres with both hands. How many spheres did he hold up in total during this time?",
"options": [
"A. 7.",
"B. 9.",
"C. 6.",
"D. 8."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the second half of the video, the man held up several spheres with both hands. How many spheres did he hold up in total during this time?\nOption:\nA. 7.\nB. 9.\nC. 6.\nD. 8.\nAnswer with the option's letter from the given choices directly.",
337,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "113-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 338,
"target": "B",
"doc": {
"video_id": "113",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=HqAfzcJurMM",
"videoID": "HqAfzcJurMM",
"question_id": "113-3",
"task_type": "Object Recognition",
"question": "What is the man in a dark shirt and jeans holding in his hand at the beginning of the video?",
"options": [
"A. Paint bucket and paintbrush.",
"B. Bucket and shovel.",
"C. Broom and bucket.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the man in a dark shirt and jeans holding in his hand at the beginning of the video?\nOption:\nA. Paint bucket and paintbrush.\nB. Bucket and shovel.\nC. Broom and bucket.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
338,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "113-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 339,
"target": "D",
"doc": {
"video_id": "114",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=pcO-alfiyEo",
"videoID": "pcO-alfiyEo",
"question_id": "114-1",
"task_type": "Information Synopsis",
"question": "Which disease is the subject of this video?",
"options": [
"A. High blood pressure.",
"B. Heart disease.",
"C. Cancer.",
"D. Diabetes Mellitus."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which disease is the subject of this video?\nOption:\nA. High blood pressure.\nB. Heart disease.\nC. Cancer.\nD. Diabetes Mellitus.\nAnswer with the option's letter from the given choices directly.",
339,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "114-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 340,
"target": "C",
"doc": {
"video_id": "114",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=pcO-alfiyEo",
"videoID": "pcO-alfiyEo",
"question_id": "114-2",
"task_type": "Attribute Perception",
"question": "How do the people in the video appear emotionally?",
"options": [
"A. Cheerful and jubilant.",
"B. Unable to determine.",
"C. Full of pressure.",
"D. Positive and upbeat."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How do the people in the video appear emotionally?\nOption:\nA. Cheerful and jubilant.\nB. Unable to determine.\nC. Full of pressure.\nD. Positive and upbeat.\nAnswer with the option's letter from the given choices directly.",
340,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "114-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 341,
"target": "C",
"doc": {
"video_id": "114",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=pcO-alfiyEo",
"videoID": "pcO-alfiyEo",
"question_id": "114-3",
"task_type": "Counting Problem",
"question": "At the beginning of the video, what is the number of pills displayed on the board being held by an individual?",
"options": [
"A. 11.",
"B. 9.",
"C. 8.",
"D. 10."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At the beginning of the video, what is the number of pills displayed on the board being held by an individual?\nOption:\nA. 11.\nB. 9.\nC. 8.\nD. 10.\nAnswer with the option's letter from the given choices directly.",
341,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "114-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 342,
"target": "D",
"doc": {
"video_id": "115",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=W86cTIoMv2U",
"videoID": "W86cTIoMv2U",
"question_id": "115-1",
"task_type": "Attribute Perception",
"question": "What is the color pattern of the cat in the video?",
"options": [
"A. Black and white.",
"B. Calico.",
"C. Solid.",
"D. Spotted."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color pattern of the cat in the video?\nOption:\nA. Black and white.\nB. Calico.\nC. Solid.\nD. Spotted.\nAnswer with the option's letter from the given choices directly.",
342,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "115-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 343,
"target": "B",
"doc": {
"video_id": "115",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=W86cTIoMv2U",
"videoID": "W86cTIoMv2U",
"question_id": "115-2",
"task_type": "OCR Problems",
"question": "According to the video, what is the age range of the cat in the video?",
"options": [
"A. Near adulthood.",
"B. Kitten.",
"C. Indeterminable.",
"D. Senior cat."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is the age range of the cat in the video?\nOption:\nA. Near adulthood.\nB. Kitten.\nC. Indeterminable.\nD. Senior cat.\nAnswer with the option's letter from the given choices directly.",
343,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "115-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 344,
"target": "A",
"doc": {
"video_id": "115",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=W86cTIoMv2U",
"videoID": "W86cTIoMv2U",
"question_id": "115-3",
"task_type": "Counting Problem",
"question": "How many streams did the cat cross in the video?",
"options": [
"A. 2.",
"B. 1.",
"C. 3.",
"D. 4."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many streams did the cat cross in the video?\nOption:\nA. 2.\nB. 1.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
344,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "115-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 345,
"target": "B",
"doc": {
"video_id": "116",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=qK4G2KpmqFU",
"videoID": "qK4G2KpmqFU",
"question_id": "116-1",
"task_type": "Attribute Perception",
"question": "What color are the protective gloves worn by the person in the video?",
"options": [
"A. Yellow and green.",
"B. Red and yellow.",
"C. White.",
"D. Blue."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color are the protective gloves worn by the person in the video?\nOption:\nA. Yellow and green.\nB. Red and yellow.\nC. White.\nD. Blue.\nAnswer with the option's letter from the given choices directly.",
345,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "116-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 346,
"target": "D",
"doc": {
"video_id": "116",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=qK4G2KpmqFU",
"videoID": "qK4G2KpmqFU",
"question_id": "116-2",
"task_type": "Information Synopsis",
"question": "According to the video, why is this plant called the most dangerous plant in the desert?",
"options": [
"A. Because it is very flammable.",
"B. Because there are often dangerous animals around it.",
"C. Because it is highly poisonous.",
"D. Because it is full of sharp spines with barbs."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, why is this plant called the most dangerous plant in the desert?\nOption:\nA. Because it is very flammable.\nB. Because there are often dangerous animals around it.\nC. Because it is highly poisonous.\nD. Because it is full of sharp spines with barbs.\nAnswer with the option's letter from the given choices directly.",
346,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "116-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 347,
"target": "B",
"doc": {
"video_id": "116",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=qK4G2KpmqFU",
"videoID": "qK4G2KpmqFU",
"question_id": "116-3",
"task_type": "Spatial Perception",
"question": "Which hand did the person in the video wear a glove on?",
"options": [
"A. His left hand.",
"B. His right hand.",
"C. Neither hand.",
"D. Both hands."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which hand did the person in the video wear a glove on?\nOption:\nA. His left hand.\nB. His right hand.\nC. Neither hand.\nD. Both hands.\nAnswer with the option's letter from the given choices directly.",
347,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "116-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Spatial Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 348,
"target": "B",
"doc": {
"video_id": "117",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=Z7PlUGbsXlQ",
"videoID": "Z7PlUGbsXlQ",
"question_id": "117-1",
"task_type": "Counting Problem",
"question": "What is the total number of bird species that are visible in the video?",
"options": [
"A. 2.",
"B. 3.",
"C. 1.",
"D. 0."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the total number of bird species that are visible in the video?\nOption:\nA. 2.\nB. 3.\nC. 1.\nD. 0.\nAnswer with the option's letter from the given choices directly.",
348,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "117-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 349,
"target": "B",
"doc": {
"video_id": "117",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=Z7PlUGbsXlQ",
"videoID": "Z7PlUGbsXlQ",
"question_id": "117-2",
"task_type": "Information Synopsis",
"question": "According to the information provided in the video, approximately what is the height of these emperor penguin chicks?",
"options": [
"A. About 1 cm.",
"B. About one meter.",
"C. About 20 cm.",
"D. About 10 cm."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the information provided in the video, approximately what is the height of these emperor penguin chicks?\nOption:\nA. About 1 cm.\nB. About one meter.\nC. About 20 cm.\nD. About 10 cm.\nAnswer with the option's letter from the given choices directly.",
349,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "117-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 350,
"target": "D",
"doc": {
"video_id": "117",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=Z7PlUGbsXlQ",
"videoID": "Z7PlUGbsXlQ",
"question_id": "117-3",
"task_type": "Counting Problem",
"question": "How many giant petrels are seen in the video?",
"options": [
"A. 2.",
"B. 4.",
"C. 3.",
"D. 1."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many giant petrels are seen in the video?\nOption:\nA. 2.\nB. 4.\nC. 3.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
350,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "117-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 351,
"target": "C",
"doc": {
"video_id": "118",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=tGdL-34L-GE",
"videoID": "tGdL-34L-GE",
"question_id": "118-1",
"task_type": "Counting Problem",
"question": "How many sand cats that can be observed in the video?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many sand cats that can be observed in the video?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
351,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "118-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 352,
"target": "C",
"doc": {
"video_id": "118",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=tGdL-34L-GE",
"videoID": "tGdL-34L-GE",
"question_id": "118-2",
"task_type": "Information Synopsis",
"question": "According to the video, why is it very difficult to capture footage of sand cats?",
"options": [
"A. Because sand cats are small.",
"B. Because sand cats are very fast.",
"C. Because the fur color of sand cats blends in with the environment, making them hard to spot.",
"D. Because there are too few sand cats."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, why is it very difficult to capture footage of sand cats?\nOption:\nA. Because sand cats are small.\nB. Because sand cats are very fast.\nC. Because the fur color of sand cats blends in with the environment, making them hard to spot.\nD. Because there are too few sand cats.\nAnswer with the option's letter from the given choices directly.",
352,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "118-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 353,
"target": "B",
"doc": {
"video_id": "118",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=tGdL-34L-GE",
"videoID": "tGdL-34L-GE",
"question_id": "118-3",
"task_type": "OCR Problems",
"question": "What is the estimated age of the sand cats shown in the video according to the scientists?",
"options": [
"A. 6 to 8 months.",
"B. 6 to 8 weeks.",
"C. 2 to 4 months.",
"D. 2 to 4 weeks."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the estimated age of the sand cats shown in the video according to the scientists?\nOption:\nA. 6 to 8 months.\nB. 6 to 8 weeks.\nC. 2 to 4 months.\nD. 2 to 4 weeks.\nAnswer with the option's letter from the given choices directly.",
353,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "118-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 354,
"target": "B",
"doc": {
"video_id": "119",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=-ApL8d6tX5U",
"videoID": "-ApL8d6tX5U",
"question_id": "119-1",
"task_type": "Information Synopsis",
"question": "What is the subject of this video?",
"options": [
"A. The shortest man in the world.",
"B. The shortest man in the world.",
"C. Life in the United States.",
"D. Medical care in the United States."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the subject of this video?\nOption:\nA. The shortest man in the world.\nB. The shortest man in the world.\nC. Life in the United States.\nD. Medical care in the United States.\nAnswer with the option's letter from the given choices directly.",
354,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "119-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 355,
"target": "C",
"doc": {
"video_id": "119",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=-ApL8d6tX5U",
"videoID": "-ApL8d6tX5U",
"question_id": "119-2",
"task_type": "Attribute Perception",
"question": "As can be seen in the video, what color is the bowling ball next to the world's shortest woman?",
"options": [
"A. White.",
"B. Red.",
"C. Green.",
"D. Black."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As can be seen in the video, what color is the bowling ball next to the world's shortest woman?\nOption:\nA. White.\nB. Red.\nC. Green.\nD. Black.\nAnswer with the option's letter from the given choices directly.",
355,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "119-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 356,
"target": "A",
"doc": {
"video_id": "119",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=-ApL8d6tX5U",
"videoID": "-ApL8d6tX5U",
"question_id": "119-3",
"task_type": "Information Synopsis",
"question": "According to the video, what problem does the world's shortest woman have?",
"options": [
"A. There is a problem with her legs.",
"B. There is a problem with her arms.",
"C. There is a problem with her heart.",
"D. There is a problem with her psychology."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what problem does the world's shortest woman have?\nOption:\nA. There is a problem with her legs.\nB. There is a problem with her arms.\nC. There is a problem with her heart.\nD. There is a problem with her psychology.\nAnswer with the option's letter from the given choices directly.",
356,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "119-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 357,
"target": "B",
"doc": {
"video_id": "120",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=s50vvwTystA",
"videoID": "s50vvwTystA",
"question_id": "120-1",
"task_type": "Attribute Perception",
"question": "What is the animal in the video?",
"options": [
"A. A cow.",
"B. A lamp.",
"C. A cat.",
"D. A dog."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the animal in the video?\nOption:\nA. A cow.\nB. A lamp.\nC. A cat.\nD. A dog.\nAnswer with the option's letter from the given choices directly.",
357,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "120-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 358,
"target": "D",
"doc": {
"video_id": "120",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=s50vvwTystA",
"videoID": "s50vvwTystA",
"question_id": "120-2",
"task_type": "Action Reasoning",
"question": "What is the reason for the lamp's action in the video, slapping the human portrayed?",
"options": [
"A. It is experiencing hunger.",
"B. It is expressing anger.",
"C. It is unwell.",
"D. It desires more physical contact."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the reason for the lamp's action in the video, slapping the human portrayed?\nOption:\nA. It is experiencing hunger.\nB. It is expressing anger.\nC. It is unwell.\nD. It desires more physical contact.\nAnswer with the option's letter from the given choices directly.",
358,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "120-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 359,
"target": "B",
"doc": {
"video_id": "120",
"duration": "short",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=s50vvwTystA",
"videoID": "s50vvwTystA",
"question_id": "120-3",
"task_type": "Attribute Perception",
"question": "Which color of clothing is the person wearing in the video?",
"options": [
"A. White.",
"B. Red.",
"C. Green.",
"D. Blue."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which color of clothing is the person wearing in the video?\nOption:\nA. White.\nB. Red.\nC. Green.\nD. Blue.\nAnswer with the option's letter from the given choices directly.",
359,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "120-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 360,
"target": "B",
"doc": {
"video_id": "121",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=43wqf_KhiUo",
"videoID": "43wqf_KhiUo",
"question_id": "121-1",
"task_type": "Object Reasoning",
"question": "Who was sentenced to have guilt in this video?",
"options": [
"A. The actor who triggered the gun.",
"B. The prop master who was in charge of this gun.",
"C. The film director.",
"D. The deceased himself."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who was sentenced to have guilt in this video?\nOption:\nA. The actor who triggered the gun.\nB. The prop master who was in charge of this gun.\nC. The film director.\nD. The deceased himself.\nAnswer with the option's letter from the given choices directly.",
360,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "121-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 361,
"target": "D",
"doc": {
"video_id": "121",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=43wqf_KhiUo",
"videoID": "43wqf_KhiUo",
"question_id": "121-2",
"task_type": "Attribute Perception",
"question": "Which color is the coat worn by the news anchor at the beginning of the video?",
"options": [
"A. Blue.",
"B. Green.",
"C. Yellow.",
"D. Pink."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which color is the coat worn by the news anchor at the beginning of the video?\nOption:\nA. Blue.\nB. Green.\nC. Yellow.\nD. Pink.\nAnswer with the option's letter from the given choices directly.",
361,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "121-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 362,
"target": "D",
"doc": {
"video_id": "121",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=43wqf_KhiUo",
"videoID": "43wqf_KhiUo",
"question_id": "121-3",
"task_type": "Spatial Perception",
"question": "Based on the information provided in the video, which of the following locations is where the shooting occurred?",
"options": [
"A. The court.",
"B. The actor's house.",
"C. The TV station.",
"D. A ranch in New Mexico."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the information provided in the video, which of the following locations is where the shooting occurred?\nOption:\nA. The court.\nB. The actor's house.\nC. The TV station.\nD. A ranch in New Mexico.\nAnswer with the option's letter from the given choices directly.",
362,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "121-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Spatial Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 363,
"target": "B",
"doc": {
"video_id": "122",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=Z-rHofd6g2Q",
"videoID": "Z-rHofd6g2Q",
"question_id": "122-1",
"task_type": "OCR Problems",
"question": "Which license plate is shown on Obama's car in the video?",
"options": [
"A. OBAMA.",
"B. WJI9HLW.",
"C. WJLHHJW.",
"D. BBCNEWS."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which license plate is shown on Obama's car in the video?\nOption:\nA. OBAMA.\nB. WJI9HLW.\nC. WJLHHJW.\nD. BBCNEWS.\nAnswer with the option's letter from the given choices directly.",
363,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "122-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 364,
"target": "D",
"doc": {
"video_id": "122",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=Z-rHofd6g2Q",
"videoID": "Z-rHofd6g2Q",
"question_id": "122-2",
"task_type": "OCR Problems",
"question": "What brand of car does Obama have in the video?",
"options": [
"A. BMW.",
"B. Ford.",
"C. Mercedes.",
"D. Range Rover."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What brand of car does Obama have in the video?\nOption:\nA. BMW.\nB. Ford.\nC. Mercedes.\nD. Range Rover.\nAnswer with the option's letter from the given choices directly.",
364,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "122-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 365,
"target": "C",
"doc": {
"video_id": "122",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=Z-rHofd6g2Q",
"videoID": "Z-rHofd6g2Q",
"question_id": "122-3",
"task_type": "Attribute Perception",
"question": "Which color is the dress worn by the woman next to Obama in the video clip?",
"options": [
"A. Blue.",
"B. Black.",
"C. Yellow.",
"D. Gray."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which color is the dress worn by the woman next to Obama in the video clip?\nOption:\nA. Blue.\nB. Black.\nC. Yellow.\nD. Gray.\nAnswer with the option's letter from the given choices directly.",
365,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "122-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 366,
"target": "C",
"doc": {
"video_id": "123",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=lmlw4NNZBbw",
"videoID": "lmlw4NNZBbw",
"question_id": "123-1",
"task_type": "OCR Problems",
"question": "According to the video, how many days in the year 2023 were there no recorded incidents of air pollution in Bangkok?",
"options": [
"A. 34.",
"B. 32.",
"C. 31.",
"D. 33."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how many days in the year 2023 were there no recorded incidents of air pollution in Bangkok?\nOption:\nA. 34.\nB. 32.\nC. 31.\nD. 33.\nAnswer with the option's letter from the given choices directly.",
366,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "123-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 367,
"target": "C",
"doc": {
"video_id": "123",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=lmlw4NNZBbw",
"videoID": "lmlw4NNZBbw",
"question_id": "123-2",
"task_type": "Object Reasoning",
"question": "According to the video, which of the following statements about Thailand's agenda on air pollution is correct?",
"options": [
"A. The agenda worked.",
"B. The agenda has been modified.",
"C. The agenda failed.",
"D. The agenda was abolished."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following statements about Thailand's agenda on air pollution is correct?\nOption:\nA. The agenda worked.\nB. The agenda has been modified.\nC. The agenda failed.\nD. The agenda was abolished.\nAnswer with the option's letter from the given choices directly.",
367,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "123-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 368,
"target": "C",
"doc": {
"video_id": "123",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=lmlw4NNZBbw",
"videoID": "lmlw4NNZBbw",
"question_id": "123-3",
"task_type": "Counting Problem",
"question": "How many times does the interviewed girl appear in the video?",
"options": [
"A. 4.",
"B. 1.",
"C. 2.",
"D. 3."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times does the interviewed girl appear in the video?\nOption:\nA. 4.\nB. 1.\nC. 2.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
368,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "123-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 369,
"target": "D",
"doc": {
"video_id": "124",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=c0siCya457M",
"videoID": "c0siCya457M",
"question_id": "124-1",
"task_type": "Information Synopsis",
"question": "According to the video, which of the following events occurred in India?",
"options": [
"A. Making a movie about war.",
"B. A Bollywood movie broke box office records.",
"C. Crops are being burned.",
"D. The farmers staged A protest."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following events occurred in India?\nOption:\nA. Making a movie about war.\nB. A Bollywood movie broke box office records.\nC. Crops are being burned.\nD. The farmers staged A protest.\nAnswer with the option's letter from the given choices directly.",
369,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "124-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 370,
"target": "D",
"doc": {
"video_id": "124",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=c0siCya457M",
"videoID": "c0siCya457M",
"question_id": "124-2",
"task_type": "Information Synopsis",
"question": "According to the video, what is the primary demand of farmers?",
"options": [
"A. Overthrow the government.",
"B. More land for farming.",
"C. More fertilizer.",
"D. Government guarantees of a minimum support price for their crop."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is the primary demand of farmers?\nOption:\nA. Overthrow the government.\nB. More land for farming.\nC. More fertilizer.\nD. Government guarantees of a minimum support price for their crop.\nAnswer with the option's letter from the given choices directly.",
370,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "124-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 371,
"target": "C",
"doc": {
"video_id": "124",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=c0siCya457M",
"videoID": "c0siCya457M",
"question_id": "124-3",
"task_type": "Attribute Perception",
"question": "What colour is the news presenter's coat?",
"options": [
"A. Purple.",
"B. Yellow.",
"C. Pink.",
"D. Black."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What colour is the news presenter's coat?\nOption:\nA. Purple.\nB. Yellow.\nC. Pink.\nD. Black.\nAnswer with the option's letter from the given choices directly.",
371,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "124-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 372,
"target": "C",
"doc": {
"video_id": "125",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=nKtWkDNGNOw",
"videoID": "nKtWkDNGNOw",
"question_id": "125-1",
"task_type": "Attribute Perception",
"question": "According to the video, how many residents have been evacuated from the town?",
"options": [
"A. More than 10000.",
"B. More than 2000.",
"C. More than 3000.",
"D. More than 15000."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how many residents have been evacuated from the town?\nOption:\nA. More than 10000.\nB. More than 2000.\nC. More than 3000.\nD. More than 15000.\nAnswer with the option's letter from the given choices directly.",
372,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "125-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 373,
"target": "A",
"doc": {
"video_id": "125",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=nKtWkDNGNOw",
"videoID": "nKtWkDNGNOw",
"question_id": "125-2",
"task_type": "Action Reasoning",
"question": "According to the video, what was the reason for the town's evacuation?",
"options": [
"A. Impending volcanic eruption.",
"B. Lack of food.",
"C. Bad weather.",
"D. Earthquakes."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what was the reason for the town's evacuation?\nOption:\nA. Impending volcanic eruption.\nB. Lack of food.\nC. Bad weather.\nD. Earthquakes.\nAnswer with the option's letter from the given choices directly.",
373,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "125-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 374,
"target": "A",
"doc": {
"video_id": "125",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=nKtWkDNGNOw",
"videoID": "nKtWkDNGNOw",
"question_id": "125-3",
"task_type": "Object Reasoning",
"question": "According to the video, what was the overall mood of the people interviewed?",
"options": [
"A. Sad.",
"B. Happy.",
"C. Getting angry.",
"D. Cheer up."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what was the overall mood of the people interviewed?\nOption:\nA. Sad.\nB. Happy.\nC. Getting angry.\nD. Cheer up.\nAnswer with the option's letter from the given choices directly.",
374,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "125-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 375,
"target": "D",
"doc": {
"video_id": "126",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=y6ReUXtm_VE",
"videoID": "y6ReUXtm_VE",
"question_id": "126-1",
"task_type": "Spatial Reasoning",
"question": "What is the atmosphere portrayed in the video like?",
"options": [
"A. Sad.",
"B. Tense.",
"C. Joyful.",
"D. Solemn."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the atmosphere portrayed in the video like?\nOption:\nA. Sad.\nB. Tense.\nC. Joyful.\nD. Solemn.\nAnswer with the option's letter from the given choices directly.",
375,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "126-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Spatial Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 376,
"target": "B",
"doc": {
"video_id": "126",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=y6ReUXtm_VE",
"videoID": "y6ReUXtm_VE",
"question_id": "126-2",
"task_type": "Action Reasoning",
"question": "What is the reason for entering cold water in the video during the winter season?",
"options": [
"A. Not mentioned in the video.",
"B. It is a religious act.",
"C. To challenge themselves.",
"D. For health benefits."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the reason for entering cold water in the video during the winter season?\nOption:\nA. Not mentioned in the video.\nB. It is a religious act.\nC. To challenge themselves.\nD. For health benefits.\nAnswer with the option's letter from the given choices directly.",
376,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "126-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 377,
"target": "B",
"doc": {
"video_id": "126",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=y6ReUXtm_VE",
"videoID": "y6ReUXtm_VE",
"question_id": "126-3",
"task_type": "OCR Problems",
"question": "What is the approximate temperature shown in the video?",
"options": [
"A. Around 0 degrees centigrade.",
"B. Around -5 degrees centigrade.",
"C. Around -10 degrees centigrade.",
"D. Around -15 degrees centigrade."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the approximate temperature shown in the video?\nOption:\nA. Around 0 degrees centigrade.\nB. Around -5 degrees centigrade.\nC. Around -10 degrees centigrade.\nD. Around -15 degrees centigrade.\nAnswer with the option's letter from the given choices directly.",
377,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "126-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 378,
"target": "C",
"doc": {
"video_id": "127",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=YlAefotZNAc",
"videoID": "YlAefotZNAc",
"question_id": "127-1",
"task_type": "Attribute Perception",
"question": "What clothes were the boys being interviewed wearing?",
"options": [
"A. Black leather jacket.",
"B. Blue suit and red tie.",
"C. Yellow plaid shirt.",
"D. Yellow rescue suit."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What clothes were the boys being interviewed wearing?\nOption:\nA. Black leather jacket.\nB. Blue suit and red tie.\nC. Yellow plaid shirt.\nD. Yellow rescue suit.\nAnswer with the option's letter from the given choices directly.",
378,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "127-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 379,
"target": "D",
"doc": {
"video_id": "127",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=YlAefotZNAc",
"videoID": "YlAefotZNAc",
"question_id": "127-2",
"task_type": "Attribute Perception",
"question": "What is the mood of the people interviewed in the video who experienced the blizzard?",
"options": [
"A. Terrified.",
"B. Sad.",
"C. Calm.",
"D. Positive."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the mood of the people interviewed in the video who experienced the blizzard?\nOption:\nA. Terrified.\nB. Sad.\nC. Calm.\nD. Positive.\nAnswer with the option's letter from the given choices directly.",
379,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "127-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 380,
"target": "A",
"doc": {
"video_id": "127",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=YlAefotZNAc",
"videoID": "YlAefotZNAc",
"question_id": "127-3",
"task_type": "Object Reasoning",
"question": "According to the video, what is the reason why the interviewed man frequently observes snow?",
"options": [
"A. Because he lives in the mountains at an altitude of about 3000 meters.",
"B. Because he lives in Florida.",
"C. Because he likes snow.",
"D. Not mentioned in the video."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is the reason why the interviewed man frequently observes snow?\nOption:\nA. Because he lives in the mountains at an altitude of about 3000 meters.\nB. Because he lives in Florida.\nC. Because he likes snow.\nD. Not mentioned in the video.\nAnswer with the option's letter from the given choices directly.",
380,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "127-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 381,
"target": "A",
"doc": {
"video_id": "128",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=EJ_yarUrgVQ",
"videoID": "EJ_yarUrgVQ",
"question_id": "128-1",
"task_type": "Object Reasoning",
"question": "According to the information presented in the video, what natural disaster did France experience that was identified as its worst ever?",
"options": [
"A. drought.",
"B. Sandstorms.",
"C. Floods.",
"D. Snowstorms."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the information presented in the video, what natural disaster did France experience that was identified as its worst ever?\nOption:\nA. drought.\nB. Sandstorms.\nC. Floods.\nD. Snowstorms.\nAnswer with the option's letter from the given choices directly.",
381,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "128-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 382,
"target": "A",
"doc": {
"video_id": "128",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=EJ_yarUrgVQ",
"videoID": "EJ_yarUrgVQ",
"question_id": "128-2",
"task_type": "Object Recognition",
"question": "What happened in the mountains captured in the video?",
"options": [
"A. Fire.",
"B. Landslides.",
"C. Bonfire Party.",
"D. Debris flow."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happened in the mountains captured in the video?\nOption:\nA. Fire.\nB. Landslides.\nC. Bonfire Party.\nD. Debris flow.\nAnswer with the option's letter from the given choices directly.",
382,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "128-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 383,
"target": "C",
"doc": {
"video_id": "128",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=EJ_yarUrgVQ",
"videoID": "EJ_yarUrgVQ",
"question_id": "128-3",
"task_type": "Attribute Perception",
"question": "Which color is the news anchor's top at the start of the video?",
"options": [
"A. Gold.",
"B. Brown.",
"C. Black.",
"D. White."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which color is the news anchor's top at the start of the video?\nOption:\nA. Gold.\nB. Brown.\nC. Black.\nD. White.\nAnswer with the option's letter from the given choices directly.",
383,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "128-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 384,
"target": "B",
"doc": {
"video_id": "129",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=tNQo2GTbP6s",
"videoID": "tNQo2GTbP6s",
"question_id": "129-1",
"task_type": "Object Reasoning",
"question": "According to the video, why is this fossil discovery special?",
"options": [
"A. Because the fossils found are very well preserved.",
"B. Because it is the largest sea dragon fossil ever found in Britain.",
"C. Because this is the first sea dragon fossil found.",
"D. Because the fossils were found in the reservoir."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, why is this fossil discovery special?\nOption:\nA. Because the fossils found are very well preserved.\nB. Because it is the largest sea dragon fossil ever found in Britain.\nC. Because this is the first sea dragon fossil found.\nD. Because the fossils were found in the reservoir.\nAnswer with the option's letter from the given choices directly.",
384,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "129-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 385,
"target": "D",
"doc": {
"video_id": "129",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=tNQo2GTbP6s",
"videoID": "tNQo2GTbP6s",
"question_id": "129-2",
"task_type": "Object Recognition",
"question": "According to the video, which of the following options correctly identifies Joe Davis?",
"options": [
"A. Not mentioned in the video.",
"B. He is the news anchor.",
"C. It is the name of the Sea Dragon.",
"D. He is a bald interviewee."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following options correctly identifies Joe Davis?\nOption:\nA. Not mentioned in the video.\nB. He is the news anchor.\nC. It is the name of the Sea Dragon.\nD. He is a bald interviewee.\nAnswer with the option's letter from the given choices directly.",
385,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "129-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 386,
"target": "B",
"doc": {
"video_id": "129",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=tNQo2GTbP6s",
"videoID": "tNQo2GTbP6s",
"question_id": "129-3",
"task_type": "Spatial Reasoning",
"question": "According to the video, where were the ichthyosaur fossils found this time?",
"options": [
"A. In the sea.",
"B. In a reservoir.",
"C. On a mountain.",
"D. In a museum."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, where were the ichthyosaur fossils found this time?\nOption:\nA. In the sea.\nB. In a reservoir.\nC. On a mountain.\nD. In a museum.\nAnswer with the option's letter from the given choices directly.",
386,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "129-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 387,
"target": "D",
"doc": {
"video_id": "130",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=ykJ7pyr87Qc",
"videoID": "ykJ7pyr87Qc",
"question_id": "130-1",
"task_type": "Counting Problem",
"question": "According to the information presented in the video, what was the exact number of individuals present in the store at the time?",
"options": [
"A. 3.",
"B. 4.",
"C. 5.",
"D. 6."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the information presented in the video, what was the exact number of individuals present in the store at the time?\nOption:\nA. 3.\nB. 4.\nC. 5.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
387,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "130-1",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 388,
"target": "B",
"doc": {
"video_id": "130",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=ykJ7pyr87Qc",
"videoID": "ykJ7pyr87Qc",
"question_id": "130-2",
"task_type": "Action Recognition",
"question": "According to the video, what happened in the supermarket?",
"options": [
"A. A criminal broke in to rob the place and then shot and injured a person.",
"B. A criminal broke in to rob the place and was subdued by a cowboy.",
"C. A criminal broke in to rob the place and successfully escaped.",
"D. Multiple criminals broke in to rob the place and fought with the crowd."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what happened in the supermarket?\nOption:\nA. A criminal broke in to rob the place and then shot and injured a person.\nB. A criminal broke in to rob the place and was subdued by a cowboy.\nC. A criminal broke in to rob the place and successfully escaped.\nD. Multiple criminals broke in to rob the place and fought with the crowd.\nAnswer with the option's letter from the given choices directly.",
388,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "130-2",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 389,
"target": "B",
"doc": {
"video_id": "130",
"duration": "short",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=ykJ7pyr87Qc",
"videoID": "ykJ7pyr87Qc",
"question_id": "130-3",
"task_type": "Action Recognition",
"question": "How did the cowboy subdue the criminal according to the video?",
"options": [
"A. The cowboy convinced the criminal to give up the robbery.",
"B. The cowboy subdued the criminal when the criminal turned around.",
"C. The cowboy shot the criminal.",
"D. The cowboy punched the criminal."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How did the cowboy subdue the criminal according to the video?\nOption:\nA. The cowboy convinced the criminal to give up the robbery.\nB. The cowboy subdued the criminal when the criminal turned around.\nC. The cowboy shot the criminal.\nD. The cowboy punched the criminal.\nAnswer with the option's letter from the given choices directly.",
389,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "130-3",
"duration": "short",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 390,
"target": "C",
"doc": {
"video_id": "131",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=HdBsD5VspVM",
"videoID": "HdBsD5VspVM",
"question_id": "131-1",
"task_type": "Object Reasoning",
"question": "Based on the video, who will benefit from the change buffs of armors?",
"options": [
"A. Junggle.",
"B. Assassin.",
"C. Adc.",
"D. Support."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, who will benefit from the change buffs of armors?\nOption:\nA. Junggle.\nB. Assassin.\nC. Adc.\nD. Support.\nAnswer with the option's letter from the given choices directly.",
390,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "131-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 391,
"target": "B",
"doc": {
"video_id": "131",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=HdBsD5VspVM",
"videoID": "HdBsD5VspVM",
"question_id": "131-2",
"task_type": "Object Reasoning",
"question": "What profession is the legend in this video?",
"options": [
"A. Assassin.",
"B. Shooter.",
"C. Tank.",
"D. Mage."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What profession is the legend in this video?\nOption:\nA. Assassin.\nB. Shooter.\nC. Tank.\nD. Mage.\nAnswer with the option's letter from the given choices directly.",
391,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "131-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 392,
"target": "A",
"doc": {
"video_id": "131",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=HdBsD5VspVM",
"videoID": "HdBsD5VspVM",
"question_id": "131-3",
"task_type": "Object Recognition",
"question": "What is the armor that first be described in the video?",
"options": [
"A. Sword.",
"B. Arrow.",
"C. Axe.",
"D. Stick."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the armor that first be described in the video?\nOption:\nA. Sword.\nB. Arrow.\nC. Axe.\nD. Stick.\nAnswer with the option's letter from the given choices directly.",
392,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "131-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 393,
"target": "D",
"doc": {
"video_id": "132",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=4V6G0qYVoBg",
"videoID": "4V6G0qYVoBg",
"question_id": "132-1",
"task_type": "Object Recognition",
"question": "What is on the legend when he is hit by the turret?",
"options": [
"A. Sand.",
"B. Ice.",
"C. Thunder.",
"D. Fire."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is on the legend when he is hit by the turret?\nOption:\nA. Sand.\nB. Ice.\nC. Thunder.\nD. Fire.\nAnswer with the option's letter from the given choices directly.",
393,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "132-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 394,
"target": "D",
"doc": {
"video_id": "132",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=4V6G0qYVoBg",
"videoID": "4V6G0qYVoBg",
"question_id": "132-2",
"task_type": "Action Recognition",
"question": "Which of the following events occur when the player presses R?",
"options": [
"A. Nothing happens.",
"B. The legend is flying.",
"C. The legend runs away.",
"D. The enemy is knocked up."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following events occur when the player presses R?\nOption:\nA. Nothing happens.\nB. The legend is flying.\nC. The legend runs away.\nD. The enemy is knocked up.\nAnswer with the option's letter from the given choices directly.",
394,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "132-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 395,
"target": "D",
"doc": {
"video_id": "132",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=4V6G0qYVoBg",
"videoID": "4V6G0qYVoBg",
"question_id": "132-3",
"task_type": "Object Recognition",
"question": "What is thrown out when pressing Q on the keyboard?",
"options": [
"A. A brick.",
"B. A spear.",
"C. An axe.",
"D. A wheel."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is thrown out when pressing Q on the keyboard?\nOption:\nA. A brick.\nB. A spear.\nC. An axe.\nD. A wheel.\nAnswer with the option's letter from the given choices directly.",
395,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "132-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 396,
"target": "A",
"doc": {
"video_id": "133",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=uF3zNOthLAg",
"videoID": "uF3zNOthLAg",
"question_id": "133-1",
"task_type": "Counting Problem",
"question": "How many legends are featured in the video?",
"options": [
"A. 3.",
"B. 4.",
"C. 5.",
"D. 6."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many legends are featured in the video?\nOption:\nA. 3.\nB. 4.\nC. 5.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
396,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "133-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 397,
"target": "C",
"doc": {
"video_id": "133",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=uF3zNOthLAg",
"videoID": "uF3zNOthLAg",
"question_id": "133-2",
"task_type": "Object Recognition",
"question": "Which legend is knocked down in the video?",
"options": [
"A. The legend who is a wolf.",
"B. The legend with two weapons in hands.",
"C. The legend with red cloak.",
"D. No one is knocked down."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which legend is knocked down in the video?\nOption:\nA. The legend who is a wolf.\nB. The legend with two weapons in hands.\nC. The legend with red cloak.\nD. No one is knocked down.\nAnswer with the option's letter from the given choices directly.",
397,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "133-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 398,
"target": "D",
"doc": {
"video_id": "133",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=uF3zNOthLAg",
"videoID": "uF3zNOthLAg",
"question_id": "133-3",
"task_type": "Spatial Perception",
"question": "Which location serves as the battleground in the fight?",
"options": [
"A. Behind the nexus.",
"B. In the river.",
"C. On the mountain.",
"D. Near the turret."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which location serves as the battleground in the fight?\nOption:\nA. Behind the nexus.\nB. In the river.\nC. On the mountain.\nD. Near the turret.\nAnswer with the option's letter from the given choices directly.",
398,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "133-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Spatial Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 399,
"target": "B",
"doc": {
"video_id": "134",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=6Z_XNM_iT4g",
"videoID": "6Z_XNM_iT4g",
"question_id": "134-1",
"task_type": "Action Recognition",
"question": "In the legend, when the character with a fox tail goes left, what is her intention or goal?",
"options": [
"A. She wants to pick up gold.",
"B. She wants to ward grass.",
"C. She wants to slay an enemy.",
"D. She wants to meet her ally."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the legend, when the character with a fox tail goes left, what is her intention or goal?\nOption:\nA. She wants to pick up gold.\nB. She wants to ward grass.\nC. She wants to slay an enemy.\nD. She wants to meet her ally.\nAnswer with the option's letter from the given choices directly.",
399,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "134-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 400,
"target": "B",
"doc": {
"video_id": "134",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=6Z_XNM_iT4g",
"videoID": "6Z_XNM_iT4g",
"question_id": "134-2",
"task_type": "Action Recognition",
"question": "Which skill is used by the legend who slays the enemy?",
"options": [
"A. Haste.",
"B. Flash.",
"C. Exhaust.",
"D. Teleport."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which skill is used by the legend who slays the enemy?\nOption:\nA. Haste.\nB. Flash.\nC. Exhaust.\nD. Teleport.\nAnswer with the option's letter from the given choices directly.",
400,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "134-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 401,
"target": "C",
"doc": {
"video_id": "134",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=6Z_XNM_iT4g",
"videoID": "6Z_XNM_iT4g",
"question_id": "134-3",
"task_type": "Object Recognition",
"question": "What type of weapon does the slain legend retrieve?",
"options": [
"A. Sword.",
"B. Axe.",
"C. Gun.",
"D. Spear."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What type of weapon does the slain legend retrieve?\nOption:\nA. Sword.\nB. Axe.\nC. Gun.\nD. Spear.\nAnswer with the option's letter from the given choices directly.",
401,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "134-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 402,
"target": "C",
"doc": {
"video_id": "135",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=fJ-hp5Jlbv0",
"videoID": "fJ-hp5Jlbv0",
"question_id": "135-1",
"task_type": "Attribute Perception",
"question": "Which color is the pistol used by the player?",
"options": [
"A. Purple.",
"B. Blue.",
"C. Green.",
"D. Red."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which color is the pistol used by the player?\nOption:\nA. Purple.\nB. Blue.\nC. Green.\nD. Red.\nAnswer with the option's letter from the given choices directly.",
402,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "135-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 403,
"target": "B",
"doc": {
"video_id": "135",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=fJ-hp5Jlbv0",
"videoID": "fJ-hp5Jlbv0",
"question_id": "135-2",
"task_type": "Object Recognition",
"question": "What is on the wall that the player shoot at?",
"options": [
"A. Red circle.",
"B. Red letter \"A\".",
"C. Yellow arrow.",
"D. Black letter \"A\"."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is on the wall that the player shoot at?\nOption:\nA. Red circle.\nB. Red letter \"A\".\nC. Yellow arrow.\nD. Black letter \"A\".\nAnswer with the option's letter from the given choices directly.",
403,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "135-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 404,
"target": "B",
"doc": {
"video_id": "135",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=fJ-hp5Jlbv0",
"videoID": "fJ-hp5Jlbv0",
"question_id": "135-3",
"task_type": "Counting Problem",
"question": "How many people are in the video?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people are in the video?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
404,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "135-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 405,
"target": "A",
"doc": {
"video_id": "136",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=LlajSKnbcGk",
"videoID": "LlajSKnbcGk",
"question_id": "136-1",
"task_type": "Counting Problem",
"question": "According to the video, how many persons are shot to death by the player before he is dead?",
"options": [
"A. 7.",
"B. 6.",
"C. 8.",
"D. 5."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how many persons are shot to death by the player before he is dead?\nOption:\nA. 7.\nB. 6.\nC. 8.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
405,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "136-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 406,
"target": "A",
"doc": {
"video_id": "136",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=LlajSKnbcGk",
"videoID": "LlajSKnbcGk",
"question_id": "136-2",
"task_type": "Attribute Perception",
"question": "What color are his gloves in the video?",
"options": [
"A. Black.",
"B. Blue.",
"C. Purple.",
"D. Orange."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color are his gloves in the video?\nOption:\nA. Black.\nB. Blue.\nC. Purple.\nD. Orange.\nAnswer with the option's letter from the given choices directly.",
406,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "136-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 407,
"target": "A",
"doc": {
"video_id": "136",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=LlajSKnbcGk",
"videoID": "LlajSKnbcGk",
"question_id": "136-3",
"task_type": "Spatial Perception",
"question": "Where is the player finally dead?",
"options": [
"A. Before a gate.",
"B. In a basement.",
"C. On top of a building.",
"D. In a pool."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where is the player finally dead?\nOption:\nA. Before a gate.\nB. In a basement.\nC. On top of a building.\nD. In a pool.\nAnswer with the option's letter from the given choices directly.",
407,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "136-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Spatial Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 408,
"target": "B",
"doc": {
"video_id": "137",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=dIPNQRVqJmY",
"videoID": "dIPNQRVqJmY",
"question_id": "137-1",
"task_type": "Action Recognition",
"question": "What happens when the player approaches the window first?",
"options": [
"A. He jumps out the window.",
"B. He is flashed.",
"C. He is shot to death.",
"D. He gets down on the ground."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens when the player approaches the window first?\nOption:\nA. He jumps out the window.\nB. He is flashed.\nC. He is shot to death.\nD. He gets down on the ground.\nAnswer with the option's letter from the given choices directly.",
408,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "137-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 409,
"target": "C",
"doc": {
"video_id": "137",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=dIPNQRVqJmY",
"videoID": "dIPNQRVqJmY",
"question_id": "137-2",
"task_type": "Object Recognition",
"question": "What does the player see upon exiting the building?",
"options": [
"A. A rocket.",
"B. Fire.",
"C. Smoke.",
"D. His teammates."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the player see upon exiting the building?\nOption:\nA. A rocket.\nB. Fire.\nC. Smoke.\nD. His teammates.\nAnswer with the option's letter from the given choices directly.",
409,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "137-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 410,
"target": "C",
"doc": {
"video_id": "137",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=dIPNQRVqJmY",
"videoID": "dIPNQRVqJmY",
"question_id": "137-3",
"task_type": "Counting Problem",
"question": "How many shots does the player take in the video?",
"options": [
"A. 4.",
"B. 3.",
"C. 5.",
"D. 6."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many shots does the player take in the video?\nOption:\nA. 4.\nB. 3.\nC. 5.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
410,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "137-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 411,
"target": "D",
"doc": {
"video_id": "138",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=of62s85uMMs",
"videoID": "of62s85uMMs",
"question_id": "138-1",
"task_type": "Spatial Perception",
"question": "What is the location of the scene being depicted in the video?",
"options": [
"A. In a river.",
"B. In a glacier.",
"C. In a forest.",
"D. In a desert."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the location of the scene being depicted in the video?\nOption:\nA. In a river.\nB. In a glacier.\nC. In a forest.\nD. In a desert.\nAnswer with the option's letter from the given choices directly.",
411,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "138-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Spatial Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 412,
"target": "B",
"doc": {
"video_id": "138",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=of62s85uMMs",
"videoID": "of62s85uMMs",
"question_id": "138-2",
"task_type": "Counting Problem",
"question": "How many craches occurs in this video?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many craches occurs in this video?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
412,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "138-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 413,
"target": "D",
"doc": {
"video_id": "138",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=of62s85uMMs",
"videoID": "of62s85uMMs",
"question_id": "138-3",
"task_type": "Counting Problem",
"question": "How many cars are overtaken in this video?",
"options": [
"A. 1.",
"B. 2.",
"C. 4.",
"D. 3."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many cars are overtaken in this video?\nOption:\nA. 1.\nB. 2.\nC. 4.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
413,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "138-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 414,
"target": "C",
"doc": {
"video_id": "139",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=80p80ynsZ78",
"videoID": "80p80ynsZ78",
"question_id": "139-1",
"task_type": "Attribute Perception",
"question": "Which of the following is the main color of the racetrack?",
"options": [
"A. Purple.",
"B. White.",
"C. Pink.",
"D. Blue."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is the main color of the racetrack?\nOption:\nA. Purple.\nB. White.\nC. Pink.\nD. Blue.\nAnswer with the option's letter from the given choices directly.",
414,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "139-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 415,
"target": "D",
"doc": {
"video_id": "139",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=80p80ynsZ78",
"videoID": "80p80ynsZ78",
"question_id": "139-2",
"task_type": "Spatial Perception",
"question": "What the weather is like when the race begins?",
"options": [
"A. It is foggy.",
"B. It is rainy.",
"C. It is snowing.",
"D. It is cloudy."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What the weather is like when the race begins?\nOption:\nA. It is foggy.\nB. It is rainy.\nC. It is snowing.\nD. It is cloudy.\nAnswer with the option's letter from the given choices directly.",
415,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "139-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Spatial Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 416,
"target": "A",
"doc": {
"video_id": "139",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=80p80ynsZ78",
"videoID": "80p80ynsZ78",
"question_id": "139-3",
"task_type": "Action Recognition",
"question": "According to the video, what happens to the car when it is running?",
"options": [
"A. It soars into the sky.",
"B. It drills into the ground.",
"C. It floats on the water.",
"D. It hits a tree."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what happens to the car when it is running?\nOption:\nA. It soars into the sky.\nB. It drills into the ground.\nC. It floats on the water.\nD. It hits a tree.\nAnswer with the option's letter from the given choices directly.",
416,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "139-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 417,
"target": "B",
"doc": {
"video_id": "140",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=Kv1JXuOkAfk",
"videoID": "Kv1JXuOkAfk",
"question_id": "140-1",
"task_type": "Attribute Perception",
"question": "What is the color of the car in this video?",
"options": [
"A. White.",
"B. Blue.",
"C. Purple.",
"D. Pink."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the car in this video?\nOption:\nA. White.\nB. Blue.\nC. Purple.\nD. Pink.\nAnswer with the option's letter from the given choices directly.",
417,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "140-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 418,
"target": "A",
"doc": {
"video_id": "140",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=Kv1JXuOkAfk",
"videoID": "Kv1JXuOkAfk",
"question_id": "140-2",
"task_type": "Counting Problem",
"question": "What is the pattern on the car?",
"options": [
"A. A hand.",
"B. Lines.",
"C. A man.",
"D. An animal."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the pattern on the car?\nOption:\nA. A hand.\nB. Lines.\nC. A man.\nD. An animal.\nAnswer with the option's letter from the given choices directly.",
418,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "140-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 419,
"target": "D",
"doc": {
"video_id": "140",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=Kv1JXuOkAfk",
"videoID": "Kv1JXuOkAfk",
"question_id": "140-3",
"task_type": "Action Recognition",
"question": "According to the video, what happens to the car when it is running?",
"options": [
"A. It soars into the sky.",
"B. It drills into the ground.",
"C. It floats on the water.",
"D. It hits a stone."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what happens to the car when it is running?\nOption:\nA. It soars into the sky.\nB. It drills into the ground.\nC. It floats on the water.\nD. It hits a stone.\nAnswer with the option's letter from the given choices directly.",
419,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "140-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 420,
"target": "A",
"doc": {
"video_id": "141",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=lsfEHOtYGyk",
"videoID": "lsfEHOtYGyk",
"question_id": "141-1",
"task_type": "Action Recognition",
"question": "Which player does the video call for to finally put the ball in the basket?",
"options": [
"A. Player number 2.",
"B. Player number 4.",
"C. Player number 1.",
"D. Player number 3."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player does the video call for to finally put the ball in the basket?\nOption:\nA. Player number 2.\nB. Player number 4.\nC. Player number 1.\nD. Player number 3.\nAnswer with the option's letter from the given choices directly.",
420,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "141-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 421,
"target": "B",
"doc": {
"video_id": "141",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=lsfEHOtYGyk",
"videoID": "lsfEHOtYGyk",
"question_id": "141-2",
"task_type": "Action Recognition",
"question": "Which player is running the pick-and-roll for the other players?",
"options": [
"A. Player number 5.",
"B. Player number 4.",
"C. Player number 2.",
"D. Player number 3."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player is running the pick-and-roll for the other players?\nOption:\nA. Player number 5.\nB. Player number 4.\nC. Player number 2.\nD. Player number 3.\nAnswer with the option's letter from the given choices directly.",
421,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "141-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 422,
"target": "A",
"doc": {
"video_id": "141",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=lsfEHOtYGyk",
"videoID": "lsfEHOtYGyk",
"question_id": "141-3",
"task_type": "Object Recognition",
"question": "Which sport's tactics are shown in the video?",
"options": [
"A. Basketball.",
"B. Soccer.",
"C. Rugby.",
"D. Baseball."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which sport's tactics are shown in the video?\nOption:\nA. Basketball.\nB. Soccer.\nC. Rugby.\nD. Baseball.\nAnswer with the option's letter from the given choices directly.",
422,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "141-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 423,
"target": "C",
"doc": {
"video_id": "142",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=rj6rJzs029A",
"videoID": "rj6rJzs029A",
"question_id": "142-1",
"task_type": "OCR Problems",
"question": "What is the remaining time in seconds when the player in the video takes a free throw during the game?",
"options": [
"A. 3 seconds.",
"B. 1 seconds.",
"C. 2 seconds.",
"D. No time left."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the remaining time in seconds when the player in the video takes a free throw during the game?\nOption:\nA. 3 seconds.\nB. 1 seconds.\nC. 2 seconds.\nD. No time left.\nAnswer with the option's letter from the given choices directly.",
423,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "142-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 424,
"target": "A",
"doc": {
"video_id": "142",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=rj6rJzs029A",
"videoID": "rj6rJzs029A",
"question_id": "142-2",
"task_type": "Action Reasoning",
"question": "What are the people in the video cheering for?",
"options": [
"A. Three-point shutout.",
"B. Layup shutout.",
"C. The center throws a shutout.",
"D. Isolation slam."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the people in the video cheering for?\nOption:\nA. Three-point shutout.\nB. Layup shutout.\nC. The center throws a shutout.\nD. Isolation slam.\nAnswer with the option's letter from the given choices directly.",
424,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "142-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 425,
"target": "C",
"doc": {
"video_id": "142",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=rj6rJzs029A",
"videoID": "rj6rJzs029A",
"question_id": "142-3",
"task_type": "Object Recognition",
"question": "Who is the player that ends the game in the video?",
"options": [
"A. White number 6.",
"B. Black number 5.",
"C. Black number 4.",
"D. White number 8."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is the player that ends the game in the video?\nOption:\nA. White number 6.\nB. Black number 5.\nC. Black number 4.\nD. White number 8.\nAnswer with the option's letter from the given choices directly.",
425,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "142-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 426,
"target": "C",
"doc": {
"video_id": "143",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=PJHkIJZGwKA",
"videoID": "PJHkIJZGwKA",
"question_id": "143-1",
"task_type": "Object Recognition",
"question": "What team currently has the lead in the video?",
"options": [
"A. The two teams are tied.",
"B. The blue team.",
"C. The white team.",
"D. Unable to inference."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What team currently has the lead in the video?\nOption:\nA. The two teams are tied.\nB. The blue team.\nC. The white team.\nD. Unable to inference.\nAnswer with the option's letter from the given choices directly.",
426,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "143-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 427,
"target": "B",
"doc": {
"video_id": "143",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=PJHkIJZGwKA",
"videoID": "PJHkIJZGwKA",
"question_id": "143-2",
"task_type": "Action Recognition",
"question": "Which player finished the dunk?",
"options": [
"A. Number 13.",
"B. Number 32.",
"C. Number 17.",
"D. Number 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player finished the dunk?\nOption:\nA. Number 13.\nB. Number 32.\nC. Number 17.\nD. Number 5.\nAnswer with the option's letter from the given choices directly.",
427,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "143-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 428,
"target": "D",
"doc": {
"video_id": "143",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=PJHkIJZGwKA",
"videoID": "PJHkIJZGwKA",
"question_id": "143-3",
"task_type": "Object Recognition",
"question": "Which type of basketball game is being played in the video?",
"options": [
"A. Women's basketball 3v3.",
"B. Men's basketball 3v3.",
"C. Women's basketball 5v5.",
"D. Men's basketball 5v5."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which type of basketball game is being played in the video?\nOption:\nA. Women's basketball 3v3.\nB. Men's basketball 3v3.\nC. Women's basketball 5v5.\nD. Men's basketball 5v5.\nAnswer with the option's letter from the given choices directly.",
428,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "143-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 429,
"target": "A",
"doc": {
"video_id": "144",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=QUmkzhUQoEA",
"videoID": "QUmkzhUQoEA",
"question_id": "144-1",
"task_type": "Object Recognition",
"question": "Who is the first player in the video to take a shot?",
"options": [
"A. Player number 7.",
"B. Player number 15.",
"C. Player number 14.",
"D. Player number 12."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is the first player in the video to take a shot?\nOption:\nA. Player number 7.\nB. Player number 15.\nC. Player number 14.\nD. Player number 12.\nAnswer with the option's letter from the given choices directly.",
429,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "144-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 430,
"target": "D",
"doc": {
"video_id": "144",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=QUmkzhUQoEA",
"videoID": "QUmkzhUQoEA",
"question_id": "144-2",
"task_type": "Object Recognition",
"question": "Which player in the video scores first?",
"options": [
"A. Player number 7.",
"B. Player number 12.",
"C. Player number 15.",
"D. Player number 14."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player in the video scores first?\nOption:\nA. Player number 7.\nB. Player number 12.\nC. Player number 15.\nD. Player number 14.\nAnswer with the option's letter from the given choices directly.",
430,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "144-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 431,
"target": "B",
"doc": {
"video_id": "144",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=QUmkzhUQoEA",
"videoID": "QUmkzhUQoEA",
"question_id": "144-3",
"task_type": "Object Reasoning",
"question": "Which of the following describes the unique aspect of this basketball game?",
"options": [
"A. It is a basketball charity show.",
"B. This is a Paralympic basketball game.",
"C. This is a special game in honor of a basketball star.",
"D. It is a basketball game that was replayed after many years."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following describes the unique aspect of this basketball game?\nOption:\nA. It is a basketball charity show.\nB. This is a Paralympic basketball game.\nC. This is a special game in honor of a basketball star.\nD. It is a basketball game that was replayed after many years.\nAnswer with the option's letter from the given choices directly.",
431,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "144-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 432,
"target": "D",
"doc": {
"video_id": "145",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=gvtvz-keDmE",
"videoID": "gvtvz-keDmE",
"question_id": "145-1",
"task_type": "Action Recognition",
"question": "What does the rabbit do while the man is flying up to dunk?",
"options": [
"A. Left the basketball court.",
"B. Making a stick.",
"C. Do wood carving.",
"D. Making a pencil."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the rabbit do while the man is flying up to dunk?\nOption:\nA. Left the basketball court.\nB. Making a stick.\nC. Do wood carving.\nD. Making a pencil.\nAnswer with the option's letter from the given choices directly.",
432,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "145-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 433,
"target": "A",
"doc": {
"video_id": "145",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=gvtvz-keDmE",
"videoID": "gvtvz-keDmE",
"question_id": "145-2",
"task_type": "Action Reasoning",
"question": "Which of the following options explains why the man is screaming at the beginning of the video?",
"options": [
"A. Because the rabbit is about to win.",
"B. Because he loses.",
"C. Because he is going to dunk.",
"D. Because he is angry."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options explains why the man is screaming at the beginning of the video?\nOption:\nA. Because the rabbit is about to win.\nB. Because he loses.\nC. Because he is going to dunk.\nD. Because he is angry.\nAnswer with the option's letter from the given choices directly.",
433,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "145-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 434,
"target": "B",
"doc": {
"video_id": "145",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=gvtvz-keDmE",
"videoID": "gvtvz-keDmE",
"question_id": "145-3",
"task_type": "Object Recognition",
"question": "What sport is being played in the video?",
"options": [
"A. High jump.",
"B. Basketball.",
"C. Volleyball.",
"D. Gymnastics."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What sport is being played in the video?\nOption:\nA. High jump.\nB. Basketball.\nC. Volleyball.\nD. Gymnastics.\nAnswer with the option's letter from the given choices directly.",
434,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "145-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 435,
"target": "C",
"doc": {
"video_id": "146",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=V9fsOaa97F4",
"videoID": "V9fsOaa97F4",
"question_id": "146-1",
"task_type": "Attribute Perception",
"question": "Who dunks in the video?",
"options": [
"A. The basketball players in orange shoes.",
"B. The woman in the green hoodie.",
"C. The man wearing a black vest with the number 22.",
"D. The man wearing a yellow vest with the number 8."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who dunks in the video?\nOption:\nA. The basketball players in orange shoes.\nB. The woman in the green hoodie.\nC. The man wearing a black vest with the number 22.\nD. The man wearing a yellow vest with the number 8.\nAnswer with the option's letter from the given choices directly.",
435,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "146-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 436,
"target": "B",
"doc": {
"video_id": "146",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=V9fsOaa97F4",
"videoID": "V9fsOaa97F4",
"question_id": "146-2",
"task_type": "Counting Problem",
"question": "What is the total number of people guarding basketball players in orange shoes in the video?",
"options": [
"A. 6.",
"B. 5.",
"C. 4.",
"D. 3."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the total number of people guarding basketball players in orange shoes in the video?\nOption:\nA. 6.\nB. 5.\nC. 4.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
436,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "146-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 437,
"target": "D",
"doc": {
"video_id": "146",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=V9fsOaa97F4",
"videoID": "V9fsOaa97F4",
"question_id": "146-3",
"task_type": "Object Recognition",
"question": "What is this video mainly about?",
"options": [
"A. Basketball court publicity.",
"B. Basketball event publicity.",
"C. Basketball movie trailer.",
"D. Basketball brand publicity."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. Basketball court publicity.\nB. Basketball event publicity.\nC. Basketball movie trailer.\nD. Basketball brand publicity.\nAnswer with the option's letter from the given choices directly.",
437,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "146-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 438,
"target": "A",
"doc": {
"video_id": "147",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=7R1eNHvfspk",
"videoID": "7R1eNHvfspk",
"question_id": "147-1",
"task_type": "Counting Problem",
"question": "How many challengers does the protagonist in the video take on?",
"options": [
"A. 7.",
"B. 6.",
"C. 8.",
"D. 5."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many challengers does the protagonist in the video take on?\nOption:\nA. 7.\nB. 6.\nC. 8.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
438,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "147-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 439,
"target": "C",
"doc": {
"video_id": "147",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=7R1eNHvfspk",
"videoID": "7R1eNHvfspk",
"question_id": "147-2",
"task_type": "Action Recognition",
"question": "What is the game in the video?",
"options": [
"A. Basketball three-point contest.",
"B. Basketball one-on-one.",
"C. Shooting Game Boy challenge.",
"D. Basketball dunk challenge."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the game in the video?\nOption:\nA. Basketball three-point contest.\nB. Basketball one-on-one.\nC. Shooting Game Boy challenge.\nD. Basketball dunk challenge.\nAnswer with the option's letter from the given choices directly.",
439,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "147-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 440,
"target": "B",
"doc": {
"video_id": "147",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=7R1eNHvfspk",
"videoID": "7R1eNHvfspk",
"question_id": "147-3",
"task_type": "Object Reasoning",
"question": "Which person in the video is shown to shoot the ball the highest?",
"options": [
"A. The first challenger.",
"B. The Challenge initiator.",
"C. The second challenger.",
"D. The last challenger."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which person in the video is shown to shoot the ball the highest?\nOption:\nA. The first challenger.\nB. The Challenge initiator.\nC. The second challenger.\nD. The last challenger.\nAnswer with the option's letter from the given choices directly.",
440,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "147-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 441,
"target": "D",
"doc": {
"video_id": "148",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=KZu5iE7yrPI",
"videoID": "KZu5iE7yrPI",
"question_id": "148-1",
"task_type": "Action Recognition",
"question": "Which player in the video dunks?",
"options": [
"A. Player number 11.",
"B. Player number 12.",
"C. Player number 4.",
"D. Player number 6."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player in the video dunks?\nOption:\nA. Player number 11.\nB. Player number 12.\nC. Player number 4.\nD. Player number 6.\nAnswer with the option's letter from the given choices directly.",
441,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "148-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 442,
"target": "A",
"doc": {
"video_id": "148",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=KZu5iE7yrPI",
"videoID": "KZu5iE7yrPI",
"question_id": "148-2",
"task_type": "Action Reasoning",
"question": "Which basketball rule is featured in the video?",
"options": [
"A. Travelling.",
"B. Palming.",
"C. Hacking.",
"D. Holding."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which basketball rule is featured in the video?\nOption:\nA. Travelling.\nB. Palming.\nC. Hacking.\nD. Holding.\nAnswer with the option's letter from the given choices directly.",
442,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "148-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 443,
"target": "B",
"doc": {
"video_id": "148",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=KZu5iE7yrPI",
"videoID": "KZu5iE7yrPI",
"question_id": "148-3",
"task_type": "Action Reasoning",
"question": "Why was the athlete in the video called for a traveling violation despite not being a foul?",
"options": [
"A. Because the referee made the wrong call.",
"B. Because this happened before the rules were changed.",
"C. Because the player trolled the referee.",
"D. The penalty was not a traveling violation."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why was the athlete in the video called for a traveling violation despite not being a foul?\nOption:\nA. Because the referee made the wrong call.\nB. Because this happened before the rules were changed.\nC. Because the player trolled the referee.\nD. The penalty was not a traveling violation.\nAnswer with the option's letter from the given choices directly.",
443,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "148-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 444,
"target": "C",
"doc": {
"video_id": "149",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=5fPtlxR3s3c",
"videoID": "5fPtlxR3s3c",
"question_id": "149-1",
"task_type": "Action Recognition",
"question": "What technical moves do the players perform in the third clip in the video?",
"options": [
"A. Dunk.",
"B. Pass.",
"C. Hit 3-pointers.",
"D. Hit the game-winning shot."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What technical moves do the players perform in the third clip in the video?\nOption:\nA. Dunk.\nB. Pass.\nC. Hit 3-pointers.\nD. Hit the game-winning shot.\nAnswer with the option's letter from the given choices directly.",
444,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "149-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 445,
"target": "B",
"doc": {
"video_id": "149",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=5fPtlxR3s3c",
"videoID": "5fPtlxR3s3c",
"question_id": "149-2",
"task_type": "Object Recognition",
"question": "From which country does the player who takes the last shot in the video hail?",
"options": [
"A. Canada.",
"B. America.",
"C. Spain.",
"D. Germany."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: From which country does the player who takes the last shot in the video hail?\nOption:\nA. Canada.\nB. America.\nC. Spain.\nD. Germany.\nAnswer with the option's letter from the given choices directly.",
445,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "149-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 446,
"target": "D",
"doc": {
"video_id": "149",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=5fPtlxR3s3c",
"videoID": "5fPtlxR3s3c",
"question_id": "149-3",
"task_type": "Information Synopsis",
"question": "What is the main theme of this video?",
"options": [
"A. Highlights from Team USA during the FIBA Women's Basketball World Cup.",
"B. A compilation of the top five plays from the Men's Basketball World Cup.",
"C. A countdown of the top 10 moments from the Men's Basketball World Cup.",
"D. A countdown of the top 5 moments from the Women's Basketball World Cup."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main theme of this video?\nOption:\nA. Highlights from Team USA during the FIBA Women's Basketball World Cup.\nB. A compilation of the top five plays from the Men's Basketball World Cup.\nC. A countdown of the top 10 moments from the Men's Basketball World Cup.\nD. A countdown of the top 5 moments from the Women's Basketball World Cup.\nAnswer with the option's letter from the given choices directly.",
446,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "149-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 447,
"target": "A",
"doc": {
"video_id": "150",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=Tn70NxIMk2Q",
"videoID": "Tn70NxIMk2Q",
"question_id": "150-1",
"task_type": "Information Synopsis",
"question": "What story does the video tell?",
"options": [
"A. A boy's basketball growing-up story.",
"B. Development of basketball.",
"C. How does the NBA draft.",
"D. How do boys practice basketball."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What story does the video tell?\nOption:\nA. A boy's basketball growing-up story.\nB. Development of basketball.\nC. How does the NBA draft.\nD. How do boys practice basketball.\nAnswer with the option's letter from the given choices directly.",
447,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "150-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 448,
"target": "C",
"doc": {
"video_id": "150",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=Tn70NxIMk2Q",
"videoID": "Tn70NxIMk2Q",
"question_id": "150-2",
"task_type": "Object Recognition",
"question": "Which one of the following jersey sizes does the man wear when he enters the NBA at the end of the video?",
"options": [
"A. 23.",
"B. 35.",
"C. 10.",
"D. 28."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which one of the following jersey sizes does the man wear when he enters the NBA at the end of the video?\nOption:\nA. 23.\nB. 35.\nC. 10.\nD. 28.\nAnswer with the option's letter from the given choices directly.",
448,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "150-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 449,
"target": "A",
"doc": {
"video_id": "150",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=Tn70NxIMk2Q",
"videoID": "Tn70NxIMk2Q",
"question_id": "150-3",
"task_type": "Action Reasoning",
"question": "Which of the following best describes why the man appeared upset in the middle of the video?",
"options": [
"A. Because he chose to pass the ball at the crucial moment.",
"B. Because he missed the game-winning shot.",
"C. Because he was hurt.",
"D. Because he was substituted by the coach."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best describes why the man appeared upset in the middle of the video?\nOption:\nA. Because he chose to pass the ball at the crucial moment.\nB. Because he missed the game-winning shot.\nC. Because he was hurt.\nD. Because he was substituted by the coach.\nAnswer with the option's letter from the given choices directly.",
449,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "150-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 450,
"target": "A",
"doc": {
"video_id": "151",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=JP67IM1LX-M",
"videoID": "JP67IM1LX-M",
"question_id": "151-1",
"task_type": "Counting Problem",
"question": "How many players are in the beach volleyball game shown in the video?",
"options": [
"A. 4.",
"B. 5.",
"C. 6.",
"D. 3."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many players are in the beach volleyball game shown in the video?\nOption:\nA. 4.\nB. 5.\nC. 6.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
450,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "151-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 451,
"target": "D",
"doc": {
"video_id": "151",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=JP67IM1LX-M",
"videoID": "JP67IM1LX-M",
"question_id": "151-2",
"task_type": "Attribute Perception",
"question": "What color of the T-shirt worn by the boy in black skin, who appears at the beginning and end of the video?",
"options": [
"A. White.",
"B. Blue.",
"C. Red.",
"D. Yellow."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color of the T-shirt worn by the boy in black skin, who appears at the beginning and end of the video?\nOption:\nA. White.\nB. Blue.\nC. Red.\nD. Yellow.\nAnswer with the option's letter from the given choices directly.",
451,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "151-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 452,
"target": "C",
"doc": {
"video_id": "151",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=JP67IM1LX-M",
"videoID": "JP67IM1LX-M",
"question_id": "151-3",
"task_type": "Spatial Reasoning",
"question": "Which location is most likely to host the FIFA World Cup based on the given landmark and promotional event in. this video?",
"options": [
"A. France.",
"B. USA.",
"C. Brazil.",
"D. Italy."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which location is most likely to host the FIFA World Cup based on the given landmark and promotional event in. this video?\nOption:\nA. France.\nB. USA.\nC. Brazil.\nD. Italy.\nAnswer with the option's letter from the given choices directly.",
452,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "151-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Spatial Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 453,
"target": "B",
"doc": {
"video_id": "152",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=PYqmI0Hiaho",
"videoID": "PYqmI0Hiaho",
"question_id": "152-1",
"task_type": "Counting Problem",
"question": "How many times did the No. 7 in red score in the video?",
"options": [
"A. 3.",
"B. 2.",
"C. 4.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times did the No. 7 in red score in the video?\nOption:\nA. 3.\nB. 2.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
453,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "152-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 454,
"target": "A",
"doc": {
"video_id": "152",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=PYqmI0Hiaho",
"videoID": "PYqmI0Hiaho",
"question_id": "152-2",
"task_type": "Object Recognition",
"question": "What event is this video most likely associated with?",
"options": [
"A. The FIFA World Cup.",
"B. The Olympic Games.",
"C. The UEFA European Championship.",
"D. The Super Bowl."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What event is this video most likely associated with?\nOption:\nA. The FIFA World Cup.\nB. The Olympic Games.\nC. The UEFA European Championship.\nD. The Super Bowl.\nAnswer with the option's letter from the given choices directly.",
454,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "152-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 455,
"target": "D",
"doc": {
"video_id": "152",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=PYqmI0Hiaho",
"videoID": "PYqmI0Hiaho",
"question_id": "152-3",
"task_type": "Object Recognition",
"question": "In the first goal, which player passed the ball to number 7 in red?",
"options": [
"A. Number 8.",
"B. Number 6.",
"C. Number 5.",
"D. Number 9."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the first goal, which player passed the ball to number 7 in red?\nOption:\nA. Number 8.\nB. Number 6.\nC. Number 5.\nD. Number 9.\nAnswer with the option's letter from the given choices directly.",
455,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "152-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 456,
"target": "A",
"doc": {
"video_id": "153",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=_KU4X06VNiE",
"videoID": "_KU4X06VNiE",
"question_id": "153-1",
"task_type": "OCR Problems",
"question": "Referring to ths video, which company sponsored the competition?",
"options": [
"A. Yahoo.",
"B. Google.",
"C. Microsoft.",
"D. Apple."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Referring to ths video, which company sponsored the competition?\nOption:\nA. Yahoo.\nB. Google.\nC. Microsoft.\nD. Apple.\nAnswer with the option's letter from the given choices directly.",
456,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "153-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 457,
"target": "C",
"doc": {
"video_id": "153",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=_KU4X06VNiE",
"videoID": "_KU4X06VNiE",
"question_id": "153-2",
"task_type": "OCR Problems",
"question": "As depicted in the video, what's the last name of the white team wearing number 13?",
"options": [
"A. Marry.",
"B. Lisa.",
"C. Lilly.",
"D. Karry."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, what's the last name of the white team wearing number 13?\nOption:\nA. Marry.\nB. Lisa.\nC. Lilly.\nD. Karry.\nAnswer with the option's letter from the given choices directly.",
457,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "153-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 458,
"target": "B",
"doc": {
"video_id": "153",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=_KU4X06VNiE",
"videoID": "_KU4X06VNiE",
"question_id": "153-3",
"task_type": "Object Recognition",
"question": "Which player in the video ends the game?",
"options": [
"A. White number 13.",
"B. White number 16.",
"C. Red number 9.",
"D. Red number 11."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player in the video ends the game?\nOption:\nA. White number 13.\nB. White number 16.\nC. Red number 9.\nD. Red number 11.\nAnswer with the option's letter from the given choices directly.",
458,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "153-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 459,
"target": "C",
"doc": {
"video_id": "154",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=1tidNDIpKSY",
"videoID": "1tidNDIpKSY",
"question_id": "154-1",
"task_type": "Object Recognition",
"question": "At the beginning of the video, what is the man standing next to?",
"options": [
"A. The UEFA Champions League Trophy.",
"B. The Olympic Torch.",
"C. The FIFA World Cup Trophy.",
"D. The Super Bowl Trophy."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At the beginning of the video, what is the man standing next to?\nOption:\nA. The UEFA Champions League Trophy.\nB. The Olympic Torch.\nC. The FIFA World Cup Trophy.\nD. The Super Bowl Trophy.\nAnswer with the option's letter from the given choices directly.",
459,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "154-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 460,
"target": "A",
"doc": {
"video_id": "154",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=1tidNDIpKSY",
"videoID": "1tidNDIpKSY",
"question_id": "154-2",
"task_type": "Spatial Perception",
"question": "Which of the following options best describes the scene in the background behind the man?",
"options": [
"A. A soccer field.",
"B. A basketball court.",
"C. An athletic track.",
"D. A baseball diamond."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options best describes the scene in the background behind the man?\nOption:\nA. A soccer field.\nB. A basketball court.\nC. An athletic track.\nD. A baseball diamond.\nAnswer with the option's letter from the given choices directly.",
460,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "154-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Spatial Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 461,
"target": "B",
"doc": {
"video_id": "154",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=1tidNDIpKSY",
"videoID": "1tidNDIpKSY",
"question_id": "154-3",
"task_type": "OCR Problems",
"question": "What can be inferred from the video about the location of the FIFA World Cup 2026 Final?",
"options": [
"A. It will be held in Manchester.",
"B. It will be held in New York/New Jersey.",
"C. It will be held in Houston.",
"D. It will be held in Golden State."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be inferred from the video about the location of the FIFA World Cup 2026 Final?\nOption:\nA. It will be held in Manchester.\nB. It will be held in New York/New Jersey.\nC. It will be held in Houston.\nD. It will be held in Golden State.\nAnswer with the option's letter from the given choices directly.",
461,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "154-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 462,
"target": "D",
"doc": {
"video_id": "155",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=ya2IXAREZho",
"videoID": "ya2IXAREZho",
"question_id": "155-1",
"task_type": "Object Recognition",
"question": "Which team does the player, seen wearing the number 10 shirt and scoring multiple goals in the video, belong to?",
"options": [
"A. England.",
"B. China.",
"C. America.",
"D. France."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which team does the player, seen wearing the number 10 shirt and scoring multiple goals in the video, belong to?\nOption:\nA. England.\nB. China.\nC. America.\nD. France.\nAnswer with the option's letter from the given choices directly.",
462,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "155-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 463,
"target": "B",
"doc": {
"video_id": "155",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=ya2IXAREZho",
"videoID": "ya2IXAREZho",
"question_id": "155-2",
"task_type": "Object Recognition",
"question": "Which player's goal highlights video is being shown in the video?",
"options": [
"A. Lionel Andrés Messi.",
"B. Kylian Mbappé.",
"C. Cristiano Ronaldo.",
"D. Kobe Bryant."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player's goal highlights video is being shown in the video?\nOption:\nA. Lionel Andrés Messi.\nB. Kylian Mbappé.\nC. Cristiano Ronaldo.\nD. Kobe Bryant.\nAnswer with the option's letter from the given choices directly.",
463,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "155-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 464,
"target": "A",
"doc": {
"video_id": "155",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=ya2IXAREZho",
"videoID": "ya2IXAREZho",
"question_id": "155-3",
"task_type": "Counting Problem",
"question": "How many Mbappe's spectacular goals does the video show?",
"options": [
"A. 4.",
"B. 5.",
"C. 3.",
"D. 6."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many Mbappe's spectacular goals does the video show?\nOption:\nA. 4.\nB. 5.\nC. 3.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
464,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "155-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 465,
"target": "C",
"doc": {
"video_id": "156",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=BkiTScinYOQ",
"videoID": "BkiTScinYOQ",
"question_id": "156-1",
"task_type": "Action Recognition",
"question": "How do the players celebrate scoring the first goal in this video?",
"options": [
"A. Running on the field.",
"B. High-fiving with the teammates.",
"C. Kneeling on the field and screaming.",
"D. They do not celebrate the goal."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How do the players celebrate scoring the first goal in this video?\nOption:\nA. Running on the field.\nB. High-fiving with the teammates.\nC. Kneeling on the field and screaming.\nD. They do not celebrate the goal.\nAnswer with the option's letter from the given choices directly.",
465,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "156-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 466,
"target": "B",
"doc": {
"video_id": "156",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=BkiTScinYOQ",
"videoID": "BkiTScinYOQ",
"question_id": "156-2",
"task_type": "Temporal Perception",
"question": "Which country did Portugal score against in the last goal of the video?",
"options": [
"A. Spanish.",
"B. Morocco.",
"C. Iran.",
"D. Mexico."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which country did Portugal score against in the last goal of the video?\nOption:\nA. Spanish.\nB. Morocco.\nC. Iran.\nD. Mexico.\nAnswer with the option's letter from the given choices directly.",
466,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "156-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Temporal Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 467,
"target": "B",
"doc": {
"video_id": "156",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=BkiTScinYOQ",
"videoID": "BkiTScinYOQ",
"question_id": "156-3",
"task_type": "Attribute Perception",
"question": "What are the two main colors of the Portuguese team uniforms featured in the video?",
"options": [
"A. Yellow and black.",
"B. Red and white.",
"C. Blue and pink.",
"D. Can not be inferred in the video."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the two main colors of the Portuguese team uniforms featured in the video?\nOption:\nA. Yellow and black.\nB. Red and white.\nC. Blue and pink.\nD. Can not be inferred in the video.\nAnswer with the option's letter from the given choices directly.",
467,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "156-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 468,
"target": "D",
"doc": {
"video_id": "157",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=9dQXiqS56Y4",
"videoID": "9dQXiqS56Y4",
"question_id": "157-1",
"task_type": "Object Recognition",
"question": "Which player is substituted with player 19?",
"options": [
"A. Number 15.",
"B. Number 17.",
"C. Number 16.",
"D. Number 18."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player is substituted with player 19?\nOption:\nA. Number 15.\nB. Number 17.\nC. Number 16.\nD. Number 18.\nAnswer with the option's letter from the given choices directly.",
468,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "157-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 469,
"target": "C",
"doc": {
"video_id": "157",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=9dQXiqS56Y4",
"videoID": "9dQXiqS56Y4",
"question_id": "157-2",
"task_type": "OCR Problems",
"question": "Which of the following is the name of player 18?",
"options": [
"A. Kylian Mbappé.",
"B. Lionel Andrés Messi.",
"C. Maxi Rodriguez.",
"D. Cristiano Ronaldo."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is the name of player 18?\nOption:\nA. Kylian Mbappé.\nB. Lionel Andrés Messi.\nC. Maxi Rodriguez.\nD. Cristiano Ronaldo.\nAnswer with the option's letter from the given choices directly.",
469,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "157-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 470,
"target": "A",
"doc": {
"video_id": "157",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=9dQXiqS56Y4",
"videoID": "9dQXiqS56Y4",
"question_id": "157-3",
"task_type": "Counting Problem",
"question": "How many goals can be seen in the video?",
"options": [
"A. 0.",
"B. 1.",
"C. 2.",
"D. 3."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many goals can be seen in the video?\nOption:\nA. 0.\nB. 1.\nC. 2.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
470,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "157-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 471,
"target": "A",
"doc": {
"video_id": "158",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=uJFbOrNrL3w",
"videoID": "uJFbOrNrL3w",
"question_id": "158-1",
"task_type": "OCR Problems",
"question": "What is the score of the football match at the time 78:32?",
"options": [
"A. Qatar 1 - 2 Senegal.",
"B. Qatar 0 - 2 Senegal.",
"C. Qatar 2 - 0 Senegal.",
"D. Qatar 0 - 1 Senegal."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the score of the football match at the time 78:32?\nOption:\nA. Qatar 1 - 2 Senegal.\nB. Qatar 0 - 2 Senegal.\nC. Qatar 2 - 0 Senegal.\nD. Qatar 0 - 1 Senegal.\nAnswer with the option's letter from the given choices directly.",
471,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "158-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 472,
"target": "D",
"doc": {
"video_id": "158",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=uJFbOrNrL3w",
"videoID": "uJFbOrNrL3w",
"question_id": "158-2",
"task_type": "Object Recognition",
"question": "Which player delivers the pass to number 9 to score the goal?",
"options": [
"A. Number 14.",
"B. Number 16.",
"C. Number 15.",
"D. Number 17."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player delivers the pass to number 9 to score the goal?\nOption:\nA. Number 14.\nB. Number 16.\nC. Number 15.\nD. Number 17.\nAnswer with the option's letter from the given choices directly.",
472,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "158-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 473,
"target": "B",
"doc": {
"video_id": "158",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=uJFbOrNrL3w",
"videoID": "uJFbOrNrL3w",
"question_id": "158-3",
"task_type": "Object Reasoning",
"question": "Judging by the score and time, what can be inferred about Qatar's performance in this match so far?",
"options": [
"A. They are winning comfortably with plenty of time left.",
"B. They are losing and have little time to recover.",
"C. The game is tied and approaching the final moments.",
"D. They have a significant lead and are about to win."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Judging by the score and time, what can be inferred about Qatar's performance in this match so far?\nOption:\nA. They are winning comfortably with plenty of time left.\nB. They are losing and have little time to recover.\nC. The game is tied and approaching the final moments.\nD. They have a significant lead and are about to win.\nAnswer with the option's letter from the given choices directly.",
473,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "158-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 474,
"target": "B",
"doc": {
"video_id": "159",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=ByUONqA1IJY",
"videoID": "ByUONqA1IJY",
"question_id": "159-1",
"task_type": "Object Recognition",
"question": "Which of the following players serves from the backcourt?",
"options": [
"A. Forward.",
"B. Goalkeeper.",
"C. Midfielder.",
"D. Defender."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following players serves from the backcourt?\nOption:\nA. Forward.\nB. Goalkeeper.\nC. Midfielder.\nD. Defender.\nAnswer with the option's letter from the given choices directly.",
474,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "159-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 475,
"target": "C",
"doc": {
"video_id": "159",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=ByUONqA1IJY",
"videoID": "ByUONqA1IJY",
"question_id": "159-2",
"task_type": "Attribute Perception",
"question": "Which team is the player most likely representing based on the indication of his jersey in this video?",
"options": [
"A. Brazil.",
"B. Italy.",
"C. Uruguay.",
"D. Argentina."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which team is the player most likely representing based on the indication of his jersey in this video?\nOption:\nA. Brazil.\nB. Italy.\nC. Uruguay.\nD. Argentina.\nAnswer with the option's letter from the given choices directly.",
475,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "159-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 476,
"target": "B",
"doc": {
"video_id": "159",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=ByUONqA1IJY",
"videoID": "ByUONqA1IJY",
"question_id": "159-3",
"task_type": "Attribute Perception",
"question": "What emotion is represented by the number 9 in blue after scoring a goal?",
"options": [
"A. Very happy after scoring the goal.",
"B. Tears of joy after scoring a goal.",
"C. Surprise at the referee's decision.",
"D. Sadness that the goal was not scored."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What emotion is represented by the number 9 in blue after scoring a goal?\nOption:\nA. Very happy after scoring the goal.\nB. Tears of joy after scoring a goal.\nC. Surprise at the referee's decision.\nD. Sadness that the goal was not scored.\nAnswer with the option's letter from the given choices directly.",
476,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "159-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 477,
"target": "C",
"doc": {
"video_id": "160",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=njc90ODIbLc",
"videoID": "njc90ODIbLc",
"question_id": "160-1",
"task_type": "OCR Problems",
"question": "What is the score following the goal being scored?",
"options": [
"A. 1:1.",
"B. 2:2.",
"C. 1:2.",
"D. 2:1."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the score following the goal being scored?\nOption:\nA. 1:1.\nB. 2:2.\nC. 1:2.\nD. 2:1.\nAnswer with the option's letter from the given choices directly.",
477,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "160-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 478,
"target": "D",
"doc": {
"video_id": "160",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=njc90ODIbLc",
"videoID": "njc90ODIbLc",
"question_id": "160-2",
"task_type": "Object Reasoning",
"question": "Who could be the man in the white shirt clapping in the video?",
"options": [
"A. The commentator.",
"B. The coach of the Mexico team.",
"C. An ordinary member of the audience.",
"D. The coach of the Korean team."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who could be the man in the white shirt clapping in the video?\nOption:\nA. The commentator.\nB. The coach of the Mexico team.\nC. An ordinary member of the audience.\nD. The coach of the Korean team.\nAnswer with the option's letter from the given choices directly.",
478,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "160-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 479,
"target": "B",
"doc": {
"video_id": "160",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=njc90ODIbLc",
"videoID": "njc90ODIbLc",
"question_id": "160-3",
"task_type": "Object Recognition",
"question": "Which team scored in the video?",
"options": [
"A. Japan.",
"B. Korea.",
"C. China.",
"D. Singapore."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which team scored in the video?\nOption:\nA. Japan.\nB. Korea.\nC. China.\nD. Singapore.\nAnswer with the option's letter from the given choices directly.",
479,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "160-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 480,
"target": "B",
"doc": {
"video_id": "161",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=-qTAeVGl_e8",
"videoID": "-qTAeVGl_e8",
"question_id": "161-1",
"task_type": "Counting Problem",
"question": "How many athletes are doing high jumps in the video?",
"options": [
"A. 3.",
"B. 2.",
"C. 4.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many athletes are doing high jumps in the video?\nOption:\nA. 3.\nB. 2.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
480,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "161-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 481,
"target": "C",
"doc": {
"video_id": "161",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=-qTAeVGl_e8",
"videoID": "-qTAeVGl_e8",
"question_id": "161-2",
"task_type": "Object Reasoning",
"question": "Who ultimately won the high jump competition in the video?",
"options": [
"A. Athlete wearing a white top and black trousers.",
"B. Athlete wearing a white top and white shorts.",
"C. Athlete wearing a yellow top and green shorts.",
"D. Athlete wearing a yellow top and black trousers."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who ultimately won the high jump competition in the video?\nOption:\nA. Athlete wearing a white top and black trousers.\nB. Athlete wearing a white top and white shorts.\nC. Athlete wearing a yellow top and green shorts.\nD. Athlete wearing a yellow top and black trousers.\nAnswer with the option's letter from the given choices directly.",
481,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "161-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 482,
"target": "A",
"doc": {
"video_id": "161",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=-qTAeVGl_e8",
"videoID": "-qTAeVGl_e8",
"question_id": "161-3",
"task_type": "Object Recognition",
"question": "What is the competition in this video?",
"options": [
"A. High jump competition.",
"B. Long jump competition.",
"C. Pole vault competition.",
"D. Triple jump competition."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the competition in this video?\nOption:\nA. High jump competition.\nB. Long jump competition.\nC. Pole vault competition.\nD. Triple jump competition.\nAnswer with the option's letter from the given choices directly.",
482,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "161-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 483,
"target": "D",
"doc": {
"video_id": "162",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=8y2Hj7GDcwI",
"videoID": "8y2Hj7GDcwI",
"question_id": "162-1",
"task_type": "Object Recognition",
"question": "Which country's athlete's high jump process is documented in the video?",
"options": [
"A. England.",
"B. USA.",
"C. Korea.",
"D. New Zealand."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which country's athlete's high jump process is documented in the video?\nOption:\nA. England.\nB. USA.\nC. Korea.\nD. New Zealand.\nAnswer with the option's letter from the given choices directly.",
483,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "162-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 484,
"target": "B",
"doc": {
"video_id": "162",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=8y2Hj7GDcwI",
"videoID": "8y2Hj7GDcwI",
"question_id": "162-2",
"task_type": "OCR Problems",
"question": "What is the name of the athlete who won the second place in the video?",
"options": [
"A. Hamish Kerr.",
"B. Shelby McEwen.",
"C. Sanghyeok Woo.",
"D. Not mentioned in the video."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the name of the athlete who won the second place in the video?\nOption:\nA. Hamish Kerr.\nB. Shelby McEwen.\nC. Sanghyeok Woo.\nD. Not mentioned in the video.\nAnswer with the option's letter from the given choices directly.",
484,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "162-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 485,
"target": "C",
"doc": {
"video_id": "162",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=8y2Hj7GDcwI",
"videoID": "8y2Hj7GDcwI",
"question_id": "162-3",
"task_type": "OCR Problems",
"question": "Which company's advertisement did not appear in the video?",
"options": [
"A. SEIKO.",
"B. TDK.",
"C. TCL.",
"D. SONY."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which company's advertisement did not appear in the video?\nOption:\nA. SEIKO.\nB. TDK.\nC. TCL.\nD. SONY.\nAnswer with the option's letter from the given choices directly.",
485,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "162-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 486,
"target": "A",
"doc": {
"video_id": "163",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=zNxi2s36tS0",
"videoID": "zNxi2s36tS0",
"question_id": "163-1",
"task_type": "Counting Problem",
"question": "How many athletes are competing on the track in the video?",
"options": [
"A. 8.",
"B. 9.",
"C. 7.",
"D. 13."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many athletes are competing on the track in the video?\nOption:\nA. 8.\nB. 9.\nC. 7.\nD. 13.\nAnswer with the option's letter from the given choices directly.",
486,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "163-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 487,
"target": "D",
"doc": {
"video_id": "163",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=zNxi2s36tS0",
"videoID": "zNxi2s36tS0",
"question_id": "163-2",
"task_type": "Object Recognition",
"question": "What is the competition event currently taking place in the video?",
"options": [
"A. Women's 200m open.",
"B. Men's 200m open.",
"C. Women's 100m open.",
"D. Men's 100m open."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the competition event currently taking place in the video?\nOption:\nA. Women's 200m open.\nB. Men's 200m open.\nC. Women's 100m open.\nD. Men's 100m open.\nAnswer with the option's letter from the given choices directly.",
487,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "163-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 488,
"target": "B",
"doc": {
"video_id": "163",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=zNxi2s36tS0",
"videoID": "zNxi2s36tS0",
"question_id": "163-3",
"task_type": "OCR Problems",
"question": "What is the shortest time to reach the finish line in the video?",
"options": [
"A. 9.5 seconds.",
"B. 10.06 seconds.",
"C. 8.7 seconds.",
"D. 8.5 seconds."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the shortest time to reach the finish line in the video?\nOption:\nA. 9.5 seconds.\nB. 10.06 seconds.\nC. 8.7 seconds.\nD. 8.5 seconds.\nAnswer with the option's letter from the given choices directly.",
488,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "163-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 489,
"target": "C",
"doc": {
"video_id": "164",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=YNV0pLQ3r5U",
"videoID": "YNV0pLQ3r5U",
"question_id": "164-1",
"task_type": "Counting Problem",
"question": "How many groups of athletes' competitions are recorded in the video?",
"options": [
"A. 2.",
"B. 1.",
"C. 3.",
"D. 6."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many groups of athletes' competitions are recorded in the video?\nOption:\nA. 2.\nB. 1.\nC. 3.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
489,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "164-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 490,
"target": "A",
"doc": {
"video_id": "164",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=YNV0pLQ3r5U",
"videoID": "YNV0pLQ3r5U",
"question_id": "164-2",
"task_type": "Object Recognition",
"question": "Assuming groups are numbered in chronological order of appearance in the video, with the first group being the one that appears first in the video. To which group does the athlete who fell in the video belong?",
"options": [
"A. Second group.",
"B. First group.",
"C. Third group.",
"D. It cannot be inferred from this video."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Assuming groups are numbered in chronological order of appearance in the video, with the first group being the one that appears first in the video. To which group does the athlete who fell in the video belong?\nOption:\nA. Second group.\nB. First group.\nC. Third group.\nD. It cannot be inferred from this video.\nAnswer with the option's letter from the given choices directly.",
490,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "164-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 491,
"target": "D",
"doc": {
"video_id": "164",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=YNV0pLQ3r5U",
"videoID": "YNV0pLQ3r5U",
"question_id": "164-3",
"task_type": "Action Recognition",
"question": "What is the competition event currently taking place in the video?",
"options": [
"A. Men's 4x100m relay.",
"B. Women's 4x100m relay.",
"C. Men's 60m hurdles.",
"D. Women's 60m hurdles."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the competition event currently taking place in the video?\nOption:\nA. Men's 4x100m relay.\nB. Women's 4x100m relay.\nC. Men's 60m hurdles.\nD. Women's 60m hurdles.\nAnswer with the option's letter from the given choices directly.",
491,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "164-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 492,
"target": "B",
"doc": {
"video_id": "165",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=2yGaTOzaGIA",
"videoID": "2yGaTOzaGIA",
"question_id": "165-1",
"task_type": "Object Recognition",
"question": "Who was the first athlete to reach the finish line in the video?",
"options": [
"A. The athlete wearing a blue top and blue shorts.",
"B. The athlete wearing a white top and black shorts.",
"C. The athlete wearing a yellow top and green shorts.",
"D. The athlete wearing a white top and red shorts."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who was the first athlete to reach the finish line in the video?\nOption:\nA. The athlete wearing a blue top and blue shorts.\nB. The athlete wearing a white top and black shorts.\nC. The athlete wearing a yellow top and green shorts.\nD. The athlete wearing a white top and red shorts.\nAnswer with the option's letter from the given choices directly.",
492,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "165-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 493,
"target": "C",
"doc": {
"video_id": "165",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=2yGaTOzaGIA",
"videoID": "2yGaTOzaGIA",
"question_id": "165-2",
"task_type": "Counting Problem",
"question": "How many athletes are competing in the video?",
"options": [
"A. 7.",
"B. 5.",
"C. 6.",
"D. 8."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many athletes are competing in the video?\nOption:\nA. 7.\nB. 5.\nC. 6.\nD. 8.\nAnswer with the option's letter from the given choices directly.",
493,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "165-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 494,
"target": "A",
"doc": {
"video_id": "165",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=2yGaTOzaGIA",
"videoID": "2yGaTOzaGIA",
"question_id": "165-3",
"task_type": "Action Recognition",
"question": "What is the competition event taking place in the video?",
"options": [
"A. Men's 400 meters.",
"B. Men's 200 meters.",
"C. Women's 400 meters.",
"D. Women's 200 meters."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the competition event taking place in the video?\nOption:\nA. Men's 400 meters.\nB. Men's 200 meters.\nC. Women's 400 meters.\nD. Women's 200 meters.\nAnswer with the option's letter from the given choices directly.",
494,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "165-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 495,
"target": "D",
"doc": {
"video_id": "166",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=ejFVFtJdP3s",
"videoID": "ejFVFtJdP3s",
"question_id": "166-1",
"task_type": "Counting Problem",
"question": "How many times do the athletes in the event shown in the video need to touch the pool wall?",
"options": [
"A. 0.",
"B. 1.",
"C. 3.",
"D. 2."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times do the athletes in the event shown in the video need to touch the pool wall?\nOption:\nA. 0.\nB. 1.\nC. 3.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
495,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "166-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 496,
"target": "B",
"doc": {
"video_id": "166",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=ejFVFtJdP3s",
"videoID": "ejFVFtJdP3s",
"question_id": "166-2",
"task_type": "Attribute Perception",
"question": "Which country does the swimmer who was given a close-up in the video hail from?",
"options": [
"A. Australia.",
"B. America.",
"C. England.",
"D. Malaysia."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which country does the swimmer who was given a close-up in the video hail from?\nOption:\nA. Australia.\nB. America.\nC. England.\nD. Malaysia.\nAnswer with the option's letter from the given choices directly.",
496,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "166-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 497,
"target": "C",
"doc": {
"video_id": "166",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=ejFVFtJdP3s",
"videoID": "ejFVFtJdP3s",
"question_id": "166-3",
"task_type": "Action Recognition",
"question": "What swimming stroke does the athlete in the video use?",
"options": [
"A. Freestyle.",
"B. Breaststroke.",
"C. Butterfly.",
"D. Individual Medley."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What swimming stroke does the athlete in the video use?\nOption:\nA. Freestyle.\nB. Breaststroke.\nC. Butterfly.\nD. Individual Medley.\nAnswer with the option's letter from the given choices directly.",
497,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "166-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 498,
"target": "A",
"doc": {
"video_id": "167",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=913JhnW28U4",
"videoID": "913JhnW28U4",
"question_id": "167-1",
"task_type": "Attribute Perception",
"question": "From which country does the first-place athlete in the video originate?",
"options": [
"A. China.",
"B. Italy.",
"C. Hungary.",
"D. South Korea."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: From which country does the first-place athlete in the video originate?\nOption:\nA. China.\nB. Italy.\nC. Hungary.\nD. South Korea.\nAnswer with the option's letter from the given choices directly.",
498,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "167-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 499,
"target": "D",
"doc": {
"video_id": "167",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=913JhnW28U4",
"videoID": "913JhnW28U4",
"question_id": "167-2",
"task_type": "Action Recognition",
"question": "What swimming stroke does the athlete in the video use?",
"options": [
"A. Individual Medley.",
"B. Breaststroke.",
"C. Butterfly.",
"D. Freestyle."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What swimming stroke does the athlete in the video use?\nOption:\nA. Individual Medley.\nB. Breaststroke.\nC. Butterfly.\nD. Freestyle.\nAnswer with the option's letter from the given choices directly.",
499,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "167-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 500,
"target": "B",
"doc": {
"video_id": "167",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=913JhnW28U4",
"videoID": "913JhnW28U4",
"question_id": "167-3",
"task_type": "Counting Problem",
"question": "What is the time difference in seconds between the second and third place in the competition results shown at the end of the video?",
"options": [
"A. 0.19s.",
"B. 0.06s.",
"C. 0.25s.",
"D. 1.06s."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the time difference in seconds between the second and third place in the competition results shown at the end of the video?\nOption:\nA. 0.19s.\nB. 0.06s.\nC. 0.25s.\nD. 1.06s.\nAnswer with the option's letter from the given choices directly.",
500,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "167-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 501,
"target": "C",
"doc": {
"video_id": "168",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=DK6bKtXUE1c",
"videoID": "DK6bKtXUE1c",
"question_id": "168-1",
"task_type": "Counting Problem",
"question": "How many athletes' long jumps are recorded in the video?",
"options": [
"A. 5.",
"B. 4.",
"C. 3.",
"D. 2."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many athletes' long jumps are recorded in the video?\nOption:\nA. 5.\nB. 4.\nC. 3.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
501,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "168-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 502,
"target": "A",
"doc": {
"video_id": "168",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=DK6bKtXUE1c",
"videoID": "DK6bKtXUE1c",
"question_id": "168-2",
"task_type": "Counting Problem",
"question": "How many athletes in the video achieved a high jump distance of over 9 meters?",
"options": [
"A. 0.",
"B. 1.",
"C. 2.",
"D. 3."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many athletes in the video achieved a high jump distance of over 9 meters?\nOption:\nA. 0.\nB. 1.\nC. 2.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
502,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "168-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 503,
"target": "D",
"doc": {
"video_id": "168",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=DK6bKtXUE1c",
"videoID": "DK6bKtXUE1c",
"question_id": "168-3",
"task_type": "Action Recognition",
"question": "What is the competition event taking place in the video?",
"options": [
"A. Men's high jump.",
"B. Women's high jump.",
"C. Women's long jump.",
"D. Men's long jump."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the competition event taking place in the video?\nOption:\nA. Men's high jump.\nB. Women's high jump.\nC. Women's long jump.\nD. Men's long jump.\nAnswer with the option's letter from the given choices directly.",
503,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "168-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 504,
"target": "B",
"doc": {
"video_id": "169",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=CHlJdMVLV2s",
"videoID": "CHlJdMVLV2s",
"question_id": "169-1",
"task_type": "Attribute Perception",
"question": "Which country do the athletes shown at the beginning of the video come from?",
"options": [
"A. Switzerland.",
"B. Germany.",
"C. Brazil.",
"D. Hungary."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which country do the athletes shown at the beginning of the video come from?\nOption:\nA. Switzerland.\nB. Germany.\nC. Brazil.\nD. Hungary.\nAnswer with the option's letter from the given choices directly.",
504,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "169-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 505,
"target": "C",
"doc": {
"video_id": "169",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=CHlJdMVLV2s",
"videoID": "CHlJdMVLV2s",
"question_id": "169-2",
"task_type": "OCR Problems",
"question": "Where and when was the diving scene in the video taken?",
"options": [
"A. Beijing 2008.",
"B. Tokyo 2020.",
"C. Rio 2016.",
"D. Lodon 2012."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where and when was the diving scene in the video taken?\nOption:\nA. Beijing 2008.\nB. Tokyo 2020.\nC. Rio 2016.\nD. Lodon 2012.\nAnswer with the option's letter from the given choices directly.",
505,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "169-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 506,
"target": "A",
"doc": {
"video_id": "169",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=CHlJdMVLV2s",
"videoID": "CHlJdMVLV2s",
"question_id": "169-3",
"task_type": "Counting Problem",
"question": "How many times did the two athletes dive together at the same time in the video?",
"options": [
"A. 3.",
"B. 4.",
"C. 6.",
"D. 8."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times did the two athletes dive together at the same time in the video?\nOption:\nA. 3.\nB. 4.\nC. 6.\nD. 8.\nAnswer with the option's letter from the given choices directly.",
506,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "169-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 507,
"target": "D",
"doc": {
"video_id": "170",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=E895PNqSgEI",
"videoID": "E895PNqSgEI",
"question_id": "170-1",
"task_type": "OCR Problems",
"question": "When and where did the match in the video take place?",
"options": [
"A. Lodon 2012.",
"B. Tokyo 2018.",
"C. Beijing 2008.",
"D. Tokyo 2020."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When and where did the match in the video take place?\nOption:\nA. Lodon 2012.\nB. Tokyo 2018.\nC. Beijing 2008.\nD. Tokyo 2020.\nAnswer with the option's letter from the given choices directly.",
507,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "170-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 508,
"target": "B",
"doc": {
"video_id": "170",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=E895PNqSgEI",
"videoID": "E895PNqSgEI",
"question_id": "170-2",
"task_type": "Counting Problem",
"question": "How many athletes can be seen crossing the finish line in the video?",
"options": [
"A. 7.",
"B. 6.",
"C. 8.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many athletes can be seen crossing the finish line in the video?\nOption:\nA. 7.\nB. 6.\nC. 8.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
508,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "170-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 509,
"target": "C",
"doc": {
"video_id": "170",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=E895PNqSgEI",
"videoID": "E895PNqSgEI",
"question_id": "170-3",
"task_type": "Object Recognition",
"question": "Who was the first athlete to reach the finish line in the video?",
"options": [
"A. The athlete in green tops and green shorts.",
"B. The athlete in red tops and red shorts.",
"C. The athlete in blue tops and blue shorts.",
"D. The athlete in blue tops and green shorts."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who was the first athlete to reach the finish line in the video?\nOption:\nA. The athlete in green tops and green shorts.\nB. The athlete in red tops and red shorts.\nC. The athlete in blue tops and blue shorts.\nD. The athlete in blue tops and green shorts.\nAnswer with the option's letter from the given choices directly.",
509,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "170-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 510,
"target": "B",
"doc": {
"video_id": "171",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=FsLaTZmP6Uw",
"videoID": "FsLaTZmP6Uw",
"question_id": "171-1",
"task_type": "Temporal Perception",
"question": "When was the game hosted?",
"options": [
"A. In 2012.",
"B. In 2016.",
"C. In 2020.",
"D. In 2024."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When was the game hosted?\nOption:\nA. In 2012.\nB. In 2016.\nC. In 2020.\nD. In 2024.\nAnswer with the option's letter from the given choices directly.",
510,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "171-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Temporal Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 511,
"target": "C",
"doc": {
"video_id": "171",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=FsLaTZmP6Uw",
"videoID": "FsLaTZmP6Uw",
"question_id": "171-2",
"task_type": "Object Reasoning",
"question": "Which player was the winner of the game?",
"options": [
"A. The man in black.",
"B. The game ended in a standoff.",
"C. The man in red.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player was the winner of the game?\nOption:\nA. The man in black.\nB. The game ended in a standoff.\nC. The man in red.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
511,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "171-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 512,
"target": "A",
"doc": {
"video_id": "171",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=FsLaTZmP6Uw",
"videoID": "FsLaTZmP6Uw",
"question_id": "171-3",
"task_type": "Action Reasoning",
"question": "Which sentence describes this rally according to this video?",
"options": [
"A. Two athletes make a series of consecutive successful hits.",
"B. The athlete in red can not easily make a good return.",
"C. The athlete in black often makes edge balls.",
"D. None of the above is correct."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which sentence describes this rally according to this video?\nOption:\nA. Two athletes make a series of consecutive successful hits.\nB. The athlete in red can not easily make a good return.\nC. The athlete in black often makes edge balls.\nD. None of the above is correct.\nAnswer with the option's letter from the given choices directly.",
512,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "171-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 513,
"target": "D",
"doc": {
"video_id": "172",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=8W4gbKKNI1E",
"videoID": "8W4gbKKNI1E",
"question_id": "172-1",
"task_type": "Object Recognition",
"question": "What is the logo on the pitcher's chest who wears a blue and red sports shirt and orange helmet?",
"options": [
"A. A flower.",
"B. A row of letters.",
"C. A plane.",
"D. A tick."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the logo on the pitcher's chest who wears a blue and red sports shirt and orange helmet?\nOption:\nA. A flower.\nB. A row of letters.\nC. A plane.\nD. A tick.\nAnswer with the option's letter from the given choices directly.",
513,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "172-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 514,
"target": "A",
"doc": {
"video_id": "172",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=8W4gbKKNI1E",
"videoID": "8W4gbKKNI1E",
"question_id": "172-2",
"task_type": "Action Recognition",
"question": "What did the baseball umpire do after the second ball?",
"options": [
"A. He got down on one knee.",
"B. He put his arms crossed over the chest.",
"C. He gave a thumbs up.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What did the baseball umpire do after the second ball?\nOption:\nA. He got down on one knee.\nB. He put his arms crossed over the chest.\nC. He gave a thumbs up.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
514,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "172-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 515,
"target": "B",
"doc": {
"video_id": "172",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=8W4gbKKNI1E",
"videoID": "8W4gbKKNI1E",
"question_id": "172-3",
"task_type": "Action Recognition",
"question": "What happened to the last ball in this video?",
"options": [
"A. It was caught by an opposing player.",
"B. It was hit out of the ballpark and reached the seats.",
"C. It was not striked by the player.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happened to the last ball in this video?\nOption:\nA. It was caught by an opposing player.\nB. It was hit out of the ballpark and reached the seats.\nC. It was not striked by the player.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
515,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "172-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 516,
"target": "D",
"doc": {
"video_id": "173",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=tbKTfX5Az6w",
"videoID": "tbKTfX5Az6w",
"question_id": "173-1",
"task_type": "OCR Problems",
"question": "What is the final score of the game?",
"options": [
"A. 5-8.",
"B. 6-8.",
"C. 6-9.",
"D. 5-9."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the final score of the game?\nOption:\nA. 5-8.\nB. 6-8.\nC. 6-9.\nD. 5-9.\nAnswer with the option's letter from the given choices directly.",
516,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "173-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 517,
"target": "B",
"doc": {
"video_id": "173",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=tbKTfX5Az6w",
"videoID": "tbKTfX5Az6w",
"question_id": "173-2",
"task_type": "Object Reasoning",
"question": "Which player was the winner of the game?",
"options": [
"A. The man in deep blue.",
"B. The man in light blue.",
"C. The game ended in a standoff.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player was the winner of the game?\nOption:\nA. The man in deep blue.\nB. The man in light blue.\nC. The game ended in a standoff.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
517,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "173-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 518,
"target": "C",
"doc": {
"video_id": "173",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=tbKTfX5Az6w",
"videoID": "tbKTfX5Az6w",
"question_id": "173-3",
"task_type": "Action Recognition",
"question": "How did the winner win this rally?",
"options": [
"A. The opposing player made a foul hit.",
"B. The opposing hit the ball out of the bounds.",
"C. He made an edge ball and the opposing player missed the ball.",
"D. None of the above is correct."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How did the winner win this rally?\nOption:\nA. The opposing player made a foul hit.\nB. The opposing hit the ball out of the bounds.\nC. He made an edge ball and the opposing player missed the ball.\nD. None of the above is correct.\nAnswer with the option's letter from the given choices directly.",
518,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "173-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 519,
"target": "C",
"doc": {
"video_id": "174",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=84EpEwIVFdU",
"videoID": "84EpEwIVFdU",
"question_id": "174-1",
"task_type": "Attribute Perception",
"question": "Which color of balls is absent from the table?",
"options": [
"A. Green.",
"B. Red.",
"C. Blue.",
"D. Black."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which color of balls is absent from the table?\nOption:\nA. Green.\nB. Red.\nC. Blue.\nD. Black.\nAnswer with the option's letter from the given choices directly.",
519,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "174-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 520,
"target": "A",
"doc": {
"video_id": "174",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=84EpEwIVFdU",
"videoID": "84EpEwIVFdU",
"question_id": "174-2",
"task_type": "OCR Problems",
"question": "What Latin texts are inscribed on the man's chest?",
"options": [
"A. Liber Win and MrQ.",
"B. Only Liber Win.",
"C. Only MrQ.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What Latin texts are inscribed on the man's chest?\nOption:\nA. Liber Win and MrQ.\nB. Only Liber Win.\nC. Only MrQ.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
520,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "174-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 521,
"target": "B",
"doc": {
"video_id": "174",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=84EpEwIVFdU",
"videoID": "84EpEwIVFdU",
"question_id": "174-3",
"task_type": "Information Synopsis",
"question": "What is the main focus of this video?",
"options": [
"A. An actor is being interviewed for winning a prize.",
"B. A snooker player is being interviewed for entering the final competition.",
"C. A father is being interviewed for talking about his slow brain.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of this video?\nOption:\nA. An actor is being interviewed for winning a prize.\nB. A snooker player is being interviewed for entering the final competition.\nC. A father is being interviewed for talking about his slow brain.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
521,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "174-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 522,
"target": "B",
"doc": {
"video_id": "175",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=TblFD8H4j94",
"videoID": "TblFD8H4j94",
"question_id": "175-1",
"task_type": "Counting Problem",
"question": "How many rallies did the two athletes play?",
"options": [
"A. 3.",
"B. 4.",
"C. 5.",
"D. 6."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many rallies did the two athletes play?\nOption:\nA. 3.\nB. 4.\nC. 5.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
522,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "175-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 523,
"target": "C",
"doc": {
"video_id": "175",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=TblFD8H4j94",
"videoID": "TblFD8H4j94",
"question_id": "175-2",
"task_type": "Action Recognition",
"question": "Which of the following occurred in the last rally?",
"options": [
"A. The opposing player missed the ball.",
"B. The opposing player hit the ball out of the bounds.",
"C. The opposing player made a return but the ball did not come over the net.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following occurred in the last rally?\nOption:\nA. The opposing player missed the ball.\nB. The opposing player hit the ball out of the bounds.\nC. The opposing player made a return but the ball did not come over the net.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
523,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "175-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 524,
"target": "A",
"doc": {
"video_id": "175",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=TblFD8H4j94",
"videoID": "TblFD8H4j94",
"question_id": "175-3",
"task_type": "Object Recognition",
"question": "What sport are the two athletes playing?",
"options": [
"A. Tennis.",
"B. Baseball.",
"C. Soccer.",
"D. Basketball."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What sport are the two athletes playing?\nOption:\nA. Tennis.\nB. Baseball.\nC. Soccer.\nD. Basketball.\nAnswer with the option's letter from the given choices directly.",
524,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "175-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 525,
"target": "D",
"doc": {
"video_id": "176",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=wGfRSojH0PE",
"videoID": "wGfRSojH0PE",
"question_id": "176-1",
"task_type": "OCR Problems",
"question": "What is the score after this game?",
"options": [
"A. 20-22.",
"B. 23-20.",
"C. 20-23.",
"D. 23-22."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the score after this game?\nOption:\nA. 20-22.\nB. 23-20.\nC. 20-23.\nD. 23-22.\nAnswer with the option's letter from the given choices directly.",
525,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "176-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 526,
"target": "B",
"doc": {
"video_id": "176",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=wGfRSojH0PE",
"videoID": "wGfRSojH0PE",
"question_id": "176-2",
"task_type": "Object Recognition",
"question": "What is the man in black doing while standing beside the field and in front of the Dove board?",
"options": [
"A. He is the referee standing aside.",
"B. He is a photographer holding a camera to live broadcast.",
"C. He is the substitute player waiting to step on the court.",
"D. He is a spectator watching the game."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the man in black doing while standing beside the field and in front of the Dove board?\nOption:\nA. He is the referee standing aside.\nB. He is a photographer holding a camera to live broadcast.\nC. He is the substitute player waiting to step on the court.\nD. He is a spectator watching the game.\nAnswer with the option's letter from the given choices directly.",
526,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "176-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 527,
"target": "C",
"doc": {
"video_id": "176",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=wGfRSojH0PE",
"videoID": "wGfRSojH0PE",
"question_id": "176-3",
"task_type": "Action Recognition",
"question": "How did the white team win this game?",
"options": [
"A. The opposing green team got a red card.",
"B. The kicker of the white team made a touchdown and won 6 points.",
"C. The kicker of the white team made a drop kick and won 3 points.",
"D. The white team won an extra point."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How did the white team win this game?\nOption:\nA. The opposing green team got a red card.\nB. The kicker of the white team made a touchdown and won 6 points.\nC. The kicker of the white team made a drop kick and won 3 points.\nD. The white team won an extra point.\nAnswer with the option's letter from the given choices directly.",
527,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "176-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 528,
"target": "A",
"doc": {
"video_id": "177",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=6DO8yOVYXr0",
"videoID": "6DO8yOVYXr0",
"question_id": "177-1",
"task_type": "OCR Problems",
"question": "What is the current score of the ongoing game?",
"options": [
"A. 3-2.",
"B. 2-3.",
"C. 2-2.",
"D. 3-3."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the current score of the ongoing game?\nOption:\nA. 3-2.\nB. 2-3.\nC. 2-2.\nD. 3-3.\nAnswer with the option's letter from the given choices directly.",
528,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "177-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 529,
"target": "D",
"doc": {
"video_id": "177",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=6DO8yOVYXr0",
"videoID": "6DO8yOVYXr0",
"question_id": "177-2",
"task_type": "Counting Problem",
"question": "How many players are wearing black shirts in this video?",
"options": [
"A. 1.",
"B. 3.",
"C. 4.",
"D. 2."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many players are wearing black shirts in this video?\nOption:\nA. 1.\nB. 3.\nC. 4.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
529,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "177-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 530,
"target": "B",
"doc": {
"video_id": "177",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=6DO8yOVYXr0",
"videoID": "6DO8yOVYXr0",
"question_id": "177-3",
"task_type": "Object Recognition",
"question": "Who smashed the decisive ball?",
"options": [
"A. The woman in red with a number of 6 on her back.",
"B. The woman in red with a number of 2 on her back.",
"C. The woman in red with a number of 5 on her back.",
"D. The woman in black."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who smashed the decisive ball?\nOption:\nA. The woman in red with a number of 6 on her back.\nB. The woman in red with a number of 2 on her back.\nC. The woman in red with a number of 5 on her back.\nD. The woman in black.\nAnswer with the option's letter from the given choices directly.",
530,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "177-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 531,
"target": "C",
"doc": {
"video_id": "178",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=IPevXMMEVQ0",
"videoID": "IPevXMMEVQ0",
"question_id": "178-1",
"task_type": "Object Recognition",
"question": "Which flag is being held by the spectator?",
"options": [
"A. It contains five yellow stars on a red background.",
"B. It is a blue cross on a white background.",
"C. It is a yellow cross on a light blue background.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which flag is being held by the spectator?\nOption:\nA. It contains five yellow stars on a red background.\nB. It is a blue cross on a white background.\nC. It is a yellow cross on a light blue background.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
531,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "178-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 532,
"target": "A",
"doc": {
"video_id": "178",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=IPevXMMEVQ0",
"videoID": "IPevXMMEVQ0",
"question_id": "178-2",
"task_type": "Action Recognition",
"question": "What action did the man in blue take towards the goalkeeper in yellow?",
"options": [
"A. He hit on the goalkeeper's neck.",
"B. He greeted the goalkeeper.",
"C. He made faces to the goalkeeper.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What action did the man in blue take towards the goalkeeper in yellow?\nOption:\nA. He hit on the goalkeeper's neck.\nB. He greeted the goalkeeper.\nC. He made faces to the goalkeeper.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
532,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "178-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 533,
"target": "D",
"doc": {
"video_id": "178",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=IPevXMMEVQ0",
"videoID": "IPevXMMEVQ0",
"question_id": "178-3",
"task_type": "OCR Problems",
"question": "Which number is located on the back of the blue man?",
"options": [
"A. 40.",
"B. 50.",
"C. 60.",
"D. 70."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which number is located on the back of the blue man?\nOption:\nA. 40.\nB. 50.\nC. 60.\nD. 70.\nAnswer with the option's letter from the given choices directly.",
533,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "178-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 534,
"target": "B",
"doc": {
"video_id": "179",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=fkJv7LRa6Pc",
"videoID": "fkJv7LRa6Pc",
"question_id": "179-1",
"task_type": "Action Recognition",
"question": "What happened to the blue car with a decoration of fire on the hood?",
"options": [
"A. Its tire blew out and cannot move.",
"B. It ran out of fuel.",
"C. It was caught on fires.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happened to the blue car with a decoration of fire on the hood?\nOption:\nA. Its tire blew out and cannot move.\nB. It ran out of fuel.\nC. It was caught on fires.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
534,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "179-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 535,
"target": "C",
"doc": {
"video_id": "179",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=fkJv7LRa6Pc",
"videoID": "fkJv7LRa6Pc",
"question_id": "179-2",
"task_type": "Action Reasoning",
"question": "Which of the following sentences is accurate based on the video?",
"options": [
"A. There are no cars on the road except these two cars.",
"B. The broken car restarted by itself.",
"C. The green and black car gives the front car a push to the start/finish line.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following sentences is accurate based on the video?\nOption:\nA. There are no cars on the road except these two cars.\nB. The broken car restarted by itself.\nC. The green and black car gives the front car a push to the start/finish line.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
535,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "179-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 536,
"target": "A",
"doc": {
"video_id": "179",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=fkJv7LRa6Pc",
"videoID": "fkJv7LRa6Pc",
"question_id": "179-3",
"task_type": "OCR Problems",
"question": "What are the red texts recognized on the white wall?",
"options": [
"A. GO BOWLING.",
"B. DO BOWLING.",
"C. GO DOWLING.",
"D. DO DOWLING."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the red texts recognized on the white wall?\nOption:\nA. GO BOWLING.\nB. DO BOWLING.\nC. GO DOWLING.\nD. DO DOWLING.\nAnswer with the option's letter from the given choices directly.",
536,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "179-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 537,
"target": "D",
"doc": {
"video_id": "180",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=54YD3Gv-3MI",
"videoID": "54YD3Gv-3MI",
"question_id": "180-1",
"task_type": "Counting Problem",
"question": "How many players are participating in the game?",
"options": [
"A. 6.",
"B. 4.",
"C. 3.",
"D. 5."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many players are participating in the game?\nOption:\nA. 6.\nB. 4.\nC. 3.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
537,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "180-1",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 538,
"target": "B",
"doc": {
"video_id": "180",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=54YD3Gv-3MI",
"videoID": "54YD3Gv-3MI",
"question_id": "180-2",
"task_type": "Action Recognition",
"question": "What happened to the blue player during the game?",
"options": [
"A. He won the first place.",
"B. He slided out of the slope.",
"C. He got injured during the game.",
"D. He asked the umpire to stop the game."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happened to the blue player during the game?\nOption:\nA. He won the first place.\nB. He slided out of the slope.\nC. He got injured during the game.\nD. He asked the umpire to stop the game.\nAnswer with the option's letter from the given choices directly.",
538,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "180-2",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 539,
"target": "C",
"doc": {
"video_id": "180",
"duration": "short",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=54YD3Gv-3MI",
"videoID": "54YD3Gv-3MI",
"question_id": "180-3",
"task_type": "Object Reasoning",
"question": "What country won first place?",
"options": [
"A. KOR.",
"B. LAT.",
"C. CHN.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What country won first place?\nOption:\nA. KOR.\nB. LAT.\nC. CHN.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
539,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "180-3",
"duration": "short",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 540,
"target": "A",
"doc": {
"video_id": "181",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=8np5YKYx3sU",
"videoID": "8np5YKYx3sU",
"question_id": "181-1",
"task_type": "Counting Problem",
"question": "How many men and women are presenting on the stage?",
"options": [
"A. Six men and three women.",
"B. Five men and two women.",
"C. Four men and three women.",
"D. Four men and four women."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many men and women are presenting on the stage?\nOption:\nA. Six men and three women.\nB. Five men and two women.\nC. Four men and three women.\nD. Four men and four women.\nAnswer with the option's letter from the given choices directly.",
540,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "181-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 541,
"target": "D",
"doc": {
"video_id": "181",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=8np5YKYx3sU",
"videoID": "8np5YKYx3sU",
"question_id": "181-2",
"task_type": "Attribute Perception",
"question": "What clothes is the singer wearing?",
"options": [
"A. A yellow shirt.",
"B. A blue short-sleeve.",
"C. A purple vest.",
"D. A black skirt."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What clothes is the singer wearing?\nOption:\nA. A yellow shirt.\nB. A blue short-sleeve.\nC. A purple vest.\nD. A black skirt.\nAnswer with the option's letter from the given choices directly.",
541,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "181-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 542,
"target": "B",
"doc": {
"video_id": "181",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=8np5YKYx3sU",
"videoID": "8np5YKYx3sU",
"question_id": "181-3",
"task_type": "Action Recognition",
"question": "What are the people on the stage doing in this video?",
"options": [
"A. They are reciting poetry.",
"B. They are singing and dancing.",
"C. They are performing acrobatics.",
"D. They are performing a drama."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the people on the stage doing in this video?\nOption:\nA. They are reciting poetry.\nB. They are singing and dancing.\nC. They are performing acrobatics.\nD. They are performing a drama.\nAnswer with the option's letter from the given choices directly.",
542,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "181-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 543,
"target": "C",
"doc": {
"video_id": "182",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=2QTiAmvygC4",
"videoID": "2QTiAmvygC4",
"question_id": "182-1",
"task_type": "Attribute Perception",
"question": "Which color of pants is the person wearing while playing the piano in the video?",
"options": [
"A. Black.",
"B. Purple.",
"C. Brown.",
"D. Blue."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which color of pants is the person wearing while playing the piano in the video?\nOption:\nA. Black.\nB. Purple.\nC. Brown.\nD. Blue.\nAnswer with the option's letter from the given choices directly.",
543,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "182-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 544,
"target": "A",
"doc": {
"video_id": "182",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=2QTiAmvygC4",
"videoID": "2QTiAmvygC4",
"question_id": "182-2",
"task_type": "Counting Problem",
"question": "What is the exact number of performers in the video?",
"options": [
"A. 5.",
"B. 4.",
"C. 3.",
"D. 2."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the exact number of performers in the video?\nOption:\nA. 5.\nB. 4.\nC. 3.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
544,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "182-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 545,
"target": "D",
"doc": {
"video_id": "182",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=2QTiAmvygC4",
"videoID": "2QTiAmvygC4",
"question_id": "182-3",
"task_type": "Information Synopsis",
"question": "What is happening in the video?",
"options": [
"A. They are performing a drama.",
"B. They are reciting poetry.",
"C. They are performing acrobatics.",
"D. They are singing a song."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is happening in the video?\nOption:\nA. They are performing a drama.\nB. They are reciting poetry.\nC. They are performing acrobatics.\nD. They are singing a song.\nAnswer with the option's letter from the given choices directly.",
545,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "182-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 546,
"target": "B",
"doc": {
"video_id": "183",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=NjxJY7P-Qpo",
"videoID": "NjxJY7P-Qpo",
"question_id": "183-1",
"task_type": "Counting Problem",
"question": "How many individuals are shown singing in the video?",
"options": [
"A. Two men.",
"B. Two women.",
"C. Three women.",
"D. One woman."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many individuals are shown singing in the video?\nOption:\nA. Two men.\nB. Two women.\nC. Three women.\nD. One woman.\nAnswer with the option's letter from the given choices directly.",
546,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "183-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 547,
"target": "C",
"doc": {
"video_id": "183",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=NjxJY7P-Qpo",
"videoID": "NjxJY7P-Qpo",
"question_id": "183-2",
"task_type": "Object Recognition",
"question": "Which individual does the singer wearing a yellow dress embrace?",
"options": [
"A. A woman in a suit.",
"B. A woman in a dress.",
"C. A man with a white hat.",
"D. Another singer."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which individual does the singer wearing a yellow dress embrace?\nOption:\nA. A woman in a suit.\nB. A woman in a dress.\nC. A man with a white hat.\nD. Another singer.\nAnswer with the option's letter from the given choices directly.",
547,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "183-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 548,
"target": "A",
"doc": {
"video_id": "183",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=NjxJY7P-Qpo",
"videoID": "NjxJY7P-Qpo",
"question_id": "183-3",
"task_type": "Object Reasoning",
"question": "Which of the following best describes the audience's reaction to the singer's performance in the video?",
"options": [
"A. The audience applauded and thought the singer's performance was excellent.",
"B. There was no audience at the show because it was a live online broadcast.",
"C. The audience covered their faces and wept, thinking that the singer's singing had touched them deeply.",
"D. The audience looked very calm.."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best describes the audience's reaction to the singer's performance in the video?\nOption:\nA. The audience applauded and thought the singer's performance was excellent.\nB. There was no audience at the show because it was a live online broadcast.\nC. The audience covered their faces and wept, thinking that the singer's singing had touched them deeply.\nD. The audience looked very calm..\nAnswer with the option's letter from the given choices directly.",
548,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "183-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 549,
"target": "A",
"doc": {
"video_id": "184",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=HiJb_2dvuHc",
"videoID": "HiJb_2dvuHc",
"question_id": "184-1",
"task_type": "Attribute Perception",
"question": "What instrument is the character in the middle playing in the video?",
"options": [
"A. Drum set.",
"B. Keytar.",
"C. Electric guitar.",
"D. Saxophone."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What instrument is the character in the middle playing in the video?\nOption:\nA. Drum set.\nB. Keytar.\nC. Electric guitar.\nD. Saxophone.\nAnswer with the option's letter from the given choices directly.",
549,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "184-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 550,
"target": "B",
"doc": {
"video_id": "184",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=HiJb_2dvuHc",
"videoID": "HiJb_2dvuHc",
"question_id": "184-2",
"task_type": "Attribute Perception",
"question": "What is the color of the wig worn by the character playing the guitar?",
"options": [
"A. Pink.",
"B. Blue.",
"C. Red.",
"D. Blonde."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the wig worn by the character playing the guitar?\nOption:\nA. Pink.\nB. Blue.\nC. Red.\nD. Blonde.\nAnswer with the option's letter from the given choices directly.",
550,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "184-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 551,
"target": "D",
"doc": {
"video_id": "184",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=HiJb_2dvuHc",
"videoID": "HiJb_2dvuHc",
"question_id": "184-3",
"task_type": "Action Reasoning",
"question": "Based on the backdrop and costumes, which demographic is the intended audience for the show in the video?",
"options": [
"A. Senior citizens.",
"B. Young adults.",
"C. Adolescents.",
"D. Children."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the backdrop and costumes, which demographic is the intended audience for the show in the video?\nOption:\nA. Senior citizens.\nB. Young adults.\nC. Adolescents.\nD. Children.\nAnswer with the option's letter from the given choices directly.",
551,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "184-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 552,
"target": "B",
"doc": {
"video_id": "185",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=mmKggCnGtA4",
"videoID": "mmKggCnGtA4",
"question_id": "185-1",
"task_type": "Action Recognition",
"question": "Which activity is the most likely one for the two main people in the image?",
"options": [
"A. Having a conversation.",
"B. Dancing.",
"C. Arguing.",
"D. Playing a game."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which activity is the most likely one for the two main people in the image?\nOption:\nA. Having a conversation.\nB. Dancing.\nC. Arguing.\nD. Playing a game.\nAnswer with the option's letter from the given choices directly.",
552,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "185-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 553,
"target": "A",
"doc": {
"video_id": "185",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=mmKggCnGtA4",
"videoID": "mmKggCnGtA4",
"question_id": "185-2",
"task_type": "Object Recognition",
"question": "What can be seen in the background directly above the heads of the main characters?",
"options": [
"A. Neon lights.",
"B. A basketball hoop.",
"C. A screen.",
"D. Floating balloons."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be seen in the background directly above the heads of the main characters?\nOption:\nA. Neon lights.\nB. A basketball hoop.\nC. A screen.\nD. Floating balloons.\nAnswer with the option's letter from the given choices directly.",
553,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "185-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 554,
"target": "C",
"doc": {
"video_id": "185",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=mmKggCnGtA4",
"videoID": "mmKggCnGtA4",
"question_id": "185-3",
"task_type": "Spatial Reasoning",
"question": "Which of the following venue types is most likely the setting of the scene, given its context?",
"options": [
"A. A waterfront piazza.",
"B. An outdoor festival.",
"C. A casual diner.",
"D. An ice cream parlor."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following venue types is most likely the setting of the scene, given its context?\nOption:\nA. A waterfront piazza.\nB. An outdoor festival.\nC. A casual diner.\nD. An ice cream parlor.\nAnswer with the option's letter from the given choices directly.",
554,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "185-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Spatial Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 555,
"target": "B",
"doc": {
"video_id": "186",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=QhL6ICNQ_So",
"videoID": "QhL6ICNQ_So",
"question_id": "186-1",
"task_type": "Counting Problem",
"question": "How many individuals are currently present on the stage?",
"options": [
"A. Six.",
"B. Seven.",
"C. Eight.",
"D. Five."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many individuals are currently present on the stage?\nOption:\nA. Six.\nB. Seven.\nC. Eight.\nD. Five.\nAnswer with the option's letter from the given choices directly.",
555,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "186-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 556,
"target": "C",
"doc": {
"video_id": "186",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=QhL6ICNQ_So",
"videoID": "QhL6ICNQ_So",
"question_id": "186-2",
"task_type": "Attribute Perception",
"question": "What is the pattern on the backdrop of the stage?",
"options": [
"A. Stripes.",
"B. Polka dots.",
"C. Swirling colors.",
"D. Solid color."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the pattern on the backdrop of the stage?\nOption:\nA. Stripes.\nB. Polka dots.\nC. Swirling colors.\nD. Solid color.\nAnswer with the option's letter from the given choices directly.",
556,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "186-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 557,
"target": "A",
"doc": {
"video_id": "186",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=QhL6ICNQ_So",
"videoID": "QhL6ICNQ_So",
"question_id": "186-3",
"task_type": "Attribute Perception",
"question": "Which genre best describes the performance based on the activity and costumes?",
"options": [
"A. Musical theatre.",
"B. Situational comedy.",
"C. Ballet.",
"D. Can't be deduced."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which genre best describes the performance based on the activity and costumes?\nOption:\nA. Musical theatre.\nB. Situational comedy.\nC. Ballet.\nD. Can't be deduced.\nAnswer with the option's letter from the given choices directly.",
557,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "186-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 558,
"target": "A",
"doc": {
"video_id": "187",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=iOOseoPiw8E",
"videoID": "iOOseoPiw8E",
"question_id": "187-1",
"task_type": "Attribute Perception",
"question": "In this video, which stage setting is featured at the beginning of the performance?",
"options": [
"A. A garden.",
"B. An underwater scene.",
"C. A cityscape.",
"D. A desert."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In this video, which stage setting is featured at the beginning of the performance?\nOption:\nA. A garden.\nB. An underwater scene.\nC. A cityscape.\nD. A desert.\nAnswer with the option's letter from the given choices directly.",
558,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "187-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 559,
"target": "B",
"doc": {
"video_id": "187",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=iOOseoPiw8E",
"videoID": "iOOseoPiw8E",
"question_id": "187-2",
"task_type": "OCR Problems",
"question": "According to the video, which country will host this live stage event?",
"options": [
"A. America.",
"B. Australia.",
"C. Canada.",
"D. England."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which country will host this live stage event?\nOption:\nA. America.\nB. Australia.\nC. Canada.\nD. England.\nAnswer with the option's letter from the given choices directly.",
559,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "187-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 560,
"target": "D",
"doc": {
"video_id": "187",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=iOOseoPiw8E",
"videoID": "iOOseoPiw8E",
"question_id": "187-3",
"task_type": "Spatial Reasoning",
"question": "Based on the characters' costumes and the stage setup, which of the following types of performance is most likely being portrayed in this video?",
"options": [
"A. A modern dance performance.",
"B. A historical drama.",
"C. A science fiction play.",
"D. A children's fantasy show."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the characters' costumes and the stage setup, which of the following types of performance is most likely being portrayed in this video?\nOption:\nA. A modern dance performance.\nB. A historical drama.\nC. A science fiction play.\nD. A children's fantasy show.\nAnswer with the option's letter from the given choices directly.",
560,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "187-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Spatial Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 561,
"target": "C",
"doc": {
"video_id": "188",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=XK6ApJHa59I",
"videoID": "XK6ApJHa59I",
"question_id": "188-1",
"task_type": "Object Recognition",
"question": "Which person in the video is being slapped by the man?",
"options": [
"A. The woman without glasses.",
"B. Monica.",
"C. Himself.",
"D. One audience."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which person in the video is being slapped by the man?\nOption:\nA. The woman without glasses.\nB. Monica.\nC. Himself.\nD. One audience.\nAnswer with the option's letter from the given choices directly.",
561,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "188-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 562,
"target": "A",
"doc": {
"video_id": "188",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=XK6ApJHa59I",
"videoID": "XK6ApJHa59I",
"question_id": "188-2",
"task_type": "Object Recognition",
"question": "Which item is being held by the actor in the image on the poster displayed at the end of the video?",
"options": [
"A. A glass.",
"B. A book.",
"C. A hat.",
"D. A microphone."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which item is being held by the actor in the image on the poster displayed at the end of the video?\nOption:\nA. A glass.\nB. A book.\nC. A hat.\nD. A microphone.\nAnswer with the option's letter from the given choices directly.",
562,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "188-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 563,
"target": "B",
"doc": {
"video_id": "188",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=XK6ApJHa59I",
"videoID": "XK6ApJHa59I",
"question_id": "188-3",
"task_type": "Attribute Perception",
"question": "What can be inferred about the tone of the play?",
"options": [
"A. It is a serious drama.",
"B. It is a provocative comedy.",
"C. It is a romantic story.",
"D. It is a science fiction adventure."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be inferred about the tone of the play?\nOption:\nA. It is a serious drama.\nB. It is a provocative comedy.\nC. It is a romantic story.\nD. It is a science fiction adventure.\nAnswer with the option's letter from the given choices directly.",
563,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "188-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 564,
"target": "A",
"doc": {
"video_id": "189",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=eICN548E2TU",
"videoID": "eICN548E2TU",
"question_id": "189-1",
"task_type": "Action Recognition",
"question": "What is the purpose of the performers sitting down?",
"options": [
"A. To drink tea.",
"B. To eat hot pot.",
"C. To drink beer.",
"D. Because Standing is not allowed here."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the purpose of the performers sitting down?\nOption:\nA. To drink tea.\nB. To eat hot pot.\nC. To drink beer.\nD. Because Standing is not allowed here.\nAnswer with the option's letter from the given choices directly.",
564,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "189-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 565,
"target": "D",
"doc": {
"video_id": "189",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=eICN548E2TU",
"videoID": "eICN548E2TU",
"question_id": "189-2",
"task_type": "Object Recognition",
"question": "Which item does the man display to the woman?",
"options": [
"A. A painting.",
"B. A picture on his phone.",
"C. A poem he had written.",
"D. A picture in a book."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which item does the man display to the woman?\nOption:\nA. A painting.\nB. A picture on his phone.\nC. A poem he had written.\nD. A picture in a book.\nAnswer with the option's letter from the given choices directly.",
565,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "189-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 566,
"target": "C",
"doc": {
"video_id": "189",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=eICN548E2TU",
"videoID": "eICN548E2TU",
"question_id": "189-3",
"task_type": "Object Reasoning",
"question": "Which of the following statements best describes the relationship between the two characters in this scene based on the available information?",
"options": [
"A. They are friends having a casual conversation.",
"B. They are strangers meeting for the first time.",
"C. They are family members in a tense discussion.",
"D. They are teacher and student in a lecture."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements best describes the relationship between the two characters in this scene based on the available information?\nOption:\nA. They are friends having a casual conversation.\nB. They are strangers meeting for the first time.\nC. They are family members in a tense discussion.\nD. They are teacher and student in a lecture.\nAnswer with the option's letter from the given choices directly.",
566,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "189-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 567,
"target": "C",
"doc": {
"video_id": "190",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=PSC_HUeqaUk",
"videoID": "PSC_HUeqaUk",
"question_id": "190-1",
"task_type": "Attribute Perception",
"question": "What is the main event taking place?",
"options": [
"A. A theater play.",
"B. A dance recital.",
"C. A live concert.",
"D. A film screening."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main event taking place?\nOption:\nA. A theater play.\nB. A dance recital.\nC. A live concert.\nD. A film screening.\nAnswer with the option's letter from the given choices directly.",
567,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "190-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 568,
"target": "D",
"doc": {
"video_id": "190",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=PSC_HUeqaUk",
"videoID": "PSC_HUeqaUk",
"question_id": "190-2",
"task_type": "Object Recognition",
"question": "Which of the following is visible on the background screen?",
"options": [
"A. A tropical beach.",
"B. A plane flying in the sky.",
"C. A city skyline.",
"D. A boat on the water."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is visible on the background screen?\nOption:\nA. A tropical beach.\nB. A plane flying in the sky.\nC. A city skyline.\nD. A boat on the water.\nAnswer with the option's letter from the given choices directly.",
568,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "190-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 569,
"target": "A",
"doc": {
"video_id": "190",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=PSC_HUeqaUk",
"videoID": "PSC_HUeqaUk",
"question_id": "190-3",
"task_type": "Attribute Perception",
"question": "Based on the audience's reaction, what is the most likely atmosphere in the venue?",
"options": [
"A. Excited and energetic.",
"B. Solemn and quiet.",
"C. Confused and curious.",
"D. Bored and uninterested."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the audience's reaction, what is the most likely atmosphere in the venue?\nOption:\nA. Excited and energetic.\nB. Solemn and quiet.\nC. Confused and curious.\nD. Bored and uninterested.\nAnswer with the option's letter from the given choices directly.",
569,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "190-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 570,
"target": "B",
"doc": {
"video_id": "191",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=vQTQJ1Mlzqo",
"videoID": "vQTQJ1Mlzqo",
"question_id": "191-1",
"task_type": "Counting Problem",
"question": "How many tricks are performed in this video?",
"options": [
"A. 4.",
"B. 5.",
"C. 6.",
"D. 7."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many tricks are performed in this video?\nOption:\nA. 4.\nB. 5.\nC. 6.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
570,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "191-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 571,
"target": "C",
"doc": {
"video_id": "191",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=vQTQJ1Mlzqo",
"videoID": "vQTQJ1Mlzqo",
"question_id": "191-2",
"task_type": "Object Recognition",
"question": "What finger is wrapped with the third rubber band trick in this video?",
"options": [
"A. Ring finger.",
"B. Index finger.",
"C. Middle finger.",
"D. Little finger."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What finger is wrapped with the third rubber band trick in this video?\nOption:\nA. Ring finger.\nB. Index finger.\nC. Middle finger.\nD. Little finger.\nAnswer with the option's letter from the given choices directly.",
571,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "191-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 572,
"target": "A",
"doc": {
"video_id": "191",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=vQTQJ1Mlzqo",
"videoID": "vQTQJ1Mlzqo",
"question_id": "191-3",
"task_type": "Object Recognition",
"question": "Which sentence best describes the first rubber band trick shown in the video?",
"options": [
"A. A rubber band mysteriously jumps from being wrapped around the wrist to a smart phone.",
"B. A rubber band jumps from being wrapped around the pinkie and ring fingers to the first and middle fingers of the same hand and then back again.",
"C. A rubber band seems to switch places between the fingers, and the second rubber band appears to block it.",
"D. A thumb gets rid of being wrapped by a rubber band."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which sentence best describes the first rubber band trick shown in the video?\nOption:\nA. A rubber band mysteriously jumps from being wrapped around the wrist to a smart phone.\nB. A rubber band jumps from being wrapped around the pinkie and ring fingers to the first and middle fingers of the same hand and then back again.\nC. A rubber band seems to switch places between the fingers, and the second rubber band appears to block it.\nD. A thumb gets rid of being wrapped by a rubber band.\nAnswer with the option's letter from the given choices directly.",
572,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "191-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 573,
"target": "A",
"doc": {
"video_id": "192",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=gTBRQaBRiSs",
"videoID": "gTBRQaBRiSs",
"question_id": "192-1",
"task_type": "Counting Problem",
"question": "As can be seen in the video, how many judges are watching the show?",
"options": [
"A. 4.",
"B. 5.",
"C. 6.",
"D. 7."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As can be seen in the video, how many judges are watching the show?\nOption:\nA. 4.\nB. 5.\nC. 6.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
573,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "192-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 574,
"target": "D",
"doc": {
"video_id": "192",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=gTBRQaBRiSs",
"videoID": "gTBRQaBRiSs",
"question_id": "192-2",
"task_type": "Action Reasoning",
"question": "According the video, what makes the audience shocked and rise to applaud?",
"options": [
"A. The performer sings a beautiful opera aria in multiple languages.",
"B. The performer showcases a mesmerizing display of fire-breathing skills.",
"C. The performer does a series of impressive acrobatic flips and somersaults.",
"D. The performer levitates with only a stick supporting him."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According the video, what makes the audience shocked and rise to applaud?\nOption:\nA. The performer sings a beautiful opera aria in multiple languages.\nB. The performer showcases a mesmerizing display of fire-breathing skills.\nC. The performer does a series of impressive acrobatic flips and somersaults.\nD. The performer levitates with only a stick supporting him.\nAnswer with the option's letter from the given choices directly.",
574,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "192-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 575,
"target": "B",
"doc": {
"video_id": "192",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=gTBRQaBRiSs",
"videoID": "gTBRQaBRiSs",
"question_id": "192-3",
"task_type": "Object Recognition",
"question": "Which regional clothes does the performer wear in the video?",
"options": [
"A. Chinese clothes.",
"B. Indian clothes.",
"C. Egyptian clothes.",
"D. Mexican clothes."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which regional clothes does the performer wear in the video?\nOption:\nA. Chinese clothes.\nB. Indian clothes.\nC. Egyptian clothes.\nD. Mexican clothes.\nAnswer with the option's letter from the given choices directly.",
575,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "192-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 576,
"target": "D",
"doc": {
"video_id": "193",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=jznsxDcKSnE",
"videoID": "jznsxDcKSnE",
"question_id": "193-1",
"task_type": "Object Recognition",
"question": "What clothes does the performer wear in the video?",
"options": [
"A. Black shirts and black pants.",
"B. Pink shirts and pink pants.",
"C. Black shirts and pink pants.",
"D. Pink shirts and black pants."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What clothes does the performer wear in the video?\nOption:\nA. Black shirts and black pants.\nB. Pink shirts and pink pants.\nC. Black shirts and pink pants.\nD. Pink shirts and black pants.\nAnswer with the option's letter from the given choices directly.",
576,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "193-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 577,
"target": "B",
"doc": {
"video_id": "193",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=jznsxDcKSnE",
"videoID": "jznsxDcKSnE",
"question_id": "193-2",
"task_type": "Temporal Perception",
"question": "Based on the video, when are the lights turned on during the performance?",
"options": [
"A. When the woman walks around the lightning ball.",
"B. When the woman starts fanning to make snow.",
"C. When the woman puts a pink ball into her mouth.",
"D. When the woman bows to the audience."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, when are the lights turned on during the performance?\nOption:\nA. When the woman walks around the lightning ball.\nB. When the woman starts fanning to make snow.\nC. When the woman puts a pink ball into her mouth.\nD. When the woman bows to the audience.\nAnswer with the option's letter from the given choices directly.",
577,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "193-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Temporal Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 578,
"target": "C",
"doc": {
"video_id": "193",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=jznsxDcKSnE",
"videoID": "jznsxDcKSnE",
"question_id": "193-3",
"task_type": "Action Recognition",
"question": "What does she do after picking a pink ball from the thunder in the video?",
"options": [
"A. She throws it away.",
"B. She puts it into her pocket.",
"C. She puts it in her ear but fetches it from her mouth.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does she do after picking a pink ball from the thunder in the video?\nOption:\nA. She throws it away.\nB. She puts it into her pocket.\nC. She puts it in her ear but fetches it from her mouth.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
578,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "193-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 579,
"target": "C",
"doc": {
"video_id": "194",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=fRf2aYYPrkc",
"videoID": "fRf2aYYPrkc",
"question_id": "194-1",
"task_type": "Object Reasoning",
"question": "Based on the video, what most likely roles do the man and woman have alongside the magician?",
"options": [
"A. Audience.",
"B. Performers.",
"C. Volunteers.",
"D. Magician assistants."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, what most likely roles do the man and woman have alongside the magician?\nOption:\nA. Audience.\nB. Performers.\nC. Volunteers.\nD. Magician assistants.\nAnswer with the option's letter from the given choices directly.",
579,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "194-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 580,
"target": "A",
"doc": {
"video_id": "194",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=fRf2aYYPrkc",
"videoID": "fRf2aYYPrkc",
"question_id": "194-2",
"task_type": "Action Recognition",
"question": "As depicted in the video, what occurs when the magician taps only the man on the shoulder?",
"options": [
"A. The woman also feels his touch.",
"B. The man does not feel his touch.",
"C. The man also touches the woman.",
"D. The woman is asked to touch the man."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, what occurs when the magician taps only the man on the shoulder?\nOption:\nA. The woman also feels his touch.\nB. The man does not feel his touch.\nC. The man also touches the woman.\nD. The woman is asked to touch the man.\nAnswer with the option's letter from the given choices directly.",
580,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "194-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 581,
"target": "B",
"doc": {
"video_id": "194",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=fRf2aYYPrkc",
"videoID": "fRf2aYYPrkc",
"question_id": "194-3",
"task_type": "Action Recognition",
"question": "What does the magician not do according to the video?",
"options": [
"A. He taps the man on the shoulder.",
"B. He puts a coin in the man's mouth.",
"C. He rubs the woman's nose using a playing card.",
"D. He asks the woman to close her eyes."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the magician not do according to the video?\nOption:\nA. He taps the man on the shoulder.\nB. He puts a coin in the man's mouth.\nC. He rubs the woman's nose using a playing card.\nD. He asks the woman to close her eyes.\nAnswer with the option's letter from the given choices directly.",
581,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "194-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 582,
"target": "B",
"doc": {
"video_id": "195",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=QGHVpr8FIN8",
"videoID": "QGHVpr8FIN8",
"question_id": "195-1",
"task_type": "Object Reasoning",
"question": "What happens to the balloon in the video?",
"options": [
"A. The balloon becomes colorful.",
"B. The balloon is popped and turns into a pigeon.",
"C. The balloon is taken away by a stage crew.",
"D. The balloon is released and travels into the air."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens to the balloon in the video?\nOption:\nA. The balloon becomes colorful.\nB. The balloon is popped and turns into a pigeon.\nC. The balloon is taken away by a stage crew.\nD. The balloon is released and travels into the air.\nAnswer with the option's letter from the given choices directly.",
582,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "195-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 583,
"target": "C",
"doc": {
"video_id": "195",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=QGHVpr8FIN8",
"videoID": "QGHVpr8FIN8",
"question_id": "195-2",
"task_type": "Object Recognition",
"question": "What does the performer have in his hand at the beginning of this video?",
"options": [
"A. A balloon.",
"B. A fan.",
"C. An umbrella.",
"D. A ribbon."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the performer have in his hand at the beginning of this video?\nOption:\nA. A balloon.\nB. A fan.\nC. An umbrella.\nD. A ribbon.\nAnswer with the option's letter from the given choices directly.",
583,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "195-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 584,
"target": "A",
"doc": {
"video_id": "195",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=QGHVpr8FIN8",
"videoID": "QGHVpr8FIN8",
"question_id": "195-3",
"task_type": "Attribute Perception",
"question": "What are the colors of the fan?",
"options": [
"A. Orange, white and green.",
"B. Pink, white and green.",
"C. White, pink and green.",
"D. Green, orange and pink."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the colors of the fan?\nOption:\nA. Orange, white and green.\nB. Pink, white and green.\nC. White, pink and green.\nD. Green, orange and pink.\nAnswer with the option's letter from the given choices directly.",
584,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "195-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 585,
"target": "D",
"doc": {
"video_id": "196",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=lNKtsi2Cu0E",
"videoID": "lNKtsi2Cu0E",
"question_id": "196-1",
"task_type": "Object Recognition",
"question": "What are the tools the man in this video used to perform magic?",
"options": [
"A. Dices.",
"B. Rubber bands.",
"C. Coins.",
"D. Cards."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the tools the man in this video used to perform magic?\nOption:\nA. Dices.\nB. Rubber bands.\nC. Coins.\nD. Cards.\nAnswer with the option's letter from the given choices directly.",
585,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "196-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 586,
"target": "D",
"doc": {
"video_id": "196",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=lNKtsi2Cu0E",
"videoID": "lNKtsi2Cu0E",
"question_id": "196-2",
"task_type": "Attribute Perception",
"question": "What is the color of the performer's hair in this video?",
"options": [
"A. Black.",
"B. White.",
"C. Grey.",
"D. None of the above because he does not have hair."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the performer's hair in this video?\nOption:\nA. Black.\nB. White.\nC. Grey.\nD. None of the above because he does not have hair.\nAnswer with the option's letter from the given choices directly.",
586,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "196-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 587,
"target": "C",
"doc": {
"video_id": "196",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=lNKtsi2Cu0E",
"videoID": "lNKtsi2Cu0E",
"question_id": "196-3",
"task_type": "Action Recognition",
"question": "As can be seen in the video, what does the performer do when he leaves the stage?",
"options": [
"A. He does a forward roll.",
"B. He claps his hands.",
"C. He sucks on a card.",
"D. He bows to the audience."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As can be seen in the video, what does the performer do when he leaves the stage?\nOption:\nA. He does a forward roll.\nB. He claps his hands.\nC. He sucks on a card.\nD. He bows to the audience.\nAnswer with the option's letter from the given choices directly.",
587,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "196-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 588,
"target": "A",
"doc": {
"video_id": "197",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=1q-5IIyZL20",
"videoID": "1q-5IIyZL20",
"question_id": "197-1",
"task_type": "Object Recognition",
"question": "According the video, what is the appearance of the clothes worn by the man holding a dog?",
"options": [
"A. A lovable magic dragon.",
"B. A tree with lots of leaves.",
"C. An elderly yellow cat.",
"D. A sweet candy."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According the video, what is the appearance of the clothes worn by the man holding a dog?\nOption:\nA. A lovable magic dragon.\nB. A tree with lots of leaves.\nC. An elderly yellow cat.\nD. A sweet candy.\nAnswer with the option's letter from the given choices directly.",
588,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "197-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 589,
"target": "D",
"doc": {
"video_id": "197",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=1q-5IIyZL20",
"videoID": "1q-5IIyZL20",
"question_id": "197-2",
"task_type": "Object Reasoning",
"question": "What is the function of the black plastic bag shown in the video?",
"options": [
"A. To store the dog's treats for later.",
"B. To carry the dog's toys during walks.",
"C. To collect water for the dog to drink.",
"D. To catch the dog's poo."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the function of the black plastic bag shown in the video?\nOption:\nA. To store the dog's treats for later.\nB. To carry the dog's toys during walks.\nC. To collect water for the dog to drink.\nD. To catch the dog's poo.\nAnswer with the option's letter from the given choices directly.",
589,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "197-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 590,
"target": "A",
"doc": {
"video_id": "197",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=1q-5IIyZL20",
"videoID": "1q-5IIyZL20",
"question_id": "197-3",
"task_type": "Object Recognition",
"question": "Which of the following items is not placed on the judge's table who is speaking at the end of this video?",
"options": [
"A. A stuffed toy of a dragon.",
"B. A photo of a white dog.",
"C. Two mugs.",
"D. A microphone."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following items is not placed on the judge's table who is speaking at the end of this video?\nOption:\nA. A stuffed toy of a dragon.\nB. A photo of a white dog.\nC. Two mugs.\nD. A microphone.\nAnswer with the option's letter from the given choices directly.",
590,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "197-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 591,
"target": "C",
"doc": {
"video_id": "198",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=V_skpmEXebM",
"videoID": "V_skpmEXebM",
"question_id": "198-1",
"task_type": "Action Recognition",
"question": "What is the magic trick in this video?",
"options": [
"A. The performer turns 100 dollars to 50.",
"B. The performer makes 50 dollars disappear.",
"C. The performer turns 50 dollars to 100.",
"D. The performer creates 50 dollars."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the magic trick in this video?\nOption:\nA. The performer turns 100 dollars to 50.\nB. The performer makes 50 dollars disappear.\nC. The performer turns 50 dollars to 100.\nD. The performer creates 50 dollars.\nAnswer with the option's letter from the given choices directly.",
591,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "198-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 592,
"target": "A",
"doc": {
"video_id": "198",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=V_skpmEXebM",
"videoID": "V_skpmEXebM",
"question_id": "198-2",
"task_type": "Object Reasoning",
"question": "Which statement accurately describes a similarity between the two men in this video?",
"options": [
"A. They both wear sunglasses.",
"B. They both wear black shorts.",
"C. They are both in black.",
"D. They both wear beards."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which statement accurately describes a similarity between the two men in this video?\nOption:\nA. They both wear sunglasses.\nB. They both wear black shorts.\nC. They are both in black.\nD. They both wear beards.\nAnswer with the option's letter from the given choices directly.",
592,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "198-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 593,
"target": "D",
"doc": {
"video_id": "198",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=V_skpmEXebM",
"videoID": "V_skpmEXebM",
"question_id": "198-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. The man in black performs a mind-reading act with a volunteer from the audience.",
"B. The man in black showcases his skills in playing the guitar to a live audience.",
"C. The man in black teaches a group of students how to solve complex math problems.",
"D. The man in black performs a money magic to another man in blue in public."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. The man in black performs a mind-reading act with a volunteer from the audience.\nB. The man in black showcases his skills in playing the guitar to a live audience.\nC. The man in black teaches a group of students how to solve complex math problems.\nD. The man in black performs a money magic to another man in blue in public.\nAnswer with the option's letter from the given choices directly.",
593,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "198-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 594,
"target": "B",
"doc": {
"video_id": "199",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=sxsjigV9S-Y",
"videoID": "sxsjigV9S-Y",
"question_id": "199-1",
"task_type": "Object Reasoning",
"question": "Why do the men on the stage look similar to each other?",
"options": [
"A. Because they made up to look similar.",
"B. Because they are twins.",
"C. Because a man is looking at a mirror.",
"D. None of the above is correct."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why do the men on the stage look similar to each other?\nOption:\nA. Because they made up to look similar.\nB. Because they are twins.\nC. Because a man is looking at a mirror.\nD. None of the above is correct.\nAnswer with the option's letter from the given choices directly.",
594,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "199-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 595,
"target": "C",
"doc": {
"video_id": "199",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=sxsjigV9S-Y",
"videoID": "sxsjigV9S-Y",
"question_id": "199-2",
"task_type": "Action Recognition",
"question": "What do the two men first perform on the stage?",
"options": [
"A. They draw curves on the screen.",
"B. They hug each other.",
"C. They run out of the screen.",
"D. They take a ladder from a bag."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the two men first perform on the stage?\nOption:\nA. They draw curves on the screen.\nB. They hug each other.\nC. They run out of the screen.\nD. They take a ladder from a bag.\nAnswer with the option's letter from the given choices directly.",
595,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "199-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 596,
"target": "A",
"doc": {
"video_id": "199",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=sxsjigV9S-Y",
"videoID": "sxsjigV9S-Y",
"question_id": "199-3",
"task_type": "Action Recognition",
"question": "What happened to the curves they draw on the screen?",
"options": [
"A. They merged together and form a woman outline.",
"B. They dissipated and disappeared.",
"C. They merged together and disappeared.",
"D. None of the above is correct."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happened to the curves they draw on the screen?\nOption:\nA. They merged together and form a woman outline.\nB. They dissipated and disappeared.\nC. They merged together and disappeared.\nD. None of the above is correct.\nAnswer with the option's letter from the given choices directly.",
596,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "199-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 597,
"target": "B",
"doc": {
"video_id": "200",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=LeBG88YT3rU",
"videoID": "LeBG88YT3rU",
"question_id": "200-1",
"task_type": "Information Synopsis",
"question": "Which of the following best describes the subject matter of the video?",
"options": [
"A. A man is giving a virtual tour of a museum to a woman over FaceTime.",
"B. A man is performing an amazing card trick for a woman over FaceTime.",
"C. A woman is teaching a man how to bake a cake over FaceTime.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best describes the subject matter of the video?\nOption:\nA. A man is giving a virtual tour of a museum to a woman over FaceTime.\nB. A man is performing an amazing card trick for a woman over FaceTime.\nC. A woman is teaching a man how to bake a cake over FaceTime.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
597,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "200-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 598,
"target": "D",
"doc": {
"video_id": "200",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=LeBG88YT3rU",
"videoID": "LeBG88YT3rU",
"question_id": "200-2",
"task_type": "Object Reasoning",
"question": "How did the woman respond to the performance in the video?",
"options": [
"A. She burst into laughters and could not stop.",
"B. She angrily threw the cards at the man.",
"C. She fainted from shock and had to be revived.",
"D. She was so surprised that she fell onto the sofa."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How did the woman respond to the performance in the video?\nOption:\nA. She burst into laughters and could not stop.\nB. She angrily threw the cards at the man.\nC. She fainted from shock and had to be revived.\nD. She was so surprised that she fell onto the sofa.\nAnswer with the option's letter from the given choices directly.",
598,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "200-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 599,
"target": "C",
"doc": {
"video_id": "200",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=LeBG88YT3rU",
"videoID": "LeBG88YT3rU",
"question_id": "200-3",
"task_type": "Object Recognition",
"question": "What is the background behind the woman in FaceTime?",
"options": [
"A. A bustling city street with skyscrapers and traffic.",
"B. A serene mountain landscape with snow-capped peaks.",
"C. A lot of windows and two sofa pillows.",
"D. A tropical beach with palm trees and clear blue water."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the background behind the woman in FaceTime?\nOption:\nA. A bustling city street with skyscrapers and traffic.\nB. A serene mountain landscape with snow-capped peaks.\nC. A lot of windows and two sofa pillows.\nD. A tropical beach with palm trees and clear blue water.\nAnswer with the option's letter from the given choices directly.",
599,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "200-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 600,
"target": "A",
"doc": {
"video_id": "201",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=k74LDvXSnHM",
"videoID": "k74LDvXSnHM",
"question_id": "201-1",
"task_type": "Spatial Reasoning",
"question": "Based on the beginning of the video, during which holiday was the video most likely recorded?",
"options": [
"A. Halloween.",
"B. Christmas.",
"C. Chinese New Year.",
"D. Thanksgiving."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the beginning of the video, during which holiday was the video most likely recorded?\nOption:\nA. Halloween.\nB. Christmas.\nC. Chinese New Year.\nD. Thanksgiving.\nAnswer with the option's letter from the given choices directly.",
600,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "201-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Spatial Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 601,
"target": "C",
"doc": {
"video_id": "201",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=k74LDvXSnHM",
"videoID": "k74LDvXSnHM",
"question_id": "201-2",
"task_type": "Action Reasoning",
"question": "What caused the sudden fall of the woman in the video?",
"options": [
"A. She twisted her ankle.",
"B. She was tripped.",
"C. She was scared by a prank.",
"D. Cannot be determined."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What caused the sudden fall of the woman in the video?\nOption:\nA. She twisted her ankle.\nB. She was tripped.\nC. She was scared by a prank.\nD. Cannot be determined.\nAnswer with the option's letter from the given choices directly.",
601,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "201-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 602,
"target": "A",
"doc": {
"video_id": "201",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=k74LDvXSnHM",
"videoID": "k74LDvXSnHM",
"question_id": "201-3",
"task_type": "Counting Problem",
"question": "According to the video, how many individuals are in the bathroom?",
"options": [
"A. 2.",
"B. 5.",
"C. 4.",
"D. 3."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how many individuals are in the bathroom?\nOption:\nA. 2.\nB. 5.\nC. 4.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
602,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "201-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 603,
"target": "B",
"doc": {
"video_id": "202",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=QtKT3q7xB4M",
"videoID": "QtKT3q7xB4M",
"question_id": "202-1",
"task_type": "OCR Problems",
"question": "What was the final score of the man in the video?",
"options": [
"A. 140.",
"B. 160.",
"C. 200.",
"D. 180."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the final score of the man in the video?\nOption:\nA. 140.\nB. 160.\nC. 200.\nD. 180.\nAnswer with the option's letter from the given choices directly.",
603,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "202-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 604,
"target": "C",
"doc": {
"video_id": "202",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=QtKT3q7xB4M",
"videoID": "QtKT3q7xB4M",
"question_id": "202-2",
"task_type": "Attribute Perception",
"question": "What color shoes is the man wearing in the video who is playing the obstacle course game?",
"options": [
"A. Blue.",
"B. Yellow.",
"C. Red.",
"D. Green."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color shoes is the man wearing in the video who is playing the obstacle course game?\nOption:\nA. Blue.\nB. Yellow.\nC. Red.\nD. Green.\nAnswer with the option's letter from the given choices directly.",
604,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "202-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 605,
"target": "B",
"doc": {
"video_id": "202",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=QtKT3q7xB4M",
"videoID": "QtKT3q7xB4M",
"question_id": "202-3",
"task_type": "Information Synopsis",
"question": "Which of the following best describes the topic of the video?",
"options": [
"A. A fight.",
"B. An obstacle course game.",
"C. A movie.",
"D. A war."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best describes the topic of the video?\nOption:\nA. A fight.\nB. An obstacle course game.\nC. A movie.\nD. A war.\nAnswer with the option's letter from the given choices directly.",
605,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "202-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 606,
"target": "D",
"doc": {
"video_id": "203",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=mD_7mhnPS_Y",
"videoID": "mD_7mhnPS_Y",
"question_id": "203-1",
"task_type": "Counting Problem",
"question": "In the opening scene of the video, how many stars are present on the trophy situated on the left hand side of the screen?",
"options": [
"A. 6.",
"B. 8.",
"C. 9.",
"D. 7."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the opening scene of the video, how many stars are present on the trophy situated on the left hand side of the screen?\nOption:\nA. 6.\nB. 8.\nC. 9.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
606,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "203-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 607,
"target": "D",
"doc": {
"video_id": "203",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=mD_7mhnPS_Y",
"videoID": "mD_7mhnPS_Y",
"question_id": "203-2",
"task_type": "Action Reasoning",
"question": "Why were the two people so happy in the video?",
"options": [
"A. Because they practiced for eight hours a day.",
"B. Because they came to the United States.",
"C. Because they met their idols.",
"D. Because they won the championship of America's Got Talent."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why were the two people so happy in the video?\nOption:\nA. Because they practiced for eight hours a day.\nB. Because they came to the United States.\nC. Because they met their idols.\nD. Because they won the championship of America's Got Talent.\nAnswer with the option's letter from the given choices directly.",
607,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "203-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 608,
"target": "D",
"doc": {
"video_id": "203",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=mD_7mhnPS_Y",
"videoID": "mD_7mhnPS_Y",
"question_id": "203-3",
"task_type": "Attribute Perception",
"question": "Which hairstyle is being worn by the host on stage in the video?",
"options": [
"A. White short hair.",
"B. Golden long hair.",
"C. Spiky hair.",
"D. Bald."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which hairstyle is being worn by the host on stage in the video?\nOption:\nA. White short hair.\nB. Golden long hair.\nC. Spiky hair.\nD. Bald.\nAnswer with the option's letter from the given choices directly.",
608,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "203-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 609,
"target": "C",
"doc": {
"video_id": "204",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=pcm_E7UoKLo",
"videoID": "pcm_E7UoKLo",
"question_id": "204-1",
"task_type": "Attribute Perception",
"question": "According to the video, what country does the shortest woman in the world come from?",
"options": [
"A. Nepal.",
"B. China.",
"C. India.",
"D. United States."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what country does the shortest woman in the world come from?\nOption:\nA. Nepal.\nB. China.\nC. India.\nD. United States.\nAnswer with the option's letter from the given choices directly.",
609,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "204-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 610,
"target": "D",
"doc": {
"video_id": "204",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=pcm_E7UoKLo",
"videoID": "pcm_E7UoKLo",
"question_id": "204-2",
"task_type": "Attribute Perception",
"question": "According to the opening scene of the video, what color tie is the man wearing with his suit?",
"options": [
"A. Black.",
"B. Red.",
"C. White.",
"D. Yellow."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the opening scene of the video, what color tie is the man wearing with his suit?\nOption:\nA. Black.\nB. Red.\nC. White.\nD. Yellow.\nAnswer with the option's letter from the given choices directly.",
610,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "204-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 611,
"target": "C",
"doc": {
"video_id": "204",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=pcm_E7UoKLo",
"videoID": "pcm_E7UoKLo",
"question_id": "204-3",
"task_type": "Attribute Perception",
"question": "Based on the video, which statement is true regarding the height comparison of the world's shortest man and shortest woman?",
"options": [
"A. They are of the same height.",
"B. It is unclear from the video.",
"C. The world's shortest woman is taller than the world's shortest man.",
"D. The world's shortest man is taller than the world's shortest woman."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, which statement is true regarding the height comparison of the world's shortest man and shortest woman?\nOption:\nA. They are of the same height.\nB. It is unclear from the video.\nC. The world's shortest woman is taller than the world's shortest man.\nD. The world's shortest man is taller than the world's shortest woman.\nAnswer with the option's letter from the given choices directly.",
611,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "204-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 612,
"target": "A",
"doc": {
"video_id": "205",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=EVnzXA9b7Ww",
"videoID": "EVnzXA9b7Ww",
"question_id": "205-1",
"task_type": "Information Synopsis",
"question": "What is this video about?",
"options": [
"A. A dog broke the Guinness World Record for skateboarding.",
"B. A wedding.",
"C. A trip.",
"D. People are having parties on the lawn."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video about?\nOption:\nA. A dog broke the Guinness World Record for skateboarding.\nB. A wedding.\nC. A trip.\nD. People are having parties on the lawn.\nAnswer with the option's letter from the given choices directly.",
612,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "205-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 613,
"target": "A",
"doc": {
"video_id": "205",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=EVnzXA9b7Ww",
"videoID": "EVnzXA9b7Ww",
"question_id": "205-2",
"task_type": "Attribute Perception",
"question": "What clothing is the man wearing while holding the dog in the second half of the video?",
"options": [
"A. Blue and black T-shirt.",
"B. Red T-shirt.",
"C. Black leather jacket.",
"D. Black suit."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What clothing is the man wearing while holding the dog in the second half of the video?\nOption:\nA. Blue and black T-shirt.\nB. Red T-shirt.\nC. Black leather jacket.\nD. Black suit.\nAnswer with the option's letter from the given choices directly.",
613,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "205-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 614,
"target": "B",
"doc": {
"video_id": "205",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=EVnzXA9b7Ww",
"videoID": "EVnzXA9b7Ww",
"question_id": "205-3",
"task_type": "Attribute Perception",
"question": "What color are the wheels of the skateboard used by the dog in the video?",
"options": [
"A. Black.",
"B. Orange.",
"C. White.",
"D. Green."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color are the wheels of the skateboard used by the dog in the video?\nOption:\nA. Black.\nB. Orange.\nC. White.\nD. Green.\nAnswer with the option's letter from the given choices directly.",
614,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "205-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 615,
"target": "A",
"doc": {
"video_id": "206",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=4H8hcvNeWtg",
"videoID": "4H8hcvNeWtg",
"question_id": "206-1",
"task_type": "Counting Problem",
"question": "What is the total number of people in the video?",
"options": [
"A. 7.",
"B. 6.",
"C. 5.",
"D. 8."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the total number of people in the video?\nOption:\nA. 7.\nB. 6.\nC. 5.\nD. 8.\nAnswer with the option's letter from the given choices directly.",
615,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "206-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 616,
"target": "B",
"doc": {
"video_id": "206",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=4H8hcvNeWtg",
"videoID": "4H8hcvNeWtg",
"question_id": "206-2",
"task_type": "Object Reasoning",
"question": "Did they successfully play the game of telephone in the video?",
"options": [
"A. Yes..",
"B. No.",
"C. They did not play this game in the video.",
"D. It is not possible to determine from the video if they successfully played the game of telephone."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Did they successfully play the game of telephone in the video?\nOption:\nA. Yes..\nB. No.\nC. They did not play this game in the video.\nD. It is not possible to determine from the video if they successfully played the game of telephone.\nAnswer with the option's letter from the given choices directly.",
616,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "206-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 617,
"target": "A",
"doc": {
"video_id": "206",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=4H8hcvNeWtg",
"videoID": "4H8hcvNeWtg",
"question_id": "206-3",
"task_type": "Counting Problem",
"question": "How many people are wearing ties in the video?",
"options": [
"A. 4.",
"B. 5.",
"C. 3.",
"D. 2."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people are wearing ties in the video?\nOption:\nA. 4.\nB. 5.\nC. 3.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
617,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "206-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 618,
"target": "D",
"doc": {
"video_id": "207",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=RqA3LAo75kw",
"videoID": "RqA3LAo75kw",
"question_id": "207-1",
"task_type": "Action Recognition",
"question": "What game are they playing in the video?",
"options": [
"A. Dancing game.",
"B. Video game.",
"C. Singing game.",
"D. Drawing game."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What game are they playing in the video?\nOption:\nA. Dancing game.\nB. Video game.\nC. Singing game.\nD. Drawing game.\nAnswer with the option's letter from the given choices directly.",
618,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "207-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 619,
"target": "D",
"doc": {
"video_id": "207",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=RqA3LAo75kw",
"videoID": "RqA3LAo75kw",
"question_id": "207-2",
"task_type": "Counting Problem",
"question": "How many people were shown in the video drawing on the stage?",
"options": [
"A. 6.",
"B. 3.",
"C. 4.",
"D. 5."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people were shown in the video drawing on the stage?\nOption:\nA. 6.\nB. 3.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
619,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "207-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 620,
"target": "C",
"doc": {
"video_id": "207",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=RqA3LAo75kw",
"videoID": "RqA3LAo75kw",
"question_id": "207-3",
"task_type": "Counting Problem",
"question": "How many individuals depicted in the video are wearing glasses?",
"options": [
"A. 0.",
"B. 2.",
"C. 3.",
"D. 1."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many individuals depicted in the video are wearing glasses?\nOption:\nA. 0.\nB. 2.\nC. 3.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
620,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "207-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 621,
"target": "B",
"doc": {
"video_id": "208",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=44ivpEIcBhE",
"videoID": "44ivpEIcBhE",
"question_id": "208-1",
"task_type": "Object Recognition",
"question": "Which instrument is the performer on the stage holding in the video?",
"options": [
"A. Trumpet.",
"B. Saxophone.",
"C. Violin.",
"D. Guitar."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which instrument is the performer on the stage holding in the video?\nOption:\nA. Trumpet.\nB. Saxophone.\nC. Violin.\nD. Guitar.\nAnswer with the option's letter from the given choices directly.",
621,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "208-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 622,
"target": "A",
"doc": {
"video_id": "208",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=44ivpEIcBhE",
"videoID": "44ivpEIcBhE",
"question_id": "208-2",
"task_type": "Object Recognition",
"question": "Who was the first person to press the button and turn the chair around in the video?",
"options": [
"A. The woman in gold clothes.",
"B. Nobody.",
"C. The person performing on stage.",
"D. The man in a blue shirt."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who was the first person to press the button and turn the chair around in the video?\nOption:\nA. The woman in gold clothes.\nB. Nobody.\nC. The person performing on stage.\nD. The man in a blue shirt.\nAnswer with the option's letter from the given choices directly.",
622,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "208-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 623,
"target": "C",
"doc": {
"video_id": "208",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=44ivpEIcBhE",
"videoID": "44ivpEIcBhE",
"question_id": "208-3",
"task_type": "Counting Problem",
"question": "Which option correctly indicates the number of microphones on the stage?",
"options": [
"A. 4.",
"B. 3.",
"C. 2.",
"D. 1."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which option correctly indicates the number of microphones on the stage?\nOption:\nA. 4.\nB. 3.\nC. 2.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
623,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "208-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 624,
"target": "B",
"doc": {
"video_id": "209",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=n7q1yYzx2g8",
"videoID": "n7q1yYzx2g8",
"question_id": "209-1",
"task_type": "Counting Problem",
"question": "How many times do the two people in the video give different answers to the question?",
"options": [
"A. 4.",
"B. 3.",
"C. 6.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times do the two people in the video give different answers to the question?\nOption:\nA. 4.\nB. 3.\nC. 6.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
624,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "209-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 625,
"target": "A",
"doc": {
"video_id": "209",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=n7q1yYzx2g8",
"videoID": "n7q1yYzx2g8",
"question_id": "209-2",
"task_type": "Object Recognition",
"question": "Which country's flag is on the ribbon in the video?",
"options": [
"A. United Kingdom.",
"B. Germany.",
"C. China.",
"D. United States."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which country's flag is on the ribbon in the video?\nOption:\nA. United Kingdom.\nB. Germany.\nC. China.\nD. United States.\nAnswer with the option's letter from the given choices directly.",
625,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "209-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 626,
"target": "A",
"doc": {
"video_id": "209",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=n7q1yYzx2g8",
"videoID": "n7q1yYzx2g8",
"question_id": "209-3",
"task_type": "Information Synopsis",
"question": "According to the video,who spends more time on their phone?",
"options": [
"A. The man.",
"B. They spend the same amount of time.",
"C. The woman.",
"D. Unable to determine."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video,who spends more time on their phone?\nOption:\nA. The man.\nB. They spend the same amount of time.\nC. The woman.\nD. Unable to determine.\nAnswer with the option's letter from the given choices directly.",
626,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "209-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 627,
"target": "B",
"doc": {
"video_id": "210",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=XS6ysDFTbLU",
"videoID": "XS6ysDFTbLU",
"question_id": "210-1",
"task_type": "Counting Problem",
"question": "How many people were challenged by the person in the video?",
"options": [
"A. 1.",
"B. 3.",
"C. 2.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people were challenged by the person in the video?\nOption:\nA. 1.\nB. 3.\nC. 2.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
627,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "210-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 628,
"target": "B",
"doc": {
"video_id": "210",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=XS6ysDFTbLU",
"videoID": "XS6ysDFTbLU",
"question_id": "210-2",
"task_type": "Attribute Perception",
"question": "What color is the rope pulled by the person during the ice bucket challenge in the video?",
"options": [
"A. Black.",
"B. Red and white.",
"C. Yellow and green.",
"D. Gray."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the rope pulled by the person during the ice bucket challenge in the video?\nOption:\nA. Black.\nB. Red and white.\nC. Yellow and green.\nD. Gray.\nAnswer with the option's letter from the given choices directly.",
628,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "210-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 629,
"target": "A",
"doc": {
"video_id": "210",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=XS6ysDFTbLU",
"videoID": "XS6ysDFTbLU",
"question_id": "210-3",
"task_type": "Action Reasoning",
"question": "Based on the words spoken by the person in the video, what is the main significance of the activity shown in the video?",
"options": [
"A. To raise awareness for Lou Gehrig's disease.",
"B. To encourage individuals to push their limits.",
"C. To invite others to take part in the challenge.",
"D. Not relevant to physical health."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the words spoken by the person in the video, what is the main significance of the activity shown in the video?\nOption:\nA. To raise awareness for Lou Gehrig's disease.\nB. To encourage individuals to push their limits.\nC. To invite others to take part in the challenge.\nD. Not relevant to physical health.\nAnswer with the option's letter from the given choices directly.",
629,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "210-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 630,
"target": "B",
"doc": {
"video_id": "211",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=lNO6cYxqMZk",
"videoID": "lNO6cYxqMZk",
"question_id": "211-1",
"task_type": "Action Recognition",
"question": "What did the little girl not do after the performance?",
"options": [
"A. She gave five to her mom.",
"B. She took a bow to the audience.",
"C. She jumped up and down.",
"D. She shared an emotional hug with her mom."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What did the little girl not do after the performance?\nOption:\nA. She gave five to her mom.\nB. She took a bow to the audience.\nC. She jumped up and down.\nD. She shared an emotional hug with her mom.\nAnswer with the option's letter from the given choices directly.",
630,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "211-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 631,
"target": "C",
"doc": {
"video_id": "211",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=lNO6cYxqMZk",
"videoID": "lNO6cYxqMZk",
"question_id": "211-2",
"task_type": "Object Recognition",
"question": "What is the background for the stage when the little girl and her mom jump on ropes in this video?",
"options": [
"A. Colorful backdrops and props.",
"B. A big visually horrible eye.",
"C. A huge full moon with light effects.",
"D. A plastic white ball with abstract designs."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the background for the stage when the little girl and her mom jump on ropes in this video?\nOption:\nA. Colorful backdrops and props.\nB. A big visually horrible eye.\nC. A huge full moon with light effects.\nD. A plastic white ball with abstract designs.\nAnswer with the option's letter from the given choices directly.",
631,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "211-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 632,
"target": "A",
"doc": {
"video_id": "211",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=lNO6cYxqMZk",
"videoID": "lNO6cYxqMZk",
"question_id": "211-3",
"task_type": "Attribute Perception",
"question": "What is the color of the woman's nails?",
"options": [
"A. Pink.",
"B. Blue.",
"C. White.",
"D. Black."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the woman's nails?\nOption:\nA. Pink.\nB. Blue.\nC. White.\nD. Black.\nAnswer with the option's letter from the given choices directly.",
632,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "211-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 633,
"target": "A",
"doc": {
"video_id": "212",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=026dzf-vc5g",
"videoID": "026dzf-vc5g",
"question_id": "212-1",
"task_type": "Action Recognition",
"question": "Which skill is not included in the little girl's performance?",
"options": [
"A. Backflip.",
"B. Cartweel.",
"C. Handstand.",
"D. Full split."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which skill is not included in the little girl's performance?\nOption:\nA. Backflip.\nB. Cartweel.\nC. Handstand.\nD. Full split.\nAnswer with the option's letter from the given choices directly.",
633,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "212-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 634,
"target": "D",
"doc": {
"video_id": "212",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=026dzf-vc5g",
"videoID": "026dzf-vc5g",
"question_id": "212-2",
"task_type": "Object Reasoning",
"question": "According to the little girl's expression, how does she feel after finishing her performance?",
"options": [
"A. Depressed.",
"B. Ashamed.",
"C. Calm.",
"D. Excited."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the little girl's expression, how does she feel after finishing her performance?\nOption:\nA. Depressed.\nB. Ashamed.\nC. Calm.\nD. Excited.\nAnswer with the option's letter from the given choices directly.",
634,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "212-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 635,
"target": "B",
"doc": {
"video_id": "212",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=026dzf-vc5g",
"videoID": "026dzf-vc5g",
"question_id": "212-3",
"task_type": "Attribute Perception",
"question": "What is the color of the little girl wearing shorts?",
"options": [
"A. Pink.",
"B. Blue.",
"C. White.",
"D. Black."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the little girl wearing shorts?\nOption:\nA. Pink.\nB. Blue.\nC. White.\nD. Black.\nAnswer with the option's letter from the given choices directly.",
635,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "212-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 636,
"target": "D",
"doc": {
"video_id": "213",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=KEy2iCzpce4",
"videoID": "KEy2iCzpce4",
"question_id": "213-1",
"task_type": "Action Recognition",
"question": "What is the ending pose in the performance?",
"options": [
"A. The ending pose in her performance is a backflip followed by a mid-air split.",
"B. The ending pose in her performance is a handstand on top of a partner's shoulders.",
"C. The ending pose in her performance is a handstand on top of a balancing ball.",
"D. The ending pose in her performance is balancing on a single leg with two hands holding the other leg."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the ending pose in the performance?\nOption:\nA. The ending pose in her performance is a backflip followed by a mid-air split.\nB. The ending pose in her performance is a handstand on top of a partner's shoulders.\nC. The ending pose in her performance is a handstand on top of a balancing ball.\nD. The ending pose in her performance is balancing on a single leg with two hands holding the other leg.\nAnswer with the option's letter from the given choices directly.",
636,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "213-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 637,
"target": "B",
"doc": {
"video_id": "213",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=KEy2iCzpce4",
"videoID": "KEy2iCzpce4",
"question_id": "213-2",
"task_type": "Object Recognition",
"question": "What can be seen on her chin when she is being interviewd in this video?",
"options": [
"A. A scarf.",
"B. Scars.",
"C. A tattoo of a butterfly.",
"D. A sparkling diamond piercing."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be seen on her chin when she is being interviewd in this video?\nOption:\nA. A scarf.\nB. Scars.\nC. A tattoo of a butterfly.\nD. A sparkling diamond piercing.\nAnswer with the option's letter from the given choices directly.",
637,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "213-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 638,
"target": "C",
"doc": {
"video_id": "213",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=KEy2iCzpce4",
"videoID": "KEy2iCzpce4",
"question_id": "213-3",
"task_type": "Object Recognition",
"question": "What does the girl wear during the performance?",
"options": [
"A. Shirts.",
"B. A necklace.",
"C. Dress.",
"D. A hat."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the girl wear during the performance?\nOption:\nA. Shirts.\nB. A necklace.\nC. Dress.\nD. A hat.\nAnswer with the option's letter from the given choices directly.",
638,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "213-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 639,
"target": "C",
"doc": {
"video_id": "214",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=dH8l--46j6s",
"videoID": "dH8l--46j6s",
"question_id": "214-1",
"task_type": "Action Recognition",
"question": "Which pose appears in this video?",
"options": [
"A. The woman balances on the man's shoulders.",
"B. The man and woman perform a synchronized backflip.",
"C. The man handstands on the woman's stomach.",
"D. The man lifts the woman with one hand."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which pose appears in this video?\nOption:\nA. The woman balances on the man's shoulders.\nB. The man and woman perform a synchronized backflip.\nC. The man handstands on the woman's stomach.\nD. The man lifts the woman with one hand.\nAnswer with the option's letter from the given choices directly.",
639,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "214-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 640,
"target": "A",
"doc": {
"video_id": "214",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=dH8l--46j6s",
"videoID": "dH8l--46j6s",
"question_id": "214-2",
"task_type": "Object Reasoning",
"question": "According to the video, what synchronized action do the two performers engage in simultaneously?",
"options": [
"A. A synchronized cartwheel.",
"B. A backflip with a twist.",
"C. A handstand followed by a somersault.",
"D. A high jump with a full split in mid-air."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what synchronized action do the two performers engage in simultaneously?\nOption:\nA. A synchronized cartwheel.\nB. A backflip with a twist.\nC. A handstand followed by a somersault.\nD. A high jump with a full split in mid-air.\nAnswer with the option's letter from the given choices directly.",
640,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "214-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 641,
"target": "B",
"doc": {
"video_id": "214",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=dH8l--46j6s",
"videoID": "dH8l--46j6s",
"question_id": "214-3",
"task_type": "Object Recognition",
"question": "What does the male performer wear in this video?",
"options": [
"A. Black pants and white shorts.",
"B. Black pants with a naked upper body.",
"C. Black pants with a naked upper body.",
"D. Black pants with a naked upper body."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the male performer wear in this video?\nOption:\nA. Black pants and white shorts.\nB. Black pants with a naked upper body.\nC. Black pants with a naked upper body.\nD. Black pants with a naked upper body.\nAnswer with the option's letter from the given choices directly.",
641,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "214-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 642,
"target": "B",
"doc": {
"video_id": "215",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=VmNBt1tzC6k",
"videoID": "VmNBt1tzC6k",
"question_id": "215-1",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. This video is mainly about a gymnastics team showcasing their skills in a high-flying trapeze performance.",
"B. This video is mainly about several gymnastics teams participating an acrobatic competition.",
"C. This video is mainly about a gymnastics team training for an upcoming synchronized swimming competition.",
"D. This video is mainly about a group of gymnasts attempting to break the world record for the longest handstand."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. This video is mainly about a gymnastics team showcasing their skills in a high-flying trapeze performance.\nB. This video is mainly about several gymnastics teams participating an acrobatic competition.\nC. This video is mainly about a gymnastics team training for an upcoming synchronized swimming competition.\nD. This video is mainly about a group of gymnasts attempting to break the world record for the longest handstand.\nAnswer with the option's letter from the given choices directly.",
642,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "215-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 643,
"target": "C",
"doc": {
"video_id": "215",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=VmNBt1tzC6k",
"videoID": "VmNBt1tzC6k",
"question_id": "215-2",
"task_type": "Action Recognition",
"question": "What is the ending pose of the team at the end of this video?",
"options": [
"A. The ending pose of the team is a human pyramid with five people stacked on top of each other.",
"B. The ending pose of the team is a synchronized backflip performed simultaneously by all members.",
"C. The ending pose of the team resembles a blossoming flower, with one person positioned in the center and two others on either side.",
"D. The ending pose of the team is a handstand pyramid with all members balancing on their hands."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the ending pose of the team at the end of this video?\nOption:\nA. The ending pose of the team is a human pyramid with five people stacked on top of each other.\nB. The ending pose of the team is a synchronized backflip performed simultaneously by all members.\nC. The ending pose of the team resembles a blossoming flower, with one person positioned in the center and two others on either side.\nD. The ending pose of the team is a handstand pyramid with all members balancing on their hands.\nAnswer with the option's letter from the given choices directly.",
643,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "215-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 644,
"target": "A",
"doc": {
"video_id": "215",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=VmNBt1tzC6k",
"videoID": "VmNBt1tzC6k",
"question_id": "215-3",
"task_type": "Counting Problem",
"question": "How many individuals are in the team, with each person dressed in yellow?",
"options": [
"A. 4.",
"B. 3.",
"C. 5.",
"D. 6."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many individuals are in the team, with each person dressed in yellow?\nOption:\nA. 4.\nB. 3.\nC. 5.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
644,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "215-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 645,
"target": "D",
"doc": {
"video_id": "216",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=5Knkqo-lYF0",
"videoID": "5Knkqo-lYF0",
"question_id": "216-1",
"task_type": "Object Recognition",
"question": "What does the girl wear above the head during her performance?",
"options": [
"A. A red hat.",
"B. A yellow headscarf.",
"C. A red headscarf.",
"D. A yellow hat."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the girl wear above the head during her performance?\nOption:\nA. A red hat.\nB. A yellow headscarf.\nC. A red headscarf.\nD. A yellow hat.\nAnswer with the option's letter from the given choices directly.",
645,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "216-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 646,
"target": "B",
"doc": {
"video_id": "216",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=5Knkqo-lYF0",
"videoID": "5Knkqo-lYF0",
"question_id": "216-2",
"task_type": "Attribute Perception",
"question": "Which sentence best describes the girl's dress?",
"options": [
"A. It is over-sized for her.",
"B. It is colorful.",
"C. It is very long.",
"D. None of the above is correct."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which sentence best describes the girl's dress?\nOption:\nA. It is over-sized for her.\nB. It is colorful.\nC. It is very long.\nD. None of the above is correct.\nAnswer with the option's letter from the given choices directly.",
646,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "216-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 647,
"target": "A",
"doc": {
"video_id": "216",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=5Knkqo-lYF0",
"videoID": "5Knkqo-lYF0",
"question_id": "216-3",
"task_type": "Counting Problem",
"question": "How many rolls does the girl do in her performance?",
"options": [
"A. More than 5 but less than or equal to 10.",
"B. Less than or equal to 5.",
"C. More than 10 but less than or equal to 15.",
"D. More than 15."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many rolls does the girl do in her performance?\nOption:\nA. More than 5 but less than or equal to 10.\nB. Less than or equal to 5.\nC. More than 10 but less than or equal to 15.\nD. More than 15.\nAnswer with the option's letter from the given choices directly.",
647,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "216-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 648,
"target": "A",
"doc": {
"video_id": "217",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=BrrDT_uYQsg",
"videoID": "BrrDT_uYQsg",
"question_id": "217-1",
"task_type": "Attribute Perception",
"question": "What is the physique of the athletes?",
"options": [
"A. Three of them have a strong build, while one is thin.",
"B. All of them have a strong build.",
"C. All of them have a thin physique.",
"D. None of the above options is correct."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the physique of the athletes?\nOption:\nA. Three of them have a strong build, while one is thin.\nB. All of them have a strong build.\nC. All of them have a thin physique.\nD. None of the above options is correct.\nAnswer with the option's letter from the given choices directly.",
648,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "217-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 649,
"target": "D",
"doc": {
"video_id": "217",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=BrrDT_uYQsg",
"videoID": "BrrDT_uYQsg",
"question_id": "217-2",
"task_type": "Action Recognition",
"question": "Which sports in this video involve athletes utilizing humans as tools?",
"options": [
"A. Hula hooping.",
"B. Nunchakus.",
"C. Horse-vaulting.",
"D. Rope jumping."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which sports in this video involve athletes utilizing humans as tools?\nOption:\nA. Hula hooping.\nB. Nunchakus.\nC. Horse-vaulting.\nD. Rope jumping.\nAnswer with the option's letter from the given choices directly.",
649,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "217-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 650,
"target": "B",
"doc": {
"video_id": "217",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=BrrDT_uYQsg",
"videoID": "BrrDT_uYQsg",
"question_id": "217-3",
"task_type": "Object Recognition",
"question": "What do the athletes wear during their performance?",
"options": [
"A. Red shorts and red pants.",
"B. Yellow shorts and red pants.",
"C. Yellow shorts and yellow pants.",
"D. Red shorts and yellow pants."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the athletes wear during their performance?\nOption:\nA. Red shorts and red pants.\nB. Yellow shorts and red pants.\nC. Yellow shorts and yellow pants.\nD. Red shorts and yellow pants.\nAnswer with the option's letter from the given choices directly.",
650,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "217-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 651,
"target": "C",
"doc": {
"video_id": "218",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=jmE9y0vv2aM",
"videoID": "jmE9y0vv2aM",
"question_id": "218-1",
"task_type": "Attribute Perception",
"question": "If the man is 180cm tall, what is the estimated diameter of the pilates ball which the man is playing with?",
"options": [
"A. About 200cm.",
"B. About 150cm.",
"C. About 100cm.",
"D. About 50cm."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: If the man is 180cm tall, what is the estimated diameter of the pilates ball which the man is playing with?\nOption:\nA. About 200cm.\nB. About 150cm.\nC. About 100cm.\nD. About 50cm.\nAnswer with the option's letter from the given choices directly.",
651,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "218-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 652,
"target": "C",
"doc": {
"video_id": "218",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=jmE9y0vv2aM",
"videoID": "jmE9y0vv2aM",
"question_id": "218-2",
"task_type": "Counting Problem",
"question": "How many different tricks does the man show inside the house?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many different tricks does the man show inside the house?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
652,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "218-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 653,
"target": "D",
"doc": {
"video_id": "218",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=jmE9y0vv2aM",
"videoID": "jmE9y0vv2aM",
"question_id": "218-3",
"task_type": "Object Recognition",
"question": "What does the man wear outside the house?",
"options": [
"A. White shirts and black shorts.",
"B. Black shirts and white shorts.",
"C. White shirts and white shorts.",
"D. Black shirts and black shorts."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the man wear outside the house?\nOption:\nA. White shirts and black shorts.\nB. Black shirts and white shorts.\nC. White shirts and white shorts.\nD. Black shirts and black shorts.\nAnswer with the option's letter from the given choices directly.",
653,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "218-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 654,
"target": "B",
"doc": {
"video_id": "219",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=fSDq8CPXHQM",
"videoID": "fSDq8CPXHQM",
"question_id": "219-1",
"task_type": "Spatial Perception",
"question": "Where do the events in this video take place?",
"options": [
"A. In the gym.",
"B. In the theater stage.",
"C. In the rehearsal room.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where do the events in this video take place?\nOption:\nA. In the gym.\nB. In the theater stage.\nC. In the rehearsal room.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
654,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "219-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Spatial Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 655,
"target": "C",
"doc": {
"video_id": "219",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=fSDq8CPXHQM",
"videoID": "fSDq8CPXHQM",
"question_id": "219-2",
"task_type": "Action Recognition",
"question": "Which skill does not appear in her performance?",
"options": [
"A. Pole skills.",
"B. Loop skills.",
"C. Stilts skills.",
"D. Human pyramid skills."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which skill does not appear in her performance?\nOption:\nA. Pole skills.\nB. Loop skills.\nC. Stilts skills.\nD. Human pyramid skills.\nAnswer with the option's letter from the given choices directly.",
655,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "219-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 656,
"target": "A",
"doc": {
"video_id": "219",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=fSDq8CPXHQM",
"videoID": "fSDq8CPXHQM",
"question_id": "219-3",
"task_type": "Object Recognition",
"question": "What is the stage background where several male performers are holding long sticks?",
"options": [
"A. A sailboat.",
"B. A forest.",
"C. A moon.",
"D. A computer."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the stage background where several male performers are holding long sticks?\nOption:\nA. A sailboat.\nB. A forest.\nC. A moon.\nD. A computer.\nAnswer with the option's letter from the given choices directly.",
656,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "219-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 657,
"target": "D",
"doc": {
"video_id": "220",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=D61jenC5oBA",
"videoID": "D61jenC5oBA",
"question_id": "220-1",
"task_type": "Action Recognition",
"question": "Which skill does not appear in her performance?",
"options": [
"A. A human pyramid with the man standing on the shoulders of the woman.",
"B. A full split by the woman.",
"C. A forward roll by the woman.",
"D. A backflip by the man."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which skill does not appear in her performance?\nOption:\nA. A human pyramid with the man standing on the shoulders of the woman.\nB. A full split by the woman.\nC. A forward roll by the woman.\nD. A backflip by the man.\nAnswer with the option's letter from the given choices directly.",
657,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "220-1",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 658,
"target": "B",
"doc": {
"video_id": "220",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=D61jenC5oBA",
"videoID": "D61jenC5oBA",
"question_id": "220-2",
"task_type": "Attribute Perception",
"question": "What color do the clothes of the two players have in common?",
"options": [
"A. Grey.",
"B. Yellow.",
"C. Black.",
"D. Purple."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color do the clothes of the two players have in common?\nOption:\nA. Grey.\nB. Yellow.\nC. Black.\nD. Purple.\nAnswer with the option's letter from the given choices directly.",
658,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "220-2",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 659,
"target": "C",
"doc": {
"video_id": "220",
"duration": "short",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=D61jenC5oBA",
"videoID": "D61jenC5oBA",
"question_id": "220-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. A male and a female player are competing in a synchronized swimming routine.",
"B. A male and a female player are collaborating on a painting in an art studio.",
"C. A male and a female player are practicing partner acrobatics.",
"D. A male and a female player are rehearsing a romantic dance routine."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. A male and a female player are competing in a synchronized swimming routine.\nB. A male and a female player are collaborating on a painting in an art studio.\nC. A male and a female player are practicing partner acrobatics.\nD. A male and a female player are rehearsing a romantic dance routine.\nAnswer with the option's letter from the given choices directly.",
659,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "220-3",
"duration": "short",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 660,
"target": "B",
"doc": {
"video_id": "221",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=c9oXTCVcTSE",
"videoID": "c9oXTCVcTSE",
"question_id": "221-1",
"task_type": "Object Recognition",
"question": "As depicted in the video, which tool is not necessary to make a rubber band car?",
"options": [
"A. Straw.",
"B. Pencil.",
"C. Scissors.",
"D. Hammer."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, which tool is not necessary to make a rubber band car?\nOption:\nA. Straw.\nB. Pencil.\nC. Scissors.\nD. Hammer.\nAnswer with the option's letter from the given choices directly.",
660,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "221-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 661,
"target": "C",
"doc": {
"video_id": "221",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=c9oXTCVcTSE",
"videoID": "c9oXTCVcTSE",
"question_id": "221-2",
"task_type": "Attribute Perception",
"question": "What is the color of the bottle cap?",
"options": [
"A. Blue.",
"B. Green.",
"C. Orange.",
"D. Red."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the bottle cap?\nOption:\nA. Blue.\nB. Green.\nC. Orange.\nD. Red.\nAnswer with the option's letter from the given choices directly.",
661,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "221-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 662,
"target": "A",
"doc": {
"video_id": "221",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=c9oXTCVcTSE",
"videoID": "c9oXTCVcTSE",
"question_id": "221-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. Making a toy car with everyday objects.",
"B. Handcrafting a plastic airplane with pretty cool science.",
"C. Doing fun science experiment using cola.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. Making a toy car with everyday objects.\nB. Handcrafting a plastic airplane with pretty cool science.\nC. Doing fun science experiment using cola.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
662,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "221-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 663,
"target": "A",
"doc": {
"video_id": "222",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=FlcVBKTSjRs",
"videoID": "FlcVBKTSjRs",
"question_id": "222-1",
"task_type": "Counting Problem",
"question": "How many spoons are used to make the holder in the video?",
"options": [
"A. 16.",
"B. 10.",
"C. 6.",
"D. 12."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many spoons are used to make the holder in the video?\nOption:\nA. 16.\nB. 10.\nC. 6.\nD. 12.\nAnswer with the option's letter from the given choices directly.",
663,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "222-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 664,
"target": "D",
"doc": {
"video_id": "222",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=FlcVBKTSjRs",
"videoID": "FlcVBKTSjRs",
"question_id": "222-2",
"task_type": "Object Recognition",
"question": "Which object does the holder made in this video visually resemble?",
"options": [
"A. Grass.",
"B. Animal.",
"C. Human's face.",
"D. Flower."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which object does the holder made in this video visually resemble?\nOption:\nA. Grass.\nB. Animal.\nC. Human's face.\nD. Flower.\nAnswer with the option's letter from the given choices directly.",
664,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "222-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 665,
"target": "B",
"doc": {
"video_id": "222",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=FlcVBKTSjRs",
"videoID": "FlcVBKTSjRs",
"question_id": "222-3",
"task_type": "Information Synopsis",
"question": "What is the primary focus or main topic of this video?",
"options": [
"A. Making a toy car with everyday objects.",
"B. Learning how to make a spoon candle holder.",
"C. Doing fun science experiment using cola.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary focus or main topic of this video?\nOption:\nA. Making a toy car with everyday objects.\nB. Learning how to make a spoon candle holder.\nC. Doing fun science experiment using cola.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
665,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "222-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 666,
"target": "D",
"doc": {
"video_id": "223",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=OCTekmo_szs",
"videoID": "OCTekmo_szs",
"question_id": "223-1",
"task_type": "Counting Problem",
"question": "How many different kinds of snow globes are made in this video?",
"options": [
"A. 4.",
"B. 5.",
"C. 2.",
"D. 3."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many different kinds of snow globes are made in this video?\nOption:\nA. 4.\nB. 5.\nC. 2.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
666,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "223-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 667,
"target": "B",
"doc": {
"video_id": "223",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=OCTekmo_szs",
"videoID": "OCTekmo_szs",
"question_id": "223-2",
"task_type": "Object Recognition",
"question": "Which tool is not necessary to make a snow globe?",
"options": [
"A. Distilled water.",
"B. Scissors.",
"C. Glitter.",
"D. Super glue."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which tool is not necessary to make a snow globe?\nOption:\nA. Distilled water.\nB. Scissors.\nC. Glitter.\nD. Super glue.\nAnswer with the option's letter from the given choices directly.",
667,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "223-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 668,
"target": "C",
"doc": {
"video_id": "223",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=OCTekmo_szs",
"videoID": "OCTekmo_szs",
"question_id": "223-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. Learning how to make a spoon candle holder.",
"B. Teaching how to handcraft a plastic ball.",
"C. Learning how to DIY snow globes.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. Learning how to make a spoon candle holder.\nB. Teaching how to handcraft a plastic ball.\nC. Learning how to DIY snow globes.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
668,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "223-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 669,
"target": "C",
"doc": {
"video_id": "224",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=g1MCVp5xICM",
"videoID": "g1MCVp5xICM",
"question_id": "224-1",
"task_type": "Attribute Perception",
"question": "What is the color of the pistil in this video?",
"options": [
"A. Pink.",
"B. Black and white.",
"C. Light yellow.",
"D. None of above is right."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the pistil in this video?\nOption:\nA. Pink.\nB. Black and white.\nC. Light yellow.\nD. None of above is right.\nAnswer with the option's letter from the given choices directly.",
669,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "224-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 670,
"target": "A",
"doc": {
"video_id": "224",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=g1MCVp5xICM",
"videoID": "g1MCVp5xICM",
"question_id": "224-2",
"task_type": "Spatial Perception",
"question": "As can be seen in the video, which hand is used to hold the glue gun?",
"options": [
"A. Right.",
"B. Left.",
"C. Both two hands.",
"D. None of the hands."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As can be seen in the video, which hand is used to hold the glue gun?\nOption:\nA. Right.\nB. Left.\nC. Both two hands.\nD. None of the hands.\nAnswer with the option's letter from the given choices directly.",
670,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "224-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Spatial Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 671,
"target": "B",
"doc": {
"video_id": "224",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=g1MCVp5xICM",
"videoID": "g1MCVp5xICM",
"question_id": "224-3",
"task_type": "Temporal Reasoning",
"question": "What is the right order of the tools appearing in the video when making a paper peony?\n(a) Glue gun.\n(b) Yellow paper.\n(c) Pink paper.\n(d) Scissors.",
"options": [
"A. (b)(a)(d)(c).",
"B. (b)(d)(a)(c).",
"C. (a)(b)(c)(d).",
"D. (d)(a)(c)(b)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the right order of the tools appearing in the video when making a paper peony?\n(a) Glue gun.\n(b) Yellow paper.\n(c) Pink paper.\n(d) Scissors.\nOption:\nA. (b)(a)(d)(c).\nB. (b)(d)(a)(c).\nC. (a)(b)(c)(d).\nD. (d)(a)(c)(b).\nAnswer with the option's letter from the given choices directly.",
671,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "224-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 672,
"target": "B",
"doc": {
"video_id": "225",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=BiJtPU9uP0c",
"videoID": "BiJtPU9uP0c",
"question_id": "225-1",
"task_type": "Counting Problem",
"question": "How many different kinds of animal faces are made in this video?",
"options": [
"A. 4.",
"B. 3.",
"C. 5.",
"D. 2."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many different kinds of animal faces are made in this video?\nOption:\nA. 4.\nB. 3.\nC. 5.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
672,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "225-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 673,
"target": "C",
"doc": {
"video_id": "225",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=BiJtPU9uP0c",
"videoID": "BiJtPU9uP0c",
"question_id": "225-2",
"task_type": "Attribute Perception",
"question": "What is the shape of the paper shown in the video?",
"options": [
"A. Circle.",
"B. Rectangle.",
"C. Square.",
"D. Triangle."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the shape of the paper shown in the video?\nOption:\nA. Circle.\nB. Rectangle.\nC. Square.\nD. Triangle.\nAnswer with the option's letter from the given choices directly.",
673,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "225-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 674,
"target": "A",
"doc": {
"video_id": "225",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=BiJtPU9uP0c",
"videoID": "BiJtPU9uP0c",
"question_id": "225-3",
"task_type": "Object Recognition",
"question": "What is the second paper animal made in this video?",
"options": [
"A. A pig.",
"B. A cat.",
"C. A dog.",
"D. A monkey."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the second paper animal made in this video?\nOption:\nA. A pig.\nB. A cat.\nC. A dog.\nD. A monkey.\nAnswer with the option's letter from the given choices directly.",
674,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "225-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 675,
"target": "D",
"doc": {
"video_id": "226",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=OQAf_rTw7n8",
"videoID": "OQAf_rTw7n8",
"question_id": "226-1",
"task_type": "Counting Problem",
"question": "How many candle holders are made in this video?",
"options": [
"A. 4.",
"B. 5.",
"C. 6.",
"D. 3."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many candle holders are made in this video?\nOption:\nA. 4.\nB. 5.\nC. 6.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
675,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "226-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 676,
"target": "B",
"doc": {
"video_id": "226",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=OQAf_rTw7n8",
"videoID": "OQAf_rTw7n8",
"question_id": "226-2",
"task_type": "Object Reasoning",
"question": "What common objects do the candle holders made in this video share?",
"options": [
"A. Spoon.",
"B. Plastic cup.",
"C. String.",
"D. Plastic pearl."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What common objects do the candle holders made in this video share?\nOption:\nA. Spoon.\nB. Plastic cup.\nC. String.\nD. Plastic pearl.\nAnswer with the option's letter from the given choices directly.",
676,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "226-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 677,
"target": "C",
"doc": {
"video_id": "226",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=OQAf_rTw7n8",
"videoID": "OQAf_rTw7n8",
"question_id": "226-3",
"task_type": "Spatial Perception",
"question": "As shown in the video, where are the candles placed?",
"options": [
"A. Next to the holder made of spoons.",
"B. On the bottom of the table.",
"C. In the middle of the cups.",
"D. There is no candle in this video."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As shown in the video, where are the candles placed?\nOption:\nA. Next to the holder made of spoons.\nB. On the bottom of the table.\nC. In the middle of the cups.\nD. There is no candle in this video.\nAnswer with the option's letter from the given choices directly.",
677,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "226-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Spatial Perception",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 678,
"target": "A",
"doc": {
"video_id": "227",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=qhW8L_GAodE",
"videoID": "qhW8L_GAodE",
"question_id": "227-1",
"task_type": "Spatial Perception",
"question": "How are the bottle caps placed at the beginning of this video?",
"options": [
"A. Eight small white caps around a big orange cap.",
"B. Eight small white caps are in a line and above a big orange cap.",
"C. Eight small white caps are in a circle and above a big orange cap..",
"D. Eight small white caps are in a square and around a big orange cap."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How are the bottle caps placed at the beginning of this video?\nOption:\nA. Eight small white caps around a big orange cap.\nB. Eight small white caps are in a line and above a big orange cap.\nC. Eight small white caps are in a circle and above a big orange cap..\nD. Eight small white caps are in a square and around a big orange cap.\nAnswer with the option's letter from the given choices directly.",
678,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "227-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Spatial Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 679,
"target": "D",
"doc": {
"video_id": "227",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=qhW8L_GAodE",
"videoID": "qhW8L_GAodE",
"question_id": "227-2",
"task_type": "Object Recognition",
"question": "What are the objects next to the girl on the sticker of the orange cap?",
"options": [
"A. A bloon and a bear.",
"B. An apple and a bear.",
"C. A zongzi and a bloon.",
"D. A bloon and an ice cream."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the objects next to the girl on the sticker of the orange cap?\nOption:\nA. A bloon and a bear.\nB. An apple and a bear.\nC. A zongzi and a bloon.\nD. A bloon and an ice cream.\nAnswer with the option's letter from the given choices directly.",
679,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "227-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 680,
"target": "B",
"doc": {
"video_id": "227",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=qhW8L_GAodE",
"videoID": "qhW8L_GAodE",
"question_id": "227-3",
"task_type": "Action Reasoning",
"question": "What is the objective or purpose of this do-it-yourself (DIY) project?",
"options": [
"A. It serves as a keychain.",
"B. It serves as a cup coaster.",
"C. It serves as a coin bank.",
"D. It serves as a storage box."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the objective or purpose of this do-it-yourself (DIY) project?\nOption:\nA. It serves as a keychain.\nB. It serves as a cup coaster.\nC. It serves as a coin bank.\nD. It serves as a storage box.\nAnswer with the option's letter from the given choices directly.",
680,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "227-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 681,
"target": "C",
"doc": {
"video_id": "228",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=ufZMrlZRn8o",
"videoID": "ufZMrlZRn8o",
"question_id": "228-1",
"task_type": "Attribute Perception",
"question": "What is the size of the rectangular paper?",
"options": [
"A. 3*20 cm.",
"B. 5*20 cm.",
"C. 3.5*20 cm.",
"D. 3.5*2 cm."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the size of the rectangular paper?\nOption:\nA. 3*20 cm.\nB. 5*20 cm.\nC. 3.5*20 cm.\nD. 3.5*2 cm.\nAnswer with the option's letter from the given choices directly.",
681,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "228-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 682,
"target": "D",
"doc": {
"video_id": "228",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=ufZMrlZRn8o",
"videoID": "ufZMrlZRn8o",
"question_id": "228-2",
"task_type": "Counting Problem",
"question": "As depicted in the video, how many times is the paper folded before cutting by scissors?",
"options": [
"A. 4.",
"B. 5.",
"C. 6.",
"D. 7."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, how many times is the paper folded before cutting by scissors?\nOption:\nA. 4.\nB. 5.\nC. 6.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
682,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "228-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 683,
"target": "A",
"doc": {
"video_id": "228",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=ufZMrlZRn8o",
"videoID": "ufZMrlZRn8o",
"question_id": "228-3",
"task_type": "Object Reasoning",
"question": "What is the function of the scissors that are used in the video?",
"options": [
"A. Making the paper look more like a heart.",
"B. Destroying the paper.",
"C. Cutting the paper into pieces.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the function of the scissors that are used in the video?\nOption:\nA. Making the paper look more like a heart.\nB. Destroying the paper.\nC. Cutting the paper into pieces.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
683,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "228-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 684,
"target": "B",
"doc": {
"video_id": "229",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=IKtnfFHjERg",
"videoID": "IKtnfFHjERg",
"question_id": "229-1",
"task_type": "Attribute Perception",
"question": "What is the color of the two butterflies made of paper?",
"options": [
"A. Pink for the left and left for the right.",
"B. Yellow for the left and pink for the right.",
"C. Both are pink.",
"D. Both are yellow."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the two butterflies made of paper?\nOption:\nA. Pink for the left and left for the right.\nB. Yellow for the left and pink for the right.\nC. Both are pink.\nD. Both are yellow.\nAnswer with the option's letter from the given choices directly.",
684,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "229-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 685,
"target": "C",
"doc": {
"video_id": "229",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=IKtnfFHjERg",
"videoID": "IKtnfFHjERg",
"question_id": "229-2",
"task_type": "Action Recognition",
"question": "According to the video, in what manner is the paper initially folded?",
"options": [
"A. Along the center line.",
"B. Off two corners to the center line.",
"C. Along the diagonal.",
"D. Off two corners to the diagonal."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, in what manner is the paper initially folded?\nOption:\nA. Along the center line.\nB. Off two corners to the center line.\nC. Along the diagonal.\nD. Off two corners to the diagonal.\nAnswer with the option's letter from the given choices directly.",
685,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "229-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 686,
"target": "A",
"doc": {
"video_id": "229",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=IKtnfFHjERg",
"videoID": "IKtnfFHjERg",
"question_id": "229-3",
"task_type": "Temporal Reasoning",
"question": "What is the right order of actions the video encourages to do after finishing making a paper butterfly?\n(a) Click Likes.\n(b) Subscribe.\n(c) Set Notifications.",
"options": [
"A. (a)(b)(c).",
"B. (b)(a)(c).",
"C. (a)(c)(b).",
"D. (c)(b)(a)."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the right order of actions the video encourages to do after finishing making a paper butterfly?\n(a) Click Likes.\n(b) Subscribe.\n(c) Set Notifications.\nOption:\nA. (a)(b)(c).\nB. (b)(a)(c).\nC. (a)(c)(b).\nD. (c)(b)(a).\nAnswer with the option's letter from the given choices directly.",
686,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "229-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 687,
"target": "D",
"doc": {
"video_id": "230",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=B_OL2TSrpKM",
"videoID": "B_OL2TSrpKM",
"question_id": "230-1",
"task_type": "Object Reasoning",
"question": "What is the intended use or function of the needle used in the video?",
"options": [
"A. To damage its functionality of the stick.",
"B. To sculpture a word on the stick.",
"C. To create loops and stitches on the stick.",
"D. To drill a keyring hole on the popsicle stick."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the intended use or function of the needle used in the video?\nOption:\nA. To damage its functionality of the stick.\nB. To sculpture a word on the stick.\nC. To create loops and stitches on the stick.\nD. To drill a keyring hole on the popsicle stick.\nAnswer with the option's letter from the given choices directly.",
687,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "230-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 688,
"target": "B",
"doc": {
"video_id": "230",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=B_OL2TSrpKM",
"videoID": "B_OL2TSrpKM",
"question_id": "230-2",
"task_type": "OCR Problems",
"question": "What is written on the first made keychains?",
"options": [
"A. Google.",
"B. YouTube.",
"C. Facebook.",
"D. Spotify."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is written on the first made keychains?\nOption:\nA. Google.\nB. YouTube.\nC. Facebook.\nD. Spotify.\nAnswer with the option's letter from the given choices directly.",
688,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "230-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 689,
"target": "C",
"doc": {
"video_id": "230",
"duration": "short",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=B_OL2TSrpKM",
"videoID": "B_OL2TSrpKM",
"question_id": "230-3",
"task_type": "Action Recognition",
"question": "What is the last step that this video showcases to DIY a keychain using popsicle sticks?",
"options": [
"A. Drill a keyring hole on the popsicle stick.",
"B. Decorate the sticks using colored pens.",
"C. Put the keychain through the hole.",
"D. Cut a small piece from the entire stick."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the last step that this video showcases to DIY a keychain using popsicle sticks?\nOption:\nA. Drill a keyring hole on the popsicle stick.\nB. Decorate the sticks using colored pens.\nC. Put the keychain through the hole.\nD. Cut a small piece from the entire stick.\nAnswer with the option's letter from the given choices directly.",
689,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "230-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 690,
"target": "A",
"doc": {
"video_id": "231",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=KfTtpoIovuk",
"videoID": "KfTtpoIovuk",
"question_id": "231-1",
"task_type": "Object Recognition",
"question": "What does the chef in the video end up cutting with a knife?",
"options": [
"A. Egg.",
"B. Beef.",
"C. Cake.",
"D. Cheese."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the chef in the video end up cutting with a knife?\nOption:\nA. Egg.\nB. Beef.\nC. Cake.\nD. Cheese.\nAnswer with the option's letter from the given choices directly.",
690,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "231-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 691,
"target": "C",
"doc": {
"video_id": "231",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=KfTtpoIovuk",
"videoID": "KfTtpoIovuk",
"question_id": "231-2",
"task_type": "Counting Problem",
"question": "How many customers can be seen in the video?",
"options": [
"A. 4.",
"B. 1.",
"C. 2.",
"D. 3."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many customers can be seen in the video?\nOption:\nA. 4.\nB. 1.\nC. 2.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
691,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "231-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 692,
"target": "D",
"doc": {
"video_id": "231",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=KfTtpoIovuk",
"videoID": "KfTtpoIovuk",
"question_id": "231-3",
"task_type": "Information Synopsis",
"question": "What is the subject matter of the video?",
"options": [
"A. Chef recruitment.",
"B. Wine advertisement.",
"C. Tourism publicity.",
"D. Restaurant publicity."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the subject matter of the video?\nOption:\nA. Chef recruitment.\nB. Wine advertisement.\nC. Tourism publicity.\nD. Restaurant publicity.\nAnswer with the option's letter from the given choices directly.",
692,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "231-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 693,
"target": "B",
"doc": {
"video_id": "232",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=zPx3EibuO_w",
"videoID": "zPx3EibuO_w",
"question_id": "232-1",
"task_type": "Object Recognition",
"question": "Which food item is cut in half with a knife by the person in the video?",
"options": [
"A. Cucumber.",
"B. Tomato.",
"C. Tofu.",
"D. Almond."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which food item is cut in half with a knife by the person in the video?\nOption:\nA. Cucumber.\nB. Tomato.\nC. Tofu.\nD. Almond.\nAnswer with the option's letter from the given choices directly.",
693,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "232-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 694,
"target": "B",
"doc": {
"video_id": "232",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=zPx3EibuO_w",
"videoID": "zPx3EibuO_w",
"question_id": "232-2",
"task_type": "Object Reasoning",
"question": "Which ingredient in the video is not used as a decoration at the end of the video?",
"options": [
"A. Sugar.",
"B. Noodles.",
"C. Tomato.",
"D. Shredded cucumber."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which ingredient in the video is not used as a decoration at the end of the video?\nOption:\nA. Sugar.\nB. Noodles.\nC. Tomato.\nD. Shredded cucumber.\nAnswer with the option's letter from the given choices directly.",
694,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "232-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 695,
"target": "C",
"doc": {
"video_id": "232",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=zPx3EibuO_w",
"videoID": "zPx3EibuO_w",
"question_id": "232-3",
"task_type": "Information Synopsis",
"question": "What is the topic of this video?",
"options": [
"A. Food introduction.",
"B. Food advertising.",
"C. Food-making tutorial.",
"D. Handmade tutorial."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the topic of this video?\nOption:\nA. Food introduction.\nB. Food advertising.\nC. Food-making tutorial.\nD. Handmade tutorial.\nAnswer with the option's letter from the given choices directly.",
695,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "232-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 696,
"target": "D",
"doc": {
"video_id": "233",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=J2rlJV7zKZw",
"videoID": "J2rlJV7zKZw",
"question_id": "233-1",
"task_type": "Object Recognition",
"question": "What tool is used in the video to stir ingredients?",
"options": [
"A. Little iron bars.",
"B. Scoop.",
"C. Stir by hand.",
"D. Mixer."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What tool is used in the video to stir ingredients?\nOption:\nA. Little iron bars.\nB. Scoop.\nC. Stir by hand.\nD. Mixer.\nAnswer with the option's letter from the given choices directly.",
696,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "233-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 697,
"target": "A",
"doc": {
"video_id": "233",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=J2rlJV7zKZw",
"videoID": "J2rlJV7zKZw",
"question_id": "233-2",
"task_type": "Object Recognition",
"question": "Which ingredient is used first in the video to make this dish?",
"options": [
"A. Egg.",
"B. Lime juice.",
"C. Vegetable oil.",
"D. Sugar."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which ingredient is used first in the video to make this dish?\nOption:\nA. Egg.\nB. Lime juice.\nC. Vegetable oil.\nD. Sugar.\nAnswer with the option's letter from the given choices directly.",
697,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "233-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 698,
"target": "A",
"doc": {
"video_id": "233",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=J2rlJV7zKZw",
"videoID": "J2rlJV7zKZw",
"question_id": "233-3",
"task_type": "Object Reasoning",
"question": "Which of the following could have caused the dark spots in the mayonnaise?",
"options": [
"A. Black truffle.",
"B. Black raisin.",
"C. Black olives.",
"D. Black sugar."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following could have caused the dark spots in the mayonnaise?\nOption:\nA. Black truffle.\nB. Black raisin.\nC. Black olives.\nD. Black sugar.\nAnswer with the option's letter from the given choices directly.",
698,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "233-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 699,
"target": "B",
"doc": {
"video_id": "234",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=v0m3E_uSyFg",
"videoID": "v0m3E_uSyFg",
"question_id": "234-1",
"task_type": "Attribute Perception",
"question": "What color is the spoon used for mixing in the video?",
"options": [
"A. Yellow.",
"B. Pink.",
"C. Red.",
"D. Orange."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the spoon used for mixing in the video?\nOption:\nA. Yellow.\nB. Pink.\nC. Red.\nD. Orange.\nAnswer with the option's letter from the given choices directly.",
699,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "234-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 700,
"target": "C",
"doc": {
"video_id": "234",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=v0m3E_uSyFg",
"videoID": "v0m3E_uSyFg",
"question_id": "234-2",
"task_type": "Object Reasoning",
"question": "Why does the food in the video swell?",
"options": [
"A. Fermentation.",
"B. Steep.",
"C. Baking in the oven.",
"D. Aeration."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the food in the video swell?\nOption:\nA. Fermentation.\nB. Steep.\nC. Baking in the oven.\nD. Aeration.\nAnswer with the option's letter from the given choices directly.",
700,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "234-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 701,
"target": "D",
"doc": {
"video_id": "234",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=v0m3E_uSyFg",
"videoID": "v0m3E_uSyFg",
"question_id": "234-3",
"task_type": "Object Recognition",
"question": "What kind of food is being prepared in the video?",
"options": [
"A. Pizza.",
"B. Cheese block.",
"C. Bread.",
"D. Scone."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What kind of food is being prepared in the video?\nOption:\nA. Pizza.\nB. Cheese block.\nC. Bread.\nD. Scone.\nAnswer with the option's letter from the given choices directly.",
701,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "234-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 702,
"target": "B",
"doc": {
"video_id": "235",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=kij9GXWdmik",
"videoID": "kij9GXWdmik",
"question_id": "235-1",
"task_type": "Object Reasoning",
"question": "What is the reason for cutting out the center of the pineapple slice in the video?",
"options": [
"A. Cut off pineapple core for garnish.",
"B. Cut off the core of the pineapple so as not to affect the flavor.",
"C. For the beauty of the pineapple.",
"D. More favorable pineapple slices."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the reason for cutting out the center of the pineapple slice in the video?\nOption:\nA. Cut off pineapple core for garnish.\nB. Cut off the core of the pineapple so as not to affect the flavor.\nC. For the beauty of the pineapple.\nD. More favorable pineapple slices.\nAnswer with the option's letter from the given choices directly.",
702,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "235-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 703,
"target": "A",
"doc": {
"video_id": "235",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=kij9GXWdmik",
"videoID": "kij9GXWdmik",
"question_id": "235-2",
"task_type": "Attribute Perception",
"question": "What is the color of the clothing worn by the individuals in the video?",
"options": [
"A. Black.",
"B. Gray.",
"C. Green.",
"D. Blown."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the clothing worn by the individuals in the video?\nOption:\nA. Black.\nB. Gray.\nC. Green.\nD. Blown.\nAnswer with the option's letter from the given choices directly.",
703,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "235-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 704,
"target": "C",
"doc": {
"video_id": "235",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=kij9GXWdmik",
"videoID": "kij9GXWdmik",
"question_id": "235-3",
"task_type": "Temporal Perception",
"question": "What is the second step in the production process?",
"options": [
"A. Remove the core.",
"B. Cut off the head.",
"C. Peel off the skin.",
"D. Clean the outer skin."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the second step in the production process?\nOption:\nA. Remove the core.\nB. Cut off the head.\nC. Peel off the skin.\nD. Clean the outer skin.\nAnswer with the option's letter from the given choices directly.",
704,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "235-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Temporal Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 705,
"target": "D",
"doc": {
"video_id": "236",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=NAahAX62MZ0",
"videoID": "NAahAX62MZ0",
"question_id": "236-1",
"task_type": "Counting Problem",
"question": "How many peppers can be seen on the plate in the video's final shot?",
"options": [
"A. 9.",
"B. 6.",
"C. 8.",
"D. 7."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many peppers can be seen on the plate in the video's final shot?\nOption:\nA. 9.\nB. 6.\nC. 8.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
705,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "236-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 706,
"target": "B",
"doc": {
"video_id": "236",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=NAahAX62MZ0",
"videoID": "NAahAX62MZ0",
"question_id": "236-2",
"task_type": "Object Reasoning",
"question": "What are the main ingredients in the video?",
"options": [
"A. Eggplant.",
"B. Pepper.",
"C. Avocado.",
"D. Green grape."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the main ingredients in the video?\nOption:\nA. Eggplant.\nB. Pepper.\nC. Avocado.\nD. Green grape.\nAnswer with the option's letter from the given choices directly.",
706,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "236-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 707,
"target": "A",
"doc": {
"video_id": "236",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=NAahAX62MZ0",
"videoID": "NAahAX62MZ0",
"question_id": "236-3",
"task_type": "Object Reasoning",
"question": "Why does the surface of the pepper in the video turn yellow?",
"options": [
"A. Deep fry.",
"B. Go bad.",
"C. Color with paint.",
"D. It's ripe."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the surface of the pepper in the video turn yellow?\nOption:\nA. Deep fry.\nB. Go bad.\nC. Color with paint.\nD. It's ripe.\nAnswer with the option's letter from the given choices directly.",
707,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "236-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 708,
"target": "C",
"doc": {
"video_id": "237",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=YZ11C-U7S8I",
"videoID": "YZ11C-U7S8I",
"question_id": "237-1",
"task_type": "Object Recognition",
"question": "What animal doesn't appear in the video?",
"options": [
"A. Tiger.",
"B. Fox.",
"C. Chick.",
"D. Crocodile."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What animal doesn't appear in the video?\nOption:\nA. Tiger.\nB. Fox.\nC. Chick.\nD. Crocodile.\nAnswer with the option's letter from the given choices directly.",
708,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "237-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 709,
"target": "A",
"doc": {
"video_id": "237",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=YZ11C-U7S8I",
"videoID": "YZ11C-U7S8I",
"question_id": "237-2",
"task_type": "Information Synopsis",
"question": "Which of the following best describes the topic of this video?",
"options": [
"A. An introduction to eating healthy food.",
"B. An introduction to food.",
"C. An introduction to animal eating habits.",
"D. Food advertising."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best describes the topic of this video?\nOption:\nA. An introduction to eating healthy food.\nB. An introduction to food.\nC. An introduction to animal eating habits.\nD. Food advertising.\nAnswer with the option's letter from the given choices directly.",
709,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "237-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 710,
"target": "B",
"doc": {
"video_id": "237",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=YZ11C-U7S8I",
"videoID": "YZ11C-U7S8I",
"question_id": "237-3",
"task_type": "Object Reasoning",
"question": "What is the reason for the crocodile in the video to grow taller?",
"options": [
"A. It stands on the chair.",
"B. It eats healthy food.",
"C. It steps into its young adulthood.",
"D. It insists on daily exercise."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the reason for the crocodile in the video to grow taller?\nOption:\nA. It stands on the chair.\nB. It eats healthy food.\nC. It steps into its young adulthood.\nD. It insists on daily exercise.\nAnswer with the option's letter from the given choices directly.",
710,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "237-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 711,
"target": "D",
"doc": {
"video_id": "238",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=NXeHeEHj9Xc",
"videoID": "NXeHeEHj9Xc",
"question_id": "238-1",
"task_type": "Information Synopsis",
"question": "Which of the following best describes the topic of the video?",
"options": [
"A. How to eat potato chips.",
"B. The birth of French fries.",
"C. How to make French fries.",
"D. The birth of potato chips."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best describes the topic of the video?\nOption:\nA. How to eat potato chips.\nB. The birth of French fries.\nC. How to make French fries.\nD. The birth of potato chips.\nAnswer with the option's letter from the given choices directly.",
711,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "238-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 712,
"target": "C",
"doc": {
"video_id": "238",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=NXeHeEHj9Xc",
"videoID": "NXeHeEHj9Xc",
"question_id": "238-2",
"task_type": "Action Recognition",
"question": "How do the people in the video eat potato chips?",
"options": [
"A. Left-handedly.",
"B. Right-handedly.",
"C. Using both hands.",
"D. Using a fork."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How do the people in the video eat potato chips?\nOption:\nA. Left-handedly.\nB. Right-handedly.\nC. Using both hands.\nD. Using a fork.\nAnswer with the option's letter from the given choices directly.",
712,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "238-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 713,
"target": "A",
"doc": {
"video_id": "238",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=NXeHeEHj9Xc",
"videoID": "NXeHeEHj9Xc",
"question_id": "238-3",
"task_type": "Action Reasoning",
"question": "Which snack does the main character in the video prefer?",
"options": [
"A. Potato chips.",
"B. French fries.",
"C. He likes them both.",
"D. He doesn't like either."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which snack does the main character in the video prefer?\nOption:\nA. Potato chips.\nB. French fries.\nC. He likes them both.\nD. He doesn't like either.\nAnswer with the option's letter from the given choices directly.",
713,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "238-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 714,
"target": "B",
"doc": {
"video_id": "239",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=VnvG08masio",
"videoID": "VnvG08masio",
"question_id": "239-1",
"task_type": "Object Reasoning",
"question": "Which of the following cooking methods causes the smoke seen in the video?",
"options": [
"A. Making soup.",
"B. BBQ.",
"C. Stir fry.",
"D. Fire hazard."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following cooking methods causes the smoke seen in the video?\nOption:\nA. Making soup.\nB. BBQ.\nC. Stir fry.\nD. Fire hazard.\nAnswer with the option's letter from the given choices directly.",
714,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "239-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 715,
"target": "C",
"doc": {
"video_id": "239",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=VnvG08masio",
"videoID": "VnvG08masio",
"question_id": "239-2",
"task_type": "Counting Problem",
"question": "Which dish is served last in the video?",
"options": [
"A. 4.",
"B. 2.",
"C. 3.",
"D. 1."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which dish is served last in the video?\nOption:\nA. 4.\nB. 2.\nC. 3.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
715,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "239-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 716,
"target": "D",
"doc": {
"video_id": "239",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=VnvG08masio",
"videoID": "VnvG08masio",
"question_id": "239-3",
"task_type": "Attribute Perception",
"question": "What is the color of the laughing grandmother's attire in the video?",
"options": [
"A. Light brown.",
"B. Light blue.",
"C. Light yellow.",
"D. Light green."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the laughing grandmother's attire in the video?\nOption:\nA. Light brown.\nB. Light blue.\nC. Light yellow.\nD. Light green.\nAnswer with the option's letter from the given choices directly.",
716,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "239-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 717,
"target": "A",
"doc": {
"video_id": "240",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=4twBzf6ngSc",
"videoID": "4twBzf6ngSc",
"question_id": "240-1",
"task_type": "Information Synopsis",
"question": "Which of the following best describes the topic of the video?",
"options": [
"A. Destruction of illegal food at port.",
"B. How is the juice sold at the port made.",
"C. What kind of food do people bring into the port.",
"D. What do people eat in the harbor."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best describes the topic of the video?\nOption:\nA. Destruction of illegal food at port.\nB. How is the juice sold at the port made.\nC. What kind of food do people bring into the port.\nD. What do people eat in the harbor.\nAnswer with the option's letter from the given choices directly.",
717,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "240-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 718,
"target": "D",
"doc": {
"video_id": "240",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=4twBzf6ngSc",
"videoID": "4twBzf6ngSc",
"question_id": "240-2",
"task_type": "Object Reasoning",
"question": "How is the food in the video destroyed?",
"options": [
"A. Flush down the drain.",
"B. Burn up.",
"C. Freezing.",
"D. Grind."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How is the food in the video destroyed?\nOption:\nA. Flush down the drain.\nB. Burn up.\nC. Freezing.\nD. Grind.\nAnswer with the option's letter from the given choices directly.",
718,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "240-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 719,
"target": "B",
"doc": {
"video_id": "240",
"duration": "short",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=4twBzf6ngSc",
"videoID": "4twBzf6ngSc",
"question_id": "240-3",
"task_type": "OCR Problems",
"question": "What are the potential risks associated with exotic animal diseases according to the bulletin board in the video?",
"options": [
"A. Animal.",
"B. Agriculture.",
"C. Airport.",
"D. Human."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the potential risks associated with exotic animal diseases according to the bulletin board in the video?\nOption:\nA. Animal.\nB. Agriculture.\nC. Airport.\nD. Human.\nAnswer with the option's letter from the given choices directly.",
719,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "240-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Food",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 720,
"target": "C",
"doc": {
"video_id": "241",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=jb-To6qJcxU",
"videoID": "jb-To6qJcxU",
"question_id": "241-1",
"task_type": "Attribute Perception",
"question": "What color is the woman's manicure in the video?",
"options": [
"A. Green.",
"B. Red.",
"C. Blue.",
"D. Black."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the woman's manicure in the video?\nOption:\nA. Green.\nB. Red.\nC. Blue.\nD. Black.\nAnswer with the option's letter from the given choices directly.",
720,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "241-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 721,
"target": "C",
"doc": {
"video_id": "241",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=jb-To6qJcxU",
"videoID": "jb-To6qJcxU",
"question_id": "241-2",
"task_type": "Counting Problem",
"question": "How many earrings does the woman in the video have on her left ear?",
"options": [
"A. 3.",
"B. 5.",
"C. 7.",
"D. 1."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many earrings does the woman in the video have on her left ear?\nOption:\nA. 3.\nB. 5.\nC. 7.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
721,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "241-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 722,
"target": "B",
"doc": {
"video_id": "241",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=jb-To6qJcxU",
"videoID": "jb-To6qJcxU",
"question_id": "241-3",
"task_type": "Information Synopsis",
"question": "What is this video about?",
"options": [
"A. How to paint.",
"B. How to makeup.",
"C. How to sing.",
"D. How to act."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video about?\nOption:\nA. How to paint.\nB. How to makeup.\nC. How to sing.\nD. How to act.\nAnswer with the option's letter from the given choices directly.",
722,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "241-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 723,
"target": "D",
"doc": {
"video_id": "242",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=lokFoo_QD8c",
"videoID": "lokFoo_QD8c",
"question_id": "242-1",
"task_type": "Temporal Reasoning",
"question": "Which of the following is the correct order of cosmetics used in the video?",
"options": [
"A. Primer, BB cream, eyeshadow, mascara, highlighter, contour, blush, lipstick.",
"B. Primer, BB cream, highlighter, mascara, eyeshadow, contour, blush, lipstick.",
"C. BB cream, primer, eyeshadow, mascara, highlighter, blush, contour, lipstick.",
"D. BB cream, primer, eyeshadow, highlighter, mascara, contour, blush, lipstick."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is the correct order of cosmetics used in the video?\nOption:\nA. Primer, BB cream, eyeshadow, mascara, highlighter, contour, blush, lipstick.\nB. Primer, BB cream, highlighter, mascara, eyeshadow, contour, blush, lipstick.\nC. BB cream, primer, eyeshadow, mascara, highlighter, blush, contour, lipstick.\nD. BB cream, primer, eyeshadow, highlighter, mascara, contour, blush, lipstick.\nAnswer with the option's letter from the given choices directly.",
723,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "242-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 724,
"target": "B",
"doc": {
"video_id": "242",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=lokFoo_QD8c",
"videoID": "lokFoo_QD8c",
"question_id": "242-2",
"task_type": "OCR Problems",
"question": "What is the number of the first lipstick she used?",
"options": [
"A. 959.",
"B. 656.",
"C. 858.",
"D. 666."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the number of the first lipstick she used?\nOption:\nA. 959.\nB. 656.\nC. 858.\nD. 666.\nAnswer with the option's letter from the given choices directly.",
724,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "242-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 725,
"target": "B",
"doc": {
"video_id": "242",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=lokFoo_QD8c",
"videoID": "lokFoo_QD8c",
"question_id": "242-3",
"task_type": "Spatial Reasoning",
"question": "Where was the video most likely shot?",
"options": [
"A. Dining room.",
"B. Bathroom.",
"C. Kitchen.",
"D. Bedroom."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where was the video most likely shot?\nOption:\nA. Dining room.\nB. Bathroom.\nC. Kitchen.\nD. Bedroom.\nAnswer with the option's letter from the given choices directly.",
725,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "242-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 726,
"target": "A",
"doc": {
"video_id": "243",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=pwVJizpCuDQ",
"videoID": "pwVJizpCuDQ",
"question_id": "243-1",
"task_type": "Attribute Perception",
"question": "What color is the bicycle parked against the wall?",
"options": [
"A. Green.",
"B. Yellow.",
"C. Red.",
"D. Blue."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the bicycle parked against the wall?\nOption:\nA. Green.\nB. Yellow.\nC. Red.\nD. Blue.\nAnswer with the option's letter from the given choices directly.",
726,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "243-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 727,
"target": "C",
"doc": {
"video_id": "243",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=pwVJizpCuDQ",
"videoID": "pwVJizpCuDQ",
"question_id": "243-2",
"task_type": "Counting Problem",
"question": "How many outfits did the female protagonist change in total in the video?",
"options": [
"A. 1.",
"B. 3.",
"C. 5.",
"D. 7."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many outfits did the female protagonist change in total in the video?\nOption:\nA. 1.\nB. 3.\nC. 5.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
727,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "243-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 728,
"target": "A",
"doc": {
"video_id": "243",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=pwVJizpCuDQ",
"videoID": "pwVJizpCuDQ",
"question_id": "243-3",
"task_type": "Object Recognition",
"question": "What is the woman wearing when standing in front of the mirror?",
"options": [
"A. A green suit with a purple blouse.",
"B. A sleeveless, bright pink dress.",
"C. A bright blue blouse with a long, red cape.",
"D. A bright yellow coat."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the woman wearing when standing in front of the mirror?\nOption:\nA. A green suit with a purple blouse.\nB. A sleeveless, bright pink dress.\nC. A bright blue blouse with a long, red cape.\nD. A bright yellow coat.\nAnswer with the option's letter from the given choices directly.",
728,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "243-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 729,
"target": "A",
"doc": {
"video_id": "244",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=9v9NHoTXc-Q",
"videoID": "9v9NHoTXc-Q",
"question_id": "244-1",
"task_type": "Attribute Perception",
"question": "What color are the shoes worn by the boy with the white hat?",
"options": [
"A. Bright blue.",
"B. Black and white.",
"C. Dark purple.",
"D. Grey."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color are the shoes worn by the boy with the white hat?\nOption:\nA. Bright blue.\nB. Black and white.\nC. Dark purple.\nD. Grey.\nAnswer with the option's letter from the given choices directly.",
729,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "244-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 730,
"target": "C",
"doc": {
"video_id": "244",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=9v9NHoTXc-Q",
"videoID": "9v9NHoTXc-Q",
"question_id": "244-2",
"task_type": "Spatial Perception",
"question": "Which of the following objects is the pink satin high-heeled shoe placed on in the video?",
"options": [
"A. Chair.",
"B. Sofa.",
"C. Record player.",
"D. Stair."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following objects is the pink satin high-heeled shoe placed on in the video?\nOption:\nA. Chair.\nB. Sofa.\nC. Record player.\nD. Stair.\nAnswer with the option's letter from the given choices directly.",
730,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "244-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 731,
"target": "C",
"doc": {
"video_id": "244",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=9v9NHoTXc-Q",
"videoID": "9v9NHoTXc-Q",
"question_id": "244-3",
"task_type": "Information Synopsis",
"question": "What is this video about?",
"options": [
"A. House party.",
"B. Home decoration.",
"C. Fashion advertising.",
"D. Hide-and-seek."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video about?\nOption:\nA. House party.\nB. Home decoration.\nC. Fashion advertising.\nD. Hide-and-seek.\nAnswer with the option's letter from the given choices directly.",
731,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "244-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 732,
"target": "C",
"doc": {
"video_id": "245",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=ZO12ZY38FEw",
"videoID": "ZO12ZY38FEw",
"question_id": "245-1",
"task_type": "Object Recognition",
"question": "Which outfit is the male protagonist wearing on the train in the video?",
"options": [
"A. A double-breasted beige coat.",
"B. A red and white varsity-style jacket.",
"C. A denim shirt with sparkles on the shoulders.",
"D. An open, red plaid shirt with a relaxed fit."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which outfit is the male protagonist wearing on the train in the video?\nOption:\nA. A double-breasted beige coat.\nB. A red and white varsity-style jacket.\nC. A denim shirt with sparkles on the shoulders.\nD. An open, red plaid shirt with a relaxed fit.\nAnswer with the option's letter from the given choices directly.",
732,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "245-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 733,
"target": "C",
"doc": {
"video_id": "245",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=ZO12ZY38FEw",
"videoID": "ZO12ZY38FEw",
"question_id": "245-2",
"task_type": "Object Recognition",
"question": "Which musical instrument did the man play upon entering the classroom?",
"options": [
"A. Trombone.",
"B. Flute.",
"C. Trumpet.",
"D. French horn."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which musical instrument did the man play upon entering the classroom?\nOption:\nA. Trombone.\nB. Flute.\nC. Trumpet.\nD. French horn.\nAnswer with the option's letter from the given choices directly.",
733,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "245-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 734,
"target": "C",
"doc": {
"video_id": "245",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=ZO12ZY38FEw",
"videoID": "ZO12ZY38FEw",
"question_id": "245-3",
"task_type": "Counting Problem",
"question": "How many outfits did the male protagonist change in total in the video?",
"options": [
"A. 1.",
"B. 3.",
"C. 5.",
"D. 7."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many outfits did the male protagonist change in total in the video?\nOption:\nA. 1.\nB. 3.\nC. 5.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
734,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "245-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 735,
"target": "B",
"doc": {
"video_id": "246",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=frX8GujpmkI",
"videoID": "frX8GujpmkI",
"question_id": "246-1",
"task_type": "Spatial Reasoning",
"question": "What is the emotion of the video?",
"options": [
"A. Solemn.",
"B. Enjoyable.",
"C. Heartbroken.",
"D. Anxious."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the emotion of the video?\nOption:\nA. Solemn.\nB. Enjoyable.\nC. Heartbroken.\nD. Anxious.\nAnswer with the option's letter from the given choices directly.",
735,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "246-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 736,
"target": "C",
"doc": {
"video_id": "246",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=frX8GujpmkI",
"videoID": "frX8GujpmkI",
"question_id": "246-2",
"task_type": "Object Recognition",
"question": "What is the woman wearing with a light green striped handbag in the video?",
"options": [
"A. A two-piece outfit with a blue, red and white pattern.",
"B. A sheer, flowing, yellow dress.",
"C. A dark-colored blouse and a floral print skirt.",
"D. A long, beige dress with a relaxed, flowing silhouette."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the woman wearing with a light green striped handbag in the video?\nOption:\nA. A two-piece outfit with a blue, red and white pattern.\nB. A sheer, flowing, yellow dress.\nC. A dark-colored blouse and a floral print skirt.\nD. A long, beige dress with a relaxed, flowing silhouette.\nAnswer with the option's letter from the given choices directly.",
736,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "246-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 737,
"target": "D",
"doc": {
"video_id": "246",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=frX8GujpmkI",
"videoID": "frX8GujpmkI",
"question_id": "246-3",
"task_type": "Counting Problem",
"question": "At the end of the video, how many people are there on the staircase along a coastal hillside?",
"options": [
"A. 3.",
"B. 2.",
"C. 4.",
"D. 5."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At the end of the video, how many people are there on the staircase along a coastal hillside?\nOption:\nA. 3.\nB. 2.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
737,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "246-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 738,
"target": "B",
"doc": {
"video_id": "247",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=uVR-DZQuRbM",
"videoID": "uVR-DZQuRbM",
"question_id": "247-1",
"task_type": "Object Reasoning",
"question": "What is the role of the man speaking in the video?",
"options": [
"A. Eye pencil designer.",
"B. Makeup artist.",
"C. Model agent.",
"D. Company manager."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the role of the man speaking in the video?\nOption:\nA. Eye pencil designer.\nB. Makeup artist.\nC. Model agent.\nD. Company manager.\nAnswer with the option's letter from the given choices directly.",
738,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "247-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 739,
"target": "C",
"doc": {
"video_id": "247",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=uVR-DZQuRbM",
"videoID": "uVR-DZQuRbM",
"question_id": "247-2",
"task_type": "Object Recognition",
"question": "What are the characteristics of the makeup in the video?",
"options": [
"A. The eyebrow edges are very sharp.",
"B. This look is very suitable for models with special face shapes.",
"C. There is an eyeliner on each of the upper and lower eyelids, and they do not intersect each other.",
"D. This makeup style does not require a foundation and is very white."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the characteristics of the makeup in the video?\nOption:\nA. The eyebrow edges are very sharp.\nB. This look is very suitable for models with special face shapes.\nC. There is an eyeliner on each of the upper and lower eyelids, and they do not intersect each other.\nD. This makeup style does not require a foundation and is very white.\nAnswer with the option's letter from the given choices directly.",
739,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "247-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 740,
"target": "D",
"doc": {
"video_id": "247",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=uVR-DZQuRbM",
"videoID": "uVR-DZQuRbM",
"question_id": "247-3",
"task_type": "Attribute Perception",
"question": "What is the color of the sponge ball used for applying foundation in the video?",
"options": [
"A. Light yellow.",
"B. White.",
"C. Pink.",
"D. Black."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the sponge ball used for applying foundation in the video?\nOption:\nA. Light yellow.\nB. White.\nC. Pink.\nD. Black.\nAnswer with the option's letter from the given choices directly.",
740,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "247-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 741,
"target": "C",
"doc": {
"video_id": "248",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=6NVr0cNiHPM",
"videoID": "6NVr0cNiHPM",
"question_id": "248-1",
"task_type": "Counting Problem",
"question": "How many items are stored in the box displayed at the beginning of the video?",
"options": [
"A. 7.",
"B. 8.",
"C. 10.",
"D. 9."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many items are stored in the box displayed at the beginning of the video?\nOption:\nA. 7.\nB. 8.\nC. 10.\nD. 9.\nAnswer with the option's letter from the given choices directly.",
741,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "248-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 742,
"target": "D",
"doc": {
"video_id": "248",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=6NVr0cNiHPM",
"videoID": "6NVr0cNiHPM",
"question_id": "248-2",
"task_type": "Temporal Reasoning",
"question": "What is the order of skincare products used before makeup in the video?",
"options": [
"A. Face serum, toner, eye cream, face cream.",
"B. Toner, face serum, eye cream, face cream.",
"C. Face serum, toner, face cream, eye cream.",
"D. Toner, face serum, face cream, eye cream."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the order of skincare products used before makeup in the video?\nOption:\nA. Face serum, toner, eye cream, face cream.\nB. Toner, face serum, eye cream, face cream.\nC. Face serum, toner, face cream, eye cream.\nD. Toner, face serum, face cream, eye cream.\nAnswer with the option's letter from the given choices directly.",
742,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "248-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 743,
"target": "B",
"doc": {
"video_id": "248",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=6NVr0cNiHPM",
"videoID": "6NVr0cNiHPM",
"question_id": "248-3",
"task_type": "Object Recognition",
"question": "What is used to apply eye cream in the video?",
"options": [
"A. Shadow brush.",
"B. Ring finger.",
"C. Cotton pad.",
"D. Face brush."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is used to apply eye cream in the video?\nOption:\nA. Shadow brush.\nB. Ring finger.\nC. Cotton pad.\nD. Face brush.\nAnswer with the option's letter from the given choices directly.",
743,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "248-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 744,
"target": "C",
"doc": {
"video_id": "249",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=OGQ7KwUXuBk",
"videoID": "OGQ7KwUXuBk",
"question_id": "249-1",
"task_type": "Object Recognition",
"question": "What is the pregnant woman wearing in the video?",
"options": [
"A. A strapless, form-fitting dress adorned with sequins.",
"B. A mint green dress with a strapless design.",
"C. An elegant black dress with long sleeves.",
"D. A striking green dress with a large, ruffled detail."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the pregnant woman wearing in the video?\nOption:\nA. A strapless, form-fitting dress adorned with sequins.\nB. A mint green dress with a strapless design.\nC. An elegant black dress with long sleeves.\nD. A striking green dress with a large, ruffled detail.\nAnswer with the option's letter from the given choices directly.",
744,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "249-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 745,
"target": "A",
"doc": {
"video_id": "249",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=OGQ7KwUXuBk",
"videoID": "OGQ7KwUXuBk",
"question_id": "249-2",
"task_type": "Spatial Reasoning",
"question": "Where did the video take place?",
"options": [
"A. Red Carpet at the Oscars.",
"B. Fashion show.",
"C. Press conference.",
"D. Film set."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where did the video take place?\nOption:\nA. Red Carpet at the Oscars.\nB. Fashion show.\nC. Press conference.\nD. Film set.\nAnswer with the option's letter from the given choices directly.",
745,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "249-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Spatial Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 746,
"target": "A",
"doc": {
"video_id": "249",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=OGQ7KwUXuBk",
"videoID": "OGQ7KwUXuBk",
"question_id": "249-3",
"task_type": "OCR Problems",
"question": "What is the total number of nominations received by the movie \"Oppenheimer\"?",
"options": [
"A. 13.",
"B. 12.",
"C. 8.",
"D. 4."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the total number of nominations received by the movie \"Oppenheimer\"?\nOption:\nA. 13.\nB. 12.\nC. 8.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
746,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "249-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 747,
"target": "C",
"doc": {
"video_id": "250",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=2iHmlBmLKUE",
"videoID": "2iHmlBmLKUE",
"question_id": "250-1",
"task_type": "Information Synopsis",
"question": "Which topic does the video cover?",
"options": [
"A. How to apply perfume.",
"B. How the perfume smells.",
"C. How to refill perfume.",
"D. How to choose a right perfume."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which topic does the video cover?\nOption:\nA. How to apply perfume.\nB. How the perfume smells.\nC. How to refill perfume.\nD. How to choose a right perfume.\nAnswer with the option's letter from the given choices directly.",
747,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "250-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 748,
"target": "B",
"doc": {
"video_id": "250",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=2iHmlBmLKUE",
"videoID": "2iHmlBmLKUE",
"question_id": "250-2",
"task_type": "Object Reasoning",
"question": "What is the capacity of the refill if the capacity of a bottle of perfume is 50mL?",
"options": [
"A. 150mL.",
"B. 100mL.",
"C. 50mL.",
"D. 25mL."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the capacity of the refill if the capacity of a bottle of perfume is 50mL?\nOption:\nA. 150mL.\nB. 100mL.\nC. 50mL.\nD. 25mL.\nAnswer with the option's letter from the given choices directly.",
748,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "250-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 749,
"target": "B",
"doc": {
"video_id": "250",
"duration": "short",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=2iHmlBmLKUE",
"videoID": "2iHmlBmLKUE",
"question_id": "250-3",
"task_type": "Attribute Perception",
"question": "Which of the following best describes the shape of the perfume bottle?",
"options": [
"A. Rounded.",
"B. Triangular.",
"C. Asymmetrical.",
"D. Rectangular."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best describes the shape of the perfume bottle?\nOption:\nA. Rounded.\nB. Triangular.\nC. Asymmetrical.\nD. Rectangular.\nAnswer with the option's letter from the given choices directly.",
749,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "250-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 750,
"target": "B",
"doc": {
"video_id": "251",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=1sTQOxXFO44",
"videoID": "1sTQOxXFO44",
"question_id": "251-1",
"task_type": "Action Recognition",
"question": "Which daily activity is not shown in the video?",
"options": [
"A. Cleaning the floor.",
"B. Eating a meal.",
"C. Ironing clothes.",
"D. Cooking."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which daily activity is not shown in the video?\nOption:\nA. Cleaning the floor.\nB. Eating a meal.\nC. Ironing clothes.\nD. Cooking.\nAnswer with the option's letter from the given choices directly.",
750,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "251-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 751,
"target": "C",
"doc": {
"video_id": "251",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=1sTQOxXFO44",
"videoID": "1sTQOxXFO44",
"question_id": "251-2",
"task_type": "Action Recognition",
"question": "What is the woman sitting on the floor ready to do at the end of the video?",
"options": [
"A. To play cell phone.",
"B. To play games.",
"C. To watch TV.",
"D. To read a book."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the woman sitting on the floor ready to do at the end of the video?\nOption:\nA. To play cell phone.\nB. To play games.\nC. To watch TV.\nD. To read a book.\nAnswer with the option's letter from the given choices directly.",
751,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "251-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 752,
"target": "A",
"doc": {
"video_id": "251",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=1sTQOxXFO44",
"videoID": "1sTQOxXFO44",
"question_id": "251-3",
"task_type": "Temporal Perception",
"question": "After cleaning her floor in the video, what did the woman do next?",
"options": [
"A. Frying an egg.",
"B. Using a cup to catch water.",
"C. Washing her clothes.",
"D. Cutting a watermelon."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: After cleaning her floor in the video, what did the woman do next?\nOption:\nA. Frying an egg.\nB. Using a cup to catch water.\nC. Washing her clothes.\nD. Cutting a watermelon.\nAnswer with the option's letter from the given choices directly.",
752,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "251-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Perception",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 753,
"target": "D",
"doc": {
"video_id": "252",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=RP1AL2DU6vQ",
"videoID": "RP1AL2DU6vQ",
"question_id": "252-1",
"task_type": "Action Recognition",
"question": "In the video, what is the primary mode of transportation for the boys to get to school?",
"options": [
"A. By bus.",
"B. By driving a car.",
"C. By biking.",
"D. By walking."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, what is the primary mode of transportation for the boys to get to school?\nOption:\nA. By bus.\nB. By driving a car.\nC. By biking.\nD. By walking.\nAnswer with the option's letter from the given choices directly.",
753,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "252-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 754,
"target": "B",
"doc": {
"video_id": "252",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=RP1AL2DU6vQ",
"videoID": "RP1AL2DU6vQ",
"question_id": "252-2",
"task_type": "Counting Problem",
"question": "How many people in total can be seen in the video sitting at the table eating breakfast?",
"options": [
"A. 2.",
"B. 3.",
"C. 1.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people in total can be seen in the video sitting at the table eating breakfast?\nOption:\nA. 2.\nB. 3.\nC. 1.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
754,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "252-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 755,
"target": "C",
"doc": {
"video_id": "252",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=RP1AL2DU6vQ",
"videoID": "RP1AL2DU6vQ",
"question_id": "252-3",
"task_type": "OCR Problems",
"question": "In the video, at what time do classes begin at the boys' school?",
"options": [
"A. 10:45.",
"B. 10:15.",
"C. 8:30.",
"D. 8:00."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, at what time do classes begin at the boys' school?\nOption:\nA. 10:45.\nB. 10:15.\nC. 8:30.\nD. 8:00.\nAnswer with the option's letter from the given choices directly.",
755,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "252-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 756,
"target": "A",
"doc": {
"video_id": "253",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=kePBvNotYy4",
"videoID": "kePBvNotYy4",
"question_id": "253-1",
"task_type": "Action Recognition",
"question": "What is the mode of transportation for the woman in the video to get to work?",
"options": [
"A. Drive a car.",
"B. Walk.",
"C. Biking.",
"D. By bus."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the mode of transportation for the woman in the video to get to work?\nOption:\nA. Drive a car.\nB. Walk.\nC. Biking.\nD. By bus.\nAnswer with the option's letter from the given choices directly.",
756,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "253-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 757,
"target": "C",
"doc": {
"video_id": "253",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=kePBvNotYy4",
"videoID": "kePBvNotYy4",
"question_id": "253-2",
"task_type": "Action Recognition",
"question": "What does the woman in the video do after brushing her teeth?",
"options": [
"A. Feed her dog.",
"B. Dry her hair.",
"C. Take a shower.",
"D. Eat breakfast."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the woman in the video do after brushing her teeth?\nOption:\nA. Feed her dog.\nB. Dry her hair.\nC. Take a shower.\nD. Eat breakfast.\nAnswer with the option's letter from the given choices directly.",
757,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "253-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 758,
"target": "B",
"doc": {
"video_id": "253",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=kePBvNotYy4",
"videoID": "kePBvNotYy4",
"question_id": "253-3",
"task_type": "Counting Problem",
"question": "How long does it take for the girl in the video to get from home to work?",
"options": [
"A. More than 30 minutes and less than 1 hour.",
"B. A little less than 30 minutes.",
"C. More than 1 hour.",
"D. Less than 10 minutes."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How long does it take for the girl in the video to get from home to work?\nOption:\nA. More than 30 minutes and less than 1 hour.\nB. A little less than 30 minutes.\nC. More than 1 hour.\nD. Less than 10 minutes.\nAnswer with the option's letter from the given choices directly.",
758,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "253-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 759,
"target": "C",
"doc": {
"video_id": "254",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=lk2ics_YImU",
"videoID": "lk2ics_YImU",
"question_id": "254-1",
"task_type": "Action Recognition",
"question": "What kind of transportation does the boy in the video take to and from school?",
"options": [
"A. Car.",
"B. Bus.",
"C. Train.",
"D. Bike."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What kind of transportation does the boy in the video take to and from school?\nOption:\nA. Car.\nB. Bus.\nC. Train.\nD. Bike.\nAnswer with the option's letter from the given choices directly.",
759,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "254-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 760,
"target": "A",
"doc": {
"video_id": "254",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=lk2ics_YImU",
"videoID": "lk2ics_YImU",
"question_id": "254-2",
"task_type": "Object Recognition",
"question": "What is the second class the boys are taking in the video?",
"options": [
"A. Math class.",
"B. Science class.",
"C. Music class.",
"D. English class."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the second class the boys are taking in the video?\nOption:\nA. Math class.\nB. Science class.\nC. Music class.\nD. English class.\nAnswer with the option's letter from the given choices directly.",
760,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "254-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 761,
"target": "D",
"doc": {
"video_id": "254",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=lk2ics_YImU",
"videoID": "lk2ics_YImU",
"question_id": "254-3",
"task_type": "Object Recognition",
"question": "Which item did not appear in the video?",
"options": [
"A. Schoolbag.",
"B. Guitar.",
"C. Drum kit.",
"D. Laptop."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which item did not appear in the video?\nOption:\nA. Schoolbag.\nB. Guitar.\nC. Drum kit.\nD. Laptop.\nAnswer with the option's letter from the given choices directly.",
761,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "254-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 762,
"target": "B",
"doc": {
"video_id": "255",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=CQUphYL0vY8",
"videoID": "CQUphYL0vY8",
"question_id": "255-1",
"task_type": "Action Recognition",
"question": "What is the first thing the heroine in the video does after going out in the morning?",
"options": [
"A. Drive to work.",
"B. Boxing exercise.",
"C. Have breakfast with colleagues.",
"D. Running exercise."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the first thing the heroine in the video does after going out in the morning?\nOption:\nA. Drive to work.\nB. Boxing exercise.\nC. Have breakfast with colleagues.\nD. Running exercise.\nAnswer with the option's letter from the given choices directly.",
762,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "255-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 763,
"target": "C",
"doc": {
"video_id": "255",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=CQUphYL0vY8",
"videoID": "CQUphYL0vY8",
"question_id": "255-2",
"task_type": "Counting Problem",
"question": "How many meals did the woman in the video have with other people?",
"options": [
"A. 2.",
"B. 4.",
"C. 3.",
"D. 5."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many meals did the woman in the video have with other people?\nOption:\nA. 2.\nB. 4.\nC. 3.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
763,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "255-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 764,
"target": "A",
"doc": {
"video_id": "255",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=CQUphYL0vY8",
"videoID": "CQUphYL0vY8",
"question_id": "255-3",
"task_type": "Spatial Reasoning",
"question": "Where did the heroine in the video change into high heels?",
"options": [
"A. Curb.",
"B. Her home.",
"C. Her company.",
"D. Restaurant."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where did the heroine in the video change into high heels?\nOption:\nA. Curb.\nB. Her home.\nC. Her company.\nD. Restaurant.\nAnswer with the option's letter from the given choices directly.",
764,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "255-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Spatial Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 765,
"target": "D",
"doc": {
"video_id": "256",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=aFfMGy94sjE",
"videoID": "aFfMGy94sjE",
"question_id": "256-1",
"task_type": "Information Synopsis",
"question": "What is the video mainly about?",
"options": [
"A. The daily work of a prosecutor.",
"B. The daily law enforcement process of the police.",
"C. The daily work of a prison guard.",
"D. The daily work of a police records specialist."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video mainly about?\nOption:\nA. The daily work of a prosecutor.\nB. The daily law enforcement process of the police.\nC. The daily work of a prison guard.\nD. The daily work of a police records specialist.\nAnswer with the option's letter from the given choices directly.",
765,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "256-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 766,
"target": "B",
"doc": {
"video_id": "256",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=aFfMGy94sjE",
"videoID": "aFfMGy94sjE",
"question_id": "256-2",
"task_type": "Attribute Perception",
"question": "What country is the lady in the video from?",
"options": [
"A. Britain.",
"B. Canada.",
"C. America.",
"D. Australia."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What country is the lady in the video from?\nOption:\nA. Britain.\nB. Canada.\nC. America.\nD. Australia.\nAnswer with the option's letter from the given choices directly.",
766,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "256-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 767,
"target": "C",
"doc": {
"video_id": "256",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=aFfMGy94sjE",
"videoID": "aFfMGy94sjE",
"question_id": "256-3",
"task_type": "Action Recognition",
"question": "Which activity is shown in the video?",
"options": [
"A. Arresting a suspect.",
"B. Having a lunch.",
"C. Working at the computer.",
"D. Shopping in the store."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which activity is shown in the video?\nOption:\nA. Arresting a suspect.\nB. Having a lunch.\nC. Working at the computer.\nD. Shopping in the store.\nAnswer with the option's letter from the given choices directly.",
767,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "256-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 768,
"target": "A",
"doc": {
"video_id": "257",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=s-lM2uwiwyQ",
"videoID": "s-lM2uwiwyQ",
"question_id": "257-1",
"task_type": "Action Recognition",
"question": "What is the lady's mode of transport to work in the video?",
"options": [
"A. By driving a car.",
"B. By bus.",
"C. By subway.",
"D. By walking."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the lady's mode of transport to work in the video?\nOption:\nA. By driving a car.\nB. By bus.\nC. By subway.\nD. By walking.\nAnswer with the option's letter from the given choices directly.",
768,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "257-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 769,
"target": "D",
"doc": {
"video_id": "257",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=s-lM2uwiwyQ",
"videoID": "s-lM2uwiwyQ",
"question_id": "257-2",
"task_type": "Action Recognition",
"question": "Which of these tests does the lady shown in the video perform on the patient NOT include?",
"options": [
"A. Taking pulse.",
"B. Taking body temperature.",
"C. Measuring blood pressure.",
"D. Testing blood sugar."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of these tests does the lady shown in the video perform on the patient NOT include?\nOption:\nA. Taking pulse.\nB. Taking body temperature.\nC. Measuring blood pressure.\nD. Testing blood sugar.\nAnswer with the option's letter from the given choices directly.",
769,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "257-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 770,
"target": "B",
"doc": {
"video_id": "257",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=s-lM2uwiwyQ",
"videoID": "s-lM2uwiwyQ",
"question_id": "257-3",
"task_type": "Counting Problem",
"question": "How many people are shown having lunch with the woman in the video?",
"options": [
"A. 2.",
"B. 1.",
"C. 3.",
"D. 0."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people are shown having lunch with the woman in the video?\nOption:\nA. 2.\nB. 1.\nC. 3.\nD. 0.\nAnswer with the option's letter from the given choices directly.",
770,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "257-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 771,
"target": "C",
"doc": {
"video_id": "258",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=ZfNSxRiYfZQ",
"videoID": "ZfNSxRiYfZQ",
"question_id": "258-1",
"task_type": "Attribute Perception",
"question": "What color are the shoes the man in the video is wearing during his run?",
"options": [
"A. Black.",
"B. Blue.",
"C. White.",
"D. Yellow."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color are the shoes the man in the video is wearing during his run?\nOption:\nA. Black.\nB. Blue.\nC. White.\nD. Yellow.\nAnswer with the option's letter from the given choices directly.",
771,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "258-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 772,
"target": "A",
"doc": {
"video_id": "258",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=ZfNSxRiYfZQ",
"videoID": "ZfNSxRiYfZQ",
"question_id": "258-2",
"task_type": "Action Recognition",
"question": "Which of the following items is the man in the video doing at the gym?",
"options": [
"A. Bench press.",
"B. Seated row.",
"C. Leg press.",
"D. Running."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following items is the man in the video doing at the gym?\nOption:\nA. Bench press.\nB. Seated row.\nC. Leg press.\nD. Running.\nAnswer with the option's letter from the given choices directly.",
772,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "258-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 773,
"target": "D",
"doc": {
"video_id": "258",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=ZfNSxRiYfZQ",
"videoID": "ZfNSxRiYfZQ",
"question_id": "258-3",
"task_type": "Action Recognition",
"question": "What mode of transportation does the man in the video use to go out?",
"options": [
"A. By bike.",
"B. By bus.",
"C. By subway.",
"D. By driving a car."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What mode of transportation does the man in the video use to go out?\nOption:\nA. By bike.\nB. By bus.\nC. By subway.\nD. By driving a car.\nAnswer with the option's letter from the given choices directly.",
773,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "258-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 774,
"target": "B",
"doc": {
"video_id": "259",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=RU8IfXp7LbE",
"videoID": "RU8IfXp7LbE",
"question_id": "259-1",
"task_type": "OCR Problems",
"question": "What time does the man in the video get up?",
"options": [
"A. 5:59 a.m.",
"B. 6:00 a.m.",
"C. 6:00 p.m.",
"D. 5:59 p.m."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What time does the man in the video get up?\nOption:\nA. 5:59 a.m.\nB. 6:00 a.m.\nC. 6:00 p.m.\nD. 5:59 p.m.\nAnswer with the option's letter from the given choices directly.",
774,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "259-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 775,
"target": "C",
"doc": {
"video_id": "259",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=RU8IfXp7LbE",
"videoID": "RU8IfXp7LbE",
"question_id": "259-2",
"task_type": "Action Recognition",
"question": "Which activity is not shown in the video?",
"options": [
"A. Using the bathroom.",
"B. Swimming.",
"C. Taking a shower.",
"D. Brushing teeth."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which activity is not shown in the video?\nOption:\nA. Using the bathroom.\nB. Swimming.\nC. Taking a shower.\nD. Brushing teeth.\nAnswer with the option's letter from the given choices directly.",
775,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "259-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 776,
"target": "A",
"doc": {
"video_id": "259",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=RU8IfXp7LbE",
"videoID": "RU8IfXp7LbE",
"question_id": "259-3",
"task_type": "Action Recognition",
"question": "What does the man shown in the video do after swimming?",
"options": [
"A. He refuels the car.",
"B. He performs a show for children.",
"C. He falls asleep.",
"D. He takes a shower."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the man shown in the video do after swimming?\nOption:\nA. He refuels the car.\nB. He performs a show for children.\nC. He falls asleep.\nD. He takes a shower.\nAnswer with the option's letter from the given choices directly.",
776,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "259-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 777,
"target": "D",
"doc": {
"video_id": "260",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=3E4Zca1p64s",
"videoID": "3E4Zca1p64s",
"question_id": "260-1",
"task_type": "Object Recognition",
"question": "What kind of cell phone is the person in the video using?",
"options": [
"A. iPhone 13.",
"B. Google Pixel 8.",
"C. Google Pixel 2.",
"D. Google Pixel 3."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What kind of cell phone is the person in the video using?\nOption:\nA. iPhone 13.\nB. Google Pixel 8.\nC. Google Pixel 2.\nD. Google Pixel 3.\nAnswer with the option's letter from the given choices directly.",
777,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "260-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 778,
"target": "B",
"doc": {
"video_id": "260",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=3E4Zca1p64s",
"videoID": "3E4Zca1p64s",
"question_id": "260-2",
"task_type": "Action Recognition",
"question": "What does the person in the video do after brushing his teeth?",
"options": [
"A. Driving a car.",
"B. Spraying perfume.",
"C. Drinking water.",
"D. Charging the cell phone."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the person in the video do after brushing his teeth?\nOption:\nA. Driving a car.\nB. Spraying perfume.\nC. Drinking water.\nD. Charging the cell phone.\nAnswer with the option's letter from the given choices directly.",
778,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "260-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 779,
"target": "C",
"doc": {
"video_id": "260",
"duration": "short",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=3E4Zca1p64s",
"videoID": "3E4Zca1p64s",
"question_id": "260-3",
"task_type": "Action Recognition",
"question": "What feature of the cell phone did the person in the video use?",
"options": [
"A. Phone call.",
"B. Video editing.",
"C. Photo shooting.",
"D. Online games."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What feature of the cell phone did the person in the video use?\nOption:\nA. Phone call.\nB. Video editing.\nC. Photo shooting.\nD. Online games.\nAnswer with the option's letter from the given choices directly.",
779,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "260-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 780,
"target": "B",
"doc": {
"video_id": "261",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=aWBRHk_-5as",
"videoID": "aWBRHk_-5as",
"question_id": "261-1",
"task_type": "Spatial Reasoning",
"question": "Which region is the most likely destination of the subject in the video?",
"options": [
"A. Europe.",
"B. Asia.",
"C. America.",
"D. Australia."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which region is the most likely destination of the subject in the video?\nOption:\nA. Europe.\nB. Asia.\nC. America.\nD. Australia.\nAnswer with the option's letter from the given choices directly.",
780,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "261-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 781,
"target": "C",
"doc": {
"video_id": "261",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=aWBRHk_-5as",
"videoID": "aWBRHk_-5as",
"question_id": "261-2",
"task_type": "Attribute Perception",
"question": "Which color shirt is the protagonist wearing when he speaks in the car?",
"options": [
"A. Black.",
"B. Yellow.",
"C. White.",
"D. Blue."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which color shirt is the protagonist wearing when he speaks in the car?\nOption:\nA. Black.\nB. Yellow.\nC. White.\nD. Blue.\nAnswer with the option's letter from the given choices directly.",
781,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "261-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 782,
"target": "A",
"doc": {
"video_id": "261",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=aWBRHk_-5as",
"videoID": "aWBRHk_-5as",
"question_id": "261-3",
"task_type": "Action Recognition",
"question": "Which activity does the protagonist not engage in?",
"options": [
"A. Dancing.",
"B. Worship.",
"C. Taste the food.",
"D. Take a picture."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which activity does the protagonist not engage in?\nOption:\nA. Dancing.\nB. Worship.\nC. Taste the food.\nD. Take a picture.\nAnswer with the option's letter from the given choices directly.",
782,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "261-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 783,
"target": "D",
"doc": {
"video_id": "262",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=PaC3CEkCD6k",
"videoID": "PaC3CEkCD6k",
"question_id": "262-1",
"task_type": "Object Recognition",
"question": "What kind of transport does the protagonist in the video take to set off?",
"options": [
"A. Train.",
"B. Airliner.",
"C. Bus.",
"D. Cruise liner."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What kind of transport does the protagonist in the video take to set off?\nOption:\nA. Train.\nB. Airliner.\nC. Bus.\nD. Cruise liner.\nAnswer with the option's letter from the given choices directly.",
783,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "262-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 784,
"target": "A",
"doc": {
"video_id": "262",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=PaC3CEkCD6k",
"videoID": "PaC3CEkCD6k",
"question_id": "262-2",
"task_type": "Spatial Reasoning",
"question": "During which season does the protagonists in the video probably set out?",
"options": [
"A. Winter.",
"B. Summer.",
"C. Spring.",
"D. Autumn."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: During which season does the protagonists in the video probably set out?\nOption:\nA. Winter.\nB. Summer.\nC. Spring.\nD. Autumn.\nAnswer with the option's letter from the given choices directly.",
784,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "262-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 785,
"target": "B",
"doc": {
"video_id": "262",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=PaC3CEkCD6k",
"videoID": "PaC3CEkCD6k",
"question_id": "262-3",
"task_type": "Counting Problem",
"question": "How many people join in a toast at the end of the video?",
"options": [
"A. 2.",
"B. 3.",
"C. 4.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people join in a toast at the end of the video?\nOption:\nA. 2.\nB. 3.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
785,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "262-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 786,
"target": "C",
"doc": {
"video_id": "263",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=AyIqI6P8eN8",
"videoID": "AyIqI6P8eN8",
"question_id": "263-1",
"task_type": "Action Recognition",
"question": "What type of street performance is showcased in the video?",
"options": [
"A. Martial art.",
"B. Musical instrument performance.",
"C. Street dance.",
"D. Streetball."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What type of street performance is showcased in the video?\nOption:\nA. Martial art.\nB. Musical instrument performance.\nC. Street dance.\nD. Streetball.\nAnswer with the option's letter from the given choices directly.",
786,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "263-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 787,
"target": "B",
"doc": {
"video_id": "263",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=AyIqI6P8eN8",
"videoID": "AyIqI6P8eN8",
"question_id": "263-2",
"task_type": "Object Recognition",
"question": "Which building does the protagonist capture on his phone in the video?",
"options": [
"A. Pantheon.",
"B. The Colosseum.",
"C. The Arc de Triomphe.",
"D. The temple of Baalbek."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which building does the protagonist capture on his phone in the video?\nOption:\nA. Pantheon.\nB. The Colosseum.\nC. The Arc de Triomphe.\nD. The temple of Baalbek.\nAnswer with the option's letter from the given choices directly.",
787,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "263-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 788,
"target": "D",
"doc": {
"video_id": "263",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=AyIqI6P8eN8",
"videoID": "AyIqI6P8eN8",
"question_id": "263-3",
"task_type": "Spatial Reasoning",
"question": "Where are the final scenes in the video most likely shot?",
"options": [
"A. Satellite photography.",
"B. Drone.",
"C. High-rise.",
"D. Airplane."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where are the final scenes in the video most likely shot?\nOption:\nA. Satellite photography.\nB. Drone.\nC. High-rise.\nD. Airplane.\nAnswer with the option's letter from the given choices directly.",
788,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "263-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Spatial Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 789,
"target": "D",
"doc": {
"video_id": "264",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=8e05W-wi38g",
"videoID": "8e05W-wi38g",
"question_id": "264-1",
"task_type": "Spatial Reasoning",
"question": "Which country is the main character in the video most likely to travel to?",
"options": [
"A. Britain.",
"B. New Zealand.",
"C. Canada.",
"D. Australia."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which country is the main character in the video most likely to travel to?\nOption:\nA. Britain.\nB. New Zealand.\nC. Canada.\nD. Australia.\nAnswer with the option's letter from the given choices directly.",
789,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "264-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Spatial Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 790,
"target": "A",
"doc": {
"video_id": "264",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=8e05W-wi38g",
"videoID": "8e05W-wi38g",
"question_id": "264-2",
"task_type": "Object Recognition",
"question": "Which animal is shown in the video?",
"options": [
"A. Kangaroo.",
"B. Giraffe.",
"C. Donkey.",
"D. Horse."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which animal is shown in the video?\nOption:\nA. Kangaroo.\nB. Giraffe.\nC. Donkey.\nD. Horse.\nAnswer with the option's letter from the given choices directly.",
790,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "264-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 791,
"target": "C",
"doc": {
"video_id": "264",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=8e05W-wi38g",
"videoID": "8e05W-wi38g",
"question_id": "264-3",
"task_type": "Action Recognition",
"question": "What activities does the protagonist do with his friends in the snow?",
"options": [
"A. Skiing.",
"B. Have a snowball fight.",
"C. Build a snowman.",
"D. Snow picnic."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What activities does the protagonist do with his friends in the snow?\nOption:\nA. Skiing.\nB. Have a snowball fight.\nC. Build a snowman.\nD. Snow picnic.\nAnswer with the option's letter from the given choices directly.",
791,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "264-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 792,
"target": "B",
"doc": {
"video_id": "265",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=5-4Od1qn53s",
"videoID": "5-4Od1qn53s",
"question_id": "265-1",
"task_type": "OCR Problems",
"question": "What is the lateral distance of the Buddha in the video?",
"options": [
"A. 48m.",
"B. 46m.",
"C. 45m.",
"D. 47m."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the lateral distance of the Buddha in the video?\nOption:\nA. 48m.\nB. 46m.\nC. 45m.\nD. 47m.\nAnswer with the option's letter from the given choices directly.",
792,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "265-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 793,
"target": "C",
"doc": {
"video_id": "265",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=5-4Od1qn53s",
"videoID": "5-4Od1qn53s",
"question_id": "265-2",
"task_type": "Spatial Reasoning",
"question": "Which city is the video showcasing?",
"options": [
"A. Malaysia.",
"B. New Delhi.",
"C. Bangkok.",
"D. Singapore."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which city is the video showcasing?\nOption:\nA. Malaysia.\nB. New Delhi.\nC. Bangkok.\nD. Singapore.\nAnswer with the option's letter from the given choices directly.",
793,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "265-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Spatial Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 794,
"target": "A",
"doc": {
"video_id": "265",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=5-4Od1qn53s",
"videoID": "5-4Od1qn53s",
"question_id": "265-3",
"task_type": "Attribute Perception",
"question": "In the video, what does the man's facial expression look like when he sees the scorpion skewer?",
"options": [
"A. Disgust.",
"B. Joy.",
"C. Sadness.",
"D. Calmness."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, what does the man's facial expression look like when he sees the scorpion skewer?\nOption:\nA. Disgust.\nB. Joy.\nC. Sadness.\nD. Calmness.\nAnswer with the option's letter from the given choices directly.",
794,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "265-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 795,
"target": "D",
"doc": {
"video_id": "266",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=2gFxZmb0dMo",
"videoID": "2gFxZmb0dMo",
"question_id": "266-1",
"task_type": "Object Reasoning",
"question": "What is the most likely role of the blonde woman in the video, clad in a blue T-shirt and black shorts?",
"options": [
"A. Driver.",
"B. Teacher.",
"C. Tourist.",
"D. Tour guide."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the most likely role of the blonde woman in the video, clad in a blue T-shirt and black shorts?\nOption:\nA. Driver.\nB. Teacher.\nC. Tourist.\nD. Tour guide.\nAnswer with the option's letter from the given choices directly.",
795,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "266-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 796,
"target": "C",
"doc": {
"video_id": "266",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=2gFxZmb0dMo",
"videoID": "2gFxZmb0dMo",
"question_id": "266-2",
"task_type": "Action Recognition",
"question": "What are the people in the video doing in the lake?",
"options": [
"A. Fishing.",
"B. Swimming.",
"C. Boating.",
"D. Diving."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the people in the video doing in the lake?\nOption:\nA. Fishing.\nB. Swimming.\nC. Boating.\nD. Diving.\nAnswer with the option's letter from the given choices directly.",
796,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "266-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 797,
"target": "B",
"doc": {
"video_id": "266",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=2gFxZmb0dMo",
"videoID": "2gFxZmb0dMo",
"question_id": "266-3",
"task_type": "Attribute Perception",
"question": "Which hair color does the little boy have while playing the guitar in the video?",
"options": [
"A. Black.",
"B. Golden.",
"C. White.",
"D. Grey."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which hair color does the little boy have while playing the guitar in the video?\nOption:\nA. Black.\nB. Golden.\nC. White.\nD. Grey.\nAnswer with the option's letter from the given choices directly.",
797,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "266-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 798,
"target": "A",
"doc": {
"video_id": "267",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=Atf_Af1q_5w",
"videoID": "Atf_Af1q_5w",
"question_id": "267-1",
"task_type": "Attribute Perception",
"question": "What scenery does the video mainly show?",
"options": [
"A. Natural scenery.",
"B. Cityscape.",
"C. Rural scenery.",
"D. Architectural scenery."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What scenery does the video mainly show?\nOption:\nA. Natural scenery.\nB. Cityscape.\nC. Rural scenery.\nD. Architectural scenery.\nAnswer with the option's letter from the given choices directly.",
798,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "267-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 799,
"target": "D",
"doc": {
"video_id": "267",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=Atf_Af1q_5w",
"videoID": "Atf_Af1q_5w",
"question_id": "267-2",
"task_type": "Counting Problem",
"question": "How many people have their backs to the camera in the sunset scene?",
"options": [
"A. 0.",
"B. 1.",
"C. 3.",
"D. 2."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people have their backs to the camera in the sunset scene?\nOption:\nA. 0.\nB. 1.\nC. 3.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
799,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "267-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 800,
"target": "B",
"doc": {
"video_id": "267",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=Atf_Af1q_5w",
"videoID": "Atf_Af1q_5w",
"question_id": "267-3",
"task_type": "Counting Problem",
"question": "How many main tourists are shown in the video?",
"options": [
"A. 3.",
"B. 2.",
"C. 4.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many main tourists are shown in the video?\nOption:\nA. 3.\nB. 2.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
800,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "267-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 801,
"target": "C",
"doc": {
"video_id": "268",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=27YZLpw0M98",
"videoID": "27YZLpw0M98",
"question_id": "268-1",
"task_type": "OCR Problems",
"question": "What camera does the protagonist in the video use to shoot?",
"options": [
"A. Canon.",
"B. Leica.",
"C. Nikon.",
"D. SONY."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What camera does the protagonist in the video use to shoot?\nOption:\nA. Canon.\nB. Leica.\nC. Nikon.\nD. SONY.\nAnswer with the option's letter from the given choices directly.",
801,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "268-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 802,
"target": "B",
"doc": {
"video_id": "268",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=27YZLpw0M98",
"videoID": "27YZLpw0M98",
"question_id": "268-2",
"task_type": "Attribute Perception",
"question": "What color clothing is the protagonist wearing while carrying a tripod over his shoulder?",
"options": [
"A. Yellow.",
"B. Black.",
"C. Orange.",
"D. Blue."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color clothing is the protagonist wearing while carrying a tripod over his shoulder?\nOption:\nA. Yellow.\nB. Black.\nC. Orange.\nD. Blue.\nAnswer with the option's letter from the given choices directly.",
802,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "268-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 803,
"target": "A",
"doc": {
"video_id": "268",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=27YZLpw0M98",
"videoID": "27YZLpw0M98",
"question_id": "268-3",
"task_type": "Information Synopsis",
"question": "Which of the following topics is the most likely focus of this video?",
"options": [
"A. Camera advertising.",
"B. Photography course.",
"C. Promotional film.",
"D. Publicity of tourist attractions."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following topics is the most likely focus of this video?\nOption:\nA. Camera advertising.\nB. Photography course.\nC. Promotional film.\nD. Publicity of tourist attractions.\nAnswer with the option's letter from the given choices directly.",
803,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "268-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 804,
"target": "A",
"doc": {
"video_id": "269",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=ZXoaMa6jlO4",
"videoID": "ZXoaMa6jlO4",
"question_id": "269-1",
"task_type": "Action Recognition",
"question": "In what manner do the children in the video get transported by their parents?",
"options": [
"A. Stroller.",
"B. Car.",
"C. Bike.",
"D. Carry in the bosom."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what manner do the children in the video get transported by their parents?\nOption:\nA. Stroller.\nB. Car.\nC. Bike.\nD. Carry in the bosom.\nAnswer with the option's letter from the given choices directly.",
804,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "269-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 805,
"target": "D",
"doc": {
"video_id": "269",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=ZXoaMa6jlO4",
"videoID": "ZXoaMa6jlO4",
"question_id": "269-2",
"task_type": "Object Recognition",
"question": "What food is the child eating in the video when the family is dining at a restaurant?",
"options": [
"A. Fruits.",
"B. Vegetable.",
"C. Milk.",
"D. Biscuit."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What food is the child eating in the video when the family is dining at a restaurant?\nOption:\nA. Fruits.\nB. Vegetable.\nC. Milk.\nD. Biscuit.\nAnswer with the option's letter from the given choices directly.",
805,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "269-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 806,
"target": "D",
"doc": {
"video_id": "269",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=ZXoaMa6jlO4",
"videoID": "ZXoaMa6jlO4",
"question_id": "269-3",
"task_type": "Object Recognition",
"question": "What is the weather like when the family goes out?",
"options": [
"A. Snow.",
"B. Overcast sky.",
"C. Rainy day.",
"D. Clear Weather."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the weather like when the family goes out?\nOption:\nA. Snow.\nB. Overcast sky.\nC. Rainy day.\nD. Clear Weather.\nAnswer with the option's letter from the given choices directly.",
806,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "269-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 807,
"target": "C",
"doc": {
"video_id": "270",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=gR5YcIJhDcU",
"videoID": "gR5YcIJhDcU",
"question_id": "270-1",
"task_type": "OCR Problems",
"question": "To which location does the person in the video travel in Australia?",
"options": [
"A. New South Wales.",
"B. Hobart.",
"C. Hallstatt.",
"D. Queensland."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: To which location does the person in the video travel in Australia?\nOption:\nA. New South Wales.\nB. Hobart.\nC. Hallstatt.\nD. Queensland.\nAnswer with the option's letter from the given choices directly.",
807,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "270-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 808,
"target": "C",
"doc": {
"video_id": "270",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=gR5YcIJhDcU",
"videoID": "gR5YcIJhDcU",
"question_id": "270-2",
"task_type": "Object Reasoning",
"question": "How does the protagonist of the video most likely reach their destination?",
"options": [
"A. Airplane.",
"B. Bus.",
"C. Rail transport.",
"D. Boat."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the protagonist of the video most likely reach their destination?\nOption:\nA. Airplane.\nB. Bus.\nC. Rail transport.\nD. Boat.\nAnswer with the option's letter from the given choices directly.",
808,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "270-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 809,
"target": "B",
"doc": {
"video_id": "270",
"duration": "short",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=gR5YcIJhDcU",
"videoID": "gR5YcIJhDcU",
"question_id": "270-3",
"task_type": "Attribute Perception",
"question": "What color is the train in the video?",
"options": [
"A. Yellow.",
"B. Red.",
"C. Blue.",
"D. Green."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the train in the video?\nOption:\nA. Yellow.\nB. Red.\nC. Blue.\nD. Green.\nAnswer with the option's letter from the given choices directly.",
809,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "270-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 810,
"target": "B",
"doc": {
"video_id": "271",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=yXl0Vrk5ssk",
"videoID": "yXl0Vrk5ssk",
"question_id": "271-1",
"task_type": "Counting Problem",
"question": "What is the total number of baby birds shown in the video?",
"options": [
"A. 2.",
"B. 3.",
"C. 4.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the total number of baby birds shown in the video?\nOption:\nA. 2.\nB. 3.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
810,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "271-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 811,
"target": "C",
"doc": {
"video_id": "271",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=yXl0Vrk5ssk",
"videoID": "yXl0Vrk5ssk",
"question_id": "271-2",
"task_type": "Spatial Perception",
"question": "Where is this video most likely shot?",
"options": [
"A. In a cafe.",
"B. In a room.",
"C. In a nest.",
"D. Upon the roof."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where is this video most likely shot?\nOption:\nA. In a cafe.\nB. In a room.\nC. In a nest.\nD. Upon the roof.\nAnswer with the option's letter from the given choices directly.",
811,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "271-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 812,
"target": "A",
"doc": {
"video_id": "271",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=yXl0Vrk5ssk",
"videoID": "yXl0Vrk5ssk",
"question_id": "271-3",
"task_type": "Action Recognition",
"question": "Which of the following accurately describes the events depicted in the video?",
"options": [
"A. A yellow mechine scares baby birds.",
"B. A yellow mechine preys in the nest.",
"C. A yellow mechine brings baby birds out the nest.",
"D. A yellow mechine destroys the nest."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following accurately describes the events depicted in the video?\nOption:\nA. A yellow mechine scares baby birds.\nB. A yellow mechine preys in the nest.\nC. A yellow mechine brings baby birds out the nest.\nD. A yellow mechine destroys the nest.\nAnswer with the option's letter from the given choices directly.",
812,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "271-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 813,
"target": "D",
"doc": {
"video_id": "272",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=YsPwg3dUwjQ",
"videoID": "YsPwg3dUwjQ",
"question_id": "272-1",
"task_type": "Object Recognition",
"question": "What is the animal in this video?",
"options": [
"A. A mouse.",
"B. A rabbit.",
"C. A dog.",
"D. A echidna."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the animal in this video?\nOption:\nA. A mouse.\nB. A rabbit.\nC. A dog.\nD. A echidna.\nAnswer with the option's letter from the given choices directly.",
813,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "272-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 814,
"target": "A",
"doc": {
"video_id": "272",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=YsPwg3dUwjQ",
"videoID": "YsPwg3dUwjQ",
"question_id": "272-2",
"task_type": "Counting Problem",
"question": "How many people appear in this video?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people appear in this video?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
814,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "272-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 815,
"target": "C",
"doc": {
"video_id": "272",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=YsPwg3dUwjQ",
"videoID": "YsPwg3dUwjQ",
"question_id": "272-3",
"task_type": "Attribute Perception",
"question": "What is the shape of its nose?",
"options": [
"A. Triangle.",
"B. Sphere.",
"C. Long.",
"D. Cube."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the shape of its nose?\nOption:\nA. Triangle.\nB. Sphere.\nC. Long.\nD. Cube.\nAnswer with the option's letter from the given choices directly.",
815,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "272-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 816,
"target": "D",
"doc": {
"video_id": "273",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=Kn10Jf1x24Q",
"videoID": "Kn10Jf1x24Q",
"question_id": "273-1",
"task_type": "Spatial Perception",
"question": "Where does the first bunny hide?",
"options": [
"A. In a hole.",
"B. In grass.",
"C. Behind a rock.",
"D. Under a car."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where does the first bunny hide?\nOption:\nA. In a hole.\nB. In grass.\nC. Behind a rock.\nD. Under a car.\nAnswer with the option's letter from the given choices directly.",
816,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "273-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Spatial Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 817,
"target": "B",
"doc": {
"video_id": "273",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=Kn10Jf1x24Q",
"videoID": "Kn10Jf1x24Q",
"question_id": "273-2",
"task_type": "Action Recognition",
"question": "What is the bunny's behavior after being placed on the ground?",
"options": [
"A. Finding its mate.",
"B. Attacking the cameraman.",
"C. Hiding under the car.",
"D. Drilling into the hole."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the bunny's behavior after being placed on the ground?\nOption:\nA. Finding its mate.\nB. Attacking the cameraman.\nC. Hiding under the car.\nD. Drilling into the hole.\nAnswer with the option's letter from the given choices directly.",
817,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "273-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 818,
"target": "B",
"doc": {
"video_id": "273",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=Kn10Jf1x24Q",
"videoID": "Kn10Jf1x24Q",
"question_id": "273-3",
"task_type": "Attribute Perception",
"question": "What color is the bunny in the video?",
"options": [
"A. White.",
"B. Yellow.",
"C. Black.",
"D. Red."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the bunny in the video?\nOption:\nA. White.\nB. Yellow.\nC. Black.\nD. Red.\nAnswer with the option's letter from the given choices directly.",
818,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "273-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 819,
"target": "C",
"doc": {
"video_id": "274",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=vszWsYOdnPg",
"videoID": "vszWsYOdnPg",
"question_id": "274-1",
"task_type": "Object Recognition",
"question": "What are the animals in this video?",
"options": [
"A. Cats.",
"B. Rabbits.",
"C. Skunks.",
"D. Foxes."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the animals in this video?\nOption:\nA. Cats.\nB. Rabbits.\nC. Skunks.\nD. Foxes.\nAnswer with the option's letter from the given choices directly.",
819,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "274-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 820,
"target": "B",
"doc": {
"video_id": "274",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=vszWsYOdnPg",
"videoID": "vszWsYOdnPg",
"question_id": "274-2",
"task_type": "Counting Problem",
"question": "How many animals appear in this video?",
"options": [
"A. 4.",
"B. 5.",
"C. 6.",
"D. 7."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many animals appear in this video?\nOption:\nA. 4.\nB. 5.\nC. 6.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
820,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "274-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 821,
"target": "A",
"doc": {
"video_id": "274",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=vszWsYOdnPg",
"videoID": "vszWsYOdnPg",
"question_id": "274-3",
"task_type": "Action Recognition",
"question": "What do these animals do in this video?",
"options": [
"A. Sniffing the cyclist.",
"B. Eating some food.",
"C. Climbing trees.",
"D. Attacking people."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do these animals do in this video?\nOption:\nA. Sniffing the cyclist.\nB. Eating some food.\nC. Climbing trees.\nD. Attacking people.\nAnswer with the option's letter from the given choices directly.",
821,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "274-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 822,
"target": "A",
"doc": {
"video_id": "275",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=ZHWZf1Z4B5k",
"videoID": "ZHWZf1Z4B5k",
"question_id": "275-1",
"task_type": "Action Recognition",
"question": "What does the person do in the video?",
"options": [
"A. Help the bunny shower.",
"B. Take a shower.",
"C. Clean the tub.",
"D. Have a rest."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the person do in the video?\nOption:\nA. Help the bunny shower.\nB. Take a shower.\nC. Clean the tub.\nD. Have a rest.\nAnswer with the option's letter from the given choices directly.",
822,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "275-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 823,
"target": "C",
"doc": {
"video_id": "275",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=ZHWZf1Z4B5k",
"videoID": "ZHWZf1Z4B5k",
"question_id": "275-2",
"task_type": "Spatial Perception",
"question": "Where is this video most likely shot?",
"options": [
"A. In a swimming pool.",
"B. In a bedroom.",
"C. In a bathroom.",
"D. In a river."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where is this video most likely shot?\nOption:\nA. In a swimming pool.\nB. In a bedroom.\nC. In a bathroom.\nD. In a river.\nAnswer with the option's letter from the given choices directly.",
823,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "275-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 824,
"target": "D",
"doc": {
"video_id": "275",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=ZHWZf1Z4B5k",
"videoID": "ZHWZf1Z4B5k",
"question_id": "275-3",
"task_type": "Action Recognition",
"question": "What is the posture of the bunny?",
"options": [
"A. Laying down.",
"B. Jumping.",
"C. Screaming.",
"D. Standing."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the posture of the bunny?\nOption:\nA. Laying down.\nB. Jumping.\nC. Screaming.\nD. Standing.\nAnswer with the option's letter from the given choices directly.",
824,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "275-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 825,
"target": "D",
"doc": {
"video_id": "276",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=6Cr_8tvvQ0k",
"videoID": "6Cr_8tvvQ0k",
"question_id": "276-1",
"task_type": "Object Recognition",
"question": "What is the animal in this video?",
"options": [
"A. A panda.",
"B. A monkey.",
"C. A cat.",
"D. A chimp."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the animal in this video?\nOption:\nA. A panda.\nB. A monkey.\nC. A cat.\nD. A chimp.\nAnswer with the option's letter from the given choices directly.",
825,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "276-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 826,
"target": "A",
"doc": {
"video_id": "276",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=6Cr_8tvvQ0k",
"videoID": "6Cr_8tvvQ0k",
"question_id": "276-2",
"task_type": "Action Recognition",
"question": "What does the animal do in the video?",
"options": [
"A. It is pulled in circles by a person.",
"B. It jumps on a person.",
"C. It pulls a tree.",
"D. It flies to the sky."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the animal do in the video?\nOption:\nA. It is pulled in circles by a person.\nB. It jumps on a person.\nC. It pulls a tree.\nD. It flies to the sky.\nAnswer with the option's letter from the given choices directly.",
826,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "276-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 827,
"target": "C",
"doc": {
"video_id": "276",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=6Cr_8tvvQ0k",
"videoID": "6Cr_8tvvQ0k",
"question_id": "276-3",
"task_type": "Attribute Perception",
"question": "What color is the animal in this video?",
"options": [
"A. Yellow.",
"B. White.",
"C. Black.",
"D. Red."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the animal in this video?\nOption:\nA. Yellow.\nB. White.\nC. Black.\nD. Red.\nAnswer with the option's letter from the given choices directly.",
827,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "276-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 828,
"target": "A",
"doc": {
"video_id": "277",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=H8fbAFOMTp4",
"videoID": "H8fbAFOMTp4",
"question_id": "277-1",
"task_type": "Attribute Perception",
"question": "What color is the cat?",
"options": [
"A. Yellow.",
"B. White.",
"C. Black.",
"D. Blue."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the cat?\nOption:\nA. Yellow.\nB. White.\nC. Black.\nD. Blue.\nAnswer with the option's letter from the given choices directly.",
828,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "277-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 829,
"target": "B",
"doc": {
"video_id": "277",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=H8fbAFOMTp4",
"videoID": "H8fbAFOMTp4",
"question_id": "277-2",
"task_type": "Action Recognition",
"question": "What does the animal do to the glove in the video?",
"options": [
"A. Sleeping on it.",
"B. Biting and pulling it.",
"C. Drilling in it.",
"D. Not caring it."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the animal do to the glove in the video?\nOption:\nA. Sleeping on it.\nB. Biting and pulling it.\nC. Drilling in it.\nD. Not caring it.\nAnswer with the option's letter from the given choices directly.",
829,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "277-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 830,
"target": "B",
"doc": {
"video_id": "277",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=H8fbAFOMTp4",
"videoID": "H8fbAFOMTp4",
"question_id": "277-3",
"task_type": "Counting Problem",
"question": "In the video, how many airboxes can be observed?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, how many airboxes can be observed?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
830,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "277-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 831,
"target": "B",
"doc": {
"video_id": "278",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=fQVhppRP4Wo",
"videoID": "fQVhppRP4Wo",
"question_id": "278-1",
"task_type": "Counting Problem",
"question": "How many foxes appear in this video?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many foxes appear in this video?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
831,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "278-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 832,
"target": "D",
"doc": {
"video_id": "278",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=fQVhppRP4Wo",
"videoID": "fQVhppRP4Wo",
"question_id": "278-2",
"task_type": "Attribute Perception",
"question": "What color are the foxes in the video?",
"options": [
"A. Blue.",
"B. Pink.",
"C. Black.",
"D. Yellow."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color are the foxes in the video?\nOption:\nA. Blue.\nB. Pink.\nC. Black.\nD. Yellow.\nAnswer with the option's letter from the given choices directly.",
832,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "278-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 833,
"target": "D",
"doc": {
"video_id": "278",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=fQVhppRP4Wo",
"videoID": "fQVhppRP4Wo",
"question_id": "278-3",
"task_type": "Action Recognition",
"question": "What is the behavior of the fox when a person appears in the video?",
"options": [
"A. It eats food given by the person.",
"B. It lays down near the person.",
"C. It sleeps with the person.",
"D. It stays on the person's head."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the behavior of the fox when a person appears in the video?\nOption:\nA. It eats food given by the person.\nB. It lays down near the person.\nC. It sleeps with the person.\nD. It stays on the person's head.\nAnswer with the option's letter from the given choices directly.",
833,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "278-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 834,
"target": "C",
"doc": {
"video_id": "279",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=hsKmC7RwHvo",
"videoID": "hsKmC7RwHvo",
"question_id": "279-1",
"task_type": "Object Recognition",
"question": "What is the animal in this video?",
"options": [
"A. A cat.",
"B. A dog.",
"C. A raccoon.",
"D. A panda."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the animal in this video?\nOption:\nA. A cat.\nB. A dog.\nC. A raccoon.\nD. A panda.\nAnswer with the option's letter from the given choices directly.",
834,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "279-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 835,
"target": "A",
"doc": {
"video_id": "279",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=hsKmC7RwHvo",
"videoID": "hsKmC7RwHvo",
"question_id": "279-2",
"task_type": "Action Recognition",
"question": "What does the animal do in the video?",
"options": [
"A. It eats the food and puts out the door.",
"B. It finds its mates.",
"C. It climbs onto the table.",
"D. It knocks the door."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the animal do in the video?\nOption:\nA. It eats the food and puts out the door.\nB. It finds its mates.\nC. It climbs onto the table.\nD. It knocks the door.\nAnswer with the option's letter from the given choices directly.",
835,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "279-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 836,
"target": "B",
"doc": {
"video_id": "279",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=hsKmC7RwHvo",
"videoID": "hsKmC7RwHvo",
"question_id": "279-3",
"task_type": "Temporal Perception",
"question": "What time of day is depicted in this video?",
"options": [
"A. Afternoon.",
"B. Night.",
"C. Dusk.",
"D. Morning."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What time of day is depicted in this video?\nOption:\nA. Afternoon.\nB. Night.\nC. Dusk.\nD. Morning.\nAnswer with the option's letter from the given choices directly.",
836,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "279-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Temporal Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 837,
"target": "D",
"doc": {
"video_id": "280",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=S1nUMsPC1-0",
"videoID": "S1nUMsPC1-0",
"question_id": "280-1",
"task_type": "Object Recognition",
"question": "Which kind of pets is not introduced in this video?",
"options": [
"A. Dogs.",
"B. Monkeys.",
"C. Turtles.",
"D. None of the above."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which kind of pets is not introduced in this video?\nOption:\nA. Dogs.\nB. Monkeys.\nC. Turtles.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
837,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "280-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 838,
"target": "B",
"doc": {
"video_id": "280",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=S1nUMsPC1-0",
"videoID": "S1nUMsPC1-0",
"question_id": "280-2",
"task_type": "Object Recognition",
"question": "How many people are in the family that is playing with a dog at the end of the video?",
"options": [
"A. 3.",
"B. 4.",
"C. 5.",
"D. 6."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people are in the family that is playing with a dog at the end of the video?\nOption:\nA. 3.\nB. 4.\nC. 5.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
838,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "280-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 839,
"target": "D",
"doc": {
"video_id": "280",
"duration": "short",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=S1nUMsPC1-0",
"videoID": "S1nUMsPC1-0",
"question_id": "280-3",
"task_type": "Attribute Perception",
"question": "Which color of fish is absent from the video?",
"options": [
"A. Orange.",
"B. Blue.",
"C. Green.",
"D. Red."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which color of fish is absent from the video?\nOption:\nA. Orange.\nB. Blue.\nC. Green.\nD. Red.\nAnswer with the option's letter from the given choices directly.",
839,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "280-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 840,
"target": "B",
"doc": {
"video_id": "281",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=VgtNv__qR0k",
"videoID": "VgtNv__qR0k",
"question_id": "281-1",
"task_type": "Action Recognition",
"question": "When receiving what instruction in the video, the movement needs to be stopped?",
"options": [
"A. Green light.",
"B. Red light.",
"C. Yellow light.",
"D. Not mentioned in the video."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When receiving what instruction in the video, the movement needs to be stopped?\nOption:\nA. Green light.\nB. Red light.\nC. Yellow light.\nD. Not mentioned in the video.\nAnswer with the option's letter from the given choices directly.",
840,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "281-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 841,
"target": "C",
"doc": {
"video_id": "281",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=VgtNv__qR0k",
"videoID": "VgtNv__qR0k",
"question_id": "281-2",
"task_type": "Counting Problem",
"question": "How many people were exercising in the video?",
"options": [
"A. 6.",
"B. 8.",
"C. 7.",
"D. 5."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people were exercising in the video?\nOption:\nA. 6.\nB. 8.\nC. 7.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
841,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "281-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 842,
"target": "A",
"doc": {
"video_id": "281",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=VgtNv__qR0k",
"videoID": "VgtNv__qR0k",
"question_id": "281-3",
"task_type": "Action Recognition",
"question": "Who finishes training first in the video?",
"options": [
"A. Man in a grey top and black shorts.",
"B. Woman in a black top and black shorts.",
"C. Lady wearing a hat.",
"D. Woman wearing a grey top and black pants."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who finishes training first in the video?\nOption:\nA. Man in a grey top and black shorts.\nB. Woman in a black top and black shorts.\nC. Lady wearing a hat.\nD. Woman wearing a grey top and black pants.\nAnswer with the option's letter from the given choices directly.",
842,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "281-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 843,
"target": "D",
"doc": {
"video_id": "282",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=MiPw-RZMHCQ",
"videoID": "MiPw-RZMHCQ",
"question_id": "282-1",
"task_type": "Object Recognition",
"question": "Who of the two people competing in the second group in the video reaches the finish line first?",
"options": [
"A. Man in a purple top and black shorts.",
"B. Lady in a black top and black shorts.",
"C. Bare-chested man in black shorts.",
"D. Lady in a purple top and black shorts."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who of the two people competing in the second group in the video reaches the finish line first?\nOption:\nA. Man in a purple top and black shorts.\nB. Lady in a black top and black shorts.\nC. Bare-chested man in black shorts.\nD. Lady in a purple top and black shorts.\nAnswer with the option's letter from the given choices directly.",
843,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "282-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 844,
"target": "B",
"doc": {
"video_id": "282",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=MiPw-RZMHCQ",
"videoID": "MiPw-RZMHCQ",
"question_id": "282-2",
"task_type": "Counting Problem",
"question": "How many games are played in the video?",
"options": [
"A. 4.",
"B. 5.",
"C. 6.",
"D. 3."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many games are played in the video?\nOption:\nA. 4.\nB. 5.\nC. 6.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
844,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "282-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 845,
"target": "C",
"doc": {
"video_id": "282",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=MiPw-RZMHCQ",
"videoID": "MiPw-RZMHCQ",
"question_id": "282-3",
"task_type": "Counting Problem",
"question": "How many times did the man in the video with no shirt on his upper body race?",
"options": [
"A. 3.",
"B. 1.",
"C. 2.",
"D. 0."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times did the man in the video with no shirt on his upper body race?\nOption:\nA. 3.\nB. 1.\nC. 2.\nD. 0.\nAnswer with the option's letter from the given choices directly.",
845,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "282-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 846,
"target": "A",
"doc": {
"video_id": "283",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=UbIVqUDZOs0",
"videoID": "UbIVqUDZOs0",
"question_id": "283-1",
"task_type": "Action Recognition",
"question": "What exercise is the woman in black doing at the beginning of the video?",
"options": [
"A. Running.",
"B. Jump rope.",
"C. Play golf.",
"D. Yoga."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What exercise is the woman in black doing at the beginning of the video?\nOption:\nA. Running.\nB. Jump rope.\nC. Play golf.\nD. Yoga.\nAnswer with the option's letter from the given choices directly.",
846,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "283-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 847,
"target": "D",
"doc": {
"video_id": "283",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=UbIVqUDZOs0",
"videoID": "UbIVqUDZOs0",
"question_id": "283-2",
"task_type": "Action Recognition",
"question": "Which sport is not featured in the video?",
"options": [
"A. Running.",
"B. Jump rope.",
"C. Play golf.",
"D. Cycling."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which sport is not featured in the video?\nOption:\nA. Running.\nB. Jump rope.\nC. Play golf.\nD. Cycling.\nAnswer with the option's letter from the given choices directly.",
847,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "283-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 848,
"target": "B",
"doc": {
"video_id": "283",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=UbIVqUDZOs0",
"videoID": "UbIVqUDZOs0",
"question_id": "283-3",
"task_type": "Counting Problem",
"question": "How many times does the sport of skipping appear in the video?",
"options": [
"A. 3.",
"B. 2.",
"C. 4.",
"D. 1."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times does the sport of skipping appear in the video?\nOption:\nA. 3.\nB. 2.\nC. 4.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
848,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "283-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 849,
"target": "C",
"doc": {
"video_id": "284",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=6EIrArTyLVU",
"videoID": "6EIrArTyLVU",
"question_id": "284-1",
"task_type": "Action Recognition",
"question": "What is the person in the background, wearing a white top and black shorts, doing while the person in pink is speaking at the beginning of the video?",
"options": [
"A. Sit-ups.",
"B. Lying triceps extension.",
"C. Using the bar for pull-ups.",
"D. Push-up."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the person in the background, wearing a white top and black shorts, doing while the person in pink is speaking at the beginning of the video?\nOption:\nA. Sit-ups.\nB. Lying triceps extension.\nC. Using the bar for pull-ups.\nD. Push-up.\nAnswer with the option's letter from the given choices directly.",
849,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "284-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 850,
"target": "A",
"doc": {
"video_id": "284",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=6EIrArTyLVU",
"videoID": "6EIrArTyLVU",
"question_id": "284-2",
"task_type": "Action Recognition",
"question": "Which training exercise in the video does the man perform with the assistance of an elastic rope?",
"options": [
"A. Pull-ups with the assistance of an elastic rope.",
"B. Lying triceps extension with the assistance of an elastic rope.",
"C. Sit-ups with the assistance of an elastic rope.",
"D. Push-ups with the assistance of an elastic rope."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which training exercise in the video does the man perform with the assistance of an elastic rope?\nOption:\nA. Pull-ups with the assistance of an elastic rope.\nB. Lying triceps extension with the assistance of an elastic rope.\nC. Sit-ups with the assistance of an elastic rope.\nD. Push-ups with the assistance of an elastic rope.\nAnswer with the option's letter from the given choices directly.",
850,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "284-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 851,
"target": "C",
"doc": {
"video_id": "284",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=6EIrArTyLVU",
"videoID": "6EIrArTyLVU",
"question_id": "284-3",
"task_type": "Counting Problem",
"question": "How many sit-ups did the man in the video, wearing a pink top and black pants, perform?",
"options": [
"A. 0.",
"B. 1.",
"C. 2.",
"D. 3."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many sit-ups did the man in the video, wearing a pink top and black pants, perform?\nOption:\nA. 0.\nB. 1.\nC. 2.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
851,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "284-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 852,
"target": "B",
"doc": {
"video_id": "285",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=p_-ocKiob4w",
"videoID": "p_-ocKiob4w",
"question_id": "285-1",
"task_type": "Object Recognition",
"question": "How many different types of sports equipment were demonstrated in detail by the people in the video?",
"options": [
"A. 3.",
"B. 4.",
"C. 5.",
"D. 2."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many different types of sports equipment were demonstrated in detail by the people in the video?\nOption:\nA. 3.\nB. 4.\nC. 5.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
852,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "285-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 853,
"target": "C",
"doc": {
"video_id": "285",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=p_-ocKiob4w",
"videoID": "p_-ocKiob4w",
"question_id": "285-2",
"task_type": "Action Recognition",
"question": "Which training exercise was not demonstrated in the video?",
"options": [
"A. Tricep dips.",
"B. Push-ups.",
"C. Pull-ups.",
"D. Box jumps."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which training exercise was not demonstrated in the video?\nOption:\nA. Tricep dips.\nB. Push-ups.\nC. Pull-ups.\nD. Box jumps.\nAnswer with the option's letter from the given choices directly.",
853,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "285-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 854,
"target": "A",
"doc": {
"video_id": "285",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=p_-ocKiob4w",
"videoID": "p_-ocKiob4w",
"question_id": "285-3",
"task_type": "Action Recognition",
"question": "What is the subsequent move demonstrated in the video after completing the demonstration of box jumps?",
"options": [
"A. Step ups.",
"B. Single leg squats.",
"C. Calf raises.",
"D. Tricep dips."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the subsequent move demonstrated in the video after completing the demonstration of box jumps?\nOption:\nA. Step ups.\nB. Single leg squats.\nC. Calf raises.\nD. Tricep dips.\nAnswer with the option's letter from the given choices directly.",
854,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "285-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 855,
"target": "D",
"doc": {
"video_id": "286",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=5kmnEgBSCfg",
"videoID": "5kmnEgBSCfg",
"question_id": "286-1",
"task_type": "Counting Problem",
"question": "How many sets of jumps did the two men do in total at the beginning of the video?",
"options": [
"A. 3.",
"B. 6.",
"C. 4.",
"D. 5."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many sets of jumps did the two men do in total at the beginning of the video?\nOption:\nA. 3.\nB. 6.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
855,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "286-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 856,
"target": "B",
"doc": {
"video_id": "286",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=5kmnEgBSCfg",
"videoID": "5kmnEgBSCfg",
"question_id": "286-2",
"task_type": "Object Recognition",
"question": "What is the primary item that the two men in the video utilize for their workout?",
"options": [
"A. Dumbbel.",
"B. Black bench.",
"C. Black sofa.",
"D. Black iron plate."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary item that the two men in the video utilize for their workout?\nOption:\nA. Dumbbel.\nB. Black bench.\nC. Black sofa.\nD. Black iron plate.\nAnswer with the option's letter from the given choices directly.",
856,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "286-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 857,
"target": "C",
"doc": {
"video_id": "286",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=5kmnEgBSCfg",
"videoID": "5kmnEgBSCfg",
"question_id": "286-3",
"task_type": "Object Recognition",
"question": "Who is the main speaker during the latter half of the video?",
"options": [
"A. Man wearing white top and white shorts.",
"B. Man wearing orange top and black shorts.",
"C. Man wearing white shirt and blue shorts.",
"D. Man wearing red shirt and black shorts."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is the main speaker during the latter half of the video?\nOption:\nA. Man wearing white top and white shorts.\nB. Man wearing orange top and black shorts.\nC. Man wearing white shirt and blue shorts.\nD. Man wearing red shirt and black shorts.\nAnswer with the option's letter from the given choices directly.",
857,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "286-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 858,
"target": "A",
"doc": {
"video_id": "287",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=UcL6WlcyBcI",
"videoID": "UcL6WlcyBcI",
"question_id": "287-1",
"task_type": "Object Reasoning",
"question": "What is the main ability to compete in the game in the video?",
"options": [
"A. Reaction speed.",
"B. Physical strength.",
"C. Strategic planning.",
"D. Teamwork."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main ability to compete in the game in the video?\nOption:\nA. Reaction speed.\nB. Physical strength.\nC. Strategic planning.\nD. Teamwork.\nAnswer with the option's letter from the given choices directly.",
858,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "287-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 859,
"target": "D",
"doc": {
"video_id": "287",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=UcL6WlcyBcI",
"videoID": "UcL6WlcyBcI",
"question_id": "287-2",
"task_type": "Object Recognition",
"question": "Which part of their body do they not need to touch during the movement prior to grabbing the conical object?",
"options": [
"A. Knees.",
"B. Head.",
"C. Foot.",
"D. Buttocks."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which part of their body do they not need to touch during the movement prior to grabbing the conical object?\nOption:\nA. Knees.\nB. Head.\nC. Foot.\nD. Buttocks.\nAnswer with the option's letter from the given choices directly.",
859,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "287-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 860,
"target": "B",
"doc": {
"video_id": "287",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=UcL6WlcyBcI",
"videoID": "UcL6WlcyBcI",
"question_id": "287-3",
"task_type": "Object Recognition",
"question": "Who won the game in the group closest to the camera in the final game?",
"options": [
"A. The man wearing a gray top and black shorts.",
"B. The woman wearing a gray top and black shorts.",
"C. The man wearing a blue shirt and black shorts.",
"D. The woman wearing a white shirt and black shorts."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who won the game in the group closest to the camera in the final game?\nOption:\nA. The man wearing a gray top and black shorts.\nB. The woman wearing a gray top and black shorts.\nC. The man wearing a blue shirt and black shorts.\nD. The woman wearing a white shirt and black shorts.\nAnswer with the option's letter from the given choices directly.",
860,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "287-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 861,
"target": "C",
"doc": {
"video_id": "288",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=j7vUGiA-uIE",
"videoID": "j7vUGiA-uIE",
"question_id": "288-1",
"task_type": "Action Recognition",
"question": "What were the movements of the individuals in the video during the competition process?",
"options": [
"A. Lie flat.",
"B. Half squat.",
"C. Plank.",
"D. Stand."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What were the movements of the individuals in the video during the competition process?\nOption:\nA. Lie flat.\nB. Half squat.\nC. Plank.\nD. Stand.\nAnswer with the option's letter from the given choices directly.",
861,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "288-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 862,
"target": "A",
"doc": {
"video_id": "288",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=j7vUGiA-uIE",
"videoID": "j7vUGiA-uIE",
"question_id": "288-2",
"task_type": "Spatial Perception",
"question": "Where does the participant need to place his/her hand during the game in the video?",
"options": [
"A. Inside the circle.",
"B. Outside the circle.",
"C. On the edge of the circle.",
"D. Not mentioned."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where does the participant need to place his/her hand during the game in the video?\nOption:\nA. Inside the circle.\nB. Outside the circle.\nC. On the edge of the circle.\nD. Not mentioned.\nAnswer with the option's letter from the given choices directly.",
862,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "288-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Spatial Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 863,
"target": "D",
"doc": {
"video_id": "288",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=j7vUGiA-uIE",
"videoID": "j7vUGiA-uIE",
"question_id": "288-3",
"task_type": "Object Recognition",
"question": "Who is the last to reach the finish line at the end of the video?",
"options": [
"A. The man in red tops and blue shorts.",
"B. The woman wearing a black top and black shorts.",
"C. The man in black tops and blue shorts.",
"D. The woman in pink blouse and black trousers."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is the last to reach the finish line at the end of the video?\nOption:\nA. The man in red tops and blue shorts.\nB. The woman wearing a black top and black shorts.\nC. The man in black tops and blue shorts.\nD. The woman in pink blouse and black trousers.\nAnswer with the option's letter from the given choices directly.",
863,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "288-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 864,
"target": "B",
"doc": {
"video_id": "289",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=SFmocFLXg0Y",
"videoID": "SFmocFLXg0Y",
"question_id": "289-1",
"task_type": "Counting Problem",
"question": "How many squats did the lady in the video do?",
"options": [
"A. 3.",
"B. 4.",
"C. 2.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many squats did the lady in the video do?\nOption:\nA. 3.\nB. 4.\nC. 2.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
864,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "289-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 865,
"target": "C",
"doc": {
"video_id": "289",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=SFmocFLXg0Y",
"videoID": "SFmocFLXg0Y",
"question_id": "289-2",
"task_type": "Object Recognition",
"question": "What was the exercise demonstrated after the squatting exercise in the video?",
"options": [
"A. Push-ups.",
"B. Plank.",
"C. Step-ups.",
"D. Pull-ups."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the exercise demonstrated after the squatting exercise in the video?\nOption:\nA. Push-ups.\nB. Plank.\nC. Step-ups.\nD. Pull-ups.\nAnswer with the option's letter from the given choices directly.",
865,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "289-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 866,
"target": "A",
"doc": {
"video_id": "289",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=SFmocFLXg0Y",
"videoID": "SFmocFLXg0Y",
"question_id": "289-3",
"task_type": "Temporal Perception",
"question": "How long does the video recommend performing the third introduced move at the beginning?",
"options": [
"A. 5-15 seconds.",
"B. 10-20 seconds.",
"C. 5-15 minutes.",
"D. 10-20 minutes."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How long does the video recommend performing the third introduced move at the beginning?\nOption:\nA. 5-15 seconds.\nB. 10-20 seconds.\nC. 5-15 minutes.\nD. 10-20 minutes.\nAnswer with the option's letter from the given choices directly.",
866,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "289-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 867,
"target": "D",
"doc": {
"video_id": "290",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=IRDp5HcZyVA",
"videoID": "IRDp5HcZyVA",
"question_id": "290-1",
"task_type": "Counting Problem",
"question": "How many people are on each team in the video?",
"options": [
"A. 6.",
"B. 4.",
"C. 5.",
"D. 3."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people are on each team in the video?\nOption:\nA. 6.\nB. 4.\nC. 5.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
867,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "290-1",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 868,
"target": "B",
"doc": {
"video_id": "290",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=IRDp5HcZyVA",
"videoID": "IRDp5HcZyVA",
"question_id": "290-2",
"task_type": "Object Recognition",
"question": "Which team won the last game in the video?",
"options": [
"A. The team playing with green cloth.",
"B. The team playing with red cloth.",
"C. The match between the two teams ended in a draw.",
"D. It is not mentioned in this video."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which team won the last game in the video?\nOption:\nA. The team playing with green cloth.\nB. The team playing with red cloth.\nC. The match between the two teams ended in a draw.\nD. It is not mentioned in this video.\nAnswer with the option's letter from the given choices directly.",
868,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "290-2",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 869,
"target": "C",
"doc": {
"video_id": "290",
"duration": "short",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=IRDp5HcZyVA",
"videoID": "IRDp5HcZyVA",
"question_id": "290-3",
"task_type": "Action Recognition",
"question": "What game are they playing in the video?",
"options": [
"A. Connect Four.",
"B. Gomoku.",
"C. TIC TAC TOE.",
"D. It is unable to determine."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What game are they playing in the video?\nOption:\nA. Connect Four.\nB. Gomoku.\nC. TIC TAC TOE.\nD. It is unable to determine.\nAnswer with the option's letter from the given choices directly.",
869,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "290-3",
"duration": "short",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 870,
"target": "D",
"doc": {
"video_id": "291",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=nVtTVt9csBc",
"videoID": "nVtTVt9csBc",
"question_id": "291-1",
"task_type": "Information Synopsis",
"question": "What's the main content of this video?",
"options": [
"A. City tourism film.",
"B. City transport guide.",
"C. City job market promotion.",
"D. City tourism services ad."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the main content of this video?\nOption:\nA. City tourism film.\nB. City transport guide.\nC. City job market promotion.\nD. City tourism services ad.\nAnswer with the option's letter from the given choices directly.",
870,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "291-1",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 871,
"target": "B",
"doc": {
"video_id": "291",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=nVtTVt9csBc",
"videoID": "nVtTVt9csBc",
"question_id": "291-2",
"task_type": "Object Recognition",
"question": "Which mode of transportation is not shown in the video?",
"options": [
"A. Train.",
"B. Bike.",
"C. Car.",
"D. Airplane."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which mode of transportation is not shown in the video?\nOption:\nA. Train.\nB. Bike.\nC. Car.\nD. Airplane.\nAnswer with the option's letter from the given choices directly.",
871,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "291-2",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 872,
"target": "C",
"doc": {
"video_id": "291",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=nVtTVt9csBc",
"videoID": "nVtTVt9csBc",
"question_id": "291-3",
"task_type": "Attribute Perception",
"question": "Which color dress is the woman wearing, who is in the video and is also wearing brown heels?",
"options": [
"A. Black.",
"B. Yellow.",
"C. White.",
"D. Green."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which color dress is the woman wearing, who is in the video and is also wearing brown heels?\nOption:\nA. Black.\nB. Yellow.\nC. White.\nD. Green.\nAnswer with the option's letter from the given choices directly.",
872,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "291-3",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 873,
"target": "A",
"doc": {
"video_id": "292",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=DoFmXBjRGmk",
"videoID": "DoFmXBjRGmk",
"question_id": "292-1",
"task_type": "Action Recognition",
"question": "What mode of transportation does the male protagonist use to get home from work according to the video?",
"options": [
"A. Riding an electric scooter.",
"B. Driving a car.",
"C. Cycling.",
"D. Riding an electric bike."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What mode of transportation does the male protagonist use to get home from work according to the video?\nOption:\nA. Riding an electric scooter.\nB. Driving a car.\nC. Cycling.\nD. Riding an electric bike.\nAnswer with the option's letter from the given choices directly.",
873,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "292-1",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 874,
"target": "D",
"doc": {
"video_id": "292",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=DoFmXBjRGmk",
"videoID": "DoFmXBjRGmk",
"question_id": "292-2",
"task_type": "Object Reasoning",
"question": "What is the job of the male protagonist in the video?",
"options": [
"A. Reporter.",
"B. Host.",
"C. Writer.",
"D. Programmer."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the job of the male protagonist in the video?\nOption:\nA. Reporter.\nB. Host.\nC. Writer.\nD. Programmer.\nAnswer with the option's letter from the given choices directly.",
874,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "292-2",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 875,
"target": "B",
"doc": {
"video_id": "292",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=DoFmXBjRGmk",
"videoID": "DoFmXBjRGmk",
"question_id": "292-3",
"task_type": "OCR Problems",
"question": "Which was the duration of the male protagonist's morning work in the video?",
"options": [
"A. Three hours.",
"B. Two hours and forty-five minutes.",
"C. Four hours.",
"D. Three hours and forty-five minutes."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which was the duration of the male protagonist's morning work in the video?\nOption:\nA. Three hours.\nB. Two hours and forty-five minutes.\nC. Four hours.\nD. Three hours and forty-five minutes.\nAnswer with the option's letter from the given choices directly.",
875,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "292-3",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 876,
"target": "A",
"doc": {
"video_id": "293",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=PYZSjin_Pe8",
"videoID": "PYZSjin_Pe8",
"question_id": "293-1",
"task_type": "Action Recognition",
"question": "Which player scored the final goal of the match in the video?",
"options": [
"A. Singapore Number 5 Player.",
"B. Singapore Number 10 Player.",
"C. China Number 7 Player.",
"D. China Number 9 Player."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player scored the final goal of the match in the video?\nOption:\nA. Singapore Number 5 Player.\nB. Singapore Number 10 Player.\nC. China Number 7 Player.\nD. China Number 9 Player.\nAnswer with the option's letter from the given choices directly.",
876,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "293-1",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 877,
"target": "B",
"doc": {
"video_id": "293",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=PYZSjin_Pe8",
"videoID": "PYZSjin_Pe8",
"question_id": "293-2",
"task_type": "Action Recognition",
"question": "Based on the video, how did China score their goals in the match?",
"options": [
"A. Number nine scored a hat-trick.",
"B. Number seven scored a double.",
"C. Number three scored from a penalty.",
"D. Number fifteen scored with a header."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, how did China score their goals in the match?\nOption:\nA. Number nine scored a hat-trick.\nB. Number seven scored a double.\nC. Number three scored from a penalty.\nD. Number fifteen scored with a header.\nAnswer with the option's letter from the given choices directly.",
877,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "293-2",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 878,
"target": "C",
"doc": {
"video_id": "293",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=PYZSjin_Pe8",
"videoID": "PYZSjin_Pe8",
"question_id": "293-3",
"task_type": "Object Reasoning",
"question": "How did the team wearing green perform in the game shown in the video compared to their game against the team in red that took place 10 years ago?",
"options": [
"A. China replicated the previous 6:1 victory with a similar scoreline.",
"B. The match ended in a high-scoring draw, showing defensive vulnerabilities compared to the past.",
"C. China struggled to match their past performance, ending the game in a 2-2 draw.",
"D. China demonstrated the same dominant 'crushing victory' as they did a decade ago."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How did the team wearing green perform in the game shown in the video compared to their game against the team in red that took place 10 years ago?\nOption:\nA. China replicated the previous 6:1 victory with a similar scoreline.\nB. The match ended in a high-scoring draw, showing defensive vulnerabilities compared to the past.\nC. China struggled to match their past performance, ending the game in a 2-2 draw.\nD. China demonstrated the same dominant 'crushing victory' as they did a decade ago.\nAnswer with the option's letter from the given choices directly.",
878,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "293-3",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 879,
"target": "B",
"doc": {
"video_id": "294",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=nbOXpuv7K4Q",
"videoID": "nbOXpuv7K4Q",
"question_id": "294-1",
"task_type": "Action Recognition",
"question": "According to the video, how does the magician demonstrate his skill with a Rubik's Cube?",
"options": [
"A. He performs sleight of hand tricks with the cube, making it disappear and reappear.",
"B. He throws the scrambled Rubik's Cube into the air, and it immediately restores itself upon catching it.",
"C. He recites a spell over the cube, causing it to change colors in an instant.",
"D. He uses a special device to manipulate the cube's movements without touching it."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how does the magician demonstrate his skill with a Rubik's Cube?\nOption:\nA. He performs sleight of hand tricks with the cube, making it disappear and reappear.\nB. He throws the scrambled Rubik's Cube into the air, and it immediately restores itself upon catching it.\nC. He recites a spell over the cube, causing it to change colors in an instant.\nD. He uses a special device to manipulate the cube's movements without touching it.\nAnswer with the option's letter from the given choices directly.",
879,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "294-1",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 880,
"target": "C",
"doc": {
"video_id": "294",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=nbOXpuv7K4Q",
"videoID": "nbOXpuv7K4Q",
"question_id": "294-2",
"task_type": "Spatial Perception",
"question": "Based on the visual cues provided by the video, where did not the magician perform his magic?",
"options": [
"A. On the street.",
"B. In a church.",
"C. In a hospital.",
"D. On a small boat on the lake."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the visual cues provided by the video, where did not the magician perform his magic?\nOption:\nA. On the street.\nB. In a church.\nC. In a hospital.\nD. On a small boat on the lake.\nAnswer with the option's letter from the given choices directly.",
880,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "294-2",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 881,
"target": "A",
"doc": {
"video_id": "294",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=nbOXpuv7K4Q",
"videoID": "nbOXpuv7K4Q",
"question_id": "294-3",
"task_type": "Object Recognition",
"question": "What clothes does the performer wear during the spider magic act?",
"options": [
"A. A blue denim jacket.",
"B. A red velvet cape.",
"C. A sequined vest.",
"D. A black top hat."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What clothes does the performer wear during the spider magic act?\nOption:\nA. A blue denim jacket.\nB. A red velvet cape.\nC. A sequined vest.\nD. A black top hat.\nAnswer with the option's letter from the given choices directly.",
881,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "294-3",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 882,
"target": "D",
"doc": {
"video_id": "295",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=Lv7cYKTYE2g",
"videoID": "Lv7cYKTYE2g",
"question_id": "295-1",
"task_type": "Attribute Perception",
"question": "What color is Ms. Milani's boss' clothing in the video?",
"options": [
"A. Grey.",
"B. Blue.",
"C. Purple.",
"D. Green."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is Ms. Milani's boss' clothing in the video?\nOption:\nA. Grey.\nB. Blue.\nC. Purple.\nD. Green.\nAnswer with the option's letter from the given choices directly.",
882,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "295-1",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 883,
"target": "D",
"doc": {
"video_id": "295",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=Lv7cYKTYE2g",
"videoID": "Lv7cYKTYE2g",
"question_id": "295-2",
"task_type": "Object Reasoning",
"question": "What is the primary difference between Arbeitslosengeld and Bürgergeld, as depicted in the video?",
"options": [
"A. The application process.",
"B. The duration of benefits.",
"C. The eligibility requirements.",
"D. The amount of money provided."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary difference between Arbeitslosengeld and Bürgergeld, as depicted in the video?\nOption:\nA. The application process.\nB. The duration of benefits.\nC. The eligibility requirements.\nD. The amount of money provided.\nAnswer with the option's letter from the given choices directly.",
883,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "295-2",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 884,
"target": "C",
"doc": {
"video_id": "295",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=Lv7cYKTYE2g",
"videoID": "Lv7cYKTYE2g",
"question_id": "295-3",
"task_type": "Information Synopsis",
"question": "What is the overall message of the video?",
"options": [
"A. Finding a new job is easy with the help of Arbeitslosengeld.",
"B. Bürgergeld is the best solution for long-term unemployment.",
"C. The German government provides financial assistance to those in need.",
"D. Unemployment is a major problem in Germany."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the overall message of the video?\nOption:\nA. Finding a new job is easy with the help of Arbeitslosengeld.\nB. Bürgergeld is the best solution for long-term unemployment.\nC. The German government provides financial assistance to those in need.\nD. Unemployment is a major problem in Germany.\nAnswer with the option's letter from the given choices directly.",
884,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "295-3",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 885,
"target": "C",
"doc": {
"video_id": "296",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=bHt0Riqz0qo",
"videoID": "bHt0Riqz0qo",
"question_id": "296-1",
"task_type": "Object Recognition",
"question": "Which animal is the man's pet depicted in the video?",
"options": [
"A. Pig.",
"B. Turtle.",
"C. Crocodile.",
"D. Dog."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which animal is the man's pet depicted in the video?\nOption:\nA. Pig.\nB. Turtle.\nC. Crocodile.\nD. Dog.\nAnswer with the option's letter from the given choices directly.",
885,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "296-1",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 886,
"target": "A",
"doc": {
"video_id": "296",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=bHt0Riqz0qo",
"videoID": "bHt0Riqz0qo",
"question_id": "296-2",
"task_type": "Action Recognition",
"question": "Which of the statements is not in the video?",
"options": [
"A. The man sleeps with his pet.",
"B. The man helps his pet brush its teeth.",
"C. The man walks with his pet.",
"D. Kids ride on the pet."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the statements is not in the video?\nOption:\nA. The man sleeps with his pet.\nB. The man helps his pet brush its teeth.\nC. The man walks with his pet.\nD. Kids ride on the pet.\nAnswer with the option's letter from the given choices directly.",
886,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "296-2",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 887,
"target": "D",
"doc": {
"video_id": "296",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=bHt0Riqz0qo",
"videoID": "bHt0Riqz0qo",
"question_id": "296-3",
"task_type": "Action Reasoning",
"question": "What is the temperament of the pet in the video?",
"options": [
"A. Irascible.",
"B. Shy.",
"C. Impatient.",
"D. Mild."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the temperament of the pet in the video?\nOption:\nA. Irascible.\nB. Shy.\nC. Impatient.\nD. Mild.\nAnswer with the option's letter from the given choices directly.",
887,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "296-3",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 888,
"target": "A",
"doc": {
"video_id": "297",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=AIs3LoU4JUo",
"videoID": "AIs3LoU4JUo",
"question_id": "297-1",
"task_type": "Object Recognition",
"question": "What is the man eating in the video?",
"options": [
"A. Wonton.",
"B. Dumpling.",
"C. Steak.",
"D. Tempura."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the man eating in the video?\nOption:\nA. Wonton.\nB. Dumpling.\nC. Steak.\nD. Tempura.\nAnswer with the option's letter from the given choices directly.",
888,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "297-1",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 889,
"target": "C",
"doc": {
"video_id": "297",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=AIs3LoU4JUo",
"videoID": "AIs3LoU4JUo",
"question_id": "297-2",
"task_type": "Object Recognition",
"question": "According to the video, which of the following is typically used as the filling in wontons?",
"options": [
"A. Mutton.",
"B. Beef.",
"C. Shrimp meat.",
"D. Pork."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following is typically used as the filling in wontons?\nOption:\nA. Mutton.\nB. Beef.\nC. Shrimp meat.\nD. Pork.\nAnswer with the option's letter from the given choices directly.",
889,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "297-2",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 890,
"target": "D",
"doc": {
"video_id": "297",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=AIs3LoU4JUo",
"videoID": "AIs3LoU4JUo",
"question_id": "297-3",
"task_type": "Action Reasoning",
"question": "How does the man feel about the food?",
"options": [
"A. Sweet.",
"B. Bland.",
"C. Disgusting.",
"D. Delicious."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the man feel about the food?\nOption:\nA. Sweet.\nB. Bland.\nC. Disgusting.\nD. Delicious.\nAnswer with the option's letter from the given choices directly.",
890,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "297-3",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 891,
"target": "B",
"doc": {
"video_id": "298",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=4ZK-m01XSQ8",
"videoID": "4ZK-m01XSQ8",
"question_id": "298-1",
"task_type": "Object Recognition",
"question": "Which track athlete knocked down the railing first in the video?",
"options": [
"A. Track No. 2.",
"B. Track No. 8.",
"C. Track No. 3.",
"D. Track No. 7."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which track athlete knocked down the railing first in the video?\nOption:\nA. Track No. 2.\nB. Track No. 8.\nC. Track No. 3.\nD. Track No. 7.\nAnswer with the option's letter from the given choices directly.",
891,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "298-1",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 892,
"target": "B",
"doc": {
"video_id": "298",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=4ZK-m01XSQ8",
"videoID": "4ZK-m01XSQ8",
"question_id": "298-2",
"task_type": "Counting Problem",
"question": "What was the total number of athletes who participated in the game in the video?",
"options": [
"A. 8.",
"B. 7.",
"C. 6.",
"D. 9."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the total number of athletes who participated in the game in the video?\nOption:\nA. 8.\nB. 7.\nC. 6.\nD. 9.\nAnswer with the option's letter from the given choices directly.",
892,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "298-2",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 893,
"target": "A",
"doc": {
"video_id": "298",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=4ZK-m01XSQ8",
"videoID": "4ZK-m01XSQ8",
"question_id": "298-3",
"task_type": "Object Recognition",
"question": "Which of the following track athletes did not cause any disruption by knocking down railings during the competition?",
"options": [
"A. Track No. 4.",
"B. Track No. 8.",
"C. Track No. 1.",
"D. Track No. 3."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following track athletes did not cause any disruption by knocking down railings during the competition?\nOption:\nA. Track No. 4.\nB. Track No. 8.\nC. Track No. 1.\nD. Track No. 3.\nAnswer with the option's letter from the given choices directly.",
893,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "298-3",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 894,
"target": "D",
"doc": {
"video_id": "299",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=g7PjwIzqyCQ",
"videoID": "g7PjwIzqyCQ",
"question_id": "299-1",
"task_type": "Information Synopsis",
"question": "What is the primary focus of the video?",
"options": [
"A. How China and Korea cooperate to mentain the beauty of nature in Baekdu Mountain.",
"B. The designatyion of Baekdu Mountain by UNESCO.",
"C. Whether Baekdu Mountain belongs to Korea or China.",
"D. The importance of Baekdu Mountain and some worries about its about its ownership."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary focus of the video?\nOption:\nA. How China and Korea cooperate to mentain the beauty of nature in Baekdu Mountain.\nB. The designatyion of Baekdu Mountain by UNESCO.\nC. Whether Baekdu Mountain belongs to Korea or China.\nD. The importance of Baekdu Mountain and some worries about its about its ownership.\nAnswer with the option's letter from the given choices directly.",
894,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "299-1",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 895,
"target": "C",
"doc": {
"video_id": "299",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=g7PjwIzqyCQ",
"videoID": "g7PjwIzqyCQ",
"question_id": "299-2",
"task_type": "Spatial Reasoning",
"question": "Where were the video clips of the mountain tour taken from among the following options?",
"options": [
"A. In South Korea.",
"B. In UNESCO.",
"C. In China.",
"D. In North Korea."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where were the video clips of the mountain tour taken from among the following options?\nOption:\nA. In South Korea.\nB. In UNESCO.\nC. In China.\nD. In North Korea.\nAnswer with the option's letter from the given choices directly.",
895,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "299-2",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Spatial Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 896,
"target": "D",
"doc": {
"video_id": "299",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=g7PjwIzqyCQ",
"videoID": "g7PjwIzqyCQ",
"question_id": "299-3",
"task_type": "Attribute Perception",
"question": "What color of pants is the female commentator wearing in the video?",
"options": [
"A. Black.",
"B. White.",
"C. Purple.",
"D. Green."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color of pants is the female commentator wearing in the video?\nOption:\nA. Black.\nB. White.\nC. Purple.\nD. Green.\nAnswer with the option's letter from the given choices directly.",
896,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "299-3",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 897,
"target": "A",
"doc": {
"video_id": "300",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=mBD-TPIHbUs",
"videoID": "mBD-TPIHbUs",
"question_id": "300-1",
"task_type": "Action Reasoning",
"question": "According to the early part of the video, when the man with the gray and white beard and black glasses sits at the table, what does the man in the burgundy shirt next to him say when he says 'no'?",
"options": [
"A. He cannot grab food with hands.",
"B. He cannot enjoy food.",
"C. He cannot sit here.",
"D. He cannot say hello."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the early part of the video, when the man with the gray and white beard and black glasses sits at the table, what does the man in the burgundy shirt next to him say when he says 'no'?\nOption:\nA. He cannot grab food with hands.\nB. He cannot enjoy food.\nC. He cannot sit here.\nD. He cannot say hello.\nAnswer with the option's letter from the given choices directly.",
897,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "300-1",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 898,
"target": "D",
"doc": {
"video_id": "300",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=mBD-TPIHbUs",
"videoID": "mBD-TPIHbUs",
"question_id": "300-2",
"task_type": "Action Reasoning",
"question": "In the middle part of the video, why does the man with gray and white beard and black glasses wave and greet the restaurant staff repeatedly when he enters the restaurant?",
"options": [
"A. To show the audience a positive example.",
"B. He wants to show his enthusiasm.",
"C. He knows the restaurant's service staff.",
"D. To show the audience a negative example."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the middle part of the video, why does the man with gray and white beard and black glasses wave and greet the restaurant staff repeatedly when he enters the restaurant?\nOption:\nA. To show the audience a positive example.\nB. He wants to show his enthusiasm.\nC. He knows the restaurant's service staff.\nD. To show the audience a negative example.\nAnswer with the option's letter from the given choices directly.",
898,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "300-2",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 899,
"target": "B",
"doc": {
"video_id": "300",
"duration": "short",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=mBD-TPIHbUs",
"videoID": "mBD-TPIHbUs",
"question_id": "300-3",
"task_type": "Action Reasoning",
"question": "According to the later part of the video, why is it not recommended for the man with gray and white beard and black glasses to say 'ciao'?",
"options": [
"A. Because 'ciao' meaning 'hello' is not applicable in Italy.",
"B. Because 'ciao' meaning 'hello' is not formal enough.",
"C. Because 'ciao' meaning 'hello' might offend people.",
"D. Unable to determine\"."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the later part of the video, why is it not recommended for the man with gray and white beard and black glasses to say 'ciao'?\nOption:\nA. Because 'ciao' meaning 'hello' is not applicable in Italy.\nB. Because 'ciao' meaning 'hello' is not formal enough.\nC. Because 'ciao' meaning 'hello' might offend people.\nD. Unable to determine\".\nAnswer with the option's letter from the given choices directly.",
899,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "300-3",
"duration": "short",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 900,
"target": "C",
"doc": {
"video_id": "301",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=dphq5X-rMew",
"videoID": "dphq5X-rMew",
"question_id": "301-1",
"task_type": "Counting Problem",
"question": "According to the video, how many man-made ditches are found recently?",
"options": [
"A. 4.",
"B. 2.",
"C. 3.",
"D. 1."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how many man-made ditches are found recently?\nOption:\nA. 4.\nB. 2.\nC. 3.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
900,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "301-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 901,
"target": "A",
"doc": {
"video_id": "301",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=dphq5X-rMew",
"videoID": "dphq5X-rMew",
"question_id": "301-2",
"task_type": "Object Reasoning",
"question": "Why Cusco and Tenochtitlan can be easily found while Amazonian settlements can not?",
"options": [
"A. Because Amazonian settlements were built with wood and earth .",
"B. Because Amazonian settlements were too small.",
"C. Because Amazonian settlements were in the center of Amazon forest.",
"D. Because local people all died from diseases."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why Cusco and Tenochtitlan can be easily found while Amazonian settlements can not?\nOption:\nA. Because Amazonian settlements were built with wood and earth .\nB. Because Amazonian settlements were too small.\nC. Because Amazonian settlements were in the center of Amazon forest.\nD. Because local people all died from diseases.\nAnswer with the option's letter from the given choices directly.",
901,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "301-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 902,
"target": "D",
"doc": {
"video_id": "301",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=dphq5X-rMew",
"videoID": "dphq5X-rMew",
"question_id": "301-3",
"task_type": "Object Recognition",
"question": "Who finally find the lost city?",
"options": [
"A. Terra preta.",
"B. Fawcett.",
"C. European expeditions.",
"D. Dr.Michael Heckenberger."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who finally find the lost city?\nOption:\nA. Terra preta.\nB. Fawcett.\nC. European expeditions.\nD. Dr.Michael Heckenberger.\nAnswer with the option's letter from the given choices directly.",
902,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "301-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 903,
"target": "C",
"doc": {
"video_id": "302",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=K64wRD8eaus",
"videoID": "K64wRD8eaus",
"question_id": "302-1",
"task_type": "Object Recognition",
"question": "Why does this video mention Victor Garber?",
"options": [
"A. Because he doesn't believe that the Titanic will sink.",
"B. Because he is the architect of the Titanic.",
"C. Because he played Tomas Andrew in the movie.",
"D. Because he konws the tragic end of the Titanic."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does this video mention Victor Garber?\nOption:\nA. Because he doesn't believe that the Titanic will sink.\nB. Because he is the architect of the Titanic.\nC. Because he played Tomas Andrew in the movie.\nD. Because he konws the tragic end of the Titanic.\nAnswer with the option's letter from the given choices directly.",
903,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "302-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 904,
"target": "C",
"doc": {
"video_id": "302",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=K64wRD8eaus",
"videoID": "K64wRD8eaus",
"question_id": "302-2",
"task_type": "Object Reasoning",
"question": "Which of the following statements is not correct?",
"options": [
"A. The Titanic finally sank because 5 adjacent compartments were breached.",
"B. Despite the lack of lifeboats, the Titanic met all the requirement.",
"C. People on the Titanic were not rescued in time because its operator was sleeping.",
"D. The Titanic was equipped with 20 lifeboats."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements is not correct?\nOption:\nA. The Titanic finally sank because 5 adjacent compartments were breached.\nB. Despite the lack of lifeboats, the Titanic met all the requirement.\nC. People on the Titanic were not rescued in time because its operator was sleeping.\nD. The Titanic was equipped with 20 lifeboats.\nAnswer with the option's letter from the given choices directly.",
904,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "302-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 905,
"target": "C",
"doc": {
"video_id": "302",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=K64wRD8eaus",
"videoID": "K64wRD8eaus",
"question_id": "302-3",
"task_type": "Counting Problem",
"question": "How many ships are shown in the map while the sinking ship sending out message?",
"options": [
"A. 3 .",
"B. 8.",
"C. 11.",
"D. 9."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many ships are shown in the map while the sinking ship sending out message?\nOption:\nA. 3 .\nB. 8.\nC. 11.\nD. 9.\nAnswer with the option's letter from the given choices directly.",
905,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "302-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 906,
"target": "C",
"doc": {
"video_id": "303",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=kSBB5PsRV-k",
"videoID": "kSBB5PsRV-k",
"question_id": "303-1",
"task_type": "OCR Problems",
"question": "When the video talks about sea level rise, what is the time span?",
"options": [
"A. 1300 AD - 2016 AD.",
"B. 1200 BC - 2019 AD.",
"C. 1300 BC - 2016 AD.",
"D. 1200 AD - 2019 AD."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When the video talks about sea level rise, what is the time span?\nOption:\nA. 1300 AD - 2016 AD.\nB. 1200 BC - 2019 AD.\nC. 1300 BC - 2016 AD.\nD. 1200 AD - 2019 AD.\nAnswer with the option's letter from the given choices directly.",
906,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "303-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 907,
"target": "B",
"doc": {
"video_id": "303",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=kSBB5PsRV-k",
"videoID": "kSBB5PsRV-k",
"question_id": "303-2",
"task_type": "Temporal Reasoning",
"question": "What is the accurate sequence for the video to present its content?\n(a) Sunken cities.\n(b) Sunken ships.\n(c) Sunken things used to learn about ancient people.",
"options": [
"A. (c)(b)(a).",
"B. (a)(b)(c).",
"C. (a)(c)(b).",
"D. (b)(a)(c)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the accurate sequence for the video to present its content?\n(a) Sunken cities.\n(b) Sunken ships.\n(c) Sunken things used to learn about ancient people.\nOption:\nA. (c)(b)(a).\nB. (a)(b)(c).\nC. (a)(c)(b).\nD. (b)(a)(c).\nAnswer with the option's letter from the given choices directly.",
907,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "303-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 908,
"target": "D",
"doc": {
"video_id": "303",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=kSBB5PsRV-k",
"videoID": "kSBB5PsRV-k",
"question_id": "303-3",
"task_type": "Information Synopsis",
"question": "What is the primary focus of this video for the audience?",
"options": [
"A. What is there 200 meters below the sea?",
"B. What have humans discovered from deep-sea archaeology?",
"C. What is the significance of deep-sea archaeology to the development of modern human society?",
"D. How much of human history is on the bottom of the ocean?"
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary focus of this video for the audience?\nOption:\nA. What is there 200 meters below the sea?\nB. What have humans discovered from deep-sea archaeology?\nC. What is the significance of deep-sea archaeology to the development of modern human society?\nD. How much of human history is on the bottom of the ocean?\nAnswer with the option's letter from the given choices directly.",
908,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "303-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 909,
"target": "B",
"doc": {
"video_id": "304",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=QTA8j5wSTx4",
"videoID": "QTA8j5wSTx4",
"question_id": "304-1",
"task_type": "Action Reasoning",
"question": "Why did the main character in the video put his hand in the coat?",
"options": [
"A. Because his hand was deformed.",
"B. For public image.",
"C. Because such action can released stomach pain.",
"D. For warmth."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did the main character in the video put his hand in the coat?\nOption:\nA. Because his hand was deformed.\nB. For public image.\nC. Because such action can released stomach pain.\nD. For warmth.\nAnswer with the option's letter from the given choices directly.",
909,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "304-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 910,
"target": "B",
"doc": {
"video_id": "304",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=QTA8j5wSTx4",
"videoID": "QTA8j5wSTx4",
"question_id": "304-2",
"task_type": "Object Reasoning",
"question": "Who started the custom of restraining hand activities in pulic?",
"options": [
"A. Demosthenes.",
"B. Aeschines.",
"C. Napoleon.",
"D. Pizarro."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who started the custom of restraining hand activities in pulic?\nOption:\nA. Demosthenes.\nB. Aeschines.\nC. Napoleon.\nD. Pizarro.\nAnswer with the option's letter from the given choices directly.",
910,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "304-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 911,
"target": "A",
"doc": {
"video_id": "304",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=QTA8j5wSTx4",
"videoID": "QTA8j5wSTx4",
"question_id": "304-3",
"task_type": "Spatial Perception",
"question": "Which of the following elements isn't mentioned in the painting \"The Emperor Napoleon in His Study at the Tuileries\"?",
"options": [
"A. A lamp on the desk.",
"B. A map on the floor.",
"C. A clock on the wall.",
"D. Napoleon Bonaparte with his hand tucked inside his coat."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following elements isn't mentioned in the painting \"The Emperor Napoleon in His Study at the Tuileries\"?\nOption:\nA. A lamp on the desk.\nB. A map on the floor.\nC. A clock on the wall.\nD. Napoleon Bonaparte with his hand tucked inside his coat.\nAnswer with the option's letter from the given choices directly.",
911,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "304-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Spatial Perception",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 912,
"target": "B",
"doc": {
"video_id": "305",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=tuZcS2Flabw",
"videoID": "tuZcS2Flabw",
"question_id": "305-1",
"task_type": "Object Recognition",
"question": "Who embraces the Parthenon Temple in the video?",
"options": [
"A. Pheidias.",
"B. Athena.",
"C. The glory of the Athenians.",
"D. Pericles."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who embraces the Parthenon Temple in the video?\nOption:\nA. Pheidias.\nB. Athena.\nC. The glory of the Athenians.\nD. Pericles.\nAnswer with the option's letter from the given choices directly.",
912,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "305-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 913,
"target": "C",
"doc": {
"video_id": "305",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=tuZcS2Flabw",
"videoID": "tuZcS2Flabw",
"question_id": "305-2",
"task_type": "Attribute Perception",
"question": "Which of the following ideas about the Temple's design is not innovative?",
"options": [
"A. Put human's and gods's sculptures side by side.",
"B. Combine Doric columns with Ionic elements.",
"C. Integrating traditional elements with modern ones in a harmonious manner..",
"D. Incorporate entasis in each colum."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following ideas about the Temple's design is not innovative?\nOption:\nA. Put human's and gods's sculptures side by side.\nB. Combine Doric columns with Ionic elements.\nC. Integrating traditional elements with modern ones in a harmonious manner..\nD. Incorporate entasis in each colum.\nAnswer with the option's letter from the given choices directly.",
913,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "305-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 914,
"target": "C",
"doc": {
"video_id": "305",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=tuZcS2Flabw",
"videoID": "tuZcS2Flabw",
"question_id": "305-3",
"task_type": "Temporal Reasoning",
"question": "Which of the following events are in correct order?",
"options": [
"A. The Parthenon Temple was completed, Pheidias left thanks to Athena's bless, Pheidias was accused of embezzlement, Pheidias accounted spending to prove his innocence.",
"B. Pheidias was accused of embezzlement, Pheidias accounted spending to prove his innocence, the Parthenon Temple was completed, Pheidias left thanks to Athena's bless.",
"C. The Parthenon Temple was completed, Pheidias was accused of embezzlement, Pheidias accounted spending to prove his innocence, Pheidias left thanks to Athena's bless.",
"D. The Parthenon Temple was completed, Pheidias left thanks to Athena's bless, Pheidias accounted spending to prove his innocence, Pheidias was accused of embezzlement."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following events are in correct order?\nOption:\nA. The Parthenon Temple was completed, Pheidias left thanks to Athena's bless, Pheidias was accused of embezzlement, Pheidias accounted spending to prove his innocence.\nB. Pheidias was accused of embezzlement, Pheidias accounted spending to prove his innocence, the Parthenon Temple was completed, Pheidias left thanks to Athena's bless.\nC. The Parthenon Temple was completed, Pheidias was accused of embezzlement, Pheidias accounted spending to prove his innocence, Pheidias left thanks to Athena's bless.\nD. The Parthenon Temple was completed, Pheidias left thanks to Athena's bless, Pheidias accounted spending to prove his innocence, Pheidias was accused of embezzlement.\nAnswer with the option's letter from the given choices directly.",
914,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "305-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 915,
"target": "D",
"doc": {
"video_id": "306",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=s_TGkDcjBqI",
"videoID": "s_TGkDcjBqI",
"question_id": "306-1",
"task_type": "Counting Problem",
"question": "How many princesses flew away when the cowherd approached?",
"options": [
"A. Eight.",
"B. Seven.",
"C. One.",
"D. Six."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many princesses flew away when the cowherd approached?\nOption:\nA. Eight.\nB. Seven.\nC. One.\nD. Six.\nAnswer with the option's letter from the given choices directly.",
915,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "306-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 916,
"target": "B",
"doc": {
"video_id": "306",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=s_TGkDcjBqI",
"videoID": "s_TGkDcjBqI",
"question_id": "306-2",
"task_type": "Action Recognition",
"question": "What was weaver doing when the cowherd saw her at the first time?",
"options": [
"A. Taking a shower with her sisters.",
"B. Combing her hair.",
"C. Exploring the countryside.",
"D. Teaching her skills to the villagers."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was weaver doing when the cowherd saw her at the first time?\nOption:\nA. Taking a shower with her sisters.\nB. Combing her hair.\nC. Exploring the countryside.\nD. Teaching her skills to the villagers.\nAnswer with the option's letter from the given choices directly.",
916,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "306-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 917,
"target": "B",
"doc": {
"video_id": "306",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=s_TGkDcjBqI",
"videoID": "s_TGkDcjBqI",
"question_id": "306-3",
"task_type": "Action Reasoning",
"question": "How did the cowherd use the bull's magic?",
"options": [
"A. Asked the magpies to form a bridge.",
"B. Hurtled upwards and tried to wade through the stars.",
"C. Plucked a golden hairpin and tore through the sky.",
"D. Hastily placed each child in a basket."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How did the cowherd use the bull's magic?\nOption:\nA. Asked the magpies to form a bridge.\nB. Hurtled upwards and tried to wade through the stars.\nC. Plucked a golden hairpin and tore through the sky.\nD. Hastily placed each child in a basket.\nAnswer with the option's letter from the given choices directly.",
917,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "306-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 918,
"target": "B",
"doc": {
"video_id": "307",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=8l1NyR6UvxU",
"videoID": "8l1NyR6UvxU",
"question_id": "307-1",
"task_type": "Temporal Reasoning",
"question": "Which of the following events happened before 72 BCE?",
"options": [
"A. The Senate retaliated with the full force of two legion.",
"B. Praetor Variniu was ambushed.",
"C. Crixus died in a battle.",
"D. Marcus Licinius Crassus had assumed control of the war."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following events happened before 72 BCE?\nOption:\nA. The Senate retaliated with the full force of two legion.\nB. Praetor Variniu was ambushed.\nC. Crixus died in a battle.\nD. Marcus Licinius Crassus had assumed control of the war.\nAnswer with the option's letter from the given choices directly.",
918,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "307-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 919,
"target": "A",
"doc": {
"video_id": "307",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=8l1NyR6UvxU",
"videoID": "8l1NyR6UvxU",
"question_id": "307-2",
"task_type": "Attribute Perception",
"question": "Which of the following features can not describe Spartacus?",
"options": [
"A. Thick beard.",
"B. Curly hair.",
"C. Strong muscle.",
"D. Having a bold mind."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following features can not describe Spartacus?\nOption:\nA. Thick beard.\nB. Curly hair.\nC. Strong muscle.\nD. Having a bold mind.\nAnswer with the option's letter from the given choices directly.",
919,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "307-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 920,
"target": "D",
"doc": {
"video_id": "307",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=8l1NyR6UvxU",
"videoID": "8l1NyR6UvxU",
"question_id": "307-3",
"task_type": "Action Recognition",
"question": "What happened when the main character owned an army of about 120,000 soldiers?",
"options": [
"A. The army captured Rome.",
"B. The army marched beyond Rome's border.",
"C. The army climbed the Alp.",
"D. The army turned south."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happened when the main character owned an army of about 120,000 soldiers?\nOption:\nA. The army captured Rome.\nB. The army marched beyond Rome's border.\nC. The army climbed the Alp.\nD. The army turned south.\nAnswer with the option's letter from the given choices directly.",
920,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "307-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 921,
"target": "C",
"doc": {
"video_id": "308",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=wgPymD-NBQU",
"videoID": "wgPymD-NBQU",
"question_id": "308-1",
"task_type": "Object Reasoning",
"question": "Who is the main character in the video?",
"options": [
"A. Mark Antony.",
"B. Caesar.",
"C. Brutus.",
"D. Gaius Cassius Longinus."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is the main character in the video?\nOption:\nA. Mark Antony.\nB. Caesar.\nC. Brutus.\nD. Gaius Cassius Longinus.\nAnswer with the option's letter from the given choices directly.",
921,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "308-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 922,
"target": "C",
"doc": {
"video_id": "308",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=wgPymD-NBQU",
"videoID": "wgPymD-NBQU",
"question_id": "308-2",
"task_type": "Counting Problem",
"question": "How many people attended in the assasination of Caesar according to the video?",
"options": [
"A. 3.",
"B. Around 20 to 30.",
"C. About 60.",
"D. 1."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people attended in the assasination of Caesar according to the video?\nOption:\nA. 3.\nB. Around 20 to 30.\nC. About 60.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
922,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "308-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 923,
"target": "D",
"doc": {
"video_id": "308",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=wgPymD-NBQU",
"videoID": "wgPymD-NBQU",
"question_id": "308-3",
"task_type": "Attribute Perception",
"question": "Which of the following statements is true?",
"options": [
"A. Dante thought Brutus was a selfless fighter against dictators.",
"B. Caesar was stingy with his followers.",
"C. Brutus was Caesar's son.",
"D. Caesar owned the crowd's support from beginning to end."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements is true?\nOption:\nA. Dante thought Brutus was a selfless fighter against dictators.\nB. Caesar was stingy with his followers.\nC. Brutus was Caesar's son.\nD. Caesar owned the crowd's support from beginning to end.\nAnswer with the option's letter from the given choices directly.",
923,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "308-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 924,
"target": "A",
"doc": {
"video_id": "309",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=tjrvKA4w9-Y",
"videoID": "tjrvKA4w9-Y",
"question_id": "309-1",
"task_type": "Spatial Reasoning",
"question": "How does the video organize the story of Che Guevara?",
"options": [
"A. By holding a trial trying to evaluating his merits and demerits.",
"B. By telling his life in flashback.",
"C. By asking the Judge to tell Che Guevara's story.",
"D. By debating on whether Che Guevara should be judged by his ideals or outcomes."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the video organize the story of Che Guevara?\nOption:\nA. By holding a trial trying to evaluating his merits and demerits.\nB. By telling his life in flashback.\nC. By asking the Judge to tell Che Guevara's story.\nD. By debating on whether Che Guevara should be judged by his ideals or outcomes.\nAnswer with the option's letter from the given choices directly.",
924,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "309-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Spatial Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 925,
"target": "C",
"doc": {
"video_id": "309",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=tjrvKA4w9-Y",
"videoID": "tjrvKA4w9-Y",
"question_id": "309-2",
"task_type": "Attribute Perception",
"question": "Whose portrait is on the cup of the judge in the video?",
"options": [
"A. The lawyer's.",
"B. The judge's.",
"C. Che Guevara's.",
"D. Castro's."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Whose portrait is on the cup of the judge in the video?\nOption:\nA. The lawyer's.\nB. The judge's.\nC. Che Guevara's.\nD. Castro's.\nAnswer with the option's letter from the given choices directly.",
925,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "309-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 926,
"target": "D",
"doc": {
"video_id": "309",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=tjrvKA4w9-Y",
"videoID": "tjrvKA4w9-Y",
"question_id": "309-3",
"task_type": "Temporal Reasoning",
"question": "What happened when Che Guevara failed to rally rebels in the Congo?",
"options": [
"A. He went to Soviet to rebel once again.",
"B. He was captured and executed by Congo's government.",
"C. His action led to the Cuban Missile Crisis.",
"D. He went to Bolivia to rebel once again."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happened when Che Guevara failed to rally rebels in the Congo?\nOption:\nA. He went to Soviet to rebel once again.\nB. He was captured and executed by Congo's government.\nC. His action led to the Cuban Missile Crisis.\nD. He went to Bolivia to rebel once again.\nAnswer with the option's letter from the given choices directly.",
926,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "309-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 927,
"target": "B",
"doc": {
"video_id": "310",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=A9fQPzZ1-hg",
"videoID": "A9fQPzZ1-hg",
"question_id": "310-1",
"task_type": "Counting Problem",
"question": "How many defending layers were there in the barrier in the east of the Berlin wall accoridng to the video?",
"options": [
"A. 4.",
"B. 3.",
"C. 2.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many defending layers were there in the barrier in the east of the Berlin wall accoridng to the video?\nOption:\nA. 4.\nB. 3.\nC. 2.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
927,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "310-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 928,
"target": "B",
"doc": {
"video_id": "310",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=A9fQPzZ1-hg",
"videoID": "A9fQPzZ1-hg",
"question_id": "310-2",
"task_type": "Temporal Reasoning",
"question": "Which of the following events happened after the fall of the Berlin Wall, according to the video?",
"options": [
"A. East Germany tired to defused tension by making travel permits easier to obtain.",
"B. The disintegration of the soviet union.",
"C. Mass demonstrations for free travel appeared.",
"D. Many people fled to West Germany by various methods."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following events happened after the fall of the Berlin Wall, according to the video?\nOption:\nA. East Germany tired to defused tension by making travel permits easier to obtain.\nB. The disintegration of the soviet union.\nC. Mass demonstrations for free travel appeared.\nD. Many people fled to West Germany by various methods.\nAnswer with the option's letter from the given choices directly.",
928,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "310-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 929,
"target": "C",
"doc": {
"video_id": "310",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=A9fQPzZ1-hg",
"videoID": "A9fQPzZ1-hg",
"question_id": "310-3",
"task_type": "Attribute Perception",
"question": "How many people from East Germany had fled to West Germany by 1961?",
"options": [
"A. About 5 million.",
"B. About 4 million.",
"C. About 3.5 million.",
"D. About 2 million."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people from East Germany had fled to West Germany by 1961?\nOption:\nA. About 5 million.\nB. About 4 million.\nC. About 3.5 million.\nD. About 2 million.\nAnswer with the option's letter from the given choices directly.",
929,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "310-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 930,
"target": "B",
"doc": {
"video_id": "311",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=4jmMWohs1XM",
"videoID": "4jmMWohs1XM",
"question_id": "311-1",
"task_type": "Attribute Perception",
"question": "What are the colors of the famous statue of Augustus Caesar that is supposed to?",
"options": [
"A. Totally white.",
"B. Mainly white and red.",
"C. Mainly white and green.",
"D. Mainly red and green."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the colors of the famous statue of Augustus Caesar that is supposed to?\nOption:\nA. Totally white.\nB. Mainly white and red.\nC. Mainly white and green.\nD. Mainly red and green.\nAnswer with the option's letter from the given choices directly.",
930,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "311-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 931,
"target": "D",
"doc": {
"video_id": "311",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=4jmMWohs1XM",
"videoID": "4jmMWohs1XM",
"question_id": "311-2",
"task_type": "Object Reasoning",
"question": "Which of the following statements is not true?",
"options": [
"A. The colors on the Roman scultures faded in Renaissance.",
"B. \"Laocoön and His Sons\" is originally full of color.",
"C. Artists in the Renaissance liked white marble because they wanted to imitated the Roman sculptures.",
"D. \"David\" is originally full of color."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements is not true?\nOption:\nA. The colors on the Roman scultures faded in Renaissance.\nB. \"Laocoön and His Sons\" is originally full of color.\nC. Artists in the Renaissance liked white marble because they wanted to imitated the Roman sculptures.\nD. \"David\" is originally full of color.\nAnswer with the option's letter from the given choices directly.",
931,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "311-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 932,
"target": "A",
"doc": {
"video_id": "311",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=4jmMWohs1XM",
"videoID": "4jmMWohs1XM",
"question_id": "311-3",
"task_type": "Attribute Perception",
"question": "Which of the following methods of revealing the original color of the ancient art work is not mentioned in the video?",
"options": [
"A. By imagining the color that fits in well with the art work.",
"B. Ultraviolet light.",
"C. Pigment analysis.",
"D. Sampling some visible colors."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following methods of revealing the original color of the ancient art work is not mentioned in the video?\nOption:\nA. By imagining the color that fits in well with the art work.\nB. Ultraviolet light.\nC. Pigment analysis.\nD. Sampling some visible colors.\nAnswer with the option's letter from the given choices directly.",
932,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "311-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 933,
"target": "D",
"doc": {
"video_id": "312",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=RBTiTcHm_ac",
"videoID": "RBTiTcHm_ac",
"question_id": "312-1",
"task_type": "Attribute Perception",
"question": "What does Michael Bierut take as an example to illustrate pictoral logos?",
"options": [
"A. Nike's logo.",
"B. Google's logo.",
"C. Coca cola'logo.",
"D. Apple's logo."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does Michael Bierut take as an example to illustrate pictoral logos?\nOption:\nA. Nike's logo.\nB. Google's logo.\nC. Coca cola'logo.\nD. Apple's logo.\nAnswer with the option's letter from the given choices directly.",
933,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "312-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 934,
"target": "C",
"doc": {
"video_id": "312",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=RBTiTcHm_ac",
"videoID": "RBTiTcHm_ac",
"question_id": "312-2",
"task_type": "Action Reasoning",
"question": "Why does Michael Bierut mention religious symbols?",
"options": [
"A. Because he thinks that logoss should take in religious elements.",
"B. Because he believes that religious symbols are logos too.",
"C. To illustrate that the logo's meaning is expressed by what people come to mind.",
"D. Because he believes in Christianity."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does Michael Bierut mention religious symbols?\nOption:\nA. Because he thinks that logoss should take in religious elements.\nB. Because he believes that religious symbols are logos too.\nC. To illustrate that the logo's meaning is expressed by what people come to mind.\nD. Because he believes in Christianity.\nAnswer with the option's letter from the given choices directly.",
934,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "312-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 935,
"target": "C",
"doc": {
"video_id": "312",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=RBTiTcHm_ac",
"videoID": "RBTiTcHm_ac",
"question_id": "312-3",
"task_type": "Object Reasoning",
"question": "What did Nike do when a graphic design student submit her work in the video?",
"options": [
"A. They sent her a ring with a Nike swoosh.",
"B. They associated the product with the idea of athletic.",
"C. The Nike founders didn't like it.",
"D. They abandoned her work."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What did Nike do when a graphic design student submit her work in the video?\nOption:\nA. They sent her a ring with a Nike swoosh.\nB. They associated the product with the idea of athletic.\nC. The Nike founders didn't like it.\nD. They abandoned her work.\nAnswer with the option's letter from the given choices directly.",
935,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "312-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 936,
"target": "D",
"doc": {
"video_id": "313",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=osRzbY_AdaM",
"videoID": "osRzbY_AdaM",
"question_id": "313-1",
"task_type": "Attribute Perception",
"question": "What do the two people at the beginning of the video have in common?",
"options": [
"A. The are both piano-learner.",
"B. They both wear glasses.",
"C. They are both good at playing the violin.",
"D. They both wear black shirt."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the two people at the beginning of the video have in common?\nOption:\nA. The are both piano-learner.\nB. They both wear glasses.\nC. They are both good at playing the violin.\nD. They both wear black shirt.\nAnswer with the option's letter from the given choices directly.",
936,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "313-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 937,
"target": "B",
"doc": {
"video_id": "313",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=osRzbY_AdaM",
"videoID": "osRzbY_AdaM",
"question_id": "313-2",
"task_type": "Action Reasoning",
"question": "What is the teacher's attitude toward the student when the student attempts to play middle C on the violin?",
"options": [
"A. Inspired.",
"B. Disappointed.",
"C. Appreciated.",
"D. Angry."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the teacher's attitude toward the student when the student attempts to play middle C on the violin?\nOption:\nA. Inspired.\nB. Disappointed.\nC. Appreciated.\nD. Angry.\nAnswer with the option's letter from the given choices directly.",
937,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "313-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 938,
"target": "A",
"doc": {
"video_id": "313",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=osRzbY_AdaM",
"videoID": "osRzbY_AdaM",
"question_id": "313-3",
"task_type": "Action Reasoning",
"question": "Which of the following statements about the violin-learner in the video is true?",
"options": [
"A. He keeps improving himself as the video goes on.",
"B. He has lost confidence in his playing skills.",
"C. He learned the violin because his neck wasn't that good.",
"D. He is very enthusiastic when exposed to new playing techniques."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements about the violin-learner in the video is true?\nOption:\nA. He keeps improving himself as the video goes on.\nB. He has lost confidence in his playing skills.\nC. He learned the violin because his neck wasn't that good.\nD. He is very enthusiastic when exposed to new playing techniques.\nAnswer with the option's letter from the given choices directly.",
938,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "313-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 939,
"target": "B",
"doc": {
"video_id": "314",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=1pHkv4KUiFY",
"videoID": "1pHkv4KUiFY",
"question_id": "314-1",
"task_type": "Counting Problem",
"question": "How many listeners gave busking donations to the girl before she started playing?",
"options": [
"A. One.",
"B. Two.",
"C. None.",
"D. Three."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many listeners gave busking donations to the girl before she started playing?\nOption:\nA. One.\nB. Two.\nC. None.\nD. Three.\nAnswer with the option's letter from the given choices directly.",
939,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "314-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 940,
"target": "A",
"doc": {
"video_id": "314",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=1pHkv4KUiFY",
"videoID": "1pHkv4KUiFY",
"question_id": "314-2",
"task_type": "Object Recognition",
"question": "What instrument is the little girl playing in the video?",
"options": [
"A. Violin.",
"B. Piano.",
"C. Cello.",
"D. Guitar."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What instrument is the little girl playing in the video?\nOption:\nA. Violin.\nB. Piano.\nC. Cello.\nD. Guitar.\nAnswer with the option's letter from the given choices directly.",
940,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "314-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 941,
"target": "D",
"doc": {
"video_id": "314",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=1pHkv4KUiFY",
"videoID": "1pHkv4KUiFY",
"question_id": "314-3",
"task_type": "Attribute Perception",
"question": "What color shoes does the little girl wear?",
"options": [
"A. Yellow.",
"B. Orange.",
"C. Pink.",
"D. Black."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color shoes does the little girl wear?\nOption:\nA. Yellow.\nB. Orange.\nC. Pink.\nD. Black.\nAnswer with the option's letter from the given choices directly.",
941,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "314-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 942,
"target": "A",
"doc": {
"video_id": "315",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=B2zhLYz4pYo",
"videoID": "B2zhLYz4pYo",
"question_id": "315-1",
"task_type": "Attribute Perception",
"question": "What's the color of Gabriel García Márquez's car in the video?",
"options": [
"A. Green.",
"B. Black.",
"C. Orange.",
"D. Red."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the color of Gabriel García Márquez's car in the video?\nOption:\nA. Green.\nB. Black.\nC. Orange.\nD. Red.\nAnswer with the option's letter from the given choices directly.",
942,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "315-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 943,
"target": "D",
"doc": {
"video_id": "315",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=B2zhLYz4pYo",
"videoID": "B2zhLYz4pYo",
"question_id": "315-2",
"task_type": "Object Reasoning",
"question": "Which of the following plots about the book was not mentioned in the video?",
"options": [
"A. Globe-trotting adventure.",
"B. Civil war.",
"C. Political intrigue.",
"D. The main character abruptly turned his car around and went home."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following plots about the book was not mentioned in the video?\nOption:\nA. Globe-trotting adventure.\nB. Civil war.\nC. Political intrigue.\nD. The main character abruptly turned his car around and went home.\nAnswer with the option's letter from the given choices directly.",
943,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "315-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 944,
"target": "D",
"doc": {
"video_id": "315",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=B2zhLYz4pYo",
"videoID": "B2zhLYz4pYo",
"question_id": "315-3",
"task_type": "Action Reasoning",
"question": "Which of the following statements about the author is not correct?",
"options": [
"A. He used to be a journalist.",
"B. Despite having seen the dark side of the world, he still believed a better world.",
"C. His writing style was influenced by his maternal grandparent.",
"D. He growed up in a colonial society."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements about the author is not correct?\nOption:\nA. He used to be a journalist.\nB. Despite having seen the dark side of the world, he still believed a better world.\nC. His writing style was influenced by his maternal grandparent.\nD. He growed up in a colonial society.\nAnswer with the option's letter from the given choices directly.",
944,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "315-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 945,
"target": "D",
"doc": {
"video_id": "316",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=IaTaaNinolU",
"videoID": "IaTaaNinolU",
"question_id": "316-1",
"task_type": "Attribute Perception",
"question": "Which of the following features can describe Hamlet in the video?",
"options": [
"A. Wearing brown coat.",
"B. Blonde hair.",
"C. Thick beard.",
"D. Long hair."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following features can describe Hamlet in the video?\nOption:\nA. Wearing brown coat.\nB. Blonde hair.\nC. Thick beard.\nD. Long hair.\nAnswer with the option's letter from the given choices directly.",
945,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "316-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 946,
"target": "D",
"doc": {
"video_id": "316",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=IaTaaNinolU",
"videoID": "IaTaaNinolU",
"question_id": "316-2",
"task_type": "Temporal Reasoning",
"question": "Which of the following events happened after Polonius's death?",
"options": [
"A. The norwegian army intended to invade a small area in polan.",
"B. Hamlet's mother screamed out for help.",
"C. Hamlet rebuked Ophelia's advances.",
"D. Rosencrantz and Guildenstern escorted hamlet on a diplomatic mission."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following events happened after Polonius's death?\nOption:\nA. The norwegian army intended to invade a small area in polan.\nB. Hamlet's mother screamed out for help.\nC. Hamlet rebuked Ophelia's advances.\nD. Rosencrantz and Guildenstern escorted hamlet on a diplomatic mission.\nAnswer with the option's letter from the given choices directly.",
946,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "316-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 947,
"target": "A",
"doc": {
"video_id": "316",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=IaTaaNinolU",
"videoID": "IaTaaNinolU",
"question_id": "316-3",
"task_type": "Action Reasoning",
"question": "Why didn't the English king kill Hamlet?",
"options": [
"A. Because Hamlet changed the content in the letter.",
"B. Because Claudius asked the English king to protect Hamlet during their journey.",
"C. Because the English king was unwilling to kill Hamlet.",
"D. Because Hamlet's friends chose to be killed instead of Hamlet."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why didn't the English king kill Hamlet?\nOption:\nA. Because Hamlet changed the content in the letter.\nB. Because Claudius asked the English king to protect Hamlet during their journey.\nC. Because the English king was unwilling to kill Hamlet.\nD. Because Hamlet's friends chose to be killed instead of Hamlet.\nAnswer with the option's letter from the given choices directly.",
947,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "316-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 948,
"target": "D",
"doc": {
"video_id": "317",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=Bkheu99K5lY",
"videoID": "Bkheu99K5lY",
"question_id": "317-1",
"task_type": "Object Reasoning",
"question": "Which of the following materials used to recreate Monate's Water Lilies is not neccessary?",
"options": [
"A. Paintbrushes (flat and round).",
"B. Acrylic paints in green, blue, white, yellow, and purple.",
"C. Tissue paper.",
"D. Drawing board."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following materials used to recreate Monate's Water Lilies is not neccessary?\nOption:\nA. Paintbrushes (flat and round).\nB. Acrylic paints in green, blue, white, yellow, and purple.\nC. Tissue paper.\nD. Drawing board.\nAnswer with the option's letter from the given choices directly.",
948,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "317-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 949,
"target": "C",
"doc": {
"video_id": "317",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=Bkheu99K5lY",
"videoID": "Bkheu99K5lY",
"question_id": "317-2",
"task_type": "Attribute Perception",
"question": "Which style does the painting in the video belong to?",
"options": [
"A. Romanticism.",
"B. Realism.",
"C. Impressionism.",
"D. Minimalism."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which style does the painting in the video belong to?\nOption:\nA. Romanticism.\nB. Realism.\nC. Impressionism.\nD. Minimalism.\nAnswer with the option's letter from the given choices directly.",
949,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "317-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 950,
"target": "C",
"doc": {
"video_id": "317",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=Bkheu99K5lY",
"videoID": "Bkheu99K5lY",
"question_id": "317-3",
"task_type": "Temporal Reasoning",
"question": "What's the correct order of the following events?\n1. Creating basic background.\n2. Add a few flowers to the painting.\n3. Building up the texture of the painting even more.\n4. Drawing the pads of the water lilies.",
"options": [
"A. 4, 1, 2, 3.",
"B. 2, 4, 1, 3.",
"C. 1, 4, 2, 3.",
"D. 1, 4, 3, 2."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the correct order of the following events?\n1. Creating basic background.\n2. Add a few flowers to the painting.\n3. Building up the texture of the painting even more.\n4. Drawing the pads of the water lilies.\nOption:\nA. 4, 1, 2, 3.\nB. 2, 4, 1, 3.\nC. 1, 4, 2, 3.\nD. 1, 4, 3, 2.\nAnswer with the option's letter from the given choices directly.",
950,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "317-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 951,
"target": "C",
"doc": {
"video_id": "318",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=QdlP8ai8trw",
"videoID": "QdlP8ai8trw",
"question_id": "318-1",
"task_type": "Spatial Perception",
"question": "Where is the sculture supposed to stay?",
"options": [
"A. Inside the church.",
"B. In Italy museum.",
"C. On the cathedral.",
"D. On the main street of Florence."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where is the sculture supposed to stay?\nOption:\nA. Inside the church.\nB. In Italy museum.\nC. On the cathedral.\nD. On the main street of Florence.\nAnswer with the option's letter from the given choices directly.",
951,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "318-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 952,
"target": "A",
"doc": {
"video_id": "318",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=QdlP8ai8trw",
"videoID": "QdlP8ai8trw",
"question_id": "318-2",
"task_type": "Object Recognition",
"question": "Which of the following statements is not correct?",
"options": [
"A. David ousted the Medici family.",
"B. People in Florence identified with David because of his courage of defeating giant Goliath.",
"C. Florence had experienced political turmoil before becoming a republic country.",
"D. The very stance of the sculpture was taken directly from classical antiquity."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements is not correct?\nOption:\nA. David ousted the Medici family.\nB. People in Florence identified with David because of his courage of defeating giant Goliath.\nC. Florence had experienced political turmoil before becoming a republic country.\nD. The very stance of the sculpture was taken directly from classical antiquity.\nAnswer with the option's letter from the given choices directly.",
952,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "318-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 953,
"target": "B",
"doc": {
"video_id": "318",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=QdlP8ai8trw",
"videoID": "QdlP8ai8trw",
"question_id": "318-3",
"task_type": "Action Recognition",
"question": "What was David doing according to the sculpture?",
"options": [
"A. Completely relaxing.",
"B. Ready for a fight.",
"C. Overlooking at a battle in the distance.",
"D. Fighting with giant Goliath."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was David doing according to the sculpture?\nOption:\nA. Completely relaxing.\nB. Ready for a fight.\nC. Overlooking at a battle in the distance.\nD. Fighting with giant Goliath.\nAnswer with the option's letter from the given choices directly.",
953,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "318-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 954,
"target": "B",
"doc": {
"video_id": "319",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=MrRvX5I8PyY",
"videoID": "MrRvX5I8PyY",
"question_id": "319-1",
"task_type": "Attribute Perception",
"question": "What kind of clothes is the speaker wearing?",
"options": [
"A. Blue jacket.",
"B. Green plaid shirt.",
"C. Brown shirt.",
"D. Yellow T-shirt."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What kind of clothes is the speaker wearing?\nOption:\nA. Blue jacket.\nB. Green plaid shirt.\nC. Brown shirt.\nD. Yellow T-shirt.\nAnswer with the option's letter from the given choices directly.",
954,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "319-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 955,
"target": "B",
"doc": {
"video_id": "319",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=MrRvX5I8PyY",
"videoID": "MrRvX5I8PyY",
"question_id": "319-2",
"task_type": "Attribute Perception",
"question": "Which one is listed top 2 according to the video?",
"options": [
"A. Wangjing Soho.",
"B. London Aquatics Centre.",
"C. Messner Mountain Museum.",
"D. 600 Collins Street."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which one is listed top 2 according to the video?\nOption:\nA. Wangjing Soho.\nB. London Aquatics Centre.\nC. Messner Mountain Museum.\nD. 600 Collins Street.\nAnswer with the option's letter from the given choices directly.",
955,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "319-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 956,
"target": "A",
"doc": {
"video_id": "319",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=MrRvX5I8PyY",
"videoID": "MrRvX5I8PyY",
"question_id": "319-3",
"task_type": "Object Reasoning",
"question": "Which of the following statements about Zaha Hadid's work is not correct?",
"options": [
"A. All of her works are built in the center of cities.",
"B. Wangjing Soho consists of three towers resembling interweaving mountains.",
"C. Heydar Aliyev Centre is named after the country's former president.",
"D. 600 Collins Street locates in Australia."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements about Zaha Hadid's work is not correct?\nOption:\nA. All of her works are built in the center of cities.\nB. Wangjing Soho consists of three towers resembling interweaving mountains.\nC. Heydar Aliyev Centre is named after the country's former president.\nD. 600 Collins Street locates in Australia.\nAnswer with the option's letter from the given choices directly.",
956,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "319-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 957,
"target": "B",
"doc": {
"video_id": "320",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=z2APU5ob9Og",
"videoID": "z2APU5ob9Og",
"question_id": "320-1",
"task_type": "Object Reasoning",
"question": "What is the first artist's specialty in pottery?",
"options": [
"A. Making coral.",
"B. Making tiny pottery.",
"C. Sculpting faces.",
"D. Ceating symmetrical pierced pots."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the first artist's specialty in pottery?\nOption:\nA. Making coral.\nB. Making tiny pottery.\nC. Sculpting faces.\nD. Ceating symmetrical pierced pots.\nAnswer with the option's letter from the given choices directly.",
957,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "320-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 958,
"target": "B",
"doc": {
"video_id": "320",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=z2APU5ob9Og",
"videoID": "z2APU5ob9Og",
"question_id": "320-2",
"task_type": "Object Reasoning",
"question": "Why can the tenth artist create potteries with marbled designs?",
"options": [
"A. Because her works are made of marble.",
"B. Because she glazes her pottery with bubbles.",
"C. Because she paints marbel's texture to her works.",
"D. Because she learns the perfect tempreture to generate such design."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why can the tenth artist create potteries with marbled designs?\nOption:\nA. Because her works are made of marble.\nB. Because she glazes her pottery with bubbles.\nC. Because she paints marbel's texture to her works.\nD. Because she learns the perfect tempreture to generate such design.\nAnswer with the option's letter from the given choices directly.",
958,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "320-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 959,
"target": "B",
"doc": {
"video_id": "320",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=z2APU5ob9Og",
"videoID": "z2APU5ob9Og",
"question_id": "320-3",
"task_type": "Attribute Perception",
"question": "What's the main color of Hugh Hope's work?",
"options": [
"A. Blue.",
"B. Red.",
"C. Green.",
"D. Black."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the main color of Hugh Hope's work?\nOption:\nA. Blue.\nB. Red.\nC. Green.\nD. Black.\nAnswer with the option's letter from the given choices directly.",
959,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "320-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 960,
"target": "A",
"doc": {
"video_id": "321",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=7Hk9jct2ozY",
"videoID": "7Hk9jct2ozY",
"question_id": "321-1",
"task_type": "Action Recognition",
"question": "What is responsible for gene activation according to the video?",
"options": [
"A. Gene activation by Vitamin D.",
"B. Hormonal changes during adolescence.",
"C. Exposure to ultraviolet light.",
"D. Cellular metabolism during exercise."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is responsible for gene activation according to the video?\nOption:\nA. Gene activation by Vitamin D.\nB. Hormonal changes during adolescence.\nC. Exposure to ultraviolet light.\nD. Cellular metabolism during exercise.\nAnswer with the option's letter from the given choices directly.",
960,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "321-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 961,
"target": "D",
"doc": {
"video_id": "321",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=7Hk9jct2ozY",
"videoID": "7Hk9jct2ozY",
"question_id": "321-2",
"task_type": "Temporal Reasoning",
"question": "What is the correct order in which the following events are presented in the video?\n(a) The nucleus is introduced.\n(b) Base pairing.\n(c) Cell division.",
"options": [
"A. (a)(c)(b).",
"B. (c)(a)(b).",
"C. (c)(b)(a).",
"D. (a)(c)(b)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct order in which the following events are presented in the video?\n(a) The nucleus is introduced.\n(b) Base pairing.\n(c) Cell division.\nOption:\nA. (a)(c)(b).\nB. (c)(a)(b).\nC. (c)(b)(a).\nD. (a)(c)(b).\nAnswer with the option's letter from the given choices directly.",
961,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "321-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 962,
"target": "B",
"doc": {
"video_id": "321",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=7Hk9jct2ozY",
"videoID": "7Hk9jct2ozY",
"question_id": "321-3",
"task_type": "Information Synopsis",
"question": "What does this video focus on?",
"options": [
"A. DNA variation.",
"B. Transcription and coding of DNA.",
"C. Technical principles of genetic engineering.",
"D. Differences between animal and plant DNA."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does this video focus on?\nOption:\nA. DNA variation.\nB. Transcription and coding of DNA.\nC. Technical principles of genetic engineering.\nD. Differences between animal and plant DNA.\nAnswer with the option's letter from the given choices directly.",
962,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "321-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 963,
"target": "C",
"doc": {
"video_id": "322",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=UZTf3OXJDWA",
"videoID": "UZTf3OXJDWA",
"question_id": "322-1",
"task_type": "Information Synopsis",
"question": "What is the context of the video?",
"options": [
"A. A documentary film on human evolutionary biology.",
"B. An overview of the human circulatory system.",
"C. A clip about the human immune system.",
"D. A detailed study of the human digestive system."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the context of the video?\nOption:\nA. A documentary film on human evolutionary biology.\nB. An overview of the human circulatory system.\nC. A clip about the human immune system.\nD. A detailed study of the human digestive system.\nAnswer with the option's letter from the given choices directly.",
963,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "322-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 964,
"target": "A",
"doc": {
"video_id": "322",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=UZTf3OXJDWA",
"videoID": "UZTf3OXJDWA",
"question_id": "322-2",
"task_type": "Temporal Reasoning",
"question": "What is the sequence of events in the video showing the following?\n(a) A rolling T-cell.\n(b) Human hair follicles ooze oil.\n(c) Plasma cell division.",
"options": [
"A. (b)(a)(c).",
"B. (a)(b)(c).",
"C. (b)(c)(a).",
"D. (c)(a)(b)."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the sequence of events in the video showing the following?\n(a) A rolling T-cell.\n(b) Human hair follicles ooze oil.\n(c) Plasma cell division.\nOption:\nA. (b)(a)(c).\nB. (a)(b)(c).\nC. (b)(c)(a).\nD. (c)(a)(b).\nAnswer with the option's letter from the given choices directly.",
964,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "322-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 965,
"target": "D",
"doc": {
"video_id": "322",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=UZTf3OXJDWA",
"videoID": "UZTf3OXJDWA",
"question_id": "322-3",
"task_type": "Object Reasoning",
"question": "Which of the following descriptions of the blue, bisected cell at the beginning of the video is correct?",
"options": [
"A. They recognize and destroy pathogens.",
"B. They carry oxygen to various tissues and organs of the body.",
"C. They are responsible for transmitting and processing nerve signals.",
"D. They can differentiate into many kinds of cells."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following descriptions of the blue, bisected cell at the beginning of the video is correct?\nOption:\nA. They recognize and destroy pathogens.\nB. They carry oxygen to various tissues and organs of the body.\nC. They are responsible for transmitting and processing nerve signals.\nD. They can differentiate into many kinds of cells.\nAnswer with the option's letter from the given choices directly.",
965,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "322-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 966,
"target": "A",
"doc": {
"video_id": "323",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=URUJD5NEXC8",
"videoID": "URUJD5NEXC8",
"question_id": "323-1",
"task_type": "Object Recognition",
"question": "Which cellular structure is responsible for receiving proteins according to the video?",
"options": [
"A. Golgi apparatus (Golgi body).",
"B. Nucleus.",
"C. Ribosome.",
"D. Mitochondrion."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which cellular structure is responsible for receiving proteins according to the video?\nOption:\nA. Golgi apparatus (Golgi body).\nB. Nucleus.\nC. Ribosome.\nD. Mitochondrion.\nAnswer with the option's letter from the given choices directly.",
966,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "323-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 967,
"target": "B",
"doc": {
"video_id": "323",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=URUJD5NEXC8",
"videoID": "URUJD5NEXC8",
"question_id": "323-2",
"task_type": "Attribute Perception",
"question": "Which components are part of the object described in the video?",
"options": [
"A. Ribosomes and Endoplasmic Reticulum.",
"B. Microfilaments and Microtubules.",
"C. Golgi Apparatus and Lysosomes.",
"D. Nucleolus and Peroxisomes."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which components are part of the object described in the video?\nOption:\nA. Ribosomes and Endoplasmic Reticulum.\nB. Microfilaments and Microtubules.\nC. Golgi Apparatus and Lysosomes.\nD. Nucleolus and Peroxisomes.\nAnswer with the option's letter from the given choices directly.",
967,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "323-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 968,
"target": "C",
"doc": {
"video_id": "323",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=URUJD5NEXC8",
"videoID": "URUJD5NEXC8",
"question_id": "323-3",
"task_type": "Temporal Reasoning",
"question": "What is the correct chronological order in which the following parts of the video appear?\n(a) Human lungs.\n(b) Protein folding distortion.\n(c) Mice, plants and cells.",
"options": [
"A. (a)(b)(c).",
"B. (c)(a)(b).",
"C. (c)(b)(a).",
"D. (b)(c)(a)."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct chronological order in which the following parts of the video appear?\n(a) Human lungs.\n(b) Protein folding distortion.\n(c) Mice, plants and cells.\nOption:\nA. (a)(b)(c).\nB. (c)(a)(b).\nC. (c)(b)(a).\nD. (b)(c)(a).\nAnswer with the option's letter from the given choices directly.",
968,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "323-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 969,
"target": "B",
"doc": {
"video_id": "324",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=eWUAB194qhU",
"videoID": "eWUAB194qhU",
"question_id": "324-1",
"task_type": "Information Synopsis",
"question": "What is the context of the video?",
"options": [
"A. It details the circulatory system and blood flow in the human body.",
"B. It provides a simple explanation of the lymphatic system.",
"C. It explores the muscular system and human movement.",
"D. It focuses on the digestive system and nutrient absorption."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the context of the video?\nOption:\nA. It details the circulatory system and blood flow in the human body.\nB. It provides a simple explanation of the lymphatic system.\nC. It explores the muscular system and human movement.\nD. It focuses on the digestive system and nutrient absorption.\nAnswer with the option's letter from the given choices directly.",
969,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "324-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 970,
"target": "A",
"doc": {
"video_id": "324",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=eWUAB194qhU",
"videoID": "eWUAB194qhU",
"question_id": "324-2",
"task_type": "Object Recognition",
"question": "Which cells are contained within the lymph nodes, as depicted in the video?",
"options": [
"A. Macrophages, dendritic cells, B-cells, and T-cells.",
"B. Red blood cells and platelets.",
"C. Neurons and glial cells.",
"D. Muscle cells and fat cells."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which cells are contained within the lymph nodes, as depicted in the video?\nOption:\nA. Macrophages, dendritic cells, B-cells, and T-cells.\nB. Red blood cells and platelets.\nC. Neurons and glial cells.\nD. Muscle cells and fat cells.\nAnswer with the option's letter from the given choices directly.",
970,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "324-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 971,
"target": "D",
"doc": {
"video_id": "324",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=eWUAB194qhU",
"videoID": "eWUAB194qhU",
"question_id": "324-3",
"task_type": "Action Reasoning",
"question": "Why should exercise be included as a part of maintaining a healthy lymphatic system, according to the video?",
"options": [
"A. It increases body fat which is essential for lymphatic function.",
"B. It helps to reduce the oxygen levels in the blood.",
"C. It slows down the heart rate and reduces lymph flow.",
"D. It speeds up the rate at which lymph drains back into your blood."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why should exercise be included as a part of maintaining a healthy lymphatic system, according to the video?\nOption:\nA. It increases body fat which is essential for lymphatic function.\nB. It helps to reduce the oxygen levels in the blood.\nC. It slows down the heart rate and reduces lymph flow.\nD. It speeds up the rate at which lymph drains back into your blood.\nAnswer with the option's letter from the given choices directly.",
971,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "324-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 972,
"target": "D",
"doc": {
"video_id": "325",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=CxBbw8eT624",
"videoID": "CxBbw8eT624",
"question_id": "325-1",
"task_type": "Object Reasoning",
"question": "What is the role of Neuralink's first product, Telepathy according to the video?",
"options": [
"A. Telepathy is designed for communication between users through thought alone without controlling devices.",
"B. Telepathy enhances the brain's natural cognitive abilities without external interaction.",
"C. Telepathy enhances the brain's natural cognitive abilities without external interaction.",
"D. Telepathy allows the user to reach out and control pretty much everything interconnected."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the role of Neuralink's first product, Telepathy according to the video?\nOption:\nA. Telepathy is designed for communication between users through thought alone without controlling devices.\nB. Telepathy enhances the brain's natural cognitive abilities without external interaction.\nC. Telepathy enhances the brain's natural cognitive abilities without external interaction.\nD. Telepathy allows the user to reach out and control pretty much everything interconnected.\nAnswer with the option's letter from the given choices directly.",
972,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "325-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 973,
"target": "C",
"doc": {
"video_id": "325",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=CxBbw8eT624",
"videoID": "CxBbw8eT624",
"question_id": "325-2",
"task_type": "Counting Problem",
"question": "How many individual electrodes are distributed across the circular hole in Neuralink's device?",
"options": [
"A. 512 electrodes.",
"B. 256 electrodes.",
"C. 1024 electrodes.",
"D. 2048 electrodes."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many individual electrodes are distributed across the circular hole in Neuralink's device?\nOption:\nA. 512 electrodes.\nB. 256 electrodes.\nC. 1024 electrodes.\nD. 2048 electrodes.\nAnswer with the option's letter from the given choices directly.",
973,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "325-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 974,
"target": "A",
"doc": {
"video_id": "325",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=CxBbw8eT624",
"videoID": "CxBbw8eT624",
"question_id": "325-3",
"task_type": "Temporal Perception",
"question": "What is the name of the brain-computer interface company introduced at the end of the video?",
"options": [
"A. Blackrock Neurotech.",
"B. Vanguard.",
"C. Neuralink.",
"D. Synchron."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the name of the brain-computer interface company introduced at the end of the video?\nOption:\nA. Blackrock Neurotech.\nB. Vanguard.\nC. Neuralink.\nD. Synchron.\nAnswer with the option's letter from the given choices directly.",
974,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "325-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 975,
"target": "A",
"doc": {
"video_id": "326",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=ZGGDKC3GlrI",
"videoID": "ZGGDKC3GlrI",
"question_id": "326-1",
"task_type": "Information Synopsis",
"question": "What is the topic of the video?",
"options": [
"A. The future of medicine.",
"B. A historical overview of medical treatments.",
"C. The impact of traditional medicines on modern healthcare.",
"D. A discussion on the ethics of animal testing in medical research."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the topic of the video?\nOption:\nA. The future of medicine.\nB. A historical overview of medical treatments.\nC. The impact of traditional medicines on modern healthcare.\nD. A discussion on the ethics of animal testing in medical research.\nAnswer with the option's letter from the given choices directly.",
975,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "326-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 976,
"target": "A",
"doc": {
"video_id": "326",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=ZGGDKC3GlrI",
"videoID": "ZGGDKC3GlrI",
"question_id": "326-2",
"task_type": "Temporal Perception",
"question": "What is the third future topic in medicine mentioned in the video?",
"options": [
"A. Regenerative medicine.",
"B. Nanotechnology.",
"C. Gene editing.",
"D. Bioengineering."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the third future topic in medicine mentioned in the video?\nOption:\nA. Regenerative medicine.\nB. Nanotechnology.\nC. Gene editing.\nD. Bioengineering.\nAnswer with the option's letter from the given choices directly.",
976,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "326-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 977,
"target": "B",
"doc": {
"video_id": "326",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=ZGGDKC3GlrI",
"videoID": "ZGGDKC3GlrI",
"question_id": "326-3",
"task_type": "Object Reasoning",
"question": "What is the medical technique that takes up the most space in the video?",
"options": [
"A. A novel drug delivery system using nanotechnology.",
"B. A gene-editing technology.",
"C. An artificial intelligence for developing vaccines.",
"D. An advanced magnetic resonance imaging (MRI)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the medical technique that takes up the most space in the video?\nOption:\nA. A novel drug delivery system using nanotechnology.\nB. A gene-editing technology.\nC. An artificial intelligence for developing vaccines.\nD. An advanced magnetic resonance imaging (MRI).\nAnswer with the option's letter from the given choices directly.",
977,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "326-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 978,
"target": "A",
"doc": {
"video_id": "327",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=iPi-QNYgAa4",
"videoID": "iPi-QNYgAa4",
"question_id": "327-1",
"task_type": "Attribute Perception",
"question": "Which summarizes the main focus of the video?",
"options": [
"A. The role of nanorobotics in medical treatment.",
"B. R&D process and future application of nanorobotics.",
"C. The role of nanorobots in cancer treatment.",
"D. Cannot be inferred from the video."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which summarizes the main focus of the video?\nOption:\nA. The role of nanorobotics in medical treatment.\nB. R&D process and future application of nanorobotics.\nC. The role of nanorobots in cancer treatment.\nD. Cannot be inferred from the video.\nAnswer with the option's letter from the given choices directly.",
978,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "327-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 979,
"target": "A",
"doc": {
"video_id": "327",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=iPi-QNYgAa4",
"videoID": "iPi-QNYgAa4",
"question_id": "327-2",
"task_type": "Temporal Reasoning",
"question": "What is the exact order in which the video presents the following events?\n(a) Application of respiratory cells to cardiovascular cure therapy.\n(b) The use of nanorobots in cancer treatment.\n(c) Nanorobots for haemostasis.",
"options": [
"A. (b)(a)(c).",
"B. (a)(b)(c).",
"C. (b)(c)(a).",
"D. (c)(a)(b)."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the exact order in which the video presents the following events?\n(a) Application of respiratory cells to cardiovascular cure therapy.\n(b) The use of nanorobots in cancer treatment.\n(c) Nanorobots for haemostasis.\nOption:\nA. (b)(a)(c).\nB. (a)(b)(c).\nC. (b)(c)(a).\nD. (c)(a)(b).\nAnswer with the option's letter from the given choices directly.",
979,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "327-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 980,
"target": "C",
"doc": {
"video_id": "327",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=iPi-QNYgAa4",
"videoID": "iPi-QNYgAa4",
"question_id": "327-3",
"task_type": "Object Reasoning",
"question": "For which diseases will respiratory cells potentially be used in treatment?",
"options": [
"A. Respiratory diseases.",
"B. Cancer treatment.",
"C. Cardiovascular disease.",
"D. Alzheimer's disease."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: For which diseases will respiratory cells potentially be used in treatment?\nOption:\nA. Respiratory diseases.\nB. Cancer treatment.\nC. Cardiovascular disease.\nD. Alzheimer's disease.\nAnswer with the option's letter from the given choices directly.",
980,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "327-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 981,
"target": "D",
"doc": {
"video_id": "328",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=ckn9zybpYZ8",
"videoID": "ckn9zybpYZ8",
"question_id": "328-1",
"task_type": "Information Synopsis",
"question": "What is the main focus of the video?",
"options": [
"A. Genetic factors contributing to multiple sclerosis.",
"B. Global statistics on neurological diseases.",
"C. Advances in the treatment of Alzheimer's disease.",
"D. Parkinson's disease and its impact on the nervous system."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of the video?\nOption:\nA. Genetic factors contributing to multiple sclerosis.\nB. Global statistics on neurological diseases.\nC. Advances in the treatment of Alzheimer's disease.\nD. Parkinson's disease and its impact on the nervous system.\nAnswer with the option's letter from the given choices directly.",
981,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "328-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 982,
"target": "C",
"doc": {
"video_id": "328",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=ckn9zybpYZ8",
"videoID": "ckn9zybpYZ8",
"question_id": "328-2",
"task_type": "OCR Problems",
"question": "How many cases of Parkinson's disease are mentioned in the video?",
"options": [
"A. Over 4 million cases globally.",
"B. Over 2 million cases globally.",
"C. Over 6 million cases globally.",
"D. Over 8 million cases globally."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many cases of Parkinson's disease are mentioned in the video?\nOption:\nA. Over 4 million cases globally.\nB. Over 2 million cases globally.\nC. Over 6 million cases globally.\nD. Over 8 million cases globally.\nAnswer with the option's letter from the given choices directly.",
982,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "328-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 983,
"target": "A",
"doc": {
"video_id": "328",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=ckn9zybpYZ8",
"videoID": "ckn9zybpYZ8",
"question_id": "328-3",
"task_type": "Attribute Perception",
"question": "What issue related to mitochondria is discussed in the context of Parkinson's disease, as depicted in the video?",
"options": [
"A. There is an abnormal increase in the number of mitochondria.",
"B. Mitochondria have an important role in neuronal connectivity.",
"C. Mitochondria are too inefficient, resulting in inactive neurons.",
"D. Mitochondria prevent the formation of new neuronal connections."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What issue related to mitochondria is discussed in the context of Parkinson's disease, as depicted in the video?\nOption:\nA. There is an abnormal increase in the number of mitochondria.\nB. Mitochondria have an important role in neuronal connectivity.\nC. Mitochondria are too inefficient, resulting in inactive neurons.\nD. Mitochondria prevent the formation of new neuronal connections.\nAnswer with the option's letter from the given choices directly.",
983,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "328-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 984,
"target": "B",
"doc": {
"video_id": "329",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=3uV8XZcIBbk",
"videoID": "3uV8XZcIBbk",
"question_id": "329-1",
"task_type": "Information Synopsis",
"question": "What is the main topic of the video?",
"options": [
"A. The history of heart transplantation.",
"B. The manufacture of durable artificial hearts.",
"C. Developments in the treatment of cardiac arrhythmias.",
"D. The development and production of new drugs for cardiovascular disease."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main topic of the video?\nOption:\nA. The history of heart transplantation.\nB. The manufacture of durable artificial hearts.\nC. Developments in the treatment of cardiac arrhythmias.\nD. The development and production of new drugs for cardiovascular disease.\nAnswer with the option's letter from the given choices directly.",
984,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "329-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 985,
"target": "B",
"doc": {
"video_id": "329",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=3uV8XZcIBbk",
"videoID": "3uV8XZcIBbk",
"question_id": "329-2",
"task_type": "Counting Problem",
"question": "According to the video, how many heart transplants are performed annually, and how many are requested in the U.S. and worldwide?",
"options": [
"A. 10,000 transplants were performed, 500,000 were requested in the U.S., and 5,000,000 were requested worldwide.",
"B. 6,000 transplants were performed, 600,000 were requested in the U.S., and 1,000,000 were requested worldwide.",
"C. 20,000 transplants were performed, 200,000 were requested in the U.S., and 2,000,000 were requested worldwide.",
"D. 3,000 transplants were performed, 300,000 were requested in the U.S., and 3,000,000 were requested worldwide."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how many heart transplants are performed annually, and how many are requested in the U.S. and worldwide?\nOption:\nA. 10,000 transplants were performed, 500,000 were requested in the U.S., and 5,000,000 were requested worldwide.\nB. 6,000 transplants were performed, 600,000 were requested in the U.S., and 1,000,000 were requested worldwide.\nC. 20,000 transplants were performed, 200,000 were requested in the U.S., and 2,000,000 were requested worldwide.\nD. 3,000 transplants were performed, 300,000 were requested in the U.S., and 3,000,000 were requested worldwide.\nAnswer with the option's letter from the given choices directly.",
985,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "329-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 986,
"target": "A",
"doc": {
"video_id": "329",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=3uV8XZcIBbk",
"videoID": "3uV8XZcIBbk",
"question_id": "329-3",
"task_type": "Attribute Perception",
"question": "What is the role of the company BiVACOR in the field of medical devices?",
"options": [
"A. BiVACOR is engineering a long-term device intended to completely replace the function of a patient's native heart.",
"B. BiVACOR is developing a temporary device for heart assistance during surgery.",
"C. BiVACOR is focused on creating diagnostic tools for heart disease.",
"D. BiVACOR designs pacemakers for regulating heart rhythm disorders."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the role of the company BiVACOR in the field of medical devices?\nOption:\nA. BiVACOR is engineering a long-term device intended to completely replace the function of a patient's native heart.\nB. BiVACOR is developing a temporary device for heart assistance during surgery.\nC. BiVACOR is focused on creating diagnostic tools for heart disease.\nD. BiVACOR designs pacemakers for regulating heart rhythm disorders.\nAnswer with the option's letter from the given choices directly.",
986,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "329-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 987,
"target": "B",
"doc": {
"video_id": "330",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=cMim0uU1yzA",
"videoID": "cMim0uU1yzA",
"question_id": "330-1",
"task_type": "Information Synopsis",
"question": "Which summarizes the main focus of the video?",
"options": [
"A. The effects of neurological disorders on brain function.",
"B. What happens to the brain as we age.",
"C. The effects of social and emotional education on brain development.",
"D. Neurological development of the human brain from infancy to adulthood."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which summarizes the main focus of the video?\nOption:\nA. The effects of neurological disorders on brain function.\nB. What happens to the brain as we age.\nC. The effects of social and emotional education on brain development.\nD. Neurological development of the human brain from infancy to adulthood.\nAnswer with the option's letter from the given choices directly.",
987,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "330-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 988,
"target": "A",
"doc": {
"video_id": "330",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=cMim0uU1yzA",
"videoID": "cMim0uU1yzA",
"question_id": "330-2",
"task_type": "Action Reasoning",
"question": "What is different from the other chapters in the video when introducing the children's chapter?",
"options": [
"A. No brain slice model demonstration.",
"B. No neurotransmitter models were demonstrated.",
"C. Brain waves of the human brain were demonstrated and analyzed.",
"D. Video recordings of real-life scenarios are cited to reflect the function of the human brain at different ages."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is different from the other chapters in the video when introducing the children's chapter?\nOption:\nA. No brain slice model demonstration.\nB. No neurotransmitter models were demonstrated.\nC. Brain waves of the human brain were demonstrated and analyzed.\nD. Video recordings of real-life scenarios are cited to reflect the function of the human brain at different ages.\nAnswer with the option's letter from the given choices directly.",
988,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "330-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 989,
"target": "C",
"doc": {
"video_id": "330",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=cMim0uU1yzA",
"videoID": "cMim0uU1yzA",
"question_id": "330-3",
"task_type": "Object Reasoning",
"question": "What happens to your brain when you're in your 70s according to the video?",
"options": [
"A. A greater tendency to be introverted, to avoid risk, or to have a stronger tendency towards social bonding.",
"B. Reorganisation of the brain's neural network resulting in weaker connections between certain areas.",
"C. The brain becomes smaller in size.",
"D. 'Natural death' or 'aging' of neurons in the brain."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens to your brain when you're in your 70s according to the video?\nOption:\nA. A greater tendency to be introverted, to avoid risk, or to have a stronger tendency towards social bonding.\nB. Reorganisation of the brain's neural network resulting in weaker connections between certain areas.\nC. The brain becomes smaller in size.\nD. 'Natural death' or 'aging' of neurons in the brain.\nAnswer with the option's letter from the given choices directly.",
989,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "330-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 990,
"target": "D",
"doc": {
"video_id": "331",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=t6V9i8fFADI",
"videoID": "t6V9i8fFADI",
"question_id": "331-1",
"task_type": "OCR Problems",
"question": "Which company is featured in the video but not mentioned in the audio?",
"options": [
"A. Coca-Cola.",
"B. Apple.",
"C. Dairy Queen.",
"D. American Express."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which company is featured in the video but not mentioned in the audio?\nOption:\nA. Coca-Cola.\nB. Apple.\nC. Dairy Queen.\nD. American Express.\nAnswer with the option's letter from the given choices directly.",
990,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "331-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 991,
"target": "A",
"doc": {
"video_id": "331",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=t6V9i8fFADI",
"videoID": "t6V9i8fFADI",
"question_id": "331-2",
"task_type": "Object Reasoning",
"question": "Suppose Emily purchased some company stock in 2005 at a price of $50 per share. She passed away in 2020, when the stock's market value was $150 per share. If her heir, John, sold the stock in 2023 for $180 per share. Assume the capital gains tax rate is 15%. According to the 'step-up in basis' rule, how much capital gains tax would John need to pay?",
"options": [
"A. $4.5 per share.",
"B. $27 per share.",
"C. $19.5 per share.",
"D. $15 per share."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Suppose Emily purchased some company stock in 2005 at a price of $50 per share. She passed away in 2020, when the stock's market value was $150 per share. If her heir, John, sold the stock in 2023 for $180 per share. Assume the capital gains tax rate is 15%. According to the 'step-up in basis' rule, how much capital gains tax would John need to pay?\nOption:\nA. $4.5 per share.\nB. $27 per share.\nC. $19.5 per share.\nD. $15 per share.\nAnswer with the option's letter from the given choices directly.",
991,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "331-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 992,
"target": "B",
"doc": {
"video_id": "331",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=t6V9i8fFADI",
"videoID": "t6V9i8fFADI",
"question_id": "331-3",
"task_type": "Temporal Reasoning",
"question": "What is the correct order in which the billionaires appear in the video?",
"options": [
"A. Warren Buffett, Morris Pearl, Elon Musk.",
"B. Warren Buffett, Jeff Bezos, Elon Musk.",
"C. Warren Buffett, Elon Musk, Jeff Bezos.",
"D. Morris Pearl, Jeff Bezos, Elon Musk."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct order in which the billionaires appear in the video?\nOption:\nA. Warren Buffett, Morris Pearl, Elon Musk.\nB. Warren Buffett, Jeff Bezos, Elon Musk.\nC. Warren Buffett, Elon Musk, Jeff Bezos.\nD. Morris Pearl, Jeff Bezos, Elon Musk.\nAnswer with the option's letter from the given choices directly.",
992,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "331-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 993,
"target": "A",
"doc": {
"video_id": "332",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=zIbNJCSCEjk",
"videoID": "zIbNJCSCEjk",
"question_id": "332-1",
"task_type": "Object Recognition",
"question": "What event wasn't mentioned within the video?",
"options": [
"A. Witness the inflation in Venezuela.",
"B. Fed rate hike.",
"C. The epidemic is affecting global supply chains.",
"D. Bicycle factory closed."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What event wasn't mentioned within the video?\nOption:\nA. Witness the inflation in Venezuela.\nB. Fed rate hike.\nC. The epidemic is affecting global supply chains.\nD. Bicycle factory closed.\nAnswer with the option's letter from the given choices directly.",
993,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "332-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 994,
"target": "A",
"doc": {
"video_id": "332",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=zIbNJCSCEjk",
"videoID": "zIbNJCSCEjk",
"question_id": "332-2",
"task_type": "OCR Problems",
"question": "According to the video, how much did the central bank drop interest rates during the epidemic?",
"options": [
"A. 5%.",
"B. 6%.",
"C. 1%.",
"D. 10%."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how much did the central bank drop interest rates during the epidemic?\nOption:\nA. 5%.\nB. 6%.\nC. 1%.\nD. 10%.\nAnswer with the option's letter from the given choices directly.",
994,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "332-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 995,
"target": "A",
"doc": {
"video_id": "332",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=zIbNJCSCEjk",
"videoID": "zIbNJCSCEjk",
"question_id": "332-3",
"task_type": "Object Reasoning",
"question": "Why does the price of apples go up in the video?",
"options": [
"A. There is excess liquidity in the market.",
"B. Good sales.",
"C. For profiteering.",
"D. The apples don't sell."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the price of apples go up in the video?\nOption:\nA. There is excess liquidity in the market.\nB. Good sales.\nC. For profiteering.\nD. The apples don't sell.\nAnswer with the option's letter from the given choices directly.",
995,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "332-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 996,
"target": "D",
"doc": {
"video_id": "333",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=9XCKnN6o19k",
"videoID": "9XCKnN6o19k",
"question_id": "333-1",
"task_type": "Object Reasoning",
"question": "What is the role of the woman speaking in a white blouse in the video?",
"options": [
"A. Journalist.",
"B. Employee.",
"C. Company Executive.",
"D. Counselor."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the role of the woman speaking in a white blouse in the video?\nOption:\nA. Journalist.\nB. Employee.\nC. Company Executive.\nD. Counselor.\nAnswer with the option's letter from the given choices directly.",
996,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "333-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 997,
"target": "C",
"doc": {
"video_id": "333",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=9XCKnN6o19k",
"videoID": "9XCKnN6o19k",
"question_id": "333-2",
"task_type": "Object Reasoning",
"question": "Extrapolating from the video, when was Shein founded?",
"options": [
"A. 2015.",
"B. 2008.",
"C. 2011.",
"D. 2010."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Extrapolating from the video, when was Shein founded?\nOption:\nA. 2015.\nB. 2008.\nC. 2011.\nD. 2010.\nAnswer with the option's letter from the given choices directly.",
997,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "333-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 998,
"target": "B",
"doc": {
"video_id": "333",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=9XCKnN6o19k",
"videoID": "9XCKnN6o19k",
"question_id": "333-3",
"task_type": "Information Synopsis",
"question": "Inferring from the video, what is the characteristic of Shein Corp?",
"options": [
"A. Open.",
"B. Mystery.",
"C. Expensive.",
"D. Innovation."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Inferring from the video, what is the characteristic of Shein Corp?\nOption:\nA. Open.\nB. Mystery.\nC. Expensive.\nD. Innovation.\nAnswer with the option's letter from the given choices directly.",
998,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "333-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 999,
"target": "C",
"doc": {
"video_id": "334",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=257wV-AbKaE",
"videoID": "257wV-AbKaE",
"question_id": "334-1",
"task_type": "Information Synopsis",
"question": "What is the main point of this video?",
"options": [
"A. Who is Al Capone.",
"B. Al Capone's estate.",
"C. The operational process of money laundering.",
"D. How to Start a Laundry Business."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main point of this video?\nOption:\nA. Who is Al Capone.\nB. Al Capone's estate.\nC. The operational process of money laundering.\nD. How to Start a Laundry Business.\nAnswer with the option's letter from the given choices directly.",
999,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "334-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1000,
"target": "B",
"doc": {
"video_id": "334",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=257wV-AbKaE",
"videoID": "257wV-AbKaE",
"question_id": "334-2",
"task_type": "Object Reasoning",
"question": "As shown in the video, if a money launderer is rigging a bet, what stage of money laundering might it be at?",
"options": [
"A. Phase one.",
"B. Phase two.",
"C. Phase three.",
"D. Cannot be inferred."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As shown in the video, if a money launderer is rigging a bet, what stage of money laundering might it be at?\nOption:\nA. Phase one.\nB. Phase two.\nC. Phase three.\nD. Cannot be inferred.\nAnswer with the option's letter from the given choices directly.",
1000,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "334-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1001,
"target": "B",
"doc": {
"video_id": "334",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=257wV-AbKaE",
"videoID": "257wV-AbKaE",
"question_id": "334-3",
"task_type": "Temporal Reasoning",
"question": "What is the correct order in which the following patterns appear in the video?",
"options": [
"A. Pizza parlors, the United Nations emblem, dice.",
"B. Dice, pizza parlors, the United Nations emblem.",
"C. Pizza parlors, dice, the United Nations emblem.",
"D. The United Nations emblem, pizza parlors, dice."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct order in which the following patterns appear in the video?\nOption:\nA. Pizza parlors, the United Nations emblem, dice.\nB. Dice, pizza parlors, the United Nations emblem.\nC. Pizza parlors, dice, the United Nations emblem.\nD. The United Nations emblem, pizza parlors, dice.\nAnswer with the option's letter from the given choices directly.",
1001,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "334-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1002,
"target": "D",
"doc": {
"video_id": "335",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=GPOv72Awo68",
"videoID": "GPOv72Awo68",
"question_id": "335-1",
"task_type": "Counting Problem",
"question": "How many times does Jacob Clifford appear alone in the video?",
"options": [
"A. 4.",
"B. 10.",
"C. 8.",
"D. 6."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times does Jacob Clifford appear alone in the video?\nOption:\nA. 4.\nB. 10.\nC. 8.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
1002,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "335-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1003,
"target": "D",
"doc": {
"video_id": "335",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=GPOv72Awo68",
"videoID": "GPOv72Awo68",
"question_id": "335-2",
"task_type": "Object Reasoning",
"question": "As depicted in the video, how did investors respond to the availability of mortgage-backed securities and CDOs in the early 2000s?",
"options": [
"A. They were hesitant due to the perceived risk associated with these new financial instruments.",
"B. They preferred investing in US Treasury bonds due to their stability and low risk.",
"C. They focused on individual mortgages, finding them more manageable and transparent.",
"D. They eagerly invested in them, believing they were low-risk and offered high returns."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, how did investors respond to the availability of mortgage-backed securities and CDOs in the early 2000s?\nOption:\nA. They were hesitant due to the perceived risk associated with these new financial instruments.\nB. They preferred investing in US Treasury bonds due to their stability and low risk.\nC. They focused on individual mortgages, finding them more manageable and transparent.\nD. They eagerly invested in them, believing they were low-risk and offered high returns.\nAnswer with the option's letter from the given choices directly.",
1003,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "335-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1004,
"target": "A",
"doc": {
"video_id": "335",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=GPOv72Awo68",
"videoID": "GPOv72Awo68",
"question_id": "335-3",
"task_type": "Information Synopsis",
"question": "According to what is shown in the video, what were the key factors that contributed to the 2008 financial crisis in the United States?",
"options": [
"A. Predatory lending practices and a housing market bubble.",
"B. High demand for US Treasury bonds and conservative investment strategies.",
"C. Low interest rates and strict lending standards.",
"D. Increased regulation of the financial industry and cautious credit rating agencies."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to what is shown in the video, what were the key factors that contributed to the 2008 financial crisis in the United States?\nOption:\nA. Predatory lending practices and a housing market bubble.\nB. High demand for US Treasury bonds and conservative investment strategies.\nC. Low interest rates and strict lending standards.\nD. Increased regulation of the financial industry and cautious credit rating agencies.\nAnswer with the option's letter from the given choices directly.",
1004,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "335-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1005,
"target": "B",
"doc": {
"video_id": "336",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=ZCFkWDdmXG8",
"videoID": "ZCFkWDdmXG8",
"question_id": "336-1",
"task_type": "Information Synopsis",
"question": "Which of the following investment strategies does Warren Buffet advocate for in the video?",
"options": [
"A. Picking individual stocks based on careful analysis of company financials.",
"B. Investing in a diverse range of companies through a low-cost index fund.",
"C. Engaging in short-term trading to capitalize on market fluctuations.",
"D. Investing in socially responsible companies that prioritize environmental sustainability."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following investment strategies does Warren Buffet advocate for in the video?\nOption:\nA. Picking individual stocks based on careful analysis of company financials.\nB. Investing in a diverse range of companies through a low-cost index fund.\nC. Engaging in short-term trading to capitalize on market fluctuations.\nD. Investing in socially responsible companies that prioritize environmental sustainability.\nAnswer with the option's letter from the given choices directly.",
1005,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "336-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1006,
"target": "D",
"doc": {
"video_id": "336",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=ZCFkWDdmXG8",
"videoID": "ZCFkWDdmXG8",
"question_id": "336-2",
"task_type": "Object Reasoning",
"question": "The video mentions a potential risk associated with focusing solely on short-term stock price performance. What is this risk?",
"options": [
"A. Technology companies may dominate the stock market, pushing out traditional businesses.",
"B. Investors may lose confidence in the stock market, leading to a crash.",
"C. The government may intervene with regulations to control corporate behavior.",
"D. Companies may neglect long-term investments that benefit the overall economy."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The video mentions a potential risk associated with focusing solely on short-term stock price performance. What is this risk?\nOption:\nA. Technology companies may dominate the stock market, pushing out traditional businesses.\nB. Investors may lose confidence in the stock market, leading to a crash.\nC. The government may intervene with regulations to control corporate behavior.\nD. Companies may neglect long-term investments that benefit the overall economy.\nAnswer with the option's letter from the given choices directly.",
1006,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "336-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1007,
"target": "A",
"doc": {
"video_id": "336",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=ZCFkWDdmXG8",
"videoID": "ZCFkWDdmXG8",
"question_id": "336-3",
"task_type": "Object Reasoning",
"question": "The video highlights a disparity between a booming stock market and other economic indicators. Which of the following best exemplifies this disparity?",
"options": [
"A. The Dow Jones Industrial Average reaches record highs while wages remain stagnant.",
"B. Companies invest heavily in research and development, leading to rapid GDP growth.",
"C. The S&P 500 experiences a slight dip while unemployment rates significantly decrease.",
"D. Investors eagerly participate in IPOs, boosting overall family net worth."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The video highlights a disparity between a booming stock market and other economic indicators. Which of the following best exemplifies this disparity?\nOption:\nA. The Dow Jones Industrial Average reaches record highs while wages remain stagnant.\nB. Companies invest heavily in research and development, leading to rapid GDP growth.\nC. The S&P 500 experiences a slight dip while unemployment rates significantly decrease.\nD. Investors eagerly participate in IPOs, boosting overall family net worth.\nAnswer with the option's letter from the given choices directly.",
1007,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "336-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1008,
"target": "C",
"doc": {
"video_id": "337",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=CC9VeHrI3Es",
"videoID": "CC9VeHrI3Es",
"question_id": "337-1",
"task_type": "Object Reasoning",
"question": "If a cereal company predicts a potential increase in corn prices in the coming months, what action might they take in the futures market?",
"options": [
"A. Sell futures contracts for corn.",
"B. Wait for the price to increase before taking any action.",
"C. Buy futures contracts for corn.",
"D. Decrease the size of their cereal boxes to maintain current prices."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: If a cereal company predicts a potential increase in corn prices in the coming months, what action might they take in the futures market?\nOption:\nA. Sell futures contracts for corn.\nB. Wait for the price to increase before taking any action.\nC. Buy futures contracts for corn.\nD. Decrease the size of their cereal boxes to maintain current prices.\nAnswer with the option's letter from the given choices directly.",
1008,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "337-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1009,
"target": "A",
"doc": {
"video_id": "337",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=CC9VeHrI3Es",
"videoID": "CC9VeHrI3Es",
"question_id": "337-2",
"task_type": "Information Synopsis",
"question": "Which statement best summarises the role of the economics term pointed to by the green arrow at the beginning of the video?",
"options": [
"A. It provides maize producers and users with tools to manage price risk and ensure stability.",
"B. It allows maize producers and users to speculate and make steady profits and reduce unexpected losses.",
"C. It ensures that consumers always buy maize grain at the lowest possible price.",
"D. It reduces the information gap between maize producers and users and weakens market price volatility."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which statement best summarises the role of the economics term pointed to by the green arrow at the beginning of the video?\nOption:\nA. It provides maize producers and users with tools to manage price risk and ensure stability.\nB. It allows maize producers and users to speculate and make steady profits and reduce unexpected losses.\nC. It ensures that consumers always buy maize grain at the lowest possible price.\nD. It reduces the information gap between maize producers and users and weakens market price volatility.\nAnswer with the option's letter from the given choices directly.",
1009,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "337-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1010,
"target": "C",
"doc": {
"video_id": "337",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=CC9VeHrI3Es",
"videoID": "CC9VeHrI3Es",
"question_id": "337-3",
"task_type": "Object Reasoning",
"question": "What is a potential disadvantage for a corn farmer who chooses NOT to participate in the futures market?",
"options": [
"A. He will miss out on government subsidies offered to farmers who use the futures market.",
"B. He will be unable to secure storage for their corn after harvest.",
"C. He might have to sell their corn at a lower price than anticipated during harvest.",
"D. He will be forced to sell their entire corn crop to a single buyer."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is a potential disadvantage for a corn farmer who chooses NOT to participate in the futures market?\nOption:\nA. He will miss out on government subsidies offered to farmers who use the futures market.\nB. He will be unable to secure storage for their corn after harvest.\nC. He might have to sell their corn at a lower price than anticipated during harvest.\nD. He will be forced to sell their entire corn crop to a single buyer.\nAnswer with the option's letter from the given choices directly.",
1010,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "337-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1011,
"target": "C",
"doc": {
"video_id": "338",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=GFTKKyYSCKs",
"videoID": "GFTKKyYSCKs",
"question_id": "338-1",
"task_type": "Object Reasoning",
"question": "According to the video, what happened in early 2020 in response to the COVID-19 pandemic?",
"options": [
"A. The Federal Reserve drastically raised interest rates to combat potential inflation.",
"B. The US government implemented large-scale tax cuts to stimulate the economy.",
"C. The Federal Reserve committed to purchasing an unprecedented amount of US Treasury bonds.",
"D. The US government launched massive public infrastructure projects to create jobs."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what happened in early 2020 in response to the COVID-19 pandemic?\nOption:\nA. The Federal Reserve drastically raised interest rates to combat potential inflation.\nB. The US government implemented large-scale tax cuts to stimulate the economy.\nC. The Federal Reserve committed to purchasing an unprecedented amount of US Treasury bonds.\nD. The US government launched massive public infrastructure projects to create jobs.\nAnswer with the option's letter from the given choices directly.",
1011,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "338-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1012,
"target": "A",
"doc": {
"video_id": "338",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=GFTKKyYSCKs",
"videoID": "GFTKKyYSCKs",
"question_id": "338-2",
"task_type": "Information Synopsis",
"question": "In line with the video evidence, what is the main reason why central banks avoid simply printing more money during an economic crisis?",
"options": [
"A. It can lead to severe inflation, devaluing the existing money supply and hurting the economy.",
"B. It is logistically difficult and time-consuming to print large amounts of money.",
"C. Governments prefer to use other economic policies like tax cuts and infrastructure projects.",
"D. Central banks lack the authority to print money, as that is a government responsibility."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, what is the main reason why central banks avoid simply printing more money during an economic crisis?\nOption:\nA. It can lead to severe inflation, devaluing the existing money supply and hurting the economy.\nB. It is logistically difficult and time-consuming to print large amounts of money.\nC. Governments prefer to use other economic policies like tax cuts and infrastructure projects.\nD. Central banks lack the authority to print money, as that is a government responsibility.\nAnswer with the option's letter from the given choices directly.",
1012,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "338-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1013,
"target": "D",
"doc": {
"video_id": "338",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=GFTKKyYSCKs",
"videoID": "GFTKKyYSCKs",
"question_id": "338-3",
"task_type": "Object Recognition",
"question": "Which building appears in the video when the central bank is mentioned?",
"options": [
"A. The European Central Bank.",
"B. Bank of England.",
"C. Deutsche Bundesbank.",
"D. Federal Reserve Building."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which building appears in the video when the central bank is mentioned?\nOption:\nA. The European Central Bank.\nB. Bank of England.\nC. Deutsche Bundesbank.\nD. Federal Reserve Building.\nAnswer with the option's letter from the given choices directly.",
1013,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "338-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1014,
"target": "A",
"doc": {
"video_id": "339",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=qFUuUzZNznY",
"videoID": "qFUuUzZNznY",
"question_id": "339-1",
"task_type": "Object Reasoning",
"question": "What is a key advantage for hotel owners who choose to franchise with a major brand?",
"options": [
"A. Access to the brand's established customer base and loyalty program.",
"B. Lower upfront investment costs for building and operating the hotel.",
"C. Greater control over daily operations and brand standards.",
"D. Higher profit margins due to lower franchise fees."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is a key advantage for hotel owners who choose to franchise with a major brand?\nOption:\nA. Access to the brand's established customer base and loyalty program.\nB. Lower upfront investment costs for building and operating the hotel.\nC. Greater control over daily operations and brand standards.\nD. Higher profit margins due to lower franchise fees.\nAnswer with the option's letter from the given choices directly.",
1014,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "339-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1015,
"target": "D",
"doc": {
"video_id": "339",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=qFUuUzZNznY",
"videoID": "qFUuUzZNznY",
"question_id": "339-2",
"task_type": "Information Synopsis",
"question": "According to what is shown in the video, what is a major factor influencing hotel room prices in dynamic cities?",
"options": [
"A. The fixed cost of amenities and services offered by the hotel.",
"B. The franchise fees charged by the hotel brand.",
"C. The local real estate market and property values.",
"D. Fluctuations in demand based on events and day of the week."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to what is shown in the video, what is a major factor influencing hotel room prices in dynamic cities?\nOption:\nA. The fixed cost of amenities and services offered by the hotel.\nB. The franchise fees charged by the hotel brand.\nC. The local real estate market and property values.\nD. Fluctuations in demand based on events and day of the week.\nAnswer with the option's letter from the given choices directly.",
1015,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "339-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1016,
"target": "D",
"doc": {
"video_id": "339",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=qFUuUzZNznY",
"videoID": "qFUuUzZNznY",
"question_id": "339-3",
"task_type": "OCR Problems",
"question": "According to the video, in which tier of hotels do major brands like Marriott and Hyatt still predominantly own and operate the properties?",
"options": [
"A. Economy.",
"B. Midscale.",
"C. Upscale.",
"D. Luxury."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, in which tier of hotels do major brands like Marriott and Hyatt still predominantly own and operate the properties?\nOption:\nA. Economy.\nB. Midscale.\nC. Upscale.\nD. Luxury.\nAnswer with the option's letter from the given choices directly.",
1016,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "339-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1017,
"target": "A",
"doc": {
"video_id": "340",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=p7HKvqRI_Bo",
"videoID": "p7HKvqRI_Bo",
"question_id": "340-1",
"task_type": "Information Synopsis",
"question": "What is the intended educational purpose of this video for the viewer?",
"options": [
"A. How the stock market works.",
"B. How stock prices affect investors' investments.",
"C. The core factors that affect stock price fluctuations.",
"D. How novice investors can participate in stock investment."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the intended educational purpose of this video for the viewer?\nOption:\nA. How the stock market works.\nB. How stock prices affect investors' investments.\nC. The core factors that affect stock price fluctuations.\nD. How novice investors can participate in stock investment.\nAnswer with the option's letter from the given choices directly.",
1017,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "340-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1018,
"target": "C",
"doc": {
"video_id": "340",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=p7HKvqRI_Bo",
"videoID": "p7HKvqRI_Bo",
"question_id": "340-2",
"task_type": "Object Reasoning",
"question": "In the video, what event caused CCO's stock price to plummet?",
"options": [
"A. Selling expired beverages.",
"B. Selling illegal beverages.",
"C. A mouse emerged from the drink.",
"D. Deliberately raising the price of drinks."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, what event caused CCO's stock price to plummet?\nOption:\nA. Selling expired beverages.\nB. Selling illegal beverages.\nC. A mouse emerged from the drink.\nD. Deliberately raising the price of drinks.\nAnswer with the option's letter from the given choices directly.",
1018,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "340-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1019,
"target": "B",
"doc": {
"video_id": "340",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=p7HKvqRI_Bo",
"videoID": "p7HKvqRI_Bo",
"question_id": "340-3",
"task_type": "Counting Problem",
"question": "As shown in the video, how many companies compete with the coffee company?",
"options": [
"A. 3.",
"B. 4.",
"C. 2.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As shown in the video, how many companies compete with the coffee company?\nOption:\nA. 3.\nB. 4.\nC. 2.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
1019,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "340-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1020,
"target": "A",
"doc": {
"video_id": "341",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=tBVUTFPate0",
"videoID": "tBVUTFPate0",
"question_id": "341-1",
"task_type": "Information Synopsis",
"question": "What is the main focus of the video?",
"options": [
"A. A guided tour of certain modules within the ISS.",
"B. A tutorial on the International Space Station's (ISS) scientific equipment.",
"C. A day in the life of an astronaut on the ISS.",
"D. The construction and assembly of the ISS."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of the video?\nOption:\nA. A guided tour of certain modules within the ISS.\nB. A tutorial on the International Space Station's (ISS) scientific equipment.\nC. A day in the life of an astronaut on the ISS.\nD. The construction and assembly of the ISS.\nAnswer with the option's letter from the given choices directly.",
1020,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "341-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1021,
"target": "C",
"doc": {
"video_id": "341",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=tBVUTFPate0",
"videoID": "tBVUTFPate0",
"question_id": "341-2",
"task_type": "Action Recognition",
"question": "What does the astronaut show us at the end of the video?",
"options": [
"A. Sleeping capsules.",
"B. Bathroom.",
"C. Food.",
"D. How to brush teeth."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the astronaut show us at the end of the video?\nOption:\nA. Sleeping capsules.\nB. Bathroom.\nC. Food.\nD. How to brush teeth.\nAnswer with the option's letter from the given choices directly.",
1021,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "341-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1022,
"target": "B",
"doc": {
"video_id": "341",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=tBVUTFPate0",
"videoID": "tBVUTFPate0",
"question_id": "341-3",
"task_type": "Counting Problem",
"question": "How many astronaut sleeping capsules are shown in the video?",
"options": [
"A. 3.",
"B. 4.",
"C. 2.",
"D. 1."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many astronaut sleeping capsules are shown in the video?\nOption:\nA. 3.\nB. 4.\nC. 2.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
1022,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "341-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1023,
"target": "D",
"doc": {
"video_id": "342",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=AmrrSfiMxGA",
"videoID": "AmrrSfiMxGA",
"question_id": "342-1",
"task_type": "Action Recognition",
"question": "What is the astronaut in the video doing?",
"options": [
"A. Performing a scientific experiment in zero gravity.",
"B. Floating freely in space.",
"C. Preparing for re-entry into Earth's atmosphere.",
"D. Conducting a spacewalk outside the ISS."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the astronaut in the video doing?\nOption:\nA. Performing a scientific experiment in zero gravity.\nB. Floating freely in space.\nC. Preparing for re-entry into Earth's atmosphere.\nD. Conducting a spacewalk outside the ISS.\nAnswer with the option's letter from the given choices directly.",
1023,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "342-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1024,
"target": "B",
"doc": {
"video_id": "342",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=AmrrSfiMxGA",
"videoID": "AmrrSfiMxGA",
"question_id": "342-2",
"task_type": "Spatial Reasoning",
"question": "Which detail in the video indicates that the astronaut is currently outside the International Space Station (ISS)?",
"options": [
"A. The presence of stars in the background.",
"B. The Earth's surface is visible below.",
"C. The astronaut is using a computer.",
"D. The presence of microgravity affects the astronaut's hair."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which detail in the video indicates that the astronaut is currently outside the International Space Station (ISS)?\nOption:\nA. The presence of stars in the background.\nB. The Earth's surface is visible below.\nC. The astronaut is using a computer.\nD. The presence of microgravity affects the astronaut's hair.\nAnswer with the option's letter from the given choices directly.",
1024,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "342-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1025,
"target": "D",
"doc": {
"video_id": "342",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=AmrrSfiMxGA",
"videoID": "AmrrSfiMxGA",
"question_id": "342-3",
"task_type": "Object Reasoning",
"question": "What unfortunate event occurred during the middle of the video?",
"options": [
"A. There was an unexpected meteor shower.",
"B. The International Space Station lost power.",
"C. A tool was accidentally released into space.",
"D. One of the shields was inadvertently lost."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What unfortunate event occurred during the middle of the video?\nOption:\nA. There was an unexpected meteor shower.\nB. The International Space Station lost power.\nC. A tool was accidentally released into space.\nD. One of the shields was inadvertently lost.\nAnswer with the option's letter from the given choices directly.",
1025,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "342-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1026,
"target": "B",
"doc": {
"video_id": "343",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=sf4qRY3h_eo",
"videoID": "sf4qRY3h_eo",
"question_id": "343-1",
"task_type": "OCR Problems",
"question": "Approximately how much time does the rocket in the video take from launch to when it reaches 1000km/h?",
"options": [
"A. 1min5s.",
"B. 55s.",
"C. 57s.",
"D. 1min30s."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Approximately how much time does the rocket in the video take from launch to when it reaches 1000km/h?\nOption:\nA. 1min5s.\nB. 55s.\nC. 57s.\nD. 1min30s.\nAnswer with the option's letter from the given choices directly.",
1026,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "343-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1027,
"target": "C",
"doc": {
"video_id": "343",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=sf4qRY3h_eo",
"videoID": "sf4qRY3h_eo",
"question_id": "343-2",
"task_type": "Spatial Perception",
"question": "In the middle of the video, what is recorded on the shown below the video?",
"options": [
"A. Spaceship cruise.",
"B. Satellite cruise.",
"C. The process of the rocket's reverse thrust landing.",
"D. Rocket launching process."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the middle of the video, what is recorded on the shown below the video?\nOption:\nA. Spaceship cruise.\nB. Satellite cruise.\nC. The process of the rocket's reverse thrust landing.\nD. Rocket launching process.\nAnswer with the option's letter from the given choices directly.",
1027,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "343-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1028,
"target": "A",
"doc": {
"video_id": "343",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=sf4qRY3h_eo",
"videoID": "sf4qRY3h_eo",
"question_id": "343-3",
"task_type": "Counting Problem",
"question": "How many times are rocket part separations shown in the video?",
"options": [
"A. 3.",
"B. 2.",
"C. 4.",
"D. 1."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times are rocket part separations shown in the video?\nOption:\nA. 3.\nB. 2.\nC. 4.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
1028,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "343-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1029,
"target": "C",
"doc": {
"video_id": "344",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=GQEYGAJIzAM",
"videoID": "GQEYGAJIzAM",
"question_id": "344-1",
"task_type": "Action Reasoning",
"question": "Why does the main character in the video dip a drop of coffee and drip it back?",
"options": [
"A. To showcase the flavor of this coffee.",
"B. To record it in slow motion.",
"C. To show how mountains form in craters on the moon's surface.",
"D. The author is bored while observing the moon."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the main character in the video dip a drop of coffee and drip it back?\nOption:\nA. To showcase the flavor of this coffee.\nB. To record it in slow motion.\nC. To show how mountains form in craters on the moon's surface.\nD. The author is bored while observing the moon.\nAnswer with the option's letter from the given choices directly.",
1029,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "344-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1030,
"target": "D",
"doc": {
"video_id": "344",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=GQEYGAJIzAM",
"videoID": "GQEYGAJIzAM",
"question_id": "344-2",
"task_type": "Object Recognition",
"question": "What is the second planet observed by the main character of the video?",
"options": [
"A. Phobos.",
"B. Moon.",
"C. Jupiter.",
"D. Mars."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the second planet observed by the main character of the video?\nOption:\nA. Phobos.\nB. Moon.\nC. Jupiter.\nD. Mars.\nAnswer with the option's letter from the given choices directly.",
1030,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "344-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1031,
"target": "B",
"doc": {
"video_id": "344",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=GQEYGAJIzAM",
"videoID": "GQEYGAJIzAM",
"question_id": "344-3",
"task_type": "Object Recognition",
"question": "The main character of the video is observing the surface of the moon when he notices a straight line, what is it?",
"options": [
"A. Lunar Ridge.",
"B. Collapsed lava tubes.",
"C. Rift valley systems.",
"D. Scratch marks."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The main character of the video is observing the surface of the moon when he notices a straight line, what is it?\nOption:\nA. Lunar Ridge.\nB. Collapsed lava tubes.\nC. Rift valley systems.\nD. Scratch marks.\nAnswer with the option's letter from the given choices directly.",
1031,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "344-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1032,
"target": "A",
"doc": {
"video_id": "345",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=pSHVbLPWA28",
"videoID": "pSHVbLPWA28",
"question_id": "345-1",
"task_type": "Object Recognition",
"question": "The video shows how long it takes to drive from the Earth to the Moon?",
"options": [
"A. 160 days.",
"B. 50 days.",
"C. 180 days.",
"D. 19 days."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The video shows how long it takes to drive from the Earth to the Moon?\nOption:\nA. 160 days.\nB. 50 days.\nC. 180 days.\nD. 19 days.\nAnswer with the option's letter from the given choices directly.",
1032,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "345-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1033,
"target": "B",
"doc": {
"video_id": "345",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=pSHVbLPWA28",
"videoID": "pSHVbLPWA28",
"question_id": "345-2",
"task_type": "Object Reasoning",
"question": "In the middle of the video a photo is shown, what are the dots on the photo?",
"options": [
"A. Mercury.",
"B. The Earth.",
"C. Uranus.",
"D. It is impossible to extrapolate."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the middle of the video a photo is shown, what are the dots on the photo?\nOption:\nA. Mercury.\nB. The Earth.\nC. Uranus.\nD. It is impossible to extrapolate.\nAnswer with the option's letter from the given choices directly.",
1033,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "345-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1034,
"target": "D",
"doc": {
"video_id": "345",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=pSHVbLPWA28",
"videoID": "pSHVbLPWA28",
"question_id": "345-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. Formation of the Universe.",
"B. Earth's position in the Milky Way.",
"C. Human exploration of the universe.",
"D. How big is the universe."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. Formation of the Universe.\nB. Earth's position in the Milky Way.\nC. Human exploration of the universe.\nD. How big is the universe.\nAnswer with the option's letter from the given choices directly.",
1034,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "345-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1035,
"target": "A",
"doc": {
"video_id": "346",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=MaAv2IVzqhM",
"videoID": "MaAv2IVzqhM",
"question_id": "346-1",
"task_type": "Information Synopsis",
"question": "What is the main theme of the video?",
"options": [
"A. The history and achievements of LIGO.",
"B. The construction process of LIGO facilities.",
"C. The daily operations of LIGO.",
"D. The future goals and planned upgrades for LIGO."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main theme of the video?\nOption:\nA. The history and achievements of LIGO.\nB. The construction process of LIGO facilities.\nC. The daily operations of LIGO.\nD. The future goals and planned upgrades for LIGO.\nAnswer with the option's letter from the given choices directly.",
1035,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "346-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1036,
"target": "B",
"doc": {
"video_id": "346",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=MaAv2IVzqhM",
"videoID": "MaAv2IVzqhM",
"question_id": "346-2",
"task_type": "Spatial Reasoning",
"question": "What can be inferred about the location of LIGO Hanford in the video?",
"options": [
"A. It is located in a remote area, beside a small town.",
"B. it is located in an open area, beside a trail.",
"C. It is located in a plain area next to some cultivated land.",
"D. It cannot be inferred."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be inferred about the location of LIGO Hanford in the video?\nOption:\nA. It is located in a remote area, beside a small town.\nB. it is located in an open area, beside a trail.\nC. It is located in a plain area next to some cultivated land.\nD. It cannot be inferred.\nAnswer with the option's letter from the given choices directly.",
1036,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "346-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Spatial Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1037,
"target": "C",
"doc": {
"video_id": "346",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=MaAv2IVzqhM",
"videoID": "MaAv2IVzqhM",
"question_id": "346-3",
"task_type": "Object Reasoning",
"question": "Which of the following gravitational wave detectors was featured in the video?",
"options": [
"A. The Virgo Interferometer.",
"B. The Kamioka Gravitational Wave Detector.",
"C. The Laser Interferometer Gravitational-Wave Observatory.",
"D. The GEO600 Gravitational Wave Detector."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following gravitational wave detectors was featured in the video?\nOption:\nA. The Virgo Interferometer.\nB. The Kamioka Gravitational Wave Detector.\nC. The Laser Interferometer Gravitational-Wave Observatory.\nD. The GEO600 Gravitational Wave Detector.\nAnswer with the option's letter from the given choices directly.",
1037,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "346-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1038,
"target": "A",
"doc": {
"video_id": "347",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=VQKpMmBDtZo",
"videoID": "VQKpMmBDtZo",
"question_id": "347-1",
"task_type": "Information Synopsis",
"question": "What is the main issue addressed in the video?",
"options": [
"A. The challenge of space debris around Earth.",
"B. The history of space exploration.",
"C. The development of new satellite technology.",
"D. The study of distant galaxies."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main issue addressed in the video?\nOption:\nA. The challenge of space debris around Earth.\nB. The history of space exploration.\nC. The development of new satellite technology.\nD. The study of distant galaxies.\nAnswer with the option's letter from the given choices directly.",
1038,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "347-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1039,
"target": "C",
"doc": {
"video_id": "347",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=VQKpMmBDtZo",
"videoID": "VQKpMmBDtZo",
"question_id": "347-2",
"task_type": "Object Reasoning",
"question": "What significant event increased the number of objects in Earth's orbit in the video?",
"options": [
"A. A country conducted a large-scale launch of a satellite constellation.",
"B. The International Space Station discarded some obsolete modules.",
"C. The Iridium-Cosmos collision in 2009.",
"D. An old weather satellite disintegrated, creating a large amount of debris."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What significant event increased the number of objects in Earth's orbit in the video?\nOption:\nA. A country conducted a large-scale launch of a satellite constellation.\nB. The International Space Station discarded some obsolete modules.\nC. The Iridium-Cosmos collision in 2009.\nD. An old weather satellite disintegrated, creating a large amount of debris.\nAnswer with the option's letter from the given choices directly.",
1039,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "347-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1040,
"target": "D",
"doc": {
"video_id": "347",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=VQKpMmBDtZo",
"videoID": "VQKpMmBDtZo",
"question_id": "347-3",
"task_type": "Object Reasoning",
"question": "For the class of space debris measuring between 1 and 10 centimeters, which cleanup method is proposed in the video?",
"options": [
"A. Sending up specialized garbage collection satellites.",
"B. Capturing the debris with robotic arms.",
"C. Employing large nets to collect debris.",
"D. Using ground- and space-based lasers."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: For the class of space debris measuring between 1 and 10 centimeters, which cleanup method is proposed in the video?\nOption:\nA. Sending up specialized garbage collection satellites.\nB. Capturing the debris with robotic arms.\nC. Employing large nets to collect debris.\nD. Using ground- and space-based lasers.\nAnswer with the option's letter from the given choices directly.",
1040,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "347-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1041,
"target": "B",
"doc": {
"video_id": "348",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=HLxvq_M4218",
"videoID": "HLxvq_M4218",
"question_id": "348-1",
"task_type": "Object Reasoning",
"question": "Which specific phenomenon during the video demonstrated the correctness of general relativity?",
"options": [
"A. The existence of gravitational waves.",
"B. Light bends when it passes through the gravitational field of the sun.",
"C. The gravitational redshift phenomenon.",
"D. The existence of black holes."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which specific phenomenon during the video demonstrated the correctness of general relativity?\nOption:\nA. The existence of gravitational waves.\nB. Light bends when it passes through the gravitational field of the sun.\nC. The gravitational redshift phenomenon.\nD. The existence of black holes.\nAnswer with the option's letter from the given choices directly.",
1041,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "348-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1042,
"target": "A",
"doc": {
"video_id": "348",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=HLxvq_M4218",
"videoID": "HLxvq_M4218",
"question_id": "348-2",
"task_type": "Information Synopsis",
"question": "What is the video about?",
"options": [
"A. The 1919 total solar eclipse that verified Einstein's theory of relativity.",
"B. Einstein's initial proposal of the theory that light bends under the influence of the sun's gravity.",
"C. Einstein's presentation of the complete theory of general relativity at the Prussian Academy of Sciences.",
"D. Einstein's first public presentation of the equations of general relativity in 1915."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video about?\nOption:\nA. The 1919 total solar eclipse that verified Einstein's theory of relativity.\nB. Einstein's initial proposal of the theory that light bends under the influence of the sun's gravity.\nC. Einstein's presentation of the complete theory of general relativity at the Prussian Academy of Sciences.\nD. Einstein's first public presentation of the equations of general relativity in 1915.\nAnswer with the option's letter from the given choices directly.",
1042,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "348-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1043,
"target": "D",
"doc": {
"video_id": "348",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=HLxvq_M4218",
"videoID": "HLxvq_M4218",
"question_id": "348-3",
"task_type": "Counting Problem",
"question": "At least how many total solar eclipses occurred between 1921 and 1970 according to the video?",
"options": [
"A. 7 total solar eclipses.",
"B. 6 total solar eclipses.",
"C. 5 total solar eclipses.",
"D. 4 total solar eclipses."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At least how many total solar eclipses occurred between 1921 and 1970 according to the video?\nOption:\nA. 7 total solar eclipses.\nB. 6 total solar eclipses.\nC. 5 total solar eclipses.\nD. 4 total solar eclipses.\nAnswer with the option's letter from the given choices directly.",
1043,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "348-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1044,
"target": "C",
"doc": {
"video_id": "349",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=V5M_grieetw",
"videoID": "V5M_grieetw",
"question_id": "349-1",
"task_type": "Information Synopsis",
"question": "What is the primary topic of the video?",
"options": [
"A. The life cycle of stars in our galaxy.",
"B. The history of space exploration missions.",
"C. The methods used to map the Milky Way galaxy.",
"D. The search for extraterrestrial life."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary topic of the video?\nOption:\nA. The life cycle of stars in our galaxy.\nB. The history of space exploration missions.\nC. The methods used to map the Milky Way galaxy.\nD. The search for extraterrestrial life.\nAnswer with the option's letter from the given choices directly.",
1044,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "349-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1045,
"target": "A",
"doc": {
"video_id": "349",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=V5M_grieetw",
"videoID": "V5M_grieetw",
"question_id": "349-2",
"task_type": "Object Recognition",
"question": "Who does the video focus on regarding their work with globular clusters?",
"options": [
"A. Harlow Shapley.",
"B. Walter Baade.",
"C. William Herschel.",
"D. Henrietta Swan Levitt."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who does the video focus on regarding their work with globular clusters?\nOption:\nA. Harlow Shapley.\nB. Walter Baade.\nC. William Herschel.\nD. Henrietta Swan Levitt.\nAnswer with the option's letter from the given choices directly.",
1045,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "349-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1046,
"target": "A",
"doc": {
"video_id": "349",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=V5M_grieetw",
"videoID": "V5M_grieetw",
"question_id": "349-3",
"task_type": "Temporal Perception",
"question": "In what order were the following mentioned in the video?",
"options": [
"A. William, Henrietta, Harlow, Gaia.",
"B. Henrietta, William, Gaia, Harlow.",
"C. Gaia, Harlow, Henrietta, William.",
"D. Harlow, Gaia, William, Henrietta."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what order were the following mentioned in the video?\nOption:\nA. William, Henrietta, Harlow, Gaia.\nB. Henrietta, William, Gaia, Harlow.\nC. Gaia, Harlow, Henrietta, William.\nD. Harlow, Gaia, William, Henrietta.\nAnswer with the option's letter from the given choices directly.",
1046,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "349-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1047,
"target": "C",
"doc": {
"video_id": "350",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=e-P5IFTqB98",
"videoID": "e-P5IFTqB98",
"question_id": "350-1",
"task_type": "Information Synopsis",
"question": "What is the primary topic of the video?",
"options": [
"A. The history of telescopes.",
"B. The lifecycle of stars.",
"C. The formation and evolution of black holes.",
"D. The dangers of space travel."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary topic of the video?\nOption:\nA. The history of telescopes.\nB. The lifecycle of stars.\nC. The formation and evolution of black holes.\nD. The dangers of space travel.\nAnswer with the option's letter from the given choices directly.",
1047,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "350-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1048,
"target": "D",
"doc": {
"video_id": "350",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=e-P5IFTqB98",
"videoID": "e-P5IFTqB98",
"question_id": "350-2",
"task_type": "Object Reasoning",
"question": "How does the diameter of the black hole S5 0014+81 compare to other distances in space?",
"options": [
"A. It is roughly equal to the size of the Milky Way.",
"B. It is smaller than the distance from the Sun to Pluto.",
"C. It is about the same as the distance from Earth to the Sun.",
"D. The history of telescopes."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the diameter of the black hole S5 0014+81 compare to other distances in space?\nOption:\nA. It is roughly equal to the size of the Milky Way.\nB. It is smaller than the distance from the Sun to Pluto.\nC. It is about the same as the distance from Earth to the Sun.\nD. The history of telescopes.\nAnswer with the option's letter from the given choices directly.",
1048,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "350-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1049,
"target": "A",
"doc": {
"video_id": "350",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=e-P5IFTqB98",
"videoID": "e-P5IFTqB98",
"question_id": "350-3",
"task_type": "Object Reasoning",
"question": "What indicates if we travel inside a black hole for quite a while?",
"options": [
"A. The black hole has a super size massive.",
"B. The black hole is rotating rapidly.",
"C. The black hole is emitting visible light.",
"D. The black hole has a charged event horizon."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What indicates if we travel inside a black hole for quite a while?\nOption:\nA. The black hole has a super size massive.\nB. The black hole is rotating rapidly.\nC. The black hole is emitting visible light.\nD. The black hole has a charged event horizon.\nAnswer with the option's letter from the given choices directly.",
1049,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "350-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1050,
"target": "C",
"doc": {
"video_id": "351",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=pmF41T52nJs",
"videoID": "pmF41T52nJs",
"question_id": "351-1",
"task_type": "Attribute Perception",
"question": "What is the color of the shirt that the speaker is wearing in the video?",
"options": [
"A. Green.",
"B. Brown.",
"C. Grey.",
"D. White."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the shirt that the speaker is wearing in the video?\nOption:\nA. Green.\nB. Brown.\nC. Grey.\nD. White.\nAnswer with the option's letter from the given choices directly.",
1050,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "351-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1051,
"target": "A",
"doc": {
"video_id": "351",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=pmF41T52nJs",
"videoID": "pmF41T52nJs",
"question_id": "351-2",
"task_type": "Action Reasoning",
"question": "Why does the speaker mention water pipes?",
"options": [
"A. Because water in the water pipes will expand when freezing.",
"B. Because water pipes will split when it's too cold.",
"C. Because some water pipes are made of rock.",
"D. Because lots of them are grey and looks quite similar to each other."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the speaker mention water pipes?\nOption:\nA. Because water in the water pipes will expand when freezing.\nB. Because water pipes will split when it's too cold.\nC. Because some water pipes are made of rock.\nD. Because lots of them are grey and looks quite similar to each other.\nAnswer with the option's letter from the given choices directly.",
1051,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "351-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1052,
"target": "A",
"doc": {
"video_id": "351",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=pmF41T52nJs",
"videoID": "pmF41T52nJs",
"question_id": "351-3",
"task_type": "Temporal Reasoning",
"question": "What's the order of the following types of weathering in the video?",
"options": [
"A. Mechanical weathering, chemical weathering and biological weathering.",
"B. Biological weathering, chemical weathering and mechanical weathering.",
"C. Chemical weathering, biological weathering and mechanical weathering.",
"D. Mechanical weathering, biological weathering and chemical whethering."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the order of the following types of weathering in the video?\nOption:\nA. Mechanical weathering, chemical weathering and biological weathering.\nB. Biological weathering, chemical weathering and mechanical weathering.\nC. Chemical weathering, biological weathering and mechanical weathering.\nD. Mechanical weathering, biological weathering and chemical whethering.\nAnswer with the option's letter from the given choices directly.",
1052,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "351-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1053,
"target": "B",
"doc": {
"video_id": "352",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=xa6SdvFA3w0",
"videoID": "xa6SdvFA3w0",
"question_id": "352-1",
"task_type": "Action Reasoning",
"question": "Why does the speaker mention students in a stressful class?",
"options": [
"A. Because she pities these students.",
"B. Because she wants to make this lesson not so serious.",
"C. Because students in a stressful class cause sea suface height anormalie.",
"D. Because their stress pushes the sea water."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the speaker mention students in a stressful class?\nOption:\nA. Because she pities these students.\nB. Because she wants to make this lesson not so serious.\nC. Because students in a stressful class cause sea suface height anormalie.\nD. Because their stress pushes the sea water.\nAnswer with the option's letter from the given choices directly.",
1053,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "352-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1054,
"target": "D",
"doc": {
"video_id": "352",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=xa6SdvFA3w0",
"videoID": "xa6SdvFA3w0",
"question_id": "352-2",
"task_type": "Object Recognition",
"question": "Which of the following causes is not mentioned in the video when it comes to ocean current?",
"options": [
"A. Coriolis effect.",
"B. Salinity.",
"C. Wind.",
"D. Global warming."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following causes is not mentioned in the video when it comes to ocean current?\nOption:\nA. Coriolis effect.\nB. Salinity.\nC. Wind.\nD. Global warming.\nAnswer with the option's letter from the given choices directly.",
1054,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "352-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1055,
"target": "A",
"doc": {
"video_id": "352",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=xa6SdvFA3w0",
"videoID": "xa6SdvFA3w0",
"question_id": "352-3",
"task_type": "Temporal Reasoning",
"question": "Which of the following statements about the water bottle in imagination is true?",
"options": [
"A. It will firstly brought by equatorial current.",
"B. It will firstly moved eastwards.",
"C. It will move northwards after reaching Asia because the cold water piles up.",
"D. It will end up reaching Australia."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements about the water bottle in imagination is true?\nOption:\nA. It will firstly brought by equatorial current.\nB. It will firstly moved eastwards.\nC. It will move northwards after reaching Asia because the cold water piles up.\nD. It will end up reaching Australia.\nAnswer with the option's letter from the given choices directly.",
1055,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "352-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1056,
"target": "D",
"doc": {
"video_id": "353",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=wdcXmerZWDc",
"videoID": "wdcXmerZWDc",
"question_id": "353-1",
"task_type": "Attribute Perception",
"question": "What's the weather like in the city at the beginning of the video?",
"options": [
"A. Stormy.",
"B. Rainy.",
"C. Cloudy.",
"D. Sunny."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the weather like in the city at the beginning of the video?\nOption:\nA. Stormy.\nB. Rainy.\nC. Cloudy.\nD. Sunny.\nAnswer with the option's letter from the given choices directly.",
1056,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "353-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1057,
"target": "A",
"doc": {
"video_id": "353",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=wdcXmerZWDc",
"videoID": "wdcXmerZWDc",
"question_id": "353-2",
"task_type": "Attribute Perception",
"question": "What is the color of the clothes that the speaker is wearing in the video?",
"options": [
"A. Navy blue.",
"B. Pink.",
"C. Green.",
"D. Brown."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the clothes that the speaker is wearing in the video?\nOption:\nA. Navy blue.\nB. Pink.\nC. Green.\nD. Brown.\nAnswer with the option's letter from the given choices directly.",
1057,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "353-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1058,
"target": "C",
"doc": {
"video_id": "353",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=wdcXmerZWDc",
"videoID": "wdcXmerZWDc",
"question_id": "353-3",
"task_type": "Object Reasoning",
"question": "What does the speaker explain before introducing MS4?",
"options": [
"A. The government has token several measures to handle storm water.",
"B. Developers are increasingly tasked with managing runoff from their projects.",
"C. It will be a difficult task for water treatment plants to deal with the storm water alone.",
"D. It's impossible to handle the storm water."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the speaker explain before introducing MS4?\nOption:\nA. The government has token several measures to handle storm water.\nB. Developers are increasingly tasked with managing runoff from their projects.\nC. It will be a difficult task for water treatment plants to deal with the storm water alone.\nD. It's impossible to handle the storm water.\nAnswer with the option's letter from the given choices directly.",
1058,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "353-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1059,
"target": "C",
"doc": {
"video_id": "354",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=HooZ84rpovQ",
"videoID": "HooZ84rpovQ",
"question_id": "354-1",
"task_type": "Object Reasoning",
"question": "How does the speaker introduce the fact that the Mediterranean might disappear for a long while?",
"options": [
"A. MSC.",
"B. A small elephant.",
"C. A huge rabit bone.",
"D. Insular gigantism."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the speaker introduce the fact that the Mediterranean might disappear for a long while?\nOption:\nA. MSC.\nB. A small elephant.\nC. A huge rabit bone.\nD. Insular gigantism.\nAnswer with the option's letter from the given choices directly.",
1059,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "354-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1060,
"target": "B",
"doc": {
"video_id": "354",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=HooZ84rpovQ",
"videoID": "HooZ84rpovQ",
"question_id": "354-2",
"task_type": "Object Recognition",
"question": "Which of the following hypotheses for the disappearance of Mediterranean isn't mentioned in the video?",
"options": [
"A. Global Cooling Event.",
"B. The rabbit get so huge because of insular gigantism.",
"C. Tectonic Events.",
"D. Tectonic and Climate Change."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following hypotheses for the disappearance of Mediterranean isn't mentioned in the video?\nOption:\nA. Global Cooling Event.\nB. The rabbit get so huge because of insular gigantism.\nC. Tectonic Events.\nD. Tectonic and Climate Change.\nAnswer with the option's letter from the given choices directly.",
1060,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "354-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1061,
"target": "D",
"doc": {
"video_id": "354",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=HooZ84rpovQ",
"videoID": "HooZ84rpovQ",
"question_id": "354-3",
"task_type": "Action Recognition",
"question": "What is Nuralagus rex doing in the video?",
"options": [
"A. Flying.",
"B. Eating.",
"C. Walking.",
"D. Hopping."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is Nuralagus rex doing in the video?\nOption:\nA. Flying.\nB. Eating.\nC. Walking.\nD. Hopping.\nAnswer with the option's letter from the given choices directly.",
1061,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "354-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1062,
"target": "D",
"doc": {
"video_id": "355",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=ZQP-7BPvvq0",
"videoID": "ZQP-7BPvvq0",
"question_id": "355-1",
"task_type": "Object Reasoning",
"question": "What does the speaker mean when mentioning images of livestock and grazing animals from ancient paintings?",
"options": [
"A. He loves these creatures.",
"B. The ancient artists shows more preference to these animal than others.",
"C. The ancient artists are familiar with these animal.",
"D. There used to be full of plants."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the speaker mean when mentioning images of livestock and grazing animals from ancient paintings?\nOption:\nA. He loves these creatures.\nB. The ancient artists shows more preference to these animal than others.\nC. The ancient artists are familiar with these animal.\nD. There used to be full of plants.\nAnswer with the option's letter from the given choices directly.",
1062,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "355-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1063,
"target": "B",
"doc": {
"video_id": "355",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=ZQP-7BPvvq0",
"videoID": "ZQP-7BPvvq0",
"question_id": "355-2",
"task_type": "Object Reasoning",
"question": "Which of the following elements is not an evidence for the green Sahara?",
"options": [
"A. More rainfall.",
"B. More Dust.",
"C. Ancient pollen.",
"D. Ancient paintings."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following elements is not an evidence for the green Sahara?\nOption:\nA. More rainfall.\nB. More Dust.\nC. Ancient pollen.\nD. Ancient paintings.\nAnswer with the option's letter from the given choices directly.",
1063,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "355-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1064,
"target": "A",
"doc": {
"video_id": "355",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=ZQP-7BPvvq0",
"videoID": "ZQP-7BPvvq0",
"question_id": "355-3",
"task_type": "Object Recognition",
"question": "What does the speaker explain after seeing the ancient paintings at the beginning of the video?",
"options": [
"A. The abnormal part of the paintings.",
"B. His favorite animal.",
"C. Why Sahara was once green.",
"D. How the ancient artists paint on rocks."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the speaker explain after seeing the ancient paintings at the beginning of the video?\nOption:\nA. The abnormal part of the paintings.\nB. His favorite animal.\nC. Why Sahara was once green.\nD. How the ancient artists paint on rocks.\nAnswer with the option's letter from the given choices directly.",
1064,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "355-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1065,
"target": "C",
"doc": {
"video_id": "356",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=rWp5ZpJAIAE",
"videoID": "rWp5ZpJAIAE",
"question_id": "356-1",
"task_type": "Attribute Perception",
"question": "What is the color of the clothes that William Smith is wearing in his protrait?",
"options": [
"A. White.",
"B. Blue.",
"C. Black.",
"D. Yellow."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the clothes that William Smith is wearing in his protrait?\nOption:\nA. White.\nB. Blue.\nC. Black.\nD. Yellow.\nAnswer with the option's letter from the given choices directly.",
1065,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "356-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1066,
"target": "C",
"doc": {
"video_id": "356",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=rWp5ZpJAIAE",
"videoID": "rWp5ZpJAIAE",
"question_id": "356-2",
"task_type": "Counting Problem",
"question": "How many protraits at most are shown at one point of the video?",
"options": [
"A. Two.",
"B. One.",
"C. Three.",
"D. Four."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many protraits at most are shown at one point of the video?\nOption:\nA. Two.\nB. One.\nC. Three.\nD. Four.\nAnswer with the option's letter from the given choices directly.",
1066,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "356-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1067,
"target": "D",
"doc": {
"video_id": "356",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=rWp5ZpJAIAE",
"videoID": "rWp5ZpJAIAE",
"question_id": "356-3",
"task_type": "Temporal Perception",
"question": "What happens at the end of the Paleozoic Era?",
"options": [
"A. Rise of mammals and modern humans.",
"B. A Burst of diversity.",
"C. Reign of the reptiles.",
"D. A devastating catastraphy happens."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens at the end of the Paleozoic Era?\nOption:\nA. Rise of mammals and modern humans.\nB. A Burst of diversity.\nC. Reign of the reptiles.\nD. A devastating catastraphy happens.\nAnswer with the option's letter from the given choices directly.",
1067,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "356-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Temporal Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1068,
"target": "B",
"doc": {
"video_id": "357",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=DI6SemRT2iY",
"videoID": "DI6SemRT2iY",
"question_id": "357-1",
"task_type": "Attribute Perception",
"question": "What's the lichens possibly like according to its image in the video?",
"options": [
"A. It is black.",
"B. Strip-shaped single-cell aggregates.",
"C. Round single cell aggregate.",
"D. It can be very huge."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the lichens possibly like according to its image in the video?\nOption:\nA. It is black.\nB. Strip-shaped single-cell aggregates.\nC. Round single cell aggregate.\nD. It can be very huge.\nAnswer with the option's letter from the given choices directly.",
1068,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "357-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1069,
"target": "C",
"doc": {
"video_id": "357",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=DI6SemRT2iY",
"videoID": "DI6SemRT2iY",
"question_id": "357-2",
"task_type": "Object Reasoning",
"question": "What is the main reason that the period is called Boring Billion?",
"options": [
"A. Full of disasters.",
"B. Lack of species diversity.",
"C. Geological stability.",
"D. Fine weather."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main reason that the period is called Boring Billion?\nOption:\nA. Full of disasters.\nB. Lack of species diversity.\nC. Geological stability.\nD. Fine weather.\nAnswer with the option's letter from the given choices directly.",
1069,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "357-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1070,
"target": "D",
"doc": {
"video_id": "357",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=DI6SemRT2iY",
"videoID": "DI6SemRT2iY",
"question_id": "357-3",
"task_type": "Attribute Perception",
"question": "What's the main color of Nuna?",
"options": [
"A. Gray.",
"B. Blue.",
"C. Green.",
"D. Brown."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the main color of Nuna?\nOption:\nA. Gray.\nB. Blue.\nC. Green.\nD. Brown.\nAnswer with the option's letter from the given choices directly.",
1070,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "357-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1071,
"target": "A",
"doc": {
"video_id": "358",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=_qepWb_NVj4",
"videoID": "_qepWb_NVj4",
"question_id": "358-1",
"task_type": "Counting Problem",
"question": "How many different points of views about the amount of continents are mentioned in the video?",
"options": [
"A. Five.",
"B. Seven.",
"C. Four.",
"D. Six."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many different points of views about the amount of continents are mentioned in the video?\nOption:\nA. Five.\nB. Seven.\nC. Four.\nD. Six.\nAnswer with the option's letter from the given choices directly.",
1071,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "358-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1072,
"target": "B",
"doc": {
"video_id": "358",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=_qepWb_NVj4",
"videoID": "_qepWb_NVj4",
"question_id": "358-2",
"task_type": "Object Reasoning",
"question": "Which of the following statements about the Eight-Continents theory is true according to the video?",
"options": [
"A. New Zealand belongs to Australia.",
"B. Some scholar agree on it because Zealandia is made up of continental crust.",
"C. Most scholars don't agree on the 8th continent because it's bigger than Greenland.",
"D. Zealandia fits all the qualifications for being a continent."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements about the Eight-Continents theory is true according to the video?\nOption:\nA. New Zealand belongs to Australia.\nB. Some scholar agree on it because Zealandia is made up of continental crust.\nC. Most scholars don't agree on the 8th continent because it's bigger than Greenland.\nD. Zealandia fits all the qualifications for being a continent.\nAnswer with the option's letter from the given choices directly.",
1072,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "358-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1073,
"target": "B",
"doc": {
"video_id": "358",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=_qepWb_NVj4",
"videoID": "_qepWb_NVj4",
"question_id": "358-3",
"task_type": "OCR Problems",
"question": "What's the third criteria promoted by Nick Mortimer that a continent must meet?",
"options": [
"A. Elevated relative to the surrounding ocean floor.",
"B. Thicker and less dense than oceanic crust.",
"C. Have a well-defined extent and sufficient size.",
"D. Big enough."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the third criteria promoted by Nick Mortimer that a continent must meet?\nOption:\nA. Elevated relative to the surrounding ocean floor.\nB. Thicker and less dense than oceanic crust.\nC. Have a well-defined extent and sufficient size.\nD. Big enough.\nAnswer with the option's letter from the given choices directly.",
1073,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "358-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1074,
"target": "D",
"doc": {
"video_id": "359",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=QTK_bC00ilg",
"videoID": "QTK_bC00ilg",
"question_id": "359-1",
"task_type": "Object Recognition",
"question": "Which one does not exist on the islands west of the Wallace Line?",
"options": [
"A. Woodpecker.",
"B. Elephant.",
"C. Tiger.",
"D. Cockatoo."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which one does not exist on the islands west of the Wallace Line?\nOption:\nA. Woodpecker.\nB. Elephant.\nC. Tiger.\nD. Cockatoo.\nAnswer with the option's letter from the given choices directly.",
1074,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "359-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1075,
"target": "A",
"doc": {
"video_id": "359",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=QTK_bC00ilg",
"videoID": "QTK_bC00ilg",
"question_id": "359-2",
"task_type": "Temporal Perception",
"question": "When was Wallace firstly aware of the Wallace line?",
"options": [
"A. When he compared the species on two islands.",
"B. When he believed in Darwin's Theory.",
"C. When he saw the line by his eyes.",
"D. When the plate tectonics theory came into being."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When was Wallace firstly aware of the Wallace line?\nOption:\nA. When he compared the species on two islands.\nB. When he believed in Darwin's Theory.\nC. When he saw the line by his eyes.\nD. When the plate tectonics theory came into being.\nAnswer with the option's letter from the given choices directly.",
1075,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "359-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1076,
"target": "C",
"doc": {
"video_id": "359",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=QTK_bC00ilg",
"videoID": "QTK_bC00ilg",
"question_id": "359-3",
"task_type": "Temporal Reasoning",
"question": "What's the correct order of these following events?",
"options": [
"A. Asia moved southwards, creatures were seperated by intense current and live separatly.",
"B. Two continents collided, creatures were seperated by current and then moved to different islands.",
"C. Australia moved northwards, creatures moved to the islands were seperated by intense current.",
"D. Creatures were seperated by current, two continents collided and then the creatures moved to the islands."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the correct order of these following events?\nOption:\nA. Asia moved southwards, creatures were seperated by intense current and live separatly.\nB. Two continents collided, creatures were seperated by current and then moved to different islands.\nC. Australia moved northwards, creatures moved to the islands were seperated by intense current.\nD. Creatures were seperated by current, two continents collided and then the creatures moved to the islands.\nAnswer with the option's letter from the given choices directly.",
1076,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "359-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1077,
"target": "A",
"doc": {
"video_id": "360",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=7Ni4gcl4tpE",
"videoID": "7Ni4gcl4tpE",
"question_id": "360-1",
"task_type": "Attribute Perception",
"question": "What is the background color of the sign that marks the highest point of Kiribati?",
"options": [
"A. White.",
"B. Blue.",
"C. Red.",
"D. Black."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the background color of the sign that marks the highest point of Kiribati?\nOption:\nA. White.\nB. Blue.\nC. Red.\nD. Black.\nAnswer with the option's letter from the given choices directly.",
1077,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "360-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1078,
"target": "C",
"doc": {
"video_id": "360",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=7Ni4gcl4tpE",
"videoID": "7Ni4gcl4tpE",
"question_id": "360-2",
"task_type": "Spatial Reasoning",
"question": "Why did the local people ask whether the youtuber brought his oxygen mask?",
"options": [
"A. Because the wind was intense.",
"B. Because there was very high in altitude.",
"C. Because he just made fun of the highest point in Kiribati.",
"D. Because his oxygen mask got lost."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did the local people ask whether the youtuber brought his oxygen mask?\nOption:\nA. Because the wind was intense.\nB. Because there was very high in altitude.\nC. Because he just made fun of the highest point in Kiribati.\nD. Because his oxygen mask got lost.\nAnswer with the option's letter from the given choices directly.",
1078,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "360-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Spatial Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1079,
"target": "A",
"doc": {
"video_id": "360",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=7Ni4gcl4tpE",
"videoID": "7Ni4gcl4tpE",
"question_id": "360-3",
"task_type": "Object Reasoning",
"question": "What did the youtuber advise in order to save Kiribati?",
"options": [
"A. Bring tourists there and make them aware of the danger of climate change.",
"B. Build up the sea wall.",
"C. Bring tourists there and attract more money.",
"D. Buy other islands and ask local people to move to these higher islands."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What did the youtuber advise in order to save Kiribati?\nOption:\nA. Bring tourists there and make them aware of the danger of climate change.\nB. Build up the sea wall.\nC. Bring tourists there and attract more money.\nD. Buy other islands and ask local people to move to these higher islands.\nAnswer with the option's letter from the given choices directly.",
1079,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "360-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1080,
"target": "A",
"doc": {
"video_id": "361",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=gT2lNXqt2V0",
"videoID": "gT2lNXqt2V0",
"question_id": "361-1",
"task_type": "Object Reasoning",
"question": "Based on the video, what is the role of the girl in the yellow top and red trousers?",
"options": [
"A. Victim.",
"B. Suspect.",
"C. Police officer.",
"D. Defense lawyer."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, what is the role of the girl in the yellow top and red trousers?\nOption:\nA. Victim.\nB. Suspect.\nC. Police officer.\nD. Defense lawyer.\nAnswer with the option's letter from the given choices directly.",
1080,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "361-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1081,
"target": "D",
"doc": {
"video_id": "361",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=gT2lNXqt2V0",
"videoID": "gT2lNXqt2V0",
"question_id": "361-2",
"task_type": "Object Reasoning",
"question": "Who made the call received in the video?",
"options": [
"A. Defense lawyer.",
"B. Judge.",
"C. Police officer.",
"D. Victim assistance coordinator."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who made the call received in the video?\nOption:\nA. Defense lawyer.\nB. Judge.\nC. Police officer.\nD. Victim assistance coordinator.\nAnswer with the option's letter from the given choices directly.",
1081,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "361-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1082,
"target": "B",
"doc": {
"video_id": "361",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=gT2lNXqt2V0",
"videoID": "gT2lNXqt2V0",
"question_id": "361-3",
"task_type": "Attribute Perception",
"question": "What is the application submitted by the victim in the last part of the video?",
"options": [
"A. Application for a temporary or permanent injunction to prohibit certain actions by the offender.",
"B. Application for a victim's compensation program to seek financial assistance for medical expenses or other related costs.",
"C. Application for a victim impact statement to express the emotional and physical impact of the crime.",
"D. Application for a witness protection program to ensure safety and security."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the application submitted by the victim in the last part of the video?\nOption:\nA. Application for a temporary or permanent injunction to prohibit certain actions by the offender.\nB. Application for a victim's compensation program to seek financial assistance for medical expenses or other related costs.\nC. Application for a victim impact statement to express the emotional and physical impact of the crime.\nD. Application for a witness protection program to ensure safety and security.\nAnswer with the option's letter from the given choices directly.",
1082,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "361-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1083,
"target": "C",
"doc": {
"video_id": "362",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=TtJVJ9UP7M0",
"videoID": "TtJVJ9UP7M0",
"question_id": "362-1",
"task_type": "Information Synopsis",
"question": "What is the video mainly about?",
"options": [
"A. How to understand the criminal justice system.",
"B. How to write a legal brief for a civil case.",
"C. How to be ready for Juvenile Court.",
"D. How to become a defense lawyer."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video mainly about?\nOption:\nA. How to understand the criminal justice system.\nB. How to write a legal brief for a civil case.\nC. How to be ready for Juvenile Court.\nD. How to become a defense lawyer.\nAnswer with the option's letter from the given choices directly.",
1083,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "362-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1084,
"target": "A",
"doc": {
"video_id": "362",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=TtJVJ9UP7M0",
"videoID": "TtJVJ9UP7M0",
"question_id": "362-2",
"task_type": "Action Recognition",
"question": "What is not mentioned in the video that need to pay attention to before entering the court?",
"options": [
"A. Brush teeth before hearing.",
"B. Turn off cell phone.",
"C. Throw gum away.",
"D. Eat before or after hearing."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is not mentioned in the video that need to pay attention to before entering the court?\nOption:\nA. Brush teeth before hearing.\nB. Turn off cell phone.\nC. Throw gum away.\nD. Eat before or after hearing.\nAnswer with the option's letter from the given choices directly.",
1084,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "362-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1085,
"target": "D",
"doc": {
"video_id": "362",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=TtJVJ9UP7M0",
"videoID": "TtJVJ9UP7M0",
"question_id": "362-3",
"task_type": "Object Reasoning",
"question": "What is the role of the man in the suit sitting next to the boy in the courtroom in the video?",
"options": [
"A. Not mentioned in the video.",
"B. Clerk.",
"C. Judge.",
"D. The boy's attorney."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the role of the man in the suit sitting next to the boy in the courtroom in the video?\nOption:\nA. Not mentioned in the video.\nB. Clerk.\nC. Judge.\nD. The boy's attorney.\nAnswer with the option's letter from the given choices directly.",
1085,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "362-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1086,
"target": "B",
"doc": {
"video_id": "363",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=HwpzzAefx9M",
"videoID": "HwpzzAefx9M",
"question_id": "363-1",
"task_type": "Information Synopsis",
"question": "What is the video mainly about?",
"options": [
"A. Social ethics.",
"B. The laws of War.",
"C. The laws of nature.",
"D. National regulations."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video mainly about?\nOption:\nA. Social ethics.\nB. The laws of War.\nC. The laws of nature.\nD. National regulations.\nAnswer with the option's letter from the given choices directly.",
1086,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "363-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1087,
"target": "C",
"doc": {
"video_id": "363",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=HwpzzAefx9M",
"videoID": "HwpzzAefx9M",
"question_id": "363-2",
"task_type": "Action Reasoning",
"question": "According to the content of the video, why does the little boy squatting on the ground playing with a toy car run away and hide?",
"options": [
"A. Because he hears a loud thunderclap, mistaking it for an explosion, and becomes terrified.",
"B. Because he sees several gangsters walking towards him and his mother with big knives, he is very scared.",
"C. Because he sees soldiers pointing guns at him and his mother, and he is very frightened.",
"D. Because he accidentally broke the toy car and doesn't want his mother to find out, fearing disappointment rather than punishment."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the content of the video, why does the little boy squatting on the ground playing with a toy car run away and hide?\nOption:\nA. Because he hears a loud thunderclap, mistaking it for an explosion, and becomes terrified.\nB. Because he sees several gangsters walking towards him and his mother with big knives, he is very scared.\nC. Because he sees soldiers pointing guns at him and his mother, and he is very frightened.\nD. Because he accidentally broke the toy car and doesn't want his mother to find out, fearing disappointment rather than punishment.\nAnswer with the option's letter from the given choices directly.",
1087,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "363-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1088,
"target": "A",
"doc": {
"video_id": "363",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=HwpzzAefx9M",
"videoID": "HwpzzAefx9M",
"question_id": "363-3",
"task_type": "Action Recognition",
"question": "What activities are the men in the video not allowed to do while in captivity?",
"options": [
"A. Use weapons.",
"B. Eat food.",
"C. Drink water.",
"D. Communicate with loved ones."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What activities are the men in the video not allowed to do while in captivity?\nOption:\nA. Use weapons.\nB. Eat food.\nC. Drink water.\nD. Communicate with loved ones.\nAnswer with the option's letter from the given choices directly.",
1088,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "363-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1089,
"target": "D",
"doc": {
"video_id": "364",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=3fQxvz4ENDg",
"videoID": "3fQxvz4ENDg",
"question_id": "364-1",
"task_type": "Information Synopsis",
"question": "Which summarizes the main content of the video?",
"options": [
"A. Introduction to existing AI technologies.",
"B. The evolution of artificial intelligence.",
"C. Impact of artificial intelligence on human society.",
"D. The world's first artificial intelligence law."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which summarizes the main content of the video?\nOption:\nA. Introduction to existing AI technologies.\nB. The evolution of artificial intelligence.\nC. Impact of artificial intelligence on human society.\nD. The world's first artificial intelligence law.\nAnswer with the option's letter from the given choices directly.",
1089,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "364-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1090,
"target": "B",
"doc": {
"video_id": "364",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=3fQxvz4ENDg",
"videoID": "3fQxvz4ENDg",
"question_id": "364-2",
"task_type": "OCR Problems",
"question": "Which AI technology is not featured in the video?",
"options": [
"A. Chatbot.",
"B. Automatic driving.",
"C. Generative AI.",
"D. Deepfake."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which AI technology is not featured in the video?\nOption:\nA. Chatbot.\nB. Automatic driving.\nC. Generative AI.\nD. Deepfake.\nAnswer with the option's letter from the given choices directly.",
1090,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "364-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1091,
"target": "C",
"doc": {
"video_id": "364",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=3fQxvz4ENDg",
"videoID": "3fQxvz4ENDg",
"question_id": "364-3",
"task_type": "Attribute Perception",
"question": "What was discussed after the introduction to generative AI?",
"options": [
"A. How will it tackle risks posed by ai system.",
"B. How does the act define AI.",
"C. Punishment for violating legal rules.",
"D. What ai systems are exempted."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was discussed after the introduction to generative AI?\nOption:\nA. How will it tackle risks posed by ai system.\nB. How does the act define AI.\nC. Punishment for violating legal rules.\nD. What ai systems are exempted.\nAnswer with the option's letter from the given choices directly.",
1091,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "364-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1092,
"target": "A",
"doc": {
"video_id": "365",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=IGyx5UEwgtA",
"videoID": "IGyx5UEwgtA",
"question_id": "365-1",
"task_type": "Counting Problem",
"question": "How many judges hear the federal appeals shown in the video?",
"options": [
"A. 3.",
"B. 1.",
"C. 2.",
"D. 4."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many judges hear the federal appeals shown in the video?\nOption:\nA. 3.\nB. 1.\nC. 2.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1092,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "365-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1093,
"target": "D",
"doc": {
"video_id": "365",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=IGyx5UEwgtA",
"videoID": "IGyx5UEwgtA",
"question_id": "365-2",
"task_type": "OCR Problems",
"question": "Which of the following is not mentioned in the video?",
"options": [
"A. Four original jurisdictions possessed by federal courts.",
"B. Cases that the Supreme Court will not hear.",
"C. Cases that the Supreme Court will hear.",
"D. What happens when a case goes to the Supreme Court."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is not mentioned in the video?\nOption:\nA. Four original jurisdictions possessed by federal courts.\nB. Cases that the Supreme Court will not hear.\nC. Cases that the Supreme Court will hear.\nD. What happens when a case goes to the Supreme Court.\nAnswer with the option's letter from the given choices directly.",
1093,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "365-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1094,
"target": "B",
"doc": {
"video_id": "365",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=IGyx5UEwgtA",
"videoID": "IGyx5UEwgtA",
"question_id": "365-3",
"task_type": "Object Reasoning",
"question": "What is the example of avocados in the video meant to illustrate?",
"options": [
"A. The court only decides cases with certainty.",
"B. The courts are only deciding cases that are fully developed.",
"C. The courts only rule on cases that are standing.",
"D. The courts only rule on cases without controversy."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the example of avocados in the video meant to illustrate?\nOption:\nA. The court only decides cases with certainty.\nB. The courts are only deciding cases that are fully developed.\nC. The courts only rule on cases that are standing.\nD. The courts only rule on cases without controversy.\nAnswer with the option's letter from the given choices directly.",
1094,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "365-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1095,
"target": "C",
"doc": {
"video_id": "366",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=zOgYnntFl-k",
"videoID": "zOgYnntFl-k",
"question_id": "366-1",
"task_type": "Object Reasoning",
"question": "Which of the following professions does not appear at the beginning of the video?",
"options": [
"A. Worker.",
"B. Firemen.",
"C. Chef.",
"D. Judge."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following professions does not appear at the beginning of the video?\nOption:\nA. Worker.\nB. Firemen.\nC. Chef.\nD. Judge.\nAnswer with the option's letter from the given choices directly.",
1095,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "366-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1096,
"target": "A",
"doc": {
"video_id": "366",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=zOgYnntFl-k",
"videoID": "zOgYnntFl-k",
"question_id": "366-2",
"task_type": "Counting Problem",
"question": "How many members of the jury are shown in the first half of the video?",
"options": [
"A. 6.",
"B. 7.",
"C. 5.",
"D. 8."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many members of the jury are shown in the first half of the video?\nOption:\nA. 6.\nB. 7.\nC. 5.\nD. 8.\nAnswer with the option's letter from the given choices directly.",
1096,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "366-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1097,
"target": "D",
"doc": {
"video_id": "366",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=zOgYnntFl-k",
"videoID": "zOgYnntFl-k",
"question_id": "366-3",
"task_type": "Counting Problem",
"question": "In the real trial transcript in the second half of the video, how many people raised their right hands during the jury discussion?",
"options": [
"A. 8.",
"B. 6.",
"C. 9.",
"D. 7."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the real trial transcript in the second half of the video, how many people raised their right hands during the jury discussion?\nOption:\nA. 8.\nB. 6.\nC. 9.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
1097,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "366-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1098,
"target": "B",
"doc": {
"video_id": "367",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=orVnyudsNuw",
"videoID": "orVnyudsNuw",
"question_id": "367-1",
"task_type": "OCR Problems",
"question": "What is the fifth sign that is introduced in the video?",
"options": [
"A. You Stand For What You Believe In.",
"B. You Exude Confidence.",
"C. You Have Integrity.",
"D. You Love Arguing To Prove A Point."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the fifth sign that is introduced in the video?\nOption:\nA. You Stand For What You Believe In.\nB. You Exude Confidence.\nC. You Have Integrity.\nD. You Love Arguing To Prove A Point.\nAnswer with the option's letter from the given choices directly.",
1098,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "367-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1099,
"target": "C",
"doc": {
"video_id": "367",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=orVnyudsNuw",
"videoID": "orVnyudsNuw",
"question_id": "367-2",
"task_type": "Temporal Reasoning",
"question": "When discussing which of the following points in this video, there was no page switching?\n(a) The first.\n(b) The fourth.\n(c) The fifth.\n(d) The eighth.\n(e) The ninth.",
"options": [
"A. (d).",
"B. (a)(c).",
"C. (b)(c)(d).",
"D. (a)(b)(c)(d)."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When discussing which of the following points in this video, there was no page switching?\n(a) The first.\n(b) The fourth.\n(c) The fifth.\n(d) The eighth.\n(e) The ninth.\nOption:\nA. (d).\nB. (a)(c).\nC. (b)(c)(d).\nD. (a)(b)(c)(d).\nAnswer with the option's letter from the given choices directly.",
1099,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "367-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1100,
"target": "D",
"doc": {
"video_id": "367",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=orVnyudsNuw",
"videoID": "orVnyudsNuw",
"question_id": "367-3",
"task_type": "Counting Problem",
"question": "When introducing the eighth point, how many people finally appeared on the screen?",
"options": [
"A. 2.",
"B. 4.",
"C. 5.",
"D. 8."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When introducing the eighth point, how many people finally appeared on the screen?\nOption:\nA. 2.\nB. 4.\nC. 5.\nD. 8.\nAnswer with the option's letter from the given choices directly.",
1100,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "367-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1101,
"target": "D",
"doc": {
"video_id": "368",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=Hel5dqvImSw",
"videoID": "Hel5dqvImSw",
"question_id": "368-1",
"task_type": "Attribute Perception",
"question": "Which country is the destination of the plane at the beginning of the video?",
"options": [
"A. Australia.",
"B. America.",
"C. China.",
"D. Canada."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which country is the destination of the plane at the beginning of the video?\nOption:\nA. Australia.\nB. America.\nC. China.\nD. Canada.\nAnswer with the option's letter from the given choices directly.",
1101,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "368-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1102,
"target": "B",
"doc": {
"video_id": "368",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=Hel5dqvImSw",
"videoID": "Hel5dqvImSw",
"question_id": "368-2",
"task_type": "Information Synopsis",
"question": "What is the second case in the video about?",
"options": [
"A. What should a lady do after being reported by her husband at the immigration office.",
"B. What should a lady do after her husband withdraws her application for permanent residency.",
"C. What a lady should do after divorcing her husband.",
"D. How a lady can obtain permanent residency."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the second case in the video about?\nOption:\nA. What should a lady do after being reported by her husband at the immigration office.\nB. What should a lady do after her husband withdraws her application for permanent residency.\nC. What a lady should do after divorcing her husband.\nD. How a lady can obtain permanent residency.\nAnswer with the option's letter from the given choices directly.",
1102,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "368-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1103,
"target": "C",
"doc": {
"video_id": "368",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=Hel5dqvImSw",
"videoID": "Hel5dqvImSw",
"question_id": "368-3",
"task_type": "Spatial Reasoning",
"question": "In the last case of the video, what happens when the man divorces with the lady?",
"options": [
"A. Neither the woman nor her ex-husband can work in United States.",
"B. The lady will not be unable to work in the United States because of divorce.",
"C. The lady will not be unable to work in Canada because of divorce.",
"D. Neither the woman nor her ex-husband can work in Canada."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the last case of the video, what happens when the man divorces with the lady?\nOption:\nA. Neither the woman nor her ex-husband can work in United States.\nB. The lady will not be unable to work in the United States because of divorce.\nC. The lady will not be unable to work in Canada because of divorce.\nD. Neither the woman nor her ex-husband can work in Canada.\nAnswer with the option's letter from the given choices directly.",
1103,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "368-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1104,
"target": "A",
"doc": {
"video_id": "369",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=_rC5V5vEprs",
"videoID": "_rC5V5vEprs",
"question_id": "369-1",
"task_type": "Attribute Perception",
"question": "Which of the following options does not appear on the stone tablet carved by the two men at the beginning of the video?",
"options": [
"A. Book.",
"B. Shake hands.",
"C. Person on crutches.",
"D. Children."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options does not appear on the stone tablet carved by the two men at the beginning of the video?\nOption:\nA. Book.\nB. Shake hands.\nC. Person on crutches.\nD. Children.\nAnswer with the option's letter from the given choices directly.",
1104,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "369-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1105,
"target": "D",
"doc": {
"video_id": "369",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=_rC5V5vEprs",
"videoID": "_rC5V5vEprs",
"question_id": "369-2",
"task_type": "Object Recognition",
"question": "What type of cases are cited in the video?",
"options": [
"A. Robbery.",
"B. Domestic violence case.",
"C. Intentional injury case.",
"D. Divorce case."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What type of cases are cited in the video?\nOption:\nA. Robbery.\nB. Domestic violence case.\nC. Intentional injury case.\nD. Divorce case.\nAnswer with the option's letter from the given choices directly.",
1105,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "369-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1106,
"target": "B",
"doc": {
"video_id": "369",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=_rC5V5vEprs",
"videoID": "_rC5V5vEprs",
"question_id": "369-3",
"task_type": "Object Recognition",
"question": "Which codexes are compared in common in the second half of the video?",
"options": [
"A. Code of Hammurabi and the Laws of Manu.",
"B. Code of Hammurabi and Code of Deuteronomy.",
"C. Code of Ur-Nammu and Code of Deuteronomy.",
"D. Code of Hammurabi and Code of Deuteronomy and the Laws of Manu."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which codexes are compared in common in the second half of the video?\nOption:\nA. Code of Hammurabi and the Laws of Manu.\nB. Code of Hammurabi and Code of Deuteronomy.\nC. Code of Ur-Nammu and Code of Deuteronomy.\nD. Code of Hammurabi and Code of Deuteronomy and the Laws of Manu.\nAnswer with the option's letter from the given choices directly.",
1106,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "369-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1107,
"target": "C",
"doc": {
"video_id": "370",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=y3am4T5LuKU",
"videoID": "y3am4T5LuKU",
"question_id": "370-1",
"task_type": "Action Recognition",
"question": "What happens when a woman crosses the road in the beginning of the video when there are no legal rules?",
"options": [
"A. The lady gets hit by cars.",
"B. The road is very orderly, vehicles follow the rules, and it is very safe to cross the road.",
"C. The road is a mess, vehicles do not obey traffic rules, and it is very dangerous to cross the road.",
"D. The lady will be taken away by the police."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens when a woman crosses the road in the beginning of the video when there are no legal rules?\nOption:\nA. The lady gets hit by cars.\nB. The road is very orderly, vehicles follow the rules, and it is very safe to cross the road.\nC. The road is a mess, vehicles do not obey traffic rules, and it is very dangerous to cross the road.\nD. The lady will be taken away by the police.\nAnswer with the option's letter from the given choices directly.",
1107,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "370-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1108,
"target": "A",
"doc": {
"video_id": "370",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=y3am4T5LuKU",
"videoID": "y3am4T5LuKU",
"question_id": "370-2",
"task_type": "Object Recognition",
"question": "What is the identity of the cartoon character wearing a hat and raising his right hand in salute in the video?",
"options": [
"A. Police.",
"B. Soldier.",
"C. Security guard.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the identity of the cartoon character wearing a hat and raising his right hand in salute in the video?\nOption:\nA. Police.\nB. Soldier.\nC. Security guard.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1108,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "370-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1109,
"target": "D",
"doc": {
"video_id": "370",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=y3am4T5LuKU",
"videoID": "y3am4T5LuKU",
"question_id": "370-3",
"task_type": "Counting Problem",
"question": "How many photos of the cartoon version of criminals appear in the video?",
"options": [
"A. 1.",
"B. 4.",
"C. 2.",
"D. 3."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many photos of the cartoon version of criminals appear in the video?\nOption:\nA. 1.\nB. 4.\nC. 2.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
1109,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "370-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1110,
"target": "B",
"doc": {
"video_id": "371",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=7fnVcXfPe5c",
"videoID": "7fnVcXfPe5c",
"question_id": "371-1",
"task_type": "Spatial Perception",
"question": "Which option correctly indicates the location of the \"psychological tip\" written in white on the board when illustrating the third suggestion in the video?",
"options": [
"A. Between \"REMOVE THE FILTER\" and \"THREAD THE CONVERSATION\".",
"B. Below \"THE PRATFALL EFFECT\".",
"C. Between \"THREAD THE CONVERSATION\" and \"THE PRATFALL EFFECT\".",
"D. It can't be recognized in the video."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which option correctly indicates the location of the \"psychological tip\" written in white on the board when illustrating the third suggestion in the video?\nOption:\nA. Between \"REMOVE THE FILTER\" and \"THREAD THE CONVERSATION\".\nB. Below \"THE PRATFALL EFFECT\".\nC. Between \"THREAD THE CONVERSATION\" and \"THE PRATFALL EFFECT\".\nD. It can't be recognized in the video.\nAnswer with the option's letter from the given choices directly.",
1110,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "371-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1111,
"target": "C",
"doc": {
"video_id": "371",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=7fnVcXfPe5c",
"videoID": "7fnVcXfPe5c",
"question_id": "371-2",
"task_type": "Object Recognition",
"question": "What does the visualization in the video most closely resemble?",
"options": [
"A. A holographic projection in a science museum.",
"B. A watercolor painting on a drawing board.",
"C. A chalk painting on a blackboard.",
"D. A digital artwork created using 3D modeling software."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the visualization in the video most closely resemble?\nOption:\nA. A holographic projection in a science museum.\nB. A watercolor painting on a drawing board.\nC. A chalk painting on a blackboard.\nD. A digital artwork created using 3D modeling software.\nAnswer with the option's letter from the given choices directly.",
1111,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "371-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1112,
"target": "A",
"doc": {
"video_id": "371",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=7fnVcXfPe5c",
"videoID": "7fnVcXfPe5c",
"question_id": "371-3",
"task_type": "Action Reasoning",
"question": "What is the purpose of drawing a bifurcated tree when introducing the threading?",
"options": [
"A. To illustrate multiple possible topics to continue the conversation.",
"B. To show various approaches to get yourself understood.",
"C. To demonstrate the consequences of removing the filter.",
"D. To clarify the importance of expressing yourself."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the purpose of drawing a bifurcated tree when introducing the threading?\nOption:\nA. To illustrate multiple possible topics to continue the conversation.\nB. To show various approaches to get yourself understood.\nC. To demonstrate the consequences of removing the filter.\nD. To clarify the importance of expressing yourself.\nAnswer with the option's letter from the given choices directly.",
1112,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "371-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1113,
"target": "A",
"doc": {
"video_id": "372",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=phM_SYH6aA0",
"videoID": "phM_SYH6aA0",
"question_id": "372-1",
"task_type": "Action Recognition",
"question": "How does the man demonstrate to do a tuck?",
"options": [
"A. He jumps up and drives his arms up and pushes off the floor.",
"B. He grabs his knees with his hands.",
"C. He brings his arms forward and over the head.",
"D. He closes his eyes during the jump."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the man demonstrate to do a tuck?\nOption:\nA. He jumps up and drives his arms up and pushes off the floor.\nB. He grabs his knees with his hands.\nC. He brings his arms forward and over the head.\nD. He closes his eyes during the jump.\nAnswer with the option's letter from the given choices directly.",
1113,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "372-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1114,
"target": "D",
"doc": {
"video_id": "372",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=phM_SYH6aA0",
"videoID": "phM_SYH6aA0",
"question_id": "372-2",
"task_type": "Object Reasoning",
"question": "What is the function of the phone display at the end of this video?",
"options": [
"A. Leading the user to download and install the application.",
"B. Showing the services available on the website.",
"C. Displaying the advertisements inserted on the web page to advertise the products.",
"D. Guiding users to log in to the website and complete the settings."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the function of the phone display at the end of this video?\nOption:\nA. Leading the user to download and install the application.\nB. Showing the services available on the website.\nC. Displaying the advertisements inserted on the web page to advertise the products.\nD. Guiding users to log in to the website and complete the settings.\nAnswer with the option's letter from the given choices directly.",
1114,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "372-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1115,
"target": "B",
"doc": {
"video_id": "372",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=phM_SYH6aA0",
"videoID": "phM_SYH6aA0",
"question_id": "372-3",
"task_type": "Action Recognition",
"question": "What is the trick of evolving backward tuck into a backflip in the video?",
"options": [
"A. By adding a moment of rotation just before you enter the tuck.",
"B. By using the hands less on the mat.",
"C. By swinging the arms faster.",
"D. By landing with both feet pointed towards the ceiling."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the trick of evolving backward tuck into a backflip in the video?\nOption:\nA. By adding a moment of rotation just before you enter the tuck.\nB. By using the hands less on the mat.\nC. By swinging the arms faster.\nD. By landing with both feet pointed towards the ceiling.\nAnswer with the option's letter from the given choices directly.",
1115,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "372-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1116,
"target": "D",
"doc": {
"video_id": "373",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=ce_uHxXsUWA",
"videoID": "ce_uHxXsUWA",
"question_id": "373-1",
"task_type": "OCR Problems",
"question": "What is the name of the department on the website?",
"options": [
"A. Justice.Gov.",
"B. Education.Gov.",
"C. Defense.Gov.",
"D. Travel.State.Gov."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the name of the department on the website?\nOption:\nA. Justice.Gov.\nB. Education.Gov.\nC. Defense.Gov.\nD. Travel.State.Gov.\nAnswer with the option's letter from the given choices directly.",
1116,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "373-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1117,
"target": "B",
"doc": {
"video_id": "373",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=ce_uHxXsUWA",
"videoID": "ce_uHxXsUWA",
"question_id": "373-2",
"task_type": "Action Reasoning",
"question": "How does the displaying link change after is historically clicked?",
"options": [
"A. Its size grows larger.",
"B. Its color turns from dark blue to purple.",
"C. It disappears from the website.",
"D. An underscore is added to the link."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the displaying link change after is historically clicked?\nOption:\nA. Its size grows larger.\nB. Its color turns from dark blue to purple.\nC. It disappears from the website.\nD. An underscore is added to the link.\nAnswer with the option's letter from the given choices directly.",
1117,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "373-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1118,
"target": "C",
"doc": {
"video_id": "373",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=ce_uHxXsUWA",
"videoID": "ce_uHxXsUWA",
"question_id": "373-3",
"task_type": "OCR Problems",
"question": "Which question is not included in the frequently asked questions displayed on the website?",
"options": [
"A. How do I create a MyTrabelGov account?",
"B. What does the \"Renew Passport\" button look like in my account?",
"C. What is the processing time for renewals?",
"D. Why don't I see an option to \"Renew Passport\" in my account?"
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which question is not included in the frequently asked questions displayed on the website?\nOption:\nA. How do I create a MyTrabelGov account?\nB. What does the \"Renew Passport\" button look like in my account?\nC. What is the processing time for renewals?\nD. Why don't I see an option to \"Renew Passport\" in my account?\nAnswer with the option's letter from the given choices directly.",
1118,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "373-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1119,
"target": "C",
"doc": {
"video_id": "374",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=us78iBEMihk",
"videoID": "us78iBEMihk",
"question_id": "374-1",
"task_type": "Object Reasoning",
"question": "What is the function of the red bucket shown in the video?",
"options": [
"A. To contain the water for irrigation.",
"B. To store gardening tools and equipment.",
"C. To accommodate the shoveled soil when digging a hole.",
"D. Can not be deduced."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the function of the red bucket shown in the video?\nOption:\nA. To contain the water for irrigation.\nB. To store gardening tools and equipment.\nC. To accommodate the shoveled soil when digging a hole.\nD. Can not be deduced.\nAnswer with the option's letter from the given choices directly.",
1119,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "374-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1120,
"target": "A",
"doc": {
"video_id": "374",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=us78iBEMihk",
"videoID": "us78iBEMihk",
"question_id": "374-2",
"task_type": "Object Reasoning",
"question": "What is the difference between the plants before and after being placed in the hole?",
"options": [
"A. Its container is removed after being placed in the hole.",
"B. Its leaves are cut after being placed in the hole.",
"C. It is saturated before being placed in the hole.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the difference between the plants before and after being placed in the hole?\nOption:\nA. Its container is removed after being placed in the hole.\nB. Its leaves are cut after being placed in the hole.\nC. It is saturated before being placed in the hole.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1120,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "374-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1121,
"target": "B",
"doc": {
"video_id": "374",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=us78iBEMihk",
"videoID": "us78iBEMihk",
"question_id": "374-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. It teaches how to prune trees.",
"B. It teaches how and when to plant trees and shrubs.",
"C. It teaches how to dig a hole.",
"D. It teaches how to escape from a forest."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. It teaches how to prune trees.\nB. It teaches how and when to plant trees and shrubs.\nC. It teaches how to dig a hole.\nD. It teaches how to escape from a forest.\nAnswer with the option's letter from the given choices directly.",
1121,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "374-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1122,
"target": "B",
"doc": {
"video_id": "375",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=axGscDVdHWg",
"videoID": "axGscDVdHWg",
"question_id": "375-1",
"task_type": "Temporal Reasoning",
"question": "What are the steps to decorate a Christmas tree?\n(a) Add Ornaments.\n(b) Frost with Icicles to finish.\n(c) Assemble the tree, and add lights if necessary.\n(d) Fluff the branches according to your liking.\n(e) Add the nature.\n(f) Fix the area you want to camouflage.",
"options": [
"A. (c)(f)(a)(d)(e)(b).",
"B. (c)(d)(f)(e)(a)(b).",
"C. (a)(f)(c)(b)(e)(d).",
"D. (a)(d)(f)(e)(c)(b)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the steps to decorate a Christmas tree?\n(a) Add Ornaments.\n(b) Frost with Icicles to finish.\n(c) Assemble the tree, and add lights if necessary.\n(d) Fluff the branches according to your liking.\n(e) Add the nature.\n(f) Fix the area you want to camouflage.\nOption:\nA. (c)(f)(a)(d)(e)(b).\nB. (c)(d)(f)(e)(a)(b).\nC. (a)(f)(c)(b)(e)(d).\nD. (a)(d)(f)(e)(c)(b).\nAnswer with the option's letter from the given choices directly.",
1122,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "375-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1123,
"target": "C",
"doc": {
"video_id": "375",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=axGscDVdHWg",
"videoID": "axGscDVdHWg",
"question_id": "375-2",
"task_type": "Object Recognition",
"question": "What item is not used to decorate the Christmas tree?",
"options": [
"A. Red balls.",
"B. Lights.",
"C. Green stars.",
"D. Icicles."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What item is not used to decorate the Christmas tree?\nOption:\nA. Red balls.\nB. Lights.\nC. Green stars.\nD. Icicles.\nAnswer with the option's letter from the given choices directly.",
1123,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "375-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1124,
"target": "A",
"doc": {
"video_id": "375",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=axGscDVdHWg",
"videoID": "axGscDVdHWg",
"question_id": "375-3",
"task_type": "Object Reasoning",
"question": "What is the difference of the woman when talking to the camera and when decorating the tree?",
"options": [
"A. She wears glasses when decorating but does not when talking.",
"B. She holds a ball when decorating but does not when talking.",
"C. She wears glasses when decorating but does not when talking.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the difference of the woman when talking to the camera and when decorating the tree?\nOption:\nA. She wears glasses when decorating but does not when talking.\nB. She holds a ball when decorating but does not when talking.\nC. She wears glasses when decorating but does not when talking.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1124,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "375-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1125,
"target": "A",
"doc": {
"video_id": "376",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=VBygMK8jRoI",
"videoID": "VBygMK8jRoI",
"question_id": "376-1",
"task_type": "Object Reasoning",
"question": "What is special about her skate?",
"options": [
"A. It has front and rear wheels of different colors.",
"B. It is newly bought.",
"C. It does not have shoelaces.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is special about her skate?\nOption:\nA. It has front and rear wheels of different colors.\nB. It is newly bought.\nC. It does not have shoelaces.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1125,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "376-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1126,
"target": "A",
"doc": {
"video_id": "376",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=VBygMK8jRoI",
"videoID": "VBygMK8jRoI",
"question_id": "376-2",
"task_type": "Object Recognition",
"question": "What is the final piece of equipment that the woman adds to her body?",
"options": [
"A. Gloves.",
"B. A Helmet.",
"C. Knee pads.",
"D. Elbow pads."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the final piece of equipment that the woman adds to her body?\nOption:\nA. Gloves.\nB. A Helmet.\nC. Knee pads.\nD. Elbow pads.\nAnswer with the option's letter from the given choices directly.",
1126,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "376-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1127,
"target": "C",
"doc": {
"video_id": "376",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=VBygMK8jRoI",
"videoID": "VBygMK8jRoI",
"question_id": "376-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. It teaches rollerblading.",
"B. It teaches to ski.",
"C. It teaches roller skating.",
"D. It teaches ice skating."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. It teaches rollerblading.\nB. It teaches to ski.\nC. It teaches roller skating.\nD. It teaches ice skating.\nAnswer with the option's letter from the given choices directly.",
1127,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "376-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1128,
"target": "A",
"doc": {
"video_id": "377",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=RF86oSB6_bc",
"videoID": "RF86oSB6_bc",
"question_id": "377-1",
"task_type": "Action Reasoning",
"question": "What is forbidden when there is a huge red cross on the screen?",
"options": [
"A. Throwing the sneakers into a washing machine.",
"B. Covering the workspace with newspaper.",
"C. Taking the cleaner and scrubbing it over the sneaker.",
"D. Using toothpaste to clean the sneakers."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is forbidden when there is a huge red cross on the screen?\nOption:\nA. Throwing the sneakers into a washing machine.\nB. Covering the workspace with newspaper.\nC. Taking the cleaner and scrubbing it over the sneaker.\nD. Using toothpaste to clean the sneakers.\nAnswer with the option's letter from the given choices directly.",
1128,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "377-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1129,
"target": "A",
"doc": {
"video_id": "377",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=RF86oSB6_bc",
"videoID": "RF86oSB6_bc",
"question_id": "377-2",
"task_type": "OCR Problems",
"question": "What is the recommended brand of the shoe cleaner in this video?",
"options": [
"A. SNEAKER LAB.",
"B. JASON MARKK.",
"C. LEATHER HONEY.",
"D. CREP PROTECT."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the recommended brand of the shoe cleaner in this video?\nOption:\nA. SNEAKER LAB.\nB. JASON MARKK.\nC. LEATHER HONEY.\nD. CREP PROTECT.\nAnswer with the option's letter from the given choices directly.",
1129,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "377-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1130,
"target": "B",
"doc": {
"video_id": "377",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=RF86oSB6_bc",
"videoID": "RF86oSB6_bc",
"question_id": "377-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. It showcases the latest fashion trends for white sneakers.",
"B. It gives a quick tutorial on how you clean and restore your dirty white sneakers.",
"C. It demonstrates different ways to style white sneakers with various outfits.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. It showcases the latest fashion trends for white sneakers.\nB. It gives a quick tutorial on how you clean and restore your dirty white sneakers.\nC. It demonstrates different ways to style white sneakers with various outfits.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1130,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "377-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1131,
"target": "C",
"doc": {
"video_id": "378",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=WOGaS19VZ8Q",
"videoID": "WOGaS19VZ8Q",
"question_id": "378-1",
"task_type": "Attribute Perception",
"question": "What is the color of the pneumatic air gun?",
"options": [
"A. Green.",
"B. Red.",
"C. Blue.",
"D. Orange."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the pneumatic air gun?\nOption:\nA. Green.\nB. Red.\nC. Blue.\nD. Orange.\nAnswer with the option's letter from the given choices directly.",
1131,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "378-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1132,
"target": "A",
"doc": {
"video_id": "378",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=WOGaS19VZ8Q",
"videoID": "WOGaS19VZ8Q",
"question_id": "378-2",
"task_type": "Object Reasoning",
"question": "What happens to the drill when it is taken out of the orange bucket?",
"options": [
"A. It is buried in ice.",
"B. It is distorted.",
"C. It has the color changed.",
"D. It is separated into two pieces."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens to the drill when it is taken out of the orange bucket?\nOption:\nA. It is buried in ice.\nB. It is distorted.\nC. It has the color changed.\nD. It is separated into two pieces.\nAnswer with the option's letter from the given choices directly.",
1132,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "378-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1133,
"target": "D",
"doc": {
"video_id": "378",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=WOGaS19VZ8Q",
"videoID": "WOGaS19VZ8Q",
"question_id": "378-3",
"task_type": "Object Recognition",
"question": "Which tool does not the man wearing green hat use?",
"options": [
"A. A hammer drill.",
"B. A joist hanger.",
"C. A pneumatic air gun.",
"D. A tape measure."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which tool does not the man wearing green hat use?\nOption:\nA. A hammer drill.\nB. A joist hanger.\nC. A pneumatic air gun.\nD. A tape measure.\nAnswer with the option's letter from the given choices directly.",
1133,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "378-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1134,
"target": "B",
"doc": {
"video_id": "379",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=BxrolAZXZFg",
"videoID": "BxrolAZXZFg",
"question_id": "379-1",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. It gives a quick tutorial on how to walk a dog at home.",
"B. It shows tricks and helpful tips to clip your dog's nails.",
"C. It demonstrates how to wash the dog's hair without being stressful.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. It gives a quick tutorial on how to walk a dog at home.\nB. It shows tricks and helpful tips to clip your dog's nails.\nC. It demonstrates how to wash the dog's hair without being stressful.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1134,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "379-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1135,
"target": "C",
"doc": {
"video_id": "379",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=BxrolAZXZFg",
"videoID": "BxrolAZXZFg",
"question_id": "379-2",
"task_type": "Object Recognition",
"question": "According to the video, what is the blue item most likely to be?",
"options": [
"A. A clipper.",
"B. A peanut butter jar.",
"C. A dremel.",
"D. A rag."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is the blue item most likely to be?\nOption:\nA. A clipper.\nB. A peanut butter jar.\nC. A dremel.\nD. A rag.\nAnswer with the option's letter from the given choices directly.",
1135,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "379-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1136,
"target": "A",
"doc": {
"video_id": "379",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=BxrolAZXZFg",
"videoID": "BxrolAZXZFg",
"question_id": "379-3",
"task_type": "Object Recognition",
"question": "What sticker does it appear on the screen at the beginning of this video?",
"options": [
"A. A pair of eyes.",
"B. A flower.",
"C. A smiling face.",
"D. A raindrop."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What sticker does it appear on the screen at the beginning of this video?\nOption:\nA. A pair of eyes.\nB. A flower.\nC. A smiling face.\nD. A raindrop.\nAnswer with the option's letter from the given choices directly.",
1136,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "379-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1137,
"target": "D",
"doc": {
"video_id": "380",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=2Bns2m5Bg4M",
"videoID": "2Bns2m5Bg4M",
"question_id": "380-1",
"task_type": "Object Recognition",
"question": "As can be seen in the video, which food is not used to practice holding chopsticks?",
"options": [
"A. Blueberries.",
"B. Popcoin.",
"C. Peanuts.",
"D. Soybeans."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As can be seen in the video, which food is not used to practice holding chopsticks?\nOption:\nA. Blueberries.\nB. Popcoin.\nC. Peanuts.\nD. Soybeans.\nAnswer with the option's letter from the given choices directly.",
1137,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "380-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1138,
"target": "C",
"doc": {
"video_id": "380",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=2Bns2m5Bg4M",
"videoID": "2Bns2m5Bg4M",
"question_id": "380-2",
"task_type": "Object Recognition",
"question": "As depicted in the video, which finger touches the chopsticks?",
"options": [
"A. Little finger on the right hand.",
"B. Little finger on the left hand.",
"C. Index finger on the right hand.",
"D. Index finger on the left hand."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, which finger touches the chopsticks?\nOption:\nA. Little finger on the right hand.\nB. Little finger on the left hand.\nC. Index finger on the right hand.\nD. Index finger on the left hand.\nAnswer with the option's letter from the given choices directly.",
1138,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "380-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1139,
"target": "C",
"doc": {
"video_id": "380",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=2Bns2m5Bg4M",
"videoID": "2Bns2m5Bg4M",
"question_id": "380-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. It discusses the cultural significance of chopsticks in Western cuisine.",
"B. It demonstrates advanced techniques for playing musical instruments using chopsticks.",
"C. It gives a quick tutorial on how to use chopsticks.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. It discusses the cultural significance of chopsticks in Western cuisine.\nB. It demonstrates advanced techniques for playing musical instruments using chopsticks.\nC. It gives a quick tutorial on how to use chopsticks.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1139,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "380-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1140,
"target": "D",
"doc": {
"video_id": "381",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=GWNzRtKzIRA",
"videoID": "GWNzRtKzIRA",
"question_id": "381-1",
"task_type": "Counting Problem",
"question": "How many clips of Mad Men where the typewriter made appearances are shown in the video?",
"options": [
"A. 3.",
"B. 5.",
"C. 6.",
"D. 4."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many clips of Mad Men where the typewriter made appearances are shown in the video?\nOption:\nA. 3.\nB. 5.\nC. 6.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1140,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "381-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1141,
"target": "B",
"doc": {
"video_id": "381",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=GWNzRtKzIRA",
"videoID": "GWNzRtKzIRA",
"question_id": "381-2",
"task_type": "Information Synopsis",
"question": "According to the video, which statement is correct about the typewriter introduced in the video?",
"options": [
"A. It has a simpler structure than other typewriters.",
"B. It can change fonts or character sets by changing the type ball.",
"C. It can play symphonies.",
"D. The type element in it is mushroom shaped."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which statement is correct about the typewriter introduced in the video?\nOption:\nA. It has a simpler structure than other typewriters.\nB. It can change fonts or character sets by changing the type ball.\nC. It can play symphonies.\nD. The type element in it is mushroom shaped.\nAnswer with the option's letter from the given choices directly.",
1141,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "381-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1142,
"target": "B",
"doc": {
"video_id": "381",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=GWNzRtKzIRA",
"videoID": "GWNzRtKzIRA",
"question_id": "381-3",
"task_type": "Object Reasoning",
"question": "Which year marked the debut of the typewriter introduced in the video?",
"options": [
"A. 1961.",
"B. 1971.",
"C. 1981.",
"D. 1986."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which year marked the debut of the typewriter introduced in the video?\nOption:\nA. 1961.\nB. 1971.\nC. 1981.\nD. 1986.\nAnswer with the option's letter from the given choices directly.",
1142,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "381-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1143,
"target": "D",
"doc": {
"video_id": "382",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=6DAFkaGUiT4",
"videoID": "6DAFkaGUiT4",
"question_id": "382-1",
"task_type": "Object Reasoning",
"question": "According to the video, why is IMAX not suitable for filming movies?",
"options": [
"A. The motors make too much noise and distracte the actors.",
"B. The cameras are so heavy, and require specific training to operate them.",
"C. IMAX movies are expensive to produce.",
"D. All of the above."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, why is IMAX not suitable for filming movies?\nOption:\nA. The motors make too much noise and distracte the actors.\nB. The cameras are so heavy, and require specific training to operate them.\nC. IMAX movies are expensive to produce.\nD. All of the above.\nAnswer with the option's letter from the given choices directly.",
1143,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "382-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1144,
"target": "D",
"doc": {
"video_id": "382",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=6DAFkaGUiT4",
"videoID": "6DAFkaGUiT4",
"question_id": "382-2",
"task_type": "Information Synopsis",
"question": "According to the video, which statement is correct?",
"options": [
"A. IMAX cameras use 35mm film.",
"B. 1000 feet of film is enough for an imax camera to shoot for a whole day.",
"C. IMAX digital projectors use prisms to split xenon lights to produce different colors of light.",
"D. IMAX only has nine film cameras."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which statement is correct?\nOption:\nA. IMAX cameras use 35mm film.\nB. 1000 feet of film is enough for an imax camera to shoot for a whole day.\nC. IMAX digital projectors use prisms to split xenon lights to produce different colors of light.\nD. IMAX only has nine film cameras.\nAnswer with the option's letter from the given choices directly.",
1144,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "382-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1145,
"target": "D",
"doc": {
"video_id": "382",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=6DAFkaGUiT4",
"videoID": "6DAFkaGUiT4",
"question_id": "382-3",
"task_type": "Object Recognition",
"question": "Which IMAX movie isn't in the video?",
"options": [
"A. The Hunger Games: Catching Fire.",
"B. The Dark Knight.",
"C. Oppenheimer.",
"D. Dune."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which IMAX movie isn't in the video?\nOption:\nA. The Hunger Games: Catching Fire.\nB. The Dark Knight.\nC. Oppenheimer.\nD. Dune.\nAnswer with the option's letter from the given choices directly.",
1145,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "382-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1146,
"target": "B",
"doc": {
"video_id": "383",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=e3fz3dqhN44",
"videoID": "e3fz3dqhN44",
"question_id": "383-1",
"task_type": "Object Recognition",
"question": "What is the ad in the video about?",
"options": [
"A. Quantum computers.",
"B. VPN.",
"C. Airline flights.",
"D. Formula E."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the ad in the video about?\nOption:\nA. Quantum computers.\nB. VPN.\nC. Airline flights.\nD. Formula E.\nAnswer with the option's letter from the given choices directly.",
1146,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "383-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1147,
"target": "B",
"doc": {
"video_id": "383",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=e3fz3dqhN44",
"videoID": "e3fz3dqhN44",
"question_id": "383-2",
"task_type": "Object Reasoning",
"question": "Which transportation is the quantum computer compared to in the video?",
"options": [
"A. Riding horses.",
"B. Boats.",
"C. Walking.",
"D. Faster cars."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which transportation is the quantum computer compared to in the video?\nOption:\nA. Riding horses.\nB. Boats.\nC. Walking.\nD. Faster cars.\nAnswer with the option's letter from the given choices directly.",
1147,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "383-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1148,
"target": "B",
"doc": {
"video_id": "383",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=e3fz3dqhN44",
"videoID": "e3fz3dqhN44",
"question_id": "383-3",
"task_type": "Information Synopsis",
"question": "According to the video, which statement is incorrect?",
"options": [
"A. Quantum computers can speed up calculations of prime factorization.",
"B. Quantum computers work at room temperature.",
"C. A quantum computer is run on qubits.",
"D. Quantum computers are not good at addition."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which statement is incorrect?\nOption:\nA. Quantum computers can speed up calculations of prime factorization.\nB. Quantum computers work at room temperature.\nC. A quantum computer is run on qubits.\nD. Quantum computers are not good at addition.\nAnswer with the option's letter from the given choices directly.",
1148,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "383-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1149,
"target": "C",
"doc": {
"video_id": "384",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=ECgVLceb_LE",
"videoID": "ECgVLceb_LE",
"question_id": "384-1",
"task_type": "Spatial Perception",
"question": "Where are the propellers of the flying car in the video stored?",
"options": [
"A. Within the car's massive front air intakes.",
"B. Folded and tucked underneath the chassis, visible through the transparent floor panels.",
"C. Retracted into the car's roof, concealed by a large panel.",
"D. Stored separately in a detachable trailer that can be uncoupled for flight."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where are the propellers of the flying car in the video stored?\nOption:\nA. Within the car's massive front air intakes.\nB. Folded and tucked underneath the chassis, visible through the transparent floor panels.\nC. Retracted into the car's roof, concealed by a large panel.\nD. Stored separately in a detachable trailer that can be uncoupled for flight.\nAnswer with the option's letter from the given choices directly.",
1149,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "384-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Spatial Perception",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1150,
"target": "C",
"doc": {
"video_id": "384",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=ECgVLceb_LE",
"videoID": "ECgVLceb_LE",
"question_id": "384-2",
"task_type": "Information Synopsis",
"question": "According to the video, which statement is incorrect?",
"options": [
"A. The flying car uses digital rearview mirrors.",
"B. The flying car is shaped like a drone.",
"C. This is XPENG's first flying car.",
"D. The steering wheel is different in drive mode and fly mode."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which statement is incorrect?\nOption:\nA. The flying car uses digital rearview mirrors.\nB. The flying car is shaped like a drone.\nC. This is XPENG's first flying car.\nD. The steering wheel is different in drive mode and fly mode.\nAnswer with the option's letter from the given choices directly.",
1150,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "384-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1151,
"target": "C",
"doc": {
"video_id": "384",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=ECgVLceb_LE",
"videoID": "ECgVLceb_LE",
"question_id": "384-3",
"task_type": "Attribute Perception",
"question": "What color are the seats in the car?",
"options": [
"A. Black.",
"B. White.",
"C. Dark blue.",
"D. Light blue."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color are the seats in the car?\nOption:\nA. Black.\nB. White.\nC. Dark blue.\nD. Light blue.\nAnswer with the option's letter from the given choices directly.",
1151,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "384-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1152,
"target": "C",
"doc": {
"video_id": "385",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=bepwr1-CNRU",
"videoID": "bepwr1-CNRU",
"question_id": "385-1",
"task_type": "Temporal Reasoning",
"question": "Which of the following is correct for ranking Tesla models from shortest to longest 0-60 mph acceleration time based on the video?",
"options": [
"A. Roaster (2008), Model X (Plaid), Model S (Whitestar), Model Y (Performance).",
"B. Model X (Plaid), Model Y (Long Range), Model S (Plaid), Model 3 (Standard Range Plus).",
"C. Roaster (2nd Generation), Model X (Plaid), Model S (Long Range), Model Y (Performance).",
"D. Model Y (Performance), Roaster (2008), Model S (Long Range), Model 3 (Standard Range Plus)."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is correct for ranking Tesla models from shortest to longest 0-60 mph acceleration time based on the video?\nOption:\nA. Roaster (2008), Model X (Plaid), Model S (Whitestar), Model Y (Performance).\nB. Model X (Plaid), Model Y (Long Range), Model S (Plaid), Model 3 (Standard Range Plus).\nC. Roaster (2nd Generation), Model X (Plaid), Model S (Long Range), Model Y (Performance).\nD. Model Y (Performance), Roaster (2008), Model S (Long Range), Model 3 (Standard Range Plus).\nAnswer with the option's letter from the given choices directly.",
1152,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "385-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1153,
"target": "A",
"doc": {
"video_id": "385",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=bepwr1-CNRU",
"videoID": "bepwr1-CNRU",
"question_id": "385-2",
"task_type": "Information Synopsis",
"question": "What is this video about?",
"options": [
"A. Evolution of Tesla.",
"B. Tesla business model.",
"C. Tesla master plan.",
"D. Introduction to Tesla vehicle features."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video about?\nOption:\nA. Evolution of Tesla.\nB. Tesla business model.\nC. Tesla master plan.\nD. Introduction to Tesla vehicle features.\nAnswer with the option's letter from the given choices directly.",
1153,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "385-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1154,
"target": "A",
"doc": {
"video_id": "385",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=bepwr1-CNRU",
"videoID": "bepwr1-CNRU",
"question_id": "385-3",
"task_type": "OCR Problems",
"question": "What is the driving range of Tesla Model X Long Range version on a single charge?",
"options": [
"A. 360 miles.",
"B. 353 miles.",
"C. 405 miles.",
"D. 340 milse."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the driving range of Tesla Model X Long Range version on a single charge?\nOption:\nA. 360 miles.\nB. 353 miles.\nC. 405 miles.\nD. 340 milse.\nAnswer with the option's letter from the given choices directly.",
1154,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "385-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1155,
"target": "D",
"doc": {
"video_id": "386",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=yU5kPoc7sL4",
"videoID": "yU5kPoc7sL4",
"question_id": "386-1",
"task_type": "Object Reasoning",
"question": "Based on the video, guess what device the touch screen was first used on?",
"options": [
"A. Cellphone.",
"B. PDA.",
"C. ATM.",
"D. Computer,."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, guess what device the touch screen was first used on?\nOption:\nA. Cellphone.\nB. PDA.\nC. ATM.\nD. Computer,.\nAnswer with the option's letter from the given choices directly.",
1155,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "386-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1156,
"target": "C",
"doc": {
"video_id": "386",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=yU5kPoc7sL4",
"videoID": "yU5kPoc7sL4",
"question_id": "386-2",
"task_type": "Temporal Reasoning",
"question": "Which of the following options correctly orders the types of touch screens according to their order in the video?",
"options": [
"A. Infrared touchscreen, surface acoustic wave touchscreen, resistive touchscreen, capacitive touchscreen.",
"B. Resistive touchscreen, infrared touchscreen, surface acoustic wave touchscreen, capacitive touchscreen.",
"C. Resistive touchscreen, capacitive touchscreen, infrared touchscreen, surface acoustic wave touchscreen.",
"D. Surface acoustic wave touchscreen, resistive touchscreen, infrared touchscreen, capacitive touchscreen."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options correctly orders the types of touch screens according to their order in the video?\nOption:\nA. Infrared touchscreen, surface acoustic wave touchscreen, resistive touchscreen, capacitive touchscreen.\nB. Resistive touchscreen, infrared touchscreen, surface acoustic wave touchscreen, capacitive touchscreen.\nC. Resistive touchscreen, capacitive touchscreen, infrared touchscreen, surface acoustic wave touchscreen.\nD. Surface acoustic wave touchscreen, resistive touchscreen, infrared touchscreen, capacitive touchscreen.\nAnswer with the option's letter from the given choices directly.",
1156,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "386-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1157,
"target": "C",
"doc": {
"video_id": "386",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=yU5kPoc7sL4",
"videoID": "yU5kPoc7sL4",
"question_id": "386-3",
"task_type": "Object Reasoning",
"question": "Why doesn't it work to click on a cell phone while wearing gloves?",
"options": [
"A. The contact area between the gloves and the screen is large, making it impossible to press the screen down.",
"B. Gloves cannot absorb the infrared rays emitted by the screen.",
"C. Clothes don't conduct electricity and cannot complete a circuit between the capacitive screen.",
"D. The gloves cause multi-touch and the screen cannot tell where the finger is touching."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why doesn't it work to click on a cell phone while wearing gloves?\nOption:\nA. The contact area between the gloves and the screen is large, making it impossible to press the screen down.\nB. Gloves cannot absorb the infrared rays emitted by the screen.\nC. Clothes don't conduct electricity and cannot complete a circuit between the capacitive screen.\nD. The gloves cause multi-touch and the screen cannot tell where the finger is touching.\nAnswer with the option's letter from the given choices directly.",
1157,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "386-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1158,
"target": "B",
"doc": {
"video_id": "387",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=Lyv_2usFQA0",
"videoID": "Lyv_2usFQA0",
"question_id": "387-1",
"task_type": "Temporal Reasoning",
"question": "According to the order in the video, what is the fourth GPT introduced (except GPT Finder)?",
"options": [
"A. Diagrams.",
"B. Logo Creator.",
"C. Prompt Perfect.",
"D. GPT Finder."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the order in the video, what is the fourth GPT introduced (except GPT Finder)?\nOption:\nA. Diagrams.\nB. Logo Creator.\nC. Prompt Perfect.\nD. GPT Finder.\nAnswer with the option's letter from the given choices directly.",
1158,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "387-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1159,
"target": "B",
"doc": {
"video_id": "387",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=Lyv_2usFQA0",
"videoID": "Lyv_2usFQA0",
"question_id": "387-2",
"task_type": "Temporal Reasoning",
"question": "Which GPT is introduced after Convert Anything?",
"options": [
"A. Creative Writing Coach.",
"B. Diagrams.",
"C. ScholarAI.",
"D. Consensus."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which GPT is introduced after Convert Anything?\nOption:\nA. Creative Writing Coach.\nB. Diagrams.\nC. ScholarAI.\nD. Consensus.\nAnswer with the option's letter from the given choices directly.",
1159,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "387-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1160,
"target": "C",
"doc": {
"video_id": "387",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=Lyv_2usFQA0",
"videoID": "Lyv_2usFQA0",
"question_id": "387-3",
"task_type": "Action Recognition",
"question": "Which GPT can generate a prompt based on the image so that DALLE can generate a similar image?",
"options": [
"A. Convert Anything.",
"B. Individ AI.",
"C. Super Description.",
"D. Prompt Perfect."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which GPT can generate a prompt based on the image so that DALLE can generate a similar image?\nOption:\nA. Convert Anything.\nB. Individ AI.\nC. Super Description.\nD. Prompt Perfect.\nAnswer with the option's letter from the given choices directly.",
1160,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "387-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1161,
"target": "C",
"doc": {
"video_id": "388",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=t61Wl2HVwFo",
"videoID": "t61Wl2HVwFo",
"question_id": "388-1",
"task_type": "Counting Problem",
"question": "How many groups of people were interviewed in the promotional video at the beginning of the video?",
"options": [
"A. 11.",
"B. 8.",
"C. 9.",
"D. 10."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many groups of people were interviewed in the promotional video at the beginning of the video?\nOption:\nA. 11.\nB. 8.\nC. 9.\nD. 10.\nAnswer with the option's letter from the given choices directly.",
1161,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "388-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1162,
"target": "D",
"doc": {
"video_id": "388",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=t61Wl2HVwFo",
"videoID": "t61Wl2HVwFo",
"question_id": "388-2",
"task_type": "Object Reasoning",
"question": "Which of the following is not the meaning of the On-device AI technology introduced in the video?",
"options": [
"A. Fast data processing.",
"B. High level of privacy protection.",
"C. Reduced power consumption.",
"D. Cheap cost of using."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is not the meaning of the On-device AI technology introduced in the video?\nOption:\nA. Fast data processing.\nB. High level of privacy protection.\nC. Reduced power consumption.\nD. Cheap cost of using.\nAnswer with the option's letter from the given choices directly.",
1162,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "388-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1163,
"target": "B",
"doc": {
"video_id": "388",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=t61Wl2HVwFo",
"videoID": "t61Wl2HVwFo",
"question_id": "388-3",
"task_type": "Information Synopsis",
"question": "According to the video, which statement is correct?",
"options": [
"A. Samsung Gauss can only be used to process video images.",
"B. NQ8 AI Gen3 is capable of achieving super-resolution for video imagery.",
"C. The video was taken at the Samsung Developer Conference.",
"D. Only with a camera installed can the screen sense home information."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which statement is correct?\nOption:\nA. Samsung Gauss can only be used to process video images.\nB. NQ8 AI Gen3 is capable of achieving super-resolution for video imagery.\nC. The video was taken at the Samsung Developer Conference.\nD. Only with a camera installed can the screen sense home information.\nAnswer with the option's letter from the given choices directly.",
1163,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "388-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1164,
"target": "A",
"doc": {
"video_id": "389",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=0rCbfsuKdYw",
"videoID": "0rCbfsuKdYw",
"question_id": "389-1",
"task_type": "OCR Problems",
"question": "What is the brand of the TV used with PS2 in the video?",
"options": [
"A. TOSHIBA.",
"B. BenQ.",
"C. SAMSUNG.",
"D. HITACHI."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the brand of the TV used with PS2 in the video?\nOption:\nA. TOSHIBA.\nB. BenQ.\nC. SAMSUNG.\nD. HITACHI.\nAnswer with the option's letter from the given choices directly.",
1164,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "389-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1165,
"target": "A",
"doc": {
"video_id": "389",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=0rCbfsuKdYw",
"videoID": "0rCbfsuKdYw",
"question_id": "389-2",
"task_type": "Object Recognition",
"question": "Which game is played on each generation of Play Stations in the video?",
"options": [
"A. Gran Turismo.",
"B. The Last of Us.",
"C. God of War.",
"D. Grand Theft Auto V."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which game is played on each generation of Play Stations in the video?\nOption:\nA. Gran Turismo.\nB. The Last of Us.\nC. God of War.\nD. Grand Theft Auto V.\nAnswer with the option's letter from the given choices directly.",
1165,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "389-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1166,
"target": "A",
"doc": {
"video_id": "389",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=0rCbfsuKdYw",
"videoID": "0rCbfsuKdYw",
"question_id": "389-3",
"task_type": "Object Reasoning",
"question": "According to the video, guess how many times more powerful is the PS5 than the PS2?",
"options": [
"A. 2800.",
"B. 3800.",
"C. 380.",
"D. 80."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, guess how many times more powerful is the PS5 than the PS2?\nOption:\nA. 2800.\nB. 3800.\nC. 380.\nD. 80.\nAnswer with the option's letter from the given choices directly.",
1166,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "389-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1167,
"target": "B",
"doc": {
"video_id": "390",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=tVcEEw6qbBQ",
"videoID": "tVcEEw6qbBQ",
"question_id": "390-1",
"task_type": "Object Reasoning",
"question": "Who owns the bottle containing protists at the beginning of the video?",
"options": [
"A. The purple one.",
"B. The pink one.",
"C. Both of them.",
"D. None of them."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who owns the bottle containing protists at the beginning of the video?\nOption:\nA. The purple one.\nB. The pink one.\nC. Both of them.\nD. None of them.\nAnswer with the option's letter from the given choices directly.",
1167,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "390-1",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1168,
"target": "D",
"doc": {
"video_id": "390",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=tVcEEw6qbBQ",
"videoID": "tVcEEw6qbBQ",
"question_id": "390-2",
"task_type": "Object Reasoning",
"question": "Among the examples of using microscopes to help learning life sciences at the end of the video, what are being observed in the example following the demonstration of studying mitosis?",
"options": [
"A. Euglena and paramecia.",
"B. Aquatic plant cells.",
"C. The cross section of an onion root tip.",
"D. Stomata from a thin sample underneath a leaf."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Among the examples of using microscopes to help learning life sciences at the end of the video, what are being observed in the example following the demonstration of studying mitosis?\nOption:\nA. Euglena and paramecia.\nB. Aquatic plant cells.\nC. The cross section of an onion root tip.\nD. Stomata from a thin sample underneath a leaf.\nAnswer with the option's letter from the given choices directly.",
1168,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "390-2",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1169,
"target": "B",
"doc": {
"video_id": "390",
"duration": "medium",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=tVcEEw6qbBQ",
"videoID": "tVcEEw6qbBQ",
"question_id": "390-3",
"task_type": "Information Synopsis",
"question": "According to the video, which statement is correct?",
"options": [
"A. Viruses can be seen with a light microscope.",
"B. The longer the objective lens, the higher the magnification.",
"C. Light microscopes do not require electricity.",
"D. Use tissue paper to clean the lens when it is difficult to see the image clearly even when you are focusing."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which statement is correct?\nOption:\nA. Viruses can be seen with a light microscope.\nB. The longer the objective lens, the higher the magnification.\nC. Light microscopes do not require electricity.\nD. Use tissue paper to clean the lens when it is difficult to see the image clearly even when you are focusing.\nAnswer with the option's letter from the given choices directly.",
1169,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "390-3",
"duration": "medium",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1170,
"target": "D",
"doc": {
"video_id": "391",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=jIzGzPzdLwM",
"videoID": "jIzGzPzdLwM",
"question_id": "391-1",
"task_type": "Action Recognition",
"question": "What does Shen do for changing his fate?",
"options": [
"A. Make team with others.",
"B. Stops what he is doing.",
"C. Change his mind.",
"D. Destroys pandas."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does Shen do for changing his fate?\nOption:\nA. Make team with others.\nB. Stops what he is doing.\nC. Change his mind.\nD. Destroys pandas.\nAnswer with the option's letter from the given choices directly.",
1170,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "391-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1171,
"target": "C",
"doc": {
"video_id": "391",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=jIzGzPzdLwM",
"videoID": "jIzGzPzdLwM",
"question_id": "391-2",
"task_type": "Action Reasoning",
"question": "How does Shen feel when he is going to see a panda?",
"options": [
"A. Angry.",
"B. Happy.",
"C. Nervous.",
"D. Sad."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does Shen feel when he is going to see a panda?\nOption:\nA. Angry.\nB. Happy.\nC. Nervous.\nD. Sad.\nAnswer with the option's letter from the given choices directly.",
1171,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "391-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1172,
"target": "B",
"doc": {
"video_id": "391",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=jIzGzPzdLwM",
"videoID": "jIzGzPzdLwM",
"question_id": "391-3",
"task_type": "Action Recognition",
"question": "What happens to Shen at last?",
"options": [
"A. Shen escapes to a hidden fortress, plotting his revenge against the panda.",
"B. Shen is defeated by the panda and crashed by a cannon.",
"C. Shen undergoes a dramatic transformation, redeeming himself and becoming a beloved hero.",
"D. Shen's defeat leads to a revelation that he was merely a pawn in a larger scheme orchestrated by a more sinister villain."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens to Shen at last?\nOption:\nA. Shen escapes to a hidden fortress, plotting his revenge against the panda.\nB. Shen is defeated by the panda and crashed by a cannon.\nC. Shen undergoes a dramatic transformation, redeeming himself and becoming a beloved hero.\nD. Shen's defeat leads to a revelation that he was merely a pawn in a larger scheme orchestrated by a more sinister villain.\nAnswer with the option's letter from the given choices directly.",
1172,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "391-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1173,
"target": "A",
"doc": {
"video_id": "392",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=edAu5_O4C54",
"videoID": "edAu5_O4C54",
"question_id": "392-1",
"task_type": "Object Recognition",
"question": "Which does not appear in this video?",
"options": [
"A. Spider-kangaroo.",
"B. Spider-cat.",
"C. Spider-dinosaur.",
"D. Spider-horse."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which does not appear in this video?\nOption:\nA. Spider-kangaroo.\nB. Spider-cat.\nC. Spider-dinosaur.\nD. Spider-horse.\nAnswer with the option's letter from the given choices directly.",
1173,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "392-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1174,
"target": "D",
"doc": {
"video_id": "392",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=edAu5_O4C54",
"videoID": "edAu5_O4C54",
"question_id": "392-2",
"task_type": "Object Reasoning",
"question": "Who is the mentor of the Chased black spider-man?",
"options": [
"A. The spider-man with hang gliders.",
"B. The spider-woman with golden hair.",
"C. The spider-woman with motorcycle.",
"D. The spider-man with a baby."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is the mentor of the Chased black spider-man?\nOption:\nA. The spider-man with hang gliders.\nB. The spider-woman with golden hair.\nC. The spider-woman with motorcycle.\nD. The spider-man with a baby.\nAnswer with the option's letter from the given choices directly.",
1174,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "392-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1175,
"target": "D",
"doc": {
"video_id": "392",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=edAu5_O4C54",
"videoID": "edAu5_O4C54",
"question_id": "392-3",
"task_type": "Action Recognition",
"question": "Where is the chased black spider-man hide at first?",
"options": [
"A. Under a car.",
"B. Outside the building.",
"C. In a factory.",
"D. On the back of a spider-man who has four robotic arms."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where is the chased black spider-man hide at first?\nOption:\nA. Under a car.\nB. Outside the building.\nC. In a factory.\nD. On the back of a spider-man who has four robotic arms.\nAnswer with the option's letter from the given choices directly.",
1175,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "392-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1176,
"target": "B",
"doc": {
"video_id": "393",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=dcYgBU4t98E",
"videoID": "dcYgBU4t98E",
"question_id": "393-1",
"task_type": "Action Recognition",
"question": "What happens when the chicken is bumping fists with a penguin?",
"options": [
"A. The chicken is throwed into sea by the penguin.",
"B. The chicken is pushed into sky by water.",
"C. The chicken flies into sky.",
"D. The penguin hits the chicken."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens when the chicken is bumping fists with a penguin?\nOption:\nA. The chicken is throwed into sea by the penguin.\nB. The chicken is pushed into sky by water.\nC. The chicken flies into sky.\nD. The penguin hits the chicken.\nAnswer with the option's letter from the given choices directly.",
1176,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "393-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1177,
"target": "C",
"doc": {
"video_id": "393",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=dcYgBU4t98E",
"videoID": "dcYgBU4t98E",
"question_id": "393-2",
"task_type": "Counting Problem",
"question": "How many penguins play torches when they have a bonfire party?",
"options": [
"A. 4.",
"B. 2.",
"C. 3.",
"D. 5."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many penguins play torches when they have a bonfire party?\nOption:\nA. 4.\nB. 2.\nC. 3.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
1177,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "393-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1178,
"target": "A",
"doc": {
"video_id": "393",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=dcYgBU4t98E",
"videoID": "dcYgBU4t98E",
"question_id": "393-3",
"task_type": "Action Recognition",
"question": "Which skill does the chicken have?",
"options": [
"A. Surf.",
"B. Fly.",
"C. Sing.",
"D. Cook."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which skill does the chicken have?\nOption:\nA. Surf.\nB. Fly.\nC. Sing.\nD. Cook.\nAnswer with the option's letter from the given choices directly.",
1178,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "393-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1179,
"target": "D",
"doc": {
"video_id": "394",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=V6ui161NyTg",
"videoID": "V6ui161NyTg",
"question_id": "394-1",
"task_type": "Object Reasoning",
"question": "What is the relationship between the two girls in the video?",
"options": [
"A. They are sisters who care about each other.",
"B. They are strangers with the same experience.",
"C. They are best friends to each other.",
"D. They are different periods of the same person."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the relationship between the two girls in the video?\nOption:\nA. They are sisters who care about each other.\nB. They are strangers with the same experience.\nC. They are best friends to each other.\nD. They are different periods of the same person.\nAnswer with the option's letter from the given choices directly.",
1179,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "394-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1180,
"target": "C",
"doc": {
"video_id": "394",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=V6ui161NyTg",
"videoID": "V6ui161NyTg",
"question_id": "394-2",
"task_type": "Information Synopsis",
"question": "What kind of story does this video record about the little girl?",
"options": [
"A. The little girl overcame other people's doubts through continuous efforts and finally realized her dream of painting.",
"B. The little girl accepted help from a stranger and embraced life again.",
"C. The little girl healed herself and walked out of the shadow of school violence.",
"D. The little girl helps others overcome autism through painting."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What kind of story does this video record about the little girl?\nOption:\nA. The little girl overcame other people's doubts through continuous efforts and finally realized her dream of painting.\nB. The little girl accepted help from a stranger and embraced life again.\nC. The little girl healed herself and walked out of the shadow of school violence.\nD. The little girl helps others overcome autism through painting.\nAnswer with the option's letter from the given choices directly.",
1180,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "394-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1181,
"target": "B",
"doc": {
"video_id": "394",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=V6ui161NyTg",
"videoID": "V6ui161NyTg",
"question_id": "394-3",
"task_type": "Temporal Reasoning",
"question": "In the video, in the little girl's recollection, what is the correct order of the events she experienced?\n(1) Her paintings were disliked by others.\n(2) She was pushed down by others.\n(3) She was forgotten by others when playing hide and seek.\n(4) She was talked about by others while swinging.",
"options": [
"A. (2)(1)(3)(4).",
"B. (1)(2)(3)(4).",
"C. (3)(2)(1)(4).",
"D. (1)(4)(3)(2)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, in the little girl's recollection, what is the correct order of the events she experienced?\n(1) Her paintings were disliked by others.\n(2) She was pushed down by others.\n(3) She was forgotten by others when playing hide and seek.\n(4) She was talked about by others while swinging.\nOption:\nA. (2)(1)(3)(4).\nB. (1)(2)(3)(4).\nC. (3)(2)(1)(4).\nD. (1)(4)(3)(2).\nAnswer with the option's letter from the given choices directly.",
1181,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "394-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1182,
"target": "A",
"doc": {
"video_id": "395",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=-XpJeDGh8No",
"videoID": "-XpJeDGh8No",
"question_id": "395-1",
"task_type": "Action Recognition",
"question": "What happens when a monster teach math?",
"options": [
"A. Two students is fighting.",
"B. A student gets answer quickly.",
"C. All student study quietly.",
"D. No one in the room."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens when a monster teach math?\nOption:\nA. Two students is fighting.\nB. A student gets answer quickly.\nC. All student study quietly.\nD. No one in the room.\nAnswer with the option's letter from the given choices directly.",
1182,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "395-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1183,
"target": "B",
"doc": {
"video_id": "395",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=-XpJeDGh8No",
"videoID": "-XpJeDGh8No",
"question_id": "395-2",
"task_type": "Action Recognition",
"question": "What does the yellow turtle monster do after receiving a red book?",
"options": [
"A. Engaging in business.",
"B. Taking a vacation.",
"C. Reading additional newspapers.",
"D. Seeking food."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the yellow turtle monster do after receiving a red book?\nOption:\nA. Engaging in business.\nB. Taking a vacation.\nC. Reading additional newspapers.\nD. Seeking food.\nAnswer with the option's letter from the given choices directly.",
1183,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "395-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1184,
"target": "D",
"doc": {
"video_id": "395",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=-XpJeDGh8No",
"videoID": "-XpJeDGh8No",
"question_id": "395-3",
"task_type": "Object Recognition",
"question": "Who fight versus the black dinosaur at last?",
"options": [
"A. A spider.",
"B. A snake.",
"C. A King Kong.",
"D. A dragon."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who fight versus the black dinosaur at last?\nOption:\nA. A spider.\nB. A snake.\nC. A King Kong.\nD. A dragon.\nAnswer with the option's letter from the given choices directly.",
1184,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "395-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1185,
"target": "C",
"doc": {
"video_id": "396",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=QI9VIulqTCA",
"videoID": "QI9VIulqTCA",
"question_id": "396-1",
"task_type": "Object Recognition",
"question": "Who is the little iceberg meet first?",
"options": [
"A. A little dolphin.",
"B. A little shark.",
"C. A little killer whale.",
"D. A little turtle."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is the little iceberg meet first?\nOption:\nA. A little dolphin.\nB. A little shark.\nC. A little killer whale.\nD. A little turtle.\nAnswer with the option's letter from the given choices directly.",
1185,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "396-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1186,
"target": "D",
"doc": {
"video_id": "396",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=QI9VIulqTCA",
"videoID": "QI9VIulqTCA",
"question_id": "396-2",
"task_type": "Action Recognition",
"question": "What is the purpose of the ship coming here?",
"options": [
"A. Finding someone lost.",
"B. Having a sightseeing.",
"C. Making friends with iceberg.",
"D. Capturing killer whales."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the purpose of the ship coming here?\nOption:\nA. Finding someone lost.\nB. Having a sightseeing.\nC. Making friends with iceberg.\nD. Capturing killer whales.\nAnswer with the option's letter from the given choices directly.",
1186,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "396-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1187,
"target": "B",
"doc": {
"video_id": "396",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=QI9VIulqTCA",
"videoID": "QI9VIulqTCA",
"question_id": "396-3",
"task_type": "Action Reasoning",
"question": "Why is the small iceberg shrinking?",
"options": [
"A. It is exposed to the sun.",
"B. It crashes the ship for protecting the little killer whale.",
"C. It is captured by the ship.",
"D. It wants to sleep."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why is the small iceberg shrinking?\nOption:\nA. It is exposed to the sun.\nB. It crashes the ship for protecting the little killer whale.\nC. It is captured by the ship.\nD. It wants to sleep.\nAnswer with the option's letter from the given choices directly.",
1187,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "396-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1188,
"target": "D",
"doc": {
"video_id": "397",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=2za5RwplXdI",
"videoID": "2za5RwplXdI",
"question_id": "397-1",
"task_type": "Action Recognition",
"question": "What happens when a star falls on the earth?",
"options": [
"A. It smashes a big hole in the ground.",
"B. It brings elements that never shows on earth.",
"C. It makes a beautiful star sky.",
"D. It causes a disaster and destroys dinosaurs."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens when a star falls on the earth?\nOption:\nA. It smashes a big hole in the ground.\nB. It brings elements that never shows on earth.\nC. It makes a beautiful star sky.\nD. It causes a disaster and destroys dinosaurs.\nAnswer with the option's letter from the given choices directly.",
1188,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "397-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1189,
"target": "B",
"doc": {
"video_id": "397",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=2za5RwplXdI",
"videoID": "2za5RwplXdI",
"question_id": "397-2",
"task_type": "Object Recognition",
"question": "According to the video, what do the dead dinosaurs transform into?",
"options": [
"A. Star and sun.",
"B. Birds.",
"C. Human.",
"D. Stone."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what do the dead dinosaurs transform into?\nOption:\nA. Star and sun.\nB. Birds.\nC. Human.\nD. Stone.\nAnswer with the option's letter from the given choices directly.",
1189,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "397-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1190,
"target": "D",
"doc": {
"video_id": "397",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=2za5RwplXdI",
"videoID": "2za5RwplXdI",
"question_id": "397-3",
"task_type": "Information Synopsis",
"question": "What story does the video tell?",
"options": [
"A. Who is the ancestor of birds.",
"B. How fossil comes about.",
"C. We should visit the dinosaur museum.",
"D. Dinosaurs went where."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What story does the video tell?\nOption:\nA. Who is the ancestor of birds.\nB. How fossil comes about.\nC. We should visit the dinosaur museum.\nD. Dinosaurs went where.\nAnswer with the option's letter from the given choices directly.",
1190,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "397-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1191,
"target": "B",
"doc": {
"video_id": "398",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=Jm0MLlE4x0U",
"videoID": "Jm0MLlE4x0U",
"question_id": "398-1",
"task_type": "Action Recognition",
"question": "What does the fox do after meeting the blue baby bird?",
"options": [
"A. It leaves the baby bird alone.",
"B. It wants to ignore the bird.",
"C. It steals fishes from the trolls.",
"D. It takes the bird for food."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the fox do after meeting the blue baby bird?\nOption:\nA. It leaves the baby bird alone.\nB. It wants to ignore the bird.\nC. It steals fishes from the trolls.\nD. It takes the bird for food.\nAnswer with the option's letter from the given choices directly.",
1191,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "398-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1192,
"target": "A",
"doc": {
"video_id": "398",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=Jm0MLlE4x0U",
"videoID": "Jm0MLlE4x0U",
"question_id": "398-2",
"task_type": "Action Reasoning",
"question": "Why does the mother bird bring a fish to the fox?",
"options": [
"A. To thank the fox for raising its own offspring.",
"B. Because the fox is extremely hungry.",
"C. To fulfill its baby's wish to help the fox.",
"D. Because the fox is a friend of its child."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the mother bird bring a fish to the fox?\nOption:\nA. To thank the fox for raising its own offspring.\nB. Because the fox is extremely hungry.\nC. To fulfill its baby's wish to help the fox.\nD. Because the fox is a friend of its child.\nAnswer with the option's letter from the given choices directly.",
1192,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "398-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1193,
"target": "C",
"doc": {
"video_id": "398",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=Jm0MLlE4x0U",
"videoID": "Jm0MLlE4x0U",
"question_id": "398-3",
"task_type": "Spatial Perception",
"question": "Where does the baby bird go at last?",
"options": [
"A. Flying away alone.",
"B. Staying with the fox.",
"C. Flying with its mom to its herd.",
"D. Going to find its nest."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where does the baby bird go at last?\nOption:\nA. Flying away alone.\nB. Staying with the fox.\nC. Flying with its mom to its herd.\nD. Going to find its nest.\nAnswer with the option's letter from the given choices directly.",
1193,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "398-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Spatial Perception",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1194,
"target": "C",
"doc": {
"video_id": "399",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=I5cFBi02O34",
"videoID": "I5cFBi02O34",
"question_id": "399-1",
"task_type": "Action Recognition",
"question": "What is the attitude of the king towards the hero in front of him?",
"options": [
"A. Angry.",
"B. Trusted.",
"C. Disappointed.",
"D. Positive."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the attitude of the king towards the hero in front of him?\nOption:\nA. Angry.\nB. Trusted.\nC. Disappointed.\nD. Positive.\nAnswer with the option's letter from the given choices directly.",
1194,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "399-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1195,
"target": "A",
"doc": {
"video_id": "399",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=I5cFBi02O34",
"videoID": "I5cFBi02O34",
"question_id": "399-2",
"task_type": "Action Reasoning",
"question": "Why does the hero wait until the last moment to save the princess?",
"options": [
"A. She is not the princess he desires.",
"B. He is unable to overcome the monsters.",
"C. He mistakenly follows the wrong path.",
"D. He seeks to encounter more princesses."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the hero wait until the last moment to save the princess?\nOption:\nA. She is not the princess he desires.\nB. He is unable to overcome the monsters.\nC. He mistakenly follows the wrong path.\nD. He seeks to encounter more princesses.\nAnswer with the option's letter from the given choices directly.",
1195,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "399-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1196,
"target": "B",
"doc": {
"video_id": "399",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=I5cFBi02O34",
"videoID": "I5cFBi02O34",
"question_id": "399-3",
"task_type": "Action Reasoning",
"question": "Why does the hero take the last monster back?",
"options": [
"A. He falls in love with the monster princess.",
"B. The monster wears the same crown as the princess.",
"C. The king wants the monster.",
"D. He promises to reveal the location of a hidden treasure."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the hero take the last monster back?\nOption:\nA. He falls in love with the monster princess.\nB. The monster wears the same crown as the princess.\nC. The king wants the monster.\nD. He promises to reveal the location of a hidden treasure.\nAnswer with the option's letter from the given choices directly.",
1196,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "399-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1197,
"target": "B",
"doc": {
"video_id": "400",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=ZN6kyy2SXnk",
"videoID": "ZN6kyy2SXnk",
"question_id": "400-1",
"task_type": "Spatial Perception",
"question": "Where is Puss in Boots meet Death?",
"options": [
"A. A kitchen.",
"B. An office.",
"C. A hole.",
"D. A bar."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where is Puss in Boots meet Death?\nOption:\nA. A kitchen.\nB. An office.\nC. A hole.\nD. A bar.\nAnswer with the option's letter from the given choices directly.",
1197,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "400-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Spatial Perception",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1198,
"target": "C",
"doc": {
"video_id": "400",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=ZN6kyy2SXnk",
"videoID": "ZN6kyy2SXnk",
"question_id": "400-2",
"task_type": "Action Recognition",
"question": "How does Puss in Boots feel when he hear the whistle of the man with hat?",
"options": [
"A. Anxious.",
"B. Exited.",
"C. Fearfull.",
"D. Happy."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does Puss in Boots feel when he hear the whistle of the man with hat?\nOption:\nA. Anxious.\nB. Exited.\nC. Fearfull.\nD. Happy.\nAnswer with the option's letter from the given choices directly.",
1198,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "400-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1199,
"target": "A",
"doc": {
"video_id": "400",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=ZN6kyy2SXnk",
"videoID": "ZN6kyy2SXnk",
"question_id": "400-3",
"task_type": "Counting Problem",
"question": "How many crystals with cat in boots does Death break?",
"options": [
"A. 8.",
"B. 6.",
"C. 7.",
"D. 9."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many crystals with cat in boots does Death break?\nOption:\nA. 8.\nB. 6.\nC. 7.\nD. 9.\nAnswer with the option's letter from the given choices directly.",
1199,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "400-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1200,
"target": "B",
"doc": {
"video_id": "401",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=TIK01MpwWGg",
"videoID": "TIK01MpwWGg",
"question_id": "401-1",
"task_type": "Action Reasoning",
"question": "Why does the woman in the blue and black shirt want to do makeup for the man in the yellow shirt in the video?",
"options": [
"A. Because the man likes wearing makeup.",
"B. Because the woman thinks that a whiter complexion from makeup will make the man's teeth look not so white.",
"C. Because they are both only 13 years old.",
"D. Because the man is a celebrity."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the woman in the blue and black shirt want to do makeup for the man in the yellow shirt in the video?\nOption:\nA. Because the man likes wearing makeup.\nB. Because the woman thinks that a whiter complexion from makeup will make the man's teeth look not so white.\nC. Because they are both only 13 years old.\nD. Because the man is a celebrity.\nAnswer with the option's letter from the given choices directly.",
1200,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "401-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1201,
"target": "C",
"doc": {
"video_id": "401",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=TIK01MpwWGg",
"videoID": "TIK01MpwWGg",
"question_id": "401-2",
"task_type": "Action Reasoning",
"question": "Why does the man always speak in a strange way during his date in the latter half of the video?",
"options": [
"A. Because he doesn't like his date and wants to make a bad impression.",
"B. Because he has a toothache.",
"C. Because he doesn't want to show his teeth.",
"D. Because his throat is uncomfortable."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the man always speak in a strange way during his date in the latter half of the video?\nOption:\nA. Because he doesn't like his date and wants to make a bad impression.\nB. Because he has a toothache.\nC. Because he doesn't want to show his teeth.\nD. Because his throat is uncomfortable.\nAnswer with the option's letter from the given choices directly.",
1201,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "401-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1202,
"target": "B",
"doc": {
"video_id": "401",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=TIK01MpwWGg",
"videoID": "TIK01MpwWGg",
"question_id": "401-3",
"task_type": "Attribute Perception",
"question": "In the latter half of the video, what are the man and woman drinking during their date?",
"options": [
"A. Water.",
"B. Red wine.",
"C. Juice.",
"D. Beer."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the latter half of the video, what are the man and woman drinking during their date?\nOption:\nA. Water.\nB. Red wine.\nC. Juice.\nD. Beer.\nAnswer with the option's letter from the given choices directly.",
1202,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "401-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1203,
"target": "A",
"doc": {
"video_id": "402",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=44G3dPME330",
"videoID": "44G3dPME330",
"question_id": "402-1",
"task_type": "Counting Problem",
"question": "If a cue appears in the bottom left corner when a splice occurs, then how many video clips were used to compose this video?",
"options": [
"A. 7.",
"B. 10.",
"C. 3.",
"D. 1."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: If a cue appears in the bottom left corner when a splice occurs, then how many video clips were used to compose this video?\nOption:\nA. 7.\nB. 10.\nC. 3.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
1203,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "402-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1204,
"target": "D",
"doc": {
"video_id": "402",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=44G3dPME330",
"videoID": "44G3dPME330",
"question_id": "402-2",
"task_type": "Object Reasoning",
"question": "What common element do these clips in the video share?",
"options": [
"A. There are four people in the room.",
"B. Everyone is eating something.",
"C. Everyone is sitting on the couch.",
"D. Sheldon changed the Wi-Fi password."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What common element do these clips in the video share?\nOption:\nA. There are four people in the room.\nB. Everyone is eating something.\nC. Everyone is sitting on the couch.\nD. Sheldon changed the Wi-Fi password.\nAnswer with the option's letter from the given choices directly.",
1204,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "402-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1205,
"target": "B",
"doc": {
"video_id": "402",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=44G3dPME330",
"videoID": "44G3dPME330",
"question_id": "402-3",
"task_type": "Attribute Perception",
"question": "What color is the laptop the woman is holding at the beginning of the video?",
"options": [
"A. Black.",
"B. Pink.",
"C. Silver.",
"D. Green."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the laptop the woman is holding at the beginning of the video?\nOption:\nA. Black.\nB. Pink.\nC. Silver.\nD. Green.\nAnswer with the option's letter from the given choices directly.",
1205,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "402-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1206,
"target": "C",
"doc": {
"video_id": "403",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=iC54QQSe4_I",
"videoID": "iC54QQSe4_I",
"question_id": "403-1",
"task_type": "Object Reasoning",
"question": "Based on the video, what might the content of the letter be?",
"options": [
"A. You're pretty cranky for a princess rodeo clown.",
"B. The California Institute of Technology wants young Sheldon to teach there.",
"C. The California Institute of Technology wants young Sheldon to attend school there.",
"D. The California Institute of Technology wants young Sheldon to drop out."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, what might the content of the letter be?\nOption:\nA. You're pretty cranky for a princess rodeo clown.\nB. The California Institute of Technology wants young Sheldon to teach there.\nC. The California Institute of Technology wants young Sheldon to attend school there.\nD. The California Institute of Technology wants young Sheldon to drop out.\nAnswer with the option's letter from the given choices directly.",
1206,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "403-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1207,
"target": "D",
"doc": {
"video_id": "403",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=iC54QQSe4_I",
"videoID": "iC54QQSe4_I",
"question_id": "403-2",
"task_type": "Spatial Reasoning",
"question": "Based on the video, where does young Sheldon find his father?",
"options": [
"A. In a church.",
"B. At the California Institute of Technology.",
"C. In a gun shop.",
"D. In a bar."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, where does young Sheldon find his father?\nOption:\nA. In a church.\nB. At the California Institute of Technology.\nC. In a gun shop.\nD. In a bar.\nAnswer with the option's letter from the given choices directly.",
1207,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "403-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Spatial Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1208,
"target": "B",
"doc": {
"video_id": "403",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=iC54QQSe4_I",
"videoID": "iC54QQSe4_I",
"question_id": "403-3",
"task_type": "Action Reasoning",
"question": "Why did young Sheldon's mother not tell him about the contents of the letter?",
"options": [
"A. Because she thinks the California Institute of Technology is not good.",
"B. Because she does not want young Sheldon to go to California alone for school.",
"C. Because she had an argument with young Sheldon.",
"D. Because the letter was addressed to young Sheldon's parents."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did young Sheldon's mother not tell him about the contents of the letter?\nOption:\nA. Because she thinks the California Institute of Technology is not good.\nB. Because she does not want young Sheldon to go to California alone for school.\nC. Because she had an argument with young Sheldon.\nD. Because the letter was addressed to young Sheldon's parents.\nAnswer with the option's letter from the given choices directly.",
1208,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "403-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1209,
"target": "A",
"doc": {
"video_id": "404",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=4aPxxE345OM",
"videoID": "4aPxxE345OM",
"question_id": "404-1",
"task_type": "Object Reasoning",
"question": "Based on the video, who is Griffin more interested in?",
"options": [
"A. Manny's mom.",
"B. Manny.",
"C. Manny's dad.",
"D. Cannot determine."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, who is Griffin more interested in?\nOption:\nA. Manny's mom.\nB. Manny.\nC. Manny's dad.\nD. Cannot determine.\nAnswer with the option's letter from the given choices directly.",
1209,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "404-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1210,
"target": "B",
"doc": {
"video_id": "404",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=4aPxxE345OM",
"videoID": "4aPxxE345OM",
"question_id": "404-2",
"task_type": "Object Recognition",
"question": "Based on the video, what is Manny's family's pet?",
"options": [
"A. Griffin.",
"B. A dog.",
"C. A rabbit.",
"D. A cat."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, what is Manny's family's pet?\nOption:\nA. Griffin.\nB. A dog.\nC. A rabbit.\nD. A cat.\nAnswer with the option's letter from the given choices directly.",
1210,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "404-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1211,
"target": "A",
"doc": {
"video_id": "404",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=4aPxxE345OM",
"videoID": "4aPxxE345OM",
"question_id": "404-3",
"task_type": "Action Reasoning",
"question": "Based on the video, why does Manny want to be friends with Griffin?",
"options": [
"A. Because he has a crush on Griffin's sister.",
"B. Because Griffin is good at basketball.",
"C. Because Griffin admires Manny's dad.",
"D. Because he thinks Griffin is cool."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, why does Manny want to be friends with Griffin?\nOption:\nA. Because he has a crush on Griffin's sister.\nB. Because Griffin is good at basketball.\nC. Because Griffin admires Manny's dad.\nD. Because he thinks Griffin is cool.\nAnswer with the option's letter from the given choices directly.",
1211,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "404-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1212,
"target": "B",
"doc": {
"video_id": "405",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=hjNX33W-99I",
"videoID": "hjNX33W-99I",
"question_id": "405-1",
"task_type": "Object Reasoning",
"question": "Based on the video, the man who leaves with the female protagonist is actually?",
"options": [
"A. A pilot.",
"B. A millionaire.",
"C. An actor.",
"D. A bum."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, the man who leaves with the female protagonist is actually?\nOption:\nA. A pilot.\nB. A millionaire.\nC. An actor.\nD. A bum.\nAnswer with the option's letter from the given choices directly.",
1212,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "405-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1213,
"target": "D",
"doc": {
"video_id": "405",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=hjNX33W-99I",
"videoID": "hjNX33W-99I",
"question_id": "405-2",
"task_type": "Temporal Perception",
"question": "When in the video does the female protagonist wear a green dress?",
"options": [
"A. She never wears a green dress.",
"B. In the beginning of the video.",
"C. In the middle of the video.",
"D. In the latter part of the video."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When in the video does the female protagonist wear a green dress?\nOption:\nA. She never wears a green dress.\nB. In the beginning of the video.\nC. In the middle of the video.\nD. In the latter part of the video.\nAnswer with the option's letter from the given choices directly.",
1213,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "405-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Temporal Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1214,
"target": "C",
"doc": {
"video_id": "405",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=hjNX33W-99I",
"videoID": "hjNX33W-99I",
"question_id": "405-3",
"task_type": "Action Reasoning",
"question": "Why does the female protagonist pay a bum to accompany her for a day at the beginning of the video?",
"options": [
"A. For fun.",
"B. Because she fell in love with the bum.",
"C. she wants to teach her parents a lesson.",
"D. Because she recognizes the bum is actually a millionaire."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the female protagonist pay a bum to accompany her for a day at the beginning of the video?\nOption:\nA. For fun.\nB. Because she fell in love with the bum.\nC. she wants to teach her parents a lesson.\nD. Because she recognizes the bum is actually a millionaire.\nAnswer with the option's letter from the given choices directly.",
1214,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "405-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1215,
"target": "C",
"doc": {
"video_id": "406",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=r-ySDrO4uf4",
"videoID": "r-ySDrO4uf4",
"question_id": "406-1",
"task_type": "Object Reasoning",
"question": "Who is the smarter one according to the hippo god?",
"options": [
"A. Both.",
"B. The man in white T-shirt.",
"C. The man in black T-shirt.",
"D. No one."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is the smarter one according to the hippo god?\nOption:\nA. Both.\nB. The man in white T-shirt.\nC. The man in black T-shirt.\nD. No one.\nAnswer with the option's letter from the given choices directly.",
1215,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "406-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1216,
"target": "A",
"doc": {
"video_id": "406",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=r-ySDrO4uf4",
"videoID": "r-ySDrO4uf4",
"question_id": "406-2",
"task_type": "Action Recognition",
"question": "What does the hippo god do with the two men's hearts?",
"options": [
"A. Puts their hearts on the one side of balance.",
"B. Puts their hearts on the two sides of balance respectively.",
"C. Takes their hearts away.",
"D. Puts their hearts on the boat."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the hippo god do with the two men's hearts?\nOption:\nA. Puts their hearts on the one side of balance.\nB. Puts their hearts on the two sides of balance respectively.\nC. Takes their hearts away.\nD. Puts their hearts on the boat.\nAnswer with the option's letter from the given choices directly.",
1216,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "406-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1217,
"target": "C",
"doc": {
"video_id": "406",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=r-ySDrO4uf4",
"videoID": "r-ySDrO4uf4",
"question_id": "406-3",
"task_type": "Action Reasoning",
"question": "What does the hippo god decide to do?",
"options": [
"A. Defeat the two men.",
"B. Send the two men home.",
"C. Help the two men.",
"D. Go back home itself."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the hippo god decide to do?\nOption:\nA. Defeat the two men.\nB. Send the two men home.\nC. Help the two men.\nD. Go back home itself.\nAnswer with the option's letter from the given choices directly.",
1217,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "406-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1218,
"target": "D",
"doc": {
"video_id": "407",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=QxYHkKLaFmY",
"videoID": "QxYHkKLaFmY",
"question_id": "407-1",
"task_type": "Object Recognition",
"question": "What is covered on the ground of the castle?",
"options": [
"A. Stones.",
"B. Food.",
"C. Clothes.",
"D. Glod."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is covered on the ground of the castle?\nOption:\nA. Stones.\nB. Food.\nC. Clothes.\nD. Glod.\nAnswer with the option's letter from the given choices directly.",
1218,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "407-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1219,
"target": "B",
"doc": {
"video_id": "407",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=QxYHkKLaFmY",
"videoID": "QxYHkKLaFmY",
"question_id": "407-2",
"task_type": "Object Recognition",
"question": "What color is the dragon?",
"options": [
"A. Red.",
"B. Black.",
"C. Blue.",
"D. Yellow."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the dragon?\nOption:\nA. Red.\nB. Black.\nC. Blue.\nD. Yellow.\nAnswer with the option's letter from the given choices directly.",
1219,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "407-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1220,
"target": "D",
"doc": {
"video_id": "407",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=QxYHkKLaFmY",
"videoID": "QxYHkKLaFmY",
"question_id": "407-3",
"task_type": "Object Recognition",
"question": "What is the creature do at last?",
"options": [
"A. Hides and seeks with others.",
"B. Waits for its master.",
"C. Prevents others taking gold from the castle.",
"D. Destroys the villege."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the creature do at last?\nOption:\nA. Hides and seeks with others.\nB. Waits for its master.\nC. Prevents others taking gold from the castle.\nD. Destroys the villege.\nAnswer with the option's letter from the given choices directly.",
1220,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "407-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1221,
"target": "A",
"doc": {
"video_id": "408",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=weOrNMHcDTI",
"videoID": "weOrNMHcDTI",
"question_id": "408-1",
"task_type": "Action Reasoning",
"question": "Which animal doesn't appear in the video?",
"options": [
"A. Panda.",
"B. Hawk.",
"C. lizard.",
"D. Snake."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which animal doesn't appear in the video?\nOption:\nA. Panda.\nB. Hawk.\nC. lizard.\nD. Snake.\nAnswer with the option's letter from the given choices directly.",
1221,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "408-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1222,
"target": "C",
"doc": {
"video_id": "408",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=weOrNMHcDTI",
"videoID": "weOrNMHcDTI",
"question_id": "408-2",
"task_type": "Counting Problem",
"question": "How many bullets does Rango use to kill the hawl in the video?",
"options": [
"A. 3.",
"B. 2.",
"C. 1.",
"D. 4."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many bullets does Rango use to kill the hawl in the video?\nOption:\nA. 3.\nB. 2.\nC. 1.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1222,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "408-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1223,
"target": "A",
"doc": {
"video_id": "408",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=weOrNMHcDTI",
"videoID": "weOrNMHcDTI",
"question_id": "408-3",
"task_type": "Object Reasoning",
"question": "Who is the opponent of Rango in the second matchup in the video?",
"options": [
"A. Snake.",
"B. Tiger.",
"C. Hawk.",
"D. Mouse."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is the opponent of Rango in the second matchup in the video?\nOption:\nA. Snake.\nB. Tiger.\nC. Hawk.\nD. Mouse.\nAnswer with the option's letter from the given choices directly.",
1223,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "408-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1224,
"target": "B",
"doc": {
"video_id": "409",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=uMIohuKRq58",
"videoID": "uMIohuKRq58",
"question_id": "409-1",
"task_type": "Information Synopsis",
"question": "What is the video about?",
"options": [
"A. A motorcycle race.",
"B. A sence of a terminator chasing a boy.",
"C. A police chases a convict.",
"D. A gangster fight."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video about?\nOption:\nA. A motorcycle race.\nB. A sence of a terminator chasing a boy.\nC. A police chases a convict.\nD. A gangster fight.\nAnswer with the option's letter from the given choices directly.",
1224,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "409-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1225,
"target": "D",
"doc": {
"video_id": "409",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=uMIohuKRq58",
"videoID": "uMIohuKRq58",
"question_id": "409-2",
"task_type": "Object Reasoning",
"question": "Who is protecting the boy?",
"options": [
"A. Nobody.",
"B. The terminate with police suit.",
"C. Both.",
"D. The terminator with black leather jacket."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is protecting the boy?\nOption:\nA. Nobody.\nB. The terminate with police suit.\nC. Both.\nD. The terminator with black leather jacket.\nAnswer with the option's letter from the given choices directly.",
1225,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "409-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1226,
"target": "C",
"doc": {
"video_id": "409",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=uMIohuKRq58",
"videoID": "uMIohuKRq58",
"question_id": "409-3",
"task_type": "Object Recognition",
"question": "Where does the terminator with sunglasses hide his gun?",
"options": [
"A. His motorcycle.",
"B. His jacket.",
"C. A box of roses.",
"D. The body of others."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where does the terminator with sunglasses hide his gun?\nOption:\nA. His motorcycle.\nB. His jacket.\nC. A box of roses.\nD. The body of others.\nAnswer with the option's letter from the given choices directly.",
1226,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "409-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1227,
"target": "C",
"doc": {
"video_id": "410",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=uC9qU3X1JgM",
"videoID": "uC9qU3X1JgM",
"question_id": "410-1",
"task_type": "Counting Problem",
"question": "How many persons does superman fight versus?",
"options": [
"A. 6.",
"B. 4.",
"C. 5.",
"D. 7."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many persons does superman fight versus?\nOption:\nA. 6.\nB. 4.\nC. 5.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
1227,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "410-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1228,
"target": "A",
"doc": {
"video_id": "410",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=uC9qU3X1JgM",
"videoID": "uC9qU3X1JgM",
"question_id": "410-2",
"task_type": "Object Recognition",
"question": "Which is not the ability of superman?",
"options": [
"A. Stealth.",
"B. Laser eyes.",
"C. Fast movement.",
"D. Flight."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which is not the ability of superman?\nOption:\nA. Stealth.\nB. Laser eyes.\nC. Fast movement.\nD. Flight.\nAnswer with the option's letter from the given choices directly.",
1228,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "410-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1229,
"target": "B",
"doc": {
"video_id": "410",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=uC9qU3X1JgM",
"videoID": "uC9qU3X1JgM",
"question_id": "410-3",
"task_type": "Object Recognition",
"question": "Who is caught in the air by Superman?",
"options": [
"A. The Flash.",
"B. Batman.",
"C. Wonder Woman.",
"D. Aquaman."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is caught in the air by Superman?\nOption:\nA. The Flash.\nB. Batman.\nC. Wonder Woman.\nD. Aquaman.\nAnswer with the option's letter from the given choices directly.",
1229,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "410-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1230,
"target": "D",
"doc": {
"video_id": "411",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=K759eXmaMTY",
"videoID": "K759eXmaMTY",
"question_id": "411-1",
"task_type": "Attribute Perception",
"question": "What caused the dilapidated scene at the beginning of the video?",
"options": [
"A. Typhoon.",
"B. War.",
"C. Fire.",
"D. Earthquake."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What caused the dilapidated scene at the beginning of the video?\nOption:\nA. Typhoon.\nB. War.\nC. Fire.\nD. Earthquake.\nAnswer with the option's letter from the given choices directly.",
1230,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "411-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1231,
"target": "C",
"doc": {
"video_id": "411",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=K759eXmaMTY",
"videoID": "K759eXmaMTY",
"question_id": "411-2",
"task_type": "Action Recognition",
"question": "What are the children wearing pink outfits doing at the beginning of the video?",
"options": [
"A. Praying.",
"B. Having a party.",
"C. Learning ballet.",
"D. Receiving treatment."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the children wearing pink outfits doing at the beginning of the video?\nOption:\nA. Praying.\nB. Having a party.\nC. Learning ballet.\nD. Receiving treatment.\nAnswer with the option's letter from the given choices directly.",
1231,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "411-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1232,
"target": "A",
"doc": {
"video_id": "411",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=K759eXmaMTY",
"videoID": "K759eXmaMTY",
"question_id": "411-3",
"task_type": "Information Synopsis",
"question": "In the middle of the video, what are the difficulties of rebuilding after the earthquake?",
"options": [
"A. The weather and the unstable economy.",
"B. Serious casualties.",
"C. Constant aftershocks.",
"D. There are not enough supplies."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the middle of the video, what are the difficulties of rebuilding after the earthquake?\nOption:\nA. The weather and the unstable economy.\nB. Serious casualties.\nC. Constant aftershocks.\nD. There are not enough supplies.\nAnswer with the option's letter from the given choices directly.",
1232,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "411-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1233,
"target": "A",
"doc": {
"video_id": "412",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=xr_nln2ZQw8",
"videoID": "xr_nln2ZQw8",
"question_id": "412-1",
"task_type": "Action Recognition",
"question": "In the video, how does Vitória make money?",
"options": [
"A. Doing makeup for people in other slums.",
"B. Drug trafficking.",
"C. Working part-time in a store.",
"D. Opening a cosmetics store."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, how does Vitória make money?\nOption:\nA. Doing makeup for people in other slums.\nB. Drug trafficking.\nC. Working part-time in a store.\nD. Opening a cosmetics store.\nAnswer with the option's letter from the given choices directly.",
1233,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "412-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1234,
"target": "B",
"doc": {
"video_id": "412",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=xr_nln2ZQw8",
"videoID": "xr_nln2ZQw8",
"question_id": "412-2",
"task_type": "Information Synopsis",
"question": "Which of the following is NOT a reason why there are many teenage mothers in Brazil every year?",
"options": [
"A. Abortion is not allowed.",
"B. The Brazilian tradition is to have children at a young age.",
"C. Lack of sex education.",
"D. Contraceptive use is not advocated."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is NOT a reason why there are many teenage mothers in Brazil every year?\nOption:\nA. Abortion is not allowed.\nB. The Brazilian tradition is to have children at a young age.\nC. Lack of sex education.\nD. Contraceptive use is not advocated.\nAnswer with the option's letter from the given choices directly.",
1234,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "412-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1235,
"target": "C",
"doc": {
"video_id": "412",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=xr_nln2ZQw8",
"videoID": "xr_nln2ZQw8",
"question_id": "412-3",
"task_type": "Attribute Perception",
"question": "At the end of the video, what kind of attitude is reflected in the interview with the young parents?",
"options": [
"A. Depressed.",
"B. Confused.",
"C. Optimistic.",
"D. Angry."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At the end of the video, what kind of attitude is reflected in the interview with the young parents?\nOption:\nA. Depressed.\nB. Confused.\nC. Optimistic.\nD. Angry.\nAnswer with the option's letter from the given choices directly.",
1235,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "412-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1236,
"target": "D",
"doc": {
"video_id": "413",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=tXb_zrHp4H8",
"videoID": "tXb_zrHp4H8",
"question_id": "413-1",
"task_type": "Attribute Perception",
"question": "What is \"Avenues for Justice\" mentioned in the video?",
"options": [
"A. A travel agency.",
"B. A prison.",
"C. A detective agency.",
"D. A law firm."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is \"Avenues for Justice\" mentioned in the video?\nOption:\nA. A travel agency.\nB. A prison.\nC. A detective agency.\nD. A law firm.\nAnswer with the option's letter from the given choices directly.",
1236,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "413-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1237,
"target": "D",
"doc": {
"video_id": "413",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=tXb_zrHp4H8",
"videoID": "tXb_zrHp4H8",
"question_id": "413-2",
"task_type": "Information Synopsis",
"question": "What is this video about?",
"options": [
"A. The development of New York.",
"B. The life of a family in New York.",
"C. A charitable organization in New York.",
"D. Young and homeless in New York."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video about?\nOption:\nA. The development of New York.\nB. The life of a family in New York.\nC. A charitable organization in New York.\nD. Young and homeless in New York.\nAnswer with the option's letter from the given choices directly.",
1237,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "413-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1238,
"target": "D",
"doc": {
"video_id": "413",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=tXb_zrHp4H8",
"videoID": "tXb_zrHp4H8",
"question_id": "413-3",
"task_type": "Temporal Perception",
"question": "In which part of the video is the woman in the blue top interviewed?",
"options": [
"A. Cannot be determined.",
"B. The beginning of the video.",
"C. The middle part of the video.",
"D. The latter part of the video."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which part of the video is the woman in the blue top interviewed?\nOption:\nA. Cannot be determined.\nB. The beginning of the video.\nC. The middle part of the video.\nD. The latter part of the video.\nAnswer with the option's letter from the given choices directly.",
1238,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "413-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Temporal Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1239,
"target": "D",
"doc": {
"video_id": "414",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=NrtwBO_nyFA",
"videoID": "NrtwBO_nyFA",
"question_id": "414-1",
"task_type": "Object Recognition",
"question": "Which step introduced in this video does not require the collaboration of two excavators?",
"options": [
"A. Detachment phase of the marble blocks.",
"B. Loading the marble blocks into a dumper.",
"C. Carrying the marble blocks in the narrow streets.",
"D. Mining for cutting the marble blocks."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which step introduced in this video does not require the collaboration of two excavators?\nOption:\nA. Detachment phase of the marble blocks.\nB. Loading the marble blocks into a dumper.\nC. Carrying the marble blocks in the narrow streets.\nD. Mining for cutting the marble blocks.\nAnswer with the option's letter from the given choices directly.",
1239,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "414-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1240,
"target": "C",
"doc": {
"video_id": "414",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=NrtwBO_nyFA",
"videoID": "NrtwBO_nyFA",
"question_id": "414-2",
"task_type": "Information Synopsis",
"question": "What is the main content of this video?",
"options": [
"A. The cultural significance of marble in architecture and art.",
"B. The economic impact of the marble industry on local communities.",
"C. The extraction and processing of marble.",
"D. The challenges faced by marble salespeople in a competitive market."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main content of this video?\nOption:\nA. The cultural significance of marble in architecture and art.\nB. The economic impact of the marble industry on local communities.\nC. The extraction and processing of marble.\nD. The challenges faced by marble salespeople in a competitive market.\nAnswer with the option's letter from the given choices directly.",
1240,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "414-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1241,
"target": "A",
"doc": {
"video_id": "414",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=NrtwBO_nyFA",
"videoID": "NrtwBO_nyFA",
"question_id": "414-3",
"task_type": "Object Recognition",
"question": "Which of the five black-text lines penned by the man in blue is replicated in red spray paint during the sorting and counting phase?",
"options": [
"A. The first line.",
"B. The second line.",
"C. The third line.",
"D. The fourth line."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the five black-text lines penned by the man in blue is replicated in red spray paint during the sorting and counting phase?\nOption:\nA. The first line.\nB. The second line.\nC. The third line.\nD. The fourth line.\nAnswer with the option's letter from the given choices directly.",
1241,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "414-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1242,
"target": "D",
"doc": {
"video_id": "415",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=HeEyJo838PA",
"videoID": "HeEyJo838PA",
"question_id": "415-1",
"task_type": "Counting Problem",
"question": "In the video, how many cubs is the mother bear with when the filmer encounters her downstream of the river?",
"options": [
"A. 4.",
"B. 1.",
"C. 3.",
"D. 2."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, how many cubs is the mother bear with when the filmer encounters her downstream of the river?\nOption:\nA. 4.\nB. 1.\nC. 3.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
1242,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "415-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1243,
"target": "C",
"doc": {
"video_id": "415",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=HeEyJo838PA",
"videoID": "HeEyJo838PA",
"question_id": "415-2",
"task_type": "Object Recognition",
"question": "What is the main food of the brown bear in the video?",
"options": [
"A. Berries.",
"B. Insects.",
"C. Salmon.",
"D. Small mammals."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main food of the brown bear in the video?\nOption:\nA. Berries.\nB. Insects.\nC. Salmon.\nD. Small mammals.\nAnswer with the option's letter from the given choices directly.",
1243,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "415-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1244,
"target": "C",
"doc": {
"video_id": "415",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=HeEyJo838PA",
"videoID": "HeEyJo838PA",
"question_id": "415-3",
"task_type": "Action Reasoning",
"question": "Why does the mother brown bear in the video make her cubs run to higher ground?",
"options": [
"A. Because you can see further from the high ground.",
"B. Because she spotted two tiger that might kill the cubs.",
"C. Because she spotted two adult male brown bears that might kill the cubs.",
"D. Because it's easier to catch salmon from the high ground."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the mother brown bear in the video make her cubs run to higher ground?\nOption:\nA. Because you can see further from the high ground.\nB. Because she spotted two tiger that might kill the cubs.\nC. Because she spotted two adult male brown bears that might kill the cubs.\nD. Because it's easier to catch salmon from the high ground.\nAnswer with the option's letter from the given choices directly.",
1244,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "415-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1245,
"target": "B",
"doc": {
"video_id": "416",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=B7-6lRo2m4Y",
"videoID": "B7-6lRo2m4Y",
"question_id": "416-1",
"task_type": "Spatial Perception",
"question": "At the beginning of the video, in which direction is the little penguin moving?",
"options": [
"A. Staying in place.",
"B. From left to right.",
"C. From right to left.",
"D. Cannot be determined."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At the beginning of the video, in which direction is the little penguin moving?\nOption:\nA. Staying in place.\nB. From left to right.\nC. From right to left.\nD. Cannot be determined.\nAnswer with the option's letter from the given choices directly.",
1245,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "416-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Spatial Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1246,
"target": "C",
"doc": {
"video_id": "416",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=B7-6lRo2m4Y",
"videoID": "B7-6lRo2m4Y",
"question_id": "416-2",
"task_type": "Action Recognition",
"question": "What ultimately happened to the little penguin that got lost from its parents in the video?",
"options": [
"A. It was adopted by other penguins.",
"B. It did not find its parents and froze to death.",
"C. It found its mother.",
"D. Cannot be determined."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What ultimately happened to the little penguin that got lost from its parents in the video?\nOption:\nA. It was adopted by other penguins.\nB. It did not find its parents and froze to death.\nC. It found its mother.\nD. Cannot be determined.\nAnswer with the option's letter from the given choices directly.",
1246,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "416-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1247,
"target": "A",
"doc": {
"video_id": "416",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=B7-6lRo2m4Y",
"videoID": "B7-6lRo2m4Y",
"question_id": "416-3",
"task_type": "Counting Problem",
"question": "The video mentions that when penguins huddle together, the temperature in the middle can reach up to how many degrees Celsius?",
"options": [
"A. 37.",
"B. 27.",
"C. -27.",
"D. 0."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The video mentions that when penguins huddle together, the temperature in the middle can reach up to how many degrees Celsius?\nOption:\nA. 37.\nB. 27.\nC. -27.\nD. 0.\nAnswer with the option's letter from the given choices directly.",
1247,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "416-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1248,
"target": "D",
"doc": {
"video_id": "417",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=QrmxMNBhH2Q",
"videoID": "QrmxMNBhH2Q",
"question_id": "417-1",
"task_type": "Information Synopsis",
"question": "What does this video end with?",
"options": [
"A. An interview about how one of the members grows up in 2010.",
"B. A behind-the-scenes footage of a music video shoot in 2010.",
"C. A compilation of the band's funniest moments on tour in 2010.",
"D. The audition of one of the members in 2010."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does this video end with?\nOption:\nA. An interview about how one of the members grows up in 2010.\nB. A behind-the-scenes footage of a music video shoot in 2010.\nC. A compilation of the band's funniest moments on tour in 2010.\nD. The audition of one of the members in 2010.\nAnswer with the option's letter from the given choices directly.",
1248,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "417-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1249,
"target": "A",
"doc": {
"video_id": "417",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=QrmxMNBhH2Q",
"videoID": "QrmxMNBhH2Q",
"question_id": "417-2",
"task_type": "Attribute Perception",
"question": "In the live show featured in the first half of this video, what hairstyle does the second member who starts to sing have?",
"options": [
"A. Curly hair.",
"B. Braided hair.",
"C. Bald head.",
"D. Mohawk hairstyle."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the live show featured in the first half of this video, what hairstyle does the second member who starts to sing have?\nOption:\nA. Curly hair.\nB. Braided hair.\nC. Bald head.\nD. Mohawk hairstyle.\nAnswer with the option's letter from the given choices directly.",
1249,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "417-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1250,
"target": "C",
"doc": {
"video_id": "417",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=QrmxMNBhH2Q",
"videoID": "QrmxMNBhH2Q",
"question_id": "417-3",
"task_type": "Object Reasoning",
"question": "Which of the following sections is not included in the video?",
"options": [
"A. A live performance by One Direction.",
"B. A rehearsal for the upcoming music concert.",
"C. An interview discussing Zayn's departure from One Direction.",
"D. Multiple news reports highlighting the success of One Direction."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following sections is not included in the video?\nOption:\nA. A live performance by One Direction.\nB. A rehearsal for the upcoming music concert.\nC. An interview discussing Zayn's departure from One Direction.\nD. Multiple news reports highlighting the success of One Direction.\nAnswer with the option's letter from the given choices directly.",
1250,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "417-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1251,
"target": "C",
"doc": {
"video_id": "418",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=9Y-YJEtxHeo",
"videoID": "9Y-YJEtxHeo",
"question_id": "418-1",
"task_type": "Spatial Perception",
"question": "What is the direction of movement of the man speaking to the camera at the beginning of the video?",
"options": [
"A. He is standing still.",
"B. Cannot be determined.",
"C. From left to right.",
"D. From right to left."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the direction of movement of the man speaking to the camera at the beginning of the video?\nOption:\nA. He is standing still.\nB. Cannot be determined.\nC. From left to right.\nD. From right to left.\nAnswer with the option's letter from the given choices directly.",
1251,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "418-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Spatial Perception",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1252,
"target": "C",
"doc": {
"video_id": "418",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=9Y-YJEtxHeo",
"videoID": "9Y-YJEtxHeo",
"question_id": "418-2",
"task_type": "Temporal Perception",
"question": "What is the main content of this video?",
"options": [
"A. Japan's economic boom.",
"B. Japanese law.",
"C. Japan's overtime culture.",
"D. Japanese street scenes."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main content of this video?\nOption:\nA. Japan's economic boom.\nB. Japanese law.\nC. Japan's overtime culture.\nD. Japanese street scenes.\nAnswer with the option's letter from the given choices directly.",
1252,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "418-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Temporal Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1253,
"target": "C",
"doc": {
"video_id": "418",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=9Y-YJEtxHeo",
"videoID": "9Y-YJEtxHeo",
"question_id": "418-3",
"task_type": "OCR Problems",
"question": "How many Japanese people would feel ashamed for taking paid leave according to the video?",
"options": [
"A. 4%.",
"B. 36%.",
"C. 63%.",
"D. 42%."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many Japanese people would feel ashamed for taking paid leave according to the video?\nOption:\nA. 4%.\nB. 36%.\nC. 63%.\nD. 42%.\nAnswer with the option's letter from the given choices directly.",
1253,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "418-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1254,
"target": "B",
"doc": {
"video_id": "419",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=3EQLFHRHpag",
"videoID": "3EQLFHRHpag",
"question_id": "419-1",
"task_type": "Information Synopsis",
"question": "The problems people encounter in the video are caused by what?",
"options": [
"A. Catastrophic weather.",
"B. Global warming.",
"C. Financial crisis.",
"D. Oil crisis."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The problems people encounter in the video are caused by what?\nOption:\nA. Catastrophic weather.\nB. Global warming.\nC. Financial crisis.\nD. Oil crisis.\nAnswer with the option's letter from the given choices directly.",
1254,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "419-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1255,
"target": "C",
"doc": {
"video_id": "419",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=3EQLFHRHpag",
"videoID": "3EQLFHRHpag",
"question_id": "419-2",
"task_type": "Temporal Perception",
"question": "In which part of the video is a man wearing a red jersey interviewed?",
"options": [
"A. There is no interview with a man in red clothing.",
"B. End of the video.",
"C. Beginning of the video.",
"D. Middle of the video."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which part of the video is a man wearing a red jersey interviewed?\nOption:\nA. There is no interview with a man in red clothing.\nB. End of the video.\nC. Beginning of the video.\nD. Middle of the video.\nAnswer with the option's letter from the given choices directly.",
1255,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "419-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Temporal Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1256,
"target": "B",
"doc": {
"video_id": "419",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=3EQLFHRHpag",
"videoID": "3EQLFHRHpag",
"question_id": "419-3",
"task_type": "Attribute Perception",
"question": "What emotion is the song sung by the woman at the end of the video expressing?",
"options": [
"A. Respect for leaders.",
"B. Concern about climate change.",
"C. Yearning for a beautiful love.",
"D. Longing for the outside world."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What emotion is the song sung by the woman at the end of the video expressing?\nOption:\nA. Respect for leaders.\nB. Concern about climate change.\nC. Yearning for a beautiful love.\nD. Longing for the outside world.\nAnswer with the option's letter from the given choices directly.",
1256,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "419-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1257,
"target": "A",
"doc": {
"video_id": "420",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=FEzcqbzTwvA",
"videoID": "FEzcqbzTwvA",
"question_id": "420-1",
"task_type": "Attribute Perception",
"question": "Which of the following views is expressed by the interviewee in the green shirt with a white beard in the video?",
"options": [
"A. He gradually forgot about the Parkinson's tremors after playing table tennis.",
"B. Playing table tennis has a scientifically proven therapeutic effect on Parkinson's.",
"C. There is no scientific basis for the therapeutic effect of table tennis on Parkinson's.",
"D. He can't do anything without dopamine."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following views is expressed by the interviewee in the green shirt with a white beard in the video?\nOption:\nA. He gradually forgot about the Parkinson's tremors after playing table tennis.\nB. Playing table tennis has a scientifically proven therapeutic effect on Parkinson's.\nC. There is no scientific basis for the therapeutic effect of table tennis on Parkinson's.\nD. He can't do anything without dopamine.\nAnswer with the option's letter from the given choices directly.",
1257,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "420-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1258,
"target": "B",
"doc": {
"video_id": "420",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=FEzcqbzTwvA",
"videoID": "FEzcqbzTwvA",
"question_id": "420-2",
"task_type": "Temporal Perception",
"question": "Where in the video is the interview with the woman wearing a mask and a green shirt located?",
"options": [
"A. Beginning of the video.",
"B. Middle of the video.",
"C. End of the video.",
"D. Cannot determine."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where in the video is the interview with the woman wearing a mask and a green shirt located?\nOption:\nA. Beginning of the video.\nB. Middle of the video.\nC. End of the video.\nD. Cannot determine.\nAnswer with the option's letter from the given choices directly.",
1258,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "420-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Temporal Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1259,
"target": "D",
"doc": {
"video_id": "420",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=FEzcqbzTwvA",
"videoID": "FEzcqbzTwvA",
"question_id": "420-3",
"task_type": "Information Synopsis",
"question": "Which of the following is NOT a significance of the Ping Pong Parkinson event according to the video?",
"options": [
"A. Alleviating the conditions of Parkinson's patients participating in the activity.",
"B. Enhancing the understanding among Parkinson's patients participating in the activity.",
"C. Improving the mental state of Parkinson's patients participating in the activity.",
"D. Research on the impact of table tennis on Parkinson's patients."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is NOT a significance of the Ping Pong Parkinson event according to the video?\nOption:\nA. Alleviating the conditions of Parkinson's patients participating in the activity.\nB. Enhancing the understanding among Parkinson's patients participating in the activity.\nC. Improving the mental state of Parkinson's patients participating in the activity.\nD. Research on the impact of table tennis on Parkinson's patients.\nAnswer with the option's letter from the given choices directly.",
1259,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "420-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1260,
"target": "A",
"doc": {
"video_id": "421",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=9SEMSQDO-pg",
"videoID": "9SEMSQDO-pg",
"question_id": "421-1",
"task_type": "Action Recognition",
"question": "What are the people holding candles doing at the beginning of the video?",
"options": [
"A. Praying for the missing persons of MH370.",
"B. Celebrating a birthday.",
"C. Conducting a religious ceremony.",
"D. Cannot determine."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the people holding candles doing at the beginning of the video?\nOption:\nA. Praying for the missing persons of MH370.\nB. Celebrating a birthday.\nC. Conducting a religious ceremony.\nD. Cannot determine.\nAnswer with the option's letter from the given choices directly.",
1260,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "421-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1261,
"target": "D",
"doc": {
"video_id": "421",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=9SEMSQDO-pg",
"videoID": "9SEMSQDO-pg",
"question_id": "421-2",
"task_type": "OCR Problems",
"question": "How many years has it been since the disappearance of flight MH370 according to the video?",
"options": [
"A. 100.",
"B. 15.",
"C. 5.",
"D. 10."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many years has it been since the disappearance of flight MH370 according to the video?\nOption:\nA. 100.\nB. 15.\nC. 5.\nD. 10.\nAnswer with the option's letter from the given choices directly.",
1261,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "421-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1262,
"target": "A",
"doc": {
"video_id": "421",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=9SEMSQDO-pg",
"videoID": "9SEMSQDO-pg",
"question_id": "421-3",
"task_type": "Object Recognition",
"question": "Who is the person with white hair wearing a black top that appears in the middle of the video?",
"options": [
"A. News reporter.",
"B. Malaysian official.",
"C. News anchor.",
"D. Family member of an MH370 victim."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is the person with white hair wearing a black top that appears in the middle of the video?\nOption:\nA. News reporter.\nB. Malaysian official.\nC. News anchor.\nD. Family member of an MH370 victim.\nAnswer with the option's letter from the given choices directly.",
1262,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "421-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1263,
"target": "C",
"doc": {
"video_id": "422",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=FkZn5u3MIiY",
"videoID": "FkZn5u3MIiY",
"question_id": "422-1",
"task_type": "Object Recognition",
"question": "Who could the woman sitting on the bench at the beginning of the video, wearing black and white stripes and speaking, possibly be?",
"options": [
"A. The Queen of England.",
"B. A news anchor.",
"C. The Princess of Wales.",
"D. A news reporter."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who could the woman sitting on the bench at the beginning of the video, wearing black and white stripes and speaking, possibly be?\nOption:\nA. The Queen of England.\nB. A news anchor.\nC. The Princess of Wales.\nD. A news reporter.\nAnswer with the option's letter from the given choices directly.",
1263,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "422-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1264,
"target": "C",
"doc": {
"video_id": "422",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=FkZn5u3MIiY",
"videoID": "FkZn5u3MIiY",
"question_id": "422-2",
"task_type": "Attribute Perception",
"question": "In the video, what kind of emotion does the Princess of Wales display?",
"options": [
"A. Negative.",
"B. Terrified.",
"C. Positive.",
"D. Pessimistic."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, what kind of emotion does the Princess of Wales display?\nOption:\nA. Negative.\nB. Terrified.\nC. Positive.\nD. Pessimistic.\nAnswer with the option's letter from the given choices directly.",
1264,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "422-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1265,
"target": "A",
"doc": {
"video_id": "422",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=FkZn5u3MIiY",
"videoID": "FkZn5u3MIiY",
"question_id": "422-3",
"task_type": "Temporal Perception",
"question": "During which part of the video is a black-and-white photo shown?",
"options": [
"A. Later.",
"B. Middle.",
"C. Beginning.",
"D. No black-and-white photo is shown in the video."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: During which part of the video is a black-and-white photo shown?\nOption:\nA. Later.\nB. Middle.\nC. Beginning.\nD. No black-and-white photo is shown in the video.\nAnswer with the option's letter from the given choices directly.",
1265,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "422-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1266,
"target": "C",
"doc": {
"video_id": "423",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=S6hF2ffyz2c",
"videoID": "S6hF2ffyz2c",
"question_id": "423-1",
"task_type": "Action Reasoning",
"question": "What was the cause of the chaos depicted in the video?",
"options": [
"A. Car explosion.",
"B. Terrorist attack.",
"C. Farmers' protest.",
"D. War."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the cause of the chaos depicted in the video?\nOption:\nA. Car explosion.\nB. Terrorist attack.\nC. Farmers' protest.\nD. War.\nAnswer with the option's letter from the given choices directly.",
1266,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "423-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1267,
"target": "C",
"doc": {
"video_id": "423",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=S6hF2ffyz2c",
"videoID": "S6hF2ffyz2c",
"question_id": "423-2",
"task_type": "Action Reasoning",
"question": "Based on the information provided by the video, which of the following is NOT a reason for the outbreak of protests by European farmers?",
"options": [
"A. Overwhelming environmental regulations.",
"B. Influx of inexpensive imported goods.",
"C. Escalating costs of chemical fertilizers.",
"D. Decline in income."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the information provided by the video, which of the following is NOT a reason for the outbreak of protests by European farmers?\nOption:\nA. Overwhelming environmental regulations.\nB. Influx of inexpensive imported goods.\nC. Escalating costs of chemical fertilizers.\nD. Decline in income.\nAnswer with the option's letter from the given choices directly.",
1267,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "423-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1268,
"target": "D",
"doc": {
"video_id": "423",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=S6hF2ffyz2c",
"videoID": "S6hF2ffyz2c",
"question_id": "423-3",
"task_type": "Object Recognition",
"question": "In the video, the man speaking to the camera, wearing glasses and a brown coat, is most likely in which role?",
"options": [
"A. One of the protesting farmers.",
"B. A news anchor.",
"C. A representative of the police.",
"D. A news reporter."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, the man speaking to the camera, wearing glasses and a brown coat, is most likely in which role?\nOption:\nA. One of the protesting farmers.\nB. A news anchor.\nC. A representative of the police.\nD. A news reporter.\nAnswer with the option's letter from the given choices directly.",
1268,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "423-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1269,
"target": "A",
"doc": {
"video_id": "424",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=9QXXL96y3A0",
"videoID": "9QXXL96y3A0",
"question_id": "424-1",
"task_type": "Object Reasoning",
"question": "According to the video, which of the following countries has the lowest birth rate?",
"options": [
"A. South Korea.",
"B. United States.",
"C. United Kingdom.",
"D. Japan."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following countries has the lowest birth rate?\nOption:\nA. South Korea.\nB. United States.\nC. United Kingdom.\nD. Japan.\nAnswer with the option's letter from the given choices directly.",
1269,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "424-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1270,
"target": "C",
"doc": {
"video_id": "424",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=9QXXL96y3A0",
"videoID": "9QXXL96y3A0",
"question_id": "424-2",
"task_type": "Information Synopsis",
"question": "What is the key focus of the video?",
"options": [
"A. Population structure issues.",
"B. Immigration concerns.",
"C. Implications of declining birth rates on the global stage.",
"D. Uncertain."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the key focus of the video?\nOption:\nA. Population structure issues.\nB. Immigration concerns.\nC. Implications of declining birth rates on the global stage.\nD. Uncertain.\nAnswer with the option's letter from the given choices directly.",
1270,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "424-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1271,
"target": "B",
"doc": {
"video_id": "424",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=9QXXL96y3A0",
"videoID": "9QXXL96y3A0",
"question_id": "424-3",
"task_type": "Information Synopsis",
"question": "According to the narration in the latter part of the video, what is the current trend of the global population?",
"options": [
"A. Remaining unchanged.",
"B. Increasing.",
"C. Decreasing.",
"D. Cannot determine."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the narration in the latter part of the video, what is the current trend of the global population?\nOption:\nA. Remaining unchanged.\nB. Increasing.\nC. Decreasing.\nD. Cannot determine.\nAnswer with the option's letter from the given choices directly.",
1271,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "424-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1272,
"target": "B",
"doc": {
"video_id": "425",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=HK13nVxBYxI",
"videoID": "HK13nVxBYxI",
"question_id": "425-1",
"task_type": "Counting Problem",
"question": "How many people were interviewed in the video?",
"options": [
"A. 4.",
"B. 2.",
"C. 1.",
"D. 3."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people were interviewed in the video?\nOption:\nA. 4.\nB. 2.\nC. 1.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
1272,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "425-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1273,
"target": "A",
"doc": {
"video_id": "425",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=HK13nVxBYxI",
"videoID": "HK13nVxBYxI",
"question_id": "425-2",
"task_type": "Information Synopsis",
"question": "What is the interviewed man's opinion on cryptocurrencies in the video?",
"options": [
"A. He thinks it's an unregulated market and very dangerous.",
"B. He believes Ethereum is better than Bitcoin.",
"C. He thinks cryptocurrencies are very valuable.",
"D. He believes the price of cryptocurrencies will continue to rise."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the interviewed man's opinion on cryptocurrencies in the video?\nOption:\nA. He thinks it's an unregulated market and very dangerous.\nB. He believes Ethereum is better than Bitcoin.\nC. He thinks cryptocurrencies are very valuable.\nD. He believes the price of cryptocurrencies will continue to rise.\nAnswer with the option's letter from the given choices directly.",
1273,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "425-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1274,
"target": "B",
"doc": {
"video_id": "425",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=HK13nVxBYxI",
"videoID": "HK13nVxBYxI",
"question_id": "425-3",
"task_type": "OCR Problems",
"question": "According to the video, what has the price of Bitcoin reached in dollars?",
"options": [
"A. 692.",
"B. 69202.",
"C. 692020.",
"D. 6920."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what has the price of Bitcoin reached in dollars?\nOption:\nA. 692.\nB. 69202.\nC. 692020.\nD. 6920.\nAnswer with the option's letter from the given choices directly.",
1274,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "425-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1275,
"target": "C",
"doc": {
"video_id": "426",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=vv6ATRPUjrI",
"videoID": "vv6ATRPUjrI",
"question_id": "426-1",
"task_type": "Information Synopsis",
"question": "According to the video, which of the following is NOT a difficulty in growing crops in space?",
"options": [
"A. Lack of gravity.",
"B. Environmental radiation.",
"C. Lack of sunlight.",
"D. Water adsorption at the plant roots affects pslant respiration."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following is NOT a difficulty in growing crops in space?\nOption:\nA. Lack of gravity.\nB. Environmental radiation.\nC. Lack of sunlight.\nD. Water adsorption at the plant roots affects pslant respiration.\nAnswer with the option's letter from the given choices directly.",
1275,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "426-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1276,
"target": "A",
"doc": {
"video_id": "426",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=vv6ATRPUjrI",
"videoID": "vv6ATRPUjrI",
"question_id": "426-2",
"task_type": "Temporal Perception",
"question": "When does the interview with the woman in the green shirt take place in the video?",
"options": [
"A. Middle of the video.",
"B. End of the video.",
"C. Beginning of the video.",
"D. There is no interview with a woman in a green shirt."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When does the interview with the woman in the green shirt take place in the video?\nOption:\nA. Middle of the video.\nB. End of the video.\nC. Beginning of the video.\nD. There is no interview with a woman in a green shirt.\nAnswer with the option's letter from the given choices directly.",
1276,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "426-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1277,
"target": "D",
"doc": {
"video_id": "426",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=vv6ATRPUjrI",
"videoID": "vv6ATRPUjrI",
"question_id": "426-3",
"task_type": "Spatial Perception",
"question": "Where was the scene shown at the beginning of the video filmed?",
"options": [
"A. Studio.",
"B. Airplane.",
"C. School.",
"D. Space station."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where was the scene shown at the beginning of the video filmed?\nOption:\nA. Studio.\nB. Airplane.\nC. School.\nD. Space station.\nAnswer with the option's letter from the given choices directly.",
1277,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "426-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Spatial Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1278,
"target": "D",
"doc": {
"video_id": "427",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=dTUaWnvIOp4",
"videoID": "dTUaWnvIOp4",
"question_id": "427-1",
"task_type": "Spatial Perception",
"question": "In the early part of the video, the interview with the man could likely have been filmed in what location?",
"options": [
"A. Inside a shopping mall.",
"B. On a sports field.",
"C. Inside a park.",
"D. Inside a laboratory."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the early part of the video, the interview with the man could likely have been filmed in what location?\nOption:\nA. Inside a shopping mall.\nB. On a sports field.\nC. Inside a park.\nD. Inside a laboratory.\nAnswer with the option's letter from the given choices directly.",
1278,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "427-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Spatial Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1279,
"target": "C",
"doc": {
"video_id": "427",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=dTUaWnvIOp4",
"videoID": "dTUaWnvIOp4",
"question_id": "427-2",
"task_type": "Object Reasoning",
"question": "Based on the video, what are people's concerns about artificially manufactured diamonds?",
"options": [
"A. Not high quality.",
"B. Low production capacity.",
"C. Too energy-intensive and not sustainable.",
"D. Lack of consumer acceptance."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, what are people's concerns about artificially manufactured diamonds?\nOption:\nA. Not high quality.\nB. Low production capacity.\nC. Too energy-intensive and not sustainable.\nD. Lack of consumer acceptance.\nAnswer with the option's letter from the given choices directly.",
1279,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "427-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1280,
"target": "B",
"doc": {
"video_id": "427",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=dTUaWnvIOp4",
"videoID": "dTUaWnvIOp4",
"question_id": "427-3",
"task_type": "Object Reasoning",
"question": "In the early to middle part of the video, there are two women wearing blue clothes. What is their relationship?",
"options": [
"A. Diamond salesperson and customer.",
"B. Sisters.",
"C. Reporter and interviewee.",
"D. There are no two women wearing blue clothes in the video."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the early to middle part of the video, there are two women wearing blue clothes. What is their relationship?\nOption:\nA. Diamond salesperson and customer.\nB. Sisters.\nC. Reporter and interviewee.\nD. There are no two women wearing blue clothes in the video.\nAnswer with the option's letter from the given choices directly.",
1280,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "427-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1281,
"target": "A",
"doc": {
"video_id": "428",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=20gUIdXGuAs",
"videoID": "20gUIdXGuAs",
"question_id": "428-1",
"task_type": "Information Synopsis",
"question": "Based on the video, why was Google sued?",
"options": [
"A. Google was sued for unlawfully using copyrighted content to train AI models.",
"B. Google was not the subject of a lawsuit.",
"C. Google was sued for violating antitrust laws.",
"D. Google was sued due to the negative societal effects caused by its AI robots."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, why was Google sued?\nOption:\nA. Google was sued for unlawfully using copyrighted content to train AI models.\nB. Google was not the subject of a lawsuit.\nC. Google was sued for violating antitrust laws.\nD. Google was sued due to the negative societal effects caused by its AI robots.\nAnswer with the option's letter from the given choices directly.",
1281,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "428-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1282,
"target": "B",
"doc": {
"video_id": "428",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=20gUIdXGuAs",
"videoID": "20gUIdXGuAs",
"question_id": "428-2",
"task_type": "Attribute Perception",
"question": "What is the likely liquid contained in the cup on the desk of the news studio?",
"options": [
"A. Juice.",
"B. Water.",
"C. Red wine.",
"D. Cola."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the likely liquid contained in the cup on the desk of the news studio?\nOption:\nA. Juice.\nB. Water.\nC. Red wine.\nD. Cola.\nAnswer with the option's letter from the given choices directly.",
1282,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "428-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1283,
"target": "C",
"doc": {
"video_id": "428",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=20gUIdXGuAs",
"videoID": "20gUIdXGuAs",
"question_id": "428-3",
"task_type": "Object Reasoning",
"question": "How is the game content that appears in the video produced?",
"options": [
"A. Pre-set by the game company.",
"B. Performed by real people.",
"C. Generated by AI.",
"D. Undeterminable."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How is the game content that appears in the video produced?\nOption:\nA. Pre-set by the game company.\nB. Performed by real people.\nC. Generated by AI.\nD. Undeterminable.\nAnswer with the option's letter from the given choices directly.",
1283,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "428-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1284,
"target": "A",
"doc": {
"video_id": "429",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=uw30kZ_z0Q8",
"videoID": "uw30kZ_z0Q8",
"question_id": "429-1",
"task_type": "Information Synopsis",
"question": "What is the genre or category of this video?",
"options": [
"A. This is a piece of news.",
"B. This is a live game commentary record.",
"C. This is an advertisement for the launch of a game.",
"D. This is a video introducing a game manufacturer."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the genre or category of this video?\nOption:\nA. This is a piece of news.\nB. This is a live game commentary record.\nC. This is an advertisement for the launch of a game.\nD. This is a video introducing a game manufacturer.\nAnswer with the option's letter from the given choices directly.",
1284,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "429-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1285,
"target": "D",
"doc": {
"video_id": "429",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=uw30kZ_z0Q8",
"videoID": "uw30kZ_z0Q8",
"question_id": "429-2",
"task_type": "Object Recognition",
"question": "What was the host talking about when the screenshots of the webpage and comments were shown at the beginning of the video?",
"options": [
"A. Pokémon Go recruits character modelers.",
"B. Pokémon is coming to Switch.",
"C. A new Pokémon game is coming soon.",
"D. Pokémon is recruiting 3DCG designers."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the host talking about when the screenshots of the webpage and comments were shown at the beginning of the video?\nOption:\nA. Pokémon Go recruits character modelers.\nB. Pokémon is coming to Switch.\nC. A new Pokémon game is coming soon.\nD. Pokémon is recruiting 3DCG designers.\nAnswer with the option's letter from the given choices directly.",
1285,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "429-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1286,
"target": "B",
"doc": {
"video_id": "429",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=uw30kZ_z0Q8",
"videoID": "uw30kZ_z0Q8",
"question_id": "429-3",
"task_type": "Counting Problem",
"question": "How many different humanoid characters appear in the game shown in this video?",
"options": [
"A. 7.",
"B. 6.",
"C. 5.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many different humanoid characters appear in the game shown in this video?\nOption:\nA. 7.\nB. 6.\nC. 5.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1286,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "429-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1287,
"target": "B",
"doc": {
"video_id": "430",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=p86ZOmxiTx4",
"videoID": "p86ZOmxiTx4",
"question_id": "430-1",
"task_type": "Object Recognition",
"question": "What caused the death of the people reported in the video?",
"options": [
"A. A car accident.",
"B. A cancer.",
"C. A cerebral hemorrhage.",
"D. A medical malpractice."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What caused the death of the people reported in the video?\nOption:\nA. A car accident.\nB. A cancer.\nC. A cerebral hemorrhage.\nD. A medical malpractice.\nAnswer with the option's letter from the given choices directly.",
1287,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "430-1",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1288,
"target": "D",
"doc": {
"video_id": "430",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=p86ZOmxiTx4",
"videoID": "p86ZOmxiTx4",
"question_id": "430-2",
"task_type": "Information Synopsis",
"question": "What is the main content of this news?",
"options": [
"A. Alex Trebek's life experiences and major events.",
"B. Alex Trebek is interviewed by the guest.",
"C. Alex Trebek's programs throughout his life.",
"D. Alex Trebek's experience of life after cancer."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main content of this news?\nOption:\nA. Alex Trebek's life experiences and major events.\nB. Alex Trebek is interviewed by the guest.\nC. Alex Trebek's programs throughout his life.\nD. Alex Trebek's experience of life after cancer.\nAnswer with the option's letter from the given choices directly.",
1288,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "430-2",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1289,
"target": "A",
"doc": {
"video_id": "430",
"duration": "medium",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=p86ZOmxiTx4",
"videoID": "p86ZOmxiTx4",
"question_id": "430-3",
"task_type": "OCR Problems",
"question": "Which news media outlet reported on this video?",
"options": [
"A. ABC News.",
"B. BBC News.",
"C. CNBC.",
"D. NBC News."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which news media outlet reported on this video?\nOption:\nA. ABC News.\nB. BBC News.\nC. CNBC.\nD. NBC News.\nAnswer with the option's letter from the given choices directly.",
1289,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "430-3",
"duration": "medium",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1290,
"target": "D",
"doc": {
"video_id": "431",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=RN1ZC7Tb2Pg",
"videoID": "RN1ZC7Tb2Pg",
"question_id": "431-1",
"task_type": "Action Recognition",
"question": "What happens when Gragas and Akali fight versus Sylas?",
"options": [
"A. Sylas is dead but slays Gragas only.",
"B. They slay Sylas without cost.",
"C. Sylas is dead but slays Akali only.",
"D. Sylas is dead and slays them both."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens when Gragas and Akali fight versus Sylas?\nOption:\nA. Sylas is dead but slays Gragas only.\nB. They slay Sylas without cost.\nC. Sylas is dead but slays Akali only.\nD. Sylas is dead and slays them both.\nAnswer with the option's letter from the given choices directly.",
1290,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "431-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1291,
"target": "A",
"doc": {
"video_id": "431",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=RN1ZC7Tb2Pg",
"videoID": "RN1ZC7Tb2Pg",
"question_id": "431-2",
"task_type": "Counting Problem",
"question": "How many legends does Rengar slay?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many legends does Rengar slay?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1291,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "431-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1292,
"target": "C",
"doc": {
"video_id": "431",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=RN1ZC7Tb2Pg",
"videoID": "RN1ZC7Tb2Pg",
"question_id": "431-3",
"task_type": "Object Recognition",
"question": "Who slays Azir as shown in the video?",
"options": [
"A. Jinx.",
"B. Nami.",
"C. Ahri.",
"D. Ashe."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who slays Azir as shown in the video?\nOption:\nA. Jinx.\nB. Nami.\nC. Ahri.\nD. Ashe.\nAnswer with the option's letter from the given choices directly.",
1292,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "431-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1293,
"target": "B",
"doc": {
"video_id": "432",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=OApAF--FqLA",
"videoID": "OApAF--FqLA",
"question_id": "432-1",
"task_type": "Object Recognition",
"question": "Who was the first to be slain by Vi?",
"options": [
"A. Jayce.",
"B. Lucian.",
"C. Milio.",
"D. Twist Fate."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who was the first to be slain by Vi?\nOption:\nA. Jayce.\nB. Lucian.\nC. Milio.\nD. Twist Fate.\nAnswer with the option's letter from the given choices directly.",
1293,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "432-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1294,
"target": "A",
"doc": {
"video_id": "432",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=OApAF--FqLA",
"videoID": "OApAF--FqLA",
"question_id": "432-2",
"task_type": "Counting Problem",
"question": "How many times is Vi slain in this video?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times is Vi slain in this video?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1294,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "432-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1295,
"target": "C",
"doc": {
"video_id": "432",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=OApAF--FqLA",
"videoID": "OApAF--FqLA",
"question_id": "432-3",
"task_type": "Counting Problem",
"question": "How many skins does Vi have in this video?",
"options": [
"A. 7.",
"B. 5.",
"C. 6.",
"D. 8."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many skins does Vi have in this video?\nOption:\nA. 7.\nB. 5.\nC. 6.\nD. 8.\nAnswer with the option's letter from the given choices directly.",
1295,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "432-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1296,
"target": "D",
"doc": {
"video_id": "433",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=Gokw6n0qf-w",
"videoID": "Gokw6n0qf-w",
"question_id": "433-1",
"task_type": "Temporal Perception",
"question": "When does Sion start proxying?",
"options": [
"A. Teammates arrive.",
"B. Opponent reaches tower.",
"C. Slaining enemy.",
"D. A big wave of minions gets under the tower."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When does Sion start proxying?\nOption:\nA. Teammates arrive.\nB. Opponent reaches tower.\nC. Slaining enemy.\nD. A big wave of minions gets under the tower.\nAnswer with the option's letter from the given choices directly.",
1296,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "433-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Temporal Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1297,
"target": "B",
"doc": {
"video_id": "433",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=Gokw6n0qf-w",
"videoID": "Gokw6n0qf-w",
"question_id": "433-2",
"task_type": "Action Recognition",
"question": "What is Sion's passive skill?",
"options": [
"A. Cleaning the minions.",
"B. Keeping alive for a while.",
"C. Preventing Sion from death.",
"D. Saving teammates."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is Sion's passive skill?\nOption:\nA. Cleaning the minions.\nB. Keeping alive for a while.\nC. Preventing Sion from death.\nD. Saving teammates.\nAnswer with the option's letter from the given choices directly.",
1297,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "433-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1298,
"target": "C",
"doc": {
"video_id": "433",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=Gokw6n0qf-w",
"videoID": "Gokw6n0qf-w",
"question_id": "433-3",
"task_type": "Action Recognition",
"question": "Which summoner skill does Sion use in this video?",
"options": [
"A. Heal.",
"B. Exhaust.",
"C. TP.",
"D. Smite."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which summoner skill does Sion use in this video?\nOption:\nA. Heal.\nB. Exhaust.\nC. TP.\nD. Smite.\nAnswer with the option's letter from the given choices directly.",
1298,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "433-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1299,
"target": "C",
"doc": {
"video_id": "434",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=0FM64MrRuZE",
"videoID": "0FM64MrRuZE",
"question_id": "434-1",
"task_type": "Action Recognition",
"question": "What does Vayne do at the beginning of the game?",
"options": [
"A. Helping teammate jungle.",
"B. Attacking the top of the enemy.",
"C. Attacking the jungle of the enemy.",
"D. Staying at the fountain."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does Vayne do at the beginning of the game?\nOption:\nA. Helping teammate jungle.\nB. Attacking the top of the enemy.\nC. Attacking the jungle of the enemy.\nD. Staying at the fountain.\nAnswer with the option's letter from the given choices directly.",
1299,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "434-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1300,
"target": "A",
"doc": {
"video_id": "434",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=0FM64MrRuZE",
"videoID": "0FM64MrRuZE",
"question_id": "434-2",
"task_type": "Action Recognition",
"question": "Which skill is used first by Vayne?",
"options": [
"A. Haste.",
"B. Flash.",
"C. Exhaust.",
"D. Teleport."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which skill is used first by Vayne?\nOption:\nA. Haste.\nB. Flash.\nC. Exhaust.\nD. Teleport.\nAnswer with the option's letter from the given choices directly.",
1300,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "434-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1301,
"target": "D",
"doc": {
"video_id": "434",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=0FM64MrRuZE",
"videoID": "0FM64MrRuZE",
"question_id": "434-3",
"task_type": "Object Recognition",
"question": "Which position is Vayne in this video?",
"options": [
"A. Jungle.",
"B. Support.",
"C. Mid.",
"D. Top."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which position is Vayne in this video?\nOption:\nA. Jungle.\nB. Support.\nC. Mid.\nD. Top.\nAnswer with the option's letter from the given choices directly.",
1301,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "434-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1302,
"target": "B",
"doc": {
"video_id": "435",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=Rh8sz6ZnXM4",
"videoID": "Rh8sz6ZnXM4",
"question_id": "435-1",
"task_type": "Spatial Perception",
"question": "Where is the location of the race as shown in the video?",
"options": [
"A. Desert.",
"B. Mountain.",
"C. Sea side.",
"D. City."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where is the location of the race as shown in the video?\nOption:\nA. Desert.\nB. Mountain.\nC. Sea side.\nD. City.\nAnswer with the option's letter from the given choices directly.",
1302,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "435-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1303,
"target": "D",
"doc": {
"video_id": "435",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=Rh8sz6ZnXM4",
"videoID": "Rh8sz6ZnXM4",
"question_id": "435-2",
"task_type": "Object Recognition",
"question": "At the beginning, what is the player's rank?",
"options": [
"A. Third.",
"B. First.",
"C. Second.",
"D. Last."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At the beginning, what is the player's rank?\nOption:\nA. Third.\nB. First.\nC. Second.\nD. Last.\nAnswer with the option's letter from the given choices directly.",
1303,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "435-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1304,
"target": "B",
"doc": {
"video_id": "435",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=Rh8sz6ZnXM4",
"videoID": "Rh8sz6ZnXM4",
"question_id": "435-3",
"task_type": "Object Recognition",
"question": "What rank is the player at the end?",
"options": [
"A. Last.",
"B. First.",
"C. Second.",
"D. Third."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What rank is the player at the end?\nOption:\nA. Last.\nB. First.\nC. Second.\nD. Third.\nAnswer with the option's letter from the given choices directly.",
1304,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "435-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1305,
"target": "C",
"doc": {
"video_id": "436",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=erjwCQ-UZyw",
"videoID": "erjwCQ-UZyw",
"question_id": "436-1",
"task_type": "Object Recognition",
"question": "What can be seen in the sky during the race?",
"options": [
"A. Stars.",
"B. Birds.",
"C. Planes.",
"D. Rockets."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be seen in the sky during the race?\nOption:\nA. Stars.\nB. Birds.\nC. Planes.\nD. Rockets.\nAnswer with the option's letter from the given choices directly.",
1305,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "436-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1306,
"target": "D",
"doc": {
"video_id": "436",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=erjwCQ-UZyw",
"videoID": "erjwCQ-UZyw",
"question_id": "436-2",
"task_type": "Counting Problem",
"question": "How many cars does player use?",
"options": [
"A. 1.",
"B. 3.",
"C. 4.",
"D. 2."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many cars does player use?\nOption:\nA. 1.\nB. 3.\nC. 4.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
1306,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "436-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1307,
"target": "B",
"doc": {
"video_id": "436",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=erjwCQ-UZyw",
"videoID": "erjwCQ-UZyw",
"question_id": "436-3",
"task_type": "Spatial Perception",
"question": "What is the weather like?",
"options": [
"A. Rainy.",
"B. Sunny.",
"C. Snow.",
"D. Gloomy."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the weather like?\nOption:\nA. Rainy.\nB. Sunny.\nC. Snow.\nD. Gloomy.\nAnswer with the option's letter from the given choices directly.",
1307,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "436-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Spatial Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1308,
"target": "C",
"doc": {
"video_id": "437",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=H54zMD-9Q-8",
"videoID": "H54zMD-9Q-8",
"question_id": "437-1",
"task_type": "Object Recognition",
"question": "What is in the sky?",
"options": [
"A. Stars.",
"B. Birds.",
"C. Balloons.",
"D. Rockets."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is in the sky?\nOption:\nA. Stars.\nB. Birds.\nC. Balloons.\nD. Rockets.\nAnswer with the option's letter from the given choices directly.",
1308,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "437-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1309,
"target": "C",
"doc": {
"video_id": "437",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=H54zMD-9Q-8",
"videoID": "H54zMD-9Q-8",
"question_id": "437-2",
"task_type": "Object Recognition",
"question": "Which brand of car does the player use?",
"options": [
"A. Aston Martin.",
"B. Toyota.",
"C. Nissan.",
"D. Mercedes Benz."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which brand of car does the player use?\nOption:\nA. Aston Martin.\nB. Toyota.\nC. Nissan.\nD. Mercedes Benz.\nAnswer with the option's letter from the given choices directly.",
1309,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "437-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1310,
"target": "A",
"doc": {
"video_id": "437",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=H54zMD-9Q-8",
"videoID": "H54zMD-9Q-8",
"question_id": "437-3",
"task_type": "Counting Problem",
"question": "How many taillights does the player's car have?",
"options": [
"A. 4.",
"B. 2.",
"C. 6.",
"D. 5."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many taillights does the player's car have?\nOption:\nA. 4.\nB. 2.\nC. 6.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
1310,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "437-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1311,
"target": "D",
"doc": {
"video_id": "438",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=UIvP4xyw9H8",
"videoID": "UIvP4xyw9H8",
"question_id": "438-1",
"task_type": "Counting Problem",
"question": "How many grenades does the player throw?",
"options": [
"A. 8.",
"B. 5.",
"C. 6.",
"D. 7."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many grenades does the player throw?\nOption:\nA. 8.\nB. 5.\nC. 6.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
1311,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "438-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1312,
"target": "B",
"doc": {
"video_id": "438",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=UIvP4xyw9H8",
"videoID": "UIvP4xyw9H8",
"question_id": "438-2",
"task_type": "Counting Problem",
"question": "How many times does the play die?",
"options": [
"A. 5.",
"B. 6.",
"C. 7.",
"D. 8."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times does the play die?\nOption:\nA. 5.\nB. 6.\nC. 7.\nD. 8.\nAnswer with the option's letter from the given choices directly.",
1312,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "438-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1313,
"target": "A",
"doc": {
"video_id": "438",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=UIvP4xyw9H8",
"videoID": "UIvP4xyw9H8",
"question_id": "438-3",
"task_type": "Attribute Perception",
"question": "What color is the player's sniper rifle?",
"options": [
"A. Green.",
"B. Brown.",
"C. Black.",
"D. Yellow."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the player's sniper rifle?\nOption:\nA. Green.\nB. Brown.\nC. Black.\nD. Yellow.\nAnswer with the option's letter from the given choices directly.",
1313,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "438-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1314,
"target": "B",
"doc": {
"video_id": "439",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=kGR4r4Av1ig",
"videoID": "kGR4r4Av1ig",
"question_id": "439-1",
"task_type": "Counting Problem",
"question": "How many sports does the player play?",
"options": [
"A. 2.",
"B. 3.",
"C. 4.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many sports does the player play?\nOption:\nA. 2.\nB. 3.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
1314,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "439-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1315,
"target": "C",
"doc": {
"video_id": "439",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=kGR4r4Av1ig",
"videoID": "kGR4r4Av1ig",
"question_id": "439-2",
"task_type": "Counting Problem",
"question": "How many times does the player crash?",
"options": [
"A. 4.",
"B. 3.",
"C. 2.",
"D. 1."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times does the player crash?\nOption:\nA. 4.\nB. 3.\nC. 2.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
1315,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "439-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1316,
"target": "D",
"doc": {
"video_id": "439",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=kGR4r4Av1ig",
"videoID": "kGR4r4Av1ig",
"question_id": "439-3",
"task_type": "Attribute Perception",
"question": "What color is the hat of player's figure?",
"options": [
"A. Yellow.",
"B. Brown.",
"C. Green.",
"D. Blue."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the hat of player's figure?\nOption:\nA. Yellow.\nB. Brown.\nC. Green.\nD. Blue.\nAnswer with the option's letter from the given choices directly.",
1316,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "439-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1317,
"target": "A",
"doc": {
"video_id": "440",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=IN8t1d3pRMY",
"videoID": "IN8t1d3pRMY",
"question_id": "440-1",
"task_type": "Counting Problem",
"question": "How many players hold a weapon in their left hand?",
"options": [
"A. 6.",
"B. 5.",
"C. 7.",
"D. 8."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many players hold a weapon in their left hand?\nOption:\nA. 6.\nB. 5.\nC. 7.\nD. 8.\nAnswer with the option's letter from the given choices directly.",
1317,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "440-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1318,
"target": "A",
"doc": {
"video_id": "440",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=IN8t1d3pRMY",
"videoID": "IN8t1d3pRMY",
"question_id": "440-2",
"task_type": "Counting Problem",
"question": "How many players jump from a building and die?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many players jump from a building and die?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1318,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "440-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1319,
"target": "A",
"doc": {
"video_id": "440",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=IN8t1d3pRMY",
"videoID": "IN8t1d3pRMY",
"question_id": "440-3",
"task_type": "Counting Problem",
"question": "What style is the video?",
"options": [
"A. Funny clip.",
"B. Tutorial.",
"C. Vlog.",
"D. Match video."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What style is the video?\nOption:\nA. Funny clip.\nB. Tutorial.\nC. Vlog.\nD. Match video.\nAnswer with the option's letter from the given choices directly.",
1319,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "440-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1320,
"target": "D",
"doc": {
"video_id": "441",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=QnGwAUTM1fY",
"videoID": "QnGwAUTM1fY",
"question_id": "441-1",
"task_type": "Object Recognition",
"question": "Who hit the third ball in the video?",
"options": [
"A. Player in red shirts and pants.",
"B. Player wearing white shirts and pants.",
"C. Player wearing white shirts and black pants.",
"D. Player in black shirts and pants."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who hit the third ball in the video?\nOption:\nA. Player in red shirts and pants.\nB. Player wearing white shirts and pants.\nC. Player wearing white shirts and black pants.\nD. Player in black shirts and pants.\nAnswer with the option's letter from the given choices directly.",
1320,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "441-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1321,
"target": "B",
"doc": {
"video_id": "441",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=QnGwAUTM1fY",
"videoID": "QnGwAUTM1fY",
"question_id": "441-2",
"task_type": "Action Reasoning",
"question": "What does the narrator's \"1, 2, 3\" stand for when the player in the white undershirt is attacking in the video?",
"options": [
"A. Time limit.",
"B. Limit of dribbles.",
"C. Number of players on the field.",
"D. Cannot be inferred from the video."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the narrator's \"1, 2, 3\" stand for when the player in the white undershirt is attacking in the video?\nOption:\nA. Time limit.\nB. Limit of dribbles.\nC. Number of players on the field.\nD. Cannot be inferred from the video.\nAnswer with the option's letter from the given choices directly.",
1321,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "441-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1322,
"target": "C",
"doc": {
"video_id": "441",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=QnGwAUTM1fY",
"videoID": "QnGwAUTM1fY",
"question_id": "441-3",
"task_type": "Counting Problem",
"question": "How many players completed dunks in the video?",
"options": [
"A. 0.",
"B. 1.",
"C. 2.",
"D. 3."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many players completed dunks in the video?\nOption:\nA. 0.\nB. 1.\nC. 2.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
1322,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "441-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1323,
"target": "A",
"doc": {
"video_id": "442",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=kjwjDjQupLc",
"videoID": "kjwjDjQupLc",
"question_id": "442-1",
"task_type": "Action Reasoning",
"question": "In what way is the penultimate goal in the video accomplished?",
"options": [
"A. Layup.",
"B. Dunk.",
"C. Three-point shot.",
"D. Free shot."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what way is the penultimate goal in the video accomplished?\nOption:\nA. Layup.\nB. Dunk.\nC. Three-point shot.\nD. Free shot.\nAnswer with the option's letter from the given choices directly.",
1323,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "442-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1324,
"target": "C",
"doc": {
"video_id": "442",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=kjwjDjQupLc",
"videoID": "kjwjDjQupLc",
"question_id": "442-2",
"task_type": "Counting Problem",
"question": "How many times has LOUISVILLE been ahead in the video?",
"options": [
"A. 2.",
"B. 0.",
"C. 1.",
"D. 3."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times has LOUISVILLE been ahead in the video?\nOption:\nA. 2.\nB. 0.\nC. 1.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
1324,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "442-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1325,
"target": "B",
"doc": {
"video_id": "442",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=kjwjDjQupLc",
"videoID": "kjwjDjQupLc",
"question_id": "442-3",
"task_type": "OCR Problems",
"question": "What is the score at the end of the half?",
"options": [
"A. 38 - 31.",
"B. 38 - 34.",
"C. 67 - 61.",
"D. 67 - 60."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the score at the end of the half?\nOption:\nA. 38 - 31.\nB. 38 - 34.\nC. 67 - 61.\nD. 67 - 60.\nAnswer with the option's letter from the given choices directly.",
1325,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "442-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1326,
"target": "D",
"doc": {
"video_id": "443",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=cPlpZdGLa5A",
"videoID": "cPlpZdGLa5A",
"question_id": "443-1",
"task_type": "Counting Problem",
"question": "How many attempts do it take for Phonzy to hit a 3-pointer in the first game?",
"options": [
"A. 2.",
"B. 4.",
"C. 3.",
"D. 5."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many attempts do it take for Phonzy to hit a 3-pointer in the first game?\nOption:\nA. 2.\nB. 4.\nC. 3.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
1326,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "443-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1327,
"target": "D",
"doc": {
"video_id": "443",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=cPlpZdGLa5A",
"videoID": "cPlpZdGLa5A",
"question_id": "443-2",
"task_type": "Counting Problem",
"question": "How many free throws does Jamal hit in game 2?",
"options": [
"A. 7.",
"B. 8.",
"C. 10.",
"D. 9."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many free throws does Jamal hit in game 2?\nOption:\nA. 7.\nB. 8.\nC. 10.\nD. 9.\nAnswer with the option's letter from the given choices directly.",
1327,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "443-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1328,
"target": "B",
"doc": {
"video_id": "443",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=cPlpZdGLa5A",
"videoID": "cPlpZdGLa5A",
"question_id": "443-3",
"task_type": "Action Reasoning",
"question": "Why is there no winner in Game 3?",
"options": [
"A. Neither of them completed their assigned basketball moves.",
"B. They all complete their assigned basketball moves.",
"C. Game 3 isn't decisive.",
"D. Game 3 is poorly designed."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why is there no winner in Game 3?\nOption:\nA. Neither of them completed their assigned basketball moves.\nB. They all complete their assigned basketball moves.\nC. Game 3 isn't decisive.\nD. Game 3 is poorly designed.\nAnswer with the option's letter from the given choices directly.",
1328,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "443-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1329,
"target": "C",
"doc": {
"video_id": "444",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=EavQ0LtEICY",
"videoID": "EavQ0LtEICY",
"question_id": "444-1",
"task_type": "Action Reasoning",
"question": "Why do they replace the other side of the layup in the second game?",
"options": [
"A. Because of the long layups, their right hands are too tired.",
"B. They are left-handed will.",
"C. Layups from the right side are too easy.",
"D. This is the rule of the game, which requires a right side layup before finishing with a left side layup."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why do they replace the other side of the layup in the second game?\nOption:\nA. Because of the long layups, their right hands are too tired.\nB. They are left-handed will.\nC. Layups from the right side are too easy.\nD. This is the rule of the game, which requires a right side layup before finishing with a left side layup.\nAnswer with the option's letter from the given choices directly.",
1329,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "444-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1330,
"target": "A",
"doc": {
"video_id": "444",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=EavQ0LtEICY",
"videoID": "EavQ0LtEICY",
"question_id": "444-2",
"task_type": "Action Recognition",
"question": "What are the conditions under which an offensive player can continue to attack when he scores a goal against someone else's defense in the fourth game?",
"options": [
"A. Hit three free throws in a row.",
"B. The shot is a three-pointer.",
"C. Breaking through two defenders in a row.",
"D. Getting the referee's approval."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the conditions under which an offensive player can continue to attack when he scores a goal against someone else's defense in the fourth game?\nOption:\nA. Hit three free throws in a row.\nB. The shot is a three-pointer.\nC. Breaking through two defenders in a row.\nD. Getting the referee's approval.\nAnswer with the option's letter from the given choices directly.",
1330,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "444-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1331,
"target": "C",
"doc": {
"video_id": "444",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=EavQ0LtEICY",
"videoID": "EavQ0LtEICY",
"question_id": "444-3",
"task_type": "Attribute Perception",
"question": "Who wins the game in the last match?",
"options": [
"A. Players in white shirts and black shorts.",
"B. Player #2.",
"C. Players in black shirts and white shorts.",
"D. No winner."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who wins the game in the last match?\nOption:\nA. Players in white shirts and black shorts.\nB. Player #2.\nC. Players in black shirts and white shorts.\nD. No winner.\nAnswer with the option's letter from the given choices directly.",
1331,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "444-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1332,
"target": "B",
"doc": {
"video_id": "445",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=0Y9-MQ44MdU",
"videoID": "0Y9-MQ44MdU",
"question_id": "445-1",
"task_type": "Object Reasoning",
"question": "Extrapolating from the video, what team is the commentator on?",
"options": [
"A. Team 1.",
"B. Team 2.",
"C. He's not on either team.",
"D. Cannot be inferred from the video."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Extrapolating from the video, what team is the commentator on?\nOption:\nA. Team 1.\nB. Team 2.\nC. He's not on either team.\nD. Cannot be inferred from the video.\nAnswer with the option's letter from the given choices directly.",
1332,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "445-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1333,
"target": "A",
"doc": {
"video_id": "445",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=0Y9-MQ44MdU",
"videoID": "0Y9-MQ44MdU",
"question_id": "445-2",
"task_type": "Attribute Perception",
"question": "What is the largest run differential in the first inning of the game in the video?",
"options": [
"A. 4.",
"B. 3.",
"C. 2.",
"D. 1."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the largest run differential in the first inning of the game in the video?\nOption:\nA. 4.\nB. 3.\nC. 2.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
1333,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "445-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1334,
"target": "D",
"doc": {
"video_id": "445",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=0Y9-MQ44MdU",
"videoID": "0Y9-MQ44MdU",
"question_id": "445-3",
"task_type": "Counting Problem",
"question": "How many points does a three-pointer count in the video?",
"options": [
"A. 4.",
"B. 3.",
"C. 1.",
"D. 2."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many points does a three-pointer count in the video?\nOption:\nA. 4.\nB. 3.\nC. 1.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
1334,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "445-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1335,
"target": "C",
"doc": {
"video_id": "446",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=ACvLYz9nHvg",
"videoID": "ACvLYz9nHvg",
"question_id": "446-1",
"task_type": "Counting Problem",
"question": "How many consecutive hits did player #39 in the video have in his first start?",
"options": [
"A. 2.",
"B. 3.",
"C. 4.",
"D. 5."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many consecutive hits did player #39 in the video have in his first start?\nOption:\nA. 2.\nB. 3.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
1335,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "446-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1336,
"target": "B",
"doc": {
"video_id": "446",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=ACvLYz9nHvg",
"videoID": "ACvLYz9nHvg",
"question_id": "446-2",
"task_type": "Object Recognition",
"question": "What country's practice game is this?",
"options": [
"A. UK.",
"B. USA.",
"C. Canada.",
"D. Australia."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What country's practice game is this?\nOption:\nA. UK.\nB. USA.\nC. Canada.\nD. Australia.\nAnswer with the option's letter from the given choices directly.",
1336,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "446-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1337,
"target": "A",
"doc": {
"video_id": "446",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=ACvLYz9nHvg",
"videoID": "ACvLYz9nHvg",
"question_id": "446-3",
"task_type": "Action Recognition",
"question": "What are most of the technical moves completed by the players in the video?",
"options": [
"A. Mid-range shot.",
"B. Three-point shot.",
"C. Dunk.",
"D. Layup."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are most of the technical moves completed by the players in the video?\nOption:\nA. Mid-range shot.\nB. Three-point shot.\nC. Dunk.\nD. Layup.\nAnswer with the option's letter from the given choices directly.",
1337,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "446-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1338,
"target": "D",
"doc": {
"video_id": "447",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=2NOcM1HoRRg",
"videoID": "2NOcM1HoRRg",
"question_id": "447-1",
"task_type": "Action Recognition",
"question": "What are the two doing in the video?",
"options": [
"A. Commentating on a basketball game.",
"B. Shooting basketball commercials.",
"C. Watch a basketball game.",
"D. Have a basketball video game."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the two doing in the video?\nOption:\nA. Commentating on a basketball game.\nB. Shooting basketball commercials.\nC. Watch a basketball game.\nD. Have a basketball video game.\nAnswer with the option's letter from the given choices directly.",
1338,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "447-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1339,
"target": "C",
"doc": {
"video_id": "447",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=2NOcM1HoRRg",
"videoID": "2NOcM1HoRRg",
"question_id": "447-2",
"task_type": "OCR Problems",
"question": "What is the team's LAC hitting percentage when the score is 38-29?",
"options": [
"A. 48%.",
"B. 41%.",
"C. 56%.",
"D. 54%."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the team's LAC hitting percentage when the score is 38-29?\nOption:\nA. 48%.\nB. 41%.\nC. 56%.\nD. 54%.\nAnswer with the option's letter from the given choices directly.",
1339,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "447-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1340,
"target": "D",
"doc": {
"video_id": "447",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=2NOcM1HoRRg",
"videoID": "2NOcM1HoRRg",
"question_id": "447-3",
"task_type": "OCR Problems",
"question": "LAL completed the comeback with how much time left in the third quarter?",
"options": [
"A. 9.8s.",
"B. 5min32s.",
"C. 4min44s.",
"D. 6.3s."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: LAL completed the comeback with how much time left in the third quarter?\nOption:\nA. 9.8s.\nB. 5min32s.\nC. 4min44s.\nD. 6.3s.\nAnswer with the option's letter from the given choices directly.",
1340,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "447-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1341,
"target": "A",
"doc": {
"video_id": "448",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=YPvxuoNLgPE",
"videoID": "YPvxuoNLgPE",
"question_id": "448-1",
"task_type": "Attribute Perception",
"question": "In the first game, what color does the second losing player represent?",
"options": [
"A. Blue.",
"B. Green.",
"C. Red.",
"D. Yellow."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the first game, what color does the second losing player represent?\nOption:\nA. Blue.\nB. Green.\nC. Red.\nD. Yellow.\nAnswer with the option's letter from the given choices directly.",
1341,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "448-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1342,
"target": "B",
"doc": {
"video_id": "448",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=YPvxuoNLgPE",
"videoID": "YPvxuoNLgPE",
"question_id": "448-2",
"task_type": "Action Reasoning",
"question": "In the second game, why is the yellow player docked a point when he gets 12?",
"options": [
"A. Because he stepped into the three-point line.",
"B. Because he used a green basketball.",
"C. Because the referee made a mistake.",
"D. Because he's way ahead on points."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the second game, why is the yellow player docked a point when he gets 12?\nOption:\nA. Because he stepped into the three-point line.\nB. Because he used a green basketball.\nC. Because the referee made a mistake.\nD. Because he's way ahead on points.\nAnswer with the option's letter from the given choices directly.",
1342,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "448-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1343,
"target": "C",
"doc": {
"video_id": "448",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=YPvxuoNLgPE",
"videoID": "YPvxuoNLgPE",
"question_id": "448-3",
"task_type": "Temporal Reasoning",
"question": "In the fourth game, what is the order of appearance for the colors represented by the players?",
"options": [
"A. Yellow, blue, red, green.",
"B. Red, green, yellow, blue.",
"C. Green, yellow, blue, red.",
"D. Blue, red, green, yellow."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the fourth game, what is the order of appearance for the colors represented by the players?\nOption:\nA. Yellow, blue, red, green.\nB. Red, green, yellow, blue.\nC. Green, yellow, blue, red.\nD. Blue, red, green, yellow.\nAnswer with the option's letter from the given choices directly.",
1343,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "448-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1344,
"target": "C",
"doc": {
"video_id": "449",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=uKIk-QbbcO8",
"videoID": "uKIk-QbbcO8",
"question_id": "449-1",
"task_type": "Attribute Perception",
"question": "In the first game, when the challenger scores for the first time, how many points has Bugs & Daffy scored?",
"options": [
"A. 4.",
"B. 2.",
"C. 7.",
"D. 0."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the first game, when the challenger scores for the first time, how many points has Bugs & Daffy scored?\nOption:\nA. 4.\nB. 2.\nC. 7.\nD. 0.\nAnswer with the option's letter from the given choices directly.",
1344,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "449-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1345,
"target": "B",
"doc": {
"video_id": "449",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=uKIk-QbbcO8",
"videoID": "uKIk-QbbcO8",
"question_id": "449-2",
"task_type": "Counting Problem",
"question": "How many games are played in the video?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many games are played in the video?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1345,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "449-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1346,
"target": "D",
"doc": {
"video_id": "449",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=uKIk-QbbcO8",
"videoID": "uKIk-QbbcO8",
"question_id": "449-3",
"task_type": "Counting Problem",
"question": "Who gets the most points in the video?",
"options": [
"A. Challenger in Gray.",
"B. Daffy.",
"C. Challenger in Black.",
"D. Bugs."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who gets the most points in the video?\nOption:\nA. Challenger in Gray.\nB. Daffy.\nC. Challenger in Black.\nD. Bugs.\nAnswer with the option's letter from the given choices directly.",
1346,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "449-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1347,
"target": "A",
"doc": {
"video_id": "450",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=7E6i3E-fsj4",
"videoID": "7E6i3E-fsj4",
"question_id": "450-1",
"task_type": "Counting Problem",
"question": "How many fouls are blown on players in black jerseys in the first game?",
"options": [
"A. 3.",
"B. 2.",
"C. 4.",
"D. 1."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many fouls are blown on players in black jerseys in the first game?\nOption:\nA. 3.\nB. 2.\nC. 4.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
1347,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "450-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1348,
"target": "C",
"doc": {
"video_id": "450",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=7E6i3E-fsj4",
"videoID": "7E6i3E-fsj4",
"question_id": "450-2",
"task_type": "Counting Problem",
"question": "Who hit more threes in Game 1?",
"options": [
"A. Players in white.",
"B. Players in black.",
"C. Same.",
"D. Cannot be inferred from the video."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who hit more threes in Game 1?\nOption:\nA. Players in white.\nB. Players in black.\nC. Same.\nD. Cannot be inferred from the video.\nAnswer with the option's letter from the given choices directly.",
1348,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "450-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1349,
"target": "B",
"doc": {
"video_id": "450",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=7E6i3E-fsj4",
"videoID": "7E6i3E-fsj4",
"question_id": "450-3",
"task_type": "Object Reasoning",
"question": "Who gained the ultimate victory?",
"options": [
"A. Players in white.",
"B. The game is still open and there's no way to know.",
"C. Players in black.",
"D. Can't be deduced."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who gained the ultimate victory?\nOption:\nA. Players in white.\nB. The game is still open and there's no way to know.\nC. Players in black.\nD. Can't be deduced.\nAnswer with the option's letter from the given choices directly.",
1349,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "450-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1350,
"target": "A",
"doc": {
"video_id": "451",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=Mxkg3qLIPC8",
"videoID": "Mxkg3qLIPC8",
"question_id": "451-1",
"task_type": "Object Recognition",
"question": "Which significant football event is depicted in the video?",
"options": [
"A. FIFA World Cup 2022 Final.",
"B. 2023 UEFA Champions League final.",
"C. FIFA World Cup 2018 Final.",
"D. 2018 UEFA European Championship Final."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which significant football event is depicted in the video?\nOption:\nA. FIFA World Cup 2022 Final.\nB. 2023 UEFA Champions League final.\nC. FIFA World Cup 2018 Final.\nD. 2018 UEFA European Championship Final.\nAnswer with the option's letter from the given choices directly.",
1350,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "451-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1351,
"target": "B",
"doc": {
"video_id": "451",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=Mxkg3qLIPC8",
"videoID": "Mxkg3qLIPC8",
"question_id": "451-2",
"task_type": "Temporal Perception",
"question": "What happens next when the score is 2-1?",
"options": [
"A. The match will proceed without any significant events.",
"B. Kylian Mbappé will score again shortly after.",
"C. Argentina will increase their lead with another goal.",
"D. The game will end with Argentina winning in regular time."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens next when the score is 2-1?\nOption:\nA. The match will proceed without any significant events.\nB. Kylian Mbappé will score again shortly after.\nC. Argentina will increase their lead with another goal.\nD. The game will end with Argentina winning in regular time.\nAnswer with the option's letter from the given choices directly.",
1351,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "451-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Temporal Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1352,
"target": "C",
"doc": {
"video_id": "451",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=Mxkg3qLIPC8",
"videoID": "Mxkg3qLIPC8",
"question_id": "451-3",
"task_type": "Action Recognition",
"question": "How did the winning team in the video win the FIFA World Cup final?",
"options": [
"A. By outscoring their opponent in regular time.",
"B. Through a golden goal during extra time.",
"C. With a decisive triumph in a penalty shootout.",
"D. By default following a forfeit from France."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How did the winning team in the video win the FIFA World Cup final?\nOption:\nA. By outscoring their opponent in regular time.\nB. Through a golden goal during extra time.\nC. With a decisive triumph in a penalty shootout.\nD. By default following a forfeit from France.\nAnswer with the option's letter from the given choices directly.",
1352,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "451-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1353,
"target": "C",
"doc": {
"video_id": "452",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=Xim8-lUZnsA",
"videoID": "Xim8-lUZnsA",
"question_id": "452-1",
"task_type": "Information Synopsis",
"question": "What is the topic of this video?",
"options": [
"A. Ranking the top 15 football players for beginners to watch and learn from.",
"B. Discussing the 15 most complex football tactics used in professional matches.",
"C. Demonstrating the 15 easiest and most effective skill moves for beginner football players, ordered by difficulty.",
"D. Analyzing the 15 worst mistakes beginners make in football and how to avoid them."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the topic of this video?\nOption:\nA. Ranking the top 15 football players for beginners to watch and learn from.\nB. Discussing the 15 most complex football tactics used in professional matches.\nC. Demonstrating the 15 easiest and most effective skill moves for beginner football players, ordered by difficulty.\nD. Analyzing the 15 worst mistakes beginners make in football and how to avoid them.\nAnswer with the option's letter from the given choices directly.",
1353,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "452-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1354,
"target": "B",
"doc": {
"video_id": "452",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=Xim8-lUZnsA",
"videoID": "Xim8-lUZnsA",
"question_id": "452-2",
"task_type": "Action Reasoning",
"question": "If you want to confuse your opponent to get rid of him, which move should you use?",
"options": [
"A. No.2 skill.",
"B. No.6 skill.",
"C. No.8 skill.",
"D. No.15 skill."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: If you want to confuse your opponent to get rid of him, which move should you use?\nOption:\nA. No.2 skill.\nB. No.6 skill.\nC. No.8 skill.\nD. No.15 skill.\nAnswer with the option's letter from the given choices directly.",
1354,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "452-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1355,
"target": "D",
"doc": {
"video_id": "452",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=Xim8-lUZnsA",
"videoID": "Xim8-lUZnsA",
"question_id": "452-3",
"task_type": "Counting Problem",
"question": "In which skill are there only three people on the field?",
"options": [
"A. No.10 skill.",
"B. No.7 skill.",
"C. No.13 skill.",
"D. No.3 skill."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which skill are there only three people on the field?\nOption:\nA. No.10 skill.\nB. No.7 skill.\nC. No.13 skill.\nD. No.3 skill.\nAnswer with the option's letter from the given choices directly.",
1355,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "452-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1356,
"target": "A",
"doc": {
"video_id": "453",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=PYOyMK2Jw7g",
"videoID": "PYOyMK2Jw7g",
"question_id": "453-1",
"task_type": "OCR Problems",
"question": "Which competition is being referred to in this video?",
"options": [
"A. Emirates FA Cup.",
"B. Copa del Rey.",
"C. Copa Italia.",
"D. Carabao Cup."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which competition is being referred to in this video?\nOption:\nA. Emirates FA Cup.\nB. Copa del Rey.\nC. Copa Italia.\nD. Carabao Cup.\nAnswer with the option's letter from the given choices directly.",
1356,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "453-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1357,
"target": "B",
"doc": {
"video_id": "453",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=PYOyMK2Jw7g",
"videoID": "PYOyMK2Jw7g",
"question_id": "453-2",
"task_type": "Object Recognition",
"question": "Who scored the third goal in the video?",
"options": [
"A. The number 9 in the white jersey.",
"B. The number 10 in the white jersey.",
"C. The number 10 in the red jersey.",
"D. The number 25 in the white jersey."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who scored the third goal in the video?\nOption:\nA. The number 9 in the white jersey.\nB. The number 10 in the white jersey.\nC. The number 10 in the red jersey.\nD. The number 25 in the white jersey.\nAnswer with the option's letter from the given choices directly.",
1357,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "453-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1358,
"target": "A",
"doc": {
"video_id": "453",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=PYOyMK2Jw7g",
"videoID": "PYOyMK2Jw7g",
"question_id": "453-3",
"task_type": "OCR Problems",
"question": "What is the result according to this video?",
"options": [
"A. Red team United won 4-3 after extra time.",
"B. Red team United won 3-2 in regular time.",
"C. White team won 4-3 after extra time.",
"D. White team won 3-2 after extra time."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the result according to this video?\nOption:\nA. Red team United won 4-3 after extra time.\nB. Red team United won 3-2 in regular time.\nC. White team won 4-3 after extra time.\nD. White team won 3-2 after extra time.\nAnswer with the option's letter from the given choices directly.",
1358,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "453-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1359,
"target": "A",
"doc": {
"video_id": "454",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=_8lBR0E_Tx8",
"videoID": "_8lBR0E_Tx8",
"question_id": "454-1",
"task_type": "Information Synopsis",
"question": "What is the theme of the video?",
"options": [
"A. Best goals of the 2023 FIFA Women's World Cup.",
"B. Best goals of the 2023 FIFA Men's World Cup.",
"C. Men's football team in-play tactical drills and scrimmages.",
"D. Women's international friendly and training matches."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the theme of the video?\nOption:\nA. Best goals of the 2023 FIFA Women's World Cup.\nB. Best goals of the 2023 FIFA Men's World Cup.\nC. Men's football team in-play tactical drills and scrimmages.\nD. Women's international friendly and training matches.\nAnswer with the option's letter from the given choices directly.",
1359,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "454-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1360,
"target": "C",
"doc": {
"video_id": "454",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=_8lBR0E_Tx8",
"videoID": "_8lBR0E_Tx8",
"question_id": "454-2",
"task_type": "Object Recognition",
"question": "Which player scored the 7th goal?",
"options": [
"A. Vietnam player number 7.",
"B. Netherlands player number 1.",
"C. Netherlands player number 22.",
"D. Vietnam player number 3."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player scored the 7th goal?\nOption:\nA. Vietnam player number 7.\nB. Netherlands player number 1.\nC. Netherlands player number 22.\nD. Vietnam player number 3.\nAnswer with the option's letter from the given choices directly.",
1360,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "454-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1361,
"target": "D",
"doc": {
"video_id": "454",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=_8lBR0E_Tx8",
"videoID": "_8lBR0E_Tx8",
"question_id": "454-3",
"task_type": "Action Recognition",
"question": "How does Brazil score their 5th goal?",
"options": [
"A. Number 5 heads the ball into the net from a cross.",
"B. Number 10 dribbles past the goalkeeper for a solo goal.",
"C. Number 9 scores directly from a corner kick.",
"D. Number 17 passes the ball back to number 16 for the goal."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does Brazil score their 5th goal?\nOption:\nA. Number 5 heads the ball into the net from a cross.\nB. Number 10 dribbles past the goalkeeper for a solo goal.\nC. Number 9 scores directly from a corner kick.\nD. Number 17 passes the ball back to number 16 for the goal.\nAnswer with the option's letter from the given choices directly.",
1361,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "454-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1362,
"target": "D",
"doc": {
"video_id": "455",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=KlbDo9o5SC8",
"videoID": "KlbDo9o5SC8",
"question_id": "455-1",
"task_type": "Action Recognition",
"question": "What are the players doing before the match formally begins, according to the video?",
"options": [
"A. Exchanging team flags.",
"B. Participating in the coin toss.",
"C. Warming up on the field.",
"D. Singing their national anthems."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the players doing before the match formally begins, according to the video?\nOption:\nA. Exchanging team flags.\nB. Participating in the coin toss.\nC. Warming up on the field.\nD. Singing their national anthems.\nAnswer with the option's letter from the given choices directly.",
1362,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "455-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1363,
"target": "A",
"doc": {
"video_id": "455",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=KlbDo9o5SC8",
"videoID": "KlbDo9o5SC8",
"question_id": "455-2",
"task_type": "Action Recognition",
"question": "In the match, which player commits the first foul?",
"options": [
"A. Thailand player number 21.",
"B. Thailand player number 20.",
"C. USA player number 5.",
"D. USA player number 11."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the match, which player commits the first foul?\nOption:\nA. Thailand player number 21.\nB. Thailand player number 20.\nC. USA player number 5.\nD. USA player number 11.\nAnswer with the option's letter from the given choices directly.",
1363,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "455-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1364,
"target": "D",
"doc": {
"video_id": "455",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=KlbDo9o5SC8",
"videoID": "KlbDo9o5SC8",
"question_id": "455-3",
"task_type": "Object Reasoning",
"question": "What do the match statistics in the bottom left corner of the video indicate at the match time 89:17?",
"options": [
"A. The game was postponed due to inclement weather, resulting in skewed statistics.",
"B. The game was evenly matched in terms of possession and shots taken.",
"C. Thailand was leading the USA with a higher number of passes and possession.",
"D. The USA dominated the match with a significant lead in goals, possession, passes, and shots."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the match statistics in the bottom left corner of the video indicate at the match time 89:17?\nOption:\nA. The game was postponed due to inclement weather, resulting in skewed statistics.\nB. The game was evenly matched in terms of possession and shots taken.\nC. Thailand was leading the USA with a higher number of passes and possession.\nD. The USA dominated the match with a significant lead in goals, possession, passes, and shots.\nAnswer with the option's letter from the given choices directly.",
1364,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "455-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1365,
"target": "B",
"doc": {
"video_id": "456",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=2Gg4OQo7-zA",
"videoID": "2Gg4OQo7-zA",
"question_id": "456-1",
"task_type": "Information Synopsis",
"question": "What is the video showcasing?",
"options": [
"A. Highlights of a single World Cup final match.",
"B. Highlights of all World Cup finals since 1998.",
"C. A documentary on the history of football.",
"D. The best goals in World Cup history."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video showcasing?\nOption:\nA. Highlights of a single World Cup final match.\nB. Highlights of all World Cup finals since 1998.\nC. A documentary on the history of football.\nD. The best goals in World Cup history.\nAnswer with the option's letter from the given choices directly.",
1365,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "456-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1366,
"target": "A",
"doc": {
"video_id": "456",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=2Gg4OQo7-zA",
"videoID": "2Gg4OQo7-zA",
"question_id": "456-2",
"task_type": "Object Recognition",
"question": "In Germany vs Argentina, Which player scores the first goal?",
"options": [
"A. Germany's player number 19.",
"B. Germany's player number 9.",
"C. Argentina's player number 10.",
"D. Argentina's player number 19."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In Germany vs Argentina, Which player scores the first goal?\nOption:\nA. Germany's player number 19.\nB. Germany's player number 9.\nC. Argentina's player number 10.\nD. Argentina's player number 19.\nAnswer with the option's letter from the given choices directly.",
1366,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "456-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1367,
"target": "D",
"doc": {
"video_id": "456",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=2Gg4OQo7-zA",
"videoID": "2Gg4OQo7-zA",
"question_id": "456-3",
"task_type": "OCR Problems",
"question": "What was the final score set at in the first game?",
"options": [
"A. 3-0.",
"B. 2-1.",
"C. 1-2.",
"D. 0-3."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the final score set at in the first game?\nOption:\nA. 3-0.\nB. 2-1.\nC. 1-2.\nD. 0-3.\nAnswer with the option's letter from the given choices directly.",
1367,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "456-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1368,
"target": "B",
"doc": {
"video_id": "457",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=hwloS0JY6Ms",
"videoID": "hwloS0JY6Ms",
"question_id": "457-1",
"task_type": "OCR Problems",
"question": "What is the name of the coach of Manchester City described in the video?",
"options": [
"A. Mark Warburton.",
"B. Josep Guardiola.",
"C. Roberto Mancini.",
"D. Manuel Pellegrini."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the name of the coach of Manchester City described in the video?\nOption:\nA. Mark Warburton.\nB. Josep Guardiola.\nC. Roberto Mancini.\nD. Manuel Pellegrini.\nAnswer with the option's letter from the given choices directly.",
1368,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "457-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1369,
"target": "C",
"doc": {
"video_id": "457",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=hwloS0JY6Ms",
"videoID": "hwloS0JY6Ms",
"question_id": "457-2",
"task_type": "Counting Problem",
"question": "How many goals do the player Number 7 of Tottenham Hotspur score?",
"options": [
"A. 3.",
"B. 1.",
"C. 2.",
"D. 4."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many goals do the player Number 7 of Tottenham Hotspur score?\nOption:\nA. 3.\nB. 1.\nC. 2.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1369,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "457-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1370,
"target": "A",
"doc": {
"video_id": "457",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=hwloS0JY6Ms",
"videoID": "hwloS0JY6Ms",
"question_id": "457-3",
"task_type": "Object Recognition",
"question": "Which player is replaced in the middle of the game?",
"options": [
"A. Fernando Llorente.",
"B. Moussa Sissoko.",
"C. Fernando Sissoko.",
"D. Moussa Llorente."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player is replaced in the middle of the game?\nOption:\nA. Fernando Llorente.\nB. Moussa Sissoko.\nC. Fernando Sissoko.\nD. Moussa Llorente.\nAnswer with the option's letter from the given choices directly.",
1370,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "457-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1371,
"target": "A",
"doc": {
"video_id": "458",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=6ksaVHtfBG8",
"videoID": "6ksaVHtfBG8",
"question_id": "458-1",
"task_type": "OCR Problems",
"question": "Which competition is the video related to?",
"options": [
"A. The AFC Asian Cup 2023.",
"B. The FIFA World Cup 2022.",
"C. The UEFA European Championship 2024.",
"D. The CONCACAF Gold Cup 2023."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which competition is the video related to?\nOption:\nA. The AFC Asian Cup 2023.\nB. The FIFA World Cup 2022.\nC. The UEFA European Championship 2024.\nD. The CONCACAF Gold Cup 2023.\nAnswer with the option's letter from the given choices directly.",
1371,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "458-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1372,
"target": "C",
"doc": {
"video_id": "458",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=6ksaVHtfBG8",
"videoID": "6ksaVHtfBG8",
"question_id": "458-2",
"task_type": "Object Recognition",
"question": "Which player is injured in the middle of the game?",
"options": [
"A. Korea's player number 12.",
"B. Malaysia's player number 10.",
"C. Malaysia's player number 12.",
"D. Korea's player number 9."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player is injured in the middle of the game?\nOption:\nA. Korea's player number 12.\nB. Malaysia's player number 10.\nC. Malaysia's player number 12.\nD. Korea's player number 9.\nAnswer with the option's letter from the given choices directly.",
1372,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "458-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1373,
"target": "B",
"doc": {
"video_id": "458",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=6ksaVHtfBG8",
"videoID": "6ksaVHtfBG8",
"question_id": "458-3",
"task_type": "OCR Problems",
"question": "What is the result of the match?",
"options": [
"A. Korea Republic won 3-0.",
"B. The match ended in a 3-3 draw.",
"C. The match ended in a 2-2 draw.",
"D. Malaysia won 3-3 with a victory in extra time."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the result of the match?\nOption:\nA. Korea Republic won 3-0.\nB. The match ended in a 3-3 draw.\nC. The match ended in a 2-2 draw.\nD. Malaysia won 3-3 with a victory in extra time.\nAnswer with the option's letter from the given choices directly.",
1373,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "458-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1374,
"target": "B",
"doc": {
"video_id": "459",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=bOUSkh5PInA",
"videoID": "bOUSkh5PInA",
"question_id": "459-1",
"task_type": "OCR Problems",
"question": "When was the first goal scored in the game?",
"options": [
"A. About 42:00 in the game.",
"B. About 68:00 in the game.",
"C. About 57:00 in the game.",
"D. About 32:00 in the game."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When was the first goal scored in the game?\nOption:\nA. About 42:00 in the game.\nB. About 68:00 in the game.\nC. About 57:00 in the game.\nD. About 32:00 in the game.\nAnswer with the option's letter from the given choices directly.",
1374,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "459-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1375,
"target": "C",
"doc": {
"video_id": "459",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=bOUSkh5PInA",
"videoID": "bOUSkh5PInA",
"question_id": "459-2",
"task_type": "Object Reasoning",
"question": "Why were the first two goals in this video invalid?",
"options": [
"A. Because the ball went out of play before the goal was scored.",
"B. Because there was a foul committed by the scoring team.",
"C. Because the player who scored was offside.",
"D. Because the goal was scored with the use of a handball."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why were the first two goals in this video invalid?\nOption:\nA. Because the ball went out of play before the goal was scored.\nB. Because there was a foul committed by the scoring team.\nC. Because the player who scored was offside.\nD. Because the goal was scored with the use of a handball.\nAnswer with the option's letter from the given choices directly.",
1375,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "459-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1376,
"target": "C",
"doc": {
"video_id": "459",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=bOUSkh5PInA",
"videoID": "bOUSkh5PInA",
"question_id": "459-3",
"task_type": "OCR Problems",
"question": "What is the result of the match?",
"options": [
"A. Al Ahli vs Al Nassr 2-1.",
"B. Al Ahli vs Al Nassr 1-1.",
"C. Al Ahli vs Al Nassr 0-1.",
"D. Al Ahli vs Al Nassr 1-0."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the result of the match?\nOption:\nA. Al Ahli vs Al Nassr 2-1.\nB. Al Ahli vs Al Nassr 1-1.\nC. Al Ahli vs Al Nassr 0-1.\nD. Al Ahli vs Al Nassr 1-0.\nAnswer with the option's letter from the given choices directly.",
1376,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "459-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1377,
"target": "A",
"doc": {
"video_id": "460",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=rL9JCz7S8Jg",
"videoID": "rL9JCz7S8Jg",
"question_id": "460-1",
"task_type": "Temporal Perception",
"question": "When is the second goal scored in the game?",
"options": [
"A. At 79 minutes into the match.",
"B. In the first half of the match.",
"C. Right after halftime.",
"D. In the closing minutes of the match."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When is the second goal scored in the game?\nOption:\nA. At 79 minutes into the match.\nB. In the first half of the match.\nC. Right after halftime.\nD. In the closing minutes of the match.\nAnswer with the option's letter from the given choices directly.",
1377,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "460-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1378,
"target": "B",
"doc": {
"video_id": "460",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=rL9JCz7S8Jg",
"videoID": "rL9JCz7S8Jg",
"question_id": "460-2",
"task_type": "OCR Problems",
"question": "What is the result of the match?",
"options": [
"A. Brazil won 3-0 against Germany.",
"B. Brazil won 2-0 against Germany.",
"C. Germany won 2-0 against Brazil.",
"D. Germany won 2-1 against Brazil."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the result of the match?\nOption:\nA. Brazil won 3-0 against Germany.\nB. Brazil won 2-0 against Germany.\nC. Germany won 2-0 against Brazil.\nD. Germany won 2-1 against Brazil.\nAnswer with the option's letter from the given choices directly.",
1378,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "460-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1379,
"target": "D",
"doc": {
"video_id": "460",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=rL9JCz7S8Jg",
"videoID": "rL9JCz7S8Jg",
"question_id": "460-3",
"task_type": "Action Recognition",
"question": "What is the performance of Brazil's player number 9 in the match?",
"options": [
"A. He saved a penalty as the team's goalkeeper.",
"B. He provided an assist for the winning goal.",
"C. He received a red card and was sent off.",
"D. He scored two goals and led the team to victory."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the performance of Brazil's player number 9 in the match?\nOption:\nA. He saved a penalty as the team's goalkeeper.\nB. He provided an assist for the winning goal.\nC. He received a red card and was sent off.\nD. He scored two goals and led the team to victory.\nAnswer with the option's letter from the given choices directly.",
1379,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "460-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1380,
"target": "A",
"doc": {
"video_id": "461",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=UmIYanq5gH8",
"videoID": "UmIYanq5gH8",
"question_id": "461-1",
"task_type": "OCR Problems",
"question": "In which lane are the Australian athletes competing in the race in the video?",
"options": [
"A. Lane 6.",
"B. Lane 4.",
"C. Lane 5.",
"D. Lane 1."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which lane are the Australian athletes competing in the race in the video?\nOption:\nA. Lane 6.\nB. Lane 4.\nC. Lane 5.\nD. Lane 1.\nAnswer with the option's letter from the given choices directly.",
1380,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "461-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1381,
"target": "D",
"doc": {
"video_id": "461",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=UmIYanq5gH8",
"videoID": "UmIYanq5gH8",
"question_id": "461-2",
"task_type": "Action Recognition",
"question": "What swimming stroke was used for the second leg of the relay race in the video?",
"options": [
"A. Freestyle.",
"B. Butterfly stroke.",
"C. Backstroke.",
"D. Breaststroke."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What swimming stroke was used for the second leg of the relay race in the video?\nOption:\nA. Freestyle.\nB. Butterfly stroke.\nC. Backstroke.\nD. Breaststroke.\nAnswer with the option's letter from the given choices directly.",
1381,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "461-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1382,
"target": "B",
"doc": {
"video_id": "461",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=UmIYanq5gH8",
"videoID": "UmIYanq5gH8",
"question_id": "461-3",
"task_type": "Temporal Reasoning",
"question": "What was the order of countries that finished first for each leg of the relay race in the video?",
"options": [
"A. United States, United Kingdom, United Kingdom, United States.",
"B. United States, United Kingdom, United States, United States.",
"C. United Kingdom, United States, United States, United States.",
"D. United States, United Kingdom, United Kingdom, United Kingdom."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the order of countries that finished first for each leg of the relay race in the video?\nOption:\nA. United States, United Kingdom, United Kingdom, United States.\nB. United States, United Kingdom, United States, United States.\nC. United Kingdom, United States, United States, United States.\nD. United States, United Kingdom, United Kingdom, United Kingdom.\nAnswer with the option's letter from the given choices directly.",
1382,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "461-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1383,
"target": "C",
"doc": {
"video_id": "462",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=SBssmtYTJpM",
"videoID": "SBssmtYTJpM",
"question_id": "462-1",
"task_type": "Action Recognition",
"question": "What is the first diving event shown in the video?",
"options": [
"A. Men's 3m springboard.",
"B. Women's 10m platform.",
"C. Men's 10m platform.",
"D. Women's 3m springboard."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the first diving event shown in the video?\nOption:\nA. Men's 3m springboard.\nB. Women's 10m platform.\nC. Men's 10m platform.\nD. Women's 3m springboard.\nAnswer with the option's letter from the given choices directly.",
1383,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "462-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1384,
"target": "A",
"doc": {
"video_id": "462",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=SBssmtYTJpM",
"videoID": "SBssmtYTJpM",
"question_id": "462-2",
"task_type": "OCR Problems",
"question": "Where and when was the diving scene in the video taken?",
"options": [
"A. Tokyo 2020.",
"B. Rio 2016.",
"C. Beijing 2008.",
"D. Lodon 2012."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where and when was the diving scene in the video taken?\nOption:\nA. Tokyo 2020.\nB. Rio 2016.\nC. Beijing 2008.\nD. Lodon 2012.\nAnswer with the option's letter from the given choices directly.",
1384,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "462-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1385,
"target": "D",
"doc": {
"video_id": "462",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=SBssmtYTJpM",
"videoID": "SBssmtYTJpM",
"question_id": "462-3",
"task_type": "Action Recognition",
"question": "What is the last set of diving moves shown by the two female athletes in the video?",
"options": [
"A. Diving backwards facing the pool.",
"B. Diving backwards facing the board (table).",
"C. Diving forward facing the pool.",
"D. Diving inward facing the board (table)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the last set of diving moves shown by the two female athletes in the video?\nOption:\nA. Diving backwards facing the pool.\nB. Diving backwards facing the board (table).\nC. Diving forward facing the pool.\nD. Diving inward facing the board (table).\nAnswer with the option's letter from the given choices directly.",
1385,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "462-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1386,
"target": "B",
"doc": {
"video_id": "463",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=u_f0697yLBI",
"videoID": "u_f0697yLBI",
"question_id": "463-1",
"task_type": "OCR Problems",
"question": "What was the winning time of the men's 100-meter race that was shown first in the video?",
"options": [
"A. 9.58s.",
"B. 9.80s.",
"C. 9.63s.",
"D. 10.75s."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the winning time of the men's 100-meter race that was shown first in the video?\nOption:\nA. 9.58s.\nB. 9.80s.\nC. 9.63s.\nD. 10.75s.\nAnswer with the option's letter from the given choices directly.",
1386,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "463-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1387,
"target": "C",
"doc": {
"video_id": "463",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=u_f0697yLBI",
"videoID": "u_f0697yLBI",
"question_id": "463-2",
"task_type": "Temporal Reasoning",
"question": "In what order do the 2012 London men's 100m final, the 2018 Beijing men's 100m final, and the 2012 London women's 100m final appear in the video?",
"options": [
"A. The 2008 men's 100m final in Beijing appeared first, then the 2012 London women's 100m final, and the 2012 London men's 100m final appeared last in the video.",
"B. The 2012 London men's 100m final appears first, then the 2012 London women's 100m final, and the 2008 Beijing men's 100m final appears last in the video.",
"C. The 2012 London women's 100m final appears first, then the 2012 London men's 100m final, and the 2008 Beijing men's 100m final appears last in the video.",
"D. Neither."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what order do the 2012 London men's 100m final, the 2018 Beijing men's 100m final, and the 2012 London women's 100m final appear in the video?\nOption:\nA. The 2008 men's 100m final in Beijing appeared first, then the 2012 London women's 100m final, and the 2012 London men's 100m final appeared last in the video.\nB. The 2012 London men's 100m final appears first, then the 2012 London women's 100m final, and the 2008 Beijing men's 100m final appears last in the video.\nC. The 2012 London women's 100m final appears first, then the 2012 London men's 100m final, and the 2008 Beijing men's 100m final appears last in the video.\nD. Neither.\nAnswer with the option's letter from the given choices directly.",
1387,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "463-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1388,
"target": "A",
"doc": {
"video_id": "463",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=u_f0697yLBI",
"videoID": "u_f0697yLBI",
"question_id": "463-3",
"task_type": "Object Recognition",
"question": "In the 100m final shown at the end of the video, which country's athletes reached the finish line first?",
"options": [
"A. Jamaica.",
"B. USA.",
"C. Kenya.",
"D. Ethiopia."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the 100m final shown at the end of the video, which country's athletes reached the finish line first?\nOption:\nA. Jamaica.\nB. USA.\nC. Kenya.\nD. Ethiopia.\nAnswer with the option's letter from the given choices directly.",
1388,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "463-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1389,
"target": "D",
"doc": {
"video_id": "464",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=ZQIZx5Oqw88",
"videoID": "ZQIZx5Oqw88",
"question_id": "464-1",
"task_type": "Temporal Reasoning",
"question": "In the introductions of the athletes at the beginning of the video, what is the sequence of the French athlete, the American athlete, and the South African athlete?",
"options": [
"A. The American athlete, the South African athlete, the French athlete.",
"B. The French athlete, the American athlete, the South African athlete.",
"C. The American athlete, the French athlete, the South African athlete.",
"D. The South African athlete, the French athlete, the American athlete."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the introductions of the athletes at the beginning of the video, what is the sequence of the French athlete, the American athlete, and the South African athlete?\nOption:\nA. The American athlete, the South African athlete, the French athlete.\nB. The French athlete, the American athlete, the South African athlete.\nC. The American athlete, the French athlete, the South African athlete.\nD. The South African athlete, the French athlete, the American athlete.\nAnswer with the option's letter from the given choices directly.",
1389,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "464-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1390,
"target": "B",
"doc": {
"video_id": "464",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=ZQIZx5Oqw88",
"videoID": "ZQIZx5Oqw88",
"question_id": "464-2",
"task_type": "Object Recognition",
"question": "Which team in the video reached the finish line first?",
"options": [
"A. USA team.",
"B. Canadian team.",
"C. Ghana team.",
"D. South Africa team."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which team in the video reached the finish line first?\nOption:\nA. USA team.\nB. Canadian team.\nC. Ghana team.\nD. South Africa team.\nAnswer with the option's letter from the given choices directly.",
1390,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "464-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1391,
"target": "C",
"doc": {
"video_id": "464",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=ZQIZx5Oqw88",
"videoID": "ZQIZx5Oqw88",
"question_id": "464-3",
"task_type": "Temporal Perception",
"question": "In which baton handover process in the video did a player fall?",
"options": [
"A. The handover process between the second and third rods.",
"B. The handover process between the first and second rods.",
"C. The handover process between the third and fourth sticks.",
"D. Not mentioned in the video."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which baton handover process in the video did a player fall?\nOption:\nA. The handover process between the second and third rods.\nB. The handover process between the first and second rods.\nC. The handover process between the third and fourth sticks.\nD. Not mentioned in the video.\nAnswer with the option's letter from the given choices directly.",
1391,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "464-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Temporal Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1392,
"target": "A",
"doc": {
"video_id": "465",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=zW87tVnDKIU",
"videoID": "zW87tVnDKIU",
"question_id": "465-1",
"task_type": "OCR Problems",
"question": "What is the name of the second athlete performing the high jump at the beginning of the video?",
"options": [
"A. MUTAZ ESSA BARSHIM.",
"B. DEREK DROUIN.",
"C. BOHDAN BONDARENKO.",
"D. ERIK KYNARD."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the name of the second athlete performing the high jump at the beginning of the video?\nOption:\nA. MUTAZ ESSA BARSHIM.\nB. DEREK DROUIN.\nC. BOHDAN BONDARENKO.\nD. ERIK KYNARD.\nAnswer with the option's letter from the given choices directly.",
1392,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "465-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1393,
"target": "D",
"doc": {
"video_id": "465",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=zW87tVnDKIU",
"videoID": "zW87tVnDKIU",
"question_id": "465-2",
"task_type": "Object Recognition",
"question": "Which athlete in the video was the first to touch off the crossbar?",
"options": [
"A. Athlete from Russia.",
"B. Athlete from Qatar.",
"C. Athlete from Canada.",
"D. Athlete from Ukraine."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which athlete in the video was the first to touch off the crossbar?\nOption:\nA. Athlete from Russia.\nB. Athlete from Qatar.\nC. Athlete from Canada.\nD. Athlete from Ukraine.\nAnswer with the option's letter from the given choices directly.",
1393,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "465-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1394,
"target": "B",
"doc": {
"video_id": "465",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=zW87tVnDKIU",
"videoID": "zW87tVnDKIU",
"question_id": "465-3",
"task_type": "Temporal Reasoning",
"question": "What were the final placings of the three high jumpers recorded in the video?",
"options": [
"A. Qatari athlete first, Canadian athlete second, Ukrainian athlete third.",
"B. Canadian athlete first, Qatari athlete second, Ukrainian athlete third.",
"C. Ukrainian athletes first, Canadian athletes second, Qatari athletes third.",
"D. Ukrainian athletes first, Qatari athletes second, Canadian athletes third."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What were the final placings of the three high jumpers recorded in the video?\nOption:\nA. Qatari athlete first, Canadian athlete second, Ukrainian athlete third.\nB. Canadian athlete first, Qatari athlete second, Ukrainian athlete third.\nC. Ukrainian athletes first, Canadian athletes second, Qatari athletes third.\nD. Ukrainian athletes first, Qatari athletes second, Canadian athletes third.\nAnswer with the option's letter from the given choices directly.",
1394,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "465-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1395,
"target": "C",
"doc": {
"video_id": "466",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=UIX6f0xMQ2U",
"videoID": "UIX6f0xMQ2U",
"question_id": "466-1",
"task_type": "Counting Problem",
"question": "How many times does the video show a long jump with a distance of 8 metres 53?",
"options": [
"A. 2.",
"B. 4.",
"C. 3.",
"D. 6."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times does the video show a long jump with a distance of 8 metres 53?\nOption:\nA. 2.\nB. 4.\nC. 3.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
1395,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "466-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1396,
"target": "A",
"doc": {
"video_id": "466",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=UIX6f0xMQ2U",
"videoID": "UIX6f0xMQ2U",
"question_id": "466-2",
"task_type": "Temporal Reasoning",
"question": "What is the order in which the following long jump processes appear in the video?\n① 8.53m Rome 2018\n② 8.53m London 2018\n③ 8.56m Shanghai 2018\n④ 8.53m Birmingham 2018",
"options": [
"A. ②①④③.",
"B. ④①②③.",
"C. ②③①④.",
"D. ①②③④."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the order in which the following long jump processes appear in the video?\n① 8.53m Rome 2018\n② 8.53m London 2018\n③ 8.56m Shanghai 2018\n④ 8.53m Birmingham 2018\nOption:\nA. ②①④③.\nB. ④①②③.\nC. ②③①④.\nD. ①②③④.\nAnswer with the option's letter from the given choices directly.",
1396,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "466-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1397,
"target": "D",
"doc": {
"video_id": "466",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=UIX6f0xMQ2U",
"videoID": "UIX6f0xMQ2U",
"question_id": "466-3",
"task_type": "OCR Problems",
"question": "Where and when was the longest jump in the video set?",
"options": [
"A. Rome 2018.",
"B. Shanghai 2018.",
"C. Lodon 2017.",
"D. Shanghai 2017."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where and when was the longest jump in the video set?\nOption:\nA. Rome 2018.\nB. Shanghai 2018.\nC. Lodon 2017.\nD. Shanghai 2017.\nAnswer with the option's letter from the given choices directly.",
1397,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "466-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1398,
"target": "B",
"doc": {
"video_id": "467",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=vf3sPa2W0sA",
"videoID": "vf3sPa2W0sA",
"question_id": "467-1",
"task_type": "Object Recognition",
"question": "What are the medals won by the athletes recorded in the video?",
"options": [
"A. Four gold medals and one silver medals.",
"B. Three gold medals. and two silver medals.",
"C. Five gold medals.",
"D. Two gold medals and three silver medals."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the medals won by the athletes recorded in the video?\nOption:\nA. Four gold medals and one silver medals.\nB. Three gold medals. and two silver medals.\nC. Five gold medals.\nD. Two gold medals and three silver medals.\nAnswer with the option's letter from the given choices directly.",
1398,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "467-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1399,
"target": "C",
"doc": {
"video_id": "467",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=vf3sPa2W0sA",
"videoID": "vf3sPa2W0sA",
"question_id": "467-2",
"task_type": "OCR Problems",
"question": "Which of the following times and places does not appear in the video?",
"options": [
"A. 2016 Rio.",
"B. 1992 Barcelona.",
"C. 1998 Soeul.",
"D. 1988 Soeul."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following times and places does not appear in the video?\nOption:\nA. 2016 Rio.\nB. 1992 Barcelona.\nC. 1998 Soeul.\nD. 1988 Soeul.\nAnswer with the option's letter from the given choices directly.",
1399,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "467-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1400,
"target": "A",
"doc": {
"video_id": "467",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=vf3sPa2W0sA",
"videoID": "vf3sPa2W0sA",
"question_id": "467-3",
"task_type": "OCR Problems",
"question": "What was the score of the silver medalist in the last match shown in the video?",
"options": [
"A. 7.22m.",
"B. 7.14m.",
"C. 7.17m.",
"D. 7.15m."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the score of the silver medalist in the last match shown in the video?\nOption:\nA. 7.22m.\nB. 7.14m.\nC. 7.17m.\nD. 7.15m.\nAnswer with the option's letter from the given choices directly.",
1400,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "467-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1401,
"target": "D",
"doc": {
"video_id": "468",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=e-XGSYnhUjg",
"videoID": "e-XGSYnhUjg",
"question_id": "468-1",
"task_type": "Counting Problem",
"question": "How many strokes did the athlete in the video use during the race?",
"options": [
"A. 5.",
"B. 3.",
"C. 2.",
"D. 4."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many strokes did the athlete in the video use during the race?\nOption:\nA. 5.\nB. 3.\nC. 2.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1401,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "468-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1402,
"target": "B",
"doc": {
"video_id": "468",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=e-XGSYnhUjg",
"videoID": "e-XGSYnhUjg",
"question_id": "468-2",
"task_type": "Object Recognition",
"question": "Which country does the athlete, who was in first place after touching the wall for the third time in the video, represent?",
"options": [
"A. Brazil.",
"B. USA.",
"C. Japan.",
"D. China."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which country does the athlete, who was in first place after touching the wall for the third time in the video, represent?\nOption:\nA. Brazil.\nB. USA.\nC. Japan.\nD. China.\nAnswer with the option's letter from the given choices directly.",
1402,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "468-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1403,
"target": "C",
"doc": {
"video_id": "468",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=e-XGSYnhUjg",
"videoID": "e-XGSYnhUjg",
"question_id": "468-3",
"task_type": "Temporal Reasoning",
"question": "What is the order of the swimming strokes used by the athletes in the video?",
"options": [
"A. Backstroke, Breaststroke, Butterfly, Freestyle.",
"B. Breaststroke, Butterfly, Backstroke, Freestyle.",
"C. Butterfly, Backstroke, Breaststroke, Freestyle.",
"D. Butterfly, freestyle, backstroke, breaststroke."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the order of the swimming strokes used by the athletes in the video?\nOption:\nA. Backstroke, Breaststroke, Butterfly, Freestyle.\nB. Breaststroke, Butterfly, Backstroke, Freestyle.\nC. Butterfly, Backstroke, Breaststroke, Freestyle.\nD. Butterfly, freestyle, backstroke, breaststroke.\nAnswer with the option's letter from the given choices directly.",
1403,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "468-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1404,
"target": "A",
"doc": {
"video_id": "469",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=IQcTd4xZ44o",
"videoID": "IQcTd4xZ44o",
"question_id": "469-1",
"task_type": "Spatial Perception",
"question": "In what direction does the camera move during the introduction of the athletes at the beginning of the video?",
"options": [
"A. Move from the inside to the outside of the runway.",
"B. Move from the outside to the inside of the runway.",
"C. Move from front to back along the runway.",
"D. Move from back to front along the runway."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what direction does the camera move during the introduction of the athletes at the beginning of the video?\nOption:\nA. Move from the inside to the outside of the runway.\nB. Move from the outside to the inside of the runway.\nC. Move from front to back along the runway.\nD. Move from back to front along the runway.\nAnswer with the option's letter from the given choices directly.",
1404,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "469-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1405,
"target": "D",
"doc": {
"video_id": "469",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=IQcTd4xZ44o",
"videoID": "IQcTd4xZ44o",
"question_id": "469-2",
"task_type": "OCR Problems",
"question": "What is the time of the third athlete to reach the finish line in the video?",
"options": [
"A. 3:29.26.",
"B. 3:31.38.",
"C. 3:31.70.",
"D. 3:31.46."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the time of the third athlete to reach the finish line in the video?\nOption:\nA. 3:29.26.\nB. 3:31.38.\nC. 3:31.70.\nD. 3:31.46.\nAnswer with the option's letter from the given choices directly.",
1405,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "469-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1406,
"target": "B",
"doc": {
"video_id": "469",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=IQcTd4xZ44o",
"videoID": "IQcTd4xZ44o",
"question_id": "469-3",
"task_type": "Object Recognition",
"question": "In the video, some athletes celebrated by waving their national flags. Which country does the second athlete in the video, who takes out the national flag to celebrate, represent?",
"options": [
"A. Poland.",
"B. Kenya.",
"C. Algeria.",
"D. Norway."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, some athletes celebrated by waving their national flags. Which country does the second athlete in the video, who takes out the national flag to celebrate, represent?\nOption:\nA. Poland.\nB. Kenya.\nC. Algeria.\nD. Norway.\nAnswer with the option's letter from the given choices directly.",
1406,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "469-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1407,
"target": "C",
"doc": {
"video_id": "470",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=ZFtxvf72NxA",
"videoID": "ZFtxvf72NxA",
"question_id": "470-1",
"task_type": "Object Recognition",
"question": "In the video, which country does the athlete positioned on the innermost track represent?",
"options": [
"A. Nigeria.",
"B. Canada.",
"C. United States.",
"D. United Kindom."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, which country does the athlete positioned on the innermost track represent?\nOption:\nA. Nigeria.\nB. Canada.\nC. United States.\nD. United Kindom.\nAnswer with the option's letter from the given choices directly.",
1407,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "470-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1408,
"target": "A",
"doc": {
"video_id": "470",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=ZFtxvf72NxA",
"videoID": "ZFtxvf72NxA",
"question_id": "470-2",
"task_type": "Object Recognition",
"question": "Which athlete is the second to reach the finish line in the video?",
"options": [
"A. The athlete wearing a yellow top and yellow shorts.",
"B. The athlete wearing a white top and yellow shorts.",
"C. The athlete wearing a blue top and blue shorts.",
"D. The athlete wearing a white top and blue shorts."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which athlete is the second to reach the finish line in the video?\nOption:\nA. The athlete wearing a yellow top and yellow shorts.\nB. The athlete wearing a white top and yellow shorts.\nC. The athlete wearing a blue top and blue shorts.\nD. The athlete wearing a white top and blue shorts.\nAnswer with the option's letter from the given choices directly.",
1408,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "470-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1409,
"target": "D",
"doc": {
"video_id": "470",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=ZFtxvf72NxA",
"videoID": "ZFtxvf72NxA",
"question_id": "470-3",
"task_type": "Action Reasoning",
"question": "In the video, what did the staff wearing red hats and red shirts prompt the four American athletes to do?",
"options": [
"A. Neither.",
"B. Prompting four athletes to hold each other up.",
"C. Prompting four athletes to come off the field for a break.",
"D. Prompting the four athletes to take a photo with the screen that recorded their results."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, what did the staff wearing red hats and red shirts prompt the four American athletes to do?\nOption:\nA. Neither.\nB. Prompting four athletes to hold each other up.\nC. Prompting four athletes to come off the field for a break.\nD. Prompting the four athletes to take a photo with the screen that recorded their results.\nAnswer with the option's letter from the given choices directly.",
1409,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "470-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1410,
"target": "B",
"doc": {
"video_id": "471",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=vgWlRdNOGx4",
"videoID": "vgWlRdNOGx4",
"question_id": "471-1",
"task_type": "OCR Problems",
"question": "How many points does Ma Long win on serve in Game 1?",
"options": [
"A. 12.",
"B. 4.",
"C. 14.",
"D. 9."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many points does Ma Long win on serve in Game 1?\nOption:\nA. 12.\nB. 4.\nC. 14.\nD. 9.\nAnswer with the option's letter from the given choices directly.",
1410,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "471-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1411,
"target": "C",
"doc": {
"video_id": "471",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=vgWlRdNOGx4",
"videoID": "vgWlRdNOGx4",
"question_id": "471-2",
"task_type": "Temporal Reasoning",
"question": "Among the four matches, in which order do the two athletes consecutively appear back to the Rio2016 board?",
"options": [
"A. Zhang-Ma-Zhang-Ma.",
"B. Ma-Ma-Zhang-Zhang.",
"C. Ma-Zhang-Ma-Zhang.",
"D. Zhang-Zhang-Ma-Ma."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Among the four matches, in which order do the two athletes consecutively appear back to the Rio2016 board?\nOption:\nA. Zhang-Ma-Zhang-Ma.\nB. Ma-Ma-Zhang-Zhang.\nC. Ma-Zhang-Ma-Zhang.\nD. Zhang-Zhang-Ma-Ma.\nAnswer with the option's letter from the given choices directly.",
1411,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "471-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1412,
"target": "A",
"doc": {
"video_id": "471",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=vgWlRdNOGx4",
"videoID": "vgWlRdNOGx4",
"question_id": "471-3",
"task_type": "Object Reasoning",
"question": "Who is the silver medalist?",
"options": [
"A. Zhang Jike.",
"B. Ma Long.",
"C. Jun Mizutani.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is the silver medalist?\nOption:\nA. Zhang Jike.\nB. Ma Long.\nC. Jun Mizutani.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1412,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "471-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1413,
"target": "A",
"doc": {
"video_id": "472",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=qkRUcYHJ1JI",
"videoID": "qkRUcYHJ1JI",
"question_id": "472-1",
"task_type": "Object Recognition",
"question": "What sport are the two athletes playing?",
"options": [
"A. Baseball.",
"B. Tennis.",
"C. Soccer.",
"D. Basketball."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What sport are the two athletes playing?\nOption:\nA. Baseball.\nB. Tennis.\nC. Soccer.\nD. Basketball.\nAnswer with the option's letter from the given choices directly.",
1413,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "472-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1414,
"target": "D",
"doc": {
"video_id": "472",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=qkRUcYHJ1JI",
"videoID": "qkRUcYHJ1JI",
"question_id": "472-2",
"task_type": "Action Recognition",
"question": "How does Japan gets the decisive score to win the championship?",
"options": [
"A. Home run.",
"B. Foul ball.",
"C. Wild pitches.",
"D. Strikes out swinging."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does Japan gets the decisive score to win the championship?\nOption:\nA. Home run.\nB. Foul ball.\nC. Wild pitches.\nD. Strikes out swinging.\nAnswer with the option's letter from the given choices directly.",
1414,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "472-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1415,
"target": "B",
"doc": {
"video_id": "472",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=qkRUcYHJ1JI",
"videoID": "qkRUcYHJ1JI",
"question_id": "472-3",
"task_type": "Temporal Reasoning",
"question": "What is the sequence of appearance for the following throwers on the court?\n(a) Dark blue 21.\n(b) White 21.\n(c) Dark blue 29.\n(d) Dark blue 26.",
"options": [
"A. (b)(d)(a)(c).",
"B. (b)(c)(d)(a).",
"C. (a)(b)(c)(d).",
"D. (a)(c)(d)(b)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the sequence of appearance for the following throwers on the court?\n(a) Dark blue 21.\n(b) White 21.\n(c) Dark blue 29.\n(d) Dark blue 26.\nOption:\nA. (b)(d)(a)(c).\nB. (b)(c)(d)(a).\nC. (a)(b)(c)(d).\nD. (a)(c)(d)(b).\nAnswer with the option's letter from the given choices directly.",
1415,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "472-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1416,
"target": "D",
"doc": {
"video_id": "473",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=G-mmtUxSt5k",
"videoID": "G-mmtUxSt5k",
"question_id": "473-1",
"task_type": "Counting Problem",
"question": "How many men's single matches are included in this video?",
"options": [
"A. 10.",
"B. 9.",
"C. 8.",
"D. 7."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many men's single matches are included in this video?\nOption:\nA. 10.\nB. 9.\nC. 8.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
1416,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "473-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1417,
"target": "B",
"doc": {
"video_id": "473",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=G-mmtUxSt5k",
"videoID": "G-mmtUxSt5k",
"question_id": "473-2",
"task_type": "Spatial Perception",
"question": "Where is the first match held?",
"options": [
"A. SINGAPORE.",
"B. SUZHOU.",
"C. Sudirman.",
"D. HANGZHOU."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where is the first match held?\nOption:\nA. SINGAPORE.\nB. SUZHOU.\nC. Sudirman.\nD. HANGZHOU.\nAnswer with the option's letter from the given choices directly.",
1417,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "473-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Spatial Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1418,
"target": "C",
"doc": {
"video_id": "473",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=G-mmtUxSt5k",
"videoID": "G-mmtUxSt5k",
"question_id": "473-3",
"task_type": "Object Reasoning",
"question": "Which team won the first men's doubles?",
"options": [
"A. The red team.",
"B. The orange team.",
"C. The blue team.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which team won the first men's doubles?\nOption:\nA. The red team.\nB. The orange team.\nC. The blue team.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1418,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "473-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1419,
"target": "C",
"doc": {
"video_id": "474",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=FYqSATdr62Q",
"videoID": "FYqSATdr62Q",
"question_id": "474-1",
"task_type": "OCR Problems",
"question": "What is the final score of the player DING?",
"options": [
"A. 129.",
"B. 147.",
"C. 0.",
"D. 112."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the final score of the player DING?\nOption:\nA. 129.\nB. 147.\nC. 0.\nD. 112.\nAnswer with the option's letter from the given choices directly.",
1419,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "474-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1420,
"target": "A",
"doc": {
"video_id": "474",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=FYqSATdr62Q",
"videoID": "FYqSATdr62Q",
"question_id": "474-2",
"task_type": "Action Recognition",
"question": "What action does the judge take when the player pockets the black ball while other balls are still on the table?",
"options": [
"A. He re-racks the black ball.",
"B. He removes the black ball.",
"C. He adds an extra black ball.",
"D. He replaces the black ball with a red one."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What action does the judge take when the player pockets the black ball while other balls are still on the table?\nOption:\nA. He re-racks the black ball.\nB. He removes the black ball.\nC. He adds an extra black ball.\nD. He replaces the black ball with a red one.\nAnswer with the option's letter from the given choices directly.",
1420,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "474-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1421,
"target": "B",
"doc": {
"video_id": "474",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=FYqSATdr62Q",
"videoID": "FYqSATdr62Q",
"question_id": "474-3",
"task_type": "Action Recognition",
"question": "How does the player pocket the green ball?",
"options": [
"A. He throws the cue stick at the ball, making it spin and fall into the pocket.",
"B. He uses the rest and takes a shot.",
"C. He lies down on the table and blows on the ball.",
"D. He uses a slingshot to launch the ball into the pocket."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the player pocket the green ball?\nOption:\nA. He throws the cue stick at the ball, making it spin and fall into the pocket.\nB. He uses the rest and takes a shot.\nC. He lies down on the table and blows on the ball.\nD. He uses a slingshot to launch the ball into the pocket.\nAnswer with the option's letter from the given choices directly.",
1421,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "474-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1422,
"target": "B",
"doc": {
"video_id": "475",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=Hw5qjcnhrW4",
"videoID": "Hw5qjcnhrW4",
"question_id": "475-1",
"task_type": "Object Recognition",
"question": "What sport are the two athletes playing?",
"options": [
"A. Baseball.",
"B. Tennis.",
"C. Soccer.",
"D. Basketball."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What sport are the two athletes playing?\nOption:\nA. Baseball.\nB. Tennis.\nC. Soccer.\nD. Basketball.\nAnswer with the option's letter from the given choices directly.",
1422,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "475-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1423,
"target": "C",
"doc": {
"video_id": "475",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=Hw5qjcnhrW4",
"videoID": "Hw5qjcnhrW4",
"question_id": "475-2",
"task_type": "Action Reasoning",
"question": "How many points does the winner get enough to win the match?",
"options": [
"A. 5.",
"B. 7.",
"C. 6.",
"D. 11."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many points does the winner get enough to win the match?\nOption:\nA. 5.\nB. 7.\nC. 6.\nD. 11.\nAnswer with the option's letter from the given choices directly.",
1423,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "475-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1424,
"target": "A",
"doc": {
"video_id": "475",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=Hw5qjcnhrW4",
"videoID": "Hw5qjcnhrW4",
"question_id": "475-3",
"task_type": "Object Reasoning",
"question": "What do the two players have in common?",
"options": [
"A. Both of them are wearing hats with a tick logo.",
"B. Both of them are wearing black shorts.",
"C. Both of them are from Spain.",
"D. They have the same height."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the two players have in common?\nOption:\nA. Both of them are wearing hats with a tick logo.\nB. Both of them are wearing black shorts.\nC. Both of them are from Spain.\nD. They have the same height.\nAnswer with the option's letter from the given choices directly.",
1424,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "475-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1425,
"target": "D",
"doc": {
"video_id": "476",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=DfZIv2FKWj4",
"videoID": "DfZIv2FKWj4",
"question_id": "476-1",
"task_type": "OCR Problems",
"question": "What is the score at the half time?",
"options": [
"A. England 3 v.s. Samoa 43.",
"B. England 34 v.s. Samoa 3.",
"C. England 3 v.s. Samoa 34.",
"D. England 43 v.s. Samoa 3."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the score at the half time?\nOption:\nA. England 3 v.s. Samoa 43.\nB. England 34 v.s. Samoa 3.\nC. England 3 v.s. Samoa 34.\nD. England 43 v.s. Samoa 3.\nAnswer with the option's letter from the given choices directly.",
1425,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "476-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1426,
"target": "B",
"doc": {
"video_id": "476",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=DfZIv2FKWj4",
"videoID": "DfZIv2FKWj4",
"question_id": "476-2",
"task_type": "Object Reasoning",
"question": "Which team gets more scores at the second half time?",
"options": [
"A. Samoa.",
"B. England.",
"C. The two teams get the same scores.",
"D. It cannot be inferred based on this video."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which team gets more scores at the second half time?\nOption:\nA. Samoa.\nB. England.\nC. The two teams get the same scores.\nD. It cannot be inferred based on this video.\nAnswer with the option's letter from the given choices directly.",
1426,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "476-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1427,
"target": "C",
"doc": {
"video_id": "476",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=DfZIv2FKWj4",
"videoID": "DfZIv2FKWj4",
"question_id": "476-3",
"task_type": "Object Recognition",
"question": "What sport are the two teams of athletes playing?",
"options": [
"A. Ice hockey.",
"B. Soccer.",
"C. Rugby.",
"D. Basketball."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What sport are the two teams of athletes playing?\nOption:\nA. Ice hockey.\nB. Soccer.\nC. Rugby.\nD. Basketball.\nAnswer with the option's letter from the given choices directly.",
1427,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "476-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1428,
"target": "A",
"doc": {
"video_id": "477",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=c66RN87YjNE",
"videoID": "c66RN87YjNE",
"question_id": "477-1",
"task_type": "Object Recognition",
"question": "Which team participates in the first five matches in this video?",
"options": [
"A. JPN.",
"B. GER.",
"C. CAN.",
"D. IRI."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which team participates in the first five matches in this video?\nOption:\nA. JPN.\nB. GER.\nC. CAN.\nD. IRI.\nAnswer with the option's letter from the given choices directly.",
1428,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "477-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1429,
"target": "D",
"doc": {
"video_id": "477",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=c66RN87YjNE",
"videoID": "c66RN87YjNE",
"question_id": "477-2",
"task_type": "OCR Problems",
"question": "What is the number written on the back of Nishida?",
"options": [
"A. 13.",
"B. 22.",
"C. 20.",
"D. 11."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the number written on the back of Nishida?\nOption:\nA. 13.\nB. 22.\nC. 20.\nD. 11.\nAnswer with the option's letter from the given choices directly.",
1429,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "477-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1430,
"target": "B",
"doc": {
"video_id": "477",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=c66RN87YjNE",
"videoID": "c66RN87YjNE",
"question_id": "477-3",
"task_type": "Action Recognition",
"question": "How does JPN team usually win a point in this video?",
"options": [
"A. By drive serve.",
"B. By quick smash.",
"C. By dink spike.",
"D. By switch attack."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does JPN team usually win a point in this video?\nOption:\nA. By drive serve.\nB. By quick smash.\nC. By dink spike.\nD. By switch attack.\nAnswer with the option's letter from the given choices directly.",
1430,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "477-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1431,
"target": "C",
"doc": {
"video_id": "478",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=PAI-gQ19pbs",
"videoID": "PAI-gQ19pbs",
"question_id": "478-1",
"task_type": "Object Recognition",
"question": "What sport are the two teams playing?",
"options": [
"A. Baseball.",
"B. Soccer.",
"C. Ice hockey.",
"D. Basketball."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What sport are the two teams playing?\nOption:\nA. Baseball.\nB. Soccer.\nC. Ice hockey.\nD. Basketball.\nAnswer with the option's letter from the given choices directly.",
1431,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "478-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1432,
"target": "A",
"doc": {
"video_id": "478",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=PAI-gQ19pbs",
"videoID": "PAI-gQ19pbs",
"question_id": "478-2",
"task_type": "Temporal Reasoning",
"question": "Which period is the most time-consuming in this video?",
"options": [
"A. The second.",
"B. The first.",
"C. The third.",
"D. Three periods take the same time."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which period is the most time-consuming in this video?\nOption:\nA. The second.\nB. The first.\nC. The third.\nD. Three periods take the same time.\nAnswer with the option's letter from the given choices directly.",
1432,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "478-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1433,
"target": "D",
"doc": {
"video_id": "478",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=PAI-gQ19pbs",
"videoID": "PAI-gQ19pbs",
"question_id": "478-3",
"task_type": "Action Recognition",
"question": "As depicted in this video, what happened to the last shot in this baseball game?",
"options": [
"A. It hits the goalpost and ricochets back onto the field.",
"B. It is blocked by the goalkeeper.",
"C. It bounces off the referee's head and goes out of bounds.",
"D. It goes past the goalkeeper's shoulder and into the net."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in this video, what happened to the last shot in this baseball game?\nOption:\nA. It hits the goalpost and ricochets back onto the field.\nB. It is blocked by the goalkeeper.\nC. It bounces off the referee's head and goes out of bounds.\nD. It goes past the goalkeeper's shoulder and into the net.\nAnswer with the option's letter from the given choices directly.",
1433,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "478-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1434,
"target": "B",
"doc": {
"video_id": "479",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=hQYRDNl-lGI",
"videoID": "hQYRDNl-lGI",
"question_id": "479-1",
"task_type": "Temporal Reasoning",
"question": "In which order are the first four cars driven out of the garage in this video?\n(a) The yellow car.\n(b) The black car.\n(c) The silver car.\n(d) The white car.",
"options": [
"A. (a)(b)(c)(d).",
"B. (a)(c)(b)(d).",
"C. (b)(c)(d)(a).",
"D. (b)(d)(a)(c)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which order are the first four cars driven out of the garage in this video?\n(a) The yellow car.\n(b) The black car.\n(c) The silver car.\n(d) The white car.\nOption:\nA. (a)(b)(c)(d).\nB. (a)(c)(b)(d).\nC. (b)(c)(d)(a).\nD. (b)(d)(a)(c).\nAnswer with the option's letter from the given choices directly.",
1434,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "479-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1435,
"target": "C",
"doc": {
"video_id": "479",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=hQYRDNl-lGI",
"videoID": "hQYRDNl-lGI",
"question_id": "479-2",
"task_type": "Attribute Perception",
"question": "What is the color of the car that is first to start in the race?",
"options": [
"A. Red.",
"B. Black.",
"C. Yellow.",
"D. White."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the car that is first to start in the race?\nOption:\nA. Red.\nB. Black.\nC. Yellow.\nD. White.\nAnswer with the option's letter from the given choices directly.",
1435,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "479-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1436,
"target": "A",
"doc": {
"video_id": "479",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=hQYRDNl-lGI",
"videoID": "hQYRDNl-lGI",
"question_id": "479-3",
"task_type": "Object Reasoning",
"question": "According to this video, who is the winner of the race?",
"options": [
"A. Based on this video, we cannot get to know the winner.",
"B. The yellow car.",
"C. The black car.",
"D. The red car."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to this video, who is the winner of the race?\nOption:\nA. Based on this video, we cannot get to know the winner.\nB. The yellow car.\nC. The black car.\nD. The red car.\nAnswer with the option's letter from the given choices directly.",
1436,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "479-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1437,
"target": "D",
"doc": {
"video_id": "480",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=awHc8IySU-I",
"videoID": "awHc8IySU-I",
"question_id": "480-1",
"task_type": "Object Reasoning",
"question": "Among the games, which winner has the quickest finishing time?",
"options": [
"A. The first one.",
"B. The second one.",
"C. The third one.",
"D. The last one."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Among the games, which winner has the quickest finishing time?\nOption:\nA. The first one.\nB. The second one.\nC. The third one.\nD. The last one.\nAnswer with the option's letter from the given choices directly.",
1437,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "480-1",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1438,
"target": "B",
"doc": {
"video_id": "480",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=awHc8IySU-I",
"videoID": "awHc8IySU-I",
"question_id": "480-2",
"task_type": "Counting Problem",
"question": "How many athletes successfully finished the game in 2010?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many athletes successfully finished the game in 2010?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1438,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "480-2",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1439,
"target": "C",
"doc": {
"video_id": "480",
"duration": "medium",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=awHc8IySU-I",
"videoID": "awHc8IySU-I",
"question_id": "480-3",
"task_type": "Temporal Perception",
"question": "How are the five moments of champion organized in this video?",
"options": [
"A. Randomly shuffled throughout the video.",
"B. In reverse chronological order.",
"C. In chronological order.",
"D. Based on the players' jersey numbers."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How are the five moments of champion organized in this video?\nOption:\nA. Randomly shuffled throughout the video.\nB. In reverse chronological order.\nC. In chronological order.\nD. Based on the players' jersey numbers.\nAnswer with the option's letter from the given choices directly.",
1439,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "480-3",
"duration": "medium",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Temporal Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1440,
"target": "A",
"doc": {
"video_id": "481",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=CwzjlmBLfrQ",
"videoID": "CwzjlmBLfrQ",
"question_id": "481-1",
"task_type": "Action Recognition",
"question": "Why does the man playing the keyboard leave his seat during the performance?",
"options": [
"A. Because he sneezes and tries to get a tissue to wipe his nose.",
"B. To adjust the sound settings on his instrument.",
"C. Because he forgets to play his part in the music.",
"D. To take a brief intermission break from the performance."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the man playing the keyboard leave his seat during the performance?\nOption:\nA. Because he sneezes and tries to get a tissue to wipe his nose.\nB. To adjust the sound settings on his instrument.\nC. Because he forgets to play his part in the music.\nD. To take a brief intermission break from the performance.\nAnswer with the option's letter from the given choices directly.",
1440,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "481-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1441,
"target": "B",
"doc": {
"video_id": "481",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=CwzjlmBLfrQ",
"videoID": "CwzjlmBLfrQ",
"question_id": "481-2",
"task_type": "Action Reasoning",
"question": "In the middle of the video, why are there scenes of men running by the sea?",
"options": [
"A. The men are participating in a seaside marathon event.",
"B. The man playing the electronic keyboard is dreaming in his head.",
"C. The scenes are part of a documentary about coastal life.",
"D. It's a commercial for a new fitness program."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the middle of the video, why are there scenes of men running by the sea?\nOption:\nA. The men are participating in a seaside marathon event.\nB. The man playing the electronic keyboard is dreaming in his head.\nC. The scenes are part of a documentary about coastal life.\nD. It's a commercial for a new fitness program.\nAnswer with the option's letter from the given choices directly.",
1441,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "481-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1442,
"target": "C",
"doc": {
"video_id": "481",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=CwzjlmBLfrQ",
"videoID": "CwzjlmBLfrQ",
"question_id": "481-3",
"task_type": "OCR Problems",
"question": "What is the nationality of the man playing the electronic keyboard?",
"options": [
"A. France.",
"B. USA.",
"C. UK.",
"D. Germany."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the nationality of the man playing the electronic keyboard?\nOption:\nA. France.\nB. USA.\nC. UK.\nD. Germany.\nAnswer with the option's letter from the given choices directly.",
1442,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "481-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1443,
"target": "B",
"doc": {
"video_id": "482",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=-3t1rj8g6yg",
"videoID": "-3t1rj8g6yg",
"question_id": "482-1",
"task_type": "Object Reasoning",
"question": "What does this video document?",
"options": [
"A. What technology has been adopted for the National Theatre's live broadcast.",
"B. Behind the scenes at the National Theatre.",
"C. The popularity of the art of opera today.",
"D. Can't extrapolate from the video."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does this video document?\nOption:\nA. What technology has been adopted for the National Theatre's live broadcast.\nB. Behind the scenes at the National Theatre.\nC. The popularity of the art of opera today.\nD. Can't extrapolate from the video.\nAnswer with the option's letter from the given choices directly.",
1443,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "482-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1444,
"target": "C",
"doc": {
"video_id": "482",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=-3t1rj8g6yg",
"videoID": "-3t1rj8g6yg",
"question_id": "482-2",
"task_type": "OCR Problems",
"question": "Which play is featured at the end of the video?",
"options": [
"A. Macbeth.",
"B. Hamlet.",
"C. The Lehman Trilogy.",
"D. War Horse."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which play is featured at the end of the video?\nOption:\nA. Macbeth.\nB. Hamlet.\nC. The Lehman Trilogy.\nD. War Horse.\nAnswer with the option's letter from the given choices directly.",
1444,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "482-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1445,
"target": "A",
"doc": {
"video_id": "482",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=-3t1rj8g6yg",
"videoID": "-3t1rj8g6yg",
"question_id": "482-3",
"task_type": "Object Reasoning",
"question": "What can be inferred about broadcast quality from the video description?",
"options": [
"A. It utilizes state-of-the-art filming tailored to each play.",
"B. It is broadcast in standard definition.",
"C. The filming techniques are outdated.",
"D. The plays are edited before being broadcast."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be inferred about broadcast quality from the video description?\nOption:\nA. It utilizes state-of-the-art filming tailored to each play.\nB. It is broadcast in standard definition.\nC. The filming techniques are outdated.\nD. The plays are edited before being broadcast.\nAnswer with the option's letter from the given choices directly.",
1445,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "482-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1446,
"target": "A",
"doc": {
"video_id": "483",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=323v_FtWqvo",
"videoID": "323v_FtWqvo",
"question_id": "483-1",
"task_type": "Spatial Reasoning",
"question": "What is the primary theme of the high school theatre show featured in the video?",
"options": [
"A. Social justice issues.",
"B. Classic literature adaptations.",
"C. The celebration of school spirit.",
"D. The exploration of modern art."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary theme of the high school theatre show featured in the video?\nOption:\nA. Social justice issues.\nB. Classic literature adaptations.\nC. The celebration of school spirit.\nD. The exploration of modern art.\nAnswer with the option's letter from the given choices directly.",
1446,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "483-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Spatial Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1447,
"target": "D",
"doc": {
"video_id": "483",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=323v_FtWqvo",
"videoID": "323v_FtWqvo",
"question_id": "483-2",
"task_type": "Attribute Perception",
"question": "How are the students dressed in the video?",
"options": [
"A. In period costumes.",
"B. In formal attire.",
"C. In casual streetwear.",
"D. In matching thematic outfits."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How are the students dressed in the video?\nOption:\nA. In period costumes.\nB. In formal attire.\nC. In casual streetwear.\nD. In matching thematic outfits.\nAnswer with the option's letter from the given choices directly.",
1447,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "483-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1448,
"target": "B",
"doc": {
"video_id": "483",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=323v_FtWqvo",
"videoID": "323v_FtWqvo",
"question_id": "483-3",
"task_type": "Attribute Perception",
"question": "Based on the video, what setting is the high school theatre show trying to replicate?",
"options": [
"A. A professional Broadway stage.",
"B. An experimental black box theatre.",
"C. A historical setting.",
"D. A classic Shakespearean stage."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, what setting is the high school theatre show trying to replicate?\nOption:\nA. A professional Broadway stage.\nB. An experimental black box theatre.\nC. A historical setting.\nD. A classic Shakespearean stage.\nAnswer with the option's letter from the given choices directly.",
1448,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "483-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1449,
"target": "C",
"doc": {
"video_id": "484",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=PSyedQl9IiQ",
"videoID": "PSyedQl9IiQ",
"question_id": "484-1",
"task_type": "Spatial Reasoning",
"question": "What is the theme of the performance in the video?",
"options": [
"A. The joys of childhood.",
"B. The importance of education.",
"C. Rebellion against authority.",
"D. The history of literature."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the theme of the performance in the video?\nOption:\nA. The joys of childhood.\nB. The importance of education.\nC. Rebellion against authority.\nD. The history of literature.\nAnswer with the option's letter from the given choices directly.",
1449,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "484-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Spatial Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1450,
"target": "A",
"doc": {
"video_id": "484",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=PSyedQl9IiQ",
"videoID": "PSyedQl9IiQ",
"question_id": "484-2",
"task_type": "Counting Problem",
"question": "How many seats are on stage in the video?",
"options": [
"A. Nine.",
"B. Three.",
"C. Six.",
"D. Five."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many seats are on stage in the video?\nOption:\nA. Nine.\nB. Three.\nC. Six.\nD. Five.\nAnswer with the option's letter from the given choices directly.",
1450,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "484-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1451,
"target": "D",
"doc": {
"video_id": "484",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=PSyedQl9IiQ",
"videoID": "PSyedQl9IiQ",
"question_id": "484-3",
"task_type": "Temporal Perception",
"question": "What can be inferred about the story at the beginning of the video?",
"options": [
"A. Students are engaged in a group reading session.",
"B. Students are preparing for an exam.",
"C. Students are in the midst of a classroom lesson.",
"D. Students are facing off against a figure of authority."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be inferred about the story at the beginning of the video?\nOption:\nA. Students are engaged in a group reading session.\nB. Students are preparing for an exam.\nC. Students are in the midst of a classroom lesson.\nD. Students are facing off against a figure of authority.\nAnswer with the option's letter from the given choices directly.",
1451,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "484-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Temporal Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1452,
"target": "C",
"doc": {
"video_id": "485",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=BEIVOKz4zXw",
"videoID": "BEIVOKz4zXw",
"question_id": "485-1",
"task_type": "OCR Problems",
"question": "What is the setting for \"Grease The Musical\" as mentioned in the video?",
"options": [
"A. Sydney Opera House.",
"B. Broadway.",
"C. West End.",
"D. Rydell High."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the setting for \"Grease The Musical\" as mentioned in the video?\nOption:\nA. Sydney Opera House.\nB. Broadway.\nC. West End.\nD. Rydell High.\nAnswer with the option's letter from the given choices directly.",
1452,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "485-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1453,
"target": "A",
"doc": {
"video_id": "485",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=BEIVOKz4zXw",
"videoID": "BEIVOKz4zXw",
"question_id": "485-2",
"task_type": "Counting Problem",
"question": "How many buckets are on the stage?",
"options": [
"A. 3.",
"B. 2.",
"C. 4.",
"D. 5."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many buckets are on the stage?\nOption:\nA. 3.\nB. 2.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
1453,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "485-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1454,
"target": "D",
"doc": {
"video_id": "485",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=BEIVOKz4zXw",
"videoID": "BEIVOKz4zXw",
"question_id": "485-3",
"task_type": "Temporal Perception",
"question": "When do the performers begin a dance routine without microphones?",
"options": [
"A. Right before the final act of the show.",
"B. Immediately as the show opens.",
"C. During the male lead's solo performance.",
"D. After the female singer finished her solo performance."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When do the performers begin a dance routine without microphones?\nOption:\nA. Right before the final act of the show.\nB. Immediately as the show opens.\nC. During the male lead's solo performance.\nD. After the female singer finished her solo performance.\nAnswer with the option's letter from the given choices directly.",
1454,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "485-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Temporal Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1455,
"target": "A",
"doc": {
"video_id": "486",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=WO86oRJt8lM",
"videoID": "WO86oRJt8lM",
"question_id": "486-1",
"task_type": "Object Recognition",
"question": "Which animals appear in the performance from the video?",
"options": [
"A. Zebra, Monkey, Tiger.",
"B. Elephant, Giraffe, Lion.",
"C. Bear, Rabbit, Eagle.",
"D. Shark, Dolphin, Whale."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which animals appear in the performance from the video?\nOption:\nA. Zebra, Monkey, Tiger.\nB. Elephant, Giraffe, Lion.\nC. Bear, Rabbit, Eagle.\nD. Shark, Dolphin, Whale.\nAnswer with the option's letter from the given choices directly.",
1455,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "486-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1456,
"target": "C",
"doc": {
"video_id": "486",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=WO86oRJt8lM",
"videoID": "WO86oRJt8lM",
"question_id": "486-2",
"task_type": "Spatial Reasoning",
"question": "What can be inferred about the setting from the lighting and background?",
"options": [
"A. The scene depicts a nighttime setting.",
"B. It is set in a bright outdoor environment.",
"C. It's set indoors with stage lighting to simulate a natural environment.",
"D. It is taking place in an urban cityscape."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be inferred about the setting from the lighting and background?\nOption:\nA. The scene depicts a nighttime setting.\nB. It is set in a bright outdoor environment.\nC. It's set indoors with stage lighting to simulate a natural environment.\nD. It is taking place in an urban cityscape.\nAnswer with the option's letter from the given choices directly.",
1456,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "486-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Spatial Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1457,
"target": "B",
"doc": {
"video_id": "486",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=WO86oRJt8lM",
"videoID": "WO86oRJt8lM",
"question_id": "486-3",
"task_type": "Action Reasoning",
"question": "What is the likely purpose of the scene being performed?",
"options": [
"A. Provide comedic entertainment for the after-dinner amusement of the audience.",
"B. Demonstrate the storytelling and creative elements of theatre.",
"C. Recreates historical scenes and remembers history with the audience.",
"D. Cannot be inferred from the video."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the likely purpose of the scene being performed?\nOption:\nA. Provide comedic entertainment for the after-dinner amusement of the audience.\nB. Demonstrate the storytelling and creative elements of theatre.\nC. Recreates historical scenes and remembers history with the audience.\nD. Cannot be inferred from the video.\nAnswer with the option's letter from the given choices directly.",
1457,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "486-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1458,
"target": "B",
"doc": {
"video_id": "487",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=UeOy1WKeTRY",
"videoID": "UeOy1WKeTRY",
"question_id": "487-1",
"task_type": "Attribute Perception",
"question": "What is the setting of the video?",
"options": [
"A. A movie premiere.",
"B. The Tony Awards.",
"C. A television talk show.",
"D. A music video shoot."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the setting of the video?\nOption:\nA. A movie premiere.\nB. The Tony Awards.\nC. A television talk show.\nD. A music video shoot.\nAnswer with the option's letter from the given choices directly.",
1458,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "487-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1459,
"target": "B",
"doc": {
"video_id": "487",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=UeOy1WKeTRY",
"videoID": "UeOy1WKeTRY",
"question_id": "487-2",
"task_type": "Action Reasoning",
"question": "What aspect of Broadway is being saluted by the man's opening number at the activity?",
"options": [
"A. The choreography of Broadway musicals.",
"B. The magic of live performances.",
"C. Famous Broadway actors and their classic scenes.",
"D. Can't be deduced from the video."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What aspect of Broadway is being saluted by the man's opening number at the activity?\nOption:\nA. The choreography of Broadway musicals.\nB. The magic of live performances.\nC. Famous Broadway actors and their classic scenes.\nD. Can't be deduced from the video.\nAnswer with the option's letter from the given choices directly.",
1459,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "487-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1460,
"target": "A",
"doc": {
"video_id": "487",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=UeOy1WKeTRY",
"videoID": "UeOy1WKeTRY",
"question_id": "487-3",
"task_type": "Spatial Reasoning",
"question": "From the staging and setting in the video, what type of ambiance is suggested for the activity?",
"options": [
"A. Grandiose and theatrical.",
"B. Intimate and minimalist.",
"C. Casual and impromptu.",
"D. Modern and high-tech."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: From the staging and setting in the video, what type of ambiance is suggested for the activity?\nOption:\nA. Grandiose and theatrical.\nB. Intimate and minimalist.\nC. Casual and impromptu.\nD. Modern and high-tech.\nAnswer with the option's letter from the given choices directly.",
1460,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "487-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Spatial Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1461,
"target": "D",
"doc": {
"video_id": "488",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=PB06auioy0Y",
"videoID": "PB06auioy0Y",
"question_id": "488-1",
"task_type": "Spatial Reasoning",
"question": "What type of event is being depicted in the video?",
"options": [
"A. A theater play.",
"B. A classical opera performance.",
"C. A movie premiere.",
"D. A rock concert."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What type of event is being depicted in the video?\nOption:\nA. A theater play.\nB. A classical opera performance.\nC. A movie premiere.\nD. A rock concert.\nAnswer with the option's letter from the given choices directly.",
1461,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "488-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Spatial Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1462,
"target": "B",
"doc": {
"video_id": "488",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=PB06auioy0Y",
"videoID": "PB06auioy0Y",
"question_id": "488-2",
"task_type": "Action Recognition",
"question": "What does the band do at the end of this song?",
"options": [
"A. They make a quick exit so as to naturally segue into the next performance.",
"B. They make a bowed.",
"C. They exchange instruments and perform a new act.",
"D. They begin the meet-and-greet and meet the audience face-to-face."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the band do at the end of this song?\nOption:\nA. They make a quick exit so as to naturally segue into the next performance.\nB. They make a bowed.\nC. They exchange instruments and perform a new act.\nD. They begin the meet-and-greet and meet the audience face-to-face.\nAnswer with the option's letter from the given choices directly.",
1462,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "488-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1463,
"target": "D",
"doc": {
"video_id": "488",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=PB06auioy0Y",
"videoID": "PB06auioy0Y",
"question_id": "488-3",
"task_type": "Attribute Perception",
"question": "What is likely the emotional tone of the song being performed?",
"options": [
"A. Mysterious and intriguing.",
"B. Joyful and uplifting.",
"C. Calm and relaxing.",
"D. Somber and reflective."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is likely the emotional tone of the song being performed?\nOption:\nA. Mysterious and intriguing.\nB. Joyful and uplifting.\nC. Calm and relaxing.\nD. Somber and reflective.\nAnswer with the option's letter from the given choices directly.",
1463,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "488-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1464,
"target": "C",
"doc": {
"video_id": "489",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=_w4XRiUVfY4",
"videoID": "_w4XRiUVfY4",
"question_id": "489-1",
"task_type": "Counting Problem",
"question": "How many people appear in the video?",
"options": [
"A. 13.",
"B. 14.",
"C. 15.",
"D. 12."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people appear in the video?\nOption:\nA. 13.\nB. 14.\nC. 15.\nD. 12.\nAnswer with the option's letter from the given choices directly.",
1464,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "489-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1465,
"target": "A",
"doc": {
"video_id": "489",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=_w4XRiUVfY4",
"videoID": "_w4XRiUVfY4",
"question_id": "489-2",
"task_type": "OCR Problems",
"question": "Which institution is associated with Pitch Slapped in the video?",
"options": [
"A. Berklee College of Music.",
"B. Juilliard School.",
"C. Harvard University.",
"D. Oxford University."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which institution is associated with Pitch Slapped in the video?\nOption:\nA. Berklee College of Music.\nB. Juilliard School.\nC. Harvard University.\nD. Oxford University.\nAnswer with the option's letter from the given choices directly.",
1465,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "489-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1466,
"target": "C",
"doc": {
"video_id": "489",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=_w4XRiUVfY4",
"videoID": "_w4XRiUVfY4",
"question_id": "489-3",
"task_type": "Attribute Perception",
"question": "Based on the video, how is the group dressed for their performance?",
"options": [
"A. Costumes from different musical eras.",
"B. Casual streetwear.",
"C. Formal attire with black and white colors.",
"D. Matching school uniforms."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, how is the group dressed for their performance?\nOption:\nA. Costumes from different musical eras.\nB. Casual streetwear.\nC. Formal attire with black and white colors.\nD. Matching school uniforms.\nAnswer with the option's letter from the given choices directly.",
1466,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "489-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1467,
"target": "B",
"doc": {
"video_id": "490",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://m.youtube.com/watch?v=azZZZbSwLQght",
"videoID": "azZZZbSwLQght",
"question_id": "490-1",
"task_type": "Spatial Reasoning",
"question": "What is the event in the video?",
"options": [
"A. The Grammy Awards.",
"B. The London 2012 Olympic Games.",
"C. The Super Bowl Halftime Show.",
"D. The Brit Awards."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the event in the video?\nOption:\nA. The Grammy Awards.\nB. The London 2012 Olympic Games.\nC. The Super Bowl Halftime Show.\nD. The Brit Awards.\nAnswer with the option's letter from the given choices directly.",
1467,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "490-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1468,
"target": "D",
"doc": {
"video_id": "490",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://m.youtube.com/watch?v=azZZZbSwLQght",
"videoID": "azZZZbSwLQght",
"question_id": "490-2",
"task_type": "Object Recognition",
"question": "What significant object is present on the stage?",
"options": [
"A. An oversized guitar.",
"B. A large drum set.",
"C. A grand piano.",
"D. A giant bell."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What significant object is present on the stage?\nOption:\nA. An oversized guitar.\nB. A large drum set.\nC. A grand piano.\nD. A giant bell.\nAnswer with the option's letter from the given choices directly.",
1468,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "490-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1469,
"target": "A",
"doc": {
"video_id": "490",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://m.youtube.com/watch?v=azZZZbSwLQght",
"videoID": "azZZZbSwLQght",
"question_id": "490-3",
"task_type": "Spatial Reasoning",
"question": "What type of venue is the man performing according to the video?",
"options": [
"A. A large stadium.",
"B. A small, intimate club.",
"C. A private event space.",
"D. An open-air festival."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What type of venue is the man performing according to the video?\nOption:\nA. A large stadium.\nB. A small, intimate club.\nC. A private event space.\nD. An open-air festival.\nAnswer with the option's letter from the given choices directly.",
1469,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "490-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Spatial Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1470,
"target": "B",
"doc": {
"video_id": "491",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=dez7DLniEJA",
"videoID": "dez7DLniEJA",
"question_id": "491-1",
"task_type": "Action Recognition",
"question": "How does the man demonstrate the shake change?",
"options": [
"A. He turns 4 of spades into 7 of diamonds.",
"B. He turns 4 of spades into 7 of hearts.",
"C. He turns 7 of diamonds into 4 of spades.",
"D. He turns 7 of hearts into 4 of spades."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the man demonstrate the shake change?\nOption:\nA. He turns 4 of spades into 7 of diamonds.\nB. He turns 4 of spades into 7 of hearts.\nC. He turns 7 of diamonds into 4 of spades.\nD. He turns 7 of hearts into 4 of spades.\nAnswer with the option's letter from the given choices directly.",
1470,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "491-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1471,
"target": "C",
"doc": {
"video_id": "491",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=dez7DLniEJA",
"videoID": "dez7DLniEJA",
"question_id": "491-2",
"task_type": "Temporal Reasoning",
"question": "What is the chronological order of the following events according to this video?\n(a) The Pinkie Change.\n(b) The Shake Change.\n(c) The Shake Change.",
"options": [
"A. (a)(b)(c).",
"B. (b)(c)(a).",
"C. (b)(a)(c).",
"D. (c)(b)(a)."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the chronological order of the following events according to this video?\n(a) The Pinkie Change.\n(b) The Shake Change.\n(c) The Shake Change.\nOption:\nA. (a)(b)(c).\nB. (b)(c)(a).\nC. (b)(a)(c).\nD. (c)(b)(a).\nAnswer with the option's letter from the given choices directly.",
1471,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "491-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1472,
"target": "A",
"doc": {
"video_id": "491",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=dez7DLniEJA",
"videoID": "dez7DLniEJA",
"question_id": "491-3",
"task_type": "Object Reasoning",
"question": "What is not common among the three card tricks?",
"options": [
"A. They all require a special card.",
"B. They do not need extra tools.",
"C. They are easy to learn.",
"D. They are visually shocking."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is not common among the three card tricks?\nOption:\nA. They all require a special card.\nB. They do not need extra tools.\nC. They are easy to learn.\nD. They are visually shocking.\nAnswer with the option's letter from the given choices directly.",
1472,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "491-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1473,
"target": "A",
"doc": {
"video_id": "492",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=qNWL4S5bDt4",
"videoID": "qNWL4S5bDt4",
"question_id": "492-1",
"task_type": "Action Recognition",
"question": "How does the performer produce the fifth pigeon?",
"options": [
"A. From an egg.",
"B. From feather.",
"C. From another pigeon.",
"D. From a hat."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the performer produce the fifth pigeon?\nOption:\nA. From an egg.\nB. From feather.\nC. From another pigeon.\nD. From a hat.\nAnswer with the option's letter from the given choices directly.",
1473,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "492-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1474,
"target": "D",
"doc": {
"video_id": "492",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=qNWL4S5bDt4",
"videoID": "qNWL4S5bDt4",
"question_id": "492-2",
"task_type": "Temporal Reasoning",
"question": "According to this video, in which order do the following events happened when the performer made the first two pigeons?\n(a) He duplicated the pigeon and produced another.\n(b) He produced a pigeon from a burning leather.\n(c) He put the pigeons into a cage.",
"options": [
"A. (a)(b)(c).",
"B. (b)(c)(a).",
"C. (c)(b)(a).",
"D. (b)(a)(c)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to this video, in which order do the following events happened when the performer made the first two pigeons?\n(a) He duplicated the pigeon and produced another.\n(b) He produced a pigeon from a burning leather.\n(c) He put the pigeons into a cage.\nOption:\nA. (a)(b)(c).\nB. (b)(c)(a).\nC. (c)(b)(a).\nD. (b)(a)(c).\nAnswer with the option's letter from the given choices directly.",
1474,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "492-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1475,
"target": "B",
"doc": {
"video_id": "492",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=qNWL4S5bDt4",
"videoID": "qNWL4S5bDt4",
"question_id": "492-3",
"task_type": "Action Recognition",
"question": "What is the last magic during the performance?",
"options": [
"A. The magician produces several pigeons.",
"B. The magician gets into the cage, secretly escapes from it, and disguises as a photographer.",
"C. The magician produces a woman from a white pigeon.",
"D. The magician asks a woman to get into the cage and makes the woman disappear."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the last magic during the performance?\nOption:\nA. The magician produces several pigeons.\nB. The magician gets into the cage, secretly escapes from it, and disguises as a photographer.\nC. The magician produces a woman from a white pigeon.\nD. The magician asks a woman to get into the cage and makes the woman disappear.\nAnswer with the option's letter from the given choices directly.",
1475,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "492-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1476,
"target": "D",
"doc": {
"video_id": "493",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=F4v0BZtrSbY",
"videoID": "F4v0BZtrSbY",
"question_id": "493-1",
"task_type": "Object Reasoning",
"question": "What is the function of the bike on the stage?",
"options": [
"A. It serves as a tool for magic.",
"B. It helps the magician leave the stage quickly.",
"C. It is a sponsored advertisement.",
"D. It is just for stage background."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the function of the bike on the stage?\nOption:\nA. It serves as a tool for magic.\nB. It helps the magician leave the stage quickly.\nC. It is a sponsored advertisement.\nD. It is just for stage background.\nAnswer with the option's letter from the given choices directly.",
1476,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "493-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1477,
"target": "B",
"doc": {
"video_id": "493",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=F4v0BZtrSbY",
"videoID": "F4v0BZtrSbY",
"question_id": "493-2",
"task_type": "Temporal Reasoning",
"question": "According to this video, in which order are the following items used for magic?\n(a) Cards.\n(b) A stick.\n(c) A ring.\n(d) Paper money.",
"options": [
"A. (a)(d)(b)(c).",
"B. (b)(a)(d)(c).",
"C. (b)(c)(d)(a).",
"D. (c)(b)(a)(d)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to this video, in which order are the following items used for magic?\n(a) Cards.\n(b) A stick.\n(c) A ring.\n(d) Paper money.\nOption:\nA. (a)(d)(b)(c).\nB. (b)(a)(d)(c).\nC. (b)(c)(d)(a).\nD. (c)(b)(a)(d).\nAnswer with the option's letter from the given choices directly.",
1477,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "493-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1478,
"target": "C",
"doc": {
"video_id": "493",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=F4v0BZtrSbY",
"videoID": "F4v0BZtrSbY",
"question_id": "493-3",
"task_type": "Action Recognition",
"question": "What happened to the ring?",
"options": [
"A. It was levitating while spinning, and then dropped on the ground.",
"B. It was taken away by the female judge.",
"C. It was levitating while spinning, and then put on the finger of the female judge.",
"D. It disappeared mysteriously from the stage."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happened to the ring?\nOption:\nA. It was levitating while spinning, and then dropped on the ground.\nB. It was taken away by the female judge.\nC. It was levitating while spinning, and then put on the finger of the female judge.\nD. It disappeared mysteriously from the stage.\nAnswer with the option's letter from the given choices directly.",
1478,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "493-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1479,
"target": "C",
"doc": {
"video_id": "494",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=a8IGDMrohnY",
"videoID": "a8IGDMrohnY",
"question_id": "494-1",
"task_type": "Object Recognition",
"question": "What does the little magician wear?",
"options": [
"A. A necklace.",
"B. A white sockpuppet.",
"C. A pair of glasses.",
"D. A pair of earings."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the little magician wear?\nOption:\nA. A necklace.\nB. A white sockpuppet.\nC. A pair of glasses.\nD. A pair of earings.\nAnswer with the option's letter from the given choices directly.",
1479,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "494-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1480,
"target": "A",
"doc": {
"video_id": "494",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=a8IGDMrohnY",
"videoID": "a8IGDMrohnY",
"question_id": "494-2",
"task_type": "Object Recognition",
"question": "What card does the male judge pick?",
"options": [
"A. 2 of spades.",
"B. 2 of diamonds.",
"C. 2 of hearts.",
"D. 2 of clubs."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What card does the male judge pick?\nOption:\nA. 2 of spades.\nB. 2 of diamonds.\nC. 2 of hearts.\nD. 2 of clubs.\nAnswer with the option's letter from the given choices directly.",
1480,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "494-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1481,
"target": "B",
"doc": {
"video_id": "494",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=a8IGDMrohnY",
"videoID": "a8IGDMrohnY",
"question_id": "494-3",
"task_type": "Action Recognition",
"question": "What is the magic the magician playing?",
"options": [
"A. He steals the card which the male judge randomly picked.",
"B. He picks the same card that the male judge randomly picked.",
"C. He makes disappear the card that the male judge randomly picked.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the magic the magician playing?\nOption:\nA. He steals the card which the male judge randomly picked.\nB. He picks the same card that the male judge randomly picked.\nC. He makes disappear the card that the male judge randomly picked.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1481,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "494-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1482,
"target": "B",
"doc": {
"video_id": "495",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=WNYi8LUGIZc",
"videoID": "WNYi8LUGIZc",
"question_id": "495-1",
"task_type": "Attribute Perception",
"question": "What type of cards does the magician ask the female judge to pick one from?",
"options": [
"A. Drinks.",
"B. Celebrities.",
"C. Animals.",
"D. Transportations."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What type of cards does the magician ask the female judge to pick one from?\nOption:\nA. Drinks.\nB. Celebrities.\nC. Animals.\nD. Transportations.\nAnswer with the option's letter from the given choices directly.",
1482,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "495-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1483,
"target": "C",
"doc": {
"video_id": "495",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=WNYi8LUGIZc",
"videoID": "WNYi8LUGIZc",
"question_id": "495-2",
"task_type": "Action Recognition",
"question": "What magic does the magician first perform on the stage?",
"options": [
"A. He cuts paper to make the drink the female judge is thinking.",
"B. He cuts paper to make the avatar of the celebrity the male judge is thinking of.",
"C. He cuts paper to make the avatar of the celebrity the female judge is thinking of.",
"D. He cuts paper to make the drink the male judge is thinking."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What magic does the magician first perform on the stage?\nOption:\nA. He cuts paper to make the drink the female judge is thinking.\nB. He cuts paper to make the avatar of the celebrity the male judge is thinking of.\nC. He cuts paper to make the avatar of the celebrity the female judge is thinking of.\nD. He cuts paper to make the drink the male judge is thinking.\nAnswer with the option's letter from the given choices directly.",
1483,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "495-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1484,
"target": "A",
"doc": {
"video_id": "495",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=WNYi8LUGIZc",
"videoID": "WNYi8LUGIZc",
"question_id": "495-3",
"task_type": "Object Reasoning",
"question": "What do the two performances have in common?",
"options": [
"A. They both need interactions with judges.",
"B. The magicians in the two performances both wear a tie.",
"C. They both need playing cards as performance props.",
"D. They both have purple stage backgrounds."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the two performances have in common?\nOption:\nA. They both need interactions with judges.\nB. The magicians in the two performances both wear a tie.\nC. They both need playing cards as performance props.\nD. They both have purple stage backgrounds.\nAnswer with the option's letter from the given choices directly.",
1484,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "495-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1485,
"target": "D",
"doc": {
"video_id": "496",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=1uqupftxFOM",
"videoID": "1uqupftxFOM",
"question_id": "496-1",
"task_type": "Temporal Reasoning",
"question": "According to this video, in which order do the following events happen?\n(a) The magician took the white cloth away from the photo frame.\n(b) The magician led a self introduction.\n(c) The magician produced a feather.",
"options": [
"A. (a)(b)(c).",
"B. (b)(c)(a).",
"C. (c)(b)(a).",
"D. (b)(a)(c)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to this video, in which order do the following events happen?\n(a) The magician took the white cloth away from the photo frame.\n(b) The magician led a self introduction.\n(c) The magician produced a feather.\nOption:\nA. (a)(b)(c).\nB. (b)(c)(a).\nC. (c)(b)(a).\nD. (b)(a)(c).\nAnswer with the option's letter from the given choices directly.",
1485,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "496-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1486,
"target": "B",
"doc": {
"video_id": "496",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=1uqupftxFOM",
"videoID": "1uqupftxFOM",
"question_id": "496-2",
"task_type": "Action Recognition",
"question": "What is the last magic the magician played?",
"options": [
"A. He produced a small leather.",
"B. He produced a large leather from a levitating small leather.",
"C. His hands waved in the air.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the last magic the magician played?\nOption:\nA. He produced a small leather.\nB. He produced a large leather from a levitating small leather.\nC. His hands waved in the air.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1486,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "496-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1487,
"target": "C",
"doc": {
"video_id": "496",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=1uqupftxFOM",
"videoID": "1uqupftxFOM",
"question_id": "496-3",
"task_type": "Action Reasoning",
"question": "What sentence best describes the performance?",
"options": [
"A. The magic is not wonderful enough to make the audiences cheer up.",
"B. One of the judges stops the performance.",
"C. The magician does the magic with his eyes closed.",
"D. The magician needs other tools besides leathers."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What sentence best describes the performance?\nOption:\nA. The magic is not wonderful enough to make the audiences cheer up.\nB. One of the judges stops the performance.\nC. The magician does the magic with his eyes closed.\nD. The magician needs other tools besides leathers.\nAnswer with the option's letter from the given choices directly.",
1487,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "496-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1488,
"target": "A",
"doc": {
"video_id": "497",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=9HPFkUhOp1Q",
"videoID": "9HPFkUhOp1Q",
"question_id": "497-1",
"task_type": "Temporal Reasoning",
"question": "According to this video, in which order do the following events happen?\n(a) A ring jumps into the finger.\n(b) A coin jumps from hand to hand.\n(c) The cigarette disappers and reappears.\n(d) A coin changes into a card.",
"options": [
"A. (a)(c)(d)(b).",
"B. (a)(b)(d)(c).",
"C. (b)(c)(a)(d).",
"D. (c)(d)(b)(a)."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to this video, in which order do the following events happen?\n(a) A ring jumps into the finger.\n(b) A coin jumps from hand to hand.\n(c) The cigarette disappers and reappears.\n(d) A coin changes into a card.\nOption:\nA. (a)(c)(d)(b).\nB. (a)(b)(d)(c).\nC. (b)(c)(a)(d).\nD. (c)(d)(b)(a).\nAnswer with the option's letter from the given choices directly.",
1488,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "497-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1489,
"target": "D",
"doc": {
"video_id": "497",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=9HPFkUhOp1Q",
"videoID": "9HPFkUhOp1Q",
"question_id": "497-2",
"task_type": "Spatial Perception",
"question": "Where is the disappearing coin gone in the fifth magic?",
"options": [
"A. It is thrown under the table.",
"B. It is hidden in sleeves.",
"C. It is hidden behind the playing card.",
"D. It is hidden in the palm of the magician's hand."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where is the disappearing coin gone in the fifth magic?\nOption:\nA. It is thrown under the table.\nB. It is hidden in sleeves.\nC. It is hidden behind the playing card.\nD. It is hidden in the palm of the magician's hand.\nAnswer with the option's letter from the given choices directly.",
1489,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "497-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1490,
"target": "B",
"doc": {
"video_id": "497",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=9HPFkUhOp1Q",
"videoID": "9HPFkUhOp1Q",
"question_id": "497-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. It distinguishs card magics from coin magics.",
"B. It teaches eight impossible magic tricks anyone can do.",
"C. It introduces the most difficult images around the world.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. It distinguishs card magics from coin magics.\nB. It teaches eight impossible magic tricks anyone can do.\nC. It introduces the most difficult images around the world.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1490,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "497-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1491,
"target": "C",
"doc": {
"video_id": "498",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=z2I5J1ArmeM",
"videoID": "z2I5J1ArmeM",
"question_id": "498-1",
"task_type": "Object Reasoning",
"question": "What is the key to the illusion of the basic levitating magic that is performed at home?",
"options": [
"A. A pair of shoes is equipped with an engine.",
"B. A pair of shoes are of different sizes.",
"C. A pair of shoes are stuck together.",
"D. A pair of shoes are made of special materials."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the key to the illusion of the basic levitating magic that is performed at home?\nOption:\nA. A pair of shoes is equipped with an engine.\nB. A pair of shoes are of different sizes.\nC. A pair of shoes are stuck together.\nD. A pair of shoes are made of special materials.\nAnswer with the option's letter from the given choices directly.",
1491,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "498-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1492,
"target": "A",
"doc": {
"video_id": "498",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=z2I5J1ArmeM",
"videoID": "z2I5J1ArmeM",
"question_id": "498-2",
"task_type": "Action Recognition",
"question": "What is the advanced professional magic performed on television?",
"options": [
"A. The performer levitates with only a stick supporting him.",
"B. The performer levitates without any support.",
"C. The performer levitates a woman.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the advanced professional magic performed on television?\nOption:\nA. The performer levitates with only a stick supporting him.\nB. The performer levitates without any support.\nC. The performer levitates a woman.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1492,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "498-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1493,
"target": "A",
"doc": {
"video_id": "498",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=z2I5J1ArmeM",
"videoID": "z2I5J1ArmeM",
"question_id": "498-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. It covers a few of the many methods magicians use to levitate.",
"B. It highlights several levitating performances on television.",
"C. It shows how levitating affects our daily lives.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. It covers a few of the many methods magicians use to levitate.\nB. It highlights several levitating performances on television.\nC. It shows how levitating affects our daily lives.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1493,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "498-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1494,
"target": "B",
"doc": {
"video_id": "499",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=umfD-K4OTKk",
"videoID": "umfD-K4OTKk",
"question_id": "499-1",
"task_type": "Temporal Reasoning",
"question": "According to this video, in which order do the following events happen?\n(a) The flower in the mouth turns into a butterfly.\n(b) The turban turns into a real snake.\n(c) The spider painting on the arm turns into a real spider.",
"options": [
"A. (a)(b)(c).",
"B. (b)(a)(c).",
"C. (b)(c)(a).",
"D. (c)(b)(a)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to this video, in which order do the following events happen?\n(a) The flower in the mouth turns into a butterfly.\n(b) The turban turns into a real snake.\n(c) The spider painting on the arm turns into a real spider.\nOption:\nA. (a)(b)(c).\nB. (b)(a)(c).\nC. (b)(c)(a).\nD. (c)(b)(a).\nAnswer with the option's letter from the given choices directly.",
1494,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "499-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1495,
"target": "C",
"doc": {
"video_id": "499",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=umfD-K4OTKk",
"videoID": "umfD-K4OTKk",
"question_id": "499-2",
"task_type": "Spatial Perception",
"question": "Where are most of the magics in this video performed?",
"options": [
"A. On the stage.",
"B. Beside the swimming pool.",
"C. Inner room such as cafe or bar.",
"D. At home."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where are most of the magics in this video performed?\nOption:\nA. On the stage.\nB. Beside the swimming pool.\nC. Inner room such as cafe or bar.\nD. At home.\nAnswer with the option's letter from the given choices directly.",
1495,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "499-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1496,
"target": "A",
"doc": {
"video_id": "499",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=umfD-K4OTKk",
"videoID": "umfD-K4OTKk",
"question_id": "499-3",
"task_type": "Counting Problem",
"question": "How many magics does the magician play?",
"options": [
"A. 7.",
"B. 6.",
"C. 8.",
"D. 9."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many magics does the magician play?\nOption:\nA. 7.\nB. 6.\nC. 8.\nD. 9.\nAnswer with the option's letter from the given choices directly.",
1496,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "499-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1497,
"target": "D",
"doc": {
"video_id": "500",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=MZdvZBy1bE8",
"videoID": "MZdvZBy1bE8",
"question_id": "500-1",
"task_type": "Temporal Reasoning",
"question": "According to this video, in which order do the following events happen?\n(a) The magician pulls a card out of his phone.\n(b) The magician pulls the silk through the middle of the screen.\n(c) The phone goes inside the bottle.\n(d) The magician visually removes the torchlight from his phone.\n(e) The icons fall off the screen onto the magician's hand.",
"options": [
"A. (a)(b)(c)(d)(e).",
"B. (a)(c)(d)(b)(e).",
"C. (a)(e)(c)(b)(d).",
"D. (a)(b)(e)(c)(d)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to this video, in which order do the following events happen?\n(a) The magician pulls a card out of his phone.\n(b) The magician pulls the silk through the middle of the screen.\n(c) The phone goes inside the bottle.\n(d) The magician visually removes the torchlight from his phone.\n(e) The icons fall off the screen onto the magician's hand.\nOption:\nA. (a)(b)(c)(d)(e).\nB. (a)(c)(d)(b)(e).\nC. (a)(e)(c)(b)(d).\nD. (a)(b)(e)(c)(d).\nAnswer with the option's letter from the given choices directly.",
1497,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "500-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1498,
"target": "B",
"doc": {
"video_id": "500",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=MZdvZBy1bE8",
"videoID": "MZdvZBy1bE8",
"question_id": "500-2",
"task_type": "Action Recognition",
"question": "What is the bonus trick?",
"options": [
"A. The magician pulls the silk through the middle of the screen.",
"B. The magician visually removes the torchlight from his phone.",
"C. The magician pulls a card out of his phone.",
"D. The icons fall off the screen onto the magician's hand."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the bonus trick?\nOption:\nA. The magician pulls the silk through the middle of the screen.\nB. The magician visually removes the torchlight from his phone.\nC. The magician pulls a card out of his phone.\nD. The icons fall off the screen onto the magician's hand.\nAnswer with the option's letter from the given choices directly.",
1498,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "500-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1499,
"target": "C",
"doc": {
"video_id": "500",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=MZdvZBy1bE8",
"videoID": "MZdvZBy1bE8",
"question_id": "500-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. It teaches five visual cards magic tricks anyone can do.",
"B. It teaches six visual dice magic tricks anyone can do.",
"C. It teaches nine visual phone magic tricks anyone can do.",
"D. It teaches three visual rubber magic tricks anyone can do."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. It teaches five visual cards magic tricks anyone can do.\nB. It teaches six visual dice magic tricks anyone can do.\nC. It teaches nine visual phone magic tricks anyone can do.\nD. It teaches three visual rubber magic tricks anyone can do.\nAnswer with the option's letter from the given choices directly.",
1499,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "500-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1500,
"target": "A",
"doc": {
"video_id": "501",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=ad_higXixRA",
"videoID": "ad_higXixRA",
"question_id": "501-1",
"task_type": "Object Reasoning",
"question": "The person wearing a red shirt on stage is what role?",
"options": [
"A. Guest of the show.",
"B. Audience member of the show.",
"C. Host of the show.",
"D. Janitor."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The person wearing a red shirt on stage is what role?\nOption:\nA. Guest of the show.\nB. Audience member of the show.\nC. Host of the show.\nD. Janitor.\nAnswer with the option's letter from the given choices directly.",
1500,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "501-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1501,
"target": "B",
"doc": {
"video_id": "501",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=ad_higXixRA",
"videoID": "ad_higXixRA",
"question_id": "501-2",
"task_type": "Counting Problem",
"question": "How many items did the guest accurately guess the prices of, with a margin of error within $1?",
"options": [
"A. 2.",
"B. 3.",
"C. 4.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many items did the guest accurately guess the prices of, with a margin of error within $1?\nOption:\nA. 2.\nB. 3.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
1501,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "501-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1502,
"target": "A",
"doc": {
"video_id": "501",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=ad_higXixRA",
"videoID": "ad_higXixRA",
"question_id": "501-3",
"task_type": "OCR Problems",
"question": "In the video, how much was the guest off by when guessing the price of the second item?",
"options": [
"A. 9.97.",
"B. 5.",
"C. 19.97.",
"D. 1."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, how much was the guest off by when guessing the price of the second item?\nOption:\nA. 9.97.\nB. 5.\nC. 19.97.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
1502,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "501-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1503,
"target": "D",
"doc": {
"video_id": "502",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=XoaWHHNiw1w",
"videoID": "XoaWHHNiw1w",
"question_id": "502-1",
"task_type": "Object Recognition",
"question": "What type of flowers are placed on the table between the host and the guest in the video?",
"options": [
"A. Lilies.",
"B. Roses.",
"C. Chrysanthemums.",
"D. Sunflowers."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What type of flowers are placed on the table between the host and the guest in the video?\nOption:\nA. Lilies.\nB. Roses.\nC. Chrysanthemums.\nD. Sunflowers.\nAnswer with the option's letter from the given choices directly.",
1503,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "502-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1504,
"target": "D",
"doc": {
"video_id": "502",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=XoaWHHNiw1w",
"videoID": "XoaWHHNiw1w",
"question_id": "502-2",
"task_type": "Object Recognition",
"question": "What is the first tattoo about the female guest written by her boyfriend in the video?",
"options": [
"A. kim brand.",
"B. my girl is a doctor.",
"C. kim.",
"D. my girl is a lawyer."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the first tattoo about the female guest written by her boyfriend in the video?\nOption:\nA. kim brand.\nB. my girl is a doctor.\nC. kim.\nD. my girl is a lawyer.\nAnswer with the option's letter from the given choices directly.",
1504,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "502-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1505,
"target": "A",
"doc": {
"video_id": "502",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=XoaWHHNiw1w",
"videoID": "XoaWHHNiw1w",
"question_id": "502-3",
"task_type": "Object Reasoning",
"question": "According to the video, who is less likely to get scared?",
"options": [
"A. The show host.",
"B. Porsha.",
"C. Kim Kardashian.",
"D. Pete."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, who is less likely to get scared?\nOption:\nA. The show host.\nB. Porsha.\nC. Kim Kardashian.\nD. Pete.\nAnswer with the option's letter from the given choices directly.",
1505,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "502-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1506,
"target": "C",
"doc": {
"video_id": "503",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=S_MAesZsnMk",
"videoID": "S_MAesZsnMk",
"question_id": "503-1",
"task_type": "OCR Problems",
"question": "According to the video, what is written on the judge's table?",
"options": [
"A. Monster.",
"B. Hollywood.",
"C. America Idol.",
"D. I am Tongi."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is written on the judge's table?\nOption:\nA. Monster.\nB. Hollywood.\nC. America Idol.\nD. I am Tongi.\nAnswer with the option's letter from the given choices directly.",
1506,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "503-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1507,
"target": "B",
"doc": {
"video_id": "503",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=S_MAesZsnMk",
"videoID": "S_MAesZsnMk",
"question_id": "503-2",
"task_type": "Action Reasoning",
"question": "According to the video, why did the judges on stage cry?",
"options": [
"A. Tongi's singing was too awful.",
"B. Moved by Tongi's singing.",
"C. One of the judges just lost their father.",
"D. Unable to determine."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, why did the judges on stage cry?\nOption:\nA. Tongi's singing was too awful.\nB. Moved by Tongi's singing.\nC. One of the judges just lost their father.\nD. Unable to determine.\nAnswer with the option's letter from the given choices directly.",
1507,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "503-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1508,
"target": "C",
"doc": {
"video_id": "503",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=S_MAesZsnMk",
"videoID": "S_MAesZsnMk",
"question_id": "503-3",
"task_type": "Information Synopsis",
"question": "What is the video about?",
"options": [
"A. A storytelling program.",
"B. A movie shooting scene.",
"C. A talent show.",
"D. Singing tutorial."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video about?\nOption:\nA. A storytelling program.\nB. A movie shooting scene.\nC. A talent show.\nD. Singing tutorial.\nAnswer with the option's letter from the given choices directly.",
1508,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "503-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1509,
"target": "A",
"doc": {
"video_id": "504",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=1sTNqJVrqx8",
"videoID": "1sTNqJVrqx8",
"question_id": "504-1",
"task_type": "Object Recognition",
"question": "What other instrument is there on the stage besides the piano?",
"options": [
"A. Guitar.",
"B. Trumpet.",
"C. Saxophone.",
"D. Violin."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What other instrument is there on the stage besides the piano?\nOption:\nA. Guitar.\nB. Trumpet.\nC. Saxophone.\nD. Violin.\nAnswer with the option's letter from the given choices directly.",
1509,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "504-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1510,
"target": "C",
"doc": {
"video_id": "504",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=1sTNqJVrqx8",
"videoID": "1sTNqJVrqx8",
"question_id": "504-2",
"task_type": "Attribute Perception",
"question": "Whose song did the performer sing in the video?",
"options": [
"A. Michael Jackson.",
"B. Adele.",
"C. Lady Gaga.",
"D. Taylor Swift."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Whose song did the performer sing in the video?\nOption:\nA. Michael Jackson.\nB. Adele.\nC. Lady Gaga.\nD. Taylor Swift.\nAnswer with the option's letter from the given choices directly.",
1510,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "504-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1511,
"target": "C",
"doc": {
"video_id": "504",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=1sTNqJVrqx8",
"videoID": "1sTNqJVrqx8",
"question_id": "504-3",
"task_type": "Spatial Perception",
"question": "From the perspective of the performer, is the judge wearing black clothes on her left or right side?",
"options": [
"A. Right.",
"B. Middle.",
"C. Left.",
"D. There are no judges wearing black clothes."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: From the perspective of the performer, is the judge wearing black clothes on her left or right side?\nOption:\nA. Right.\nB. Middle.\nC. Left.\nD. There are no judges wearing black clothes.\nAnswer with the option's letter from the given choices directly.",
1511,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "504-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1512,
"target": "A",
"doc": {
"video_id": "505",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=kq9Q9-U0vrc",
"videoID": "kq9Q9-U0vrc",
"question_id": "505-1",
"task_type": "Attribute Perception",
"question": "What name did the man wearing the black shirt use when booking the room according to the video?",
"options": [
"A. Adams.",
"B. Danny.",
"C. Star.",
"D. Lounge."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What name did the man wearing the black shirt use when booking the room according to the video?\nOption:\nA. Adams.\nB. Danny.\nC. Star.\nD. Lounge.\nAnswer with the option's letter from the given choices directly.",
1512,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "505-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1513,
"target": "D",
"doc": {
"video_id": "505",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=kq9Q9-U0vrc",
"videoID": "kq9Q9-U0vrc",
"question_id": "505-2",
"task_type": "Object Reasoning",
"question": "The receptionist in the video has another identity, which is most likely?",
"options": [
"A. A guest of the inn.",
"B. A robot.",
"C. The inn's chef.",
"D. A member of The Danny Band."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The receptionist in the video has another identity, which is most likely?\nOption:\nA. A guest of the inn.\nB. A robot.\nC. The inn's chef.\nD. A member of The Danny Band.\nAnswer with the option's letter from the given choices directly.",
1513,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "505-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1514,
"target": "C",
"doc": {
"video_id": "505",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=kq9Q9-U0vrc",
"videoID": "kq9Q9-U0vrc",
"question_id": "505-3",
"task_type": "Temporal Perception",
"question": "At what time does a man in military uniform appear in the video?",
"options": [
"A. The end of the video.",
"B. There is no man in military uniform appearing.",
"C. The beginning of the video.",
"D. The middle of the video."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At what time does a man in military uniform appear in the video?\nOption:\nA. The end of the video.\nB. There is no man in military uniform appearing.\nC. The beginning of the video.\nD. The middle of the video.\nAnswer with the option's letter from the given choices directly.",
1514,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "505-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Temporal Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1515,
"target": "D",
"doc": {
"video_id": "506",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=xYdeRoM78h4",
"videoID": "xYdeRoM78h4",
"question_id": "506-1",
"task_type": "Information Synopsis",
"question": "What is this video about?",
"options": [
"A. A live broadcast of an obstacle course show.",
"B. Unable to determine.",
"C. A critique of the dangers associated with obstacle courses shows.",
"D. An analysis of each obstacle in an obstacle course show and the performance of the contestants."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video about?\nOption:\nA. A live broadcast of an obstacle course show.\nB. Unable to determine.\nC. A critique of the dangers associated with obstacle courses shows.\nD. An analysis of each obstacle in an obstacle course show and the performance of the contestants.\nAnswer with the option's letter from the given choices directly.",
1515,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "506-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1516,
"target": "D",
"doc": {
"video_id": "506",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=xYdeRoM78h4",
"videoID": "xYdeRoM78h4",
"question_id": "506-2",
"task_type": "Temporal Perception",
"question": "Based on the video, which of the following obstacles appears last?",
"options": [
"A. Grinders.",
"B. Face platform.",
"C. Energy coils.",
"D. Digestive tract."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, which of the following obstacles appears last?\nOption:\nA. Grinders.\nB. Face platform.\nC. Energy coils.\nD. Digestive tract.\nAnswer with the option's letter from the given choices directly.",
1516,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "506-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Temporal Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1517,
"target": "A",
"doc": {
"video_id": "506",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=xYdeRoM78h4",
"videoID": "xYdeRoM78h4",
"question_id": "506-3",
"task_type": "Object Recognition",
"question": "According to the video, which team ultimately won?",
"options": [
"A. China.",
"B. Italy.",
"C. USA.",
"D. France."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which team ultimately won?\nOption:\nA. China.\nB. Italy.\nC. USA.\nD. France.\nAnswer with the option's letter from the given choices directly.",
1517,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "506-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1518,
"target": "A",
"doc": {
"video_id": "507",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=UTzl0lLoAJE",
"videoID": "UTzl0lLoAJE",
"question_id": "507-1",
"task_type": "Information Synopsis",
"question": "What is the content of this video about?",
"options": [
"A. Various world records related to the human body.",
"B. Introduction to racial differences among countries.",
"C. World records of various sports.",
"D. Customs and cultures of different countries."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the content of this video about?\nOption:\nA. Various world records related to the human body.\nB. Introduction to racial differences among countries.\nC. World records of various sports.\nD. Customs and cultures of different countries.\nAnswer with the option's letter from the given choices directly.",
1518,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "507-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1519,
"target": "A",
"doc": {
"video_id": "507",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=UTzl0lLoAJE",
"videoID": "UTzl0lLoAJE",
"question_id": "507-2",
"task_type": "Temporal Perception",
"question": "Based on the order of introduction in the video, which of the following options is introduced last?",
"options": [
"A. Largest mouth gape (female).",
"B. Longest legs.",
"C. Tallest family: 203.29 cm (6 ft 8.03 in).",
"D. Largest mouth gape (male)."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the order of introduction in the video, which of the following options is introduced last?\nOption:\nA. Largest mouth gape (female).\nB. Longest legs.\nC. Tallest family: 203.29 cm (6 ft 8.03 in).\nD. Largest mouth gape (male).\nAnswer with the option's letter from the given choices directly.",
1519,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "507-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1520,
"target": "B",
"doc": {
"video_id": "507",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=UTzl0lLoAJE",
"videoID": "UTzl0lLoAJE",
"question_id": "507-3",
"task_type": "Counting Problem",
"question": "How many people are there in the \"tallest family\" as shown in the video?",
"options": [
"A. 2.",
"B. 5.",
"C. 3.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people are there in the \"tallest family\" as shown in the video?\nOption:\nA. 2.\nB. 5.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1520,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "507-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1521,
"target": "B",
"doc": {
"video_id": "508",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=QopYbLq-zIQ",
"videoID": "QopYbLq-zIQ",
"question_id": "508-1",
"task_type": "Temporal Reasoning",
"question": "Based on the video, what is the order in which the host tastes the sandwiches made by the contestants?\n(1) Owen; (2) Salt Hank; (3) Albert",
"options": [
"A. (1)(3)(2).",
"B. (2)(3)(1).",
"C. (3)(2)(1).",
"D. (1)(2)(3)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, what is the order in which the host tastes the sandwiches made by the contestants?\n(1) Owen; (2) Salt Hank; (3) Albert\nOption:\nA. (1)(3)(2).\nB. (2)(3)(1).\nC. (3)(2)(1).\nD. (1)(2)(3).\nAnswer with the option's letter from the given choices directly.",
1521,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "508-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1522,
"target": "B",
"doc": {
"video_id": "508",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=QopYbLq-zIQ",
"videoID": "QopYbLq-zIQ",
"question_id": "508-2",
"task_type": "Object Recognition",
"question": "Based on the video, who won the sandwich competition?",
"options": [
"A. Salt Hank.",
"B. Owen.",
"C. Cannot determine.",
"D. Albert."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, who won the sandwich competition?\nOption:\nA. Salt Hank.\nB. Owen.\nC. Cannot determine.\nD. Albert.\nAnswer with the option's letter from the given choices directly.",
1522,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "508-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1523,
"target": "D",
"doc": {
"video_id": "508",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=QopYbLq-zIQ",
"videoID": "QopYbLq-zIQ",
"question_id": "508-3",
"task_type": "OCR Problems",
"question": "What is the name of the program in the video?",
"options": [
"A. Ultimate Sandwich.",
"B. Gordon Ramsay.",
"C. Best Sandwich.",
"D. Idiot Sandwich."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the name of the program in the video?\nOption:\nA. Ultimate Sandwich.\nB. Gordon Ramsay.\nC. Best Sandwich.\nD. Idiot Sandwich.\nAnswer with the option's letter from the given choices directly.",
1523,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "508-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1524,
"target": "C",
"doc": {
"video_id": "509",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=MQEcUPeAdFs",
"videoID": "MQEcUPeAdFs",
"question_id": "509-1",
"task_type": "OCR Problems",
"question": "What is the first competition conducted in the video?",
"options": [
"A. Boat Carry.",
"B. Arm Wrestling.",
"C. Deadlift.",
"D. Bench Press."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the first competition conducted in the video?\nOption:\nA. Boat Carry.\nB. Arm Wrestling.\nC. Deadlift.\nD. Bench Press.\nAnswer with the option's letter from the given choices directly.",
1524,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "509-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1525,
"target": "A",
"doc": {
"video_id": "509",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=MQEcUPeAdFs",
"videoID": "MQEcUPeAdFs",
"question_id": "509-2",
"task_type": "Information Synopsis",
"question": "Based on the video, which team won in the end?",
"options": [
"A. The team wearing black clothes.",
"B. The team wearing white clothes.",
"C. Cannot determine.",
"D. The two teams ended in a tie."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, which team won in the end?\nOption:\nA. The team wearing black clothes.\nB. The team wearing white clothes.\nC. Cannot determine.\nD. The two teams ended in a tie.\nAnswer with the option's letter from the given choices directly.",
1525,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "509-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1526,
"target": "A",
"doc": {
"video_id": "509",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=MQEcUPeAdFs",
"videoID": "MQEcUPeAdFs",
"question_id": "509-3",
"task_type": "Counting Problem",
"question": "In the Bench Press segment, how many times did the team in white shirts complete the bench press?",
"options": [
"A. 75.",
"B. 52.",
"C. 57.",
"D. 42."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the Bench Press segment, how many times did the team in white shirts complete the bench press?\nOption:\nA. 75.\nB. 52.\nC. 57.\nD. 42.\nAnswer with the option's letter from the given choices directly.",
1526,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "509-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1527,
"target": "B",
"doc": {
"video_id": "510",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=UGMDV82m3Ro",
"videoID": "UGMDV82m3Ro",
"question_id": "510-1",
"task_type": "Information Synopsis",
"question": "What is this video about?",
"options": [
"A. Parenting.",
"B. Home renovation.",
"C. Family conflict.",
"D. Charity."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video about?\nOption:\nA. Parenting.\nB. Home renovation.\nC. Family conflict.\nD. Charity.\nAnswer with the option's letter from the given choices directly.",
1527,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "510-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1528,
"target": "C",
"doc": {
"video_id": "510",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=UGMDV82m3Ro",
"videoID": "UGMDV82m3Ro",
"question_id": "510-2",
"task_type": "Temporal Perception",
"question": "In which part of the video does the red parrot appear?",
"options": [
"A. The parrot does not appear.",
"B. End of the video.",
"C. Beginning of the video.",
"D. Middle of the video."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which part of the video does the red parrot appear?\nOption:\nA. The parrot does not appear.\nB. End of the video.\nC. Beginning of the video.\nD. Middle of the video.\nAnswer with the option's letter from the given choices directly.",
1528,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "510-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Temporal Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1529,
"target": "A",
"doc": {
"video_id": "510",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=UGMDV82m3Ro",
"videoID": "UGMDV82m3Ro",
"question_id": "510-3",
"task_type": "Attribute Perception",
"question": "Based on the video, is the homeowner satisfied with the renovation results?",
"options": [
"A. Very satisfied.",
"B. Unsatisfied.",
"C. So-so.",
"D. Cannot determine."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, is the homeowner satisfied with the renovation results?\nOption:\nA. Very satisfied.\nB. Unsatisfied.\nC. So-so.\nD. Cannot determine.\nAnswer with the option's letter from the given choices directly.",
1529,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "510-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1530,
"target": "B",
"doc": {
"video_id": "511",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=zbvamKv81o0",
"videoID": "zbvamKv81o0",
"question_id": "511-1",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. This video is mainly about a group of talented musicians performing in a symphony orchestra on stage.",
"B. This video is mainly about several incredible acrobats creating show stopping performances on stage.",
"C. This video is mainly about a group of skilled magicians performing mind-bending illusions on stage.",
"D. This video is mainly about a group of professional dancers showcasing their choreography on stage."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. This video is mainly about a group of talented musicians performing in a symphony orchestra on stage.\nB. This video is mainly about several incredible acrobats creating show stopping performances on stage.\nC. This video is mainly about a group of skilled magicians performing mind-bending illusions on stage.\nD. This video is mainly about a group of professional dancers showcasing their choreography on stage.\nAnswer with the option's letter from the given choices directly.",
1530,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "511-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1531,
"target": "C",
"doc": {
"video_id": "511",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=zbvamKv81o0",
"videoID": "zbvamKv81o0",
"question_id": "511-2",
"task_type": "Action Recognition",
"question": "From which performance is the opening of this video taken?",
"options": [
"A. The first.",
"B. The second.",
"C. The third.",
"D. The fourth."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: From which performance is the opening of this video taken?\nOption:\nA. The first.\nB. The second.\nC. The third.\nD. The fourth.\nAnswer with the option's letter from the given choices directly.",
1531,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "511-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1532,
"target": "A",
"doc": {
"video_id": "511",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=zbvamKv81o0",
"videoID": "zbvamKv81o0",
"question_id": "511-3",
"task_type": "Object Recognition",
"question": "Which performance in this video has the least acrobats?",
"options": [
"A. The third.",
"B. The first.",
"C. The second.",
"D. The fourth."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which performance in this video has the least acrobats?\nOption:\nA. The third.\nB. The first.\nC. The second.\nD. The fourth.\nAnswer with the option's letter from the given choices directly.",
1532,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "511-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1533,
"target": "A",
"doc": {
"video_id": "512",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=KD1yxRyFAZQ",
"videoID": "KD1yxRyFAZQ",
"question_id": "512-1",
"task_type": "Action Recognition",
"question": "What do the judges do after the little girl finishes the performance?",
"options": [
"A. They stand up and applaud for her.",
"B. They sit on the seats and applaud for her.",
"C. They blow the little kisses.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the judges do after the little girl finishes the performance?\nOption:\nA. They stand up and applaud for her.\nB. They sit on the seats and applaud for her.\nC. They blow the little kisses.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1533,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "512-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1534,
"target": "D",
"doc": {
"video_id": "512",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=KD1yxRyFAZQ",
"videoID": "KD1yxRyFAZQ",
"question_id": "512-2",
"task_type": "Counting Problem",
"question": "How many yeses does the little girl get?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many yeses does the little girl get?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1534,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "512-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1535,
"target": "B",
"doc": {
"video_id": "512",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=KD1yxRyFAZQ",
"videoID": "KD1yxRyFAZQ",
"question_id": "512-3",
"task_type": "Action Recognition",
"question": "Which of the following skills is not included in the performance?",
"options": [
"A. Spotting.",
"B. Backward flip.",
"C. Flexibility.",
"D. Jumps."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following skills is not included in the performance?\nOption:\nA. Spotting.\nB. Backward flip.\nC. Flexibility.\nD. Jumps.\nAnswer with the option's letter from the given choices directly.",
1535,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "512-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1536,
"target": "D",
"doc": {
"video_id": "513",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=VfSIqKSiguc",
"videoID": "VfSIqKSiguc",
"question_id": "513-1",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. Awesome calisthenics.",
"B. Awesome wingsuit flying.",
"C. Awesome skateboarding.",
"D. Awesome gymnastics and acrobatics."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. Awesome calisthenics.\nB. Awesome wingsuit flying.\nC. Awesome skateboarding.\nD. Awesome gymnastics and acrobatics.\nAnswer with the option's letter from the given choices directly.",
1536,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "513-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1537,
"target": "B",
"doc": {
"video_id": "513",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=VfSIqKSiguc",
"videoID": "VfSIqKSiguc",
"question_id": "513-2",
"task_type": "Attribute Perception",
"question": "What is the type of the last trick?",
"options": [
"A. Rope trick.",
"B. Trampoline wall trick.",
"C. Tumbling trick.",
"D. Flip trick."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the type of the last trick?\nOption:\nA. Rope trick.\nB. Trampoline wall trick.\nC. Tumbling trick.\nD. Flip trick.\nAnswer with the option's letter from the given choices directly.",
1537,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "513-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1538,
"target": "C",
"doc": {
"video_id": "513",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=VfSIqKSiguc",
"videoID": "VfSIqKSiguc",
"question_id": "513-3",
"task_type": "Spatial Perception",
"question": "Where does the first moment occur?",
"options": [
"A. On a badminton court.",
"B. Outside the house.",
"C. Inside a stadium.",
"D. In the Olympic Games."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where does the first moment occur?\nOption:\nA. On a badminton court.\nB. Outside the house.\nC. Inside a stadium.\nD. In the Olympic Games.\nAnswer with the option's letter from the given choices directly.",
1538,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "513-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Spatial Perception",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1539,
"target": "C",
"doc": {
"video_id": "514",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=iC56xtbUOnM",
"videoID": "iC56xtbUOnM",
"question_id": "514-1",
"task_type": "Action Recognition",
"question": "What does the female performer do after she takes off her white shirt?",
"options": [
"A. She walks to the judges.",
"B. She lies down on the stage.",
"C. She grabs the ropes.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the female performer do after she takes off her white shirt?\nOption:\nA. She walks to the judges.\nB. She lies down on the stage.\nC. She grabs the ropes.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1539,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "514-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1540,
"target": "A",
"doc": {
"video_id": "514",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=iC56xtbUOnM",
"videoID": "iC56xtbUOnM",
"question_id": "514-2",
"task_type": "Action Recognition",
"question": "How do the two performers end their performance?",
"options": [
"A. The two performers end their performance with the actress jumping from the rope and being caught by the actor.",
"B. The two performers end their performance with a synchronized backflip off the stage.",
"C. The two performers end their performance with a dramatic dance routine.",
"D. The two performers end their performance with a grand finale of fire-breathing tricks."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How do the two performers end their performance?\nOption:\nA. The two performers end their performance with the actress jumping from the rope and being caught by the actor.\nB. The two performers end their performance with a synchronized backflip off the stage.\nC. The two performers end their performance with a dramatic dance routine.\nD. The two performers end their performance with a grand finale of fire-breathing tricks.\nAnswer with the option's letter from the given choices directly.",
1540,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "514-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1541,
"target": "B",
"doc": {
"video_id": "514",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=iC56xtbUOnM",
"videoID": "iC56xtbUOnM",
"question_id": "514-3",
"task_type": "Action Recognition",
"question": "What does the performers do after they get four yeses?",
"options": [
"A. They jumped upside and down.",
"B. They kissed each other.",
"C. They gave a high five to each other.",
"D. They spun in circles on the stage."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the performers do after they get four yeses?\nOption:\nA. They jumped upside and down.\nB. They kissed each other.\nC. They gave a high five to each other.\nD. They spun in circles on the stage.\nAnswer with the option's letter from the given choices directly.",
1541,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "514-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1542,
"target": "B",
"doc": {
"video_id": "515",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=a7jZszvFpTY",
"videoID": "a7jZszvFpTY",
"question_id": "515-1",
"task_type": "Temporal Reasoning",
"question": "According to this video, in which order do the following events happened?\n(a) Forming a human pole.\n(b) Jumping through white hoops.\n(c) Forming a human cross.",
"options": [
"A. (a)(b)(c).",
"B. (b)(c)(a).",
"C. (a)(c)(b).",
"D. (c)(b)(a)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to this video, in which order do the following events happened?\n(a) Forming a human pole.\n(b) Jumping through white hoops.\n(c) Forming a human cross.\nOption:\nA. (a)(b)(c).\nB. (b)(c)(a).\nC. (a)(c)(b).\nD. (c)(b)(a).\nAnswer with the option's letter from the given choices directly.",
1542,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "515-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1543,
"target": "C",
"doc": {
"video_id": "515",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=a7jZszvFpTY",
"videoID": "a7jZszvFpTY",
"question_id": "515-2",
"task_type": "Action Recognition",
"question": "How do the performers end their performance?",
"options": [
"A. The performers end their performance by disappearing into thin air, leaving the audience in awe.",
"B. The performers end their performance by transforming into animals and running off into the wilderness.",
"C. The performers end their performance by forming a human pole by standing on each other's shoulders and leaping together onto the stage.",
"D. The performers end their performance with a dramatic explosion of confetti and fireworks."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How do the performers end their performance?\nOption:\nA. The performers end their performance by disappearing into thin air, leaving the audience in awe.\nB. The performers end their performance by transforming into animals and running off into the wilderness.\nC. The performers end their performance by forming a human pole by standing on each other's shoulders and leaping together onto the stage.\nD. The performers end their performance with a dramatic explosion of confetti and fireworks.\nAnswer with the option's letter from the given choices directly.",
1543,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "515-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1544,
"target": "A",
"doc": {
"video_id": "515",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=a7jZszvFpTY",
"videoID": "a7jZszvFpTY",
"question_id": "515-3",
"task_type": "Object Recognition",
"question": "What do the performers wear?",
"options": [
"A. Red scarfs.",
"B. White belts.",
"C. Black hats.",
"D. Rimless glasses."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the performers wear?\nOption:\nA. Red scarfs.\nB. White belts.\nC. Black hats.\nD. Rimless glasses.\nAnswer with the option's letter from the given choices directly.",
1544,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "515-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1545,
"target": "D",
"doc": {
"video_id": "516",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=hYTYb5sGznw",
"videoID": "hYTYb5sGznw",
"question_id": "516-1",
"task_type": "Attribute Perception",
"question": "What is the little girl's expression after the performance?",
"options": [
"A. She looks nervous.",
"B. She looks shocked.",
"C. She looks depressed.",
"D. She looks calm."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the little girl's expression after the performance?\nOption:\nA. She looks nervous.\nB. She looks shocked.\nC. She looks depressed.\nD. She looks calm.\nAnswer with the option's letter from the given choices directly.",
1545,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "516-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1546,
"target": "B",
"doc": {
"video_id": "516",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=hYTYb5sGznw",
"videoID": "hYTYb5sGznw",
"question_id": "516-2",
"task_type": "Action Recognition",
"question": "What is the final movement of the performance?",
"options": [
"A. The final movement of the performance is the man and woman performing a synchronized cartwheel.",
"B. The final movement of the performance is the woman handstanding on top of the man while the man balances on the pole.",
"C. The final movement of the performance is the woman balancing on one hand while the man balancing on one foot.",
"D. The final movement of the performance is the man lifting the woman above his head."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the final movement of the performance?\nOption:\nA. The final movement of the performance is the man and woman performing a synchronized cartwheel.\nB. The final movement of the performance is the woman handstanding on top of the man while the man balances on the pole.\nC. The final movement of the performance is the woman balancing on one hand while the man balancing on one foot.\nD. The final movement of the performance is the man lifting the woman above his head.\nAnswer with the option's letter from the given choices directly.",
1546,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "516-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1547,
"target": "C",
"doc": {
"video_id": "516",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=hYTYb5sGznw",
"videoID": "hYTYb5sGznw",
"question_id": "516-3",
"task_type": "Counting Problem",
"question": "How many yeses do the performers get?",
"options": [
"A. 3.",
"B. 2.",
"C. 4.",
"D. 1."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many yeses do the performers get?\nOption:\nA. 3.\nB. 2.\nC. 4.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
1547,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "516-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1548,
"target": "A",
"doc": {
"video_id": "517",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=dy8AvyRlXpQ",
"videoID": "dy8AvyRlXpQ",
"question_id": "517-1",
"task_type": "Counting Problem",
"question": "How many female performers are doing this show?",
"options": [
"A. 2.",
"B. 3.",
"C. 4.",
"D. 5."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many female performers are doing this show?\nOption:\nA. 2.\nB. 3.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
1548,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "517-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1549,
"target": "D",
"doc": {
"video_id": "517",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=dy8AvyRlXpQ",
"videoID": "dy8AvyRlXpQ",
"question_id": "517-2",
"task_type": "Action Recognition",
"question": "What happened after the two actresses climbed onto the high board?",
"options": [
"A. Two actresses jumped to the swing.",
"B. They performed a series of acrobatic flips and twists in mid-air.",
"C. They started a choreographed dance routine on the platform.",
"D. One of the actresses jumped to the swing."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happened after the two actresses climbed onto the high board?\nOption:\nA. Two actresses jumped to the swing.\nB. They performed a series of acrobatic flips and twists in mid-air.\nC. They started a choreographed dance routine on the platform.\nD. One of the actresses jumped to the swing.\nAnswer with the option's letter from the given choices directly.",
1549,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "517-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1550,
"target": "B",
"doc": {
"video_id": "517",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=dy8AvyRlXpQ",
"videoID": "dy8AvyRlXpQ",
"question_id": "517-3",
"task_type": "Object Reasoning",
"question": "What is the function of the actor hanging on the left swing?",
"options": [
"A. The actor hanging on the left swing is responsible for swinging higher than the performer on the right swing.",
"B. The actor hanging on the left swing is there to catch the performer who jumps out of the right swing.",
"C. The actor hanging on the left swing is there to distract the audience with his acrobatic tricks.",
"D. The actor hanging on the left swing is there to balance the weight distribution of the swings."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the function of the actor hanging on the left swing?\nOption:\nA. The actor hanging on the left swing is responsible for swinging higher than the performer on the right swing.\nB. The actor hanging on the left swing is there to catch the performer who jumps out of the right swing.\nC. The actor hanging on the left swing is there to distract the audience with his acrobatic tricks.\nD. The actor hanging on the left swing is there to balance the weight distribution of the swings.\nAnswer with the option's letter from the given choices directly.",
1550,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "517-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1551,
"target": "C",
"doc": {
"video_id": "518",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=hZ8DmOpRTVc",
"videoID": "hZ8DmOpRTVc",
"question_id": "518-1",
"task_type": "Object Recognition",
"question": "What does the performer ride?",
"options": [
"A. A bicycle.",
"B. A tricycle.",
"C. A unicycle.",
"D. A quadracycle."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the performer ride?\nOption:\nA. A bicycle.\nB. A tricycle.\nC. A unicycle.\nD. A quadracycle.\nAnswer with the option's letter from the given choices directly.",
1551,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "518-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1552,
"target": "A",
"doc": {
"video_id": "518",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=hZ8DmOpRTVc",
"videoID": "hZ8DmOpRTVc",
"question_id": "518-2",
"task_type": "Object Reasoning",
"question": "What distinguishes the second unicycle from the others?",
"options": [
"A. Its frame is not straight.",
"B. It does not have a saddle.",
"C. Its wheel is much smaller.",
"D. It has three pedals."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What distinguishes the second unicycle from the others?\nOption:\nA. Its frame is not straight.\nB. It does not have a saddle.\nC. Its wheel is much smaller.\nD. It has three pedals.\nAnswer with the option's letter from the given choices directly.",
1552,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "518-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1553,
"target": "D",
"doc": {
"video_id": "518",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=hZ8DmOpRTVc",
"videoID": "hZ8DmOpRTVc",
"question_id": "518-3",
"task_type": "Object Reasoning",
"question": "Among the three unicycles the actor rides in the performance, which one has the highest frame?",
"options": [
"A. Their frames are of the same height.",
"B. The first.",
"C. The second.",
"D. The third."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Among the three unicycles the actor rides in the performance, which one has the highest frame?\nOption:\nA. Their frames are of the same height.\nB. The first.\nC. The second.\nD. The third.\nAnswer with the option's letter from the given choices directly.",
1553,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "518-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1554,
"target": "B",
"doc": {
"video_id": "519",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=I0xrbzYT4Sc",
"videoID": "I0xrbzYT4Sc",
"question_id": "519-1",
"task_type": "Counting Problem",
"question": "How many moments are in the opening of this video?",
"options": [
"A. 3.",
"B. 4.",
"C. 2.",
"D. 1."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many moments are in the opening of this video?\nOption:\nA. 3.\nB. 4.\nC. 2.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
1554,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "519-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1555,
"target": "C",
"doc": {
"video_id": "519",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=I0xrbzYT4Sc",
"videoID": "I0xrbzYT4Sc",
"question_id": "519-2",
"task_type": "Action Recognition",
"question": "What is the main content of the last moment?",
"options": [
"A. Motorbike rallycross.",
"B. Freestyle motocross.",
"C. Car rallycross.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main content of the last moment?\nOption:\nA. Motorbike rallycross.\nB. Freestyle motocross.\nC. Car rallycross.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1555,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "519-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1556,
"target": "A",
"doc": {
"video_id": "519",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=I0xrbzYT4Sc",
"videoID": "I0xrbzYT4Sc",
"question_id": "519-3",
"task_type": "Temporal Perception",
"question": "From which moment does the sport turn from freestyle motocross into a car race?",
"options": [
"A. The eighth.",
"B. The ninth.",
"C. The seventh.",
"D. The tenth."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: From which moment does the sport turn from freestyle motocross into a car race?\nOption:\nA. The eighth.\nB. The ninth.\nC. The seventh.\nD. The tenth.\nAnswer with the option's letter from the given choices directly.",
1556,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "519-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Temporal Perception",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1557,
"target": "D",
"doc": {
"video_id": "520",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=EFD9BLgMVK8",
"videoID": "EFD9BLgMVK8",
"question_id": "520-1",
"task_type": "Object Recognition",
"question": "Where is the performance held?",
"options": [
"A. On a stage of the talent show.",
"B. On a football court.",
"C. On a stage of the circus.",
"D. On a basketball court."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where is the performance held?\nOption:\nA. On a stage of the talent show.\nB. On a football court.\nC. On a stage of the circus.\nD. On a basketball court.\nAnswer with the option's letter from the given choices directly.",
1557,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "520-1",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1558,
"target": "B",
"doc": {
"video_id": "520",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=EFD9BLgMVK8",
"videoID": "EFD9BLgMVK8",
"question_id": "520-2",
"task_type": "Action Recognition",
"question": "How does the actress fall into a ball?",
"options": [
"A. She stands beside the bicycle and does a three-point shot.",
"B. She stands on the moving bicycle and taps on the ball.",
"C. She leaves the bicycle and does a slam dunk.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the actress fall into a ball?\nOption:\nA. She stands beside the bicycle and does a three-point shot.\nB. She stands on the moving bicycle and taps on the ball.\nC. She leaves the bicycle and does a slam dunk.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1558,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "520-2",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1559,
"target": "C",
"doc": {
"video_id": "520",
"duration": "medium",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=EFD9BLgMVK8",
"videoID": "EFD9BLgMVK8",
"question_id": "520-3",
"task_type": "Object Recognition",
"question": "Which character is the actress cosplayed?",
"options": [
"A. The Iron Man.",
"B. The Ultraman.",
"C. The Superman.",
"D. The Spider-Man."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which character is the actress cosplayed?\nOption:\nA. The Iron Man.\nB. The Ultraman.\nC. The Superman.\nD. The Spider-Man.\nAnswer with the option's letter from the given choices directly.",
1559,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "520-3",
"duration": "medium",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1560,
"target": "B",
"doc": {
"video_id": "521",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=qJGqZ_g__So",
"videoID": "qJGqZ_g__So",
"question_id": "521-1",
"task_type": "Temporal Reasoning",
"question": "According to this video, in which order do the following events happen?\n(a) Showcasing the work.\n(b) Starting from the center.\n(c) Mastering curves.\n(d) Navigating small areas and mistakes.\n(e) Sharing the creations.",
"options": [
"A. (a)(b)(c)(d)(e).",
"B. (b)(c)(d)(a)(e).",
"C. (e)(b)(c)(a)(d).",
"D. (c)(d)(b)(a)(d)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to this video, in which order do the following events happen?\n(a) Showcasing the work.\n(b) Starting from the center.\n(c) Mastering curves.\n(d) Navigating small areas and mistakes.\n(e) Sharing the creations.\nOption:\nA. (a)(b)(c)(d)(e).\nB. (b)(c)(d)(a)(e).\nC. (e)(b)(c)(a)(d).\nD. (c)(d)(b)(a)(d).\nAnswer with the option's letter from the given choices directly.",
1560,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "521-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1561,
"target": "C",
"doc": {
"video_id": "521",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=qJGqZ_g__So",
"videoID": "qJGqZ_g__So",
"question_id": "521-2",
"task_type": "Spatial Perception",
"question": "What is the position of the ring which the woman is wearing?",
"options": [
"A. On the ring finger of the right hand.",
"B. On the index finger of the left hand.",
"C. On the ring finger of the left hand.",
"D. On the index finger of the right hand."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the position of the ring which the woman is wearing?\nOption:\nA. On the ring finger of the right hand.\nB. On the index finger of the left hand.\nC. On the ring finger of the left hand.\nD. On the index finger of the right hand.\nAnswer with the option's letter from the given choices directly.",
1561,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "521-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1562,
"target": "A",
"doc": {
"video_id": "521",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=qJGqZ_g__So",
"videoID": "qJGqZ_g__So",
"question_id": "521-3",
"task_type": "Action Recognition",
"question": "In this video, what is the technique or method used by the woman to perform her paper cutting?",
"options": [
"A. To use a scalpel to cut along the curves on the craft pad.",
"B. To fold the paper multiple times.",
"C. To use scissors to trim the paper.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In this video, what is the technique or method used by the woman to perform her paper cutting?\nOption:\nA. To use a scalpel to cut along the curves on the craft pad.\nB. To fold the paper multiple times.\nC. To use scissors to trim the paper.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1562,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "521-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1563,
"target": "A",
"doc": {
"video_id": "522",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=0ag_Qi5OEd0",
"videoID": "0ag_Qi5OEd0",
"question_id": "522-1",
"task_type": "Object Recognition",
"question": "Which ingredient is not used in the video?",
"options": [
"A. Flower sticker.",
"B. Can lids.",
"C. Wool thread.",
"D. Glue gun."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which ingredient is not used in the video?\nOption:\nA. Flower sticker.\nB. Can lids.\nC. Wool thread.\nD. Glue gun.\nAnswer with the option's letter from the given choices directly.",
1563,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "522-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1564,
"target": "D",
"doc": {
"video_id": "522",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=0ag_Qi5OEd0",
"videoID": "0ag_Qi5OEd0",
"question_id": "522-2",
"task_type": "Object Reasoning",
"question": "What does the finished handcraft look like?",
"options": [
"A. A burning fire.",
"B. A sunflower.",
"C. A rattle.",
"D. An octopus."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the finished handcraft look like?\nOption:\nA. A burning fire.\nB. A sunflower.\nC. A rattle.\nD. An octopus.\nAnswer with the option's letter from the given choices directly.",
1564,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "522-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1565,
"target": "B",
"doc": {
"video_id": "522",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=0ag_Qi5OEd0",
"videoID": "0ag_Qi5OEd0",
"question_id": "522-3",
"task_type": "Temporal Reasoning",
"question": "According to the video, what is the sequential order in which the following tools are used?\n(a) Glue gun.\n(b) Needle.\n(c) Scissors.",
"options": [
"A. (a)(b)(c).",
"B. (b)(c)(a).",
"C. (b)(a)(c).",
"D. (c)(a)(b)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is the sequential order in which the following tools are used?\n(a) Glue gun.\n(b) Needle.\n(c) Scissors.\nOption:\nA. (a)(b)(c).\nB. (b)(c)(a).\nC. (b)(a)(c).\nD. (c)(a)(b).\nAnswer with the option's letter from the given choices directly.",
1565,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "522-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1566,
"target": "D",
"doc": {
"video_id": "523",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=aAxGTnVNJiE",
"videoID": "aAxGTnVNJiE",
"question_id": "523-1",
"task_type": "Temporal Reasoning",
"question": "According to the video, what is the chronological order in which the following actions occur?\n(a) Weaving in the ends.\n(b) Crocheting a single crochet.\n(c) Finishing the handcraft.\n(d) Making a slip knot.\n(e) Crocheting a chain.",
"options": [
"A. (a)(b)(c)(d)(e).",
"B. (e)(b)(c)(a)(d).",
"C. (c)(d)(b)(a)(d).",
"D. (d)(e)(b)(a)(c)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is the chronological order in which the following actions occur?\n(a) Weaving in the ends.\n(b) Crocheting a single crochet.\n(c) Finishing the handcraft.\n(d) Making a slip knot.\n(e) Crocheting a chain.\nOption:\nA. (a)(b)(c)(d)(e).\nB. (e)(b)(c)(a)(d).\nC. (c)(d)(b)(a)(d).\nD. (d)(e)(b)(a)(c).\nAnswer with the option's letter from the given choices directly.",
1566,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "523-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1567,
"target": "B",
"doc": {
"video_id": "523",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=aAxGTnVNJiE",
"videoID": "aAxGTnVNJiE",
"question_id": "523-2",
"task_type": "Attribute Perception",
"question": "What is the color of the hook?",
"options": [
"A. Purple.",
"B. Red.",
"C. Blue.",
"D. Orange."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the color of the hook?\nOption:\nA. Purple.\nB. Red.\nC. Blue.\nD. Orange.\nAnswer with the option's letter from the given choices directly.",
1567,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "523-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1568,
"target": "C",
"doc": {
"video_id": "523",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=aAxGTnVNJiE",
"videoID": "aAxGTnVNJiE",
"question_id": "523-3",
"task_type": "Information Synopsis",
"question": "What is the primary focus or main topic of this video?",
"options": [
"A. It teaches how to make a plastic ball for absolute beginners.",
"B. It teaches how to make a toy car for absolute beginners.",
"C. It teaches how to crochet for absolute beginners.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary focus or main topic of this video?\nOption:\nA. It teaches how to make a plastic ball for absolute beginners.\nB. It teaches how to make a toy car for absolute beginners.\nC. It teaches how to crochet for absolute beginners.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1568,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "523-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1569,
"target": "C",
"doc": {
"video_id": "524",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=IqejFRAcys8",
"videoID": "IqejFRAcys8",
"question_id": "524-1",
"task_type": "Object Recognition",
"question": "Which ingredient is not used in the video?",
"options": [
"A. Paper.",
"B. Wool thread.",
"C. Leftover cloth.",
"D. Glue gun."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which ingredient is not used in the video?\nOption:\nA. Paper.\nB. Wool thread.\nC. Leftover cloth.\nD. Glue gun.\nAnswer with the option's letter from the given choices directly.",
1569,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "524-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1570,
"target": "A",
"doc": {
"video_id": "524",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=IqejFRAcys8",
"videoID": "IqejFRAcys8",
"question_id": "524-2",
"task_type": "Action Recognition",
"question": "What is the first step to make the handcraft?",
"options": [
"A. Cut the threads into pieces by scissors.",
"B. Glue the threads.",
"C. Put the threads onto heart paper.",
"D. Iron the threads."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the first step to make the handcraft?\nOption:\nA. Cut the threads into pieces by scissors.\nB. Glue the threads.\nC. Put the threads onto heart paper.\nD. Iron the threads.\nAnswer with the option's letter from the given choices directly.",
1570,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "524-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1571,
"target": "B",
"doc": {
"video_id": "524",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=IqejFRAcys8",
"videoID": "IqejFRAcys8",
"question_id": "524-3",
"task_type": "Object Reasoning",
"question": "What is the intended use of the pink crochet?",
"options": [
"A. To lift the price of the handicraft.",
"B. To decorate the borders of the handcraft.",
"C. To prevent the handcraft from dropping.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the intended use of the pink crochet?\nOption:\nA. To lift the price of the handicraft.\nB. To decorate the borders of the handcraft.\nC. To prevent the handcraft from dropping.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1571,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "524-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1572,
"target": "B",
"doc": {
"video_id": "525",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=HZJXPm0nhgY",
"videoID": "HZJXPm0nhgY",
"question_id": "525-1",
"task_type": "Temporal Reasoning",
"question": "According to this video, in which order are the following tools used?\n(a) Cotton swabs.\n(b) Brush.\n(c) Iron scrubber.\n(d) Card.",
"options": [
"A. (a)(b)(d)(c).",
"B. (b)(c)(a)(d).",
"C. (b)(d)(a)(c).",
"D. (d)(c)(a)(b)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to this video, in which order are the following tools used?\n(a) Cotton swabs.\n(b) Brush.\n(c) Iron scrubber.\n(d) Card.\nOption:\nA. (a)(b)(d)(c).\nB. (b)(c)(a)(d).\nC. (b)(d)(a)(c).\nD. (d)(c)(a)(b).\nAnswer with the option's letter from the given choices directly.",
1572,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "525-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1573,
"target": "C",
"doc": {
"video_id": "525",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=HZJXPm0nhgY",
"videoID": "HZJXPm0nhgY",
"question_id": "525-2",
"task_type": "Counting Problem",
"question": "At the beginning of this video, how many different colors of acrylic are being squeezed onto the board?",
"options": [
"A. 5.",
"B. 4.",
"C. 6.",
"D. 3."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At the beginning of this video, how many different colors of acrylic are being squeezed onto the board?\nOption:\nA. 5.\nB. 4.\nC. 6.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
1573,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "525-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1574,
"target": "A",
"doc": {
"video_id": "525",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=HZJXPm0nhgY",
"videoID": "HZJXPm0nhgY",
"question_id": "525-3",
"task_type": "Object Recognition",
"question": "What is the subject or image that the artist is drawing on the board in the video?",
"options": [
"A. A man walking a dog in the woods.",
"B. A beach scene.",
"C. A green tree at full moon.",
"D. A sunflower field."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the subject or image that the artist is drawing on the board in the video?\nOption:\nA. A man walking a dog in the woods.\nB. A beach scene.\nC. A green tree at full moon.\nD. A sunflower field.\nAnswer with the option's letter from the given choices directly.",
1574,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "525-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1575,
"target": "D",
"doc": {
"video_id": "526",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=3EQzRz-V7uE",
"videoID": "3EQzRz-V7uE",
"question_id": "526-1",
"task_type": "Counting Problem",
"question": "At the beginning of this video, how many rounds of yellow thread are wrapped around the fingers?",
"options": [
"A. 210.",
"B. 20.",
"C. 10.",
"D. 120."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At the beginning of this video, how many rounds of yellow thread are wrapped around the fingers?\nOption:\nA. 210.\nB. 20.\nC. 10.\nD. 120.\nAnswer with the option's letter from the given choices directly.",
1575,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "526-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1576,
"target": "B",
"doc": {
"video_id": "526",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=3EQzRz-V7uE",
"videoID": "3EQzRz-V7uE",
"question_id": "526-2",
"task_type": "Object Reasoning",
"question": "Which of the following options is NOT one of the purposes of red threads?",
"options": [
"A. To make the left foot.",
"B. To make the tail.",
"C. To make the right foot.",
"D. To make the nose."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options is NOT one of the purposes of red threads?\nOption:\nA. To make the left foot.\nB. To make the tail.\nC. To make the right foot.\nD. To make the nose.\nAnswer with the option's letter from the given choices directly.",
1576,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "526-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1577,
"target": "C",
"doc": {
"video_id": "526",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=3EQzRz-V7uE",
"videoID": "3EQzRz-V7uE",
"question_id": "526-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. It demonstrates how to clean a yarn chick.",
"B. It shows how to bring a yarn chick to real life.",
"C. It teaches how to DIY a yarn chick.",
"D. It compares several different kinds of chicks."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. It demonstrates how to clean a yarn chick.\nB. It shows how to bring a yarn chick to real life.\nC. It teaches how to DIY a yarn chick.\nD. It compares several different kinds of chicks.\nAnswer with the option's letter from the given choices directly.",
1577,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "526-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1578,
"target": "A",
"doc": {
"video_id": "527",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=-gnfUTPmnNU",
"videoID": "-gnfUTPmnNU",
"question_id": "527-1",
"task_type": "Temporal Reasoning",
"question": "According to this video, in which order do the following business ideas appear in the video?\n(a) Wedding invitations.\n(b) Clothing labels.\n(c) Book marks.\n(d) Dried flowers.",
"options": [
"A. (c)(a)(b)(d).",
"B. (a)(b)(d)(c).",
"C. (b)(d)(a)(c).",
"D. (d)(c)(a)(b)."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to this video, in which order do the following business ideas appear in the video?\n(a) Wedding invitations.\n(b) Clothing labels.\n(c) Book marks.\n(d) Dried flowers.\nOption:\nA. (c)(a)(b)(d).\nB. (a)(b)(d)(c).\nC. (b)(d)(a)(c).\nD. (d)(c)(a)(b).\nAnswer with the option's letter from the given choices directly.",
1578,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "527-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1579,
"target": "D",
"doc": {
"video_id": "527",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=-gnfUTPmnNU",
"videoID": "-gnfUTPmnNU",
"question_id": "527-2",
"task_type": "Counting Problem",
"question": "How many candles are shown in this video?",
"options": [
"A. 5.",
"B. 4.",
"C. 3.",
"D. 6."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many candles are shown in this video?\nOption:\nA. 5.\nB. 4.\nC. 3.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
1579,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "527-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1580,
"target": "B",
"doc": {
"video_id": "527",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=-gnfUTPmnNU",
"videoID": "-gnfUTPmnNU",
"question_id": "527-3",
"task_type": "OCR Problems",
"question": "What is the specific text displayed on the neon signs?",
"options": [
"A. Hi gorgeous.",
"B. Hello gorgeous.",
"C. Hello.",
"D. gorgeous."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the specific text displayed on the neon signs?\nOption:\nA. Hi gorgeous.\nB. Hello gorgeous.\nC. Hello.\nD. gorgeous.\nAnswer with the option's letter from the given choices directly.",
1580,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "527-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1581,
"target": "C",
"doc": {
"video_id": "528",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=biiLScgI3u4",
"videoID": "biiLScgI3u4",
"question_id": "528-1",
"task_type": "Object Recognition",
"question": "Which of the following ingredients is not used in the video?",
"options": [
"A. Ladle.",
"B. Wool thread.",
"C. Paper.",
"D. Lighter."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following ingredients is not used in the video?\nOption:\nA. Ladle.\nB. Wool thread.\nC. Paper.\nD. Lighter.\nAnswer with the option's letter from the given choices directly.",
1581,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "528-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1582,
"target": "A",
"doc": {
"video_id": "528",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=biiLScgI3u4",
"videoID": "biiLScgI3u4",
"question_id": "528-2",
"task_type": "Action Recognition",
"question": "How does the person weave the same colored thread?",
"options": [
"A. Diagonally.",
"B. Randomly.",
"C. Parallelly.",
"D. Vertically."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the person weave the same colored thread?\nOption:\nA. Diagonally.\nB. Randomly.\nC. Parallelly.\nD. Vertically.\nAnswer with the option's letter from the given choices directly.",
1582,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "528-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1583,
"target": "D",
"doc": {
"video_id": "528",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=biiLScgI3u4",
"videoID": "biiLScgI3u4",
"question_id": "528-3",
"task_type": "Attribute Perception",
"question": "What kind of handicraft is made in this video?",
"options": [
"A. A necklace.",
"B. A ring.",
"C. A watch.",
"D. A bracelet."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What kind of handicraft is made in this video?\nOption:\nA. A necklace.\nB. A ring.\nC. A watch.\nD. A bracelet.\nAnswer with the option's letter from the given choices directly.",
1583,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "528-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1584,
"target": "B",
"doc": {
"video_id": "529",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=jWa2yrLGxCM",
"videoID": "jWa2yrLGxCM",
"question_id": "529-1",
"task_type": "Temporal Reasoning",
"question": "According to the video, what is the sequential order in which the following types of jewelry are made?\n(a) Fairy Wing Earrings.\n(b) Pastel Necklace.\n(c) Pearl Earrings.\n(d) Tiered Necklace.",
"options": [
"A. (a)(b)(d)(c).",
"B. (b)(c)(a)(d).",
"C. (b)(d)(a)(c).",
"D. (d)(c)(a)(b)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is the sequential order in which the following types of jewelry are made?\n(a) Fairy Wing Earrings.\n(b) Pastel Necklace.\n(c) Pearl Earrings.\n(d) Tiered Necklace.\nOption:\nA. (a)(b)(d)(c).\nB. (b)(c)(a)(d).\nC. (b)(d)(a)(c).\nD. (d)(c)(a)(b).\nAnswer with the option's letter from the given choices directly.",
1584,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "529-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1585,
"target": "C",
"doc": {
"video_id": "529",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=jWa2yrLGxCM",
"videoID": "jWa2yrLGxCM",
"question_id": "529-2",
"task_type": "Counting Problem",
"question": "How many kinds of earrings are made in this video?",
"options": [
"A. 2.",
"B. 3.",
"C. 4.",
"D. 5."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many kinds of earrings are made in this video?\nOption:\nA. 2.\nB. 3.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
1585,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "529-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1586,
"target": "A",
"doc": {
"video_id": "529",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=jWa2yrLGxCM",
"videoID": "jWa2yrLGxCM",
"question_id": "529-3",
"task_type": "Object Reasoning",
"question": "According to the video, which of the following statements is true?",
"options": [
"A. The woman is wearing a white dress.",
"B. The woman only makes jewelry for humans.",
"C. The woman has demonstrated how to make a ring.",
"D. None of the other three statements is true."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following statements is true?\nOption:\nA. The woman is wearing a white dress.\nB. The woman only makes jewelry for humans.\nC. The woman has demonstrated how to make a ring.\nD. None of the other three statements is true.\nAnswer with the option's letter from the given choices directly.",
1586,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "529-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1587,
"target": "D",
"doc": {
"video_id": "530",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=k7iaAhz2md0",
"videoID": "k7iaAhz2md0",
"question_id": "530-1",
"task_type": "Object Recognition",
"question": "Which of the following materials is not used in this video?",
"options": [
"A. Old CD's.",
"B. Glue or liquid silicone.",
"C. Satin Ribbon.",
"D. Wool threads."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following materials is not used in this video?\nOption:\nA. Old CD's.\nB. Glue or liquid silicone.\nC. Satin Ribbon.\nD. Wool threads.\nAnswer with the option's letter from the given choices directly.",
1587,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "530-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1588,
"target": "B",
"doc": {
"video_id": "530",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=k7iaAhz2md0",
"videoID": "k7iaAhz2md0",
"question_id": "530-2",
"task_type": "Counting Problem",
"question": "How many stars can be extracted from one CD?",
"options": [
"A. 2.",
"B. 3.",
"C. 4.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many stars can be extracted from one CD?\nOption:\nA. 2.\nB. 3.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
1588,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "530-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1589,
"target": "C",
"doc": {
"video_id": "530",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=k7iaAhz2md0",
"videoID": "k7iaAhz2md0",
"question_id": "530-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. It demonstrates how to clean old CDs.",
"B. It shows how to extract audio from old CDs.",
"C. It teaches how to make beautiful decorations from old CDs.",
"D. It compares several different kinds of CDs."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. It demonstrates how to clean old CDs.\nB. It shows how to extract audio from old CDs.\nC. It teaches how to make beautiful decorations from old CDs.\nD. It compares several different kinds of CDs.\nAnswer with the option's letter from the given choices directly.",
1589,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "530-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1590,
"target": "C",
"doc": {
"video_id": "531",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=xk8PdL_PdSo",
"videoID": "xk8PdL_PdSo",
"question_id": "531-1",
"task_type": "Object Reasoning",
"question": "Which two people appear at the beginning of the video?",
"options": [
"A. Cavalli and Preston.",
"B. Linda and GG.",
"C. Ernie and Clara.",
"D. Canon and Dell."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which two people appear at the beginning of the video?\nOption:\nA. Cavalli and Preston.\nB. Linda and GG.\nC. Ernie and Clara.\nD. Canon and Dell.\nAnswer with the option's letter from the given choices directly.",
1590,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "531-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1591,
"target": "D",
"doc": {
"video_id": "531",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=xk8PdL_PdSo",
"videoID": "xk8PdL_PdSo",
"question_id": "531-2",
"task_type": "Object Recognition",
"question": "Which food prepared by her grandmother does the granddaughter in the video attempt to try?",
"options": [
"A. Fruit Salad.",
"B. Hamburger Pie.",
"C. Bruncholoney Sandwich.",
"D. Chicken & Dumplings."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which food prepared by her grandmother does the granddaughter in the video attempt to try?\nOption:\nA. Fruit Salad.\nB. Hamburger Pie.\nC. Bruncholoney Sandwich.\nD. Chicken & Dumplings.\nAnswer with the option's letter from the given choices directly.",
1591,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "531-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1592,
"target": "A",
"doc": {
"video_id": "531",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=xk8PdL_PdSo",
"videoID": "xk8PdL_PdSo",
"question_id": "531-3",
"task_type": "Temporal Reasoning",
"question": "What does Cavalli do in the video?",
"options": [
"A. Clap hands, eat chips, eat sandwiches.",
"B. Clap hands, eat sandwiches, eat chips.",
"C. Eat sandwiches, eat chips, clap hands.",
"D. Eat sandwiches, clap hands, eat chips."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does Cavalli do in the video?\nOption:\nA. Clap hands, eat chips, eat sandwiches.\nB. Clap hands, eat sandwiches, eat chips.\nC. Eat sandwiches, eat chips, clap hands.\nD. Eat sandwiches, clap hands, eat chips.\nAnswer with the option's letter from the given choices directly.",
1592,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "531-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1593,
"target": "B",
"doc": {
"video_id": "532",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=tdAVvYsU8q8",
"videoID": "tdAVvYsU8q8",
"question_id": "532-1",
"task_type": "Temporal Perception",
"question": "What is the third restaurant featured in the video?",
"options": [
"A. Burpin' Burger Aloha.",
"B. COMRADE.",
"C. LYNN's Table.",
"D. Jean Juan' FRENCH MEX."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the third restaurant featured in the video?\nOption:\nA. Burpin' Burger Aloha.\nB. COMRADE.\nC. LYNN's Table.\nD. Jean Juan' FRENCH MEX.\nAnswer with the option's letter from the given choices directly.",
1593,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "532-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Temporal Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1594,
"target": "D",
"doc": {
"video_id": "532",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=tdAVvYsU8q8",
"videoID": "tdAVvYsU8q8",
"question_id": "532-2",
"task_type": "Action Recognition",
"question": "How does the waiter in LYNN's Table wash the dishes?",
"options": [
"A. Brush the dishes.",
"B. Wash with a rag.",
"C. Use a Dishwasher.",
"D. Lick with their tongue."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the waiter in LYNN's Table wash the dishes?\nOption:\nA. Brush the dishes.\nB. Wash with a rag.\nC. Use a Dishwasher.\nD. Lick with their tongue.\nAnswer with the option's letter from the given choices directly.",
1594,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "532-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1595,
"target": "C",
"doc": {
"video_id": "532",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=tdAVvYsU8q8",
"videoID": "tdAVvYsU8q8",
"question_id": "532-3",
"task_type": "Action Reasoning",
"question": "Why does the Chinese restaurant last seen in the clip of the restaurant Burpin' Burger close its doors?",
"options": [
"A. The owner of the restaurant is going out.",
"B. This restaurant is closed because it's time.",
"C. A White-haired man empties this restaurant while eating at the buffet.",
"D. There's been a complaint about the restaurant."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the Chinese restaurant last seen in the clip of the restaurant Burpin' Burger close its doors?\nOption:\nA. The owner of the restaurant is going out.\nB. This restaurant is closed because it's time.\nC. A White-haired man empties this restaurant while eating at the buffet.\nD. There's been a complaint about the restaurant.\nAnswer with the option's letter from the given choices directly.",
1595,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "532-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1596,
"target": "A",
"doc": {
"video_id": "533",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=5xi1iD4fkhc",
"videoID": "5xi1iD4fkhc",
"question_id": "533-1",
"task_type": "Counting Problem",
"question": "How many cats in the video like mackerel the best?",
"options": [
"A. 3.",
"B. 1.",
"C. 2.",
"D. 4."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many cats in the video like mackerel the best?\nOption:\nA. 3.\nB. 1.\nC. 2.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1596,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "533-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1597,
"target": "C",
"doc": {
"video_id": "533",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=5xi1iD4fkhc",
"videoID": "5xi1iD4fkhc",
"question_id": "533-2",
"task_type": "Action Recognition",
"question": "Which cat in the video took the salmon directly from the plate?",
"options": [
"A. Joaquin.",
"B. Dali.",
"C. Rothko.",
"D. Summer."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which cat in the video took the salmon directly from the plate?\nOption:\nA. Joaquin.\nB. Dali.\nC. Rothko.\nD. Summer.\nAnswer with the option's letter from the given choices directly.",
1597,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "533-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1598,
"target": "B",
"doc": {
"video_id": "533",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=5xi1iD4fkhc",
"videoID": "5xi1iD4fkhc",
"question_id": "533-3",
"task_type": "Action Reasoning",
"question": "What would most likely cause Mimi to eat on the floor?",
"options": [
"A. It can't climb the table.",
"B. It's very timid.",
"C. It fell off the table.",
"D. It likes to eat on the floor."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What would most likely cause Mimi to eat on the floor?\nOption:\nA. It can't climb the table.\nB. It's very timid.\nC. It fell off the table.\nD. It likes to eat on the floor.\nAnswer with the option's letter from the given choices directly.",
1598,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "533-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1599,
"target": "D",
"doc": {
"video_id": "534",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=ZBKUqc_ICpg",
"videoID": "ZBKUqc_ICpg",
"question_id": "534-1",
"task_type": "Temporal Reasoning",
"question": "What kind of food does the main character in the video make successively?",
"options": [
"A. Chickpea frittata, ricotta toast, chocolate porridge.",
"B. Chocolate porridge, chickpea frittata, ricotta toast.",
"C. Ricotta toast, chickpea frittata, chocolate porridge.",
"D. Chickpea frittata, chocolate porridge, ricotta toast."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What kind of food does the main character in the video make successively?\nOption:\nA. Chickpea frittata, ricotta toast, chocolate porridge.\nB. Chocolate porridge, chickpea frittata, ricotta toast.\nC. Ricotta toast, chickpea frittata, chocolate porridge.\nD. Chickpea frittata, chocolate porridge, ricotta toast.\nAnswer with the option's letter from the given choices directly.",
1599,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "534-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1600,
"target": "B",
"doc": {
"video_id": "534",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=ZBKUqc_ICpg",
"videoID": "ZBKUqc_ICpg",
"question_id": "534-2",
"task_type": "Object Reasoning",
"question": "When making Chickpea frittata, what condiments are added to give it an egg-like smell and taste?",
"options": [
"A. Chickpea flour.",
"B. Kala Namak.",
"C. Egg wash.",
"D. Egg-flavored baking powder."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When making Chickpea frittata, what condiments are added to give it an egg-like smell and taste?\nOption:\nA. Chickpea flour.\nB. Kala Namak.\nC. Egg wash.\nD. Egg-flavored baking powder.\nAnswer with the option's letter from the given choices directly.",
1600,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "534-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1601,
"target": "C",
"doc": {
"video_id": "534",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=ZBKUqc_ICpg",
"videoID": "ZBKUqc_ICpg",
"question_id": "534-3",
"task_type": "Object Reasoning",
"question": "When is the food made in the video eaten?",
"options": [
"A. Dinner.",
"B. Lunch.",
"C. Breakfast.",
"D. Afternoon tea."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When is the food made in the video eaten?\nOption:\nA. Dinner.\nB. Lunch.\nC. Breakfast.\nD. Afternoon tea.\nAnswer with the option's letter from the given choices directly.",
1601,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "534-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1602,
"target": "A",
"doc": {
"video_id": "535",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=ooqdg9Wr-mo",
"videoID": "ooqdg9Wr-mo",
"question_id": "535-1",
"task_type": "OCR Problems",
"question": "What food in the video is worth $750?",
"options": [
"A. Barbecue.",
"B. Crayfish.",
"C. Gold Flake Ice Cream.",
"D. Roast chicken."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What food in the video is worth $750?\nOption:\nA. Barbecue.\nB. Crayfish.\nC. Gold Flake Ice Cream.\nD. Roast chicken.\nAnswer with the option's letter from the given choices directly.",
1602,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "535-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1603,
"target": "D",
"doc": {
"video_id": "535",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=ooqdg9Wr-mo",
"videoID": "ooqdg9Wr-mo",
"question_id": "535-2",
"task_type": "Object Reasoning",
"question": "Which of the following food in the video is cheapest?",
"options": [
"A. Taco.",
"B. Caviar.",
"C. Japanese honeydew melon.",
"D. Doughnuts."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following food in the video is cheapest?\nOption:\nA. Taco.\nB. Caviar.\nC. Japanese honeydew melon.\nD. Doughnuts.\nAnswer with the option's letter from the given choices directly.",
1603,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "535-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1604,
"target": "B",
"doc": {
"video_id": "535",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=ooqdg9Wr-mo",
"videoID": "ooqdg9Wr-mo",
"question_id": "535-3",
"task_type": "Action Reasoning",
"question": "Inferring from the video, why does the man in the video choose the latter between caviar and roasted butter beans?",
"options": [
"A. He doesn't like caviar.",
"B. Caviar is too expensive.",
"C. He likes roasted butter beans.",
"D. He chooses it at random."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Inferring from the video, why does the man in the video choose the latter between caviar and roasted butter beans?\nOption:\nA. He doesn't like caviar.\nB. Caviar is too expensive.\nC. He likes roasted butter beans.\nD. He chooses it at random.\nAnswer with the option's letter from the given choices directly.",
1604,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "535-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1605,
"target": "C",
"doc": {
"video_id": "536",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=-c8eATXUui8",
"videoID": "-c8eATXUui8",
"question_id": "536-1",
"task_type": "Object Recognition",
"question": "What is the first food the main character in the video tried?",
"options": [
"A. Pizza.",
"B. Hot dogs.",
"C. Bagles.",
"D. Barbecue."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the first food the main character in the video tried?\nOption:\nA. Pizza.\nB. Hot dogs.\nC. Bagles.\nD. Barbecue.\nAnswer with the option's letter from the given choices directly.",
1605,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "536-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1606,
"target": "B",
"doc": {
"video_id": "536",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=-c8eATXUui8",
"videoID": "-c8eATXUui8",
"question_id": "536-2",
"task_type": "Action Reasoning",
"question": "Of the following foods, what does the protagonist in the video prefer?",
"options": [
"A. Rice noodle roll.",
"B. Bacon sandwich.",
"C. Bagles.",
"D. Pizza."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Of the following foods, what does the protagonist in the video prefer?\nOption:\nA. Rice noodle roll.\nB. Bacon sandwich.\nC. Bagles.\nD. Pizza.\nAnswer with the option's letter from the given choices directly.",
1606,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "536-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1607,
"target": "C",
"doc": {
"video_id": "536",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=-c8eATXUui8",
"videoID": "-c8eATXUui8",
"question_id": "536-3",
"task_type": "Object Recognition",
"question": "The ad in the video is inserted while the main character is eating what?",
"options": [
"A. Pickle.",
"B. Taco.",
"C. Cookie.",
"D. Ice cream."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The ad in the video is inserted while the main character is eating what?\nOption:\nA. Pickle.\nB. Taco.\nC. Cookie.\nD. Ice cream.\nAnswer with the option's letter from the given choices directly.",
1607,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "536-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1608,
"target": "D",
"doc": {
"video_id": "537",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=Pf1hNFndi60",
"videoID": "Pf1hNFndi60",
"question_id": "537-1",
"task_type": "Action Reasoning",
"question": "What are the three people in the video most likely discussing as they eat their greens?",
"options": [
"A. What do women like to eat.",
"B. How to become healthy.",
"C. What do they plan to do next.",
"D. Why do they eat this stuff."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the three people in the video most likely discussing as they eat their greens?\nOption:\nA. What do women like to eat.\nB. How to become healthy.\nC. What do they plan to do next.\nD. Why do they eat this stuff.\nAnswer with the option's letter from the given choices directly.",
1608,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "537-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1609,
"target": "A",
"doc": {
"video_id": "537",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=Pf1hNFndi60",
"videoID": "Pf1hNFndi60",
"question_id": "537-2",
"task_type": "Counting Problem",
"question": "How many people in the video eat insects?",
"options": [
"A. 2.",
"B. 1.",
"C. 3.",
"D. 0."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people in the video eat insects?\nOption:\nA. 2.\nB. 1.\nC. 3.\nD. 0.\nAnswer with the option's letter from the given choices directly.",
1609,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "537-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1610,
"target": "A",
"doc": {
"video_id": "537",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=Pf1hNFndi60",
"videoID": "Pf1hNFndi60",
"question_id": "537-3",
"task_type": "Object Recognition",
"question": "What food do the people in the video end up eating?",
"options": [
"A. Banana drink.",
"B. Insect Feast.",
"C. Fruits.",
"D. Grilled fish."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What food do the people in the video end up eating?\nOption:\nA. Banana drink.\nB. Insect Feast.\nC. Fruits.\nD. Grilled fish.\nAnswer with the option's letter from the given choices directly.",
1610,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "537-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1611,
"target": "B",
"doc": {
"video_id": "538",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=uhYiRmGURwE",
"videoID": "uhYiRmGURwE",
"question_id": "538-1",
"task_type": "Action Reasoning",
"question": "Why are people in the video buying food at the dollar store?",
"options": [
"A. He doesn't have much money on him.",
"B. Compliance with the rules of the game.",
"C. He's trying to save money.",
"D. He likes dollar-store food."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why are people in the video buying food at the dollar store?\nOption:\nA. He doesn't have much money on him.\nB. Compliance with the rules of the game.\nC. He's trying to save money.\nD. He likes dollar-store food.\nAnswer with the option's letter from the given choices directly.",
1611,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "538-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1612,
"target": "B",
"doc": {
"video_id": "538",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=uhYiRmGURwE",
"videoID": "uhYiRmGURwE",
"question_id": "538-2",
"task_type": "Temporal Perception",
"question": "When making the first dish in the video, what is the second food that is plated?",
"options": [
"A. Onion slice.",
"B. Seaweed snack.",
"C. Pork rinds.",
"D. Chorizo."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When making the first dish in the video, what is the second food that is plated?\nOption:\nA. Onion slice.\nB. Seaweed snack.\nC. Pork rinds.\nD. Chorizo.\nAnswer with the option's letter from the given choices directly.",
1612,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "538-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1613,
"target": "C",
"doc": {
"video_id": "538",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=uhYiRmGURwE",
"videoID": "uhYiRmGURwE",
"question_id": "538-3",
"task_type": "Object Reasoning",
"question": "Which dish in the video is made mainly of pork?",
"options": [
"A. Last course.",
"B. Second course.",
"C. First course.",
"D. None of them."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which dish in the video is made mainly of pork?\nOption:\nA. Last course.\nB. Second course.\nC. First course.\nD. None of them.\nAnswer with the option's letter from the given choices directly.",
1613,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "538-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1614,
"target": "C",
"doc": {
"video_id": "539",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=FcP0mzWFCQU",
"videoID": "FcP0mzWFCQU",
"question_id": "539-1",
"task_type": "OCR Problems",
"question": "How many new McDonald's have opened in France from 2018-2023 in the video?",
"options": [
"A. 95.",
"B. 221.",
"C. 83.",
"D. 92."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many new McDonald's have opened in France from 2018-2023 in the video?\nOption:\nA. 95.\nB. 221.\nC. 83.\nD. 92.\nAnswer with the option's letter from the given choices directly.",
1614,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "539-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1615,
"target": "D",
"doc": {
"video_id": "539",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=FcP0mzWFCQU",
"videoID": "FcP0mzWFCQU",
"question_id": "539-2",
"task_type": "Object Recognition",
"question": "Which section of the video shows customers uploading videos of themselves making or tasting food on video media?",
"options": [
"A. Opportunities.",
"B. Challenges.",
"C. France's fast-food market.",
"D. Winning over the French."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which section of the video shows customers uploading videos of themselves making or tasting food on video media?\nOption:\nA. Opportunities.\nB. Challenges.\nC. France's fast-food market.\nD. Winning over the French.\nAnswer with the option's letter from the given choices directly.",
1615,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "539-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1616,
"target": "B",
"doc": {
"video_id": "539",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=FcP0mzWFCQU",
"videoID": "FcP0mzWFCQU",
"question_id": "539-3",
"task_type": "Object Reasoning",
"question": "The video mentions that about 100 SUBWAYS have been closed in France, what is the most likely reason?",
"options": [
"A. People are less enthusiastic about fast food.",
"B. Affected by the European market economy.",
"C. People have less spending power.",
"D. French boycott of SUBWAY."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The video mentions that about 100 SUBWAYS have been closed in France, what is the most likely reason?\nOption:\nA. People are less enthusiastic about fast food.\nB. Affected by the European market economy.\nC. People have less spending power.\nD. French boycott of SUBWAY.\nAnswer with the option's letter from the given choices directly.",
1616,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "539-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1617,
"target": "B",
"doc": {
"video_id": "540",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=6DbsOZU8mBM",
"videoID": "6DbsOZU8mBM",
"question_id": "540-1",
"task_type": "Object Reasoning",
"question": "Extrapolating from the video, what does the brand OSMO in the video sell?",
"options": [
"A. Kobe beef.",
"B. Salt.",
"C. Molasses.",
"D. Black truffle."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Extrapolating from the video, what does the brand OSMO in the video sell?\nOption:\nA. Kobe beef.\nB. Salt.\nC. Molasses.\nD. Black truffle.\nAnswer with the option's letter from the given choices directly.",
1617,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "540-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1618,
"target": "A",
"doc": {
"video_id": "540",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=6DbsOZU8mBM",
"videoID": "6DbsOZU8mBM",
"question_id": "540-2",
"task_type": "Temporal Perception",
"question": "What is the third baked food in the video?",
"options": [
"A. Scallop.",
"B. Kobe beef.",
"C. Bacon.",
"D. Salmon."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the third baked food in the video?\nOption:\nA. Scallop.\nB. Kobe beef.\nC. Bacon.\nD. Salmon.\nAnswer with the option's letter from the given choices directly.",
1618,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "540-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1619,
"target": "D",
"doc": {
"video_id": "540",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=6DbsOZU8mBM",
"videoID": "6DbsOZU8mBM",
"question_id": "540-3",
"task_type": "Object Recognition",
"question": "What treat did the little hamster savor at the end of the video?",
"options": [
"A. Salmon.",
"B. Bacon.",
"C. Scallop.",
"D. Kobe beef."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What treat did the little hamster savor at the end of the video?\nOption:\nA. Salmon.\nB. Bacon.\nC. Scallop.\nD. Kobe beef.\nAnswer with the option's letter from the given choices directly.",
1619,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "540-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1620,
"target": "B",
"doc": {
"video_id": "541",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=yA79KYMLUpI",
"videoID": "yA79KYMLUpI",
"question_id": "541-1",
"task_type": "Attribute Perception",
"question": "What color is the makeup bag in the video?",
"options": [
"A. Black.",
"B. Brown.",
"C. Bright blue.",
"D. Pink."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the makeup bag in the video?\nOption:\nA. Black.\nB. Brown.\nC. Bright blue.\nD. Pink.\nAnswer with the option's letter from the given choices directly.",
1620,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "541-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1621,
"target": "B",
"doc": {
"video_id": "541",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=yA79KYMLUpI",
"videoID": "yA79KYMLUpI",
"question_id": "541-2",
"task_type": "Temporal Reasoning",
"question": "Which of the following options correctly orders the items according to their order in the video?",
"options": [
"A. Gin, hand wipes, AirTag.",
"B. EarPods, Sleep Master mask, water bottle.",
"C. Passport, freckle pen, Earpods.",
"D. Tarot card, lip balm, sponge."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options correctly orders the items according to their order in the video?\nOption:\nA. Gin, hand wipes, AirTag.\nB. EarPods, Sleep Master mask, water bottle.\nC. Passport, freckle pen, Earpods.\nD. Tarot card, lip balm, sponge.\nAnswer with the option's letter from the given choices directly.",
1621,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "541-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1622,
"target": "B",
"doc": {
"video_id": "541",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=yA79KYMLUpI",
"videoID": "yA79KYMLUpI",
"question_id": "541-3",
"task_type": "Object Recognition",
"question": "Which item is not in the makeup bag?",
"options": [
"A. Freckle pen.",
"B. Sleep Master mask.",
"C. Sponge.",
"D. Lip balm."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which item is not in the makeup bag?\nOption:\nA. Freckle pen.\nB. Sleep Master mask.\nC. Sponge.\nD. Lip balm.\nAnswer with the option's letter from the given choices directly.",
1622,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "541-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1623,
"target": "D",
"doc": {
"video_id": "542",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=3x8pvm0g5AU",
"videoID": "3x8pvm0g5AU",
"question_id": "542-1",
"task_type": "Object Recognition",
"question": "In which city did the video take place?",
"options": [
"A. Paris.",
"B. London.",
"C. New York.",
"D. Hong Kong."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which city did the video take place?\nOption:\nA. Paris.\nB. London.\nC. New York.\nD. Hong Kong.\nAnswer with the option's letter from the given choices directly.",
1623,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "542-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1624,
"target": "B",
"doc": {
"video_id": "542",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=3x8pvm0g5AU",
"videoID": "3x8pvm0g5AU",
"question_id": "542-2",
"task_type": "OCR Problems",
"question": "What pattern does the UAV form at the end of the video?",
"options": [
"A. Louis Vuitton.",
"B. LV Lovers.",
"C. LVMH.",
"D. Louis Vuitton Monogram."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What pattern does the UAV form at the end of the video?\nOption:\nA. Louis Vuitton.\nB. LV Lovers.\nC. LVMH.\nD. Louis Vuitton Monogram.\nAnswer with the option's letter from the given choices directly.",
1624,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "542-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1625,
"target": "D",
"doc": {
"video_id": "542",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=3x8pvm0g5AU",
"videoID": "3x8pvm0g5AU",
"question_id": "542-3",
"task_type": "Action Reasoning",
"question": "What is the model walking behind the one in a casual suit characterized by a rich print in vibrant shades of pink and white wearing?",
"options": [
"A. An oversized, short-sleeved shirt in a glossy, brown hue that suggests a leather-like material.",
"B. A black wetsuit with a blue surfboard fin, reinforcing the aquatic theme of the outfit.",
"C. A striking ensemble with a floral theme against a dark background, which gives a nocturnal, tropical vibe.",
"D. An oversized, striped shirt jacket featuring a mix of blue tones and a pattern of classic denim patchwork."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the model walking behind the one in a casual suit characterized by a rich print in vibrant shades of pink and white wearing?\nOption:\nA. An oversized, short-sleeved shirt in a glossy, brown hue that suggests a leather-like material.\nB. A black wetsuit with a blue surfboard fin, reinforcing the aquatic theme of the outfit.\nC. A striking ensemble with a floral theme against a dark background, which gives a nocturnal, tropical vibe.\nD. An oversized, striped shirt jacket featuring a mix of blue tones and a pattern of classic denim patchwork.\nAnswer with the option's letter from the given choices directly.",
1625,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "542-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1626,
"target": "B",
"doc": {
"video_id": "543",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=MZEdnBcLcfg",
"videoID": "MZEdnBcLcfg",
"question_id": "543-1",
"task_type": "Object Reasoning",
"question": "What character is the woman with white hair in the video most likely to be?",
"options": [
"A. Model.",
"B. Designer.",
"C. Director.",
"D. Agent."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What character is the woman with white hair in the video most likely to be?\nOption:\nA. Model.\nB. Designer.\nC. Director.\nD. Agent.\nAnswer with the option's letter from the given choices directly.",
1626,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "543-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1627,
"target": "A",
"doc": {
"video_id": "543",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=MZEdnBcLcfg",
"videoID": "MZEdnBcLcfg",
"question_id": "543-2",
"task_type": "Information Synopsis",
"question": "What is the main story of this video?",
"options": [
"A. A behind-the-scenes account of a fashion show.",
"B. Fashion brand advertising.",
"C. Designer promotional video.",
"D. Modeling casting call."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main story of this video?\nOption:\nA. A behind-the-scenes account of a fashion show.\nB. Fashion brand advertising.\nC. Designer promotional video.\nD. Modeling casting call.\nAnswer with the option's letter from the given choices directly.",
1627,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "543-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1628,
"target": "B",
"doc": {
"video_id": "543",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=MZEdnBcLcfg",
"videoID": "MZEdnBcLcfg",
"question_id": "543-3",
"task_type": "Temporal Reasoning",
"question": "What is the correct order of the countries involved in the video?",
"options": [
"A. United States, Italy, France.",
"B. United States, France, United Kingdom.",
"C. France, United States, Italy.",
"D. Australia, UK, USA."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct order of the countries involved in the video?\nOption:\nA. United States, Italy, France.\nB. United States, France, United Kingdom.\nC. France, United States, Italy.\nD. Australia, UK, USA.\nAnswer with the option's letter from the given choices directly.",
1628,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "543-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1629,
"target": "A",
"doc": {
"video_id": "544",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=_CqKv0Y1FB0",
"videoID": "_CqKv0Y1FB0",
"question_id": "544-1",
"task_type": "Counting Problem",
"question": "When the motorcycle in the video leaves, how many people come out of the door?",
"options": [
"A. 5.",
"B. 7.",
"C. 9.",
"D. 3."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When the motorcycle in the video leaves, how many people come out of the door?\nOption:\nA. 5.\nB. 7.\nC. 9.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
1629,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "544-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1630,
"target": "B",
"doc": {
"video_id": "544",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=_CqKv0Y1FB0",
"videoID": "_CqKv0Y1FB0",
"question_id": "544-2",
"task_type": "Temporal Reasoning",
"question": "What is the correct order in which the characters appear in the video?",
"options": [
"A. Dog walkers, motorcycle riders, skateboarders.",
"B. The man in the car, the man carrying the boxes through the door, the two women in red.",
"C. People with violins, people on motorcycles, people with umbrellas.",
"D. Motorcycle riders, skateboarders, dog walkers."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct order in which the characters appear in the video?\nOption:\nA. Dog walkers, motorcycle riders, skateboarders.\nB. The man in the car, the man carrying the boxes through the door, the two women in red.\nC. People with violins, people on motorcycles, people with umbrellas.\nD. Motorcycle riders, skateboarders, dog walkers.\nAnswer with the option's letter from the given choices directly.",
1630,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "544-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1631,
"target": "B",
"doc": {
"video_id": "544",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=_CqKv0Y1FB0",
"videoID": "_CqKv0Y1FB0",
"question_id": "544-3",
"task_type": "Information Synopsis",
"question": "What does this video depict?",
"options": [
"A. Celebrity Outfit Breakdown.",
"B. Street fashion Industry Documentary.",
"C. Street theater performance.",
"D. Movie Promo."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does this video depict?\nOption:\nA. Celebrity Outfit Breakdown.\nB. Street fashion Industry Documentary.\nC. Street theater performance.\nD. Movie Promo.\nAnswer with the option's letter from the given choices directly.",
1631,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "544-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1632,
"target": "C",
"doc": {
"video_id": "545",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=ntdzrNKRH4g",
"videoID": "ntdzrNKRH4g",
"question_id": "545-1",
"task_type": "Temporal Reasoning",
"question": "What is the correct sequence of different products appearing in the video?",
"options": [
"A. Scrunch socks, Earrings, Dress pants.",
"B. Scrunch socks, Dress pants, Earrings.",
"C. Dress pants, Scrunch socks, Earrings.",
"D. Earrings, Scrunch socks, Dress pants."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct sequence of different products appearing in the video?\nOption:\nA. Scrunch socks, Earrings, Dress pants.\nB. Scrunch socks, Dress pants, Earrings.\nC. Dress pants, Scrunch socks, Earrings.\nD. Earrings, Scrunch socks, Dress pants.\nAnswer with the option's letter from the given choices directly.",
1632,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "545-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1633,
"target": "B",
"doc": {
"video_id": "545",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=ntdzrNKRH4g",
"videoID": "ntdzrNKRH4g",
"question_id": "545-2",
"task_type": "Object Recognition",
"question": "What does the woman in the video recommend wearing while running?",
"options": [
"A. Waist Length Tank Tops.",
"B. Satina Leggings.",
"C. Lounge Set.",
"D. Faux Leather Jacket."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the woman in the video recommend wearing while running?\nOption:\nA. Waist Length Tank Tops.\nB. Satina Leggings.\nC. Lounge Set.\nD. Faux Leather Jacket.\nAnswer with the option's letter from the given choices directly.",
1633,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "545-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1634,
"target": "A",
"doc": {
"video_id": "545",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=ntdzrNKRH4g",
"videoID": "ntdzrNKRH4g",
"question_id": "545-3",
"task_type": "Object Recognition",
"question": "When the woman in the video introduces Scrunch socks, which two other pieces of clothing she recommends are mixed in the little video she's talking about?",
"options": [
"A. 3 and 6.",
"B. 10 and 5.",
"C. 9 and 12.",
"D. 11 and 1."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When the woman in the video introduces Scrunch socks, which two other pieces of clothing she recommends are mixed in the little video she's talking about?\nOption:\nA. 3 and 6.\nB. 10 and 5.\nC. 9 and 12.\nD. 11 and 1.\nAnswer with the option's letter from the given choices directly.",
1634,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "545-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1635,
"target": "A",
"doc": {
"video_id": "546",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=x8zGsWuu-e0",
"videoID": "x8zGsWuu-e0",
"question_id": "546-1",
"task_type": "Counting Problem",
"question": "How many days a week does the actress in the video wear sunglasses?",
"options": [
"A. 6.",
"B. 7.",
"C. 5.",
"D. 4."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many days a week does the actress in the video wear sunglasses?\nOption:\nA. 6.\nB. 7.\nC. 5.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1635,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "546-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1636,
"target": "D",
"doc": {
"video_id": "546",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=x8zGsWuu-e0",
"videoID": "x8zGsWuu-e0",
"question_id": "546-2",
"task_type": "Object Reasoning",
"question": "On which day is the actress in the video best dressed for sports?",
"options": [
"A. Wednesday.",
"B. Friday.",
"C. Monday.",
"D. Sunday."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: On which day is the actress in the video best dressed for sports?\nOption:\nA. Wednesday.\nB. Friday.\nC. Monday.\nD. Sunday.\nAnswer with the option's letter from the given choices directly.",
1636,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "546-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1637,
"target": "B",
"doc": {
"video_id": "546",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=x8zGsWuu-e0",
"videoID": "x8zGsWuu-e0",
"question_id": "546-3",
"task_type": "Object Reasoning",
"question": "What makes Thursday's outfit different from the rest of the year?",
"options": [
"A. Wear a lot.",
"B. No bag.",
"C. With sunglasses.",
"D. No boots."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What makes Thursday's outfit different from the rest of the year?\nOption:\nA. Wear a lot.\nB. No bag.\nC. With sunglasses.\nD. No boots.\nAnswer with the option's letter from the given choices directly.",
1637,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "546-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1638,
"target": "B",
"doc": {
"video_id": "547",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=x-ccfWcbht4",
"videoID": "x-ccfWcbht4",
"question_id": "547-1",
"task_type": "Object Reasoning",
"question": "In which year did the main character in the video win his second Emmy?",
"options": [
"A. 2021.",
"B. 2022.",
"C. 2019.",
"D. 2024."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which year did the main character in the video win his second Emmy?\nOption:\nA. 2021.\nB. 2022.\nC. 2019.\nD. 2024.\nAnswer with the option's letter from the given choices directly.",
1638,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "547-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1639,
"target": "A",
"doc": {
"video_id": "547",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=x-ccfWcbht4",
"videoID": "x-ccfWcbht4",
"question_id": "547-2",
"task_type": "Object Reasoning",
"question": "How old is the woman in the video in 2019?",
"options": [
"A. 22.",
"B. 20.",
"C. 25.",
"D. 18."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How old is the woman in the video in 2019?\nOption:\nA. 22.\nB. 20.\nC. 25.\nD. 18.\nAnswer with the option's letter from the given choices directly.",
1639,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "547-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1640,
"target": "D",
"doc": {
"video_id": "547",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=x-ccfWcbht4",
"videoID": "x-ccfWcbht4",
"question_id": "547-3",
"task_type": "Temporal Reasoning",
"question": "What is the correct chronological order in which the following looks appeared?",
"options": [
"A. Academy Awards, Teen Vogue Young Hollywood, CFDA Fashion Awards.",
"B. Met Gala, Vogue US, Emmy Awards.",
"C. Vanity Fair's Women in Hollywood, Dream Halloween, Critic's Choice Awards.",
"D. Met Gala, The Greatest Showman, Dune London Screening."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct chronological order in which the following looks appeared?\nOption:\nA. Academy Awards, Teen Vogue Young Hollywood, CFDA Fashion Awards.\nB. Met Gala, Vogue US, Emmy Awards.\nC. Vanity Fair's Women in Hollywood, Dream Halloween, Critic's Choice Awards.\nD. Met Gala, The Greatest Showman, Dune London Screening.\nAnswer with the option's letter from the given choices directly.",
1640,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "547-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1641,
"target": "A",
"doc": {
"video_id": "548",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=-fbRaGwfqRw",
"videoID": "-fbRaGwfqRw",
"question_id": "548-1",
"task_type": "Action Recognition",
"question": "What was the person's action in number nine GlamBOT in the video?",
"options": [
"A. He threw his jacket in the air.",
"B. He blew a kiss to the camera.",
"C. He ran and jumped over the Glambot marker.",
"D. He did a spin and hair toss."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the person's action in number nine GlamBOT in the video?\nOption:\nA. He threw his jacket in the air.\nB. He blew a kiss to the camera.\nC. He ran and jumped over the Glambot marker.\nD. He did a spin and hair toss.\nAnswer with the option's letter from the given choices directly.",
1641,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "548-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1642,
"target": "D",
"doc": {
"video_id": "548",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=-fbRaGwfqRw",
"videoID": "-fbRaGwfqRw",
"question_id": "548-2",
"task_type": "Attribute Perception",
"question": "What outfit was Cardi B wearing?",
"options": [
"A. A soft, pastel blue gown with a strapless design that fits snugly, highlighting her shoulders and neckline.",
"B. A striking red dress with a black underdress or lining.",
"C. A gorgeous orange sequined dress with a form-fitting silhouette.",
"D. A blue dress with a unique, sculptural design that resembles a \"waterfall of fabric\" that flowed around her."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What outfit was Cardi B wearing?\nOption:\nA. A soft, pastel blue gown with a strapless design that fits snugly, highlighting her shoulders and neckline.\nB. A striking red dress with a black underdress or lining.\nC. A gorgeous orange sequined dress with a form-fitting silhouette.\nD. A blue dress with a unique, sculptural design that resembles a \"waterfall of fabric\" that flowed around her.\nAnswer with the option's letter from the given choices directly.",
1642,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "548-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1643,
"target": "A",
"doc": {
"video_id": "548",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=-fbRaGwfqRw",
"videoID": "-fbRaGwfqRw",
"question_id": "548-3",
"task_type": "Object Recognition",
"question": "Which Glambot did Cole Walliser almost miss capturing due to an unexpected eyeroll and movement from the celebrity?",
"options": [
"A. Cara Delevingne.",
"B. Charlie D'Amelio.",
"C. Salma Hayek.",
"D. Anita."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which Glambot did Cole Walliser almost miss capturing due to an unexpected eyeroll and movement from the celebrity?\nOption:\nA. Cara Delevingne.\nB. Charlie D'Amelio.\nC. Salma Hayek.\nD. Anita.\nAnswer with the option's letter from the given choices directly.",
1643,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "548-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1644,
"target": "D",
"doc": {
"video_id": "549",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=skbELjWHyXA",
"videoID": "skbELjWHyXA",
"question_id": "549-1",
"task_type": "Object Recognition",
"question": "What brand is the wallet in the video?",
"options": [
"A. Louis Vuitton.",
"B. Nike.",
"C. Prada.",
"D. Goyard."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What brand is the wallet in the video?\nOption:\nA. Louis Vuitton.\nB. Nike.\nC. Prada.\nD. Goyard.\nAnswer with the option's letter from the given choices directly.",
1644,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "549-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1645,
"target": "C",
"doc": {
"video_id": "549",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=skbELjWHyXA",
"videoID": "skbELjWHyXA",
"question_id": "549-2",
"task_type": "Temporal Reasoning",
"question": "Which of the following options correctly orders the items according to their order in the video?",
"options": [
"A. Phone, Jewelry, Sprite, Sneakers.",
"B. Sneakers, Phone, Jewelry, Sprite.",
"C. Jewelry, Sprite, Phone, Sneakers.",
"D. Sprite, Phone, Jewelry, Sneakers."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options correctly orders the items according to their order in the video?\nOption:\nA. Phone, Jewelry, Sprite, Sneakers.\nB. Sneakers, Phone, Jewelry, Sprite.\nC. Jewelry, Sprite, Phone, Sneakers.\nD. Sprite, Phone, Jewelry, Sneakers.\nAnswer with the option's letter from the given choices directly.",
1645,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "549-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1646,
"target": "D",
"doc": {
"video_id": "549",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=skbELjWHyXA",
"videoID": "skbELjWHyXA",
"question_id": "549-3",
"task_type": "Action Reasoning",
"question": "Which music artist does Lil Tjay mention enjoying in the video?",
"options": [
"A. Pop Smoke.",
"B. Ed Sheeran.",
"C. SZA.",
"D. All of the above."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which music artist does Lil Tjay mention enjoying in the video?\nOption:\nA. Pop Smoke.\nB. Ed Sheeran.\nC. SZA.\nD. All of the above.\nAnswer with the option's letter from the given choices directly.",
1646,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "549-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1647,
"target": "D",
"doc": {
"video_id": "550",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=Z7LJrU4443Q",
"videoID": "Z7LJrU4443Q",
"question_id": "550-1",
"task_type": "Temporal Reasoning",
"question": "The video highlights different stages of the design and manufacturing process at Hermès. Which sequence best reflects the order of these stages as presented?",
"options": [
"A. Sketching and 3D mockups -> Hardware development -> Hand-stitching and assembly -> Leather selection.",
"B. Hardware development -> Sketching and 3D mockups -> Leather selection -> Hand-stitching and assembly.",
"C. Leather selection -> Hardware development -> Sketching and 3D mockups -> Hand-stitching and assembly.",
"D. Sketching and 3D mockups -> Leather selection -> Hardware development -> Hand-stitching and assembly."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The video highlights different stages of the design and manufacturing process at Hermès. Which sequence best reflects the order of these stages as presented?\nOption:\nA. Sketching and 3D mockups -> Hardware development -> Hand-stitching and assembly -> Leather selection.\nB. Hardware development -> Sketching and 3D mockups -> Leather selection -> Hand-stitching and assembly.\nC. Leather selection -> Hardware development -> Sketching and 3D mockups -> Hand-stitching and assembly.\nD. Sketching and 3D mockups -> Leather selection -> Hardware development -> Hand-stitching and assembly.\nAnswer with the option's letter from the given choices directly.",
1647,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "550-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1648,
"target": "D",
"doc": {
"video_id": "550",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=Z7LJrU4443Q",
"videoID": "Z7LJrU4443Q",
"question_id": "550-2",
"task_type": "Object Reasoning",
"question": "The video emphasizes the importance of craftsmanship and hand-made techniques at Hermès. Which statement best reflects the brand's dedication to craftsmanship?",
"options": [
"A. The use of Salpa allows for precise design iterations before committing to real leather.",
"B. The \"bureau d'études\" ensures the quality and functionality of hardware elements.",
"C. The vast selection of leather colors and materials allows for personalized customer choices.",
"D. The extensive training and skill required to create each bag highlight the value of human touch."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The video emphasizes the importance of craftsmanship and hand-made techniques at Hermès. Which statement best reflects the brand's dedication to craftsmanship?\nOption:\nA. The use of Salpa allows for precise design iterations before committing to real leather.\nB. The \"bureau d'études\" ensures the quality and functionality of hardware elements.\nC. The vast selection of leather colors and materials allows for personalized customer choices.\nD. The extensive training and skill required to create each bag highlight the value of human touch.\nAnswer with the option's letter from the given choices directly.",
1648,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "550-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1649,
"target": "C",
"doc": {
"video_id": "550",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=Z7LJrU4443Q",
"videoID": "Z7LJrU4443Q",
"question_id": "550-3",
"task_type": "Object Reasoning",
"question": "Priscilla Alexandra Spring mentions the importance of both \"love at first sight\" and longevity when purchasing a Hermès bag. What aspect of the video best showcases the brand's focus on longevity?",
"options": [
"A. The use of Salpa, a fabric simulating leather, to create multiple mockups before using real leather.",
"B. The meticulous hand-stitching process demonstrated by Aurélie, the craftswoman.",
"C. The explanation of the extensive training required for artisans to manufacture a Kelly bag.",
"D. The emphasis on the \"noise\" and feel of metal closures, ensuring a satisfying experience."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Priscilla Alexandra Spring mentions the importance of both \"love at first sight\" and longevity when purchasing a Hermès bag. What aspect of the video best showcases the brand's focus on longevity?\nOption:\nA. The use of Salpa, a fabric simulating leather, to create multiple mockups before using real leather.\nB. The meticulous hand-stitching process demonstrated by Aurélie, the craftswoman.\nC. The explanation of the extensive training required for artisans to manufacture a Kelly bag.\nD. The emphasis on the \"noise\" and feel of metal closures, ensuring a satisfying experience.\nAnswer with the option's letter from the given choices directly.",
1649,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "550-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1650,
"target": "A",
"doc": {
"video_id": "551",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=yfbeNtST6Fs",
"videoID": "yfbeNtST6Fs",
"question_id": "551-1",
"task_type": "Object Recognition",
"question": "What is the main subject matter of the advertisement featured in the video?",
"options": [
"A. Audible app.",
"B. Music listening app.",
"C. Shopping app.",
"D. Video online playing app."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main subject matter of the advertisement featured in the video?\nOption:\nA. Audible app.\nB. Music listening app.\nC. Shopping app.\nD. Video online playing app.\nAnswer with the option's letter from the given choices directly.",
1650,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "551-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1651,
"target": "D",
"doc": {
"video_id": "551",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=yfbeNtST6Fs",
"videoID": "yfbeNtST6Fs",
"question_id": "551-2",
"task_type": "Action Recognition",
"question": "What is the first thing the heroine does after the exam in the video?",
"options": [
"A. Playing video games with friends.",
"B. Cleaning the floor.",
"C. Tearing down the Christmas tree.",
"D. Sorting out clothes."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the first thing the heroine does after the exam in the video?\nOption:\nA. Playing video games with friends.\nB. Cleaning the floor.\nC. Tearing down the Christmas tree.\nD. Sorting out clothes.\nAnswer with the option's letter from the given choices directly.",
1651,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "551-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1652,
"target": "B",
"doc": {
"video_id": "551",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=yfbeNtST6Fs",
"videoID": "yfbeNtST6Fs",
"question_id": "551-3",
"task_type": "Temporal Reasoning",
"question": "In what order do the following events appear in the video?\n①Receiving a parcel\n②Making bibimbap\n③Cleaning the floor",
"options": [
"A. ①②③.",
"B. ②①③.",
"C. ②③①.",
"D. ③②①."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what order do the following events appear in the video?\n①Receiving a parcel\n②Making bibimbap\n③Cleaning the floor\nOption:\nA. ①②③.\nB. ②①③.\nC. ②③①.\nD. ③②①.\nAnswer with the option's letter from the given choices directly.",
1652,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "551-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1653,
"target": "C",
"doc": {
"video_id": "552",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=3m29MQ-qPfg",
"videoID": "3m29MQ-qPfg",
"question_id": "552-1",
"task_type": "OCR Problems",
"question": "When and where did the athlete and student first meet in the video?",
"options": [
"A. 12:30 PM in the car.",
"B. 12:30 PM in the canteen.",
"C. 12:30 PM in the classroom.",
"D. 12:00 PM in the classroom."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When and where did the athlete and student first meet in the video?\nOption:\nA. 12:30 PM in the car.\nB. 12:30 PM in the canteen.\nC. 12:30 PM in the classroom.\nD. 12:00 PM in the classroom.\nAnswer with the option's letter from the given choices directly.",
1653,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "552-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1654,
"target": "A",
"doc": {
"video_id": "552",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=3m29MQ-qPfg",
"videoID": "3m29MQ-qPfg",
"question_id": "552-2",
"task_type": "Action Recognition",
"question": "What is the student doing while the athlete is competing in the indoor stadium in the video?",
"options": [
"A. Shopping.",
"B. Bathing.",
"C. Studying.",
"D. Preparing a meal."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the student doing while the athlete is competing in the indoor stadium in the video?\nOption:\nA. Shopping.\nB. Bathing.\nC. Studying.\nD. Preparing a meal.\nAnswer with the option's letter from the given choices directly.",
1654,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "552-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1655,
"target": "D",
"doc": {
"video_id": "552",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=3m29MQ-qPfg",
"videoID": "3m29MQ-qPfg",
"question_id": "552-3",
"task_type": "Action Recognition",
"question": "What does the hero in the video go to with his roommate when he gets home?",
"options": [
"A. Attending a party together.",
"B. Playing computer games.",
"C. Learning together.",
"D. Eating food."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the hero in the video go to with his roommate when he gets home?\nOption:\nA. Attending a party together.\nB. Playing computer games.\nC. Learning together.\nD. Eating food.\nAnswer with the option's letter from the given choices directly.",
1655,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "552-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1656,
"target": "B",
"doc": {
"video_id": "553",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=tCRpDpDgBcE",
"videoID": "tCRpDpDgBcE",
"question_id": "553-1",
"task_type": "Action Recognition",
"question": "What is the first thing the hero does after breakfast in the video?",
"options": [
"A. Attending a class.",
"B. Buying a cup of coffee.",
"C. Communicating with classmates.",
"D. Using a computer for learning."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the first thing the hero does after breakfast in the video?\nOption:\nA. Attending a class.\nB. Buying a cup of coffee.\nC. Communicating with classmates.\nD. Using a computer for learning.\nAnswer with the option's letter from the given choices directly.",
1656,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "553-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1657,
"target": "C",
"doc": {
"video_id": "553",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=tCRpDpDgBcE",
"videoID": "tCRpDpDgBcE",
"question_id": "553-2",
"task_type": "Action Recognition",
"question": "What did the hero and the other football players do before entering the stadium locker room?",
"options": [
"A. They communicated with opponents before the game.",
"B. The ate food to replenish energy.",
"C. They had a meeting and listen to the coach talk about football team tactics.",
"D. They did a warm-up with teammates on the field."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What did the hero and the other football players do before entering the stadium locker room?\nOption:\nA. They communicated with opponents before the game.\nB. The ate food to replenish energy.\nC. They had a meeting and listen to the coach talk about football team tactics.\nD. They did a warm-up with teammates on the field.\nAnswer with the option's letter from the given choices directly.",
1657,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "553-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1658,
"target": "A",
"doc": {
"video_id": "553",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=tCRpDpDgBcE",
"videoID": "tCRpDpDgBcE",
"question_id": "553-3",
"task_type": "OCR Problems",
"question": "At what time does the protagonist in the video typically visit the gym for their workout?",
"options": [
"A. 9:00 PM.",
"B. 8:30 PM.",
"C. 10:00 PM.",
"D. 9:30 PM."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At what time does the protagonist in the video typically visit the gym for their workout?\nOption:\nA. 9:00 PM.\nB. 8:30 PM.\nC. 10:00 PM.\nD. 9:30 PM.\nAnswer with the option's letter from the given choices directly.",
1658,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "553-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1659,
"target": "D",
"doc": {
"video_id": "554",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=QPzmsQ86_HM",
"videoID": "QPzmsQ86_HM",
"question_id": "554-1",
"task_type": "Object Reasoning",
"question": "What is the main job of the hero in the video?",
"options": [
"A. Data analysis.",
"B. Backend development.",
"C. Artificial intelligence.",
"D. Frontend development."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main job of the hero in the video?\nOption:\nA. Data analysis.\nB. Backend development.\nC. Artificial intelligence.\nD. Frontend development.\nAnswer with the option's letter from the given choices directly.",
1659,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "554-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1660,
"target": "B",
"doc": {
"video_id": "554",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=QPzmsQ86_HM",
"videoID": "QPzmsQ86_HM",
"question_id": "554-2",
"task_type": "Spatial Reasoning",
"question": "Where does the hero in the video work?",
"options": [
"A. In the company.",
"B. At home.",
"C. At a cafe.",
"D. At school."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where does the hero in the video work?\nOption:\nA. In the company.\nB. At home.\nC. At a cafe.\nD. At school.\nAnswer with the option's letter from the given choices directly.",
1660,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "554-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1661,
"target": "C",
"doc": {
"video_id": "554",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=QPzmsQ86_HM",
"videoID": "QPzmsQ86_HM",
"question_id": "554-3",
"task_type": "Temporal Reasoning",
"question": "In what order do the following events appear in the video?\n①Participation in online meeting\n②Searching with Google\n③Go to the rooftop to eat",
"options": [
"A. ③①②.",
"B. ①②③.",
"C. ②①③.",
"D. ③②①."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what order do the following events appear in the video?\n①Participation in online meeting\n②Searching with Google\n③Go to the rooftop to eat\nOption:\nA. ③①②.\nB. ①②③.\nC. ②①③.\nD. ③②①.\nAnswer with the option's letter from the given choices directly.",
1661,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "554-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1662,
"target": "A",
"doc": {
"video_id": "555",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=Zjl2vmy02As",
"videoID": "Zjl2vmy02As",
"question_id": "555-1",
"task_type": "Action Recognition",
"question": "When expectation having finished brushing teeth and getting dressed, what is the male protagonist doing in reality?",
"options": [
"A. Sleeping.",
"B. Ready to go out.",
"C. Drinking coffee.",
"D. Getting up."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When expectation having finished brushing teeth and getting dressed, what is the male protagonist doing in reality?\nOption:\nA. Sleeping.\nB. Ready to go out.\nC. Drinking coffee.\nD. Getting up.\nAnswer with the option's letter from the given choices directly.",
1662,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "555-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1663,
"target": "D",
"doc": {
"video_id": "555",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=Zjl2vmy02As",
"videoID": "Zjl2vmy02As",
"question_id": "555-2",
"task_type": "OCR Problems",
"question": "What time does the hero in the video get up in real life?",
"options": [
"A. 11:00 AM.",
"B. 7:00 AM.",
"C. 10:30 AM.",
"D. 10:00 AM."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What time does the hero in the video get up in real life?\nOption:\nA. 11:00 AM.\nB. 7:00 AM.\nC. 10:30 AM.\nD. 10:00 AM.\nAnswer with the option's letter from the given choices directly.",
1663,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "555-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1664,
"target": "B",
"doc": {
"video_id": "555",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=Zjl2vmy02As",
"videoID": "Zjl2vmy02As",
"question_id": "555-3",
"task_type": "Action Recognition",
"question": "What did the male protagonist in the video not do in reality?",
"options": [
"A. Attending online meeting.",
"B. Cooking.",
"C. Coding.",
"D. Playing games."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What did the male protagonist in the video not do in reality?\nOption:\nA. Attending online meeting.\nB. Cooking.\nC. Coding.\nD. Playing games.\nAnswer with the option's letter from the given choices directly.",
1664,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "555-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1665,
"target": "C",
"doc": {
"video_id": "556",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=XkFulQe9EVE",
"videoID": "XkFulQe9EVE",
"question_id": "556-1",
"task_type": "Action Recognition",
"question": "In the video, what mode of transportation does the protagonist use to commute to school?",
"options": [
"A. Walking.",
"B. Biking.",
"C. Skateboarding.",
"D. Driving."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, what mode of transportation does the protagonist use to commute to school?\nOption:\nA. Walking.\nB. Biking.\nC. Skateboarding.\nD. Driving.\nAnswer with the option's letter from the given choices directly.",
1665,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "556-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1666,
"target": "A",
"doc": {
"video_id": "556",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=XkFulQe9EVE",
"videoID": "XkFulQe9EVE",
"question_id": "556-2",
"task_type": "Temporal Reasoning",
"question": "Which of the following options has the correct sequence of events sort of appearing in the video?",
"options": [
"A. Study in the library, eat lunch, work out at the gym, do interviews.",
"B. Eat lunch, study in the library, work out in the gym, do interviews.",
"C. Study in the library, eat lunch, work out in the gym, do interviews.",
"D. Do interviews, eat lunch, study at the library, work out at the gym."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options has the correct sequence of events sort of appearing in the video?\nOption:\nA. Study in the library, eat lunch, work out at the gym, do interviews.\nB. Eat lunch, study in the library, work out in the gym, do interviews.\nC. Study in the library, eat lunch, work out in the gym, do interviews.\nD. Do interviews, eat lunch, study at the library, work out at the gym.\nAnswer with the option's letter from the given choices directly.",
1666,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "556-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1667,
"target": "D",
"doc": {
"video_id": "556",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=XkFulQe9EVE",
"videoID": "XkFulQe9EVE",
"question_id": "556-3",
"task_type": "Temporal Perception",
"question": "Which area of the video does the male protagonist stay in for the longest time in the afternoon?",
"options": [
"A. Home.",
"B. Library.",
"C. Classroom.",
"D. Harvard innovation lab."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which area of the video does the male protagonist stay in for the longest time in the afternoon?\nOption:\nA. Home.\nB. Library.\nC. Classroom.\nD. Harvard innovation lab.\nAnswer with the option's letter from the given choices directly.",
1667,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "556-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1668,
"target": "B",
"doc": {
"video_id": "557",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=FtIF8VafVZY",
"videoID": "FtIF8VafVZY",
"question_id": "557-1",
"task_type": "Temporal Reasoning",
"question": "What is the chronological sequence of the following events depicted in the video?\n①eat breakfast\n②gym workout\n③present in class",
"options": [
"A. ①②③.",
"B. ②①③.",
"C. ②③①.",
"D. ③②①."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the chronological sequence of the following events depicted in the video?\n①eat breakfast\n②gym workout\n③present in class\nOption:\nA. ①②③.\nB. ②①③.\nC. ②③①.\nD. ③②①.\nAnswer with the option's letter from the given choices directly.",
1668,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "557-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1669,
"target": "C",
"doc": {
"video_id": "557",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=FtIF8VafVZY",
"videoID": "FtIF8VafVZY",
"question_id": "557-2",
"task_type": "Action Recognition",
"question": "What did the male protagonist do during the time period between returning to the dormitory after studying outside at night and going out to study again?",
"options": [
"A. Brushing teeth and communicating with roommates.",
"B. Sleeping.",
"C. Eating dinner and tidying up clothes.",
"D. Making and eating dinner."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What did the male protagonist do during the time period between returning to the dormitory after studying outside at night and going out to study again?\nOption:\nA. Brushing teeth and communicating with roommates.\nB. Sleeping.\nC. Eating dinner and tidying up clothes.\nD. Making and eating dinner.\nAnswer with the option's letter from the given choices directly.",
1669,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "557-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1670,
"target": "A",
"doc": {
"video_id": "557",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=FtIF8VafVZY",
"videoID": "FtIF8VafVZY",
"question_id": "557-3",
"task_type": "Action Recognition",
"question": "In the video, what did the male protagonist do after giving the presentation?",
"options": [
"A. Heading to an event to meet up with friends.",
"B. Cycling back to the dormitory.",
"C. Studying in the student center.",
"D. Eating dinner."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, what did the male protagonist do after giving the presentation?\nOption:\nA. Heading to an event to meet up with friends.\nB. Cycling back to the dormitory.\nC. Studying in the student center.\nD. Eating dinner.\nAnswer with the option's letter from the given choices directly.",
1670,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "557-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1671,
"target": "D",
"doc": {
"video_id": "558",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=pQIGOTs6XWg",
"videoID": "pQIGOTs6XWg",
"question_id": "558-1",
"task_type": "Object Reasoning",
"question": "What is the purpose of the high table placed at the back of the classroom in the video?",
"options": [
"A. Used to display trophies won by the class.",
"B. Used to place mobile phones handed in by classmates.",
"C. Used to hold snacks and refreshments for the students.",
"D. Used as a designated area for students to doze off to stand."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the purpose of the high table placed at the back of the classroom in the video?\nOption:\nA. Used to display trophies won by the class.\nB. Used to place mobile phones handed in by classmates.\nC. Used to hold snacks and refreshments for the students.\nD. Used as a designated area for students to doze off to stand.\nAnswer with the option's letter from the given choices directly.",
1671,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "558-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1672,
"target": "B",
"doc": {
"video_id": "558",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=pQIGOTs6XWg",
"videoID": "pQIGOTs6XWg",
"question_id": "558-2",
"task_type": "Temporal Reasoning",
"question": "Which of the following options correctly orders the sequence of the boy's daily itinerary?",
"options": [
"A. Go to class, go home, go to study room and continue studying.",
"B. Attend classes, go to private tutoring facilities for counselling, go to study halls to continue studying.",
"C. Go to class, go home, go to private tutoring facility to continue learning.",
"D. Neither."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options correctly orders the sequence of the boy's daily itinerary?\nOption:\nA. Go to class, go home, go to study room and continue studying.\nB. Attend classes, go to private tutoring facilities for counselling, go to study halls to continue studying.\nC. Go to class, go home, go to private tutoring facility to continue learning.\nD. Neither.\nAnswer with the option's letter from the given choices directly.",
1672,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "558-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1673,
"target": "C",
"doc": {
"video_id": "558",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=pQIGOTs6XWg",
"videoID": "pQIGOTs6XWg",
"question_id": "558-3",
"task_type": "Temporal Reasoning",
"question": "When did the boy finally return home ready to rest?",
"options": [
"A. About 4:00 pm.",
"B. About 10:00 pm.",
"C. About 12:20 am at midnight.",
"D. About 10:20 pm."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When did the boy finally return home ready to rest?\nOption:\nA. About 4:00 pm.\nB. About 10:00 pm.\nC. About 12:20 am at midnight.\nD. About 10:20 pm.\nAnswer with the option's letter from the given choices directly.",
1673,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "558-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1674,
"target": "A",
"doc": {
"video_id": "559",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=wUnEDnAYTpE",
"videoID": "wUnEDnAYTpE",
"question_id": "559-1",
"task_type": "Temporal Perception",
"question": "What are the two time periods in the video when the male protagonist is training for swimming?",
"options": [
"A. 7:00AM-9:00AM and 3:00PM-5:00PM.",
"B. 7:00PM-9:00PM and 3:00AM-5:00AM.",
"C. 7:30AM-9:30AM and 3:30PM-5:30PM.",
"D. 6:00AM-9:00AM and 3:00PM-5:00PM."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the two time periods in the video when the male protagonist is training for swimming?\nOption:\nA. 7:00AM-9:00AM and 3:00PM-5:00PM.\nB. 7:00PM-9:00PM and 3:00AM-5:00AM.\nC. 7:30AM-9:30AM and 3:30PM-5:30PM.\nD. 6:00AM-9:00AM and 3:00PM-5:00PM.\nAnswer with the option's letter from the given choices directly.",
1674,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "559-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1675,
"target": "D",
"doc": {
"video_id": "559",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=wUnEDnAYTpE",
"videoID": "wUnEDnAYTpE",
"question_id": "559-2",
"task_type": "Temporal Reasoning",
"question": "Which of the following options is the correct ordering of events for the male lead?",
"options": [
"A. Swimming training,gym training,lunch.",
"B. Breakfast,swimming training,gym training.",
"C. Breakfast,swimming training,lunch.",
"D. Swimming training, breakfast, gym training."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options is the correct ordering of events for the male lead?\nOption:\nA. Swimming training,gym training,lunch.\nB. Breakfast,swimming training,gym training.\nC. Breakfast,swimming training,lunch.\nD. Swimming training, breakfast, gym training.\nAnswer with the option's letter from the given choices directly.",
1675,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "559-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1676,
"target": "B",
"doc": {
"video_id": "559",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=wUnEDnAYTpE",
"videoID": "wUnEDnAYTpE",
"question_id": "559-3",
"task_type": "Action Recognition",
"question": "What did the man in the video do next after finishing his second swim session?",
"options": [
"A. Having a rest at room.",
"B. Eating dinner.",
"C. Recoverying.",
"D. Doing some work at room."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What did the man in the video do next after finishing his second swim session?\nOption:\nA. Having a rest at room.\nB. Eating dinner.\nC. Recoverying.\nD. Doing some work at room.\nAnswer with the option's letter from the given choices directly.",
1676,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "559-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1677,
"target": "C",
"doc": {
"video_id": "560",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=NeLj-gLWgug",
"videoID": "NeLj-gLWgug",
"question_id": "560-1",
"task_type": "Temporal Reasoning",
"question": "Which of the following options correctly sorts the sequence of events in the video?",
"options": [
"A. Eating, having a rest, sweeping the floor.",
"B. Taking a shower, having lunch, studying.",
"C. Working out in the gym, taking a shower, eating.",
"D. Cleaning up the house, having dinner, hanging out."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options correctly sorts the sequence of events in the video?\nOption:\nA. Eating, having a rest, sweeping the floor.\nB. Taking a shower, having lunch, studying.\nC. Working out in the gym, taking a shower, eating.\nD. Cleaning up the house, having dinner, hanging out.\nAnswer with the option's letter from the given choices directly.",
1677,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "560-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1678,
"target": "A",
"doc": {
"video_id": "560",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=NeLj-gLWgug",
"videoID": "NeLj-gLWgug",
"question_id": "560-2",
"task_type": "OCR Problems",
"question": "What time does the male protagonist in the video go out to relax?",
"options": [
"A. 7:30 PM.",
"B. 5:30 PM.",
"C. 4:30 PM.",
"D. 6:30 PM."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What time does the male protagonist in the video go out to relax?\nOption:\nA. 7:30 PM.\nB. 5:30 PM.\nC. 4:30 PM.\nD. 6:30 PM.\nAnswer with the option's letter from the given choices directly.",
1678,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "560-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1679,
"target": "D",
"doc": {
"video_id": "560",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=NeLj-gLWgug",
"videoID": "NeLj-gLWgug",
"question_id": "560-3",
"task_type": "Object Recognition",
"question": "Which food was absent from the hero's last meal in the video?",
"options": [
"A. Cucumbers.",
"B. Chicken.",
"C. Tomatoes.",
"D. Eggs."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which food was absent from the hero's last meal in the video?\nOption:\nA. Cucumbers.\nB. Chicken.\nC. Tomatoes.\nD. Eggs.\nAnswer with the option's letter from the given choices directly.",
1679,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "560-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1680,
"target": "A",
"doc": {
"video_id": "561",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=A30IuIjQYYg",
"videoID": "A30IuIjQYYg",
"question_id": "561-1",
"task_type": "Spatial Reasoning",
"question": "Which country is recorded by the video?",
"options": [
"A. Thailand.",
"B. Cambodia.",
"C. Laos.",
"D. Myanmar."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which country is recorded by the video?\nOption:\nA. Thailand.\nB. Cambodia.\nC. Laos.\nD. Myanmar.\nAnswer with the option's letter from the given choices directly.",
1680,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "561-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Spatial Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1681,
"target": "A",
"doc": {
"video_id": "561",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=A30IuIjQYYg",
"videoID": "A30IuIjQYYg",
"question_id": "561-2",
"task_type": "Temporal Reasoning",
"question": "What is the correct order in which the activities appear in the video?",
"options": [
"A. Fire dance, marine photography, cave adventure.",
"B. Fire dance, cave adventure, marine photography.",
"C. Cave adventure, marine photography, fire dance.",
"D. Marine photography, fire dance, cave adventure."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct order in which the activities appear in the video?\nOption:\nA. Fire dance, marine photography, cave adventure.\nB. Fire dance, cave adventure, marine photography.\nC. Cave adventure, marine photography, fire dance.\nD. Marine photography, fire dance, cave adventure.\nAnswer with the option's letter from the given choices directly.",
1681,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "561-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1682,
"target": "B",
"doc": {
"video_id": "561",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=A30IuIjQYYg",
"videoID": "A30IuIjQYYg",
"question_id": "561-3",
"task_type": "Counting Problem",
"question": "How many times does underwater photography appear in the video?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times does underwater photography appear in the video?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1682,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "561-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1683,
"target": "C",
"doc": {
"video_id": "562",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=nhpAc7lUgis",
"videoID": "nhpAc7lUgis",
"question_id": "562-1",
"task_type": "Spatial Perception",
"question": "What is the direction of the person in the video relative to the camera when sliding with the rope?",
"options": [
"A. From left to right.",
"B. From distant to close.",
"C. From near to far.",
"D. From right to left."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the direction of the person in the video relative to the camera when sliding with the rope?\nOption:\nA. From left to right.\nB. From distant to close.\nC. From near to far.\nD. From right to left.\nAnswer with the option's letter from the given choices directly.",
1683,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "562-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Spatial Perception",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1684,
"target": "C",
"doc": {
"video_id": "562",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=nhpAc7lUgis",
"videoID": "nhpAc7lUgis",
"question_id": "562-2",
"task_type": "Spatial Reasoning",
"question": "Where is the main character in the video on holiday?",
"options": [
"A. Sandcastle.",
"B. Private mansion.",
"C. Beachside.",
"D. Music festival."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where is the main character in the video on holiday?\nOption:\nA. Sandcastle.\nB. Private mansion.\nC. Beachside.\nD. Music festival.\nAnswer with the option's letter from the given choices directly.",
1684,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "562-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Spatial Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1685,
"target": "D",
"doc": {
"video_id": "562",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=nhpAc7lUgis",
"videoID": "nhpAc7lUgis",
"question_id": "562-3",
"task_type": "Object Reasoning",
"question": "What is the relationship between the man and woman in the video?",
"options": [
"A. Friends.",
"B. Brother and sister.",
"C. Father and daughter.",
"D. Couples."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the relationship between the man and woman in the video?\nOption:\nA. Friends.\nB. Brother and sister.\nC. Father and daughter.\nD. Couples.\nAnswer with the option's letter from the given choices directly.",
1685,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "562-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1686,
"target": "D",
"doc": {
"video_id": "563",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=MIYDzEYfI6U",
"videoID": "MIYDzEYfI6U",
"question_id": "563-1",
"task_type": "Object Recognition",
"question": "Where did the author of the video learn that \"people can adapt and find happiness\"?",
"options": [
"A. North Africa.",
"B. East Asian.",
"C. South Africa.",
"D. Southeast Asia."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where did the author of the video learn that \"people can adapt and find happiness\"?\nOption:\nA. North Africa.\nB. East Asian.\nC. South Africa.\nD. Southeast Asia.\nAnswer with the option's letter from the given choices directly.",
1686,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "563-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1687,
"target": "B",
"doc": {
"video_id": "563",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=MIYDzEYfI6U",
"videoID": "MIYDzEYfI6U",
"question_id": "563-2",
"task_type": "Object Reasoning",
"question": "Why does the author of the video feel the need to \"appreciate your own nation\"?",
"options": [
"A. Because New Zealand has its own unique culture.",
"B. Because she has been exposed to diverse world cultures through her travels.",
"C. Because she is a volunteer in the promotion of New Zealand's national culture.",
"D. Because her travel experiences in other countries made her uncomfortable."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the author of the video feel the need to \"appreciate your own nation\"?\nOption:\nA. Because New Zealand has its own unique culture.\nB. Because she has been exposed to diverse world cultures through her travels.\nC. Because she is a volunteer in the promotion of New Zealand's national culture.\nD. Because her travel experiences in other countries made her uncomfortable.\nAnswer with the option's letter from the given choices directly.",
1687,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "563-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1688,
"target": "D",
"doc": {
"video_id": "563",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=MIYDzEYfI6U",
"videoID": "MIYDzEYfI6U",
"question_id": "563-3",
"task_type": "Counting Problem",
"question": "How many lessons does the author of the video illustrate that she has learned?",
"options": [
"A. 9.",
"B. 10.",
"C. 12.",
"D. 11."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many lessons does the author of the video illustrate that she has learned?\nOption:\nA. 9.\nB. 10.\nC. 12.\nD. 11.\nAnswer with the option's letter from the given choices directly.",
1688,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "563-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1689,
"target": "A",
"doc": {
"video_id": "564",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=tPQX0lMk6U0",
"videoID": "tPQX0lMk6U0",
"question_id": "564-1",
"task_type": "Action Reasoning",
"question": "What is the underlying reason why the main character in the video takes off his shoes when he gets into the car?",
"options": [
"A. She is too sleepy.",
"B. She doesn't like to wear shoes in the car.",
"C. She thinks she is in a hotel room.",
"D. Her mum makes her take off her shoes."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the underlying reason why the main character in the video takes off his shoes when he gets into the car?\nOption:\nA. She is too sleepy.\nB. She doesn't like to wear shoes in the car.\nC. She thinks she is in a hotel room.\nD. Her mum makes her take off her shoes.\nAnswer with the option's letter from the given choices directly.",
1689,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "564-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1690,
"target": "B",
"doc": {
"video_id": "564",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=tPQX0lMk6U0",
"videoID": "tPQX0lMk6U0",
"question_id": "564-2",
"task_type": "Action Reasoning",
"question": "What does the protagonist in the video think is the worst decision ever?",
"options": [
"A. She takes a double-decker bus.",
"B. She only eats one egg before she leaves.",
"C. She forgets to take her pills.",
"D. She go to Burger King."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the protagonist in the video think is the worst decision ever?\nOption:\nA. She takes a double-decker bus.\nB. She only eats one egg before she leaves.\nC. She forgets to take her pills.\nD. She go to Burger King.\nAnswer with the option's letter from the given choices directly.",
1690,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "564-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1691,
"target": "C",
"doc": {
"video_id": "564",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=tPQX0lMk6U0",
"videoID": "tPQX0lMk6U0",
"question_id": "564-3",
"task_type": "Object Reasoning",
"question": "Why does the main character cry at the end of the video?",
"options": [
"A. Because she had nightmares.",
"B. Because her dad spills water on the quilt.",
"C. Because she wet the bed.",
"D. Because she is woken up by her brother."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the main character cry at the end of the video?\nOption:\nA. Because she had nightmares.\nB. Because her dad spills water on the quilt.\nC. Because she wet the bed.\nD. Because she is woken up by her brother.\nAnswer with the option's letter from the given choices directly.",
1691,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "564-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1692,
"target": "D",
"doc": {
"video_id": "565",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=QHFy-nWNJYk",
"videoID": "QHFy-nWNJYk",
"question_id": "565-1",
"task_type": "OCR Problems",
"question": "Where does the video show them spending the night in the car?",
"options": [
"A. Seven Magic Mountains.",
"B. Las Vegas.",
"C. Death Valley.",
"D. Bryce Canyon."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where does the video show them spending the night in the car?\nOption:\nA. Seven Magic Mountains.\nB. Las Vegas.\nC. Death Valley.\nD. Bryce Canyon.\nAnswer with the option's letter from the given choices directly.",
1692,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "565-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1693,
"target": "B",
"doc": {
"video_id": "565",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=QHFy-nWNJYk",
"videoID": "QHFy-nWNJYk",
"question_id": "565-2",
"task_type": "Counting Problem",
"question": "How many attractions are shown in the video?",
"options": [
"A. 9.",
"B. 10.",
"C. 11.",
"D. 8."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many attractions are shown in the video?\nOption:\nA. 9.\nB. 10.\nC. 11.\nD. 8.\nAnswer with the option's letter from the given choices directly.",
1693,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "565-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1694,
"target": "A",
"doc": {
"video_id": "565",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=QHFy-nWNJYk",
"videoID": "QHFy-nWNJYk",
"question_id": "565-3",
"task_type": "Object Recognition",
"question": "What mode of transport do the people in the video take to get to Horseshoe Bay?",
"options": [
"A. Helicopter.",
"B. Car.",
"C. Bus.",
"D. Boat."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What mode of transport do the people in the video take to get to Horseshoe Bay?\nOption:\nA. Helicopter.\nB. Car.\nC. Bus.\nD. Boat.\nAnswer with the option's letter from the given choices directly.",
1694,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "565-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1695,
"target": "B",
"doc": {
"video_id": "566",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=4ME5ZfDg2aw",
"videoID": "4ME5ZfDg2aw",
"question_id": "566-1",
"task_type": "Object Reasoning",
"question": "Why does the author feel betrayed at the beginning of the video?",
"options": [
"A. Greenland is deserted.",
"B. Greenland is not green.",
"C. Greenland is very cold.",
"D. Greenland is very expensive."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the author feel betrayed at the beginning of the video?\nOption:\nA. Greenland is deserted.\nB. Greenland is not green.\nC. Greenland is very cold.\nD. Greenland is very expensive.\nAnswer with the option's letter from the given choices directly.",
1695,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "566-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1696,
"target": "C",
"doc": {
"video_id": "566",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=4ME5ZfDg2aw",
"videoID": "4ME5ZfDg2aw",
"question_id": "566-2",
"task_type": "Object Reasoning",
"question": "What is the role of the dog in the video?",
"options": [
"A. They are sheepdogs.",
"B. They are household pets.",
"C. They were used as transportation.",
"D. They are used to interact with visitors."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the role of the dog in the video?\nOption:\nA. They are sheepdogs.\nB. They are household pets.\nC. They were used as transportation.\nD. They are used to interact with visitors.\nAnswer with the option's letter from the given choices directly.",
1696,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "566-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1697,
"target": "A",
"doc": {
"video_id": "566",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=4ME5ZfDg2aw",
"videoID": "4ME5ZfDg2aw",
"question_id": "566-3",
"task_type": "Temporal Reasoning",
"question": "What activities are recorded sequentially in the video?",
"options": [
"A. Explore ice sheet, take sled dog rides, visit residents.",
"B. Explore ice sheet, visit residents, take sled dog rides.",
"C. Take sled dog rides, visit residents, explore ice sheet.",
"D. Visit residents, explore ice sheet, take sled dog rides."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What activities are recorded sequentially in the video?\nOption:\nA. Explore ice sheet, take sled dog rides, visit residents.\nB. Explore ice sheet, visit residents, take sled dog rides.\nC. Take sled dog rides, visit residents, explore ice sheet.\nD. Visit residents, explore ice sheet, take sled dog rides.\nAnswer with the option's letter from the given choices directly.",
1697,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "566-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1698,
"target": "D",
"doc": {
"video_id": "567",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=sr284c-q8oY",
"videoID": "sr284c-q8oY",
"question_id": "567-1",
"task_type": "Object Recognition",
"question": "Where is the main character at the beginning of the video when she talks to the camera?",
"options": [
"A. Helicopter.",
"B. Hotel.",
"C. Restaurant.",
"D. Cruise ship."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where is the main character at the beginning of the video when she talks to the camera?\nOption:\nA. Helicopter.\nB. Hotel.\nC. Restaurant.\nD. Cruise ship.\nAnswer with the option's letter from the given choices directly.",
1698,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "567-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1699,
"target": "A",
"doc": {
"video_id": "567",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=sr284c-q8oY",
"videoID": "sr284c-q8oY",
"question_id": "567-2",
"task_type": "Object Recognition",
"question": "What animals did the main character not see on the trip?",
"options": [
"A. Cat.",
"B. Eagle.",
"C. Brown bear.",
"D. Seal."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What animals did the main character not see on the trip?\nOption:\nA. Cat.\nB. Eagle.\nC. Brown bear.\nD. Seal.\nAnswer with the option's letter from the given choices directly.",
1699,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "567-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1700,
"target": "C",
"doc": {
"video_id": "567",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=sr284c-q8oY",
"videoID": "sr284c-q8oY",
"question_id": "567-3",
"task_type": "Spatial Reasoning",
"question": "At the end of the video, what time is it roughly when the main character faces the camera?",
"options": [
"A. 4 p.m.",
"B. 11:40 p.m.",
"C. 10 p.m.",
"D. 10 a.m."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At the end of the video, what time is it roughly when the main character faces the camera?\nOption:\nA. 4 p.m.\nB. 11:40 p.m.\nC. 10 p.m.\nD. 10 a.m.\nAnswer with the option's letter from the given choices directly.",
1700,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "567-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Spatial Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1701,
"target": "B",
"doc": {
"video_id": "568",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=von_IMi97-w",
"videoID": "von_IMi97-w",
"question_id": "568-1",
"task_type": "OCR Problems",
"question": "How far does the main character in the video travel in total?",
"options": [
"A. 254,000 miles.",
"B. 3,224 miles.",
"C. 37 miles.",
"D. 21,000 miles."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How far does the main character in the video travel in total?\nOption:\nA. 254,000 miles.\nB. 3,224 miles.\nC. 37 miles.\nD. 21,000 miles.\nAnswer with the option's letter from the given choices directly.",
1701,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "568-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1702,
"target": "C",
"doc": {
"video_id": "568",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=von_IMi97-w",
"videoID": "von_IMi97-w",
"question_id": "568-2",
"task_type": "Object Reasoning",
"question": "What is the main reason for Amtrak's failure as shown by the video?",
"options": [
"A. Because Amtrak's service is not good enough.",
"B. Travelers prefer to travel via federal highways, causing it to lose money.",
"C. Too little national financial support.",
"D. Cannot be inferred."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main reason for Amtrak's failure as shown by the video?\nOption:\nA. Because Amtrak's service is not good enough.\nB. Travelers prefer to travel via federal highways, causing it to lose money.\nC. Too little national financial support.\nD. Cannot be inferred.\nAnswer with the option's letter from the given choices directly.",
1702,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "568-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1703,
"target": "C",
"doc": {
"video_id": "568",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=von_IMi97-w",
"videoID": "von_IMi97-w",
"question_id": "568-3",
"task_type": "Object Recognition",
"question": "After telling the story about the railroad, where does the main character in the video reach?",
"options": [
"A. Los Angeles.",
"B. New York City.",
"C. Chicago.",
"D. Ohio."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: After telling the story about the railroad, where does the main character in the video reach?\nOption:\nA. Los Angeles.\nB. New York City.\nC. Chicago.\nD. Ohio.\nAnswer with the option's letter from the given choices directly.",
1703,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "568-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1704,
"target": "A",
"doc": {
"video_id": "569",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=VMJEY6uCX3k",
"videoID": "VMJEY6uCX3k",
"question_id": "569-1",
"task_type": "Attribute Perception",
"question": "When does the construction of the highway start, as described by the man in his first viewpoint at the beginning of the video?",
"options": [
"A. 1919.",
"B. 1922.",
"C. 1999.",
"D. 1990."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When does the construction of the highway start, as described by the man in his first viewpoint at the beginning of the video?\nOption:\nA. 1919.\nB. 1922.\nC. 1999.\nD. 1990.\nAnswer with the option's letter from the given choices directly.",
1704,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "569-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1705,
"target": "B",
"doc": {
"video_id": "569",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=VMJEY6uCX3k",
"videoID": "VMJEY6uCX3k",
"question_id": "569-2",
"task_type": "Object Reasoning",
"question": "Extrapolating from the video, what is most likely carved on the stone next to the highway?",
"options": [
"A. Telephone number.",
"B. Prisoner's id number.",
"C. Highway number.",
"D. Decorative pattern."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Extrapolating from the video, what is most likely carved on the stone next to the highway?\nOption:\nA. Telephone number.\nB. Prisoner's id number.\nC. Highway number.\nD. Decorative pattern.\nAnswer with the option's letter from the given choices directly.",
1705,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "569-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1706,
"target": "D",
"doc": {
"video_id": "569",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=VMJEY6uCX3k",
"videoID": "VMJEY6uCX3k",
"question_id": "569-3",
"task_type": "Temporal Perception",
"question": "Where does the main character arrive at the end of the video?",
"options": [
"A. A mine.",
"B. Queen creek bridge.",
"C. Claypool Tunnel.",
"D. Town of Superior."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where does the main character arrive at the end of the video?\nOption:\nA. A mine.\nB. Queen creek bridge.\nC. Claypool Tunnel.\nD. Town of Superior.\nAnswer with the option's letter from the given choices directly.",
1706,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "569-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Temporal Perception",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1707,
"target": "B",
"doc": {
"video_id": "570",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=M_lwC5zPkyo",
"videoID": "M_lwC5zPkyo",
"question_id": "570-1",
"task_type": "Object Recognition",
"question": "The man in the video ate pancakes after visiting which attraction?",
"options": [
"A. French Concession area.",
"B. People's Square.",
"C. Park.",
"D. Temples."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The man in the video ate pancakes after visiting which attraction?\nOption:\nA. French Concession area.\nB. People's Square.\nC. Park.\nD. Temples.\nAnswer with the option's letter from the given choices directly.",
1707,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "570-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1708,
"target": "C",
"doc": {
"video_id": "570",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=M_lwC5zPkyo",
"videoID": "M_lwC5zPkyo",
"question_id": "570-2",
"task_type": "Object Recognition",
"question": "What kind of chess are the old people in the video playing?",
"options": [
"A. Mahjong.",
"B. Go.",
"C. Chinese chess.",
"D. Five-in-a-row."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What kind of chess are the old people in the video playing?\nOption:\nA. Mahjong.\nB. Go.\nC. Chinese chess.\nD. Five-in-a-row.\nAnswer with the option's letter from the given choices directly.",
1708,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "570-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1709,
"target": "A",
"doc": {
"video_id": "570",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=M_lwC5zPkyo",
"videoID": "M_lwC5zPkyo",
"question_id": "570-3",
"task_type": "Object Reasoning",
"question": "Why are Shanghai temples so crowded, as inferred by the main character in the video?",
"options": [
"A. Coming up to Chinese New Year.",
"B. There will be free wine tasting.",
"C. It's peak tourist season.",
"D. Lantern festival was organized."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why are Shanghai temples so crowded, as inferred by the main character in the video?\nOption:\nA. Coming up to Chinese New Year.\nB. There will be free wine tasting.\nC. It's peak tourist season.\nD. Lantern festival was organized.\nAnswer with the option's letter from the given choices directly.",
1709,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "570-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1710,
"target": "A",
"doc": {
"video_id": "571",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=Huj-zXv4DEw",
"videoID": "Huj-zXv4DEw",
"question_id": "571-1",
"task_type": "Action Recognition",
"question": "What actions does the red-black parrot perform in this video?",
"options": [
"A. Putting the ball in the basket.",
"B. Throwing the ball.",
"C. Eating the ball.",
"D. Breaking the ball."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What actions does the red-black parrot perform in this video?\nOption:\nA. Putting the ball in the basket.\nB. Throwing the ball.\nC. Eating the ball.\nD. Breaking the ball.\nAnswer with the option's letter from the given choices directly.",
1710,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "571-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1711,
"target": "B",
"doc": {
"video_id": "571",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=Huj-zXv4DEw",
"videoID": "Huj-zXv4DEw",
"question_id": "571-2",
"task_type": "Action Reasoning",
"question": "How does the yellow dog feel when a girl wants to cut its hair?",
"options": [
"A. Easy.",
"B. Angry.",
"C. Excited.",
"D. Happy."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the yellow dog feel when a girl wants to cut its hair?\nOption:\nA. Easy.\nB. Angry.\nC. Excited.\nD. Happy.\nAnswer with the option's letter from the given choices directly.",
1711,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "571-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1712,
"target": "C",
"doc": {
"video_id": "571",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=Huj-zXv4DEw",
"videoID": "Huj-zXv4DEw",
"question_id": "571-3",
"task_type": "Attribute Perception",
"question": "What color are the parrots in the video when three of them appear?",
"options": [
"A. Blue,black,red.",
"B. Yellow,red,green.",
"C. Yellow,blue,green.",
"D. Red,blue,green."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color are the parrots in the video when three of them appear?\nOption:\nA. Blue,black,red.\nB. Yellow,red,green.\nC. Yellow,blue,green.\nD. Red,blue,green.\nAnswer with the option's letter from the given choices directly.",
1712,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "571-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1713,
"target": "C",
"doc": {
"video_id": "572",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=IjSISLZCKPs",
"videoID": "IjSISLZCKPs",
"question_id": "572-1",
"task_type": "Action Recognition",
"question": "What is the black-white cat doing in the toilet?",
"options": [
"A. Drinking the water.",
"B. Taking a shower.",
"C. Shredding paper.",
"D. Using the litter box."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the black-white cat doing in the toilet?\nOption:\nA. Drinking the water.\nB. Taking a shower.\nC. Shredding paper.\nD. Using the litter box.\nAnswer with the option's letter from the given choices directly.",
1713,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "572-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1714,
"target": "C",
"doc": {
"video_id": "572",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=IjSISLZCKPs",
"videoID": "IjSISLZCKPs",
"question_id": "572-2",
"task_type": "Action Recognition",
"question": "What is the reaction of the dog when a person walks towards a black-white dog holding a broken black slipper?",
"options": [
"A. Getting closer to the man.",
"B. Taking the slipper away.",
"C. Running to sofa.",
"D. Howling and crying."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the reaction of the dog when a person walks towards a black-white dog holding a broken black slipper?\nOption:\nA. Getting closer to the man.\nB. Taking the slipper away.\nC. Running to sofa.\nD. Howling and crying.\nAnswer with the option's letter from the given choices directly.",
1714,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "572-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1715,
"target": "D",
"doc": {
"video_id": "572",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=IjSISLZCKPs",
"videoID": "IjSISLZCKPs",
"question_id": "572-3",
"task_type": "Counting Problem",
"question": "How many dogs are there when a black-white dog emerges from a black and white pillow?",
"options": [
"A. 4.",
"B. 1.",
"C. 3.",
"D. 2."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many dogs are there when a black-white dog emerges from a black and white pillow?\nOption:\nA. 4.\nB. 1.\nC. 3.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
1715,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "572-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1716,
"target": "D",
"doc": {
"video_id": "573",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=1NcGHbFWBFA",
"videoID": "1NcGHbFWBFA",
"question_id": "573-1",
"task_type": "Attribute Perception",
"question": "What color does the fish turn when fed a sea urchin by a man?",
"options": [
"A. Yellow-black.",
"B. Black-white.",
"C. Green-yellow.",
"D. Blue-yellow."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color does the fish turn when fed a sea urchin by a man?\nOption:\nA. Yellow-black.\nB. Black-white.\nC. Green-yellow.\nD. Blue-yellow.\nAnswer with the option's letter from the given choices directly.",
1716,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "573-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1717,
"target": "B",
"doc": {
"video_id": "573",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=1NcGHbFWBFA",
"videoID": "1NcGHbFWBFA",
"question_id": "573-2",
"task_type": "Counting Problem",
"question": "How many hats does the hamster change in this video?",
"options": [
"A. 4.",
"B. 6.",
"C. 5.",
"D. 7."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many hats does the hamster change in this video?\nOption:\nA. 4.\nB. 6.\nC. 5.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
1717,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "573-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1718,
"target": "A",
"doc": {
"video_id": "573",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=1NcGHbFWBFA",
"videoID": "1NcGHbFWBFA",
"question_id": "573-3",
"task_type": "Action Recognition",
"question": "What does the dog do when a person walks on the floor in socks?",
"options": [
"A. Laying on the floor.",
"B. Sitting and waits for food.",
"C. Sitting on the floor.",
"D. Rolling on the floor."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the dog do when a person walks on the floor in socks?\nOption:\nA. Laying on the floor.\nB. Sitting and waits for food.\nC. Sitting on the floor.\nD. Rolling on the floor.\nAnswer with the option's letter from the given choices directly.",
1718,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "573-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1719,
"target": "C",
"doc": {
"video_id": "574",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=BVJGvKaFG7U",
"videoID": "BVJGvKaFG7U",
"question_id": "574-1",
"task_type": "Object Recognition",
"question": "What is the first animal that appears in this video?",
"options": [
"A. Dog.",
"B. Rhino.",
"C. Giraffe.",
"D. Cat."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the first animal that appears in this video?\nOption:\nA. Dog.\nB. Rhino.\nC. Giraffe.\nD. Cat.\nAnswer with the option's letter from the given choices directly.",
1719,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "574-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1720,
"target": "A",
"doc": {
"video_id": "574",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=BVJGvKaFG7U",
"videoID": "BVJGvKaFG7U",
"question_id": "574-2",
"task_type": "Action Recognition",
"question": "What happens when a baby elephant chases a group of ducks?",
"options": [
"A. The baby elephant falls to the ground.",
"B. The ducks fly away.",
"C. The baby elephant jumps in water.",
"D. The baby elephant crashes into a tree."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens when a baby elephant chases a group of ducks?\nOption:\nA. The baby elephant falls to the ground.\nB. The ducks fly away.\nC. The baby elephant jumps in water.\nD. The baby elephant crashes into a tree.\nAnswer with the option's letter from the given choices directly.",
1720,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "574-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1721,
"target": "B",
"doc": {
"video_id": "574",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=BVJGvKaFG7U",
"videoID": "BVJGvKaFG7U",
"question_id": "574-3",
"task_type": "Counting Problem",
"question": "How many babies does the lion mother have in the video?",
"options": [
"A. 3.",
"B. 4.",
"C. 5.",
"D. 6."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many babies does the lion mother have in the video?\nOption:\nA. 3.\nB. 4.\nC. 5.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
1721,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "574-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1722,
"target": "D",
"doc": {
"video_id": "575",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=cFqLEwAvaHI",
"videoID": "cFqLEwAvaHI",
"question_id": "575-1",
"task_type": "Object Recognition",
"question": "What are the animals featured in the video?",
"options": [
"A. Pandas.",
"B. Dogs.",
"C. Cats.",
"D. Foxes."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the animals featured in the video?\nOption:\nA. Pandas.\nB. Dogs.\nC. Cats.\nD. Foxes.\nAnswer with the option's letter from the given choices directly.",
1722,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "575-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1723,
"target": "D",
"doc": {
"video_id": "575",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=cFqLEwAvaHI",
"videoID": "cFqLEwAvaHI",
"question_id": "575-2",
"task_type": "Action Reasoning",
"question": "What color is the house?",
"options": [
"A. Blue.",
"B. Green.",
"C. White.",
"D. Red."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the house?\nOption:\nA. Blue.\nB. Green.\nC. White.\nD. Red.\nAnswer with the option's letter from the given choices directly.",
1723,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "575-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1724,
"target": "A",
"doc": {
"video_id": "575",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=cFqLEwAvaHI",
"videoID": "cFqLEwAvaHI",
"question_id": "575-3",
"task_type": "Object Recognition",
"question": "What is between the yellow fox and the white fox?",
"options": [
"A. Fence.",
"B. Stone.",
"C. Tree.",
"D. House."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is between the yellow fox and the white fox?\nOption:\nA. Fence.\nB. Stone.\nC. Tree.\nD. House.\nAnswer with the option's letter from the given choices directly.",
1724,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "575-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1725,
"target": "B",
"doc": {
"video_id": "576",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=LFGE_8SoYVA",
"videoID": "LFGE_8SoYVA",
"question_id": "576-1",
"task_type": "Action Recognition",
"question": "What does the squirrel do in this video?",
"options": [
"A. Eating food.",
"B. Wrestling with a doll.",
"C. Climbing on a tree.",
"D. Lying down and sleeping."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the squirrel do in this video?\nOption:\nA. Eating food.\nB. Wrestling with a doll.\nC. Climbing on a tree.\nD. Lying down and sleeping.\nAnswer with the option's letter from the given choices directly.",
1725,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "576-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1726,
"target": "D",
"doc": {
"video_id": "576",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=LFGE_8SoYVA",
"videoID": "LFGE_8SoYVA",
"question_id": "576-2",
"task_type": "Action Recognition",
"question": "What does the yellow cat with black spots do in this video?",
"options": [
"A. Fights with people.",
"B. Climbs on wall.",
"C. Explores a hole.",
"D. Eats foods in a microwave oven."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the yellow cat with black spots do in this video?\nOption:\nA. Fights with people.\nB. Climbs on wall.\nC. Explores a hole.\nD. Eats foods in a microwave oven.\nAnswer with the option's letter from the given choices directly.",
1726,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "576-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1727,
"target": "B",
"doc": {
"video_id": "576",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=LFGE_8SoYVA",
"videoID": "LFGE_8SoYVA",
"question_id": "576-3",
"task_type": "Action Recognition",
"question": "What does the white parrot do in this video?",
"options": [
"A. Drinking water.",
"B. Shouting to a mug.",
"C. Finding food.",
"D. Quarreling with others."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the white parrot do in this video?\nOption:\nA. Drinking water.\nB. Shouting to a mug.\nC. Finding food.\nD. Quarreling with others.\nAnswer with the option's letter from the given choices directly.",
1727,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "576-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1728,
"target": "A",
"doc": {
"video_id": "577",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=VoJ-Ey6q8uM",
"videoID": "VoJ-Ey6q8uM",
"question_id": "577-1",
"task_type": "Action Recognition",
"question": "What do the animals in this video do?",
"options": [
"A. They encounter their reflections.",
"B. They engage in battles with similar animals.",
"C. They seek to form connections with other animals.",
"D. They come across their doppelgangers."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the animals in this video do?\nOption:\nA. They encounter their reflections.\nB. They engage in battles with similar animals.\nC. They seek to form connections with other animals.\nD. They come across their doppelgangers.\nAnswer with the option's letter from the given choices directly.",
1728,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "577-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1729,
"target": "D",
"doc": {
"video_id": "577",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=VoJ-Ey6q8uM",
"videoID": "VoJ-Ey6q8uM",
"question_id": "577-2",
"task_type": "Action Recognition",
"question": "What happens when a boy picks up a mirror?",
"options": [
"A. He runs away.",
"B. He dances on floor.",
"C. He wants to sleep.",
"D. A bird attacks him."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens when a boy picks up a mirror?\nOption:\nA. He runs away.\nB. He dances on floor.\nC. He wants to sleep.\nD. A bird attacks him.\nAnswer with the option's letter from the given choices directly.",
1729,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "577-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1730,
"target": "B",
"doc": {
"video_id": "577",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=VoJ-Ey6q8uM",
"videoID": "VoJ-Ey6q8uM",
"question_id": "577-3",
"task_type": "Attribute Perception",
"question": "What the color is the parrot that stands on wood?",
"options": [
"A. Blue.",
"B. Green.",
"C. White.",
"D. Black."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What the color is the parrot that stands on wood?\nOption:\nA. Blue.\nB. Green.\nC. White.\nD. Black.\nAnswer with the option's letter from the given choices directly.",
1730,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "577-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1731,
"target": "B",
"doc": {
"video_id": "578",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=l73rmrLTHQc",
"videoID": "l73rmrLTHQc",
"question_id": "578-1",
"task_type": "Action Recognition",
"question": "What does the first panda do in this video?",
"options": [
"A. Eating an apple.",
"B. Chasing an apple.",
"C. Exercising in the room.",
"D. Playing with a ball."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the first panda do in this video?\nOption:\nA. Eating an apple.\nB. Chasing an apple.\nC. Exercising in the room.\nD. Playing with a ball.\nAnswer with the option's letter from the given choices directly.",
1731,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "578-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1732,
"target": "B",
"doc": {
"video_id": "578",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=l73rmrLTHQc",
"videoID": "l73rmrLTHQc",
"question_id": "578-2",
"task_type": "Counting Problem",
"question": "How many pandas are on the swing?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many pandas are on the swing?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1732,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "578-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1733,
"target": "D",
"doc": {
"video_id": "578",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=l73rmrLTHQc",
"videoID": "l73rmrLTHQc",
"question_id": "578-3",
"task_type": "Action Recognition",
"question": "What does the panda that takes a blue bucket do in this video?",
"options": [
"A. Throwing the bucket away.",
"B. Putting the bucket on other pandas.",
"C. Eating food in the bucket.",
"D. Putting itself in the bucket."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the panda that takes a blue bucket do in this video?\nOption:\nA. Throwing the bucket away.\nB. Putting the bucket on other pandas.\nC. Eating food in the bucket.\nD. Putting itself in the bucket.\nAnswer with the option's letter from the given choices directly.",
1733,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "578-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1734,
"target": "C",
"doc": {
"video_id": "579",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=a3o9a6pBKz8",
"videoID": "a3o9a6pBKz8",
"question_id": "579-1",
"task_type": "Action Recognition",
"question": "What does the polar bear do in this video?",
"options": [
"A. Attempting to escape.",
"B. Searching for food.",
"C. Swinging with a cub.",
"D. Running on the glacier."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the polar bear do in this video?\nOption:\nA. Attempting to escape.\nB. Searching for food.\nC. Swinging with a cub.\nD. Running on the glacier.\nAnswer with the option's letter from the given choices directly.",
1734,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "579-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1735,
"target": "A",
"doc": {
"video_id": "579",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=a3o9a6pBKz8",
"videoID": "a3o9a6pBKz8",
"question_id": "579-2",
"task_type": "Action Recognition",
"question": "What does the monkey do to the woman?",
"options": [
"A. Pulling her hair.",
"B. Begging for food.",
"C. Climbing on her.",
"D. Stealing things from her."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the monkey do to the woman?\nOption:\nA. Pulling her hair.\nB. Begging for food.\nC. Climbing on her.\nD. Stealing things from her.\nAnswer with the option's letter from the given choices directly.",
1735,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "579-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1736,
"target": "C",
"doc": {
"video_id": "579",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=a3o9a6pBKz8",
"videoID": "a3o9a6pBKz8",
"question_id": "579-3",
"task_type": "Action Recognition",
"question": "What does the brown horse do to the boy with blue costume?",
"options": [
"A. Allowing the boy ride on it.",
"B. Eating food in the boy's hand.",
"C. Pulling his costume hat.",
"D. Ignoring the boy."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the brown horse do to the boy with blue costume?\nOption:\nA. Allowing the boy ride on it.\nB. Eating food in the boy's hand.\nC. Pulling his costume hat.\nD. Ignoring the boy.\nAnswer with the option's letter from the given choices directly.",
1736,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "579-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1737,
"target": "A",
"doc": {
"video_id": "580",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=LVRcD_-ht3g",
"videoID": "LVRcD_-ht3g",
"question_id": "580-1",
"task_type": "Action Recognition",
"question": "What does a black dog do in water?",
"options": [
"A. Assisting a boy in water.",
"B. Submerging a boy in water.",
"C. Engaging with a boy.",
"D. Searching for fish."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does a black dog do in water?\nOption:\nA. Assisting a boy in water.\nB. Submerging a boy in water.\nC. Engaging with a boy.\nD. Searching for fish.\nAnswer with the option's letter from the given choices directly.",
1737,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "580-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1738,
"target": "A",
"doc": {
"video_id": "580",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=LVRcD_-ht3g",
"videoID": "LVRcD_-ht3g",
"question_id": "580-2",
"task_type": "Action Reasoning",
"question": "Why does the black dog hold the goose?",
"options": [
"A. Protecting the cameraman from attacks of the goose.",
"B. Consume the goose as food.",
"C. Playing with the goose.",
"D. Letting the goose go elsewhere."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the black dog hold the goose?\nOption:\nA. Protecting the cameraman from attacks of the goose.\nB. Consume the goose as food.\nC. Playing with the goose.\nD. Letting the goose go elsewhere.\nAnswer with the option's letter from the given choices directly.",
1738,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "580-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1739,
"target": "A",
"doc": {
"video_id": "580",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=LVRcD_-ht3g",
"videoID": "LVRcD_-ht3g",
"question_id": "580-3",
"task_type": "Action Recognition",
"question": "What does a black elephant do in this video?",
"options": [
"A. Helping a man in river.",
"B. Finding fishes for food.",
"C. Going across the river.",
"D. Washing itself."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does a black elephant do in this video?\nOption:\nA. Helping a man in river.\nB. Finding fishes for food.\nC. Going across the river.\nD. Washing itself.\nAnswer with the option's letter from the given choices directly.",
1739,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "580-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1740,
"target": "A",
"doc": {
"video_id": "581",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=kDo4y8-bZko",
"videoID": "kDo4y8-bZko",
"question_id": "581-1",
"task_type": "Object Reasoning",
"question": "What does the coach wearing a white top and black trousers and a hat at the beginning of the video mainly explain to the athlete wearing a black top and grey and white shorts?",
"options": [
"A. Maintaining relaxation as well as leg and arm movement essentials.",
"B. How to breathe while running.",
"C. Maintaining tightness and arm movement essentials.",
"D. How to warm up before a match."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the coach wearing a white top and black trousers and a hat at the beginning of the video mainly explain to the athlete wearing a black top and grey and white shorts?\nOption:\nA. Maintaining relaxation as well as leg and arm movement essentials.\nB. How to breathe while running.\nC. Maintaining tightness and arm movement essentials.\nD. How to warm up before a match.\nAnswer with the option's letter from the given choices directly.",
1740,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "581-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1741,
"target": "D",
"doc": {
"video_id": "581",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=kDo4y8-bZko",
"videoID": "kDo4y8-bZko",
"question_id": "581-2",
"task_type": "Object Recognition",
"question": "Which athlete's run is being recorded on a mobile phone by the coach wearing a white top and black shorts in the video?",
"options": [
"A. Female athlete in orange trousers.",
"B. Male athlete wearing a black top with grey and white trousers.",
"C. Male athlete in orange trousers.",
"D. Female athlete in black top and grey trousers."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which athlete's run is being recorded on a mobile phone by the coach wearing a white top and black shorts in the video?\nOption:\nA. Female athlete in orange trousers.\nB. Male athlete wearing a black top with grey and white trousers.\nC. Male athlete in orange trousers.\nD. Female athlete in black top and grey trousers.\nAnswer with the option's letter from the given choices directly.",
1741,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "581-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1742,
"target": "B",
"doc": {
"video_id": "581",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=kDo4y8-bZko",
"videoID": "kDo4y8-bZko",
"question_id": "581-3",
"task_type": "Action Recognition",
"question": "How many running sessions did the athlete who recorded the shoe change in the video perform in the video?",
"options": [
"A. 3.",
"B. 2.",
"C. 4.",
"D. 1."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many running sessions did the athlete who recorded the shoe change in the video perform in the video?\nOption:\nA. 3.\nB. 2.\nC. 4.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
1742,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "581-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1743,
"target": "C",
"doc": {
"video_id": "582",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=8-aI8Fp2bPU",
"videoID": "8-aI8Fp2bPU",
"question_id": "582-1",
"task_type": "Action Recognition",
"question": "What is the next exercise after the burpee training in the video?",
"options": [
"A. Matrix pushups.",
"B. Jump suqats.",
"C. Shadowboxing.",
"D. Sit-ups."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the next exercise after the burpee training in the video?\nOption:\nA. Matrix pushups.\nB. Jump suqats.\nC. Shadowboxing.\nD. Sit-ups.\nAnswer with the option's letter from the given choices directly.",
1743,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "582-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1744,
"target": "A",
"doc": {
"video_id": "582",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=8-aI8Fp2bPU",
"videoID": "8-aI8Fp2bPU",
"question_id": "582-2",
"task_type": "Counting Problem",
"question": "How many sets of shadowboxing training were performed in the video?",
"options": [
"A. 3.",
"B. 2.",
"C. 1.",
"D. 4."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many sets of shadowboxing training were performed in the video?\nOption:\nA. 3.\nB. 2.\nC. 1.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
1744,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "582-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1745,
"target": "D",
"doc": {
"video_id": "582",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=8-aI8Fp2bPU",
"videoID": "8-aI8Fp2bPU",
"question_id": "582-3",
"task_type": "Temporal Reasoning",
"question": "In what order do burpees, jump squats and push-ups appear in the video?",
"options": [
"A. First jump squats, then push-ups, then burpees.",
"B. First jump squats, then burpees, then push-ups, then jump squats again.",
"C. First burpees, then push-ups, then jump squats.",
"D. First burpees, then jump squats, then push-ups."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what order do burpees, jump squats and push-ups appear in the video?\nOption:\nA. First jump squats, then push-ups, then burpees.\nB. First jump squats, then burpees, then push-ups, then jump squats again.\nC. First burpees, then push-ups, then jump squats.\nD. First burpees, then jump squats, then push-ups.\nAnswer with the option's letter from the given choices directly.",
1745,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "582-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1746,
"target": "B",
"doc": {
"video_id": "583",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=JwxALozuKfY",
"videoID": "JwxALozuKfY",
"question_id": "583-1",
"task_type": "Action Recognition",
"question": "In the first stage of the workout, what training did the male protagonist in the video not undergo?",
"options": [
"A. Front plank.",
"B. Dumbbell pullover.",
"C. Weighted russian twist.",
"D. Leg raise."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the first stage of the workout, what training did the male protagonist in the video not undergo?\nOption:\nA. Front plank.\nB. Dumbbell pullover.\nC. Weighted russian twist.\nD. Leg raise.\nAnswer with the option's letter from the given choices directly.",
1746,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "583-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1747,
"target": "C",
"doc": {
"video_id": "583",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=JwxALozuKfY",
"videoID": "JwxALozuKfY",
"question_id": "583-2",
"task_type": "Counting Problem",
"question": "In the video, how many times does the male protagonist do hanging leg raises per set in the first phase of training?",
"options": [
"A. 4.",
"B. 6.",
"C. 3.",
"D. 7."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, how many times does the male protagonist do hanging leg raises per set in the first phase of training?\nOption:\nA. 4.\nB. 6.\nC. 3.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
1747,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "583-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1748,
"target": "A",
"doc": {
"video_id": "583",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=JwxALozuKfY",
"videoID": "JwxALozuKfY",
"question_id": "583-3",
"task_type": "Action Recognition",
"question": "What does the second super set in the weight training introduced in the video mainly consist of?",
"options": [
"A. Dumbell bent row, pushups, plank hold.",
"B. Pushups, plank hold.",
"C. Incline bench press, lat pulldown.",
"D. Flat bench press, pull up."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the second super set in the weight training introduced in the video mainly consist of?\nOption:\nA. Dumbell bent row, pushups, plank hold.\nB. Pushups, plank hold.\nC. Incline bench press, lat pulldown.\nD. Flat bench press, pull up.\nAnswer with the option's letter from the given choices directly.",
1748,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "583-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1749,
"target": "D",
"doc": {
"video_id": "584",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=lNvBXLbInXE",
"videoID": "lNvBXLbInXE",
"question_id": "584-1",
"task_type": "OCR Problems",
"question": "What is the second yoga pranayama method demonstrated by the man in the video?",
"options": [
"A. Markatasana.",
"B. Bhastrika.",
"C. Anulom vilom.",
"D. Kapalbhati."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the second yoga pranayama method demonstrated by the man in the video?\nOption:\nA. Markatasana.\nB. Bhastrika.\nC. Anulom vilom.\nD. Kapalbhati.\nAnswer with the option's letter from the given choices directly.",
1749,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "584-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1750,
"target": "B",
"doc": {
"video_id": "584",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=lNvBXLbInXE",
"videoID": "lNvBXLbInXE",
"question_id": "584-2",
"task_type": "Action Recognition",
"question": "Which exercise did the man in the video do that required him to block his nose?",
"options": [
"A. The forth.",
"B. The third.",
"C. The second.",
"D. The first."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which exercise did the man in the video do that required him to block his nose?\nOption:\nA. The forth.\nB. The third.\nC. The second.\nD. The first.\nAnswer with the option's letter from the given choices directly.",
1750,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "584-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1751,
"target": "C",
"doc": {
"video_id": "584",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=lNvBXLbInXE",
"videoID": "lNvBXLbInXE",
"question_id": "584-3",
"task_type": "OCR Problems",
"question": "What is the last move the man makes in the video?",
"options": [
"A. Bhramari.",
"B. Markatasana.",
"C. Shavasana.",
"D. Anulom Vilom."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the last move the man makes in the video?\nOption:\nA. Bhramari.\nB. Markatasana.\nC. Shavasana.\nD. Anulom Vilom.\nAnswer with the option's letter from the given choices directly.",
1751,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "584-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1752,
"target": "A",
"doc": {
"video_id": "585",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=U8WCRz0Yh4Q",
"videoID": "U8WCRz0Yh4Q",
"question_id": "585-1",
"task_type": "Counting Problem",
"question": "How many of the tricks presented in the video involve drilling the ball through a defender's crotch?",
"options": [
"A. 3.",
"B. 4.",
"C. 6.",
"D. 2."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many of the tricks presented in the video involve drilling the ball through a defender's crotch?\nOption:\nA. 3.\nB. 4.\nC. 6.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
1752,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "585-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1753,
"target": "D",
"doc": {
"video_id": "585",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=U8WCRz0Yh4Q",
"videoID": "U8WCRz0Yh4Q",
"question_id": "585-2",
"task_type": "Action Recognition",
"question": "Which technique described in the video is done by kicking the ball into the air over a defender?",
"options": [
"A. Half volley meg.",
"B. Scoop turn.",
"C. Push and go.",
"D. The lift."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which technique described in the video is done by kicking the ball into the air over a defender?\nOption:\nA. Half volley meg.\nB. Scoop turn.\nC. Push and go.\nD. The lift.\nAnswer with the option's letter from the given choices directly.",
1753,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "585-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1754,
"target": "B",
"doc": {
"video_id": "585",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=U8WCRz0Yh4Q",
"videoID": "U8WCRz0Yh4Q",
"question_id": "585-3",
"task_type": "Temporal Reasoning",
"question": "What is the correct ordering of the following tips in the order they are presented in the video?\n①fake cut back\n②double touch meg\n③the lift\n④stop and meg",
"options": [
"A. ①②③④.",
"B. ②①④③.",
"C. ②③①④.",
"D. ③①②④."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct ordering of the following tips in the order they are presented in the video?\n①fake cut back\n②double touch meg\n③the lift\n④stop and meg\nOption:\nA. ①②③④.\nB. ②①④③.\nC. ②③①④.\nD. ③①②④.\nAnswer with the option's letter from the given choices directly.",
1754,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "585-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1755,
"target": "C",
"doc": {
"video_id": "586",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=2ka5AleJ48I",
"videoID": "2ka5AleJ48I",
"question_id": "586-1",
"task_type": "Object Recognition",
"question": "In the video, which track's runners reached the finish line first in the second group of races?",
"options": [
"A. 5.",
"B. 4.",
"C. 3.",
"D. 2."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, which track's runners reached the finish line first in the second group of races?\nOption:\nA. 5.\nB. 4.\nC. 3.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
1755,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "586-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1756,
"target": "A",
"doc": {
"video_id": "586",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=2ka5AleJ48I",
"videoID": "2ka5AleJ48I",
"question_id": "586-2",
"task_type": "Object Recognition",
"question": "What is the main product introduced by the male protagonist in the video?",
"options": [
"A. An open run headphones.",
"B. A turban with built-in speakers.",
"C. A sweatshirt with noise-canceling capabilities.",
"D. A hearing aid with wireless charging."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main product introduced by the male protagonist in the video?\nOption:\nA. An open run headphones.\nB. A turban with built-in speakers.\nC. A sweatshirt with noise-canceling capabilities.\nD. A hearing aid with wireless charging.\nAnswer with the option's letter from the given choices directly.",
1756,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "586-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1757,
"target": "D",
"doc": {
"video_id": "586",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=2ka5AleJ48I",
"videoID": "2ka5AleJ48I",
"question_id": "586-3",
"task_type": "Object Recognition",
"question": "Who won the championship in the women's competition?",
"options": [
"A. The girl wearing a gray top and gray shorts.",
"B. The girl wearing a black top and black shorts.",
"C. The girl wearing a white top and blue shorts.",
"D. The girl wearing a yellow top and black shorts."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who won the championship in the women's competition?\nOption:\nA. The girl wearing a gray top and gray shorts.\nB. The girl wearing a black top and black shorts.\nC. The girl wearing a white top and blue shorts.\nD. The girl wearing a yellow top and black shorts.\nAnswer with the option's letter from the given choices directly.",
1757,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "586-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1758,
"target": "B",
"doc": {
"video_id": "587",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=zPyWOtToo6E",
"videoID": "zPyWOtToo6E",
"question_id": "587-1",
"task_type": "Action Reasoning",
"question": "Which player will be eliminated at the end of each lap of the first and second games in the video?",
"options": [
"A. Player in head position.",
"B. Player in the last position.",
"C. Player in the middle position.",
"D. Random selection."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player will be eliminated at the end of each lap of the first and second games in the video?\nOption:\nA. Player in head position.\nB. Player in the last position.\nC. Player in the middle position.\nD. Random selection.\nAnswer with the option's letter from the given choices directly.",
1758,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "587-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1759,
"target": "C",
"doc": {
"video_id": "587",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=zPyWOtToo6E",
"videoID": "zPyWOtToo6E",
"question_id": "587-2",
"task_type": "OCR Problems",
"question": "How long did it take the first runner to reach the finish line in the third race in the video?",
"options": [
"A. 55 seconds.",
"B. 54 seconds.",
"C. 53 seconds.",
"D. 2 minutes and 29 seconds."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How long did it take the first runner to reach the finish line in the third race in the video?\nOption:\nA. 55 seconds.\nB. 54 seconds.\nC. 53 seconds.\nD. 2 minutes and 29 seconds.\nAnswer with the option's letter from the given choices directly.",
1759,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "587-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1760,
"target": "A",
"doc": {
"video_id": "587",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=zPyWOtToo6E",
"videoID": "zPyWOtToo6E",
"question_id": "587-3",
"task_type": "Object Recognition",
"question": "Who was the second contestant eliminated during the final match in the video?",
"options": [
"A. The boy wearing a red shirt and black trousers.",
"B. The boy wearing a black top and black shorts.",
"C. The boy wearing a gray shirt and black shorts.",
"D. The boy wearing a black top and black shorts."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who was the second contestant eliminated during the final match in the video?\nOption:\nA. The boy wearing a red shirt and black trousers.\nB. The boy wearing a black top and black shorts.\nC. The boy wearing a gray shirt and black shorts.\nD. The boy wearing a black top and black shorts.\nAnswer with the option's letter from the given choices directly.",
1760,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "587-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1761,
"target": "D",
"doc": {
"video_id": "588",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=xre_3dxdhkQ",
"videoID": "xre_3dxdhkQ",
"question_id": "588-1",
"task_type": "OCR Problems",
"question": "What weight did the first girl in the video's female group attempt to bench press before failing?",
"options": [
"A. 150 lbs.",
"B. 140 lbs.",
"C. 160 lbs.",
"D. 155 lbs."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What weight did the first girl in the video's female group attempt to bench press before failing?\nOption:\nA. 150 lbs.\nB. 140 lbs.\nC. 160 lbs.\nD. 155 lbs.\nAnswer with the option's letter from the given choices directly.",
1761,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "588-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1762,
"target": "B",
"doc": {
"video_id": "588",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=xre_3dxdhkQ",
"videoID": "xre_3dxdhkQ",
"question_id": "588-2",
"task_type": "Counting Problem",
"question": "How many ladies participated in the women's bench press competition in the video?",
"options": [
"A. 7.",
"B. 6.",
"C. 8.",
"D. 9."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many ladies participated in the women's bench press competition in the video?\nOption:\nA. 7.\nB. 6.\nC. 8.\nD. 9.\nAnswer with the option's letter from the given choices directly.",
1762,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "588-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1763,
"target": "C",
"doc": {
"video_id": "588",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=xre_3dxdhkQ",
"videoID": "xre_3dxdhkQ",
"question_id": "588-3",
"task_type": "Counting Problem",
"question": "What is the highest weight that the boys' group can successfully bench press in the video?",
"options": [
"A. 405 lbs.",
"B. 415 lbs.",
"C. 420 lbs.",
"D. 385 lbs."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the highest weight that the boys' group can successfully bench press in the video?\nOption:\nA. 405 lbs.\nB. 415 lbs.\nC. 420 lbs.\nD. 385 lbs.\nAnswer with the option's letter from the given choices directly.",
1763,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "588-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1764,
"target": "A",
"doc": {
"video_id": "589",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=FYJJbwG_i8U",
"videoID": "FYJJbwG_i8U",
"question_id": "589-1",
"task_type": "Action Recognition",
"question": "Which of the following options describes the first exercise shown in the video?",
"options": [
"A. Cross hands on chest, squat and then jump up.",
"B. Start by doing a lunge and then jump up, then switch legs and repeat.",
"C. Jump up and lift legs.",
"D. Cross hands on chest and then jump up."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options describes the first exercise shown in the video?\nOption:\nA. Cross hands on chest, squat and then jump up.\nB. Start by doing a lunge and then jump up, then switch legs and repeat.\nC. Jump up and lift legs.\nD. Cross hands on chest and then jump up.\nAnswer with the option's letter from the given choices directly.",
1764,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "589-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1765,
"target": "D",
"doc": {
"video_id": "589",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=FYJJbwG_i8U",
"videoID": "FYJJbwG_i8U",
"question_id": "589-2",
"task_type": "Action Recognition",
"question": "What is the workout that follows after the tuck jumps workout in the video?",
"options": [
"A. Reverse lunge knee drive.",
"B. Heel flicks.",
"C. Kneeling jumps.",
"D. High knees."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the workout that follows after the tuck jumps workout in the video?\nOption:\nA. Reverse lunge knee drive.\nB. Heel flicks.\nC. Kneeling jumps.\nD. High knees.\nAnswer with the option's letter from the given choices directly.",
1765,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "589-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1766,
"target": "B",
"doc": {
"video_id": "589",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=FYJJbwG_i8U",
"videoID": "FYJJbwG_i8U",
"question_id": "589-3",
"task_type": "Temporal Reasoning",
"question": "In the video, what is the order in which high knees, jump lunges, A skips, and kneeling jumps appear?",
"options": [
"A. High knees, jump lunges, A skips, kneeling jumps.",
"B. Jump lunges, A skips, high knees, kneeling jumps.",
"C. Jump lunges, kneeling jumps, a skips, high knees.",
"D. Kneeling jumps, jump lunges, a skips, high knees."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, what is the order in which high knees, jump lunges, A skips, and kneeling jumps appear?\nOption:\nA. High knees, jump lunges, A skips, kneeling jumps.\nB. Jump lunges, A skips, high knees, kneeling jumps.\nC. Jump lunges, kneeling jumps, a skips, high knees.\nD. Kneeling jumps, jump lunges, a skips, high knees.\nAnswer with the option's letter from the given choices directly.",
1766,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "589-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1767,
"target": "C",
"doc": {
"video_id": "590",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=VP4GtrEsefk",
"videoID": "VP4GtrEsefk",
"question_id": "590-1",
"task_type": "Attribute Perception",
"question": "What are the rules of the game in the video?",
"options": [
"A. Just the goalkeeper saves the penalty kick.",
"B. Players just need to score the penalty kick.",
"C. Players need to copy the penalties from the demo video.",
"D. Players need to copy the passing process in the demo video."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the rules of the game in the video?\nOption:\nA. Just the goalkeeper saves the penalty kick.\nB. Players just need to score the penalty kick.\nC. Players need to copy the penalties from the demo video.\nD. Players need to copy the passing process in the demo video.\nAnswer with the option's letter from the given choices directly.",
1767,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "590-1",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1768,
"target": "A",
"doc": {
"video_id": "590",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=VP4GtrEsefk",
"videoID": "VP4GtrEsefk",
"question_id": "590-2",
"task_type": "Object Recognition",
"question": "Which player in the video successfully copied a penalty kick?",
"options": [
"A. The first player of the team wearing black.",
"B. The first player of the team wearing red.",
"C. The third player of the team wearing black.",
"D. The second player of the team wearing black."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player in the video successfully copied a penalty kick?\nOption:\nA. The first player of the team wearing black.\nB. The first player of the team wearing red.\nC. The third player of the team wearing black.\nD. The second player of the team wearing black.\nAnswer with the option's letter from the given choices directly.",
1768,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "590-2",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1769,
"target": "D",
"doc": {
"video_id": "590",
"duration": "medium",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=VP4GtrEsefk",
"videoID": "VP4GtrEsefk",
"question_id": "590-3",
"task_type": "Action Recognition",
"question": "How to replicate the penalty kick of the person wearing a teddy bear headgear in the video?",
"options": [
"A. Run forward and kick the ball to the edge of the goal, letting it bounce to the other edge and into the goal.",
"B. Run forward and kick the ball with the right foot, aiming for the bottom corner of the goal.",
"C. Run forward and kick the ball high into the air, allowing it to spin before landing in the goal.",
"D. Run to the football then stop and kick the ball into the goal with the left foot."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How to replicate the penalty kick of the person wearing a teddy bear headgear in the video?\nOption:\nA. Run forward and kick the ball to the edge of the goal, letting it bounce to the other edge and into the goal.\nB. Run forward and kick the ball with the right foot, aiming for the bottom corner of the goal.\nC. Run forward and kick the ball high into the air, allowing it to spin before landing in the goal.\nD. Run to the football then stop and kick the ball into the goal with the left foot.\nAnswer with the option's letter from the given choices directly.",
1769,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "590-3",
"duration": "medium",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1770,
"target": "B",
"doc": {
"video_id": "591",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=FMdEkj5HEqg",
"videoID": "FMdEkj5HEqg",
"question_id": "591-1",
"task_type": "Counting Problem",
"question": "How many times is the little boy on the right side of the planet when he visited the planet in his rocket?",
"options": [
"A. 3.",
"B. 4.",
"C. 5.",
"D. 2."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times is the little boy on the right side of the planet when he visited the planet in his rocket?\nOption:\nA. 3.\nB. 4.\nC. 5.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
1770,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "591-1",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1771,
"target": "C",
"doc": {
"video_id": "591",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=FMdEkj5HEqg",
"videoID": "FMdEkj5HEqg",
"question_id": "591-2",
"task_type": "Temporal Perception",
"question": "Which planet does the boy spend the most time visiting?",
"options": [
"A. Jupiter.",
"B. Mars.",
"C. Venus.",
"D. Earth."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which planet does the boy spend the most time visiting?\nOption:\nA. Jupiter.\nB. Mars.\nC. Venus.\nD. Earth.\nAnswer with the option's letter from the given choices directly.",
1771,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "591-2",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Temporal Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1772,
"target": "A",
"doc": {
"video_id": "591",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=FMdEkj5HEqg",
"videoID": "FMdEkj5HEqg",
"question_id": "591-3",
"task_type": "Object Reasoning",
"question": "What is the most likely role of the woman in yellow in the video?",
"options": [
"A. Astronomy teacher.",
"B. The little boy's mom.",
"C. Rocket Driver.",
"D. Teacher of French."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the most likely role of the woman in yellow in the video?\nOption:\nA. Astronomy teacher.\nB. The little boy's mom.\nC. Rocket Driver.\nD. Teacher of French.\nAnswer with the option's letter from the given choices directly.",
1772,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "591-3",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1773,
"target": "D",
"doc": {
"video_id": "592",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=InHaW59CmDw",
"videoID": "InHaW59CmDw",
"question_id": "592-1",
"task_type": "Temporal Reasoning",
"question": "Which animals are introduced in sequence in the video?",
"options": [
"A. Argentinosaurus, Giant Rhinoceros, Titanoboa, Leedsichthys.",
"B. Leedsichthys, Argentinosaurus, Giant Rhinoceros, Titanoboa.",
"C. Titanoboa, Leedsichthys, Argentinosaurus, Giant Rhinoceros.",
"D. Giant Rhinoceros, Titanoboa, Leedsichthys, Argentinosaurus."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which animals are introduced in sequence in the video?\nOption:\nA. Argentinosaurus, Giant Rhinoceros, Titanoboa, Leedsichthys.\nB. Leedsichthys, Argentinosaurus, Giant Rhinoceros, Titanoboa.\nC. Titanoboa, Leedsichthys, Argentinosaurus, Giant Rhinoceros.\nD. Giant Rhinoceros, Titanoboa, Leedsichthys, Argentinosaurus.\nAnswer with the option's letter from the given choices directly.",
1773,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "592-1",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1774,
"target": "A",
"doc": {
"video_id": "592",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=InHaW59CmDw",
"videoID": "InHaW59CmDw",
"question_id": "592-2",
"task_type": "Object Recognition",
"question": "Which animals detailedly introduced in the video are not extinct?",
"options": [
"A. Whale Shark, Blue Whale.",
"B. African Elephant, Giraffe.",
"C. Whale Shark, Saltwater Crocodile.",
"D. Kronosaurus, Spiny Dragon."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which animals detailedly introduced in the video are not extinct?\nOption:\nA. Whale Shark, Blue Whale.\nB. African Elephant, Giraffe.\nC. Whale Shark, Saltwater Crocodile.\nD. Kronosaurus, Spiny Dragon.\nAnswer with the option's letter from the given choices directly.",
1774,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "592-2",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1775,
"target": "B",
"doc": {
"video_id": "592",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=InHaW59CmDw",
"videoID": "InHaW59CmDw",
"question_id": "592-3",
"task_type": "Attribute Perception",
"question": "Which animal is a herbivore mentioned in the video?",
"options": [
"A. African Elephant.",
"B. Argentinosaurus.",
"C. Blue Whale.",
"D. Giraffe."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which animal is a herbivore mentioned in the video?\nOption:\nA. African Elephant.\nB. Argentinosaurus.\nC. Blue Whale.\nD. Giraffe.\nAnswer with the option's letter from the given choices directly.",
1775,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "592-3",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1776,
"target": "C",
"doc": {
"video_id": "593",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=p9uIBCDhyr0",
"videoID": "p9uIBCDhyr0",
"question_id": "593-1",
"task_type": "Action Recognition",
"question": "In the performance, what captivating element draws the audience's attention?",
"options": [
"A. The intricate set design and elaborate props.",
"B. The live band providing a dynamic musical backdrop.",
"C. The actress's mesmerizing spin as her hair is tied to a rope.",
"D. The audience's enthusiastic participation in interactive segments."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the performance, what captivating element draws the audience's attention?\nOption:\nA. The intricate set design and elaborate props.\nB. The live band providing a dynamic musical backdrop.\nC. The actress's mesmerizing spin as her hair is tied to a rope.\nD. The audience's enthusiastic participation in interactive segments.\nAnswer with the option's letter from the given choices directly.",
1776,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "593-1",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1777,
"target": "A",
"doc": {
"video_id": "593",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=p9uIBCDhyr0",
"videoID": "p9uIBCDhyr0",
"question_id": "593-2",
"task_type": "Action Recognition",
"question": "What decision did the judges reach?",
"options": [
"A. They unanimously gave her four yeses.",
"B. They awarded her the first place trophy.",
"C. They offered constructive criticism for her performance.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What decision did the judges reach?\nOption:\nA. They unanimously gave her four yeses.\nB. They awarded her the first place trophy.\nC. They offered constructive criticism for her performance.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1777,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "593-2",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1778,
"target": "D",
"doc": {
"video_id": "593",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=p9uIBCDhyr0",
"videoID": "p9uIBCDhyr0",
"question_id": "593-3",
"task_type": "Object Reasoning",
"question": "What do the two male judges share in common?",
"options": [
"A. Both of them are speaking English.",
"B. Both of them are wearing big beards.",
"C. Both of them are wearing black shirts.",
"D. Both of them are wearing glasses."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the two male judges share in common?\nOption:\nA. Both of them are speaking English.\nB. Both of them are wearing big beards.\nC. Both of them are wearing black shirts.\nD. Both of them are wearing glasses.\nAnswer with the option's letter from the given choices directly.",
1778,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "593-3",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1779,
"target": "C",
"doc": {
"video_id": "594",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=GrO5sxp3n0E",
"videoID": "GrO5sxp3n0E",
"question_id": "594-1",
"task_type": "Attribute Perception",
"question": "What color is the woman's phone for personal use?",
"options": [
"A. Deep Purple.",
"B. Sierra Blue.",
"C. Olive green.",
"D. Graphite Grey."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color is the woman's phone for personal use?\nOption:\nA. Deep Purple.\nB. Sierra Blue.\nC. Olive green.\nD. Graphite Grey.\nAnswer with the option's letter from the given choices directly.",
1779,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "594-1",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1780,
"target": "D",
"doc": {
"video_id": "594",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=GrO5sxp3n0E",
"videoID": "GrO5sxp3n0E",
"question_id": "594-2",
"task_type": "Object Reasoning",
"question": "What is the woman's favorite drink, and why?",
"options": [
"A. Coffee, because it helps her wake up.",
"B. Tea, because it's calming.",
"C. Juice, because it's healthy.",
"D. Water, because it's readily available and tastes good."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the woman's favorite drink, and why?\nOption:\nA. Coffee, because it helps her wake up.\nB. Tea, because it's calming.\nC. Juice, because it's healthy.\nD. Water, because it's readily available and tastes good.\nAnswer with the option's letter from the given choices directly.",
1780,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "594-2",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1781,
"target": "A",
"doc": {
"video_id": "594",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=GrO5sxp3n0E",
"videoID": "GrO5sxp3n0E",
"question_id": "594-3",
"task_type": "Counting Problem",
"question": "How many items did the woman take out of her bag?",
"options": [
"A. 14.",
"B. 17.",
"C. 11.",
"D. 20."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many items did the woman take out of her bag?\nOption:\nA. 14.\nB. 17.\nC. 11.\nD. 20.\nAnswer with the option's letter from the given choices directly.",
1781,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "594-3",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1782,
"target": "A",
"doc": {
"video_id": "595",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=cVIfe0Gxa64",
"videoID": "cVIfe0Gxa64",
"question_id": "595-1",
"task_type": "OCR Problems",
"question": "Which of the following is not a sponsor of the race?",
"options": [
"A. Lenovo.",
"B. BWT.",
"C. DEKRA.",
"D. IG."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is not a sponsor of the race?\nOption:\nA. Lenovo.\nB. BWT.\nC. DEKRA.\nD. IG.\nAnswer with the option's letter from the given choices directly.",
1782,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "595-1",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1783,
"target": "C",
"doc": {
"video_id": "595",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=cVIfe0Gxa64",
"videoID": "cVIfe0Gxa64",
"question_id": "595-2",
"task_type": "Counting Problem",
"question": "In the video, how many finishing cars were on the same team as the winning driver?",
"options": [
"A. 4.",
"B. 6.",
"C. 3.",
"D. 5."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, how many finishing cars were on the same team as the winning driver?\nOption:\nA. 4.\nB. 6.\nC. 3.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
1783,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "595-2",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1784,
"target": "B",
"doc": {
"video_id": "595",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=cVIfe0Gxa64",
"videoID": "cVIfe0Gxa64",
"question_id": "595-3",
"task_type": "Counting Problem",
"question": "How many cars finished the race in the video?",
"options": [
"A. 23.",
"B. 21.",
"C. 25.",
"D. 29."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many cars finished the race in the video?\nOption:\nA. 23.\nB. 21.\nC. 25.\nD. 29.\nAnswer with the option's letter from the given choices directly.",
1784,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "595-3",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1785,
"target": "B",
"doc": {
"video_id": "596",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=I48vtTxdaeQ",
"videoID": "I48vtTxdaeQ",
"question_id": "596-1",
"task_type": "Information Synopsis",
"question": "What is the proposed solution presented in the video to minimize moiré during the Taylor Swift concert?",
"options": [
"A. Using specialized camera filters that block specific light wavelengths causing moiré.",
"B. Employing LED screens with a larger pixel pitch, thereby decreasing the spatial frequency of the displayed image.",
"C. Implementing real-time image processing algorithms within the camera to detect and remove moiré patterns.",
"D. Reducing the overall brightness and contrast of the LED screens to minimize interference patterns."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the proposed solution presented in the video to minimize moiré during the Taylor Swift concert?\nOption:\nA. Using specialized camera filters that block specific light wavelengths causing moiré.\nB. Employing LED screens with a larger pixel pitch, thereby decreasing the spatial frequency of the displayed image.\nC. Implementing real-time image processing algorithms within the camera to detect and remove moiré patterns.\nD. Reducing the overall brightness and contrast of the LED screens to minimize interference patterns.\nAnswer with the option's letter from the given choices directly.",
1785,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "596-1",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1786,
"target": "C",
"doc": {
"video_id": "596",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=I48vtTxdaeQ",
"videoID": "I48vtTxdaeQ",
"question_id": "596-2",
"task_type": "Information Synopsis",
"question": "How does the video demonstrate the concept of \"aliasing\" in relation to the formation of moiré?",
"options": [
"A. By demonstrating how light waves bend when passing through different mediums, creating distortions in the image.",
"B. By explaining how the compression of digital images can lead to the loss of information and visual artifacts.",
"C. By illustrating how the camera's sensor captures a continuous scene as a series of discrete points, potentially leading to misrepresentation of the original signal.",
"D. By showing how low-frequency sounds can create the illusion of higher-pitched tones."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the video demonstrate the concept of \"aliasing\" in relation to the formation of moiré?\nOption:\nA. By demonstrating how light waves bend when passing through different mediums, creating distortions in the image.\nB. By explaining how the compression of digital images can lead to the loss of information and visual artifacts.\nC. By illustrating how the camera's sensor captures a continuous scene as a series of discrete points, potentially leading to misrepresentation of the original signal.\nD. By showing how low-frequency sounds can create the illusion of higher-pitched tones.\nAnswer with the option's letter from the given choices directly.",
1786,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "596-2",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1787,
"target": "D",
"doc": {
"video_id": "596",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=I48vtTxdaeQ",
"videoID": "I48vtTxdaeQ",
"question_id": "596-3",
"task_type": "Object Reasoning",
"question": "Imagine you are a videographer tasked with filming a corporate presentation that will involve displaying detailed graphs and charts on a large LED screen. What factors would you consider to minimize the risk of moiré during the recording?",
"options": [
"A. Screen Technology: Choosing an LED screen with a higher resolution and smaller pixel pitch to increase the spatial frequency of the displayed content.",
"B. Camera Placement: Positioning the camera at a distance and angle where the spatial frequency of the screen's pixels is significantly different from the camera sensor's sampling frequency.",
"C. Camera Settings: Using a camera with a high frame rate to avoid temporal aliasing and adjusting the aperture and focus to potentially blur any moiré patterns.",
"D. All of the above factors are important considerations for minimizing moiré in this scenario."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Imagine you are a videographer tasked with filming a corporate presentation that will involve displaying detailed graphs and charts on a large LED screen. What factors would you consider to minimize the risk of moiré during the recording?\nOption:\nA. Screen Technology: Choosing an LED screen with a higher resolution and smaller pixel pitch to increase the spatial frequency of the displayed content.\nB. Camera Placement: Positioning the camera at a distance and angle where the spatial frequency of the screen's pixels is significantly different from the camera sensor's sampling frequency.\nC. Camera Settings: Using a camera with a high frame rate to avoid temporal aliasing and adjusting the aperture and focus to potentially blur any moiré patterns.\nD. All of the above factors are important considerations for minimizing moiré in this scenario.\nAnswer with the option's letter from the given choices directly.",
1787,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "596-3",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1788,
"target": "D",
"doc": {
"video_id": "597",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=5R64pAPUZmk",
"videoID": "5R64pAPUZmk",
"question_id": "597-1",
"task_type": "Object Recognition",
"question": "What does the girl take out from the envelop?",
"options": [
"A. A mirror.",
"B. A book.",
"C. A paper.",
"D. A card."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the girl take out from the envelop?\nOption:\nA. A mirror.\nB. A book.\nC. A paper.\nD. A card.\nAnswer with the option's letter from the given choices directly.",
1788,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "597-1",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1789,
"target": "C",
"doc": {
"video_id": "597",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=5R64pAPUZmk",
"videoID": "5R64pAPUZmk",
"question_id": "597-2",
"task_type": "Action Recognition",
"question": "What is the usage of the thing that the girl take out from envelop?",
"options": [
"A. Identify herself.",
"B. Buy food.",
"C. Start elevator.",
"D. Take money from bank."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the usage of the thing that the girl take out from envelop?\nOption:\nA. Identify herself.\nB. Buy food.\nC. Start elevator.\nD. Take money from bank.\nAnswer with the option's letter from the given choices directly.",
1789,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "597-2",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1790,
"target": "B",
"doc": {
"video_id": "597",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=5R64pAPUZmk",
"videoID": "5R64pAPUZmk",
"question_id": "597-3",
"task_type": "Object Recognition",
"question": "What is the animal flying in sky?",
"options": [
"A. Seagull.",
"B. Whale.",
"C. Turtle.",
"D. Horse."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the animal flying in sky?\nOption:\nA. Seagull.\nB. Whale.\nC. Turtle.\nD. Horse.\nAnswer with the option's letter from the given choices directly.",
1790,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "597-3",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1791,
"target": "D",
"doc": {
"video_id": "598",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=fFYNmVb3NCQ",
"videoID": "fFYNmVb3NCQ",
"question_id": "598-1",
"task_type": "Information Synopsis",
"question": "What is the main idea of the video?",
"options": [
"A. The exchange and collision of Silla culture and Goryeo culture.",
"B. How the Silla people integrated into the dynasty established by the Koryo people.",
"C. Silla people's different attitudes towards Koryo people.",
"D. The exchange of culture and the reaction of Silla people after the establishment of Koryo dynasty."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main idea of the video?\nOption:\nA. The exchange and collision of Silla culture and Goryeo culture.\nB. How the Silla people integrated into the dynasty established by the Koryo people.\nC. Silla people's different attitudes towards Koryo people.\nD. The exchange of culture and the reaction of Silla people after the establishment of Koryo dynasty.\nAnswer with the option's letter from the given choices directly.",
1791,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "598-1",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1792,
"target": "A",
"doc": {
"video_id": "598",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=fFYNmVb3NCQ",
"videoID": "fFYNmVb3NCQ",
"question_id": "598-2",
"task_type": "Object Reasoning",
"question": "Which of the following statements is a fact?",
"options": [
"A. King Gyeongsun's tombstone is the last tombstone of the Silla Dynasty.",
"B. Prince Mai became the leader of the Jurchens.",
"C. Hanboga and Prince Mai are the same person.",
"D. Rice cakes will grow everywhere on New Year's Eve."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements is a fact?\nOption:\nA. King Gyeongsun's tombstone is the last tombstone of the Silla Dynasty.\nB. Prince Mai became the leader of the Jurchens.\nC. Hanboga and Prince Mai are the same person.\nD. Rice cakes will grow everywhere on New Year's Eve.\nAnswer with the option's letter from the given choices directly.",
1792,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "598-2",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1793,
"target": "C",
"doc": {
"video_id": "598",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=fFYNmVb3NCQ",
"videoID": "fFYNmVb3NCQ",
"question_id": "598-3",
"task_type": "Spatial Reasoning",
"question": "Where may the images in the video mainly come from(except for map)?",
"options": [
"A. Nature photography.",
"B. History book.",
"C. TV drama.",
"D. Photo token in Koryo dynasty."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where may the images in the video mainly come from(except for map)?\nOption:\nA. Nature photography.\nB. History book.\nC. TV drama.\nD. Photo token in Koryo dynasty.\nAnswer with the option's letter from the given choices directly.",
1793,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "598-3",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Spatial Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1794,
"target": "B",
"doc": {
"video_id": "599",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=0JjgoicpYkU",
"videoID": "0JjgoicpYkU",
"question_id": "599-1",
"task_type": "OCR Problems",
"question": "What time did the heroine in the video turn off her alarm clock?",
"options": [
"A. 8:24.",
"B. 9:24.",
"C. 6:24.",
"D. Not mentioned in the video."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What time did the heroine in the video turn off her alarm clock?\nOption:\nA. 8:24.\nB. 9:24.\nC. 6:24.\nD. Not mentioned in the video.\nAnswer with the option's letter from the given choices directly.",
1794,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "599-1",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1795,
"target": "C",
"doc": {
"video_id": "599",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=0JjgoicpYkU",
"videoID": "0JjgoicpYkU",
"question_id": "599-2",
"task_type": "Temporal Reasoning",
"question": "Which of the following options correctly sorts the events of the following heroines in chronological order?\n① Study\n② Make the bed\n③ Eat breakfast\n④ Exercise at home",
"options": [
"A. ②①③④.",
"B. ②④③①.",
"C. ②③①④.",
"D. ③②①④."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options correctly sorts the events of the following heroines in chronological order?\n① Study\n② Make the bed\n③ Eat breakfast\n④ Exercise at home\nOption:\nA. ②①③④.\nB. ②④③①.\nC. ②③①④.\nD. ③②①④.\nAnswer with the option's letter from the given choices directly.",
1795,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "599-2",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1796,
"target": "A",
"doc": {
"video_id": "599",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=0JjgoicpYkU",
"videoID": "0JjgoicpYkU",
"question_id": "599-3",
"task_type": "OCR Problems",
"question": "How much money can you save by purchasing the course through the link and promo code shared by the heroine in the video?",
"options": [
"A. 11550 rubles.",
"B. 16500 rubles.",
"C. 4950 rubles.",
"D. 21,450 rubles."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How much money can you save by purchasing the course through the link and promo code shared by the heroine in the video?\nOption:\nA. 11550 rubles.\nB. 16500 rubles.\nC. 4950 rubles.\nD. 21,450 rubles.\nAnswer with the option's letter from the given choices directly.",
1796,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "599-3",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1797,
"target": "A",
"doc": {
"video_id": "600",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=91fi6egFsyA",
"videoID": "91fi6egFsyA",
"question_id": "600-1",
"task_type": "Object Recognition",
"question": "Which woman works as a chef?",
"options": [
"A. Diamante.",
"B. Carola Ordenes.",
"C. Amina.",
"D. Ghizlane."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which woman works as a chef?\nOption:\nA. Diamante.\nB. Carola Ordenes.\nC. Amina.\nD. Ghizlane.\nAnswer with the option's letter from the given choices directly.",
1797,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "600-1",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1798,
"target": "C",
"doc": {
"video_id": "600",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=91fi6egFsyA",
"videoID": "91fi6egFsyA",
"question_id": "600-2",
"task_type": "Information Synopsis",
"question": "What aspect of Italy does Carola Ordenes dislike?",
"options": [
"A. Difficulty in obtaining residence permits.",
"B. Lack of job opportunities.",
"C. High taxes.",
"D. Cannot be determined."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What aspect of Italy does Carola Ordenes dislike?\nOption:\nA. Difficulty in obtaining residence permits.\nB. Lack of job opportunities.\nC. High taxes.\nD. Cannot be determined.\nAnswer with the option's letter from the given choices directly.",
1798,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "600-2",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1799,
"target": "D",
"doc": {
"video_id": "600",
"duration": "medium",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=91fi6egFsyA",
"videoID": "91fi6egFsyA",
"question_id": "600-3",
"task_type": "Attribute Perception",
"question": "Which of the following is not the home country of the four women in the video?",
"options": [
"A. Morocco.",
"B. Chile.",
"C. Lithuania.",
"D. Italy."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is not the home country of the four women in the video?\nOption:\nA. Morocco.\nB. Chile.\nC. Lithuania.\nD. Italy.\nAnswer with the option's letter from the given choices directly.",
1799,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "600-3",
"duration": "medium",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1800,
"target": "A",
"doc": {
"video_id": "601",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=TGom0uiW130",
"videoID": "TGom0uiW130",
"question_id": "601-1",
"task_type": "Information Synopsis",
"question": "What is the video mainly about?",
"options": [
"A. Planes invented by the Wright Brothers.",
"B. The structural difference between the planes created by Whitehead and planes created by the Wright Brothers.",
"C. Who invented the first plane.",
"D. How Whitehead and the Wright Brothers cooperated to invent the first motorized flight."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video mainly about?\nOption:\nA. Planes invented by the Wright Brothers.\nB. The structural difference between the planes created by Whitehead and planes created by the Wright Brothers.\nC. Who invented the first plane.\nD. How Whitehead and the Wright Brothers cooperated to invent the first motorized flight.\nAnswer with the option's letter from the given choices directly.",
1800,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "601-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1801,
"target": "D",
"doc": {
"video_id": "601",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=TGom0uiW130",
"videoID": "TGom0uiW130",
"question_id": "601-2",
"task_type": "Object Reasoning",
"question": "According to the video, which of the folllowing persons or institudes believes that the first motorized flight was invented by Whitehead?",
"options": [
"A. The author.",
"B. Orville Wright.",
"C. The Smithsonian Air and Space Museum.",
"D. John Brown."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the folllowing persons or institudes believes that the first motorized flight was invented by Whitehead?\nOption:\nA. The author.\nB. Orville Wright.\nC. The Smithsonian Air and Space Museum.\nD. John Brown.\nAnswer with the option's letter from the given choices directly.",
1801,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "601-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1802,
"target": "A",
"doc": {
"video_id": "601",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=TGom0uiW130",
"videoID": "TGom0uiW130",
"question_id": "601-3",
"task_type": "Attribute Perception",
"question": "Which of the following statements is not correct according to the video?",
"options": [
"A. Gustave Whitehead succeeded in his first try of motorized flight.",
"B. Gustave Whitehead developed a passion for flight at a young age.",
"C. The Smithsonian Air and Space Museum claims that there is no enough evidence to support Brown's claims.",
"D. The Wright Brothers made their first successful powered flight in Kitty Hawk."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements is not correct according to the video?\nOption:\nA. Gustave Whitehead succeeded in his first try of motorized flight.\nB. Gustave Whitehead developed a passion for flight at a young age.\nC. The Smithsonian Air and Space Museum claims that there is no enough evidence to support Brown's claims.\nD. The Wright Brothers made their first successful powered flight in Kitty Hawk.\nAnswer with the option's letter from the given choices directly.",
1802,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "601-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1803,
"target": "D",
"doc": {
"video_id": "602",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=w0Wmc8C0Eq0",
"videoID": "w0Wmc8C0Eq0",
"question_id": "602-1",
"task_type": "Action Reasoning",
"question": "Which of the following statements is not correct according to the video?",
"options": [
"A. Prince Dmitri Donskoi defeated the Mongols at the Battle of Kulikovo Field in 1380.",
"B. Before 2000BC, the Russia is inhabited by nomadic tribes and Bronze Age Culture.",
"C. The Grand Principality of Moscow emerged as a powerful rival to the Golden Horde.",
"D. We can guess nomadic tribes and Bronze Age Culture by the murals."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements is not correct according to the video?\nOption:\nA. Prince Dmitri Donskoi defeated the Mongols at the Battle of Kulikovo Field in 1380.\nB. Before 2000BC, the Russia is inhabited by nomadic tribes and Bronze Age Culture.\nC. The Grand Principality of Moscow emerged as a powerful rival to the Golden Horde.\nD. We can guess nomadic tribes and Bronze Age Culture by the murals.\nAnswer with the option's letter from the given choices directly.",
1803,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "602-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1804,
"target": "A",
"doc": {
"video_id": "602",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=w0Wmc8C0Eq0",
"videoID": "w0Wmc8C0Eq0",
"question_id": "602-2",
"task_type": "Temporal Reasoning",
"question": "According to what is shown in the video, which of the following events happened before 1613?",
"options": [
"A. The fall of Teutonic Knights.",
"B. The Zemsky Sobor elected Mikhail Romanov as Tsar.",
"C. Catherine the Great ascends the throne.",
"D. Peter the Great transformed Russia into a modern European power."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to what is shown in the video, which of the following events happened before 1613?\nOption:\nA. The fall of Teutonic Knights.\nB. The Zemsky Sobor elected Mikhail Romanov as Tsar.\nC. Catherine the Great ascends the throne.\nD. Peter the Great transformed Russia into a modern European power.\nAnswer with the option's letter from the given choices directly.",
1804,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "602-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1805,
"target": "A",
"doc": {
"video_id": "602",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=w0Wmc8C0Eq0",
"videoID": "w0Wmc8C0Eq0",
"question_id": "602-3",
"task_type": "Attribute Perception",
"question": "In line with the video evidence, which of the following statements about the world's longest railway line is not correct?",
"options": [
"A. It was built by Russia and China.",
"B. The length of it is 9289km.",
"C. It was completed in 1916.",
"D. French loans helped a lot in the process of building it."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, which of the following statements about the world's longest railway line is not correct?\nOption:\nA. It was built by Russia and China.\nB. The length of it is 9289km.\nC. It was completed in 1916.\nD. French loans helped a lot in the process of building it.\nAnswer with the option's letter from the given choices directly.",
1805,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "602-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1806,
"target": "A",
"doc": {
"video_id": "603",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=7D-gxaie6UI",
"videoID": "7D-gxaie6UI",
"question_id": "603-1",
"task_type": "Object Reasoning",
"question": "In accordance with the video footage, which of the following diseases causes the most deaths?",
"options": [
"A. Tuberculosis.",
"B. Malaria.",
"C. Cholera.",
"D. Typhoid."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In accordance with the video footage, which of the following diseases causes the most deaths?\nOption:\nA. Tuberculosis.\nB. Malaria.\nC. Cholera.\nD. Typhoid.\nAnswer with the option's letter from the given choices directly.",
1806,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "603-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1807,
"target": "B",
"doc": {
"video_id": "603",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=7D-gxaie6UI",
"videoID": "7D-gxaie6UI",
"question_id": "603-2",
"task_type": "Object Recognition",
"question": "What was the event that put an end to the romanticization of TB in the video?",
"options": [
"A. The prevalence of TB in colonial territories.",
"B. TB spread to the working class.",
"C. Mycobacterium tuberculosis has a thick cell wall that makes it resistant to infection-fighting cells.",
"D. The course of the disease can be unpredictable, causing death within a few weeks or over many years."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the event that put an end to the romanticization of TB in the video?\nOption:\nA. The prevalence of TB in colonial territories.\nB. TB spread to the working class.\nC. Mycobacterium tuberculosis has a thick cell wall that makes it resistant to infection-fighting cells.\nD. The course of the disease can be unpredictable, causing death within a few weeks or over many years.\nAnswer with the option's letter from the given choices directly.",
1807,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "603-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1808,
"target": "B",
"doc": {
"video_id": "603",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=7D-gxaie6UI",
"videoID": "7D-gxaie6UI",
"question_id": "603-3",
"task_type": "Object Reasoning",
"question": "Based on the video, which of the following statements is correct?",
"options": [
"A. Roughly 30% of all northern Europeans were dying of TB, most of them were artists and writers.",
"B. The consumption was thought to be only prevelant in white people.",
"C. TB was incurable.",
"D. Southern California came to be known as \"the land of new lungs\" for the reason that the air there can heal TB diseases."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, which of the following statements is correct?\nOption:\nA. Roughly 30% of all northern Europeans were dying of TB, most of them were artists and writers.\nB. The consumption was thought to be only prevelant in white people.\nC. TB was incurable.\nD. Southern California came to be known as \"the land of new lungs\" for the reason that the air there can heal TB diseases.\nAnswer with the option's letter from the given choices directly.",
1808,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "603-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1809,
"target": "C",
"doc": {
"video_id": "604",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=0RxMZBLeqRI",
"videoID": "0RxMZBLeqRI",
"question_id": "604-1",
"task_type": "Information Synopsis",
"question": "What is the video mainly about?",
"options": [
"A. The rise and fall of Gypsies.",
"B. How the Gypsies developed into the Roma.",
"C. The history of the Gypsies.",
"D. How the Roma became so prosperous."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video mainly about?\nOption:\nA. The rise and fall of Gypsies.\nB. How the Gypsies developed into the Roma.\nC. The history of the Gypsies.\nD. How the Roma became so prosperous.\nAnswer with the option's letter from the given choices directly.",
1809,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "604-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1810,
"target": "C",
"doc": {
"video_id": "604",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=0RxMZBLeqRI",
"videoID": "0RxMZBLeqRI",
"question_id": "604-2",
"task_type": "Temporal Reasoning",
"question": "As depicted in the video, what happened when the Gypsies migrated to Europe?",
"options": [
"A. Slavery was abolished.",
"B. They became increasingly enslaved in the Balkans as the Ottomans expanded their territory.",
"C. They fought with Selic.",
"D. They separated from the Turks."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, what happened when the Gypsies migrated to Europe?\nOption:\nA. Slavery was abolished.\nB. They became increasingly enslaved in the Balkans as the Ottomans expanded their territory.\nC. They fought with Selic.\nD. They separated from the Turks.\nAnswer with the option's letter from the given choices directly.",
1810,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "604-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1811,
"target": "B",
"doc": {
"video_id": "604",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=0RxMZBLeqRI",
"videoID": "0RxMZBLeqRI",
"question_id": "604-3",
"task_type": "Object Reasoning",
"question": "Which of the following statements is not correct according to what is shown in the video?",
"options": [
"A. The Gypsies didn't know their origin because of the lack of written language.",
"B. In England, the character 'V' on the skin of a Roma meant that he/she had tried to escape.",
"C. The Gypsies escaped from enslavement by living in the countryside.",
"D. Several massacres and programs of gypsies took place in Germany during WW2."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements is not correct according to what is shown in the video?\nOption:\nA. The Gypsies didn't know their origin because of the lack of written language.\nB. In England, the character 'V' on the skin of a Roma meant that he/she had tried to escape.\nC. The Gypsies escaped from enslavement by living in the countryside.\nD. Several massacres and programs of gypsies took place in Germany during WW2.\nAnswer with the option's letter from the given choices directly.",
1811,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "604-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1812,
"target": "D",
"doc": {
"video_id": "605",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=xKiRmesHWIA",
"videoID": "xKiRmesHWIA",
"question_id": "605-1",
"task_type": "Information Synopsis",
"question": "What's the main idea of the video?",
"options": [
"A. What did the French gain from World War One.",
"B. Why the Austro-Hungarian Empire was divided.",
"C. The process of World War One.",
"D. How the Austro-Hungarian Empire rises and falls."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the main idea of the video?\nOption:\nA. What did the French gain from World War One.\nB. Why the Austro-Hungarian Empire was divided.\nC. The process of World War One.\nD. How the Austro-Hungarian Empire rises and falls.\nAnswer with the option's letter from the given choices directly.",
1812,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "605-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1813,
"target": "A",
"doc": {
"video_id": "605",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=xKiRmesHWIA",
"videoID": "xKiRmesHWIA",
"question_id": "605-2",
"task_type": "Object Recognition",
"question": "Which of the following things mentioned in the video was not the cause of the split of the country?",
"options": [
"A. The tolerance towards its minorities that makes them strong enough.",
"B. The WW1.",
"C. The communist revolution in Russia.",
"D. The uneven economic growth between the two regions."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following things mentioned in the video was not the cause of the split of the country?\nOption:\nA. The tolerance towards its minorities that makes them strong enough.\nB. The WW1.\nC. The communist revolution in Russia.\nD. The uneven economic growth between the two regions.\nAnswer with the option's letter from the given choices directly.",
1813,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "605-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1814,
"target": "D",
"doc": {
"video_id": "605",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=xKiRmesHWIA",
"videoID": "xKiRmesHWIA",
"question_id": "605-3",
"task_type": "Object Reasoning",
"question": "In the middle of the video a shield appears on a mountain in Central Europe, what does this stand for?",
"options": [
"A. The impassable Alps.",
"B. The defense forces of Western Europe and Russia.",
"C. The battlefield between Western Europe and Russia.",
"D. A buffer zone between Western Europe and Russia."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the middle of the video a shield appears on a mountain in Central Europe, what does this stand for?\nOption:\nA. The impassable Alps.\nB. The defense forces of Western Europe and Russia.\nC. The battlefield between Western Europe and Russia.\nD. A buffer zone between Western Europe and Russia.\nAnswer with the option's letter from the given choices directly.",
1814,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "605-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1815,
"target": "B",
"doc": {
"video_id": "606",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=uC8TK7GH85o",
"videoID": "uC8TK7GH85o",
"question_id": "606-1",
"task_type": "Temporal Perception",
"question": "What is the first minute of the video about?",
"options": [
"A. Summary of battles waged by Napoleon.",
"B. Comments on Napoleon's achievements.",
"C. Criticism of Napoleon.",
"D. How Napoleon became so successful."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the first minute of the video about?\nOption:\nA. Summary of battles waged by Napoleon.\nB. Comments on Napoleon's achievements.\nC. Criticism of Napoleon.\nD. How Napoleon became so successful.\nAnswer with the option's letter from the given choices directly.",
1815,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "606-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Temporal Perception",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1816,
"target": "A",
"doc": {
"video_id": "606",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=uC8TK7GH85o",
"videoID": "uC8TK7GH85o",
"question_id": "606-2",
"task_type": "Object Reasoning",
"question": "In line with the video evidence, which of the following statements about Napoleon is correct?",
"options": [
"A. In the code made by him, a wife had no right to decide on where to live.",
"B. He was born on an island that was under the control of Corsica at that time.",
"C. He attained the rank of brigadier general at the age of 23.",
"D. His career was stalled because he disagreed with the Jacobin faction."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, which of the following statements about Napoleon is correct?\nOption:\nA. In the code made by him, a wife had no right to decide on where to live.\nB. He was born on an island that was under the control of Corsica at that time.\nC. He attained the rank of brigadier general at the age of 23.\nD. His career was stalled because he disagreed with the Jacobin faction.\nAnswer with the option's letter from the given choices directly.",
1816,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "606-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1817,
"target": "B",
"doc": {
"video_id": "606",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=uC8TK7GH85o",
"videoID": "uC8TK7GH85o",
"question_id": "606-3",
"task_type": "Counting Problem",
"question": "Throughout the video, how many scholars in total show up in the video and comment on Napoleon?",
"options": [
"A. Two.",
"B. Three.",
"C. One.",
"D. Four."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Throughout the video, how many scholars in total show up in the video and comment on Napoleon?\nOption:\nA. Two.\nB. Three.\nC. One.\nD. Four.\nAnswer with the option's letter from the given choices directly.",
1817,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "606-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1818,
"target": "A",
"doc": {
"video_id": "607",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=zxKPjD8urG4",
"videoID": "zxKPjD8urG4",
"question_id": "607-1",
"task_type": "Information Synopsis",
"question": "Which are the primary themes explored in the video?",
"options": [
"A. Secrets about Rome found underwater.",
"B. How Rome grew into an empire.",
"C. The Roman Maritime's prosperity.",
"D. The high technologies used to detect sank ships."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which are the primary themes explored in the video?\nOption:\nA. Secrets about Rome found underwater.\nB. How Rome grew into an empire.\nC. The Roman Maritime's prosperity.\nD. The high technologies used to detect sank ships.\nAnswer with the option's letter from the given choices directly.",
1818,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "607-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1819,
"target": "A",
"doc": {
"video_id": "607",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=zxKPjD8urG4",
"videoID": "zxKPjD8urG4",
"question_id": "607-2",
"task_type": "Action Reasoning",
"question": "In accordance with the video footage, which of the following statements is not correct?",
"options": [
"A. The cargo in the lost port might be transported to Rome by the natural channel of the Tiber River.",
"B. At its peak, the Roman Empire's territory encompassed parts of Asia, most of Europe, and a portion of North Africa.",
"C. Experts guess there may be a port near Rome because there are records about it and the vast food need of Rome.",
"D. The hexagonal lake was probably used for unloading cargo."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In accordance with the video footage, which of the following statements is not correct?\nOption:\nA. The cargo in the lost port might be transported to Rome by the natural channel of the Tiber River.\nB. At its peak, the Roman Empire's territory encompassed parts of Asia, most of Europe, and a portion of North Africa.\nC. Experts guess there may be a port near Rome because there are records about it and the vast food need of Rome.\nD. The hexagonal lake was probably used for unloading cargo.\nAnswer with the option's letter from the given choices directly.",
1819,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "607-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1820,
"target": "B",
"doc": {
"video_id": "607",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=zxKPjD8urG4",
"videoID": "zxKPjD8urG4",
"question_id": "607-3",
"task_type": "Attribute Perception",
"question": "Based on the video, which of the statements is not correct about the the metal ingots in the drained wreck?",
"options": [
"A. They were stored on either side of the keel.",
"B. It's mainly made of silver and gold.",
"C. The team found 22 of them.",
"D. They were stamped with the letters meaning emperor."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, which of the statements is not correct about the the metal ingots in the drained wreck?\nOption:\nA. They were stored on either side of the keel.\nB. It's mainly made of silver and gold.\nC. The team found 22 of them.\nD. They were stamped with the letters meaning emperor.\nAnswer with the option's letter from the given choices directly.",
1820,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "607-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1821,
"target": "C",
"doc": {
"video_id": "608",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=BoBapnk6TB8",
"videoID": "BoBapnk6TB8",
"question_id": "608-1",
"task_type": "Attribute Perception",
"question": "As depicted in the video, which of the following statements about the seven wonders of the ancient world is not correct?",
"options": [
"A. They include a huge statue on an island.",
"B. Only the Giza pyramid survives.",
"C. Two of them in Egypt have not been found yet.",
"D. There are two of them in Egypt."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, which of the following statements about the seven wonders of the ancient world is not correct?\nOption:\nA. They include a huge statue on an island.\nB. Only the Giza pyramid survives.\nC. Two of them in Egypt have not been found yet.\nD. There are two of them in Egypt.\nAnswer with the option's letter from the given choices directly.",
1821,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "608-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1822,
"target": "C",
"doc": {
"video_id": "608",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=BoBapnk6TB8",
"videoID": "BoBapnk6TB8",
"question_id": "608-2",
"task_type": "Object Reasoning",
"question": "As depicted in the video, which of the following statements about the Lighthouse of Alexandria is not correct?",
"options": [
"A. It acted as a landmark.",
"B. It stood in a beautiful city.",
"C. It is estimated to be as high as a 30-story building.",
"D. A door can be accurately reconstructed in 3D images."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, which of the following statements about the Lighthouse of Alexandria is not correct?\nOption:\nA. It acted as a landmark.\nB. It stood in a beautiful city.\nC. It is estimated to be as high as a 30-story building.\nD. A door can be accurately reconstructed in 3D images.\nAnswer with the option's letter from the given choices directly.",
1822,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "608-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1823,
"target": "C",
"doc": {
"video_id": "608",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=BoBapnk6TB8",
"videoID": "BoBapnk6TB8",
"question_id": "608-3",
"task_type": "Object Reasoning",
"question": "According to what is shown in the video, which is the most important element that makes it difficult to dive into Lake Nasser to find the lost fort?",
"options": [
"A. Sediment that makes the waters turbid.",
"B. The depth of the lake.",
"C. The Crocodiles.",
"D. Its accurate location is unknown."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to what is shown in the video, which is the most important element that makes it difficult to dive into Lake Nasser to find the lost fort?\nOption:\nA. Sediment that makes the waters turbid.\nB. The depth of the lake.\nC. The Crocodiles.\nD. Its accurate location is unknown.\nAnswer with the option's letter from the given choices directly.",
1823,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "608-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1824,
"target": "C",
"doc": {
"video_id": "609",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=HProiNnmGwI",
"videoID": "HProiNnmGwI",
"question_id": "609-1",
"task_type": "Information Synopsis",
"question": "What is the primary focus of the video?",
"options": [
"A. The Crusades' successive expeditions to the East.",
"B. The Fourth Crusade and the Fall of Constantinople.",
"C. The First Crusade and the Conquest of Jerusalem.",
"D. The foundation of the Kingdom of Jerusalem."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary focus of the video?\nOption:\nA. The Crusades' successive expeditions to the East.\nB. The Fourth Crusade and the Fall of Constantinople.\nC. The First Crusade and the Conquest of Jerusalem.\nD. The foundation of the Kingdom of Jerusalem.\nAnswer with the option's letter from the given choices directly.",
1824,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "609-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1825,
"target": "D",
"doc": {
"video_id": "609",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=HProiNnmGwI",
"videoID": "HProiNnmGwI",
"question_id": "609-2",
"task_type": "Attribute Perception",
"question": "Which of the following statements is correct according to the video?",
"options": [
"A. The Crusaders won the first battle after they landed in Asia Minor.",
"B. Pope Gregory waged a campaign against the Muslim world.",
"C. The Third Crusade was held to conquer Russia.",
"D. The Muslim empires experienced divisions before the Crusade arrived."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements is correct according to the video?\nOption:\nA. The Crusaders won the first battle after they landed in Asia Minor.\nB. Pope Gregory waged a campaign against the Muslim world.\nC. The Third Crusade was held to conquer Russia.\nD. The Muslim empires experienced divisions before the Crusade arrived.\nAnswer with the option's letter from the given choices directly.",
1825,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "609-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1826,
"target": "D",
"doc": {
"video_id": "609",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=HProiNnmGwI",
"videoID": "HProiNnmGwI",
"question_id": "609-3",
"task_type": "Temporal Reasoning",
"question": "In line with the video evidence, what happened after the Crusaders conques Antioch?",
"options": [
"A. They came back to Europe.",
"B. They stayed in Constantinople for the rest of the year.",
"C. They marched on Constantinople.",
"D. They besieged Jerusalem."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, what happened after the Crusaders conques Antioch?\nOption:\nA. They came back to Europe.\nB. They stayed in Constantinople for the rest of the year.\nC. They marched on Constantinople.\nD. They besieged Jerusalem.\nAnswer with the option's letter from the given choices directly.",
1826,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "609-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1827,
"target": "A",
"doc": {
"video_id": "610",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=O7zcPvQDIMU",
"videoID": "O7zcPvQDIMU",
"question_id": "610-1",
"task_type": "Information Synopsis",
"question": "What is the best title for this video?",
"options": [
"A. The terrifying reality of Medieval life during the Norman invasion.",
"B. The Battle of Stamford Bridge and Hastings.",
"C. The battle between Harold and William.",
"D. Anglo-Saxon's fall in England."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the best title for this video?\nOption:\nA. The terrifying reality of Medieval life during the Norman invasion.\nB. The Battle of Stamford Bridge and Hastings.\nC. The battle between Harold and William.\nD. Anglo-Saxon's fall in England.\nAnswer with the option's letter from the given choices directly.",
1827,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "610-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1828,
"target": "C",
"doc": {
"video_id": "610",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=O7zcPvQDIMU",
"videoID": "O7zcPvQDIMU",
"question_id": "610-2",
"task_type": "Temporal Reasoning",
"question": "In accordance with the video footage, which of the following events happened after the battle at Standford Bridge?",
"options": [
"A. Harold's coronation.",
"B. The marriage of William.",
"C. The Norman army landed on Hasting.",
"D. Edward's demise."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In accordance with the video footage, which of the following events happened after the battle at Standford Bridge?\nOption:\nA. Harold's coronation.\nB. The marriage of William.\nC. The Norman army landed on Hasting.\nD. Edward's demise.\nAnswer with the option's letter from the given choices directly.",
1828,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "610-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1829,
"target": "B",
"doc": {
"video_id": "610",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Humanity & History",
"url": "https://www.youtube.com/watch?v=O7zcPvQDIMU",
"videoID": "O7zcPvQDIMU",
"question_id": "610-3",
"task_type": "Object Reasoning",
"question": "Which of the following statements about William the Conquer is true based on the video?",
"options": [
"A. His eldest son succeeded him as King of England.",
"B. He died on Sept. 9th.",
"C. His army killed Harold at Standford Bridge.",
"D. He lose the battle of Hastings."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements about William the Conquer is true based on the video?\nOption:\nA. His eldest son succeeded him as King of England.\nB. He died on Sept. 9th.\nC. His army killed Harold at Standford Bridge.\nD. He lose the battle of Hastings.\nAnswer with the option's letter from the given choices directly.",
1829,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "610-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Humanity & History",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1830,
"target": "D",
"doc": {
"video_id": "611",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=H8fGd3fCJbg",
"videoID": "H8fGd3fCJbg",
"question_id": "611-1",
"task_type": "Action Recognition",
"question": "Based on the video, what is not true about the artwork \"Apollo and Daphne\"?",
"options": [
"A. Apollo didn't touch Daphne's skin physically.",
"B. Apollo tried to grab Daphne.",
"C. Apollo had curly hair in the sculpture.",
"D. Daphne's fingertips took root."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, what is not true about the artwork \"Apollo and Daphne\"?\nOption:\nA. Apollo didn't touch Daphne's skin physically.\nB. Apollo tried to grab Daphne.\nC. Apollo had curly hair in the sculpture.\nD. Daphne's fingertips took root.\nAnswer with the option's letter from the given choices directly.",
1830,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "611-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1831,
"target": "D",
"doc": {
"video_id": "611",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=H8fGd3fCJbg",
"videoID": "H8fGd3fCJbg",
"question_id": "611-2",
"task_type": "Temporal Reasoning",
"question": "As depicted in the video, in what order does the author present Bernini's four masterpieces created for Borghese in a single scene?",
"options": [
"A. \"The rape of Persephone\", \"Apollo and Daphne\", \"David\" and \"Aeneas, Anchises, and Ascanius fleeing Troy\".",
"B. \"David\", \"Aeneas, Anchises, and Ascanius fleeing Troy\", \"Apollo and Daphne\" and \"The rape of Persephone\".",
"C. \"Apollo and Daphne\", \"Aeneas, Anchises, and Ascanius fleeing Troy\", \"David\" and \"The rape of Persephone\".",
"D. \"Aeneas, Anchises, and Ascanius fleeing Troy\", \"David\", \"The rape of Persephone\" and \"Apollo and Daphne\"."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, in what order does the author present Bernini's four masterpieces created for Borghese in a single scene?\nOption:\nA. \"The rape of Persephone\", \"Apollo and Daphne\", \"David\" and \"Aeneas, Anchises, and Ascanius fleeing Troy\".\nB. \"David\", \"Aeneas, Anchises, and Ascanius fleeing Troy\", \"Apollo and Daphne\" and \"The rape of Persephone\".\nC. \"Apollo and Daphne\", \"Aeneas, Anchises, and Ascanius fleeing Troy\", \"David\" and \"The rape of Persephone\".\nD. \"Aeneas, Anchises, and Ascanius fleeing Troy\", \"David\", \"The rape of Persephone\" and \"Apollo and Daphne\".\nAnswer with the option's letter from the given choices directly.",
1831,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "611-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1832,
"target": "D",
"doc": {
"video_id": "611",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=H8fGd3fCJbg",
"videoID": "H8fGd3fCJbg",
"question_id": "611-3",
"task_type": "Object Reasoning",
"question": "What is true about Bernini's David as described in the video?",
"options": [
"A. David opened his mouth slightly to breathe better.",
"B. David tried to kill the youth Acis in the statue.",
"C. Bernini's David portrayed the biblical hero in contemplation.",
"D. The contrast between the two famous \"David\" showed the move from Renaissance to Baroque."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is true about Bernini's David as described in the video?\nOption:\nA. David opened his mouth slightly to breathe better.\nB. David tried to kill the youth Acis in the statue.\nC. Bernini's David portrayed the biblical hero in contemplation.\nD. The contrast between the two famous \"David\" showed the move from Renaissance to Baroque.\nAnswer with the option's letter from the given choices directly.",
1832,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "611-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1833,
"target": "B",
"doc": {
"video_id": "612",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=GLW9omJfAdk",
"videoID": "GLW9omJfAdk",
"question_id": "612-1",
"task_type": "Temporal Reasoning",
"question": "How was his life journey according to the video?",
"options": [
"A. Borned with humble background and lived in seclusion in a farmhouse.",
"B. Borned with a humble background, entered the upper class and then lived in seclusion in a farmhouse.",
"C. Borned with a humble background, lived in seclusion in a farmhouse and then entered the upper class.",
"D. Borned in the upper class and lived in seclusion in a farmhouse."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How was his life journey according to the video?\nOption:\nA. Borned with humble background and lived in seclusion in a farmhouse.\nB. Borned with a humble background, entered the upper class and then lived in seclusion in a farmhouse.\nC. Borned with a humble background, lived in seclusion in a farmhouse and then entered the upper class.\nD. Borned in the upper class and lived in seclusion in a farmhouse.\nAnswer with the option's letter from the given choices directly.",
1833,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "612-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1834,
"target": "B",
"doc": {
"video_id": "612",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=GLW9omJfAdk",
"videoID": "GLW9omJfAdk",
"question_id": "612-2",
"task_type": "Object Reasoning",
"question": "As depicted in the video, which of the following statements about Goya and the historical background is correct?",
"options": [
"A. After regaining his crown, Ferdinand VII became modest and support the principles of the Enlightenment.",
"B. Goya was very dissatisfied with Ferdinand VII at first.",
"C. After the French invasion, Ferdinand VII died soon.",
"D. Goya had always supported revolutionaries from France."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, which of the following statements about Goya and the historical background is correct?\nOption:\nA. After regaining his crown, Ferdinand VII became modest and support the principles of the Enlightenment.\nB. Goya was very dissatisfied with Ferdinand VII at first.\nC. After the French invasion, Ferdinand VII died soon.\nD. Goya had always supported revolutionaries from France.\nAnswer with the option's letter from the given choices directly.",
1834,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "612-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1835,
"target": "B",
"doc": {
"video_id": "612",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=GLW9omJfAdk",
"videoID": "GLW9omJfAdk",
"question_id": "612-3",
"task_type": "Action Reasoning",
"question": "According to what is shown in the video, which of the following statements about the painting \"Altropos (the Fates)\" is not correct?",
"options": [
"A. The extra man sitting among the Fates might be Goya himself.",
"B. Clotho was holding the thread of life in the painting.",
"C. The painting expresses Goya's helplessness and unwillingness towards his children's death.",
"D. Atropos was carrying scissors, deciding whether cutting the thread of life."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to what is shown in the video, which of the following statements about the painting \"Altropos (the Fates)\" is not correct?\nOption:\nA. The extra man sitting among the Fates might be Goya himself.\nB. Clotho was holding the thread of life in the painting.\nC. The painting expresses Goya's helplessness and unwillingness towards his children's death.\nD. Atropos was carrying scissors, deciding whether cutting the thread of life.\nAnswer with the option's letter from the given choices directly.",
1835,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "612-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1836,
"target": "D",
"doc": {
"video_id": "613",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=ElWG0_kjy_Y",
"videoID": "ElWG0_kjy_Y",
"question_id": "613-1",
"task_type": "Information Synopsis",
"question": "What is the main idea of the video?",
"options": [
"A. Da Vinci's tragic life journey.",
"B. Da Vinci's talent in both art and science.",
"C. Whose portrait is Mona Lisa?",
"D. Da Vinci's masterpiece-Mona Lisa."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main idea of the video?\nOption:\nA. Da Vinci's tragic life journey.\nB. Da Vinci's talent in both art and science.\nC. Whose portrait is Mona Lisa?\nD. Da Vinci's masterpiece-Mona Lisa.\nAnswer with the option's letter from the given choices directly.",
1836,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "613-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1837,
"target": "D",
"doc": {
"video_id": "613",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=ElWG0_kjy_Y",
"videoID": "ElWG0_kjy_Y",
"question_id": "613-2",
"task_type": "Object Reasoning",
"question": "What is not true about the painting based on the video?",
"options": [
"A. It lacks drama.",
"B. It might be Lisa del Giacondo's portrait.",
"C. The expression appears to change depending on the angle from which it is viewed.",
"D. There is no jewelry in the painting because the protagonist was bankrupt."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is not true about the painting based on the video?\nOption:\nA. It lacks drama.\nB. It might be Lisa del Giacondo's portrait.\nC. The expression appears to change depending on the angle from which it is viewed.\nD. There is no jewelry in the painting because the protagonist was bankrupt.\nAnswer with the option's letter from the given choices directly.",
1837,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "613-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1838,
"target": "B",
"doc": {
"video_id": "613",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=ElWG0_kjy_Y",
"videoID": "ElWG0_kjy_Y",
"question_id": "613-3",
"task_type": "Object Reasoning",
"question": "As depicted in the video, which of the following knowledge and techniques doesn't contribute to Mona Lisa's mysterious smile?",
"options": [
"A. Psychology of visual perception.",
"B. The 'Spolvaro' Technique.",
"C. Facial anatomy.",
"D. Sfumato and Chiaroscuro techniques."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, which of the following knowledge and techniques doesn't contribute to Mona Lisa's mysterious smile?\nOption:\nA. Psychology of visual perception.\nB. The 'Spolvaro' Technique.\nC. Facial anatomy.\nD. Sfumato and Chiaroscuro techniques.\nAnswer with the option's letter from the given choices directly.",
1838,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "613-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1839,
"target": "D",
"doc": {
"video_id": "614",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=wCkQ138sg6M",
"videoID": "wCkQ138sg6M",
"question_id": "614-1",
"task_type": "Action Recognition",
"question": "What is the standing man in the black shirt doing in the first minute of the video?",
"options": [
"A. Dancing.",
"B. Playing the violin.",
"C. Playing the piano.",
"D. Conducting the orchestra."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the standing man in the black shirt doing in the first minute of the video?\nOption:\nA. Dancing.\nB. Playing the violin.\nC. Playing the piano.\nD. Conducting the orchestra.\nAnswer with the option's letter from the given choices directly.",
1839,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "614-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1840,
"target": "D",
"doc": {
"video_id": "614",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=wCkQ138sg6M",
"videoID": "wCkQ138sg6M",
"question_id": "614-2",
"task_type": "Action Reasoning",
"question": "According to what is shown in the video, what might be the relationship between the old man in a white shirt and the standing man in a black shirt?",
"options": [
"A. Father and son.",
"B. Grandfather and grandson.",
"C. Employer and employee.",
"D. Teacher and student."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to what is shown in the video, what might be the relationship between the old man in a white shirt and the standing man in a black shirt?\nOption:\nA. Father and son.\nB. Grandfather and grandson.\nC. Employer and employee.\nD. Teacher and student.\nAnswer with the option's letter from the given choices directly.",
1840,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "614-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1841,
"target": "A",
"doc": {
"video_id": "614",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=wCkQ138sg6M",
"videoID": "wCkQ138sg6M",
"question_id": "614-3",
"task_type": "Attribute Perception",
"question": "In accordance with the video footage, which of the following suggestions is not given by the old man?",
"options": [
"A. Using as many gestures as possible to make the players well-conducted.",
"B. Maintaining a high level of energy and focus throughout the performance.",
"C. Effective Facial Expressions.",
"D. The young conductor shouldn't allowed the tempo to get very slow in one section which was different from the author's idea."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In accordance with the video footage, which of the following suggestions is not given by the old man?\nOption:\nA. Using as many gestures as possible to make the players well-conducted.\nB. Maintaining a high level of energy and focus throughout the performance.\nC. Effective Facial Expressions.\nD. The young conductor shouldn't allowed the tempo to get very slow in one section which was different from the author's idea.\nAnswer with the option's letter from the given choices directly.",
1841,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "614-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1842,
"target": "C",
"doc": {
"video_id": "615",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=LCfBYE97rFk",
"videoID": "LCfBYE97rFk",
"question_id": "615-1",
"task_type": "Temporal Reasoning",
"question": "What does Paula Scher often do when she sits down at her desk as depicted in the video?",
"options": [
"A. Talking with colleagues.",
"B. Designing.",
"C. Reading her e-mail.",
"D. Accomplishing nothing."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does Paula Scher often do when she sits down at her desk as depicted in the video?\nOption:\nA. Talking with colleagues.\nB. Designing.\nC. Reading her e-mail.\nD. Accomplishing nothing.\nAnswer with the option's letter from the given choices directly.",
1842,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "615-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1843,
"target": "C",
"doc": {
"video_id": "615",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=LCfBYE97rFk",
"videoID": "LCfBYE97rFk",
"question_id": "615-2",
"task_type": "Attribute Perception",
"question": "According to what is shown in the video, what is not true about the design for local people after the disaster?",
"options": [
"A. It gives local people a sense of identity.",
"B. It is called the emotional sign system.",
"C. Posters standing at every spot look similar in the scenery.",
"D. The design contains the name of the street."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to what is shown in the video, what is not true about the design for local people after the disaster?\nOption:\nA. It gives local people a sense of identity.\nB. It is called the emotional sign system.\nC. Posters standing at every spot look similar in the scenery.\nD. The design contains the name of the street.\nAnswer with the option's letter from the given choices directly.",
1843,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "615-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1844,
"target": "A",
"doc": {
"video_id": "615",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=LCfBYE97rFk",
"videoID": "LCfBYE97rFk",
"question_id": "615-3",
"task_type": "Object Reasoning",
"question": "In line with the video evidence, which of the following features can not describe Paula Scher?",
"options": [
"A. Passionate for feminism.",
"B. Hard-working.",
"C. Patient.",
"D. Imaginative."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, which of the following features can not describe Paula Scher?\nOption:\nA. Passionate for feminism.\nB. Hard-working.\nC. Patient.\nD. Imaginative.\nAnswer with the option's letter from the given choices directly.",
1844,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "615-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1845,
"target": "A",
"doc": {
"video_id": "616",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=aansXcMqnNk",
"videoID": "aansXcMqnNk",
"question_id": "616-1",
"task_type": "Attribute Perception",
"question": "What is not true about Kyo-yuzen technique based on the video?",
"options": [
"A. It contains 10 stages.",
"B. The artworks contain very fine lines in some places.",
"C. It includes the steaming process.",
"D. Artists paint on the cloth."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is not true about Kyo-yuzen technique based on the video?\nOption:\nA. It contains 10 stages.\nB. The artworks contain very fine lines in some places.\nC. It includes the steaming process.\nD. Artists paint on the cloth.\nAnswer with the option's letter from the given choices directly.",
1845,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "616-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1846,
"target": "D",
"doc": {
"video_id": "616",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=aansXcMqnNk",
"videoID": "aansXcMqnNk",
"question_id": "616-2",
"task_type": "Spatial Reasoning",
"question": "To introduce Kyoto's traditional culture, where has the youtuber not been in the video?",
"options": [
"A. A Buddhist temple.",
"B. A Restaurant.",
"C. A workshop.",
"D. An opera."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: To introduce Kyoto's traditional culture, where has the youtuber not been in the video?\nOption:\nA. A Buddhist temple.\nB. A Restaurant.\nC. A workshop.\nD. An opera.\nAnswer with the option's letter from the given choices directly.",
1846,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "616-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Spatial Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1847,
"target": "A",
"doc": {
"video_id": "616",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=aansXcMqnNk",
"videoID": "aansXcMqnNk",
"question_id": "616-3",
"task_type": "Temporal Reasoning",
"question": "After introducing Tofu making, what kind of traditional technique or scenic spot did the youtuber introduce according to what is shown in the video?",
"options": [
"A. A Buddhist temple.",
"B. Kyoto Museum.",
"C. Nishiki Market.",
"D. Folding fan workshop."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: After introducing Tofu making, what kind of traditional technique or scenic spot did the youtuber introduce according to what is shown in the video?\nOption:\nA. A Buddhist temple.\nB. Kyoto Museum.\nC. Nishiki Market.\nD. Folding fan workshop.\nAnswer with the option's letter from the given choices directly.",
1847,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "616-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1848,
"target": "A",
"doc": {
"video_id": "617",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=k1SE25mURhc",
"videoID": "k1SE25mURhc",
"question_id": "617-1",
"task_type": "Action Reasoning",
"question": "In line with the video evidence, why does the author mention Cambodia Angkor Wat?",
"options": [
"A. To illustrate his point that Ellora Caves might not have been built by its occupiers.",
"B. Because it is as remarkable as Ellora Caves.",
"C. Because Cambodia Angkor Wat and Ellora Cave are both built by monks.",
"D. Because it is very close to Ellora Caves."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, why does the author mention Cambodia Angkor Wat?\nOption:\nA. To illustrate his point that Ellora Caves might not have been built by its occupiers.\nB. Because it is as remarkable as Ellora Caves.\nC. Because Cambodia Angkor Wat and Ellora Cave are both built by monks.\nD. Because it is very close to Ellora Caves.\nAnswer with the option's letter from the given choices directly.",
1848,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "617-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1849,
"target": "B",
"doc": {
"video_id": "617",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=k1SE25mURhc",
"videoID": "k1SE25mURhc",
"question_id": "617-2",
"task_type": "Object Reasoning",
"question": "In accordance with the video footage, which of the following statements about the caves is not correct?",
"options": [
"A. Only one-third of it can be seen by visitors.",
"B. It was built with stones from other places.",
"C. It is the largest single monolithic rock excavation.",
"D. It reveals the remarkable engineering knowledge owned by ancient Indians."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In accordance with the video footage, which of the following statements about the caves is not correct?\nOption:\nA. Only one-third of it can be seen by visitors.\nB. It was built with stones from other places.\nC. It is the largest single monolithic rock excavation.\nD. It reveals the remarkable engineering knowledge owned by ancient Indians.\nAnswer with the option's letter from the given choices directly.",
1849,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "617-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1850,
"target": "D",
"doc": {
"video_id": "617",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=k1SE25mURhc",
"videoID": "k1SE25mURhc",
"question_id": "617-3",
"task_type": "Object Recognition",
"question": "Which of the following things can not be found in Ellora Caves?",
"options": [
"A. A shiva statue.",
"B. An elephant statue.",
"C. A monument about Ramayana.",
"D. Piles of stone."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following things can not be found in Ellora Caves?\nOption:\nA. A shiva statue.\nB. An elephant statue.\nC. A monument about Ramayana.\nD. Piles of stone.\nAnswer with the option's letter from the given choices directly.",
1850,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "617-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1851,
"target": "B",
"doc": {
"video_id": "618",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=Sm0lziolZyM",
"videoID": "Sm0lziolZyM",
"question_id": "618-1",
"task_type": "Object Reasoning",
"question": "What do the expanding red lines on the map in the first few minutes of the video stand for?",
"options": [
"A. The Yellow River.",
"B. The Silk Road.",
"C. Du Fu's route to Xi'an.",
"D. The Yangtze River."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the expanding red lines on the map in the first few minutes of the video stand for?\nOption:\nA. The Yellow River.\nB. The Silk Road.\nC. Du Fu's route to Xi'an.\nD. The Yangtze River.\nAnswer with the option's letter from the given choices directly.",
1851,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "618-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1852,
"target": "B",
"doc": {
"video_id": "618",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=Sm0lziolZyM",
"videoID": "Sm0lziolZyM",
"question_id": "618-2",
"task_type": "Object Reasoning",
"question": "What does the verse \"The red lacquered gates wine is left to sour meat to rot, outside the gates lie the bones of the frozen and starved\" imply according to what is shown in the video?",
"options": [
"A. The red gate is defiled by rotten wine and meat.",
"B. The luxury of the rich and the poverty of the common people.",
"C. People's lives were very prosperous.",
"D. The common people disdained to submit to the powerful."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the verse \"The red lacquered gates wine is left to sour meat to rot, outside the gates lie the bones of the frozen and starved\" imply according to what is shown in the video?\nOption:\nA. The red gate is defiled by rotten wine and meat.\nB. The luxury of the rich and the poverty of the common people.\nC. People's lives were very prosperous.\nD. The common people disdained to submit to the powerful.\nAnswer with the option's letter from the given choices directly.",
1852,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "618-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1853,
"target": "C",
"doc": {
"video_id": "618",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=Sm0lziolZyM",
"videoID": "Sm0lziolZyM",
"question_id": "618-3",
"task_type": "Temporal Reasoning",
"question": "Based on the video, when did Du Fu write the verse \"In the city in spring grass and weeds grow everywhere grieving for the times, even the blossom sheds tears\"?",
"options": [
"A. Before he went to the imperial court.",
"B. After the fall of the Tang Dynasty.",
"C. Before his wife died.",
"D. Before his son's birth."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, when did Du Fu write the verse \"In the city in spring grass and weeds grow everywhere grieving for the times, even the blossom sheds tears\"?\nOption:\nA. Before he went to the imperial court.\nB. After the fall of the Tang Dynasty.\nC. Before his wife died.\nD. Before his son's birth.\nAnswer with the option's letter from the given choices directly.",
1853,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "618-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1854,
"target": "C",
"doc": {
"video_id": "619",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=B6tQyCH5hQM",
"videoID": "B6tQyCH5hQM",
"question_id": "619-1",
"task_type": "Counting Problem",
"question": "How many colors of glaze are used in the video?",
"options": [
"A. Five.",
"B. Four.",
"C. Three.",
"D. Two."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many colors of glaze are used in the video?\nOption:\nA. Five.\nB. Four.\nC. Three.\nD. Two.\nAnswer with the option's letter from the given choices directly.",
1854,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "619-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1855,
"target": "C",
"doc": {
"video_id": "619",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=B6tQyCH5hQM",
"videoID": "B6tQyCH5hQM",
"question_id": "619-2",
"task_type": "Attribute Perception",
"question": "As depicted in the video, what color does a ceramic object look like before it is burned in a furnace?",
"options": [
"A. Yellow.",
"B. White.",
"C. Brown.",
"D. Blue."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, what color does a ceramic object look like before it is burned in a furnace?\nOption:\nA. Yellow.\nB. White.\nC. Brown.\nD. Blue.\nAnswer with the option's letter from the given choices directly.",
1855,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "619-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1856,
"target": "A",
"doc": {
"video_id": "619",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=B6tQyCH5hQM",
"videoID": "B6tQyCH5hQM",
"question_id": "619-3",
"task_type": "Temporal Reasoning",
"question": "According to what is shown in the video, what did the blogger do to these ceramic products after they were put into the furnace and burned?",
"options": [
"A. She painted on them.",
"B. She dyed them three colors.",
"C. She shaped the clay.",
"D. She added some decorations to them."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to what is shown in the video, what did the blogger do to these ceramic products after they were put into the furnace and burned?\nOption:\nA. She painted on them.\nB. She dyed them three colors.\nC. She shaped the clay.\nD. She added some decorations to them.\nAnswer with the option's letter from the given choices directly.",
1856,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "619-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1857,
"target": "B",
"doc": {
"video_id": "620",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=mOiEOs3ZlT8",
"videoID": "mOiEOs3ZlT8",
"question_id": "620-1",
"task_type": "Information Synopsis",
"question": "What is the video mainly about?",
"options": [
"A. Ancient artists' techniques or styles.",
"B. An exhibition.",
"C. Masterpieces that emerged in ancient Rome.",
"D. A artistic journey in Italy Museum."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video mainly about?\nOption:\nA. Ancient artists' techniques or styles.\nB. An exhibition.\nC. Masterpieces that emerged in ancient Rome.\nD. A artistic journey in Italy Museum.\nAnswer with the option's letter from the given choices directly.",
1857,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "620-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1858,
"target": "B",
"doc": {
"video_id": "620",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=mOiEOs3ZlT8",
"videoID": "mOiEOs3ZlT8",
"question_id": "620-2",
"task_type": "Object Reasoning",
"question": "In accordance with the video footage, what is not true about the portraits?",
"options": [
"A. They were painted by Hans Memling.",
"B. They are portraits of aristocracy and leaders of the church.",
"C. They reflected the rising upper class's preference for portraiture.",
"D. Lots of details of the physiognomy are included."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In accordance with the video footage, what is not true about the portraits?\nOption:\nA. They were painted by Hans Memling.\nB. They are portraits of aristocracy and leaders of the church.\nC. They reflected the rising upper class's preference for portraiture.\nD. Lots of details of the physiognomy are included.\nAnswer with the option's letter from the given choices directly.",
1858,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "620-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1859,
"target": "D",
"doc": {
"video_id": "620",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Literature & Art",
"url": "https://www.youtube.com/watch?v=mOiEOs3ZlT8",
"videoID": "mOiEOs3ZlT8",
"question_id": "620-3",
"task_type": "Action Reasoning",
"question": "As depicted in the video, what does the cocked elbow in Copley's work reveal?",
"options": [
"A. Abnormal aesthetics of the time.",
"B. He was not good at depicting elbow.",
"C. Its style can be traced back to ancient Rome.",
"D. Swagger and confidence."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, what does the cocked elbow in Copley's work reveal?\nOption:\nA. Abnormal aesthetics of the time.\nB. He was not good at depicting elbow.\nC. Its style can be traced back to ancient Rome.\nD. Swagger and confidence.\nAnswer with the option's letter from the given choices directly.",
1859,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "620-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Literature & Art",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1860,
"target": "A",
"doc": {
"video_id": "621",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=5LU_XY0z2ZY",
"videoID": "5LU_XY0z2ZY",
"question_id": "621-1",
"task_type": "Information Synopsis",
"question": "What is the theme of the video?",
"options": [
"A. Enhancing metabolic health as a strategy to combat Alzheimer's disease.",
"B. Research progress and application prospects of beta-amyloid as a therapeutic target for Alzheimer's disease.",
"C. Failed attempts to eradicate Alzheimer's disease through current medical treatments.",
"D. Research on how human behavioural science can be used to improve cognitive performance and its potential as a strategy to combat Alzheimer's disease."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the theme of the video?\nOption:\nA. Enhancing metabolic health as a strategy to combat Alzheimer's disease.\nB. Research progress and application prospects of beta-amyloid as a therapeutic target for Alzheimer's disease.\nC. Failed attempts to eradicate Alzheimer's disease through current medical treatments.\nD. Research on how human behavioural science can be used to improve cognitive performance and its potential as a strategy to combat Alzheimer's disease.\nAnswer with the option's letter from the given choices directly.",
1860,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "621-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1861,
"target": "B",
"doc": {
"video_id": "621",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=5LU_XY0z2ZY",
"videoID": "5LU_XY0z2ZY",
"question_id": "621-2",
"task_type": "Object Reasoning",
"question": "What could be the key to truly combating and even preventing the disease that the video mainly talks about?",
"options": [
"A. Enhanced research into drugs that specifically target beta-amyloid proteins may hold the key to treating Alzheimer's disease.",
"B. Enhance metabolic health.",
"C. Increased investment in research and development of new brain surgeries.",
"D. Innovate on traditional cognitive therapy exercises."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What could be the key to truly combating and even preventing the disease that the video mainly talks about?\nOption:\nA. Enhanced research into drugs that specifically target beta-amyloid proteins may hold the key to treating Alzheimer's disease.\nB. Enhance metabolic health.\nC. Increased investment in research and development of new brain surgeries.\nD. Innovate on traditional cognitive therapy exercises.\nAnswer with the option's letter from the given choices directly.",
1861,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "621-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1862,
"target": "A",
"doc": {
"video_id": "621",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=5LU_XY0z2ZY",
"videoID": "5LU_XY0z2ZY",
"question_id": "621-3",
"task_type": "Object Reasoning",
"question": "What could we do in our daily life to help prevent the disease that the video mainly talks about?",
"options": [
"A. Adopting a low-carb diet and incorporating daily exercise.",
"B. Cultivating a positive mindset and scheduling regular health checkups.",
"C. Ensuring proper intake of healthcare products and maintaining a balanced diet.",
"D. Optimizing sleep patterns to enhance brain activity."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What could we do in our daily life to help prevent the disease that the video mainly talks about?\nOption:\nA. Adopting a low-carb diet and incorporating daily exercise.\nB. Cultivating a positive mindset and scheduling regular health checkups.\nC. Ensuring proper intake of healthcare products and maintaining a balanced diet.\nD. Optimizing sleep patterns to enhance brain activity.\nAnswer with the option's letter from the given choices directly.",
1862,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "621-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1863,
"target": "A",
"doc": {
"video_id": "622",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=K3ksKkCOgTw",
"videoID": "K3ksKkCOgTw",
"question_id": "622-1",
"task_type": "Information Synopsis",
"question": "What is the video primarily about?",
"options": [
"A. The examination of sugar's impact on health and the food industry's response to this issue.",
"B. The benefits of incorporating more sugar into your diet for better health.",
"C. Cooking recipes that maximize sugar content in meals.",
"D. The role of government in regulating sugar production."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video primarily about?\nOption:\nA. The examination of sugar's impact on health and the food industry's response to this issue.\nB. The benefits of incorporating more sugar into your diet for better health.\nC. Cooking recipes that maximize sugar content in meals.\nD. The role of government in regulating sugar production.\nAnswer with the option's letter from the given choices directly.",
1863,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "622-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1864,
"target": "B",
"doc": {
"video_id": "622",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=K3ksKkCOgTw",
"videoID": "K3ksKkCOgTw",
"question_id": "622-2",
"task_type": "Object Recognition",
"question": "What are some of the health issues that can arise from consuming too much sugar according to the video?",
"options": [
"A. Obesity, Diabetes, Tooth Decay, Heart Disease.",
"B. Obesity, Diabetes, Cancer, Heart Disease.",
"C. Obesity, Diabetes, Cancer, Cognitive Decline.",
"D. Obesity, Diabetes, Heart Disease, Non-Alcoholic Fatty Liver Disease (NAFLD)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are some of the health issues that can arise from consuming too much sugar according to the video?\nOption:\nA. Obesity, Diabetes, Tooth Decay, Heart Disease.\nB. Obesity, Diabetes, Cancer, Heart Disease.\nC. Obesity, Diabetes, Cancer, Cognitive Decline.\nD. Obesity, Diabetes, Heart Disease, Non-Alcoholic Fatty Liver Disease (NAFLD).\nAnswer with the option's letter from the given choices directly.",
1864,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "622-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1865,
"target": "D",
"doc": {
"video_id": "622",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=K3ksKkCOgTw",
"videoID": "K3ksKkCOgTw",
"question_id": "622-3",
"task_type": "Object Reasoning",
"question": "What approach does the video suggest is needed to combat the rising numbers of obesity and diabetes?",
"options": [
"A. Teach people about healthy eating and exercise to encourage better lifestyle choices..",
"B. Increased marketing of sugar-free products to promote consumer choice.",
"C. Businesses and non-profits work together to make the food system healthier.",
"D. The government should step in and become an advocate for consumers."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What approach does the video suggest is needed to combat the rising numbers of obesity and diabetes?\nOption:\nA. Teach people about healthy eating and exercise to encourage better lifestyle choices..\nB. Increased marketing of sugar-free products to promote consumer choice.\nC. Businesses and non-profits work together to make the food system healthier.\nD. The government should step in and become an advocate for consumers.\nAnswer with the option's letter from the given choices directly.",
1865,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "622-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1866,
"target": "C",
"doc": {
"video_id": "623",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=gfikT_O4v9A",
"videoID": "gfikT_O4v9A",
"question_id": "623-1",
"task_type": "Information Synopsis",
"question": "What is the primary focus of the speaker's presentation in the video?",
"options": [
"A. The impact of sleep quality on cognitive function and daily productivity.",
"B. Exploring the connection between gut microbiota diversity and mental health.",
"C. The relationship between body fat, insulin resistance, and chronic diseases like obesity and diabetes.",
"D. The role of insulin in regulating obesity and diabetes, and its direct correlation with maintaining fat mass."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary focus of the speaker's presentation in the video?\nOption:\nA. The impact of sleep quality on cognitive function and daily productivity.\nB. Exploring the connection between gut microbiota diversity and mental health.\nC. The relationship between body fat, insulin resistance, and chronic diseases like obesity and diabetes.\nD. The role of insulin in regulating obesity and diabetes, and its direct correlation with maintaining fat mass.\nAnswer with the option's letter from the given choices directly.",
1866,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "623-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1867,
"target": "A",
"doc": {
"video_id": "623",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=gfikT_O4v9A",
"videoID": "gfikT_O4v9A",
"question_id": "623-2",
"task_type": "Temporal Reasoning",
"question": "What is the order in which the following is introduced in the video?\n① The concept of the personal fat threshold.\n② How to grow fat cells.\n③ The relationship between insulin and diet.",
"options": [
"A. ①②③.",
"B. ①③②.",
"C. ③①②.",
"D. ②①③."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the order in which the following is introduced in the video?\n① The concept of the personal fat threshold.\n② How to grow fat cells.\n③ The relationship between insulin and diet.\nOption:\nA. ①②③.\nB. ①③②.\nC. ③①②.\nD. ②①③.\nAnswer with the option's letter from the given choices directly.",
1867,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "623-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1868,
"target": "D",
"doc": {
"video_id": "623",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=gfikT_O4v9A",
"videoID": "gfikT_O4v9A",
"question_id": "623-3",
"task_type": "OCR Problems",
"question": "Which scientists are mentioned in the video?",
"options": [
"A. L. Pasteur and R. Koch.",
"B. J. Watson and F. Crick.",
"C. G. Mendel and C. Darwin.",
"D. E.P. Joslin and F.G. Benedict."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which scientists are mentioned in the video?\nOption:\nA. L. Pasteur and R. Koch.\nB. J. Watson and F. Crick.\nC. G. Mendel and C. Darwin.\nD. E.P. Joslin and F.G. Benedict.\nAnswer with the option's letter from the given choices directly.",
1868,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "623-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1869,
"target": "C",
"doc": {
"video_id": "624",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=mAwgdX5VxGc",
"videoID": "mAwgdX5VxGc",
"question_id": "624-1",
"task_type": "Object Reasoning",
"question": "What is the speaker's opinion on reversing type 2 diabetes?",
"options": [
"A. It can be treated through insulin therapy.",
"B. It can be treated through surgical interventions.",
"C. It can be treated naturally.",
"D. Cannot be inferred."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the speaker's opinion on reversing type 2 diabetes?\nOption:\nA. It can be treated through insulin therapy.\nB. It can be treated through surgical interventions.\nC. It can be treated naturally.\nD. Cannot be inferred.\nAnswer with the option's letter from the given choices directly.",
1869,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "624-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1870,
"target": "D",
"doc": {
"video_id": "624",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=mAwgdX5VxGc",
"videoID": "mAwgdX5VxGc",
"question_id": "624-2",
"task_type": "Temporal Reasoning",
"question": "What is the order in which the following is introduced in the video?\n① Insulin Resistance Explained.\n② Case studies.\n③ How to Reverse Type 2 Diabetes.",
"options": [
"A. ①②③.",
"B. ①③②.",
"C. ③①②.",
"D. ①③②."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the order in which the following is introduced in the video?\n① Insulin Resistance Explained.\n② Case studies.\n③ How to Reverse Type 2 Diabetes.\nOption:\nA. ①②③.\nB. ①③②.\nC. ③①②.\nD. ①③②.\nAnswer with the option's letter from the given choices directly.",
1870,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "624-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1871,
"target": "C",
"doc": {
"video_id": "624",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=mAwgdX5VxGc",
"videoID": "mAwgdX5VxGc",
"question_id": "624-3",
"task_type": "Object Reasoning",
"question": "What treatment is universally implemented in all the case studies featured in the video?",
"options": [
"A. Use medication therapy.",
"B. Undergo surgical treatment.",
"C. Take dietary management.",
"D. Implement physical therapy."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What treatment is universally implemented in all the case studies featured in the video?\nOption:\nA. Use medication therapy.\nB. Undergo surgical treatment.\nC. Take dietary management.\nD. Implement physical therapy.\nAnswer with the option's letter from the given choices directly.",
1871,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "624-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1872,
"target": "B",
"doc": {
"video_id": "625",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=iCQmfRMwHfA",
"videoID": "iCQmfRMwHfA",
"question_id": "625-1",
"task_type": "Information Synopsis",
"question": "What is the subject of the video?",
"options": [
"A. How to eat well to improve bone health.",
"B. The relationship between nutrition and inflammation.",
"C. The benefits of high-carbohydrate diets for post-surgery recovery.",
"D. Implementing a sugar-rich diet to reduce chronic pain."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the subject of the video?\nOption:\nA. How to eat well to improve bone health.\nB. The relationship between nutrition and inflammation.\nC. The benefits of high-carbohydrate diets for post-surgery recovery.\nD. Implementing a sugar-rich diet to reduce chronic pain.\nAnswer with the option's letter from the given choices directly.",
1872,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "625-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1873,
"target": "D",
"doc": {
"video_id": "625",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=iCQmfRMwHfA",
"videoID": "iCQmfRMwHfA",
"question_id": "625-2",
"task_type": "Object Reasoning",
"question": "What is the speaker's opinion on preventing inflammation and sickness?",
"options": [
"A. The government should step in and make systemic changes to prevent sickness.",
"B. Health professionals must strengthen the promotion of health awareness.",
"C. Businesses and non-profits work together to make the food system healthier.",
"D. Individuals must take matters into their own hands and act now."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the speaker's opinion on preventing inflammation and sickness?\nOption:\nA. The government should step in and make systemic changes to prevent sickness.\nB. Health professionals must strengthen the promotion of health awareness.\nC. Businesses and non-profits work together to make the food system healthier.\nD. Individuals must take matters into their own hands and act now.\nAnswer with the option's letter from the given choices directly.",
1873,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "625-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1874,
"target": "D",
"doc": {
"video_id": "625",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=iCQmfRMwHfA",
"videoID": "iCQmfRMwHfA",
"question_id": "625-3",
"task_type": "Object Recognition",
"question": "What components are the cause of inflammation in the nutritional model developed by the speaker?",
"options": [
"A. Polyunsaturated oils, carbohydrates, and processed foods.",
"B. Sugar, alcohol, and trans fats.",
"C. Refined grains, trans fats, and processed foods.",
"D. Sugar, carbohydrates, and polyunsaturated oils."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What components are the cause of inflammation in the nutritional model developed by the speaker?\nOption:\nA. Polyunsaturated oils, carbohydrates, and processed foods.\nB. Sugar, alcohol, and trans fats.\nC. Refined grains, trans fats, and processed foods.\nD. Sugar, carbohydrates, and polyunsaturated oils.\nAnswer with the option's letter from the given choices directly.",
1874,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "625-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1875,
"target": "A",
"doc": {
"video_id": "626",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=Z4Ug2ONxgYw",
"videoID": "Z4Ug2ONxgYw",
"question_id": "626-1",
"task_type": "Object Reasoning",
"question": "What is the focus of standard thyroid treatment according to the video?",
"options": [
"A. The standard treatment is to only check TSH.",
"B. The standard treatment is to monitor T3 levels exclusively.",
"C. The standard treatment involves using a combination of T3 and T4 medications.",
"D. The standard treatment is to adjust medication based solely on symptoms without testing."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the focus of standard thyroid treatment according to the video?\nOption:\nA. The standard treatment is to only check TSH.\nB. The standard treatment is to monitor T3 levels exclusively.\nC. The standard treatment involves using a combination of T3 and T4 medications.\nD. The standard treatment is to adjust medication based solely on symptoms without testing.\nAnswer with the option's letter from the given choices directly.",
1875,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "626-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1876,
"target": "C",
"doc": {
"video_id": "626",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=Z4Ug2ONxgYw",
"videoID": "Z4Ug2ONxgYw",
"question_id": "626-2",
"task_type": "Object Recognition",
"question": "What is the suggested treatment for goiter as mentioned in the video?",
"options": [
"A. Surgical removal of the thyroid gland.",
"B. High-intensity exercise regimen.",
"C. Provide iodine.",
"D. Long-term use of synthetic thyroid hormones."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the suggested treatment for goiter as mentioned in the video?\nOption:\nA. Surgical removal of the thyroid gland.\nB. High-intensity exercise regimen.\nC. Provide iodine.\nD. Long-term use of synthetic thyroid hormones.\nAnswer with the option's letter from the given choices directly.",
1876,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "626-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1877,
"target": "B",
"doc": {
"video_id": "626",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=Z4Ug2ONxgYw",
"videoID": "Z4Ug2ONxgYw",
"question_id": "626-3",
"task_type": "Temporal Reasoning",
"question": "What is the order in which the following is introduced in the video?\n① How to cure thyroid when you have auto-immune.\n② Stress influence on the production of TSH.\n③ Standard Thyroid Treatment.",
"options": [
"A. ①②③.",
"B. ③②①.",
"C. ③①②.",
"D. ①③②."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the order in which the following is introduced in the video?\n① How to cure thyroid when you have auto-immune.\n② Stress influence on the production of TSH.\n③ Standard Thyroid Treatment.\nOption:\nA. ①②③.\nB. ③②①.\nC. ③①②.\nD. ①③②.\nAnswer with the option's letter from the given choices directly.",
1877,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "626-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1878,
"target": "A",
"doc": {
"video_id": "627",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=yQ6VOOd73MA",
"videoID": "yQ6VOOd73MA",
"question_id": "627-1",
"task_type": "Object Reasoning",
"question": "Which subject is NOT involved in the video?",
"options": [
"A. The impact of diet on brain function.",
"B. Anesthesia and the Brain.",
"C. How trauma affects the brain.",
"D. The role of emotions in decision-making."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which subject is NOT involved in the video?\nOption:\nA. The impact of diet on brain function.\nB. Anesthesia and the Brain.\nC. How trauma affects the brain.\nD. The role of emotions in decision-making.\nAnswer with the option's letter from the given choices directly.",
1878,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "627-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1879,
"target": "B",
"doc": {
"video_id": "627",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=yQ6VOOd73MA",
"videoID": "yQ6VOOd73MA",
"question_id": "627-2",
"task_type": "Temporal Reasoning",
"question": "What is the order in which the following is introduced in the video?\n① Sleepwalking and the Brain.\n② How Much Control Do We Have of Our Brain?\n③ Emotions and the Brain.\n④ Creativity and the Brain.",
"options": [
"A. ①②③④.",
"B. ①③②④.",
"C. ②①④③.",
"D. ①③②④."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the order in which the following is introduced in the video?\n① Sleepwalking and the Brain.\n② How Much Control Do We Have of Our Brain?\n③ Emotions and the Brain.\n④ Creativity and the Brain.\nOption:\nA. ①②③④.\nB. ①③②④.\nC. ②①④③.\nD. ①③②④.\nAnswer with the option's letter from the given choices directly.",
1879,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "627-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1880,
"target": "D",
"doc": {
"video_id": "627",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=yQ6VOOd73MA",
"videoID": "yQ6VOOd73MA",
"question_id": "627-3",
"task_type": "Object Reasoning",
"question": "Who's in control of our brains in the video?",
"options": [
"A. Genetics, neurochemistry, and inherent brain structure.",
"B. Conscious decision-making and the influence of the subconscious mind.",
"C. Environmental factors such as social networks, culture, and education.",
"D. The answer cannot be inferred."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who's in control of our brains in the video?\nOption:\nA. Genetics, neurochemistry, and inherent brain structure.\nB. Conscious decision-making and the influence of the subconscious mind.\nC. Environmental factors such as social networks, culture, and education.\nD. The answer cannot be inferred.\nAnswer with the option's letter from the given choices directly.",
1880,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "627-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1881,
"target": "D",
"doc": {
"video_id": "628",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=YdMCL9_UTE4",
"videoID": "YdMCL9_UTE4",
"question_id": "628-1",
"task_type": "Object Reasoning",
"question": "Which of the following options is not mentioned in the video as influencing their mental resilience?",
"options": [
"A. Their experiences.",
"B. Their environment.",
"C. Their genetic makeup.",
"D. Their social status."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options is not mentioned in the video as influencing their mental resilience?\nOption:\nA. Their experiences.\nB. Their environment.\nC. Their genetic makeup.\nD. Their social status.\nAnswer with the option's letter from the given choices directly.",
1881,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "628-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1882,
"target": "B",
"doc": {
"video_id": "628",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=YdMCL9_UTE4",
"videoID": "YdMCL9_UTE4",
"question_id": "628-2",
"task_type": "Temporal Reasoning",
"question": "In what order are the following introduced in the video?\n① Resilience training program.\n② The relationship between genetics and stress.\n③ The stress's influence on the brain.\n④ What does resilient behavior look like.",
"options": [
"A. ①②③④.",
"B. ④②③①.",
"C. ④③②①.",
"D. ①④③②."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what order are the following introduced in the video?\n① Resilience training program.\n② The relationship between genetics and stress.\n③ The stress's influence on the brain.\n④ What does resilient behavior look like.\nOption:\nA. ①②③④.\nB. ④②③①.\nC. ④③②①.\nD. ①④③②.\nAnswer with the option's letter from the given choices directly.",
1882,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "628-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1883,
"target": "B",
"doc": {
"video_id": "628",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=YdMCL9_UTE4",
"videoID": "YdMCL9_UTE4",
"question_id": "628-3",
"task_type": "Object Reasoning",
"question": "What event is shared by two families as depicted in the video?",
"options": [
"A. Both families suffered the loss of their homes due to a fire.",
"B. Both families lost their own sons.",
"C. Both families experienced the passing of a beloved family member due to an illness.",
"D. Both families lost their mutual loved ones."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What event is shared by two families as depicted in the video?\nOption:\nA. Both families suffered the loss of their homes due to a fire.\nB. Both families lost their own sons.\nC. Both families experienced the passing of a beloved family member due to an illness.\nD. Both families lost their mutual loved ones.\nAnswer with the option's letter from the given choices directly.",
1883,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "628-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1884,
"target": "B",
"doc": {
"video_id": "629",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=Z9mOrNcX4j0",
"videoID": "Z9mOrNcX4j0",
"question_id": "629-1",
"task_type": "Temporal Reasoning",
"question": "What is the order for introducing content in the video?\n(a) The excretion process of plants.\n(b) Plant adaptability.\n(c) The structure of a plant.\n(d) The special way of eating of plants.\n(e) Photosynthesis.",
"options": [
"A. (a)(d)(e)(c)(b).",
"B. (c)(d)(e)(a)(b).",
"C. (d)(c)(b)(a)(e).",
"D. (c)(e)(d)(b)(a)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the order for introducing content in the video?\n(a) The excretion process of plants.\n(b) Plant adaptability.\n(c) The structure of a plant.\n(d) The special way of eating of plants.\n(e) Photosynthesis.\nOption:\nA. (a)(d)(e)(c)(b).\nB. (c)(d)(e)(a)(b).\nC. (d)(c)(b)(a)(e).\nD. (c)(e)(d)(b)(a).\nAnswer with the option's letter from the given choices directly.",
1884,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "629-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1885,
"target": "D",
"doc": {
"video_id": "629",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=Z9mOrNcX4j0",
"videoID": "Z9mOrNcX4j0",
"question_id": "629-2",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. It is a cartoon, and the characters are all in the shape of plants.",
"B. It is a documentary about plants.",
"C. It is a popular science video about the Piranha Plant.",
"D. It is an educational science video about plants."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. It is a cartoon, and the characters are all in the shape of plants.\nB. It is a documentary about plants.\nC. It is a popular science video about the Piranha Plant.\nD. It is an educational science video about plants.\nAnswer with the option's letter from the given choices directly.",
1885,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "629-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1886,
"target": "A",
"doc": {
"video_id": "629",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=Z9mOrNcX4j0",
"videoID": "Z9mOrNcX4j0",
"question_id": "629-3",
"task_type": "Object Recognition",
"question": "Which small animal appears most frequently in the video?",
"options": [
"A. Cat.",
"B. Little bird.",
"C. Bat.",
"D. Bee."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which small animal appears most frequently in the video?\nOption:\nA. Cat.\nB. Little bird.\nC. Bat.\nD. Bee.\nAnswer with the option's letter from the given choices directly.",
1886,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "629-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1887,
"target": "B",
"doc": {
"video_id": "630",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=BjmPvovl-V4",
"videoID": "BjmPvovl-V4",
"question_id": "630-1",
"task_type": "Information Synopsis",
"question": "What central theme is explored in the video?",
"options": [
"A. The role of artificial intelligence in modern warfare.",
"B. The philosophical and scientific exploration of consciousness.",
"C. The economic impact of consciousness on global markets.",
"D. The history of consciousness in ancient civilizations."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What central theme is explored in the video?\nOption:\nA. The role of artificial intelligence in modern warfare.\nB. The philosophical and scientific exploration of consciousness.\nC. The economic impact of consciousness on global markets.\nD. The history of consciousness in ancient civilizations.\nAnswer with the option's letter from the given choices directly.",
1887,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "630-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1888,
"target": "C",
"doc": {
"video_id": "630",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=BjmPvovl-V4",
"videoID": "BjmPvovl-V4",
"question_id": "630-2",
"task_type": "Temporal Reasoning",
"question": "What kind of experts does Kmele consult in sequence in the video?",
"options": [
"A. Spiritual leaders, neuroscientists, entrepreneurs, physicists.",
"B. Physicists, neuroscientists, entrepreneurs, neuroscientists.",
"C. Neuroscientists, spiritual leaders, entrepreneurs, physicists.",
"D. Entrepreneurs, spiritual leaders, physicists, neuroscientists."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What kind of experts does Kmele consult in sequence in the video?\nOption:\nA. Spiritual leaders, neuroscientists, entrepreneurs, physicists.\nB. Physicists, neuroscientists, entrepreneurs, neuroscientists.\nC. Neuroscientists, spiritual leaders, entrepreneurs, physicists.\nD. Entrepreneurs, spiritual leaders, physicists, neuroscientists.\nAnswer with the option's letter from the given choices directly.",
1888,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "630-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1889,
"target": "D",
"doc": {
"video_id": "630",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Biology & Medicine",
"url": "https://www.youtube.com/watch?v=BjmPvovl-V4",
"videoID": "BjmPvovl-V4",
"question_id": "630-3",
"task_type": "Object Reasoning",
"question": "What is Kemle's attitude towards the explanation of consciousness by the experts?",
"options": [
"A. He agrees with the neuroscientist.",
"B. He agrees with the spiritual leader.",
"C. He agrees with the entrepreneur.",
"D. Cannot be inferred."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is Kemle's attitude towards the explanation of consciousness by the experts?\nOption:\nA. He agrees with the neuroscientist.\nB. He agrees with the spiritual leader.\nC. He agrees with the entrepreneur.\nD. Cannot be inferred.\nAnswer with the option's letter from the given choices directly.",
1889,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "630-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Biology & Medicine",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1890,
"target": "D",
"doc": {
"video_id": "631",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=pU_yyadYgG8",
"videoID": "pU_yyadYgG8",
"question_id": "631-1",
"task_type": "Object Reasoning",
"question": "How did Yen's appreciation against the US Dollar after the Plaza Accord impact the Japanese economy in the video?",
"options": [
"A. It led to higher inflation due to increased demand for domestic products.",
"B. It stimulated investment by lowering borrowing costs for Japanese companies.",
"C. It boosted exports by making Japanese goods cheaper for foreign consumers.",
"D. It increased domestic consumption by making imported goods more affordable."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How did Yen's appreciation against the US Dollar after the Plaza Accord impact the Japanese economy in the video?\nOption:\nA. It led to higher inflation due to increased demand for domestic products.\nB. It stimulated investment by lowering borrowing costs for Japanese companies.\nC. It boosted exports by making Japanese goods cheaper for foreign consumers.\nD. It increased domestic consumption by making imported goods more affordable.\nAnswer with the option's letter from the given choices directly.",
1890,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "631-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1891,
"target": "A",
"doc": {
"video_id": "631",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=pU_yyadYgG8",
"videoID": "pU_yyadYgG8",
"question_id": "631-2",
"task_type": "Counting Problem",
"question": "How many times did Joeri appear when explaining the situation in the 1990s as described in the video?",
"options": [
"A. 8.",
"B. 9.",
"C. 6.",
"D. 7."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times did Joeri appear when explaining the situation in the 1990s as described in the video?\nOption:\nA. 8.\nB. 9.\nC. 6.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
1891,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "631-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1892,
"target": "B",
"doc": {
"video_id": "631",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=pU_yyadYgG8",
"videoID": "pU_yyadYgG8",
"question_id": "631-3",
"task_type": "Information Synopsis",
"question": "What explanation is given for the fact that the country mainly featured in the video continued to experience deflation in the early 2000s?",
"options": [
"A. The quantitative easing program was too small and had a limited impact on the economy.",
"B. The deflationary mindset had become entrenched, leading to a self-fulfilling prophecy of low inflation expectations.",
"C. The global economy experienced a period of low growth, dragging down Japan's export-dependent economy.",
"D. The aging population and shrinking workforce reduced demand and put downward pressure on prices."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What explanation is given for the fact that the country mainly featured in the video continued to experience deflation in the early 2000s?\nOption:\nA. The quantitative easing program was too small and had a limited impact on the economy.\nB. The deflationary mindset had become entrenched, leading to a self-fulfilling prophecy of low inflation expectations.\nC. The global economy experienced a period of low growth, dragging down Japan's export-dependent economy.\nD. The aging population and shrinking workforce reduced demand and put downward pressure on prices.\nAnswer with the option's letter from the given choices directly.",
1892,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "631-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1893,
"target": "B",
"doc": {
"video_id": "632",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=FQd5bo9nIZs",
"videoID": "FQd5bo9nIZs",
"question_id": "632-1",
"task_type": "Temporal Reasoning",
"question": "Here are a few sub-titles about this video, what should be the correct order in which they appear in the video?\n(1) The money problem.\n(2) The Acqusition.\n(3) Intermission.\n(4) Present day and future.",
"options": [
"A. (2)(1)(4)(3).",
"B. (1)(2)(4)(3).",
"C. (1)(4)(3)(2).",
"D. (4)(2)(1)(3)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Here are a few sub-titles about this video, what should be the correct order in which they appear in the video?\n(1) The money problem.\n(2) The Acqusition.\n(3) Intermission.\n(4) Present day and future.\nOption:\nA. (2)(1)(4)(3).\nB. (1)(2)(4)(3).\nC. (1)(4)(3)(2).\nD. (4)(2)(1)(3).\nAnswer with the option's letter from the given choices directly.",
1893,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "632-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1894,
"target": "B",
"doc": {
"video_id": "632",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=FQd5bo9nIZs",
"videoID": "FQd5bo9nIZs",
"question_id": "632-2",
"task_type": "Information Synopsis",
"question": "What does the video focus on in the \"Free?\" chapter?",
"options": [
"A. The free service offers premium features at no cost to attract users.",
"B. The free service allows the app to expand and thus benefit in other ways.",
"C. The free service guarantees user privacy and data protection.",
"D. The free service helps build a community around the app for future monetization."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the video focus on in the \"Free?\" chapter?\nOption:\nA. The free service offers premium features at no cost to attract users.\nB. The free service allows the app to expand and thus benefit in other ways.\nC. The free service guarantees user privacy and data protection.\nD. The free service helps build a community around the app for future monetization.\nAnswer with the option's letter from the given choices directly.",
1894,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "632-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1895,
"target": "D",
"doc": {
"video_id": "632",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=FQd5bo9nIZs",
"videoID": "FQd5bo9nIZs",
"question_id": "632-3",
"task_type": "Object Reasoning",
"question": "Which of the following sections most closely relates to the theme of the video?",
"options": [
"A. Intermission.",
"B. The money problme.",
"C. #deletefacebook.",
"D. Present day and future."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following sections most closely relates to the theme of the video?\nOption:\nA. Intermission.\nB. The money problme.\nC. #deletefacebook.\nD. Present day and future.\nAnswer with the option's letter from the given choices directly.",
1895,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "632-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1896,
"target": "D",
"doc": {
"video_id": "633",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=QTyzyP2Afys",
"videoID": "QTyzyP2Afys",
"question_id": "633-1",
"task_type": "Information Synopsis",
"question": "What is the central conflict regarding the main theme of the video presentation?",
"options": [
"A. The conflict between the environmental impact of Bitcoin mining and the need for sustainable energy solutions.",
"B. The debate over the regulation of cryptocurrencies and the balance between innovation and consumer protection.",
"C. The ethical concerns surrounding the use of cryptocurrencies in illegal activities and the need for greater transparency.",
"D. The struggle between established financial institutions and the disruptive potential of decentralized digital currencies."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the central conflict regarding the main theme of the video presentation?\nOption:\nA. The conflict between the environmental impact of Bitcoin mining and the need for sustainable energy solutions.\nB. The debate over the regulation of cryptocurrencies and the balance between innovation and consumer protection.\nC. The ethical concerns surrounding the use of cryptocurrencies in illegal activities and the need for greater transparency.\nD. The struggle between established financial institutions and the disruptive potential of decentralized digital currencies.\nAnswer with the option's letter from the given choices directly.",
1896,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "633-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1897,
"target": "C",
"doc": {
"video_id": "633",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=QTyzyP2Afys",
"videoID": "QTyzyP2Afys",
"question_id": "633-2",
"task_type": "Information Synopsis",
"question": "Which of the following best summarizes the core difference between the contrasting viewpoints on the future of Bitcoin in the video?",
"options": [
"A. Whether Bitcoin should prioritize stability as a currency or potential for high returns as an investment.",
"B. Whether the focus should be on technological innovation or addressing the environmental impact of Bitcoin mining.",
"C. Whether Bitcoin should be integrated into existing financial systems or remain independent as a decentralized alternative.",
"D. Whether Bitcoin primarily benefits wealthy investors or has the potential to empower underbanked communities."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best summarizes the core difference between the contrasting viewpoints on the future of Bitcoin in the video?\nOption:\nA. Whether Bitcoin should prioritize stability as a currency or potential for high returns as an investment.\nB. Whether the focus should be on technological innovation or addressing the environmental impact of Bitcoin mining.\nC. Whether Bitcoin should be integrated into existing financial systems or remain independent as a decentralized alternative.\nD. Whether Bitcoin primarily benefits wealthy investors or has the potential to empower underbanked communities.\nAnswer with the option's letter from the given choices directly.",
1897,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "633-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1898,
"target": "B",
"doc": {
"video_id": "633",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=QTyzyP2Afys",
"videoID": "QTyzyP2Afys",
"question_id": "633-3",
"task_type": "Information Synopsis",
"question": "What is the main argument presented by Jordi Visser regarding the volatility of Bitcoin in the video?",
"options": [
"A. The volatility of Bitcoin is irrelevant as its value will continue to rise in the long term.",
"B. The volatility of Bitcoin is declining as it becomes a more established and accepted asset class.",
"C. The volatility of Bitcoin is significantly higher than other asset classes like technology stocks.",
"D. The volatility of Bitcoin is a major concern that discourages investors and hinders its adoption."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main argument presented by Jordi Visser regarding the volatility of Bitcoin in the video?\nOption:\nA. The volatility of Bitcoin is irrelevant as its value will continue to rise in the long term.\nB. The volatility of Bitcoin is declining as it becomes a more established and accepted asset class.\nC. The volatility of Bitcoin is significantly higher than other asset classes like technology stocks.\nD. The volatility of Bitcoin is a major concern that discourages investors and hinders its adoption.\nAnswer with the option's letter from the given choices directly.",
1898,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "633-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1899,
"target": "C",
"doc": {
"video_id": "634",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=0ncLRRLvdfk",
"videoID": "0ncLRRLvdfk",
"question_id": "634-1",
"task_type": "Information Synopsis",
"question": "How does a thorough comprehension of the first method introduced in the video aid in navigating economic downturns?",
"options": [
"A. It allows individuals to predict the exact timing and severity of economic downturns, enabling them to liquidate assets before losses occur.",
"B. It fosters a \"doom and gloom\" mentality, encouraging individuals to hoard cash and avoid all investments during economic hardship.",
"C. It equips individuals with knowledge of macroeconomics and the cyclical nature of the economy, allowing them to adapt their financial strategies and identify potential opportunities amidst turmoil.",
"D. It guarantees financial success during economic downturns by providing a foolproof formula for picking winning investments."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does a thorough comprehension of the first method introduced in the video aid in navigating economic downturns?\nOption:\nA. It allows individuals to predict the exact timing and severity of economic downturns, enabling them to liquidate assets before losses occur.\nB. It fosters a \"doom and gloom\" mentality, encouraging individuals to hoard cash and avoid all investments during economic hardship.\nC. It equips individuals with knowledge of macroeconomics and the cyclical nature of the economy, allowing them to adapt their financial strategies and identify potential opportunities amidst turmoil.\nD. It guarantees financial success during economic downturns by providing a foolproof formula for picking winning investments.\nAnswer with the option's letter from the given choices directly.",
1899,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "634-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1900,
"target": "B",
"doc": {
"video_id": "634",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=0ncLRRLvdfk",
"videoID": "0ncLRRLvdfk",
"question_id": "634-2",
"task_type": "Temporal Reasoning",
"question": "Which of the following options is NOT introduced between the way \"learn to read financial news critically\" and \"utilize gamified learning apps\" in the video?",
"options": [
"A. Challenge yourself with financial experiments.",
"B. Experiment with dollar-cost averaging.",
"C. Analyze your favourite brands' financials.",
"D. Read influential personal finance books."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options is NOT introduced between the way \"learn to read financial news critically\" and \"utilize gamified learning apps\" in the video?\nOption:\nA. Challenge yourself with financial experiments.\nB. Experiment with dollar-cost averaging.\nC. Analyze your favourite brands' financials.\nD. Read influential personal finance books.\nAnswer with the option's letter from the given choices directly.",
1900,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "634-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1901,
"target": "B",
"doc": {
"video_id": "634",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=0ncLRRLvdfk",
"videoID": "0ncLRRLvdfk",
"question_id": "634-3",
"task_type": "Information Synopsis",
"question": "What makes influential financial podcasts a valuable tool for financial education in the video?",
"options": [
"A. All podcasts offer unbiased, expert advice with guaranteed accuracy and relevance.",
"B. Podcasts offer convenient and accessible information on various financial topics.",
"C. They consistently feature interviews with financial advisors, guaranteeing personalized investment recommendations.",
"D. They provide access to market-moving news before traditional media outlets, enabling listeners to capitalize on short-term opportunities."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What makes influential financial podcasts a valuable tool for financial education in the video?\nOption:\nA. All podcasts offer unbiased, expert advice with guaranteed accuracy and relevance.\nB. Podcasts offer convenient and accessible information on various financial topics.\nC. They consistently feature interviews with financial advisors, guaranteeing personalized investment recommendations.\nD. They provide access to market-moving news before traditional media outlets, enabling listeners to capitalize on short-term opportunities.\nAnswer with the option's letter from the given choices directly.",
1901,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "634-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1902,
"target": "D",
"doc": {
"video_id": "635",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=oue5A-7Hpx4",
"videoID": "oue5A-7Hpx4",
"question_id": "635-1",
"task_type": "Object Reasoning",
"question": "What is the core distinction between American Express and companies like Visa and Mastercard as described in the video?",
"options": [
"A. Amex focuses on a closed-loop system, while Visa and Mastercard operate on an open-loop system.",
"B. Amex primarily targets affluent customers, whereas Visa and Mastercard focus on a broader customer base.",
"C. AmEx offers superior rewards programs compared to Visa and Mastercard, attracting high-spending customers.",
"D. AmEx functions as both a card network and a lender, while Visa and Mastercard primarily act as card networks."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the core distinction between American Express and companies like Visa and Mastercard as described in the video?\nOption:\nA. Amex focuses on a closed-loop system, while Visa and Mastercard operate on an open-loop system.\nB. Amex primarily targets affluent customers, whereas Visa and Mastercard focus on a broader customer base.\nC. AmEx offers superior rewards programs compared to Visa and Mastercard, attracting high-spending customers.\nD. AmEx functions as both a card network and a lender, while Visa and Mastercard primarily act as card networks.\nAnswer with the option's letter from the given choices directly.",
1902,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "635-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1903,
"target": "D",
"doc": {
"video_id": "635",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=oue5A-7Hpx4",
"videoID": "oue5A-7Hpx4",
"question_id": "635-2",
"task_type": "Information Synopsis",
"question": "Which of the following is NOT a criticism of the current credit scoring system in the US mentioned in the video?",
"options": [
"A. The system disproportionately disadvantages Black and Hispanic communities due to historical inequalities.",
"B. Errors in credit reports are frequent and often difficult for consumers to rectify.",
"C. The focus on past financial behavior unfairly penalizes individuals who have experienced hardship.",
"D. The system lacks transparency and sufficient government oversight, leading to potential consumer exploitation."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is NOT a criticism of the current credit scoring system in the US mentioned in the video?\nOption:\nA. The system disproportionately disadvantages Black and Hispanic communities due to historical inequalities.\nB. Errors in credit reports are frequent and often difficult for consumers to rectify.\nC. The focus on past financial behavior unfairly penalizes individuals who have experienced hardship.\nD. The system lacks transparency and sufficient government oversight, leading to potential consumer exploitation.\nAnswer with the option's letter from the given choices directly.",
1903,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "635-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1904,
"target": "A",
"doc": {
"video_id": "635",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=oue5A-7Hpx4",
"videoID": "oue5A-7Hpx4",
"question_id": "635-3",
"task_type": "Information Synopsis",
"question": "Which of the following is NOT a significant factor that cause Discover's success in the video?",
"options": [
"A. Offering premium travel rewards and airport lounge access to attract affluent customers.",
"B. Focusing on a \"spend-centric\" model that encourages card usage through cashback rewards.",
"C. Targeting middle-class consumers with no annual fees and simple rewards structures.",
"D. Prioritizing online business operations and 24/7 customer service."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is NOT a significant factor that cause Discover's success in the video?\nOption:\nA. Offering premium travel rewards and airport lounge access to attract affluent customers.\nB. Focusing on a \"spend-centric\" model that encourages card usage through cashback rewards.\nC. Targeting middle-class consumers with no annual fees and simple rewards structures.\nD. Prioritizing online business operations and 24/7 customer service.\nAnswer with the option's letter from the given choices directly.",
1904,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "635-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1905,
"target": "B",
"doc": {
"video_id": "636",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=9bbWYVrQgZ8",
"videoID": "9bbWYVrQgZ8",
"question_id": "636-1",
"task_type": "Information Synopsis",
"question": "Which of the following best summarizes the main concern of those who advocate for a reduction in the balance sheet in the video?",
"options": [
"A. The risk of deflation and economic stagnation due to reduced liquidity in the financial system.",
"B. The potential for asset bubbles and financial instability driven by excessive risk-taking.",
"C. The limitations it imposes on the Fed's ability to respond effectively to future economic crises.",
"D. The negative impact on smaller businesses and income inequality due to the focus on supporting large corporations."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following best summarizes the main concern of those who advocate for a reduction in the balance sheet in the video?\nOption:\nA. The risk of deflation and economic stagnation due to reduced liquidity in the financial system.\nB. The potential for asset bubbles and financial instability driven by excessive risk-taking.\nC. The limitations it imposes on the Fed's ability to respond effectively to future economic crises.\nD. The negative impact on smaller businesses and income inequality due to the focus on supporting large corporations.\nAnswer with the option's letter from the given choices directly.",
1905,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "636-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1906,
"target": "D",
"doc": {
"video_id": "636",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=9bbWYVrQgZ8",
"videoID": "9bbWYVrQgZ8",
"question_id": "636-2",
"task_type": "Object Reasoning",
"question": "According to the video, what is one reason why the relationship between inflation and wage inflation has weakened in recent decades?",
"options": [
"A. Increased automation has led to a decline in the demand for labor.",
"B. Government policies have encouraged companies to suppress wages.",
"C. Technological advances have increased worker productivity, reducing the need for wage increases.",
"D. The rise of globalization and outsourcing has reduced the bargaining power of workers."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is one reason why the relationship between inflation and wage inflation has weakened in recent decades?\nOption:\nA. Increased automation has led to a decline in the demand for labor.\nB. Government policies have encouraged companies to suppress wages.\nC. Technological advances have increased worker productivity, reducing the need for wage increases.\nD. The rise of globalization and outsourcing has reduced the bargaining power of workers.\nAnswer with the option's letter from the given choices directly.",
1906,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "636-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1907,
"target": "A",
"doc": {
"video_id": "636",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=9bbWYVrQgZ8",
"videoID": "9bbWYVrQgZ8",
"question_id": "636-3",
"task_type": "Information Synopsis",
"question": "What is the main focus of this video?",
"options": [
"A. The potential consequences of the Federal Reserve's expansionary monetary policy, particularly on inflation.",
"B. The effectiveness of various economic models, such as the Phillips Curve, in predicting and managing inflation.",
"C. The history and evolution of the Federal Reserve and its role in the US economy.",
"D. The impact of the COVID-19 pandemic on global supply chains and consumer spending habits."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of this video?\nOption:\nA. The potential consequences of the Federal Reserve's expansionary monetary policy, particularly on inflation.\nB. The effectiveness of various economic models, such as the Phillips Curve, in predicting and managing inflation.\nC. The history and evolution of the Federal Reserve and its role in the US economy.\nD. The impact of the COVID-19 pandemic on global supply chains and consumer spending habits.\nAnswer with the option's letter from the given choices directly.",
1907,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "636-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1908,
"target": "C",
"doc": {
"video_id": "637",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=J5Npf2xJpag",
"videoID": "J5Npf2xJpag",
"question_id": "637-1",
"task_type": "Information Synopsis",
"question": "According to the given video, what is the main focus of it?",
"options": [
"A. A comparison of different investment banking firms and their approaches to risk management and client relationships.",
"B. The rise of retail investors and the growing importance of consumer finance in the 21st century.",
"C. Highlighting the key leaders, business strategies, and navigation of financial crises of a special history period.",
"D. The evolution of financial regulations and their impact on investment banks like Goldman Sachs."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the given video, what is the main focus of it?\nOption:\nA. A comparison of different investment banking firms and their approaches to risk management and client relationships.\nB. The rise of retail investors and the growing importance of consumer finance in the 21st century.\nC. Highlighting the key leaders, business strategies, and navigation of financial crises of a special history period.\nD. The evolution of financial regulations and their impact on investment banks like Goldman Sachs.\nAnswer with the option's letter from the given choices directly.",
1908,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "637-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1909,
"target": "D",
"doc": {
"video_id": "637",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=J5Npf2xJpag",
"videoID": "J5Npf2xJpag",
"question_id": "637-2",
"task_type": "Information Synopsis",
"question": "Why do the companies featured in the video face challenges when entering the IPO underwriting business?",
"options": [
"A. The firm lacked the necessary capital to compete with established players.",
"B. The commercial paper market was saturated and offered limited growth opportunities.",
"C. Regulations at the time heavily restricted new entrants in the IPO market.",
"D. The firm's special background positioned it as an outsider in the elite-dominated industry."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why do the companies featured in the video face challenges when entering the IPO underwriting business?\nOption:\nA. The firm lacked the necessary capital to compete with established players.\nB. The commercial paper market was saturated and offered limited growth opportunities.\nC. Regulations at the time heavily restricted new entrants in the IPO market.\nD. The firm's special background positioned it as an outsider in the elite-dominated industry.\nAnswer with the option's letter from the given choices directly.",
1909,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "637-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1910,
"target": "C",
"doc": {
"video_id": "637",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=J5Npf2xJpag",
"videoID": "J5Npf2xJpag",
"question_id": "637-3",
"task_type": "Object Reasoning",
"question": "How did the collapse of Long Term Capital Management (LTCM) affect Goldman Sachs in the video?",
"options": [
"A. Led to the immediate downfall of Goldman Sachs due to its large exposure to LTCM.",
"B. Resulted in the government bailout of Goldman Sachs and other major Wall Street firms.",
"C. Forced Goldman Sachs to adapt its business model and seek a more stable base of capital.",
"D. Prompted Goldman Sachs to shift its focus from trading to traditional investment banking activities."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How did the collapse of Long Term Capital Management (LTCM) affect Goldman Sachs in the video?\nOption:\nA. Led to the immediate downfall of Goldman Sachs due to its large exposure to LTCM.\nB. Resulted in the government bailout of Goldman Sachs and other major Wall Street firms.\nC. Forced Goldman Sachs to adapt its business model and seek a more stable base of capital.\nD. Prompted Goldman Sachs to shift its focus from trading to traditional investment banking activities.\nAnswer with the option's letter from the given choices directly.",
1910,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "637-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1911,
"target": "C",
"doc": {
"video_id": "638",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=mzoX7zEZ6h4",
"videoID": "mzoX7zEZ6h4",
"question_id": "638-1",
"task_type": "Information Synopsis",
"question": "Which of the following is NOT one of the primary methods to reach the theme introduced in the video?",
"options": [
"A. The minting of physical coins and printing of paper money by government entities.",
"B. The issuance of loans and creation of digital debt records by private banks.",
"C. Direct investment by central banks into research and development of technological innovations.",
"D. Quantitative easing measures undertaken by central banks to inject liquidity into the economy."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is NOT one of the primary methods to reach the theme introduced in the video?\nOption:\nA. The minting of physical coins and printing of paper money by government entities.\nB. The issuance of loans and creation of digital debt records by private banks.\nC. Direct investment by central banks into research and development of technological innovations.\nD. Quantitative easing measures undertaken by central banks to inject liquidity into the economy.\nAnswer with the option's letter from the given choices directly.",
1911,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "638-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1912,
"target": "A",
"doc": {
"video_id": "638",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=mzoX7zEZ6h4",
"videoID": "mzoX7zEZ6h4",
"question_id": "638-2",
"task_type": "Information Synopsis",
"question": "What is the main focus of this video?",
"options": [
"A. The mechanics of money creation by central banks, private banks, and governments, and the resulting implications.",
"B. The historical evolution of the US monetary system from the gold standard to the current debt-based system.",
"C. The role of financial instruments like derivatives in the 2008 financial crisis and their ongoing impact on the global economy.",
"D. The issue of wealth inequality, its connection to central bank policies, and potential solutions for a more equitable distribution of wealth."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of this video?\nOption:\nA. The mechanics of money creation by central banks, private banks, and governments, and the resulting implications.\nB. The historical evolution of the US monetary system from the gold standard to the current debt-based system.\nC. The role of financial instruments like derivatives in the 2008 financial crisis and their ongoing impact on the global economy.\nD. The issue of wealth inequality, its connection to central bank policies, and potential solutions for a more equitable distribution of wealth.\nAnswer with the option's letter from the given choices directly.",
1912,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "638-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1913,
"target": "D",
"doc": {
"video_id": "638",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=mzoX7zEZ6h4",
"videoID": "mzoX7zEZ6h4",
"question_id": "638-3",
"task_type": "Object Reasoning",
"question": "What does the video identify as a significant consequence of central bank interventions like quantitative easing?",
"options": [
"A. A decrease in the value of the US dollar compared to other currencies.",
"B. A reduction in the overall national debt through bond purchases.",
"C. An increase in the velocity of money, leading to rapid circulation within the real economy.",
"D. A distortion of market realities, leading to inflated asset prices like stocks and real estate."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the video identify as a significant consequence of central bank interventions like quantitative easing?\nOption:\nA. A decrease in the value of the US dollar compared to other currencies.\nB. A reduction in the overall national debt through bond purchases.\nC. An increase in the velocity of money, leading to rapid circulation within the real economy.\nD. A distortion of market realities, leading to inflated asset prices like stocks and real estate.\nAnswer with the option's letter from the given choices directly.",
1913,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "638-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1914,
"target": "A",
"doc": {
"video_id": "639",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=PHe0bXAIuk0",
"videoID": "PHe0bXAIuk0",
"question_id": "639-1",
"task_type": "Information Synopsis",
"question": "What is the main focus of this video?",
"options": [
"A. Providing a simplified framework for understanding the core mechanics driving economic fluctuations and cycles.",
"B. Exploring the intricate details of various economic markets and their interactions.",
"C. Analyzing the role of government policies and central bank interventions in managing economic crises.",
"D. Examining the impact of individual spending habits and consumer choices on overall economic performance."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of this video?\nOption:\nA. Providing a simplified framework for understanding the core mechanics driving economic fluctuations and cycles.\nB. Exploring the intricate details of various economic markets and their interactions.\nC. Analyzing the role of government policies and central bank interventions in managing economic crises.\nD. Examining the impact of individual spending habits and consumer choices on overall economic performance.\nAnswer with the option's letter from the given choices directly.",
1914,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "639-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1915,
"target": "D",
"doc": {
"video_id": "639",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=PHe0bXAIuk0",
"videoID": "PHe0bXAIuk0",
"question_id": "639-2",
"task_type": "Object Reasoning",
"question": "How does the phenomenon mainly introduced in the video happen?",
"options": [
"A. Fluctuations in income levels directly determine the availability of credit, which in turn impacts spending patterns and overall economic activity.",
"B. Credit availability remains constant, while changes in spending habits and income levels create short-term fluctuations within a stable economic environment.",
"C. Government policies and central bank interventions are the primary drivers of economic cycles, with spending, income, and credit playing a secondary role.",
"D. Increased spending leads to higher incomes, encouraging further borrowing. Conversely, decreased spending reduces incomes resulting in a contraction."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the phenomenon mainly introduced in the video happen?\nOption:\nA. Fluctuations in income levels directly determine the availability of credit, which in turn impacts spending patterns and overall economic activity.\nB. Credit availability remains constant, while changes in spending habits and income levels create short-term fluctuations within a stable economic environment.\nC. Government policies and central bank interventions are the primary drivers of economic cycles, with spending, income, and credit playing a secondary role.\nD. Increased spending leads to higher incomes, encouraging further borrowing. Conversely, decreased spending reduces incomes resulting in a contraction.\nAnswer with the option's letter from the given choices directly.",
1915,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "639-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1916,
"target": "D",
"doc": {
"video_id": "639",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=PHe0bXAIuk0",
"videoID": "PHe0bXAIuk0",
"question_id": "639-3",
"task_type": "Object Reasoning",
"question": "According to the video, what is the primary factor that distinguishes a recession from a deleveraging?",
"options": [
"A. The severity of unemployment and economic contraction.",
"B. The level of government debt and budget deficits.",
"C. The rate of inflation and deflation.",
"D. The effectiveness of lowering interest rates to stimulate borrowing."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is the primary factor that distinguishes a recession from a deleveraging?\nOption:\nA. The severity of unemployment and economic contraction.\nB. The level of government debt and budget deficits.\nC. The rate of inflation and deflation.\nD. The effectiveness of lowering interest rates to stimulate borrowing.\nAnswer with the option's letter from the given choices directly.",
1916,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "639-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1917,
"target": "B",
"doc": {
"video_id": "640",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=W-Q9AOp2FW8",
"videoID": "W-Q9AOp2FW8",
"question_id": "640-1",
"task_type": "Object Reasoning",
"question": "What was the primary purpose of the gathering in Boca Raton, Florida in 1994, highlighted at the beginning of the video?",
"options": [
"A. To explore strategic mergers and acquisitions for expanding market share and diversifying investment portfolios.",
"B. To brainstorm innovative financial products aimed at reducing risk for financial institutions.",
"C. To discuss and implement strategies for exploiting loopholes in existing financial regulations.",
"D. To investigate international investment opportunities and optimize tax strategies within the bounds of legal compliance."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the primary purpose of the gathering in Boca Raton, Florida in 1994, highlighted at the beginning of the video?\nOption:\nA. To explore strategic mergers and acquisitions for expanding market share and diversifying investment portfolios.\nB. To brainstorm innovative financial products aimed at reducing risk for financial institutions.\nC. To discuss and implement strategies for exploiting loopholes in existing financial regulations.\nD. To investigate international investment opportunities and optimize tax strategies within the bounds of legal compliance.\nAnswer with the option's letter from the given choices directly.",
1917,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "640-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1918,
"target": "B",
"doc": {
"video_id": "640",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=W-Q9AOp2FW8",
"videoID": "W-Q9AOp2FW8",
"question_id": "640-2",
"task_type": "Object Reasoning",
"question": "What concern did Terri Duhon and her team have regarding the expanding credit derivatives market, particularly in the context of mortgages in the video?",
"options": [
"A. They felt the complexity of synthetic CDOs would make them difficult to sell to investors.",
"B. They lacked sufficient historical data on the performance of retail mortgages during economic fluctuations.",
"C. They believed the market was becoming oversaturated with similar products, leading to decreased profitability.",
"D. They were concerned that increased regulation would hinder the growth and innovation within the derivatives market."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What concern did Terri Duhon and her team have regarding the expanding credit derivatives market, particularly in the context of mortgages in the video?\nOption:\nA. They felt the complexity of synthetic CDOs would make them difficult to sell to investors.\nB. They lacked sufficient historical data on the performance of retail mortgages during economic fluctuations.\nC. They believed the market was becoming oversaturated with similar products, leading to decreased profitability.\nD. They were concerned that increased regulation would hinder the growth and innovation within the derivatives market.\nAnswer with the option's letter from the given choices directly.",
1918,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "640-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1919,
"target": "B",
"doc": {
"video_id": "640",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Finance & Commerce",
"url": "https://www.youtube.com/watch?v=W-Q9AOp2FW8",
"videoID": "W-Q9AOp2FW8",
"question_id": "640-3",
"task_type": "Information Synopsis",
"question": "What is the main focus of this video?",
"options": [
"A. The Occupy Wall Street movement and its impact on the financial crisis.",
"B. The development and growth of credit default swaps and their role in the financial crisis.",
"C. The deregulation of the financial industry and its contribution to the housing bubble.",
"D. The rise and fall of specific banks, such as JP Morgan and Goldman Sachs, during the financial crisis."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of this video?\nOption:\nA. The Occupy Wall Street movement and its impact on the financial crisis.\nB. The development and growth of credit default swaps and their role in the financial crisis.\nC. The deregulation of the financial industry and its contribution to the housing bubble.\nD. The rise and fall of specific banks, such as JP Morgan and Goldman Sachs, during the financial crisis.\nAnswer with the option's letter from the given choices directly.",
1919,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "640-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Finance & Commerce",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1920,
"target": "A",
"doc": {
"video_id": "641",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=rD5oDjk3IOU",
"videoID": "rD5oDjk3IOU",
"question_id": "641-1",
"task_type": "Object Reasoning",
"question": "What future human activity does the video discuss?",
"options": [
"A. Initiating experimental colonies on habitable exoplanets.",
"B. Beginning large-scale terraforming projects on Mars.",
"C. Launching commercial expeditions to potentially habitable exoplanets.",
"D. The possibility of sending humans to settle on other planets."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What future human activity does the video discuss?\nOption:\nA. Initiating experimental colonies on habitable exoplanets.\nB. Beginning large-scale terraforming projects on Mars.\nC. Launching commercial expeditions to potentially habitable exoplanets.\nD. The possibility of sending humans to settle on other planets.\nAnswer with the option's letter from the given choices directly.",
1920,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "641-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1921,
"target": "A",
"doc": {
"video_id": "641",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=rD5oDjk3IOU",
"videoID": "rD5oDjk3IOU",
"question_id": "641-2",
"task_type": "Temporal Reasoning",
"question": "In what order are the following planets introduced in the video?",
"options": [
"A. Venus, Jupiter, Neptune.",
"B. Mercury, Jupiter, Mars.",
"C. Venus, Neptune, Jupiter.",
"D. Jupiter, Mercury, Neptune."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what order are the following planets introduced in the video?\nOption:\nA. Venus, Jupiter, Neptune.\nB. Mercury, Jupiter, Mars.\nC. Venus, Neptune, Jupiter.\nD. Jupiter, Mercury, Neptune.\nAnswer with the option's letter from the given choices directly.",
1921,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "641-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1922,
"target": "B",
"doc": {
"video_id": "641",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=rD5oDjk3IOU",
"videoID": "rD5oDjk3IOU",
"question_id": "641-3",
"task_type": "Object Reasoning",
"question": "What are the planets that Dr.David Grinspoon and Dr. Heidi B. Hammel research?",
"options": [
"A. Mars and Jupiter.",
"B. Venus and Neptune.",
"C. Mars and Neptune.",
"D. Jupiter and Neptune."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the planets that Dr.David Grinspoon and Dr. Heidi B. Hammel research?\nOption:\nA. Mars and Jupiter.\nB. Venus and Neptune.\nC. Mars and Neptune.\nD. Jupiter and Neptune.\nAnswer with the option's letter from the given choices directly.",
1922,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "641-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 1923,
"target": "B",
"doc": {
"video_id": "642",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=lAgwMGRvJrM",
"videoID": "lAgwMGRvJrM",
"question_id": "642-1",
"task_type": "Information Synopsis",
"question": "What subject is central to the content of the video?",
"options": [
"A. The formation and evolution of stellar nurseries.",
"B. An exploration of the varied subclasses of supernovae phenomena.",
"C. The impact of dark matter on galactic rotation curves.",
"D. The mechanisms behind pulsar emission."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What subject is central to the content of the video?\nOption:\nA. The formation and evolution of stellar nurseries.\nB. An exploration of the varied subclasses of supernovae phenomena.\nC. The impact of dark matter on galactic rotation curves.\nD. The mechanisms behind pulsar emission.\nAnswer with the option's letter from the given choices directly.",
1923,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "642-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1924,
"target": "C",
"doc": {
"video_id": "642",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=lAgwMGRvJrM",
"videoID": "lAgwMGRvJrM",
"question_id": "642-2",
"task_type": "Temporal Reasoning",
"question": "What is the order in which the following is introduced in the video?\n① SN-1987A\n② Type 1a Supernova\n③ Kilonova",
"options": [
"A. ①③②.",
"B. ②③①.",
"C. ①②③.",
"D. ③①②."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the order in which the following is introduced in the video?\n① SN-1987A\n② Type 1a Supernova\n③ Kilonova\nOption:\nA. ①③②.\nB. ②③①.\nC. ①②③.\nD. ③①②.\nAnswer with the option's letter from the given choices directly.",
1924,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "642-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1925,
"target": "D",
"doc": {
"video_id": "642",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=lAgwMGRvJrM",
"videoID": "lAgwMGRvJrM",
"question_id": "642-3",
"task_type": "Object Reasoning",
"question": "What was Fritz Zwicky's significant contribution as highlighted in the video?",
"options": [
"A. He discovered the existence of dark matter through observations of galaxy clusters.",
"B. He invented the Schmidt telescope, which revolutionized astronomical observation.",
"C. He was the first to propose the theory of neutron stars as remnants of supernova explosions.",
"D. He developed the system for classifying supernovae based on their spectral characteristics."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was Fritz Zwicky's significant contribution as highlighted in the video?\nOption:\nA. He discovered the existence of dark matter through observations of galaxy clusters.\nB. He invented the Schmidt telescope, which revolutionized astronomical observation.\nC. He was the first to propose the theory of neutron stars as remnants of supernova explosions.\nD. He developed the system for classifying supernovae based on their spectral characteristics.\nAnswer with the option's letter from the given choices directly.",
1925,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "642-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 1926,
"target": "A",
"doc": {
"video_id": "643",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=Ei-TcECJVXU",
"videoID": "Ei-TcECJVXU",
"question_id": "643-1",
"task_type": "Information Synopsis",
"question": "What is the core focus of the video?",
"options": [
"A. An investigative exploration into the ISS's advanced engineering.",
"B. A historical overview of the space race and its culmination in the ISS.",
"C. The daily life and routines of astronauts aboard the ISS.",
"D. The geopolitical implications of international cooperation on the ISS."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the core focus of the video?\nOption:\nA. An investigative exploration into the ISS's advanced engineering.\nB. A historical overview of the space race and its culmination in the ISS.\nC. The daily life and routines of astronauts aboard the ISS.\nD. The geopolitical implications of international cooperation on the ISS.\nAnswer with the option's letter from the given choices directly.",
1926,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "643-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1927,
"target": "D",
"doc": {
"video_id": "643",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=Ei-TcECJVXU",
"videoID": "Ei-TcECJVXU",
"question_id": "643-2",
"task_type": "Object Recognition",
"question": "Which of the following subjects is not discussed in the video?",
"options": [
"A. The advanced water recycling systems that purify and reuse wastewater.",
"B. The utilization of solar panels for generating power.",
"C. The development of communication systems for maintaining contact with Earth.",
"D. The implementation of radiation shielding to protect astronauts."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following subjects is not discussed in the video?\nOption:\nA. The advanced water recycling systems that purify and reuse wastewater.\nB. The utilization of solar panels for generating power.\nC. The development of communication systems for maintaining contact with Earth.\nD. The implementation of radiation shielding to protect astronauts.\nAnswer with the option's letter from the given choices directly.",
1927,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "643-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1928,
"target": "C",
"doc": {
"video_id": "643",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=Ei-TcECJVXU",
"videoID": "Ei-TcECJVXU",
"question_id": "643-3",
"task_type": "Temporal Reasoning",
"question": "What is the order in which the following is introduced in the video?\n① How to supply oxygen\n② The motion of ISS\n③ Fireflies on ISS\n④ The communication between ISS and The Earth",
"options": [
"A. ②①④③.",
"B. ④①②③.",
"C. ①②③④.",
"D. ①④②③."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the order in which the following is introduced in the video?\n① How to supply oxygen\n② The motion of ISS\n③ Fireflies on ISS\n④ The communication between ISS and The Earth\nOption:\nA. ②①④③.\nB. ④①②③.\nC. ①②③④.\nD. ①④②③.\nAnswer with the option's letter from the given choices directly.",
1928,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "643-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1929,
"target": "C",
"doc": {
"video_id": "644",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=EVBCLAaflaw",
"videoID": "EVBCLAaflaw",
"question_id": "644-1",
"task_type": "Object Recognition",
"question": "What topic is discussed in the video?",
"options": [
"A. The video explains the latest breakthroughs in using quantum entanglement for teleportation experiments on the ISS.",
"B. It reveals decoded evidence of past alien visitations hidden in ancient human texts and structures.",
"C. How to create and send messages that could be understood by aliens.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What topic is discussed in the video?\nOption:\nA. The video explains the latest breakthroughs in using quantum entanglement for teleportation experiments on the ISS.\nB. It reveals decoded evidence of past alien visitations hidden in ancient human texts and structures.\nC. How to create and send messages that could be understood by aliens.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1929,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "644-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1930,
"target": "B",
"doc": {
"video_id": "644",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=EVBCLAaflaw",
"videoID": "EVBCLAaflaw",
"question_id": "644-2",
"task_type": "Object Reasoning",
"question": "What are the views of the individuals interviewed regarding the potential existence of extraterrestrial life in the universe?",
"options": [
"A. They both believe there is no credible evidence supporting the existence of extraterrestrial life.",
"B. They both believe in the existence of extraterrestrial life in the universe.",
"C. They are unsure and believe it is too soon to determine if extraterrestrial life exists.",
"D. Indeterminate."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the views of the individuals interviewed regarding the potential existence of extraterrestrial life in the universe?\nOption:\nA. They both believe there is no credible evidence supporting the existence of extraterrestrial life.\nB. They both believe in the existence of extraterrestrial life in the universe.\nC. They are unsure and believe it is too soon to determine if extraterrestrial life exists.\nD. Indeterminate.\nAnswer with the option's letter from the given choices directly.",
1930,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "644-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1931,
"target": "B",
"doc": {
"video_id": "644",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=EVBCLAaflaw",
"videoID": "EVBCLAaflaw",
"question_id": "644-3",
"task_type": "Object Reasoning",
"question": "When will we likely discover alien life, as anticipated by the white-haired speaker in the video's opening and closing?",
"options": [
"A. Within 10 years.",
"B. Within 20 years.",
"C. Within 30 years.",
"D. Within 40 years."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When will we likely discover alien life, as anticipated by the white-haired speaker in the video's opening and closing?\nOption:\nA. Within 10 years.\nB. Within 20 years.\nC. Within 30 years.\nD. Within 40 years.\nAnswer with the option's letter from the given choices directly.",
1931,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "644-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1932,
"target": "A",
"doc": {
"video_id": "645",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=c08Zzc0xepI",
"videoID": "c08Zzc0xepI",
"question_id": "645-1",
"task_type": "Information Synopsis",
"question": "What is the central theme of the video?",
"options": [
"A. The potential discovery of evidence supporting the existence of alternate universes.",
"B. The examination of cosmic microwave background radiation supporting the Big Bang model.",
"C. A discussion on the hypothetical concept of white holes and their role in space-time.",
"D. An in-depth exploration of time dilation effects around neutron stars."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the central theme of the video?\nOption:\nA. The potential discovery of evidence supporting the existence of alternate universes.\nB. The examination of cosmic microwave background radiation supporting the Big Bang model.\nC. A discussion on the hypothetical concept of white holes and their role in space-time.\nD. An in-depth exploration of time dilation effects around neutron stars.\nAnswer with the option's letter from the given choices directly.",
1932,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "645-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1933,
"target": "A",
"doc": {
"video_id": "645",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=c08Zzc0xepI",
"videoID": "c08Zzc0xepI",
"question_id": "645-2",
"task_type": "Object Recognition",
"question": "How does the video describe the methods being used by physicists to support the theory of parallel universes?",
"options": [
"A. Through the utilization of particle accelerators to search for gravitons as proof of other universes.",
"B. By employing quantum entanglement to link particles across multiple universes.",
"C. Via observation of cosmic microwave background radiation anomalies.",
"D. By mapping the distribution of dark matter in our universe to infer the structure of other universes."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the video describe the methods being used by physicists to support the theory of parallel universes?\nOption:\nA. Through the utilization of particle accelerators to search for gravitons as proof of other universes.\nB. By employing quantum entanglement to link particles across multiple universes.\nC. Via observation of cosmic microwave background radiation anomalies.\nD. By mapping the distribution of dark matter in our universe to infer the structure of other universes.\nAnswer with the option's letter from the given choices directly.",
1933,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "645-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1934,
"target": "C",
"doc": {
"video_id": "645",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=c08Zzc0xepI",
"videoID": "c08Zzc0xepI",
"question_id": "645-3",
"task_type": "Object Recognition",
"question": "What is the view of the white interviewee in the video wearing sunglasses on whether there are parallel universes?",
"options": [
"A. He enthusiastically supports the idea of parallel universes, suggesting we may interact with them soon.",
"B. He questions the plausibility of parallel universes due to the absence of tangible proof.",
"C. He asserts that while some parallel universes might exist theoretically, our understanding is limited.",
"D. He finds the notion of parallel universes fascinating but speculates they may be beyond scientific reach."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the view of the white interviewee in the video wearing sunglasses on whether there are parallel universes?\nOption:\nA. He enthusiastically supports the idea of parallel universes, suggesting we may interact with them soon.\nB. He questions the plausibility of parallel universes due to the absence of tangible proof.\nC. He asserts that while some parallel universes might exist theoretically, our understanding is limited.\nD. He finds the notion of parallel universes fascinating but speculates they may be beyond scientific reach.\nAnswer with the option's letter from the given choices directly.",
1934,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "645-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1935,
"target": "B",
"doc": {
"video_id": "646",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=5iA7wZfxglE",
"videoID": "5iA7wZfxglE",
"question_id": "646-1",
"task_type": "Temporal Reasoning",
"question": "What is the order in which the following is introduced in the video?\n① Lambda-Cold Dark Matter Model\n② The Bullet Cluster\n③ Annihilation Detection\n④ Dark Energy",
"options": [
"A. ③①②④.",
"B. ①②③④.",
"C. ②①④③.",
"D. ①③②④."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the order in which the following is introduced in the video?\n① Lambda-Cold Dark Matter Model\n② The Bullet Cluster\n③ Annihilation Detection\n④ Dark Energy\nOption:\nA. ③①②④.\nB. ①②③④.\nC. ②①④③.\nD. ①③②④.\nAnswer with the option's letter from the given choices directly.",
1935,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "646-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1936,
"target": "C",
"doc": {
"video_id": "646",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=5iA7wZfxglE",
"videoID": "5iA7wZfxglE",
"question_id": "646-2",
"task_type": "Object Reasoning",
"question": "Which is mentioned in the video as the second line of evidence for dark matter?",
"options": [
"A. Cluster Collisions.",
"B. Cosmic Microwave Background (CMB).",
"C. Galactic Rotation Curves.",
"D. Gravitational lensing effects."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which is mentioned in the video as the second line of evidence for dark matter?\nOption:\nA. Cluster Collisions.\nB. Cosmic Microwave Background (CMB).\nC. Galactic Rotation Curves.\nD. Gravitational lensing effects.\nAnswer with the option's letter from the given choices directly.",
1936,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "646-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1937,
"target": "D",
"doc": {
"video_id": "646",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=5iA7wZfxglE",
"videoID": "5iA7wZfxglE",
"question_id": "646-3",
"task_type": "Object Recognition",
"question": "Which of the following topics is not discussed in detail in the video?",
"options": [
"A. Signs of Missing Mass.",
"B. Dark Matter in the Early Universe.",
"C. Gravitational Lensing.",
"D. Detection of Dark Energy."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following topics is not discussed in detail in the video?\nOption:\nA. Signs of Missing Mass.\nB. Dark Matter in the Early Universe.\nC. Gravitational Lensing.\nD. Detection of Dark Energy.\nAnswer with the option's letter from the given choices directly.",
1937,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "646-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1938,
"target": "D",
"doc": {
"video_id": "647",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=0w4OTD4L0GQ",
"videoID": "0w4OTD4L0GQ",
"question_id": "647-1",
"task_type": "Temporal Reasoning",
"question": "What is the order in which the following is introduced in the video?\n① Our Local Universe\n② The Zone of Avoidance\n③ Dark Flow\n④ The Vela Supercluster",
"options": [
"A. ③①②④.",
"B. ①③②④.",
"C. ②①④③.",
"D. ①②③④."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the order in which the following is introduced in the video?\n① Our Local Universe\n② The Zone of Avoidance\n③ Dark Flow\n④ The Vela Supercluster\nOption:\nA. ③①②④.\nB. ①③②④.\nC. ②①④③.\nD. ①②③④.\nAnswer with the option's letter from the given choices directly.",
1938,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "647-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1939,
"target": "B",
"doc": {
"video_id": "647",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=0w4OTD4L0GQ",
"videoID": "0w4OTD4L0GQ",
"question_id": "647-2",
"task_type": "Object Reasoning",
"question": "What is the \"Great Attractor\" according to the video?",
"options": [
"A. An immense region of dark matter influencing the motion of galaxies across billions of light-years.",
"B. A phenomenon to the outskirts of the Milky Way, attracting the contents of the cluster towards it.",
"C. The central supermassive black hole of the Milky Way, Sagittarius A*.",
"D. A celestial object akin to a quasar, emitting immense amounts of energy and affecting local space."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the \"Great Attractor\" according to the video?\nOption:\nA. An immense region of dark matter influencing the motion of galaxies across billions of light-years.\nB. A phenomenon to the outskirts of the Milky Way, attracting the contents of the cluster towards it.\nC. The central supermassive black hole of the Milky Way, Sagittarius A*.\nD. A celestial object akin to a quasar, emitting immense amounts of energy and affecting local space.\nAnswer with the option's letter from the given choices directly.",
1939,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "647-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1940,
"target": "A",
"doc": {
"video_id": "647",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=0w4OTD4L0GQ",
"videoID": "0w4OTD4L0GQ",
"question_id": "647-3",
"task_type": "Object Reasoning",
"question": "What aspect is NOT described in the video?",
"options": [
"A. The role of dark matter in the formation of the Great Attractor.",
"B. Its gravitational influence on the local galaxy supercluster, Laniakea.",
"C. The challenges posed by the Zone of Avoidance in studying the Great Attractor.",
"D. Observational history and methods used to investigate the Great Attractor region."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What aspect is NOT described in the video?\nOption:\nA. The role of dark matter in the formation of the Great Attractor.\nB. Its gravitational influence on the local galaxy supercluster, Laniakea.\nC. The challenges posed by the Zone of Avoidance in studying the Great Attractor.\nD. Observational history and methods used to investigate the Great Attractor region.\nAnswer with the option's letter from the given choices directly.",
1940,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "647-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1941,
"target": "D",
"doc": {
"video_id": "648",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=4iC9Qi3y9q8",
"videoID": "4iC9Qi3y9q8",
"question_id": "648-1",
"task_type": "OCR Problems",
"question": "Which concept is not described in the video?",
"options": [
"A. The interplay of dark energy and the expansion of the universe.",
"B. The theoretical implications of the Cosmic Horizon on observations.",
"C. The Big Bang as a hypothetical explanation for the universe's origin.",
"D. The role of dark matter in shaping the early universe's structure."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which concept is not described in the video?\nOption:\nA. The interplay of dark energy and the expansion of the universe.\nB. The theoretical implications of the Cosmic Horizon on observations.\nC. The Big Bang as a hypothetical explanation for the universe's origin.\nD. The role of dark matter in shaping the early universe's structure.\nAnswer with the option's letter from the given choices directly.",
1941,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "648-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1942,
"target": "B",
"doc": {
"video_id": "648",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=4iC9Qi3y9q8",
"videoID": "4iC9Qi3y9q8",
"question_id": "648-2",
"task_type": "Temporal Reasoning",
"question": "In which sequence are the following introduced in the video?\n① The Big Bang\n② The Hubble Telescope\n③ Redshift\n④ Dark Energy",
"options": [
"A. ①③②④.",
"B. ①②③④.",
"C. ②①④③.",
"D. ①③②④."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which sequence are the following introduced in the video?\n① The Big Bang\n② The Hubble Telescope\n③ Redshift\n④ Dark Energy\nOption:\nA. ①③②④.\nB. ①②③④.\nC. ②①④③.\nD. ①③②④.\nAnswer with the option's letter from the given choices directly.",
1942,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "648-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1943,
"target": "C",
"doc": {
"video_id": "648",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=4iC9Qi3y9q8",
"videoID": "4iC9Qi3y9q8",
"question_id": "648-3",
"task_type": "OCR Problems",
"question": "What does the video state about the total area of the universe?",
"options": [
"A. It estimates the total diameter at about 100 trillion lightyears.",
"B. It calculates the total diameter to be roughly 75 trillion lightyears.",
"C. It puts the total diameter at over 25 trillion lightyears.",
"D. It suggests the total diameter is approximately 50 trillion lightyears."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the video state about the total area of the universe?\nOption:\nA. It estimates the total diameter at about 100 trillion lightyears.\nB. It calculates the total diameter to be roughly 75 trillion lightyears.\nC. It puts the total diameter at over 25 trillion lightyears.\nD. It suggests the total diameter is approximately 50 trillion lightyears.\nAnswer with the option's letter from the given choices directly.",
1943,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "648-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1944,
"target": "A",
"doc": {
"video_id": "649",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=mvpbZGwZ6_4",
"videoID": "mvpbZGwZ6_4",
"question_id": "649-1",
"task_type": "Information Synopsis",
"question": "What event is the video primarily discussing?",
"options": [
"A. A fundamental challenge to the Big Bang theory posed by JWST findings.",
"B. Unveiling of an Earth-like planet's detailed atmosphere by the James Webb Space Telescope (JWST).",
"C. The announcement of the deepest view into space ever captured by the Hubble Space Telescope.",
"D. A surprising confirmation of the Big Bang theory through JWST observations."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What event is the video primarily discussing?\nOption:\nA. A fundamental challenge to the Big Bang theory posed by JWST findings.\nB. Unveiling of an Earth-like planet's detailed atmosphere by the James Webb Space Telescope (JWST).\nC. The announcement of the deepest view into space ever captured by the Hubble Space Telescope.\nD. A surprising confirmation of the Big Bang theory through JWST observations.\nAnswer with the option's letter from the given choices directly.",
1944,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "649-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1945,
"target": "C",
"doc": {
"video_id": "649",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=mvpbZGwZ6_4",
"videoID": "mvpbZGwZ6_4",
"question_id": "649-2",
"task_type": "Object Reasoning",
"question": "What conclusions can be made about the special machine's impact on astronomical measurements from the video?",
"options": [
"A. Its observations align with established cosmic distance indicators within expected margins of error.",
"B. Its contributions are seen as complementary to existing instruments, providing incremental improvements.",
"C. It's referenced as a pivotal tool in refining and recalibrating cosmic distance ladders.",
"D. Its primary function is to support the maintenance of the cosmic microwave background radiation map."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What conclusions can be made about the special machine's impact on astronomical measurements from the video?\nOption:\nA. Its observations align with established cosmic distance indicators within expected margins of error.\nB. Its contributions are seen as complementary to existing instruments, providing incremental improvements.\nC. It's referenced as a pivotal tool in refining and recalibrating cosmic distance ladders.\nD. Its primary function is to support the maintenance of the cosmic microwave background radiation map.\nAnswer with the option's letter from the given choices directly.",
1945,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "649-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1946,
"target": "B",
"doc": {
"video_id": "649",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=mvpbZGwZ6_4",
"videoID": "mvpbZGwZ6_4",
"question_id": "649-3",
"task_type": "Action Reasoning",
"question": "Based on the video, what can be inferred about astronomers' reactions to the observations?",
"options": [
"A. They expected the findings and feel it confirms current models.",
"B. They are surprised and questioning existing cosmological theories.",
"C. They are reassured about the viability of the ΛCDM model.",
"D. They believe the JWST is producing inaccurate results."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, what can be inferred about astronomers' reactions to the observations?\nOption:\nA. They expected the findings and feel it confirms current models.\nB. They are surprised and questioning existing cosmological theories.\nC. They are reassured about the viability of the ΛCDM model.\nD. They believe the JWST is producing inaccurate results.\nAnswer with the option's letter from the given choices directly.",
1946,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "649-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1947,
"target": "C",
"doc": {
"video_id": "650",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=55Jas5HrzcQ",
"videoID": "55Jas5HrzcQ",
"question_id": "650-1",
"task_type": "Object Reasoning",
"question": "What milestone was described in the video?",
"options": [
"A. The construction of the International Space Station.",
"B. The first American satellite in space.",
"C. A manned lunar landing by the end of the 1960s.",
"D. The development of the Space Shuttle."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What milestone was described in the video?\nOption:\nA. The construction of the International Space Station.\nB. The first American satellite in space.\nC. A manned lunar landing by the end of the 1960s.\nD. The development of the Space Shuttle.\nAnswer with the option's letter from the given choices directly.",
1947,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "650-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1948,
"target": "A",
"doc": {
"video_id": "650",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=55Jas5HrzcQ",
"videoID": "55Jas5HrzcQ",
"question_id": "650-2",
"task_type": "OCR Problems",
"question": "What missions have encountered the failures detailed in the video?",
"options": [
"A. Mission 1 and Mission 13.",
"B. Mission 11 and Mission 2.",
"C. Mission 8 and Mission 12.",
"D. Mission 10 and Mission 14."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What missions have encountered the failures detailed in the video?\nOption:\nA. Mission 1 and Mission 13.\nB. Mission 11 and Mission 2.\nC. Mission 8 and Mission 12.\nD. Mission 10 and Mission 14.\nAnswer with the option's letter from the given choices directly.",
1948,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "650-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 1949,
"target": "A",
"doc": {
"video_id": "650",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Astronomy",
"url": "https://www.youtube.com/watch?v=55Jas5HrzcQ",
"videoID": "55Jas5HrzcQ",
"question_id": "650-3",
"task_type": "Object Reasoning",
"question": "After the first successful moon landing, what challenge did NASA's Apollo Program face?",
"options": [
"A. Declining public interest and reduced government funding.",
"B. Technological limitations in space suit design.",
"C. Soviet competition in space exploration.",
"D. The inability to return samples from the lunar surface."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: After the first successful moon landing, what challenge did NASA's Apollo Program face?\nOption:\nA. Declining public interest and reduced government funding.\nB. Technological limitations in space suit design.\nC. Soviet competition in space exploration.\nD. The inability to return samples from the lunar surface.\nAnswer with the option's letter from the given choices directly.",
1949,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "650-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Astronomy",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1950,
"target": "C",
"doc": {
"video_id": "651",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=Pfyx5bEZK80",
"videoID": "Pfyx5bEZK80",
"question_id": "651-1",
"task_type": "Object Reasoning",
"question": "In line with the video evidence, which of the following reasons does not lead humpback whales to come to Hawaii to breed and nurse the young?",
"options": [
"A. The water is shallow and clear.",
"B. The water is warmer than the water in Alaska.",
"C. There is a unique setting where ocean and lava collide.",
"D. It's easier for them to avoid predators."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, which of the following reasons does not lead humpback whales to come to Hawaii to breed and nurse the young?\nOption:\nA. The water is shallow and clear.\nB. The water is warmer than the water in Alaska.\nC. There is a unique setting where ocean and lava collide.\nD. It's easier for them to avoid predators.\nAnswer with the option's letter from the given choices directly.",
1950,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "651-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 1951,
"target": "A",
"doc": {
"video_id": "651",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=Pfyx5bEZK80",
"videoID": "Pfyx5bEZK80",
"question_id": "651-2",
"task_type": "Action Reasoning",
"question": "What is the main idea of Dr. Sam Ohhugan's words?",
"options": [
"A. How Hawaiians treat nature in Hawaii.",
"B. Why there are so many volcanoes in Hawaii.",
"C. Why Hawaiians' ancestors called their gods aumakua.",
"D. Traditional Values of Hawaiians."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main idea of Dr. Sam Ohhugan's words?\nOption:\nA. How Hawaiians treat nature in Hawaii.\nB. Why there are so many volcanoes in Hawaii.\nC. Why Hawaiians' ancestors called their gods aumakua.\nD. Traditional Values of Hawaiians.\nAnswer with the option's letter from the given choices directly.",
1951,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "651-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1952,
"target": "B",
"doc": {
"video_id": "651",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=Pfyx5bEZK80",
"videoID": "Pfyx5bEZK80",
"question_id": "651-3",
"task_type": "Object Reasoning",
"question": "How was Nahuku formed according to the video?",
"options": [
"A. It was created by lava tubes.",
"B. It was created by a river of molten lava.",
"C. It was formed by the workers in the park.",
"D. It was formed due to the function of wind."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How was Nahuku formed according to the video?\nOption:\nA. It was created by lava tubes.\nB. It was created by a river of molten lava.\nC. It was formed by the workers in the park.\nD. It was formed due to the function of wind.\nAnswer with the option's letter from the given choices directly.",
1952,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "651-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1953,
"target": "C",
"doc": {
"video_id": "652",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=K0m4xAfTVwM",
"videoID": "K0m4xAfTVwM",
"question_id": "652-1",
"task_type": "Action Reasoning",
"question": "According to the video, why did the scientist mention the way a star explodes?",
"options": [
"A. Because there are lots of features between star explosion and lightning, for example, they both release light.",
"B. Because it is very similar to the principle of lightning generation in many ways.",
"C. Because he wants to illustrate the complexity of the lightning's generation.",
"D. Because the stars in the galaxy is similar to the shape of lightning."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, why did the scientist mention the way a star explodes?\nOption:\nA. Because there are lots of features between star explosion and lightning, for example, they both release light.\nB. Because it is very similar to the principle of lightning generation in many ways.\nC. Because he wants to illustrate the complexity of the lightning's generation.\nD. Because the stars in the galaxy is similar to the shape of lightning.\nAnswer with the option's letter from the given choices directly.",
1953,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "652-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1954,
"target": "A",
"doc": {
"video_id": "652",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=K0m4xAfTVwM",
"videoID": "K0m4xAfTVwM",
"question_id": "652-2",
"task_type": "Object Reasoning",
"question": "What is the rocket's main function according to the video?",
"options": [
"A. Triggering the lightning.",
"B. Observing the lightning.",
"C. Looking for an answer in outer space.",
"D. Simulating the formation of lightning."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the rocket's main function according to the video?\nOption:\nA. Triggering the lightning.\nB. Observing the lightning.\nC. Looking for an answer in outer space.\nD. Simulating the formation of lightning.\nAnswer with the option's letter from the given choices directly.",
1954,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "652-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1955,
"target": "C",
"doc": {
"video_id": "652",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=K0m4xAfTVwM",
"videoID": "K0m4xAfTVwM",
"question_id": "652-3",
"task_type": "Information Synopsis",
"question": "According to Green's research, what is the benefit that lightning brings to us?",
"options": [
"A. Lightning storms often bring heavy rainfall, which can help replenish groundwater levels.",
"B. Lightning-induced wildfires can help maintain ecological balance, making way for new plant growth.",
"C. Lightning on the ground can clear the radiation on save slots in outer space, contributing to satellite communications.",
"D. Lightning can cause nitrogen in the air to combine with oxygen, forming nitrogen oxides."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to Green's research, what is the benefit that lightning brings to us?\nOption:\nA. Lightning storms often bring heavy rainfall, which can help replenish groundwater levels.\nB. Lightning-induced wildfires can help maintain ecological balance, making way for new plant growth.\nC. Lightning on the ground can clear the radiation on save slots in outer space, contributing to satellite communications.\nD. Lightning can cause nitrogen in the air to combine with oxygen, forming nitrogen oxides.\nAnswer with the option's letter from the given choices directly.",
1955,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "652-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1956,
"target": "C",
"doc": {
"video_id": "653",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=YtAL8y2lACs",
"videoID": "YtAL8y2lACs",
"question_id": "653-1",
"task_type": "Counting Problem",
"question": "How many people does the Ross Ice Shelf team consist of?",
"options": [
"A. 2.",
"B. 23.",
"C. 12.",
"D. 8."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people does the Ross Ice Shelf team consist of?\nOption:\nA. 2.\nB. 23.\nC. 12.\nD. 8.\nAnswer with the option's letter from the given choices directly.",
1956,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "653-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1957,
"target": "C",
"doc": {
"video_id": "653",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=YtAL8y2lACs",
"videoID": "YtAL8y2lACs",
"question_id": "653-2",
"task_type": "Action Reasoning",
"question": "Why was the team so cautious upon the helicopter's landing?",
"options": [
"A. Because the helicopter steps were too high.",
"B. Because one of the team members died of falling into a deep crevasse.",
"C. Because there might be a deep crevasse.",
"D. Because the wind was very strong."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why was the team so cautious upon the helicopter's landing?\nOption:\nA. Because the helicopter steps were too high.\nB. Because one of the team members died of falling into a deep crevasse.\nC. Because there might be a deep crevasse.\nD. Because the wind was very strong.\nAnswer with the option's letter from the given choices directly.",
1957,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "653-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1958,
"target": "D",
"doc": {
"video_id": "653",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=YtAL8y2lACs",
"videoID": "YtAL8y2lACs",
"question_id": "653-3",
"task_type": "Object Reasoning",
"question": "In line with the video evidence, how do the fractures form?",
"options": [
"A. Due to rocks experiencing mutual stress.",
"B. Resulting from the force of two rivers against the bank.",
"C. Chemicals such as acids erode rocks, leading to fractures.",
"D. Due to the shearing effect."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, how do the fractures form?\nOption:\nA. Due to rocks experiencing mutual stress.\nB. Resulting from the force of two rivers against the bank.\nC. Chemicals such as acids erode rocks, leading to fractures.\nD. Due to the shearing effect.\nAnswer with the option's letter from the given choices directly.",
1958,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "653-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1959,
"target": "C",
"doc": {
"video_id": "654",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=HvKpnaXYUPU",
"videoID": "HvKpnaXYUPU",
"question_id": "654-1",
"task_type": "Object Reasoning",
"question": "Based on the information provided by the video, which of the following elements takes part in the formation of the Yamal Crater?",
"options": [
"A. Lava explosion.",
"B. Water or erosion weakening the ground beneath.",
"C. Microbes that break down organic matter.",
"D. Meteorite striking earth."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the information provided by the video, which of the following elements takes part in the formation of the Yamal Crater?\nOption:\nA. Lava explosion.\nB. Water or erosion weakening the ground beneath.\nC. Microbes that break down organic matter.\nD. Meteorite striking earth.\nAnswer with the option's letter from the given choices directly.",
1959,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "654-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1960,
"target": "C",
"doc": {
"video_id": "654",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=HvKpnaXYUPU",
"videoID": "HvKpnaXYUPU",
"question_id": "654-2",
"task_type": "Counting Problem",
"question": "How many cases are cited in the video to illustrate the craters caused by excessive methane in permafrost?",
"options": [
"A. One.",
"B. Three.",
"C. Two.",
"D. Four."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many cases are cited in the video to illustrate the craters caused by excessive methane in permafrost?\nOption:\nA. One.\nB. Three.\nC. Two.\nD. Four.\nAnswer with the option's letter from the given choices directly.",
1960,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "654-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1961,
"target": "B",
"doc": {
"video_id": "654",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=HvKpnaXYUPU",
"videoID": "HvKpnaXYUPU",
"question_id": "654-3",
"task_type": "Object Recognition",
"question": "Which of the following techniques was not used in the research of Esieh Lake?",
"options": [
"A. VLF.",
"B. Radar.",
"C. Sonar scan.",
"D. The technique that used to trace the chemical fingerprints of leaking methane."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following techniques was not used in the research of Esieh Lake?\nOption:\nA. VLF.\nB. Radar.\nC. Sonar scan.\nD. The technique that used to trace the chemical fingerprints of leaking methane.\nAnswer with the option's letter from the given choices directly.",
1961,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "654-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1962,
"target": "B",
"doc": {
"video_id": "655",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=PXxscnWG8QA",
"videoID": "PXxscnWG8QA",
"question_id": "655-1",
"task_type": "Action Reasoning",
"question": "What did Masu misunderstand according to the video?",
"options": [
"A. The Tsunamis always move fast.",
"B. The Tsunamis are not single waves.",
"C. The Tsunamis are dangerous.",
"D. The first wave is the largest."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What did Masu misunderstand according to the video?\nOption:\nA. The Tsunamis always move fast.\nB. The Tsunamis are not single waves.\nC. The Tsunamis are dangerous.\nD. The first wave is the largest.\nAnswer with the option's letter from the given choices directly.",
1962,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "655-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1963,
"target": "C",
"doc": {
"video_id": "655",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=PXxscnWG8QA",
"videoID": "PXxscnWG8QA",
"question_id": "655-2",
"task_type": "Object Reasoning",
"question": "As depicted in the video, which of the following elements was not one of the causes of such a great loss in Hilo island in 1960?",
"options": [
"A. Natural terrain of the bay.",
"B. Local people's curiosity.",
"C. The breaking down of the alarm bell.",
"D. Underestimation of the impact of waves on the other side of the island."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, which of the following elements was not one of the causes of such a great loss in Hilo island in 1960?\nOption:\nA. Natural terrain of the bay.\nB. Local people's curiosity.\nC. The breaking down of the alarm bell.\nD. Underestimation of the impact of waves on the other side of the island.\nAnswer with the option's letter from the given choices directly.",
1963,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "655-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1964,
"target": "B",
"doc": {
"video_id": "655",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=PXxscnWG8QA",
"videoID": "PXxscnWG8QA",
"question_id": "655-3",
"task_type": "Object Reasoning",
"question": "What triggered the 1958 tsunami in Lituya Bay according to the video?",
"options": [
"A. volcano eruption.",
"B. Landslide.",
"C. Earthquake.",
"D. Meteorite Impacts."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What triggered the 1958 tsunami in Lituya Bay according to the video?\nOption:\nA. volcano eruption.\nB. Landslide.\nC. Earthquake.\nD. Meteorite Impacts.\nAnswer with the option's letter from the given choices directly.",
1964,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "655-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1965,
"target": "D",
"doc": {
"video_id": "656",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=iOjmy8zkDGQ",
"videoID": "iOjmy8zkDGQ",
"question_id": "656-1",
"task_type": "Attribute Perception",
"question": "What is true about the theory of the formation of the earth that NASA believes in?",
"options": [
"A. It can explain the time span of the earth formation.",
"B. It cannot explain the formation of gas giants.",
"C. It can explain the creation of giant worlds.",
"D. Solar wind takes part in it."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is true about the theory of the formation of the earth that NASA believes in?\nOption:\nA. It can explain the time span of the earth formation.\nB. It cannot explain the formation of gas giants.\nC. It can explain the creation of giant worlds.\nD. Solar wind takes part in it.\nAnswer with the option's letter from the given choices directly.",
1965,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "656-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1966,
"target": "A",
"doc": {
"video_id": "656",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=iOjmy8zkDGQ",
"videoID": "iOjmy8zkDGQ",
"question_id": "656-2",
"task_type": "Object Reasoning",
"question": "What is the metaphor for a double-barreled shotgun according to the video?",
"options": [
"A. The asteroid impact and volcano activity.",
"B. The extinction of the dinosaurs caused both benefits and harms.",
"C. Human hunting.",
"D. Intensive volcano activity."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the metaphor for a double-barreled shotgun according to the video?\nOption:\nA. The asteroid impact and volcano activity.\nB. The extinction of the dinosaurs caused both benefits and harms.\nC. Human hunting.\nD. Intensive volcano activity.\nAnswer with the option's letter from the given choices directly.",
1966,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "656-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1967,
"target": "B",
"doc": {
"video_id": "656",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=iOjmy8zkDGQ",
"videoID": "iOjmy8zkDGQ",
"question_id": "656-3",
"task_type": "Information Synopsis",
"question": "What is the primary focus of the video?",
"options": [
"A. The impact of creatures on the earth.",
"B. Earth's formation and the evolution of life.",
"C. Prehistoric disasters.",
"D. Geological evolution on earth."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary focus of the video?\nOption:\nA. The impact of creatures on the earth.\nB. Earth's formation and the evolution of life.\nC. Prehistoric disasters.\nD. Geological evolution on earth.\nAnswer with the option's letter from the given choices directly.",
1967,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "656-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1968,
"target": "D",
"doc": {
"video_id": "657",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=74TEQfw6L60",
"videoID": "74TEQfw6L60",
"question_id": "657-1",
"task_type": "Action Recognition",
"question": "How were the Sawtooth ranges formed?",
"options": [
"A. Through frequent earthquakes.",
"B. Due to glacier impacts on the mountains.",
"C. As a result of rain flooding the terrain.",
"D. From the pressure of another tectonic plate."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How were the Sawtooth ranges formed?\nOption:\nA. Through frequent earthquakes.\nB. Due to glacier impacts on the mountains.\nC. As a result of rain flooding the terrain.\nD. From the pressure of another tectonic plate.\nAnswer with the option's letter from the given choices directly.",
1968,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "657-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1969,
"target": "D",
"doc": {
"video_id": "657",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=74TEQfw6L60",
"videoID": "74TEQfw6L60",
"question_id": "657-2",
"task_type": "Attribute Perception",
"question": "What evidence indicates that the mountain is still growing?",
"options": [
"A. The increasingly frequent Tsunamis.",
"B. The growth of the rift.",
"C. Measurement of the height of the Rocky Mountains every year.",
"D. The bubbles in water."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What evidence indicates that the mountain is still growing?\nOption:\nA. The increasingly frequent Tsunamis.\nB. The growth of the rift.\nC. Measurement of the height of the Rocky Mountains every year.\nD. The bubbles in water.\nAnswer with the option's letter from the given choices directly.",
1969,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "657-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 1970,
"target": "B",
"doc": {
"video_id": "657",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=74TEQfw6L60",
"videoID": "74TEQfw6L60",
"question_id": "657-3",
"task_type": "Object Reasoning",
"question": "According to the video, what potential changes might occur over tens of millions of years?",
"options": [
"A. The land may expand.",
"B. The land could undergo rifts.",
"C. Mountains may stretch across the entire continent.",
"D. Tectonic activity may intensify."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what potential changes might occur over tens of millions of years?\nOption:\nA. The land may expand.\nB. The land could undergo rifts.\nC. Mountains may stretch across the entire continent.\nD. Tectonic activity may intensify.\nAnswer with the option's letter from the given choices directly.",
1970,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "657-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1971,
"target": "A",
"doc": {
"video_id": "658",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=Ry2dJuJ-9UE",
"videoID": "Ry2dJuJ-9UE",
"question_id": "658-1",
"task_type": "Action Recognition",
"question": "In line with the video evidence, what occurred within the lava as its outer layer solidified into rock?",
"options": [
"A. The solidified outer layer could function as a conduit.",
"B. The gas trapped within the lava began to escape.",
"C. The viscosity of the lava decreases over time as solidification quickens.",
"D. The ice melted, giving rise to a lake."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, what occurred within the lava as its outer layer solidified into rock?\nOption:\nA. The solidified outer layer could function as a conduit.\nB. The gas trapped within the lava began to escape.\nC. The viscosity of the lava decreases over time as solidification quickens.\nD. The ice melted, giving rise to a lake.\nAnswer with the option's letter from the given choices directly.",
1971,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "658-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1972,
"target": "B",
"doc": {
"video_id": "658",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=Ry2dJuJ-9UE",
"videoID": "Ry2dJuJ-9UE",
"question_id": "658-2",
"task_type": "Attribute Perception",
"question": "Which element has the highest peak in the spectra of sample token by Dr. Rudy Reimer?",
"options": [
"A. Iron.",
"B. Zirconium.",
"C. Helium.",
"D. Aluminum."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which element has the highest peak in the spectra of sample token by Dr. Rudy Reimer?\nOption:\nA. Iron.\nB. Zirconium.\nC. Helium.\nD. Aluminum.\nAnswer with the option's letter from the given choices directly.",
1972,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "658-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1973,
"target": "D",
"doc": {
"video_id": "658",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=Ry2dJuJ-9UE",
"videoID": "Ry2dJuJ-9UE",
"question_id": "658-3",
"task_type": "Object Reasoning",
"question": "Based on the video, which of the following statements regarding the algae sample token near a volcanic hot spring is false?",
"options": [
"A. The algae might come to the hot spring by air.",
"B. It is analyzed by Dr. John Stockner.",
"C. The algae species can survive in hot water.",
"D. It reveals that algae can survive in seawater."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, which of the following statements regarding the algae sample token near a volcanic hot spring is false?\nOption:\nA. The algae might come to the hot spring by air.\nB. It is analyzed by Dr. John Stockner.\nC. The algae species can survive in hot water.\nD. It reveals that algae can survive in seawater.\nAnswer with the option's letter from the given choices directly.",
1973,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "658-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1974,
"target": "C",
"doc": {
"video_id": "659",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=K946mptgzTU",
"videoID": "K946mptgzTU",
"question_id": "659-1",
"task_type": "Object Reasoning",
"question": "According to the video, what is Chimeya famous for?",
"options": [
"A. Rainfall.",
"B. Drought.",
"C. Camel.",
"D. Tannery."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is Chimeya famous for?\nOption:\nA. Rainfall.\nB. Drought.\nC. Camel.\nD. Tannery.\nAnswer with the option's letter from the given choices directly.",
1974,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "659-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1975,
"target": "C",
"doc": {
"video_id": "659",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=K946mptgzTU",
"videoID": "K946mptgzTU",
"question_id": "659-2",
"task_type": "Object Reasoning",
"question": "According to the video, what is the primary reason for the appearance of giraffe images on the rocks in the deep desert?",
"options": [
"A. Migration.",
"B. The creativity of the ancients in the desert.",
"C. Climate change.",
"D. The imagination of the ancients in the desert."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what is the primary reason for the appearance of giraffe images on the rocks in the deep desert?\nOption:\nA. Migration.\nB. The creativity of the ancients in the desert.\nC. Climate change.\nD. The imagination of the ancients in the desert.\nAnswer with the option's letter from the given choices directly.",
1975,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "659-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 1976,
"target": "B",
"doc": {
"video_id": "659",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=K946mptgzTU",
"videoID": "K946mptgzTU",
"question_id": "659-3",
"task_type": "Action Reasoning",
"question": "According to the video, why does the caravan drop off hay along the way?",
"options": [
"A. Because it can prevent the caravan from getting lost.",
"B. Because they can find it when coming back.",
"C. Because the camels hate eating hay.",
"D. Because it is cheap."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, why does the caravan drop off hay along the way?\nOption:\nA. Because it can prevent the caravan from getting lost.\nB. Because they can find it when coming back.\nC. Because the camels hate eating hay.\nD. Because it is cheap.\nAnswer with the option's letter from the given choices directly.",
1976,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "659-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1977,
"target": "D",
"doc": {
"video_id": "660",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=7MFKy7DJsCY",
"videoID": "7MFKy7DJsCY",
"question_id": "660-1",
"task_type": "Object Reasoning",
"question": "What specific evidence is presented to support the claim that the Maya in the Yucatan were more advanced than previously believed?",
"options": [
"A. The discovery of a massive, previously unknown pyramid complex.",
"B. The intricate carvings on the vault stones found in the pyramid.",
"C. The detailed analysis of dental plaque revealing a diverse diet.",
"D. The construction of the \"Stairway to Heaven\" estate with its elaborate architecture."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What specific evidence is presented to support the claim that the Maya in the Yucatan were more advanced than previously believed?\nOption:\nA. The discovery of a massive, previously unknown pyramid complex.\nB. The intricate carvings on the vault stones found in the pyramid.\nC. The detailed analysis of dental plaque revealing a diverse diet.\nD. The construction of the \"Stairway to Heaven\" estate with its elaborate architecture.\nAnswer with the option's letter from the given choices directly.",
1977,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "660-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1978,
"target": "D",
"doc": {
"video_id": "660",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=7MFKy7DJsCY",
"videoID": "7MFKy7DJsCY",
"question_id": "660-2",
"task_type": "Object Reasoning",
"question": "What evidence suggests that the Maya at \"Stairway to Heaven\" were wealthy plantation owners?",
"options": [
"A. The elaborate architecture of their homes.",
"B. The presence of gold and silver artifacts in their tombs.",
"C. The presence of sophisticated agricultural tools found at the site.",
"D. The analysis of food particles in dental plaque reveals a diverse diet."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What evidence suggests that the Maya at \"Stairway to Heaven\" were wealthy plantation owners?\nOption:\nA. The elaborate architecture of their homes.\nB. The presence of gold and silver artifacts in their tombs.\nC. The presence of sophisticated agricultural tools found at the site.\nD. The analysis of food particles in dental plaque reveals a diverse diet.\nAnswer with the option's letter from the given choices directly.",
1978,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "660-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1979,
"target": "B",
"doc": {
"video_id": "660",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Geography",
"url": "https://www.youtube.com/watch?v=7MFKy7DJsCY",
"videoID": "7MFKy7DJsCY",
"question_id": "660-3",
"task_type": "Object Reasoning",
"question": "What does the discovery of the feathered serpent carvings at the city of Uxmal suggest about the Maya civilization of the Yucatan?",
"options": [
"A. The Maya practiced human sacrifice rituals.",
"B. A new political ideology involving religious cults emerged in the region.",
"C. The Maya were adept at advanced engineering and architecture.",
"D. The Maya were influenced by other cultures from the south."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the discovery of the feathered serpent carvings at the city of Uxmal suggest about the Maya civilization of the Yucatan?\nOption:\nA. The Maya practiced human sacrifice rituals.\nB. A new political ideology involving religious cults emerged in the region.\nC. The Maya were adept at advanced engineering and architecture.\nD. The Maya were influenced by other cultures from the south.\nAnswer with the option's letter from the given choices directly.",
1979,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "660-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Geography",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1980,
"target": "B",
"doc": {
"video_id": "661",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=CoPfYGgzCkM",
"videoID": "CoPfYGgzCkM",
"question_id": "661-1",
"task_type": "Information Synopsis",
"question": "What is the main content introduced in the video?",
"options": [
"A. The tips for soldiers to survive in the war.",
"B. The rules of war that all soldiers must abide by.",
"C. The historical background of famous battles in military history.",
"D. The history of human war."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main content introduced in the video?\nOption:\nA. The tips for soldiers to survive in the war.\nB. The rules of war that all soldiers must abide by.\nC. The historical background of famous battles in military history.\nD. The history of human war.\nAnswer with the option's letter from the given choices directly.",
1980,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "661-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1981,
"target": "C",
"doc": {
"video_id": "661",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=CoPfYGgzCkM",
"videoID": "CoPfYGgzCkM",
"question_id": "661-2",
"task_type": "Spatial Reasoning",
"question": "In what circumstances was the museum depicted in the video bombed?",
"options": [
"A. Museum was used to store fighter jets.",
"B. Museum was used to store a large number of tanks.",
"C. Museum was used to store many weapons.",
"D. The museum was used as a combat headquarters."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what circumstances was the museum depicted in the video bombed?\nOption:\nA. Museum was used to store fighter jets.\nB. Museum was used to store a large number of tanks.\nC. Museum was used to store many weapons.\nD. The museum was used as a combat headquarters.\nAnswer with the option's letter from the given choices directly.",
1981,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "661-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Spatial Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1982,
"target": "A",
"doc": {
"video_id": "661",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=CoPfYGgzCkM",
"videoID": "CoPfYGgzCkM",
"question_id": "661-3",
"task_type": "Temporal Reasoning",
"question": "What is the order in which the restrictions on the use of weapons under international law are introduced in the video?",
"options": [
"A. Chemical weapons and poisons, biological weapons, landmines, laser weapons, expanding and exploding bullets, weapons that spawn non-detectable fragments, flamethrowers and flame weapons, and cluster bombs.",
"B. Chemical weapons and poisons, biological weapons, landmines, laser weapons, expanding and exploding bullets, flamethrowers and flame weapons, cluster bombs.",
"C. Chemical weapons and poisons, biological weapons, laser weapons, weapons that spawn non-detectable fragments, flamethrowers and flame weapons, cluster bombs.",
"D. Chemical weapons and poisons, biological weapons, laser weapons, landmines, expanding and exploding bullets, flamethrowers and flame weapons, weapons that spawn non-detectable fragments, cluster bombs."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the order in which the restrictions on the use of weapons under international law are introduced in the video?\nOption:\nA. Chemical weapons and poisons, biological weapons, landmines, laser weapons, expanding and exploding bullets, weapons that spawn non-detectable fragments, flamethrowers and flame weapons, and cluster bombs.\nB. Chemical weapons and poisons, biological weapons, landmines, laser weapons, expanding and exploding bullets, flamethrowers and flame weapons, cluster bombs.\nC. Chemical weapons and poisons, biological weapons, laser weapons, weapons that spawn non-detectable fragments, flamethrowers and flame weapons, cluster bombs.\nD. Chemical weapons and poisons, biological weapons, laser weapons, landmines, expanding and exploding bullets, flamethrowers and flame weapons, weapons that spawn non-detectable fragments, cluster bombs.\nAnswer with the option's letter from the given choices directly.",
1982,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "661-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1983,
"target": "D",
"doc": {
"video_id": "662",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=oT1irsG3wE0",
"videoID": "oT1irsG3wE0",
"question_id": "662-1",
"task_type": "Information Synopsis",
"question": "What is the main content introduced in the video?",
"options": [
"A. Chad Doerman accused of killing his three daughters appears in trial.",
"B. Chad Doerman accused of abusing his three sons appears in trial.",
"C. Chad Doerman accused of abusing his three daughters appears in trial.",
"D. Chad Doerman accused of killing his three sons appears in trial."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main content introduced in the video?\nOption:\nA. Chad Doerman accused of killing his three daughters appears in trial.\nB. Chad Doerman accused of abusing his three sons appears in trial.\nC. Chad Doerman accused of abusing his three daughters appears in trial.\nD. Chad Doerman accused of killing his three sons appears in trial.\nAnswer with the option's letter from the given choices directly.",
1983,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "662-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 1984,
"target": "B",
"doc": {
"video_id": "662",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=oT1irsG3wE0",
"videoID": "oT1irsG3wE0",
"question_id": "662-2",
"task_type": "Object Reasoning",
"question": "What is the main message of the caller in the 911 recordings that appear in the video?",
"options": [
"A. The caller's stepfather is killing everyone in the house. The caller also states that the stepfather has shot the caller's brothers.",
"B. The callers see at least two children have been shot and a little girl running down the street as her stepfather is killing people in her house.",
"C. The callers see at least two children being shot, and a little girl running down the street while her family is being shot.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main message of the caller in the 911 recordings that appear in the video?\nOption:\nA. The caller's stepfather is killing everyone in the house. The caller also states that the stepfather has shot the caller's brothers.\nB. The callers see at least two children have been shot and a little girl running down the street as her stepfather is killing people in her house.\nC. The callers see at least two children being shot, and a little girl running down the street while her family is being shot.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
1984,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "662-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 1985,
"target": "C",
"doc": {
"video_id": "662",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=oT1irsG3wE0",
"videoID": "oT1irsG3wE0",
"question_id": "662-3",
"task_type": "Spatial Reasoning",
"question": "Under what circumstances can Chad Doerman be sentenced to death?",
"options": [
"A. He is convicted of killing his three young daughters.",
"B. He is convicted of killing his neighbor's three daughters.",
"C. He is convicted of killing his three young sons.",
"D. He is convicted of killing his neighbor's three sons."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Under what circumstances can Chad Doerman be sentenced to death?\nOption:\nA. He is convicted of killing his three young daughters.\nB. He is convicted of killing his neighbor's three daughters.\nC. He is convicted of killing his three young sons.\nD. He is convicted of killing his neighbor's three sons.\nAnswer with the option's letter from the given choices directly.",
1985,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "662-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Spatial Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1986,
"target": "A",
"doc": {
"video_id": "663",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=eqJPDJr_irE",
"videoID": "eqJPDJr_irE",
"question_id": "663-1",
"task_type": "Action Reasoning",
"question": "What does the man in the first case in the video go to jail for?",
"options": [
"A. Failing to pay child support.",
"B. Failing to pay family maintenance.",
"C. Family violence against wife and child.",
"D. Failing to pay children's school fees."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the man in the first case in the video go to jail for?\nOption:\nA. Failing to pay child support.\nB. Failing to pay family maintenance.\nC. Family violence against wife and child.\nD. Failing to pay children's school fees.\nAnswer with the option's letter from the given choices directly.",
1986,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "663-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 1987,
"target": "D",
"doc": {
"video_id": "663",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=eqJPDJr_irE",
"videoID": "eqJPDJr_irE",
"question_id": "663-2",
"task_type": "Action Reasoning",
"question": "What does the lady in the first case in the video do?",
"options": [
"A. She knows that the man is the biological father of her two sons and sends him to prison for failing to pay child support for them.",
"B. She gives her two children to the man to raise and sends him to jail for not paying child support.",
"C. She mistakenly believed that the man was her son's biological father, so when she discovered that he had not paid child support, she chose to call the police, which led to his arrest.",
"D. She knows the man isn't her son's real father, but still put him in jail for not paying child support."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the lady in the first case in the video do?\nOption:\nA. She knows that the man is the biological father of her two sons and sends him to prison for failing to pay child support for them.\nB. She gives her two children to the man to raise and sends him to jail for not paying child support.\nC. She mistakenly believed that the man was her son's biological father, so when she discovered that he had not paid child support, she chose to call the police, which led to his arrest.\nD. She knows the man isn't her son's real father, but still put him in jail for not paying child support.\nAnswer with the option's letter from the given choices directly.",
1987,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "663-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1988,
"target": "B",
"doc": {
"video_id": "663",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=eqJPDJr_irE",
"videoID": "eqJPDJr_irE",
"question_id": "663-3",
"task_type": "Information Synopsis",
"question": "What is the main focus of the argument in the second case based on the content of the video?",
"options": [
"A. Whether the man is the biological father of his son.",
"B. Whether the man is the biological father of twins.",
"C. Whether the man is the unfaithful to his wife in marriage.",
"D. Whether the man is the biological father of his daughter."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of the argument in the second case based on the content of the video?\nOption:\nA. Whether the man is the biological father of his son.\nB. Whether the man is the biological father of twins.\nC. Whether the man is the unfaithful to his wife in marriage.\nD. Whether the man is the biological father of his daughter.\nAnswer with the option's letter from the given choices directly.",
1988,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "663-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1989,
"target": "C",
"doc": {
"video_id": "664",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=x1qEBVBxE10",
"videoID": "x1qEBVBxE10",
"question_id": "664-1",
"task_type": "Information Synopsis",
"question": "What is the video mainly about?",
"options": [
"A. Disputes over custody in divorce cases.",
"B. Determining custodial rights between genders.",
"C. Establishing paternity of a child.",
"D. Addressing allegations of child abuse by men."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video mainly about?\nOption:\nA. Disputes over custody in divorce cases.\nB. Determining custodial rights between genders.\nC. Establishing paternity of a child.\nD. Addressing allegations of child abuse by men.\nAnswer with the option's letter from the given choices directly.",
1989,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "664-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1990,
"target": "A",
"doc": {
"video_id": "664",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=x1qEBVBxE10",
"videoID": "x1qEBVBxE10",
"question_id": "664-2",
"task_type": "Object Reasoning",
"question": "Who is the biological father of the girl in the second case in the video?",
"options": [
"A. Mr. Bryant.",
"B. Mr. Jennings.",
"C. Not mentioned in the video.",
"D. Mr. Hardy."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is the biological father of the girl in the second case in the video?\nOption:\nA. Mr. Bryant.\nB. Mr. Jennings.\nC. Not mentioned in the video.\nD. Mr. Hardy.\nAnswer with the option's letter from the given choices directly.",
1990,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "664-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 1991,
"target": "D",
"doc": {
"video_id": "664",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=x1qEBVBxE10",
"videoID": "x1qEBVBxE10",
"question_id": "664-3",
"task_type": "Object Reasoning",
"question": "What is the view held by the female witness present in the first case in the video?",
"options": [
"A. She testifies that the man is the girl's biological father.",
"B. She suspects the boy is not the man's biological child.",
"C. She testifies that the man is the boy's biological father.",
"D. She suspects the girl is not the man's biological child."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the view held by the female witness present in the first case in the video?\nOption:\nA. She testifies that the man is the girl's biological father.\nB. She suspects the boy is not the man's biological child.\nC. She testifies that the man is the boy's biological father.\nD. She suspects the girl is not the man's biological child.\nAnswer with the option's letter from the given choices directly.",
1991,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "664-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1992,
"target": "B",
"doc": {
"video_id": "665",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=SfZrnoo1GPM",
"videoID": "SfZrnoo1GPM",
"question_id": "665-1",
"task_type": "Information Synopsis",
"question": "What is the central focus of the main plot discussed by the two hosts in the video about the film?",
"options": [
"A. The movie tells the story of a group of veterans who uncovered an international arms smuggling case covered up by high-level government officials. Two of the protagonists were wrongly convicted of murder. In the process of tracing the truth, they discovered that the case was related to terrorist organizations.",
"B. The film's main plot revolves around the court-martial of two U.S. Marines accused of murdering a fellow Marine at Guantanamo Bay.",
"C. The main plot of the film revolves around a defence lawyer suing the government for allegedly organising the murder of two Marines.",
"D. The main plot of the film revolves around how two Marines sue their superiors for bullying them."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the central focus of the main plot discussed by the two hosts in the video about the film?\nOption:\nA. The movie tells the story of a group of veterans who uncovered an international arms smuggling case covered up by high-level government officials. Two of the protagonists were wrongly convicted of murder. In the process of tracing the truth, they discovered that the case was related to terrorist organizations.\nB. The film's main plot revolves around the court-martial of two U.S. Marines accused of murdering a fellow Marine at Guantanamo Bay.\nC. The main plot of the film revolves around a defence lawyer suing the government for allegedly organising the murder of two Marines.\nD. The main plot of the film revolves around how two Marines sue their superiors for bullying them.\nAnswer with the option's letter from the given choices directly.",
1992,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "665-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 1993,
"target": "C",
"doc": {
"video_id": "665",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=SfZrnoo1GPM",
"videoID": "SfZrnoo1GPM",
"question_id": "665-2",
"task_type": "Object Reasoning",
"question": "What does code red refer to in the film plot in the video?",
"options": [
"A. Code red refers to an emergency evacuation plan used in fire or other emergencies to ensure the rapid and safe evacuation of people.",
"B. Code red is a warning mechanism for students in educational institutions, indicating that students have violated school rules and need to receive counseling or minor punishment.",
"C. Code red refers to informal disciplinary intervention within an organization, carried out under instructions from superiors, and designed to correct member behavior privately without following official procedures.",
"D. Code red is one of the hospital emergency codes, which means patients with highly contagious diseases need immediate isolation."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does code red refer to in the film plot in the video?\nOption:\nA. Code red refers to an emergency evacuation plan used in fire or other emergencies to ensure the rapid and safe evacuation of people.\nB. Code red is a warning mechanism for students in educational institutions, indicating that students have violated school rules and need to receive counseling or minor punishment.\nC. Code red refers to informal disciplinary intervention within an organization, carried out under instructions from superiors, and designed to correct member behavior privately without following official procedures.\nD. Code red is one of the hospital emergency codes, which means patients with highly contagious diseases need immediate isolation.\nAnswer with the option's letter from the given choices directly.",
1993,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "665-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 1994,
"target": "A",
"doc": {
"video_id": "665",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=SfZrnoo1GPM",
"videoID": "SfZrnoo1GPM",
"question_id": "665-3",
"task_type": "Object Reasoning",
"question": "Which two people are the main focus of conflict in the film's plot in the second half of the video?",
"options": [
"A. Male defense counsel and male colonel sitting in the witness box.",
"B. Male defense counsel and male prosecutor.",
"C. Female defense lawyer and male colonel sitting in the witness box.",
"D. Female defense counsel and male prosecutor."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which two people are the main focus of conflict in the film's plot in the second half of the video?\nOption:\nA. Male defense counsel and male colonel sitting in the witness box.\nB. Male defense counsel and male prosecutor.\nC. Female defense lawyer and male colonel sitting in the witness box.\nD. Female defense counsel and male prosecutor.\nAnswer with the option's letter from the given choices directly.",
1994,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "665-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1995,
"target": "D",
"doc": {
"video_id": "666",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=mzvHYBgD-tI",
"videoID": "mzvHYBgD-tI",
"question_id": "666-1",
"task_type": "Action Recognition",
"question": "Which of the following is not discussed by the presenter in the video?",
"options": [
"A. Introduction to pro bono work for lawyers.",
"B. Introduction to jury trials.",
"C. Give a review of the show.",
"D. Introducing how to be a prosecutor."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is not discussed by the presenter in the video?\nOption:\nA. Introduction to pro bono work for lawyers.\nB. Introduction to jury trials.\nC. Give a review of the show.\nD. Introducing how to be a prosecutor.\nAnswer with the option's letter from the given choices directly.",
1995,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "666-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 1996,
"target": "B",
"doc": {
"video_id": "666",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=mzvHYBgD-tI",
"videoID": "mzvHYBgD-tI",
"question_id": "666-2",
"task_type": "Action Reasoning",
"question": "In the video, how did the defense lawyer prove that there was something wrong with the surveillance video in the movie plot?",
"options": [
"A. By presenting a frame-by-frame comparison showing the identical plastic bag at the same timestamp each night, along with timestamp inconsistencies with known weather patterns or ambient conditions, conclusively demonstrating a flaw or tampering with the surveillance footage.",
"B. By stating that a plastic bag appears in the same way at the same point in time every night in the surveillance video.",
"C. By stating that there's a man on the surveillance video every night.",
"D. By stating that surveillance video is checked every night."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, how did the defense lawyer prove that there was something wrong with the surveillance video in the movie plot?\nOption:\nA. By presenting a frame-by-frame comparison showing the identical plastic bag at the same timestamp each night, along with timestamp inconsistencies with known weather patterns or ambient conditions, conclusively demonstrating a flaw or tampering with the surveillance footage.\nB. By stating that a plastic bag appears in the same way at the same point in time every night in the surveillance video.\nC. By stating that there's a man on the surveillance video every night.\nD. By stating that surveillance video is checked every night.\nAnswer with the option's letter from the given choices directly.",
1996,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "666-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 1997,
"target": "C",
"doc": {
"video_id": "666",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=mzvHYBgD-tI",
"videoID": "mzvHYBgD-tI",
"question_id": "666-3",
"task_type": "Object Reasoning",
"question": "What efforts were made by the defence lawyers in the last trial in the video?",
"options": [
"A. She pointed out that the police had leading questions during the interrogation process, which may have affected the authenticity of the defendant's confession.",
"B. She reveals that a key witness had an alibi at the time of the crime, questioning the integrity of his statement.",
"C. She proved that the key witness's vital evidence was not credible, citing new evidence to lead the suspect to another person.",
"D. She cited details from the forensic medical report and questioned possible contamination issues in the handling of physical evidence, which would affect the validity of the evidence."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What efforts were made by the defence lawyers in the last trial in the video?\nOption:\nA. She pointed out that the police had leading questions during the interrogation process, which may have affected the authenticity of the defendant's confession.\nB. She reveals that a key witness had an alibi at the time of the crime, questioning the integrity of his statement.\nC. She proved that the key witness's vital evidence was not credible, citing new evidence to lead the suspect to another person.\nD. She cited details from the forensic medical report and questioned possible contamination issues in the handling of physical evidence, which would affect the validity of the evidence.\nAnswer with the option's letter from the given choices directly.",
1997,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "666-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 1998,
"target": "A",
"doc": {
"video_id": "667",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=AFFZjzuAHMg",
"videoID": "AFFZjzuAHMg",
"question_id": "667-1",
"task_type": "Spatial Reasoning",
"question": "In the bar scene of the movie, what item does the policeman present as evidence of the arrested person's suspicion, found in the car?",
"options": [
"A. A canvas shoe found in the arrested man's car.",
"B. A pair of canvas shoes found in the arrested man's car.",
"C. A schoolbag found in the arrested man's car.",
"D. A hat found in an arrested man's car."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the bar scene of the movie, what item does the policeman present as evidence of the arrested person's suspicion, found in the car?\nOption:\nA. A canvas shoe found in the arrested man's car.\nB. A pair of canvas shoes found in the arrested man's car.\nC. A schoolbag found in the arrested man's car.\nD. A hat found in an arrested man's car.\nAnswer with the option's letter from the given choices directly.",
1998,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "667-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 1999,
"target": "D",
"doc": {
"video_id": "667",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=AFFZjzuAHMg",
"videoID": "AFFZjzuAHMg",
"question_id": "667-2",
"task_type": "Spatial Reasoning",
"question": "Which case is the main story about in the plot of the movie explained by the host in the video?",
"options": [
"A. Jake Tyler Brigance.",
"B. Daniel Kaffee.",
"C. Andy Dufresne.",
"D. Carl Lee Haley."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which case is the main story about in the plot of the movie explained by the host in the video?\nOption:\nA. Jake Tyler Brigance.\nB. Daniel Kaffee.\nC. Andy Dufresne.\nD. Carl Lee Haley.\nAnswer with the option's letter from the given choices directly.",
1999,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "667-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Spatial Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2000,
"target": "B",
"doc": {
"video_id": "667",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=AFFZjzuAHMg",
"videoID": "AFFZjzuAHMg",
"question_id": "667-3",
"task_type": "Temporal Reasoning",
"question": "Which of the following is the correct order of events in the video?",
"options": [
"A. The court receives case materials, defense attorneys meet with clients, key witnesses testify, and prosecutors compile evidence lists.",
"B. The police arrested two suspects, the prosecutor discussed the case together, the defense lawyer discussed the composition of the jury, and the prosecutor produced a gun as evidence.",
"C. Detectives collect evidence from the scene, prosecutors and police meet to discuss the case, and during court arguments, the defense raises objections to the evidence.",
"D. Psychology experts testify in court, the prosecutor takes out a gun as evidence, the defense lawyer discusses the composition of the jury, and the prosecutor discusses the case together."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is the correct order of events in the video?\nOption:\nA. The court receives case materials, defense attorneys meet with clients, key witnesses testify, and prosecutors compile evidence lists.\nB. The police arrested two suspects, the prosecutor discussed the case together, the defense lawyer discussed the composition of the jury, and the prosecutor produced a gun as evidence.\nC. Detectives collect evidence from the scene, prosecutors and police meet to discuss the case, and during court arguments, the defense raises objections to the evidence.\nD. Psychology experts testify in court, the prosecutor takes out a gun as evidence, the defense lawyer discusses the composition of the jury, and the prosecutor discusses the case together.\nAnswer with the option's letter from the given choices directly.",
2000,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "667-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2001,
"target": "C",
"doc": {
"video_id": "668",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=C6aeL83z_9Y",
"videoID": "C6aeL83z_9Y",
"question_id": "668-1",
"task_type": "Information Synopsis",
"question": "What content does the video mainly focus on?",
"options": [
"A. Detailed examination of NFT's influence on art market dynamics.",
"B. Exploration of NFT technology and its role in transforming digital ownership.",
"C. Some legal issues related to NFT.",
"D. Some non-legal issues related to NFT."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What content does the video mainly focus on?\nOption:\nA. Detailed examination of NFT's influence on art market dynamics.\nB. Exploration of NFT technology and its role in transforming digital ownership.\nC. Some legal issues related to NFT.\nD. Some non-legal issues related to NFT.\nAnswer with the option's letter from the given choices directly.",
2001,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "668-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2002,
"target": "A",
"doc": {
"video_id": "668",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=C6aeL83z_9Y",
"videoID": "C6aeL83z_9Y",
"question_id": "668-2",
"task_type": "Object Reasoning",
"question": "What is the point of the video trying to discuss about the NFT by purchasing the example of Ford signing a purchase contract with an athlete?",
"options": [
"A. Contractual relationship issues between NFT issuers and buyers and secondary buyers.",
"B. The issue of the difference between the value of NFT's investments and actual items such as vehicles.",
"C. The question of NFT's investment value.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the point of the video trying to discuss about the NFT by purchasing the example of Ford signing a purchase contract with an athlete?\nOption:\nA. Contractual relationship issues between NFT issuers and buyers and secondary buyers.\nB. The issue of the difference between the value of NFT's investments and actual items such as vehicles.\nC. The question of NFT's investment value.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2002,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "668-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2003,
"target": "D",
"doc": {
"video_id": "668",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=C6aeL83z_9Y",
"videoID": "C6aeL83z_9Y",
"question_id": "668-3",
"task_type": "Temporal Reasoning",
"question": "Which of the following options correctly sorts the order in which events appear in the video?",
"options": [
"A. Discussion of Are NFT securities, examples of NBA Top shots, an introduction to what NFT is, followed by examples of Bored Ape.",
"B. An introduction to the revolutionary technology behind blockchain, illustrating with cases from Cryptokitties, proceeding to examine the innovative use of blockchain in music via CryptoPunks, and ultimately engaging in a dialogue on the implications of blockchain's transparency for financial privacy.",
"C. An introduction to what NFT is, examples of Bored Ape, examples of NBA Top shots, discussion of Are NFT securities.",
"D. An introduction to what NFT is, examples of NBA Top shots, examples of Bored Ape, discussion of Are NFT securities."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options correctly sorts the order in which events appear in the video?\nOption:\nA. Discussion of Are NFT securities, examples of NBA Top shots, an introduction to what NFT is, followed by examples of Bored Ape.\nB. An introduction to the revolutionary technology behind blockchain, illustrating with cases from Cryptokitties, proceeding to examine the innovative use of blockchain in music via CryptoPunks, and ultimately engaging in a dialogue on the implications of blockchain's transparency for financial privacy.\nC. An introduction to what NFT is, examples of Bored Ape, examples of NBA Top shots, discussion of Are NFT securities.\nD. An introduction to what NFT is, examples of NBA Top shots, examples of Bored Ape, discussion of Are NFT securities.\nAnswer with the option's letter from the given choices directly.",
2003,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "668-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2004,
"target": "B",
"doc": {
"video_id": "669",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=g1VFfVsZt7w",
"videoID": "g1VFfVsZt7w",
"question_id": "669-1",
"task_type": "Spatial Reasoning",
"question": "How is the main content of the video presented?",
"options": [
"A. Moderator explains content through interactive dialogue.",
"B. The host watches movie plot clips and explains the content involved in each clip.",
"C. The host explains simultaneously while the video content is playing.",
"D. The host watches an entire movie and then comments on it at the end."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How is the main content of the video presented?\nOption:\nA. Moderator explains content through interactive dialogue.\nB. The host watches movie plot clips and explains the content involved in each clip.\nC. The host explains simultaneously while the video content is playing.\nD. The host watches an entire movie and then comments on it at the end.\nAnswer with the option's letter from the given choices directly.",
2004,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "669-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2005,
"target": "C",
"doc": {
"video_id": "669",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=g1VFfVsZt7w",
"videoID": "g1VFfVsZt7w",
"question_id": "669-2",
"task_type": "Object Reasoning",
"question": "What character in the courtroom is the man with the gun trying to assassinate the prosecutor in the film clip at the very beginning of the video?",
"options": [
"A. Member of the jury.",
"B. Defence counsel.",
"C. Witness.",
"D. Judge."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What character in the courtroom is the man with the gun trying to assassinate the prosecutor in the film clip at the very beginning of the video?\nOption:\nA. Member of the jury.\nB. Defence counsel.\nC. Witness.\nD. Judge.\nAnswer with the option's letter from the given choices directly.",
2005,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "669-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2006,
"target": "A",
"doc": {
"video_id": "669",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=g1VFfVsZt7w",
"videoID": "g1VFfVsZt7w",
"question_id": "669-3",
"task_type": "Object Reasoning",
"question": "What is the issue that the prosecution and defence sit down to discuss in the episode of the film at the end of the video?",
"options": [
"A. The issue of the defendant's damages to the plaintiff.",
"B. The question of the number of years of imprisonment to which the accused should be sentenced.",
"C. The question of whether the defendant is liable to pay compensation.",
"D. The question of whether the defendant and the plaintiff were required to settle privately."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the issue that the prosecution and defence sit down to discuss in the episode of the film at the end of the video?\nOption:\nA. The issue of the defendant's damages to the plaintiff.\nB. The question of the number of years of imprisonment to which the accused should be sentenced.\nC. The question of whether the defendant is liable to pay compensation.\nD. The question of whether the defendant and the plaintiff were required to settle privately.\nAnswer with the option's letter from the given choices directly.",
2006,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "669-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2007,
"target": "D",
"doc": {
"video_id": "670",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=P6Q7bQTnrjk",
"videoID": "P6Q7bQTnrjk",
"question_id": "670-1",
"task_type": "Information Synopsis",
"question": "What is the main case presented in the video?",
"options": [
"A. Popular YouTube home video blogger scams children's cases.",
"B. The Cheating Case of Popular YouTube Family Vlogger.",
"C. Domestic violence case of popular YouTube family vlogger.",
"D. Popular YouTube home video blogger's child abuse case."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main case presented in the video?\nOption:\nA. Popular YouTube home video blogger scams children's cases.\nB. The Cheating Case of Popular YouTube Family Vlogger.\nC. Domestic violence case of popular YouTube family vlogger.\nD. Popular YouTube home video blogger's child abuse case.\nAnswer with the option's letter from the given choices directly.",
2007,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "670-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2008,
"target": "B",
"doc": {
"video_id": "670",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=P6Q7bQTnrjk",
"videoID": "P6Q7bQTnrjk",
"question_id": "670-2",
"task_type": "Object Reasoning",
"question": "Who were the individuals in the video who were subsequently arrested for child abuse?",
"options": [
"A. Ruby Franke.",
"B. Ruby Franke and Jodi Hildebrandt.",
"C. Kevin Frankie and Ruby Franke.",
"D. Ruby Franke , Jodi Hildebrandt and Pam Bacher."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who were the individuals in the video who were subsequently arrested for child abuse?\nOption:\nA. Ruby Franke.\nB. Ruby Franke and Jodi Hildebrandt.\nC. Kevin Frankie and Ruby Franke.\nD. Ruby Franke , Jodi Hildebrandt and Pam Bacher.\nAnswer with the option's letter from the given choices directly.",
2008,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "670-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2009,
"target": "C",
"doc": {
"video_id": "670",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Law",
"url": "https://www.youtube.com/watch?v=P6Q7bQTnrjk",
"videoID": "P6Q7bQTnrjk",
"question_id": "670-3",
"task_type": "Object Reasoning",
"question": "What is the motivation behind Ruby Frank's unsolicited plea as mentioned in the video?",
"options": [
"A. She believes that by voluntarily pleading guilty, she can obtain a more lenient sentence, thereby reducing the sentence, so that she can return to her family as soon as possible, take care of her children and make up for her mistakes.",
"B. In order to protect herself from further public pressure and legal proceedings, and to restore family relations as soon as possible, she chose to plead guilty to quickly resolve the case.",
"C. Doesn't want to put her children through a long process and wants to reconcile with them by communicating that she recognises her mistakes and will take responsibility for them.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the motivation behind Ruby Frank's unsolicited plea as mentioned in the video?\nOption:\nA. She believes that by voluntarily pleading guilty, she can obtain a more lenient sentence, thereby reducing the sentence, so that she can return to her family as soon as possible, take care of her children and make up for her mistakes.\nB. In order to protect herself from further public pressure and legal proceedings, and to restore family relations as soon as possible, she chose to plead guilty to quickly resolve the case.\nC. Doesn't want to put her children through a long process and wants to reconcile with them by communicating that she recognises her mistakes and will take responsibility for them.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2009,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "670-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Law",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2010,
"target": "B",
"doc": {
"video_id": "671",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=HnS87Tw4amM",
"videoID": "HnS87Tw4amM",
"question_id": "671-1",
"task_type": "Information Synopsis",
"question": "What is the central theme of the video?",
"options": [
"A. How to install a washing machine without water leakage.",
"B. How to remove mold grime and bad smells from washing machines.",
"C. How to repair a washing machine when the drum cubes are broken.",
"D. How to deal with hard water that leaves calcification problems on washing machines."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the central theme of the video?\nOption:\nA. How to install a washing machine without water leakage.\nB. How to remove mold grime and bad smells from washing machines.\nC. How to repair a washing machine when the drum cubes are broken.\nD. How to deal with hard water that leaves calcification problems on washing machines.\nAnswer with the option's letter from the given choices directly.",
2010,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "671-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2011,
"target": "C",
"doc": {
"video_id": "671",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=HnS87Tw4amM",
"videoID": "HnS87Tw4amM",
"question_id": "671-2",
"task_type": "Temporal Reasoning",
"question": "In which order are the following steps introduced in this video?\n(a) Removing a washing machine door seal.\n(b) Astonishing Results of the beach on the washing machine door seal.\n(c) Washing machine hoses to clean to remove Mold Mildew fungus.\n(d) Cleaning mold out of soap draw.",
"options": [
"A. (b)(c)(a)(d).",
"B. (a)(b)(c)(d).",
"C. (a)(c)(b)(d).",
"D. (b)(d)(a)(c)."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which order are the following steps introduced in this video?\n(a) Removing a washing machine door seal.\n(b) Astonishing Results of the beach on the washing machine door seal.\n(c) Washing machine hoses to clean to remove Mold Mildew fungus.\n(d) Cleaning mold out of soap draw.\nOption:\nA. (b)(c)(a)(d).\nB. (a)(b)(c)(d).\nC. (a)(c)(b)(d).\nD. (b)(d)(a)(c).\nAnswer with the option's letter from the given choices directly.",
2011,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "671-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2012,
"target": "A",
"doc": {
"video_id": "671",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=HnS87Tw4amM",
"videoID": "HnS87Tw4amM",
"question_id": "671-3",
"task_type": "Object Reasoning",
"question": "How and why does the water filled in the drum change color?",
"options": [
"A. It turns dirty because it soaks the rubbish in the washing machine.",
"B. It turns clean due to the water purification of the drum.",
"C. It turns clean because it soaks the rubbish in the washing machine.",
"D. It turns dirty due to the water purification of the drum."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How and why does the water filled in the drum change color?\nOption:\nA. It turns dirty because it soaks the rubbish in the washing machine.\nB. It turns clean due to the water purification of the drum.\nC. It turns clean because it soaks the rubbish in the washing machine.\nD. It turns dirty due to the water purification of the drum.\nAnswer with the option's letter from the given choices directly.",
2012,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "671-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2013,
"target": "A",
"doc": {
"video_id": "672",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=Vu_NnDWxKY4",
"videoID": "Vu_NnDWxKY4",
"question_id": "672-1",
"task_type": "Information Synopsis",
"question": "What is the primary subject matter of the video?",
"options": [
"A. How to do yogo for weight loss.",
"B. How to get rid of chocolates.",
"C. How to stretch muscles after exercise.",
"D. How to warmup before exercise."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary subject matter of the video?\nOption:\nA. How to do yogo for weight loss.\nB. How to get rid of chocolates.\nC. How to stretch muscles after exercise.\nD. How to warmup before exercise.\nAnswer with the option's letter from the given choices directly.",
2013,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "672-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2014,
"target": "D",
"doc": {
"video_id": "672",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=Vu_NnDWxKY4",
"videoID": "Vu_NnDWxKY4",
"question_id": "672-2",
"task_type": "Temporal Reasoning",
"question": "In which order are the following poses introduced in this video?\n(a) Reclining butterfly pose.\n(b) Downward facing dog.\n(c) Camel pose.\n(d) Seated chair twist.",
"options": [
"A. (b)(d)(a)(c).",
"B. (d)(b)(c)(a).",
"C. (c)(a)(d)(b).",
"D. (b)(d)(c)(a)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which order are the following poses introduced in this video?\n(a) Reclining butterfly pose.\n(b) Downward facing dog.\n(c) Camel pose.\n(d) Seated chair twist.\nOption:\nA. (b)(d)(a)(c).\nB. (d)(b)(c)(a).\nC. (c)(a)(d)(b).\nD. (b)(d)(c)(a).\nAnswer with the option's letter from the given choices directly.",
2014,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "672-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2015,
"target": "B",
"doc": {
"video_id": "672",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=Vu_NnDWxKY4",
"videoID": "Vu_NnDWxKY4",
"question_id": "672-3",
"task_type": "Temporal Perception",
"question": "From which pose does there come a cat?",
"options": [
"A. Triangle forward fold.",
"B. Seated char squat.",
"C. Hip circles.",
"D. Ragdoll squeeze."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: From which pose does there come a cat?\nOption:\nA. Triangle forward fold.\nB. Seated char squat.\nC. Hip circles.\nD. Ragdoll squeeze.\nAnswer with the option's letter from the given choices directly.",
2015,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "672-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Perception",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2016,
"target": "D",
"doc": {
"video_id": "673",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=0Jbc3Ah4EIc",
"videoID": "0Jbc3Ah4EIc",
"question_id": "673-1",
"task_type": "Information Synopsis",
"question": "What is the main topic of the video?",
"options": [
"A. How to inspect the calories that daily food may contain.",
"B. How to use a fitness app scientifically to help you get in shape.",
"C. How skinny people gain weight through scientific diet and exercise.",
"D. How to lose belly fat if one finally makes determination."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main topic of the video?\nOption:\nA. How to inspect the calories that daily food may contain.\nB. How to use a fitness app scientifically to help you get in shape.\nC. How skinny people gain weight through scientific diet and exercise.\nD. How to lose belly fat if one finally makes determination.\nAnswer with the option's letter from the given choices directly.",
2016,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "673-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2017,
"target": "B",
"doc": {
"video_id": "673",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=0Jbc3Ah4EIc",
"videoID": "0Jbc3Ah4EIc",
"question_id": "673-2",
"task_type": "Temporal Reasoning",
"question": "In which order are the following tips introduced in this video?\n(a) Tracking daily data of the body\n(b) Emphasizing the importance of time and priority.\n(c) Building a personalised meal plan.\n(d) Creating a caloric deficit.",
"options": [
"A. (b)(d)(a)(c).",
"B. (b)(d)(c)(a).",
"C. (c)(a)(d)(b).",
"D. (d)(c)(a)(b)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which order are the following tips introduced in this video?\n(a) Tracking daily data of the body\n(b) Emphasizing the importance of time and priority.\n(c) Building a personalised meal plan.\n(d) Creating a caloric deficit.\nOption:\nA. (b)(d)(a)(c).\nB. (b)(d)(c)(a).\nC. (c)(a)(d)(b).\nD. (d)(c)(a)(b).\nAnswer with the option's letter from the given choices directly.",
2017,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "673-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2018,
"target": "C",
"doc": {
"video_id": "673",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=0Jbc3Ah4EIc",
"videoID": "0Jbc3Ah4EIc",
"question_id": "673-3",
"task_type": "Object Reasoning",
"question": "Which of the following functions can not the app provide according to this video?",
"options": [
"A. Tracking food grams and calories.",
"B. Providing a summary of your body statistics.",
"C. Offering consultation with a doctor or health professional.",
"D. Setting up a diet plan."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following functions can not the app provide according to this video?\nOption:\nA. Tracking food grams and calories.\nB. Providing a summary of your body statistics.\nC. Offering consultation with a doctor or health professional.\nD. Setting up a diet plan.\nAnswer with the option's letter from the given choices directly.",
2018,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "673-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2019,
"target": "C",
"doc": {
"video_id": "674",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=gKjpYd1Ogwg",
"videoID": "gKjpYd1Ogwg",
"question_id": "674-1",
"task_type": "Information Synopsis",
"question": "What is the main topic of the video?",
"options": [
"A. It livestreams a famous mahjong compettion.",
"B. It clarifies several differences in kinds of mahjong around the world.",
"C. It teaches the basics of how to play mahjong for beginners.",
"D. It elaborates on the possible negative impacts mahjong may bring."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main topic of the video?\nOption:\nA. It livestreams a famous mahjong compettion.\nB. It clarifies several differences in kinds of mahjong around the world.\nC. It teaches the basics of how to play mahjong for beginners.\nD. It elaborates on the possible negative impacts mahjong may bring.\nAnswer with the option's letter from the given choices directly.",
2019,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "674-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2020,
"target": "A",
"doc": {
"video_id": "674",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=gKjpYd1Ogwg",
"videoID": "gKjpYd1Ogwg",
"question_id": "674-2",
"task_type": "Temporal Reasoning",
"question": "In what sequence are the following contents introduced in this video?\n(a) Different types of yaku.\n(b) Different types of tiles.\n(c) The strategies to complete a winning hand quickly.\n(d) How to create a winning hand.",
"options": [
"A. (b)(d)(c)(a).",
"B. (b)(d)(a)(c).",
"C. (c)(a)(d)(b).",
"D. (d)(c)(a)(b)."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what sequence are the following contents introduced in this video?\n(a) Different types of yaku.\n(b) Different types of tiles.\n(c) The strategies to complete a winning hand quickly.\n(d) How to create a winning hand.\nOption:\nA. (b)(d)(c)(a).\nB. (b)(d)(a)(c).\nC. (c)(a)(d)(b).\nD. (d)(c)(a)(b).\nAnswer with the option's letter from the given choices directly.",
2020,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "674-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2021,
"target": "B",
"doc": {
"video_id": "674",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=gKjpYd1Ogwg",
"videoID": "gKjpYd1Ogwg",
"question_id": "674-3",
"task_type": "Object Recognition",
"question": "Which of the following yaku does the example round one have according to this video?",
"options": [
"A. Yakuhai.",
"B. Tanyao.",
"C. Riichi.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following yaku does the example round one have according to this video?\nOption:\nA. Yakuhai.\nB. Tanyao.\nC. Riichi.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2021,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "674-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2022,
"target": "B",
"doc": {
"video_id": "675",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=lVeNTofDB2k",
"videoID": "lVeNTofDB2k",
"question_id": "675-1",
"task_type": "Information Synopsis",
"question": "What is the main topic about the video?",
"options": [
"A. It evaluates the quality of several kinds of best-selling coffee.",
"B. It teaches how to make multiple kinds of coffee drink.",
"C. It demonstrates various functions what a coffee maker can provide.",
"D. It encourages people to drink coffee for their health."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main topic about the video?\nOption:\nA. It evaluates the quality of several kinds of best-selling coffee.\nB. It teaches how to make multiple kinds of coffee drink.\nC. It demonstrates various functions what a coffee maker can provide.\nD. It encourages people to drink coffee for their health.\nAnswer with the option's letter from the given choices directly.",
2022,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "675-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2023,
"target": "C",
"doc": {
"video_id": "675",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=lVeNTofDB2k",
"videoID": "lVeNTofDB2k",
"question_id": "675-2",
"task_type": "Temporal Reasoning",
"question": "In which order are the following types of coffee made in this video?\n(a) Espresso, Manual.\n(b) Piccolo Latte.\n(c) Dirty Chai.\n(d) Vienna Coffee.",
"options": [
"A. (b)(d)(c)(a).",
"B. (b)(d)(a)(c).",
"C. (a)(b)(c)(d).",
"D. (d)(c)(a)(b)."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which order are the following types of coffee made in this video?\n(a) Espresso, Manual.\n(b) Piccolo Latte.\n(c) Dirty Chai.\n(d) Vienna Coffee.\nOption:\nA. (b)(d)(c)(a).\nB. (b)(d)(a)(c).\nC. (a)(b)(c)(d).\nD. (d)(c)(a)(b).\nAnswer with the option's letter from the given choices directly.",
2023,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "675-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2024,
"target": "A",
"doc": {
"video_id": "675",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=lVeNTofDB2k",
"videoID": "lVeNTofDB2k",
"question_id": "675-3",
"task_type": "Object Reasoning",
"question": "Which of the following item is not the difference when making Macchiato and Latte Macchiato according to the video?",
"options": [
"A. Different types of coffee.",
"B. Different kinds of cups.",
"C. Different tainting strategies.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following item is not the difference when making Macchiato and Latte Macchiato according to the video?\nOption:\nA. Different types of coffee.\nB. Different kinds of cups.\nC. Different tainting strategies.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2024,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "675-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2025,
"target": "D",
"doc": {
"video_id": "676",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=YPzzGPXW8fs",
"videoID": "YPzzGPXW8fs",
"question_id": "676-1",
"task_type": "Information Synopsis",
"question": "What is the primary subject matter of the video?",
"options": [
"A. How to solve the 4x4 Rubik's cube step by step.",
"B. How to solve the 5x5 Rubik's cube step by step.",
"C. How to solve the 2x2 Rubik's cube step by step.",
"D. How to solve the 3x3 Rubik's cube step by step."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary subject matter of the video?\nOption:\nA. How to solve the 4x4 Rubik's cube step by step.\nB. How to solve the 5x5 Rubik's cube step by step.\nC. How to solve the 2x2 Rubik's cube step by step.\nD. How to solve the 3x3 Rubik's cube step by step.\nAnswer with the option's letter from the given choices directly.",
2025,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "676-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2026,
"target": "B",
"doc": {
"video_id": "676",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=YPzzGPXW8fs",
"videoID": "YPzzGPXW8fs",
"question_id": "676-2",
"task_type": "Temporal Reasoning",
"question": "In which order are the following steps introduced in this video?\n(a) Solving the middle layer.\n(b) Matching corner pieces with the center.\n(c) Solving the yellow.\n(d) Solving the white.\n(e) Solving the cube.",
"options": [
"A. (b)(c)(a)(d)(d).",
"B. (d)(a)(c)(b)(e).",
"C. (a)(c)(b)(d)(e).",
"D. (b)(d)(a)(c)(e)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which order are the following steps introduced in this video?\n(a) Solving the middle layer.\n(b) Matching corner pieces with the center.\n(c) Solving the yellow.\n(d) Solving the white.\n(e) Solving the cube.\nOption:\nA. (b)(c)(a)(d)(d).\nB. (d)(a)(c)(b)(e).\nC. (a)(c)(b)(d)(e).\nD. (b)(d)(a)(c)(e).\nAnswer with the option's letter from the given choices directly.",
2026,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "676-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2027,
"target": "D",
"doc": {
"video_id": "676",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=YPzzGPXW8fs",
"videoID": "YPzzGPXW8fs",
"question_id": "676-3",
"task_type": "Object Reasoning",
"question": "Which of the following steps has the most scenarios during the demonstration according to the video?",
"options": [
"A. Solving the White Corners.",
"B. Solving the Corners.",
"C. Solving the Middle Layer.",
"D. Solving the Yellow."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following steps has the most scenarios during the demonstration according to the video?\nOption:\nA. Solving the White Corners.\nB. Solving the Corners.\nC. Solving the Middle Layer.\nD. Solving the Yellow.\nAnswer with the option's letter from the given choices directly.",
2027,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "676-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2028,
"target": "A",
"doc": {
"video_id": "677",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=MYxL_JLseC8",
"videoID": "MYxL_JLseC8",
"question_id": "677-1",
"task_type": "Information Synopsis",
"question": "What is the central content of the video?",
"options": [
"A. A professor is giving a speech online on how to design and build your outdoor kitchen.",
"B. A professor is discussing in the office with his students how to design and build your outdoor kitchen.",
"C. A professor is teaching a class in the classroom on how to design and build your outdoor kitchen.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the central content of the video?\nOption:\nA. A professor is giving a speech online on how to design and build your outdoor kitchen.\nB. A professor is discussing in the office with his students how to design and build your outdoor kitchen.\nC. A professor is teaching a class in the classroom on how to design and build your outdoor kitchen.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2028,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "677-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2029,
"target": "D",
"doc": {
"video_id": "677",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=MYxL_JLseC8",
"videoID": "MYxL_JLseC8",
"question_id": "677-2",
"task_type": "Temporal Reasoning",
"question": "In what sequence are the following topics introduced in this video?\n(a) Different types of grills.\n(b) Venting.\n(c) Lighting considerations.\n(d) Cooking options.\n(e) Outdoor kitchen configurations.",
"options": [
"A. (b)(c)(a)(d)(d).",
"B. (d)(a)(c)(b)(e).",
"C. (a)(c)(b)(d)(e).",
"D. (c)(b)(a)(d)(a)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what sequence are the following topics introduced in this video?\n(a) Different types of grills.\n(b) Venting.\n(c) Lighting considerations.\n(d) Cooking options.\n(e) Outdoor kitchen configurations.\nOption:\nA. (b)(c)(a)(d)(d).\nB. (d)(a)(c)(b)(e).\nC. (a)(c)(b)(d)(e).\nD. (c)(b)(a)(d)(a).\nAnswer with the option's letter from the given choices directly.",
2029,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "677-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2030,
"target": "B",
"doc": {
"video_id": "677",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=MYxL_JLseC8",
"videoID": "MYxL_JLseC8",
"question_id": "677-3",
"task_type": "Object Reasoning",
"question": "What configuration type is the demonstration of Decton after the professor introduces the topic of materials?",
"options": [
"A. U shape style.",
"B. L shape style.",
"C. Island style.",
"D. Galley style."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What configuration type is the demonstration of Decton after the professor introduces the topic of materials?\nOption:\nA. U shape style.\nB. L shape style.\nC. Island style.\nD. Galley style.\nAnswer with the option's letter from the given choices directly.",
2030,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "677-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2031,
"target": "C",
"doc": {
"video_id": "678",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=B0-7MeWA2aI",
"videoID": "B0-7MeWA2aI",
"question_id": "678-1",
"task_type": "Information Synopsis",
"question": "What is the main topic about the video?",
"options": [
"A. How to design clothing.",
"B. How to create a successful social media marketing campaign for offline retail.",
"C. How to start print on demand step by step.",
"D. How to start an online store or website for free."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main topic about the video?\nOption:\nA. How to design clothing.\nB. How to create a successful social media marketing campaign for offline retail.\nC. How to start print on demand step by step.\nD. How to start an online store or website for free.\nAnswer with the option's letter from the given choices directly.",
2031,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "678-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2032,
"target": "A",
"doc": {
"video_id": "678",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=B0-7MeWA2aI",
"videoID": "B0-7MeWA2aI",
"question_id": "678-2",
"task_type": "Temporal Reasoning",
"question": "In which order are the following topics introduced in this video?\n(a) How to design clothing.\n(b) How to create a logo.\n(c) How to start an online store or website for free.\n(d) How to market and get sales.",
"options": [
"A. (a)(c)(b)(d).",
"B. (a)(b)(c)(d).",
"C. (a)(d)(b)(c).",
"D. (a)(b)(d)(c)."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which order are the following topics introduced in this video?\n(a) How to design clothing.\n(b) How to create a logo.\n(c) How to start an online store or website for free.\n(d) How to market and get sales.\nOption:\nA. (a)(c)(b)(d).\nB. (a)(b)(c)(d).\nC. (a)(d)(b)(c).\nD. (a)(b)(d)(c).\nAnswer with the option's letter from the given choices directly.",
2032,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "678-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2033,
"target": "A",
"doc": {
"video_id": "678",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=B0-7MeWA2aI",
"videoID": "B0-7MeWA2aI",
"question_id": "678-3",
"task_type": "OCR Problems",
"question": "As depicted in this video, which of the following websites can help promote selling?",
"options": [
"A. Pinterest.",
"B. Facebook.",
"C. Shopify.",
"D. All of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in this video, which of the following websites can help promote selling?\nOption:\nA. Pinterest.\nB. Facebook.\nC. Shopify.\nD. All of the above.\nAnswer with the option's letter from the given choices directly.",
2033,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "678-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2034,
"target": "B",
"doc": {
"video_id": "679",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=z7SoAIq20lA",
"videoID": "z7SoAIq20lA",
"question_id": "679-1",
"task_type": "Information Synopsis",
"question": "What is the central theme of the video?",
"options": [
"A. It teaches to edit an image as simple as a few clicks of your mouse using Photoshop.",
"B. It teaches to freely and easily manipulate color in an image using Nik Viveza.",
"C. It teaches how to photograph outdoors without color distortion.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the central theme of the video?\nOption:\nA. It teaches to edit an image as simple as a few clicks of your mouse using Photoshop.\nB. It teaches to freely and easily manipulate color in an image using Nik Viveza.\nC. It teaches how to photograph outdoors without color distortion.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2034,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "679-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2035,
"target": "C",
"doc": {
"video_id": "679",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=z7SoAIq20lA",
"videoID": "z7SoAIq20lA",
"question_id": "679-2",
"task_type": "Temporal Reasoning",
"question": "In which order do the following steps are introduced when manipulating the first image in this video?\n(a) Adding more control points.\n(b) Global adjustments\n(c) Launching Nik Viveza.\n(d) Adjusting green",
"options": [
"A. (a)(c)(d)(b).",
"B. (b)(a)(d)(c).",
"C. (c)(b)(d)(a).",
"D. (d)(c)(a)(b)."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which order do the following steps are introduced when manipulating the first image in this video?\n(a) Adding more control points.\n(b) Global adjustments\n(c) Launching Nik Viveza.\n(d) Adjusting green\nOption:\nA. (a)(c)(d)(b).\nB. (b)(a)(d)(c).\nC. (c)(b)(d)(a).\nD. (d)(c)(a)(b).\nAnswer with the option's letter from the given choices directly.",
2035,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "679-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2036,
"target": "A",
"doc": {
"video_id": "679",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=z7SoAIq20lA",
"videoID": "z7SoAIq20lA",
"question_id": "679-3",
"task_type": "Counting Problem",
"question": "How many different pictures are edited in this video?",
"options": [
"A. 4.",
"B. 3.",
"C. 2.",
"D. 1."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many different pictures are edited in this video?\nOption:\nA. 4.\nB. 3.\nC. 2.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
2036,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "679-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2037,
"target": "D",
"doc": {
"video_id": "680",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=y2kg3MOk1sY",
"videoID": "y2kg3MOk1sY",
"question_id": "680-1",
"task_type": "Information Synopsis",
"question": "What is the main topic about the video?",
"options": [
"A. It is an animation demonstration of computer systems.",
"B. It is a computer course for people new to working with computers.",
"C. It teaches technology fundamentals of computer.",
"D. All of the above."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main topic about the video?\nOption:\nA. It is an animation demonstration of computer systems.\nB. It is a computer course for people new to working with computers.\nC. It teaches technology fundamentals of computer.\nD. All of the above.\nAnswer with the option's letter from the given choices directly.",
2037,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "680-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2038,
"target": "B",
"doc": {
"video_id": "680",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=y2kg3MOk1sY",
"videoID": "y2kg3MOk1sY",
"question_id": "680-2",
"task_type": "Temporal Reasoning",
"question": "In what sequence are the following topics introduced in this video?\n(a) Internet Safety: Your Browser's Security Features.\n(b) Basic Parts of a Computer.\n(c) Mac OS X Basics: Getting Started with the Desktop\n(d) Understanding Operating Systems",
"options": [
"A. (a)(b)(c)(d).",
"B. (b)(d)(a)(c).",
"C. (a)(d)(c)(b).",
"D. (b)(a)(c)(d)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what sequence are the following topics introduced in this video?\n(a) Internet Safety: Your Browser's Security Features.\n(b) Basic Parts of a Computer.\n(c) Mac OS X Basics: Getting Started with the Desktop\n(d) Understanding Operating Systems\nOption:\nA. (a)(b)(c)(d).\nB. (b)(d)(a)(c).\nC. (a)(d)(c)(b).\nD. (b)(a)(c)(d).\nAnswer with the option's letter from the given choices directly.",
2038,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "680-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2039,
"target": "C",
"doc": {
"video_id": "680",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Life Tip",
"url": "https://www.youtube.com/watch?v=y2kg3MOk1sY",
"videoID": "y2kg3MOk1sY",
"question_id": "680-3",
"task_type": "Action Reasoning",
"question": "According to this video, what can people do if they lose their important files on the computers?",
"options": [
"A. Use backup software such as Time Machine when the computer breaks down.",
"B. Ask after-sales service for help if something happens to the computer and backup drive.",
"C. Download what they have backed up in the cloud if they have access to cloud service.",
"D. Recover from the backup hard drives even if the drives get burnt."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to this video, what can people do if they lose their important files on the computers?\nOption:\nA. Use backup software such as Time Machine when the computer breaks down.\nB. Ask after-sales service for help if something happens to the computer and backup drive.\nC. Download what they have backed up in the cloud if they have access to cloud service.\nD. Recover from the backup hard drives even if the drives get burnt.\nAnswer with the option's letter from the given choices directly.",
2039,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "680-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Life Tip",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2040,
"target": "A",
"doc": {
"video_id": "681",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=AF8d72mA41M",
"videoID": "AF8d72mA41M",
"question_id": "681-1",
"task_type": "Object Reasoning",
"question": "What can't you learn from the video?",
"options": [
"A. Winner of the 2016 Nobel Prize in Physics.",
"B. How an LED works.",
"C. The invention process of blue LED.",
"D. Nakamura's current research interest."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can't you learn from the video?\nOption:\nA. Winner of the 2016 Nobel Prize in Physics.\nB. How an LED works.\nC. The invention process of blue LED.\nD. Nakamura's current research interest.\nAnswer with the option's letter from the given choices directly.",
2040,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "681-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2041,
"target": "C",
"doc": {
"video_id": "681",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=AF8d72mA41M",
"videoID": "AF8d72mA41M",
"question_id": "681-2",
"task_type": "Information Synopsis",
"question": "According to the video, which statement is correct?",
"options": [
"A. Nakamura is respected in the company for his creation of blue LEDs.",
"B. Nakamura did not choose ZnSe because he believed the material would not meet the requirements.",
"C. An LED primarily emits light at the junction of the PN interface, where electron-hole recombination occurs.",
"D. Nakamura studied p-type GaN first, and then n-type GaN."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which statement is correct?\nOption:\nA. Nakamura is respected in the company for his creation of blue LEDs.\nB. Nakamura did not choose ZnSe because he believed the material would not meet the requirements.\nC. An LED primarily emits light at the junction of the PN interface, where electron-hole recombination occurs.\nD. Nakamura studied p-type GaN first, and then n-type GaN.\nAnswer with the option's letter from the given choices directly.",
2041,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "681-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2042,
"target": "B",
"doc": {
"video_id": "681",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=AF8d72mA41M",
"videoID": "AF8d72mA41M",
"question_id": "681-3",
"task_type": "Temporal Reasoning",
"question": "When was the first blue LED created in the video?",
"options": [
"A. 1962.",
"B. 1972.",
"C. 1982.",
"D. 1992."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When was the first blue LED created in the video?\nOption:\nA. 1962.\nB. 1972.\nC. 1982.\nD. 1992.\nAnswer with the option's letter from the given choices directly.",
2042,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "681-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2043,
"target": "B",
"doc": {
"video_id": "682",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=eKVTFXQPAhs",
"videoID": "eKVTFXQPAhs",
"question_id": "682-1",
"task_type": "Counting Problem",
"question": "Assuming different models such as the S6 and S6 Edge are considered one generation, how many generations of Samsung Galaxy S series phones are shown in the video?",
"options": [
"A. 10.",
"B. 15.",
"C. 20.",
"D. 24."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Assuming different models such as the S6 and S6 Edge are considered one generation, how many generations of Samsung Galaxy S series phones are shown in the video?\nOption:\nA. 10.\nB. 15.\nC. 20.\nD. 24.\nAnswer with the option's letter from the given choices directly.",
2043,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "682-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2044,
"target": "C",
"doc": {
"video_id": "682",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=eKVTFXQPAhs",
"videoID": "eKVTFXQPAhs",
"question_id": "682-2",
"task_type": "Object Reasoning",
"question": "Which phone characteristic and model pairing is incorrect from the following options?",
"options": [
"A. S Voice, S3.",
"B. Head tracking, S4.",
"C. Bixby, S5.",
"D. Fully glass back, S6."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which phone characteristic and model pairing is incorrect from the following options?\nOption:\nA. S Voice, S3.\nB. Head tracking, S4.\nC. Bixby, S5.\nD. Fully glass back, S6.\nAnswer with the option's letter from the given choices directly.",
2044,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "682-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2045,
"target": "B",
"doc": {
"video_id": "682",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=eKVTFXQPAhs",
"videoID": "eKVTFXQPAhs",
"question_id": "682-3",
"task_type": "Temporal Reasoning",
"question": "What color was the back cover of the previous generation Galaxy S phone of the phone with the most expensive starting price shown in the video?",
"options": [
"A. Black.",
"B. White.",
"C. Purple.",
"D. Gold."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What color was the back cover of the previous generation Galaxy S phone of the phone with the most expensive starting price shown in the video?\nOption:\nA. Black.\nB. White.\nC. Purple.\nD. Gold.\nAnswer with the option's letter from the given choices directly.",
2045,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "682-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2046,
"target": "B",
"doc": {
"video_id": "683",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=U9mJuUkhUzk",
"videoID": "U9mJuUkhUzk",
"question_id": "683-1",
"task_type": "Information Synopsis",
"question": "Which of the following statements about the first cilp of the video presented by the speaker is incorrect?",
"options": [
"A. The first woman asks ChatGPT to write a sentence in Tagalog.",
"B. A woman uses ChatGPT to tutor her four kids.",
"C. The fourth person thinks that ChatGPT contribute more to the work than himself.",
"D. Someone mentions the usefulness of ChatGPT in code-writing."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following statements about the first cilp of the video presented by the speaker is incorrect?\nOption:\nA. The first woman asks ChatGPT to write a sentence in Tagalog.\nB. A woman uses ChatGPT to tutor her four kids.\nC. The fourth person thinks that ChatGPT contribute more to the work than himself.\nD. Someone mentions the usefulness of ChatGPT in code-writing.\nAnswer with the option's letter from the given choices directly.",
2046,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "683-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2047,
"target": "A",
"doc": {
"video_id": "683",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=U9mJuUkhUzk",
"videoID": "U9mJuUkhUzk",
"question_id": "683-2",
"task_type": "Object Reasoning",
"question": "What can you do with GPT-4 Turbo as described in the video?",
"options": [
"A. Enter 32K context length.",
"B. Ask about news happening in June 2023.",
"C. Ask 6 different people to read the text aloud to you.",
"D. Using xlsx mode."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can you do with GPT-4 Turbo as described in the video?\nOption:\nA. Enter 32K context length.\nB. Ask about news happening in June 2023.\nC. Ask 6 different people to read the text aloud to you.\nD. Using xlsx mode.\nAnswer with the option's letter from the given choices directly.",
2047,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "683-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2048,
"target": "A",
"doc": {
"video_id": "683",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=U9mJuUkhUzk",
"videoID": "U9mJuUkhUzk",
"question_id": "683-3",
"task_type": "Information Synopsis",
"question": "Which of the following can best summarize the video?",
"options": [
"A. The launch and explaination of GPT-4 turbo and GPTs.",
"B. Announcing the establishment of OpenAI.",
"C. Explaination of the usage of already launched modules.",
"D. GPTs and some relevant API."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following can best summarize the video?\nOption:\nA. The launch and explaination of GPT-4 turbo and GPTs.\nB. Announcing the establishment of OpenAI.\nC. Explaination of the usage of already launched modules.\nD. GPTs and some relevant API.\nAnswer with the option's letter from the given choices directly.",
2048,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "683-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2049,
"target": "C",
"doc": {
"video_id": "684",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=NCNeFSJexSY",
"videoID": "NCNeFSJexSY",
"question_id": "684-1",
"task_type": "Information Synopsis",
"question": "What does this video about?",
"options": [
"A. Video producers on their experiences with Vision Pro.",
"B. Video interview with Zucks.",
"C. Reviewing Zucks Review of the Apple Vision Pro.",
"D. Apple Vision Pro product sale."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does this video about?\nOption:\nA. Video producers on their experiences with Vision Pro.\nB. Video interview with Zucks.\nC. Reviewing Zucks Review of the Apple Vision Pro.\nD. Apple Vision Pro product sale.\nAnswer with the option's letter from the given choices directly.",
2049,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "684-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2050,
"target": "A",
"doc": {
"video_id": "684",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=NCNeFSJexSY",
"videoID": "NCNeFSJexSY",
"question_id": "684-2",
"task_type": "Spatial Perception",
"question": "Which direction is the narrator in the red facing in relation to the narrator in green?",
"options": [
"A. Right front.",
"B. Left front.",
"C. Directly in front.",
"D. It is impossible to extrapolate."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which direction is the narrator in the red facing in relation to the narrator in green?\nOption:\nA. Right front.\nB. Left front.\nC. Directly in front.\nD. It is impossible to extrapolate.\nAnswer with the option's letter from the given choices directly.",
2050,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "684-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Spatial Perception",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2051,
"target": "A",
"doc": {
"video_id": "684",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=NCNeFSJexSY",
"videoID": "NCNeFSJexSY",
"question_id": "684-3",
"task_type": "Action Reasoning",
"question": "How were the recurring short video clips captured in the video?",
"options": [
"A. Using Vision Pro.",
"B. Using a cell phone.",
"C. Using a video camera.",
"D. Not determinable."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How were the recurring short video clips captured in the video?\nOption:\nA. Using Vision Pro.\nB. Using a cell phone.\nC. Using a video camera.\nD. Not determinable.\nAnswer with the option's letter from the given choices directly.",
2051,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "684-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2052,
"target": "C",
"doc": {
"video_id": "685",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=nVtb2vNUOdU",
"videoID": "nVtb2vNUOdU",
"question_id": "685-1",
"task_type": "Information Synopsis",
"question": "What's the main thing this video documents?",
"options": [
"A. The history of NVIDIA's founding.",
"B. NVIDIA's Core Technology Report.",
"C. The Meteoric Rise of Nvidia.",
"D. NVIDIA's IPO process."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the main thing this video documents?\nOption:\nA. The history of NVIDIA's founding.\nB. NVIDIA's Core Technology Report.\nC. The Meteoric Rise of Nvidia.\nD. NVIDIA's IPO process.\nAnswer with the option's letter from the given choices directly.",
2052,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "685-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2053,
"target": "D",
"doc": {
"video_id": "685",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=nVtb2vNUOdU",
"videoID": "nVtb2vNUOdU",
"question_id": "685-2",
"task_type": "Temporal Reasoning",
"question": "What is the correct order of the video to explain the following events?\n(a) History.\n(b) Alexnet.\n(c) CUDA.\n(d) Team Management.",
"options": [
"A. (b)(c)(a)(d).",
"B. (b)(c)(d)(a).",
"C. (d)(c)(b)(a).",
"D. (a)(c)(b)(d)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct order of the video to explain the following events?\n(a) History.\n(b) Alexnet.\n(c) CUDA.\n(d) Team Management.\nOption:\nA. (b)(c)(a)(d).\nB. (b)(c)(d)(a).\nC. (d)(c)(b)(a).\nD. (a)(c)(b)(d).\nAnswer with the option's letter from the given choices directly.",
2053,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "685-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2054,
"target": "A",
"doc": {
"video_id": "685",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=nVtb2vNUOdU",
"videoID": "nVtb2vNUOdU",
"question_id": "685-3",
"task_type": "Action Recognition",
"question": "What type of video game was not mentioned in the introduction of the NVIDIA V1?",
"options": [
"A. Basketball Game.",
"B. Racing Game.",
"C. Boxing game.",
"D. Shooting Game."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What type of video game was not mentioned in the introduction of the NVIDIA V1?\nOption:\nA. Basketball Game.\nB. Racing Game.\nC. Boxing game.\nD. Shooting Game.\nAnswer with the option's letter from the given choices directly.",
2054,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "685-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2055,
"target": "B",
"doc": {
"video_id": "686",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=8IZvEF4Ui10",
"videoID": "8IZvEF4Ui10",
"question_id": "686-1",
"task_type": "Object Reasoning",
"question": "Which component in the chassis has the most distinct color from the others?",
"options": [
"A. Graphics card.",
"B. Power supply.",
"C. Motherboard.",
"D. Memory Stick."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which component in the chassis has the most distinct color from the others?\nOption:\nA. Graphics card.\nB. Power supply.\nC. Motherboard.\nD. Memory Stick.\nAnswer with the option's letter from the given choices directly.",
2055,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "686-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2056,
"target": "D",
"doc": {
"video_id": "686",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=8IZvEF4Ui10",
"videoID": "8IZvEF4Ui10",
"question_id": "686-2",
"task_type": "Temporal Reasoning",
"question": "What is the installation order for the following components in the video?\n(a) Motherboard.\n(b) Water cooling.\n(c) Power supply.\n(d) Bus.",
"options": [
"A. (c)(a)(d)(b).",
"B. (b)(a)(d)(c).",
"C. (d)(c)(b)(a).",
"D. (a)(c)(d)(b)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the installation order for the following components in the video?\n(a) Motherboard.\n(b) Water cooling.\n(c) Power supply.\n(d) Bus.\nOption:\nA. (c)(a)(d)(b).\nB. (b)(a)(d)(c).\nC. (d)(c)(b)(a).\nD. (a)(c)(d)(b).\nAnswer with the option's letter from the given choices directly.",
2056,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "686-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2057,
"target": "A",
"doc": {
"video_id": "686",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=8IZvEF4Ui10",
"videoID": "8IZvEF4Ui10",
"question_id": "686-3",
"task_type": "Action Reasoning",
"question": "Why is the author playing a video game at the end of the video?",
"options": [
"A. To evaluate computer performance.",
"B. As part of a video game advertisement placement.",
"C. The author is engaging in live gaming.",
"D. To produce a new game review."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why is the author playing a video game at the end of the video?\nOption:\nA. To evaluate computer performance.\nB. As part of a video game advertisement placement.\nC. The author is engaging in live gaming.\nD. To produce a new game review.\nAnswer with the option's letter from the given choices directly.",
2057,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "686-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2058,
"target": "A",
"doc": {
"video_id": "687",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=yGiM3kD12cU",
"videoID": "yGiM3kD12cU",
"question_id": "687-1",
"task_type": "Attribute Perception",
"question": "Which car was the first to be designed with an automobile roof?",
"options": [
"A. Type 44.",
"B. Type 46.",
"C. Type 41.",
"D. Type 55."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which car was the first to be designed with an automobile roof?\nOption:\nA. Type 44.\nB. Type 46.\nC. Type 41.\nD. Type 55.\nAnswer with the option's letter from the given choices directly.",
2058,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "687-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2059,
"target": "B",
"doc": {
"video_id": "687",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=yGiM3kD12cU",
"videoID": "yGiM3kD12cU",
"question_id": "687-2",
"task_type": "Temporal Perception",
"question": "What type of Bugatti does the video devote the most space to?",
"options": [
"A. Bugatti Veyron.",
"B. Bugatti Chiron.",
"C. Type 57.",
"D. Bugatti EB."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What type of Bugatti does the video devote the most space to?\nOption:\nA. Bugatti Veyron.\nB. Bugatti Chiron.\nC. Type 57.\nD. Bugatti EB.\nAnswer with the option's letter from the given choices directly.",
2059,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "687-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Temporal Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2060,
"target": "D",
"doc": {
"video_id": "687",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=yGiM3kD12cU",
"videoID": "yGiM3kD12cU",
"question_id": "687-3",
"task_type": "Counting Problem",
"question": "According to the video, how many cars have broken 300 mph top speed?",
"options": [
"A. 0.",
"B. 2.",
"C. 3.",
"D. 1."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how many cars have broken 300 mph top speed?\nOption:\nA. 0.\nB. 2.\nC. 3.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
2060,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "687-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2061,
"target": "B",
"doc": {
"video_id": "688",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=j0J-favyUeQ",
"videoID": "j0J-favyUeQ",
"question_id": "688-1",
"task_type": "Information Synopsis",
"question": "What is the main focus of this video?",
"options": [
"A. How mainstream apps work.",
"B. How some tech products work.",
"C. Technology products in music.",
"D. Strong tech company."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of this video?\nOption:\nA. How mainstream apps work.\nB. How some tech products work.\nC. Technology products in music.\nD. Strong tech company.\nAnswer with the option's letter from the given choices directly.",
2061,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "688-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2062,
"target": "D",
"doc": {
"video_id": "688",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=j0J-favyUeQ",
"videoID": "j0J-favyUeQ",
"question_id": "688-2",
"task_type": "Object Reasoning",
"question": "Which tech product in the video is not music-related?",
"options": [
"A. Spotify.",
"B. Shazam.",
"C. Bose.",
"D. MSG Shpere."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which tech product in the video is not music-related?\nOption:\nA. Spotify.\nB. Shazam.\nC. Bose.\nD. MSG Shpere.\nAnswer with the option's letter from the given choices directly.",
2062,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "688-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2063,
"target": "A",
"doc": {
"video_id": "688",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=j0J-favyUeQ",
"videoID": "j0J-favyUeQ",
"question_id": "688-3",
"task_type": "Object Reasoning",
"question": "What tech product presented in the video is not a combination of software and hardware?",
"options": [
"A. Spotify.",
"B. MSG Shpere.",
"C. LED wristbands.",
"D. Tap-to-pay."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What tech product presented in the video is not a combination of software and hardware?\nOption:\nA. Spotify.\nB. MSG Shpere.\nC. LED wristbands.\nD. Tap-to-pay.\nAnswer with the option's letter from the given choices directly.",
2063,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "688-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2064,
"target": "D",
"doc": {
"video_id": "689",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=KTjeh5QPL0o",
"videoID": "KTjeh5QPL0o",
"question_id": "689-1",
"task_type": "Information Synopsis",
"question": "What does this video tell?",
"options": [
"A. The process of building a starship.",
"B. Why Starship is the holy grail for SpaceX.",
"C. Why Starlink is crucial to SpaceX's success.",
"D. How SpaceX could Win The Space Race."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does this video tell?\nOption:\nA. The process of building a starship.\nB. Why Starship is the holy grail for SpaceX.\nC. Why Starlink is crucial to SpaceX's success.\nD. How SpaceX could Win The Space Race.\nAnswer with the option's letter from the given choices directly.",
2064,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "689-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2065,
"target": "B",
"doc": {
"video_id": "689",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=KTjeh5QPL0o",
"videoID": "KTjeh5QPL0o",
"question_id": "689-2",
"task_type": "Object Reasoning",
"question": "What is the role of the man who wear a T-shirt with the words \"BLUE GHOST\" on it?",
"options": [
"A. Executive.",
"B. Engineer.",
"C. Journalist.",
"D. Visitor."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the role of the man who wear a T-shirt with the words \"BLUE GHOST\" on it?\nOption:\nA. Executive.\nB. Engineer.\nC. Journalist.\nD. Visitor.\nAnswer with the option's letter from the given choices directly.",
2065,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "689-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2066,
"target": "C",
"doc": {
"video_id": "689",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=KTjeh5QPL0o",
"videoID": "KTjeh5QPL0o",
"question_id": "689-3",
"task_type": "Temporal Reasoning",
"question": "In which section of the video is the rocket recovery told?",
"options": [
"A. The part \"Why Starlink is crucial to SpaceX's success\".",
"B. The part \"Why Starship is the holy grail for SpaceX\".",
"C. The part \"How ex-SpaceX engineers are fueling the space race with Firefly\".",
"D. It is impossible to extrapolate."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which section of the video is the rocket recovery told?\nOption:\nA. The part \"Why Starlink is crucial to SpaceX's success\".\nB. The part \"Why Starship is the holy grail for SpaceX\".\nC. The part \"How ex-SpaceX engineers are fueling the space race with Firefly\".\nD. It is impossible to extrapolate.\nAnswer with the option's letter from the given choices directly.",
2066,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "689-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2067,
"target": "C",
"doc": {
"video_id": "690",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=b0HfmY64eSE",
"videoID": "b0HfmY64eSE",
"question_id": "690-1",
"task_type": "Object Reasoning",
"question": "What's a video writer's least favorite product?",
"options": [
"A. Cell phone holder.",
"B. Rotating LED lights.",
"C. Headphones that can open bottle caps.",
"D. Tracking luggage."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's a video writer's least favorite product?\nOption:\nA. Cell phone holder.\nB. Rotating LED lights.\nC. Headphones that can open bottle caps.\nD. Tracking luggage.\nAnswer with the option's letter from the given choices directly.",
2067,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "690-1",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2068,
"target": "A",
"doc": {
"video_id": "690",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=b0HfmY64eSE",
"videoID": "b0HfmY64eSE",
"question_id": "690-2",
"task_type": "Object Reasoning",
"question": "What factors may not be pertinent to an author's assessment of a technology product?",
"options": [
"A. Price.",
"B. Futuristic.",
"C. Degree of practicality.",
"D. Appearance."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What factors may not be pertinent to an author's assessment of a technology product?\nOption:\nA. Price.\nB. Futuristic.\nC. Degree of practicality.\nD. Appearance.\nAnswer with the option's letter from the given choices directly.",
2068,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "690-2",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2069,
"target": "A",
"doc": {
"video_id": "690",
"duration": "long",
"domain": "Knowledge",
"sub_category": "Technology",
"url": "https://www.youtube.com/watch?v=b0HfmY64eSE",
"videoID": "b0HfmY64eSE",
"question_id": "690-3",
"task_type": "Counting Problem",
"question": "How many audio-like products are shown in the video?",
"options": [
"A. 3.",
"B. 4.",
"C. 2.",
"D. 5."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many audio-like products are shown in the video?\nOption:\nA. 3.\nB. 4.\nC. 2.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
2069,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "690-3",
"duration": "long",
"category": "Knowledge",
"sub_category": "Technology",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2070,
"target": "A",
"doc": {
"video_id": "691",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=lMxFbRc3Luk",
"videoID": "lMxFbRc3Luk",
"question_id": "691-1",
"task_type": "Action Reasoning",
"question": "As depicted in the video, why is the teacher still in the museum after the security alarm?",
"options": [
"A. She wants to steal the crown.",
"B. She checks the security.",
"C. She comes to find her students.",
"D. She has a talk with the girl and the boy."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, why is the teacher still in the museum after the security alarm?\nOption:\nA. She wants to steal the crown.\nB. She checks the security.\nC. She comes to find her students.\nD. She has a talk with the girl and the boy.\nAnswer with the option's letter from the given choices directly.",
2070,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "691-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2071,
"target": "D",
"doc": {
"video_id": "691",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=lMxFbRc3Luk",
"videoID": "lMxFbRc3Luk",
"question_id": "691-2",
"task_type": "Action Reasoning",
"question": "What is the relationship between Dante and his family according to what is shown in the video?",
"options": [
"A. They help each other.",
"B. His family understands him.",
"C. He hates his family.",
"D. His family doesn't support him."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the relationship between Dante and his family according to what is shown in the video?\nOption:\nA. They help each other.\nB. His family understands him.\nC. He hates his family.\nD. His family doesn't support him.\nAnswer with the option's letter from the given choices directly.",
2071,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "691-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2072,
"target": "D",
"doc": {
"video_id": "691",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=lMxFbRc3Luk",
"videoID": "lMxFbRc3Luk",
"question_id": "691-3",
"task_type": "Action Reasoning",
"question": "In line with the video evidence, what is the function of the necklace that the girl stole from the museum?",
"options": [
"A. Generate fire.",
"B. Make people invisible.",
"C. Generate electricity.",
"D. Seize superpowers from others."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, what is the function of the necklace that the girl stole from the museum?\nOption:\nA. Generate fire.\nB. Make people invisible.\nC. Generate electricity.\nD. Seize superpowers from others.\nAnswer with the option's letter from the given choices directly.",
2072,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "691-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2073,
"target": "B",
"doc": {
"video_id": "692",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=TF9I1GxNdJQ",
"videoID": "TF9I1GxNdJQ",
"question_id": "692-1",
"task_type": "Action Recognition",
"question": "What happens to the computer in the video?",
"options": [
"A. It is controlled by others.",
"B. Its software is broken.",
"C. Its hardware is broken.",
"D. It generates virus."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens to the computer in the video?\nOption:\nA. It is controlled by others.\nB. Its software is broken.\nC. Its hardware is broken.\nD. It generates virus.\nAnswer with the option's letter from the given choices directly.",
2073,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "692-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2074,
"target": "B",
"doc": {
"video_id": "692",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=TF9I1GxNdJQ",
"videoID": "TF9I1GxNdJQ",
"question_id": "692-2",
"task_type": "Action Reasoning",
"question": "What is the relationship between the black stickman and the red stickman in the video?",
"options": [
"A. Friends.",
"B. Initially friends, but then become enemies.",
"C. Enemies.",
"D. They don't know each other."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the relationship between the black stickman and the red stickman in the video?\nOption:\nA. Friends.\nB. Initially friends, but then become enemies.\nC. Enemies.\nD. They don't know each other.\nAnswer with the option's letter from the given choices directly.",
2074,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "692-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2075,
"target": "D",
"doc": {
"video_id": "692",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=TF9I1GxNdJQ",
"videoID": "TF9I1GxNdJQ",
"question_id": "692-3",
"task_type": "Information Synopsis",
"question": "In accordance with the video footage, what does the story entail?",
"options": [
"A. People's computers are intruded by viruses.",
"B. Computer virus protection.",
"C. Stickmen with Gongfu performance.",
"D. Fight with virus maker."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In accordance with the video footage, what does the story entail?\nOption:\nA. People's computers are intruded by viruses.\nB. Computer virus protection.\nC. Stickmen with Gongfu performance.\nD. Fight with virus maker.\nAnswer with the option's letter from the given choices directly.",
2075,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "692-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2076,
"target": "C",
"doc": {
"video_id": "693",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=dXUXfyiX2lA",
"videoID": "dXUXfyiX2lA",
"question_id": "693-1",
"task_type": "Action Recognition",
"question": "What do the people in the village do according to the video?",
"options": [
"A. Build their village.",
"B. Rob something from people in the sky.",
"C. Protect their village from robbers.",
"D. Participate in the war."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the people in the village do according to the video?\nOption:\nA. Build their village.\nB. Rob something from people in the sky.\nC. Protect their village from robbers.\nD. Participate in the war.\nAnswer with the option's letter from the given choices directly.",
2076,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "693-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2077,
"target": "A",
"doc": {
"video_id": "693",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=dXUXfyiX2lA",
"videoID": "dXUXfyiX2lA",
"question_id": "693-2",
"task_type": "Action Reasoning",
"question": "Based on the video, why do the two people in the moving giant castle get out?",
"options": [
"A. To get more energy for the giant castle.",
"B. To argue with each other.",
"C. To fight with enemies.",
"D. To have a break."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, why do the two people in the moving giant castle get out?\nOption:\nA. To get more energy for the giant castle.\nB. To argue with each other.\nC. To fight with enemies.\nD. To have a break.\nAnswer with the option's letter from the given choices directly.",
2077,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "693-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2078,
"target": "C",
"doc": {
"video_id": "693",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=dXUXfyiX2lA",
"videoID": "dXUXfyiX2lA",
"question_id": "693-3",
"task_type": "Information Synopsis",
"question": "As depicted in the video, what is the final story's theme?",
"options": [
"A. A man fights against zombies.",
"B. A man saves his village intruded by zombies.",
"C. A father saves his child.",
"D. A village building story."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, what is the final story's theme?\nOption:\nA. A man fights against zombies.\nB. A man saves his village intruded by zombies.\nC. A father saves his child.\nD. A village building story.\nAnswer with the option's letter from the given choices directly.",
2078,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "693-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2079,
"target": "D",
"doc": {
"video_id": "694",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=Sp2nxlrQ89w",
"videoID": "Sp2nxlrQ89w",
"question_id": "694-1",
"task_type": "Information Synopsis",
"question": "According to what is shown in the video, what is the stickman animation about?",
"options": [
"A. A king rules the Minecraft world.",
"B. A Minecraft video game played by stickmen.",
"C. War in Minecraft world.",
"D. A father revenges on his son in a Minecraft world."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to what is shown in the video, what is the stickman animation about?\nOption:\nA. A king rules the Minecraft world.\nB. A Minecraft video game played by stickmen.\nC. War in Minecraft world.\nD. A father revenges on his son in a Minecraft world.\nAnswer with the option's letter from the given choices directly.",
2079,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "694-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2080,
"target": "C",
"doc": {
"video_id": "694",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=Sp2nxlrQ89w",
"videoID": "Sp2nxlrQ89w",
"question_id": "694-2",
"task_type": "Action Reasoning",
"question": "In line with the video evidence, why does the orange stickman want to destroy the Minecraft world?",
"options": [
"A. He wants to save his son.",
"B. He is too sad.",
"C. He loses his son.",
"D. He does like the world."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, why does the orange stickman want to destroy the Minecraft world?\nOption:\nA. He wants to save his son.\nB. He is too sad.\nC. He loses his son.\nD. He does like the world.\nAnswer with the option's letter from the given choices directly.",
2080,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "694-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2081,
"target": "A",
"doc": {
"video_id": "694",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=Sp2nxlrQ89w",
"videoID": "Sp2nxlrQ89w",
"question_id": "694-3",
"task_type": "Action Reasoning",
"question": "In accordance with the video footage, why does the orange stick man stop destroying?",
"options": [
"A. He realizes he has hurt his son.",
"B. He has no energy.",
"C. He is too sad to do this.",
"D. He forgives people in the world."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In accordance with the video footage, why does the orange stick man stop destroying?\nOption:\nA. He realizes he has hurt his son.\nB. He has no energy.\nC. He is too sad to do this.\nD. He forgives people in the world.\nAnswer with the option's letter from the given choices directly.",
2081,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "694-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2082,
"target": "C",
"doc": {
"video_id": "695",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=AnWKOvjFl8k",
"videoID": "AnWKOvjFl8k",
"question_id": "695-1",
"task_type": "Action Recognition",
"question": "What does the man with a laughing face do at the beginning of the video?",
"options": [
"A. Destroys the city.",
"B. Fights with Batman.",
"C. Hurt people.",
"D. Hosts a charity event."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the man with a laughing face do at the beginning of the video?\nOption:\nA. Destroys the city.\nB. Fights with Batman.\nC. Hurt people.\nD. Hosts a charity event.\nAnswer with the option's letter from the given choices directly.",
2082,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "695-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2083,
"target": "B",
"doc": {
"video_id": "695",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=AnWKOvjFl8k",
"videoID": "AnWKOvjFl8k",
"question_id": "695-2",
"task_type": "Action Recognition",
"question": "How do Joker and Harley escape when their car is caught by Batman's car in the video?",
"options": [
"A. Run to the hill.",
"B. Split their car.",
"C. Hide in the grass.",
"D. Make their car fly."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How do Joker and Harley escape when their car is caught by Batman's car in the video?\nOption:\nA. Run to the hill.\nB. Split their car.\nC. Hide in the grass.\nD. Make their car fly.\nAnswer with the option's letter from the given choices directly.",
2083,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "695-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2084,
"target": "B",
"doc": {
"video_id": "695",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=AnWKOvjFl8k",
"videoID": "AnWKOvjFl8k",
"question_id": "695-3",
"task_type": "Information Synopsis",
"question": "What does the video entail?",
"options": [
"A. Joker's trick.",
"B. Battle between Batman and Joker.",
"C. How Joker commits crimes.",
"D. The thing Joker likes to do."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the video entail?\nOption:\nA. Joker's trick.\nB. Battle between Batman and Joker.\nC. How Joker commits crimes.\nD. The thing Joker likes to do.\nAnswer with the option's letter from the given choices directly.",
2084,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "695-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2085,
"target": "D",
"doc": {
"video_id": "696",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=I4yye8mUzWg",
"videoID": "I4yye8mUzWg",
"question_id": "696-1",
"task_type": "Action Recognition",
"question": "What is the power of the song according to the video?",
"options": [
"A. Inspires positive emotions.",
"B. Bestows gifts upon listeners.",
"C. Creates excitement in people.",
"D. Transports listeners to another realm."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the power of the song according to the video?\nOption:\nA. Inspires positive emotions.\nB. Bestows gifts upon listeners.\nC. Creates excitement in people.\nD. Transports listeners to another realm.\nAnswer with the option's letter from the given choices directly.",
2085,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "696-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2086,
"target": "D",
"doc": {
"video_id": "696",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=I4yye8mUzWg",
"videoID": "I4yye8mUzWg",
"question_id": "696-2",
"task_type": "Action Reasoning",
"question": "What do heroes of legend use to defeat the enemy based on the video?",
"options": [
"A. Their wisdom.",
"B. A big robot.",
"C. Their superpower.",
"D. Power of music."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do heroes of legend use to defeat the enemy based on the video?\nOption:\nA. Their wisdom.\nB. A big robot.\nC. Their superpower.\nD. Power of music.\nAnswer with the option's letter from the given choices directly.",
2086,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "696-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2087,
"target": "A",
"doc": {
"video_id": "696",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=I4yye8mUzWg",
"videoID": "I4yye8mUzWg",
"question_id": "696-3",
"task_type": "Information Synopsis",
"question": "What is the topic of this video?",
"options": [
"A. Music.",
"B. Friendship.",
"C. Power.",
"D. Multi-dimension."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the topic of this video?\nOption:\nA. Music.\nB. Friendship.\nC. Power.\nD. Multi-dimension.\nAnswer with the option's letter from the given choices directly.",
2087,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "696-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2088,
"target": "A",
"doc": {
"video_id": "697",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=Q8AZ16uBhr8",
"videoID": "Q8AZ16uBhr8",
"question_id": "697-1",
"task_type": "Action Reasoning",
"question": "As depicted in the video, how is the relationship between the rabbit and human?",
"options": [
"A. Hostile.",
"B. Friend.",
"C. Cooperator.",
"D. No one is correct above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, how is the relationship between the rabbit and human?\nOption:\nA. Hostile.\nB. Friend.\nC. Cooperator.\nD. No one is correct above.\nAnswer with the option's letter from the given choices directly.",
2088,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "697-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2089,
"target": "B",
"doc": {
"video_id": "697",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=Q8AZ16uBhr8",
"videoID": "Q8AZ16uBhr8",
"question_id": "697-2",
"task_type": "Action Reasoning",
"question": "What is the impression of the video?",
"options": [
"A. Sad.",
"B. Funny.",
"C. Horrible.",
"D. Silent."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the impression of the video?\nOption:\nA. Sad.\nB. Funny.\nC. Horrible.\nD. Silent.\nAnswer with the option's letter from the given choices directly.",
2089,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "697-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2090,
"target": "C",
"doc": {
"video_id": "697",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=Q8AZ16uBhr8",
"videoID": "Q8AZ16uBhr8",
"question_id": "697-3",
"task_type": "Information Synopsis",
"question": "What is the subject of the video?",
"options": [
"A. Rabbit likes to eat carrots.",
"B. How to raise a rabbit.",
"C. A rabbit gives people trouble.",
"D. A rabbit performs for food."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the subject of the video?\nOption:\nA. Rabbit likes to eat carrots.\nB. How to raise a rabbit.\nC. A rabbit gives people trouble.\nD. A rabbit performs for food.\nAnswer with the option's letter from the given choices directly.",
2090,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "697-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2091,
"target": "B",
"doc": {
"video_id": "698",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=KIg4rprmO9Y",
"videoID": "KIg4rprmO9Y",
"question_id": "698-1",
"task_type": "Object Recognition",
"question": "In the meeting scene at the beginning of the video, who didn't quit the Justice League?",
"options": [
"A. Oliver.",
"B. Aquaman.",
"C. Jefferson.",
"D. Batman."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the meeting scene at the beginning of the video, who didn't quit the Justice League?\nOption:\nA. Oliver.\nB. Aquaman.\nC. Jefferson.\nD. Batman.\nAnswer with the option's letter from the given choices directly.",
2091,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "698-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2092,
"target": "C",
"doc": {
"video_id": "698",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=KIg4rprmO9Y",
"videoID": "KIg4rprmO9Y",
"question_id": "698-2",
"task_type": "Object Reasoning",
"question": "According to the video, how was Victor ultimately healed and able to control his new abilities?",
"options": [
"A. By receiving medical treatment from S.T.A.R. Labs doctors.",
"B. By using advanced Earth technology developed by his father.",
"C. By receiving help from Halo, who used her healing abilities.",
"D. By training intensely and mastering his new mechanical body."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how was Victor ultimately healed and able to control his new abilities?\nOption:\nA. By receiving medical treatment from S.T.A.R. Labs doctors.\nB. By using advanced Earth technology developed by his father.\nC. By receiving help from Halo, who used her healing abilities.\nD. By training intensely and mastering his new mechanical body.\nAnswer with the option's letter from the given choices directly.",
2092,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "698-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2093,
"target": "C",
"doc": {
"video_id": "698",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=KIg4rprmO9Y",
"videoID": "KIg4rprmO9Y",
"question_id": "698-3",
"task_type": "Counting Problem",
"question": "How many times do news segments appear in this video?",
"options": [
"A. 2.",
"B. 4.",
"C. 6.",
"D. 8."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times do news segments appear in this video?\nOption:\nA. 2.\nB. 4.\nC. 6.\nD. 8.\nAnswer with the option's letter from the given choices directly.",
2093,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "698-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2094,
"target": "C",
"doc": {
"video_id": "699",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=I2dv1jNM7gM",
"videoID": "I2dv1jNM7gM",
"question_id": "699-1",
"task_type": "Action Recognition",
"question": "In line with the video evidence, what does the team do in the first story?",
"options": [
"A. They are fighting with each other.",
"B. They are watching a movie.",
"C. They are filming.",
"D. They are saving a mermaid."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, what does the team do in the first story?\nOption:\nA. They are fighting with each other.\nB. They are watching a movie.\nC. They are filming.\nD. They are saving a mermaid.\nAnswer with the option's letter from the given choices directly.",
2094,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "699-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2095,
"target": "A",
"doc": {
"video_id": "699",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=I2dv1jNM7gM",
"videoID": "I2dv1jNM7gM",
"question_id": "699-2",
"task_type": "Object Recognition",
"question": "In accordance with the video footage, who protects the mermaid?",
"options": [
"A. The shark.",
"B. The dog.",
"C. The panda.",
"D. Herself."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In accordance with the video footage, who protects the mermaid?\nOption:\nA. The shark.\nB. The dog.\nC. The panda.\nD. Herself.\nAnswer with the option's letter from the given choices directly.",
2095,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "699-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2096,
"target": "B",
"doc": {
"video_id": "699",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=I2dv1jNM7gM",
"videoID": "I2dv1jNM7gM",
"question_id": "699-3",
"task_type": "Action Recognition",
"question": "What happens when the figures in the video are electrocuted?",
"options": [
"A. They become abnormal.",
"B. They lose their memory.",
"C. They become friendly.",
"D. They yearn for a family."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens when the figures in the video are electrocuted?\nOption:\nA. They become abnormal.\nB. They lose their memory.\nC. They become friendly.\nD. They yearn for a family.\nAnswer with the option's letter from the given choices directly.",
2096,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "699-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2097,
"target": "D",
"doc": {
"video_id": "700",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=1NYQf_OXDqI",
"videoID": "1NYQf_OXDqI",
"question_id": "700-1",
"task_type": "Action Reasoning",
"question": "Based on the video, what is the relationship of Tom and Jerry when they meet the witch?",
"options": [
"A. Competition.",
"B. Cooperation.",
"C. Hostile.",
"D. No one is correct above."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, what is the relationship of Tom and Jerry when they meet the witch?\nOption:\nA. Competition.\nB. Cooperation.\nC. Hostile.\nD. No one is correct above.\nAnswer with the option's letter from the given choices directly.",
2097,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "700-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2098,
"target": "C",
"doc": {
"video_id": "700",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=1NYQf_OXDqI",
"videoID": "1NYQf_OXDqI",
"question_id": "700-2",
"task_type": "Action Reasoning",
"question": "How do Tom and Jerry share a similarity as depicted in the video?",
"options": [
"A. Sleeping.",
"B. Liking eating cheese.",
"C. Protecting Dorothy.",
"D. Liking eating chicken."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How do Tom and Jerry share a similarity as depicted in the video?\nOption:\nA. Sleeping.\nB. Liking eating cheese.\nC. Protecting Dorothy.\nD. Liking eating chicken.\nAnswer with the option's letter from the given choices directly.",
2098,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "700-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2099,
"target": "A",
"doc": {
"video_id": "700",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Animation",
"url": "https://www.youtube.com/watch?v=1NYQf_OXDqI",
"videoID": "1NYQf_OXDqI",
"question_id": "700-3",
"task_type": "Action Reasoning",
"question": "In line with the video evidence, what is the weakness of the villain?",
"options": [
"A. Water.",
"B. Dark.",
"C. Overconfidence.",
"D. Attachment."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, what is the weakness of the villain?\nOption:\nA. Water.\nB. Dark.\nC. Overconfidence.\nD. Attachment.\nAnswer with the option's letter from the given choices directly.",
2099,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "700-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Animation",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2100,
"target": "A",
"doc": {
"video_id": "701",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=-dfvdKf-KR0",
"videoID": "-dfvdKf-KR0",
"question_id": "701-1",
"task_type": "Action Reasoning",
"question": "Why did Jessica Pearson send Harvey a text message saying \"I need you\" in the first six minutes of the video?",
"options": [
"A. Because Jessica is unable to handle the dispute with Gerard Tate.",
"B. Because Cooper signed the agreement.",
"C. Because Gerald has been maliciously trading.",
"D. Because Cooper will not continue to serve as Honorary Vice President."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did Jessica Pearson send Harvey a text message saying \"I need you\" in the first six minutes of the video?\nOption:\nA. Because Jessica is unable to handle the dispute with Gerard Tate.\nB. Because Cooper signed the agreement.\nC. Because Gerald has been maliciously trading.\nD. Because Cooper will not continue to serve as Honorary Vice President.\nAnswer with the option's letter from the given choices directly.",
2100,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "701-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2101,
"target": "C",
"doc": {
"video_id": "701",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=-dfvdKf-KR0",
"videoID": "-dfvdKf-KR0",
"question_id": "701-2",
"task_type": "Action Reasoning",
"question": "Between the 6th and 13th minutes of the video, why does Jessica Pearson specifically invite Mike Ross to have dinner together?",
"options": [
"A. Because Jessica Pearson wants to know what makes Mike Ross special.",
"B. Because Mike Ross has never attended Harvard or any other law school.",
"C. Because Jessica Pearson wants to understand Harvey through Mike Ross.",
"D. Because Jessica Pearson wants to fire Mike Ross."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Between the 6th and 13th minutes of the video, why does Jessica Pearson specifically invite Mike Ross to have dinner together?\nOption:\nA. Because Jessica Pearson wants to know what makes Mike Ross special.\nB. Because Mike Ross has never attended Harvard or any other law school.\nC. Because Jessica Pearson wants to understand Harvey through Mike Ross.\nD. Because Jessica Pearson wants to fire Mike Ross.\nAnswer with the option's letter from the given choices directly.",
2101,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "701-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2102,
"target": "B",
"doc": {
"video_id": "701",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=-dfvdKf-KR0",
"videoID": "-dfvdKf-KR0",
"question_id": "701-3",
"task_type": "Action Reasoning",
"question": "During the 39th to 43rd minute of the film, why does Mike Ross give Nathan a check?",
"options": [
"A. Because Mike Ross wants to collaborate with Nathan.",
"B. Because Mike Ross wants to return to Harvey's team.",
"C. Because Mike Ross hopes to get an opportunity to become a lawyer.",
"D. Because Mike Ross hopes to help many people with this check."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: During the 39th to 43rd minute of the film, why does Mike Ross give Nathan a check?\nOption:\nA. Because Mike Ross wants to collaborate with Nathan.\nB. Because Mike Ross wants to return to Harvey's team.\nC. Because Mike Ross hopes to get an opportunity to become a lawyer.\nD. Because Mike Ross hopes to help many people with this check.\nAnswer with the option's letter from the given choices directly.",
2102,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "701-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2103,
"target": "D",
"doc": {
"video_id": "702",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=p84O3JAp_IM",
"videoID": "p84O3JAp_IM",
"question_id": "702-1",
"task_type": "Action Recognition",
"question": "In the latter part of the video, how did the man, who was wearing a bandage and holding an envelope, sustain his injury?",
"options": [
"A. One of his hands was hit by a firework while he was setting it off.",
"B. His arms got injured while he was attempting to put out the fire at a burning house.",
"C. His hands were injured from falling down to the ground while he was chasing Wayne's motorcycle.",
"D. One of his arms was dragged down by a dog lured with food by Wayne, while he was insulting Wayne's father."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the latter part of the video, how did the man, who was wearing a bandage and holding an envelope, sustain his injury?\nOption:\nA. One of his hands was hit by a firework while he was setting it off.\nB. His arms got injured while he was attempting to put out the fire at a burning house.\nC. His hands were injured from falling down to the ground while he was chasing Wayne's motorcycle.\nD. One of his arms was dragged down by a dog lured with food by Wayne, while he was insulting Wayne's father.\nAnswer with the option's letter from the given choices directly.",
2103,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "702-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2104,
"target": "B",
"doc": {
"video_id": "702",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=p84O3JAp_IM",
"videoID": "p84O3JAp_IM",
"question_id": "702-2",
"task_type": "Action Reasoning",
"question": "Why does the man in the suit and tie caution Wayne against following in his family's footsteps early in the video?",
"options": [
"A. Because he dislikes kids who fight.",
"B. Because he wants Wayne to learn to deal with conflicts in an appropriate manner.",
"C. Because he doesn't want people who do wrong to escape punishment.",
"D. Because Wayne's father was not a good man."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the man in the suit and tie caution Wayne against following in his family's footsteps early in the video?\nOption:\nA. Because he dislikes kids who fight.\nB. Because he wants Wayne to learn to deal with conflicts in an appropriate manner.\nC. Because he doesn't want people who do wrong to escape punishment.\nD. Because Wayne's father was not a good man.\nAnswer with the option's letter from the given choices directly.",
2104,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "702-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2105,
"target": "C",
"doc": {
"video_id": "702",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=p84O3JAp_IM",
"videoID": "p84O3JAp_IM",
"question_id": "702-3",
"task_type": "Object Reasoning",
"question": "In the early part of the video, the man whose window was smashed by Wayne, and the woman wearing a pattern with white triangles, who is chatting with Wayne's father in the middle part of the video, what is their relationship?",
"options": [
"A. Siblings.",
"B. Friends.",
"C. Ex-Couple.",
"D. Couple."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the early part of the video, the man whose window was smashed by Wayne, and the woman wearing a pattern with white triangles, who is chatting with Wayne's father in the middle part of the video, what is their relationship?\nOption:\nA. Siblings.\nB. Friends.\nC. Ex-Couple.\nD. Couple.\nAnswer with the option's letter from the given choices directly.",
2105,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "702-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2106,
"target": "D",
"doc": {
"video_id": "703",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=HP_J8x9lsUg",
"videoID": "HP_J8x9lsUg",
"question_id": "703-1",
"task_type": "Action Reasoning",
"question": "In the first half of the video, why did the big boy wearing a striped shirt refuse the request of the little boy wearing swimming goggles to watch TV together?",
"options": [
"A. Because the big boy wearing a striped shirt is going to watch the meteor shower, he doesn't have time to watch TV with the little boy.",
"B. Because the big boy wearing a striped shirt needs a little boy with swimming goggles to take care of his dog instead of watching TV.",
"C. Because the big boy wearing a striped shirt wants the little boy wearing swimming goggles to help him with his work instead of watching TV.",
"D. Because the big boy wearing a striped shirt wants to see his friends, he doesn't have time to watch TV with the little boy."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the first half of the video, why did the big boy wearing a striped shirt refuse the request of the little boy wearing swimming goggles to watch TV together?\nOption:\nA. Because the big boy wearing a striped shirt is going to watch the meteor shower, he doesn't have time to watch TV with the little boy.\nB. Because the big boy wearing a striped shirt needs a little boy with swimming goggles to take care of his dog instead of watching TV.\nC. Because the big boy wearing a striped shirt wants the little boy wearing swimming goggles to help him with his work instead of watching TV.\nD. Because the big boy wearing a striped shirt wants to see his friends, he doesn't have time to watch TV with the little boy.\nAnswer with the option's letter from the given choices directly.",
2106,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "703-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2107,
"target": "B",
"doc": {
"video_id": "703",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=HP_J8x9lsUg",
"videoID": "HP_J8x9lsUg",
"question_id": "703-2",
"task_type": "Object Reasoning",
"question": "Why does Kevin hug and apologize to the boy named Holden in the middle of the video?",
"options": [
"A. Because Kevin did not greet the boy named Holden in the green hoodie.",
"B. Because the boy named Holden in the green hoodie was once injured while saving Kevin.",
"C. Because Kevin got married.",
"D. Because Kevin once failed to save the boy named Holden in the green hoodie."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does Kevin hug and apologize to the boy named Holden in the middle of the video?\nOption:\nA. Because Kevin did not greet the boy named Holden in the green hoodie.\nB. Because the boy named Holden in the green hoodie was once injured while saving Kevin.\nC. Because Kevin got married.\nD. Because Kevin once failed to save the boy named Holden in the green hoodie.\nAnswer with the option's letter from the given choices directly.",
2107,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "703-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2108,
"target": "D",
"doc": {
"video_id": "703",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=HP_J8x9lsUg",
"videoID": "HP_J8x9lsUg",
"question_id": "703-3",
"task_type": "Action Reasoning",
"question": "Why does Kevin show a nervous state when answering the phone in the second half of the video?",
"options": [
"A. Kevin is concerned about the consequences of not completing the task requested by the caller on time.",
"B. Because the presence of the caller poses a threat to the relationship between Kevin and Holden.",
"C. Because the caller will threaten Kevin, Kevin is afraid of this danger.",
"D. Because Kevin doesn't want to tell the people on the phone about Holden's situation, he feels uneasy about the consequences of the lie being exposed."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does Kevin show a nervous state when answering the phone in the second half of the video?\nOption:\nA. Kevin is concerned about the consequences of not completing the task requested by the caller on time.\nB. Because the presence of the caller poses a threat to the relationship between Kevin and Holden.\nC. Because the caller will threaten Kevin, Kevin is afraid of this danger.\nD. Because Kevin doesn't want to tell the people on the phone about Holden's situation, he feels uneasy about the consequences of the lie being exposed.\nAnswer with the option's letter from the given choices directly.",
2108,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "703-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2109,
"target": "D",
"doc": {
"video_id": "704",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=HTv4z899xgA",
"videoID": "HTv4z899xgA",
"question_id": "704-1",
"task_type": "Action Reasoning",
"question": "Before Jane spills the nail polish, why does Grayson, who is wearing a black suit and a red tie, sit on the couch with a grim expression?",
"options": [
"A. Because Grayson discovered that a stranger had broken into the house.",
"B. Because Grayson was facing trouble at work.",
"C. Because Grayson couldn't accept the fact that his girlfriend Deb had gained weight.",
"D. Because Grayson's girlfriend Deb died in a car accident."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Before Jane spills the nail polish, why does Grayson, who is wearing a black suit and a red tie, sit on the couch with a grim expression?\nOption:\nA. Because Grayson discovered that a stranger had broken into the house.\nB. Because Grayson was facing trouble at work.\nC. Because Grayson couldn't accept the fact that his girlfriend Deb had gained weight.\nD. Because Grayson's girlfriend Deb died in a car accident.\nAnswer with the option's letter from the given choices directly.",
2109,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "704-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2110,
"target": "D",
"doc": {
"video_id": "704",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=HTv4z899xgA",
"videoID": "HTv4z899xgA",
"question_id": "704-2",
"task_type": "Attribute Perception",
"question": "When Jane saw Grayson in the office, she shed tears. What was Jane's mood at this moment?",
"options": [
"A. Crying with joy.",
"B. Touched.",
"C. Regretful.",
"D. Sad."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When Jane saw Grayson in the office, she shed tears. What was Jane's mood at this moment?\nOption:\nA. Crying with joy.\nB. Touched.\nC. Regretful.\nD. Sad.\nAnswer with the option's letter from the given choices directly.",
2110,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "704-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2111,
"target": "A",
"doc": {
"video_id": "704",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=HTv4z899xgA",
"videoID": "HTv4z899xgA",
"question_id": "704-3",
"task_type": "Action Reasoning",
"question": "Throughout the video, why does the woman named Jane undergo a personality change?",
"options": [
"A. Because Jane's body contains the consciousness of the deceased Deb.",
"B. Because difficulties at work prompt Jane to change herself.",
"C. Because Jane wants to make herself more likable to clients.",
"D. Because Jane misses Deb and wants to pay tribute to Deb."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Throughout the video, why does the woman named Jane undergo a personality change?\nOption:\nA. Because Jane's body contains the consciousness of the deceased Deb.\nB. Because difficulties at work prompt Jane to change herself.\nC. Because Jane wants to make herself more likable to clients.\nD. Because Jane misses Deb and wants to pay tribute to Deb.\nAnswer with the option's letter from the given choices directly.",
2111,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "704-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2112,
"target": "B",
"doc": {
"video_id": "705",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=X6UqTmRAApw",
"videoID": "X6UqTmRAApw",
"question_id": "705-1",
"task_type": "Attribute Perception",
"question": "What is Theresa Woo's profession?",
"options": [
"A. Police Officer.",
"B. Journalist.",
"C. Doctor.",
"D. Lawyer."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is Theresa Woo's profession?\nOption:\nA. Police Officer.\nB. Journalist.\nC. Doctor.\nD. Lawyer.\nAnswer with the option's letter from the given choices directly.",
2112,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "705-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2113,
"target": "D",
"doc": {
"video_id": "705",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=X6UqTmRAApw",
"videoID": "X6UqTmRAApw",
"question_id": "705-2",
"task_type": "Object Reasoning",
"question": "Why was Theresa Woo murdered?",
"options": [
"A. Because Theresa Woo had a boyfriend, a man named Walton became jealous and killed her.",
"B. Because Theresa Woo offended many people in her career, she was retaliated against.",
"C. Because Theresa Woo was the mistress of the husband of a woman named Raine, Raine hated Theresa Woo and killed her.",
"D. Because Theresa Woo was preparing to expose the criminal behavior of a woman named Raine, Raine hated Theresa Woo and killed her."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why was Theresa Woo murdered?\nOption:\nA. Because Theresa Woo had a boyfriend, a man named Walton became jealous and killed her.\nB. Because Theresa Woo offended many people in her career, she was retaliated against.\nC. Because Theresa Woo was the mistress of the husband of a woman named Raine, Raine hated Theresa Woo and killed her.\nD. Because Theresa Woo was preparing to expose the criminal behavior of a woman named Raine, Raine hated Theresa Woo and killed her.\nAnswer with the option's letter from the given choices directly.",
2113,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "705-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2114,
"target": "B",
"doc": {
"video_id": "705",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=X6UqTmRAApw",
"videoID": "X6UqTmRAApw",
"question_id": "705-3",
"task_type": "Action Reasoning",
"question": "In the video, when Claire says \"This means\" to Luke, the man in a suit, what does this statement imply?",
"options": [
"A. It implies Claire has decided to start a romantic relationship with Luke.",
"B. It implies Claire has decided to move in with Luke.",
"C. It implies Claire has decided to have a beer with Luke.",
"D. It implies Claire has decided to celebrate the completion of a work project."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, when Claire says \"This means\" to Luke, the man in a suit, what does this statement imply?\nOption:\nA. It implies Claire has decided to start a romantic relationship with Luke.\nB. It implies Claire has decided to move in with Luke.\nC. It implies Claire has decided to have a beer with Luke.\nD. It implies Claire has decided to celebrate the completion of a work project.\nAnswer with the option's letter from the given choices directly.",
2114,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "705-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2115,
"target": "B",
"doc": {
"video_id": "706",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=ctqn-UJGX5M",
"videoID": "ctqn-UJGX5M",
"question_id": "706-1",
"task_type": "Action Reasoning",
"question": "How is Arya Stark when she was a child?",
"options": [
"A. Quiet.",
"B. Naught.",
"C. Bad.",
"D. Troublesome."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How is Arya Stark when she was a child?\nOption:\nA. Quiet.\nB. Naught.\nC. Bad.\nD. Troublesome.\nAnswer with the option's letter from the given choices directly.",
2115,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "706-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2116,
"target": "D",
"doc": {
"video_id": "706",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=ctqn-UJGX5M",
"videoID": "ctqn-UJGX5M",
"question_id": "706-2",
"task_type": "Object Reasoning",
"question": "What is the girl most important quality?",
"options": [
"A. Loyal.",
"B. Smart.",
"C. Honest.",
"D. Tough."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the girl most important quality?\nOption:\nA. Loyal.\nB. Smart.\nC. Honest.\nD. Tough.\nAnswer with the option's letter from the given choices directly.",
2116,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "706-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2117,
"target": "C",
"doc": {
"video_id": "706",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=ctqn-UJGX5M",
"videoID": "ctqn-UJGX5M",
"question_id": "706-3",
"task_type": "Information Synopsis",
"question": "What does the video tell us?",
"options": [
"A. The challenge of the girl.",
"B. The identity of the girl.",
"C. The transformation of the girl.",
"D. The suffer of the girl."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the video tell us?\nOption:\nA. The challenge of the girl.\nB. The identity of the girl.\nC. The transformation of the girl.\nD. The suffer of the girl.\nAnswer with the option's letter from the given choices directly.",
2117,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "706-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2118,
"target": "C",
"doc": {
"video_id": "707",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=fuDNhWIa6KQ",
"videoID": "fuDNhWIa6KQ",
"question_id": "707-1",
"task_type": "Information Synopsis",
"question": "What is the video about?",
"options": [
"A. The skills of fight.",
"B. How to get the throne.",
"C. Rank of fighters.",
"D. No one above is correct."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video about?\nOption:\nA. The skills of fight.\nB. How to get the throne.\nC. Rank of fighters.\nD. No one above is correct.\nAnswer with the option's letter from the given choices directly.",
2118,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "707-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2119,
"target": "B",
"doc": {
"video_id": "707",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=fuDNhWIa6KQ",
"videoID": "fuDNhWIa6KQ",
"question_id": "707-2",
"task_type": "Object Recognition",
"question": "Who is the greatest fighter according to this video?",
"options": [
"A. Grey Worm.",
"B. Jon Snow.",
"C. Ser Brienne of Tarth.",
"D. No one above is correct."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is the greatest fighter according to this video?\nOption:\nA. Grey Worm.\nB. Jon Snow.\nC. Ser Brienne of Tarth.\nD. No one above is correct.\nAnswer with the option's letter from the given choices directly.",
2119,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "707-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2120,
"target": "A",
"doc": {
"video_id": "707",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=fuDNhWIa6KQ",
"videoID": "fuDNhWIa6KQ",
"question_id": "707-3",
"task_type": "Object Recognition",
"question": "What does Jon Snow use to fight with Ramsay Bolton?",
"options": [
"A. A shield.",
"B. A sword.",
"C. An Axe.",
"D. A spear."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does Jon Snow use to fight with Ramsay Bolton?\nOption:\nA. A shield.\nB. A sword.\nC. An Axe.\nD. A spear.\nAnswer with the option's letter from the given choices directly.",
2120,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "707-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2121,
"target": "D",
"doc": {
"video_id": "708",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=xGcfBRkJSWQ",
"videoID": "xGcfBRkJSWQ",
"question_id": "708-1",
"task_type": "Action Reasoning",
"question": "Why does Joker kill his accomplices?",
"options": [
"A. Other robbers betray him.",
"B. He hurries to escape.",
"C. A robber shows his face.",
"D. He doesn't want to share the money."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does Joker kill his accomplices?\nOption:\nA. Other robbers betray him.\nB. He hurries to escape.\nC. A robber shows his face.\nD. He doesn't want to share the money.\nAnswer with the option's letter from the given choices directly.",
2121,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "708-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2122,
"target": "C",
"doc": {
"video_id": "708",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=xGcfBRkJSWQ",
"videoID": "xGcfBRkJSWQ",
"question_id": "708-2",
"task_type": "Action Recognition",
"question": "How does bat-man stop the truck Joker drives?",
"options": [
"A. Lifts the truck.",
"B. Crashes the truck.",
"C. Stumbles the truck.",
"D. No one above is correct."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does bat-man stop the truck Joker drives?\nOption:\nA. Lifts the truck.\nB. Crashes the truck.\nC. Stumbles the truck.\nD. No one above is correct.\nAnswer with the option's letter from the given choices directly.",
2122,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "708-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2123,
"target": "A",
"doc": {
"video_id": "708",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=xGcfBRkJSWQ",
"videoID": "xGcfBRkJSWQ",
"question_id": "708-3",
"task_type": "Action Reasoning",
"question": "Why does Joker give bomb controler to both people of boats respectively?",
"options": [
"A. He wants to defeat bat-man.",
"B. He wants to kill bat-man.",
"C. He dislike the people on the boats.",
"D. People on the boat are all convicts."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does Joker give bomb controler to both people of boats respectively?\nOption:\nA. He wants to defeat bat-man.\nB. He wants to kill bat-man.\nC. He dislike the people on the boats.\nD. People on the boat are all convicts.\nAnswer with the option's letter from the given choices directly.",
2123,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "708-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2124,
"target": "C",
"doc": {
"video_id": "709",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=XRQ2L8BaI1A",
"videoID": "XRQ2L8BaI1A",
"question_id": "709-1",
"task_type": "OCR Problems",
"question": "In which episode do Penny and Leonard get married in this video?",
"options": [
"A. Season 7, Episode 23.",
"B. Season 9, Episode 1.",
"C. Season 10, Episode 1.",
"D. Season 12, Episode 6."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which episode do Penny and Leonard get married in this video?\nOption:\nA. Season 7, Episode 23.\nB. Season 9, Episode 1.\nC. Season 10, Episode 1.\nD. Season 12, Episode 6.\nAnswer with the option's letter from the given choices directly.",
2124,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "709-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2125,
"target": "C",
"doc": {
"video_id": "709",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=XRQ2L8BaI1A",
"videoID": "XRQ2L8BaI1A",
"question_id": "709-2",
"task_type": "Action Reasoning",
"question": "How does Leonard convince Penny to go on a date with him?",
"options": [
"A. He promises to be more attentive and romantic.",
"B. He uses humor and flattery to make her laugh.",
"C. He appeals to her desire for a \"nice\" and \"honest\" partner.",
"D. He tells her that he has changed and wants to make things right."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does Leonard convince Penny to go on a date with him?\nOption:\nA. He promises to be more attentive and romantic.\nB. He uses humor and flattery to make her laugh.\nC. He appeals to her desire for a \"nice\" and \"honest\" partner.\nD. He tells her that he has changed and wants to make things right.\nAnswer with the option's letter from the given choices directly.",
2125,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "709-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2126,
"target": "A",
"doc": {
"video_id": "709",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=XRQ2L8BaI1A",
"videoID": "XRQ2L8BaI1A",
"question_id": "709-3",
"task_type": "Attribute Perception",
"question": "When Penny enters the laundry room, what color clothes is Leonard folding?",
"options": [
"A. Green.",
"B. Blue.",
"C. White.",
"D. Black."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When Penny enters the laundry room, what color clothes is Leonard folding?\nOption:\nA. Green.\nB. Blue.\nC. White.\nD. Black.\nAnswer with the option's letter from the given choices directly.",
2126,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "709-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2127,
"target": "D",
"doc": {
"video_id": "710",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=IPqb_oDlGBU",
"videoID": "IPqb_oDlGBU",
"question_id": "710-1",
"task_type": "Attribute Perception",
"question": "What is the speaker's attitude to the new version movie?",
"options": [
"A. Positive.",
"B. Approbation.",
"C. Encouraged.",
"D. Negative."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the speaker's attitude to the new version movie?\nOption:\nA. Positive.\nB. Approbation.\nC. Encouraged.\nD. Negative.\nAnswer with the option's letter from the given choices directly.",
2127,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "710-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2128,
"target": "A",
"doc": {
"video_id": "710",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=IPqb_oDlGBU",
"videoID": "IPqb_oDlGBU",
"question_id": "710-2",
"task_type": "Action Reasoning",
"question": "How does the speaker think about the David Suchet's version movie?",
"options": [
"A. Perfect.",
"B. Need to get progress.",
"C. Is not appropriate for movie.",
"D. Can't be worst."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the speaker think about the David Suchet's version movie?\nOption:\nA. Perfect.\nB. Need to get progress.\nC. Is not appropriate for movie.\nD. Can't be worst.\nAnswer with the option's letter from the given choices directly.",
2128,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "710-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2129,
"target": "D",
"doc": {
"video_id": "710",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Movie & TV Show",
"url": "https://www.youtube.com/watch?v=IPqb_oDlGBU",
"videoID": "IPqb_oDlGBU",
"question_id": "710-3",
"task_type": "Object Reasoning",
"question": "Why does Poirot lead person to death in David Suchet's version?",
"options": [
"A. Because he doesn't like argue with people about reason of crime.",
"B. Because he is stupid.",
"C. Because he doesn't have empathy.",
"D. Because he think things are just two side."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does Poirot lead person to death in David Suchet's version?\nOption:\nA. Because he doesn't like argue with people about reason of crime.\nB. Because he is stupid.\nC. Because he doesn't have empathy.\nD. Because he think things are just two side.\nAnswer with the option's letter from the given choices directly.",
2129,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "710-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Movie & TV Show",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2130,
"target": "C",
"doc": {
"video_id": "711",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=XOfeqRgaF1A",
"videoID": "XOfeqRgaF1A",
"question_id": "711-1",
"task_type": "Object Reasoning",
"question": "What was the experiment in which a man with a ponytail invited volunteers to assess the accuracy of determining COVID-19 infection based on voice, aiming to explore?",
"options": [
"A. To explore whether everyone hears the same thing when talking to someone.",
"B. To explore whether someone actually has COVID-19.",
"C. To explore whether the method of judging someone's trustworthiness through their voice is reliable.",
"D. To explore whether someone has difficulty breathing."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the experiment in which a man with a ponytail invited volunteers to assess the accuracy of determining COVID-19 infection based on voice, aiming to explore?\nOption:\nA. To explore whether everyone hears the same thing when talking to someone.\nB. To explore whether someone actually has COVID-19.\nC. To explore whether the method of judging someone's trustworthiness through their voice is reliable.\nD. To explore whether someone has difficulty breathing.\nAnswer with the option's letter from the given choices directly.",
2130,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "711-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2131,
"target": "C",
"doc": {
"video_id": "711",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=XOfeqRgaF1A",
"videoID": "XOfeqRgaF1A",
"question_id": "711-2",
"task_type": "Object Reasoning",
"question": "What research did Oliver Niebuhr conduct?",
"options": [
"A. Whether one's abilities can be judged based on facial features.",
"B. Whether someone's reliability can be judged through their voice.",
"C. What factors make a voice more attractive.",
"D. Whether personality traits can be deciphered through facial and vocal cues."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What research did Oliver Niebuhr conduct?\nOption:\nA. Whether one's abilities can be judged based on facial features.\nB. Whether someone's reliability can be judged through their voice.\nC. What factors make a voice more attractive.\nD. Whether personality traits can be deciphered through facial and vocal cues.\nAnswer with the option's letter from the given choices directly.",
2131,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "711-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2132,
"target": "D",
"doc": {
"video_id": "711",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=XOfeqRgaF1A",
"videoID": "XOfeqRgaF1A",
"question_id": "711-3",
"task_type": "Action Reasoning",
"question": "In the later segment of the video, it was noted that \"Jon is currently investigating a training method for enduring effectiveness.\" What is the focus of Jon Freeman's research?",
"options": [
"A. To enable the brain to evaluate facial features in the same way.",
"B. To explore how stereotypes work.",
"C. Cannot be determined.",
"D. To enable the brain to evaluate facial features in a different way."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the later segment of the video, it was noted that \"Jon is currently investigating a training method for enduring effectiveness.\" What is the focus of Jon Freeman's research?\nOption:\nA. To enable the brain to evaluate facial features in the same way.\nB. To explore how stereotypes work.\nC. Cannot be determined.\nD. To enable the brain to evaluate facial features in a different way.\nAnswer with the option's letter from the given choices directly.",
2132,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "711-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2133,
"target": "A",
"doc": {
"video_id": "712",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=GDzuMTvHnJE",
"videoID": "GDzuMTvHnJE",
"question_id": "712-1",
"task_type": "Action Reasoning",
"question": "After receiving a one-time benefit of £26,000, what transformation did the three families experience?",
"options": [
"A. Became more motivated in life.",
"B. Became less willing to spend money.",
"C. Became more trustworthy.",
"D. Became more stressed."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: After receiving a one-time benefit of £26,000, what transformation did the three families experience?\nOption:\nA. Became more motivated in life.\nB. Became less willing to spend money.\nC. Became more trustworthy.\nD. Became more stressed.\nAnswer with the option's letter from the given choices directly.",
2133,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "712-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2134,
"target": "B",
"doc": {
"video_id": "712",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=GDzuMTvHnJE",
"videoID": "GDzuMTvHnJE",
"question_id": "712-2",
"task_type": "Attribute Perception",
"question": "What kind of work did the man named Tony engage in?",
"options": [
"A. Unemployed.",
"B. Second-hand goods business.",
"C. Driver.",
"D. Cleaner."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What kind of work did the man named Tony engage in?\nOption:\nA. Unemployed.\nB. Second-hand goods business.\nC. Driver.\nD. Cleaner.\nAnswer with the option's letter from the given choices directly.",
2134,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "712-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2135,
"target": "C",
"doc": {
"video_id": "712",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=GDzuMTvHnJE",
"videoID": "GDzuMTvHnJE",
"question_id": "712-3",
"task_type": "Action Reasoning",
"question": "In the latter part of the video, why does Tony say, \"We had a great day today\"?",
"options": [
"A. Tony has learned how to drive.",
"B. Diana's family has finally reunited.",
"C. Diana found a job, Tony's business potential opening.",
"D. They received £ 26000 in benefits and improved their living standards."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the latter part of the video, why does Tony say, \"We had a great day today\"?\nOption:\nA. Tony has learned how to drive.\nB. Diana's family has finally reunited.\nC. Diana found a job, Tony's business potential opening.\nD. They received £ 26000 in benefits and improved their living standards.\nAnswer with the option's letter from the given choices directly.",
2135,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "712-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2136,
"target": "B",
"doc": {
"video_id": "713",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=y4-CnqfK3pk",
"videoID": "y4-CnqfK3pk",
"question_id": "713-1",
"task_type": "Object Reasoning",
"question": "According to the video, the price at which Jonas and Max sell drugs is approximately how many times the price in the origin country, Colombia?",
"options": [
"A. 2.",
"B. 33.",
"C. 10.",
"D. 333."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, the price at which Jonas and Max sell drugs is approximately how many times the price in the origin country, Colombia?\nOption:\nA. 2.\nB. 33.\nC. 10.\nD. 333.\nAnswer with the option's letter from the given choices directly.",
2136,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "713-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2137,
"target": "C",
"doc": {
"video_id": "713",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=y4-CnqfK3pk",
"videoID": "y4-CnqfK3pk",
"question_id": "713-2",
"task_type": "Counting Problem",
"question": "How many people interviewed in the video are unwilling to show their faces?",
"options": [
"A. 8.",
"B. 1.",
"C. 5.",
"D. 2."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many people interviewed in the video are unwilling to show their faces?\nOption:\nA. 8.\nB. 1.\nC. 5.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
2137,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "713-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2138,
"target": "D",
"doc": {
"video_id": "713",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=y4-CnqfK3pk",
"videoID": "y4-CnqfK3pk",
"question_id": "713-3",
"task_type": "Object Recognition",
"question": "According to the video, which country's drug trafficking group was responsible for the death of crime journalist De Vries?",
"options": [
"A. United Kingdom.",
"B. It's not mentioned in the video.",
"C. Germany.",
"D. Netherlands."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which country's drug trafficking group was responsible for the death of crime journalist De Vries?\nOption:\nA. United Kingdom.\nB. It's not mentioned in the video.\nC. Germany.\nD. Netherlands.\nAnswer with the option's letter from the given choices directly.",
2138,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "713-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2139,
"target": "A",
"doc": {
"video_id": "714",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=g_1oiJqE3OI",
"videoID": "g_1oiJqE3OI",
"question_id": "714-1",
"task_type": "Object Reasoning",
"question": "Why does the video introduce a family who raises two children, showcasing conflicting points within the family? And what problem does it elaborate on?",
"options": [
"A. The family is worried about high fuel costs, which allocates demands for new fuel, and more intelligent utilization of transportation resources.",
"B. The family is struggling with conflicting work schedules and lack of quality time together, which allocates for more flexible working schedule.",
"C. The family is divided over their preferred vacation destination, which emphasizes the need for better travel planning and budgeting.",
"D. The family is experiencing disagreements about household chores, which emphasizes the importance of effective communication and division of responsibilities."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the video introduce a family who raises two children, showcasing conflicting points within the family? And what problem does it elaborate on?\nOption:\nA. The family is worried about high fuel costs, which allocates demands for new fuel, and more intelligent utilization of transportation resources.\nB. The family is struggling with conflicting work schedules and lack of quality time together, which allocates for more flexible working schedule.\nC. The family is divided over their preferred vacation destination, which emphasizes the need for better travel planning and budgeting.\nD. The family is experiencing disagreements about household chores, which emphasizes the importance of effective communication and division of responsibilities.\nAnswer with the option's letter from the given choices directly.",
2139,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "714-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2140,
"target": "D",
"doc": {
"video_id": "714",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=g_1oiJqE3OI",
"videoID": "g_1oiJqE3OI",
"question_id": "714-2",
"task_type": "Information Synopsis",
"question": "What is the focus of the imagined world in 2050 that is introduced in this video?",
"options": [
"A. Sustainable agriculture and food production.",
"B. Advanced healthcare technologies and treatments.",
"C. Ocean exploration and conservation efforts.",
"D. Renewable energy and artificial intelligence design for driving."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the focus of the imagined world in 2050 that is introduced in this video?\nOption:\nA. Sustainable agriculture and food production.\nB. Advanced healthcare technologies and treatments.\nC. Ocean exploration and conservation efforts.\nD. Renewable energy and artificial intelligence design for driving.\nAnswer with the option's letter from the given choices directly.",
2140,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "714-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2141,
"target": "A",
"doc": {
"video_id": "714",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=g_1oiJqE3OI",
"videoID": "g_1oiJqE3OI",
"question_id": "714-3",
"task_type": "Information Synopsis",
"question": "What is the topic of the first section in the video, imagine the sections are separated by an opening frame?",
"options": [
"A. Fueling the future.",
"B. Autonomous vehicles.",
"C. Artificial intelligence design.",
"D. Searching for Utopia."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the topic of the first section in the video, imagine the sections are separated by an opening frame?\nOption:\nA. Fueling the future.\nB. Autonomous vehicles.\nC. Artificial intelligence design.\nD. Searching for Utopia.\nAnswer with the option's letter from the given choices directly.",
2141,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "714-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2142,
"target": "A",
"doc": {
"video_id": "715",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=9edWX7TTsLw",
"videoID": "9edWX7TTsLw",
"question_id": "715-1",
"task_type": "Temporal Reasoning",
"question": "What is the order of companies mentioned in the first half of the video related to water scarcity?",
"options": [
"A. Tesla, Coca-Cola, Intel.",
"B. Coca-Cola, Tesla, Intel.",
"C. Intel, Tesla, Coca-Cola.",
"D. Tesla, Intel, Coca-Cola."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the order of companies mentioned in the first half of the video related to water scarcity?\nOption:\nA. Tesla, Coca-Cola, Intel.\nB. Coca-Cola, Tesla, Intel.\nC. Intel, Tesla, Coca-Cola.\nD. Tesla, Intel, Coca-Cola.\nAnswer with the option's letter from the given choices directly.",
2142,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "715-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2143,
"target": "A",
"doc": {
"video_id": "715",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=9edWX7TTsLw",
"videoID": "9edWX7TTsLw",
"question_id": "715-2",
"task_type": "Information Synopsis",
"question": "Which of the following impacts of water scarcity was NOT mentioned in the video?",
"options": [
"A. Impact on climate.",
"B. Impact on agriculture.",
"C. Impact on livestock farming.",
"D. Impact on industry."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following impacts of water scarcity was NOT mentioned in the video?\nOption:\nA. Impact on climate.\nB. Impact on agriculture.\nC. Impact on livestock farming.\nD. Impact on industry.\nAnswer with the option's letter from the given choices directly.",
2143,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "715-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2144,
"target": "C",
"doc": {
"video_id": "715",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=9edWX7TTsLw",
"videoID": "9edWX7TTsLw",
"question_id": "715-3",
"task_type": "Information Synopsis",
"question": "Which of the following is NOT a reason for water scarcity mentioned in the video?",
"options": [
"A. Climate change.",
"B. High consumption by industrial production.",
"C. Population growth leading to increased consumption of drinking water.",
"D. High consumption by agricultural production."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is NOT a reason for water scarcity mentioned in the video?\nOption:\nA. Climate change.\nB. High consumption by industrial production.\nC. Population growth leading to increased consumption of drinking water.\nD. High consumption by agricultural production.\nAnswer with the option's letter from the given choices directly.",
2144,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "715-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2145,
"target": "A",
"doc": {
"video_id": "716",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=B9He5DVePvk",
"videoID": "B9He5DVePvk",
"question_id": "716-1",
"task_type": "Attribute Perception",
"question": "What religion can we infer from the video that the Amish practice?",
"options": [
"A. Christianity.",
"B. Islam.",
"C. Buddhism.",
"D. It's not mentioned in the video."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What religion can we infer from the video that the Amish practice?\nOption:\nA. Christianity.\nB. Islam.\nC. Buddhism.\nD. It's not mentioned in the video.\nAnswer with the option's letter from the given choices directly.",
2145,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "716-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2146,
"target": "B",
"doc": {
"video_id": "716",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=B9He5DVePvk",
"videoID": "B9He5DVePvk",
"question_id": "716-2",
"task_type": "Action Reasoning",
"question": "Based on the video, what mode of transportation do you think the Amish use?",
"options": [
"A. Only horse-drawn carriages.",
"B. Horse-drawn carriages and cars.",
"C. Only cars.",
"D. Cannot be determined."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, what mode of transportation do you think the Amish use?\nOption:\nA. Only horse-drawn carriages.\nB. Horse-drawn carriages and cars.\nC. Only cars.\nD. Cannot be determined.\nAnswer with the option's letter from the given choices directly.",
2146,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "716-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2147,
"target": "C",
"doc": {
"video_id": "716",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=B9He5DVePvk",
"videoID": "B9He5DVePvk",
"question_id": "716-3",
"task_type": "Object Reasoning",
"question": "What can't be inferred from the interview with an Amish woman is shown in the latter part of the video?",
"options": [
"A. Not every Amish person is very devout.",
"B. She has used contraceptive methods.",
"C. Amish usually don't use electricity.",
"D. She has six children."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can't be inferred from the interview with an Amish woman is shown in the latter part of the video?\nOption:\nA. Not every Amish person is very devout.\nB. She has used contraceptive methods.\nC. Amish usually don't use electricity.\nD. She has six children.\nAnswer with the option's letter from the given choices directly.",
2147,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "716-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2148,
"target": "C",
"doc": {
"video_id": "717",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=068rdc75mHM",
"videoID": "068rdc75mHM",
"question_id": "717-1",
"task_type": "Information Synopsis",
"question": "According to the video, what inconsistency in quantum mechanics was discovered by Einstein and published in a paper?",
"options": [
"A. The trajectory of an object will be described as a probability wave and is no longer certain.",
"B. There are problems with its mathematical equations.",
"C. Two particles at a great distance can instantly influence each other, i.e., quantum entanglement.",
"D. It's not mentioned in the video."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what inconsistency in quantum mechanics was discovered by Einstein and published in a paper?\nOption:\nA. The trajectory of an object will be described as a probability wave and is no longer certain.\nB. There are problems with its mathematical equations.\nC. Two particles at a great distance can instantly influence each other, i.e., quantum entanglement.\nD. It's not mentioned in the video.\nAnswer with the option's letter from the given choices directly.",
2148,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "717-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2149,
"target": "A",
"doc": {
"video_id": "717",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=068rdc75mHM",
"videoID": "068rdc75mHM",
"question_id": "717-2",
"task_type": "Action Reasoning",
"question": "The video repeatedly shows a person at a table, lifting a cup and another cup also floating up, with two light balls under each cup. What is this segment trying to show?",
"options": [
"A. Quantum entanglement effect.",
"B. This is a magic trick.",
"C. Objects propagate in the form of probability waves.",
"D. Einstein's photoelectric effect."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The video repeatedly shows a person at a table, lifting a cup and another cup also floating up, with two light balls under each cup. What is this segment trying to show?\nOption:\nA. Quantum entanglement effect.\nB. This is a magic trick.\nC. Objects propagate in the form of probability waves.\nD. Einstein's photoelectric effect.\nAnswer with the option's letter from the given choices directly.",
2149,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "717-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2150,
"target": "A",
"doc": {
"video_id": "717",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=068rdc75mHM",
"videoID": "068rdc75mHM",
"question_id": "717-3",
"task_type": "Information Synopsis",
"question": "According to the video, how did the debate between Einstein and Bohr about quantum mechanics turn out?",
"options": [
"A. The experiment by the team at the high-altitude observatory further confirmed that Bohr was right.",
"B. Bell's experiment completely confirmed that Bohr was correct.",
"C. Bell's experiment completely confirmed that Einstein was correct.",
"D. Pan Jianwei's team confirmed that Bohr was correct."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how did the debate between Einstein and Bohr about quantum mechanics turn out?\nOption:\nA. The experiment by the team at the high-altitude observatory further confirmed that Bohr was right.\nB. Bell's experiment completely confirmed that Bohr was correct.\nC. Bell's experiment completely confirmed that Einstein was correct.\nD. Pan Jianwei's team confirmed that Bohr was correct.\nAnswer with the option's letter from the given choices directly.",
2150,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "717-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2151,
"target": "C",
"doc": {
"video_id": "718",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=eAIKvD_gLJo",
"videoID": "eAIKvD_gLJo",
"question_id": "718-1",
"task_type": "Object Reasoning",
"question": "Based on the early part of the video, what can we infer as the reason for the sudden surge in popularity of crystals and gemstones?",
"options": [
"A. A whole new variety of crystals and gemstones has emerged.",
"B. Enhanced accessibility and affordability of gemstones.",
"C. Social media promotion.",
"D. Growing interest in holistic wellness and alternative spirituality."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the early part of the video, what can we infer as the reason for the sudden surge in popularity of crystals and gemstones?\nOption:\nA. A whole new variety of crystals and gemstones has emerged.\nB. Enhanced accessibility and affordability of gemstones.\nC. Social media promotion.\nD. Growing interest in holistic wellness and alternative spirituality.\nAnswer with the option's letter from the given choices directly.",
2151,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "718-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2152,
"target": "B",
"doc": {
"video_id": "718",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=eAIKvD_gLJo",
"videoID": "eAIKvD_gLJo",
"question_id": "718-2",
"task_type": "Action Reasoning",
"question": "Based on the content of the video, why did the local government send an entourage to follow the reporter?",
"options": [
"A. To guide them.",
"B. To influence what they hear from the miners.",
"C. To help them with licensing issues.",
"D. To ensure their safety."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the content of the video, why did the local government send an entourage to follow the reporter?\nOption:\nA. To guide them.\nB. To influence what they hear from the miners.\nC. To help them with licensing issues.\nD. To ensure their safety.\nAnswer with the option's letter from the given choices directly.",
2152,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "718-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2153,
"target": "A",
"doc": {
"video_id": "718",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=eAIKvD_gLJo",
"videoID": "eAIKvD_gLJo",
"question_id": "718-3",
"task_type": "Object Reasoning",
"question": "Based on the content of the video, what does the word \"dirty\" most likely refer to in the video title \"The dirty business of beauty\"?",
"options": [
"A. The gemstone and crystal mining industry is filled with child labor abuse, exploitation, and dangers, making it morally \"dirty\".",
"B. Gemstone and crystal mining pollute the environment.",
"C. The physical environment of gemstone and crystal mining is dirty.",
"D. Newly mined gemstones and crystals are physically dirty."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the content of the video, what does the word \"dirty\" most likely refer to in the video title \"The dirty business of beauty\"?\nOption:\nA. The gemstone and crystal mining industry is filled with child labor abuse, exploitation, and dangers, making it morally \"dirty\".\nB. Gemstone and crystal mining pollute the environment.\nC. The physical environment of gemstone and crystal mining is dirty.\nD. Newly mined gemstones and crystals are physically dirty.\nAnswer with the option's letter from the given choices directly.",
2153,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "718-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2154,
"target": "B",
"doc": {
"video_id": "719",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=B7Hh0PY1kks",
"videoID": "B7Hh0PY1kks",
"question_id": "719-1",
"task_type": "Action Reasoning",
"question": "What are the differing motivations for buying unhealthy food in the early part of the video between Cuba's family and Michael's family?",
"options": [
"A. Cuba's family does it to satisfy Cuba's demands, while all members of Michael's family like to eat unhealthy food.",
"B. All members of Cuba's family like to eat unhealthy food, while Michael's family does it just to satisfy Michael's demands.",
"C. All members of both Cuba's and Michael's families like to eat unhealthy food.",
"D. Cannot be determined."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the differing motivations for buying unhealthy food in the early part of the video between Cuba's family and Michael's family?\nOption:\nA. Cuba's family does it to satisfy Cuba's demands, while all members of Michael's family like to eat unhealthy food.\nB. All members of Cuba's family like to eat unhealthy food, while Michael's family does it just to satisfy Michael's demands.\nC. All members of both Cuba's and Michael's families like to eat unhealthy food.\nD. Cannot be determined.\nAnswer with the option's letter from the given choices directly.",
2154,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "719-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2155,
"target": "C",
"doc": {
"video_id": "719",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=B7Hh0PY1kks",
"videoID": "B7Hh0PY1kks",
"question_id": "719-2",
"task_type": "Action Reasoning",
"question": "What motivated Taylor to make up her mind to change her and Harry's dietary habits?",
"options": [
"A. Taylor's family and friends hope she can make a change.",
"B. Harry's long-term unhealthy eating habits led him to be diagnosed with diabetes.",
"C. Harry's long-term unhealthy eating habits have led to tooth decay.",
"D. Harry suffered from anemia due to long-term unhealthy dietary habits."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What motivated Taylor to make up her mind to change her and Harry's dietary habits?\nOption:\nA. Taylor's family and friends hope she can make a change.\nB. Harry's long-term unhealthy eating habits led him to be diagnosed with diabetes.\nC. Harry's long-term unhealthy eating habits have led to tooth decay.\nD. Harry suffered from anemia due to long-term unhealthy dietary habits.\nAnswer with the option's letter from the given choices directly.",
2155,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "719-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2156,
"target": "A",
"doc": {
"video_id": "719",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=B7Hh0PY1kks",
"videoID": "B7Hh0PY1kks",
"question_id": "719-3",
"task_type": "Information Synopsis",
"question": "Based on the entire video, what kind of transformation occurred with the child named Cuba?",
"options": [
"A. Healthy eating made Cuba go from being anemic to being healthy.",
"B. Healthy eating made Cuba go from having meningitis to being healthy.",
"C. Healthy eating made Cuba go from having diabetes to being healthy.",
"D. Healthy eating made Cuba go from having cavities to being healthy."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the entire video, what kind of transformation occurred with the child named Cuba?\nOption:\nA. Healthy eating made Cuba go from being anemic to being healthy.\nB. Healthy eating made Cuba go from having meningitis to being healthy.\nC. Healthy eating made Cuba go from having diabetes to being healthy.\nD. Healthy eating made Cuba go from having cavities to being healthy.\nAnswer with the option's letter from the given choices directly.",
2156,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "719-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2157,
"target": "D",
"doc": {
"video_id": "720",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=AkHbB9GuCkE",
"videoID": "AkHbB9GuCkE",
"question_id": "720-1",
"task_type": "Attribute Perception",
"question": "Why did Jamie Dimon turn down numerous job offers from famous Wall Street investment banks to work for Sandy Weill?",
"options": [
"A. Because Sandy Weill was prestigious.",
"B. Because Sandy Weill was a family friend.",
"C. Because Jamie Dimon believed Sandy Weill had stronger abilities.",
"D. Because Sandy Weill could offer him the opportunity to deeply understand how companies operate."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did Jamie Dimon turn down numerous job offers from famous Wall Street investment banks to work for Sandy Weill?\nOption:\nA. Because Sandy Weill was prestigious.\nB. Because Sandy Weill was a family friend.\nC. Because Jamie Dimon believed Sandy Weill had stronger abilities.\nD. Because Sandy Weill could offer him the opportunity to deeply understand how companies operate.\nAnswer with the option's letter from the given choices directly.",
2157,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "720-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2158,
"target": "D",
"doc": {
"video_id": "720",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=AkHbB9GuCkE",
"videoID": "AkHbB9GuCkE",
"question_id": "720-2",
"task_type": "Action Reasoning",
"question": "Which of the following was not a factor in Sandy Weill's decision to dismiss Jamie Dimon?",
"options": [
"A. Jamie Dimon's inability to treat employees equally.",
"B. Differences in vision for leadership between Jamie Dimon and Sandy Weill.",
"C. Jamie Dimon alienating Sandy Weill's daughter.",
"D. Differences in vision for leadership between Jamie Dimon and Sandy Weill leading to disagreements over investment plans."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following was not a factor in Sandy Weill's decision to dismiss Jamie Dimon?\nOption:\nA. Jamie Dimon's inability to treat employees equally.\nB. Differences in vision for leadership between Jamie Dimon and Sandy Weill.\nC. Jamie Dimon alienating Sandy Weill's daughter.\nD. Differences in vision for leadership between Jamie Dimon and Sandy Weill leading to disagreements over investment plans.\nAnswer with the option's letter from the given choices directly.",
2158,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "720-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2159,
"target": "B",
"doc": {
"video_id": "720",
"duration": "long",
"domain": "Film & Television",
"sub_category": "Documentary",
"url": "https://www.youtube.com/watch?v=AkHbB9GuCkE",
"videoID": "AkHbB9GuCkE",
"question_id": "720-3",
"task_type": "Information Synopsis",
"question": "Based on the video, which company's management did Jamie Dimon not serve on?",
"options": [
"A. Citigroup.",
"B. American Express.",
"C. Commercial Credit.",
"D. JP Morgan Chase."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, which company's management did Jamie Dimon not serve on?\nOption:\nA. Citigroup.\nB. American Express.\nC. Commercial Credit.\nD. JP Morgan Chase.\nAnswer with the option's letter from the given choices directly.",
2159,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "720-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "Documentary",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2160,
"target": "B",
"doc": {
"video_id": "721",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=qd2ivr-5oEM",
"videoID": "qd2ivr-5oEM",
"question_id": "721-1",
"task_type": "Information Synopsis",
"question": "Which of the following contents was NOT mentioned in the video?",
"options": [
"A. NBA investigates Raptors player for gambling.",
"B. The price of Bitcoin plummeting.",
"C. Breakdown of Israel-Hamas ceasefire talks.",
"D. Journalist describes the chaos in Haiti."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following contents was NOT mentioned in the video?\nOption:\nA. NBA investigates Raptors player for gambling.\nB. The price of Bitcoin plummeting.\nC. Breakdown of Israel-Hamas ceasefire talks.\nD. Journalist describes the chaos in Haiti.\nAnswer with the option's letter from the given choices directly.",
2160,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "721-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2161,
"target": "D",
"doc": {
"video_id": "721",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=qd2ivr-5oEM",
"videoID": "qd2ivr-5oEM",
"question_id": "721-2",
"task_type": "Action Reasoning",
"question": "According to the content of the video, what position does Marineland's statement aim to express?",
"options": [
"A. Calls for Marineland to be closed.",
"B. Marineland has been investigated by the government multiple times and should not be investigated again.",
"C. There should be more veterinarians to help Marineland.",
"D. They have made a lot of effort to save the belugas but still failed."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the content of the video, what position does Marineland's statement aim to express?\nOption:\nA. Calls for Marineland to be closed.\nB. Marineland has been investigated by the government multiple times and should not be investigated again.\nC. There should be more veterinarians to help Marineland.\nD. They have made a lot of effort to save the belugas but still failed.\nAnswer with the option's letter from the given choices directly.",
2161,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "721-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2162,
"target": "A",
"doc": {
"video_id": "721",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=qd2ivr-5oEM",
"videoID": "qd2ivr-5oEM",
"question_id": "721-3",
"task_type": "Action Reasoning",
"question": "In the middle section of the video, regarding the news about a ship colliding with a bridge, how do the news anchor and the guest judge when the first power outage occurred on the ship in the video?",
"options": [
"A. Because the lights on the ship suddenly went out.",
"B. Because the ship suddenly stopped.",
"C. Because the ship hit the bridge.",
"D. Because the crew sent out a distress signal."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the middle section of the video, regarding the news about a ship colliding with a bridge, how do the news anchor and the guest judge when the first power outage occurred on the ship in the video?\nOption:\nA. Because the lights on the ship suddenly went out.\nB. Because the ship suddenly stopped.\nC. Because the ship hit the bridge.\nD. Because the crew sent out a distress signal.\nAnswer with the option's letter from the given choices directly.",
2162,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "721-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2163,
"target": "B",
"doc": {
"video_id": "722",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=54H8ppxnp8I",
"videoID": "54H8ppxnp8I",
"question_id": "722-1",
"task_type": "Object Reasoning",
"question": "According to the video, does the male interviewee wearing a black suit have a stance that is more in favor of men or women?",
"options": [
"A. Neutral.",
"B. In favor of men.",
"C. In favor of women.",
"D. Cannot be determined."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, does the male interviewee wearing a black suit have a stance that is more in favor of men or women?\nOption:\nA. Neutral.\nB. In favor of men.\nC. In favor of women.\nD. Cannot be determined.\nAnswer with the option's letter from the given choices directly.",
2163,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "722-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2164,
"target": "A",
"doc": {
"video_id": "722",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=54H8ppxnp8I",
"videoID": "54H8ppxnp8I",
"question_id": "722-2",
"task_type": "Action Reasoning",
"question": "According to the video, the male interviewee wearing a tight black shirt told a joke about his experience in a popular culture class. He said if there's a movie that more than half of the class has seen, he would buy everyone pizza. Many years have passed, and he hasn't found such a movie yet. What does he imply with this joke?",
"options": [
"A. Consensus culture is becoming less common.",
"B. People no longer like watching movies.",
"C. The pizza from the place he orders is not good.",
"D. There's less popular culture."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, the male interviewee wearing a tight black shirt told a joke about his experience in a popular culture class. He said if there's a movie that more than half of the class has seen, he would buy everyone pizza. Many years have passed, and he hasn't found such a movie yet. What does he imply with this joke?\nOption:\nA. Consensus culture is becoming less common.\nB. People no longer like watching movies.\nC. The pizza from the place he orders is not good.\nD. There's less popular culture.\nAnswer with the option's letter from the given choices directly.",
2164,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "722-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2165,
"target": "D",
"doc": {
"video_id": "722",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=54H8ppxnp8I",
"videoID": "54H8ppxnp8I",
"question_id": "722-3",
"task_type": "Object Reasoning",
"question": "Based on the content of the video, whose viewpoint is relatively the most radical?",
"options": [
"A. The female interviewee.",
"B. The male interviewee wearing a shirt.",
"C. The show host.",
"D. The male interviewee wearing a black suit."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the content of the video, whose viewpoint is relatively the most radical?\nOption:\nA. The female interviewee.\nB. The male interviewee wearing a shirt.\nC. The show host.\nD. The male interviewee wearing a black suit.\nAnswer with the option's letter from the given choices directly.",
2165,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "722-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2166,
"target": "A",
"doc": {
"video_id": "723",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=n5DXniZGkME",
"videoID": "n5DXniZGkME",
"question_id": "723-1",
"task_type": "Object Reasoning",
"question": "At the beginning of the video, what do the red-marked countries on the world map represent?",
"options": [
"A. The population of these countries is declining.",
"B. The population of these countries is growing.",
"C. These countries are experiencing famine.",
"D. These countries are experiencing war."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At the beginning of the video, what do the red-marked countries on the world map represent?\nOption:\nA. The population of these countries is declining.\nB. The population of these countries is growing.\nC. These countries are experiencing famine.\nD. These countries are experiencing war.\nAnswer with the option's letter from the given choices directly.",
2166,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "723-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2167,
"target": "C",
"doc": {
"video_id": "723",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=n5DXniZGkME",
"videoID": "n5DXniZGkME",
"question_id": "723-2",
"task_type": "Information Synopsis",
"question": "Which conclusion cannot be drawn in the middle section, which features interviews with Australian women in the video?",
"options": [
"A. More government support policies for pregnant women and expectant fathers could increase fertility rates.",
"B. Raising everyone's wage levels and reducing living costs could increase fertility rates.",
"C. The increase in older mothers is one of the reasons for the decline in fertility rates.",
"D. There is an increase in older mothers in Australia."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which conclusion cannot be drawn in the middle section, which features interviews with Australian women in the video?\nOption:\nA. More government support policies for pregnant women and expectant fathers could increase fertility rates.\nB. Raising everyone's wage levels and reducing living costs could increase fertility rates.\nC. The increase in older mothers is one of the reasons for the decline in fertility rates.\nD. There is an increase in older mothers in Australia.\nAnswer with the option's letter from the given choices directly.",
2167,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "723-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2168,
"target": "A",
"doc": {
"video_id": "723",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=n5DXniZGkME",
"videoID": "n5DXniZGkME",
"question_id": "723-3",
"task_type": "Action Reasoning",
"question": "In the final part of the video, an interview was conducted with a white woman and a black man. What is the relevance of this interview to the theme of the video?",
"options": [
"A. Using automation and robots to perform repetitive tasks is one way to address aging.",
"B. We are experiencing the fourth industrial revolution and should seize the opportunity.",
"C. Automation and robots are taking away people's jobs.",
"D. People are becoming increasingly lazy."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the final part of the video, an interview was conducted with a white woman and a black man. What is the relevance of this interview to the theme of the video?\nOption:\nA. Using automation and robots to perform repetitive tasks is one way to address aging.\nB. We are experiencing the fourth industrial revolution and should seize the opportunity.\nC. Automation and robots are taking away people's jobs.\nD. People are becoming increasingly lazy.\nAnswer with the option's letter from the given choices directly.",
2168,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "723-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2169,
"target": "A",
"doc": {
"video_id": "724",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=54k2g9ddy4M",
"videoID": "54k2g9ddy4M",
"question_id": "724-1",
"task_type": "Information Synopsis",
"question": "According to the video, which of the following news items did NOT appear in this news segment?",
"options": [
"A. An earthquake occurred in the United States.",
"B. Potential community spread of measles in Canada.",
"C. Fire destroys First Nation's only school.",
"D. Josh Liendo swims into the history books."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following news items did NOT appear in this news segment?\nOption:\nA. An earthquake occurred in the United States.\nB. Potential community spread of measles in Canada.\nC. Fire destroys First Nation's only school.\nD. Josh Liendo swims into the history books.\nAnswer with the option's letter from the given choices directly.",
2169,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "724-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2170,
"target": "C",
"doc": {
"video_id": "724",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=54k2g9ddy4M",
"videoID": "54k2g9ddy4M",
"question_id": "724-2",
"task_type": "Object Reasoning",
"question": "In the video, a woman believed her son had died due to a hospital's mistake with information. What impact did this incident have on her family relationships?",
"options": [
"A. It led to a rift between her husband and her son.",
"B. It led to a rift between her husband and herself.",
"C. It strengthened the relationship between her and her son.",
"D. It led to a rift between her and her son."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, a woman believed her son had died due to a hospital's mistake with information. What impact did this incident have on her family relationships?\nOption:\nA. It led to a rift between her husband and her son.\nB. It led to a rift between her husband and herself.\nC. It strengthened the relationship between her and her son.\nD. It led to a rift between her and her son.\nAnswer with the option's letter from the given choices directly.",
2170,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "724-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2171,
"target": "A",
"doc": {
"video_id": "724",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=54k2g9ddy4M",
"videoID": "54k2g9ddy4M",
"question_id": "724-3",
"task_type": "Temporal Reasoning",
"question": "What is the fourth-to-last news item in this news video?",
"options": [
"A. Josh Liendo swims into the history books.",
"B. Coming soon | Rising demand for pet psychics.",
"C. U.S. vice president calls for Gaza ceasefire.",
"D. California storm drops 2 meters of snow."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the fourth-to-last news item in this news video?\nOption:\nA. Josh Liendo swims into the history books.\nB. Coming soon | Rising demand for pet psychics.\nC. U.S. vice president calls for Gaza ceasefire.\nD. California storm drops 2 meters of snow.\nAnswer with the option's letter from the given choices directly.",
2171,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "724-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2172,
"target": "C",
"doc": {
"video_id": "725",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=p_4UPdFqgIQ",
"videoID": "p_4UPdFqgIQ",
"question_id": "725-1",
"task_type": "Object Reasoning",
"question": "Which person's anti-aging research did not mention biological clinical trials?",
"options": [
"A. Cynthia Kenyon.",
"B. Nathaniel David.",
"C. Steve Horvath.",
"D. Liz Parrish."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which person's anti-aging research did not mention biological clinical trials?\nOption:\nA. Cynthia Kenyon.\nB. Nathaniel David.\nC. Steve Horvath.\nD. Liz Parrish.\nAnswer with the option's letter from the given choices directly.",
2172,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "725-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2173,
"target": "B",
"doc": {
"video_id": "725",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=p_4UPdFqgIQ",
"videoID": "p_4UPdFqgIQ",
"question_id": "725-2",
"task_type": "Information Synopsis",
"question": "What does Steve Horvath consider a viable solution for reversing aging?",
"options": [
"A. Extending telomere length.",
"B. Reversing epigenetic changes.",
"C. Exercise and fitness.",
"D. Preventing age-related diseases."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does Steve Horvath consider a viable solution for reversing aging?\nOption:\nA. Extending telomere length.\nB. Reversing epigenetic changes.\nC. Exercise and fitness.\nD. Preventing age-related diseases.\nAnswer with the option's letter from the given choices directly.",
2173,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "725-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2174,
"target": "C",
"doc": {
"video_id": "725",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=p_4UPdFqgIQ",
"videoID": "p_4UPdFqgIQ",
"question_id": "725-3",
"task_type": "Object Reasoning",
"question": "According to the interviewees in the video, which of the following is not a benefit of delaying aging?",
"options": [
"A. Living longer.",
"B. Delaying the onset of age-related diseases.",
"C. Being more appreciative of life.",
"D. Becoming healthier."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the interviewees in the video, which of the following is not a benefit of delaying aging?\nOption:\nA. Living longer.\nB. Delaying the onset of age-related diseases.\nC. Being more appreciative of life.\nD. Becoming healthier.\nAnswer with the option's letter from the given choices directly.",
2174,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "725-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2175,
"target": "D",
"doc": {
"video_id": "726",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=UyaKyS2wTQk",
"videoID": "UyaKyS2wTQk",
"question_id": "726-1",
"task_type": "Action Reasoning",
"question": "What does the use of spinal cord stimulators to treat Teresa Burbery and Gordon Myers indicate?",
"options": [
"A. Spinal cord stimulators can effectively reduce Teresa Burbery's pain.",
"B. Medical accidents are frequent.",
"C. The spinal cord stimulator treatment plan is very safe.",
"D. Some doctors prioritize profit over patients' lives."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the use of spinal cord stimulators to treat Teresa Burbery and Gordon Myers indicate?\nOption:\nA. Spinal cord stimulators can effectively reduce Teresa Burbery's pain.\nB. Medical accidents are frequent.\nC. The spinal cord stimulator treatment plan is very safe.\nD. Some doctors prioritize profit over patients' lives.\nAnswer with the option's letter from the given choices directly.",
2175,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "726-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2176,
"target": "C",
"doc": {
"video_id": "726",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=UyaKyS2wTQk",
"videoID": "UyaKyS2wTQk",
"question_id": "726-2",
"task_type": "Counting Problem",
"question": "How many patients were harmed by medical accidents in the video?",
"options": [
"A. 6.",
"B. 7.",
"C. 5.",
"D. 4."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many patients were harmed by medical accidents in the video?\nOption:\nA. 6.\nB. 7.\nC. 5.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
2176,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "726-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2177,
"target": "A",
"doc": {
"video_id": "726",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=UyaKyS2wTQk",
"videoID": "UyaKyS2wTQk",
"question_id": "726-3",
"task_type": "Object Reasoning",
"question": "Which of the following descriptions of neurosurgeon Tom Morris mentioned in the video is correct?",
"options": [
"A. Tom Morris disregards the wishes of patients.",
"B. Tom Morris performs well in his work.",
"C. Tom Morris can help Gail change her pain situation.",
"D. Tom Morris has extensive experience in neurosurgical operations."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following descriptions of neurosurgeon Tom Morris mentioned in the video is correct?\nOption:\nA. Tom Morris disregards the wishes of patients.\nB. Tom Morris performs well in his work.\nC. Tom Morris can help Gail change her pain situation.\nD. Tom Morris has extensive experience in neurosurgical operations.\nAnswer with the option's letter from the given choices directly.",
2177,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "726-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2178,
"target": "B",
"doc": {
"video_id": "727",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=yk4eWjYVNYg",
"videoID": "yk4eWjYVNYg",
"question_id": "727-1",
"task_type": "Object Reasoning",
"question": "From the initial section of the video, which of the following is NOT a factor contributing to Jessica's unstable housing situation?",
"options": [
"A. Jessica doesn't have stable housing due to a lack of available houses in town.",
"B. Jessica doesn't have stable housing because she doesn't have a job.",
"C. Jessica doesn't have stable housing because of high rental prices.",
"D. Cannot be determined."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: From the initial section of the video, which of the following is NOT a factor contributing to Jessica's unstable housing situation?\nOption:\nA. Jessica doesn't have stable housing due to a lack of available houses in town.\nB. Jessica doesn't have stable housing because she doesn't have a job.\nC. Jessica doesn't have stable housing because of high rental prices.\nD. Cannot be determined.\nAnswer with the option's letter from the given choices directly.",
2178,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "727-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2179,
"target": "D",
"doc": {
"video_id": "727",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=yk4eWjYVNYg",
"videoID": "yk4eWjYVNYg",
"question_id": "727-2",
"task_type": "Counting Problem",
"question": "Based on the video, how many low-income individuals affected by housing issues were interviewed in total?",
"options": [
"A. 5.",
"B. 4.",
"C. 7.",
"D. 6."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, how many low-income individuals affected by housing issues were interviewed in total?\nOption:\nA. 5.\nB. 4.\nC. 7.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
2179,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "727-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2180,
"target": "D",
"doc": {
"video_id": "727",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=yk4eWjYVNYg",
"videoID": "yk4eWjYVNYg",
"question_id": "727-3",
"task_type": "Object Reasoning",
"question": "Which one is a condition for successfully applying for housing?",
"options": [
"A. Having a recommendation letter.",
"B. Having a good rental history.",
"C. Having an income.",
"D. A successful person in urgent need of accommodation."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which one is a condition for successfully applying for housing?\nOption:\nA. Having a recommendation letter.\nB. Having a good rental history.\nC. Having an income.\nD. A successful person in urgent need of accommodation.\nAnswer with the option's letter from the given choices directly.",
2180,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "727-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2181,
"target": "C",
"doc": {
"video_id": "728",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=E-YnmCB6yjo",
"videoID": "E-YnmCB6yjo",
"question_id": "728-1",
"task_type": "Action Reasoning",
"question": "What does the successful appeal by a group of Swiss women on health and climate as a fundamental human right signify?",
"options": [
"A. It signifies that climate change will no longer lead to human rights claims.",
"B. It signifies that Swiss women will no longer be as vulnerable to the impacts of climate change.",
"C. It signifies that the government will take more action to reduce greenhouse gas emissions.",
"D. Cannot be determined."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the successful appeal by a group of Swiss women on health and climate as a fundamental human right signify?\nOption:\nA. It signifies that climate change will no longer lead to human rights claims.\nB. It signifies that Swiss women will no longer be as vulnerable to the impacts of climate change.\nC. It signifies that the government will take more action to reduce greenhouse gas emissions.\nD. Cannot be determined.\nAnswer with the option's letter from the given choices directly.",
2181,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "728-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2182,
"target": "A",
"doc": {
"video_id": "728",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=E-YnmCB6yjo",
"videoID": "E-YnmCB6yjo",
"question_id": "728-2",
"task_type": "Object Reasoning",
"question": "Based on the video, which of the following is NOT a reason for the surge in the number of private commercial DNA laboratories?",
"options": [
"A. High accuracy of prenatal paternity testing.",
"B. High revenue.",
"C. Operation without the need for special permits.",
"D. Lack of government regulation."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, which of the following is NOT a reason for the surge in the number of private commercial DNA laboratories?\nOption:\nA. High accuracy of prenatal paternity testing.\nB. High revenue.\nC. Operation without the need for special permits.\nD. Lack of government regulation.\nAnswer with the option's letter from the given choices directly.",
2182,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "728-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2183,
"target": "A",
"doc": {
"video_id": "728",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=E-YnmCB6yjo",
"videoID": "E-YnmCB6yjo",
"question_id": "728-3",
"task_type": "Object Reasoning",
"question": "At the end of the video, what is the sentiment of the four Russian defectors towards a woman named Judy Tulk?",
"options": [
"A. Gratitude.",
"B. Missing.",
"C. Admiration.",
"D. Cannot be determined."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At the end of the video, what is the sentiment of the four Russian defectors towards a woman named Judy Tulk?\nOption:\nA. Gratitude.\nB. Missing.\nC. Admiration.\nD. Cannot be determined.\nAnswer with the option's letter from the given choices directly.",
2183,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "728-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2184,
"target": "B",
"doc": {
"video_id": "729",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=bBazRZOxl-k",
"videoID": "bBazRZOxl-k",
"question_id": "729-1",
"task_type": "Information Synopsis",
"question": "Which of the following events did not appear in the news?",
"options": [
"A. Entrepreneur explains why local radio isn't dead.",
"B. A U.S. ship collides with a bridge.",
"C. Big wins and best moments at the 96th Oscars.",
"D. Newfoundland digs out after a snowstorm."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following events did not appear in the news?\nOption:\nA. Entrepreneur explains why local radio isn't dead.\nB. A U.S. ship collides with a bridge.\nC. Big wins and best moments at the 96th Oscars.\nD. Newfoundland digs out after a snowstorm.\nAnswer with the option's letter from the given choices directly.",
2184,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "729-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2185,
"target": "C",
"doc": {
"video_id": "729",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=bBazRZOxl-k",
"videoID": "bBazRZOxl-k",
"question_id": "729-2",
"task_type": "Temporal Reasoning",
"question": "What is the second to last news item in this segment?",
"options": [
"A. CBC agrees to meeting on safer-supply drugs.",
"B. Kate Middleton photo manipulation concern.",
"C. Entrepreneur explains why local radio isn't dead.",
"D. Canadians stranded in Haiti as violence escalates."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the second to last news item in this segment?\nOption:\nA. CBC agrees to meeting on safer-supply drugs.\nB. Kate Middleton photo manipulation concern.\nC. Entrepreneur explains why local radio isn't dead.\nD. Canadians stranded in Haiti as violence escalates.\nAnswer with the option's letter from the given choices directly.",
2185,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "729-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2186,
"target": "B",
"doc": {
"video_id": "729",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=bBazRZOxl-k",
"videoID": "bBazRZOxl-k",
"question_id": "729-3",
"task_type": "Action Reasoning",
"question": "Based on the content of the video, what is the initial intention of B.C. having safer-supply drugs?",
"options": [
"A. Government collusion with drug dealers.",
"B. To avoid street drugs and save lives.",
"C. The profits from drugs are very high.",
"D. The government wants to promote drugs."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the content of the video, what is the initial intention of B.C. having safer-supply drugs?\nOption:\nA. Government collusion with drug dealers.\nB. To avoid street drugs and save lives.\nC. The profits from drugs are very high.\nD. The government wants to promote drugs.\nAnswer with the option's letter from the given choices directly.",
2186,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "729-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2187,
"target": "B",
"doc": {
"video_id": "730",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=1NH5dJ9VRvU",
"videoID": "1NH5dJ9VRvU",
"question_id": "730-1",
"task_type": "Object Recognition",
"question": "Which item is not used in the video where the guest demonstrates making a solar eclipse observing tool?",
"options": [
"A. White paper.",
"B. Ruler.",
"C. Small nail.",
"D. Shoebox."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which item is not used in the video where the guest demonstrates making a solar eclipse observing tool?\nOption:\nA. White paper.\nB. Ruler.\nC. Small nail.\nD. Shoebox.\nAnswer with the option's letter from the given choices directly.",
2187,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "730-1",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2188,
"target": "B",
"doc": {
"video_id": "730",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=1NH5dJ9VRvU",
"videoID": "1NH5dJ9VRvU",
"question_id": "730-2",
"task_type": "Information Synopsis",
"question": "Which of the following events was not reported in this news video?",
"options": [
"A. Boxer lends brain to trauma research.",
"B. Russia experienced a terrorist attack.",
"C. Gaza's Al-Shifa Hospital in ruins.",
"D. Protests over carbon tax increase."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following events was not reported in this news video?\nOption:\nA. Boxer lends brain to trauma research.\nB. Russia experienced a terrorist attack.\nC. Gaza's Al-Shifa Hospital in ruins.\nD. Protests over carbon tax increase.\nAnswer with the option's letter from the given choices directly.",
2188,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "730-2",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2189,
"target": "A",
"doc": {
"video_id": "730",
"duration": "long",
"domain": "Film & Television",
"sub_category": "News Report",
"url": "https://www.youtube.com/watch?v=1NH5dJ9VRvU",
"videoID": "1NH5dJ9VRvU",
"question_id": "730-3",
"task_type": "Temporal Reasoning",
"question": "What is the second news item reported in this news video?",
"options": [
"A. Federal politicians get a 4.4% raise.",
"B. Boxer lends brain to trauma research.",
"C. Trudeau pledges $1B for school food program.",
"D. Protests over carbon tax increase."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the second news item reported in this news video?\nOption:\nA. Federal politicians get a 4.4% raise.\nB. Boxer lends brain to trauma research.\nC. Trudeau pledges $1B for school food program.\nD. Protests over carbon tax increase.\nAnswer with the option's letter from the given choices directly.",
2189,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "730-3",
"duration": "long",
"category": "Film & Television",
"sub_category": "News Report",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2190,
"target": "C",
"doc": {
"video_id": "731",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=S_5vlPXLCRc",
"videoID": "S_5vlPXLCRc",
"question_id": "731-1",
"task_type": "Action Reasoning",
"question": "How about the pick strategy of team T1 as shown in the video?",
"options": [
"A. It is difficult for them to get an advantage at the beginning of the game.",
"B. They should avoid battle at the beginning of the game.",
"C. They are formidable at the beginning of this game.",
"D. It is easy for them to win the game when they step into the late stage of the game."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How about the pick strategy of team T1 as shown in the video?\nOption:\nA. It is difficult for them to get an advantage at the beginning of the game.\nB. They should avoid battle at the beginning of the game.\nC. They are formidable at the beginning of this game.\nD. It is easy for them to win the game when they step into the late stage of the game.\nAnswer with the option's letter from the given choices directly.",
2190,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "731-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2191,
"target": "D",
"doc": {
"video_id": "731",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=S_5vlPXLCRc",
"videoID": "S_5vlPXLCRc",
"question_id": "731-2",
"task_type": "Action Reasoning",
"question": "What should team T1 do if they want to win this game?",
"options": [
"A. They should accumulate gold as long as possible.",
"B. They should defend.",
"C. They should endure when the enemy fights with them.",
"D. They should attack competitors positively."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What should team T1 do if they want to win this game?\nOption:\nA. They should accumulate gold as long as possible.\nB. They should defend.\nC. They should endure when the enemy fights with them.\nD. They should attack competitors positively.\nAnswer with the option's letter from the given choices directly.",
2191,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "731-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2192,
"target": "C",
"doc": {
"video_id": "731",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=S_5vlPXLCRc",
"videoID": "S_5vlPXLCRc",
"question_id": "731-3",
"task_type": "Action Reasoning",
"question": "Why does Weiwei go to the tower of enemy at the time of 15:00?",
"options": [
"A. He does not want to play this game.",
"B. He chases the enemy to the tower.",
"C. He does not want the enemy to get more gold.",
"D. He pretends to surrender."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does Weiwei go to the tower of enemy at the time of 15:00?\nOption:\nA. He does not want to play this game.\nB. He chases the enemy to the tower.\nC. He does not want the enemy to get more gold.\nD. He pretends to surrender.\nAnswer with the option's letter from the given choices directly.",
2192,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "731-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2193,
"target": "D",
"doc": {
"video_id": "732",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=xvQOEdwAgj0",
"videoID": "xvQOEdwAgj0",
"question_id": "732-1",
"task_type": "Action Recognition",
"question": "Which is the position that is banned mostly from the game?",
"options": [
"A. Top.",
"B. Jungle.",
"C. Mid.",
"D. Support."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which is the position that is banned mostly from the game?\nOption:\nA. Top.\nB. Jungle.\nC. Mid.\nD. Support.\nAnswer with the option's letter from the given choices directly.",
2193,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "732-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2194,
"target": "B",
"doc": {
"video_id": "732",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=xvQOEdwAgj0",
"videoID": "xvQOEdwAgj0",
"question_id": "732-2",
"task_type": "Action Recognition",
"question": "As depicted in the video, how do the jungles of two teams start?",
"options": [
"A. One loses his lane.",
"B. At different sides.",
"C. At the same side.",
"D. At both top sides."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, how do the jungles of two teams start?\nOption:\nA. One loses his lane.\nB. At different sides.\nC. At the same side.\nD. At both top sides.\nAnswer with the option's letter from the given choices directly.",
2194,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "732-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2195,
"target": "B",
"doc": {
"video_id": "732",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=xvQOEdwAgj0",
"videoID": "xvQOEdwAgj0",
"question_id": "732-3",
"task_type": "Action Recognition",
"question": "How did the final battle unfold?",
"options": [
"A. Team JDG surrenders.",
"B. Team JDG defeats at the high ground.",
"C. Team TES pushes down all the towers of JDG.",
"D. Team TES intrudes the lane of JDG."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How did the final battle unfold?\nOption:\nA. Team JDG surrenders.\nB. Team JDG defeats at the high ground.\nC. Team TES pushes down all the towers of JDG.\nD. Team TES intrudes the lane of JDG.\nAnswer with the option's letter from the given choices directly.",
2195,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "732-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2196,
"target": "A",
"doc": {
"video_id": "733",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=EFQtbZQ74Eg",
"videoID": "EFQtbZQ74Eg",
"question_id": "733-1",
"task_type": "Action Reasoning",
"question": "Why did Rookie flash over the wall at the beginning of the game?",
"options": [
"A. He is surrounded by enemies.",
"B. He is in a hurry to go.",
"C. The cost of flash is low.",
"D. He wants to attack the enemy."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did Rookie flash over the wall at the beginning of the game?\nOption:\nA. He is surrounded by enemies.\nB. He is in a hurry to go.\nC. The cost of flash is low.\nD. He wants to attack the enemy.\nAnswer with the option's letter from the given choices directly.",
2196,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "733-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2197,
"target": "D",
"doc": {
"video_id": "733",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=EFQtbZQ74Eg",
"videoID": "EFQtbZQ74Eg",
"question_id": "733-2",
"task_type": "Action Reasoning",
"question": "How about the bottom lane at an early time?",
"options": [
"A. The blue side giveups attack.",
"B. The red side can farm easily.",
"C. It is peaceful for both sides.",
"D. The blue side is formidable."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How about the bottom lane at an early time?\nOption:\nA. The blue side giveups attack.\nB. The red side can farm easily.\nC. It is peaceful for both sides.\nD. The blue side is formidable.\nAnswer with the option's letter from the given choices directly.",
2197,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "733-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2198,
"target": "C",
"doc": {
"video_id": "733",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=EFQtbZQ74Eg",
"videoID": "EFQtbZQ74Eg",
"question_id": "733-3",
"task_type": "Object Recognition",
"question": "Which team has the most significant advantage in this video?",
"options": [
"A. Both.",
"B. NIP.",
"C. BLG.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which team has the most significant advantage in this video?\nOption:\nA. Both.\nB. NIP.\nC. BLG.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2198,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "733-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2199,
"target": "B",
"doc": {
"video_id": "734",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=rj45Omrq6Pk",
"videoID": "rj45Omrq6Pk",
"question_id": "734-1",
"task_type": "Action Reasoning",
"question": "Why do the hosts talk about lemon?",
"options": [
"A. They want to eat lemon for a break.",
"B. A person wants a host to eat lemon.",
"C. The audience wants to eat lemon.",
"D. A host eats lemon every night for training."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why do the hosts talk about lemon?\nOption:\nA. They want to eat lemon for a break.\nB. A person wants a host to eat lemon.\nC. The audience wants to eat lemon.\nD. A host eats lemon every night for training.\nAnswer with the option's letter from the given choices directly.",
2199,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "734-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2200,
"target": "C",
"doc": {
"video_id": "734",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=rj45Omrq6Pk",
"videoID": "rj45Omrq6Pk",
"question_id": "734-2",
"task_type": "Action Reasoning",
"question": "Which lane of the blue team has the biggest advantage in the first ten minutes?",
"options": [
"A. Top.",
"B. Jungle.",
"C. Bottom.",
"D. Mid."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which lane of the blue team has the biggest advantage in the first ten minutes?\nOption:\nA. Top.\nB. Jungle.\nC. Bottom.\nD. Mid.\nAnswer with the option's letter from the given choices directly.",
2200,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "734-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2201,
"target": "D",
"doc": {
"video_id": "734",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=rj45Omrq6Pk",
"videoID": "rj45Omrq6Pk",
"question_id": "734-3",
"task_type": "Counting Problem",
"question": "Who has the maximum kill streak according to the video?",
"options": [
"A. Buslo.",
"B. Massu.",
"C. Berserker.",
"D. Jensen."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who has the maximum kill streak according to the video?\nOption:\nA. Buslo.\nB. Massu.\nC. Berserker.\nD. Jensen.\nAnswer with the option's letter from the given choices directly.",
2201,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "734-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2202,
"target": "A",
"doc": {
"video_id": "735",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=ot4P2G1MRLw",
"videoID": "ot4P2G1MRLw",
"question_id": "735-1",
"task_type": "Object Recognition",
"question": "Which position of team WBG is stronger at the beginning of the game?",
"options": [
"A. Top and jungle.",
"B. Mid and ADC.",
"C. ADC and support.",
"D. Support and top."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which position of team WBG is stronger at the beginning of the game?\nOption:\nA. Top and jungle.\nB. Mid and ADC.\nC. ADC and support.\nD. Support and top.\nAnswer with the option's letter from the given choices directly.",
2202,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "735-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2203,
"target": "A",
"doc": {
"video_id": "735",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=ot4P2G1MRLw",
"videoID": "ot4P2G1MRLw",
"question_id": "735-2",
"task_type": "Action Recognition",
"question": "How does team WBG win the battle at the time of 17:10?",
"options": [
"A. Zdz uses TP and kills Zika with teammates.",
"B. Team LNG loses the view of enemies.",
"C. Team LNG doesn't use TP.",
"D. Team LNG lacks teammates."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does team WBG win the battle at the time of 17:10?\nOption:\nA. Zdz uses TP and kills Zika with teammates.\nB. Team LNG loses the view of enemies.\nC. Team LNG doesn't use TP.\nD. Team LNG lacks teammates.\nAnswer with the option's letter from the given choices directly.",
2203,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "735-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2204,
"target": "A",
"doc": {
"video_id": "735",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=ot4P2G1MRLw",
"videoID": "ot4P2G1MRLw",
"question_id": "735-3",
"task_type": "Action Reasoning",
"question": "How is the atmosphere in team WBG according to this video?",
"options": [
"A. They trust each other.",
"B. Their cooperation is not well.",
"C. They are nervous.",
"D. They are passive."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How is the atmosphere in team WBG according to this video?\nOption:\nA. They trust each other.\nB. Their cooperation is not well.\nC. They are nervous.\nD. They are passive.\nAnswer with the option's letter from the given choices directly.",
2204,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "735-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2205,
"target": "C",
"doc": {
"video_id": "736",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=-y1U0qk1vDs",
"videoID": "-y1U0qk1vDs",
"question_id": "736-1",
"task_type": "Action Recognition",
"question": "What happens when it is at ban/pick time?",
"options": [
"A. Team TL is banned.",
"B. Team FLY gains a more ban.",
"C. Team FLY doesn't ban a hero at its turn.",
"D. Team TL misses a ban."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happens when it is at ban/pick time?\nOption:\nA. Team TL is banned.\nB. Team FLY gains a more ban.\nC. Team FLY doesn't ban a hero at its turn.\nD. Team TL misses a ban.\nAnswer with the option's letter from the given choices directly.",
2205,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "736-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2206,
"target": "B",
"doc": {
"video_id": "736",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=-y1U0qk1vDs",
"videoID": "-y1U0qk1vDs",
"question_id": "736-2",
"task_type": "Action Reasoning",
"question": "What strategy should team FLY take?",
"options": [
"A. Intruding the jungle lane of enemies at the beginning of the game.",
"B. Avoiding battle until late stage.",
"C. Attacking enemies positively for more resources.",
"D. Attacking enemies as early as possible."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What strategy should team FLY take?\nOption:\nA. Intruding the jungle lane of enemies at the beginning of the game.\nB. Avoiding battle until late stage.\nC. Attacking enemies positively for more resources.\nD. Attacking enemies as early as possible.\nAnswer with the option's letter from the given choices directly.",
2206,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "736-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2207,
"target": "D",
"doc": {
"video_id": "736",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=-y1U0qk1vDs",
"videoID": "-y1U0qk1vDs",
"question_id": "736-3",
"task_type": "Action Reasoning",
"question": "Who will lead their team to victory in this game?",
"options": [
"A. Inspired by team FLY.",
"B. Massu from team FLY.",
"C. Yeon from team TL.",
"D. APA from team TL."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who will lead their team to victory in this game?\nOption:\nA. Inspired by team FLY.\nB. Massu from team FLY.\nC. Yeon from team TL.\nD. APA from team TL.\nAnswer with the option's letter from the given choices directly.",
2207,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "736-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2208,
"target": "A",
"doc": {
"video_id": "737",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=HT2iYl6E_Vo",
"videoID": "HT2iYl6E_Vo",
"question_id": "737-1",
"task_type": "Action Reasoning",
"question": "What are the comparative strengths of each team?",
"options": [
"A. The level is comparable.",
"B. Team TL has a higher level.",
"C. Team G2 has a higher level.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the comparative strengths of each team?\nOption:\nA. The level is comparable.\nB. Team TL has a higher level.\nC. Team G2 has a higher level.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2208,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "737-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2209,
"target": "D",
"doc": {
"video_id": "737",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=HT2iYl6E_Vo",
"videoID": "HT2iYl6E_Vo",
"question_id": "737-2",
"task_type": "Action Recognition",
"question": "Which type of grenade was not featured in the video?",
"options": [
"A. Fire.",
"B. Smoke.",
"C. Flash.",
"D. Gas."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which type of grenade was not featured in the video?\nOption:\nA. Fire.\nB. Smoke.\nC. Flash.\nD. Gas.\nAnswer with the option's letter from the given choices directly.",
2209,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "737-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2210,
"target": "A",
"doc": {
"video_id": "737",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=HT2iYl6E_Vo",
"videoID": "HT2iYl6E_Vo",
"question_id": "737-3",
"task_type": "Object Recognition",
"question": "As depicted in the video, which team gets the final victory?",
"options": [
"A. Team TL.",
"B. Team G2.",
"C. Both.",
"D. No one."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, which team gets the final victory?\nOption:\nA. Team TL.\nB. Team G2.\nC. Both.\nD. No one.\nAnswer with the option's letter from the given choices directly.",
2210,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "737-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2211,
"target": "D",
"doc": {
"video_id": "738",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=S8w44Tl5atk",
"videoID": "S8w44Tl5atk",
"question_id": "738-1",
"task_type": "Action Reasoning",
"question": "Why does the player take the five cards at the beginning of this video?",
"options": [
"A. These cards are powerful.",
"B. These cards are useless.",
"C. These cards are all pirates.",
"D. These cards cooperate with each other."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the player take the five cards at the beginning of this video?\nOption:\nA. These cards are powerful.\nB. These cards are useless.\nC. These cards are all pirates.\nD. These cards cooperate with each other.\nAnswer with the option's letter from the given choices directly.",
2211,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "738-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2212,
"target": "C",
"doc": {
"video_id": "738",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=S8w44Tl5atk",
"videoID": "S8w44Tl5atk",
"question_id": "738-2",
"task_type": "Object Recognition",
"question": "What type of cards does the play require at the start of the game?",
"options": [
"A. Coin card.",
"B. Pirate cards.",
"C. Weapon cards.",
"D. No one is correct above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What type of cards does the play require at the start of the game?\nOption:\nA. Coin card.\nB. Pirate cards.\nC. Weapon cards.\nD. No one is correct above.\nAnswer with the option's letter from the given choices directly.",
2212,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "738-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2213,
"target": "B",
"doc": {
"video_id": "738",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=S8w44Tl5atk",
"videoID": "S8w44Tl5atk",
"question_id": "738-3",
"task_type": "Object Recognition",
"question": "What is the occupation of the character that the player uses?",
"options": [
"A. Warrior.",
"B. Thief.",
"C. Priest.",
"D. Druid."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the occupation of the character that the player uses?\nOption:\nA. Warrior.\nB. Thief.\nC. Priest.\nD. Druid.\nAnswer with the option's letter from the given choices directly.",
2213,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "738-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2214,
"target": "B",
"doc": {
"video_id": "739",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=QdHBJ-NiaHM",
"videoID": "QdHBJ-NiaHM",
"question_id": "739-1",
"task_type": "Action Reasoning",
"question": "Why does the player send a message of well met?",
"options": [
"A. He is very courteous.",
"B. Opponent drags a powerful card of him onto the chessboard.",
"C. He doesn't want this outcome.",
"D. He sends the wrong message."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the player send a message of well met?\nOption:\nA. He is very courteous.\nB. Opponent drags a powerful card of him onto the chessboard.\nC. He doesn't want this outcome.\nD. He sends the wrong message.\nAnswer with the option's letter from the given choices directly.",
2214,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "739-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2215,
"target": "B",
"doc": {
"video_id": "739",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=QdHBJ-NiaHM",
"videoID": "QdHBJ-NiaHM",
"question_id": "739-2",
"task_type": "Object Recognition",
"question": "What is the occupation of the character the player controls?",
"options": [
"A. Thief.",
"B. Paladin.",
"C. Priest.",
"D. Warrior."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the occupation of the character the player controls?\nOption:\nA. Thief.\nB. Paladin.\nC. Priest.\nD. Warrior.\nAnswer with the option's letter from the given choices directly.",
2215,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "739-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2216,
"target": "D",
"doc": {
"video_id": "739",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=QdHBJ-NiaHM",
"videoID": "QdHBJ-NiaHM",
"question_id": "739-3",
"task_type": "Object Recognition",
"question": "Which cost of card has the fewest number?",
"options": [
"A. 2.",
"B. 3.",
"C. 5.",
"D. 6."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which cost of card has the fewest number?\nOption:\nA. 2.\nB. 3.\nC. 5.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
2216,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "739-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2217,
"target": "C",
"doc": {
"video_id": "740",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=YcU9d-RPEog",
"videoID": "YcU9d-RPEog",
"question_id": "740-1",
"task_type": "Action Recognition",
"question": "Which lane does the player start in at the beginning of the game?",
"options": [
"A. Top.",
"B. Jungle.",
"C. Mid.",
"D. Bottom."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which lane does the player start in at the beginning of the game?\nOption:\nA. Top.\nB. Jungle.\nC. Mid.\nD. Bottom.\nAnswer with the option's letter from the given choices directly.",
2217,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "740-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2218,
"target": "D",
"doc": {
"video_id": "740",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=YcU9d-RPEog",
"videoID": "YcU9d-RPEog",
"question_id": "740-2",
"task_type": "Action Recognition",
"question": "How does the player first die?",
"options": [
"A. He is greedy for gold.",
"B. He leaves the towers too far.",
"C. He can't hold the pressure of the opponent.",
"D. He helps teammates in the jungle."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the player first die?\nOption:\nA. He is greedy for gold.\nB. He leaves the towers too far.\nC. He can't hold the pressure of the opponent.\nD. He helps teammates in the jungle.\nAnswer with the option's letter from the given choices directly.",
2218,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "740-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2219,
"target": "A",
"doc": {
"video_id": "740",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Esports",
"url": "https://www.youtube.com/watch?v=YcU9d-RPEog",
"videoID": "YcU9d-RPEog",
"question_id": "740-3",
"task_type": "Action Reasoning",
"question": "What is the strategy of the player?",
"options": [
"A. Becoming more powerful with the growth of AP.",
"B. Carrying the game at the beginning.",
"C. Supporting teammates as more as possible.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the strategy of the player?\nOption:\nA. Becoming more powerful with the growth of AP.\nB. Carrying the game at the beginning.\nC. Supporting teammates as more as possible.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2219,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "740-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Esports",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2220,
"target": "B",
"doc": {
"video_id": "741",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=U9-Qy3H_UBY",
"videoID": "U9-Qy3H_UBY",
"question_id": "741-1",
"task_type": "Information Synopsis",
"question": "What does this video document?",
"options": [
"A. The first day of moving with friends.",
"B. The first day of being a college basketball player.",
"C. Attending a basketball game with the team.",
"D. The first day of coaching the basketball team."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does this video document?\nOption:\nA. The first day of moving with friends.\nB. The first day of being a college basketball player.\nC. Attending a basketball game with the team.\nD. The first day of coaching the basketball team.\nAnswer with the option's letter from the given choices directly.",
2220,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "741-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2221,
"target": "A",
"doc": {
"video_id": "741",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=U9-Qy3H_UBY",
"videoID": "U9-Qy3H_UBY",
"question_id": "741-2",
"task_type": "Object Reasoning",
"question": "What character is the person speaking to the group at the beginning of the video?",
"options": [
"A. Basketball Coach.",
"B. Referee.",
"C. University Professor.",
"D. Director."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What character is the person speaking to the group at the beginning of the video?\nOption:\nA. Basketball Coach.\nB. Referee.\nC. University Professor.\nD. Director.\nAnswer with the option's letter from the given choices directly.",
2221,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "741-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2222,
"target": "C",
"doc": {
"video_id": "741",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=U9-Qy3H_UBY",
"videoID": "U9-Qy3H_UBY",
"question_id": "741-3",
"task_type": "Counting Problem",
"question": "How many 3-pointers does the college freshman protagonist in the video hit in a practice game?",
"options": [
"A. 3.",
"B. 4.",
"C. 2.",
"D. 1."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many 3-pointers does the college freshman protagonist in the video hit in a practice game?\nOption:\nA. 3.\nB. 4.\nC. 2.\nD. 1.\nAnswer with the option's letter from the given choices directly.",
2222,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "741-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2223,
"target": "A",
"doc": {
"video_id": "742",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=cy40DIzOUow",
"videoID": "cy40DIzOUow",
"question_id": "742-1",
"task_type": "Object Reasoning",
"question": "What is the role of the man on the sidelines wearing a gray top with studs talking to the camera after the game starts?",
"options": [
"A. The home team's coach.",
"B. Coach of the purple team.",
"C. Substitutes for the white team.",
"D. Scorekeeper."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the role of the man on the sidelines wearing a gray top with studs talking to the camera after the game starts?\nOption:\nA. The home team's coach.\nB. Coach of the purple team.\nC. Substitutes for the white team.\nD. Scorekeeper.\nAnswer with the option's letter from the given choices directly.",
2223,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "742-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2224,
"target": "C",
"doc": {
"video_id": "742",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=cy40DIzOUow",
"videoID": "cy40DIzOUow",
"question_id": "742-2",
"task_type": "Object Reasoning",
"question": "Who is the home team's top scorer?",
"options": [
"A. Player number 34.",
"B. Player number 22.",
"C. Player number 15.",
"D. Player number 10."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is the home team's top scorer?\nOption:\nA. Player number 34.\nB. Player number 22.\nC. Player number 15.\nD. Player number 10.\nAnswer with the option's letter from the given choices directly.",
2224,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "742-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2225,
"target": "B",
"doc": {
"video_id": "742",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=cy40DIzOUow",
"videoID": "cy40DIzOUow",
"question_id": "742-3",
"task_type": "Temporal Perception",
"question": "In which period does the home team overtake the guest team?",
"options": [
"A. 12:56 - 8:13.",
"B. 5:58 - 2:57.",
"C. 8:13 - 5:58.",
"D. 13:10 - 10:37."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which period does the home team overtake the guest team?\nOption:\nA. 12:56 - 8:13.\nB. 5:58 - 2:57.\nC. 8:13 - 5:58.\nD. 13:10 - 10:37.\nAnswer with the option's letter from the given choices directly.",
2225,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "742-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Temporal Perception",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2226,
"target": "D",
"doc": {
"video_id": "743",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=-QuCz7kxBr8",
"videoID": "-QuCz7kxBr8",
"question_id": "743-1",
"task_type": "Object Reasoning",
"question": "What's the least flattering of the gadgets featured in the video?",
"options": [
"A. Flick glove.",
"B. Dribble sleeve.",
"C. Straight shooter strap.",
"D. Pack of 3 basketball shooting aid."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the least flattering of the gadgets featured in the video?\nOption:\nA. Flick glove.\nB. Dribble sleeve.\nC. Straight shooter strap.\nD. Pack of 3 basketball shooting aid.\nAnswer with the option's letter from the given choices directly.",
2226,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "743-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2227,
"target": "A",
"doc": {
"video_id": "743",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=-QuCz7kxBr8",
"videoID": "-QuCz7kxBr8",
"question_id": "743-2",
"task_type": "Counting Problem",
"question": "How many valid goals does Jesse score in the final one-two session?",
"options": [
"A. 1.",
"B. 2.",
"C. 0.",
"D. 3."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many valid goals does Jesse score in the final one-two session?\nOption:\nA. 1.\nB. 2.\nC. 0.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
2227,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "743-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2228,
"target": "B",
"doc": {
"video_id": "743",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=-QuCz7kxBr8",
"videoID": "-QuCz7kxBr8",
"question_id": "743-3",
"task_type": "Action Reasoning",
"question": "What tool in the video had the least impact on the player's ability to complete the challenge?",
"options": [
"A. Dribble stick.",
"B. Power hands.",
"C. Flick glove.",
"D. Dribble sleeve."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What tool in the video had the least impact on the player's ability to complete the challenge?\nOption:\nA. Dribble stick.\nB. Power hands.\nC. Flick glove.\nD. Dribble sleeve.\nAnswer with the option's letter from the given choices directly.",
2228,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "743-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2229,
"target": "C",
"doc": {
"video_id": "744",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=4od5mdtEp-E",
"videoID": "4od5mdtEp-E",
"question_id": "744-1",
"task_type": "Object Reasoning",
"question": "Why is Jesse absent from the first game?",
"options": [
"A. The game does not go to garbage time.",
"B. He's not a player on the team.",
"C. He is not up to the coach's standards.",
"D. He's afraid to play."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why is Jesse absent from the first game?\nOption:\nA. The game does not go to garbage time.\nB. He's not a player on the team.\nC. He is not up to the coach's standards.\nD. He's afraid to play.\nAnswer with the option's letter from the given choices directly.",
2229,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "744-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2230,
"target": "B",
"doc": {
"video_id": "744",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=4od5mdtEp-E",
"videoID": "4od5mdtEp-E",
"question_id": "744-2",
"task_type": "Object Reasoning",
"question": "Why doesn't Jesse make it past the Ignite team to the tryouts?",
"options": [
"A. He doesn't shoot as well as John Jenkins.",
"B. He doesn't have the physical talent or basketball skills to be a pro.",
"C. He doesn't have the right attitude.",
"D. He ranks dead last in upper body strength on the Cavaliers."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why doesn't Jesse make it past the Ignite team to the tryouts?\nOption:\nA. He doesn't shoot as well as John Jenkins.\nB. He doesn't have the physical talent or basketball skills to be a pro.\nC. He doesn't have the right attitude.\nD. He ranks dead last in upper body strength on the Cavaliers.\nAnswer with the option's letter from the given choices directly.",
2230,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "744-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2231,
"target": "C",
"doc": {
"video_id": "744",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=4od5mdtEp-E",
"videoID": "4od5mdtEp-E",
"question_id": "744-3",
"task_type": "Temporal Reasoning",
"question": "On which day after Jesse's tryout does the Ignite suffer a crushing defeat?",
"options": [
"A. Day 5.",
"B. Last day.",
"C. Day 3.",
"D. First day."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: On which day after Jesse's tryout does the Ignite suffer a crushing defeat?\nOption:\nA. Day 5.\nB. Last day.\nC. Day 3.\nD. First day.\nAnswer with the option's letter from the given choices directly.",
2231,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "744-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2232,
"target": "B",
"doc": {
"video_id": "745",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=mRgstQ4X5d4",
"videoID": "mRgstQ4X5d4",
"question_id": "745-1",
"task_type": "Action Recognition",
"question": "What does Cameron achieve in his debut practice game in Las Vegas?",
"options": [
"A. Breaking the three-point record.",
"B. Hit the shutout ball.",
"C. Pull down a lot of rebounds.",
"D. Giving away a fatal cap to the opponent."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does Cameron achieve in his debut practice game in Las Vegas?\nOption:\nA. Breaking the three-point record.\nB. Hit the shutout ball.\nC. Pull down a lot of rebounds.\nD. Giving away a fatal cap to the opponent.\nAnswer with the option's letter from the given choices directly.",
2232,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "745-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2233,
"target": "C",
"doc": {
"video_id": "745",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=mRgstQ4X5d4",
"videoID": "mRgstQ4X5d4",
"question_id": "745-2",
"task_type": "Information Synopsis",
"question": "What story does this documentary tell?",
"options": [
"A. Marginal pro basketball players rebound through basketball exposure camps.",
"B. How marginalized professional basketball players make a living.",
"C. Fringe pros hope to buck the trend with basketball exposure camps but are unsuccessful.",
"D. Pros Experience Basketball Exposure Camp."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What story does this documentary tell?\nOption:\nA. Marginal pro basketball players rebound through basketball exposure camps.\nB. How marginalized professional basketball players make a living.\nC. Fringe pros hope to buck the trend with basketball exposure camps but are unsuccessful.\nD. Pros Experience Basketball Exposure Camp.\nAnswer with the option's letter from the given choices directly.",
2233,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "745-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2234,
"target": "A",
"doc": {
"video_id": "745",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=mRgstQ4X5d4",
"videoID": "mRgstQ4X5d4",
"question_id": "745-3",
"task_type": "Object Reasoning",
"question": "Which of the following descriptions best summarizes Cameron?",
"options": [
"A. Prioritizing family while harboring dreams.",
"B. Doing whatever it takes to make his dreams come true.",
"C. An unstable mind on the road to his dreams leads to failure.",
"D. Going through the motions and getting by."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following descriptions best summarizes Cameron?\nOption:\nA. Prioritizing family while harboring dreams.\nB. Doing whatever it takes to make his dreams come true.\nC. An unstable mind on the road to his dreams leads to failure.\nD. Going through the motions and getting by.\nAnswer with the option's letter from the given choices directly.",
2234,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "745-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2235,
"target": "C",
"doc": {
"video_id": "746",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=kg6L6LL523Q",
"videoID": "kg6L6LL523Q",
"question_id": "746-1",
"task_type": "Counting Problem",
"question": "After buying a cup of coffee, how many challenges does the man complete?",
"options": [
"A. 5.",
"B. 7.",
"C. 6.",
"D. 4."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: After buying a cup of coffee, how many challenges does the man complete?\nOption:\nA. 5.\nB. 7.\nC. 6.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
2235,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "746-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2236,
"target": "B",
"doc": {
"video_id": "746",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=kg6L6LL523Q",
"videoID": "kg6L6LL523Q",
"question_id": "746-2",
"task_type": "Action Reasoning",
"question": "What are the rules for the initial 2v2 basketball mini-game?",
"options": [
"A. The first person to hit 7 balls in a row wins.",
"B. If a person misses a hit, the total number of previous consecutive hits is added to his score, and they are out if their score reaches 7.",
"C. The first to 7 hits wins.",
"D. This is the famous H.O.R.S.E. game."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the rules for the initial 2v2 basketball mini-game?\nOption:\nA. The first person to hit 7 balls in a row wins.\nB. If a person misses a hit, the total number of previous consecutive hits is added to his score, and they are out if their score reaches 7.\nC. The first to 7 hits wins.\nD. This is the famous H.O.R.S.E. game.\nAnswer with the option's letter from the given choices directly.",
2236,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "746-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2237,
"target": "B",
"doc": {
"video_id": "746",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=kg6L6LL523Q",
"videoID": "kg6L6LL523Q",
"question_id": "746-3",
"task_type": "Action Reasoning",
"question": "Which task does the man continue to do from day one to day two?",
"options": [
"A. Get a girl's phone number.",
"B. Hit 500 shots.",
"C. Hit five half-court shots for the second straight day.",
"D. Do a lift."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which task does the man continue to do from day one to day two?\nOption:\nA. Get a girl's phone number.\nB. Hit 500 shots.\nC. Hit five half-court shots for the second straight day.\nD. Do a lift.\nAnswer with the option's letter from the given choices directly.",
2237,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "746-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2238,
"target": "D",
"doc": {
"video_id": "747",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=4qYqPmIO0v0",
"videoID": "4qYqPmIO0v0",
"question_id": "747-1",
"task_type": "Action Recognition",
"question": "What happened to the team on the counterattack after Sabonis' first steal?",
"options": [
"A. Sabonis finishes the pick-and-roll.",
"B. A teammate hit a three.",
"C. Sabonis pass is broken up.",
"D. Teammate singles and fails."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happened to the team on the counterattack after Sabonis' first steal?\nOption:\nA. Sabonis finishes the pick-and-roll.\nB. A teammate hit a three.\nC. Sabonis pass is broken up.\nD. Teammate singles and fails.\nAnswer with the option's letter from the given choices directly.",
2238,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "747-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2239,
"target": "A",
"doc": {
"video_id": "747",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=4qYqPmIO0v0",
"videoID": "4qYqPmIO0v0",
"question_id": "747-2",
"task_type": "Counting Problem",
"question": "How many athletic goals did LIT score from the last technical statistic on shooting percentage before halftime to the end of the half?",
"options": [
"A. 4.",
"B. 2.",
"C. 1.",
"D. 6."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many athletic goals did LIT score from the last technical statistic on shooting percentage before halftime to the end of the half?\nOption:\nA. 4.\nB. 2.\nC. 1.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
2239,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "747-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2240,
"target": "D",
"doc": {
"video_id": "747",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=4qYqPmIO0v0",
"videoID": "4qYqPmIO0v0",
"question_id": "747-3",
"task_type": "Counting Problem",
"question": "How many timeouts does the HUN actively consume throughout the game?",
"options": [
"A. 2.",
"B. 5.",
"C. 7.",
"D. 3."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many timeouts does the HUN actively consume throughout the game?\nOption:\nA. 2.\nB. 5.\nC. 7.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
2240,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "747-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2241,
"target": "A",
"doc": {
"video_id": "748",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=mDxrdWZOTpM",
"videoID": "mDxrdWZOTpM",
"question_id": "748-1",
"task_type": "Information Synopsis",
"question": "What is the primary focus of this video?",
"options": [
"A. The history of the FIBA basketball world cup.",
"B. Team USA's road to the FIBA championship.",
"C. Countries' views on FIBA.",
"D. A history of the development of basketball in the world."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary focus of this video?\nOption:\nA. The history of the FIBA basketball world cup.\nB. Team USA's road to the FIBA championship.\nC. Countries' views on FIBA.\nD. A history of the development of basketball in the world.\nAnswer with the option's letter from the given choices directly.",
2241,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "748-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2242,
"target": "C",
"doc": {
"video_id": "748",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=mDxrdWZOTpM",
"videoID": "mDxrdWZOTpM",
"question_id": "748-2",
"task_type": "Object Reasoning",
"question": "What led to the decline of the Soviet men's basketball team's dominance after 1990?",
"options": [
"A. The Rise of the Yugoslavian Men's Basketball Team.",
"B. The United States defeated the Soviet Union in 1988.",
"C. Sabonis could not represent the Soviet Union.",
"D. Coach K coaches the U.S. men's basketball team."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What led to the decline of the Soviet men's basketball team's dominance after 1990?\nOption:\nA. The Rise of the Yugoslavian Men's Basketball Team.\nB. The United States defeated the Soviet Union in 1988.\nC. Sabonis could not represent the Soviet Union.\nD. Coach K coaches the U.S. men's basketball team.\nAnswer with the option's letter from the given choices directly.",
2242,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "748-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2243,
"target": "C",
"doc": {
"video_id": "748",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=mDxrdWZOTpM",
"videoID": "mDxrdWZOTpM",
"question_id": "748-3",
"task_type": "Object Reasoning",
"question": "Why did Pau Gasol say he was willing to break his leg?",
"options": [
"A. He is physically and mentally exhausted as the Spanish talisman.",
"B. Because in doing so, opponents will slack off because he's unavailable.",
"C. Because the whole Spanish team will go all out for it and win the title.",
"D. Because his teammates can then take the field in name only."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did Pau Gasol say he was willing to break his leg?\nOption:\nA. He is physically and mentally exhausted as the Spanish talisman.\nB. Because in doing so, opponents will slack off because he's unavailable.\nC. Because the whole Spanish team will go all out for it and win the title.\nD. Because his teammates can then take the field in name only.\nAnswer with the option's letter from the given choices directly.",
2243,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "748-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2244,
"target": "B",
"doc": {
"video_id": "749",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=EeTmrsVW8qE",
"videoID": "EeTmrsVW8qE",
"question_id": "749-1",
"task_type": "Object Reasoning",
"question": "Why is Rakija introduced at the beginning of the video?",
"options": [
"A. Rakaji's boss is the sponsor of the documentary.",
"B. To map Serbian basketball culture.",
"C. Rakaji is the national drink of Serbia.",
"D. Serbs like to drink alcohol while playing basketball."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why is Rakija introduced at the beginning of the video?\nOption:\nA. Rakaji's boss is the sponsor of the documentary.\nB. To map Serbian basketball culture.\nC. Rakaji is the national drink of Serbia.\nD. Serbs like to drink alcohol while playing basketball.\nAnswer with the option's letter from the given choices directly.",
2244,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "749-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2245,
"target": "A",
"doc": {
"video_id": "749",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=EeTmrsVW8qE",
"videoID": "EeTmrsVW8qE",
"question_id": "749-2",
"task_type": "Object Reasoning",
"question": "Why does the video narrator think Kresimir Cosic's traits sound familiar?",
"options": [
"A. Jokic also has center size and guard skills.",
"B. He is as legendary as Radiovoj Korac.",
"C. Because his skill moves are closer to modern basketball.",
"D. Because he led the establishment of the golden generation of Yugoslavian basketball."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the video narrator think Kresimir Cosic's traits sound familiar?\nOption:\nA. Jokic also has center size and guard skills.\nB. He is as legendary as Radiovoj Korac.\nC. Because his skill moves are closer to modern basketball.\nD. Because he led the establishment of the golden generation of Yugoslavian basketball.\nAnswer with the option's letter from the given choices directly.",
2245,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "749-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2246,
"target": "D",
"doc": {
"video_id": "749",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=EeTmrsVW8qE",
"videoID": "EeTmrsVW8qE",
"question_id": "749-3",
"task_type": "Temporal Reasoning",
"question": "What is presented successively in the video?",
"options": [
"A. Golden Generation, Derby Battle, New Generation.",
"B. New Generation, Derby Battle, Golden Generation.",
"C. The New Generation, the Golden Generation, and the Derby Battle.",
"D. Golden Generation, New Generation, Derby Battle."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is presented successively in the video?\nOption:\nA. Golden Generation, Derby Battle, New Generation.\nB. New Generation, Derby Battle, Golden Generation.\nC. The New Generation, the Golden Generation, and the Derby Battle.\nD. Golden Generation, New Generation, Derby Battle.\nAnswer with the option's letter from the given choices directly.",
2246,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "749-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2247,
"target": "B",
"doc": {
"video_id": "750",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=sxrx7oCrb3A",
"videoID": "sxrx7oCrb3A",
"question_id": "750-1",
"task_type": "Action Recognition",
"question": "What game do they play to discover that Wiggins shoots 44.8% from the field in the game of Wiggins guessing?",
"options": [
"A. Singles.",
"B. Blindfolded shooting.",
"C. Three-point matchup.",
"D. Timed shot."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What game do they play to discover that Wiggins shoots 44.8% from the field in the game of Wiggins guessing?\nOption:\nA. Singles.\nB. Blindfolded shooting.\nC. Three-point matchup.\nD. Timed shot.\nAnswer with the option's letter from the given choices directly.",
2247,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "750-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2248,
"target": "A",
"doc": {
"video_id": "750",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=sxrx7oCrb3A",
"videoID": "sxrx7oCrb3A",
"question_id": "750-2",
"task_type": "Action Recognition",
"question": "In the Guess Nash matchup, which game has similar rules to the third game in Guess Wiggins?",
"options": [
"A. Third game.",
"B. First game.",
"C. Second game.",
"D. Fourth game."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the Guess Nash matchup, which game has similar rules to the third game in Guess Wiggins?\nOption:\nA. Third game.\nB. First game.\nC. Second game.\nD. Fourth game.\nAnswer with the option's letter from the given choices directly.",
2248,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "750-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2249,
"target": "C",
"doc": {
"video_id": "750",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Basketball",
"url": "https://www.youtube.com/watch?v=sxrx7oCrb3A",
"videoID": "sxrx7oCrb3A",
"question_id": "750-3",
"task_type": "Action Reasoning",
"question": "What's the difference between guessing Dončić's game and guessing other players?",
"options": [
"A. Winning jerseys without autographs.",
"B. This player is not from the NBA.",
"C. Use NBA2K as a rules demo.",
"D. The game is more complex."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's the difference between guessing Dončić's game and guessing other players?\nOption:\nA. Winning jerseys without autographs.\nB. This player is not from the NBA.\nC. Use NBA2K as a rules demo.\nD. The game is more complex.\nAnswer with the option's letter from the given choices directly.",
2249,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "750-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Basketball",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2250,
"target": "B",
"doc": {
"video_id": "751",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=-yWE13LbFd4",
"videoID": "-yWE13LbFd4",
"question_id": "751-1",
"task_type": "Counting Problem",
"question": "How many goals did the number 9 of the blue team score in the match?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many goals did the number 9 of the blue team score in the match?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
2250,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "751-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2251,
"target": "B",
"doc": {
"video_id": "751",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=-yWE13LbFd4",
"videoID": "-yWE13LbFd4",
"question_id": "751-2",
"task_type": "Counting Problem",
"question": "How many timeout substitutions were there during the game in the video?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many timeout substitutions were there during the game in the video?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
2251,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "751-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2252,
"target": "C",
"doc": {
"video_id": "751",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=-yWE13LbFd4",
"videoID": "-yWE13LbFd4",
"question_id": "751-3",
"task_type": "Object Reasoning",
"question": "How did the match in the video progress in terms of scoring, leading up to its conclusion?",
"options": [
"A. The match was dominated by the blue team throughout, ending in a straightforward win without going to extra time.",
"B. The red team took an early lead, but the blue team caught up and maintained a steady lead, winning the match without needing extra time.",
"C. The match was closely contested, with the score tied at halftime and again during the second half, ultimately going to extra time where the red team won by penalties.",
"D. The game saw an unexpected comeback by the blue team in the extra time, securing a win through a last-minute goal."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How did the match in the video progress in terms of scoring, leading up to its conclusion?\nOption:\nA. The match was dominated by the blue team throughout, ending in a straightforward win without going to extra time.\nB. The red team took an early lead, but the blue team caught up and maintained a steady lead, winning the match without needing extra time.\nC. The match was closely contested, with the score tied at halftime and again during the second half, ultimately going to extra time where the red team won by penalties.\nD. The game saw an unexpected comeback by the blue team in the extra time, securing a win through a last-minute goal.\nAnswer with the option's letter from the given choices directly.",
2252,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "751-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2253,
"target": "C",
"doc": {
"video_id": "752",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=bFuVjFacj_E",
"videoID": "bFuVjFacj_E",
"question_id": "752-1",
"task_type": "Information Synopsis",
"question": "What is the main focus of the video?",
"options": [
"A. The rivalry between Real Madrid and Barcelona during the early 2000s, including memorable matches and iconic moments.",
"B. An in-depth analysis of Barcelona's La Masia youth academy and its role in shaping the club's distinctive playing style and success during the early 2000s, emphasizing the importance of homegrown talent.",
"C. The transformation of Real Madrid into a global brand and the balance between business ambitions and footballing goals during the Galacticos era.",
"D. A chronological overview of Real Madrid's tactical evolution throughout the 21st century, focusing on the managers' strategies that led to their successes."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of the video?\nOption:\nA. The rivalry between Real Madrid and Barcelona during the early 2000s, including memorable matches and iconic moments.\nB. An in-depth analysis of Barcelona's La Masia youth academy and its role in shaping the club's distinctive playing style and success during the early 2000s, emphasizing the importance of homegrown talent.\nC. The transformation of Real Madrid into a global brand and the balance between business ambitions and footballing goals during the Galacticos era.\nD. A chronological overview of Real Madrid's tactical evolution throughout the 21st century, focusing on the managers' strategies that led to their successes.\nAnswer with the option's letter from the given choices directly.",
2253,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "752-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2254,
"target": "C",
"doc": {
"video_id": "752",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=bFuVjFacj_E",
"videoID": "bFuVjFacj_E",
"question_id": "752-2",
"task_type": "Object Reasoning",
"question": "Why was the man with glass selected as the president of the club according to the video?",
"options": [
"A. His promise to build a new stadium for the club.",
"B. His commitment to maintaining the club's traditional playing style.",
"C. His promise to bring famous football stars and financial assurances to the club.",
"D. His proven track record in successfully managing other sports organizations and his vision for international expansion of the club's brand."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why was the man with glass selected as the president of the club according to the video?\nOption:\nA. His promise to build a new stadium for the club.\nB. His commitment to maintaining the club's traditional playing style.\nC. His promise to bring famous football stars and financial assurances to the club.\nD. His proven track record in successfully managing other sports organizations and his vision for international expansion of the club's brand.\nAnswer with the option's letter from the given choices directly.",
2254,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "752-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2255,
"target": "D",
"doc": {
"video_id": "752",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=bFuVjFacj_E",
"videoID": "bFuVjFacj_E",
"question_id": "752-3",
"task_type": "Object Reasoning",
"question": "Why did Florentino Perez fail to create a well-rounded team capable of competing?",
"options": [
"A. An overemphasis on a defensive strategy, neglecting the offensive capabilities.",
"B. A strict adherence to playing homegrown talent at the expense of experienced players.",
"C. The decision to prioritize youth development over signing experienced stars.",
"D. The team's imbalance is due to the lack of a true defensive-minded midfielder."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did Florentino Perez fail to create a well-rounded team capable of competing?\nOption:\nA. An overemphasis on a defensive strategy, neglecting the offensive capabilities.\nB. A strict adherence to playing homegrown talent at the expense of experienced players.\nC. The decision to prioritize youth development over signing experienced stars.\nD. The team's imbalance is due to the lack of a true defensive-minded midfielder.\nAnswer with the option's letter from the given choices directly.",
2255,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "752-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2256,
"target": "C",
"doc": {
"video_id": "753",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=RwdvZLz-aSk",
"videoID": "RwdvZLz-aSk",
"question_id": "753-1",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. A compilation of José Mourinho's most controversial press conference moments.",
"B. A retrospective of José Mourinho's childhood and early life influences on his coaching style.",
"C. An exploration of the comprehensive career achievements and unique incidents in José Mourinho's football career.",
"D. A discussion on the influence of José Mourinho's managerial style on modern football tactics."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. A compilation of José Mourinho's most controversial press conference moments.\nB. A retrospective of José Mourinho's childhood and early life influences on his coaching style.\nC. An exploration of the comprehensive career achievements and unique incidents in José Mourinho's football career.\nD. A discussion on the influence of José Mourinho's managerial style on modern football tactics.\nAnswer with the option's letter from the given choices directly.",
2256,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "753-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2257,
"target": "A",
"doc": {
"video_id": "753",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=RwdvZLz-aSk",
"videoID": "RwdvZLz-aSk",
"question_id": "753-2",
"task_type": "Object Reasoning",
"question": "What is the video's evaluation of the main character's performance as a football coach?",
"options": [
"A. Highly successful, having won major European competitions with multiple clubs and league titles in four different countries.",
"B. Unsuccessful, with a career marked by frequent losses and lack of titles.",
"C. Mediocre, having little impact on the clubs he managed.",
"D. Innovative but without any significant silverware to his name."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video's evaluation of the main character's performance as a football coach?\nOption:\nA. Highly successful, having won major European competitions with multiple clubs and league titles in four different countries.\nB. Unsuccessful, with a career marked by frequent losses and lack of titles.\nC. Mediocre, having little impact on the clubs he managed.\nD. Innovative but without any significant silverware to his name.\nAnswer with the option's letter from the given choices directly.",
2257,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "753-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2258,
"target": "B",
"doc": {
"video_id": "753",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=RwdvZLz-aSk",
"videoID": "RwdvZLz-aSk",
"question_id": "753-3",
"task_type": "Object Reasoning",
"question": "What is the video's assessment of the main character based on the details provided in the video?",
"options": [
"A. Timid and reserved, lacking assertiveness and emotional depth in his approach.",
"B. Charismatic and confident, with a blend of emotional depth and tough love in his approach.",
"C. Soft-spoken and gentle, showing a nurturing attitude towards rivals and players.",
"D. Highly concerned about public perception, often seeking validation and approval from others."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video's assessment of the main character based on the details provided in the video?\nOption:\nA. Timid and reserved, lacking assertiveness and emotional depth in his approach.\nB. Charismatic and confident, with a blend of emotional depth and tough love in his approach.\nC. Soft-spoken and gentle, showing a nurturing attitude towards rivals and players.\nD. Highly concerned about public perception, often seeking validation and approval from others.\nAnswer with the option's letter from the given choices directly.",
2258,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "753-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2259,
"target": "B",
"doc": {
"video_id": "754",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=0k2ey_okQ4E",
"videoID": "0k2ey_okQ4E",
"question_id": "754-1",
"task_type": "Information Synopsis",
"question": "What is the primary focus of the video featuring Ronaldo?",
"options": [
"A. The personal life of Cristiano Ronaldo.",
"B. The skills and strengths that make Ronaldo one of the best footballers.",
"C. The training routine that Cristiano Ronaldo follows.",
"D. Ronaldo's journey from his childhood to becoming a professional footballer."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary focus of the video featuring Ronaldo?\nOption:\nA. The personal life of Cristiano Ronaldo.\nB. The skills and strengths that make Ronaldo one of the best footballers.\nC. The training routine that Cristiano Ronaldo follows.\nD. Ronaldo's journey from his childhood to becoming a professional footballer.\nAnswer with the option's letter from the given choices directly.",
2259,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "754-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2260,
"target": "D",
"doc": {
"video_id": "754",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=0k2ey_okQ4E",
"videoID": "0k2ey_okQ4E",
"question_id": "754-2",
"task_type": "Action Reasoning",
"question": "Why is the main character's 25m straight-line sprint ability considered inferior to that of a professional sprinter in the video?",
"options": [
"A. Because his training primarily revolves around developing skills for basketball that emphasizes vertical leap and technique over linear speed, resulting in less dedicated practice for pure sprinting performance.",
"B. Due to his natural inclination towards endurance over explosive power, which is beneficial for long-duration sports but not sprinting.",
"C. His genetic makeup predisposes him to slower twitch muscle fibers, inherently limiting his top speed potential.",
"D. Because his training prioritizes football skills rather than pure speed, leading to a disparity in specialized speed training routines."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why is the main character's 25m straight-line sprint ability considered inferior to that of a professional sprinter in the video?\nOption:\nA. Because his training primarily revolves around developing skills for basketball that emphasizes vertical leap and technique over linear speed, resulting in less dedicated practice for pure sprinting performance.\nB. Due to his natural inclination towards endurance over explosive power, which is beneficial for long-duration sports but not sprinting.\nC. His genetic makeup predisposes him to slower twitch muscle fibers, inherently limiting his top speed potential.\nD. Because his training prioritizes football skills rather than pure speed, leading to a disparity in specialized speed training routines.\nAnswer with the option's letter from the given choices directly.",
2260,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "754-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2261,
"target": "A",
"doc": {
"video_id": "754",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=0k2ey_okQ4E",
"videoID": "0k2ey_okQ4E",
"question_id": "754-3",
"task_type": "Object Reasoning",
"question": "What underlies the main character's ability to make rapid decisions on the field, according to the video?",
"options": [
"A. Mental ability and vast experience.",
"B. Extensive training and situational awareness.",
"C. Confidence and self-belief.",
"D. All of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What underlies the main character's ability to make rapid decisions on the field, according to the video?\nOption:\nA. Mental ability and vast experience.\nB. Extensive training and situational awareness.\nC. Confidence and self-belief.\nD. All of the above.\nAnswer with the option's letter from the given choices directly.",
2261,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "754-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2262,
"target": "A",
"doc": {
"video_id": "755",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=HGv87QeBYnc",
"videoID": "HGv87QeBYnc",
"question_id": "755-1",
"task_type": "Information Synopsis",
"question": "What is the subject of the video?",
"options": [
"A. Liverpool Road to PL Victory 2019/20.",
"B. Manchester City's 2018/19 Premier League triumph.",
"C. Chelsea FC's historical title wins.",
"D. Tottenham Hotspur's 2019/20 season highlights."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the subject of the video?\nOption:\nA. Liverpool Road to PL Victory 2019/20.\nB. Manchester City's 2018/19 Premier League triumph.\nC. Chelsea FC's historical title wins.\nD. Tottenham Hotspur's 2019/20 season highlights.\nAnswer with the option's letter from the given choices directly.",
2262,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "755-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2263,
"target": "B",
"doc": {
"video_id": "755",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=HGv87QeBYnc",
"videoID": "HGv87QeBYnc",
"question_id": "755-2",
"task_type": "Object Reasoning",
"question": "What evaluation does the video provide of the introduced football club's performance in the 2019/20 Premier League season?",
"options": [
"A. Dominant, winning every away game in the season.",
"B. Outstanding, setting a new record for consecutive league home wins.",
"C. Competitive but without setting any new records.",
"D. Inconsistent, with equal wins and losses throughout the season."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What evaluation does the video provide of the introduced football club's performance in the 2019/20 Premier League season?\nOption:\nA. Dominant, winning every away game in the season.\nB. Outstanding, setting a new record for consecutive league home wins.\nC. Competitive but without setting any new records.\nD. Inconsistent, with equal wins and losses throughout the season.\nAnswer with the option's letter from the given choices directly.",
2263,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "755-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2264,
"target": "A",
"doc": {
"video_id": "755",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=HGv87QeBYnc",
"videoID": "HGv87QeBYnc",
"question_id": "755-3",
"task_type": "Object Reasoning",
"question": "What is the impact of player number 10 on the red team according to the video?",
"options": [
"A. Crucial, scoring many important goals for the club.",
"B. Limited, only excelling in specific areas but lacking overall impact.",
"C. Defensive, contributing more to preventing goals than scoring them.",
"D. Cannot be inferred."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the impact of player number 10 on the red team according to the video?\nOption:\nA. Crucial, scoring many important goals for the club.\nB. Limited, only excelling in specific areas but lacking overall impact.\nC. Defensive, contributing more to preventing goals than scoring them.\nD. Cannot be inferred.\nAnswer with the option's letter from the given choices directly.",
2264,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "755-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2265,
"target": "A",
"doc": {
"video_id": "756",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=i1p5PAgNvR0",
"videoID": "i1p5PAgNvR0",
"question_id": "756-1",
"task_type": "Information Synopsis",
"question": "What is the content of the video?",
"options": [
"A. A comprehensive showcase of all goals scored by number 11 Salah for Liverpool.",
"B. Highlights of number 11 Salah's first 100 goals for Liverpool.",
"C. A compilation of number 11 Salah's goals for the Egyptian national team.",
"D. An analysis of number 11 Salah's performance in the Premier League."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the content of the video?\nOption:\nA. A comprehensive showcase of all goals scored by number 11 Salah for Liverpool.\nB. Highlights of number 11 Salah's first 100 goals for Liverpool.\nC. A compilation of number 11 Salah's goals for the Egyptian national team.\nD. An analysis of number 11 Salah's performance in the Premier League.\nAnswer with the option's letter from the given choices directly.",
2265,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "756-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2266,
"target": "C",
"doc": {
"video_id": "756",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=i1p5PAgNvR0",
"videoID": "i1p5PAgNvR0",
"question_id": "756-2",
"task_type": "Counting Problem",
"question": "How many of Liverpool's number 11, Salah's goals are featured in the video?",
"options": [
"A. 10.",
"B. 25.",
"C. 200.",
"D. 50."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many of Liverpool's number 11, Salah's goals are featured in the video?\nOption:\nA. 10.\nB. 25.\nC. 200.\nD. 50.\nAnswer with the option's letter from the given choices directly.",
2266,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "756-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2267,
"target": "B",
"doc": {
"video_id": "756",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=i1p5PAgNvR0",
"videoID": "i1p5PAgNvR0",
"question_id": "756-3",
"task_type": "Object Reasoning",
"question": "How would you describe the number 11 in the red shirt's performance in the video?",
"options": [
"A. Mediocre display with flashes of potential, showing some football skills yet lacking consistency in scoring decisive goals.",
"B. Remarkable, demonstrating great football skills, agility, and quickness, scoring many important goals.",
"C. Solid all-round performance, contributing effectively in both defense and attack without standing out in goal scoring.",
"D. Inconsistent, with moments of brilliance overshadowed by frequent mistakes and missed opportunities in front of the goal."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How would you describe the number 11 in the red shirt's performance in the video?\nOption:\nA. Mediocre display with flashes of potential, showing some football skills yet lacking consistency in scoring decisive goals.\nB. Remarkable, demonstrating great football skills, agility, and quickness, scoring many important goals.\nC. Solid all-round performance, contributing effectively in both defense and attack without standing out in goal scoring.\nD. Inconsistent, with moments of brilliance overshadowed by frequent mistakes and missed opportunities in front of the goal.\nAnswer with the option's letter from the given choices directly.",
2267,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "756-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2268,
"target": "C",
"doc": {
"video_id": "757",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=jdQ-20JEmgc",
"videoID": "jdQ-20JEmgc",
"question_id": "757-1",
"task_type": "Information Synopsis",
"question": "What is the topic of the video?",
"options": [
"A. An in-depth exploration of tactics used in the 2005 Champions League final match.",
"B. A play-by-play breakdown and tactical analysis of Liverpool's miraculous comeback victory in the 2005 Champions League final.",
"C. The journey and events leading up to Milan and Liverpool's meeting in the 2005 Champions League final.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the topic of the video?\nOption:\nA. An in-depth exploration of tactics used in the 2005 Champions League final match.\nB. A play-by-play breakdown and tactical analysis of Liverpool's miraculous comeback victory in the 2005 Champions League final.\nC. The journey and events leading up to Milan and Liverpool's meeting in the 2005 Champions League final.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2268,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "757-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2269,
"target": "A",
"doc": {
"video_id": "757",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=jdQ-20JEmgc",
"videoID": "jdQ-20JEmgc",
"question_id": "757-2",
"task_type": "Object Reasoning",
"question": "What was the main reason for the success of the club represented by the white team at the beginning of the video?",
"options": [
"A. A clever management team that strengthens the squad in the middle of success.",
"B. An exclusive focus on developing a strong defensive lineup.",
"C. The introduction of a completely new tactical system.",
"D. Relying on the talent of a single star player to carry the team forward."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the main reason for the success of the club represented by the white team at the beginning of the video?\nOption:\nA. A clever management team that strengthens the squad in the middle of success.\nB. An exclusive focus on developing a strong defensive lineup.\nC. The introduction of a completely new tactical system.\nD. Relying on the talent of a single star player to carry the team forward.\nAnswer with the option's letter from the given choices directly.",
2269,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "757-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2270,
"target": "C",
"doc": {
"video_id": "757",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=jdQ-20JEmgc",
"videoID": "jdQ-20JEmgc",
"question_id": "757-3",
"task_type": "Temporal Perception",
"question": "How would you describe the team in red's performance in the game at the beginning of the video?",
"options": [
"A. Dominant throughout the match, with a consistent lead from the start.",
"B. Struggling to find form, ultimately resulting in a defeat.",
"C. A tale of two halves, with a historic comeback from 3-0 down to win the final.",
"D. Defensive, with an emphasis on counter-attacks and no goals scored."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How would you describe the team in red's performance in the game at the beginning of the video?\nOption:\nA. Dominant throughout the match, with a consistent lead from the start.\nB. Struggling to find form, ultimately resulting in a defeat.\nC. A tale of two halves, with a historic comeback from 3-0 down to win the final.\nD. Defensive, with an emphasis on counter-attacks and no goals scored.\nAnswer with the option's letter from the given choices directly.",
2270,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "757-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Temporal Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2271,
"target": "B",
"doc": {
"video_id": "758",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=rhDdA-7gEhs",
"videoID": "rhDdA-7gEhs",
"question_id": "758-1",
"task_type": "Information Synopsis",
"question": "What is the main focus of the video?",
"options": [
"A. Ronaldo's journey through his career as he perfected his footballing skills with the changing times.",
"B. Ronaldo's journey from Madeira to winning prestigious titles and awards.",
"C. A detailed account of Cristiano Ronaldo's complete career as well as his off-field experiences and activities.",
"D. A summary and analysis of the best goals of Ronaldo's career."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of the video?\nOption:\nA. Ronaldo's journey through his career as he perfected his footballing skills with the changing times.\nB. Ronaldo's journey from Madeira to winning prestigious titles and awards.\nC. A detailed account of Cristiano Ronaldo's complete career as well as his off-field experiences and activities.\nD. A summary and analysis of the best goals of Ronaldo's career.\nAnswer with the option's letter from the given choices directly.",
2271,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "758-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2272,
"target": "B",
"doc": {
"video_id": "758",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=rhDdA-7gEhs",
"videoID": "rhDdA-7gEhs",
"question_id": "758-2",
"task_type": "Object Reasoning",
"question": "What are the critical factors contributing to the success of the player featured in the video?",
"options": [
"A. He was a late bloomer but relied on strong talent and hard work.",
"B. Self-confidence and his great personality.",
"C. A comprehensive training program, guidance from top coaches, and generous financial support.",
"D. Enjoying excellent resources with the support of fans and the attention of the media."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the critical factors contributing to the success of the player featured in the video?\nOption:\nA. He was a late bloomer but relied on strong talent and hard work.\nB. Self-confidence and his great personality.\nC. A comprehensive training program, guidance from top coaches, and generous financial support.\nD. Enjoying excellent resources with the support of fans and the attention of the media.\nAnswer with the option's letter from the given choices directly.",
2272,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "758-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2273,
"target": "A",
"doc": {
"video_id": "758",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=rhDdA-7gEhs",
"videoID": "rhDdA-7gEhs",
"question_id": "758-3",
"task_type": "Action Recognition",
"question": "How has the player's playing style evolved during his tenure at Manchester United in the video?",
"options": [
"A. He seeks simplicity and efficiency.",
"B. He pays more attention to defence while maintaining a high level of attack.",
"C. His artistry in playing football has increased due to the continuous improvement of his technique.",
"D. As he takes over more of the ball, he increases the proportion of his attack."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How has the player's playing style evolved during his tenure at Manchester United in the video?\nOption:\nA. He seeks simplicity and efficiency.\nB. He pays more attention to defence while maintaining a high level of attack.\nC. His artistry in playing football has increased due to the continuous improvement of his technique.\nD. As he takes over more of the ball, he increases the proportion of his attack.\nAnswer with the option's letter from the given choices directly.",
2273,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "758-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2274,
"target": "C",
"doc": {
"video_id": "759",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=aKLobzbLxuY",
"videoID": "aKLobzbLxuY",
"question_id": "759-1",
"task_type": "Information Synopsis",
"question": "What does the video featuring?",
"options": [
"A. Chelsea winning the FA Cup after a decisive match.",
"B. A regular season Premier League match between Chelsea and Liverpool.",
"C. Liverpool winning the FA Cup against Chelsea in a penalty shootout.",
"D. A pre-match analysis of the FA Cup final between Chelsea and Liverpool."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the video featuring?\nOption:\nA. Chelsea winning the FA Cup after a decisive match.\nB. A regular season Premier League match between Chelsea and Liverpool.\nC. Liverpool winning the FA Cup against Chelsea in a penalty shootout.\nD. A pre-match analysis of the FA Cup final between Chelsea and Liverpool.\nAnswer with the option's letter from the given choices directly.",
2274,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "759-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2275,
"target": "C",
"doc": {
"video_id": "759",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=aKLobzbLxuY",
"videoID": "aKLobzbLxuY",
"question_id": "759-2",
"task_type": "Action Recognition",
"question": "Why did the winner win the game in the video?",
"options": [
"A. More goals were scored in extra time.",
"B. Defended all of the opponent's penalty kicks.",
"C. Won the penalty shootout with the last penalty kick.",
"D. Scored a decisive goal in the 90th minute of the match."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did the winner win the game in the video?\nOption:\nA. More goals were scored in extra time.\nB. Defended all of the opponent's penalty kicks.\nC. Won the penalty shootout with the last penalty kick.\nD. Scored a decisive goal in the 90th minute of the match.\nAnswer with the option's letter from the given choices directly.",
2275,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "759-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2276,
"target": "B",
"doc": {
"video_id": "759",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=aKLobzbLxuY",
"videoID": "aKLobzbLxuY",
"question_id": "759-3",
"task_type": "Action Reasoning",
"question": "How would you evaluate the FA Cup final match between Chelsea and Liverpool according to the video?",
"options": [
"A. A high-scoring game with many goals during regular play.",
"B. A defensive game that was decided by a penalty shoot-out.",
"C. A match dominated by Chelsea but eventually lost in extra time.",
"D. An easy victory for Liverpool with minimal challenge from Chelsea."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How would you evaluate the FA Cup final match between Chelsea and Liverpool according to the video?\nOption:\nA. A high-scoring game with many goals during regular play.\nB. A defensive game that was decided by a penalty shoot-out.\nC. A match dominated by Chelsea but eventually lost in extra time.\nD. An easy victory for Liverpool with minimal challenge from Chelsea.\nAnswer with the option's letter from the given choices directly.",
2276,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "759-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2277,
"target": "C",
"doc": {
"video_id": "760",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=md6mZDhuR9s",
"videoID": "md6mZDhuR9s",
"question_id": "760-1",
"task_type": "Information Synopsis",
"question": "What is the central theme of the video?",
"options": [
"A. The construction of Barcelona's attacking system from 2014 to the present and the path to success in the Champions League.",
"B. Barcelona's three midfield tactics during Luis Enrique's tenure and highlights of the goals.",
"C. The path of the 'Attacking Trio' in Barcelona's Champions League campaign.",
"D. Highlights of Barcelona goalkeepers' great defenses in the 2014/2015 season with the records behind them."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the central theme of the video?\nOption:\nA. The construction of Barcelona's attacking system from 2014 to the present and the path to success in the Champions League.\nB. Barcelona's three midfield tactics during Luis Enrique's tenure and highlights of the goals.\nC. The path of the 'Attacking Trio' in Barcelona's Champions League campaign.\nD. Highlights of Barcelona goalkeepers' great defenses in the 2014/2015 season with the records behind them.\nAnswer with the option's letter from the given choices directly.",
2277,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "760-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2278,
"target": "A",
"doc": {
"video_id": "760",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=md6mZDhuR9s",
"videoID": "md6mZDhuR9s",
"question_id": "760-2",
"task_type": "Object Recognition",
"question": "The video mentions the goal-scoring contributions of which players?",
"options": [
"A. Ronaldinho, Eto'o, Messi.",
"B. Neymar, Messi, Piqué.",
"C. Messi, Neymar.",
"D. Piqué, Neymar."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The video mentions the goal-scoring contributions of which players?\nOption:\nA. Ronaldinho, Eto'o, Messi.\nB. Neymar, Messi, Piqué.\nC. Messi, Neymar.\nD. Piqué, Neymar.\nAnswer with the option's letter from the given choices directly.",
2278,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "760-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2279,
"target": "C",
"doc": {
"video_id": "760",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Football",
"url": "https://www.youtube.com/watch?v=md6mZDhuR9s",
"videoID": "md6mZDhuR9s",
"question_id": "760-3",
"task_type": "Action Reasoning",
"question": "How would you assess the passing skills as suggested in the video?",
"options": [
"A. Imaginative passing and high spectator analysis, but also high turnover rate.",
"B. Very competent with few passes of high quality and accuracy.",
"C. High-quality and patience passing.",
"D. Cannot be inferred from video."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How would you assess the passing skills as suggested in the video?\nOption:\nA. Imaginative passing and high spectator analysis, but also high turnover rate.\nB. Very competent with few passes of high quality and accuracy.\nC. High-quality and patience passing.\nD. Cannot be inferred from video.\nAnswer with the option's letter from the given choices directly.",
2279,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "760-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Football",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2280,
"target": "B",
"doc": {
"video_id": "761",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=UZhe_SVdXRc",
"videoID": "UZhe_SVdXRc",
"question_id": "761-1",
"task_type": "Counting Problem",
"question": "How many athletes in the video were eliminated while jumping a height of 2 metres 27?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many athletes in the video were eliminated while jumping a height of 2 metres 27?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
2280,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "761-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2281,
"target": "C",
"doc": {
"video_id": "761",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=UZhe_SVdXRc",
"videoID": "UZhe_SVdXRc",
"question_id": "761-2",
"task_type": "Action Reasoning",
"question": "What caused the German athlete in the video to encounter difficulties when attempting to clear the height of 2 meters 27?",
"options": [
"A. The athlete slipped on the takeoff and lost balance, causing him to miss the jump.",
"B. The athlete's coach gave him incorrect instructions, leading to a failed attempt.",
"C. The athlete suddenly stopped when he was approaching the crossbar frame, leaving little time for another high jump attempt.",
"D. The athlete misjudged the distance and jumped too early, resulting in a failed attempt."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What caused the German athlete in the video to encounter difficulties when attempting to clear the height of 2 meters 27?\nOption:\nA. The athlete slipped on the takeoff and lost balance, causing him to miss the jump.\nB. The athlete's coach gave him incorrect instructions, leading to a failed attempt.\nC. The athlete suddenly stopped when he was approaching the crossbar frame, leaving little time for another high jump attempt.\nD. The athlete misjudged the distance and jumped too early, resulting in a failed attempt.\nAnswer with the option's letter from the given choices directly.",
2281,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "761-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2282,
"target": "A",
"doc": {
"video_id": "761",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=UZhe_SVdXRc",
"videoID": "UZhe_SVdXRc",
"question_id": "761-3",
"task_type": "Object Reasoning",
"question": "Which countries do the top three in the competition come from?",
"options": [
"A. Qatar, South Korea, Ukraine.",
"B. United States, South Korea, Italy.",
"C. Ukraine, Qatar, United States.",
"D. Italy, Qatar, Ukraine."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which countries do the top three in the competition come from?\nOption:\nA. Qatar, South Korea, Ukraine.\nB. United States, South Korea, Italy.\nC. Ukraine, Qatar, United States.\nD. Italy, Qatar, Ukraine.\nAnswer with the option's letter from the given choices directly.",
2282,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "761-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2283,
"target": "D",
"doc": {
"video_id": "762",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=gTPHH880Cyc",
"videoID": "gTPHH880Cyc",
"question_id": "762-1",
"task_type": "Spatial Reasoning",
"question": "At the start of the race, why do the athletes line up as depicted in the video?",
"options": [
"A. The athletes line up in one row to create a visual spectacle for the audience, enhancing the overall experience of the race.",
"B. The athletes are positioned in three rows based on their height, with taller athletes in the back row and shorter athletes in the front row.",
"C. The athletes line up in three rows to allow for easier identification and tracking by the race officials and cameras.",
"D. The athletes line up in two rows to provide a clear path for the athletes to move along the track during the race."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At the start of the race, why do the athletes line up as depicted in the video?\nOption:\nA. The athletes line up in one row to create a visual spectacle for the audience, enhancing the overall experience of the race.\nB. The athletes are positioned in three rows based on their height, with taller athletes in the back row and shorter athletes in the front row.\nC. The athletes line up in three rows to allow for easier identification and tracking by the race officials and cameras.\nD. The athletes line up in two rows to provide a clear path for the athletes to move along the track during the race.\nAnswer with the option's letter from the given choices directly.",
2283,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "762-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Spatial Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2284,
"target": "B",
"doc": {
"video_id": "762",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=gTPHH880Cyc",
"videoID": "gTPHH880Cyc",
"question_id": "762-2",
"task_type": "Object Reasoning",
"question": "In the video, why does a person suddenly appear in front of the leading team of runners when it is already 7,600 meters?",
"options": [
"A. Because this person is from another competition.",
"B. Because this guy is already a full lap behind the leading team.",
"C. Because this guy is two laps behind the leader.",
"D. Because this person has given up the game."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, why does a person suddenly appear in front of the leading team of runners when it is already 7,600 meters?\nOption:\nA. Because this person is from another competition.\nB. Because this guy is already a full lap behind the leading team.\nC. Because this guy is two laps behind the leader.\nD. Because this person has given up the game.\nAnswer with the option's letter from the given choices directly.",
2284,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "762-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2285,
"target": "C",
"doc": {
"video_id": "762",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=gTPHH880Cyc",
"videoID": "gTPHH880Cyc",
"question_id": "762-3",
"task_type": "Action Reasoning",
"question": "What happened to the championship-winning athlete in the video during the last 400 meters of the race?",
"options": [
"A. He tripped another player.",
"B. He pushed another player into the infield.",
"C. He got jostled and stepped on the infield.",
"D. He was knocked down and stood up again."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happened to the championship-winning athlete in the video during the last 400 meters of the race?\nOption:\nA. He tripped another player.\nB. He pushed another player into the infield.\nC. He got jostled and stepped on the infield.\nD. He was knocked down and stood up again.\nAnswer with the option's letter from the given choices directly.",
2285,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "762-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2286,
"target": "A",
"doc": {
"video_id": "763",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=sDWOsWawxPc",
"videoID": "sDWOsWawxPc",
"question_id": "763-1",
"task_type": "Object Reasoning",
"question": "Assuming the order of the matches is arranged in the order of appearance in the video, with the first match being the first one played in the video. Which of the following description of the champions' performances in several matches in the video is inappropriate?",
"options": [
"A. The person who won the championship in the first match consistently held the lead throughout the competition.",
"B. The person who won the championship in the second match consistently held the lead throughout the competition.",
"C. The person who won the championship in the third match consistently held the lead throughout the competition.",
"D. The person who won the championship in the first match came from behind to take the lead during the competition."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Assuming the order of the matches is arranged in the order of appearance in the video, with the first match being the first one played in the video. Which of the following description of the champions' performances in several matches in the video is inappropriate?\nOption:\nA. The person who won the championship in the first match consistently held the lead throughout the competition.\nB. The person who won the championship in the second match consistently held the lead throughout the competition.\nC. The person who won the championship in the third match consistently held the lead throughout the competition.\nD. The person who won the championship in the first match came from behind to take the lead during the competition.\nAnswer with the option's letter from the given choices directly.",
2286,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "763-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2287,
"target": "D",
"doc": {
"video_id": "763",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=sDWOsWawxPc",
"videoID": "sDWOsWawxPc",
"question_id": "763-2",
"task_type": "Counting Problem",
"question": "How many times did champions break world records in the matches shown in the video?",
"options": [
"A. 6.",
"B. 5.",
"C. 4.",
"D. 3."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times did champions break world records in the matches shown in the video?\nOption:\nA. 6.\nB. 5.\nC. 4.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
2287,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "763-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2288,
"target": "B",
"doc": {
"video_id": "763",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=sDWOsWawxPc",
"videoID": "sDWOsWawxPc",
"question_id": "763-3",
"task_type": "Object Reasoning",
"question": "In which event did the oldest individual Olympic swimming gold medallist in the video win gold?",
"options": [
"A. Men's 100m Butterfly.",
"B. Men's 50m freestyle.",
"C. Men's 200m Butterfly.",
"D. Women's 50m freestyle."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which event did the oldest individual Olympic swimming gold medallist in the video win gold?\nOption:\nA. Men's 100m Butterfly.\nB. Men's 50m freestyle.\nC. Men's 200m Butterfly.\nD. Women's 50m freestyle.\nAnswer with the option's letter from the given choices directly.",
2288,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "763-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2289,
"target": "C",
"doc": {
"video_id": "764",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=8UxGzDeRIJk",
"videoID": "8UxGzDeRIJk",
"question_id": "764-1",
"task_type": "Object Reasoning",
"question": "Which statement about this video is true?",
"options": [
"A. The player from the USA ranked first.",
"B. The total distance in this competition is lower than 5000.",
"C. The first and second place simultaneously reached the finish line.",
"D. This video features a men's long-distance running competition."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which statement about this video is true?\nOption:\nA. The player from the USA ranked first.\nB. The total distance in this competition is lower than 5000.\nC. The first and second place simultaneously reached the finish line.\nD. This video features a men's long-distance running competition.\nAnswer with the option's letter from the given choices directly.",
2289,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "764-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2290,
"target": "C",
"doc": {
"video_id": "764",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=8UxGzDeRIJk",
"videoID": "8UxGzDeRIJk",
"question_id": "764-2",
"task_type": "Temporal Reasoning",
"question": "After how many minutes is it no longer possible to see two athletes side by side?",
"options": [
"A. 9.",
"B. 18.",
"C. 27.",
"D. 36."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: After how many minutes is it no longer possible to see two athletes side by side?\nOption:\nA. 9.\nB. 18.\nC. 27.\nD. 36.\nAnswer with the option's letter from the given choices directly.",
2290,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "764-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2291,
"target": "B",
"doc": {
"video_id": "764",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=8UxGzDeRIJk",
"videoID": "8UxGzDeRIJk",
"question_id": "764-3",
"task_type": "Spatial Reasoning",
"question": "How many meters did they approximately complete in 25 minutes?",
"options": [
"A. 5000m.",
"B. 6000m.",
"C. 7000m.",
"D. 8000m."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many meters did they approximately complete in 25 minutes?\nOption:\nA. 5000m.\nB. 6000m.\nC. 7000m.\nD. 8000m.\nAnswer with the option's letter from the given choices directly.",
2291,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "764-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2292,
"target": "B",
"doc": {
"video_id": "765",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=wNpA02SNgUg",
"videoID": "wNpA02SNgUg",
"question_id": "765-1",
"task_type": "Action Reasoning",
"question": "What is the reason why the referee raised the red flag during the game in the video?",
"options": [
"A. The athlete commits a foul by not landing in the designated area at the time of landing.",
"B. The athlete commits a foul by stepping over the starting line with his or her foot at the start.",
"C. The athlete commits a foul by using an unauthorized equipment during the game.",
"D. The athlete commits a foul by waving to the crowd instead of focusing on the race."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the reason why the referee raised the red flag during the game in the video?\nOption:\nA. The athlete commits a foul by not landing in the designated area at the time of landing.\nB. The athlete commits a foul by stepping over the starting line with his or her foot at the start.\nC. The athlete commits a foul by using an unauthorized equipment during the game.\nD. The athlete commits a foul by waving to the crowd instead of focusing on the race.\nAnswer with the option's letter from the given choices directly.",
2292,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "765-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2293,
"target": "C",
"doc": {
"video_id": "765",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=wNpA02SNgUg",
"videoID": "wNpA02SNgUg",
"question_id": "765-2",
"task_type": "Object Recognition",
"question": "What is the identity of the athlete in the video who committed fouls on all attempts except the first one?",
"options": [
"A. He is an athlete of the Chinese team.",
"B. He is an athlete of the Jamaican team.",
"C. He is a neutral individual athlete.",
"D. It is not mentioned in the video."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the identity of the athlete in the video who committed fouls on all attempts except the first one?\nOption:\nA. He is an athlete of the Chinese team.\nB. He is an athlete of the Jamaican team.\nC. He is a neutral individual athlete.\nD. It is not mentioned in the video.\nAnswer with the option's letter from the given choices directly.",
2293,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "765-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2294,
"target": "A",
"doc": {
"video_id": "765",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=wNpA02SNgUg",
"videoID": "wNpA02SNgUg",
"question_id": "765-3",
"task_type": "OCR Problems",
"question": "Which jump did the silver medalist in the video perform best?",
"options": [
"A. The last jump.",
"B. The second jump.",
"C. The first jump.",
"D. The third jump."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which jump did the silver medalist in the video perform best?\nOption:\nA. The last jump.\nB. The second jump.\nC. The first jump.\nD. The third jump.\nAnswer with the option's letter from the given choices directly.",
2294,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "765-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2295,
"target": "D",
"doc": {
"video_id": "766",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=SB4NLADmT_k",
"videoID": "SB4NLADmT_k",
"question_id": "766-1",
"task_type": "Object Recognition",
"question": "Which player is lying on the ground receiving treatment in the first half of the video?",
"options": [
"A. 283 Jonathan Drack.",
"B. 295 Tosin OKE.",
"C. 150 Bing Dong.",
"D. 377 Omar Craddock."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player is lying on the ground receiving treatment in the first half of the video?\nOption:\nA. 283 Jonathan Drack.\nB. 295 Tosin OKE.\nC. 150 Bing Dong.\nD. 377 Omar Craddock.\nAnswer with the option's letter from the given choices directly.",
2295,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "766-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2296,
"target": "B",
"doc": {
"video_id": "766",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=SB4NLADmT_k",
"videoID": "SB4NLADmT_k",
"question_id": "766-2",
"task_type": "Action Reasoning",
"question": "What is the problem with Tosin OKE's presence in the game in the video?",
"options": [
"A. He performed an illegal dance routine during the landing phase, violating the rules.",
"B. He veered to one side of the track during the jump phase and always stepped on the sideline.",
"C. He constantly high-fived the spectators during the jump phase, causing a delay in the event.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the problem with Tosin OKE's presence in the game in the video?\nOption:\nA. He performed an illegal dance routine during the landing phase, violating the rules.\nB. He veered to one side of the track during the jump phase and always stepped on the sideline.\nC. He constantly high-fived the spectators during the jump phase, causing a delay in the event.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2296,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "766-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2297,
"target": "C",
"doc": {
"video_id": "766",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=SB4NLADmT_k",
"videoID": "SB4NLADmT_k",
"question_id": "766-3",
"task_type": "Object Reasoning",
"question": "Which player in the video is always in the lead?",
"options": [
"A. 283 Benjamin Compaore.",
"B. 222 Max Hess.",
"C. 150 Bing Dong.",
"D. 377 Omar Craddock."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which player in the video is always in the lead?\nOption:\nA. 283 Benjamin Compaore.\nB. 222 Max Hess.\nC. 150 Bing Dong.\nD. 377 Omar Craddock.\nAnswer with the option's letter from the given choices directly.",
2297,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "766-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2298,
"target": "A",
"doc": {
"video_id": "767",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=LqaSTy0swNE",
"videoID": "LqaSTy0swNE",
"question_id": "767-1",
"task_type": "Object Reasoning",
"question": "Why did the players of the Polish team feel down after the first set of games in the video?",
"options": [
"A. They failed in the last handover.",
"B. They finished the game but didn't break the record.",
"C. They were disqualified due to a rule violation.",
"D. They were injured during the games and couldn't continue."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did the players of the Polish team feel down after the first set of games in the video?\nOption:\nA. They failed in the last handover.\nB. They finished the game but didn't break the record.\nC. They were disqualified due to a rule violation.\nD. They were injured during the games and couldn't continue.\nAnswer with the option's letter from the given choices directly.",
2298,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "767-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2299,
"target": "D",
"doc": {
"video_id": "767",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=LqaSTy0swNE",
"videoID": "LqaSTy0swNE",
"question_id": "767-2",
"task_type": "Object Reasoning",
"question": "Why did the French team have no results in the second group of games in the video?",
"options": [
"A. Because they were disqualified for starting the race before the official signal.",
"B. Because they were unable to participate in the second group of games due to a scheduling conflict.",
"C. Because they used performance-enhancing substances in the second group of games.",
"D. Because they dropped the baton during their final handover."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did the French team have no results in the second group of games in the video?\nOption:\nA. Because they were disqualified for starting the race before the official signal.\nB. Because they were unable to participate in the second group of games due to a scheduling conflict.\nC. Because they used performance-enhancing substances in the second group of games.\nD. Because they dropped the baton during their final handover.\nAnswer with the option's letter from the given choices directly.",
2299,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "767-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2300,
"target": "B",
"doc": {
"video_id": "767",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=LqaSTy0swNE",
"videoID": "LqaSTy0swNE",
"question_id": "767-3",
"task_type": "Object Recognition",
"question": "Which team in the video has the best performance?",
"options": [
"A. Team Denmark.",
"B. Team USA.",
"C. Team Germany.",
"D. Team Jamaica."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which team in the video has the best performance?\nOption:\nA. Team Denmark.\nB. Team USA.\nC. Team Germany.\nD. Team Jamaica.\nAnswer with the option's letter from the given choices directly.",
2300,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "767-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2301,
"target": "C",
"doc": {
"video_id": "768",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=g8UiXbc19og",
"videoID": "g8UiXbc19og",
"question_id": "768-1",
"task_type": "Temporal Reasoning",
"question": "Which of the following options correctly sequences the order in which the competition items appear in the video?",
"options": [
"A. 400m run, shot put, 110m hurdles.",
"B. 100m run, shot put, long jump.",
"C. 100m run, long jump, shot put.",
"D. 400m run, 110m hurdles, shot put."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options correctly sequences the order in which the competition items appear in the video?\nOption:\nA. 400m run, shot put, 110m hurdles.\nB. 100m run, shot put, long jump.\nC. 100m run, long jump, shot put.\nD. 400m run, 110m hurdles, shot put.\nAnswer with the option's letter from the given choices directly.",
2301,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "768-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2302,
"target": "A",
"doc": {
"video_id": "768",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=g8UiXbc19og",
"videoID": "g8UiXbc19og",
"question_id": "768-2",
"task_type": "Action Reasoning",
"question": "Why did the Canadian athlete suddenly collapse during the game?",
"options": [
"A. His thigh suddenly hurt possibly due to a muscle strain.",
"B. He suffered a sudden cramp in his hand, causing him to lose balance.",
"C. He experienced a sudden heart attack.",
"D. His ankle got injured and caused a sudden loss of balance."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did the Canadian athlete suddenly collapse during the game?\nOption:\nA. His thigh suddenly hurt possibly due to a muscle strain.\nB. He suffered a sudden cramp in his hand, causing him to lose balance.\nC. He experienced a sudden heart attack.\nD. His ankle got injured and caused a sudden loss of balance.\nAnswer with the option's letter from the given choices directly.",
2302,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "768-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2303,
"target": "D",
"doc": {
"video_id": "768",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=g8UiXbc19og",
"videoID": "g8UiXbc19og",
"question_id": "768-3",
"task_type": "Object Reasoning",
"question": "What is the medal awarded to the first runner to reach the finish line in the 1500m run in the video?",
"options": [
"A. Gold medal.",
"B. Silver medal.",
"C. Bronze medal.",
"D. He did not receive a medal."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the medal awarded to the first runner to reach the finish line in the 1500m run in the video?\nOption:\nA. Gold medal.\nB. Silver medal.\nC. Bronze medal.\nD. He did not receive a medal.\nAnswer with the option's letter from the given choices directly.",
2303,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "768-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2304,
"target": "B",
"doc": {
"video_id": "769",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=qeefjE74SXI",
"videoID": "qeefjE74SXI",
"question_id": "769-1",
"task_type": "OCR Problems",
"question": "What was the height of the first attempt of the championship-winning athlete in the video?",
"options": [
"A. 5.45m.",
"B. 5.60m.",
"C. 5.75m.",
"D. 6.00m."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the height of the first attempt of the championship-winning athlete in the video?\nOption:\nA. 5.45m.\nB. 5.60m.\nC. 5.75m.\nD. 6.00m.\nAnswer with the option's letter from the given choices directly.",
2304,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "769-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2305,
"target": "C",
"doc": {
"video_id": "769",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=qeefjE74SXI",
"videoID": "qeefjE74SXI",
"question_id": "769-2",
"task_type": "Action Reasoning",
"question": "Why did the American Nelson athlete in the video fail in his second trial jump of 5m95?",
"options": [
"A. His hand hit the bar and it fell.",
"B. He failed to clear the bar during takeoff.",
"C. His chest hit the bar, causing it to fall.",
"D. His legs hit the bar, causing it to fall."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did the American Nelson athlete in the video fail in his second trial jump of 5m95?\nOption:\nA. His hand hit the bar and it fell.\nB. He failed to clear the bar during takeoff.\nC. His chest hit the bar, causing it to fall.\nD. His legs hit the bar, causing it to fall.\nAnswer with the option's letter from the given choices directly.",
2305,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "769-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2306,
"target": "A",
"doc": {
"video_id": "769",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=qeefjE74SXI",
"videoID": "qeefjE74SXI",
"question_id": "769-3",
"task_type": "Object Reasoning",
"question": "Why is the Swiss athlete in the yellow top so happy near the end of the video?",
"options": [
"A. He won the gold medal and broke the world record.",
"B. He won a special award for sportsmanship.",
"C. He won the silver medal and made great progress.",
"D. He won the bronze medal and received unexpected recognition."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why is the Swiss athlete in the yellow top so happy near the end of the video?\nOption:\nA. He won the gold medal and broke the world record.\nB. He won a special award for sportsmanship.\nC. He won the silver medal and made great progress.\nD. He won the bronze medal and received unexpected recognition.\nAnswer with the option's letter from the given choices directly.",
2306,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "769-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2307,
"target": "D",
"doc": {
"video_id": "770",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=SrqTAg6n798",
"videoID": "SrqTAg6n798",
"question_id": "770-1",
"task_type": "Temporal Reasoning",
"question": "In the opening section of the video, what is the sequence in which the pairs of men's 10m platform divers appear?",
"options": [
"A. China, UK, Korea, Ukraine, Brazil.",
"B. China, Ukraine, Korea, UK.",
"C. China, UK, Germany, Ukraine.",
"D. China, Ukraine, Korea, UK, Brazil."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the opening section of the video, what is the sequence in which the pairs of men's 10m platform divers appear?\nOption:\nA. China, UK, Korea, Ukraine, Brazil.\nB. China, Ukraine, Korea, UK.\nC. China, UK, Germany, Ukraine.\nD. China, Ukraine, Korea, UK, Brazil.\nAnswer with the option's letter from the given choices directly.",
2307,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "770-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2308,
"target": "B",
"doc": {
"video_id": "770",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=SrqTAg6n798",
"videoID": "SrqTAg6n798",
"question_id": "770-2",
"task_type": "Action Reasoning",
"question": "In the video, what is the problem with the German team's third jump in the women's synchronized 10m high platform diving competition?",
"options": [
"A. They were missing a turn in the air.",
"B. They made bigger splashes when entering the water.",
"C. They entered the water feet first..",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, what is the problem with the German team's third jump in the women's synchronized 10m high platform diving competition?\nOption:\nA. They were missing a turn in the air.\nB. They made bigger splashes when entering the water.\nC. They entered the water feet first..\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2308,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "770-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2309,
"target": "C",
"doc": {
"video_id": "770",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Athletics",
"url": "https://www.youtube.com/watch?v=SrqTAg6n798",
"videoID": "SrqTAg6n798",
"question_id": "770-3",
"task_type": "Object Reasoning",
"question": "In the video, which jump earned the highest score for the champion pair in the women's synchronized ten-meter high platform diving competition?",
"options": [
"A. The third jump.",
"B. The fourth jump.",
"C. The fifth jump.",
"D. The first jump."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, which jump earned the highest score for the champion pair in the women's synchronized ten-meter high platform diving competition?\nOption:\nA. The third jump.\nB. The fourth jump.\nC. The fifth jump.\nD. The first jump.\nAnswer with the option's letter from the given choices directly.",
2309,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "770-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Athletics",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2310,
"target": "B",
"doc": {
"video_id": "771",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=Bgh-2z6P0ao",
"videoID": "Bgh-2z6P0ao",
"question_id": "771-1",
"task_type": "Object Reasoning",
"question": "What is the relationship between Fan Z.D. and Wang C.Q. that can be inferred from this video?",
"options": [
"A. They are teammates in this match.",
"B. They are rivals on the playing field.",
"C. They are very good friends in life.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the relationship between Fan Z.D. and Wang C.Q. that can be inferred from this video?\nOption:\nA. They are teammates in this match.\nB. They are rivals on the playing field.\nC. They are very good friends in life.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2310,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "771-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2311,
"target": "C",
"doc": {
"video_id": "771",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=Bgh-2z6P0ao",
"videoID": "Bgh-2z6P0ao",
"question_id": "771-2",
"task_type": "Action Reasoning",
"question": "Which game is Fan Z.D. closest to winning?",
"options": [
"A. The first game.",
"B. The second game.",
"C. The third game.",
"D. The fourth game."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which game is Fan Z.D. closest to winning?\nOption:\nA. The first game.\nB. The second game.\nC. The third game.\nD. The fourth game.\nAnswer with the option's letter from the given choices directly.",
2311,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "771-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2312,
"target": "A",
"doc": {
"video_id": "771",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=Bgh-2z6P0ao",
"videoID": "Bgh-2z6P0ao",
"question_id": "771-3",
"task_type": "Object Reasoning",
"question": "Based on the video, which of the following sentences best describes this match?",
"options": [
"A. Although Wang wins the match by a large margin, the score in each of the innings is particularly tight.",
"B. Wang wins each of the innings by a large margin.",
"C. Although Fan wins the match by a large margin, the score in each of the innings is particularly tight.",
"D. Fan wins each of the innings by a large margin."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, which of the following sentences best describes this match?\nOption:\nA. Although Wang wins the match by a large margin, the score in each of the innings is particularly tight.\nB. Wang wins each of the innings by a large margin.\nC. Although Fan wins the match by a large margin, the score in each of the innings is particularly tight.\nD. Fan wins each of the innings by a large margin.\nAnswer with the option's letter from the given choices directly.",
2312,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "771-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2313,
"target": "A",
"doc": {
"video_id": "772",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=HKbJtgUrHGk",
"videoID": "HKbJtgUrHGk",
"question_id": "772-1",
"task_type": "Object Reasoning",
"question": "What is special to their homemade baseball game when compared to a usual baseball game?",
"options": [
"A. The ball is not caught by real persons after it is pinched.",
"B. The ball is served by an automatic tee machine.",
"C. The ball is much bigger than the usual baseball game.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is special to their homemade baseball game when compared to a usual baseball game?\nOption:\nA. The ball is not caught by real persons after it is pinched.\nB. The ball is served by an automatic tee machine.\nC. The ball is much bigger than the usual baseball game.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2313,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "772-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2314,
"target": "D",
"doc": {
"video_id": "772",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=HKbJtgUrHGk",
"videoID": "HKbJtgUrHGk",
"question_id": "772-2",
"task_type": "Counting Problem",
"question": "How many balls does the woman in pink pitch?",
"options": [
"A. 5.",
"B. 4.",
"C. 3.",
"D. She is just sitting and does not pitch a ball."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many balls does the woman in pink pitch?\nOption:\nA. 5.\nB. 4.\nC. 3.\nD. She is just sitting and does not pitch a ball.\nAnswer with the option's letter from the given choices directly.",
2314,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "772-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2315,
"target": "B",
"doc": {
"video_id": "772",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=HKbJtgUrHGk",
"videoID": "HKbJtgUrHGk",
"question_id": "772-3",
"task_type": "Object Reasoning",
"question": "What is the function of the whiteboard?",
"options": [
"A. It is used for presentations.",
"B. It serves as a scoreboard.",
"C. It is used for teaching.",
"D. It serves as a billboard."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the function of the whiteboard?\nOption:\nA. It is used for presentations.\nB. It serves as a scoreboard.\nC. It is used for teaching.\nD. It serves as a billboard.\nAnswer with the option's letter from the given choices directly.",
2315,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "772-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2316,
"target": "D",
"doc": {
"video_id": "773",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=BrjwEoPXyxA",
"videoID": "BrjwEoPXyxA",
"question_id": "773-1",
"task_type": "Action Reasoning",
"question": "As depicted in the video, what does it mean when there appears a badminton icon near the team name on the scoreboard, leftmost of the screen?",
"options": [
"A. It means the team has broken the rules.",
"B. It means the team is about to lose.",
"C. It means the team has raised a challenge.",
"D. It means the team has got a point."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, what does it mean when there appears a badminton icon near the team name on the scoreboard, leftmost of the screen?\nOption:\nA. It means the team has broken the rules.\nB. It means the team is about to lose.\nC. It means the team has raised a challenge.\nD. It means the team has got a point.\nAnswer with the option's letter from the given choices directly.",
2316,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "773-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2317,
"target": "B",
"doc": {
"video_id": "773",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=BrjwEoPXyxA",
"videoID": "BrjwEoPXyxA",
"question_id": "773-2",
"task_type": "Action Recognition",
"question": "How does the blue team win the match point?",
"options": [
"A. The opposite team gives up the match.",
"B. The player on the blue team with hairbands smashed the ball.",
"C. The opposite team hit the ball out of the court.",
"D. The opposite team hit the ball but it does not cross the middle line."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the blue team win the match point?\nOption:\nA. The opposite team gives up the match.\nB. The player on the blue team with hairbands smashed the ball.\nC. The opposite team hit the ball out of the court.\nD. The opposite team hit the ball but it does not cross the middle line.\nAnswer with the option's letter from the given choices directly.",
2317,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "773-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2318,
"target": "C",
"doc": {
"video_id": "773",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=BrjwEoPXyxA",
"videoID": "BrjwEoPXyxA",
"question_id": "773-3",
"task_type": "Information Synopsis",
"question": "What are the key events or topics covered in the video?",
"options": [
"A. Men's tennis doubles match.",
"B. Mixed badminton doubles match.",
"C. Men's badminton doubles match match.",
"D. Men's badminton singles match."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the key events or topics covered in the video?\nOption:\nA. Men's tennis doubles match.\nB. Mixed badminton doubles match.\nC. Men's badminton doubles match match.\nD. Men's badminton singles match.\nAnswer with the option's letter from the given choices directly.",
2318,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "773-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2319,
"target": "C",
"doc": {
"video_id": "774",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=GV5CuB4zPTY",
"videoID": "GV5CuB4zPTY",
"question_id": "774-1",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. Two players are playing 8-ball just for fun.",
"B. Two players are playing snooker just for fun.",
"C. Two players are having an 8-ball pool tournament competition.",
"D. Two players are having a snooker competition."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. Two players are playing 8-ball just for fun.\nB. Two players are playing snooker just for fun.\nC. Two players are having an 8-ball pool tournament competition.\nD. Two players are having a snooker competition.\nAnswer with the option's letter from the given choices directly.",
2319,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "774-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2320,
"target": "A",
"doc": {
"video_id": "774",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=GV5CuB4zPTY",
"videoID": "GV5CuB4zPTY",
"question_id": "774-2",
"task_type": "Action Reasoning",
"question": "What can be inferred from this video?",
"options": [
"A. PAGULAYAN gives up this match.",
"B. PAGULAYAN wins this match.",
"C. The red ball should be shot at last.",
"D. They have played 7 complete games in total."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be inferred from this video?\nOption:\nA. PAGULAYAN gives up this match.\nB. PAGULAYAN wins this match.\nC. The red ball should be shot at last.\nD. They have played 7 complete games in total.\nAnswer with the option's letter from the given choices directly.",
2320,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "774-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2321,
"target": "B",
"doc": {
"video_id": "774",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=GV5CuB4zPTY",
"videoID": "GV5CuB4zPTY",
"question_id": "774-3",
"task_type": "Object Reasoning",
"question": "Which is the correct characteristic of Shaw, the player from Britain who is wearing black shirts and pants?",
"options": [
"A. He is right-handed.",
"B. He has a beer belly.",
"C. He wins the second round.",
"D. He has black hair."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which is the correct characteristic of Shaw, the player from Britain who is wearing black shirts and pants?\nOption:\nA. He is right-handed.\nB. He has a beer belly.\nC. He wins the second round.\nD. He has black hair.\nAnswer with the option's letter from the given choices directly.",
2321,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "774-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2322,
"target": "B",
"doc": {
"video_id": "775",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=Xjf5N9S3jAA",
"videoID": "Xjf5N9S3jAA",
"question_id": "775-1",
"task_type": "Counting Problem",
"question": "According to what is shown in the video, how many match points and set points does the winner have before the last rally to win this match?",
"options": [
"A. 2 match points and 1 set points.",
"B. 2 match points and 2 set points.",
"C. 1 match points and 1 set points.",
"D. 1 match points and 2 set points."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to what is shown in the video, how many match points and set points does the winner have before the last rally to win this match?\nOption:\nA. 2 match points and 1 set points.\nB. 2 match points and 2 set points.\nC. 1 match points and 1 set points.\nD. 1 match points and 2 set points.\nAnswer with the option's letter from the given choices directly.",
2322,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "775-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2323,
"target": "C",
"doc": {
"video_id": "775",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=Xjf5N9S3jAA",
"videoID": "Xjf5N9S3jAA",
"question_id": "775-2",
"task_type": "Information Synopsis",
"question": "What is the main focus of the video?",
"options": [
"A. This video is about a men's tennis doubles match, and the team wearing blue won the game.",
"B. This video is about a table tennis match, and the man wearing blue won the match.",
"C. This video is about a men's tennis singles match, and the man wearing blue won the game.",
"D. This video is about a men's tennis singles match, and the man wearing white won the game.."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of the video?\nOption:\nA. This video is about a men's tennis doubles match, and the team wearing blue won the game.\nB. This video is about a table tennis match, and the man wearing blue won the match.\nC. This video is about a men's tennis singles match, and the man wearing blue won the game.\nD. This video is about a men's tennis singles match, and the man wearing white won the game..\nAnswer with the option's letter from the given choices directly.",
2323,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "775-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2324,
"target": "A",
"doc": {
"video_id": "775",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=Xjf5N9S3jAA",
"videoID": "Xjf5N9S3jAA",
"question_id": "775-3",
"task_type": "Action Recognition",
"question": "What does the volunteer squatting next to the middle line do after each rally in the video?",
"options": [
"A. She runs onto the court and picks up the ball.",
"B. She signals the referee when the ball goes out of bounds.",
"C. She throws a ball to the player.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the volunteer squatting next to the middle line do after each rally in the video?\nOption:\nA. She runs onto the court and picks up the ball.\nB. She signals the referee when the ball goes out of bounds.\nC. She throws a ball to the player.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2324,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "775-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2325,
"target": "D",
"doc": {
"video_id": "776",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=7tLtu3QT2OI",
"videoID": "7tLtu3QT2OI",
"question_id": "776-1",
"task_type": "Information Synopsis",
"question": "What is the main focus of the video?",
"options": [
"A. Fiji wins the men's rugby 7's final from the Olympics between Fiji and New Zealand.",
"B. New Zealand wins the men's rugby 7's final from the Olympics between Fiji and New Zealand.",
"C. Fiji wins the men's rugby 7's final from Hong Kong Sevens between Fiji and New Zealand.",
"D. New Zealand wins the men's rugby 7's final from Hong Kong Sevens between Fiji and New Zealand."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of the video?\nOption:\nA. Fiji wins the men's rugby 7's final from the Olympics between Fiji and New Zealand.\nB. New Zealand wins the men's rugby 7's final from the Olympics between Fiji and New Zealand.\nC. Fiji wins the men's rugby 7's final from Hong Kong Sevens between Fiji and New Zealand.\nD. New Zealand wins the men's rugby 7's final from Hong Kong Sevens between Fiji and New Zealand.\nAnswer with the option's letter from the given choices directly.",
2325,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "776-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2326,
"target": "B",
"doc": {
"video_id": "776",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=7tLtu3QT2OI",
"videoID": "7tLtu3QT2OI",
"question_id": "776-2",
"task_type": "Action Recognition",
"question": "In line with the video evidence, what do the sportsmen do before the match gets started?",
"options": [
"A. They do the opening dance.",
"B. They sing their national anthem.",
"C. They change their clothes.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, what do the sportsmen do before the match gets started?\nOption:\nA. They do the opening dance.\nB. They sing their national anthem.\nC. They change their clothes.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2326,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "776-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2327,
"target": "D",
"doc": {
"video_id": "776",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=7tLtu3QT2OI",
"videoID": "7tLtu3QT2OI",
"question_id": "776-3",
"task_type": "OCR Problems",
"question": "How many scores does Fiji get at the very end of the first half?",
"options": [
"A. 1.",
"B. 3.",
"C. 5.",
"D. 7."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many scores does Fiji get at the very end of the first half?\nOption:\nA. 1.\nB. 3.\nC. 5.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
2327,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "776-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2328,
"target": "A",
"doc": {
"video_id": "777",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=z9Wfy8rzwQ4",
"videoID": "z9Wfy8rzwQ4",
"question_id": "777-1",
"task_type": "Information Synopsis",
"question": "What is the main focus of the video?",
"options": [
"A. This video is about a volleyball match and Oregon wins the semifinals and advances to the finals.",
"B. This video is about a volleyball match and Oregon wins the finals.",
"C. This video is about a baseball match and Oregon wins the semifinals and advances to the finals.",
"D. This video is about a baseball match and Oregon wins the finals."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of the video?\nOption:\nA. This video is about a volleyball match and Oregon wins the semifinals and advances to the finals.\nB. This video is about a volleyball match and Oregon wins the finals.\nC. This video is about a baseball match and Oregon wins the semifinals and advances to the finals.\nD. This video is about a baseball match and Oregon wins the finals.\nAnswer with the option's letter from the given choices directly.",
2328,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "777-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2329,
"target": "C",
"doc": {
"video_id": "777",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=z9Wfy8rzwQ4",
"videoID": "z9Wfy8rzwQ4",
"question_id": "777-2",
"task_type": "Object Reasoning",
"question": "What would happen if Nebraska won the fourth set?",
"options": [
"A. Nebraska would tie the score and start the fifth set.",
"B. Nebraska would lead the opponent by two points.",
"C. Nebraska would win the game directly.",
"D. Nebraska would lose the game eventually."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What would happen if Nebraska won the fourth set?\nOption:\nA. Nebraska would tie the score and start the fifth set.\nB. Nebraska would lead the opponent by two points.\nC. Nebraska would win the game directly.\nD. Nebraska would lose the game eventually.\nAnswer with the option's letter from the given choices directly.",
2329,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "777-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2330,
"target": "B",
"doc": {
"video_id": "777",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=z9Wfy8rzwQ4",
"videoID": "z9Wfy8rzwQ4",
"question_id": "777-3",
"task_type": "Object Reasoning",
"question": "What can be read from the scoreboard before halftime in the fourth set?",
"options": [
"A. Nebraska leaves one point behind.",
"B. Oregon gets 24 points.",
"C. Nebraska has a team logo of a red capital letter N.",
"D. Oregon has a match point."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be read from the scoreboard before halftime in the fourth set?\nOption:\nA. Nebraska leaves one point behind.\nB. Oregon gets 24 points.\nC. Nebraska has a team logo of a red capital letter N.\nD. Oregon has a match point.\nAnswer with the option's letter from the given choices directly.",
2330,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "777-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2331,
"target": "C",
"doc": {
"video_id": "778",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=rVPDIRSyS34",
"videoID": "rVPDIRSyS34",
"question_id": "778-1",
"task_type": "Information Synopsis",
"question": "What is the main focus of the video?",
"options": [
"A. Football.",
"B. Ice hockey.",
"C. Street hockey.",
"D. Baseball."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of the video?\nOption:\nA. Football.\nB. Ice hockey.\nC. Street hockey.\nD. Baseball.\nAnswer with the option's letter from the given choices directly.",
2331,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "778-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2332,
"target": "A",
"doc": {
"video_id": "778",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=rVPDIRSyS34",
"videoID": "rVPDIRSyS34",
"question_id": "778-2",
"task_type": "Action Reasoning",
"question": "Why is there a penalty shootout in this game in the video?",
"options": [
"A. After the third period, The two teams are tied with 11 points.",
"B. After the third period, one of the teams has won the game, but they reach a consensus to play for fun.",
"C. The rules require a penalty shootout whether two teams are tied or not at the end of the third period.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why is there a penalty shootout in this game in the video?\nOption:\nA. After the third period, The two teams are tied with 11 points.\nB. After the third period, one of the teams has won the game, but they reach a consensus to play for fun.\nC. The rules require a penalty shootout whether two teams are tied or not at the end of the third period.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2332,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "778-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2333,
"target": "D",
"doc": {
"video_id": "778",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=rVPDIRSyS34",
"videoID": "rVPDIRSyS34",
"question_id": "778-3",
"task_type": "Counting Problem",
"question": "How many periods do they play for and how many photos are taken?",
"options": [
"A. 2 periods and 3 photos.",
"B. 3 periods and 3 photos.",
"C. 2 periods and 2 photos.",
"D. 3 periods and 2 photos."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many periods do they play for and how many photos are taken?\nOption:\nA. 2 periods and 3 photos.\nB. 3 periods and 3 photos.\nC. 2 periods and 2 photos.\nD. 3 periods and 2 photos.\nAnswer with the option's letter from the given choices directly.",
2333,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "778-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2334,
"target": "B",
"doc": {
"video_id": "779",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=1wzgMHrkrys",
"videoID": "1wzgMHrkrys",
"question_id": "779-1",
"task_type": "Temporal Reasoning",
"question": "According to the timer after the race, in which lag does Jake Gagne drive fastest?",
"options": [
"A. Lag 1.",
"B. Lag 2.",
"C. Lag 3.",
"D. Lag 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the timer after the race, in which lag does Jake Gagne drive fastest?\nOption:\nA. Lag 1.\nB. Lag 2.\nC. Lag 3.\nD. Lag 4.\nAnswer with the option's letter from the given choices directly.",
2334,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "779-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2335,
"target": "C",
"doc": {
"video_id": "779",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=1wzgMHrkrys",
"videoID": "1wzgMHrkrys",
"question_id": "779-2",
"task_type": "Temporal Reasoning",
"question": "Who is interviewed both before and after the race based on the video?",
"options": [
"A. Corey Alexander.",
"B. Ashton Yates.",
"C. Jake Gagne.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is interviewed both before and after the race based on the video?\nOption:\nA. Corey Alexander.\nB. Ashton Yates.\nC. Jake Gagne.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2335,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "779-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2336,
"target": "A",
"doc": {
"video_id": "779",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=1wzgMHrkrys",
"videoID": "1wzgMHrkrys",
"question_id": "779-3",
"task_type": "Information Synopsis",
"question": "What is the main focus of the video?",
"options": [
"A. Medallia superbike race.",
"B. Medallia car race.",
"C. Medallia bike race.",
"D. Medallia running race."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of the video?\nOption:\nA. Medallia superbike race.\nB. Medallia car race.\nC. Medallia bike race.\nD. Medallia running race.\nAnswer with the option's letter from the given choices directly.",
2336,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "779-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2337,
"target": "D",
"doc": {
"video_id": "780",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=0iED5BojTMI",
"videoID": "0iED5BojTMI",
"question_id": "780-1",
"task_type": "Counting Problem",
"question": "How many games are hosted in this video?",
"options": [
"A. 6.",
"B. 7.",
"C. 8.",
"D. 9."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many games are hosted in this video?\nOption:\nA. 6.\nB. 7.\nC. 8.\nD. 9.\nAnswer with the option's letter from the given choices directly.",
2337,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "780-1",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2338,
"target": "B",
"doc": {
"video_id": "780",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=0iED5BojTMI",
"videoID": "0iED5BojTMI",
"question_id": "780-2",
"task_type": "Object Recognition",
"question": "Who wins the women's A final?",
"options": [
"A. Zoe Candelier.",
"B. Hee Won Son.",
"C. Melina Tremblay.",
"D. Caitlin Pelkey."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who wins the women's A final?\nOption:\nA. Zoe Candelier.\nB. Hee Won Son.\nC. Melina Tremblay.\nD. Caitlin Pelkey.\nAnswer with the option's letter from the given choices directly.",
2338,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "780-2",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2339,
"target": "C",
"doc": {
"video_id": "780",
"duration": "long",
"domain": "Sports Competition",
"sub_category": "Other Sports",
"url": "https://www.youtube.com/watch?v=0iED5BojTMI",
"videoID": "0iED5BojTMI",
"question_id": "780-3",
"task_type": "Information Synopsis",
"question": "What is the main focus of the video?",
"options": [
"A. Women's group finals for Canadian junior short track selections.",
"B. Men's group finals for Canadian junior skiing selections.",
"C. Women's and men's group finals for Canadian junior short track selections.",
"D. Women's and men's finals for Canadian junior short track selections."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of the video?\nOption:\nA. Women's group finals for Canadian junior short track selections.\nB. Men's group finals for Canadian junior skiing selections.\nC. Women's and men's group finals for Canadian junior short track selections.\nD. Women's and men's finals for Canadian junior short track selections.\nAnswer with the option's letter from the given choices directly.",
2339,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "780-3",
"duration": "long",
"category": "Sports Competition",
"sub_category": "Other Sports",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2340,
"target": "A",
"doc": {
"video_id": "781",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=kDXerubF4I4",
"videoID": "kDXerubF4I4",
"question_id": "781-1",
"task_type": "Counting Problem",
"question": "Which of the performances in the video is a solo?",
"options": [
"A. I wanna be ready.",
"B. Wade in the water.",
"C. Fix Me, Jesus.",
"D. Sinner Man."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the performances in the video is a solo?\nOption:\nA. I wanna be ready.\nB. Wade in the water.\nC. Fix Me, Jesus.\nD. Sinner Man.\nAnswer with the option's letter from the given choices directly.",
2340,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "781-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2341,
"target": "B",
"doc": {
"video_id": "781",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=kDXerubF4I4",
"videoID": "kDXerubF4I4",
"question_id": "781-2",
"task_type": "Object Reasoning",
"question": "What can be inferred about the performance from the video?",
"options": [
"A. It is a beautifully designed operatic work that is regarded as a cultural treasure.",
"B. It is an abstract dance piece with high appreciation value.",
"C. It is an art exhibition performance in which the artist communicates a vision with exquisite costumes and staging.",
"D. It is a stage performance that transcends modernity and is difficult for most people to appreciate."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be inferred about the performance from the video?\nOption:\nA. It is a beautifully designed operatic work that is regarded as a cultural treasure.\nB. It is an abstract dance piece with high appreciation value.\nC. It is an art exhibition performance in which the artist communicates a vision with exquisite costumes and staging.\nD. It is a stage performance that transcends modernity and is difficult for most people to appreciate.\nAnswer with the option's letter from the given choices directly.",
2341,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "781-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2342,
"target": "C",
"doc": {
"video_id": "781",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=kDXerubF4I4",
"videoID": "kDXerubF4I4",
"question_id": "781-3",
"task_type": "Information Synopsis",
"question": "What is the video primarily about?",
"options": [
"A. A contemporary dance performance depicting the struggles and triumphs of indigenous communities.",
"B. A choreographed dance piece inspired by the history and resilience of LGBTQ+ individuals.",
"C. A dance interpretation of the African-American struggle and resilience.",
"D. A dance production exploring the themes of immigration, identity, and cultural integration."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video primarily about?\nOption:\nA. A contemporary dance performance depicting the struggles and triumphs of indigenous communities.\nB. A choreographed dance piece inspired by the history and resilience of LGBTQ+ individuals.\nC. A dance interpretation of the African-American struggle and resilience.\nD. A dance production exploring the themes of immigration, identity, and cultural integration.\nAnswer with the option's letter from the given choices directly.",
2342,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "781-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2343,
"target": "C",
"doc": {
"video_id": "782",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=-xZglYQSJ3Q",
"videoID": "-xZglYQSJ3Q",
"question_id": "782-1",
"task_type": "Information Synopsis",
"question": "What is the video primarily about?",
"options": [
"A. The history of costume design in cinema.",
"B. The personal lives of famous costume designers.",
"C. The process and artistry behind costume design in theatre.",
"D. The financial aspects of working in the theatre industry."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video primarily about?\nOption:\nA. The history of costume design in cinema.\nB. The personal lives of famous costume designers.\nC. The process and artistry behind costume design in theatre.\nD. The financial aspects of working in the theatre industry.\nAnswer with the option's letter from the given choices directly.",
2343,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "782-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2344,
"target": "B",
"doc": {
"video_id": "782",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=-xZglYQSJ3Q",
"videoID": "-xZglYQSJ3Q",
"question_id": "782-2",
"task_type": "Object Reasoning",
"question": "What can be inferred about the series \"Working In The Theatre\" from the video?",
"options": [
"A. It documents the entire process of a day's work for theatre company employees, providing guidance to the industry.",
"B. It provides an exclusive view of the behind-the-scenes process of theatre, including costume design.",
"C. It is a variety show that invites celebrities to experience theatre work.",
"D. It provides aspiring actors with a whole process of education, including acting enhancement and character understanding."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be inferred about the series \"Working In The Theatre\" from the video?\nOption:\nA. It documents the entire process of a day's work for theatre company employees, providing guidance to the industry.\nB. It provides an exclusive view of the behind-the-scenes process of theatre, including costume design.\nC. It is a variety show that invites celebrities to experience theatre work.\nD. It provides aspiring actors with a whole process of education, including acting enhancement and character understanding.\nAnswer with the option's letter from the given choices directly.",
2344,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "782-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2345,
"target": "D",
"doc": {
"video_id": "782",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=-xZglYQSJ3Q",
"videoID": "-xZglYQSJ3Q",
"question_id": "782-3",
"task_type": "Action Reasoning",
"question": "What is the professional relationship between William Ivey Long and Willa Kim as described in the video?",
"options": [
"A. Classmates from design school.",
"B. Distant relatives.",
"C. Business partners.",
"D. Teacher and student."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the professional relationship between William Ivey Long and Willa Kim as described in the video?\nOption:\nA. Classmates from design school.\nB. Distant relatives.\nC. Business partners.\nD. Teacher and student.\nAnswer with the option's letter from the given choices directly.",
2345,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "782-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2346,
"target": "A",
"doc": {
"video_id": "783",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=liDQtpusSmY",
"videoID": "liDQtpusSmY",
"question_id": "783-1",
"task_type": "Action Reasoning",
"question": "What is the king endeavoring to accomplish?",
"options": [
"A. Search for his identity.",
"B. Conquer neighboring lands.",
"C. Attain immortality.",
"D. Win the gods' favor."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the king endeavoring to accomplish?\nOption:\nA. Search for his identity.\nB. Conquer neighboring lands.\nC. Attain immortality.\nD. Win the gods' favor.\nAnswer with the option's letter from the given choices directly.",
2346,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "783-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2347,
"target": "B",
"doc": {
"video_id": "783",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=liDQtpusSmY",
"videoID": "liDQtpusSmY",
"question_id": "783-2",
"task_type": "Action Reasoning",
"question": "According to the video, what grave actions does the king unknowingly commit?",
"options": [
"A. Steal from the temple and deceive the oracle.",
"B. Kill his father, and marry his mother.",
"C. Betray his country and flee to exile.",
"D. Overthrow the government and seize power unlawfully."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what grave actions does the king unknowingly commit?\nOption:\nA. Steal from the temple and deceive the oracle.\nB. Kill his father, and marry his mother.\nC. Betray his country and flee to exile.\nD. Overthrow the government and seize power unlawfully.\nAnswer with the option's letter from the given choices directly.",
2347,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "783-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2348,
"target": "D",
"doc": {
"video_id": "783",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=liDQtpusSmY",
"videoID": "liDQtpusSmY",
"question_id": "783-3",
"task_type": "Action Recognition",
"question": "What is the king's response upon discovering the truth about his actions in the video?",
"options": [
"A. He leaves Thebes to seek forgiveness from the gods.",
"B. He ascends to the heavens to join the deities.",
"C. He claims the throne and rules with greater wisdom.",
"D. He gouges out his own eyes in despair."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the king's response upon discovering the truth about his actions in the video?\nOption:\nA. He leaves Thebes to seek forgiveness from the gods.\nB. He ascends to the heavens to join the deities.\nC. He claims the throne and rules with greater wisdom.\nD. He gouges out his own eyes in despair.\nAnswer with the option's letter from the given choices directly.",
2348,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "783-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2349,
"target": "D",
"doc": {
"video_id": "784",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=FvZFBZB10LA",
"videoID": "FvZFBZB10LA",
"question_id": "784-1",
"task_type": "Information Synopsis",
"question": "What is the central theme of the video?",
"options": [
"A. The pursuit of knowledge and scientific discovery.",
"B. The importance of wealth and material possessions.",
"C. The adventures of a hero in a mythical land.",
"D. The journey of life and facing judgment after death."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the central theme of the video?\nOption:\nA. The pursuit of knowledge and scientific discovery.\nB. The importance of wealth and material possessions.\nC. The adventures of a hero in a mythical land.\nD. The journey of life and facing judgment after death.\nAnswer with the option's letter from the given choices directly.",
2349,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "784-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2350,
"target": "D",
"doc": {
"video_id": "784",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=FvZFBZB10LA",
"videoID": "FvZFBZB10LA",
"question_id": "784-2",
"task_type": "Action Reasoning",
"question": "How do Everyman's friend Fellowship and his Kindred respond to his request for them to accompany him on his journey to face judgment?",
"options": [
"A. They immediately agree to join him and provide support.",
"B. They refuse because they are busy with their own affairs.",
"C. They immediately refuse to accompany him, highlighting the solitary nature of Everyman's journey.",
"D. They promise to follow but abandon him at the last moment."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How do Everyman's friend Fellowship and his Kindred respond to his request for them to accompany him on his journey to face judgment?\nOption:\nA. They immediately agree to join him and provide support.\nB. They refuse because they are busy with their own affairs.\nC. They immediately refuse to accompany him, highlighting the solitary nature of Everyman's journey.\nD. They promise to follow but abandon him at the last moment.\nAnswer with the option's letter from the given choices directly.",
2350,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "784-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2351,
"target": "B",
"doc": {
"video_id": "784",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=FvZFBZB10LA",
"videoID": "FvZFBZB10LA",
"question_id": "784-3",
"task_type": "Object Reasoning",
"question": "What inference can be made from the video about life after death?",
"options": [
"A. All earthly attributes and relations will have significance in the afterlife.",
"B. Good Deeds are the only attribute that accompanies Everyman to the afterlife.",
"C. Knowledge and earthly pleasures are what truly matter at the end of life.",
"D. Everyman's journey is futile and without redemption."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What inference can be made from the video about life after death?\nOption:\nA. All earthly attributes and relations will have significance in the afterlife.\nB. Good Deeds are the only attribute that accompanies Everyman to the afterlife.\nC. Knowledge and earthly pleasures are what truly matter at the end of life.\nD. Everyman's journey is futile and without redemption.\nAnswer with the option's letter from the given choices directly.",
2351,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "784-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2352,
"target": "B",
"doc": {
"video_id": "785",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=jIx5Zi84Z3Q",
"videoID": "jIx5Zi84Z3Q",
"question_id": "785-1",
"task_type": "Action Reasoning",
"question": "What is the central conflict involving the woman dressed in a black top and white skirt in the video?",
"options": [
"A. Her efforts to seek artistic recognition were met with opposition from family and friends.",
"B. Her secret financial dealings and the consequent threat of exposure.",
"C. Her ambition to take over her husband's business, which creates an irreconcilable conflict with him.",
"D. Her efforts to restore the family's lost wealth to no avail."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the central conflict involving the woman dressed in a black top and white skirt in the video?\nOption:\nA. Her efforts to seek artistic recognition were met with opposition from family and friends.\nB. Her secret financial dealings and the consequent threat of exposure.\nC. Her ambition to take over her husband's business, which creates an irreconcilable conflict with him.\nD. Her efforts to restore the family's lost wealth to no avail.\nAnswer with the option's letter from the given choices directly.",
2352,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "785-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2353,
"target": "A",
"doc": {
"video_id": "785",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=jIx5Zi84Z3Q",
"videoID": "jIx5Zi84Z3Q",
"question_id": "785-2",
"task_type": "Object Reasoning",
"question": "How is the character of the woman dressed in a white skirt and black top perceived at the beginning of the play?",
"options": [
"A. As light-hearted and imprudent with money.",
"B. As an entrepreneurial spirit with a knack for business.",
"C. As the stern matriarch ruling over her household.",
"D. As a dutiful and compliant offspring."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How is the character of the woman dressed in a white skirt and black top perceived at the beginning of the play?\nOption:\nA. As light-hearted and imprudent with money.\nB. As an entrepreneurial spirit with a knack for business.\nC. As the stern matriarch ruling over her household.\nD. As a dutiful and compliant offspring.\nAnswer with the option's letter from the given choices directly.",
2353,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "785-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2354,
"target": "D",
"doc": {
"video_id": "785",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=jIx5Zi84Z3Q",
"videoID": "jIx5Zi84Z3Q",
"question_id": "785-3",
"task_type": "Object Reasoning",
"question": "What can be inferred about the man dressed in a formal suit and dark hat's attitude towards the woman in the white skirt and his employees?",
"options": [
"A. He is generous and lax with both his family and employees.",
"B. He encourages the woman's autonomy and treats his workers equitably.",
"C. He is unaware of domestic matters and avoids involvement in employee disputes.",
"D. He is authoritarian and patronizing in his personal and professional relations."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be inferred about the man dressed in a formal suit and dark hat's attitude towards the woman in the white skirt and his employees?\nOption:\nA. He is generous and lax with both his family and employees.\nB. He encourages the woman's autonomy and treats his workers equitably.\nC. He is unaware of domestic matters and avoids involvement in employee disputes.\nD. He is authoritarian and patronizing in his personal and professional relations.\nAnswer with the option's letter from the given choices directly.",
2354,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "785-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2355,
"target": "C",
"doc": {
"video_id": "786",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=P69idA8JO98",
"videoID": "P69idA8JO98",
"question_id": "786-1",
"task_type": "Action Reasoning",
"question": "How does Snow White come to meet the seven miners?",
"options": [
"A. She is on a quest to find a hidden treasure.",
"B. She is kidnapped by the miners from her castle.",
"C. She is sent to the forest by her wicked stepmother and finds their home.",
"D. She wanders into the forest out of curiosity and stumbles upon their cottage."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does Snow White come to meet the seven miners?\nOption:\nA. She is on a quest to find a hidden treasure.\nB. She is kidnapped by the miners from her castle.\nC. She is sent to the forest by her wicked stepmother and finds their home.\nD. She wanders into the forest out of curiosity and stumbles upon their cottage.\nAnswer with the option's letter from the given choices directly.",
2355,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "786-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2356,
"target": "A",
"doc": {
"video_id": "786",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=P69idA8JO98",
"videoID": "P69idA8JO98",
"question_id": "786-2",
"task_type": "Action Recognition",
"question": "What is the Queen's fate in the video?",
"options": [
"A. She is trapped on a cliff and meets her end by a falling boulder.",
"B. She is overthrown by the miners and banished from the kingdom.",
"C. She is forgiven by Snow White and reforms her ways.",
"D. She flees the kingdom to seek power elsewhere."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the Queen's fate in the video?\nOption:\nA. She is trapped on a cliff and meets her end by a falling boulder.\nB. She is overthrown by the miners and banished from the kingdom.\nC. She is forgiven by Snow White and reforms her ways.\nD. She flees the kingdom to seek power elsewhere.\nAnswer with the option's letter from the given choices directly.",
2356,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "786-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2357,
"target": "A",
"doc": {
"video_id": "786",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=P69idA8JO98",
"videoID": "P69idA8JO98",
"question_id": "786-3",
"task_type": "Action Reasoning",
"question": "What inference can be made about the actress on her interactions in the video?",
"options": [
"A. She won the friendship of the miners and the creatures of the forest.",
"B. She was a skilled princess who fights for justice.",
"C. She has magic that helps her control nature.",
"D. She was a musician known throughout the country."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What inference can be made about the actress on her interactions in the video?\nOption:\nA. She won the friendship of the miners and the creatures of the forest.\nB. She was a skilled princess who fights for justice.\nC. She has magic that helps her control nature.\nD. She was a musician known throughout the country.\nAnswer with the option's letter from the given choices directly.",
2357,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "786-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2358,
"target": "B",
"doc": {
"video_id": "787",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=FkrIcmXvv18",
"videoID": "FkrIcmXvv18",
"question_id": "787-1",
"task_type": "Information Synopsis",
"question": "What is the main plot of the video?",
"options": [
"A. The rivalry and eventual alliance between two noble houses.",
"B. The forbidden romance between a young man and a woman from feuding families.",
"C. The comedic mishaps of two young lovers from different cities.",
"D. The quest of a knight to win the love of a princess."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main plot of the video?\nOption:\nA. The rivalry and eventual alliance between two noble houses.\nB. The forbidden romance between a young man and a woman from feuding families.\nC. The comedic mishaps of two young lovers from different cities.\nD. The quest of a knight to win the love of a princess.\nAnswer with the option's letter from the given choices directly.",
2358,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "787-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2359,
"target": "D",
"doc": {
"video_id": "787",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=FkrIcmXvv18",
"videoID": "FkrIcmXvv18",
"question_id": "787-2",
"task_type": "Object Reasoning",
"question": "Based on the actions of the friar in the video, what can be inferred about his beliefs?",
"options": [
"A. He believes in the Goddess because no one but the Goddess could know and admire his good deeds.",
"B. He believes in and strictly abides by the social order.",
"C. He hopes that love would reconcile feuding families.",
"D. It cannot be inferred from the video."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the actions of the friar in the video, what can be inferred about his beliefs?\nOption:\nA. He believes in the Goddess because no one but the Goddess could know and admire his good deeds.\nB. He believes in and strictly abides by the social order.\nC. He hopes that love would reconcile feuding families.\nD. It cannot be inferred from the video.\nAnswer with the option's letter from the given choices directly.",
2359,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "787-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2360,
"target": "C",
"doc": {
"video_id": "787",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=FkrIcmXvv18",
"videoID": "FkrIcmXvv18",
"question_id": "787-3",
"task_type": "Action Reasoning",
"question": "What does the decision of the two main characters in the video to get married tell us about their characters?",
"options": [
"A. They are cautious and thoughtful about their relationship.",
"B. They are rebellious and seek to defy their families' wishes at every turn.",
"C. They are impulsive and driven by strong emotions.",
"D. They are manipulative and use their love to gain social standing."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the decision of the two main characters in the video to get married tell us about their characters?\nOption:\nA. They are cautious and thoughtful about their relationship.\nB. They are rebellious and seek to defy their families' wishes at every turn.\nC. They are impulsive and driven by strong emotions.\nD. They are manipulative and use their love to gain social standing.\nAnswer with the option's letter from the given choices directly.",
2360,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "787-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2361,
"target": "C",
"doc": {
"video_id": "788",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=6dmTmP4Dsw8",
"videoID": "6dmTmP4Dsw8",
"question_id": "788-1",
"task_type": "Information Synopsis",
"question": "What overarching story is told in the video?",
"options": [
"A. The story of a princess who searches for her long-lost family and is finally reunited.",
"B. The story of a knight who fights the evil witch to save the kingdom.",
"C. The story of a princess who discovers her magical powers and their consequences.",
"D. The story of a knight who saves and awakens a princess from a dragon."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What overarching story is told in the video?\nOption:\nA. The story of a princess who searches for her long-lost family and is finally reunited.\nB. The story of a knight who fights the evil witch to save the kingdom.\nC. The story of a princess who discovers her magical powers and their consequences.\nD. The story of a knight who saves and awakens a princess from a dragon.\nAnswer with the option's letter from the given choices directly.",
2361,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "788-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2362,
"target": "C",
"doc": {
"video_id": "788",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=6dmTmP4Dsw8",
"videoID": "6dmTmP4Dsw8",
"question_id": "788-2",
"task_type": "Action Reasoning",
"question": "What motivates the main character to flee her hometown?",
"options": [
"A. she was banished from the country.",
"B. she was greatly oppressed by life and wished to go for freedom.",
"C. she exposed her superpowers, causing her to panic.",
"D. she was pursuing a prophecy about her fate."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What motivates the main character to flee her hometown?\nOption:\nA. she was banished from the country.\nB. she was greatly oppressed by life and wished to go for freedom.\nC. she exposed her superpowers, causing her to panic.\nD. she was pursuing a prophecy about her fate.\nAnswer with the option's letter from the given choices directly.",
2362,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "788-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2363,
"target": "A",
"doc": {
"video_id": "788",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=6dmTmP4Dsw8",
"videoID": "6dmTmP4Dsw8",
"question_id": "788-3",
"task_type": "Action Reasoning",
"question": "What can be inferred about how the video's protagonist initially treated her powers and her kingdom?",
"options": [
"A. She sees her superpowers as a threat and isolates herself, unintentionally causing harm to her kingdom.",
"B. She sees her superpowers as an asset and wishes to convince the past to use them to create value for her kingdom and benefit her people.",
"C. Her superpowers inspire ambition; she desires to use her superpowers to earn a place for herself in her kingdom and begins to formulate meticulous plans.",
"D. She is indifferent to her kingdom and abandons it without a second thought."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be inferred about how the video's protagonist initially treated her powers and her kingdom?\nOption:\nA. She sees her superpowers as a threat and isolates herself, unintentionally causing harm to her kingdom.\nB. She sees her superpowers as an asset and wishes to convince the past to use them to create value for her kingdom and benefit her people.\nC. Her superpowers inspire ambition; she desires to use her superpowers to earn a place for herself in her kingdom and begins to formulate meticulous plans.\nD. She is indifferent to her kingdom and abandons it without a second thought.\nAnswer with the option's letter from the given choices directly.",
2363,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "788-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2364,
"target": "D",
"doc": {
"video_id": "789",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=i7ivyMFsw-w",
"videoID": "i7ivyMFsw-w",
"question_id": "789-1",
"task_type": "Information Synopsis",
"question": "What is the video primarily focused on?",
"options": [
"A. The historical influence of Shakespeare's works.",
"B. The comparison between different adaptations of \"Hamlet\".",
"C. The exploration of Hamlet's character in modern cinema.",
"D. Kenneth Branagh's journey in playing Hamlet."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video primarily focused on?\nOption:\nA. The historical influence of Shakespeare's works.\nB. The comparison between different adaptations of \"Hamlet\".\nC. The exploration of Hamlet's character in modern cinema.\nD. Kenneth Branagh's journey in playing Hamlet.\nAnswer with the option's letter from the given choices directly.",
2364,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "789-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2365,
"target": "C",
"doc": {
"video_id": "789",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=i7ivyMFsw-w",
"videoID": "i7ivyMFsw-w",
"question_id": "789-2",
"task_type": "Attribute Perception",
"question": "What aspect of theatre production is highlighted according to the video?",
"options": [
"A. Set design and construction.",
"B. Costume design and wardrobe malfunctions.",
"C. Actor's preparation and character development.",
"D. Financial struggles of putting on a Shakespearean play."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What aspect of theatre production is highlighted according to the video?\nOption:\nA. Set design and construction.\nB. Costume design and wardrobe malfunctions.\nC. Actor's preparation and character development.\nD. Financial struggles of putting on a Shakespearean play.\nAnswer with the option's letter from the given choices directly.",
2365,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "789-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2366,
"target": "B",
"doc": {
"video_id": "789",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=i7ivyMFsw-w",
"videoID": "i7ivyMFsw-w",
"question_id": "789-3",
"task_type": "Object Reasoning",
"question": "What can be inferred about the overall quality of the play's production?",
"options": [
"A. The play is likely to be an amateur production, but the overall production is well done.",
"B. The play promises to be a high-quality production due to rigorous casting and a deep understanding of the material.",
"C. With inexperienced but well-mannered actors, the play may break through expectations and achieve great results.",
"D. Due to the complexity of the script and the very high demand for props, the production of the play will encounter difficulties."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be inferred about the overall quality of the play's production?\nOption:\nA. The play is likely to be an amateur production, but the overall production is well done.\nB. The play promises to be a high-quality production due to rigorous casting and a deep understanding of the material.\nC. With inexperienced but well-mannered actors, the play may break through expectations and achieve great results.\nD. Due to the complexity of the script and the very high demand for props, the production of the play will encounter difficulties.\nAnswer with the option's letter from the given choices directly.",
2366,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "789-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2367,
"target": "A",
"doc": {
"video_id": "790",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=RgJVug6iNs8",
"videoID": "RgJVug6iNs8",
"question_id": "790-1",
"task_type": "Information Synopsis",
"question": "What is the central theme of the performance?",
"options": [
"A. The adventures of the three little pigs as they search for wealth and build houses.",
"B. A cooking competition for the three little pigs.",
"C. A suspenseful story about the disappearance of the three little pigs.",
"D. A story about a battle of wits between the three little pigs and a cunning wolf."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the central theme of the performance?\nOption:\nA. The adventures of the three little pigs as they search for wealth and build houses.\nB. A cooking competition for the three little pigs.\nC. A suspenseful story about the disappearance of the three little pigs.\nD. A story about a battle of wits between the three little pigs and a cunning wolf.\nAnswer with the option's letter from the given choices directly.",
2367,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "790-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2368,
"target": "B",
"doc": {
"video_id": "790",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=RgJVug6iNs8",
"videoID": "RgJVug6iNs8",
"question_id": "790-2",
"task_type": "Object Recognition",
"question": "How does the story of the wolf conclude in the video?",
"options": [
"A. The wolf becomes friends with the pigs and they live harmoniously.",
"B. The wolf falls into a pot of boiling water and is fatally boiled.",
"C. The wolf is captured by a hunter passing by.",
"D. The wolf runs away and never returns to bother the pigs again."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the story of the wolf conclude in the video?\nOption:\nA. The wolf becomes friends with the pigs and they live harmoniously.\nB. The wolf falls into a pot of boiling water and is fatally boiled.\nC. The wolf is captured by a hunter passing by.\nD. The wolf runs away and never returns to bother the pigs again.\nAnswer with the option's letter from the given choices directly.",
2368,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "790-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2369,
"target": "A",
"doc": {
"video_id": "790",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Stage Play",
"url": "https://www.youtube.com/watch?v=RgJVug6iNs8",
"videoID": "RgJVug6iNs8",
"question_id": "790-3",
"task_type": "Object Reasoning",
"question": "What can be inferred about the storyline of the video?",
"options": [
"A. The story is about the rewards of hard work and ingenuity in building a better life.",
"B. The story focuses on the wolf using his resourcefulness to make a living by invading the pigs to get food.",
"C. The story follows the pigs and the wolf from hostility to becoming friends in the end.",
"D. Can't be inferred from the video."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What can be inferred about the storyline of the video?\nOption:\nA. The story is about the rewards of hard work and ingenuity in building a better life.\nB. The story focuses on the wolf using his resourcefulness to make a living by invading the pigs to get food.\nC. The story follows the pigs and the wolf from hostility to becoming friends in the end.\nD. Can't be inferred from the video.\nAnswer with the option's letter from the given choices directly.",
2369,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "790-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Stage Play",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2370,
"target": "B",
"doc": {
"video_id": "791",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=9tBsMSDoDqk",
"videoID": "9tBsMSDoDqk",
"question_id": "791-1",
"task_type": "Temporal Reasoning",
"question": "In which order are the following magics performed in this video?\n(a) Finding words in a dictionary.\n(b) Drawing a diamond using a match.\n(c) Producing a popcoin from eyes.\n(d) Seeing things through a bearded dragon.\n(e) Inducing judges to choose the cards that they get.",
"options": [
"A. (c)(a)(e)(d)(b).",
"B. (c)(d)(a)(e)(b).",
"C. (a)(e)(b)(c)(d).",
"D. (a)(b)(d)(c)(e)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which order are the following magics performed in this video?\n(a) Finding words in a dictionary.\n(b) Drawing a diamond using a match.\n(c) Producing a popcoin from eyes.\n(d) Seeing things through a bearded dragon.\n(e) Inducing judges to choose the cards that they get.\nOption:\nA. (c)(a)(e)(d)(b).\nB. (c)(d)(a)(e)(b).\nC. (a)(e)(b)(c)(d).\nD. (a)(b)(d)(c)(e).\nAnswer with the option's letter from the given choices directly.",
2370,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "791-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2371,
"target": "C",
"doc": {
"video_id": "791",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=9tBsMSDoDqk",
"videoID": "9tBsMSDoDqk",
"question_id": "791-2",
"task_type": "Action Reasoning",
"question": "What is the most possible reason why a bearded dragon can speak in the second magic performance?",
"options": [
"A. By using miniature microphone vocalization.",
"B. By using a hidden voice modulator device.",
"C. By human ventriloquism.",
"D. By employing a highly trained animal mimicry specialist."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the most possible reason why a bearded dragon can speak in the second magic performance?\nOption:\nA. By using miniature microphone vocalization.\nB. By using a hidden voice modulator device.\nC. By human ventriloquism.\nD. By employing a highly trained animal mimicry specialist.\nAnswer with the option's letter from the given choices directly.",
2371,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "791-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2372,
"target": "A",
"doc": {
"video_id": "791",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=9tBsMSDoDqk",
"videoID": "9tBsMSDoDqk",
"question_id": "791-3",
"task_type": "Information Synopsis",
"question": "Which summarizes the main content of the video?",
"options": [
"A. A collection of 5 magicians that shocked the judges.",
"B. A talent show with various acts, including magic performances.",
"C. A comedy skit featuring magicians and their funny mishaps.",
"D. A tutorial on how to perform magic tricks."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which summarizes the main content of the video?\nOption:\nA. A collection of 5 magicians that shocked the judges.\nB. A talent show with various acts, including magic performances.\nC. A comedy skit featuring magicians and their funny mishaps.\nD. A tutorial on how to perform magic tricks.\nAnswer with the option's letter from the given choices directly.",
2372,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "791-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2373,
"target": "A",
"doc": {
"video_id": "792",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=0bwyCF0HteE",
"videoID": "0bwyCF0HteE",
"question_id": "792-1",
"task_type": "Counting Problem",
"question": "How many magic shows are included in this video?",
"options": [
"A. 10.",
"B. 9.",
"C. 8.",
"D. 7."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many magic shows are included in this video?\nOption:\nA. 10.\nB. 9.\nC. 8.\nD. 7.\nAnswer with the option's letter from the given choices directly.",
2373,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "792-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2374,
"target": "D",
"doc": {
"video_id": "792",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=0bwyCF0HteE",
"videoID": "0bwyCF0HteE",
"question_id": "792-2",
"task_type": "Action Reasoning",
"question": "What is special about the second magic?",
"options": [
"A. It incorporates fire manipulation and pyrotechnics.",
"B. It includes a disappearing act with a grand illusion.",
"C. It features mind-reading and psychic abilities.",
"D. It is about dogs' quick change, while others are about humans' quick change."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is special about the second magic?\nOption:\nA. It incorporates fire manipulation and pyrotechnics.\nB. It includes a disappearing act with a grand illusion.\nC. It features mind-reading and psychic abilities.\nD. It is about dogs' quick change, while others are about humans' quick change.\nAnswer with the option's letter from the given choices directly.",
2374,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "792-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2375,
"target": "B",
"doc": {
"video_id": "792",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=0bwyCF0HteE",
"videoID": "0bwyCF0HteE",
"question_id": "792-3",
"task_type": "Action Recognition",
"question": "What are the magic tricks about?",
"options": [
"A. Cards trick magics.",
"B. Quick change magics.",
"C. Street magics.",
"D. Coins trick magics."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the magic tricks about?\nOption:\nA. Cards trick magics.\nB. Quick change magics.\nC. Street magics.\nD. Coins trick magics.\nAnswer with the option's letter from the given choices directly.",
2375,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "792-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2376,
"target": "D",
"doc": {
"video_id": "793",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=hIJtZN8w03A",
"videoID": "hIJtZN8w03A",
"question_id": "793-1",
"task_type": "Information Synopsis",
"question": "What is the main focus of this video?",
"options": [
"A. This video showcases the incredible skills of magicians who can perform mind-boggling sleight of hand tricks, leaving the audience in awe of their abilities.",
"B. The video explores the deep-rooted connection between magic and supernatural forces, providing evidence of the existence of mystical powers and their influence on magic tricks.",
"C. This video takes a historical approach, tracing the origins of magic back to ancient mystical rituals and delving into the rich history of magical practices across different cultures.",
"D. This video exposes the untold truths about the world's most famous magicians and discovers the science and methods behind magic tricks that appear to be pure sorcery."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of this video?\nOption:\nA. This video showcases the incredible skills of magicians who can perform mind-boggling sleight of hand tricks, leaving the audience in awe of their abilities.\nB. The video explores the deep-rooted connection between magic and supernatural forces, providing evidence of the existence of mystical powers and their influence on magic tricks.\nC. This video takes a historical approach, tracing the origins of magic back to ancient mystical rituals and delving into the rich history of magical practices across different cultures.\nD. This video exposes the untold truths about the world's most famous magicians and discovers the science and methods behind magic tricks that appear to be pure sorcery.\nAnswer with the option's letter from the given choices directly.",
2376,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "793-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2377,
"target": "B",
"doc": {
"video_id": "793",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=hIJtZN8w03A",
"videoID": "hIJtZN8w03A",
"question_id": "793-2",
"task_type": "Action Recognition",
"question": "How does this video reveal each magic trick?",
"options": [
"A. The video reveals each magic trick by employing a team of professional magicians who analyze the performances and provide insights into the techniques used.",
"B. The video reveals each magic trick by first quickly recapping each performance and then illustrating the illusion.",
"C. The video reveals each magic trick by showcasing interviews with audience members who witnessed the tricks firsthand and share their observations and theories.",
"D. The video reveals each magic trick by utilizing cutting-edge technology that captures and analyzes the magician's movements, allowing viewers to see the precise mechanics behind each illusion."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does this video reveal each magic trick?\nOption:\nA. The video reveals each magic trick by employing a team of professional magicians who analyze the performances and provide insights into the techniques used.\nB. The video reveals each magic trick by first quickly recapping each performance and then illustrating the illusion.\nC. The video reveals each magic trick by showcasing interviews with audience members who witnessed the tricks firsthand and share their observations and theories.\nD. The video reveals each magic trick by utilizing cutting-edge technology that captures and analyzes the magician's movements, allowing viewers to see the precise mechanics behind each illusion.\nAnswer with the option's letter from the given choices directly.",
2377,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "793-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2378,
"target": "C",
"doc": {
"video_id": "793",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=hIJtZN8w03A",
"videoID": "hIJtZN8w03A",
"question_id": "793-3",
"task_type": "Counting Problem",
"question": "How many tricks are revealed in this video?",
"options": [
"A. 7.",
"B. 9.",
"C. 11.",
"D. 13."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many tricks are revealed in this video?\nOption:\nA. 7.\nB. 9.\nC. 11.\nD. 13.\nAnswer with the option's letter from the given choices directly.",
2378,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "793-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2379,
"target": "C",
"doc": {
"video_id": "794",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=XDq08lk5GnQ",
"videoID": "XDq08lk5GnQ",
"question_id": "794-1",
"task_type": "Information Synopsis",
"question": "What is the primary focus of this video?",
"options": [
"A. The video features a magician who can control the elements and manipulate fire, water, and wind.",
"B. The video highlights a magician who performs several astonishing magics on the stage.",
"C. The video features a magician who reveals the secrets behind famous magic tricks.",
"D. The video showcases a magician who can transform ordinary objects into gold with his magic."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary focus of this video?\nOption:\nA. The video features a magician who can control the elements and manipulate fire, water, and wind.\nB. The video highlights a magician who performs several astonishing magics on the stage.\nC. The video features a magician who reveals the secrets behind famous magic tricks.\nD. The video showcases a magician who can transform ordinary objects into gold with his magic.\nAnswer with the option's letter from the given choices directly.",
2379,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "794-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2380,
"target": "A",
"doc": {
"video_id": "794",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=XDq08lk5GnQ",
"videoID": "XDq08lk5GnQ",
"question_id": "794-2",
"task_type": "Action Reasoning",
"question": "What is the key to the illusion of bending a teaspoon with his mind?",
"options": [
"A. The key to the illusion is a spring at the center to stretch a special teaspoon.",
"B. The key to the illusion is using a hidden magnet to manipulate the teaspoon.",
"C. The key to the illusion is a special chemical coating on the teaspoon that reacts to the magician's touch.",
"D. The key to the illusion is a miniature robotic arm hidden inside the teaspoon."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the key to the illusion of bending a teaspoon with his mind?\nOption:\nA. The key to the illusion is a spring at the center to stretch a special teaspoon.\nB. The key to the illusion is using a hidden magnet to manipulate the teaspoon.\nC. The key to the illusion is a special chemical coating on the teaspoon that reacts to the magician's touch.\nD. The key to the illusion is a miniature robotic arm hidden inside the teaspoon.\nAnswer with the option's letter from the given choices directly.",
2380,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "794-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2381,
"target": "B",
"doc": {
"video_id": "794",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=XDq08lk5GnQ",
"videoID": "XDq08lk5GnQ",
"question_id": "794-3",
"task_type": "Object Reasoning",
"question": "What is the purpose of the mirrors in the last magic?",
"options": [
"A. To create an optical illusion that makes it appear as though the magician is in two places at once.",
"B. To prevent the audience from seeing the magician's escape route.",
"C. To enhance the visual aesthetics of the performance and create a mesmerizing atmosphere for the audience.",
"D. To serve as a symbolic representation of the magician's ability to manipulate reality and bend the laws of physics."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the purpose of the mirrors in the last magic?\nOption:\nA. To create an optical illusion that makes it appear as though the magician is in two places at once.\nB. To prevent the audience from seeing the magician's escape route.\nC. To enhance the visual aesthetics of the performance and create a mesmerizing atmosphere for the audience.\nD. To serve as a symbolic representation of the magician's ability to manipulate reality and bend the laws of physics.\nAnswer with the option's letter from the given choices directly.",
2381,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "794-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2382,
"target": "B",
"doc": {
"video_id": "795",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=D97vMwfWxvI",
"videoID": "D97vMwfWxvI",
"question_id": "795-1",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. This video shows several magic tricks and introduces the magicians involved and their past journey.",
"B. The magician in the video reveals the secrets behind several magic tricks.",
"C. The magician in the video was able to show the audience a few magic tricks that he is good at.",
"D. The magician in the video performed a series of dangerous stunts."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. This video shows several magic tricks and introduces the magicians involved and their past journey.\nB. The magician in the video reveals the secrets behind several magic tricks.\nC. The magician in the video was able to show the audience a few magic tricks that he is good at.\nD. The magician in the video performed a series of dangerous stunts.\nAnswer with the option's letter from the given choices directly.",
2382,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "795-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2383,
"target": "C",
"doc": {
"video_id": "795",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=D97vMwfWxvI",
"videoID": "D97vMwfWxvI",
"question_id": "795-2",
"task_type": "Temporal Reasoning",
"question": "In which order are the following magics performed in this video?\n(a) Bend the straw with spirits.\n(b) Two kinds of snacks are poured out of the packaging bag.\n(c) Two coins on the plate turn into seven.\n(d) Coins drilled into the cup from the bottom.",
"options": [
"A. (d)(a)(b)(c).",
"B. (d)(c)(a)(b).",
"C. (c)(d)(a)(b).",
"D. (c)(a)(d)(b)."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which order are the following magics performed in this video?\n(a) Bend the straw with spirits.\n(b) Two kinds of snacks are poured out of the packaging bag.\n(c) Two coins on the plate turn into seven.\n(d) Coins drilled into the cup from the bottom.\nOption:\nA. (d)(a)(b)(c).\nB. (d)(c)(a)(b).\nC. (c)(d)(a)(b).\nD. (c)(a)(d)(b).\nAnswer with the option's letter from the given choices directly.",
2383,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "795-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2384,
"target": "A",
"doc": {
"video_id": "795",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=D97vMwfWxvI",
"videoID": "D97vMwfWxvI",
"question_id": "795-3",
"task_type": "Object Recognition",
"question": "Which tool is used in the first magic?",
"options": [
"A. A string.",
"B. A coin.",
"C. A hat.",
"D. A pair of glasses."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which tool is used in the first magic?\nOption:\nA. A string.\nB. A coin.\nC. A hat.\nD. A pair of glasses.\nAnswer with the option's letter from the given choices directly.",
2384,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "795-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2385,
"target": "D",
"doc": {
"video_id": "796",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=_eJsOYC8SVU",
"videoID": "_eJsOYC8SVU",
"question_id": "796-1",
"task_type": "Information Synopsis",
"question": "What is the main content of this video?",
"options": [
"A. The video shows the daily routine of a magician and his daughter training in magic.",
"B. Video shows the amazing pattern talent of a magician's daughter.",
"C. The video shows a father teaching his daughter six magic tricks.",
"D. The video shows a father teaching his daughter eight magic tricks."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main content of this video?\nOption:\nA. The video shows the daily routine of a magician and his daughter training in magic.\nB. Video shows the amazing pattern talent of a magician's daughter.\nC. The video shows a father teaching his daughter six magic tricks.\nD. The video shows a father teaching his daughter eight magic tricks.\nAnswer with the option's letter from the given choices directly.",
2385,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "796-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2386,
"target": "B",
"doc": {
"video_id": "796",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=_eJsOYC8SVU",
"videoID": "_eJsOYC8SVU",
"question_id": "796-2",
"task_type": "Object Recognition",
"question": "Which of the following tools is not used in this video?",
"options": [
"A. Cards.",
"B. A string.",
"C. Coins.",
"D. A calculator."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following tools is not used in this video?\nOption:\nA. Cards.\nB. A string.\nC. Coins.\nD. A calculator.\nAnswer with the option's letter from the given choices directly.",
2386,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "796-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2387,
"target": "C",
"doc": {
"video_id": "796",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=_eJsOYC8SVU",
"videoID": "_eJsOYC8SVU",
"question_id": "796-3",
"task_type": "Object Reasoning",
"question": "What do the first and the fifth segments of magic in this video have in common?",
"options": [
"A. They include the same tools for magic.",
"B. They both need an assistant.",
"C. They happen in the same place.",
"D. They both require several demonstrations."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the first and the fifth segments of magic in this video have in common?\nOption:\nA. They include the same tools for magic.\nB. They both need an assistant.\nC. They happen in the same place.\nD. They both require several demonstrations.\nAnswer with the option's letter from the given choices directly.",
2387,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "796-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2388,
"target": "A",
"doc": {
"video_id": "797",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=H6Y2KRfpRVY",
"videoID": "H6Y2KRfpRVY",
"question_id": "797-1",
"task_type": "Counting Problem",
"question": "How many magics are performed in this video?",
"options": [
"A. 5.",
"B. 4.",
"C. 3.",
"D. 2."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many magics are performed in this video?\nOption:\nA. 5.\nB. 4.\nC. 3.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
2388,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "797-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2389,
"target": "D",
"doc": {
"video_id": "797",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=H6Y2KRfpRVY",
"videoID": "H6Y2KRfpRVY",
"question_id": "797-2",
"task_type": "Object Reasoning",
"question": "What do the first two magics performed in the video have in common?",
"options": [
"A. A videotape is played during the performance.",
"B. The magician does not wear a mask.",
"C. The performance too amazing to win multiple rounds of applause.",
"D. All of the above."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the first two magics performed in the video have in common?\nOption:\nA. A videotape is played during the performance.\nB. The magician does not wear a mask.\nC. The performance too amazing to win multiple rounds of applause.\nD. All of the above.\nAnswer with the option's letter from the given choices directly.",
2389,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "797-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2390,
"target": "B",
"doc": {
"video_id": "797",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=H6Y2KRfpRVY",
"videoID": "H6Y2KRfpRVY",
"question_id": "797-3",
"task_type": "Action Recognition",
"question": "Which action is not included in the third magic?",
"options": [
"A. Tapping on the top of head.",
"B. Ten fingers crossed with volunteers.",
"C. Touching on the shoulder for three times.",
"D. Two volunteers drawing the same curve."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which action is not included in the third magic?\nOption:\nA. Tapping on the top of head.\nB. Ten fingers crossed with volunteers.\nC. Touching on the shoulder for three times.\nD. Two volunteers drawing the same curve.\nAnswer with the option's letter from the given choices directly.",
2390,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "797-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2391,
"target": "C",
"doc": {
"video_id": "798",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=k2FIFQIYBvA",
"videoID": "k2FIFQIYBvA",
"question_id": "798-1",
"task_type": "Information Synopsis",
"question": "What is the main focus of this video?",
"options": [
"A. Magician transforms objects into living animals and performs several card and coin tricks in front of the host, leaving the audience jaw-dropped.",
"B. A magician performs a telepathic trick to find out what the spectator across the room is thinking and what he has been through in the past three days.",
"C. A magician performs shocking magic tricks by sewing his lips together, frightening the audience, and performing scary card and nail tricks.",
"D. A magician can accurately predict the future and can accurately name a person he doesn't know."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of this video?\nOption:\nA. Magician transforms objects into living animals and performs several card and coin tricks in front of the host, leaving the audience jaw-dropped.\nB. A magician performs a telepathic trick to find out what the spectator across the room is thinking and what he has been through in the past three days.\nC. A magician performs shocking magic tricks by sewing his lips together, frightening the audience, and performing scary card and nail tricks.\nD. A magician can accurately predict the future and can accurately name a person he doesn't know.\nAnswer with the option's letter from the given choices directly.",
2391,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "798-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2392,
"target": "A",
"doc": {
"video_id": "798",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=k2FIFQIYBvA",
"videoID": "k2FIFQIYBvA",
"question_id": "798-2",
"task_type": "Action Reasoning",
"question": "What similarities exist between the first two magic tricks?",
"options": [
"A. Both of them are performed in the same place.",
"B. Both of them require 5 volunteers.",
"C. Neither of them needs deck cards.",
"D. All of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What similarities exist between the first two magic tricks?\nOption:\nA. Both of them are performed in the same place.\nB. Both of them require 5 volunteers.\nC. Neither of them needs deck cards.\nD. All of the above.\nAnswer with the option's letter from the given choices directly.",
2392,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "798-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2393,
"target": "A",
"doc": {
"video_id": "798",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=k2FIFQIYBvA",
"videoID": "k2FIFQIYBvA",
"question_id": "798-3",
"task_type": "Object Reasoning",
"question": "What is the relationship between the magician and the audience with striped shirts and brown hair?",
"options": [
"A. The audience is a volunteer for the magician's performance.",
"B. The audience is the magician's trust.",
"C. The audience is the magician's specially invited guest, a very good friend of his and a magic lover.",
"D. The audience is another highly skilled magician, which enhances the magic's incredulity."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the relationship between the magician and the audience with striped shirts and brown hair?\nOption:\nA. The audience is a volunteer for the magician's performance.\nB. The audience is the magician's trust.\nC. The audience is the magician's specially invited guest, a very good friend of his and a magic lover.\nD. The audience is another highly skilled magician, which enhances the magic's incredulity.\nAnswer with the option's letter from the given choices directly.",
2393,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "798-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2394,
"target": "B",
"doc": {
"video_id": "799",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=D0ZMmQjKn2Q",
"videoID": "D0ZMmQjKn2Q",
"question_id": "799-1",
"task_type": "Action Reasoning",
"question": "Why does the magician fail in the first episode?",
"options": [
"A. Because the magician forgot the magic words.",
"B. Because the magic box is taken by mistake.",
"C. Because the magician's assistant sabotaged the performance.",
"D. Because the magician was cursed by a rival magician."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the magician fail in the first episode?\nOption:\nA. Because the magician forgot the magic words.\nB. Because the magic box is taken by mistake.\nC. Because the magician's assistant sabotaged the performance.\nD. Because the magician was cursed by a rival magician.\nAnswer with the option's letter from the given choices directly.",
2394,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "799-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2395,
"target": "C",
"doc": {
"video_id": "799",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=D0ZMmQjKn2Q",
"videoID": "D0ZMmQjKn2Q",
"question_id": "799-2",
"task_type": "Temporal Reasoning",
"question": "In which order do the following events happen in this video?\n(a) Mr. Bean falls in love with the beautiful singer Roxy and has a wish to get an autograph from her. His attempts are mostly foiled by the bodyguard until he manages to get a kiss mark from Roxy using her handkerchief and Bean is happy.\n(b) Whilst digging for treasure, Mr. Bean builds his own metal detector and goes to hunt for treasure but fails. When he manages to get the treasure and bring it to his flat.\n(c) Mr. Bean and Irma are off for a day at the seaside, where his trunk gets accidentally swapped with that of a stage magician.\n(d) Bean attends a hypnotism show. He unwittingly volunteers to be hypnotised; when the hypnotist makes him think he's a dog, he runs away and then back at home chases Scrapper around the garden and the house.",
"options": [
"A. (b)(a)(c)(d).",
"B. (a)(c)(d)(b).",
"C. (c)(d)(a)(b).",
"D. (d)(c)(b)(a)."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which order do the following events happen in this video?\n(a) Mr. Bean falls in love with the beautiful singer Roxy and has a wish to get an autograph from her. His attempts are mostly foiled by the bodyguard until he manages to get a kiss mark from Roxy using her handkerchief and Bean is happy.\n(b) Whilst digging for treasure, Mr. Bean builds his own metal detector and goes to hunt for treasure but fails. When he manages to get the treasure and bring it to his flat.\n(c) Mr. Bean and Irma are off for a day at the seaside, where his trunk gets accidentally swapped with that of a stage magician.\n(d) Bean attends a hypnotism show. He unwittingly volunteers to be hypnotised; when the hypnotist makes him think he's a dog, he runs away and then back at home chases Scrapper around the garden and the house.\nOption:\nA. (b)(a)(c)(d).\nB. (a)(c)(d)(b).\nC. (c)(d)(a)(b).\nD. (d)(c)(b)(a).\nAnswer with the option's letter from the given choices directly.",
2395,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "799-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2396,
"target": "A",
"doc": {
"video_id": "799",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=D0ZMmQjKn2Q",
"videoID": "D0ZMmQjKn2Q",
"question_id": "799-3",
"task_type": "Action Recognition",
"question": "How does Mr. Bean get off the plane when he finds the island is right down?",
"options": [
"A. He puts a lipstick on his face to fake a rash.",
"B. He jumps out of the plane using a parachute made of toilet paper.",
"C. He uses a jetpack hidden in his suitcase to fly off the plane.",
"D. He disguises himself as a flight attendant and exits through the emergency exit."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does Mr. Bean get off the plane when he finds the island is right down?\nOption:\nA. He puts a lipstick on his face to fake a rash.\nB. He jumps out of the plane using a parachute made of toilet paper.\nC. He uses a jetpack hidden in his suitcase to fly off the plane.\nD. He disguises himself as a flight attendant and exits through the emergency exit.\nAnswer with the option's letter from the given choices directly.",
2396,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "799-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2397,
"target": "D",
"doc": {
"video_id": "800",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=Cb91B0wZYLE",
"videoID": "Cb91B0wZYLE",
"question_id": "800-1",
"task_type": "Temporal Reasoning",
"question": "What is the sequence in which the following magic tricks are performed in this video?\n(a) Joker inside the black Jacks sandwich is mysteriously replaced with the randomly chosen card.\n(b) The positions of Aces and random cards are swapped.\n(c) After a full circle of mixing up, the piles go back to where they started.\n(d) The order of cards is recovered from chaos.",
"options": [
"A. (c)(b)(a)(d).",
"B. (b)(a)(d)(c).",
"C. (d)(a)(c)(b).",
"D. (a)(c)(d)(b)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the sequence in which the following magic tricks are performed in this video?\n(a) Joker inside the black Jacks sandwich is mysteriously replaced with the randomly chosen card.\n(b) The positions of Aces and random cards are swapped.\n(c) After a full circle of mixing up, the piles go back to where they started.\n(d) The order of cards is recovered from chaos.\nOption:\nA. (c)(b)(a)(d).\nB. (b)(a)(d)(c).\nC. (d)(a)(c)(b).\nD. (a)(c)(d)(b).\nAnswer with the option's letter from the given choices directly.",
2397,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "800-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2398,
"target": "B",
"doc": {
"video_id": "800",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=Cb91B0wZYLE",
"videoID": "Cb91B0wZYLE",
"question_id": "800-2",
"task_type": "Information Synopsis",
"question": "What is the main focus of this video?",
"options": [
"A. Experience the supernatural with this magician's ability to communicate with spirits!.",
"B. The card trick and magic in tonight's video will fool you 99.99% of the time.",
"C. Witness the incredible levitation powers of this magician!.",
"D. Discover the secrets of time travel through this mind-bending magic performance!."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main focus of this video?\nOption:\nA. Experience the supernatural with this magician's ability to communicate with spirits!.\nB. The card trick and magic in tonight's video will fool you 99.99% of the time.\nC. Witness the incredible levitation powers of this magician!.\nD. Discover the secrets of time travel through this mind-bending magic performance!.\nAnswer with the option's letter from the given choices directly.",
2398,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "800-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2399,
"target": "C",
"doc": {
"video_id": "800",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Magic Show",
"url": "https://www.youtube.com/watch?v=Cb91B0wZYLE",
"videoID": "Cb91B0wZYLE",
"question_id": "800-3",
"task_type": "Attribute Perception",
"question": "According to the video, which magic trick has a tutorial?",
"options": [
"A. Full circle.",
"B. Jack Sandwich.",
"C. Order from chaos.",
"D. Random or Ace."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which magic trick has a tutorial?\nOption:\nA. Full circle.\nB. Jack Sandwich.\nC. Order from chaos.\nD. Random or Ace.\nAnswer with the option's letter from the given choices directly.",
2399,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "800-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Magic Show",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2400,
"target": "D",
"doc": {
"video_id": "801",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=v4YhsooE5xY",
"videoID": "v4YhsooE5xY",
"question_id": "801-1",
"task_type": "Object Reasoning",
"question": "According to the beginning of the video, who used a tool to open the watermelon?",
"options": [
"A. Frank Skinner.",
"B. Nathan.",
"C. Tim K.",
"D. Josh Whitaker."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the beginning of the video, who used a tool to open the watermelon?\nOption:\nA. Frank Skinner.\nB. Nathan.\nC. Tim K.\nD. Josh Whitaker.\nAnswer with the option's letter from the given choices directly.",
2400,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "801-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2401,
"target": "D",
"doc": {
"video_id": "801",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=v4YhsooE5xY",
"videoID": "v4YhsooE5xY",
"question_id": "801-2",
"task_type": "Object Reasoning",
"question": "In the video, while evaluating five paintings, the man with platinum blonde hair and a pure black top contends that the highest-scoring painting was created by whom?",
"options": [
"A. Romesh Nathan.",
"B. Tim K.",
"C. Josh Whitaker.",
"D. Frank Skinner."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, while evaluating five paintings, the man with platinum blonde hair and a pure black top contends that the highest-scoring painting was created by whom?\nOption:\nA. Romesh Nathan.\nB. Tim K.\nC. Josh Whitaker.\nD. Frank Skinner.\nAnswer with the option's letter from the given choices directly.",
2401,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "801-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2402,
"target": "D",
"doc": {
"video_id": "801",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=v4YhsooE5xY",
"videoID": "v4YhsooE5xY",
"question_id": "801-3",
"task_type": "Action Reasoning",
"question": "What is the main purpose of Tim K bending down from the grass, standing up, walking back to the bathtub, and bending down again?",
"options": [
"A. To wash the item he picked up in the bathtub.",
"B. To put the item he picked up into the bathtub.",
"C. To check on something inside the bathtub, possibly related to the competition.",
"D. To put the bathtub plug back in, which he might have removed earlier."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main purpose of Tim K bending down from the grass, standing up, walking back to the bathtub, and bending down again?\nOption:\nA. To wash the item he picked up in the bathtub.\nB. To put the item he picked up into the bathtub.\nC. To check on something inside the bathtub, possibly related to the competition.\nD. To put the bathtub plug back in, which he might have removed earlier.\nAnswer with the option's letter from the given choices directly.",
2402,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "801-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2403,
"target": "C",
"doc": {
"video_id": "802",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=TYmHe20p_zU",
"videoID": "TYmHe20p_zU",
"question_id": "802-1",
"task_type": "Information Synopsis",
"question": "What kind of transformation does William Holmes undergo throughout the video?",
"options": [
"A. From being unkempt to tidy.",
"B. From being poor to wealthy.",
"C. From being stressed to enjoying life.",
"D. From feeling inferior to being confident."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What kind of transformation does William Holmes undergo throughout the video?\nOption:\nA. From being unkempt to tidy.\nB. From being poor to wealthy.\nC. From being stressed to enjoying life.\nD. From feeling inferior to being confident.\nAnswer with the option's letter from the given choices directly.",
2403,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "802-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2404,
"target": "D",
"doc": {
"video_id": "802",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=TYmHe20p_zU",
"videoID": "TYmHe20p_zU",
"question_id": "802-2",
"task_type": "Action Reasoning",
"question": "Why do the five members of the Feb 5 team want to help William Holmes change?",
"options": [
"A. They hope that William Holmes can become better and reunite with his girlfriend.",
"B. They want to help William Holmes and become friends with him.",
"C. They are enthusiastic about teaching their skills to others to gain a sense of achievement.",
"D. They hope William Holmes achieves a work-life balance."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why do the five members of the Feb 5 team want to help William Holmes change?\nOption:\nA. They hope that William Holmes can become better and reunite with his girlfriend.\nB. They want to help William Holmes and become friends with him.\nC. They are enthusiastic about teaching their skills to others to gain a sense of achievement.\nD. They hope William Holmes achieves a work-life balance.\nAnswer with the option's letter from the given choices directly.",
2404,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "802-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2405,
"target": "C",
"doc": {
"video_id": "802",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=TYmHe20p_zU",
"videoID": "TYmHe20p_zU",
"question_id": "802-3",
"task_type": "Information Synopsis",
"question": "In the video, which aspect of William Holmes do the five members of the Feb 5 team not change?",
"options": [
"A. Cooking skills.",
"B. Clothing.",
"C. Work.",
"D. Hairstyle."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, which aspect of William Holmes do the five members of the Feb 5 team not change?\nOption:\nA. Cooking skills.\nB. Clothing.\nC. Work.\nD. Hairstyle.\nAnswer with the option's letter from the given choices directly.",
2405,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "802-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2406,
"target": "C",
"doc": {
"video_id": "803",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=eQGSbBANfVg",
"videoID": "eQGSbBANfVg",
"question_id": "803-1",
"task_type": "Action Recognition",
"question": "In the early part of the video, how was the flip-flop that traveled the farthest thrown?",
"options": [
"A. Slid on a rope.",
"B. Thrown directly by hand.",
"C. The flip-flop was carried on the body.",
"D. Hit with a bat."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the early part of the video, how was the flip-flop that traveled the farthest thrown?\nOption:\nA. Slid on a rope.\nB. Thrown directly by hand.\nC. The flip-flop was carried on the body.\nD. Hit with a bat.\nAnswer with the option's letter from the given choices directly.",
2406,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "803-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2407,
"target": "B",
"doc": {
"video_id": "803",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=eQGSbBANfVg",
"videoID": "eQGSbBANfVg",
"question_id": "803-2",
"task_type": "Action Reasoning",
"question": "In the middle part of the video, why does the girl named Danielle Walker, with golden hair and blue clothes, use a huge bat to hit the ball?",
"options": [
"A. To increase the probability of hitting the ball.",
"B. To carry out the \"cricket appeal\" task.",
"C. That's how real cricket is played.",
"D. To make the process of hitting the ball more interesting."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the middle part of the video, why does the girl named Danielle Walker, with golden hair and blue clothes, use a huge bat to hit the ball?\nOption:\nA. To increase the probability of hitting the ball.\nB. To carry out the \"cricket appeal\" task.\nC. That's how real cricket is played.\nD. To make the process of hitting the ball more interesting.\nAnswer with the option's letter from the given choices directly.",
2407,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "803-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2408,
"target": "D",
"doc": {
"video_id": "803",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=eQGSbBANfVg",
"videoID": "eQGSbBANfVg",
"question_id": "803-3",
"task_type": "Object Reasoning",
"question": "In the latter part of the video, when Luke says, \"We probably have no time.\", what does this statement express about Luke's thoughts?",
"options": [
"A. Luke wants to save time.",
"B. Luke didn't have time to finish what he wanted to do.",
"C. Luke wants to showcase a shorter video.",
"D. Luke doesn't want to play the upcoming video."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the latter part of the video, when Luke says, \"We probably have no time.\", what does this statement express about Luke's thoughts?\nOption:\nA. Luke wants to save time.\nB. Luke didn't have time to finish what he wanted to do.\nC. Luke wants to showcase a shorter video.\nD. Luke doesn't want to play the upcoming video.\nAnswer with the option's letter from the given choices directly.",
2408,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "803-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2409,
"target": "C",
"doc": {
"video_id": "804",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=AeEYQ62t8hA",
"videoID": "AeEYQ62t8hA",
"question_id": "804-1",
"task_type": "Counting Problem",
"question": "How many guests were invited to participate in the interview in this segment of the video?",
"options": [
"A. 4.",
"B. 3.",
"C. 2.",
"D. 5."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many guests were invited to participate in the interview in this segment of the video?\nOption:\nA. 4.\nB. 3.\nC. 2.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
2409,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "804-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2410,
"target": "B",
"doc": {
"video_id": "804",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=AeEYQ62t8hA",
"videoID": "AeEYQ62t8hA",
"question_id": "804-2",
"task_type": "Object Reasoning",
"question": "In James Corden's dream, what does it mean when the man sitting to his left in a black suit and black tie and the man sitting to his right in a dark blue top suddenly burst into laughter?",
"options": [
"A. This means that James Corden's story is very interesting, and these two men were amused.",
"B. This means that James Corden's story is not interesting, and these two men are fake laughing.",
"C. This means that James Cordon decided not to leave the show, and these two men felt very happy.",
"D. This means that James Cordon decided to leave the show, and these two men felt very happy."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In James Corden's dream, what does it mean when the man sitting to his left in a black suit and black tie and the man sitting to his right in a dark blue top suddenly burst into laughter?\nOption:\nA. This means that James Corden's story is very interesting, and these two men were amused.\nB. This means that James Corden's story is not interesting, and these two men are fake laughing.\nC. This means that James Cordon decided not to leave the show, and these two men felt very happy.\nD. This means that James Cordon decided to leave the show, and these two men felt very happy.\nAnswer with the option's letter from the given choices directly.",
2410,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "804-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2411,
"target": "A",
"doc": {
"video_id": "804",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=AeEYQ62t8hA",
"videoID": "AeEYQ62t8hA",
"question_id": "804-3",
"task_type": "Action Reasoning",
"question": "At the end of the video, why does James Corden sing a song with tears in his eyes?",
"options": [
"A. He was reluctant to leave this job.",
"B. He expressed his satisfaction with the hard work of his colleagues.",
"C. He felt touched by the work he had done over the years.",
"D. He wept with joy at the achievements of the performance."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: At the end of the video, why does James Corden sing a song with tears in his eyes?\nOption:\nA. He was reluctant to leave this job.\nB. He expressed his satisfaction with the hard work of his colleagues.\nC. He felt touched by the work he had done over the years.\nD. He wept with joy at the achievements of the performance.\nAnswer with the option's letter from the given choices directly.",
2411,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "804-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2412,
"target": "C",
"doc": {
"video_id": "805",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=C8Rk__S06_M",
"videoID": "C8Rk__S06_M",
"question_id": "805-1",
"task_type": "Counting Problem",
"question": "In the early part of the video, during the One-Eyed Monster quiz task, how many questions did the team with the boy with dark skin answer correctly?",
"options": [
"A. 1.",
"B. 3.",
"C. 0.",
"D. 2."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the early part of the video, during the One-Eyed Monster quiz task, how many questions did the team with the boy with dark skin answer correctly?\nOption:\nA. 1.\nB. 3.\nC. 0.\nD. 2.\nAnswer with the option's letter from the given choices directly.",
2412,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "805-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2413,
"target": "A",
"doc": {
"video_id": "805",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=C8Rk__S06_M",
"videoID": "C8Rk__S06_M",
"question_id": "805-2",
"task_type": "Object Recognition",
"question": "Which of the five items, guessed by the man in the red top during the price-guessing segment in the middle of the video, deviated the least from its actual price?",
"options": [
"A. The Dental Floss.",
"B. The Tide.",
"C. The TGI Friday's frozen spinach and artichoke dip.",
"D. The Rice-A-Roni."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the five items, guessed by the man in the red top during the price-guessing segment in the middle of the video, deviated the least from its actual price?\nOption:\nA. The Dental Floss.\nB. The Tide.\nC. The TGI Friday's frozen spinach and artichoke dip.\nD. The Rice-A-Roni.\nAnswer with the option's letter from the given choices directly.",
2413,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "805-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2414,
"target": "B",
"doc": {
"video_id": "805",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=C8Rk__S06_M",
"videoID": "C8Rk__S06_M",
"question_id": "805-3",
"task_type": "Object Reasoning",
"question": "In the latter part of the video, during the \"Who'd you rather\" segment, what is the relationship between the man wearing a black jacket and Rihanna?",
"options": [
"A. Sibling relationship.",
"B. Strangers.",
"C. Romantic relationship.",
"D. Idol and fan relationship."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the latter part of the video, during the \"Who'd you rather\" segment, what is the relationship between the man wearing a black jacket and Rihanna?\nOption:\nA. Sibling relationship.\nB. Strangers.\nC. Romantic relationship.\nD. Idol and fan relationship.\nAnswer with the option's letter from the given choices directly.",
2414,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "805-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2415,
"target": "B",
"doc": {
"video_id": "806",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=r1PoSdFWvQg",
"videoID": "r1PoSdFWvQg",
"question_id": "806-1",
"task_type": "Attribute Perception",
"question": "In the first game of the video's early part, what is the key to winning this game of building the tallest tower?",
"options": [
"A. Contestants use their bodies to support the cans, achieving a taller tower.",
"B. Contestants do not knock down the already stacked can tower and use this tower to achieve a greater height.",
"C. Contestants knock down the already stacked can tower and use a new method to build a taller tower.",
"D. Cannot be determined."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the first game of the video's early part, what is the key to winning this game of building the tallest tower?\nOption:\nA. Contestants use their bodies to support the cans, achieving a taller tower.\nB. Contestants do not knock down the already stacked can tower and use this tower to achieve a greater height.\nC. Contestants knock down the already stacked can tower and use a new method to build a taller tower.\nD. Cannot be determined.\nAnswer with the option's letter from the given choices directly.",
2415,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "806-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Attribute Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2416,
"target": "A",
"doc": {
"video_id": "806",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=r1PoSdFWvQg",
"videoID": "r1PoSdFWvQg",
"question_id": "806-2",
"task_type": "Action Recognition",
"question": "In the latter part of the video, which method did the contestant with the shortest time used to send inflatable ducks into the lake?",
"options": [
"A. Dragging the duck by its tail.",
"B. Dragging the duck by its head.",
"C. Rolling the duck.",
"D. Pushing the duck."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the latter part of the video, which method did the contestant with the shortest time used to send inflatable ducks into the lake?\nOption:\nA. Dragging the duck by its tail.\nB. Dragging the duck by its head.\nC. Rolling the duck.\nD. Pushing the duck.\nAnswer with the option's letter from the given choices directly.",
2416,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "806-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2417,
"target": "C",
"doc": {
"video_id": "806",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=r1PoSdFWvQg",
"videoID": "r1PoSdFWvQg",
"question_id": "806-3",
"task_type": "Object Recognition",
"question": "In the latter part of the video, which prize can Sam Campbell not obtain?",
"options": [
"A. A wooden table.",
"B. A wooden Pinocchio.",
"C. A stick.",
"D. A wooden painting."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the latter part of the video, which prize can Sam Campbell not obtain?\nOption:\nA. A wooden table.\nB. A wooden Pinocchio.\nC. A stick.\nD. A wooden painting.\nAnswer with the option's letter from the given choices directly.",
2417,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "806-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2418,
"target": "C",
"doc": {
"video_id": "807",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=dLjOLXmu68M",
"videoID": "dLjOLXmu68M",
"question_id": "807-1",
"task_type": "Action Reasoning",
"question": "When the three hosts discuss the ranking of musical competition shows, why do they think \"X Factor\" ranks the highest?",
"options": [
"A. Because the cast of X Factor is larger.",
"B. Because the X factor is original, the program quality is higher.",
"C. Because the judges of X Factor will fight for opportunities for the contestants.",
"D. Because X Factor was the first music competition program, it opened the curtain on the music competition."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When the three hosts discuss the ranking of musical competition shows, why do they think \"X Factor\" ranks the highest?\nOption:\nA. Because the cast of X Factor is larger.\nB. Because the X factor is original, the program quality is higher.\nC. Because the judges of X Factor will fight for opportunities for the contestants.\nD. Because X Factor was the first music competition program, it opened the curtain on the music competition.\nAnswer with the option's letter from the given choices directly.",
2418,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "807-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2419,
"target": "A",
"doc": {
"video_id": "807",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=dLjOLXmu68M",
"videoID": "dLjOLXmu68M",
"question_id": "807-2",
"task_type": "Action Reasoning",
"question": "Which reality show is NOT ranked first in its respective category based on the order of the three guests in the video?",
"options": [
"A. Vanderpump Rules.",
"B. The Great British Bake Off.",
"C. Real Housewives.",
"D. America's Next Top Model."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which reality show is NOT ranked first in its respective category based on the order of the three guests in the video?\nOption:\nA. Vanderpump Rules.\nB. The Great British Bake Off.\nC. Real Housewives.\nD. America's Next Top Model.\nAnswer with the option's letter from the given choices directly.",
2419,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "807-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2420,
"target": "B",
"doc": {
"video_id": "807",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=dLjOLXmu68M",
"videoID": "dLjOLXmu68M",
"question_id": "807-3",
"task_type": "Object Reasoning",
"question": "In the latter part of the video, what is the relationship between the shows \"Flavor of Love\" and \"To Be Her Best Friend\" as discussed by the three hosts?",
"options": [
"A. Both shows are fantastic performances.",
"B. Both shows lay the groundwork for the fun of flavor, and participants will be made fun of.",
"C. Both shows are a response to \"The Bachelor.\".",
"D. Cannot be determined."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the latter part of the video, what is the relationship between the shows \"Flavor of Love\" and \"To Be Her Best Friend\" as discussed by the three hosts?\nOption:\nA. Both shows are fantastic performances.\nB. Both shows lay the groundwork for the fun of flavor, and participants will be made fun of.\nC. Both shows are a response to \"The Bachelor.\".\nD. Cannot be determined.\nAnswer with the option's letter from the given choices directly.",
2420,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "807-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2421,
"target": "A",
"doc": {
"video_id": "808",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=leG0U5wzGlA",
"videoID": "leG0U5wzGlA",
"question_id": "808-1",
"task_type": "Object Reasoning",
"question": "In the early part of the video, when Sal, the man wearing a black suit and a yellow tie, gives a speech, what is the relationship between Crystal and Gary he mentioned?",
"options": [
"A. Spouses.",
"B. Business partners.",
"C. Friends.",
"D. Unclear."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the early part of the video, when Sal, the man wearing a black suit and a yellow tie, gives a speech, what is the relationship between Crystal and Gary he mentioned?\nOption:\nA. Spouses.\nB. Business partners.\nC. Friends.\nD. Unclear.\nAnswer with the option's letter from the given choices directly.",
2421,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "808-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2422,
"target": "A",
"doc": {
"video_id": "808",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=leG0U5wzGlA",
"videoID": "leG0U5wzGlA",
"question_id": "808-2",
"task_type": "Action Reasoning",
"question": "In the middle part of the video, why does Joe, the man wearing a blue shirt and yellow pants, pounce on the customer's table?",
"options": [
"A. Because it was a task given by his friends to find all the broken tables.",
"B. Because it was a task given by his friends for Joe to jump on all the tables.",
"C. Because Joe is the restaurant manager and needs to check if the tables are sturdy.",
"D. Because Joe wants to destroy these tables."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the middle part of the video, why does Joe, the man wearing a blue shirt and yellow pants, pounce on the customer's table?\nOption:\nA. Because it was a task given by his friends to find all the broken tables.\nB. Because it was a task given by his friends for Joe to jump on all the tables.\nC. Because Joe is the restaurant manager and needs to check if the tables are sturdy.\nD. Because Joe wants to destroy these tables.\nAnswer with the option's letter from the given choices directly.",
2422,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "808-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2423,
"target": "D",
"doc": {
"video_id": "808",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=leG0U5wzGlA",
"videoID": "leG0U5wzGlA",
"question_id": "808-3",
"task_type": "Action Reasoning",
"question": "In the latter part of the video, why is Brian Quinn, a man wearing a baseball cap, booed by the audience at the band's live performance?",
"options": [
"A. Because the audience was angry about being pranked by Brian Quinn.",
"B. Because Brian Quinn lost the funds raised by the audience for the band.",
"C. Because Brian Quinn intentionally interrupted the band's performance.",
"D. Because Brian Quinn charged the audience at a free performance."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the latter part of the video, why is Brian Quinn, a man wearing a baseball cap, booed by the audience at the band's live performance?\nOption:\nA. Because the audience was angry about being pranked by Brian Quinn.\nB. Because Brian Quinn lost the funds raised by the audience for the band.\nC. Because Brian Quinn intentionally interrupted the band's performance.\nD. Because Brian Quinn charged the audience at a free performance.\nAnswer with the option's letter from the given choices directly.",
2423,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "808-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2424,
"target": "A",
"doc": {
"video_id": "809",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=o6gD9_akew0",
"videoID": "o6gD9_akew0",
"question_id": "809-1",
"task_type": "Information Synopsis",
"question": "Which of the following is NOT a reason why 100 women think Steve Harvey is a good kisser?",
"options": [
"A. He is handsome.",
"B. He is kind.",
"C. His lips.",
"D. He has charisma."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is NOT a reason why 100 women think Steve Harvey is a good kisser?\nOption:\nA. He is handsome.\nB. He is kind.\nC. His lips.\nD. He has charisma.\nAnswer with the option's letter from the given choices directly.",
2424,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "809-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2425,
"target": "A",
"doc": {
"video_id": "809",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=o6gD9_akew0",
"videoID": "o6gD9_akew0",
"question_id": "809-2",
"task_type": "Action Reasoning",
"question": "In the middle part of the video, when Kim Kardashian and Khloé Kardashian come on stage, Khloé Kardashian shakes her head and says \"No\" to Kim Kardashian. What does this indicate?",
"options": [
"A. Khloé refuses to shake hands with Kim.",
"B. There is a feud between Khloé and Kim.",
"C. Khloé refuses to compete with Kim.",
"D. Khloé dislikes Kim."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the middle part of the video, when Kim Kardashian and Khloé Kardashian come on stage, Khloé Kardashian shakes her head and says \"No\" to Kim Kardashian. What does this indicate?\nOption:\nA. Khloé refuses to shake hands with Kim.\nB. There is a feud between Khloé and Kim.\nC. Khloé refuses to compete with Kim.\nD. Khloé dislikes Kim.\nAnswer with the option's letter from the given choices directly.",
2425,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "809-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2426,
"target": "C",
"doc": {
"video_id": "809",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=o6gD9_akew0",
"videoID": "o6gD9_akew0",
"question_id": "809-3",
"task_type": "Action Recognition",
"question": "In the latter part of the video, how is the show scoring the answers given by Kim Kardashian?",
"options": [
"A. Based on the panel of judges deciding on accuracy or entertainment value.",
"B. Based on the matching answers with a celebrity guest's spouse/friend.",
"C. Based on the pre-recorded survey results.",
"D. Based on real-time voting from the on-site audience."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the latter part of the video, how is the show scoring the answers given by Kim Kardashian?\nOption:\nA. Based on the panel of judges deciding on accuracy or entertainment value.\nB. Based on the matching answers with a celebrity guest's spouse/friend.\nC. Based on the pre-recorded survey results.\nD. Based on real-time voting from the on-site audience.\nAnswer with the option's letter from the given choices directly.",
2426,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "809-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2427,
"target": "D",
"doc": {
"video_id": "810",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=N62mrplIygc",
"videoID": "N62mrplIygc",
"question_id": "810-1",
"task_type": "Object Recognition",
"question": "In the middle part of the video, which rule did Bailey, the girl, break on her first day of school?",
"options": [
"A. Running from the classroom to the cafeteria.",
"B. Wasting food.",
"C. Smoking.",
"D. Leaving the classroom early."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the middle part of the video, which rule did Bailey, the girl, break on her first day of school?\nOption:\nA. Running from the classroom to the cafeteria.\nB. Wasting food.\nC. Smoking.\nD. Leaving the classroom early.\nAnswer with the option's letter from the given choices directly.",
2427,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "810-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2428,
"target": "C",
"doc": {
"video_id": "810",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=N62mrplIygc",
"videoID": "N62mrplIygc",
"question_id": "810-2",
"task_type": "Object Reasoning",
"question": "What is the relationship between Jono and Portia, who comforting the sad Jono?",
"options": [
"A. Portia is Jono's teacher.",
"B. Jono and Portia have a mother-son relationship.",
"C. Jono is boarding at Portia's house.",
"D. Cannot be determined."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the relationship between Jono and Portia, who comforting the sad Jono?\nOption:\nA. Portia is Jono's teacher.\nB. Jono and Portia have a mother-son relationship.\nC. Jono is boarding at Portia's house.\nD. Cannot be determined.\nAnswer with the option's letter from the given choices directly.",
2428,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "810-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2429,
"target": "C",
"doc": {
"video_id": "810",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Variety Show",
"url": "https://www.youtube.com/watch?v=N62mrplIygc",
"videoID": "N62mrplIygc",
"question_id": "810-3",
"task_type": "Object Reasoning",
"question": "Based on the entire video, what kind of transformation occurs between Jono and Bailey?",
"options": [
"A. From being unruly to following rules.",
"B. From spending carelessly to being more frugal.",
"C. From being selfish and willful to understanding gratitude.",
"D. From being lazy to becoming more diligent."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the entire video, what kind of transformation occurs between Jono and Bailey?\nOption:\nA. From being unruly to following rules.\nB. From spending carelessly to being more frugal.\nC. From being selfish and willful to understanding gratitude.\nD. From being lazy to becoming more diligent.\nAnswer with the option's letter from the given choices directly.",
2429,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "810-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Variety Show",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2430,
"target": "B",
"doc": {
"video_id": "811",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=B6fvT2LKEDI",
"videoID": "B6fvT2LKEDI",
"question_id": "811-1",
"task_type": "Counting Problem",
"question": "In how many instances does archery make an appearance during these performances?",
"options": [
"A. 2.",
"B. 3.",
"C. 4.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In how many instances does archery make an appearance during these performances?\nOption:\nA. 2.\nB. 3.\nC. 4.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
2430,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "811-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2431,
"target": "C",
"doc": {
"video_id": "811",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=B6fvT2LKEDI",
"videoID": "B6fvT2LKEDI",
"question_id": "811-2",
"task_type": "Object Reasoning",
"question": "What do not the two male performers have in common?",
"options": [
"A. They are both black.",
"B. Their performances are scary.",
"C. They both wear bow ties.",
"D. They both have good flexibility."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do not the two male performers have in common?\nOption:\nA. They are both black.\nB. Their performances are scary.\nC. They both wear bow ties.\nD. They both have good flexibility.\nAnswer with the option's letter from the given choices directly.",
2431,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "811-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2432,
"target": "A",
"doc": {
"video_id": "811",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=B6fvT2LKEDI",
"videoID": "B6fvT2LKEDI",
"question_id": "811-3",
"task_type": "Action Reasoning",
"question": "The result of which performance remains unknown based on this video?",
"options": [
"A. The first and the sixth.",
"B. The second and the sixth.",
"C. The first and the last.",
"D. The second and the last."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The result of which performance remains unknown based on this video?\nOption:\nA. The first and the sixth.\nB. The second and the sixth.\nC. The first and the last.\nD. The second and the last.\nAnswer with the option's letter from the given choices directly.",
2432,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "811-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2433,
"target": "A",
"doc": {
"video_id": "812",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=yh-EHgkFci4",
"videoID": "yh-EHgkFci4",
"question_id": "812-1",
"task_type": "Action Reasoning",
"question": "How does the actor in the video achieve the jump upwards in the first clip?",
"options": [
"A. Trampoline.",
"B. Hanging.",
"C. Special effects editing.",
"D. Using spring-loaded shoes."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the actor in the video achieve the jump upwards in the first clip?\nOption:\nA. Trampoline.\nB. Hanging.\nC. Special effects editing.\nD. Using spring-loaded shoes.\nAnswer with the option's letter from the given choices directly.",
2433,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "812-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2434,
"target": "D",
"doc": {
"video_id": "812",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=yh-EHgkFci4",
"videoID": "yh-EHgkFci4",
"question_id": "812-2",
"task_type": "Object Reasoning",
"question": "What do the third and fourth programs have in common?",
"options": [
"A. The same number of persons.",
"B. The actors are of the same gender.",
"C. All have background screens.",
"D. There is water involved."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the third and fourth programs have in common?\nOption:\nA. The same number of persons.\nB. The actors are of the same gender.\nC. All have background screens.\nD. There is water involved.\nAnswer with the option's letter from the given choices directly.",
2434,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "812-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2435,
"target": "B",
"doc": {
"video_id": "812",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=yh-EHgkFci4",
"videoID": "yh-EHgkFci4",
"question_id": "812-3",
"task_type": "Counting Problem",
"question": "How many of the shows in the video are solo acts?",
"options": [
"A. 2.",
"B. 1.",
"C. 4.",
"D. 3."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many of the shows in the video are solo acts?\nOption:\nA. 2.\nB. 1.\nC. 4.\nD. 3.\nAnswer with the option's letter from the given choices directly.",
2435,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "812-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2436,
"target": "D",
"doc": {
"video_id": "813",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=5y5_VyEwZhc",
"videoID": "5y5_VyEwZhc",
"question_id": "813-1",
"task_type": "Action Recognition",
"question": "Which acrobatic skill is absent from this video?",
"options": [
"A. Trampoline.",
"B. Hula hoops.",
"C. Ropes.",
"D. Stilts."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which acrobatic skill is absent from this video?\nOption:\nA. Trampoline.\nB. Hula hoops.\nC. Ropes.\nD. Stilts.\nAnswer with the option's letter from the given choices directly.",
2436,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "813-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2437,
"target": "B",
"doc": {
"video_id": "813",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=5y5_VyEwZhc",
"videoID": "5y5_VyEwZhc",
"question_id": "813-2",
"task_type": "Object Reasoning",
"question": "Which option correctly describes the second performance?",
"options": [
"A. The performance features a red background with fires.",
"B. The second performance involves a male actor who is bare-chested.",
"C. Throughout the performance, only two main actors perform using ropes.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which option correctly describes the second performance?\nOption:\nA. The performance features a red background with fires.\nB. The second performance involves a male actor who is bare-chested.\nC. Throughout the performance, only two main actors perform using ropes.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2437,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "813-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2438,
"target": "C",
"doc": {
"video_id": "813",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=5y5_VyEwZhc",
"videoID": "5y5_VyEwZhc",
"question_id": "813-3",
"task_type": "Action Reasoning",
"question": "What is unique about the last performance?",
"options": [
"A. The last performance features background dancers, distinguishing it from the others.",
"B. The stage lighting is noticeably dimmer in the last performance compared to the others.",
"C. Musical instruments are incorporated into the last performance, unlike the others.",
"D. The last performance showcases a collective group of performers, while the others consist of individual acts."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is unique about the last performance?\nOption:\nA. The last performance features background dancers, distinguishing it from the others.\nB. The stage lighting is noticeably dimmer in the last performance compared to the others.\nC. Musical instruments are incorporated into the last performance, unlike the others.\nD. The last performance showcases a collective group of performers, while the others consist of individual acts.\nAnswer with the option's letter from the given choices directly.",
2438,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "813-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2439,
"target": "C",
"doc": {
"video_id": "814",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=BlTCkMMJM6o",
"videoID": "BlTCkMMJM6o",
"question_id": "814-1",
"task_type": "Action Recognition",
"question": "Which acrobatics skill is absent from this video?",
"options": [
"A. Pommel horse.",
"B. Ropes.",
"C. Trampoline.",
"D. Unicycles."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which acrobatics skill is absent from this video?\nOption:\nA. Pommel horse.\nB. Ropes.\nC. Trampoline.\nD. Unicycles.\nAnswer with the option's letter from the given choices directly.",
2439,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "814-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2440,
"target": "A",
"doc": {
"video_id": "814",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=BlTCkMMJM6o",
"videoID": "BlTCkMMJM6o",
"question_id": "814-2",
"task_type": "Object Reasoning",
"question": "What is true about this video?",
"options": [
"A. The video showcases more than just stage performances.",
"B. There is no audience interaction depicted in the video.",
"C. All the performances in the video take place within a stadium.",
"D. Towards the end of the video, a female actor wearing a red hat is seen with gears around her waist."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is true about this video?\nOption:\nA. The video showcases more than just stage performances.\nB. There is no audience interaction depicted in the video.\nC. All the performances in the video take place within a stadium.\nD. Towards the end of the video, a female actor wearing a red hat is seen with gears around her waist.\nAnswer with the option's letter from the given choices directly.",
2440,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "814-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2441,
"target": "B",
"doc": {
"video_id": "814",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=BlTCkMMJM6o",
"videoID": "BlTCkMMJM6o",
"question_id": "814-3",
"task_type": "Information Synopsis",
"question": "What is this video about?",
"options": [
"A. A documentary about the history of circus performances.",
"B. A compilation that highlights exceptional performances from two spectacular shows.",
"C. A promotional video for a circus-themed amusement park.",
"D. A behind-the-scenes look at the training of circus performers."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video about?\nOption:\nA. A documentary about the history of circus performances.\nB. A compilation that highlights exceptional performances from two spectacular shows.\nC. A promotional video for a circus-themed amusement park.\nD. A behind-the-scenes look at the training of circus performers.\nAnswer with the option's letter from the given choices directly.",
2441,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "814-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2442,
"target": "B",
"doc": {
"video_id": "815",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=7TydWUguPRU",
"videoID": "7TydWUguPRU",
"question_id": "815-1",
"task_type": "Object Reasoning",
"question": "What do the first five programs have in common?",
"options": [
"A. It's all a multi-person show.",
"B. Both use rings as props.",
"C. It's all a mix of male and female actors.",
"D. All with jumping action."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the first five programs have in common?\nOption:\nA. It's all a multi-person show.\nB. Both use rings as props.\nC. It's all a mix of male and female actors.\nD. All with jumping action.\nAnswer with the option's letter from the given choices directly.",
2442,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "815-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2443,
"target": "C",
"doc": {
"video_id": "815",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=7TydWUguPRU",
"videoID": "7TydWUguPRU",
"question_id": "815-2",
"task_type": "Object Reasoning",
"question": "Which two programs have similarities in the use of the ring?",
"options": [
"A. The first and fourth programs.",
"B. Second and fourth programs.",
"C. Third and fifth programs.",
"D. Third and seventh programs."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which two programs have similarities in the use of the ring?\nOption:\nA. The first and fourth programs.\nB. Second and fourth programs.\nC. Third and fifth programs.\nD. Third and seventh programs.\nAnswer with the option's letter from the given choices directly.",
2443,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "815-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2444,
"target": "C",
"doc": {
"video_id": "815",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=7TydWUguPRU",
"videoID": "7TydWUguPRU",
"question_id": "815-3",
"task_type": "Action Reasoning",
"question": "How is the seventh performance different from the others?",
"options": [
"A. There are male and female performances.",
"B. There are performers singing.",
"C. Use the musical score.",
"D. Multiple hula hoops were used."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How is the seventh performance different from the others?\nOption:\nA. There are male and female performances.\nB. There are performers singing.\nC. Use the musical score.\nD. Multiple hula hoops were used.\nAnswer with the option's letter from the given choices directly.",
2444,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "815-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2445,
"target": "A",
"doc": {
"video_id": "816",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=Z2G9bTvffAw",
"videoID": "Z2G9bTvffAw",
"question_id": "816-1",
"task_type": "Object Reasoning",
"question": "What distinguishes the yellow lion from the red lion?",
"options": [
"A. No obvious difference apart from the color can be told from this video.",
"B. The red lion is so much bigger.",
"C. The red lion always stands in the middle of yellow lions.",
"D. It requires extra two performers to manipulate the red lion."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What distinguishes the yellow lion from the red lion?\nOption:\nA. No obvious difference apart from the color can be told from this video.\nB. The red lion is so much bigger.\nC. The red lion always stands in the middle of yellow lions.\nD. It requires extra two performers to manipulate the red lion.\nAnswer with the option's letter from the given choices directly.",
2445,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "816-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2446,
"target": "B",
"doc": {
"video_id": "816",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=Z2G9bTvffAw",
"videoID": "Z2G9bTvffAw",
"question_id": "816-2",
"task_type": "Temporal Reasoning",
"question": "What sequence of actions followed after they stepped on the stilts?\n(a) They unfurled a banner from their mouths.\n(b) They tumbled down from the high platform.\n(c) They caught the balls thrown by the ground staff.",
"options": [
"A. (a)(b)(c).",
"B. (a)(c)(b).",
"C. (b)(c)(a).",
"D. (b)(a)(c)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What sequence of actions followed after they stepped on the stilts?\n(a) They unfurled a banner from their mouths.\n(b) They tumbled down from the high platform.\n(c) They caught the balls thrown by the ground staff.\nOption:\nA. (a)(b)(c).\nB. (a)(c)(b).\nC. (b)(c)(a).\nD. (b)(a)(c).\nAnswer with the option's letter from the given choices directly.",
2446,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "816-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2447,
"target": "D",
"doc": {
"video_id": "816",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=Z2G9bTvffAw",
"videoID": "Z2G9bTvffAw",
"question_id": "816-3",
"task_type": "Action Recognition",
"question": "What do the spectators do after the performance?",
"options": [
"A. Take photos with the lion.",
"B. Pat the lion's head.",
"C. Attempt to play the drum.",
"D. All of the above."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the spectators do after the performance?\nOption:\nA. Take photos with the lion.\nB. Pat the lion's head.\nC. Attempt to play the drum.\nD. All of the above.\nAnswer with the option's letter from the given choices directly.",
2447,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "816-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2448,
"target": "D",
"doc": {
"video_id": "817",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=EuBhksmj0sQ",
"videoID": "EuBhksmj0sQ",
"question_id": "817-1",
"task_type": "Action Recognition",
"question": "Which acrobatic skill is featured in this video?",
"options": [
"A. Unicycle.",
"B. Bottle juggling.",
"C. Rope jumping.",
"D. All of the above."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which acrobatic skill is featured in this video?\nOption:\nA. Unicycle.\nB. Bottle juggling.\nC. Rope jumping.\nD. All of the above.\nAnswer with the option's letter from the given choices directly.",
2448,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "817-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2449,
"target": "D",
"doc": {
"video_id": "817",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=EuBhksmj0sQ",
"videoID": "EuBhksmj0sQ",
"question_id": "817-2",
"task_type": "Action Reasoning",
"question": "Which statement is true about this video?",
"options": [
"A. The costumes worn during the performance remain consistent throughout.",
"B. The performance includes intermittent pauses.",
"C. The performance is conducted during daylight hours.",
"D. All individuals performing in the video are male."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which statement is true about this video?\nOption:\nA. The costumes worn during the performance remain consistent throughout.\nB. The performance includes intermittent pauses.\nC. The performance is conducted during daylight hours.\nD. All individuals performing in the video are male.\nAnswer with the option's letter from the given choices directly.",
2449,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "817-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2450,
"target": "B",
"doc": {
"video_id": "817",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=EuBhksmj0sQ",
"videoID": "EuBhksmj0sQ",
"question_id": "817-3",
"task_type": "Information Synopsis",
"question": "What is the video about?",
"options": [
"A. A tutorial on how to perform magic tricks.",
"B. An acrobatics dance show.",
"C. A documentary about the history of ballet.",
"D. A compilation of funny human videos."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video about?\nOption:\nA. A tutorial on how to perform magic tricks.\nB. An acrobatics dance show.\nC. A documentary about the history of ballet.\nD. A compilation of funny human videos.\nAnswer with the option's letter from the given choices directly.",
2450,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "817-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2451,
"target": "C",
"doc": {
"video_id": "818",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=Fqptx5fNXR4",
"videoID": "Fqptx5fNXR4",
"question_id": "818-1",
"task_type": "Object Reasoning",
"question": "What is special about the circus?",
"options": [
"A. It is a circus that primarily focuses on magic tricks.",
"B. It is a circus that operates only during the winter season.",
"C. It is a traveling circus without a fixed performance location.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is special about the circus?\nOption:\nA. It is a circus that primarily focuses on magic tricks.\nB. It is a circus that operates only during the winter season.\nC. It is a traveling circus without a fixed performance location.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2451,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "818-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2452,
"target": "A",
"doc": {
"video_id": "818",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=Fqptx5fNXR4",
"videoID": "Fqptx5fNXR4",
"question_id": "818-2",
"task_type": "Action Recognition",
"question": "Which of the performances does the woman participate in?",
"options": [
"A. Globe of death and flying trapeze.",
"B. Globe of death and juggling.",
"C. Flying trapeze and rola bola.",
"D. Flying trapeze and wheel of destiny."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the performances does the woman participate in?\nOption:\nA. Globe of death and flying trapeze.\nB. Globe of death and juggling.\nC. Flying trapeze and rola bola.\nD. Flying trapeze and wheel of destiny.\nAnswer with the option's letter from the given choices directly.",
2452,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "818-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2453,
"target": "D",
"doc": {
"video_id": "818",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=Fqptx5fNXR4",
"videoID": "Fqptx5fNXR4",
"question_id": "818-3",
"task_type": "Information Synopsis",
"question": "What is the main content of the video about?",
"options": [
"A. Three days in the life of a college student.",
"B. Three days in the life of a software engineer working from home.",
"C. Three days in the life of a content creator.",
"D. Three days in the life of a circus actress."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main content of the video about?\nOption:\nA. Three days in the life of a college student.\nB. Three days in the life of a software engineer working from home.\nC. Three days in the life of a content creator.\nD. Three days in the life of a circus actress.\nAnswer with the option's letter from the given choices directly.",
2453,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "818-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2454,
"target": "B",
"doc": {
"video_id": "819",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=GsYi-l5PEEs",
"videoID": "GsYi-l5PEEs",
"question_id": "819-1",
"task_type": "Action Recognition",
"question": "What is the man doing when the carousel starts spinning for the first time?",
"options": [
"A. He is trying to stand on the balance board on the third level.",
"B. He is jumping from the ground onto the platform.",
"C. He is standing on the balance board on the fourth level.",
"D. He is having a conversation with the little boy on the ground."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the man doing when the carousel starts spinning for the first time?\nOption:\nA. He is trying to stand on the balance board on the third level.\nB. He is jumping from the ground onto the platform.\nC. He is standing on the balance board on the fourth level.\nD. He is having a conversation with the little boy on the ground.\nAnswer with the option's letter from the given choices directly.",
2454,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "819-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2455,
"target": "D",
"doc": {
"video_id": "819",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=GsYi-l5PEEs",
"videoID": "GsYi-l5PEEs",
"question_id": "819-2",
"task_type": "Object Reasoning",
"question": "Which statement about this video is true?",
"options": [
"A. The man eventually reached the top of the 5-layer balance board.",
"B. The entire performance took place in bright daylight, without the sky turning dim.",
"C. At the end of the performance, the man failed to do a handstand.",
"D. There are zoom shots in the entire video."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which statement about this video is true?\nOption:\nA. The man eventually reached the top of the 5-layer balance board.\nB. The entire performance took place in bright daylight, without the sky turning dim.\nC. At the end of the performance, the man failed to do a handstand.\nD. There are zoom shots in the entire video.\nAnswer with the option's letter from the given choices directly.",
2455,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "819-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2456,
"target": "C",
"doc": {
"video_id": "819",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=GsYi-l5PEEs",
"videoID": "GsYi-l5PEEs",
"question_id": "819-3",
"task_type": "Object Reasoning",
"question": "Who is most likely to have filmed this video?",
"options": [
"A. A professional videographer hired for the event.",
"B. The event organizer.",
"C. An audience sitting and watching the show.",
"D. A random passerby."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is most likely to have filmed this video?\nOption:\nA. A professional videographer hired for the event.\nB. The event organizer.\nC. An audience sitting and watching the show.\nD. A random passerby.\nAnswer with the option's letter from the given choices directly.",
2456,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "819-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2457,
"target": "D",
"doc": {
"video_id": "820",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=SeDGkRfpTqQ",
"videoID": "SeDGkRfpTqQ",
"question_id": "820-1",
"task_type": "Counting Problem",
"question": "How many dance group auditions are included in this video?",
"options": [
"A. 6.",
"B. 8.",
"C. 10.",
"D. 12."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many dance group auditions are included in this video?\nOption:\nA. 6.\nB. 8.\nC. 10.\nD. 12.\nAnswer with the option's letter from the given choices directly.",
2457,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "820-1",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2458,
"target": "B",
"doc": {
"video_id": "820",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=SeDGkRfpTqQ",
"videoID": "SeDGkRfpTqQ",
"question_id": "820-2",
"task_type": "Object Reasoning",
"question": "What is a common characteristic among these auditions?",
"options": [
"A. All auditions only involve male actors.",
"B. They are group auditions.",
"C. The judges are the same for all auditions.",
"D. None of the auditions require the use of props."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is a common characteristic among these auditions?\nOption:\nA. All auditions only involve male actors.\nB. They are group auditions.\nC. The judges are the same for all auditions.\nD. None of the auditions require the use of props.\nAnswer with the option's letter from the given choices directly.",
2458,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "820-2",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2459,
"target": "C",
"doc": {
"video_id": "820",
"duration": "long",
"domain": "Artistic Performance",
"sub_category": "Acrobatics",
"url": "https://www.youtube.com/watch?v=SeDGkRfpTqQ",
"videoID": "SeDGkRfpTqQ",
"question_id": "820-3",
"task_type": "Action Reasoning",
"question": "Which two performances incorporate the use of light strips?",
"options": [
"A. The second from the beginning and the third to last.",
"B. The third from the beginning and the third to last.",
"C. The third from the beginning and the second to last.",
"D. The second from the beginning and the second to last."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which two performances incorporate the use of light strips?\nOption:\nA. The second from the beginning and the third to last.\nB. The third from the beginning and the third to last.\nC. The third from the beginning and the second to last.\nD. The second from the beginning and the second to last.\nAnswer with the option's letter from the given choices directly.",
2459,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "820-3",
"duration": "long",
"category": "Artistic Performance",
"sub_category": "Acrobatics",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2460,
"target": "B",
"doc": {
"video_id": "821",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=c03YTDzmsK8",
"videoID": "c03YTDzmsK8",
"question_id": "821-1",
"task_type": "Temporal Reasoning",
"question": "What are the correct steps to make the adorable rabbit?\n(a) Making a white tail from a microfiber duster mop pad.\n(b) Cutting off the tinsel from the original rabbit.\n(c) Wrapping the brown yarn across the rabbit.\n(d) Making a cute little lace bow and tie it to the rabbit.",
"options": [
"A. (a)(c)(b)(d).",
"B. (b)(c)(a)(d).",
"C. (b)(a)(d)(c).",
"D. (c)(b)(a)(d)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the correct steps to make the adorable rabbit?\n(a) Making a white tail from a microfiber duster mop pad.\n(b) Cutting off the tinsel from the original rabbit.\n(c) Wrapping the brown yarn across the rabbit.\n(d) Making a cute little lace bow and tie it to the rabbit.\nOption:\nA. (a)(c)(b)(d).\nB. (b)(c)(a)(d).\nC. (b)(a)(d)(c).\nD. (c)(b)(a)(d).\nAnswer with the option's letter from the given choices directly.",
2460,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "821-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2461,
"target": "C",
"doc": {
"video_id": "821",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=c03YTDzmsK8",
"videoID": "c03YTDzmsK8",
"question_id": "821-2",
"task_type": "Object Reasoning",
"question": "What do the two handmade crafts have in common?",
"options": [
"A. They are in similar colors.",
"B. Both of them are related to carrots.",
"C. Both items need a glue gun.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the two handmade crafts have in common?\nOption:\nA. They are in similar colors.\nB. Both of them are related to carrots.\nC. Both items need a glue gun.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2461,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "821-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2462,
"target": "A",
"doc": {
"video_id": "821",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=c03YTDzmsK8",
"videoID": "c03YTDzmsK8",
"question_id": "821-3",
"task_type": "Information Synopsis",
"question": "What is the main topic of this video?",
"options": [
"A. DIY simple and cute crafts on a budget.",
"B. DIY small toys for kids on a budget.",
"C. How to get started with DIY and gradually become a DIY master.",
"D. A review of the latest high-end crafting tools and materials."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main topic of this video?\nOption:\nA. DIY simple and cute crafts on a budget.\nB. DIY small toys for kids on a budget.\nC. How to get started with DIY and gradually become a DIY master.\nD. A review of the latest high-end crafting tools and materials.\nAnswer with the option's letter from the given choices directly.",
2462,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "821-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2463,
"target": "A",
"doc": {
"video_id": "822",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=3MJNKd10kxM",
"videoID": "3MJNKd10kxM",
"question_id": "822-1",
"task_type": "Temporal Reasoning",
"question": "In which order do the following topics are introduced in this video ?\n(a) Spring pocket DIY.\n(b) Easter bucket floral DIY.\n(c) Farmhouse bunny in a bucket DIY.\n(d) Spring tin bucket floral DIY.\n(e) Bunny hop decor.",
"options": [
"A. (a)(b)(c)(d)(e).",
"B. (a)(c)(b)(e)(d).",
"C. (b)(e)(a)(d)(c).",
"D. (b)(a)(d)(c)(e)."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which order do the following topics are introduced in this video ?\n(a) Spring pocket DIY.\n(b) Easter bucket floral DIY.\n(c) Farmhouse bunny in a bucket DIY.\n(d) Spring tin bucket floral DIY.\n(e) Bunny hop decor.\nOption:\nA. (a)(b)(c)(d)(e).\nB. (a)(c)(b)(e)(d).\nC. (b)(e)(a)(d)(c).\nD. (b)(a)(d)(c)(e).\nAnswer with the option's letter from the given choices directly.",
2463,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "822-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2464,
"target": "D",
"doc": {
"video_id": "822",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=3MJNKd10kxM",
"videoID": "3MJNKd10kxM",
"question_id": "822-2",
"task_type": "Object Recognition",
"question": "Which ingredient is not used in the video?",
"options": [
"A. Hot glue.",
"B. Pieces of burlap.",
"C. Florals.",
"D. Plastic bottles."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which ingredient is not used in the video?\nOption:\nA. Hot glue.\nB. Pieces of burlap.\nC. Florals.\nD. Plastic bottles.\nAnswer with the option's letter from the given choices directly.",
2464,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "822-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2465,
"target": "B",
"doc": {
"video_id": "822",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=3MJNKd10kxM",
"videoID": "3MJNKd10kxM",
"question_id": "822-3",
"task_type": "Information Synopsis",
"question": "What is the best caption for this video?",
"options": [
"A. Winter Holiday Decorations: Festive DIY Crafts for Christmas and New Year.",
"B. Spring Easter Collection #4 DIY Crafts Farmhouse Whimsical Rustic Crafts using Florals and More.",
"C. Gardening Tips and Tricks: Growing Your Own Organic Vegetables.",
"D. A Travel Vlog: Exploring Exotic Destinations Around the World."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the best caption for this video?\nOption:\nA. Winter Holiday Decorations: Festive DIY Crafts for Christmas and New Year.\nB. Spring Easter Collection #4 DIY Crafts Farmhouse Whimsical Rustic Crafts using Florals and More.\nC. Gardening Tips and Tricks: Growing Your Own Organic Vegetables.\nD. A Travel Vlog: Exploring Exotic Destinations Around the World.\nAnswer with the option's letter from the given choices directly.",
2465,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "822-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2466,
"target": "D",
"doc": {
"video_id": "823",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=vDzNVHXc66U",
"videoID": "vDzNVHXc66U",
"question_id": "823-1",
"task_type": "Object Reasoning",
"question": "Which is the characteristic of the first wool craft with spoons?",
"options": [
"A. It is decorated with yellow flowers out of wool threads.",
"B. It has spoons in both outer and inner circles.",
"C. It has five wool pendants.",
"D. All of the above."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which is the characteristic of the first wool craft with spoons?\nOption:\nA. It is decorated with yellow flowers out of wool threads.\nB. It has spoons in both outer and inner circles.\nC. It has five wool pendants.\nD. All of the above.\nAnswer with the option's letter from the given choices directly.",
2466,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "823-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2467,
"target": "B",
"doc": {
"video_id": "823",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=vDzNVHXc66U",
"videoID": "vDzNVHXc66U",
"question_id": "823-2",
"task_type": "Object Recognition",
"question": "Which ingredient is not used in the video?",
"options": [
"A. Paper.",
"B. Leftover cloth.",
"C. Wool thread.",
"D. Spoons."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which ingredient is not used in the video?\nOption:\nA. Paper.\nB. Leftover cloth.\nC. Wool thread.\nD. Spoons.\nAnswer with the option's letter from the given choices directly.",
2467,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "823-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2468,
"target": "C",
"doc": {
"video_id": "823",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=vDzNVHXc66U",
"videoID": "vDzNVHXc66U",
"question_id": "823-3",
"task_type": "Information Synopsis",
"question": "What is the main topic of this video?",
"options": [
"A. A documentary on the history of ancient handcrafts.",
"B. A video showcasing the latest handcrafts made by popular artists.",
"C. A tutorial on different types of easy handmade handcrafts of daily accessories.",
"D. A travel vlog exploring exotic handicrafts around the world."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main topic of this video?\nOption:\nA. A documentary on the history of ancient handcrafts.\nB. A video showcasing the latest handcrafts made by popular artists.\nC. A tutorial on different types of easy handmade handcrafts of daily accessories.\nD. A travel vlog exploring exotic handicrafts around the world.\nAnswer with the option's letter from the given choices directly.",
2468,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "823-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2469,
"target": "C",
"doc": {
"video_id": "824",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=p0qf48lXD4M",
"videoID": "p0qf48lXD4M",
"question_id": "824-1",
"task_type": "Counting Problem",
"question": "How many DIYs have been made in this video?",
"options": [
"A. 6.",
"B. 8.",
"C. 10.",
"D. 12."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many DIYs have been made in this video?\nOption:\nA. 6.\nB. 8.\nC. 10.\nD. 12.\nAnswer with the option's letter from the given choices directly.",
2469,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "824-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Counting Problem",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2470,
"target": "D",
"doc": {
"video_id": "824",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=p0qf48lXD4M",
"videoID": "p0qf48lXD4M",
"question_id": "824-2",
"task_type": "Object Reasoning",
"question": "What does the two wall sconces made in this video have in common?",
"options": [
"A. Each of them has a light in it.",
"B. Both of them are symmetric.",
"C. Each of them is in golden and white.",
"D. All of the above."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the two wall sconces made in this video have in common?\nOption:\nA. Each of them has a light in it.\nB. Both of them are symmetric.\nC. Each of them is in golden and white.\nD. All of the above.\nAnswer with the option's letter from the given choices directly.",
2470,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "824-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2471,
"target": "B",
"doc": {
"video_id": "824",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=p0qf48lXD4M",
"videoID": "p0qf48lXD4M",
"question_id": "824-3",
"task_type": "Information Synopsis",
"question": "What is the best caption for this video?",
"options": [
"A. TOP 10 BEST DOLLAR TREE DIY To Try to decorate your Christmas tree.",
"B. TOP 10 BEST DOLLAR TREE DIY To Try to decorate your home.",
"C. TOP 8 BEST DOLLAR TREE DIY To Try to decorate your home.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the best caption for this video?\nOption:\nA. TOP 10 BEST DOLLAR TREE DIY To Try to decorate your Christmas tree.\nB. TOP 10 BEST DOLLAR TREE DIY To Try to decorate your home.\nC. TOP 8 BEST DOLLAR TREE DIY To Try to decorate your home.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2471,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "824-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2472,
"target": "B",
"doc": {
"video_id": "825",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=wxff_4tDauo",
"videoID": "wxff_4tDauo",
"question_id": "825-1",
"task_type": "Object Reasoning",
"question": "According to this video, which handmade electronic device is the most commonly manufactured among the following options?",
"options": [
"A. iPad.",
"B. Smartphone.",
"C. Personal laptop.",
"D. Game console."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to this video, which handmade electronic device is the most commonly manufactured among the following options?\nOption:\nA. iPad.\nB. Smartphone.\nC. Personal laptop.\nD. Game console.\nAnswer with the option's letter from the given choices directly.",
2472,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "825-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2473,
"target": "C",
"doc": {
"video_id": "825",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=wxff_4tDauo",
"videoID": "wxff_4tDauo",
"question_id": "825-2",
"task_type": "Temporal Reasoning",
"question": "In which order are the following steps introduced in this video?\n(a) Cardboard Apple MacBook.\n(b) Cardboard Apple Watch.\n(c) Apple iPad.\n(d) Apple iPhone 13.",
"options": [
"A. (b)(c)(a)(d).",
"B. (a)(b)(c)(d).",
"C. (a)(c)(b)(d).",
"D. (b)(d)(a)(c)."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which order are the following steps introduced in this video?\n(a) Cardboard Apple MacBook.\n(b) Cardboard Apple Watch.\n(c) Apple iPad.\n(d) Apple iPhone 13.\nOption:\nA. (b)(c)(a)(d).\nB. (a)(b)(c)(d).\nC. (a)(c)(b)(d).\nD. (b)(d)(a)(c).\nAnswer with the option's letter from the given choices directly.",
2473,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "825-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2474,
"target": "A",
"doc": {
"video_id": "825",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=wxff_4tDauo",
"videoID": "wxff_4tDauo",
"question_id": "825-3",
"task_type": "Information Synopsis",
"question": "What is the main topic of this video?",
"options": [
"A. Handmade working cardboard gadgets for the year.",
"B. Advertisements for new gadgets released during the year.",
"C. A quality review of the year's best-selling electronic products.",
"D. Showcase of must-buy tech products for the year."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main topic of this video?\nOption:\nA. Handmade working cardboard gadgets for the year.\nB. Advertisements for new gadgets released during the year.\nC. A quality review of the year's best-selling electronic products.\nD. Showcase of must-buy tech products for the year.\nAnswer with the option's letter from the given choices directly.",
2474,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "825-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2475,
"target": "D",
"doc": {
"video_id": "826",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=dXBREPP1hfo",
"videoID": "dXBREPP1hfo",
"question_id": "826-1",
"task_type": "Action Reasoning",
"question": "If one lives in this house, what cannot he do according to this video?",
"options": [
"A. Play computer games.",
"B. Swim in the pool.",
"C. Feed chicken.",
"D. Lie on the sofa and watch TV."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: If one lives in this house, what cannot he do according to this video?\nOption:\nA. Play computer games.\nB. Swim in the pool.\nC. Feed chicken.\nD. Lie on the sofa and watch TV.\nAnswer with the option's letter from the given choices directly.",
2475,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "826-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2476,
"target": "B",
"doc": {
"video_id": "826",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=dXBREPP1hfo",
"videoID": "dXBREPP1hfo",
"question_id": "826-2",
"task_type": "Temporal Reasoning",
"question": "In which order are the following steps introduced in this video?\n(a) A pair of slippers.\n(b) A pair of pillows.\n(c) A Starbucks cup.\n(d) A Windows computer.",
"options": [
"A. (b)(c)(a)(d).",
"B. (a)(b)(c)(d).",
"C. (a)(c)(b)(d).",
"D. (b)(d)(a)(c)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In which order are the following steps introduced in this video?\n(a) A pair of slippers.\n(b) A pair of pillows.\n(c) A Starbucks cup.\n(d) A Windows computer.\nOption:\nA. (b)(c)(a)(d).\nB. (a)(b)(c)(d).\nC. (a)(c)(b)(d).\nD. (b)(d)(a)(c).\nAnswer with the option's letter from the given choices directly.",
2476,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "826-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2477,
"target": "C",
"doc": {
"video_id": "826",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=dXBREPP1hfo",
"videoID": "dXBREPP1hfo",
"question_id": "826-3",
"task_type": "Information Synopsis",
"question": "What is the best caption for this video?",
"options": [
"A. Super Easy Pom Pom Chicken Making Idea with Fingers - DIY Pom Pom Chick - How to Make Yarn Chicken.",
"B. Spring Easter Collection #4 DIY Crafts Farmhouse Whimsical Rustic Crafts using Florals and More.",
"C. Building the Cutest Bear Pink Miniature House from Cardboard for two DIY Miniature House.",
"D. The BEST DOLLAR TREE PLANTER DIY IDEAS! | Dollar Tree DIY | SUMMER DIY HOME DECOR."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the best caption for this video?\nOption:\nA. Super Easy Pom Pom Chicken Making Idea with Fingers - DIY Pom Pom Chick - How to Make Yarn Chicken.\nB. Spring Easter Collection #4 DIY Crafts Farmhouse Whimsical Rustic Crafts using Florals and More.\nC. Building the Cutest Bear Pink Miniature House from Cardboard for two DIY Miniature House.\nD. The BEST DOLLAR TREE PLANTER DIY IDEAS! | Dollar Tree DIY | SUMMER DIY HOME DECOR.\nAnswer with the option's letter from the given choices directly.",
2477,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "826-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2478,
"target": "A",
"doc": {
"video_id": "827",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=k3zNTrWrbOU",
"videoID": "k3zNTrWrbOU",
"question_id": "827-1",
"task_type": "Action Reasoning",
"question": "Which of the following steps does not the woman do on her own?",
"options": [
"A. Painting the wall.",
"B. Installing countertops.",
"C. Designing the kitchen.",
"D. Decorating the table in the kitchen."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following steps does not the woman do on her own?\nOption:\nA. Painting the wall.\nB. Installing countertops.\nC. Designing the kitchen.\nD. Decorating the table in the kitchen.\nAnswer with the option's letter from the given choices directly.",
2478,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "827-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2479,
"target": "D",
"doc": {
"video_id": "827",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=k3zNTrWrbOU",
"videoID": "k3zNTrWrbOU",
"question_id": "827-2",
"task_type": "Temporal Reasoning",
"question": "If the woman in this video wears and changes one piece of clothes every day, then at least how many days is the video shot for?",
"options": [
"A. 11.",
"B. 9.",
"C. 7.",
"D. 5."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: If the woman in this video wears and changes one piece of clothes every day, then at least how many days is the video shot for?\nOption:\nA. 11.\nB. 9.\nC. 7.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
2479,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "827-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2480,
"target": "B",
"doc": {
"video_id": "827",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=k3zNTrWrbOU",
"videoID": "k3zNTrWrbOU",
"question_id": "827-3",
"task_type": "Information Synopsis",
"question": "What is the main topic of this video?",
"options": [
"A. GUEST BATHROOM MAKEOVER Ep. 4 | DIY Storage Cabinet & Decorating Ideas *BEFORE & AFTER*.",
"B. DIY KITCHEN ISLAND *Installing Countertops & Appliances*.",
"C. DIY SOFT ARCHES (Entryway Transformation) + COST Breakdown | XO, MaCenna.",
"D. EXTREME BEDROOM MAKEOVER | Full Bedroom Transformation 2020."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main topic of this video?\nOption:\nA. GUEST BATHROOM MAKEOVER Ep. 4 | DIY Storage Cabinet & Decorating Ideas *BEFORE & AFTER*.\nB. DIY KITCHEN ISLAND *Installing Countertops & Appliances*.\nC. DIY SOFT ARCHES (Entryway Transformation) + COST Breakdown | XO, MaCenna.\nD. EXTREME BEDROOM MAKEOVER | Full Bedroom Transformation 2020.\nAnswer with the option's letter from the given choices directly.",
2480,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "827-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2481,
"target": "C",
"doc": {
"video_id": "828",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=vv2EiRK0Lmg",
"videoID": "vv2EiRK0Lmg",
"question_id": "828-1",
"task_type": "Object Recognition",
"question": "What is the handcraft made seventh in this video?",
"options": [
"A. A gold base vase.",
"B. Cotton twine wreath.",
"C. Wooden letters.",
"D. A home letter board."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the handcraft made seventh in this video?\nOption:\nA. A gold base vase.\nB. Cotton twine wreath.\nC. Wooden letters.\nD. A home letter board.\nAnswer with the option's letter from the given choices directly.",
2481,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "828-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2482,
"target": "A",
"doc": {
"video_id": "828",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=vv2EiRK0Lmg",
"videoID": "vv2EiRK0Lmg",
"question_id": "828-2",
"task_type": "Object Recognition",
"question": "Which ingredient is most used in this video?",
"options": [
"A. Wood.",
"B. Ropes.",
"C. Florals.",
"D. Leftover cloth."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which ingredient is most used in this video?\nOption:\nA. Wood.\nB. Ropes.\nC. Florals.\nD. Leftover cloth.\nAnswer with the option's letter from the given choices directly.",
2482,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "828-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2483,
"target": "D",
"doc": {
"video_id": "828",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=vv2EiRK0Lmg",
"videoID": "vv2EiRK0Lmg",
"question_id": "828-3",
"task_type": "Information Synopsis",
"question": "What is the main topic of this video?",
"options": [
"A. The subject of this video is a debate about the ethical significance of DIY culture in modern society.",
"B. This is a tutorial on how to make complex sculptures using only items found in nature.",
"C. This video explores the psychological impact of affordable DIY projects on individuals, as well as its social value and significance.",
"D. This video shares a $20 tree DIY that creates high-end looking projects for just a few dollars."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main topic of this video?\nOption:\nA. The subject of this video is a debate about the ethical significance of DIY culture in modern society.\nB. This is a tutorial on how to make complex sculptures using only items found in nature.\nC. This video explores the psychological impact of affordable DIY projects on individuals, as well as its social value and significance.\nD. This video shares a $20 tree DIY that creates high-end looking projects for just a few dollars.\nAnswer with the option's letter from the given choices directly.",
2483,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "828-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2484,
"target": "B",
"doc": {
"video_id": "829",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=h5Jx12SYKzk",
"videoID": "h5Jx12SYKzk",
"question_id": "829-1",
"task_type": "Object Reasoning",
"question": "What do the first and last made planters have in common?",
"options": [
"A. They are both in a circular shape.",
"B. They both have black and brown bodies.",
"C. They both have four feet.",
"D. The plants inside them stretch to the ground."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What do the first and last made planters have in common?\nOption:\nA. They are both in a circular shape.\nB. They both have black and brown bodies.\nC. They both have four feet.\nD. The plants inside them stretch to the ground.\nAnswer with the option's letter from the given choices directly.",
2484,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "829-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2485,
"target": "C",
"doc": {
"video_id": "829",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=h5Jx12SYKzk",
"videoID": "h5Jx12SYKzk",
"question_id": "829-2",
"task_type": "Counting Problem",
"question": "How many kinds of planters are made in this video?",
"options": [
"A. Less than 10.",
"B. More than or equal to 10 but less than 15.",
"C. More than or equal to 15 but less than 20.",
"D. More than or equal to 20."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many kinds of planters are made in this video?\nOption:\nA. Less than 10.\nB. More than or equal to 10 but less than 15.\nC. More than or equal to 15 but less than 20.\nD. More than or equal to 20.\nAnswer with the option's letter from the given choices directly.",
2485,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "829-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Counting Problem",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2486,
"target": "A",
"doc": {
"video_id": "829",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=h5Jx12SYKzk",
"videoID": "h5Jx12SYKzk",
"question_id": "829-3",
"task_type": "Information Synopsis",
"question": "What is the main topic of this video?",
"options": [
"A. This video showcases several brilliant Dollar Tree DIYS using Dollar Tree planters, flowerpots and more.",
"B. This video is a tutorial on how to use Dollar Tree planters to build a fully functional hydroponic system.",
"C. This video explores the scientific process of creating genetically modified plants using Dollar Tree flowerpots.",
"D. The main topic of this video is a philosophical discussion on the symbolism of Dollar Tree flowerpots in art and literature."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main topic of this video?\nOption:\nA. This video showcases several brilliant Dollar Tree DIYS using Dollar Tree planters, flowerpots and more.\nB. This video is a tutorial on how to use Dollar Tree planters to build a fully functional hydroponic system.\nC. This video explores the scientific process of creating genetically modified plants using Dollar Tree flowerpots.\nD. The main topic of this video is a philosophical discussion on the symbolism of Dollar Tree flowerpots in art and literature.\nAnswer with the option's letter from the given choices directly.",
2486,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "829-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2487,
"target": "D",
"doc": {
"video_id": "830",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=yG0W82PGAcQ",
"videoID": "yG0W82PGAcQ",
"question_id": "830-1",
"task_type": "Action Reasoning",
"question": "According to this video, what can you do if wanting to draw multiple neat lines using only one marker pen?",
"options": [
"A. Use a ruler to create parallel lines.",
"B. Place strips of masking tape on the paper.",
"C. Draw a grid of evenly spaced horizontal and vertical lines.",
"D. Bind several marker pens with adhesive tapes."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to this video, what can you do if wanting to draw multiple neat lines using only one marker pen?\nOption:\nA. Use a ruler to create parallel lines.\nB. Place strips of masking tape on the paper.\nC. Draw a grid of evenly spaced horizontal and vertical lines.\nD. Bind several marker pens with adhesive tapes.\nAnswer with the option's letter from the given choices directly.",
2487,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "830-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2488,
"target": "B",
"doc": {
"video_id": "830",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=yG0W82PGAcQ",
"videoID": "yG0W82PGAcQ",
"question_id": "830-2",
"task_type": "Object Reasoning",
"question": "Which of the following options is not the function a clip can provide in this video?",
"options": [
"A. To organize educational materials by grouping them together.",
"B. To hold the pencil on a piece of paper, helping children to draw a perfect circle.",
"C. To hang or display accessories on a mirror, preventing them from getting misplaced.",
"D. To secure items such as pens, keeping them neatly organized and preventing tangling."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options is not the function a clip can provide in this video?\nOption:\nA. To organize educational materials by grouping them together.\nB. To hold the pencil on a piece of paper, helping children to draw a perfect circle.\nC. To hang or display accessories on a mirror, preventing them from getting misplaced.\nD. To secure items such as pens, keeping them neatly organized and preventing tangling.\nAnswer with the option's letter from the given choices directly.",
2488,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "830-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2489,
"target": "C",
"doc": {
"video_id": "830",
"duration": "long",
"domain": "Life Record",
"sub_category": "Handicraft",
"url": "https://www.youtube.com/watch?v=yG0W82PGAcQ",
"videoID": "yG0W82PGAcQ",
"question_id": "830-3",
"task_type": "Information Synopsis",
"question": "What is the main topic about this video?",
"options": [
"A. This video is a tutorial on advanced calculus and mathematical proofs.",
"B. This video is a travel vlog showcasing beautiful destinations around the school.",
"C. This video is a school compilation full of awesome and useful hacks for any occasion, including at the university and in the office.",
"D. This video is a comedy sketch featuring pranks and funny moments in the classroom."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main topic about this video?\nOption:\nA. This video is a tutorial on advanced calculus and mathematical proofs.\nB. This video is a travel vlog showcasing beautiful destinations around the school.\nC. This video is a school compilation full of awesome and useful hacks for any occasion, including at the university and in the office.\nD. This video is a comedy sketch featuring pranks and funny moments in the classroom.\nAnswer with the option's letter from the given choices directly.",
2489,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "830-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Handicraft",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2490,
"target": "C",
"doc": {
"video_id": "831",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=6Pt3-NDxzRk",
"videoID": "6Pt3-NDxzRk",
"question_id": "831-1",
"task_type": "Information Synopsis",
"question": "What does this video show?",
"options": [
"A. A woman documents her daily meals and exercise routine for a week.",
"B. A woman prepares for a big event by following a strict diet and exercise regimen.",
"C. A woman shows off a weekly diet that goes with her fitness plan.",
"D. A woman shows off her cooking skills by keeping a diary of her meals for a week."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does this video show?\nOption:\nA. A woman documents her daily meals and exercise routine for a week.\nB. A woman prepares for a big event by following a strict diet and exercise regimen.\nC. A woman shows off a weekly diet that goes with her fitness plan.\nD. A woman shows off her cooking skills by keeping a diary of her meals for a week.\nAnswer with the option's letter from the given choices directly.",
2490,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "831-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2491,
"target": "B",
"doc": {
"video_id": "831",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=6Pt3-NDxzRk",
"videoID": "6Pt3-NDxzRk",
"question_id": "831-2",
"task_type": "Temporal Perception",
"question": "Could you please indicate the specific time in the video when the minor kitchen mishap occurs while cooking?",
"options": [
"A. Wednesday breakfast.",
"B. Tuesday lunch.",
"C. Friday dinner.",
"D. Tuesday dinner."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Could you please indicate the specific time in the video when the minor kitchen mishap occurs while cooking?\nOption:\nA. Wednesday breakfast.\nB. Tuesday lunch.\nC. Friday dinner.\nD. Tuesday dinner.\nAnswer with the option's letter from the given choices directly.",
2491,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "831-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Temporal Perception",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2492,
"target": "A",
"doc": {
"video_id": "831",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=6Pt3-NDxzRk",
"videoID": "6Pt3-NDxzRk",
"question_id": "831-3",
"task_type": "Object Reasoning",
"question": "In the video, the main character delivers food to someone on Wednesday. What is the nature of their relationship?",
"options": [
"A. They are working partners.",
"B. They are family members.",
"C. They are friends.",
"D. lt is unclear what their relationship is."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, the main character delivers food to someone on Wednesday. What is the nature of their relationship?\nOption:\nA. They are working partners.\nB. They are family members.\nC. They are friends.\nD. lt is unclear what their relationship is.\nAnswer with the option's letter from the given choices directly.",
2492,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "831-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2493,
"target": "A",
"doc": {
"video_id": "832",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=iWT0kl1k32M",
"videoID": "iWT0kl1k32M",
"question_id": "832-1",
"task_type": "Action Reasoning",
"question": "Which state's cuisine is consistently highly praised by the three people in the video?",
"options": [
"A. Oklahoma.",
"B. Wisconsin.",
"C. North Carolina.",
"D. Hawaii."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which state's cuisine is consistently highly praised by the three people in the video?\nOption:\nA. Oklahoma.\nB. Wisconsin.\nC. North Carolina.\nD. Hawaii.\nAnswer with the option's letter from the given choices directly.",
2493,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "832-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2494,
"target": "D",
"doc": {
"video_id": "832",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=iWT0kl1k32M",
"videoID": "iWT0kl1k32M",
"question_id": "832-2",
"task_type": "Action Reasoning",
"question": "How do they typically determine the final food rating in the video?",
"options": [
"A. Top marks.",
"B. Take the average.",
"C. The score that given by the person in the middle of the video.",
"D. Majority vote."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How do they typically determine the final food rating in the video?\nOption:\nA. Top marks.\nB. Take the average.\nC. The score that given by the person in the middle of the video.\nD. Majority vote.\nAnswer with the option's letter from the given choices directly.",
2494,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "832-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2495,
"target": "C",
"doc": {
"video_id": "832",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=iWT0kl1k32M",
"videoID": "iWT0kl1k32M",
"question_id": "832-3",
"task_type": "Action Reasoning",
"question": "Why do the Wisconsin cheese cubes receive an A rating?",
"options": [
"A. By majority vote.",
"B. When deciding which of the S-rated foods is the best, drop it to an A rating.",
"C. After tasting the smoked salmon spread, one person lowered the score of the grilled cheese cubes.",
"D. They think smoked salmon spread is better."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why do the Wisconsin cheese cubes receive an A rating?\nOption:\nA. By majority vote.\nB. When deciding which of the S-rated foods is the best, drop it to an A rating.\nC. After tasting the smoked salmon spread, one person lowered the score of the grilled cheese cubes.\nD. They think smoked salmon spread is better.\nAnswer with the option's letter from the given choices directly.",
2495,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "832-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2496,
"target": "B",
"doc": {
"video_id": "833",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=lw3_W5X1t54",
"videoID": "lw3_W5X1t54",
"question_id": "833-1",
"task_type": "Object Reasoning",
"question": "Which dish at Restaurant Kalamansi best embodies the rich history and essence of Filipino cuisine?",
"options": [
"A. Kinilaw.",
"B. Pinakbet.",
"C. Adobong Baboy.",
"D. Kare Kare."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which dish at Restaurant Kalamansi best embodies the rich history and essence of Filipino cuisine?\nOption:\nA. Kinilaw.\nB. Pinakbet.\nC. Adobong Baboy.\nD. Kare Kare.\nAnswer with the option's letter from the given choices directly.",
2496,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "833-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2497,
"target": "D",
"doc": {
"video_id": "833",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=lw3_W5X1t54",
"videoID": "lw3_W5X1t54",
"question_id": "833-2",
"task_type": "Object Reasoning",
"question": "What is the relationship between the video's main character and the lady sitting next to him at the table?",
"options": [
"A. Owner of the restaurant Kalamansi.",
"B. His wife.",
"C. Delia.",
"D. His companion."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the relationship between the video's main character and the lady sitting next to him at the table?\nOption:\nA. Owner of the restaurant Kalamansi.\nB. His wife.\nC. Delia.\nD. His companion.\nAnswer with the option's letter from the given choices directly.",
2497,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "833-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2498,
"target": "A",
"doc": {
"video_id": "833",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=lw3_W5X1t54",
"videoID": "lw3_W5X1t54",
"question_id": "833-3",
"task_type": "Information Synopsis",
"question": "What does this video focus on?",
"options": [
"A. Why Filipino food is unappreciated.",
"B. Why is Filipino food not good.",
"C. Different forms of Filipino restaurants.",
"D. What are the types of Filipino cuisine."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does this video focus on?\nOption:\nA. Why Filipino food is unappreciated.\nB. Why is Filipino food not good.\nC. Different forms of Filipino restaurants.\nD. What are the types of Filipino cuisine.\nAnswer with the option's letter from the given choices directly.",
2498,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "833-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2499,
"target": "C",
"doc": {
"video_id": "834",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=aqFfjJrLkBA",
"videoID": "aqFfjJrLkBA",
"question_id": "834-1",
"task_type": "Action Reasoning",
"question": "What was the primary factor that determined the sequence in which the children appeared in the video?",
"options": [
"A. Based on ascending age.",
"B. In alternating gender order.",
"C. Youngest to oldest in age.",
"D. According to different preferences."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the primary factor that determined the sequence in which the children appeared in the video?\nOption:\nA. Based on ascending age.\nB. In alternating gender order.\nC. Youngest to oldest in age.\nD. According to different preferences.\nAnswer with the option's letter from the given choices directly.",
2499,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "834-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2500,
"target": "B",
"doc": {
"video_id": "834",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=aqFfjJrLkBA",
"videoID": "aqFfjJrLkBA",
"question_id": "834-2",
"task_type": "Action Reasoning",
"question": "Why does Sadie have a countdown?",
"options": [
"A. The food preparer is prepared to restrict her diet.",
"B. The food preparer thinks she is better at choosing her food than other children.",
"C. Sadie loves to challenge herself.",
"D. Because Sadie is the oldest child."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does Sadie have a countdown?\nOption:\nA. The food preparer is prepared to restrict her diet.\nB. The food preparer thinks she is better at choosing her food than other children.\nC. Sadie loves to challenge herself.\nD. Because Sadie is the oldest child.\nAnswer with the option's letter from the given choices directly.",
2500,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "834-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2501,
"target": "A",
"doc": {
"video_id": "834",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=aqFfjJrLkBA",
"videoID": "aqFfjJrLkBA",
"question_id": "834-3",
"task_type": "Temporal Reasoning",
"question": "Could you please specify the order for food preparation as shown in the video? Which of the following sequences is accurate?",
"options": [
"A. Rice balls, cheese slices, bread slices.",
"B. Sliced bread, sliced cheese, rice balls.",
"C. Rice balls, bread slices, cheese slices.",
"D. Cheese slices, rice balls, bread slices."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Could you please specify the order for food preparation as shown in the video? Which of the following sequences is accurate?\nOption:\nA. Rice balls, cheese slices, bread slices.\nB. Sliced bread, sliced cheese, rice balls.\nC. Rice balls, bread slices, cheese slices.\nD. Cheese slices, rice balls, bread slices.\nAnswer with the option's letter from the given choices directly.",
2501,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "834-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2502,
"target": "C",
"doc": {
"video_id": "835",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=IXjkbgrdAkw",
"videoID": "IXjkbgrdAkw",
"question_id": "835-1",
"task_type": "Object Reasoning",
"question": "In the video the main character goes to five different restaurants, which restaurant is frequented by the main character's brother-in-law?",
"options": [
"A. Third restaurant.",
"B. The first restaurant.",
"C. Fourth restaurant.",
"D. The last restaurant."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video the main character goes to five different restaurants, which restaurant is frequented by the main character's brother-in-law?\nOption:\nA. Third restaurant.\nB. The first restaurant.\nC. Fourth restaurant.\nD. The last restaurant.\nAnswer with the option's letter from the given choices directly.",
2502,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "835-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2503,
"target": "B",
"doc": {
"video_id": "835",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=IXjkbgrdAkw",
"videoID": "IXjkbgrdAkw",
"question_id": "835-2",
"task_type": "Action Reasoning",
"question": "What is the main food sold at the restaurant that the main character in the video thinks has the most historical flavor?",
"options": [
"A. Hotdog.",
"B. BBQ.",
"C. Fried chicken.",
"D. Sandwiches."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main food sold at the restaurant that the main character in the video thinks has the most historical flavor?\nOption:\nA. Hotdog.\nB. BBQ.\nC. Fried chicken.\nD. Sandwiches.\nAnswer with the option's letter from the given choices directly.",
2503,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "835-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2504,
"target": "C",
"doc": {
"video_id": "835",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=IXjkbgrdAkw",
"videoID": "IXjkbgrdAkw",
"question_id": "835-3",
"task_type": "Attribute Perception",
"question": "What is the flavor that the main character tastes the most in the video?",
"options": [
"A. Sour.",
"B. Sweet.",
"C. Spicy.",
"D. Smoky flavor."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the flavor that the main character tastes the most in the video?\nOption:\nA. Sour.\nB. Sweet.\nC. Spicy.\nD. Smoky flavor.\nAnswer with the option's letter from the given choices directly.",
2504,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "835-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Attribute Perception",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2505,
"target": "A",
"doc": {
"video_id": "836",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=wTlERUE8LVw",
"videoID": "wTlERUE8LVw",
"question_id": "836-1",
"task_type": "Object Recognition",
"question": "What food is sampled by the main character when meeting a modeling agent for the first time in the video?",
"options": [
"A. Some uncommon foods.",
"B. High cholesterol foods.",
"C. Historic food.",
"D. Roast suckling pig."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What food is sampled by the main character when meeting a modeling agent for the first time in the video?\nOption:\nA. Some uncommon foods.\nB. High cholesterol foods.\nC. Historic food.\nD. Roast suckling pig.\nAnswer with the option's letter from the given choices directly.",
2505,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "836-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2506,
"target": "D",
"doc": {
"video_id": "836",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=wTlERUE8LVw",
"videoID": "wTlERUE8LVw",
"question_id": "836-2",
"task_type": "Object Reasoning",
"question": "What is the video hero's favorite food before he eats the roast suckling pig?",
"options": [
"A. Frogs.",
"B. Beef stew.",
"C. Roasted rabbit.",
"D. Deep fried pork chunks."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video hero's favorite food before he eats the roast suckling pig?\nOption:\nA. Frogs.\nB. Beef stew.\nC. Roasted rabbit.\nD. Deep fried pork chunks.\nAnswer with the option's letter from the given choices directly.",
2506,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "836-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2507,
"target": "B",
"doc": {
"video_id": "836",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=wTlERUE8LVw",
"videoID": "wTlERUE8LVw",
"question_id": "836-3",
"task_type": "Action Reasoning",
"question": "As can be inferred from the video, why does the video dedicates a significant amount of time to the roast suckling pig?",
"options": [
"A. Because the author of the video really liked it.",
"B. Because it's the most popular spotlight cuisine in the Philippines.",
"C. Because there's a lot of roasted suckling pig in the Philippines.",
"D. Because it's very expensive."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As can be inferred from the video, why does the video dedicates a significant amount of time to the roast suckling pig?\nOption:\nA. Because the author of the video really liked it.\nB. Because it's the most popular spotlight cuisine in the Philippines.\nC. Because there's a lot of roasted suckling pig in the Philippines.\nD. Because it's very expensive.\nAnswer with the option's letter from the given choices directly.",
2507,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "836-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2508,
"target": "B",
"doc": {
"video_id": "837",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=4VniCsSusJs",
"videoID": "4VniCsSusJs",
"question_id": "837-1",
"task_type": "Action Reasoning",
"question": "Which of the following food combinations is the favorite of the two people in the video?",
"options": [
"A. Hot sauce and bananas.",
"B. Spaghetti with tacos.",
"C. Crazy Sandwich.",
"D. Cookies and Cheese."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following food combinations is the favorite of the two people in the video?\nOption:\nA. Hot sauce and bananas.\nB. Spaghetti with tacos.\nC. Crazy Sandwich.\nD. Cookies and Cheese.\nAnswer with the option's letter from the given choices directly.",
2508,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "837-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2509,
"target": "C",
"doc": {
"video_id": "837",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=4VniCsSusJs",
"videoID": "4VniCsSusJs",
"question_id": "837-2",
"task_type": "Action Reasoning",
"question": "Why do they throw food in a big bowl?",
"options": [
"A. Prepare to throw them away.",
"B. Serve food they hate.",
"C. Prepare to mix and taste.",
"D. Ready for others to try."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why do they throw food in a big bowl?\nOption:\nA. Prepare to throw them away.\nB. Serve food they hate.\nC. Prepare to mix and taste.\nD. Ready for others to try.\nAnswer with the option's letter from the given choices directly.",
2509,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "837-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2510,
"target": "A",
"doc": {
"video_id": "837",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=4VniCsSusJs",
"videoID": "4VniCsSusJs",
"question_id": "837-3",
"task_type": "Object Reasoning",
"question": "What is the most significant difference between the first clip in the video and the other two?",
"options": [
"A. The food combinations in the first clip were all recommended by friends online.",
"B. The first clip is all about weird food combinations.",
"C. The food in the first clip is not additionally processed.",
"D. The food in the first clip is not mixed together."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the most significant difference between the first clip in the video and the other two?\nOption:\nA. The food combinations in the first clip were all recommended by friends online.\nB. The first clip is all about weird food combinations.\nC. The food in the first clip is not additionally processed.\nD. The food in the first clip is not mixed together.\nAnswer with the option's letter from the given choices directly.",
2510,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "837-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2511,
"target": "B",
"doc": {
"video_id": "838",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=txdfCHpxzVg",
"videoID": "txdfCHpxzVg",
"question_id": "838-1",
"task_type": "Object Reasoning",
"question": "Extrapolating from the video, what does it suggest if chunky fries are served?",
"options": [
"A. The fries are processed in foreign factories.",
"B. The fries are hand-cut strips.",
"C. The fries absorb enough grease.",
"D. Square fries imply a lack of processing."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Extrapolating from the video, what does it suggest if chunky fries are served?\nOption:\nA. The fries are processed in foreign factories.\nB. The fries are hand-cut strips.\nC. The fries absorb enough grease.\nD. Square fries imply a lack of processing.\nAnswer with the option's letter from the given choices directly.",
2511,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "838-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2512,
"target": "A",
"doc": {
"video_id": "838",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=txdfCHpxzVg",
"videoID": "txdfCHpxzVg",
"question_id": "838-2",
"task_type": "Temporal Reasoning",
"question": "What is the correct order for processing cod in the video?",
"options": [
"A. Cubed, inspected, cut into strips, breaded and fried.",
"B. Inspected, cubed, cut into strips, breaded and fried.",
"C. Cubed, breaded, cut into strips, fried and frozen.",
"D. Inspected, cut into strips, breaded, fried and cubed."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct order for processing cod in the video?\nOption:\nA. Cubed, inspected, cut into strips, breaded and fried.\nB. Inspected, cubed, cut into strips, breaded and fried.\nC. Cubed, breaded, cut into strips, fried and frozen.\nD. Inspected, cut into strips, breaded, fried and cubed.\nAnswer with the option's letter from the given choices directly.",
2512,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "838-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2513,
"target": "B",
"doc": {
"video_id": "838",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=txdfCHpxzVg",
"videoID": "txdfCHpxzVg",
"question_id": "838-3",
"task_type": "Information Synopsis",
"question": "Which aspect does this video emphasize?",
"options": [
"A. Difference in processing procedures between meat products and plant-based products.",
"B. Documentary about several foods from raw materials to processing into processed foods.",
"C. Emphasize the important role that machines play in food processing.",
"D. The development of food processing lines."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which aspect does this video emphasize?\nOption:\nA. Difference in processing procedures between meat products and plant-based products.\nB. Documentary about several foods from raw materials to processing into processed foods.\nC. Emphasize the important role that machines play in food processing.\nD. The development of food processing lines.\nAnswer with the option's letter from the given choices directly.",
2513,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "838-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2514,
"target": "C",
"doc": {
"video_id": "839",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=azJ5pk5reX0",
"videoID": "azJ5pk5reX0",
"question_id": "839-1",
"task_type": "OCR Problems",
"question": "How many calories has the person in the video, who consumed a $100 golden burger, already eaten when he meets his teammate?",
"options": [
"A. 4660.",
"B. 5050.",
"C. 6070.",
"D. 8420."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many calories has the person in the video, who consumed a $100 golden burger, already eaten when he meets his teammate?\nOption:\nA. 4660.\nB. 5050.\nC. 6070.\nD. 8420.\nAnswer with the option's letter from the given choices directly.",
2514,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "839-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "OCR Problems",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2515,
"target": "B",
"doc": {
"video_id": "839",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=azJ5pk5reX0",
"videoID": "azJ5pk5reX0",
"question_id": "839-2",
"task_type": "Object Reasoning",
"question": "What is the relationship between the person documenting his fitness at the beginning of the video and the main curator of this video?",
"options": [
"A. They're teammates.",
"B. They're competitors.",
"C. They do not know each other.",
"D. Can't extrapolate from the video."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the relationship between the person documenting his fitness at the beginning of the video and the main curator of this video?\nOption:\nA. They're teammates.\nB. They're competitors.\nC. They do not know each other.\nD. Can't extrapolate from the video.\nAnswer with the option's letter from the given choices directly.",
2515,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "839-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2516,
"target": "B",
"doc": {
"video_id": "839",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=azJ5pk5reX0",
"videoID": "azJ5pk5reX0",
"question_id": "839-3",
"task_type": "Action Reasoning",
"question": "Who in the video accomplished the calorie goal set?",
"options": [
"A. All have accomplished their goals.",
"B. No one has accomplished the goal.",
"C. Members of Team UK.",
"D. Members of Team USA."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who in the video accomplished the calorie goal set?\nOption:\nA. All have accomplished their goals.\nB. No one has accomplished the goal.\nC. Members of Team UK.\nD. Members of Team USA.\nAnswer with the option's letter from the given choices directly.",
2516,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "839-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2517,
"target": "D",
"doc": {
"video_id": "840",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=y9Fne3oUwX4",
"videoID": "y9Fne3oUwX4",
"question_id": "840-1",
"task_type": "Object Recognition",
"question": "Which food did the team that chose bananas also select from the following options?",
"options": [
"A. Bread.",
"B. Doughnut.",
"C. Hamburg.",
"D. Marshmallow."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which food did the team that chose bananas also select from the following options?\nOption:\nA. Bread.\nB. Doughnut.\nC. Hamburg.\nD. Marshmallow.\nAnswer with the option's letter from the given choices directly.",
2517,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "840-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2518,
"target": "C",
"doc": {
"video_id": "840",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=y9Fne3oUwX4",
"videoID": "y9Fne3oUwX4",
"question_id": "840-2",
"task_type": "Object Recognition",
"question": "What letter of the alphabet does the two teams choose to eat differently when it comes to food?",
"options": [
"A. A.",
"B. F.",
"C. B.",
"D. K."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What letter of the alphabet does the two teams choose to eat differently when it comes to food?\nOption:\nA. A.\nB. F.\nC. B.\nD. K.\nAnswer with the option's letter from the given choices directly.",
2518,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "840-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2519,
"target": "A",
"doc": {
"video_id": "840",
"duration": "long",
"domain": "Life Record",
"sub_category": "Food",
"url": "https://www.youtube.com/watch?v=y9Fne3oUwX4",
"videoID": "y9Fne3oUwX4",
"question_id": "840-3",
"task_type": "OCR Problems",
"question": "As one team enjoys pizza, how many tasks has the other team completed?",
"options": [
"A. 11.",
"B. 15.",
"C. 12.",
"D. 26."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As one team enjoys pizza, how many tasks has the other team completed?\nOption:\nA. 11.\nB. 15.\nC. 12.\nD. 26.\nAnswer with the option's letter from the given choices directly.",
2519,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "840-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Food",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2520,
"target": "A",
"doc": {
"video_id": "841",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=WB4giHwiulE",
"videoID": "WB4giHwiulE",
"question_id": "841-1",
"task_type": "Object Reasoning",
"question": "How did World War I affect the businesses of the people profiled in the video?",
"options": [
"A. It had little impact as her clientele remained wealthy.",
"B. It forced her to close all her shops due to lack of resources.",
"C. It caused her to shift focus to military uniforms.",
"D. It led to bankruptcy and closure of the fashion house."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How did World War I affect the businesses of the people profiled in the video?\nOption:\nA. It had little impact as her clientele remained wealthy.\nB. It forced her to close all her shops due to lack of resources.\nC. It caused her to shift focus to military uniforms.\nD. It led to bankruptcy and closure of the fashion house.\nAnswer with the option's letter from the given choices directly.",
2520,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "841-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2521,
"target": "C",
"doc": {
"video_id": "841",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=WB4giHwiulE",
"videoID": "WB4giHwiulE",
"question_id": "841-2",
"task_type": "Object Reasoning",
"question": "What factors contributed to the success of Chanel No. 5 and the little black dress?",
"options": [
"A. Chanel's ability to stay ahead of trends by constantly reinventing her designs.",
"B. The economic boom of the 1920s that allowed for increased spending on luxury goods.",
"C. A combination of high-quality products, innovative marketing strategies, and understanding the desires of her target audience.",
"D. Primarily the financial backing of her wealthy patrons and lovers."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What factors contributed to the success of Chanel No. 5 and the little black dress?\nOption:\nA. Chanel's ability to stay ahead of trends by constantly reinventing her designs.\nB. The economic boom of the 1920s that allowed for increased spending on luxury goods.\nC. A combination of high-quality products, innovative marketing strategies, and understanding the desires of her target audience.\nD. Primarily the financial backing of her wealthy patrons and lovers.\nAnswer with the option's letter from the given choices directly.",
2521,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "841-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2522,
"target": "D",
"doc": {
"video_id": "841",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=WB4giHwiulE",
"videoID": "WB4giHwiulE",
"question_id": "841-3",
"task_type": "Object Reasoning",
"question": "What was significant about Chanel's embrace of tanned skin?",
"options": [
"A. It aligned with the popular Victorian ideal of beauty.",
"B. It promoted the health benefits of limited sun exposure.",
"C. It was inspired by her travels to Hollywood and exposure to American culture.",
"D. It challenged the prevailing preference for pale skin."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was significant about Chanel's embrace of tanned skin?\nOption:\nA. It aligned with the popular Victorian ideal of beauty.\nB. It promoted the health benefits of limited sun exposure.\nC. It was inspired by her travels to Hollywood and exposure to American culture.\nD. It challenged the prevailing preference for pale skin.\nAnswer with the option's letter from the given choices directly.",
2522,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "841-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2523,
"target": "B",
"doc": {
"video_id": "842",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=CKYw_RG7i7w",
"videoID": "CKYw_RG7i7w",
"question_id": "842-1",
"task_type": "Counting Problem",
"question": "According to the video, how many grandsons does the founder of Gucci have?",
"options": [
"A. 3.",
"B. 4.",
"C. 5.",
"D. 6."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how many grandsons does the founder of Gucci have?\nOption:\nA. 3.\nB. 4.\nC. 5.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
2523,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "842-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2524,
"target": "B",
"doc": {
"video_id": "842",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=CKYw_RG7i7w",
"videoID": "CKYw_RG7i7w",
"question_id": "842-2",
"task_type": "Information Synopsis",
"question": "According to the video, which of the following statements is accurate?",
"options": [
"A. Maurizio Gucci fulfilled his vision.",
"B. The video features an interview with Domenico De Sole.",
"C. Paolo Gucci held the highest stake among the third-generation descendants.",
"D. Patrizia Reggiani was sentenced to 10 years in prison."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, which of the following statements is accurate?\nOption:\nA. Maurizio Gucci fulfilled his vision.\nB. The video features an interview with Domenico De Sole.\nC. Paolo Gucci held the highest stake among the third-generation descendants.\nD. Patrizia Reggiani was sentenced to 10 years in prison.\nAnswer with the option's letter from the given choices directly.",
2524,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "842-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2525,
"target": "C",
"doc": {
"video_id": "842",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=CKYw_RG7i7w",
"videoID": "CKYw_RG7i7w",
"question_id": "842-3",
"task_type": "Information Synopsis",
"question": "What is the main content of this video?",
"options": [
"A. The history of Chanel's fashion empire and its iconic designs.",
"B. The evolution of Louis Vuitton as a luxury brand in the fashion industry.",
"C. Gucci's rise and fall and rise again story recounted by those who lived through it.",
"D. The history of Prada's iconic fashion designs and their influence on the industry."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main content of this video?\nOption:\nA. The history of Chanel's fashion empire and its iconic designs.\nB. The evolution of Louis Vuitton as a luxury brand in the fashion industry.\nC. Gucci's rise and fall and rise again story recounted by those who lived through it.\nD. The history of Prada's iconic fashion designs and their influence on the industry.\nAnswer with the option's letter from the given choices directly.",
2525,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "842-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2526,
"target": "D",
"doc": {
"video_id": "843",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=9aGgBzToUz4",
"videoID": "9aGgBzToUz4",
"question_id": "843-1",
"task_type": "Counting Problem",
"question": "Which category has the highest number of outfits?",
"options": [
"A. Nothing new.",
"B. God Tier.",
"C. Holy ground.",
"D. Gorgeous."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which category has the highest number of outfits?\nOption:\nA. Nothing new.\nB. God Tier.\nC. Holy ground.\nD. Gorgeous.\nAnswer with the option's letter from the given choices directly.",
2526,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "843-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2527,
"target": "A",
"doc": {
"video_id": "843",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=9aGgBzToUz4",
"videoID": "9aGgBzToUz4",
"question_id": "843-2",
"task_type": "Object Reasoning",
"question": "What is the most distinctive feature that sets \"Nothing new\" apart from the other categories?",
"options": [
"A. It includes an outfit with a hat, unlike any of the other categories.",
"B. It focuses on vintage-inspired designs, unlike any of the other categories.",
"C. It showcases outfits with bold and vibrant colors, unlike any of the other categories.",
"D. It emphasizes minimalist and monochromatic styles, unlike any of the other categories."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the most distinctive feature that sets \"Nothing new\" apart from the other categories?\nOption:\nA. It includes an outfit with a hat, unlike any of the other categories.\nB. It focuses on vintage-inspired designs, unlike any of the other categories.\nC. It showcases outfits with bold and vibrant colors, unlike any of the other categories.\nD. It emphasizes minimalist and monochromatic styles, unlike any of the other categories.\nAnswer with the option's letter from the given choices directly.",
2527,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "843-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2528,
"target": "A",
"doc": {
"video_id": "843",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=9aGgBzToUz4",
"videoID": "9aGgBzToUz4",
"question_id": "843-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. A fearless ranking of Taylor Swift's style, of her eras tour costumes.",
"B. A review of Taylor Swift's discography and the evolution of her songwriting.",
"C. A behind-the-scenes look at Taylor Swift's personal life and relationships.",
"D. A deep dive into the history of rock music and its influence on Taylor Swift's musical style."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. A fearless ranking of Taylor Swift's style, of her eras tour costumes.\nB. A review of Taylor Swift's discography and the evolution of her songwriting.\nC. A behind-the-scenes look at Taylor Swift's personal life and relationships.\nD. A deep dive into the history of rock music and its influence on Taylor Swift's musical style.\nAnswer with the option's letter from the given choices directly.",
2528,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "843-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2529,
"target": "B",
"doc": {
"video_id": "844",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=K7YNoRZMv2k",
"videoID": "K7YNoRZMv2k",
"question_id": "844-1",
"task_type": "Temporal Reasoning",
"question": "In what chronological order did the following events take place?\n(a) The Prada brand was almost ceasing to exist.\n(b) The launch of the label Miu Miu.\n(c) Prada gained connections to royal families.\n(d) The arrival of Miuccia Prada.",
"options": [
"A. (c)(b)(a)(d).",
"B. (c)(a)(d)(b).",
"C. (d)(c)(a)(b).",
"D. (d)(a)(b)(c)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what chronological order did the following events take place?\n(a) The Prada brand was almost ceasing to exist.\n(b) The launch of the label Miu Miu.\n(c) Prada gained connections to royal families.\n(d) The arrival of Miuccia Prada.\nOption:\nA. (c)(b)(a)(d).\nB. (c)(a)(d)(b).\nC. (d)(c)(a)(b).\nD. (d)(a)(b)(c).\nAnswer with the option's letter from the given choices directly.",
2529,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "844-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2530,
"target": "C",
"doc": {
"video_id": "844",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=K7YNoRZMv2k",
"videoID": "K7YNoRZMv2k",
"question_id": "844-2",
"task_type": "Action Recognition",
"question": "What would happen if the daughter didn't assume control of Prada?",
"options": [
"A. The company would merge with a competitor and become a dominant force in the fashion industry.",
"B. The company would transition into a non-profit organization and focus on philanthropic endeavors.",
"C. The company would go bankrupt.",
"D. The company would experience a significant increase in profits and expansion."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What would happen if the daughter didn't assume control of Prada?\nOption:\nA. The company would merge with a competitor and become a dominant force in the fashion industry.\nB. The company would transition into a non-profit organization and focus on philanthropic endeavors.\nC. The company would go bankrupt.\nD. The company would experience a significant increase in profits and expansion.\nAnswer with the option's letter from the given choices directly.",
2530,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "844-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2531,
"target": "D",
"doc": {
"video_id": "844",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=K7YNoRZMv2k",
"videoID": "K7YNoRZMv2k",
"question_id": "844-3",
"task_type": "Object Recognition",
"question": "Which item is not being displayed on the product showcase interface?",
"options": [
"A. Prada high heels.",
"B. Nylon handbags.",
"C. America's World Cup sneaker.",
"D. Down jacket with a little green line."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which item is not being displayed on the product showcase interface?\nOption:\nA. Prada high heels.\nB. Nylon handbags.\nC. America's World Cup sneaker.\nD. Down jacket with a little green line.\nAnswer with the option's letter from the given choices directly.",
2531,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "844-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2532,
"target": "B",
"doc": {
"video_id": "845",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=1evyOuQz-jM",
"videoID": "1evyOuQz-jM",
"question_id": "845-1",
"task_type": "Temporal Reasoning",
"question": "In what sequence are the following jewelry pieces arranged to be displayed in this video?\n(a) Engagement ring.\n(b) Cartier bracelets.\n(c) Tabayer earrings.\n(d) Missoma hoops.",
"options": [
"A. (b)(a)(d)(c).",
"B. (a)(b)(c)(d).",
"C. (c)(d)(b)(a).",
"D. (d)(c)(a)(b)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In what sequence are the following jewelry pieces arranged to be displayed in this video?\n(a) Engagement ring.\n(b) Cartier bracelets.\n(c) Tabayer earrings.\n(d) Missoma hoops.\nOption:\nA. (b)(a)(d)(c).\nB. (a)(b)(c)(d).\nC. (c)(d)(b)(a).\nD. (d)(c)(a)(b).\nAnswer with the option's letter from the given choices directly.",
2532,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "845-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2533,
"target": "A",
"doc": {
"video_id": "845",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=1evyOuQz-jM",
"videoID": "1evyOuQz-jM",
"question_id": "845-2",
"task_type": "Object Reasoning",
"question": "What makes the dual diamond ring from Leon Diamond unique or special?",
"options": [
"A. It is worn on two fingers.",
"B. It is made from rare and expensive materials.",
"C. It can change colors based on the wearer's mood..",
"D. It has a hidden compartment for storing small items.."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What makes the dual diamond ring from Leon Diamond unique or special?\nOption:\nA. It is worn on two fingers.\nB. It is made from rare and expensive materials.\nC. It can change colors based on the wearer's mood..\nD. It has a hidden compartment for storing small items..\nAnswer with the option's letter from the given choices directly.",
2533,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "845-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2534,
"target": "C",
"doc": {
"video_id": "845",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=1evyOuQz-jM",
"videoID": "1evyOuQz-jM",
"question_id": "845-3",
"task_type": "Temporal Reasoning",
"question": "Which jewellery piece is being introduced when she takes off her coat?",
"options": [
"A. A bracelet.",
"B. A ring.",
"C. A necklace.",
"D. A pair of earrings."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which jewellery piece is being introduced when she takes off her coat?\nOption:\nA. A bracelet.\nB. A ring.\nC. A necklace.\nD. A pair of earrings.\nAnswer with the option's letter from the given choices directly.",
2534,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "845-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2535,
"target": "D",
"doc": {
"video_id": "846",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=fLO5Ww0V2qU",
"videoID": "fLO5Ww0V2qU",
"question_id": "846-1",
"task_type": "Attribute Perception",
"question": "What is the woman's outfit today in the video?",
"options": [
"A. Oversized tote bag, black pantsuit, and sneakers.",
"B. Clutch purse, denim shorts, and flip-flops.",
"C. Backpack, red sweater, and knee-high boots.",
"D. Mini Kelly bag, purple and white floral dress, pink sandals."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the woman's outfit today in the video?\nOption:\nA. Oversized tote bag, black pantsuit, and sneakers.\nB. Clutch purse, denim shorts, and flip-flops.\nC. Backpack, red sweater, and knee-high boots.\nD. Mini Kelly bag, purple and white floral dress, pink sandals.\nAnswer with the option's letter from the given choices directly.",
2535,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "846-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2536,
"target": "B",
"doc": {
"video_id": "846",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=fLO5Ww0V2qU",
"videoID": "fLO5Ww0V2qU",
"question_id": "846-2",
"task_type": "Counting Problem",
"question": "How many stores has she visited in this video?",
"options": [
"A. 3.",
"B. 4.",
"C. 5.",
"D. 6."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many stores has she visited in this video?\nOption:\nA. 3.\nB. 4.\nC. 5.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
2536,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "846-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2537,
"target": "A",
"doc": {
"video_id": "846",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=fLO5Ww0V2qU",
"videoID": "fLO5Ww0V2qU",
"question_id": "846-3",
"task_type": "Information Synopsis",
"question": "Which statement is true according to this video?",
"options": [
"A. The woman takes a look at the men's shirts in a store.",
"B. The woman does not buy anything at the end of this video.",
"C. The woman receives a discount on her purchases at the end of the video.",
"D. The woman returns an item she bought earlier in the video."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which statement is true according to this video?\nOption:\nA. The woman takes a look at the men's shirts in a store.\nB. The woman does not buy anything at the end of this video.\nC. The woman receives a discount on her purchases at the end of the video.\nD. The woman returns an item she bought earlier in the video.\nAnswer with the option's letter from the given choices directly.",
2537,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "846-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2538,
"target": "C",
"doc": {
"video_id": "847",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=0eJvnKwGThw",
"videoID": "0eJvnKwGThw",
"question_id": "847-1",
"task_type": "Information Synopsis",
"question": "What is this video documenting?",
"options": [
"A. A concert.",
"B. Clothing ads.",
"C. A clothing fashion walk.",
"D. Designer Workshop."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video documenting?\nOption:\nA. A concert.\nB. Clothing ads.\nC. A clothing fashion walk.\nD. Designer Workshop.\nAnswer with the option's letter from the given choices directly.",
2538,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "847-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2539,
"target": "A",
"doc": {
"video_id": "847",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=0eJvnKwGThw",
"videoID": "0eJvnKwGThw",
"question_id": "847-2",
"task_type": "Spatial Reasoning",
"question": "Where might this walkout take place?",
"options": [
"A. Italy.",
"B. New Zealand.",
"C. United States.",
"D. South Africa."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where might this walkout take place?\nOption:\nA. Italy.\nB. New Zealand.\nC. United States.\nD. South Africa.\nAnswer with the option's letter from the given choices directly.",
2539,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "847-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Spatial Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2540,
"target": "B",
"doc": {
"video_id": "847",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=0eJvnKwGThw",
"videoID": "0eJvnKwGThw",
"question_id": "847-3",
"task_type": "Spatial Perception",
"question": "In the last model gathering, at what time does the green model in the lower left corner of the square appear in the video?",
"options": [
"A. Front of video.",
"B. Middle of the video.",
"C. The back end of the video.",
"D. Not featured."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the last model gathering, at what time does the green model in the lower left corner of the square appear in the video?\nOption:\nA. Front of video.\nB. Middle of the video.\nC. The back end of the video.\nD. Not featured.\nAnswer with the option's letter from the given choices directly.",
2540,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "847-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Spatial Perception",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2541,
"target": "D",
"doc": {
"video_id": "848",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=gHXaUDx7P0Y",
"videoID": "gHXaUDx7P0Y",
"question_id": "848-1",
"task_type": "Temporal Reasoning",
"question": "Here are the names of some of the chapters, and what is the order in which they occur?\n(a) Scaling Up: One Plus One.\n(b) The Public Eye.\n(c) Broken Trust.\n(d) Revenge.",
"options": [
"A. (b)(a)(d)(c).",
"B. (d)(b)(c)(a).",
"C. (a)(b)(c)(d).",
"D. (a)(c)(d)(b)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Here are the names of some of the chapters, and what is the order in which they occur?\n(a) Scaling Up: One Plus One.\n(b) The Public Eye.\n(c) Broken Trust.\n(d) Revenge.\nOption:\nA. (b)(a)(d)(c).\nB. (d)(b)(c)(a).\nC. (a)(b)(c)(d).\nD. (a)(c)(d)(b).\nAnswer with the option's letter from the given choices directly.",
2541,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "848-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2542,
"target": "C",
"doc": {
"video_id": "848",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=gHXaUDx7P0Y",
"videoID": "gHXaUDx7P0Y",
"question_id": "848-2",
"task_type": "Counting Problem",
"question": "How many major setbacks in Nike's development were mentioned in the video?",
"options": [
"A. 0.",
"B. 2.",
"C. 4.",
"D. 6."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many major setbacks in Nike's development were mentioned in the video?\nOption:\nA. 0.\nB. 2.\nC. 4.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
2542,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "848-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2543,
"target": "A",
"doc": {
"video_id": "848",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=gHXaUDx7P0Y",
"videoID": "gHXaUDx7P0Y",
"question_id": "848-3",
"task_type": "Object Reasoning",
"question": "What was the trigger for the company's name change to Nike in 1971?",
"options": [
"A. Negotiations with Onitsuka failed.",
"B. Blue Ribbon went bankrupt.",
"C. The success of the Olympic Games.",
"D. Phil didn't like the original name."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What was the trigger for the company's name change to Nike in 1971?\nOption:\nA. Negotiations with Onitsuka failed.\nB. Blue Ribbon went bankrupt.\nC. The success of the Olympic Games.\nD. Phil didn't like the original name.\nAnswer with the option's letter from the given choices directly.",
2543,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "848-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2544,
"target": "B",
"doc": {
"video_id": "849",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=j27UP4zz_6U",
"videoID": "j27UP4zz_6U",
"question_id": "849-1",
"task_type": "Temporal Reasoning",
"question": "What is the correct order in which the following fragrances appear in the video?\n(a) OUD LEATHER.\n(b) OUD WOOD.\n(c) GREY VETIVER.\n(d) LAVENDRE EXTREME.",
"options": [
"A. (a)(c)(d)(b).",
"B. (b)(a)(d)(c).",
"C. (b)(a)(c)(d).",
"D. (c)(d)(b)(a)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct order in which the following fragrances appear in the video?\n(a) OUD LEATHER.\n(b) OUD WOOD.\n(c) GREY VETIVER.\n(d) LAVENDRE EXTREME.\nOption:\nA. (a)(c)(d)(b).\nB. (b)(a)(d)(c).\nC. (b)(a)(c)(d).\nD. (c)(d)(b)(a).\nAnswer with the option's letter from the given choices directly.",
2544,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "849-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2545,
"target": "C",
"doc": {
"video_id": "849",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=j27UP4zz_6U",
"videoID": "j27UP4zz_6U",
"question_id": "849-2",
"task_type": "Object Reasoning",
"question": "What makes ROSE PRICK different from other perfumes?",
"options": [
"A. It's not black.",
"B. It has a lavender scent.",
"C. It is the only perfume that has a pink appearance.",
"D. It's very expensive."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What makes ROSE PRICK different from other perfumes?\nOption:\nA. It's not black.\nB. It has a lavender scent.\nC. It is the only perfume that has a pink appearance.\nD. It's very expensive.\nAnswer with the option's letter from the given choices directly.",
2545,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "849-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2546,
"target": "D",
"doc": {
"video_id": "849",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=j27UP4zz_6U",
"videoID": "j27UP4zz_6U",
"question_id": "849-3",
"task_type": "Object Reasoning",
"question": "Which of the following is not a common characteristic of these perfumes?",
"options": [
"A. All are owned by the author.",
"B. They are all Tom Ford perfumes.",
"C. All are designed for men.",
"D. They are all expensive fragrances."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is not a common characteristic of these perfumes?\nOption:\nA. All are owned by the author.\nB. They are all Tom Ford perfumes.\nC. All are designed for men.\nD. They are all expensive fragrances.\nAnswer with the option's letter from the given choices directly.",
2546,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "849-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2547,
"target": "A",
"doc": {
"video_id": "850",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=5Z2_uRf7ckY",
"videoID": "5Z2_uRf7ckY",
"question_id": "850-1",
"task_type": "Object Reasoning",
"question": "What is the place with the green tin door and the sign that says \"GAYOSO\"?",
"options": [
"A. The first workshop of Zara.",
"B. Zara's first factory.",
"C. A tailor store.",
"D. A clothing mall."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the place with the green tin door and the sign that says \"GAYOSO\"?\nOption:\nA. The first workshop of Zara.\nB. Zara's first factory.\nC. A tailor store.\nD. A clothing mall.\nAnswer with the option's letter from the given choices directly.",
2547,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "850-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2548,
"target": "D",
"doc": {
"video_id": "850",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=5Z2_uRf7ckY",
"videoID": "5Z2_uRf7ckY",
"question_id": "850-2",
"task_type": "Object Recognition",
"question": "What clothing was not copied in the clip showing the plagiarism against Zara?",
"options": [
"A. High heels.",
"B. Floral Shirt.",
"C. Dress.",
"D. Jeans."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What clothing was not copied in the clip showing the plagiarism against Zara?\nOption:\nA. High heels.\nB. Floral Shirt.\nC. Dress.\nD. Jeans.\nAnswer with the option's letter from the given choices directly.",
2548,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "850-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2549,
"target": "B",
"doc": {
"video_id": "850",
"duration": "long",
"domain": "Life Record",
"sub_category": "Fashion",
"url": "https://www.youtube.com/watch?v=5Z2_uRf7ckY",
"videoID": "5Z2_uRf7ckY",
"question_id": "850-3",
"task_type": "Temporal Reasoning",
"question": "What is the correct order of presentation of the following events in the video?\n(a) Zara was plagiarized.\n(b) Toxic dyes.\n(c) Zara owner's birthday.",
"options": [
"A. (b)(c)(a).",
"B. (a)(b)(c).",
"C. (c)(a)(b).",
"D. (c)(b)(a)."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct order of presentation of the following events in the video?\n(a) Zara was plagiarized.\n(b) Toxic dyes.\n(c) Zara owner's birthday.\nOption:\nA. (b)(c)(a).\nB. (a)(b)(c).\nC. (c)(a)(b).\nD. (c)(b)(a).\nAnswer with the option's letter from the given choices directly.",
2549,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "850-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Fashion",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2550,
"target": "B",
"doc": {
"video_id": "851",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=Lo-ewzf-Q1k",
"videoID": "Lo-ewzf-Q1k",
"question_id": "851-1",
"task_type": "Action Reasoning",
"question": "Why does the heroine in the video like to wear makeup?",
"options": [
"A. Because the heroine's skin is slightly yellow.",
"B. Because the heroine's eyelids are dark.",
"C. Because the heroine has freckles on her face.",
"D. Because the heroine enjoys the artistic aspect of makeup application."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the heroine in the video like to wear makeup?\nOption:\nA. Because the heroine's skin is slightly yellow.\nB. Because the heroine's eyelids are dark.\nC. Because the heroine has freckles on her face.\nD. Because the heroine enjoys the artistic aspect of makeup application.\nAnswer with the option's letter from the given choices directly.",
2550,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "851-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2551,
"target": "C",
"doc": {
"video_id": "851",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=Lo-ewzf-Q1k",
"videoID": "Lo-ewzf-Q1k",
"question_id": "851-2",
"task_type": "Action Reasoning",
"question": "In the video, when introducing her bedroom, why did the author say that she wanted to take a large photo and hang it on the wall?",
"options": [
"A. Place it above the clothes rack to increase the atmosphere.",
"B. There are two small photos of the bedroom but the big one is missing.",
"C. Hung on the wall to cover appliances.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, when introducing her bedroom, why did the author say that she wanted to take a large photo and hang it on the wall?\nOption:\nA. Place it above the clothes rack to increase the atmosphere.\nB. There are two small photos of the bedroom but the big one is missing.\nC. Hung on the wall to cover appliances.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2551,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "851-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2552,
"target": "A",
"doc": {
"video_id": "851",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=Lo-ewzf-Q1k",
"videoID": "Lo-ewzf-Q1k",
"question_id": "851-3",
"task_type": "Spatial Perception",
"question": "In the video, where will the lamp be hung in the package that the heroine takes home from the gym?",
"options": [
"A. It will be placed above the hangers.",
"B. It'll be on the wall above the kitchen cabinets.",
"C. It will be placed on the wall in front of the door.",
"D. It will be placed on the bedside."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, where will the lamp be hung in the package that the heroine takes home from the gym?\nOption:\nA. It will be placed above the hangers.\nB. It'll be on the wall above the kitchen cabinets.\nC. It will be placed on the wall in front of the door.\nD. It will be placed on the bedside.\nAnswer with the option's letter from the given choices directly.",
2552,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "851-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Spatial Perception",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2553,
"target": "D",
"doc": {
"video_id": "852",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=eno3UMEMQJI",
"videoID": "eno3UMEMQJI",
"question_id": "852-1",
"task_type": "Temporal Reasoning",
"question": "Which of the following options correctly sorts the sequence of events in the video?",
"options": [
"A. Walking, eating, hanging laundry, watching friends unbox watches.",
"B. Eating, hanging laundry, walking, watching friends unboxing watches.",
"C. Hanging laundry, eating, walking, watching friends unboxing watches.",
"D. Hanging laundry, eating, watching friends unboxing watches, taking a walk."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options correctly sorts the sequence of events in the video?\nOption:\nA. Walking, eating, hanging laundry, watching friends unbox watches.\nB. Eating, hanging laundry, walking, watching friends unboxing watches.\nC. Hanging laundry, eating, walking, watching friends unboxing watches.\nD. Hanging laundry, eating, watching friends unboxing watches, taking a walk.\nAnswer with the option's letter from the given choices directly.",
2553,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "852-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2554,
"target": "B",
"doc": {
"video_id": "852",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=eno3UMEMQJI",
"videoID": "eno3UMEMQJI",
"question_id": "852-2",
"task_type": "Action Recognition",
"question": "Which of the following descriptions about the author's use of Bluetooth headsets is correct?",
"options": [
"A. He only wore the left earphone.",
"B. He has headphones on both ears.",
"C. He only wore the right earphone.",
"D. He was not wearing headphones."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following descriptions about the author's use of Bluetooth headsets is correct?\nOption:\nA. He only wore the left earphone.\nB. He has headphones on both ears.\nC. He only wore the right earphone.\nD. He was not wearing headphones.\nAnswer with the option's letter from the given choices directly.",
2554,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "852-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2555,
"target": "C",
"doc": {
"video_id": "852",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=eno3UMEMQJI",
"videoID": "eno3UMEMQJI",
"question_id": "852-3",
"task_type": "Action Recognition",
"question": "What did the male protagonist in the video do immediately after finishing his personal report at the meeting?",
"options": [
"A. Play games.",
"B. Watch a friend unbox a watch.",
"C. Have lunch.",
"D. Chat with many friends."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What did the male protagonist in the video do immediately after finishing his personal report at the meeting?\nOption:\nA. Play games.\nB. Watch a friend unbox a watch.\nC. Have lunch.\nD. Chat with many friends.\nAnswer with the option's letter from the given choices directly.",
2555,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "852-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2556,
"target": "A",
"doc": {
"video_id": "853",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=pKaPg9ZJ88Q",
"videoID": "pKaPg9ZJ88Q",
"question_id": "853-1",
"task_type": "Object Reasoning",
"question": "Which of the following things is not what the host and hostess are talking about in the car at the beginning of the video?",
"options": [
"A. What will the heroine do in the park.",
"B. When did the heroine buy the car.",
"C. What did the heroine do after graduating.",
"D. The heroine's experience of saving money since college."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following things is not what the host and hostess are talking about in the car at the beginning of the video?\nOption:\nA. What will the heroine do in the park.\nB. When did the heroine buy the car.\nC. What did the heroine do after graduating.\nD. The heroine's experience of saving money since college.\nAnswer with the option's letter from the given choices directly.",
2556,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "853-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2557,
"target": "D",
"doc": {
"video_id": "853",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=pKaPg9ZJ88Q",
"videoID": "pKaPg9ZJ88Q",
"question_id": "853-2",
"task_type": "Object Reasoning",
"question": "Why is there no record of the heroine's daily life from 5:00 to 7:20 in the video?",
"options": [
"A. Due to a miscommunication between the photographer and the heroine, the filming schedule was accidentally overlooked during that time period.",
"B. Due to the heroine's gym schedule from 5:00 to 7:30 and the presence of a male photographer, it becomes inconvenient to proceed with the shooting.",
"C. Due to a sudden power outage in the area, filming was interrupted and could not resume until after 7:20.",
"D. Due to the heroine's commitment to teaching two classes from 5:00 to 7:30, it becomes unfeasible to carry out the filming during that time period."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why is there no record of the heroine's daily life from 5:00 to 7:20 in the video?\nOption:\nA. Due to a miscommunication between the photographer and the heroine, the filming schedule was accidentally overlooked during that time period.\nB. Due to the heroine's gym schedule from 5:00 to 7:30 and the presence of a male photographer, it becomes inconvenient to proceed with the shooting.\nC. Due to a sudden power outage in the area, filming was interrupted and could not resume until after 7:20.\nD. Due to the heroine's commitment to teaching two classes from 5:00 to 7:30, it becomes unfeasible to carry out the filming during that time period.\nAnswer with the option's letter from the given choices directly.",
2557,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "853-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2558,
"target": "B",
"doc": {
"video_id": "853",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=pKaPg9ZJ88Q",
"videoID": "pKaPg9ZJ88Q",
"question_id": "853-3",
"task_type": "Object Reasoning",
"question": "What is the role of the heroine in the dancing sequence in the second half of the video?",
"options": [
"A. The heroine is a dance teacher. She sits and watches other students dance.",
"B. The heroine is the lead dancer, she teaches the others dance moves.",
"C. The heroine is a student, she learns dance moves from the lead dancer.",
"D. The heroine is the audience, she watches others dance."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the role of the heroine in the dancing sequence in the second half of the video?\nOption:\nA. The heroine is a dance teacher. She sits and watches other students dance.\nB. The heroine is the lead dancer, she teaches the others dance moves.\nC. The heroine is a student, she learns dance moves from the lead dancer.\nD. The heroine is the audience, she watches others dance.\nAnswer with the option's letter from the given choices directly.",
2558,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "853-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2559,
"target": "C",
"doc": {
"video_id": "854",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=E83XxGvEPtI",
"videoID": "E83XxGvEPtI",
"question_id": "854-1",
"task_type": "Information Synopsis",
"question": "What is the main content of the video?",
"options": [
"A. A day in the life of a college student.",
"B. A day in the life of a software engineer working from home.",
"C. A day in the life of a content creator.",
"D. A day in the life of a music creator."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the main content of the video?\nOption:\nA. A day in the life of a college student.\nB. A day in the life of a software engineer working from home.\nC. A day in the life of a content creator.\nD. A day in the life of a music creator.\nAnswer with the option's letter from the given choices directly.",
2559,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "854-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2560,
"target": "A",
"doc": {
"video_id": "854",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=E83XxGvEPtI",
"videoID": "E83XxGvEPtI",
"question_id": "854-2",
"task_type": "Object Reasoning",
"question": "What food items are common in both the hero's breakfast and lunch?",
"options": [
"A. Milk.",
"B. Chicken breast.",
"C. Bread.",
"D. Bacon."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What food items are common in both the hero's breakfast and lunch?\nOption:\nA. Milk.\nB. Chicken breast.\nC. Bread.\nD. Bacon.\nAnswer with the option's letter from the given choices directly.",
2560,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "854-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2561,
"target": "D",
"doc": {
"video_id": "854",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=E83XxGvEPtI",
"videoID": "E83XxGvEPtI",
"question_id": "854-3",
"task_type": "Action Reasoning",
"question": "What is the theme of the video that the male protagonist made after finishing his workout?",
"options": [
"A. Work out.",
"B. Daily life.",
"C. Science popularization.",
"D. Game."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the theme of the video that the male protagonist made after finishing his workout?\nOption:\nA. Work out.\nB. Daily life.\nC. Science popularization.\nD. Game.\nAnswer with the option's letter from the given choices directly.",
2561,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "854-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2562,
"target": "B",
"doc": {
"video_id": "855",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=vzfTpidE5wg",
"videoID": "vzfTpidE5wg",
"question_id": "855-1",
"task_type": "Action Recognition",
"question": "What is the process of boiling leaves in a pot that the girl introduces at the beginning of the video?",
"options": [
"A. Make lemonade to boost immunity.",
"B. Making rosemary perfume, spraying it at the roots to care for the hair.",
"C. Making shampoo, applying it to the roots while washing hair for hair care.",
"D. Making rosemary water, spraying it on the body for moisturizing."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the process of boiling leaves in a pot that the girl introduces at the beginning of the video?\nOption:\nA. Make lemonade to boost immunity.\nB. Making rosemary perfume, spraying it at the roots to care for the hair.\nC. Making shampoo, applying it to the roots while washing hair for hair care.\nD. Making rosemary water, spraying it on the body for moisturizing.\nAnswer with the option's letter from the given choices directly.",
2562,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "855-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2563,
"target": "C",
"doc": {
"video_id": "855",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=vzfTpidE5wg",
"videoID": "vzfTpidE5wg",
"question_id": "855-2",
"task_type": "Information Synopsis",
"question": "After organizing the clutter in the cabinet in the video, what is the main topic that the heroine discusses while sitting in front of the camera?",
"options": [
"A. Strategies for maintaining a healthy work-life balance.",
"B. Tips for effective budgeting and saving money.",
"C. Advice on dealing with procrastinating tasks.",
"D. Tips for maintaining a clutter-free home."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: After organizing the clutter in the cabinet in the video, what is the main topic that the heroine discusses while sitting in front of the camera?\nOption:\nA. Strategies for maintaining a healthy work-life balance.\nB. Tips for effective budgeting and saving money.\nC. Advice on dealing with procrastinating tasks.\nD. Tips for maintaining a clutter-free home.\nAnswer with the option's letter from the given choices directly.",
2563,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "855-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2564,
"target": "A",
"doc": {
"video_id": "855",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=vzfTpidE5wg",
"videoID": "vzfTpidE5wg",
"question_id": "855-3",
"task_type": "Object Reasoning",
"question": "What is inside the Amazon package received by the female protagonist in the video used for?",
"options": [
"A. The package contains bottles to store the ginger juice drink she made.",
"B. The package contains a new set of makeup brushes.",
"C. The package contains a book on herbal remedies.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is inside the Amazon package received by the female protagonist in the video used for?\nOption:\nA. The package contains bottles to store the ginger juice drink she made.\nB. The package contains a new set of makeup brushes.\nC. The package contains a book on herbal remedies.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2564,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "855-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2565,
"target": "D",
"doc": {
"video_id": "856",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=3HEYqZgC4ko",
"videoID": "3HEYqZgC4ko",
"question_id": "856-1",
"task_type": "Action Recognition",
"question": "In the video, what happened in the car when the heroine came home from shopping in the supermarket?",
"options": [
"A. She changed her baby into a new set of clothes in the car.",
"B. Her baby slept peacefully in the car.",
"C. Her baby was happy and she fed her baby in the car.",
"D. Her baby was unhappy in the car."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, what happened in the car when the heroine came home from shopping in the supermarket?\nOption:\nA. She changed her baby into a new set of clothes in the car.\nB. Her baby slept peacefully in the car.\nC. Her baby was happy and she fed her baby in the car.\nD. Her baby was unhappy in the car.\nAnswer with the option's letter from the given choices directly.",
2565,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "856-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2566,
"target": "B",
"doc": {
"video_id": "856",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=3HEYqZgC4ko",
"videoID": "3HEYqZgC4ko",
"question_id": "856-2",
"task_type": "Action Recognition",
"question": "Which of the following tasks did the heroine not complete while her baby was sleeping?",
"options": [
"A. Rearranging the furniture in the living room.",
"B. Making a coconut milk and orange juice drink.",
"C. Baking a homemade cake from scratch.",
"D. Writing a detailed report for work."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following tasks did the heroine not complete while her baby was sleeping?\nOption:\nA. Rearranging the furniture in the living room.\nB. Making a coconut milk and orange juice drink.\nC. Baking a homemade cake from scratch.\nD. Writing a detailed report for work.\nAnswer with the option's letter from the given choices directly.",
2566,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "856-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2567,
"target": "C",
"doc": {
"video_id": "856",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=3HEYqZgC4ko",
"videoID": "3HEYqZgC4ko",
"question_id": "856-3",
"task_type": "Temporal Reasoning",
"question": "Which of the following descriptions of the heroine's daily activities in a day is correct in chronological order?",
"options": [
"A. Shopping, doing laundry, making the bed.",
"B. Online shopping, making the bed, doing laundry.",
"C. Buy coffee, shop, make her bed.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following descriptions of the heroine's daily activities in a day is correct in chronological order?\nOption:\nA. Shopping, doing laundry, making the bed.\nB. Online shopping, making the bed, doing laundry.\nC. Buy coffee, shop, make her bed.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2567,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "856-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2568,
"target": "A",
"doc": {
"video_id": "857",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=25Pt1AZO9EM",
"videoID": "25Pt1AZO9EM",
"question_id": "857-1",
"task_type": "Action Recognition",
"question": "In the video after feeding the ducks, what did the male protagonist do after riding his bike?",
"options": [
"A. Check on the cattle herd.",
"B. Pick up duck eggs.",
"C. Check the growth of apple and chestnut trees.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video after feeding the ducks, what did the male protagonist do after riding his bike?\nOption:\nA. Check on the cattle herd.\nB. Pick up duck eggs.\nC. Check the growth of apple and chestnut trees.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2568,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "857-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2569,
"target": "D",
"doc": {
"video_id": "857",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=25Pt1AZO9EM",
"videoID": "25Pt1AZO9EM",
"question_id": "857-2",
"task_type": "Action Reasoning",
"question": "What is the purpose of the man in the video processing the blue plastic barrel?",
"options": [
"A. Processing the blue plastic barrel into troughs for feeding cattle.",
"B. Processing the blue plastic barrel into troughs for feeding ducks.",
"C. Processing the blue plastic barrel into troughs for feeding chicken.",
"D. Processing the blue plastic barrel into troughs for feeding pigs."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the purpose of the man in the video processing the blue plastic barrel?\nOption:\nA. Processing the blue plastic barrel into troughs for feeding cattle.\nB. Processing the blue plastic barrel into troughs for feeding ducks.\nC. Processing the blue plastic barrel into troughs for feeding chicken.\nD. Processing the blue plastic barrel into troughs for feeding pigs.\nAnswer with the option's letter from the given choices directly.",
2569,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "857-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2570,
"target": "B",
"doc": {
"video_id": "857",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=25Pt1AZO9EM",
"videoID": "25Pt1AZO9EM",
"question_id": "857-3",
"task_type": "Temporal Reasoning",
"question": "Which of the following options is correct in chronological order of the male protagonist's daily life trajectory?",
"options": [
"A. Feed the ducks, feed the chickens, check on the cattle, feed the pigs.",
"B. Feed the ducks, check the cattle, feed the chickens, feed the pigs.",
"C. Feed the ducks, check on the cattle, have dinner, feed the pigs.",
"D. Feed the ducks, check on the cattle, play games, feed the pigs."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options is correct in chronological order of the male protagonist's daily life trajectory?\nOption:\nA. Feed the ducks, feed the chickens, check on the cattle, feed the pigs.\nB. Feed the ducks, check the cattle, feed the chickens, feed the pigs.\nC. Feed the ducks, check on the cattle, have dinner, feed the pigs.\nD. Feed the ducks, check on the cattle, play games, feed the pigs.\nAnswer with the option's letter from the given choices directly.",
2570,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "857-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2571,
"target": "C",
"doc": {
"video_id": "858",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=5i5Be73ESUc",
"videoID": "5i5Be73ESUc",
"question_id": "858-1",
"task_type": "Information Synopsis",
"question": "Which of the following things was not discussed by the heroine while driving to work?",
"options": [
"A. Preview of the main content of this video.",
"B. Her occupation and place of work.",
"C. Her job search process.",
"D. Some of her experiences at work."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following things was not discussed by the heroine while driving to work?\nOption:\nA. Preview of the main content of this video.\nB. Her occupation and place of work.\nC. Her job search process.\nD. Some of her experiences at work.\nAnswer with the option's letter from the given choices directly.",
2571,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "858-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2572,
"target": "A",
"doc": {
"video_id": "858",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=5i5Be73ESUc",
"videoID": "5i5Be73ESUc",
"question_id": "858-2",
"task_type": "Temporal Reasoning",
"question": "What time does the heroine in the video get home from get off work?",
"options": [
"A. Morning.",
"B. Noon.",
"C. Afternoon.",
"D. Night."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What time does the heroine in the video get home from get off work?\nOption:\nA. Morning.\nB. Noon.\nC. Afternoon.\nD. Night.\nAnswer with the option's letter from the given choices directly.",
2572,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "858-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2573,
"target": "D",
"doc": {
"video_id": "858",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=5i5Be73ESUc",
"videoID": "5i5Be73ESUc",
"question_id": "858-3",
"task_type": "Action Reasoning",
"question": "Why did the heroine in the video go shopping at Trader joe's?",
"options": [
"A. Because she saw a viral TikTok video recommending their frozen pizza.",
"B. Because she heard they have a special promotion on toilet paper.",
"C. Because she believes their shopping bags are more stylish and trendy compared to other supermarkets.",
"D. because she believes the food in Trader Joe's is of higher quality."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did the heroine in the video go shopping at Trader joe's?\nOption:\nA. Because she saw a viral TikTok video recommending their frozen pizza.\nB. Because she heard they have a special promotion on toilet paper.\nC. Because she believes their shopping bags are more stylish and trendy compared to other supermarkets.\nD. because she believes the food in Trader Joe's is of higher quality.\nAnswer with the option's letter from the given choices directly.",
2573,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "858-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2574,
"target": "B",
"doc": {
"video_id": "859",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=u4TU2A7eVj8",
"videoID": "u4TU2A7eVj8",
"question_id": "859-1",
"task_type": "Temporal Reasoning",
"question": "On which day in the video did the female lead barely speak during daytime working hours?",
"options": [
"A. Monday.",
"B. Thursday.",
"C. Tuesday.",
"D. Wednesday."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: On which day in the video did the female lead barely speak during daytime working hours?\nOption:\nA. Monday.\nB. Thursday.\nC. Tuesday.\nD. Wednesday.\nAnswer with the option's letter from the given choices directly.",
2574,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "859-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2575,
"target": "C",
"doc": {
"video_id": "859",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=u4TU2A7eVj8",
"videoID": "u4TU2A7eVj8",
"question_id": "859-2",
"task_type": "Information Synopsis",
"question": "Which of the following is not what the heroine said while lying in bed on Friday morning in the video?",
"options": [
"A. How she feels about her experiences these days.",
"B. The reason why she feels tired recently.",
"C. Dissatisfaction with recent heavy workloads.",
"D. Previous experience in New York."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is not what the heroine said while lying in bed on Friday morning in the video?\nOption:\nA. How she feels about her experiences these days.\nB. The reason why she feels tired recently.\nC. Dissatisfaction with recent heavy workloads.\nD. Previous experience in New York.\nAnswer with the option's letter from the given choices directly.",
2575,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "859-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2576,
"target": "A",
"doc": {
"video_id": "859",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=u4TU2A7eVj8",
"videoID": "u4TU2A7eVj8",
"question_id": "859-3",
"task_type": "Action Reasoning",
"question": "In the video, does the heroine show the panoramic view outside the window of her rented place? If not, why?",
"options": [
"A. No, she felt that showing the content outside the window would reveal her location information.",
"B. No, she intentionally avoided showing the panoramic view outside the window to create suspense for a future video.",
"C. No, the window was facing a construction site, and the noise and visual disturbance made it undesirable to showcase.",
"D. Yes, she briefly showed the panoramic view outside the window but edited it out of the final video."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, does the heroine show the panoramic view outside the window of her rented place? If not, why?\nOption:\nA. No, she felt that showing the content outside the window would reveal her location information.\nB. No, she intentionally avoided showing the panoramic view outside the window to create suspense for a future video.\nC. No, the window was facing a construction site, and the noise and visual disturbance made it undesirable to showcase.\nD. Yes, she briefly showed the panoramic view outside the window but edited it out of the final video.\nAnswer with the option's letter from the given choices directly.",
2576,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "859-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2577,
"target": "D",
"doc": {
"video_id": "860",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=K37YKK9h9pA",
"videoID": "K37YKK9h9pA",
"question_id": "860-1",
"task_type": "Action Reasoning",
"question": "Which of the following is NOT one of the things the video talks about during the autopilot phase of the plane during the flight to Salt Lake City?",
"options": [
"A. The tasks of the three pilots and how they work their shifts.",
"B. Basic information about the instrumentation installed on the aircraft.",
"C. The work and life balance of pilots.",
"D. The impact of weather conditions on aircraft performance."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is NOT one of the things the video talks about during the autopilot phase of the plane during the flight to Salt Lake City?\nOption:\nA. The tasks of the three pilots and how they work their shifts.\nB. Basic information about the instrumentation installed on the aircraft.\nC. The work and life balance of pilots.\nD. The impact of weather conditions on aircraft performance.\nAnswer with the option's letter from the given choices directly.",
2577,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "860-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2578,
"target": "B",
"doc": {
"video_id": "860",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=K37YKK9h9pA",
"videoID": "K37YKK9h9pA",
"question_id": "860-2",
"task_type": "Temporal Reasoning",
"question": "Which of the following is the correct chronological order of what Bart does on his first day in the video after arriving in Salt Lake City?",
"options": [
"A. Gym workout, outdoor bison watching together, breakfast.",
"B. Gym workout, breakfast, outdoor bison watching together.",
"C. Breakfast, gym workout, outdoor bison watching.",
"D. Breakfast, outdoor bison watching, breakfast with hostesses."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is the correct chronological order of what Bart does on his first day in the video after arriving in Salt Lake City?\nOption:\nA. Gym workout, outdoor bison watching together, breakfast.\nB. Gym workout, breakfast, outdoor bison watching together.\nC. Breakfast, gym workout, outdoor bison watching.\nD. Breakfast, outdoor bison watching, breakfast with hostesses.\nAnswer with the option's letter from the given choices directly.",
2578,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "860-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2579,
"target": "C",
"doc": {
"video_id": "860",
"duration": "long",
"domain": "Life Record",
"sub_category": "Daily Life",
"url": "https://www.youtube.com/watch?v=K37YKK9h9pA",
"videoID": "K37YKK9h9pA",
"question_id": "860-3",
"task_type": "Object Reasoning",
"question": "Who was in control during the landing process of the plane in the video's return journey?",
"options": [
"A. Barry,Captain.",
"B. Bart,Second officer.",
"C. Barry,First officer.",
"D. Wim,Captain."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who was in control during the landing process of the plane in the video's return journey?\nOption:\nA. Barry,Captain.\nB. Bart,Second officer.\nC. Barry,First officer.\nD. Wim,Captain.\nAnswer with the option's letter from the given choices directly.",
2579,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "860-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Daily Life",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2580,
"target": "D",
"doc": {
"video_id": "861",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=_T2Avd3tFHc",
"videoID": "_T2Avd3tFHc",
"question_id": "861-1",
"task_type": "Object Reasoning",
"question": "Who is the grandfather in the video with the injured right hand?",
"options": [
"A. Money scammers.",
"B. The grandfather of one of the travelers.",
"C. Traveler's guide.",
"D. A native of Moldova."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who is the grandfather in the video with the injured right hand?\nOption:\nA. Money scammers.\nB. The grandfather of one of the travelers.\nC. Traveler's guide.\nD. A native of Moldova.\nAnswer with the option's letter from the given choices directly.",
2580,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "861-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2581,
"target": "B",
"doc": {
"video_id": "861",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=_T2Avd3tFHc",
"videoID": "_T2Avd3tFHc",
"question_id": "861-2",
"task_type": "Counting Problem",
"question": "How many times does the traveler in the video reunite with the old grandpa?",
"options": [
"A. 4.",
"B. 3.",
"C. 2.",
"D. 5."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many times does the traveler in the video reunite with the old grandpa?\nOption:\nA. 4.\nB. 3.\nC. 2.\nD. 5.\nAnswer with the option's letter from the given choices directly.",
2581,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "861-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Counting Problem",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2582,
"target": "C",
"doc": {
"video_id": "861",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=_T2Avd3tFHc",
"videoID": "_T2Avd3tFHc",
"question_id": "861-3",
"task_type": "Object Reasoning",
"question": "Why do travelers think Moldova isn't the worst city to visit?",
"options": [
"A. Because they celebrated a birthday with their best friend here.",
"B. Because they explored a lot of interesting places together.",
"C. Because they experience a meaningful few days with an old grandpa they never know.",
"D. Because they help the local people and do something worthwhile."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why do travelers think Moldova isn't the worst city to visit?\nOption:\nA. Because they celebrated a birthday with their best friend here.\nB. Because they explored a lot of interesting places together.\nC. Because they experience a meaningful few days with an old grandpa they never know.\nD. Because they help the local people and do something worthwhile.\nAnswer with the option's letter from the given choices directly.",
2582,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "861-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2583,
"target": "A",
"doc": {
"video_id": "862",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=uzyYKAPcfmE",
"videoID": "uzyYKAPcfmE",
"question_id": "862-1",
"task_type": "Object Reasoning",
"question": "What is the relationship between the two travelers in the video?",
"options": [
"A. They are couples.",
"B. They are colleagues.",
"C. They are traveling friends.",
"D. It is impossible to extrapolate."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the relationship between the two travelers in the video?\nOption:\nA. They are couples.\nB. They are colleagues.\nC. They are traveling friends.\nD. It is impossible to extrapolate.\nAnswer with the option's letter from the given choices directly.",
2583,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "862-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2584,
"target": "B",
"doc": {
"video_id": "862",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=uzyYKAPcfmE",
"videoID": "uzyYKAPcfmE",
"question_id": "862-2",
"task_type": "Temporal Reasoning",
"question": "What did the people in the video do on the third day of their trip?",
"options": [
"A. They visited Lake Titicaca.",
"B. They went on a spelunking expedition.",
"C. They did their shopping in the town of Daloria.",
"D. They upgraded their room to a twin room."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What did the people in the video do on the third day of their trip?\nOption:\nA. They visited Lake Titicaca.\nB. They went on a spelunking expedition.\nC. They did their shopping in the town of Daloria.\nD. They upgraded their room to a twin room.\nAnswer with the option's letter from the given choices directly.",
2584,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "862-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2585,
"target": "C",
"doc": {
"video_id": "862",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=uzyYKAPcfmE",
"videoID": "uzyYKAPcfmE",
"question_id": "862-3",
"task_type": "Object Reasoning",
"question": "What's so special about the trip in the video?",
"options": [
"A. This trip is a high-altitude trip.",
"B. During this trip, they visited uninhabited areas and explored the beautiful natural scenery.",
"C. Most time of the trip was spent on the train.",
"D. This is a honeymoon trip for two people."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What's so special about the trip in the video?\nOption:\nA. This trip is a high-altitude trip.\nB. During this trip, they visited uninhabited areas and explored the beautiful natural scenery.\nC. Most time of the trip was spent on the train.\nD. This is a honeymoon trip for two people.\nAnswer with the option's letter from the given choices directly.",
2585,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "862-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2586,
"target": "C",
"doc": {
"video_id": "863",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=rSnQR7vGqMY",
"videoID": "rSnQR7vGqMY",
"question_id": "863-1",
"task_type": "Object Reasoning",
"question": "What is the role of the man named AJ in the video?",
"options": [
"A. Tourist guide.",
"B. Video blogger.",
"C. Game Planner.",
"D. Sponsor."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the role of the man named AJ in the video?\nOption:\nA. Tourist guide.\nB. Video blogger.\nC. Game Planner.\nD. Sponsor.\nAnswer with the option's letter from the given choices directly.",
2586,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "863-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2587,
"target": "A",
"doc": {
"video_id": "863",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=rSnQR7vGqMY",
"videoID": "rSnQR7vGqMY",
"question_id": "863-2",
"task_type": "Action Reasoning",
"question": "Why is the man named Niko yelling at the mountain?",
"options": [
"A. Playing with his echo.",
"B. Calling out to his teammates.",
"C. To complete the planner's task.",
"D. Scaring away the kangaroos."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why is the man named Niko yelling at the mountain?\nOption:\nA. Playing with his echo.\nB. Calling out to his teammates.\nC. To complete the planner's task.\nD. Scaring away the kangaroos.\nAnswer with the option's letter from the given choices directly.",
2587,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "863-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2588,
"target": "A",
"doc": {
"video_id": "863",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=rSnQR7vGqMY",
"videoID": "rSnQR7vGqMY",
"question_id": "863-3",
"task_type": "Information Synopsis",
"question": "What is this video mainly about?",
"options": [
"A. Two teams complete the mission and survive in the jungle for 24 hours.",
"B. Two teams work together to accomplish assigned tasks.",
"C. Two teams battled the game planners.",
"D. Two squads search for the man in the black suit via VCR."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is this video mainly about?\nOption:\nA. Two teams complete the mission and survive in the jungle for 24 hours.\nB. Two teams work together to accomplish assigned tasks.\nC. Two teams battled the game planners.\nD. Two squads search for the man in the black suit via VCR.\nAnswer with the option's letter from the given choices directly.",
2588,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "863-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2589,
"target": "B",
"doc": {
"video_id": "864",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=VFntoBRGF1A",
"videoID": "VFntoBRGF1A",
"question_id": "864-1",
"task_type": "Temporal Reasoning",
"question": "What date did the individual in the video leave a place that Simon thought was very important to him?",
"options": [
"A. May 31, 2023.",
"B. June 9, 2022.",
"C. May 9, 2022.",
"D. June 31, 2022."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What date did the individual in the video leave a place that Simon thought was very important to him?\nOption:\nA. May 31, 2023.\nB. June 9, 2022.\nC. May 9, 2022.\nD. June 31, 2022.\nAnswer with the option's letter from the given choices directly.",
2589,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "864-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2590,
"target": "C",
"doc": {
"video_id": "864",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=VFntoBRGF1A",
"videoID": "VFntoBRGF1A",
"question_id": "864-2",
"task_type": "Action Reasoning",
"question": "Which of the following options does the video not give an explanation for?",
"options": [
"A. Why is Noah called \"the climber\".",
"B. Why Yosemite means a lot to Simon.",
"C. Why they buy a pineapple at Walmart.",
"D. It is impossible to deduce."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options does the video not give an explanation for?\nOption:\nA. Why is Noah called \"the climber\".\nB. Why Yosemite means a lot to Simon.\nC. Why they buy a pineapple at Walmart.\nD. It is impossible to deduce.\nAnswer with the option's letter from the given choices directly.",
2590,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "864-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2591,
"target": "B",
"doc": {
"video_id": "864",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=VFntoBRGF1A",
"videoID": "VFntoBRGF1A",
"question_id": "864-3",
"task_type": "Temporal Reasoning",
"question": "What is the accurate sequence of their destinations?",
"options": [
"A. Grand Prismatic, Phelps Lake, HOH National Forest.",
"B. Jolly Green Giant, Grand Prismatic, Yosemite.",
"C. 7-mile hole trail Yellowstone, Phelps Lake, Jolly Green Giant.",
"D. Badlands National Park, Jolly Green Giant, Yosemite."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the accurate sequence of their destinations?\nOption:\nA. Grand Prismatic, Phelps Lake, HOH National Forest.\nB. Jolly Green Giant, Grand Prismatic, Yosemite.\nC. 7-mile hole trail Yellowstone, Phelps Lake, Jolly Green Giant.\nD. Badlands National Park, Jolly Green Giant, Yosemite.\nAnswer with the option's letter from the given choices directly.",
2591,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "864-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2592,
"target": "D",
"doc": {
"video_id": "865",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=YEok4Ykw204",
"videoID": "YEok4Ykw204",
"question_id": "865-1",
"task_type": "Temporal Reasoning",
"question": "What is the correct order in which the following places are introduced in the video?\n(a) Minaki Aman Temple.\n(b) Alhamra.\n(c) Burj Khalifa.\n(d) Machu Picchu.\n(e) Coliseum.",
"options": [
"A. (a)(c)(b)(e)(d).",
"B. (e)(b)(c)(d)(a).",
"C. (b)(c)(a)(d)(e).",
"D. (a)(b)(c)(d)(e)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct order in which the following places are introduced in the video?\n(a) Minaki Aman Temple.\n(b) Alhamra.\n(c) Burj Khalifa.\n(d) Machu Picchu.\n(e) Coliseum.\nOption:\nA. (a)(c)(b)(e)(d).\nB. (e)(b)(c)(d)(a).\nC. (b)(c)(a)(d)(e).\nD. (a)(b)(c)(d)(e).\nAnswer with the option's letter from the given choices directly.",
2592,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "865-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2593,
"target": "B",
"doc": {
"video_id": "865",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=YEok4Ykw204",
"videoID": "YEok4Ykw204",
"question_id": "865-2",
"task_type": "Object Reasoning",
"question": "Which of the following is the characteristics of the 16th location described in the video?",
"options": [
"A. It is built from 20,000 volcanic stones.",
"B. Its construction began in 1506.",
"C. It is one of the most popular tourist attractions in Lebanon.",
"D. There is a restaurant and bar on the 122nd floor."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is the characteristics of the 16th location described in the video?\nOption:\nA. It is built from 20,000 volcanic stones.\nB. Its construction began in 1506.\nC. It is one of the most popular tourist attractions in Lebanon.\nD. There is a restaurant and bar on the 122nd floor.\nAnswer with the option's letter from the given choices directly.",
2593,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "865-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2594,
"target": "C",
"doc": {
"video_id": "865",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=YEok4Ykw204",
"videoID": "YEok4Ykw204",
"question_id": "865-3",
"task_type": "Counting Problem",
"question": "According to the video, how many buildings were built in less than 110 years?",
"options": [
"A. 1.",
"B. 2.",
"C. 3.",
"D. 4."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, how many buildings were built in less than 110 years?\nOption:\nA. 1.\nB. 2.\nC. 3.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
2594,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "865-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2595,
"target": "B",
"doc": {
"video_id": "866",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=xNgVeznQmXI",
"videoID": "xNgVeznQmXI",
"question_id": "866-1",
"task_type": "Action Reasoning",
"question": "The main character of the video gets the money in a plastic bag, where does it come from?",
"options": [
"A. The protagonist puts Martabak's change in a plastic bag.",
"B. The change from the owner who sells Martabak.",
"C. The change from the owner who sells beers and snacks.",
"D. A gift from the owner who sells Martabak."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The main character of the video gets the money in a plastic bag, where does it come from?\nOption:\nA. The protagonist puts Martabak's change in a plastic bag.\nB. The change from the owner who sells Martabak.\nC. The change from the owner who sells beers and snacks.\nD. A gift from the owner who sells Martabak.\nAnswer with the option's letter from the given choices directly.",
2595,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "866-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2596,
"target": "A",
"doc": {
"video_id": "866",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=xNgVeznQmXI",
"videoID": "xNgVeznQmXI",
"question_id": "866-2",
"task_type": "Object Reasoning",
"question": "What does the main character of the video think of the people of Bandung?",
"options": [
"A. They are kind.",
"B. They are poorer.",
"C. They are overzealous.",
"D. They're greedy for small change."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the main character of the video think of the people of Bandung?\nOption:\nA. They are kind.\nB. They are poorer.\nC. They are overzealous.\nD. They're greedy for small change.\nAnswer with the option's letter from the given choices directly.",
2596,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "866-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2597,
"target": "D",
"doc": {
"video_id": "866",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=xNgVeznQmXI",
"videoID": "xNgVeznQmXI",
"question_id": "866-3",
"task_type": "Object Reasoning",
"question": "Which description of the video is inaccurate?",
"options": [
"A. The red light on the road has a pattern of a woman dancing.",
"B. Sudirman Street is empty because it is morning.",
"C. He visited a historic mosque.",
"D. People on the streets dressed up as ghosts for a special day."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which description of the video is inaccurate?\nOption:\nA. The red light on the road has a pattern of a woman dancing.\nB. Sudirman Street is empty because it is morning.\nC. He visited a historic mosque.\nD. People on the streets dressed up as ghosts for a special day.\nAnswer with the option's letter from the given choices directly.",
2597,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "866-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2598,
"target": "C",
"doc": {
"video_id": "867",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=ZLRyTFIZOIs",
"videoID": "ZLRyTFIZOIs",
"question_id": "867-1",
"task_type": "Object Recognition",
"question": "What is the final mode of transportation used by the team that successfully finishes the second challenge?",
"options": [
"A. Coach.",
"B. Taxi.",
"C. Van.",
"D. Scooter."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the final mode of transportation used by the team that successfully finishes the second challenge?\nOption:\nA. Coach.\nB. Taxi.\nC. Van.\nD. Scooter.\nAnswer with the option's letter from the given choices directly.",
2598,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "867-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2599,
"target": "D",
"doc": {
"video_id": "867",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=ZLRyTFIZOIs",
"videoID": "ZLRyTFIZOIs",
"question_id": "867-2",
"task_type": "Object Reasoning",
"question": "What is the relationship between the team in the SUV in the video and the owner of the van company?",
"options": [
"A. They are friends.",
"B. They are business partners.",
"C. They are buyers and sellers.",
"D. Cannot be inferred from the video."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the relationship between the team in the SUV in the video and the owner of the van company?\nOption:\nA. They are friends.\nB. They are business partners.\nC. They are buyers and sellers.\nD. Cannot be inferred from the video.\nAnswer with the option's letter from the given choices directly.",
2599,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "867-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2600,
"target": "B",
"doc": {
"video_id": "867",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=ZLRyTFIZOIs",
"videoID": "ZLRyTFIZOIs",
"question_id": "867-3",
"task_type": "Object Reasoning",
"question": "What is the initial budget for each person at the start of the challenge?",
"options": [
"A. £71.",
"B. £69.",
"C. £207.",
"D. £138."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the initial budget for each person at the start of the challenge?\nOption:\nA. £71.\nB. £69.\nC. £207.\nD. £138.\nAnswer with the option's letter from the given choices directly.",
2600,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "867-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2601,
"target": "A",
"doc": {
"video_id": "868",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=eZ5lFBPpRkM",
"videoID": "eZ5lFBPpRkM",
"question_id": "868-1",
"task_type": "Object Reasoning",
"question": "What is the relationship of the woman standing next to the main character in the video when he visits Jatilui?",
"options": [
"A. His wife.",
"B. His daughter.",
"C. Strange passersby.",
"D. Modeling for photo shoots."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the relationship of the woman standing next to the main character in the video when he visits Jatilui?\nOption:\nA. His wife.\nB. His daughter.\nC. Strange passersby.\nD. Modeling for photo shoots.\nAnswer with the option's letter from the given choices directly.",
2601,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "868-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2602,
"target": "B",
"doc": {
"video_id": "868",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=eZ5lFBPpRkM",
"videoID": "eZ5lFBPpRkM",
"question_id": "868-2",
"task_type": "Temporal Reasoning",
"question": "When the people in the video saw the turtles for the first time, how many days into their journey was it?",
"options": [
"A. Day 2.",
"B. Day 3.",
"C. Day 1.",
"D. Day 4."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: When the people in the video saw the turtles for the first time, how many days into their journey was it?\nOption:\nA. Day 2.\nB. Day 3.\nC. Day 1.\nD. Day 4.\nAnswer with the option's letter from the given choices directly.",
2602,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "868-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Temporal Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2603,
"target": "C",
"doc": {
"video_id": "868",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=eZ5lFBPpRkM",
"videoID": "eZ5lFBPpRkM",
"question_id": "868-3",
"task_type": "Information Synopsis",
"question": "What does the protagonist of the video record?",
"options": [
"A. Travel logs with friends.",
"B. Bali attractions review.",
"C. The complete guide to traveling in Bali.",
"D. Bali Diving Guide."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the protagonist of the video record?\nOption:\nA. Travel logs with friends.\nB. Bali attractions review.\nC. The complete guide to traveling in Bali.\nD. Bali Diving Guide.\nAnswer with the option's letter from the given choices directly.",
2603,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "868-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2604,
"target": "C",
"doc": {
"video_id": "869",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=NagvWwLvRik",
"videoID": "NagvWwLvRik",
"question_id": "869-1",
"task_type": "Action Reasoning",
"question": "Where is the man's family most likely to be when he crosses the canyon in the video?",
"options": [
"A. Waiting in the car.",
"B. Camping in the canyon.",
"C. Staying in the hotel.",
"D. At home."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Where is the man's family most likely to be when he crosses the canyon in the video?\nOption:\nA. Waiting in the car.\nB. Camping in the canyon.\nC. Staying in the hotel.\nD. At home.\nAnswer with the option's letter from the given choices directly.",
2604,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "869-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2605,
"target": "D",
"doc": {
"video_id": "869",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=NagvWwLvRik",
"videoID": "NagvWwLvRik",
"question_id": "869-2",
"task_type": "Object Reasoning",
"question": "What makes the fourth day of the family's arrival at their destination stand out in the video?",
"options": [
"A. The men arrive at the national park for the second time.",
"B. It's New Year's Eve.",
"C. The men explore the canyon with one of the children.",
"D. Jack's birthday."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What makes the fourth day of the family's arrival at their destination stand out in the video?\nOption:\nA. The men arrive at the national park for the second time.\nB. It's New Year's Eve.\nC. The men explore the canyon with one of the children.\nD. Jack's birthday.\nAnswer with the option's letter from the given choices directly.",
2605,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "869-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2606,
"target": "A",
"doc": {
"video_id": "869",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=NagvWwLvRik",
"videoID": "NagvWwLvRik",
"question_id": "869-3",
"task_type": "Object Recognition",
"question": "Which of the following activities does the man in the video perform only once?",
"options": [
"A. Getting together with close friends.",
"B. Wild camping.",
"C. Hiking the canyon.",
"D. Explore the caves."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following activities does the man in the video perform only once?\nOption:\nA. Getting together with close friends.\nB. Wild camping.\nC. Hiking the canyon.\nD. Explore the caves.\nAnswer with the option's letter from the given choices directly.",
2606,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "869-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2607,
"target": "B",
"doc": {
"video_id": "870",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=Fibq4aA-GVc",
"videoID": "Fibq4aA-GVc",
"question_id": "870-1",
"task_type": "Action Recognition",
"question": "Which challenges do they not complete during the trip?",
"options": [
"A. Find the crocodile.",
"B. Eat ice cream until see a red light.",
"C. Mini golf.",
"D. Tip the waiter."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which challenges do they not complete during the trip?\nOption:\nA. Find the crocodile.\nB. Eat ice cream until see a red light.\nC. Mini golf.\nD. Tip the waiter.\nAnswer with the option's letter from the given choices directly.",
2607,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "870-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2608,
"target": "C",
"doc": {
"video_id": "870",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=Fibq4aA-GVc",
"videoID": "Fibq4aA-GVc",
"question_id": "870-2",
"task_type": "Temporal Reasoning",
"question": "What successive activities do the people in the video engage in while in the car?",
"options": [
"A. They play mini-ping pong, observe bridge heights, and play PlayStation.",
"B. They add coolant to vehicles, move the microwave, and play mini pool.",
"C. They play mini-basketball, make macaroni, and paint.",
"D. They shave their heads, eat ice-cream, and play hide-and-seek."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What successive activities do the people in the video engage in while in the car?\nOption:\nA. They play mini-ping pong, observe bridge heights, and play PlayStation.\nB. They add coolant to vehicles, move the microwave, and play mini pool.\nC. They play mini-basketball, make macaroni, and paint.\nD. They shave their heads, eat ice-cream, and play hide-and-seek.\nAnswer with the option's letter from the given choices directly.",
2608,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "870-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2609,
"target": "D",
"doc": {
"video_id": "870",
"duration": "long",
"domain": "Life Record",
"sub_category": "Travel",
"url": "https://www.youtube.com/watch?v=Fibq4aA-GVc",
"videoID": "Fibq4aA-GVc",
"question_id": "870-3",
"task_type": "Object Reasoning",
"question": "What is wrong with the car when there is a woman in the vehicle?",
"options": [
"A. Flat tire.",
"B. Overheating.",
"C. Shattered rearview mirror.",
"D. Unusual rattling noise."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is wrong with the car when there is a woman in the vehicle?\nOption:\nA. Flat tire.\nB. Overheating.\nC. Shattered rearview mirror.\nD. Unusual rattling noise.\nAnswer with the option's letter from the given choices directly.",
2609,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "870-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Travel",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2610,
"target": "D",
"doc": {
"video_id": "871",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=ZkhAbUGoD18",
"videoID": "ZkhAbUGoD18",
"question_id": "871-1",
"task_type": "Action Reasoning",
"question": "Why are there many toys in the room?",
"options": [
"A. For entertaining guests.",
"B. The host wants to move.",
"C. For baby to play.",
"D. For playing with cats."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why are there many toys in the room?\nOption:\nA. For entertaining guests.\nB. The host wants to move.\nC. For baby to play.\nD. For playing with cats.\nAnswer with the option's letter from the given choices directly.",
2610,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "871-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2611,
"target": "C",
"doc": {
"video_id": "871",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=ZkhAbUGoD18",
"videoID": "ZkhAbUGoD18",
"question_id": "871-2",
"task_type": "Object Reasoning",
"question": "Which is the most interesting toy for these cats based on the video?",
"options": [
"A. Car.",
"B. Interactive Laser.",
"C. Feather.",
"D. Ball."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which is the most interesting toy for these cats based on the video?\nOption:\nA. Car.\nB. Interactive Laser.\nC. Feather.\nD. Ball.\nAnswer with the option's letter from the given choices directly.",
2611,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "871-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2612,
"target": "B",
"doc": {
"video_id": "871",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=ZkhAbUGoD18",
"videoID": "ZkhAbUGoD18",
"question_id": "871-3",
"task_type": "Information Synopsis",
"question": "What is the video's main focus?",
"options": [
"A. How to raise cats.",
"B. Playing with cats.",
"C. Feeding cats.",
"D. What cats like most."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video's main focus?\nOption:\nA. How to raise cats.\nB. Playing with cats.\nC. Feeding cats.\nD. What cats like most.\nAnswer with the option's letter from the given choices directly.",
2612,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "871-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2613,
"target": "D",
"doc": {
"video_id": "872",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=ZP-Y2S750KY",
"videoID": "ZP-Y2S750KY",
"question_id": "872-1",
"task_type": "Information Synopsis",
"question": "What is the video regarding?",
"options": [
"A. How to take care of animals.",
"B. Saving animals.",
"C. People play with animals.",
"D. Performance of baby animals."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video regarding?\nOption:\nA. How to take care of animals.\nB. Saving animals.\nC. People play with animals.\nD. Performance of baby animals.\nAnswer with the option's letter from the given choices directly.",
2613,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "872-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2614,
"target": "A",
"doc": {
"video_id": "872",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=ZP-Y2S750KY",
"videoID": "ZP-Y2S750KY",
"question_id": "872-2",
"task_type": "Object Recognition",
"question": "As depicted in the video, what are the characteristics of these animals?",
"options": [
"A. Cute.",
"B. Savage.",
"C. Dangerous.",
"D. Fool."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, what are the characteristics of these animals?\nOption:\nA. Cute.\nB. Savage.\nC. Dangerous.\nD. Fool.\nAnswer with the option's letter from the given choices directly.",
2614,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "872-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Object Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2615,
"target": "C",
"doc": {
"video_id": "872",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=ZP-Y2S750KY",
"videoID": "ZP-Y2S750KY",
"question_id": "872-3",
"task_type": "Action Recognition",
"question": "What does the panda do in the video?",
"options": [
"A. Eating some food.",
"B. Playing with people.",
"C. Rolling down.",
"D. Sleeping."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the panda do in the video?\nOption:\nA. Eating some food.\nB. Playing with people.\nC. Rolling down.\nD. Sleeping.\nAnswer with the option's letter from the given choices directly.",
2615,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "872-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2616,
"target": "A",
"doc": {
"video_id": "873",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=t23Zi0DBSiI",
"videoID": "t23Zi0DBSiI",
"question_id": "873-1",
"task_type": "Action Reasoning",
"question": "What is the reason for people visiting this place according to what is shown in the video?",
"options": [
"A. Enjoy playing with dogs.",
"B. Taking a break.",
"C. Talking with each other.",
"D. Having lunch."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the reason for people visiting this place according to what is shown in the video?\nOption:\nA. Enjoy playing with dogs.\nB. Taking a break.\nC. Talking with each other.\nD. Having lunch.\nAnswer with the option's letter from the given choices directly.",
2616,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "873-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2617,
"target": "D",
"doc": {
"video_id": "873",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=t23Zi0DBSiI",
"videoID": "t23Zi0DBSiI",
"question_id": "873-2",
"task_type": "Spatial Reasoning",
"question": "In line with the video evidence, what the atmosphere is in the house?",
"options": [
"A. Deserted.",
"B. Noisy.",
"C. Nervous.",
"D. Relaxed."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, what the atmosphere is in the house?\nOption:\nA. Deserted.\nB. Noisy.\nC. Nervous.\nD. Relaxed.\nAnswer with the option's letter from the given choices directly.",
2617,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "873-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Spatial Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2618,
"target": "C",
"doc": {
"video_id": "873",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=t23Zi0DBSiI",
"videoID": "t23Zi0DBSiI",
"question_id": "873-3",
"task_type": "Action Reasoning",
"question": "Why do dogs lie in a row by a window as depicted in the video?",
"options": [
"A. Begging for food.",
"B. Preparing to sleep.",
"C. Taking a photo with a person.",
"D. Ready for water."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why do dogs lie in a row by a window as depicted in the video?\nOption:\nA. Begging for food.\nB. Preparing to sleep.\nC. Taking a photo with a person.\nD. Ready for water.\nAnswer with the option's letter from the given choices directly.",
2618,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "873-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2619,
"target": "C",
"doc": {
"video_id": "874",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=h1cv8Fsn4eU",
"videoID": "h1cv8Fsn4eU",
"question_id": "874-1",
"task_type": "Action Reasoning",
"question": "According to what is shown in the video, why does the cameraman drive a car?",
"options": [
"A. Taking the cat to play.",
"B. Purchasing food for the cat.",
"C. Transporting the cat to the hospital.",
"D. Searching for the lost cat."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to what is shown in the video, why does the cameraman drive a car?\nOption:\nA. Taking the cat to play.\nB. Purchasing food for the cat.\nC. Transporting the cat to the hospital.\nD. Searching for the lost cat.\nAnswer with the option's letter from the given choices directly.",
2619,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "874-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2620,
"target": "B",
"doc": {
"video_id": "874",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=h1cv8Fsn4eU",
"videoID": "h1cv8Fsn4eU",
"question_id": "874-2",
"task_type": "Action Reasoning",
"question": "Why do people shed tears when the big sea turtle goes to sea in the video?",
"options": [
"A. They feel sad.",
"B. They are friends.",
"C. They don't want the turtle to go back to sea.",
"D. They dislike the turtle."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why do people shed tears when the big sea turtle goes to sea in the video?\nOption:\nA. They feel sad.\nB. They are friends.\nC. They don't want the turtle to go back to sea.\nD. They dislike the turtle.\nAnswer with the option's letter from the given choices directly.",
2620,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "874-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2621,
"target": "A",
"doc": {
"video_id": "874",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=h1cv8Fsn4eU",
"videoID": "h1cv8Fsn4eU",
"question_id": "874-3",
"task_type": "Information Synopsis",
"question": "What is the subject matter of the video?",
"options": [
"A. People save animals.",
"B. Cute animals performance.",
"C. Funny animals.",
"D. Animals that suffer."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the subject matter of the video?\nOption:\nA. People save animals.\nB. Cute animals performance.\nC. Funny animals.\nD. Animals that suffer.\nAnswer with the option's letter from the given choices directly.",
2621,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "874-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2622,
"target": "B",
"doc": {
"video_id": "875",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=4IenX7OHumk",
"videoID": "4IenX7OHumk",
"question_id": "875-1",
"task_type": "Spatial Reasoning",
"question": "Which location appears in the video?",
"options": [
"A. Aisa.",
"B. Africa.",
"C. Europe.",
"D. Antarctica."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which location appears in the video?\nOption:\nA. Aisa.\nB. Africa.\nC. Europe.\nD. Antarctica.\nAnswer with the option's letter from the given choices directly.",
2622,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "875-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Spatial Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2623,
"target": "A",
"doc": {
"video_id": "875",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=4IenX7OHumk",
"videoID": "4IenX7OHumk",
"question_id": "875-2",
"task_type": "Action Reasoning",
"question": "Why does a leopard climb on trees as depicted in the video?",
"options": [
"A. To prey on birds for food.",
"B. To keep an eye on the grass below.",
"C. To search for competitors.",
"D. To evade natural enemies."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does a leopard climb on trees as depicted in the video?\nOption:\nA. To prey on birds for food.\nB. To keep an eye on the grass below.\nC. To search for competitors.\nD. To evade natural enemies.\nAnswer with the option's letter from the given choices directly.",
2623,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "875-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2624,
"target": "D",
"doc": {
"video_id": "875",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=4IenX7OHumk",
"videoID": "4IenX7OHumk",
"question_id": "875-3",
"task_type": "Action Reasoning",
"question": "Which of the following options is not depicted in the video?",
"options": [
"A. Leopard climbs on trees.",
"B. Hyenas prey animals.",
"C. A lion is attacked by a group of lions.",
"D. Eagles eat the bodies of animals."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options is not depicted in the video?\nOption:\nA. Leopard climbs on trees.\nB. Hyenas prey animals.\nC. A lion is attacked by a group of lions.\nD. Eagles eat the bodies of animals.\nAnswer with the option's letter from the given choices directly.",
2624,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "875-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2625,
"target": "D",
"doc": {
"video_id": "876",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=whksDmTR9YE",
"videoID": "whksDmTR9YE",
"question_id": "876-1",
"task_type": "Attribute Perception",
"question": "According to the video, what does the cougar fight with?",
"options": [
"A. A tiger.",
"B. A lion.",
"C. A hippo.",
"D. A grizzly."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what does the cougar fight with?\nOption:\nA. A tiger.\nB. A lion.\nC. A hippo.\nD. A grizzly.\nAnswer with the option's letter from the given choices directly.",
2625,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "876-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2626,
"target": "B",
"doc": {
"video_id": "876",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=whksDmTR9YE",
"videoID": "whksDmTR9YE",
"question_id": "876-2",
"task_type": "Action Reasoning",
"question": "Which appears in the video?",
"options": [
"A. A bear fights with lions.",
"B. Two foxes fight.",
"C. A shark preys fishes.",
"D. A eagle preys on rats."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which appears in the video?\nOption:\nA. A bear fights with lions.\nB. Two foxes fight.\nC. A shark preys fishes.\nD. A eagle preys on rats.\nAnswer with the option's letter from the given choices directly.",
2626,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "876-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2627,
"target": "C",
"doc": {
"video_id": "876",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=whksDmTR9YE",
"videoID": "whksDmTR9YE",
"question_id": "876-3",
"task_type": "Action Reasoning",
"question": "Why do hermit crabs fight with each other?",
"options": [
"A. To avoid the attack of conch.",
"B. They fight for food.",
"C. They need new houses.",
"D. To protect themselves."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why do hermit crabs fight with each other?\nOption:\nA. To avoid the attack of conch.\nB. They fight for food.\nC. They need new houses.\nD. To protect themselves.\nAnswer with the option's letter from the given choices directly.",
2627,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "876-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2628,
"target": "A",
"doc": {
"video_id": "877",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=5WIdIs3A9Ok",
"videoID": "5WIdIs3A9Ok",
"question_id": "877-1",
"task_type": "Action Reasoning",
"question": "Why is there no male jaguar near the cub?",
"options": [
"A. Male and female jaguars part after mating.",
"B. Male jugar is dead.",
"C. Female jugar brings its cub out to find food.",
"D. Male jugar goes out to find food."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why is there no male jaguar near the cub?\nOption:\nA. Male and female jaguars part after mating.\nB. Male jugar is dead.\nC. Female jugar brings its cub out to find food.\nD. Male jugar goes out to find food.\nAnswer with the option's letter from the given choices directly.",
2628,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "877-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2629,
"target": "D",
"doc": {
"video_id": "877",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=5WIdIs3A9Ok",
"videoID": "5WIdIs3A9Ok",
"question_id": "877-2",
"task_type": "Action Reasoning",
"question": "Why do capybaras dive into water in the video?",
"options": [
"A. They need to move to other places to live.",
"B. They are scared.",
"C. They have a habit of swim.",
"D. They hear signals from peers."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why do capybaras dive into water in the video?\nOption:\nA. They need to move to other places to live.\nB. They are scared.\nC. They have a habit of swim.\nD. They hear signals from peers.\nAnswer with the option's letter from the given choices directly.",
2629,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "877-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2630,
"target": "B",
"doc": {
"video_id": "877",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=5WIdIs3A9Ok",
"videoID": "5WIdIs3A9Ok",
"question_id": "877-3",
"task_type": "Information Synopsis",
"question": "What story does the photographers record based on the video?",
"options": [
"A. Wild scene of Africa.",
"B. Hunting life of jaguars.",
"C. Fight between animals.",
"D. Growing of jaguar cub."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What story does the photographers record based on the video?\nOption:\nA. Wild scene of Africa.\nB. Hunting life of jaguars.\nC. Fight between animals.\nD. Growing of jaguar cub.\nAnswer with the option's letter from the given choices directly.",
2630,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "877-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2631,
"target": "C",
"doc": {
"video_id": "878",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=p9CCSG3-dhI",
"videoID": "p9CCSG3-dhI",
"question_id": "878-1",
"task_type": "Action Reasoning",
"question": "Which animals compete for territory as shown in the video?",
"options": [
"A. Bears.",
"B. Eagles.",
"C. Hippos.",
"D. Wolves."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which animals compete for territory as shown in the video?\nOption:\nA. Bears.\nB. Eagles.\nC. Hippos.\nD. Wolves.\nAnswer with the option's letter from the given choices directly.",
2631,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "878-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2632,
"target": "B",
"doc": {
"video_id": "878",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=p9CCSG3-dhI",
"videoID": "p9CCSG3-dhI",
"question_id": "878-2",
"task_type": "Action Recognition",
"question": "Which appears according to the video?",
"options": [
"A. Lions prey on animals for food.",
"B. Crocodiles catch bats flying into water.",
"C. Eagles eat the bodies of dead animals.",
"D. Jaguars climb on trees."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which appears according to the video?\nOption:\nA. Lions prey on animals for food.\nB. Crocodiles catch bats flying into water.\nC. Eagles eat the bodies of dead animals.\nD. Jaguars climb on trees.\nAnswer with the option's letter from the given choices directly.",
2632,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "878-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2633,
"target": "D",
"doc": {
"video_id": "878",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=p9CCSG3-dhI",
"videoID": "p9CCSG3-dhI",
"question_id": "878-3",
"task_type": "Information Synopsis",
"question": "What does the video depict?",
"options": [
"A. The animals in the wild.",
"B. Savage of nature.",
"C. How to survive in the wild.",
"D. Wild animals in rivers."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What does the video depict?\nOption:\nA. The animals in the wild.\nB. Savage of nature.\nC. How to survive in the wild.\nD. Wild animals in rivers.\nAnswer with the option's letter from the given choices directly.",
2633,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "878-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2634,
"target": "A",
"doc": {
"video_id": "879",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=zAXbdzvCeV8",
"videoID": "zAXbdzvCeV8",
"question_id": "879-1",
"task_type": "Action Reasoning",
"question": "Why are the three dogs fighting with another dog in the video?",
"options": [
"A. They are not a group.",
"B. They are too hungry.",
"C. They don't like it.",
"D. They are playing."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why are the three dogs fighting with another dog in the video?\nOption:\nA. They are not a group.\nB. They are too hungry.\nC. They don't like it.\nD. They are playing.\nAnswer with the option's letter from the given choices directly.",
2634,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "879-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2635,
"target": "C",
"doc": {
"video_id": "879",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=zAXbdzvCeV8",
"videoID": "zAXbdzvCeV8",
"question_id": "879-2",
"task_type": "Action Recognition",
"question": "In line with the video evidence, what does the younger tiger do when it is found by the dominant tiger?",
"options": [
"A. Escapes.",
"B. Engages in a fight.",
"C. Submits.",
"D. Playfully interacts."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, what does the younger tiger do when it is found by the dominant tiger?\nOption:\nA. Escapes.\nB. Engages in a fight.\nC. Submits.\nD. Playfully interacts.\nAnswer with the option's letter from the given choices directly.",
2635,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "879-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2636,
"target": "D",
"doc": {
"video_id": "879",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=zAXbdzvCeV8",
"videoID": "zAXbdzvCeV8",
"question_id": "879-3",
"task_type": "Information Synopsis",
"question": "What is the video about?",
"options": [
"A. Nature scene.",
"B. Animals protection.",
"C. Connection with animals.",
"D. Wild animals."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the video about?\nOption:\nA. Nature scene.\nB. Animals protection.\nC. Connection with animals.\nD. Wild animals.\nAnswer with the option's letter from the given choices directly.",
2636,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "879-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2637,
"target": "B",
"doc": {
"video_id": "880",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=eZ2Rt2DVGdU",
"videoID": "eZ2Rt2DVGdU",
"question_id": "880-1",
"task_type": "Action Reasoning",
"question": "In accordance with the video footage, which may be the reason for the death of the diver?",
"options": [
"A. Equipment failure.",
"B. Being electrocuted by an electric ray.",
"C. Bleed with a big trauma.",
"D. Bited by fish."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In accordance with the video footage, which may be the reason for the death of the diver?\nOption:\nA. Equipment failure.\nB. Being electrocuted by an electric ray.\nC. Bleed with a big trauma.\nD. Bited by fish.\nAnswer with the option's letter from the given choices directly.",
2637,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "880-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2638,
"target": "A",
"doc": {
"video_id": "880",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=eZ2Rt2DVGdU",
"videoID": "eZ2Rt2DVGdU",
"question_id": "880-2",
"task_type": "Action Recognition",
"question": "Which shark species is featured in the video?",
"options": [
"A. Hammerhead shark.",
"B. Great white shark.",
"C. Tiger shark.",
"D. Mako shark."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which shark species is featured in the video?\nOption:\nA. Hammerhead shark.\nB. Great white shark.\nC. Tiger shark.\nD. Mako shark.\nAnswer with the option's letter from the given choices directly.",
2638,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "880-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2639,
"target": "B",
"doc": {
"video_id": "880",
"duration": "long",
"domain": "Life Record",
"sub_category": "Pet & Animal",
"url": "https://www.youtube.com/watch?v=eZ2Rt2DVGdU",
"videoID": "eZ2Rt2DVGdU",
"question_id": "880-3",
"task_type": "Information Synopsis",
"question": "Which is the best title of the video?",
"options": [
"A. Wild animals.",
"B. Ocean animals.",
"C. Diverse fishes.",
"D. Protect the sea."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which is the best title of the video?\nOption:\nA. Wild animals.\nB. Ocean animals.\nC. Diverse fishes.\nD. Protect the sea.\nAnswer with the option's letter from the given choices directly.",
2639,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "880-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Pet & Animal",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2640,
"target": "B",
"doc": {
"video_id": "881",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=goyWFUzCqF4",
"videoID": "goyWFUzCqF4",
"question_id": "881-1",
"task_type": "Object Recognition",
"question": "What is the sports tracking device used by the male protagonist in the video for running?",
"options": [
"A. A pair of sensor-equipped shoes that sync with a computer.",
"B. Apple watch.",
"C. iPhone.",
"D. Not mentioned in the video."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the sports tracking device used by the male protagonist in the video for running?\nOption:\nA. A pair of sensor-equipped shoes that sync with a computer.\nB. Apple watch.\nC. iPhone.\nD. Not mentioned in the video.\nAnswer with the option's letter from the given choices directly.",
2640,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "881-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2641,
"target": "C",
"doc": {
"video_id": "881",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=goyWFUzCqF4",
"videoID": "goyWFUzCqF4",
"question_id": "881-2",
"task_type": "Action Reasoning",
"question": "What happened when the male protagonist attempted his second 21-kilometer long run in the video?",
"options": [
"A. The male protagonist felt some pain in his legs while running and had to give up.",
"B. During the running process, there was a lot of water accumulation on the road surface, but it wasn't raining.",
"C. It rained during the running process, and the male protagonist completed the long run despite the rain.",
"D. It rained during the running process, and the male protagonist had to abandon the long run."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happened when the male protagonist attempted his second 21-kilometer long run in the video?\nOption:\nA. The male protagonist felt some pain in his legs while running and had to give up.\nB. During the running process, there was a lot of water accumulation on the road surface, but it wasn't raining.\nC. It rained during the running process, and the male protagonist completed the long run despite the rain.\nD. It rained during the running process, and the male protagonist had to abandon the long run.\nAnswer with the option's letter from the given choices directly.",
2641,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "881-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2642,
"target": "A",
"doc": {
"video_id": "881",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=goyWFUzCqF4",
"videoID": "goyWFUzCqF4",
"question_id": "881-3",
"task_type": "Action Recognition",
"question": "Which of the following descriptions of the male protagonist's process from preparing for the marathon to successfully completing the marathon in the video is correct?",
"options": [
"A. The male protagonist did not manage to secure a ticket for the marathon. Then, at the 30-kilometer mark, his legs suddenly started hurting, forcing him to stop training. After arriving in South Africa, his legs recovered, and he also got another chance to participate in the race, ultimately successfully completing the marathon.",
"B. The male protagonist experienced sudden leg pain and had to stop training when he reached the 30-kilometer mark. Following this, he failed to secure a ticket for the marathon. After arriving in South Africa, his legs recovered, and he also received another opportunity to participate in the race. Eventually, he successfully completed the marathon.",
"C. The male protagonist failed to secure a ticket for the marathon. Subsequently, when he was running at the 30-kilometer mark, he experienced sudden leg pain and had to stop training. Upon reaching South Africa, his legs did not recover fully, but he still received an opportunity to participate in the race. Eventually, he endured the pain and successfully completed the marathon.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following descriptions of the male protagonist's process from preparing for the marathon to successfully completing the marathon in the video is correct?\nOption:\nA. The male protagonist did not manage to secure a ticket for the marathon. Then, at the 30-kilometer mark, his legs suddenly started hurting, forcing him to stop training. After arriving in South Africa, his legs recovered, and he also got another chance to participate in the race, ultimately successfully completing the marathon.\nB. The male protagonist experienced sudden leg pain and had to stop training when he reached the 30-kilometer mark. Following this, he failed to secure a ticket for the marathon. After arriving in South Africa, his legs recovered, and he also received another opportunity to participate in the race. Eventually, he successfully completed the marathon.\nC. The male protagonist failed to secure a ticket for the marathon. Subsequently, when he was running at the 30-kilometer mark, he experienced sudden leg pain and had to stop training. Upon reaching South Africa, his legs did not recover fully, but he still received an opportunity to participate in the race. Eventually, he endured the pain and successfully completed the marathon.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2642,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "881-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2643,
"target": "D",
"doc": {
"video_id": "882",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=qVZOKel-gpE",
"videoID": "qVZOKel-gpE",
"question_id": "882-1",
"task_type": "Action Reasoning",
"question": "In the video on the third day of training why is the man in the green top so far behind the contestant in the white top?",
"options": [
"A. The guy in the green top is slacking off on the ride and riding slowly.",
"B. The hero in the green top was riding in a very windy and difficult ride.",
"C. The hero with the green top fell and got hurt.",
"D. The green-topped hero's bike broke down and the repairs took time."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video on the third day of training why is the man in the green top so far behind the contestant in the white top?\nOption:\nA. The guy in the green top is slacking off on the ride and riding slowly.\nB. The hero in the green top was riding in a very windy and difficult ride.\nC. The hero with the green top fell and got hurt.\nD. The green-topped hero's bike broke down and the repairs took time.\nAnswer with the option's letter from the given choices directly.",
2643,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "882-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2644,
"target": "B",
"doc": {
"video_id": "882",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=qVZOKel-gpE",
"videoID": "qVZOKel-gpE",
"question_id": "882-2",
"task_type": "Action Reasoning",
"question": "Why did the hero in the red top stop to give a close-up of a palm tree on the fourth day of the training ride?",
"options": [
"A. Emphasis on the topographical features of the riding environment.",
"B. Use palm trees to reflect how windy the scene was.",
"C. Emphasis on beautiful views of the ride's surroundings.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did the hero in the red top stop to give a close-up of a palm tree on the fourth day of the training ride?\nOption:\nA. Emphasis on the topographical features of the riding environment.\nB. Use palm trees to reflect how windy the scene was.\nC. Emphasis on beautiful views of the ride's surroundings.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2644,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "882-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2645,
"target": "C",
"doc": {
"video_id": "882",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=qVZOKel-gpE",
"videoID": "qVZOKel-gpE",
"question_id": "882-3",
"task_type": "Information Synopsis",
"question": "Which of the following options does not match the description in the video?",
"options": [
"A. On the first day of training, the two male protagonists ran 8km and rested most of the rest of the time.",
"B. On the second day of training, the two male protagonists swam three kilometers in open water and then cycled 60 kilometers.",
"C. On the third day of training, the two male protagonists rode a total of 160km, after which only one of them proceeded to run an additional 4km.",
"D. On the last day of training, the two male protagonists conducted a simulated triathlon training, swimming 3000m, cycling 40km, and running 16km. Both of them successfully completed."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options does not match the description in the video?\nOption:\nA. On the first day of training, the two male protagonists ran 8km and rested most of the rest of the time.\nB. On the second day of training, the two male protagonists swam three kilometers in open water and then cycled 60 kilometers.\nC. On the third day of training, the two male protagonists rode a total of 160km, after which only one of them proceeded to run an additional 4km.\nD. On the last day of training, the two male protagonists conducted a simulated triathlon training, swimming 3000m, cycling 40km, and running 16km. Both of them successfully completed.\nAnswer with the option's letter from the given choices directly.",
2645,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "882-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "C"
}
},
{
"doc_id": 2646,
"target": "A",
"doc": {
"video_id": "883",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=2LDriAWltwM",
"videoID": "2LDriAWltwM",
"question_id": "883-1",
"task_type": "Action Reasoning",
"question": "Which move in the video was done in both the warm-up phase and the official workout phase of Monday's workout?",
"options": [
"A. Glute drive.",
"B. Barbell rdls.",
"C. Single leg dumbbell rdl.",
"D. Seated hamstring curl."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which move in the video was done in both the warm-up phase and the official workout phase of Monday's workout?\nOption:\nA. Glute drive.\nB. Barbell rdls.\nC. Single leg dumbbell rdl.\nD. Seated hamstring curl.\nAnswer with the option's letter from the given choices directly.",
2646,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "883-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2647,
"target": "D",
"doc": {
"video_id": "883",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=2LDriAWltwM",
"videoID": "2LDriAWltwM",
"question_id": "883-2",
"task_type": "Object Reasoning",
"question": "What is the difference in appearance between the female lead in the video, Friday and Wednesday?",
"options": [
"A. On Friday her hair was shorter compared to when it was on Wednesday, and on Friday she wore different trousers than she did on Wednesday.",
"B. Friday she didn't wear a coat, Wednesday she wore a coat.",
"C. On Friday, her hair was longer and straighter and she wore different trousers than on Wednesday.",
"D. On Friday, her hair turned curly and she wore different trousers than she did on Wednesday."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the difference in appearance between the female lead in the video, Friday and Wednesday?\nOption:\nA. On Friday her hair was shorter compared to when it was on Wednesday, and on Friday she wore different trousers than she did on Wednesday.\nB. Friday she didn't wear a coat, Wednesday she wore a coat.\nC. On Friday, her hair was longer and straighter and she wore different trousers than on Wednesday.\nD. On Friday, her hair turned curly and she wore different trousers than she did on Wednesday.\nAnswer with the option's letter from the given choices directly.",
2647,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "883-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2648,
"target": "B",
"doc": {
"video_id": "883",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=2LDriAWltwM",
"videoID": "2LDriAWltwM",
"question_id": "883-3",
"task_type": "Temporal Reasoning",
"question": "Which of the following describes the heroine's weekly fitness programme correctly?",
"options": [
"A. Glutes and hamstrings on Monday, pulls on Tuesday, Quads and calves on Wednesday, rest on Thursday when the lady goes to the hairdresser's and gets her hair done, push on Friday and Cardio and Core on Saturday.",
"B. Glutes and hamstrings on Monday, push on Tuesday, Quads and calves on Wednesday, rest on Thursday when the lady goes to the hairdresser's and gets her hair done, pulls on Friday and Cardio and Core on Saturday.",
"C. Glutes and hamstrings on Monday, pulls on Tuesday, Wednesday off while the lady goes to the hairdresser's and gets her hair done, Quads and calves on Thursday, push on Friday and Cardio and Core on Saturday.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following describes the heroine's weekly fitness programme correctly?\nOption:\nA. Glutes and hamstrings on Monday, pulls on Tuesday, Quads and calves on Wednesday, rest on Thursday when the lady goes to the hairdresser's and gets her hair done, push on Friday and Cardio and Core on Saturday.\nB. Glutes and hamstrings on Monday, push on Tuesday, Quads and calves on Wednesday, rest on Thursday when the lady goes to the hairdresser's and gets her hair done, pulls on Friday and Cardio and Core on Saturday.\nC. Glutes and hamstrings on Monday, pulls on Tuesday, Wednesday off while the lady goes to the hairdresser's and gets her hair done, Quads and calves on Thursday, push on Friday and Cardio and Core on Saturday.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2648,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "883-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2649,
"target": "C",
"doc": {
"video_id": "884",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=0kRsiSdDFYg",
"videoID": "0kRsiSdDFYg",
"question_id": "884-1",
"task_type": "Counting Problem",
"question": "How many goals did the player interviewed by the male host at the beginning of the video wearing a top with the word QARAR score in his first match?",
"options": [
"A. 3.",
"B. 2.",
"C. 1.",
"D. 4."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many goals did the player interviewed by the male host at the beginning of the video wearing a top with the word QARAR score in his first match?\nOption:\nA. 3.\nB. 2.\nC. 1.\nD. 4.\nAnswer with the option's letter from the given choices directly.",
2649,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "884-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2650,
"target": "A",
"doc": {
"video_id": "884",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=0kRsiSdDFYg",
"videoID": "0kRsiSdDFYg",
"question_id": "884-2",
"task_type": "Action Reasoning",
"question": "What is the score of the first game of the semi-finals in the video?",
"options": [
"A. 2:1.",
"B. 1:1.",
"C. 1:0.",
"D. 3:1."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the score of the first game of the semi-finals in the video?\nOption:\nA. 2:1.\nB. 1:1.\nC. 1:0.\nD. 3:1.\nAnswer with the option's letter from the given choices directly.",
2650,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "884-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2651,
"target": "D",
"doc": {
"video_id": "884",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=0kRsiSdDFYg",
"videoID": "0kRsiSdDFYg",
"question_id": "884-3",
"task_type": "Action Reasoning",
"question": "In the video after all the matches are over why is everyone cheering and celebrating around the guy wearing a shirt with the word Qatar on it?",
"options": [
"A. Because he received a special recognition for fair play and sportsmanship.",
"B. Because he scored the most goals throughout the tournament.",
"C. Because he broke a record for the fastest goal in the competition.",
"D. Because he won the championship in the competition."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video after all the matches are over why is everyone cheering and celebrating around the guy wearing a shirt with the word Qatar on it?\nOption:\nA. Because he received a special recognition for fair play and sportsmanship.\nB. Because he scored the most goals throughout the tournament.\nC. Because he broke a record for the fastest goal in the competition.\nD. Because he won the championship in the competition.\nAnswer with the option's letter from the given choices directly.",
2651,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "884-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2652,
"target": "B",
"doc": {
"video_id": "885",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=EhLOdDfaC7Y",
"videoID": "EhLOdDfaC7Y",
"question_id": "885-1",
"task_type": "Action Recognition",
"question": "What happened in the first game of the second round of 16 in the video?",
"options": [
"A. During the game, the attacking side committed a foul while attacking and the defence turned into the attacking side and successfully scored.",
"B. During the game, the defending side committed a foul during defense and the attacking side resumed the ball attack and scored successfully.",
"C. During the game, the attacking side committed a foul while attacking, but the defense successfully intercepted the ball and prevented any scoring opportunity.",
"D. During the game, the defending side committed a foul during defense, but the attacking side missed the subsequent penalty kick, failing to score."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What happened in the first game of the second round of 16 in the video?\nOption:\nA. During the game, the attacking side committed a foul while attacking and the defence turned into the attacking side and successfully scored.\nB. During the game, the defending side committed a foul during defense and the attacking side resumed the ball attack and scored successfully.\nC. During the game, the attacking side committed a foul while attacking, but the defense successfully intercepted the ball and prevented any scoring opportunity.\nD. During the game, the defending side committed a foul during defense, but the attacking side missed the subsequent penalty kick, failing to score.\nAnswer with the option's letter from the given choices directly.",
2652,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "885-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2653,
"target": "C",
"doc": {
"video_id": "885",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=EhLOdDfaC7Y",
"videoID": "EhLOdDfaC7Y",
"question_id": "885-2",
"task_type": "Action Reasoning",
"question": "Why did the video focus on the content behind the player wearing a black short-sleeved shirt and black trousers?",
"options": [
"A. Because his personal historical achievements are printed on the back of the clothes.",
"B. Because the player wearing a black top and black pants has the best skills.",
"C. Because the player wears clothes with the logo of the winner of the previous competition printed on the back.",
"D. Because there are very exquisite patterns printed on the back of the clothes."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why did the video focus on the content behind the player wearing a black short-sleeved shirt and black trousers?\nOption:\nA. Because his personal historical achievements are printed on the back of the clothes.\nB. Because the player wearing a black top and black pants has the best skills.\nC. Because the player wears clothes with the logo of the winner of the previous competition printed on the back.\nD. Because there are very exquisite patterns printed on the back of the clothes.\nAnswer with the option's letter from the given choices directly.",
2653,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "885-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2654,
"target": "A",
"doc": {
"video_id": "885",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=EhLOdDfaC7Y",
"videoID": "EhLOdDfaC7Y",
"question_id": "885-3",
"task_type": "Object Reasoning",
"question": "What are the scoring rules for the finals and semifinals in the video?",
"options": [
"A. Attacking side goals, attacking side scores, attacking side no goals, defence scores.",
"B. Attacking side goals, attacking side scores, if attacking side doesn't score, neither side scores.",
"C. If the defending team doesn't score, the attacking team gets a point, if the attacking team doesn't score, neither team gets a point.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What are the scoring rules for the finals and semifinals in the video?\nOption:\nA. Attacking side goals, attacking side scores, attacking side no goals, defence scores.\nB. Attacking side goals, attacking side scores, if attacking side doesn't score, neither side scores.\nC. If the defending team doesn't score, the attacking team gets a point, if the attacking team doesn't score, neither team gets a point.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2654,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "885-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2655,
"target": "D",
"doc": {
"video_id": "886",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=elprD1hnDyU",
"videoID": "elprD1hnDyU",
"question_id": "886-1",
"task_type": "OCR Problems",
"question": "What is the maximum weight that the woman in the video can successfully bench press?",
"options": [
"A. 50 kilograms.",
"B. 55 kilograms.",
"C. 60 kilograms.",
"D. 57.5 kilograms."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the maximum weight that the woman in the video can successfully bench press?\nOption:\nA. 50 kilograms.\nB. 55 kilograms.\nC. 60 kilograms.\nD. 57.5 kilograms.\nAnswer with the option's letter from the given choices directly.",
2655,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "886-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "OCR Problems",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2656,
"target": "B",
"doc": {
"video_id": "886",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=elprD1hnDyU",
"videoID": "elprD1hnDyU",
"question_id": "886-2",
"task_type": "Temporal Reasoning",
"question": "Which of the following options correctly describes the heroine's Tuesday exercise items and their order in the video?",
"options": [
"A. Hack squat, weighted walking lunges, standing calf raises, superset, drop set.",
"B. Hack squat, weighted walking lunges, superset, dropset, standing calf raises.",
"C. Barbell bench press, dumbbell shoulder press, dumbbell lateral raise drop set, incline machine press, single arm cable lat raises, cable tricep extensions.",
"D. None of the above."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options correctly describes the heroine's Tuesday exercise items and their order in the video?\nOption:\nA. Hack squat, weighted walking lunges, standing calf raises, superset, drop set.\nB. Hack squat, weighted walking lunges, superset, dropset, standing calf raises.\nC. Barbell bench press, dumbbell shoulder press, dumbbell lateral raise drop set, incline machine press, single arm cable lat raises, cable tricep extensions.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2656,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "886-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2657,
"target": "C",
"doc": {
"video_id": "886",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=elprD1hnDyU",
"videoID": "elprD1hnDyU",
"question_id": "886-3",
"task_type": "Temporal Reasoning",
"question": "On which days of the week does the woman in the video take a rest?",
"options": [
"A. Saturday, Monday.",
"B. Thursday, Tuesday.",
"C. Thursday, Saturday.",
"D. Saturday, Sunday."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: On which days of the week does the woman in the video take a rest?\nOption:\nA. Saturday, Monday.\nB. Thursday, Tuesday.\nC. Thursday, Saturday.\nD. Saturday, Sunday.\nAnswer with the option's letter from the given choices directly.",
2657,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "886-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2658,
"target": "A",
"doc": {
"video_id": "887",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=proINILL4X4",
"videoID": "proINILL4X4",
"question_id": "887-1",
"task_type": "Action Reasoning",
"question": "What is the difference between the actions of the lady on the left hand side of the male presenter at the beginning of the video wearing a black top and blue shorts and the others?",
"options": [
"A. She doesn't jump.",
"B. She is moving more.",
"C. She moves at a faster pace than the others.",
"D. Her movements are the most standard."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the difference between the actions of the lady on the left hand side of the male presenter at the beginning of the video wearing a black top and blue shorts and the others?\nOption:\nA. She doesn't jump.\nB. She is moving more.\nC. She moves at a faster pace than the others.\nD. Her movements are the most standard.\nAnswer with the option's letter from the given choices directly.",
2658,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "887-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2659,
"target": "D",
"doc": {
"video_id": "887",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=proINILL4X4",
"videoID": "proINILL4X4",
"question_id": "887-2",
"task_type": "Action Reasoning",
"question": "What is the meaning of what the person in the video wrote on the blackboard?",
"options": [
"A. Their names and how long they usually exercise.",
"B. Not mentioned in the video.",
"C. Their name and how long they need to rest.",
"D. Their name and the time they first needed a break."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the meaning of what the person in the video wrote on the blackboard?\nOption:\nA. Their names and how long they usually exercise.\nB. Not mentioned in the video.\nC. Their name and how long they need to rest.\nD. Their name and the time they first needed a break.\nAnswer with the option's letter from the given choices directly.",
2659,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "887-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2660,
"target": "B",
"doc": {
"video_id": "887",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=proINILL4X4",
"videoID": "proINILL4X4",
"question_id": "887-3",
"task_type": "Temporal Reasoning",
"question": "How often do the people in the video take water breaks per workout?",
"options": [
"A. 10 minutes.",
"B. 5 minutes.",
"C. 15 minutes.",
"D. No regular breaks."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How often do the people in the video take water breaks per workout?\nOption:\nA. 10 minutes.\nB. 5 minutes.\nC. 15 minutes.\nD. No regular breaks.\nAnswer with the option's letter from the given choices directly.",
2660,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "887-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2661,
"target": "C",
"doc": {
"video_id": "888",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=N1ZRYx46ye4",
"videoID": "N1ZRYx46ye4",
"question_id": "888-1",
"task_type": "Object Recognition",
"question": "In the video, which fitness equipment is used by the heroine on push day, leg day, pull day and full body day?",
"options": [
"A. Rotary torso machine.",
"B. Seated row machine.",
"C. Dumbbell.",
"D. Hach squat machine."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, which fitness equipment is used by the heroine on push day, leg day, pull day and full body day?\nOption:\nA. Rotary torso machine.\nB. Seated row machine.\nC. Dumbbell.\nD. Hach squat machine.\nAnswer with the option's letter from the given choices directly.",
2661,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "888-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Object Recognition",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2662,
"target": "A",
"doc": {
"video_id": "888",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=N1ZRYx46ye4",
"videoID": "N1ZRYx46ye4",
"question_id": "888-2",
"task_type": "Action Reasoning",
"question": "In the video, after using the leg press machine, which part of the body mainly relies on the strength of the next training movements?",
"options": [
"A. Buttock.",
"B. Thigh.",
"C. Calf.",
"D. Back."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, after using the leg press machine, which part of the body mainly relies on the strength of the next training movements?\nOption:\nA. Buttock.\nB. Thigh.\nC. Calf.\nD. Back.\nAnswer with the option's letter from the given choices directly.",
2662,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "888-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2663,
"target": "D",
"doc": {
"video_id": "888",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=N1ZRYx46ye4",
"videoID": "N1ZRYx46ye4",
"question_id": "888-3",
"task_type": "Information Synopsis",
"question": "Which of the following things is not mentioned in the introduction before the full body training formally begins?",
"options": [
"A. Benefits of full body training.",
"B. Training plan for full body training.",
"C. Reasons to love fitness.",
"D. Duration of each full body training session."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following things is not mentioned in the introduction before the full body training formally begins?\nOption:\nA. Benefits of full body training.\nB. Training plan for full body training.\nC. Reasons to love fitness.\nD. Duration of each full body training session.\nAnswer with the option's letter from the given choices directly.",
2663,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "888-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2664,
"target": "B",
"doc": {
"video_id": "889",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=EP2rD6lr2Xk",
"videoID": "EP2rD6lr2Xk",
"question_id": "889-1",
"task_type": "Information Synopsis",
"question": "In the video, what advice does the friend give to the male protagonist during their remote video call on the phone?",
"options": [
"A. Make sure to rest well at night.",
"B. Don't exert too much force while riding a bike, control your pace.",
"C. Don't eat cheese Danish.",
"D. Don't run too fast."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In the video, what advice does the friend give to the male protagonist during their remote video call on the phone?\nOption:\nA. Make sure to rest well at night.\nB. Don't exert too much force while riding a bike, control your pace.\nC. Don't eat cheese Danish.\nD. Don't run too fast.\nAnswer with the option's letter from the given choices directly.",
2664,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "889-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Information Synopsis",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2665,
"target": "C",
"doc": {
"video_id": "889",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=EP2rD6lr2Xk",
"videoID": "EP2rD6lr2Xk",
"question_id": "889-2",
"task_type": "Action Reasoning",
"question": "Why does the male protagonist, wearing a white shirt and black shorts in the video, take painkillers during the race?",
"options": [
"A. He suffered a stress fracture in his foot during the competition.",
"B. Because his knee was suddenly injured during the competition.",
"C. Because he injured his foot during previous training, running during the competition would be very painful.",
"D. Because he suffered a stress fracture in his foot during previous training."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why does the male protagonist, wearing a white shirt and black shorts in the video, take painkillers during the race?\nOption:\nA. He suffered a stress fracture in his foot during the competition.\nB. Because his knee was suddenly injured during the competition.\nC. Because he injured his foot during previous training, running during the competition would be very painful.\nD. Because he suffered a stress fracture in his foot during previous training.\nAnswer with the option's letter from the given choices directly.",
2665,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "889-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2666,
"target": "A",
"doc": {
"video_id": "889",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=EP2rD6lr2Xk",
"videoID": "EP2rD6lr2Xk",
"question_id": "889-3",
"task_type": "Action Reasoning",
"question": "Why were the two male leads in the video, as well as the people around them, so excited when they reached the finish line?",
"options": [
"A. Because they had only been involved in this sport for eight months, after months of perseverance, they overcame difficulties and successfully reached the finish line. They felt proud, and the people around them were also deeply moved.",
"B. Because they have been involved in this sport for a long time, during the competition, they overcame difficulties and successfully reached the finish line. They felt proud, and the people around them were also deeply moved.",
"C. Because they have been involved in this sport for a long time, and they were the first to reach the finish line in the competition, they felt proud, and the people around them were also deeply moved.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Why were the two male leads in the video, as well as the people around them, so excited when they reached the finish line?\nOption:\nA. Because they had only been involved in this sport for eight months, after months of perseverance, they overcame difficulties and successfully reached the finish line. They felt proud, and the people around them were also deeply moved.\nB. Because they have been involved in this sport for a long time, during the competition, they overcame difficulties and successfully reached the finish line. They felt proud, and the people around them were also deeply moved.\nC. Because they have been involved in this sport for a long time, and they were the first to reach the finish line in the competition, they felt proud, and the people around them were also deeply moved.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2666,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "889-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2667,
"target": "D",
"doc": {
"video_id": "890",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=bIQf1HwEqHU",
"videoID": "bIQf1HwEqHU",
"question_id": "890-1",
"task_type": "Information Synopsis",
"question": "Which of the following is NOT included in what the woman in the video says during the drive home from track day practice?",
"options": [
"A. Feelings after completing the workout.",
"B. Appreciation for the coach and his team.",
"C. The special significance of this training.",
"D. Discussion of the next training plan."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is NOT included in what the woman in the video says during the drive home from track day practice?\nOption:\nA. Feelings after completing the workout.\nB. Appreciation for the coach and his team.\nC. The special significance of this training.\nD. Discussion of the next training plan.\nAnswer with the option's letter from the given choices directly.",
2667,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "890-1",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2668,
"target": "B",
"doc": {
"video_id": "890",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=bIQf1HwEqHU",
"videoID": "bIQf1HwEqHU",
"question_id": "890-2",
"task_type": "Action Reasoning",
"question": "The man and woman who had dinner with the female lead before the marathon officially started in the video, did they participate in the marathon, and what were their results?",
"options": [
"A. No, the man and woman were spectators who accompanied the female lead to the marathon; they did not participate.",
"B. Yes, the man is the female lead's father, and he completed the full marathon. The woman is the female lead's mother, and she ran 38 kilometers.",
"C. Not mentioned in the video.",
"D. Yes, the man, who is the female lead's father, completed a half marathon, and the woman, who is the female lead's mother, ran 38 kilometers."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: The man and woman who had dinner with the female lead before the marathon officially started in the video, did they participate in the marathon, and what were their results?\nOption:\nA. No, the man and woman were spectators who accompanied the female lead to the marathon; they did not participate.\nB. Yes, the man is the female lead's father, and he completed the full marathon. The woman is the female lead's mother, and she ran 38 kilometers.\nC. Not mentioned in the video.\nD. Yes, the man, who is the female lead's father, completed a half marathon, and the woman, who is the female lead's mother, ran 38 kilometers.\nAnswer with the option's letter from the given choices directly.",
2668,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "890-2",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Action Reasoning",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2669,
"target": "C",
"doc": {
"video_id": "890",
"duration": "long",
"domain": "Life Record",
"sub_category": "Exercise",
"url": "https://www.youtube.com/watch?v=bIQf1HwEqHU",
"videoID": "bIQf1HwEqHU",
"question_id": "890-3",
"task_type": "Temporal Reasoning",
"question": "Which of the following options describes the video's heroine's description of the giveaway in the correct order of presentation?",
"options": [
"A. Gatorade merch, adidas runners shoes, garmin watch, shokz run belt, rudy project yonder and casey shades, miam book.",
"B. Gatorade merch, adidas runners shoes, miam book, garmin watch, shokz run belt, rudy project yonder and casey shades.",
"C. Gatorade merch, miam book, garmin watch, shokz run belt, rudy project yonder and casey shades, adidas runners shoes.",
"D. Gatorade merch, miam book, garmin watch, shokz run belt, adidas runners shoes, rudy project yonder and casey shades."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following options describes the video's heroine's description of the giveaway in the correct order of presentation?\nOption:\nA. Gatorade merch, adidas runners shoes, garmin watch, shokz run belt, rudy project yonder and casey shades, miam book.\nB. Gatorade merch, adidas runners shoes, miam book, garmin watch, shokz run belt, rudy project yonder and casey shades.\nC. Gatorade merch, miam book, garmin watch, shokz run belt, rudy project yonder and casey shades, adidas runners shoes.\nD. Gatorade merch, miam book, garmin watch, shokz run belt, adidas runners shoes, rudy project yonder and casey shades.\nAnswer with the option's letter from the given choices directly.",
2669,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "890-3",
"duration": "long",
"category": "Life Record",
"sub_category": "Exercise",
"task_category": "Temporal Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2670,
"target": "B",
"doc": {
"video_id": "891",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=uuCVnqV4cNc",
"videoID": "uuCVnqV4cNc",
"question_id": "891-1",
"task_type": "Object Reasoning",
"question": "Based on the information provided by the video, what is unique about the Factory 56 production line compared to other automotive factories?",
"options": [
"A. It utilizes a traditional assembly line with human workers.",
"B. It can produce vehicles of different powertrains (combustion, hybrid, electric) on the same line.",
"C. It exclusively manufactures electric vehicles.",
"D. It solely manufactures Mercedes-Benz's top-of-the-line Maybach vehicles."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the information provided by the video, what is unique about the Factory 56 production line compared to other automotive factories?\nOption:\nA. It utilizes a traditional assembly line with human workers.\nB. It can produce vehicles of different powertrains (combustion, hybrid, electric) on the same line.\nC. It exclusively manufactures electric vehicles.\nD. It solely manufactures Mercedes-Benz's top-of-the-line Maybach vehicles.\nAnswer with the option's letter from the given choices directly.",
2670,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "891-1",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2671,
"target": "B",
"doc": {
"video_id": "891",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=uuCVnqV4cNc",
"videoID": "uuCVnqV4cNc",
"question_id": "891-2",
"task_type": "Object Reasoning",
"question": "In line with the video evidence, what is the primary challenge for developers of autonomous driving technology, as highlighted by the testing conducted in China?",
"options": [
"A. Adapting the car's software and sensors to recognize diverse road markings and signage.",
"B. Accounting for the unpredictable traffic flow and behavior of other road users.",
"C. Developing high-resolution cameras and LiDAR systems for accurate environment perception.",
"D. Building an extensive 5G network infrastructure for reliable communication between vehicles."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, what is the primary challenge for developers of autonomous driving technology, as highlighted by the testing conducted in China?\nOption:\nA. Adapting the car's software and sensors to recognize diverse road markings and signage.\nB. Accounting for the unpredictable traffic flow and behavior of other road users.\nC. Developing high-resolution cameras and LiDAR systems for accurate environment perception.\nD. Building an extensive 5G network infrastructure for reliable communication between vehicles.\nAnswer with the option's letter from the given choices directly.",
2671,
"videomme",
"test"
]
],
"resps": [
[
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "891-2",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "B"
}
},
{
"doc_id": 2672,
"target": "A",
"doc": {
"video_id": "891",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=uuCVnqV4cNc",
"videoID": "uuCVnqV4cNc",
"question_id": "891-3",
"task_type": "Object Reasoning",
"question": "How does the Factory 56's \"TecLine\" concept future-proof the production process for potential new car models?",
"options": [
"A. By utilizing modular workstations that can be easily reconfigured for different car sizes and shapes.",
"B. By employing a universal robotic workforce capable of performing any task on any vehicle.",
"C. By implementing a cloud-based manufacturing system that can be remotely updated with new production protocols.",
"D. By focusing on standardized parts and components across all car models to minimize variations."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How does the Factory 56's \"TecLine\" concept future-proof the production process for potential new car models?\nOption:\nA. By utilizing modular workstations that can be easily reconfigured for different car sizes and shapes.\nB. By employing a universal robotic workforce capable of performing any task on any vehicle.\nC. By implementing a cloud-based manufacturing system that can be remotely updated with new production protocols.\nD. By focusing on standardized parts and components across all car models to minimize variations.\nAnswer with the option's letter from the given choices directly.",
2672,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "891-3",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2673,
"target": "B",
"doc": {
"video_id": "892",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=8UQFaVCV8S8",
"videoID": "8UQFaVCV8S8",
"question_id": "892-1",
"task_type": "OCR Problems",
"question": "After #41 completed his free throws, how many points was the team the author was rooting for ahead or behind the other team?",
"options": [
"A. 6 points ahead of the other team.",
"B. Trailing the other team by 12 points.",
"C. 12 points ahead of the other team.",
"D. Trailing the other team by 13 points."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: After #41 completed his free throws, how many points was the team the author was rooting for ahead or behind the other team?\nOption:\nA. 6 points ahead of the other team.\nB. Trailing the other team by 12 points.\nC. 12 points ahead of the other team.\nD. Trailing the other team by 13 points.\nAnswer with the option's letter from the given choices directly.",
2673,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "892-1",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "OCR Problems",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2674,
"target": "A",
"doc": {
"video_id": "892",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=8UQFaVCV8S8",
"videoID": "8UQFaVCV8S8",
"question_id": "892-2",
"task_type": "OCR Problems",
"question": "As depicted in the video, what was the halftime score between the two teams?",
"options": [
"A. 32 - 23.",
"B. 27 - 16.",
"C. 18 - 6.",
"D. 37 - 27."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, what was the halftime score between the two teams?\nOption:\nA. 32 - 23.\nB. 27 - 16.\nC. 18 - 6.\nD. 37 - 27.\nAnswer with the option's letter from the given choices directly.",
2674,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "892-2",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "OCR Problems",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2675,
"target": "C",
"doc": {
"video_id": "892",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=8UQFaVCV8S8",
"videoID": "8UQFaVCV8S8",
"question_id": "892-3",
"task_type": "Object Reasoning",
"question": "What is the most likely role of the person shooting the video?",
"options": [
"A. The team's coach.",
"B. Student volunteers.",
"C. Players' families.",
"D. Replacements."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the most likely role of the person shooting the video?\nOption:\nA. The team's coach.\nB. Student volunteers.\nC. Players' families.\nD. Replacements.\nAnswer with the option's letter from the given choices directly.",
2675,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "892-3",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2676,
"target": "D",
"doc": {
"video_id": "893",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=WQn-c_4dVWs",
"videoID": "WQn-c_4dVWs",
"question_id": "893-1",
"task_type": "Information Synopsis",
"question": "What title best summarizes this video?",
"options": [
"A. French frozen food production process.",
"B. French frozen food chefs.",
"C. French frozen food advantages and disadvantages.",
"D. The daily diet of French people and why French people like to eat frozen food."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What title best summarizes this video?\nOption:\nA. French frozen food production process.\nB. French frozen food chefs.\nC. French frozen food advantages and disadvantages.\nD. The daily diet of French people and why French people like to eat frozen food.\nAnswer with the option's letter from the given choices directly.",
2676,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "893-1",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "D"
}
},
{
"doc_id": 2677,
"target": "D",
"doc": {
"video_id": "893",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=WQn-c_4dVWs",
"videoID": "WQn-c_4dVWs",
"question_id": "893-2",
"task_type": "Temporal Reasoning",
"question": "What is the correct order in which the following events are told in the video?\n(a) Desserts in restaurants.\n(b) Pinguin frozen vegetables.\n(c) Lab test for vitamin C.",
"options": [
"A. (a)(b)(c).",
"B. (b)(c)(a).",
"C. (a)(c)(b).",
"D. (c)(b)(a)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the correct order in which the following events are told in the video?\n(a) Desserts in restaurants.\n(b) Pinguin frozen vegetables.\n(c) Lab test for vitamin C.\nOption:\nA. (a)(b)(c).\nB. (b)(c)(a).\nC. (a)(c)(b).\nD. (c)(b)(a).\nAnswer with the option's letter from the given choices directly.",
2677,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "893-2",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Temporal Reasoning",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2678,
"target": "A",
"doc": {
"video_id": "893",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=WQn-c_4dVWs",
"videoID": "WQn-c_4dVWs",
"question_id": "893-3",
"task_type": "Object Reasoning",
"question": "Which character in the middle of the video most closely resembles the man cutting the scallops?",
"options": [
"A. The man in dark clothes and glasses at the end of the video in a restaurant.",
"B. The man with the red helmet at the beginning of the video.",
"C. The man in the center of the video wearing a blue protective suit with a red hat.",
"D. Man dressed in green, driving to the factory."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which character in the middle of the video most closely resembles the man cutting the scallops?\nOption:\nA. The man in dark clothes and glasses at the end of the video in a restaurant.\nB. The man with the red helmet at the beginning of the video.\nC. The man in the center of the video wearing a blue protective suit with a red hat.\nD. Man dressed in green, driving to the factory.\nAnswer with the option's letter from the given choices directly.",
2678,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "893-3",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2679,
"target": "C",
"doc": {
"video_id": "894",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=tslKtm6Le1s",
"videoID": "tslKtm6Le1s",
"question_id": "894-1",
"task_type": "Information Synopsis",
"question": "What is the plot of the opera in the video?",
"options": [
"A. A tale of young lovers from rival families ends with their tragic double suicide.",
"B. A love story between a courtesan and a young man ends tragically when she sacrifices herself for his family's sake.",
"C. A tragic romance where a ghost seeks to reunite with her lover.",
"D. A love story between a courtesan and a young man ends tragically when she sacrifices herself for his family's sake."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the plot of the opera in the video?\nOption:\nA. A tale of young lovers from rival families ends with their tragic double suicide.\nB. A love story between a courtesan and a young man ends tragically when she sacrifices herself for his family's sake.\nC. A tragic romance where a ghost seeks to reunite with her lover.\nD. A love story between a courtesan and a young man ends tragically when she sacrifices herself for his family's sake.\nAnswer with the option's letter from the given choices directly.",
2679,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "894-1",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Information Synopsis",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2680,
"target": "D",
"doc": {
"video_id": "894",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=tslKtm6Le1s",
"videoID": "tslKtm6Le1s",
"question_id": "894-2",
"task_type": "Object Reasoning",
"question": "What is the relationship between the two characters in the video?",
"options": [
"A. Strangers who met by chance.",
"B. Friends.",
"C. Family.",
"D. Lovers."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the relationship between the two characters in the video?\nOption:\nA. Strangers who met by chance.\nB. Friends.\nC. Family.\nD. Lovers.\nAnswer with the option's letter from the given choices directly.",
2680,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "894-2",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "C",
"answer": "D"
}
},
{
"doc_id": 2681,
"target": "A",
"doc": {
"video_id": "894",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=tslKtm6Le1s",
"videoID": "tslKtm6Le1s",
"question_id": "894-3",
"task_type": "Action Recognition",
"question": "As depicted in the video, how does the story end?",
"options": [
"A. Sanlang jumps onto the table and lies there dead.",
"B. Sanlang awakens Yan Poxi from the underworld.",
"C. Yan Poxi kills Sanlang.",
"D. Sanlang drives away Yan Poxi."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As depicted in the video, how does the story end?\nOption:\nA. Sanlang jumps onto the table and lies there dead.\nB. Sanlang awakens Yan Poxi from the underworld.\nC. Yan Poxi kills Sanlang.\nD. Sanlang drives away Yan Poxi.\nAnswer with the option's letter from the given choices directly.",
2681,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "894-3",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Recognition",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2682,
"target": "B",
"doc": {
"video_id": "895",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=1O1TfTrEnss",
"videoID": "1O1TfTrEnss",
"question_id": "895-1",
"task_type": "Object Recognition",
"question": "As can be seen in the video, which of the following is excluded from the power value on the red or blue small card?",
"options": [
"A. Value 3.",
"B. Value 4.",
"C. Value 5.",
"D. Value 1."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As can be seen in the video, which of the following is excluded from the power value on the red or blue small card?\nOption:\nA. Value 3.\nB. Value 4.\nC. Value 5.\nD. Value 1.\nAnswer with the option's letter from the given choices directly.",
2682,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "895-1",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "B"
}
},
{
"doc_id": 2683,
"target": "C",
"doc": {
"video_id": "895",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=1O1TfTrEnss",
"videoID": "1O1TfTrEnss",
"question_id": "895-2",
"task_type": "Object Reasoning",
"question": "What is special about the board on the lower right corner of the video that keeps track of the locations controlled by each of the two sides?",
"options": [
"A. The quantities of locations on both sides always stay identical.",
"B. The number of locations on each side must be greater than 6.",
"C. The sum of locations of each side is 12.",
"D. None of the above."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is special about the board on the lower right corner of the video that keeps track of the locations controlled by each of the two sides?\nOption:\nA. The quantities of locations on both sides always stay identical.\nB. The number of locations on each side must be greater than 6.\nC. The sum of locations of each side is 12.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2683,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "895-2",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "D",
"answer": "C"
}
},
{
"doc_id": 2684,
"target": "A",
"doc": {
"video_id": "895",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=1O1TfTrEnss",
"videoID": "1O1TfTrEnss",
"question_id": "895-3",
"task_type": "Object Recognition",
"question": "As can be seen in the video, which of the items is not placed on the table?",
"options": [
"A. A red card marked with 1937 and 1938.",
"B. Several blue dice.",
"C. A game map.",
"D. A green square card."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: As can be seen in the video, which of the items is not placed on the table?\nOption:\nA. A red card marked with 1937 and 1938.\nB. Several blue dice.\nC. A game map.\nD. A green square card.\nAnswer with the option's letter from the given choices directly.",
2684,
"videomme",
"test"
]
],
"resps": [
[
"D"
]
],
"filtered_resps": [
"D"
],
"videomme_percetion_score": {
"question_id": "895-3",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Recognition",
"pred_answer": "D",
"answer": "A"
}
},
{
"doc_id": 2685,
"target": "C",
"doc": {
"video_id": "896",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=xIWaK92gRlo",
"videoID": "xIWaK92gRlo",
"question_id": "896-1",
"task_type": "Object Reasoning",
"question": "What sets apart the third set of stickers from the other two?",
"options": [
"A. The third set of stickers are animal prints.",
"B. The third set of stickers is with botanical patterns.",
"C. The third set of stickers has no color.",
"D. The third set of stickers is Inexpensive."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What sets apart the third set of stickers from the other two?\nOption:\nA. The third set of stickers are animal prints.\nB. The third set of stickers is with botanical patterns.\nC. The third set of stickers has no color.\nD. The third set of stickers is Inexpensive.\nAnswer with the option's letter from the given choices directly.",
2685,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "896-1",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Object Reasoning",
"pred_answer": "B",
"answer": "C"
}
},
{
"doc_id": 2686,
"target": "D",
"doc": {
"video_id": "896",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=xIWaK92gRlo",
"videoID": "xIWaK92gRlo",
"question_id": "896-2",
"task_type": "Temporal Reasoning",
"question": "What is the proper sequence for the following items used in this video?\n(a) Stickers.\n(b) Watercolor pencils.\n(c) Gems.\n(d) Glue paper.",
"options": [
"A. (a)(c)(b)(d).",
"B. (d)(b)(a)(c).",
"C. (b)(a)(d)(c).",
"D. (d)(a)(c)(b)."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the proper sequence for the following items used in this video?\n(a) Stickers.\n(b) Watercolor pencils.\n(c) Gems.\n(d) Glue paper.\nOption:\nA. (a)(c)(b)(d).\nB. (d)(b)(a)(c).\nC. (b)(a)(d)(c).\nD. (d)(a)(c)(b).\nAnswer with the option's letter from the given choices directly.",
2686,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "896-2",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Temporal Reasoning",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2687,
"target": "B",
"doc": {
"video_id": "896",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=xIWaK92gRlo",
"videoID": "xIWaK92gRlo",
"question_id": "896-3",
"task_type": "Information Synopsis",
"question": "What is captured in this video?",
"options": [
"A. Crafting Tutorial.",
"B. Introduction to shopping for handmade products.",
"C. Sticker Share.",
"D. Individual handmade works are sold online."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is captured in this video?\nOption:\nA. Crafting Tutorial.\nB. Introduction to shopping for handmade products.\nC. Sticker Share.\nD. Individual handmade works are sold online.\nAnswer with the option's letter from the given choices directly.",
2687,
"videomme",
"test"
]
],
"resps": [
[
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "896-3",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2688,
"target": "B",
"doc": {
"video_id": "897",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=nO2B4haj2BQ",
"videoID": "nO2B4haj2BQ",
"question_id": "897-1",
"task_type": "Action Reasoning",
"question": "According to the video, what are the tactics of team DK?",
"options": [
"A. Swiftly attack the enemy.",
"B. Safeguard the ADC.",
"C. Proactively engage in skirmishes.",
"D. Disperse the opposing forces in battle."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: According to the video, what are the tactics of team DK?\nOption:\nA. Swiftly attack the enemy.\nB. Safeguard the ADC.\nC. Proactively engage in skirmishes.\nD. Disperse the opposing forces in battle.\nAnswer with the option's letter from the given choices directly.",
2688,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "897-1",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2689,
"target": "A",
"doc": {
"video_id": "897",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=nO2B4haj2BQ",
"videoID": "nO2B4haj2BQ",
"question_id": "897-2",
"task_type": "Action Reasoning",
"question": "Who gets advantage before 10:00 according to the video?",
"options": [
"A. Kingen of team DK.",
"B. Zeus of team T1.",
"C. Lucid of team DK.",
"D. Keria of team T1."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Who gets advantage before 10:00 according to the video?\nOption:\nA. Kingen of team DK.\nB. Zeus of team T1.\nC. Lucid of team DK.\nD. Keria of team T1.\nAnswer with the option's letter from the given choices directly.",
2689,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "897-2",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "A"
}
},
{
"doc_id": 2690,
"target": "C",
"doc": {
"video_id": "897",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=nO2B4haj2BQ",
"videoID": "nO2B4haj2BQ",
"question_id": "897-3",
"task_type": "Action Reasoning",
"question": "In line with the video evidence, which is not true about the failure of team DK in the last battle?",
"options": [
"A. They are all killed.",
"B. They lose the dragon's soul.",
"C. They hit enemies a lot for revenge.",
"D. They are split into two groups."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, which is not true about the failure of team DK in the last battle?\nOption:\nA. They are all killed.\nB. They lose the dragon's soul.\nC. They hit enemies a lot for revenge.\nD. They are split into two groups.\nAnswer with the option's letter from the given choices directly.",
2690,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "897-3",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2691,
"target": "D",
"doc": {
"video_id": "898",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=K9MQATj3894",
"videoID": "K9MQATj3894",
"question_id": "898-1",
"task_type": "Attribute Perception",
"question": "Once the embroidery in the video is finished, which color of stitches is concealed beneath the pattern?",
"options": [
"A. White.",
"B. Green.",
"C. Yellow.",
"D. Blue."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Once the embroidery in the video is finished, which color of stitches is concealed beneath the pattern?\nOption:\nA. White.\nB. Green.\nC. Yellow.\nD. Blue.\nAnswer with the option's letter from the given choices directly.",
2691,
"videomme",
"test"
]
],
"resps": [
[
"D."
]
],
"filtered_resps": [
"D."
],
"videomme_percetion_score": {
"question_id": "898-1",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Attribute Perception",
"pred_answer": "D",
"answer": "D"
}
},
{
"doc_id": 2692,
"target": "D",
"doc": {
"video_id": "898",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=K9MQATj3894",
"videoID": "K9MQATj3894",
"question_id": "898-2",
"task_type": "Counting Problem",
"question": "How many embroidery techniques does the author teach in the video?",
"options": [
"A. 8.",
"B. 10.",
"C. 12.",
"D. 9."
],
"answer": "D"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many embroidery techniques does the author teach in the video?\nOption:\nA. 8.\nB. 10.\nC. 12.\nD. 9.\nAnswer with the option's letter from the given choices directly.",
2692,
"videomme",
"test"
]
],
"resps": [
[
"B.",
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "898-2",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "D"
}
},
{
"doc_id": 2693,
"target": "C",
"doc": {
"video_id": "898",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=K9MQATj3894",
"videoID": "K9MQATj3894",
"question_id": "898-3",
"task_type": "Action Reasoning",
"question": "In line with the video evidence, which of the following is the key step in bullion stitch?",
"options": [
"A. Weaving needle and thread back and forth.",
"B. Threading the thread up from below.",
"C. Wrapping the thread around the needle.",
"D. Inserting the needle one stitch forward."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, which of the following is the key step in bullion stitch?\nOption:\nA. Weaving needle and thread back and forth.\nB. Threading the thread up from below.\nC. Wrapping the thread around the needle.\nD. Inserting the needle one stitch forward.\nAnswer with the option's letter from the given choices directly.",
2693,
"videomme",
"test"
]
],
"resps": [
[
"C",
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "898-3",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2694,
"target": "B",
"doc": {
"video_id": "899",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=f-1xE_5uXDg",
"videoID": "f-1xE_5uXDg",
"question_id": "899-1",
"task_type": "Action Reasoning",
"question": "Based on the video, why did Miranda mention many things related to Sicily?",
"options": [
"A. Because Sicily is Miranda's father's hometown.",
"B. Because Sicily is Miranda's birthplace.",
"C. Because Sicily is Miranda's mother's hometown.",
"D. Because Miranda likes the sea of Sicily."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Based on the video, why did Miranda mention many things related to Sicily?\nOption:\nA. Because Sicily is Miranda's father's hometown.\nB. Because Sicily is Miranda's birthplace.\nC. Because Sicily is Miranda's mother's hometown.\nD. Because Miranda likes the sea of Sicily.\nAnswer with the option's letter from the given choices directly.",
2694,
"videomme",
"test"
]
],
"resps": [
[
"C.",
"C."
]
],
"filtered_resps": [
"C."
],
"videomme_percetion_score": {
"question_id": "899-1",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Reasoning",
"pred_answer": "C",
"answer": "B"
}
},
{
"doc_id": 2695,
"target": "A",
"doc": {
"video_id": "899",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=f-1xE_5uXDg",
"videoID": "f-1xE_5uXDg",
"question_id": "899-2",
"task_type": "Action Recognition",
"question": "Which of the following is not exemplified by Miranda's experience of learning new languages in Seville?",
"options": [
"A. The process of learning a language requires judging others.",
"B. The process of learning a language should not be afraid of making mistakes.",
"C. The process of learning a language should not feel ashamed.",
"D. The process of learning a language requires daring to speak."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is not exemplified by Miranda's experience of learning new languages in Seville?\nOption:\nA. The process of learning a language requires judging others.\nB. The process of learning a language should not be afraid of making mistakes.\nC. The process of learning a language should not feel ashamed.\nD. The process of learning a language requires daring to speak.\nAnswer with the option's letter from the given choices directly.",
2695,
"videomme",
"test"
]
],
"resps": [
[
"A",
"A"
]
],
"filtered_resps": [
"A"
],
"videomme_percetion_score": {
"question_id": "899-2",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Recognition",
"pred_answer": "A",
"answer": "A"
}
},
{
"doc_id": 2696,
"target": "A",
"doc": {
"video_id": "899",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=f-1xE_5uXDg",
"videoID": "f-1xE_5uXDg",
"question_id": "899-3",
"task_type": "Counting Problem",
"question": "How many questions did the man in the pink shirt ask Miranda in the video?",
"options": [
"A. 8.",
"B. 9.",
"C. 7.",
"D. 6."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: How many questions did the man in the pink shirt ask Miranda in the video?\nOption:\nA. 8.\nB. 9.\nC. 7.\nD. 6.\nAnswer with the option's letter from the given choices directly.",
2696,
"videomme",
"test"
]
],
"resps": [
[
"B."
]
],
"filtered_resps": [
"B."
],
"videomme_percetion_score": {
"question_id": "899-3",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Counting Problem",
"pred_answer": "B",
"answer": "A"
}
},
{
"doc_id": 2697,
"target": "B",
"doc": {
"video_id": "900",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=5KlS-p5eYH8",
"videoID": "5KlS-p5eYH8",
"question_id": "900-1",
"task_type": "Action Reasoning",
"question": "In line with the video evidence, why does the heroine in the video only do butt training once a week?",
"options": [
"A. Because the heroine's buttocks are already strong enough.",
"B. Because the heroine thinks it takes time for the buttock muscles to recover.",
"C. Because the heroine doesn't have time to do butt training.",
"D. Because the heroine wants to achieve the best result in the marathon."
],
"answer": "B"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: In line with the video evidence, why does the heroine in the video only do butt training once a week?\nOption:\nA. Because the heroine's buttocks are already strong enough.\nB. Because the heroine thinks it takes time for the buttock muscles to recover.\nC. Because the heroine doesn't have time to do butt training.\nD. Because the heroine wants to achieve the best result in the marathon.\nAnswer with the option's letter from the given choices directly.",
2697,
"videomme",
"test"
]
],
"resps": [
[
"B"
]
],
"filtered_resps": [
"B"
],
"videomme_percetion_score": {
"question_id": "900-1",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Action Reasoning",
"pred_answer": "B",
"answer": "B"
}
},
{
"doc_id": 2698,
"target": "C",
"doc": {
"video_id": "900",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=5KlS-p5eYH8",
"videoID": "5KlS-p5eYH8",
"question_id": "900-2",
"task_type": "Information Synopsis",
"question": "Which of the following is not discussed in the video when the heroine holds a cell phone in one hand and faces the mirror to record herself?",
"options": [
"A. Her motivation to exercise.",
"B. Changes in physical condition after exercise.",
"C. How she feels about today's workout.",
"D. Her exercise goals."
],
"answer": "C"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: Which of the following is not discussed in the video when the heroine holds a cell phone in one hand and faces the mirror to record herself?\nOption:\nA. Her motivation to exercise.\nB. Changes in physical condition after exercise.\nC. How she feels about today's workout.\nD. Her exercise goals.\nAnswer with the option's letter from the given choices directly.",
2698,
"videomme",
"test"
]
],
"resps": [
[
"C"
]
],
"filtered_resps": [
"C"
],
"videomme_percetion_score": {
"question_id": "900-2",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Information Synopsis",
"pred_answer": "C",
"answer": "C"
}
},
{
"doc_id": 2699,
"target": "A",
"doc": {
"video_id": "900",
"duration": "long",
"domain": "Multilingual",
"sub_category": "Multilingual",
"url": "https://www.youtube.com/watch?v=5KlS-p5eYH8",
"videoID": "5KlS-p5eYH8",
"question_id": "900-3",
"task_type": "Information Synopsis",
"question": "What is the primary focus of the video?",
"options": [
"A. How does the heroine do fitness exercises and gain muscle every day.",
"B. The heroine's daily eating and shopping routine.",
"C. The home life of the heroine and her boyfriend.",
"D. None of the above."
],
"answer": "A"
},
"arguments": [
[
"These are the frames of a video. Select the best answer to the following multiple-choice question based on the video. Respond with only the letter (A, B, C, or D) of the correct option.\nQuestion: What is the primary focus of the video?\nOption:\nA. How does the heroine do fitness exercises and gain muscle every day.\nB. The heroine's daily eating and shopping routine.\nC. The home life of the heroine and her boyfriend.\nD. None of the above.\nAnswer with the option's letter from the given choices directly.",
2699,
"videomme",
"test"
]
],
"resps": [
[
"A."
]
],
"filtered_resps": [
"A."
],
"videomme_percetion_score": {
"question_id": "900-3",
"duration": "long",
"category": "Multilingual",
"sub_category": "Multilingual",
"task_category": "Information Synopsis",
"pred_answer": "A",
"answer": "A"
}
}
],
"time": "0108_2331"
}
================================================
FILE: xtuner-eval_niah/README.md
================================================
## 📖 Single-Hop & Multi-Hop NIAH
This page provides specific evaluation methods for single-hop and multi-hop needle-in-a-haystack tasks.
---
### Preparation
**Environment**
- Please use a environment with **transformers >= 4.45.1 (to use YARN)**. see [niah_requirements.txt](niah_requirements.txt).
- And to avoid confict, **please don't install XTuner in your environment**, I have copy all useful files in [xtuner](xtuner).
**Model Parameters**
- Please place the pretrained parameters of the model to be tested into `./xtuner/vision_niah/model_weights/`.
- Please pay extra attention to modifying the `"rope_scaling": null` in `VideoChat-Flash-Qwen2-7B_res448/config.json` to:
```
"rope_scaling": {
"type": "yarn",
"factor": 4.0,
"original_max_position_embeddings": 32768
}
```
**Eval Datasets**
- Please download the image and videodata for testing from the [Huggingface link](https://huggingface.co/datasets/OpenGVLab/NIAH-Video).
- We place the annotaions of NIAH in `vision_niah/data` and `vision_niah/data_multi`.
**Haystack Video**
- Please place the haystack video of the model to be tested into `./xtuner/vision_niah/data/haystack_videos`. The video_haystack.mkv file used in our testing can be accessed via [this link](https://huggingface.co/datasets/OpenGVLab/NIAH-Video).
---
### Eval
- When running the needle-in-a-haystack test, please set the `xtuner_eval_niah` folder as the root directory of the workspace.
**single-hop niah**
- Please run the following command.
```
bash vision_niah/flash_eval_xtuner_single.sh
```
**multi-hop niah**
- Please run the following command.
```
bash vision_niah/flash_eval_xtuner_multi.sh
```
================================================
FILE: xtuner-eval_niah/llava/__init__.py
================================================
from .model import LlavaQwenForCausalLM
from .train.train import LazySupervisedDataset, DataCollatorForSupervisedDataset
================================================
FILE: xtuner-eval_niah/llava/constants.py
================================================
CONTROLLER_HEART_BEAT_EXPIRATION = 30
WORKER_HEART_BEAT_INTERVAL = 15
LOGDIR = "."
# Model Constants
IGNORE_INDEX = -100
IMAGE_TOKEN_INDEX = -200
DEFAULT_IMAGE_TOKEN = ""
DEFAULT_IMAGE_PATCH_TOKEN = ""
DEFAULT_IM_START_TOKEN = ""
DEFAULT_IM_END_TOKEN = ""
================================================
FILE: xtuner-eval_niah/llava/conversation.py
================================================
import dataclasses
from enum import auto, Enum
from typing import List, Any, Dict, Union, Tuple
import re
import base64
from io import BytesIO
from PIL import Image
from transformers import AutoTokenizer
class SeparatorStyle(Enum):
"""Different separator style."""
SINGLE = auto()
TWO = auto()
MPT = auto()
PLAIN = auto()
CHATML = auto()
LLAMA_2 = auto()
LLAMA_3 = auto()
QWEN = auto()
GEMMA = auto()
@dataclasses.dataclass
class Conversation:
"""A class that keeps all conversation history."""
system: str
roles: List[str]
messages: List[List[str]]
offset: int
sep_style: SeparatorStyle = SeparatorStyle.SINGLE
sep: str = "###"
sep2: str = None
version: str = "Unknown"
tokenizer_id: str = ""
tokenizer: Any = None
# Stop criteria (the default one is EOS token)
stop_str: Union[str, List[str]] = None
# Stops generation if meeting any token in this list
stop_token_ids: List[int] = None
skip_next: bool = False
def get_prompt(self):
messages = self.messages
if len(messages) > 0 and type(messages[0][1]) is tuple:
messages = self.messages.copy()
init_role, init_msg = messages[0].copy()
init_msg = init_msg[0]
if "mmtag" in self.version:
init_msg = init_msg.replace("", "").strip()
messages[0] = (init_role, init_msg)
messages.insert(0, (self.roles[0], ""))
messages.insert(1, (self.roles[1], "Received."))
elif not init_msg.startswith(""):
init_msg = init_msg.replace("", "").strip()
messages[0] = (init_role, "\n" + init_msg)
else:
messages[0] = (init_role, init_msg)
if self.sep_style == SeparatorStyle.SINGLE:
ret = self.system + self.sep
for role, message in messages:
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + ": " + message + self.sep
else:
ret += role + ":"
elif self.sep_style == SeparatorStyle.TWO:
seps = [self.sep, self.sep2]
ret = self.system + seps[0]
for i, (role, message) in enumerate(messages):
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + ": " + message + seps[i % 2]
else:
ret += role + ":"
elif self.sep_style == SeparatorStyle.CHATML:
ret = "" if self.system == "" else self.system + self.sep + "\n"
for role, message in messages:
if message:
if type(message) is tuple:
message, images, _ = message
message = "" * len(images) + message
ret += role + "\n" + message + self.sep + "\n"
else:
ret += role + "\n"
return ret
elif self.sep_style == SeparatorStyle.LLAMA_3:
chat_template_messages = [{"role": "system", "content": self.system}]
for role, message in messages:
if message:
if type(message) is tuple:
message, images = message
message = "" * len(images) + message
chat_template_messages.append({"role": role, "content": message})
# print(chat_template_messages)
return self.tokenizer.apply_chat_template(chat_template_messages, tokenize=False, add_generation_prompt=True)
# ret = "" if self.system == "" else self.system + self.sep + "\n"
# for role, message in messages:
# if message:
# if type(message) is tuple:
# message, images = message
# message = "" * len(images) + message
# ret += role + "\n" + message + self.sep + "\n"
# else:
# ret += role + "\n"
# return ret
elif self.sep_style == SeparatorStyle.MPT:
ret = self.system + self.sep
for role, message in messages:
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + message + self.sep
else:
ret += role
elif self.sep_style == SeparatorStyle.GEMMA:
ret = ""
for i, (role, message) in enumerate(messages):
assert role == self.roles[i % 2], "Conversation should alternate user/assistant/user/assistant/..."
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + message + self.sep
else:
ret += role
elif self.sep_style == SeparatorStyle.LLAMA_2:
wrap_sys = lambda msg: f"<>\n{msg}\n<>\n\n" if len(msg) > 0 else msg
wrap_inst = lambda msg: f"[INST] {msg} [/INST]"
ret = ""
for i, (role, message) in enumerate(messages):
if i == 0:
assert message, "first message should not be none"
assert role == self.roles[0], "first message should come from user"
if message:
if type(message) is tuple:
message, _, _ = message
if i == 0:
message = wrap_sys(self.system) + message
if i % 2 == 0:
message = wrap_inst(message)
ret += self.sep + message
else:
ret += " " + message + " " + self.sep2
else:
ret += ""
ret = ret.lstrip(self.sep)
elif self.sep_style == SeparatorStyle.PLAIN:
seps = [self.sep, self.sep2]
ret = self.system
for i, (role, message) in enumerate(messages):
if message:
if type(message) is tuple:
message, _, _ = message
ret += message + seps[i % 2]
else:
ret += ""
else:
raise ValueError(f"Invalid style: {self.sep_style}")
return ret
def append_message(self, role, message):
self.messages.append([role, message])
def process_image(self, image, image_process_mode, return_pil=False, image_format="PNG"):
if image_process_mode == "Pad":
def expand2square(pil_img, background_color=(122, 116, 104)):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
image = expand2square(image)
elif image_process_mode in ["Default", "Crop"]:
pass
elif image_process_mode == "Resize":
image = image.resize((336, 336))
else:
raise ValueError(f"Invalid image_process_mode: {image_process_mode}")
if type(image) is not Image.Image:
image = Image.open(image).convert("RGB")
max_hw, min_hw = max(image.size), min(image.size)
aspect_ratio = max_hw / min_hw
max_len, min_len = 672, 448
shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
longest_edge = int(shortest_edge * aspect_ratio)
W, H = image.size
if H > W:
H, W = longest_edge, shortest_edge
else:
H, W = shortest_edge, longest_edge
image = image.resize((W, H))
if return_pil:
return image
else:
buffered = BytesIO()
image.save(buffered, format=image_format)
img_b64_str = base64.b64encode(buffered.getvalue()).decode()
return img_b64_str
def get_images(self, return_pil=False, return_path=False):
images = []
for i, (role, msg) in enumerate(self.messages[self.offset :]):
if i % 2 == 0:
if type(msg) is tuple:
msg, image, image_process_mode = msg
if type(image) != list:
image = [image]
for img in image:
if not return_path and self.is_image_file(img):
img = self.process_image(img, image_process_mode, return_pil=return_pil)
else:
images.append(img)
return images
def is_image_file(self, filename):
image_extensions = [".png", ".jpg", ".jpeg", ".gif", ".bmp", ".tiff", ".webp"]
return any(filename.lower().endswith(ext) for ext in image_extensions)
def is_video_file(self, filename):
video_extensions = [".mp4", ".mov", ".avi", ".mkv", ".wmv", ".flv", ".mpeg", ".mpg"]
return any(filename.lower().endswith(ext) for ext in video_extensions)
def to_gradio_chatbot(self):
ret = []
for i, (role, msg) in enumerate(self.messages[self.offset :]):
if i % 2 == 0:
if type(msg) is tuple:
msg, image, image_process_mode = msg
if type(image) != list:
image = [image]
if len(image) == 1:
msg = "\n" + msg.replace("", "").strip()
else:
msg = re.sub(r"()\n(?=)", r"\1 ", msg)
img_str_list = []
for img in image:
if self.is_image_file(img):
img_b64_str = self.process_image(img, "Default", return_pil=False, image_format="JPEG")
img_str = f'
'
img_str_list.append(img_str)
elif self.is_video_file(img):
ret.append(((img,), None))
msg = msg.strip()
img_place_holder = ""
for img_str in img_str_list:
img_place_holder += f"{img_str}\n\n"
if len(img_str_list) > 0:
msg = f"{img_place_holder}\n\n{msg}"
if len(msg) > 0:
ret.append([msg, None])
else:
ret.append([msg, None])
else:
ret[-1][-1] = msg
return ret
def copy(self):
return Conversation(system=self.system, roles=self.roles, messages=[[x, y] for x, y in self.messages], offset=self.offset, sep_style=self.sep_style, sep=self.sep, sep2=self.sep2, version=self.version)
def dict(self):
if len(self.get_images()) > 0:
return {
"system": self.system,
"roles": self.roles,
"messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages],
"offset": self.offset,
"sep": self.sep,
"sep2": self.sep2,
}
return {
"system": self.system,
"roles": self.roles,
"messages": self.messages,
"offset": self.offset,
"sep": self.sep,
"sep2": self.sep2,
}
conv_vicuna_v0 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("Human", "Assistant"),
messages=[
["Human", "What are the key differences between renewable and non-renewable energy sources?"],
[
"Assistant",
"Renewable energy sources are those that can be replenished naturally in a relatively "
"short amount of time, such as solar, wind, hydro, geothermal, and biomass. "
"Non-renewable energy sources, on the other hand, are finite and will eventually be "
"depleted, such as coal, oil, and natural gas. Here are some key differences between "
"renewable and non-renewable energy sources:\n"
"1. Availability: Renewable energy sources are virtually inexhaustible, while non-renewable "
"energy sources are finite and will eventually run out.\n"
"2. Environmental impact: Renewable energy sources have a much lower environmental impact "
"than non-renewable sources, which can lead to air and water pollution, greenhouse gas emissions, "
"and other negative effects.\n"
"3. Cost: Renewable energy sources can be more expensive to initially set up, but they typically "
"have lower operational costs than non-renewable sources.\n"
"4. Reliability: Renewable energy sources are often more reliable and can be used in more remote "
"locations than non-renewable sources.\n"
"5. Flexibility: Renewable energy sources are often more flexible and can be adapted to different "
"situations and needs, while non-renewable sources are more rigid and inflexible.\n"
"6. Sustainability: Renewable energy sources are more sustainable over the long term, while "
"non-renewable sources are not, and their depletion can lead to economic and social instability.\n",
],
],
offset=2,
sep_style=SeparatorStyle.SINGLE,
sep="###",
)
conv_vicuna_v1 = Conversation(
system="A chat between a curious user and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the user's questions.",
roles=("USER", "ASSISTANT"),
version="v1",
messages=[],
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="",
)
conv_llama_2 = Conversation(
system="""You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.""",
roles=("USER", "ASSISTANT"),
version="llama_v2",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="",
)
conv_llava_llama_2 = Conversation(
system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",
roles=("USER", "ASSISTANT"),
version="llama_v2",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="",
)
# conv_llava_llama_3 = Conversation(
# system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",
# roles=("user", "assistant"),
# version="llama_v3",
# messages=[],
# offset=0,
# sep="<|eot_id|>",
# sep_style=SeparatorStyle.LLAMA_3,
# tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
# tokenizer=AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct"),
# stop_token_ids=[128009],
# )
conv_mistral_instruct = Conversation(
system="",
roles=("USER", "ASSISTANT"),
version="llama_v2",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="",
)
conv_llava_llama_2_simple = Conversation(
system="Answer the questions about the visual content that the user provides.",
roles=("USER", "ASSISTANT"),
version="llama_v2",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="",
)
conv_llava_llama_2_mmtag = Conversation(
system="Answer the questions about the visual content that the user provides." "The visual content will be provided with the following format: visual content.",
roles=("USER", "ASSISTANT"),
version="llama_v2_mmtag",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="",
)
conv_mpt = Conversation(
system="""<|im_start|>system
A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
conv_qwen = Conversation(
system="""<|im_start|>system
You are a helpful assistant.""",
roles=("<|im_start|>user", "<|im_start|>assistant"),
version="qwen",
messages=[],
offset=0,
sep_style=SeparatorStyle.CHATML,
sep="<|im_end|>",
)
conv_internlm_2 = Conversation(
system="""<|im_start|>system
You are a helpful assistant.""",
roles=("<|im_start|>user", "<|im_start|>assistant"),
version="internlm_2",
messages=[],
offset=0,
sep_style=SeparatorStyle.CHATML,
sep="<|im_end|>",
)
conv_gemma_instruct = Conversation(system="", roles=("user\n", "model\n"), version="gemma", messages=[], offset=0, sep_style=SeparatorStyle.GEMMA, sep="\n")
conv_llava_plain = Conversation(
system="",
roles=("", ""),
messages=[],
offset=0,
sep_style=SeparatorStyle.PLAIN,
sep="\n",
)
conv_llava_v0 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("Human", "Assistant"),
messages=[],
offset=0,
sep_style=SeparatorStyle.SINGLE,
sep="###",
)
conv_llava_v0_mmtag = Conversation(
system="A chat between a curious user and an artificial intelligence assistant. "
"The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
"The visual content will be provided with the following format: visual content.",
roles=("Human", "Assistant"),
messages=[],
offset=0,
sep_style=SeparatorStyle.SINGLE,
sep="###",
version="v0_mmtag",
)
conv_llava_v1 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("USER", "ASSISTANT"),
version="v1",
messages=[],
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="",
)
conv_llava_v1_mmtag = Conversation(
system="A chat between a curious user and an artificial intelligence assistant. "
"The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
"The visual content will be provided with the following format: visual content.",
roles=("USER", "ASSISTANT"),
messages=[],
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="",
version="v1_mmtag",
)
conv_mistral_orca = Conversation(
system="""<|im_start|>system
You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
conv_mistral_zephyr = Conversation(
system="""<|system|>
You are a helpful AI assistant.""",
roles=("<|user|>\n", "<|assistant|>\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="",
)
conv_mistral_direct = Conversation(
system="""<|im_start|>system
Answer the questions.""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
conv_chatml_direct = Conversation(
system="""<|im_start|>system
Answer the questions.""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
default_conversation = conv_vicuna_v0
conv_templates = {
"default": conv_vicuna_v0,
"v0": conv_vicuna_v0,
"v1": conv_vicuna_v1,
"vicuna_v1": conv_vicuna_v1,
"llama_2": conv_llama_2,
"mistral_instruct": conv_mistral_instruct,
"mistral_orca": conv_mistral_orca,
"mistral_zephyr": conv_mistral_zephyr,
"mistral_direct": conv_mistral_direct,
"plain": conv_llava_plain,
"v0_plain": conv_llava_plain,
"chatml_direct": conv_chatml_direct,
"llava_v0": conv_llava_v0,
"llava_v0_mmtag": conv_llava_v0_mmtag,
"llava_v1": conv_llava_v1,
"llava_v1_mmtag": conv_llava_v1_mmtag,
"llava_llama_2": conv_llava_llama_2,
# "llava_llama_3": conv_llava_llama_3,
"llava_llama_2_simple": conv_llava_llama_2_simple,
"llava_llama_2_mmtag": conv_llava_llama_2_mmtag,
"llava_mistral_instruct": conv_mistral_instruct,
"mpt": conv_mpt,
"qwen_1_5": conv_qwen,
"qwen_2": conv_qwen,
"internlm_2": conv_internlm_2,
"gemma_instruct": conv_gemma_instruct,
}
if __name__ == "__main__":
print(default_conversation.get_prompt())
print(default_conversation)
================================================
FILE: xtuner-eval_niah/llava/dist_utils.py
================================================
import json
import os
import builtins
import datetime
import time
import subprocess
import torch
import torch.distributed as dist
def get_rank() -> int:
if not dist.is_available():
return 0
if not dist.is_initialized():
return 0
return dist.get_rank()
def get_world_size() -> int:
if not dist.is_available():
return 1
if not dist.is_initialized():
return 1
return dist.get_world_size()
def setup_for_distributed(is_master):
builtin_print = builtins.print
def print(*args, **kwargs):
force = kwargs.pop("force", False)
# force = force or (get_world_size() > 8)
if is_master or force:
now = datetime.datetime.now().time()
builtin_print("[{}] ".format(now), end="") # print with time stamp
builtin_print(*args, **kwargs)
builtins.print = print
def init_distributed_mode(use_dynamic_port: bool = True):
if "SLURM_PROCID" in os.environ:
rank = int(os.environ["SLURM_PROCID"])
local_rank = rank % torch.cuda.device_count()
world_size = int(os.environ["SLURM_NTASKS"])
try:
local_size = int(os.environ["SLURM_NTASKS_PER_NODE"])
except:
local_size = int(os.environ.get("LOCAL_SIZE", 1))
if "MASTER_PORT" not in os.environ:
port = 10023 # + random.randint(0, 20)
# if use_dynamic_port:
# for i in range(10042, 65535):
# cmd = f"netstat -aon|grep {i}"
# with os.popen(cmd, "r") as file:
# if file.read() == "":
# port = i
# break
print(f"MASTER_PORT = {port}")
os.environ["MASTER_PORT"] = str(port)
time.sleep(3)
node_list = os.environ["SLURM_STEP_NODELIST"]
addr = subprocess.getoutput(f"scontrol show hostname {node_list} | head -n1")
if "MASTER_ADDR" not in os.environ:
os.environ["MASTER_ADDR"] = addr
os.environ["RANK"] = str(rank)
os.environ["LOCAL_RANK"] = str(local_rank)
os.environ["LOCAL_WORLD_SIZE"] = str(local_size)
os.environ["WORLD_SIZE"] = str(world_size)
else:
rank = int(os.environ["RANK"])
setup_for_distributed(rank == 0)
print(
f"Rank {os.environ['RANK']} | Local Rank {os.environ['LOCAL_RANK']} | "
f"World Size {os.environ['WORLD_SIZE']} | Local World Size {os.environ['LOCAL_WORLD_SIZE']} |",
force=True
)
================================================
FILE: xtuner-eval_niah/llava/mm_utils.py
================================================
from PIL import Image
from io import BytesIO
import base64
import math
import ast
import re
import torch
from transformers import StoppingCriteria
from llava.constants import IMAGE_TOKEN_INDEX
def resize_and_center_crop(image, shortest_edge_length):
# Calculate new dimensions and resize
aspect_ratio = float(image.width) / float(image.height)
if aspect_ratio > 1:
new_width = int(shortest_edge_length * aspect_ratio)
new_height = shortest_edge_length
else:
new_width = shortest_edge_length
new_height = int(shortest_edge_length / aspect_ratio)
resized_image = image.resize((new_width, new_height), Image.ANTIALIAS)
# Calculate the position and perform the center crop
left = (new_width - shortest_edge_length) / 2
top = (new_height - shortest_edge_length) / 2
right = (new_width + shortest_edge_length) / 2
bottom = (new_height + shortest_edge_length) / 2
cropped_image = resized_image.crop((left, top, right, bottom))
return cropped_image
def auto_pad_images(image, grid_params):
assert isinstance(image, Image.Image), "Input should be a Pillow Image"
assert len(grid_params) > 0, "Grid parameters should not be empty"
# Step 1: Calculate and find the closest aspect ratio
input_width, input_height = image.size
input_aspect_ratio = input_width / input_height
candidate_resolutions = [(w / h, w, h) for w in grid_params for h in grid_params]
closest_aspect_ratio = min(candidate_resolutions, key=lambda x: abs(input_aspect_ratio - x[0]))
candidate_resolutions = [(x[1], x[2]) for x in candidate_resolutions if abs(x[0] - closest_aspect_ratio[0]) < 1e-3]
target_resolution = min(candidate_resolutions, key=lambda res: abs(max(input_width, input_height) / max(res) - 1))
resize_width, resize_height = target_resolution
if input_width > input_height:
resize_height = int(resize_width / input_aspect_ratio)
else:
resize_width = int(resize_height * input_aspect_ratio)
resized_image = image.resize((resize_width, resize_height), Image.ANTIALIAS)
# Step 5: Pad the resized image if necessary to match the target resolution
pad_width = target_resolution[0] - resize_width
pad_height = target_resolution[1] - resize_height
padded_image = Image.new("RGB", target_resolution, color=(0, 0, 0))
padded_image.paste(resized_image, (pad_width // 2, pad_height // 2))
return padded_image
def extract_patches(image, patch_size, overlap_ratio):
assert isinstance(image, Image.Image), "Input should be a Pillow Image"
assert patch_size > 0, "Patch size should be greater than 0"
assert 0 <= overlap_ratio < 1, "Overlap ratio should be between 0 and 1"
W, H = image.size
patches = []
stride = int(patch_size * (1 - overlap_ratio))
num_patches_y = (H - patch_size) // stride + 1
num_patches_x = (W - patch_size) // stride + 1
y_start = (H - (num_patches_y - 1) * stride - patch_size) // 2
x_start = (W - (num_patches_x - 1) * stride - patch_size) // 2
for y in range(y_start, y_start + num_patches_y * stride, stride):
for x in range(x_start, x_start + num_patches_x * stride, stride):
patch = image.crop((x, y, x + patch_size, y + patch_size))
patches.append(patch)
return patches
def process_highres_image_crop_split(image, data_args, processor=None):
crop_resolution = data_args.image_crop_resolution
split_resolution = data_args.image_split_resolution
if processor is None:
processor = data_args.image_processor
image_crop = resize_and_center_crop(image, crop_resolution)
image_patches = extract_patches(image_crop, patch_size=split_resolution, overlap_ratio=0)
image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches]
return torch.stack(image_patches, dim=0)
def process_highres_image(image, processor, grid_pinpoints):
grid_params = [int(x) for x in grid_pinpoints.split(",")]
width_height = max(image.size)
fit_grid_params = [x for x in grid_params if x >= width_height]
if len(fit_grid_params) == 0:
select_size = max(grid_params)
else:
select_size = min(fit_grid_params)
# FIXME: always select the 448
select_size = max(grid_params)
image_padded = expand2square(image, tuple(int(x * 255) for x in processor.image_mean))
# FIXME: this seems to be a bug that it always resizes instead of padding
image_original_resize = image.resize((processor.size["shortest_edge"], processor.size["shortest_edge"]))
image_padded = image_padded.resize((select_size, select_size))
image_patches = extract_patches(image_padded, patch_size=processor.size["shortest_edge"], overlap_ratio=0)
image_patches = [image_original_resize] + image_patches
image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches]
return torch.stack(image_patches, dim=0)
def select_best_resolution(original_size, possible_resolutions, max_resolutions, patch_size):
"""
Selects the best resolution from a list of possible resolutions based on the original size.
Args:
original_size (tuple): The original size of the image in the format (width, height).
possible_resolutions (list): A list of possible resolutions in the format [(width1, height1), (width2, height2), ...].
Returns:
tuple: The best fit resolution in the format (width, height).
"""
original_width, original_height = original_size
best_fit = None
max_effective_resolution = 0
min_wasted_resolution = float("inf")
for width, height in possible_resolutions:
if max_resolutions != None and (width * height != patch_size * patch_size):
if (width * height+patch_size*patch_size) > max_resolutions: # NOTE 要算一个global
continue
# Calculate the downscaled size to keep the aspect ratio
scale = min(width / original_width, height / original_height)
downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
# Calculate effective and wasted resolutions
effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
wasted_resolution = (width * height) - effective_resolution
if effective_resolution > max_effective_resolution or (effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution):
max_effective_resolution = effective_resolution
min_wasted_resolution = wasted_resolution
best_fit = (width, height)
# print(f"original_size={original_size}, possible_resolutions={possible_resolutions}, max_resolutions={max_resolutions}, best_fit={best_fit}")
assert best_fit is not None, f"Can't find suitable fit in {possible_resolutions} at max:{max_resolutions}"
return best_fit
def resize_and_pad_image(image, target_resolution):
"""
Resize and pad an image to a target resolution while maintaining aspect ratio.
Args:
image (PIL.Image.Image): The input image.
target_resolution (tuple): The target resolution (width, height) of the image.
Returns:
PIL.Image.Image: The resized and padded image.
"""
original_width, original_height = image.size
target_width, target_height = target_resolution
# Determine which dimension (width or height) to fill
scale_w = target_width / original_width
scale_h = target_height / original_height
if scale_w < scale_h:
# Width will be filled completely
new_width = target_width
new_height = min(math.ceil(original_height * scale_w), target_height)
else:
# Height will be filled completely
new_height = target_height
new_width = min(math.ceil(original_width * scale_h), target_width)
# Resize the image
resized_image = image.resize((new_width, new_height))
# Create a new image with the target size and paste the resized image onto it
new_image = Image.new("RGB", (target_width, target_height), (0, 0, 0))
paste_x = (target_width - new_width) // 2
paste_y = (target_height - new_height) // 2
new_image.paste(resized_image, (paste_x, paste_y))
return new_image
def divide_to_patches(image, patch_size):
"""
Divides an image into patches of a specified size.
Args:
image (PIL.Image.Image): The input image.
patch_size (int): The size of each patch.
Returns:
list: A list of PIL.Image.Image objects representing the patches.
"""
patches = []
width, height = image.size
for i in range(0, height, patch_size):
for j in range(0, width, patch_size):
box = (j, i, j + patch_size, i + patch_size)
patch = image.crop(box)
patches.append(patch)
return patches
def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size, max_resolutions=None):
"""
Calculate the shape of the image patch grid after the preprocessing for images of any resolution.
Args:
image_size (tuple): The size of the input image in the format (width, height).
grid_pinpoints (str): A string representation of a list of possible resolutions.
patch_size (int): The size of each image patch.
Returns:
tuple: The shape of the image patch grid in the format (width, height).
"""
if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints:
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
if type(grid_pinpoints) is list:
possible_resolutions = grid_pinpoints
else:
possible_resolutions = ast.literal_eval(grid_pinpoints)
width, height = select_best_resolution(image_size, possible_resolutions, max_resolutions=max_resolutions, patch_size=patch_size)
# print("get width/patch size", width, patch_size, flush=True)
return width // patch_size, height // patch_size
def process_anyres_image(image, processor, grid_pinpoints):
"""
Process an image with variable resolutions.
Args:
image (PIL.Image.Image): The input image to be processed.
processor: The image processor object.
grid_pinpoints (str): A string representation of a list of possible resolutions.
Returns:
torch.Tensor: A tensor containing the processed image patches.
"""
raise NotImplementedError
# Convert grid_pinpoints from string to list
if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints:
try:
patch_size = processor.size[0]
except Exception as e:
patch_size = processor.size["shortest_edge"]
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
if type(grid_pinpoints) is list:
possible_resolutions = grid_pinpoints
else:
possible_resolutions = ast.literal_eval(grid_pinpoints)
best_resolution = select_best_resolution(image.size, possible_resolutions)
image_padded = resize_and_pad_image(image, best_resolution)
patches = divide_to_patches(image_padded, processor.crop_size["height"])
# FIXME: this seems to be a bug that it resizes instead of pad.
# but to keep it consistent with previous, i will keep it as it is
# TODO: uncomment below to ablate with the padding
if isinstance(processor.size, dict):
shortest_edge = processor.size["shortest_edge"]
else:
shortest_edge = min(processor.size)
image_original_resize = image.resize((shortest_edge, shortest_edge))
# image_padded_square = expand2square(image, tuple(int(x*255) for x in processor.image_mean))
# image_original_resize = image_padded_square.resize((processor.size['shortest_edge'], processor.size['shortest_edge']))
image_patches = [image_original_resize] + patches
image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches]
# print("image.size", image.size, "len(image_patches):", len(image_patches), "patch_size:", image_patches[0].shape)
return torch.stack(image_patches, dim=0)
def process_anyres_image_nopad(image, processor, grid_pinpoints):
"""
Process an image with variable resolutions.
Args:
image (PIL.Image.Image): The input image to be processed.
processor: The image processor object.
grid_pinpoints (str): A string representation of a list of possible resolutions.
Returns:
torch.Tensor: A tensor containing the processed image patches.
"""
# Convert grid_pinpoints from string to list
try:
patch_size = processor.size[0]
except Exception as e:
patch_size = processor.size["shortest_edge"]
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints:
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
if type(grid_pinpoints) is list:
possible_resolutions = grid_pinpoints
else:
possible_resolutions = ast.literal_eval(grid_pinpoints)
best_resolution = select_best_resolution(image.size, possible_resolutions, max_resolutions=None, patch_size=patch_size) # 目前图像无限制
# image_padded = resize_and_pad_image(image, best_resolution)
patches = divide_to_patches(image.resize(best_resolution), patch_size)
# FIXME: this seems to be a bug that it resizes instead of pad.
# but to keep it consistent with previous, i will keep it as it is
# TODO: uncomment below to ablate with the padding
if isinstance(processor.size, dict):
shortest_edge = processor.size["shortest_edge"]
else:
shortest_edge = min(processor.size)
image_original_resize = image.resize((shortest_edge, shortest_edge))
# image_padded_square = expand2square(image, tuple(int(x*255) for x in processor.image_mean))
# image_original_resize = image_padded_square.resize((processor.size['shortest_edge'], processor.size['shortest_edge']))
image_patches = [image_original_resize] + patches
image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches]
# raise ValueError(f"image.size: {image.size} len(image_patches): {len(image_patches)}, patch_size:, {image_patches[0].shape}, possible_resolutions:, {possible_resolutions}, best: {best_resolution}")
return torch.stack(image_patches, dim=0)
def process_anyres_video_nopad(video, processor, grid_pinpoints, max_resolutions):
"""
Process an image with variable resolutions.
Args:
video (numpy.ndarray): (T, H, W, C)
image (PIL.Image.Image): The input image to be processed.
processor: The image processor object.
grid_pinpoints (str): A string representation of a list of possible resolutions.
Returns:
torch.Tensor: A tensor containing the processed image patches.
"""
# Convert grid_pinpoints from string to list
try:
patch_size = processor.size[0]
except Exception as e:
patch_size = processor.size["shortest_edge"]
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints:
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
if type(grid_pinpoints) is list:
possible_resolutions = grid_pinpoints
else:
possible_resolutions = ast.literal_eval(grid_pinpoints)
best_resolution = select_best_resolution(video[0].shape[0:2], possible_resolutions, max_resolutions=max_resolutions, patch_size=patch_size)
video = processor.preprocess(video, return_tensors="pt", target_size=best_resolution)["pixel_values"]
print("data: new_video.shape:", video.shape, "best_resolution:", best_resolution)
return video
def load_image_from_base64(image):
return Image.open(BytesIO(base64.b64decode(image)))
def expand2square(pil_img, background_color):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
def process_images(images, image_processor, model_cfg):
image_aspect_ratio = getattr(model_cfg, "image_aspect_ratio", None)
new_images = []
if image_aspect_ratio == "highres":
raise NotImplementedError
for image in images:
image = process_highres_image(image, image_processor, model_cfg.image_grid_pinpoints)
new_images.append(image)
elif "anyres" in image_aspect_ratio:
for image in images:
if "nopad" in image_aspect_ratio:
image = process_anyres_image_nopad(image, image_processor, model_cfg.image_grid_pinpoints)
else:
image = process_anyres_image(image, image_processor, model_cfg.image_grid_pinpoints)
new_images.append(image)
elif image_aspect_ratio == "crop_split":
raise NotImplementedError
for image in images:
image = process_highres_image_crop_split(image, model_cfg, image_processor)
new_images.append(image)
elif image_aspect_ratio == "pad":
for image in images:
image = expand2square(image, tuple(int(x * 255) for x in image_processor.image_mean))
image = image_processor.preprocess(image, return_tensors="pt")["pixel_values"][0]
new_images.append(image)
else:
return image_processor.preprocess(images, return_tensors="pt")["pixel_values"]
if all(x.shape == new_images[0].shape for x in new_images):
new_images = torch.stack(new_images, dim=0)
return new_images
def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None):
prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split("")]
def insert_separator(X, sep):
return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1]
input_ids = []
offset = 0
if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
offset = 1
input_ids.append(prompt_chunks[0][0])
for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
input_ids.extend(x[offset:])
if return_tensors is not None:
if return_tensors == "pt":
return torch.tensor(input_ids, dtype=torch.long)
raise ValueError(f"Unsupported tensor type: {return_tensors}")
return input_ids
def get_model_name_from_path(model_path):
model_path = model_path.strip("/")
model_paths = model_path.split("/")
if model_paths[-1].startswith("checkpoint-"):
return model_paths[-2] + "_" + model_paths[-1]
else:
return model_paths[-1]
class KeywordsStoppingCriteria(StoppingCriteria):
def __init__(self, keywords, tokenizer, input_ids):
self.keywords = keywords
self.keyword_ids = []
for keyword in keywords:
cur_keyword_ids = tokenizer(keyword).input_ids
if len(cur_keyword_ids) > 1 and cur_keyword_ids[0] == tokenizer.bos_token_id:
cur_keyword_ids = cur_keyword_ids[1:]
self.keyword_ids.append(torch.tensor(cur_keyword_ids))
self.tokenizer = tokenizer
self.start_len = input_ids.shape[1]
def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
assert output_ids.shape[0] == 1, "Only support batch size 1 (yet)" # TODO
offset = min(output_ids.shape[1] - self.start_len, 3)
self.keyword_ids = [keyword_id.to(output_ids.device) for keyword_id in self.keyword_ids]
for keyword_id in self.keyword_ids:
if output_ids[0, -keyword_id.shape[0] :] == keyword_id:
return True
outputs = self.tokenizer.batch_decode(output_ids[:, -offset:], skip_special_tokens=True)[0]
for keyword in self.keywords:
if keyword in outputs:
return True
return False
================================================
FILE: xtuner-eval_niah/llava/model/__init__.py
================================================
import os
AVAILABLE_MODELS = {
"llava_qwen": "LlavaQwenForCausalLM, LlavaQwenConfig",
"llava_qwen_flash": "LlavaQwenForCausalLM_Flash, LlavaQwenConfig_Flash"
}
for model_name, model_classes in AVAILABLE_MODELS.items():
try:
exec(f"from .language_model.{model_name} import {model_classes}")
except Exception as e:
print(f"Failed to import {model_name} from llava.language_model.{model_name}. Error: {e}")
================================================
FILE: xtuner-eval_niah/llava/model/apply_delta.py
================================================
"""
Usage:
python3 -m fastchat.model.apply_delta --base ~/model_weights/llama-7b --target ~/model_weights/vicuna-7b --delta lmsys/vicuna-7b-delta
"""
import argparse
import torch
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
from llava import LlavaLlamaForCausalLM
def apply_delta(base_model_path, target_model_path, delta_path):
print("Loading base model")
base = AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
print("Loading delta")
delta = LlavaLlamaForCausalLM.from_pretrained(delta_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
delta_tokenizer = AutoTokenizer.from_pretrained(delta_path)
print("Applying delta")
for name, param in tqdm(delta.state_dict().items(), desc="Applying delta"):
if name not in base.state_dict():
assert name in ["model.mm_projector.weight", "model.mm_projector.bias"], f"{name} not in base model"
continue
if param.data.shape == base.state_dict()[name].shape:
param.data += base.state_dict()[name]
else:
assert name in ["model.embed_tokens.weight", "lm_head.weight"], f"{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}"
bparam = base.state_dict()[name]
param.data[: bparam.shape[0], : bparam.shape[1]] += bparam
print("Saving target model")
delta.save_pretrained(target_model_path)
delta_tokenizer.save_pretrained(target_model_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--base-model-path", type=str, required=True)
parser.add_argument("--target-model-path", type=str, required=True)
parser.add_argument("--delta-path", type=str, required=True)
args = parser.parse_args()
apply_delta(args.base_model_path, args.target_model_path, args.delta_path)
================================================
FILE: xtuner-eval_niah/llava/model/builder.py
================================================
# Copyright 2023 Haotian Liu
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import warnings
import shutil
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
import torch
from llava.model import *
from llava.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.utils import rank0_print
def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", attn_implementation="flash_attention_2", customized_config=None, overwrite_config=None, **kwargs):
kwargs["device_map"] = device_map
if load_8bit:
kwargs["load_in_8bit"] = True
elif load_4bit:
kwargs["load_in_4bit"] = True
kwargs["quantization_config"] = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4")
else:
kwargs["torch_dtype"] = torch.float16
if customized_config is not None:
kwargs["config"] = customized_config
if "multimodal" in kwargs:
if kwargs["multimodal"] is True:
is_multimodal = True
kwargs.pop("multimodal")
else:
is_multimodal = False
else:
is_multimodal = False
is_multimodal=True
assert is_multimodal, "I need it!!!"
if "llava" in model_name.lower() or is_multimodal:
# Load LLaVA model
if "lora" in model_name.lower() and model_base is None:
raise NotImplementedError("I don't like lora.")
warnings.warn(
"There is `lora` in model name but no `model_base` is provided. If you are loading a LoRA model, please provide the `model_base` argument. Detailed instruction: https://github.com/haotian-liu/LLaVA#launch-a-model-worker-lora-weights-unmerged."
)
if "lora" in model_name.lower() and model_base is not None:
raise NotImplementedError("I don't like lora.")
lora_cfg_pretrained = AutoConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
rank0_print("Loading LLaVA from base model...")
if "mixtral" in model_name.lower():
from llava.model.language_model.llava_mixtral import LlavaMixtralConfig
lora_cfg_pretrained = LlavaMixtralConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = LlavaMixtralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
elif "mistral" in model_name.lower():
from llava.model.language_model.llava_mistral import LlavaMistralConfig
lora_cfg_pretrained = LlavaMistralConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = LlavaMistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
elif "gemma" in model_name.lower():
from llava.model.language_model.llava_gemma import LlavaGemmaConfig
lora_cfg_pretrained = LlavaGemmaConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = LlavaGemmaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
else:
from llava.model.language_model.llava_llama import LlavaConfig
lora_cfg_pretrained = LlavaConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
if model.lm_head.weight.shape[0] != token_num:
model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
rank0_print("Loading additional LLaVA weights...")
if os.path.exists(os.path.join(model_path, "non_lora_trainables.bin")):
non_lora_trainables = torch.load(os.path.join(model_path, "non_lora_trainables.bin"), map_location="cpu")
else:
# this is probably from HF Hub
from huggingface_hub import hf_hub_download
def load_from_hf(repo_id, filename, subfolder=None):
cache_file = hf_hub_download(repo_id=repo_id, filename=filename, subfolder=subfolder)
return torch.load(cache_file, map_location="cpu")
non_lora_trainables = load_from_hf(model_path, "non_lora_trainables.bin")
non_lora_trainables = {(k[11:] if k.startswith("base_model.") else k): v for k, v in non_lora_trainables.items()}
if any(k.startswith("model.model.") for k in non_lora_trainables):
non_lora_trainables = {(k[6:] if k.startswith("model.") else k): v for k, v in non_lora_trainables.items()}
model.load_state_dict(non_lora_trainables, strict=False)
from peft import PeftModel
rank0_print("Loading LoRA weights...")
model = PeftModel.from_pretrained(model, model_path)
rank0_print("Merging LoRA weights...")
model = model.merge_and_unload()
rank0_print("Model is loaded...")
elif model_base is not None: # this may be mm projector only, loading projector with preset language mdoel
rank0_print(f"Loading LLaVA from base model {model_base}...")
if "mixtral" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
cfg_pretrained = AutoConfig.from_pretrained(model_path)
model = LlavaMixtralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
elif "mistral" in model_name.lower() or "zephyr" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
cfg_pretrained = AutoConfig.from_pretrained(model_path)
model = LlavaMistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
elif "gemma" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
cfg_pretrained = AutoConfig.from_pretrained(model_path)
model = LlavaGemmaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
elif (
"wizardlm-2" in model_name.lower()
and "vicuna" in model_name.lower()
or "llama" in model_name.lower()
or "yi" in model_name.lower()
or "nous-hermes" in model_name.lower()
or "llava-v1.6-34b" in model_name.lower()
or "llava-v1.5" in model_name.lower()
):
from llava.model.language_model.llava_llama import LlavaConfig
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
if customized_config is None:
llava_cfg = LlavaConfig.from_pretrained(model_path)
if "v1.5" in model_name.lower():
llava_cfg.delay_load = True # a workaround for correctly loading v1.5 models
else:
llava_cfg = customized_config
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
llava_cfg = LlavaConfig.from_pretrained(model_path)
model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=llava_cfg, **kwargs)
else:
raise ValueError(f"Model {model_name} not supported")
mm_projector_weights = torch.load(os.path.join(model_path, "mm_projector.bin"), map_location="cpu")
mm_projector_weights = {k: v.to(torch.float16) for k, v in mm_projector_weights.items()}
model.load_state_dict(mm_projector_weights, strict=False)
else:
rank0_print(f"Loaded LLaVA model: {model_path}")
if "mixtral" in model_name.lower():
raise NotImplementedError("I don't like it.")
from llava.model.language_model.llava_mixtral import LlavaMixtralConfig
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
if customized_config is None:
llava_cfg = LlavaMixtralConfig.from_pretrained(model_path)
else:
llava_cfg = customized_config
if overwrite_config is not None:
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = LlavaMixtralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
elif "mistral" in model_name.lower() or "zephyr" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = LlavaMistralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
elif (
"wizardlm-2" in model_name.lower()
and "vicuna" in model_name.lower()
or "llama" in model_name.lower()
# or "yi" in model_name.lower() # 太容易撞车了
or "nous-hermes" in model_name.lower()
or "llava-v1.6-34b" in model_name.lower()
or "llava-v1.5" in model_name.lower()
):
raise NotImplementedError("I don't like it")
from llava.model.language_model.llava_llama import LlavaConfig
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
if customized_config is None:
llava_cfg = LlavaConfig.from_pretrained(model_path)
if "v1.5" in model_name.lower():
llava_cfg.delay_load = True # a workaround for correctly loading v1.5 models
else:
llava_cfg = customized_config
if overwrite_config is not None:
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
elif "qwen" in model_name.lower() or "quyen" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_path)
if "moe" in model_name.lower() or "A14B" in model_name.lower():
from llava.model.language_model.llava_qwen_moe import LlavaQwenMoeConfig
if overwrite_config is not None:
llava_cfg = LlavaQwenMoeConfig.from_pretrained(model_path)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaQwenMoeForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
else:
model = LlavaQwenMoeForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
elif "flash" in model_name.lower():
from llava.model.language_model.llava_qwen_flash import LlavaQwenConfig_Flash
if overwrite_config is not None:
llava_cfg = LlavaQwenConfig_Flash.from_pretrained(model_path)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaQwenForCausalLM_Flash.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
else:
model = LlavaQwenForCausalLM_Flash.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
else:
from llava.model.language_model.llava_qwen import LlavaQwenConfig
if overwrite_config is not None:
llava_cfg = LlavaQwenConfig.from_pretrained(model_path)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
else:
model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
elif "internlm2" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
from llava.model.language_model.llava_internlm2 import LlavaInternLM2Config
if overwrite_config is not None:
llava_cfg = LlavaInternLM2Config.from_pretrained(model_path, trust_remote_code=True)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaInternLM2ForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, trust_remote_code=True, **kwargs)
else:
model = LlavaInternLM2ForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, trust_remote_code=True, attn_implementation=attn_implementation, **kwargs)
elif "gemma" in model_name.lower():
raise NotImplementedError("I don't like it")
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
cfg_pretrained = AutoConfig.from_pretrained(model_path)
model = LlavaGemmaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
else:
# 默认用qwen
try:
tokenizer = AutoTokenizer.from_pretrained(model_path)
if "moe" in model_name.lower() or "A14B" in model_name.lower():
from llava.model.language_model.llava_qwen_moe import LlavaQwenMoeConfig
if overwrite_config is not None:
llava_cfg = LlavaQwenMoeConfig.from_pretrained(model_path)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaQwenMoeForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
else:
model = LlavaQwenMoeForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
elif "flash" in model_name.lower():
from llava.model.language_model.llava_qwen_flash import LlavaQwenConfig_Flash
if overwrite_config is not None:
llava_cfg = LlavaQwenConfig_Flash.from_pretrained(model_path)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaQwenForCausalLM_Flash.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
else:
model = LlavaQwenForCausalLM_Flash.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
elif "fastv" in model_name.lower():
from llava.model.language_model.llava_qwen_fastv import LlavaQwenConfig_FastV
if overwrite_config is not None:
llava_cfg = LlavaQwenConfig_FastV.from_pretrained(model_path)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaQwenForCausalLM_FastV.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
else:
model = LlavaQwenForCausalLM_FastV.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
else:
from llava.model.language_model.llava_qwen import LlavaQwenConfig
if overwrite_config is not None:
llava_cfg = LlavaQwenConfig.from_pretrained(model_path)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
else:
model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
except:
raise ValueError(f"Model {model_name} not supported")
# try:
# from llava.model.language_model.llava_llama import LlavaConfig
# tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
# if customized_config is None:
# llava_cfg = LlavaConfig.from_pretrained(model_path)
# if "v1.5" in model_path.lower():
# llava_cfg.delay_load = True # a workaround for correctly loading v1.5 models
# else:
# llava_cfg = customized_config
# if overwrite_config is not None:
# rank0_print(f"Overwriting config with {overwrite_config}")
# for k, v in overwrite_config.items():
# setattr(llava_cfg, k, v)
# model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
# except:
# raise ValueError(f"Model {model_name} not supported")
else:
NotImplementedError("I don't want language model only.")
# Load language model
if model_base is not None:
# PEFT model
from peft import PeftModel
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_base, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto")
print(f"Loading LoRA weights from {model_path}")
model = PeftModel.from_pretrained(model, model_path)
print(f"Merging weights")
model = model.merge_and_unload()
print("Convert to FP16...")
model.to(torch.float16)
else:
use_fast = False
if "mpt" in model_name.lower().replace("prompt", ""):
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, trust_remote_code=True, **kwargs)
else:
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
rank0_print(f"Model Class: {model.__class__.__name__}")
image_processor = None
if "llava" in model_name.lower() or is_multimodal:
mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)
if mm_use_im_patch_token:
tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
if mm_use_im_start_end:
tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
model.resize_token_embeddings(len(tokenizer))
vision_tower = model.get_vision_tower()
if not vision_tower.is_loaded:
vision_tower.load_model(device_map=device_map)
if device_map != "auto":
vision_tower.to(device="cuda", dtype=torch.float16)
image_processor = vision_tower.image_processor
if hasattr(model.config, "max_sequence_length"):
context_len = model.config.max_sequence_length
elif hasattr(model.config, "max_position_embeddings"):
context_len = model.config.max_position_embeddings
elif hasattr(model.config, "tokenizer_model_max_length"):
context_len = model.config.tokenizer_model_max_length
else:
context_len = 2048
return tokenizer, model, image_processor, context_len
================================================
FILE: xtuner-eval_niah/llava/model/consolidate.py
================================================
"""
Usage:
python3 -m llava.model.consolidate --src ~/model_weights/llava-7b --dst ~/model_weights/llava-7b_consolidate
"""
import argparse
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from llava.model import *
from llava.model.utils import auto_upgrade
def consolidate_ckpt(src_path, dst_path):
print("Loading model")
auto_upgrade(src_path)
src_model = AutoModelForCausalLM.from_pretrained(src_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
src_tokenizer = AutoTokenizer.from_pretrained(src_path, use_fast=False)
src_model.save_pretrained(dst_path)
src_tokenizer.save_pretrained(dst_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--src", type=str, required=True)
parser.add_argument("--dst", type=str, required=True)
args = parser.parse_args()
consolidate_ckpt(args.src, args.dst)
================================================
FILE: xtuner-eval_niah/llava/model/language_model/llava_qwen.py
================================================
# Copyright 2024 Hao Zhang
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List, Optional, Tuple, Union, Dict
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss
import transformers
from transformers import AutoConfig, AutoModelForCausalLM, LlamaConfig, LlamaModel, LlamaForCausalLM
from transformers.modeling_outputs import CausalLMOutputWithPast
from transformers.generation.utils import GenerateOutput
# from ...constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.model.llava_arch import LlavaMetaModel, LlavaMetaForCausalLM
from transformers import Qwen2Config, Qwen2Model, Qwen2ForCausalLM
# from .qwen.modeling_qwen import QWenLMHeadModel, QWenModel
# from .qwen.configuration_qwen import QWenConfig
class LlavaQwenConfig(Qwen2Config):
model_type = "llava_qwen"
class LlavaQwenModel(LlavaMetaModel, Qwen2Model):
config_class = LlavaQwenConfig
def __init__(self, config: Qwen2Config):
super(LlavaQwenModel, self).__init__(config)
class LlavaQwenForCausalLM(Qwen2ForCausalLM, LlavaMetaForCausalLM):
config_class = LlavaQwenConfig
def __init__(self, config):
# super(Qwen2ForCausalLM, self).__init__(config)
Qwen2ForCausalLM.__init__(self, config)
config.model_type = "llava_qwen"
# config.rope_scaling = None
self.model = LlavaQwenModel(config)
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_model(self):
return self.model
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
images: Optional[torch.FloatTensor] = None,
image_sizes: Optional[List[List[int]]] = None,
return_dict: Optional[bool] = None,
modalities: Optional[List[str]] = ["image"],
dpo_forward: Optional[bool] = False,
cache_position=None,
) -> Union[Tuple, CausalLMOutputWithPast]:
# print("images[0].shape:", images[0].shape)
if inputs_embeds is None:
(input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities, image_sizes)
# print("inputs_embeds.shape:", inputs_embeds.shape)
if dpo_forward:
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
return logits, labels
else:
return super().forward(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
labels=labels,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
@torch.no_grad()
def generate(
self,
inputs: Optional[torch.Tensor] = None,
images: Optional[torch.Tensor] = None,
image_sizes: Optional[torch.Tensor] = None,
modalities: Optional[List[str]] = ["image"],
**kwargs,
) -> Union[GenerateOutput, torch.LongTensor]:
position_ids = kwargs.pop("position_ids", None)
attention_mask = kwargs.pop("attention_mask", None)
if "inputs_embeds" in kwargs:
raise NotImplementedError("`inputs_embeds` is not supported")
if images is not None:
(inputs, position_ids, attention_mask, _, inputs_embeds, _) = self.prepare_inputs_labels_for_multimodal(inputs, position_ids, attention_mask, None, None, images, modalities, image_sizes=image_sizes)
else:
inputs_embeds = self.get_model().embed_tokens(inputs)
return super().generate(position_ids=position_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, **kwargs)
def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
images = kwargs.pop("images", None)
image_sizes = kwargs.pop("image_sizes", None)
inputs = super().prepare_inputs_for_generation(input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs)
if images is not None:
inputs["images"] = images
if image_sizes is not None:
inputs["image_sizes"] = image_sizes
return inputs
AutoConfig.register("llava_qwen", LlavaQwenConfig)
AutoModelForCausalLM.register(LlavaQwenConfig, LlavaQwenForCausalLM)
================================================
FILE: xtuner-eval_niah/llava/model/language_model/llava_qwen_flash.py
================================================
# Copyright 2024 Hao Zhang
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List, Optional, Tuple, Union, Dict
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss
import transformers
from transformers import AutoConfig, AutoModelForCausalLM, LlamaConfig, LlamaModel, LlamaForCausalLM
from transformers.modeling_outputs import CausalLMOutputWithPast
from transformers.generation.utils import GenerateOutput
# from ...constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.model.llava_arch import LlavaMetaModel, LlavaMetaForCausalLM
from transformers import Qwen2Config
# from .qwen.modeling_qwen import QWenLMHeadModel, QWenModel
# from .qwen.configuration_qwen import QWenConfig
from .modeling_qwen2_flash import Qwen2Model_Flash, Qwen2ForCausalLM_Flash
class LlavaQwenConfig_Flash(Qwen2Config):
model_type = "llava_qwen_flash"
class LlavaQwenModel_Flash(LlavaMetaModel, Qwen2Model_Flash):
config_class = LlavaQwenConfig_Flash
def __init__(self, config: Qwen2Config):
super(LlavaQwenModel_Flash, self).__init__(config)
class LlavaQwenForCausalLM_Flash(Qwen2ForCausalLM_Flash, LlavaMetaForCausalLM):
config_class = LlavaQwenConfig_Flash
def __init__(self, config):
# super(Qwen2ForCausalLM, self).__init__(config)
Qwen2ForCausalLM_Flash.__init__(self, config)
config.model_type = "llava_qwen_flash"
# config.rope_scaling = None
self.model = LlavaQwenModel_Flash(config)
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_model(self):
return self.model
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
images: Optional[torch.FloatTensor] = None,
image_sizes: Optional[List[List[int]]] = None,
return_dict: Optional[bool] = None,
modalities: Optional[List[str]] = ["image"],
dpo_forward: Optional[bool] = False,
cache_position=None,
) -> Union[Tuple, CausalLMOutputWithPast]:
if inputs_embeds is None:
(input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities, image_sizes)
# print("inputs_embeds.shape:", inputs_embeds.shape)
if dpo_forward:
outputs, labels = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
labels=labels
)
hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
return logits, labels
else:
return super().forward(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
labels=labels,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
@torch.no_grad()
def generate(
self,
inputs: Optional[torch.Tensor] = None,
images: Optional[torch.Tensor] = None,
image_sizes: Optional[torch.Tensor] = None,
modalities: Optional[List[str]] = ["image"],
**kwargs,
) -> Union[GenerateOutput, torch.LongTensor]:
position_ids = kwargs.pop("position_ids", None)
attention_mask = kwargs.pop("attention_mask", None)
if "inputs_embeds" in kwargs:
raise NotImplementedError("`inputs_embeds` is not supported")
if images is not None:
(inputs, position_ids, attention_mask, _, inputs_embeds, _) = self.prepare_inputs_labels_for_multimodal(inputs, position_ids, attention_mask, None, None, images, modalities, image_sizes=image_sizes)
else:
self.model.image_token_posi = [-1]
self.model.prompt_len = None
self.model.image_tokens = [0]
inputs_embeds = self.get_model().embed_tokens(inputs)
return super().generate(position_ids=position_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, **kwargs)
def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
images = kwargs.pop("images", None)
image_sizes = kwargs.pop("image_sizes", None)
inputs = super().prepare_inputs_for_generation(input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs)
if images is not None:
inputs["images"] = images
if image_sizes is not None:
inputs["image_sizes"] = image_sizes
return inputs
AutoConfig.register("llava_qwen_flash", LlavaQwenConfig_Flash)
AutoModelForCausalLM.register(LlavaQwenConfig_Flash, LlavaQwenForCausalLM_Flash)
================================================
FILE: xtuner-eval_niah/llava/model/language_model/modeling_qwen2_flash.py
================================================
# coding=utf-8
# transformers==4.39.2 or 4.40.1 NOTE
# Copyright 2024 The Qwen team, Alibaba Group and the HuggingFace Inc. team. All rights reserved.
#
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
# and OPT implementations in this library. It has been modified from its
# original forms to accommodate minor architectural differences compared
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch Qwen2 model."""
import inspect
import math
import warnings
from typing import List, Optional, Tuple, Union
import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from torch import nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from transformers.activations import ACT2FN
from transformers.cache_utils import Cache, DynamicCache
from transformers.modeling_attn_mask_utils import _prepare_4d_causal_attention_mask, _prepare_4d_causal_attention_mask_for_sdpa
from transformers.modeling_outputs import BaseModelOutputWithPast, CausalLMOutputWithPast, SequenceClassifierOutputWithPast
from transformers.modeling_utils import PreTrainedModel
from transformers.utils import (
add_start_docstrings,
add_start_docstrings_to_model_forward,
is_flash_attn_2_available,
is_flash_attn_greater_or_equal_2_10,
logging,
replace_return_docstrings,
)
from transformers.models.qwen2.configuration_qwen2 import Qwen2Config
from llava.constants import IGNORE_INDEX
if is_flash_attn_2_available():
from flash_attn import flash_attn_func, flash_attn_varlen_func
from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
_flash_supports_window_size = "window_size" in list(inspect.signature(flash_attn_func).parameters)
logger = logging.get_logger(__name__)
_CHECKPOINT_FOR_DOC = "Qwen/Qwen2-7B-beta"
_CONFIG_FOR_DOC = "Qwen2Config"
QWEN2_PRETRAINED_MODEL_ARCHIVE_LIST = [
"Qwen/Qwen2-7B-beta",
# See all Qwen2 models at https://huggingface.co/models?filter=qwen2
]
# Copied from transformers.models.llama.modeling_llama._get_unpad_data
def _get_unpad_data(attention_mask):
seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
max_seqlen_in_batch = seqlens_in_batch.max().item()
cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
return (
indices,
cu_seqlens,
max_seqlen_in_batch,
)
# Copied from transformers.models.llama.modeling_llama.LlamaRMSNorm with Llama->Qwen2
class Qwen2RMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
"""
Qwen2RMSNorm is equivalent to T5LayerNorm
"""
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
# Copied from transformers.models.mistral.modeling_mistral.MistralRotaryEmbedding with Mistral->Qwen2
class Qwen2RotaryEmbedding(nn.Module):
def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
super().__init__()
self.dim = dim
self.max_position_embeddings = max_position_embeddings
self.base = base
inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
self.register_buffer("inv_freq", inv_freq, persistent=False)
# Build here to make `torch.jit.trace` work.
self._set_cos_sin_cache(
seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
)
def _set_cos_sin_cache(self, seq_len, device, dtype):
self.max_seq_len_cached = seq_len
t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
freqs = torch.outer(t, self.inv_freq)
# Different from paper, but it uses a different permutation in order to obtain the same calculation
emb = torch.cat((freqs, freqs), dim=-1)
self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
def forward(self, x, seq_len=None):
# x: [bs, num_attention_heads, seq_len, head_size]
if seq_len > self.max_seq_len_cached:
self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=x.dtype)
return (
self.cos_cached[:seq_len].to(dtype=x.dtype),
self.sin_cached[:seq_len].to(dtype=x.dtype),
)
# Copied from transformers.models.llama.modeling_llama.rotate_half
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]
return torch.cat((-x2, x1), dim=-1)
# Copied from transformers.models.mistral.modeling_mistral.apply_rotary_pos_emb
def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
"""Applies Rotary Position Embedding to the query and key tensors.
Args:
q (`torch.Tensor`): The query tensor.
k (`torch.Tensor`): The key tensor.
cos (`torch.Tensor`): The cosine part of the rotary embedding.
sin (`torch.Tensor`): The sine part of the rotary embedding.
position_ids (`torch.Tensor`):
The position indices of the tokens corresponding to the query and key tensors. For example, this can be
used to pass offsetted position ids when working with a KV-cache.
unsqueeze_dim (`int`, *optional*, defaults to 1):
The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
Returns:
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
"""
cos = cos[position_ids].unsqueeze(unsqueeze_dim)
sin = sin[position_ids].unsqueeze(unsqueeze_dim)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
# Copied from transformers.models.mistral.modeling_mistral.MistralMLP with Mistral->Qwen2
class Qwen2MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, x):
return self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
# Copied from transformers.models.llama.modeling_llama.repeat_kv
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""
This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
"""
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
class Qwen2Attention(nn.Module):
"""
Multi-headed attention from 'Attention Is All You Need' paper. Modified to use sliding window attention: Longformer
and "Generating Long Sequences with Sparse Transformers".
"""
def __init__(self, config: Qwen2Config, layer_idx: Optional[int] = None):
super().__init__()
self.config = config
self.layer_idx = layer_idx
if layer_idx is None:
logger.warning_once(
f"Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will "
"to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` "
"when creating this class."
)
self.hidden_size = config.hidden_size
self.num_heads = config.num_attention_heads
self.head_dim = self.hidden_size // self.num_heads
self.num_key_value_heads = config.num_key_value_heads
self.num_key_value_groups = self.num_heads // self.num_key_value_heads
self.max_position_embeddings = config.max_position_embeddings
self.rope_theta = config.rope_theta
self.is_causal = True
self.attention_dropout = config.attention_dropout
if (self.head_dim * self.num_heads) != self.hidden_size:
raise ValueError(
f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
f" and `num_heads`: {self.num_heads})."
)
self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=True)
self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=True)
self.o_proj = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=False)
self.rotary_emb = Qwen2RotaryEmbedding(
self.head_dim,
max_position_embeddings=self.max_position_embeddings,
base=self.rope_theta,
)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
if "padding_mask" in kwargs:
warnings.warn(
"Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
)
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
"for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
"with a layer index."
)
kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
if past_key_value is not None:
cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
# repeat k/v heads if n_kv_heads < n_heads
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
raise ValueError(
f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
f" {attn_weights.size()}"
)
if attention_mask is not None:
if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
raise ValueError(
f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
)
attn_weights = attn_weights + attention_mask
# upcast attention to fp32
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
attn_output = torch.matmul(attn_weights, value_states)
if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
raise ValueError(
f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
f" {attn_output.size()}"
)
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
class Qwen2FlashAttention2(Qwen2Attention):
"""
Qwen2 flash attention module, following Qwen2 attention module. This module inherits from `Qwen2Attention`
as the weights of the module stays untouched. The only required change would be on the forward pass
where it needs to correctly call the public API of flash attention and deal with padding tokens
in case the input contains any of them. Additionally, for sliding window attention, we apply SWA only to the bottom
config.max_window_layers layers.
"""
# Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2.__init__
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
# flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
# Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
):
if "padding_mask" in kwargs:
warnings.warn(
"Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`"
)
# overwrite attention_mask with padding_mask
attention_mask = kwargs.pop("padding_mask")
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
f"The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} "
"for auto-regressive decoding with k/v caching, please make sure to initialize the attention class "
"with a layer index."
)
kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
# Because the input can be padded, the absolute sequence length depends on the max position id.
rotary_seq_len = max(kv_seq_len, position_ids[:, -1].max().item()) + 1
cos, sin = self.rotary_emb(value_states, seq_len=rotary_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
use_sliding_windows = (
_flash_supports_window_size
and getattr(self.config, "sliding_window", None) is not None
and kv_seq_len > self.config.sliding_window
and self.config.use_sliding_window
)
if not _flash_supports_window_size:
logger.warning_once(
"The current flash attention version does not support sliding window attention, for a more memory efficient implementation"
" make sure to upgrade flash-attn library."
)
if past_key_value is not None:
# Activate slicing cache only if the config has a value `sliding_windows` attribute
cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
if (
getattr(self.config, "sliding_window", None) is not None
and kv_seq_len > self.config.sliding_window
and cache_has_contents
):
slicing_tokens = 1 - self.config.sliding_window
past_key = past_key_value[self.layer_idx][0]
past_value = past_key_value[self.layer_idx][1]
past_key = past_key[:, :, slicing_tokens:, :].contiguous()
past_value = past_value[:, :, slicing_tokens:, :].contiguous()
if past_key.shape[-2] != self.config.sliding_window - 1:
raise ValueError(
f"past key must have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`), got"
f" {past_key.shape}"
)
if attention_mask is not None:
attention_mask = attention_mask[:, slicing_tokens:]
attention_mask = torch.cat([attention_mask, torch.ones_like(attention_mask[:, -1:])], dim=-1)
cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
# repeat k/v heads if n_kv_heads < n_heads
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
dropout_rate = 0.0 if not self.training else self.attention_dropout
# In PEFT, usually we cast the layer norms in float32 for training stability reasons
# therefore the input hidden states gets silently casted in float32. Hence, we need
# cast them back in float16 just to be sure everything works as expected.
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, "_pre_quantization_dtype"):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
logger.warning_once(
f"The input hidden states seems to be silently casted in float32, this might be related to"
f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
f" {target_dtype}."
)
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
# Reashape to the expected shape for Flash Attention
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
attn_output = self._flash_attention_forward(
query_states,
key_states,
value_states,
attention_mask,
q_len,
dropout=dropout_rate,
use_sliding_windows=use_sliding_windows,
)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
def _flash_attention_forward(
self,
query_states,
key_states,
value_states,
attention_mask,
query_length,
dropout=0.0,
softmax_scale=None,
use_sliding_windows=False,
):
"""
Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
first unpad the input, then computes the attention scores and pad the final attention scores.
Args:
query_states (`torch.Tensor`):
Input query states to be passed to Flash Attention API
key_states (`torch.Tensor`):
Input key states to be passed to Flash Attention API
value_states (`torch.Tensor`):
Input value states to be passed to Flash Attention API
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
use_sliding_windows (`bool`, *optional*):
Whether to activate sliding window attention.
"""
if not self._flash_attn_uses_top_left_mask:
causal = self.is_causal
else:
# TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
causal = self.is_causal and query_length != 1
# Decide whether to use SWA or not by layer index.
if use_sliding_windows and self.layer_idx >= self.config.max_window_layers:
use_sliding_windows = False
# Contains at least one padding token in the sequence
if attention_mask is not None:
batch_size = query_states.shape[0]
query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
query_states, key_states, value_states, attention_mask, query_length
)
cu_seqlens_q, cu_seqlens_k = cu_seq_lens
max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
if not use_sliding_windows:
attn_output_unpad = flash_attn_varlen_func(
query_states,
key_states,
value_states,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_in_batch_q,
max_seqlen_k=max_seqlen_in_batch_k,
dropout_p=dropout,
softmax_scale=softmax_scale,
causal=causal,
)
else:
attn_output_unpad = flash_attn_varlen_func(
query_states,
key_states,
value_states,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_in_batch_q,
max_seqlen_k=max_seqlen_in_batch_k,
dropout_p=dropout,
softmax_scale=softmax_scale,
causal=causal,
window_size=(self.config.sliding_window, self.config.sliding_window),
)
attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
else:
if not use_sliding_windows:
attn_output = flash_attn_func(
query_states,
key_states,
value_states,
dropout,
softmax_scale=softmax_scale,
causal=causal,
)
else:
attn_output = flash_attn_func(
query_states,
key_states,
value_states,
dropout,
softmax_scale=softmax_scale,
causal=causal,
window_size=(self.config.sliding_window, self.config.sliding_window),
)
return attn_output
# Copied from transformers.models.mistral.modeling_mistral.MistralFlashAttention2._upad_input
def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
batch_size, kv_seq_len, num_heads, head_dim = key_layer.shape
# On the first iteration we need to properly re-create the padding mask
# by slicing it on the proper place
if kv_seq_len != attention_mask.shape[-1]:
attention_mask_num_tokens = attention_mask.shape[-1]
attention_mask = attention_mask[:, attention_mask_num_tokens - kv_seq_len :]
indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
key_layer = index_first_axis(key_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k)
value_layer = index_first_axis(value_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k)
if query_length == kv_seq_len:
query_layer = index_first_axis(
query_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim), indices_k
)
cu_seqlens_q = cu_seqlens_k
max_seqlen_in_batch_q = max_seqlen_in_batch_k
indices_q = indices_k
elif query_length == 1:
max_seqlen_in_batch_q = 1
cu_seqlens_q = torch.arange(
batch_size + 1, dtype=torch.int32, device=query_layer.device
) # There is a memcpy here, that is very bad.
indices_q = cu_seqlens_q[:-1]
query_layer = query_layer.squeeze(1)
else:
# The -q_len: slice assumes left padding.
attention_mask = attention_mask[:, -query_length:]
query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
return (
query_layer,
key_layer,
value_layer,
indices_q,
(cu_seqlens_q, cu_seqlens_k),
(max_seqlen_in_batch_q, max_seqlen_in_batch_k),
)
# Copied from transformers.models.mistral.modeling_mistral.MistralSdpaAttention with Mistral->Qwen2
class Qwen2SdpaAttention(Qwen2Attention):
"""
Qwen2 attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
`Qwen2Attention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
SDPA API.
"""
# Adapted from Qwen2Attention.forward
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
if output_attentions:
# TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
logger.warning_once(
"Qwen2Model is using Qwen2SdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
)
return super().forward(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
)
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
if past_key_value is not None:
cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
if attention_mask is not None:
if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
raise ValueError(
f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
)
# SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
# Reference: https://github.com/pytorch/pytorch/issues/112577.
if query_states.device.type == "cuda" and attention_mask is not None:
query_states = query_states.contiguous()
key_states = key_states.contiguous()
value_states = value_states.contiguous()
attn_output = torch.nn.functional.scaled_dot_product_attention(
query_states,
key_states,
value_states,
attn_mask=attention_mask,
dropout_p=self.attention_dropout if self.training else 0.0,
# The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
is_causal=self.is_causal and attention_mask is None and q_len > 1,
)
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.view(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
return attn_output, None, past_key_value
QWEN2_ATTENTION_CLASSES = {
"eager": Qwen2Attention,
"flash_attention_2": Qwen2FlashAttention2,
"sdpa": Qwen2SdpaAttention,
}
class Qwen2DecoderLayer(nn.Module):
def __init__(self, config: Qwen2Config, layer_idx: int):
super().__init__()
self.hidden_size = config.hidden_size
if config.use_sliding_window and config._attn_implementation != "flash_attention_2":
logger.warning_once(
f"Sliding Window Attention is enabled but not implemented for `{config._attn_implementation}`; "
"unexpected results may be encountered."
)
self.self_attn = QWEN2_ATTENTION_CLASSES[config._attn_implementation](config, layer_idx)
self.mlp = Qwen2MLP(config)
self.input_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
self.post_attention_layernorm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: Optional[bool] = False,
use_cache: Optional[bool] = False,
**kwargs,
) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
if "padding_mask" in kwargs:
warnings.warn(
"Passing `padding_mask` is deprecated and will be removed in v4.37. "
"Please make sure use `attention_mask` instead.`"
)
"""
Args:
hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
`(batch, sequence_length)` where padding elements are indicated by 0.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
(see `past_key_values`).
past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
"""
residual = hidden_states
hidden_states = self.input_layernorm(hidden_states)
# Self Attention
hidden_states, self_attn_weights, present_key_value = self.self_attn(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
)
hidden_states = residual + hidden_states
# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
outputs = (hidden_states,)
if output_attentions:
outputs += (self_attn_weights,)
if use_cache:
outputs += (present_key_value,)
return outputs
QWEN2_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
Parameters:
config ([`Qwen2Config`]):
Model configuration class with all the parameters of the model. Initializing with a config file does not
load the weights associated with the model, only the configuration. Check out the
[`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
@add_start_docstrings(
"The bare Qwen2 Model outputting raw hidden-states without any specific head on top.",
QWEN2_START_DOCSTRING,
)
class Qwen2PreTrainedModel(PreTrainedModel):
config_class = Qwen2Config
base_model_prefix = "model"
supports_gradient_checkpointing = True
_no_split_modules = ["Qwen2DecoderLayer"]
_skip_keys_device_placement = "past_key_values"
_supports_flash_attn_2 = True
_supports_sdpa = True
_supports_cache_class = True
def _init_weights(self, module):
std = self.config.initializer_range
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
QWEN2_INPUTS_DOCSTRING = r"""
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
`past_key_values`).
If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
information on the default strategy.
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.n_positions - 1]`.
[What are position IDs?](../glossary#position-ids)
past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
Two formats are allowed:
- a [`~cache_utils.Cache`] instance;
- Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
cache format.
The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
legacy cache format will be returned.
If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
of shape `(batch_size, sequence_length)`.
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
`past_key_values`).
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""
@add_start_docstrings(
"The bare Qwen2 Model outputting raw hidden-states without any specific head on top.",
QWEN2_START_DOCSTRING,
)
class Qwen2Model_Flash(Qwen2PreTrainedModel):
"""
Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`Qwen2DecoderLayer`]
Args:
config: Qwen2Config
"""
def __init__(self, config: Qwen2Config):
super().__init__(config)
self.padding_idx = config.pad_token_id
self.vocab_size = config.vocab_size
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
self.layers = nn.ModuleList(
[Qwen2DecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)]
)
self._attn_implementation = config._attn_implementation
self.norm = Qwen2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
self.gradient_checkpointing = False
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.embed_tokens
def set_input_embeddings(self, value):
self.embed_tokens = value
@add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
labels: Optional[torch.Tensor] = None,
) -> Union[Tuple, BaseModelOutputWithPast]:
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
use_cache = use_cache if use_cache is not None else self.config.use_cache
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
# retrieve input_ids and inputs_embeds
if input_ids is not None and inputs_embeds is not None:
raise ValueError("You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time")
elif input_ids is not None:
batch_size, seq_length = input_ids.shape
elif inputs_embeds is not None:
batch_size, seq_length, _ = inputs_embeds.shape
else:
raise ValueError("You have to specify either decoder_input_ids or decoder_inputs_embeds")
if self.gradient_checkpointing and self.training:
if use_cache:
logger.warning_once(
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
)
use_cache = False
past_key_values_length = 0
if use_cache:
use_legacy_cache = not isinstance(past_key_values, Cache)
if use_legacy_cache:
past_key_values = DynamicCache.from_legacy_cache(past_key_values)
past_key_values_length = past_key_values.get_usable_length(seq_length)
if position_ids is None:
device = input_ids.device if input_ids is not None else inputs_embeds.device
position_ids = torch.arange(
past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
)
position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
else:
position_ids = position_ids.view(-1, seq_length).long()
if inputs_embeds is None:
inputs_embeds = self.embed_tokens(input_ids)
if attention_mask is not None and self._attn_implementation == "flash_attention_2" and use_cache:
is_padding_right = attention_mask[:, -1].sum().item() != batch_size
if is_padding_right:
raise ValueError(
"You are attempting to perform batched generation with padding_side='right'"
" this may lead to unexpected behaviour for Flash Attention version of Qwen2. Make sure to "
" call `tokenizer.padding_side = 'left'` before tokenizing the input. "
)
if self._attn_implementation == "flash_attention_2":
# 2d mask is passed through the layers
attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
elif self._attn_implementation == "sdpa" and not output_attentions:
# output_attentions=True can not be supported when using SDPA, and we fall back on
# the manual implementation that requires a 4D causal mask in all cases.
attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
attention_mask,
(batch_size, seq_length),
inputs_embeds,
past_key_values_length,
)
else:
# 4d mask is passed through the layers
attention_mask = _prepare_4d_causal_attention_mask(
attention_mask,
(batch_size, seq_length),
inputs_embeds,
past_key_values_length,
sliding_window=self.config.sliding_window,
)
hidden_states = inputs_embeds
# decoder layers
all_hidden_states = () if output_hidden_states else None
all_self_attns = () if output_attentions else None
next_decoder_cache = None
for layer_idx, decoder_layer in enumerate(self.layers):
if output_hidden_states:
all_hidden_states += (hidden_states,)
if self.gradient_checkpointing and self.training:
layer_outputs = self._gradient_checkpointing_func(
decoder_layer.__call__,
hidden_states,
attention_mask,
position_ids,
past_key_values,
output_attentions,
use_cache,
)
else:
layer_outputs = decoder_layer(
hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_values,
output_attentions=output_attentions,
use_cache=use_cache,
)
hidden_states = layer_outputs[0]
if use_cache:
next_decoder_cache = layer_outputs[2 if output_attentions else 1]
if output_attentions:
all_self_attns += (layer_outputs[1],)
###### copy from pdrop #########
# rank & drop after specific layer
# only drop in prefill stage when inference
rank_layer = layer_idx+1
if rank_layer in self.llm_compress_layer_list:
if hidden_states.shape[1] != 1: # prefill stage or training
stage = self.llm_compress_layer_list.index(rank_layer) # determine current stage
(
position_ids,
attention_mask,
hidden_states,
labels # update labels and return
) = self.flash_rank_drop(
cur_num = stage,
rank_layer = rank_layer,
features = hidden_states,
position_ids=position_ids,
attention_mask=attention_mask,
labels = labels
)
# process attention_mask again after updating
if self._attn_implementation == "flash_attention_2":
# 2d mask is passed through the layers
attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
elif self._attn_implementation == "sdpa" and not output_attentions:
# output_attentions=True can not be supported when using SDPA, and we fall back on
# the manual implementation that requires a 4D causal mask in all cases.
attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
attention_mask,
(batch_size, hidden_states.shape[1]),
hidden_states,
past_key_values_length,
)
else:
# 4d mask is passed through the layers
attention_mask = _prepare_4d_causal_attention_mask(
attention_mask,
(batch_size, hidden_states.shape[1]),
hidden_states,
past_key_values_length,
sliding_window=self.config.sliding_window,
)
else:
# update position_ids in decoding stage when inference
stage = self.llm_compress_layer_list.index(rank_layer) # determine current stage
cur_visual_length = [int(cur_image_token * self.llm_image_token_ratio_list[stage]) for cur_image_token in self.num_image_token_lens]
next_visual_length = [int(cur_image_token * self.llm_image_token_ratio_list[stage + 1]) for cur_image_token in self.num_image_token_lens]
new_position_ids = []
for idx, cur_position_ids in enumerate(position_ids):
cur_position_ids = cur_position_ids - (cur_visual_length[idx] - next_visual_length[idx])
new_position_ids.append(cur_position_ids)
assert idx == 0, idx
position_ids = torch.tensor(new_position_ids, dtype=torch.long).unsqueeze(0)
# raise ValueError(f"{type(position_ids)}, 哈哈我疯了")
#################
hidden_states = self.norm(hidden_states)
# add hidden states from the last decoder layer
if output_hidden_states:
all_hidden_states += (hidden_states,)
next_cache = None
if use_cache:
next_cache = next_decoder_cache.to_legacy_cache() if use_legacy_cache else next_decoder_cache
if not return_dict:
return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None), labels
return BaseModelOutputWithPast(
last_hidden_state=hidden_states,
past_key_values=next_cache,
hidden_states=all_hidden_states,
attentions=all_self_attns,
), labels
# implementation of flash
def flash_rank_drop(
self, cur_num, rank_layer, features ,
position_ids, attention_mask, labels
):
if self.llm_compress_type == 'uniform0_attention':
if cur_num == 0:
llm_compress_type = 'uniform'
else:
llm_compress_type = 'attention'
elif self.llm_compress_type == 'uniform1_attention':
if cur_num <= 1:
llm_compress_type = 'uniform'
else:
llm_compress_type = 'attention'
else:
llm_compress_type = self.llm_compress_type
_labels = labels
_position_ids = position_ids
_attention_mask = attention_mask
if position_ids is None:
position_ids = torch.arange(0, features.shape[1], dtype=torch.long, device=features.device).unsqueeze(0)
if getattr(self.config, 'tokenizer_padding_side', 'right') == "right":
batch_size = features.shape[0]
image_tokens = [int(cur_image_token * self.llm_image_token_ratio_list[cur_num]) for cur_image_token in self.num_image_token_lens]
keep_length = [int(cur_image_token * self.llm_image_token_ratio_list[cur_num + 1]) for cur_image_token in self.num_image_token_lens]
features_list = []
attention_mask_list = []
labels_list = []
if attention_mask is None:
attention_mask = torch.ones((batch_size,features.shape[1]), dtype=torch.bool, device=features.device)
else:
attention_mask = attention_mask.bool()
if labels is None:
labels = torch.full((batch_size,features.shape[1]), IGNORE_INDEX, device=features.device)
if 'attention' in llm_compress_type:
# obtain query_states and key_states to calculate attention map
hidden_states= features.clone().detach()
# print(f"hidden_states.shape: {hidden_states.shape}")
self_attn = self.layers[rank_layer].self_attn
hidden_states = self.layers[rank_layer].input_layernorm(hidden_states)
# print(f"new hidden_states.shape: {hidden_states.shape}")
num_heads = self_attn.num_heads
num_key_value_heads = self_attn.num_key_value_heads
head_dim = self_attn.head_dim
bsz, q_len, _ = hidden_states.size()
# print(self_attn.k_proj)
query_states = self_attn.q_proj(hidden_states)
key_states = self_attn.k_proj(hidden_states)
value_states = self_attn.v_proj(hidden_states)
# print("old key_states.shape:", key_states.shape)
query_states = query_states.view(bsz, q_len, num_heads, head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, num_key_value_heads, head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, num_key_value_heads, head_dim).transpose(1, 2)
# print("key_states.shape:", key_states.shape)
kv_seq_len = key_states.shape[-2]
cos, sin = self_attn.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
key_states = repeat_kv(key_states, self_attn.num_key_value_groups)
# attention_mask
eager_attention_mask = _prepare_4d_causal_attention_mask(
attention_mask, (batch_size, q_len), hidden_states, past_key_values_length=0
).to(device=query_states.device)
# take valid features
features = [cur_features[cur_attention_mask] for cur_features, cur_attention_mask in zip(features, attention_mask)]
labels = [cur_labels[cur_attention_mask] for cur_labels, cur_attention_mask in zip(labels, attention_mask)]
attention_mask = [cur_attention_mask[cur_attention_mask] for cur_attention_mask, cur_attention_mask in zip(attention_mask, attention_mask)]
# rank & drop
for i in range(batch_size):
image_index = self.first_image_token_position[i]
if image_index == -1:
cur_input_embeds = features[i]
features_list.append(cur_input_embeds)
attention_mask_list.append(attention_mask[i])
labels_list.append(labels[i])
continue
if 'attention' in llm_compress_type:
# obtain current states
cur_key_states = key_states[i]
cur_query_states = query_states[i]
cur_eager_attention_mask = eager_attention_mask[i]
# choose last instruction token as query
if self.training:
answer_index = torch.where(labels[i] != -100)[0].tolist()
index_before_answer = []
for index in answer_index:
if labels[i][index-1] == -100:
index_before_answer.append(index-1)
if index_before_answer == []:
# print("========index_before_answer is []")
cur_input_embeds = features[i]
features_list.append(cur_input_embeds)
attention_mask_list.append(attention_mask[i])
labels_list.append(labels[i])
continue
index_before_answer=torch.tensor(index_before_answer,device=labels[0].device)
text_query_states = cur_query_states[:,index_before_answer,:]
text_eager_attention_mask = cur_eager_attention_mask[:,index_before_answer,:]
else:
prompt_total_len = self.text_prompt_lens[i] + image_tokens[i]
text_query_states = cur_query_states[:,prompt_total_len-1,:].unsqueeze(1)
text_eager_attention_mask = cur_eager_attention_mask[:,prompt_total_len-1,:].unsqueeze(1)
# print(f"text_query_states.shape: {text_query_states.shape}")
# print(f"cur_key_states.shape: {cur_key_states.shape}")
# calculate attention map
attn_weights = torch.matmul(text_query_states, cur_key_states.transpose(1, 2)) / math.sqrt(head_dim) #(num_head, text_token,seq_len)
attn_weights = attn_weights + text_eager_attention_mask
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype) #(num_head, text_token,seq_len)
attention_avg_head = torch.mean(attn_weights, dim=0) # ave across heads
attention_avg_head = attention_avg_head[:,image_index:image_index+image_tokens[i]] # select image token as keys
attention_avg_text = torch.mean(attention_avg_head, dim=0) # (576)
# print("attention_avg_text.shape:", attention_avg_text.shape)
if llm_compress_type == 'attention':
top_rank_index = attention_avg_text.topk(keep_length[i]).indices
else:
raise NotImplementedError(llm_compress_type)
elif llm_compress_type == 'uniform':
top_rank_index = torch.linspace(0, image_tokens[i]-1, keep_length[i], dtype=torch.long)
else:
raise NotImplementedError(llm_compress_type)
top_rank_index = top_rank_index + image_index
top_rank_index= top_rank_index.sort().values
start_index = image_index + image_tokens[i]
new_input_embeds = torch.cat([features[i][ :image_index, :] ,features[i][ top_rank_index, :], features[i][start_index:, :]], dim=0)
# print("origin labels:", len(labels))
# print(labels[i])
# print(top_rank_index)
new_labels = torch.cat([labels[i][ :image_index],labels[i][ top_rank_index], labels[i][start_index:]], dim=0)
new_attention_mask = torch.cat([attention_mask[i][:image_index], attention_mask[i][top_rank_index], attention_mask[i][start_index:]], dim=0)
features_list.append(new_input_embeds)
attention_mask_list.append(new_attention_mask)
labels_list.append(new_labels)
# Truncate sequences to max length as image embeddings can make the sequence longer
tokenizer_model_max_length = getattr(self.config, 'tokenizer_model_max_length', None)
if tokenizer_model_max_length is not None:
new_input_embeds = [x[:tokenizer_model_max_length] for x in features_list]
new_attention_mask = [x[:tokenizer_model_max_length] for x in attention_mask_list]
new_labels = [x[:tokenizer_model_max_length] for x in labels_list]
max_len = max(x.shape[0] for x in new_input_embeds)
# padding the sequences to form batch
embeds_padded=[]
labels_paded=[]
attention_mask_padded=[]
position_ids = torch.zeros((batch_size, max_len), dtype=position_ids.dtype, device=position_ids.device)
for i, (cur_new_embed, cur_new_labels) in enumerate(zip(new_input_embeds, new_labels)):
cur_len_emb=cur_new_embed.shape[0]
dif=max_len - cur_len_emb # padding to longest seq
cur_new_embed = torch.cat([cur_new_embed,torch.zeros((dif, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)],dim=0)
cur_new_labels = torch.cat([cur_new_labels,torch.full((dif,),IGNORE_INDEX,dtype=cur_new_labels.dtype, device=cur_new_labels.device)],dim=0)
cur_attention_mask = new_attention_mask[i]
cur_attention_mask = torch.cat([cur_attention_mask,torch.full((dif,),False, dtype=cur_attention_mask.dtype, device=cur_attention_mask.device)],dim=0)
embeds_padded.append(cur_new_embed)
labels_paded.append(cur_new_labels)
attention_mask_padded.append(cur_attention_mask)
cur_len = new_attention_mask[i].sum().item()
position_ids[i, :cur_len] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
new_input_embeds = torch.stack(embeds_padded,dim=0)
new_input_embeds = new_input_embeds.to(features[0].dtype)
new_attention_mask = torch.stack(attention_mask_padded,dim=0)
new_labels = torch.stack(labels_paded,dim=0)
if _position_ids is None:
position_ids = None
if _labels is None:
new_labels = None
if _attention_mask is None:
new_attention_mask = None
else:
new_attention_mask = new_attention_mask.to(dtype=_attention_mask.dtype)
return position_ids, new_attention_mask, new_input_embeds, new_labels
else:
raise ValueError(f"Unexpected tokenizer_padding_side: {self.config.tokenizer_padding_side}")
class Qwen2ForCausalLM_Flash(Qwen2PreTrainedModel):
_tied_weights_keys = ["lm_head.weight"]
def __init__(self, config):
super().__init__(config)
self.model = Qwen2Model_Flash(config)
self.vocab_size = config.vocab_size
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.embed_tokens
def set_input_embeddings(self, value):
self.model.embed_tokens = value
def get_output_embeddings(self):
return self.lm_head
def set_output_embeddings(self, new_embeddings):
self.lm_head = new_embeddings
def set_decoder(self, decoder):
self.model = decoder
def get_decoder(self):
return self.model
@add_start_docstrings_to_model_forward(QWEN2_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, CausalLMOutputWithPast]:
r"""
Args:
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
Returns:
Example:
```python
>>> from transformers import AutoTokenizer, Qwen2ForCausalLM
>>> model = Qwen2ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
>>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
# decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
outputs, labels = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
labels=labels
)
hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
logits = logits.float()
loss = None
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
shift_logits = shift_logits.view(-1, self.config.vocab_size)
shift_labels = shift_labels.view(-1)
# Enable model parallelism
shift_labels = shift_labels.to(shift_logits.device)
loss = loss_fct(shift_logits, shift_labels)
if not return_dict:
output = (logits,) + outputs[1:]
return (loss,) + output if loss is not None else output
return CausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
def prepare_inputs_for_generation(
self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
):
# Omit tokens covered by past_key_values
if past_key_values is not None:
if isinstance(past_key_values, Cache):
cache_length = past_key_values.get_seq_length()
past_length = past_key_values.seen_tokens
max_cache_length = past_key_values.get_max_length()
else:
cache_length = past_length = past_key_values[0][0].shape[2]
max_cache_length = None
# Keep only the unprocessed tokens:
# 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
# some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
# input)
if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
# 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
# input_ids based on the past_length.
elif past_length < input_ids.shape[1]:
input_ids = input_ids[:, past_length:]
# 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
# If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
if (
max_cache_length is not None
and attention_mask is not None
and cache_length + input_ids.shape[1] > max_cache_length
):
attention_mask = attention_mask[:, -max_cache_length:]
position_ids = kwargs.get("position_ids", None)
if attention_mask is not None and position_ids is None:
# create position_ids on the fly for batch generation
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
if past_key_values:
position_ids = position_ids[:, -input_ids.shape[1] :]
# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
if inputs_embeds is not None and past_key_values is None:
model_inputs = {"inputs_embeds": inputs_embeds}
else:
model_inputs = {"input_ids": input_ids}
model_inputs.update(
{
"position_ids": position_ids,
"past_key_values": past_key_values,
"use_cache": kwargs.get("use_cache"),
"attention_mask": attention_mask,
}
)
return model_inputs
@staticmethod
def _reorder_cache(past_key_values, beam_idx):
reordered_past = ()
for layer_past in past_key_values:
reordered_past += (
tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
)
return reordered_past
================================================
FILE: xtuner-eval_niah/llava/model/llava_arch.py
================================================
# Copyright 2023 Haotian Liu
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from abc import ABC, abstractmethod
import psutil
import math
import re
import time
import torch
import torch.nn as nn
import torch.nn.functional as F
from .multimodal_encoder.builder import build_vision_tower
from .multimodal_projector.builder import build_vision_projector
from llava.constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from llava.mm_utils import get_anyres_image_grid_shape
from llava.utils import rank0_print
import random
class LlavaMetaModel:
def __init__(self, config):
super(LlavaMetaModel, self).__init__(config)
if hasattr(config, "mm_vision_tower"):
delay_load = getattr(config, "delay_load", False)
self.vision_tower = build_vision_tower(config, delay_load=delay_load)
self.mm_projector = build_vision_projector(config, vision_cfg=self.vision_tower.config)
if "unpad" in getattr(config, "mm_patch_merge_type", ""):
self.image_newline = nn.Parameter(torch.empty(config.hidden_size, dtype=self.dtype))
if "nopad" in getattr(config, "mm_patch_merge_type", "") and getattr(self.config, "mm_newline_position", "nothing") != "nothing":
self.frame_newline = nn.Parameter(torch.empty(config.hidden_size, dtype=self.dtype))
def get_vision_tower(self):
vision_tower = getattr(self, "vision_tower", None)
if type(vision_tower) is list:
vision_tower = vision_tower[0]
return vision_tower
def initialize_vision_modules(self, model_args, fsdp=None):
vision_tower = model_args.vision_tower
mm_vision_select_layer = model_args.mm_vision_select_layer
mm_vision_select_feature = model_args.mm_vision_select_feature
pretrain_mm_mlp_adapter = model_args.pretrain_mm_mlp_adapter
mm_patch_merge_type = model_args.mm_patch_merge_type
self.config.mm_vision_tower = vision_tower
self.config.vision_tower_pretrained = getattr(model_args, "vision_tower_pretrained", "")
if self.get_vision_tower() is None:
vision_tower = build_vision_tower(model_args)
if fsdp is not None and len(fsdp) > 0:
self.vision_tower = [vision_tower]
else:
self.vision_tower = vision_tower
else:
if fsdp is not None and len(fsdp) > 0:
vision_tower = self.vision_tower[0]
else:
vision_tower = self.vision_tower
vision_tower.load_model()
self.config.use_mm_proj = True
self.config.mm_projector_type = getattr(model_args, "mm_projector_type", "linear")
self.config.mm_hidden_size = vision_tower.hidden_size
self.config.mm_vision_select_layer = mm_vision_select_layer
self.config.mm_vision_select_feature = mm_vision_select_feature
self.config.mm_patch_merge_type = mm_patch_merge_type
if getattr(self, "mm_projector", None) is None:
self.mm_projector = build_vision_projector(self.config, vision_cfg=vision_tower.config)
if "unpad" in mm_patch_merge_type:
embed_std = 1 / torch.sqrt(torch.tensor(self.config.hidden_size, dtype=self.dtype))
self.image_newline = nn.Parameter(torch.randn(self.config.hidden_size, dtype=self.dtype) * embed_std)
if "nopad" in getattr(self.config, "mm_patch_merge_type", "") and getattr(self.config, "mm_newline_position", "nothing") != "nothing":
embed_std = 1 / torch.sqrt(torch.tensor(self.config.hidden_size, dtype=self.dtype))
self.frame_newline = nn.Parameter(torch.randn(self.config.hidden_size, dtype=self.dtype) * embed_std)
else:
# In case it is frozen by LoRA
for p in self.mm_projector.parameters():
p.requires_grad = True
if pretrain_mm_mlp_adapter is not None:
mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location="cpu")
def get_w(weights, keyword):
return {k.split(keyword + ".")[1]: v for k, v in weights.items() if keyword in k}
if self.config.mm_projector_type =='lxh_qformer':
incompatible_keys = self.mm_projector.load_state_dict(get_w(mm_projector_weights, "mm_projector"), strict=False)
else:
incompatible_keys = self.mm_projector.load_state_dict(get_w(mm_projector_weights, "mm_projector"))
rank0_print(f"Loaded mm projector weights from {pretrain_mm_mlp_adapter}. Incompatible keys: {incompatible_keys}")
def unpad_image(tensor, original_size, is_frame=False):
"""
Unpads a PyTorch tensor of a padded and resized image.
Args:
tensor (torch.Tensor): The image tensor, assumed to be in CxHxW format.
original_size (tuple): The original size of the image (height, width).
Returns:
torch.Tensor: The unpadded image tensor.
"""
original_width, original_height = original_size
if is_frame:
current_height, current_width = tensor.shape[2:]
else:
current_height, current_width = tensor.shape[1:]
# Compute aspect ratios
original_aspect_ratio = original_width / original_height
current_aspect_ratio = current_width / current_height
# Determine padding size and direction
if original_aspect_ratio > current_aspect_ratio:
# Padding was added to the height
scale_factor = current_width / original_width
new_height = int(original_height * scale_factor)
padding = (current_height - new_height) // 2
if is_frame:
unpadded_tensor = tensor[:, :, padding : current_height - padding, :]
else:
unpadded_tensor = tensor[:, padding : current_height - padding, :]
else:
# Padding was added to the width
scale_factor = current_height / original_height
new_width = int(original_width * scale_factor)
padding = (current_width - new_width) // 2
if is_frame:
unpadded_tensor = tensor[:, :, :, padding : current_width - padding]
else:
unpadded_tensor = tensor[:, :, padding : current_width - padding]
return unpadded_tensor
class LlavaMetaForCausalLM(ABC):
@abstractmethod
def get_model(self):
pass
def get_vision_tower(self):
return self.get_model().get_vision_tower()
def get_4dPool(self, image_feature):
num_frames, num_tokens, num_dim = image_feature.shape
height = width = int(math.sqrt(num_tokens))
assert num_tokens == height * width, image_feature.shape
image_feature = image_feature.view(num_frames, height, width, -1)
image_feature = image_feature.permute(0, 3, 1, 2).contiguous()
# image_feature = nn.functional.max_pool2d(image_feature, self.config.mm_spatial_pool_stride)
if self.config.mm_spatial_pool_mode == "average":
raise NotImplementedError
image_feature = nn.functional.avg_pool2d(image_feature, self.config.mm_spatial_pool_stride)
elif self.config.mm_spatial_pool_mode == "max":
raise NotImplementedError
image_feature = nn.functional.max_pool2d(image_feature, self.config.mm_spatial_pool_stride)
elif self.config.mm_spatial_pool_mode == "bilinear":
height, weight = image_feature.shape[2:]
scaled_shape = [math.ceil(height / 4), math.ceil(weight / 4)]
image_feature = nn.functional.interpolate(image_feature, size=scaled_shape, mode='bilinear')
else:
raise ValueError(f"Unexpected mm_spatial_pool_mode: {self.config.mm_spatial_pool_mode}")
image_feature = image_feature.permute(0, 2, 3, 1)
image_feature = image_feature.view(num_frames, -1, num_dim)
return image_feature
def get_2dPool(self, image_feature):
num_frames, num_tokens, num_dim = image_feature.shape
height = width = int(math.sqrt(num_tokens))
assert num_tokens == height * width, image_feature.shape
image_feature = image_feature.view(num_frames, height, width, -1)
image_feature = image_feature.permute(0, 3, 1, 2).contiguous()
# image_feature = nn.functional.max_pool2d(image_feature, self.config.mm_spatial_pool_stride)
if self.config.mm_spatial_pool_mode == "average":
raise NotImplementedError
image_feature = nn.functional.avg_pool2d(image_feature, self.config.mm_spatial_pool_stride)
elif self.config.mm_spatial_pool_mode == "max":
raise NotImplementedError
image_feature = nn.functional.max_pool2d(image_feature, self.config.mm_spatial_pool_stride)
elif self.config.mm_spatial_pool_mode == "bilinear":
height, weight = image_feature.shape[2:]
scaled_shape = [math.ceil(height / 2), math.ceil(weight / 2)]
image_feature = nn.functional.interpolate(image_feature, size=scaled_shape, mode='bilinear')
else:
raise ValueError(f"Unexpected mm_spatial_pool_mode: {self.config.mm_spatial_pool_mode}")
image_feature = image_feature.permute(0, 2, 3, 1)
image_feature = image_feature.view(num_frames, -1, num_dim)
return image_feature
def encode_image(self, images_list):
concat_images = torch.cat([image for image in images_list], dim=0)
split_sizes = [image.shape[0] for image in images_list]
image_features = self.get_model().get_vision_tower()(concat_images)
image_features = self.get_model().mm_projector(image_features)
image_features = torch.split(image_features, split_sizes)
return image_features
def encode_image_video(self, images_list, video_idx_in_batch):
concat_images = torch.cat([image for image in images_list], dim=0)
split_sizes = [image.shape[0] for image in images_list]
videos_or_images_features = self.get_model().get_vision_tower()(concat_images)
per_videos_or_images_features = torch.split(videos_or_images_features, split_sizes, dim=0) # tuple, (dim_1, 576, 4096)
all_videos_or_images_features = []
for idx, feat in enumerate(per_videos_or_images_features):
if idx in video_idx_in_batch:
feat = self.get_model().mm_projector(feat, compress=True, local_num_frames=getattr(self.config, "mm_local_num_frames", -1))
else:
feat = self.get_model().mm_projector(feat, compress=False)
all_videos_or_images_features.append(feat)
return all_videos_or_images_features
def encode_video(self, images_list, video_idx_in_batch):
bs = len(images_list)
concat_images = []
concat_videos = []
for idx, image in enumerate(images_list):
if idx in video_idx_in_batch:
concat_videos.append(image)
else:
concat_images.append(image)
# print(concat_videos[0].shape)
has_image = len(concat_images) > 0
has_video = len(concat_videos) > 0
mm_local_num_frames = getattr(self.config, "mm_local_num_frames", -1)
assert mm_local_num_frames != -1
if has_image:
image_split_sizes = [image.shape[0] for image in concat_images]
concat_images = torch.cat([image.unsqueeze(1) for image in concat_images], dim=0)
images_features = self.get_model().get_vision_tower()(concat_images) # B_i, N, D
images_features = self.get_model().mm_projector(images_features, compress=False, local_num_frames=1)
images_features = torch.split(images_features, image_split_sizes)
if has_video:
video_split_sizes = [video.shape[0] // mm_local_num_frames for video in concat_videos]
concat_videos = torch.cat([video.reshape(video.shape[0] // mm_local_num_frames, mm_local_num_frames, video.shape[1], video.shape[2], video.shape[3]) for video in concat_videos], dim=0) # B T C H W
videos_features = self.get_model().get_vision_tower()(concat_videos) # B_v, N, D
videos_features = self.get_model().mm_projector(videos_features, compress=True, local_num_frames=mm_local_num_frames)
videos_features = [v.reshape(-1, v.shape[-2] // mm_local_num_frames, v.shape[-1]) for v in torch.split(videos_features, video_split_sizes)]
all_videos_or_images_features = []
img_idx = 0
vid_idx = 0
for idx in range(bs):
if idx in video_idx_in_batch:
feat =videos_features[vid_idx]
vid_idx += 1
else:
feat = images_features[img_idx]
img_idx += 1
all_videos_or_images_features.append(feat)
if has_video:
assert vid_idx == len(videos_features), f"vid: {vid_idx} != {len(videos_features)}"
if has_image:
assert img_idx == len(images_features), f"img: {img_idx} != {len(images_features)}"
return all_videos_or_images_features
def encode_video_image(self, images_list, video_idx_in_batch):
bs = len(images_list)
concat_images = []
concat_videos = []
for idx, image in enumerate(images_list):
if idx in video_idx_in_batch:
concat_videos.append(image)
else:
concat_images.append(image)
# print(concat_videos[0].shape)
has_image = len(concat_images) > 0
has_video = len(concat_videos) > 0
mm_local_num_frames = getattr(self.config, "mm_local_num_frames", -1)
assert mm_local_num_frames != -1
if has_image:
image_split_sizes = [image.shape[0] for image in concat_images]
concat_images = torch.cat([image.unsqueeze(1) for image in concat_images], dim=0)
# print("input vit image.shape:", concat_images.shape)
images_features = self.get_model().get_vision_tower()(concat_images) # B_i, N, D
images_features = torch.split(images_features, image_split_sizes)
if has_video:
video_split_sizes = [video.shape[0] // mm_local_num_frames for video in concat_videos]
concat_videos = torch.cat([video.reshape(video.shape[0] // mm_local_num_frames, mm_local_num_frames, video.shape[1], video.shape[2], video.shape[3]) for video in concat_videos], dim=0)
# print("input vit video.shape:", concat_videos.shape)
videos_features = self.get_model().get_vision_tower()(concat_videos) # B_v, N, D
videos_features = [v.reshape(-1, v.shape[-2] // mm_local_num_frames, v.shape[-1]) for v in torch.split(videos_features, video_split_sizes)]
all_videos_or_images_features = []
img_idx = 0
vid_idx = 0
for idx in range(bs):
if idx in video_idx_in_batch:
feat = self.get_model().mm_projector(videos_features[vid_idx], compress=True, local_num_frames=getattr(self.config, "mm_local_num_frames", -1))
vid_idx += 1
else:
feat = self.get_model().mm_projector(images_features[img_idx], compress=False)
img_idx += 1
all_videos_or_images_features.append(feat)
if has_video:
assert vid_idx == len(videos_features), f"vid: {vid_idx} != {len(videos_features)}"
if has_image:
assert img_idx == len(images_features), f"img: {img_idx} != {len(images_features)}"
return all_videos_or_images_features
def add_token_per_frame(self, image_feature):
image_feature = image_feature.permute(2, 0, 1).contiguous()
if hasattr(self.model, "frame_newline"):
image_feature = torch.cat((image_feature, self.model.frame_newline[:, None, None].expand(*image_feature.shape[:-1], 1).to(image_feature.device)), dim=-1)
else:
image_feature = torch.cat((image_feature, self.model.image_newline[:, None, None].expand(*image_feature.shape[:-1], 1).to(image_feature.device)), dim=-1)
image_feature = image_feature.permute(1, 2, 0).contiguous()
return image_feature
def add_different_token_per_frame(self, image_feature):
raise NotImplementedError("No")
def prepare_inputs_labels_for_multimodal(self, input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities=["image"], image_sizes=None):
assert type(modalities) is list, modalities
vision_tower = self.get_vision_tower()
# rank_print(modalities)
if vision_tower is None or images is None or input_ids.shape[1] == 1:
return input_ids, position_ids, attention_mask, past_key_values, None, labels
if type(images) is list or images.ndim == 5:
if type(images) is list:
images = [x.unsqueeze(0) if x.ndim == 3 else x for x in images]
video_idx_in_batch = []
for _ in range(len(modalities)):
if modalities[_] == "video":
video_idx_in_batch.append(_)
images_list = []
for image in images:
if image.ndim == 4:
images_list.append(image)
else:
images_list.append(image.unsqueeze(0))
vision_encode_type = getattr(self.config, "vision_encode_type", "image")
mm_patch_merge_type = getattr(self.config, "mm_patch_merge_type", "flat")
image_aspect_ratio = getattr(self.config, "image_aspect_ratio", "square")
frame_aspect_ratio = getattr(self.config, "frame_aspect_ratio", "square")
mm_newline_position = getattr(self.config, "mm_newline_position", "nothing")
if "anyres" in frame_aspect_ratio:
new_images_list = []
num_frames_list = []
for idx, image in enumerate(images_list):
if idx in video_idx_in_batch:
T, C, H, W = image.shape
num_frames_list.append(T)
# print("origin video.shape:", image.shape) # T C H W
patch_size = self.get_vision_tower().image_size
if H * W != patch_size * patch_size:
global_image = F.interpolate(
image.float(), size=(patch_size, patch_size), mode='bicubic', align_corners=False
).to(image.dtype).unsqueeze(0)
sub_image = image.reshape(
1, T, C, H//patch_size, patch_size, W//patch_size, patch_size
).permute(0, 3, 5, 1, 2, 4, 6).reshape(-1, T, C, patch_size, patch_size).contiguous()
new_image = torch.concat([global_image, sub_image], dim=0).flatten(0, 1)
else:
new_image = image
# print("new video shape:", new_image.shape)
new_images_list.append(new_image)
else:
num_frames_list.append(1)
new_images_list.append(image)
images_list = new_images_list
# rank0_print(self.config)
# TODO image: share vit&connector for image/video, image_video:, video
if vision_encode_type == "image": # image backbone, process video by frame
image_features = self.encode_image(images_list)
elif vision_encode_type == "video": # video backbone, process video with compress
image_features = self.encode_video(images_list, video_idx_in_batch=video_idx_in_batch)
elif vision_encode_type == "image_video": # image backbone, process video with compress
image_features = self.encode_image_video(images_list, video_idx_in_batch=video_idx_in_batch)
elif vision_encode_type == "image_video_new":
image_features = self.encode_image_video_new(images_list, video_idx_in_batch=video_idx_in_batch)
elif vision_encode_type == "video_image": # image backbone, process video with compress
image_features = self.encode_video_image(images_list, video_idx_in_batch=video_idx_in_batch)
else:
raise NotImplementedError(vision_encode_type)
if 'llava_ov' in getattr(self.config, "transformers_version", ""):
new_image_features = []
# print("I am llava ov!!!!!!!")
for idx, image_feat in enumerate(image_features):
if idx in video_idx_in_batch: # NOTE lxh: I don't want it.
new_image_features.append(self.get_2dPool(image_feat))
else:
new_image_features.append(image_feat)
image_features = new_image_features
if mm_patch_merge_type == "flat":
image_features = [x.flatten(0, 1) for x in image_features]
elif mm_patch_merge_type.startswith("spatial"):
new_image_features = []
for image_idx, image_feature in enumerate(image_features):
# FIXME: now assume the image is square, and split to 2x2 patches
# num_patches = h * w, where h = w = sqrt(num_patches)
# currently image_feature is a tensor of shape (4, num_patches, hidden_size)
# we want to first unflatten it to (2, 2, h, w, hidden_size)
# rank0_print("At least we are reaching here")
if image_idx in video_idx_in_batch: # video operations
# rank0_print("Video")
# rank0_print(f"video image_feature.shape: {image_feature.shape}")
if "anyres" in frame_aspect_ratio:
if "anyres_max" in frame_aspect_ratio:
matched_anyres_max_num_patches = re.match(r"anyres_max_(\d+)", frame_aspect_ratio)
if matched_anyres_max_num_patches:
max_num_patches = int(matched_anyres_max_num_patches.group(1))
num_frames = num_frames_list[image_idx]
if hasattr(self.get_vision_tower(), "image_size"):
vision_tower_image_size = self.get_vision_tower().image_size
else:
raise ValueError("vision_tower_image_size is not found in the vision tower.")
try:
num_patch_width, num_patch_height = get_anyres_image_grid_shape(image_sizes[image_idx], self.config.frame_grid_pinpoints, vision_tower_image_size, max_resolutions=self.config.max_num_pixels // num_frames) #TODO 要传个num_frames来算
except Exception as e:
rank0_print(f"Error: {e}, self.config:{self.config}")
raise e
height = width = self.get_model().mm_projector.num_frame_patches_per_side
if "maxpool2x2" in mm_patch_merge_type:
raise NotImplementedError
elif "unpad" in mm_patch_merge_type and "anyres_max" in frame_aspect_ratio and matched_anyres_max_num_patches:
raise NotImplementedError
elif "unpad" in mm_patch_merge_type and "anyres" in frame_aspect_ratio:
raise NotImplementedError
else:
# rank0_print(f"652 num_frames={num_frames}")
if num_patch_height * num_patch_width != 1: # has global
image_feature = image_feature.view(num_patch_height * num_patch_width + 1, -1, height, width, image_feature.shape[-1])
assert num_frames == image_feature.shape[1], f"{num_frames} != {image_feature.shape[1]}"
base_frame_feature = image_feature[0].view(num_frames, -1, image_feature[0].shape[-1]) # T 4*4 C
# rank0_print(f"base_frame_feature.shape: {base_frame_feature.shape}")
image_feature = image_feature[1:].permute(1, 0, 2, 3, 4) # T P 4 4 C
frame_feature = image_feature.view(num_frames, num_patch_height, num_patch_width, height, width, -1)
frame_feature = frame_feature.permute(0, 1, 3, 2, 4, 5).contiguous()
frame_feature = frame_feature.flatten(1, 4)
frame_feature = torch.cat((base_frame_feature, frame_feature), dim=1)
# rank0_print(f"two_frame_feature.shape: {frame_feature.shape}")
else: # no global
frame_feature = image_feature.view(num_frames, -1, image_feature[0].shape[-1]) # T 4*4 C
# rank0_print(f"only_frame_feature.shape: {frame_feature.shape}")
if "nobase" in mm_patch_merge_type:
raise NotImplementedError
else:
frame_feature = image_feature
if "pad" in mm_patch_merge_type: # unpad和nopad都算
if mm_newline_position == 'one_token':
frame_feature = frame_feature.flatten(0, 1)
if "unpad" in mm_patch_merge_type:
frame_feature = torch.cat((frame_feature, self.model.image_newline[None].to(frame_feature.device)), dim=0)
else:
frame_feature = torch.cat((frame_feature, self.model.frame_newline[None].to(frame_feature.device)), dim=0)
elif mm_newline_position == 'frame':
# Frame-wise
frame_feature = self.add_token_per_frame(frame_feature)
frame_feature = frame_feature.flatten(0, 1)
elif mm_newline_position == 'frame2':
# Frame-wise
raise NotImplementedError
elif mm_newline_position == 'nothing':
frame_feature = frame_feature.flatten(0, 1)
else:
raise NotImplementedError("add pad please!!")
else:
frame_feature = frame_feature.flatten(0, 1)
# rank0_print(f"final video frame_feature.shape: {frame_feature.shape}")
image_feature = frame_feature
elif image_feature.shape[0] > 1: # multi patches and multi images operations
# rank0_print("Single-images") NOTE: 多图实际上不会过这里,因为被拆成多个单图pad了
base_image_feature = image_feature[0]
image_feature = image_feature[1:]
origin_size = image_feature.shape
height = width = self.get_model().mm_projector.num_image_patches_per_side # NOTE 写死一个图49
assert height * width == base_image_feature.shape[0], f"height:{height}, width: {width}, base_image_feature: {base_image_feature.shape}"
if "anyres_max" in image_aspect_ratio:
matched_anyres_max_num_patches = re.match(r"anyres_max_(\d+)", image_aspect_ratio)
if matched_anyres_max_num_patches:
max_num_patches = int(matched_anyres_max_num_patches.group(1))
if "anyres" in image_aspect_ratio:
if hasattr(self.get_vision_tower(), "image_size"):
vision_tower_image_size = self.get_vision_tower().image_size
else:
raise ValueError("vision_tower_image_size is not found in the vision tower.")
try:
num_patch_width, num_patch_height = get_anyres_image_grid_shape(image_sizes[image_idx], self.config.image_grid_pinpoints, vision_tower_image_size, max_resolutions=None)
except Exception as e:
rank0_print(f"Error: {e}")
raise e
# num_patch_width, num_patch_height = 2, 2
image_feature = image_feature.view(num_patch_height, num_patch_width, height, width, -1)
else:
raise NotImplementedError(image_aspect_ratio)
image_feature = image_feature.view(2, 2, height, width, -1)
if "maxpool2x2" in mm_patch_merge_type:
raise NotImplementedError
image_feature = image_feature.permute(4, 0, 2, 1, 3).contiguous()
image_feature = image_feature.flatten(1, 2).flatten(2, 3)
image_feature = nn.functional.max_pool2d(image_feature, 2)
image_feature = image_feature.flatten(1, 2).transpose(0, 1)
elif "unpad" in mm_patch_merge_type and "anyres_max" in image_aspect_ratio and matched_anyres_max_num_patches:
raise NotImplementedError
elif "unpad" in mm_patch_merge_type:
raise NotImplementedError
else:
image_feature = image_feature.permute(0, 2, 1, 3, 4).contiguous()
image_feature = image_feature.flatten(0, 3)
if "nobase" in mm_patch_merge_type:
pass
else:
try:
image_feature = torch.cat((base_image_feature, image_feature), dim=0)
except Exception as e:
raise ValueError(f"{num_patch_width} {num_patch_height} now: base_image_feature: {base_image_feature.shape}, {image_feature.shape}, image_sizes[image_idx]: {image_sizes[image_idx]}, origin_size: {origin_size}, {image_sizes[image_idx]}, {self.config.image_grid_pinpoints}, {vision_tower_image_size}")
else: # single image operations
image_feature = image_feature[0]
if "unpad" in mm_patch_merge_type:
image_feature = torch.cat((image_feature, self.model.image_newline[None]), dim=0)
# rank0_print(f"image/video_feature.shape: {image_feature.shape}")
new_image_features.append(image_feature)
image_features = new_image_features
else:
raise ValueError(f"Unexpected mm_patch_merge_type: {self.config.mm_patch_merge_type}")
else:
# raise NotImplementedError(f"images.shape={images.shape}, modalities={modalities}")
image_features = self.encode_image(images)
# TODO: image start / end is not implemented here to support pretraining.
if getattr(self.config, "tune_mm_mlp_adapter", False) and getattr(self.config, "mm_use_im_start_end", False):
raise NotImplementedError
# rank0_print(f"Total images len(image_features: {len(image_features)}")
# Let's just add dummy tensors if they do not exist,
# it is a headache to deal with None all the time.
# But it is not ideal, and if you have a better idea,
# please open an issue / submit a PR, thanks.
_labels = labels
_position_ids = position_ids
_attention_mask = attention_mask
if attention_mask is None:
attention_mask = torch.ones_like(input_ids, dtype=torch.bool)
else:
attention_mask = attention_mask.bool()
if position_ids is None:
position_ids = torch.arange(0, input_ids.shape[1], dtype=torch.long, device=input_ids.device)
if labels is None:
labels = torch.full_like(input_ids, IGNORE_INDEX)
# remove the padding using attention_mask -- FIXME
_input_ids = input_ids
input_ids = [cur_input_ids[cur_attention_mask] for cur_input_ids, cur_attention_mask in zip(input_ids, attention_mask)]
labels = [cur_labels[cur_attention_mask] for cur_labels, cur_attention_mask in zip(labels, attention_mask)]
new_input_embeds = []
new_labels = []
cur_image_idx = 0
mm_llm_compress = getattr(self.config, "mm_llm_compress", False)
if mm_llm_compress:
self.model.llm_compress_type = getattr(self.config, "llm_compress_type", "attention")
self.model.llm_compress_layer_list = getattr(self.config, "llm_compress_layer_list", [8, 16, 24])
self.model.llm_image_token_ratio_list = getattr(self.config, "llm_image_token_ratio_list", [1.0, 0.5, 0.25, 0.125])
first_image_token_position = []
text_prompt_lens = []
else:
self.model.llm_compress_type = "attention"
self.model.llm_compress_layer_list = []
self.model.llm_image_token_ratio_list = []
# rank_print("Inserting Images embedding")
for batch_idx, cur_input_ids in enumerate(input_ids):
num_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum()
if mm_llm_compress:
####### copy from pdrop, only support single image/video NOTE ##################
# record image position for further dropping
image_index = torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist()
assert len(image_index) == 1, f"Only support singe/video: {image_index}"
if image_index == []:
first_image_token_position.append(-1)
else:
first_image_token_position.append(image_index[0])
# record input instruction length in inference mode
if not self.training:
if image_index == []:
assert num_images == 0, num_images
else:
assert num_images == 1, f"num_images={num_images}, not support"
text_prompt_lens.append(cur_input_ids.shape[0] - num_images) # consider image place holder
###############################################
# rank0_print(f"num_images={num_images}")
if num_images == 0:
cur_image_features = image_features[cur_image_idx]
cur_input_embeds_1 = self.get_model().embed_tokens(cur_input_ids)
cur_input_embeds = torch.cat([cur_input_embeds_1, cur_image_features[0:0]], dim=0)
new_input_embeds.append(cur_input_embeds)
new_labels.append(labels[batch_idx])
cur_image_idx += 1
continue
image_token_indices = [-1] + torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist() + [cur_input_ids.shape[0]]
cur_input_ids_noim = []
cur_labels = labels[batch_idx]
cur_labels_noim = []
for i in range(len(image_token_indices) - 1):
cur_input_ids_noim.append(cur_input_ids[image_token_indices[i] + 1 : image_token_indices[i + 1]])
cur_labels_noim.append(cur_labels[image_token_indices[i] + 1 : image_token_indices[i + 1]])
split_sizes = [x.shape[0] for x in cur_labels_noim]
cur_input_embeds = self.get_model().embed_tokens(torch.cat(cur_input_ids_noim))
cur_input_embeds_no_im = torch.split(cur_input_embeds, split_sizes, dim=0)
cur_new_input_embeds = []
cur_new_labels = []
for i in range(num_images + 1):
cur_new_input_embeds.append(cur_input_embeds_no_im[i])
cur_new_labels.append(cur_labels_noim[i])
if i < num_images:
try:
cur_image_features = image_features[cur_image_idx]
except IndexError:
rank0_print(f"cur_image_idx={cur_image_idx} is not ok")
cur_image_features = image_features[cur_image_idx - 1]
cur_image_idx += 1
cur_new_input_embeds.append(cur_image_features)
cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=cur_labels.device, dtype=cur_labels.dtype))
cur_new_input_embeds = [x.to(self.device) for x in cur_new_input_embeds]
# import pdb; pdb.set_trace()
cur_new_input_embeds = torch.cat(cur_new_input_embeds)
cur_new_labels = torch.cat(cur_new_labels)
new_input_embeds.append(cur_new_input_embeds)
new_labels.append(cur_new_labels)
if mm_llm_compress:
self.model.first_image_token_position = first_image_token_position
self.model.text_prompt_lens = text_prompt_lens
self.model.num_image_token_lens = [image_feature.shape[0] for image_feature in image_features]
# Truncate sequences to max length as image embeddings can make the sequence longer
tokenizer_model_max_length = getattr(self.config, "tokenizer_model_max_length", None)
# rank_print("Finishing Inserting")
new_input_embeds = [x[:tokenizer_model_max_length] for x, modality in zip(new_input_embeds, modalities)]
new_labels = [x[:tokenizer_model_max_length] for x, modality in zip(new_labels, modalities)]
# Combine them
max_len = max(x.shape[0] for x in new_input_embeds)
batch_size = len(new_input_embeds)
new_input_embeds_padded = []
new_labels_padded = torch.full((batch_size, max_len), IGNORE_INDEX, dtype=new_labels[0].dtype, device=new_labels[0].device)
attention_mask = torch.zeros((batch_size, max_len), dtype=attention_mask.dtype, device=attention_mask.device)
position_ids = torch.zeros((batch_size, max_len), dtype=position_ids.dtype, device=position_ids.device)
# rank0_print("Prepare pos id")
for i, (cur_new_embed, cur_new_labels) in enumerate(zip(new_input_embeds, new_labels)):
cur_len = cur_new_embed.shape[0]
if getattr(self.config, "tokenizer_padding_side", "right") == "left":
new_input_embeds_padded.append(torch.cat((torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device), cur_new_embed), dim=0))
if cur_len > 0:
new_labels_padded[i, -cur_len:] = cur_new_labels
attention_mask[i, -cur_len:] = True
position_ids[i, -cur_len:] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
else:
new_input_embeds_padded.append(torch.cat((cur_new_embed, torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)), dim=0))
if cur_len > 0:
new_labels_padded[i, :cur_len] = cur_new_labels
attention_mask[i, :cur_len] = True
position_ids[i, :cur_len] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
new_input_embeds = torch.stack(new_input_embeds_padded, dim=0)
# rank0_print("tokenizer padding")
if _labels is None:
new_labels = None
else:
new_labels = new_labels_padded
if _attention_mask is None:
attention_mask = None
else:
attention_mask = attention_mask.to(dtype=_attention_mask.dtype)
if _position_ids is None:
position_ids = None
if getattr(self.config, "use_pos_skipping", False) and self.training:
position_ids = torch.arange(new_input_embeds.size(1), device=new_input_embeds.device).unsqueeze(0).to(new_input_embeds.device)
split_position = random.randint(0, new_input_embeds.size(1))
left_add = random.randint(0, self.config.pos_skipping_range)
right_add = random.randint(left_add, self.config.pos_skipping_range)
position_ids[:, :split_position] += left_add
position_ids[:, split_position:] += right_add
# import pdb; pdb.set_trace()
# print("Finish preparing")
return None, position_ids, attention_mask, past_key_values, new_input_embeds, new_labels
def initialize_vision_tokenizer(self, model_args, tokenizer):
if model_args.mm_use_im_patch_token:
tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
self.resize_token_embeddings(len(tokenizer))
if model_args.mm_use_im_start_end:
num_new_tokens = tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
self.resize_token_embeddings(len(tokenizer))
if num_new_tokens > 0:
input_embeddings = self.get_input_embeddings().weight.data
output_embeddings = self.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
output_embeddings[-num_new_tokens:] = output_embeddings_avg
if model_args.tune_mm_mlp_adapter:
for p in self.get_input_embeddings().parameters():
p.requires_grad = True
for p in self.get_output_embeddings().parameters():
p.requires_grad = False
if model_args.pretrain_mm_mlp_adapter:
mm_projector_weights = torch.load(model_args.pretrain_mm_mlp_adapter, map_location="cpu")
embed_tokens_weight = mm_projector_weights["model.embed_tokens.weight"]
assert num_new_tokens == 2
if input_embeddings.shape == embed_tokens_weight.shape:
input_embeddings[-num_new_tokens:] = embed_tokens_weight[-num_new_tokens:]
elif embed_tokens_weight.shape[0] == num_new_tokens:
input_embeddings[-num_new_tokens:] = embed_tokens_weight
else:
raise ValueError(f"Unexpected embed_tokens_weight shape. Pretrained: {embed_tokens_weight.shape}. Current: {input_embeddings.shape}. Numer of new tokens: {num_new_tokens}.")
elif model_args.mm_use_im_patch_token:
if model_args.tune_mm_mlp_adapter:
for p in self.get_input_embeddings().parameters():
p.requires_grad = False
for p in self.get_output_embeddings().parameters():
p.requires_grad = False
================================================
FILE: xtuner-eval_niah/llava/model/make_delta.py
================================================
"""
Usage:
python3 -m llava.model.make_delta --base ~/model_weights/llama-7b --target ~/model_weights/llava-7b --delta ~/model_weights/llava-7b-delta --hub-repo-id liuhaotian/llava-7b-delta
"""
import argparse
import torch
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
from llava.model.utils import auto_upgrade
def make_delta(base_model_path, target_model_path, delta_path, hub_repo_id):
print("Loading base model")
base = AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
print("Loading target model")
auto_upgrade(target_model_path)
target = AutoModelForCausalLM.from_pretrained(target_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
print("Calculating delta")
for name, param in tqdm(target.state_dict().items(), desc="Calculating delta"):
if name not in base.state_dict():
assert name in ["model.mm_projector.weight", "model.mm_projector.bias"], f"{name} not in base model"
continue
if param.data.shape == base.state_dict()[name].shape:
param.data -= base.state_dict()[name]
else:
assert name in ["model.embed_tokens.weight", "lm_head.weight"], f"{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}"
bparam = base.state_dict()[name]
param.data[: bparam.shape[0], : bparam.shape[1]] -= bparam
print("Saving delta")
if hub_repo_id:
kwargs = {"push_to_hub": True, "repo_id": hub_repo_id}
else:
kwargs = {}
target.save_pretrained(delta_path, **kwargs)
target_tokenizer = AutoTokenizer.from_pretrained(target_model_path)
target_tokenizer.save_pretrained(delta_path, **kwargs)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--base-model-path", type=str, required=True)
parser.add_argument("--target-model-path", type=str, required=True)
parser.add_argument("--delta-path", type=str, required=True)
parser.add_argument("--hub-repo-id", type=str, default=None)
args = parser.parse_args()
make_delta(args.base_model_path, args.target_model_path, args.delta_path, args.hub_repo_id)
================================================
FILE: xtuner-eval_niah/llava/model/multimodal_encoder/builder.py
================================================
import os
from .clip_encoder import CLIPVisionTower
from .siglip_encoder import SigLipVisionTower
from .clip_encoder import CLIPVisionTower, CLIPVisionTowerS2
from .umt_encoder import UMTVisionTower
from .internvideo2_encoder import InternVideo2VisionTower
# from .eva_clip.eva_clip_encoder import EvaClipVisionTower
# from .dev_eva_clip.eva_vit import EvaViTWrapper
def build_vision_tower(vision_tower_cfg, **kwargs):
vision_tower = getattr(vision_tower_cfg, "mm_vision_tower", getattr(vision_tower_cfg, "vision_tower", None))
# is_absolute_path_exists = os.path.exists(vision_tower) # NOTE sb code!
use_s2 = getattr(vision_tower_cfg, "s2", False)
if 'clip-vit' in vision_tower or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
if use_s2:
return CLIPVisionTowerS2(vision_tower, args=vision_tower_cfg, **kwargs)
else:
return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
elif "siglip" in vision_tower:
return SigLipVisionTower(vision_tower, vision_tower_cfg=vision_tower_cfg, **kwargs)
elif "internvideo2" in vision_tower:
return InternVideo2VisionTower(vision_tower, vision_tower_cfg=vision_tower_cfg, image_size=224, **kwargs)
elif "umt-hd" in vision_tower:
return UMTVisionTower(vision_tower, vision_tower_cfg=vision_tower_cfg, image_size=448, **kwargs)
elif "umt" in vision_tower:
return UMTVisionTower(vision_tower, vision_tower_cfg=vision_tower_cfg, **kwargs)
else:
raise ValueError(f"Unknown vision tower: {vision_tower}")
================================================
FILE: xtuner-eval_niah/llava/model/multimodal_encoder/clip_encoder.py
================================================
import torch
import torch.nn as nn
from llava.utils import rank0_print
from transformers import CLIPVisionModel, CLIPImageProcessor, CLIPVisionConfig
try:
from s2wrapper import forward as multiscale_forward
except:
pass
class CLIPVisionTower(nn.Module):
def __init__(self, vision_tower, args, delay_load=False):
super().__init__()
self.is_loaded = False
self.vision_tower_name = vision_tower
self.select_layer = args.mm_vision_select_layer
self.select_feature = getattr(args, "mm_vision_select_feature", "patch")
if not delay_load:
rank0_print(f"Loading vision tower: {vision_tower}")
self.load_model()
elif getattr(args, "unfreeze_mm_vision_tower", False):
# TODO: better detector is needed.
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True.")
self.load_model()
elif hasattr(args, "mm_tunable_parts") and "mm_vision_tower" in args.mm_tunable_parts:
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`.")
self.load_model()
else:
self.cfg_only = CLIPVisionConfig.from_pretrained(self.vision_tower_name)
def load_model(self, device_map=None):
if self.is_loaded:
rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name))
return
self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name)
self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map)
self.vision_tower.requires_grad_(False)
self.is_loaded = True
def feature_select(self, image_forward_outs):
select_feature_type = self.select_feature
if self.select_feature in ["slicefour_patch", "slicefour_cls_patch"]:
select_every_k_layer = len(image_forward_outs.hidden_states) // 4
image_features = torch.cat([image_forward_outs.hidden_states[i] for i in range(select_every_k_layer + self.select_layer, len(image_forward_outs.hidden_states), select_every_k_layer)], dim=-1)
select_feature_type = select_feature_type.replace("slicefour_", "")
elif self.select_feature in ["slice_m25811_f6_patch", "slice_m25811_f6_cls_patch"]:
select_layers = [-2, -5, -8, -11, 6]
image_features = torch.cat([image_forward_outs.hidden_states[i] for i in select_layers], dim=-1)
select_feature_type = select_feature_type.replace("slice_m25811_f6_", "")
else:
image_features = image_forward_outs.hidden_states[self.select_layer]
if select_feature_type == "patch":
image_features = image_features[:, 1:]
elif select_feature_type == "cls_patch":
image_features = image_features
else:
raise ValueError(f"Unexpected select feature: {select_feature_type}")
return image_features
def forward(self, images):
if type(images) is list:
image_features = []
for image in images:
image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), output_hidden_states=True)
image_feature = self.feature_select(image_forward_out).to(image.dtype)
image_features.append(image_feature)
else:
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
image_features = self.feature_select(image_forward_outs).to(images.dtype)
return image_features
@property
def dummy_feature(self):
return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
@property
def dtype(self):
return self.vision_tower.dtype
@property
def device(self):
return self.vision_tower.device
@property
def config(self):
if self.is_loaded:
return self.vision_tower.config
else:
return self.cfg_only
@property
def hidden_size(self):
_hidden_size = self.config.hidden_size
if "slicefour" in self.select_feature:
_hidden_size *= 4
if "slice_m25811_f6" in self.select_feature:
_hidden_size *= 5
return _hidden_size
@property
def num_patches_per_side(self):
return self.config.image_size // self.config.patch_size
@property
def num_patches(self):
_num_patches = (self.config.image_size // self.config.patch_size) ** 2
if "cls_patch" in self.select_feature:
_num_patches += 1
return _num_patches
@property
def image_size(self):
return self.config.image_size
class CLIPVisionTowerS2(CLIPVisionTower):
def __init__(self, vision_tower, args, delay_load=False):
self.s2_scales = getattr(args, "s2_scales", "336,672,1008")
self.s2_scales = list(map(int, self.s2_scales.split(",")))
self.s2_scales.sort()
self.s2_split_size = self.s2_scales[0]
self.s2_image_size = self.s2_scales[-1]
super().__init__(vision_tower, args, delay_load)
# change resize/crop size in preprocessing to the largest image size in s2_scale
if not delay_load or getattr(args, "unfreeze_mm_vision_tower", False):
self.image_processor.size["shortest_edge"] = self.s2_image_size
self.image_processor.crop_size["height"] = self.image_processor.crop_size["width"] = self.s2_image_size
def load_model(self, device_map=None):
if self.is_loaded:
rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name))
return
self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name)
self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map)
self.vision_tower.requires_grad_(False)
self.image_processor.size["shortest_edge"] = self.s2_image_size
self.image_processor.crop_size["height"] = self.image_processor.crop_size["width"] = self.s2_image_size
self.is_loaded = True
def forward_feature(self, images):
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
image_features = self.feature_select(image_forward_outs).to(images.dtype)
return image_features
def forward(self, images):
if type(images) is list:
image_features = []
for image in images:
image_feature = multiscale_forward(self.forward_feature, image.unsqueeze(0), img_sizes=self.s2_scales, max_split_size=self.s2_split_size, split_forward=True)
image_features.append(image_feature)
else:
image_features = multiscale_forward(self.forward_feature, images, img_sizes=self.s2_scales, max_split_size=self.s2_split_size, split_forward=True)
return image_features
@property
def hidden_size(self):
return self.config.hidden_size * len(self.s2_scales)
================================================
FILE: xtuner-eval_niah/llava/model/multimodal_encoder/internvideo2/__init__.py
================================================
from .vit_scale_clean import PretrainVisionTransformer_clean
================================================
FILE: xtuner-eval_niah/llava/model/multimodal_encoder/internvideo2/flash_attention_class.py
================================================
import torch
import torch.nn as nn
from einops import rearrange
from flash_attn.flash_attn_interface import flash_attn_varlen_qkvpacked_func
from flash_attn.bert_padding import unpad_input, pad_input
class FlashAttention(nn.Module):
"""Implement the scaled dot product attention with softmax.
Arguments
---------
softmax_scale: The temperature to use for the softmax attention.
(default: 1/sqrt(d_keys) where d_keys is computed at
runtime)
attention_dropout: The dropout rate to apply to the attention
(default: 0.0)
"""
def __init__(self, softmax_scale=None, attention_dropout=0.0, device=None, dtype=None):
super().__init__()
self.softmax_scale = softmax_scale
self.dropout_p = attention_dropout
def forward(self, qkv, key_padding_mask=None, causal=False, cu_seqlens=None,
max_s=None, need_weights=False):
"""Implements the multihead softmax attention.
Arguments
---------
qkv: The tensor containing the query, key, and value. (B, S, 3, H, D) if key_padding_mask is None
if unpadded: (nnz, 3, h, d)
key_padding_mask: a bool tensor of shape (B, S)
"""
assert not need_weights
assert qkv.dtype in [torch.float16, torch.bfloat16]
assert qkv.is_cuda
if cu_seqlens is None:
batch_size = qkv.shape[0]
seqlen = qkv.shape[1]
if key_padding_mask is None:
qkv = rearrange(qkv, 'b s ... -> (b s) ...')
max_s = seqlen
cu_seqlens = torch.arange(0, (batch_size + 1) * seqlen, step=seqlen, dtype=torch.int32,
device=qkv.device)
output = flash_attn_varlen_qkvpacked_func(
qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
softmax_scale=self.softmax_scale, causal=causal
)
output = rearrange(output, '(b s) ... -> b s ...', b=batch_size)
else:
nheads = qkv.shape[-2]
x = rearrange(qkv, 'b s three h d -> b s (three h d)')
x_unpad, indices, cu_seqlens, max_s = unpad_input(x, key_padding_mask)
x_unpad = rearrange(x_unpad, 'nnz (three h d) -> nnz three h d', three=3, h=nheads)
output_unpad = flash_attn_varlen_qkvpacked_func(
x_unpad, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
softmax_scale=self.softmax_scale, causal=causal
)
output = rearrange(pad_input(rearrange(output_unpad, 'nnz h d -> nnz (h d)'),
indices, batch_size, seqlen),
'b s (h d) -> b s h d', h=nheads)
else:
assert max_s is not None
output = flash_attn_varlen_qkvpacked_func(
qkv, cu_seqlens, max_s, self.dropout_p if self.training else 0.0,
softmax_scale=self.softmax_scale, causal=causal
)
return output, None
================================================
FILE: xtuner-eval_niah/llava/model/multimodal_encoder/internvideo2/pos_embed.py
================================================
import numpy as np
import torch
import logging
logger = logging.getLogger(__name__)
# --------------------------------------------------------
# 3D sine-cosine position embedding
# References:
# MVD: https://github.com/ruiwang2021/mvd/blob/main/modeling_finetune.py
# --------------------------------------------------------
def get_3d_sincos_pos_embed(embed_dim, grid_size, t_size, cls_token=False):
"""
grid_size: int of the grid height and width
t_size: int of the temporal size
return:
pos_embed: [t_size*grid_size*grid_size, embed_dim] or [1+t_size*grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
"""
assert embed_dim % 4 == 0
embed_dim_spatial = embed_dim // 4 * 3
embed_dim_temporal = embed_dim // 4
# spatial
grid_h = np.arange(grid_size, dtype=np.float32)
grid_w = np.arange(grid_size, dtype=np.float32)
grid = np.meshgrid(grid_w, grid_h) # here w goes first
grid = np.stack(grid, axis=0)
grid = grid.reshape([2, 1, grid_size, grid_size])
pos_embed_spatial = get_2d_sincos_pos_embed_from_grid(
embed_dim_spatial, grid
)
# temporal
grid_t = np.arange(t_size, dtype=np.float32)
pos_embed_temporal = get_1d_sincos_pos_embed_from_grid(
embed_dim_temporal, grid_t
)
# concate: [T, H, W] order
pos_embed_temporal = pos_embed_temporal[:, np.newaxis, :]
pos_embed_temporal = np.repeat(
pos_embed_temporal, grid_size**2, axis=1
) # [T, H*W, D // 4]
pos_embed_spatial = pos_embed_spatial[np.newaxis, :, :]
pos_embed_spatial = np.repeat(
pos_embed_spatial, t_size, axis=0
) # [T, H*W, D // 4 * 3]
pos_embed = np.concatenate([pos_embed_temporal, pos_embed_spatial], axis=-1)
pos_embed = pos_embed.reshape([-1, embed_dim]) # [T*H*W, D]
if cls_token:
pos_embed = np.concatenate(
[np.zeros([1, embed_dim]), pos_embed], axis=0
)
return pos_embed
# --------------------------------------------------------
# 2D sine-cosine position embedding
# References:
# Transformer: https://github.com/tensorflow/models/blob/master/official/nlp/transformer/model_utils.py
# MoCo v3: https://github.com/facebookresearch/moco-v3
# --------------------------------------------------------
def get_2d_sincos_pos_embed(embed_dim, grid_size, cls_token=False):
"""
grid_size: int of the grid height and width
return:
pos_embed: [grid_size*grid_size, embed_dim] or [1+grid_size*grid_size, embed_dim] (w/ or w/o cls_token)
"""
grid_h = np.arange(grid_size, dtype=np.float32)
grid_w = np.arange(grid_size, dtype=np.float32)
grid = np.meshgrid(grid_w, grid_h) # here w goes first
grid = np.stack(grid, axis=0)
grid = grid.reshape([2, 1, grid_size, grid_size])
pos_embed = get_2d_sincos_pos_embed_from_grid(embed_dim, grid)
if cls_token:
pos_embed = np.concatenate(
[np.zeros([1, embed_dim]), pos_embed], axis=0
)
return pos_embed
def get_1d_sincos_pos_embed(embed_dim, t_size, cls_token=False):
"""
t_size: int of the temporal size
return:
pos_embed: [t_size, embed_dim] or [1+t_size, embed_dim] (w/ or w/o cls_token)
"""
grid_t = np.arange(t_size, dtype=np.float32)
pos_embed = get_1d_sincos_pos_embed_from_grid(embed_dim, grid_t)
if cls_token:
pos_embed = np.concatenate(
[np.zeros([1, embed_dim]), pos_embed], axis=0
)
return pos_embed
def get_2d_sincos_pos_embed_from_grid(embed_dim, grid):
assert embed_dim % 2 == 0
# use half of dimensions to encode grid_h
emb_h = get_1d_sincos_pos_embed_from_grid(
embed_dim // 2, grid[0]
) # (H*W, D/2)
emb_w = get_1d_sincos_pos_embed_from_grid(
embed_dim // 2, grid[1]
) # (H*W, D/2)
emb = np.concatenate([emb_h, emb_w], axis=1) # (H*W, D)
return emb
def get_1d_sincos_pos_embed_from_grid(embed_dim, pos):
"""
embed_dim: output dimension for each position
pos: a list of positions to be encoded: size (M,)
out: (M, D)
"""
assert embed_dim % 2 == 0
omega = np.arange(embed_dim // 2, dtype=np.float32)
omega /= embed_dim / 2.0
omega = 1.0 / 10000**omega # (D/2,)
pos = pos.reshape(-1) # (M,)
out = np.einsum("m,d->md", pos, omega) # (M, D/2), outer product
emb_sin = np.sin(out) # (M, D/2)
emb_cos = np.cos(out) # (M, D/2)
emb = np.concatenate([emb_sin, emb_cos], axis=1) # (M, D)
return emb
def interpolate_pos_embed_internvideo2(checkpoint_model, model, orig_t_size = 8):
# interpolate position embedding
for pos_name in ['pos_embed', 'clip_pos_embed']:
if pos_name in checkpoint_model:
pos_embed_checkpoint = checkpoint_model[pos_name]
embedding_size = pos_embed_checkpoint.shape[-1] # channel dim
num_patches = model.patch_embed.num_patches #
num_extra_tokens = model.pos_embed.shape[-2] - num_patches # 0/1
# we use 8 frames for pretraining
# new_t_size = args.num_frames * args.num_segments // model.patch_embed.tubelet_size
new_t_size = model.num_frames // model.tubelet_size
# height (== width) for the checkpoint position embedding
orig_size = int(((pos_embed_checkpoint.shape[-2] - num_extra_tokens)//(orig_t_size)) ** 0.5)
# height (== width) for the new position embedding
new_size = int((num_patches // (new_t_size))** 0.5)
# class_token and dist_token are kept unchanged
if orig_t_size != new_t_size:
logger.info(f"Temporal interpolate from {orig_t_size} to {new_t_size} ({pos_name})")
print(f"Temporal interpolate from {orig_t_size} to {new_t_size} ({pos_name})")
extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
# only the position tokens are interpolated
pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
# B, L, C -> B, T, HW, C -> BHW, C, T (B = 1)
pos_tokens = pos_tokens.view(1, orig_t_size, -1, embedding_size)
pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, embedding_size, orig_t_size)
pos_tokens = torch.nn.functional.interpolate(pos_tokens, size=new_t_size, mode='linear')
pos_tokens = pos_tokens.view(1, -1, embedding_size, new_t_size)
pos_tokens = pos_tokens.permute(0, 3, 1, 2).reshape(1, -1, embedding_size)
new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
checkpoint_model[pos_name] = new_pos_embed
pos_embed_checkpoint = new_pos_embed
# class_token and dist_token are kept unchanged
if orig_size != new_size:
logger.info(f"Position interpolate from {orig_size}x{orig_size} to {new_size}x{new_size} ({pos_name})")
print(f"Position interpolate from {orig_size}x{orig_size} to {new_size}x{new_size} ({pos_name})")
extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
# only the position tokens are interpolated
pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
# B, L, C -> BT, H, W, C -> BT, C, H, W
pos_tokens = pos_tokens.reshape(-1, new_t_size, orig_size, orig_size, embedding_size)
pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
pos_tokens = torch.nn.functional.interpolate(
pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False)
# BT, C, H, W -> BT, H, W, C -> B, T, H, W, C
pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, new_t_size, new_size, new_size, embedding_size)
pos_tokens = pos_tokens.flatten(1, 3) # B, L, C
new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
checkpoint_model[pos_name] = new_pos_embed
for pos_name in ['img_pos_embed']:
if pos_name in checkpoint_model:
pos_embed_checkpoint = checkpoint_model[pos_name]
embedding_size = pos_embed_checkpoint.shape[-1] # channel dim
num_patches = model.patch_embed.num_img_patches #
num_extra_tokens = model.pos_embed.shape[-2] - model.patch_embed.num_patches # 0/1
# we use 8 frames for pretraining
# new_t_size = args.num_frames * args.num_segments // model.patch_embed.tubelet_size
# height (== width) for the checkpoint position embedding
orig_size = int(((pos_embed_checkpoint.shape[-2] - num_extra_tokens)) ** 0.5)
# height (== width) for the new position embedding
new_size = int((num_patches)** 0.5)
# class_token and dist_token are kept unchanged
if orig_size != new_size:
logger.info(f"Position interpolate from {orig_size}x{orig_size} to {new_size}x{new_size} ({pos_name})")
print(f"Position interpolate from {orig_size}x{orig_size} to {new_size}x{new_size} ({pos_name})")
extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
# only the position tokens are interpolated
pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
# B, L, C -> B, H, W, C -> B, C, H, W
pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
pos_tokens = torch.nn.functional.interpolate(
pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False)
# BT, C, H, W -> BT, H, W, C -> B, T, H, W, C
pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, 1, new_size, new_size, embedding_size)
pos_tokens = pos_tokens.flatten(1, 3) # B, L, C
new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
checkpoint_model[pos_name] = new_pos_embed
if 'pos_embed_spatial' in checkpoint_model or 'pos_embed_temporal' in checkpoint_model:
raise NotImplementedError
def interpolate_pos_embed_internvideo2_new(checkpoint_model, model, orig_t_size = 8):
pos_names = []
for k in checkpoint_model.keys():
if ('pos_embed' in k or 'clip_pos_embed' in k) and 'img' not in k:
pos_names.append(k)
assert len(pos_names) > 0, checkpoint_model.keys()
if 'pos_embed_spatial' in checkpoint_model.keys() or 'pos_embed_temporal' in checkpoint_model.keys():
raise NotImplementedError
# interpolate position embedding
for pos_name in pos_names:
pos_embed_checkpoint = checkpoint_model[pos_name]
embedding_size = pos_embed_checkpoint.shape[-1] # channel dim
num_patches = model.patch_embed.num_patches #
num_extra_tokens = model.pos_embed.shape[-2] - num_patches # 0/1
# we use 8 frames for pretraining
# new_t_size = args.num_frames * args.num_segments // model.patch_embed.tubelet_size
new_t_size = model.num_frames // model.tubelet_size
# height (== width) for the checkpoint position embedding
orig_size = int(((pos_embed_checkpoint.shape[-2] - num_extra_tokens)//(orig_t_size)) ** 0.5)
# height (== width) for the new position embedding
new_size = int((num_patches // (new_t_size))** 0.5)
# class_token and dist_token are kept unchanged
if orig_t_size != new_t_size:
logger.info(f"Temporal interpolate from {orig_t_size} to {new_t_size} ({pos_name})")
extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
# only the position tokens are interpolated
pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
# B, L, C -> B, T, HW, C -> BHW, C, T (B = 1)
pos_tokens = pos_tokens.view(1, orig_t_size, -1, embedding_size)
pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, embedding_size, orig_t_size)
pos_tokens = torch.nn.functional.interpolate(pos_tokens, size=new_t_size, mode='linear')
pos_tokens = pos_tokens.view(1, -1, embedding_size, new_t_size)
pos_tokens = pos_tokens.permute(0, 3, 1, 2).reshape(1, -1, embedding_size)
new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
checkpoint_model[pos_name] = new_pos_embed
pos_embed_checkpoint = new_pos_embed
# class_token and dist_token are kept unchanged
if orig_size != new_size:
logger.info(f"Position interpolate from {orig_size}x{orig_size} to {new_size}x{new_size} ({pos_name})")
extra_tokens = pos_embed_checkpoint[:, :num_extra_tokens]
# only the position tokens are interpolated
pos_tokens = pos_embed_checkpoint[:, num_extra_tokens:]
# B, L, C -> BT, H, W, C -> BT, C, H, W
pos_tokens = pos_tokens.reshape(-1, new_t_size, orig_size, orig_size, embedding_size)
pos_tokens = pos_tokens.reshape(-1, orig_size, orig_size, embedding_size).permute(0, 3, 1, 2)
pos_tokens = torch.nn.functional.interpolate(
pos_tokens, size=(new_size, new_size), mode='bicubic', align_corners=False)
# BT, C, H, W -> BT, H, W, C -> B, T, H, W, C
pos_tokens = pos_tokens.permute(0, 2, 3, 1).reshape(-1, new_t_size, new_size, new_size, embedding_size)
pos_tokens = pos_tokens.flatten(1, 3) # B, L, C
new_pos_embed = torch.cat((extra_tokens, pos_tokens), dim=1)
checkpoint_model[pos_name] = new_pos_embed
================================================
FILE: xtuner-eval_niah/llava/model/multimodal_encoder/internvideo2/vit_scale_clean.py
================================================
import math
import logging
import torch
import torch.nn.functional as F
from timm.models.layers import DropPath, to_2tuple, trunc_normal_
from torch import nn
import torch.utils.checkpoint as checkpoint
from functools import partial
from einops import rearrange
from .pos_embed import get_3d_sincos_pos_embed, get_2d_sincos_pos_embed, get_1d_sincos_pos_embed, interpolate_pos_embed_internvideo2
from .flash_attention_class import FlashAttention
logger = logging.getLogger(__name__)
class CrossAttention(nn.Module):
def __init__(
self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0.,
proj_drop=0., attn_head_dim=None, out_dim=None):
super().__init__()
if out_dim is None:
out_dim = dim
self.num_heads = num_heads
head_dim = dim // num_heads
if attn_head_dim is not None:
head_dim = attn_head_dim
all_head_dim = head_dim * self.num_heads
self.scale = qk_scale or head_dim ** -0.5
assert all_head_dim == dim
self.q = nn.Linear(dim, all_head_dim, bias=False)
self.k = nn.Linear(dim, all_head_dim, bias=False)
self.v = nn.Linear(dim, all_head_dim, bias=False)
if qkv_bias:
self.q_bias = nn.Parameter(torch.zeros(all_head_dim))
self.k_bias = nn.Parameter(torch.zeros(all_head_dim))
self.v_bias = nn.Parameter(torch.zeros(all_head_dim))
else:
self.q_bias = None
self.k_bias = None
self.v_bias = None
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(all_head_dim, out_dim)
self.proj_drop = nn.Dropout(proj_drop)
def forward(self, x, k=None, v=None):
B, N, C = x.shape
N_k = k.shape[1]
N_v = v.shape[1]
q_bias, k_bias, v_bias = None, None, None
if self.q_bias is not None:
q_bias = self.q_bias
k_bias = self.k_bias
v_bias = self.v_bias
q = F.linear(input=x, weight=self.q.weight, bias=q_bias)
q = q.reshape(B, N, 1, self.num_heads, -1).permute(2, 0, 3, 1, 4).squeeze(0) # (B, N_head, N_q, dim)
k = F.linear(input=k, weight=self.k.weight, bias=k_bias)
k = k.reshape(B, N_k, 1, self.num_heads, -1).permute(2, 0, 3, 1, 4).squeeze(0)
v = F.linear(input=v, weight=self.v.weight, bias=v_bias)
v = v.reshape(B, N_v, 1, self.num_heads, -1).permute(2, 0, 3, 1, 4).squeeze(0)
q = q * self.scale
attn = (q @ k.transpose(-2, -1)) # (B, N_head, N_q, N_k)
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
x = self.proj(x)
x = self.proj_drop(x)
return x
class AttentiveBlock(nn.Module):
def __init__(self, dim, num_heads, qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
drop_path=0., norm_layer=nn.LayerNorm, attn_head_dim=None, out_dim=None):
super().__init__()
self.norm1_q = norm_layer(dim)
self.norm1_k = norm_layer(dim)
self.norm1_v = norm_layer(dim)
self.cross_attn = CrossAttention(
dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale, attn_drop=attn_drop,
proj_drop=drop, attn_head_dim=attn_head_dim, out_dim=out_dim)
self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
def forward(self, x_q, x_kv, pos_q, pos_k, bool_masked_pos, rel_pos_bias=None):
x_q = self.norm1_q(x_q + pos_q)
x_k = self.norm1_k(x_kv + pos_k)
x_v = self.norm1_v(x_kv)
x = self.cross_attn(x_q, k=x_k, v=x_v)
return x
class AttentionPoolingBlock(AttentiveBlock):
def forward(self, x):
x_q = x.mean(1, keepdim=True)
x_kv, pos_q, pos_k = x, 0, 0
x = super().forward(x_q, x_kv, pos_q, pos_k, bool_masked_pos=None, rel_pos_bias=None)
x = x.squeeze(1)
return x
class RMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
class LayerScale(nn.Module):
def __init__(self, dim, init_values=1e-5, inplace=False, force_fp32=False):
super().__init__()
self.inplace = inplace
self.weight = nn.Parameter(init_values * torch.ones(dim))
self.force_fp32 = force_fp32
@torch.cuda.amp.autocast(enabled=False)
def forward(self, x):
if self.force_fp32:
output_type = x.dtype
out = x.float().mul_(self.weight.float()) if self.inplace else x.float() * self.weight.float()
return out.to(dtype=output_type)
else:
out = x.mul_(self.weight) if self.inplace else x * self.weight
return out
class Attention(nn.Module):
def __init__(self, dim, num_heads=8, qkv_bias=False, attn_drop=0., proj_drop=0., use_flash_attn=False,
causal=False, norm_layer=nn.LayerNorm, qk_normalization=False, use_fused_rmsnorm=False):
super().__init__()
assert dim % num_heads == 0, 'dim should be divisible by num_heads'
self.num_heads = num_heads
head_dim = dim // num_heads
self.scale = head_dim ** -0.5
self.qkv = nn.Linear(dim, dim * 3, bias=qkv_bias)
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
self.use_flash_attn = use_flash_attn
if use_flash_attn:
self.causal = causal
self.inner_attn = FlashAttention(attention_dropout=attn_drop)
self.qk_normalization = qk_normalization
self.q_norm = norm_layer(dim) if qk_normalization else nn.Identity()
self.k_norm = norm_layer(dim) if qk_normalization else nn.Identity()
self.use_fused_rmsnorm = use_fused_rmsnorm
def _naive_attn(self, x):
B, N, C = x.shape
# print(x.shape, torch.cuda.memory_allocated(), torch.cuda.memory_allocated())
qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
q, k, v = qkv.unbind(0) # make torchscript happy (cannot use tensor as tuple)
if self.qk_normalization:
B_, H_, N_, D_ = q.shape
q = self.q_norm(q.transpose(1, 2).flatten(-2, -1)).view(B_, N_, H_, D_).transpose(1, 2)
k = self.k_norm(k.transpose(1, 2).flatten(-2, -1)).view(B_, N_, H_, D_).transpose(1, 2)
attn = ((q * self.scale) @ k.transpose(-2, -1))
# attn = attn - attn.max(-1)[0].unsqueeze(-1) # in case of overflow for fp16
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
# print(torch.cuda.memory_allocated(), torch.cuda.memory_allocated())
x = (attn @ v).transpose(1, 2).reshape(B, N, C)
x = self.proj(x)
x = self.proj_drop(x)
return x
def _flash_attn(self, x, key_padding_mask=None, need_weights=False):
qkv = self.qkv(x)
qkv = rearrange(qkv, "b s (three h d) -> b s three h d", three=3, h=self.num_heads)
if self.qk_normalization:
q, k, v = qkv.unbind(2)
if self.use_fused_rmsnorm:
q = self.q_norm(q.flatten(-2, -1))[0].view(q.shape)
k = self.k_norm(k.flatten(-2, -1))[0].view(k.shape)
else:
q = self.q_norm(q.flatten(-2, -1)).view(q.shape)
k = self.k_norm(k.flatten(-2, -1)).view(k.shape)
qkv = torch.stack([q, k, v], dim=2)
context, _ = self.inner_attn(
qkv, key_padding_mask=key_padding_mask, need_weights=need_weights, causal=self.causal
)
outs = self.proj(rearrange(context, "b s h d -> b s (h d)"))
outs = self.proj_drop(outs)
return outs
def forward(self, x):
x = self._naive_attn(x) if not self.use_flash_attn else self._flash_attn(x)
return x
class Mlp(nn.Module):
""" MLP as used in Vision Transformer, MLP-Mixer and related networks
"""
def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU,
bias=True, drop=0.):
super().__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features
bias = to_2tuple(bias)
drop_probs = to_2tuple(drop)
self.fc1 = nn.Linear(in_features, hidden_features, bias=bias[0])
self.act = act_layer()
self.drop1 = nn.Dropout(drop_probs[0])
self.fc2 = nn.Linear(hidden_features, out_features, bias=bias[1])
self.drop2 = nn.Dropout(drop_probs[1])
def forward(self, x):
x = self.fc1(x)
x = self.act(x)
x = self.drop1(x)
x = self.fc2(x)
x = self.drop2(x)
return x
class Block(nn.Module):
def __init__(
self, dim, num_heads, mlp_ratio=4., qkv_bias=False, drop=0., attn_drop=0., init_values=None,
drop_path=0., act_layer=nn.GELU, norm_layer=nn.LayerNorm, use_flash_attn=False, use_fused_mlp=False,
fused_mlp_heuristic=1, with_cp=False, qk_normalization=False, layerscale_no_force_fp32=False,
use_fused_rmsnorm=False):
super().__init__()
self.norm1 = norm_layer(dim)
self.attn = Attention(dim, num_heads=num_heads, qkv_bias=qkv_bias, attn_drop=attn_drop, proj_drop=drop,
use_flash_attn=use_flash_attn, causal=False, norm_layer=norm_layer,
qk_normalization=qk_normalization,
use_fused_rmsnorm=use_fused_rmsnorm)
self.ls1 = LayerScale(dim, init_values=init_values,
force_fp32=(not layerscale_no_force_fp32)) if init_values else nn.Identity()
# NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
self.drop_path1 = DropPath(drop_path) if drop_path > 0. else nn.Identity()
self.norm2 = norm_layer(dim)
mlp_hidden_dim = int(dim * mlp_ratio)
if use_fused_mlp:
raise NotImplementedError
self.mlp = FusedMLP(in_features=dim, hidden_features=mlp_hidden_dim, heuristic=fused_mlp_heuristic)
else:
self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
self.ls2 = LayerScale(dim, init_values=init_values,
force_fp32=(not layerscale_no_force_fp32)) if init_values else nn.Identity()
self.drop_path2 = DropPath(drop_path) if drop_path > 0. else nn.Identity()
self.with_cp = with_cp
self.use_fused_rmsnorm = use_fused_rmsnorm
def forward(self, x, residual=None):
def _inner_forward(x, residual=None):
if self.use_fused_rmsnorm:
x, residual = self.norm1(x, residual)
x = self.drop_path1(self.ls1(self.attn(x)))
x, residual = self.norm2(x, residual)
x = self.drop_path2(self.ls2(self.mlp(x)))
return x, residual
else:
assert residual is None
x = x + self.drop_path1(self.ls1(self.attn(self.norm1(x))))
x = x + self.drop_path2(self.ls2(self.mlp(self.norm2(x))))
return x
if self.with_cp:
# print(f"\033[31m use_checkpoint [0m")
return checkpoint.checkpoint(_inner_forward, x, residual)
else:
return _inner_forward(x, residual=residual)
class PatchEmbed(nn.Module):
""" 3D Image to Patch Embedding
"""
def __init__(
self, img_size=224, patch_size=16, in_chans=3, embed_dim=768,
num_frames=8, tubelet_size=1, norm_layer=None
):
super().__init__()
img_size = to_2tuple(img_size)
patch_size = to_2tuple(patch_size)
self.img_size = img_size
self.patch_size = patch_size
self.grid_size = (
num_frames // tubelet_size,
img_size[0] // patch_size[0],
img_size[1] // patch_size[1]
) # (T, H, W)
self.num_patches = self.grid_size[0] * self.grid_size[1] * self.grid_size[2]
self.num_img_patches = self.grid_size[1] * self.grid_size[2]
self.proj = nn.Conv3d(
in_channels=in_chans, out_channels=embed_dim,
kernel_size=(tubelet_size, patch_size[0], patch_size[1]),
stride=(tubelet_size, patch_size[0], patch_size[1])
)
self.norm = norm_layer(embed_dim) if norm_layer else nn.Identity()
def forward(self, x):
x = self.proj(x)
x = x.flatten(3).permute(0, 2, 3, 1) # B x C x T x HW => B x T x HW x C
x = self.norm(x)
return x
class PretrainVisionTransformer_clean(nn.Module):
def __init__(
self,
in_chans: int = 3,
patch_size: int = 14,
img_size: int = 224,
qkv_bias: bool = False, # follow internvl_clip to set False
drop_path_rate: float = 0.25, # may need ablation
embed_dim: int = 1408,
num_heads: int = 16,
mlp_ratio: float = 48/11,
init_values: float = 1e-5, # may need ablation
qk_normalization: bool = True,
depth: int = 40,
use_flash_attn: bool = True,
use_fused_rmsnorm: bool = True,
use_fused_mlp: bool = True,
fused_mlp_heuristic: int = 1,
attn_pool_num_heads: int = 16,
clip_embed_dim: int = 768,
layerscale_no_force_fp32: bool = False, # whether True for training?
num_frames: int = 8,
tubelet_size: int = 1,
sep_pos_embed: bool = False,
sep_image_video_pos_embed: bool = False,
use_checkpoint: bool = False,
checkpoint_num: int = 0,
# for unmasked teacher
x_vis_return_idx=-1,
x_vis_only=False
):
super().__init__()
self.num_frames = num_frames
self.tubelet_size = tubelet_size
# assert use_flash_attn == use_fused_rmsnorm == use_fused_mlp, f'use_flash_attn:{use_flash_attn}, use_fused_rmsnorm{use_fused_rmsnorm} and use_fused_mlp{use_fused_mlp} should be consistent'
self.use_flash_attn = use_flash_attn
self.embed_dim = embed_dim
logger.info(f"Origin depth: {depth}")
depth = depth + x_vis_return_idx + 1
logger.info(f"New depth: {depth}")
self.depth = depth
self.x_vis_only = x_vis_only
if use_fused_rmsnorm:
raise NotImplementedError
norm_layer_for_blocks = partial(DropoutAddRMSNorm, eps=1e-6, prenorm=True)
else:
norm_layer_for_blocks = partial(RMSNorm, eps=1e-6)
self.norm_layer_for_blocks = norm_layer_for_blocks
self.patch_embed = PatchEmbed(
img_size, patch_size, in_chans, embed_dim,
num_frames=num_frames, tubelet_size=tubelet_size,
)
num_patches = self.patch_embed.num_patches
num_img_patches = self.patch_embed.num_img_patches
# print(f"num_patches: {num_patches}, num_img_patches: {num_img_patches}")
self.cls_token = nn.Parameter(torch.zeros(1, 1, embed_dim))
# stolen from https://github.com/facebookresearch/mae_st/blob/dc072aaaf640d06892e23a33b42223a994efe272/models_vit.py#L65-L73C17
self.sep_pos_embed = sep_pos_embed
self.sep_image_video_pos_embed = sep_image_video_pos_embed
if sep_pos_embed:
raise NotImplementedError
else:
if sep_image_video_pos_embed:
logger.info("Use separate position embedding, for image and video we use different pos_embed.")
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
self.img_pos_embed = nn.Parameter(torch.zeros(1, num_img_patches + 1, embed_dim))
else:
logger.info("Use joint position embedding, for image and video we use same pos_embed.")
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)]
# choose which layer to use checkpoint
with_cp_list = [False] * depth
if use_checkpoint:
for idx in range(depth):
if idx < checkpoint_num:
with_cp_list[idx] = True
logger.info(f"Droppath rate: {dpr}")
logger.info(f"Checkpoint list: {with_cp_list}")
self.blocks = nn.ModuleList([
Block(embed_dim, num_heads, mlp_ratio, qkv_bias=qkv_bias,
norm_layer=norm_layer_for_blocks,
drop_path=dpr[i], init_values=init_values, attn_drop=0.,
use_flash_attn=use_flash_attn, use_fused_mlp=use_fused_mlp,
fused_mlp_heuristic=fused_mlp_heuristic,
with_cp=with_cp_list[i],
qk_normalization=qk_normalization,
layerscale_no_force_fp32=layerscale_no_force_fp32,
use_fused_rmsnorm=use_fused_rmsnorm)
for i in range(depth)])
if not self.x_vis_only:
self.clip_projector = AttentionPoolingBlock(
dim=embed_dim, num_heads=attn_pool_num_heads, qkv_bias=True, qk_scale=None,
drop=0., attn_drop=0., norm_layer=partial(nn.LayerNorm, eps=1e-5), out_dim=clip_embed_dim)
self.init_pos_embed()
trunc_normal_(self.cls_token, std=.02) # NOTE 对chat没用,都要加载预训练的
self.apply(self._init_weights)
self.fix_init_weight()
def init_pos_embed(self):
logger.info("Init pos_embed from sincos pos_embed")
if self.sep_pos_embed:
raise NotImplementedError
else:
pos_embed = get_3d_sincos_pos_embed(
self.pos_embed.shape[-1],
self.patch_embed.grid_size[1], # height & weight
self.patch_embed.grid_size[0], # t_size
cls_token=True
)
self.pos_embed.data.copy_(torch.from_numpy(pos_embed).float().unsqueeze(0))
if self.sep_image_video_pos_embed:
img_pos_embed = get_3d_sincos_pos_embed(
self.pos_embed.shape[-1],
self.patch_embed.grid_size[1], # height & weight
1,
cls_token=True
)
self.img_pos_embed.data.copy_(torch.from_numpy(img_pos_embed).float().unsqueeze(0))
def _init_weights(self, m):
if isinstance(m, nn.Linear):
trunc_normal_(m.weight, std=.02)
if isinstance(m, nn.Linear) and m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
def fix_init_weight(self):
def rescale(param, layer_id):
param.div_(math.sqrt(2.0 * layer_id))
for layer_id, layer in enumerate(self.blocks):
rescale(layer.attn.proj.weight.data, layer_id + 1)
rescale(layer.mlp.fc2.weight.data, layer_id + 1)
@property
def dtype(self):
return self.patch_embed.proj.weight.dtype
def get_num_layers(self):
return len(self.blocks)
@torch.jit.ignore
def no_weight_decay(self):
return {
'pos_embed',
'pos_embed_spatial',
'pos_embed_temporal',
'pos_embed_cls',
'img_pos_embed',
'cls_token'
}
# @torch.cuda.amp.autocast(enabled=False)
def forward(self, x, mask=None, use_image=False):
x = self.patch_embed(x.type(self.dtype))
# print(f"x.shape: {x.shape} x.dtype: {x.dtype}, model.dtype: {self.dtype}")
B, T, L, C = x.shape # T: temporal; L: spatial
x = x.view([B, T * L, C])
# append cls token
cls_tokens = self.cls_token.expand(B, -1, -1)
x = torch.cat((cls_tokens, x), dim=1)
# add pos_embed
if self.sep_pos_embed:
raise NotImplementedError
else:
if use_image:
if self.sep_image_video_pos_embed:
pos_embed = self.img_pos_embed
else:
# (1, num_img_patches + 1, embed_dim)
# print('origin pos_embed.shape:', self.pos_embed.shape)
cls_pos_embed = self.pos_embed[:, 0:1, :]
# print('cls_pos_embed.shape:', cls_pos_embed.shape)
img_pos_embed = self.pos_embed[:, 1:, :].view(1, self.num_frames, self.patch_embed.num_patches // self.num_frames, self.embed_dim).mean(dim=1)
# print('img_pos_embed.shape:', img_pos_embed.shape)
pos_embed = torch.cat([cls_pos_embed, img_pos_embed], dim=1)
# print('final img_pos_embed.shape:', pos_embed.shape)
else:
pos_embed = self.pos_embed
# print("pos_embed.shape:", pos_embed.shape)
x = x + pos_embed
# mask tokens, ~mask means visible
if mask is not None:
x = x[~mask].reshape(B, -1, C)
else:
x = x.reshape(B, -1, C)
residual = None
for idx, blk in enumerate(self.blocks):
if isinstance(x, tuple) and len(x) == 2:
x, residual = x
x = blk(x, residual=residual)
if isinstance(x, tuple) and len(x) == 2:
x, residual = x
if residual is not None:
x = x + residual
x_vis = x
if self.x_vis_only:
return x_vis
else:
x_pool_vis = self.clip_projector(x_vis)
return x_vis, x_pool_vis, None, None
def pretrain_internvideo2_giant_patch14_224_clean(config):
model = PretrainVisionTransformer_clean(
in_chans=3, img_size=224, patch_size=14,
embed_dim=1408, depth=40, num_heads=16, mlp_ratio=48/11,
clip_embed_dim=config.vision_encoder.clip_embed_dim,
attn_pool_num_heads=16, qkv_bias=False,
drop_path_rate=0.25,
init_values=0.00001,
qk_normalization=True,
use_flash_attn=config.vision_encoder.get('use_flash_attn', True),
use_fused_rmsnorm=config.vision_encoder.get('use_fused_rmsnorm', True),
use_fused_mlp=config.vision_encoder.get('use_fused_mlp', True),
fused_mlp_heuristic=1,
layerscale_no_force_fp32=False,
num_frames=config.vision_encoder.num_frames,
tubelet_size=config.vision_encoder.tubelet_size,
sep_pos_embed=False,
sep_image_video_pos_embed=config.vision_encoder.sep_image_video_pos_embed,
use_checkpoint=config.vision_encoder.use_checkpoint,
checkpoint_num=config.vision_encoder.checkpoint_num,
x_vis_return_idx=config.vision_encoder.x_vis_return_idx,
x_vis_only=config.vision_encoder.x_vis_only,
)
if config.vision_encoder.pretrained is not None:
logger.info(f"Loading pretrained weights from {config.vision_encoder.pretrained}")
state_dict = torch.load(config.vision_encoder.pretrained, map_location='cpu')
interpolate_pos_embed_internvideo2(state_dict, model, orig_t_size=8) # NOTE 8f for stage1
message = model.load_state_dict(state_dict, strict=False)
logger.info(message)
else:
logger.info("No pretrained weights!!!")
return model
def pretrain_internvideo2_6b_patch14_224_clean(config):
model = PretrainVisionTransformer_clean(
in_chans=3, img_size=224, patch_size=14,
embed_dim=3200, depth=48, num_heads=25, mlp_ratio=4,
clip_embed_dim=config.vision_encoder.clip_embed_dim,
attn_pool_num_heads=16, qkv_bias=False,
drop_path_rate=0.3,
init_values=0.00001,
qk_normalization=True,
use_flash_attn=config.vision_encoder.get('use_flash_attn', True),
use_fused_rmsnorm=config.vision_encoder.get('use_fused_rmsnorm', True),
use_fused_mlp=config.vision_encoder.get('use_fused_mlp', True),
fused_mlp_heuristic=1,
layerscale_no_force_fp32=False,
num_frames=config.vision_encoder.num_frames,
tubelet_size=config.vision_encoder.tubelet_size,
sep_pos_embed=False,
sep_image_video_pos_embed=config.vision_encoder.sep_image_video_pos_embed,
use_checkpoint=config.vision_encoder.use_checkpoint,
checkpoint_num=config.vision_encoder.checkpoint_num,
x_vis_return_idx=config.vision_encoder.x_vis_return_idx,
x_vis_only=config.vision_encoder.x_vis_only
)
if config.vision_encoder.pretrained is not None:
logger.info(f"Loading pretrained weights from {config.vision_encoder.pretrained}")
state_dict = torch.load(config.vision_encoder.pretrained, map_location='cpu')
interpolate_pos_embed_internvideo2(state_dict, model, orig_t_size=8) # NOTE 8f for stage1
msg = model.load_state_dict(state_dict, strict=False)
logger.info(msg)
else:
logger.info("No pretrained weights!!!")
return model
================================================
FILE: xtuner-eval_niah/llava/model/multimodal_encoder/internvideo2_encoder.py
================================================
"""
# Adapted from https://huggingface.co/MILVLG/imp-v1-3b/blob/main/vision_encoder.py
"""
from typing import Optional, Tuple, Union, Dict
from dataclasses import dataclass
from functools import partial, reduce
from PIL import Image
import torch
import torch.utils.checkpoint
from torch import nn
import os
from transformers.image_processing_utils import BatchFeature, get_size_dict
from transformers.image_transforms import (
convert_to_rgb,
normalize,
rescale,
resize,
to_channel_dimension_format,
)
from transformers.image_utils import (
ChannelDimension,
PILImageResampling,
to_numpy_array,
)
from llava.utils import rank0_print
from .internvideo2.vit_scale_clean import PretrainVisionTransformer_clean
from .internvideo2.vit_scale_clean import interpolate_pos_embed_internvideo2
class InternVideo2ImageProcessor:
def __init__(self, image_mean=(0.485, 0.456, 0.406), image_std=(0.229, 0.224, 0.225), size=(224, 224), crop_size: Dict[str, int] = None, resample=PILImageResampling.BICUBIC, rescale_factor=1 / 255, data_format=ChannelDimension.FIRST):
crop_size = crop_size if crop_size is not None else {"height": size[0], "width": size[1]}
crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
self.image_mean = image_mean
self.image_std = image_std
self.size = size
self.resample = resample
self.rescale_factor = rescale_factor
self.data_format = data_format
self.crop_size = crop_size
def preprocess(self, images, return_tensors, target_size=None):
if isinstance(images, Image.Image):
images = [images]
else:
# to adapt video data
images = [to_numpy_array(image) for image in images]
assert isinstance(images, list)
if target_size is None:
target_size = self.size
transforms = [
convert_to_rgb,
to_numpy_array,
partial(resize, size=target_size, resample=self.resample, data_format=self.data_format),
partial(rescale, scale=self.rescale_factor, data_format=self.data_format),
partial(normalize, mean=self.image_mean, std=self.image_std, data_format=self.data_format),
partial(to_channel_dimension_format, channel_dim=self.data_format, input_channel_dim=self.data_format),
]
images = reduce(lambda x, f: [*map(f, x)], transforms, images)
data = {"pixel_values": images}
return BatchFeature(data=data, tensor_type=return_tensors)
class InternVideo2VisionConfig:
model_type = "internvideo2_vision_model"
def __init__(
self,
num_frames=4,
hidden_size=1408,
num_hidden_layers=40,
num_attention_heads=16,
num_channels=3,
image_size=224,
patch_size=14,
x_vis_return_idx=-2,
sep_image_video_pos_embed=True,
use_checkpoint=True,
checkpoint_num=40,
# **kwargs,
):
# super().__init__(**kwargs)
self.num_frames = num_frames
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.num_channels = num_channels
self.patch_size = patch_size
self.image_size = image_size
self.x_vis_return_idx = x_vis_return_idx
self.sep_image_video_pos_embed = sep_image_video_pos_embed
self.use_checkpoint = use_checkpoint
self.checkpoint_num = checkpoint_num
def build_vit(config, pt_type='origin'):
model = PretrainVisionTransformer_clean(
in_chans=config.num_channels, img_size=config.image_size, patch_size=config.patch_size,
embed_dim=config.hidden_size, depth=config.num_hidden_layers, num_heads=config.num_attention_heads, mlp_ratio=48/11,
# clip_embed_dim=config.vision_encoder.clip_embed_dim,
attn_pool_num_heads=16, qkv_bias=False,
drop_path_rate=0.25,
init_values=0.00001,
qk_normalization=True,
use_flash_attn=True,
use_fused_rmsnorm=False,
use_fused_mlp=False,
fused_mlp_heuristic=1,
layerscale_no_force_fp32=False,
num_frames=config.num_frames,
tubelet_size=1,
sep_pos_embed=False,
sep_image_video_pos_embed=config.sep_image_video_pos_embed,
use_checkpoint=config.use_checkpoint,
checkpoint_num=config.checkpoint_num,
x_vis_return_idx=config.x_vis_return_idx,
x_vis_only=True
)
ckpt_path = "OpenGVLab/Video_Encoders_for_Training_VideoChat-Flash/InternVideo2-1B_f4_vision.pt"
if not os.path.isfile(ckpt_path):
raise NotImplementedError("Please download https://huggingface.co/OpenGVLab/Video_Encoders_for_Training_VideoChat-Flash/InternVideo2-1B_f4_vision.pt")
state_dict = torch.load(ckpt_path, map_location='cpu')
if config.num_frames != 4:
raise NotImplementedError
# make deepspeed zero3 happy
if config.image_size != 224:
interpolate_pos_embed_internvideo2(state_dict, model, orig_t_size=4)
message = model.load_state_dict(state_dict, strict=False)
rank0_print(message)
return model
class InternVideo2VisionTower(nn.Module):
def __init__(self, vision_tower, vision_tower_cfg, delay_load=False, pt_type='origin', image_size=224):
super().__init__()
self.is_loaded = False
self.pt_type = pt_type
self.config = InternVideo2VisionConfig(num_frames=vision_tower_cfg.mm_local_num_frames, x_vis_return_idx=vision_tower_cfg.mm_vision_select_layer, image_size=image_size)
self.vision_tower_name = vision_tower
self.image_processor = InternVideo2ImageProcessor(size=(image_size, image_size))
if not delay_load:
rank0_print(f"Loading vision tower: {vision_tower}")
self.load_model()
elif getattr(vision_tower_cfg, "unfreeze_mm_vision_tower", False):
# TODO: better detector is needed.
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True.")
self.load_model()
elif hasattr(vision_tower_cfg, "mm_tunable_parts") and "mm_vision_tower" in vision_tower_cfg.mm_tunable_parts:
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`.")
self.load_model()
else:
raise NotImplementedError
self.cfg_only = self.config
def load_model(self, device_map=None):
if self.is_loaded:
rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name))
return
self.vision_tower = build_vit(self.config, pt_type=self.pt_type)
self.vision_tower.requires_grad_(False)
self.is_loaded = True
def forward(self, images):
if type(images) is list:
raise NotImplementedError
else:
# input: B T C H W
# output: B T*L C
T = images.shape[1]
images = images.permute(0, 2, 1, 3, 4)
image_embeds = self.vision_tower(images, use_image=(T == 1))
return image_embeds[:, 1:, :]
@property
def dummy_feature(self):
return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
@property
def dtype(self):
for p in self.vision_tower.parameters():
return p.dtype
@property
def device(self):
for p in self.vision_tower.parameters():
return p.device
@property
def hidden_size(self):
return self.config.hidden_size
@property
def num_patches(self):
return (self.config.image_size // self.config.patch_size) ** 2
@property
def num_patches_per_side(self):
return self.config.image_size // self.config.patch_size
# return self.model_config["vision_cfg"]["image_size"] // self.model_config["vision_cfg"]["patch_size"]
@property
def image_size(self):
return self.config.image_size
================================================
FILE: xtuner-eval_niah/llava/model/multimodal_encoder/siglip_encoder.py
================================================
"""
# Adapted from https://huggingface.co/MILVLG/imp-v1-3b/blob/main/vision_encoder.py
"""
from typing import Optional, Tuple, Union, Dict
from dataclasses import dataclass
from functools import partial, reduce
from PIL import Image
import torch
import torch.utils.checkpoint
from torch import nn
import os
from transformers.image_processing_utils import BatchFeature, get_size_dict
from transformers.image_transforms import (
convert_to_rgb,
normalize,
rescale,
resize,
to_channel_dimension_format,
)
from transformers.image_utils import (
ChannelDimension,
PILImageResampling,
to_numpy_array,
)
from transformers.activations import ACT2FN
from transformers.modeling_outputs import BaseModelOutput, BaseModelOutputWithPooling
from transformers.modeling_utils import PreTrainedModel
from transformers import PretrainedConfig
from transformers.utils import ModelOutput
from llava.utils import rank0_print
class SigLipImageProcessor:
def __init__(self, image_mean=(0.5, 0.5, 0.5), image_std=(0.5, 0.5, 0.5), size=(384, 384), crop_size: Dict[str, int] = None, resample=PILImageResampling.BICUBIC, rescale_factor=1 / 255, data_format=ChannelDimension.FIRST):
crop_size = crop_size if crop_size is not None else {"height": 384, "width": 384}
crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
self.image_mean = image_mean
self.image_std = image_std
self.size = size
self.resample = resample
self.rescale_factor = rescale_factor
self.data_format = data_format
self.crop_size = crop_size
def preprocess(self, images, return_tensors):
if isinstance(images, Image.Image):
images = [images]
else:
# to adapt video data
images = [to_numpy_array(image) for image in images]
assert isinstance(images, list)
transforms = [
convert_to_rgb,
to_numpy_array,
partial(resize, size=self.size, resample=self.resample, data_format=self.data_format),
partial(rescale, scale=self.rescale_factor, data_format=self.data_format),
partial(normalize, mean=self.image_mean, std=self.image_std, data_format=self.data_format),
partial(to_channel_dimension_format, channel_dim=self.data_format, input_channel_dim=self.data_format),
]
images = reduce(lambda x, f: [*map(f, x)], transforms, images)
data = {"pixel_values": images}
return BatchFeature(data=data, tensor_type=return_tensors)
class SigLipVisionConfig(PretrainedConfig):
model_type = "siglip_vision_model"
def __init__(
self,
hidden_size=1152,
image_mean=(0.5, 0.5, 0.5),
intermediate_size=4304,
num_hidden_layers=27,
num_attention_heads=16,
num_channels=3,
image_size=384,
patch_size=14,
hidden_act="gelu_pytorch_tanh",
layer_norm_eps=1e-6,
attention_dropout=0.0,
**kwargs,
):
super().__init__(**kwargs)
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.num_channels = num_channels
self.patch_size = patch_size
self.image_size = image_size
self.attention_dropout = attention_dropout
self.layer_norm_eps = layer_norm_eps
self.hidden_act = hidden_act
self.image_mean = image_mean
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path: Union[str, os.PathLike], **kwargs) -> "PretrainedConfig":
cls._set_token_in_kwargs(kwargs)
config_dict, kwargs = cls.get_config_dict(pretrained_model_name_or_path, **kwargs)
# get the vision config dict if we are loading from SigLipConfig
if config_dict.get("model_type") == "siglip":
config_dict = config_dict["vision_config"]
if "model_type" in config_dict and hasattr(cls, "model_type") and config_dict["model_type"] != cls.model_type:
print(f"You are using a model of type {config_dict['model_type']} to instantiate a model of type " f"{cls.model_type}. This is not supported for all configurations of models and can yield errors.")
return cls.from_dict(config_dict, **kwargs)
@dataclass
# Copied from transformers.models.clip.modeling_clip.CLIPVisionModelOutput with CLIP->SigLip
class SigLipVisionModelOutput(ModelOutput):
"""
Base class for vision model's outputs that also contains image embeddings of the pooling of the last hidden states.
Args:
image_embeds (`torch.FloatTensor` of shape `(batch_size, output_dim)` *optional* returned when model is initialized with `with_projection=True`):
The image embeddings obtained by applying the projection layer to the pooler_output.
last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Sequence of hidden-states at the output of the last layer of the model.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
"""
image_embeds: Optional[torch.FloatTensor] = None
last_hidden_state: torch.FloatTensor = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
class SigLipVisionEmbeddings(nn.Module):
def __init__(self, config: SigLipVisionConfig):
super().__init__()
self.config = config
self.embed_dim = config.hidden_size
self.image_size = config.image_size
self.patch_size = config.patch_size
self.patch_embedding = nn.Conv2d(
in_channels=config.num_channels,
out_channels=self.embed_dim,
kernel_size=self.patch_size,
stride=self.patch_size,
padding="valid",
)
self.num_patches = (self.image_size // self.patch_size) ** 2
self.num_positions = self.num_patches
self.position_embedding = nn.Embedding(self.num_positions, self.embed_dim)
self.register_buffer("position_ids", torch.arange(self.num_positions).expand((1, -1)), persistent=False)
def forward(self, pixel_values: torch.FloatTensor) -> torch.Tensor:
patch_embeds = self.patch_embedding(pixel_values) # shape = [*, width, grid, grid]
embeddings = patch_embeds.flatten(2).transpose(1, 2)
embeddings = embeddings + self.position_embedding(self.position_ids)
return embeddings
class SigLipAttention(nn.Module):
"""Multi-headed attention from 'Attention Is All You Need' paper"""
# Copied from transformers.models.clip.modeling_clip.CLIPAttention.__init__
def __init__(self, config):
super().__init__()
self.config = config
self.embed_dim = config.hidden_size
self.num_heads = config.num_attention_heads
self.head_dim = self.embed_dim // self.num_heads
if self.head_dim * self.num_heads != self.embed_dim:
raise ValueError(f"embed_dim must be divisible by num_heads (got `embed_dim`: {self.embed_dim} and `num_heads`:" f" {self.num_heads}).")
self.scale = self.head_dim**-0.5
self.dropout = config.attention_dropout
self.k_proj = nn.Linear(self.embed_dim, self.embed_dim)
self.v_proj = nn.Linear(self.embed_dim, self.embed_dim)
self.q_proj = nn.Linear(self.embed_dim, self.embed_dim)
self.out_proj = nn.Linear(self.embed_dim, self.embed_dim)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
"""Input shape: Batch x Time x Channel"""
batch_size, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = key_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
value_states = value_states.view(batch_size, q_len, self.num_heads, self.head_dim).transpose(1, 2)
k_v_seq_len = key_states.shape[-2]
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) * self.scale
if attn_weights.size() != (batch_size, self.num_heads, q_len, k_v_seq_len):
raise ValueError(f"Attention weights should be of size {(batch_size, self.num_heads, q_len, k_v_seq_len)}, but is" f" {attn_weights.size()}")
if attention_mask is not None:
if attention_mask.size() != (batch_size, 1, q_len, k_v_seq_len):
raise ValueError(f"Attention mask should be of size {(batch_size, 1, q_len, k_v_seq_len)}, but is {attention_mask.size()}")
attn_weights = attn_weights + attention_mask
# upcast attention to fp32
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
attn_weights = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
attn_output = torch.matmul(attn_weights, value_states)
if attn_output.size() != (batch_size, self.num_heads, q_len, self.head_dim):
raise ValueError(f"`attn_output` should be of size {(batch_size, self.num_heads, q_len, self.head_dim)}, but is" f" {attn_output.size()}")
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(batch_size, q_len, self.embed_dim)
attn_output = self.out_proj(attn_output)
return attn_output, attn_weights
# Copied from transformers.models.clip.modeling_clip.CLIPMLP with CLIP->SigLip
class SigLipMLP(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.activation_fn = ACT2FN[config.hidden_act]
self.fc1 = nn.Linear(config.hidden_size, config.intermediate_size)
self.fc2 = nn.Linear(config.intermediate_size, config.hidden_size)
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
hidden_states = self.fc1(hidden_states)
hidden_states = self.activation_fn(hidden_states)
hidden_states = self.fc2(hidden_states)
return hidden_states
# Copied from transformers.models.clip.modeling_clip.CLIPEncoderLayer with CLIP->SigLip
class SigLipEncoderLayer(nn.Module):
def __init__(self, config: SigLipVisionConfig):
super().__init__()
self.embed_dim = config.hidden_size
self.self_attn = SigLipAttention(config)
self.layer_norm1 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
self.mlp = SigLipMLP(config)
self.layer_norm2 = nn.LayerNorm(self.embed_dim, eps=config.layer_norm_eps)
# Ignore copy
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: torch.Tensor,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.FloatTensor]:
"""
Args:
hidden_states (`torch.FloatTensor`):
Input to the layer of shape `(batch, seq_len, embed_dim)`.
attention_mask (`torch.FloatTensor`):
Attention mask of shape `(batch, 1, q_len, k_v_seq_len)` where padding elements are indicated by very large negative values.
output_attentions (`bool`, *optional*, defaults to `False`):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
"""
residual = hidden_states
hidden_states = self.layer_norm1(hidden_states)
hidden_states, attn_weights = self.self_attn(
hidden_states=hidden_states,
attention_mask=attention_mask,
output_attentions=output_attentions,
)
hidden_states = residual + hidden_states
residual = hidden_states
hidden_states = self.layer_norm2(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
outputs = (hidden_states,)
if output_attentions:
outputs += (attn_weights,)
return outputs
class SigLipPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = SigLipVisionConfig
base_model_prefix = "siglip"
supports_gradient_checkpointing = True
def _init_weights(self, module):
"""Initialize the weights"""
pass
# Copied from transformers.models.clip.modeling_clip.CLIPEncoder with CLIP->SigLip
class SigLipEncoder(nn.Module):
"""
Transformer encoder consisting of `config.num_hidden_layers` self attention layers. Each layer is a
[`SigLipEncoderLayer`].
Args:
config: SigLipVisionConfig
"""
def __init__(self, config: SigLipVisionConfig):
super().__init__()
self.config = config
self.layers = nn.ModuleList([SigLipEncoderLayer(config) for _ in range(config.num_hidden_layers)])
self.gradient_checkpointing = False
# Ignore copy
def forward(
self,
inputs_embeds,
attention_mask: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutput]:
r"""
Args:
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation.
This is useful if you want more control over how to convert `input_ids` indices into associated vectors
than the model's internal embedding lookup matrix.
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors
for more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
encoder_states = () if output_hidden_states else None
all_attentions = () if output_attentions else None
hidden_states = inputs_embeds
for encoder_layer in self.layers:
if output_hidden_states:
encoder_states = encoder_states + (hidden_states,)
if self.gradient_checkpointing and self.training:
layer_outputs = self._gradient_checkpointing_func(
encoder_layer.__call__,
hidden_states,
attention_mask,
output_attentions,
)
else:
layer_outputs = encoder_layer(
hidden_states,
attention_mask,
output_attentions=output_attentions,
)
hidden_states = layer_outputs[0]
if output_attentions:
all_attentions = all_attentions + (layer_outputs[1],)
if output_hidden_states:
encoder_states = encoder_states + (hidden_states,)
if not return_dict:
return tuple(v for v in [hidden_states, encoder_states, all_attentions] if v is not None)
return BaseModelOutput(last_hidden_state=hidden_states, hidden_states=encoder_states, attentions=all_attentions)
class SigLipVisionTransformer(nn.Module):
def __init__(self, config: SigLipVisionConfig):
super().__init__()
self.config = config
embed_dim = config.hidden_size
self.embeddings = SigLipVisionEmbeddings(config)
self.encoder = SigLipEncoder(config)
self.post_layernorm = nn.LayerNorm(embed_dim, eps=config.layer_norm_eps)
self.head = SigLipMultiheadAttentionPoolingHead(config)
def forward(
self,
pixel_values,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPooling]:
r"""
Returns:
"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
hidden_states = self.embeddings(pixel_values)
encoder_outputs = self.encoder(
inputs_embeds=hidden_states,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
last_hidden_state = encoder_outputs[0]
last_hidden_state = self.post_layernorm(last_hidden_state)
pooled_output = self.head(last_hidden_state)
if not return_dict:
return (last_hidden_state, pooled_output) + encoder_outputs[1:]
return BaseModelOutputWithPooling(
last_hidden_state=last_hidden_state,
pooler_output=pooled_output,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
)
class SigLipMultiheadAttentionPoolingHead(nn.Module):
"""Multihead Attention Pooling."""
def __init__(self, config: SigLipVisionConfig):
super().__init__()
self.probe = nn.Parameter(torch.randn(1, 1, config.hidden_size))
self.attention = torch.nn.MultiheadAttention(config.hidden_size, config.num_attention_heads, batch_first=True)
self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.mlp = SigLipMLP(config)
def forward(self, hidden_state):
batch_size = hidden_state.shape[0]
probe = self.probe.repeat(batch_size, 1, 1)
hidden_state = self.attention(probe, hidden_state, hidden_state)[0]
residual = hidden_state
hidden_state = self.layernorm(hidden_state)
hidden_state = residual + self.mlp(hidden_state)
return hidden_state[:, 0]
class SigLipVisionModel(SigLipPreTrainedModel):
config_class = SigLipVisionConfig
main_input_name = "pixel_values"
_no_split_modules = ["SigLipEncoderLayer"]
def __init__(self, config: SigLipVisionConfig):
super().__init__(config)
self.vision_model = SigLipVisionTransformer(config)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self) -> nn.Module:
return self.vision_model.embeddings.patch_embedding
def forward(
self,
pixel_values,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPooling]:
r"""
Returns:
Examples:
```python
>>> from PIL import Image
>>> import requests
>>> from transformers import AutoProcessor, SigLipVisionModel
>>> model = SigLipVisionModel.from_pretrained("google/siglip-base-patch16-224")
>>> processor = AutoProcessor.from_pretrained("google/siglip-base-patch16-224")
>>> url = "http://images.cocodataset.org/val2017/000000039769.jpg"
>>> image = Image.open(requests.get(url, stream=True).raw)
>>> inputs = processor(images=image, return_tensors="pt")
>>> outputs = model(**inputs)
>>> last_hidden_state = outputs.last_hidden_state
>>> pooled_output = outputs.pooler_output # pooled features
```"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
return self.vision_model(
pixel_values=pixel_values,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
class SigLipVisionTower(nn.Module):
def __init__(self, vision_tower, vision_tower_cfg, delay_load=False):
super().__init__()
self.is_loaded = False
self.config = SigLipVisionConfig()
self.vision_tower_name = vision_tower
self.image_processor = SigLipImageProcessor()
if not delay_load:
rank0_print(f"Loading vision tower: {vision_tower}")
self.load_model()
elif getattr(vision_tower_cfg, "unfreeze_mm_vision_tower", False):
# TODO: better detector is needed.
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True.")
self.load_model()
elif hasattr(vision_tower_cfg, "mm_tunable_parts") and "mm_vision_tower" in vision_tower_cfg.mm_tunable_parts:
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`.")
self.load_model()
else:
self.cfg_only = self.config
def load_model(self, device_map=None):
if self.is_loaded:
rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name))
return
self.vision_tower = SigLipVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map)
del self.vision_tower.vision_model.encoder.layers[-1:]
self.vision_tower.vision_model.head = nn.Identity()
self.vision_tower.requires_grad_(False)
self.is_loaded = True
def forward(self, images):
if type(images) is list:
image_features = []
for image in images:
image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), output_hidden_states=True)
image_feature = image_forward_out.hidden_states[-1].to(image.dtype)
assert image_features.shape[-2] == 729
image_features.append(image_feature)
else:
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
image_features = image_forward_outs.hidden_states[-1].to(images.dtype)
assert image_features.shape[-2] == 729
return image_features
@property
def dummy_feature(self):
return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
@property
def dtype(self):
for p in self.vision_tower.parameters():
return p.dtype
@property
def device(self):
for p in self.vision_tower.parameters():
return p.device
@property
def hidden_size(self):
return self.config.hidden_size
@property
def num_patches(self):
return (self.config.image_size // self.config.patch_size) ** 2
@property
def num_patches_per_side(self):
return self.config.image_size // self.config.patch_size
# return self.model_config["vision_cfg"]["image_size"] // self.model_config["vision_cfg"]["patch_size"]
@property
def image_size(self):
return self.config.image_size
================================================
FILE: xtuner-eval_niah/llava/model/multimodal_encoder/umt/vit.py
================================================
import numpy as np
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.utils.checkpoint as checkpoint
from functools import partial
try:
from flash_attn import flash_attn_qkvpacked_func
except:
print("You need to install flash_attn")
from timm.models.layers import drop_path, to_2tuple, trunc_normal_
# logger = logging.getLogger(__name__)
class DropPath(nn.Module):
"""Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).
"""
def __init__(self, drop_prob=None):
super(DropPath, self).__init__()
self.drop_prob = drop_prob
def forward(self, x):
return drop_path(x, self.drop_prob, self.training)
def extra_repr(self) -> str:
return 'p={}'.format(self.drop_prob)
class Mlp(nn.Module):
def __init__(self, in_features, hidden_features=None, out_features=None, act_layer=nn.GELU, drop=0.):
super().__init__()
out_features = out_features or in_features
hidden_features = hidden_features or in_features
self.fc1 = nn.Linear(in_features, hidden_features)
self.act = act_layer()
self.fc2 = nn.Linear(hidden_features, out_features)
self.drop = nn.Dropout(drop)
def forward(self, x):
x = self.fc1(x)
x = self.act(x)
x = self.drop(x)
x = self.fc2(x)
x = self.drop(x)
return x
class Attention(nn.Module):
def __init__(
self, dim, num_heads=8, qkv_bias=False, qk_scale=None, attn_drop=0.,
proj_drop=0., attn_head_dim=None,
attn_type='flash_v2'):
super().__init__()
self.num_heads = num_heads
head_dim = dim // num_heads
if attn_head_dim is not None:
head_dim = attn_head_dim
all_head_dim = head_dim * self.num_heads
self.scale = qk_scale or head_dim ** -0.5
self.qkv = nn.Linear(dim, all_head_dim * 3, bias=False)
if qkv_bias:
self.q_bias = nn.Parameter(torch.zeros(all_head_dim))
self.v_bias = nn.Parameter(torch.zeros(all_head_dim))
else:
self.q_bias = None
self.v_bias = None
if attn_type not in ['origin', 'flash_v2']:
raise NotImplementedError(f"Not support attn_type: {attn_type}")
print('umt:', f'attn_type: {attn_type}')
self.attn_type = attn_type
if attn_type == 'flash_v2':
self.attn_drop = attn_drop
else:
self.attn_drop = nn.Dropout(attn_drop)
self.proj = nn.Linear(all_head_dim, dim)
self.proj_drop = nn.Dropout(proj_drop)
def forward(self, x):
B, N, C = x.shape
qkv_bias = None
if self.q_bias is not None:
qkv_bias = torch.cat((self.q_bias, torch.zeros_like(self.v_bias, requires_grad=False), self.v_bias))
# qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
qkv = F.linear(input=x, weight=self.qkv.weight, bias=qkv_bias)
if self.attn_type == 'flash_v2':
qkv = qkv.reshape(B, N, 3, self.num_heads, -1)
x = flash_attn_qkvpacked_func(qkv, dropout_p=self.attn_drop, softmax_scale=self.scale, causal=False).reshape(B, N, -1)
else:
qkv = qkv.reshape(B, N, 3, self.num_heads, -1).permute(2, 0, 3, 1, 4)
q, k, v = qkv[0], qkv[1], qkv[
2] # make torchscript happy (cannot use tensor as tuple)
# B num_heads N head_dim
q = q * self.scale
attn = (q @ k.transpose(-2, -1))
attn = attn.softmax(dim=-1)
attn = self.attn_drop(attn)
x = (attn @ v).transpose(1, 2).reshape(B, N, -1)
x = self.proj(x)
x = self.proj_drop(x)
return x
class Block(nn.Module):
def __init__(self, dim, num_heads, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop=0., attn_drop=0.,
drop_path=0., init_values=None, act_layer=nn.GELU, norm_layer=nn.LayerNorm,
attn_head_dim=None):
super().__init__()
self.norm1 = norm_layer(dim)
self.attn = Attention(
dim, num_heads=num_heads, qkv_bias=qkv_bias, qk_scale=qk_scale,
attn_drop=attn_drop, proj_drop=drop, attn_head_dim=attn_head_dim)
# NOTE: drop path for stochastic depth, we shall see if this is better than dropout here
self.drop_path = DropPath(drop_path) if drop_path > 0. else nn.Identity()
self.norm2 = norm_layer(dim)
mlp_hidden_dim = int(dim * mlp_ratio)
self.mlp = Mlp(in_features=dim, hidden_features=mlp_hidden_dim, act_layer=act_layer, drop=drop)
if init_values > 0:
self.gamma_1 = nn.Parameter(init_values * torch.ones((dim)),requires_grad=True)
self.gamma_2 = nn.Parameter(init_values * torch.ones((dim)),requires_grad=True)
else:
self.gamma_1, self.gamma_2 = None, None
def forward(self, x):
if self.gamma_1 is None:
x = x + self.drop_path(self.attn(self.norm1(x)))
x = x + self.drop_path(self.mlp(self.norm2(x)))
else:
x = x + self.drop_path(self.gamma_1 * self.attn(self.norm1(x)))
x = x + self.drop_path(self.gamma_2 * self.mlp(self.norm2(x)))
return x
class PatchEmbed(nn.Module):
""" Image to Patch Embedding
"""
def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, num_frames=16, tubelet_size=2):
super().__init__()
img_size = to_2tuple(img_size)
patch_size = to_2tuple(patch_size)
self.tubelet_size = int(tubelet_size)
num_patches = (img_size[1] // patch_size[1]) * (img_size[0] // patch_size[0]) * (num_frames // self.tubelet_size)
self.img_size = img_size
self.patch_size = patch_size
self.num_patches = num_patches
self.proj = nn.Conv3d(
in_channels=in_chans, out_channels=embed_dim,
kernel_size=(self.tubelet_size, patch_size[0], patch_size[1]),
stride=(self.tubelet_size, patch_size[0], patch_size[1])
)
print('umt:', f'Num of patches: {num_patches}')
def forward(self, x, **kwargs):
B, C, T, H, W = x.shape
x = x.to(self.proj.weight.device)
# FIXME look at relaxing size constraints
# assert H == self.img_size[0] and W == self.img_size[1], \
# f"Input image size ({H}*{W}) doesn't match model ({self.img_size[0]}*{self.img_size[1]})."
x = self.proj(x).flatten(2).transpose(1, 2)
return x
# sin-cos position encoding
# https://github.com/jadore801120/attention-is-all-you-need-pytorch/blob/master/transformer/Models.py#L31
def get_sinusoid_encoding_table(n_position, d_hid, ckpt_num_frame=-1, cur_frame=12):
''' Sinusoid position encoding table '''
# TODO: make it with torch instead of numpy
def get_position_angle_vec(position):
return [position / np.power(10000, 2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)]
if ckpt_num_frame != -1 and ckpt_num_frame != cur_frame:
print('umt:', f"Interpolate position embedding")
print('umt:', f"Testing frame: {cur_frame}")
print('umt:', f"Checkpoint frame: {ckpt_num_frame}")
T = ckpt_num_frame # checkpoint frame
new_T = cur_frame # testing frame
n_position = n_position // new_T * T # generate checkpoint position embedding
sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)])
sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i
sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1
sinusoid_table = torch.tensor(sinusoid_table, dtype=torch.float, requires_grad=False).unsqueeze(0)
# interpolate
P = int((n_position // T) ** 0.5)
C = d_hid
sinusoid_table = sinusoid_table.reshape(-1, T, P, P, C)
sinusoid_table = sinusoid_table.permute(0, 2, 3, 4, 1).reshape(-1, C, T) # BHW, C, T
sinusoid_table = torch.nn.functional.interpolate(sinusoid_table, size=new_T, mode='linear')
sinusoid_table = sinusoid_table.reshape(1, P, P, C, new_T).permute(0, 4, 1, 2, 3) # B, T, H, W, C
sinusoid_table = sinusoid_table.flatten(1, 3)
return sinusoid_table
else:
sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(n_position)])
sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i
sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1
return torch.tensor(sinusoid_table, dtype=torch.float, requires_grad=False).unsqueeze(0)
def get_sinusoid_encoding_table2(n_position=784, d_hid=1024, cur_frame=8, ckpt_num_frame=4, pre_n_position=784):
''' Sinusoid position encoding table '''
# TODO: make it with torch instead of numpy
def get_position_angle_vec(position):
return [position / np.power(10000, 2 * (hid_j // 2) / d_hid) for hid_j in range(d_hid)]
# generate checkpoint position embedding
sinusoid_table = np.array([get_position_angle_vec(pos_i) for pos_i in range(pre_n_position)])
sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) # dim 2i
sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) # dim 2i+1
sinusoid_table = torch.tensor(sinusoid_table, dtype=torch.float, requires_grad=False).unsqueeze(0)
print(f"n_position: {n_position}")
print(f"pre_n_position: {pre_n_position}")
if n_position != pre_n_position:
T = ckpt_num_frame # checkpoint frame
P = 14 # checkpoint size
C = d_hid
new_P = int((n_position // cur_frame) ** 0.5) # testing size
print(f'Pretraining uses 14x14, but current version is {new_P}x{new_P}')
print(f'Interpolate the position embedding')
sinusoid_table = sinusoid_table.reshape(-1, T, P, P, C)
sinusoid_table = sinusoid_table.reshape(-1, P, P, C).permute(0, 3, 1, 2)
sinusoid_table = torch.nn.functional.interpolate(
sinusoid_table, size=(new_P, new_P), mode='bicubic', align_corners=False)
# BT, C, H, W -> BT, H, W, C -> B, T, H, W, C
sinusoid_table = sinusoid_table.permute(0, 2, 3, 1).reshape(-1, T, new_P, new_P, C)
sinusoid_table = sinusoid_table.flatten(1, 3) # B, THW, C
if cur_frame != ckpt_num_frame:
print(f'Pretraining uses 4 frames, but current frame is {cur_frame}')
print(f'Interpolate the position embedding')
T = ckpt_num_frame # checkpoint frame
new_T = cur_frame # testing frame
# interpolate
P = int((n_position // cur_frame) ** 0.5) # testing size
C = d_hid
sinusoid_table = sinusoid_table.reshape(-1, T, P, P, C)
sinusoid_table = sinusoid_table.permute(0, 2, 3, 4, 1).reshape(-1, C, T) # BHW, C, T
sinusoid_table = torch.nn.functional.interpolate(sinusoid_table, size=new_T, mode='linear')
sinusoid_table = sinusoid_table.reshape(1, P, P, C, new_T).permute(0, 4, 1, 2, 3) # B, T, H, W, C
sinusoid_table = sinusoid_table.flatten(1, 3) # B, THW, C
return sinusoid_table
class PretrainVisionTransformerEncoder(nn.Module):
""" Vision Transformer with support for patch or hybrid CNN input stage
"""
def __init__(self, img_size=224, patch_size=16, in_chans=3, embed_dim=768, depth=12,
num_heads=12, mlp_ratio=4., qkv_bias=False, qk_scale=None, drop_rate=0., attn_drop_rate=0.,
drop_path_rate=0., norm_layer=nn.LayerNorm, init_values=None, num_frames=8, tubelet_size=1,
use_learnable_pos_emb=False,
use_checkpoint=False, checkpoint_num=0,
ckpt_num_frame=-1, with_ln=True, return_index=-1
):
super().__init__()
self.num_features = self.embed_dim = embed_dim # num_features for consistency with other models
self.patch_embed = PatchEmbed(
img_size=img_size, patch_size=patch_size, in_chans=in_chans, embed_dim=embed_dim,
num_frames=num_frames, tubelet_size=tubelet_size
)
num_patches = self.patch_embed.num_patches
self.depth = depth + return_index + 1
self.use_checkpoint = use_checkpoint
self.checkpoint_num = checkpoint_num
print('umt:', f"Use checkpoint: {use_checkpoint}")
print('umt:', f"Checkpoint number: {checkpoint_num}")
print('umt:', f"Real runing depth: {self.depth}")
# TODO: Add the cls token
if use_learnable_pos_emb:
self.pos_embed = nn.Parameter(torch.zeros(1, num_patches + 1, embed_dim))
self.img_pos_embed = nn.Parameter(torch.zeros(1, num_patches//(num_frames//tubelet_size) + 1, embed_dim))
else:
# sine-cosine positional embeddings
if img_size != 224:
self.pos_embed = get_sinusoid_encoding_table2(num_patches, embed_dim, ckpt_num_frame=ckpt_num_frame, cur_frame=num_frames//tubelet_size)
self.img_pos_embed = get_sinusoid_encoding_table2(num_patches//(num_frames//tubelet_size), embed_dim, cur_frame=1, ckpt_num_frame=1, pre_n_position=14*14)
else:
self.pos_embed = get_sinusoid_encoding_table(num_patches, embed_dim, ckpt_num_frame=ckpt_num_frame, cur_frame=num_frames//tubelet_size)
self.img_pos_embed = get_sinusoid_encoding_table(num_patches//(num_frames//tubelet_size), embed_dim)
dpr = [x.item() for x in torch.linspace(0, drop_path_rate, depth)] # stochastic depth decay rule
self.blocks = nn.ModuleList([
Block(
dim=embed_dim, num_heads=num_heads, mlp_ratio=mlp_ratio, qkv_bias=qkv_bias, qk_scale=qk_scale,
drop=drop_rate, attn_drop=attn_drop_rate, drop_path=dpr[i], norm_layer=norm_layer,
init_values=init_values)
for i in range(self.depth)])
if with_ln:
self.vision_layernorm = nn.LayerNorm(embed_dim, eps=1e-12)
else:
self.vision_layernorm = nn.Identity()
if use_learnable_pos_emb:
trunc_normal_(self.pos_embed, std=.02)
@torch.jit.ignore
def no_weight_decay(self):
return {'pos_embed', 'cls_token'}
def forward_features(self, x, use_image=False):
x = self.patch_embed(x)
if use_image:
x = x + self.img_pos_embed.type_as(x).to(x.device).clone().detach()
else:
x = x + self.pos_embed.type_as(x).to(x.device).clone().detach()
B, _, C = x.shape
x_vis = x
for idx, blk in enumerate(self.blocks):
if self.use_checkpoint and idx < self.checkpoint_num:
x_vis = checkpoint.checkpoint(blk, x_vis)
else:
x_vis = blk(x_vis)
# with ln ot not
x_vis = self.vision_layernorm(x_vis)
return x_vis
def forward(self, x, use_image=False):
x_vis = self.forward_features(x, use_image)
return x_vis
class PretrainVisionTransformer(nn.Module):
""" Vision Transformer with support for patch or hybrid CNN input stage
"""
def __init__(self,
img_size=224,
patch_size=16,
encoder_in_chans=3,
encoder_embed_dim=768,
encoder_depth=12,
encoder_num_heads=12,
mlp_ratio=4.,
qkv_bias=True,
qk_scale=None,
drop_rate=0.,
attn_drop_rate=0.,
drop_path_rate=0.,
norm_layer=partial(nn.LayerNorm, eps=1e-6),
init_values=0.,
use_learnable_pos_emb=False,
num_frames=8,
tubelet_size=1,
use_checkpoint=False,
checkpoint_num=0,
ckpt_num_frame=4, # the pretrained model uses 4 frames
return_index=-1,
with_ln=False
):
super().__init__()
self.encoder = PretrainVisionTransformerEncoder(
img_size=img_size,
patch_size=patch_size,
in_chans=encoder_in_chans,
embed_dim=encoder_embed_dim,
depth=encoder_depth,
num_heads=encoder_num_heads,
mlp_ratio=mlp_ratio,
qkv_bias=qkv_bias,
qk_scale=qk_scale,
drop_rate=drop_rate,
attn_drop_rate=attn_drop_rate,
drop_path_rate=drop_path_rate,
norm_layer=norm_layer,
init_values=init_values,
num_frames=num_frames,
tubelet_size=tubelet_size,
use_learnable_pos_emb=use_learnable_pos_emb,
use_checkpoint=use_checkpoint,
checkpoint_num=checkpoint_num,
ckpt_num_frame=ckpt_num_frame,
with_ln=with_ln,
return_index=return_index
)
print('umt:', f'With LN: {with_ln}')
print('umt:', f'Total {encoder_depth} layer')
print('umt:', f'Return {encoder_depth+return_index+1}-th layer')
self.apply(self._init_weights)
def _init_weights(self, m):
if isinstance(m, nn.Linear):
nn.init.xavier_uniform_(m.weight)
if isinstance(m, nn.Linear) and m.bias is not None:
nn.init.constant_(m.bias, 0)
elif isinstance(m, nn.LayerNorm):
nn.init.constant_(m.bias, 0)
nn.init.constant_(m.weight, 1.0)
@torch.jit.ignore
def no_weight_decay(self):
return {'pos_embed', 'cls_token', 'clip_pos_embed'}
def forward(self, x, use_image=False):
T = x.shape[2]
x_vis = self.encoder(x, use_image) # [B, N_vis, C_e]
B, TL, C = x_vis.shape
x_vis = x_vis.view(B, T, TL // T, C)
return x_vis
================================================
FILE: xtuner-eval_niah/llava/model/multimodal_encoder/umt_encoder.py
================================================
"""
# Adapted from https://huggingface.co/MILVLG/imp-v1-3b/blob/main/vision_encoder.py
"""
from typing import Optional, Tuple, Union, Dict
from dataclasses import dataclass
from functools import partial, reduce
from PIL import Image
import torch
import torch.utils.checkpoint
from torch import nn
import os
from transformers.image_processing_utils import BatchFeature, get_size_dict
from transformers.image_transforms import (
convert_to_rgb,
normalize,
rescale,
resize,
to_channel_dimension_format,
)
from transformers.image_utils import (
ChannelDimension,
PILImageResampling,
to_numpy_array,
)
from llava.utils import rank0_print
from .umt.vit import PretrainVisionTransformer
class UMTImageProcessor:
def __init__(self, image_mean=(0.485, 0.456, 0.406), image_std=(0.229, 0.224, 0.225), size=(224, 224), crop_size: Dict[str, int] = None, resample=PILImageResampling.BICUBIC, rescale_factor=1 / 255, data_format=ChannelDimension.FIRST):
crop_size = crop_size if crop_size is not None else {"height": size[0], "width": size[1]}
crop_size = get_size_dict(crop_size, default_to_square=True, param_name="crop_size")
self.image_mean = image_mean
self.image_std = image_std
self.size = size
self.resample = resample
self.rescale_factor = rescale_factor
self.data_format = data_format
self.crop_size = crop_size
def preprocess(self, images, return_tensors, target_size=None):
if isinstance(images, Image.Image):
images = [images]
else:
# to adapt video data
images = [to_numpy_array(image) for image in images]
assert isinstance(images, list)
if target_size is None:
target_size = self.size
transforms = [
convert_to_rgb,
to_numpy_array,
partial(resize, size=target_size, resample=self.resample, data_format=self.data_format),
partial(rescale, scale=self.rescale_factor, data_format=self.data_format),
partial(normalize, mean=self.image_mean, std=self.image_std, data_format=self.data_format),
partial(to_channel_dimension_format, channel_dim=self.data_format, input_channel_dim=self.data_format),
]
images = reduce(lambda x, f: [*map(f, x)], transforms, images)
data = {"pixel_values": images}
return BatchFeature(data=data, tensor_type=return_tensors)
class UMTVisionConfig:
model_type = "umt_vision_model"
def __init__(
self,
num_frames=4,
hidden_size=1024,
num_hidden_layers=24,
num_attention_heads=16,
num_channels=3,
image_size=224,
patch_size=16,
return_idx=-2
# **kwargs,
):
# super().__init__(**kwargs)
self.num_frames = num_frames
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.num_channels = num_channels
self.patch_size = patch_size
self.image_size = image_size
self.return_idx = return_idx
def build_vit(config, pt_type='origin'):
model = PretrainVisionTransformer(
img_size=config.image_size,
patch_size=16,
encoder_embed_dim=1024,
encoder_depth=24,
encoder_num_heads=16,
drop_path_rate=0.,
num_frames=config.num_frames,
tubelet_size=1,
use_checkpoint=True,
checkpoint_num=24,
return_index=config.return_idx,
with_ln=True, # merge vision_layernorm in it
)
ckpt_path = "OpenGVLab/Video_Encoders_for_Training_VideoChat-Flash/UMT-L_f4_vision.pt"
if not os.path.isfile(ckpt_path):
raise NotImplementedError("Please download https://huggingface.co/OpenGVLab/Video_Encoders_for_Training_VideoChat-Flash/UMT-L_f4_vision.pt")
old_state_dict = torch.load(ckpt_path, map_location='cpu')
state_dict = {}
for k in old_state_dict:
if k.startswith("encoder."):
if k.startswith("encoder.norm"):
state_dict[k.replace('encoder.norm', 'encoder.vision_layernorm')] = old_state_dict[k]
else:
state_dict[k] = old_state_dict[k]
del old_state_dict
msg = model.load_state_dict(state_dict, strict=False)
print('umt:', f"Loading pretrained weights from {ckpt_path}", msg)
return model
class UMTVisionTower(nn.Module):
def __init__(self, vision_tower, vision_tower_cfg, delay_load=False, pt_type='origin', image_size=224):
super().__init__()
self.is_loaded = False
self.pt_type = pt_type
self.config = UMTVisionConfig(num_frames=vision_tower_cfg.mm_local_num_frames, return_idx=vision_tower_cfg.mm_vision_select_layer, image_size=image_size)
self.vision_tower_name = vision_tower
self.image_processor = UMTImageProcessor(size=(image_size, image_size))
if not delay_load:
rank0_print(f"Loading vision tower: {vision_tower}")
self.load_model()
elif getattr(vision_tower_cfg, "unfreeze_mm_vision_tower", False):
# TODO: better detector is needed.
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True.")
self.load_model()
elif hasattr(vision_tower_cfg, "mm_tunable_parts") and "mm_vision_tower" in vision_tower_cfg.mm_tunable_parts:
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`.")
self.load_model()
else:
self.cfg_only = self.config
def load_model(self, device_map=None):
if self.is_loaded:
rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name))
return
self.vision_tower = build_vit(self.config, pt_type=self.pt_type)
self.vision_tower.requires_grad_(False)
self.is_loaded = True
def forward(self, images):
if type(images) is list:
raise NotImplementedError
else:
# input: B T C H W
# output: B T*L C
T = images.shape[1]
images = images.permute(0, 2, 1, 3, 4)
image_embeds = self.vision_tower(images, use_image=(T == 1))
B, T, L, C = image_embeds.shape
image_embeds = image_embeds.reshape(B, -1, C)
return image_embeds
@property
def dummy_feature(self):
return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
@property
def dtype(self):
for p in self.vision_tower.parameters():
return p.dtype
@property
def device(self):
for p in self.vision_tower.parameters():
return p.device
@property
def hidden_size(self):
return self.config.hidden_size
@property
def num_patches(self):
return (self.config.image_size // self.config.patch_size) ** 2
@property
def num_patches_per_side(self):
return self.config.image_size // self.config.patch_size
# return self.model_config["vision_cfg"]["image_size"] // self.model_config["vision_cfg"]["patch_size"]
@property
def image_size(self):
return self.config.image_size
================================================
FILE: xtuner-eval_niah/llava/model/multimodal_projector/builder.py
================================================
import torch
import torch.nn as nn
import re
from .tome16_mlp_hd64 import ToMe16_mlp_hd64
class IdentityMap(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x, *args, **kwargs):
return x
@property
def config(self):
return {"mm_projector_type": "identity"}
class SimpleResBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.pre_norm = nn.LayerNorm(channels)
self.proj = nn.Sequential(nn.Linear(channels, channels), nn.GELU(), nn.Linear(channels, channels))
def forward(self, x):
x = self.pre_norm(x)
return x + self.proj(x)
def build_vision_projector(config, delay_load=False, **kwargs):
projector_type = getattr(config, "mm_projector_type", "linear")
if projector_type == 'tome16_mlp_hd64':
return ToMe16_mlp_hd64(config, kwargs["vision_cfg"])
if projector_type == "linear":
return nn.Linear(config.mm_hidden_size, config.hidden_size)
mlp_gelu_match = re.match(r"^mlp(\d+)x_gelu$", projector_type)
if mlp_gelu_match:
mlp_depth = int(mlp_gelu_match.group(1))
modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
for _ in range(1, mlp_depth):
modules.append(nn.GELU())
modules.append(nn.Linear(config.hidden_size, config.hidden_size))
return nn.Sequential(*modules)
mlp_gelu_resnet_match = re.match(r"^mlp(\d+)x_res(\d+)x_gelu$", projector_type)
if mlp_gelu_resnet_match:
mlp_depth = int(mlp_gelu_resnet_match.group(1))
res_depth = int(mlp_gelu_resnet_match.group(2))
modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
for _ in range(1, mlp_depth):
modules.append(nn.GELU())
modules.append(nn.Linear(config.hidden_size, config.hidden_size))
for _ in range(res_depth):
modules.append(SimpleResBlock(config.hidden_size))
return nn.Sequential(*modules)
if projector_type == "identity":
return IdentityMap()
raise ValueError(f"Unknown projector type: {projector_type}")
================================================
FILE: xtuner-eval_niah/llava/model/multimodal_projector/tome16_mlp_hd64.py
================================================
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
# This source code is licensed under the license found in the
# LICENSE file in the root directory of this source tree.
# --------------------------------------------------------
import torch
import torch.nn as nn
from typing import Callable, Tuple
import torch.nn.functional as F
def bipartite_soft_matching(
metric: torch.Tensor,
r: int,
) -> Tuple[Callable, Callable]:
"""
Applies ToMe with a balanced matching set (50%, 50%).
Input size is [batch, tokens, channels].
r indicates the number of tokens to remove (max 50% of tokens).
"""
protected = 0
t = metric.shape[1]
r = min(r, (t - protected) // 2)
assert r > 0, r
with torch.no_grad():
metric = metric / metric.norm(dim=-1, keepdim=True)
a, b = metric[..., ::2, :], metric[..., 1::2, :]
scores = a @ b.transpose(-1, -2)
node_max, node_idx = scores.max(dim=-1)
edge_idx = node_max.argsort(dim=-1, descending=True)[..., None]
unm_idx = edge_idx[..., r:, :] # Unmerged Tokens
src_idx = edge_idx[..., :r, :] # Merged Tokens
dst_idx = node_idx[..., None].gather(dim=-2, index=src_idx)
def merge(x: torch.Tensor, mode="mean") -> torch.Tensor:
src, dst = x[..., ::2, :], x[..., 1::2, :]
n, t1, c = src.shape
unm = src.gather(dim=-2, index=unm_idx.expand(n, t1 - r, c))
src = src.gather(dim=-2, index=src_idx.expand(n, r, c))
dst = dst.scatter_add(-2, dst_idx.expand(n, r, c), src) # , reduce=mode)
return torch.cat([unm, dst], dim=1)
def unmerge(x: torch.Tensor) -> torch.Tensor:
unm_len = unm_idx.shape[1]
unm, dst = x[..., :unm_len, :], x[..., unm_len:, :]
n, _, c = unm.shape
src = dst.gather(dim=-2, index=dst_idx.expand(n, r, c))
out = torch.zeros(n, metric.shape[1], c, device=x.device, dtype=x.dtype)
out[..., 1::2, :] = dst
out.scatter_(dim=-2, index=(2 * unm_idx).expand(n, unm_len, c), src=unm)
out.scatter_(dim=-2, index=(2 * src_idx).expand(n, r, c), src=src)
return out
return merge, unmerge
def merge_wavg(
merge: Callable, x: torch.Tensor, size: torch.Tensor = None
) -> Tuple[torch.Tensor, torch.Tensor]:
"""
Applies the merge function by taking a weighted average based on token size.
Returns the merged tensor and the new token sizes.
"""
if size is None:
size = torch.ones_like(x[..., 0, None])
x = merge(x * size, mode="sum")
size = merge(size, mode="sum")
x = x / size
return x, size
class ToMe16_mlp_hd64(nn.Module):
def __init__(self, config, vision_cfg):
super().__init__()
self._config = config
self.mm_hidden_size = config.mm_hidden_size
self.hw = vision_cfg.image_size // vision_cfg.patch_size
self.num_attention_heads = vision_cfg.num_attention_heads
self.mlp = nn.Sequential(nn.Linear(config.mm_hidden_size, config.hidden_size),
nn.GELU(),
nn.Linear(config.hidden_size, config.hidden_size))
self.max_pos_hw = self.hw
self.max_pos_num_frames = config.mm_pos_num_frames
# self._set_3d_pos_cache(max_grid_size=self.max_pos_hw, max_t_size=self.max_pos_num_frames)
self.num_image_patches_per_side = 8
self.num_frame_patches_per_side = 4
def merge_tokens(self, x, target_num_token):
r"""
x = torch.randn(10, 2560, c)
x = merge_tokens(x, r_merge_list=[1280])
"""
size = None
b, p, c = x.shape
tmp_p = p
r_merge_list = []
assert tmp_p > target_num_token, f"{tmp_p} should greater than {target_num_token}"
while tmp_p != target_num_token:
if tmp_p - target_num_token <= (tmp_p // 2):
r_merge_list.append(tmp_p - target_num_token)
break
else:
r_merge_list.append(tmp_p // 2)
tmp_p = tmp_p - (tmp_p // 2)
head = self.num_attention_heads
dim = c // head
for r in r_merge_list:
metric = x.reshape(b, p, head, dim).mean(2) # [b, p, c//head]
merge, _ = bipartite_soft_matching(
metric,
r
)
x, size = merge_wavg(merge, x, size)
_, p, _ = x.shape
# x = x.reshape(-1, c) # 300, 1024
return x
def forward(self, x, compress=False, local_num_frames=-1):
height = width = self.hw
assert height * width == x.shape[1]
dtype = x.dtype
device = x.device
if local_num_frames != -1 and local_num_frames != 1:
assert compress is True
if compress:
if local_num_frames != -1:
num_frames = local_num_frames
x = x.reshape(x.shape[0] // local_num_frames, -1, x.shape[-1])
else:
num_frames = x.shape[0]
x = x.reshape(1, -1, x.shape[-1])
num_tome_tokens = 16 * num_frames
else:
num_tome_tokens = 64
x = self.merge_tokens(x, target_num_token=num_tome_tokens)
x = self.mlp(x)
return x
@property
def config(self):
return {"mm_projector_type": "tome16_mlp_hd64"}
================================================
FILE: xtuner-eval_niah/llava/model/utils.py
================================================
from transformers import AutoConfig
def auto_upgrade(config):
cfg = AutoConfig.from_pretrained(config)
if "llava" in config and "llava" not in cfg.model_type:
assert cfg.model_type == "llama"
print("You are using newer LLaVA code base, while the checkpoint of v0 is from older code base.")
print("You must upgrade the checkpoint to the new code base (this can be done automatically).")
confirm = input("Please confirm that you want to upgrade the checkpoint. [Y/N]")
if confirm.lower() in ["y", "yes"]:
print("Upgrading checkpoint...")
assert len(cfg.architectures) == 1
setattr(cfg.__class__, "model_type", "llava")
cfg.architectures[0] = "LlavaLlamaForCausalLM"
cfg.save_pretrained(config)
print("Checkpoint upgraded.")
else:
print("Checkpoint upgrade aborted.")
exit(1)
================================================
FILE: xtuner-eval_niah/llava/serialize_utils.py
================================================
# Description: This file contains the code for serializing the dataset.
# From https://github.com/ppwwyyxx/RAM-multiprocess-dataloader/blob/795868a37446d61412b9a58dbb1b7c76e75d39c4/serialize.py
# Copyright (c) Facebook, Inc. and its affiliates.
"""
List serialization code adopted from
https://github.com/facebookresearch/detectron2/blob/main/detectron2/data/common.py
"""
import multiprocessing as mp
from typing import List, Any, Optional
import pickle
import numpy as np
import torch
import torch.distributed as dist
import functools
import os
from datetime import timedelta
def get_world_size() -> int:
if not dist.is_available():
return 1
if not dist.is_initialized():
return 1
return dist.get_world_size()
def get_rank() -> int:
if not dist.is_available():
return 0
if not dist.is_initialized():
return 0
return dist.get_rank()
def get_rank() -> int:
if not dist.is_available():
return 0
if not dist.is_initialized():
return 0
return dist.get_rank()
def get_local_rank() -> int:
if not dist.is_available():
return 0
if not dist.is_initialized():
return 0
# this is not guaranteed to be set
if 'RANK' in os.environ and 'WORLD_SIZE' in os.environ:
return int(os.environ['LOCAL_RANK'])
elif 'SLURM_PROCID' in os.environ:
return int(os.environ['SLURM_LOCALID'])
else:
raise RuntimeError("Unable to get local rank")
def get_local_size() -> int:
return torch.cuda.device_count()
@functools.lru_cache()
def _get_global_gloo_group():
"""
Return a process group based on gloo backend, containing all the ranks
The result is cached.
"""
if dist.get_backend() == "nccl":
return dist.new_group(backend="gloo", timeout=timedelta(minutes=60))
else:
return dist.group.WORLD
def all_gather(data, group=None):
"""
Run all_gather on arbitrary picklable data (not necessarily tensors).
Args:
data: any picklable object
group: a torch process group. By default, will use a group which
contains all ranks on gloo backend.
Returns:
list[data]: list of data gathered from each rank
"""
if get_world_size() == 1:
return [data]
if group is None:
group = (
_get_global_gloo_group()
) # use CPU group by default, to reduce GPU RAM usage.
world_size = dist.get_world_size(group)
if world_size == 1:
return [data]
output = [None for _ in range(world_size)]
dist.all_gather_object(output, data, group=group)
return output
class NumpySerializedList:
def __init__(self, lst: list):
def _serialize(data):
buffer = pickle.dumps(data, protocol=-1)
return np.frombuffer(buffer, dtype=np.uint8)
print(
"Serializing {} elements to byte tensors and concatenating them all ...".format(
len(lst)
)
)
self._lst = [_serialize(x) for x in lst]
self._addr = np.asarray([len(x) for x in self._lst], dtype=np.int64)
self._addr = np.cumsum(self._addr)
self._lst = np.concatenate(self._lst)
print("Serialized dataset takes {:.2f} MiB".format(len(self._lst) / 1024**2))
def __len__(self):
return len(self._addr)
def __getitem__(self, idx):
start_addr = 0 if idx == 0 else self._addr[idx - 1].item()
end_addr = self._addr[idx].item()
bytes = memoryview(self._lst[start_addr:end_addr])
return pickle.loads(bytes)
class TorchSerializedList(NumpySerializedList):
def __init__(self, lst: list):
super().__init__(lst)
self._addr = torch.from_numpy(self._addr)
self._lst = torch.from_numpy(self._lst)
def __getitem__(self, idx):
start_addr = 0 if idx == 0 else self._addr[idx - 1].item()
end_addr = self._addr[idx].item()
bytes = memoryview(self._lst[start_addr:end_addr].numpy())
return pickle.loads(bytes)
def local_scatter(array: Optional[List[Any]]):
"""
Scatter an array from local leader to all local workers.
The i-th local worker gets array[i].
Args:
array: Array with same size of #local workers.
"""
if get_local_size() <= 1:
# Just one worker. Do nothing.
return array[0]
if get_local_rank() == 0:
assert len(array) == get_local_size()
all_gather(array)
else:
all_data = all_gather(None)
array = all_data[get_rank() - get_local_rank()]
return array[get_local_rank()]
# NOTE: https://github.com/facebookresearch/mobile-vision/pull/120
# has another implementation that does not use tensors.
class TorchShmSerializedList(TorchSerializedList):
def __init__(self, lst: list):
if get_local_rank() == 0:
super().__init__(lst)
if get_local_rank() == 0:
# Move data to shared memory, obtain a handle to send to each local worker.
# This is cheap because a tensor will only be moved to shared memory once.
handles = [None] + [
bytes(mp.reduction.ForkingPickler.dumps((self._addr, self._lst)))
for _ in range(get_local_size() - 1)
]
else:
handles = None
# Each worker receives the handle from local leader.
handle = local_scatter(handles)
if get_local_rank() > 0:
# Materialize the tensor from shared memory.
self._addr, self._lst = mp.reduction.ForkingPickler.loads(handle)
print(
f"Worker {get_rank()} obtains a dataset of length="
f"{len(self)} from its local leader."
)
# From https://github.com/ppwwyyxx/RAM-multiprocess-dataloader/issues/5#issuecomment-1510676170
def local_broadcast_process_authkey():
if int(os.environ['LOCAL_WORLD_SIZE']) == 1:
return
local_rank = int(os.environ['LOCAL_RANK'])
authkey = bytes(mp.current_process().authkey)
all_keys = all_gather(authkey)
local_leader_key = all_keys[get_rank() - local_rank]
if authkey != local_leader_key:
print("Process authkey is different from the key of local leader. This might happen when "
"workers are launched independently.")
print("Overwriting local authkey ...")
mp.current_process().authkey = local_leader_key
================================================
FILE: xtuner-eval_niah/llava/train/llava_trainer.py
================================================
import os
import torch
import torch.nn as nn
import datetime
from accelerate import Accelerator
from accelerate.utils import InitProcessGroupKwargs, GradientAccumulationPlugin
from torch.utils.data import Dataset, Sampler, DataLoader
from transformers import Trainer
from transformers.trainer import is_sagemaker_mp_enabled, get_parameter_names, has_length, ALL_LAYERNORM_LAYERS, logger, is_accelerate_available, is_datasets_available, GradientAccumulationPlugin
from transformers.trainer_utils import seed_worker
from transformers.trainer_pt_utils import get_length_grouped_indices as get_length_grouped_indices_hf
from transformers.trainer_pt_utils import AcceleratorConfig
from typing import List, Optional
from datetime import timedelta
if is_accelerate_available():
from accelerate import Accelerator, skip_first_batches, InitProcessGroupKwargs
if is_datasets_available():
import datasets
from llava.utils import rank0_print
def maybe_zero_3(param, ignore_status=False, name=None):
from deepspeed import zero
from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
if hasattr(param, "ds_id"):
if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
if not ignore_status:
print(name, "no ignore status")
with zero.GatheredParameters([param]):
param = param.data.detach().cpu().clone()
else:
param = param.detach().cpu().clone()
return param
def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)}
to_return = {k: maybe_zero_3(v, ignore_status=True, name=k).cpu() for k, v in to_return.items()}
return to_return
def split_to_even_chunks(indices, lengths, num_chunks):
"""
Split a list of indices into `chunks` chunks of roughly equal lengths.
"""
if len(indices) % num_chunks != 0:
return [indices[i::num_chunks] for i in range(num_chunks)]
num_indices_per_chunk = len(indices) // num_chunks
chunks = [[] for _ in range(num_chunks)]
chunks_lengths = [0 for _ in range(num_chunks)]
for index in indices:
shortest_chunk = chunks_lengths.index(min(chunks_lengths))
chunks[shortest_chunk].append(index)
chunks_lengths[shortest_chunk] += lengths[index]
if len(chunks[shortest_chunk]) == num_indices_per_chunk:
chunks_lengths[shortest_chunk] = float("inf")
return chunks
def get_variable_length_grouped_indices(lengths, batch_size, world_size, megabatch_mult=8, generator=None):
# We need to use torch for the random part as a distributed sampler will set the random seed for torch.
indices = torch.randperm(len(lengths), generator=generator)
sorted_indices = sorted(range(len(lengths)), key=lambda i: lengths[i], reverse=True)
megabatch_size = world_size * batch_size * megabatch_mult
megabatches = [sorted_indices[i : i + megabatch_size] for i in range(0, len(lengths), megabatch_size)]
megabatches = [sorted(megabatch, key=lambda i: indices[i], reverse=True) for megabatch in megabatches]
shuffled_indices = [i for megabatch in megabatches for i in megabatch]
world_batch_size = world_size * batch_size
batches = [shuffled_indices[i : i + world_batch_size] for i in range(0, len(lengths), world_batch_size)]
batch_indices = torch.randperm(len(batches), generator=generator)
batches = [batches[i] for i in batch_indices]
return [i for batch in batches for i in batch]
def get_modality_length_grouped_indices(lengths, batch_size, world_size, generator=None):
"""
Return a list of indices so that each slice of `batch_size` consecutive indices correspond to elements of similar
lengths. To do this, the indices are:
- randomly permuted
- grouped in mega-batches of size `mega_batch_mult * batch_size`
- reorder by length in each mega-batch
The result is the concatenation of all mega-batches, with the batch of `batch_size` containing the element of
maximum length placed first, so that an OOM happens sooner rather than later.
"""
# We need to use torch for the random part as a distributed sampler will set the random seed for torch.
assert all(l != 0 for l in lengths), "Should not have zero length."
if all(l > 0 for l in lengths) or all(l < 0 for l in lengths):
# all samples are in the same modality
return get_length_grouped_indices(lengths, batch_size, world_size, generator=generator)
mm_indices, mm_lengths = zip(*[(i, l) for i, l in enumerate(lengths) if l > 0])
lang_indices, lang_lengths = zip(*[(i, -l) for i, l in enumerate(lengths) if l < 0])
mm_shuffle = [mm_indices[i] for i in get_length_grouped_indices(mm_lengths, batch_size, world_size, generator=None)]
lang_shuffle = [lang_indices[i] for i in get_length_grouped_indices(lang_lengths, batch_size, world_size, generator=None)]
megabatch_size = world_size * batch_size
mm_megabatches = [mm_shuffle[i : i + megabatch_size] for i in range(0, len(mm_shuffle), megabatch_size)]
lang_megabatches = [lang_shuffle[i : i + megabatch_size] for i in range(0, len(lang_shuffle), megabatch_size)]
last_mm = mm_megabatches[-1]
last_lang = lang_megabatches[-1]
additional_batch = last_mm + last_lang
megabatches = mm_megabatches[:-1] + lang_megabatches[:-1]
megabatch_indices = torch.randperm(len(megabatches), generator=generator)
megabatches = [megabatches[i] for i in megabatch_indices]
if len(additional_batch) > 0:
megabatches.append(sorted(additional_batch))
return [i for megabatch in megabatches for i in megabatch]
def get_length_grouped_indices(lengths, batch_size, world_size, generator=None, merge=True):
"""
Return a list of indices so that each slice of `batch_size` consecutive indices correspond to elements of similar
lengths. To do this, the indices are:
- randomly permuted
- grouped in mega-batches of size `mega_batch_mult * batch_size`
- reorder by length in each mega-batch
The result is the concatenation of all mega-batches, with the batch of `batch_size` containing the element of
maximum length placed first, so that an OOM happens sooner rather than later.
"""
# We need to use torch for the random part as a distributed sampler will set the random seed for torch.
indices = torch.randperm(len(lengths), generator=generator)
megabatch_size = world_size * batch_size
megabatches = [indices[i : i + megabatch_size].tolist() for i in range(0, len(lengths), megabatch_size)]
megabatches = [sorted(megabatch, key=lambda i: lengths[i], reverse=True) for megabatch in megabatches]
megabatches = [split_to_even_chunks(megabatch, lengths, world_size) for megabatch in megabatches]
return [i for megabatch in megabatches for batch in megabatch for i in batch]
def get_length_grouped_indices_auto_single(lengths, batch_size, world_size, generator=None):
indices = get_length_grouped_indices_hf(lengths, batch_size * world_size, generator=generator)
megabatch_size = world_size * batch_size
megabatches = [indices[i : i + megabatch_size] for i in range(0, len(lengths), megabatch_size)]
megabatches = [sorted(megabatch, key=lambda i: lengths[i], reverse=True) for megabatch in megabatches]
megabatches = [split_to_even_chunks(megabatch, lengths, world_size) for megabatch in megabatches]
# We need to use torch for the random part as a distributed sampler will set the random seed for torch.
batch_indices = torch.randperm(len(megabatches), generator=generator)
megabatches = [megabatches[i] for i in batch_indices]
return [i for megabatch in megabatches for batch in megabatch for i in batch]
def get_modality_length_grouped_indices_auto(lengths, batch_size, world_size, generator=None):
# We need to use torch for the random part as a distributed sampler will set the random seed for torch.
assert all(l != 0 for l in lengths), "Should not have zero length."
if all(l > 0 for l in lengths) or all(l < 0 for l in lengths):
# all samples are in the same modality
return get_length_grouped_indices_auto_single(lengths, batch_size, world_size, generator=generator)
mm_indices, mm_lengths = zip(*[(i, l) for i, l in enumerate(lengths) if l > 0])
lang_indices, lang_lengths = zip(*[(i, -l) for i, l in enumerate(lengths) if l < 0])
mm_shuffle = [mm_indices[i] for i in get_length_grouped_indices_auto_single(mm_lengths, batch_size, world_size, generator=None)]
lang_shuffle = [lang_indices[i] for i in get_length_grouped_indices_auto_single(lang_lengths, batch_size, world_size, generator=None)]
megabatch_size = world_size * batch_size
mm_megabatches = [mm_shuffle[i : i + megabatch_size] for i in range(0, len(mm_shuffle), megabatch_size)]
lang_megabatches = [lang_shuffle[i : i + megabatch_size] for i in range(0, len(lang_shuffle), megabatch_size)]
last_mm = mm_megabatches[-1]
last_lang = lang_megabatches[-1]
additional_batch = last_mm + last_lang
megabatches = mm_megabatches[:-1] + lang_megabatches[:-1]
megabatch_indices = torch.randperm(len(megabatches), generator=generator)
megabatches = [megabatches[i] for i in megabatch_indices]
# FIXME: Hard code to avoid last batch mixed with different modalities
# if len(additional_batch) > 0:
# megabatches.append(sorted(additional_batch))
return [i for megabatch in megabatches for i in megabatch]
class LengthGroupedSampler(Sampler):
r"""
Sampler that samples indices in a way that groups together features of the dataset of roughly the same length while
keeping a bit of randomness.
"""
def __init__(
self,
batch_size: int,
world_size: int,
lengths: Optional[List[int]] = None,
generator=None,
variable_length: bool = False,
group_by_modality: bool = False,
group_by_modality_auto: bool = False,
):
if lengths is None:
raise ValueError("Lengths must be provided.")
self.batch_size = batch_size
self.world_size = world_size
self.lengths = lengths
self.generator = generator
self.variable_length = variable_length
self.group_by_modality = group_by_modality
self.group_by_modality_auto = group_by_modality_auto
def __len__(self):
return len(self.lengths)
def __iter__(self):
if self.variable_length:
assert not self.group_by_modality, "Variable length grouping is not supported with modality grouping."
indices = get_variable_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator)
else:
if self.group_by_modality:
indices = get_modality_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator)
elif self.group_by_modality_auto:
indices = get_modality_length_grouped_indices_auto(self.lengths, self.batch_size, self.world_size, generator=self.generator)
else:
indices = get_length_grouped_indices_auto_single(self.lengths, self.batch_size, self.world_size, generator=self.generator)
return iter(indices)
class LLaVATrainer(Trainer):
def create_accelerator_and_postprocess(self):
grad_acc_kwargs = {"num_steps": self.args.gradient_accumulation_steps}
grad_acc_kwargs["sync_with_dataloader"] = False
gradient_accumulation_plugin = GradientAccumulationPlugin(**grad_acc_kwargs)
accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52))
rank0_print("Setting NCCL timeout to INF to avoid running errors.")
# create accelerator object
self.accelerator = Accelerator(
dispatch_batches=self.args.dispatch_batches, split_batches=self.args.split_batches, deepspeed_plugin=self.args.deepspeed_plugin, gradient_accumulation_plugin=gradient_accumulation_plugin, kwargs_handlers=[accelerator_kwargs]
)
# some Trainer classes need to use `gather` instead of `gather_for_metrics`, thus we store a flag
self.gather_function = self.accelerator.gather_for_metrics
# deepspeed and accelerate flags covering both trainer args and accelerate launcher
self.is_deepspeed_enabled = getattr(self.accelerator.state, "deepspeed_plugin", None) is not None
self.is_fsdp_enabled = getattr(self.accelerator.state, "fsdp_plugin", None) is not None
# post accelerator creation setup
if self.is_fsdp_enabled:
fsdp_plugin = self.accelerator.state.fsdp_plugin
fsdp_plugin.limit_all_gathers = self.args.fsdp_config.get("limit_all_gathers", fsdp_plugin.limit_all_gathers)
if is_accelerate_available("0.23.0"):
fsdp_plugin.activation_checkpointing = self.args.fsdp_config.get("activation_checkpointing", fsdp_plugin.activation_checkpointing)
if fsdp_plugin.activation_checkpointing and self.args.gradient_checkpointing:
raise ValueError("The activation_checkpointing in FSDP config and the gradient_checkpointing in training arg " "can't be set to True simultaneously. Please use FSDP's activation_checkpointing logic " "when using FSDP.")
if self.is_deepspeed_enabled and getattr(self.args, "hf_deepspeed_config", None) is None:
self.propagate_args_to_deepspeed()
def _get_train_sampler(self) -> Optional[torch.utils.data.Sampler]:
if self.train_dataset is None or not has_length(self.train_dataset):
return None
if self.args.group_by_length:
lengths = self.train_dataset.lengths
return LengthGroupedSampler(
# self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_steps
self.args.train_batch_size,
# world_size=self.args.world_size,
world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work?
lengths=lengths,
)
elif self.args.group_by_modality_length:
lengths = self.train_dataset.modality_lengths
return LengthGroupedSampler(
# self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_steps
self.args.train_batch_size,
# world_size=self.args.world_size,
world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work?
lengths=lengths,
group_by_modality=True,
)
elif self.args.group_by_modality_length_auto:
lengths = self.train_dataset.modality_lengths
return LengthGroupedSampler(
# self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_steps
self.args.train_batch_size,
# world_size=self.args.world_size,
world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work?
lengths=lengths,
group_by_modality_auto=True,
)
elif self.args.group_by_varlen:
lengths = self.train_dataset.lengths
return LengthGroupedSampler(
self.args.train_batch_size * self.args.gradient_accumulation_steps,
# self.args.train_batch_size, # TODO: seems that we should have gradient_accumulation_steps
# world_size=self.args.world_size,
world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work?
lengths=lengths,
variable_length=True,
)
else:
return super()._get_train_sampler()
def get_train_dataloader(self) -> DataLoader:
"""
Returns the training [`~torch.utils.data.DataLoader`].
Will use no sampler if `train_dataset` does not implement `__len__`, a random sampler (adapted to distributed
training if necessary) otherwise.
Subclass and override this method if you want to inject some custom behavior.
"""
if self.train_dataset is None:
raise ValueError("Trainer: training requires a train_dataset.")
train_dataset = self.train_dataset
data_collator = self.data_collator
if is_datasets_available() and isinstance(train_dataset, datasets.Dataset):
train_dataset = self._remove_unused_columns(train_dataset, description="training")
else:
data_collator = self._get_collator_with_removed_columns(data_collator, description="training")
dataloader_params = {
"batch_size": self._train_batch_size,
"collate_fn": data_collator,
"num_workers": self.args.dataloader_num_workers,
"pin_memory": self.args.dataloader_pin_memory,
"persistent_workers": self.args.dataloader_persistent_workers,
}
if not isinstance(train_dataset, torch.utils.data.IterableDataset):
dataloader_params["sampler"] = self._get_train_sampler()
dataloader_params["drop_last"] = self.args.dataloader_drop_last
dataloader_params["worker_init_fn"] = seed_worker
dataloader_params["prefetch_factor"] = self.args.dataloader_num_workers * 2 if self.args.dataloader_num_workers != 0 else None
dataloader = self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))
return dataloader
def create_optimizer(self):
"""
Setup the optimizer.
We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the
Trainer's init through `optimizers`, or subclass and override this method in a subclass.
"""
if is_sagemaker_mp_enabled():
return super().create_optimizer()
opt_model = self.model
if self.optimizer is None:
decay_parameters = get_parameter_names(opt_model, ALL_LAYERNORM_LAYERS)
decay_parameters = [name for name in decay_parameters if "bias" not in name]
lr_mapper = {}
if self.args.mm_projector_lr is not None:
lr_mapper["mm_projector"] = self.args.mm_projector_lr
if self.args.mm_vision_tower_lr is not None:
lr_mapper["vision_tower"] = self.args.mm_vision_tower_lr
if len(lr_mapper) > 0:
special_lr_parameters = [name for name, _ in opt_model.named_parameters() if any(module_keyword in name for module_keyword in lr_mapper)]
optimizer_grouped_parameters = [
{
"params": [p for n, p in opt_model.named_parameters() if (n in decay_parameters and n not in special_lr_parameters and p.requires_grad)],
"weight_decay": self.args.weight_decay,
},
{
"params": [p for n, p in opt_model.named_parameters() if (n not in decay_parameters and n not in special_lr_parameters and p.requires_grad)],
"weight_decay": 0.0,
},
]
for module_keyword, lr in lr_mapper.items():
module_parameters = [name for name, _ in opt_model.named_parameters() if module_keyword in name]
optimizer_grouped_parameters.extend(
[
{
"params": [p for n, p in opt_model.named_parameters() if (n in decay_parameters and n in module_parameters and p.requires_grad)],
"weight_decay": self.args.weight_decay,
"lr": lr,
},
{
"params": [p for n, p in opt_model.named_parameters() if (n not in decay_parameters and n in module_parameters and p.requires_grad)],
"weight_decay": 0.0,
"lr": lr,
},
]
)
else:
optimizer_grouped_parameters = [
{
"params": [p for n, p in opt_model.named_parameters() if (n in decay_parameters and p.requires_grad)],
"weight_decay": self.args.weight_decay,
},
{
"params": [p for n, p in opt_model.named_parameters() if (n not in decay_parameters and p.requires_grad)],
"weight_decay": 0.0,
},
]
optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(self.args)
self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
if optimizer_cls.__name__ == "Adam8bit":
import bitsandbytes
manager = bitsandbytes.optim.GlobalOptimManager.get_instance()
skipped = 0
for module in opt_model.modules():
if isinstance(module, nn.Embedding):
skipped += sum({p.data_ptr(): p.numel() for p in module.parameters()}.values())
logger.info(f"skipped {module}: {skipped/2**20}M params")
manager.register_module_override(module, "weight", {"optim_bits": 32})
logger.debug(f"bitsandbytes: will optimize {module} in fp32")
logger.info(f"skipped: {skipped/2**20}M params")
return self.optimizer
def _save_checkpoint(self, model, trial, metrics=None):
if getattr(self.args, "tune_mm_mlp_adapter", False) or (
hasattr(self.args, "mm_tunable_parts") and (len(self.args.mm_tunable_parts.split(",")) == 1 and ("mm_mlp_adapter" in self.args.mm_tunable_parts or "mm_vision_resampler" in self.args.mm_tunable_parts))
):
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
run_dir = self._get_output_dir(trial=trial)
output_dir = os.path.join(run_dir, checkpoint_folder)
# Only save Adapter
keys_to_match = ["mm_projector", "vision_resampler"]
if getattr(self.args, "use_im_start_end", False):
keys_to_match.extend(["embed_tokens", "embed_in"])
weight_to_save = get_mm_adapter_state_maybe_zero_3(self.model.named_parameters(), keys_to_match)
if self.args.local_rank == 0 or self.args.local_rank == -1:
self.model.config.save_pretrained(output_dir)
torch.save(weight_to_save, os.path.join(output_dir, f"mm_projector.bin"))
else:
super(LLaVATrainer, self)._save_checkpoint(model, trial, metrics)
def _save(self, output_dir: Optional[str] = None, state_dict=None):
if getattr(self.args, "tune_mm_mlp_adapter", False):
pass
else:
super(LLaVATrainer, self)._save(output_dir, state_dict)
================================================
FILE: xtuner-eval_niah/llava/train/llava_trainer_eval.py
================================================
import json
import subprocess
from llava.train.llava_trainer import LLaVATrainer
class LLaVAEvalTrainer(LLaVATrainer):
def evaluate(self, evaluate_args):
cmd = f"accelerate launch --num_processes {evaluate_args.eval_num_processes} -m lmms_eval \
--model {evaluate_args.model} \
--model_args {evaluate_args.model_args} \
--tasks {evaluate_args.task_names} \
--batch_size {evaluate_args.batch_size} \
--log_samples_suffix {evaluate_args.log_samples_suffix} \
--output_path {evaluate_args.output_path}"
if evaluate_args.limit:
cmd += f" --limit {evaluate_args.limit}"
if evaluate_args.num_fewshot:
cmd += f" --num_fewshot {evaluate_args.num_fewshot}"
if evaluate_args.gen_kwargs != "":
cmd += f" --gen_kwargs {evaluate_args.gen_kwargs}"
if evaluate_args.log_samples:
cmd += f" --log_samples"
else:
assert False, "Please log samples so that the result can be parsed"
results = subprocess.run([cmd], shell=True, capture_output=True, text=True)
try:
result_file_index_start = results.stdout.index("Saved samples to ")
result_file_index_end = results.stdout.index(f".json")
result_file_index_start += len("Saved samples to ")
file = results.stdout[result_file_index_start:result_file_index_end]
except:
result_file_index_start = results.stderr.index("Saved samples to ")
result_file_index_end = results.stderr.index(f".json")
result_file_index_start += len("Saved samples to ")
file = results.stderr[result_file_index_start:result_file_index_end]
file = file.split("/")[:-1]
file = "/".join(file) + "/results.json"
with open(file, "r") as f:
lmms_eval_results = json.load(f)
result_dict = {}
tasks_list = evaluate_args.task_names.split(",")
for task in tasks_list:
task_results = lmms_eval_results["results"][task]
for k, v in task_results.items():
if k != "alias" and "stderr" not in k:
metric = k.split(",")[0]
result_dict[f"{task}_{metric}"] = v
return result_dict
"""def evaluate(self, evaluate_args):
initialize_tasks()
tasks_list = evaluate_args.task_names.split(",")
result_dict = {}
results = evaluator.simple_evaluate(
model=evaluate_args.model,
model_args=evaluate_args.model_args,
tasks=tasks_list,
num_fewshot=evaluate_args.num_fewshot,
batch_size=evaluate_args.batch_size,
device=evaluate_args.device,
limit=evaluate_args.limit,
check_integrity=evaluate_args.check_integrity,
show_task_to_terminal=evaluate_args.show_task_to_terminal,
log_samples=evaluate_args.log_samples,
gen_kwargs=evaluate_args.gen_kwargs,
cli_args=evaluate_args,
)
for task in tasks_list:
task_results = results["results"][task]
for k,v in task_results.items():
if k != "alias" and "stderr" not in k:
metric = k.split(",")[0]
result_dict[f"{task}_{metric}"] = v
return result_dict"""
================================================
FILE: xtuner-eval_niah/llava/train/train.py
================================================
# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:
# Adopted from tatsu-lab@stanford_alpaca. Below is the original copyright:
# Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import ast
import os
import copy
from dataclasses import dataclass, field
import json
import logging
import pathlib
from typing import Dict, Optional, Sequence, List
from PIL import Image, ImageFile
from packaging import version
import numpy as np
import gc
import io
import time
import random
import yaml
import math
import re
import torch
import transformers
import tokenizers
import deepspeed
from transformers import AutoConfig
from torch.utils.data import Dataset
from llava.constants import IGNORE_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IMAGE_TOKEN_INDEX
from llava.train.llava_trainer import LLaVATrainer
from llava import conversation as conversation_lib
from llava.model import *
from llava.mm_utils import process_highres_image, process_anyres_image, process_anyres_image_nopad, process_highres_image_crop_split, tokenizer_image_token, process_anyres_video_nopad
from llava.utils import rank0_print
from llava.video_utils import VIDEO_READER_FUNCS
# from llava.serialize_utils import TorchShmSerializedList, get_rank, get_local_rank, local_broadcast_process_authkey
# import wandb
torch.multiprocessing.set_sharing_strategy("file_system")
ImageFile.LOAD_TRUNCATED_IMAGES = True
local_rank = None
IS_TOKENIZER_GREATER_THAN_0_14 = version.parse(tokenizers.__version__) >= version.parse("0.14")
@dataclass
class ModelArguments:
model_name_or_path: Optional[str] = field(default="facebook/opt-125m")
model_class_name: Optional[str] = field(default=None, metadata={"help": "Used to init model class, format is XXXXForCausalLM. e.g. currently XXXX is chosen from LlavaLlama, LlavaMixtral, LlavaMistral, Llama"})
mm_tunable_parts: Optional[str] = field(
default=None, metadata={"help": 'Could be "mm_mlp_adapter", "mm_vision_resampler", "mm_vision_tower,mm_mlp_adapter,mm_language_model", "mm_vision_tower,mm_mlp_adapter,mm_language_model", "mm_mlp_adapter,mm_language_model"'}
)
# deciding which part of the multimodal model to tune, will overwrite other previous settings
version: Optional[str] = field(default="v0")
freeze_backbone: bool = field(default=False)
tune_mm_mlp_adapter: bool = field(default=False)
tune_mm_vision_resampler: bool = field(default=False)
vision_tower: Optional[str] = field(default=None)
vision_tower_pretrained: Optional[str] = field(default=None) # default to the last layer
vision_encode_type: Optional[str] = field(default="image")
unfreeze_mm_vision_tower: bool = field(default=False)
unfreeze_language_model: bool = field(default=False)
mm_vision_select_layer: Optional[int] = field(default=-1) # default to the last layer
pretrain_mm_mlp_adapter: Optional[str] = field(default=None)
mm_projector_type: Optional[str] = field(default="linear")
mm_use_im_start_end: bool = field(default=False)
mm_use_im_patch_token: bool = field(default=True)
mm_patch_merge_type: Optional[str] = field(default="flat")
mm_vision_select_feature: Optional[str] = field(default="patch")
mm_resampler_type: Optional[str] = field(default=None)
mm_mask_drop_mode: str = field(default="fixed")
mm_mask_drop_skip_percentage: float = field(default=0.0)
mm_mask_drop_ratio: float = field(default=0.25)
mm_mask_drop_ratio_upper: Optional[float] = field(default=None)
mm_mask_drop_ratio_lower: Optional[float] = field(default=None)
mm_spatial_pool_stride: Optional[int] = field(default=None)
mm_spatial_pool_mode: str = field(default="bilinear")
mm_spatial_pool_out_channels: Optional[int] = field(default=None)
mm_num_compress_latents: Optional[int] = field(default=128)
mm_num_compress_query_type: Optional[str] = field(default='learnable')
mm_pos_num_frames: Optional[int] = field(default=8)
mm_close_init: Optional[bool] = field(default=False)
min_slow_num_frames: Optional[int] = field(default=4)
mm_perceiver_depth: Optional[int] = field(default=3)
mm_perceiver_latents: Optional[int] = field(default=32)
mm_perceiver_ff_mult: Optional[float] = field(default=4)
mm_perceiver_pretrained: Optional[str] = field(default=None)
mm_qformer_depth: Optional[int] = field(default=3)
mm_qformer_latents: Optional[int] = field(default=32)
mm_qformer_pretrained: Optional[str] = field(default=None)
rope_scaling_factor: Optional[float] = field(default=None)
rope_scaling_type: Optional[str] = field(default=None)
s2: Optional[bool] = field(default=False)
s2_scales: Optional[str] = field(default="336,672,1008")
use_pos_skipping: Optional[bool] = field(default=False)
pos_skipping_range: Optional[int] = field(default=4096)
mm_newline_position: Optional[str] = field(default="one_token") # for frame separate
mm_local_num_frames: Optional[int] = field(default=-1) # 用来控制video encoder和projector是否分段处理时间序列
mm_llm_compress: Optional[bool] = field(default=False)
llm_compress_type: Optional[str] = field(default="attention")
llm_compress_layer_list: Optional[str] = field(default="8,16,24")
llm_image_token_ratio_list: Optional[str] = field(default="1.0,0.5,0.25,0.125")
# 增加新模型参数的记得去下面overwrite_config注册
@dataclass
class DataArguments:
data_path: str = field(default=None, metadata={"help": "Path to the training data, in llava's instruction.json format. Supporting multiple json files via /path/to/{a,b,c}.json"})
lazy_preprocess: bool = False
is_multimodal: bool = False
early_mix_text: bool = False
# image_folder: Optional[str] = field(default=None)
image_aspect_ratio: str = "square"
image_grid_pinpoints: Optional[str] = field(default=None)
image_crop_resolution: Optional[int] = field(default=None) # 好像没啥用
image_split_resolution: Optional[int] = field(default=None) # 好像没啥用
frame_aspect_ratio: str = "square"
frame_grid_pinpoints: Optional[str] = field(default=None)
max_num_pixels: int = 14745600000 # 384*384*100000
# video_folder: Optional[str] = field(default=None)
# video_fps: Optional[int] = field(default=1)
frames_upbound: Optional[int] = field(default=8)
frames_lowbound: Optional[int] = field(default=1) # 注意当视频实在没有这么多帧的时候还是会低于lowbound
time_msg: Optional[str] = field(default=None)
local_num_frames: Optional[int] = field(default=8)
sample_type: Optional[str] = field(default='middle')
@dataclass
class TrainingArguments(transformers.TrainingArguments):
cache_dir: Optional[str] = field(default=None)
optim: str = field(default="adamw_torch")
remove_unused_columns: bool = field(default=False)
freeze_mm_mlp_adapter: bool = field(default=False)
freeze_mm_vision_resampler: bool = field(default=False)
mpt_attn_impl: Optional[str] = field(default="triton")
model_max_length: int = field(
default=4096,
metadata={"help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."},
)
double_quant: bool = field(default=True, metadata={"help": "Compress the quantization statistics through double quantization."})
quant_type: str = field(default="nf4", metadata={"help": "Quantization data type to use. Should be one of `fp4` or `nf4`."})
bits: int = field(default=16, metadata={"help": "How many bits to use."})
lora_enable: bool = False
lora_r: int = 64
lora_alpha: int = 16
lora_dropout: float = 0.05
lora_weight_path: str = ""
lora_bias: str = "none"
mm_projector_lr: Optional[float] = None
mm_vision_tower_lr: Optional[float] = None
group_by_varlen: bool = field(default=False)
group_by_modality_length: bool = field(default=False)
group_by_modality_length_auto: bool = field(default=False)
auto_find_batch_size: bool = field(default=False)
gradient_checkpointing: bool = field(default=True)
verbose_logging: bool = field(default=True)
attn_implementation: str = field(default="flash_attention_2", metadata={"help": "Use transformers attention implementation."})
# @dataclass
# class EvaluationArguments:
# eval_num_processes: int = field(default=1)
# task_names: str = field(default=None)
# model: str = field(default="llava")
# model_args: Optional[str] = field(default=None)
# num_fewshot: Optional[int] = field(default=None)
# batch_size: int = field(default=1)
# device: Optional[str] = field(default=None)
# limit: Optional[int] = field(default=None)
# check_integrity: Optional[bool] = field(default=False)
# show_task_to_terminal: Optional[bool] = field(default=False)
# log_samples: Optional[bool] = field(default=True)
# gen_kwargs: Optional[str] = field(default="")
# log_samples_suffix: Optional[str] = field(default="")
# output_path: Optional[str] = field(default="./logs/")
def maybe_zero_3(param, ignore_status=False, name=None):
from deepspeed import zero
from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
if hasattr(param, "ds_id"):
if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
if not ignore_status:
logging.warning(f"{name}: param.ds_status != ZeroParamStatus.NOT_AVAILABLE: {param.ds_status}")
with zero.GatheredParameters([param]):
param = param.data.detach().cpu().clone()
else:
param = param.detach().cpu().clone()
return param
# Borrowed from peft.utils.get_peft_model_state_dict
def get_peft_state_maybe_zero_3(named_params, bias):
if bias == "none":
to_return = {k: t for k, t in named_params if "lora_" in k}
elif bias == "all":
to_return = {k: t for k, t in named_params if "lora_" in k or "bias" in k}
elif bias == "lora_only":
to_return = {}
maybe_lora_bias = {}
lora_bias_names = set()
for k, t in named_params:
if "lora_" in k:
to_return[k] = t
bias_name = k.split("lora_")[0] + "bias"
lora_bias_names.add(bias_name)
elif "bias" in k:
maybe_lora_bias[k] = t
for k, t in maybe_lora_bias:
if bias_name in lora_bias_names:
to_return[bias_name] = t
else:
raise NotImplementedError
to_return = {k: maybe_zero_3(v, ignore_status=True) for k, v in to_return.items()}
return to_return
def get_peft_state_non_lora_maybe_zero_3(named_params, require_grad_only=True):
to_return = {k: t for k, t in named_params if "lora_" not in k}
if require_grad_only:
to_return = {k: t for k, t in to_return.items() if t.requires_grad}
to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
return to_return
def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)}
to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
return to_return
def find_all_linear_names(model):
cls = torch.nn.Linear
lora_module_names = set()
multimodal_keywords = ["mm_projector", "vision_tower", "vision_resampler"]
for name, module in model.named_modules():
if any(mm_keyword in name for mm_keyword in multimodal_keywords):
continue
if isinstance(module, cls):
names = name.split(".")
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if "lm_head" in lora_module_names: # needed for 16-bit
lora_module_names.remove("lm_head")
return list(lora_module_names)
def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
"""Collects the state dict and dump to disk."""
if hasattr(trainer.args, "tune_mm_mlp_adapter") and trainer.args.tune_mm_mlp_adapter:
check_only_save_mm_adapter_tunnable = True
# only has mm_mlp_adapter and mm_vision_resampler in the tuneable parts
elif hasattr(trainer.args, "mm_tunable_parts") and (len(trainer.args.mm_tunable_parts.split(",")) == 1 and ("mm_mlp_adapter" in trainer.args.mm_tunable_parts or "mm_vision_resampler" in trainer.args.mm_tunable_parts)):
check_only_save_mm_adapter_tunnable = True
else:
check_only_save_mm_adapter_tunnable = False
trainer.accelerator.wait_for_everyone()
torch.cuda.synchronize()
rank0_print(f"Only save projectors: {check_only_save_mm_adapter_tunnable}")
if check_only_save_mm_adapter_tunnable:
# Only save Adapter
keys_to_match = ["mm_projector", "vision_resampler"]
if getattr(trainer.args, "use_im_start_end", False):
keys_to_match.extend(["embed_tokens", "embed_in"])
weight_to_save = get_mm_adapter_state_maybe_zero_3(trainer.model.named_parameters(), keys_to_match)
trainer.model.config.save_pretrained(output_dir)
current_folder = output_dir.split("/")[-1]
parent_folder = os.path.dirname(output_dir)
if trainer.args.local_rank == 0 or trainer.args.local_rank == -1:
if current_folder.startswith("checkpoint-"):
mm_projector_folder = os.path.join(parent_folder, "mm_projector")
os.makedirs(mm_projector_folder, exist_ok=True)
torch.save(weight_to_save, os.path.join(mm_projector_folder, f"{current_folder}.bin"))
else:
torch.save(weight_to_save, os.path.join(output_dir, f"mm_projector.bin"))
return
if trainer.deepspeed:
trainer.save_model(output_dir)
return
state_dict = trainer.model.state_dict()
if trainer.args.should_save:
cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
del state_dict
trainer._save(output_dir, state_dict=cpu_state_dict) # noqa
def smart_tokenizer_and_embedding_resize(
special_tokens_dict: Dict,
tokenizer: transformers.PreTrainedTokenizer,
model: transformers.PreTrainedModel,
):
"""Resize tokenizer and embedding.
Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
"""
num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))
if num_new_tokens > 0:
input_embeddings = model.get_input_embeddings().weight.data
output_embeddings = model.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
output_embeddings[-num_new_tokens:] = output_embeddings_avg
def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict:
"""Tokenize a list of strings."""
tokenized_list = [
tokenizer(
text,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
)
for text in strings
]
input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
input_ids_lens = labels_lens = [tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list]
return dict(
input_ids=input_ids,
labels=labels,
input_ids_lens=input_ids_lens,
labels_lens=labels_lens,
)
def _mask_targets(target, tokenized_lens, speakers):
# cur_idx = 0
cur_idx = tokenized_lens[0]
tokenized_lens = tokenized_lens[1:]
target[:cur_idx] = IGNORE_INDEX
for tokenized_len, speaker in zip(tokenized_lens, speakers):
if speaker == "human":
target[cur_idx + 2 : cur_idx + tokenized_len] = IGNORE_INDEX
cur_idx += tokenized_len
def _add_speaker_and_signal(header, source, get_conversation=True):
"""Add speaker and start/end signal on each round."""
BEGIN_SIGNAL = "### "
END_SIGNAL = "\n"
conversation = header
for sentence in source:
from_str = sentence["from"]
if from_str.lower() == "human":
from_str = conversation_lib.default_conversation.roles[0]
elif from_str.lower() == "gpt":
from_str = conversation_lib.default_conversation.roles[1]
else:
from_str = "unknown"
sentence["value"] = BEGIN_SIGNAL + from_str + ": " + sentence["value"] + END_SIGNAL
if get_conversation:
conversation += sentence["value"]
conversation += BEGIN_SIGNAL
return conversation
def preprocess_multimodal(sources: Sequence[str], data_args: DataArguments, msg="") -> Dict:
is_multimodal = data_args.is_multimodal
if not is_multimodal:
return sources
for source in sources:
for sentence in source:
# TODO maybe this should be changed for interleaved data?
# if DEFAULT_IMAGE_TOKEN in sentence["value"] and not sentence["value"].startswith(DEFAULT_IMAGE_TOKEN):
# only check for num_im=1
num_im = len(re.findall(DEFAULT_IMAGE_TOKEN, sentence["value"]))
if num_im == 1 and DEFAULT_IMAGE_TOKEN in sentence["value"] and not sentence["value"].startswith(DEFAULT_IMAGE_TOKEN):
sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, "").strip()
sentence["value"] = DEFAULT_IMAGE_TOKEN + "\n" + sentence["value"]
sentence["value"] = sentence["value"].strip()
if "mmtag" in conversation_lib.default_conversation.version:
sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, "" + DEFAULT_IMAGE_TOKEN + "")
replace_token = DEFAULT_IMAGE_TOKEN
if data_args.mm_use_im_start_end:
replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN
if msg.rstrip() != "":
replace_token = replace_token + msg.rstrip() + " " # NOTE for time msg of video
sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, replace_token)
# For videoInstruct-100k noisy_data. TODO: Ask Yuanhan to clean the data instead of leaving the noise code here.
sentence["value"] = sentence["value"].replace("QA_GT_caption_based_noisy", "")
return sources
def preprocess_llama_2(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv = conversation_lib.default_conversation.copy()
roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.LLAMA_2
# Mask targets
sep = "[/INST] "
for conversation, target in zip(conversations, targets):
total_len = int(target.ne(tokenizer.pad_token_id).sum())
rounds = conversation.split(conv.sep2)
cur_len = 1
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer))
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
else:
round_len = len(tokenizer(rou).input_ids)
instruction_len = len(tokenizer(parts[0]).input_ids) - 2
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def preprocess_gemma(sources: List[List[Dict[str, str]]], tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv: conversation_lib.Conversation = conversation_lib.default_conversation.copy()
roles: Dict[str, str] = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations: List[str] = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source: List[Dict[str, str]] = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role: str = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids: torch.Tensor = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids: torch.Tensor = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets: torch.Tensor = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.GEMMA
# Mask target
sep: str = conv.sep + conv.roles[1]
for conversation, target in zip(conversations, targets):
total_len: int = int(target.ne(tokenizer.pad_token_id).sum())
rounds: List[str] = conversation.split(conv.sep)
re_rounds = []
for conv_idx in range(0, len(rounds), 2):
re_rounds.append(conv.sep.join(rounds[conv_idx : conv_idx + 2]))
cur_len = 1 # Ignore
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(re_rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep # Re-append sep because split on this
# Now "".join(parts)==rou
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer)) - 1 # Ignore
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1 # Ignore
else:
round_len = len(tokenizer(rou).input_ids) - 1 # Ignore
instruction_len = len(tokenizer(parts[0]).input_ids) - 1 # Ignore
round_len += 2 # sep: \n takes 2 tokens
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(f"warning: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def preprocess_qwen(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict:
# roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"}
roles = {"human": "user", "gpt": "assistant"}
# Add image tokens to tokenizer as a special tokens
# Use a deepcopy of tokenizer so that we don't modify on the tokenizer
tokenizer = copy.deepcopy(tokenizer)
# When there is actually an image, we add the image tokens as a special token
if has_image:
tokenizer.add_tokens([""], special_tokens=True)
image_token_index = tokenizer.convert_tokens_to_ids("")
im_start, im_end = tokenizer.additional_special_tokens_ids[0:2] # for qwen2_5
# unmask_tokens = ["<|im_start|>", "<|im_start|>", "\n"]
unmask_tokens_idx = [198, im_start, im_end]
nl_tokens = tokenizer("\n").input_ids
# Reset Qwen chat templates so that it won't include system message every time we apply
chat_template = "{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
tokenizer.chat_template = chat_template
# _system = tokenizer("system").input_ids + nl_tokens
# _user = tokenizer("user").input_ids + nl_tokens
# _assistant = tokenizer("assistant").input_ids + nl_tokens
# Apply prompt templates
input_ids, targets = [], []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != roles["human"]:
source = source[1:]
input_id, target = [], []
# New version, use apply chat template
# Build system message for each sentence
input_id += tokenizer.apply_chat_template([{"role" : "system", "content" : system_message}])
target += [IGNORE_INDEX] * len(input_id)
for conv in source:
# Make sure llava data can load
try:
role = conv["role"]
content = conv["content"]
except:
role = conv["from"]
content = conv["value"]
role = roles.get(role, role)
conv = [{"role" : role, "content" : content}]
encode_id = tokenizer.apply_chat_template(conv)
input_id += encode_id
if role in ["user", "system"]:
target += [IGNORE_INDEX] * len(encode_id)
else:
target += encode_id
assert len(input_id) == len(target), f"{len(input_id)} != {len(target)}"
for idx, encode_id in enumerate(input_id):
if encode_id in unmask_tokens_idx:
target[idx] = encode_id
if encode_id == image_token_index:
input_id[idx] = IMAGE_TOKEN_INDEX
input_ids.append(input_id)
targets.append(target)
input_ids = torch.tensor(input_ids, dtype=torch.long)
targets = torch.tensor(targets, dtype=torch.long)
del tokenizer
return dict(
input_ids=input_ids, # tensor(bs x seq_len)
labels=targets, # tensor(bs x seq_len)
)
def preprocess_internlm2(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict:
# roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"}
roles = {"human": "user", "gpt": "assistant"}
# Add image tokens to tokenizer as a special tokens
# Use a deepcopy of tokenizer so that we don't modify on the tokenizer
tokenizer = copy.deepcopy(tokenizer)
# When there is actually an image, we add the image tokens as a special token
if has_image:
tokenizer.add_tokens([""], special_tokens=True)
image_token_index = tokenizer.convert_tokens_to_ids("")
unmask_tokens = ["<|im_start|>", "<|im_end|>", "<|action_start|>", "<|action_end|>", "<|interpreter|>", "<|plugin|>"]
unmask_tokens_idx = [tokenizer.convert_tokens_to_ids(tok) for tok in unmask_tokens]
# nl_tokens = tokenizer("\n").input_ids
# chat_template = "{{ bos_token }}{% for message in messages %}{{'<|im_start|>' + message['role'] + '\n' + message['content'] + '<|im_end|>' + '\n'}}{% endfor %}{% if add_generation_prompt %}{{ '<|im_start|>assistant\n' }}{% endif %}"
# tokenizer.chat_template = chat_template
# _system = tokenizer("system").input_ids + nl_tokens
# _user = tokenizer("user").input_ids + nl_tokens
# _assistant = tokenizer("assistant").input_ids + nl_tokens
# Apply prompt templates
input_ids, targets = [], []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != roles["human"]:
source = source[1:]
input_id, target = [], []
# New version, use apply chat template
# Build system message for each sentence
input_id += tokenizer.apply_chat_template([{"role" : "system", "content" : system_message}])
target += [IGNORE_INDEX] * len(input_id)
for conv in source:
# Make sure llava data can load
try:
role = conv["role"]
content = conv["content"]
except:
role = conv["from"]
content = conv["value"]
role = roles.get(role, role)
conv = [{"role" : role, "content" : content}]
encode_id = tokenizer.apply_chat_template(conv)
input_id += encode_id
if role in ["user", "system"]:
target += [IGNORE_INDEX] * len(encode_id)
else:
target += encode_id
assert len(input_id) == len(target), f"{len(input_id)} != {len(target)}"
for idx, encode_id in enumerate(input_id):
if encode_id in unmask_tokens_idx:
target[idx] = encode_id
if encode_id == image_token_index:
input_id[idx] = IMAGE_TOKEN_INDEX
input_ids.append(input_id)
targets.append(target)
input_ids = torch.tensor(input_ids, dtype=torch.long)
targets = torch.tensor(targets, dtype=torch.long)
del tokenizer
return dict(
input_ids=input_ids, # tensor(bs x seq_len)
labels=targets, # tensor(bs x seq_len)
)
def preprocess_llama3(
sources,
tokenizer: transformers.PreTrainedTokenizer,
has_image: bool = False,
max_len=2048,
system_message: str = "You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.",
) -> Dict:
# roles = {"human": "<|start_header_id|>user<|end_header_id|>", "gpt": "<|start_header_id|>assistant<|end_header_id|>"}
roles = {"human": "user", "gpt": "assistant"}
# Add image tokens to tokenizer as a special tokens
# Use a deepcopy of tokenizer so that we don't modify on the tokenizer
tokenizer = copy.deepcopy(tokenizer)
# When there is actually an image, we add the image tokens as a special token
if has_image:
tokenizer.add_tokens([""], special_tokens=True)
image_token_index = tokenizer.convert_tokens_to_ids("")
bos_token_id = tokenizer.convert_tokens_to_ids("<|begin_of_text|>")
start_header_id = tokenizer.convert_tokens_to_ids("<|start_header_id|>")
end_header_id = tokenizer.convert_tokens_to_ids("<|end_header_id|>")
eot_id = tokenizer.convert_tokens_to_ids("<|eot_id|>")
unmask_tokens = ["<|begin_of_text|>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "\n\n"]
unmask_tokens_idx = [tokenizer.convert_tokens_to_ids(tok) for tok in unmask_tokens]
# After update, calling tokenizer of llama3 will
# auto add bos id for the tokens. ヽ(`⌒´)ノ
def safe_tokenizer_llama3(text):
input_ids = tokenizer(text).input_ids
if input_ids[0] == bos_token_id:
input_ids = input_ids[1:]
return input_ids
nl_tokens = tokenizer.convert_tokens_to_ids("\n\n")
# Apply prompt templates
input_ids, targets = [], []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != roles["human"]:
source = source[1:]
input_id, target = [], []
# New version, use apply chat template
# Build system message for each sentence
input_id += tokenizer.apply_chat_template([{"role" : "system", "content" : system_message}])
target += [IGNORE_INDEX] * len(input_id)
for conv in source:
# Make sure llava data can load
try:
role = conv["role"]
content = conv["content"]
except:
role = conv["from"]
content = conv["value"]
role = roles.get(role, role)
conv = [{"role" : role, "content" : content}]
# First is bos token we don't need here
encode_id = tokenizer.apply_chat_template(conv)[1:]
input_id += encode_id
if role in ["user", "system"]:
target += [IGNORE_INDEX] * len(encode_id)
else:
target += encode_id
assert len(input_id) == len(target), f"{len(input_id)} != {len(target)}"
for idx, encode_id in enumerate(input_id):
if encode_id in unmask_tokens_idx:
target[idx] = encode_id
if encode_id == image_token_index:
input_id[idx] = IMAGE_TOKEN_INDEX
input_ids.append(input_id)
targets.append(target)
input_ids = torch.tensor(input_ids, dtype=torch.long)
targets = torch.tensor(targets, dtype=torch.long)
return dict(
input_ids=input_ids, # tensor(bs x seq_len)
labels=targets, # tensor(bs x seq_len)
)
def preprocess_v1(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv = conversation_lib.default_conversation.copy()
roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.TWO
# Mask targets
sep = conv.sep + conv.roles[1] + ": "
for conversation, target in zip(conversations, targets):
total_len = int(target.ne(tokenizer.pad_token_id).sum())
rounds = conversation.split(conv.sep2)
cur_len = 1
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer))
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
else:
round_len = len(tokenizer(rou).input_ids)
instruction_len = len(tokenizer(parts[0]).input_ids) - 2
if i != 0 and not tokenizer.legacy and IS_TOKENIZER_GREATER_THAN_0_14:
round_len -= 1
instruction_len -= 1
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def preprocess_mpt(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv = conversation_lib.default_conversation.copy()
roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.MPT
# Mask targets
sep = conv.sep + conv.roles[1]
for conversation, target in zip(conversations, targets):
total_len = int(target.ne(tokenizer.pad_token_id).sum())
rounds = conversation.split(conv.sep)
re_rounds = [conv.sep.join(rounds[:3])] # system + user + gpt
for conv_idx in range(3, len(rounds), 2):
re_rounds.append(conv.sep.join(rounds[conv_idx : conv_idx + 2])) # user + gpt
cur_len = 1
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(re_rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer))
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1
else:
round_len = len(tokenizer(rou).input_ids)
instruction_len = len(tokenizer(parts[0]).input_ids) - 1
if i != 0 and getattr(tokenizer, "legacy", False) and IS_TOKENIZER_GREATER_THAN_0_14:
round_len += 1
instruction_len += 1
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f"(#turns={len(re_rounds)} ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def preprocess_plain(
sources: Sequence[str],
tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
# add end signal and concatenate together
conversations = []
for source in sources:
assert len(source) == 2
assert DEFAULT_IMAGE_TOKEN in source[0]["value"]
source[0]["value"] = DEFAULT_IMAGE_TOKEN
conversation = source[0]["value"] + source[1]["value"] + conversation_lib.default_conversation.sep
conversations.append(conversation)
# tokenize conversations
input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations]
targets = copy.deepcopy(input_ids)
for target, source in zip(targets, sources):
tokenized_len = len(tokenizer_image_token(source[0]["value"], tokenizer))
target[:tokenized_len] = IGNORE_INDEX
return dict(input_ids=input_ids, labels=targets)
def preprocess(sources: Sequence[str], tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
"""
Given a list of sources, each is a conversation list. This transform:
1. Add signal '### ' at the beginning each sentence, with end signal '\n';
2. Concatenate conversations together;
3. Tokenize the concatenated conversation;
4. Make a deepcopy as the target. Mask human words with IGNORE_INDEX.
"""
if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.PLAIN:
return preprocess_plain(sources, tokenizer)
if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.LLAMA_2:
return preprocess_llama_2(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version.startswith("v1"):
return preprocess_v1(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "mpt":
return preprocess_mpt(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "qwen":
return preprocess_qwen(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "internlm_2":
return preprocess_internlm2(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "gemma":
return preprocess_gemma(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "llama_v3":
return preprocess_llama3(sources, tokenizer, has_image=has_image)
# add end signal and concatenate together
conversations = []
for source in sources:
header = f"{conversation_lib.default_conversation.system}\n\n"
conversation = _add_speaker_and_signal(header, source)
conversations.append(conversation)
# tokenize conversations
def get_tokenize_len(prompts):
return [len(tokenizer_image_token(prompt, tokenizer)) for prompt in prompts]
if has_image:
input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations]
else:
conversations_tokenized = _tokenize_fn(conversations, tokenizer)
input_ids = conversations_tokenized["input_ids"]
targets = copy.deepcopy(input_ids)
for target, source in zip(targets, sources):
if has_image:
tokenized_lens = get_tokenize_len([header] + [s["value"] for s in source])
else:
tokenized_lens = _tokenize_fn([header] + [s["value"] for s in source], tokenizer)["input_ids_lens"]
speakers = [sentence["from"] for sentence in source]
_mask_targets(target, tokenized_lens, speakers)
return dict(input_ids=input_ids, labels=targets)
class LazySupervisedDataset(Dataset):
def __init__(self, data_path: str, tokenizer: transformers.PreTrainedTokenizer, data_args: DataArguments):
super(LazySupervisedDataset, self).__init__()
self.tokenizer = tokenizer
self.num_video_tokens = max(8, data_args.frames_upbound) * 128 // 8 # 估计一下video token 数量
try:
from petrel_client.client import Client
has_client = True
except ImportError:
has_client = False
if has_client:
self.client = Client('~/petreloss.conf')
else:
self.client = None
# if get_local_rank() == 0:
list_data_dict = []
# Handle multiple JSON files specified in the data_path
if "{" in data_path and "}" in data_path:
raise NotImplementedError("Please use .yaml!!!")
base_path, file_pattern = re.match(r"^(.*)\{(.*)\}\.json$", data_path).groups()
file_names = file_pattern.split(",")
rank0_print(f"Loading {file_names} from {base_path}")
data_args.dataset_paths = []
for file_name in file_names:
data_args.dataset_paths.append(f"{base_path}{file_name}.json")
full_path = f"{base_path}{file_name}.json"
rank0_print(f"Loading {full_path}")
with open(full_path, "r") as file:
cur_data_dict = json.load(file)
rank0_print(f"Loaded {len(cur_data_dict)} samples from {full_path}")
list_data_dict.extend(cur_data_dict)
elif data_path.endswith(".yaml"):
with open(data_path, "r") as file:
yaml_data = yaml.safe_load(file)
datasets = yaml_data.get("datasets")
# file should be in the format of:
# datasets:
# - json_path: xxxx1.json
# sampling_strategy: first:1000
# - json_path: xxxx2.json
# sampling_strategy: end:3000
# - json_path: xxxx3.json
# sampling_strategy: random:999
# data_args.dataset_paths = [dataset.get("json_path") for dataset in datasets] # NOTE
for dataset in datasets:
json_path = dataset.get("json_path")
sampling_strategy = dataset.get("sampling_strategy", "all")
sampling_number = None
rank0_print(f"Loading {json_path} with {sampling_strategy} sampling strategy")
if json_path.endswith(".jsonl"):
cur_data_dict = []
if "s3://" in json_path:
with io.BytesIO(self.client.get(json_path)) as json_file:
for line in json_file:
cur_data_dict.append(json.loads(line.strip()))
else:
with open(json_path, "r") as json_file:
for line in json_file:
cur_data_dict.append(json.loads(line.strip()))
elif json_path.endswith(".json"):
if "s3://" in json_path:
with io.BytesIO(self.client.get(json_path)) as json_file:
cur_data_dict = json.load(json_file)
else:
with open(json_path, "r") as json_file:
cur_data_dict = json.load(json_file)
else:
raise ValueError(f"Unsupported file type: {json_path}")
assert len(cur_data_dict) > 0, cur_data_dict
media_type = dataset.get("media_type", None)
if media_type is None:
if 'image' in cur_data_dict[0].keys(): # NOTE 碰到混合数据可能会出错
media_type = 'image'
elif 'video' in cur_data_dict[0].keys():
media_type = 'video'
else:
media_type = 'text'
if ":" in sampling_strategy:
sampling_strategy, sampling_number = sampling_strategy.split(":")
if "%" in sampling_number:
sampling_number = math.ceil(int(sampling_number.split("%")[0]) * len(cur_data_dict) / 100)
else:
sampling_number = int(sampling_number)
# Apply the sampling strategy
if sampling_strategy == "first" and sampling_number is not None:
cur_data_dict = cur_data_dict[:sampling_number]
rank0_print(f"sampling_strategy={sampling_strategy}, {0}:{sampling_number}")
elif sampling_strategy == "first2" and sampling_number is not None:
cur_data_dict = cur_data_dict[sampling_number:sampling_number*2]
rank0_print(f"sampling_strategy={sampling_strategy}, {sampling_number}:{sampling_number*2}")
elif sampling_strategy == "first3" and sampling_number is not None:
cur_data_dict = cur_data_dict[sampling_number*2:sampling_number*3]
rank0_print(f"sampling_strategy={sampling_strategy}, {sampling_number*2}:{sampling_number*3}")
elif sampling_strategy == "first4" and sampling_number is not None:
cur_data_dict = cur_data_dict[sampling_number*3:sampling_number*4]
rank0_print(f"sampling_strategy={sampling_strategy}, {sampling_number*3}:{sampling_number*4}")
elif sampling_strategy == "end" and sampling_number is not None:
cur_data_dict = cur_data_dict[-sampling_number:]
rank0_print(f"sampling_strategy={sampling_strategy}, {-sampling_number}:-")
elif sampling_strategy == "random" and sampling_number is not None:
raise NotImplementedError("Don't use random")
random.shuffle(cur_data_dict)
cur_data_dict = cur_data_dict[:sampling_number]
video_read_type = dataset.get("video_read_type", None)
data_root = dataset.get("data_root", "")
# try:
# post-process meta info
if media_type not in ['text', 'mix']:
def check_pnorm2(ori_path): # TODO ugly code, remove it after clean anno file
if ori_path.startswith("pnorm2:s3://") or ori_path.startswith("p2:s3://") or ori_path.startswith("pssd:s3://"):
old_bucket_name = ori_path.split('://')[1].split('/')[0]
data_prefix = ori_path.split('://')[0]
data_path = '/'.join(ori_path.split('://')[1].split('/')[1:])
# new_bucket_name = old_bucket_name.replace('-', '_').lower()
new_bucket_name = old_bucket_name.lower()
return data_prefix + '://' + new_bucket_name + '/' + data_path
else:
return ori_path
for i in range(len(cur_data_dict)):
if video_read_type != None:
cur_data_dict[i]['video_read_type'] = video_read_type
if type(cur_data_dict[i][media_type]) is list:
new_data_path = []
for old_data_path in cur_data_dict[i][media_type]:
new_data_path.append(os.path.join(data_root, old_data_path))
# new_data_path.append(check_pnorm2(os.path.join(data_root, old_data_path)))
cur_data_dict[i][media_type] = new_data_path
else:
cur_data_dict[i][media_type] = os.path.join(data_root, cur_data_dict[i][media_type])
# cur_data_dict[i][media_type] = check_pnorm2(os.path.join(data_root, cur_data_dict[i][media_type]))
rank0_print(f"Check samples from {json_path}, media_type={media_type}, video_read_type={video_read_type}, data_root={data_root}")
if media_type not in ['text', 'mix'] and video_read_type != 'fake':
ok = False
for i in range(3):
data_path = cur_data_dict[i][media_type]
if type(data_path) is list:
data_path = data_path[0]
rank0_print(f"Checking: {data_path}")
if 's3://' in data_path:
if media_type == 'video' and video_read_type in ['img', 'frame']:
for path in self.client.list(data_path):
ok = True
break
else:
tmp_data = self.client.get(data_path)
if tmp_data is not None and len(tmp_data) > 0:
ok = True
else:
if os.path.exists(data_path):
ok = True
if ok:
break
assert ok, f"Data in {data_path} can't be read!"
rank0_print(f"Loaded {len(cur_data_dict)} samples from {json_path}, media_type={media_type}, video_read_type={video_read_type}, data_root={data_root}")
# except Exception as e:
# rank0_print(f"Loaded {len(cur_data_dict)} samples from {json_path}, data_root={data_root}, something maybe wrong {e}!!!")
list_data_dict.extend(cur_data_dict)
else:
raise NotImplementedError("Please use .yaml!!!")
data_args.dataset_paths = [data_path]
rank0_print(f"Loading {data_path}")
with open(data_path, "r") as file:
cur_data_dict = json.load(file)
rank0_print(f"Loaded {len(cur_data_dict)} samples from {data_path}")
list_data_dict.extend(cur_data_dict)
# else:
# list_data_dict = []
self.list_data_dict = list_data_dict
# self.list_data_dict = TorchShmSerializedList(list_data_dict)
rank0_print(f"Loaded {len(self.list_data_dict)} samples from {data_path}")
rank0_print("Formatting inputs...Skip in lazy mode")
self.tokenizer = tokenizer
self.data_args = data_args
def __len__(self):
return len(self.list_data_dict)
@property
def lengths(self):
length_list = []
for sample in self.list_data_dict:
if "image" in sample:
img_tokens = 128
elif "video" in sample:
img_tokens = self.num_video_tokens
else:
img_tokens = 0
length_list.append(sum(len(conv["value"].split()) for conv in sample["conversations"]) + img_tokens)
return length_list
@property
def modality_lengths(self):
length_list = []
for sample in self.list_data_dict:
cur_len = sum(len(conv["value"].split()) for conv in sample["conversations"])
assert cur_len > 0, f"Conversation length is 0 for {sample}"
if "image" in sample or "video" in sample or self.data_args.early_mix_text:
length_list.append(cur_len)
else:
length_list.append(-cur_len)
return length_list
def process_image(self, image_file, overwrite_image_aspect_ratio=None):
# image_folder = self.data_args.image_folder
# start_time = time.time()
processor = self.data_args.image_processor
# print(f"\n\nInspecting the image path, image_file={image_file}")
try:
if 's3://' in image_file:
value = self.client.Get(image_file)
img_bytes = np.frombuffer(value, dtype=np.uint8)
with io.BytesIO(img_bytes) as buff:
image = Image.open(buff).convert('RGB')
else:
image = Image.open(image_file).convert('RGB') # PIL Image
except Exception as exn:
print(f"Failed to open image {image_file}. Exception:", exn)
raise exn
image_size = image.size
image_aspect_ratio = self.data_args.image_aspect_ratio
if overwrite_image_aspect_ratio is not None:
image_aspect_ratio = overwrite_image_aspect_ratio
if image_aspect_ratio == "highres":
raise NotImplementedError
image = process_highres_image(image, self.data_args.image_processor, self.data_args.image_grid_pinpoints)
# elif image_aspect_ratio == "anyres" or "anyres_max" in image_aspect_ratio:
elif "anyres" in image_aspect_ratio:
if 'nopad' in image_aspect_ratio:
image = process_anyres_image_nopad(image, self.data_args.image_processor, self.data_args.image_grid_pinpoints)
else:
raise NotImplementedError
image = process_anyres_image(image, self.data_args.image_processor, self.data_args.image_grid_pinpoints)
elif image_aspect_ratio == "crop_split":
raise NotImplementedError
image = process_highres_image_crop_split(image, self.data_args)
elif image_aspect_ratio == "pad":
def expand2square(pil_img, background_color):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
image = expand2square(image, tuple(int(x * 255) for x in processor.image_mean))
image = processor.preprocess(image, return_tensors="pt")["pixel_values"][0]
else:
image = processor.preprocess(image, return_tensors="pt")["pixel_values"][0]
# end_time = time.time()
# print(image_file, end_time - start_time)
# print(f"OK, image_file={image_file}\n\n")
return image, image_size, "image"
def process_video(self, video_file, data_anno, data_args):
# print(f"\n\nInspecting the video path, video_file={video_file}\n\n", flush=True)
# logging.info(f"\n\nInspecting the video path, video_file={video_file}\n\n")
# start_time = time.time()
local_num_frames = data_args.local_num_frames
max_num_frames = data_args.frames_upbound
min_num_frames = data_args.frames_lowbound
sample_type = data_args.sample_type
video_reader_type = data_anno.get("video_read_type", "decord")
if "start" in data_anno and "end" in data_anno:
clip = [float(data_anno["start"]), float(data_anno["end"])]
else:
clip = None
if clip is None or video_reader_type == "img":
video_reader = VIDEO_READER_FUNCS[video_reader_type]
frames, frame_indices, fps, duration = video_reader(
video_file, max_num_frames, sample_type,
min_num_frames=min_num_frames,
max_num_frames=max_num_frames, client=self.client, clip=clip,
local_num_frames=local_num_frames
)
# if sample_type in ['rand', 'middle'] and len(frames) < local_num_frames and len(frames) != max_num_frames:
# raise ValueError(f"{video_file} only have {len(frames)} frames!!!")
# logger.info(f"{data_path} is OK!!!!")
else:
video_reader = VIDEO_READER_FUNCS['lazy']
start, end = clip
duration = end - start
if min_num_frames > duration:
min_num_frames = (duration // local_num_frames) * local_num_frames
if sample_type == 'dynamic_fps1':
num_segments = int(duration // local_num_frames)
if num_segments == 0:
num_frames = local_num_frames
else:
num_frames = local_num_frames * num_segments
num_frames = min(num_frames, max_num_frames)
num_frames = max(num_frames, min_num_frames)
else:
num_frames = max_num_frames
frames, frame_indices, fps = video_reader(video_file, num_frames=num_frames, video_start=start, video_end=end, client=self.client)
# logger.info(f"{data_path} is OK, duation={end-start} num_frames={num_frames}!!!!")
if sample_type == 'dynamic_fps1' and len(frames) % local_num_frames != 0:
raise ValueError(f"min_num_frames={min_num_frames}, max_num_frames={max_num_frames}, local_num_frames={local_num_frames}, len(frames)={len(frames)}, is wrong!!!")
sec = [str(round(f / fps, 1)) for f in frame_indices]
if data_args.time_msg is not None and sec is not None:
if data_args.time_msg == 'short':
msg = f"\nThe video lasts for {duration:.2f} seconds, and {len(sec)} frames are uniformly sampled from it. "
else:
# " " should be added in the start and end
msg = f"\nThe video lasts for {duration:.2f} seconds, and {len(sec)} frames are uniformly sampled at {', '.join(sec)} seconds. "
else:
msg = ""
# logging.info(f"OK, video_file={video_file}\n\n")
# print(f"OK, video_file={video_file}\n\n", flush=True)
# end_time = time.time()
# print(video_file, end_time - start_time)
return frames, msg
def __getitem__(self, i) -> Dict[str, torch.Tensor]:
# TODO: define number of retries somewhere else
num_base_retries = 2
num_final_retries = 300
# try the current sample first
for attempt_idx in range(num_base_retries):
try:
sample = self._get_item(i)
return sample
except Exception as e:
# sleep 1s in case it is a cloud disk issue
print(f"[Try #{attempt_idx}] Failed to fetch sample {i}. Exception:", e)
if attempt_idx != (num_base_retries -1):
time.sleep(1)
retry_step = 5
# try other samples, in case it is file corruption issue
for attempt_idx in range(num_base_retries+3):
try:
next_index = min(i + retry_step, len(self.list_data_dict) - 1)
# sample_idx = random.choice(range(len(self)))
sample = self._get_item(next_index)
return sample
except Exception as e:
# no need to sleep
print(f"[Try other #{attempt_idx}] Failed to fetch sample {next_index}. Exception:", e)
retry_step *= 2
try:
sample = self._get_item(i)
return sample
except Exception as e:
raise e
def _get_item(self, i) -> Dict[str, torch.Tensor]:
sources = self.list_data_dict[i]
if isinstance(i, int):
sources = [sources]
else:
raise NotImplementedError(i)
assert len(sources) == 1, "Don't know why it is wrapped to a list" # FIXME
if "image" in sources[0]:
image_file = self.list_data_dict[i]["image"]
if type(image_file) is list:
# Handling multi images
# overwrite to process with simple pad
if len(image_file) > 1:
image = [self.process_image(f, "pad") for f in image_file]
image = [[im[0], im[1], "image"] for im in image]
else:
image = [self.process_image(f) for f in image_file]
else:
image = [self.process_image(image_file)]
# sources = preprocess_multimodal(copy.deepcopy([e["conversations"] for e in sources]), self.data_args)
sources = preprocess_multimodal([e["conversations"] for e in sources], self.data_args)
elif "video" in sources[0]:
video_file = self.list_data_dict[i]["video"]
try:
video, time_msg = self.process_video(video_file, data_anno=self.list_data_dict[i], data_args=self.data_args)
# print(video_file, time_msg)
processor = self.data_args.image_processor
frame_aspect_ratio = self.data_args.frame_aspect_ratio
# if frame_aspect_ratio == "anyres" or "anyres_max" in frame_aspect_ratio:
if "anyres" in frame_aspect_ratio:
if 'nopad' in frame_aspect_ratio:
image = process_anyres_video_nopad(video, self.data_args.image_processor, self.data_args.frame_grid_pinpoints, max_resolutions=self.data_args.max_num_pixels // len(video))
else:
raise NotImplementedError
# image = process_anyres_video(video, self.data_args.image_processor, self.data_args.frame_grid_pinpoints)
else:
image = processor.preprocess(video, return_tensors="pt")["pixel_values"]
image = [(image, video[0].shape[0:2], "video")]
# sources = preprocess_multimodal(copy.deepcopy([e["conversations"] for e in sources]), self.data_args, msg=time_msg)
sources = preprocess_multimodal([e["conversations"] for e in sources], self.data_args, msg=time_msg)
except Exception as e:
print(f"Error: {e}")
print(f"Failed to read video file: {video_file}")
raise e
else:
# sources = copy.deepcopy([e["conversations"] for e in sources]) # NOTE epoch>1时会出问题,最好提前处理了
sources = [e["conversations"] for e in sources]
has_image = ("image" in self.list_data_dict[i]) or ("video" in self.list_data_dict[i])
data_dict = preprocess(sources, self.tokenizer, has_image=has_image)
if "prompt" in data_dict:
prompt = data_dict["prompt"]
else:
prompt = None
if isinstance(i, int):
data_dict = dict(input_ids=data_dict["input_ids"][0], labels=data_dict["labels"][0])
# image exist in the data
if "image" in self.list_data_dict[i]:
data_dict["image"] = image
elif "video" in self.list_data_dict[i]:
data_dict["image"] = image
elif self.data_args.is_multimodal:
# image does not exist in the data, but the model is multimodal
crop_size = self.data_args.image_processor.crop_size
data_dict["image"] = [
(torch.zeros(1, 3, crop_size["height"], crop_size["width"]), (crop_size["width"], crop_size["height"]), "text"),
]
# prompt exist in the data
if prompt is not None:
data_dict["prompt"] = prompt
data_dict["id"] = self.list_data_dict[i].get("id", i)
# gc.collect() # NOTE
return data_dict
@dataclass
class DataCollatorForSupervisedDataset(object):
"""Collate examples for supervised fine-tuning."""
tokenizer: transformers.PreTrainedTokenizer
def pad_sequence(self, input_ids, batch_first, padding_value):
if self.tokenizer.padding_side == "left":
input_ids = [torch.flip(_input_ids, [0]) for _input_ids in input_ids]
input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=batch_first, padding_value=padding_value)
if self.tokenizer.padding_side == "left":
input_ids = torch.flip(input_ids, [1])
return input_ids
def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
# input_ids, labels, ids = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels", "id"))
input_ids = [_input_ids[: self.tokenizer.model_max_length] for _input_ids in input_ids]
labels = [_labels[: self.tokenizer.model_max_length] for _labels in labels]
if self.tokenizer.pad_token_id is None:
# self.tokenizer.pad_token_id = self.tokenizer.eos_token_id # FIXME: this could only be triggered for llama3 model.
self.tokenizer.pad_token_id = 0 # This gets the best result. Don't know why.
input_ids = self.pad_sequence(input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id)
labels = self.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
batch = dict(input_ids=input_ids, labels=labels.long() if labels.dtype == torch.int32 else labels, attention_mask=input_ids.ne(self.tokenizer.pad_token_id))
# batch = dict(input_ids=input_ids, labels=labels, attention_mask=input_ids.ne(self.tokenizer.pad_token_id), ids=ids)
if "image" in instances[0]:
images = [instance["image"] for instance in instances]
# data_format: [image/video, spatial_size, media_type]
batch["image_sizes"] = [im[1] for im_list in images for im in im_list]
batch["modalities"] = [im[2] for im_list in images for im in im_list]
images = [im[0] for im_list in images for im in im_list] # flatten multi-images
# 拉平多图应该没有影响,只要后面顺序对的上就行
# use list for input of different lengths
# if all(x is not None and x.shape == images[0].shape for x in images):
# Image: (N, P, C, H, W)
# Video: (N, F, C, H, W)
# batch["images"] = torch.stack(images)
# else:
batch["images"] = images
else:
# 纯文本数据也会填一个images
raise NotImplementedError(instances[0])
if "prompt" in instances[0]:
batch["prompts"] = [instance["prompt"] for instance in instances]
return batch
def make_supervised_data_module(tokenizer: transformers.PreTrainedTokenizer, data_args) -> Dict:
"""Make dataset and collator for supervised fine-tuning."""
train_dataset = LazySupervisedDataset(tokenizer=tokenizer, data_path=data_args.data_path, data_args=data_args)
data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
return dict(train_dataset=train_dataset, eval_dataset=None, data_collator=data_collator)
def get_model(model_args, training_args, bnb_model_from_pretrained_args):
assert training_args.attn_implementation
if training_args.attn_implementation == "sdpa" and torch.__version__ < "2.1.2":
raise ValueError("The 'sdpa' attention implementation requires torch version 2.1.2 or higher.")
customized_kwargs = dict()
customized_kwargs.update(bnb_model_from_pretrained_args)
cfg_pretrained = None
if ',' in model_args.llm_compress_layer_list:
llm_compress_layer_list = [int(i) for i in model_args.llm_compress_layer_list.split(',')]
else:
llm_compress_layer_list = [int(model_args.llm_compress_layer_list)]
llm_image_token_ratio_list = [float(i) for i in model_args.llm_image_token_ratio_list.split(',')]
overwrite_config = {"vision_encode_type": model_args.vision_encode_type,
"mm_num_compress_latents": model_args.mm_num_compress_latents,
"mm_num_compress_query_type": model_args.mm_num_compress_query_type,
"mm_pos_num_frames": model_args.mm_pos_num_frames,
"mm_local_num_frames": model_args.mm_local_num_frames,
"mm_close_init": model_args.mm_close_init,
"min_slow_num_frames": model_args.min_slow_num_frames,
"mm_llm_compress": model_args.mm_llm_compress,
"llm_compress_layer_list": llm_compress_layer_list,
"llm_image_token_ratio_list": llm_image_token_ratio_list,
"llm_compress_type": model_args.llm_compress_type,
"mm_projector_type": model_args.mm_projector_type,
"mm_patch_merge_type": model_args.mm_patch_merge_type,
"mm_newline_position": model_args.mm_newline_position
}
if any(
[
model_args.rope_scaling_factor is not None,
model_args.rope_scaling_type is not None,
model_args.mm_spatial_pool_stride is not None,
model_args.mm_spatial_pool_out_channels is not None,
model_args.mm_spatial_pool_mode is not None,
model_args.mm_resampler_type is not None,
]
):
if "internlm2" in model_args.model_name_or_path.lower():
cfg_pretrained = AutoConfig.from_pretrained(model_args.model_name_or_path, trust_remote_code=True)
else:
cfg_pretrained = AutoConfig.from_pretrained(model_args.model_name_or_path)
else:
raise NotImplementedError(model_args)
if model_args.use_pos_skipping is not None and model_args.pos_skipping_range is not None:
overwrite_config["use_pos_skipping"] = model_args.use_pos_skipping
overwrite_config["pos_skipping_range"] = model_args.pos_skipping_range
if model_args.rope_scaling_factor is not None and model_args.rope_scaling_type is not None:
overwrite_config["rope_scaling"] = {
"factor": model_args.rope_scaling_factor,
"type": model_args.rope_scaling_type,
}
if training_args.model_max_length is None:
training_args.model_max_length = cfg_pretrained.max_position_embeddings * model_args.rope_scaling_factor
overwrite_config["max_sequence_length"] = training_args.model_max_length
assert training_args.model_max_length == int(cfg_pretrained.max_position_embeddings * model_args.rope_scaling_factor), print(
f"model_max_length: {training_args.model_max_length}, max_position_embeddings: {cfg_pretrained.max_position_embeddings}, rope_scaling_factor: {model_args.rope_scaling_factor}"
)
# overwrite_config["max_sequence_length"] = model_args.max_sequence_length
# overwrite_config["tokenizer_model_max_length"] = model_args.tokenizer_model_max_length
if model_args.mm_spatial_pool_stride is not None and model_args.mm_spatial_pool_out_channels is not None and model_args.mm_spatial_pool_mode is not None and model_args.mm_resampler_type is not None:
overwrite_config["mm_resampler_type"] = model_args.mm_resampler_type
overwrite_config["mm_spatial_pool_stride"] = model_args.mm_spatial_pool_stride
overwrite_config["mm_spatial_pool_out_channels"] = model_args.mm_spatial_pool_out_channels
overwrite_config["mm_spatial_pool_mode"] = model_args.mm_spatial_pool_mode
if model_args.mm_spatial_pool_mode is not None:
overwrite_config["mm_spatial_pool_mode"] = model_args.mm_spatial_pool_mode
if overwrite_config:
assert cfg_pretrained is not None, "cfg_pretrained is None"
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(cfg_pretrained, k, v)
customized_kwargs["config"] = cfg_pretrained
if model_args.model_class_name is not None:
raise NotImplementedError(model_args)
actual_model_class_name = f"{model_args.model_class_name}ForCausalLM"
model_class = getattr(transformers, actual_model_class_name)
rank0_print(f"Using model class {model_class} from {model_args.model_class_name}")
model = model_class.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
elif model_args.vision_tower is not None:
if "mixtral" in model_args.model_name_or_path.lower():
raise ValueError(f"I don't want model class {model_args}")
model = LlavaMixtralForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock
deepspeed.utils.set_z3_leaf_modules(model, [MixtralSparseMoeBlock])
elif "mistral" in model_args.model_name_or_path.lower() or "zephyr" in model_args.model_name_or_path.lower():
model = LlavaMistralForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
elif (
"wizardlm-2" in model_args.model_name_or_path.lower()
or "vicuna" in model_args.model_name_or_path.lower()
or "llama" in model_args.model_name_or_path.lower()
# or "yi" in model_args.model_name_or_path.lower()
# or "nous-hermes" in model_args.model_name_or_path.lower()
# and "wizard-2" in model_args.model_name_or_path.lower()
):
raise ValueError(f"I don't want model class {model_args}")
model = LlavaLlamaForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
elif "qwen" in model_args.model_name_or_path.lower():
if "moe" in model_args.model_name_or_path.lower() or "A14B" in model_args.model_name_or_path:
model = LlavaQwenMoeForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
from transformers.models.qwen2_moe.modeling_qwen2_moe import Qwen2MoeSparseMoeBlock
deepspeed.utils.set_z3_leaf_modules(model, [Qwen2MoeSparseMoeBlock])
elif overwrite_config['mm_llm_compress']:
model = LlavaQwenForCausalLM_Pdrop.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
else:
model = LlavaQwenForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
elif "internlm2" in model_args.model_name_or_path.lower():
if overwrite_config['mm_llm_compress']:
raise NotImplementedError
# model = LlavaInternLM2ForCausalLM_Pdrop.from_pretrained(
# model_args.model_name_or_path,
# cache_dir=training_args.cache_dir,
# attn_implementation=training_args.attn_implementation,
# torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
# low_cpu_mem_usage=False,
# **customized_kwargs,
# )
else:
model = LlavaInternLM2ForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
elif "gemma" in model_args.model_name_or_path.lower():
raise ValueError(f"I don't want model class {model_args}")
model = LlavaGemmaForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
else:
raise ValueError(f"Unknown model class {model_args}")
else:
raise NotImplementedError
model = transformers.LlamaForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=False,
**customized_kwargs,
)
rank0_print(f"Model config: {model.config}.")
return model
def train(attn_implementation=None):
# global local_rank
parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
# wandb.init(project="mllm", entity="likunchang", name=os.path.basename(training_args.output_dir), reinit=True)
# local_broadcast_process_authkey() # NOTE
if training_args.verbose_logging:
rank0_print(f"Inspecting experiment hyperparameters:\n")
rank0_print(f"model_args = {vars(model_args)}\n\n")
rank0_print(f"data_args = {vars(data_args)}\n\n")
rank0_print(f"training_args = {vars(training_args)}\n\n")
# rank0_print(f"evaluation_args = {vars(evaluation_args)}\n\n")
# local_rank = training_args.local_rank
compute_dtype = torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32)
bnb_model_from_pretrained_args = {}
if training_args.bits in [4, 8]:
from transformers import BitsAndBytesConfig
bnb_model_from_pretrained_args.update(
dict(
device_map={"": training_args.device},
load_in_4bit=training_args.bits == 4,
load_in_8bit=training_args.bits == 8,
quantization_config=BitsAndBytesConfig(
load_in_4bit=training_args.bits == 4,
load_in_8bit=training_args.bits == 8,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=training_args.double_quant,
bnb_4bit_quant_type=training_args.quant_type, # {'fp4', 'nf4'}
),
)
)
model = get_model(model_args, training_args, bnb_model_from_pretrained_args)
model.config.use_cache = False
if model_args.rope_scaling_factor is not None and model_args.rope_scaling_type is not None:
model.config.rope_scaling = {
"factor": model_args.rope_scaling_factor,
"type": model_args.rope_scaling_type,
}
if model_args.freeze_backbone:
model.model.requires_grad_(False)
if training_args.bits in [4, 8]:
from peft import prepare_model_for_kbit_training
model.config.torch_dtype = torch.float32 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=training_args.gradient_checkpointing)
if training_args.gradient_checkpointing:
if hasattr(model, "enable_input_require_grads"):
model.enable_input_require_grads()
else:
def make_inputs_require_grad(module, input, output):
output.requires_grad_(True)
model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
if training_args.lora_enable:
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=training_args.lora_r,
lora_alpha=training_args.lora_alpha,
target_modules=find_all_linear_names(model),
lora_dropout=training_args.lora_dropout,
bias=training_args.lora_bias,
task_type="CAUSAL_LM",
)
if training_args.bits == 16:
if training_args.bf16:
model.to(torch.bfloat16)
if training_args.fp16:
model.to(torch.float16)
rank0_print("Adding LoRA adapters...")
model = get_peft_model(model, lora_config)
if "mistral" in model_args.model_name_or_path.lower() or "mixtral" in model_args.model_name_or_path.lower() or "zephyr" in model_args.model_name_or_path.lower():
tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="left")
elif "qwen" in model_args.model_name_or_path.lower():
tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="right")
elif "internlm2" in model_args.model_name_or_path.lower():
tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="right", trust_remote_code=True)
elif (
"wizardlm-2" in model_args.model_name_or_path.lower()
or "vicuna" in model_args.model_name_or_path.lower()
or "llama" in model_args.model_name_or_path.lower()
or "yi" in model_args.model_name_or_path.lower()
or "nous-hermes" in model_args.model_name_or_path.lower()
and "wizard-2" in model_args.model_name_or_path.lower()
):
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
model_max_length=training_args.model_max_length,
padding_side="right",
use_fast=False,
)
rank0_print(f"Prompt version: {model_args.version}")
if model_args.version == "v0":
if tokenizer.pad_token is None:
smart_tokenizer_and_embedding_resize(
special_tokens_dict=dict(pad_token="[PAD]"),
tokenizer=tokenizer,
model=model,
)
elif model_args.version == "v0.5":
tokenizer.pad_token = tokenizer.unk_token
else:
if tokenizer.unk_token is not None:
tokenizer.pad_token = tokenizer.unk_token
if model_args.version in conversation_lib.conv_templates:
conversation_lib.default_conversation = conversation_lib.conv_templates[model_args.version]
else:
raise NotImplementedError(f"Can't find your conv_templates: {model_args.version}")
conversation_lib.default_conversation = conversation_lib.conv_templates["vicuna_v1"]
if model_args.vision_tower is not None:
model.get_model().initialize_vision_modules(model_args=model_args, fsdp=training_args.fsdp)
vision_tower = model.get_vision_tower()
vision_tower.to(dtype=torch.bfloat16 if training_args.bf16 else torch.float16, device=training_args.device)
# NOTE hard code
data_args.image_processor = vision_tower.image_processor
data_args.is_multimodal = True
model.config.image_aspect_ratio = data_args.image_aspect_ratio
model.config.frame_aspect_ratio = data_args.frame_aspect_ratio
if data_args.image_grid_pinpoints is not None:
if isinstance(data_args.image_grid_pinpoints, str) and "x" in data_args.image_grid_pinpoints:
try:
patch_size = data_args.image_processor.size[0]
except Exception as e:
patch_size = data_args.image_processor.size["shortest_edge"]
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", data_args.image_grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
data_args.image_grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
elif isinstance(data_args.image_grid_pinpoints, str):
data_args.image_grid_pinpoints = ast.literal_eval(data_args.image_grid_pinpoints)
if data_args.frame_grid_pinpoints is not None:
if isinstance(data_args.frame_grid_pinpoints, str) and "x" in data_args.frame_grid_pinpoints:
try:
patch_size = data_args.image_processor.size[0]
except Exception as e:
patch_size = data_args.image_processor.size["shortest_edge"]
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", data_args.frame_grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
data_args.frame_grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
elif isinstance(data_args.frame_grid_pinpoints, str):
data_args.frame_grid_pinpoints = ast.literal_eval(data_args.frame_grid_pinpoints)
model.config.max_num_pixels = data_args.max_num_pixels
model.config.frame_grid_pinpoints = data_args.frame_grid_pinpoints
model.config.image_grid_pinpoints = data_args.image_grid_pinpoints
model.config.image_crop_resolution = data_args.image_crop_resolution
model.config.image_split_resolution = data_args.image_split_resolution
model.config.tokenizer_padding_side = tokenizer.padding_side
model.config.tokenizer_model_max_length = tokenizer.model_max_length
model.config.mm_newline_position = model_args.mm_newline_position
### Deciding train which part of the model
if model_args.mm_tunable_parts is None: # traditional way of deciding which part to train
model.config.tune_mm_mlp_adapter = training_args.tune_mm_mlp_adapter = model_args.tune_mm_mlp_adapter
model.config.tune_mm_vision_resampler = training_args.tune_mm_vision_resampler = model_args.tune_mm_vision_resampler
if model_args.tune_mm_mlp_adapter or model_args.tune_mm_vision_resampler:
model.requires_grad_(False)
if model_args.tune_mm_mlp_adapter:
for p in model.get_model().mm_projector.parameters():
p.requires_grad = True
model.config.freeze_mm_mlp_adapter = training_args.freeze_mm_mlp_adapter
if training_args.freeze_mm_mlp_adapter:
for p in model.get_model().mm_projector.parameters():
p.requires_grad = False
model.config.freeze_mm_vision_resampler = training_args.freeze_mm_vision_resampler
model.config.unfreeze_mm_vision_tower = model_args.unfreeze_mm_vision_tower
if model_args.unfreeze_mm_vision_tower:
vision_tower.requires_grad_(True)
else:
vision_tower.requires_grad_(False)
else:
rank0_print(f"Using mm_tunable_parts: {model_args.mm_tunable_parts}")
model.config.mm_tunable_parts = training_args.mm_tunable_parts = model_args.mm_tunable_parts
# Set the entire model to not require gradients by default
model.requires_grad_(False)
vision_tower.requires_grad_(False)
model.get_model().mm_projector.requires_grad_(False)
# Parse the mm_tunable_parts to decide which parts to unfreeze
tunable_parts = model_args.mm_tunable_parts.split(",")
if "mm_mlp_adapter" in tunable_parts:
for p in model.get_model().mm_projector.parameters():
p.requires_grad = True
if "mm_vision_tower" in tunable_parts:
for name, param in model.named_parameters():
if "vision_tower" in name:
param.requires_grad_(True)
if "mm_language_model" in tunable_parts:
for name, param in model.named_parameters():
if "vision_tower" not in name and "mm_projector" not in name and "vision_resampler" not in name:
param.requires_grad_(True)
total_params = sum(p.ds_numel if hasattr(p, "ds_numel") else p.numel() for p in model.parameters())
trainable_params = sum(p.ds_numel if hasattr(p, "ds_numel") else p.numel() for p in model.parameters() if p.requires_grad)
rank0_print(f"Total parameters: ~{total_params/1e6:.2f} MB)")
rank0_print(f"Trainable parameters: ~{trainable_params/1e6:.2f} MB)")
if training_args.bits in [4, 8]:
model.get_model().mm_projector.to(dtype=compute_dtype, device=training_args.device)
model.config.mm_use_im_start_end = data_args.mm_use_im_start_end = model_args.mm_use_im_start_end
model.config.mm_projector_lr = training_args.mm_projector_lr
model.config.mm_vision_tower_lr = training_args.mm_vision_tower_lr
training_args.use_im_start_end = model_args.mm_use_im_start_end
model.config.mm_use_im_patch_token = model_args.mm_use_im_patch_token
model.initialize_vision_tokenizer(model_args, tokenizer=tokenizer)
if training_args.bits in [4, 8]:
from peft.tuners.lora import LoraLayer
for name, module in model.named_modules():
if isinstance(module, LoraLayer):
if training_args.bf16:
module = module.to(torch.bfloat16)
if "norm" in name:
module = module.to(torch.float32)
if "lm_head" in name or "embed_tokens" in name:
if hasattr(module, "weight"):
if training_args.bf16 and module.weight.dtype == torch.float32:
module = module.to(torch.bfloat16)
data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
trainer = LLaVATrainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
rank0_print(f"model_config after before train: {model.config}")
if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
trainer.train(resume_from_checkpoint=True)
else:
trainer.train()
trainer.save_state()
model.config.use_cache = True
if training_args.lora_enable:
state_dict = get_peft_state_maybe_zero_3(model.named_parameters(), training_args.lora_bias)
non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(model.named_parameters())
if training_args.local_rank == 0 or training_args.local_rank == -1:
if hasattr(model, "config"):
model.config.save_pretrained(training_args.output_dir)
if hasattr(model, "generation_config"):
model.generation_config.save_pretrained(training_args.output_dir)
model.save_pretrained(training_args.output_dir, state_dict=state_dict)
torch.save(non_lora_state_dict, os.path.join(training_args.output_dir, "non_lora_trainables.bin"))
else:
safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
rank0_print(f"Model saved to {training_args.output_dir}")
if __name__ == "__main__":
train()
================================================
FILE: xtuner-eval_niah/llava/train/train_mem.py
================================================
from llava.train.train import train
from llava.dist_utils import init_distributed_mode
if __name__ == "__main__":
init_distributed_mode()
train()
================================================
FILE: xtuner-eval_niah/llava/utils.py
================================================
import datetime
import logging
import logging.handlers
import os
import sys
import numpy as np
import requests
from llava.constants import LOGDIR
server_error_msg = "**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**"
moderation_msg = "I am sorry. Your input may violate our content moderation guidelines. Please avoid using harmful or offensive content."
handler = None
import torch.distributed as dist
try:
import av
from decord import VideoReader, cpu
except ImportError:
print("Please install pyav to use video processing functions.")
def process_video_with_decord(video_file, data_args):
vr = VideoReader(video_file, ctx=cpu(0), num_threads=1)
total_frame_num = len(vr)
avg_fps = round(vr.get_avg_fps() / data_args.video_fps)
frame_idx = [i for i in range(0, total_frame_num, avg_fps)]
if data_args.frames_upbound > 0:
if len(frame_idx) > data_args.frames_upbound:
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, data_args.frames_upbound, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
video = vr.get_batch(frame_idx).asnumpy()
# https://github.com/dmlc/decord/issues/208
vr.seek(0)
return video
def process_video_with_pyav(video_file, data_args):
container = av.open(video_file)
# !!! This is the only difference. Using auto threading
container.streams.video[0].thread_type = "AUTO"
video_frames = []
for packet in container.demux():
if packet.stream.type == 'video':
for frame in packet.decode():
video_frames.append(frame)
total_frame_num = len(video_frames)
video_time = video_frames[-1].time
avg_fps = round(total_frame_num / video_time / data_args.video_fps)
frame_idx = [i for i in range(0, total_frame_num, avg_fps)]
if data_args.frames_upbound > 0:
if len(frame_idx) > data_args.frames_upbound:
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, data_args.frames_upbound, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
frames = [video_frames[i] for i in frame_idx]
return np.stack([x.to_ndarray(format="rgb24") for x in frames])
def rank0_print(*args):
if dist.is_initialized():
if dist.get_rank() == 0:
print(f"Rank {dist.get_rank()}: ", *args)
else:
print(*args)
def rank_print(*args):
if dist.is_initialized():
print(f"Rank {dist.get_rank()}: ", *args)
else:
print(*args)
def build_logger(logger_name, logger_filename):
global handler
formatter = logging.Formatter(
fmt="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
# Set the format of root handlers
if not logging.getLogger().handlers:
logging.basicConfig(level=logging.INFO)
logging.getLogger().handlers[0].setFormatter(formatter)
# Redirect stdout and stderr to loggers
stdout_logger = logging.getLogger("stdout")
stdout_logger.setLevel(logging.INFO)
sl = StreamToLogger(stdout_logger, logging.INFO)
sys.stdout = sl
stderr_logger = logging.getLogger("stderr")
stderr_logger.setLevel(logging.ERROR)
sl = StreamToLogger(stderr_logger, logging.ERROR)
sys.stderr = sl
# Get logger
logger = logging.getLogger(logger_name)
logger.setLevel(logging.INFO)
# Add a file handler for all loggers
if handler is None:
os.makedirs(LOGDIR, exist_ok=True)
filename = os.path.join(LOGDIR, logger_filename)
handler = logging.handlers.TimedRotatingFileHandler(filename, when="D", utc=True)
handler.setFormatter(formatter)
for name, item in logging.root.manager.loggerDict.items():
if isinstance(item, logging.Logger):
item.addHandler(handler)
return logger
class StreamToLogger(object):
"""
Fake file-like stream object that redirects writes to a logger instance.
"""
def __init__(self, logger, log_level=logging.INFO):
self.terminal = sys.stdout
self.logger = logger
self.log_level = log_level
self.linebuf = ""
def __getattr__(self, attr):
return getattr(self.terminal, attr)
def write(self, buf):
temp_linebuf = self.linebuf + buf
self.linebuf = ""
for line in temp_linebuf.splitlines(True):
# From the io.TextIOWrapper docs:
# On output, if newline is None, any '\n' characters written
# are translated to the system default line separator.
# By default sys.stdout.write() expects '\n' newlines and then
# translates them so this is still cross platform.
if line[-1] == "\n":
self.logger.log(self.log_level, line.rstrip())
else:
self.linebuf += line
def flush(self):
if self.linebuf != "":
self.logger.log(self.log_level, self.linebuf.rstrip())
self.linebuf = ""
def disable_torch_init():
"""
Disable the redundant torch default initialization to accelerate model creation.
"""
import torch
setattr(torch.nn.Linear, "reset_parameters", lambda self: None)
setattr(torch.nn.LayerNorm, "reset_parameters", lambda self: None)
def violates_moderation(text):
"""
Check whether the text violates OpenAI moderation API.
"""
url = "https://api.openai.com/v1/moderations"
headers = {"Content-Type": "application/json", "Authorization": "Bearer " + os.environ["OPENAI_API_KEY"]}
text = text.replace("\n", "")
data = "{" + '"input": ' + f'"{text}"' + "}"
data = data.encode("utf-8")
try:
ret = requests.post(url, headers=headers, data=data, timeout=5)
flagged = ret.json()["results"][0]["flagged"]
except requests.exceptions.RequestException as e:
print(f"######################### Moderation Error: {e} #########################")
flagged = False
except KeyError as e:
print(f"######################### Moderation Error: {e} #########################")
flagged = False
return flagged
def pretty_print_semaphore(semaphore):
if semaphore is None:
return "None"
return f"Semaphore(value={semaphore._value}, locked={semaphore.locked()})"
================================================
FILE: xtuner-eval_niah/llava/video_utils.py
================================================
import random
import os
import io
import av
import cv2
import decord
import imageio
from decord import VideoReader
import torch
import numpy as np
import math
import gc
from torchvision.transforms.functional import pil_to_tensor
import re
def get_index(num_frames, num_segments):
seg_size = float(num_frames - 1) / num_segments
start = int(seg_size / 2)
offsets = np.array([
start + int(np.round(seg_size * idx)) for idx in range(num_segments)
])
return offsets
def lazy_load_s3video(s3path_video, num_frames, video_start, video_end, client):
# load video from ceph
video_bytes_stream = client.get(s3path_video, enable_stream_lazyloding=True)
container = av.open(video_bytes_stream)
stream = container.streams.video[0]
# duration = stream.duration
real_fps = container.streams.video[0].average_rate
time_base = container.streams.video[0].time_base
start, end = video_start, video_end
# Convert time to pts
duration_frams = int(end - start) * real_fps
frames_index = get_index(duration_frams, num_frames)
pts_list = []
start_pts = int((start) / time_base)
end_pts = int((end) / time_base)
for frame_index in frames_index:
pts_list.append(int((frame_index / real_fps)) / time_base)
# Seek to nearest key frame from the start
container.seek(max(start_pts, 0), stream=stream)
frames = []
for frame in container.decode(**{"video":0}):
if frame.pts < start_pts:
continue
# if frame.pts <= end_pts:
if len(pts_list) >0:
if frame.pts >= pts_list[0]:
frames.append(frame)
pts_list.pop(0)
else:
break
container.close()
frames = [np.array(frames[idx].to_rgb().to_image()) for idx in range(len(frames))]
final_frames = np.stack(frames)
del frames
del video_bytes_stream # T C H W
gc.collect()
return final_frames, frames_index, float(real_fps)
def pts_to_secs(pts: int, time_base: float, start_pts: int) -> float:
"""
Converts a present time with the given time base and start_pts offset to seconds.
Returns:
time_in_seconds (float): The corresponding time in seconds.
https://github.com/facebookresearch/pytorchvideo/blob/main/pytorchvideo/data/utils.py#L54-L64
"""
if pts == math.inf:
return math.inf
return int(pts - start_pts) * time_base
def get_pyav_video_duration(video_reader):
video_stream = video_reader.streams.video[0]
video_duration = pts_to_secs(
video_stream.duration,
video_stream.time_base,
video_stream.start_time
)
return float(video_duration)
def get_frame_indices(num_frames, vlen, sample='middle', fix_start=None, input_fps=1, min_num_frames=1, max_num_frames=-1, local_num_frames=8):
if min_num_frames > vlen:
if sample == 'dynamic_fps1':
min_num_frames = (vlen // local_num_frames) * local_num_frames
else:
min_num_frames = vlen
if sample == 'dynamic_fps1':
duration = float(vlen) / input_fps
num_segments = int(duration // local_num_frames)
if num_segments == 0:
num_frames = local_num_frames
else:
num_frames = local_num_frames * num_segments
if max_num_frames > 0:
num_frames = min(num_frames, max_num_frames)
sample = "middle" # NOTE
# logger.info(f"? is OK (img), duation={duration} frames={num_frames}!!!!")
num_frames = max(min_num_frames, num_frames)
# print(f"\033[0;31m vlen={vlen}, input_fps={input_fps} num_frames={num_frames} \033[0m")
if sample in ["rand", "middle"]: # uniform sampling
acc_samples = min(num_frames, vlen)
# split the video into `acc_samples` intervals, and sample from each interval.
intervals = np.linspace(start=0, stop=vlen, num=acc_samples + 1).astype(int)
ranges = []
for idx, interv in enumerate(intervals[:-1]):
ranges.append((interv, intervals[idx + 1] - 1))
if sample == 'rand':
try:
frame_indices = [random.choice(range(x[0], x[1])) for x in ranges]
except:
frame_indices = np.random.permutation(vlen)[:acc_samples]
frame_indices.sort()
frame_indices = list(frame_indices)
elif fix_start is not None:
frame_indices = [x[0] + fix_start for x in ranges]
elif sample == 'middle':
frame_indices = [(x[0] + x[1]) // 2 for x in ranges]
else:
raise NotImplementedError
if len(frame_indices) < num_frames: # padded with last frame
padded_frame_indices = [frame_indices[-1]] * num_frames
padded_frame_indices[:len(frame_indices)] = frame_indices
frame_indices = padded_frame_indices
elif "fps" in sample: # fps0.5, sequentially sample frames at 0.5 fps
output_fps = float(sample[3:])
duration = float(vlen) / input_fps
delta = 1 / output_fps # gap between frames, this is also the clip length each frame represents
frame_seconds = np.arange(0 + delta / 2, duration + delta / 2, delta)
frame_indices = np.around(frame_seconds * input_fps).astype(int)
frame_indices = [e for e in frame_indices if e < vlen]
if max_num_frames > 0 and len(frame_indices) > max_num_frames:
frame_indices = frame_indices[:max_num_frames]
# frame_indices = np.linspace(0 + delta / 2, duration + delta / 2, endpoint=False, num=max_num_frames)
else:
raise ValueError(f"Not support sample type: {sample}")
return frame_indices
def read_frames_av(video_path, num_frames, sample='rand', client=None, fix_start=None, min_num_frames=1, max_num_frames=-1, clip=None, local_num_frames=8):
if clip is not None:
raise NotImplementedError("av don't support clip!!!")
if 's3://' in video_path:
video_bytes = client.get(video_path)
byteio = io.BytesIO(video_bytes)
byteio.seek(0)
reader = av.open(byteio)
else:
byteio = None
reader = av.open(video_path)
frames = [f.to_rgb().to_ndarray() for f in reader.decode(video=0)]
vlen = len(frames)
duration = get_pyav_video_duration(reader)
fps = vlen / float(duration)
frame_indices = get_frame_indices(
num_frames, vlen, sample=sample, fix_start=fix_start,
input_fps=fps, min_num_frames=min_num_frames, max_num_frames=max_num_frames, local_num_frames=local_num_frames
)
frames = np.stack([frames[idx] for idx in frame_indices]) # (T, H, W, C), torch.uint8
# frames = frames.permute(0, 3, 1, 2) # (T, C, H, W), torch.uint8
if byteio != None:
byteio.close()
reader.close()
return frames, frame_indices, float(fps), duration
def read_frames_gif(
video_path, num_frames, sample='rand', fix_start=None,
min_num_frames=1, max_num_frames=-1, client=None, clip=None, local_num_frames=8
):
if clip is not None:
raise NotImplementedError("Gif don't support clip!!!")
if 's3://' in video_path:
video_bytes = client.get(video_path)
byteio = io.BytesIO(video_bytes)
gif = imageio.get_reader(byteio)
else:
byteio = None
gif = imageio.get_reader(video_path)
vlen = len(gif)
fps = 1.
duration = vlen / fps
frame_indices = get_frame_indices(
num_frames, vlen, sample=sample, fix_start=fix_start,
min_num_frames=min_num_frames,
max_num_frames=max_num_frames, local_num_frames=local_num_frames,
input_fps=fps # NOTE 写死先
)
frames = []
min_h = min_w = 100000
hw_set = set()
for index, frame in enumerate(gif):
# for index in frame_idxs:
if index in frame_indices:
frame = cv2.cvtColor(frame, cv2.COLOR_RGBA2RGB)
frame = frame.astype(np.uint8)
# # (H x W x C) to (C x H x W)
# frame = frame.permute(2, 0, 1)
frames.append(frame)
hw_set.add(frame.shape)
if frame.shape[0] < min_h:
min_h = frame.shape[0]
if frame.shape[1] < min_w:
min_w = frame.shape[1]
# print(hw_set, min_h, min_w)
if len(hw_set) > 1:
frames = [i[:min_h, :min_w] for i in frames]
frames = np.stack(frames) # .float() / 255
if byteio != None:
byteio.close()
return frames, frame_indices, float(fps), duration # for tgif
def read_frames_decord(
video_path, num_frames, sample='rand', fix_start=None, min_num_frames=1,
max_num_frames=-1, client=None, clip=None, local_num_frames=8
):
if video_path.endswith('.avi'):
return read_frames_av(video_path=video_path, num_frames=num_frames, sample=sample,
fix_start=fix_start, min_num_frames=min_num_frames, max_num_frames=max_num_frames,
client=client, clip=clip, local_num_frames=local_num_frames)
if 's3://' in video_path:
video_bytes = client.get(video_path)
if video_bytes is None or len(video_bytes) == 0:
raise ValueError(f"Can't read byte from {video_path}!")
byteio = io.BytesIO(video_bytes)
video_reader = VideoReader(byteio, num_threads=1)
else:
byteio = None
video_reader = VideoReader(video_path, num_threads=1)
vlen = len(video_reader)
fps = video_reader.get_avg_fps()
duration = vlen / float(fps)
if clip:
start, end = clip
start = max(0, start)
end = min(duration - 0.1, end) # 防止end超过视频末尾
duration = end - start
vlen = int(duration * fps)
start_index = int(start * fps)
frame_indices = get_frame_indices(
num_frames, vlen, sample=sample, fix_start=fix_start,
input_fps=fps, min_num_frames=min_num_frames, max_num_frames=max_num_frames, local_num_frames=local_num_frames
)
if clip:
frame_indices = [f + start_index for f in frame_indices]
# print(fps, frame_indices)
frames = video_reader.get_batch(frame_indices).asnumpy() # (T, H, W, C), torch.uint8
# https://github.com/dmlc/decord/issues/208
video_reader.seek(0)
if byteio != None:
byteio.close()
# frames = frames.permute(0, 3, 1, 2) # (T, C, H, W), torch.uint8
return frames, frame_indices, float(fps), duration
def read_frames_img(
video_path, num_frames, sample='rand', fix_start=None, min_num_frames=1,
max_num_frames=-1, client=None, clip=None, local_num_frames=8
):
def extract_frame_number(filename):
# Extract the numeric part from the filename using regular expressions
if filename.endswith('.jpg'):
match = re.search(r'_(\d+).jpg$', filename)
elif filename.endswith('.jpeg'):
match = re.search(r'_(\d+).jpeg$', filename)
elif filename.endswith('.png'):
match = re.search(r'_(\d+).png$', filename)
else:
raise NotImplementedError(f"Wrong filename: {filename}")
return int(match.group(1)) if match else -1
def sort_frames(frame_paths):
# Extract filenames from each path and sort by their numeric part
return sorted(frame_paths, key=lambda x: extract_frame_number(os.path.basename(x)))
# img_list=[]
if "s3://" in video_path:
img_list = sort_frames(client.list(video_path))
else:
img_list = sort_frames(list(os.listdir(video_path)))
if 'tvqa' in video_path.lower():
fps = 3.0 # tvqa是3fps的
else:
fps = 1.0 # NOTE 未知数据直接当1fps处理
if clip is not None:
start = float(clip[0])
end = float(clip[1])
start = max(0, start)
end = min(len(img_list) / fps, end) # 防止end超过视频末尾
vlen = (end - start) * fps
else:
vlen = len(img_list)
duration = vlen / fps
if min_num_frames > vlen:
if sample == 'dynamic_fps1':
min_num_frames = (vlen // local_num_frames) * local_num_frames
else:
min_num_frames = vlen
if sample == 'dynamic_fps1':
num_segments = int(duration // local_num_frames)
if num_segments == 0:
num_frames = local_num_frames
else:
num_frames = local_num_frames * num_segments
num_frames = min(num_frames, max_num_frames)
num_frames = max(min_num_frames, num_frames)
num_frames = int(num_frames)
if clip is not None:
def _get_index_by_time(start_sec, end_sec, num_segments=8, fps=1., max_frame=9999):
start_idx = max(1, round(start_sec * fps))
end_idx = min(round(end_sec * fps), max_frame)
seg_size = float(end_idx - start_idx) / (num_segments - 1)
offsets = np.array([start_idx + int(np.round(seg_size * idx)) for idx in range(num_segments)])
return offsets
frame_indices = _get_index_by_time(float(clip[0]), float(clip[1]), num_segments=num_frames, fps=fps, max_frame=len(img_list)-1)
else:
frame_indices = get_frame_indices(
num_frames, vlen, sample=sample, fix_start=fix_start,
min_num_frames=min_num_frames,
max_num_frames=max_num_frames, local_num_frames=local_num_frames
)
imgs = []
for idx in frame_indices:
frame_fname = os.path.join(video_path, img_list[idx])
if "s3://" in video_path:
img_bytes = client.get(frame_fname)
else:
with open(frame_fname, 'rb') as f:
img_bytes = f.read()
img_np = np.frombuffer(img_bytes, np.uint8)
img = cv2.imdecode(img_np, cv2.IMREAD_COLOR)
cv2.cvtColor(img, cv2.COLOR_BGR2RGB, img)
imgs.append(img)
# print(f"\033[0;31m img_list={len(img_list)} video_path={video_path}, len(imgs)={len(imgs)}, frame_indices={frame_indices} num_frames={num_frames} \033[0m")
frames = np.array(imgs, dtype=np.uint8)
# frames = torch.tensor(np.array(imgs), dtype=torch.uint8).permute(0, 3, 1, 2) # (T, C, H, W), torch.uint8
# logger.info(f"{video_path} is OK (img), duation={vlen}!!!!")
return frames, frame_indices, fps, duration # NOTE img直接当1fps处理
def read_frames_fake(
video_path, num_frames, sample='rand', fix_start=None,
max_num_frames=-1, client=None, clip=None, local_num_frames=8
):
print("I am fake!!!!!!")
frame_indices = get_frame_indices(
num_frames, 100, sample=sample, fix_start=fix_start,
input_fps=1, max_num_frames=max_num_frames, local_num_frames=local_num_frames
)
frames = np.random.randint(0, 255, size=(len(frame_indices), 224, 224, 3)) # (T, H, W, C), torch.uint8
return frames, frame_indices, 1.0, 100
VIDEO_READER_FUNCS = {
'av': read_frames_av,
'decord': read_frames_decord,
'gif': read_frames_gif,
'img': read_frames_img,
'frame': read_frames_img,
'lazy': lazy_load_s3video,
'fake': read_frames_fake
}
================================================
FILE: xtuner-eval_niah/longva/__init__.py
================================================
from .model import LlavaLlamaForCausalLM
from .constants import *
from .conversation import *
from .mm_utils import *
from .utils import *
from .model import *
================================================
FILE: xtuner-eval_niah/longva/constants.py
================================================
CONTROLLER_HEART_BEAT_EXPIRATION = 30
WORKER_HEART_BEAT_INTERVAL = 15
LOGDIR = "."
# Model Constants
IGNORE_INDEX = -100
IMAGE_TOKEN_INDEX = -200
DEFAULT_IMAGE_TOKEN = ""
DEFAULT_IMAGE_PATCH_TOKEN = ""
DEFAULT_IM_START_TOKEN = ""
DEFAULT_IM_END_TOKEN = ""
================================================
FILE: xtuner-eval_niah/longva/conversation.py
================================================
import dataclasses
from enum import auto, Enum
from typing import List, Any, Dict, Union, Tuple
import re
import base64
from io import BytesIO
from PIL import Image
from transformers import AutoTokenizer
class SeparatorStyle(Enum):
"""Different separator style."""
SINGLE = auto()
TWO = auto()
MPT = auto()
PLAIN = auto()
CHATML = auto()
LLAMA_2 = auto()
LLAMA_3 = auto()
QWEN = auto()
GEMMA = auto()
@dataclasses.dataclass
class Conversation:
"""A class that keeps all conversation history."""
system: str
roles: List[str]
messages: List[List[str]]
offset: int
sep_style: SeparatorStyle = SeparatorStyle.SINGLE
sep: str = "###"
sep2: str = None
version: str = "Unknown"
tokenizer_id: str = ""
tokenizer: Any = None
# Stop criteria (the default one is EOS token)
stop_str: Union[str, List[str]] = None
# Stops generation if meeting any token in this list
stop_token_ids: List[int] = None
skip_next: bool = False
def get_prompt(self):
messages = self.messages
if len(messages) > 0 and type(messages[0][1]) is tuple:
messages = self.messages.copy()
init_role, init_msg = messages[0].copy()
init_msg = init_msg[0]
if "mmtag" in self.version:
init_msg = init_msg.replace("", "").strip()
messages[0] = (init_role, init_msg)
messages.insert(0, (self.roles[0], ""))
messages.insert(1, (self.roles[1], "Received."))
elif not init_msg.startswith(""):
init_msg = init_msg.replace("", "").strip()
messages[0] = (init_role, "\n" + init_msg)
else:
messages[0] = (init_role, init_msg)
if self.sep_style == SeparatorStyle.SINGLE:
ret = self.system + self.sep
for role, message in messages:
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + ": " + message + self.sep
else:
ret += role + ":"
elif self.sep_style == SeparatorStyle.TWO:
seps = [self.sep, self.sep2]
ret = self.system + seps[0]
for i, (role, message) in enumerate(messages):
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + ": " + message + seps[i % 2]
else:
ret += role + ":"
elif self.sep_style == SeparatorStyle.CHATML:
ret = "" if self.system == "" else self.system + self.sep + "\n"
for role, message in messages:
if message:
if type(message) is tuple:
message, images = message
message = "" * len(images) + message
ret += role + "\n" + message + self.sep + "\n"
else:
ret += role + "\n"
return ret
elif self.sep_style == SeparatorStyle.LLAMA_3:
chat_template_messages = [{"role": "system", "content": self.system}]
for role, message in messages:
if message:
if type(message) is tuple:
message, images = message
message = "" * len(images) + message
chat_template_messages.append({"role": role, "content": message})
# print(chat_template_messages)
return self.tokenizer.apply_chat_template(chat_template_messages, tokenize=False, add_generation_prompt=True)
# ret = "" if self.system == "" else self.system + self.sep + "\n"
# for role, message in messages:
# if message:
# if type(message) is tuple:
# message, images = message
# message = "" * len(images) + message
# ret += role + "\n" + message + self.sep + "\n"
# else:
# ret += role + "\n"
# return ret
elif self.sep_style == SeparatorStyle.MPT:
ret = self.system + self.sep
for role, message in messages:
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + message + self.sep
else:
ret += role
elif self.sep_style == SeparatorStyle.GEMMA:
ret = ""
for i, (role, message) in enumerate(messages):
assert role == self.roles[i % 2], "Conversation should alternate user/assistant/user/assistant/..."
if message:
if type(message) is tuple:
message, _, _ = message
ret += role + message + self.sep
else:
ret += role
elif self.sep_style == SeparatorStyle.LLAMA_2:
wrap_sys = lambda msg: f"<>\n{msg}\n<>\n\n" if len(msg) > 0 else msg
wrap_inst = lambda msg: f"[INST] {msg} [/INST]"
ret = ""
for i, (role, message) in enumerate(messages):
if i == 0:
assert message, "first message should not be none"
assert role == self.roles[0], "first message should come from user"
if message:
if type(message) is tuple:
message, _, _ = message
if i == 0:
message = wrap_sys(self.system) + message
if i % 2 == 0:
message = wrap_inst(message)
ret += self.sep + message
else:
ret += " " + message + " " + self.sep2
else:
ret += ""
ret = ret.lstrip(self.sep)
elif self.sep_style == SeparatorStyle.PLAIN:
seps = [self.sep, self.sep2]
ret = self.system
for i, (role, message) in enumerate(messages):
if message:
if type(message) is tuple:
message, _, _ = message
ret += message + seps[i % 2]
else:
ret += ""
else:
raise ValueError(f"Invalid style: {self.sep_style}")
return ret
def append_message(self, role, message):
self.messages.append([role, message])
def process_image(self, image, image_process_mode, return_pil=False, image_format="PNG"):
if image_process_mode == "Pad":
def expand2square(pil_img, background_color=(122, 116, 104)):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
image = expand2square(image)
elif image_process_mode in ["Default", "Crop"]:
pass
elif image_process_mode == "Resize":
image = image.resize((336, 336))
else:
raise ValueError(f"Invalid image_process_mode: {image_process_mode}")
if type(image) is not Image.Image:
image = Image.open(image).convert("RGB")
max_hw, min_hw = max(image.size), min(image.size)
aspect_ratio = max_hw / min_hw
max_len, min_len = 672, 448
shortest_edge = int(min(max_len / aspect_ratio, min_len, min_hw))
longest_edge = int(shortest_edge * aspect_ratio)
W, H = image.size
if H > W:
H, W = longest_edge, shortest_edge
else:
H, W = shortest_edge, longest_edge
image = image.resize((W, H))
if return_pil:
return image
else:
buffered = BytesIO()
image.save(buffered, format=image_format)
img_b64_str = base64.b64encode(buffered.getvalue()).decode()
return img_b64_str
def get_images(self, return_pil=False, return_path=False):
images = []
for i, (role, msg) in enumerate(self.messages[self.offset :]):
if i % 2 == 0:
if type(msg) is tuple:
msg, image, image_process_mode = msg
if type(image) != list:
image = [image]
for img in image:
if not return_path:
img = self.process_image(img, image_process_mode, return_pil=return_pil)
else:
images.append(img)
return images
def to_gradio_chatbot(self):
ret = []
for i, (role, msg) in enumerate(self.messages[self.offset :]):
if i % 2 == 0:
if type(msg) is tuple:
msg, image, image_process_mode = msg
if type(image) != list:
image = [image]
if len(image) == 1:
msg = "\n" + msg.replace("", "").strip()
else:
msg = re.sub(r"()\n(?=)", r"\1 ", msg)
for img in image:
img_b64_str = self.process_image(img, "Default", return_pil=False, image_format="JPEG")
img_str = f'
'
msg = msg.replace("", img_str, 1).strip()
if len(msg) > 0:
ret.append([msg, None])
else:
ret.append([msg, None])
else:
ret[-1][-1] = msg
return ret
def copy(self):
return Conversation(system=self.system, roles=self.roles, messages=[[x, y] for x, y in self.messages], offset=self.offset, sep_style=self.sep_style, sep=self.sep, sep2=self.sep2, version=self.version)
def dict(self):
if len(self.get_images()) > 0:
return {
"system": self.system,
"roles": self.roles,
"messages": [[x, y[0] if type(y) is tuple else y] for x, y in self.messages],
"offset": self.offset,
"sep": self.sep,
"sep2": self.sep2,
}
return {
"system": self.system,
"roles": self.roles,
"messages": self.messages,
"offset": self.offset,
"sep": self.sep,
"sep2": self.sep2,
}
conv_vicuna_v0 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("Human", "Assistant"),
messages=[
["Human", "What are the key differences between renewable and non-renewable energy sources?"],
[
"Assistant",
"Renewable energy sources are those that can be replenished naturally in a relatively "
"short amount of time, such as solar, wind, hydro, geothermal, and biomass. "
"Non-renewable energy sources, on the other hand, are finite and will eventually be "
"depleted, such as coal, oil, and natural gas. Here are some key differences between "
"renewable and non-renewable energy sources:\n"
"1. Availability: Renewable energy sources are virtually inexhaustible, while non-renewable "
"energy sources are finite and will eventually run out.\n"
"2. Environmental impact: Renewable energy sources have a much lower environmental impact "
"than non-renewable sources, which can lead to air and water pollution, greenhouse gas emissions, "
"and other negative effects.\n"
"3. Cost: Renewable energy sources can be more expensive to initially set up, but they typically "
"have lower operational costs than non-renewable sources.\n"
"4. Reliability: Renewable energy sources are often more reliable and can be used in more remote "
"locations than non-renewable sources.\n"
"5. Flexibility: Renewable energy sources are often more flexible and can be adapted to different "
"situations and needs, while non-renewable sources are more rigid and inflexible.\n"
"6. Sustainability: Renewable energy sources are more sustainable over the long term, while "
"non-renewable sources are not, and their depletion can lead to economic and social instability.\n",
],
],
offset=2,
sep_style=SeparatorStyle.SINGLE,
sep="###",
)
conv_vicuna_v1 = Conversation(
system="A chat between a curious user and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the user's questions.",
roles=("USER", "ASSISTANT"),
version="v1",
messages=[],
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="",
)
conv_llama_2 = Conversation(
system="""You are a helpful, respectful and honest assistant. Always answer as helpfully as possible, while being safe. Your answers should not include any harmful, unethical, racist, sexist, toxic, dangerous, or illegal content. Please ensure that your responses are socially unbiased and positive in nature.
If a question does not make any sense, or is not factually coherent, explain why instead of answering something not correct. If you don't know the answer to a question, please don't share false information.""",
roles=("USER", "ASSISTANT"),
version="llama_v2",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="",
)
conv_llava_llama_2 = Conversation(
system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",
roles=("USER", "ASSISTANT"),
version="llama_v2",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="",
)
# This will lead to a bug when user can not access Meta-Llama-3-8B-Instruct
# conv_llava_llama_3 = Conversation(
# system="You are a helpful language and vision assistant. " "You are able to understand the visual content that the user provides, " "and assist the user with a variety of tasks using natural language.",
# roles=("user", "assistant"),
# version="llama_v3",
# messages=[],
# offset=0,
# sep="<|eot_id|>",
# sep_style=SeparatorStyle.LLAMA_3,
# tokenizer_id="meta-llama/Meta-Llama-3-8B-Instruct",
# tokenizer=AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct"),
# stop_token_ids=[128009],
# )
conv_mistral_instruct = Conversation(
system="",
roles=("USER", "ASSISTANT"),
version="llama_v2",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="",
)
conv_llava_llama_2_simple = Conversation(
system="Answer the questions about the visual content that the user provides.",
roles=("USER", "ASSISTANT"),
version="llama_v2",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="",
)
conv_llava_llama_2_mmtag = Conversation(
system="Answer the questions about the visual content that the user provides." "The visual content will be provided with the following format: visual content.",
roles=("USER", "ASSISTANT"),
version="llama_v2_mmtag",
messages=[],
offset=0,
sep_style=SeparatorStyle.LLAMA_2,
sep="",
sep2="",
)
conv_mpt = Conversation(
system="""<|im_start|>system
A conversation between a user and an LLM-based AI assistant. The assistant gives helpful and honest answers.""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
conv_qwen = Conversation(
system="""<|im_start|>system
You are a helpful assistant.""",
roles=("<|im_start|>user", "<|im_start|>assistant"),
version="qwen",
messages=[],
offset=0,
sep_style=SeparatorStyle.CHATML,
sep="<|im_end|>",
)
conv_gemma_instruct = Conversation(system="", roles=("user\n", "model\n"), version="gemma", messages=[], offset=0, sep_style=SeparatorStyle.GEMMA, sep="\n")
conv_llava_plain = Conversation(
system="",
roles=("", ""),
messages=[],
offset=0,
sep_style=SeparatorStyle.PLAIN,
sep="\n",
)
conv_llava_v0 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("Human", "Assistant"),
messages=[],
offset=0,
sep_style=SeparatorStyle.SINGLE,
sep="###",
)
conv_llava_v0_mmtag = Conversation(
system="A chat between a curious user and an artificial intelligence assistant. "
"The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
"The visual content will be provided with the following format: visual content.",
roles=("Human", "Assistant"),
messages=[],
offset=0,
sep_style=SeparatorStyle.SINGLE,
sep="###",
version="v0_mmtag",
)
conv_llava_v1 = Conversation(
system="A chat between a curious human and an artificial intelligence assistant. " "The assistant gives helpful, detailed, and polite answers to the human's questions.",
roles=("USER", "ASSISTANT"),
version="v1",
messages=[],
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="",
)
conv_llava_v1_mmtag = Conversation(
system="A chat between a curious user and an artificial intelligence assistant. "
"The assistant is able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language."
"The visual content will be provided with the following format: visual content.",
roles=("USER", "ASSISTANT"),
messages=[],
offset=0,
sep_style=SeparatorStyle.TWO,
sep=" ",
sep2="",
version="v1_mmtag",
)
conv_mistral_orca = Conversation(
system="""<|im_start|>system
You are MistralOrca, a large language model trained by Alignment Lab AI. Write out your reasoning step-by-step to be sure you get the right answers!""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
conv_mistral_zephyr = Conversation(
system="""<|system|>
You are a helpful AI assistant.""",
roles=("<|user|>\n", "<|assistant|>\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="",
)
conv_mistral_direct = Conversation(
system="""<|im_start|>system
Answer the questions.""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
conv_chatml_direct = Conversation(
system="""<|im_start|>system
Answer the questions.""",
roles=("<|im_start|>user\n", "<|im_start|>assistant\n"),
version="mpt",
messages=[],
offset=0,
sep_style=SeparatorStyle.MPT,
sep="<|im_end|>",
)
default_conversation = conv_vicuna_v0
conv_templates = {
"default": conv_vicuna_v0,
"v0": conv_vicuna_v0,
"v1": conv_vicuna_v1,
"vicuna_v1": conv_vicuna_v1,
"llama_2": conv_llama_2,
"mistral_instruct": conv_mistral_instruct,
"mistral_orca": conv_mistral_orca,
"mistral_zephyr": conv_mistral_zephyr,
"mistral_direct": conv_mistral_direct,
"plain": conv_llava_plain,
"v0_plain": conv_llava_plain,
"chatml_direct": conv_chatml_direct,
"llava_v0": conv_llava_v0,
"llava_v0_mmtag": conv_llava_v0_mmtag,
"llava_v1": conv_llava_v1,
"llava_v1_mmtag": conv_llava_v1_mmtag,
"llava_llama_2": conv_llava_llama_2,
"llava_llama_2_simple": conv_llava_llama_2_simple,
"llava_llama_2_mmtag": conv_llava_llama_2_mmtag,
"llava_mistral_instruct": conv_mistral_instruct,
"mpt": conv_mpt,
"qwen_1_5": conv_qwen,
"gemma_instruct": conv_gemma_instruct,
}
if __name__ == "__main__":
print(default_conversation.get_prompt())
================================================
FILE: xtuner-eval_niah/longva/mm_utils.py
================================================
from PIL import Image
from io import BytesIO
import base64
import math
import ast
import re
import torch
from transformers import StoppingCriteria
from longva.constants import IMAGE_TOKEN_INDEX
def resize_and_center_crop(image, shortest_edge_length):
# Calculate new dimensions and resize
aspect_ratio = float(image.width) / float(image.height)
if aspect_ratio > 1:
new_width = int(shortest_edge_length * aspect_ratio)
new_height = shortest_edge_length
else:
new_width = shortest_edge_length
new_height = int(shortest_edge_length / aspect_ratio)
resized_image = image.resize((new_width, new_height), Image.ANTIALIAS)
# Calculate the position and perform the center crop
left = (new_width - shortest_edge_length) / 2
top = (new_height - shortest_edge_length) / 2
right = (new_width + shortest_edge_length) / 2
bottom = (new_height + shortest_edge_length) / 2
cropped_image = resized_image.crop((left, top, right, bottom))
return cropped_image
def auto_pad_images(image, grid_params):
assert isinstance(image, Image.Image), "Input should be a Pillow Image"
assert len(grid_params) > 0, "Grid parameters should not be empty"
# Step 1: Calculate and find the closest aspect ratio
input_width, input_height = image.size
input_aspect_ratio = input_width / input_height
candidate_resolutions = [(w / h, w, h) for w in grid_params for h in grid_params]
closest_aspect_ratio = min(candidate_resolutions, key=lambda x: abs(input_aspect_ratio - x[0]))
candidate_resolutions = [(x[1], x[2]) for x in candidate_resolutions if abs(x[0] - closest_aspect_ratio[0]) < 1e-3]
target_resolution = min(candidate_resolutions, key=lambda res: abs(max(input_width, input_height) / max(res) - 1))
resize_width, resize_height = target_resolution
if input_width > input_height:
resize_height = int(resize_width / input_aspect_ratio)
else:
resize_width = int(resize_height * input_aspect_ratio)
resized_image = image.resize((resize_width, resize_height), Image.ANTIALIAS)
# Step 5: Pad the resized image if necessary to match the target resolution
pad_width = target_resolution[0] - resize_width
pad_height = target_resolution[1] - resize_height
padded_image = Image.new("RGB", target_resolution, color=(0, 0, 0))
padded_image.paste(resized_image, (pad_width // 2, pad_height // 2))
return padded_image
def extract_patches(image, patch_size, overlap_ratio):
assert isinstance(image, Image.Image), "Input should be a Pillow Image"
assert patch_size > 0, "Patch size should be greater than 0"
assert 0 <= overlap_ratio < 1, "Overlap ratio should be between 0 and 1"
W, H = image.size
patches = []
stride = int(patch_size * (1 - overlap_ratio))
num_patches_y = (H - patch_size) // stride + 1
num_patches_x = (W - patch_size) // stride + 1
y_start = (H - (num_patches_y - 1) * stride - patch_size) // 2
x_start = (W - (num_patches_x - 1) * stride - patch_size) // 2
for y in range(y_start, y_start + num_patches_y * stride, stride):
for x in range(x_start, x_start + num_patches_x * stride, stride):
patch = image.crop((x, y, x + patch_size, y + patch_size))
patches.append(patch)
return patches
def process_highres_image_crop_split(image, data_args, processor=None):
crop_resolution = data_args.image_crop_resolution
split_resolution = data_args.image_split_resolution
if processor is None:
processor = data_args.image_processor
image_crop = resize_and_center_crop(image, crop_resolution)
image_patches = extract_patches(image_crop, patch_size=split_resolution, overlap_ratio=0)
image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches]
return torch.stack(image_patches, dim=0)
def process_highres_image(image, processor, grid_pinpoints):
grid_params = [int(x) for x in grid_pinpoints.split(",")]
width_height = max(image.size)
fit_grid_params = [x for x in grid_params if x >= width_height]
if len(fit_grid_params) == 0:
select_size = max(grid_params)
else:
select_size = min(fit_grid_params)
# FIXME: always select the 448
select_size = max(grid_params)
image_padded = expand2square(image, tuple(int(x * 255) for x in processor.image_mean))
# FIXME: this seems to be a bug that it always resizes instead of padding
image_original_resize = image.resize((processor.size["shortest_edge"], processor.size["shortest_edge"]))
image_padded = image_padded.resize((select_size, select_size))
image_patches = extract_patches(image_padded, patch_size=processor.size["shortest_edge"], overlap_ratio=0)
image_patches = [image_original_resize] + image_patches
image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches]
return torch.stack(image_patches, dim=0)
def select_best_resolution(original_size, possible_resolutions):
"""
Selects the best resolution from a list of possible resolutions based on the original size.
Args:
original_size (tuple): The original size of the image in the format (width, height).
possible_resolutions (list): A list of possible resolutions in the format [(width1, height1), (width2, height2), ...].
Returns:
tuple: The best fit resolution in the format (width, height).
"""
original_width, original_height = original_size
best_fit = None
max_effective_resolution = 0
min_wasted_resolution = float("inf")
for width, height in possible_resolutions:
# Calculate the downscaled size to keep the aspect ratio
scale = min(width / original_width, height / original_height)
downscaled_width, downscaled_height = int(original_width * scale), int(original_height * scale)
# Calculate effective and wasted resolutions
effective_resolution = min(downscaled_width * downscaled_height, original_width * original_height)
wasted_resolution = (width * height) - effective_resolution
if effective_resolution > max_effective_resolution or (effective_resolution == max_effective_resolution and wasted_resolution < min_wasted_resolution):
max_effective_resolution = effective_resolution
min_wasted_resolution = wasted_resolution
best_fit = (width, height)
return best_fit
def resize_and_pad_image(image, target_resolution):
"""
Resize and pad an image to a target resolution while maintaining aspect ratio.
Args:
image (PIL.Image.Image): The input image.
target_resolution (tuple): The target resolution (width, height) of the image.
Returns:
PIL.Image.Image: The resized and padded image.
"""
original_width, original_height = image.size
target_width, target_height = target_resolution
# Determine which dimension (width or height) to fill
scale_w = target_width / original_width
scale_h = target_height / original_height
if scale_w < scale_h:
# Width will be filled completely
new_width = target_width
new_height = min(math.ceil(original_height * scale_w), target_height)
else:
# Height will be filled completely
new_height = target_height
new_width = min(math.ceil(original_width * scale_h), target_width)
# Resize the image
resized_image = image.resize((new_width, new_height))
# Create a new image with the target size and paste the resized image onto it
new_image = Image.new("RGB", (target_width, target_height), (0, 0, 0))
paste_x = (target_width - new_width) // 2
paste_y = (target_height - new_height) // 2
new_image.paste(resized_image, (paste_x, paste_y))
return new_image
def divide_to_patches(image, patch_size):
"""
Divides an image into patches of a specified size.
Args:
image (PIL.Image.Image): The input image.
patch_size (int): The size of each patch.
Returns:
list: A list of PIL.Image.Image objects representing the patches.
"""
patches = []
width, height = image.size
for i in range(0, height, patch_size):
for j in range(0, width, patch_size):
box = (j, i, j + patch_size, i + patch_size)
patch = image.crop(box)
patches.append(patch)
return patches
def get_anyres_image_grid_shape(image_size, grid_pinpoints, patch_size):
"""
Calculate the shape of the image patch grid after the preprocessing for images of any resolution.
Args:
image_size (tuple): The size of the input image in the format (width, height).
grid_pinpoints (str): A string representation of a list of possible resolutions.
patch_size (int): The size of each image patch.
Returns:
tuple: The shape of the image patch grid in the format (width, height).
"""
if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints:
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
if type(grid_pinpoints) is list:
possible_resolutions = grid_pinpoints
else:
possible_resolutions = ast.literal_eval(grid_pinpoints)
width, height = select_best_resolution(image_size, possible_resolutions)
return width // patch_size, height // patch_size
def process_anyres_image(image, processor, grid_pinpoints):
"""
Process an image with variable resolutions.
Args:
image (PIL.Image.Image): The input image to be processed.
processor: The image processor object.
grid_pinpoints (str): A string representation of a list of possible resolutions.
Returns:
torch.Tensor: A tensor containing the processed image patches.
"""
# Convert grid_pinpoints from string to list
if isinstance(grid_pinpoints, str) and "x" in grid_pinpoints:
try:
patch_size = processor.size[0]
except Exception as e:
patch_size = processor.size["shortest_edge"]
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
if type(grid_pinpoints) is list:
possible_resolutions = grid_pinpoints
else:
possible_resolutions = ast.literal_eval(grid_pinpoints)
best_resolution = select_best_resolution(image.size, possible_resolutions)
image_padded = resize_and_pad_image(image, best_resolution)
patches = divide_to_patches(image_padded, processor.crop_size["height"])
# FIXME: this seems to be a bug that it resizes instead of pad.
# but to keep it consistent with previous, i will keep it as it is
# TODO: uncomment below to ablate with the padding
if isinstance(processor.size, dict):
shortest_edge = processor.size["shortest_edge"]
else:
shortest_edge = min(processor.size)
image_original_resize = image.resize((shortest_edge, shortest_edge))
# image_padded_square = expand2square(image, tuple(int(x*255) for x in processor.image_mean))
# image_original_resize = image_padded_square.resize((processor.size['shortest_edge'], processor.size['shortest_edge']))
image_patches = [image_original_resize] + patches
image_patches = [processor.preprocess(image_patch, return_tensors="pt")["pixel_values"][0] for image_patch in image_patches]
return torch.stack(image_patches, dim=0)
def load_image_from_base64(image):
return Image.open(BytesIO(base64.b64decode(image)))
def expand2square(pil_img, background_color):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
def process_images(images, image_processor, model_cfg):
image_aspect_ratio = getattr(model_cfg, "image_aspect_ratio", None)
new_images = []
if image_aspect_ratio == "highres":
for image in images:
image = process_highres_image(image, image_processor, model_cfg.image_grid_pinpoints)
new_images.append(image)
elif image_aspect_ratio == "anyres" or "anyres_max" in image_aspect_ratio:
for image in images:
image = process_anyres_image(image, image_processor, model_cfg.image_grid_pinpoints)
new_images.append(image)
elif image_aspect_ratio == "crop_split":
for image in images:
image = process_highres_image_crop_split(image, model_cfg, image_processor)
new_images.append(image)
elif image_aspect_ratio == "pad":
for image in images:
image = expand2square(image, tuple(int(x * 255) for x in image_processor.image_mean))
image = image_processor.preprocess(image, return_tensors="pt")["pixel_values"][0]
new_images.append(image)
else:
return image_processor(images, return_tensors="pt")["pixel_values"]
if all(x.shape == new_images[0].shape for x in new_images):
new_images = torch.stack(new_images, dim=0)
return new_images
def tokenizer_image_token(prompt, tokenizer, image_token_index=IMAGE_TOKEN_INDEX, return_tensors=None):
prompt_chunks = [tokenizer(chunk).input_ids for chunk in prompt.split("")]
def insert_separator(X, sep):
return [ele for sublist in zip(X, [sep] * len(X)) for ele in sublist][:-1]
input_ids = []
offset = 0
if len(prompt_chunks) > 0 and len(prompt_chunks[0]) > 0 and prompt_chunks[0][0] == tokenizer.bos_token_id:
offset = 1
input_ids.append(prompt_chunks[0][0])
for x in insert_separator(prompt_chunks, [image_token_index] * (offset + 1)):
input_ids.extend(x[offset:])
if return_tensors is not None:
if return_tensors == "pt":
return torch.tensor(input_ids, dtype=torch.long)
raise ValueError(f"Unsupported tensor type: {return_tensors}")
return input_ids
def get_model_name_from_path(model_path):
model_path = model_path.strip("/")
model_paths = model_path.split("/")
if model_paths[-1].startswith("checkpoint-"):
return model_paths[-2] + "_" + model_paths[-1]
else:
return model_paths[-1]
class KeywordsStoppingCriteria(StoppingCriteria):
def __init__(self, keywords, tokenizer, input_ids):
self.keywords = keywords
self.keyword_ids = []
for keyword in keywords:
cur_keyword_ids = tokenizer(keyword).input_ids
if len(cur_keyword_ids) > 1 and cur_keyword_ids[0] == tokenizer.bos_token_id:
cur_keyword_ids = cur_keyword_ids[1:]
self.keyword_ids.append(torch.tensor(cur_keyword_ids))
self.tokenizer = tokenizer
self.start_len = input_ids.shape[1]
def __call__(self, output_ids: torch.LongTensor, scores: torch.FloatTensor, **kwargs) -> bool:
assert output_ids.shape[0] == 1, "Only support batch size 1 (yet)" # TODO
offset = min(output_ids.shape[1] - self.start_len, 3)
self.keyword_ids = [keyword_id.to(output_ids.device) for keyword_id in self.keyword_ids]
for keyword_id in self.keyword_ids:
if output_ids[0, -keyword_id.shape[0] :] == keyword_id:
return True
outputs = self.tokenizer.batch_decode(output_ids[:, -offset:], skip_special_tokens=True)[0]
for keyword in self.keywords:
if keyword in outputs:
return True
return False
================================================
FILE: xtuner-eval_niah/longva/model/__init__.py
================================================
import os
AVAILABLE_MODELS = {
"llava_llama": "LlavaLlamaForCausalLM, LlavaConfig",
"llava_qwen": "LlavaQwenForCausalLM, LlavaQwenConfig",
"llava_mistral": "LlavaMistralForCausalLM, LlavaMistralConfig",
# Add other models as needed
}
for model_name, model_classes in AVAILABLE_MODELS.items():
try:
exec(f"from .language_model.{model_name} import {model_classes}")
except Exception as e:
raise e
print(f"Failed to import {model_name} from longva.language_model.{model_name}")
================================================
FILE: xtuner-eval_niah/longva/model/apply_delta.py
================================================
"""
Usage:
python3 -m fastchat.model.apply_delta --base ~/model_weights/llama-7b --target ~/model_weights/vicuna-7b --delta lmsys/vicuna-7b-delta
"""
import argparse
import torch
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
from longva import LlavaLlamaForCausalLM
def apply_delta(base_model_path, target_model_path, delta_path):
print("Loading base model")
base = AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
print("Loading delta")
delta = LlavaLlamaForCausalLM.from_pretrained(delta_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
delta_tokenizer = AutoTokenizer.from_pretrained(delta_path)
print("Applying delta")
for name, param in tqdm(delta.state_dict().items(), desc="Applying delta"):
if name not in base.state_dict():
assert name in ["model.mm_projector.weight", "model.mm_projector.bias"], f"{name} not in base model"
continue
if param.data.shape == base.state_dict()[name].shape:
param.data += base.state_dict()[name]
else:
assert name in ["model.embed_tokens.weight", "lm_head.weight"], f"{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}"
bparam = base.state_dict()[name]
param.data[: bparam.shape[0], : bparam.shape[1]] += bparam
print("Saving target model")
delta.save_pretrained(target_model_path)
delta_tokenizer.save_pretrained(target_model_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--base-model-path", type=str, required=True)
parser.add_argument("--target-model-path", type=str, required=True)
parser.add_argument("--delta-path", type=str, required=True)
args = parser.parse_args()
apply_delta(args.base_model_path, args.target_model_path, args.delta_path)
================================================
FILE: xtuner-eval_niah/longva/model/builder.py
================================================
# Copyright 2023 Haotian Liu
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import warnings
import shutil
from transformers import AutoTokenizer, AutoModelForCausalLM, AutoConfig, BitsAndBytesConfig
import torch
from longva.model import *
from longva.constants import DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from longva.utils import rank0_print
def load_pretrained_model(model_path, model_base, model_name, load_8bit=False, load_4bit=False, device_map="auto", attn_implementation="flash_attention_2", customized_config=None, overwrite_config=None, **kwargs):
kwargs["device_map"] = device_map
if load_8bit:
kwargs["load_in_8bit"] = True
elif load_4bit:
kwargs["load_in_4bit"] = True
kwargs["quantization_config"] = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4")
else:
kwargs["torch_dtype"] = torch.float16
if customized_config is not None:
kwargs["config"] = customized_config
if "multimodal" in kwargs:
if kwargs["multimodal"] is True:
is_multimodal = True
kwargs.pop("multimodal")
else:
is_multimodal = False
if "llava" in model_name.lower() or "longva" in model_name.lower() or is_multimodal:
# Load LLaVA model
if "lora" in model_name.lower() and model_base is None:
warnings.warn(
"There is `lora` in model name but no `model_base` is provided. If you are loading a LoRA model, please provide the `model_base` argument. Detailed instruction: https://github.com/haotian-liu/LLaVA#launch-a-model-worker-lora-weights-unmerged."
)
if "lora" in model_name.lower() and model_base is not None:
lora_cfg_pretrained = AutoConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
rank0_print("Loading LLaVA from base model...")
if "mixtral" in model_name.lower():
from longva.model.language_model.llava_mixtral import LlavaMixtralConfig
lora_cfg_pretrained = LlavaMixtralConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = LlavaMixtralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
elif "mistral" in model_name.lower():
from longva.model.language_model.llava_mistral import LlavaMistralConfig
lora_cfg_pretrained = LlavaMistralConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = LlavaMistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
elif "gemma" in model_name.lower():
from longva.model.language_model.llava_gemma import LlavaGemmaConfig
lora_cfg_pretrained = LlavaGemmaConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = LlavaGemmaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
else:
from longva.model.language_model.llava_llama import LlavaConfig
lora_cfg_pretrained = LlavaConfig.from_pretrained(model_path)
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=lora_cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
token_num, tokem_dim = model.lm_head.out_features, model.lm_head.in_features
if model.lm_head.weight.shape[0] != token_num:
model.lm_head.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
model.model.embed_tokens.weight = torch.nn.Parameter(torch.empty(token_num, tokem_dim, device=model.device, dtype=model.dtype))
rank0_print("Loading additional LLaVA weights...")
if os.path.exists(os.path.join(model_path, "non_lora_trainables.bin")):
non_lora_trainables = torch.load(os.path.join(model_path, "non_lora_trainables.bin"), map_location="cpu")
else:
# this is probably from HF Hub
from huggingface_hub import hf_hub_download
def load_from_hf(repo_id, filename, subfolder=None):
cache_file = hf_hub_download(repo_id=repo_id, filename=filename, subfolder=subfolder)
return torch.load(cache_file, map_location="cpu")
non_lora_trainables = load_from_hf(model_path, "non_lora_trainables.bin")
non_lora_trainables = {(k[11:] if k.startswith("base_model.") else k): v for k, v in non_lora_trainables.items()}
if any(k.startswith("model.model.") for k in non_lora_trainables):
non_lora_trainables = {(k[6:] if k.startswith("model.") else k): v for k, v in non_lora_trainables.items()}
model.load_state_dict(non_lora_trainables, strict=False)
from peft import PeftModel
rank0_print("Loading LoRA weights...")
model = PeftModel.from_pretrained(model, model_path)
rank0_print("Merging LoRA weights...")
model = model.merge_and_unload()
rank0_print("Model is loaded...")
elif model_base is not None: # this may be mm projector only, loading projector with preset language mdoel
rank0_print(f"Loading LLaVA from base model {model_base}...")
if "mixtral" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
cfg_pretrained = AutoConfig.from_pretrained(model_path)
model = LlavaMixtralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
elif "mistral" in model_name.lower() or "zephyr" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
cfg_pretrained = AutoConfig.from_pretrained(model_path)
model = LlavaMistralForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
elif "gemma" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
cfg_pretrained = AutoConfig.from_pretrained(model_path)
model = LlavaGemmaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
elif (
"wizardlm-2" in model_name.lower()
and "vicuna" in model_name.lower()
or "llama" in model_name.lower()
or "yi" in model_name.lower()
or "nous-hermes" in model_name.lower()
or "llava-v1.6-34b" in model_name.lower()
or "llava-v1.5" in model_name.lower()
):
from longva.model.language_model.llava_llama import LlavaConfig
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
if customized_config is None:
llava_cfg = LlavaConfig.from_pretrained(model_path)
if "v1.5" in model_name.lower():
llava_cfg.delay_load = True # a workaround for correctly loading v1.5 models
else:
llava_cfg = customized_config
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
llava_cfg = LlavaConfig.from_pretrained(model_path)
model = LlavaLlamaForCausalLM.from_pretrained(model_base, low_cpu_mem_usage=True, config=llava_cfg, **kwargs)
else:
raise ValueError(f"Model {model_name} not supported")
mm_projector_weights = torch.load(os.path.join(model_path, "mm_projector.bin"), map_location="cpu")
mm_projector_weights = {k: v.to(torch.float16) for k, v in mm_projector_weights.items()}
model.load_state_dict(mm_projector_weights, strict=False)
else:
rank0_print(f"Loaded LLaVA model: {model_path}")
if "mixtral" in model_name.lower():
from longva.model.language_model.llava_mixtral import LlavaMixtralConfig
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
if customized_config is None:
llava_cfg = LlavaMixtralConfig.from_pretrained(model_path)
else:
llava_cfg = customized_config
if overwrite_config is not None:
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = LlavaMixtralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
elif "mistral" in model_name.lower() or "zephyr" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = LlavaMistralForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
elif (
"wizardlm-2" in model_name.lower()
and "vicuna" in model_name.lower()
or "llama" in model_name.lower()
or "yi" in model_name.lower()
or "nous-hermes" in model_name.lower()
or "llava-v1.6-34b" in model_name.lower()
or "llava-v1.5" in model_name.lower()
):
from longva.model.language_model.llava_llama import LlavaConfig
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
if customized_config is None:
llava_cfg = LlavaConfig.from_pretrained(model_path)
if "v1.5" in model_name.lower():
llava_cfg.delay_load = True # a workaround for correctly loading v1.5 models
else:
llava_cfg = customized_config
if overwrite_config is not None:
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
elif "qwen" in model_name.lower() or "quyen" in model_name.lower():
from longva.model.language_model.llava_qwen import LlavaQwenConfig
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
if overwrite_config is not None:
llava_cfg = LlavaQwenConfig.from_pretrained(model_path)
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(llava_cfg, k, v)
model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
else:
model = LlavaQwenForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, **kwargs)
elif "gemma" in model_name.lower():
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
cfg_pretrained = AutoConfig.from_pretrained(model_path)
model = LlavaGemmaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, config=cfg_pretrained, attn_implementation=attn_implementation, **kwargs)
else:
print("model name: ", model_name.lower())
# try:
from longva.model.language_model.llava_llama import LlavaConfig
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
if customized_config is None:
llava_cfg = LlavaConfig.from_pretrained(model_path)
if "v1.5" in model_path.lower():
llava_cfg.delay_load = True # a workaround for correctly loading v1.5 models
else:
llava_cfg = customized_config
model = LlavaLlamaForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, attn_implementation=attn_implementation, config=llava_cfg, **kwargs)
# except:
# raise ValueError(f"Model {model_name} not supported")
else:
# Load language model
if model_base is not None:
# PEFT model
from peft import PeftModel
tokenizer = AutoTokenizer.from_pretrained(model_base, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_base, torch_dtype=torch.float16, low_cpu_mem_usage=True, device_map="auto")
print(f"Loading LoRA weights from {model_path}")
model = PeftModel.from_pretrained(model, model_path)
print(f"Merging weights")
model = model.merge_and_unload()
print("Convert to FP16...")
model.to(torch.float16)
else:
use_fast = False
if "mpt" in model_name.lower().replace("prompt", ""):
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, trust_remote_code=True, **kwargs)
else:
tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=False)
model = AutoModelForCausalLM.from_pretrained(model_path, low_cpu_mem_usage=True, **kwargs)
rank0_print(f"Model Class: {model.__class__.__name__}")
image_processor = None
if "llava" in model_name.lower() or "longva" in model_name.lower() or is_multimodal:
mm_use_im_start_end = getattr(model.config, "mm_use_im_start_end", False)
mm_use_im_patch_token = getattr(model.config, "mm_use_im_patch_token", True)
if mm_use_im_patch_token:
tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
if mm_use_im_start_end:
tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
model.resize_token_embeddings(len(tokenizer))
vision_tower = model.get_vision_tower()
if not vision_tower.is_loaded:
vision_tower.load_model(device_map=device_map)
if device_map != "auto":
vision_tower.to(device="cuda", dtype=torch.float16)
image_processor = vision_tower.image_processor
if hasattr(model.config, "max_sequence_length"):
context_len = model.config.max_sequence_length
elif hasattr(model.config, "max_position_embeddings"):
context_len = model.config.max_position_embeddings
elif hasattr(model.config, "tokenizer_model_max_length"):
context_len = model.config.tokenizer_model_max_length
else:
context_len = 2048
return tokenizer, model, image_processor, context_len
================================================
FILE: xtuner-eval_niah/longva/model/consolidate.py
================================================
"""
Usage:
python3 -m llava.model.consolidate --src ~/model_weights/llava-7b --dst ~/model_weights/llava-7b_consolidate
"""
import argparse
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from longva.model import *
from longva.model.utils import auto_upgrade
def consolidate_ckpt(src_path, dst_path):
print("Loading model")
auto_upgrade(src_path)
src_model = AutoModelForCausalLM.from_pretrained(src_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
src_tokenizer = AutoTokenizer.from_pretrained(src_path, use_fast=False)
src_model.save_pretrained(dst_path)
src_tokenizer.save_pretrained(dst_path)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--src", type=str, required=True)
parser.add_argument("--dst", type=str, required=True)
args = parser.parse_args()
consolidate_ckpt(args.src, args.dst)
================================================
FILE: xtuner-eval_niah/longva/model/language_model/llava_llama.py
================================================
# Copyright 2023 Haotian Liu
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List, Optional, Tuple, Union
import torch
import torch.nn as nn
from transformers import AutoConfig, AutoModelForCausalLM, LlamaConfig
from torch.nn import CrossEntropyLoss
# , LlamaModel, LlamaForCausalLM, GenerationConfig
# from .modeling_llama import LlamaModel, LlamaForCausalLM
from transformers import LlamaModel, LlamaForCausalLM
from transformers.modeling_outputs import CausalLMOutputWithPast
from transformers.generation.utils import GenerateOutput
from longva.model.llava_arch import LlavaMetaModel, LlavaMetaForCausalLM
class LlavaConfig(LlamaConfig):
model_type = "llava_llama"
temperature: float = 0.0 # reset to 0.0, previously 0.9 for Vicuna
max_new_tokens: int = 1024
do_sample: bool = False
top_p: Optional[float] = None
# rope_scaling: Optional[dict] = {}
class LlavaLlamaModel(LlavaMetaModel, LlamaModel):
config_class = LlavaConfig
def __init__(self, config: LlamaConfig):
super(LlavaLlamaModel, self).__init__(config)
class LlavaLlamaForCausalLM(LlamaForCausalLM, LlavaMetaForCausalLM):
config_class = LlavaConfig
def __init__(self, config):
LlamaForCausalLM.__init__(self, config)
# configure default generation settings
config.model_type = "llava_llama"
# config.rope_scaling = None
self.model = LlavaLlamaModel(config)
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_model(self):
return self.model
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
images: Optional[torch.FloatTensor] = None,
image_sizes: Optional[List[List[int]]] = None,
return_dict: Optional[bool] = None,
modalities: Optional[List[str]] = ["image"],
dpo_forward: Optional[bool] = None,
cache_position=None,
) -> Union[Tuple, CausalLMOutputWithPast]:
if inputs_embeds is None:
(input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities, image_sizes)
if dpo_forward:
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
return logits, labels
else:
return super().forward(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
labels=labels,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
@torch.no_grad()
def generate(
self,
inputs: Optional[torch.Tensor] = None,
images: Optional[torch.Tensor] = None,
image_sizes: Optional[torch.Tensor] = None,
modalities: Optional[List[str]] = ["image"],
**kwargs,
) -> Union[GenerateOutput, torch.LongTensor]:
modalities = kwargs.pop("modalities", None) if "modalities" in kwargs and modalities is None else modalities
position_ids = kwargs.pop("position_ids", None)
attention_mask = kwargs.pop("attention_mask", None)
if "inputs_embeds" in kwargs:
raise NotImplementedError("`inputs_embeds` is not supported")
if images is not None:
(inputs, position_ids, attention_mask, _, inputs_embeds, _) = self.prepare_inputs_labels_for_multimodal(inputs, position_ids, attention_mask, None, None, images, modalities, image_sizes=image_sizes)
else:
inputs_embeds = self.get_model().embed_tokens(inputs)
return super().generate(position_ids=position_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, **kwargs)
def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
images = kwargs.pop("images", None)
image_sizes = kwargs.pop("image_sizes", None)
inputs = super().prepare_inputs_for_generation(input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs)
if images is not None:
inputs["images"] = images
if image_sizes is not None:
inputs["image_sizes"] = image_sizes
return inputs
AutoConfig.register("llava_llama", LlavaConfig)
AutoModelForCausalLM.register(LlavaConfig, LlavaLlamaForCausalLM)
================================================
FILE: xtuner-eval_niah/longva/model/language_model/llava_mistral.py
================================================
# Copyright 2023 Haotian Liu
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List, Optional, Tuple, Union
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss
from transformers import AutoConfig, AutoModelForCausalLM, MistralConfig, MistralModel, MistralForCausalLM, GenerationConfig
from transformers.modeling_outputs import CausalLMOutputWithPast
from transformers.generation.utils import GenerateOutput
from ..llava_arch import LlavaMetaModel, LlavaMetaForCausalLM
class LlavaMistralConfig(MistralConfig):
model_type = "llava_mistral"
temperature: float = 0.0 # reset to 0.0, previously 0.9 for Vicuna
max_new_tokens: int = 1024
do_sample: bool = False
top_p: Optional[float] = None
class LlavaMistralModel(LlavaMetaModel, MistralModel):
config_class = LlavaMistralConfig
def __init__(self, config: MistralConfig):
super(LlavaMistralModel, self).__init__(config)
class LlavaMistralForCausalLM(MistralForCausalLM, LlavaMetaForCausalLM):
config_class = LlavaMistralConfig
def __init__(self, config):
super(MistralForCausalLM, self).__init__(config)
config.model_type = "llava_mistral"
# config.rope_scaling = None
self.model = LlavaMistralModel(config)
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_model(self):
return self.model
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
images: Optional[torch.FloatTensor] = None,
image_sizes: Optional[List[List[int]]] = None,
return_dict: Optional[bool] = None,
cache_position=None,
) -> Union[Tuple, CausalLMOutputWithPast]:
if inputs_embeds is None:
(input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, image_sizes)
return super().forward(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
labels=labels,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
@torch.no_grad()
def generate(
self,
inputs: Optional[torch.Tensor] = None,
images: Optional[torch.Tensor] = None,
image_sizes: Optional[torch.Tensor] = None,
**kwargs,
) -> Union[GenerateOutput, torch.LongTensor]:
position_ids = kwargs.pop("position_ids", None)
attention_mask = kwargs.pop("attention_mask", None)
if "inputs_embeds" in kwargs:
raise NotImplementedError("`inputs_embeds` is not supported")
if images is not None:
(inputs, position_ids, attention_mask, _, inputs_embeds, _) = self.prepare_inputs_labels_for_multimodal(inputs, position_ids, attention_mask, None, None, images, image_sizes=image_sizes)
else:
inputs_embeds = self.get_model().embed_tokens(inputs)
return super().generate(position_ids=position_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, **kwargs)
def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
images = kwargs.pop("images", None)
image_sizes = kwargs.pop("image_sizes", None)
inputs = super().prepare_inputs_for_generation(input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs)
if images is not None:
inputs["images"] = images
if image_sizes is not None:
inputs["image_sizes"] = image_sizes
return inputs
AutoConfig.register("llava_mistral", LlavaMistralConfig)
AutoModelForCausalLM.register(LlavaMistralConfig, LlavaMistralForCausalLM)
================================================
FILE: xtuner-eval_niah/longva/model/language_model/llava_mpt.py
================================================
# Copyright 2023 Haotian Liu
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Optional, Tuple
import torch
from transformers import AutoConfig, AutoModelForCausalLM, MptConfig, MptForCausalLM, MptModel, GenerationConfig
from longva.model.llava_arch import LlavaMetaModel, LlavaMetaForCausalLM
class LlavaMptConfig(MptConfig):
model_type = "llava_mpt"
class LlavaMptModel(LlavaMetaModel, MptModel):
config_class = LlavaMptConfig
def __init__(self, config: MptConfig):
config.hidden_size = config.d_model
super(LlavaMptModel, self).__init__(config)
def embed_tokens(self, x):
return self.wte(x)
class LlavaMptForCausalLM(MptForCausalLM, LlavaMetaForCausalLM):
config_class = LlavaMptConfig
supports_gradient_checkpointing = True
def __init__(self, config):
super(MptForCausalLM, self).__init__(config)
config.model_type = "llava_mpt"
# config.rope_scaling = None
self.generation_config = GenerationConfig(
temperature=0.0,
max_new_tokens=1024,
do_sample=False,
top_p=None,
)
self.transformer = LlavaMptModel(config)
self.lm_head = torch.nn.Linear(config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_model(self):
return self.transformer
def _set_gradient_checkpointing(self, module, value=False):
if isinstance(module, LlavaMptModel):
module.gradient_checkpointing = value
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Tuple[Tuple[torch.Tensor, torch.Tensor], ...]] = None,
attention_mask: Optional[torch.Tensor] = None,
inputs_embeds: Optional[torch.Tensor] = None,
labels: Optional[torch.Tensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
cache_position=None,
images=None,
):
input_ids, attention_mask, past_key_values, inputs_embeds, labels = self.prepare_inputs_labels_for_multimodal(input_ids, attention_mask, past_key_values, labels, images)
return super().forward(
input_ids,
past_key_values=past_key_values,
attention_mask=attention_mask,
inputs_embeds=inputs_embeds,
labels=labels,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
images = kwargs.pop("images", None)
_inputs = super().prepare_inputs_for_generation(input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs)
_inputs["images"] = images
return _inputs
AutoConfig.register("llava_mpt", LlavaMptConfig)
AutoModelForCausalLM.register(LlavaMptConfig, LlavaMptForCausalLM)
================================================
FILE: xtuner-eval_niah/longva/model/language_model/llava_qwen.py
================================================
# Copyright 2024 Hao Zhang
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import List, Optional, Tuple, Union, Dict
import torch
import torch.nn as nn
from torch.nn import CrossEntropyLoss
import transformers
from transformers import AutoConfig, AutoModelForCausalLM, LlamaConfig, LlamaModel, LlamaForCausalLM
from transformers.modeling_outputs import CausalLMOutputWithPast
from transformers.generation.utils import GenerateOutput
# from ...constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from longva.model.llava_arch import LlavaMetaModel, LlavaMetaForCausalLM
from transformers import Qwen2Config, Qwen2Model, Qwen2ForCausalLM
# from .qwen.modeling_qwen import QWenLMHeadModel, QWenModel
# from .qwen.configuration_qwen import QWenConfig
class LlavaQwenConfig(Qwen2Config):
model_type = "llava_qwen"
class LlavaQwenModel(LlavaMetaModel, Qwen2Model):
config_class = LlavaQwenConfig
def __init__(self, config: Qwen2Config):
super(LlavaQwenModel, self).__init__(config)
class LlavaQwenForCausalLM(Qwen2ForCausalLM, LlavaMetaForCausalLM):
config_class = LlavaQwenConfig
def __init__(self, config):
# super(Qwen2ForCausalLM, self).__init__(config)
Qwen2ForCausalLM.__init__(self, config)
config.model_type = "llava_qwen"
# config.rope_scaling = None
self.model = LlavaQwenModel(config)
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_model(self):
return self.model
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
images: Optional[torch.FloatTensor] = None,
image_sizes: Optional[List[List[int]]] = None,
return_dict: Optional[bool] = None,
modalities: Optional[List[str]] = ["image"],
dpo_forward: Optional[bool] = False,
cache_position=None,
) -> Union[Tuple, CausalLMOutputWithPast]:
if inputs_embeds is None:
(input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities, image_sizes)
if dpo_forward:
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
return logits, labels
else:
return super().forward(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
labels=labels,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
@torch.no_grad()
def generate(
self,
inputs: Optional[torch.Tensor] = None,
images: Optional[torch.Tensor] = None,
image_sizes: Optional[torch.Tensor] = None,
modalities: Optional[List[str]] = ["image"],
**kwargs,
) -> Union[GenerateOutput, torch.LongTensor]:
position_ids = kwargs.pop("position_ids", None)
attention_mask = kwargs.pop("attention_mask", None)
if "inputs_embeds" in kwargs:
raise NotImplementedError("`inputs_embeds` is not supported")
if images is not None:
(inputs, position_ids, attention_mask, _, inputs_embeds, _) = self.prepare_inputs_labels_for_multimodal(inputs, position_ids, attention_mask, None, None, images, modalities, image_sizes=image_sizes)
else:
inputs_embeds = self.get_model().embed_tokens(inputs)
return super().generate(position_ids=position_ids, attention_mask=attention_mask, inputs_embeds=inputs_embeds, **kwargs)
def prepare_inputs_for_generation(self, input_ids, past_key_values=None, inputs_embeds=None, **kwargs):
images = kwargs.pop("images", None)
image_sizes = kwargs.pop("image_sizes", None)
inputs = super().prepare_inputs_for_generation(input_ids, past_key_values=past_key_values, inputs_embeds=inputs_embeds, **kwargs)
if images is not None:
inputs["images"] = images
if image_sizes is not None:
inputs["image_sizes"] = image_sizes
return inputs
AutoConfig.register("llava_qwen", LlavaQwenConfig)
AutoModelForCausalLM.register(LlavaQwenConfig, LlavaQwenForCausalLM)
================================================
FILE: xtuner-eval_niah/longva/model/language_model/modeling_llama.py
================================================
# coding=utf-8
# Copyright 2022 EleutherAI and the HuggingFace Inc. team. All rights reserved.
#
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
# and OPT implementations in this library. It has been modified from its
# original forms to accommodate minor architectural differences compared
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch LLaMA model."""
import math
import warnings
from typing import List, Optional, Tuple, Union
import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from torch import nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from transformers.activations import ACT2FN
from transformers.cache_utils import Cache, DynamicCache, StaticCache
from transformers.modeling_outputs import (
BaseModelOutputWithPast,
CausalLMOutputWithPast,
QuestionAnsweringModelOutput,
SequenceClassifierOutputWithPast,
)
from transformers.modeling_utils import PreTrainedModel
from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
from transformers.utils import (
add_start_docstrings,
add_start_docstrings_to_model_forward,
is_flash_attn_2_available,
is_flash_attn_greater_or_equal_2_10,
logging,
replace_return_docstrings,
)
from transformers.models.llama.configuration_llama import LlamaConfig
if is_flash_attn_2_available():
from flash_attn import flash_attn_func, flash_attn_varlen_func
from flash_attn.bert_padding import index_first_axis, pad_input, unpad_input # noqa
logger = logging.get_logger(__name__)
_CONFIG_FOR_DOC = "LlamaConfig"
def _get_unpad_data(attention_mask):
seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
max_seqlen_in_batch = seqlens_in_batch.max().item()
cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
return (
indices,
cu_seqlens,
max_seqlen_in_batch,
)
class LlamaRMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
"""
LlamaRMSNorm is equivalent to T5LayerNorm
"""
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
ALL_LAYERNORM_LAYERS.append(LlamaRMSNorm)
class LlamaRotaryEmbedding(nn.Module):
def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
super().__init__()
self.scaling_factor = scaling_factor
self.dim = dim
self.max_position_embeddings = max_position_embeddings
self.base = base
inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(device) / self.dim))
self.register_buffer("inv_freq", inv_freq, persistent=False)
# For BC we register cos and sin cached
self.max_seq_len_cached = max_position_embeddings
t = torch.arange(self.max_seq_len_cached, device=device, dtype=torch.int64).type_as(self.inv_freq)
t = t / self.scaling_factor
freqs = torch.outer(t, self.inv_freq)
# Different from paper, but it uses a different permutation in order to obtain the same calculation
emb = torch.cat((freqs, freqs), dim=-1)
self.register_buffer("_cos_cached", emb.cos().to(torch.get_default_dtype()), persistent=False)
self.register_buffer("_sin_cached", emb.sin().to(torch.get_default_dtype()), persistent=False)
@property
def sin_cached(self):
logger.warning_once("The sin_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use " "the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class")
return self._sin_cached
@property
def cos_cached(self):
logger.warning_once("The cos_cached attribute will be removed in 4.39. Bear in mind that its contents changed in v4.38. Use " "the forward method of RoPE from now on instead. It is not used in the `LlamaAttention` class")
return self._cos_cached
@torch.no_grad()
def forward(self, x, position_ids, seq_len=None):
if seq_len is not None:
logger.warning_once("The `seq_len` argument is deprecated and unused. It will be removed in v4.39.")
# x: [bs, num_attention_heads, seq_len, head_size]
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(position_ids.shape[0], -1, 1)
position_ids_expanded = position_ids[:, None, :].float()
# Force float32 since bfloat16 loses precision on long contexts
# See https://github.com/huggingface/transformers/pull/29285
device_type = x.device.type
device_type = device_type if isinstance(device_type, str) else "cpu"
with torch.autocast(device_type=device_type, enabled=False):
freqs = (inv_freq_expanded.float() @ position_ids_expanded.float()).transpose(1, 2)
emb = torch.cat((freqs, freqs), dim=-1)
cos = emb.cos()
sin = emb.sin()
return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
class LlamaLinearScalingRotaryEmbedding(LlamaRotaryEmbedding):
"""LlamaRotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
def forward(self, x, position_ids, seq_len=None):
# difference to the original RoPE: a scaling factor is aplied to the position ids
position_ids = position_ids.float() / self.scaling_factor
cos, sin = super().forward(x, position_ids, seq_len)
return cos, sin
class LlamaDynamicNTKScalingRotaryEmbedding(LlamaRotaryEmbedding):
"""LlamaRotaryEmbedding extended with Dynamic NTK scaling. Credits to the Reddit users /u/bloc97 and /u/emozilla"""
def forward(self, x, position_ids, seq_len=None):
# difference to the original RoPE: inv_freq is recomputed when the sequence length > original length
seq_len = torch.max(position_ids) + 1
if seq_len > self.max_position_embeddings:
base = self.base * ((self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)) ** (self.dim / (self.dim - 2))
inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(x.device) / self.dim))
self.register_buffer("inv_freq", inv_freq, persistent=False) # TODO joao: this may break with compilation
cos, sin = super().forward(x, position_ids, seq_len)
return cos, sin
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]
return torch.cat((-x2, x1), dim=-1)
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
"""Applies Rotary Position Embedding to the query and key tensors.
Args:
q (`torch.Tensor`): The query tensor.
k (`torch.Tensor`): The key tensor.
cos (`torch.Tensor`): The cosine part of the rotary embedding.
sin (`torch.Tensor`): The sine part of the rotary embedding.
position_ids (`torch.Tensor`, *optional*):
Deprecated and unused.
unsqueeze_dim (`int`, *optional*, defaults to 1):
The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
Returns:
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
"""
cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
class LlamaMLP(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size
self.gate_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, x):
if self.config.pretraining_tp > 1:
slice = self.intermediate_size // self.config.pretraining_tp
gate_proj_slices = self.gate_proj.weight.split(slice, dim=0)
up_proj_slices = self.up_proj.weight.split(slice, dim=0)
down_proj_slices = self.down_proj.weight.split(slice, dim=1)
gate_proj = torch.cat([F.linear(x, gate_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
up_proj = torch.cat([F.linear(x, up_proj_slices[i]) for i in range(self.config.pretraining_tp)], dim=-1)
intermediate_states = (self.act_fn(gate_proj) * up_proj).split(slice, dim=2)
down_proj = [F.linear(intermediate_states[i], down_proj_slices[i]) for i in range(self.config.pretraining_tp)]
down_proj = sum(down_proj)
else:
down_proj = self.down_proj(self.act_fn(self.gate_proj(x)) * self.up_proj(x))
return down_proj
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""
This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
"""
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
class LlamaAttention(nn.Module):
"""Multi-headed attention from 'Attention Is All You Need' paper"""
def __init__(self, config: LlamaConfig, layer_idx: Optional[int] = None):
super().__init__()
self.config = config
self.layer_idx = layer_idx
if layer_idx is None:
logger.warning_once(
f"Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will "
"lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` "
"when creating this class."
)
self.attention_dropout = config.attention_dropout
self.hidden_size = config.hidden_size
self.num_heads = config.num_attention_heads
self.head_dim = self.hidden_size // self.num_heads
self.num_key_value_heads = config.num_key_value_heads
self.num_key_value_groups = self.num_heads // self.num_key_value_heads
self.max_position_embeddings = config.max_position_embeddings
self.rope_theta = config.rope_theta
self.is_causal = True
if (self.head_dim * self.num_heads) != self.hidden_size:
raise ValueError(f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}" f" and `num_heads`: {self.num_heads}).")
self.q_proj = nn.Linear(self.hidden_size, self.num_heads * self.head_dim, bias=config.attention_bias)
self.k_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
self.v_proj = nn.Linear(self.hidden_size, self.num_key_value_heads * self.head_dim, bias=config.attention_bias)
self.o_proj = nn.Linear(self.hidden_size, self.hidden_size, bias=config.attention_bias)
self._init_rope()
def _init_rope(self):
if self.config.rope_scaling is None:
self.rotary_emb = LlamaRotaryEmbedding(
self.head_dim,
max_position_embeddings=self.max_position_embeddings,
base=self.rope_theta,
)
else:
scaling_type = self.config.rope_scaling["type"]
scaling_factor = self.config.rope_scaling["factor"]
if scaling_type == "linear":
self.rotary_emb = LlamaLinearScalingRotaryEmbedding(
self.head_dim,
max_position_embeddings=self.max_position_embeddings,
scaling_factor=scaling_factor,
base=self.rope_theta,
)
elif scaling_type == "dynamic":
self.rotary_emb = LlamaDynamicNTKScalingRotaryEmbedding(
self.head_dim,
max_position_embeddings=self.max_position_embeddings,
scaling_factor=scaling_factor,
base=self.rope_theta,
)
else:
raise ValueError(f"Unknown RoPE scaling type {scaling_type}")
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
bsz, q_len, _ = hidden_states.size()
if self.config.pretraining_tp > 1:
key_value_slicing = (self.num_key_value_heads * self.head_dim) // self.config.pretraining_tp
query_slices = self.q_proj.weight.split((self.num_heads * self.head_dim) // self.config.pretraining_tp, dim=0)
key_slices = self.k_proj.weight.split(key_value_slicing, dim=0)
value_slices = self.v_proj.weight.split(key_value_slicing, dim=0)
query_states = [F.linear(hidden_states, query_slices[i]) for i in range(self.config.pretraining_tp)]
query_states = torch.cat(query_states, dim=-1)
key_states = [F.linear(hidden_states, key_slices[i]) for i in range(self.config.pretraining_tp)]
key_states = torch.cat(key_states, dim=-1)
value_states = [F.linear(hidden_states, value_slices[i]) for i in range(self.config.pretraining_tp)]
value_states = torch.cat(value_states, dim=-1)
else:
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
past_key_value = getattr(self, "past_key_value", past_key_value)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
if past_key_value is not None:
# sin and cos are specific to RoPE models; position_ids needed for the static cache
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
if attention_mask is not None: # no matter the length, we just slice it
causal_mask = attention_mask
if cache_position is not None:
causal_mask = attention_mask[:, :, cache_position, : key_states.shape[-2]]
attn_weights = attn_weights + causal_mask
# upcast attention to fp32
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
attn_weights = nn.functional.dropout(attn_weights, p=self.attention_dropout, training=self.training)
attn_output = torch.matmul(attn_weights, value_states)
if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
raise ValueError(f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is" f" {attn_output.size()}")
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
if self.config.pretraining_tp > 1:
attn_output = attn_output.split(self.hidden_size // self.config.pretraining_tp, dim=2)
o_proj_slices = self.o_proj.weight.split(self.hidden_size // self.config.pretraining_tp, dim=1)
attn_output = sum([F.linear(attn_output[i], o_proj_slices[i]) for i in range(self.config.pretraining_tp)])
else:
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
class LlamaRingFlashAttention2(LlamaAttention):
"""
Llama flash attention module. This module inherits from `LlamaAttention` as the weights of the module stays
untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
flash attention and deal with padding tokens in case the input contains any of them.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
# flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
# Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
output_attentions = False
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
# Flash attention requires the input to have the shape
# batch_size x seq_length x head_dim x hidden_dim
# therefore we just need to keep the original shape
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
past_key_value = getattr(self, "past_key_value", past_key_value)
if past_key_value is not None:
# sin and cos are specific to RoPE models; position_ids needed for the static cache
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
# TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
# to be able to avoid many of these transpose/reshape/view.
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
dropout_rate = self.attention_dropout if self.training else 0.0
# In PEFT, usually we cast the layer norms in float32 for training stability reasons
# therefore the input hidden states gets silently casted in float32. Hence, we need
# cast them back in the correct dtype just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not cast the LayerNorms
# in fp32. (LlamaRMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, "_pre_quantization_dtype"):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
logger.warning_once(
f"The input hidden states seems to be silently casted in float32, this might be related to" f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in" f" {target_dtype}."
)
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
attn_output = self._flash_attention_forward(query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
def _flash_attention_forward(self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None):
"""
Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
first unpad the input, then computes the attention scores and pad the final attention scores.
Args:
query_states (`torch.Tensor`):
Input query states to be passed to Flash Attention API
key_states (`torch.Tensor`):
Input key states to be passed to Flash Attention API
value_states (`torch.Tensor`):
Input value states to be passed to Flash Attention API
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
"""
if not self._flash_attn_uses_top_left_mask:
causal = self.is_causal
else:
# TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
causal = self.is_causal and query_length != 1
# Contains at least one padding token in the sequence
if attention_mask is not None:
batch_size = query_states.shape[0]
query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(query_states, key_states, value_states, attention_mask, query_length)
cu_seqlens_q, cu_seqlens_k = cu_seq_lens
max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
attn_output_unpad = zigzag_ring_flash_attn_varlen_func(
query_states,
key_states,
value_states,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_in_batch_q,
max_seqlen_k=max_seqlen_in_batch_k,
dropout_p=dropout,
softmax_scale=softmax_scale,
causal=causal,
)
attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
else:
# pack qkv
# query_states: (batch_size, seqlen, nheads, headdim)
# qkv: (batch_size, seqlen, 3, nheads, headdim)
qkv = torch.stack([query_states, key_states, value_states], dim=2)
attn_output = zigzag_ring_flash_attn_qkvpacked_func(qkv, dropout, softmax_scale, causal=causal)
return attn_output
def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
key_layer = index_first_axis(key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k)
value_layer = index_first_axis(value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k)
if query_length == kv_seq_len:
query_layer = index_first_axis(query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k)
cu_seqlens_q = cu_seqlens_k
max_seqlen_in_batch_q = max_seqlen_in_batch_k
indices_q = indices_k
elif query_length == 1:
max_seqlen_in_batch_q = 1
cu_seqlens_q = torch.arange(batch_size + 1, dtype=torch.int32, device=query_layer.device) # There is a memcpy here, that is very bad.
indices_q = cu_seqlens_q[:-1]
query_layer = query_layer.squeeze(1)
else:
# The -q_len: slice assumes left padding.
attention_mask = attention_mask[:, -query_length:]
query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
return (
query_layer,
key_layer,
value_layer,
indices_q,
(cu_seqlens_q, cu_seqlens_k),
(max_seqlen_in_batch_q, max_seqlen_in_batch_k),
)
class LlamaFlashAttention2(LlamaAttention):
"""
Llama flash attention module. This module inherits from `LlamaAttention` as the weights of the module stays
untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
flash attention and deal with padding tokens in case the input contains any of them.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
# flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
# Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10()
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
output_attentions = False
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
# Flash attention requires the input to have the shape
# batch_size x seq_length x head_dim x hidden_dim
# therefore we just need to keep the original shape
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
past_key_value = getattr(self, "past_key_value", past_key_value)
if past_key_value is not None:
# sin and cos are specific to RoPE models; position_ids needed for the static cache
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
# TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
# to be able to avoid many of these transpose/reshape/view.
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
dropout_rate = self.attention_dropout if self.training else 0.0
# In PEFT, usually we cast the layer norms in float32 for training stability reasons
# therefore the input hidden states gets silently casted in float32. Hence, we need
# cast them back in the correct dtype just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not cast the LayerNorms
# in fp32. (LlamaRMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, "_pre_quantization_dtype"):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
logger.warning_once(
f"The input hidden states seems to be silently casted in float32, this might be related to" f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in" f" {target_dtype}."
)
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
attn_output = self._flash_attention_forward(query_states, key_states, value_states, attention_mask, q_len, dropout=dropout_rate)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
def _flash_attention_forward(self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None):
"""
Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
first unpad the input, then computes the attention scores and pad the final attention scores.
Args:
query_states (`torch.Tensor`):
Input query states to be passed to Flash Attention API
key_states (`torch.Tensor`):
Input key states to be passed to Flash Attention API
value_states (`torch.Tensor`):
Input value states to be passed to Flash Attention API
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
"""
if not self._flash_attn_uses_top_left_mask:
causal = self.is_causal
else:
# TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
causal = self.is_causal and query_length != 1
# Contains at least one padding token in the sequence
if attention_mask is not None:
batch_size = query_states.shape[0]
query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(query_states, key_states, value_states, attention_mask, query_length)
cu_seqlens_q, cu_seqlens_k = cu_seq_lens
max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
attn_output_unpad = flash_attn_varlen_func(
query_states,
key_states,
value_states,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_in_batch_q,
max_seqlen_k=max_seqlen_in_batch_k,
dropout_p=dropout,
softmax_scale=softmax_scale,
causal=causal,
)
attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
else:
attn_output = flash_attn_func(query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal)
return attn_output
def _upad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
key_layer = index_first_axis(key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k)
value_layer = index_first_axis(value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k)
if query_length == kv_seq_len:
query_layer = index_first_axis(query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k)
cu_seqlens_q = cu_seqlens_k
max_seqlen_in_batch_q = max_seqlen_in_batch_k
indices_q = indices_k
elif query_length == 1:
max_seqlen_in_batch_q = 1
cu_seqlens_q = torch.arange(batch_size + 1, dtype=torch.int32, device=query_layer.device) # There is a memcpy here, that is very bad.
indices_q = cu_seqlens_q[:-1]
query_layer = query_layer.squeeze(1)
else:
# The -q_len: slice assumes left padding.
attention_mask = attention_mask[:, -query_length:]
query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
return (
query_layer,
key_layer,
value_layer,
indices_q,
(cu_seqlens_q, cu_seqlens_k),
(max_seqlen_in_batch_q, max_seqlen_in_batch_k),
)
class LlamaSdpaAttention(LlamaAttention):
"""
Llama attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
`LlamaAttention` as the weights of the module stays untouched. The only changes are on the forward pass to adapt to
SDPA API.
"""
# Adapted from LlamaAttention.forward
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
if output_attentions:
# TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
logger.warning_once(
"LlamaModel is using LlamaSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, "
'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
)
return super().forward(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
cache_position=cache_position,
)
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
# In case static cache is used, it is an instance attribute.
past_key_value = getattr(self, "past_key_value", past_key_value)
if past_key_value is not None:
# sin and cos are specific to RoPE models; position_ids needed for the static cache
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position}
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
causal_mask = attention_mask
if attention_mask is not None and cache_position is not None:
causal_mask = causal_mask[:, :, cache_position, : key_states.shape[-2]]
# SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
# Reference: https://github.com/pytorch/pytorch/issues/112577.
if query_states.device.type == "cuda" and causal_mask is not None:
query_states = query_states.contiguous()
key_states = key_states.contiguous()
value_states = value_states.contiguous()
attn_output = torch.nn.functional.scaled_dot_product_attention(
query_states,
key_states,
value_states,
attn_mask=causal_mask,
dropout_p=self.attention_dropout if self.training else 0.0,
)
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.view(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
return attn_output, None, past_key_value
try:
from ring_flash_attn import zigzag_ring_flash_attn_qkvpacked_func, zigzag_ring_flash_attn_varlen_func
except ImportError:
print("Please install the ring-flash-attn package")
LLAMA_ATTENTION_CLASSES = {
"eager": LlamaAttention,
"flash_attention_2": LlamaFlashAttention2,
"ring_flash_attention_2": LlamaRingFlashAttention2,
"sdpa": LlamaSdpaAttention,
}
class LlamaDecoderLayer(nn.Module):
def __init__(self, config: LlamaConfig, layer_idx: int):
super().__init__()
self.hidden_size = config.hidden_size
self.self_attn = LLAMA_ATTENTION_CLASSES[config._attn_implementation](config=config, layer_idx=layer_idx)
self.mlp = LlamaMLP(config)
self.input_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
self.post_attention_layernorm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: Optional[bool] = False,
use_cache: Optional[bool] = False,
cache_position: Optional[torch.LongTensor] = None,
**kwargs,
) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
"""
Args:
hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
attention_mask (`torch.FloatTensor`, *optional*):
attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
query_sequence_length, key_sequence_length)` if default attention is used.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
(see `past_key_values`).
past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
"""
if "padding_mask" in kwargs:
warnings.warn("Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`")
residual = hidden_states
hidden_states = self.input_layernorm(hidden_states)
# Self Attention
hidden_states, self_attn_weights, present_key_value = self.self_attn(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
cache_position=cache_position,
**kwargs,
)
hidden_states = residual + hidden_states
# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
outputs = (hidden_states,)
if output_attentions:
outputs += (self_attn_weights,)
if use_cache:
outputs += (present_key_value,)
return outputs
LLAMA_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
Parameters:
config ([`LlamaConfig`]):
Model configuration class with all the parameters of the model. Initializing with a config file does not
load the weights associated with the model, only the configuration. Check out the
[`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
@add_start_docstrings(
"The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
LLAMA_START_DOCSTRING,
)
class LlamaPreTrainedModel(PreTrainedModel):
config_class = LlamaConfig
base_model_prefix = "model"
supports_gradient_checkpointing = True
_no_split_modules = ["LlamaDecoderLayer"]
_skip_keys_device_placement = ["past_key_values", "causal_mask"]
_supports_flash_attn_2 = True
_supports_sdpa = True
_supports_cache_class = True
def _init_weights(self, module):
std = self.config.initializer_range
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
def _setup_cache(self, cache_cls, max_batch_size, max_cache_len: Optional[int] = None):
if self.config._attn_implementation == "flash_attention_2" and cache_cls == StaticCache:
raise ValueError("`static` cache implementation is not compatible with `attn_implementation==flash_attention_2` " "make sure to use `sdpa` in the mean time, and open an issue at https://github.com/huggingface/transformers")
if max_cache_len > self.model.causal_mask.shape[-1] or self.device != self.model.causal_mask.device:
causal_mask = torch.full((max_cache_len, max_cache_len), fill_value=True, device=self.device, dtype=torch.bool)
self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
for layer in self.model.layers:
device = layer.input_layernorm.weight.device
if hasattr(self.config, "_pre_quantization_dtype"):
dtype = self.config._pre_quantization_dtype
else:
dtype = layer.self_attn.o_proj.weight.dtype
layer.self_attn.past_key_value = cache_cls(self.config, max_batch_size, max_cache_len, device=device, dtype=dtype)
def _reset_cache(self):
for layer in self.model.layers:
layer.self_attn.past_key_value = None
LLAMA_INPUTS_DOCSTRING = r"""
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
`past_key_values`).
If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
information on the default strategy.
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.n_positions - 1]`.
[What are position IDs?](../glossary#position-ids)
past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
Two formats are allowed:
- a [`~cache_utils.Cache`] instance;
- Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
cache format.
The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
legacy cache format will be returned.
If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
of shape `(batch_size, sequence_length)`.
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
`past_key_values`).
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""
@add_start_docstrings(
"The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
LLAMA_START_DOCSTRING,
)
class LlamaModel(LlamaPreTrainedModel):
"""
Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`LlamaDecoderLayer`]
Args:
config: LlamaConfig
"""
def __init__(self, config: LlamaConfig):
super().__init__(config)
self.padding_idx = config.pad_token_id
self.vocab_size = config.vocab_size
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
self.layers = nn.ModuleList([LlamaDecoderLayer(config, layer_idx) for layer_idx in range(config.num_hidden_layers)])
self.norm = LlamaRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
self.gradient_checkpointing = False
# Register a causal mask to separate causal and padding mask creation. Merging happens in the attention class.
# NOTE: This is not friendly with TorchScript, ONNX, ExportedProgram serialization for very large `max_position_embeddings`.
causal_mask = torch.full((config.max_position_embeddings, config.max_position_embeddings), fill_value=True, dtype=torch.bool)
self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.embed_tokens
def set_input_embeddings(self, value):
self.embed_tokens = value
@add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
) -> Union[Tuple, BaseModelOutputWithPast]:
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
use_cache = use_cache if use_cache is not None else self.config.use_cache
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one")
if self.gradient_checkpointing and self.training and use_cache:
logger.warning_once("`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.")
use_cache = False
if inputs_embeds is None:
inputs_embeds = self.embed_tokens(input_ids)
past_seen_tokens = 0
if use_cache: # kept for BC (cache positions)
if not isinstance(past_key_values, StaticCache):
past_key_values = DynamicCache.from_legacy_cache(past_key_values)
past_seen_tokens = past_key_values.get_seq_length()
if cache_position is None:
if isinstance(past_key_values, StaticCache):
raise ValueError("cache_position is a required argument when using StaticCache.")
cache_position = torch.arange(past_seen_tokens, past_seen_tokens + inputs_embeds.shape[1], device=inputs_embeds.device)
if position_ids is None:
position_ids = cache_position.unsqueeze(0)
causal_mask = self._update_causal_mask(attention_mask, inputs_embeds)
# embed positions
hidden_states = inputs_embeds
# decoder layers
all_hidden_states = () if output_hidden_states else None
all_self_attns = () if output_attentions else None
next_decoder_cache = None
for decoder_layer in self.layers:
if output_hidden_states:
all_hidden_states += (hidden_states,)
if self.gradient_checkpointing and self.training:
layer_outputs = self._gradient_checkpointing_func(
decoder_layer.__call__,
hidden_states,
causal_mask,
position_ids,
past_key_values,
output_attentions,
use_cache,
cache_position,
)
else:
layer_outputs = decoder_layer(
hidden_states,
attention_mask=causal_mask,
position_ids=position_ids,
past_key_value=past_key_values,
output_attentions=output_attentions,
use_cache=use_cache,
cache_position=cache_position,
)
hidden_states = layer_outputs[0]
if use_cache:
next_decoder_cache = layer_outputs[2 if output_attentions else 1]
if output_attentions:
all_self_attns += (layer_outputs[1],)
hidden_states = self.norm(hidden_states)
# add hidden states from the last decoder layer
if output_hidden_states:
all_hidden_states += (hidden_states,)
next_cache = None
if use_cache:
next_cache = next_decoder_cache.to_legacy_cache() if isinstance(next_decoder_cache, Cache) else next_decoder_cache
if not return_dict:
return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
return BaseModelOutputWithPast(
last_hidden_state=hidden_states,
past_key_values=next_cache,
hidden_states=all_hidden_states,
attentions=all_self_attns,
)
# TODO: As of torch==2.2.0, the `attention_mask` passed to the model in `generate` is 2D and of dynamic length even when the static
# KV cache is used. This is an issue for torch.compile which then recaptures cudagraphs at each decode steps due to the dynamic shapes.
# (`recording cudagraph tree for symint key 13`, etc.), which is VERY slow. A workaround is `@torch.compiler.disable`, but this prevents using
# `fullgraph=True`. See more context in https://github.com/huggingface/transformers/pull/29114
def _update_causal_mask(self, attention_mask, input_tensor):
if self.config._attn_implementation == "flash_attention_2":
if attention_mask is not None and 0.0 in attention_mask:
return attention_mask
return None
batch_size, seq_length = input_tensor.shape[:2]
dtype = input_tensor.dtype
device = input_tensor.device
# support going beyond cached `max_position_embedding`
if seq_length > self.causal_mask.shape[-1]:
causal_mask = torch.full((2 * self.causal_mask.shape[-1], 2 * self.causal_mask.shape[-1]), fill_value=1)
self.register_buffer("causal_mask", torch.triu(causal_mask, diagonal=1), persistent=False)
# We use the current dtype to avoid any overflows
min_dtype = torch.finfo(dtype).min
causal_mask = self.causal_mask[None, None, :, :].repeat(batch_size, 1, 1, 1).to(dtype) * min_dtype
causal_mask = causal_mask.to(dtype=dtype, device=device)
if attention_mask is not None and attention_mask.dim() == 2:
mask_length = attention_mask.shape[-1]
padding_mask = causal_mask[..., :mask_length].eq(0.0) * attention_mask[:, None, None, :].eq(0.0)
causal_mask[..., :mask_length] = causal_mask[..., :mask_length].masked_fill(padding_mask, min_dtype)
if self.config._attn_implementation == "sdpa" and attention_mask is not None and attention_mask.device.type == "cuda":
# TODO: For dynamo, rather use a check on fullgraph=True once this is possible (https://github.com/pytorch/pytorch/pull/120400).
is_tracing = torch.jit.is_tracing() or isinstance(input_tensor, torch.fx.Proxy) or (hasattr(torch, "_dynamo") and torch._dynamo.is_compiling())
if not is_tracing and torch.any(attention_mask != 1):
# Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
# using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
# Details: https://github.com/pytorch/pytorch/issues/110213
causal_mask = AttentionMaskConverter._unmask_unattended(causal_mask, min_dtype)
return causal_mask
class LlamaForCausalLM(LlamaPreTrainedModel):
_tied_weights_keys = ["lm_head.weight"]
def __init__(self, config):
super().__init__(config)
self.model = LlamaModel(config)
self.vocab_size = config.vocab_size
self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.embed_tokens
def set_input_embeddings(self, value):
self.model.embed_tokens = value
def get_output_embeddings(self):
return self.lm_head
def set_output_embeddings(self, new_embeddings):
self.lm_head = new_embeddings
def set_decoder(self, decoder):
self.model = decoder
def get_decoder(self):
return self.model
@add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
) -> Union[Tuple, CausalLMOutputWithPast]:
r"""
Args:
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
Returns:
Example:
```python
>>> from transformers import AutoTokenizer, LlamaForCausalLM
>>> model = LlamaForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
# decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
cache_position=cache_position,
)
hidden_states = outputs[0]
if self.config.pretraining_tp > 1:
lm_head_slices = self.lm_head.weight.split(self.vocab_size // self.config.pretraining_tp, dim=0)
logits = [F.linear(hidden_states, lm_head_slices[i]) for i in range(self.config.pretraining_tp)]
logits = torch.cat(logits, dim=-1)
else:
logits = self.lm_head(hidden_states)
logits = logits.float()
loss = None
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
shift_logits = shift_logits.view(-1, self.config.vocab_size)
shift_labels = shift_labels.view(-1)
# Enable model parallelism
shift_labels = shift_labels.to(shift_logits.device)
loss = loss_fct(shift_logits, shift_labels)
if not return_dict:
output = (logits,) + outputs[1:]
return (loss,) + output if loss is not None else output
return CausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
def prepare_inputs_for_generation(self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs):
past_length = 0
if past_key_values is not None:
if isinstance(past_key_values, Cache):
cache_length = past_key_values.get_seq_length()
past_length = past_key_values.seen_tokens
max_cache_length = past_key_values.get_max_length()
else:
cache_length = past_length = past_key_values[0][0].shape[2]
max_cache_length = None
# Keep only the unprocessed tokens:
# 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
# some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
# input)
if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
# 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
# input_ids based on the past_length.
elif past_length < input_ids.shape[1]:
input_ids = input_ids[:, past_length:]
# 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
# If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
if max_cache_length is not None and attention_mask is not None and cache_length + input_ids.shape[1] > max_cache_length:
attention_mask = attention_mask[:, -max_cache_length:]
position_ids = kwargs.get("position_ids", None)
if attention_mask is not None and position_ids is None:
# create position_ids on the fly for batch generation
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
if past_key_values:
position_ids = position_ids[:, -input_ids.shape[1] :]
if self.generation_config.cache_implementation == "static":
# generation with static cache
cache_position = kwargs.get("cache_position", None)
if cache_position is None:
past_length = 0
else:
past_length = cache_position[-1] + 1
input_ids = input_ids[:, past_length:]
position_ids = position_ids[:, past_length:]
# TODO @gante we should only keep a `cache_position` in generate, and do +=1.
# same goes for position ids. Could also help with continued generation.
input_length = position_ids.shape[-1] if position_ids is not None else input_ids.shape[-1]
cache_position = torch.arange(past_length, past_length + input_length, device=input_ids.device)
position_ids = position_ids.contiguous() if position_ids is not None else None
# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
if inputs_embeds is not None and past_key_values is None:
model_inputs = {"inputs_embeds": inputs_embeds}
else:
# The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
# recompiles graphs as the stride of the inputs is a guard. Ref: https://github.com/huggingface/transformers/pull/29114
# TODO: use `next_tokens` directly instead.
model_inputs = {"input_ids": input_ids.contiguous()}
model_inputs.update(
{
"position_ids": position_ids,
"cache_position": cache_position,
"past_key_values": past_key_values,
"use_cache": kwargs.get("use_cache"),
"attention_mask": attention_mask,
}
)
return model_inputs
@staticmethod
def _reorder_cache(past_key_values, beam_idx):
reordered_past = ()
for layer_past in past_key_values:
reordered_past += (tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),)
return reordered_past
@add_start_docstrings(
"""
The LLaMa Model transformer with a sequence classification head on top (linear layer).
[`LlamaForSequenceClassification`] uses the last token in order to do the classification, as other causal models
(e.g. GPT-2) do.
Since it does classification on the last token, it requires to know the position of the last token. If a
`pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
each row of the batch).
""",
LLAMA_START_DOCSTRING,
)
class LlamaForSequenceClassification(LlamaPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.model = LlamaModel(config)
self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.embed_tokens
def set_input_embeddings(self, value):
self.model.embed_tokens = value
@add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, SequenceClassifierOutputWithPast]:
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
transformer_outputs = self.model(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = transformer_outputs[0]
logits = self.score(hidden_states)
if input_ids is not None:
batch_size = input_ids.shape[0]
else:
batch_size = inputs_embeds.shape[0]
if self.config.pad_token_id is None and batch_size != 1:
raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
if self.config.pad_token_id is None:
sequence_lengths = -1
else:
if input_ids is not None:
# if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
sequence_lengths = torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1
sequence_lengths = sequence_lengths % input_ids.shape[-1]
sequence_lengths = sequence_lengths.to(logits.device)
else:
sequence_lengths = -1
pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
loss = None
if labels is not None:
labels = labels.to(logits.device)
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = "regression"
elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
self.config.problem_type = "single_label_classification"
else:
self.config.problem_type = "multi_label_classification"
if self.config.problem_type == "regression":
loss_fct = MSELoss()
if self.num_labels == 1:
loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(pooled_logits, labels)
elif self.config.problem_type == "single_label_classification":
loss_fct = CrossEntropyLoss()
loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == "multi_label_classification":
loss_fct = BCEWithLogitsLoss()
loss = loss_fct(pooled_logits, labels)
if not return_dict:
output = (pooled_logits,) + transformer_outputs[1:]
return ((loss,) + output) if loss is not None else output
return SequenceClassifierOutputWithPast(
loss=loss,
logits=pooled_logits,
past_key_values=transformer_outputs.past_key_values,
hidden_states=transformer_outputs.hidden_states,
attentions=transformer_outputs.attentions,
)
@add_start_docstrings(
"""
The Llama Model transformer with a span classification head on top for extractive question-answering tasks like
SQuAD (a linear layer on top of the hidden-states output to compute `span start logits` and `span end logits`).
""",
LLAMA_START_DOCSTRING,
)
class LlamaForQuestionAnswering(LlamaPreTrainedModel):
base_model_prefix = "transformer"
# Copied from transformers.models.bloom.modeling_bloom.BloomForQuestionAnswering.__init__ with Bloom->Llama
def __init__(self, config):
super().__init__(config)
self.transformer = LlamaModel(config)
self.qa_outputs = nn.Linear(config.hidden_size, 2)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.transformer.embed_tokens
def set_input_embeddings(self, value):
self.transformer.embed_tokens = value
@add_start_docstrings_to_model_forward(LLAMA_INPUTS_DOCSTRING)
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.FloatTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
start_positions: Optional[torch.LongTensor] = None,
end_positions: Optional[torch.LongTensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, QuestionAnsweringModelOutput]:
r"""
start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.transformer(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = outputs[0]
logits = self.qa_outputs(sequence_output)
start_logits, end_logits = logits.split(1, dim=-1)
start_logits = start_logits.squeeze(-1).contiguous()
end_logits = end_logits.squeeze(-1).contiguous()
total_loss = None
if start_positions is not None and end_positions is not None:
# If we are on multi-GPU, split add a dimension
if len(start_positions.size()) > 1:
start_positions = start_positions.squeeze(-1).to(start_logits.device)
if len(end_positions.size()) > 1:
end_positions = end_positions.squeeze(-1).to(end_logits.device)
# sometimes the start/end positions are outside our model inputs, we ignore these terms
ignored_index = start_logits.size(1)
start_positions = start_positions.clamp(0, ignored_index)
end_positions = end_positions.clamp(0, ignored_index)
loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
start_loss = loss_fct(start_logits, start_positions)
end_loss = loss_fct(end_logits, end_positions)
total_loss = (start_loss + end_loss) / 2
if not return_dict:
output = (start_logits, end_logits) + outputs[2:]
return ((total_loss,) + output) if total_loss is not None else output
return QuestionAnsweringModelOutput(
loss=total_loss,
start_logits=start_logits,
end_logits=end_logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
================================================
FILE: xtuner-eval_niah/longva/model/llava_arch.py
================================================
# Copyright 2023 Haotian Liu
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from abc import ABC, abstractmethod
import math
import re
import time
import torch
import torch.nn as nn
from .multimodal_encoder.builder import build_vision_tower
from .multimodal_resampler.builder import build_vision_resampler
from .multimodal_projector.builder import build_vision_projector
from longva.constants import IGNORE_INDEX, IMAGE_TOKEN_INDEX, DEFAULT_IMAGE_PATCH_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN
from longva.mm_utils import get_anyres_image_grid_shape
from longva.utils import rank0_print
import random
class LlavaMetaModel:
def __init__(self, config):
super(LlavaMetaModel, self).__init__(config)
if hasattr(config, "mm_vision_tower"):
delay_load = getattr(config, "delay_load", False)
self.vision_tower = build_vision_tower(config, delay_load=delay_load)
self.vision_resampler = build_vision_resampler(config, vision_tower=self.vision_tower)
self.mm_projector = build_vision_projector(config, vision_cfg=self.vision_tower.config)
if "unpad" in getattr(config, "mm_patch_merge_type", ""):
self.image_newline = nn.Parameter(torch.empty(config.hidden_size, dtype=self.dtype))
def get_vision_tower(self):
vision_tower = getattr(self, "vision_tower", None)
if type(vision_tower) is list:
vision_tower = vision_tower[0]
return vision_tower
def initialize_vision_modules(self, model_args, fsdp=None):
vision_tower = model_args.vision_tower
mm_vision_select_layer = model_args.mm_vision_select_layer
mm_vision_select_feature = model_args.mm_vision_select_feature
pretrain_mm_mlp_adapter = model_args.pretrain_mm_mlp_adapter
mm_patch_merge_type = model_args.mm_patch_merge_type
self.config.mm_vision_tower = vision_tower
self.config.vision_tower_pretrained = getattr(model_args, "vision_tower_pretrained", "")
if self.get_vision_tower() is None:
vision_tower = build_vision_tower(model_args)
vision_resampler = build_vision_resampler(model_args, vision_tower=vision_tower)
for k, v in vision_resampler.config.items():
setattr(self.config, k, v)
if fsdp is not None and len(fsdp) > 0:
self.vision_tower = [vision_tower]
self.vision_resampler = [vision_resampler]
else:
self.vision_tower = vision_tower
self.vision_resampler = vision_resampler
else:
if fsdp is not None and len(fsdp) > 0:
vision_resampler = self.vision_resampler[0]
vision_tower = self.vision_tower[0]
else:
vision_resampler = self.vision_resampler
vision_tower = self.vision_tower
vision_tower.load_model()
# In case it is frozen by LoRA
for p in self.vision_resampler.parameters():
p.requires_grad = True
self.config.use_mm_proj = True
self.config.mm_projector_type = getattr(model_args, "mm_projector_type", "linear")
self.config.mm_hidden_size = getattr(vision_resampler, "hidden_size", vision_tower.hidden_size)
self.config.mm_vision_select_layer = mm_vision_select_layer
self.config.mm_vision_select_feature = mm_vision_select_feature
self.config.mm_patch_merge_type = mm_patch_merge_type
if getattr(self, "mm_projector", None) is None:
self.mm_projector = build_vision_projector(self.config, vision_cfg=vision_tower.config)
if "unpad" in mm_patch_merge_type:
embed_std = 1 / torch.sqrt(torch.tensor(self.config.hidden_size, dtype=self.dtype))
self.image_newline = nn.Parameter(torch.randn(self.config.hidden_size, dtype=self.dtype) * embed_std)
else:
# In case it is frozen by LoRA
for p in self.mm_projector.parameters():
p.requires_grad = True
if pretrain_mm_mlp_adapter is not None:
mm_projector_weights = torch.load(pretrain_mm_mlp_adapter, map_location="cpu")
def get_w(weights, keyword):
return {k.split(keyword + ".")[1]: v for k, v in weights.items() if keyword in k}
incompatible_keys = self.mm_projector.load_state_dict(get_w(mm_projector_weights, "mm_projector"))
rank0_print(f"Loaded mm projector weights from {pretrain_mm_mlp_adapter}. Incompatible keys: {incompatible_keys}")
incompatible_keys = self.vision_resampler.load_state_dict(get_w(mm_projector_weights, "vision_resampler"), strict=False)
rank0_print(f"Loaded vision resampler weights from {pretrain_mm_mlp_adapter}. Incompatible keys: {incompatible_keys}")
def unpad_image(tensor, original_size):
"""
Unpads a PyTorch tensor of a padded and resized image.
Args:
tensor (torch.Tensor): The image tensor, assumed to be in CxHxW format.
original_size (tuple): The original size of the image (height, width).
Returns:
torch.Tensor: The unpadded image tensor.
"""
original_width, original_height = original_size
current_height, current_width = tensor.shape[1:]
# Compute aspect ratios
original_aspect_ratio = original_width / original_height
current_aspect_ratio = current_width / current_height
# Determine padding size and direction
if original_aspect_ratio > current_aspect_ratio:
# Padding was added to the height
scale_factor = current_width / original_width
new_height = int(original_height * scale_factor)
padding = (current_height - new_height) // 2
unpadded_tensor = tensor[:, padding : current_height - padding, :]
else:
# Padding was added to the width
scale_factor = current_height / original_height
new_width = int(original_width * scale_factor)
padding = (current_width - new_width) // 2
unpadded_tensor = tensor[:, :, padding : current_width - padding]
return unpadded_tensor
class LlavaMetaForCausalLM(ABC):
@abstractmethod
def get_model(self):
pass
def get_vision_tower(self):
return self.get_model().get_vision_tower()
def get_2dPool(self, image_feature):
height = width = self.get_vision_tower().num_patches_per_side
num_frames, num_tokens, num_dim = image_feature.shape
image_feature = image_feature.view(num_frames, height, width, -1)
image_feature = image_feature.permute(0, 3, 1, 2).contiguous()
# image_feature = nn.functional.max_pool2d(image_feature, self.config.mm_spatial_pool_stride)
if self.config.mm_spatial_pool_mode == "average":
image_feature = nn.functional.avg_pool2d(image_feature, self.config.mm_spatial_pool_stride)
elif self.config.mm_spatial_pool_mode == "max":
image_feature = nn.functional.max_pool2d(image_feature, self.config.mm_spatial_pool_stride)
else:
raise ValueError(f"Unexpected mm_spatial_pool_mode: {self.config.mm_spatial_pool_mode}")
image_feature = image_feature.permute(0, 2, 3, 1)
image_feature = image_feature.view(num_frames, -1, num_dim)
return image_feature
def encode_images(self, images):
image_features = self.get_model().get_vision_tower()(images)
#image_features = self.get_model().vision_resampler(image_features, images=images)
image_features = self.get_model().mm_projector(image_features)
image_features = self.get_model().vision_resampler(image_features, images=images)
return image_features
def encode_multimodals(self, videos_or_images, video_idx_in_batch, split_sizes=None):
videos_or_images_features = self.get_model().get_vision_tower()(videos_or_images)
per_videos_or_images_features = torch.split(videos_or_images_features, split_sizes, dim=0) # tuple, (dim_1, 576, 4096)
all_videos_or_images_features = []
for idx, feat in enumerate(per_videos_or_images_features):
feat = self.get_model().mm_projector(feat)
# Post pooling
if idx in video_idx_in_batch:
feat = self.get_2dPool(feat)
all_videos_or_images_features.append(feat)
return all_videos_or_images_features
def prepare_inputs_labels_for_multimodal(self, input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities=["image"], image_sizes=None):
vision_tower = self.get_vision_tower()
if vision_tower is None or images is None or input_ids.shape[1] == 1:
return input_ids, position_ids, attention_mask, past_key_values, None, labels
if type(images) is list or images.ndim == 5:
if type(images) is list:
images = [x.unsqueeze(0) if x.ndim == 3 else x for x in images]
video_idx_in_batch = []
for _ in range(len(modalities)):
if modalities[_] == "video":
video_idx_in_batch.append(_)
images_list = []
for image in images:
if image.ndim == 4:
images_list.append(image)
else:
images_list.append(image.unsqueeze(0))
concat_images = torch.cat([image for image in images_list], dim=0)
split_sizes = [image.shape[0] for image in images_list]
image_features = self.encode_multimodals(concat_images, video_idx_in_batch, split_sizes)
# image_features = torch.split(image_features, split_sizes, dim=0)
mm_patch_merge_type = getattr(self.config, "mm_patch_merge_type", "flat")
image_aspect_ratio = getattr(self.config, "image_aspect_ratio", "square")
if mm_patch_merge_type == "flat":
image_features = [x.flatten(0, 1) for x in image_features]
elif mm_patch_merge_type== "unires":
new_image_features = []
for image_idx, image_feature in enumerate(image_features):
# rank0_print(f"Initial feature size : {image_feature.shape}")
if image_idx in video_idx_in_batch: # video operations
image_feature = image_feature.flatten(0, 1)
elif image_feature.shape[0] > 1:
# base image feature is never used in unires
base_image_feature = image_feature[0]
image_feature = image_feature[1:]
# rank0_print(f"Before pool : {image_feature.shape}")
height = width = self.get_vision_tower().num_patches_per_side
assert height * width == base_image_feature.shape[0]
if hasattr(self.get_vision_tower(), "image_size"):
vision_tower_image_size = self.get_vision_tower().image_size
else:
raise ValueError("vision_tower_image_size is not found in the vision tower.")
num_patch_width, num_patch_height = get_anyres_image_grid_shape(image_sizes[image_idx], self.config.image_grid_pinpoints, vision_tower_image_size)
image_feature = image_feature.view(num_patch_height, num_patch_width, height, width, -1)
# Assume 2*2 patches
# After this, [2,2, 24,24, 4096]
kernel_size = mm_patch_merge_type.split("avgpool")[-1].split("x")[-1]
kernel_size = 2
image_feature = image_feature.view(num_patch_height * num_patch_width, height, width, -1) # [4, 24, 24, 4096]
image_feature = image_feature.permute(0, 3, 1, 2).contiguous() # [4, 4096, 24, 24]
image_feature = nn.functional.avg_pool2d(image_feature, kernel_size) # [4, 4096, 12, 12]
image_feature = image_feature.flatten(2, 3) # [4, 4096, 144]
image_feature = image_feature.permute(0, 2, 1).contiguous() # [4, 144, 4096]
image_feature = image_feature.flatten(0, 1) # [576, 4096]
# rank0_print(f"After pool : {image_feature.shape}")
else:
# for text only data, there is a placeholder image feature that is actually never used.
image_feature = image_feature[0]
# rank0_print(f"After here : {image_feature.shape}")
new_image_features.append(image_feature)
image_features = new_image_features
else:
raise ValueError(f"Unexpected mm_patch_merge_type: {self.config.mm_patch_merge_type}")
else:
error_message = """
Something is wrong with the input shape. Most likely, you did not wrap the video input in a list:
This is correct:
model.generate(input_ids, images=[video_tensor], modalities=["video"], **gen_kwargs)
This is wrong:
model.generate(input_ids, images=video_tensor, modalities=["video"], **gen_kwargs)
"""
raise ValueError(error_message)
# image_features = self.encode_images(images)
# TODO: image start / end is not implemented here to support pretraining.
if getattr(self.config, "tune_mm_mlp_adapter", False) and getattr(self.config, "mm_use_im_start_end", False):
raise NotImplementedError
# Let's just add dummy tensors if they do not exist,
# it is a headache to deal with None all the time.
# But it is not ideal, and if you have a better idea,
# please open an issue / submit a PR, thanks.
_labels = labels
_position_ids = position_ids
_attention_mask = attention_mask
if attention_mask is None:
attention_mask = torch.ones_like(input_ids, dtype=torch.bool)
else:
attention_mask = attention_mask.bool()
if position_ids is None:
position_ids = torch.arange(0, input_ids.shape[1], dtype=torch.long, device=input_ids.device)
if labels is None:
labels = torch.full_like(input_ids, IGNORE_INDEX)
# remove the padding using attention_mask -- FIXME
_input_ids = input_ids
input_ids = [cur_input_ids[cur_attention_mask] for cur_input_ids, cur_attention_mask in zip(input_ids, attention_mask)]
labels = [cur_labels[cur_attention_mask] for cur_labels, cur_attention_mask in zip(labels, attention_mask)]
new_input_embeds = []
new_labels = []
cur_image_idx = 0
for batch_idx, cur_input_ids in enumerate(input_ids):
num_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum()
if num_images == 0:
cur_image_features = image_features[cur_image_idx]
cur_input_embeds_1 = self.get_model().embed_tokens(cur_input_ids)
cur_input_embeds = torch.cat([cur_input_embeds_1, cur_image_features[0:0]], dim=0)
new_input_embeds.append(cur_input_embeds)
new_labels.append(labels[batch_idx])
cur_image_idx += 1
continue
image_token_indices = [-1] + torch.where(cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist() + [cur_input_ids.shape[0]]
cur_input_ids_noim = []
cur_labels = labels[batch_idx]
cur_labels_noim = []
for i in range(len(image_token_indices) - 1):
cur_input_ids_noim.append(cur_input_ids[image_token_indices[i] + 1 : image_token_indices[i + 1]])
cur_labels_noim.append(cur_labels[image_token_indices[i] + 1 : image_token_indices[i + 1]])
split_sizes = [x.shape[0] for x in cur_labels_noim]
cur_input_embeds = self.get_model().embed_tokens(torch.cat(cur_input_ids_noim))
cur_input_embeds_no_im = torch.split(cur_input_embeds, split_sizes, dim=0)
cur_new_input_embeds = []
cur_new_labels = []
for i in range(num_images + 1):
cur_new_input_embeds.append(cur_input_embeds_no_im[i])
cur_new_labels.append(cur_labels_noim[i])
if i < num_images:
cur_image_features = image_features[cur_image_idx]
cur_image_idx += 1
cur_new_input_embeds.append(cur_image_features)
cur_new_labels.append(torch.full((cur_image_features.shape[0],), IGNORE_INDEX, device=cur_labels.device, dtype=cur_labels.dtype))
cur_new_input_embeds = [x.to(self.device) for x in cur_new_input_embeds]
# import pdb; pdb.set_trace()
cur_new_input_embeds = torch.cat(cur_new_input_embeds)
cur_new_labels = torch.cat(cur_new_labels)
new_input_embeds.append(cur_new_input_embeds)
new_labels.append(cur_new_labels)
# Truncate sequences to max length as image embeddings can make the sequence longer
tokenizer_model_max_length = getattr(self.config, "tokenizer_model_max_length", None)
new_input_embeds = [x[:tokenizer_model_max_length] for x, modality in zip(new_input_embeds, modalities)]
new_labels = [x[:tokenizer_model_max_length] for x, modality in zip(new_labels, modalities)]
# TODO: Hard code for control loss spike
# if tokenizer_model_max_length is not None:
# new_input_embeds = [x[:4096] if modality != "video" else x[:tokenizer_model_max_length] for x, modality in zip(new_input_embeds, modalities)]
# new_labels = [x[:4096] if modality != "video" else x[:tokenizer_model_max_length] for x, modality in zip(new_labels, modalities)]
# Combine them
max_len = max(x.shape[0] for x in new_input_embeds)
batch_size = len(new_input_embeds)
new_input_embeds_padded = []
new_labels_padded = torch.full((batch_size, max_len), IGNORE_INDEX, dtype=new_labels[0].dtype, device=new_labels[0].device)
attention_mask = torch.zeros((batch_size, max_len), dtype=attention_mask.dtype, device=attention_mask.device)
position_ids = torch.zeros((batch_size, max_len), dtype=position_ids.dtype, device=position_ids.device)
for i, (cur_new_embed, cur_new_labels) in enumerate(zip(new_input_embeds, new_labels)):
cur_len = cur_new_embed.shape[0]
if getattr(self.config, "tokenizer_padding_side", "right") == "left":
new_input_embeds_padded.append(torch.cat((torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device), cur_new_embed), dim=0))
if cur_len > 0:
new_labels_padded[i, -cur_len:] = cur_new_labels
attention_mask[i, -cur_len:] = True
position_ids[i, -cur_len:] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
else:
new_input_embeds_padded.append(torch.cat((cur_new_embed, torch.zeros((max_len - cur_len, cur_new_embed.shape[1]), dtype=cur_new_embed.dtype, device=cur_new_embed.device)), dim=0))
if cur_len > 0:
new_labels_padded[i, :cur_len] = cur_new_labels
attention_mask[i, :cur_len] = True
position_ids[i, :cur_len] = torch.arange(0, cur_len, dtype=position_ids.dtype, device=position_ids.device)
new_input_embeds = torch.stack(new_input_embeds_padded, dim=0)
if _labels is None:
new_labels = None
else:
new_labels = new_labels_padded
if _attention_mask is None:
attention_mask = None
else:
attention_mask = attention_mask.to(dtype=_attention_mask.dtype)
if _position_ids is None:
position_ids = None
if getattr(self.config, "use_pos_skipping", False) and self.training:
position_ids = torch.arange(new_input_embeds.size(1), device=new_input_embeds.device).unsqueeze(0).to(new_input_embeds.device)
split_position = random.randint(0, new_input_embeds.size(1))
left_add = random.randint(0, self.config.pos_skipping_range)
right_add = random.randint(left_add, self.config.pos_skipping_range)
position_ids[:, :split_position] += left_add
position_ids[:, split_position:] += right_add
# import pdb; pdb.set_trace()
return None, position_ids, attention_mask, past_key_values, new_input_embeds, new_labels
def initialize_vision_tokenizer(self, model_args, tokenizer):
if model_args.mm_use_im_patch_token:
tokenizer.add_tokens([DEFAULT_IMAGE_PATCH_TOKEN], special_tokens=True)
self.resize_token_embeddings(len(tokenizer))
if model_args.mm_use_im_start_end:
num_new_tokens = tokenizer.add_tokens([DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN], special_tokens=True)
self.resize_token_embeddings(len(tokenizer))
if num_new_tokens > 0:
input_embeddings = self.get_input_embeddings().weight.data
output_embeddings = self.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
output_embeddings[-num_new_tokens:] = output_embeddings_avg
if model_args.tune_mm_mlp_adapter:
for p in self.get_input_embeddings().parameters():
p.requires_grad = True
for p in self.get_output_embeddings().parameters():
p.requires_grad = False
if model_args.pretrain_mm_mlp_adapter:
mm_projector_weights = torch.load(model_args.pretrain_mm_mlp_adapter, map_location="cpu")
embed_tokens_weight = mm_projector_weights["model.embed_tokens.weight"]
assert num_new_tokens == 2
if input_embeddings.shape == embed_tokens_weight.shape:
input_embeddings[-num_new_tokens:] = embed_tokens_weight[-num_new_tokens:]
elif embed_tokens_weight.shape[0] == num_new_tokens:
input_embeddings[-num_new_tokens:] = embed_tokens_weight
else:
raise ValueError(f"Unexpected embed_tokens_weight shape. Pretrained: {embed_tokens_weight.shape}. Current: {input_embeddings.shape}. Numer of new tokens: {num_new_tokens}.")
elif model_args.mm_use_im_patch_token:
if model_args.tune_mm_mlp_adapter:
for p in self.get_input_embeddings().parameters():
p.requires_grad = False
for p in self.get_output_embeddings().parameters():
p.requires_grad = False
================================================
FILE: xtuner-eval_niah/longva/model/make_delta.py
================================================
"""
Usage:
python3 -m llava.model.make_delta --base ~/model_weights/llama-7b --target ~/model_weights/llava-7b --delta ~/model_weights/llava-7b-delta --hub-repo-id liuhaotian/llava-7b-delta
"""
import argparse
import torch
from tqdm import tqdm
from transformers import AutoTokenizer, AutoModelForCausalLM
from longva.model.utils import auto_upgrade
def make_delta(base_model_path, target_model_path, delta_path, hub_repo_id):
print("Loading base model")
base = AutoModelForCausalLM.from_pretrained(base_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
print("Loading target model")
auto_upgrade(target_model_path)
target = AutoModelForCausalLM.from_pretrained(target_model_path, torch_dtype=torch.float16, low_cpu_mem_usage=True)
print("Calculating delta")
for name, param in tqdm(target.state_dict().items(), desc="Calculating delta"):
if name not in base.state_dict():
assert name in ["model.mm_projector.weight", "model.mm_projector.bias"], f"{name} not in base model"
continue
if param.data.shape == base.state_dict()[name].shape:
param.data -= base.state_dict()[name]
else:
assert name in ["model.embed_tokens.weight", "lm_head.weight"], f"{name} dimension mismatch: {param.data.shape} vs {base.state_dict()[name].shape}"
bparam = base.state_dict()[name]
param.data[: bparam.shape[0], : bparam.shape[1]] -= bparam
print("Saving delta")
if hub_repo_id:
kwargs = {"push_to_hub": True, "repo_id": hub_repo_id}
else:
kwargs = {}
target.save_pretrained(delta_path, **kwargs)
target_tokenizer = AutoTokenizer.from_pretrained(target_model_path)
target_tokenizer.save_pretrained(delta_path, **kwargs)
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--base-model-path", type=str, required=True)
parser.add_argument("--target-model-path", type=str, required=True)
parser.add_argument("--delta-path", type=str, required=True)
parser.add_argument("--hub-repo-id", type=str, default=None)
args = parser.parse_args()
make_delta(args.base_model_path, args.target_model_path, args.delta_path, args.hub_repo_id)
================================================
FILE: xtuner-eval_niah/longva/model/multimodal_encoder/builder.py
================================================
import os
from .clip_encoder import CLIPVisionTower
from .clip_encoder import CLIPVisionTower, CLIPVisionTowerS2
# from .eva_clip.eva_clip_encoder import EvaClipVisionTower
# from .dev_eva_clip.eva_vit import EvaViTWrapper
def build_vision_tower(vision_tower_cfg, **kwargs):
vision_tower = getattr(vision_tower_cfg, "mm_vision_tower", getattr(vision_tower_cfg, "vision_tower", None))
is_absolute_path_exists = os.path.exists(vision_tower)
use_s2 = getattr(vision_tower_cfg, "s2", False)
if is_absolute_path_exists or vision_tower.startswith("openai") or vision_tower.startswith("laion") or "ShareGPT4V" in vision_tower:
if use_s2:
return CLIPVisionTowerS2(vision_tower, args=vision_tower_cfg, **kwargs)
else:
return CLIPVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
# elif "internal-eva" in vision_tower.lower() or "eva02" in vision_tower.lower():
# return EvaClipVisionTower(vision_tower, args=vision_tower_cfg, **kwargs)
# elif vision_tower in ["EVA-CLIP-8B", "EVA-CLIP-8B-plus"]:
# return EvaViTWrapper(vision_tower, args=vision_tower_cfg, **kwargs)
raise ValueError(f"Unknown vision tower: {vision_tower}")
================================================
FILE: xtuner-eval_niah/longva/model/multimodal_encoder/clip_encoder.py
================================================
import torch
import torch.nn as nn
from longva.utils import rank0_print
from transformers import CLIPVisionModel, CLIPImageProcessor, CLIPVisionConfig
try:
from s2wrapper import forward as multiscale_forward
except:
pass
class CLIPVisionTower(nn.Module):
def __init__(self, vision_tower, args, delay_load=False):
super().__init__()
self.is_loaded = False
self.vision_tower_name = vision_tower
self.select_layer = args.mm_vision_select_layer
self.select_feature = getattr(args, "mm_vision_select_feature", "patch")
if not delay_load:
rank0_print(f"Loading vision tower: {vision_tower}")
self.load_model()
elif getattr(args, "unfreeze_mm_vision_tower", False):
# TODO: better detector is needed.
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `unfreeze_mm_vision_tower`: True.")
self.load_model()
elif hasattr(args, "mm_tunable_parts") and "mm_vision_tower" in args.mm_tunable_parts:
rank0_print(f"The checkpoint seems to contain `vision_tower` weights: `mm_tunable_parts` contains `mm_vision_tower`.")
self.load_model()
else:
self.cfg_only = CLIPVisionConfig.from_pretrained(self.vision_tower_name)
def load_model(self, device_map=None):
if self.is_loaded:
rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name))
return
self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name)
self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map)
self.vision_tower.requires_grad_(False)
self.is_loaded = True
def feature_select(self, image_forward_outs):
select_feature_type = self.select_feature
if self.select_feature in ["slicefour_patch", "slicefour_cls_patch"]:
select_every_k_layer = len(image_forward_outs.hidden_states) // 4
image_features = torch.cat([image_forward_outs.hidden_states[i] for i in range(select_every_k_layer + self.select_layer, len(image_forward_outs.hidden_states), select_every_k_layer)], dim=-1)
select_feature_type = select_feature_type.replace("slicefour_", "")
elif self.select_feature in ["slice_m25811_f6_patch", "slice_m25811_f6_cls_patch"]:
select_layers = [-2, -5, -8, -11, 6]
image_features = torch.cat([image_forward_outs.hidden_states[i] for i in select_layers], dim=-1)
select_feature_type = select_feature_type.replace("slice_m25811_f6_", "")
else:
image_features = image_forward_outs.hidden_states[self.select_layer]
if select_feature_type == "patch":
image_features = image_features[:, 1:]
elif select_feature_type == "cls_patch":
image_features = image_features
else:
raise ValueError(f"Unexpected select feature: {select_feature_type}")
return image_features
def forward(self, images):
if type(images) is list:
image_features = []
for image in images:
image_forward_out = self.vision_tower(image.to(device=self.device, dtype=self.dtype).unsqueeze(0), output_hidden_states=True)
image_feature = self.feature_select(image_forward_out).to(image.dtype)
image_features.append(image_feature)
else:
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
image_features = self.feature_select(image_forward_outs).to(images.dtype)
return image_features
@property
def dummy_feature(self):
return torch.zeros(1, self.hidden_size, device=self.device, dtype=self.dtype)
@property
def dtype(self):
return self.vision_tower.dtype
@property
def device(self):
return self.vision_tower.device
@property
def config(self):
if self.is_loaded:
return self.vision_tower.config
else:
return self.cfg_only
@property
def hidden_size(self):
_hidden_size = self.config.hidden_size
if "slicefour" in self.select_feature:
_hidden_size *= 4
if "slice_m25811_f6" in self.select_feature:
_hidden_size *= 5
return _hidden_size
@property
def num_patches_per_side(self):
return self.config.image_size // self.config.patch_size
@property
def num_patches(self):
_num_patches = (self.config.image_size // self.config.patch_size) ** 2
if "cls_patch" in self.select_feature:
_num_patches += 1
return _num_patches
@property
def image_size(self):
return self.config.image_size
class CLIPVisionTowerS2(CLIPVisionTower):
def __init__(self, vision_tower, args, delay_load=False):
self.s2_scales = getattr(args, "s2_scales", "336,672,1008")
self.s2_scales = list(map(int, self.s2_scales.split(",")))
self.s2_scales.sort()
self.s2_split_size = self.s2_scales[0]
self.s2_image_size = self.s2_scales[-1]
super().__init__(vision_tower, args, delay_load)
# change resize/crop size in preprocessing to the largest image size in s2_scale
if not delay_load or getattr(args, "unfreeze_mm_vision_tower", False):
self.image_processor.size["shortest_edge"] = self.s2_image_size
self.image_processor.crop_size["height"] = self.image_processor.crop_size["width"] = self.s2_image_size
def load_model(self, device_map=None):
if self.is_loaded:
rank0_print("{} is already loaded, `load_model` called again, skipping.".format(self.vision_tower_name))
return
self.image_processor = CLIPImageProcessor.from_pretrained(self.vision_tower_name)
self.vision_tower = CLIPVisionModel.from_pretrained(self.vision_tower_name, device_map=device_map)
self.vision_tower.requires_grad_(False)
self.image_processor.size["shortest_edge"] = self.s2_image_size
self.image_processor.crop_size["height"] = self.image_processor.crop_size["width"] = self.s2_image_size
self.is_loaded = True
@torch.no_grad()
def forward_feature(self, images):
image_forward_outs = self.vision_tower(images.to(device=self.device, dtype=self.dtype), output_hidden_states=True)
image_features = self.feature_select(image_forward_outs).to(images.dtype)
return image_features
@torch.no_grad()
def forward(self, images):
if type(images) is list:
image_features = []
for image in images:
image_feature = multiscale_forward(self.forward_feature, image.unsqueeze(0), img_sizes=self.s2_scales, max_split_size=self.s2_split_size, split_forward=True)
image_features.append(image_feature)
else:
image_features = multiscale_forward(self.forward_feature, images, img_sizes=self.s2_scales, max_split_size=self.s2_split_size, split_forward=True)
return image_features
@property
def hidden_size(self):
return self.config.hidden_size * len(self.s2_scales)
================================================
FILE: xtuner-eval_niah/longva/model/multimodal_projector/builder.py
================================================
import torch
import torch.nn as nn
import re
from .pooler_projector import PoolerProjector
class IdentityMap(nn.Module):
def __init__(self):
super().__init__()
def forward(self, x, *args, **kwargs):
return x
@property
def config(self):
return {"mm_projector_type": "identity"}
class SimpleResBlock(nn.Module):
def __init__(self, channels):
super().__init__()
self.pre_norm = nn.LayerNorm(channels)
self.proj = nn.Sequential(nn.Linear(channels, channels), nn.GELU(), nn.Linear(channels, channels))
def forward(self, x):
x = self.pre_norm(x)
return x + self.proj(x)
def build_vision_projector(config, delay_load=False, **kwargs):
projector_type = getattr(config, "mm_projector_type", "linear")
if projector_type == "linear":
return nn.Linear(config.mm_hidden_size, config.hidden_size)
if projector_type == "pooler":
return PoolerProjector(config, kwargs["vision_cfg"])
mlp_gelu_match = re.match(r"^mlp(\d+)x_gelu$", projector_type)
if mlp_gelu_match:
mlp_depth = int(mlp_gelu_match.group(1))
modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
for _ in range(1, mlp_depth):
modules.append(nn.GELU())
modules.append(nn.Linear(config.hidden_size, config.hidden_size))
return nn.Sequential(*modules)
mlp_gelu_resnet_match = re.match(r"^mlp(\d+)x_res(\d+)x_gelu$", projector_type)
if mlp_gelu_resnet_match:
mlp_depth = int(mlp_gelu_resnet_match.group(1))
res_depth = int(mlp_gelu_resnet_match.group(2))
modules = [nn.Linear(config.mm_hidden_size, config.hidden_size)]
for _ in range(1, mlp_depth):
modules.append(nn.GELU())
modules.append(nn.Linear(config.hidden_size, config.hidden_size))
for _ in range(res_depth):
modules.append(SimpleResBlock(config.hidden_size))
return nn.Sequential(*modules)
if projector_type == "identity":
return IdentityMap()
raise ValueError(f"Unknown projector type: {projector_type}")
================================================
FILE: xtuner-eval_niah/longva/model/multimodal_projector/pooler_projector.py
================================================
import torch
import torch.nn as nn
import math
from transformers.models.clip.modeling_clip import CLIPVisionModel
class PoolerProjector(nn.Module):
def __init__(self, config, vision_cfg):
super().__init__()
self._config = config
self.hw = vision_cfg.image_size // vision_cfg.patch_size
self.conv_pool = nn.Conv2d(config.mm_hidden_size, config.hidden_size, kernel_size=2, stride=2)
self.proj = nn.Sequential(
nn.GELU(),
nn.Linear(config.hidden_size, config.hidden_size),
)
def forward(self, x, *args, **kwargs):
height = width = self.hw
assert height * width == x.shape[1]
x = x.view(x.shape[0], height, width, -1).permute(0, 3, 1, 2)
x = self.conv_pool(x)
x = x.flatten(2).transpose(1, 2)
x = self.proj(x)
return x
@property
def config(self):
return {"mm_projector_type": "pooler"}
================================================
FILE: xtuner-eval_niah/longva/model/multimodal_resampler/builder.py
================================================
import torch
from .masked_drop import MaskedDrop
from .spatial_pool import SpatialPool
from .perceiver import PerceiverResampler
from .qformer import Qformer
class IdentityMap(torch.nn.Module):
def __init__(self):
super().__init__()
def forward(self, x, *args, **kwargs):
return x
@property
def config(self):
return {"mm_resampler_type": None}
def build_vision_resampler(model_args, delay_load=False, **kwargs):
resampler_type = getattr(model_args, "mm_resampler_type", None)
if resampler_type == "masked_drop":
return MaskedDrop(model_args)
elif resampler_type == "spatial_pool":
return SpatialPool(model_args, **kwargs)
elif resampler_type == "perceiver":
return PerceiverResampler(model_args, **kwargs)
elif resampler_type == "qformer":
return Qformer(model_args, **kwargs)
elif resampler_type is None:
return IdentityMap()
raise ValueError(f"Unknown resampler type: {resampler_type}")
================================================
FILE: xtuner-eval_niah/longva/model/multimodal_resampler/masked_drop.py
================================================
import torch
import torch.nn as nn
import random
class MaskedDrop(nn.Module):
def __init__(self, model_args):
super().__init__()
self.mode = model_args.mm_mask_drop_mode
self.skip_percentage = model_args.mm_mask_drop_skip_percentage
self.ratio = model_args.mm_mask_drop_ratio
self.ratio_upper = model_args.mm_mask_drop_ratio_upper
self.ratio_lower = model_args.mm_mask_drop_ratio_lower
def forward(self, image_features, *args, **kwargs):
if not self.training:
return image_features
if self.skip_percentage > random.random():
return image_features
masked_features = []
for image_feature in image_features:
num_tokens = image_feature.shape[0]
if self.mode == "fixed":
num_keep = int(num_tokens * self.ratio)
masked_features.append(self.random_masking(image_feature.unsqueeze(0), num_keep)[0][0])
elif self.mode == "range":
num_keep = int(num_tokens * random.uniform(self.ratio_lower, self.ratio_upper))
masked_features.append(self.random_masking(image_feature.unsqueeze(0), num_keep)[0])
elif self.mode == "cls_only":
masked_features.append(image_feature[0:1])
else:
raise ValueError(f"Unexpected masked drop mode: {self.mode}")
if self.mode not in ["range"] and (type(image_features) is not list or self.mode in ["cls_only"]):
masked_features = torch.stack(masked_features, dim=0)
return masked_features
@property
def config(self):
return {
"mm_resampler_type": "masked_drop",
"mm_mask_drop_mode": self.mode,
"mm_mask_drop_skip_percentage": self.skip_percentage,
"mm_mask_drop_ratio": self.ratio,
"mm_mask_drop_ratio_upper": self.ratio_upper,
"mm_mask_drop_ratio_lower": self.ratio_lower,
}
def random_masking(self, x, len_keep):
"""
Perform per-sample random masking by per-sample shuffling.
Per-sample shuffling is done by argsort random noise.
x: [N, L, D], sequence
"""
N, L, D = x.shape # batch, length, dim
noise = torch.rand(N, L, device=x.device) # noise in [0, 1]
# sort noise for each sample
ids_shuffle = torch.argsort(noise, dim=1) # ascend: small is keep, large is remove
ids_restore = torch.argsort(ids_shuffle, dim=1)
# keep the first subset
ids_keep = ids_shuffle[:, :len_keep]
x_masked = torch.gather(x, dim=1, index=ids_keep.unsqueeze(-1).repeat(1, 1, D))
# generate the binary mask: 0 is keep, 1 is remove
mask = torch.ones([N, L], device=x.device)
mask[:, :len_keep] = 0
# unshuffle to get the binary mask
mask = torch.gather(mask, dim=1, index=ids_restore)
return x_masked, mask, ids_restore
================================================
FILE: xtuner-eval_niah/longva/model/multimodal_resampler/perceiver.py
================================================
"""
Taken from https://github.com/lucidrains/flamingo-pytorch
"""
import torch
from einops import rearrange, repeat
try:
from einops_exts import rearrange_many
except:
pass
from torch import einsum, nn
def exists(val):
return val is not None
def FeedForward(dim, mult=4):
inner_dim = int(dim * mult)
return nn.Sequential(
nn.LayerNorm(dim),
nn.Linear(dim, inner_dim, bias=False),
nn.GELU(),
nn.Linear(inner_dim, dim, bias=False),
)
class PerceiverAttention(nn.Module):
def __init__(self, *, dim, dim_head=64, heads=8):
super().__init__()
self.scale = dim_head**-0.5
self.heads = heads
inner_dim = dim_head * heads
self.norm_media = nn.LayerNorm(dim)
self.norm_latents = nn.LayerNorm(dim)
self.to_q = nn.Linear(dim, inner_dim, bias=False)
self.to_kv = nn.Linear(dim, inner_dim * 2, bias=False)
self.to_out = nn.Linear(inner_dim, dim, bias=False)
def forward(self, x, latents):
"""
Args:
x (torch.Tensor): image features
shape (b, T, n1, D)
latent (torch.Tensor): latent features
shape (b, T, n2, D)
"""
x = self.norm_media(x)
latents = self.norm_latents(latents)
h = self.heads
q = self.to_q(latents)
kv_input = torch.cat((x, latents), dim=-2)
k, v = self.to_kv(kv_input).chunk(2, dim=-1)
q, k, v = rearrange_many((q, k, v), "b t n (h d) -> b h t n d", h=h)
q = q * self.scale
# attention
sim = einsum("... i d, ... j d -> ... i j", q, k)
sim = sim - sim.amax(dim=-1, keepdim=True).detach()
attn = sim.softmax(dim=-1)
out = einsum("... i j, ... j d -> ... i d", attn, v)
out = rearrange(out, "b h t n d -> b t n (h d)", h=h)
return self.to_out(out)
class PerceiverResamplerModule(nn.Module):
def __init__(
self,
*,
dim,
depth=6,
dim_head=64,
heads=8,
num_latents=64,
max_num_media=None,
max_num_frames=None,
ff_mult=4,
):
super().__init__()
self.latents = nn.Parameter(torch.randn(num_latents, dim))
self.frame_embs = nn.Parameter(torch.randn(max_num_frames, dim)) if exists(max_num_frames) else None
self.media_time_embs = nn.Parameter(torch.randn(max_num_media, 1, dim)) if exists(max_num_media) else None
self.layers = nn.ModuleList([])
for _ in range(depth):
self.layers.append(
nn.ModuleList(
[
PerceiverAttention(dim=dim, dim_head=dim_head, heads=heads),
FeedForward(dim=dim, mult=ff_mult) if ff_mult > 0 else nn.Identity(),
]
)
)
self.norm = nn.LayerNorm(dim)
def forward(self, x):
"""
Args:
x (torch.Tensor): image features
shape (b, T, F, v, D)
Returns:
shape (b, T, n, D) where n is self.num_latents
"""
b, T, F, v = x.shape[:4]
# frame and media time embeddings
if exists(self.frame_embs):
frame_embs = repeat(self.frame_embs[:F], "F d -> b T F v d", b=b, T=T, v=v)
x = x + frame_embs
x = rearrange(x, "b T F v d -> b T (F v) d") # flatten the frame and spatial dimensions
if exists(self.media_time_embs):
x = x + self.media_time_embs[:T]
# blocks
latents = repeat(self.latents, "n d -> b T n d", b=b, T=T)
for attn, ff in self.layers:
latents = attn(x, latents) + latents
latents = ff(latents) + latents
return self.norm(latents)
class PerceiverResampler(nn.Module):
def __init__(self, model_args, vision_tower):
super().__init__()
self.depth = model_args.mm_perceiver_depth
self.num_latents = model_args.mm_perceiver_latents
self.ff_mult = model_args.mm_perceiver_ff_mult
self.pretrained = model_args.mm_perceiver_pretrained
self.perceiver = PerceiverResamplerModule(dim=vision_tower.hidden_size, depth=self.depth, num_latents=self.num_latents, ff_mult=self.ff_mult)
if self.pretrained is not None:
self.load_state_dict(torch.load(self.pretrained))
def forward(self, image_features, *args, **kwargs):
return self.perceiver(image_features[:, None, None]).squeeze(1)
@property
def config(self):
return {
"mm_resampler_type": "perceiver",
"mm_perceiver_depth": self.depth,
"mm_perceiver_latents": self.num_latents,
"mm_perceiver_ff_mult": self.ff_mult,
"mm_perceiver_pretrained": self.pretrained,
}
================================================
FILE: xtuner-eval_niah/longva/model/multimodal_resampler/qformer.py
================================================
"""
* Copyright (c) 2023, salesforce.com, inc.
* All rights reserved.
* SPDX-License-Identifier: BSD-3-Clause
* For full license text, see LICENSE.txt file in the repo root or https://opensource.org/licenses/BSD-3-Clause
* By Junnan Li
* Based on huggingface code base
* https://github.com/huggingface/transformers/blob/v4.15.0/src/transformers/models/bert
"""
import math
import os
import warnings
from dataclasses import dataclass
from typing import Optional, Tuple, Dict, Any
import torch
from torch import Tensor, device, dtype, nn
import torch.utils.checkpoint
from torch import nn
from torch.nn import CrossEntropyLoss
import torch.nn.functional as F
from transformers.activations import ACT2FN
from transformers.file_utils import (
ModelOutput,
)
from transformers.modeling_outputs import (
BaseModelOutputWithPastAndCrossAttentions,
BaseModelOutputWithPoolingAndCrossAttentions,
CausalLMOutputWithCrossAttentions,
MaskedLMOutput,
MultipleChoiceModelOutput,
NextSentencePredictorOutput,
QuestionAnsweringModelOutput,
SequenceClassifierOutput,
TokenClassifierOutput,
)
from transformers.modeling_utils import (
PreTrainedModel,
apply_chunking_to_forward,
find_pruneable_heads_and_indices,
prune_linear_layer,
)
from transformers.utils import logging
from transformers.models.bert.configuration_bert import BertConfig
logger = logging.get_logger(__name__)
def disabled_train(self, mode=True):
"""Overwrite model.train with this function to make sure train/eval mode
does not change anymore."""
return self
class BertEmbeddings(nn.Module):
"""Construct the embeddings from word and position embeddings."""
def __init__(self, config):
super().__init__()
self.word_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, padding_idx=config.pad_token_id)
self.position_embeddings = nn.Embedding(config.max_position_embeddings, config.hidden_size)
# self.LayerNorm is not snake-cased to stick with TensorFlow model variable name and be able to load
# any TensorFlow checkpoint file
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
# position_ids (1, len position emb) is contiguous in memory and exported when serialized
self.register_buffer("position_ids", torch.arange(config.max_position_embeddings).expand((1, -1)))
self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
self.config = config
def forward(
self,
input_ids=None,
position_ids=None,
query_embeds=None,
past_key_values_length=0,
):
if input_ids is not None:
seq_length = input_ids.size()[1]
else:
seq_length = 0
if position_ids is None:
position_ids = self.position_ids[:, past_key_values_length : seq_length + past_key_values_length].clone()
if input_ids is not None:
embeddings = self.word_embeddings(input_ids)
if self.position_embedding_type == "absolute":
position_embeddings = self.position_embeddings(position_ids)
embeddings = embeddings + position_embeddings
if query_embeds is not None:
embeddings = torch.cat((query_embeds, embeddings), dim=1)
else:
embeddings = query_embeds
embeddings = self.LayerNorm(embeddings)
embeddings = self.dropout(embeddings)
return embeddings
class BertSelfAttention(nn.Module):
def __init__(self, config, is_cross_attention):
super().__init__()
self.config = config
if config.hidden_size % config.num_attention_heads != 0 and not hasattr(config, "embedding_size"):
raise ValueError("The hidden size (%d) is not a multiple of the number of attention " "heads (%d)" % (config.hidden_size, config.num_attention_heads))
self.num_attention_heads = config.num_attention_heads
self.attention_head_size = int(config.hidden_size / config.num_attention_heads)
self.all_head_size = self.num_attention_heads * self.attention_head_size
self.query = nn.Linear(config.hidden_size, self.all_head_size)
if is_cross_attention:
self.key = nn.Linear(config.encoder_width, self.all_head_size)
self.value = nn.Linear(config.encoder_width, self.all_head_size)
else:
self.key = nn.Linear(config.hidden_size, self.all_head_size)
self.value = nn.Linear(config.hidden_size, self.all_head_size)
self.dropout = nn.Dropout(config.attention_probs_dropout_prob)
self.position_embedding_type = getattr(config, "position_embedding_type", "absolute")
if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
self.max_position_embeddings = config.max_position_embeddings
self.distance_embedding = nn.Embedding(2 * config.max_position_embeddings - 1, self.attention_head_size)
self.save_attention = False
def save_attn_gradients(self, attn_gradients):
self.attn_gradients = attn_gradients
def get_attn_gradients(self):
return self.attn_gradients
def save_attention_map(self, attention_map):
self.attention_map = attention_map
def get_attention_map(self):
return self.attention_map
def transpose_for_scores(self, x):
new_x_shape = x.size()[:-1] + (
self.num_attention_heads,
self.attention_head_size,
)
x = x.view(*new_x_shape)
return x.permute(0, 2, 1, 3)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_value=None,
output_attentions=False,
):
# If this is instantiated as a cross-attention module, the keys
# and values come from an encoder; the attention mask needs to be
# such that the encoder's padding tokens are not attended to.
is_cross_attention = encoder_hidden_states is not None
if is_cross_attention:
key_layer = self.transpose_for_scores(self.key(encoder_hidden_states))
value_layer = self.transpose_for_scores(self.value(encoder_hidden_states))
attention_mask = encoder_attention_mask
elif past_key_value is not None:
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))
key_layer = torch.cat([past_key_value[0], key_layer], dim=2)
value_layer = torch.cat([past_key_value[1], value_layer], dim=2)
else:
key_layer = self.transpose_for_scores(self.key(hidden_states))
value_layer = self.transpose_for_scores(self.value(hidden_states))
mixed_query_layer = self.query(hidden_states)
query_layer = self.transpose_for_scores(mixed_query_layer)
past_key_value = (key_layer, value_layer)
# Take the dot product between "query" and "key" to get the raw attention scores.
attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2))
if self.position_embedding_type == "relative_key" or self.position_embedding_type == "relative_key_query":
seq_length = hidden_states.size()[1]
position_ids_l = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(-1, 1)
position_ids_r = torch.arange(seq_length, dtype=torch.long, device=hidden_states.device).view(1, -1)
distance = position_ids_l - position_ids_r
positional_embedding = self.distance_embedding(distance + self.max_position_embeddings - 1)
positional_embedding = positional_embedding.to(dtype=query_layer.dtype) # fp16 compatibility
if self.position_embedding_type == "relative_key":
relative_position_scores = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores
elif self.position_embedding_type == "relative_key_query":
relative_position_scores_query = torch.einsum("bhld,lrd->bhlr", query_layer, positional_embedding)
relative_position_scores_key = torch.einsum("bhrd,lrd->bhlr", key_layer, positional_embedding)
attention_scores = attention_scores + relative_position_scores_query + relative_position_scores_key
attention_scores = attention_scores / math.sqrt(self.attention_head_size)
if attention_mask is not None:
# Apply the attention mask is (precomputed for all layers in BertModel forward() function)
attention_scores = attention_scores + attention_mask
# Normalize the attention scores to probabilities.
attention_probs = nn.Softmax(dim=-1)(attention_scores)
if is_cross_attention and self.save_attention:
self.save_attention_map(attention_probs)
attention_probs.register_hook(self.save_attn_gradients)
# This is actually dropping out entire tokens to attend to, which might
# seem a bit unusual, but is taken from the original Transformer paper.
attention_probs_dropped = self.dropout(attention_probs)
# Mask heads if we want to
if head_mask is not None:
attention_probs_dropped = attention_probs_dropped * head_mask
context_layer = torch.matmul(attention_probs_dropped, value_layer)
context_layer = context_layer.permute(0, 2, 1, 3).contiguous()
new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,)
context_layer = context_layer.view(*new_context_layer_shape)
outputs = (context_layer, attention_probs) if output_attentions else (context_layer,)
outputs = outputs + (past_key_value,)
return outputs
class BertSelfOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states
class BertAttention(nn.Module):
def __init__(self, config, is_cross_attention=False):
super().__init__()
self.self = BertSelfAttention(config, is_cross_attention)
self.output = BertSelfOutput(config)
self.pruned_heads = set()
def prune_heads(self, heads):
if len(heads) == 0:
return
heads, index = find_pruneable_heads_and_indices(
heads,
self.self.num_attention_heads,
self.self.attention_head_size,
self.pruned_heads,
)
# Prune linear layers
self.self.query = prune_linear_layer(self.self.query, index)
self.self.key = prune_linear_layer(self.self.key, index)
self.self.value = prune_linear_layer(self.self.value, index)
self.output.dense = prune_linear_layer(self.output.dense, index, dim=1)
# Update hyper params and store pruned heads
self.self.num_attention_heads = self.self.num_attention_heads - len(heads)
self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads
self.pruned_heads = self.pruned_heads.union(heads)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_value=None,
output_attentions=False,
):
self_outputs = self.self(
hidden_states,
attention_mask,
head_mask,
encoder_hidden_states,
encoder_attention_mask,
past_key_value,
output_attentions,
)
attention_output = self.output(self_outputs[0], hidden_states)
outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them
return outputs
class BertIntermediate(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.intermediate_size)
if isinstance(config.hidden_act, str):
self.intermediate_act_fn = ACT2FN[config.hidden_act]
else:
self.intermediate_act_fn = config.hidden_act
def forward(self, hidden_states):
hidden_states = self.dense(hidden_states)
hidden_states = self.intermediate_act_fn(hidden_states)
return hidden_states
class BertOutput(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.intermediate_size, config.hidden_size)
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
def forward(self, hidden_states, input_tensor):
hidden_states = self.dense(hidden_states)
hidden_states = self.dropout(hidden_states)
hidden_states = self.LayerNorm(hidden_states + input_tensor)
return hidden_states
class BertLayer(nn.Module):
def __init__(self, config, layer_num):
super().__init__()
self.config = config
self.chunk_size_feed_forward = config.chunk_size_feed_forward
self.seq_len_dim = 1
self.attention = BertAttention(config)
self.layer_num = layer_num
if self.config.add_cross_attention and layer_num % self.config.cross_attention_freq == 0:
self.crossattention = BertAttention(config, is_cross_attention=self.config.add_cross_attention)
self.has_cross_attention = True
else:
self.has_cross_attention = False
self.intermediate = BertIntermediate(config)
self.output = BertOutput(config)
self.intermediate_query = BertIntermediate(config)
self.output_query = BertOutput(config)
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_value=None,
output_attentions=False,
query_length=0,
):
# decoder uni-directional self-attention cached key/values tuple is at positions 1,2
self_attn_past_key_value = past_key_value[:2] if past_key_value is not None else None
self_attention_outputs = self.attention(
hidden_states,
attention_mask,
head_mask,
output_attentions=output_attentions,
past_key_value=self_attn_past_key_value,
)
attention_output = self_attention_outputs[0]
outputs = self_attention_outputs[1:-1]
present_key_value = self_attention_outputs[-1]
if query_length > 0:
query_attention_output = attention_output[:, :query_length, :]
if self.has_cross_attention:
assert encoder_hidden_states is not None, "encoder_hidden_states must be given for cross-attention layers"
cross_attention_outputs = self.crossattention(
query_attention_output,
attention_mask,
head_mask,
encoder_hidden_states,
encoder_attention_mask,
output_attentions=output_attentions,
)
query_attention_output = cross_attention_outputs[0]
outputs = outputs + cross_attention_outputs[1:-1] # add cross attentions if we output attention weights
layer_output = apply_chunking_to_forward(
self.feed_forward_chunk_query,
self.chunk_size_feed_forward,
self.seq_len_dim,
query_attention_output,
)
if attention_output.shape[1] > query_length:
layer_output_text = apply_chunking_to_forward(
self.feed_forward_chunk,
self.chunk_size_feed_forward,
self.seq_len_dim,
attention_output[:, query_length:, :],
)
layer_output = torch.cat([layer_output, layer_output_text], dim=1)
else:
layer_output = apply_chunking_to_forward(
self.feed_forward_chunk,
self.chunk_size_feed_forward,
self.seq_len_dim,
attention_output,
)
outputs = (layer_output,) + outputs
outputs = outputs + (present_key_value,)
return outputs
def feed_forward_chunk(self, attention_output):
intermediate_output = self.intermediate(attention_output)
layer_output = self.output(intermediate_output, attention_output)
return layer_output
def feed_forward_chunk_query(self, attention_output):
intermediate_output = self.intermediate_query(attention_output)
layer_output = self.output_query(intermediate_output, attention_output)
return layer_output
class BertEncoder(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.layer = nn.ModuleList([BertLayer(config, i) for i in range(config.num_hidden_layers)])
def forward(
self,
hidden_states,
attention_mask=None,
head_mask=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_values=None,
use_cache=None,
output_attentions=False,
output_hidden_states=False,
return_dict=True,
query_length=0,
):
all_hidden_states = () if output_hidden_states else None
all_self_attentions = () if output_attentions else None
all_cross_attentions = () if output_attentions and self.config.add_cross_attention else None
next_decoder_cache = () if use_cache else None
for i in range(self.config.num_hidden_layers):
layer_module = self.layer[i]
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
layer_head_mask = head_mask[i] if head_mask is not None else None
past_key_value = past_key_values[i] if past_key_values is not None else None
if getattr(self.config, "gradient_checkpointing", False) and self.training:
if use_cache:
logger.warn("`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...")
use_cache = False
def create_custom_forward(module):
def custom_forward(*inputs):
return module(*inputs, past_key_value, output_attentions, query_length)
return custom_forward
layer_outputs = torch.utils.checkpoint.checkpoint(
create_custom_forward(layer_module),
hidden_states,
attention_mask,
layer_head_mask,
encoder_hidden_states,
encoder_attention_mask,
)
else:
layer_outputs = layer_module(
hidden_states,
attention_mask,
layer_head_mask,
encoder_hidden_states,
encoder_attention_mask,
past_key_value,
output_attentions,
query_length,
)
hidden_states = layer_outputs[0]
if use_cache:
next_decoder_cache += (layer_outputs[-1],)
if output_attentions:
all_self_attentions = all_self_attentions + (layer_outputs[1],)
all_cross_attentions = all_cross_attentions + (layer_outputs[2],)
if output_hidden_states:
all_hidden_states = all_hidden_states + (hidden_states,)
if not return_dict:
return tuple(
v
for v in [
hidden_states,
next_decoder_cache,
all_hidden_states,
all_self_attentions,
all_cross_attentions,
]
if v is not None
)
return BaseModelOutputWithPastAndCrossAttentions(
last_hidden_state=hidden_states,
past_key_values=next_decoder_cache,
hidden_states=all_hidden_states,
attentions=all_self_attentions,
cross_attentions=all_cross_attentions,
)
class BertPooler(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
self.activation = nn.Tanh()
def forward(self, hidden_states):
# We "pool" the model by simply taking the hidden state corresponding
# to the first token.
first_token_tensor = hidden_states[:, 0]
pooled_output = self.dense(first_token_tensor)
pooled_output = self.activation(pooled_output)
return pooled_output
class BertPredictionHeadTransform(nn.Module):
def __init__(self, config):
super().__init__()
self.dense = nn.Linear(config.hidden_size, config.hidden_size)
if isinstance(config.hidden_act, str):
self.transform_act_fn = ACT2FN[config.hidden_act]
else:
self.transform_act_fn = config.hidden_act
self.LayerNorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
def forward(self, hidden_states):
hidden_states = self.dense(hidden_states)
hidden_states = self.transform_act_fn(hidden_states)
hidden_states = self.LayerNorm(hidden_states)
return hidden_states
class BertLMPredictionHead(nn.Module):
def __init__(self, config):
super().__init__()
self.transform = BertPredictionHeadTransform(config)
# The output weights are the same as the input embeddings, but there is
# an output-only bias for each token.
self.decoder = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
self.bias = nn.Parameter(torch.zeros(config.vocab_size))
# Need a link between the two variables so that the bias is correctly resized with `resize_token_embeddings`
self.decoder.bias = self.bias
def forward(self, hidden_states):
hidden_states = self.transform(hidden_states)
hidden_states = self.decoder(hidden_states)
return hidden_states
class BertOnlyMLMHead(nn.Module):
def __init__(self, config):
super().__init__()
self.predictions = BertLMPredictionHead(config)
def forward(self, sequence_output):
prediction_scores = self.predictions(sequence_output)
return prediction_scores
class BertPreTrainedModel(PreTrainedModel):
"""
An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained
models.
"""
config_class = BertConfig
base_model_prefix = "bert"
_keys_to_ignore_on_load_missing = [r"position_ids"]
def _init_weights(self, module):
"""Initialize the weights"""
if isinstance(module, (nn.Linear, nn.Embedding)):
# Slightly different from the TF version which uses truncated_normal for initialization
# cf https://github.com/pytorch/pytorch/pull/5617
module.weight.data.normal_(mean=0.0, std=self.config.initializer_range)
elif isinstance(module, nn.LayerNorm):
module.bias.data.zero_()
module.weight.data.fill_(1.0)
if isinstance(module, nn.Linear) and module.bias is not None:
module.bias.data.zero_()
class BertModel(BertPreTrainedModel):
"""
The model can behave as an encoder (with only self-attention) as well as a decoder, in which case a layer of
cross-attention is added between the self-attention layers, following the architecture described in `Attention is
all you need `__ by Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,
Llion Jones, Aidan N. Gomez, Lukasz Kaiser and Illia Polosukhin.
argument and :obj:`add_cross_attention` set to :obj:`True`; an :obj:`encoder_hidden_states` is then expected as an
input to the forward pass.
"""
def __init__(self, config, add_pooling_layer=False):
super().__init__(config)
self.config = config
self.embeddings = BertEmbeddings(config)
self.encoder = BertEncoder(config)
self.pooler = BertPooler(config) if add_pooling_layer else None
self.init_weights()
def get_input_embeddings(self):
return self.embeddings.word_embeddings
def set_input_embeddings(self, value):
self.embeddings.word_embeddings = value
def _prune_heads(self, heads_to_prune):
"""
Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base
class PreTrainedModel
"""
for layer, heads in heads_to_prune.items():
self.encoder.layer[layer].attention.prune_heads(heads)
def get_extended_attention_mask(
self,
attention_mask: Tensor,
input_shape: Tuple[int],
device: device,
is_decoder: bool,
has_query: bool = False,
) -> Tensor:
"""
Makes broadcastable attention and causal masks so that future and masked tokens are ignored.
Arguments:
attention_mask (:obj:`torch.Tensor`):
Mask with ones indicating tokens to attend to, zeros for tokens to ignore.
input_shape (:obj:`Tuple[int]`):
The shape of the input to the model.
device: (:obj:`torch.device`):
The device of the input to the model.
Returns:
:obj:`torch.Tensor` The extended attention mask, with a the same dtype as :obj:`attention_mask.dtype`.
"""
# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
# ourselves in which case we just need to make it broadcastable to all heads.
if attention_mask.dim() == 3:
extended_attention_mask = attention_mask[:, None, :, :]
elif attention_mask.dim() == 2:
# Provided a padding mask of dimensions [batch_size, seq_length]
# - if the model is a decoder, apply a causal mask in addition to the padding mask
# - if the model is an encoder, make the mask broadcastable to [batch_size, num_heads, seq_length, seq_length]
if is_decoder:
batch_size, seq_length = input_shape
seq_ids = torch.arange(seq_length, device=device)
causal_mask = seq_ids[None, None, :].repeat(batch_size, seq_length, 1) <= seq_ids[None, :, None]
# add a prefix ones mask to the causal mask
# causal and attention masks must have same type with pytorch version < 1.3
causal_mask = causal_mask.to(attention_mask.dtype)
if causal_mask.shape[1] < attention_mask.shape[1]:
prefix_seq_len = attention_mask.shape[1] - causal_mask.shape[1]
if has_query: # UniLM style attention mask
causal_mask = torch.cat(
[
torch.zeros(
(batch_size, prefix_seq_len, seq_length),
device=device,
dtype=causal_mask.dtype,
),
causal_mask,
],
axis=1,
)
causal_mask = torch.cat(
[
torch.ones(
(batch_size, causal_mask.shape[1], prefix_seq_len),
device=device,
dtype=causal_mask.dtype,
),
causal_mask,
],
axis=-1,
)
extended_attention_mask = causal_mask[:, None, :, :] * attention_mask[:, None, None, :]
else:
extended_attention_mask = attention_mask[:, None, None, :]
else:
raise ValueError("Wrong shape for input_ids (shape {}) or attention_mask (shape {})".format(input_shape, attention_mask.shape))
# Since attention_mask is 1.0 for positions we want to attend and 0.0 for
# masked positions, this operation will create a tensor which is 0.0 for
# positions we want to attend and -10000.0 for masked positions.
# Since we are adding it to the raw scores before the softmax, this is
# effectively the same as removing these entirely.
extended_attention_mask = extended_attention_mask.to(dtype=self.dtype) # fp16 compatibility
extended_attention_mask = (1.0 - extended_attention_mask) * -10000.0
return extended_attention_mask
def forward(
self,
input_ids=None,
attention_mask=None,
position_ids=None,
head_mask=None,
query_embeds=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
past_key_values=None,
use_cache=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
is_decoder=False,
):
r"""
encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
the model is configured as a decoder.
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
(those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
use_cache (:obj:`bool`, `optional`):
If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
decoding (see :obj:`past_key_values`).
"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
# use_cache = use_cache if use_cache is not None else self.config.use_cache
if input_ids is None:
assert query_embeds is not None, "You have to specify query_embeds when input_ids is None"
# past_key_values_length
past_key_values_length = past_key_values[0][0].shape[2] - self.config.query_length if past_key_values is not None else 0
query_length = query_embeds.shape[1] if query_embeds is not None else 0
embedding_output = self.embeddings(
input_ids=input_ids,
position_ids=position_ids,
query_embeds=query_embeds,
past_key_values_length=past_key_values_length,
)
input_shape = embedding_output.size()[:-1]
batch_size, seq_length = input_shape
device = embedding_output.device
if attention_mask is None:
attention_mask = torch.ones(((batch_size, seq_length + past_key_values_length)), device=device)
# We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]
# ourselves in which case we just need to make it broadcastable to all heads.
if is_decoder:
extended_attention_mask = self.get_extended_attention_mask(
attention_mask,
input_ids.shape,
device,
is_decoder,
has_query=(query_embeds is not None),
)
else:
extended_attention_mask = self.get_extended_attention_mask(attention_mask, input_shape, device, is_decoder)
# If a 2D or 3D attention mask is provided for the cross-attention
# we need to make broadcastable to [batch_size, num_heads, seq_length, seq_length]
if encoder_hidden_states is not None:
if type(encoder_hidden_states) == list:
encoder_batch_size, encoder_sequence_length, _ = encoder_hidden_states[0].size()
else:
(
encoder_batch_size,
encoder_sequence_length,
_,
) = encoder_hidden_states.size()
encoder_hidden_shape = (encoder_batch_size, encoder_sequence_length)
if type(encoder_attention_mask) == list:
encoder_extended_attention_mask = [self.invert_attention_mask(mask) for mask in encoder_attention_mask]
elif encoder_attention_mask is None:
encoder_attention_mask = torch.ones(encoder_hidden_shape, device=device)
encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
else:
encoder_extended_attention_mask = self.invert_attention_mask(encoder_attention_mask)
else:
encoder_extended_attention_mask = None
# Prepare head mask if needed
# 1.0 in head_mask indicate we keep the head
# attention_probs has shape bsz x n_heads x N x N
# input head_mask has shape [num_heads] or [num_hidden_layers x num_heads]
# and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length]
head_mask = self.get_head_mask(head_mask, self.config.num_hidden_layers)
encoder_outputs = self.encoder(
embedding_output,
attention_mask=extended_attention_mask,
head_mask=head_mask,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_extended_attention_mask,
past_key_values=past_key_values,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
query_length=query_length,
)
sequence_output = encoder_outputs[0]
pooled_output = self.pooler(sequence_output) if self.pooler is not None else None
if not return_dict:
return (sequence_output, pooled_output) + encoder_outputs[1:]
return BaseModelOutputWithPoolingAndCrossAttentions(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
past_key_values=encoder_outputs.past_key_values,
hidden_states=encoder_outputs.hidden_states,
attentions=encoder_outputs.attentions,
cross_attentions=encoder_outputs.cross_attentions,
)
class BertLMHeadModel(BertPreTrainedModel):
_keys_to_ignore_on_load_unexpected = [r"pooler"]
_keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
def __init__(self, config):
super().__init__(config)
self.bert = BertModel(config, add_pooling_layer=False)
self.cls = BertOnlyMLMHead(config)
self.init_weights()
def get_output_embeddings(self):
return self.cls.predictions.decoder
def set_output_embeddings(self, new_embeddings):
self.cls.predictions.decoder = new_embeddings
def forward(
self,
input_ids=None,
attention_mask=None,
position_ids=None,
head_mask=None,
query_embeds=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
labels=None,
past_key_values=None,
use_cache=True,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
return_logits=False,
is_decoder=True,
reduction="mean",
):
r"""
encoder_hidden_states (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length, hidden_size)`, `optional`):
Sequence of hidden-states at the output of the last layer of the encoder. Used in the cross-attention if
the model is configured as a decoder.
encoder_attention_mask (:obj:`torch.FloatTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Mask to avoid performing attention on the padding token indices of the encoder input. This mask is used in
the cross-attention if the model is configured as a decoder. Mask values selected in ``[0, 1]``:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the left-to-right language modeling loss (next word prediction). Indices should be in
``[-100, 0, ..., config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are
ignored (masked), the loss is only computed for the tokens with labels n ``[0, ..., config.vocab_size]``
past_key_values (:obj:`tuple(tuple(torch.FloatTensor))` of length :obj:`config.n_layers` with each tuple having 4 tensors of shape :obj:`(batch_size, num_heads, sequence_length - 1, embed_size_per_head)`):
Contains precomputed key and value hidden states of the attention blocks. Can be used to speed up decoding.
If :obj:`past_key_values` are used, the user can optionally input only the last :obj:`decoder_input_ids`
(those that don't have their past key value states given to this model) of shape :obj:`(batch_size, 1)`
instead of all :obj:`decoder_input_ids` of shape :obj:`(batch_size, sequence_length)`.
use_cache (:obj:`bool`, `optional`):
If set to :obj:`True`, :obj:`past_key_values` key value states are returned and can be used to speed up
decoding (see :obj:`past_key_values`).
Returns:
Example::
>>> from transformers import BertTokenizer, BertLMHeadModel, BertConfig
>>> import torch
>>> tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
>>> config = BertConfig.from_pretrained("bert-base-cased")
>>> model = BertLMHeadModel.from_pretrained('bert-base-cased', config=config)
>>> inputs = tokenizer("Hello, my dog is cute", return_tensors="pt")
>>> outputs = model(**inputs)
>>> prediction_logits = outputs.logits
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if labels is not None:
use_cache = False
if past_key_values is not None:
query_embeds = None
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
head_mask=head_mask,
query_embeds=query_embeds,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
past_key_values=past_key_values,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
is_decoder=is_decoder,
)
sequence_output = outputs[0]
if query_embeds is not None:
sequence_output = outputs[0][:, query_embeds.shape[1] :, :]
prediction_scores = self.cls(sequence_output)
if return_logits:
return prediction_scores[:, :-1, :].contiguous()
lm_loss = None
if labels is not None:
# we are doing next-token prediction; shift prediction scores and input ids by one
shifted_prediction_scores = prediction_scores[:, :-1, :].contiguous()
labels = labels[:, 1:].contiguous()
loss_fct = CrossEntropyLoss(reduction=reduction, label_smoothing=0.1)
lm_loss = loss_fct(
shifted_prediction_scores.view(-1, self.config.vocab_size),
labels.view(-1),
)
if reduction == "none":
lm_loss = lm_loss.view(prediction_scores.size(0), -1).sum(1)
if not return_dict:
output = (prediction_scores,) + outputs[2:]
return ((lm_loss,) + output) if lm_loss is not None else output
return CausalLMOutputWithCrossAttentions(
loss=lm_loss,
logits=prediction_scores,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
cross_attentions=outputs.cross_attentions,
)
def prepare_inputs_for_generation(self, input_ids, query_embeds, past=None, attention_mask=None, **model_kwargs):
# if model is used as a decoder in encoder-decoder model, the decoder attention mask is created on the fly
if attention_mask is None:
attention_mask = input_ids.new_ones(input_ids.shape)
query_mask = input_ids.new_ones(query_embeds.shape[:-1])
attention_mask = torch.cat([query_mask, attention_mask], dim=-1)
# cut decoder_input_ids if past is used
if past is not None:
input_ids = input_ids[:, -1:]
return {
"input_ids": input_ids,
"query_embeds": query_embeds,
"attention_mask": attention_mask,
"past_key_values": past,
"encoder_hidden_states": model_kwargs.get("encoder_hidden_states", None),
"encoder_attention_mask": model_kwargs.get("encoder_attention_mask", None),
"is_decoder": True,
}
def _reorder_cache(self, past, beam_idx):
reordered_past = ()
for layer_past in past:
reordered_past += (tuple(past_state.index_select(0, beam_idx) for past_state in layer_past),)
return reordered_past
class BertForMaskedLM(BertPreTrainedModel):
_keys_to_ignore_on_load_unexpected = [r"pooler"]
_keys_to_ignore_on_load_missing = [r"position_ids", r"predictions.decoder.bias"]
def __init__(self, config):
super().__init__(config)
self.bert = BertModel(config, add_pooling_layer=False)
self.cls = BertOnlyMLMHead(config)
self.init_weights()
def get_output_embeddings(self):
return self.cls.predictions.decoder
def set_output_embeddings(self, new_embeddings):
self.cls.predictions.decoder = new_embeddings
def forward(
self,
input_ids=None,
attention_mask=None,
position_ids=None,
head_mask=None,
query_embeds=None,
encoder_hidden_states=None,
encoder_attention_mask=None,
labels=None,
output_attentions=None,
output_hidden_states=None,
return_dict=None,
return_logits=False,
is_decoder=False,
):
r"""
labels (:obj:`torch.LongTensor` of shape :obj:`(batch_size, sequence_length)`, `optional`):
Labels for computing the masked language modeling loss. Indices should be in ``[-100, 0, ...,
config.vocab_size]`` (see ``input_ids`` docstring) Tokens with indices set to ``-100`` are ignored
(masked), the loss is only computed for the tokens with labels in ``[0, ..., config.vocab_size]``
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.bert(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
head_mask=head_mask,
query_embeds=query_embeds,
encoder_hidden_states=encoder_hidden_states,
encoder_attention_mask=encoder_attention_mask,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
is_decoder=is_decoder,
)
if query_embeds is not None:
sequence_output = outputs[0][:, query_embeds.shape[1] :, :]
prediction_scores = self.cls(sequence_output)
if return_logits:
return prediction_scores
masked_lm_loss = None
if labels is not None:
loss_fct = CrossEntropyLoss() # -100 index = padding token
masked_lm_loss = loss_fct(prediction_scores.view(-1, self.config.vocab_size), labels.view(-1))
if not return_dict:
output = (prediction_scores,) + outputs[2:]
return ((masked_lm_loss,) + output) if masked_lm_loss is not None else output
return MaskedLMOutput(
loss=masked_lm_loss,
logits=prediction_scores,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
class Qformer(nn.Module):
def __init__(self, model_args, vision_tower):
super().__init__()
self.depth = model_args.mm_qformer_depth
self.num_latents = model_args.mm_qformer_latents
self.pretrained = model_args.mm_qformer_pretrained
self.Qformer, self.query_tokens, self.ln_vision = self.build_Qformer(vision_tower.hidden_size, self.depth, self.num_latents)
if self.pretrained is not None:
pretrained_dict = torch.load(self.pretrained, map_location="cpu")["model"]
pretrained_dict = {k: v for k, v in pretrained_dict.items() if not k.startswith("t5_proj")}
self.load_state_dict(pretrained_dict)
def build_Qformer(self, vision_width, cross_attention_freq, num_query_token):
encoder_config = BertConfig.from_pretrained("bert-base-uncased")
encoder_config.encoder_width = vision_width
# insert cross-attention layer every other block
encoder_config.add_cross_attention = True
encoder_config.cross_attention_freq = cross_attention_freq
encoder_config.query_length = num_query_token
Qformer = BertLMHeadModel(config=encoder_config)
query_tokens = nn.Parameter(torch.zeros(1, num_query_token, encoder_config.hidden_size))
query_tokens.data.normal_(mean=0.0, std=encoder_config.initializer_range)
Qformer.cls = None
Qformer.bert.embeddings.word_embeddings = None
Qformer.bert.embeddings.position_embeddings = None
for layer in Qformer.bert.encoder.layer:
layer.output = None
layer.intermediate = None
return Qformer, query_tokens, nn.LayerNorm(vision_width)
def forward(self, image_features, *args, **kwargs):
x = self.ln_vision(image_features)
image_atts = torch.ones(x.size()[:-1], dtype=torch.long).to(x.device)
query_tokens = self.query_tokens.expand(x.shape[0], -1, -1)
query_output = self.Qformer.bert(
query_embeds=query_tokens,
encoder_hidden_states=x,
encoder_attention_mask=image_atts,
return_dict=True,
)
return query_output.last_hidden_state
@property
def hidden_size(self):
return 768
@property
def config(self):
return {
"mm_resampler_type": "qformer",
"mm_qformer_depth": self.depth,
"mm_qformer_latents": self.num_latents,
"mm_qformer_pretrained": self.pretrained,
}
================================================
FILE: xtuner-eval_niah/longva/model/multimodal_resampler/spatial_pool.py
================================================
import torch
import torch.nn as nn
import math
class SpatialPool(nn.Module):
def __init__(self, model_args, vision_tower):
super().__init__()
self.mode = model_args.mm_spatial_pool_mode
self.stride = model_args.mm_spatial_pool_stride
self.out_channels = getattr(model_args, "mm_spatial_pool_out_channels", vision_tower.hidden_size)
if self.mode == "average":
self.pool = nn.AvgPool2d(kernel_size=self.stride, stride=self.stride)
elif self.mode == "max":
self.pool = nn.MaxPool2d(kernel_size=self.stride, stride=self.stride)
elif self.mode == "conv":
self.pool = nn.Conv2d(in_channels=vision_tower.hidden_size, out_channels=self.out_channels, kernel_size=self.stride, stride=self.stride)
else:
raise ValueError(f"Unknown pooling mode: {self.pool}.")
def forward(self, image_features, images, *args, **kwargs):
ori_W = int(math.sqrt(image_features.shape[1] * images.shape[3] // images.shape[2]))
ori_H = int(ori_W * images.shape[2] // images.shape[3])
B, _, F = image_features.shape
image_features_spatial = image_features.view(B, ori_H, ori_H, F).permute(0, 3, 1, 2)
image_features_spatial_pool = self.pool(image_features_spatial)
return image_features_spatial_pool.flatten(2).transpose(1, 2).contiguous()
@property
def config(self):
return {
"mm_resampler_type": "spatial_pool",
"mm_spatial_pool_stride": self.stride,
"mm_spatial_pool_mode": self.mode,
"mm_spatial_pool_out_channels": self.out_channels,
}
@property
def hidden_size(self):
return self.out_channels
================================================
FILE: xtuner-eval_niah/longva/model/utils.py
================================================
from transformers import AutoConfig
def auto_upgrade(config):
cfg = AutoConfig.from_pretrained(config)
if "llava" in config and "llava" not in cfg.model_type:
assert cfg.model_type == "llama"
print("You are using newer LLaVA code base, while the checkpoint of v0 is from older code base.")
print("You must upgrade the checkpoint to the new code base (this can be done automatically).")
confirm = input("Please confirm that you want to upgrade the checkpoint. [Y/N]")
if confirm.lower() in ["y", "yes"]:
print("Upgrading checkpoint...")
assert len(cfg.architectures) == 1
setattr(cfg.__class__, "model_type", "llava")
cfg.architectures[0] = "LlavaLlamaForCausalLM"
cfg.save_pretrained(config)
print("Checkpoint upgraded.")
else:
print("Checkpoint upgrade aborted.")
exit(1)
================================================
FILE: xtuner-eval_niah/longva/train/llama_flash_attn_monkey_patch.py
================================================
from typing import Optional, Tuple
import warnings
import torch
import transformers
from transformers.models.llama.modeling_llama import apply_rotary_pos_emb, repeat_kv
try:
from flash_attn.flash_attn_interface import flash_attn_unpadded_qkvpacked_func
except ImportError:
from flash_attn.flash_attn_interface import flash_attn_varlen_qkvpacked_func as flash_attn_unpadded_qkvpacked_func
from flash_attn.bert_padding import unpad_input, pad_input
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.Tensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: bool = False,
use_cache: bool = False,
padding_mask: Optional[torch.Tensor] = None,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
if output_attentions:
warnings.warn("Output attentions is not supported for patched `LlamaAttention`, returning `None` instead.")
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2) # shape: (b, num_heads, s, head_dim)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value[0].shape[-2]
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
if past_key_value is not None:
# reuse k, v
key_states = torch.cat([past_key_value[0], key_states], dim=2)
value_states = torch.cat([past_key_value[1], value_states], dim=2)
past_key_value = (key_states, value_states) if use_cache else None
# repeat k/v heads if n_kv_heads < n_heads
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
# Transform the data into the format required by flash attention
qkv = torch.stack([query_states, key_states, value_states], dim=2)
qkv = qkv.transpose(1, 3) # shape: [b, s, 3, num_heads, head_dim]
key_padding_mask = attention_mask
if key_padding_mask is None:
qkv = qkv.reshape(-1, 3, self.num_heads, self.head_dim)
cu_q_lens = torch.arange(0, (bsz + 1) * q_len, step=q_len, dtype=torch.int32, device=qkv.device)
max_s = q_len
output = flash_attn_unpadded_qkvpacked_func(qkv, cu_q_lens, max_s, 0.0, softmax_scale=None, causal=True)
output = output.view(bsz, q_len, -1)
else:
qkv = qkv.reshape(bsz, q_len, -1)
qkv, indices, cu_q_lens, max_s = unpad_input(qkv, key_padding_mask)
qkv = qkv.view(-1, 3, self.num_heads, self.head_dim)
output_unpad = flash_attn_unpadded_qkvpacked_func(qkv, cu_q_lens, max_s, 0.0, softmax_scale=None, causal=True)
output_unpad = output_unpad.reshape(-1, self.num_heads * self.head_dim)
output = pad_input(output_unpad, indices, bsz, q_len)
return self.o_proj(output), None, past_key_value
# Disable the transformation of the attention mask in LlamaModel as the flash attention
# requires the attention mask to be the same as the key_padding_mask
def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
# [bsz, seq_len]
return attention_mask
def replace_llama_attn_with_flash_attn():
cuda_major, cuda_minor = torch.cuda.get_device_capability()
if cuda_major < 8:
warnings.warn("Flash attention is only supported on A100 or H100 GPU during training due to head dim > 64 backward." "ref: https://github.com/HazyResearch/flash-attention/issues/190#issuecomment-1523359593")
transformers.models.llama.modeling_llama.LlamaModel._prepare_decoder_attention_mask = _prepare_decoder_attention_mask
transformers.models.llama.modeling_llama.LlamaAttention.forward = forward
================================================
FILE: xtuner-eval_niah/longva/train/llava_trainer.py
================================================
import os
import torch
import torch.nn as nn
import datetime
from accelerate import Accelerator
from accelerate.utils import InitProcessGroupKwargs, GradientAccumulationPlugin
from torch.utils.data import Dataset, Sampler, DataLoader
from trl.trainer import DPOTrainer
from trl.trainer.utils import DPODataCollatorWithPadding
from transformers import Trainer
from transformers.trainer import is_sagemaker_mp_enabled, get_parameter_names, has_length, ALL_LAYERNORM_LAYERS, logger, is_accelerate_available, is_datasets_available, GradientAccumulationPlugin
from transformers.trainer_utils import seed_worker
from transformers.trainer_pt_utils import get_length_grouped_indices as get_length_grouped_indices_hf
from transformers.trainer_pt_utils import AcceleratorConfig
from typing import List, Optional
from datetime import timedelta
if is_accelerate_available():
from accelerate import Accelerator, skip_first_batches, InitProcessGroupKwargs
if is_datasets_available():
import datasets
from longva.utils import rank0_print
def maybe_zero_3(param, ignore_status=False, name=None):
from deepspeed import zero
from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
if hasattr(param, "ds_id"):
if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
if not ignore_status:
print(name, "no ignore status")
with zero.GatheredParameters([param]):
param = param.data.detach().cpu().clone()
else:
param = param.detach().cpu().clone()
return param
def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)}
to_return = {k: maybe_zero_3(v, ignore_status=True, name=k).cpu() for k, v in to_return.items()}
return to_return
def split_to_even_chunks(indices, lengths, num_chunks):
"""
Split a list of indices into `chunks` chunks of roughly equal lengths.
"""
if len(indices) % num_chunks != 0:
return [indices[i::num_chunks] for i in range(num_chunks)]
num_indices_per_chunk = len(indices) // num_chunks
chunks = [[] for _ in range(num_chunks)]
chunks_lengths = [0 for _ in range(num_chunks)]
for index in indices:
shortest_chunk = chunks_lengths.index(min(chunks_lengths))
chunks[shortest_chunk].append(index)
chunks_lengths[shortest_chunk] += lengths[index]
if len(chunks[shortest_chunk]) == num_indices_per_chunk:
chunks_lengths[shortest_chunk] = float("inf")
return chunks
def get_variable_length_grouped_indices(lengths, batch_size, world_size, megabatch_mult=8, generator=None):
# We need to use torch for the random part as a distributed sampler will set the random seed for torch.
indices = torch.randperm(len(lengths), generator=generator)
sorted_indices = sorted(range(len(lengths)), key=lambda i: lengths[i], reverse=True)
megabatch_size = world_size * batch_size * megabatch_mult
megabatches = [sorted_indices[i : i + megabatch_size] for i in range(0, len(lengths), megabatch_size)]
megabatches = [sorted(megabatch, key=lambda i: indices[i], reverse=True) for megabatch in megabatches]
shuffled_indices = [i for megabatch in megabatches for i in megabatch]
world_batch_size = world_size * batch_size
batches = [shuffled_indices[i : i + world_batch_size] for i in range(0, len(lengths), world_batch_size)]
batch_indices = torch.randperm(len(batches), generator=generator)
batches = [batches[i] for i in batch_indices]
return [i for batch in batches for i in batch]
def get_modality_length_grouped_indices(lengths, batch_size, world_size, generator=None):
"""
Return a list of indices so that each slice of `batch_size` consecutive indices correspond to elements of similar
lengths. To do this, the indices are:
- randomly permuted
- grouped in mega-batches of size `mega_batch_mult * batch_size`
- reorder by length in each mega-batch
The result is the concatenation of all mega-batches, with the batch of `batch_size` containing the element of
maximum length placed first, so that an OOM happens sooner rather than later.
"""
# We need to use torch for the random part as a distributed sampler will set the random seed for torch.
assert all(l != 0 for l in lengths), "Should not have zero length."
if all(l > 0 for l in lengths) or all(l < 0 for l in lengths):
# all samples are in the same modality
return get_length_grouped_indices(lengths, batch_size, world_size, generator=generator)
mm_indices, mm_lengths = zip(*[(i, l) for i, l in enumerate(lengths) if l > 0])
lang_indices, lang_lengths = zip(*[(i, -l) for i, l in enumerate(lengths) if l < 0])
mm_shuffle = [mm_indices[i] for i in get_length_grouped_indices(mm_lengths, batch_size, world_size, generator=None)]
lang_shuffle = [lang_indices[i] for i in get_length_grouped_indices(lang_lengths, batch_size, world_size, generator=None)]
megabatch_size = world_size * batch_size
mm_megabatches = [mm_shuffle[i : i + megabatch_size] for i in range(0, len(mm_shuffle), megabatch_size)]
lang_megabatches = [lang_shuffle[i : i + megabatch_size] for i in range(0, len(lang_shuffle), megabatch_size)]
last_mm = mm_megabatches[-1]
last_lang = lang_megabatches[-1]
additional_batch = last_mm + last_lang
megabatches = mm_megabatches[:-1] + lang_megabatches[:-1]
megabatch_indices = torch.randperm(len(megabatches), generator=generator)
megabatches = [megabatches[i] for i in megabatch_indices]
if len(additional_batch) > 0:
megabatches.append(sorted(additional_batch))
return [i for megabatch in megabatches for i in megabatch]
def get_length_grouped_indices(lengths, batch_size, world_size, generator=None, merge=True):
"""
Return a list of indices so that each slice of `batch_size` consecutive indices correspond to elements of similar
lengths. To do this, the indices are:
- randomly permuted
- grouped in mega-batches of size `mega_batch_mult * batch_size`
- reorder by length in each mega-batch
The result is the concatenation of all mega-batches, with the batch of `batch_size` containing the element of
maximum length placed first, so that an OOM happens sooner rather than later.
"""
# We need to use torch for the random part as a distributed sampler will set the random seed for torch.
indices = torch.randperm(len(lengths), generator=generator)
megabatch_size = world_size * batch_size
megabatches = [indices[i : i + megabatch_size].tolist() for i in range(0, len(lengths), megabatch_size)]
megabatches = [sorted(megabatch, key=lambda i: lengths[i], reverse=True) for megabatch in megabatches]
megabatches = [split_to_even_chunks(megabatch, lengths, world_size) for megabatch in megabatches]
return [i for megabatch in megabatches for batch in megabatch for i in batch]
def get_length_grouped_indices_auto_single(lengths, batch_size, world_size, generator=None):
indices = get_length_grouped_indices_hf(lengths, batch_size * world_size, generator=generator)
megabatch_size = world_size * batch_size
megabatches = [indices[i : i + megabatch_size] for i in range(0, len(lengths), megabatch_size)]
megabatches = [sorted(megabatch, key=lambda i: lengths[i], reverse=True) for megabatch in megabatches]
megabatches = [split_to_even_chunks(megabatch, lengths, world_size) for megabatch in megabatches]
# We need to use torch for the random part as a distributed sampler will set the random seed for torch.
batch_indices = torch.randperm(len(megabatches), generator=generator)
megabatches = [megabatches[i] for i in batch_indices]
return [i for megabatch in megabatches for batch in megabatch for i in batch]
def get_modality_length_grouped_indices_auto(lengths, batch_size, world_size, generator=None):
# We need to use torch for the random part as a distributed sampler will set the random seed for torch.
assert all(l != 0 for l in lengths), "Should not have zero length."
if all(l > 0 for l in lengths) or all(l < 0 for l in lengths):
# all samples are in the same modality
return get_length_grouped_indices_auto_single(lengths, batch_size, world_size, generator=generator)
mm_indices, mm_lengths = zip(*[(i, l) for i, l in enumerate(lengths) if l > 0])
lang_indices, lang_lengths = zip(*[(i, -l) for i, l in enumerate(lengths) if l < 0])
mm_shuffle = [mm_indices[i] for i in get_length_grouped_indices_auto_single(mm_lengths, batch_size, world_size, generator=None)]
lang_shuffle = [lang_indices[i] for i in get_length_grouped_indices_auto_single(lang_lengths, batch_size, world_size, generator=None)]
megabatch_size = world_size * batch_size
mm_megabatches = [mm_shuffle[i : i + megabatch_size] for i in range(0, len(mm_shuffle), megabatch_size)]
lang_megabatches = [lang_shuffle[i : i + megabatch_size] for i in range(0, len(lang_shuffle), megabatch_size)]
last_mm = mm_megabatches[-1]
last_lang = lang_megabatches[-1]
additional_batch = last_mm + last_lang
megabatches = mm_megabatches[:-1] + lang_megabatches[:-1]
megabatch_indices = torch.randperm(len(megabatches), generator=generator)
megabatches = [megabatches[i] for i in megabatch_indices]
# FIXME: Hard code to avoid last batch mixed with different modalities
# if len(additional_batch) > 0:
# megabatches.append(sorted(additional_batch))
return [i for megabatch in megabatches for i in megabatch]
class LengthGroupedSampler(Sampler):
r"""
Sampler that samples indices in a way that groups together features of the dataset of roughly the same length while
keeping a bit of randomness.
"""
def __init__(
self,
batch_size: int,
world_size: int,
lengths: Optional[List[int]] = None,
generator=None,
variable_length: bool = False,
group_by_modality: bool = False,
group_by_modality_auto: bool = False,
):
if lengths is None:
raise ValueError("Lengths must be provided.")
self.batch_size = batch_size
self.world_size = world_size
self.lengths = lengths
self.generator = generator
self.variable_length = variable_length
self.group_by_modality = group_by_modality
self.group_by_modality_auto = group_by_modality_auto
def __len__(self):
return len(self.lengths)
def __iter__(self):
if self.variable_length:
assert not self.group_by_modality, "Variable length grouping is not supported with modality grouping."
indices = get_variable_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator)
else:
if self.group_by_modality:
indices = get_modality_length_grouped_indices(self.lengths, self.batch_size, self.world_size, generator=self.generator)
elif self.group_by_modality_auto:
indices = get_modality_length_grouped_indices_auto(self.lengths, self.batch_size, self.world_size, generator=self.generator)
else:
indices = get_length_grouped_indices_auto_single(self.lengths, self.batch_size, self.world_size, generator=self.generator)
return iter(indices)
class LLaVATrainer(Trainer):
def create_accelerator_and_postprocess(self):
grad_acc_kwargs = {"num_steps": self.args.gradient_accumulation_steps}
grad_acc_kwargs["sync_with_dataloader"] = False
gradient_accumulation_plugin = GradientAccumulationPlugin(**grad_acc_kwargs)
accelerator_kwargs = InitProcessGroupKwargs(timeout=timedelta(weeks=52))
rank0_print("Setting NCCL timeout to INF to avoid running errors.")
# create accelerator object
self.accelerator = Accelerator(
dispatch_batches=self.args.dispatch_batches, split_batches=self.args.split_batches, deepspeed_plugin=self.args.deepspeed_plugin, gradient_accumulation_plugin=gradient_accumulation_plugin, kwargs_handlers=[accelerator_kwargs]
)
# some Trainer classes need to use `gather` instead of `gather_for_metrics`, thus we store a flag
self.gather_function = self.accelerator.gather_for_metrics
# deepspeed and accelerate flags covering both trainer args and accelerate launcher
self.is_deepspeed_enabled = getattr(self.accelerator.state, "deepspeed_plugin", None) is not None
self.is_fsdp_enabled = getattr(self.accelerator.state, "fsdp_plugin", None) is not None
# post accelerator creation setup
if self.is_fsdp_enabled:
fsdp_plugin = self.accelerator.state.fsdp_plugin
fsdp_plugin.limit_all_gathers = self.args.fsdp_config.get("limit_all_gathers", fsdp_plugin.limit_all_gathers)
if is_accelerate_available("0.23.0"):
fsdp_plugin.activation_checkpointing = self.args.fsdp_config.get("activation_checkpointing", fsdp_plugin.activation_checkpointing)
if fsdp_plugin.activation_checkpointing and self.args.gradient_checkpointing:
raise ValueError("The activation_checkpointing in FSDP config and the gradient_checkpointing in training arg " "can't be set to True simultaneously. Please use FSDP's activation_checkpointing logic " "when using FSDP.")
if self.is_deepspeed_enabled and getattr(self.args, "hf_deepspeed_config", None) is None:
self.propagate_args_to_deepspeed()
def _get_train_sampler(self) -> Optional[torch.utils.data.Sampler]:
if self.train_dataset is None or not has_length(self.train_dataset):
return None
if self.args.group_by_length:
lengths = self.train_dataset.lengths
return LengthGroupedSampler(
# self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_steps
self.args.train_batch_size,
# world_size=self.args.world_size,
world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work?
lengths=lengths,
)
elif self.args.group_by_modality_length:
lengths = self.train_dataset.modality_lengths
return LengthGroupedSampler(
# self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_steps
self.args.train_batch_size,
# world_size=self.args.world_size,
world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work?
lengths=lengths,
group_by_modality=True,
)
elif self.args.group_by_modality_length_auto:
lengths = self.train_dataset.modality_lengths
return LengthGroupedSampler(
# self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_steps
self.args.train_batch_size,
# world_size=self.args.world_size,
world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work?
lengths=lengths,
group_by_modality_auto=True,
)
elif self.args.group_by_varlen:
lengths = self.train_dataset.lengths
return LengthGroupedSampler(
self.args.train_batch_size * self.args.gradient_accumulation_steps,
# self.args.train_batch_size, # TODO: seems that we should have gradient_accumulation_steps
# world_size=self.args.world_size,
world_size=self.args.world_size * self.args.gradient_accumulation_steps, # TODO: seems that this may work?
lengths=lengths,
variable_length=True,
)
else:
return super()._get_train_sampler()
def get_train_dataloader(self) -> DataLoader:
"""
Returns the training [`~torch.utils.data.DataLoader`].
Will use no sampler if `train_dataset` does not implement `__len__`, a random sampler (adapted to distributed
training if necessary) otherwise.
Subclass and override this method if you want to inject some custom behavior.
"""
if self.train_dataset is None:
raise ValueError("Trainer: training requires a train_dataset.")
train_dataset = self.train_dataset
data_collator = self.data_collator
if is_datasets_available() and isinstance(train_dataset, datasets.Dataset):
train_dataset = self._remove_unused_columns(train_dataset, description="training")
else:
data_collator = self._get_collator_with_removed_columns(data_collator, description="training")
dataloader_params = {
"batch_size": self._train_batch_size,
"collate_fn": data_collator,
"num_workers": self.args.dataloader_num_workers,
"pin_memory": self.args.dataloader_pin_memory,
"persistent_workers": self.args.dataloader_persistent_workers,
}
if not isinstance(train_dataset, torch.utils.data.IterableDataset):
dataloader_params["sampler"] = self._get_train_sampler()
dataloader_params["drop_last"] = self.args.dataloader_drop_last
dataloader_params["worker_init_fn"] = seed_worker
dataloader_params["prefetch_factor"] = self.args.dataloader_num_workers * 2 if self.args.dataloader_num_workers != 0 else None
dataloader = self.accelerator.prepare(DataLoader(train_dataset, **dataloader_params))
return dataloader
def create_optimizer(self):
"""
Setup the optimizer.
We provide a reasonable default that works well. If you want to use something else, you can pass a tuple in the
Trainer's init through `optimizers`, or subclass and override this method in a subclass.
"""
if is_sagemaker_mp_enabled():
return super().create_optimizer()
opt_model = self.model
if self.optimizer is None:
decay_parameters = get_parameter_names(opt_model, ALL_LAYERNORM_LAYERS)
decay_parameters = [name for name in decay_parameters if "bias" not in name]
lr_mapper = {}
if self.args.mm_projector_lr is not None:
lr_mapper["mm_projector"] = self.args.mm_projector_lr
if self.args.mm_vision_tower_lr is not None:
lr_mapper["vision_tower"] = self.args.mm_vision_tower_lr
if len(lr_mapper) > 0:
special_lr_parameters = [name for name, _ in opt_model.named_parameters() if any(module_keyword in name for module_keyword in lr_mapper)]
optimizer_grouped_parameters = [
{
"params": [p for n, p in opt_model.named_parameters() if (n in decay_parameters and n not in special_lr_parameters and p.requires_grad)],
"weight_decay": self.args.weight_decay,
},
{
"params": [p for n, p in opt_model.named_parameters() if (n not in decay_parameters and n not in special_lr_parameters and p.requires_grad)],
"weight_decay": 0.0,
},
]
for module_keyword, lr in lr_mapper.items():
module_parameters = [name for name, _ in opt_model.named_parameters() if module_keyword in name]
optimizer_grouped_parameters.extend(
[
{
"params": [p for n, p in opt_model.named_parameters() if (n in decay_parameters and n in module_parameters and p.requires_grad)],
"weight_decay": self.args.weight_decay,
"lr": lr,
},
{
"params": [p for n, p in opt_model.named_parameters() if (n not in decay_parameters and n in module_parameters and p.requires_grad)],
"weight_decay": 0.0,
"lr": lr,
},
]
)
else:
optimizer_grouped_parameters = [
{
"params": [p for n, p in opt_model.named_parameters() if (n in decay_parameters and p.requires_grad)],
"weight_decay": self.args.weight_decay,
},
{
"params": [p for n, p in opt_model.named_parameters() if (n not in decay_parameters and p.requires_grad)],
"weight_decay": 0.0,
},
]
optimizer_cls, optimizer_kwargs = Trainer.get_optimizer_cls_and_kwargs(self.args)
self.optimizer = optimizer_cls(optimizer_grouped_parameters, **optimizer_kwargs)
if optimizer_cls.__name__ == "Adam8bit":
import bitsandbytes
manager = bitsandbytes.optim.GlobalOptimManager.get_instance()
skipped = 0
for module in opt_model.modules():
if isinstance(module, nn.Embedding):
skipped += sum({p.data_ptr(): p.numel() for p in module.parameters()}.values())
logger.info(f"skipped {module}: {skipped/2**20}M params")
manager.register_module_override(module, "weight", {"optim_bits": 32})
logger.debug(f"bitsandbytes: will optimize {module} in fp32")
logger.info(f"skipped: {skipped/2**20}M params")
return self.optimizer
def _save_checkpoint(self, model, trial, metrics=None):
if getattr(self.args, "tune_mm_mlp_adapter", False) or (
hasattr(self.args, "mm_tunable_parts") and (len(self.args.mm_tunable_parts.split(",")) == 1 and ("mm_mlp_adapter" in self.args.mm_tunable_parts or "mm_vision_resampler" in self.args.mm_tunable_parts))
):
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
run_dir = self._get_output_dir(trial=trial)
output_dir = os.path.join(run_dir, checkpoint_folder)
# Only save Adapter
keys_to_match = ["mm_projector", "vision_resampler"]
if getattr(self.args, "use_im_start_end", False):
keys_to_match.extend(["embed_tokens", "embed_in"])
weight_to_save = get_mm_adapter_state_maybe_zero_3(self.model.named_parameters(), keys_to_match)
if self.args.local_rank == 0 or self.args.local_rank == -1:
self.model.config.save_pretrained(output_dir)
torch.save(weight_to_save, os.path.join(output_dir, f"mm_projector.bin"))
else:
super(LLaVATrainer, self)._save_checkpoint(model, trial, metrics)
def _save(self, output_dir: Optional[str] = None, state_dict=None):
if getattr(self.args, "tune_mm_mlp_adapter", False):
pass
else:
super(LLaVATrainer, self)._save(output_dir, state_dict)
class LLaVADPOTrainer(DPOTrainer):
def _get_train_sampler(self) -> Optional[torch.utils.data.Sampler]:
if self.train_dataset is None or not has_length(self.train_dataset):
return None
if self.args.group_by_modality_length:
lengths = self.train_dataset.modality_lengths
return LengthGroupedSampler(
# self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_steps
self.args.train_batch_size,
world_size=self.args.world_size,
lengths=lengths,
group_by_modality=True,
)
else:
return super()._get_train_sampler()
def _save_checkpoint(self, model, trial, metrics=None):
if getattr(self.args, "tune_mm_mlp_adapter", False) or (
hasattr(self.args, "mm_tunable_parts") and (len(self.args.mm_tunable_parts.split(",")) == 1 and ("mm_mlp_adapter" in self.args.mm_tunable_parts or "mm_vision_resampler" in self.args.mm_tunable_parts))
):
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
run_dir = self._get_output_dir(trial=trial)
output_dir = os.path.join(run_dir, checkpoint_folder)
# Only save Adapter
keys_to_match = ["mm_projector", "vision_resampler"]
if getattr(self.args, "use_im_start_end", False):
keys_to_match.extend(["embed_tokens", "embed_in"])
weight_to_save = get_mm_adapter_state_maybe_zero_3(self.model.named_parameters(), keys_to_match)
if self.args.local_rank == 0 or self.args.local_rank == -1:
self.model.config.save_pretrained(output_dir)
torch.save(weight_to_save, os.path.join(output_dir, f"mm_projector.bin"))
else:
# super(LLaVADPOTrainer, self)._save_checkpoint(model, trial, metrics)
# print(type(model))
# from transformers.modeling_utils import unwrap_model
# print(type(unwrap_model(model)))
# print(unwrap_model(model).config)
if self.args.lora_enable:
from transformers.trainer_utils import PREFIX_CHECKPOINT_DIR
checkpoint_folder = f"{PREFIX_CHECKPOINT_DIR}-{self.state.global_step}"
run_dir = self._get_output_dir(trial=trial)
output_dir = os.path.join(run_dir, checkpoint_folder)
from transformers.modeling_utils import unwrap_model
unwrapped_model = unwrap_model(model)
self.save_my_lora_ckpt(output_dir, self.args, unwrapped_model)
else:
super(LLaVADPOTrainer, self)._save_checkpoint(model, trial, metrics)
def _save(self, output_dir: Optional[str] = None, state_dict=None):
if getattr(self.args, "tune_mm_mlp_adapter", False):
pass
else:
super(LLaVADPOTrainer, self)._save(output_dir, state_dict)
================================================
FILE: xtuner-eval_niah/longva/train/train.py
================================================
# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:
# Adopted from tatsu-lab@stanford_alpaca. Below is the original copyright:
# Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import ast
import os
import copy
from dataclasses import dataclass, field
import json
import logging
import pathlib
from typing import Dict, Optional, Sequence, List
from PIL import Image, ImageFile
from packaging import version
import numpy as np
import time
import random
import yaml
import math
import re
import torch
import transformers
import tokenizers
import deepspeed
from transformers import AutoConfig
from torch.utils.data import Dataset
from longva.constants import IGNORE_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IMAGE_TOKEN_INDEX
from longva.train.llava_trainer import LLaVATrainer
from longva import conversation as conversation_lib
from longva.model import *
from longva.mm_utils import process_highres_image, process_anyres_image, process_highres_image_crop_split, tokenizer_image_token
from longva.utils import rank0_print, process_video_with_pyav
torch.multiprocessing.set_sharing_strategy("file_system")
ImageFile.LOAD_TRUNCATED_IMAGES = True
local_rank = None
IS_TOKENIZER_GREATER_THAN_0_14 = version.parse(tokenizers.__version__) >= version.parse("0.14")
@dataclass
class ModelArguments:
model_name_or_path: Optional[str] = field(default="facebook/opt-125m")
model_class_name: Optional[str] = field(default=None, metadata={"help": "Used to init model class, format is XXXXForCausalLM. e.g. currently XXXX is chosen from LlavaLlama, LlavaMixtral, LlavaMistral, Llama"})
mm_tunable_parts: Optional[str] = field(
default=None, metadata={"help": 'Could be "mm_mlp_adapter", "mm_vision_resampler", "mm_vision_tower,mm_mlp_adapter,mm_language_model", "mm_vision_tower,mm_mlp_adapter,mm_language_model", "mm_mlp_adapter,mm_language_model"'}
)
# deciding which part of the multimodal model to tune, will overwrite other previous settings
version: Optional[str] = field(default="v0")
freeze_backbone: bool = field(default=False)
tune_mm_mlp_adapter: bool = field(default=False)
tune_mm_vision_resampler: bool = field(default=False)
vision_tower: Optional[str] = field(default=None)
vision_tower_pretrained: Optional[str] = field(default=None) # default to the last layer
unfreeze_mm_vision_tower: bool = field(default=False)
unfreeze_language_model: bool = field(default=False)
mm_vision_select_layer: Optional[int] = field(default=-1) # default to the last layer
pretrain_mm_mlp_adapter: Optional[str] = field(default=None)
mm_projector_type: Optional[str] = field(default="linear")
mm_use_im_start_end: bool = field(default=False)
mm_use_im_patch_token: bool = field(default=True)
mm_patch_merge_type: Optional[str] = field(default="flat")
mm_vision_select_feature: Optional[str] = field(default="patch")
mm_resampler_type: Optional[str] = field(default=None)
mm_mask_drop_mode: str = field(default="fixed")
mm_mask_drop_skip_percentage: float = field(default=0.0)
mm_mask_drop_ratio: float = field(default=0.25)
mm_mask_drop_ratio_upper: Optional[float] = field(default=None)
mm_mask_drop_ratio_lower: Optional[float] = field(default=None)
mm_spatial_pool_stride: Optional[int] = field(default=None)
mm_spatial_pool_mode: str = field(default="average")
mm_spatial_pool_out_channels: Optional[int] = field(default=None)
mm_perceiver_depth: Optional[int] = field(default=3)
mm_perceiver_latents: Optional[int] = field(default=32)
mm_perceiver_ff_mult: Optional[float] = field(default=4)
mm_perceiver_pretrained: Optional[str] = field(default=None)
mm_qformer_depth: Optional[int] = field(default=3)
mm_qformer_latents: Optional[int] = field(default=32)
mm_qformer_pretrained: Optional[str] = field(default=None)
rope_scaling_factor: Optional[float] = field(default=None)
rope_scaling_type: Optional[str] = field(default=None)
s2: Optional[bool] = field(default=False)
s2_scales: Optional[str] = field(default="336,672,1008")
use_pos_skipping: Optional[bool] = field(default=False)
pos_skipping_range: Optional[int] = field(default=4096)
@dataclass
class DataArguments:
data_path: str = field(default=None, metadata={"help": "Path to the training data, in llava's instruction.json format. Supporting multiple json files via /path/to/{a,b,c}.json"})
lazy_preprocess: bool = False
is_multimodal: bool = False
image_folder: Optional[str] = field(default=None)
image_aspect_ratio: str = "square"
image_grid_pinpoints: Optional[str] = field(default=None)
image_crop_resolution: Optional[int] = field(default=None)
image_split_resolution: Optional[int] = field(default=None)
video_folder: Optional[str] = field(default=None)
video_fps: Optional[int] = field(default=1)
frames_upbound: Optional[int] = field(default=0)
@dataclass
class TrainingArguments(transformers.TrainingArguments):
cache_dir: Optional[str] = field(default=None)
optim: str = field(default="adamw_torch")
remove_unused_columns: bool = field(default=False)
freeze_mm_mlp_adapter: bool = field(default=False)
freeze_mm_vision_resampler: bool = field(default=False)
mpt_attn_impl: Optional[str] = field(default="triton")
model_max_length: int = field(
default=4096,
metadata={"help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."},
)
double_quant: bool = field(default=True, metadata={"help": "Compress the quantization statistics through double quantization."})
quant_type: str = field(default="nf4", metadata={"help": "Quantization data type to use. Should be one of `fp4` or `nf4`."})
bits: int = field(default=16, metadata={"help": "How many bits to use."})
lora_enable: bool = False
lora_r: int = 64
lora_alpha: int = 16
lora_dropout: float = 0.05
lora_weight_path: str = ""
lora_bias: str = "none"
mm_projector_lr: Optional[float] = None
mm_vision_tower_lr: Optional[float] = None
group_by_varlen: bool = field(default=False)
group_by_modality_length: bool = field(default=False)
group_by_modality_length_auto: bool = field(default=False)
auto_find_batch_size: bool = field(default=False)
gradient_checkpointing: bool = field(default=True)
verbose_logging: bool = field(default=False)
attn_implementation: str = field(default="flash_attention_2", metadata={"help": "Use transformers attention implementation."})
# @dataclass
# class EvaluationArguments:
# eval_num_processes: int = field(default=1)
# task_names: str = field(default=None)
# model: str = field(default="llava")
# model_args: Optional[str] = field(default=None)
# num_fewshot: Optional[int] = field(default=None)
# batch_size: int = field(default=1)
# device: Optional[str] = field(default=None)
# limit: Optional[int] = field(default=None)
# check_integrity: Optional[bool] = field(default=False)
# show_task_to_terminal: Optional[bool] = field(default=False)
# log_samples: Optional[bool] = field(default=True)
# gen_kwargs: Optional[str] = field(default="")
# log_samples_suffix: Optional[str] = field(default="")
# output_path: Optional[str] = field(default="./logs/")
def maybe_zero_3(param, ignore_status=False, name=None):
from deepspeed import zero
from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
if hasattr(param, "ds_id"):
if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
if not ignore_status:
logging.warning(f"{name}: param.ds_status != ZeroParamStatus.NOT_AVAILABLE: {param.ds_status}")
with zero.GatheredParameters([param]):
param = param.data.detach().cpu().clone()
else:
param = param.detach().cpu().clone()
return param
# Borrowed from peft.utils.get_peft_model_state_dict
def get_peft_state_maybe_zero_3(named_params, bias):
if bias == "none":
to_return = {k: t for k, t in named_params if "lora_" in k}
elif bias == "all":
to_return = {k: t for k, t in named_params if "lora_" in k or "bias" in k}
elif bias == "lora_only":
to_return = {}
maybe_lora_bias = {}
lora_bias_names = set()
for k, t in named_params:
if "lora_" in k:
to_return[k] = t
bias_name = k.split("lora_")[0] + "bias"
lora_bias_names.add(bias_name)
elif "bias" in k:
maybe_lora_bias[k] = t
for k, t in maybe_lora_bias:
if bias_name in lora_bias_names:
to_return[bias_name] = t
else:
raise NotImplementedError
to_return = {k: maybe_zero_3(v, ignore_status=True) for k, v in to_return.items()}
return to_return
def get_peft_state_non_lora_maybe_zero_3(named_params, require_grad_only=True):
to_return = {k: t for k, t in named_params if "lora_" not in k}
if require_grad_only:
to_return = {k: t for k, t in to_return.items() if t.requires_grad}
to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
return to_return
def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)}
to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
return to_return
def find_all_linear_names(model):
cls = torch.nn.Linear
lora_module_names = set()
multimodal_keywords = ["mm_projector", "vision_tower", "vision_resampler"]
for name, module in model.named_modules():
if any(mm_keyword in name for mm_keyword in multimodal_keywords):
continue
if isinstance(module, cls):
names = name.split(".")
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if "lm_head" in lora_module_names: # needed for 16-bit
lora_module_names.remove("lm_head")
return list(lora_module_names)
def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
"""Collects the state dict and dump to disk."""
if hasattr(trainer.args, "tune_mm_mlp_adapter") and trainer.args.tune_mm_mlp_adapter:
check_only_save_mm_adapter_tunnable = True
# only has mm_mlp_adapter and mm_vision_resampler in the tuneable parts
elif hasattr(trainer.args, "mm_tunable_parts") and (len(trainer.args.mm_tunable_parts.split(",")) == 1 and ("mm_mlp_adapter" in trainer.args.mm_tunable_parts or "mm_vision_resampler" in trainer.args.mm_tunable_parts)):
check_only_save_mm_adapter_tunnable = True
else:
check_only_save_mm_adapter_tunnable = False
trainer.accelerator.wait_for_everyone()
torch.cuda.synchronize()
rank0_print(f"Only save projectors: {check_only_save_mm_adapter_tunnable}")
if check_only_save_mm_adapter_tunnable:
# Only save Adapter
keys_to_match = ["mm_projector", "vision_resampler"]
if getattr(trainer.args, "use_im_start_end", False):
keys_to_match.extend(["embed_tokens", "embed_in"])
weight_to_save = get_mm_adapter_state_maybe_zero_3(trainer.model.named_parameters(), keys_to_match)
trainer.model.config.save_pretrained(output_dir)
current_folder = output_dir.split("/")[-1]
parent_folder = os.path.dirname(output_dir)
if trainer.args.local_rank == 0 or trainer.args.local_rank == -1:
if current_folder.startswith("checkpoint-"):
mm_projector_folder = os.path.join(parent_folder, "mm_projector")
os.makedirs(mm_projector_folder, exist_ok=True)
torch.save(weight_to_save, os.path.join(mm_projector_folder, f"{current_folder}.bin"))
else:
torch.save(weight_to_save, os.path.join(output_dir, f"mm_projector.bin"))
return
if trainer.deepspeed:
trainer.save_model(output_dir)
return
state_dict = trainer.model.state_dict()
if trainer.args.should_save:
cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
del state_dict
trainer._save(output_dir, state_dict=cpu_state_dict) # noqa
def smart_tokenizer_and_embedding_resize(
special_tokens_dict: Dict,
tokenizer: transformers.PreTrainedTokenizer,
model: transformers.PreTrainedModel,
):
"""Resize tokenizer and embedding.
Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
"""
num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))
if num_new_tokens > 0:
input_embeddings = model.get_input_embeddings().weight.data
output_embeddings = model.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
output_embeddings[-num_new_tokens:] = output_embeddings_avg
def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict:
"""Tokenize a list of strings."""
tokenized_list = [
tokenizer(
text,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
)
for text in strings
]
input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
input_ids_lens = labels_lens = [tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list]
return dict(
input_ids=input_ids,
labels=labels,
input_ids_lens=input_ids_lens,
labels_lens=labels_lens,
)
def _mask_targets(target, tokenized_lens, speakers):
# cur_idx = 0
cur_idx = tokenized_lens[0]
tokenized_lens = tokenized_lens[1:]
target[:cur_idx] = IGNORE_INDEX
for tokenized_len, speaker in zip(tokenized_lens, speakers):
if speaker == "human":
target[cur_idx + 2 : cur_idx + tokenized_len] = IGNORE_INDEX
cur_idx += tokenized_len
def _add_speaker_and_signal(header, source, get_conversation=True):
"""Add speaker and start/end signal on each round."""
BEGIN_SIGNAL = "### "
END_SIGNAL = "\n"
conversation = header
for sentence in source:
from_str = sentence["from"]
if from_str.lower() == "human":
from_str = conversation_lib.default_conversation.roles[0]
elif from_str.lower() == "gpt":
from_str = conversation_lib.default_conversation.roles[1]
else:
from_str = "unknown"
sentence["value"] = BEGIN_SIGNAL + from_str + ": " + sentence["value"] + END_SIGNAL
if get_conversation:
conversation += sentence["value"]
conversation += BEGIN_SIGNAL
return conversation
def preprocess_multimodal(sources: Sequence[str], data_args: DataArguments) -> Dict:
is_multimodal = data_args.is_multimodal
if not is_multimodal:
return sources
for source in sources:
for sentence in source:
# TODO maybe this should be changed for interleaved data?
# if DEFAULT_IMAGE_TOKEN in sentence["value"] and not sentence["value"].startswith(DEFAULT_IMAGE_TOKEN):
# only check for num_im=1
num_im = len(re.findall(DEFAULT_IMAGE_TOKEN, sentence["value"]))
if num_im == 1 and DEFAULT_IMAGE_TOKEN in sentence["value"] and not sentence["value"].startswith(DEFAULT_IMAGE_TOKEN):
sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, "").strip()
sentence["value"] = DEFAULT_IMAGE_TOKEN + "\n" + sentence["value"]
sentence["value"] = sentence["value"].strip()
if "mmtag" in conversation_lib.default_conversation.version:
sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, "" + DEFAULT_IMAGE_TOKEN + "")
replace_token = DEFAULT_IMAGE_TOKEN
if data_args.mm_use_im_start_end:
replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN
sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, replace_token)
# For videoInstruct-100k noisy_data. TODO: Ask Yuanhan to clean the data instead of leaving the noise code here.
sentence["value"] = sentence["value"].replace("QA_GT_caption_based_noisy", "")
return sources
def preprocess_llama_2(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv = conversation_lib.default_conversation.copy()
roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.LLAMA_2
# Mask targets
sep = "[/INST] "
for conversation, target in zip(conversations, targets):
total_len = int(target.ne(tokenizer.pad_token_id).sum())
rounds = conversation.split(conv.sep2)
cur_len = 1
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer))
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
else:
round_len = len(tokenizer(rou).input_ids)
instruction_len = len(tokenizer(parts[0]).input_ids) - 2
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def preprocess_gemma(sources: List[List[Dict[str, str]]], tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv: conversation_lib.Conversation = conversation_lib.default_conversation.copy()
roles: Dict[str, str] = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations: List[str] = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source: List[Dict[str, str]] = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role: str = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids: torch.Tensor = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids: torch.Tensor = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets: torch.Tensor = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.GEMMA
# Mask target
sep: str = conv.sep + conv.roles[1]
for conversation, target in zip(conversations, targets):
total_len: int = int(target.ne(tokenizer.pad_token_id).sum())
rounds: List[str] = conversation.split(conv.sep)
re_rounds = []
for conv_idx in range(0, len(rounds), 2):
re_rounds.append(conv.sep.join(rounds[conv_idx : conv_idx + 2]))
cur_len = 1 # Ignore
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(re_rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep # Re-append sep because split on this
# Now "".join(parts)==rou
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer)) - 1 # Ignore
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1 # Ignore
else:
round_len = len(tokenizer(rou).input_ids) - 1 # Ignore
instruction_len = len(tokenizer(parts[0]).input_ids) - 1 # Ignore
round_len += 2 # sep: \n takes 2 tokens
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(f"warning: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def preprocess_qwen(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict:
roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"}
im_start, im_end = tokenizer.additional_special_tokens_ids
nl_tokens = tokenizer("\n").input_ids
_system = tokenizer("system").input_ids + nl_tokens
_user = tokenizer("user").input_ids + nl_tokens
_assistant = tokenizer("assistant").input_ids + nl_tokens
# Apply prompt templates
input_ids, targets = [], []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != roles["human"]:
source = source[1:]
input_id, target = [], []
system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens
input_id += system
target += [im_start] + [IGNORE_INDEX] * (len(system) - 3) + [im_end] + nl_tokens
assert len(input_id) == len(target)
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
if has_image and "" in sentence["value"]:
num_im = len(re.findall(DEFAULT_IMAGE_TOKEN, sentence["value"]))
# multi image can start with text first
if num_im == 1:
assert sentence["value"].startswith(""), print(sentence["value"])
matches = re.findall("", sentence["value"])
num_image = len(matches)
_input_id = tokenizer(role).input_ids + nl_tokens + [IMAGE_TOKEN_INDEX] * num_image + nl_tokens + tokenizer(sentence["value"][len("") * num_image :]).input_ids + [im_end] + nl_tokens
else:
_input_id = tokenizer(role).input_ids + nl_tokens + tokenizer(sentence["value"]).input_ids + [im_end] + nl_tokens
input_id += _input_id
if role == "<|im_start|>user":
_target = [im_start] + [IGNORE_INDEX] * (len(_input_id) - 3) + [im_end] + nl_tokens
elif role == "<|im_start|>assistant":
_target = [im_start] + [IGNORE_INDEX] * len(tokenizer(role).input_ids) + _input_id[len(tokenizer(role).input_ids) + 1 : -2] + [im_end] + nl_tokens
else:
raise NotImplementedError
target += _target
assert len(input_id) == len(target)
# input_id += [tokenizer.pad_token_id] * (max_len - len(input_id))
# target += [IGNORE_INDEX] * (max_len - len(target))
input_ids.append(input_id)
targets.append(target)
input_ids = torch.tensor(input_ids, dtype=torch.long)
targets = torch.tensor(targets, dtype=torch.long)
return dict(
input_ids=input_ids, # tensor(bs x seq_len)
labels=targets, # tensor(bs x seq_len)
# attention_mask=input_ids.ne(tokenizer.pad_token_id), # tensor(bs x seq_len)
)
def preprocess_llama3(
sources,
tokenizer: transformers.PreTrainedTokenizer,
has_image: bool = False,
max_len=2048,
system_message: str = "You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.",
) -> Dict:
# roles = {"human": "<|start_header_id|>user<|end_header_id|>", "gpt": "<|start_header_id|>assistant<|end_header_id|>"}
roles = {"human": "user", "gpt": "assistant"}
# Add image tokens to tokenizer as a special tokens
tokenizer.add_tokens([""], special_tokens=True)
image_token_index = tokenizer.convert_tokens_to_ids("")
bos_token_id = tokenizer.convert_tokens_to_ids("<|begin_of_text|>")
start_header_id = tokenizer.convert_tokens_to_ids("<|start_header_id|>")
end_header_id = tokenizer.convert_tokens_to_ids("<|end_header_id|>")
eot_id = tokenizer.convert_tokens_to_ids("<|eot_id|>")
unmask_tokens = ["<|begin_of_text|>", "<|start_header_id|>", "<|end_header_id|>", "<|eot_id|>", "\n\n"]
unmask_tokens_idx = [tokenizer.convert_tokens_to_ids(tok) for tok in unmask_tokens]
# After update, calling tokenizer of llama3 will
# auto add bos id for the tokens. ヽ(`⌒´)ノ
def safe_tokenizer_llama3(text):
input_ids = tokenizer(text).input_ids
if input_ids[0] == bos_token_id:
input_ids = input_ids[1:]
return input_ids
nl_tokens = tokenizer.convert_tokens_to_ids("\n\n")
# Apply prompt templates
input_ids, targets = [], []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != roles["human"]:
source = source[1:]
input_id, target = [], []
# New version, use apply chat template
# Build system message for each sentence
input_id += tokenizer.apply_chat_template([{"role" : "system", "content" : system_message}])
target += [IGNORE_INDEX] * len(input_id)
for conv in source:
# Make sure llava data can load
try:
role = conv["role"]
content = conv["content"]
except:
role = conv["from"]
content = conv["value"]
role = roles.get(role, role)
conv = [{"role" : role, "content" : content}]
# First is bos token we don't need here
encode_id = tokenizer.apply_chat_template(conv)[1:]
input_id += encode_id
if role in ["user", "system"]:
target += [IGNORE_INDEX] * len(encode_id)
else:
target += encode_id
assert len(input_id) == len(target), f"{len(input_id)} != {len(target)}"
for idx, encode_id in enumerate(input_id):
if encode_id in unmask_tokens_idx:
target[idx] = encode_id
if encode_id == image_token_index:
input_id[idx] = IMAGE_TOKEN_INDEX
input_ids.append(input_id)
targets.append(target)
input_ids = torch.tensor(input_ids, dtype=torch.long)
targets = torch.tensor(targets, dtype=torch.long)
return dict(
input_ids=input_ids, # tensor(bs x seq_len)
labels=targets, # tensor(bs x seq_len)
)
def preprocess_v1(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv = conversation_lib.default_conversation.copy()
roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.TWO
# Mask targets
sep = conv.sep + conv.roles[1] + ": "
for conversation, target in zip(conversations, targets):
total_len = int(target.ne(tokenizer.pad_token_id).sum())
rounds = conversation.split(conv.sep2)
cur_len = 1
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer))
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
else:
round_len = len(tokenizer(rou).input_ids)
instruction_len = len(tokenizer(parts[0]).input_ids) - 2
if i != 0 and not tokenizer.legacy and IS_TOKENIZER_GREATER_THAN_0_14:
round_len -= 1
instruction_len -= 1
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def preprocess_mpt(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv = conversation_lib.default_conversation.copy()
roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.MPT
# Mask targets
sep = conv.sep + conv.roles[1]
for conversation, target in zip(conversations, targets):
total_len = int(target.ne(tokenizer.pad_token_id).sum())
rounds = conversation.split(conv.sep)
re_rounds = [conv.sep.join(rounds[:3])] # system + user + gpt
for conv_idx in range(3, len(rounds), 2):
re_rounds.append(conv.sep.join(rounds[conv_idx : conv_idx + 2])) # user + gpt
cur_len = 1
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(re_rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer))
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1
else:
round_len = len(tokenizer(rou).input_ids)
instruction_len = len(tokenizer(parts[0]).input_ids) - 1
if i != 0 and getattr(tokenizer, "legacy", False) and IS_TOKENIZER_GREATER_THAN_0_14:
round_len += 1
instruction_len += 1
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f"(#turns={len(re_rounds)} ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def preprocess_plain(
sources: Sequence[str],
tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
# add end signal and concatenate together
conversations = []
for source in sources:
assert len(source) == 2
assert DEFAULT_IMAGE_TOKEN in source[0]["value"]
source[0]["value"] = DEFAULT_IMAGE_TOKEN
conversation = source[0]["value"] + source[1]["value"] + conversation_lib.default_conversation.sep
conversations.append(conversation)
# tokenize conversations
input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations]
targets = copy.deepcopy(input_ids)
for target, source in zip(targets, sources):
tokenized_len = len(tokenizer_image_token(source[0]["value"], tokenizer))
target[:tokenized_len] = IGNORE_INDEX
return dict(input_ids=input_ids, labels=targets)
def preprocess(sources: Sequence[str], tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
"""
Given a list of sources, each is a conversation list. This transform:
1. Add signal '### ' at the beginning each sentence, with end signal '\n';
2. Concatenate conversations together;
3. Tokenize the concatenated conversation;
4. Make a deepcopy as the target. Mask human words with IGNORE_INDEX.
"""
if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.PLAIN:
return preprocess_plain(sources, tokenizer)
if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.LLAMA_2:
return preprocess_llama_2(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version.startswith("v1"):
return preprocess_v1(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "mpt":
return preprocess_mpt(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "qwen":
return preprocess_qwen(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "gemma":
return preprocess_gemma(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "llama_v3":
return preprocess_llama3(sources, tokenizer, has_image=has_image)
# add end signal and concatenate together
conversations = []
for source in sources:
header = f"{conversation_lib.default_conversation.system}\n\n"
conversation = _add_speaker_and_signal(header, source)
conversations.append(conversation)
# tokenize conversations
def get_tokenize_len(prompts):
return [len(tokenizer_image_token(prompt, tokenizer)) for prompt in prompts]
if has_image:
input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations]
else:
conversations_tokenized = _tokenize_fn(conversations, tokenizer)
input_ids = conversations_tokenized["input_ids"]
targets = copy.deepcopy(input_ids)
for target, source in zip(targets, sources):
if has_image:
tokenized_lens = get_tokenize_len([header] + [s["value"] for s in source])
else:
tokenized_lens = _tokenize_fn([header] + [s["value"] for s in source], tokenizer)["input_ids_lens"]
speakers = [sentence["from"] for sentence in source]
_mask_targets(target, tokenized_lens, speakers)
return dict(input_ids=input_ids, labels=targets)
class LazySupervisedDataset(Dataset):
def __init__(self, data_path: str, tokenizer: transformers.PreTrainedTokenizer, data_args: DataArguments):
super(LazySupervisedDataset, self).__init__()
self.tokenizer = tokenizer
self.list_data_dict = []
# Handle multiple JSON files specified in the data_path
if "{" in data_path and "}" in data_path:
base_path, file_pattern = re.match(r"^(.*)\{(.*)\}\.json$", data_path).groups()
file_names = file_pattern.split(",")
rank0_print(f"Loading {file_names} from {base_path}")
data_args.dataset_paths = []
for file_name in file_names:
data_args.dataset_paths.append(f"{base_path}{file_name}.json")
full_path = f"{base_path}{file_name}.json"
rank0_print(f"Loading {full_path}")
with open(full_path, "r") as file:
cur_data_dict = json.load(file)
rank0_print(f"Loaded {len(cur_data_dict)} samples from {full_path}")
self.list_data_dict.extend(cur_data_dict)
elif data_path.endswith(".yaml"):
with open(data_path, "r") as file:
yaml_data = yaml.safe_load(file)
datasets = yaml_data.get("datasets")
# file should be in the format of:
# datasets:
# - json_path: xxxx1.json
# sampling_strategy: first:1000
# - json_path: xxxx2.json
# sampling_strategy: end:3000
# - json_path: xxxx3.json
# sampling_strategy: random:999
data_args.dataset_paths = [dataset.get("json_path") for dataset in datasets]
for dataset in datasets:
json_path = dataset.get("json_path")
sampling_strategy = dataset.get("sampling_strategy", "all")
sampling_number = None
rank0_print(f"Loading {json_path} with {sampling_strategy} sampling strategy")
with open(json_path, "r") as json_file:
cur_data_dict = json.load(json_file)
if ":" in sampling_strategy:
sampling_strategy, sampling_number = sampling_strategy.split(":")
if "%" in sampling_number:
sampling_number = math.ceil(int(sampling_number.split("%")[0]) * len(cur_data_dict) / 100)
else:
sampling_number = int(sampling_number)
# Apply the sampling strategy
if sampling_strategy == "first" and sampling_number is not None:
cur_data_dict = cur_data_dict[:sampling_number]
elif sampling_strategy == "end" and sampling_number is not None:
cur_data_dict = cur_data_dict[-sampling_number:]
elif sampling_strategy == "random" and sampling_number is not None:
random.shuffle(cur_data_dict)
cur_data_dict = cur_data_dict[:sampling_number]
rank0_print(f"Loaded {len(cur_data_dict)} samples from {json_path}")
self.list_data_dict.extend(cur_data_dict)
else:
data_args.dataset_paths = [data_path]
rank0_print(f"Loading {data_path}")
with open(data_path, "r") as file:
cur_data_dict = json.load(file)
rank0_print(f"Loaded {len(cur_data_dict)} samples from {data_path}")
self.list_data_dict.extend(cur_data_dict)
rank0_print("Formatting inputs...Skip in lazy mode")
self.tokenizer = tokenizer
self.data_args = data_args
def __len__(self):
return len(self.list_data_dict)
@property
def lengths(self):
length_list = []
for sample in self.list_data_dict:
img_tokens = 128 if "image" in sample else 0
length_list.append(sum(len(conv["value"].split()) for conv in sample["conversations"]) + img_tokens)
return length_list
@property
def modality_lengths(self):
length_list = []
for sample in self.list_data_dict:
cur_len = sum(len(conv["value"].split()) for conv in sample["conversations"])
cur_len = cur_len if ("image" in sample) or ("video" in sample) else -cur_len
length_list.append(cur_len)
return length_list
def process_image(self, image_file):
image_folder = self.data_args.image_folder
processor = self.data_args.image_processor
# print(f"\n\nInspecting the image path, folder = {image_folder}, image={image_file}\n\n")
try:
image = Image.open(os.path.join(image_folder, image_file)).convert("RGB")
except Exception as exn:
print(f"Failed to open image {image_file}. Exception:", exn)
raise exn
image_size = image.size
if self.data_args.image_aspect_ratio == "highres":
image = process_highres_image(image, self.data_args.image_processor, self.data_args.image_grid_pinpoints)
elif self.data_args.image_aspect_ratio == "anyres" or "anyres_max" in self.data_args.image_aspect_ratio:
image = process_anyres_image(image, self.data_args.image_processor, self.data_args.image_grid_pinpoints)
elif self.data_args.image_aspect_ratio == "crop_split":
image = process_highres_image_crop_split(image, self.data_args)
elif self.data_args.image_aspect_ratio == "pad":
def expand2square(pil_img, background_color):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
image = expand2square(image, tuple(int(x * 255) for x in processor.image_mean))
image = processor.preprocess(image, return_tensors="pt")["pixel_values"][0]
else:
image = processor.preprocess(image, return_tensors="pt")["pixel_values"][0]
return image, image_size, "image"
def __getitem__(self, i) -> Dict[str, torch.Tensor]:
# TODO: define number of retries somewhere else
num_base_retries = 3
num_final_retries = 300
# try the current sample first
for attempt_idx in range(num_base_retries):
try:
sample = self._get_item(i)
return sample
except Exception as e:
# sleep 1s in case it is a cloud disk issue
print(f"[Try #{attempt_idx}] Failed to fetch sample {i}. Exception:", e)
time.sleep(1)
# try other samples, in case it is file corruption issue
for attempt_idx in range(num_base_retries):
try:
next_index = min(i + 1, len(self.list_data_dict) - 1)
# sample_idx = random.choice(range(len(self)))
sample = self._get_item(next_index)
return sample
except Exception as e:
# no need to sleep
print(f"[Try other #{attempt_idx}] Failed to fetch sample {next_index}. Exception:", e)
pass
try:
sample = self._get_item(i)
return sample
except Exception as e:
raise e
def _get_item(self, i) -> Dict[str, torch.Tensor]:
sources = self.list_data_dict[i]
if isinstance(i, int):
sources = [sources]
assert len(sources) == 1, "Don't know why it is wrapped to a list" # FIXME
if "image" in sources[0]:
image_file = self.list_data_dict[i]["image"]
if type(image_file) is list:
image = [self.process_image(f) for f in image_file]
else:
image = [self.process_image(image_file)]
sources = preprocess_multimodal(copy.deepcopy([e["conversations"] for e in sources]), self.data_args)
elif "video" in sources[0]:
video_file = self.list_data_dict[i]["video"]
video_folder = self.data_args.video_folder
video_file = os.path.join(video_folder, video_file)
suffix = video_file.split(".")[-1]
if not os.path.exists(video_file):
print("File {} not exist!".format(video_file))
try:
video = process_video_with_pyav(video_file, self.data_args)
# using videoreader
# if "shareVideoGPTV" not in video_file and "liangke" not in video_file:
# vr = VideoReader(video_file, ctx=cpu(0))
# total_frame_num = len(vr)
# avg_fps = round(vr.get_avg_fps() / self.data_args.video_fps)
# frame_idx = [i for i in range(0, total_frame_num, avg_fps)]
# if self.data_args.frames_upbound > 0:
# if len(frame_idx) > self.data_args.frames_upbound:
# uniform_sampled_frames = np.linspace(0, total_frame_num - 1, self.data_args.frames_upbound, dtype=int)
# frame_idx = uniform_sampled_frames.tolist()
# video = vr.get_batch(frame_idx).asnumpy()
# video = np.array(video)
# else:
# if "liangke" in video_file:
# video_file = self.list_data_dict[i]["video"]
# frame_files = [os.path.join(video_file, f) for f in os.listdir(video_file) if os.path.isfile(os.path.join(video_file, f))]
# frame_files.sort() # Ensure the frames are sorted if they are named sequentially
# # TODO: Hard CODE: Determine the indices for uniformly sampling 10 frames
# num_frames_to_sample = 10
# total_frames = len(frame_files)
# sampled_indices = np.linspace(0, total_frames - 1, num_frames_to_sample, dtype=int)
# # Read and store the sampled frames
# video = []
# for idx in sampled_indices:
# frame_path = frame_files[idx]
# try:
# with Image.open(frame_path) as img:
# frame = img.convert("RGB")
# video.append(frame)
# except IOError:
# print(f"Failed to read frame at path: {frame_path}")
processor = self.data_args.image_processor
image = processor.preprocess(video, return_tensors="pt")["pixel_values"]
image = [(image, video[0].size, "video")]
sources = preprocess_multimodal(copy.deepcopy([e["conversations"] for e in sources]), self.data_args)
except Exception as e:
print(f"Error: {e}")
print(f"Failed to read video file: {video_file}")
return self._get_item(i + 1)
else:
sources = copy.deepcopy([e["conversations"] for e in sources])
has_image = ("image" in self.list_data_dict[i]) or ("video" in self.list_data_dict[i])
data_dict = preprocess(sources, self.tokenizer, has_image=has_image)
if "prompt" in data_dict:
prompt = data_dict["prompt"]
else:
prompt = None
if isinstance(i, int):
data_dict = dict(input_ids=data_dict["input_ids"][0], labels=data_dict["labels"][0])
# image exist in the data
if "image" in self.list_data_dict[i]:
data_dict["image"] = image
elif "video" in self.list_data_dict[i]:
data_dict["image"] = image
elif self.data_args.is_multimodal:
# image does not exist in the data, but the model is multimodal
crop_size = self.data_args.image_processor.crop_size
data_dict["image"] = [
(torch.zeros(1, 3, crop_size["height"], crop_size["width"]), (crop_size["width"], crop_size["height"]), "text"),
]
# prompt exist in the data
if prompt is not None:
data_dict["prompt"] = prompt
data_dict["id"] = self.list_data_dict[i].get("id", i)
return data_dict
@dataclass
class DataCollatorForSupervisedDataset(object):
"""Collate examples for supervised fine-tuning."""
tokenizer: transformers.PreTrainedTokenizer
def pad_sequence(self, input_ids, batch_first, padding_value):
if self.tokenizer.padding_side == "left":
input_ids = [torch.flip(_input_ids, [0]) for _input_ids in input_ids]
input_ids = torch.nn.utils.rnn.pad_sequence(input_ids, batch_first=batch_first, padding_value=padding_value)
if self.tokenizer.padding_side == "left":
input_ids = torch.flip(input_ids, [1])
return input_ids
def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
input_ids, labels = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels"))
# input_ids, labels, ids = tuple([instance[key] for instance in instances] for key in ("input_ids", "labels", "id"))
input_ids = [_input_ids[: self.tokenizer.model_max_length] for _input_ids in input_ids]
labels = [_labels[: self.tokenizer.model_max_length] for _labels in labels]
if self.tokenizer.pad_token_id is None:
# self.tokenizer.pad_token_id = self.tokenizer.eos_token_id # FIXME: this could only be triggered for llama3 model.
self.tokenizer.pad_token_id = 0 # This gets the best result. Don't know why.
input_ids = self.pad_sequence(input_ids, batch_first=True, padding_value=self.tokenizer.pad_token_id)
labels = self.pad_sequence(labels, batch_first=True, padding_value=IGNORE_INDEX)
batch = dict(input_ids=input_ids, labels=labels.long() if labels.dtype == torch.int32 else labels, attention_mask=input_ids.ne(self.tokenizer.pad_token_id))
# batch = dict(input_ids=input_ids, labels=labels, attention_mask=input_ids.ne(self.tokenizer.pad_token_id), ids=ids)
if "image" in instances[0]:
images = [instance["image"] for instance in instances]
batch["image_sizes"] = [im[1] for im_list in images for im in im_list]
batch["modalities"] = [im[2] for im_list in images for im in im_list]
images = [im[0] for im_list in images for im in im_list]
if all(x is not None and x.shape == images[0].shape for x in images):
# Image: (N, P, C, H, W)
# Video: (N, F, C, H, W)
batch["images"] = torch.stack(images)
else:
batch["images"] = images
if "prompt" in instances[0]:
batch["prompts"] = [instance["prompt"] for instance in instances]
return batch
def make_supervised_data_module(tokenizer: transformers.PreTrainedTokenizer, data_args) -> Dict:
"""Make dataset and collator for supervised fine-tuning."""
train_dataset = LazySupervisedDataset(tokenizer=tokenizer, data_path=data_args.data_path, data_args=data_args)
data_collator = DataCollatorForSupervisedDataset(tokenizer=tokenizer)
return dict(train_dataset=train_dataset, eval_dataset=None, data_collator=data_collator)
def get_model(model_args, training_args, bnb_model_from_pretrained_args):
assert training_args.attn_implementation
if training_args.attn_implementation == "sdpa" and torch.__version__ < "2.1.2":
raise ValueError("The 'sdpa' attention implementation requires torch version 2.1.2 or higher.")
customized_kwargs = dict()
customized_kwargs.update(bnb_model_from_pretrained_args)
cfg_pretrained = None
overwrite_config = {}
if any(
[
model_args.rope_scaling_factor is not None,
model_args.rope_scaling_type is not None,
model_args.mm_spatial_pool_stride is not None,
model_args.mm_spatial_pool_out_channels is not None,
model_args.mm_spatial_pool_mode is not None,
model_args.mm_resampler_type is not None,
]
):
cfg_pretrained = AutoConfig.from_pretrained(model_args.model_name_or_path)
if model_args.use_pos_skipping is not None and model_args.pos_skipping_range is not None:
overwrite_config["use_pos_skipping"] = model_args.use_pos_skipping
overwrite_config["pos_skipping_range"] = model_args.pos_skipping_range
if model_args.rope_scaling_factor is not None and model_args.rope_scaling_type is not None:
overwrite_config["rope_scaling"] = {
"factor": model_args.rope_scaling_factor,
"type": model_args.rope_scaling_type,
}
if training_args.model_max_length is None:
training_args.model_max_length = cfg_pretrained.max_position_embeddings * model_args.rope_scaling_factor
overwrite_config["max_sequence_length"] = training_args.model_max_length
assert training_args.model_max_length == int(cfg_pretrained.max_position_embeddings * model_args.rope_scaling_factor), print(
f"model_max_length: {training_args.model_max_length}, max_position_embeddings: {cfg_pretrained.max_position_embeddings}, rope_scaling_factor: {model_args.rope_scaling_factor}"
)
# overwrite_config["max_sequence_length"] = model_args.max_sequence_length
# overwrite_config["tokenizer_model_max_length"] = model_args.tokenizer_model_max_length
if model_args.mm_spatial_pool_stride is not None and model_args.mm_spatial_pool_out_channels is not None and model_args.mm_spatial_pool_mode is not None and model_args.mm_resampler_type is not None:
overwrite_config["mm_resampler_type"] = model_args.mm_resampler_type
overwrite_config["mm_spatial_pool_stride"] = model_args.mm_spatial_pool_stride
overwrite_config["mm_spatial_pool_out_channels"] = model_args.mm_spatial_pool_out_channels
overwrite_config["mm_spatial_pool_mode"] = model_args.mm_spatial_pool_mode
if overwrite_config:
assert cfg_pretrained is not None, "cfg_pretrained is None"
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(cfg_pretrained, k, v)
customized_kwargs["config"] = cfg_pretrained
if model_args.model_class_name is not None:
actual_model_class_name = f"{model_args.model_class_name}ForCausalLM"
model_class = getattr(transformers, actual_model_class_name)
rank0_print(f"Using model class {model_class} from {model_args.model_class_name}")
model = model_class.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
elif model_args.vision_tower is not None:
if "mixtral" in model_args.model_name_or_path.lower():
model = LlavaMixtralForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock
deepspeed.utils.set_z3_leaf_modules(model, [MixtralSparseMoeBlock])
elif "mistral" in model_args.model_name_or_path.lower() or "zephyr" in model_args.model_name_or_path.lower():
model = LlavaMistralForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
elif (
"wizardlm-2" in model_args.model_name_or_path.lower()
or "vicuna" in model_args.model_name_or_path.lower()
or "llama" in model_args.model_name_or_path.lower()
or "yi" in model_args.model_name_or_path.lower()
or "nous-hermes" in model_args.model_name_or_path.lower()
and "wizard-2" in model_args.model_name_or_path.lower()
):
model = LlavaLlamaForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
elif "qwen" in model_args.model_name_or_path.lower() or "longva" in model_args.model_name_or_path.lower():
if "moe" in model_args.model_name_or_path.lower() or "A14B" in model_args.model_name_or_path:
model = LlavaQwenMoeForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
from transformers.models.qwen2_moe.modeling_qwen2_moe import Qwen2MoeSparseMoeBlock
deepspeed.utils.set_z3_leaf_modules(model, [Qwen2MoeSparseMoeBlock])
else:
model = LlavaQwenForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
elif "gemma" in model_args.model_name_or_path.lower():
model = LlavaGemmaForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
else:
raise ValueError(f"Unknown model class {model_args}")
else:
model = transformers.LlamaForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
return model
def train(attn_implementation=None):
global local_rank
parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
if training_args.verbose_logging:
rank0_print(f"Inspecting experiment hyperparameters:\n")
rank0_print(f"model_args = {vars(model_args)}\n\n")
rank0_print(f"data_args = {vars(data_args)}\n\n")
rank0_print(f"training_args = {vars(training_args)}\n\n")
# rank0_print(f"evaluation_args = {vars(evaluation_args)}\n\n")
local_rank = training_args.local_rank
compute_dtype = torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32)
bnb_model_from_pretrained_args = {}
if training_args.bits in [4, 8]:
from transformers import BitsAndBytesConfig
bnb_model_from_pretrained_args.update(
dict(
device_map={"": training_args.device},
load_in_4bit=training_args.bits == 4,
load_in_8bit=training_args.bits == 8,
quantization_config=BitsAndBytesConfig(
load_in_4bit=training_args.bits == 4,
load_in_8bit=training_args.bits == 8,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=training_args.double_quant,
bnb_4bit_quant_type=training_args.quant_type, # {'fp4', 'nf4'}
),
)
)
model = get_model(model_args, training_args, bnb_model_from_pretrained_args)
model.config.use_cache = False
if model_args.rope_scaling_factor is not None and model_args.rope_scaling_type is not None:
model.config.rope_scaling = {
"factor": model_args.rope_scaling_factor,
"type": model_args.rope_scaling_type,
}
if model_args.freeze_backbone:
model.model.requires_grad_(False)
if training_args.bits in [4, 8]:
from peft import prepare_model_for_kbit_training
model.config.torch_dtype = torch.float32 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=training_args.gradient_checkpointing)
if training_args.gradient_checkpointing:
if hasattr(model, "enable_input_require_grads"):
model.enable_input_require_grads()
else:
def make_inputs_require_grad(module, input, output):
output.requires_grad_(True)
model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
if training_args.lora_enable:
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=training_args.lora_r,
lora_alpha=training_args.lora_alpha,
target_modules=find_all_linear_names(model),
lora_dropout=training_args.lora_dropout,
bias=training_args.lora_bias,
task_type="CAUSAL_LM",
)
if training_args.bits == 16:
if training_args.bf16:
model.to(torch.bfloat16)
if training_args.fp16:
model.to(torch.float16)
rank0_print("Adding LoRA adapters...")
model = get_peft_model(model, lora_config)
if "mistral" in model_args.model_name_or_path.lower() or "mixtral" in model_args.model_name_or_path.lower() or "zephyr" in model_args.model_name_or_path.lower():
tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="left")
elif "qwen" in model_args.model_name_or_path.lower():
tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="right")
elif (
"wizardlm-2" in model_args.model_name_or_path.lower()
or "vicuna" in model_args.model_name_or_path.lower()
or "llama" in model_args.model_name_or_path.lower()
or "yi" in model_args.model_name_or_path.lower()
or "nous-hermes" in model_args.model_name_or_path.lower()
and "wizard-2" in model_args.model_name_or_path.lower()
):
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
model_max_length=training_args.model_max_length,
padding_side="right",
use_fast=False,
)
rank0_print(f"Prompt version: {model_args.version}")
if model_args.version == "v0":
if tokenizer.pad_token is None:
smart_tokenizer_and_embedding_resize(
special_tokens_dict=dict(pad_token="[PAD]"),
tokenizer=tokenizer,
model=model,
)
elif model_args.version == "v0.5":
tokenizer.pad_token = tokenizer.unk_token
else:
if tokenizer.unk_token is not None:
tokenizer.pad_token = tokenizer.unk_token
if model_args.version in conversation_lib.conv_templates:
conversation_lib.default_conversation = conversation_lib.conv_templates[model_args.version]
else:
conversation_lib.default_conversation = conversation_lib.conv_templates["vicuna_v1"]
if model_args.vision_tower is not None:
model.get_model().initialize_vision_modules(model_args=model_args, fsdp=training_args.fsdp)
vision_tower = model.get_vision_tower()
vision_tower.to(dtype=torch.bfloat16 if training_args.bf16 else torch.float16, device=training_args.device)
data_args.image_processor = vision_tower.image_processor
data_args.is_multimodal = True
model.config.image_aspect_ratio = data_args.image_aspect_ratio
if data_args.image_grid_pinpoints is not None:
if isinstance(data_args.image_grid_pinpoints, str) and "x" in data_args.image_grid_pinpoints:
try:
patch_size = data_args.image_processor.size[0]
except Exception as e:
patch_size = data_args.image_processor.size["shortest_edge"]
assert patch_size in [224, 336, 384, 448, 512], "patch_size should be in [224, 336, 384, 448, 512]"
# Use regex to extract the range from the input string
matches = re.findall(r"\((\d+)x(\d+)\)", data_args.image_grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
# Generate a matrix of tuples from (range_start[0], range_start[1]) to (range_end[0], range_end[1])
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
# Multiply all elements by patch_size
data_args.image_grid_pinpoints = [[dim * patch_size for dim in pair] for pair in grid_pinpoints]
elif isinstance(data_args.image_grid_pinpoints, str):
data_args.image_grid_pinpoints = ast.literal_eval(data_args.image_grid_pinpoints)
model.config.image_grid_pinpoints = data_args.image_grid_pinpoints
model.config.image_crop_resolution = data_args.image_crop_resolution
model.config.image_split_resolution = data_args.image_split_resolution
model.config.tokenizer_padding_side = tokenizer.padding_side
model.config.tokenizer_model_max_length = tokenizer.model_max_length
### Deciding train which part of the model
if model_args.mm_tunable_parts is None: # traditional way of deciding which part to train
model.config.tune_mm_mlp_adapter = training_args.tune_mm_mlp_adapter = model_args.tune_mm_mlp_adapter
model.config.tune_mm_vision_resampler = training_args.tune_mm_vision_resampler = model_args.tune_mm_vision_resampler
if model_args.tune_mm_mlp_adapter or model_args.tune_mm_vision_resampler:
model.requires_grad_(False)
if model_args.tune_mm_mlp_adapter:
for p in model.get_model().mm_projector.parameters():
p.requires_grad = True
if model_args.tune_mm_vision_resampler:
for p in model.get_model().vision_resampler.parameters():
p.requires_grad = True
model.config.freeze_mm_mlp_adapter = training_args.freeze_mm_mlp_adapter
if training_args.freeze_mm_mlp_adapter:
for p in model.get_model().mm_projector.parameters():
p.requires_grad = False
model.config.freeze_mm_vision_resampler = training_args.freeze_mm_vision_resampler
if training_args.freeze_mm_vision_resampler:
for p in model.get_model().vision_resampler.parameters():
p.requires_grad = False
model.config.unfreeze_mm_vision_tower = model_args.unfreeze_mm_vision_tower
if model_args.unfreeze_mm_vision_tower:
vision_tower.requires_grad_(True)
else:
vision_tower.requires_grad_(False)
else:
rank0_print(f"Using mm_tunable_parts: {model_args.mm_tunable_parts}")
model.config.mm_tunable_parts = training_args.mm_tunable_parts = model_args.mm_tunable_parts
# Set the entire model to not require gradients by default
model.requires_grad_(False)
vision_tower.requires_grad_(False)
model.get_model().mm_projector.requires_grad_(False)
model.get_model().vision_resampler.requires_grad_(False)
# Parse the mm_tunable_parts to decide which parts to unfreeze
tunable_parts = model_args.mm_tunable_parts.split(",")
if "mm_mlp_adapter" in tunable_parts:
for p in model.get_model().mm_projector.parameters():
p.requires_grad = True
if "mm_vision_resampler" in tunable_parts:
for p in model.get_model().vision_resampler.parameters():
p.requires_grad = True
if "mm_vision_tower" in tunable_parts:
for name, param in model.named_parameters():
if "vision_tower" in name:
param.requires_grad_(True)
if "mm_language_model" in tunable_parts:
for name, param in model.named_parameters():
if "vision_tower" not in name and "mm_projector" not in name and "vision_resampler" not in name:
param.requires_grad_(True)
total_params = sum(p.ds_numel if hasattr(p, "ds_numel") else p.numel() for p in model.parameters())
trainable_params = sum(p.ds_numel if hasattr(p, "ds_numel") else p.numel() for p in model.parameters() if p.requires_grad)
rank0_print(f"Total parameters: ~{total_params/1e6:.2f} MB)")
rank0_print(f"Trainable parameters: ~{trainable_params/1e6:.2f} MB)")
if training_args.bits in [4, 8]:
model.get_model().mm_projector.to(dtype=compute_dtype, device=training_args.device)
model.config.mm_use_im_start_end = data_args.mm_use_im_start_end = model_args.mm_use_im_start_end
model.config.mm_projector_lr = training_args.mm_projector_lr
model.config.mm_vision_tower_lr = training_args.mm_vision_tower_lr
training_args.use_im_start_end = model_args.mm_use_im_start_end
model.config.mm_use_im_patch_token = model_args.mm_use_im_patch_token
model.initialize_vision_tokenizer(model_args, tokenizer=tokenizer)
if training_args.bits in [4, 8]:
from peft.tuners.lora import LoraLayer
for name, module in model.named_modules():
if isinstance(module, LoraLayer):
if training_args.bf16:
module = module.to(torch.bfloat16)
if "norm" in name:
module = module.to(torch.float32)
if "lm_head" in name or "embed_tokens" in name:
if hasattr(module, "weight"):
if training_args.bf16 and module.weight.dtype == torch.float32:
module = module.to(torch.bfloat16)
data_module = make_supervised_data_module(tokenizer=tokenizer, data_args=data_args)
trainer = LLaVATrainer(model=model, tokenizer=tokenizer, args=training_args, **data_module)
if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
trainer.train(resume_from_checkpoint=True)
else:
trainer.train()
trainer.save_state()
model.config.use_cache = True
if training_args.lora_enable:
state_dict = get_peft_state_maybe_zero_3(model.named_parameters(), training_args.lora_bias)
non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(model.named_parameters())
if training_args.local_rank == 0 or training_args.local_rank == -1:
if hasattr(model, "config"):
model.config.save_pretrained(training_args.output_dir)
if hasattr(model, "generation_config"):
model.generation_config.save_pretrained(training_args.output_dir)
model.save_pretrained(training_args.output_dir, state_dict=state_dict)
torch.save(non_lora_state_dict, os.path.join(training_args.output_dir, "non_lora_trainables.bin"))
else:
safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
rank0_print(f"Model saved to {training_args.output_dir}")
if __name__ == "__main__":
train()
================================================
FILE: xtuner-eval_niah/longva/train/train_dpo.py
================================================
# Adopted from https://github.com/lm-sys/FastChat. Below is the original copyright:
# Adopted from tatsu-lab@stanford_alpaca. Below is the original copyright:
# Copyright 2023 Rohan Taori, Ishaan Gulrajani, Tianyi Zhang, Yann Dubois, Xuechen Li
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import os
import copy
import deepspeed
from dataclasses import dataclass, field
import json
import logging
import pathlib
from typing import Dict, Optional, Sequence, List
import ast
import yaml
import time
import random
import yaml
import math
import re
import torch
import transformers
import tokenizers
from longva.constants import IGNORE_INDEX, DEFAULT_IMAGE_TOKEN, DEFAULT_IM_START_TOKEN, DEFAULT_IM_END_TOKEN, IMAGE_TOKEN_INDEX
from torch.utils.data import Dataset
from longva.train.llava_trainer import LLaVADPOTrainer
from data_processing.utils import load_jsonl, load_json
from longva import conversation as conversation_lib
from longva.model import *
from longva.model.language_model.llava_qwen import LlavaQwenConfig
from longva.model.language_model.llava_llama import LlavaConfig
from longva.model.language_model.llava_mistral import LlavaMistralConfig
from longva.mm_utils import process_highres_image, process_anyres_image, process_highres_image_crop_split, tokenizer_image_token
from longva.utils import rank0_print
from transformers import AutoConfig
import pickle
from trl.trainer.utils import DPODataCollatorWithPadding
from PIL import Image, ImageFile
from decord import VideoReader, cpu
ImageFile.LOAD_TRUNCATED_IMAGES = True
from packaging import version
from typing import Any
local_rank = None
import numpy as np
IS_TOKENIZER_GREATER_THAN_0_14 = version.parse(tokenizers.__version__) >= version.parse("0.14")
@dataclass
class ModelArguments:
model_name_or_path: Optional[str] = field(default="facebook/opt-125m")
model_class_name: Optional[str] = field(default=None, metadata={"help": "Used to init model class, format is XXXXForCausalLM. e.g. currently XXXX is chosen from LlavaLlama, LlavaMixtral, LlavaMistral, Llama"})
mm_tunable_parts: Optional[str] = field(
default=None, metadata={"help": 'Could be "mm_mlp_adapter", "mm_vision_resampler", "mm_vision_tower,mm_mlp_adapter,mm_language_model", "mm_vision_tower,mm_mlp_adapter,mm_language_model", "mm_mlp_adapter,mm_language_model"'}
)
# deciding which part of the multimodal model to tune, will overwrite other previous settings
version: Optional[str] = field(default="v0")
freeze_backbone: bool = field(default=False)
tune_mm_mlp_adapter: bool = field(default=False)
tune_mm_vision_resampler: bool = field(default=False)
vision_tower: Optional[str] = field(default=None)
vision_tower_pretrained: Optional[str] = field(default=None) # default to the last layer
unfreeze_mm_vision_tower: bool = field(default=False)
unfreeze_language_model: bool = field(default=False)
mm_vision_select_layer: Optional[int] = field(default=-1) # default to the last layer
pretrain_mm_mlp_adapter: Optional[str] = field(default=None)
mm_projector_type: Optional[str] = field(default="linear")
mm_use_im_start_end: bool = field(default=False)
mm_use_im_patch_token: bool = field(default=True)
mm_patch_merge_type: Optional[str] = field(default="flat")
mm_vision_select_feature: Optional[str] = field(default="patch")
mm_resampler_type: Optional[str] = field(default=None)
mm_mask_drop_mode: str = field(default="fixed")
mm_mask_drop_skip_percentage: float = field(default=0.0)
mm_mask_drop_ratio: float = field(default=0.25)
mm_mask_drop_ratio_upper: Optional[float] = field(default=None)
mm_mask_drop_ratio_lower: Optional[float] = field(default=None)
mm_spatial_pool_stride: Optional[int] = field(default=None)
mm_spatial_pool_mode: str = field(default="average")
mm_spatial_pool_out_channels: Optional[int] = field(default=None)
mm_perceiver_depth: Optional[int] = field(default=3)
mm_perceiver_latents: Optional[int] = field(default=32)
mm_perceiver_ff_mult: Optional[float] = field(default=4)
mm_perceiver_pretrained: Optional[str] = field(default=None)
mm_qformer_depth: Optional[int] = field(default=3)
mm_qformer_latents: Optional[int] = field(default=32)
mm_qformer_pretrained: Optional[str] = field(default=None)
rope_scaling_factor: Optional[float] = field(default=None)
rope_scaling_type: Optional[str] = field(default=None)
s2: Optional[bool] = field(default=False)
s2_scales: Optional[str] = field(default="336,672,1008")
@dataclass
class DataArguments:
data_path: str = field(default=None, metadata={"help": "Path to the training data, in llava's instruction.json format. Supporting multiple json files via /path/to/{a,b,c}.json"})
lazy_preprocess: bool = False
is_multimodal: bool = False
image_folder: Optional[str] = field(default=None)
video_folder: Optional[str] = field(default=None)
video_fps: Optional[int] = field(default=1)
image_aspect_ratio: str = "square"
image_grid_pinpoints: Optional[str] = field(default=None)
image_crop_resolution: int = 384
image_split_resolution: int = 384
input_prompt: Optional[str] = field(default=None)
refine_prompt: Optional[bool] = field(default=False)
frames_upbound: Optional[int] = field(default=0)
num_sample: Optional[int] = field(default=None)
@dataclass
class TrainingArguments(transformers.TrainingArguments):
cache_dir: Optional[str] = field(default=None)
optim: str = field(default="adamw_torch")
remove_unused_columns: bool = field(default=False)
freeze_mm_mlp_adapter: bool = field(default=False)
freeze_mm_vision_resampler: bool = field(default=False)
mpt_attn_impl: Optional[str] = field(default="triton")
model_max_length: int = field(
default=4096,
metadata={"help": "Maximum sequence length. Sequences will be right padded (and possibly truncated)."},
)
double_quant: bool = field(default=True, metadata={"help": "Compress the quantization statistics through double quantization."})
quant_type: str = field(default="nf4", metadata={"help": "Quantization data type to use. Should be one of `fp4` or `nf4`."})
bits: int = field(default=16, metadata={"help": "How many bits to use."})
lora_enable: bool = False
lora_r: int = 64
lora_alpha: int = 16
lora_dropout: float = 0.05
lora_weight_path: str = ""
lora_bias: str = "none"
mm_projector_lr: Optional[float] = None
mm_vision_tower_lr: Optional[float] = None
group_by_varlen: bool = field(default=False)
group_by_modality_length: bool = field(default=False)
group_by_modality_length_auto: bool = field(default=False)
auto_find_batch_size: bool = field(default=False)
gradient_checkpointing: bool = field(default=True)
verbose_logging: bool = field(default=False)
attn_implementation: str = field(default="flash_attention_2", metadata={"help": "Use transformers attention implementation."})
dpo_alpha: float = field(default=1.0)
beta: float = field(default=0.1)
gamma: float = field(default=1.0)
generate_during_eval: bool = field(default=False)
precompute_ref_log_probs: bool = field(default=False)
def maybe_zero_3(param, ignore_status=False, name=None):
from deepspeed import zero
from deepspeed.runtime.zero.partition_parameters import ZeroParamStatus
if hasattr(param, "ds_id"):
if param.ds_status == ZeroParamStatus.NOT_AVAILABLE:
if not ignore_status:
logging.warning(f"{name}: param.ds_status != ZeroParamStatus.NOT_AVAILABLE: {param.ds_status}")
with zero.GatheredParameters([param]):
param = param.data.detach().cpu().clone()
else:
param = param.detach().cpu().clone()
return param
# Borrowed from peft.utils.get_peft_model_state_dict
def get_peft_state_maybe_zero_3(named_params, bias):
if bias == "none":
to_return = {k: t for k, t in named_params if "lora_" in k}
elif bias == "all":
to_return = {k: t for k, t in named_params if "lora_" in k or "bias" in k}
elif bias == "lora_only":
to_return = {}
maybe_lora_bias = {}
lora_bias_names = set()
for k, t in named_params:
if "lora_" in k:
to_return[k] = t
bias_name = k.split("lora_")[0] + "bias"
lora_bias_names.add(bias_name)
elif "bias" in k:
maybe_lora_bias[k] = t
for k, t in maybe_lora_bias:
if bias_name in lora_bias_names:
to_return[bias_name] = t
else:
raise NotImplementedError
to_return = {k: maybe_zero_3(v, ignore_status=True) for k, v in to_return.items()}
return to_return
def get_peft_state_non_lora_maybe_zero_3(named_params, require_grad_only=True):
to_return = {k: t for k, t in named_params if "lora_" not in k}
if require_grad_only:
to_return = {k: t for k, t in to_return.items() if t.requires_grad}
to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
return to_return
def get_mm_adapter_state_maybe_zero_3(named_params, keys_to_match):
to_return = {k: t for k, t in named_params if any(key_match in k for key_match in keys_to_match)}
to_return = {k: maybe_zero_3(v, ignore_status=True).cpu() for k, v in to_return.items()}
return to_return
def find_all_linear_names(model):
cls = torch.nn.Linear
lora_module_names = set()
multimodal_keywords = ["mm_projector", "vision_tower", "vision_resampler"]
for name, module in model.named_modules():
if any(mm_keyword in name for mm_keyword in multimodal_keywords):
continue
if isinstance(module, cls):
names = name.split(".")
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if "lm_head" in lora_module_names: # needed for 16-bit
lora_module_names.remove("lm_head")
return list(lora_module_names)
def safe_save_model_for_hf_trainer(trainer: transformers.Trainer, output_dir: str):
"""Collects the state dict and dump to disk."""
if hasattr(trainer.args, "tune_mm_mlp_adapter") and trainer.args.tune_mm_mlp_adapter:
check_only_save_mm_adapter_tunnable = True
# only has mm_mlp_adapter and mm_vision_resampler in the tuneable parts
elif hasattr(trainer.args, "mm_tunable_parts") and (len(trainer.args.mm_tunable_parts.split(",")) == 1 and ("mm_mlp_adapter" in trainer.args.mm_tunable_parts or "mm_vision_resampler" in trainer.args.mm_tunable_parts)):
check_only_save_mm_adapter_tunnable = True
else:
check_only_save_mm_adapter_tunnable = False
trainer.accelerator.wait_for_everyone()
torch.cuda.synchronize()
rank0_print(f"Only save projectors: {check_only_save_mm_adapter_tunnable}")
if check_only_save_mm_adapter_tunnable:
# Only save Adapter
keys_to_match = ["mm_projector", "vision_resampler"]
if getattr(trainer.args, "use_im_start_end", False):
keys_to_match.extend(["embed_tokens", "embed_in"])
weight_to_save = get_mm_adapter_state_maybe_zero_3(trainer.model.named_parameters(), keys_to_match)
trainer.model.config.save_pretrained(output_dir)
current_folder = output_dir.split("/")[-1]
parent_folder = os.path.dirname(output_dir)
if trainer.args.local_rank == 0 or trainer.args.local_rank == -1:
if current_folder.startswith("checkpoint-"):
mm_projector_folder = os.path.join(parent_folder, "mm_projector")
os.makedirs(mm_projector_folder, exist_ok=True)
torch.save(weight_to_save, os.path.join(mm_projector_folder, f"{current_folder}.bin"))
else:
torch.save(weight_to_save, os.path.join(output_dir, f"mm_projector.bin"))
return
if trainer.deepspeed:
trainer.save_model(output_dir)
return
state_dict = trainer.model.state_dict()
if trainer.args.should_save:
cpu_state_dict = {key: value.cpu() for key, value in state_dict.items()}
del state_dict
trainer._save(output_dir, state_dict=cpu_state_dict) # noqa
def smart_tokenizer_and_embedding_resize(
special_tokens_dict: Dict,
tokenizer: transformers.PreTrainedTokenizer,
model: transformers.PreTrainedModel,
):
"""Resize tokenizer and embedding.
Note: This is the unoptimized version that may make your embedding size not be divisible by 64.
"""
num_new_tokens = tokenizer.add_special_tokens(special_tokens_dict)
model.resize_token_embeddings(len(tokenizer))
if num_new_tokens > 0:
input_embeddings = model.get_input_embeddings().weight.data
output_embeddings = model.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
output_embeddings[-num_new_tokens:] = output_embeddings_avg
def _tokenize_fn(strings: Sequence[str], tokenizer: transformers.PreTrainedTokenizer) -> Dict:
"""Tokenize a list of strings."""
tokenized_list = [
tokenizer(
text,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
)
for text in strings
]
input_ids = labels = [tokenized.input_ids[0] for tokenized in tokenized_list]
input_ids_lens = labels_lens = [tokenized.input_ids.ne(tokenizer.pad_token_id).sum().item() for tokenized in tokenized_list]
return dict(
input_ids=input_ids,
labels=labels,
input_ids_lens=input_ids_lens,
labels_lens=labels_lens,
)
def _mask_targets(target, tokenized_lens, speakers):
# cur_idx = 0
cur_idx = tokenized_lens[0]
tokenized_lens = tokenized_lens[1:]
target[:cur_idx] = IGNORE_INDEX
for tokenized_len, speaker in zip(tokenized_lens, speakers):
if speaker == "human":
target[cur_idx + 2 : cur_idx + tokenized_len] = IGNORE_INDEX
cur_idx += tokenized_len
def _add_speaker_and_signal(header, source, get_conversation=True):
"""Add speaker and start/end signal on each round."""
BEGIN_SIGNAL = "### "
END_SIGNAL = "\n"
conversation = header
for sentence in source:
from_str = sentence["from"]
if from_str.lower() == "human":
from_str = conversation_lib.default_conversation.roles[0]
elif from_str.lower() == "gpt":
from_str = conversation_lib.default_conversation.roles[1]
else:
from_str = "unknown"
sentence["value"] = BEGIN_SIGNAL + from_str + ": " + sentence["value"] + END_SIGNAL
if get_conversation:
conversation += sentence["value"]
conversation += BEGIN_SIGNAL
return conversation
def preprocess_multimodal(sources: Sequence[str], data_args: DataArguments) -> Dict:
is_multimodal = data_args.is_multimodal
if not is_multimodal:
return sources
for source in sources:
for sentence in source:
if DEFAULT_IMAGE_TOKEN in sentence["value"] and not sentence["value"].startswith(DEFAULT_IMAGE_TOKEN):
sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, "").strip()
sentence["value"] = DEFAULT_IMAGE_TOKEN + "\n" + sentence["value"]
sentence["value"] = sentence["value"].strip()
if "mmtag" in conversation_lib.default_conversation.version:
sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, "" + DEFAULT_IMAGE_TOKEN + "")
replace_token = DEFAULT_IMAGE_TOKEN
if data_args.mm_use_im_start_end:
replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN
sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, replace_token)
return sources
def preprocess_multimodal_movie(sources: Sequence[str], data_args: DataArguments, video_inputs: str) -> Dict:
is_multimodal = data_args.is_multimodal
if not is_multimodal:
return sources
for source in sources:
for sentence in source:
if DEFAULT_IMAGE_TOKEN in sentence["value"]:
prompt = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, "").strip()
replace_token = video_inputs
if data_args.mm_use_im_start_end:
replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN
sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, replace_token)
return sources, prompt
def preprocess_llama_2(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv = conversation_lib.default_conversation.copy()
roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.LLAMA_2
# Mask targets
sep = "[/INST] "
for conversation, target in zip(conversations, targets):
total_len = int(target.ne(tokenizer.pad_token_id).sum())
rounds = conversation.split(conv.sep2)
cur_len = 1
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer))
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
else:
round_len = len(tokenizer(rou).input_ids)
instruction_len = len(tokenizer(parts[0]).input_ids) - 2
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
rank0_print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def make_conv(prompt, answer):
return [
{
"from": "human",
"value": prompt,
},
{
"from": "gpt",
"value": answer,
},
]
def preprocess_gemma(sources: List[List[Dict[str, str]]], tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv: conversation_lib.Conversation = conversation_lib.default_conversation.copy()
roles: Dict[str, str] = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations: List[str] = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source: List[Dict[str, str]] = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role: str = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids: torch.Tensor = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids: torch.Tensor = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets: torch.Tensor = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.GEMMA
# Mask target
sep: str = conv.sep + conv.roles[1]
for conversation, target in zip(conversations, targets):
total_len: int = int(target.ne(tokenizer.pad_token_id).sum())
rounds: List[str] = conversation.split(conv.sep)
re_rounds = []
for conv_idx in range(0, len(rounds), 2):
re_rounds.append(conv.sep.join(rounds[conv_idx : conv_idx + 2]))
cur_len = 1 # Ignore
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(re_rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep # Re-append sep because split on this
# Now "".join(parts)==rou
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer)) - 1 # Ignore
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1 # Ignore
else:
round_len = len(tokenizer(rou).input_ids) - 1 # Ignore
instruction_len = len(tokenizer(parts[0]).input_ids) - 1 # Ignore
round_len += 2 # sep: \n takes 2 tokens
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
rank0_print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def preprocess_qwen(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False, max_len=2048, system_message: str = "You are a helpful assistant.") -> Dict:
roles = {"human": "<|im_start|>user", "gpt": "<|im_start|>assistant"}
im_start, im_end = tokenizer.additional_special_tokens_ids
nl_tokens = tokenizer("\n").input_ids
_system = tokenizer("system").input_ids + nl_tokens
_user = tokenizer("user").input_ids + nl_tokens
_assistant = tokenizer("assistant").input_ids + nl_tokens
# Apply prompt templates
input_ids, targets = [], []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != roles["human"]:
source = source[1:]
input_id, target = [], []
system = [im_start] + _system + tokenizer(system_message).input_ids + [im_end] + nl_tokens
input_id += system
target += [im_start] + [IGNORE_INDEX] * (len(system) - 3) + [im_end] + nl_tokens
assert len(input_id) == len(target)
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
if has_image and "" in sentence["value"]:
assert sentence["value"].startswith(""), print(sentence["value"])
_input_id = tokenizer(role).input_ids + nl_tokens + [IMAGE_TOKEN_INDEX] + nl_tokens + tokenizer(sentence["value"][len("") :]).input_ids + [im_end] + nl_tokens
else:
_input_id = tokenizer(role).input_ids + nl_tokens + tokenizer(sentence["value"]).input_ids + [im_end] + nl_tokens
input_id += _input_id
if role == "<|im_start|>user":
_target = [im_start] + [IGNORE_INDEX] * (len(_input_id) - 3) + [im_end] + nl_tokens
elif role == "<|im_start|>assistant":
_target = [im_start] + [IGNORE_INDEX] * len(tokenizer(role).input_ids) + _input_id[len(tokenizer(role).input_ids) + 1 : -2] + [im_end] + nl_tokens
else:
raise NotImplementedError
target += _target
assert len(input_id) == len(target)
# input_id += [tokenizer.pad_token_id] * (max_len - len(input_id))
# target += [IGNORE_INDEX] * (max_len - len(target))
input_ids.append(input_id)
targets.append(target)
input_ids = torch.tensor(input_ids, dtype=torch.long)
targets = torch.tensor(targets, dtype=torch.long)
return dict(
input_ids=input_ids, # tensor(bs x seq_len)
labels=targets, # tensor(bs x seq_len)
# attention_mask=input_ids.ne(tokenizer.pad_token_id), # tensor(bs x seq_len)
)
def preprocess_llama3(
sources,
tokenizer: transformers.PreTrainedTokenizer,
has_image: bool = False,
max_len=2048,
system_message: str = "You are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.",
) -> Dict:
roles = {"human": "<|start_header_id|>user<|end_header_id|>", "gpt": "<|start_header_id|>assistant<|end_header_id|>"}
eot_id = tokenizer.convert_tokens_to_ids("<|eot_id|>")
nl_tokens = tokenizer("\n").input_ids
# Apply prompt templates
input_ids, targets = [], []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != roles["human"]:
source = source[1:]
input_id, target = [], []
system = tokenizer("<|begin_of_text|>").input_ids + tokenizer("<|start_header_id|>system<|end_header_id|>").input_ids + nl_tokens * 2 + tokenizer(system_message).input_ids + [eot_id]
input_id += system
target += [IGNORE_INDEX] * len(system)
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
if has_image and "" in sentence["value"]:
assert sentence["value"].startswith(""), print(sentence["value"])
_input_id = tokenizer(role).input_ids + nl_tokens * 2 + [IMAGE_TOKEN_INDEX] + tokenizer(sentence["value"][len("") :]).input_ids + [eot_id]
else:
_input_id = tokenizer(role).input_ids + nl_tokens * 2 + tokenizer(sentence["value"]).input_ids + [eot_id]
input_id += _input_id
if role == "<|start_header_id|>user<|end_header_id|>":
_target = [IGNORE_INDEX] * len(_input_id)
elif role == "<|start_header_id|>assistant<|end_header_id|>":
_target = [IGNORE_INDEX] * (len(tokenizer(role).input_ids) + 2) + _input_id[len(tokenizer(role).input_ids) + 2 : -1] + [eot_id]
else:
raise NotImplementedError
target += _target
assert len(input_id) == len(target), f"{len(input_id)} != {len(target)}"
input_ids.append(input_id)
targets.append(target)
input_ids = torch.tensor(input_ids, dtype=torch.long)
targets = torch.tensor(targets, dtype=torch.long)
return dict(
input_ids=input_ids, # tensor(bs x seq_len)
labels=targets, # tensor(bs x seq_len)
)
def preprocess_v1(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv = conversation_lib.default_conversation.copy()
roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.TWO
# Mask targets
sep = conv.sep + conv.roles[1] + ": "
for conversation, target in zip(conversations, targets):
total_len = int(target.ne(tokenizer.pad_token_id).sum())
rounds = conversation.split(conv.sep2)
cur_len = 1
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer))
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 2
else:
round_len = len(tokenizer(rou).input_ids)
instruction_len = len(tokenizer(parts[0]).input_ids) - 2
if i != 0 and not tokenizer.legacy and IS_TOKENIZER_GREATER_THAN_0_14:
round_len -= 1
instruction_len -= 1
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f" (ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def preprocess_mpt(sources, tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
conv = conversation_lib.default_conversation.copy()
roles = {"human": conv.roles[0], "gpt": conv.roles[1]}
# Apply prompt templates
conversations = []
for i, source in enumerate(sources):
if roles[source[0]["from"]] != conv.roles[0]:
# Skip the first one if it is not from human
source = source[1:]
conv.messages = []
for j, sentence in enumerate(source):
role = roles[sentence["from"]]
assert role == conv.roles[j % 2], f"{i}"
conv.append_message(role, sentence["value"])
conversations.append(conv.get_prompt())
# Tokenize conversations
if has_image:
input_ids = torch.stack([tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations], dim=0)
else:
input_ids = tokenizer(
conversations,
return_tensors="pt",
padding="longest",
max_length=tokenizer.model_max_length,
truncation=True,
).input_ids
targets = input_ids.clone()
assert conv.sep_style == conversation_lib.SeparatorStyle.MPT
# Mask targets
sep = conv.sep + conv.roles[1]
for conversation, target in zip(conversations, targets):
total_len = int(target.ne(tokenizer.pad_token_id).sum())
rounds = conversation.split(conv.sep)
re_rounds = [conv.sep.join(rounds[:3])] # system + user + gpt
for conv_idx in range(3, len(rounds), 2):
re_rounds.append(conv.sep.join(rounds[conv_idx : conv_idx + 2])) # user + gpt
cur_len = 1
target[:cur_len] = IGNORE_INDEX
for i, rou in enumerate(re_rounds):
if rou == "":
break
parts = rou.split(sep)
if len(parts) != 2:
break
parts[0] += sep
if has_image:
round_len = len(tokenizer_image_token(rou, tokenizer))
instruction_len = len(tokenizer_image_token(parts[0], tokenizer)) - 1
else:
round_len = len(tokenizer(rou).input_ids)
instruction_len = len(tokenizer(parts[0]).input_ids) - 1
if i != 0 and getattr(tokenizer, "legacy", False) and IS_TOKENIZER_GREATER_THAN_0_14:
round_len += 1
instruction_len += 1
target[cur_len : cur_len + instruction_len] = IGNORE_INDEX
cur_len += round_len
target[cur_len:] = IGNORE_INDEX
if cur_len < tokenizer.model_max_length:
if cur_len != total_len:
target[:] = IGNORE_INDEX
print(f"WARNING: tokenization mismatch: {cur_len} vs. {total_len}." f"(#turns={len(re_rounds)} ignored)")
return dict(
input_ids=input_ids,
labels=targets,
)
def preprocess_plain(
sources: Sequence[str],
tokenizer: transformers.PreTrainedTokenizer,
) -> Dict:
# add end signal and concatenate together
conversations = []
for source in sources:
assert len(source) == 2
assert DEFAULT_IMAGE_TOKEN in source[0]["value"]
source[0]["value"] = DEFAULT_IMAGE_TOKEN
conversation = source[0]["value"] + source[1]["value"] + conversation_lib.default_conversation.sep
conversations.append(conversation)
# tokenize conversations
input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations]
targets = copy.deepcopy(input_ids)
for target, source in zip(targets, sources):
tokenized_len = len(tokenizer_image_token(source[0]["value"], tokenizer))
target[:tokenized_len] = IGNORE_INDEX
return dict(input_ids=input_ids, labels=targets)
def preprocess(sources: Sequence[str], tokenizer: transformers.PreTrainedTokenizer, has_image: bool = False) -> Dict:
"""
Given a list of sources, each is a conversation list. This transform:
1. Add signal '### ' at the beginning each sentence, with end signal '\n';
2. Concatenate conversations together;
3. Tokenize the concatenated conversation;
4. Make a deepcopy as the target. Mask human words with IGNORE_INDEX.
"""
if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.PLAIN:
return preprocess_plain(sources, tokenizer)
if conversation_lib.default_conversation.sep_style == conversation_lib.SeparatorStyle.LLAMA_2:
return preprocess_llama_2(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version.startswith("v1"):
return preprocess_v1(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "mpt":
return preprocess_mpt(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "qwen":
return preprocess_qwen(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "gemma":
return preprocess_gemma(sources, tokenizer, has_image=has_image)
if conversation_lib.default_conversation.version == "llama_v3":
return preprocess_llama3(sources, tokenizer, has_image=has_image)
# add end signal and concatenate together
conversations = []
for source in sources:
header = f"{conversation_lib.default_conversation.system}\n\n"
conversation = _add_speaker_and_signal(header, source)
conversations.append(conversation)
# tokenize conversations
def get_tokenize_len(prompts):
return [len(tokenizer_image_token(prompt, tokenizer)) for prompt in prompts]
if has_image:
input_ids = [tokenizer_image_token(prompt, tokenizer, return_tensors="pt") for prompt in conversations]
else:
conversations_tokenized = _tokenize_fn(conversations, tokenizer)
input_ids = conversations_tokenized["input_ids"]
targets = copy.deepcopy(input_ids)
for target, source in zip(targets, sources):
if has_image:
tokenized_lens = get_tokenize_len([header] + [s["value"] for s in source])
else:
tokenized_lens = _tokenize_fn([header] + [s["value"] for s in source], tokenizer)["input_ids_lens"]
speakers = [sentence["from"] for sentence in source]
_mask_targets(target, tokenized_lens, speakers)
return dict(input_ids=input_ids, labels=targets)
def load_data(data_path):
if "jsonl" in data_path:
data_list = load_jsonl(data_path)
else:
data_list = load_json(data_path)
return data_list
class DPODataset(Dataset):
"""Dataset for DPODataset fine-tuning."""
def __init__(self, data_path: str, tokenizer: transformers.PreTrainedTokenizer, data_args: DataArguments):
super(DPODataset, self).__init__()
# Handle multiple JSON files specified in the data_path
self.list_data_dict = []
if "{" in data_path and "}" in data_path:
base_path, file_pattern = re.match(r"^(.*)\{(.*)\}\.json$", data_path).groups()
file_names = file_pattern.split(",")
rank0_print(f"Loading {file_names} from {base_path}")
data_args.dataset_paths = []
for file_name in file_names:
data_args.dataset_paths.append(f"{base_path}{file_name}.json")
full_path = f"{base_path}{file_name}.json"
rank0_print(f"Loading {full_path}")
cur_data_dict = load_data(full_path)
rank0_print(f"Loaded {len(cur_data_dict)} samples from {full_path}")
self.list_data_dict.extend(cur_data_dict)
elif data_path.endswith(".yaml"):
with open(data_path, "r") as file:
yaml_data = yaml.safe_load(file)
datasets = yaml_data.get("datasets")
# file should be in the format of:
# datasets:
# - json_path: xxxx1.json
# sampling_strategy: first:1000
# - json_path: xxxx2.json
# sampling_strategy: end:3000
# - json_path: xxxx3.json
# sampling_strategy: random:999
data_args.dataset_paths = [dataset.get("json_path") for dataset in datasets]
for dataset in datasets:
json_path = dataset.get("json_path")
sampling_strategy = dataset.get("sampling_strategy", "all")
sampling_number = None
rank0_print(f"Loading {json_path} with {sampling_strategy} sampling strategy")
cur_data_dict = load_data(json_path)
if ":" in sampling_strategy:
sampling_strategy, sampling_number = sampling_strategy.split(":")
if "%" in sampling_number:
sampling_number = math.ceil(int(sampling_number.split("%")[0]) * len(cur_data_dict) / 100)
else:
sampling_number = int(sampling_number)
# Apply the sampling strategy
if sampling_strategy == "first" and sampling_number is not None:
cur_data_dict = cur_data_dict[:sampling_number]
elif sampling_strategy == "end" and sampling_number is not None:
cur_data_dict = cur_data_dict[-sampling_number:]
elif sampling_strategy == "random" and sampling_number is not None:
random.shuffle(cur_data_dict)
cur_data_dict = cur_data_dict[:sampling_number]
rank0_print(f"Loaded {len(cur_data_dict)} samples from {json_path}")
self.list_data_dict.extend(cur_data_dict)
else:
data_args.dataset_paths = [data_path]
rank0_print(f"Loading {data_path}")
cur_data_dict = load_data(data_path)
rank0_print(f"Loaded {len(cur_data_dict)} samples from {data_path}")
self.list_data_dict.extend(cur_data_dict)
rank0_print("Formatting inputs...Skip in lazy mode")
self.tokenizer = tokenizer
self.data_args = data_args
def __len__(self):
return len(self.list_data_dict)
@property
def lengths(self):
length_list = []
for sample in self.list_data_dict:
# Calculate the length of the prompt, answer, chosen, and rejected text
cur_len = len(sample["prompt"].split()) + len(sample["answer"].split()) + len(sample["chosen"].split()) + len(sample["rejected"].split())
# Add additional tokens if an image is present
img_tokens = 128 if "image" in sample else 0
length_list.append(cur_len + img_tokens)
return length_list
@property
def modality_lengths(self):
length_list = []
for sample in self.list_data_dict:
# Calculate the length of the prompt, answer, chosen, and rejected text
cur_len = len(sample["prompt"].split()) + len(sample["answer"].split()) + len(sample["chosen"].split()) + len(sample["rejected"].split())
# If the sample includes a video, the length is positive; otherwise, it is negative
cur_len = cur_len if ("video" in sample or "image" in sample) else -cur_len
length_list.append(cur_len)
return length_list
def process_image(self, image_file):
image_folder = self.data_args.image_folder
processor = self.data_args.image_processor
# print(f"\n\nInspecting the image path, folder = {image_folder}, image={image_file}\n\n")
try:
image = Image.open(os.path.join(image_folder, image_file)).convert("RGB")
except Exception as exn:
print(f"Failed to open image {image_file}. Exception:", exn)
raise exn
image_size = image.size
if self.data_args.image_aspect_ratio == "highres":
image = process_highres_image(image, self.data_args.image_processor, self.data_args.image_grid_pinpoints)
elif self.data_args.image_aspect_ratio == "anyres" or "anyres" in self.data_args.image_aspect_ratio:
image = process_anyres_image(image, self.data_args.image_processor, self.data_args.image_grid_pinpoints)
elif self.data_args.image_aspect_ratio == "crop_split":
image = process_highres_image_crop_split(image, self.data_args)
elif self.data_args.image_aspect_ratio == "pad":
def expand2square(pil_img, background_color):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
image = expand2square(image, tuple(int(x * 255) for x in processor.image_mean))
image = processor.preprocess(image, return_tensors="pt")["pixel_values"][0]
else:
image = processor.preprocess(image, return_tensors="pt")["pixel_values"][0]
return image, image_size, "image"
def __getitem__(self, i) -> Dict[str, torch.Tensor]:
# TODO: define number of retries somewhere else
num_base_retries = 3
num_final_retries = 300
# try the current sample first
for attempt_idx in range(num_base_retries):
try:
sample = self._get_item(i)
return sample
except Exception as e:
# sleep 1s in case it is a cloud disk issue
print(f"[Try #{attempt_idx}] Failed to fetch sample {i}. Exception:", e)
time.sleep(1)
# try other samples, in case it is file corruption issue
for attempt_idx in range(num_base_retries):
try:
next_index = min(i + 1, len(self.list_data_dict) - 1)
# sample_idx = random.choice(range(len(self)))
sample = self._get_item(next_index)
return sample
except Exception as e:
# no need to sleep
print(f"[Try other #{attempt_idx}] Failed to fetch sample {next_index}. Exception:", e)
pass
# still fail, most likely to be path issue or cloud disk issue, retry the same sample for longer
# for attempt_idx in range(num_final_retries):
# try:
# sample = self._get_item(i)
# return sample
# except Exception as e:
# # sleep 1s in case it is a cloud disk issue
# print(f"[Final try #{attempt_idx}] Failed to fetch sample {i}. Exception:", e)
# time.sleep(1)
# Finally raise exception on failing.
assert False, "Failed to fetch sample."
def _get_item(self, i) -> Dict[str, torch.Tensor]:
sources = self.list_data_dict[i]
if isinstance(i, int):
sources = [sources]
assert len(sources) == 1, "Don't know why it is wrapped to a list" # FIXME
suffix = None
if "image" in sources[0]:
image_file = self.list_data_dict[i]["image"]
if type(image_file) is list:
image = [self.process_image(f) for f in image_file]
else:
image = [self.process_image(image_file)]
# sources = preprocess_multimodal(copy.deepcopy([e["conversations"] for e in sources]), self.data_args)
elif "video" in sources[0]: # FIXME: This logic should be largely improved by Yuanhan. It's too messy now.
video_file = self.list_data_dict[i]["video"]
video_folder = self.data_args.video_folder
video_file = os.path.join(video_folder, video_file)
suffix = video_file.split(".")[-1]
if not os.path.exists(video_file):
print("File {} not exist!".format(video_file))
if suffix == "pkl":
video_info = pickle.load(open(video_file, "rb"))
image = torch.from_numpy(video_info["feats"][:, 1:])
input_prompt = video_info["inputs"].replace("...", "")
# replace the default image token with multiple tokens
input_prompt = input_prompt.replace(DEFAULT_IMAGE_TOKEN, DEFAULT_IMAGE_TOKEN * self.data_args.video_token)
sources, query_prompt = preprocess_multimodal_movie(copy.deepcopy([e["conversations"] for e in sources]), self.data_args, input_prompt)
else: # using videoreader
if "shareVideoGPTV" not in video_file and "liangke" not in video_file:
vr = VideoReader(video_file, ctx=cpu(0))
total_frame_num = len(vr)
avg_fps = round(vr.get_avg_fps() / self.data_args.video_fps)
frame_idx = [i for i in range(0, total_frame_num, avg_fps)]
if self.data_args.frames_upbound > 0:
if len(frame_idx) > self.data_args.frames_upbound:
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, self.data_args.frames_upbound, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
video = vr.get_batch(frame_idx).asnumpy()
video = np.array(video)
else:
if "liangke" in video_file:
video_file = self.list_data_dict[i]["video"]
frame_files = [os.path.join(video_file, f) for f in os.listdir(video_file) if os.path.isfile(os.path.join(video_file, f))]
frame_files.sort() # Ensure the frames are sorted if they are named sequentially
# TODO: Hard CODE: Determine the indices for uniformly sampling 10 frames
num_frames_to_sample = 10
total_frames = len(frame_files)
sampled_indices = np.linspace(0, total_frames - 1, num_frames_to_sample, dtype=int)
# Read and store the sampled frames
video = []
for idx in sampled_indices:
frame_path = frame_files[idx]
try:
with Image.open(frame_path) as img:
frame = img.convert("RGB")
video.append(frame)
except IOError:
print(f"Failed to read frame at path: {frame_path}")
processor = self.data_args.image_processor
image = processor.preprocess(video, return_tensors="pt")["pixel_values"]
image = [(image, video[0].size, "video")]
# sources = preprocess_multimodal(copy.deepcopy([e["conversations"] for e in sources]), self.data_args)
else:
sources = copy.deepcopy([e["conversations"] for e in sources])
has_image = ("image" in self.list_data_dict[i]) or ("video" in self.list_data_dict[i])
# data_dict = preprocess(sources, self.tokenizer, has_image=has_image)
data_dict = copy.deepcopy(self.list_data_dict[i]) # inplace modification following
if "prompt" in data_dict:
prompt = data_dict["prompt"]
prompt = prompt.replace("", "").strip()
prompt = "\n" + prompt
data_dict["prompt"] = prompt
else:
prompt = None
if suffix == "pkl":
prompt = [query_prompt]
# image exist in the data
if "image" in self.list_data_dict[i]:
data_dict["image"] = image
elif "video" in self.list_data_dict[i]:
data_dict["image"] = image
elif self.data_args.is_multimodal:
# image does not exist in the data, but the model is multimodal
crop_size = self.data_args.image_processor.crop_size
data_dict["image"] = [
(torch.zeros(1, 3, crop_size["height"], crop_size["width"]), (crop_size["width"], crop_size["height"]), "text"),
]
# prompt exist in the data
data_dict["has_image"] = has_image
return data_dict
@dataclass
class DPODataCollator(DPODataCollatorWithPadding):
"""Collate examples for DPO fine-tuning."""
# tokenizer: transformers.PreTrainedTokenizer
def collate(self, batch):
# first, pad everything to the same length
# input_ids, labels = tuple([instance[key] for instance in instances]
# for key in ("input_ids", "labels"))
# input_ids = torch.nn.utils.rnn.pad_sequence(
# input_ids,
# batch_first=True,
# padding_value=self.tokenizer.pad_token_id)
# labels = torch.nn.utils.rnn.pad_sequence(labels,
# batch_first=True,
# padding_value=IGNORE_INDEX)
# input_ids = input_ids[:, :self.tokenizer.model_max_length]
# labels = labels[:, :self.tokenizer.model_max_length]
# batch = dict(
# input_ids=input_ids,
# labels=labels,
# attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
# )
padded_batch = {}
for k in batch[0].keys():
if k.endswith("_input_ids") or k.endswith("_attention_mask") or k.endswith("_labels"):
# if "prompt" in k:
# to_pad = [torch.LongTensor(ex[k][::-1]) for ex in batch]
# else:
to_pad = [torch.LongTensor(ex[k]) for ex in batch]
if k.endswith("_input_ids"):
padding_value = self.tokenizer.pad_token_id
elif k.endswith("_labels"):
padding_value = self.label_pad_token_id
else:
continue
# elif k.endswith("_attention_mask"):
# padding_value = self.padding_value
# else:
# raise ValueError(f"Unexpected key in batch '{k}'")
padded_batch[k] = torch.nn.utils.rnn.pad_sequence(to_pad, batch_first=True, padding_value=padding_value)
# for the prompt, flip back so padding is on left side
# if "prompt" in k:
# padded_batch[k] = padded_batch[k].flip(dims=[1])
else:
padded_batch[k] = [ex[k] for ex in batch]
for k in ["chosen_input_ids", "rejected_input_ids"]:
attn_k = k.replace("input_ids", "attention_mask")
padded_batch[attn_k] = padded_batch[k].ne(self.tokenizer.pad_token_id)
return padded_batch
def tokenize_batch_element(self, prompt: str, chosen: str, rejected: str, has_image: bool = True) -> Dict:
"""Tokenize a single batch element.
At this stage, we don't convert to PyTorch tensors yet; we just handle the truncation
in case the prompt + chosen or prompt + rejected responses is/are too long. First
we truncate the prompt; if we're still too long, we truncate the chosen/rejected.
We also create the labels for the chosen/rejected responses, which are of length equal to
the sum of the length of the prompt and the chosen/rejected response, with
label_pad_token_id for the prompt tokens.
"""
# import pdb; pdb.set_trace()
batch = {}
chosen_sources = make_conv(prompt, chosen)
rejected_sources = make_conv(prompt, rejected)
chosen_data_dict = preprocess([chosen_sources], self.tokenizer, has_image=has_image)
# chosen_data_dict['attention_mask'] = chosen_data_dict["input_ids"].ne(self.tokenizer.pad_token_id)
rejected_data_dict = preprocess([rejected_sources], self.tokenizer, has_image=has_image)
# rejected_data_dict['attention_mask'] = rejected_data_dict["input_ids"].ne(self.tokenizer.pad_token_id)
chosen_data_dict = {k: v[0] for k, v in chosen_data_dict.items()}
rejected_data_dict = {k: v[0] for k, v in rejected_data_dict.items()}
for k, toks in {
"chosen": chosen_data_dict,
"rejected": rejected_data_dict,
}.items():
for type_key, tokens in toks.items():
if type_key == "token_type_ids":
continue
batch[f"{k}_{type_key}"] = tokens
return batch
def __call__(self, features: List[Dict[str, Any]]) -> Dict[str, Any]:
tokenized_batch = []
Xs, keys = [], []
for feature in features:
prompt = feature["prompt"]
chosen = feature["chosen"]
rejected = feature["rejected"]
has_image = feature["has_image"]
# Xs.append(feature[has_X])
# keys.append(has_X)
batch_element = self.tokenize_batch_element(prompt, chosen, rejected, has_image=has_image)
tokenized_batch.append(batch_element)
# return collated batch
padded_batch = self.collate(tokenized_batch)
# import pdb;pdb.set_trace()
if "image" in features[0]:
# instances[1]['image'][0][0].shape
# torch.Size([5, 3, 224, 224])
images = [instance["image"] for instance in features]
padded_batch["image_sizes"] = [im[1] for im_list in images for im in im_list]
padded_batch["modalities"] = [im[2] for im_list in images for im in im_list]
images = [im[0] for im_list in images for im in im_list]
# import pdb;pdb.set_trace()
padded_batch["images"] = images
# padded_batch["images"] =[padded_batch["modalities"], images]
return padded_batch
def make_dpo_data_module(tokenizer: transformers.PreTrainedTokenizer, data_args) -> Dict:
"""Make dataset and collator for supervised fine-tuning."""
train_dataset = DPODataset(tokenizer=tokenizer, data_path=data_args.data_path, data_args=data_args)
return train_dataset
def get_model(model_args, training_args, bnb_model_from_pretrained_args):
assert training_args.attn_implementation
if training_args.attn_implementation == "sdpa" and torch.__version__ < "2.1.2":
raise ValueError("The 'sdpa' attention implementation requires torch version 2.1.2 or higher.")
######################### Overwrite config #########################
customized_kwargs = dict()
customized_kwargs.update(bnb_model_from_pretrained_args)
overwrite_config = {}
cfg_pretrained = None
if "qwen" in model_args.model_name_or_path.lower():
cfg_pretrained = LlavaQwenConfig.from_pretrained(model_args.model_name_or_path)
elif "mistral" in model_args.model_name_or_path.lower() or "zephyr" in model_args.model_name_or_path.lower():
cfg_pretrained = LlavaMistralConfig.from_pretrained(model_args.model_name_or_path)
elif (
"wizardlm-2" in model_args.model_name_or_path.lower()
or "vicuna" in model_args.model_name_or_path.lower()
or "llama" in model_args.model_name_or_path.lower()
or "yi" in model_args.model_name_or_path.lower()
or "nous-hermes" in model_args.model_name_or_path.lower()
and "wizard-2" in model_args.model_name_or_path.lower()
):
cfg_pretrained = LlavaConfig.from_pretrained(model_args.model_name_or_path)
else:
cfg_pretrained = AutoConfig.from_pretrained(model_args.model_name_or_path)
if model_args.rope_scaling_factor is not None and model_args.rope_scaling_type is not None and cfg_pretrained is not None:
overwrite_config["rope_scaling"] = {
"factor": model_args.rope_scaling_factor,
"type": model_args.rope_scaling_type,
}
if training_args.model_max_length is None:
training_args.model_max_length = cfg_pretrained.max_position_embeddings * model_args.rope_scaling_factor
overwrite_config["max_sequence_length"] = training_args.model_max_length
assert training_args.model_max_length == int(cfg_pretrained.max_position_embeddings * model_args.rope_scaling_factor), print(
f"model_max_length: {training_args.model_max_length}, max_position_embeddings: {cfg_pretrained.max_position_embeddings}, rope_scaling_factor: {model_args.rope_scaling_factor}"
)
# overwrite_config["max_sequence_length"] = model_args.max_sequence_length
# overwrite_config["tokenizer_model_max_length"] = model_args.tokenizer_model_max_length
if model_args.mm_spatial_pool_stride is not None and model_args.mm_spatial_pool_out_channels is not None and model_args.mm_spatial_pool_mode is not None and model_args.mm_resampler_type is not None and cfg_pretrained is not None:
overwrite_config["mm_resampler_type"] = model_args.mm_resampler_type
overwrite_config["mm_spatial_pool_stride"] = model_args.mm_spatial_pool_stride
overwrite_config["mm_spatial_pool_out_channels"] = model_args.mm_spatial_pool_out_channels
overwrite_config["mm_spatial_pool_mode"] = model_args.mm_spatial_pool_mode
if overwrite_config:
rank0_print(f"Overwriting config with {overwrite_config}")
for k, v in overwrite_config.items():
setattr(cfg_pretrained, k, v)
customized_kwargs["config"] = cfg_pretrained
######################### Finish Overwrite ###########################
ref_model = None
if model_args.model_class_name is not None:
actual_model_class_name = f"{model_args.model_class_name}ForCausalLM"
model_class = getattr(transformers, actual_model_class_name)
rank0_print(f"Using model class {model_class} from {model_args.model_class_name}")
model = model_class.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
elif model_args.vision_tower is not None:
if "mixtral" in model_args.model_name_or_path.lower():
model = LlavaMixtralForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
from transformers.models.mixtral.modeling_mixtral import MixtralSparseMoeBlock
deepspeed.utils.set_z3_leaf_modules(model, [MixtralSparseMoeBlock])
elif "mistral" in model_args.model_name_or_path.lower() or "zephyr" in model_args.model_name_or_path.lower():
model = LlavaMistralForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
elif (
"wizardlm-2" in model_args.model_name_or_path.lower()
or "vicuna" in model_args.model_name_or_path.lower()
or "llama" in model_args.model_name_or_path.lower()
or "yi" in model_args.model_name_or_path.lower()
or "nous-hermes" in model_args.model_name_or_path.lower()
and "wizard-2" in model_args.model_name_or_path.lower()
):
model = LlavaLlamaForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
if "zero3" in training_args.deepspeed:
rank0_print("#### Initialize reference model #####")
ref_model = LlavaLlamaForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
elif "qwen" in model_args.model_name_or_path.lower() or "quyen" in model_args.model_name_or_path.lower() or "longva" in model_args.model_name_or_path.lower():
if "moe" in model_args.model_name_or_path.lower():
model = LlavaQwenMoeForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
from transformers.models.qwen2_moe.modeling_qwen2_moe import Qwen2MoeSparseMoeBlock
deepspeed.utils.set_z3_leaf_modules(model, [Qwen2MoeSparseMoeBlock])
else:
model = LlavaQwenForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
if "zero3" in training_args.deepspeed:
rank0_print("#### Initialize reference model #####")
ref_model = LlavaQwenForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
elif "gemma" in model_args.model_name_or_path.lower():
model = LlavaGemmaForCausalLM.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
attn_implementation=training_args.attn_implementation,
torch_dtype=(torch.bfloat16 if training_args.bf16 else None),
low_cpu_mem_usage=True,
**customized_kwargs,
)
else:
raise ValueError(f"Unknown model class {model_args}")
else:
model = transformers.LlamaForCausalLM.from_pretrained(
model_args.model_name_or_path, cache_dir=training_args.cache_dir, attn_implementation=training_args.attn_implementation, torch_dtype=(torch.bfloat16 if training_args.bf16 else None), **customized_kwargs
)
return model, ref_model
def train(attn_implementation=None):
global local_rank
parser = transformers.HfArgumentParser((ModelArguments, DataArguments, TrainingArguments))
model_args, data_args, training_args = parser.parse_args_into_dataclasses()
if training_args.verbose_logging:
rank0_print(f"Inspecting experiment hyperparameters:\n")
rank0_print(f"model_args = {vars(model_args)}\n\n")
rank0_print(f"data_args = {vars(data_args)}\n\n")
rank0_print(f"training_args = {vars(training_args)}\n\n")
# rank0_print(f"evaluation_args = {vars(evaluation_args)}\n\n")
local_rank = training_args.local_rank
compute_dtype = torch.float16 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32)
bnb_model_from_pretrained_args = {}
if training_args.bits in [4, 8]:
from transformers import BitsAndBytesConfig
bnb_model_from_pretrained_args.update(
dict(
device_map={"": training_args.device},
load_in_4bit=training_args.bits == 4,
load_in_8bit=training_args.bits == 8,
quantization_config=BitsAndBytesConfig(
load_in_4bit=training_args.bits == 4,
load_in_8bit=training_args.bits == 8,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=training_args.double_quant,
bnb_4bit_quant_type=training_args.quant_type, # {'fp4', 'nf4'}
),
)
)
model, ref_model = get_model(model_args, training_args, bnb_model_from_pretrained_args)
model.config.use_cache = False
if model_args.freeze_backbone:
model.model.requires_grad_(False)
if training_args.bits in [4, 8]:
from peft import prepare_model_for_kbit_training
model.config.torch_dtype = torch.float32 if training_args.fp16 else (torch.bfloat16 if training_args.bf16 else torch.float32)
model = prepare_model_for_kbit_training(model, use_gradient_checkpointing=training_args.gradient_checkpointing)
if training_args.gradient_checkpointing:
if hasattr(model, "enable_input_require_grads"):
model.enable_input_require_grads()
if ref_model is not None:
ref_model.enable_input_require_grads()
else:
def make_inputs_require_grad(module, input, output):
output.requires_grad_(True)
model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
if ref_model is not None:
ref_model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
if training_args.lora_enable:
from peft import LoraConfig, get_peft_model
lora_config = LoraConfig(
r=training_args.lora_r,
lora_alpha=training_args.lora_alpha,
target_modules=find_all_linear_names(model),
lora_dropout=training_args.lora_dropout,
bias=training_args.lora_bias,
task_type="CAUSAL_LM",
)
if training_args.bits == 16:
if training_args.bf16:
model.to(torch.bfloat16)
if training_args.fp16:
model.to(torch.float16)
rank0_print("Adding LoRA adapters...")
model = get_peft_model(model, lora_config)
if "mpt" in model_args.model_name_or_path:
tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="right")
elif "mistral" in model_args.model_name_or_path.lower() or "mixtral" in model_args.model_name_or_path.lower() or "zephyr" in model_args.model_name_or_path.lower():
tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="left")
elif "qwen" in model_args.model_name_or_path.lower():
tokenizer = transformers.AutoTokenizer.from_pretrained(model_args.model_name_or_path, cache_dir=training_args.cache_dir, model_max_length=training_args.model_max_length, padding_side="right")
else: # for all other models
tokenizer = transformers.AutoTokenizer.from_pretrained(
model_args.model_name_or_path,
cache_dir=training_args.cache_dir,
model_max_length=training_args.model_max_length,
padding_side="right",
use_fast=False,
)
rank0_print(f"Prompt version: {model_args.version}")
if model_args.version == "v0":
if tokenizer.pad_token is None:
smart_tokenizer_and_embedding_resize(
special_tokens_dict=dict(pad_token="[PAD]"),
tokenizer=tokenizer,
model=model,
)
elif model_args.version == "v0.5":
tokenizer.pad_token = tokenizer.unk_token
else:
if tokenizer.unk_token is not None:
tokenizer.pad_token = tokenizer.unk_token
if model_args.version in conversation_lib.conv_templates:
conversation_lib.default_conversation = conversation_lib.conv_templates[model_args.version]
else:
conversation_lib.default_conversation = conversation_lib.conv_templates["vicuna_v1"]
if model_args.vision_tower is not None:
model.get_model().initialize_vision_modules(model_args=model_args, fsdp=training_args.fsdp)
vision_tower = model.get_vision_tower()
vision_tower.to(dtype=torch.bfloat16 if training_args.bf16 else torch.float16, device=training_args.device)
data_args.image_processor = vision_tower.image_processor
data_args.is_multimodal = True
model.config.image_aspect_ratio = data_args.image_aspect_ratio
if data_args.image_grid_pinpoints is not None:
# for input like "(1x1)...(3x3)", convert to [(1, 1), (2, 1), (3, 1), (1, 2), (2, 2), (3, 2), (1, 3), (2, 3), (3, 3)]
if "x" in data_args.image_grid_pinpoints and "..." in data_args.image_grid_pinpoints:
vis_encoder_size = data_args.image_processor.size[0]
matches = re.findall(r"\((\d+)x(\d+)\)", data_args.image_grid_pinpoints)
range_start = tuple(map(int, matches[0]))
range_end = tuple(map(int, matches[-1]))
grid_pinpoints = [(i, j) for i in range(range_start[0], range_end[0] + 1) for j in range(range_start[1], range_end[1] + 1)]
grid_pinpoints = [[dim * vis_encoder_size for dim in pair] for pair in grid_pinpoints]
data_args.image_grid_pinpoints = grid_pinpoints
elif "x" in data_args.image_grid_pinpoints:
vis_encoder_size = data_args.image_processor.size[0]
assert vis_encoder_size in [224, 336, 384, 448, 512], "vis_encoder_size should be in [224, 336, 384, 448, 512]"
grid_pinpoints = data_args.image_grid_pinpoints.replace(" ", "").replace("x", ",")[1:-1].split("),(")
data_args.image_grid_pinpoints = [[int(x) * vis_encoder_size for x in item.split(",")] for item in grid_pinpoints]
else:
data_args.image_grid_pinpoints = ast.literal_eval(data_args.image_grid_pinpoints) # for backward compatibility
model.config.image_grid_pinpoints = data_args.image_grid_pinpoints
model.config.image_crop_resolution = data_args.image_crop_resolution
model.config.image_split_resolution = data_args.image_split_resolution
model.config.tokenizer_padding_side = tokenizer.padding_side
model.config.tokenizer_model_max_length = tokenizer.model_max_length
### Deciding train which part of the model
if model_args.mm_tunable_parts is None: # traditional way of deciding which part to train
model.config.tune_mm_mlp_adapter = training_args.tune_mm_mlp_adapter = model_args.tune_mm_mlp_adapter
model.config.tune_mm_vision_resampler = training_args.tune_mm_vision_resampler = model_args.tune_mm_vision_resampler
if model_args.tune_mm_mlp_adapter or model_args.tune_mm_vision_resampler:
model.requires_grad_(False)
if model_args.tune_mm_mlp_adapter:
for p in model.get_model().mm_projector.parameters():
p.requires_grad = True
if model_args.tune_mm_vision_resampler:
for p in model.get_model().vision_resampler.parameters():
p.requires_grad = True
model.config.freeze_mm_mlp_adapter = training_args.freeze_mm_mlp_adapter
if training_args.freeze_mm_mlp_adapter:
for p in model.get_model().mm_projector.parameters():
p.requires_grad = False
model.config.freeze_mm_vision_resampler = training_args.freeze_mm_vision_resampler
if training_args.freeze_mm_vision_resampler:
for p in model.get_model().vision_resampler.parameters():
p.requires_grad = False
model.config.unfreeze_mm_vision_tower = model_args.unfreeze_mm_vision_tower
if model_args.unfreeze_mm_vision_tower:
vision_tower.requires_grad_(True)
else:
vision_tower.requires_grad_(False)
else:
rank0_print(f"Using mm_tunable_parts: {model_args.mm_tunable_parts}")
model.config.mm_tunable_parts = training_args.mm_tunable_parts = model_args.mm_tunable_parts
# Set the entire model to not require gradients by default
model.requires_grad_(False)
vision_tower.requires_grad_(False)
model.get_model().mm_projector.requires_grad_(False)
model.get_model().vision_resampler.requires_grad_(False)
# Parse the mm_tunable_parts to decide which parts to unfreeze
tunable_parts = model_args.mm_tunable_parts.split(",")
if "mm_mlp_adapter" in tunable_parts:
for p in model.get_model().mm_projector.parameters():
p.requires_grad = True
if "mm_vision_resampler" in tunable_parts:
for p in model.get_model().vision_resampler.parameters():
p.requires_grad = True
if "mm_vision_tower" in tunable_parts:
for name, param in model.named_parameters():
if "vision_tower" in name:
param.requires_grad_(True)
if "mm_language_model" in tunable_parts:
for name, param in model.named_parameters():
if "vision_tower" not in name and "mm_projector" not in name and "vision_resampler" not in name:
param.requires_grad_(True)
total_params = sum(p.ds_numel if hasattr(p, "ds_numel") else p.numel() for p in model.parameters())
trainable_params = sum(p.ds_numel if hasattr(p, "ds_numel") else p.numel() for p in model.parameters() if p.requires_grad)
rank0_print(f"Total parameters: ~{total_params/1e6:.2f} MB)")
rank0_print(f"Trainable parameters: ~{trainable_params/1e6:.2f} MB)")
if training_args.bits in [4, 8]:
model.get_model().mm_projector.to(dtype=compute_dtype, device=training_args.device)
model.config.mm_use_im_start_end = data_args.mm_use_im_start_end = model_args.mm_use_im_start_end
model.config.mm_projector_lr = training_args.mm_projector_lr
model.config.mm_vision_tower_lr = training_args.mm_vision_tower_lr
training_args.use_im_start_end = model_args.mm_use_im_start_end
model.config.mm_use_im_patch_token = model_args.mm_use_im_patch_token
model.initialize_vision_tokenizer(model_args, tokenizer=tokenizer)
if ref_model is not None:
ref_model.get_model().initialize_vision_modules(model_args=model_args, fsdp=training_args.fsdp)
ref_vision_tower = ref_model.get_vision_tower()
ref_vision_tower.to(dtype=torch.bfloat16 if training_args.bf16 else torch.float16, device=training_args.device)
ref_model.config.image_aspect_ratio = data_args.image_aspect_ratio
ref_model.config.image_grid_pinpoints = data_args.image_grid_pinpoints
ref_model.config.image_crop_resolution = data_args.image_crop_resolution
ref_model.config.image_split_resolution = data_args.image_split_resolution
ref_model.config.tokenizer_padding_side = tokenizer.padding_side
ref_model.config.tokenizer_model_max_length = tokenizer.model_max_length
ref_model.config.mm_use_im_start_end = data_args.mm_use_im_start_end
ref_model.config.mm_use_im_patch_token = model_args.mm_use_im_patch_token
ref_model.initialize_vision_tokenizer(model_args, tokenizer=tokenizer)
parameter_names = [n for n, _ in ref_model.named_parameters()]
for param_name in parameter_names:
param = ref_model.get_parameter(param_name)
param.requires_grad = False
ref_model.eval()
if training_args.bits in [4, 8]:
from peft.tuners.lora import LoraLayer
for name, module in model.named_modules():
if isinstance(module, LoraLayer):
if training_args.bf16:
module = module.to(torch.bfloat16)
if "norm" in name:
module = module.to(torch.float32)
if "lm_head" in name or "embed_tokens" in name:
if hasattr(module, "weight"):
if training_args.bf16 and module.weight.dtype == torch.float32:
module = module.to(torch.bfloat16)
train_dataset = make_dpo_data_module(tokenizer=tokenizer, data_args=data_args)
data_collator = DPODataCollator(
tokenizer,
label_pad_token_id=IGNORE_INDEX,
pad_token_id=tokenizer.pad_token_id,
)
trainer = LLaVADPOTrainer(
model,
ref_model,
args=training_args,
dpo_alpha=training_args.dpo_alpha,
beta=training_args.beta,
gamma=training_args.gamma,
train_dataset=train_dataset,
eval_dataset=None,
data_collator=data_collator,
tokenizer=tokenizer,
max_length=training_args.model_max_length,
generate_during_eval=False, # training_args.generate_during_eval,
precompute_ref_log_probs=training_args.precompute_ref_log_probs,
)
if list(pathlib.Path(training_args.output_dir).glob("checkpoint-*")):
trainer.train(resume_from_checkpoint=True)
else:
trainer.train()
trainer.save_state()
model.config.use_cache = True
if training_args.lora_enable:
state_dict = get_peft_state_maybe_zero_3(model.named_parameters(), training_args.lora_bias)
non_lora_state_dict = get_peft_state_non_lora_maybe_zero_3(model.named_parameters())
if training_args.local_rank == 0 or training_args.local_rank == -1:
if hasattr(model, "config"):
model.config.save_pretrained(training_args.output_dir)
if hasattr(model, "generation_config"):
model.generation_config.save_pretrained(training_args.output_dir)
model.save_pretrained(training_args.output_dir, state_dict=state_dict)
torch.save(non_lora_state_dict, os.path.join(training_args.output_dir, "non_lora_trainables.bin"))
else:
safe_save_model_for_hf_trainer(trainer=trainer, output_dir=training_args.output_dir)
rank0_print(f"Model saved to {training_args.output_dir}")
if __name__ == "__main__":
train()
================================================
FILE: xtuner-eval_niah/longva/train/train_mem.py
================================================
from longva.train.train import train
if __name__ == "__main__":
train()
================================================
FILE: xtuner-eval_niah/longva/utils.py
================================================
import datetime
import logging
import logging.handlers
import os
import sys
import numpy as np
import requests
from longva.constants import LOGDIR
server_error_msg = "**NETWORK ERROR DUE TO HIGH TRAFFIC. PLEASE REGENERATE OR REFRESH THIS PAGE.**"
moderation_msg = "I am sorry. Your input may violate our content moderation guidelines. Please avoid using harmful or offensive content."
handler = None
import torch.distributed as dist
try:
import av
except ImportError:
print("Please install pyav to use video processing functions.")
def process_video_with_pyav(video_file, data_args):
container = av.open(video_file)
stream = container.streams.video[0]
# stream.codec_context.skip_frame = "NONKEY" # Optional: Skip non-key frames to speed up
total_frame_num = stream.frames
avg_fps = round(stream.average_rate / data_args.video_fps)
frame_idx = [i for i in range(0, total_frame_num, avg_fps)]
if data_args.frames_upbound > 0:
if len(frame_idx) > data_args.frames_upbound:
uniform_sampled_frames = np.linspace(0, total_frame_num - 1, data_args.frames_upbound, dtype=int)
frame_idx = uniform_sampled_frames.tolist()
video_frames = []
for index, frame in enumerate(container.decode(video=0)):
if index in frame_idx:
video_frames.append(frame.to_rgb().to_ndarray())
if len(video_frames) == len(frame_idx): # Stop decoding once we have all needed frames
break
video = np.stack(video_frames)
return video
def rank0_print(*args):
if dist.is_initialized():
if dist.get_rank() == 0:
print(f"Rank {dist.get_rank()}: ", *args)
else:
print(*args)
def build_logger(logger_name, logger_filename):
global handler
formatter = logging.Formatter(
fmt="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
datefmt="%Y-%m-%d %H:%M:%S",
)
# Set the format of root handlers
if not logging.getLogger().handlers:
logging.basicConfig(level=logging.INFO)
logging.getLogger().handlers[0].setFormatter(formatter)
# Redirect stdout and stderr to loggers
stdout_logger = logging.getLogger("stdout")
stdout_logger.setLevel(logging.INFO)
sl = StreamToLogger(stdout_logger, logging.INFO)
sys.stdout = sl
stderr_logger = logging.getLogger("stderr")
stderr_logger.setLevel(logging.ERROR)
sl = StreamToLogger(stderr_logger, logging.ERROR)
sys.stderr = sl
# Get logger
logger = logging.getLogger(logger_name)
logger.setLevel(logging.INFO)
# Add a file handler for all loggers
if handler is None:
os.makedirs(LOGDIR, exist_ok=True)
filename = os.path.join(LOGDIR, logger_filename)
handler = logging.handlers.TimedRotatingFileHandler(filename, when="D", utc=True)
handler.setFormatter(formatter)
for name, item in logging.root.manager.loggerDict.items():
if isinstance(item, logging.Logger):
item.addHandler(handler)
return logger
class StreamToLogger(object):
"""
Fake file-like stream object that redirects writes to a logger instance.
"""
def __init__(self, logger, log_level=logging.INFO):
self.terminal = sys.stdout
self.logger = logger
self.log_level = log_level
self.linebuf = ""
def __getattr__(self, attr):
return getattr(self.terminal, attr)
def write(self, buf):
temp_linebuf = self.linebuf + buf
self.linebuf = ""
for line in temp_linebuf.splitlines(True):
# From the io.TextIOWrapper docs:
# On output, if newline is None, any '\n' characters written
# are translated to the system default line separator.
# By default sys.stdout.write() expects '\n' newlines and then
# translates them so this is still cross platform.
if line[-1] == "\n":
self.logger.log(self.log_level, line.rstrip())
else:
self.linebuf += line
def flush(self):
if self.linebuf != "":
self.logger.log(self.log_level, self.linebuf.rstrip())
self.linebuf = ""
def disable_torch_init():
"""
Disable the redundant torch default initialization to accelerate model creation.
"""
import torch
setattr(torch.nn.Linear, "reset_parameters", lambda self: None)
setattr(torch.nn.LayerNorm, "reset_parameters", lambda self: None)
def violates_moderation(text):
"""
Check whether the text violates OpenAI moderation API.
"""
url = "https://api.openai.com/v1/moderations"
headers = {"Content-Type": "application/json", "Authorization": "Bearer " + os.environ["OPENAI_API_KEY"]}
text = text.replace("\n", "")
data = "{" + '"input": ' + f'"{text}"' + "}"
data = data.encode("utf-8")
try:
ret = requests.post(url, headers=headers, data=data, timeout=5)
flagged = ret.json()["results"][0]["flagged"]
except requests.exceptions.RequestException as e:
print(f"######################### Moderation Error: {e} #########################")
flagged = False
except KeyError as e:
print(f"######################### Moderation Error: {e} #########################")
flagged = False
return flagged
def pretty_print_semaphore(semaphore):
if semaphore is None:
return "None"
return f"Semaphore(value={semaphore._value}, locked={semaphore.locked()})"
================================================
FILE: xtuner-eval_niah/niah_requirements.txt
================================================
accelerate==1.0.1
addict==2.4.0
aiofiles==23.2.1
aiohappyeyeballs==2.4.3
aiohttp==3.10.10
aiosignal==1.3.1
altair==5.4.1
annotated-types==0.7.0
anyio==4.6.2.post1
appdirs==1.4.4
argon2-cffi==23.1.0
argon2-cffi-bindings==21.2.0
arrow==1.3.0
arxiv==2.1.3
asttokens==2.4.1
async-lru==2.0.4
async-timeout==4.0.3
attrs==24.2.0
av==13.1.0
babel==2.16.0
beautifulsoup4==4.12.3
bitsandbytes==0.41.0
bleach==6.2.0
blinker==1.8.2
boto3==1.35.39
botocore==1.35.39
Brotli==1.1.0
cachetools==5.5.0
certifi==2022.12.7
cffi==1.17.1
charset-normalizer==2.1.1
click==8.1.7
colorama==0.4.6
coloredlogs==15.0.1
comm==0.2.2
contourpy==1.3.0
cycler==0.12.1
datasets==2.16.1
debugpy==1.8.7
decorator==5.1.1
decord==0.6.0
deepspeed==0.14.2
defusedxml==0.7.1
dill==0.3.7
distro==1.9.0
docker-pycreds==0.4.0
docstring_parser==0.16
duckduckgo_search==5.3.1b1
einops==0.6.1
einops-exts==0.0.4
environs==11.0.0
et_xmlfile==2.0.0
evaluate==0.4.3
exceptiongroup==1.2.2
executing==2.1.0
fastapi==0.115.4
fastjsonschema==2.20.0
feedparser==6.0.11
ffmpy==0.4.0
filelock==3.13.1
flash_attn==2.6.3
fonttools==4.54.1
fqdn==1.5.1
frozenlist==1.5.0
fsspec==2023.10.0
ftfy==6.3.1
func_timeout==4.3.5
gitdb==4.0.11
GitPython==3.1.43
gradio==4.29.0
gradio_client==0.16.1
griffe==0.49.0
h11==0.14.0
h2==4.1.0
hf_transfer==0.1.8
hjson==3.1.0
hpack==4.0.0
httpcore==1.0.6
httpx==0.27.2
huggingface-hub==0.26.1
humanfriendly==10.0
humanize==4.11.0
hyperframe==6.0.1
idna==3.4
imageio==2.36.0
importlib_metadata==8.5.0
importlib_resources==6.4.5
iniconfig==2.0.0
ipykernel==6.29.5
ipython==8.29.0
ipywidgets==8.1.5
isoduration==20.11.0
jedi==0.19.1
Jinja2==3.1.3
jmespath==1.0.1
joblib==1.4.2
json5==0.9.25
jsonpointer==3.0.0
jsonschema==4.23.0
jsonschema-specifications==2024.10.1
jupyter==1.1.1
jupyter-console==6.6.3
jupyter-events==0.10.0
jupyter-lsp==2.2.5
jupyter_client==8.6.3
jupyter_core==5.7.2
jupyter_server==2.14.2
jupyter_server_terminals==0.5.3
jupyterlab==4.2.5
jupyterlab_pygments==0.3.0
jupyterlab_server==2.27.3
jupyterlab_widgets==3.0.13
kiwisolver==1.4.7
lagent==0.2.4
latex2mathml==3.77.0
lazy_loader==0.4
loguru==0.7.2
markdown-it-py==3.0.0
markdown2==2.5.1
MarkupSafe==2.1.5
marshmallow==3.22.0
matplotlib==3.9.2
matplotlib-inline==0.1.7
mdurl==0.1.2
mistune==3.0.2
mmengine==0.10.5
modelscope==1.19.2
mpi4py_mpich==3.1.5
mpmath==1.3.0
multidict==6.1.0
multiprocess==0.70.15
multiprocessing-logging==0.3.4
narwhals==1.11.1
nbclient==0.10.0
nbconvert==7.16.4
nbformat==5.10.4
nest-asyncio==1.6.0
networkx==3.2.1
ninja==1.11.1.1
notebook==7.2.2
notebook_shim==0.2.4
numpy==1.26.4
nvidia-cublas-cu11==11.11.3.6
nvidia-cuda-cupti-cu11==11.8.87
nvidia-cuda-nvrtc-cu11==11.8.89
nvidia-cuda-runtime-cu11==11.8.89
nvidia-cudnn-cu11==8.7.0.84
nvidia-cufft-cu11==10.9.0.58
nvidia-curand-cu11==10.3.0.86
nvidia-cusolver-cu11==11.4.1.48
nvidia-cusparse-cu11==11.7.5.86
nvidia-nccl-cu11==2.19.3
nvidia-nvtx-cu11==11.8.86
open_clip_torch==2.28.0
opencv-python==4.10.0.84
openpyxl==3.1.5
orjson==3.10.10
overrides==7.7.0
packaging==24.1
pandas==2.2.3
pandocfilters==1.5.1
parso==0.8.4
peft==0.4.0
pexpect==4.9.0
phx-class-registry==4.1.0
pillow==10.2.0
platformdirs==4.3.6
pluggy==1.5.0
prometheus_client==0.21.0
prompt_toolkit==3.0.48
propcache==0.2.0
protobuf==4.25.5
psutil==6.1.0
ptyprocess==0.7.0
pure_eval==0.2.3
py-cpuinfo==9.0.0
pyarrow==17.0.0
pyarrow-hotfix==0.6
pycparser==2.22
pydantic==2.9.2
pydantic_core==2.23.4
pydeck==0.9.1
pydub==0.25.1
Pygments==2.18.0
pynvml==11.5.3
pyparsing==3.2.0
pytest==8.3.3
python-dateutil==2.9.0.post0
python-dotenv==1.0.1
python-json-logger==2.0.7
python-multipart==0.0.16
pytz==2024.2
PyYAML==6.0.2
pyzmq==26.2.0
referencing==0.35.1
regex==2024.9.11
requests==2.32.3
rfc3339-validator==0.1.4
rfc3986-validator==0.1.1
rich==13.9.3
rpds-py==0.20.0
ruff==0.7.1
s3transfer==0.10.3
safetensors==0.4.5
scikit-image==0.24.0
scikit-learn==1.2.2
scipy==1.14.1
seaborn==0.13.2
semantic-version==2.10.0
Send2Trash==1.8.3
sentencepiece==0.1.99
sentry-sdk==2.17.0
setproctitle==1.3.3
sgmllib3k==1.0.0
shellingham==1.5.4
shortuuid==1.0.13
shtab==1.7.1
six==1.16.0
smmap==5.0.1
sniffio==1.3.1
socksio==1.0.0
soupsieve==2.6
stack-data==0.6.3
starlette==0.41.2
streamlit==1.40.0
svgwrite==1.4.3
sympy==1.13.1
tenacity==9.0.0
termcolor==2.5.0
terminado==0.18.1
threadpoolctl==3.5.0
tifffile==2024.9.20
tiktoken==0.8.0
timeout-decorator==0.5.0
timm==1.0.11
tinycss2==1.4.0
tokenizers==0.20.1
toml==0.10.2
tomli==2.0.2
tomlkit==0.12.0
torch==2.2.0+cu118
torchaudio==2.2.0+cu118
torchvision==0.17.0+cu118
tornado==6.4.1
tqdm==4.66.5
traitlets==5.14.3
transformers==4.46.0
transformers-stream-generator==0.0.5
triton==2.2.0
typer==0.12.5
types-python-dateutil==2.9.0.20241003
typing_extensions==4.12.2
tyro==0.8.14
tzdata==2024.2
uri-template==1.3.0
urllib3==2.2.3
uvicorn==0.32.0
wandb==0.16.5
watchdog==5.0.3
wavedrom==2.0.3.post3
wcwidth==0.2.13
webcolors==24.8.0
webencodings==0.5.1
websocket-client==1.8.0
websockets==11.0.3
widgetsnbextension==4.0.13
xxhash==3.5.0
yapf==0.40.2
yarl==1.16.0
zipp==3.20.2
================================================
FILE: xtuner-eval_niah/tmp/git_placeholder
================================================
================================================
FILE: xtuner-eval_niah/vision_niah/data/haystack_embeddings/git_placeholder
================================================
================================================
FILE: xtuner-eval_niah/vision_niah/data/haystack_videos/git_placeholder
================================================
================================================
FILE: xtuner-eval_niah/vision_niah/data/needle_embeddings/git_placeholder
================================================
================================================
FILE: xtuner-eval_niah/vision_niah/data/source_data/git_placeholder
================================================
================================================
FILE: xtuner-eval_niah/vision_niah/data/source_data/niah-coco-singlehop_20.json
================================================
[
{
"question": "Find the frame of a plate of food with some salad on bread. what is the color of the plate\nOptions: (A) white\n(B) purple\n(C) green\n(D) orange\nOnly give the best option. Best option: (",
"answer": "A",
"image": "single/COCO_val2014_000000152808.jpg",
"id": 0
},
{
"question": "Find the frame of a MALE IN THE PROCESS OF HITTING A TENNIS BALL ON THE TENNIS COURT what is the man swinging at a ball on a tennis court\nOptions: (A) racquet\n(B) machine\n(C) device\n(D) knife\nOnly give the best option. Best option: (",
"answer": "A",
"image": "single/COCO_val2014_000000306186.jpg",
"id": 1
},
{
"question": "Find the frame of a man being pulled along in the ocean on a small boat with an animal running beside him. what is running next to a man parasurfing in the water\nOptions: (A) cats\n(B) panda\n(C) dog\n(D) cows\nOnly give the best option. Best option: (",
"answer": "C",
"image": "single/COCO_val2014_000000115912.jpg",
"id": 2
},
{
"question": "Find the frame of a partially eaten piece of quiche by a keyboard what is the color of the plate\nOptions: (A) gray\n(B) purple\n(C) white\n(D) black\nOnly give the best option. Best option: (",
"answer": "C",
"image": "single/COCO_val2014_000000430002.jpg",
"id": 3
},
{
"question": "Find the frame of a red bike with white handles is locked to a black pole on a sidewalk. what parked in front of a parking meter\nOptions: (A) truck\n(B) surfboard\n(C) bicycle\n(D) engine\nOnly give the best option. Best option: (",
"answer": "C",
"image": "single/COCO_val2014_000000370208.jpg",
"id": 4
},
{
"question": "Find the frame of a woman standing on a cobbled street holding a umbrella. what is the color of the umbrella\nOptions: (A) green\n(B) purple\n(C) yellow\n(D) gray\nOnly give the best option. Best option: (",
"answer": "B",
"image": "single/COCO_val2014_000000313588.jpg",
"id": 5
},
{
"question": "Find the frame of there is a metal and wood chair in a garden what is in an overgrown yard by a hose\nOptions: (A) shelf\n(B) sofa\n(C) bench\n(D) bed\nOnly give the best option. Best option: (",
"answer": "C",
"image": "single/COCO_val2014_000000125476.jpg",
"id": 6
},
{
"question": "Find the frame of a very tall and skinny building with a clock. what is the color of the trees\nOptions: (A) green\n(B) blue\n(C) white\n(D) black\nOnly give the best option. Best option: (",
"answer": "A",
"image": "single/COCO_val2014_000000083086.jpg",
"id": 7
},
{
"question": "Find the frame of a bear sits in a hole inside of a large, green tree. what is the color of the bear\nOptions: (A) red\n(B) black\n(C) gray\n(D) white\nOnly give the best option. Best option: (",
"answer": "B",
"image": "single/COCO_val2014_000000361341.jpg",
"id": 8
},
{
"question": "Find the frame of several people lined up to board something on a dirt airstrip. what is being loaded for a trip\nOptions: (A) skateboard\n(B) motorcycle\n(C) surfboard\n(D) airplane\nOnly give the best option. Best option: (",
"answer": "D",
"image": "single/COCO_val2014_000000497878.jpg",
"id": 9
},
{
"question": "Find the frame of two cows, a motorcycle and a man in front of a church. what is the color of the church\nOptions: (A) blue\n(B) black\n(C) brown\n(D) white\nOnly give the best option. Best option: (",
"answer": "D",
"image": "single/COCO_val2014_000000442223.jpg",
"id": 10
},
{
"question": "Find the frame of a man holding a white umbrella over another white umbrella. what is the photographer setting up\nOptions: (A) machine\n(B) equipment\n(C) racquet\n(D) device\nOnly give the best option. Best option: (",
"answer": "B",
"image": "single/COCO_val2014_000000093089.jpg",
"id": 11
},
{
"question": "Find the frame of the person is taking a picture of herself. where does the woman take a photo of herself\nOptions: (A) tub\n(B) floor\n(C) mirror\n(D) refrigerator\nOnly give the best option. Best option: (",
"answer": "C",
"image": "single/COCO_val2014_000000032510.jpg",
"id": 12
},
{
"question": "Find the frame of a seagull walking a muddy beach front area what is the color of the bird\nOptions: (A) purple\n(B) white\n(C) yellow\n(D) red\nOnly give the best option. Best option: (",
"answer": "B",
"image": "single/COCO_val2014_000000019404.jpg",
"id": 13
},
{
"question": "Find the frame of a fire engine traveling through a busy street. what does rescue make a turn on a city street\nOptions: (A) surfboards\n(B) ship\n(C) boats\n(D) truck\nOnly give the best option. Best option: (",
"answer": "D",
"image": "single/COCO_val2014_000000246057.jpg",
"id": 14
},
{
"question": "Find the frame of a fancy desert on a table with a number of drinking glasses. what is the color of the flowers\nOptions: (A) purple\n(B) red\n(C) black\n(D) brown\nOnly give the best option. Best option: (",
"answer": "B",
"image": "single/COCO_val2014_000000277689.jpg",
"id": 15
},
{
"question": "Find the frame of two men in black coats stand next to white chairs. how many man looking frightened , leaning on an umbrella while the other man is smiling\nOptions: (A) one\n(B) three\n(C) six\n(D) five\nOnly give the best option. Best option: (",
"answer": "A",
"image": "single/COCO_val2014_000000186218.jpg",
"id": 16
},
{
"question": "Find the frame of a group of oranges on plates in front of Oriental Statues. how many pedestal plates holding oranges on a table\nOptions: (A) seven\n(B) one\n(C) ten\n(D) three\nOnly give the best option. Best option: (",
"answer": "D",
"image": "single/COCO_val2014_000000137085.jpg",
"id": 17
},
{
"question": "Find the frame of a group of girls stand around with umbrellas. what is the color of the skirts\nOptions: (A) red\n(B) orange\n(C) white\n(D) yellow\nOnly give the best option. Best option: (",
"answer": "C",
"image": "single/COCO_val2014_000000403995.jpg",
"id": 18
},
{
"question": "Find the frame of a delicious looking salad with tuna and bread. what fills the plate with another salad sitting alongside\nOptions: (A) drink\n(B) cake\n(C) drinks\n(D) sandwich\nOnly give the best option. Best option: (",
"answer": "D",
"image": "single/COCO_val2014_000000578418.jpg",
"id": 19
}
]
================================================
FILE: xtuner-eval_niah/vision_niah/data_multi/needle_embeddings/git_placeholder
================================================
================================================
FILE: xtuner-eval_niah/vision_niah/data_multi/source_data/git_placeholder
================================================
================================================
FILE: xtuner-eval_niah/vision_niah/data_multi/source_data/niah-coco-multihop-100.json
================================================
[
{
"images": [
"multi/niah_651_5_COCO_val2014_000000353320.jpg",
"multi/niah_651_6_COCO_val2014_000000448236.jpg",
"multi/niah_651_3_COCO_val2014_000000034749.jpg",
"multi/niah_651_2_COCO_val2014_000000537827.jpg",
"multi/niah_651_1_COCO_val2014_000000474034.jpg",
"multi/COCO_val2014_000000446603.jpg",
"multi/niah_651_7_COCO_val2014_000000458861.jpg"
],
"question1": "Find the frame of a person skateboarding on a sidewalk at night. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) two black suitcases one is open and a white bag\n(B) two people riding skate board down a steep road\n(C) A guitar case sits among luggage on a metal bench.\n(D) A frisbee and a guitar lay on a bed.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of a person skateboarding on a sidewalk at night. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what lay on the ground with one open\nOptions: (A) ties\n(B) suitcases\n(C) ski\n(D) suits\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_70_5_COCO_val2014_000000220871.jpg",
"multi/niah_70_4_COCO_val2014_000000233888.jpg",
"multi/COCO_val2014_000000037814.jpg",
"multi/niah_70_6_COCO_val2014_000000037988.jpg",
"multi/niah_70_2_COCO_val2014_000000350505.jpg",
"multi/niah_70_1_COCO_val2014_000000198775.jpg"
],
"question1": "Find the frame of a tiny red and black car. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a man that is holding some kind of surfboard\n(B) A pair of scissors, thread, construction paper and a cell phone.\n(C) Boy in the motion of swinging at an incoming baseball. \n(D) A woman tennise player standing at the service line and throwing a tennis ball up in the air to serve it.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a tiny red and black car. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the scissors\nOptions: (A) yellow\n(B) brown\n(C) purple\n(D) red\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_722_3_COCO_val2014_000000428175.jpg",
"multi/niah_722_7_COCO_val2014_000000122251.jpg",
"multi/niah_722_2_COCO_val2014_000000436605.jpg",
"multi/niah_722_5_COCO_val2014_000000150639.jpg",
"multi/niah_722_6_COCO_val2014_000000488664.jpg",
"multi/niah_722_1_COCO_val2014_000000440043.jpg",
"multi/COCO_val2014_000000377723.jpg"
],
"question1": "Find the frame of a double decker red bus parked next to another double decker bus. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A boat that is in a body of water.\n(B) A man with glasses is talking on the phone.\n(C) group of people waiting to get onto a bus in the city\n(D) A blue and white training passing houses while emitting smoke.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a double decker red bus parked next to another double decker bus. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the bus\nOptions: (A) orange\n(B) red\n(C) purple\n(D) blue\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_672_7_COCO_val2014_000000383621.jpg",
"multi/niah_672_2_COCO_val2014_000000322610.jpg",
"multi/niah_672_5_COCO_val2014_000000555012.jpg",
"multi/niah_672_3_COCO_val2014_000000370475.jpg",
"multi/niah_672_6_COCO_val2014_000000031601.jpg",
"multi/niah_672_1_COCO_val2014_000000254653.jpg",
"multi/COCO_val2014_000000468345.jpg"
],
"question1": "Find the frame of a red white and black bus and some bushes and trees Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A white plane flies in the cloudy sky.\n(B) THERE IS AN IMAG EOF A TRAIN OVER A BRIDGE\n(C) a bathroom done in almost total white \n(D) A man skateboards down a road in front of a tractor. \nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of a red white and black bus and some bushes and trees Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the young man riding past a construction zone\nOptions: (A) airliner\n(B) skateboard\n(C) airplane\n(D) vehicle\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_454_2_COCO_val2014_000000306855.jpg",
"multi/niah_454_7_COCO_val2014_000000518203.jpg",
"multi/niah_454_5_COCO_val2014_000000081469.jpg",
"multi/niah_454_3_COCO_val2014_000000503144.jpg",
"multi/niah_454_1_COCO_val2014_000000523957.jpg",
"multi/niah_454_6_COCO_val2014_000000290828.jpg",
"multi/COCO_val2014_000000386589.jpg"
],
"question1": "Find the frame of a Skateboarder weaving in and out of cones Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) The bed is located between the two windows.\n(B) A girl on a horse riding around the practice course.\n(C) a red and white jet is taking off and some buildings\n(D) a big giraffe eats some tree leafs \nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a Skateboarder weaving in and out of cones Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the horse\nOptions: (A) red\n(B) brown\n(C) blue\n(D) green\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_429_6_COCO_val2014_000000359066.jpg",
"multi/COCO_val2014_000000111619.jpg",
"multi/niah_429_1_COCO_val2014_000000319721.jpg",
"multi/niah_429_2_COCO_val2014_000000509589.jpg",
"multi/niah_429_5_COCO_val2014_000000120199.jpg",
"multi/niah_429_4_COCO_val2014_000000355860.jpg"
],
"question1": "Find the frame of a woman hold a brown horse while a woman watches. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Giraffe standing next to a tree near a pond.\n(B) A tour bus belonging to the Marines on a street with other buses\n(C) A mushroom, pepperoni, pepper pizza on a cooling rack.\n(D) a woman in red is riding a horse\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a woman hold a brown horse while a woman watches. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the bus\nOptions: (A) orange\n(B) blue\n(C) purple\n(D) gray\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_804_4_COCO_val2014_000000090626.jpg",
"multi/niah_804_2_COCO_val2014_000000367848.jpg",
"multi/COCO_val2014_000000167898.jpg",
"multi/niah_804_5_COCO_val2014_000000433993.jpg",
"multi/niah_804_1_COCO_val2014_000000345456.jpg",
"multi/niah_804_6_COCO_val2014_000000346138.jpg"
],
"question1": "Find the frame of a yellow bus is coming down the tracks Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a woman and her kids sitting on a brown couch \n(B) a bath room with a sink a toilet and a mirror\n(C) some pasta dishes are in plates and bowls\n(D) Empty Asian city bus stopped on a hill.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a yellow bus is coming down the tracks Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the porcelain\nOptions: (A) purple\n(B) gray\n(C) black\n(D) white\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_119_1_COCO_val2014_000000050392.jpg",
"multi/niah_119_2_COCO_val2014_000000475387.jpg",
"multi/niah_119_7_COCO_val2014_000000205222.jpg",
"multi/niah_119_6_COCO_val2014_000000560427.jpg",
"multi/niah_119_3_COCO_val2014_000000369082.jpg",
"multi/niah_119_5_COCO_val2014_000000400062.jpg",
"multi/COCO_val2014_000000409500.jpg"
],
"question1": "Find the frame of a kitchen that has a stove and counters in it. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A partially eaten pastry is on a plate.\n(B) A bacon, eggs, and toast breakfast complete with coffee and orange juice.\n(C) an image of a poorly lit old bathroom\n(D) A man is bending forward to catch a Frisbee.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a kitchen that has a stove and counters in it. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is showing only the toilet , toilet paper and the seat covers\nOptions: (A) bathroom\n(B) restroom\n(C) room\n(D) garage\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_447_2_COCO_val2014_000000159260.jpg",
"multi/COCO_val2014_000000323496.jpg",
"multi/niah_447_4_COCO_val2014_000000538792.jpg",
"multi/niah_447_1_COCO_val2014_000000157378.jpg",
"multi/niah_447_6_COCO_val2014_000000095808.jpg",
"multi/niah_447_5_COCO_val2014_000000430756.jpg"
],
"question1": "Find the frame of a vase of flowers is pictured in this image. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a person and a brown and white horse\n(B) A man rides a skateboard while people are running a foot race.\n(C) a motorcycle parked between two poles and one other motorcycle \n(D) A LITTLE BOY PLAYING A GAME OF TENNIS \nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of a vase of flowers is pictured in this image. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: the young lad swings what\nOptions: (A) racquet\n(B) device\n(C) blender\n(D) knife\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_595_4_COCO_val2014_000000491216.jpg",
"multi/niah_595_3_COCO_val2014_000000537506.jpg",
"multi/COCO_val2014_000000150616.jpg",
"multi/niah_595_1_COCO_val2014_000000220871.jpg",
"multi/niah_595_5_COCO_val2014_000000248395.jpg"
],
"question1": "Find the frame of a man that is holding some kind of surfboard Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A man holds a small animal in front of a camera.\n(B) A group of people wait near a wedding procession of cars, one of them holding two apples.\n(C) A cat is walking across a rug on a kitchen floor.\n(D) A look at a wallpapered bathroom, with a ships window.\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of a man that is holding some kind of surfboard Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what did white claw under a circle window\nOptions: (A) mirror\n(B) tub\n(C) fridge\n(D) toilet\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_369_7_COCO_val2014_000000387244.jpg",
"multi/niah_369_2_COCO_val2014_000000370765.jpg",
"multi/niah_369_3_COCO_val2014_000000122203.jpg",
"multi/niah_369_6_COCO_val2014_000000026465.jpg",
"multi/niah_369_5_COCO_val2014_000000138492.jpg",
"multi/COCO_val2014_000000274651.jpg",
"multi/niah_369_1_COCO_val2014_000000386581.jpg"
],
"question1": "Find the frame of a police horse tied to a parking meter as the cop writes a ticket Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A Philly cheese steak sandwich is sitting on a plate.\n(B) A black and white dog jumping up catching a Frisbee.\n(C) Water is slowly flowing from a fire hydrant.\n(D) A laptop computer is sitting on a desk.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of a police horse tied to a parking meter as the cop writes a ticket Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is sitting on the plate\nOptions: (A) pizza\n(B) wine\n(C) sandwich\n(D) beverage\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_464_5_COCO_val2014_000000108904.jpg",
"multi/niah_464_1_COCO_val2014_000000473936.jpg",
"multi/COCO_val2014_000000119120.jpg",
"multi/niah_464_7_COCO_val2014_000000094185.jpg",
"multi/niah_464_3_COCO_val2014_000000326542.jpg",
"multi/niah_464_6_COCO_val2014_000000318238.jpg",
"multi/niah_464_2_COCO_val2014_000000096769.jpg"
],
"question1": "Find the frame of motorcycle parked perpendicular on side of city street at night. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A man and his white dog playing Frisbee on a grassy hill.\n(B) Three dogs are sleeping across the length of the bed.\n(C) A large teddy bear sits in front of the store.\n(D) A counter showing two hot dogs and a side of fries with a drink\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of motorcycle parked perpendicular on side of city street at night. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what does the man throw to his wild dog\nOptions: (A) frisbee\n(B) monkey\n(C) kite\n(D) toys\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_745_1_COCO_val2014_000000352093.jpg",
"multi/niah_745_4_COCO_val2014_000000482100.jpg",
"multi/COCO_val2014_000000460683.jpg",
"multi/niah_745_3_COCO_val2014_000000032812.jpg",
"multi/niah_745_5_COCO_val2014_000000268371.jpg"
],
"question1": "Find the frame of two big soup bowls with a large spoon hanging . Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A man that has the ten commandments on his tie\n(B) The jets are flying very close to each other in the sky.\n(C) Two toilets sit outside on the pavement next to a yard with many decorations.\n(D) A girl swings a net a tennis ball. \nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of two big soup bowls with a large spoon hanging . Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: the man that has how many commandments on his tie\nOptions: (A) two\n(B) five\n(C) seven\n(D) ten\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/COCO_val2014_000000085787.jpg",
"multi/niah_735_4_COCO_val2014_000000553667.jpg",
"multi/niah_735_5_COCO_val2014_000000489920.jpg",
"multi/niah_735_1_COCO_val2014_000000561009.jpg",
"multi/niah_735_3_COCO_val2014_000000416827.jpg"
],
"question1": "Find the frame of a bird sits perched on a tree branch. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A bench sitting in front of a brick wall on a patio.\n(B) A kid hanging out a window while holding an umbrella.\n(C) A train travels down the tracks through a rural valley with mountains in the background.\n(D) an african american child wearing a green shirt eating a sandwich\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a bird sits perched on a tree branch. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: the kid hanging out what while holding an umbrella\nOptions: (A) toilet\n(B) stove\n(C) towel\n(D) window\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_206_1_COCO_val2014_000000559270.jpg",
"multi/COCO_val2014_000000225014.jpg",
"multi/niah_206_4_COCO_val2014_000000332923.jpg",
"multi/niah_206_5_COCO_val2014_000000388267.jpg",
"multi/niah_206_3_COCO_val2014_000000224222.jpg"
],
"question1": "Find the frame of a classic car with a lady inside sitting in a parking lot. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) three men out on the ocean water two of the men watching their friend surf waves blue oceans water a man wearing a green shirt and yellow shorts\n(B) A white clock tower is a nice day.\n(C) A person standing with a umbrella standing by a rail.\n(D) A large bear mascot is riding in a red Pontiac.\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of a classic car with a lady inside sitting in a parking lot. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the mascot\nOptions: (A) blue\n(B) orange\n(C) purple\n(D) red\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_111_1_COCO_val2014_000000032364.jpg",
"multi/COCO_val2014_000000199387.jpg",
"multi/niah_111_2_COCO_val2014_000000545924.jpg",
"multi/niah_111_4_COCO_val2014_000000308856.jpg",
"multi/niah_111_6_COCO_val2014_000000139740.jpg",
"multi/niah_111_5_COCO_val2014_000000557510.jpg"
],
"question1": "Find the frame of a woman flies a kite in the blue sky. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a white cup is sitting on a blue thing\n(B) A woman looks carefully as she smacks the tennis ball.\n(C) a person is adjusting skis on the slopes\n(D) a close up of a public transit train parked near other vehicles\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a woman flies a kite in the blue sky. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the skier repairing on the slope\nOptions: (A) ski\n(B) ties\n(C) apron\n(D) suits\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/COCO_val2014_000000297444.jpg",
"multi/niah_490_4_COCO_val2014_000000008647.jpg",
"multi/niah_490_1_COCO_val2014_000000398567.jpg",
"multi/niah_490_5_COCO_val2014_000000509750.jpg",
"multi/niah_490_3_COCO_val2014_000000265153.jpg"
],
"question1": "Find the frame of a computer desk with various items around it. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) many different buildings near one another near trees\n(B) A chocolate cake for a ninety second birthday.\n(C) One lama in front of another lama standing outdoors.\n(D) A statue of an elephant is painted many colors\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a computer desk with various items around it. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: chocolate what with lettering on the top of it on blue table\nOptions: (A) drink\n(B) sandwich\n(C) pastry\n(D) cake\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_428_1_COCO_val2014_000000074743.jpg",
"multi/niah_428_6_COCO_val2014_000000101715.jpg",
"multi/niah_428_2_COCO_val2014_000000352496.jpg",
"multi/niah_428_4_COCO_val2014_000000183500.jpg",
"multi/COCO_val2014_000000273083.jpg",
"multi/niah_428_5_COCO_val2014_000000441969.jpg"
],
"question1": "Find the frame of some sort of a blue sofa sits in a small room; a mirror and a picture hang above it. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) There are plants on a table on an apartment balcony.\n(B) Food items next to beverage displayed on wooden surface.\n(C) White and red biplane flying through the air. \n(D) A dog is very interested in a pizza that is on the table.\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of some sort of a blue sofa sits in a small room; a mirror and a picture hang above it. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what sniffs the box of pepperoni pizza\nOptions: (A) cats\n(B) bear\n(C) dog\n(D) birds\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_38_6_COCO_val2014_000000437497.jpg",
"multi/niah_38_2_COCO_val2014_000000073696.jpg",
"multi/COCO_val2014_000000426835.jpg",
"multi/niah_38_1_COCO_val2014_000000451598.jpg",
"multi/niah_38_5_COCO_val2014_000000553588.jpg",
"multi/niah_38_7_COCO_val2014_000000156102.jpg",
"multi/niah_38_3_COCO_val2014_000000356241.jpg"
],
"question1": "Find the frame of the hot dog is longer than the bun and has nacho cheese squeezed onto it. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Batter took a swing at the ball during the baseball game\n(B) a white dog in an orange vest and a man in a blue shirt and water\n(C) Two boys carrying hot dogs and other snacks at an outdoor sporting event.\n(D) Looking up at a building with a large face clock near the top.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of the hot dog is longer than the bun and has nacho cheese squeezed onto it. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: how many boys in yellow polos are about to eat hot dogs\nOptions: (A) five\n(B) two\n(C) four\n(D) one\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_102_4_COCO_val2014_000000304706.jpg",
"multi/niah_102_1_COCO_val2014_000000231140.jpg",
"multi/niah_102_6_COCO_val2014_000000563702.jpg",
"multi/niah_102_5_COCO_val2014_000000537132.jpg",
"multi/COCO_val2014_000000462845.jpg",
"multi/niah_102_2_COCO_val2014_000000461758.jpg"
],
"question1": "Find the frame of several zebras in front of a building with steeples. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Two large pizzas sitting on wire cooling racks.\n(B) A group of people gathering around a table displaying wine.\n(C) There is a woman playing a game of tennis.\n(D) Firefighters and three firetrucks are parked on a street.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of several zebras in front of a building with steeples. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the returns\nOptions: (A) black\n(B) yellow\n(C) red\n(D) orange\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_520_5_COCO_val2014_000000229067.jpg",
"multi/niah_520_6_COCO_val2014_000000342515.jpg",
"multi/niah_520_2_COCO_val2014_000000429710.jpg",
"multi/COCO_val2014_000000015596.jpg",
"multi/niah_520_3_COCO_val2014_000000520489.jpg",
"multi/niah_520_7_COCO_val2014_000000440184.jpg",
"multi/niah_520_1_COCO_val2014_000000277543.jpg"
],
"question1": "Find the frame of a bunch of identical bear key chains with green clovers hand on a turn rack. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A group of people ski down the mountain slope.\n(B) two people riding motorcycles near one another \n(C) 6 motorcycles are sitting outside a shop in the grass.\n(D) A group of people playing tennis on a sunny day\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a bunch of identical bear key chains with green clovers hand on a turn rack. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: how many people on white and red motorbikes race very quickly\nOptions: (A) ten\n(B) three\n(C) four\n(D) two\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_929_5_COCO_val2014_000000461759.jpg",
"multi/niah_929_2_COCO_val2014_000000420490.jpg",
"multi/COCO_val2014_000000288874.jpg",
"multi/niah_929_7_COCO_val2014_000000122745.jpg",
"multi/niah_929_3_COCO_val2014_000000545735.jpg",
"multi/niah_929_6_COCO_val2014_000000269344.jpg",
"multi/niah_929_1_COCO_val2014_000000552302.jpg"
],
"question1": "Find the frame of an air traffic controller stands in front of a large plane. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a dog jumping in the air with a frisbee in its mouth\n(B) A shelf with dishes and a clock in a mirror reflection.\n(C) A large lit clock with a Colgate sign.\n(D) a vandalized stop sign in the dark with a sky background\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of an air traffic controller stands in front of a large plane. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what shows dishes on a shelf , pots and pans , and a clock\nOptions: (A) mirror\n(B) floor\n(C) fridge\n(D) tub\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_746_5_COCO_val2014_000000061693.jpg",
"multi/niah_746_1_COCO_val2014_000000535643.jpg",
"multi/COCO_val2014_000000482233.jpg",
"multi/niah_746_7_COCO_val2014_000000488997.jpg",
"multi/niah_746_2_COCO_val2014_000000555066.jpg",
"multi/niah_746_6_COCO_val2014_000000014380.jpg",
"multi/niah_746_3_COCO_val2014_000000271759.jpg"
],
"question1": "Find the frame of a toilet and urinal against a pink tiled wall. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Stop sign and street signs in a residential neighborhood.\n(B) A car bridge going over a commuter train.\n(C) White vase holding holding an assortment of flowers \n(D) A man uses the computer while sitting at a table.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of a toilet and urinal against a pink tiled wall. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the sign\nOptions: (A) green\n(B) brown\n(C) orange\n(D) black\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_731_1_COCO_val2014_000000497344.jpg",
"multi/niah_731_5_COCO_val2014_000000552824.jpg",
"multi/niah_731_6_COCO_val2014_000000290957.jpg",
"multi/COCO_val2014_000000411559.jpg",
"multi/niah_731_2_COCO_val2014_000000223090.jpg",
"multi/niah_731_4_COCO_val2014_000000259976.jpg"
],
"question1": "Find the frame of a blue-eyed toddler talks on the cellphone while he types something on the laptop. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A tray of several chocolate frosted donuts sitting on a doily.\n(B) a horse grazes on some grass in front of a tree\n(C) A man flyng a kite with two hands\n(D) A garage door leading out to a fancy car.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a blue-eyed toddler talks on the cellphone while he types something on the laptop. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the man flying with two strings\nOptions: (A) kite\n(B) frisbee\n(C) hat\n(D) toys\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_701_2_COCO_val2014_000000467742.jpg",
"multi/niah_701_6_COCO_val2014_000000287714.jpg",
"multi/niah_701_3_COCO_val2014_000000263355.jpg",
"multi/COCO_val2014_000000245534.jpg",
"multi/niah_701_7_COCO_val2014_000000309889.jpg",
"multi/niah_701_1_COCO_val2014_000000003466.jpg",
"multi/niah_701_5_COCO_val2014_000000509590.jpg"
],
"question1": "Find the frame of a person with orange and blue sneakers stands ona brick floor. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A bed with red sheets and a large wooden headboard.\n(B) A clean hotel bathroom is pictured in this image.\n(C) A tall brick tower with a massive clock on it's side.\n(D) A young Asian girl holds a red umbrella.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a person with orange and blue sneakers stands ona brick floor. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what the time at twelve o'clock noon\nOptions: (A) frame\n(B) slice\n(C) clock\n(D) hay\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_776_5_COCO_val2014_000000349098.jpg",
"multi/COCO_val2014_000000127120.jpg",
"multi/niah_776_6_COCO_val2014_000000344762.jpg",
"multi/niah_776_1_COCO_val2014_000000184401.jpg",
"multi/niah_776_4_COCO_val2014_000000032375.jpg",
"multi/niah_776_2_COCO_val2014_000000244550.jpg"
],
"question1": "Find the frame of stuffed teddy bear covered with blanket in bed. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a man is talking in to a microphone on stage\n(B) People walk and stand near an approaching train.\n(C) The two eddy bears are leaning on the purplish pillows\n(D) Multiple views of graffiti underneath a sign on a brick wall.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of stuffed teddy bear covered with blanket in bed. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what does the fan give as a gift to the singer\nOptions: (A) monkey\n(B) sheep\n(C) bear\n(D) horse\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_725_6_COCO_val2014_000000235131.jpg",
"multi/niah_725_5_COCO_val2014_000000464737.jpg",
"multi/COCO_val2014_000000245102.jpg",
"multi/niah_725_7_COCO_val2014_000000174091.jpg",
"multi/niah_725_2_COCO_val2014_000000393258.jpg",
"multi/niah_725_1_COCO_val2014_000000365511.jpg",
"multi/niah_725_3_COCO_val2014_000000099342.jpg"
],
"question1": "Find the frame of a road sign for uplands over the back of a stop sign. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) it is snowing near a staircase and some walls with a light\n(B) Two zebras share a tender moment as an ostrich hides his head in the background. \n(C) A pizza, with a saucy and leafy topping, on a plate on a dark brown table.\n(D) An image of boats, and ferries on water.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of a road sign for uplands over the back of a stop sign. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what did the snow cover in a parking garage\nOptions: (A) towel\n(B) fabric\n(C) ramp\n(D) surface\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_614_4_COCO_val2014_000000038816.jpg",
"multi/niah_614_5_COCO_val2014_000000406681.jpg",
"multi/COCO_val2014_000000261185.jpg",
"multi/niah_614_6_COCO_val2014_000000040317.jpg",
"multi/niah_614_1_COCO_val2014_000000367582.jpg",
"multi/niah_614_2_COCO_val2014_000000122787.jpg"
],
"question1": "Find the frame of a man is looking at a piece of paper Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) The skier is coming down the snowy mountain.\n(B) A couple of brown dogs sitting on top of a couch.\n(C) A decorated Christmas tree stands in a very tidy room.\n(D) A bathroom sits off the side of a kitchen in close proximity to the refrigerator.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a man is looking at a piece of paper Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: where does the decorated christmas tree stand\nOptions: (A) restroom\n(B) office\n(C) room\n(D) bedroom\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/COCO_val2014_000000324135.jpg",
"multi/niah_708_3_COCO_val2014_000000091080.jpg",
"multi/niah_708_1_COCO_val2014_000000498228.jpg",
"multi/niah_708_5_COCO_val2014_000000574392.jpg",
"multi/niah_708_4_COCO_val2014_000000115222.jpg"
],
"question1": "Find the frame of two pelicans sit on rocks along the water. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) The small bench is nestled under the tree.\n(B) City scene with stop sign and street cars.\n(C) A kitchen that has a brick wall as a background. \n(D) a child with a hat and sunglasses and is holding a stick\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of two pelicans sit on rocks along the water. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: the child with and is holding what\nOptions: (A) slice\n(B) track\n(C) surface\n(D) stick\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_557_3_COCO_val2014_000000421833.jpg",
"multi/niah_557_2_COCO_val2014_000000393569.jpg",
"multi/niah_557_1_COCO_val2014_000000251508.jpg",
"multi/niah_557_7_COCO_val2014_000000533003.jpg",
"multi/niah_557_6_COCO_val2014_000000203865.jpg",
"multi/niah_557_5_COCO_val2014_000000521282.jpg",
"multi/COCO_val2014_000000351903.jpg"
],
"question1": "Find the frame of a girl shows a banana to the camera. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) there is an adult giraffe looking at something\n(B) clear vase filled with white and yellow flowers with water\n(C) The counter holds many small appliances and tools for the kitchen.\n(D) A transit bus parked at a stop on a street side.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of a girl shows a banana to the camera. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is there looking at something\nOptions: (A) giraffe\n(B) panda\n(C) horse\n(D) elephant\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_703_3_COCO_val2014_000000295574.jpg",
"multi/niah_703_5_COCO_val2014_000000095191.jpg",
"multi/niah_703_1_COCO_val2014_000000314074.jpg",
"multi/COCO_val2014_000000058089.jpg",
"multi/niah_703_4_COCO_val2014_000000020979.jpg"
],
"question1": "Find the frame of people are watching a man cut a wedding cake with a puppet. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A man taking a selfie in a bathroom with a toilet, sink and shower.\n(B) A train station with three trains near a building\n(C) There is a motorcycle parked in front of a small blue and white bus.\n(D) a parking meter with a red zone 3 tag\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of people are watching a man cut a wedding cake with a puppet. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: the parking meter with a zone how many sticker on it\nOptions: (A) three\n(B) four\n(C) ten\n(D) six\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/COCO_val2014_000000535643.jpg",
"multi/niah_692_4_COCO_val2014_000000384823.jpg",
"multi/niah_692_1_COCO_val2014_000000231349.jpg",
"multi/niah_692_5_COCO_val2014_000000386210.jpg",
"multi/niah_692_3_COCO_val2014_000000568398.jpg"
],
"question1": "Find the frame of a train has an out of order sign on it. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A pizza with multiple topping ready to be cut.\n(B) A toilet and urinal against a pink tiled wall. \n(C) Tube of orange and pears, surrounded by boxes of grass and one pineapple\n(D) A desk cluttered with compute equipment and office supplies and a a chair.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a train has an out of order sign on it. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what mounted to the pink wall next to a toilet\nOptions: (A) urinal\n(B) refrigerator\n(C) window\n(D) sink\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_775_6_COCO_val2014_000000021688.jpg",
"multi/niah_775_5_COCO_val2014_000000492609.jpg",
"multi/niah_775_2_COCO_val2014_000000337987.jpg",
"multi/niah_775_3_COCO_val2014_000000494139.jpg",
"multi/COCO_val2014_000000497365.jpg",
"multi/niah_775_1_COCO_val2014_000000389655.jpg",
"multi/niah_775_7_COCO_val2014_000000522315.jpg"
],
"question1": "Find the frame of a military jet that has its loading compartment open for viewing. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A garden with a bench sits outside a red building.\n(B) Two pizzas with green toppings on top of it.\n(C) Two giraffes are walking on a dirt road looking in opposite directions.\n(D) A man laughing at a redheaded woman. \nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of a military jet that has its loading compartment open for viewing. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the man\nOptions: (A) blue\n(B) green\n(C) black\n(D) brown\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/COCO_val2014_000000325410.jpg",
"multi/niah_66_1_COCO_val2014_000000495613.jpg",
"multi/niah_66_5_COCO_val2014_000000296996.jpg",
"multi/niah_66_6_COCO_val2014_000000280560.jpg",
"multi/niah_66_2_COCO_val2014_000000265313.jpg",
"multi/niah_66_3_COCO_val2014_000000536274.jpg",
"multi/niah_66_7_COCO_val2014_000000393282.jpg"
],
"question1": "Find the frame of two motorcycles are parked on an area with flowers. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Two passenger jets sit in wait at the airport.\n(B) Two giraffes standing in a desert like area. \n(C) A purple banana sprout hanging from a tree.\n(D) A person holding a tennis racket on a tennis court.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of two motorcycles are parked on an area with flowers. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the banana\nOptions: (A) purple\n(B) orange\n(C) white\n(D) green\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_982_5_COCO_val2014_000000513098.jpg",
"multi/niah_982_6_COCO_val2014_000000043851.jpg",
"multi/COCO_val2014_000000531796.jpg",
"multi/niah_982_2_COCO_val2014_000000470885.jpg",
"multi/niah_982_1_COCO_val2014_000000201305.jpg",
"multi/niah_982_3_COCO_val2014_000000050786.jpg",
"multi/niah_982_7_COCO_val2014_000000201477.jpg"
],
"question1": "Find the frame of an asian girl is posing with a little umbrella. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A small sign by a big hill filled with umbrellas.\n(B) some buses parked in a bus depot in fall\n(C) A group of older propeller planes flying in formation.\n(D) A cat is seen in a mirror as it lays on a bed near a dresser.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of an asian girl is posing with a little umbrella. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what parked in the bus depot in fall\nOptions: (A) jet\n(B) bicycle\n(C) buses\n(D) boats\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_272_4_COCO_val2014_000000434693.jpg",
"multi/niah_272_5_COCO_val2014_000000450217.jpg",
"multi/niah_272_1_COCO_val2014_000000002290.jpg",
"multi/niah_272_2_COCO_val2014_000000237566.jpg",
"multi/niah_272_6_COCO_val2014_000000566103.jpg",
"multi/COCO_val2014_000000568979.jpg"
],
"question1": "Find the frame of a boat sitting in the water near a dock. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Elephants crossing a deep river near a dry landscape.\n(B) A man that is standing next to an elephant.\n(C) three signs sit on top of a street pole \n(D) A fire hydrant that is painted white stands in front of a pink house.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a boat sitting in the water near a dock. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what holds the tail of another elephant and walks near a man\nOptions: (A) cats\n(B) birds\n(C) calf\n(D) elephant\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_668_3_COCO_val2014_000000238291.jpg",
"multi/niah_668_1_COCO_val2014_000000172342.jpg",
"multi/niah_668_4_COCO_val2014_000000025715.jpg",
"multi/COCO_val2014_000000150320.jpg",
"multi/niah_668_5_COCO_val2014_000000282724.jpg"
],
"question1": "Find the frame of a couple of people at a deli restaurant. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A boat is sitting overturned on the shore near the water. \n(B) Several empty boats floating on the river on a cloudy day.\n(C) A couple of black cats sitting in a window sill looking out a window.\n(D) A computer desk with keyboard, monitor, mouse and headphones on it.\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of a couple of people at a deli restaurant. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what set up with the variety of electronics\nOptions: (A) counter\n(B) desk\n(C) shelf\n(D) sofa\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_575_1_COCO_val2014_000000390348.jpg",
"multi/niah_575_2_COCO_val2014_000000128992.jpg",
"multi/niah_575_4_COCO_val2014_000000469232.jpg",
"multi/niah_575_6_COCO_val2014_000000219524.jpg",
"multi/niah_575_5_COCO_val2014_000000336113.jpg",
"multi/COCO_val2014_000000288579.jpg"
],
"question1": "Find the frame of a bike parked in the middle of a street covered up with a cloth .\n Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A dog looks up at a flying disk.\n(B) A man in a suit poses for pictures\n(C) A large black bear moving toward a vehicle.\n(D) A all white bath room with blue tape on the floor . \nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of a bike parked in the middle of a street covered up with a cloth .\n Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: the dog watching what fly over his head in a park\nOptions: (A) ball\n(B) frisbee\n(C) kite\n(D) monkey\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_586_5_COCO_val2014_000000459117.jpg",
"multi/niah_586_1_COCO_val2014_000000191691.jpg",
"multi/niah_586_3_COCO_val2014_000000104758.jpg",
"multi/niah_586_2_COCO_val2014_000000025743.jpg",
"multi/COCO_val2014_000000225784.jpg",
"multi/niah_586_7_COCO_val2014_000000068434.jpg",
"multi/niah_586_6_COCO_val2014_000000486172.jpg"
],
"question1": "Find the frame of a man shaving in a bathroom while looking in the mirror Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Two horses that are pulling a trailer with a bunch of wooden barrels on it.\n(B) A person is skiing down a snowy slope.\n(C) A vase is holding an arrangement of flowers.\n(D) A bicycle with headlights is photographed in a room.\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of a man shaving in a bathroom while looking in the mirror Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what parked inside the room on top of tiled floor\nOptions: (A) airplane\n(B) truck\n(C) bicycle\n(D) trains\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_42_1_COCO_val2014_000000453731.jpg",
"multi/niah_42_2_COCO_val2014_000000334743.jpg",
"multi/niah_42_7_COCO_val2014_000000434723.jpg",
"multi/niah_42_5_COCO_val2014_000000307847.jpg",
"multi/niah_42_6_COCO_val2014_000000219344.jpg",
"multi/COCO_val2014_000000236419.jpg",
"multi/niah_42_3_COCO_val2014_000000068204.jpg"
],
"question1": "Find the frame of a dog relaxing on a deck with wine bottles Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A yellow umbrella sticking out of a vase near a brick wall.\n(B) an old dairy truck is rotting away in a field\n(C) A train is sitting in a train station. \n(D) A teddy bear with a blue scarf and eyes tilted to its left.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of a dog relaxing on a deck with wine bottles Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the umbrella\nOptions: (A) yellow\n(B) black\n(C) red\n(D) brown\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_644_2_COCO_val2014_000000565929.jpg",
"multi/niah_644_1_COCO_val2014_000000183701.jpg",
"multi/COCO_val2014_000000311928.jpg",
"multi/niah_644_5_COCO_val2014_000000170278.jpg",
"multi/niah_644_6_COCO_val2014_000000461572.jpg",
"multi/niah_644_3_COCO_val2014_000000136740.jpg",
"multi/niah_644_7_COCO_val2014_000000020371.jpg"
],
"question1": "Find the frame of three large giraffes feeding from buckets on a chain linked fence. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a small boy with a bat and ball on some grass\n(B) A close up shot of a pigeon on a rain gutter.\n(C) A large dog sleeps on a duvet comforter. \n(D) Someone lifting a slice of deep dish pizza.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of three large giraffes feeding from buckets on a chain linked fence. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the little boy hitting with a bat\nOptions: (A) ball\n(B) monkey\n(C) duck\n(D) hat\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_469_5_COCO_val2014_000000221000.jpg",
"multi/COCO_val2014_000000482574.jpg",
"multi/niah_469_4_COCO_val2014_000000554767.jpg",
"multi/niah_469_2_COCO_val2014_000000512938.jpg",
"multi/niah_469_1_COCO_val2014_000000507042.jpg",
"multi/niah_469_6_COCO_val2014_000000225177.jpg"
],
"question1": "Find the frame of these are three giraffes standing in a pen Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A passenger train with several passenger cars moving along.\n(B) a man standing and checking out a bunch of hanging fruit on display\n(C) A woman in hospital gown lays in bed with a child by her bedside and another woman sweeps in the background\n(D) A city bus is traveling down the empty street.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of these are three giraffes standing in a pen Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the man touching hanging from the tree\nOptions: (A) cake\n(B) drink\n(C) pastry\n(D) fruit\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_822_2_COCO_val2014_000000250301.jpg",
"multi/COCO_val2014_000000578485.jpg",
"multi/niah_822_1_COCO_val2014_000000500829.jpg",
"multi/niah_822_6_COCO_val2014_000000240817.jpg",
"multi/niah_822_3_COCO_val2014_000000242757.jpg",
"multi/niah_822_7_COCO_val2014_000000560675.jpg",
"multi/niah_822_5_COCO_val2014_000000401163.jpg"
],
"question1": "Find the frame of an old truck parked in the woods. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A train that is passing by a very large city.\n(B) Small kitchen with white appliances and white cabinets. \n(C) A old man sits on a bench next to a lying dog.\n(D) A person cross country skiing on a winters day.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of an old truck parked in the woods. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is in the kitchen next to an empty room\nOptions: (A) stove\n(B) fridge\n(C) tub\n(D) door\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_382_3_COCO_val2014_000000245651.jpg",
"multi/COCO_val2014_000000538792.jpg",
"multi/niah_382_5_COCO_val2014_000000176007.jpg",
"multi/niah_382_1_COCO_val2014_000000347772.jpg",
"multi/niah_382_4_COCO_val2014_000000288579.jpg"
],
"question1": "Find the frame of a baseball player is getting ready to hit a ball Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Several different cell phones are laying on the sofa,\n(B) a motorcycle parked between two poles and one other motorcycle \n(C) 2 tier cake with multicolored stars attached to it. \n(D) A dog looks up at a flying disk.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a baseball player is getting ready to hit a ball Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what parked near another bike\nOptions: (A) buses\n(B) airplane\n(C) ship\n(D) motorcycle\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/COCO_val2014_000000450182.jpg",
"multi/niah_743_1_COCO_val2014_000000090753.jpg",
"multi/niah_743_4_COCO_val2014_000000342814.jpg",
"multi/niah_743_3_COCO_val2014_000000486034.jpg",
"multi/niah_743_5_COCO_val2014_000000022870.jpg"
],
"question1": "Find the frame of some white flowers are in a glass vase Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A pirate them bedroom with children in the bed is pictured.\n(B) A motorcyclist performs a wheelie on the street. \n(C) Three dogs are tied up with leash ropes.\n(D) A line of buses waiting outside an amusement park. \nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of some white flowers are in a glass vase Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the man with a helmet riding up the street with a streetcar behind him in mid afternoon\nOptions: (A) motorcycle\n(B) truck\n(C) surfboards\n(D) scooter\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_492_2_COCO_val2014_000000571808.jpg",
"multi/COCO_val2014_000000564058.jpg",
"multi/niah_492_5_COCO_val2014_000000395701.jpg",
"multi/niah_492_6_COCO_val2014_000000102240.jpg",
"multi/niah_492_1_COCO_val2014_000000530600.jpg",
"multi/niah_492_3_COCO_val2014_000000303418.jpg",
"multi/niah_492_7_COCO_val2014_000000069646.jpg"
],
"question1": "Find the frame of two grey teddy bears wearing bright knit hats and sweaters Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A chair and a television in a room.\n(B) A bride and groom are cutting a wedding cake\n(C) a person taking a picture of the back end of their big truck \n(D) A man with a tie has his feet bent up in the air. \nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of two grey teddy bears wearing bright knit hats and sweaters Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: the bride and groom cutting what\nOptions: (A) cake\n(B) turkey\n(C) banana\n(D) broccoli\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/COCO_val2014_000000579043.jpg",
"multi/niah_825_5_COCO_val2014_000000132219.jpg",
"multi/niah_825_4_COCO_val2014_000000350794.jpg",
"multi/niah_825_3_COCO_val2014_000000433134.jpg",
"multi/niah_825_1_COCO_val2014_000000195561.jpg"
],
"question1": "Find the frame of a statue under an umbrella pointing to a fir tree on a red-tiled city square. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A high statue with a clock inside on a very nice day.\n(B) a panda bear relaxing on a ledge in the shade\n(C) a black and white cat on some grass\n(D) The baby girl sits with a red ball at her feet.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of a statue under an umbrella pointing to a fir tree on a red-tiled city square. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the mohawk\nOptions: (A) green\n(B) black\n(C) purple\n(D) yellow\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_412_6_COCO_val2014_000000045830.jpg",
"multi/niah_412_2_COCO_val2014_000000088895.jpg",
"multi/niah_412_1_COCO_val2014_000000135037.jpg",
"multi/COCO_val2014_000000304099.jpg",
"multi/niah_412_5_COCO_val2014_000000268641.jpg",
"multi/niah_412_4_COCO_val2014_000000064610.jpg"
],
"question1": "Find the frame of a warthog and a zebra running in a grassland. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) An orange and black fire hydrant sits on a sidewalk next to some black pipes. \n(B) A black cat laying down in a piece of gray luggage.\n(C) The train is stopped, and the doors are open.\n(D) A number of suitcases ride on a luggage conveyor belt.\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of a warthog and a zebra running in a grassland. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what are going down the conveyor belt in the airport\nOptions: (A) ties\n(B) suitcases\n(C) hat\n(D) shoes\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_164_6_COCO_val2014_000000439132.jpg",
"multi/niah_164_7_COCO_val2014_000000562870.jpg",
"multi/niah_164_5_COCO_val2014_000000528062.jpg",
"multi/niah_164_1_COCO_val2014_000000180510.jpg",
"multi/COCO_val2014_000000226020.jpg",
"multi/niah_164_2_COCO_val2014_000000049115.jpg",
"multi/niah_164_3_COCO_val2014_000000292988.jpg"
],
"question1": "Find the frame of a yellow and black train speeding along the rails. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Closeup of a white and black cellphone on a wooden table.\n(B) A boy doing a trick on a skate board at a park.\n(C) A black cat is sitting in a pot.\n(D) Three adults sitting in a dark living room talking.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a yellow and black train speeding along the rails. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what does the boy in a green shirt stand with one foot on a skateboard\nOptions: (A) eagle\n(B) dog\n(C) sheep\n(D) cattle\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_876_6_COCO_val2014_000000162939.jpg",
"multi/niah_876_2_COCO_val2014_000000344314.jpg",
"multi/niah_876_4_COCO_val2014_000000535226.jpg",
"multi/niah_876_1_COCO_val2014_000000388014.jpg",
"multi/niah_876_5_COCO_val2014_000000326919.jpg",
"multi/COCO_val2014_000000329307.jpg"
],
"question1": "Find the frame of a bunch of small birds and rodents in a small box with straw in it. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Large jet airplanes parked at an airport waiting for passengers.\n(B) A cat curled up in a shelter made of printer boxes.\n(C) A man standing next to a bat of luggage.\n(D) A plane sitting on the tarmac at an airport\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a bunch of small birds and rodents in a small box with straw in it. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the suitcase\nOptions: (A) brown\n(B) gray\n(C) green\n(D) black\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_107_3_COCO_val2014_000000487151.jpg",
"multi/COCO_val2014_000000225784.jpg",
"multi/niah_107_1_COCO_val2014_000000215693.jpg",
"multi/niah_107_5_COCO_val2014_000000189193.jpg",
"multi/niah_107_4_COCO_val2014_000000405621.jpg"
],
"question1": "Find the frame of a bench with a red cover sitting next to a tower with writing on it Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A baseball player standing on a baseball field.\n(B) A young lady is enjoying her time with her skateboard. \n(C) A bicycle with headlights is photographed in a room.\n(D) Many stop signs are lined up in a yard.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a bench with a red cover sitting next to a tower with writing on it Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: where is the bicycle with headlights photographed\nOptions: (A) restroom\n(B) room\n(C) office\n(D) hallway\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_88_5_COCO_val2014_000000519479.jpg",
"multi/niah_88_2_COCO_val2014_000000135673.jpg",
"multi/niah_88_1_COCO_val2014_000000198530.jpg",
"multi/COCO_val2014_000000360986.jpg",
"multi/niah_88_7_COCO_val2014_000000275744.jpg",
"multi/niah_88_3_COCO_val2014_000000056068.jpg",
"multi/niah_88_6_COCO_val2014_000000397512.jpg"
],
"question1": "Find the frame of three people posing on a mountain on skis and snowboards. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A brown cow standing on top of a smaller brown cow next to a building.\n(B) A painting of a man sitting at a computer in his library.\n(C) A white and yellow bedroom equipped with two large beds.\n(D) A couple of large trains on a steel track.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of three people posing on a mountain on skis and snowboards. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the cow\nOptions: (A) red\n(B) brown\n(C) yellow\n(D) orange\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_64_5_COCO_val2014_000000509402.jpg",
"multi/COCO_val2014_000000185530.jpg",
"multi/niah_64_1_COCO_val2014_000000576080.jpg",
"multi/niah_64_4_COCO_val2014_000000361621.jpg",
"multi/niah_64_3_COCO_val2014_000000031269.jpg"
],
"question1": "Find the frame of a toilet is next to two toilet paper rolls and a telephone. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Four sheep graze in a hilly field while the sky above them is blue with wispy clouds.\n(B) a zebra is standing outside in a field\n(C) three zebras grazing in a grassy area near shrubs\n(D) A cat sits in a bathroom sink while looking outward.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of a toilet is next to two toilet paper rolls and a telephone. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what are grazing on the green hillside\nOptions: (A) zebras\n(B) giraffe\n(C) sheep\n(D) monkey\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_908_1_COCO_val2014_000000288161.jpg",
"multi/niah_908_3_COCO_val2014_000000569767.jpg",
"multi/COCO_val2014_000000476417.jpg",
"multi/niah_908_4_COCO_val2014_000000349418.jpg",
"multi/niah_908_5_COCO_val2014_000000439907.jpg"
],
"question1": "Find the frame of a kitchen sink is shown with a marble finish. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Two people are riding horses down a trail.\n(B) The female tennis player is about to hit the ball.\n(C) A high end bathroom with a shower stall\n(D) A panda bear eating bamboo in a wooded area.\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of a kitchen sink is shown with a marble finish. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what sits on the ground eating bamboo\nOptions: (A) duck\n(B) panda\n(C) horse\n(D) giraffe\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_182_6_COCO_val2014_000000381368.jpg",
"multi/niah_182_1_COCO_val2014_000000236714.jpg",
"multi/niah_182_5_COCO_val2014_000000364299.jpg",
"multi/niah_182_3_COCO_val2014_000000323525.jpg",
"multi/niah_182_7_COCO_val2014_000000346138.jpg",
"multi/COCO_val2014_000000404237.jpg",
"multi/niah_182_2_COCO_val2014_000000022870.jpg"
],
"question1": "Find the frame of a lone coffee cup sits on an empty park bench Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a black bull hiding behind some grass while what looks like a white crane is riding the bulls back.\n(B) The boat makes a big splash in the water.\n(C) A pizza on a pizza pan with two pieces removed by a serving ladle.\n(D) some pasta dishes are in plates and bowls\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a lone coffee cup sits on an empty park bench Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: how many fourths of a pizza with meats and vegetables on a pizza pan\nOptions: (A) three\n(B) one\n(C) five\n(D) ten\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/COCO_val2014_000000457131.jpg",
"multi/niah_615_1_COCO_val2014_000000485972.jpg",
"multi/niah_615_2_COCO_val2014_000000327314.jpg",
"multi/niah_615_7_COCO_val2014_000000515465.jpg",
"multi/niah_615_3_COCO_val2014_000000247609.jpg",
"multi/niah_615_6_COCO_val2014_000000161169.jpg",
"multi/niah_615_5_COCO_val2014_000000363652.jpg"
],
"question1": "Find the frame of a dessert cake is on a plate with a fork. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Four black cell phones sit on a table next to a white cell phone.\n(B) A person touching the top of a fire hydrant.\n(C) a craft is flying behind a tree line\n(D) A dog is chasing sheep across an area in full speed.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a dessert cake is on a plate with a fork. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what makes contrails behind a tree branch silouhette\nOptions: (A) engine\n(B) scooter\n(C) airplane\n(D) surfboards\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_308_5_COCO_val2014_000000274591.jpg",
"multi/niah_308_7_COCO_val2014_000000103257.jpg",
"multi/niah_308_1_COCO_val2014_000000228506.jpg",
"multi/COCO_val2014_000000152737.jpg",
"multi/niah_308_2_COCO_val2014_000000098683.jpg",
"multi/niah_308_6_COCO_val2014_000000214293.jpg",
"multi/niah_308_3_COCO_val2014_000000370170.jpg"
],
"question1": "Find the frame of a cat on a bed partially inside a canvas bag Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A small yellow bird sitting on top of a green bird feeder\n(B) A black bear surrounded by logs, rocks, and grass.\n(C) A small child at a kitchen table eats a carrot.\n(D) There is a large black bull with long hair\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of a cat on a bed partially inside a canvas bag Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the feeder\nOptions: (A) yellow\n(B) red\n(C) orange\n(D) green\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_259_3_COCO_val2014_000000133412.jpg",
"multi/niah_259_5_COCO_val2014_000000338826.jpg",
"multi/niah_259_1_COCO_val2014_000000553912.jpg",
"multi/niah_259_7_COCO_val2014_000000399567.jpg",
"multi/niah_259_2_COCO_val2014_000000177167.jpg",
"multi/niah_259_6_COCO_val2014_000000008708.jpg",
"multi/COCO_val2014_000000207486.jpg"
],
"question1": "Find the frame of person in gray hooded jacket attempting t cross busy street. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) The train sits on the track in a covered bay.\n(B) The exterior of the Maplewood Fire Station Number 4 building.\n(C) Two bulls in the middle of a street in a town.\n(D) A bear sits next to a book and a chair.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of person in gray hooded jacket attempting t cross busy street. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: how many black bulls in a roadway at night\nOptions: (A) four\n(B) seven\n(C) ten\n(D) three\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_405_2_COCO_val2014_000000261783.jpg",
"multi/niah_405_5_COCO_val2014_000000198495.jpg",
"multi/niah_405_4_COCO_val2014_000000355860.jpg",
"multi/niah_405_1_COCO_val2014_000000523096.jpg",
"multi/niah_405_6_COCO_val2014_000000140420.jpg",
"multi/COCO_val2014_000000510254.jpg"
],
"question1": "Find the frame of a laptop is sitting on a small table Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Giraffe standing next to a tree near a pond.\n(B) A young lady wearing a yellow dress setting on a large bed\n(C) a fancy black motorcycle sitting on the sidewalk\n(D) A motorcycle is parked on a gravel road in a forest by a stream.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a laptop is sitting on a small table Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the setting\nOptions: (A) yellow\n(B) gray\n(C) blue\n(D) white\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_974_2_COCO_val2014_000000061596.jpg",
"multi/COCO_val2014_000000029059.jpg",
"multi/niah_974_1_COCO_val2014_000000092145.jpg",
"multi/niah_974_5_COCO_val2014_000000308678.jpg",
"multi/niah_974_6_COCO_val2014_000000566274.jpg",
"multi/niah_974_4_COCO_val2014_000000233236.jpg"
],
"question1": "Find the frame of a heard of cows is eating grass by the fence. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a long blue couch sitting by a white wall \n(B) This little girl is sitting in a seat and eating her favorite dessert.\n(C) a couple of people are riding in a small white boat\n(D) A woman is riding a horse with several women doing the same.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a heard of cows is eating grass by the fence. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the young girl eating\nOptions: (A) vegetables\n(B) sandwich\n(C) drinks\n(D) apple\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/COCO_val2014_000000117079.jpg",
"multi/niah_878_4_COCO_val2014_000000091460.jpg",
"multi/niah_878_6_COCO_val2014_000000164515.jpg",
"multi/niah_878_2_COCO_val2014_000000361621.jpg",
"multi/niah_878_1_COCO_val2014_000000543528.jpg",
"multi/niah_878_5_COCO_val2014_000000316842.jpg"
],
"question1": "Find the frame of a plane soars hundreds of feet in the air. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) The pizza is on the plate and ready to be eaten\n(B) A desk that has three computers sitting on it.\n(C) Street scene of black fire hydrant in front a dumpster with graffiti on its side.\n(D) a trio of motorcycles parked next to an r.v.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a plane soars hundreds of feet in the air. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what covered dumpster sits on a sidewalk\nOptions: (A) graffiti\n(B) photograph\n(C) flower\n(D) flowers\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/COCO_val2014_000000467468.jpg",
"multi/niah_535_6_COCO_val2014_000000008688.jpg",
"multi/niah_535_5_COCO_val2014_000000042406.jpg",
"multi/niah_535_2_COCO_val2014_000000114586.jpg",
"multi/niah_535_4_COCO_val2014_000000034689.jpg",
"multi/niah_535_1_COCO_val2014_000000383413.jpg"
],
"question1": "Find the frame of an adult bear and a cub bear nibbling on a piece of wood. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A black and white toothbrush holder holding a toothbrush.\n(B) a wooden bench sitting on the sidewalk in front of some plants \n(C) A mattress sitting above the mattress frame in a small room\n(D) An airplane sitting at a cold runway being unloaded.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of an adult bear and a cub bear nibbling on a piece of wood. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is by the bushes and the building\nOptions: (A) rack\n(B) bench\n(C) counter\n(D) bed\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_158_6_COCO_val2014_000000530741.jpg",
"multi/niah_158_5_COCO_val2014_000000082821.jpg",
"multi/niah_158_7_COCO_val2014_000000528980.jpg",
"multi/niah_158_1_COCO_val2014_000000213035.jpg",
"multi/COCO_val2014_000000371365.jpg",
"multi/niah_158_3_COCO_val2014_000000431323.jpg",
"multi/niah_158_2_COCO_val2014_000000270893.jpg"
],
"question1": "Find the frame of two people bump their cell phones together at a party Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Group of paddle boats tied up alongside a pear on the water. \n(B) An asian salad resting on a bed of rice.\n(C) An inside out umbrella sculpture on a city street.\n(D) A person riding a motorcycle down the street with groups of people behind a fenced area.\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of two people bump their cell phones together at a party Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what do people behind a barricade watch a man ride\nOptions: (A) bicycle\n(B) buses\n(C) skateboard\n(D) motorcycle\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_425_1_COCO_val2014_000000085481.jpg",
"multi/COCO_val2014_000000206433.jpg",
"multi/niah_425_5_COCO_val2014_000000239625.jpg",
"multi/niah_425_3_COCO_val2014_000000579693.jpg",
"multi/niah_425_4_COCO_val2014_000000538589.jpg"
],
"question1": "Find the frame of a bowl of soup with a drink and a plate of bread. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Two persons in formal dress posing for a photograph.\n(B) There is a toilet next to the shelf with towels.\n(C) A large square pizza being cut up by a person.\n(D) A toilet is placed in between the two walls in what seems to be a bathroom. \nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of a bowl of soup with a drink and a plate of bread. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: two men wearing what pose for a picture\nOptions: (A) suitcases\n(B) shoes\n(C) suits\n(D) ties\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_829_1_COCO_val2014_000000464462.jpg",
"multi/niah_829_6_COCO_val2014_000000414201.jpg",
"multi/COCO_val2014_000000558350.jpg",
"multi/niah_829_7_COCO_val2014_000000187055.jpg",
"multi/niah_829_5_COCO_val2014_000000061259.jpg",
"multi/niah_829_2_COCO_val2014_000000323418.jpg",
"multi/niah_829_3_COCO_val2014_000000153832.jpg"
],
"question1": "Find the frame of two apple products with two people lounging behind them. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) One airplane is taking off from an airport while others are parked.\n(B) People are playing with a tennis ball in a gym.\n(C) A dog lying on a couch next to a computer.\n(D) A man and woman holding a knife together cutting a cake that sits on a table.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of two apple products with two people lounging behind them. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is napping in front of a computer\nOptions: (A) dog\n(B) calf\n(C) cats\n(D) monkey\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_50_4_COCO_val2014_000000348496.jpg",
"multi/niah_50_3_COCO_val2014_000000178939.jpg",
"multi/niah_50_5_COCO_val2014_000000126638.jpg",
"multi/niah_50_1_COCO_val2014_000000056736.jpg",
"multi/COCO_val2014_000000487159.jpg"
],
"question1": "Find the frame of a jet that is flying in the sky. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A dog laying on the floor of a tiled hallway.\n(B) Fresh and cooked vegetable meal on a cluttered table.\n(C) A river by a road with several boats in the water\n(D) Three giraffes reaching for leaves on a tall tree.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a jet that is flying in the sky. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what consists of salad , water , vegetables and dip\nOptions: (A) donuts\n(B) meal\n(C) sandwich\n(D) pastry\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_720_5_COCO_val2014_000000104137.jpg",
"multi/niah_720_3_COCO_val2014_000000357529.jpg",
"multi/niah_720_4_COCO_val2014_000000303219.jpg",
"multi/niah_720_1_COCO_val2014_000000220981.jpg",
"multi/COCO_val2014_000000112128.jpg"
],
"question1": "Find the frame of an iPhone with a sad face and arms and legs attached. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Two lambs eating hay from ground of a field\n(B) A yellow dog lays on a stack of white chairs.\n(C) A baseball player pitching a baseball on top of a field.\n(D) Two people stand in a living room playing with Wii remotes.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of an iPhone with a sad face and arms and legs attached. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what graze side by side in a fenced enclosure\nOptions: (A) calf\n(B) duck\n(C) sheep\n(D) dog\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_413_6_COCO_val2014_000000255482.jpg",
"multi/niah_413_5_COCO_val2014_000000531512.jpg",
"multi/COCO_val2014_000000334884.jpg",
"multi/niah_413_7_COCO_val2014_000000287545.jpg",
"multi/niah_413_3_COCO_val2014_000000410023.jpg",
"multi/niah_413_1_COCO_val2014_000000229067.jpg",
"multi/niah_413_2_COCO_val2014_000000449259.jpg"
],
"question1": "Find the frame of a group of people ski down the mountain slope. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A large and small giraffe are looking at something while standing next to a log.\n(B) A bowl of bean and fresh vegetable delight\n(C) A blue airplane in a blue, cloudless sky\n(D) A woman in blue jacket shaving a cows utters.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a group of people ski down the mountain slope. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what filled with many different vegetables\nOptions: (A) bowl\n(B) shelf\n(C) rack\n(D) basket\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/COCO_val2014_000000516414.jpg",
"multi/niah_925_1_COCO_val2014_000000089432.jpg",
"multi/niah_925_4_COCO_val2014_000000053893.jpg",
"multi/niah_925_3_COCO_val2014_000000129587.jpg",
"multi/niah_925_5_COCO_val2014_000000095132.jpg"
],
"question1": "Find the frame of a man riding a skateboard on top of a metal rail. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) On an unmade bed sits a blue book.\n(B) Three zebra who are standing next to a wooden trough. \n(C) two boats sitting in the water near a boating dock \n(D) THE ELEPHANT IS IN THE BARN EATING GRASS.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a man riding a skateboard on top of a metal rail. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what are sitting in the water near a boating dock\nOptions: (A) surfboards\n(B) boats\n(C) motorcycle\n(D) trains\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_681_6_COCO_val2014_000000087574.jpg",
"multi/niah_681_3_COCO_val2014_000000044575.jpg",
"multi/niah_681_1_COCO_val2014_000000190447.jpg",
"multi/niah_681_7_COCO_val2014_000000187240.jpg",
"multi/niah_681_5_COCO_val2014_000000033554.jpg",
"multi/niah_681_2_COCO_val2014_000000277793.jpg",
"multi/COCO_val2014_000000192970.jpg"
],
"question1": "Find the frame of an old bike sitting outside an antique store. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Street signs point people in traffic in the right direction\n(B) yellow daffodils sitting in a vase on a table\n(C) A red bus next to a white van on a street.\n(D) The giraffes are loitering in their manmade habitat.\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of an old bike sitting outside an antique store. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: how many giraffes walking around a large outdoor exhibit\nOptions: (A) four\n(B) ten\n(C) seven\n(D) three\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_34_5_COCO_val2014_000000129897.jpg",
"multi/niah_34_1_COCO_val2014_000000139108.jpg",
"multi/niah_34_3_COCO_val2014_000000526044.jpg",
"multi/niah_34_4_COCO_val2014_000000200710.jpg",
"multi/COCO_val2014_000000047149.jpg"
],
"question1": "Find the frame of the horses are grazing on the grassy field. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a cow walking on a beach near water\n(B) A white bowl filled with a salad on top of a plate.\n(C) Two teddy bears that are set up to look like they got married.\n(D) A group of men sitting at a table in a cafe.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of the horses are grazing on the grassy field. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what filled with the salad on top of a plate\nOptions: (A) shelf\n(B) basket\n(C) bowl\n(D) box\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_40_6_COCO_val2014_000000554340.jpg",
"multi/niah_40_4_COCO_val2014_000000323418.jpg",
"multi/niah_40_1_COCO_val2014_000000407355.jpg",
"multi/COCO_val2014_000000518773.jpg",
"multi/niah_40_2_COCO_val2014_000000062154.jpg",
"multi/niah_40_5_COCO_val2014_000000360208.jpg"
],
"question1": "Find the frame of a brown ottoman sits near a black counter in a vacant room. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Blond haired woman wearing a black top hat.\n(B) Eight jets in the sky in a flying formation.\n(C) A white bread sandwich has ham and jalapenos in it. \n(D) Half a donut lying on the yellow line of a street.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a brown ottoman sits near a black counter in a vacant room. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what filled with meat and peppers\nOptions: (A) drinks\n(B) pastry\n(C) wine\n(D) sandwich\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_498_1_COCO_val2014_000000379055.jpg",
"multi/niah_498_6_COCO_val2014_000000067315.jpg",
"multi/niah_498_2_COCO_val2014_000000216497.jpg",
"multi/niah_498_5_COCO_val2014_000000007952.jpg",
"multi/niah_498_7_COCO_val2014_000000125107.jpg",
"multi/COCO_val2014_000000161840.jpg",
"multi/niah_498_3_COCO_val2014_000000246105.jpg"
],
"question1": "Find the frame of an apple and a metal object sitting on blocks of wood with mirrors over them. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) three cats standing in a window looking out at the street \n(B) A forlorn looking pug rests its head on an empty plastic bottle\n(C) A train on a train track running through a city.\n(D) A man's striped dress shirt and patterned tie.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of an apple and a metal object sitting on blocks of wood with mirrors over them. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: how many cats is sitting in a window waiting for their master to return\nOptions: (A) four\n(B) ten\n(C) three\n(D) five\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_116_3_COCO_val2014_000000121112.jpg",
"multi/niah_116_1_COCO_val2014_000000212112.jpg",
"multi/niah_116_5_COCO_val2014_000000306952.jpg",
"multi/COCO_val2014_000000530220.jpg",
"multi/niah_116_4_COCO_val2014_000000246125.jpg"
],
"question1": "Find the frame of a classic green track is parked at a show Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A green and silver motorcycle on display on a red carpet.\n(B) Two cats lying on desk facing camera with a keyboard in foreground.\n(C) A giraffe is indoors behind a tall gate\n(D) Someone needs to show the baby the right end of a toothbrush.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of a classic green track is parked at a show Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the carpet\nOptions: (A) red\n(B) green\n(C) yellow\n(D) black\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_135_4_COCO_val2014_000000531334.jpg",
"multi/niah_135_5_COCO_val2014_000000116405.jpg",
"multi/niah_135_1_COCO_val2014_000000171685.jpg",
"multi/COCO_val2014_000000269344.jpg",
"multi/niah_135_3_COCO_val2014_000000199669.jpg"
],
"question1": "Find the frame of a red motorcycle parked on the street while people pass Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) people working with their laptops on a wooden table\n(B) a dog jumping in the air with a frisbee in its mouth\n(C) Two large and furry foxes looking at something together.\n(D) Citrus heights water district building with large orange sculptures and a flagpole.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a red motorcycle parked on the street while people pass Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what jumps to catch the frisbee in mid-air\nOptions: (A) dog\n(B) cats\n(C) zebras\n(D) elephant\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_223_3_COCO_val2014_000000401124.jpg",
"multi/niah_223_5_COCO_val2014_000000141634.jpg",
"multi/niah_223_4_COCO_val2014_000000022423.jpg",
"multi/niah_223_1_COCO_val2014_000000511307.jpg",
"multi/COCO_val2014_000000176993.jpg"
],
"question1": "Find the frame of a dog snuggled under covers in a bed. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A snowy bench is sitting in the woods.\n(B) this is a train coming down the rails\n(C) black and white photo of 2 women touching a horse\n(D) an elephant uses his tusk to lift a mans hat \nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a dog snuggled under covers in a bed. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the train\nOptions: (A) brown\n(B) purple\n(C) blue\n(D) orange\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_895_3_COCO_val2014_000000322887.jpg",
"multi/niah_895_1_COCO_val2014_000000126253.jpg",
"multi/niah_895_5_COCO_val2014_000000153307.jpg",
"multi/COCO_val2014_000000080213.jpg",
"multi/niah_895_4_COCO_val2014_000000491755.jpg"
],
"question1": "Find the frame of people are walking down a street in the rain. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Small living room with a white fire place in the center. \n(B) A vase filled with yellow flowers on a table.\n(C) A black cat with a beanie on it's head.\n(D) The sign warns of the impending curve in the road.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of people are walking down a street in the rain. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the hat\nOptions: (A) orange\n(B) black\n(C) brown\n(D) purple\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_274_1_COCO_val2014_000000163444.jpg",
"multi/COCO_val2014_000000105951.jpg",
"multi/niah_274_2_COCO_val2014_000000535950.jpg",
"multi/niah_274_3_COCO_val2014_000000499388.jpg",
"multi/niah_274_6_COCO_val2014_000000254976.jpg",
"multi/niah_274_7_COCO_val2014_000000132375.jpg",
"multi/niah_274_5_COCO_val2014_000000297425.jpg"
],
"question1": "Find the frame of a stainless steel refrigerator is between cabinets in a kitchen. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A dog jumping and catching a frisbee in the park.\n(B) A sheet cake covered in blue frosting and two number 2 candles.\n(C) Someone eating what looks like a jelly roll.\n(D) a vase with a flower and a jar full of juice\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a stainless steel refrigerator is between cabinets in a kitchen. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the hand holding a piece of meat filled\nOptions: (A) beverage\n(B) pastry\n(C) wine\n(D) drink\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/COCO_val2014_000000216785.jpg",
"multi/niah_55_5_COCO_val2014_000000420589.jpg",
"multi/niah_55_4_COCO_val2014_000000555366.jpg",
"multi/niah_55_3_COCO_val2014_000000550729.jpg",
"multi/niah_55_1_COCO_val2014_000000389699.jpg"
],
"question1": "Find the frame of there are a couple of kangaroos out in a pasture and one of them is eating. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A zebra standing in a bright green field of grass.\n(B) a kite stuck up high in a tree\n(C) Several trains are parked next to a platform underneath a ceiling.\n(D) a rainbow colored kite is laying on the grass\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of there are a couple of kangaroos out in a pasture and one of them is eating. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what caught in the tall tree\nOptions: (A) toys\n(B) kite\n(C) hat\n(D) ball\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/COCO_val2014_000000205513.jpg",
"multi/niah_372_6_COCO_val2014_000000455943.jpg",
"multi/niah_372_1_COCO_val2014_000000085292.jpg",
"multi/niah_372_5_COCO_val2014_000000352936.jpg",
"multi/niah_372_3_COCO_val2014_000000290261.jpg",
"multi/niah_372_7_COCO_val2014_000000398507.jpg",
"multi/niah_372_2_COCO_val2014_000000230262.jpg"
],
"question1": "Find the frame of a long train traveling along train tracks in a train yard. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A plane is flying through the sky on a fairly clear day.\n(B) A culinary display of a food dish is shown.\n(C) The bird is sitting on the abandoned bench.\n(D) Two army vehicles sit on a patch of grass. \nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a long train traveling along train tracks in a train yard. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what filled with the cake topped with vegetables\nOptions: (A) cabinet\n(B) bowl\n(C) container\n(D) box\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/COCO_val2014_000000020070.jpg",
"multi/niah_521_6_COCO_val2014_000000141821.jpg",
"multi/niah_521_2_COCO_val2014_000000201764.jpg",
"multi/niah_521_5_COCO_val2014_000000220511.jpg",
"multi/niah_521_3_COCO_val2014_000000326410.jpg",
"multi/niah_521_7_COCO_val2014_000000272961.jpg",
"multi/niah_521_1_COCO_val2014_000000250592.jpg"
],
"question1": "Find the frame of a large red and white boat docked next to a walkway. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Horses walking through the yard toward a barn.\n(B) Gourmet food on a picnic table with drinks\n(C) This is a man riding an off road skateboard with people flying kites in the background.\n(D) A man grimaces playfully while blow-drying his hair. \nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of a large red and white boat docked next to a walkway. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: where is man using a hair dryer\nOptions: (A) bathroom\n(B) bedroom\n(C) office\n(D) hallway\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_2_5_COCO_val2014_000000153832.jpg",
"multi/niah_2_3_COCO_val2014_000000454143.jpg",
"multi/niah_2_2_COCO_val2014_000000529348.jpg",
"multi/niah_2_6_COCO_val2014_000000504024.jpg",
"multi/COCO_val2014_000000152823.jpg",
"multi/niah_2_7_COCO_val2014_000000498511.jpg",
"multi/niah_2_1_COCO_val2014_000000485613.jpg"
],
"question1": "Find the frame of a baby kitten is being feed milk from a bottle. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Horses grazing in a green meadow at the foot of a mountain\n(B) Two giraffe stand near each other eating from a feeder. \n(C) A few birds sitting together in a nest.\n(D) A miniature railroad train going through a yard.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a baby kitten is being feed milk from a bottle. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what sit leaning out of the bird 's nest\nOptions: (A) bear\n(B) cattle\n(C) birds\n(D) cats\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_620_6_COCO_val2014_000000571636.jpg",
"multi/niah_620_5_COCO_val2014_000000173553.jpg",
"multi/niah_620_1_COCO_val2014_000000200365.jpg",
"multi/COCO_val2014_000000358617.jpg",
"multi/niah_620_4_COCO_val2014_000000379161.jpg",
"multi/niah_620_2_COCO_val2014_000000514797.jpg"
],
"question1": "Find the frame of table near car with a bicycle along side and a plate with two hot dogs and a coke. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) To keep up with the action, the media people need to be as fast as the players.\n(B) A man in a truck filled with bottled water.\n(C) A vase with red berries and green stems.\n(D) The man in the wetsuit carries a red and white striped surfboard.\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of table near car with a bicycle along side and a plate with two hot dogs and a coke. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what does the man in the wetsuit carry\nOptions: (A) motorcycle\n(B) trains\n(C) surfboard\n(D) jet\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_60_3_COCO_val2014_000000387250.jpg",
"multi/COCO_val2014_000000579276.jpg",
"multi/niah_60_5_COCO_val2014_000000509531.jpg",
"multi/niah_60_4_COCO_val2014_000000257657.jpg",
"multi/niah_60_1_COCO_val2014_000000163309.jpg"
],
"question1": "Find the frame of the sandwich is cut into four small sections. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a small child is biting in to some food\n(B) Yellow chrysanthemums are arranged in two white holders.\n(C) A smiling man wearing a business suit sits at his desk by the window. \n(D) Two loaves of bread in pans inside an oven.\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of the sandwich is cut into four small sections. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: how many loaves of bread baking in an oven\nOptions: (A) ten\n(B) four\n(C) six\n(D) two\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_494_4_COCO_val2014_000000142774.jpg",
"multi/niah_494_1_COCO_val2014_000000152281.jpg",
"multi/niah_494_6_COCO_val2014_000000402967.jpg",
"multi/COCO_val2014_000000035049.jpg",
"multi/niah_494_2_COCO_val2014_000000068381.jpg",
"multi/niah_494_5_COCO_val2014_000000412431.jpg"
],
"question1": "Find the frame of tagged animals are grazing on grass in a field Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a young baseball player with a bat on the field\n(B) The city bus is traveling down the road with its passengers.\n(C) a clock sitting on a mantel in front of a mirror \n(D) Several bicycles and a bus parked outside a large building.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of tagged animals are grazing on grass in a field Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: how many kid player stands with bat as others sit on bench during baseball game\nOptions: (A) six\n(B) seven\n(C) two\n(D) one\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_868_1_COCO_val2014_000000175831.jpg",
"multi/niah_868_4_COCO_val2014_000000492196.jpg",
"multi/COCO_val2014_000000257336.jpg",
"multi/niah_868_6_COCO_val2014_000000323602.jpg",
"multi/niah_868_5_COCO_val2014_000000194381.jpg",
"multi/niah_868_2_COCO_val2014_000000551602.jpg"
],
"question1": "Find the frame of an image of a boy on his bike riding past a building Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) An elephant standing at the edge of a river in a lushly green landscape\n(B) A woman baking a pizza pie in the kitchen\n(C) A tug boat pulling a barge with two yellow containers.\n(D) A grey and white puppy wearing a bandana stands on a skateboard.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of an image of a boy on his bike riding past a building Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: where does the woman take a pizza out of the oven\nOptions: (A) hallway\n(B) office\n(C) kitchen\n(D) bathroom\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_784_4_COCO_val2014_000000469051.jpg",
"multi/COCO_val2014_000000313623.jpg",
"multi/niah_784_1_COCO_val2014_000000073470.jpg",
"multi/niah_784_5_COCO_val2014_000000164464.jpg",
"multi/niah_784_3_COCO_val2014_000000388157.jpg"
],
"question1": "Find the frame of a man preparing donuts at a donut shop. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a woman holding a colorful kite in the shape of a plane\n(B) A bench with a couple of skateboard sparked underneath it.\n(C) A large white construction truck parked in a lot area in front of a building supply store.\n(D) Bananas hanging on tree in their natural form.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a man preparing donuts at a donut shop. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what butted up against the large white piece of equipment\nOptions: (A) locomotive\n(B) truck\n(C) ship\n(D) airliner\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_963_6_COCO_val2014_000000019890.jpg",
"multi/niah_963_2_COCO_val2014_000000036331.jpg",
"multi/niah_963_1_COCO_val2014_000000079024.jpg",
"multi/COCO_val2014_000000397327.jpg",
"multi/niah_963_3_COCO_val2014_000000100974.jpg",
"multi/niah_963_5_COCO_val2014_000000271900.jpg",
"multi/niah_963_7_COCO_val2014_000000466531.jpg"
],
"question1": "Find the frame of a motorcycle sits near a door on a walkway Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Two zebras standing in the sand, eating hay.\n(B) The bathroom has very white appliances in it.\n(C) Four lambs on a farm enjoy eating their hay.\n(D) A small bird sits on a corn plant.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a motorcycle sits near a door on a walkway Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: the toilet what a sink and a towel and toilet paper\nOptions: (A) bench\n(B) chair\n(C) desk\n(D) cabinet\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_16_5_COCO_val2014_000000246270.jpg",
"multi/niah_16_3_COCO_val2014_000000529636.jpg",
"multi/COCO_val2014_000000144250.jpg",
"multi/niah_16_7_COCO_val2014_000000548498.jpg",
"multi/niah_16_2_COCO_val2014_000000188817.jpg",
"multi/niah_16_6_COCO_val2014_000000372220.jpg",
"multi/niah_16_1_COCO_val2014_000000118739.jpg"
],
"question1": "Find the frame of large pizza pie with one slice missing on the table Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) An old man dressed in black on his phone.\n(B) A kitten is sitting in a bowl on a table.\n(C) A giraffe and its young on an open prairie.\n(D) Three red buses are parked on opposite sides of the road.\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of large pizza pie with one slice missing on the table Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is standing in the middle of the field with its baby\nOptions: (A) duck\n(B) monkey\n(C) giraffe\n(D) birds\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/niah_459_5_COCO_val2014_000000344928.jpg",
"multi/niah_459_3_COCO_val2014_000000522782.jpg",
"multi/niah_459_1_COCO_val2014_000000392905.jpg",
"multi/niah_459_4_COCO_val2014_000000075051.jpg",
"multi/COCO_val2014_000000333998.jpg"
],
"question1": "Find the frame of an elephant swinging its trunk inside of a pen. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A kitchen that has a refrigerator, oven, and sink in it.\n(B) Three desktop screens that are on a table.\n(C) a toilet bowl and a trash can in a bathroom\n(D) A small black kitten digs into a plastic mug.\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of an elephant swinging its trunk inside of a pen. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: where is the white refrigerator freezer sitting\nOptions: (A) bedroom\n(B) garden\n(C) office\n(D) kitchen\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_834_3_COCO_val2014_000000198120.jpg",
"multi/niah_834_1_COCO_val2014_000000143901.jpg",
"multi/niah_834_4_COCO_val2014_000000361621.jpg",
"multi/niah_834_5_COCO_val2014_000000150212.jpg",
"multi/COCO_val2014_000000219798.jpg"
],
"question1": "Find the frame of a woman holding a lit birthday cake at the base of the stairs. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A cat sits in a bathroom sink while looking outward.\n(B) Two big boxes of donuts and drinks displayed in an office.\n(C) Looking past a snowboard in the snow to a city beyond\n(D) A dog is walking while wearing brightly colored clothing.\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of a woman holding a lit birthday cake at the base of the stairs. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is walking while wearing brightly colored clothing\nOptions: (A) dog\n(B) cattle\n(C) sheep\n(D) birds\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_924_7_COCO_val2014_000000221521.jpg",
"multi/niah_924_3_COCO_val2014_000000291528.jpg",
"multi/niah_924_5_COCO_val2014_000000130291.jpg",
"multi/niah_924_6_COCO_val2014_000000015954.jpg",
"multi/niah_924_2_COCO_val2014_000000131882.jpg",
"multi/COCO_val2014_000000010039.jpg",
"multi/niah_924_1_COCO_val2014_000000427518.jpg"
],
"question1": "Find the frame of a baseball player talking to a coach on the field. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) An antique clock tower rises above into the cloudy skies.\n(B) Three wood carvings of humans hold up a bench\n(C) A woman is fixing a man's red tie.\n(D) Two giraffes standing outside while people watch them\nOnly give the best option. Best option: (",
"answer1": "A",
"question2": "Find the frame of a baseball player talking to a coach on the field. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is at the base of a tall spire\nOptions: (A) towel\n(B) track\n(C) fabric\n(D) clock\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_24_1_COCO_val2014_000000032375.jpg",
"multi/niah_24_4_COCO_val2014_000000066926.jpg",
"multi/COCO_val2014_000000156458.jpg",
"multi/niah_24_6_COCO_val2014_000000095191.jpg",
"multi/niah_24_5_COCO_val2014_000000014990.jpg",
"multi/niah_24_2_COCO_val2014_000000537132.jpg"
],
"question1": "Find the frame of the two eddy bears are leaning on the purplish pillows Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A bunch of glazed donuts sitting on a table.\n(B) A dog has came up from underneath a picnic table and rests his arms.\n(C) Two elephants following an individual on a dirt path\n(D) A train station with three trains near a building\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of the two eddy bears are leaning on the purplish pillows Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is leaning against the picnic table\nOptions: (A) eagle\n(B) dog\n(C) duck\n(D) sheep\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/COCO_val2014_000000527064.jpg",
"multi/niah_674_3_COCO_val2014_000000402967.jpg",
"multi/niah_674_1_COCO_val2014_000000361656.jpg",
"multi/niah_674_5_COCO_val2014_000000205196.jpg",
"multi/niah_674_4_COCO_val2014_000000124684.jpg"
],
"question1": "Find the frame of a close up of a bunch of ripened bananas. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) Several bicycles and a bus parked outside a large building.\n(B) An oblong pizza with mushrooms is on a cutting board.\n(C) A woman watching a frisbee fly just above her head.\n(D) A bunch of doughnuts that are on a table.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a close up of a bunch of ripened bananas. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what placed on top of table next to a wineglass\nOptions: (A) cake\n(B) pizza\n(C) banana\n(D) sandwich\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_201_6_COCO_val2014_000000279024.jpg",
"multi/niah_201_5_COCO_val2014_000000181816.jpg",
"multi/niah_201_4_COCO_val2014_000000437859.jpg",
"multi/niah_201_2_COCO_val2014_000000477534.jpg",
"multi/niah_201_1_COCO_val2014_000000189773.jpg",
"multi/COCO_val2014_000000509811.jpg"
],
"question1": "Find the frame of a double high bus that is sitting on the street. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a person in an outfit is sitting on a bench\n(B) Those cars are all parked along the curb.\n(C) Library with table and closed laptops with one person studying.\n(D) A bundle of trash with a toilet basin, polythenes and bottles\nOnly give the best option. Best option: (",
"answer1": "C",
"question2": "Find the frame of a double high bus that is sitting on the street. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what filled with wooden tables with apple laptop computers on them\nOptions: (A) store\n(B) street\n(C) library\n(D) airport\nOnly give the best option. Best option: (",
"answer2": "C"
},
{
"images": [
"multi/COCO_val2014_000000451150.jpg",
"multi/niah_939_1_COCO_val2014_000000041837.jpg",
"multi/niah_939_3_COCO_val2014_000000566687.jpg",
"multi/niah_939_5_COCO_val2014_000000574200.jpg",
"multi/niah_939_4_COCO_val2014_000000249757.jpg"
],
"question1": "Find the frame of baggage carriers at an airport next to a jet liner Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) The red and blue sign wants to stop the war on workers.\n(B) The six donuts are made with various seasonings.\n(C) A colorful vegetable salad is in a green bowl. \n(D) A bus yard filled with lots of yellow school buses.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of baggage carriers at an airport next to a jet liner Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what are made with various seasonings\nOptions: (A) donuts\n(B) drink\n(C) pizza\n(D) fruit\nOnly give the best option. Best option: (",
"answer2": "A"
},
{
"images": [
"multi/niah_130_6_COCO_val2014_000000075481.jpg",
"multi/COCO_val2014_000000415949.jpg",
"multi/niah_130_2_COCO_val2014_000000156832.jpg",
"multi/niah_130_4_COCO_val2014_000000036196.jpg",
"multi/niah_130_1_COCO_val2014_000000169700.jpg",
"multi/niah_130_5_COCO_val2014_000000522959.jpg"
],
"question1": "Find the frame of tall clock tower made of grey brick stands tall Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A street corner has a sign, and across the street is a store.\n(B) A blue and white airplane with nose pointing out on a runway while a person on a bike rides by.\n(C) A wheel control zone sign posted on a sidewalk\n(D) A couch and table in a small room.\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of tall clock tower made of grey brick stands tall Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what filled with exotic looking furniture\nOptions: (A) garage\n(B) room\n(C) hallway\n(D) garden\nOnly give the best option. Best option: (",
"answer2": "B"
},
{
"images": [
"multi/niah_375_4_COCO_val2014_000000216962.jpg",
"multi/COCO_val2014_000000217039.jpg",
"multi/niah_375_1_COCO_val2014_000000296724.jpg",
"multi/niah_375_2_COCO_val2014_000000237207.jpg",
"multi/niah_375_5_COCO_val2014_000000576822.jpg",
"multi/niah_375_6_COCO_val2014_000000135167.jpg"
],
"question1": "Find the frame of a room that has a brick wall and wood floorings with throw carpets on the floor. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) A white plated topped with different types of food.\n(B) A bedroom that appears to be very white.\n(C) If you have some animals that are eating something up a tree.\n(D) A Kenya Airways plane with red engine cover on the ground.\nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of a room that has a brick wall and wood floorings with throw carpets on the floor. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what sits on the runway\nOptions: (A) bicycle\n(B) engine\n(C) truck\n(D) airplane\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/COCO_val2014_000000470553.jpg",
"multi/niah_972_5_COCO_val2014_000000172924.jpg",
"multi/niah_972_6_COCO_val2014_000000148957.jpg",
"multi/niah_972_2_COCO_val2014_000000080213.jpg",
"multi/niah_972_4_COCO_val2014_000000062808.jpg",
"multi/niah_972_1_COCO_val2014_000000224702.jpg"
],
"question1": "Find the frame of a man is holding a tennis ball and tennis racket while a woman is looking at him from behind. Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a train that is on a road next to some buildings\n(B) A half eaten cake sitting on a pan with candy.\n(C) The box contains six donuts, but only two are chocolate covered.\n(D) Two people sitting at a table with a pizza pie.\nOnly give the best option. Best option: (",
"answer1": "B",
"question2": "Find the frame of a man is holding a tennis ball and tennis racket while a woman is looking at him from behind. Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the icing\nOptions: (A) red\n(B) white\n(C) green\n(D) blue\nOnly give the best option. Best option: (",
"answer2": "D"
},
{
"images": [
"multi/niah_54_1_COCO_val2014_000000520812.jpg",
"multi/niah_54_2_COCO_val2014_000000540820.jpg",
"multi/niah_54_5_COCO_val2014_000000486397.jpg",
"multi/niah_54_3_COCO_val2014_000000076310.jpg",
"multi/niah_54_7_COCO_val2014_000000435359.jpg",
"multi/COCO_val2014_000000005617.jpg",
"multi/niah_54_6_COCO_val2014_000000407019.jpg"
],
"question1": "Find the frame of baby sheep lie in a grassy field with their mother standing over them Locate the final frame based on the instructions in the image, and choose the captions that best describe the frame: \nOptions: (A) a polar bear standing in a rocky habitat\n(B) A computer is left on in front of a bookshelf\n(C) a woman walks with some sheep down a walk way \n(D) Two cats are laying on separate pillows on a bed. \nOnly give the best option. Best option: (",
"answer1": "D",
"question2": "Find the frame of baby sheep lie in a grassy field with their mother standing over them Locate the final frame based on the instructions in the image, and answer the following question based on the final frame: what is the color of the cats\nOptions: (A) red\n(B) blue\n(C) green\n(D) brown\nOnly give the best option. Best option: (",
"answer2": "D"
}
]
================================================
FILE: xtuner-eval_niah/vision_niah/flash_eval_xtuner_multi.sh
================================================
rm -r ./tmp/tmp2
mkdir ./tmp/tmp2
export TRITON_CACHE_DIR="./tmp/tmp2"
export PYTHONPATH="./"
for MODEL_NAME in VideoChat-Flash-Qwen2-7B_res448
do
mkdir vision_niah/data/haystack_embeddings/$MODEL_NAME
mkdir vision_niah/data_multi/needle_embeddings/$MODEL_NAME
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
NUM_GPUS=8
python vision_niah/produce_haystack_embedding.py \
--model vision_niah/model_weights/$MODEL_NAME \
--video_path vision_niah/data/haystack_videos/video_haystack.mkv \
--output_dir vision_niah/data/haystack_embeddings/$MODEL_NAME \
--sampled_frames_num 10000 \
--pooling_size 0 \
2>&1 | tee vision_niah/log/s1/eval_${MODEL_NAME}_$(date +"%Y%m%d_%H%M").log
python vision_niah/multi_produce_needle_embedding.py \
--model vision_niah/model_weights/$MODEL_NAME \
--needle_dataset vision_niah/data_multi/source_data/niah-coco-multihop-100.json \
--output_dir vision_niah/data_multi/needle_embeddings/$MODEL_NAME \
--pooling_size 0 \
2>&1 | tee vision_niah/log/s2/eval_multihop_${MODEL_NAME}_$(date +"%Y%m%d_%H%M").log
torchrun --nproc-per-node=${NUM_GPUS} vision_niah/multi_eval_vision_niah.py \
--model vision_niah/model_weights/$MODEL_NAME \
--needle_dataset vision_niah/data_multi/source_data/niah-coco-multihop-100.json \
--needle_embedding_dir vision_niah/data_multi/needle_embeddings/$MODEL_NAME \
--haystack_dir vision_niah/data/haystack_embeddings/$MODEL_NAME \
--prompt_template qwen2 \
--max_frame_num 10000 \
--min_frame_num 1000 \
--frame_interval 1000 \
2>&1 | tee vision_niah/log/s3/eval_multihop_${MODEL_NAME}_$(date +"%Y%m%d_%H%M").log
done
================================================
FILE: xtuner-eval_niah/vision_niah/flash_eval_xtuner_single.sh
================================================
rm -r ./tmp/tmp1
mkdir ./tmp/tmp1
export TRITON_CACHE_DIR="./tmp/tmp1"
export PYTHONPATH="./"
for MODEL_NAME in VideoChat-Flash-Qwen2-7B_res448
do
mkdir vision_niah/data/haystack_embeddings/$MODEL_NAME
mkdir vision_niah/data/needle_embeddings/$MODEL_NAME
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
NUM_GPUS=8
python vision_niah/produce_haystack_embedding.py \
--model vision_niah/model_weights/$MODEL_NAME \
--video_path vision_niah/data/haystack_videos/video_haystack.mkv \
--output_dir vision_niah/data/haystack_embeddings/$MODEL_NAME \
--sampled_frames_num 10000 \
--pooling_size 0 \
2>&1 | tee vision_niah/log/s1/eval_${MODEL_NAME}_$(date +"%Y%m%d_%H%M").log
python vision_niah/single_produce_needle_embedding.py \
--model vision_niah/model_weights/$MODEL_NAME \
--needle_dataset vision_niah/data/source_data/niah-coco-singlehop_20.json \
--output_dir vision_niah/data/needle_embeddings/$MODEL_NAME \
--pooling_size 0 \
2>&1 | tee vision_niah/log/s2/eval_singlehop_${MODEL_NAME}_$(date +"%Y%m%d_%H%M").log
torchrun --nproc-per-node=${NUM_GPUS} vision_niah/single_eval_vision_niah.py \
--model vision_niah/model_weights/$MODEL_NAME \
--needle_embedding_dir vision_niah/data/needle_embeddings/$MODEL_NAME \
--haystack_dir vision_niah/data/haystack_embeddings/$MODEL_NAME \
--needle_dataset vision_niah/data/source_data/niah-coco-singlehop_20.json \
--prompt_template qwen2 \
--max_frame_num 10000 \
--min_frame_num 1000\
--frame_interval 1000 \
--depth_interval 0.2 \
2>&1 | tee vision_niah/log/s3/eval_singlehop_${MODEL_NAME}_$(date +"%Y%m%d_%H%M").log
done
================================================
FILE: xtuner-eval_niah/vision_niah/log/s1/git_placeholder
================================================
================================================
FILE: xtuner-eval_niah/vision_niah/log/s2/git_placeholder
================================================
================================================
FILE: xtuner-eval_niah/vision_niah/log/s3/git_placeholder
================================================
================================================
FILE: xtuner-eval_niah/vision_niah/longva_eval_xtuner_multi.sh
================================================
rm -r ./tmp/tmp2
mkdir ./tmp/tmp2
export TRITON_CACHE_DIR="./tmp/tmp2"
export PYTHONPATH="./"
for MODEL_NAME in LongVA-7B
do
mkdir vision_niah/data/haystack_embeddings/$MODEL_NAME
mkdir vision_niah/data_multi/needle_embeddings/$MODEL_NAME
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
NUM_GPUS=8
python vision_niah/produce_haystack_embedding.py \
--model vision_niah/model_weights/$MODEL_NAME \
--video_path vision_niah/data/haystack_videos/video_haystack.mkv \
--output_dir vision_niah/data/haystack_embeddings/$MODEL_NAME \
--sampled_frames_num 3000 \
--pooling_size 2 \
2>&1 | tee vision_niah/log/s1/eval_${MODEL_NAME}_$(date +"%Y%m%d_%H%M").log
python vision_niah/multi_produce_needle_embedding.py \
--model vision_niah/model_weights/$MODEL_NAME \
--needle_dataset vision_niah/data_multi/source_data/niah-coco-multihop-100.json \
--output_dir vision_niah/data_multi/needle_embeddings/$MODEL_NAME \
--pooling_size 2 \
2>&1 | tee vision_niah/log/s2/eval_multihop_${MODEL_NAME}_$(date +"%Y%m%d_%H%M").log
torchrun --nproc-per-node=${NUM_GPUS} vision_niah/multi_eval_vision_niah.py \
--model vision_niah/model_weights/$MODEL_NAME \
--needle_dataset vision_niah/data_multi/source_data/niah-coco-multihop-100.json \
--needle_embedding_dir vision_niah/data_multi/needle_embeddings/$MODEL_NAME \
--haystack_dir vision_niah/data/haystack_embeddings/$MODEL_NAME \
--prompt_template qwen2 \
--max_frame_num 3000 \
--min_frame_num 500 \
--frame_interval 500 \
2>&1 | tee vision_niah/log/s3/eval_multihop_${MODEL_NAME}_$(date +"%Y%m%d_%H%M").log
done
================================================
FILE: xtuner-eval_niah/vision_niah/longva_eval_xtuner_single.sh
================================================
rm -r ./tmp/tmp1
mkdir ./tmp/tmp1
export TRITON_CACHE_DIR="./tmp/tmp1"
export PYTHONPATH="./"
for MODEL_NAME in LongVA-7B
do
mkdir vision_niah/data/haystack_embeddings/$MODEL_NAME
mkdir vision_niah/data/needle_embeddings/$MODEL_NAME
JOB_NAME=$(basename $0)_$(date +"%Y%m%d_%H%M%S")
NUM_GPUS=8
python vision_niah/produce_haystack_embedding.py \
--model vision_niah/model_weights/$MODEL_NAME \
--video_path vision_niah/data/haystack_videos/video_haystack.mkv \
--output_dir vision_niah/data/haystack_embeddings/$MODEL_NAME \
--sampled_frames_num 3000 \
--pooling_size 2 \
2>&1 | tee vision_niah/log/s1/eval_${MODEL_NAME}_$(date +"%Y%m%d_%H%M").log
python vision_niah/single_produce_needle_embedding.py \
--model vision_niah/model_weights/$MODEL_NAME \
--needle_dataset vision_niah/data/source_data/niah-coco-singlehop_20.json \
--output_dir vision_niah/data/needle_embeddings/$MODEL_NAME \
--pooling_size 2 \
2>&1 | tee vision_niah/log/s2/eval_singlehop_${MODEL_NAME}_$(date +"%Y%m%d_%H%M").log
torchrun --nproc-per-node=${NUM_GPUS} vision_niah/single_eval_vision_niah.py \
--model vision_niah/model_weights/$MODEL_NAME \
--needle_embedding_dir vision_niah/data/needle_embeddings/$MODEL_NAME \
--haystack_dir vision_niah/data/haystack_embeddings/$MODEL_NAME \
--needle_dataset vision_niah/data/source_data/niah-coco-singlehop_20.json \
--prompt_template qwen2 \
--max_frame_num 3000 \
--min_frame_num 500\
--frame_interval 500 \
--depth_interval 0.2 \
2>&1 | tee vision_niah/log/s3/eval_singlehop_${MODEL_NAME}_$(date +"%Y%m%d_%H%M").log
done
================================================
FILE: xtuner-eval_niah/vision_niah/model_weights/git_placeholder
================================================
================================================
FILE: xtuner-eval_niah/vision_niah/multi_eval_vision_niah.py
================================================
import sys
import os
import argparse
import gc
import sys
import torch
from transformers import AutoTokenizer
from transformers import Qwen2ForCausalLM
from tqdm import tqdm
import glob
import numpy as np
from tqdm import tqdm
import gc
import matplotlib.pyplot as plt
import os
from matplotlib.colors import LinearSegmentedColormap
import seaborn as sns
import pandas as pd
import random
import json
from datasets import load_dataset
import torch
from transformers import Qwen2ForCausalLM, AutoTokenizer, AutoConfig
from xtuner._lite.accelerate.dispatches import dispatch_modules
from mmengine.dist import infer_launcher, init_dist
from mmengine.runner import set_random_seed
from xtuner._lite.parallel import (split_for_sequence_parallel)
from xtuner._lite.parallel.setup import get_sp_group, setup_parallel
import torch.distributed as dist
from transformers.cache_utils import DynamicCache
import argparse
SEED = 24242424
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)
prompt_templates = {
"mistral": {
"preprompt": "[INST]",
"postprompt": " [/INST]"
},
"vicuna": {
"preprompt": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER:",
"postprompt": "ASSISTANT:"
},
"llama3": {
"preprompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n",
"postprompt": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
},
"qwen2": {
"preprompt": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n",
"postprompt": "<|im_end|>\n<|im_start|>assistant\n",
},
"yi": {
"preprompt": "<|im_start|>system\nAnswer the questions.<|im_end|>\n<|im_start|>user\n",
"postprompt": "<|im_end|>\n<|im_start|>assistant\n",
},
}
def safe_tokenize(tokenizer, text):
tokenized = tokenizer.encode(text, return_tensors="pt")
if tokenizer.bos_token != None and len(tokenized) > 0 and tokenized[0, 0] == tokenizer.bos_token_id:
tokenized = tokenized[:, 1:]
return tokenized
# answer = "more bet"
def eval_forward(model, input_embeds, answer_embeds, pad_id, answer_ids, tokenizer, sp_size, config, type_str, idx):
world_size = dist.get_world_size(get_sp_group())
rank = dist.get_rank(get_sp_group())
prompt_length = input_embeds.shape[1]
labels_length = answer_embeds.shape[1]
input_embeds = torch.cat([input_embeds, answer_embeds], dim=1)
pad_token_num = (sp_size * 2) - input_embeds.shape[1] % (sp_size * 2)
pad_tensor = torch.tensor([pad_id] * pad_token_num).unsqueeze(0).unsqueeze(-1).expand(-1, -1, input_embeds.shape[-1]).to("cuda")
input_embeds = torch.cat([input_embeds, pad_tensor], dim=1)
position_ids = (
torch.arange(input_embeds.shape[1]).unsqueeze(0).expand(input_embeds.shape[0], -1)
).to("cuda")
seq_len_per_gpu = int(input_embeds.shape[1] // sp_size)
# print("seq_len_per_gpu: ", seq_len_per_gpu)
# if rank == 0 :
# print("input_embeds: ", input_embeds.shape)
assert input_embeds.shape[1] % sp_size == 0
sp_group = get_sp_group()
input_embeds = split_for_sequence_parallel(input_embeds, dim=1, sp_group=sp_group) #原本这里是input_id, input_embeds也能用这个函数split吗??
position_ids = split_for_sequence_parallel(position_ids, dim=1, sp_group=sp_group)
past_key_values = DynamicCache(config.num_hidden_layers)
with torch.inference_mode():
output = model(
inputs_embeds=input_embeds,
position_ids=position_ids,
past_key_values=past_key_values,
use_cache=True,
)
logits = output[0]
# print("rank: ", rank, "logits shape: ", logits.shape)
pred = logits.argmax(dim=-1)
dist.broadcast(pred, src=world_size - 1)
pred = pred[:, (prompt_length - 1)%seq_len_per_gpu : (prompt_length + labels_length - 1)%seq_len_per_gpu]
# print("rank: ", rank, "pred shape: ", pred.shape)
# check if the logits are correct, extract argmax id # compare the predicted_ids with the labels
predict_str = str(tokenizer.decode(pred.squeeze().tolist())).lower()
answer_str = str(tokenizer.decode(answer_ids.to("cuda").squeeze().tolist())).lower()
correct = (predict_str.replace("(","").replace(")","") == answer_str)
if rank == 0 :
print(
" Idx:",
idx,
" Type:",
type_str,
" Predicted: ",
tokenizer.decode(pred.squeeze().tolist()),
" Answer: ",
tokenizer.decode(answer_ids.squeeze().tolist()),
" Correct:",
correct
)
return int(correct)
def load_haystack(args):
haystack_embeddings = torch.load(f"{args.haystack_dir}/video_embeddings.pt").to(torch.bfloat16)
# for file_path in tqdm(sorted(Path(args.haystack_dir).glob("*.pt"))[:args.max_frame_num], desc="Loading Haystack Embeddings...", disable=not accelerator.is_main_process):
# embeddings = torch.load(file_path, map_location="cuda").to(torch.bfloat16).unsqueeze(0)
# haystack_embeddings = embeddings if haystack_embeddings is None else torch.cat(
# [haystack_embeddings, embeddings], dim=0
# )
return haystack_embeddings
def load_text_embeddings(str, tokenizer, model, replace_double_newline=False):
token_ids = safe_tokenize(tokenizer, str)
def replace_double_newline_func(token_ids):
# subsitute token id 271 to two 198]
# for example:
# from: tensor([[128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 4221, 323, 11376, 18328, 13]])
# to: tensor([[128000, 128006, 9125, 128007, 198, 198, 2675, 527, 264, 11190, 4221, 323, 11376, 18328, 13]])
# length will increase by number of 271
double_newline_loc = (token_ids == 271).nonzero()[:, 1]
double_newline_loc += torch.arange(len(double_newline_loc))
if len(double_newline_loc) > 0:
for loc in double_newline_loc:
token_ids = torch.cat([token_ids[:, :loc], torch.tensor([[198, 198]]), token_ids[:, loc+1:]], dim=1)
return token_ids
if replace_double_newline:
token_ids = replace_double_newline_func(token_ids)
token_ids = token_ids.to("cuda")
with torch.inference_mode():
embeddings = model.model.embed_tokens(token_ids)
return embeddings.to(torch.bfloat16)
def inference(args):
dist_launcher = infer_launcher()
init_dist(dist_launcher)
set_random_seed(42)
sp_size = dist.get_world_size()
setup_parallel(sp_size, ring_size=sp_size)
tokenizer = AutoTokenizer.from_pretrained(
args.model,
model_max_length=sys.maxsize,
trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token
kwargs = {"rope_theta": args.rope_theta} if args.rope_theta is not None else {}
config = AutoConfig.from_pretrained(args.model)
model = Qwen2ForCausalLM.from_pretrained(
args.model,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
**kwargs,
).cuda()
rank = dist.get_rank(get_sp_group())
if rank == 0:
print("Preparing Haystack...")
haystack_embeddings = load_haystack(args)
assert len(haystack_embeddings) >= args.max_frame_num, "Haystack embeddings are not enough. Max frame {} is not found. Currently only {} frames.".format(args.max_frame_num, len(haystack_embeddings))
haystack_embeddings = haystack_embeddings[:args.max_frame_num].to("cuda")
prompt = prompt_templates[args.prompt_template]
preprompt_embeddings = load_text_embeddings(prompt["preprompt"], tokenizer, model, args.replace_double_newline)
postprompt_embeddings = load_text_embeddings(prompt["postprompt"], tokenizer, model, args.replace_double_newline)
with open(args.needle_dataset, 'r', encoding='utf-8') as file:
needle_dataset = json.load(file)
needle_embedding_list = []
cap_answer_embedding_list = []
cap_answer_id_list = []
cap_question_embeding_list = []
qa_answer_embedding_list = []
qa_answer_id_list = []
qa_question_embeding_list = []
for index, instance in enumerate(needle_dataset):
needle_embedding_list.append(torch.load(args.needle_embedding_dir + f"/{index}.pt", map_location="cpu").to(torch.bfloat16).to("cuda"))
cap_answer = instance["answer1"]
cap_question = instance["question1"]
if rank == 0:
print("index:",index,"\nCaption question:",cap_question,"\nCaption answer:",cap_answer,"\n")
cap_answer_embedding_list.append(load_text_embeddings(cap_answer, tokenizer, model))
cap_answer_id_list.append(safe_tokenize(tokenizer, cap_answer))
cap_question_embeding_list.append(load_text_embeddings(cap_question, tokenizer, model))
qa_answer = instance["answer2"]
qa_question = instance["question2"]
if rank == 0:
print("index:",index,"\nQA question:",qa_question,"\nQA answer:",qa_answer,"\n")
qa_answer_embedding_list.append(load_text_embeddings(qa_answer, tokenizer, model))
qa_answer_id_list.append(safe_tokenize(tokenizer, qa_answer))
qa_question_embeding_list.append(load_text_embeddings(qa_question, tokenizer, model))
if rank == 0:
print("Starting Evaluation...")
model.eval()
dispatch_modules(model)
all_accuries = []
for num_frames in tqdm(
range(
args.min_frame_num, args.max_frame_num + 1, args.frame_interval
)
):
cap_accuracies = []
qa_accuracies = []
for idx, (needle_embedding, cap_question_embedding, cap_answer_embedding, cap_answer_id, qa_question_embedding, qa_answer_embedding, qa_answer_id) in enumerate(zip(needle_embedding_list, cap_question_embeding_list, cap_answer_embedding_list, cap_answer_id_list, qa_question_embeding_list, qa_answer_embedding_list, qa_answer_id_list)):
needle_num = needle_embedding.shape[0]
needle_interval = num_frames / (needle_num - 1)
for needle_id in range(needle_num):
haystack_left = int((needle_id-1) * needle_interval)
haystack_right = int(needle_id * needle_interval)
if needle_id == 0:
input_frames = needle_embedding[needle_id:(needle_id+1)].view(1, -1, haystack_embeddings.shape[-1])
else:
haystack_embeddings_seg = haystack_embeddings[haystack_left:haystack_right].view(1, -1, haystack_embeddings.shape[-1])
needle_embedding_single = needle_embedding[needle_id:(needle_id+1)].view(1, -1, haystack_embeddings.shape[-1])
input_frames = torch.cat([input_frames, haystack_embeddings_seg, needle_embedding_single], dim=1)
if rank == 0:
print("\ninput_frames:",input_frames.shape)
input_frames = input_frames.view(-1, haystack_embeddings.shape[-1]).unsqueeze(0)
input_emebds = torch.cat([preprompt_embeddings, input_frames, cap_question_embedding, postprompt_embeddings], dim=1)
cap_correct = eval_forward(model, input_emebds, cap_answer_embedding, tokenizer.pad_token_id, cap_answer_id, tokenizer , sp_size, config, "Cap", idx)
input_emebds = torch.cat([preprompt_embeddings, input_frames, qa_question_embedding, postprompt_embeddings], dim=1)
qa_correct = eval_forward(model, input_emebds, qa_answer_embedding, tokenizer.pad_token_id, qa_answer_id, tokenizer , sp_size, config, "QA ", idx)
gc.collect()
torch.cuda.empty_cache()
if rank == 0:
cap_accuracies.append(cap_correct)
qa_accuracies.append(qa_correct)
if rank == 0:
both_correct = 0
only_cap_correct = 0
only_qa_correct = 0
both_wrong = 0
for cap, qa in zip(cap_accuracies, qa_accuracies):
if cap == 1 and qa == 1:
both_correct += 1
elif cap == 1 and qa == 0:
only_cap_correct += 1
elif cap == 0 and qa == 1:
only_qa_correct += 1
elif cap == 0 and qa == 0:
both_wrong += 1
result = {
"Num. Frame": num_frames,
"Score": (sum(cap_accuracies) + sum(qa_accuracies)) / (len(cap_accuracies) + len(qa_accuracies)),
"Caption Score": sum(cap_accuracies) / len(cap_accuracies),
"QA Score": sum(qa_accuracies) / len(qa_accuracies),
"Total question pair":len(qa_accuracies),
"Both correct": both_correct,
"Only cap correct":only_cap_correct,
"Only qa correct":only_qa_correct,
"Both wrong": both_wrong,
}
print(result)
all_accuries.append(result)
if rank == 0:
model_name = args.model.split("/")[-1]
os.makedirs(f"{args.output_path}/{model_name}", exist_ok=True)
# save all_accuries as json
with open(f"{args.output_path}/{model_name}/all_accuracies.json", "w") as f:
json.dump(all_accuries, f, indent=4)
return all_accuries
def plot(args, all_accuries):
df = pd.DataFrame(all_accuries)
cmap = LinearSegmentedColormap.from_list(
"custom_cmap", ["#F0496E", "#EBB839", "#9ad5b3"]
)
pivot_table = pd.pivot_table(
df,
values="Score",
index=["Frame Depth", "Num. Frame"],
aggfunc="mean",
).reset_index() # This will aggregate
pivot_table = pivot_table.pivot(
index="Frame Depth", columns="Num. Frame", values="Score"
)
# Create the heatmap with better aesthetics
plt.figure(figsize=(17.5, 8)) # Can adjust these dimensions as needed
ax = sns.heatmap(
pivot_table,
# annot=True,
fmt="g",
vmin=0,
vmax=1,
linecolor='white',
linewidths=1.5,
cmap=cmap,
cbar_kws={"label": "Score"},
)
# Set the color bar label font size
cbar = ax.collections[0].colorbar
cbar.ax.yaxis.label.set_size(14)
cbar.ax.tick_params(labelsize=14)
# Define the formatter function
def thousands_formatter(x, pos):
if x >= 1000:
return f'{x/1000:.1f}K'
return f'{x}'
context_lengths = pivot_table.columns
formatted_context_lengths = [thousands_formatter(x, None) for x in context_lengths]
# More aesthetics
plt.xlabel("Num. of Frames", fontsize=14) # X-axis label
plt.ylabel("Depth Percent", fontsize=14) # Y-axis label
plt.xticks(ticks=[i + 0.5 for i in range(len(context_lengths))], labels=formatted_context_lengths, rotation=45, fontsize=14)
# plt.xticks(rotation=45, fontsize=14) # Rotates the x-axis labels to prevent overlap
plt.yticks(rotation=0, fontsize=14) # Ensures the y-axis labels are horizontal
plt.tight_layout() # Fits everything neatly into the figure area
# save
model_name = args.model.split("/")[-1]
plt.savefig(f"{args.output_path}/{model_name}/heatmap.png")
# calculate average accuracy
average_accuracy = df["Score"].mean()
print(f"Average Accuracy: {average_accuracy}")
# save as txt
with open(f"{args.output_path}/{model_name}/avg_accuracy.txt", "w") as f:
f.write(f"Average Accuracy: {average_accuracy}\n")
def main(args):
if args.plot_only:
# load all_accuracies from json
model_name = args.model.split("/")[-1]
with open(f"{args.output_path}/{model_name}/all_accuracies.json", "r") as f:
all_accuracies = json.load(f)
plot(args, all_accuracies)
else:
all_accuracies = inference(args)
if dist.get_rank(get_sp_group()) == 0:
plot(args, all_accuracies)
if __name__ == "__main__":
args = argparse.ArgumentParser()
args.add_argument("--model", type=str, default="output/LLaVA-NeXT-Video-7B-32K")
args.add_argument("--max_frame_num", type=int, default=256)
args.add_argument("--needle_dataset", type=str, default="lmms-lab/v_niah_needles")
args.add_argument("--min_frame_num", type=int, default=20)
args.add_argument("--frame_interval", type=int, default=20)
args.add_argument("--output_path", type=str, default="vision_niah/niah_output_multi")
args.add_argument("--num_samples", type=int, default=1)
args.add_argument("--rope_theta", type=float, default=None)
args.add_argument("--haystack_dir", type=str, default="video_needle_haystack/data/haystack_embeddings")
args.add_argument("--needle_embedding_dir", type=str, default="vision_niah/data/needle_embeddings")
args.add_argument("--prompt_template", type=str)
args.add_argument("--replace_double_newline", action="store_true")
args.add_argument("--plot_only", action="store_true")
main(args.parse_args())
================================================
FILE: xtuner-eval_niah/vision_niah/multi_produce_needle_embedding.py
================================================
import sys
import os
import json
import argparse
import numpy as np
from tqdm import tqdm
import torch
from pathlib import Path
from PIL import Image
from datasets import load_dataset
import math
import cv2
import io
data_root_path = "path_to/niah_data/"
def main(args):
if "videochat-flash" in args.model.lower():
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path
from llava.mm_utils import process_images
elif "longva" in args.model.lower():
from longva.model.builder import load_pretrained_model
from longva.mm_utils import tokenizer_image_token, get_model_name_from_path
from longva.mm_utils import process_images
else:
raise "This version model is not currently supported. Please manually adjust the code to adapt."
model_name = "llava_qwen"
tokenizer, model, image_processor, context_len = load_pretrained_model(args.model, None, model_name, load_8bit=False,device_map="cuda:0")
model.config.image_aspect_ratio = "pad"
model.config.mm_patch_merge_type="flat"
with open(args.needle_dataset, 'r', encoding='utf-8') as file:
data = json.load(file)
for index, instance in enumerate(tqdm(data, desc="Processing")):
images_list = []
for image_path in instance["images"]:
frame_path = os.path.join(data_root_path, image_path)
with open(frame_path, 'rb') as f:
img_bytes = f.read()
# img_np = np.frombuffer(img_bytes, np.uint8)
# img = cv2.imdecode(img_np, cv2.IMREAD_COLOR)
# cv2.cvtColor(img, cv2.COLOR_BGR2RGB, img)
# images_list.append(img)
# 使用 PIL 来处理图像
img = Image.open(io.BytesIO(img_bytes))
img = img.convert("RGB")
images_list.append(img)
# print(images_list.shape)
images_list = process_images(images_list, image_processor, model.config).half()
# print("image before encode:", images_list.shape)
if "videochat-flash" in args.model.lower():
batch_size = 4
image_features_list=[]
for idx in range(images_list.shape[0]):
processed_images = images_list[idx:(idx+1)].repeat(batch_size, 1, 1, 1)
# print("processed_images:", processed_images.shape)
image_features = model.encode_video_image([processed_images], video_idx_in_batch = [0])[0]
# print("image_features:", image_features.shape)
image_features_list.append(image_features)
image_features=torch.cat(image_features_list,dim=0)
else:
image_features = model.encode_images(images_list)
# print("needle embedding shape: ", image_features.shape)
if args.pooling_size != 0:
B, _, F = image_features.shape
image_features_spatial = image_features.view(B, int(math.sqrt(_)), int(math.sqrt(_)), F).permute(0, 3, 1, 2) # B, F, 24, 24
image_features_spatial_pool = torch.nn.functional.avg_pool2d(image_features_spatial, args.pooling_size, args.pooling_size) # B, F, 12, 12
image_features = image_features_spatial_pool.flatten(2).transpose(1, 2).contiguous() # B, 144, F
image_features = image_features.squeeze(0)
# print("needle shape after pooling: ", image_features.shape)
if "videochat-flash" not in args.model.lower():
image_features = image_features.repeat(1, 4, 1)
print("final save needle shape: ", image_features.shape)
torch.save(image_features, f"{args.output_dir}/{index}.pt")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="output/LLaVA-NeXT-Video-7B-Vicuna")
parser.add_argument("--needle_dataset", type=str, default="vision_niah/data_multi/source_data/niah-coco-multihop_mc.json")
parser.add_argument("--output_dir", type=str, default="video_needle_haystack/data/needle_vicuna_embeddings")
parser.add_argument("--pooling_size", type=int, default=0)
args = parser.parse_args()
main(args)
================================================
FILE: xtuner-eval_niah/vision_niah/niah_output_multi/git_placeholder
================================================
================================================
FILE: xtuner-eval_niah/vision_niah/niah_output_single/git_placeholder
================================================
================================================
FILE: xtuner-eval_niah/vision_niah/produce_haystack_embedding.py
================================================
import sys
import os
from decord import VideoReader, cpu
import argparse
import numpy as np
from tqdm import tqdm
import torch
import math
from PIL import Image
def load_video_batches(video_path, batch_size):
vr = VideoReader(video_path, ctx=cpu(0))
total_frame_num = len(vr)
fps = round(vr.get_avg_fps())
frame_idx = [i for i in range(0, len(vr), fps)]
for start_idx in range(0, len(frame_idx), batch_size):
end_idx = min(start_idx + batch_size, total_frame_num)
frame_indices = frame_idx[start_idx:end_idx]
batch_frames = vr.get_batch(frame_indices).asnumpy()
yield batch_frames
def main(args):
if "videochat-flash" in args.model.lower():
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path
from llava.mm_utils import process_images
elif "longva" in args.model.lower():
from longva.model.builder import load_pretrained_model
from longva.mm_utils import tokenizer_image_token, get_model_name_from_path
from longva.mm_utils import process_images
else:
raise "This version model is not currently supported. Please manually adjust the code to adapt."
video_path = args.video_path
model_path = args.model
model_name = "llava_qwen"
if args.sampled_frames_num > 7500:
sample_frame_num = args.sampled_frames_num // 2
double_flag = True
else:
sample_frame_num = args.sampled_frames_num
double_flag = False
tokenizer, model, image_processor, context_len = load_pretrained_model(model_path, None, model_name, load_8bit=False,device_map="cuda:0")
del model.model.layers
model.config.image_aspect_ratio = "pad"
model.config.mm_patch_merge_type="flat"
# Process video in batches
batch_size = 8
total_batches = (sample_frame_num + batch_size - 1) // batch_size
image_feature_list = []
if args.add_newline_token:
newline_token_embeddong = model.model.image_newline
with torch.inference_mode():
for i, video_batch in tqdm(enumerate(load_video_batches(video_path, batch_size)), total=total_batches, desc="Processing Video Batches"):
images = [Image.fromarray(frame).convert("RGB") for frame in video_batch]
processed_images = process_images(images, image_processor,model.config).half()
# print("processed_images.shape",processed_images.shape)
if "videochat-flash" in model_path.lower():
image_features = model.encode_video_image([processed_images], video_idx_in_batch = [0] )[0]
image_features = image_features.reshape(batch_size, -1, image_features.shape[-1])
else:
image_features = model.encode_images(processed_images)
# print(image_features.shape)
if args.pooling_size != 0:
B, _, F = image_features.shape
image_features_spatial = image_features.view(B, int(math.sqrt(_)), int(math.sqrt(_)), F).permute(0, 3, 1, 2) # B, F, 24, 24
image_features_spatial_pool = torch.nn.functional.avg_pool2d(image_features_spatial, args.pooling_size, args.pooling_size) # B, F, 12, 12
image_features = image_features_spatial_pool.flatten(2).transpose(1, 2).contiguous() # B, 144, F
if args.add_newline_token:
image_features = torch.cat([image_features, newline_token_embeddong.unsqueeze(0).expand(image_features.shape[0], 1, -1)], dim=1)
image_feature_list.append(image_features.to(torch.bfloat16).to("cpu"))
if i > total_batches:
break
image_feature_list = torch.cat(image_feature_list, dim=0)
if double_flag:
image_feature_list = torch.cat((image_feature_list, image_feature_list), dim=0)
print("final save haystack shape: ", image_feature_list.shape)
torch.save(image_feature_list, f"{args.output_dir}/video_embeddings.pt")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="output/LLaVA-NeXT-Video-7B-Vicuna")
parser.add_argument("--video_path", type=str, default="vision_niah/data/haystack_videos")
parser.add_argument("--sampled_frames_num", type=int, default=7200)
parser.add_argument("--output_dir", type=str, default="video_needle_haystack/data/haystack_vicuna_embeddings")
parser.add_argument("--pooling_size", type=int, default=0)
parser.add_argument("--add_newline_token", action="store_true")
args = parser.parse_args()
main(args)
================================================
FILE: xtuner-eval_niah/vision_niah/single_eval_vision_niah.py
================================================
import sys
import os
import argparse
import gc
import sys
import torch
from transformers import AutoTokenizer
from transformers import Qwen2ForCausalLM
from tqdm import tqdm
import glob
import numpy as np
from tqdm import tqdm
import gc
import matplotlib.pyplot as plt
import os
from matplotlib.colors import LinearSegmentedColormap
import seaborn as sns
import pandas as pd
import random
import json
from datasets import load_dataset
import torch
from transformers import Qwen2ForCausalLM, AutoTokenizer, AutoConfig
from xtuner._lite.accelerate.dispatches import dispatch_modules
from mmengine.dist import infer_launcher, init_dist
from mmengine.runner import set_random_seed
from xtuner._lite.parallel import (split_for_sequence_parallel)
from xtuner._lite.parallel.setup import get_sp_group, setup_parallel
import torch.distributed as dist
from transformers.cache_utils import DynamicCache
import argparse
SEED = 24242424
torch.manual_seed(SEED)
random.seed(SEED)
np.random.seed(SEED)
prompt_templates = {
"mistral": {
"preprompt": "[INST]",
"postprompt": " [/INST]"
},
"vicuna": {
"preprompt": "A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions. USER:",
"postprompt": "ASSISTANT:"
},
"llama3": {
"preprompt": "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nYou are a helpful language and vision assistant. You are able to understand the visual content that the user provides, and assist the user with a variety of tasks using natural language.<|eot_id|><|start_header_id|>user<|end_header_id|>\n\n",
"postprompt": "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
},
"qwen2": {
"preprompt": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n",
"postprompt": "<|im_end|>\n<|im_start|>assistant\n",
},
"yi": {
"preprompt": "<|im_start|>system\nAnswer the questions.<|im_end|>\n<|im_start|>user\n",
"postprompt": "<|im_end|>\n<|im_start|>assistant\n",
},
}
def safe_tokenize(tokenizer, text):
tokenized = tokenizer.encode(text, return_tensors="pt")
if tokenizer.bos_token != None and len(tokenized) > 0 and tokenized[0, 0] == tokenizer.bos_token_id:
tokenized = tokenized[:, 1:]
return tokenized
# answer = "more bet"
def eval_forward( model, input_embeds, answer_embeds, pad_id, answer_ids, tokenizer, sp_size, config, idx):
world_size = dist.get_world_size(get_sp_group())
rank = dist.get_rank(get_sp_group())
prompt_length = input_embeds.shape[1]
labels_length = answer_embeds.shape[1]
input_embeds = torch.cat([input_embeds, answer_embeds], dim=1)
pad_token_num = (sp_size * 2) - input_embeds.shape[1] % (sp_size * 2)
pad_tensor = torch.tensor([pad_id] * pad_token_num).unsqueeze(0).unsqueeze(-1).expand(-1, -1, input_embeds.shape[-1]).to("cuda")
input_embeds = torch.cat([input_embeds, pad_tensor], dim=1)
position_ids = (
torch.arange(input_embeds.shape[1]).unsqueeze(0).expand(input_embeds.shape[0], -1)
).to("cuda")
seq_len_per_gpu = int(input_embeds.shape[1] // sp_size)
# print("seq_len_per_gpu: ", seq_len_per_gpu)
if rank == 0 :
print("input_embeds: ", input_embeds.shape)
assert input_embeds.shape[1] % sp_size == 0
sp_group = get_sp_group()
input_embeds = split_for_sequence_parallel(input_embeds, dim=1, sp_group=sp_group) #原本这里是input_id, input_embeds也能用这个函数split吗??
position_ids = split_for_sequence_parallel(position_ids, dim=1, sp_group=sp_group)
past_key_values = DynamicCache(config.num_hidden_layers)
with torch.inference_mode():
output = model(
inputs_embeds=input_embeds,
position_ids=position_ids,
past_key_values=past_key_values,
use_cache=True,
)
logits = output[0]
# print("rank: ", rank, "logits shape: ", logits.shape)
pred = logits.argmax(dim=-1)
dist.broadcast(pred, src=world_size - 1)
pred = pred[:, (prompt_length - 1)%seq_len_per_gpu : (prompt_length + labels_length - 1)%seq_len_per_gpu]
# print("rank: ", rank, "pred shape: ", pred.shape)
# check if the logits are correct, extract argmax id # compare the predicted_ids with the labels
predict_str = str(tokenizer.decode(pred.squeeze().tolist())).lower()
answer_str = str(tokenizer.decode(answer_ids.to("cuda").squeeze().tolist())).lower()
correct = (predict_str.replace("(","").replace(")","") == answer_str)
if rank == 0 :
print(
" Idx:",
idx,
" Predicted:",
tokenizer.decode(pred.squeeze().tolist()),
" Answer:",
tokenizer.decode(answer_ids.squeeze().tolist()),
" Correct:",
correct
)
return int(correct)
def load_haystack(args):
haystack_embeddings = torch.load(f"{args.haystack_dir}/video_embeddings.pt").to(torch.bfloat16)
# for file_path in tqdm(sorted(Path(args.haystack_dir).glob("*.pt"))[:args.max_frame_num], desc="Loading Haystack Embeddings...", disable=not accelerator.is_main_process):
# embeddings = torch.load(file_path, map_location="cuda").to(torch.bfloat16).unsqueeze(0)
# haystack_embeddings = embeddings if haystack_embeddings is None else torch.cat(
# [haystack_embeddings, embeddings], dim=0
# )
return haystack_embeddings
def load_text_embeddings(str, tokenizer, model, replace_double_newline=False):
token_ids = safe_tokenize(tokenizer, str)
def replace_double_newline_func(token_ids):
# subsitute token id 271 to two 198]
# for example:
# from: tensor([[128000, 128006, 9125, 128007, 271, 2675, 527, 264, 11190, 4221, 323, 11376, 18328, 13]])
# to: tensor([[128000, 128006, 9125, 128007, 198, 198, 2675, 527, 264, 11190, 4221, 323, 11376, 18328, 13]])
# length will increase by number of 271
double_newline_loc = (token_ids == 271).nonzero()[:, 1]
double_newline_loc += torch.arange(len(double_newline_loc))
if len(double_newline_loc) > 0:
for loc in double_newline_loc:
token_ids = torch.cat([token_ids[:, :loc], torch.tensor([[198, 198]]), token_ids[:, loc+1:]], dim=1)
return token_ids
if replace_double_newline:
token_ids = replace_double_newline_func(token_ids)
token_ids = token_ids.to("cuda")
with torch.inference_mode():
embeddings = model.model.embed_tokens(token_ids)
return embeddings.to(torch.bfloat16)
def inference(args):
dist_launcher = infer_launcher()
init_dist(dist_launcher)
set_random_seed(42)
sp_size = dist.get_world_size()
setup_parallel(sp_size, ring_size=sp_size)
tokenizer = AutoTokenizer.from_pretrained(
args.model,
model_max_length=sys.maxsize,
trust_remote_code=True,
)
tokenizer.pad_token = tokenizer.eos_token
kwargs = {"rope_theta": args.rope_theta} if args.rope_theta is not None else {}
config = AutoConfig.from_pretrained(args.model)
model = Qwen2ForCausalLM.from_pretrained(
args.model,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
**kwargs,
).cuda()
print("Preparing Haystack...")
haystack_embeddings = load_haystack(args)
assert len(haystack_embeddings) >= args.max_frame_num, "Haystack embeddings are not enough. Max frame {} is not found. Currently only {} frames.".format(args.max_frame_num, len(haystack_embeddings))
haystack_embeddings = haystack_embeddings[:args.max_frame_num].to("cuda")
prompt = prompt_templates[args.prompt_template]
preprompt_embeddings = load_text_embeddings(prompt["preprompt"], tokenizer, model, args.replace_double_newline)
postprompt_embeddings = load_text_embeddings(prompt["postprompt"], tokenizer, model, args.replace_double_newline)
# needle_dataset = load_dataset(args.needle_dataset)["test"]
with open(args.needle_dataset, 'r', encoding='utf-8') as file:
needle_dataset = json.load(file)
answer_embedding_list = []
answer_id_list = []
needle_embedding_list = []
question_embeding_list = []
for index, instance in enumerate(needle_dataset):
answer = instance["answer"]
question = instance["question"]
print("index",index,"\nquestion",question,"\nanswer",answer,"\n")
needle_embedding_list.append(torch.load(args.needle_embedding_dir + f"/{index}.pt", map_location="cuda").to(torch.bfloat16).to("cuda"))
answer_embedding_list.append(load_text_embeddings(answer, tokenizer, model))
answer_id_list.append(safe_tokenize(tokenizer, answer))
question_embeding_list.append(load_text_embeddings(question, tokenizer, model))
print("Starting Evaluation...")
model.eval()
dispatch_modules(model)
rank = dist.get_rank(get_sp_group())
all_accuries = []
for num_frames in tqdm(
range(
args.min_frame_num, args.max_frame_num + 1, args.frame_interval
)
):
for depth in np.arange(0, 1 + args.depth_interval, args.depth_interval):
accuracies = []
for idx, (question_embedding, needle_embedding, answer_embedding, answer_id) in enumerate(zip(question_embeding_list, needle_embedding_list, answer_embedding_list, answer_id_list)):
query_frame_idx = int(depth * num_frames)
# if rank==0:
# print(idx)
haystack_embeddings_left = haystack_embeddings[:query_frame_idx].view(1, -1, haystack_embeddings.shape[-1])
needle_embedding = needle_embedding.unsqueeze(0).view(1, -1, haystack_embeddings.shape[-1])
haystack_embeddings_right = haystack_embeddings[query_frame_idx:num_frames].view(1, -1, haystack_embeddings.shape[-1])
input_frames = torch.cat([haystack_embeddings_left,needle_embedding, haystack_embeddings_right], dim=1)
# input_frames = torch.cat([haystack_embeddings[:query_frame_idx],needle_embedding.unsqueeze(0), haystack_embeddings[query_frame_idx:num_frames]], dim=0).view(-1, haystack_embeddings.shape[-1]).unsqueeze(0)
input_emebds = torch.cat([preprompt_embeddings, input_frames,question_embedding, postprompt_embeddings], dim=1)
# >>> eval forward start >>>
correct = eval_forward(
model, input_emebds, answer_embedding, tokenizer.pad_token_id, answer_id, tokenizer, sp_size, config, idx
)
# <<< eval forward end <<<
gc.collect()
torch.cuda.empty_cache()
accuracies.append(correct)
if rank == 0:
result = {
"Num. Frame": num_frames,
"Frame Depth": round(depth * 100, -1),
"Score": sum(accuracies) / len(accuracies),
}
print(result)
all_accuries.append(result)
if rank == 0:
model_name = args.model.split("/")[-1]
os.makedirs(f"{args.output_path}/{model_name}", exist_ok=True)
# save all_accuries as json
with open(f"{args.output_path}/{model_name}/all_accuracies.json", "w") as f:
json.dump(all_accuries, f, indent=4)
return all_accuries
def plot(args, all_accuries):
df = pd.DataFrame(all_accuries)
cmap = LinearSegmentedColormap.from_list(
"custom_cmap", ["#F0496E", "#EBB839", "#9ad5b3"]
)
pivot_table = pd.pivot_table(
df,
values="Score",
index=["Frame Depth", "Num. Frame"],
aggfunc="mean",
).reset_index() # This will aggregate
pivot_table = pivot_table.pivot(
index="Frame Depth", columns="Num. Frame", values="Score"
)
# Create the heatmap with better aesthetics
plt.figure(figsize=(17.5, 8)) # Can adjust these dimensions as needed
ax = sns.heatmap(
pivot_table,
# annot=True,
fmt="g",
vmin=0,
vmax=1,
linecolor='white',
linewidths=1.5,
cmap=cmap,
cbar_kws={"label": "Score"},
)
# Set the color bar label font size
cbar = ax.collections[0].colorbar
cbar.ax.yaxis.label.set_size(14)
cbar.ax.tick_params(labelsize=14)
# Define the formatter function
def thousands_formatter(x, pos):
if x >= 1000:
return f'{x/1000:.1f}K'
return f'{x}'
context_lengths = pivot_table.columns
formatted_context_lengths = [thousands_formatter(x, None) for x in context_lengths]
# More aesthetics
plt.xlabel("Num. of Frames", fontsize=14) # X-axis label
plt.ylabel("Depth Percent", fontsize=14) # Y-axis label
plt.xticks(ticks=[i + 0.5 for i in range(len(context_lengths))], labels=formatted_context_lengths, rotation=45, fontsize=14)
# plt.xticks(rotation=45, fontsize=14) # Rotates the x-axis labels to prevent overlap
plt.yticks(rotation=0, fontsize=14) # Ensures the y-axis labels are horizontal
plt.tight_layout() # Fits everything neatly into the figure area
# save
model_name = args.model.split("/")[-1]
plt.savefig(f"{args.output_path}/{model_name}/heatmap.png")
# calculate average accuracy
average_accuracy = df["Score"].mean()
print(f"Average Accuracy: {average_accuracy}")
# save as txt
with open(f"{args.output_path}/{model_name}/avg_accuracy.txt", "w") as f:
f.write(f"Average Accuracy: {average_accuracy}\n")
def main(args):
if args.plot_only:
# load all_accuracies from json
model_name = args.model.split("/")[-1]
with open(f"{args.output_path}/{model_name}/all_accuracies.json", "r") as f:
all_accuracies = json.load(f)
plot(args, all_accuracies)
else:
all_accuracies = inference(args)
if dist.get_rank(get_sp_group()) == 0:
plot(args, all_accuracies)
if __name__ == "__main__":
args = argparse.ArgumentParser()
args.add_argument("--model", type=str, default="output/LLaVA-NeXT-Video-7B-32K")
args.add_argument("--max_frame_num", type=int, default=256)
args.add_argument("--needle_dataset", type=str, default="lmms-lab/v_niah_needles")
args.add_argument("--min_frame_num", type=int, default=20)
args.add_argument("--frame_interval", type=int, default=20)
args.add_argument("--output_path", type=str, default="vision_niah/niah_output_single")
args.add_argument("--depth_interval", type=float, default=0.1)
args.add_argument("--num_samples", type=int, default=1)
args.add_argument("--rope_theta", type=float, default=None)
args.add_argument("--haystack_dir", type=str, default="video_needle_haystack/data/haystack_embeddings")
args.add_argument("--needle_embedding_dir", type=str, default="vision_niah/data/needle_embeddings")
args.add_argument("--prompt_template", type=str)
args.add_argument("--replace_double_newline", action="store_true")
args.add_argument("--plot_only", action="store_true")
main(args.parse_args())
================================================
FILE: xtuner-eval_niah/vision_niah/single_produce_needle_embedding.py
================================================
import sys
import os
import json
import argparse
import numpy as np
from tqdm import tqdm
import torch
from pathlib import Path
from PIL import Image
from datasets import load_dataset
import math
import io
data_root_path = "path_to/niah_data/"
def main(args):
if "videochat-flash" in args.model.lower():
from llava.model.builder import load_pretrained_model
from llava.mm_utils import tokenizer_image_token, get_model_name_from_path
from llava.mm_utils import process_images
elif "longva" in args.model.lower():
from longva.model.builder import load_pretrained_model
from longva.mm_utils import tokenizer_image_token, get_model_name_from_path
from longva.mm_utils import process_images
else:
raise "This version model is not currently supported. Please manually adjust the code to adapt."
model_name = "llava_qwen"
tokenizer, model, image_processor, context_len = load_pretrained_model(args.model, None, model_name, load_8bit=False,device_map="cuda:0")
model.config.image_aspect_ratio = "pad"
model.config.mm_patch_merge_type="flat"
# dataset = load_dataset(args.needle_dataset)["test"]
with open(args.needle_dataset, 'r', encoding='utf-8') as file:
data = json.load(file)
for index, instance in enumerate(tqdm(data, desc="Processing")):
# image = instance["image"].convert("RGB")
image_path = instance["image"]
frame_path = os.path.join(data_root_path, image_path)
with open(frame_path, 'rb') as f:
img_bytes = f.read()
img = Image.open(io.BytesIO(img_bytes))
image = img.convert("RGB")
image = process_images([image], image_processor, model.config).half()
# print("image shape:", image.shape)
if "videochat-flash" in args.model.lower():
mm_local_num_frames = 4
processed_images=image.repeat(mm_local_num_frames, 1, 1, 1)
# print("processed_images shape:", processed_images.shape)
image_features = model.encode_video_image([processed_images], video_idx_in_batch = [0])[0]
else:
image_features = model.encode_images(image)
# print("needle embedding shape: ", image_features.shape)
if args.pooling_size != 0:
B, _, F = image_features.shape
image_features_spatial = image_features.view(B, int(math.sqrt(_)), int(math.sqrt(_)), F).permute(0, 3, 1, 2) # B, F, 24, 24
image_features_spatial_pool = torch.nn.functional.avg_pool2d(image_features_spatial, args.pooling_size, args.pooling_size) # B, F, 12, 12
image_features = image_features_spatial_pool.flatten(2).transpose(1, 2).contiguous() # B, 144, F
image_features = image_features.squeeze(0)
# print("needle shape after pooling: ", image_features.shape)
if "videochat-flash" not in args.model.lower():
image_features = image_features.repeat(4,1)
print("final save needle shape: ", image_features.shape)
torch.save(image_features, f"{args.output_dir}/{index}.pt")
if __name__ == "__main__":
parser = argparse.ArgumentParser()
parser.add_argument("--model", type=str, default="output/LLaVA-NeXT-Video-7B-Vicuna")
parser.add_argument("--needle_dataset", type=str, default="Please input data dir")
parser.add_argument("--output_dir", type=str, default="video_needle_haystack/data/needle_vicuna_embeddings")
parser.add_argument("--pooling_size", type=int, default=0)
args = parser.parse_args()
main(args)
================================================
FILE: xtuner-eval_niah/xtuner/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import os
from mmengine.utils import digit_version
from .entry_point import cli
from .version import __version__, version_info
HF_CEPH_HUB = os.getenv('HF_CEPH_HUB', '')
HF_USE_CEPH = os.getenv('HF_USE_CEPH', 0) or HF_CEPH_HUB != ''
DS_CEPH_DIR = os.getenv('DS_CEPH_DIR', None)
if HF_USE_CEPH:
from .utils.fileio import (patch_hf_auto_from_pretrained,
patch_hf_save_pretrained)
patch_hf_auto_from_pretrained(HF_CEPH_HUB)
patch_hf_save_pretrained()
if DS_CEPH_DIR:
from .utils.fileio import patch_deepspeed_engine
patch_deepspeed_engine()
__all__ = [
'__version__', 'version_info', 'digit_version', 'cli', 'HF_USE_CEPH',
'DS_CEPH_DIR'
]
================================================
FILE: xtuner-eval_niah/xtuner/_lite/__init__.py
================================================
from loguru import logger
from .auto import AutoConfig, AutoModelForCausalLM, AutoTokenizer
# Remove the original logger in Python to prevent duplicate printing.
logger.remove()
_LOGGER = logger
def get_logger():
return _LOGGER
__all__ = ['AutoConfig', 'AutoModelForCausalLM', 'AutoTokenizer']
================================================
FILE: xtuner-eval_niah/xtuner/_lite/accelerate/__init__.py
================================================
from .dispatches import dispatch_modules
from .lora import LORA_TARGET_MAP
from .packed import packed_sequence
__all__ = ['dispatch_modules', 'LORA_TARGET_MAP', 'packed_sequence']
================================================
FILE: xtuner-eval_niah/xtuner/_lite/accelerate/dispatches/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import types
from xtuner._lite import get_logger
logger = get_logger()
def _dispatch_forward_fn(module, dispatch_fn):
module.forward = types.MethodType(dispatch_fn, module)
def dispatch_internlm_varlen_attn_forward(module):
assert module.__class__.__name__ == 'InternLM2FlashAttention2'
from .internlm2 import internlm2_varlen_attn_forward
_dispatch_forward_fn(module, internlm2_varlen_attn_forward)
return internlm2_varlen_attn_forward.__name__
def dispatch_llama_varlen_attn_forward(module):
assert module.__class__.__name__ == 'LlamaFlashAttention2'
from .llama import llama_varlen_attn_forward
_dispatch_forward_fn(module, llama_varlen_attn_forward)
return llama_varlen_attn_forward.__name__
def dispatch_qwen2_varlen_attn_forward(module):
assert module.__class__.__name__ == 'Qwen2FlashAttention2'
from .qwen2 import qwen2_varlen_attn_forward
_dispatch_forward_fn(module, qwen2_varlen_attn_forward)
return qwen2_varlen_attn_forward.__name__
def dispatch_clip_attn_forward(module):
assert module.__class__.__name__ == 'CLIPAttention'
from .clip import clip_flash_attn_forward
_dispatch_forward_fn(module, clip_flash_attn_forward)
return clip_flash_attn_forward.__name__
def dispatch_rms_norm_forward(module):
from ._fused import rms_norm_forward
_dispatch_forward_fn(module, rms_norm_forward)
return rms_norm_forward.__name__
def dispatch_internlm_mla_varlen_attn_forward(module):
assert module.__class__.__name__ == 'InternLM2MLAFlashAttention2'
from .internlm2 import internlm2_mla_varlen_attn_forward
_dispatch_forward_fn(module, internlm2_mla_varlen_attn_forward)
return internlm2_mla_varlen_attn_forward.__name__
DISPATCH_MAP = {
'InternLM2FlashAttention2': dispatch_internlm_varlen_attn_forward,
'CLIPAttention': dispatch_clip_attn_forward,
'InternLM2RMSNorm': dispatch_rms_norm_forward,
'LlamaFlashAttention2': dispatch_llama_varlen_attn_forward,
'LlamaRMSNorm': dispatch_rms_norm_forward,
'Qwen2FlashAttention2': dispatch_qwen2_varlen_attn_forward,
'Qwen2RMSNorm': dispatch_rms_norm_forward,
'InternLM2MLAFlashAttention2': dispatch_internlm_mla_varlen_attn_forward,
}
def dispatch_modules(model, use_varlen_attn=False):
from xtuner._lite import get_logger
logger = get_logger()
for name, module in model.named_modules():
module_cls = module.__class__.__name__
if module_cls in DISPATCH_MAP:
dispatched = DISPATCH_MAP[module_cls](module)
logger.info(
f'Dispatch {name}({module_cls}) forward to `{dispatched}`')
================================================
FILE: xtuner-eval_niah/xtuner/_lite/accelerate/dispatches/_attention.py
================================================
import torch
from torch.nn import functional as F
from xtuner._lite.parallel import sequence_parallel_wrapper
SUPPORT_FLASH2 = False
try:
from flash_attn import flash_attn_func, flash_attn_varlen_func
from flash_attn.bert_padding import (index_first_axis, pad_input,
unpad_input)
SUPPORT_FLASH2 = True
except ImportError:
pass
def _get_unpad_data(attention_mask):
seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
max_seqlen_in_batch = seqlens_in_batch.max().item()
cu_seqlens = F.pad(
torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
return (
indices,
cu_seqlens,
max_seqlen_in_batch,
)
def upad_qkv(query_layer, key_layer, value_layer, attention_mask,
query_length):
indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(
attention_mask)
batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
key_layer = index_first_axis(
key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads,
head_dim), indices_k)
value_layer = index_first_axis(
value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads,
head_dim), indices_k)
if query_length == kv_seq_len:
# Different from the origin version as sequence parallel change
# the number of attention heads.
query_layer = index_first_axis(
query_layer.reshape(batch_size * kv_seq_len, -1, head_dim),
indices_k)
cu_seqlens_q = cu_seqlens_k
max_seqlen_in_batch_q = max_seqlen_in_batch_k
indices_q = indices_k
elif query_length == 1:
max_seqlen_in_batch_q = 1
cu_seqlens_q = torch.arange(
batch_size + 1, dtype=torch.int32, device=query_layer.device
) # There is a memcpy here, that is very bad.
indices_q = cu_seqlens_q[:-1]
query_layer = query_layer.squeeze(1)
else:
# The -q_len: slice assumes left padding.
attention_mask = attention_mask[:, -query_length:]
query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = \
unpad_input(query_layer, attention_mask)
return (
query_layer,
key_layer,
value_layer,
indices_q,
(cu_seqlens_q, cu_seqlens_k),
(max_seqlen_in_batch_q, max_seqlen_in_batch_k),
)
@sequence_parallel_wrapper
def flash_attn_wo_mask(
query_states,
key_states,
value_states,
dropout_p=0.0,
softmax_scale=None,
causal=True,
window_size=(-1, -1), # -1 means infinite context window
):
attn_output = flash_attn_func(
query_states,
key_states,
value_states,
dropout_p=dropout_p,
softmax_scale=softmax_scale,
causal=causal,
window_size=window_size)
return attn_output
@sequence_parallel_wrapper
def flash_attn_w_mask(
query_states, # bs, q_len, nhead, h_dim
key_states,
value_states,
attention_mask,
causal=True,
dropout_p=0.0,
window_size=(-1, -1), # -1 means infinite context window
):
batch_size, q_len = query_states.shape[:2]
query_states, key_states, value_states, indices_q, \
cu_seq_lens, max_seq_lens = upad_qkv(
query_states, key_states, value_states, attention_mask, q_len)
cu_seqlens_q, cu_seqlens_k = cu_seq_lens
max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
attn_output_unpad = flash_attn_varlen_func(
query_states,
key_states,
value_states,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_in_batch_q,
max_seqlen_k=max_seqlen_in_batch_k,
dropout_p=dropout_p,
causal=causal,
window_size=window_size)
attn_output = pad_input(attn_output_unpad, indices_q, batch_size, q_len)
return attn_output
@sequence_parallel_wrapper
def varlen_flash_attn(
query_states,
key_states,
value_states,
cumulative_len,
max_seqlen,
dropout_p=0.,
causal=True,
softmax_scale=None,
window_size=(-1, -1), # -1 means infinite context window
):
q_unpad, k_unpad, v_unpad = query_states.flatten(0, 1), key_states.flatten(
0, 1), value_states.flatten(0, 1)
attn_output = flash_attn_varlen_func(
q_unpad,
k_unpad,
v_unpad,
cumulative_len,
cumulative_len,
max_seqlen,
max_seqlen,
dropout_p=dropout_p,
return_attn_probs=False,
causal=causal,
softmax_scale=softmax_scale,
window_size=window_size)
attn_output = attn_output.unsqueeze(0)
return attn_output
================================================
FILE: xtuner-eval_niah/xtuner/_lite/accelerate/dispatches/_fused/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .layer_norm import layer_norm_forward
from .rms_norm import rms_norm_forward
from .rotary import apply_rotary_emb
__all__ = ['rms_norm_forward', 'layer_norm_forward', 'apply_rotary_emb']
================================================
FILE: xtuner-eval_niah/xtuner/_lite/accelerate/dispatches/_fused/layer_norm.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
import torch.nn.functional as F
def layer_norm_forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
hidden_states = F.layer_norm(
hidden_states, (hidden_states.shape[-1], ), eps=self.variance_epsilon)
hidden_states = self.weight.to(torch.float32) * hidden_states
return hidden_states.to(input_dtype)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/accelerate/dispatches/_fused/rms_norm.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
try:
from flash_attn.ops.triton.layernorm import rms_norm_fn
except ImportError:
try:
from flash_attn.ops.triton.layer_norm import rms_norm_fn
except ImportError:
import flash_attn
raise ImportError(f'flash_attn version {flash_attn.__version__}')
def rms_norm_forward(self, hidden_states):
from torch.distributed._functional_collectives import AsyncCollectiveTensor
if isinstance(hidden_states, AsyncCollectiveTensor):
hidden_states = hidden_states.wait()
if (hidden_states.device == torch.device('cpu')
or self.weight.device == torch.device('cpu')):
raise RuntimeError(
'Can not use triton kernels on cpu. Please set `USE_TRITON_KERNEL`'
' environment variable to 0 before training.')
ret = rms_norm_fn(
hidden_states, self.weight, None, eps=self.variance_epsilon)
return ret
================================================
FILE: xtuner-eval_niah/xtuner/_lite/accelerate/dispatches/_fused/rotary.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
# Modified from https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/ops/triton/rotary.py # noqa:E501
from typing import Optional, Union
import torch
import triton
import triton.language as tl
@triton.jit
def rotary_kernel(
OUT, # Pointers to matrices
X,
COS,
SIN,
CU_SEQLENS,
SEQLEN_OFFSETS, # this could be int or a pointer
# Matrix dimensions
seqlen,
rotary_dim,
seqlen_ro,
# strides
stride_out_batch,
stride_out_seqlen,
stride_out_nheads,
stride_out_headdim,
stride_x_batch,
stride_x_seqlen,
stride_x_nheads,
stride_x_headdim,
# Meta-parameters
BLOCK_K: tl.constexpr,
IS_SEQLEN_OFFSETS_TENSOR: tl.constexpr,
IS_VARLEN: tl.constexpr,
INTERLEAVED: tl.constexpr,
CONJUGATE: tl.constexpr,
BLOCK_M: tl.constexpr,
):
pid_m = tl.program_id(axis=0)
pid_batch = tl.program_id(axis=1)
pid_head = tl.program_id(axis=2)
rotary_dim_half = rotary_dim // 2
if not IS_VARLEN:
X = X + pid_batch * stride_x_batch + pid_head * stride_x_nheads
OUT = OUT + pid_batch * stride_out_batch + pid_head * stride_out_nheads
else:
start_idx = tl.load(CU_SEQLENS + pid_batch)
seqlen = tl.load(CU_SEQLENS + pid_batch + 1) - start_idx
X = X + start_idx * stride_x_seqlen + pid_head * stride_x_nheads
OUT = OUT + start_idx * stride_out_seqlen + \
pid_head * stride_out_nheads
if pid_m * BLOCK_M >= seqlen:
return
rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
if not IS_SEQLEN_OFFSETS_TENSOR:
rm_cs = rm + SEQLEN_OFFSETS
else:
rm_cs = rm + tl.load(SEQLEN_OFFSETS + pid_batch)
rk = tl.arange(0, BLOCK_K)
rk_half = tl.arange(0, BLOCK_K // 2)
if not INTERLEAVED:
# Load the 1st and 2nd halves of X, do calculation,
# then store to 1st and 2nd halves of OUT
X = X + (
rm[:, None] * stride_x_seqlen +
rk_half[None, :] * stride_x_headdim)
# This is different from the official implementation as the shapes of
# the two tensors cos and sin are (seqlen_ro, rotary_dim) instead of
# (seqlen_ro, rotary_dim // 2).
COS = COS + (rm_cs[:, None] * rotary_dim + rk_half[None, :])
SIN = SIN + (rm_cs[:, None] * rotary_dim + rk_half[None, :])
cos = tl.load(
COS,
mask=(rm_cs[:, None] < seqlen_ro) &
(rk_half[None, :] < rotary_dim_half),
other=1.0).to(tl.float32)
sin = tl.load(
SIN,
mask=(rm_cs[:, None] < seqlen_ro) &
(rk_half[None, :] < rotary_dim_half),
other=0.0).to(tl.float32)
x0 = tl.load(
X,
mask=(rm[:, None] < seqlen) & (rk_half[None, :] < rotary_dim_half),
other=0.0).to(tl.float32)
x1 = tl.load(
X + rotary_dim_half * stride_x_headdim,
mask=(rm[:, None] < seqlen) & (rk_half[None, :] < rotary_dim_half),
other=0.0,
).to(tl.float32)
if CONJUGATE:
sin = -sin
o0 = x0 * cos - x1 * sin
o1 = x0 * sin + x1 * cos
# write back result
OUT = OUT + (
rm[:, None] * stride_out_seqlen +
rk_half[None, :] * stride_out_headdim)
tl.store(
OUT,
o0,
mask=(rm[:, None] < seqlen) & (rk_half[None, :] < rotary_dim_half))
tl.store(
OUT + rotary_dim_half * stride_out_headdim,
o1,
mask=(rm[:, None] < seqlen) & (rk_half[None, :] < rotary_dim_half),
)
else:
# We don't want to load X[0, 2, 4, ...] and X[1, 3, 5, ...] separately
# since both are slow.
# Instead, we load x0 = X[0, 1, 2, 3, ...] and x1 = X[1, 0, 3, 2, ...].
# Loading x0 will be fast but x1 will be slow.
# Then we load cos = COS[0, 0, 1, 1, ...] and
# sin = SIN[0, 0, 1, 1, ...].
# Then we do the calculation and use tl.where to pick put the right
# outputs for the even and for the odd indices.
rk_swap = rk + ((rk + 1) % 2) * 2 - 1 # 1, 0, 3, 2, 5, 4, ...
rk_repeat = tl.arange(0, BLOCK_K) // 2
# This is different from the official implementation as the shapes of
# the two tensors cos and sin are (seqlen_ro, rotary_dim) instead of
# (seqlen_ro, rotary_dim // 2).
X0 = X + (
rm[:, None] * stride_x_seqlen + rk[None, :] * stride_x_headdim)
X1 = X + (
rm[:, None] * stride_x_seqlen +
rk_swap[None, :] * stride_x_headdim)
COS = COS + (rm_cs[:, None] * rotary_dim + rk_repeat[None, :])
SIN = SIN + (rm_cs[:, None] * rotary_dim + rk_repeat[None, :])
cos = tl.load(
COS,
mask=(rm_cs[:, None] < seqlen_ro) &
(rk_repeat[None, :] < rotary_dim_half),
other=1.0,
).to(tl.float32)
sin = tl.load(
SIN,
mask=(rm_cs[:, None] < seqlen_ro) &
(rk_repeat[None, :] < rotary_dim_half),
other=0.0,
).to(tl.float32)
x0 = tl.load(
X0,
mask=(rm[:, None] < seqlen) & (rk[None, :] < rotary_dim),
other=0.0).to(tl.float32)
x1 = tl.load(
X1,
mask=(rm[:, None] < seqlen) & (rk_swap[None, :] < rotary_dim),
other=0.0).to(tl.float32)
if CONJUGATE:
sin = -sin
x0_cos = x0 * cos
x1_sin = x1 * sin
out = tl.where(rk[None, :] % 2 == 0, x0_cos - x1_sin, x0_cos + x1_sin)
OUT = OUT + (
rm[:, None] * stride_out_seqlen + rk[None, :] * stride_out_headdim)
tl.store(
OUT, out, mask=(rm[:, None] < seqlen) & (rk[None, :] < rotary_dim))
def apply_rotary(
x: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
seqlen_offsets: Union[int, torch.Tensor] = 0,
cu_seqlens: Optional[torch.Tensor] = None,
max_seqlen: Optional[int] = None,
interleaved=False,
inplace=False,
conjugate=False,
) -> torch.Tensor:
"""
Arguments:
x: (batch, seqlen, nheads, headdim) if cu_seqlens is None
else (total_seqlen, nheads, headdim).
cos: (seqlen_ro, rotary_dim)
sin: (seqlen_ro, rotary_dim)
seqlen_offsets: integer or integer tensor of size (batch,)
cu_seqlens: (batch + 1,) or None
max_seqlen: int
Returns:
y: (batch, seqlen, nheads, headdim)
"""
is_varlen = cu_seqlens is not None
if not is_varlen:
batch, seqlen, nheads, headdim = x.shape
else:
assert max_seqlen is not None, ('If cu_seqlens is passed in, '
'then max_seqlen must be passed')
total_seqlen, nheads, headdim = x.shape
batch_p_1 = cu_seqlens.shape[0]
batch = batch_p_1 - 1
seqlen = max_seqlen
seqlen_ro, rotary_dim = cos.shape
assert sin.shape == cos.shape
# rotary_dim *= 2
assert rotary_dim <= headdim, 'rotary_dim must be <= headdim'
assert headdim <= 256, 'Only support headdim <= 256'
assert seqlen_ro >= seqlen, 'seqlen_ro must be >= seqlen'
assert (
cos.dtype == sin.dtype
), f'cos and sin must have the same dtype, got {cos.dtype} and {sin.dtype}'
assert (x.dtype == cos.dtype), (
f'Input and cos/sin must have the same dtype, '
f'got {x.dtype} and {cos.dtype}')
cos, sin = cos.contiguous(), sin.contiguous()
if isinstance(seqlen_offsets, torch.Tensor):
assert seqlen_offsets.shape == (batch, )
assert seqlen_offsets.dtype in [torch.int32, torch.int64]
seqlen_offsets = seqlen_offsets.contiguous()
else:
assert seqlen_offsets + seqlen <= seqlen_ro
output = torch.empty_like(x) if not inplace else x
if rotary_dim < headdim and not inplace:
output[..., rotary_dim:].copy_(x[..., rotary_dim:])
BLOCK_K = (32 if rotary_dim <= 32 else
(64 if rotary_dim <= 64 else
(128 if rotary_dim <= 128 else 256)))
def grid(META):
return (triton.cdiv(seqlen, META['BLOCK_M']), batch, nheads)
BLOCK_M = 4 if interleaved else (8 if rotary_dim <= 64 else 4)
# Need this, otherwise Triton tries to launch from cuda:0 and we get
# ValueError: Pointer argument (at 0) cannot be accessed from Triton
# (cpu tensor?)
with torch.cuda.device(x.device.index):
rotary_kernel[grid](
output, # data ptrs
x,
cos,
sin,
cu_seqlens,
seqlen_offsets,
seqlen, # shapes
rotary_dim,
seqlen_ro,
output.stride(0)
if not is_varlen else 0, # batch_strides if not varlen else 0
output.stride(-3), # seqlen_stride or total_seqlen_stride
output.stride(-2), # nheads_stride
output.stride(-1), # headdim_stride
x.stride(0)
if not is_varlen else 0, # batch_strides if not varlen else 0
x.stride(-3), # seqlen stride or total_seqlen_stride
x.stride(-2), # nheads stride
x.stride(-1), # headdim stride
BLOCK_K,
isinstance(seqlen_offsets, torch.Tensor),
is_varlen,
interleaved,
conjugate,
BLOCK_M,
)
return output
class ApplyRotaryEmb(torch.autograd.Function):
@staticmethod
def forward(
ctx,
x,
cos,
sin,
interleaved=False,
inplace=False,
seqlen_offsets: Union[int, torch.Tensor] = 0,
cu_seqlens: Optional[torch.Tensor] = None,
max_seqlen: Optional[int] = None,
):
out = apply_rotary(
x,
cos,
sin,
seqlen_offsets=seqlen_offsets,
cu_seqlens=cu_seqlens,
max_seqlen=max_seqlen,
interleaved=interleaved,
inplace=inplace,
)
if isinstance(seqlen_offsets, int):
ctx.save_for_backward(
cos, sin, cu_seqlens) # Can't save int with save_for_backward
ctx.seqlen_offsets = seqlen_offsets
else:
ctx.save_for_backward(cos, sin, cu_seqlens, seqlen_offsets)
ctx.seqlen_offsets = None
ctx.interleaved = interleaved
ctx.inplace = inplace
ctx.max_seqlen = max_seqlen
return out if not inplace else x
@staticmethod
def backward(ctx, do):
seqlen_offsets = ctx.seqlen_offsets
if seqlen_offsets is None:
cos, sin, cu_seqlens, seqlen_offsets = ctx.saved_tensors
else:
cos, sin, cu_seqlens = ctx.saved_tensors
# TD [2023-09-02]: For some reason Triton (2.0.0.post1) errors with
# "[CUDA]: invalid device context", and cloning makes it work. Idk why.
# Triton 2.1.0 works.
if not ctx.interleaved and not ctx.inplace:
do = do.clone()
dx = apply_rotary(
do,
cos,
sin,
seqlen_offsets=seqlen_offsets,
cu_seqlens=cu_seqlens,
max_seqlen=ctx.max_seqlen,
interleaved=ctx.interleaved,
inplace=ctx.inplace,
conjugate=True,
)
return dx, None, None, None, None, None, None, None
apply_rotary_emb = ApplyRotaryEmb.apply
================================================
FILE: xtuner-eval_niah/xtuner/_lite/accelerate/dispatches/clip.py
================================================
from typing import Optional, Tuple
import torch
from torch import nn
from torch.nn import functional as F
from transformers import CLIPVisionModel
from ._attention import flash_attn_wo_mask
def clip_flash_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
causal_attention_mask: Optional[torch.Tensor] = None,
output_attentions: Optional[bool] = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
"""Input shape: Batch x Time x Channel."""
bsz, tgt_len, embed_dim = hidden_states.size()
# get query proj
query_states = self.q_proj(hidden_states).view(bsz, tgt_len,
self.num_heads, -1)
key_states = self.k_proj(hidden_states).view(bsz, tgt_len, self.num_heads,
-1)
value_states = self.v_proj(hidden_states).view(bsz, tgt_len,
self.num_heads, -1)
# proj_shape = (bsz * self.num_heads, -1, self.head_dim)
# query_states = self._shape(query_states, tgt_len, bsz).view(*proj_shape)
# key_states = key_states.view(*proj_shape)
# value_states = value_states.view(*proj_shape)
# src_len = key_states.size(1)
# attn_weights = torch.bmm(query_states, key_states.transpose(1, 2))
# if attn_weights.size() != (bsz * self.num_heads, tgt_len, src_len):
# raise ValueError(
# f"Attention weights should be of size {(bsz * self.num_heads, tgt_len, src_len)}, but is"
# f" {attn_weights.size()}"
# )
# # apply the causal_attention_mask first
# if causal_attention_mask is not None:
# if causal_attention_mask.size() != (bsz, 1, tgt_len, src_len):
# raise ValueError(
# f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is"
# f" {causal_attention_mask.size()}"
# )
# attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + causal_attention_mask
# attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
# if attention_mask is not None:
# if attention_mask.size() != (bsz, 1, tgt_len, src_len):
# raise ValueError(
# f"Attention mask should be of size {(bsz, 1, tgt_len, src_len)}, but is {attention_mask.size()}"
# )
# attn_weights = attn_weights.view(bsz, self.num_heads, tgt_len, src_len) + attention_mask
# attn_weights = attn_weights.view(bsz * self.num_heads, tgt_len, src_len)
# attn_weights = nn.functional.softmax(attn_weights, dim=-1)
# if output_attentions:
# # this operation is a bit akward, but it's required to
# # make sure that attn_weights keeps its gradient.
# # In order to do so, attn_weights have to reshaped
# # twice and have to be reused in the following
# attn_weights_reshaped = attn_weights.view(bsz, self.num_heads, tgt_len, src_len)
# attn_weights = attn_weights_reshaped.view(bsz * self.num_heads, tgt_len, src_len)
# else:
# attn_weights_reshaped = None
# attn_probs = nn.functional.dropout(attn_weights, p=self.dropout, training=self.training)
# attn_output = torch.bmm(attn_probs, value_states)
# if attn_output.size() != (bsz * self.num_heads, tgt_len, self.head_dim):
# raise ValueError(
# f"`attn_output` should be of size {(bsz, self.num_heads, tgt_len, self.head_dim)}, but is"
# f" {attn_output.size()}"
# )
# attn_output = attn_output.view(bsz, self.num_heads, tgt_len, self.head_dim)
# attn_output = attn_output.transpose(1, 2)
# attn_output = attn_output.reshape(bsz, tgt_len, embed_dim)
attn_output = flash_attn_wo_mask(
query_states,
key_states,
value_states,
self.dropout if self.training else 0,
causal=causal_attention_mask is not None).view(bsz, tgt_len, embed_dim)
attn_output = self.out_proj(attn_output)
return attn_output, None
================================================
FILE: xtuner-eval_niah/xtuner/_lite/accelerate/dispatches/internlm2.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Optional, Tuple
import os
import torch
from einops import rearrange
from mmengine import MessageHub
from transformers.cache_utils import StaticCache, Cache
from ._attention import SUPPORT_FLASH2, flash_attn_wo_mask, varlen_flash_attn
import torch.nn.functional as F
from xtuner._lite.yunchang import llama3_varlen_attention_sp_ulysses_ring, attention_sp_ulysses_ring
from xtuner._lite.parallel.setup import get_ring_group, get_ring_world_size, get_sp_world_size, get_ulysess_group
from flash_attn import flash_attn_with_kvcache
import torch.distributed as dist
from xtuner._lite.parallel.setup import get_sp_group, get_ring_group, get_ring_world_size, get_sp_world_size, \
get_ulysess_group
from xtuner._lite.yunchang import attention_sp_ulysses_ring, ring_flash_attn_inference_func
from mmengine.dist import is_distributed, all_gather_object, all_gather
class InternLM2RotaryEmbedding(torch.nn.Module):
def __init__(self,
dim,
max_position_embeddings=2048,
base=1000000,
device=None):
super().__init__()
self.dim = dim
self.max_position_embeddings = max_position_embeddings
self.base = base
self.inv_freq = 1.0 / (
base ** (torch.arange(0, dim, 2).float().to(device) / dim))
# Build here to make `torch.jit.trace` work.
self.max_seq_len_cached = max_position_embeddings
t = torch.arange(
self.max_seq_len_cached,
device=self.inv_freq.device,
dtype=self.inv_freq.dtype)
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
emb = torch.cat((freqs, freqs), dim=-1)
self.cos_cached = emb.cos()
self.sin_cached = emb.sin()
def forward(self, x, seq_len):
# x: [bs, num_attention_heads, seq_len, head_size]
if (seq_len > self.max_seq_len_cached
or self.cos_cached.device != x.device
or self.cos_cached.dtype != x.dtype):
self.max_seq_len_cached = seq_len
assert self.inv_freq.dtype == torch.float32
t = torch.arange(
self.max_seq_len_cached,
device=x.device,
dtype=self.inv_freq.dtype)
freqs = torch.einsum('i,j->ij', t, self.inv_freq.to(t.device))
emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
self.cos_cached = emb.cos().to(x.dtype)
self.sin_cached = emb.sin().to(x.dtype)
return (
self.cos_cached[:seq_len, ...],
self.sin_cached[:seq_len, ...],
)
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., :x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2:]
return torch.cat((-x2, x1), dim=-1)
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
def apply_rotary_pos_emb_mla(q, k, cos, sin, position_ids=None, unsqueeze_dim=1): # pylint: disable=unused-argument
"""Applies Rotary Position Embedding to the query and key tensors.
Args:
q (`torch.Tensor`): The query tensor.
k (`torch.Tensor`): The key tensor.
cos (`torch.Tensor`): The cosine part of the rotary embedding.
sin (`torch.Tensor`): The sine part of the rotary embedding.
position_ids (`torch.Tensor`, *optional*):
Deprecated and unused.
unsqueeze_dim (`int`, *optional*, defaults to 1):
The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
Returns:
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
"""
orig_dtype = k.dtype
cos = cos[position_ids].unsqueeze(unsqueeze_dim) # [bs, 1, seq_len, dim]
sin = sin[position_ids].unsqueeze(unsqueeze_dim) # [bs, 1, seq_len, dim]
q_fp32 = q.to(dtype=torch.float32, device=q.device)
k_fp32 = k.to(dtype=torch.float32, device=k.device)
q_embed = (q_fp32 * cos) + (rotate_half(q_fp32) * sin)
k_embed = (k_fp32 * cos) + (rotate_half(k_fp32) * sin)
return q_embed.to(dtype=orig_dtype), k_embed.to(dtype=orig_dtype)
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""This is the equivalent of torch.repeat_interleave(x, dim=1,
repeats=n_rep).
The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to
(batch, num_attention_heads, seqlen, head_dim)
"""
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :,
None, :, :].expand(batch,
num_key_value_heads,
n_rep, slen, head_dim)
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen,
head_dim)
def repeat_kv_bshd(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""The hidden states go from (batch, seqlen, num_key_value_heads, head_dim)
to (batch, seqlen, num_attention_heads, head_dim)"""
batch, slen, num_key_value_heads, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :, :,
None, :].expand(batch, slen,
num_key_value_heads, n_rep,
head_dim)
return hidden_states.reshape(batch, slen, num_key_value_heads * n_rep,
head_dim)
def _internlm2_varlen_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
# Modified from https://huggingface.co/internlm/internlm-7b/blob/939a68c0dc1bd5f35b63c87d44af05ce33379061/modeling_internlm.py#L161 # noqa:E501
if isinstance(past_key_value, StaticCache):
raise ValueError(
'`static` cache implementation is not compatible with '
'`attn_implementation==flash_attention_2` make sure to use `sdpa` '
'in the mean time, and open an issue at '
'https://github.com/huggingface/transformers')
bsz, q_len, _ = hidden_states.size()
attn_context = MessageHub.get_instance('packed_sequence')
position_ids = attn_context.get_info('position_ids')
assert position_ids.size(1) == q_len, f'{position_ids.size(1)} {q_len}'
qkv_states = self.wqkv(hidden_states)
qkv_states = rearrange(
qkv_states,
'b q (h gs d) -> b q h gs d',
gs=2 + self.num_key_value_groups,
d=self.head_dim,
)
query_states = qkv_states[..., :self.num_key_value_groups, :]
query_states = rearrange(query_states, 'b q h gs d -> b q (h gs) d')
key_states = qkv_states[..., -2, :]
value_states = qkv_states[..., -1, :]
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
try:
cos, sin = self.rotary_emb(value_states, position_ids)
except RuntimeError:
raise RuntimeError(
'You are using the old version of InternLM2 model. The '
'`modeling_internlm2.py` is outdated. Please update the InternLM2 '
'model.')
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin)
if past_key_value is not None:
# sin and cos are specific to RoPE models;
# cache_position needed for the static cache
cache_kwargs = {
'sin': sin,
'cos': cos,
'cache_position': cache_position
}
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# In PEFT, usually we cast the layer norms in float32 for training
# stability reasons therefore the input hidden states gets silently
# casted in float32. Hence, we need cast them back in the correct dtype
# just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not
# cast the LayerNorms in fp32. (InternLM2RMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.wqkv.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
# repeat kv for sequence parallel
key_states = repeat_kv_bshd(key_states, self.num_key_value_groups)
value_states = repeat_kv_bshd(value_states, self.num_key_value_groups)
assert SUPPORT_FLASH2
cumulative_lengths = attn_context.get_info('cumulative_lengths')
if cumulative_lengths is not None and SUPPORT_FLASH2 and bsz == 1:
reset_mask = os.environ.get('RESET_MASK')
force_use_ring = os.environ.get('FORCE_USE_RING')
ring_impl_type = os.environ.get('RING_IMPL_TYPE')
# 仅仅用于测试该分支下 sp ulyess 是否正确
force_to_new_sp = os.environ.get('FORCE_TO_NEW_SP')
if reset_mask:
if get_ring_world_size() > 1 or force_to_new_sp:
if force_use_ring:
attn_output = attention_sp_ulysses_ring(query_states,
key_states,
value_states,
ulysses_pg=get_ulysess_group(),
ring_pg=get_ring_group(),
ring_impl_type=ring_impl_type)
else:
# TODO: 暂时还是调用 llama3_varlen_attention_sp_ulysses_ring
# TODO: 最合理的应该是 llama3_attention_sp_ulysses_ring
# 只有开启了 ring 情况下才运行,如果只是普通 sp,则依然运行原先逻辑
assert cumulative_lengths[-1] % get_sp_world_size() == 0, f'==={cumulative_lengths[-1]}===='
q_unpad, k_unpad, v_unpad = query_states.flatten(0, 1), key_states.flatten(
0, 1), value_states.flatten(0, 1)
attn_output = llama3_varlen_attention_sp_ulysses_ring(
q_unpad,
k_unpad,
v_unpad,
cumulative_lengths,
ulysses_pg=get_ulysess_group(),
ring_pg=get_ring_group(),
causal=True,
# 如果想更省显存,可以设置为 1。-1 表示不切分
heads_k_stride=-1
)
attn_output = attn_output.unsqueeze(0)
else:
attn_output = flash_attn_wo_mask(
query_states,
key_states,
value_states,
causal=True,
training=self.training)
else:
if get_ring_world_size() > 1 or force_to_new_sp:
# 只有开启了 ring 情况下才运行,如果只是普通 sp,则依然运行原先逻辑
assert cumulative_lengths[-1] % get_sp_world_size() == 0, f'==={cumulative_lengths[-1]}===='
q_unpad, k_unpad, v_unpad = query_states.flatten(0, 1), key_states.flatten(
0, 1), value_states.flatten(0, 1)
if force_use_ring:
raise NotImplementedError
else:
attn_output = llama3_varlen_attention_sp_ulysses_ring(
q_unpad,
k_unpad,
v_unpad,
cumulative_lengths,
ulysses_pg=get_ulysess_group(),
ring_pg=get_ring_group(),
causal=True,
# 如果想更省显存,可以设置为 1。-1 表示不切分
heads_k_stride=-1
)
attn_output = attn_output.unsqueeze(0)
else:
max_seqlen = attn_context.get_info('max_seqlen')
attn_output = varlen_flash_attn(query_states, key_states, value_states,
cumulative_lengths, max_seqlen)
else:
attn_output = flash_attn_wo_mask(
query_states,
key_states,
value_states,
causal=True,
training=self.training)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.wo(attn_output)
# Due to the implementation of the PyTorch version of flash attention,
# even when the output_attentions flag is set to True, it is not possible
# to return the attn_weights.
return attn_output, None, past_key_value
def internlm2_attn_forward_inference(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache=False,
cache_position=None,
position_embeddings=None, # will become mandatory in v4.46
):
bsz, q_len, _ = hidden_states.size()
qkv_states = self.wqkv(hidden_states)
qkv_states = rearrange(
qkv_states,
'b q (h gs d) -> b q h gs d',
gs=2 + self.num_key_value_groups,
d=self.head_dim,
)
query_states = qkv_states[..., :self.num_key_value_groups, :]
query_states = rearrange(query_states, 'b q h gs d -> b q (h gs) d')
key_states = qkv_states[..., -2, :]
value_states = qkv_states[..., -1, :]
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
kv_seq_len = key_states.shape[-2]
is_prefilled = past_key_value.get_seq_length(self.layer_idx) == 0
if is_distributed():
# 单卡不进行修改
rank = dist.get_rank(get_sp_group())
world_size = dist.get_world_size(get_sp_group())
else:
rank = 0
world_size = 1
if past_key_value is not None:
# Activate slicing cache only if the config has a value `sliding_windows` attribute
kv_seq_len = key_states.shape[-2] + cache_position[0]
if world_size > 1:
if is_prefilled:
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
else:
if rank == world_size - 1:
# 在 decode 阶段,只有最后一个 rank 才需要更新
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx,
cache_kwargs)
else:
kv_seq_len = cache_position[0]
key_states, value_states = past_key_value.key_cache[self.layer_idx], past_key_value.value_cache[
self.layer_idx]
else:
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
# repeat k/v heads if n_kv_heads < n_heads
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
dropout_rate = 0.0 if not self.training else self.attention_dropout
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, "_pre_quantization_dtype"):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
print(
f"The input hidden states seems to be silently casted in float32, this might be related to"
f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
f" {target_dtype}."
)
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
# Reashape to the expected shape for Flash Attention
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
if world_size == 1:
attn_output = flash_attn_with_kvcache(
query_states,
key_states,
value_states,
causal=True,
cache_seqlens=kv_seq_len.item())
else:
if query_states.shape[1] > 1 and is_prefilled:
key_states = key_states[:, :kv_seq_len, ...]
value_states = value_states[:, :kv_seq_len, ...]
assert key_states.shape[1] == query_states.shape[1]
attn_output = attention_sp_ulysses_ring(query_states,
key_states,
value_states,
ulysses_pg=get_ulysess_group(),
ring_pg=get_ring_group(),
ring_impl_type='basic')
else:
attn_output = ring_flash_attn_inference_func(query_states,
key_states,
value_states,
causal=True,
group=get_ring_group(),
cache_seqlens=kv_seq_len.item())
# ---------------- flash attention forward end ------------------- #
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
attn_output = self.wo(attn_output)
return attn_output, None, past_key_value
def internlm2_varlen_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
is_training = self.training
if not is_training:
attn_output, attn_weights, past_key_value = internlm2_attn_forward_inference(self,
hidden_states,
attention_mask,
position_ids,
past_key_value,
output_attentions,
use_cache,
cache_position)
return attn_output, attn_weights, past_key_value
return _internlm2_varlen_attn_forward(self, hidden_states, attention_mask,
position_ids, past_key_value,
output_attentions, use_cache)
def _internlm2_mla_varlen_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
output_attentions = False
bsz, q_len, _ = hidden_states.size()
attn_context = MessageHub.get_instance('packed_sequence')
position_ids = attn_context.get_info('position_ids')
assert position_ids.size(1) == q_len, f'{position_ids.size(1)} {q_len}'
q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
q = q.view(bsz, q_len, self.num_heads, self.q_head_dim).transpose(1, 2)
q_nope, q_pe = torch.split(
q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1
)
# Flash attention requires the input to have the shape
# batch_size x seq_length x head_dim x hidden_dim
# therefore we just need to keep the original shape
compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
compressed_kv, k_pe = torch.split(
compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1
)
k_pe = k_pe.view(bsz, q_len, 1, self.qk_rope_head_dim).transpose(1, 2)
kv = (
self.kv_b_proj(self.kv_a_layernorm(compressed_kv))
.view(bsz, q_len, self.num_heads, self.qk_nope_head_dim + self.v_head_dim)
.transpose(1, 2)
)
k_nope, value_states = torch.split(
kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1
)
kv_seq_len = value_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value.get_usable_length(kv_seq_len, self.layer_idx)
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
q_pe, k_pe = apply_rotary_pos_emb_mla(q_pe, k_pe, cos, sin, position_ids)
query_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
query_states[:, :, :, : self.qk_nope_head_dim] = q_nope
query_states[:, :, :, self.qk_nope_head_dim:] = q_pe
key_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
key_states[:, :, :, : self.qk_nope_head_dim] = k_nope
key_states[:, :, :, self.qk_nope_head_dim:] = k_pe
if self.q_head_dim != self.v_head_dim:
value_states = F.pad(value_states, [0, self.q_head_dim - self.v_head_dim])
if past_key_value is not None:
cache_kwargs = {"sin": sin, "cos": cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs
)
# TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
# to be able to avoid many of these transpose/reshape/view.
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
dropout_rate = 0.0
input_dtype = query_states.dtype
if input_dtype == torch.float32:
# Handle the case where the model is quantized
if hasattr(self.config, "_pre_quantization_dtype"):
target_dtype = self.config._pre_quantization_dtype
elif torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
else:
target_dtype = self.q_a_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
assert SUPPORT_FLASH2
cumulative_lengths = attn_context.get_info('cumulative_lengths')
if cumulative_lengths is not None and SUPPORT_FLASH2 and bsz == 1:
max_seqlen = attn_context.get_info('max_seqlen')
attn_output = varlen_flash_attn(query_states, key_states, value_states,
cumulative_lengths, max_seqlen, softmax_scale=self.softmax_scale)
else:
attn_output = flash_attn_wo_mask(
query_states,
key_states,
value_states,
causal=True,
softmax_scale=self.softmax_scale,
training=self.training)
if self.q_head_dim != self.v_head_dim:
attn_output = attn_output[:, :, :, : self.v_head_dim]
attn_output = attn_output.reshape(
bsz, q_len, self.num_heads * self.v_head_dim
).contiguous()
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
def internlm2_mla_varlen_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
return _internlm2_mla_varlen_attn_forward(self, hidden_states, attention_mask,
position_ids, past_key_value,
output_attentions, use_cache)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/accelerate/dispatches/llama.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import warnings
from typing import Optional, Tuple
import torch
from mmengine import MessageHub
from transformers.models.llama.modeling_llama import apply_rotary_pos_emb
from ._attention import SUPPORT_FLASH2, flash_attn_wo_mask, varlen_flash_attn
try:
from transformers.cache_utils import Cache
except ImportError:
class Cache:
pass
def repeat_kv_bshd(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""The hidden states go from (batch, seqlen, num_key_value_heads, head_dim)
to (batch, seqlen, num_attention_heads, head_dim)"""
batch, slen, num_key_value_heads, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :, :,
None, :].expand(batch, slen,
num_key_value_heads, n_rep,
head_dim)
return hidden_states.reshape(batch, slen, num_key_value_heads * n_rep,
head_dim)
def llama_varlen_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
if 'padding_mask' in kwargs:
warnings.warn('Passing `padding_mask` is deprecated and will be '
'removed in v4.37. Please make sure use '
'`attention_mask` instead.`')
bsz, q_len, _ = hidden_states.size()
attn_context = MessageHub.get_instance('packed_sequence')
position_ids = attn_context.get_info('position_ids')
assert position_ids.size(1) == q_len, f'{position_ids.size(1)} {q_len}'
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin)
past_key_value = getattr(self, 'past_key_value', past_key_value)
if past_key_value is not None:
# sin and cos are specific to RoPE models;
# cache_position needed for the static cache
cache_kwargs = {
'sin': sin,
'cos': cos,
'cache_position': cache_position
}
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# repeat kv for sequence parallel
key_states = repeat_kv_bshd(key_states, self.num_key_value_groups)
value_states = repeat_kv_bshd(value_states, self.num_key_value_groups)
dropout_rate = self.attention_dropout if self.training else 0.0
# In PEFT, usually we cast the layer norms in float32 for training
# stability reasons therefore the input hidden states gets silently casted
# in float32. Hence, we need cast them back in the correct dtype
# just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not
# cast the LayerNorms in fp32. (LlamaRMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
assert SUPPORT_FLASH2
cumulative_lengths = attn_context.get_info('cumulative_lengths')
if cumulative_lengths is not None and bsz == 1:
max_seqlen = attn_context.get_info('max_seqlen')
attn_output = varlen_flash_attn(
query_states,
key_states,
value_states,
cumulative_lengths,
max_seqlen,
causal=True,
dropout_p=dropout_rate,
training=self.training)
else:
attn_output = flash_attn_wo_mask(
query_states,
key_states,
value_states,
causal=True,
training=self.training)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
return attn_output, None, past_key_value
================================================
FILE: xtuner-eval_niah/xtuner/_lite/accelerate/dispatches/qwen2.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import inspect
import warnings
from typing import Optional
import torch
from mmengine import MessageHub
from transformers.cache_utils import Cache
from transformers.models.qwen2.modeling_qwen2 import (apply_rotary_pos_emb,
repeat_kv)
from ._attention import flash_attn_wo_mask, varlen_flash_attn
from flash_attn import flash_attn_with_kvcache
import torch.distributed as dist
from xtuner._lite.parallel.setup import get_sp_group, get_ring_group, get_ring_world_size, get_sp_world_size, \
get_ulysess_group
from xtuner._lite.yunchang import attention_sp_ulysses_ring, ring_flash_attn_inference_func
from mmengine.dist import is_distributed, all_gather_object, all_gather
SUPPORT_FLASH2 = False
try:
from flash_attn import flash_attn_func
_flash_supports_window_size = 'window_size' in list(
inspect.signature(flash_attn_func).parameters)
SUPPORT_FLASH2 = True
except ImportError:
pass
def qwen2_attn_forward_inference(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache=False,
cache_position=None,
position_embeddings=None, # will become mandatory in v4.46
):
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads, self.head_dim).transpose(1, 2)
cos, sin = position_embeddings
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin)
kv_seq_len = key_states.shape[-2]
is_prefilled = past_key_value.get_seq_length(self.layer_idx) == 0
if is_distributed():
# 单卡不进行修改
rank = dist.get_rank(get_sp_group())
world_size = dist.get_world_size(get_sp_group())
else:
rank = 0
world_size = 1
if past_key_value is not None:
# Activate slicing cache only if the config has a value `sliding_windows` attribute
kv_seq_len = key_states.shape[-2] + cache_position[0]
if world_size > 1:
if is_prefilled:
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
else:
if rank == world_size - 1:
# 在 decode 阶段,只有最后一个 rank 才需要更新
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx,
cache_kwargs)
else:
kv_seq_len = cache_position[0]
key_states, value_states = past_key_value.key_cache[self.layer_idx], past_key_value.value_cache[
self.layer_idx]
else:
cache_kwargs = {"sin": sin, "cos": cos, "cache_position": cache_position} # Specific to RoPE models
key_states, value_states = past_key_value.update(key_states, value_states, self.layer_idx, cache_kwargs)
# repeat k/v heads if n_kv_heads < n_heads
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
dropout_rate = 0.0 if not self.training else self.attention_dropout
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, "_pre_quantization_dtype"):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
print(
f"The input hidden states seems to be silently casted in float32, this might be related to"
f" the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in"
f" {target_dtype}."
)
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
# Reashape to the expected shape for Flash Attention
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
if world_size == 1:
attn_output = flash_attn_with_kvcache(
query_states,
key_states,
value_states,
causal=True,
cache_seqlens=kv_seq_len.item())
else:
if query_states.shape[1] > 1 and is_prefilled:
key_states = key_states[:, :kv_seq_len, ...]
value_states = value_states[:, :kv_seq_len, ...]
assert key_states.shape[1] == query_states.shape[1]
attn_output = attention_sp_ulysses_ring(query_states,
key_states,
value_states,
ulysses_pg=get_ulysess_group(),
ring_pg=get_ring_group(),
ring_impl_type='basic')
else:
attn_output = ring_flash_attn_inference_func(query_states,
key_states,
value_states,
causal=True,
group=get_ring_group(),
cache_seqlens=kv_seq_len.item())
# ---------------- flash attention forward end ------------------- #
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
attn_output = self.o_proj(attn_output)
return attn_output, None, past_key_value
def qwen2_varlen_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
):
is_training = self.training
if not is_training:
attn_output, attn_weights, past_key_value = qwen2_attn_forward_inference(self,
hidden_states,
attention_mask,
position_ids,
past_key_value,
output_attentions,
use_cache, **kwargs)
return attn_output, attn_weights, past_key_value
assert is_training == (past_key_value is None)
if 'padding_mask' in kwargs:
warnings.warn(
'Passing `padding_mask` is deprecated and will be removed in v4.37'
' Please make sure use `attention_mask` instead.`')
# overwrite attention_mask with padding_mask
attention_mask = kwargs.pop('padding_mask')
bsz, q_len, _ = hidden_states.size()
attn_context = MessageHub.get_instance('packed_sequence')
position_ids = attn_context.get_info('position_ids')
assert position_ids.size(1) == q_len, f'{position_ids.size(1)} {q_len}'
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
'The cache structure has changed since version v4.36. '
f'If you are using {self.__class__.__name__} '
'for auto-regressive decoding with k/v caching, '
'please make sure to initialize the attention class '
'with a layer index.')
kv_seq_len += past_key_value.get_usable_length(kv_seq_len,
self.layer_idx)
assert position_ids is not None
rotary_seq_len = max(kv_seq_len, position_ids.max().item() + 1)
cos, sin = self.rotary_emb(value_states, seq_len=rotary_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin, position_ids)
if past_key_value is not None:
# Activate slicing cache only if the config has a value
# `sliding_windows` attribute
cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
if (getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window
and cache_has_contents):
slicing_tokens = 1 - self.config.sliding_window
past_key = past_key_value[self.layer_idx][0]
past_value = past_key_value[self.layer_idx][1]
past_key = past_key[:, :, slicing_tokens:, :].contiguous()
past_value = past_value[:, :, slicing_tokens:, :].contiguous()
if past_key.shape[-2] != self.config.sliding_window - 1:
raise ValueError(
'past key must have a shape of (`batch_size, num_heads, '
'self.config.sliding_window-1, head_dim`), got'
f' {past_key.shape}')
if attention_mask is not None:
attention_mask = attention_mask[:, slicing_tokens:]
attention_mask = torch.cat(
[attention_mask,
torch.ones_like(attention_mask[:, -1:])],
dim=-1)
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
# repeat k/v heads if n_kv_heads < n_heads for sequence parallel
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
dropout_rate = 0.0 if not self.training else self.attention_dropout
# In PEFT, usually we cast the layer norms in float32 for
# training stability reasons, therefore the input hidden states gets
# silently casted in float32. Hence, we need
# cast them back in float16 just to be sure everything works as expected.
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
# Reashape to the expected shape for Flash Attention
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# ----------------- flash attention forward ------------------------#
if not self._flash_attn_uses_top_left_mask:
causal = self.is_causal
else:
causal = self.is_causal and q_len != 1
use_sliding_windows = (
_flash_supports_window_size
and getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window
and self.layer_idx < self.config.max_window_layers
and self.config.use_sliding_window)
window_size = (self.config.sliding_window,
self.config.sliding_window) if use_sliding_windows else (-1,
-1)
assert SUPPORT_FLASH2
cumulative_lengths = attn_context.get_info('cumulative_lengths')
if cumulative_lengths is not None and SUPPORT_FLASH2 and bsz == 1:
max_seqlen = attn_context.get_info('max_seqlen')
attn_output = varlen_flash_attn(
query_states,
key_states,
value_states,
cumulative_lengths,
max_seqlen,
causal=causal,
dropout_p=dropout_rate,
window_size=window_size,
training=self.training)
else:
attn_output = flash_attn_wo_mask(
query_states,
key_states,
value_states,
causal=causal,
dropout_p=dropout_rate,
window_size=window_size,
training=self.training)
# ---------------- flash attention forward end ------------------- #
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
================================================
FILE: xtuner-eval_niah/xtuner/_lite/accelerate/generate.py
================================================
================================================
FILE: xtuner-eval_niah/xtuner/_lite/accelerate/lora.py
================================================
LORA_TARGET_MAP = {
'InternLM2ForCausalLM': ['wqkv', 'wo', 'w1', 'w2', 'w3'],
'CLIPVisionModel':
['q_proj', 'k_proj', 'v_proj', 'out_proj', 'fc1', 'fc2']
}
================================================
FILE: xtuner-eval_niah/xtuner/_lite/accelerate/packed.py
================================================
from contextlib import contextmanager
import torch
from xtuner._lite.parallel import get_sp_group, split_for_sequence_parallel
@contextmanager
def packed_sequence(num_tokens, enable=False, sp_size=1):
from mmengine import MessageHub
ctx = MessageHub.get_instance('packed_sequence')
if enable:
device = num_tokens.device
_zero_length = torch.zeros(1, device=device)
_pad_length = torch.cat([_zero_length, num_tokens]).int()
cumulative_lengths = torch.cumsum(_pad_length, 0).int()
position_ids = [torch.arange(num.item()) for num in num_tokens]
position_ids = torch.cat(position_ids, dim=0).to(device)
position_ids = position_ids.unsqueeze(0)
if sp_size > 1:
sp_group = get_sp_group()
# `dim` is 1 as the shape of tensor is (bs, seq_len)
position_ids = split_for_sequence_parallel(
position_ids, dim=1, sp_group=sp_group)
# ctx.update_info('num_tokens', num_tokens)
ctx.update_info('position_ids', position_ids)
ctx.update_info('cumulative_lengths', cumulative_lengths)
ctx.update_info('max_seqlen', num_tokens.max())
else:
# ctx.update_info('num_tokens', None)
ctx.update_info('position_ids', None)
ctx.update_info('cumulative_lengths', None)
ctx.update_info('max_seqlen', None)
yield
# ctx.update_info('num_tokens', None)
ctx.update_info('position_ids', None)
ctx.update_info('cumulative_lengths', None)
ctx.update_info('max_seqlen', None)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/auto.py
================================================
import math
import os
from typing import Literal, Optional
import torch
from transformers import BitsAndBytesConfig, PretrainedConfig
from xtuner.model.modules.dispatch import SUPPORT_FLASH1, SUPPORT_FLASH2
if os.environ.get('XTUNER_USE_MODELSCOPE'):
from modelscope import AutoTokenizer # noqa: F401
from modelscope import AutoConfig
from modelscope import AutoModelForCausalLM as OriAutoModelForCausalLM
else:
from transformers import AutoTokenizer # noqa: F401
from transformers import AutoConfig
from transformers import AutoModelForCausalLM as OriAutoModelForCausalLM
def download_model_from_hub(
model_name_or_path: str,
from_hub: Literal['huggingface', 'modelscope'] = 'huggingface',
cache_dir: Optional[str] = None,
) -> str:
"""Automatically download model from the HUB.
Note:
If `model_name_or_path` is a local path, it will return the path
directly without downloading it again.
Args:
model_name_or_path (str): The model name, model path or repo id.
config (str | None): The config path. Default is None.
from_hub (str): The model hosting hub, modelscope, or huggingface.
Default is huggingface.
cache_dir (str | None):
The save path when downloading the model. If it is None, it
will be stored in the default location of the HUB. For
Huggingface, it's ~/.cache/huggingface/hub, for ModelScope,
it's ~/.cache/modelscope/hub.
Returns:
str: The local path of the model.
"""
if os.path.isdir(model_name_or_path):
model_path = model_name_or_path
elif from_hub == 'huggingface':
from huggingface_hub import snapshot_download
model_path = snapshot_download(
repo_id=model_name_or_path, cache_dir=cache_dir)
elif from_hub == 'modelscope':
from modelscope import snapshot_download
model_path = snapshot_download(
model_id=model_name_or_path, cache_dir=cache_dir)
else:
# TODO support openxlab
raise NotImplementedError('The model does not support downloading '
f'from {from_hub}, it only supports '
'`huggingface` and `modelscope`.')
return model_path
class AutoModelForCausalLM:
"""Enhanced version of Huggingface's `AutoModelForCausalLM`.
Compared to HuggingFace's `AutoModelForCausalLM`, the following three
features have been added:
1. Load the model from either HuggingFace or ModelScope based on the
environment variable `XTUNER_USE_MODELSCOPE` (bool).
2. Automatically enables Flash Attention. If `flash-attn` is already
installed, Flash Attention 2 will be used. If there is no
`flash-attn`, use Flash Attention 1 when torch version is less than
2.2. When torch version is greater than or equal to 2.2, use Flash
Attention 2.
3. When the length of the target sequence during training exceeds the
maximum length of the original model, the rope scaling is
automatically set to the `linear` type with a factor of 1."
Note:
If the model is built through `from_config`, it will not automatically
enable flash attention or modify rope scaling.
Note:
If you want to load the model on ModelScope, please set the
environment variable `XTUNER_USE_MODELSCOPE=1`.
"""
@classmethod
def from_config(cls,
pretrained_model_name_or_path: str,
trust_remote_code: bool = True,
**kwargs):
"""Consistent with the usage of HuggingFace's AutoModelForCausalLM."""
return AutoConfig.from_pretrained(
pretrained_model_name_or_path,
trust_remote_code=trust_remote_code,
**kwargs)
@classmethod
def from_pretrained(
cls,
pretrained_model_name_or_path: str,
trust_remote_code: bool = True,
quantization_config: Optional[BitsAndBytesConfig] = None,
max_position_embeddings: Optional[int] = None,
**kwargs):
"""Consistent with the usage of HuggingFace's AutoModelForCausalLM."""
config = cls.from_config(
pretrained_model_name_or_path, trust_remote_code=True)
attn_kwargs = cls._flash_attn_kwargs(config)
kwargs.update(attn_kwargs)
if max_position_embeddings:
long_ctx_kwargs = cls._long_ctx_kwargs(config,
max_position_embeddings)
kwargs.update(long_ctx_kwargs)
if 'torch_dtype' not in kwargs:
if torch.cuda.is_bf16_supported():
kwargs.update(torch_dtype=torch.bfloat16)
else:
kwargs.update(torch_dtype=torch.float16)
model = OriAutoModelForCausalLM.from_pretrained(
pretrained_model_name_or_path,
trust_remote_code=trust_remote_code,
quantization_config=quantization_config,
**kwargs)
from xtuner._lite.accelerate import dispatch_modules
dispatch_modules(model, use_varlen_attn=True)
return model
@staticmethod
def _flash_attn_kwargs(config: PretrainedConfig) -> dict:
"""Arguments Required to Enable Flash Attention."""
cls_name = type(config).__name__
_built_in_flash_attn_1 = ('LlamaConfig', 'GemmaConfig',
'MistralConfig', 'MixtralConfig',
'Qwen2Config', 'Starcoder2Config',
'Starcoder2Config')
_built_in_flash_attn_2 = ('InternLMConfig', 'InternLM2Config',
'LlamaConfig', 'GemmaConfig',
'MistralConfig', 'MixtralConfig',
'Qwen2Config', 'Starcoder2Config',
'Starcoder2Config')
attn_kwargs = {}
if SUPPORT_FLASH2 and cls_name in _built_in_flash_attn_2:
attn_kwargs.update(attn_implementation='flash_attention_2')
elif SUPPORT_FLASH1 and cls_name in _built_in_flash_attn_1:
attn_kwargs.update(attn_implementation='sdpa')
return attn_kwargs
@staticmethod
def _long_ctx_kwargs(config: PretrainedConfig,
max_position_embeddings: int) -> dict:
"""Arguments Required for Long Context Training."""
ori_rope_scaling = getattr(config, 'rope_scaling', None)
if ori_rope_scaling is None:
ori_rope_scaling = {'factor': 1}
if 'factor' in ori_rope_scaling.keys():
ori_rope_scaling_factor = ori_rope_scaling['factor']
else:
ori_rope_scaling_factor = 1
ori_ctx_len = getattr(config, 'max_position_embeddings', None)
long_text_kwargs = {}
if ori_ctx_len:
ori_ctx_len *= ori_rope_scaling_factor
if max_position_embeddings > ori_ctx_len:
scaling_factor = float(
math.ceil(max_position_embeddings / ori_ctx_len))
new_rope_scaling = {'type': 'linear', 'factor': scaling_factor}
long_text_kwargs.update(dict(rope_scaling=new_rope_scaling))
return long_text_kwargs
================================================
FILE: xtuner-eval_niah/xtuner/_lite/chat/__init__.py
================================================
from .messages import ChatMessages
from .templates import CHAT_TEMPLATE_MAP, ChatTemplate, HybridChatTemplate
__all__ = [
'ChatMessages', 'CHAT_TEMPLATE_MAP', 'ChatTemplate', 'HybridChatTemplate'
]
================================================
FILE: xtuner-eval_niah/xtuner/_lite/chat/backends/__init__.py
================================================
================================================
FILE: xtuner-eval_niah/xtuner/_lite/chat/messages/__init__.py
================================================
from .base import BaseMessages
from .chat import ChatMessages
__all__ = ['BaseMessages', 'ChatMessages']
================================================
FILE: xtuner-eval_niah/xtuner/_lite/chat/messages/base.py
================================================
from abc import abstractclassmethod, abstractmethod
from typing import Dict
from pydantic import BaseModel
from transformers import PreTrainedTokenizer
from ..templates import ChatTemplate
class BaseMessages(BaseModel):
@abstractmethod
def add(self, role: str, content):
pass
@abstractmethod
def pop(self):
pass
@abstractmethod
def get_prompt(self, chat_template: ChatTemplate) -> str:
pass
@abstractmethod
def tokenize(self, tokenizer: PreTrainedTokenizer,
chat_template: ChatTemplate) -> Dict:
pass
@abstractclassmethod
def from_dict(cls, item: Dict) -> 'BaseMessages':
pass
================================================
FILE: xtuner-eval_niah/xtuner/_lite/chat/messages/chat.py
================================================
import copy
from typing import Dict, List, Literal, Optional, Union
from pydantic import BaseModel
from transformers import PreTrainedTokenizer
from xtuner._lite import get_logger
from xtuner.utils import IGNORE_INDEX
from ..templates import ChatTemplate, HybridChatTemplate
from .base import BaseMessages
logger = get_logger()
class TextContentItem(BaseModel):
type: Literal['text'] = 'text'
text: str
def apply_chat_template(self, chat_template: HybridChatTemplate) -> str:
return self.text
class ImageContentItem(BaseModel):
type: Literal['image_url'] = 'image_url'
image_url: str
def apply_chat_template(self, chat_template: HybridChatTemplate) -> str:
return chat_template.image_token
MultModalContentType = Union[TextContentItem, ImageContentItem]
ContentType = Union[str, List[MultModalContentType]]
class ChatMsg(BaseModel):
role: Literal['assistant', 'user', 'system']
content: ContentType
loss: Optional[bool] = None
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
if self.loss is None:
if self.role == 'system':
self.loss = False
elif self.role == 'user':
self.loss = False
elif self.role == 'assistant':
self.loss = True
else:
raise NotImplementedError
def collect_img_urls(self) -> List[str]:
img_urls = []
if isinstance(self.content, list):
for item in self.content:
if isinstance(item, ImageContentItem):
img_urls.append(item.image_url)
return img_urls
def get_prompt(self, chat_template: ChatTemplate) -> str:
if isinstance(self.content, str):
text = self.content
elif isinstance(self.content, list):
text = ''
for i, item in enumerate(self.content):
if i == 0:
text += item.apply_chat_template(chat_template)
else:
text += '\n' + item.apply_chat_template(chat_template)
else:
raise NotImplementedError
if self.role == 'system':
prompt = chat_template.decorate_system(text)
elif self.role == 'user':
prompt = chat_template.decorate_user(text)
elif self.role == 'assistant':
prompt = chat_template.decorate_assistant(text)
else:
raise NotImplementedError
return prompt
def tokenize(
self,
tokenizer: PreTrainedTokenizer,
chat_template: ChatTemplate,
):
decorated = self.get_prompt(chat_template)
token_ids = tokenizer.encode(decorated, add_special_tokens=False)
if self.loss:
label_ids = copy.deepcopy(token_ids)
else:
label_ids = [IGNORE_INDEX] * len(token_ids)
return {
'input_ids': token_ids,
'labels': label_ids,
}
class ChatMessages(BaseMessages):
messages: List[ChatMsg]
def add(self, role, content, loss=False):
self.messages.append(ChatMsg(role=role, content=content, loss=loss))
def pop(self):
return self.messages.pop()
def get_prompt(self, chat_template: ChatTemplate) -> str:
prompt = ''
for msg in self.messages:
prompt += msg.get_prompt(chat_template)
if msg.role == 'assistant':
prompt += chat_template.sep
return prompt
def tokenize(self, tokenizer: PreTrainedTokenizer,
chat_template: ChatTemplate) -> Dict:
input_ids = []
labels = []
image_urls = []
for msg in self.messages:
res = msg.tokenize(tokenizer, chat_template)
token_ids, label_ids = res['input_ids'], res['labels']
input_ids.extend(token_ids)
labels.extend(label_ids)
image_urls.extend(msg.collect_img_urls())
if msg.role == 'assistant':
sep = chat_template.sep
sep_tokens = tokenizer.encode(sep, add_special_tokens=False)
input_ids.extend(sep_tokens)
labels.extend([IGNORE_INDEX] * len(sep_tokens))
if len(input_ids) != len(labels):
logger.error(f'[messages] {self.messages}')
logger.error(f'[input_ids] {input_ids}')
logger.error(f'[labels] {labels}')
raise RuntimeError('The lengths of input_ids and labels must be '
f'equal, but found {len(input_ids)} and '
f'{len(labels)}.')
training_data = {
'input_ids': input_ids,
'labels': labels,
'num_tokens': len(input_ids),
}
if len(image_urls) > 0:
training_data['image_urls'] = image_urls
return training_data
@classmethod
def from_str(cls, prompt: str) -> 'ChatMessages':
msg = ChatMsg(role='user', content=prompt)
return cls(messages=[msg])
@classmethod
def from_dict(cls, item: dict) -> 'ChatMessages':
'''
item
{
'messages':[
{'role':'user', 'content':'hello'},
{'role':'assistant', 'content':'hello!'},
],
}
'''
return cls(**item)
if __name__ == '__main__':
data = {
'messages': [
{
'role': 'user',
'content': 'hello'
},
{
'role': 'assistant',
'content': 'hello!'
},
]
}
messages = ChatMessages.from_dict(data)
chat_template = ChatTemplate(
system='<|im_start|>system\n{system}<|im_end|>\n',
user='<|im_start|>user\n{user}<|im_end|>\n<|im_start|>assistant\n',
assistant='{assistant}<|im_end|>\n',
stop_words=['<|im_end|>'],
)
print(messages.get_prompt(chat_template))
================================================
FILE: xtuner-eval_niah/xtuner/_lite/chat/templates/__init__.py
================================================
from .chat import ChatTemplate
from .hybrid import HybridChatTemplate
CHAT_TEMPLATE_MAP = {
'internlm2':
HybridChatTemplate(
system='<|im_start|>system\n{system}<|im_end|>\n',
user='<|im_start|>user\n{user}<|im_end|>\n<|im_start|>assistant\n',
assistant='{assistant}<|im_end|>',
stop_words=['<|im_end|>']),
'llama3_chat':
HybridChatTemplate(
system=('<|start_header_id|>system<|end_header_id|>\n\n{system}'
'<|eot_id|>'),
user=('<|start_header_id|>user<|end_header_id|>\n\n{user}<|eot_id|>'
'<|start_header_id|>assistant<|end_header_id|>\n\n'),
assistant='{assistant}<|eot_id|>',
sep='',
stop_words=['<|eot_id|>']),
'qwen':
HybridChatTemplate(
system='<|im_start|>system\n{system}<|im_end|>\n',
user='<|im_start|>user\n{user}<|im_end|>\n<|im_start|>assistant\n',
assistant='{assistant}<|im_end|>',
stop_words=['<|im_end|>', '<|endoftext|>']),
}
__all__ = ['ChatTemplate', 'HybridChatTemplate']
================================================
FILE: xtuner-eval_niah/xtuner/_lite/chat/templates/chat.py
================================================
from typing import List
from pydantic import BaseModel, field_validator
class ChatTemplate(BaseModel):
"""Define a Pydantic data model for a hybrid chat with attributes for
system, user and assistant chat as well as function and interpreter calls
and results."""
# Normal Chat
system: str # System message format
user: str # User message format
assistant: str # Assistant message format
stop_words: List[str] # List of stop words
sep: str = '\n'
def decorate_system(self, text: str) -> str:
"""Decorate text with the `system` template."""
return self.system.format(system=text)
def decorate_assistant(self, text: str) -> str:
"""Decorate text with the `assistant` template."""
return self.assistant.format(assistant=text)
def decorate_user(self, text: str) -> str:
"""Decorate text with the `user` template."""
return self.user.format(user=text)
@field_validator('system')
def check_system(cls, v: str) -> str:
"""Validate that `system` contains '{system}'.
If not, raises a ValueError.
"""
if v is not None and '{system}' not in v:
raise ValueError("system must contain the keyword '{system}'")
return v
@field_validator('user')
def check_user(cls, v: str) -> str:
"""Validate that `user` contains '{user}'.
If not, raises a ValueError.
"""
if v is not None and '{user}' not in v:
raise ValueError("user must contain the keyword '{user}'")
return v
@field_validator('assistant')
def check_assistant(cls, v: str) -> str:
"""Validate that `assistant` contains '{assistant}'.
If not, raises a ValueError.
"""
if v is not None and '{assistant}' not in v:
raise ValueError(
"assistant must contain the keyword '{assistant}'")
return v
================================================
FILE: xtuner-eval_niah/xtuner/_lite/chat/templates/hybrid.py
================================================
from typing import Dict, List, Optional
from pydantic import BaseModel, field_validator
class HybridChatTemplate(BaseModel):
"""Define a Pydantic data model for a hybrid chat with attributes for
system, user and assistant chat as well as function and interpreter calls
and results."""
# Normal Chat
system: str # System message format
user: str # User message format
assistant: str # Assistant message format
stop_words: List[str] # List of stop words
sep: str = '\n'
# Multimodal Chat
# Predefined token and index for images
image_token: str = ''
image_token_index: int = -100
# Agent Chat
# Interpreter and function related strings
files: Optional[str] = None
functions: Optional[str] = None # Function description format
function_call: Optional[str] = None # Function call format
function_result: Optional[str] = None # Function result format
code_interpreter: Optional[str] = None
code_interpreter_call: Optional[str] = None # Interpreter call format
code_interpreter_result: Optional[str] = None # Interpreter result format
function_token: Optional[str] = None
code_interpreter_token: Optional[str] = None
action_start_token: Optional[str] = None
action_end_token: Optional[str] = None
@property
def mm_token_maps(self) -> Dict[str, int]:
"""Return a dictionary that maps multimodal tokens to corresponding
token indexes."""
return {self.image_token: self.image_token_index}
def decorate_system(self, text: str) -> str:
"""Decorate text with the `system` template."""
return self.system.format(system=text)
def decorate_assistant(self, text: str) -> str:
"""Decorate text with the `assistant` template."""
return self.assistant.format(assistant=text)
def decorate_user(self, text: str) -> str:
"""Decorate text with the `user` template."""
return self.user.format(user=text)
def decorate_files(self, text: str) -> str:
"""Decorate text with the `functions` template."""
return self.files.format(files=text)
def decorate_functions(self, text: str) -> str:
"""Decorate text with the `functions` template."""
return self.functions.format(functions=text)
def decorate_function_call(self, text: str, func: str) -> str:
"""Decorate text with the `function_call` template."""
return self.function_call.format(assistant=text, function_call=func)
def decorate_function_result(self, text: str) -> str:
"""Decorate text with the `function_result` template."""
return self.function_result.format(function_result=text)
def decorate_code_interpreter(self, text: str) -> str:
"""Decorate text with the `code_interpreter` template."""
return self.code_interpreter.format(code_interpreter=text)
def decorate_code_interpreter_call(self, text: str, func: str) -> str:
"""Decorate text with the `code_interpreter_call` template."""
return self.code_interpreter_call.format(
assistant=text, code_interpreter_call=func)
def decorate_code_interpreter_result(self, text: str) -> str:
"""Decorate text with the `code_interpreter_result` template."""
return self.code_interpreter_result.format(
code_interpreter_result=text)
@field_validator('system')
def check_system(cls, v: str) -> str:
"""Validate that `system` contains '{system}'.
If not, raises a ValueError.
"""
if v is not None and '{system}' not in v:
raise ValueError("system must contain the keyword '{system}'")
return v
@field_validator('user')
def check_user(cls, v: str) -> str:
"""Validate that `user` contains '{user}'.
If not, raises a ValueError.
"""
if v is not None and '{user}' not in v:
raise ValueError("user must contain the keyword '{user}'")
return v
@field_validator('assistant')
def check_assistant(cls, v: str) -> str:
"""Validate that `assistant` contains '{assistant}'.
If not, raises a ValueError.
"""
if v is not None and '{assistant}' not in v:
raise ValueError(
"assistant must contain the keyword '{assistant}'")
return v
@field_validator('function_call')
def check_function_call(cls, v: str) -> str:
"""Validate that `function_call` contains '{function_call}'.
If not, raises a ValueError.
"""
if (v is not None and '{function_call}' not in v
and '{assistant}' not in v):
raise ValueError(
"function_call must contain the keywords '{function_call}'")
if v is not None and '{assistant}' not in v:
raise ValueError(
"function_call must contain the keyword '{assistant}' and "
"'{function_call}'")
return v
@field_validator('function_result')
def check_function_result(cls, v: str) -> str:
"""Validate that `function_result` contains '{function_result}'.
If not, raises a ValueError.
"""
if v is not None and '{function_result}' not in v:
raise ValueError(
"function_result must contain the keyword '{function_result}'")
return v
@field_validator('functions')
def check_functions(cls, v: str) -> str:
"""Validate that `functions` contains '{functions}'.
If not, raises a ValueError.
"""
if v is not None and '{functions}' not in v:
raise ValueError(
"functions must contain the keyword '{functions}'")
return v
@field_validator('code_interpreter')
def check_code_interpreter(cls, v: str) -> str:
"""Validate that `code_interpreter` contains '{code_interpreter}'.
If not, raises a ValueError.
"""
if v is not None and '{code_interpreter}' not in v:
raise ValueError('code_interpreter must contain the keyword '
"'{code_interpreter}'")
return v
@field_validator('code_interpreter_call')
def check_code_interpreter_call(cls, v: str) -> str:
"""Validate that `code_interpreter_call` contains
'{code_interpreter_call}'.
If not, raises a ValueError.
"""
if (v is not None and '{code_interpreter_call}' not in v
and '{assistant}' not in v):
raise ValueError('code_interpreter_call must contain the keywords '
"'{assistant}' and '{code_interpreter_call}'")
if v is not None and '{assistant}' not in v:
raise ValueError('code_interpreter_call must contain the keywords '
"'{assistant}' and '{code_interpreter_call}'")
return v
@field_validator('code_interpreter_result')
def check_code_interpreter_result(cls, v: str) -> str:
"""Validate that `code_interpreter_result` contains
'{code_interpreter_result}'.
If not, raises a ValueError.
"""
if v is not None and '{code_interpreter_result}' not in v:
raise ValueError(
'code_interpreter_result must contain the keyword '
"'{code_interpreter_result}'")
return v
================================================
FILE: xtuner-eval_niah/xtuner/_lite/datasets/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .format import OPENAI_FORMAT_MAP
from .llava import (LlavaCollator, LlavaRawDataset, LlavaTokenizedDataset,
LlavaTokenizeFunction, SoftPackerForLlava)
from .load import load_datasets
from .text import (HardPackerForText, SoftPackerForText, TextCollator,
TextOnlineTokenizeDataset, TextTokenizedDataset,
TextTokenizeFunction)
__all__ = [
'OPENAI_FORMAT_MAP', 'LlavaCollator', 'LlavaRawDataset',
'LlavaTokenizedDataset', 'LlavaTokenizeFunction', 'SoftPackerForLlava',
'load_datasets', 'HardPackerForText', 'SoftPackerForText', 'TextCollator',
'TextOnlineTokenizeDataset', 'TextTokenizedDataset', 'TextTokenizeFunction'
]
================================================
FILE: xtuner-eval_niah/xtuner/_lite/datasets/cache.py
================================================
import torch
class CacheDataset(torch.utils.data.Dataset):
@property
def cached_dir(self):
pass
@property
def cached(self):
pass
def cache(self, cache_dir):
pass
def load_cache(self):
pass
@classmethod
def from_cache(self, cache_dir):
pass
================================================
FILE: xtuner-eval_niah/xtuner/_lite/datasets/format.py
================================================
import re
class Alpaca2Openai():
@classmethod
def source_format(cls):
data = {
'instruction': 'INSTRUCTION',
'input': 'INPUT',
'output': 'OUTPUT',
}
return data
@classmethod
def target_format(cls):
data = {
'messages': [
{
'role': 'user',
'content': 'INSTRUCTION\nINPUT'
},
{
'role': 'assistant',
'content': 'OUTPUT'
},
]
}
return data
@staticmethod
def convert(data):
if data.get('output') == '':
return {'messages': []}
else:
return {
'messages': [
{
'role': 'user',
'content': f"{data['instruction']}\n{data['input']}"
},
{
'role': 'assistant',
'content': f"{data['output']}"
},
]
}
def llava_to_openai(data):
image_token = ''
conversations = data['conversations']
messages = []
if 'image' in data:
image_urls = data['image']
if isinstance(image_urls, str):
image_urls = [image_urls]
else:
image_urls = None
while conversations and conversations[0]['from'] == 'gpt':
# Skip the first one if it is from gpt
conversations = conversations[1:]
image_id = 0
for convs in conversations:
if convs['from'] == 'human':
pattern = f'({image_token})'
chunks = re.split(pattern, convs['value'])
text_content = []
img_content = []
for chunk in chunks:
if chunk == image_token:
url = image_urls[image_id]
if not isinstance(url, str):
raise TypeError(data)
# assert , image_url
item = dict(type='image_url', image_url=url)
img_content.append(item)
image_id += 1
elif len(chunk.strip()):
item = dict(type='text', text=chunk.strip())
text_content.append(item)
msg = {'role': 'user', 'content': img_content + text_content}
messages.append(msg)
elif convs['from'] == 'gpt':
msg = {'role': 'assistant', 'content': convs['value']}
messages.append(msg)
else:
raise NotImplementedError
return {'messages': messages}
def llava_to_openai_interleave(data):
image_token = ''
conversations = data['conversations']
messages = []
if 'image' in data:
image_urls = data['image']
if isinstance(image_urls, str):
image_urls = [image_urls]
else:
image_urls = None
while conversations and conversations[0]['from'] == 'gpt':
# Skip the first one if it is from gpt
conversations = conversations[1:]
image_id = 0
for convs in conversations:
if convs['from'] == 'human':
pattern = f'({image_token})'
chunks = re.split(pattern, convs['value'])
content = []
for chunk in chunks:
if chunk == image_token:
url = image_urls[image_id]
if not isinstance(url, str):
raise TypeError(data)
# assert , image_url
item = dict(type='image_url', image_url=url)
content.append(item)
image_id += 1
elif len(chunk.strip()):
item = dict(type='text', text=chunk.strip())
content.append(item)
msg = {'role': 'user', 'content': content}
messages.append(msg)
elif convs['from'] == 'gpt':
msg = {'role': 'assistant', 'content': convs['value']}
messages.append(msg)
else:
raise NotImplementedError
return {'messages': messages}
OPENAI_FORMAT_MAP = {
'llava': llava_to_openai,
'llava_interleave': llava_to_openai_interleave,
'alpaca': Alpaca2Openai.convert,
'openai': lambda x: x,
}
================================================
FILE: xtuner-eval_niah/xtuner/_lite/datasets/llava.py
================================================
import os
from io import BytesIO
import torch
from datasets import load_from_disk
from mmengine import fileio
from PIL import Image
from torch.nn.utils.rnn import pad_sequence
from xtuner._lite.chat import ChatMessages
from xtuner.utils import DEFAULT_PAD_TOKEN_INDEX, IGNORE_INDEX
from .format import OPENAI_FORMAT_MAP
from .text import SoftPackerForText, TextTokenizedDataset
class LlavaTokenizeFunction():
def __init__(self,
tokenizer,
chat_template,
per_img_tokens,
image_dir=None,
raw_format='llava'):
self.tokenizer = tokenizer
self.chat_template = chat_template
self.image_dir = image_dir
self.raw_format = raw_format
self.per_img_tokens = per_img_tokens
def __call__(self, item):
formatter = OPENAI_FORMAT_MAP[self.raw_format]
msg = ChatMessages.from_dict(formatter(item))
tokenized = msg.tokenize(self.tokenizer, self.chat_template)
tokenized['num_img_tokens'] = 0
if 'image_urls' in tokenized:
image_urls = tokenized['image_urls']
image_urls = []
for url in tokenized['image_urls']:
if self.image_dir:
image_urls.append(os.path.join(self.image_dir, url))
else:
image_urls.append(url)
num_images = len(image_urls)
num_img_tokens = [self.per_img_tokens for url in image_urls]
tokenized['num_tokens'] += sum(num_img_tokens) - num_images
tokenized['num_img_tokens'] = sum(num_img_tokens)
tokenized['image_urls'] = image_urls
return tokenized
class LlavaTokenizedDataset(TextTokenizedDataset):
def __init__(self, dataset, image_processor, max_length):
super().__init__(dataset, max_length)
self.image_processor = image_processor
def process_tokenized_data(self, tokenized_data):
images = []
for url in tokenized_data['image_urls']:
img = Image.open(BytesIO(fileio.get(url)))
images.append(img)
if len(images):
outputs = self.image_processor(images, return_tensors='pt')
pixel_values = outputs['pixel_values']
else:
pixel_values = None
data = {
'input_ids': tokenized_data['input_ids'],
'labels': tokenized_data['labels'],
'pixel_values': pixel_values,
'num_tokens': [tokenized_data['num_tokens']],
'num_img_tokens': [tokenized_data['num_img_tokens']],
}
return data
@classmethod
def from_cache(cls, cache_dir, image_processor, max_length):
dataset = load_from_disk(os.path.join(cache_dir, 'dataset'))
ret = cls(dataset, image_processor, max_length)
ret.cache(cache_dir)
return ret
def __getitem__(self, item):
"""Returns a dict containing packed data in the given item.
Args:
item: An index to retrieve packed data.
Returns:
A dict including packed input_ids, labels, and cumulative_len.
"""
if self.cached:
self.load_cache()
tokenized_data = self.dataset[item]
if self.cached:
self._free()
return self.process_tokenized_data(tokenized_data)
class LlavaRawDataset(torch.utils.data.Dataset):
def __init__(self, dataset, image_processor, max_length, tokenize_fn):
super().__init__()
self.dataset = dataset
self.image_processor = image_processor
self.max_length = max_length
self.tokenize_fn = tokenize_fn
def process_tokenized_data(self, tokenized_data):
images = []
for url in tokenized_data['image_urls']:
img = Image.open(url)
images.append(img)
if len(images):
outputs = self.image_processor(images, return_tensors='pt')
pixel_values = outputs['pixel_values']
else:
pixel_values = None
data = {
'input_ids': tokenized_data['input_ids'],
'labels': tokenized_data['labels'],
'pixel_values': pixel_values,
'num_tokens': [tokenized_data['num_tokens']],
'num_img_tokens': [tokenized_data['num_img_tokens']],
}
return data
def __getitem__(self, item):
raw_data = self.dataset[item]
tokenized_data = self.tokenize_fn(raw_data)
return self.process_tokenized_data(tokenized_data)
class SoftPackerForLlava(SoftPackerForText):
def __init__(self,
dataset,
image_processor,
max_length=2048,
pack_info=None):
super().__init__(dataset, max_length, pack_info)
self.image_processor = image_processor
def __getitem__(self, item):
"""Returns a dict containing packed data in the given item.
Args:
item: An index to retrieve packed data.
Returns:
A dict including packed input_ids, labels, and cumulative_len.
"""
if self.cached:
self.load_cache()
dataset = self.dataset
pack_info = self.pack_info
packed_items = pack_info[item]['indices']
assert len(packed_items) > 0
packed_input_ids = []
packed_labels = []
packed_img_urls = []
packed_num_tokens = []
packed_num_img_tokens = []
for i in packed_items:
packed_input_ids.extend(dataset[i]['input_ids'])
packed_labels.extend(dataset[i]['labels'])
_num_tokens = dataset[i]['num_tokens']
packed_num_tokens.append(_num_tokens)
if 'image_urls' in dataset[item]:
packed_img_urls.extend(dataset[item]['image_urls'])
if 'num_img_tokens' in dataset[i]:
_num_img_tokens = dataset[i]['num_img_tokens']
packed_num_img_tokens.append(_num_img_tokens)
images = []
for url in packed_img_urls:
img = Image.open(BytesIO(fileio.get(url)))
images.append(img)
if len(images):
outputs = self.image_processor(images, return_tensors='pt')
pixel_values = outputs['pixel_values']
else:
pixel_values = None
if sum(packed_num_tokens) < self.max_length:
num_pad_tokens = self.max_length - sum(packed_num_tokens)
packed_input_ids.extend([DEFAULT_PAD_TOKEN_INDEX] * num_pad_tokens)
packed_labels.extend([IGNORE_INDEX] * num_pad_tokens)
packed_num_tokens.append(num_pad_tokens)
else:
packed_num_tokens.append(0)
packed = {
'input_ids': packed_input_ids,
'labels': packed_labels,
'pixel_values': pixel_values,
'num_tokens': packed_num_tokens,
'num_img_tokens': packed_num_img_tokens
}
if self.cached:
self._free()
return packed
@classmethod
def from_cache(cls, cache_dir, image_processor, max_length):
dataset = load_from_disk(os.path.join(cache_dir, 'dataset'))
pack_info_dir = os.path.join(cache_dir, f'pack-info-soft-{max_length}')
if os.path.exists(pack_info_dir):
pack_info = load_from_disk(pack_info_dir)
else:
pack_info = cls.get_pack_info(dataset, max_length)
ret = cls(dataset, image_processor, max_length, pack_info)
ret.cache(cache_dir)
return ret
class LlavaCollator():
def __init__(self, pack_batch=False):
self.pack_batch = pack_batch
def __call__(self, instances):
pad_index = DEFAULT_PAD_TOKEN_INDEX
input_ids = []
labels = []
attention_mask = []
pixel_values = []
num_tokens = []
num_img_tokens = []
for data in instances:
input_ids.append(torch.LongTensor(data['input_ids']))
labels.append(torch.LongTensor(data['labels']))
num_tokens.extend(data['num_tokens'])
num_img_tokens.extend(data['num_img_tokens'])
if data['pixel_values'] is not None:
pixel_values.append(data['pixel_values'])
# breakpoint()
attention_mask = [torch.ones_like(ids) for ids in input_ids]
num_tokens = torch.IntTensor(num_tokens)
num_img_tokens = torch.IntTensor(num_img_tokens)
if len(instances) > 1 and self.pack_batch:
input_ids = torch.cat(input_ids, dim=0).unsqueeze(0)
labels = torch.cat(labels, dim=0).unsqueeze(0)
attention_mask = torch.cat(attention_mask, dim=0).unsqueeze(0)
elif len(instances) > 1 and not self.pack_batch:
input_ids = pad_sequence(
input_ids, batch_first=True, padding_value=pad_index)
labels = pad_sequence(
labels, batch_first=True, padding_value=IGNORE_INDEX)
attention_mask = pad_sequence(
attention_mask, batch_first=True, padding_value=0)
else:
input_ids = torch.stack(input_ids)
labels = torch.stack(labels)
attention_mask = torch.stack(attention_mask)
if len(pixel_values) > 0:
pixel_values = torch.cat(pixel_values, dim=0)
else:
pixel_values = None
# TODO support sp
data_dict = {
'input_ids': input_ids,
'labels': labels,
'pixel_values': pixel_values,
'num_tokens': num_tokens,
'num_img_tokens': num_img_tokens,
'attention_mask': attention_mask.bool()
}
return data_dict
================================================
FILE: xtuner-eval_niah/xtuner/_lite/datasets/load.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import functools
import json
import math
import os
import random
import re
from concurrent.futures import ThreadPoolExecutor
from datetime import timedelta
from datasets import Dataset, concatenate_datasets
from torch import distributed as dist
from tqdm import tqdm
from xtuner._lite import get_logger
from xtuner._lite.parallel import all_to_all_list
from .cache import CacheDataset
logger = get_logger()
def load_json(file):
with open(file) as f:
dset = json.load(f)
return dset
def load_jsonl(file):
try:
dset = []
with open(file) as f:
for line in f:
dset.append(json.loads(line))
except json.JSONDecodeError:
dset = load_json(file)
return dset
def load_bin(file):
return load_jsonl(file)
LOAD_FN_MAP = {'.json': load_json, '.jsonl': load_jsonl, '.bin': load_bin}
def master_only_load(load_fn):
@functools.wraps(load_fn)
def wrapper(*args, **kwargs):
if not (dist.is_available() and dist.is_initialized()):
return load_fn(*args, **kwargs)
timeout = timedelta(
minutes=int(os.getenv('XTUNER_DATASET_TIMEOUT', default=30)))
logger.info(f'xtuner_dataset_timeout = {timeout}', logger='current')
gloo_group = dist.new_group(backend='gloo', timeout=timeout)
if dist.get_rank() == 0:
dataset = load_fn(*args, **kwargs)
objects = [dataset]
else:
objects = [None]
dist.monitored_barrier(group=gloo_group, timeout=timeout)
dist.broadcast_object_list(objects, src=0)
return objects[0]
return wrapper
def multi_thread_map(map_fns, dataset, desc, num_proc=8):
if not isinstance(map_fns, (tuple, list)):
map_fns = [map_fns]
def sequential_map(item):
for fn in map_fns:
item = fn(item)
return item
with ThreadPoolExecutor(max_workers=num_proc) as executor:
results = list(
tqdm(
executor.map(sequential_map, dataset),
desc=desc,
total=len(dataset)))
return results
def openai_format(item):
item['messages'] = item['instruction']
return item
@master_only_load
def load_hf_dataset(path,
split='train',
sample_ratio=1.0,
num_proc=8,
cache_dir=None,
map_fn=None,
init_fn=None):
from datasets import load_dataset
dataset = load_dataset(path)[split]
if map_fn:
dataset = dataset.map(map_fn, num_proc=num_proc)
if sample_ratio != 1:
ori_samples = len(dataset)
target_samples = int(sample_ratio * ori_samples)
indices = random.choices([i for i in range(ori_samples)],
k=target_samples)
dataset = dataset.select(indices)
dataset = dataset.to_list()
if init_fn:
dataset = init_fn(dataset)
if cache_dir and isinstance(dataset, CacheDataset):
dataset.cache(cache_dir)
return dataset
def load_from_cache(cache_dir, init_fn):
if dist.is_available():
world_size = dist.get_world_size()
rank = dist.get_rank()
else:
world_size = 1
rank = 0
sub_cache_dirs = []
for _path in tqdm(os.listdir(cache_dir)):
path = os.path.join(cache_dir, _path)
if os.path.isdir(path):
sub_cache_dirs.append(path)
num_dsets = len(sub_cache_dirs)
avg_num = math.ceil(num_dsets / world_size)
start = rank * avg_num
end = min((rank + 1) * avg_num, num_dsets)
desc = f'[Rank {rank}] Loading Cached Dataset'
rank_datasets = []
for ind in tqdm(range(start, end), desc=desc):
dset = init_fn(sub_cache_dirs[ind])
rank_datasets.append(dset)
if dist.is_available() and world_size > 1:
dist.barrier()
buffers = [None] * world_size
dist.all_gather_object(buffers, rank_datasets)
world_datasets = []
for dsets_per_rank in buffers:
world_datasets.extend(dsets_per_rank)
assert len(world_datasets) == num_dsets
else:
world_datasets = rank_datasets
return world_datasets
def _gpu_parallel_load_local_datasets(paths,
file_types,
file_pattern=None,
cache_dir=None,
sample_ratios=1.0,
num_proc=8,
map_fns=None,
init_fns=Dataset.from_list):
if isinstance(paths, str):
paths = [paths]
if isinstance(sample_ratios, (tuple, list)):
if len(sample_ratios) == 1:
sample_ratios = list(sample_ratios) * len(paths)
if len(sample_ratios) != len(paths):
raise RuntimeError(f'There are {len(paths)} paths, but only '
f'{len(sample_ratios)} sample ratios were set.')
if map_fns is None:
map_fns = [None] * len(paths)
if isinstance(map_fns, (tuple, list)):
if len(map_fns) == 1:
map_fns = list(map_fns) * len(paths)
if len(map_fns) != len(paths):
raise RuntimeError(f'There are {len(paths)} paths, but only'
f'{len(map_fns)} map fns were set.')
if init_fns is None:
init_fns = [None] * len(paths)
if isinstance(init_fns, (tuple, list)):
if len(init_fns) == 1:
init_fns = list(init_fns) * len(paths)
if len(init_fns) != len(paths):
raise RuntimeError(f'There are {len(paths)} paths, but only'
f'{len(init_fns)} init fns were set.')
files = []
file_sample_ratios = []
file_map_fns = []
file_init_fns = []
for pid, path in enumerate(paths):
if os.path.isdir(path):
dir_files = []
for root, dirs, _files in os.walk(path, followlinks=True):
dirs.sort()
for relative_path in sorted(_files):
suffix = os.path.splitext(relative_path)[-1]
absolute_path = os.path.join(root, relative_path)
if file_pattern is not None:
if bool(re.match(file_pattern, absolute_path)):
dir_files.append(absolute_path)
elif suffix in file_types:
dir_files.append(absolute_path)
_num_dir_files = len(dir_files)
if _num_dir_files == 0:
raise RuntimeError(
f'There are no files with the suffix {file_types}'
f'in `{path}`.')
logger.info(f'Found {len(dir_files)} files in {path}')
files.extend(dir_files)
file_sample_ratios.extend([sample_ratios[pid]] * _num_dir_files)
file_map_fns.extend([map_fns[pid]] * _num_dir_files)
file_init_fns.extend([init_fns[pid]] * _num_dir_files)
elif os.path.isfile(path):
files.append(path)
file_sample_ratios.append(sample_ratios[pid])
file_map_fns.append(map_fns[pid])
file_init_fns.append(init_fns[pid])
else:
raise RuntimeError(f'`{path}` not found.')
num_files = len(files)
if dist.is_available():
world_size = dist.get_world_size()
rank = dist.get_rank()
else:
world_size = 1
rank = 0
datasets = []
cached_infos = {}
for i in range(math.ceil(num_files / world_size)):
start = i * world_size
end = min((i + 1) * world_size, num_files)
dset_list = []
for ind in range(start, end):
file = files[ind]
suffix = os.path.splitext(file)[-1]
dset = LOAD_FN_MAP[suffix](file)
map_fn = file_map_fns[ind]
# fixme
# assert map_fn is not None
if map_fn:
num_per_shard = math.ceil(len(dset) / world_size)
shard_start = rank * num_per_shard
shard_end = min((rank + 1) * num_per_shard, len(dset))
dset = dset[shard_start:shard_end]
logger.debug(
f'[File {ind}] Raw Sample:\n{dset[0] if dset else None}')
try:
desc = f'[File {ind}][Shard {rank}] Map'
dset = multi_thread_map(map_fn, dset, desc, num_proc)
logger.debug(f'[File {ind}][Shard {rank}] Mapped Sample:\n'
f'{dset[0] if dset else None}')
except:
raise RuntimeError(
f'[File {ind}][Shard {rank}]Map failed.')
dset_list.append(dset)
for _ in range(end, (i + 1) * world_size):
dset_list.append(None)
dset_list = all_to_all_list(dset_list)
if None in dset_list:
assert start + rank >= num_files
assert all([item is None for item in dset_list])
continue
file_ind = start + rank
assert file_ind < num_files
whole_dset = concatenate_datasets(
[Dataset.from_list(_list) for _list in dset_list])
init_fn = file_init_fns[file_ind]
if init_fn:
whole_dset = init_fn(whole_dset)
# fixme
# assert isinstance(whole_dset, CacheDataset)
if cache_dir and isinstance(whole_dset, CacheDataset):
digits = len(str(abs(num_files)))
cache_id = (f'cache-local-{file_ind+1:0{digits}}-of-'
f'{num_files:0{digits}}')
sub_cache_dir = os.path.join(cache_dir, cache_id)
whole_dset.cache(sub_cache_dir)
infos = {
'path': files[file_ind],
'num_samples': whole_dset.num_samples,
'num_tokens': whole_dset.total_tokens
}
cached_infos[cache_id] = infos
datasets.append(whole_dset)
if dist.is_available() and world_size > 1:
timeout = timedelta(
minutes=int(os.getenv('XTUNER_DATASET_TIMEOUT', default=30)))
if rank == 0:
logger.info(f'The default timeout is {timeout}. The environment '
'variable `XTUNER_DATASET_TIMEOUT` can be adjusted. '
'For example, setting `XTUNER_DATASET_TIMEOUT=120` '
'means the timeout is set to two hours.')
group = dist.new_group(backend='gloo', timeout=timeout)
dist.monitored_barrier(group=group, timeout=timeout)
logger.info('All gahter datasets... ')
all_dataset = [None] * world_size
dist.all_gather_object(all_dataset, datasets, group=group)
all_dataset = sum(all_dataset, [])
else:
all_dataset = datasets
return all_dataset
def _cpu_parallel_load_local_datasets(paths,
file_types,
file_pattern=None,
cache_dir=None,
sample_ratios=1.0,
num_proc=8,
map_fns=None,
init_fns=Dataset.from_list):
if isinstance(paths, str):
paths = [paths]
if isinstance(sample_ratios, (tuple, list)):
if len(sample_ratios) == 1:
sample_ratios = list(sample_ratios) * len(paths)
if len(sample_ratios) != len(paths):
raise RuntimeError(f'There are {len(paths)} paths, but only '
f'{len(sample_ratios)} sample ratios were set.')
if map_fns is None:
map_fns = [None] * len(paths)
if isinstance(map_fns, (tuple, list)):
if len(map_fns) == 1:
map_fns = list(map_fns) * len(paths)
if len(map_fns) != len(paths):
raise RuntimeError(f'There are {len(paths)} paths, but only'
f'{len(map_fns)} map fns were set.')
if init_fns is None:
init_fns = [None] * len(paths)
if isinstance(init_fns, (tuple, list)):
if len(init_fns) == 1:
init_fns = list(init_fns) * len(paths)
if len(init_fns) != len(paths):
raise RuntimeError(f'There are {len(paths)} paths, but only'
f'{len(init_fns)} init fns were set.')
files = []
file_sample_ratios = []
file_map_fns = []
file_init_fns = []
for pid, path in enumerate(paths):
if os.path.isdir(path):
dir_files = []
for root, dirs, _files in os.walk(path, followlinks=True):
dirs.sort()
for relative_path in sorted(_files):
suffix = os.path.splitext(relative_path)[-1]
absolute_path = os.path.join(root, relative_path)
if file_pattern is not None:
if bool(re.match(file_pattern, absolute_path)):
dir_files.append(absolute_path)
elif suffix in file_types:
dir_files.append(absolute_path)
_num_dir_files = len(dir_files)
if _num_dir_files == 0:
raise RuntimeError(
f'There are no files with the suffix {file_types}'
f'in `{path}`.')
logger.info(f'Found {len(dir_files)} files in {path}')
files.extend(dir_files)
file_sample_ratios.extend([sample_ratios[pid]] * _num_dir_files)
file_map_fns.extend([map_fns[pid]] * _num_dir_files)
file_init_fns.extend([init_fns[pid]] * _num_dir_files)
elif os.path.isfile(path):
files.append(path)
file_sample_ratios.append(sample_ratios[pid])
file_map_fns.append(map_fns[pid])
file_init_fns.append(init_fns[pid])
else:
raise RuntimeError(f'`{path}` not found.')
num_files = len(files)
if dist.is_available():
world_size = dist.get_world_size()
rank = dist.get_rank()
# Assigned files to each rank based on the file size
file_sizes = []
for file in files:
if os.path.islink(file):
real_path = os.path.realpath(file)
file_sizes.append(os.path.getsize(real_path))
else:
file_sizes.append(os.path.getsize(file))
size_order = sorted(
enumerate(file_sizes), key=lambda x: x[1], reverse=True)
sorted_indices = [ind_and_size[0] for ind_and_size in size_order]
per_rank_files = [[] for _ in range(world_size)]
per_rank_sizes = [0 for _ in range(world_size)]
for ind in sorted_indices:
min_size = min(per_rank_sizes)
target = per_rank_sizes.index(min_size)
per_rank_files[target].append(ind)
per_rank_sizes[target] += file_sizes[ind]
else:
world_size = 1
rank = 0
per_rank_files = [[i for i in range(num_files)]]
if rank == 0:
str_files = '\n\t'.join(files)
logger.debug(f'All files:\n\t{str_files}')
logger.debug(f'Assigned Files: {per_rank_files[rank]}')
rank_datasets = []
rank_cached_infos = {}
for ind in per_rank_files[rank]:
file = files[ind]
suffix = os.path.splitext(file)[-1]
dset = LOAD_FN_MAP[suffix](file)
logger.debug(f'[File {ind}] Raw Sample:\n{dset[0]}')
map_fn = file_map_fns[ind]
if map_fn:
try:
desc = f'[RANK {rank}] Map local file {ind}'
dset = multi_thread_map(map_fn, dset, desc, num_proc)
logger.debug(f'[File {ind}] Mapped Sample:\n{dset[0]}')
except:
raise RuntimeError(f'[RANK {rank}] Map {file} failed.')
init_fn = file_init_fns[ind]
if init_fn:
dset = init_fn(dset)
if cache_dir and isinstance(dset, CacheDataset):
digits = len(str(abs(num_files)))
cache_id = (f'cache-local-{ind+1:0{digits}}-of-'
f'{num_files:0{digits}}')
sub_cache_dir = os.path.join(cache_dir, cache_id)
dset.cache(sub_cache_dir)
infos = {
'path': file,
'num_samples': dset.num_samples,
'num_tokens': dset.total_tokens
}
rank_cached_infos[cache_id] = infos
elif cache_dir and not isinstance(dset, CacheDataset):
dset_cls = dset.__class__.__name__
logger.warning(
f'[File {ind}] {dset_cls} does not support caching.')
rank_datasets.append(dset)
if dist.is_available() and world_size > 1:
logger.info('Waiting for other ranks...... ')
timeout = timedelta(
minutes=int(os.getenv('XTUNER_DATASET_TIMEOUT', default=30)))
if rank == 0:
logger.info(f'The default timeout is {timeout}. The environment '
'variable `XTUNER_DATASET_TIMEOUT` can be adjusted. '
'For example, setting `XTUNER_DATASET_TIMEOUT=120` '
'means the timeout is set to two hours.')
group = dist.new_group(backend='gloo', timeout=timeout)
dist.monitored_barrier(group=group, timeout=timeout)
logger.info('All gahter datasets... ')
buffers = [None] * world_size
dist.all_gather_object(buffers, rank_datasets, group=group)
# Restore the dataset order according to the original file order
world_datasets = [None] * num_files
for _rank, per_rank_dsets in enumerate(buffers):
for _dset_ind, _file_ind in enumerate(per_rank_files[_rank]):
_dset = per_rank_dsets[_dset_ind]
world_datasets[_file_ind] = _dset
assert all([dset is not None for dset in world_datasets])
buffers = [None] * world_size
dist.all_gather_object(buffers, rank_cached_infos, group=group)
world_cached_infos = {}
for per_rank_cached_infos in buffers:
world_cached_infos.update(per_rank_cached_infos)
else:
world_datasets = []
for dset in rank_datasets:
world_datasets.append(dset)
world_cached_infos = rank_cached_infos
if cache_dir and rank == 0:
_path = os.path.join(cache_dir, 'local_infos.json')
with open(_path, 'w') as f:
json.dump(world_cached_infos, f)
return world_datasets
def load_local_datasets(paths,
file_types,
file_pattern=None,
cache_dir=None,
sample_ratios=1.0,
num_proc=8,
map_fns=None,
init_fns=None):
_parallel = os.getenv('XTUNER_LOAD_PARALLEL', default='CPU')
if _parallel == 'CPU':
return _cpu_parallel_load_local_datasets(paths, file_types,
file_pattern, cache_dir,
sample_ratios, num_proc,
map_fns, init_fns)
else:
return _gpu_parallel_load_local_datasets(paths, file_types,
file_pattern, cache_dir,
sample_ratios, num_proc,
map_fns, init_fns)
def load_datasets(paths,
sources,
sample_ratios=1.0,
file_types=LOAD_FN_MAP.keys(),
file_pattern=None,
cache_dir=None,
map_fns=None,
init_fns=None,
num_proc=8):
if isinstance(paths, str):
paths = [paths]
num_paths = len(paths)
if isinstance(sample_ratios, (float, int)):
sample_ratios = [sample_ratios] * num_paths
if isinstance(sample_ratios, (tuple, list)):
if len(sample_ratios) == 1:
sample_ratios = list(sample_ratios) * num_paths
if len(sample_ratios) != num_paths:
raise RuntimeError(f'There are {num_paths} paths, but only '
f'{len(sample_ratios)} sample ratios were set.')
if isinstance(sources, str):
sources = [sources]
if isinstance(sources, (tuple, list)):
if len(sources) == 1:
sources = list(sources) * num_paths
if len(sources) != num_paths:
raise RuntimeError(f'There are {num_paths} paths, but only '
f'{len(sources)} sources were set.')
if not isinstance(map_fns, (tuple, list)):
map_fns = [map_fns] * num_paths
if isinstance(map_fns, (tuple, list)):
if len(map_fns) == 1:
map_fns = list(map_fns) * num_paths
if len(map_fns) != num_paths:
raise RuntimeError(f'There are {num_paths} paths, but only'
f'{len(map_fns)} map fns were set.')
if not isinstance(init_fns, (tuple, list)):
init_fns = [init_fns] * num_paths
if isinstance(init_fns, (tuple, list)):
if len(init_fns) == 1:
init_fns = list(init_fns) * num_paths
if len(init_fns) != num_paths:
raise RuntimeError(f'There are {num_paths} paths, but only'
f'{len(init_fns)} init fns were set.')
local_inds = [i for i, src in enumerate(sources) if src == 'local']
local_paths = [paths[ind] for ind in local_inds]
local_map_fns = [map_fns[ind] for ind in local_inds]
local_init_fns = [init_fns[ind] for ind in local_inds]
local_sample_ratios = [sample_ratios[ind] for ind in local_inds]
hf_inds = [i for i, src in enumerate(sources) if src == 'huggingface']
hf_paths = [paths[ind] for ind in hf_inds]
hf_map_fns = [map_fns[ind] for ind in hf_inds]
hf_init_fns = [init_fns[ind] for ind in hf_inds]
hf_sample_ratios = [sample_ratios[ind] for ind in hf_inds]
datasets = []
if len(local_inds):
local_datasets = load_local_datasets(local_paths, file_types,
file_pattern, cache_dir,
local_sample_ratios, num_proc,
local_map_fns, local_init_fns)
datasets.extend(local_datasets)
if len(hf_inds):
cached_infos = {}
for i in range(len(hf_inds)):
if cache_dir:
digits = len(str(abs(len(hf_inds))))
cache_id = (f'cache-hf-{i+1:0{digits}}-of-'
f'{len(hf_inds):0{digits}}')
sub_cache_dir = os.path.join(cache_dir, cache_id)
else:
sub_cache_dir = None
dset = load_hf_dataset(
hf_paths[i],
sample_ratio=hf_sample_ratios[i],
num_proc=num_proc,
map_fn=hf_map_fns[i],
init_fn=hf_init_fns[i],
cache_dir=sub_cache_dir)
datasets.append(dset)
if cache_dir:
infos = {
'path': hf_paths[i],
'num_samples': dset.num_samples,
'num_tokens': dset.total_tokens
}
cached_infos[cache_id] = infos
if cache_dir:
_path = os.path.join(cache_dir, 'hf_infos.json')
with open(_path, 'w') as f:
json.dump(cached_infos, f)
return datasets
@master_only_load
def load_ms_dataset():
pass
================================================
FILE: xtuner-eval_niah/xtuner/_lite/datasets/pretrain.py
================================================
from xtuner._lite import get_logger
from .text import SoftPackerForText
logger = get_logger()
class SoftPackerForPretrain(SoftPackerForText):
def __getitem__(self, item):
"""Returns a dict containing packed data in the given item.
Args:
item: An index to retrieve packed data.
Returns:
A dict including packed input_ids, labels, and cumulative_len.
"""
if self._cached:
self.load_cache()
dataset = self.dataset
pack_info = self.pack_info
packed_items = pack_info[item]['indices']
assert len(packed_items) > 0
input_ids = []
num_tokens = []
for i in packed_items:
input_ids.extend(dataset[i]['input_ids'])
_num_tokens = dataset[i]['num_tokens']
num_tokens.append(_num_tokens)
if len(input_ids) < self.max_length:
num_pad_tokens = self.max_length - len(input_ids)
input_ids.extend([DEFAULT_PAD_TOKEN_INDEX] * num_pad_tokens)
num_tokens.append(num_pad_tokens)
else:
num_tokens.append(0)
packed = {
'input_ids': input_ids,
'labels': input_ids,
'num_tokens': num_tokens,
}
if len(input_ids) != len(labels):
logger.error(f'[packed_items] {packed_items}')
logger.error(f'[input_ids] {input_ids}')
logger.error(f'[labels] {labels}')
logger.error(f'[num_tokens] {num_tokens}')
raise RuntimeError('The lengths of input_ids and labels must be '
f'equal, but found {len(input_ids)} and '
f'{len(labels)}.')
if self.cached:
self._free()
return packed
================================================
FILE: xtuner-eval_niah/xtuner/_lite/datasets/text.py
================================================
import bisect
import itertools
import os
import random
import torch
from datasets import Dataset, load_from_disk
from torch import distributed as dist
from torch.nn.utils.rnn import pad_sequence
from xtuner._lite import get_logger
from xtuner._lite.chat import ChatMessages
from xtuner._lite.parallel import get_sp_world_size, pad_for_sequence_parallel
from xtuner.utils import DEFAULT_PAD_TOKEN_INDEX, IGNORE_INDEX
from .cache import CacheDataset
from .format import OPENAI_FORMAT_MAP
logger = get_logger()
def sort_and_return_indices(lst):
return [i[0] for i in sorted(enumerate(lst), key=lambda x: x[1])]
class TextTokenizeFunction():
def __init__(self, tokenizer, chat_template, raw_format='openai'):
self.tokenizer = tokenizer
self.chat_template = chat_template
self.raw_format = raw_format
def __call__(self, item):
formatter = OPENAI_FORMAT_MAP[self.raw_format]
msg = ChatMessages.from_dict(formatter(item))
tokenized = msg.tokenize(self.tokenizer, self.chat_template)
return tokenized
class TextTokenizedDataset(CacheDataset):
def __init__(self, dataset, max_length):
super().__init__()
if isinstance(dataset, list):
dataset = Dataset.from_list(dataset)
self.dataset = dataset
self.num_samples = len(dataset)
self.total_tokens = sum(dataset['num_tokens'])
self.max_length = max_length
self._cached = False
self._cached_dir = None
@property
def cached(self):
return self._cached
@property
def cached_dir(self):
return self._cached_dir
def cache(self, cache_dir):
dset_dir = os.path.join(cache_dir, 'dataset')
if len(self.dataset.cache_files) == 0 and dist.is_available(
) and dist.get_rank() == 0:
self.dataset.save_to_disk(dset_dir)
self._cached_dir = cache_dir
self.dataset = None
self._cached = True
def load_cache(self):
assert self.cached
dset_dir = os.path.join(self.cached_dir, 'dataset')
self.dataset = load_from_disk(dset_dir)
@classmethod
def from_cache(cls, cache_dir, max_length):
dataset = load_from_disk(os.path.join(cache_dir, 'dataset'))
ret = cls(dataset, max_length)
ret.cache(cache_dir)
return ret
def _free(self):
self.dataset = None
def __len__(self):
return self.num_samples
def __getitem__(self, item):
"""Returns a dict containing packed data in the given item.
Args:
item: An index to retrieve packed data.
Returns:
A dict including packed input_ids, labels, and cumulative_len.
"""
if self.cached:
self.load_cache()
data = {
'input_ids': self.dataset[item]['input_ids'],
'labels': self.dataset[item]['labels'],
'num_tokens': [self.dataset[item]['num_tokens']]
}
if self.cached:
self._free()
return data
class TextOnlineTokenizeDataset(torch.utils.data.Dataset):
def __init__(self, dataset, tokenize_fn):
super().__init__()
self.dataset = dataset
self.tokenize_fn = tokenize_fn
def __len__(self):
return len(self.dataset)
def __getitem__(self, item):
"""Returns a dict containing packed data in the given item.
Args:
item: An index to retrieve packed data.
Returns:
A dict including packed input_ids, labels, and cumulative_len.
"""
raw_data = self.dataset[item]
tokenized_data = self.tokenize_fn(raw_data)
data = {
'input_ids': tokenized_data['input_ids'],
'labels': tokenized_data['labels'],
'num_tokens': [tokenized_data['num_tokens']]
}
return data
class SoftPackerForText(CacheDataset):
def __init__(self, dataset, max_length=2048, pack_info=None, seed=None):
super().__init__()
self.max_length = max_length
if isinstance(dataset, list):
self.dataset = Dataset.from_list(dataset)
else:
self.dataset = dataset
if pack_info is None:
pack_info = self.get_pack_info(self.dataset, max_length, seed)
self.pack_info = pack_info
# The number of data items after packing
self.num_packed_samples = len(self.pack_info)
self.num_samples = len(self.dataset)
self.total_tokens = sum(self.dataset['num_tokens'])
self._cached = False
self._cached_dir = None
@property
def max_length_per_pack(self):
if self.cached:
pack_info = load_from_disk(self._cached_pack_info)
else:
pack_info = self.pack_info
return pack_info['max_length']
@property
def cached(self):
return self._cached
@property
def cached_dir(self):
return self._cached_dir
def cache(self, cache_dir):
dset_dir = os.path.join(cache_dir, 'dataset')
pack_info_dir = os.path.join(cache_dir,
f'pack-info-soft-{self.max_length}')
if len(self.dataset.cache_files) == 0:
self.dataset.save_to_disk(dset_dir)
if len(self.pack_info.cache_files) == 0:
self.pack_info.save_to_disk(pack_info_dir)
self._cached = True
self._cached_dir = cache_dir
self._cached_dset = dset_dir
self._cached_pack_info = pack_info_dir
self._free()
def load_cache(self):
assert self._cached
dset_dir = os.path.join(self.cached_dir, 'dataset')
pack_info_dir = os.path.join(self.cached_dir,
f'pack-info-soft-{self.max_length}')
self.dataset = load_from_disk(dset_dir)
self.pack_info = load_from_disk(pack_info_dir)
def _free(self):
self.dataset = None
self.pack_info = None
def __len__(self):
return self.num_packed_samples
def __getitem__(self, item):
"""Returns a dict containing packed data in the given item.
Args:
item: An index to retrieve packed data.
Returns:
A dict including packed input_ids, labels, and cumulative_len.
"""
if self._cached:
self.load_cache()
dataset = self.dataset
pack_info = self.pack_info
packed_items = pack_info[item]['indices']
assert len(packed_items) > 0
input_ids = []
labels = []
num_tokens = []
for i in packed_items:
ids = dataset[i]['input_ids']
label = dataset[i]['labels']
_num_tokens = dataset[i]['num_tokens']
if len(ids) > self.max_length:
ids = ids[:self.max_length]
label = label[:self.max_length]
_num_tokens = self.max_length
input_ids.extend(ids)
labels.extend(label)
num_tokens.append(_num_tokens)
if len(input_ids) < self.max_length:
num_pad_tokens = self.max_length - len(input_ids)
input_ids.extend([DEFAULT_PAD_TOKEN_INDEX] * num_pad_tokens)
labels.extend([IGNORE_INDEX] * num_pad_tokens)
num_tokens.append(num_pad_tokens)
else:
num_tokens.append(0)
packed = {
'input_ids': input_ids,
'labels': labels,
'num_tokens': num_tokens,
}
if len(input_ids) != len(labels):
logger.error(f'[packed_items] {packed_items}')
logger.error(f'[input_ids] {input_ids}')
logger.error(f'[labels] {labels}')
logger.error(f'[num_tokens] {num_tokens}')
raise RuntimeError('The lengths of input_ids and labels must be '
f'equal, but found {len(input_ids)} and '
f'{len(labels)}.')
if self.cached:
self._free()
return packed
@classmethod
def get_pack_info(cls, dataset, max_length, seed=None):
_ori_lens = dataset['num_tokens']
inds = [i for i in range(len(dataset))]
if seed is not None:
random.seed(seed)
random.shuffle(inds)
item_buffer = []
length_buffer = []
max_length_one_pack = 0
pack_infos = []
for shfl_i in inds:
if _ori_lens[shfl_i] + sum(length_buffer) <= max_length:
item_buffer.append(shfl_i)
length_buffer.append(_ori_lens[shfl_i])
max_length_one_pack = max(max_length_one_pack,
_ori_lens[shfl_i])
else:
if len(item_buffer) > 0:
info = {
'indices': item_buffer,
'max_length': max_length_one_pack
}
pack_infos.append(info)
item_buffer = [shfl_i]
length_buffer = [_ori_lens[shfl_i]]
max_length_one_pack = _ori_lens[shfl_i]
if len(item_buffer) > 0:
info = {'indices': item_buffer, 'max_length': max_length_one_pack}
pack_infos.append(info)
pack_infos = Dataset.from_list(pack_infos)
return pack_infos
@classmethod
def from_cache(cls, cache_dir, max_length, seed=None):
dataset = load_from_disk(os.path.join(cache_dir, 'dataset'))
pack_info_dir = os.path.join(cache_dir, f'pack-info-soft-{max_length}')
if os.path.exists(pack_info_dir):
pack_info = load_from_disk(pack_info_dir)
else:
pack_info = cls.get_pack_info(dataset, max_length, seed)
ret = cls(dataset, max_length, pack_info)
ret.cache(cache_dir)
return ret
class HardPackerForText(SoftPackerForText):
"""The new dataset obtained by concatenating multiple raw data.
Args:
dataset (datasets.Dataset): The tokenized dataset.
max_length (int): The length of each data after concatenation.
use_varlen_attn (bool): Determines whether to calculate attention
based on the seq_len dimension or the actual length of the
sequence.
Note:
The original dataset's type must be `datasets.Dataset`, others will be
very slow.
Note:
The data in the original dataset must have the `num_tokens` key,
recording the number of tokens for each piece of data.
"""
def __init__(self, dataset, max_length=2048, pack_info=None):
super().__init__(dataset, max_length, pack_info)
self.num_packed_samples = self.total_tokens // max_length
@classmethod
def _cal_max_length(cls, begin, end, shfl_item_rngs_left,
shfl_item_rngs_right):
left = bisect.bisect(shfl_item_rngs_right, begin)
right = bisect.bisect(shfl_item_rngs_left, end)
max_length = 0
for i in range(left, right):
item_begin = shfl_item_rngs_left[i]
item_end = shfl_item_rngs_right[i]
inner_l = max(begin, item_begin) - item_begin
inner_r = min(end, item_end) - item_begin
trunc_size = inner_r - inner_l
max_length = max(max_length, trunc_size)
return max_length
@classmethod
def get_pack_info(cls, dataset, max_length, seed=None):
_ori_lens = dataset['num_tokens']
# The number of data items after packing
num_packed_samples = sum(_ori_lens) // max_length
# Shuffle the order of the original dataset
# The packing will proceed according to the order after shuffle.
# Assume the following conditions hold:
# (1) shfl_inds = [3, 1, 2, 0]
# (2) self._ori_lens[3] + self._ori_lens[1] = max_length
# (3) self._ori_lens[2] + self._ori_lens[0] = max_length
# Ultimately, dataset[3] and dataset[1] will be combined into a new
# data, and dataset[2] and dataset[0] will be combined into a new data.
inds = [i for i in range(len(dataset))]
if seed is not None:
random.seed(seed)
random.shuffle(inds)
shfl_inds = inds
# shuffled cumulative lengths
shfl_lens = [_ori_lens[i] for i in shfl_inds]
shfl_acc_lens = list(itertools.accumulate(shfl_lens))
shfl_item_rngs_left = [0] + shfl_acc_lens[:-1]
shfl_item_rngs_right = shfl_acc_lens
max_length_per_pack = []
for i in range(num_packed_samples):
begin = i * max_length
end = (i + 1) * max_length
max_length_per_pack.append(
cls._cal_max_length(begin, end, shfl_item_rngs_left,
shfl_item_rngs_right))
return {
'ranges_left': shfl_item_rngs_left,
'ranges_right': shfl_item_rngs_right,
'num_packed_samples': num_packed_samples,
'indices': shfl_inds,
'max_length_per_pack': max_length_per_pack
}
def _pack_ids_and_labels_in_range(self, begin: int, end: int):
"""Packs ids and labels in a given range using bisection method.
Args:
begin: Index indicating the beginning of the range.
end: Index indicating the end of the range.
Returns:
A tuple containing packed ids, labels, and cumulative lengths.
"""
# Use binary search to find dataset positions that fall within begin
# and end range
left = bisect.bisect(self._shfl_item_rngs_right, begin)
right = bisect.bisect(self._shfl_item_rngs_left, end)
trunc_input_ids = []
trunc_labels = []
trunc_sizes = []
for i in range(left, right):
# Determine the real range we will cut in current original item
item_begin = self._shfl_item_rngs_left[i]
item_end = self._shfl_item_rngs_right[i]
# Calculate exact positions within current dataset item
inner_l = max(begin, item_begin) - item_begin
inner_r = min(end, item_end) - item_begin
# Get original data and labels
ori_idx = self.shfl_inds[i]
ori_input_ids = self.dataset[ori_idx]['input_ids']
ori_labels = self.dataset[ori_idx]['labels']
# Add original data and labels from calculated positions
# to trunc_ids and trunc_labels
trunc_input_ids.extend(ori_input_ids[inner_l:inner_r])
trunc_labels.extend(ori_labels[inner_l:inner_r])
trunc_sizes.append(inner_r - inner_l)
# return populated lists of truncated ids, labels and their cumulative
# lengths
return trunc_input_ids, trunc_labels, trunc_sizes
def __len__(self):
return self._num_packed_samples
def __getitem__(self, item):
"""Returns a dict containing packed data in the given item.
Args:
item: An index to retrieve packed data.
Returns:
A dict including packed input_ids, labels, and cumulative_len.
"""
# The cumulative length from the start position of this data
begin = item * self.max_length
# The cumulative length from the end position of this data
end = (item + 1) * self.max_length
# Extract data within the range from the shuffled original dataset.
_res = self._pack_ids_and_labels_in_range(begin, end)
packed_input_ids, packed_labels, num_tokens = _res
assert self.max_length == len(packed_input_ids) == len(packed_labels)
packed = {
'input_ids': packed_input_ids,
'labels': packed_labels,
'num_tokens': num_tokens,
}
return packed
class TextCollator:
def __init__(self, pack_batch=False, force_div_ring=False, ring_size=1, max_pack_len=-1):
self.pack_batch = pack_batch
if force_div_ring:
assert ring_size > 1
self.ring_size = ring_size
self.max_pack_len = max_pack_len
def __call__(self, instances):
pad_index = DEFAULT_PAD_TOKEN_INDEX
input_ids = []
labels = []
num_tokens = []
for data in instances:
input_ids.append(torch.LongTensor(data['input_ids']))
labels.append(torch.LongTensor(data['labels']))
num_tokens.extend(data['num_tokens'])
# attention_mask = [torch.ones_like(ids) for ids in input_ids]
num_tokens = torch.IntTensor(num_tokens)
if len(instances) > 1 and self.pack_batch:
input_ids = torch.cat(input_ids, dim=0).unsqueeze(0)
labels = torch.cat(labels, dim=0).unsqueeze(0)
# attention_mask = torch.cat(attention_mask, dim=0).unsqueeze(0)
elif len(instances) > 1 and not self.pack_batch:
input_ids = pad_sequence(
input_ids, batch_first=True, padding_value=pad_index)
labels = pad_sequence(
labels, batch_first=True, padding_value=IGNORE_INDEX)
# attention_mask = pad_sequence(
# attention_mask, batch_first=True, padding_value=0)
else:
input_ids = torch.stack(input_ids)
labels = torch.stack(labels)
# attention_mask = torch.stack(attention_mask)
if get_sp_world_size() > 1:
ori_seq_len = input_ids.shape[1]
input_ids = pad_for_sequence_parallel(input_ids, pad_index)
labels = pad_for_sequence_parallel(labels, IGNORE_INDEX)
# attention_mask = pad_for_sequence_parallel(attention_mask, 0)
pad_seq_len = input_ids.shape[1] - ori_seq_len
if pad_seq_len > 0:
pad_num_token = torch.tensor([pad_seq_len]).int()
num_tokens = torch.cat([num_tokens, pad_num_token])
if input_ids.shape != labels.shape:
logger.error(f'[instances] {instances}')
logger.error(f'[num_tokens] {num_tokens}')
logger.error(f'[input_ids] {input_ids}')
logger.error(f'[labels] {labels}')
raise RuntimeError('The shape of input_ids and labels must be '
f'equal, but found {input_ids.shape} and '
f'{labels.shape}.')
# TODO support sp
data_dict = {
'input_ids': input_ids,
'labels': labels,
'num_tokens': num_tokens,
# 'attention_mask': attention_mask.bool()
}
return data_dict
================================================
FILE: xtuner-eval_niah/xtuner/_lite/datasets/tokenize.py
================================================
================================================
FILE: xtuner-eval_niah/xtuner/_lite/modelings/__init__.py
================================================
from .internlm2 import InternLM2Config, InternLM2ForCausalLM
from .llava.modeling_llava import LlavaForConditionalGeneration
from .llava.configuration_llava import EnhancedLlavaConfig
from .llava.processing_llava import LlavaProcessor
def register_remote_code():
from transformers import AutoConfig, AutoModelForCausalLM
AutoConfig.register('internlm2', InternLM2Config, exist_ok=True)
AutoModelForCausalLM.register(
InternLM2Config, InternLM2ForCausalLM, exist_ok=True)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/modelings/internlm2/__init__.py
================================================
from .configuration_internlm2 import InternLM2Config
from .modeling_internlm2 import InternLM2ForCausalLM
================================================
FILE: xtuner-eval_niah/xtuner/_lite/modelings/internlm2/configuration_internlm2.py
================================================
# Copyright (c) The InternLM team and The HuggingFace Inc. team. All rights reserved.
#
# This code is based on transformers/src/transformers/models/llama/configuration_llama.py
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" InternLM2 model configuration"""
from transformers.configuration_utils import PretrainedConfig
from transformers.utils import logging
logger = logging.get_logger(__name__)
INTERNLM2_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
# Modified from transformers.model.llama.configuration_llama.LlamaConfig
class InternLM2Config(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`InternLM2Model`]. It is used to instantiate
an InternLM2 model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the InternLM2-7B.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 32000):
Vocabulary size of the InternLM2 model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`InternLM2Model`]
hidden_size (`int`, *optional*, defaults to 4096):
Dimension of the hidden representations.
intermediate_size (`int`, *optional*, defaults to 11008):
Dimension of the MLP representations.
num_hidden_layers (`int`, *optional*, defaults to 32):
Number of hidden layers in the Transformer decoder.
num_attention_heads (`int`, *optional*, defaults to 32):
Number of attention heads for each attention layer in the Transformer decoder.
num_key_value_heads (`int`, *optional*):
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
`num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
by meanpooling all the original heads within that group. For more details checkout [this
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
`num_attention_heads`.
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
The non-linear activation function (function or string) in the decoder.
max_position_embeddings (`int`, *optional*, defaults to 2048):
The maximum sequence length that this model might ever be used with. InternLM2 supports up to 32768 tokens.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (`float`, *optional*, defaults to 1e-06):
The epsilon used by the rms normalization layers.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`.
pad_token_id (`int`, *optional*):
Padding token id.
bos_token_id (`int`, *optional*, defaults to 1):
Beginning of stream token id.
eos_token_id (`int`, *optional*, defaults to 2):
End of stream token id.
pretraining_tp (`int`, *optional*, defaults to 1):
Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism)
to understand more about it. This value is necessary to ensure exact reproducibility
of the pretraining results. Please refer to [this
issue](https://github.com/pytorch/pytorch/issues/76232).
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
Whether to tie weight embeddings
rope_theta (`float`, *optional*, defaults to 10000.0):
The base period of the RoPE embeddings.
rope_scaling (`Dict`, *optional*):
Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
`{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
`max_position_embeddings` to the expected new maximum. See the following thread for more information on how
these scaling strategies behave:
https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
experimental feature, subject to breaking API changes in future versions.
"""
_auto_class = 'AutoConfig'
model_type = 'internlm2'
keys_to_ignore_at_inference = ['past_key_values']
def __init__( # pylint: disable=W0102
self,
vocab_size=103168,
hidden_size=4096,
intermediate_size=11008,
num_hidden_layers=32,
num_attention_heads=32,
num_key_value_heads=None,
hidden_act='silu',
max_position_embeddings=2048,
initializer_range=0.02,
rms_norm_eps=1e-6,
use_cache=True,
pad_token_id=0,
bos_token_id=1,
eos_token_id=2,
pretraining_tp=1,
tie_word_embeddings=False,
bias=True,
rope_theta=10000,
rope_scaling=None,
attn_implementation=None,
**kwargs,
):
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.bias = bias
if num_key_value_heads is None:
num_key_value_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.hidden_act = hidden_act
self.initializer_range = initializer_range
self.rms_norm_eps = rms_norm_eps
self.pretraining_tp = pretraining_tp
self.use_cache = use_cache
self.rope_theta = rope_theta
self.rope_scaling = rope_scaling
self._rope_scaling_validation()
self.attn_implementation = attn_implementation
if self.attn_implementation is None:
self.attn_implementation = 'eager'
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
tie_word_embeddings=tie_word_embeddings,
**kwargs,
)
def _rope_scaling_validation(self):
"""
Validate the `rope_scaling` configuration.
"""
if self.rope_scaling is None:
return
if not isinstance(self.rope_scaling,
dict) or len(self.rope_scaling) != 2:
raise ValueError(
'`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, '
f'got {self.rope_scaling}')
rope_scaling_type = self.rope_scaling.get('type', None)
rope_scaling_factor = self.rope_scaling.get('factor', None)
if rope_scaling_type is None or rope_scaling_type not in [
'linear', 'dynamic'
]:
raise ValueError(
f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
)
if (rope_scaling_factor is None
or not isinstance(rope_scaling_factor,
(float, int)) or rope_scaling_factor < 1.0):
raise ValueError(
f"`rope_scaling`'s factor field must be a number >= 1, got {rope_scaling_factor} "
f'of type {type(rope_scaling_factor)}')
================================================
FILE: xtuner-eval_niah/xtuner/_lite/modelings/internlm2/modeling_internlm2.py
================================================
# Copyright (c) The InternLM team and The HuggingFace Inc. team. All rights reserved.
#
# This code is based on transformers/src/transformers/models/llama/modeling_llama.py
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PyTorch InternLM2.5 model."""
import math
import queue
import threading
from typing import List, Optional, Tuple, Union
import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from einops import rearrange
from torch import nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from transformers.activations import ACT2FN
from transformers.cache_utils import Cache, DynamicCache, StaticCache
from transformers.modeling_attn_mask_utils import AttentionMaskConverter
from transformers.modeling_outputs import (BaseModelOutputWithPast,
CausalLMOutputWithPast,
QuestionAnsweringModelOutput,
SequenceClassifierOutputWithPast,
TokenClassifierOutput)
from transformers.modeling_utils import PreTrainedModel
from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
from transformers.utils import (add_start_docstrings,
add_start_docstrings_to_model_forward,
is_flash_attn_greater_or_equal_2_10, logging,
replace_return_docstrings)
try:
from transformers.generation.streamers import BaseStreamer
except Exception:
BaseStreamer = None
from .configuration_internlm2 import InternLM2Config
try:
from flash_attn import flash_attn_func, flash_attn_varlen_func
from flash_attn.bert_padding import (index_first_axis, pad_input,
unpad_input)
except:
pass
logger = logging.get_logger(__name__)
_CONFIG_FOR_DOC = 'InternLM2Config'
def _get_unpad_data(attention_mask):
seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
max_seqlen_in_batch = seqlens_in_batch.max().item()
cu_seqlens = F.pad(
torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0)) # pylint: disable=E1102
return (
indices,
cu_seqlens,
max_seqlen_in_batch,
)
class InternLM2RMSNorm(nn.Module):
"""InternLM2RMSNorm is equivalent to T5LayerNorm."""
def __init__(self, hidden_size, eps=1e-6):
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance +
self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
ALL_LAYERNORM_LAYERS.append(InternLM2RMSNorm)
class InternLM2RotaryEmbedding(nn.Module):
"""Rotary Position Embedding for the InternLM2 model. Credits to the Reddit user /u/lucidrains."""
def __init__(self,
dim,
max_position_embeddings=2048,
base=10000,
device=None,
scaling_factor=1.0):
super().__init__()
self.scaling_factor = scaling_factor
self.dim = dim
self.max_position_embeddings = max_position_embeddings
self.base = base
inv_freq = 1.0 / (
self.base
**(torch.arange(0, self.dim, 2,
dtype=torch.int64).float().to(device) / self.dim))
self.register_buffer('inv_freq', inv_freq, persistent=False)
# For BC we register cos and sin cached
self.max_seq_len_cached = max_position_embeddings
@torch.no_grad()
def forward(self, x, position_ids):
# x: [bs, num_attention_heads, seq_len, head_size]
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(
position_ids.shape[0], -1, 1)
position_ids_expanded = position_ids[:, None, :].float()
# Force float32 since bfloat16 loses precision on long contexts
# See https://github.com/huggingface/transformers/pull/29285
device_type = x.device.type
device_type = device_type if isinstance(
device_type, str) and device_type != 'mps' else 'cpu'
with torch.autocast(device_type=device_type, enabled=False):
freqs = (inv_freq_expanded.float()
@ position_ids_expanded.float()).transpose(1, 2)
emb = torch.cat((freqs, freqs), dim=-1)
cos = emb.cos()
sin = emb.sin()
return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
class InternLM2LinearScalingRotaryEmbedding(InternLM2RotaryEmbedding):
"""InternLM2RotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
def forward(self, x, position_ids):
# difference to the original RoPE: a scaling factor is aplied to the position ids
position_ids = position_ids.float() / self.scaling_factor
cos, sin = super().forward(x, position_ids)
return cos, sin
class InternLM2DynamicNTKScalingRotaryEmbedding(InternLM2RotaryEmbedding):
"""InternLM2RotaryEmbedding extended with Dynamic NTK scaling.
Credits to the Reddit users /u/bloc97 and /u/emozilla"""
def forward(self, x, position_ids):
# difference to the original RoPE: inv_freq is recomputed when the sequence length > original length
seq_len = torch.max(position_ids) + 1
if seq_len > self.max_position_embeddings:
base = self.base * ((self.scaling_factor * seq_len /
self.max_position_embeddings) -
(self.scaling_factor - 1))**(
self.dim / (self.dim - 2))
inv_freq = 1.0 / (
base
**(torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(
x.device) / self.dim))
self.register_buffer(
'inv_freq', inv_freq,
persistent=False) # TODO joao: this may break with compilation
cos, sin = super().forward(x, position_ids)
return cos, sin
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., :x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2:]
return torch.cat((-x2, x1), dim=-1)
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1): # pylint: disable=unused-argument
"""Applies Rotary Position Embedding to the query and key tensors.
Args:
q (`torch.Tensor`): The query tensor.
k (`torch.Tensor`): The key tensor.
cos (`torch.Tensor`): The cosine part of the rotary embedding.
sin (`torch.Tensor`): The sine part of the rotary embedding.
position_ids (`torch.Tensor`, *optional*):
Deprecated and unused.
unsqueeze_dim (`int`, *optional*, defaults to 1):
The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
Returns:
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
"""
cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
class InternLM2MLP(nn.Module):
"""MLP for InternLM2 model."""
def __init__(self, config):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size
self.w1 = nn.Linear(
self.hidden_size, self.intermediate_size, bias=False)
self.w3 = nn.Linear(
self.hidden_size, self.intermediate_size, bias=False)
self.w2 = nn.Linear(
self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, x):
down_proj = self.w2(self.act_fn(self.w1(x)) * self.w3(x))
return down_proj
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""
This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
"""
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :,
None, :, :].expand(batch,
num_key_value_heads,
n_rep, slen, head_dim)
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen,
head_dim)
class InternLM2Attention(nn.Module):
"""Multi-headed attention from 'Attention Is All You Need' paper"""
def __init__(self,
config: InternLM2Config,
layer_idx: Optional[int] = None):
super().__init__()
self.config = config
self.layer_idx = layer_idx
if layer_idx is None:
logger.warning_once(
f'Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will '
'lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` '
'when creating this class.')
self.hidden_size = config.hidden_size
self.num_heads = config.num_attention_heads
self.head_dim = self.hidden_size // self.num_heads
self.num_key_value_heads = config.num_key_value_heads
self.num_key_value_groups = self.num_heads // self.num_key_value_heads
self.max_position_embeddings = config.max_position_embeddings
self.rope_theta = config.rope_theta
self.is_causal = True
if (self.head_dim * self.num_heads) != self.hidden_size:
raise ValueError(
f'hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}'
f' and `num_heads`: {self.num_heads}).')
self.wqkv = nn.Linear(
self.hidden_size,
(self.num_heads + 2 * self.num_key_value_heads) * self.head_dim,
bias=config.bias,
)
self.wo = nn.Linear(
self.num_heads * self.head_dim, self.hidden_size, bias=config.bias)
self._init_rope()
def _init_rope(self):
if self.config.rope_scaling is None:
self.rotary_emb = InternLM2RotaryEmbedding(
self.head_dim,
max_position_embeddings=self.max_position_embeddings,
base=self.rope_theta,
)
else:
scaling_type = self.config.rope_scaling['type']
scaling_factor = self.config.rope_scaling['factor']
if scaling_type == 'linear':
self.rotary_emb = InternLM2LinearScalingRotaryEmbedding(
self.head_dim,
max_position_embeddings=self.max_position_embeddings,
scaling_factor=scaling_factor,
base=self.rope_theta,
)
elif scaling_type == 'dynamic':
self.rotary_emb = InternLM2DynamicNTKScalingRotaryEmbedding(
self.head_dim,
max_position_embeddings=self.max_position_embeddings,
scaling_factor=scaling_factor,
base=self.rope_theta,
)
else:
raise ValueError(f'Unknown RoPE scaling type {scaling_type}')
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False, # pylint: disable=unused-argument
cache_position: Optional[torch.LongTensor] = None,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
bsz, q_len, _ = hidden_states.size()
if self.config.pretraining_tp > 1:
# split qkv_states by tp size
key_value_slicing = (self.num_key_value_heads *
self.head_dim) // self.config.pretraining_tp
qkv_slices = self.wqkv.weight.split(key_value_slicing, dim=0)
qkv_states = torch.cat(
[
F.linear(hidden_states, qkv_slice)
for qkv_slice in qkv_slices
],
dim=-1 # pylint: disable=E1102
)
else:
qkv_states = self.wqkv(hidden_states)
qkv_states = rearrange(
qkv_states,
'b q (h gs d) -> b q h gs d',
gs=2 + self.num_key_value_groups,
d=self.head_dim,
)
query_states = qkv_states[..., :self.num_key_value_groups, :]
query_states = rearrange(query_states,
'b q h gs d -> b q (h gs) d').transpose(1, 2)
key_states = qkv_states[..., -2, :].transpose(1, 2)
value_states = qkv_states[..., -1, :].transpose(1, 2)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(
query_states, key_states, cos, sin, position_ids)
if past_key_value is not None:
# sin and cos are specific to RoPE models; cache_position needed for the static cache
cache_kwargs = {
'sin': sin,
'cos': cos,
'cache_position': cache_position
}
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
attn_weights = torch.matmul(query_states, key_states.transpose(
2, 3)) / math.sqrt(self.head_dim)
if attention_mask is not None: # no matter the length, we just slice it
causal_mask = attention_mask[:, :, :, :key_states.shape[-2]]
attn_weights = attn_weights + causal_mask
# upcast attention to fp32
attn_weights = nn.functional.softmax(
attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
attn_output = torch.matmul(attn_weights, value_states)
if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
raise ValueError(
f'`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is'
f' {attn_output.size()}')
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
if self.config.pretraining_tp > 1:
attn_output = attn_output.split(
self.hidden_size // self.config.pretraining_tp, dim=2)
o_proj_slices = self.wo.weight.split(
self.hidden_size // self.config.pretraining_tp, dim=1)
attn_output = sum([
F.linear(attn_output[i], o_proj_slices[i]) # pylint: disable=E1102
for i in range(self.config.pretraining_tp)
])
else:
attn_output = self.wo(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
class InternLM2FlashAttention2(InternLM2Attention):
"""
InternLM2 flash attention module. This module inherits from `InternLM2Attention` as the weights of the module stays
untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
flash attention and deal with padding tokens in case the input contains any of them.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
# flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement,
# that was made default for flash_attn>=2.1. This attribute is used to handle this difference.
# Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
# Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1)
# produces a wrong mask (top-left).
self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10(
)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
if isinstance(past_key_value, StaticCache):
raise ValueError(
'`static` cache implementation is not compatible with `attn_implementation==flash_attention_2` '
'make sure to use `sdpa` in the mean time, and open an issue at '
'https://github.com/huggingface/transformers')
output_attentions = False
bsz, q_len, _ = hidden_states.size()
qkv_states = self.wqkv(hidden_states)
qkv_states = rearrange(
qkv_states,
'b q (h gs d) -> b q h gs d',
gs=2 + self.num_key_value_groups,
d=self.head_dim,
)
query_states = qkv_states[..., :self.num_key_value_groups, :]
query_states = rearrange(query_states, 'b q h gs d -> b q (h gs) d')
key_states = qkv_states[..., -2, :]
value_states = qkv_states[..., -1, :]
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(
query_states, key_states, cos, sin)
if past_key_value is not None:
# sin and cos are specific to RoPE models; cache_position needed for the static cache
cache_kwargs = {
'sin': sin,
'cos': cos,
'cache_position': cache_position
}
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
# TODO: These transpose are quite inefficient but Flash Attention requires the layout
# [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
# to be able to avoid many of these transpose/reshape/view.
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# dropout_rate = self.attention_dropout if self.training else 0.0
dropout_rate = 0.0
# In PEFT, usually we cast the layer norms in float32 for training stability reasons
# therefore the input hidden states gets silently casted in float32. Hence, we need
# cast them back in the correct dtype just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not cast the LayerNorms
# in fp32. (InternLM2RMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.wqkv.weight.dtype
logger.warning_once(
f'The input hidden states seems to be silently casted in float32, this might be related to'
f' the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in'
f' {target_dtype}.')
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
attn_output = self._flash_attention_forward(
query_states,
key_states,
value_states,
attention_mask,
q_len,
dropout=dropout_rate)
attn_output = attn_output.reshape(bsz, q_len,
self.hidden_size).contiguous()
attn_output = self.wo(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value # pylint: disable=E0606
def _flash_attention_forward(self,
query_states,
key_states,
value_states,
attention_mask,
query_length,
dropout=0.0,
softmax_scale=None):
"""
Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
first unpad the input, then computes the attention scores and pad the final attention scores.
Args:
query_states (`torch.Tensor`):
Input query states to be passed to Flash Attention API
key_states (`torch.Tensor`):
Input key states to be passed to Flash Attention API
value_states (`torch.Tensor`):
Input value states to be passed to Flash Attention API
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
"""
if not self._flash_attn_uses_top_left_mask:
causal = self.is_causal
else:
# TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1.
# For details, please see the comment in InternLM2FlashAttention2 __init__.
causal = self.is_causal and query_length != 1
# Contains at least one padding token in the sequence
if attention_mask is not None:
batch_size = query_states.shape[0]
query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
query_states, key_states, value_states, attention_mask,
query_length)
cu_seqlens_q, cu_seqlens_k = cu_seq_lens
max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
attn_output_unpad = flash_attn_varlen_func( # pylint: disable=E0606
query_states,
key_states,
value_states,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_in_batch_q,
max_seqlen_k=max_seqlen_in_batch_k,
dropout_p=dropout,
softmax_scale=softmax_scale,
causal=causal,
)
attn_output = pad_input(attn_output_unpad, indices_q, batch_size,
query_length) # pylint: disable=E0606
else:
attn_output = flash_attn_func( # pylint: disable=E0606
query_states,
key_states,
value_states,
dropout,
softmax_scale=softmax_scale,
causal=causal)
return attn_output
def _upad_input(self, query_layer, key_layer, value_layer, attention_mask,
query_length):
indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(
attention_mask)
batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
key_layer = index_first_axis( # pylint: disable=E0606
key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads,
head_dim), indices_k)
value_layer = index_first_axis( # pylint: disable=E0606
value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads,
head_dim), indices_k)
if query_length == kv_seq_len:
query_layer = index_first_axis( # pylint: disable=E0606
query_layer.reshape(batch_size * kv_seq_len, self.num_heads,
head_dim), indices_k)
cu_seqlens_q = cu_seqlens_k
max_seqlen_in_batch_q = max_seqlen_in_batch_k
indices_q = indices_k
elif query_length == 1:
max_seqlen_in_batch_q = 1
cu_seqlens_q = torch.arange(
batch_size + 1, dtype=torch.int32, device=query_layer.device
) # There is a memcpy here, that is very bad.
indices_q = cu_seqlens_q[:-1]
query_layer = query_layer.squeeze(1)
else:
# The -q_len: slice assumes left padding.
attention_mask = attention_mask[:, -query_length:]
query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input( # pylint: disable=E0606
query_layer, attention_mask)
return (
query_layer,
key_layer,
value_layer,
indices_q,
(cu_seqlens_q, cu_seqlens_k),
(max_seqlen_in_batch_q, max_seqlen_in_batch_k),
)
# Copied from transformers.models.llama.modeling_llama.LllamaSdpaAttention with Llama->InternLM2
class InternLM2SdpaAttention(InternLM2Attention):
"""
InternLM2 attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
`InternLM2Attention` as the weights of the module stays untouched. The only changes are on the forward pass
to adapt to SDPA API.
"""
# Adapted from InternLM2Attention.forward
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
if output_attentions:
# TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"`
# once this is implemented.
logger.warning_once(
'InternLM2Model uses InternLM2SdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` '
'does not support `output_attentions=True`. Falling back to the manual attention implementation, '
'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. '
'This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
)
return super().forward(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
cache_position=cache_position,
)
bsz, q_len, _ = hidden_states.size()
qkv_states = self.wqkv(hidden_states)
qkv_states = rearrange(
qkv_states,
'b q (h gs d) -> b q h gs d',
gs=2 + self.num_key_value_groups,
d=self.head_dim,
)
query_states = qkv_states[..., :self.num_key_value_groups, :]
query_states = rearrange(query_states, 'b q h gs d -> b q (h gs) d')
key_states = qkv_states[..., -2, :]
value_states = qkv_states[..., -1, :]
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(
query_states, key_states, cos, sin)
if past_key_value is not None:
# sin and cos are specific to RoPE models; cache_position needed for the static cache
cache_kwargs = {
'sin': sin,
'cos': cos,
'cache_position': cache_position
}
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
causal_mask = attention_mask
if attention_mask is not None:
causal_mask = causal_mask[:, :, :, :key_states.shape[-2]]
# SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with
# custom attn_mask, Reference: https://github.com/pytorch/pytorch/issues/112577.
if query_states.device.type == 'cuda' and causal_mask is not None:
query_states = query_states.contiguous()
key_states = key_states.contiguous()
value_states = value_states.contiguous()
# We dispatch to SDPA's Flash Attention or Efficient kernels via this `is_causal` if statement instead of
# an inline conditional assignment in SDPA to support both torch.compile's dynamic shapes and full graph
# options. An inline conditional prevents dynamic shapes from compiling.
is_causal = bool(causal_mask is None and q_len > 1)
attn_output = torch.nn.functional.scaled_dot_product_attention( # pylint: disable=E1102
query_states,
key_states,
value_states,
attn_mask=causal_mask,
dropout_p=0.0,
is_causal=is_causal,
)
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.view(bsz, q_len, self.hidden_size)
attn_output = self.wo(attn_output)
return attn_output, None, past_key_value
INTERNLM2_ATTENTION_CLASSES = {
'eager': InternLM2Attention,
'flash_attention_2': InternLM2FlashAttention2,
'sdpa': InternLM2SdpaAttention,
}
# Modified from transformers.models.llama.modeling_llama.LlamaDecoderLayer with Llama->InternLM2
class InternLM2DecoderLayer(nn.Module):
"""InternLM2 Decoder Layer. This module is a single layer of the InternLM2 model."""
def __init__(self, config: InternLM2Config, layer_idx: int):
super().__init__()
self.hidden_size = config.hidden_size
self.layer_idx = layer_idx
self.attention = INTERNLM2_ATTENTION_CLASSES[
config.attn_implementation](
config=config, layer_idx=layer_idx)
self.feed_forward = InternLM2MLP(config)
self.attention_norm = InternLM2RMSNorm(
config.hidden_size, eps=config.rms_norm_eps)
self.ffn_norm = InternLM2RMSNorm(
config.hidden_size, eps=config.rms_norm_eps)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: Optional[bool] = False,
use_cache: Optional[bool] = False,
cache_position: Optional[torch.LongTensor] = None,
) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor,
torch.FloatTensor]]]:
"""
Args:
hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
attention_mask (`torch.FloatTensor`, *optional*):
attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
query_sequence_length, key_sequence_length)` if default attention is used.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
(see `past_key_values`).
past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
"""
residual = hidden_states
hidden_states = self.attention_norm(hidden_states)
# Self Attention
hidden_states, self_attn_weights, present_key_value = self.attention(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
cache_position=cache_position,
)
hidden_states = residual + hidden_states
# Fully Connected
residual = hidden_states
hidden_states = self.ffn_norm(hidden_states)
hidden_states = self.feed_forward(hidden_states)
hidden_states = residual + hidden_states
outputs = (hidden_states, )
if output_attentions:
outputs += (self_attn_weights, )
if use_cache:
outputs += (present_key_value, )
return outputs
InternLM2_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
Parameters:
config ([`InternLM2Config`]):
Model configuration class with all the parameters of the model. Initializing with a config file does not
load the weights associated with the model, only the configuration. Check out the
[`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
# Copied from transformers.models.llama.modeling_llama.LlamaPreTrainedModel with Llama->InternLM2
@add_start_docstrings(
'The bare InternLM2 Model outputting raw hidden-states without any specific head on top.',
InternLM2_START_DOCSTRING,
)
class InternLM2PreTrainedModel(PreTrainedModel):
"""
InternLM2 pretraiend model's base class.
"""
config_class = InternLM2Config
base_model_prefix = 'model'
supports_gradient_checkpointing = True
_no_split_modules = ['InternLM2DecoderLayer']
_skip_keys_device_placement = ['past_key_values']
_supports_flash_attn_2 = True
_supports_sdpa = True
_supports_cache_class = True
_supports_quantized_cache = True
_supports_static_cache = True
def _init_weights(self, module):
std = self.config.initializer_range
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
InternLM2_INPUTS_DOCSTRING = r"""
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
`past_key_values`).
If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
information on the default strategy.
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.n_positions - 1]`.
[What are position IDs?](../glossary#position-ids)
past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
Two formats are allowed:
- a [`~cache_utils.Cache`] instance;
- Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
cache format.
The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
legacy cache format will be returned.
If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
of shape `(batch_size, sequence_length)`.
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
`past_key_values`).
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
Indices depicting the position of the input sequence tokens in the sequence. Contrarily to `position_ids`,
this tensor is not affected by padding. It is used to update the cache in the correct position and to infer
the complete sequence length.
"""
# Modified from transformers.models.llama.modeling_llama.LlamaModel with Llama->InternLM2
@add_start_docstrings(
'The bare InternLM2 Model outputting raw hidden-states without any specific head on top.',
InternLM2_START_DOCSTRING,
)
class InternLM2Model(InternLM2PreTrainedModel):
"""
Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`InternLM2DecoderLayer`]
Args:
config: InternLM2Config
"""
_auto_class = 'AutoModel'
def __init__(self, config: InternLM2Config):
super().__init__(config)
self.padding_idx = config.pad_token_id
self.vocab_size = config.vocab_size
self.config = config
self.tok_embeddings = nn.Embedding(config.vocab_size,
config.hidden_size,
self.padding_idx)
self.layers = nn.ModuleList([
InternLM2DecoderLayer(config, layer_idx)
for layer_idx in range(config.num_hidden_layers)
])
self.norm = InternLM2RMSNorm(
config.hidden_size, eps=config.rms_norm_eps)
self.gradient_checkpointing = False
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.tok_embeddings
def set_input_embeddings(self, value):
self.tok_embeddings = value
@add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Union[Cache,
List[torch.FloatTensor]]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
) -> Union[Tuple, BaseModelOutputWithPast]:
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else
self.config.output_hidden_states)
use_cache = use_cache if use_cache is not None else self.config.use_cache
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError(
'You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one'
)
if self.gradient_checkpointing and self.training and use_cache:
logger.warning_once(
'`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.'
)
use_cache = False
if inputs_embeds is None:
inputs_embeds = self.tok_embeddings(input_ids)
return_legacy_cache = False
if use_cache and not isinstance(
past_key_values,
Cache): # kept for BC (non `Cache` `past_key_values` inputs)
return_legacy_cache = True
past_key_values = DynamicCache.from_legacy_cache(past_key_values)
if cache_position is None:
past_seen_tokens = past_key_values.get_seq_length(
) if past_key_values is not None else 0
cache_position = torch.arange(
past_seen_tokens,
past_seen_tokens + inputs_embeds.shape[1],
device=inputs_embeds.device)
if position_ids is None:
position_ids = cache_position.unsqueeze(0)
causal_mask = self._update_causal_mask(attention_mask, inputs_embeds,
cache_position, past_key_values,
output_attentions)
# embed positions
hidden_states = inputs_embeds
# decoder layers
all_hidden_states = () if output_hidden_states else None
all_self_attns = () if output_attentions else None
next_decoder_cache = None
for decoder_layer in self.layers:
if output_hidden_states:
all_hidden_states += (hidden_states, )
if self.gradient_checkpointing and self.training:
layer_outputs = self._gradient_checkpointing_func(
decoder_layer.__call__,
hidden_states,
causal_mask,
position_ids,
past_key_values,
output_attentions,
use_cache,
cache_position,
)
else:
layer_outputs = decoder_layer(
hidden_states,
attention_mask=causal_mask,
position_ids=position_ids,
past_key_value=past_key_values,
output_attentions=output_attentions,
use_cache=use_cache,
cache_position=cache_position,
)
hidden_states = layer_outputs[0]
if use_cache:
next_decoder_cache = layer_outputs[
2 if output_attentions else 1]
if output_attentions:
all_self_attns += (layer_outputs[1], )
hidden_states = self.norm(hidden_states)
# add hidden states from the last decoder layer
if output_hidden_states:
all_hidden_states += (hidden_states, )
next_cache = next_decoder_cache if use_cache else None
if return_legacy_cache:
next_cache = next_cache.to_legacy_cache()
if not return_dict:
return tuple(
v for v in
[hidden_states, next_cache, all_hidden_states, all_self_attns]
if v is not None)
return BaseModelOutputWithPast(
last_hidden_state=hidden_states,
past_key_values=next_cache,
hidden_states=all_hidden_states,
attentions=all_self_attns,
)
def _update_causal_mask(
self,
attention_mask: torch.Tensor,
input_tensor: torch.Tensor,
cache_position: torch.Tensor,
past_key_values: Cache,
output_attentions: bool,
):
# TODO: As of torch==2.2.0, the `attention_mask` passed to the model in `generate` is 2D and of dynamic length
# even when the static KV cache is used. This is an issue for torch.compile which then recaptures cudagraphs at
# each decode steps due to the dynamic shapes. (`recording cudagraph tree for symint key 13`, etc.), which is
# VERY slow. A workaround is `@torch.compiler.disable`, but this prevents using `fullgraph=True`.
# See more context in https://github.com/huggingface/transformers/pull/29114
if self.config.attn_implementation == 'flash_attention_2':
if attention_mask is not None and 0.0 in attention_mask:
return attention_mask
return None
# For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
# order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
# to infer the attention mask.
past_seen_tokens = past_key_values.get_seq_length(
) if past_key_values is not None else 0
using_static_cache = isinstance(past_key_values, StaticCache)
# When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
if self.config.attn_implementation == 'sdpa' and not using_static_cache and not output_attentions:
if AttentionMaskConverter._ignore_causal_mask_sdpa(
attention_mask,
inputs_embeds=input_tensor,
past_key_values_length=past_seen_tokens,
is_training=self.training,
):
return None
dtype, device = input_tensor.dtype, input_tensor.device
min_dtype = torch.finfo(dtype).min
sequence_length = input_tensor.shape[1]
if using_static_cache:
target_length = past_key_values.get_max_length()
else:
target_length = (
attention_mask.shape[-1] if isinstance(
attention_mask, torch.Tensor) else past_seen_tokens +
sequence_length + 1)
if attention_mask is not None and attention_mask.dim() == 4:
# in this case we assume that the mask comes already in inverted form and requires no inversion or slicing
if attention_mask.max() != 0:
raise ValueError(
'Custom 4D attention mask should be passed in inverted form with max==0`'
)
causal_mask = attention_mask
else:
causal_mask = torch.full((sequence_length, target_length),
fill_value=min_dtype,
dtype=dtype,
device=device)
if sequence_length != 1:
causal_mask = torch.triu(causal_mask, diagonal=1)
causal_mask *= torch.arange(
target_length, device=device) > cache_position.reshape(-1, 1)
causal_mask = causal_mask[None, None, :, :].expand(
input_tensor.shape[0], 1, -1, -1)
if attention_mask is not None:
causal_mask = causal_mask.clone(
) # copy to contiguous memory for in-place edit
mask_length = attention_mask.shape[-1]
padding_mask = causal_mask[:, :, :, :
mask_length] + attention_mask[:,
None,
None, :]
padding_mask = padding_mask == 0
causal_mask[:, :, :, :
mask_length] = causal_mask[:, :, :, :
mask_length].masked_fill(
padding_mask,
min_dtype)
if (self.config.attn_implementation == 'sdpa'
and attention_mask is not None
and attention_mask.device.type == 'cuda'
and not output_attentions):
# Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
# using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
# Details: https://github.com/pytorch/pytorch/issues/110213
causal_mask = AttentionMaskConverter._unmask_unattended(
causal_mask, min_dtype) # pylint: disable=E1120
return causal_mask
# Modified from transformers.models.llama.modeling_llama.LlamaForCausalLM
class InternLM2ForCausalLM(InternLM2PreTrainedModel):
"""Causal language model (CLM) for InternLM2."""
_auto_class = 'AutoModelForCausalLM'
_tied_weights_keys = ['output.weight']
def __init__(self, config):
super().__init__(config)
self.model = InternLM2Model(config)
self.vocab_size = config.vocab_size
self.output = nn.Linear(
config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.tok_embeddings
def set_input_embeddings(self, value):
self.model.tok_embeddings = value
def get_output_embeddings(self):
return self.output
def set_output_embeddings(self, new_embeddings):
self.output = new_embeddings
def set_decoder(self, decoder):
self.model = decoder
def get_decoder(self):
return self.model
@add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
@replace_return_docstrings(
output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Union[Cache,
List[torch.FloatTensor]]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
) -> Union[Tuple, CausalLMOutputWithPast]:
r"""
Args:
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
Returns:
Example:
```python
>>> from transformers import AutoTokenizer, InternLM2ForCausalLM
>>> model = InternLM2ForCausalLM.from_pretrained("meta-InternLM2/InternLM2-2-7b-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("meta-InternLM2/InternLM2-2-7b-hf")
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else
self.config.output_hidden_states)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
# decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
cache_position=cache_position,
)
hidden_states = outputs[0]
if self.config.pretraining_tp > 1:
output_slices = self.output.weight.split(
self.vocab_size // self.config.pretraining_tp, dim=0)
logits = [
F.linear(hidden_states, output_slices[i]) # pylint: disable=not-callable
for i in range(self.config.pretraining_tp)
]
logits = torch.cat(logits, dim=-1)
else:
logits = self.output(hidden_states)
logits = logits.float()
loss = None
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
shift_logits = shift_logits.view(-1, self.config.vocab_size)
shift_labels = shift_labels.view(-1)
# Enable model parallelism
shift_labels = shift_labels.to(shift_logits.device)
loss = loss_fct(shift_logits, shift_labels)
if not return_dict:
output = (logits, ) + outputs[1:]
return (loss, ) + output if loss is not None else output
return CausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
def prepare_inputs_for_generation(
self,
input_ids,
past_key_values=None,
attention_mask=None,
inputs_embeds=None,
cache_position=None,
use_cache=True,
**kwargs,
):
past_length = 0
if past_key_values is not None:
if isinstance(past_key_values, Cache):
past_length = cache_position[
0] if cache_position is not None else past_key_values.get_seq_length(
)
max_cache_length = (
torch.tensor(
past_key_values.get_max_length(),
device=input_ids.device)
if past_key_values.get_max_length() is not None else None)
cache_length = past_length if max_cache_length is None else torch.min(
max_cache_length, past_length)
# TODO joao: remove this `else` after `generate` prioritizes `Cache` objects
else:
cache_length = past_length = past_key_values[0][0].shape[2]
max_cache_length = None
# Keep only the unprocessed tokens:
# 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
# some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as input)
if attention_mask is not None and attention_mask.shape[
1] > input_ids.shape[1]:
input_ids = input_ids[:, -(attention_mask.shape[1] -
past_length):]
# 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
# input_ids based on the past_length.
elif past_length < input_ids.shape[1]:
input_ids = input_ids[:, past_length:]
# 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
# If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
if (max_cache_length is not None and attention_mask is not None
and cache_length + input_ids.shape[1] > max_cache_length):
attention_mask = attention_mask[:, -max_cache_length:] # pylint: disable=E1130
position_ids = kwargs.get('position_ids', None)
if attention_mask is not None and position_ids is None:
# create position_ids on the fly for batch generation
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
if past_key_values:
position_ids = position_ids[:, -input_ids.shape[1]:]
# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
if inputs_embeds is not None and past_key_values is None:
model_inputs = {'inputs_embeds': inputs_embeds}
else:
# The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
# recompiles graphs as the stride of the inputs is a guard.
# Ref: https://github.com/huggingface/transformers/pull/29114
# TODO: use `next_tokens` directly instead.
model_inputs = {'input_ids': input_ids.contiguous()}
input_length = position_ids.shape[
-1] if position_ids is not None else input_ids.shape[-1]
if cache_position is None:
cache_position = torch.arange(
past_length,
past_length + input_length,
device=input_ids.device)
elif use_cache:
cache_position = cache_position[-input_length:]
model_inputs.update({
'position_ids': position_ids,
'cache_position': cache_position,
'past_key_values': past_key_values,
'use_cache': use_cache,
'attention_mask': attention_mask,
})
return model_inputs
@staticmethod
def _reorder_cache(past_key_values, beam_idx):
reordered_past = ()
for layer_past in past_key_values:
reordered_past += (tuple(
past_state.index_select(0, beam_idx.to(past_state.device))
for past_state in layer_past), )
return reordered_past
def build_inputs(self,
tokenizer,
query: str,
history: List[Tuple[str, str]] = None,
meta_instruction=''):
if history is None:
history = []
if tokenizer.add_bos_token:
prompt = ''
else:
prompt = tokenizer.bos_token
if meta_instruction:
prompt += f"""<|im_start|>system\n{meta_instruction}<|im_end|>\n"""
for record in history:
prompt += f"""<|im_start|>user\n{record[0]}<|im_end|>\n<|im_start|>assistant\n{record[1]}<|im_end|>\n"""
prompt += f"""<|im_start|>user\n{query}<|im_end|>\n<|im_start|>assistant\n"""
return tokenizer([prompt], return_tensors='pt')
@torch.no_grad()
def chat(
self,
tokenizer,
query: str,
history: Optional[List[Tuple[str, str]]] = None,
streamer: Optional[BaseStreamer] = None,
max_new_tokens: int = 1024,
do_sample: bool = True,
temperature: float = 0.8,
top_p: float = 0.8,
meta_instruction:
str = 'You are an AI assistant whose name is InternLM (书生·浦语).\n'
'- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory '
'(上海人工智能实验室). It is designed to be helpful, honest, and harmless.\n'
'- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such '
'as English and 中文.',
**kwargs,
):
if history is None:
history = []
inputs = self.build_inputs(tokenizer, query, history, meta_instruction)
inputs = {
k: v.to(self.device)
for k, v in inputs.items() if torch.is_tensor(v)
}
# also add end-of-assistant token in eos token id to avoid unnecessary generation
eos_token_id = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids(['<|im_end|>'])[0]
]
outputs = self.generate(
**inputs,
streamer=streamer,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
temperature=temperature,
top_p=top_p,
eos_token_id=eos_token_id,
**kwargs,
)
outputs = outputs[0].cpu().tolist()[len(inputs['input_ids'][0]):]
response = tokenizer.decode(outputs, skip_special_tokens=True)
response = response.split('<|im_end|>')[0]
history = history + [(query, response)]
return response, history
@torch.no_grad()
def stream_chat(
self,
tokenizer,
query: str,
history: List[Tuple[str, str]] = None,
max_new_tokens: int = 1024,
do_sample: bool = True,
temperature: float = 0.8,
top_p: float = 0.8,
**kwargs,
):
if history is None:
history = []
"""
Return a generator in format: (response, history)
Eg.
('你好,有什么可以帮助您的吗', [('你好', '你好,有什么可以帮助您的吗')])
('你好,有什么可以帮助您的吗?', [('你好', '你好,有什么可以帮助您的吗?')])
"""
if BaseStreamer is None:
raise ModuleNotFoundError(
'The version of `transformers` is too low. Please make sure '
'that you have installed `transformers>=4.28.0`.')
response_queue = queue.Queue(maxsize=20)
class ChatStreamer(BaseStreamer):
"""
Streamer used in generate to print words one by one.
"""
def __init__(self, tokenizer) -> None:
super().__init__()
self.tokenizer = tokenizer
self.queue = response_queue
self.query = query
self.history = history
self.response = ''
self.cache = []
self.received_inputs = False
self.queue.put(
(self.response, history + [(self.query, self.response)]))
def put(self, value):
if len(value.shape) > 1 and value.shape[0] > 1:
raise ValueError('ChatStreamer only supports batch size 1')
elif len(value.shape) > 1:
value = value[0]
if not self.received_inputs:
# The first received value is input_ids, ignore here
self.received_inputs = True
return
self.cache.extend(value.tolist())
token = self.tokenizer.decode(
self.cache, skip_special_tokens=True)
if token.strip() != '<|im_end|>':
self.response = self.response + token
history = self.history + [(self.query, self.response)]
self.queue.put((self.response, history))
self.cache = []
else:
self.end()
def end(self):
self.queue.put(None)
def stream_producer():
return self.chat(
tokenizer=tokenizer,
query=query,
streamer=ChatStreamer(tokenizer=tokenizer),
history=history,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
temperature=temperature,
top_p=top_p,
**kwargs,
)
def consumer():
producer = threading.Thread(target=stream_producer)
producer.start()
while True:
res = response_queue.get()
if res is None:
return
yield res
return consumer()
# Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->InternLM2
@add_start_docstrings(
"""
The InternLM2 Model transformer with a sequence classification head on top (linear layer).
[`InternLM2ForSequenceClassification`] uses the last token in order to do the classification, as other causal models
(e.g. GPT-2) do.
Since it does classification on the last token, it requires to know the position of the last token. If a
`pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
each row of the batch).
""",
InternLM2_START_DOCSTRING,
)
class InternLM2ForSequenceClassification(InternLM2PreTrainedModel):
"""Sequence Classification Head for InternLM2 Model."""
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.model = InternLM2Model(config)
self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.tok_embeddings
def set_input_embeddings(self, value):
self.model.tok_embeddings = value
@add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Union[Cache,
List[torch.FloatTensor]]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, SequenceClassifierOutputWithPast]:
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
transformer_outputs = self.model(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = transformer_outputs[0]
logits = self.score(hidden_states)
if input_ids is not None:
batch_size = input_ids.shape[0]
else:
batch_size = inputs_embeds.shape[0]
if self.config.pad_token_id is None and batch_size != 1:
raise ValueError(
'Cannot handle batch sizes > 1 if no padding token is defined.'
)
if self.config.pad_token_id is None:
sequence_lengths = -1
else:
if input_ids is not None:
# if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
sequence_lengths = torch.eq(
input_ids, self.config.pad_token_id).int().argmax(-1) - 1
sequence_lengths = sequence_lengths % input_ids.shape[-1]
sequence_lengths = sequence_lengths.to(logits.device)
else:
sequence_lengths = -1
pooled_logits = logits[torch.arange(batch_size, device=logits.device),
sequence_lengths]
loss = None
if labels is not None:
labels = labels.to(logits.device)
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = 'regression'
elif self.num_labels > 1 and (labels.dtype
in (torch.long, torch.int)):
self.config.problem_type = 'single_label_classification'
else:
self.config.problem_type = 'multi_label_classification'
if self.config.problem_type == 'regression':
loss_fct = MSELoss()
if self.num_labels == 1:
loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(pooled_logits, labels)
elif self.config.problem_type == 'single_label_classification':
loss_fct = CrossEntropyLoss()
loss = loss_fct(
pooled_logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == 'multi_label_classification':
loss_fct = BCEWithLogitsLoss()
loss = loss_fct(pooled_logits, labels)
if not return_dict:
output = (pooled_logits, ) + transformer_outputs[1:]
return ((loss, ) + output) if loss is not None else output
return SequenceClassifierOutputWithPast(
loss=loss,
logits=pooled_logits,
past_key_values=transformer_outputs.past_key_values,
hidden_states=transformer_outputs.hidden_states,
attentions=transformer_outputs.attentions,
)
# Copied from transformers.models.llama.modeling_llama.LlamaForQuestionAnswering with Llama->InternLM2
@add_start_docstrings(
"""
The InternLM2 Model transformer with a span classification head on top for extractive question-answering tasks like
SQuAD (a linear layer on top of the hidden-states output to compute `span start logits` and `span end logits`).
""",
InternLM2_START_DOCSTRING,
)
class InternLM2ForQuestionAnswering(InternLM2PreTrainedModel):
"""Question Answering model for InternLM2."""
base_model_prefix = 'transformer'
def __init__(self, config):
super().__init__(config)
self.transformer = InternLM2Model(config)
self.qa_outputs = nn.Linear(config.hidden_size, 2)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.transformer.tok_embeddings
def set_input_embeddings(self, value):
self.transformer.tok_embeddings = value
@add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.FloatTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Union[Cache,
List[torch.FloatTensor]]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
start_positions: Optional[torch.LongTensor] = None,
end_positions: Optional[torch.LongTensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, QuestionAnsweringModelOutput]:
r"""
start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.transformer(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = outputs[0]
logits = self.qa_outputs(sequence_output)
start_logits, end_logits = logits.split(1, dim=-1)
start_logits = start_logits.squeeze(-1).contiguous()
end_logits = end_logits.squeeze(-1).contiguous()
total_loss = None
if start_positions is not None and end_positions is not None:
# If we are on multi-GPU, split add a dimension
if len(start_positions.size()) > 1:
start_positions = start_positions.squeeze(-1).to(
start_logits.device)
if len(end_positions.size()) > 1:
end_positions = end_positions.squeeze(-1).to(end_logits.device)
# sometimes the start/end positions are outside our model inputs, we ignore these terms
ignored_index = start_logits.size(1)
start_positions = start_positions.clamp(0, ignored_index)
end_positions = end_positions.clamp(0, ignored_index)
loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
start_loss = loss_fct(start_logits, start_positions)
end_loss = loss_fct(end_logits, end_positions)
total_loss = (start_loss + end_loss) / 2
if not return_dict:
output = (start_logits, end_logits) + outputs[2:]
return ((total_loss, ) +
output) if total_loss is not None else output
return QuestionAnsweringModelOutput(
loss=total_loss,
start_logits=start_logits,
end_logits=end_logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
# Copied from transformers.models.llama.modeling_llama.LlamaForTokenClassification with Llama->InternLM2
@add_start_docstrings(
"""
The InternLM2 Model transformer with a token classification head on top (a linear layer on top of the hidden-states
output) e.g. for Named-Entity-Recognition (NER) tasks.
""",
InternLM2_START_DOCSTRING,
)
class InternLM2ForTokenClassification(InternLM2PreTrainedModel):
"""Token classification model for InternLM2."""
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.model = InternLM2Model(config)
if getattr(config, 'classifier_dropout', None) is not None:
classifier_dropout = config.classifier_dropout
elif getattr(config, 'hidden_dropout', None) is not None:
classifier_dropout = config.hidden_dropout
else:
classifier_dropout = 0.1
self.dropout = nn.Dropout(classifier_dropout)
self.score = nn.Linear(config.hidden_size, config.num_labels)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.tok_embeddings
def set_input_embeddings(self, value):
self.model.tok_embeddings = value
@add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, SequenceClassifierOutputWithPast]:
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.model(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = outputs[0]
sequence_output = self.dropout(sequence_output)
logits = self.score(sequence_output)
loss = None
if labels is not None:
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
if not return_dict:
output = (logits, ) + outputs[2:]
return ((loss, ) + output) if loss is not None else output
return TokenClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/modelings/llava/__init__.py
================================================
from .configuration_llava import EnhancedLlavaConfig
from .modeling_llava import LlavaForConditionalGeneration
from .processing_llava import LlavaProcessor
================================================
FILE: xtuner-eval_niah/xtuner/_lite/modelings/llava/configuration_internlm2.py
================================================
# Copyright (c) The InternLM team and The HuggingFace Inc. team. All rights reserved.
#
# This code is based on transformers/src/transformers/models/llama/configuration_llama.py
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" InternLM2 model configuration"""
from transformers.configuration_utils import PretrainedConfig
from transformers.utils import logging
logger = logging.get_logger(__name__)
INTERNLM2_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
# Modified from transformers.model.llama.configuration_llama.LlamaConfig
class InternLM2Config(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`InternLM2Model`]. It is used to instantiate
an InternLM2 model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the InternLM2-7B.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 32000):
Vocabulary size of the InternLM2 model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`InternLM2Model`]
hidden_size (`int`, *optional*, defaults to 4096):
Dimension of the hidden representations.
intermediate_size (`int`, *optional*, defaults to 11008):
Dimension of the MLP representations.
num_hidden_layers (`int`, *optional*, defaults to 32):
Number of hidden layers in the Transformer decoder.
num_attention_heads (`int`, *optional*, defaults to 32):
Number of attention heads for each attention layer in the Transformer decoder.
num_key_value_heads (`int`, *optional*):
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
`num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
by meanpooling all the original heads within that group. For more details checkout [this
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
`num_attention_heads`.
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
The non-linear activation function (function or string) in the decoder.
max_position_embeddings (`int`, *optional*, defaults to 2048):
The maximum sequence length that this model might ever be used with. InternLM2 supports up to 32768 tokens.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (`float`, *optional*, defaults to 1e-06):
The epsilon used by the rms normalization layers.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`.
pad_token_id (`int`, *optional*):
Padding token id.
bos_token_id (`int`, *optional*, defaults to 1):
Beginning of stream token id.
eos_token_id (`int`, *optional*, defaults to 2):
End of stream token id.
pretraining_tp (`int`, *optional*, defaults to 1):
Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
document](https://huggingface.co/docs/transformers/main/perf_train_gpu_many#tensor-parallelism)
to understand more about it. This value is necessary to ensure exact reproducibility
of the pretraining results. Please refer to [this
issue](https://github.com/pytorch/pytorch/issues/76232).
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
Whether to tie weight embeddings
rope_theta (`float`, *optional*, defaults to 10000.0):
The base period of the RoPE embeddings.
rope_scaling (`Dict`, *optional*):
Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
`{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
`max_position_embeddings` to the expected new maximum. See the following thread for more information on how
these scaling strategies behave:
https://www.reddit.com/r/LocalLLaMA/comments/14mrgpr/dynamically_scaled_rope_further_increases/. This is an
experimental feature, subject to breaking API changes in future versions.
"""
_auto_class = 'AutoConfig'
model_type = 'internlm2'
keys_to_ignore_at_inference = ['past_key_values']
def __init__( # pylint: disable=W0102
self,
vocab_size=103168,
hidden_size=4096,
intermediate_size=11008,
num_hidden_layers=32,
num_attention_heads=32,
num_key_value_heads=None,
hidden_act='silu',
max_position_embeddings=2048,
initializer_range=0.02,
rms_norm_eps=1e-6,
use_cache=True,
pad_token_id=0,
bos_token_id=1,
eos_token_id=2,
pretraining_tp=1,
tie_word_embeddings=False,
bias=True,
rope_theta=10000,
rope_scaling=None,
attn_implementation=None,
**kwargs,
):
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.bias = bias
if num_key_value_heads is None:
num_key_value_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.hidden_act = hidden_act
self.initializer_range = initializer_range
self.rms_norm_eps = rms_norm_eps
self.pretraining_tp = pretraining_tp
self.use_cache = use_cache
self.rope_theta = rope_theta
self.rope_scaling = rope_scaling
self._rope_scaling_validation()
self.attn_implementation = attn_implementation
if self.attn_implementation is None:
self.attn_implementation = 'eager'
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
tie_word_embeddings=tie_word_embeddings,
**kwargs,
)
def _rope_scaling_validation(self):
"""
Validate the `rope_scaling` configuration.
"""
if self.rope_scaling is None:
return
if not isinstance(self.rope_scaling,
dict) or len(self.rope_scaling) != 2:
raise ValueError(
'`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, '
f'got {self.rope_scaling}')
rope_scaling_type = self.rope_scaling.get('type', None)
rope_scaling_factor = self.rope_scaling.get('factor', None)
if rope_scaling_type is None or rope_scaling_type not in [
'linear', 'dynamic'
]:
raise ValueError(
f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
)
if (rope_scaling_factor is None
or not isinstance(rope_scaling_factor,
(float, int)) or rope_scaling_factor < 1.0):
raise ValueError(
f"`rope_scaling`'s factor field must be a number >= 1, got {rope_scaling_factor} "
f'of type {type(rope_scaling_factor)}')
================================================
FILE: xtuner-eval_niah/xtuner/_lite/modelings/llava/configuration_llava.py
================================================
# coding=utf-8
# Copyright 2023 Microsoft Research & University of Wisconsin-Madison and the HuggingFace Inc. team. All rights reserved.
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Llava model configuration"""
import os
from typing import Union
from transformers.configuration_utils import PretrainedConfig, custom_object_save
from transformers.utils import logging
from transformers import CONFIG_MAPPING, AutoModelForCausalLM, AutoConfig
logger = logging.get_logger(__name__)
class EnhancedLlavaConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`LlavaForConditionalGeneration`]. It is used to instantiate an
Llava model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the Llava-9B.
e.g. [llava-hf/llava-9b](https://huggingface.co/llava-hf/llava-9b)
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vision_config (`Union[AutoConfig, dict]`, *optional*, defaults to `CLIPVisionConfig`):
The config object or dictionary of the vision backbone.
text_config (`Union[AutoConfig, dict]`, *optional*, defaults to `LlamaConfig`):
The config object or dictionary of the text backbone.
ignore_index (`int`, *optional*, defaults to -100):
The ignore index for the loss function.
image_token_index (`int`, *optional*, defaults to 32000):
The image token index to encode the image prompt.
projector_hidden_act (`str`, *optional*, defaults to `"gelu"`):
The activation function used by the multimodal projector.
vision_feature_select_strategy (`str`, *optional*, defaults to `"default"`):
The feature selection strategy used to select the vision feature from the vision backbone.
Can be one of `"default"` or `"full"`.
vision_feature_layer (`int`, *optional*, defaults to -2):
The index of the layer to select the vision feature.
Example:
```python
>>> from transformers import LlavaForConditionalGeneration, LlavaConfig, CLIPVisionConfig, LlamaConfig
>>> # Initializing a CLIP-vision config
>>> vision_config = CLIPVisionConfig()
>>> # Initializing a Llama config
>>> text_config = LlamaConfig()
>>> # Initializing a Llava llava-1.5-7b style configuration
>>> configuration = LlavaConfig(vision_config, text_config)
>>> # Initializing a model from the llava-1.5-7b style configuration
>>> model = LlavaForConditionalGeneration(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
_auto_class = 'AutoConfig'
model_type = "enhanced_llava"
is_composition = False
def __init__(
self,
vision_config=None,
text_config=None,
ignore_index=-100,
image_token_index=32000,
projector_hidden_act="gelu",
vision_feature_select_strategy="default",
vision_feature_layer=-2,
**kwargs,
):
self.ignore_index = ignore_index
self.image_token_index = image_token_index
self.projector_hidden_act = projector_hidden_act
if vision_feature_select_strategy not in ["default", "full"]:
raise ValueError(
"vision_feature_select_strategy should be one of 'default', 'full'."
f"Got: {vision_feature_select_strategy}"
)
self.vision_feature_select_strategy = vision_feature_select_strategy
self.vision_feature_layer = vision_feature_layer
if isinstance(vision_config, dict):
vision_config["model_type"] = (
vision_config["model_type"] if "model_type" in vision_config else "clip_vision_model"
)
vision_config = CONFIG_MAPPING[vision_config["model_type"]](**vision_config)
elif vision_config is None:
vision_config = CONFIG_MAPPING["clip_vision_model"](
intermediate_size=4096,
hidden_size=1024,
patch_size=14,
image_size=336,
num_hidden_layers=24,
num_attention_heads=16,
vocab_size=32000,
projection_dim=768,
)
self.vision_config = vision_config
if isinstance(text_config, dict):
text_config["model_type"] = text_config["model_type"] if "model_type" in text_config else "llama"
if text_config["model_type"] == 'internlm2':
from .configuration_internlm2 import InternLM2Config
from .modeling_internlm2 import InternLM2ForCausalLM
AutoConfig.register('internlm2', InternLM2Config)
AutoModelForCausalLM.register(
InternLM2Config, InternLM2ForCausalLM)
text_config['auto_map']['AutoConfig'] = 'configuration_internlm2.InternLM2Config'
text_config['auto_map']['AutoModel'] = 'modeling_internlm2.InternLM2ForCausalLM'
text_config['auto_map']['AutoModelForCausalLM'] = 'modeling_internlm2.InternLM2ForCausalLM'
text_config = InternLM2Config(**text_config)
else:
text_config = CONFIG_MAPPING[text_config["model_type"]](**text_config)
elif text_config is None:
text_config = CONFIG_MAPPING["llama"]()
self.text_config = text_config
super().__init__(**kwargs)
def save_pretrained(self, save_directory: Union[str, os.PathLike], push_to_hub: bool = False, **kwargs):
"""
Save a configuration object to the directory `save_directory`, so that it can be re-loaded using the
[`~PretrainedConfig.from_pretrained`] class method.
Args:
save_directory (`str` or `os.PathLike`):
Directory where the configuration JSON file will be saved (will be created if it does not exist).
push_to_hub (`bool`, *optional*, defaults to `False`):
Whether or not to push your model to the Hugging Face model hub after saving it. You can specify the
repository you want to push to with `repo_id` (will default to the name of `save_directory` in your
namespace).
kwargs (`Dict[str, Any]`, *optional*):
Additional key word arguments passed along to the [`~utils.PushToHubMixin.push_to_hub`] method.
"""
super().save_pretrained(save_directory, push_to_hub, **kwargs)
if self.text_config._auto_class is not None:
custom_object_save(self.text_config, save_directory, config=self.text_config)
AutoConfig.register('enhanced_llava', EnhancedLlavaConfig, exist_ok=True)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/modelings/llava/modeling_internlm2.py
================================================
# Copyright (c) The InternLM team and The HuggingFace Inc. team. All rights reserved.
#
# This code is based on transformers/src/transformers/models/llama/modeling_llama.py
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PyTorch InternLM2.5 model."""
import math
import queue
import threading
from typing import List, Optional, Tuple, Union
import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from einops import rearrange
from torch import nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from transformers.activations import ACT2FN
from transformers.cache_utils import Cache, DynamicCache, StaticCache
from transformers.modeling_attn_mask_utils import AttentionMaskConverter
from transformers.modeling_outputs import (BaseModelOutputWithPast,
CausalLMOutputWithPast,
QuestionAnsweringModelOutput,
SequenceClassifierOutputWithPast,
TokenClassifierOutput)
from transformers.modeling_utils import PreTrainedModel
from transformers.pytorch_utils import ALL_LAYERNORM_LAYERS
from transformers.utils import (add_start_docstrings,
add_start_docstrings_to_model_forward,
is_flash_attn_greater_or_equal_2_10, logging,
replace_return_docstrings)
try:
from transformers.generation.streamers import BaseStreamer
except Exception:
BaseStreamer = None
from .configuration_internlm2 import InternLM2Config
try:
from flash_attn import flash_attn_func, flash_attn_varlen_func
from flash_attn.bert_padding import (index_first_axis, pad_input,
unpad_input)
except:
pass
logger = logging.get_logger(__name__)
_CONFIG_FOR_DOC = 'InternLM2Config'
def _get_unpad_data(attention_mask):
seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
max_seqlen_in_batch = seqlens_in_batch.max().item()
cu_seqlens = F.pad(
torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0)) # pylint: disable=E1102
return (
indices,
cu_seqlens,
max_seqlen_in_batch,
)
class InternLM2RMSNorm(nn.Module):
"""InternLM2RMSNorm is equivalent to T5LayerNorm."""
def __init__(self, hidden_size, eps=1e-6):
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance +
self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
ALL_LAYERNORM_LAYERS.append(InternLM2RMSNorm)
class InternLM2RotaryEmbedding(nn.Module):
"""Rotary Position Embedding for the InternLM2 model. Credits to the Reddit user /u/lucidrains."""
def __init__(self,
dim,
max_position_embeddings=2048,
base=10000,
device=None,
scaling_factor=1.0):
super().__init__()
self.scaling_factor = scaling_factor
self.dim = dim
self.max_position_embeddings = max_position_embeddings
self.base = base
inv_freq = 1.0 / (
self.base
**(torch.arange(0, self.dim, 2,
dtype=torch.int64).float().to(device) / self.dim))
self.register_buffer('inv_freq', inv_freq, persistent=False)
# For BC we register cos and sin cached
self.max_seq_len_cached = max_position_embeddings
@torch.no_grad()
def forward(self, x, position_ids):
# x: [bs, num_attention_heads, seq_len, head_size]
inv_freq_expanded = self.inv_freq[None, :, None].float().expand(
position_ids.shape[0], -1, 1)
position_ids_expanded = position_ids[:, None, :].float()
# Force float32 since bfloat16 loses precision on long contexts
# See https://github.com/huggingface/transformers/pull/29285
device_type = x.device.type
device_type = device_type if isinstance(
device_type, str) and device_type != 'mps' else 'cpu'
with torch.autocast(device_type=device_type, enabled=False):
freqs = (inv_freq_expanded.float()
@ position_ids_expanded.float()).transpose(1, 2)
emb = torch.cat((freqs, freqs), dim=-1)
cos = emb.cos()
sin = emb.sin()
return cos.to(dtype=x.dtype), sin.to(dtype=x.dtype)
class InternLM2LinearScalingRotaryEmbedding(InternLM2RotaryEmbedding):
"""InternLM2RotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
def forward(self, x, position_ids):
# difference to the original RoPE: a scaling factor is aplied to the position ids
position_ids = position_ids.float() / self.scaling_factor
cos, sin = super().forward(x, position_ids)
return cos, sin
class InternLM2DynamicNTKScalingRotaryEmbedding(InternLM2RotaryEmbedding):
"""InternLM2RotaryEmbedding extended with Dynamic NTK scaling.
Credits to the Reddit users /u/bloc97 and /u/emozilla"""
def forward(self, x, position_ids):
# difference to the original RoPE: inv_freq is recomputed when the sequence length > original length
seq_len = torch.max(position_ids) + 1
if seq_len > self.max_position_embeddings:
base = self.base * ((self.scaling_factor * seq_len /
self.max_position_embeddings) -
(self.scaling_factor - 1))**(
self.dim / (self.dim - 2))
inv_freq = 1.0 / (
base
**(torch.arange(0, self.dim, 2, dtype=torch.int64).float().to(
x.device) / self.dim))
self.register_buffer(
'inv_freq', inv_freq,
persistent=False) # TODO joao: this may break with compilation
cos, sin = super().forward(x, position_ids)
return cos, sin
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., :x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2:]
return torch.cat((-x2, x1), dim=-1)
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1): # pylint: disable=unused-argument
"""Applies Rotary Position Embedding to the query and key tensors.
Args:
q (`torch.Tensor`): The query tensor.
k (`torch.Tensor`): The key tensor.
cos (`torch.Tensor`): The cosine part of the rotary embedding.
sin (`torch.Tensor`): The sine part of the rotary embedding.
position_ids (`torch.Tensor`, *optional*):
Deprecated and unused.
unsqueeze_dim (`int`, *optional*, defaults to 1):
The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
Returns:
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
"""
cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
class InternLM2MLP(nn.Module):
"""MLP for InternLM2 model."""
def __init__(self, config):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size
self.w1 = nn.Linear(
self.hidden_size, self.intermediate_size, bias=False)
self.w3 = nn.Linear(
self.hidden_size, self.intermediate_size, bias=False)
self.w2 = nn.Linear(
self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, x):
down_proj = self.w2(self.act_fn(self.w1(x)) * self.w3(x))
return down_proj
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""
This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
"""
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :,
None, :, :].expand(batch,
num_key_value_heads,
n_rep, slen, head_dim)
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen,
head_dim)
class InternLM2Attention(nn.Module):
"""Multi-headed attention from 'Attention Is All You Need' paper"""
def __init__(self,
config: InternLM2Config,
layer_idx: Optional[int] = None):
super().__init__()
self.config = config
self.layer_idx = layer_idx
if layer_idx is None:
logger.warning_once(
f'Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will '
'lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` '
'when creating this class.')
self.hidden_size = config.hidden_size
self.num_heads = config.num_attention_heads
self.head_dim = self.hidden_size // self.num_heads
self.num_key_value_heads = config.num_key_value_heads
self.num_key_value_groups = self.num_heads // self.num_key_value_heads
self.max_position_embeddings = config.max_position_embeddings
self.rope_theta = config.rope_theta
self.is_causal = True
if (self.head_dim * self.num_heads) != self.hidden_size:
raise ValueError(
f'hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}'
f' and `num_heads`: {self.num_heads}).')
self.wqkv = nn.Linear(
self.hidden_size,
(self.num_heads + 2 * self.num_key_value_heads) * self.head_dim,
bias=config.bias,
)
self.wo = nn.Linear(
self.num_heads * self.head_dim, self.hidden_size, bias=config.bias)
self._init_rope()
def _init_rope(self):
if self.config.rope_scaling is None:
self.rotary_emb = InternLM2RotaryEmbedding(
self.head_dim,
max_position_embeddings=self.max_position_embeddings,
base=self.rope_theta,
)
else:
scaling_type = self.config.rope_scaling['type']
scaling_factor = self.config.rope_scaling['factor']
if scaling_type == 'linear':
self.rotary_emb = InternLM2LinearScalingRotaryEmbedding(
self.head_dim,
max_position_embeddings=self.max_position_embeddings,
scaling_factor=scaling_factor,
base=self.rope_theta,
)
elif scaling_type == 'dynamic':
self.rotary_emb = InternLM2DynamicNTKScalingRotaryEmbedding(
self.head_dim,
max_position_embeddings=self.max_position_embeddings,
scaling_factor=scaling_factor,
base=self.rope_theta,
)
else:
raise ValueError(f'Unknown RoPE scaling type {scaling_type}')
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False, # pylint: disable=unused-argument
cache_position: Optional[torch.LongTensor] = None,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
bsz, q_len, _ = hidden_states.size()
if self.config.pretraining_tp > 1:
# split qkv_states by tp size
key_value_slicing = (self.num_key_value_heads *
self.head_dim) // self.config.pretraining_tp
qkv_slices = self.wqkv.weight.split(key_value_slicing, dim=0)
qkv_states = torch.cat(
[
F.linear(hidden_states, qkv_slice)
for qkv_slice in qkv_slices
],
dim=-1 # pylint: disable=E1102
)
else:
qkv_states = self.wqkv(hidden_states)
qkv_states = rearrange(
qkv_states,
'b q (h gs d) -> b q h gs d',
gs=2 + self.num_key_value_groups,
d=self.head_dim,
)
query_states = qkv_states[..., :self.num_key_value_groups, :]
query_states = rearrange(query_states,
'b q h gs d -> b q (h gs) d').transpose(1, 2)
key_states = qkv_states[..., -2, :].transpose(1, 2)
value_states = qkv_states[..., -1, :].transpose(1, 2)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(
query_states, key_states, cos, sin, position_ids)
if past_key_value is not None:
# sin and cos are specific to RoPE models; cache_position needed for the static cache
cache_kwargs = {
'sin': sin,
'cos': cos,
'cache_position': cache_position
}
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
attn_weights = torch.matmul(query_states, key_states.transpose(
2, 3)) / math.sqrt(self.head_dim)
if attention_mask is not None: # no matter the length, we just slice it
causal_mask = attention_mask[:, :, :, :key_states.shape[-2]]
attn_weights = attn_weights + causal_mask
# upcast attention to fp32
attn_weights = nn.functional.softmax(
attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
attn_output = torch.matmul(attn_weights, value_states)
if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
raise ValueError(
f'`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is'
f' {attn_output.size()}')
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
if self.config.pretraining_tp > 1:
attn_output = attn_output.split(
self.hidden_size // self.config.pretraining_tp, dim=2)
o_proj_slices = self.wo.weight.split(
self.hidden_size // self.config.pretraining_tp, dim=1)
attn_output = sum([
F.linear(attn_output[i], o_proj_slices[i]) # pylint: disable=E1102
for i in range(self.config.pretraining_tp)
])
else:
attn_output = self.wo(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
class InternLM2FlashAttention2(InternLM2Attention):
"""
InternLM2 flash attention module. This module inherits from `InternLM2Attention` as the weights of the module stays
untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
flash attention and deal with padding tokens in case the input contains any of them.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
# flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignement,
# that was made default for flash_attn>=2.1. This attribute is used to handle this difference.
# Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
# Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1)
# produces a wrong mask (top-left).
self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10(
)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
if isinstance(past_key_value, StaticCache):
raise ValueError(
'`static` cache implementation is not compatible with `attn_implementation==flash_attention_2` '
'make sure to use `sdpa` in the mean time, and open an issue at '
'https://github.com/huggingface/transformers')
output_attentions = False
bsz, q_len, _ = hidden_states.size()
qkv_states = self.wqkv(hidden_states)
qkv_states = rearrange(
qkv_states,
'b q (h gs d) -> b q h gs d',
gs=2 + self.num_key_value_groups,
d=self.head_dim,
)
query_states = qkv_states[..., :self.num_key_value_groups, :]
query_states = rearrange(query_states, 'b q h gs d -> b q (h gs) d')
key_states = qkv_states[..., -2, :]
value_states = qkv_states[..., -1, :]
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(
query_states, key_states, cos, sin)
if past_key_value is not None:
# sin and cos are specific to RoPE models; cache_position needed for the static cache
cache_kwargs = {
'sin': sin,
'cos': cos,
'cache_position': cache_position
}
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
# TODO: These transpose are quite inefficient but Flash Attention requires the layout
# [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
# to be able to avoid many of these transpose/reshape/view.
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# dropout_rate = self.attention_dropout if self.training else 0.0
dropout_rate = 0.0
# In PEFT, usually we cast the layer norms in float32 for training stability reasons
# therefore the input hidden states gets silently casted in float32. Hence, we need
# cast them back in the correct dtype just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not cast the LayerNorms
# in fp32. (InternLM2RMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.wqkv.weight.dtype
logger.warning_once(
f'The input hidden states seems to be silently casted in float32, this might be related to'
f' the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in'
f' {target_dtype}.')
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
attn_output = self._flash_attention_forward(
query_states,
key_states,
value_states,
attention_mask,
q_len,
dropout=dropout_rate)
attn_output = attn_output.reshape(bsz, q_len,
self.hidden_size).contiguous()
attn_output = self.wo(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value # pylint: disable=E0606
def _flash_attention_forward(self,
query_states,
key_states,
value_states,
attention_mask,
query_length,
dropout=0.0,
softmax_scale=None):
"""
Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
first unpad the input, then computes the attention scores and pad the final attention scores.
Args:
query_states (`torch.Tensor`):
Input query states to be passed to Flash Attention API
key_states (`torch.Tensor`):
Input key states to be passed to Flash Attention API
value_states (`torch.Tensor`):
Input value states to be passed to Flash Attention API
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
"""
if not self._flash_attn_uses_top_left_mask:
causal = self.is_causal
else:
# TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1.
# For details, please see the comment in InternLM2FlashAttention2 __init__.
causal = self.is_causal and query_length != 1
# Contains at least one padding token in the sequence
if attention_mask is not None:
batch_size = query_states.shape[0]
query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
query_states, key_states, value_states, attention_mask,
query_length)
cu_seqlens_q, cu_seqlens_k = cu_seq_lens
max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
attn_output_unpad = flash_attn_varlen_func( # pylint: disable=E0606
query_states,
key_states,
value_states,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_in_batch_q,
max_seqlen_k=max_seqlen_in_batch_k,
dropout_p=dropout,
softmax_scale=softmax_scale,
causal=causal,
)
attn_output = pad_input(attn_output_unpad, indices_q, batch_size,
query_length) # pylint: disable=E0606
else:
attn_output = flash_attn_func( # pylint: disable=E0606
query_states,
key_states,
value_states,
dropout,
softmax_scale=softmax_scale,
causal=causal)
return attn_output
def _upad_input(self, query_layer, key_layer, value_layer, attention_mask,
query_length):
indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(
attention_mask)
batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
key_layer = index_first_axis( # pylint: disable=E0606
key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads,
head_dim), indices_k)
value_layer = index_first_axis( # pylint: disable=E0606
value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads,
head_dim), indices_k)
if query_length == kv_seq_len:
query_layer = index_first_axis( # pylint: disable=E0606
query_layer.reshape(batch_size * kv_seq_len, self.num_heads,
head_dim), indices_k)
cu_seqlens_q = cu_seqlens_k
max_seqlen_in_batch_q = max_seqlen_in_batch_k
indices_q = indices_k
elif query_length == 1:
max_seqlen_in_batch_q = 1
cu_seqlens_q = torch.arange(
batch_size + 1, dtype=torch.int32, device=query_layer.device
) # There is a memcpy here, that is very bad.
indices_q = cu_seqlens_q[:-1]
query_layer = query_layer.squeeze(1)
else:
# The -q_len: slice assumes left padding.
attention_mask = attention_mask[:, -query_length:]
query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input( # pylint: disable=E0606
query_layer, attention_mask)
return (
query_layer,
key_layer,
value_layer,
indices_q,
(cu_seqlens_q, cu_seqlens_k),
(max_seqlen_in_batch_q, max_seqlen_in_batch_k),
)
# Copied from transformers.models.llama.modeling_llama.LllamaSdpaAttention with Llama->InternLM2
class InternLM2SdpaAttention(InternLM2Attention):
"""
InternLM2 attention module using torch.nn.functional.scaled_dot_product_attention. This module inherits from
`InternLM2Attention` as the weights of the module stays untouched. The only changes are on the forward pass
to adapt to SDPA API.
"""
# Adapted from InternLM2Attention.forward
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
if output_attentions:
# TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"`
# once this is implemented.
logger.warning_once(
'InternLM2Model uses InternLM2SdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` '
'does not support `output_attentions=True`. Falling back to the manual attention implementation, '
'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. '
'This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
)
return super().forward(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
cache_position=cache_position,
)
bsz, q_len, _ = hidden_states.size()
qkv_states = self.wqkv(hidden_states)
qkv_states = rearrange(
qkv_states,
'b q (h gs d) -> b q h gs d',
gs=2 + self.num_key_value_groups,
d=self.head_dim,
)
query_states = qkv_states[..., :self.num_key_value_groups, :]
query_states = rearrange(query_states, 'b q h gs d -> b q (h gs) d')
key_states = qkv_states[..., -2, :]
value_states = qkv_states[..., -1, :]
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(
query_states, key_states, cos, sin)
if past_key_value is not None:
# sin and cos are specific to RoPE models; cache_position needed for the static cache
cache_kwargs = {
'sin': sin,
'cos': cos,
'cache_position': cache_position
}
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
causal_mask = attention_mask
if attention_mask is not None:
causal_mask = causal_mask[:, :, :, :key_states.shape[-2]]
# SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with
# custom attn_mask, Reference: https://github.com/pytorch/pytorch/issues/112577.
if query_states.device.type == 'cuda' and causal_mask is not None:
query_states = query_states.contiguous()
key_states = key_states.contiguous()
value_states = value_states.contiguous()
# We dispatch to SDPA's Flash Attention or Efficient kernels via this `is_causal` if statement instead of
# an inline conditional assignment in SDPA to support both torch.compile's dynamic shapes and full graph
# options. An inline conditional prevents dynamic shapes from compiling.
is_causal = bool(causal_mask is None and q_len > 1)
attn_output = torch.nn.functional.scaled_dot_product_attention( # pylint: disable=E1102
query_states,
key_states,
value_states,
attn_mask=causal_mask,
dropout_p=0.0,
is_causal=is_causal,
)
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.view(bsz, q_len, self.hidden_size)
attn_output = self.wo(attn_output)
return attn_output, None, past_key_value
INTERNLM2_ATTENTION_CLASSES = {
'eager': InternLM2Attention,
'flash_attention_2': InternLM2FlashAttention2,
'sdpa': InternLM2SdpaAttention,
}
# Modified from transformers.models.llama.modeling_llama.LlamaDecoderLayer with Llama->InternLM2
class InternLM2DecoderLayer(nn.Module):
"""InternLM2 Decoder Layer. This module is a single layer of the InternLM2 model."""
def __init__(self, config: InternLM2Config, layer_idx: int):
super().__init__()
self.hidden_size = config.hidden_size
self.layer_idx = layer_idx
self.attention = INTERNLM2_ATTENTION_CLASSES[
config.attn_implementation](
config=config, layer_idx=layer_idx)
self.feed_forward = InternLM2MLP(config)
self.attention_norm = InternLM2RMSNorm(
config.hidden_size, eps=config.rms_norm_eps)
self.ffn_norm = InternLM2RMSNorm(
config.hidden_size, eps=config.rms_norm_eps)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: Optional[bool] = False,
use_cache: Optional[bool] = False,
cache_position: Optional[torch.LongTensor] = None,
) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor,
torch.FloatTensor]]]:
"""
Args:
hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
attention_mask (`torch.FloatTensor`, *optional*):
attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
query_sequence_length, key_sequence_length)` if default attention is used.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
(see `past_key_values`).
past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
"""
residual = hidden_states
hidden_states = self.attention_norm(hidden_states)
# Self Attention
hidden_states, self_attn_weights, present_key_value = self.attention(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
cache_position=cache_position,
)
hidden_states = residual + hidden_states
# Fully Connected
residual = hidden_states
hidden_states = self.ffn_norm(hidden_states)
hidden_states = self.feed_forward(hidden_states)
hidden_states = residual + hidden_states
outputs = (hidden_states, )
if output_attentions:
outputs += (self_attn_weights, )
if use_cache:
outputs += (present_key_value, )
return outputs
InternLM2_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
Parameters:
config ([`InternLM2Config`]):
Model configuration class with all the parameters of the model. Initializing with a config file does not
load the weights associated with the model, only the configuration. Check out the
[`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
# Copied from transformers.models.llama.modeling_llama.LlamaPreTrainedModel with Llama->InternLM2
@add_start_docstrings(
'The bare InternLM2 Model outputting raw hidden-states without any specific head on top.',
InternLM2_START_DOCSTRING,
)
class InternLM2PreTrainedModel(PreTrainedModel):
"""
InternLM2 pretraiend model's base class.
"""
config_class = InternLM2Config
base_model_prefix = 'model'
supports_gradient_checkpointing = True
_no_split_modules = ['InternLM2DecoderLayer']
_skip_keys_device_placement = ['past_key_values']
_supports_flash_attn_2 = True
_supports_sdpa = True
_supports_cache_class = True
_supports_quantized_cache = True
_supports_static_cache = True
def _init_weights(self, module):
std = self.config.initializer_range
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
InternLM2_INPUTS_DOCSTRING = r"""
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
`past_key_values`).
If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
information on the default strategy.
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.n_positions - 1]`.
[What are position IDs?](../glossary#position-ids)
past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
Two formats are allowed:
- a [`~cache_utils.Cache`] instance;
- Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
cache format.
The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
legacy cache format will be returned.
If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
of shape `(batch_size, sequence_length)`.
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
`past_key_values`).
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
Indices depicting the position of the input sequence tokens in the sequence. Contrarily to `position_ids`,
this tensor is not affected by padding. It is used to update the cache in the correct position and to infer
the complete sequence length.
"""
# Modified from transformers.models.llama.modeling_llama.LlamaModel with Llama->InternLM2
@add_start_docstrings(
'The bare InternLM2 Model outputting raw hidden-states without any specific head on top.',
InternLM2_START_DOCSTRING,
)
class InternLM2Model(InternLM2PreTrainedModel):
"""
Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`InternLM2DecoderLayer`]
Args:
config: InternLM2Config
"""
_auto_class = 'AutoModel'
def __init__(self, config: InternLM2Config):
super().__init__(config)
self.padding_idx = config.pad_token_id
self.vocab_size = config.vocab_size
self.config = config
self.tok_embeddings = nn.Embedding(config.vocab_size,
config.hidden_size,
self.padding_idx)
self.layers = nn.ModuleList([
InternLM2DecoderLayer(config, layer_idx)
for layer_idx in range(config.num_hidden_layers)
])
self.norm = InternLM2RMSNorm(
config.hidden_size, eps=config.rms_norm_eps)
self.gradient_checkpointing = False
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.tok_embeddings
def set_input_embeddings(self, value):
self.tok_embeddings = value
@add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Union[Cache,
List[torch.FloatTensor]]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
) -> Union[Tuple, BaseModelOutputWithPast]:
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else
self.config.output_hidden_states)
use_cache = use_cache if use_cache is not None else self.config.use_cache
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError(
'You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one'
)
if self.gradient_checkpointing and self.training and use_cache:
logger.warning_once(
'`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.'
)
use_cache = False
if inputs_embeds is None:
inputs_embeds = self.tok_embeddings(input_ids)
return_legacy_cache = False
if use_cache and not isinstance(
past_key_values,
Cache): # kept for BC (non `Cache` `past_key_values` inputs)
return_legacy_cache = True
past_key_values = DynamicCache.from_legacy_cache(past_key_values)
if cache_position is None:
past_seen_tokens = past_key_values.get_seq_length(
) if past_key_values is not None else 0
cache_position = torch.arange(
past_seen_tokens,
past_seen_tokens + inputs_embeds.shape[1],
device=inputs_embeds.device)
if position_ids is None:
position_ids = cache_position.unsqueeze(0)
causal_mask = self._update_causal_mask(attention_mask, inputs_embeds,
cache_position, past_key_values,
output_attentions)
# embed positions
hidden_states = inputs_embeds
# decoder layers
all_hidden_states = () if output_hidden_states else None
all_self_attns = () if output_attentions else None
next_decoder_cache = None
for decoder_layer in self.layers:
if output_hidden_states:
all_hidden_states += (hidden_states, )
if self.gradient_checkpointing and self.training:
layer_outputs = self._gradient_checkpointing_func(
decoder_layer.__call__,
hidden_states,
causal_mask,
position_ids,
past_key_values,
output_attentions,
use_cache,
cache_position,
)
else:
layer_outputs = decoder_layer(
hidden_states,
attention_mask=causal_mask,
position_ids=position_ids,
past_key_value=past_key_values,
output_attentions=output_attentions,
use_cache=use_cache,
cache_position=cache_position,
)
hidden_states = layer_outputs[0]
if use_cache:
next_decoder_cache = layer_outputs[
2 if output_attentions else 1]
if output_attentions:
all_self_attns += (layer_outputs[1], )
hidden_states = self.norm(hidden_states)
# add hidden states from the last decoder layer
if output_hidden_states:
all_hidden_states += (hidden_states, )
next_cache = next_decoder_cache if use_cache else None
if return_legacy_cache:
next_cache = next_cache.to_legacy_cache()
if not return_dict:
return tuple(
v for v in
[hidden_states, next_cache, all_hidden_states, all_self_attns]
if v is not None)
return BaseModelOutputWithPast(
last_hidden_state=hidden_states,
past_key_values=next_cache,
hidden_states=all_hidden_states,
attentions=all_self_attns,
)
def _update_causal_mask(
self,
attention_mask: torch.Tensor,
input_tensor: torch.Tensor,
cache_position: torch.Tensor,
past_key_values: Cache,
output_attentions: bool,
):
# TODO: As of torch==2.2.0, the `attention_mask` passed to the model in `generate` is 2D and of dynamic length
# even when the static KV cache is used. This is an issue for torch.compile which then recaptures cudagraphs at
# each decode steps due to the dynamic shapes. (`recording cudagraph tree for symint key 13`, etc.), which is
# VERY slow. A workaround is `@torch.compiler.disable`, but this prevents using `fullgraph=True`.
# See more context in https://github.com/huggingface/transformers/pull/29114
if self.config.attn_implementation == 'flash_attention_2':
if attention_mask is not None and 0.0 in attention_mask:
return attention_mask
return None
# For SDPA, when possible, we will rely on its `is_causal` argument instead of its `attn_mask` argument, in
# order to dispatch on Flash Attention 2. This feature is not compatible with static cache, as SDPA will fail
# to infer the attention mask.
past_seen_tokens = past_key_values.get_seq_length(
) if past_key_values is not None else 0
using_static_cache = isinstance(past_key_values, StaticCache)
# When output attentions is True, sdpa implementation's forward method calls the eager implementation's forward
if self.config.attn_implementation == 'sdpa' and not using_static_cache and not output_attentions:
if AttentionMaskConverter._ignore_causal_mask_sdpa(
attention_mask,
inputs_embeds=input_tensor,
past_key_values_length=past_seen_tokens,
is_training=self.training,
):
return None
dtype, device = input_tensor.dtype, input_tensor.device
min_dtype = torch.finfo(dtype).min
sequence_length = input_tensor.shape[1]
if using_static_cache:
target_length = past_key_values.get_max_length()
else:
target_length = (
attention_mask.shape[-1] if isinstance(
attention_mask, torch.Tensor) else past_seen_tokens +
sequence_length + 1)
if attention_mask is not None and attention_mask.dim() == 4:
# in this case we assume that the mask comes already in inverted form and requires no inversion or slicing
if attention_mask.max() != 0:
raise ValueError(
'Custom 4D attention mask should be passed in inverted form with max==0`'
)
causal_mask = attention_mask
else:
causal_mask = torch.full((sequence_length, target_length),
fill_value=min_dtype,
dtype=dtype,
device=device)
if sequence_length != 1:
causal_mask = torch.triu(causal_mask, diagonal=1)
causal_mask *= torch.arange(
target_length, device=device) > cache_position.reshape(-1, 1)
causal_mask = causal_mask[None, None, :, :].expand(
input_tensor.shape[0], 1, -1, -1)
if attention_mask is not None:
causal_mask = causal_mask.clone(
) # copy to contiguous memory for in-place edit
mask_length = attention_mask.shape[-1]
padding_mask = causal_mask[:, :, :, :
mask_length] + attention_mask[:,
None,
None, :]
padding_mask = padding_mask == 0
causal_mask[:, :, :, :
mask_length] = causal_mask[:, :, :, :
mask_length].masked_fill(
padding_mask,
min_dtype)
if (self.config.attn_implementation == 'sdpa'
and attention_mask is not None
and attention_mask.device.type == 'cuda'
and not output_attentions):
# Attend to all tokens in fully masked rows in the causal_mask, for example the relevant first rows when
# using left padding. This is required by F.scaled_dot_product_attention memory-efficient attention path.
# Details: https://github.com/pytorch/pytorch/issues/110213
causal_mask = AttentionMaskConverter._unmask_unattended(
causal_mask, min_dtype) # pylint: disable=E1120
return causal_mask
# Modified from transformers.models.llama.modeling_llama.LlamaForCausalLM
class InternLM2ForCausalLM(InternLM2PreTrainedModel):
"""Causal language model (CLM) for InternLM2."""
_auto_class = 'AutoModelForCausalLM'
_tied_weights_keys = ['output.weight']
def __init__(self, config):
super().__init__(config)
self.model = InternLM2Model(config)
self.vocab_size = config.vocab_size
self.output = nn.Linear(
config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.tok_embeddings
def set_input_embeddings(self, value):
self.model.tok_embeddings = value
def get_output_embeddings(self):
return self.output
def set_output_embeddings(self, new_embeddings):
self.output = new_embeddings
def set_decoder(self, decoder):
self.model = decoder
def get_decoder(self):
return self.model
@add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
@replace_return_docstrings(
output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Union[Cache,
List[torch.FloatTensor]]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
cache_position: Optional[torch.LongTensor] = None,
) -> Union[Tuple, CausalLMOutputWithPast]:
r"""
Args:
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
Returns:
Example:
```python
>>> from transformers import AutoTokenizer, InternLM2ForCausalLM
>>> model = InternLM2ForCausalLM.from_pretrained("meta-InternLM2/InternLM2-2-7b-hf")
>>> tokenizer = AutoTokenizer.from_pretrained("meta-InternLM2/InternLM2-2-7b-hf")
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else
self.config.output_hidden_states)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
# decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
cache_position=cache_position,
)
hidden_states = outputs[0]
if self.config.pretraining_tp > 1:
output_slices = self.output.weight.split(
self.vocab_size // self.config.pretraining_tp, dim=0)
logits = [
F.linear(hidden_states, output_slices[i]) # pylint: disable=not-callable
for i in range(self.config.pretraining_tp)
]
logits = torch.cat(logits, dim=-1)
else:
logits = self.output(hidden_states)
logits = logits.float()
loss = None
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
shift_logits = shift_logits.view(-1, self.config.vocab_size)
shift_labels = shift_labels.view(-1)
# Enable model parallelism
shift_labels = shift_labels.to(shift_logits.device)
loss = loss_fct(shift_logits, shift_labels)
if not return_dict:
output = (logits, ) + outputs[1:]
return (loss, ) + output if loss is not None else output
return CausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
def prepare_inputs_for_generation(
self,
input_ids,
past_key_values=None,
attention_mask=None,
inputs_embeds=None,
cache_position=None,
use_cache=True,
**kwargs,
):
past_length = 0
if past_key_values is not None:
if isinstance(past_key_values, Cache):
past_length = cache_position[
0] if cache_position is not None else past_key_values.get_seq_length(
)
max_cache_length = (
torch.tensor(
past_key_values.get_max_length(),
device=input_ids.device)
if past_key_values.get_max_length() is not None else None)
cache_length = past_length if max_cache_length is None else torch.min(
max_cache_length, past_length)
# TODO joao: remove this `else` after `generate` prioritizes `Cache` objects
else:
cache_length = past_length = past_key_values[0][0].shape[2]
max_cache_length = None
# Keep only the unprocessed tokens:
# 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
# some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as input)
if attention_mask is not None and attention_mask.shape[
1] > input_ids.shape[1]:
input_ids = input_ids[:, -(attention_mask.shape[1] -
past_length):]
# 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
# input_ids based on the past_length.
elif past_length < input_ids.shape[1]:
input_ids = input_ids[:, past_length:]
# 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
# If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
if (max_cache_length is not None and attention_mask is not None
and cache_length + input_ids.shape[1] > max_cache_length):
attention_mask = attention_mask[:, -max_cache_length:] # pylint: disable=E1130
position_ids = kwargs.get('position_ids', None)
if attention_mask is not None and position_ids is None:
# create position_ids on the fly for batch generation
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
if past_key_values:
position_ids = position_ids[:, -input_ids.shape[1]:]
# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
if inputs_embeds is not None and past_key_values is None:
model_inputs = {'inputs_embeds': inputs_embeds}
else:
# The `contiguous()` here is necessary to have a static stride during decoding. torchdynamo otherwise
# recompiles graphs as the stride of the inputs is a guard.
# Ref: https://github.com/huggingface/transformers/pull/29114
# TODO: use `next_tokens` directly instead.
model_inputs = {'input_ids': input_ids.contiguous()}
input_length = position_ids.shape[
-1] if position_ids is not None else input_ids.shape[-1]
if cache_position is None:
cache_position = torch.arange(
past_length,
past_length + input_length,
device=input_ids.device)
elif use_cache:
cache_position = cache_position[-input_length:]
model_inputs.update({
'position_ids': position_ids,
'cache_position': cache_position,
'past_key_values': past_key_values,
'use_cache': use_cache,
'attention_mask': attention_mask,
})
return model_inputs
@staticmethod
def _reorder_cache(past_key_values, beam_idx):
reordered_past = ()
for layer_past in past_key_values:
reordered_past += (tuple(
past_state.index_select(0, beam_idx.to(past_state.device))
for past_state in layer_past), )
return reordered_past
def build_inputs(self,
tokenizer,
query: str,
history: List[Tuple[str, str]] = None,
meta_instruction=''):
if history is None:
history = []
if tokenizer.add_bos_token:
prompt = ''
else:
prompt = tokenizer.bos_token
if meta_instruction:
prompt += f"""<|im_start|>system\n{meta_instruction}<|im_end|>\n"""
for record in history:
prompt += f"""<|im_start|>user\n{record[0]}<|im_end|>\n<|im_start|>assistant\n{record[1]}<|im_end|>\n"""
prompt += f"""<|im_start|>user\n{query}<|im_end|>\n<|im_start|>assistant\n"""
return tokenizer([prompt], return_tensors='pt')
@torch.no_grad()
def chat(
self,
tokenizer,
query: str,
history: Optional[List[Tuple[str, str]]] = None,
streamer: Optional[BaseStreamer] = None,
max_new_tokens: int = 1024,
do_sample: bool = True,
temperature: float = 0.8,
top_p: float = 0.8,
meta_instruction:
str = 'You are an AI assistant whose name is InternLM (书生·浦语).\n'
'- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory '
'(上海人工智能实验室). It is designed to be helpful, honest, and harmless.\n'
'- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such '
'as English and 中文.',
**kwargs,
):
if history is None:
history = []
inputs = self.build_inputs(tokenizer, query, history, meta_instruction)
inputs = {
k: v.to(self.device)
for k, v in inputs.items() if torch.is_tensor(v)
}
# also add end-of-assistant token in eos token id to avoid unnecessary generation
eos_token_id = [
tokenizer.eos_token_id,
tokenizer.convert_tokens_to_ids(['<|im_end|>'])[0]
]
outputs = self.generate(
**inputs,
streamer=streamer,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
temperature=temperature,
top_p=top_p,
eos_token_id=eos_token_id,
**kwargs,
)
outputs = outputs[0].cpu().tolist()[len(inputs['input_ids'][0]):]
response = tokenizer.decode(outputs, skip_special_tokens=True)
response = response.split('<|im_end|>')[0]
history = history + [(query, response)]
return response, history
@torch.no_grad()
def stream_chat(
self,
tokenizer,
query: str,
history: List[Tuple[str, str]] = None,
max_new_tokens: int = 1024,
do_sample: bool = True,
temperature: float = 0.8,
top_p: float = 0.8,
**kwargs,
):
if history is None:
history = []
"""
Return a generator in format: (response, history)
Eg.
('你好,有什么可以帮助您的吗', [('你好', '你好,有什么可以帮助您的吗')])
('你好,有什么可以帮助您的吗?', [('你好', '你好,有什么可以帮助您的吗?')])
"""
if BaseStreamer is None:
raise ModuleNotFoundError(
'The version of `transformers` is too low. Please make sure '
'that you have installed `transformers>=4.28.0`.')
response_queue = queue.Queue(maxsize=20)
class ChatStreamer(BaseStreamer):
"""
Streamer used in generate to print words one by one.
"""
def __init__(self, tokenizer) -> None:
super().__init__()
self.tokenizer = tokenizer
self.queue = response_queue
self.query = query
self.history = history
self.response = ''
self.cache = []
self.received_inputs = False
self.queue.put(
(self.response, history + [(self.query, self.response)]))
def put(self, value):
if len(value.shape) > 1 and value.shape[0] > 1:
raise ValueError('ChatStreamer only supports batch size 1')
elif len(value.shape) > 1:
value = value[0]
if not self.received_inputs:
# The first received value is input_ids, ignore here
self.received_inputs = True
return
self.cache.extend(value.tolist())
token = self.tokenizer.decode(
self.cache, skip_special_tokens=True)
if token.strip() != '<|im_end|>':
self.response = self.response + token
history = self.history + [(self.query, self.response)]
self.queue.put((self.response, history))
self.cache = []
else:
self.end()
def end(self):
self.queue.put(None)
def stream_producer():
return self.chat(
tokenizer=tokenizer,
query=query,
streamer=ChatStreamer(tokenizer=tokenizer),
history=history,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
temperature=temperature,
top_p=top_p,
**kwargs,
)
def consumer():
producer = threading.Thread(target=stream_producer)
producer.start()
while True:
res = response_queue.get()
if res is None:
return
yield res
return consumer()
# Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->InternLM2
@add_start_docstrings(
"""
The InternLM2 Model transformer with a sequence classification head on top (linear layer).
[`InternLM2ForSequenceClassification`] uses the last token in order to do the classification, as other causal models
(e.g. GPT-2) do.
Since it does classification on the last token, it requires to know the position of the last token. If a
`pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
each row of the batch).
""",
InternLM2_START_DOCSTRING,
)
class InternLM2ForSequenceClassification(InternLM2PreTrainedModel):
"""Sequence Classification Head for InternLM2 Model."""
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.model = InternLM2Model(config)
self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.tok_embeddings
def set_input_embeddings(self, value):
self.model.tok_embeddings = value
@add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Union[Cache,
List[torch.FloatTensor]]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, SequenceClassifierOutputWithPast]:
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
transformer_outputs = self.model(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = transformer_outputs[0]
logits = self.score(hidden_states)
if input_ids is not None:
batch_size = input_ids.shape[0]
else:
batch_size = inputs_embeds.shape[0]
if self.config.pad_token_id is None and batch_size != 1:
raise ValueError(
'Cannot handle batch sizes > 1 if no padding token is defined.'
)
if self.config.pad_token_id is None:
sequence_lengths = -1
else:
if input_ids is not None:
# if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
sequence_lengths = torch.eq(
input_ids, self.config.pad_token_id).int().argmax(-1) - 1
sequence_lengths = sequence_lengths % input_ids.shape[-1]
sequence_lengths = sequence_lengths.to(logits.device)
else:
sequence_lengths = -1
pooled_logits = logits[torch.arange(batch_size, device=logits.device),
sequence_lengths]
loss = None
if labels is not None:
labels = labels.to(logits.device)
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = 'regression'
elif self.num_labels > 1 and (labels.dtype
in (torch.long, torch.int)):
self.config.problem_type = 'single_label_classification'
else:
self.config.problem_type = 'multi_label_classification'
if self.config.problem_type == 'regression':
loss_fct = MSELoss()
if self.num_labels == 1:
loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(pooled_logits, labels)
elif self.config.problem_type == 'single_label_classification':
loss_fct = CrossEntropyLoss()
loss = loss_fct(
pooled_logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == 'multi_label_classification':
loss_fct = BCEWithLogitsLoss()
loss = loss_fct(pooled_logits, labels)
if not return_dict:
output = (pooled_logits, ) + transformer_outputs[1:]
return ((loss, ) + output) if loss is not None else output
return SequenceClassifierOutputWithPast(
loss=loss,
logits=pooled_logits,
past_key_values=transformer_outputs.past_key_values,
hidden_states=transformer_outputs.hidden_states,
attentions=transformer_outputs.attentions,
)
# Copied from transformers.models.llama.modeling_llama.LlamaForQuestionAnswering with Llama->InternLM2
@add_start_docstrings(
"""
The InternLM2 Model transformer with a span classification head on top for extractive question-answering tasks like
SQuAD (a linear layer on top of the hidden-states output to compute `span start logits` and `span end logits`).
""",
InternLM2_START_DOCSTRING,
)
class InternLM2ForQuestionAnswering(InternLM2PreTrainedModel):
"""Question Answering model for InternLM2."""
base_model_prefix = 'transformer'
def __init__(self, config):
super().__init__(config)
self.transformer = InternLM2Model(config)
self.qa_outputs = nn.Linear(config.hidden_size, 2)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.transformer.tok_embeddings
def set_input_embeddings(self, value):
self.transformer.tok_embeddings = value
@add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.FloatTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Union[Cache,
List[torch.FloatTensor]]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
start_positions: Optional[torch.LongTensor] = None,
end_positions: Optional[torch.LongTensor] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, QuestionAnsweringModelOutput]:
r"""
start_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the start of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
end_positions (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for position (index) of the end of the labelled span for computing the token classification loss.
Positions are clamped to the length of the sequence (`sequence_length`). Position outside of the sequence
are not taken into account for computing the loss.
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.transformer(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = outputs[0]
logits = self.qa_outputs(sequence_output)
start_logits, end_logits = logits.split(1, dim=-1)
start_logits = start_logits.squeeze(-1).contiguous()
end_logits = end_logits.squeeze(-1).contiguous()
total_loss = None
if start_positions is not None and end_positions is not None:
# If we are on multi-GPU, split add a dimension
if len(start_positions.size()) > 1:
start_positions = start_positions.squeeze(-1).to(
start_logits.device)
if len(end_positions.size()) > 1:
end_positions = end_positions.squeeze(-1).to(end_logits.device)
# sometimes the start/end positions are outside our model inputs, we ignore these terms
ignored_index = start_logits.size(1)
start_positions = start_positions.clamp(0, ignored_index)
end_positions = end_positions.clamp(0, ignored_index)
loss_fct = CrossEntropyLoss(ignore_index=ignored_index)
start_loss = loss_fct(start_logits, start_positions)
end_loss = loss_fct(end_logits, end_positions)
total_loss = (start_loss + end_loss) / 2
if not return_dict:
output = (start_logits, end_logits) + outputs[2:]
return ((total_loss, ) +
output) if total_loss is not None else output
return QuestionAnsweringModelOutput(
loss=total_loss,
start_logits=start_logits,
end_logits=end_logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
# Copied from transformers.models.llama.modeling_llama.LlamaForTokenClassification with Llama->InternLM2
@add_start_docstrings(
"""
The InternLM2 Model transformer with a token classification head on top (a linear layer on top of the hidden-states
output) e.g. for Named-Entity-Recognition (NER) tasks.
""",
InternLM2_START_DOCSTRING,
)
class InternLM2ForTokenClassification(InternLM2PreTrainedModel):
"""Token classification model for InternLM2."""
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.model = InternLM2Model(config)
if getattr(config, 'classifier_dropout', None) is not None:
classifier_dropout = config.classifier_dropout
elif getattr(config, 'hidden_dropout', None) is not None:
classifier_dropout = config.hidden_dropout
else:
classifier_dropout = 0.1
self.dropout = nn.Dropout(classifier_dropout)
self.score = nn.Linear(config.hidden_size, config.num_labels)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.tok_embeddings
def set_input_embeddings(self, value):
self.model.tok_embeddings = value
@add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, SequenceClassifierOutputWithPast]:
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
outputs = self.model(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
sequence_output = outputs[0]
sequence_output = self.dropout(sequence_output)
logits = self.score(sequence_output)
loss = None
if labels is not None:
loss_fct = CrossEntropyLoss()
loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
if not return_dict:
output = (logits, ) + outputs[2:]
return ((loss, ) + output) if loss is not None else output
return TokenClassifierOutput(
loss=loss,
logits=logits,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/modelings/llava/modeling_llava.py
================================================
# coding=utf-8
# Copyright 2023 the HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PyTorch Llava model."""
from dataclasses import dataclass
from typing import List, Optional, Tuple, Union
import torch
import torch.utils.checkpoint
from torch import nn
from transformers import PreTrainedModel
from transformers.activations import ACT2FN
from transformers.cache_utils import Cache
from transformers.modeling_outputs import ModelOutput
from transformers.utils import (
add_start_docstrings,
add_start_docstrings_to_model_forward,
logging,
replace_return_docstrings,
)
from transformers import AutoModel, AutoModelForCausalLM
from .configuration_llava import EnhancedLlavaConfig
logger = logging.get_logger(__name__)
_CONFIG_FOR_DOC = "LlavaConfig"
@dataclass
# Copied from transformers.models.idefics.modeling_idefics.IdeficsCausalLMOutputWithPast with Idefics->Llava
class LlavaCausalLMOutputWithPast(ModelOutput):
"""
Base class for Llava causal language model (or autoregressive) outputs.
Args:
loss (`torch.FloatTensor` of shape `(1,)`, *optional*, returned when `labels` is provided):
Language modeling loss (for next-token prediction).
logits (`torch.FloatTensor` of shape `(batch_size, sequence_length, config.vocab_size)`):
Prediction scores of the language modeling head (scores for each vocabulary token before SoftMax).
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`)
Contains pre-computed hidden-states (key and values in the self-attention blocks) that can be used (see
`past_key_values` input) to speed up sequential decoding.
hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`):
Tuple of `torch.FloatTensor` (one for the output of the embeddings, if the model has an embedding layer, +
one for the output of each layer) of shape `(batch_size, sequence_length, hidden_size)`.
Hidden-states of the model at the output of each layer plus the optional initial embedding outputs.
attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`):
Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length,
sequence_length)`.
Attentions weights after the attention softmax, used to compute the weighted average in the self-attention
heads.
image_hidden_states (`tuple(torch.FloatTensor)`, *optional*):
Tuple of `torch.FloatTensor` (one for the output of the image embeddings, `(batch_size, num_images,
sequence_length, hidden_size)`.
image_hidden_states of the model produced by the vision encoder, and optionally by the perceiver
"""
loss: Optional[torch.FloatTensor] = None
logits: torch.FloatTensor = None
past_key_values: Optional[List[torch.FloatTensor]] = None
hidden_states: Optional[Tuple[torch.FloatTensor]] = None
attentions: Optional[Tuple[torch.FloatTensor]] = None
image_hidden_states: Optional[Tuple[torch.FloatTensor]] = None
class LlavaMultiModalProjector(nn.Module):
def __init__(self, config: EnhancedLlavaConfig):
super().__init__()
self.linear_1 = nn.Linear(config.vision_config.hidden_size, config.text_config.hidden_size, bias=True)
self.act = ACT2FN[config.projector_hidden_act]
self.linear_2 = nn.Linear(config.text_config.hidden_size, config.text_config.hidden_size, bias=True)
def forward(self, image_features):
hidden_states = self.linear_1(image_features)
hidden_states = self.act(hidden_states)
hidden_states = self.linear_2(hidden_states)
return hidden_states
LLAVA_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
Parameters:
config ([`LlavaConfig`] or [`LlavaVisionConfig`]):
Model configuration class with all the parameters of the model. Initializing with a config file does not
load the weights associated with the model, only the configuration. Check out the
[`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
@add_start_docstrings(
"The bare LLaMA Model outputting raw hidden-states without any specific head on top.",
LLAVA_START_DOCSTRING,
)
class LlavaPreTrainedModel(PreTrainedModel):
config_class = EnhancedLlavaConfig
base_model_prefix = "model"
supports_gradient_checkpointing = True
_no_split_modules = ["LlavaVisionAttention"]
_skip_keys_device_placement = "past_key_values"
_supports_flash_attn_2 = True
def _init_weights(self, module):
# important: this ported version of Llava isn't meant for training from scratch - only
# inference and fine-tuning - so the proper init weights code has been removed - the original codebase
# https://github.com/haotian-liu/LLaVA/tree/main/llava should serve for that purpose
std = (
self.config.initializer_range
if hasattr(self.config, "initializer_range")
else self.config.text_config.initializer_range
)
if hasattr(module, "class_embedding"):
module.class_embedding.data.normal_(mean=0.0, std=std)
if isinstance(module, (nn.Linear, nn.Conv2d)):
module.weight.data.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
@property
def _supports_sdpa(self):
"""
Retrieve language_model's attribute to check whether the model supports
SDPA or not.
"""
return self.language_model._supports_sdpa
LLAVA_INPUTS_DOCSTRING = r"""
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
pixel_values (`torch.FloatTensor` of shape `(batch_size, num_channels, image_size, image_size)):
The tensors corresponding to the input images. Pixel values can be obtained using
[`AutoImageProcessor`]. See [`CLIPImageProcessor.__call__`] for details ([]`LlavaProcessor`] uses
[`CLIPImageProcessor`] for processing images).
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
`past_key_values`).
If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
information on the default strategy.
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.n_positions - 1]`. [What are position IDs?](../glossary#position-ids)
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
`(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
`decoder_input_ids` of shape `(batch_size, sequence_length)`.
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
vision_feature_layer (`int`, *optional*, defaults to -2):
The index of the layer to select the vision feature.
vision_feature_select_strategy (`str`, *optional*, defaults to `"default"`):
The feature selection strategy used to select the vision feature from the vision backbone.
Can be one of `"default"` or `"full"`.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
`past_key_values`).
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""
@add_start_docstrings(
"""The LLAVA model which consists of a vision backbone and a language model.""",
LLAVA_START_DOCSTRING,
)
class LlavaForConditionalGeneration(LlavaPreTrainedModel):
_auto_class = 'AutoModel'
def __init__(self, config: EnhancedLlavaConfig):
super().__init__(config)
self.vision_tower = AutoModel.from_config(config.vision_config)
self.multi_modal_projector = LlavaMultiModalProjector(config)
self.vocab_size = config.text_config.vocab_size
self.language_model = AutoModelForCausalLM.from_config(
config.text_config,
attn_implementation=config._attn_implementation)
self.pad_token_id = self.config.pad_token_id if self.config.pad_token_id is not None else -1
self.post_init()
def get_input_embeddings(self):
return self.language_model.get_input_embeddings()
def set_input_embeddings(self, value):
self.language_model.set_input_embeddings(value)
def get_output_embeddings(self):
return self.language_model.get_output_embeddings()
def set_output_embeddings(self, new_embeddings):
self.language_model.set_output_embeddings(new_embeddings)
def set_decoder(self, decoder):
self.language_model.set_decoder(decoder)
def get_decoder(self):
return self.language_model.get_decoder()
def tie_weights(self):
return self.language_model.tie_weights()
def resize_token_embeddings(self, new_num_tokens: Optional[int] = None, pad_to_multiple_of=None) -> nn.Embedding:
model_embeds = self.language_model.resize_token_embeddings(new_num_tokens, pad_to_multiple_of)
# update vocab size
self.config.text_config.vocab_size = model_embeds.num_embeddings
self.vocab_size = model_embeds.num_embeddings
return model_embeds
def _merge_input_ids_with_image_features(self, image_features, inputs_embeds, input_ids, attention_mask, labels):
num_images, num_image_patches, embed_dim = image_features.shape
batch_size, sequence_length = input_ids.shape
left_padding = not torch.sum(input_ids[:, -1] == torch.tensor(self.pad_token_id))
# 1. Create a mask to know where special image tokens are
special_image_token_mask = input_ids == self.config.image_token_index
num_special_image_tokens = torch.sum(special_image_token_mask, dim=-1)
# Compute the maximum embed dimension
max_embed_dim = (num_special_image_tokens.max() * (num_image_patches - 1)) + sequence_length
batch_indices, non_image_indices = torch.where(input_ids != self.config.image_token_index)
# 2. Compute the positions where text should be written
# Calculate new positions for text tokens in merged image-text sequence.
# `special_image_token_mask` identifies image tokens. Each image token will be replaced by `nb_text_tokens_per_images - 1` text tokens.
# `torch.cumsum` computes how each image token shifts subsequent text token positions.
# - 1 to adjust for zero-based indexing, as `cumsum` inherently increases indices by one.
new_token_positions = torch.cumsum((special_image_token_mask * (num_image_patches - 1) + 1), -1) - 1
nb_image_pad = max_embed_dim - 1 - new_token_positions[:, -1]
if left_padding:
new_token_positions += nb_image_pad[:, None] # offset for left padding
text_to_overwrite = new_token_positions[batch_indices, non_image_indices]
# 3. Create the full embedding, already padded to the maximum position
final_embedding = torch.zeros(
batch_size, max_embed_dim, embed_dim, dtype=inputs_embeds.dtype, device=inputs_embeds.device
)
final_attention_mask = torch.zeros(
batch_size, max_embed_dim, dtype=attention_mask.dtype, device=inputs_embeds.device
)
if labels is not None:
final_labels = torch.full(
(batch_size, max_embed_dim), self.config.ignore_index, dtype=input_ids.dtype, device=input_ids.device
)
# In case the Vision model or the Language model has been offloaded to CPU, we need to manually
# set the corresponding tensors into their correct target device.
target_device = inputs_embeds.device
batch_indices, non_image_indices, text_to_overwrite = (
batch_indices.to(target_device),
non_image_indices.to(target_device),
text_to_overwrite.to(target_device),
)
attention_mask = attention_mask.to(target_device)
# 4. Fill the embeddings based on the mask. If we have ["hey" "", "how", "are"]
# we need to index copy on [0, 577, 578, 579] for the text and [1:576] for the image features
final_embedding[batch_indices, text_to_overwrite] = inputs_embeds[batch_indices, non_image_indices]
final_attention_mask[batch_indices, text_to_overwrite] = attention_mask[batch_indices, non_image_indices]
if labels is not None:
final_labels[batch_indices, text_to_overwrite] = labels[batch_indices, non_image_indices]
# 5. Fill the embeddings corresponding to the images. Anything that is not `text_positions` needs filling (#29835)
image_to_overwrite = torch.full(
(batch_size, max_embed_dim), True, dtype=torch.bool, device=inputs_embeds.device
)
image_to_overwrite[batch_indices, text_to_overwrite] = False
image_to_overwrite &= image_to_overwrite.cumsum(-1) - 1 >= nb_image_pad[:, None].to(target_device)
if image_to_overwrite.sum() != image_features.shape[:-1].numel():
raise ValueError(
f"The input provided to the model are wrong. The number of image tokens is {torch.sum(special_image_token_mask)} while"
f" the number of image given to the model is {num_images}. This prevents correct indexing and breaks batch generation."
)
final_embedding[image_to_overwrite] = image_features.contiguous().reshape(-1, embed_dim).to(target_device)
final_attention_mask |= image_to_overwrite
position_ids = (final_attention_mask.cumsum(-1) - 1).masked_fill_((final_attention_mask == 0), 1)
# 6. Mask out the embedding at padding positions, as we later use the past_key_value value to determine the non-attended tokens.
batch_indices, pad_indices = torch.where(input_ids == self.pad_token_id)
indices_to_mask = new_token_positions[batch_indices, pad_indices]
final_embedding[batch_indices, indices_to_mask] = 0
if labels is None:
final_labels = None
return final_embedding, final_attention_mask, final_labels, position_ids
def forward(
self,
input_ids: torch.LongTensor = None,
pixel_values: torch.FloatTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
vision_feature_layer: Optional[int] = None,
vision_feature_select_strategy: Optional[str] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, LlavaCausalLMOutputWithPast]:
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
vision_feature_layer = (
vision_feature_layer if vision_feature_layer is not None else self.config.vision_feature_layer
)
vision_feature_select_strategy = (
vision_feature_select_strategy
if vision_feature_select_strategy is not None
else self.config.vision_feature_select_strategy
)
if inputs_embeds is None:
# 1. Extra the input embeddings
inputs_embeds = self.get_input_embeddings()(input_ids)
# ------------- start add this ----------------
if pixel_values is None and self.training:
# all of the input is text
# If not handled properly, deadlock can occur.
# print('===================all of the input is text==============')
image_size = self.config.vision_config.image_size
pixel_values = torch.zeros(input_ids.shape[0], 3, image_size, image_size,
dtype=torch.float32,
device=input_ids.device)
image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
# this is not memory efficient at all (output_hidden_states=True) will save all the hidden stated.
selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
if vision_feature_select_strategy == "default":
selected_image_feature = selected_image_feature[:, 1:]
elif vision_feature_select_strategy == "full":
selected_image_feature = selected_image_feature
else:
raise ValueError(
f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}"
)
image_features = self.multi_modal_projector(selected_image_feature)
inputs_embeds = inputs_embeds.to(image_features.dtype)
inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
image_features[0:0], inputs_embeds, input_ids, attention_mask, labels
)
# ------------- end add this ----------------
# 2. Merge text and images
elif pixel_values is not None and input_ids.shape[1] != 1:
image_outputs = self.vision_tower(pixel_values, output_hidden_states=True)
# this is not memory efficient at all (output_hidden_states=True) will save all the hidden stated.
selected_image_feature = image_outputs.hidden_states[vision_feature_layer]
if vision_feature_select_strategy == "default":
selected_image_feature = selected_image_feature[:, 1:]
elif vision_feature_select_strategy == "full":
selected_image_feature = selected_image_feature
else:
raise ValueError(
f"Unexpected select feature strategy: {self.config.vision_feature_select_strategy}"
)
image_features = self.multi_modal_projector(selected_image_feature)
inputs_embeds = inputs_embeds.to(image_features.dtype)
inputs_embeds, attention_mask, labels, position_ids = self._merge_input_ids_with_image_features(
image_features, inputs_embeds, input_ids, attention_mask, labels
)
# In case input_ids.shape[1] == 1 & pixel_values==None & past_key_values != None, we are in the case of
# generation with cache
elif past_key_values is not None and pixel_values is not None and input_ids.shape[1] == 1:
# Retrieve the first layer to inspect the logits and mask out the hidden states
# that are set to 0
first_layer_past_key_value = past_key_values[0][0][:, :, :, 0]
# Sum all dimensions of head_dim (-2) to avoid random errors such as: https://github.com/huggingface/transformers/pull/28032#issuecomment-1863691941
batch_index, non_attended_tokens = torch.where(first_layer_past_key_value.float().sum(-2) == 0)
# Get the target length
target_length = input_ids.shape[1]
past_length = first_layer_past_key_value.shape[-1]
extended_attention_mask = torch.ones(
(attention_mask.shape[0], past_length),
dtype=attention_mask.dtype,
device=attention_mask.device,
)
# Filter out only the tokens that can be un-attended, this can happen
# if one uses Llava + Fused modules where the cache on the
# first iteration is already big enough, or if one passes custom cache
valid_indices = non_attended_tokens < extended_attention_mask.size(-1)
new_batch_index = batch_index[valid_indices]
new_non_attended_tokens = non_attended_tokens[valid_indices]
# Zero-out the places where we don't need to attend
extended_attention_mask[new_batch_index, new_non_attended_tokens] = 0
attention_mask = torch.cat((extended_attention_mask, attention_mask[:, -target_length:]), dim=1)
position_ids = torch.sum(attention_mask, dim=1).unsqueeze(-1) - 1
outputs = self.language_model(
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
logits = outputs[0]
loss = None
if labels is not None:
# Shift so that tokens < n predict n
if attention_mask is not None:
shift_attention_mask = attention_mask[..., 1:]
shift_logits = logits[..., :-1, :][shift_attention_mask.to(logits.device) != 0].contiguous()
shift_labels = labels[..., 1:][shift_attention_mask.to(labels.device) != 0].contiguous()
else:
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = nn.CrossEntropyLoss()
loss = loss_fct(
shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1).to(shift_logits.device)
)
if not return_dict:
output = (logits,) + outputs[1:]
return (loss,) + output if loss is not None else output
return LlavaCausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
def prepare_inputs_for_generation(
self, input_ids, past_key_values=None, inputs_embeds=None, pixel_values=None, attention_mask=None, **kwargs
):
if past_key_values is not None:
if isinstance(past_key_values, Cache):
cache_length = past_key_values.get_seq_length()
past_length = past_key_values.seen_tokens
else:
cache_length = past_length = past_key_values[0][0].shape[2]
# Keep only the unprocessed tokens:
# 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
# some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
# input)
if attention_mask is not None and attention_mask.shape[1] > input_ids.shape[1]:
input_ids = input_ids[:, -(attention_mask.shape[1] - past_length) :]
# 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
# input_ids based on the past_length.
elif past_length < input_ids.shape[1]:
input_ids = input_ids[:, past_length:]
# 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
elif self.config.image_token_index in input_ids:
input_ids = input_ids[:, input_ids.shape[1] - 1 :]
# If the cache has seen more tokens than it can hold, then the cache has a size limit. Let's discard the
# older attention values, as their corresponding values are not part of the input.
if cache_length < past_length and attention_mask is not None:
attention_mask = attention_mask[:, -(cache_length + input_ids.shape[1]) :]
position_ids = kwargs.get("position_ids", None)
if attention_mask is not None and position_ids is None:
# create position_ids on the fly for batch generation
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
if past_key_values:
position_ids = position_ids[:, -input_ids.shape[1] :]
# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
if inputs_embeds is not None and past_key_values is None:
model_inputs = {"inputs_embeds": inputs_embeds}
else:
model_inputs = {"input_ids": input_ids}
model_inputs.update(
{
"position_ids": position_ids,
"past_key_values": past_key_values,
"use_cache": kwargs.get("use_cache"),
"attention_mask": attention_mask,
"pixel_values": pixel_values,
}
)
return model_inputs
def _reorder_cache(self, *args, **kwargs):
return self.language_model._reorder_cache(*args, **kwargs)
AutoModel.register(EnhancedLlavaConfig, LlavaForConditionalGeneration, exist_ok=True)
AutoModelForCausalLM.register(EnhancedLlavaConfig, LlavaForConditionalGeneration, exist_ok=True)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/modelings/llava/processing_llava.py
================================================
# coding=utf-8
# Copyright 2023 The HuggingFace Inc. team.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""
Processor class for Llava.
"""
from typing import List, Optional, Union
from transformers.feature_extraction_utils import BatchFeature
from transformers.image_utils import ImageInput
from transformers.processing_utils import ProcessorMixin
from transformers.tokenization_utils_base import PaddingStrategy, PreTokenizedInput, TextInput, TruncationStrategy
from transformers.utils import TensorType
class LlavaProcessor(ProcessorMixin):
r"""
Constructs a Llava processor which wraps a Llava image processor and a Llava tokenizer into a single processor.
[`LlavaProcessor`] offers all the functionalities of [`CLIPImageProcessor`] and [`LlamaTokenizerFast`]. See the
[`~LlavaProcessor.__call__`] and [`~LlavaProcessor.decode`] for more information.
Args:
image_processor ([`CLIPImageProcessor`], *optional*):
The image processor is a required input.
tokenizer ([`LlamaTokenizerFast`], *optional*):
The tokenizer is a required input.
chat_template (`str`, *optional*): A Jinja template which will be used to convert lists of messages
in a chat into a tokenizable string.
"""
attributes = ["image_processor", "tokenizer"]
valid_kwargs = ["chat_template"]
image_processor_class = "AutoImageProcessor"
tokenizer_class = "AutoTokenizer"
def __init__(self, image_processor=None, tokenizer=None, chat_template=None, **kwargs):
super().__init__(image_processor, tokenizer, chat_template=chat_template)
def __call__(
self,
text: Union[TextInput, PreTokenizedInput, List[TextInput], List[PreTokenizedInput]] = None,
images: ImageInput = None,
padding: Union[bool, str, PaddingStrategy] = False,
truncation: Union[bool, str, TruncationStrategy] = None,
max_length=None,
return_tensors: Optional[Union[str, TensorType]] = TensorType.PYTORCH,
) -> BatchFeature:
"""
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
and `kwargs` arguments to LlamaTokenizerFast's [`~LlamaTokenizerFast.__call__`] if `text` is not `None` to encode
the text. To prepare the image(s), this method forwards the `images` and `kwrags` arguments to
CLIPImageProcessor's [`~CLIPImageProcessor.__call__`] if `images` is not `None`. Please refer to the doctsring
of the above two methods for more information.
Args:
text (`str`, `List[str]`, `List[List[str]]`):
The sequence or batch of sequences to be encoded. Each sequence can be a string or a list of strings
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
images (`PIL.Image.Image`, `np.ndarray`, `torch.Tensor`, `List[PIL.Image.Image]`, `List[np.ndarray]`, `List[torch.Tensor]`):
The image or batch of images to be prepared. Each image can be a PIL image, NumPy array or PyTorch
tensor. Both channels-first and channels-last formats are supported.
padding (`bool`, `str` or [`~utils.PaddingStrategy`], *optional*, defaults to `False`):
Select a strategy to pad the returned sequences (according to the model's padding side and padding
index) among:
- `True` or `'longest'`: Pad to the longest sequence in the batch (or no padding if only a single
sequence if provided).
- `'max_length'`: Pad to a maximum length specified with the argument `max_length` or to the maximum
acceptable input length for the model if that argument is not provided.
- `False` or `'do_not_pad'` (default): No padding (i.e., can output a batch with sequences of different
lengths).
max_length (`int`, *optional*):
Maximum length of the returned list and optionally padding length (see above).
truncation (`bool`, *optional*):
Activates truncation to cut input sequences longer than `max_length` to `max_length`.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
- `'tf'`: Return TensorFlow `tf.constant` objects.
- `'pt'`: Return PyTorch `torch.Tensor` objects.
- `'np'`: Return NumPy `np.ndarray` objects.
- `'jax'`: Return JAX `jnp.ndarray` objects.
Returns:
[`BatchFeature`]: A [`BatchFeature`] with the following fields:
- **input_ids** -- List of token ids to be fed to a model. Returned when `text` is not `None`.
- **attention_mask** -- List of indices specifying which tokens should be attended to by the model (when
`return_attention_mask=True` or if *"attention_mask"* is in `self.model_input_names` and if `text` is not
`None`).
- **pixel_values** -- Pixel values to be fed to a model. Returned when `images` is not `None`.
"""
if images is not None:
image_inputs = self.image_processor(images, return_tensors=return_tensors)
else:
image_inputs = {}
text_inputs = self.tokenizer(
text, return_tensors=return_tensors, padding=padding, truncation=truncation, max_length=max_length
)
return BatchFeature(data={**text_inputs, **image_inputs})
# Copied from transformers.models.clip.processing_clip.CLIPProcessor.batch_decode with CLIP->Llama
def batch_decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.batch_decode`]. Please
refer to the docstring of this method for more information.
"""
return self.tokenizer.batch_decode(*args, **kwargs)
# Copied from transformers.models.clip.processing_clip.CLIPProcessor.decode with CLIP->Llama
def decode(self, *args, **kwargs):
"""
This method forwards all its arguments to LlamaTokenizerFast's [`~PreTrainedTokenizer.decode`]. Please refer to
the docstring of this method for more information.
"""
return self.tokenizer.decode(*args, **kwargs)
@property
# Copied from transformers.models.clip.processing_clip.CLIPProcessor.model_input_names
def model_input_names(self):
tokenizer_input_names = self.tokenizer.model_input_names
image_processor_input_names = self.image_processor.model_input_names
return list(dict.fromkeys(tokenizer_input_names + image_processor_input_names))
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .comm import all_to_all, all_to_all_list
from .fsdp import LoadWoInit
from .sampler import LengthGroupedSampler, ParallelSampler
from .sequence import * # noqa: F401, F403
from .setup import (get_dp_group, get_dp_mesh, get_dp_world_size, get_sp_group,
get_sp_mesh, get_sp_world_size, get_tp_group, get_tp_mesh,
get_tp_world_size, setup_parallel)
__all__ = [
'ParallelSampler', 'LengthGroupedSampler', 'all_to_all', 'all_to_all_list',
'setup_parallel', 'get_dp_mesh', 'get_dp_group', 'get_dp_world_size',
'get_sp_mesh', 'get_sp_group', 'get_sp_world_size', 'get_tp_mesh',
'get_tp_group', 'get_tp_world_size', 'LoadWoInit'
]
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/comm.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Any, Tuple
import torch
import torch.distributed as dist
from torch import Tensor
from torch.distributed.distributed_c10d import (_get_pg_default_device,
_object_to_tensor,
_tensor_to_object)
# Modified from https://github.com/microsoft/DeepSpeed/blob/ffd0a0e3ef24bfd00c2e5f35019d2674cc01ec14/deepspeed/sequence/layer.py#L15 # noqa: E501
def _all_to_all(
input: Tensor,
world_size: int,
group: dist.ProcessGroup,
scatter_dim: int,
gather_dim: int,
):
input_list = [
t.contiguous()
for t in torch.tensor_split(input, world_size, scatter_dim)
]
output_list = [torch.empty_like(input_list[0]) for _ in range(world_size)]
dist.all_to_all(output_list, input_list, group=group)
return torch.cat(output_list, dim=gather_dim).contiguous()
class _AllToAll(torch.autograd.Function):
"""All-to-all communication.
Args:
input: Input tensor
sp_group: Sequence parallel process group
scatter_dim: Scatter dimension
gather_dim: Gather dimension
"""
@staticmethod
def forward(ctx: Any, input: Tensor, sp_group: dist.ProcessGroup,
scatter_dim: int, gather_dim: int):
ctx.sp_group = sp_group
ctx.scatter_dim = scatter_dim
ctx.gather_dim = gather_dim
ctx.world_size = dist.get_world_size(sp_group)
output = _all_to_all(input, ctx.world_size, sp_group, scatter_dim,
gather_dim)
return output
@staticmethod
def backward(ctx: Any, grad_output: Tensor) -> Tuple:
grad_output = _all_to_all(
grad_output,
ctx.world_size,
ctx.sp_group,
ctx.gather_dim,
ctx.scatter_dim,
)
return (
grad_output,
None,
None,
None,
)
def all_to_all(
input: Tensor,
sp_group: dist.ProcessGroup,
scatter_dim: int = 2,
gather_dim: int = 1,
):
"""Convenience function to apply the all-to-all operation with scatter and
gather dimensions.
Notes:
We have wrapped the `torch.distributed.all_to_all` function to
enable automatic differentiation of the all-to-all operation.
Args:
input: The input tensor for which all-to-all communication is performed
sp_group: The sequence parallel process group.
scatter_dim: The dimension along which the input tensor is scattered
(default: 2).
gather_dim: The dimension along which the output tensor is gathered
(default: 1).
Returns:
The output tensor after the all-to-all communication.
"""
return _AllToAll.apply(input, sp_group, scatter_dim, gather_dim)
def all_to_all_list(object_list, group=None):
current_device = _get_pg_default_device(group)
rank = dist.get_rank(group)
world_size = dist.get_world_size(group)
tensor_list, size_list = zip(
*
[_object_to_tensor(obj, current_device, group) for obj in object_list])
tensor_list = list(tensor_list)
size_list = torch.cat(size_list)
buffer = [None] * world_size
dist.all_gather_object(buffer, size_list, group=group)
size_this_rank = []
for size_list in buffer:
size_this_rank.append(size_list[rank])
target_tensor_list = [
torch.empty(size.item(), dtype=torch.uint8, device=current_device)
for size in size_this_rank
]
dist.all_to_all(target_tensor_list, tensor_list, group=group)
for i in range(len(target_tensor_list)):
obj_view = target_tensor_list[i].type(torch.uint8)
target_tensor_list[i] = _tensor_to_object(obj_view, size_this_rank[i],
group)
return target_tensor_list
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/fsdp/__init__.py
================================================
from .checkpointing import RECOMPUTE_MODULES, checkpoint_check_fn
from .lazy import LoadWoInit, dp_lazy_init, dp_sp_lazy_init, dp_tp_lazy_init
from .wrap import (all_required_grad_wrap_policy, layer_and_emb_wrap_policy,
layer_auto_wrap_policy, token_embedding_wrap_policy)
__all__ = [
'RECOMPUTE_MODULES', 'checkpoint_check_fn', 'LoadWoInit', 'dp_lazy_init',
'all_required_grad_wrap_policy', 'layer_auto_wrap_policy',
'token_embedding_wrap_policy', 'dp_tp_lazy_init', 'dp_sp_lazy_init',
'layer_and_emb_wrap_policy'
]
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/fsdp/checkpointing.py
================================================
import random
RECOMPUTE_MODULES = ('InternLM2DecoderLayer', 'CLIPEncoderLayer',
'LlamaDecoderLayer', 'Qwen2DecoderLayer')
def checkpoint_check_fn(submodule, target=RECOMPUTE_MODULES, selective=1.0):
ret = False
if type(submodule).__name__ in target:
if random.uniform(0, 1) < selective:
ret = True
return ret
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/fsdp/lazy.py
================================================
import torch
import torch.distributed as dist
from torch.distributed._tensor import DTensor, distribute_tensor
@torch.no_grad
def dp_lazy_init(module, module_map, dp_mesh):
device = torch.cuda.current_device()
module.to_empty(device=torch.cuda.current_device(), recurse=False)
torch.cuda.empty_cache()
if dp_mesh.get_local_rank() == 0:
master_module = module_map[module]
master_params = {
name: param
for name, param in master_module.named_parameters(recurse=False)
}
master_buffers = {
name: buffer
for name, buffer in master_module.named_buffers(recurse=False)
}
for name, param in module.named_parameters(recurse=False):
p_copy = master_params[name].to(device).to(param.dtype)
param.data.copy_(p_copy)
for name, buffer in module.named_buffers(recurse=False):
b_copy = master_buffers[name].to(device).to(buffer.dtype)
buffer.data.copy_(b_copy)
@torch.no_grad
def dp_sp_lazy_init(module, module_map, dp_mesh, sp_mesh):
device = torch.cuda.current_device()
module.to_empty(device=torch.cuda.current_device(), recurse=False)
torch.cuda.empty_cache()
if dp_mesh.get_local_rank() == 0 and sp_mesh.get_local_rank() == 0:
master_module = module_map[module]
master_params = {
name: param
for name, param in master_module.named_parameters(recurse=False)
}
master_buffers = {
name: buffer
for name, buffer in master_module.named_buffers(recurse=False)
}
for name, param in module.named_parameters(recurse=False):
p_copy = master_params[name].to(device).to(param.dtype)
param.data.copy_(p_copy)
for name, buffer in module.named_buffers(recurse=False):
b_copy = master_buffers[name].to(device).to(buffer.dtype)
buffer.data.copy_(b_copy)
@torch.no_grad
def dp_tp_lazy_init(module, module_map, dp_mesh, tp_mesh):
device = torch.cuda.current_device()
module.to_empty(device=torch.cuda.current_device(), recurse=False)
torch.cuda.empty_cache()
if dp_mesh.get_local_rank() != 0:
return
if tp_mesh.get_local_rank() == 0:
master_module = module_map[module]
master_params = {
name: param
for name, param in master_module.named_parameters(recurse=False)
}
master_buffers = {
name: buffer
for name, buffer in master_module.named_buffers(recurse=False)
}
else:
master_params = None
master_buffers = None
for name, param in module.named_parameters(recurse=False):
if isinstance(param, DTensor):
p_full = param.full_tensor()
if tp_mesh.get_local_rank() == 0:
p_copy = master_params[name]
p_copy = p_copy.to(device).to(param.dtype)
else:
p_copy = torch.empty_like(p_full)
mesh = param.device_mesh
placements = param.placements
p_dtensor = distribute_tensor(p_copy, mesh, placements)
param.data.copy_(p_dtensor)
else:
if tp_mesh.get_local_rank() == 0:
p_copy = master_params[name]
p_copy = p_copy.to(device).to(param.dtype)
else:
p_copy = torch.empty_like(param)
tp_group = tp_mesh.get_group()
dist.broadcast(p_copy, 0, tp_group)
param.data.copy_(p_copy)
for name, buffer in module.named_buffers(recurse=False):
if isinstance(buffer, DTensor):
b_full = buffer.full_tensor()
if tp_mesh.get_local_rank() == 0:
b_copy = master_buffers[name]
b_copy = b_copy.to(device).to(buffer.dtype)
else:
b_copy = torch.empty_like(b_full)
mesh = buffer.device_mesh
placements = buffer.placements
b_dtensor = distribute_tensor(b_copy, mesh, placements)
buffer.data.copy_(b_dtensor)
else:
if tp_mesh.get_local_rank() == 0:
b_copy = master_buffers[name]
b_copy = b_copy.to(device).to(buffer.dtype)
else:
b_copy = torch.empty_like(buffer)
tp_group = tp_mesh.get_group()
dist.broadcast(b_copy, 0, tp_group)
buffer.data.copy_(b_copy)
class LoadWoInit:
"""Context manager that disable parameter initialization."""
def __init__(self):
self.constant_ = torch.nn.init.constant_
self.zeros_ = torch.nn.init.zeros_
self.ones_ = torch.nn.init.ones_
self.uniform_ = torch.nn.init.uniform_
self.normal_ = torch.nn.init.normal_
self.kaiming_uniform_ = torch.nn.init.kaiming_uniform_
self.kaiming_normal_ = torch.nn.init.kaiming_normal_
def __enter__(self, *args, **kwargs):
torch.nn.init.constant_ = lambda *args, **kwargs: None
torch.nn.init.zeros_ = lambda *args, **kwargs: None
torch.nn.init.ones_ = lambda *args, **kwargs: None
torch.nn.init.uniform_ = lambda *args, **kwargs: None
torch.nn.init.normal_ = lambda *args, **kwargs: None
torch.nn.init.kaiming_uniform_ = lambda *args, **kwargs: None
torch.nn.init.kaiming_normal_ = lambda *args, **kwargs: None
def __exit__(self, *args, **kwargs):
torch.nn.init.constant_ = self.constant_
torch.nn.init.zeros_ = self.zeros_
torch.nn.init.ones_ = self.ones_
torch.nn.init.uniform_ = self.uniform_
torch.nn.init.normal_ = self.normal_
torch.nn.init.kaiming_uniform_ = self.kaiming_uniform_
torch.nn.init.kaiming_normal_ = self.kaiming_normal_
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/fsdp/precision.py
================================================
import torch
from torch import nn
def set_require_grad_param_to_fp32(model: nn.Module):
def traverse(module: nn.Module):
for m_name, child in module.named_children():
all_require_grad = True
for p_name, param in child.named_parameters():
if not param.requires_grad:
all_require_grad = False
break
if all_require_grad:
child.to(torch.float32)
traverse(child)
traverse(model)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/fsdp/wrap.py
================================================
from torch import nn
from xtuner._lite import get_logger
logger = get_logger()
_LAYERS = [
'InternLM2DecoderLayer', 'CLIPVisionModel', 'LlavaMultiModalProjector',
'LlamaDecoderLayer', 'Qwen2DecoderLayer'
]
def layer_auto_wrap_policy(
module,
recurse: bool,
nonwrapped_numel: int,
layer_cls=_LAYERS,
) -> bool:
if recurse:
# always recurse
return True
else:
# if not recursing, decide whether we should wrap for
# the leaf node or reminder
return module.__class__.__name__ in layer_cls
def layer_and_emb_wrap_policy(
module,
recurse: bool,
nonwrapped_numel: int,
vocab_size,
layer_cls=_LAYERS,
) -> bool:
if recurse:
# always recurse
return True
else:
# if not recursing, decide whether we should wrap for
# the leaf node or reminder
if module.__class__.__name__ in layer_cls or isinstance(
module, nn.Embedding):
return True
elif isinstance(module, nn.Linear):
return module.weight.size(0) == vocab_size
else:
return False
def token_embedding_wrap_policy(
module,
recurse: bool,
nonwrapped_numel: int,
vocab_size: int,
) -> bool:
if recurse:
# always recurse
return True
if isinstance(module, (nn.Embedding, nn.Linear)):
if module.weight.size(0) == vocab_size:
return True
return False
def all_required_grad_wrap_policy(
module,
recurse: bool,
nonwrapped_numel: int,
) -> bool:
if recurse:
# always recurse
return True
requires_grads = [p.requires_grad for p in module.parameters()]
if len(requires_grads) and all(requires_grads):
logger.debug(module)
return True
return False
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/logger.py
================================================
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/plans/internlm2.py
================================================
from torch.distributed.tensor.parallel import ColwiseParallel, RowwiseParallel
layer_tp_plan = {
# by default ColwiseParallel input layouts is replicated
# and RowwiseParallel output layouts is replicated
'attention.wqkv': ColwiseParallel(),
'attention.wo': RowwiseParallel(),
'feed_forward.w1': ColwiseParallel(),
'feed_forward.w2': RowwiseParallel(),
'feed_forward.w3': ColwiseParallel(),
}
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/sampler.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import math
import random
from typing import Iterator, Optional, Sized
import torch
from mmengine.dist import sync_random_seed
from torch.distributed.device_mesh import DeviceMesh
from torch.utils.data import ConcatDataset as TorchConcatDataset
from torch.utils.data import Sampler
class ParallelSampler(Sampler):
"""The default data sampler for both distributed and non-distributed
environment.
It has several differences from the PyTorch ``DistributedSampler`` as
below:
1. This sampler supports non-distributed environment.
2. The round up behaviors are a little different.
- If ``round_up=True``, this sampler will add extra samples to make the
number of samples is evenly divisible by the world size. And
this behavior is the same as the ``DistributedSampler`` with
``drop_last=False``.
- If ``round_up=False``, this sampler won't remove or add any samples
while the ``DistributedSampler`` with ``drop_last=True`` will remove
tail samples.
Args:
dataset (Sized): The dataset.
shuffle (bool): Whether shuffle the dataset or not. Defaults to True.
seed (int, optional): Random seed used to shuffle the sampler if
:attr:`shuffle=True`. This number should be identical across all
processes in the distributed group. Defaults to None.
round_up (bool): Whether to add extra samples to make the number of
samples evenly divisible by the world size. Defaults to True.
"""
def __init__(
self,
dataset: Sized,
dp_mesh: DeviceMesh,
global_batch_size: int,
shuffle: bool = True,
seed: Optional[int] = None,
round_up: bool = True,
) -> None:
rank = dp_mesh.get_local_rank()
world_size = dp_mesh.size()
assert global_batch_size % world_size == 0
self.global_batch_size = global_batch_size
self.rank = rank
self.world_size = world_size
self.dataset = dataset
self.shuffle = shuffle
if seed is None:
seed = sync_random_seed()
self.seed = seed
self.epoch = 0
self.step = 0
self.round_up = round_up
if self.round_up:
self.num_samples = math.ceil(
len(self.dataset) /
global_batch_size) * global_batch_size // world_size
self.total_size = self.num_samples * self.world_size
else:
self.num_samples = math.ceil(
(len(self.dataset) - rank) / world_size)
self.total_size = len(self.dataset)
def __iter__(self) -> Iterator[int]:
"""Iterate the indices."""
# deterministically shuffle based on epoch and seed
if self.shuffle:
g = torch.Generator()
g.manual_seed(self.seed + self.epoch)
indices = torch.randperm(len(self.dataset), generator=g).tolist()
else:
indices = torch.arange(len(self.dataset)).tolist()
# add extra samples to make it evenly divisible
if self.round_up:
indices = (
indices *
int(self.total_size / len(indices) + 1))[:self.total_size]
# subsample
indices = indices[self.rank:self.total_size:self.world_size]
return iter(indices[self.step:])
def __len__(self) -> int:
"""The number of samples in this rank."""
return self.num_samples - self.step
def set_epoch(self, epoch: int, step=0) -> None:
"""Sets the epoch for this sampler.
When :attr:`shuffle=True`, this ensures all replicas use a different
random ordering for each epoch. Otherwise, the next iteration of this
sampler will yield the same ordering.
Args:
epoch (int): Epoch number.
"""
self.epoch = epoch
self.step = step
def get_length_grouped_indices(max_lengths,
group_batch_size,
dp_size,
seed=None):
if seed is not None:
torch.manual_seed(seed)
random.seed(seed)
assert all(leng != 0
for leng in max_lengths), 'Should not have zero length.'
indices = torch.randperm(len(max_lengths))
megabatches = [
indices[i:i + group_batch_size].tolist()
for i in range(0, len(max_lengths), group_batch_size)
]
output = []
for megabatch in megabatches:
megabatch = sorted(
megabatch, key=lambda i: max_lengths[i], reverse=True)
grouped_megabatch = [
megabatch[i:i + dp_size] for i in range(0, len(megabatch), dp_size)
]
random.shuffle(grouped_megabatch)
for group in grouped_megabatch:
output.extend(group)
return output
class LengthGroupedSampler(Sampler):
def __init__(self,
dataset: Sized,
dp_mesh: DeviceMesh,
global_batch_size: int,
mega_batch_mult: Optional[int] = None,
seed: Optional[int] = None,
round_up: bool = True) -> None:
rank = dp_mesh.get_local_rank()
world_size = dp_mesh.size()
self.rank = rank
self.world_size = world_size
assert global_batch_size % world_size == 0
self.dataset = dataset
if seed is None:
seed = sync_random_seed()
self.seed = seed
self.epoch = 0
self.step = 0
self.round_up = round_up
if self.round_up:
self.num_samples = math.ceil(
len(self.dataset) /
global_batch_size) * global_batch_size // world_size
self.total_size = self.num_samples * self.world_size
else:
self.num_samples = math.ceil(
(len(self.dataset) - rank) / world_size)
self.total_size = len(self.dataset)
if mega_batch_mult is None:
# Default for mega_batch_mult: 50 or the number to get 4
# megabatches, whichever is smaller.
mega_batch_mult = min(
len(self.dataset) // (global_batch_size * 4), 50)
# Just in case, for tiny datasets
if mega_batch_mult == 0:
mega_batch_mult = 1
self.group_batch_size = mega_batch_mult * global_batch_size
if isinstance(self.dataset, TorchConcatDataset):
max_lengths = []
for sub_dataset in self.dataset.datasets:
max_lengths.extend(sub_dataset.max_length_per_pack)
self.max_lengths = max_lengths
else:
self.max_lengths = self.dataset.max_length_per_pack
assert isinstance(self.max_lengths, (list, tuple))
self.global_batch_size = global_batch_size
def __iter__(self) -> Iterator[int]:
"""Iterate the indices."""
generator = torch.Generator()
generator.manual_seed(self.seed + self.epoch)
seed = self.seed + self.epoch
indices = get_length_grouped_indices(
max_lengths=self.max_lengths,
group_batch_size=self.group_batch_size,
dp_size=self.world_size,
seed=seed)
assert len(set(indices)) == len(indices)
# add extra samples to make it evenly divisible
if self.round_up:
indices = (
indices *
int(self.total_size / len(indices) + 1))[:self.total_size]
# subsample
assert len(indices) == self.total_size
indices = indices[self.rank:self.total_size:self.world_size]
assert len(indices) == self.num_samples
return iter(indices[self.step:])
def __len__(self) -> int:
"""The number of samples in this rank."""
return self.num_samples - self.step
def set_epoch(self, epoch: int, step=0) -> None:
"""Sets the epoch for this sampler.
When :attr:`shuffle=True`, this ensures all replicas use a different
random ordering for each epoch. Otherwise, the next iteration of this
sampler will yield the same ordering.
Args:
epoch (int): Epoch number.
"""
self.epoch = epoch
self.step = step
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/sequence/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.dist import init_dist
from .attention import (post_process_for_sequence_parallel_attn,
pre_process_for_sequence_parallel_attn,
sequence_parallel_wrapper)
from .data_collate import (pad_cumulative_len_for_sequence_parallel,
pad_for_sequence_parallel)
from .ops import (gather_for_sequence_parallel, gather_forward_split_backward,
split_for_sequence_parallel, split_forward_gather_backward)
from .reduce_loss import reduce_sequence_parallel_loss
__all__ = [
'sequence_parallel_wrapper', 'pre_process_for_sequence_parallel_attn',
'post_process_for_sequence_parallel_attn', 'split_for_sequence_parallel',
'init_dist', 'gather_for_sequence_parallel',
'split_forward_gather_backward', 'gather_forward_split_backward',
'pad_cumulative_len_for_sequence_parallel', 'pad_for_sequence_parallel',
'reduce_sequence_parallel_loss'
]
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/sequence/attention.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch.distributed as dist
from ..comm import all_to_all
from ..setup import get_sp_group, get_sp_world_size
def pre_process_for_sequence_parallel_attn(query_states,
key_states,
value_states,
scatter_dim=2,
gather_dim=1):
sequence_parallel_world_size = get_sp_world_size()
n_head = query_states.shape[2]
assert n_head % sequence_parallel_world_size == 0, \
('The number of attention heads should be divisible by '
f'sequence_parallel_world_size. But got n_head = {n_head} and '
f'sequence_parallel_world_size = {sequence_parallel_world_size}.')
# (b, s // sp_world_size, nd, dim) -> (b, s, nd // sp_world_size, dim)
sequence_parallel_group = get_sp_group()
query_states = all_to_all(
query_states,
sequence_parallel_group,
scatter_dim=scatter_dim,
gather_dim=gather_dim)
key_states = all_to_all(
key_states,
sequence_parallel_group,
scatter_dim=scatter_dim,
gather_dim=gather_dim)
value_states = all_to_all(
value_states,
sequence_parallel_group,
scatter_dim=scatter_dim,
gather_dim=gather_dim)
return query_states, key_states, value_states
def post_process_for_sequence_parallel_attn(attn_output,
scatter_dim=1,
gather_dim=2):
# (b, s, nd // sp_world_size, dim) -> (b, s // sp_world_size, nd, dim)
sequence_parallel_group = get_sp_group()
output = all_to_all(
attn_output,
sequence_parallel_group,
scatter_dim=scatter_dim,
gather_dim=gather_dim)
return output
def sequence_parallel_wrapper(local_attn):
def sequence_parallel_attn(query_states, key_states, value_states, *args,
**kwargs):
training = kwargs.pop('training', True)
enable_sequence_parallel = (
dist.is_initialized() and get_sp_world_size() > 1 and training)
if enable_sequence_parallel:
query_states, key_states, value_states = \
pre_process_for_sequence_parallel_attn(
query_states, key_states, value_states)
out = local_attn(query_states, key_states, value_states, *args,
**kwargs)
if enable_sequence_parallel:
out = post_process_for_sequence_parallel_attn(out).contiguous()
return out
return sequence_parallel_attn
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/sequence/data_collate.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from ..setup import get_sp_world_size
def pad_for_sequence_parallel(tensor, padding_value, dim=-1):
length = tensor.shape[dim]
seq_parallel_world_size = get_sp_world_size()
if length % seq_parallel_world_size == 0:
return tensor
pad_num = seq_parallel_world_size - (length % seq_parallel_world_size)
pad_shape = (*tensor.shape[:dim], pad_num,
*tensor.shape[dim + 1:]) if dim != -1 else (
*tensor.shape[:dim], pad_num)
pad = torch.full(
pad_shape, padding_value, dtype=tensor.dtype, device=tensor.device)
tensor = torch.cat([tensor, pad], dim=dim)
return tensor
# This function only meets the following two conditions:
# 1. use_varlen_attn = True
# 2. pack_to_max_length = True and the lengths of each sequence are different
def pad_cumulative_len_for_sequence_parallel(cumulative_len):
assert len(cumulative_len) == 1
seqlen = cumulative_len[0][-1]
seq_parallel_world_size = get_sp_world_size()
if seqlen % seq_parallel_world_size == 0:
return cumulative_len, None
bs = len(cumulative_len)
pad_len = seq_parallel_world_size - (seqlen % seq_parallel_world_size)
seqlen_new = seqlen + pad_len
attention_mask = torch.zeros(
bs, seqlen_new, dtype=torch.bool, device=cumulative_len[0].device)
attention_mask[:, :seqlen] = True
for i, cu_len in enumerate(cumulative_len):
pad = torch.tensor([seqlen_new],
device=cu_len.device,
dtype=cu_len.dtype)
cumulative_len[i] = torch.cat([cu_len, pad], dim=0)
return cumulative_len, attention_mask
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/sequence/ops.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
import torch.distributed as dist
def split_for_sequence_parallel(input, dim: int, sp_group: dist.ProcessGroup):
"""Splits the input tensor along a given dimension for sequence parallel.
Args:
input: The input tensor to be split.
dim: The dimension along which the tensor should be split.
sp_group: The sequence parallel process group.
Returns:
The split tensor corresponding to the current rank's chunk.
"""
world_size = dist.get_world_size(sp_group)
if world_size == 1:
return input
rank = dist.get_rank(sp_group)
dim_size = input.size(dim)
assert dim_size % world_size == 0, (
f'The dimension to split ({dim_size}) is not a multiple of '
f'world size ({world_size}), cannot split tensor evenly')
tensor_list = torch.split(input, dim_size // world_size, dim=dim)
output = tensor_list[rank].contiguous()
return output
def gather_for_sequence_parallel(input, dim: int, sp_group: dist.ProcessGroup):
"""Gathers the input tensor along a given dimension for sequence parallel.
Args:
input: The input tensor to be gathered.
dim: The dimension along which the tensor should be gathered.
sp_group: The sequence parallel process group.
Returns:
The gathered tensor concatenated along the specified dimension.
"""
input = input.contiguous()
world_size = dist.get_world_size(sp_group)
dist.get_rank(sp_group)
if world_size == 1:
return input
tensor_list = [torch.empty_like(input) for _ in range(world_size)]
assert input.device.type == 'cuda'
dist.all_gather(tensor_list, input, group=sp_group)
output = torch.cat(tensor_list, dim=dim).contiguous()
return output
class _GatherForwardSplitBackward(torch.autograd.Function):
"""Gather the input during forward.
Scale and split the grad and keep only the corresponding chuck to the rank
during backward.
"""
@staticmethod
def forward(ctx, input, dim, sp_group, grad_scale):
ctx.dim = dim
ctx.sp_group = sp_group
ctx.grad_scale = grad_scale
return gather_for_sequence_parallel(input, dim, sp_group)
@staticmethod
def backward(ctx, grad_output):
if ctx.grad_scale == 'up':
grad_output = grad_output * dist.get_world_size(ctx.sp_group)
elif ctx.grad_scale == 'down':
grad_output = grad_output / dist.get_world_size(ctx.sp_group)
return (split_for_sequence_parallel(grad_output, ctx.dim,
ctx.sp_group), None, None, None)
class _SplitForwardGatherBackward(torch.autograd.Function):
"""Split the input and keep only the corresponding chuck to the rank during
forward.
Scale and gather the grad during backward.
"""
@staticmethod
def forward(ctx, input, dim, sp_group, grad_scale):
ctx.dim = dim
ctx.sp_group = sp_group
ctx.grad_scale = grad_scale
return split_for_sequence_parallel(input, dim, sp_group)
@staticmethod
def backward(ctx, grad_output):
if ctx.grad_scale == 'up':
grad_output = grad_output * dist.get_world_size(ctx.sp_group)
elif ctx.grad_scale == 'down':
grad_output = grad_output / dist.get_world_size(ctx.sp_group)
return (gather_for_sequence_parallel(grad_output, ctx.dim,
ctx.sp_group), None, None, None)
def split_forward_gather_backward(input, dim, sp_group, grad_scale=None):
"""Split tensors according to the sp rank during forward propagation and
gather the grad from the whole sp group during backward propagation.
1. When do we need this? input.requires_grad = True
2. Why we need grad scale?
We have to scale down the grads as `gather_forward_split_backward` scales
up the grads.
"""
return _SplitForwardGatherBackward.apply(input, dim, sp_group, grad_scale)
def gather_forward_split_backward(input, dim, sp_group, grad_scale=None):
"""Gather tensors from the whole sp group during forward propagation and
split the grad according to the sp rank during backward propagation.
1. When do we need this?
When sp is greater than 1, we need to slice the input `x` along
sequence length dimension before it is passed into the model and get
`sub_seq_x`. We then pass `sub_seq_x` into model and get output
`sub_seq_out`. If the loss calculation process needs to use the complete
output, we have to gather the `sub_seq_out` in all sp ranks during forward
propagation and split the grad during backward propagation.
2. Why we need grad scale?
Here is a simple case.
-------- SP 1 -----------
Suppose here is a toy model with only one linear module
(in_features = 2, out_features = 1) and the input x has shape(2, 2).
Y = [[y1], = [[w11x11 + w21x12], = [[x11, x12], dot [[w11],
[y2]] [w11x21 + w21x22]] [x21, x22]] [w21]]
z = mean(Y) = (y1 + y2) / 2
Here is the partial derivative of z with respect to w11:
∂z / ∂w11 = ∂z / ∂y1 * ∂y1 / ∂w11 + ∂z / ∂y2 * ∂y2 / ∂w11
= 1/2 * x11 + 1/2 * x21 = (x11 + x21) / 2
-------- SP 2 -----------
When sequence parallel world size is set to 2, we will split the input x
and scatter them to the two rank in the same sequence parallel group.
```Step 1
Y_rank0 = [[y1]] = [[w11x11 + w21x12]] = [[x11, x12]] dot [[w11, w21]]^T
Y_rank1 = [[y2]] = [[w11x21 + w21x22]] = [[x21, x22]] dot [[w11, w21]]^T
```
Then, we have to gather them:
```Step 2
Y_rank0 = [[y1],
detach([y2])]
Y_rank1 = [detach([y1]),
[y2]]
```
Note that y2 in Y_rank0 does not have grad, neither does y1 in Y_rank1.
Similarly, we calculate the loss in each rank:
```Step 3
z_rank0 = mean(Y_rank0) = (y1 + detach(y2)) / 2
z_rank1 = mean(Y_rank1) = (detach(y1) + y2) / 2
```
So the partial derivative of loss_rank0 with respect to w11:
```∂z / ∂w11 = ∂z / ∂y1 * ∂y1 / ∂w11 = x11 / 2```
The same for rank1:
```∂z / ∂w11 = ∂z / ∂y2 * ∂y2 / ∂w11 = x21 / 2```
Finally, we need to all_reduce them:
```Step 4
In both rank:
∂z / ∂w11 = (x11 / 2 + x21 / 2) / 2 = (x11 + x21) / 4
```
In SP2, the gradient of each param is only half of that in SP1.
So we should scale up the grad during the backward process in Step 2.
""" # noqa: E501
return _GatherForwardSplitBackward.apply(input, dim, sp_group, grad_scale)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/sequence/reduce_loss.py
================================================
import torch
import torch.distributed as dist
from ..setup import get_sp_group
class _ReduceLoss(torch.autograd.Function):
@staticmethod
def forward(ctx, mean_loss, loss_scale, process_group):
ctx.mode = process_group
if loss_scale == 0:
# convert nan to 0 just for logging
mean_loss = torch.nan_to_num(mean_loss)
loss_sum = mean_loss * loss_scale
dist.all_reduce(loss_sum, group=process_group)
dist.all_reduce(loss_scale, group=process_group)
loss = loss_sum / loss_scale
return loss
@staticmethod
def backward(ctx, grad_output):
return grad_output, None, None
def reduce_sequence_parallel_loss(mean_loss,
loss_scale,
sp_group: dist.ProcessGroup = None):
if dist.get_world_size(sp_group) == 1:
return mean_loss
if sp_group is None:
# avoid bc breaking
sp_group = get_sp_group()
return _ReduceLoss.apply(mean_loss, loss_scale, sp_group)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/parallel/setup.py
================================================
import torch.distributed as dist
from torch.distributed.device_mesh import init_device_mesh
_SP_MESH = None
_DP_MESH = None
_TP_MESH = None
_SP_GROUP = None
_DP_GROUP = None
_TP_GROUP = None
_SP_WORLD_SIZE = None
_DP_WORLD_SIZE = None
_TP_WORLD_SIZE = None
def setup_sp(sp_size):
world_size = dist.get_world_size()
assert world_size % sp_size == 0
dp_size = world_size // sp_size
device_mesh = init_device_mesh(
'cuda', (dp_size, sp_size), mesh_dim_names=('dp', 'sp'))
global _SP_MESH, _DP_MESH
_SP_MESH = device_mesh['sp']
_DP_MESH = device_mesh['dp']
global _SP_GROUP, _DP_GROUP
_SP_GROUP = device_mesh.get_group('sp')
_DP_GROUP = device_mesh.get_group('dp')
def setup_tp(tp_size):
world_size = dist.get_world_size()
assert world_size % tp_size == 0
dp_size = world_size // tp_size
device_mesh = init_device_mesh(
'cuda', (dp_size, tp_size), mesh_dim_names=('dp', 'tp'))
global _TP_MESH, _DP_MESH
_TP_MESH = device_mesh['tp']
_DP_MESH = device_mesh['dp']
global _TP_GROUP, _DP_GROUP
_TP_GROUP = device_mesh.get_group('tp')
_DP_GROUP = device_mesh.get_group('dp')
def setup_dp():
world_size = dist.get_world_size()
device_mesh = init_device_mesh(
'cuda', (world_size, ), mesh_dim_names=('dp', ))
global _DP_MESH
_DP_MESH = device_mesh['dp']
global _DP_GROUP
_DP_GROUP = device_mesh.get_group('dp')
_SP_ULYESS_MESH = None
_SP_RING_MESH = None
_SP_ULYESS_GROUP = None
_SP_RING_GROUP = None
_SP_ULYESS_WORLD_SIZE = None
_SP_RING_WORLD_SIZE = None
def set_seq_parallel_pg(sp_ulysses_degree, sp_ring_degree):
"""
sp_ulysses_degree x sp_ring_degree = seq_parallel_degree
(ulysses_degree, dp_degree)
"""
world_size = dist.get_world_size()
sp_size = sp_ulysses_degree * sp_ring_degree
dp_size = world_size // sp_size
global_device_mesh = init_device_mesh(
'cuda', (dp_size, sp_size), mesh_dim_names=('dp', 'sp'))
# TODO HHA: 非常关键,顺序不能错,不能是 (dp_size, sp_ulysses_degree, sp_ring_degree)
device_mesh = init_device_mesh(
'cuda', (dp_size, sp_ring_degree, sp_ulysses_degree), mesh_dim_names=('dp', 'sp_ring', 'sp_ulysses'))
global _DP_MESH, _SP_ULYESS_MESH, _SP_RING_MESH, _SP_MESH
_DP_MESH = device_mesh['dp']
_SP_ULYESS_MESH = device_mesh['sp_ulysses']
_SP_RING_MESH = device_mesh['sp_ring']
_SP_MESH = global_device_mesh['sp']
global _DP_GROUP, _SP_ULYESS_GROUP, _SP_RING_GROUP, _SP_GROUP
_DP_GROUP = device_mesh.get_group('dp')
_SP_ULYESS_GROUP = device_mesh.get_group('sp_ulysses')
_SP_RING_GROUP = device_mesh.get_group('sp_ring')
_SP_GROUP = global_device_mesh.get_group('sp')
def setup_parallel(sp_size=1, tp_size=1, ring_size=1):
assert not (sp_size > 1 and tp_size > 1), \
('DeepSpeed Sequence Parallel can not be used with '
'Megatron-LM Tensor Parallel')
if sp_size > 1:
assert sp_size % ring_size == 0
sp_ulysses_size = sp_size // ring_size
set_seq_parallel_pg(sp_ulysses_size, ring_size)
elif tp_size > 1:
setup_tp(tp_size)
else:
setup_dp()
def get_ulysess_mesh():
return _SP_ULYESS_MESH
def get_ring_mesh():
return _SP_RING_MESH
def get_ulysess_group():
return _SP_ULYESS_GROUP
def get_ring_group():
return _SP_RING_GROUP
def get_ulysess_world_size():
global _SP_ULYESS_WORLD_SIZE
if _SP_ULYESS_WORLD_SIZE is not None:
return _SP_ULYESS_WORLD_SIZE
if not dist.is_initialized() or (_SP_ULYESS_GROUP is None):
_SP_ULYESS_WORLD_SIZE = 1
else:
_SP_ULYESS_WORLD_SIZE = dist.get_world_size(_SP_ULYESS_GROUP)
return _SP_ULYESS_WORLD_SIZE
def get_ring_world_size():
global _SP_RING_WORLD_SIZE
if _SP_RING_WORLD_SIZE is not None:
return _SP_RING_WORLD_SIZE
if not dist.is_initialized() or (_SP_RING_GROUP is None):
_SP_RING_WORLD_SIZE = 1
else:
_SP_RING_WORLD_SIZE = dist.get_world_size(_SP_RING_GROUP)
return _SP_RING_WORLD_SIZE
def get_dp_mesh():
return _DP_MESH
def get_dp_group():
return _DP_GROUP
def get_dp_world_size():
global _DP_WORLD_SIZE
if _DP_WORLD_SIZE is not None:
return _DP_WORLD_SIZE
if not dist.is_initialized() or (_DP_GROUP is None):
_DP_WORLD_SIZE = 1
else:
_DP_WORLD_SIZE = dist.get_world_size(_DP_GROUP)
return _DP_WORLD_SIZE
def get_sp_mesh():
return _SP_MESH
def get_sp_group():
return _SP_GROUP
def get_sp_world_size():
global _SP_WORLD_SIZE
if _SP_WORLD_SIZE is not None:
return _SP_WORLD_SIZE
if not dist.is_initialized() or (_SP_GROUP is None):
_SP_WORLD_SIZE = 1
else:
_SP_WORLD_SIZE = dist.get_world_size(_SP_GROUP)
return _SP_WORLD_SIZE
def get_tp_mesh():
return _TP_MESH
def get_tp_group():
return _TP_GROUP
def get_tp_world_size():
global _TP_WORLD_SIZE
if _TP_WORLD_SIZE is not None:
return _TP_WORLD_SIZE
if not dist.is_initialized() or (_TP_GROUP is None):
_TP_WORLD_SIZE = 1
else:
_TP_WORLD_SIZE = dist.get_world_size(_TP_GROUP)
return _TP_WORLD_SIZE
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/__init__.py
================================================
from .hybrid import *
from .ring import *
from .ulysses import *
from .globals import set_seq_parallel_pg
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/comm/__init__.py
================================================
from .all_to_all import *
from .extract_local import *
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/comm/all_to_all.py
================================================
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0
# DeepSpeed Team
import torch
from typing import Any, Tuple
from torch import Tensor
from torch.nn import Module
import torch.distributed as dist
def all_to_all_4D(
input: torch.tensor, scatter_idx: int = 2, gather_idx: int = 1, group=None
) -> torch.tensor:
"""
all-to-all for QKV
Args:
input (torch.tensor): a tensor sharded along dim scatter dim
scatter_idx (int): default 1
gather_idx (int): default 2
group : torch process group
Returns:
torch.tensor: resharded tensor (bs, seqlen/P, hc, hs)
"""
assert (
input.dim() == 4
), f"input must be 4D tensor, got {input.dim()} and shape {input.shape}"
seq_world_size = dist.get_world_size(group)
if scatter_idx == 2 and gather_idx == 1:
# input (torch.tensor): a tensor sharded along dim 1 (bs, seqlen/P, hc, hs) output: (bs, seqlen, hc/P, hs)
bs, shard_seqlen, hc, hs = input.shape
seqlen = shard_seqlen * seq_world_size
shard_hc = hc // seq_world_size
# transpose groups of heads with the seq-len parallel dimension, so that we can scatter them!
# (bs, seqlen/P, hc, hs) -reshape-> (bs, seq_len/P, P, hc/P, hs) -transpose(0,2)-> (P, seq_len/P, bs, hc/P, hs)
input_t = (
input.reshape(bs, shard_seqlen, seq_world_size, shard_hc, hs)
.transpose(0, 2)
.contiguous()
)
output = torch.empty_like(input_t)
# https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_to_all_single
# (P, seq_len/P, bs, hc/P, hs) scatter seqlen -all2all-> (P, seq_len/P, bs, hc/P, hs) scatter head
dist.all_to_all_single(output, input_t, group=group)
# if scattering the seq-dim, transpose the heads back to the original dimension
output = output.reshape(seqlen, bs, shard_hc, hs)
# (seq_len, bs, hc/P, hs) -reshape-> (bs, seq_len, hc/P, hs)
output = output.transpose(0, 1).contiguous().reshape(bs, seqlen, shard_hc, hs)
return output
elif scatter_idx == 1 and gather_idx == 2:
# input (torch.tensor): a tensor sharded along dim 1 (bs, seqlen, hc/P, hs) output: (bs, seqlen/P, hc, hs)
bs, seqlen, shard_hc, hs = input.shape
hc = shard_hc * seq_world_size
shard_seqlen = seqlen // seq_world_size
seq_world_size = dist.get_world_size(group)
# transpose groups of heads with the seq-len parallel dimension, so that we can scatter them!
# (bs, seqlen, hc/P, hs) -reshape-> (bs, P, seq_len/P, hc/P, hs) -transpose(0, 3)-> (hc/P, P, seqlen/P, bs, hs) -transpose(0, 1) -> (P, hc/P, seqlen/P, bs, hs)
input_t = (
input.reshape(bs, seq_world_size, shard_seqlen, shard_hc, hs)
.transpose(0, 3)
.transpose(0, 1)
.contiguous()
.reshape(seq_world_size, shard_hc, shard_seqlen, bs, hs)
)
output = torch.empty_like(input_t)
# https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_to_all_single
# (P, bs x hc/P, seqlen/P, hs) scatter seqlen -all2all-> (P, bs x seq_len/P, hc/P, hs) scatter head
dist.all_to_all_single(output, input_t, group=group)
# if scattering the seq-dim, transpose the heads back to the original dimension
output = output.reshape(hc, shard_seqlen, bs, hs)
# (hc, seqlen/N, bs, hs) -tranpose(0,2)-> (bs, seqlen/N, hc, hs)
output = output.transpose(0, 2).contiguous().reshape(bs, shard_seqlen, hc, hs)
return output
else:
raise RuntimeError("scatter_idx must be 1 or 2 and gather_idx must be 1 or 2")
class SeqAllToAll4D(torch.autograd.Function):
@staticmethod
def forward(
ctx: Any,
group: dist.ProcessGroup,
input: Tensor,
scatter_idx: int,
gather_idx: int,
) -> Tensor:
ctx.group = group
ctx.scatter_idx = scatter_idx
ctx.gather_idx = gather_idx
return all_to_all_4D(input, scatter_idx, gather_idx, group=group)
@staticmethod
def backward(ctx: Any, *grad_output: Tensor) -> Tuple[None, Tensor, None, None]:
return (
None,
SeqAllToAll4D.apply(
ctx.group, *grad_output, ctx.gather_idx, ctx.scatter_idx
),
None,
None,
)
def all_to_all_5D(
input: torch.tensor, scatter_idx: int = 3, gather_idx: int = 1, group=None
) -> torch.tensor:
"""
all-to-all for QKV
forward (bs, seqlen/N, 3, hc, hs) -> (bs, seqlen, 3, hc/N, hs)
Args:
input (torch.tensor): a tensor sharded along dim scatter dim
scatter_idx (int): default 1
gather_idx (int): default 2
group : torch process group
Returns:
torch.tensor: resharded tensor (bs, seqlen/P, 3, hc, hs)
"""
assert (
input.dim() == 5
), f"input must be 5D tensor, got {input.dim()} and shape {input.shape}"
seq_world_size = dist.get_world_size(group)
if scatter_idx == 3 and gather_idx == 1:
# input (torch.tensor): a tensor sharded along dim 1 (bs, seqlen/P, 3, hc, hs) output: (bs, seqlen, 3, hc/P, hs)
bs, shard_seqlen, t_cnt, hc, hs = input.shape
assert t_cnt == 3
seqlen = shard_seqlen * seq_world_size
shard_hc = hc // seq_world_size
# transpose groups of heads with the seq-len parallel dimension, so that we can scatter them!
# (bs, seqlen/P, 3, hc, hs) -reshape-> (bs, seq_len/P, 3, P, hc/P, hs) -transpose(0,3)-> (P, seq_len/P, 3, bs, hc/P, hs)
input_t = (
input.reshape(bs, shard_seqlen, 3, seq_world_size, shard_hc, hs)
.transpose(0, 3)
.contiguous()
)
output = torch.empty_like(input_t)
# https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_to_all_single
# (P, seq_len/P, 3, bs, hc/P, hs) scatter seqlen -all2all-> (P, seq_len/P, 3, bs, hc/P, hs) scatter head
dist.all_to_all_single(output, input_t, group=group)
# if scattering the seq-dim, transpose the heads back to the original dimension
output = output.reshape(seqlen, 3, bs, shard_hc, hs)
# (seq_len, 3, bs, hc/P, hs) -trans-> (bs, seq_len, 3, hc/P, hs)
output = output.transpose(0, 2).transpose(1, 2).contiguous()
return output.reshape(bs, seqlen, 3, shard_hc, hs).contiguous()
elif scatter_idx == 1 and gather_idx == 3:
# input (torch.tensor): a tensor sharded along dim 1 (bs, seqlen, hc/P, hs) output: (bs, seqlen/P, hc, hs)
bs, seqlen, _, shard_hc, hs = input.shape
hc = shard_hc * seq_world_size
shard_seqlen = seqlen // seq_world_size
seq_world_size = dist.get_world_size(group)
# transpose groups of heads with the seq-len parallel dimension, so that we can scatter them!
# (bs, seqlen, 3, hc/P, hs) -reshape-> (bs, P, seq_len/P, 3, hc/P, hs) -transpose(0, 4)-> (hc/P, P, seqlen/P, 3, bs, hs) -transpose(0, 1) -> (P, hc/P, seqlen/P, 3, bs, hs)
input_t = (
input.reshape(bs, seq_world_size, shard_seqlen, 3, shard_hc, hs)
.transpose(0, 4)
.transpose(0, 1)
.contiguous()
.reshape(seq_world_size, shard_hc, shard_seqlen, 3, bs, hs)
)
output = torch.empty_like(input_t)
# https://pytorch.org/docs/stable/distributed.html#torch.distributed.all_to_all_single
# (P, bs x hc/P, seqlen/P, hs) scatter seqlen -all2all-> (P, bs x seq_len/P, hc/P, hs) scatter head
dist.all_to_all_single(output, input_t, group=group)
# if scattering the seq-dim, transpose the heads back to the original dimension
output = output.reshape(hc, shard_seqlen, 3, bs, hs)
# (hc, seqlen/N, bs, hs) -tranpose(0,2)-> (bs, seqlen/N, hc, hs)
output = output.transpose(0, 3).contiguous()
return output.reshape(bs, shard_seqlen, 3, hc, hs).contiguous()
else:
raise RuntimeError("scatter_idx must be 1 or 3 and gather_idx must be 1 or 3")
class SeqAllToAll5D(torch.autograd.Function):
@staticmethod
def forward(
ctx: Any,
group: dist.ProcessGroup,
input: Tensor,
scatter_idx: int = 3,
gather_idx: int = 1,
) -> Tensor:
ctx.group = group
ctx.scatter_idx = scatter_idx
ctx.gather_idx = gather_idx
return all_to_all_5D(input, scatter_idx, gather_idx, group=group)
@staticmethod
def backward(ctx: Any, *grad_output: Tensor) -> Tuple[None, Tensor, None, None]:
return (
None,
SeqAllToAll5D.apply(
ctx.group, *grad_output, ctx.gather_idx, ctx.scatter_idx
),
None,
None,
)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/comm/extract_local.py
================================================
import torch
import torch.distributed as dist
from ..globals import PROCESS_GROUP
def stripe_extract_local(value, rank, world_size, rd, ud, *args, **kwargs):
# ud at the highest dim
input_dim = value.dim()
if input_dim == 5:
batch_size, seqlen, _, nheads, d = value.shape
elif input_dim == 4:
batch_size, seqlen, nheads, d = value.shape
else:
raise ValueError("value dim should be 4 or 5")
# (ud, L, rd)
value = value.reshape(batch_size, seqlen // rd, rd, -1).contiguous()
value = value.transpose(1, 2).reshape(batch_size, seqlen, -1).contiguous()
value = value.chunk(world_size, dim=1)[rank]
if input_dim == 5:
value = value.reshape(batch_size, seqlen // world_size, 3, nheads, d)
elif input_dim == 4:
value = value.reshape(batch_size, seqlen // world_size, nheads, d)
return value
def basic_extract_local(value, rank, world_size, *args, **kwargs):
return value.chunk(world_size, dim=1)[rank].detach().clone()
def zigzag_extract_local(value, rank, world_size, rd, ud, dim=1, *args, **kwargs):
input_dim = value.dim()
if input_dim == 5:
batch_size, seqlen, _, nheads, d = value.shape
elif input_dim == 4:
batch_size, seqlen, nheads, d = value.shape
else:
raise ValueError("value dim should be 4 or 5")
value_chunks = value.chunk(2 * rd, dim=dim)
# TODO assert ulyssess on low dim
r_rank = dist.get_rank(group=PROCESS_GROUP.RING_PG)
u_rank = dist.get_rank(group=PROCESS_GROUP.ULYSSES_PG)
local_value = torch.cat(
[value_chunks[r_rank], value_chunks[2 * rd - r_rank - 1]], dim=dim
).chunk(ud, dim=dim)[u_rank]
if input_dim == 5:
local_value = local_value.reshape(
batch_size, seqlen // world_size, 3, nheads, d
)
elif input_dim == 4:
local_value = local_value.reshape(batch_size, seqlen // world_size, nheads, d)
return local_value.contiguous()
EXTRACT_FUNC_DICT = {
"basic": basic_extract_local,
"strip": stripe_extract_local,
"zigzag": zigzag_extract_local,
}
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/globals.py
================================================
import torch
class Singleton:
_instance = None
def __new__(cls, *args, **kwargs):
if not cls._instance:
cls._instance = super(Singleton, cls).__new__(cls, *args, **kwargs)
return cls._instance
class ProcessGroupSingleton(Singleton):
def __init__(self):
self.ULYSSES_PG = None
self.RING_PG = None
PROCESS_GROUP = ProcessGroupSingleton()
def set_seq_parallel_pg(
sp_ulysses_degree, sp_ring_degree, rank, world_size, use_ulysses_low=True
):
"""
sp_ulysses_degree x sp_ring_degree = seq_parallel_degree
(ulysses_degree, dp_degree)
"""
sp_degree = sp_ring_degree * sp_ulysses_degree
dp_degree = world_size // sp_degree
assert (
world_size % sp_degree == 0
), f"world_size {world_size} % sp_degree {sp_ulysses_degree} == 0"
num_ulysses_pgs = sp_ring_degree # world_size // sp_ulysses_degree
num_ring_pgs = sp_ulysses_degree # world_size // sp_ring_degree
if use_ulysses_low:
for dp_rank in range(dp_degree):
offset = dp_rank * sp_degree
for i in range(num_ulysses_pgs):
ulysses_ranks = list(
range(
i * sp_ulysses_degree + offset,
(i + 1) * sp_ulysses_degree + offset,
)
)
group = torch.distributed.new_group(ulysses_ranks)
if rank in ulysses_ranks:
ulyssess_pg = group
for i in range(num_ring_pgs):
ring_ranks = list(range(i + offset, sp_degree + offset, num_ring_pgs))
group = torch.distributed.new_group(ring_ranks)
if rank in ring_ranks:
ring_pg = group
else:
for dp_rank in range(dp_degree):
offset = dp_rank * sp_degree
for i in range(num_ring_pgs):
ring_ranks = list(
range(
i * sp_ring_degree + offset, (i + 1) * sp_ring_degree + offset
)
)
group = torch.distributed.new_group(ring_ranks)
if rank in ring_ranks:
ring_pg = group
for i in range(num_ulysses_pgs):
ulysses_ranks = list(
range(i + offset, sp_degree + offset, num_ulysses_pgs)
)
group = torch.distributed.new_group(ulysses_ranks)
if rank in ulysses_ranks:
ulyssess_pg = group
PROCESS_GROUP.ULYSSES_PG = ulyssess_pg
PROCESS_GROUP.RING_PG = ring_pg
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/hybrid/__init__.py
================================================
from .attn_layer import LongContextAttention, LongContextAttentionQKVPacked, llama3_varlen_attention_sp_ulysses_ring,attention_sp_ulysses_ring
from .async_attn_layer import AsyncLongContextAttention
from .utils import RING_IMPL_QKVPACKED_DICT
__all__ = [
"LongContextAttention",
"LongContextAttentionQKVPacked",
"RING_IMPL_QKVPACKED_DICT",
"AsyncLongContextAttention",
'llama3_varlen_attention_sp_ulysses_ring',
'attention_sp_ulysses_ring'
]
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/hybrid/async_attn_layer.py
================================================
import torch
from typing import Any
from torch import Tensor
import torch.distributed as dist
from .utils import RING_IMPL_DICT, RING_IMPL_QKVPACKED_DICT
from ..globals import PROCESS_GROUP
class AsyncLongContextAttention(torch.nn.Module):
"""Initialization.
Arguments:
ulysses_pg (ProcessGroup): ulysses process group
ring_pg (ProcessGroup): ring process group
scatter_idx (int): scatter_idx for all2all comm
gather_idx (int): gather_idx for all2all comm
"""
def __init__(
self,
scatter_idx: int = 2,
gather_idx: int = 1,
ring_impl_type: str = "basic",
) -> None:
super(AsyncLongContextAttention, self).__init__()
self.ring_pg = PROCESS_GROUP.RING_PG
self.ulysses_pg = PROCESS_GROUP.ULYSSES_PG
self.stream = torch.cuda.Stream()
self._async_op = True
assert (
self.ulysses_pg is not None or self.ring_pg is not None
), f"use set_seq_parallel_pg() first. Now ulysses pg {self.ulysses_pg} and ring pg {self.ring_pg}"
self.scatter_idx = scatter_idx
self.gather_idx = gather_idx
self.ring_attn_fn = RING_IMPL_DICT[ring_impl_type]
def forward(
self,
query: Tensor,
key: Tensor,
value: Tensor,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
*args: Any,
) -> Tensor:
"""forward
Arguments:
query (Tensor): query input to the layer (bs, seqlen/P, hc, hs)
key (Tensor): key input to the layer (bs, seqlen/P, hc_kv, hs)
value (Tensor): value input to the layer (bs, seqlen/P, hc_kv, hs)
args: other args
Returns:
* output (Tensor): context output
"""
# un*ud = hc
ulysses_degree = dist.get_world_size(self.ulysses_pg)
bs, shard_seqlen, hc, hs = query.shape
bs, shard_seqlen, hc_kv, hs = key.shape
seq_len = shard_seqlen * ulysses_degree
un = hc // ulysses_degree
un_kv = hc_kv // ulysses_degree
assert un_kv == un, f"un_kv {un_kv} un {un}"
qkv = torch.cat([query, key, value]).contiguous()
# (3*bs, seqlen/P, hc, hs) -> (hc, seqlen/P, 3*bs, hs) -> (un, ud, seqlen/P, 3*bs, hs), where hc = un*ud
qkv_list = torch.unbind(
qkv.transpose(0, 2)
.contiguous()
.reshape(un, ulysses_degree, shard_seqlen, 3 * bs, hs)
)
# 3xall-to-all output buffer
qkv_trans_list = [
torch.zeros(
ulysses_degree,
1,
shard_seqlen,
3 * bs,
hs,
dtype=query.dtype,
device=query.device,
)
for i in range(len(qkv_list))
]
# last all-to-all buffter
context_layer_list = [
torch.zeros(
ulysses_degree,
1,
shard_seqlen,
bs,
hs,
dtype=query.dtype,
device=query.device,
)
for i in range(len(qkv_list))
]
comm_handle_list = []
# un * (ud, shard_seqlen, 3*bs, hs)
for i, qkv in enumerate(qkv_list):
with torch.cuda.stream(self.stream):
ret = dist.all_to_all_single(
qkv_trans_list[i],
qkv,
group=self.ulysses_pg,
async_op=self._async_op,
)
comm_handle_list.append(ret)
last_comm_handle_list = []
for i, qkv_trans in enumerate(qkv_trans_list):
if comm_handle_list[i] is not None:
comm_handle_list[i].wait()
qkv_trans = (
qkv_trans.reshape(seq_len, 3 * bs, 1, hs)
.transpose(0, 1)
.contiguous()
.reshape(3 * bs, seq_len, 1, hs)
)
# qkv_trans = all_to_all_4D_async(qkv, qkv_trans_list[i], self.scatter_idx, self.gather_idx, self.ulysses_pg)
qkv_trans = torch.chunk(qkv_trans, 3, dim=0)
out = self.ring_attn_fn(
qkv_trans[0],
qkv_trans[1],
qkv_trans[2],
dropout_p=dropout_p,
softmax_scale=softmax_scale,
causal=causal,
window_size=window_size,
alibi_slopes=alibi_slopes,
deterministic=deterministic,
return_attn_probs=return_attn_probs,
group=self.ring_pg,
)
if type(out) == tuple:
context_layer, _, _ = out
else:
context_layer = out
# (bs, seq_len, head_cnt/N, head_size) -> (bs, seq_len/N, head_cnt, head_size)
# scatter 1, gather 2
context_layer = (
context_layer.reshape(bs, ulysses_degree, shard_seqlen, 1, hs)
.transpose(0, 3)
.transpose(0, 1)
.contiguous()
.reshape(ulysses_degree, 1, shard_seqlen, bs, hs)
)
with torch.cuda.stream(self.stream):
ret = dist.all_to_all_single(
context_layer_list[i],
context_layer,
group=self.ulysses_pg,
async_op=self._async_op,
)
last_comm_handle_list.append(ret)
# hc = un * P
# un x (hc = P, seq_len/P, bs, hs) -> (bs, seq_len, hc = P, hs)
for i, ret in enumerate(last_comm_handle_list):
if ret is not None:
ret.wait()
context_layer_list[i] = (
context_layer_list[i]
.reshape(ulysses_degree, shard_seqlen, bs, hs)
.transpose(0, 2)
.contiguous()
.reshape(bs, shard_seqlen, ulysses_degree, hs)
)
output = torch.cat(context_layer_list, dim=2)
return output
def backward(self, *args, **kwargs):
raise RuntimeError(
"Backward computation is not allowed for AsyncLongContextAttention."
)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/hybrid/attn_layer.py
================================================
from ..comm.all_to_all import SeqAllToAll4D, SeqAllToAll5D
import torch
from typing import Any
from torch import Tensor
import torch.distributed as dist
from .utils import RING_IMPL_DICT, RING_IMPL_QKVPACKED_DICT
from ..globals import PROCESS_GROUP
from ..ring import llama3_flash_attn_prepare_cu_seqlens, llama3_flash_attn_varlen_func
from xtuner._lite.parallel import all_to_all
class LongContextAttention(torch.nn.Module):
"""Initialization.
Arguments:
ulysses_pg (ProcessGroup): ulysses process group
ring_pg (ProcessGroup): ring process group
scatter_idx (int): scatter_idx for all2all comm
gather_idx (int): gather_idx for all2all comm
"""
def __init__(
self,
scatter_idx: int = 2,
gather_idx: int = 1,
ring_impl_type: str = "basic",
use_pack_qkv: bool = False,
) -> None:
super(LongContextAttention, self).__init__()
self.ring_pg = PROCESS_GROUP.RING_PG
self.ulysses_pg = PROCESS_GROUP.ULYSSES_PG
self.use_pack_qkv = use_pack_qkv
assert (
self.ulysses_pg is not None or self.ring_pg is not None
), f"use set_seq_parallel_pg() first. Now ulysses pg {self.ulysses_pg} and ring pg {self.ring_pg}"
self.scatter_idx = scatter_idx
self.gather_idx = gather_idx
self.ring_attn_fn = RING_IMPL_DICT[ring_impl_type]
def forward(
self,
query: Tensor,
key: Tensor,
value: Tensor,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
*args: Any,
) -> Tensor:
"""forward
Arguments:
query (Tensor): query input to the layer
key (Tensor): key input to the layer
value (Tensor): value input to the layer
args: other args
Returns:
* output (Tensor): context output
"""
# 3 X (bs, seq_len/N, head_cnt, head_size) -> 3 X (bs, seq_len, head_cnt/N, head_size)
# scatter 2, gather 1
if self.use_pack_qkv:
# (3*bs, seq_len/N, head_cnt, head_size)
qkv = torch.cat([query, key, value]).continous()
# (3*bs, seq_len, head_cnt/N, head_size)
qkv = SeqAllToAll4D.apply(
self.ulysses_pg, qkv, self.scatter_idx, self.gather_idx
)
qkv = torch.chunk(qkv, 3, dim=0)
out = self.ring_attn_fn(
qkv[0],
qkv[1],
qkv[2],
dropout_p=dropout_p,
softmax_scale=softmax_scale,
causal=causal,
window_size=window_size,
alibi_slopes=alibi_slopes,
deterministic=deterministic,
return_attn_probs=return_attn_probs,
group=self.ring_pg,
)
else:
query_layer = SeqAllToAll4D.apply(
self.ulysses_pg, query, self.scatter_idx, self.gather_idx
)
key_layer = SeqAllToAll4D.apply(
self.ulysses_pg, key, self.scatter_idx, self.gather_idx
)
value_layer = SeqAllToAll4D.apply(
self.ulysses_pg, value, self.scatter_idx, self.gather_idx
)
out = self.ring_attn_fn(
query_layer,
key_layer,
value_layer,
dropout_p=dropout_p,
softmax_scale=softmax_scale,
causal=causal,
window_size=window_size,
alibi_slopes=alibi_slopes,
deterministic=deterministic,
return_attn_probs=return_attn_probs,
group=self.ring_pg,
)
if type(out) == tuple:
context_layer, _, _ = out
else:
context_layer = out
# (bs, seq_len, head_cnt/N, head_size) -> (bs, seq_len/N, head_cnt, head_size)
# scatter 1, gather 2
output = SeqAllToAll4D.apply(
self.ulysses_pg, context_layer, self.gather_idx, self.scatter_idx
)
# out e.g., [s/p::h]
return output
class LongContextAttentionQKVPacked(torch.nn.Module):
"""Initialization.
Arguments:
ulysses_pg (ProcessGroup): ulysses process group
ring_pg (ProcessGroup): ring process group
scatter_idx (int): scatter_idx for all2all comm
gather_idx (int): gather_idx for all2all comm
"""
def __init__(
self,
scatter_idx: int = 3,
gather_idx: int = 1,
ring_impl_type: str = "basic",
) -> None:
super(LongContextAttentionQKVPacked, self).__init__()
self.ring_pg = PROCESS_GROUP.RING_PG
self.ulysses_pg = PROCESS_GROUP.ULYSSES_PG
assert (
self.ulysses_pg is not None or self.ring_pg is not None
), f"use set_seq_parallel_pg() first. Now ulysses pg {self.ulysses_pg} and ring pg {self.ring_pg}"
self.scatter_idx = scatter_idx
self.gather_idx = gather_idx
self.ring_attn_fn = RING_IMPL_QKVPACKED_DICT[ring_impl_type]
def forward(
self,
qkv,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
*args: Any,
) -> Tensor:
"""forward
Arguments:
query (Tensor): query input to the layer
key (Tensor): key input to the layer
value (Tensor): value input to the layer
args: other args
Returns:
* output (Tensor): context output
"""
# scatter 3, gather 1
world_size = dist.get_world_size(self.ulysses_pg)
if world_size > 1:
qkv = SeqAllToAll5D.apply(
self.ulysses_pg, qkv, self.scatter_idx, self.gather_idx
)
out = self.ring_attn_fn(
qkv,
dropout_p=dropout_p,
softmax_scale=softmax_scale,
causal=causal,
window_size=window_size,
alibi_slopes=alibi_slopes,
deterministic=deterministic,
return_attn_probs=return_attn_probs,
group=self.ring_pg,
)
# print(f"out {out.shape}")
if type(out) == tuple:
out = out[0]
# (bs, seq_len, head_cnt/N, head_size) -> (bs, seq_len/N, head_cnt, head_size)
# scatter 1, gather 2
if world_size > 1:
out = SeqAllToAll4D.apply(
self.ulysses_pg, out, self.gather_idx, self.scatter_idx - 1
)
# out e.g., [s/p::h]
return out
def llama3_varlen_attention_sp_ulysses_ring(
query: Tensor,
key: Tensor,
value: Tensor,
cu_seqlens: Tensor,
ulysses_pg,
ring_pg,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1),
deterministic=False,
heads_k_stride=1,
):
scatter_idx = 1
gather_idx = 0
ulysses_world_size = dist.get_world_size(ulysses_pg)
if ulysses_world_size > 1:
query = all_to_all(
query, ulysses_pg, scatter_idx, gather_idx
)
key = all_to_all(
key, ulysses_pg, scatter_idx, gather_idx
)
value = all_to_all(
value, ulysses_pg, scatter_idx, gather_idx
)
ring_world_size = dist.get_world_size(ring_pg)
ring_rank = dist.get_rank(ring_pg)
(
local_cu_seqlens_q,
local_cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
local_k_slice,
) = llama3_flash_attn_prepare_cu_seqlens(cu_seqlens, causal, ring_rank, ring_world_size)
out = llama3_flash_attn_varlen_func(
query,
key,
value,
local_cu_seqlens_q,
local_cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
heads_k_stride=key.shape[1] if heads_k_stride == -1 else heads_k_stride,
local_k_slice=local_k_slice,
dropout_p=dropout_p,
causal=causal,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=deterministic,
return_attn_probs=False,
group=ring_pg
)
if type(out) == tuple:
context_layer, _, _ = out
else:
context_layer = out
if ulysses_world_size > 1:
output = all_to_all(
context_layer, ulysses_pg, gather_idx, scatter_idx
)
else:
output = context_layer
# out e.g., [s/p::h]
return output
def attention_sp_ulysses_ring(
query: Tensor,
key: Tensor,
value: Tensor,
ulysses_pg,
ring_pg,
ring_impl_type: str = "basic",
):
scatter_idx = 2
gather_idx = 1
ulysses_world_size = dist.get_world_size(ulysses_pg)
if ulysses_world_size > 1:
query = SeqAllToAll4D.apply(
ulysses_pg, query, scatter_idx, gather_idx
)
key = SeqAllToAll4D.apply(
ulysses_pg, key, scatter_idx, gather_idx
)
value = SeqAllToAll4D.apply(
ulysses_pg, value, scatter_idx, gather_idx
)
ring_attn_fn = RING_IMPL_DICT[ring_impl_type]
out = ring_attn_fn(
query,
key,
value,
causal=True,
group=ring_pg,
)
if type(out) == tuple:
context_layer, _, _ = out
else:
context_layer = out
if ulysses_world_size > 1:
# (bs, seq_len, head_cnt/N, head_size) -> (bs, seq_len/N, head_cnt, head_size)
# scatter 1, gather 2
output = SeqAllToAll4D.apply(
ulysses_pg, context_layer, gather_idx, scatter_idx
)
else:
output = context_layer
return output
class LongContextVarLenAttentionForLlaMa3(torch.nn.Module):
"""Initialization.
Arguments:
ulysses_pg (ProcessGroup): ulysses process group
ring_pg (ProcessGroup): ring process group
scatter_idx (int): scatter_idx for all2all comm
gather_idx (int): gather_idx for all2all comm
"""
def __init__(
self,
scatter_idx: int = 2,
gather_idx: int = 1
) -> None:
super().__init__()
self.ring_pg = PROCESS_GROUP.RING_PG
self.ulysses_pg = PROCESS_GROUP.ULYSSES_PG
assert (
self.ulysses_pg is not None or self.ring_pg is not None
), f"use set_seq_parallel_pg() first. Now ulysses pg {self.ulysses_pg} and ring pg {self.ring_pg}"
self.scatter_idx = scatter_idx
self.gather_idx = gather_idx
def forward(
self,
query: Tensor,
key: Tensor,
value: Tensor,
cu_seqlens: Tensor,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
*args: Any,
) -> Tensor:
"""forward
Arguments:
query (Tensor): query input to the layer (l,h,d)
key (Tensor): key input to the layer
value (Tensor): value input to the layer
args: other args
Returns:
* output (Tensor): context output
"""
query_layer = SeqAllToAll4D.apply(
self.ulysses_pg, query[None], self.scatter_idx, self.gather_idx
)
key_layer = SeqAllToAll4D.apply(
self.ulysses_pg, key[None], self.scatter_idx, self.gather_idx
)
value_layer = SeqAllToAll4D.apply(
self.ulysses_pg, value[None], self.scatter_idx, self.gather_idx
)
ring_rank = dist.get_rank(self.ring_pg)
ring_world_size = dist.get_world_size(self.ring_pg)
(
local_cu_seqlens_q,
local_cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
local_k_slice,
) = llama3_flash_attn_prepare_cu_seqlens(cu_seqlens, causal, ring_rank, ring_world_size)
out = llama3_flash_attn_varlen_func(
query_layer[0],
key_layer[0],
value_layer[0],
local_cu_seqlens_q,
local_cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
heads_k_stride=1,
local_k_slice=local_k_slice,
dropout_p=dropout_p,
causal=causal,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=deterministic,
return_attn_probs=return_attn_probs,
group=self.ring_pg
)
if type(out) == tuple:
context_layer, _, _ = out
else:
context_layer = out
# (bs, seq_len, head_cnt/N, head_size) -> (bs, seq_len/N, head_cnt, head_size)
# scatter 1, gather 2
output = SeqAllToAll4D.apply(
self.ulysses_pg, context_layer[None], self.gather_idx, self.scatter_idx
)
# out e.g., [s/p::h]
return output[0]
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/hybrid/utils.py
================================================
from ..ring import (
ring_flash_attn_func,
ring_flash_attn_qkvpacked_func,
zigzag_ring_flash_attn_func,
zigzag_ring_flash_attn_qkvpacked_func,
stripe_flash_attn_func,
stripe_flash_attn_qkvpacked_func,
)
RING_IMPL_DICT = {
"basic": ring_flash_attn_func,
"zigzag": zigzag_ring_flash_attn_func,
"strip": stripe_flash_attn_func,
}
RING_IMPL_QKVPACKED_DICT = {
"basic": ring_flash_attn_qkvpacked_func,
"zigzag": zigzag_ring_flash_attn_qkvpacked_func,
"strip": stripe_flash_attn_qkvpacked_func,
}
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/ring/__init__.py
================================================
from .llama3_flash_attn_varlen import (
llama3_flash_attn_prepare_cu_seqlens,
llama3_flash_attn_varlen_func,
llama3_flash_attn_varlen_kvpacked_func,
llama3_flash_attn_varlen_qkvpacked_func,
)
from .ring_flash_attn import (
ring_flash_attn_func,
ring_flash_attn_kvpacked_func,
ring_flash_attn_qkvpacked_func,
ring_flash_attn_inference_func,
)
from .ring_flash_attn_varlen import (
ring_flash_attn_varlen_func,
ring_flash_attn_varlen_kvpacked_func,
ring_flash_attn_varlen_qkvpacked_func,
)
from .zigzag_ring_flash_attn import (
zigzag_ring_flash_attn_func,
zigzag_ring_flash_attn_kvpacked_func,
zigzag_ring_flash_attn_qkvpacked_func,
)
from .zigzag_ring_flash_attn_varlen import (
zigzag_ring_flash_attn_varlen_func,
zigzag_ring_flash_attn_varlen_qkvpacked_func,
zigzag_ring_flash_attn_varlen_qkvpacked_func,
)
from .stripe_flash_attn import (
stripe_flash_attn_func,
stripe_flash_attn_kvpacked_func,
stripe_flash_attn_qkvpacked_func,
)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/ring/llama3_flash_attn_varlen.py
================================================
import torch
import torch.distributed as dist
from flash_attn.flash_attn_interface import (
_flash_attn_varlen_forward,
_flash_attn_varlen_backward,
)
from .utils import get_default_args
class AsyncHandles:
def __init__(self) -> None:
self.handles = []
def register(self, handle):
self.handles.append(handle)
def wait(self):
for handle in self.handles:
handle.wait()
self.handles = []
def llama3_flash_attn_prepare_cu_seqlens(cu_seqlens, causal, rank, world_size):
total_length = cu_seqlens[-1].item()
assert total_length % world_size == 0
length_per_rank = total_length // world_size
left = torch.searchsorted(cu_seqlens, rank * length_per_rank)
right = torch.searchsorted(cu_seqlens, (rank + 1) * length_per_rank)
# after this, cu_seqlens[left:right + 1] contains all the sequence for this rank
if cu_seqlens[left] != rank * length_per_rank:
left -= 1
left = left.item()
right = right.item()
# q is always the same. just calculate the cu_seqlens for the local slice
cu_seqlens_q = cu_seqlens[left : right + 1].clone()
cu_seqlens_q -= rank * length_per_rank
cu_seqlens_q[0] = 0
cu_seqlens_q[-1] = length_per_rank
cu_seqlens_k = cu_seqlens[left : right + 1].clone()
if causal:
# when causal, we hope
# - the last k seq is of the same length as the last q seq
slice_right = (rank + 1) * length_per_rank
cu_seqlens_k[-1] = slice_right
else:
# when not causal, we hope
# - the last k is full seq
slice_right = cu_seqlens[right].item()
slice_left = cu_seqlens[left].item()
cu_seqlens_k -= slice_left
max_seqlen_q = (cu_seqlens_q[1:] - cu_seqlens_q[:-1]).max().item()
max_seqlen_k = (cu_seqlens_k[1:] - cu_seqlens_k[:-1]).max().item()
local_k_slice = slice(slice_left, slice_right)
return cu_seqlens_q, cu_seqlens_k, max_seqlen_q, max_seqlen_k, local_k_slice
def llama3_flash_attn_varlen_forward(
process_group,
q: torch.Tensor,
k: torch.Tensor,
v: torch.Tensor,
cu_seqlens_q,
cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
heads_k_stride,
local_k_slice,
softmax_scale,
dropout_p=0,
causal=True,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
):
out_list = []
lse_list = []
nheads = q.shape[1]
total_k, nheads_k, head_dim = k.shape
assert nheads_k % heads_k_stride == 0
world_size = dist.get_world_size(process_group)
kv_buffer = torch.empty(
(2, total_k * world_size, heads_k_stride, head_dim),
dtype=k.dtype,
device=k.device,
)
kv_buffer_copy = torch.empty_like(kv_buffer)
k_0 = k[:, :heads_k_stride].contiguous()
v_0 = v[:, :heads_k_stride].contiguous()
async_handles = AsyncHandles()
async_handles.register(
dist.all_gather_into_tensor(
kv_buffer_copy[0], k_0, group=process_group, async_op=True
)
)
async_handles.register(
dist.all_gather_into_tensor(
kv_buffer_copy[1], v_0, group=process_group, async_op=True
)
)
for i in range(0, nheads_k, heads_k_stride):
async_handles.wait()
kv_buffer, kv_buffer_copy = kv_buffer_copy, kv_buffer
if i < nheads_k - heads_k_stride:
# all_gather the next kv slice
kv_slice_left = i + heads_k_stride
kv_slice_right = kv_slice_left + heads_k_stride
send_k = k[:, kv_slice_left:kv_slice_right].contiguous()
send_v = v[:, kv_slice_left:kv_slice_right].contiguous()
async_handles.register(
dist.all_gather_into_tensor(
kv_buffer_copy[0], send_k, group=process_group, async_op=True
)
)
async_handles.register(
dist.all_gather_into_tensor(
kv_buffer_copy[1], send_v, group=process_group, async_op=True
)
)
q_i = q[:, i * nheads // nheads_k : (i + heads_k_stride) * nheads // nheads_k]
k_i = kv_buffer[0][local_k_slice]
v_i = kv_buffer[1][local_k_slice]
params = get_default_args(_flash_attn_varlen_forward).copy()
params.update(
{
"q": q_i,
"k": k_i,
"v": v_i,
"cu_seqlens_q": cu_seqlens_q,
"cu_seqlens_k": cu_seqlens_k,
"max_seqlen_q": max_seqlen_q,
"max_seqlen_k": max_seqlen_k,
"dropout_p": dropout_p,
"softmax_scale": softmax_scale,
"causal": causal,
"window_size": window_size,
"alibi_slopes": alibi_slopes,
"return_softmax": True and dropout_p > 0,
}
)
out, _, _, _, _, lse, _, _ = _flash_attn_varlen_forward(**params)
out_list.append(out)
lse_list.append(lse)
out = torch.cat(out_list, dim=1)
lse = torch.cat(lse_list, dim=-2)
return out, lse
def llama3_flash_attn_varlen_backward(
process_group,
dout,
q,
k,
v,
out,
softmax_lse,
cu_seqlens_q,
cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
heads_k_stride,
local_k_slice,
softmax_scale,
dropout_p=0,
causal=True,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
):
nheads = q.shape[1]
total_k, nheads_k, head_dim = k.shape
assert nheads_k % heads_k_stride == 0
world_size = dist.get_world_size(process_group)
kv_buffer = torch.empty(
(2, total_k * world_size, heads_k_stride, head_dim),
dtype=k.dtype,
device=k.device,
)
kv_buffer_copy = torch.empty_like(kv_buffer)
dkv_buffer = torch.empty(
(2, total_k * world_size, heads_k_stride, head_dim),
dtype=k.dtype,
device=k.device,
)
if heads_k_stride != nheads_k:
kv_contiguous_buffer = torch.empty(
(2, total_k, heads_k_stride, head_dim),
dtype=k.dtype,
device=k.device,
)
dq = torch.empty_like(q)
dk = torch.empty_like(k)
dv = torch.empty_like(v)
async_handles = AsyncHandles()
k_0 = k[:, :heads_k_stride].contiguous()
v_0 = v[:, :heads_k_stride].contiguous()
async_handles.register(
dist.all_gather_into_tensor(
kv_buffer_copy[0], k_0, group=process_group, async_op=True
)
)
async_handles.register(
dist.all_gather_into_tensor(
kv_buffer_copy[1], v_0, group=process_group, async_op=True
)
)
for i in range(0, nheads_k, heads_k_stride):
dkv_buffer.zero_()
q_slice = slice(
i * nheads // nheads_k, (i + heads_k_stride) * nheads // nheads_k
)
q_i = q[:, q_slice]
dout_i = dout[:, q_slice]
out_i = out[:, q_slice]
dq_i = dq[:, q_slice]
if softmax_lse.dim() == 3:
lse_i = softmax_lse[:, q_slice].contiguous()
else:
lse_i = softmax_lse[q_slice]
async_handles.wait()
kv_buffer, kv_buffer_copy = kv_buffer_copy, kv_buffer
if i < nheads_k - heads_k_stride:
# all_gather the next kv slice
kv_slice_left = i + heads_k_stride
kv_slice_right = kv_slice_left + heads_k_stride
send_k = k[:, kv_slice_left:kv_slice_right].contiguous()
send_v = v[:, kv_slice_left:kv_slice_right].contiguous()
async_handles.register(
dist.all_gather_into_tensor(
kv_buffer_copy[0], send_k, group=process_group, async_op=True
)
)
async_handles.register(
dist.all_gather_into_tensor(
kv_buffer_copy[1], send_v, group=process_group, async_op=True
)
)
k_i = kv_buffer[0][local_k_slice]
v_i = kv_buffer[1][local_k_slice]
dk_i = dkv_buffer[0][local_k_slice]
dv_i = dkv_buffer[1][local_k_slice]
params = get_default_args(_flash_attn_varlen_backward).copy()
params.update(
{
"dout": dout_i,
"q": q_i,
"k": k_i,
"v": v_i,
"out": out_i,
"softmax_lse": lse_i,
"dq": dq_i,
"dk": dk_i,
"dv": dv_i,
"cu_seqlens_q": cu_seqlens_q,
"cu_seqlens_k": cu_seqlens_k,
"max_seqlen_q": max_seqlen_q,
"max_seqlen_k": max_seqlen_k,
"dropout_p": dropout_p,
"softmax_scale": softmax_scale,
"causal": causal,
"window_size": window_size,
"alibi_slopes": alibi_slopes,
"deterministic": deterministic,
}
)
_flash_attn_varlen_backward(**params)
if heads_k_stride != nheads_k:
# reduce_scatter needs contiguous buffer
dk_i = kv_contiguous_buffer[0]
dv_i = kv_contiguous_buffer[1]
else:
dk_i = dk
dv_i = dv
dist.reduce_scatter_tensor(dk_i, dkv_buffer[0], group=process_group)
dist.reduce_scatter_tensor(dv_i, dkv_buffer[1], group=process_group)
if heads_k_stride != nheads_k:
dk[:, i : i + heads_k_stride] = dk_i
dv[:, i : i + heads_k_stride] = dv_i
return dq, dk, dv
class Llama3FlashAttnVarlenFunc(torch.autograd.Function):
@staticmethod
def forward(
ctx,
q,
k,
v,
cu_seqlens_q,
cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
heads_k_stride,
local_k_slice,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_softmax,
group,
):
if softmax_scale is None:
softmax_scale = q.shape[-1] ** (-0.5)
assert alibi_slopes is None
k = k.contiguous()
v = v.contiguous()
out, softmax_lse = llama3_flash_attn_varlen_forward(
group,
q,
k,
v,
cu_seqlens_q,
cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
heads_k_stride,
local_k_slice,
softmax_scale=softmax_scale,
dropout_p=dropout_p,
causal=causal,
window_size=window_size,
alibi_slopes=alibi_slopes,
deterministic=False,
)
# this should be out_padded
ctx.save_for_backward(q, k, v, out, softmax_lse, cu_seqlens_q, cu_seqlens_k)
ctx.max_seqlen_q = max_seqlen_q
ctx.max_seqlen_k = max_seqlen_k
ctx.heads_k_stride = heads_k_stride
ctx.local_k_slice = local_k_slice
ctx.dropout_p = dropout_p
ctx.softmax_scale = softmax_scale
ctx.causal = causal
ctx.window_size = window_size
ctx.alibi_slopes = alibi_slopes
ctx.deterministic = deterministic
ctx.group = group
return out if not return_softmax else (out, softmax_lse, None)
@staticmethod
def backward(ctx, dout, *args):
q, k, v, out, softmax_lse, cu_seqlens_q, cu_seqlens_k = ctx.saved_tensors
dq, dk, dv = llama3_flash_attn_varlen_backward(
ctx.group,
dout,
q,
k,
v,
out,
softmax_lse,
cu_seqlens_q,
cu_seqlens_k,
ctx.max_seqlen_q,
ctx.max_seqlen_k,
ctx.heads_k_stride,
ctx.local_k_slice,
softmax_scale=ctx.softmax_scale,
dropout_p=ctx.dropout_p,
causal=ctx.causal,
window_size=ctx.window_size,
alibi_slopes=ctx.alibi_slopes,
deterministic=ctx.deterministic,
)
return (dq, dk, dv) + (None,) * 15
def llama3_flash_attn_varlen_qkvpacked_func(
qkv,
cu_seqlens_q,
cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
heads_k_stride,
local_k_slice,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1), # -1 means infinite context window
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return Llama3FlashAttnVarlenFunc.apply(
qkv[:, 0],
qkv[:, 1],
qkv[:, 2],
cu_seqlens_q,
cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
heads_k_stride,
local_k_slice,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
def llama3_flash_attn_varlen_kvpacked_func(
q,
kv,
cu_seqlens_q,
cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
heads_k_stride,
local_k_slice,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1), # -1 means infinite context window
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return Llama3FlashAttnVarlenFunc.apply(
q,
kv[:, 0],
kv[:, 1],
cu_seqlens_q,
cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
heads_k_stride,
local_k_slice,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
def llama3_flash_attn_varlen_func(
q,
k,
v,
cu_seqlens_q,
cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
heads_k_stride,
local_k_slice,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1), # -1 means infinite context window
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return Llama3FlashAttnVarlenFunc.apply(
q,
k,
v,
cu_seqlens_q,
cu_seqlens_k,
max_seqlen_q,
max_seqlen_k,
heads_k_stride,
local_k_slice,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/ring/ring_flash_attn.py
================================================
import torch
import torch.distributed as dist
from flash_attn.flash_attn_interface import _flash_attn_forward, _flash_attn_backward
from flash_attn import flash_attn_with_kvcache
from .utils import RingComm, update_out_and_lse, get_default_args
def ring_flash_attn_forward(
process_group,
q: torch.Tensor,
k: torch.Tensor,
v: torch.Tensor,
softmax_scale,
dropout_p=0,
causal=True,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
):
comm = RingComm(process_group)
out = None
lse = None
next_k, next_v = None, None
for step in range(comm.world_size):
if step + 1 != comm.world_size:
next_k: torch.Tensor = comm.send_recv(k)
next_v: torch.Tensor = comm.send_recv(v)
comm.commit()
if not causal or step <= comm.rank:
params = get_default_args(_flash_attn_forward).copy()
params.update(
{
"q": q,
"k": k,
"v": v,
"dropout_p": dropout_p,
"softmax_scale": softmax_scale,
"causal": causal and step == 0,
"window_size": window_size,
"alibi_slopes": alibi_slopes,
"return_softmax": True and dropout_p > 0,
}
)
block_out, _, _, _, _, block_lse, _, _ = _flash_attn_forward(**params)
out, lse = update_out_and_lse(out, lse, block_out, block_lse)
if step + 1 != comm.world_size:
comm.wait()
k = next_k
v = next_v
out = out.to(q.dtype)
lse = lse.squeeze(dim=-1).transpose(1, 2)
return out, lse
def ring_flash_attn_backward(
process_group,
dout,
q,
k,
v,
out,
softmax_lse,
softmax_scale,
dropout_p=0,
causal=True,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
):
kv_comm = RingComm(process_group)
d_kv_comm = RingComm(process_group)
dq, dk, dv = None, None, None
next_dk, next_dv = None, None
block_dq_buffer = torch.empty(q.shape, dtype=q.dtype, device=q.device)
block_dk_buffer = torch.empty(k.shape, dtype=k.dtype, device=k.device)
block_dv_buffer = torch.empty(v.shape, dtype=v.dtype, device=v.device)
next_dk, next_dv = None, None
next_k, next_v = None, None
for step in range(kv_comm.world_size):
if step + 1 != kv_comm.world_size:
next_k = kv_comm.send_recv(k)
next_v = kv_comm.send_recv(v)
kv_comm.commit()
if step <= kv_comm.rank or not causal:
bwd_causal = causal and step == 0
params = get_default_args(_flash_attn_backward).copy()
params.update(
{
"dout": dout,
"q": q,
"k": k,
"v": v,
"out": out,
"softmax_lse": softmax_lse,
"dq": block_dq_buffer,
"dk": block_dk_buffer,
"dv": block_dv_buffer,
"dropout_p": dropout_p,
"softmax_scale": softmax_scale,
"causal": bwd_causal,
"window_size": window_size,
"alibi_slopes": alibi_slopes,
"deterministic": deterministic,
}
)
_flash_attn_backward(**params)
if dq is None:
dq = block_dq_buffer.to(torch.float32)
dk = block_dk_buffer.to(torch.float32)
dv = block_dv_buffer.to(torch.float32)
else:
dq += block_dq_buffer
d_kv_comm.wait()
dk = block_dk_buffer + next_dk
dv = block_dv_buffer + next_dv
elif step != 0:
d_kv_comm.wait()
dk = next_dk
dv = next_dv
if step + 1 != kv_comm.world_size:
kv_comm.wait()
k = next_k
v = next_v
next_dk = d_kv_comm.send_recv(dk)
next_dv = d_kv_comm.send_recv(dv)
d_kv_comm.commit()
d_kv_comm.wait()
return dq.to(torch.bfloat16), next_dk.to(q.dtype), next_dv.to(q.dtype)
class RingFlashAttnFunc(torch.autograd.Function):
@staticmethod
def forward(
ctx,
q,
k,
v,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_softmax,
group,
):
if softmax_scale is None:
softmax_scale = q.shape[-1] ** (-0.5)
assert alibi_slopes is None
k = k.contiguous()
v = v.contiguous()
out, softmax_lse = ring_flash_attn_forward(
group,
q,
k,
v,
softmax_scale=softmax_scale,
dropout_p=dropout_p,
causal=causal,
window_size=window_size,
alibi_slopes=alibi_slopes,
deterministic=False,
)
# this should be out_padded
ctx.save_for_backward(q, k, v, out, softmax_lse)
ctx.dropout_p = dropout_p
ctx.softmax_scale = softmax_scale
ctx.causal = causal
ctx.window_size = window_size
ctx.alibi_slopes = alibi_slopes
ctx.deterministic = deterministic
ctx.group = group
return out if not return_softmax else (out, softmax_lse, None)
@staticmethod
def backward(ctx, dout, *args):
q, k, v, out, softmax_lse = ctx.saved_tensors
dq, dk, dv = ring_flash_attn_backward(
ctx.group,
dout,
q,
k,
v,
out,
softmax_lse,
softmax_scale=ctx.softmax_scale,
dropout_p=ctx.dropout_p,
causal=ctx.causal,
window_size=ctx.window_size,
alibi_slopes=ctx.alibi_slopes,
deterministic=ctx.deterministic,
)
return dq, dk, dv, None, None, None, None, None, None, None, None
def ring_flash_attn_qkvpacked_func(
qkv,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return RingFlashAttnFunc.apply(
qkv[:, :, 0],
qkv[:, :, 1],
qkv[:, :, 2],
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
def ring_flash_attn_kvpacked_func(
q,
kv,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return RingFlashAttnFunc.apply(
q,
kv[:, :, 0],
kv[:, :, 1],
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
def ring_flash_attn_func(
q,
k,
v,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return RingFlashAttnFunc.apply(
q,
k,
v,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
@torch.no_grad()
def ring_flash_attn_inference_func(
q,
k,
v,
cache_seqlens,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
assert q.shape[1] == 1
out, lse = flash_attn_with_kvcache(
q,
k,
v,
causal=True,
cache_seqlens=cache_seqlens,
return_softmax_lse=True)
# 如果不用 kvcache 类方法
# params = get_default_args(_flash_attn_forward).copy()
# params.update(
# {
# "q": q,
# "k": k,
# "v": v,
# "dropout_p": dropout_p,
# "softmax_scale": softmax_scale,
# "causal": causal,
# "window_size": window_size,
# "alibi_slopes": alibi_slopes,
# "return_softmax": True and dropout_p > 0,
# }
# )
# out, _, _, _, _, lse, _, _ = _flash_attn_forward(**params)
out_list = [
torch.empty_like(out, device=out.device, dtype=out.dtype)
for _ in range(dist.get_world_size(group))
]
dist.all_gather(out_list, out)
out_lse = [
torch.empty_like(lse, device=lse.device, dtype=lse.dtype)
for _ in range(dist.get_world_size(group))
]
dist.all_gather(out_lse, lse)
new_out = None
new_lse = None
for i in reversed(range(dist.get_world_size(group))):
new_out, new_lse = update_out_and_lse(new_out, new_lse, out_list[i], out_lse[i])
return new_out.to(q.dtype)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/ring/ring_flash_attn_varlen.py
================================================
import torch
import torch.distributed as dist
from flash_attn.flash_attn_interface import (
_flash_attn_varlen_forward,
_flash_attn_varlen_backward,
)
from .utils import (
RingComm,
update_out_and_lse,
get_default_args,
)
try:
from .triton_utils import (
flatten_varlen_lse,
unflatten_varlen_lse,
)
except:
from .utils import (
flatten_varlen_lse,
unflatten_varlen_lse,
)
def ring_flash_attn_varlen_forward(
process_group,
q: torch.Tensor,
k: torch.Tensor,
v: torch.Tensor,
cu_seqlens,
max_seqlen,
softmax_scale,
dropout_p=0,
causal=True,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
):
comm = RingComm(process_group)
out = None
lse = None
next_k, next_v = None, None
old_lse = False
for step in range(comm.world_size):
if step + 1 != comm.world_size:
next_k: torch.Tensor = comm.send_recv(k)
next_v: torch.Tensor = comm.send_recv(v)
comm.commit()
if not causal or step <= comm.rank:
params = get_default_args(_flash_attn_varlen_forward).copy()
params.update(
{
"q": q,
"k": k,
"v": v,
"cu_seqlens_q": cu_seqlens,
"cu_seqlens_k": cu_seqlens,
"max_seqlen_q": max_seqlen,
"max_seqlen_k": max_seqlen,
"dropout_p": dropout_p,
"softmax_scale": softmax_scale,
"causal": causal and step == 0,
"window_size": window_size,
"alibi_slopes": alibi_slopes,
"return_softmax": True and dropout_p > 0,
}
)
block_out, _, _, _, _, block_lse, _, _ = _flash_attn_varlen_forward(
**params
)
if block_lse.dim() == 3:
old_lse = True
block_lse = flatten_varlen_lse(
block_lse,
cu_seqlens=cu_seqlens,
)
out, lse = update_out_and_lse(out, lse, block_out, block_lse)
if step + 1 != comm.world_size:
comm.wait()
k = next_k
v = next_v
out = out.to(q.dtype)
if old_lse:
lse = unflatten_varlen_lse(lse, cu_seqlens, max_seqlen)
else:
lse = lse.squeeze(dim=-1).transpose(0, 1)
return out, lse
def ring_flash_attn_varlen_backward(
process_group,
dout,
q,
k,
v,
out,
softmax_lse,
cu_seqlens,
max_seqlen,
softmax_scale,
dropout_p=0,
causal=True,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
):
kv_comm = RingComm(process_group)
d_kv_comm = RingComm(process_group)
dq, dk, dv = None, None, None
next_dk, next_dv = None, None
block_dq_buffer = torch.empty(q.shape, dtype=q.dtype, device=q.device)
block_dk_buffer = torch.empty(k.shape, dtype=k.dtype, device=k.device)
block_dv_buffer = torch.empty(v.shape, dtype=v.dtype, device=v.device)
next_dk, next_dv = None, None
next_k, next_v = None, None
for step in range(kv_comm.world_size):
if step + 1 != kv_comm.world_size:
next_k = kv_comm.send_recv(k)
next_v = kv_comm.send_recv(v)
kv_comm.commit()
if step <= kv_comm.rank or not causal:
bwd_causal = causal and step == 0
params = get_default_args(_flash_attn_varlen_backward).copy()
params.update(
{
"dout": dout,
"q": q,
"k": k,
"v": v,
"out": out,
"softmax_lse": softmax_lse,
"dq": block_dq_buffer,
"dk": block_dk_buffer,
"dv": block_dv_buffer,
"cu_seqlens_q": cu_seqlens,
"cu_seqlens_k": cu_seqlens,
"max_seqlen_q": max_seqlen,
"max_seqlen_k": max_seqlen,
"dropout_p": dropout_p,
"softmax_scale": softmax_scale,
"causal": bwd_causal,
"window_size": window_size,
"alibi_slopes": alibi_slopes,
"deterministic": deterministic,
}
)
_flash_attn_varlen_backward(**params)
if dq is None:
dq = block_dq_buffer.to(torch.float32)
dk = block_dk_buffer.to(torch.float32)
dv = block_dv_buffer.to(torch.float32)
else:
dq += block_dq_buffer
d_kv_comm.wait()
dk = block_dk_buffer + next_dk
dv = block_dv_buffer + next_dv
elif step != 0:
d_kv_comm.wait()
dk = next_dk
dv = next_dv
if step + 1 != kv_comm.world_size:
kv_comm.wait()
k = next_k
v = next_v
next_dk = d_kv_comm.send_recv(dk)
next_dv = d_kv_comm.send_recv(dv)
d_kv_comm.commit()
d_kv_comm.wait()
return dq.to(torch.bfloat16), next_dk.to(q.dtype), next_dv.to(q.dtype)
class RingFlashAttnVarlenFunc(torch.autograd.Function):
@staticmethod
def forward(
ctx,
q,
k,
v,
cu_seqlens,
max_seqlen,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_softmax,
group,
):
if softmax_scale is None:
softmax_scale = q.shape[-1] ** (-0.5)
assert alibi_slopes is None
k = k.contiguous()
v = v.contiguous()
out, softmax_lse = ring_flash_attn_varlen_forward(
group,
q,
k,
v,
cu_seqlens,
max_seqlen,
softmax_scale=softmax_scale,
dropout_p=dropout_p,
causal=causal,
window_size=window_size,
alibi_slopes=alibi_slopes,
deterministic=False,
)
# this should be out_padded
ctx.save_for_backward(q, k, v, out, softmax_lse, cu_seqlens)
ctx.max_seqlen = max_seqlen
ctx.dropout_p = dropout_p
ctx.softmax_scale = softmax_scale
ctx.causal = causal
ctx.window_size = window_size
ctx.alibi_slopes = alibi_slopes
ctx.deterministic = deterministic
ctx.group = group
return out if not return_softmax else (out, softmax_lse, None)
@staticmethod
def backward(ctx, dout, *args):
q, k, v, out, softmax_lse, cu_seqlens = ctx.saved_tensors
dq, dk, dv = ring_flash_attn_varlen_backward(
ctx.group,
dout,
q,
k,
v,
out,
softmax_lse,
cu_seqlens,
ctx.max_seqlen,
softmax_scale=ctx.softmax_scale,
dropout_p=ctx.dropout_p,
causal=ctx.causal,
window_size=ctx.window_size,
alibi_slopes=ctx.alibi_slopes,
deterministic=ctx.deterministic,
)
return dq, dk, dv, None, None, None, None, None, None, None, None, None, None
def ring_flash_attn_varlen_qkvpacked_func(
qkv,
cu_seqlens,
max_seqlen,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1), # -1 means infinite context window
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return RingFlashAttnVarlenFunc.apply(
qkv[:, 0],
qkv[:, 1],
qkv[:, 2],
cu_seqlens,
max_seqlen,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
def ring_flash_attn_varlen_kvpacked_func(
q,
kv,
cu_seqlens,
max_seqlen,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1), # -1 means infinite context window
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return RingFlashAttnVarlenFunc.apply(
q,
kv[:, 0],
kv[:, 1],
cu_seqlens,
max_seqlen,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
def ring_flash_attn_varlen_func(
q,
k,
v,
cu_seqlens,
max_seqlen,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1), # -1 means infinite context window
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return RingFlashAttnVarlenFunc.apply(
q,
k,
v,
cu_seqlens,
max_seqlen,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/ring/stripe_flash_attn.py
================================================
import torch
import torch.distributed as dist
from flash_attn.flash_attn_interface import _flash_attn_forward, _flash_attn_backward
from .utils import RingComm, update_out_and_lse, get_default_args
def stripe_flash_attn_forward(
process_group,
q: torch.Tensor,
k: torch.Tensor,
v: torch.Tensor,
softmax_scale,
dropout_p=0,
causal=True,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
):
assert (
causal
), "stripe flash attn only supports causal attention, if not causal, use ring flash attn instead"
comm = RingComm(process_group)
out = None
lse = None
next_k, next_v = None, None
for step in range(comm.world_size):
if step + 1 != comm.world_size:
next_k: torch.Tensor = comm.send_recv(k)
next_v: torch.Tensor = comm.send_recv(v)
comm.commit()
params = get_default_args(_flash_attn_forward).copy()
if step <= comm.rank:
params.update(
{
"q": q,
"k": k,
"v": v,
"dropout_p": dropout_p,
"softmax_scale": softmax_scale,
"causal": causal,
"window_size": window_size,
"alibi_slopes": alibi_slopes,
"return_softmax": True and dropout_p > 0,
}
)
block_out, _, _, _, _, block_lse, _, _ = _flash_attn_forward(**params)
out, lse = update_out_and_lse(out, lse, block_out, block_lse)
else:
params.update(
{
"q": q[:, 1:],
"k": k[:, :-1],
"v": v[:, :-1],
"dropout_p": dropout_p,
"softmax_scale": softmax_scale,
"causal": causal,
"window_size": window_size,
"alibi_slopes": alibi_slopes,
"return_softmax": True and dropout_p > 0,
}
)
block_out, _, _, _, _, block_lse, _, _ = _flash_attn_forward(**params)
out, lse = update_out_and_lse(
out, lse, block_out, block_lse, slice_=(slice(None), slice(1, None))
)
if step + 1 != comm.world_size:
comm.wait()
k = next_k
v = next_v
out = out.to(q.dtype)
lse = lse.squeeze(dim=-1).transpose(1, 2)
return out, lse
def stripe_flash_attn_backward(
process_group,
dout,
q,
k,
v,
out,
softmax_lse,
softmax_scale,
dropout_p=0,
causal=True,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
):
assert (
causal
), "stripe flash attn only supports causal attention, if not causal, ring flash attn instead"
kv_comm = RingComm(process_group)
d_kv_comm = RingComm(process_group)
dq, dk, dv = None, None, None
next_dk, next_dv = None, None
next_k, next_v = None, None
dk_comm_buffer, dv_comm_buffer = None, None
block_dq_buffer = torch.empty(q.shape, dtype=q.dtype, device=q.device)
block_dk_buffer = torch.empty(k.shape, dtype=k.dtype, device=k.device)
block_dv_buffer = torch.empty(v.shape, dtype=v.dtype, device=v.device)
for step in range(kv_comm.world_size):
if step + 1 != kv_comm.world_size:
next_k = kv_comm.send_recv(k)
next_v = kv_comm.send_recv(v)
kv_comm.commit()
shift_causal = step > kv_comm.rank
softmax_lse_1 = None
params = get_default_args(_flash_attn_backward).copy()
if not shift_causal:
params.update(
{
"dout": dout,
"q": q,
"k": k,
"v": v,
"out": out,
"softmax_lse": softmax_lse,
"dq": block_dq_buffer,
"dk": block_dk_buffer,
"dv": block_dv_buffer,
"dropout_p": dropout_p,
"softmax_scale": softmax_scale,
"causal": causal,
"window_size": window_size,
"alibi_slopes": alibi_slopes,
"deterministic": deterministic,
}
)
_flash_attn_backward(**params)
else:
if softmax_lse_1 is None:
# lazy init, since the last rank does not need softmax_lse_1
softmax_lse_1 = softmax_lse[:, :, 1:].contiguous()
params.update(
{
"dout": dout[:, 1:],
"q": q[:, 1:],
"k": k[:, :-1],
"v": v[:, :-1],
"out": out[:, 1:],
"softmax_lse": softmax_lse_1,
"dq": block_dq_buffer[:, 1:],
"dk": block_dk_buffer[:, :-1],
"dv": block_dv_buffer[:, :-1],
"dropout_p": dropout_p,
"softmax_scale": softmax_scale,
"causal": causal,
"window_size": window_size,
"alibi_slopes": alibi_slopes,
"deterministic": deterministic,
}
)
_flash_attn_backward(**params)
if dq is None:
dq = block_dq_buffer.to(torch.float32)
dk = block_dk_buffer.to(torch.float32)
dv = block_dv_buffer.to(torch.float32)
else:
if not shift_causal:
dq += block_dq_buffer
else:
dq[:, 1:] += block_dq_buffer[:, 1:]
d_kv_comm.wait()
dk_comm_buffer, dv_comm_buffer = dk, dv
dk = next_dk
dv = next_dv
if not shift_causal:
dk = block_dk_buffer + dk
dv = block_dv_buffer + dv
else:
dk[:, :-1] += block_dk_buffer[:, :-1]
dv[:, :-1] += block_dv_buffer[:, :-1]
if step + 1 != kv_comm.world_size:
kv_comm.wait()
k = next_k
v = next_v
next_dk = d_kv_comm.send_recv(dk, dk_comm_buffer)
next_dv = d_kv_comm.send_recv(dv, dv_comm_buffer)
d_kv_comm.commit()
d_kv_comm.wait()
return dq.to(q.dtype), next_dk.to(q.dtype), next_dv.to(q.dtype)
class StripeFlashAttnFunc(torch.autograd.Function):
@staticmethod
def forward(
ctx,
q,
k,
v,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_softmax,
group,
):
if softmax_scale is None:
softmax_scale = q.shape[-1] ** (-0.5)
assert alibi_slopes is None
k = k.contiguous()
v = v.contiguous()
out, softmax_lse = stripe_flash_attn_forward(
group,
q,
k,
v,
softmax_scale=softmax_scale,
dropout_p=dropout_p,
causal=causal,
window_size=window_size,
alibi_slopes=alibi_slopes,
deterministic=False,
)
# this should be out_padded
ctx.save_for_backward(q, k, v, out, softmax_lse)
ctx.dropout_p = dropout_p
ctx.softmax_scale = softmax_scale
ctx.causal = causal
ctx.window_size = window_size
ctx.alibi_slopes = alibi_slopes
ctx.deterministic = deterministic
ctx.group = group
return out if not return_softmax else (out, softmax_lse, None)
@staticmethod
def backward(ctx, dout, *args):
q, k, v, out, softmax_lse = ctx.saved_tensors
dq, dk, dv = stripe_flash_attn_backward(
ctx.group,
dout,
q,
k,
v,
out,
softmax_lse,
softmax_scale=ctx.softmax_scale,
dropout_p=ctx.dropout_p,
causal=ctx.causal,
window_size=ctx.window_size,
alibi_slopes=ctx.alibi_slopes,
deterministic=ctx.deterministic,
)
return dq, dk, dv, None, None, None, None, None, None, None, None
def stripe_flash_attn_qkvpacked_func(
qkv,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1), # -1 means infinite context window
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return StripeFlashAttnFunc.apply(
qkv[:, :, 0],
qkv[:, :, 1],
qkv[:, :, 2],
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
def stripe_flash_attn_kvpacked_func(
q,
kv,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1), # -1 means infinite context window
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return StripeFlashAttnFunc.apply(
q,
kv[:, :, 0],
kv[:, :, 1],
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
def stripe_flash_attn_func(
q,
k,
v,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1), # -1 means infinite context window
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return StripeFlashAttnFunc.apply(
q,
k,
v,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/ring/triton_utils.py
================================================
import torch
import triton
import triton.language as tl
@triton.jit
def flatten_kernel(
# pointers to matrices
OUT,
LSE,
CU_SEQLENS,
# strides
stride_out_nheads,
stride_out_seqlen,
stride_lse_batch,
stride_lse_nheads,
stride_lse_seqlen,
# meta-parameters
BLOCK_M: tl.constexpr,
):
pid_m = tl.program_id(axis=0)
pid_batch = tl.program_id(axis=1)
pid_head = tl.program_id(axis=2)
start_idx = tl.load(CU_SEQLENS + pid_batch)
seqlen = tl.load(CU_SEQLENS + pid_batch + 1) - start_idx
LSE = LSE + pid_batch * stride_lse_batch + pid_head * stride_lse_nheads
OUT = OUT + pid_head * stride_out_nheads + start_idx * stride_out_seqlen
rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
LSE = LSE + rm[:, None] * stride_lse_seqlen
x = tl.load(LSE, mask=rm[:, None] < seqlen, other=0.0)
OUT = OUT + rm[:, None] * stride_out_seqlen
tl.store(OUT, x, mask=rm[:, None] < seqlen)
def flatten_varlen_lse(lse, cu_seqlens):
"""
Arguments:
lse: (batch_size, nheads, max_seqlen)
cu_seqlens: (batch_size + 1,)
Return:
flatten_lse: (nheads, total_seqlen)
"""
total_seqlen = cu_seqlens[-1]
batch_size, nheads, max_seqlen = lse.shape
output = torch.empty((nheads, total_seqlen), dtype=lse.dtype, device=lse.device)
grid = lambda META: (triton.cdiv(max_seqlen, META["BLOCK_M"]), batch_size, nheads)
BLOCK_M = 4
with torch.cuda.device(lse.device.index):
flatten_kernel[grid](
output,
lse,
cu_seqlens,
# strides
output.stride(0),
output.stride(1),
lse.stride(0),
lse.stride(1),
lse.stride(2),
BLOCK_M,
)
return output
@triton.jit
def unflatten_kernel(
# pointers to matrices
OUT,
LSE,
CU_SEQLENS,
# strides
stride_out_batch,
stride_out_nheads,
stride_out_seqlen,
stride_lse_seqlen,
stride_lse_nheads,
# meta-parameters
BLOCK_M: tl.constexpr,
):
pid_m = tl.program_id(axis=0)
pid_batch = tl.program_id(axis=1)
pid_head = tl.program_id(axis=2)
start_idx = tl.load(CU_SEQLENS + pid_batch)
seqlen = tl.load(CU_SEQLENS + pid_batch + 1) - start_idx
LSE = LSE + pid_head * stride_lse_nheads + start_idx * stride_lse_seqlen
OUT = OUT + pid_batch * stride_out_batch + pid_head * stride_out_nheads
rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
LSE = LSE + rm[:, None] * stride_lse_seqlen
x = tl.load(LSE, mask=rm[:, None] < seqlen, other=0.0)
OUT = OUT + rm[:, None] * stride_out_seqlen
tl.store(OUT, x, mask=rm[:, None] < seqlen)
def unflatten_varlen_lse(lse, cu_seqlens, max_seqlen: int):
"""
Arguments:
lse: (total_seqlen, nheads, 1)
cu_seqlens: (batch_size + 1,)
max_seqlen: int
Return:
unflatten_lse: (batch_size, nheads, max_seqlen)
"""
lse = lse.unsqueeze(dim=-1)
batch_size = len(cu_seqlens) - 1
nheads = lse.shape[1]
output = torch.empty(
(batch_size, nheads, max_seqlen),
dtype=lse.dtype,
device=lse.device,
)
grid = lambda META: (triton.cdiv(max_seqlen, META["BLOCK_M"]), batch_size, nheads)
BLOCK_M = 4
with torch.cuda.device(lse.device.index):
unflatten_kernel[grid](
output,
lse,
cu_seqlens,
# strides
output.stride(0),
output.stride(1),
output.stride(2),
lse.stride(0),
lse.stride(1),
BLOCK_M,
)
return output
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/ring/utils.py
================================================
from typing import Optional, Tuple
import torch
import torch.distributed as dist
import torch.nn.functional as F
import inspect
from functools import cache
__all__ = ["update_out_and_lse", "RingComm", "get_default_args"]
@cache
def get_default_args(func):
spec = inspect.getfullargspec(func)
defaults = spec.defaults if spec.defaults is not None else ()
padded_defaults = (None,) * (len(spec.args) - len(defaults)) + defaults
args = dict(zip(spec.args, padded_defaults))
if "softcap" in args:
args["softcap"] = 0.0
return args
@torch.jit.script
def _update_out_and_lse(
out: torch.Tensor,
lse: torch.Tensor,
block_out: torch.Tensor,
block_lse: torch.Tensor,
) -> Tuple[torch.Tensor, torch.Tensor]:
block_out = block_out.to(torch.float32)
block_lse = block_lse.transpose(-2, -1).unsqueeze(dim=-1)
# new_lse = lse + torch.log(1 + torch.exp(block_lse - lse))
# torch.exp(lse - new_lse) * out + torch.exp(block_lse - new_lse) * block_out
# For additional context and discussion, please refer to:
# https://github.com/zhuzilin/ring-flash-attention/pull/34#issuecomment-2076126795
out = out - F.sigmoid(block_lse - lse) * (out - block_out)
lse = lse - F.logsigmoid(lse - block_lse)
return out, lse
def update_out_and_lse(
out: Optional[torch.Tensor],
lse: Optional[torch.Tensor],
block_out: torch.Tensor,
block_lse: torch.Tensor,
slice_=None,
) -> Tuple[torch.Tensor, torch.Tensor]:
if out is None:
if slice_ is not None:
raise RuntimeError("first update_out_and_lse should not pass slice_ args")
out = block_out.to(torch.float32)
lse = block_lse.transpose(-2, -1).unsqueeze(dim=-1)
elif slice_ is not None:
slice_out, slice_lse = out[slice_], lse[slice_]
slice_out, slice_lse = _update_out_and_lse(
slice_out, slice_lse, block_out, block_lse
)
out[slice_], lse[slice_] = slice_out, slice_lse
else:
out, lse = _update_out_and_lse(out, lse, block_out, block_lse)
return out, lse
@torch.jit.script
def flatten_varlen_lse(lse, cu_seqlens):
new_lse = []
for i in range(len(cu_seqlens) - 1):
start, end = cu_seqlens[i], cu_seqlens[i + 1]
new_lse.append(lse[i, :, : end - start])
return torch.cat(new_lse, dim=1)
@torch.jit.script
def unflatten_varlen_lse(lse, cu_seqlens, max_seqlen: int):
num_seq = len(cu_seqlens) - 1
num_head = lse.shape[-2]
new_lse = torch.empty(
(num_seq, max_seqlen, num_head, 1), dtype=torch.float32, device=lse.device
)
for i in range(num_seq):
start, end = cu_seqlens[i], cu_seqlens[i + 1]
new_lse[i, : end - start] = lse[start:end]
return new_lse.squeeze(dim=-1).transpose(1, 2).contiguous()
class RingComm:
def __init__(self, process_group: dist.ProcessGroup):
self._process_group = process_group
self._ops = []
self.rank = dist.get_rank(self._process_group)
self.world_size = dist.get_world_size(self._process_group)
self._reqs = None
self.send_rank = (self.rank + 1) % self.world_size
self.recv_rank = (self.rank - 1) % self.world_size
if process_group is not None:
self.send_rank = dist.get_global_rank(self._process_group, self.send_rank)
self.recv_rank = dist.get_global_rank(self._process_group, self.recv_rank)
def send_recv(
self, to_send: torch.Tensor, recv_tensor: Optional[torch.Tensor] = None
) -> torch.Tensor:
if recv_tensor is None:
res = torch.empty_like(to_send)
else:
res = recv_tensor
send_op = dist.P2POp(
dist.isend, to_send, self.send_rank, group=self._process_group
)
recv_op = dist.P2POp(dist.irecv, res, self.recv_rank, group=self._process_group)
self._ops.append(send_op)
self._ops.append(recv_op)
return res
def commit(self):
if self._reqs is not None:
raise RuntimeError("commit called twice")
self._reqs = dist.batch_isend_irecv(self._ops)
def wait(self):
if self._reqs is None:
raise RuntimeError("wait called before commit")
for req in self._reqs:
req.wait()
self._reqs = None
self._ops = []
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/ring/zigzag_ring_flash_attn.py
================================================
import torch
import torch.distributed as dist
from flash_attn.flash_attn_interface import _flash_attn_forward, _flash_attn_backward
from .utils import RingComm, update_out_and_lse, get_default_args
def zigzag_ring_flash_attn_forward(
process_group,
q: torch.Tensor,
k: torch.Tensor,
v: torch.Tensor,
softmax_scale,
dropout_p=0,
causal=True,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
):
assert causal == True, "zigzag ring is meaningless for causal=False"
comm = RingComm(process_group)
block_seq_len = q.shape[1] // 2
q1 = q[:, block_seq_len:]
out = None
lse = None
next_k, next_v = None, None
def forward(q, k, v, causal):
params = get_default_args(_flash_attn_forward).copy()
params.update(
{
"q": q,
"k": k,
"v": v,
"dropout_p": dropout_p,
"softmax_scale": softmax_scale,
"causal": causal,
"window_size": window_size,
"alibi_slopes": alibi_slopes,
"return_softmax": True and dropout_p > 0,
}
)
block_out, _, _, _, _, block_lse, _, _ = _flash_attn_forward(**params)
return block_out, block_lse
for step in range(comm.world_size):
if step + 1 != comm.world_size:
next_k: torch.Tensor = comm.send_recv(k)
next_v: torch.Tensor = comm.send_recv(v)
comm.commit()
if step == 0:
block_out, block_lse = forward(q, k, v, causal=True)
out, lse = update_out_and_lse(out, lse, block_out, block_lse)
elif step <= comm.rank:
k0 = k[:, :block_seq_len]
v0 = v[:, :block_seq_len]
block_out, block_lse = forward(q, k0, v0, causal=False)
out, lse = update_out_and_lse(out, lse, block_out, block_lse)
else:
block_out, block_lse = forward(q1, k, v, causal=False)
out, lse = update_out_and_lse(
out,
lse,
block_out,
block_lse,
slice_=(slice(None), slice(block_seq_len, None)),
)
if step + 1 != comm.world_size:
comm.wait()
k = next_k
v = next_v
out = out.to(q.dtype)
lse = lse.squeeze(dim=-1).transpose(1, 2)
return out, lse
def zigzag_ring_flash_attn_backward(
process_group,
dout,
q,
k,
v,
out,
softmax_lse,
softmax_scale,
dropout_p=0,
causal=True,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
):
assert causal == True, "zigzag ring is meaningless for causal=False"
kv_comm = RingComm(process_group)
d_kv_comm = RingComm(process_group)
dq, dk, dv = None, None, None
next_dk, next_dv = None, None
next_k, next_v = None, None
dk_comm_buffer, dv_comm_buffer = None, None
dout1 = dout.chunk(2, dim=1)[1]
q1 = q.chunk(2, dim=1)[1]
out1 = out.chunk(2, dim=1)[1]
softmax_lse1 = softmax_lse.chunk(2, dim=2)[1].contiguous()
block_seq_len = q.shape[1] // 2
# repeatly allocating buffer may be slow...
dq_buffer = torch.empty(q.shape, dtype=q.dtype, device=q.device)
dk_buffer = torch.empty(k.shape, dtype=k.dtype, device=k.device)
dv_buffer = torch.empty(v.shape, dtype=v.dtype, device=v.device)
def backward(dout, q, k, v, out, softmax_lse, causal):
seqlen_q = q.shape[1]
seqlen_kv = k.shape[1]
params = get_default_args(_flash_attn_backward).copy()
params.update(
{
"dout": dout,
"q": q,
"k": k,
"v": v,
"out": out,
"softmax_lse": softmax_lse,
"dq": dq_buffer[:, :seqlen_q],
"dk": dk_buffer[:, :seqlen_kv],
"dv": dv_buffer[:, :seqlen_kv],
"dropout_p": dropout_p,
"softmax_scale": softmax_scale,
"causal": causal,
"window_size": window_size,
"alibi_slopes": alibi_slopes,
"deterministic": deterministic,
}
)
_flash_attn_backward(**params)
for step in range(kv_comm.world_size):
if step + 1 != kv_comm.world_size:
next_k = kv_comm.send_recv(k)
next_v = kv_comm.send_recv(v)
kv_comm.commit()
if step == 0:
backward(dout, q, k, v, out, softmax_lse, causal=True)
dq = dq_buffer.to(torch.float32)
dk = dk_buffer.to(torch.float32)
dv = dv_buffer.to(torch.float32)
else:
if step <= kv_comm.rank:
k0 = k[:, :block_seq_len]
v0 = v[:, :block_seq_len]
backward(dout, q, k0, v0, out, softmax_lse, causal=False)
dq += dq_buffer
else:
backward(dout1, q1, k, v, out1, softmax_lse1, causal=False)
# always use the first half in dq_buffer.
dq[:, block_seq_len:] += dq_buffer[:, :block_seq_len]
d_kv_comm.wait()
dk_comm_buffer, dv_comm_buffer = dk, dv
dk, dv = next_dk, next_dv
if step <= kv_comm.rank:
dk[:, :block_seq_len] += dk_buffer[:, :block_seq_len]
dv[:, :block_seq_len] += dv_buffer[:, :block_seq_len]
else:
dk += dk_buffer
dv += dv_buffer
if step + 1 != kv_comm.world_size:
kv_comm.wait()
k = next_k
v = next_v
next_dk = d_kv_comm.send_recv(dk, dk_comm_buffer)
next_dv = d_kv_comm.send_recv(dv, dv_comm_buffer)
d_kv_comm.commit()
d_kv_comm.wait()
return dq.to(q.dtype), next_dk.to(q.dtype), next_dv.to(q.dtype)
class ZigZagRingFlashAttnFunc(torch.autograd.Function):
@staticmethod
def forward(
ctx,
q,
k,
v,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_softmax,
group,
):
if softmax_scale is None:
softmax_scale = q.shape[-1] ** (-0.5)
assert alibi_slopes is None
k = k.contiguous()
v = v.contiguous()
out, softmax_lse = zigzag_ring_flash_attn_forward(
group,
q,
k,
v,
softmax_scale=softmax_scale,
dropout_p=dropout_p,
causal=causal,
window_size=window_size,
alibi_slopes=alibi_slopes,
deterministic=False,
)
# this should be out_padded
ctx.save_for_backward(q, k, v, out, softmax_lse)
ctx.dropout_p = dropout_p
ctx.softmax_scale = softmax_scale
ctx.causal = causal
ctx.window_size = window_size
ctx.alibi_slopes = alibi_slopes
ctx.deterministic = deterministic
ctx.group = group
return out if not return_softmax else (out, softmax_lse, None)
@staticmethod
def backward(ctx, dout, *args):
q, k, v, out, softmax_lse = ctx.saved_tensors
dq, dk, dv = zigzag_ring_flash_attn_backward(
ctx.group,
dout,
q,
k,
v,
out,
softmax_lse,
softmax_scale=ctx.softmax_scale,
dropout_p=ctx.dropout_p,
causal=ctx.causal,
window_size=ctx.window_size,
alibi_slopes=ctx.alibi_slopes,
deterministic=ctx.deterministic,
)
return dq, dk, dv, None, None, None, None, None, None, None, None
def zigzag_ring_flash_attn_qkvpacked_func(
qkv,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return ZigZagRingFlashAttnFunc.apply(
qkv[:, :, 0],
qkv[:, :, 1],
qkv[:, :, 2],
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
def zigzag_ring_flash_attn_kvpacked_func(
q,
kv,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return ZigZagRingFlashAttnFunc.apply(
q,
kv[:, :, 0],
kv[:, :, 1],
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
def zigzag_ring_flash_attn_func(
q,
k,
v,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return ZigZagRingFlashAttnFunc.apply(
q,
k,
v,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/ring/zigzag_ring_flash_attn_varlen.py
================================================
import torch
from flash_attn.flash_attn_interface import (
_flash_attn_varlen_forward,
_flash_attn_varlen_backward,
)
from .utils import (
RingComm,
update_out_and_lse,
get_default_args,
)
try:
from .triton_utils import (
flatten_varlen_lse,
unflatten_varlen_lse,
)
except:
from .utils import (
flatten_varlen_lse,
unflatten_varlen_lse,
)
def get_half_index(cu_seqlens, *, front: bool):
if len(cu_seqlens) == 2:
if front:
return slice(None, cu_seqlens[-1] // 2)
else:
return slice(cu_seqlens[-1] // 2, None)
index = torch.zeros((cu_seqlens[-1],), dtype=bool)
for i in range(len(cu_seqlens) - 1):
start, end = cu_seqlens[i], cu_seqlens[i + 1]
if front:
end = (start + end) // 2
else:
start = (start + end) // 2
index[start:end] = True
return index
@torch.jit.script
def get_half_lse(lse, cu_seqlens, *, front: bool):
if lse.dim() == 2:
new_lse = torch.empty(
(lse.shape[0], lse.shape[1] // 2),
dtype=lse.dtype,
device=lse.device,
)
for i in range(len(cu_seqlens) - 1):
start, end = cu_seqlens[i].item(), cu_seqlens[i + 1].item()
new_start, new_end = start // 2, end // 2
if front:
end -= (end - start) // 2
else:
start += (end - start) // 2
new_lse[:, new_start:new_end] = lse[:, start:end]
else:
new_lse = torch.empty(
(lse.shape[0], lse.shape[1], lse.shape[2] // 2),
dtype=lse.dtype,
device=lse.device,
)
for i in range(len(cu_seqlens) - 1):
seqlen = (cu_seqlens[i + 1] - cu_seqlens[i]).item()
if front:
start, end = 0, seqlen // 2
else:
start, end = seqlen // 2, seqlen
new_lse[i, :, : seqlen // 2] = lse[i, :, start:end]
return new_lse
def zigzag_ring_flash_attn_varlen_forward(
process_group,
q: torch.Tensor,
k: torch.Tensor,
v: torch.Tensor,
cu_seqlens,
max_seqlen,
half_index0,
half_index1,
softmax_scale,
dropout_p=0,
causal=True,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
):
assert causal == True, "zigzag ring is meaningless for causal=False"
comm = RingComm(process_group)
block_seq_len = q.shape[0] // 2
q1 = q[half_index1]
out = None
lse = None
next_k, next_v = None, None
half_cu_seqlens = cu_seqlens // 2
half_max_seqlen = max_seqlen // 2
def forward(q, k, v, causal):
seqlen_q = q.shape[0]
seqlen_kv = k.shape[0]
cu_seqlens_q = half_cu_seqlens if seqlen_q == block_seq_len else cu_seqlens
max_seqlen_q = half_max_seqlen if seqlen_q == block_seq_len else max_seqlen
cu_seqlens_kv = half_cu_seqlens if seqlen_kv == block_seq_len else cu_seqlens
max_seqlen_kv = half_max_seqlen if seqlen_kv == block_seq_len else max_seqlen
params = get_default_args(_flash_attn_varlen_forward).copy()
params.update(
{
"q": q,
"k": k,
"v": v,
# the first half and the second half are the same
"cu_seqlens_q": cu_seqlens_q,
"cu_seqlens_k": cu_seqlens_kv,
"max_seqlen_q": max_seqlen_q,
"max_seqlen_k": max_seqlen_kv,
"dropout_p": dropout_p,
"softmax_scale": softmax_scale,
"causal": causal,
"window_size": window_size,
"alibi_slopes": alibi_slopes,
"return_softmax": True and dropout_p > 0,
}
)
block_out, _, _, _, _, block_lse, _, _ = _flash_attn_varlen_forward(**params)
return block_out, block_lse
old_lse = False
for step in range(comm.world_size):
if step + 1 != comm.world_size:
next_k: torch.Tensor = comm.send_recv(k)
next_v: torch.Tensor = comm.send_recv(v)
comm.commit()
if step == 0:
block_out, block_lse = forward(q, k, v, causal=True)
if block_lse.dim() == 3:
old_lse = True
block_lse = flatten_varlen_lse(
block_lse,
cu_seqlens=cu_seqlens,
)
out, lse = update_out_and_lse(out, lse, block_out, block_lse)
elif step <= comm.rank:
k0 = k[half_index0]
v0 = v[half_index0]
block_out, block_lse = forward(q, k0, v0, causal=False)
if block_lse.dim() == 3:
old_lse = True
block_lse = flatten_varlen_lse(
block_lse,
cu_seqlens=cu_seqlens,
)
out, lse = update_out_and_lse(out, lse, block_out, block_lse)
else:
block_out, block_lse = forward(q1, k, v, causal=False)
if block_lse.dim() == 3:
old_lse = True
block_lse = flatten_varlen_lse(
block_lse,
cu_seqlens=half_cu_seqlens,
)
out[half_index1], lse[half_index1] = update_out_and_lse(
out[half_index1], lse[half_index1], block_out, block_lse
)
if step + 1 != comm.world_size:
comm.wait()
k = next_k
v = next_v
out = out.to(q.dtype)
if old_lse:
lse = unflatten_varlen_lse(lse, cu_seqlens, max_seqlen)
else:
lse = lse.squeeze(dim=-1).transpose(0, 1)
return out, lse
def zigzag_ring_flash_attn_varlen_backward(
process_group,
dout,
q,
k,
v,
out,
softmax_lse,
cu_seqlens,
max_seqlen,
half_index0,
half_index1,
softmax_scale,
dropout_p=0,
causal=True,
window_size=(-1, -1),
alibi_slopes=None,
deterministic=False,
):
assert causal == True, "zigzag ring is meaningless for causal=False"
kv_comm = RingComm(process_group)
d_kv_comm = RingComm(process_group)
dq, dk, dv = None, None, None
next_dk, next_dv = None, None
next_k, next_v = None, None
dk_comm_buffer, dv_comm_buffer = None, None
dout1 = dout[half_index1]
q1 = q[half_index1]
out1 = out[half_index1]
softmax_lse1 = get_half_lse(softmax_lse, cu_seqlens, front=False)
block_seq_len = q.shape[0] // 2
half_cu_seqlens = cu_seqlens // 2
half_max_seqlen = max_seqlen // 2
# repeatly allocating buffer may be slow...
dq_buffer = torch.empty(q.shape, dtype=q.dtype, device=q.device)
dk_buffer = torch.empty(k.shape, dtype=k.dtype, device=k.device)
dv_buffer = torch.empty(v.shape, dtype=v.dtype, device=v.device)
def backward(dout, q, k, v, out, softmax_lse, causal):
seqlen_q = q.shape[0]
seqlen_kv = k.shape[0]
cu_seqlens_q = half_cu_seqlens if seqlen_q == block_seq_len else cu_seqlens
max_seqlen_q = half_max_seqlen if seqlen_q == block_seq_len else max_seqlen
cu_seqlens_kv = half_cu_seqlens if seqlen_kv == block_seq_len else cu_seqlens
max_seqlen_kv = half_max_seqlen if seqlen_kv == block_seq_len else max_seqlen
params = get_default_args(_flash_attn_varlen_backward).copy()
params.update(
{
"dout": dout,
"q": q,
"k": k,
"v": v,
"out": out,
"softmax_lse": softmax_lse,
"dq": dq_buffer[:seqlen_q],
"dk": dk_buffer[:seqlen_kv],
"dv": dv_buffer[:seqlen_kv],
# the first half and the second half are the same
"cu_seqlens_q": cu_seqlens_q,
"cu_seqlens_k": cu_seqlens_kv,
"max_seqlen_q": max_seqlen_q,
"max_seqlen_k": max_seqlen_kv,
"dropout_p": dropout_p,
"softmax_scale": softmax_scale,
"causal": causal,
"window_size": window_size,
"alibi_slopes": alibi_slopes,
"deterministic": deterministic,
}
)
_flash_attn_varlen_backward(**params)
for step in range(kv_comm.world_size):
if step + 1 != kv_comm.world_size:
next_k = kv_comm.send_recv(k)
next_v = kv_comm.send_recv(v)
kv_comm.commit()
if step == 0:
backward(dout, q, k, v, out, softmax_lse, causal=True)
dq = dq_buffer.to(torch.float32)
dk = dk_buffer.to(torch.float32)
dv = dv_buffer.to(torch.float32)
else:
if step <= kv_comm.rank:
k0 = k[half_index0]
v0 = v[half_index0]
backward(dout, q, k0, v0, out, softmax_lse, causal=False)
dq += dq_buffer
else:
backward(dout1, q1, k, v, out1, softmax_lse1, causal=False)
dq[half_index1] += dq_buffer[:block_seq_len]
d_kv_comm.wait()
dk_comm_buffer, dv_comm_buffer = dk, dv
dk, dv = next_dk, next_dv
if step <= kv_comm.rank:
dk[half_index0] += dk_buffer[:block_seq_len]
dv[half_index0] += dv_buffer[:block_seq_len]
else:
dk += dk_buffer
dv += dv_buffer
if step + 1 != kv_comm.world_size:
kv_comm.wait()
k = next_k
v = next_v
next_dk = d_kv_comm.send_recv(dk, dk_comm_buffer)
next_dv = d_kv_comm.send_recv(dv, dv_comm_buffer)
d_kv_comm.commit()
d_kv_comm.wait()
return dq.to(q.dtype), next_dk.to(q.dtype), next_dv.to(q.dtype)
class ZigZagRingFlashAttnVarlenFunc(torch.autograd.Function):
@staticmethod
def forward(
ctx,
q,
k,
v,
cu_seqlens,
max_seqlen,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_softmax,
group,
):
if softmax_scale is None:
softmax_scale = q.shape[-1] ** (-0.5)
assert alibi_slopes is None
k = k.contiguous()
v = v.contiguous()
half_index0 = get_half_index(cu_seqlens, front=True)
half_index1 = get_half_index(cu_seqlens, front=False)
out, softmax_lse = zigzag_ring_flash_attn_varlen_forward(
group,
q,
k,
v,
cu_seqlens,
max_seqlen,
half_index0,
half_index1,
softmax_scale=softmax_scale,
dropout_p=dropout_p,
causal=causal,
window_size=window_size,
alibi_slopes=alibi_slopes,
deterministic=False,
)
# this should be out_padded
is_half_index_tensor = isinstance(half_index0, torch.Tensor)
ctx.is_half_index_tensor = is_half_index_tensor
if is_half_index_tensor:
ctx.save_for_backward(
q, k, v, out, softmax_lse, cu_seqlens, half_index0, half_index1
)
else:
ctx.save_for_backward(q, k, v, out, softmax_lse, cu_seqlens)
ctx.half_index0 = half_index0
ctx.half_index1 = half_index1
ctx.max_seqlen = max_seqlen
ctx.dropout_p = dropout_p
ctx.softmax_scale = softmax_scale
ctx.causal = causal
ctx.window_size = window_size
ctx.alibi_slopes = alibi_slopes
ctx.deterministic = deterministic
ctx.group = group
return out if not return_softmax else (out, softmax_lse, None)
@staticmethod
def backward(ctx, dout, *args):
if ctx.is_half_index_tensor:
(q, k, v, out, softmax_lse, cu_seqlens, half_index0, half_index1) = (
ctx.saved_tensors
)
else:
q, k, v, out, softmax_lse, cu_seqlens = ctx.saved_tensors
half_index0 = ctx.half_index0
half_index1 = ctx.half_index1
dq, dk, dv = zigzag_ring_flash_attn_varlen_backward(
ctx.group,
dout,
q,
k,
v,
out,
softmax_lse,
cu_seqlens,
ctx.max_seqlen,
half_index0,
half_index1,
softmax_scale=ctx.softmax_scale,
dropout_p=ctx.dropout_p,
causal=ctx.causal,
window_size=ctx.window_size,
alibi_slopes=ctx.alibi_slopes,
deterministic=ctx.deterministic,
)
return dq, dk, dv, None, None, None, None, None, None, None, None, None, None
def zigzag_ring_flash_attn_varlen_qkvpacked_func(
qkv,
cu_seqlens,
max_seqlen,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1), # -1 means infinite context window
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return ZigZagRingFlashAttnVarlenFunc.apply(
qkv[:, 0],
qkv[:, 1],
qkv[:, 2],
cu_seqlens,
max_seqlen,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
def zigzag_ring_flash_attn_varlen_kvpacked_func(
q,
kv,
cu_seqlens,
max_seqlen,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1), # -1 means infinite context window
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return ZigZagRingFlashAttnVarlenFunc.apply(
q,
kv[:, 0],
kv[:, 1],
cu_seqlens,
max_seqlen,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
def zigzag_ring_flash_attn_varlen_func(
q,
k,
v,
cu_seqlens,
max_seqlen,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1), # -1 means infinite context window
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
group=None,
):
return ZigZagRingFlashAttnVarlenFunc.apply(
q,
k,
v,
cu_seqlens,
max_seqlen,
dropout_p,
softmax_scale,
causal,
window_size,
alibi_slopes,
deterministic,
return_attn_probs,
group,
)
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/ulysses/__init__.py
================================================
from .attn_layer import UlyssesAttention
__all__ = ['UlyssesAttention']
================================================
FILE: xtuner-eval_niah/xtuner/_lite/yunchang/ulysses/attn_layer.py
================================================
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0
# DeepSpeed Team
import torch
from typing import Any
from torch import Tensor
import torch.distributed as dist
from flash_attn import flash_attn_func
from ..comm.all_to_all import SeqAllToAll4D
import torch.nn.functional as F
def torch_attn(query,
key,
value,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1), alibi_slopes=None, deterministic=False,
return_attn_probs=False,
):
batch_size, seq_len, hs, hd = query.size()
query = query.view(batch_size, -1, hs, hd).transpose(1, 2)
key = key.view(batch_size, -1, hs, hd).transpose(1, 2)
value = value.view(batch_size, -1, hs, hd).transpose(1, 2)
# the output of sdp = (batch, num_heads, seq_len, head_dim)
# TODO: add support for attn.scale when we move to Torch 2.1
hidden_states = F.scaled_dot_product_attention(
query, key, value, dropout_p=dropout_p, is_causal=causal
)
hidden_states = hidden_states.transpose(1, 2).reshape(
batch_size, -1, hs, hd
)
hidden_states = hidden_states.to(query.dtype)
return hidden_states
class UlyssesAttention(torch.nn.Module):
"""Initialization.
Arguments:
local_attention (Module): local attention with q,k,v
sequence_process_group (ProcessGroup): sequence parallel process group
scatter_idx (int): scatter_idx for all2all comm
gather_idx (int): gather_idx for all2all comm
"""
def __init__(
self,
sequence_process_group: dist.ProcessGroup = None,
scatter_idx: int = 2,
gather_idx: int = 1,
use_fa : bool = True
) -> None:
super(UlyssesAttention, self).__init__()
self.spg = sequence_process_group
self.scatter_idx = scatter_idx
self.gather_idx = gather_idx
self.use_fa = use_fa
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
gpu_name = torch.cuda.get_device_name(device)
if "Turing" in gpu_name or "Tesla" in gpu_name or "T4" in gpu_name:
self.use_fa = False
def forward(
self,
query: Tensor,
key: Tensor,
value: Tensor,
dropout_p=0.0,
softmax_scale=None,
causal=False,
window_size=(-1, -1),
softcap=0.0,
alibi_slopes=None,
deterministic=False,
return_attn_probs=False,
*args: Any
) -> Tensor:
"""forward
Arguments:
query (Tensor): query input to the layer
key (Tensor): key input to the layer
value (Tensor): value input to the layer
args: other args
Returns:
* output (Tensor): context output
"""
# TODO Merge three alltoall calls into one
# TODO (Reza): change the api on the megatron-deepspeed side so that we only receive all data (q,k, and v) together!
# in shape : e.g., [s/p:h:]
# (bs, seq_len/N, head_cnt, head_size) -> (bs, seq_len, head_cnt/N, head_size)
# scatter 2, gather 1
q = SeqAllToAll4D.apply(self.spg, query, self.scatter_idx, self.gather_idx)
k = SeqAllToAll4D.apply(self.spg, key, self.scatter_idx, self.gather_idx)
v = SeqAllToAll4D.apply(self.spg, value, self.scatter_idx, self.gather_idx)
if self.use_fa:
fn = flash_attn_func
else:
fn = torch_attn
context_layer = fn(
q,
k,
v,
dropout_p=dropout_p,
causal=causal,
window_size=window_size,
softcap=softcap,
alibi_slopes=alibi_slopes,
deterministic=deterministic,
return_attn_probs=return_attn_probs,
)
if isinstance(context_layer, tuple):
context_layer = context_layer[0]
# (bs, seq_len, head_cnt/N, head_size) -> (bs, seq_len/N, head_cnt, head_size)
# scatter 1, gather 2
output = SeqAllToAll4D.apply(
self.spg, context_layer, self.gather_idx, self.scatter_idx
)
# out e.g., [s/p::h]
return output
================================================
FILE: xtuner-eval_niah/xtuner/apis/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .datasets import * # noqa: F401, F403
from .model import * # noqa: F401, F403
from .training_args import * # noqa: F401, F403
================================================
FILE: xtuner-eval_niah/xtuner/apis/datasets/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .alpaca import (alpaca_data_collator, alpaca_dataset,
alpaca_enzh_data_collator, alpaca_enzh_dataset,
alpaca_zh_data_collator, alpaca_zh_dataset)
from .arxiv import arxiv_data_collator, arxiv_dataset
from .code_alpaca import code_alpaca_data_collator, code_alpaca_dataset
from .colorist import colorist_data_collator, colorist_dataset
from .lawyer import (lawyer_crime_data_collator, lawyer_crime_dataset,
lawyer_data_collator, lawyer_dataset,
lawyer_reference_data_collator, lawyer_reference_dataset)
from .medical import medical_data_collator, medical_dataset
from .moss_003_sft import (moss_003_sft_data_collator, moss_003_sft_dataset,
moss_003_sft_no_plugins_data_collator,
moss_003_sft_no_plugins_dataset,
moss_003_sft_plugins_data_collator,
moss_003_sft_plugins_dataset)
from .oasst1 import oasst1_data_collator, oasst1_dataset
from .open_orca import openorca_data_collator, openorca_dataset
from .sql import sql_data_collator, sql_dataset
from .tiny_codes import tiny_codes_data_collator, tiny_codes_dataset
from .wizardlm import wizardlm_data_collator, wizardlm_dataset
__all__ = [
'alpaca_data_collator', 'alpaca_dataset', 'alpaca_enzh_data_collator',
'alpaca_enzh_dataset', 'alpaca_zh_data_collator', 'alpaca_zh_dataset',
'arxiv_data_collator', 'arxiv_dataset', 'medical_data_collator',
'medical_dataset', 'moss_003_sft_data_collator', 'moss_003_sft_dataset',
'moss_003_sft_no_plugins_data_collator', 'moss_003_sft_no_plugins_dataset',
'moss_003_sft_plugins_data_collator', 'moss_003_sft_plugins_dataset',
'oasst1_data_collator', 'oasst1_dataset', 'openorca_data_collator',
'openorca_dataset', 'lawyer_crime_dataset', 'lawyer_crime_data_collator',
'lawyer_reference_dataset', 'lawyer_reference_data_collator',
'lawyer_dataset', 'lawyer_data_collator', 'colorist_dataset',
'colorist_data_collator', 'sql_dataset', 'sql_data_collator',
'code_alpaca_dataset', 'code_alpaca_data_collator', 'tiny_codes_dataset',
'tiny_codes_data_collator', 'wizardlm_data_collator', 'wizardlm_dataset'
]
================================================
FILE: xtuner-eval_niah/xtuner/apis/datasets/alpaca.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from functools import partial
from datasets import load_dataset
from torch.utils.data import ConcatDataset
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.utils import PROMPT_TEMPLATE
def alpaca_enzh_dataset(tokenizer,
path_en='tatsu-lab/alpaca',
path_zh='silk-road/alpaca-data-gpt4-chinese',
max_length=2048,
prompt_template=PROMPT_TEMPLATE.default,
remove_unused_columns=True,
pack_to_max_length=True):
alpaca = alpaca_dataset(
tokenizer,
path=path_en,
max_length=max_length,
prompt_template=prompt_template,
shuffle_before_pack=True,
remove_unused_columns=remove_unused_columns,
pack_to_max_length=pack_to_max_length)
alpaca_zh = alpaca_zh_dataset(
tokenizer,
path=path_zh,
max_length=max_length,
prompt_template=prompt_template,
shuffle_before_pack=True,
remove_unused_columns=remove_unused_columns,
pack_to_max_length=pack_to_max_length)
dataset = ConcatDataset([alpaca, alpaca_zh])
return dataset
def alpaca_enzh_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
def alpaca_zh_dataset(tokenizer,
path='silk-road/alpaca-data-gpt4-chinese',
max_length=2048,
prompt_template=PROMPT_TEMPLATE.default,
remove_unused_columns=True,
pack_to_max_length=True):
template_map_fn = template_map_fn_factory(template=prompt_template)
dataset_org = load_dataset(path)
dataset = process_hf_dataset(
dataset=dataset_org,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=template_map_fn,
remove_unused_columns=remove_unused_columns,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
return dataset
def alpaca_zh_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
def alpaca_dataset(tokenizer,
path='tatsu-lab/alpaca',
max_length=2048,
prompt_template=PROMPT_TEMPLATE.default,
remove_unused_columns=True,
pack_to_max_length=True):
template_map_fn = template_map_fn_factory(template=prompt_template)
dataset_org = load_dataset(path)
dataset = process_hf_dataset(
dataset=dataset_org,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=template_map_fn,
remove_unused_columns=remove_unused_columns,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
return dataset
def alpaca_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
================================================
FILE: xtuner-eval_niah/xtuner/apis/datasets/arxiv.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from functools import partial
from datasets import load_dataset
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE
def arxiv_dataset(tokenizer,
data_file=None,
max_length=2048,
prompt_template=PROMPT_TEMPLATE.default,
remove_unused_columns=True,
pack_to_max_length=True):
template_map_fn = template_map_fn_factory(template=prompt_template)
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv # noqa: E501
# 2. Process data with `./tools/data_preprocess/arxiv.py`
if data_file is None:
data_file = './data/arxiv_postprocess_csAIcsCLcsCV_20200101.json'
dataset_org = load_dataset(path='json', data_files=dict(train=data_file))
dataset = process_hf_dataset(
dataset=dataset_org,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=template_map_fn,
remove_unused_columns=remove_unused_columns,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
return dataset
def arxiv_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
================================================
FILE: xtuner-eval_niah/xtuner/apis/datasets/code_alpaca.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from functools import partial
from datasets import load_dataset
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE
def code_alpaca_dataset(tokenizer,
path='HuggingFaceH4/CodeAlpaca_20K',
max_length=2048,
prompt_template=PROMPT_TEMPLATE.default,
remove_unused_columns=True,
pack_to_max_length=True):
template_map_fn = template_map_fn_factory(template=prompt_template)
dataset_org = load_dataset(path)
dataset = process_hf_dataset(
dataset=dataset_org,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=template_map_fn,
remove_unused_columns=remove_unused_columns,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
return dataset
def code_alpaca_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
================================================
FILE: xtuner-eval_niah/xtuner/apis/datasets/colorist.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from functools import partial
from datasets import load_dataset
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE
def colorist_dataset(tokenizer,
path='burkelibbey/colors',
max_length=2048,
prompt_template=PROMPT_TEMPLATE.default,
remove_unused_columns=True,
pack_to_max_length=True):
template_map_fn = template_map_fn_factory(template=prompt_template)
dataset_org = load_dataset(path)
dataset = process_hf_dataset(
dataset=dataset_org,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=template_map_fn,
remove_unused_columns=remove_unused_columns,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
return dataset
def colorist_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
================================================
FILE: xtuner-eval_niah/xtuner/apis/datasets/lawyer.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from functools import partial
from datasets import load_dataset
from torch.utils.data import ConcatDataset
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.utils import PROMPT_TEMPLATE
def lawyer_dataset(tokenizer,
crime_data_file=None,
reference_data_file=None,
max_length=2048,
prompt_template=PROMPT_TEMPLATE.default,
remove_unused_columns=True,
pack_to_max_length=True):
crime_dataset = lawyer_crime_dataset(
tokenizer,
data_file=crime_data_file,
max_length=max_length,
prompt_template=prompt_template,
remove_unused_columns=remove_unused_columns,
pack_to_max_length=pack_to_max_length)
reference_dataset = lawyer_reference_dataset(
tokenizer,
data_file=reference_data_file,
max_length=max_length,
prompt_template=prompt_template,
remove_unused_columns=remove_unused_columns,
pack_to_max_length=pack_to_max_length)
dataset = ConcatDataset([crime_dataset, reference_dataset])
return dataset
def lawyer_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
def lawyer_crime_dataset(tokenizer,
data_file=None,
max_length=2048,
prompt_template=PROMPT_TEMPLATE.default,
remove_unused_columns=True,
pack_to_max_length=True):
template_map_fn = template_map_fn_factory(template=prompt_template)
# Download data from https://github.com/LiuHC0428/LAW-GPT # noqa: E501
if data_file is None:
data_file = './data/law/CrimeKgAssitant清洗后_52k.json'
dataset_org = load_dataset(path='json', data_files=dict(train=data_file))
dataset = process_hf_dataset(
dataset=dataset_org,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=template_map_fn,
remove_unused_columns=remove_unused_columns,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
return dataset
def lawyer_crime_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
def lawyer_reference_dataset(tokenizer,
data_file=None,
max_length=2048,
prompt_template=PROMPT_TEMPLATE.default,
remove_unused_columns=True,
pack_to_max_length=True):
template_map_fn = template_map_fn_factory(template=prompt_template)
# Download data from https://github.com/LiuHC0428/LAW-GPT # noqa: E501
if data_file is None:
data_file = './data/law/训练数据_带法律依据_92k.json'
dataset_org = load_dataset(path='json', data_files=dict(train=data_file))
dataset = process_hf_dataset(
dataset=dataset_org,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=template_map_fn,
remove_unused_columns=remove_unused_columns,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
return dataset
def lawyer_reference_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
================================================
FILE: xtuner-eval_niah/xtuner/apis/datasets/medical.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from functools import partial
from datasets import load_dataset
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import medical_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE
def medical_dataset(tokenizer,
path='shibing624/medical',
max_length=2048,
prompt_template=PROMPT_TEMPLATE.default,
remove_unused_columns=False,
pack_to_max_length=True):
template_map_fn = template_map_fn_factory(template=prompt_template)
dataset_org = load_dataset(path)
dataset = process_hf_dataset(
dataset=dataset_org,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=medical_map_fn,
template_map_fn=template_map_fn,
remove_unused_columns=remove_unused_columns,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
return dataset
def medical_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
================================================
FILE: xtuner-eval_niah/xtuner/apis/datasets/moss_003_sft.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from functools import partial
from torch.utils.data import ConcatDataset
from xtuner.dataset import MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
def moss_003_sft_dataset(tokenizer,
plugins_data_file=None,
no_plugins_data_file=None,
bot_name=None,
max_length=2048):
plugins = moss_003_sft_plugins_dataset(
tokenizer,
data_file=plugins_data_file,
bot_name=bot_name,
max_length=max_length)
no_plugins = moss_003_sft_no_plugins_dataset(
tokenizer,
data_file=no_plugins_data_file,
bot_name=bot_name,
max_length=max_length)
dataset = ConcatDataset([plugins, no_plugins])
return dataset
def moss_003_sft_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
def moss_003_sft_no_plugins_dataset(tokenizer,
data_file=None,
bot_name=None,
max_length=2048):
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
if data_file is None:
data_file = './data/moss-003-sft-no-tools.jsonl'
dataset = MOSSSFTDataset(
data_file=data_file,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
return dataset
def moss_003_sft_no_plugins_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
def moss_003_sft_plugins_dataset(tokenizer,
data_file=None,
bot_name=None,
max_length=2048):
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
if data_file is None:
data_file = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
dataset = MOSSSFTDataset(
data_file=data_file,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
return dataset
def moss_003_sft_plugins_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
================================================
FILE: xtuner-eval_niah/xtuner/apis/datasets/oasst1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from functools import partial
from datasets import load_dataset
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE
def oasst1_dataset(tokenizer,
path='timdettmers/openassistant-guanaco',
max_length=2048,
prompt_template=PROMPT_TEMPLATE.default,
remove_unused_columns=False,
pack_to_max_length=True):
template_map_fn = template_map_fn_factory(template=prompt_template)
dataset_org = load_dataset(path)
dataset = process_hf_dataset(
dataset=dataset_org,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=template_map_fn,
remove_unused_columns=remove_unused_columns,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
return dataset
def oasst1_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
================================================
FILE: xtuner-eval_niah/xtuner/apis/datasets/open_orca.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from functools import partial
from datasets import load_dataset
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import openorca_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE
def openorca_dataset(tokenizer,
path='Open-Orca/OpenOrca',
max_length=2048,
prompt_template=PROMPT_TEMPLATE.default,
remove_unused_columns=True,
pack_to_max_length=True):
template_map_fn = template_map_fn_factory(template=prompt_template)
dataset_org = load_dataset(path)
dataset = process_hf_dataset(
dataset=dataset_org,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=openorca_map_fn,
template_map_fn=template_map_fn,
remove_unused_columns=remove_unused_columns,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
return dataset
def openorca_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
================================================
FILE: xtuner-eval_niah/xtuner/apis/datasets/sql.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from functools import partial
from datasets import load_dataset
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE
def sql_dataset(tokenizer,
path='b-mc2/sql-create-context',
max_length=2048,
prompt_template=PROMPT_TEMPLATE.default,
remove_unused_columns=True,
pack_to_max_length=True):
template_map_fn = template_map_fn_factory(template=prompt_template)
dataset_org = load_dataset(path)
dataset = process_hf_dataset(
dataset=dataset_org,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=template_map_fn,
remove_unused_columns=remove_unused_columns,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
return dataset
def sql_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
================================================
FILE: xtuner-eval_niah/xtuner/apis/datasets/tiny_codes.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from functools import partial
from datasets import load_dataset
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, tiny_codes_map_fn
from xtuner.utils import PROMPT_TEMPLATE
def tiny_codes_dataset(tokenizer,
path='nampdn-ai/tiny-codes',
max_length=2048,
prompt_template=PROMPT_TEMPLATE.default,
remove_unused_columns=True,
pack_to_max_length=True):
template_map_fn = template_map_fn_factory(template=prompt_template)
dataset_org = load_dataset(path)
dataset = process_hf_dataset(
dataset=dataset_org,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=tiny_codes_map_fn,
template_map_fn=template_map_fn,
remove_unused_columns=remove_unused_columns,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
return dataset
def tiny_codes_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
================================================
FILE: xtuner-eval_niah/xtuner/apis/datasets/wizardlm.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from functools import partial
from datasets import load_dataset
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, wizardlm_map_fn
from xtuner.utils import PROMPT_TEMPLATE
def wizardlm_dataset(tokenizer,
path='WizardLM/WizardLM_evol_instruct_V2_196k',
max_length=2048,
prompt_template=PROMPT_TEMPLATE.default,
remove_unused_columns=False,
pack_to_max_length=True):
template_map_fn = template_map_fn_factory(template=prompt_template)
dataset_org = load_dataset(path)
dataset = process_hf_dataset(
dataset=dataset_org,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=wizardlm_map_fn,
template_map_fn=template_map_fn,
remove_unused_columns=remove_unused_columns,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
return dataset
def wizardlm_data_collator(return_hf_format=False):
return partial(default_collate_fn, return_hf_format=return_hf_format)
================================================
FILE: xtuner-eval_niah/xtuner/apis/model.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.model import SupervisedFinetune
__all__ = ['build_model', 'build_lora_model', 'build_qlora_model']
def build_qlora_model(model_name_or_path,
quantization_config=None,
lora_config=None,
return_tokenizer=True):
if quantization_config is None:
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')
if lora_config is None:
lora_config = LoraConfig(
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM')
llm = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
torch_dtype=torch.float16,
trust_remote_code=True,
quantization_config=quantization_config)
model = SupervisedFinetune(llm, lora=lora_config)
if return_tokenizer:
tokenizer = AutoTokenizer.from_pretrained(
model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True)
return model.llm, tokenizer
else:
return model.llm
def build_lora_model(model_name_or_path,
lora_config=None,
return_tokenizer=True):
if lora_config is None:
lora_config = LoraConfig(
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM')
llm = AutoModelForCausalLM.from_pretrained(
model_name_or_path, torch_dtype=torch.float16, trust_remote_code=True)
model = SupervisedFinetune(llm, lora=lora_config)
if return_tokenizer:
tokenizer = AutoTokenizer.from_pretrained(
model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True)
return model.llm, tokenizer
else:
return model.llm
def build_model(model_name_or_path, return_tokenizer=True):
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path, torch_dtype=torch.float16, trust_remote_code=True)
if return_tokenizer:
tokenizer = AutoTokenizer.from_pretrained(
model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True)
return model, tokenizer
else:
return model
================================================
FILE: xtuner-eval_niah/xtuner/apis/training_args.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from dataclasses import dataclass, field
from typing import Union
from transformers import TrainingArguments
from transformers.trainer_utils import IntervalStrategy, SchedulerType
__all__ = ['DefaultTrainingArguments']
@dataclass
class DefaultTrainingArguments(TrainingArguments):
# custom
model_name_or_path: str = field(
default=None,
metadata={'help': 'model name or path.'},
)
dataset_name_or_path: str = field(
default=None,
metadata={'help': 'dataset name or path.'},
)
# huggingface
default_output_dir = './work_dirs'
default_do_train = True
default_per_device_train_batch_size = 1
default_learning_rate = 2e-5
default_save_strategy = 'epoch'
default_lr_scheduler_type = 'cosine'
default_logging_steps = 5
output_dir: str = field(
default=default_output_dir,
metadata={
'help': ('The output directory where the model predictions and '
'checkpoints will be written.')
})
do_train: bool = field(
default=default_do_train,
metadata={'help': 'Whether to run training.'})
per_device_train_batch_size: int = field(
default=default_per_device_train_batch_size,
metadata={'help': 'Batch size per GPU/TPU core/CPU for training.'})
learning_rate: float = field(
default=default_learning_rate,
metadata={'help': 'The initial learning rate for AdamW.'})
save_strategy: Union[IntervalStrategy, str] = field(
default=default_save_strategy,
metadata={'help': 'The checkpoint save strategy to use.'},
)
lr_scheduler_type: Union[SchedulerType, str] = field(
default=default_lr_scheduler_type,
metadata={'help': 'The scheduler type to use.'},
)
logging_steps: float = field(
default=default_logging_steps,
metadata={
'help': ('Log every X updates steps. Should be an integer or a '
'float in range `[0,1)`. If smaller than 1, will be '
'interpreted as ratio of total training steps.')
})
================================================
FILE: xtuner-eval_niah/xtuner/configs/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import os
def get_cfgs_name_path():
path = os.path.dirname(__file__)
mapping = {}
for root, dirs, files in os.walk(path):
for file_ in files:
if file_.endswith(
('.py', '.json')
) and not file_.startswith('.') and not file_.startswith('_'):
mapping[os.path.splitext(file_)[0]] = os.path.join(root, file_)
return mapping
cfgs_name_path = get_cfgs_name_path()
__all__ = ['cfgs_name_path']
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_base/baichuan2_13b_base_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Base'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_base/baichuan2_13b_base_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Base'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_base/baichuan2_13b_base_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Base'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_base/baichuan2_13b_base_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Base'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_base/baichuan2_13b_base_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Base'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_base/baichuan2_13b_base_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Base'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_base/baichuan2_13b_base_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Base'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_base/baichuan2_13b_base_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Base'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_base/baichuan2_13b_base_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Base'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 512
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_base/baichuan2_13b_base_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Base'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_base/baichuan2_13b_base_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Base'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_base/baichuan2_13b_base_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Base'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_chat/baichuan2_13b_chat_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_chat/baichuan2_13b_chat_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_chat/baichuan2_13b_chat_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_chat/baichuan2_13b_chat_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_chat/baichuan2_13b_chat_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Chat'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_chat/baichuan2_13b_chat_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Chat'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_chat/baichuan2_13b_chat_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Chat'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 512
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_chat/baichuan2_13b_chat_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Chat'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_13b_chat/baichuan2_13b_chat_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Chat'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_base/baichuan2_7b_base_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Base'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_base/baichuan2_7b_base_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Base'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_base/baichuan2_7b_base_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Base'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_base/baichuan2_7b_base_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Base'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_base/baichuan2_7b_base_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Base'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_base/baichuan2_7b_base_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Base'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_base/baichuan2_7b_base_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Base'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_base/baichuan2_7b_base_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Base'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_base/baichuan2_7b_base_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Base'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 512
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_base/baichuan2_7b_base_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Base'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_base/baichuan2_7b_base_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Base'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_base/baichuan2_7b_base_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Base'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_chat/baichuan2_7b_chat_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_chat/baichuan2_7b_chat_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_chat/baichuan2_7b_chat_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_chat/baichuan2_7b_chat_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_chat/baichuan2_7b_chat_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Chat'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_chat/baichuan2_7b_chat_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Chat'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_chat/baichuan2_7b_chat_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Chat'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 512
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_chat/baichuan2_7b_chat_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Chat'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan2_7b_chat/baichuan2_7b_chat_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Chat'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_medical_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import medical_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
use_varlen_attn = False
# Data
data_path = 'shibing624/medical'
data_config_name = 'finetune'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.medical
evaluation_inputs = [
'我有家族遗传性的过敏,请问可以可以献血吗?', '我爷爷有高血压,请问他可以喝咖啡吗?',
'我女儿今年3岁了,从昨天晚上九点开始腹泻,到现在已经八个小时了,请问应该怎么办?'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path, name=data_config_name),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=medical_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_moss_sft_all_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
bot_name = 'Baichuan'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_no_plugins_path = './data/moss-003-sft-no-tools.jsonl'
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
moss_sft_no_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_no_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
moss_sft_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataset = dict(
type=ConcatDataset, datasets=[moss_sft_no_plugins, moss_sft_plugins])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_moss_sft_all_e2_gpu8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
bot_name = 'Baichuan'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_no_plugins_path = './data/moss-003-sft-no-tools.jsonl'
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 1
dataloader_num_workers = 2
max_epochs = 2
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
moss_sft_no_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_no_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
moss_sft_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataset = dict(
type=ConcatDataset, datasets=[moss_sft_no_plugins, moss_sft_plugins])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_moss_sft_plugins_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
bot_name = 'Baichuan'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 512
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_openorca_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import openorca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
use_varlen_attn = False
# Data
data_path = 'Open-Orca/OpenOrca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 5000
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=openorca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_base/baichuan_13b_base_qlora_tiny_codes_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, tiny_codes_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Base'
use_varlen_attn = False
# Data
data_path = 'nampdn-ai/tiny-codes'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=tiny_codes_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_chat/baichuan_13b_chat_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_chat/baichuan_13b_chat_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_chat/baichuan_13b_chat_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_chat/baichuan_13b_chat_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_chat/baichuan_13b_chat_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_chat/baichuan_13b_chat_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_chat/baichuan_13b_chat_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_chat/baichuan_13b_chat_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_chat/baichuan_13b_chat_qlora_medical_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import medical_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
data_path = 'shibing624/medical'
data_config_name = 'finetune'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.medical
evaluation_inputs = [
'我有家族遗传性的过敏,请问可以可以献血吗?', '我爷爷有高血压,请问他可以喝咖啡吗?',
'我女儿今年3岁了,从昨天晚上九点开始腹泻,到现在已经八个小时了,请问应该怎么办?'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path, name=data_config_name),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=medical_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_chat/baichuan_13b_chat_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 512
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_chat/baichuan_13b_chat_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_chat/baichuan_13b_chat_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_chat/baichuan_13b_chat_qlora_openorca_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import openorca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
data_path = 'Open-Orca/OpenOrca'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 5000
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=openorca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_chat/baichuan_13b_chat_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_13b_chat/baichuan_13b_chat_qlora_tiny_codes_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, tiny_codes_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
data_path = 'nampdn-ai/tiny-codes'
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=tiny_codes_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_medical_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import medical_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
data_path = 'shibing624/medical'
data_config_name = 'finetune'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.medical
evaluation_inputs = [
'我有家族遗传性的过敏,请问可以可以献血吗?', '我爷爷有高血压,请问他可以喝咖啡吗?',
'我女儿今年3岁了,从昨天晚上九点开始腹泻,到现在已经八个小时了,请问应该怎么办?'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path, name=data_config_name),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=medical_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_moss_sft_all_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
bot_name = 'Baichuan'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_no_plugins_path = './data/moss-003-sft-no-tools.jsonl'
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
moss_sft_no_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_no_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
moss_sft_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataset = dict(
type=ConcatDataset, datasets=[moss_sft_no_plugins, moss_sft_plugins])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_moss_sft_all_e2_gpu8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
bot_name = 'Baichuan'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_no_plugins_path = './data/moss-003-sft-no-tools.jsonl'
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 1
dataloader_num_workers = 2
max_epochs = 2
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
moss_sft_no_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_no_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
moss_sft_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataset = dict(
type=ConcatDataset, datasets=[moss_sft_no_plugins, moss_sft_plugins])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_moss_sft_plugins_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
bot_name = 'Baichuan'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 512
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_openorca_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import openorca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
data_path = 'Open-Orca/OpenOrca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 5000
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=openorca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/baichuan/baichuan_7b/baichuan_7b_qlora_tiny_codes_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, tiny_codes_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
data_path = 'nampdn-ai/tiny-codes'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=tiny_codes_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm2_6b/chatglm2_6b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm2_6b/chatglm2_6b_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm2_6b/chatglm2_6b_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm2_6b/chatglm2_6b_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm2_6b/chatglm2_6b_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm2_6b/chatglm2_6b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm2_6b/chatglm2_6b_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm2_6b/chatglm2_6b_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm2_6b/chatglm2_6b_qlora_medical_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import medical_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
data_path = 'shibing624/medical'
data_config_name = 'finetune'
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.medical
evaluation_inputs = [
'我有家族遗传性的过敏,请问可以可以献血吗?', '我爷爷有高血压,请问他可以喝咖啡吗?',
'我女儿今年3岁了,从昨天晚上九点开始腹泻,到现在已经八个小时了,请问应该怎么办?'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path, name=data_config_name),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=medical_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm2_6b/chatglm2_6b_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 512
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm2_6b/chatglm2_6b_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm2_6b/chatglm2_6b_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm2_6b/chatglm2_6b_qlora_openorca_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import openorca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
data_path = 'Open-Orca/OpenOrca'
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 5000
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=openorca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm2_6b/chatglm2_6b_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm2_6b/chatglm2_6b_qlora_tiny_codes_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, tiny_codes_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
data_path = 'nampdn-ai/tiny-codes'
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=tiny_codes_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b/chatglm3_6b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b/chatglm3_6b_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b/chatglm3_6b_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b/chatglm3_6b_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b/chatglm3_6b_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b/chatglm3_6b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b/chatglm3_6b_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b/chatglm3_6b_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b/chatglm3_6b_qlora_medical_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import medical_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
data_path = 'shibing624/medical'
data_config_name = 'finetune'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.medical
evaluation_inputs = [
'我有家族遗传性的过敏,请问可以可以献血吗?', '我爷爷有高血压,请问他可以喝咖啡吗?',
'我女儿今年3岁了,从昨天晚上九点开始腹泻,到现在已经八个小时了,请问应该怎么办?'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path, name=data_config_name),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=medical_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b/chatglm3_6b_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 512
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b/chatglm3_6b_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b/chatglm3_6b_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b/chatglm3_6b_qlora_openorca_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import openorca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
data_path = 'Open-Orca/OpenOrca'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 5000
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=openorca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b/chatglm3_6b_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b/chatglm3_6b_qlora_tiny_codes_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, tiny_codes_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
data_path = 'nampdn-ai/tiny-codes'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=tiny_codes_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b_base/chatglm3_6b_base_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b-base'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b_base/chatglm3_6b_base_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b-base'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b_base/chatglm3_6b_base_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b-base'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b_base/chatglm3_6b_base_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b-base'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b_base/chatglm3_6b_base_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b-base'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b_base/chatglm3_6b_base_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b-base'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b_base/chatglm3_6b_base_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b-base'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b_base/chatglm3_6b_base_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b-base'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b_base/chatglm3_6b_base_qlora_medical_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import medical_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b-base'
use_varlen_attn = False
# Data
data_path = 'shibing624/medical'
data_config_name = 'finetune'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.medical
evaluation_inputs = [
'我有家族遗传性的过敏,请问可以可以献血吗?', '我爷爷有高血压,请问他可以喝咖啡吗?',
'我女儿今年3岁了,从昨天晚上九点开始腹泻,到现在已经八个小时了,请问应该怎么办?'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path, name=data_config_name),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=medical_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b_base/chatglm3_6b_base_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b-base'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 512
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b_base/chatglm3_6b_base_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b-base'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b_base/chatglm3_6b_base_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b-base'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b_base/chatglm3_6b_base_qlora_openorca_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import openorca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b-base'
use_varlen_attn = False
# Data
data_path = 'Open-Orca/OpenOrca'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 5000
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=openorca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b_base/chatglm3_6b_base_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b-base'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/chatglm/chatglm3_6b_base/chatglm3_6b_base_qlora_tiny_codes_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, tiny_codes_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b-base'
use_varlen_attn = False
# Data
data_path = 'nampdn-ai/tiny-codes'
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=tiny_codes_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/cohere/README.md
================================================
# Cohere 8x7B
## Install
```bash
# Install the latest xtuner
pip install -U 'xtuner[deepspeed]'
# Cohere requires the latest version of transformers.
pip install git+https://github.com/huggingface/transformers.git
# Sequence parallel requires flash-attn
pip install flash-attn
```
## Full Parameter Fine-tune
Full parameter fine-tune needs 64 A100-80G
### slurm
Note: `$PARTITION` means the virtual partition of slurm.
```bash
srun -p $PARTITION --job-name=Cohere --nodes=8 --gres=gpu:8 --ntasks-per-node=8 xtuner train cohere_100b_128k_sp32 --deepspeed deepspeed_zero3 --launcher slurm
```
### torchrun
Note: `$NODE_0_ADDR` means the ip address of the node_0 machine.
```bash
# excuete on node 0
NPROC_PER_NODE=8 NNODES=8 PORT=29600 ADDR=$NODE_0_ADDR NODE_RANK=0 xtuner train cohere_100b_128k_sp32 --deepspeed deepspeed_zero3
# excuete on node 1
NPROC_PER_NODE=8 NNODES=8 PORT=29600 ADDR=$NODE_0_ADDR NODE_RANK=1 xtuner train cohere_100b_128k_sp32 --deepspeed deepspeed_zero3
```
### Speed
16 * A100 80G:
| Model | Sequence Length | GPUs Number | Sequence Parallel World Size | Tokens per Second | TFLOPs |
| :---------: | :-------------: | :---------: | :--------------------------: | :---------------: | :----: |
| Cohere_100b | 128k | 64 | 32 | 97.3 | 173.4 |
| Cohere_100b | 128k | 128 | 16 | 102.1 | 182.7 |
| Cohere_100b | 128k | 256 | 16 | 101.3 | 181.3 |
================================================
FILE: xtuner-eval_niah/xtuner/configs/cohere/cohere_104b/cohere_100b_128k_sp32.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'CohereForAI/c4ai-command-r-plus'
use_varlen_attn = False
sequence_parallel_size = 32
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.cohere_chat
max_length = 131072
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 32
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.05
# Save
save_steps = 500
save_total_limit = 1 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 10
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
attn_implementation='flash_attention_2'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=SequenceParallelSampler, seed=1024),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=CosineAnnealingLR,
eta_min=lr * 0.15,
by_epoch=True,
begin=0,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_iters=16)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(type=ThroughputHook),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
max_new_tokens=100,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(
by_epoch=False,
window_size=1,
mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/baichuan/baichuan2_13b_base_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Base'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/baichuan/baichuan2_7b_base_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Base'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/chatglm/chatglm2_6b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/chatglm/chatglm3_6b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/deepseek/deepseek_moe_16b_base_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'deepseek-ai/deepseek-moe-16b-base'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/gemma/gemma_2b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'google/gemma-2b' # Gemma requires transformers>=4.38.1 # noqa: E501
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/gemma/gemma_7b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'google/gemma-7b' # Gemma requires transformers>=4.38.1 # noqa: E501
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/internlm/internlm2_1_8b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-1_8b'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/internlm/internlm2_20b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-20b'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/internlm/internlm2_7b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/llama/llama2_70b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-70b-hf'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/llama/llama2_7b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/mistral/mistral_7b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'mistralai/Mistral-7B-v0.1'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/mixtral/mixtral_8x7b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'mistralai/Mixtral-8x7B-v0.1'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/qwen/qwen1_5_0_5b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-0.5B'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/qwen/qwen1_5_14b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-14B'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/qwen/qwen1_5_1_8b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-1.8B'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/qwen/qwen1_5_4b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-4B'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/qwen/qwen1_5_72b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-72B'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/qwen/qwen1_5_7b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-7B'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/qwen/qwen_1_8b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-1_8B'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/qwen/qwen_72b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-72B'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/qwen/qwen_7b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/starcoder/starcoder_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'bigcode/starcoder'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'from typing import List def has_close_elements(numbers: List[float], threshold: float) -> bool: """ Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True """' # noqa: E501
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/yi/yi_34b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '01-ai/Yi-34B'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/yi/yi_6b_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '01-ai/Yi-6B'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/pretrain/zephyr/zephyr_7b_beta_full_custom_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"text": "xxx"
},
{
"text": "xxx"
},
...
]
""" # noqa: E501
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'HuggingFaceH4/zephyr-7b-beta'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = ['上海是', 'Shanghai is']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/baichuan/baichuan2_13b_chat_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-13B-Chat'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/baichuan/baichuan2_7b_chat_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan2-7B-Chat'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.baichuan2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/baichuan/baichuan_13b_chat_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-13B-Chat'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.baichuan_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/baichuan/baichuan_7b_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'baichuan-inc/Baichuan-7B'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/chatglm/chatglm2_6b_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm2-6b'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.chatglm2
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/chatglm/chatglm3_6b_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'THUDM/chatglm3-6b'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.chatglm3
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True,
padding_side='left')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/deepseek/deepseek_moe_16b_chat_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'deepseek-ai/deepseek-moe-16b-chat'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.deepseek_moe
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=16,
lora_alpha=16,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/deepseek/deepseekcoder_6_7b_instruct_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'deepseek-ai/deepseek-coder-6.7b-instruct'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.deepseek_coder
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = ''
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/gemma/gemma_2b_it_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'google/gemma-2b-it' # Gemma requires transformers>=4.38.1 # noqa: E501
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.gemma
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/gemma/gemma_2b_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'google/gemma-2b' # Gemma requires transformers>=4.38.1 # noqa: E501
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/gemma/gemma_7b_it_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'google/gemma-7b-it' # Gemma requires transformers>=4.38.1 # noqa: E501
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.gemma
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/gemma/gemma_7b_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'google/gemma-7b' # Gemma requires transformers>=4.38.1 # noqa: E501
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/internlm/internlm2_chat_1_8b_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/internlm/internlm2_chat_20b_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-20b'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/internlm/internlm2_chat_7b_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-7b'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/llama/llama2_70b_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-70b-hf'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 3e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
target_modules=['gate_proj', 'down_proj', 'up_proj'],
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/llama/llama2_7b_chat_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/mistral/mistral_7b_full_finetune_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from torch.utils.data import BatchSampler
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.dataset.samplers import InternRepoSampler
from xtuner.engine import (DatasetInfoHook, EvaluateChatHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'mistralai/Mistral-7B-v0.1'
use_varlen_attn = True
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.mistral
max_length = 32768
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
attn_implementation='flash_attention_2',
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
use_varlen_attn=use_varlen_attn,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=InternRepoSampler, shuffle=True, seed=1024),
batch_sampler=dict(type=BatchSampler, drop_last=True, batch_size=1),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
)
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(
type=DatasetInfoHook, tokenizer=tokenizer,
is_intern_repo_dataset=True),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 100 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
log_processor = dict(
by_epoch=False,
window_size=1,
mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/mixtral/mixtral_8x7b_instruct_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'mistralai/Mixtral-8x7B-Instruct-v0.1'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.mixtral
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
target_modules=[
'q_proj', 'k_proj', 'v_proj', 'o_proj', 'w1', 'w2', 'w3'
],
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/qwen/qwen1_5_0_5b_chat_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-0.5B-Chat'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/qwen/qwen1_5_14b_chat_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-14B-Chat'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/qwen/qwen1_5_1_8b_chat_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-1.8B-Chat'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/qwen/qwen1_5_4b_chat_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-4B-Chat'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/qwen/qwen1_5_72b_chat_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-72B-Chat'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/qwen/qwen1_5_7b_chat_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-7B-Chat'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/qwen/qwen_1_8b_chat_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-1_8B-Chat'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/qwen/qwen_72b_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-72B'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/qwen/qwen_7b_chat_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/starcoder/starcoder_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'bigcode/starcoder'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
# randomly select 20000 samples from the original dataset
max_dataset_length = 20000
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 1e-4
betas = (0.9, 0.999)
weight_decay = 0.05
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = ''
evaluation_inputs = [
'from typing import List def has_close_elements(numbers: List[float], threshold: float) -> bool: """ Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True """' # noqa: E501
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias='none',
target_modules=['c_proj', 'c_attn', 'q_attn'],
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_dataset_length=max_dataset_length,
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/yi/yi_34b_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '01-ai/Yi-34B'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/yi/yi_6b_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '01-ai/Yi-6B'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/custom_dataset/sft/zephyr/zephyr_7b_beta_qlora_custom_sft_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[{
"messages": [
{ "role": "system", "content": "xxx." },
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": false},
{ "role": "user", "content": "xxx." },
{ "role": "assistant", "content": "xxx.", "loss": true}
]
},
...
]
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'HuggingFaceH4/zephyr-7b-beta'
use_varlen_attn = False
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.zephyr
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # bs = 1 GPU * 1 batch_size_per_device * 16 acc
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/deepseek/README.md
================================================
# DeepSeek V2
## Install
```bash
# Git clone the latest xtuner
git clone https://github.com/InternLM/xtuner.git
# Install the latest xtuner
cd xtuner
pip install -e '.[all]'
# Mixtral requires flash-attn
pip install flash-attn
# install the latest transformers
pip install -U transformers
```
## Full Parameter Fine-tune
Full parameter fine-tune DeepSeek V2 236B needs at least 64 A100-80G. The full-tuned model will be saved to `${WORK_DIRS}/hf_model` by `HFCheckpointHook`.
### slurm
Note: `$PARTITION` means the virtual partition of slurm.
```bash
srun -p $PARTITION --job-name=mixtral --nodes=8 --gres=gpu:8 --ntasks-per-node=8 xtuner train deepseek_v2_chat_full_alpaca_e3 --deepspeed deepspeed_zero3 --launcher slurm
```
### torchrun
Note: `$NODE_0_ADDR` means the ip address of the node_0 machine.
```bash
# excuete on node 0
NPROC_PER_NODE=8 NNODES=8 PORT=29600 ADDR=$NODE_0_ADDR NODE_RANK=0 xtuner train deepseek_v2_chat_full_alpaca_e3 --deepspeed deepspeed_zero3 --launcher pytorch
# excuete on node 1
NPROC_PER_NODE=8 NNODES=8 PORT=29600 ADDR=$NODE_0_ADDR NODE_RANK=1 xtuner train deepseek_v2_chat_full_alpaca_e3 --deepspeed deepspeed_zero3 --launcher pytorch
# excuete on node 2, 3, ..., 7
```
### Speed
128 * A100 80G:
| Model | Sequence Length | Use Varlen Attn | Sequence Parallel World Size | Tokens per Second |
| :--------------------: | :-------------: | :-------------: | :--------------------------: | :---------------: |
| deepseek v2 hf | 8k | False | 1 | 60 |
| **deepseek v2 XTuner** | **8k** | **False** | **1** | **120 (2x)** |
| deepseek v2 hf | 8k | True | 1 | 60 |
| **deepseek v2 XTuner** | **8k** | **True** | **1** | **130 (2.2x)** |
| deepseek v2 hf | 16k | False | 1 | OOM |
| **deepseek v2 XTuner** | **16k** | **False** | **1** | **148** |
| deepseek v2 hf | 16k | True | 1 | 95 |
| **deepseek v2 XTuner** | **16k** | **True** | **1** | **180 (1.9x)** |
================================================
FILE: xtuner-eval_niah/xtuner/configs/deepseek/deepseek_coder_6_7b_base/deepseek_coder_6_7b_base_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'deepseek-ai/deepseek-coder-6.7b-base'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.deepseek_coder
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/deepseek/deepseek_coder_6_7b_instruct/deepseekcoder_6_7b_instruct_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'deepseek-ai/deepseek-coder-6.7b-instruct'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.deepseek_coder
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/deepseek/deepseek_moe_16b_base/deepseek_moe_16b_base_full_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'deepseek-ai/deepseek-moe-16b-base'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.deepseek_moe
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/deepseek/deepseek_moe_16b_base/deepseek_moe_16b_base_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'deepseek-ai/deepseek-moe-16b-base'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.deepseek_moe
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=16,
lora_alpha=16,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/deepseek/deepseek_moe_16b_chat/deepseek_moe_16b_chat_full_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'deepseek-ai/deepseek-moe-16b-chat'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.deepseek_moe
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/deepseek/deepseek_moe_16b_chat/deepseek_moe_16b_chat_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'deepseek-ai/deepseek-moe-16b-chat'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.deepseek_moe
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=16,
lora_alpha=16,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/deepseek/deepseek_v2_chat/deepseek_v2_chat_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from transformers import AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, HFCheckpointHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.model.transformers_models.deepseek_v2 import DeepseekV2ForCausalLM
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'deepseek-ai/DeepSeek-V2-Chat'
use_varlen_attn = False
# Data
data_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.deepseek_v2
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 1 # bs per device 1 * acc 1 * 128 gpus = 128 total bs
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 1e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 50
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Save the optimizer states of deepseek v2 236B will require a lot of
# storage space. It is recommended to set `save_optimizer` to False
# (The training phase can not be resumed.)
save_optimizer = True
# Evaluate the generation performance during the training
evaluation_freq = 25
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
# Only full-finetune is supported in `DeepseekV2ForCausalLM``, XTuner.
# Please use `AutoModelForCausalLM` for lora or qlora finetune.
type=DeepseekV2ForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
moe_implementation='shard',
expert_in_one_shard=10,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=0,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(type=ThroughputHook),
dict(type=HFCheckpointHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=1)
================================================
FILE: xtuner-eval_niah/xtuner/configs/deepseek/deepseek_v2_lite_chat/deepseek_v2_lite_chat_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from transformers import AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, HFCheckpointHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.model.transformers_models.deepseek_v2 import DeepseekV2ForCausalLM
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'deepseek-ai/DeepSeek-V2-Lite-Chat'
use_varlen_attn = False
# Data
data_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.deepseek_v2
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 1 # bs per device 1 * acc 1 * 128 gpus = 128 total bs
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 1e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 50
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
save_optimizer = True
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
# Only full-finetune is supported in `DeepseekV2ForCausalLM``, XTuner.
# Please use `AutoModelForCausalLM` for lora or qlora finetune.
type=DeepseekV2ForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
moe_implementation='shard',
expert_in_one_shard=8,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=0,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(type=ThroughputHook),
dict(type=HFCheckpointHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=1)
================================================
FILE: xtuner-eval_niah/xtuner/configs/deepseek/deepseek_v2_lite_chat/deepseek_v2_lite_chat_full_alpaca_e3_32k_varlen.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from transformers import AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, HFCheckpointHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.model.transformers_models.deepseek_v2 import DeepseekV2ForCausalLM
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'deepseek-ai/DeepSeek-V2-Lite-Chat'
use_varlen_attn = True
# Data
data_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.deepseek_v2
max_length = 32768
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 1 # bs per device 1 * acc 1 * 128 gpus = 128 total bs
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 1e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 50
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
save_optimizer = True
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
# Only full-finetune is supported in `DeepseekV2ForCausalLM``, XTuner.
# Please use `AutoModelForCausalLM` for lora or qlora finetune.
type=DeepseekV2ForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
moe_implementation='shard',
expert_in_one_shard=8,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=0,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(type=ThroughputHook),
dict(type=HFCheckpointHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=1)
================================================
FILE: xtuner-eval_niah/xtuner/configs/deepspeed/deepspeed_zero1.json
================================================
{
"gradient_accumulation_steps": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"zero_force_ds_cpu_optimizer": false,
"zero_optimization": {
"stage": 1,
"overlap_comm": true
},
"fp16": {
"enabled": "auto",
"initial_scale_power": 16
},
"bf16": {
"enabled": "auto"
}
}
================================================
FILE: xtuner-eval_niah/xtuner/configs/deepspeed/deepspeed_zero2.json
================================================
{
"gradient_accumulation_steps": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"zero_force_ds_cpu_optimizer": false,
"zero_optimization": {
"stage": 2,
"overlap_comm": true
},
"fp16": {
"enabled": "auto",
"initial_scale_power": 16
},
"bf16": {
"enabled": "auto"
}
}
================================================
FILE: xtuner-eval_niah/xtuner/configs/deepspeed/deepspeed_zero2_offload.json
================================================
{
"gradient_accumulation_steps": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"zero_force_ds_cpu_optimizer": false,
"zero_optimization": {
"stage": 2,
"overlap_comm": true,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
}
},
"fp16": {
"enabled": "auto",
"initial_scale_power": 16
},
"bf16": {
"enabled": "auto"
}
}
================================================
FILE: xtuner-eval_niah/xtuner/configs/deepspeed/deepspeed_zero3.json
================================================
{
"gradient_accumulation_steps": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"zero_force_ds_cpu_optimizer": false,
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"stage3_gather_16bit_weights_on_model_save": true
},
"fp16": {
"enabled": "auto",
"initial_scale_power": 16
},
"bf16": {
"enabled": "auto"
}
}
================================================
FILE: xtuner-eval_niah/xtuner/configs/deepspeed/deepspeed_zero3_offload.json
================================================
{
"gradient_accumulation_steps": "auto",
"train_micro_batch_size_per_gpu": "auto",
"gradient_clipping": "auto",
"zero_allow_untested_optimizer": true,
"zero_force_ds_cpu_optimizer": false,
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"offload_optimizer": {
"device": "cpu",
"pin_memory": true
},
"offload_param": {
"device": "cpu",
"pin_memory": true
},
"stage3_gather_16bit_weights_on_model_save": true
},
"fp16": {
"enabled": "auto",
"initial_scale_power": 16
},
"bf16": {
"enabled": "auto"
}
}
================================================
FILE: xtuner-eval_niah/xtuner/configs/dpo/internlm/internlm2_chat_1_8b_dpo_full.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns.preference_collate_fn import \
preference_collate_fn
from xtuner.dataset.preference_dataset import (build_preference_dataset,
orpo_dpo_mix_40k_map_fn)
from xtuner.engine.hooks import (EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model.dpo import DPO
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'
use_varlen_attn = False
dpo_loss_type = 'sigmoid' # One of ['sigmoid', 'hinge', 'ipo', 'kto_pair', 'sppo_hard', 'nca_pair', 'robust'] # noqa: E501
loss_beta = 0.1
label_smoothing = 0.0
# Data
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 5e-7 # refer to alignment handbook
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'What famous British author, known for his tales of mystery and the macabre, shares his initials with a common abbreviation for "rest in peace"?', # noqa: E501
'Please tell me five scenic spots in Shanghai',
'890729 - 425663? Only respond with math and no words.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=DPO,
use_varlen_attn=use_varlen_attn,
loss_type=dpo_loss_type,
beta=loss_beta,
label_smoothing=label_smoothing,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(type=load_dataset, path='mlabonne/orpo-dpo-mix-40k'),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=orpo_dpo_mix_40k_map_fn,
is_dpo=True,
is_reward=False,
reward_token_id=-1,
num_proc=32,
use_varlen_attn=use_varlen_attn,
shuffle_before_pack=True,
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
# dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/dpo/internlm/internlm2_chat_1_8b_dpo_full_varlenattn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns.preference_collate_fn import \
preference_collate_fn
from xtuner.dataset.preference_dataset import (build_preference_dataset,
orpo_dpo_mix_40k_map_fn)
from xtuner.engine.hooks import (EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model.dpo import DPO
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'
use_varlen_attn = True
dpo_loss_type = 'sigmoid' # One of ['sigmoid', 'hinge', 'ipo', 'kto_pair', 'sppo_hard', 'nca_pair', 'robust'] # noqa: E501
loss_beta = 0.1
label_smoothing = 0.0
# Data
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
max_packed_length = max_length * 2
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 5e-7 # refer to alignment handbook
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'What famous British author, known for his tales of mystery and the macabre, shares his initials with a common abbreviation for "rest in peace"?', # noqa: E501
'Please tell me five scenic spots in Shanghai',
'890729 - 425663? Only respond with math and no words.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=DPO,
use_varlen_attn=use_varlen_attn,
loss_type=dpo_loss_type,
beta=loss_beta,
label_smoothing=label_smoothing,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(type=load_dataset, path='mlabonne/orpo-dpo-mix-40k'),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=orpo_dpo_mix_40k_map_fn,
is_dpo=True,
is_reward=False,
reward_token_id=-1,
num_proc=32,
use_varlen_attn=use_varlen_attn,
max_packed_length=max_packed_length,
shuffle_before_pack=True,
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
# dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/dpo/internlm/internlm2_chat_1_8b_dpo_full_varlenattn_jsonl_dataset.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns.preference_collate_fn import \
preference_collate_fn
from xtuner.dataset.preference_dataset import (build_preference_dataset,
load_jsonl_dataset)
from xtuner.engine.hooks import (EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model.dpo import DPO
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'
use_varlen_attn = True
dpo_loss_type = 'sigmoid' # One of ['sigmoid', 'hinge', 'ipo', 'kto_pair', 'sppo_hard', 'nca_pair', 'robust'] # noqa: E501
loss_beta = 0.1
label_smoothing = 0.0
# Data
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
max_packed_length = max_length * 2
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 5e-7 # refer to alignment handbook
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'What famous British author, known for his tales of mystery and the macabre, shares his initials with a common abbreviation for "rest in peace"?', # noqa: E501
'Please tell me five scenic spots in Shanghai',
'890729 - 425663? Only respond with math and no words.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=DPO,
use_varlen_attn=use_varlen_attn,
loss_type=dpo_loss_type,
beta=loss_beta,
label_smoothing=label_smoothing,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(
type=load_jsonl_dataset,
data_files=[
'/your/jsonl/path/here.jsonl',
'/your/another/jsonl/path/here.jsonl'
]),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
is_dpo=True,
is_reward=False,
reward_token_id=-1,
num_proc=32,
use_varlen_attn=use_varlen_attn,
max_packed_length=max_packed_length,
shuffle_before_pack=True,
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
# dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/dpo/internlm/internlm2_chat_7b_dpo_qlora_varlenattn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset.collate_fns.preference_collate_fn import \
preference_collate_fn
from xtuner.dataset.preference_dataset import (build_preference_dataset,
orpo_dpo_mix_40k_map_fn)
from xtuner.engine.hooks import (EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model.dpo import DPO
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-7b-sft'
use_varlen_attn = True
dpo_loss_type = 'sigmoid' # One of ['sigmoid', 'hinge', 'ipo', 'kto_pair', 'sppo_hard', 'nca_pair', 'robust'] # noqa: E501
loss_beta = 0.1
label_smoothing = 0.0
# Data
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
max_packed_length = max_length * 2
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 5e-7 # refer to alignment handbook
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'What famous British author, known for his tales of mystery and the macabre, shares his initials with a common abbreviation for "rest in peace"?', # noqa: E501
'Please tell me five scenic spots in Shanghai',
'890729 - 425663? Only respond with math and no words.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=DPO,
use_varlen_attn=use_varlen_attn,
loss_type=dpo_loss_type,
beta=loss_beta,
label_smoothing=label_smoothing,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(type=load_dataset, path='mlabonne/orpo-dpo-mix-40k'),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=orpo_dpo_mix_40k_map_fn,
is_dpo=True,
is_reward=False,
reward_token_id=-1,
num_proc=32,
use_varlen_attn=use_varlen_attn,
max_packed_length=max_packed_length,
shuffle_before_pack=True,
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
# dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/dpo/llama/llama3_8b_instruct_dpo_qlora_varlenattn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset.collate_fns.preference_collate_fn import \
preference_collate_fn
from xtuner.dataset.preference_dataset import (build_preference_dataset,
orpo_dpo_mix_40k_map_fn)
from xtuner.engine.hooks import (EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model.dpo import DPO
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
use_varlen_attn = True
dpo_loss_type = 'sigmoid' # One of ['sigmoid', 'hinge', 'ipo', 'kto_pair', 'sppo_hard', 'nca_pair', 'robust'] # noqa: E501
loss_beta = 0.1
label_smoothing = 0.0
# Data
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = 2048
max_packed_length = max_length * 2
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 5e-7 # refer to alignment handbook
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'What famous British author, known for his tales of mystery and the macabre, shares his initials with a common abbreviation for "rest in peace"?', # noqa: E501
'Please tell me five scenic spots in Shanghai',
'890729 - 425663? Only respond with math and no words.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=DPO,
loss_type=dpo_loss_type,
use_varlen_attn=use_varlen_attn,
beta=loss_beta,
label_smoothing=label_smoothing,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(type=load_dataset, path='mlabonne/orpo-dpo-mix-40k'),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=orpo_dpo_mix_40k_map_fn,
is_dpo=True,
is_reward=False,
reward_token_id=-1,
num_proc=32,
use_varlen_attn=use_varlen_attn,
max_packed_length=max_packed_length,
shuffle_before_pack=True,
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
# dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/gemma/gemma_2b/gemma_2b_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'google/gemma-2b' # Gemma requires transformers>=4.38.1 # noqa: E501
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/gemma/gemma_2b/gemma_2b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'google/gemma-2b' # Gemma requires transformers>=4.38.1 # noqa: E501
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/gemma/gemma_2b_it/gemma_2b_it_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'google/gemma-2b-it' # Gemma requires transformers>=4.38.1 # noqa: E501
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.gemma
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/gemma/gemma_2b_it/gemma_2b_it_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'google/gemma-2b-it' # Gemma requires transformers>=4.38.1 # noqa: E501
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.gemma
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/gemma/gemma_7b/gemma_7b_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'google/gemma-7b' # Gemma requires transformers>=4.38.1 # noqa: E501
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/gemma/gemma_7b/gemma_7b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'google/gemma-7b' # Gemma requires transformers>=4.38.1 # noqa: E501
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/gemma/gemma_7b_it/gemma_7b_it_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'google/gemma-7b-it' # Gemma requires transformers>=4.38.1 # noqa: E501
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.gemma
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/gemma/gemma_7b_it/gemma_7b_it_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'google/gemma-7b-it' # Gemma requires transformers>=4.38.1 # noqa: E501
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.gemma
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_1_8b/internlm2_1_8b_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-1_8b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_1_8b/internlm2_1_8b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-1_8b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_20b/internlm2_20b_full_finetune_custom_dataset_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"conversation": [
{
"system": "",
"input": "xxx",
"output": "xxx"
},
{
"input": "xxx",
"output": "xxx"
}
]
},
...
]
Please refer to https://github.com/InternLM/xtuner/blob/main/docs/en/user_guides/dataset_format.md for details.
""" # noqa: E501
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from torch.utils.data import BatchSampler
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.dataset.samplers import InternRepoSampler
from xtuner.engine import (DatasetInfoHook, EvaluateChatHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-20b'
use_varlen_attn = True
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 32768
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
# batch size per device, set to 1 if `use_varlen_attn` = True
# To clarify, enlarging the batch size essentially enlarges the `max_length`.
# For example, doubling the max length is tantamount to doubling the batch size
batch_size = 1
accumulative_counts = 1 # 1bs * 1acc * 64gpu = 64 batchsize
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 4e-5
betas = (0.9, 0.95)
weight_decay = 0.01
max_norm = 1 # grad clip
warm_up_ratio = 0.025
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
use_varlen_attn=use_varlen_attn,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=InternRepoSampler, shuffle=True, seed=1024),
batch_sampler=dict(
type=BatchSampler, drop_last=True, batch_size=batch_size),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
)
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type='LinearLR',
start_factor=1 / 40,
by_epoch=True,
begin=0,
end=warm_up_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=lr * 0.15,
by_epoch=True,
begin=warm_up_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(
type=DatasetInfoHook, tokenizer=tokenizer,
is_intern_repo_dataset=True),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 100 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
log_processor = dict(
by_epoch=False,
window_size=1,
mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_20b/internlm2_20b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-20b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_20b/internlm2_20b_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-20b'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_20b/internlm2_20b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-20b'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_20b/internlm2_20b_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-20b'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_20b/internlm2_20b_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-20b'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_20b/internlm2_20b_qlora_msagent_react_e3_gpu8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from modelscope.msdatasets import MsDataset
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_ms_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (msagent_react_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-20b'
use_varlen_attn = False
# Data
data_path = 'damo/MSAgent-Bench'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = False
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 1
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 2
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = (
'你是一个可以调用外部工具的助手,可以使用的工具包括:\n'
"{{\'GoogleSearch\': \'一个可以从谷歌搜索结果的API。\\n"
'当你需要对于一个特定问题找到简短明了的回答时,可以使用它。\\n'
"输入应该是一个搜索查询。\\n\\n\',"
"\'PythonInterpreter\': \"用来执行Python代码。代码必须是一个函数,\\n"
"函数名必须得是 \'solution\',代码对应你的思考过程。代码实例格式如下:\\n"
'```python\\n# import 依赖包\\nimport xxx\\ndef solution():'
'\\n # 初始化一些变量\\n variable_names_with_real_meaning = xxx'
'\\n # 步骤一\\n mid_variable = func(variable_names_with_real_meaning)'
'\\n # 步骤 x\\n mid_variable = func(mid_variable)\\n # 最后结果'
'\\n final_answer = func(mid_variable)\\n return final_answer'
"\\n```\\n\"}}\n"
'如果使用工具请遵循以下格式回复:\n```\n'
'Thought:思考你当前步骤需要解决什么问题,是否需要使用工具\n'
"Action:工具名称,你的工具必须从 [[\'GoogleSearch\', \'PythonInterpreter\']] 选择"
'\nAction Input:工具输入参数\n```\n工具返回按照以下格式回复:\n'
'```\nResponse:调用工具后的结果\n```'
'\n如果你已经知道了答案,或者你不需要工具,请遵循以下格式回复\n```'
'\nThought:给出最终答案的思考过程\nFinal Answer:最终答案\n```\n开始!\n')
evaluation_inputs = ['上海明天天气怎么样?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_ms_dataset,
dataset=dict(type=MsDataset.load, dataset_name=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=msagent_react_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_20b/internlm2_20b_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-20b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 512
pack_to_max_length = False
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_20b/internlm2_20b_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-20b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_20b/internlm2_20b_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-20b'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_7b/internlm2_7b_full_finetune_custom_dataset_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"conversation": [
{
"system": "",
"input": "xxx",
"output": "xxx"
},
{
"input": "xxx",
"output": "xxx"
}
]
},
...
]
Please refer to https://github.com/InternLM/xtuner/blob/main/docs/en/user_guides/dataset_format.md for details.
""" # noqa: E501
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from torch.utils.data import BatchSampler
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.dataset.samplers import InternRepoSampler
from xtuner.engine import (DatasetInfoHook, EvaluateChatHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = True
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 32768
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
# batch size per device, set to 1 if `use_varlen_attn` = True
# To clarify, enlarging the batch size essentially enlarges the `max_length`.
# For example, doubling the max length is tantamount to doubling the batch size
batch_size = 1
accumulative_counts = 1 # 1bs * 1acc * 64gpu = 64 batchsize
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 4e-5
betas = (0.9, 0.95)
weight_decay = 0.01
max_norm = 1 # grad clip
warm_up_ratio = 0.025
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
use_varlen_attn=use_varlen_attn,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=InternRepoSampler, shuffle=True, seed=1024),
batch_sampler=dict(
type=BatchSampler, drop_last=True, batch_size=batch_size),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
)
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type='LinearLR',
start_factor=1 / 40,
by_epoch=True,
begin=0,
end=warm_up_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=lr * 0.15,
by_epoch=True,
begin=warm_up_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(
type=DatasetInfoHook, tokenizer=tokenizer,
is_intern_repo_dataset=True),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 100 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
log_processor = dict(
by_epoch=False,
window_size=1,
mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_7b/internlm2_7b_full_finetune_custom_dataset_e1_sequence_parallel_4.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"conversation": [
{
"system": "",
"input": "xxx",
"output": "xxx"
},
{
"input": "xxx",
"output": "xxx"
}
]
},
...
]
Please refer to https://github.com/InternLM/xtuner/blob/main/docs/en/user_guides/dataset_format.md for details.
""" # noqa: E501
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from torch.utils.data import BatchSampler
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.dataset.samplers import InternRepoSampler
from xtuner.engine import (DatasetInfoHook, EvaluateChatHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = True
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 32768
pack_to_max_length = True
# parallel
sequence_parallel_size = 4
# Scheduler & Optimizer
batch_size = 1 # per_device
# accumulative_counts = accumulative_counts * sequence_parallel_size
accumulative_counts = 1
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 4e-5
betas = (0.9, 0.95)
weight_decay = 0.01
max_norm = 1 # grad clip
warm_up_ratio = 0.025
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
use_varlen_attn=use_varlen_attn,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=InternRepoSampler, shuffle=True, seed=1024),
batch_sampler=dict(type=BatchSampler, drop_last=True, batch_size=1),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
)
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type='LinearLR',
start_factor=1 / 40,
by_epoch=True,
begin=0,
end=warm_up_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=lr * 0.15,
by_epoch=True,
begin=warm_up_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(
type=DatasetInfoHook, tokenizer=tokenizer,
is_intern_repo_dataset=True),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 100 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
log_processor = dict(
by_epoch=False,
window_size=1,
mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_7b/internlm2_7b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_7b/internlm2_7b_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_7b/internlm2_7b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_7b/internlm2_7b_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_7b/internlm2_7b_qlora_json_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = False
# Data
data_path = 'path/to/your/json_data'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_7b/internlm2_7b_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_7b/internlm2_7b_qlora_msagent_react_e3_gpu8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from modelscope.msdatasets import MsDataset
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_ms_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (msagent_react_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = False
# Data
data_path = 'damo/MSAgent-Bench'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = False
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 1
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 2
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = (
'你是一个可以调用外部工具的助手,可以使用的工具包括:\n'
"{{\'GoogleSearch\': \'一个可以从谷歌搜索结果的API。\\n"
'当你需要对于一个特定问题找到简短明了的回答时,可以使用它。\\n'
"输入应该是一个搜索查询。\\n\\n\',"
"\'PythonInterpreter\': \"用来执行Python代码。代码必须是一个函数,\\n"
"函数名必须得是 \'solution\',代码对应你的思考过程。代码实例格式如下:\\n"
'```python\\n# import 依赖包\\nimport xxx\\ndef solution():'
'\\n # 初始化一些变量\\n variable_names_with_real_meaning = xxx'
'\\n # 步骤一\\n mid_variable = func(variable_names_with_real_meaning)'
'\\n # 步骤 x\\n mid_variable = func(mid_variable)\\n # 最后结果'
'\\n final_answer = func(mid_variable)\\n return final_answer'
"\\n```\\n\"}}\n"
'如果使用工具请遵循以下格式回复:\n```\n'
'Thought:思考你当前步骤需要解决什么问题,是否需要使用工具\n'
"Action:工具名称,你的工具必须从 [[\'GoogleSearch\', \'PythonInterpreter\']] 选择"
'\nAction Input:工具输入参数\n```\n工具返回按照以下格式回复:\n'
'```\nResponse:调用工具后的结果\n```'
'\n如果你已经知道了答案,或者你不需要工具,请遵循以下格式回复\n```'
'\nThought:给出最终答案的思考过程\nFinal Answer:最终答案\n```\n开始!\n')
evaluation_inputs = ['上海明天天气怎么样?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_ms_dataset,
dataset=dict(type=MsDataset.load, dataset_name=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=msagent_react_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_7b/internlm2_7b_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 512
pack_to_max_length = False
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_7b/internlm2_7b_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_7b/internlm2_7b_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_7b/internlm2_7b_w_internevo_dataset.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from torch.utils.data import BatchSampler
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.intern_repo import (build_packed_dataset,
load_intern_repo_tokenized_dataset)
from xtuner.dataset.samplers import InternRepoSampler
from xtuner.engine import (DatasetInfoHook, EvaluateChatHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = True
# Data
dataset_folder = '/path/to/sft/data/folder' # noqa: E501
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 32768
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 1 # 1bs * 1acc * 64gpu = 64 batchsize
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 4e-5
betas = (0.9, 0.95)
weight_decay = 0.01
max_norm = 1 # grad clip
warm_up_ratio = 0.025
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=build_packed_dataset,
dataset_cfg=dict(
type=load_intern_repo_tokenized_dataset,
data_order_path=None,
folder=dataset_folder,
min_length=0,
file_type='.bin'),
packed_length=max_length,
seed=1024)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=InternRepoSampler, shuffle=True, seed=1024),
batch_sampler=dict(type=BatchSampler, drop_last=True, batch_size=1),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
)
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type='LinearLR',
start_factor=1 / 40,
by_epoch=True,
begin=0,
end=warm_up_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=lr * 0.15,
by_epoch=True,
begin=warm_up_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(
type=DatasetInfoHook, tokenizer=tokenizer,
is_intern_repo_dataset=True),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 100 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
log_processor = dict(
by_epoch=False,
window_size=1,
mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_7b/internlm2_7b_w_tokenized_dataset.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from torch.utils.data import BatchSampler
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.intern_repo import (build_packed_dataset,
load_intern_repo_tokenized_dataset)
from xtuner.dataset.samplers import InternRepoSampler
from xtuner.engine import (DatasetInfoHook, EvaluateChatHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = True
# Data
dataset_folder = '/path/to/sft/data/folder' # noqa: E501
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 32768
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
# batch size per device, set to 1 if `use_varlen_attn` = True
# To clarify, enlarging the batch size essentially enlarges the `max_length`.
# For example, doubling the max length is tantamount to doubling the batch size
batch_size = 1
accumulative_counts = 1 # 1bs * 1acc * 64gpu = 64 batchsize
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 4e-5
betas = (0.9, 0.95)
weight_decay = 0.01
max_norm = 1 # grad clip
warm_up_ratio = 0.025
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=build_packed_dataset,
dataset_cfg=dict(
type=load_intern_repo_tokenized_dataset,
data_order_path=None,
folder=dataset_folder,
min_length=0,
file_type='.bin'),
packed_length=max_length,
seed=1024)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=InternRepoSampler, shuffle=True, seed=1024),
batch_sampler=dict(
type=BatchSampler, drop_last=True, batch_size=batch_size),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
)
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type='LinearLR',
start_factor=1 / 40,
by_epoch=True,
begin=0,
end=warm_up_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=lr * 0.15,
by_epoch=True,
begin=warm_up_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(
type=DatasetInfoHook, tokenizer=tokenizer,
is_intern_repo_dataset=True),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 100 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
log_processor = dict(
by_epoch=False,
window_size=1,
mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_7b/internlm2_7b_w_untokenized_dataset.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from torch.utils.data import BatchSampler
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.intern_repo import (build_packed_dataset,
load_intern_repo_untokenized_dataset)
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.dataset.samplers import InternRepoSampler
from xtuner.engine import (DatasetInfoHook, EvaluateChatHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-7b'
use_varlen_attn = True
# Data
dataset_folder = 'v1_sample_with_legal_cate' # noqa: E501
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 32768
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
# batch size per device, set to 1 if `use_varlen_attn` = True
# To clarify, enlarging the batch size essentially enlarges the `max_length`.
# For example, doubling the max length is tantamount to doubling the batch size
batch_size = 1
accumulative_counts = 1 # 1bs * 1acc * 64gpu = 64 batchsize
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 4e-5
betas = (0.9, 0.95)
weight_decay = 0.01
max_norm = 1 # grad clip
warm_up_ratio = 0.025
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=build_packed_dataset,
dataset_cfg=dict(
type=load_intern_repo_untokenized_dataset,
data_order_path=None,
folder=dataset_folder,
tokenizer=tokenizer,
max_length=max_length,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
file_type='.json'),
packed_length=max_length,
seed=1024)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=InternRepoSampler, shuffle=True, seed=1024),
batch_sampler=dict(
type=BatchSampler, drop_last=True, batch_size=batch_size),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
)
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type='LinearLR',
start_factor=1 / 40,
by_epoch=True,
begin=0,
end=warm_up_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=lr * 0.15,
by_epoch=True,
begin=warm_up_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(
type=DatasetInfoHook, tokenizer=tokenizer,
is_intern_repo_dataset=True),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 100 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
log_processor = dict(
by_epoch=False,
window_size=1,
mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_chat_1_8b/internlm2_chat_1_8b_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_chat_1_8b/internlm2_chat_1_8b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_chat_20b/internlm2_chat_20b_full_finetune_custom_dataset_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"conversation": [
{
"system": "",
"input": "xxx",
"output": "xxx"
},
{
"input": "xxx",
"output": "xxx"
}
]
},
...
]
Please refer to https://github.com/InternLM/xtuner/blob/main/docs/en/user_guides/dataset_format.md for details.
""" # noqa: E501
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from torch.utils.data import BatchSampler
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.dataset.samplers import InternRepoSampler
from xtuner.engine import (DatasetInfoHook, EvaluateChatHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-20b'
use_varlen_attn = True
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 32768
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
# batch size per device, set to 1 if `use_varlen_attn` = True
# To clarify, enlarging the batch size essentially enlarges the `max_length`.
# For example, doubling the max length is tantamount to doubling the batch size
batch_size = 1
accumulative_counts = 1 # 1bs * 1acc * 64gpu = 64 batchsize
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 4e-5
betas = (0.9, 0.95)
weight_decay = 0.01
max_norm = 1 # grad clip
warm_up_ratio = 0.025
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
use_varlen_attn=use_varlen_attn,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=InternRepoSampler, shuffle=True, seed=1024),
batch_sampler=dict(
type=BatchSampler, drop_last=True, batch_size=batch_size),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
)
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type='LinearLR',
start_factor=1 / 40,
by_epoch=True,
begin=0,
end=warm_up_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=lr * 0.15,
by_epoch=True,
begin=warm_up_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(
type=DatasetInfoHook, tokenizer=tokenizer,
is_intern_repo_dataset=True),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 100 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
log_processor = dict(
by_epoch=False,
window_size=1,
mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_chat_20b/internlm2_chat_20b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-20b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_chat_20b/internlm2_chat_20b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-20b'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_chat_20b/internlm2_chat_20b_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-20b'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_chat_20b/internlm2_chat_20b_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-20b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 512
pack_to_max_length = False
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_chat_20b/internlm2_chat_20b_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-20b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_chat_7b/internlm2_chat_7b_full_finetune_custom_dataset_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"conversation": [
{
"system": "",
"input": "xxx",
"output": "xxx"
},
{
"input": "xxx",
"output": "xxx"
}
]
},
...
]
Please refer to https://github.com/InternLM/xtuner/blob/main/docs/en/user_guides/dataset_format.md for details.
""" # noqa: E501
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from torch.utils.data import BatchSampler
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.dataset.samplers import InternRepoSampler
from xtuner.engine import (DatasetInfoHook, EvaluateChatHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-7b'
use_varlen_attn = True
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 32768
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
# batch size per device, set to 1 if `use_varlen_attn` = True
# To clarify, enlarging the batch size essentially enlarges the `max_length`.
# For example, doubling the max length is tantamount to doubling the batch size
batch_size = 1
accumulative_counts = 1 # 1bs * 1acc * 64gpu = 64 batchsize
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 4e-5
betas = (0.9, 0.95)
weight_decay = 0.01
max_norm = 1 # grad clip
warm_up_ratio = 0.025
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
use_varlen_attn=use_varlen_attn,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=InternRepoSampler, shuffle=True, seed=1024),
batch_sampler=dict(
type=BatchSampler, drop_last=True, batch_size=batch_size),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
)
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type='LinearLR',
start_factor=1 / 40,
by_epoch=True,
begin=0,
end=warm_up_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=lr * 0.15,
by_epoch=True,
begin=warm_up_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(
type=DatasetInfoHook, tokenizer=tokenizer,
is_intern_repo_dataset=True),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 100 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
log_processor = dict(
by_epoch=False,
window_size=1,
mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_chat_7b/internlm2_chat_7b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-7b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_chat_7b/internlm2_chat_7b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-7b'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_chat_7b/internlm2_chat_7b_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-7b'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_chat_7b/internlm2_chat_7b_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-7b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 512
pack_to_max_length = False
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm2_chat_7b/internlm2_chat_7b_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-7b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_20b/internlm_20b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-20b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_20b/internlm_20b_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-20b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_20b/internlm_20b_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-20b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_20b/internlm_20b_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-20b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_20b/internlm_20b_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-20b'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_20b/internlm_20b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-20b'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_20b/internlm_20b_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-20b'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_20b/internlm_20b_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-20b'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_20b/internlm_20b_qlora_msagent_react_e3_gpu8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from modelscope.msdatasets import MsDataset
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_ms_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (msagent_react_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-20b'
use_varlen_attn = False
# Data
data_path = 'damo/MSAgent-Bench'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 1
dataloader_num_workers = 2
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = (
'你是一个可以调用外部工具的助手,可以使用的工具包括:\n'
"{{\'GoogleSearch\': \'一个可以从谷歌搜索结果的API。\\n"
'当你需要对于一个特定问题找到简短明了的回答时,可以使用它。\\n'
"输入应该是一个搜索查询。\\n\\n\',"
"\'PythonInterpreter\': \"用来执行Python代码。代码必须是一个函数,\\n"
"函数名必须得是 \'solution\',代码对应你的思考过程。代码实例格式如下:\\n"
'```python\\n# import 依赖包\\nimport xxx\\ndef solution():'
'\\n # 初始化一些变量\\n variable_names_with_real_meaning = xxx'
'\\n # 步骤一\\n mid_variable = func(variable_names_with_real_meaning)'
'\\n # 步骤 x\\n mid_variable = func(mid_variable)\\n # 最后结果'
'\\n final_answer = func(mid_variable)\\n return final_answer'
"\\n```\\n\"}}\n"
'如果使用工具请遵循以下格式回复:\n```\n'
'Thought:思考你当前步骤需要解决什么问题,是否需要使用工具\n'
"Action:工具名称,你的工具必须从 [[\'GoogleSearch\', \'PythonInterpreter\']] 选择"
'\nAction Input:工具输入参数\n```\n工具返回按照以下格式回复:\n'
'```\nResponse:调用工具后的结果\n```'
'\n如果你已经知道了答案,或者你不需要工具,请遵循以下格式回复\n```'
'\nThought:给出最终答案的思考过程\nFinal Answer:最终答案\n```\n开始!\n')
evaluation_inputs = ['上海明天天气怎么样?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_ms_dataset,
dataset=dict(type=MsDataset.load, dataset_name=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=msagent_react_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_20b/internlm_20b_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-20b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 512
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_20b/internlm_20b_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-20b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_20b/internlm_20b_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-20b'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_20b/internlm_20b_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-20b'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_full_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_full_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_full_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_full_intern_repo_dataset_template.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from torch.utils.data import BatchSampler
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.intern_repo import (build_packed_dataset,
load_intern_repo_tokenized_dataset)
from xtuner.dataset.samplers import InternRepoSampler
from xtuner.engine import (DatasetInfoHook, EvaluateChatHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/path/to/your/base/model'
use_varlen_attn = True
# Data
dataset_folder = '/path/to/your/train/dataset'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 8192
pack_to_max_length = True
# Scheduler & Optimizer
# batch size per device, set to 1 if `use_varlen_attn` = True
# To clarify, enlarging the batch size essentially enlarges the `max_length`.
# For example, doubling the max length is tantamount to doubling the batch size
batch_size = 1
accumulative_counts = 4 # 1bs * 4acc * 32gpu = 128 batchsize
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 4e-5
betas = (0.9, 0.95)
weight_decay = 0.01
max_norm = 1 # grad clip
warm_up_ratio = 0.025
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=build_packed_dataset,
dataset_cfg=dict(
type=load_intern_repo_tokenized_dataset,
folder=dataset_folder,
min_length=0,
file_type='.bin'),
packed_length=max_length,
seed=1024)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=InternRepoSampler, shuffle=True, seed=1024),
batch_sampler=dict(
type=BatchSampler, drop_last=True, batch_size=batch_size),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
)
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type='LinearLR',
start_factor=1 / 40,
by_epoch=True,
begin=0,
end=warm_up_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=lr * 0.15,
by_epoch=True,
begin=warm_up_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(
type=DatasetInfoHook, tokenizer=tokenizer,
is_intern_repo_dataset=True),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
log_processor = dict(
window_size=1, mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_full_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_json_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
data_path = 'path/to/your/json_data'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_medical_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import medical_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
data_path = 'shibing624/medical'
data_config_name = 'finetune'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.medical
evaluation_inputs = [
'我有家族遗传性的过敏,请问可以可以献血吗?', '我爷爷有高血压,请问他可以喝咖啡吗?',
'我女儿今年3岁了,从昨天晚上九点开始腹泻,到现在已经八个小时了,请问应该怎么办?'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path, name=data_config_name),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=medical_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_moss_sft_all_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
bot_name = 'InternLM'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_no_plugins_path = './data/moss-003-sft-no-tools.jsonl'
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
moss_sft_no_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_no_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
moss_sft_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataset = dict(
type=ConcatDataset, datasets=[moss_sft_no_plugins, moss_sft_plugins])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_moss_sft_all_e2_gpu8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
bot_name = 'InternLM'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_no_plugins_path = './data/moss-003-sft-no-tools.jsonl'
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 1
dataloader_num_workers = 2
max_epochs = 2
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
moss_sft_no_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_no_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
moss_sft_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataset = dict(
type=ConcatDataset, datasets=[moss_sft_no_plugins, moss_sft_plugins])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_moss_sft_plugins_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
bot_name = 'InternLM'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_msagent_react_e3_gpu8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from modelscope.msdatasets import MsDataset
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_ms_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (msagent_react_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
data_path = 'damo/MSAgent-Bench'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 1
dataloader_num_workers = 2
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = (
'你是一个可以调用外部工具的助手,可以使用的工具包括:\n'
"{{\'GoogleSearch\': \'一个可以从谷歌搜索结果的API。\\n"
'当你需要对于一个特定问题找到简短明了的回答时,可以使用它。\\n'
"输入应该是一个搜索查询。\\n\\n\',"
"\'PythonInterpreter\': \"用来执行Python代码。代码必须是一个函数,\\n"
"函数名必须得是 \'solution\',代码对应你的思考过程。代码实例格式如下:\\n"
'```python\\n# import 依赖包\\nimport xxx\\ndef solution():'
'\\n # 初始化一些变量\\n variable_names_with_real_meaning = xxx'
'\\n # 步骤一\\n mid_variable = func(variable_names_with_real_meaning)'
'\\n # 步骤 x\\n mid_variable = func(mid_variable)\\n # 最后结果'
'\\n final_answer = func(mid_variable)\\n return final_answer'
"\\n```\\n\"}}\n"
'如果使用工具请遵循以下格式回复:\n```\n'
'Thought:思考你当前步骤需要解决什么问题,是否需要使用工具\n'
"Action:工具名称,你的工具必须从 [[\'GoogleSearch\', \'PythonInterpreter\']] 选择"
'\nAction Input:工具输入参数\n```\n工具返回按照以下格式回复:\n'
'```\nResponse:调用工具后的结果\n```'
'\n如果你已经知道了答案,或者你不需要工具,请遵循以下格式回复\n```'
'\nThought:给出最终答案的思考过程\nFinal Answer:最终答案\n```\n开始!\n')
evaluation_inputs = ['上海明天天气怎么样?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_ms_dataset,
dataset=dict(type=MsDataset.load, dataset_name=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=msagent_react_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 512
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_oasst1_e3_hf.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, Trainer, TrainingArguments)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE
framework = 'huggingface'
pretrained_model_name_or_path = 'internlm/internlm-7b'
dataset_name_or_path = 'timdettmers/openassistant-guanaco'
max_length = 2048
pack_to_max_length = True
prompt_template = PROMPT_TEMPLATE.default
trainer = Trainer
training_args = dict(
type=TrainingArguments,
do_train=True,
learning_rate=2e-4,
weight_decay=0,
lr_scheduler_type='cosine',
warmup_steps=100,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
num_train_epochs=3,
fp16=True,
logging_steps=1,
optim='paged_adamw_32bit',
save_strategy='steps',
save_steps=1000,
save_total_limit=2,
ddp_find_unused_parameters=False)
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4'))
lora = dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM')
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=dataset_name_or_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_oasst1_mmlu_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn, mmlu_collate_fn
from xtuner.dataset.map_fns import (default_map_fn, oasst1_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.evaluation import MMLUMetric
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Val/Test data
# Download from https://github.com/artidoro/qlora/tree/main/data/mmlu
mmlu_data_root = './data/mmlu/'
evaluate_steps = 500
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
mmlu_fs_dataset = dict(
type=load_dataset,
path='json',
data_files=dict(
val=mmlu_data_root + 'five_shot_mmlu_val.json',
test=mmlu_data_root + 'five_shot_mmlu_test.json'))
val_mmlu_fs = dict(
type=process_hf_dataset,
dataset=mmlu_fs_dataset,
tokenizer=tokenizer,
dataset_map_fn=default_map_fn,
max_length=max_length,
input_ids_with_output=False,
pack_to_max_length=False,
split='val')
val_dataloader = dict(
batch_size=1,
num_workers=0,
dataset=val_mmlu_fs,
sampler=dict(type=DefaultSampler, shuffle=False),
collate_fn=dict(type=mmlu_collate_fn))
val_evaluator = dict(
type=MMLUMetric, tokenizer=tokenizer, prefix='mmlu_fs_val')
test_mmlu_fs = dict(
type=process_hf_dataset,
dataset=mmlu_fs_dataset,
tokenizer=tokenizer,
dataset_map_fn=default_map_fn,
max_length=max_length,
input_ids_with_output=False,
pack_to_max_length=False,
split='test')
test_dataloader = dict(
batch_size=1,
num_workers=0,
dataset=test_mmlu_fs,
sampler=dict(type=DefaultSampler, shuffle=False),
collate_fn=dict(type=mmlu_collate_fn))
test_evaluator = dict(
type=MMLUMetric, tokenizer=tokenizer, prefix='mmlu_fs_test')
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(
type=TrainLoop, max_epochs=max_epochs, val_interval=evaluate_steps)
val_cfg = dict(type='ValLoop')
test_cfg = dict(type='TestLoop')
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_openorca_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import openorca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
data_path = 'Open-Orca/OpenOrca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 5000
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=openorca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_7b/internlm_7b_qlora_tiny_codes_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, tiny_codes_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-7b'
use_varlen_attn = False
# Data
data_path = 'nampdn-ai/tiny-codes'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=tiny_codes_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_20b/internlm_chat_20b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-20b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_20b/internlm_chat_20b_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-20b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_20b/internlm_chat_20b_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-20b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_20b/internlm_chat_20b_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-20b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_20b/internlm_chat_20b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-20b'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_20b/internlm_chat_20b_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-20b'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_20b/internlm_chat_20b_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-20b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 512
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_20b/internlm_chat_20b_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-20b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_20b/internlm_chat_20b_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-20b'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_medical_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import medical_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
use_varlen_attn = False
# Data
data_path = 'shibing624/medical'
data_config_name = 'finetune'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.medical
evaluation_inputs = [
'我有家族遗传性的过敏,请问可以可以献血吗?', '我爷爷有高血压,请问他可以喝咖啡吗?',
'我女儿今年3岁了,从昨天晚上九点开始腹泻,到现在已经八个小时了,请问应该怎么办?'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path, name=data_config_name),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=medical_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 512
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_openorca_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import openorca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
use_varlen_attn = False
# Data
data_path = 'Open-Orca/OpenOrca'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 5000
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=openorca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/internlm/internlm_chat_7b/internlm_chat_7b_qlora_tiny_codes_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, tiny_codes_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm-chat-7b'
use_varlen_attn = False
# Data
data_path = 'nampdn-ai/tiny-codes'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=tiny_codes_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_70b/llama2_70b_full_wizardlm_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, wizardlm_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-70b-hf'
use_varlen_attn = False
# Data
data_path = 'WizardLM/WizardLM_evol_instruct_V2_196k'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 4 # 1bs * 4acc * 32gpu = 128 batchsize
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #q
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=wizardlm_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
)
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_70b/llama2_70b_int8_lora_open_platypus_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-70b-hf'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 3e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
load_in_8bit=True),
lora=dict(
type=LoraConfig,
r=16,
lora_alpha=16,
lora_dropout=0.05,
target_modules=['gate_proj', 'down_proj', 'up_proj'],
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_70b/llama2_70b_int8_lora_open_platypus_e1_hf.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer, Trainer,
TrainingArguments)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE
framework = 'huggingface'
pretrained_model_name_or_path = 'meta-llama/Llama-2-70b-hf'
dataset_name_or_path = 'garage-bAInd/Open-Platypus'
max_length = 2048
pack_to_max_length = True
prompt_template = PROMPT_TEMPLATE.llama2_chat
trainer = Trainer
training_args = dict(
type=TrainingArguments,
do_train=True,
learning_rate=3e-4,
weight_decay=0,
lr_scheduler_type='cosine',
warmup_steps=100,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
num_train_epochs=1,
fp16=True,
logging_steps=1,
optim='adamw_torch',
save_strategy='steps',
save_steps=1000,
save_total_limit=2,
ddp_find_unused_parameters=False)
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
load_in_8bit=True)
lora = dict(
type=LoraConfig,
r=16,
lora_alpha=16,
lora_dropout=0.05,
target_modules=['gate_proj', 'down_proj', 'up_proj'],
bias='none',
task_type='CAUSAL_LM')
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=dataset_name_or_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_70b/llama2_70b_qlora_open_platypus_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-70b-hf'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 3e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
target_modules=['gate_proj', 'down_proj', 'up_proj'],
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_70b/llama2_70b_qlora_open_platypus_e1_hf.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from peft import LoraConfig
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, Trainer, TrainingArguments)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.utils import PROMPT_TEMPLATE
framework = 'huggingface'
pretrained_model_name_or_path = 'meta-llama/Llama-2-70b-hf'
dataset_name_or_path = 'garage-bAInd/Open-Platypus'
max_length = 2048
pack_to_max_length = True
prompt_template = PROMPT_TEMPLATE.llama2_chat
trainer = Trainer
training_args = dict(
type=TrainingArguments,
do_train=True,
learning_rate=3e-4,
weight_decay=0,
lr_scheduler_type='cosine',
warmup_steps=100,
per_device_train_batch_size=1,
gradient_accumulation_steps=16,
num_train_epochs=1,
fp16=True,
logging_steps=1,
optim='adamw_torch',
save_strategy='steps',
save_steps=1000,
save_total_limit=2,
ddp_find_unused_parameters=False)
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4'))
lora = dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
target_modules=['gate_proj', 'down_proj', 'up_proj'],
bias='none',
task_type='CAUSAL_LM')
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=dataset_name_or_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_full_pgbooks_400iters_sp1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
sequence_parallel_size = 1
# Data
data_path = 'emozilla/pg_books-tokenized-bos-eos-chunked-65536'
data_files = [
'data/train-00000-of-00136-877a1768c20d5900.parquet',
'data/train-00001-of-00136-70d7d139dca61754.parquet',
'data/train-00002-of-00136-62d53594e098f3d8.parquet',
'data/train-00003-of-00136-8bd300fecc4c720e.parquet',
'data/train-00004-of-00136-2a9456b5f975ae95.parquet',
'data/train-00005-of-00136-ca38cf7907bb7555.parquet',
'data/train-00006-of-00136-1ae2e4c63f3966da.parquet',
'data/train-00007-of-00136-a00cc39a4ee65ab6.parquet',
]
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 65536
max_position_embeddings = 65536
pack_to_max_length = False
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 8
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.05
# Save
save_steps = 500
save_total_limit = 1 # Maximum checkpoints to keep (-1 means unlimited)
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
max_position_embeddings=max_position_embeddings,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
attn_implementation='flash_attention_2'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path=data_path,
data_files=data_files,
ignore_verifications=True),
do_dataset_tokenization=False,
remove_unused_columns=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1 / 40,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=lr * 0.15,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
save_optimizer=False,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(
by_epoch=False,
window_size=1,
mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_full_pgbooks_400iters_sp4.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
data_path = 'emozilla/pg_books-tokenized-bos-eos-chunked-65536'
data_files = [
'data/train-00000-of-00136-877a1768c20d5900.parquet',
'data/train-00001-of-00136-70d7d139dca61754.parquet',
'data/train-00002-of-00136-62d53594e098f3d8.parquet',
'data/train-00003-of-00136-8bd300fecc4c720e.parquet',
'data/train-00004-of-00136-2a9456b5f975ae95.parquet',
'data/train-00005-of-00136-ca38cf7907bb7555.parquet',
'data/train-00006-of-00136-1ae2e4c63f3966da.parquet',
'data/train-00007-of-00136-a00cc39a4ee65ab6.parquet',
]
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 65536
max_position_embeddings = 65536
pack_to_max_length = False
# parallel
sequence_parallel_size = 4
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 8
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.05
# Save
save_steps = 500
save_total_limit = 1 # Maximum checkpoints to keep (-1 means unlimited)
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
max_position_embeddings=max_position_embeddings,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
attn_implementation='flash_attention_2'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path=data_path,
data_files=data_files,
ignore_verifications=True),
do_dataset_tokenization=False,
remove_unused_columns=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1 / 40,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=lr * 0.15,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
save_optimizer=False,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(
by_epoch=False,
window_size=1,
mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_full_wizardlm_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, wizardlm_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
data_path = 'WizardLM/WizardLM_evol_instruct_V2_196k'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 2 # per_device
accumulative_counts = 16 # 2bs * 16acc * 4gpu = 128 batchsize
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=wizardlm_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_medical_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import medical_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
data_path = 'shibing624/medical'
data_config_name = 'finetune'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.medical
evaluation_inputs = [
'我有家族遗传性的过敏,请问可以可以献血吗?', '我爷爷有高血压,请问他可以喝咖啡吗?',
'我女儿今年3岁了,从昨天晚上九点开始腹泻,到现在已经八个小时了,请问应该怎么办?'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path, name=data_config_name),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=medical_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_moss_sft_all_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
bot_name = 'Llama2'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_no_plugins_path = './data/moss-003-sft-no-tools.jsonl'
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
moss_sft_no_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_no_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
moss_sft_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataset = dict(
type=ConcatDataset, datasets=[moss_sft_no_plugins, moss_sft_plugins])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_moss_sft_all_e2_gpu8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
bot_name = 'Llama2'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_no_plugins_path = './data/moss-003-sft-no-tools.jsonl'
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 1
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 2
max_epochs = 2
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
moss_sft_no_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_no_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
moss_sft_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataset = dict(
type=ConcatDataset, datasets=[moss_sft_no_plugins, moss_sft_plugins])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_moss_sft_plugins_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
bot_name = 'Llama2'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_msagent_react_e3_gpu8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from modelscope.msdatasets import MsDataset
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_ms_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (msagent_react_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
data_path = 'damo/MSAgent-Bench'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = False
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 1
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 2
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = (
'你是一个可以调用外部工具的助手,可以使用的工具包括:\n'
"{{\'GoogleSearch\': \'一个可以从谷歌搜索结果的API。\\n"
'当你需要对于一个特定问题找到简短明了的回答时,可以使用它。\\n'
"输入应该是一个搜索查询。\\n\\n\',"
"\'PythonInterpreter\': \"用来执行Python代码。代码必须是一个函数,\\n"
"函数名必须得是 \'solution\',代码对应你的思考过程。代码实例格式如下:\\n"
'```python\\n# import 依赖包\\nimport xxx\\ndef solution():'
'\\n # 初始化一些变量\\n variable_names_with_real_meaning = xxx'
'\\n # 步骤一\\n mid_variable = func(variable_names_with_real_meaning)'
'\\n # 步骤 x\\n mid_variable = func(mid_variable)\\n # 最后结果'
'\\n final_answer = func(mid_variable)\\n return final_answer'
"\\n```\\n\"}}\n"
'如果使用工具请遵循以下格式回复:\n```\n'
'Thought:思考你当前步骤需要解决什么问题,是否需要使用工具\n'
"Action:工具名称,你的工具必须从 [[\'GoogleSearch\', \'PythonInterpreter\']] 选择"
'\nAction Input:工具输入参数\n```\n工具返回按照以下格式回复:\n'
'```\nResponse:调用工具后的结果\n```'
'\n如果你已经知道了答案,或者你不需要工具,请遵循以下格式回复\n```'
'\nThought:给出最终答案的思考过程\nFinal Answer:最终答案\n```\n开始!\n')
evaluation_inputs = ['上海明天天气怎么样?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_ms_dataset,
dataset=dict(type=MsDataset.load, dataset_name=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=msagent_react_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 512
pack_to_max_length = False
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_openorca_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import openorca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
data_path = 'Open-Orca/OpenOrca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 5000
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=openorca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b/llama2_7b_qlora_tiny_codes_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, tiny_codes_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-hf'
use_varlen_attn = False
# Data
data_path = 'nampdn-ai/tiny-codes'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=tiny_codes_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b_chat/llama2_7b_chat_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b_chat/llama2_7b_chat_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b_chat/llama2_7b_chat_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b_chat/llama2_7b_chat_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b_chat/llama2_7b_chat_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b_chat/llama2_7b_chat_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b_chat/llama2_7b_chat_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b_chat/llama2_7b_chat_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b_chat/llama2_7b_chat_qlora_medical_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import medical_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
data_path = 'shibing624/medical'
data_config_name = 'finetune'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.medical
evaluation_inputs = [
'我有家族遗传性的过敏,请问可以可以献血吗?', '我爷爷有高血压,请问他可以喝咖啡吗?',
'我女儿今年3岁了,从昨天晚上九点开始腹泻,到现在已经八个小时了,请问应该怎么办?'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path, name=data_config_name),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=medical_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b_chat/llama2_7b_chat_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 512
pack_to_max_length = False
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b_chat/llama2_7b_chat_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b_chat/llama2_7b_chat_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b_chat/llama2_7b_chat_qlora_openorca_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import openorca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
data_path = 'Open-Orca/OpenOrca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 5000
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=openorca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b_chat/llama2_7b_chat_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama2_7b_chat/llama2_7b_chat_qlora_tiny_codes_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, tiny_codes_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b-chat-hf'
use_varlen_attn = False
# Data
data_path = 'nampdn-ai/tiny-codes'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=tiny_codes_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama3_70b_instruct/llama3_70b_instruct_qlora_alpaca_e3_2k_gpu8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Meta-Llama-3-70B-Instruct'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 2 # total bs = 1 bs_per_device * 8 gpus * 2 acc = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 1e-4 # 70B model use smaller lr
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_quant_storage=torch.float16)),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True, seed=1024),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=-1,
save_last=False,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama3_8b/README.md
================================================
# Llama3 8B
## Install
```bash
# Install the latest xtuner
pip install -U 'xtuner[deepspeed]'
# install the latest transformers
pip install -U transformers
```
## QLoRA Fine-tune
QLoRA only need a single A100-80G
```bash
xtuner train llama3_8b_instruct_qlora_alpaca_e3
```
## Full Parameter Fine-tune
Full parameter fine-tune Llama3 8B in 8k context only requires 2 * A100-80G
### torchrun
```bash
NPROC_PER_NODE=${GPU_NUM} xtuner train llama3_8b_instruct_full_alpaca_e3 --deepspeed deepspeed_zero2
```
### slurm
```bash
srun ${SRUN_ARGS} xtuner train llama3_8b_instruct_full_alpaca_e3 --launcher slurm --deepspeed deepspeed_zero3
```
### Speed
| Model | Sequence Length | GPU Number | ZeRO | Sequence Parallel | Tokens per Second | TFLOPs |
| :-------: | :-------------: | :--------: | :----: | :---------------: | :---------------: | :----: |
| Llama3 8B | 8k | 2 | ZeRO-3 | 2 | 1037.0 | 76.8 |
| Llama3 8B | 8k | 4 | ZeRO-3 | 1 | 2331.3 | 172.6 |
| Llama3 8B | 8k | 8 | ZeRO-3 | 1 | 2771.2 | 205.1 |
| Model | Sequence Length | GPU Number | ZeRO | Sequence Parallel | Tokens per Second | TFLOPs |
| :-------: | :-------------: | :--------: | :----: | :---------------: | :---------------: | :----: |
| Llama3 8B | 8k | 8 | ZeRO-3 | 1 | 2771.2 | 205.1 |
| Llama3 8B | 16k | 8 | ZeRO-3 | 2 | 2320.7 | 191.7 |
| Llama3 8B | 32k | 8 | ZeRO-3 | 4 | 1870.2 | 186.6 |
| Llama3 8B | 64k | 8 | ZeRO-3 | 8 | 1356.4 | 182.0 |
| Llama3 8B | 128k | 8 | ZeRO-3 | 8 | 875.7 | 177.7 |
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama3_8b/llama3_8b_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Meta-Llama-3-8B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama3_8b_instruct/llama3_8b_instruct_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama3_8b_instruct/llama3_8b_instruct_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_medical_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import medical_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
use_varlen_attn = False
# Data
data_path = 'shibing624/medical'
data_config_name = 'finetune'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.medical
evaluation_inputs = [
'我有家族遗传性的过敏,请问可以可以献血吗?', '我爷爷有高血压,请问他可以喝咖啡吗?',
'我女儿今年3岁了,从昨天晚上九点开始腹泻,到现在已经八个小时了,请问应该怎么办?'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path, name=data_config_name),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=medical_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_moss_sft_all_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
bot_name = 'Llama'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_no_plugins_path = './data/moss-003-sft-no-tools.jsonl'
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
moss_sft_no_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_no_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
moss_sft_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataset = dict(
type=ConcatDataset, datasets=[moss_sft_no_plugins, moss_sft_plugins])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_moss_sft_all_e2_gpu8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
bot_name = 'Llama'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_no_plugins_path = './data/moss-003-sft-no-tools.jsonl'
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 1
dataloader_num_workers = 2
max_epochs = 2
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
moss_sft_no_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_no_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
moss_sft_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataset = dict(
type=ConcatDataset, datasets=[moss_sft_no_plugins, moss_sft_plugins])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_moss_sft_plugins_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
bot_name = 'Llama'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 512
pack_to_max_length = False
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_openorca_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import openorca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
use_varlen_attn = False
# Data
data_path = 'Open-Orca/OpenOrca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 5000
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=openorca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama/llama_7b/llama_7b_qlora_tiny_codes_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, tiny_codes_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'huggyllama/llama-7b'
use_varlen_attn = False
# Data
data_path = 'nampdn-ai/tiny-codes'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=tiny_codes_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama_speed_benchmark/llama2_70b/llama2_70b_full_alpaca_enzh_128k_sp8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import ThroughputHook, VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-70b-hf'
use_varlen_attn = False
sequence_parallel_size = 8
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 131072 # 128k
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
# Suppose I aim to employ a training strategy using a batch size per device
# of 1 with a maximum length of `max_length` on N GPUs.
# Upon setting the sequence parallelism dimension to `SP`,
# the accumulative counts have to be adjusted to `SP` times the original value.
# This modification is essential to assure training equivalence,
# as the sequence of `max_length` length will be segmented into `SP` parts,
# with each part being allocated to its respective GPU among the `SP` GPUs
# for parallelized training.
# bs = 32 gpus * 1 batch_size_per_device * 8 acc / 8 sequence parallel
accumulative_counts = 8
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
log_interval = 1
# Save
save_steps = -1 # speed only
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=SequenceParallelSampler, seed=1024),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [dict(type=ThroughputHook)]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(
type=LoggerHook, log_metric_by_epoch=False, interval=log_interval),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=-1,
save_last=False,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=log_interval)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama_speed_benchmark/llama2_70b/llama2_70b_full_alpaca_enzh_256k_sp16.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import ThroughputHook, VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-70b-hf'
use_varlen_attn = False
sequence_parallel_size = 16
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 262144 # 256k
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
# Suppose I aim to employ a training strategy using a batch size per device
# of 1 with a maximum length of `max_length` on N GPUs.
# Upon setting the sequence parallelism dimension to `SP`,
# the accumulative counts have to be adjusted to `SP` times the original value.
# This modification is essential to assure training equivalence,
# as the sequence of `max_length` length will be segmented into `SP` parts,
# with each part being allocated to its respective GPU among the `SP` GPUs
# for parallelized training.
# bs = 32 gpus * 1 batch_size_per_device * 16 acc / 16 sequence parallel
accumulative_counts = 16
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
log_interval = 1
# Save
save_steps = -1 # speed only
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=SequenceParallelSampler, seed=1024),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [dict(type=ThroughputHook)]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(
type=LoggerHook, log_metric_by_epoch=False, interval=log_interval),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=-1,
save_last=False,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=log_interval)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama_speed_benchmark/llama2_70b/llama2_70b_full_alpaca_enzh_32k_sp4.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import ThroughputHook, VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-70b-hf'
use_varlen_attn = False
sequence_parallel_size = 4
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 32768
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
# Suppose I aim to employ a training strategy using a batch size per device
# of 1 with a maximum length of `max_length` on N GPUs.
# Upon setting the sequence parallelism dimension to `SP`,
# the accumulative counts have to be adjusted to `SP` times the original value.
# This modification is essential to assure training equivalence,
# as the sequence of `max_length` length will be segmented into `SP` parts,
# with each part being allocated to its respective GPU among the `SP` GPUs
# for parallelized training.
# bs = 32 gpus * 1 batch_size_per_device * 4 acc / 4 sequence parallel
accumulative_counts = 4
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
log_interval = 1
# Save
save_steps = -1 # speed only
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=SequenceParallelSampler, seed=1024),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [dict(type=ThroughputHook)]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(
type=LoggerHook, log_metric_by_epoch=False, interval=log_interval),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=-1,
save_last=False,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=log_interval)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama_speed_benchmark/llama2_70b/llama2_70b_full_alpaca_enzh_8k_sp1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import ThroughputHook, VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-70b-hf'
use_varlen_attn = False
sequence_parallel_size = 1
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 8192
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
# Suppose I aim to employ a training strategy using a batch size per device
# of 1 with a maximum length of `max_length` on N GPUs.
# Upon setting the sequence parallelism dimension to `SP`,
# the accumulative counts have to be adjusted to `SP` times the original value.
# This modification is essential to assure training equivalence,
# as the sequence of `max_length` length will be segmented into `SP` parts,
# with each part being allocated to its respective GPU among the `SP` GPUs
# for parallelized training.
# bs = 32 gpus * 1 batch_size_per_device * 1 acc / 1 sequence parallel
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
log_interval = 1
# Save
save_steps = -1 # speed only
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=SequenceParallelSampler, seed=1024),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [dict(type=ThroughputHook)]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(
type=LoggerHook, log_metric_by_epoch=False, interval=log_interval),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=-1,
save_last=False,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=log_interval)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama_speed_benchmark/llama2_7b/llama2_7b_full_alpaca_enzh_128k_sp8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import ThroughputHook, VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b'
use_varlen_attn = False
sequence_parallel_size = 8
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 131072 # 128k
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
# Suppose I aim to employ a training strategy using a batch size per device
# of 1 with a maximum length of `max_length` on N GPUs.
# Upon setting the sequence parallelism dimension to `SP`,
# the accumulative counts have to be adjusted to `SP` times the original value.
# This modification is essential to assure training equivalence,
# as the sequence of `max_length` length will be segmented into `SP` parts,
# with each part being allocated to its respective GPU among the `SP` GPUs
# for parallelized training.
# bs = 8 gpus * 1 batch_size_per_device * 8 acc / 8 sequence parallel
accumulative_counts = 8
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
log_interval = 1
# Save
save_steps = -1 # speed only
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=SequenceParallelSampler, seed=1024),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [dict(type=ThroughputHook)]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(
type=LoggerHook, log_metric_by_epoch=False, interval=log_interval),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=-1,
save_last=False,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=log_interval)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama_speed_benchmark/llama2_7b/llama2_7b_full_alpaca_enzh_1M_sp16.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import ThroughputHook, VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b'
use_varlen_attn = False
sequence_parallel_size = 16
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 1048576 # 1M
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
# Suppose I aim to employ a training strategy using a batch size per device
# of 1 with a maximum length of `max_length` on N GPUs.
# Upon setting the sequence parallelism dimension to `SP`,
# the accumulative counts have to be adjusted to `SP` times the original value.
# This modification is essential to assure training equivalence,
# as the sequence of `max_length` length will be segmented into `SP` parts,
# with each part being allocated to its respective GPU among the `SP` GPUs
# for parallelized training.
# bs = 32 gpus * 1 batch_size_per_device * 16 acc / 16 sequence parallel
accumulative_counts = 16
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
log_interval = 1
# Save
save_steps = -1 # speed only
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=SequenceParallelSampler, seed=1024),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [dict(type=ThroughputHook)]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(
type=LoggerHook, log_metric_by_epoch=False, interval=log_interval),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=-1,
save_last=False,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=log_interval)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama_speed_benchmark/llama2_7b/llama2_7b_full_alpaca_enzh_256k_sp8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import ThroughputHook, VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b'
use_varlen_attn = False
sequence_parallel_size = 8
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 262144 # 256k
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
# Suppose I aim to employ a training strategy using a batch size per device
# of 1 with a maximum length of `max_length` on N GPUs.
# Upon setting the sequence parallelism dimension to `SP`,
# the accumulative counts have to be adjusted to `SP` times the original value.
# This modification is essential to assure training equivalence,
# as the sequence of `max_length` length will be segmented into `SP` parts,
# with each part being allocated to its respective GPU among the `SP` GPUs
# for parallelized training.
# bs = 8 gpus * 1 batch_size_per_device * 8 acc / 8 sequence parallel
accumulative_counts = 8
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
log_interval = 1
# Save
save_steps = -1 # speed only
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=SequenceParallelSampler, seed=1024),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [dict(type=ThroughputHook)]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(
type=LoggerHook, log_metric_by_epoch=False, interval=log_interval),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=-1,
save_last=False,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=log_interval)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama_speed_benchmark/llama2_7b/llama2_7b_full_alpaca_enzh_32k_sp1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import ThroughputHook, VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b'
use_varlen_attn = False
sequence_parallel_size = 1
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 32768
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
# Suppose I aim to employ a training strategy using a batch size per device
# of 1 with a maximum length of `max_length` on N GPUs.
# Upon setting the sequence parallelism dimension to `SP`,
# the accumulative counts have to be adjusted to `SP` times the original value.
# This modification is essential to assure training equivalence,
# as the sequence of `max_length` length will be segmented into `SP` parts,
# with each part being allocated to its respective GPU among the `SP` GPUs
# for parallelized training.
# bs = 8 gpus * 1 batch_size_per_device * 1 acc / 1 sequence parallel
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
log_interval = 1
# Save
save_steps = -1 # speed only
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=SequenceParallelSampler, seed=1024),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [dict(type=ThroughputHook)]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(
type=LoggerHook, log_metric_by_epoch=False, interval=log_interval),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=-1,
save_last=False,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=log_interval)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama_speed_benchmark/llama2_7b/llama2_7b_full_alpaca_enzh_8k_sp1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import ThroughputHook, VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Llama-2-7b'
use_varlen_attn = False
sequence_parallel_size = 1
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 8192
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
# Suppose I aim to employ a training strategy using a batch size per device
# of 1 with a maximum length of `max_length` on N GPUs.
# Upon setting the sequence parallelism dimension to `SP`,
# the accumulative counts have to be adjusted to `SP` times the original value.
# This modification is essential to assure training equivalence,
# as the sequence of `max_length` length will be segmented into `SP` parts,
# with each part being allocated to its respective GPU among the `SP` GPUs
# for parallelized training.
# bs = 8 gpus * 1 batch_size_per_device * 1 acc / 1 sequence parallel
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
log_interval = 1
# Save
save_steps = -1 # speed only
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=SequenceParallelSampler, seed=1024),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [dict(type=ThroughputHook)]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(
type=LoggerHook, log_metric_by_epoch=False, interval=log_interval),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=-1,
save_last=False,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=log_interval)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama_speed_benchmark/yi_34b/yi_34b_200k_full_alpaca_enzh_128k_sp8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import ThroughputHook, VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '01-ai/Yi-34B-200K'
use_varlen_attn = False
sequence_parallel_size = 8
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 131072 # 128k
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
# Suppose I aim to employ a training strategy using a batch size per device
# of 1 with a maximum length of `max_length` on N GPUs.
# Upon setting the sequence parallelism dimension to `SP`,
# the accumulative counts have to be adjusted to `SP` times the original value.
# This modification is essential to assure training equivalence,
# as the sequence of `max_length` length will be segmented into `SP` parts,
# with each part being allocated to its respective GPU among the `SP` GPUs
# for parallelized training.
# bs = 32 gpus * 1 batch_size_per_device * 8 acc / 8 sequence parallel
accumulative_counts = 8
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
log_interval = 1
# Save
save_steps = -1 # speed only
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=SequenceParallelSampler, seed=1024),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [dict(type=ThroughputHook)]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(
type=LoggerHook, log_metric_by_epoch=False, interval=log_interval),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=-1,
save_last=False,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=log_interval)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama_speed_benchmark/yi_34b/yi_34b_200k_full_alpaca_enzh_256k_sp8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import ThroughputHook, VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '01-ai/Yi-34B-200K'
use_varlen_attn = False
sequence_parallel_size = 8
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 262144 # 256k
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
# Suppose I aim to employ a training strategy using a batch size per device
# of 1 with a maximum length of `max_length` on N GPUs.
# Upon setting the sequence parallelism dimension to `SP`,
# the accumulative counts have to be adjusted to `SP` times the original value.
# This modification is essential to assure training equivalence,
# as the sequence of `max_length` length will be segmented into `SP` parts,
# with each part being allocated to its respective GPU among the `SP` GPUs
# for parallelized training.
# bs = 32 gpus * 1 batch_size_per_device * 8 acc / 8 sequence parallel
accumulative_counts = 8
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
log_interval = 1
# Save
save_steps = -1 # speed only
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=SequenceParallelSampler, seed=1024),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [dict(type=ThroughputHook)]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(
type=LoggerHook, log_metric_by_epoch=False, interval=log_interval),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=-1,
save_last=False,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=log_interval)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama_speed_benchmark/yi_34b/yi_34b_200k_full_alpaca_enzh_32k_sp2.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import ThroughputHook, VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '01-ai/Yi-34B-200K'
use_varlen_attn = False
sequence_parallel_size = 2
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 32768
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
# Suppose I aim to employ a training strategy using a batch size per device
# of 1 with a maximum length of `max_length` on N GPUs.
# Upon setting the sequence parallelism dimension to `SP`,
# the accumulative counts have to be adjusted to `SP` times the original value.
# This modification is essential to assure training equivalence,
# as the sequence of `max_length` length will be segmented into `SP` parts,
# with each part being allocated to its respective GPU among the `SP` GPUs
# for parallelized training.
# bs = 32 gpus * 1 batch_size_per_device * 2 acc / 2 sequence parallel
accumulative_counts = 2
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
log_interval = 1
# Save
save_steps = -1 # speed only
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=SequenceParallelSampler, seed=1024),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [dict(type=ThroughputHook)]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(
type=LoggerHook, log_metric_by_epoch=False, interval=log_interval),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=-1,
save_last=False,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=log_interval)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llama_speed_benchmark/yi_34b/yi_34b_200k_full_alpaca_enzh_8k_sp1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import ThroughputHook, VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '01-ai/Yi-34B-200K'
use_varlen_attn = False
sequence_parallel_size = 1
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.llama2_chat
max_length = 8192
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
# Suppose I aim to employ a training strategy using a batch size per device
# of 1 with a maximum length of `max_length` on N GPUs.
# Upon setting the sequence parallelism dimension to `SP`,
# the accumulative counts have to be adjusted to `SP` times the original value.
# This modification is essential to assure training equivalence,
# as the sequence of `max_length` length will be segmented into `SP` parts,
# with each part being allocated to its respective GPU among the `SP` GPUs
# for parallelized training.
# bs = 32 gpus * 1 batch_size_per_device * 1 acc / 1 sequence parallel
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
log_interval = 1
# Save
save_steps = -1 # speed only
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=SequenceParallelSampler, seed=1024),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [dict(type=ThroughputHook)]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(
type=LoggerHook, log_metric_by_epoch=False, interval=log_interval),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=-1,
save_last=False,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=log_interval)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/README.md
================================================
# LLaVA Full Pipeline
English | [简体中文](./README_zh-CN.md)
## Configs
- `./${LLM}_${ViT}/` contains configs that align with LLaVA-InternLM settings (*i.e.*, using LoRA / QLoRA).
- `./official/` contains configs that align with LLaVA official settings.
## Results
XTuner primarily promotes the LLM-QLoRA / ViT-LoRA LLaVA architecture, and the evaluation results on various datasets are as follows:
| Model | MMBench Test (EN) | MMBench Dev (EN) | MMBench Test (CN) | MMBench Dev (CN) | CCBench Dev | MME | SEEDBench_IMG | MMVet | MMMU Dev | MathVista MiniTest | HallusionBench aAcc | Configs | Pretrained Projector Checkpoints | Fine-tuned LLaVA Checkpoints |
| :--------------------------- | :---------------: | :--------------: | :---------------: | :--------------: | :---------: | :--: | :-----------: | :---: | :------: | :----------------: | :-----------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------: |
| LLaVA-v1.5-7B (XTuner) | 67.7 | 69.2 | 61.0 | 59.7 | 28.4 | 1716 | 66.4 | 32.2 | 33.7 | 24.2 | 46.2 | [Pretrain](./vicuna_7b_v15_clip_vit_large_p14_336/pretrain/llava_vicuna_7b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./vicuna_7b_v15_clip_vit_large_p14_336/finetune/llava_vicuna_7b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-7b-xtuner-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-7b-xtuner-pretrain) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-7b-xtuner) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-7b-xtuner) |
| LLaVA-v1.5-13B (XTuner) | 68.8 | 69.5 | 64.7 | 63.1 | 32.9 | 1766 | 67.9 | 35.9 | 35.2 | 26.2 | 46.9 | [Pretrain](./vicuna_13b_v15_clip_vit_large_p14_336/pretrain/llava_vicuna_13b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./vicuna_13b_v15_clip_vit_large_p14_336/finetune/llava_vicuna_13b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-13b-xtuner-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-13b-xtuner-pretrain) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-13b-xtuner) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-13b-xtuner) |
| LLaVA-InternLM-7B (XTuner) | 69.0 | 68.5 | 66.7 | 63.8 | 37.3 | 1637 | 65.7 | 32.4 | 36.9 | 26.3 | 49.1 | [Pretrain](./internlm_chat_7b_clip_vit_large_p14_336/pretrain/llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./internlm_chat_7b_clip_vit_large_p14_336/finetune/llava_internlm_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm-7b-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm-7b-pretrain) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm-7b) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm-7b) |
| LLaVA-InternLM2-7B (XTuner) | 73.3 | 74.6 | 71.7 | 72.0 | 42.5 | 1700 | 71.2 | 35.9 | 40.1 | 25.5 | 46.8 | [Pretrain](./internlm2_chat_7b_clip_vit_large_p14_336/pretrain/llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./internlm2_chat_7b_clip_vit_large_p14_336/finetune/llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm2-7b-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm2-7b-pretrain) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm2-7b) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm2-7b) |
| LLaVA-InternLM2-20B (XTuner) | 75.1 | 73.5 | 73.7 | 72.8 | 46.3 | 1868 | 70.2 | 37.2 | 39.4 | 24.6 | 47.7 | [Pretrain](./internlm2_chat_20b_clip_vit_large_p14_336/pretrain/llava_internlm2_chat_20b_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./internlm2_chat_20b_clip_vit_large_p14_336/finetune/llava_internlm2_chat_20b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm2-20b-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm2-20b-pretrain) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm2-20b) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm2-20b) |
When aligned completely with the official training settings, the results are as follows:
| Model | Framework | MMBench Test (EN) | MMBench Dev (EN) | MMBench Test (CN) | MMBench Dev (CN) | CCBench Dev | MME | SEEDBench_IMG | MMVet | Configs |
| :------------ | :-------: | :---------------: | :--------------: | :---------------: | :--------------: | :---------: | :--: | :-----------: | :---: | :--------------------------------------------------------------------------------------------------------------------------: |
| LLaVA-v1.5-7B | Official | 65.2 | 63.0 | 57.3 | 57.4 | 25.2 | 1775 | 65.6 | 32.7 | - |
| LLaVA-v1.5-7B | XTuner | 68.6 | 68.0 | 61.5 | 61.4 | 26.5 | 1786 | 65.8 | 31.4 | [Pretrain](./official/llava_v15_7b/llava_v15_7b_pretrain.py) / [Fine-tune](./official/llava_v15_7b/llava_v15_7b_finetune.py) |
## Data Preparation
Please refer to the [docs](../../../docs/en/user_guides/dataset_prepare.md#llava-dataset).
## Training
The training of LLaVA consists of two steps: alignment module (i.e., MLP) pretraining and instruction following fine-tuning
Note: this guide takes 8-card training LLaVA-InternLM2-7B as an example, if there are insufficient GPU resources or memory during actual use, you can reduce the batchsize appropriately to decrease memory consumption. The Pretrained projector is saved and re-loaded by default in `./work_dirs/llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth`.
1. Alignment module pretraining (saved by default in `./work_dirs/`)
```bash
NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2
```
2. Instruction following fine-tuning (saved by default in `./work_dirs/`)
```bash
NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero2
```
## Model Conversion (and Merge)
After training, we will obtain a set of weights (*i.e.*, `iter_xxx.pth`), which are not in the universal HuggingFace format. We first need to convert them.
```bash
xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH
# e.g., xtuner convert pth_to_hf llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune ./iter_5198.pth ./iter_5198_hf
```
At this point, we have obtained the relevant model (LLM or the corresponding LoRA).
Afterwards, if you want to merge LoRA into LLM or CLIP-ViT, please use the following command:
```bash
(LLM) xtuner convert merge $LLM $LLM_ADAPTER $SAVE_PATH
(CLIP) xtuner convert merge $CLIP $CLIP_ADAPTER $SAVE_PATH --is-clip
```
## Chat
You can download the released LLaVA-InternLM2-7B model from 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm2-7b) or 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm2-7b), and achieve image-text question answering with the following command!
```bash
xtuner chat internlm/internlm2-chat-7b \
--visual-encoder openai/clip-vit-large-patch14-336 \
--llava xtuner/llava-internlm2-7b \
--prompt-template internlm2_chat \
--image $IMAGE_PATH
```
Here, `--llava` is the converted weight from the above step (in our example, it is `./iter_5198_hf` ).
## Evaluation
XTuner's LLaVA models can be evaluated using [VLMEvalKit](https://github.com/open-compass/VLMEvalKit).
For convenience, XTuner also integrates the [MMBench](https://mmbench.opencompass.org.cn/home) evaluation.
User can download the MMBench dataset with
```
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/CCBench.tsv
```
After that, the evaluations can be run with
```bash
xtuner mmbench internlm/internlm2-chat-7b \
--visual-encoder openai/clip-vit-large-patch14-336 \
--llava xtuner/llava-internlm2-7b \
--prompt-template internlm2_chat \
--data-path $DATA_PATH \
--work-dir $RESULT_PATH
```
Here, `$DATA_PATH` refers to one of the datasets downloaded as mentioned above, such as `MMBench_DEV_EN.tsv`.
After the evaluation is completed, if it's a development set, it will directly print out the results; If it's a test set, you need to submit `mmbench_result.xlsx` to the official MMBench for final evaluation to obtain precision results!
### Refcoco
To evaluate your model with refcoco, you need download the evaluation data files in [link](https://github.com/Vision-CAIR/MiniGPT-4/tree/main/eval_scripts/eval_data). Second, you can use following command to evaluate your model.
```bash
xtuner eval_refcoco $LLM \
--visual-encoder $VISUAL_ENCODER \
--llava $LLAVA_PATH \
--prompt-template $PROMPT_TEMPLATE \
--data-path $DATA_PATH \
--work-dir $RESULT_PATH
```
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/README_zh-CN.md
================================================
# LLaVA 全流程
[English](./README.md) | 简体中文
## 配置文件
- `./${LLM}_${ViT}/` 包含着与 LLaVA-InternLM 训练配置对齐的配置文件(即使用 LoRA / QLoRA)。
- `./official/` 包含着与 LLaVA-v1.5 官方训练配置对齐的配置文件。
## 结果
XTuner 推荐使用基于 LLM-QLoRA / ViT-LoRA 的 LLaVA 架构,其在各个数据集的评测结果如下:
| 模型 | MMBench Test (EN) | MMBench Dev (EN) | MMBench Test (CN) | MMBench Dev (CN) | CCBench Dev | MME | SEEDBench_IMG | MMVet | MMMU Dev | MathVista MiniTest | HallusionBench aAcc | 配置文件 | 预训练 Projector 权重 | 微调 LLaVA 权重 |
| :--------------------------- | :---------------: | :--------------: | :---------------: | :--------------: | :---------: | :--: | :-----------: | :---: | :------: | :----------------: | :-----------------: | :-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- | :------------------------------------------------------------------------------------------------------------------------------------------------: |
| LLaVA-v1.5-7B (XTuner) | 67.7 | 69.2 | 61.0 | 59.7 | 28.4 | 1716 | 66.4 | 32.2 | 33.7 | 24.2 | 46.2 | [Pretrain](./vicuna_7b_v15_clip_vit_large_p14_336/pretrain/llava_vicuna_7b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./vicuna_7b_v15_clip_vit_large_p14_336/finetune/llava_vicuna_7b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-7b-xtuner-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-7b-xtuner-pretrain) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-7b-xtuner) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-7b-xtuner) |
| LLaVA-v1.5-13B (XTuner) | 68.8 | 69.5 | 64.7 | 63.1 | 32.9 | 1766 | 67.9 | 35.9 | 35.2 | 26.2 | 46.9 | [Pretrain](./vicuna_13b_v15_clip_vit_large_p14_336/pretrain/llava_vicuna_13b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./vicuna_13b_v15_clip_vit_large_p14_336/finetune/llava_vicuna_13b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-13b-xtuner-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-13b-xtuner-pretrain) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-v1.5-13b-xtuner) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-v1.5-13b-xtuner) |
| LLaVA-InternLM-7B (XTuner) | 69.0 | 68.5 | 66.7 | 63.8 | 37.3 | 1637 | 65.7 | 32.4 | 36.9 | 26.3 | 49.1 | [Pretrain](./internlm_chat_7b_clip_vit_large_p14_336/pretrain/llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./internlm_chat_7b_clip_vit_large_p14_336/finetune/llava_internlm_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm-7b-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm-7b-pretrain) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm-7b) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm-7b) |
| LLaVA-InternLM2-7B (XTuner) | 73.3 | 74.6 | 71.7 | 72.0 | 42.5 | 1700 | 71.2 | 35.9 | 40.1 | 25.5 | 46.8 | [Pretrain](./internlm2_chat_7b_clip_vit_large_p14_336/pretrain/llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./internlm2_chat_7b_clip_vit_large_p14_336/finetune/llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm2-7b-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm2-7b-pretrain) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm2-7b) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm2-7b) |
| LLaVA-InternLM2-20B (XTuner) | 75.1 | 73.5 | 73.7 | 72.8 | 46.3 | 1868 | 70.2 | 37.2 | 39.4 | 24.6 | 47.7 | [Pretrain](./internlm2_chat_20b_clip_vit_large_p14_336/pretrain/llava_internlm2_chat_20b_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./internlm2_chat_20b_clip_vit_large_p14_336/finetune/llava_internlm2_chat_20b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm2-20b-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm2-20b-pretrain) | 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm2-20b) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm2-20b) |
当与 LLaVA 官方训练架构对齐时,其评测结果如下:
| 模型 | 框架 | MMBench Test (EN) | MMBench Dev (EN) | MMBench Test (CN) | MMBench Dev (CN) | CCBench Dev | MME | SEEDBench_IMG | MMVet | 配置文件 |
| :------------ | :------: | :---------------: | :--------------: | :---------------: | :--------------: | :---------: | :--: | :-----------: | :---: | :--------------------------------------------------------------------------------------------------------------------------: |
| LLaVA-v1.5-7B | Official | 65.2 | 63.0 | 57.3 | 57.4 | 25.2 | 1775 | 65.6 | 32.7 | - |
| LLaVA-v1.5-7B | XTuner | 68.6 | 68.0 | 61.5 | 61.4 | 26.5 | 1786 | 65.8 | 31.4 | [Pretrain](./official/llava_v15_7b/llava_v15_7b_pretrain.py) / [Fine-tune](./official/llava_v15_7b/llava_v15_7b_finetune.py) |
## 数据准备
请参考[文档](../../../docs/zh_cn/user_guides/dataset_prepare.md#llava-dataset)。
## 训练流程
LLaVA 训练一共分为两步:对齐模块预训练、指令跟随微调(本指南以 8 卡训练 LLaVA-InternLM2-7B 为例,实际使用时如遇到显卡数量不足、显存不足等情况可以适当调低 batchsize 来降低显存开销)
预训练的 Projector 默认保存在 `./work_dirs/llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain`,并且指令微调阶段将默认在此路径载入 Projector 权重 (`iter_2181.pth`)。
1. 对齐模块训练(默认保存在 `./work_dirs/`)
```bash
NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2
```
2. 指令跟随微调(默认保存在 `./work_dirs/`)
```bash
NPROC_PER_NODE=8 xtuner train llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero2
```
## 模型转换(和合并)
训练后,我们将获得一组权重(即,`iter_xxx.pth`,但它并不是通用的 HuggingFace 格式。我们需要对其进行转换。
```bash
xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH
# e.g., xtuner convert pth_to_hf llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune ./iter_5198.pth ./iter_5198_hf
```
此时,我们将获得所需要的模型(LLM或对应的 LoRA)。
之后,如果想要合并 LoRA 至 LLM 或 CLIP-ViT 中,请使用下列命令:
```bash
(LLM) xtuner convert merge $LLM $LLM_ADAPTER $SAVE_PATH
(CLIP) xtuner convert merge $CLIP $CLIP_ADAPTER $SAVE_PATH --is-clip
```
## 对话测试
开源的 LLaVA-InternLM2-7B 模型在 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-internlm2-7b) 和 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-internlm2-7b) 都可以下载,您可以利用下列命令实现图文问答!
```bash
xtuner chat internlm/internlm2-chat-7b \
--visual-encoder openai/clip-vit-large-patch14-336 \
--llava xtuner/llava-internlm2-7b \
--prompt-template internlm2_chat \
--image $IMAGE_PATH
```
此处, `--llava` 请传入模型转换阶段所获得的权重(示例中为 `./iter_5198_hf`)。
## 评测
XTuner 的 LLaVA 模型可以利用 [VLMEvalKit](https://github.com/open-compass/VLMEvalKit) 进行评测。
同时,为了方便使用,XTuner 内也集成了 MMBench 评测,您可以通过下列命令下载 MMBench 评测数据集:
```
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/CCBench.tsv
```
之后,您可以利用下列命令实现评测:
```bash
xtuner mmbench internlm/internlm2-chat-7b \
--visual-encoder openai/clip-vit-large-patch14-336 \
--llava xtuner/llava-internlm2-7b \
--prompt-template internlm2_chat \
--data-path $DATA_PATH \
--work-dir $RESULT_PATH
```
其中,`$DATA_PATH` 指上一步骤所下载的某一个 tsv 文件,如 `MMBench_DEV_EN.tsv`。
评测完成后,若为开发集则会直接打印出结果;若为测试集,则需将 mmbench_result.xlsx 提交至 MMBench 官方完成评测取得精度结果!
### Refcoco
若您想要评测 Refcoco 数据集,您需要下载评测数据文件 [链接](https://github.com/Vision-CAIR/MiniGPT-4/tree/main/eval_scripts/eval_data). 之后,您可以利用下列命令实现评测:
```bash
xtuner eval_refcoco $LLM \
--visual-encoder $VISUAL_ENCODER \
--llava $LLAVA_PATH \
--prompt-template $PROMPT_TEMPLATE \
--data-path $DATA_PATH \
--work-dir $RESULT_PATH
```
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/internlm2_chat_1_8b_clip_vit_large_p14_336/finetune/llava_internlm2_chat_1_8b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'internlm/internlm2-chat-1_8b'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_internlm2_chat_1_8b_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth' # noqa: E501
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
llm_lora=dict(
type=LoraConfig,
r=512,
lora_alpha=256,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path),
visual_encoder_lora=dict(
type=LoraConfig, r=64, lora_alpha=16, lora_dropout=0.05, bias='none'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/internlm2_chat_1_8b_clip_vit_large_p14_336/pretrain/llava_internlm2_chat_1_8b_clip_vit_large_p14_336_e1_gpu8_pretrain.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'internlm/internlm2-chat-1_8b'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Pretrain/blip_laion_cc_sbu_558k.json'
image_folder = data_root + 'LLaVA-Pretrain/images'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 32 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-3
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=False)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/internlm2_chat_20b_clip_vit_large_p14_336/finetune/llava_internlm2_chat_20b_clip_vit_large_p14_336_e1_gpu8_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'internlm/internlm2-chat-20b'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_internlm2_chat_20b_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth' # noqa: E501
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 4 # per_device
accumulative_counts = 4
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=False,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float32),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/internlm2_chat_20b_clip_vit_large_p14_336/finetune/llava_internlm2_chat_20b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'internlm/internlm2-chat-20b'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_internlm2_chat_20b_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth' # noqa: E501
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 2
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
llm_lora=dict(
type=LoraConfig,
r=512,
lora_alpha=256,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path),
visual_encoder_lora=dict(
type=LoraConfig, r=64, lora_alpha=16, lora_dropout=0.05, bias='none'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/internlm2_chat_20b_clip_vit_large_p14_336/pretrain/llava_internlm2_chat_20b_clip_vit_large_p14_336_e1_gpu8_pretrain.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'internlm/internlm2-chat-20b'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Pretrain/blip_laion_cc_sbu_558k.json'
image_folder = data_root + 'LLaVA-Pretrain/images'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 32 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-3
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=False)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/internlm2_chat_7b_clip_vit_large_p14_336/finetune/llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'internlm/internlm2-chat-7b'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth' # noqa: E501
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 2
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=False,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float32),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/internlm2_chat_7b_clip_vit_large_p14_336/finetune/llava_internlm2_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'internlm/internlm2-chat-7b'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth' # noqa: E501
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
llm_lora=dict(
type=LoraConfig,
r=512,
lora_alpha=256,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path),
visual_encoder_lora=dict(
type=LoraConfig, r=64, lora_alpha=16, lora_dropout=0.05, bias='none'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/internlm2_chat_7b_clip_vit_large_p14_336/pretrain/llava_internlm2_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'internlm/internlm2-chat-7b'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Pretrain/blip_laion_cc_sbu_558k.json'
image_folder = data_root + 'LLaVA-Pretrain/images'
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 32 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-3
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=False)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/internlm_chat_7b_clip_vit_large_p14_336/finetune/llava_internlm_chat_7b_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'internlm/internlm-chat-7b'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth' # noqa: E501
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
llm_lora=dict(
type=LoraConfig,
r=512,
lora_alpha=256,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path),
visual_encoder_lora=dict(
type=LoraConfig, r=64, lora_alpha=16, lora_dropout=0.05, bias='none'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/internlm_chat_7b_clip_vit_large_p14_336/pretrain/llava_internlm_chat_7b_clip_vit_large_p14_336_e1_gpu8_pretrain.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'internlm/internlm-chat-7b'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Pretrain/blip_laion_cc_sbu_558k.json'
image_folder = data_root + 'LLaVA-Pretrain/images'
prompt_template = PROMPT_TEMPLATE.internlm_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 32 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-3
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=False)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/llama3_70b_instruct_clip_vit_large_p14_336/pretrain/llava_llama3_70b_instruct_quant_clip_vit_large_p14_336_e1_gpu8_pretrain.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Pretrain/blip_laion_cc_sbu_558k.json'
image_folder = data_root + 'LLaVA-Pretrain/images'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 32 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 5e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=False)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=llava_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/README.md
================================================
# LLaVA-Llama-3-8B
## Results
| Model | MMBench Test (EN) | MMBench Test (CN) | CCBench Dev | MMMU Val | SEED-IMG | AI2D Test | ScienceQA Test | HallusionBench aAcc | POPE | GQA | TextVQA | MME | MMStar | Configs |
| :-------------------- | :---------------: | :---------------: | :---------: | :-------: | :------: | :-------: | :------------: | :-----------------: | :--: | :--: | :-----: | :------: | :----: | :--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| LLaVA-v1.5-7B | 66.5 | 59.0 | 27.5 | 35.3 | 60.5 | 54.8 | 70.4 | 44.9 | 85.9 | 62.0 | 58.2 | 1511/348 | 30.3 | - |
| LLaVA-Llama-3-8B | 68.9 | 61.6 | 30.4 | 36.8 | 69.8 | 60.9 | 73.3 | 47.3 | 87.2 | 63.5 | 58.0 | 1506/295 | 38.2 | [Pretrain](./pretrain/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](./finetune/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) |
| LLaVA-Llama-3-8B-v1.1 | 72.3 | 66.4 | 31.6 | 36.8 | 70.1 | 70.0 | 72.9 | 47.7 | 86.4 | 62.6 | 59.0 | 1469/349 | 45.1 | [Pretrain](./pretrain/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py) / [Fine-tune](./finetune/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune.py) |
## Resources
- LLaVA-Llama-3-8B-v1.1
- Official LLaVA format model (`xtuner/llava-llama-3-8b-v1_1-hf`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1-hf)
- HuggingFace LLaVA format model (`xtuner/llava-llama-3-8b-v1_1-transformers`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1-transformers)
- XTuner LLaVA format model (`xtuner/llava-llama-3-8b-v1_1`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1)
- GGUF model (`xtuner/llava-llama-3-8b-v1_1-gguf`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1-gguf)
- Pretrained projector weights: 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-v1_1-pretrain)
- LLaVA-Llama-3-8B
- Official LLaVA format model (`xtuner/llava-llama-3-8b-hf`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-hf) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-hf)
- HuggingFace LLaVA format model (`xtuner/llava-llama-3-8b-transformers`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-transformers) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-transformers)
- XTuner LLaVA format model (`xtuner/llava-llama-3-8b`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b)
- Pretrained projector weights: 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-llama-3-8b-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-llama-3-8b-pretrain)
## Data Preparation
### LLaVA dataset
#### File structure
```
./data/llava_data
├── LLaVA-Pretrain
│ ├── blip_laion_cc_sbu_558k.json
│ ├── blip_laion_cc_sbu_558k_meta.json
│ └── images
├── LLaVA-Instruct-150K
│ └── llava_v1_5_mix665k.json
└── llava_images
├── coco
│ └── train2017
├── gqa
│ └── images
├── ocr_vqa
│ └── images
├── textvqa
│ └── train_images
└── vg
├── VG_100K
└── VG_100K_2
```
#### Pretrain
LLaVA-Pretrain
```shell
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain --depth=1
```
#### Finetune
1. Text data
1. LLaVA-Instruct-150K
```shell
# Make sure you have git-lfs installed (https://git-lfs.com)
git lfs install
git clone https://huggingface.co/datasets/liuhaotian/LLaVA-Instruct-150K --depth=1
```
2. Image data
1. COCO (coco): [download url](http://images.cocodataset.org/zips/train2017.zip)
2. GQA (gqa): [download url](https://downloads.cs.stanford.edu/nlp/data/gqa/images.zip)
3. OCR-VQA (ocr_vqa): [download script](https://drive.google.com/drive/folders/1_GYPY5UkUy7HIcR0zq3ZCFgeZN7BAfm_?usp=sharing)
1. ⚠️ Modify the name of OCR-VQA's images to keep the extension as `.jpg`!
```shell
#!/bin/bash
ocr_vqa_path=""
find "$target_dir" -type f | while read file; do
extension="${file##*.}"
if [ "$extension" != "jpg" ]
then
cp -- "$file" "${file%.*}.jpg"
fi
done
```
4. TextVQA (textvqa): [download url](https://dl.fbaipublicfiles.com/textvqa/images/train_val_images.zip)
5. VisualGenome (VG): [part1](https://cs.stanford.edu/people/rak248/VG_100K_2/images.zip), [part2](https://cs.stanford.edu/people/rak248/VG_100K_2/images2.zip)
### ShareGPT4V dataset
> Reference: https://github.com/InternLM/InternLM-XComposer/blob/main/projects/ShareGPT4V/docs/Data.md
#### File structure
```
./data/sharegpt4v
├── share-captioner_coco_lcs_sam_1246k_1107.json
├── sharegpt4v_instruct_gpt4-vision_cap100k.json
├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
└── data
├── sam
│ └── images
├── share_textvqa
│ └── images
├── web-celebrity
│ └── images
├── web-landmark
│ └── images
├── wikiart
│ └── images
├── llava
│ └── llava_pretrain
│ └── images -> ../../../../llava_data/LLaVA-Pretrain/images
├── coco -> ../../llava_data/llava_images/coco
├── gqa -> ../../llava_data/llava_images/gqa
├── ocr_vqa -> ../../llava_data/llava_images/ocr_vqa
├── textvqa -> ../../llava_data/llava_images/textvqa
└── vg -> ../../llava_data/llava_images/vg
```
#### Download
1. Text data
```shell
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_instruct_gpt4-vision_cap100k.json
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/share-captioner_coco_lcs_sam_1246k_1107.json
wget https://huggingface.co/datasets/Lin-Chen/ShareGPT4V/blob/main/sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.json
```
2. Image data
1. SAM (sam): [download url](https://drive.google.com/file/d/1dKumdOKSXtV7lIXdrG7jsIK_z2vZv2gs/view?usp=drive_link)
2. ShareTextVQA (share_textvqa): [download url](https://drive.google.com/file/d/1f4v_3e1OJtyYqam1CEp6RenCNTU5_mG2/view?usp=share_link)
3. Web-Celebrity (web-celebrity): [download url](https://drive.google.com/file/d/1-SB71C3j1mVg0kDDXwj2IWGEoBoRUD-J/view?usp=share_link)
4. Web-Landmark (web-landmark): [download url](https://drive.google.com/file/d/1JpJkN7ZMA50xAhMx9O-rVb5yLhfGm3_o/view?usp=share_link)
5. WikiArt (wikiart): [download url](https://drive.google.com/file/d/1FxB2Nw-vWUcTUSI_dBpPIykb-uGYoEqV/view?usp=share_link)
6. llava, coco , gqa, ocr_vqa, textvqa, vg: Please refer to the preparation of LLaVA dataset.
### InternVL-SFT
> Reference: https://github.com/OpenGVLab/InternVL/tree/main/internvl_chat#prepare-training-datasets
#### File structure
```
./data/internvl_sft
├── sharegpt4v_instruct_gpt4-vision_cap100k.jsonl
├── llava_instruct_150k_zh.jsonl
├── sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl
├── dvqa_train_200k.jsonl
├── chartqa_train_18k.jsonl
├── ai2d_train_12k.jsonl
├── docvqa_train_10k.jsonl
├── geoqa+.jsonl
├── synthdog_en.jsonl
└── data
├── ai2d
│ ├── abc_images
│ └── images
├── chartqa
│ ├── test
│ ├── train
│ └── val
├── docvqa
│ ├── test
│ ├── train
│ └── val
├── dvqa
│ └── images
├── synthdog-en
│ └── images
├── geoqa+
│ └── images
├── llava
│ └── llava_pretrain
│ └── images -> ../../../../llava_data/LLaVA-Pretrain/images
├── coco -> ../../llava_data/llava_images/coco
├── gqa -> ../../llava_data/llava_images/gqa
├── ocr_vqa -> ../../llava_data/llava_images/ocr_vqa
├── textvqa -> ../../llava_data/llava_images/textvqa
├── vg -> ../../llava_data/llava_images/vg
├── sam -> ../../sharegpt4v/data/sam
├── share_textvqa -> ../../sharegpt4v/data/share_textvqa
├── web-celebrity -> ../../sharegpt4v/data/web-celebrity
├── web-landmark -> ../../sharegpt4v/data/web-landmark
└── wikiart -> ../../sharegpt4v/data/wikiart
```
#### Download
1. Text data
```shell
wget https://huggingface.co/OpenGVLab/InternVL/resolve/main/playground.zip
unzip ./playground.zip
```
2. Image data
1. AI2D (ai2d): [download url](https://drive.google.com/file/d/1dqqa3MnrxMXaU_K9JA6C83je32ibwdOY/view?usp=sharing)
2. ChartQA (chartqa): [download url](https://huggingface.co/datasets/ahmed-masry/ChartQA/resolve/main/ChartQA%20Dataset.zip)
3. DocVQA (docvqa): [train](https://datasets.cvc.uab.es/rrc/DocVQA/train.tar.gz), [val](https://datasets.cvc.uab.es/rrc/DocVQA/val.tar.gz), [test](https://datasets.cvc.uab.es/rrc/DocVQA/test.tar.gz)
4. DVQA (dvqa): [download url](https://drive.google.com/file/d/1iKH2lTi1-QxtNUVRxTUWFvUvRHq6HAsZ/view)
5. SynthDoG-EN (synthdog-en): [download url](https://huggingface.co/OpenGVLab/InternVL/resolve/main/synthdog-en-images.zip)
6. GeoQA+ (geoqa+): [download url](https://huggingface.co/OpenGVLab/InternVL/resolve/main/geoqa%2B_images.zip)
7. llava, coco, gqa, ocr_vqa, textvqa, vg: Please refer to the preparation of LLaVA dataset.
8. sam, share_textvqa, web-celebrity, web-landmark, wikiart: Please refer to the preparation of ShareGPT4V dataset.
## Training
### LLaVA-LLama-3-8B
1. Pretrain (saved by default in `./work_dirs/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain/`)
```bash
NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain --deepspeed deepspeed_zero2 --seed 1024
```
2. Fine-tune (saved by default in `./work_dirs/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune/`)
```bash
NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune --deepspeed deepspeed_zero2 --seed 1024
```
### LLaVA-LLama-3-8B-v1.1 (Recommended)
1. Pretrain (saved by default in `./work_dirs/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain/`)
```bash
NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain --deepspeed deepspeed_zero2 --seed 1024
```
2. Fine-tune (saved by default in `./work_dirs/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune/`)
```bash
NPROC_PER_NODE=8 xtuner train llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune --deepspeed deepspeed_zero2 --seed 1024
```
### Singlg card?
XTuner also supports single-card training for LLaVA-Llama-3-8B (Youth Edition), requiring only a single card with 20GB to complete the entire process of multi-modal training.
1. Pretrain (saved by default in `./work_dirs/llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain/`)
```bash
xtuner train llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain --deepspeed deepspeed_zero2 --seed 1024
```
2. Fine-tune (saved by default in `./work_dirs/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune/`)
```bash
xtuner train llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune --deepspeed deepspeed_zero2 --seed 1024
```
## Model Conversion
After training, we will obtain a set of weights (*i.e.*, `iter_xxx.pth`), which are not in the universal HuggingFace format. We first need to convert them to the LLaVA model.
### Convert `.pth` file to LLaVA model in xtuner format ([xtuner/llava-llama-3-8b-v1_1](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1))
```bash
xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH
# e.g., xtuner convert pth_to_hf llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune ./iter_39620.pth ./iter_39620_xtuner
```
At this point, we have obtained the relevant model (LLM or the corresponding LoRA).
If you use the default configuration of LLaVA-Llama-3-8B, you will obtain the following file structure after converting.
It includes the full-finetuned LLM weights, projector weights, and LoRA weights of the visual encoder.
```
./iter_39620_xtuner
├── config.json
├── generation_config.json
├── model-00001-of-00009.safetensors
├── model-00002-of-00009.safetensors
├── model-00003-of-00009.safetensors
├── model-00004-of-00009.safetensors
├── model-00005-of-00009.safetensors
├── model-00006-of-00009.safetensors
├── model-00007-of-00009.safetensors
├── model-00008-of-00009.safetensors
├── model-00009-of-00009.safetensors
├── model.safetensors.index.json
├── projector
│ ├── config.json
│ ├── configuration_projector.py
│ ├── modeling_projector.py
│ └── model.safetensors
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── visual_encoder_adapter
├── adapter_config.json
├── adapter_model.safetensors
└── README.md
```
LLaVA model in xtuner format can engage in conversation using xtuner chat, by
```bash
xtuner chat ./iter_39620_xtuner \
--visual-encoder openai/clip-vit-large-patch14-336 \
--llava ./iter_39620_xtuner \
--prompt-template llama3_chat \
--image $IMAGE_PATH
```
and in MMBench evaluation, by
```bash
xtuner mmbench ./iter_39620_xtuner \
--visual-encoder openai/clip-vit-large-patch14-336 \
--llava ./iter_39620_xtuner \
--prompt-template llama3_chat \
--data-path $DATA_PATH \
--work-dir $RESULT_PATH
```
Here, `$DATA_PATH` refers to one of the mmbench datasets. You can download the expected data by
```bash
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/CCBench.tsv
```
### Convert `.pth` file to LLaVA model in official format ([xtuner/llava-llama-3-8b-v1_1-hf](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf))
```bash
xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH --save-format official
# e.g., xtuner convert pth_to_hf llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune ./iter_39620.pth ./iter_39620_official --save-format official
```
Here, the converted LLaVA model in official LLaVA format is saved to `./iter_39620_official`.
```
./iter_39620_official
├── config.json
├── generation_config.json
├── model-00001-of-00009.safetensors
├── model-00002-of-00009.safetensors
├── model-00003-of-00009.safetensors
├── model-00004-of-00009.safetensors
├── model-00005-of-00009.safetensors
├── model-00006-of-00009.safetensors
├── model-00007-of-00009.safetensors
├── model-00008-of-00009.safetensors
├── model-00009-of-00009.safetensors
├── model.safetensors.index.json
├── preprocessor_config.json
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json
```
### Convert `.pth` file to LLaVA model in HuggingFace format ([xtuner/llava-llama-3-8b-v1_1-transformers](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers))
```bash
xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH --save-format huggingface
# e.g., xtuner convert pth_to_hf llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune ./iter_39620.pth ./iter_39620_huggingface --save-format huggingface
```
Here, the converted LLaVA model in HuggingFace LLaVA format is saved to `./iter_39620_huggingface`.
```
./iter_39620_huggingface
├── config.json
├── generation_config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── preprocessor_config.json
├── special_tokens_map.json
├── tokenizer_config.json
└── tokenizer.json
```
## Chat
- XTuner LLaVA format [docs](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1#quickstart)
- Official LLaVA format [docs](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf#quickstart)
- HuggingFace LLaVA format [docs](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-transformers#quickstart)
- GGUF format [docs](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-gguf#quickstart)
## Deployment
[LMDeploy](https://github.com/InternLM/lmdeploy) now supports the deployment of official LLaVA format models (e.g.,[xtuner/llava-llama-3-8b-v1_1-hf](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf)). For specifics, please refer to [here](https://huggingface.co/xtuner/llava-llama-3-8b-v1_1-hf#chat-by-lmdeploy).
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/convert_xtuner_weights_to_hf.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
# Modified from https://github.com/huggingface/transformers/blob/v4.40.1/src/transformers/models/llava/convert_llava_weights_to_hf.py # noqa: E501
import argparse
import torch
from safetensors import safe_open
from transformers import (AddedToken, AutoConfig, AutoModelForCausalLM,
CLIPImageProcessor, CLIPVisionModel,
LlamaTokenizerFast, LlavaConfig,
LlavaForConditionalGeneration, LlavaProcessor)
KEYS_TO_MODIFY_MAPPING_LLM = {
'model': 'language_model.model',
'lm_head': 'language_model.lm_head',
}
KEYS_TO_MODIFY_MAPPING_VIT = {
'vision_model': 'vision_tower.vision_model',
}
KEYS_TO_MODIFY_MAPPING_PROJECTOR = {
'model.0': 'multi_modal_projector.linear_1',
'model.2': 'multi_modal_projector.linear_2',
}
def convert_state_dict_to_hf(state_dict, mapping):
new_state_dict = {}
for key, value in state_dict.items():
if key.endswith('.inv_freq'):
continue
for key_to_modify, new_key in mapping.items():
if key_to_modify in key:
key = key.replace(key_to_modify, new_key)
new_state_dict[key] = value
return new_state_dict
def convert_to_hf(text_model_id, vision_model_id, projector_weight, save_path):
torch.set_default_dtype(torch.float16)
text_config = AutoConfig.from_pretrained(
text_model_id, trust_remote_code=True)
vision_config = AutoConfig.from_pretrained(vision_model_id)
if hasattr(vision_config, 'vision_config'):
vision_config = vision_config.vision_config
tokenizer = LlamaTokenizerFast.from_pretrained(text_model_id)
tokenizer.add_tokens(
AddedToken('', special=True, normalized=False),
special_tokens=True)
tokenizer.add_special_tokens({'pad_token': ''})
image_processor = CLIPImageProcessor.from_pretrained(vision_model_id)
processor = LlavaProcessor(
tokenizer=tokenizer, image_processor=image_processor)
config = LlavaConfig(
text_config=text_config,
vision_config=vision_config,
attn_implementation='eager')
with torch.device('meta'):
model = LlavaForConditionalGeneration(config)
# Pad to 64 for performance reasons
pad_shape = 64
projector_state_dict = {}
with safe_open(projector_weight, framework='pt', device='cpu') as f:
for key in f.keys():
projector_state_dict[key] = f.get_tensor(key)
ori_llm = AutoModelForCausalLM.from_pretrained(
text_model_id, trust_remote_code=True)
ori_vit = CLIPVisionModel.from_pretrained(vision_model_id)
llm_state_dict = ori_llm.state_dict()
vit_state_dict = ori_vit.state_dict()
projector_state_dict = convert_state_dict_to_hf(
projector_state_dict, KEYS_TO_MODIFY_MAPPING_PROJECTOR)
llm_state_dict = convert_state_dict_to_hf(llm_state_dict,
KEYS_TO_MODIFY_MAPPING_LLM)
vit_state_dict = convert_state_dict_to_hf(vit_state_dict,
KEYS_TO_MODIFY_MAPPING_VIT)
state_dict = {**projector_state_dict, **llm_state_dict, **vit_state_dict}
model.load_state_dict(state_dict, strict=True, assign=True)
pre_expansion_embeddings = \
model.language_model.model.embed_tokens.weight.data
mu = torch.mean(pre_expansion_embeddings, dim=0).float()
n = pre_expansion_embeddings.size()[0]
sigma = ((pre_expansion_embeddings - mu).T
@ (pre_expansion_embeddings - mu)) / n
dist = torch.distributions.multivariate_normal.MultivariateNormal(
mu, covariance_matrix=1e-5 * sigma)
# We add an image token so we resize the model
ori_vocab_size = config.text_config.vocab_size
tokenizer_vocab_size = tokenizer.encode('')[-1]
added_token = tokenizer_vocab_size - ori_vocab_size
if added_token > 0:
model.resize_token_embeddings(ori_vocab_size + added_token, pad_shape)
model.language_model.model.embed_tokens.weight.data[
ori_vocab_size:] = torch.stack(
tuple(dist.sample()
for _ in range(model.language_model.model.embed_tokens.
weight.data[ori_vocab_size:].shape[0])),
dim=0,
)
model.language_model.lm_head.weight.data[
ori_vocab_size:] = torch.stack(
tuple(dist.sample()
for _ in range(model.language_model.lm_head.weight.
data[ori_vocab_size:].shape[0])),
dim=0,
)
model.config.image_token_index = tokenizer.encode('')[-1]
model.config.pad_token_id = tokenizer.encode('')[-1]
if ori_vit.__class__.__name__ == 'SiglipVisionModel':
model.config.vision_feature_select_strategy = 'full'
model.save_pretrained(save_path)
processor.save_pretrained(save_path)
print(f'Saved to {save_path}')
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--text_model_id')
parser.add_argument('--vision_model_id')
parser.add_argument('--projector_weight')
parser.add_argument('--save_path')
args = parser.parse_args()
convert_to_hf(args.text_model_id, args.vision_model_id,
args.projector_weight, args.save_path)
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/convert_xtuner_weights_to_llava.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import torch
try:
from llava.model import LlavaConfig, LlavaLlamaForCausalLM
from llava.utils import disable_torch_init
except ImportError:
raise ImportError(
'Please install llava with '
'`pip install git+https://github.com/haotian-liu/LLaVA.git '
'--no-deps`.')
from safetensors import safe_open
from transformers import (AutoConfig, AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
KEYS_TO_MODIFY_MAPPING_VIT = {
'vision_model': 'model.vision_tower.vision_tower.vision_model',
}
KEYS_TO_MODIFY_MAPPING_PROJECTOR = {
'model.0': 'model.mm_projector.0',
'model.2': 'model.mm_projector.2',
}
def convert_state_dict_to_hf(state_dict, mapping):
new_state_dict = {}
for key, value in state_dict.items():
if key.endswith('.inv_freq'):
continue
for key_to_modify, new_key in mapping.items():
if key_to_modify in key:
key = key.replace(key_to_modify, new_key)
new_state_dict[key] = value
return new_state_dict
def convert_to_llava(text_model_id, vision_model_id, projector_weight,
save_path):
disable_torch_init()
torch.set_default_dtype(torch.float16)
projector_state_dict = {}
with safe_open(projector_weight, framework='pt', device='cpu') as f:
for key in f.keys():
projector_state_dict[key] = f.get_tensor(key)
ori_llm = AutoModelForCausalLM.from_pretrained(
text_model_id, trust_remote_code=True, device_map='auto')
ori_vit = CLIPVisionModel.from_pretrained(vision_model_id)
llm_state_dict = ori_llm.state_dict()
vit_state_dict = ori_vit.state_dict()
projector_state_dict = convert_state_dict_to_hf(
projector_state_dict, KEYS_TO_MODIFY_MAPPING_PROJECTOR)
vit_state_dict = convert_state_dict_to_hf(vit_state_dict,
KEYS_TO_MODIFY_MAPPING_VIT)
state_dict = {**projector_state_dict, **llm_state_dict, **vit_state_dict}
tokenizer = AutoTokenizer.from_pretrained(text_model_id)
text_config = AutoConfig.from_pretrained(
text_model_id, trust_remote_code=True)
ori_config = text_config.__dict__.copy()
ori_config.update(
dict(
image_aspect_ratio='pad',
mm_hidden_size=ori_vit.config.hidden_size,
mm_projector_type='mlp2x_gelu',
mm_use_im_patch_token=False,
mm_use_im_start_end=False,
mm_vision_select_feature='patch',
mm_vision_select_layer=-2,
mm_vision_tower=vision_model_id,
unfreeze_mm_vision_tower=True,
model_type='llava',
use_cache=True,
use_mm_proj=True))
config = LlavaConfig(**ori_config)
with torch.device('meta'):
model = LlavaLlamaForCausalLM(config)
image_processor = CLIPImageProcessor.from_pretrained(vision_model_id)
model.load_state_dict(state_dict, strict=True, assign=True)
model.save_pretrained(save_path, max_shard_size='2GB')
image_processor.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f'Saved to {save_path}')
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--text_model_id')
parser.add_argument('--vision_model_id')
parser.add_argument('--projector_weight')
parser.add_argument('--save_path')
args = parser.parse_args()
convert_to_llava(args.text_model_id, args.vision_model_id,
args.projector_weight, args.save_path)
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/finetune/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_e1_gpu8_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth' # noqa: E501
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 2
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 1000
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 1000
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=False,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/finetune/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth' # noqa: E501
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 2
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 1000
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 1000
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=False,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path),
visual_encoder_lora=dict(
type=LoraConfig, r=64, lora_alpha=16, lora_dropout=0.05, bias='none'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/finetune/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import ConcatDataset, LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain/iter_9742.pth' # noqa: E501
# Data
data_root = './data/internvl_sft/'
sharegpt4v_caption_data_path = data_root + 'sharegpt4v_instruct_gpt4-vision_cap100k.jsonl' # noqa: E501
sharegpt4v_caption_image_folder = data_root + 'data'
llava_data_path = data_root + 'llava_instruct_150k_zh.jsonl'
llava_image_folder = data_root + 'data/coco'
sharegpt4v_data_path = data_root + 'sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl' # noqa: E501
sharegpt4v_image_folder = data_root + 'data'
dvqa_data_path = data_root + 'dvqa_train_200k.jsonl'
dvqa_image_folder = data_root + 'data/dvqa'
chartqa_data_path = data_root + 'chartqa_train_18k.jsonl'
chartqa_image_folder = data_root + 'data/chartqa'
ai2d_data_path = data_root + 'ai2d_train_12k.jsonl'
ai2d_image_folder = data_root + 'data/ai2d'
docvqa_data_path = data_root + 'docvqa_train_10k.jsonl'
docvqa_image_folder = data_root + 'data/docvqa'
geoqa_data_path = data_root + 'geoqa+.jsonl'
geoqa_image_folder = data_root + 'data/geoqa+'
synthdog_data_path = data_root + 'synthdog_en.jsonl'
synthdog_image_folder = data_root + 'data/synthdog-en'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = int(4096 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 4 # per_device
accumulative_counts = 4
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 5000
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 5000
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=False,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path),
visual_encoder_lora=dict(
type=LoraConfig, r=64, lora_alpha=16, lora_dropout=0.05, bias='none'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sharegpt4v_caption_dataset = dict(
type=LLaVADataset,
data_path=sharegpt4v_caption_data_path,
image_folder=sharegpt4v_caption_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
llava_dataset = dict(
type=LLaVADataset,
data_path=llava_data_path,
image_folder=llava_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
sharegpt4v_dataset = dict(
type=LLaVADataset,
data_path=sharegpt4v_data_path,
image_folder=sharegpt4v_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
dvqa_dataset = dict(
type=LLaVADataset,
data_path=dvqa_data_path,
image_folder=dvqa_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
chartqa_dataset = dict(
type=LLaVADataset,
data_path=chartqa_data_path,
image_folder=chartqa_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
ai2d_dataset = dict(
type=LLaVADataset,
data_path=ai2d_data_path,
image_folder=ai2d_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
docvqa_dataset = dict(
type=LLaVADataset,
data_path=docvqa_data_path,
image_folder=docvqa_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
geoqa_dataset = dict(
type=LLaVADataset,
data_path=geoqa_data_path,
image_folder=geoqa_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
synthdog_dataset = dict(
type=LLaVADataset,
data_path=synthdog_data_path,
image_folder=synthdog_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataset = dict(
type=ConcatDataset,
datasets=[
sharegpt4v_caption_dataset, llava_dataset, sharegpt4v_dataset,
dvqa_dataset, chartqa_dataset, ai2d_dataset, docvqa_dataset,
geoqa_dataset, synthdog_dataset
])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=train_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/finetune/llava_llama3_8b_instruct_qlora_clip_vit_large_p14_336_e1_gpu1_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain/558128.pth' # noqa: E501
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 128
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 50000
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50000
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
llm_lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/pretrain/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Pretrain/blip_laion_cc_sbu_558k.json'
image_folder = data_root + 'LLaVA-Pretrain/images'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 32 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-3
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=False)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/pretrain/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Data
data_root = './data/sharegpt4v/'
data_path = data_root + 'share-captioner_coco_lcs_sam_1246k_1107.json'
image_folder = data_root + 'data'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = int(4096 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 2
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-3
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 1000
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 1000
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=False)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/pretrain/llava_llama3_8b_instruct_quant_clip_vit_large_p14_336_e1_gpu1_pretrain.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Pretrain/blip_laion_cc_sbu_558k.json'
image_folder = data_root + 'LLaVA-Pretrain/images'
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 256
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 1e-3
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 50000
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50000
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=False)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=llava_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/official/llava_v15_13b/llava_v15_13b_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'lmsys/vicuna-13b-v1.5'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_v15_13b_pretrain/iter_2181.pth'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.vicuna
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=False,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/official/llava_v15_13b/llava_v15_13b_finetune_lora.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'lmsys/vicuna-13b-v1.5'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_v15_13b_pretrain/iter_2181.pth'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.vicuna
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True),
llm_lora=dict(
type=LoraConfig,
r=128,
lora_alpha=256,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/official/llava_v15_13b/llava_v15_13b_pretrain.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'lmsys/vicuna-13b-v1.5'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Pretrain/blip_laion_cc_sbu_558k.json'
image_folder = data_root + 'LLaVA-Pretrain/images'
prompt_template = PROMPT_TEMPLATE.vicuna
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 32 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-3
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=False)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/official/llava_v15_7b/llava_v15_7b_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'lmsys/vicuna-7b-v1.5'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_v15_7b_pretrain/iter_2181.pth'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.vicuna
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=False,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/official/llava_v15_7b/llava_v15_7b_finetune_lora.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'lmsys/vicuna-7b-v1.5'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_v15_7b_pretrain/iter_2181.pth'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.vicuna
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True),
llm_lora=dict(
type=LoraConfig,
r=128,
lora_alpha=256,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/official/llava_v15_7b/llava_v15_7b_pretrain.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'lmsys/vicuna-7b-v1.5'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Pretrain/blip_laion_cc_sbu_558k.json'
image_folder = data_root + 'LLaVA-Pretrain/images'
prompt_template = PROMPT_TEMPLATE.vicuna
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 32 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-3
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=False)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/phi3_mini_4k_instruct_clip_vit_large_p14_336/README.md
================================================
# LLaVA-Phi-3-mini
## Results
| Model | MMBench Test (EN) | MMMU Val | SEED-IMG | AI2D Test | ScienceQA Test | HallusionBench aAcc | POPE | GQA | TextVQA | MME | MMStar | Configs |
| :-------------------- | :---------------: | :-------: | :------: | :-------: | :------------: | :-----------------: | :--: | :--: | :-----: | :------: | :----: | :----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------: |
| LLaVA-v1.5-7B | 66.5 | 35.3 | 60.5 | 54.8 | 70.4 | 44.9 | 85.9 | 62.0 | 58.2 | 1511/348 | 30.3 | - |
| LLaVA-Llama-3-8B | 68.9 | 36.8 | 69.8 | 60.9 | 73.3 | 47.3 | 87.2 | 63.5 | 58.0 | 1506/295 | 38.2 | [Pretrain](https://github.com/InternLM/xtuner/blob/main/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/pretrain/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py) / [Fine-tune](https://github.com/InternLM/xtuner/blob/main/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/finetune/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py) |
| LLaVA-Llama-3-8B-v1.1 | 72.3 | 37.1 | 70.1 | 70.0 | 72.9 | 47.7 | 86.4 | 62.6 | 59.0 | 1469/349 | 45.1 | [Pretrain](https://github.com/InternLM/xtuner/blob/main/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/pretrain/llava_llama3_8b_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py) / [Fine-tune](https://github.com/InternLM/xtuner/blob/main/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336/finetune/llava_llama3_8b_instruct_full_clip_vit_large_p14_336_lora_e1_gpu8_internvl_finetune.py) |
| **LLaVA-Phi-3-mini** | 69.2 | 41.4 | 70.0 | 69.3 | 73.7 | 49.8 | 87.3 | 61.5 | 57.8 | 1477/313 | 43.7 | [Pretrain](./pretrain/llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py) / [Fine-tune](./finetune/llava_phi3_mini_4k_instruct_full_clip_vit_large_p14_336_full_e2_gpu8_internvl_finetune.py) |
## Resources
- Official LLaVA format model (`xtuner/llava-phi-3-mini`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-phi-3-mini) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-phi-3-mini)
- HuggingFace LLaVA format model (`xtuner/llava-phi-3-mini-hf`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-phi-3-mini-hf) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-phi-3-mini-hf)
- XTuner LLaVA format model (`xtuner/llava-phi-3-mini-xtuner`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-phi-3-mini-xtuner) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-phi-3-mini-xtuner)
- GGUF model (`xtuner/llava-phi-3-mini-gguf`): 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-phi-3-mini-gguf) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-phi-3-mini-gguf)
- Pretrained projector weights: 🤗 [HuggingFace](https://huggingface.co/xtuner/llava-phi-3-mini-pretrain) / 🤖 [ModelScope](https://modelscope.cn/models/xtuner/llava-phi-3-mini-pretrain)
## Data Preparation
Please refer to [here](https://github.com/InternLM/xtuner/tree/main/xtuner/configs/llava/llama3_8b_instruct_clip_vit_large_p14_336#data-preparation).
## Training
### LLaVA-Phi-3-mini
1. Pretrain
```bash
NPROC_PER_NODE=8 xtuner train llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain --deepspeed deepspeed_zero2 --seed 1024
```
2. Fine-tune
```bash
NPROC_PER_NODE=8 xtuner train llava_phi3_mini_4k_instruct_full_clip_vit_large_p14_336_full_e2_gpu8_internvl_finetune --deepspeed deepspeed_zero2 --seed 1024
```
## Model Conversion
### Step 0. Convert `.pth` file to LLaVA model in xtuner format ([LLaVA-Phi-3-mini-xtuner](https://huggingface.co/xtuner/llava-phi-3-mini-xtuner))
After training, we will obtain a set of weights (*i.e.*, `iter_xxx.pth`), which are not in the universal HuggingFace format. We first need to convert them to the LLaVA model in xtuner format.
```bash
xtuner convert pth_to_hf $FINETUNE_CFG $PTH_PATH $SAVE_PATH
# e.g., xtuner convert pth_to_hf llava_phi3_mini_4k_instruct_full_clip_vit_large_p14_336_full_e2_gpu8_internvl_finetune ./iter_39620.pth ./iter_39620_xtuner
```
```
./iter_39620_xtuner
├── added_tokens.json
├── config.json
├── model-00001-of-00004.safetensors
├── model-00002-of-00004.safetensors
├── model-00003-of-00004.safetensors
├── model-00004-of-00004.safetensors
├── model.safetensors.index.json
├── projector
│ ├── config.json
│ ├── configuration_projector.py
│ ├── modeling_projector.py
│ └── model.safetensors
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
├── tokenizer.model
└── visual_encoder
├── config.json
├── model.safetensors
└── preprocessor_config.json
```
At this time, the LLaVA model of xtuner-format can engage in conversation using xtuner chat, by
```bash
xtuner chat ./iter_39620_xtuner \
--llava ./iter_39620_xtuner \
--prompt-template phi3_chat \
--image $IMAGE_PATH
```
and in MMBench evaluation, by
```bash
xtuner mmbench ./iter_39620_xtuner \
--llava ./iter_39620_xtuner \
--prompt-template phi3_chat \
--data-path $DATA_PATH \
--work-dir $RESULT_PATH
```
Here, `$DATA_PATH` refers to one of the mmbench datasets. You can download the expected data by
```bash
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_EN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_DEV_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/MMBench_TEST_CN.tsv
wget https://opencompass.openxlab.space/utils/VLMEval/CCBench.tsv
```
### Step 1. Convert LLaVA in xtuner format to official LLaVA format or HuggingFace LLaVA format
- The official LLaVA format is structured similarly to the architecture of the [liuhaotian/llava-v1.5-7b](https://huggingface.co/liuhaotian/llava-v1.5-7b) model.
- The HuggingFace LLaVA format is structured similarly to the architecture of the [llava-hf/llava-1.5-7b-hf](https://huggingface.co/llava-hf/llava-1.5-7b-hf) model.
Since the official LLaVA format and the HuggingFace LLaVA format only support Llama architecture as the LLM, we need to first convert the phi-3 model to an equivalent Llama LLM.
```bash
python ./convert_phi_to_llama.py --phi_path ./iter_39620_xtuner --save_path ./iter_39620_xtuner_llama_llm
```
Here, `--phi_path` should specify the path to phi-3, which is the path obtained from Step.0 for the xtuner-format LLaVA model. `--save_path` should specify the save path for the converted Llama LLM.
#### To official LLaVA format ([LLaVA-Phi-3-mini](https://huggingface.co/xtuner/llava-phi-3-mini))
We can utilize the following command to obtain the LLaVA model in the official LLaVA format.
```bash
python ./convert_xtuner_weights_to_llava.py --text_model_id ./iter_39620_xtuner_llama_llm --vision_model_id ./iter_39620_xtuner/visual_encoder --projector_weight ./iter_39620_xtuner/projector/model.safetensors --save_path ./iter_39620_llava
```
Here, the converted LLaVA model in official LLaVA format is saved to `./iter_39620_llava`.
```
./iter_39620_llava
├── added_tokens.json
├── config.json
├── generation_config.json
├── model-00001-of-00005.safetensors
├── model-00002-of-00005.safetensors
├── model-00003-of-00005.safetensors
├── model-00004-of-00005.safetensors
├── model-00005-of-00005.safetensors
├── model.safetensors.index.json
├── preprocessor_config.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model
```
#### To HuggingFace LLaVA format ([LLaVA-Phi-3-mini-hf](https://huggingface.co/xtuner/llava-phi-3-mini-hf))
We can utilize the following command to obtain the LLaVA model in the HuggingFace LLaVA format.
```bash
python ./convert_xtuner_weights_to_hf.py --text_model_id ./iter_39620_xtuner_llama_llm --vision_model_id ./iter_39620_xtuner/visual_encoder --projector_weight ./iter_39620_xtuner/projector/model.safetensors --save_path ./iter_39620_hf
```
Here, the converted LLaVA model in HuggingFace LLaVA format is saved to `./iter_39620_hf`.
```
./iter_39620_hf
├── added_tokens.json
├── config.json
├── generation_config.json
├── model-00001-of-00002.safetensors
├── model-00002-of-00002.safetensors
├── model.safetensors.index.json
├── preprocessor_config.json
├── special_tokens_map.json
├── tokenizer_config.json
├── tokenizer.json
└── tokenizer.model
```
## Chat
- XTuner LLaVA format [docs](https://huggingface.co/xtuner/llava-phi-3-mini-xtuner#quickstart)
- Official LLaVA format [docs](https://huggingface.co/xtuner/llava-phi-3-mini#quickstart)
- HuggingFace LLaVA format [docs](https://huggingface.co/xtuner/llava-phi-3-mini-hf#quickstart)
- GGUF format [docs](https://huggingface.co/xtuner/llava-phi-3-mini-gguf#quickstart)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/phi3_mini_4k_instruct_clip_vit_large_p14_336/convert_phi_to_llama.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import json
import os
from mmengine.utils import mkdir_or_exist
from safetensors import safe_open
from safetensors.torch import save_file
from tqdm import tqdm
from transformers import AutoTokenizer
def convert_phi_to_llama(phi_path, save_path):
files = [f for f in os.listdir(phi_path) if f.endswith('safetensors')]
mkdir_or_exist(save_path)
index_json = os.path.join(phi_path, 'model.safetensors.index.json')
config_json = os.path.join(phi_path, 'config.json')
with open(index_json) as f:
index = json.load(f)
with open(config_json) as f:
config = json.load(f)
config.pop('_name_or_path')
if 'auto_map' in config:
config.pop('auto_map')
config.pop('embd_pdrop')
config.pop('resid_pdrop')
config['architectures'] = ['LlamaForCausalLM']
config['model_type'] = 'llama'
for file in tqdm(files, desc='Convert'):
tensors = {}
new_path = os.path.join(save_path, file)
old_path = os.path.join(phi_path, file)
with safe_open(old_path, framework='pt', device='cpu') as f:
for key in f.keys():
if 'qkv_proj' in key:
qkv = f.get_tensor(key)
q, k, v = qkv.chunk(3, dim=0)
q_name = key.replace('qkv_proj', 'q_proj')
k_name = key.replace('qkv_proj', 'k_proj')
v_name = key.replace('qkv_proj', 'v_proj')
tensors[q_name] = q
tensors[k_name] = k
tensors[v_name] = v
index['weight_map'].pop(key)
filename = os.path.basename(new_path)
index['weight_map'][q_name] = filename
index['weight_map'][k_name] = filename
index['weight_map'][v_name] = filename
elif 'gate_up_proj' in key:
gate_up_proj = f.get_tensor(key)
gate_proj, up_proj = gate_up_proj.chunk(2, dim=0)
gate_name = key.replace('gate_up_proj', 'gate_proj')
up_name = key.replace('gate_up_proj', 'up_proj')
tensors[gate_name] = gate_proj
tensors[up_name] = up_proj
index['weight_map'].pop(key)
filename = os.path.basename(new_path)
index['weight_map'][gate_name] = filename
index['weight_map'][up_name] = filename
else:
tensors[key] = f.get_tensor(key)
metadata = f.metadata()
save_file(tensors, new_path, metadata=metadata)
new_config_json = os.path.join(save_path, 'config.json')
with open(new_config_json, 'w') as f:
json.dump(config, f, indent=2)
new_index_json = os.path.join(save_path, 'model.safetensors.index.json')
with open(new_index_json, 'w') as f:
json.dump(index, f, indent=2)
tokenizer = AutoTokenizer.from_pretrained(phi_path, trust_remote_code=True)
tokenizer.save_pretrained(save_path)
print(f'Saved to {save_path}')
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--phi_path')
parser.add_argument('--save_path')
args = parser.parse_args()
convert_phi_to_llama(args.phi_path, args.save_path)
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/phi3_mini_4k_instruct_clip_vit_large_p14_336/convert_xtuner_weights_to_hf.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
# Modified from https://github.com/huggingface/transformers/blob/v4.40.1/src/transformers/models/llava/convert_llava_weights_to_hf.py # noqa: E501
import argparse
import torch
from safetensors import safe_open
from transformers import (AddedToken, AutoConfig, AutoModel,
AutoModelForCausalLM, CLIPImageProcessor,
LlamaTokenizerFast, LlavaConfig,
LlavaForConditionalGeneration, LlavaProcessor)
KEYS_TO_MODIFY_MAPPING_LLM = {
'model': 'language_model.model',
'lm_head': 'language_model.lm_head',
}
KEYS_TO_MODIFY_MAPPING_VIT = {
'vision_model': 'vision_tower.vision_model',
}
KEYS_TO_MODIFY_MAPPING_PROJECTOR = {
'model.0': 'multi_modal_projector.linear_1',
'model.2': 'multi_modal_projector.linear_2',
}
def convert_state_dict_to_hf(state_dict, mapping):
new_state_dict = {}
for key, value in state_dict.items():
if key.endswith('.inv_freq'):
continue
for key_to_modify, new_key in mapping.items():
if key_to_modify in key:
key = key.replace(key_to_modify, new_key)
new_state_dict[key] = value
return new_state_dict
def convert_to_hf(text_model_id, vision_model_id, projector_weight, save_path):
torch.set_default_dtype(torch.float16)
text_config = AutoConfig.from_pretrained(
text_model_id, trust_remote_code=True)
vision_config = AutoConfig.from_pretrained(vision_model_id)
tokenizer = LlamaTokenizerFast.from_pretrained(text_model_id)
tokenizer.add_tokens(
AddedToken('', special=True, normalized=False),
special_tokens=True)
tokenizer.add_special_tokens({'pad_token': ''})
image_processor = CLIPImageProcessor.from_pretrained(vision_model_id)
processor = LlavaProcessor(
tokenizer=tokenizer, image_processor=image_processor)
config = LlavaConfig(
text_config=text_config,
vision_config=vision_config,
attn_implementation='eager')
with torch.device('meta'):
model = LlavaForConditionalGeneration(config)
# Pad to 64 for performance reasons
pad_shape = 64
projector_state_dict = {}
with safe_open(projector_weight, framework='pt', device='cpu') as f:
for key in f.keys():
projector_state_dict[key] = f.get_tensor(key)
ori_llm = AutoModelForCausalLM.from_pretrained(
text_model_id, trust_remote_code=True)
ori_vit = AutoModel.from_pretrained(vision_model_id)
llm_state_dict = ori_llm.state_dict()
vit_state_dict = ori_vit.state_dict()
projector_state_dict = convert_state_dict_to_hf(
projector_state_dict, KEYS_TO_MODIFY_MAPPING_PROJECTOR)
llm_state_dict = convert_state_dict_to_hf(llm_state_dict,
KEYS_TO_MODIFY_MAPPING_LLM)
vit_state_dict = convert_state_dict_to_hf(vit_state_dict,
KEYS_TO_MODIFY_MAPPING_VIT)
state_dict = {**projector_state_dict, **llm_state_dict, **vit_state_dict}
model.load_state_dict(state_dict, strict=True, assign=True)
pre_expansion_embeddings = \
model.language_model.model.embed_tokens.weight.data
mu = torch.mean(pre_expansion_embeddings, dim=0).float()
n = pre_expansion_embeddings.size()[0]
sigma = ((pre_expansion_embeddings - mu).T
@ (pre_expansion_embeddings - mu)) / n
dist = torch.distributions.multivariate_normal.MultivariateNormal(
mu, covariance_matrix=1e-5 * sigma)
# We add an image token so we resize the model
ori_vocab_size = config.text_config.vocab_size
tokenizer_vocab_size = tokenizer.encode('')[-1]
added_token = tokenizer_vocab_size - ori_vocab_size
if added_token > 0:
model.resize_token_embeddings(ori_vocab_size + added_token, pad_shape)
model.language_model.model.embed_tokens.weight.data[
ori_vocab_size:] = torch.stack(
tuple(dist.sample()
for _ in range(model.language_model.model.embed_tokens.
weight.data[ori_vocab_size:].shape[0])),
dim=0,
)
model.language_model.lm_head.weight.data[
ori_vocab_size:] = torch.stack(
tuple(dist.sample()
for _ in range(model.language_model.lm_head.weight.
data[ori_vocab_size:].shape[0])),
dim=0,
)
model.config.image_token_index = tokenizer.encode('')[-1]
model.config.pad_token_id = tokenizer.encode('')[-1]
if ori_vit.__class__.__name__ == 'SiglipVisionModel':
model.config.vision_feature_select_strategy = 'full'
model.save_pretrained(save_path)
processor.save_pretrained(save_path)
print(f'Saved to {save_path}')
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--text_model_id')
parser.add_argument('--vision_model_id')
parser.add_argument('--projector_weight')
parser.add_argument('--save_path')
args = parser.parse_args()
convert_to_hf(args.text_model_id, args.vision_model_id,
args.projector_weight, args.save_path)
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/phi3_mini_4k_instruct_clip_vit_large_p14_336/convert_xtuner_weights_to_llava.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import torch
try:
from llava.model import LlavaConfig, LlavaLlamaForCausalLM
from llava.utils import disable_torch_init
except ImportError:
raise ImportError(
'Please install llava with '
'`pip install git+https://github.com/haotian-liu/LLaVA.git '
'--no-deps`.')
from safetensors import safe_open
from transformers import (AutoConfig, AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
KEYS_TO_MODIFY_MAPPING_VIT = {
'vision_model': 'model.vision_tower.vision_tower.vision_model',
}
KEYS_TO_MODIFY_MAPPING_PROJECTOR = {
'model.0': 'model.mm_projector.0',
'model.2': 'model.mm_projector.2',
}
def convert_state_dict_to_hf(state_dict, mapping):
new_state_dict = {}
for key, value in state_dict.items():
if key.endswith('.inv_freq'):
continue
for key_to_modify, new_key in mapping.items():
if key_to_modify in key:
key = key.replace(key_to_modify, new_key)
new_state_dict[key] = value
return new_state_dict
def convert_to_llava(text_model_id, vision_model_id, projector_weight,
save_path):
disable_torch_init()
torch.set_default_dtype(torch.float16)
projector_state_dict = {}
with safe_open(projector_weight, framework='pt', device='cpu') as f:
for key in f.keys():
projector_state_dict[key] = f.get_tensor(key)
ori_llm = AutoModelForCausalLM.from_pretrained(
text_model_id, trust_remote_code=True, device_map='auto')
ori_vit = CLIPVisionModel.from_pretrained(vision_model_id)
llm_state_dict = ori_llm.state_dict()
vit_state_dict = ori_vit.state_dict()
projector_state_dict = convert_state_dict_to_hf(
projector_state_dict, KEYS_TO_MODIFY_MAPPING_PROJECTOR)
vit_state_dict = convert_state_dict_to_hf(vit_state_dict,
KEYS_TO_MODIFY_MAPPING_VIT)
state_dict = {**projector_state_dict, **llm_state_dict, **vit_state_dict}
tokenizer = AutoTokenizer.from_pretrained(text_model_id)
text_config = AutoConfig.from_pretrained(
text_model_id, trust_remote_code=True)
ori_config = text_config.__dict__.copy()
ori_config.update(
dict(
image_aspect_ratio='pad',
mm_hidden_size=ori_vit.config.hidden_size,
mm_projector_type='mlp2x_gelu',
mm_use_im_patch_token=False,
mm_use_im_start_end=False,
mm_vision_select_feature='patch',
mm_vision_select_layer=-2,
mm_vision_tower=vision_model_id,
unfreeze_mm_vision_tower=True,
model_type='llava',
use_cache=True,
use_mm_proj=True))
config = LlavaConfig(**ori_config)
with torch.device('meta'):
model = LlavaLlamaForCausalLM(config)
image_processor = CLIPImageProcessor.from_pretrained(vision_model_id)
model.load_state_dict(state_dict, strict=True, assign=True)
model.save_pretrained(save_path, max_shard_size='2GB')
image_processor.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print(f'Saved to {save_path}')
def main():
parser = argparse.ArgumentParser()
parser.add_argument('--text_model_id')
parser.add_argument('--vision_model_id')
parser.add_argument('--projector_weight')
parser.add_argument('--save_path')
args = parser.parse_args()
convert_to_llava(args.text_model_id, args.vision_model_id,
args.projector_weight, args.save_path)
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/phi3_mini_4k_instruct_clip_vit_large_p14_336/finetune/llava_phi3_mini_4k_instruct_full_clip_vit_large_p14_336_e1_gpu8_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'microsoft/Phi-3-mini-4k-instruct'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth' # noqa: E501
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.phi3_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 2
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 1000
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 1000
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=False,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/phi3_mini_4k_instruct_clip_vit_large_p14_336/finetune/llava_phi3_mini_4k_instruct_full_clip_vit_large_p14_336_full_e2_gpu8_internvl_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import ConcatDataset, LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'microsoft/Phi-3-mini-4k-instruct'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain/iter_9742.pth' # noqa: E501
# Data
data_root = './data/internvl_sft/'
sharegpt4v_caption_data_path = data_root + 'sharegpt4v_instruct_gpt4-vision_cap100k.jsonl' # noqa: E501
sharegpt4v_caption_image_folder = data_root + 'data'
llava_data_path = data_root + 'llava_instruct_150k_zh.jsonl'
llava_image_folder = data_root + 'data/coco'
sharegpt4v_data_path = data_root + 'sharegpt4v_mix665k_cap23k_coco-ap9k_lcs3k_sam9k_div2k.jsonl' # noqa: E501
sharegpt4v_image_folder = data_root + 'data'
dvqa_data_path = data_root + 'dvqa_train_200k.jsonl'
dvqa_image_folder = data_root + 'data/dvqa'
chartqa_data_path = data_root + 'chartqa_train_18k.jsonl'
chartqa_image_folder = data_root + 'data/chartqa'
ai2d_data_path = data_root + 'ai2d_train_12k.jsonl'
ai2d_image_folder = data_root + 'data/ai2d'
docvqa_data_path = data_root + 'docvqa_train_10k.jsonl'
docvqa_image_folder = data_root + 'data/docvqa'
geoqa_data_path = data_root + 'geoqa+.jsonl'
geoqa_image_folder = data_root + 'data/geoqa+'
synthdog_data_path = data_root + 'synthdog_en.jsonl'
synthdog_image_folder = data_root + 'data/synthdog-en'
prompt_template = PROMPT_TEMPLATE.phi3_chat
max_length = int(4096 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 2
dataloader_num_workers = 4
max_epochs = 2
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 5000
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 5000
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=False,
freeze_visual_encoder=False,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sharegpt4v_caption_dataset = dict(
type=LLaVADataset,
data_path=sharegpt4v_caption_data_path,
image_folder=sharegpt4v_caption_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
llava_dataset = dict(
type=LLaVADataset,
data_path=llava_data_path,
image_folder=llava_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
sharegpt4v_dataset = dict(
type=LLaVADataset,
data_path=sharegpt4v_data_path,
image_folder=sharegpt4v_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
dvqa_dataset = dict(
type=LLaVADataset,
data_path=dvqa_data_path,
image_folder=dvqa_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
chartqa_dataset = dict(
type=LLaVADataset,
data_path=chartqa_data_path,
image_folder=chartqa_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
ai2d_dataset = dict(
type=LLaVADataset,
data_path=ai2d_data_path,
image_folder=ai2d_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
docvqa_dataset = dict(
type=LLaVADataset,
data_path=docvqa_data_path,
image_folder=docvqa_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
geoqa_dataset = dict(
type=LLaVADataset,
data_path=geoqa_data_path,
image_folder=geoqa_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
synthdog_dataset = dict(
type=LLaVADataset,
data_path=synthdog_data_path,
image_folder=synthdog_image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataset = dict(
type=ConcatDataset,
datasets=[
sharegpt4v_caption_dataset, llava_dataset, sharegpt4v_dataset,
dvqa_dataset, chartqa_dataset, ai2d_dataset, docvqa_dataset,
geoqa_dataset, synthdog_dataset
])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=train_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/phi3_mini_4k_instruct_clip_vit_large_p14_336/pretrain/llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_pretrain.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'microsoft/Phi-3-mini-4k-instruct'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Pretrain/blip_laion_cc_sbu_558k.json'
image_folder = data_root + 'LLaVA-Pretrain/images'
prompt_template = PROMPT_TEMPLATE.phi3_chat
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 32 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-3
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=False)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/phi3_mini_4k_instruct_clip_vit_large_p14_336/pretrain/llava_phi3_mini_4k_instruct_clip_vit_large_p14_336_e1_gpu8_sharegpt4v_pretrain.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'microsoft/Phi-3-mini-4k-instruct'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Data
data_root = './data/sharegpt4v/'
data_path = data_root + 'share-captioner_coco_lcs_sam_1246k_1107.json'
image_folder = data_root + 'data'
prompt_template = PROMPT_TEMPLATE.phi3_chat
max_length = int(4096 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 2
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-3
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 1000
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 1000
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=False)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/vicuna_13b_v15_clip_vit_large_p14_336/finetune/llava_vicuna_13b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'lmsys/vicuna-13b-v1.5'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_vicuna_13b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth' # noqa: E501
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.vicuna
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
llm_lora=dict(
type=LoraConfig,
r=512,
lora_alpha=256,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path),
visual_encoder_lora=dict(
type=LoraConfig, r=64, lora_alpha=16, lora_dropout=0.05, bias='none'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/vicuna_13b_v15_clip_vit_large_p14_336/pretrain/llava_vicuna_13b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'lmsys/vicuna-13b-v1.5'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Pretrain/blip_laion_cc_sbu_558k.json'
image_folder = data_root + 'LLaVA-Pretrain/images'
prompt_template = PROMPT_TEMPLATE.vicuna
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 32 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-3
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=False)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/vicuna_7b_v15_clip_vit_large_p14_336/finetune/llava_vicuna_7b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'lmsys/vicuna-7b-v1.5'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_vicuna_7b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth' # noqa: E501
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.vicuna
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
llm_lora=dict(
type=LoraConfig,
r=512,
lora_alpha=256,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path),
visual_encoder_lora=dict(
type=LoraConfig, r=64, lora_alpha=16, lora_dropout=0.05, bias='none'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/vicuna_7b_v15_clip_vit_large_p14_336/finetune/llava_vicuna_7b_v15_qlora_clip_vit_large_p14_336_lora_e1_gpu8_finetune_refcoco.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import ConcatDataset, LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.refcoco_json import (InvRefCOCOJsonDataset,
RefCOCOJsonDataset)
from xtuner.dataset.samplers import LengthGroupedSampler
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'lmsys/vicuna-7b-v1.5'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Specify the pretrained pth
pretrained_pth = './work_dirs/llava_vicuna_7b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain/iter_2181.pth' # noqa: E501
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Instruct-150K/llava_v1_5_mix665k.json'
refcoco_path = data_root + 'RefCOCOJson/train.json'
image_folder = data_root + 'llava_images'
prompt_template = PROMPT_TEMPLATE.vicuna
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 16 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
pretrained_pth=pretrained_pth,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
llm_lora=dict(
type=LoraConfig,
r=512,
lora_alpha=256,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path),
visual_encoder_lora=dict(
type=LoraConfig, r=64, lora_alpha=16, lora_dropout=0.05, bias='none'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
# The refcoco and inv_refcoco datasets have more than 30w items
# we limit their length for balance with the llava dataset.
refcoco_dataset = dict(
type=RefCOCOJsonDataset,
data_path=refcoco_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True,
max_dataset_length=70000,
)
inv_refcoco_dataset = dict(
type=InvRefCOCOJsonDataset,
data_path=refcoco_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True,
max_dataset_length=70000,
)
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=True,
)
train_dataset = dict(
type=ConcatDataset,
datasets=[refcoco_dataset, inv_refcoco_dataset, llava_dataset],
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=train_dataset,
sampler=dict(
type=LengthGroupedSampler,
length_property='modality_length',
per_device_batch_size=batch_size * accumulative_counts),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/llava/vicuna_7b_v15_clip_vit_large_p14_336/pretrain/llava_vicuna_7b_v15_clip_vit_large_p14_336_e1_gpu8_pretrain.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel)
from xtuner.dataset import LLaVADataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.engine.hooks import DatasetInfoHook, EvaluateChatHook
from xtuner.engine.runner import TrainLoop
from xtuner.model import LLaVAModel
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
llm_name_or_path = 'lmsys/vicuna-7b-v1.5'
visual_encoder_name_or_path = 'openai/clip-vit-large-patch14-336'
# Data
data_root = './data/llava_data/'
data_path = data_root + 'LLaVA-Pretrain/blip_laion_cc_sbu_558k.json'
image_folder = data_root + 'LLaVA-Pretrain/images'
prompt_template = PROMPT_TEMPLATE.vicuna
max_length = int(2048 - (336 / 14)**2)
# Scheduler & Optimizer
batch_size = 32 # per_device
accumulative_counts = 1
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 1e-3
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_images = 'https://llava-vl.github.io/static/images/view.jpg'
evaluation_inputs = ['请描述一下这张照片', 'Please describe this picture']
#######################################################################
# PART 2 Model & Tokenizer & Image Processor #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
padding_side='right')
image_processor = dict(
type=CLIPImageProcessor.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path,
trust_remote_code=True)
model = dict(
type=LLaVAModel,
freeze_llm=True,
freeze_visual_encoder=True,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=llm_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
visual_encoder=dict(
type=CLIPVisionModel.from_pretrained,
pretrained_model_name_or_path=visual_encoder_name_or_path))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
llava_dataset = dict(
type=LLaVADataset,
data_path=data_path,
image_folder=image_folder,
tokenizer=tokenizer,
image_processor=image_processor,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_length=max_length,
pad_image_to_square=False)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
pin_memory=True,
dataset=llava_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
image_processor=image_processor,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
evaluation_images=evaluation_images,
system=SYSTEM,
prompt_template=prompt_template)
]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/mistral/mistral_7b_full_finetune_custom_dataset_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
"""Data format:
[
{
"conversation": [
{
"system": "",
"input": "xxx",
"output": "xxx"
},
{
"input": "xxx",
"output": "xxx"
}
]
},
...
]
Please refer to https://github.com/InternLM/xtuner/blob/main/docs/en/user_guides/dataset_format.md for details.
""" # noqa: E501
import torch
from datasets import load_dataset
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from torch.utils.data import BatchSampler
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.dataset.samplers import InternRepoSampler
from xtuner.engine import (DatasetInfoHook, EvaluateChatHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'mistralai/Mistral-7B-v0.1'
use_varlen_attn = True
# Data
data_files = ['/path/to/json/file.json']
prompt_template = PROMPT_TEMPLATE.mistral
max_length = 32768
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
# batch size per device, set to 1 if `use_varlen_attn` = True
# To clarify, enlarging the batch size essentially enlarges the `max_length`.
# For example, doubling the max length is tantamount to doubling the batch size
batch_size = 1
accumulative_counts = 1 # 1bs * 1acc * 64gpu = 64 batchsize
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 1
optim_type = AdamW
lr = 4e-5
betas = (0.9, 0.95)
weight_decay = 0.01
max_norm = 1 # grad clip
warm_up_ratio = 0.025
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
attn_implementation='flash_attention_2'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
use_varlen_attn=use_varlen_attn,
dataset=dict(type=load_dataset, path='json', data_files=data_files),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=InternRepoSampler, shuffle=True, seed=1024),
batch_sampler=dict(
type=BatchSampler, drop_last=True, batch_size=batch_size),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
)
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type='LinearLR',
start_factor=1 / 40,
by_epoch=True,
begin=0,
end=warm_up_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=lr * 0.15,
by_epoch=True,
begin=warm_up_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(
type=DatasetInfoHook, tokenizer=tokenizer,
is_intern_repo_dataset=True),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 100 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
log_processor = dict(
by_epoch=False,
window_size=1,
mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/mistral/mistral_7b_qlora_skypile_pretrain_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import BitsAndBytesConfig, LlamaTokenizer, MistralForCausalLM
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import pretrain_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'mistralai/Mistral-7B-v0.1'
use_varlen_attn = False
# Data
data_path = 'Skywork/SkyPile-150B'
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
evaluation_inputs = ['上海的景点有']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=LlamaTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=MistralForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.05,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=pretrain_map_fn,
template_map_fn=None,
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
max_new_tokens=100)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/mistral/mistral_7b_w_tokenized_dataset.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from torch.utils.data import BatchSampler
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.intern_repo import (build_packed_dataset,
load_intern_repo_tokenized_dataset)
from xtuner.dataset.samplers import InternRepoSampler
from xtuner.engine import (DatasetInfoHook, EvaluateChatHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'mistralai/Mistral-7B-v0.1'
# 已经使用 Internlm2 的对话模板覆盖了 Mistral 的原有模板,new tokenizer 中已经
# 添加了 Internlm2 对话模板中的特殊字符。
# 请参考 docs/zh_cn/user_guides/finetune_custom_dataset.md
tokenizer_path = '/new/tokenizer/path'
use_varlen_attn = True
# Data
dataset_folder = '/path/to/sft/data/folder'
# 已经使用 Internlm2 的对话模板覆盖了 Mistral 的原有模板
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 32768
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
# batch size per device, set to 1 if `use_varlen_attn` = True
# To clarify, enlarging the batch size essentially enlarges the `max_length`.
# For example, doubling the max length is tantamount to doubling the batch size
batch_size = 1
accumulative_counts = 1
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 4e-5
betas = (0.9, 0.95)
weight_decay = 0.01
max_norm = 1 # grad clip
warm_up_ratio = 0.025
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=tokenizer_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
tokenizer=tokenizer,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
attn_implementation='flash_attention_2'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=build_packed_dataset,
dataset_cfg=dict(
type=load_intern_repo_tokenized_dataset,
data_order_path=None,
folder=dataset_folder,
min_length=0,
file_type='.bin'),
packed_length=max_length,
seed=1024)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=InternRepoSampler, shuffle=True, seed=1024),
batch_sampler=dict(
type=BatchSampler, drop_last=True, batch_size=batch_size),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type='LinearLR',
start_factor=1 / 40,
by_epoch=True,
begin=0,
end=warm_up_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=lr * 0.15,
by_epoch=True,
begin=warm_up_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
custom_hooks = [
dict(
type=DatasetInfoHook, tokenizer=tokenizer,
is_intern_repo_dataset=True),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 100 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
log_processor = dict(
by_epoch=False,
window_size=1,
mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/mistral/mistral_7b_w_untokenized_dataset.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR
from torch.optim import AdamW
from torch.utils.data import BatchSampler
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.intern_repo import (build_packed_dataset,
load_intern_repo_untokenized_dataset)
from xtuner.dataset.map_fns import template_map_fn_factory
from xtuner.dataset.samplers import InternRepoSampler
from xtuner.engine import (DatasetInfoHook, EvaluateChatHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '/mnt/petrelfs/share_data/basemodel/checkpoints/llm/hf_hub/models--mistralai--Mistral-7B-v0.1/snapshots/5e9c98b96d071dce59368012254c55b0ec6f8658' # noqa: E501
use_varlen_attn = True
# Data
dataset_folder = '/mnt/petrelfs/share_data/caoweihan/v1_sample_with_legal_cate'
prompt_template = PROMPT_TEMPLATE.mistral
max_length = 32768
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
# batch size per device, set to 1 if `use_varlen_attn` = True
# To clarify, enlarging the batch size essentially enlarges the `max_length`.
# For example, doubling the max length is tantamount to doubling the batch size
batch_size = 1
accumulative_counts = 1
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 4e-5
betas = (0.9, 0.95)
weight_decay = 0.01
max_norm = 1 # grad clip
warm_up_ratio = 0.025
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.bfloat16,
attn_implementation='flash_attention_2'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=build_packed_dataset,
dataset_cfg=dict(
type=load_intern_repo_untokenized_dataset,
folder=dataset_folder,
tokenizer=tokenizer,
max_length=max_length,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
file_type='.json'),
packed_length=max_length,
seed=1024)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=InternRepoSampler, shuffle=True, seed=1024),
batch_sampler=dict(
type=BatchSampler, drop_last=True, batch_size=batch_size),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type='LinearLR',
start_factor=1 / 40,
by_epoch=True,
begin=0,
end=warm_up_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=lr * 0.15,
by_epoch=True,
begin=warm_up_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
custom_hooks = [
dict(
type=DatasetInfoHook, tokenizer=tokenizer,
is_intern_repo_dataset=True),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 100 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
log_processor = dict(
by_epoch=False,
window_size=1,
mean_pattern=r'.*(loss|time|data_time|grad_norm|tflops).*')
================================================
FILE: xtuner-eval_niah/xtuner/configs/mixtral/README.md
================================================
# Mixtral 8x7B
## Install
```bash
# Install the latest xtuner
pip install -U 'xtuner[deepspeed]'
# Mixtral requires flash-attn
pip install flash-attn
# install the latest transformers
pip install -U transformers
```
## QLoRA Fine-tune
QLoRA only need a single A100-80G
```bash
xtuner train mixtral_8x7b_instruct_qlora_oasst1_e3 --deepspeed deepspeed_zero2
```
## Full Parameter Fine-tune
Full parameter fine-tune needs 16 A100-80G
### slurm
Note: `$PARTITION` means the virtual partition of slurm.
```bash
srun -p $PARTITION --job-name=mixtral --nodes=2 --gres=gpu:8 --ntasks-per-node=8 xtuner train mixtral_8x7b_instruct_full_oasst1_e3 --deepspeed deepspeed_zero3 --launcher slurm
```
### torchrun
Note: `$NODE_0_ADDR` means the ip address of the node_0 machine.
```bash
# excuete on node 0
NPROC_PER_NODE=8 NNODES=2 PORT=29600 ADDR=$NODE_0_ADDR NODE_RANK=0 xtuner train mixtral_8x7b_instruct_full_oasst1_e3 --deepspeed deepspeed_zero3
# excuete on node 1
NPROC_PER_NODE=8 NNODES=2 PORT=29600 ADDR=$NODE_0_ADDR NODE_RANK=1 xtuner train mixtral_8x7b_instruct_full_oasst1_e3 --deepspeed deepspeed_zero3
```
### Speed
16 * A100 80G:
| Model | Sequence Length | Use Varlen Attn | Sequence Parallel World Size | Tokens per Second |
| :----------: | :-------------: | :-------------: | :--------------------------: | :---------------: |
| mixtral_8x7b | 32k | False | 1 | 853.7 |
| mixtral_8x7b | 32k | True | 1 | 910.1 |
| mixtral_8x7b | 32k | False | 2 | 635.2 |
| mixtral_8x7b | 32k | True | 2 | 650.9 |
================================================
FILE: xtuner-eval_niah/xtuner/configs/mixtral/mixtral_8x7b/mixtral_8x7b_full_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'mistralai/Mixtral-8x7B-v0.1'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.mixtral
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/mixtral/mixtral_8x7b/mixtral_8x7b_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'mistralai/Mixtral-8x7B-v0.1'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.mixtral
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
target_modules=[
'q_proj', 'k_proj', 'v_proj', 'o_proj', 'w1', 'w2', 'w3'
],
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/mixtral/mixtral_8x7b_instruct/mixtral_8x7b_instruct_full_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'mistralai/Mixtral-8x7B-Instruct-v0.1'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.mixtral
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/mixtral/mixtral_8x7b_instruct/mixtral_8x7b_instruct_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'mistralai/Mixtral-8x7B-Instruct-v0.1'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.mixtral
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
target_modules=[
'q_proj', 'k_proj', 'v_proj', 'o_proj', 'w1', 'w2', 'w3'
],
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/orpo/internlm/internlm2_chat_1_8b_orpo_full.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns.preference_collate_fn import \
preference_collate_fn
from xtuner.dataset.preference_dataset import (build_preference_dataset,
orpo_dpo_mix_40k_map_fn)
from xtuner.engine.hooks import (EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model.orpo import ORPO
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'
use_varlen_attn = False
loss_beta = 0.1
# Data
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 5e-6
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'What famous British author, known for his tales of mystery and the macabre, shares his initials with a common abbreviation for "rest in peace"?', # noqa: E501
'Please tell me five scenic spots in Shanghai',
'890729 - 425663? Only respond with math and no words.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=ORPO,
beta=loss_beta,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(type=load_dataset, path='mlabonne/orpo-dpo-mix-40k'),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=orpo_dpo_mix_40k_map_fn,
is_dpo=True,
is_reward=False,
reward_token_id=-1,
num_proc=32,
use_varlen_attn=use_varlen_attn,
shuffle_before_pack=True,
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
# dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/orpo/internlm/internlm2_chat_1_8b_orpo_full_varlenattn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns.preference_collate_fn import \
preference_collate_fn
from xtuner.dataset.preference_dataset import (build_preference_dataset,
orpo_dpo_mix_40k_map_fn)
from xtuner.engine.hooks import (EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model.orpo import ORPO
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'
use_varlen_attn = True
loss_beta = 0.1
# parallel
sequence_parallel_size = 1
# Data
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
max_packed_length = max_length * 2
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 5e-6
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'What famous British author, known for his tales of mystery and the macabre, shares his initials with a common abbreviation for "rest in peace"?', # noqa: E501
'Please tell me five scenic spots in Shanghai',
'890729 - 425663? Only respond with math and no words.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=ORPO,
use_varlen_attn=use_varlen_attn,
beta=loss_beta,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(type=load_dataset, path='mlabonne/orpo-dpo-mix-40k'),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=orpo_dpo_mix_40k_map_fn,
is_dpo=True,
is_reward=False,
reward_token_id=-1,
num_proc=32,
use_varlen_attn=use_varlen_attn,
max_packed_length=max_packed_length,
shuffle_before_pack=True,
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
# dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/orpo/internlm/internlm2_chat_1_8b_orpo_full_varlenattn_jsonl_dataset.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns.preference_collate_fn import \
preference_collate_fn
from xtuner.dataset.preference_dataset import (build_preference_dataset,
load_jsonl_dataset)
from xtuner.engine.hooks import (EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model.orpo import ORPO
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'
use_varlen_attn = True
loss_beta = 0.1
# Data
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
max_packed_length = max_length * 2
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 5e-6
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'What famous British author, known for his tales of mystery and the macabre, shares his initials with a common abbreviation for "rest in peace"?', # noqa: E501
'Please tell me five scenic spots in Shanghai',
'890729 - 425663? Only respond with math and no words.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=ORPO,
use_varlen_attn=use_varlen_attn,
beta=loss_beta,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(
type=load_jsonl_dataset,
data_files=[
'/your/jsonl/path/here.jsonl',
'/your/another/jsonl/path/here.jsonl'
]),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
is_dpo=True,
is_reward=False,
reward_token_id=-1,
num_proc=32,
use_varlen_attn=use_varlen_attn,
max_packed_length=max_packed_length,
shuffle_before_pack=True,
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
# dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/orpo/internlm/internlm2_chat_7b_orpo_qlora_varlenattn_ultrafeedback_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset.collate_fns.preference_collate_fn import \
preference_collate_fn
from xtuner.dataset.preference_dataset import (build_preference_dataset,
orpo_dpo_mix_40k_map_fn)
from xtuner.engine.hooks import (EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model.orpo import ORPO
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'
use_varlen_attn = True
loss_beta = 0.1
# Data
prompt_template = PROMPT_TEMPLATE.internlm2_chat
max_length = 2048
max_packed_length = max_length * 2
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 5 # refer to orpo repo
optim_type = AdamW
lr = 5e-6
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.01
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'What famous British author, known for his tales of mystery and the macabre, shares his initials with a common abbreviation for "rest in peace"?', # noqa: E501
'Please tell me five scenic spots in Shanghai',
'890729 - 425663? Only respond with math and no words.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=ORPO,
use_varlen_attn=use_varlen_attn,
beta=loss_beta,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(
type=load_dataset,
path='argilla/ultrafeedback-binarized-preferences-cleaned'),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=orpo_dpo_mix_40k_map_fn,
is_dpo=True,
is_reward=False,
reward_token_id=-1,
num_proc=32,
use_varlen_attn=use_varlen_attn,
max_packed_length=max_packed_length,
shuffle_before_pack=True,
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
# dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/orpo/llama/llama3_8b_instruct_orpo_qlora_varlenattn_ultrafeedback_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset.collate_fns.preference_collate_fn import \
preference_collate_fn
from xtuner.dataset.preference_dataset import (build_preference_dataset,
orpo_dpo_mix_40k_map_fn)
from xtuner.engine.hooks import (EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model.orpo import ORPO
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
use_varlen_attn = True
loss_beta = 0.1
# Data
prompt_template = PROMPT_TEMPLATE.llama3_chat
max_length = 2048
max_packed_length = max_length * 2
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 5 # refer to orpo repo
optim_type = AdamW
lr = 5e-6
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.01
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'What famous British author, known for his tales of mystery and the macabre, shares his initials with a common abbreviation for "rest in peace"?', # noqa: E501
'Please tell me five scenic spots in Shanghai',
'890729 - 425663? Only respond with math and no words.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=ORPO,
use_varlen_attn=use_varlen_attn,
beta=loss_beta,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(
type=load_dataset,
path='argilla/ultrafeedback-binarized-preferences-cleaned'),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=orpo_dpo_mix_40k_map_fn,
is_dpo=True,
is_reward=False,
reward_token_id=-1,
num_proc=32,
use_varlen_attn=use_varlen_attn,
max_packed_length=max_packed_length,
shuffle_before_pack=True,
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
# dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/phi/phi3/phi3_mini_128k_instruct_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'microsoft/Phi-3-mini-128k-instruct'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.phi3_chat
max_length = 128 * 1024
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/phi/phi3/phi3_mini_128k_instruct_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'microsoft/Phi-3-mini-128k-instruct'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.phi3_chat
max_length = 128 * 1024
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/phi/phi3/phi3_mini_4k_instruct_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'microsoft/Phi-3-mini-4k-instruct'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.phi3_chat
max_length = 4096
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/phi/phi3/phi3_mini_4k_instruct_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'microsoft/Phi-3-mini-4k-instruct'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.phi3_chat
max_length = 4096
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_1_8b/qwen_1_8b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-1_8B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_1_8b/qwen_1_8b_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-1_8B'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_1_8b/qwen_1_8b_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-1_8B'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_1_8b/qwen_1_8b_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-1_8B'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_1_8b/qwen_1_8b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-1_8B'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_1_8b_chat/qwen_1_8b_chat_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-1_8B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_1_8b_chat/qwen_1_8b_chat_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-1_8B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_1_8b_chat/qwen_1_8b_chat_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-1_8B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_1_8b_chat/qwen_1_8b_chat_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-1_8B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_1_8b_chat/qwen_1_8b_chat_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-1_8B-Chat'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_72b/qwen_72b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-72B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_72b/qwen_72b_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-72B'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_72b/qwen_72b_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-72B'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_72b/qwen_72b_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-72B'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_72b/qwen_72b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-72B'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_medical_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import medical_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
data_path = 'shibing624/medical'
data_config_name = 'finetune'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.medical
evaluation_inputs = [
'我有家族遗传性的过敏,请问可以可以献血吗?', '我爷爷有高血压,请问他可以喝咖啡吗?',
'我女儿今年3岁了,从昨天晚上九点开始腹泻,到现在已经八个小时了,请问应该怎么办?'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path, name=data_config_name),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=medical_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_moss_sft_all_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
bot_name = 'Qwen'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_no_plugins_path = './data/moss-003-sft-no-tools.jsonl'
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
moss_sft_no_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_no_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
moss_sft_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataset = dict(
type=ConcatDataset, datasets=[moss_sft_no_plugins, moss_sft_plugins])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_moss_sft_all_e2_gpu8.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
bot_name = 'Qwen'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_no_plugins_path = './data/moss-003-sft-no-tools.jsonl'
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 8 # per_device
accumulative_counts = 1
dataloader_num_workers = 2
max_epochs = 2
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
moss_sft_no_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_no_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
moss_sft_plugins = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
train_dataset = dict(
type=ConcatDataset, datasets=[moss_sft_no_plugins, moss_sft_plugins])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_moss_sft_plugins_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import MOSSSFTDataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
bot_name = 'Qwen'
use_varlen_attn = False
# Data
# Download data from https://huggingface.co/datasets/fnlp/moss-003-sft-data
moss_sft_plugins_path = './data/conversations_with_tools_with_inner_instruction_no_text2image_train_all_random_meta0.5_0.1_0.01_moss_0709.jsonl' # noqa: E501
max_length = 2048
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
SYSTEM = SYSTEM_TEMPLATE.moss_sft
prompt_template = PROMPT_TEMPLATE.moss_sft
evaluation_freq = 500
evaluation_inputs = [
'一个球体的表面积是384平方厘米,求它的体积。', '今有鸡兔同笼,上有二十头,下有六十二足, 问鸡兔各几何?', '介绍一下比尔盖茨'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=MOSSSFTDataset,
data_file=moss_sft_plugins_path,
bot_name=bot_name,
tokenizer=tokenizer,
max_length=max_length)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
stop_words=[''],
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 512
pack_to_max_length = False
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_openorca_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import openorca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
data_path = 'Open-Orca/OpenOrca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 5000
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=openorca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b/qwen_7b_qlora_tiny_codes_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, tiny_codes_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B'
use_varlen_attn = False
# Data
data_path = 'nampdn-ai/tiny-codes'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|endoftext|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=tiny_codes_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b_chat/qwen_7b_chat_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b_chat/qwen_7b_chat_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b_chat/qwen_7b_chat_qlora_alpaca_enzh_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
oasst1_map_fn, template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
oasst1_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
oasst1 = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=oasst1_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[alpaca_en, alpaca_zh, oasst1])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b_chat/qwen_7b_chat_qlora_alpaca_zh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_zh_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_zh,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b_chat/qwen_7b_chat_qlora_arxiv_gentitle_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import arxiv_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
# 1. Download data from https://kaggle.com/datasets/Cornell-University/arxiv
# 2. Process data by `xtuner preprocess arxiv ${DOWNLOADED_DATA} ./data/arxiv_data.json [optional arguments]` # noqa: E501
data_path = './data/arxiv_data.json'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.arxiv_gentile
evaluation_inputs = [
('We present InternLM, a multilingual foundational language '
'model with 104B parameters. InternLM is pre-trained on a large '
'corpora with 1.6T tokens with a multi-phase progressive '
'process, and then fine-tuned to align with human preferences. '
'We also developed a training system called Uniscale-LLM for '
'efficient large language model training. The evaluation on a '
'number of benchmarks shows that InternLM achieves '
'state-of-the-art performance in multiple aspects, including '
'knowledge understanding, reading comprehension, mathematics, '
'and coding. With such well-rounded capabilities, InternLM '
'achieves outstanding performances on comprehensive exams, '
'including MMLU, AGIEval, C-Eval and GAOKAO-Bench, without '
'resorting to external tools. On these benchmarks, InternLM '
'not only significantly outperforms open-source models, but '
'also obtains superior performance compared to ChatGPT. Also, '
'InternLM demonstrates excellent capability of understanding '
'Chinese language and Chinese culture, which makes it a '
'suitable foundation model to support Chinese-oriented language '
'applications. This manuscript gives a detailed study of '
'our results, with benchmarks and examples across a diverse '
'set of knowledge domains and tasks.'),
('In this work, we develop and release Llama 2, a collection of '
'pretrained and fine-tuned large language models (LLMs) ranging '
'in scale from 7 billion to 70 billion parameters.\nOur '
'fine-tuned LLMs, called LLAMA 2-CHAT, are optimized for '
'dialogue use cases. Our models outperform open-source chat '
'models on most benchmarks we tested, and based on our human '
'evaluations for helpfulness and safety, may be a suitable '
'substitute for closedsource models. We provide a detailed '
'description of our approach to fine-tuning and safety '
'improvements of LLAMA 2-CHAT in order to enable the community '
'to build on our work and contribute to the responsible '
'development of LLMs.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset, path='json', data_files=dict(train=data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=arxiv_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b_chat/qwen_7b_chat_qlora_code_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import code_alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
data_path = 'HuggingFaceH4/CodeAlpaca_20K'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 100
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=code_alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b_chat/qwen_7b_chat_qlora_colorist_e5.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import colors_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
data_path = 'burkelibbey/colors'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 5
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = SYSTEM_TEMPLATE.colorist
evaluation_inputs = [
'请给我一个像天空一样清澈透明的蓝色。', 'Please give me a clear blue like the sky.'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=colors_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b_chat/qwen_7b_chat_qlora_lawyer_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (crime_kg_assitant_map_fn,
law_reference_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
# download data from https://github.com/LiuHC0428/LAW-GPT
crime_kg_assitant_path = './data/CrimeKgAssitant清洗后_52k.json'
law_reference_data_path = './data/训练数据_带法律依据_92k.json'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.lawyer
evaluation_inputs = ['请问离婚需要准备什么材料?', '销售鳄鱼皮包违法吗?']
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
crime_kg_assitant = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=crime_kg_assitant_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=crime_kg_assitant_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
law_reference_data = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path='json',
data_files=dict(train=law_reference_data_path)),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=law_reference_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(
type=ConcatDataset, datasets=[crime_kg_assitant, law_reference_data])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b_chat/qwen_7b_chat_qlora_medical_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import medical_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
data_path = 'shibing624/medical'
data_config_name = 'finetune'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.medical
evaluation_inputs = [
'我有家族遗传性的过敏,请问可以可以献血吗?', '我爷爷有高血压,请问他可以喝咖啡吗?',
'我女儿今年3岁了,从昨天晚上九点开始腹泻,到现在已经八个小时了,请问应该怎么办?'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path, name=data_config_name),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=medical_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b_chat/qwen_7b_chat_qlora_oasst1_512_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 512
pack_to_max_length = False
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b_chat/qwen_7b_chat_qlora_oasst1_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import oasst1_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
data_path = 'timdettmers/openassistant-guanaco'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=oasst1_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b_chat/qwen_7b_chat_qlora_open_platypus_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
data_path = 'garage-bAInd/Open-Platypus'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b_chat/qwen_7b_chat_qlora_openorca_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import openorca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
data_path = 'Open-Orca/OpenOrca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 5000
SYSTEM = ''
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=openorca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b_chat/qwen_7b_chat_qlora_sql_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import sql_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
data_path = 'b-mc2/sql-create-context'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.sql
evaluation_inputs = [
('CREATE TABLE station (name VARCHAR, lat VARCHAR, city VARCHAR)\n'
'Find the name, latitude, and city of stations with latitude '
'above 50.'),
('CREATE TABLE weather (zip_code VARCHAR, mean_visibility_miles '
'INTEGER)\n找到mean_visibility_miles最大的zip_code。')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=sql_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1/qwen_7b_chat/qwen_7b_chat_qlora_tiny_codes_e1.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import template_map_fn_factory, tiny_codes_map_fn
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen-7B-Chat'
use_varlen_attn = False
# Data
data_path = 'nampdn-ai/tiny-codes'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.coder
evaluation_inputs = [
('写一个Python函数,将十六进制颜色代码(如#0066ee)转换为对应的'
'红、绿、蓝(RGB)三个颜色分量值,并以元组的形式返回。'),
('Write a Python function that takes a hexadecimal color code '
'(e.g., #0066ee) as input and converts it into the corresponding '
'red, green, and blue (RGB) color component values.')
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right',
eos_token='<|im_end|>')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=data_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=tiny_codes_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_0_5b/qwen1_5_0_5b_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-0.5B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_0_5b/qwen1_5_0_5b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-0.5B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_0_5b_chat/qwen1_5_0_5b_chat_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-0.5B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_0_5b_chat/qwen1_5_0_5b_chat_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-0.5B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_110b/qwen1_5_110b_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-110B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
# total batch = 32gpus * batch_size_per_device 1 * acc 1 = 32
accumulative_counts = 1
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 1e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_110b/qwen1_5_110b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-110B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 1 # total bs = 1 bs_per_device * 8 gpus * 1 acc = 8
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 1e-4 # 110B model use smaller lr
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_quant_storage=torch.float16)),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(type=ThroughputHook),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_110b_chat/README.md
================================================
# Qwen 110B
## Install
```bash
# Install the latest xtuner
pip install -U 'xtuner[deepspeed]'
# We recommend installing flash_attn
# pip install flash-attn
# install the latest transformers
pip install -U transformers
```
## QLoRA Fine-tune
Training Qwen 110B with 32k context capability requires only 2 * A100 80G.
```bash
xtuner train xtuner/configs/qwen/qwen1_5/qwen1_5_110b_chat/qwen1_5_110b_chat_qlora_alpaca_e3_16k_2gpus.py --deepspeed deepspeed_zero3
```
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_110b_chat/qwen1_5_110b_chat_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-110B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
# total batch = 32gpus * batch_size_per_device 1 * acc 1 = 32
accumulative_counts = 1
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 4
max_epochs = 3
optim_type = AdamW
lr = 1e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_110b_chat/qwen1_5_110b_chat_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-110B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 1 # total bs = 1 bs_per_device * 8 gpus * 1 acc = 8
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 1e-4 # 110B model use smaller lr
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_quant_storage=torch.float16)),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(type=ThroughputHook),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_110b_chat/qwen1_5_110b_chat_qlora_alpaca_e3_16k_2gpus.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-110B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 16384
pack_to_max_length = True
# parallel
sequence_parallel_size = 2
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 1 # total bs = 1 bs_per_device * 2 gpus * 1 acc = 2
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 1e-4 # 110B model use smaller lr
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4',
bnb_4bit_quant_storage=torch.float16)),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(type=ThroughputHook),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=1)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_14b/qwen1_5_14b_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-14B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_14b/qwen1_5_14b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-14B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_14b_chat/qwen1_5_14b_chat_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-14B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_14b_chat/qwen1_5_14b_chat_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-14B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_1_8b/qwen1_5_1_8b_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-1.8B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_1_8b/qwen1_5_1_8b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-1.8B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_1_8b_chat/qwen1_5_1_8b_chat_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-1.8B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_1_8b_chat/qwen1_5_1_8b_chat_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-1.8B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_4b/qwen1_5_4b_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-4B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_4b/qwen1_5_4b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-4B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_4b_chat/qwen1_5_4b_chat_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-4B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_4b_chat/qwen1_5_4b_chat_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-4B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_72b/qwen1_5_72b_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-72B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_72b/qwen1_5_72b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-72B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_72b_chat/qwen1_5_72b_chat_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-72B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_72b_chat/qwen1_5_72b_chat_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-72B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_7b/qwen1_5_7b_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-7B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_7b/qwen1_5_7b_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-7B'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_7b_chat/qwen1_5_7b_chat_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-7B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen/qwen1_5/qwen1_5_7b_chat/qwen1_5_7b_chat_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-7B-Chat'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/qwen_moe/qwen1_5/qwen1_5_moe_a2_7_b_chat/qwen1_5_moe_a2_7_b_chat_full_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'Qwen/Qwen1.5-MoE-A2.7B-Chat'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.qwen_chat
max_length = 32768
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 1
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 50
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=False,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template),
dict(type=ThroughputHook),
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=1),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=-1,
save_last=False,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False, window_size=1)
================================================
FILE: xtuner-eval_niah/xtuner/configs/reward_model/internlm/internlm2_chat_1_8b_reward_full_ultrafeedback.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns.preference_collate_fn import \
preference_collate_fn
from xtuner.dataset.preference_dataset import (build_preference_dataset,
orpo_dpo_mix_40k_map_fn)
from xtuner.engine.hooks import VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model.reward import RewardModel
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'
use_varlen_attn = False
reward_token_id = 92527 # use [UNUSED_TOKEN_130] as reward token
loss_type = 'focal'
penalty_type = 'log_barrier'
# Data
max_length = 2048
# Scheduler & Optimizer
batch_size = 4 # per_device
accumulative_counts = 16
dataloader_num_workers = 0
max_epochs = 1 # reward model should not be trained for more than 1 epoch to avoid overfitting # noqa: E501
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=RewardModel,
use_varlen_attn=use_varlen_attn,
loss_type=loss_type,
penalty_type=penalty_type,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(
type=load_dataset,
path='argilla/ultrafeedback-binarized-preferences-cleaned'),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=orpo_dpo_mix_40k_map_fn,
is_dpo=False,
is_reward=True,
reward_token_id=reward_token_id,
num_proc=32,
use_varlen_attn=use_varlen_attn,
shuffle_before_pack=True,
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = []
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/reward_model/internlm/internlm2_chat_1_8b_reward_full_varlenattn_jsonl_dataset.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns.preference_collate_fn import \
preference_collate_fn
from xtuner.dataset.preference_dataset import (build_preference_dataset,
load_jsonl_dataset)
from xtuner.engine.hooks import VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model.reward import RewardModel
from xtuner.parallel.sequence import SequenceParallelSampler
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'
use_varlen_attn = True
reward_token_id = 92527 # use [UNUSED_TOKEN_130] as reward token
loss_type = 'focal'
penalty_type = 'log_barrier'
# Data
max_length = 2048
max_packed_length = max_length * 2
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1 # reward model should not be trained for more than 1 epoch to avoid overfitting # noqa: E501
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
# TODO: eval
# evaluation_freq = 500
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=RewardModel,
use_varlen_attn=use_varlen_attn,
loss_type=loss_type,
penalty_type=penalty_type,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(
type=load_jsonl_dataset,
data_files=[
'/your/jsonl/path/here.jsonl',
'/your/another/jsonl/path/here.jsonl'
]),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=None,
is_dpo=False,
is_reward=True,
reward_token_id=reward_token_id,
num_proc=32,
use_varlen_attn=use_varlen_attn,
max_packed_length=max_packed_length,
shuffle_before_pack=True,
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = []
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/reward_model/internlm/internlm2_chat_1_8b_reward_full_varlenattn_ultrafeedback.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns.preference_collate_fn import \
preference_collate_fn
from xtuner.dataset.preference_dataset import (build_preference_dataset,
orpo_dpo_mix_40k_map_fn)
from xtuner.engine.hooks import VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model.reward import RewardModel
from xtuner.parallel.sequence import SequenceParallelSampler
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'
use_varlen_attn = True
reward_token_id = 92527 # use [UNUSED_TOKEN_130] as reward token
loss_type = 'focal'
penalty_type = 'log_barrier'
# Data
max_length = 2048
max_packed_length = max_length * 2
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1 # reward model should not be trained for more than 1 epoch to avoid overfitting # noqa: E501
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
# TODO: eval
# evaluation_freq = 500
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=RewardModel,
use_varlen_attn=use_varlen_attn,
loss_type=loss_type,
penalty_type=penalty_type,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(
type=load_dataset,
path='argilla/ultrafeedback-binarized-preferences-cleaned'),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=orpo_dpo_mix_40k_map_fn,
is_dpo=False,
is_reward=True,
reward_token_id=reward_token_id,
num_proc=32,
use_varlen_attn=use_varlen_attn,
max_packed_length=max_packed_length,
shuffle_before_pack=True,
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = []
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/reward_model/internlm/internlm2_chat_1_8b_reward_qlora_varlenattn_ultrafeedback.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset.collate_fns.preference_collate_fn import \
preference_collate_fn
from xtuner.dataset.preference_dataset import (build_preference_dataset,
orpo_dpo_mix_40k_map_fn)
from xtuner.engine.hooks import VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model.reward import RewardModel
from xtuner.parallel.sequence import SequenceParallelSampler
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'internlm/internlm2-chat-1_8b-sft'
use_varlen_attn = True
reward_token_id = 92527 # use [UNUSED_TOKEN_130] as reward token
loss_type = 'focal'
penalty_type = 'log_barrier'
# Data
max_length = 2048
max_packed_length = max_length * 2
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1 # reward model should not be trained for more than 1 epoch to avoid overfitting # noqa: E501
optim_type = AdamW
lr = 1e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
# TODO: eval
# evaluation_freq = 500
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=RewardModel,
use_varlen_attn=use_varlen_attn,
loss_type=loss_type,
penalty_type=penalty_type,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='FEATURE_EXTRACTION')) # this setting is important
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(
type=load_dataset,
path='argilla/ultrafeedback-binarized-preferences-cleaned'),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=orpo_dpo_mix_40k_map_fn,
is_dpo=False,
is_reward=True,
reward_token_id=reward_token_id,
num_proc=32,
use_varlen_attn=use_varlen_attn,
max_packed_length=max_packed_length,
shuffle_before_pack=True,
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = []
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/reward_model/llama/llama3_8b_instruct_reward_full_varlenattn_ultrafeedback.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from torch.optim import AdamW
from transformers import AutoModelForCausalLM, AutoTokenizer
from xtuner.dataset.collate_fns.preference_collate_fn import \
preference_collate_fn
from xtuner.dataset.preference_dataset import (build_preference_dataset,
orpo_dpo_mix_40k_map_fn)
from xtuner.engine.hooks import VarlenAttnArgsToMessageHubHook
from xtuner.engine.runner import TrainLoop
from xtuner.model.reward import RewardModel
from xtuner.parallel.sequence import SequenceParallelSampler
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'meta-llama/Meta-Llama-3-8B-Instruct'
use_varlen_attn = True
reward_token_id = 128002 # use <|reserved_special_token_0|> as reward token
loss_type = 'focal'
penalty_type = 'log_barrier'
# Data
max_length = 2048
max_packed_length = max_length * 2
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 1 # reward model should not be trained for more than 1 epoch to avoid overfitting # noqa: E501
optim_type = AdamW
lr = 2e-5
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
# TODO: eval
# evaluation_freq = 500
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=RewardModel,
use_varlen_attn=use_varlen_attn,
loss_type=loss_type,
penalty_type=penalty_type,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataset = dict(
type=build_preference_dataset,
dataset=dict(
type=load_dataset,
path='argilla/ultrafeedback-binarized-preferences-cleaned'),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=orpo_dpo_mix_40k_map_fn,
is_dpo=False,
is_reward=True,
reward_token_id=reward_token_id,
num_proc=32,
use_varlen_attn=use_varlen_attn,
max_packed_length=max_packed_length,
shuffle_before_pack=True,
)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(
type=preference_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = []
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/starcoder/starcoder_qlora_stack_exchange_example.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (stack_exchange_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.utils import PROMPT_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'bigcode/starcoder'
use_varlen_attn = False
# Data
data_path = 'ArmelR/stack-exchange-instruction'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
# randomly select 20000 samples from the original dataset
max_dataset_length = 20000
pack_to_max_length = True
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16 # 1bs * 16acc * 1gpu = 16 batchsize
dataloader_num_workers = 0
max_epochs = 1
optim_type = AdamW
lr = 1e-4
betas = (0.9, 0.999)
weight_decay = 0.05
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 200
SYSTEM = ''
evaluation_inputs = [
'from typing import List def has_close_elements(numbers: List[float], threshold: float) -> bool: """ Check if in given list of numbers, are any two numbers closer to each other than given threshold. >>> has_close_elements([1.0, 2.0, 3.0], 0.5) False >>> has_close_elements([1.0, 2.8, 3.0, 4.0, 5.0, 2.0], 0.3) True """' # noqa: E501
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=16,
lora_alpha=32,
lora_dropout=0.05,
bias='none',
target_modules=['c_proj', 'c_attn', 'q_attn'],
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
train_dataset = dict(
type=process_hf_dataset,
dataset=dict(
type=load_dataset,
path=data_path,
data_dir='data/finetune',
split='train'),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=stack_exchange_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
max_dataset_length=max_dataset_length,
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=DefaultSampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/yi/yi_34b/yi_34b_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '01-ai/Yi-34B'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/yi/yi_6b/yi_6b_qlora_alpaca_enzh_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import ConcatDataset, process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import (alpaca_map_fn, alpaca_zh_map_fn,
template_map_fn_factory)
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = '01-ai/Yi-6B'
use_varlen_attn = False
# Data
alpaca_zh_path = 'silk-road/alpaca-data-gpt4-chinese'
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.default
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
alpaca_zh = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_zh_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_zh_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
train_dataset = dict(type=ConcatDataset, datasets=[alpaca_en, alpaca_zh])
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=train_dataset,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/configs/zephyr/zephyr_7b_beta_qlora_alpaca_e3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from datasets import load_dataset
from mmengine.dataset import DefaultSampler
from mmengine.hooks import (CheckpointHook, DistSamplerSeedHook, IterTimerHook,
LoggerHook, ParamSchedulerHook)
from mmengine.optim import AmpOptimWrapper, CosineAnnealingLR, LinearLR
from peft import LoraConfig
from torch.optim import AdamW
from transformers import (AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig)
from xtuner.dataset import process_hf_dataset
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.dataset.map_fns import alpaca_map_fn, template_map_fn_factory
from xtuner.engine.hooks import (DatasetInfoHook, EvaluateChatHook,
VarlenAttnArgsToMessageHubHook)
from xtuner.engine.runner import TrainLoop
from xtuner.model import SupervisedFinetune
from xtuner.parallel.sequence import SequenceParallelSampler
from xtuner.utils import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
#######################################################################
# PART 1 Settings #
#######################################################################
# Model
pretrained_model_name_or_path = 'HuggingFaceH4/zephyr-7b-beta'
use_varlen_attn = False
# Data
alpaca_en_path = 'tatsu-lab/alpaca'
prompt_template = PROMPT_TEMPLATE.zephyr
max_length = 2048
pack_to_max_length = True
# parallel
sequence_parallel_size = 1
# Scheduler & Optimizer
batch_size = 1 # per_device
accumulative_counts = 16
accumulative_counts *= sequence_parallel_size
dataloader_num_workers = 0
max_epochs = 3
optim_type = AdamW
lr = 2e-4
betas = (0.9, 0.999)
weight_decay = 0
max_norm = 1 # grad clip
warmup_ratio = 0.03
# Save
save_steps = 500
save_total_limit = 2 # Maximum checkpoints to keep (-1 means unlimited)
# Evaluate the generation performance during the training
evaluation_freq = 500
SYSTEM = SYSTEM_TEMPLATE.alpaca
evaluation_inputs = [
'请给我介绍五个上海的景点', 'Please tell me five scenic spots in Shanghai'
]
#######################################################################
# PART 2 Model & Tokenizer #
#######################################################################
tokenizer = dict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
padding_side='right')
model = dict(
type=SupervisedFinetune,
use_varlen_attn=use_varlen_attn,
llm=dict(
type=AutoModelForCausalLM.from_pretrained,
pretrained_model_name_or_path=pretrained_model_name_or_path,
trust_remote_code=True,
torch_dtype=torch.float16,
quantization_config=dict(
type=BitsAndBytesConfig,
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')),
lora=dict(
type=LoraConfig,
r=64,
lora_alpha=16,
lora_dropout=0.1,
bias='none',
task_type='CAUSAL_LM'))
#######################################################################
# PART 3 Dataset & Dataloader #
#######################################################################
alpaca_en = dict(
type=process_hf_dataset,
dataset=dict(type=load_dataset, path=alpaca_en_path),
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=alpaca_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=prompt_template),
remove_unused_columns=True,
shuffle_before_pack=True,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn)
sampler = SequenceParallelSampler \
if sequence_parallel_size > 1 else DefaultSampler
train_dataloader = dict(
batch_size=batch_size,
num_workers=dataloader_num_workers,
dataset=alpaca_en,
sampler=dict(type=sampler, shuffle=True),
collate_fn=dict(type=default_collate_fn, use_varlen_attn=use_varlen_attn))
#######################################################################
# PART 4 Scheduler & Optimizer #
#######################################################################
# optimizer
optim_wrapper = dict(
type=AmpOptimWrapper,
optimizer=dict(
type=optim_type, lr=lr, betas=betas, weight_decay=weight_decay),
clip_grad=dict(max_norm=max_norm, error_if_nonfinite=False),
accumulative_counts=accumulative_counts,
loss_scale='dynamic',
dtype='float16')
# learning policy
# More information: https://github.com/open-mmlab/mmengine/blob/main/docs/en/tutorials/param_scheduler.md # noqa: E501
param_scheduler = [
dict(
type=LinearLR,
start_factor=1e-5,
by_epoch=True,
begin=0,
end=warmup_ratio * max_epochs,
convert_to_iter_based=True),
dict(
type=CosineAnnealingLR,
eta_min=0.0,
by_epoch=True,
begin=warmup_ratio * max_epochs,
end=max_epochs,
convert_to_iter_based=True)
]
# train, val, test setting
train_cfg = dict(type=TrainLoop, max_epochs=max_epochs)
#######################################################################
# PART 5 Runtime #
#######################################################################
# Log the dialogue periodically during the training process, optional
custom_hooks = [
dict(type=DatasetInfoHook, tokenizer=tokenizer),
dict(
type=EvaluateChatHook,
tokenizer=tokenizer,
every_n_iters=evaluation_freq,
evaluation_inputs=evaluation_inputs,
system=SYSTEM,
prompt_template=prompt_template)
]
if use_varlen_attn:
custom_hooks += [dict(type=VarlenAttnArgsToMessageHubHook)]
# configure default hooks
default_hooks = dict(
# record the time of every iteration.
timer=dict(type=IterTimerHook),
# print log every 10 iterations.
logger=dict(type=LoggerHook, log_metric_by_epoch=False, interval=10),
# enable the parameter scheduler.
param_scheduler=dict(type=ParamSchedulerHook),
# save checkpoint per `save_steps`.
checkpoint=dict(
type=CheckpointHook,
by_epoch=False,
interval=save_steps,
max_keep_ckpts=save_total_limit),
# set sampler seed in distributed evrionment.
sampler_seed=dict(type=DistSamplerSeedHook),
)
# configure environment
env_cfg = dict(
# whether to enable cudnn benchmark
cudnn_benchmark=False,
# set multi process parameters
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0),
# set distributed parameters
dist_cfg=dict(backend='nccl'),
)
# set visualizer
visualizer = None
# set log level
log_level = 'INFO'
# load from which checkpoint
load_from = None
# whether to resume training from the loaded checkpoint
resume = False
# Defaults to use random seed and disable `deterministic`
randomness = dict(seed=None, deterministic=False)
# set log processor
log_processor = dict(by_epoch=False)
================================================
FILE: xtuner-eval_niah/xtuner/dataset/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import warnings
from .concat_dataset import ConcatDataset
from .huggingface import process_hf_dataset
from .intern_repo import (build_packed_dataset,
load_intern_repo_tokenized_dataset,
load_intern_repo_untokenized_dataset)
from .json_dataset import load_json_file
from .llava import LLaVADataset
from .modelscope import process_ms_dataset
from .moss_sft import MOSSSFTDataset
from .refcoco_json import (InvRefCOCOJsonDataset, RefCOCOJsonDataset,
RefCOCOJsonEvalDataset)
from .utils import decode_base64_to_image, expand2square, load_image
# ignore FutureWarning in hf datasets
warnings.simplefilter(action='ignore', category=FutureWarning)
__all__ = [
'process_hf_dataset', 'ConcatDataset', 'MOSSSFTDataset',
'process_ms_dataset', 'LLaVADataset', 'expand2square',
'decode_base64_to_image', 'load_image', 'process_ms_dataset',
'load_intern_repo_tokenized_dataset',
'load_intern_repo_untokenized_dataset', 'build_packed_dataset',
'RefCOCOJsonDataset', 'RefCOCOJsonEvalDataset', 'InvRefCOCOJsonDataset',
'load_json_file'
]
================================================
FILE: xtuner-eval_niah/xtuner/dataset/collate_fns/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .default_collate_fn import default_collate_fn
from .mmlu_collate_fn import mmlu_collate_fn
__all__ = ['default_collate_fn', 'mmlu_collate_fn']
================================================
FILE: xtuner-eval_niah/xtuner/dataset/collate_fns/default_collate_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Dict, Sequence
import torch
from torch.nn.utils.rnn import pad_sequence
from xtuner.parallel.sequence import (get_sequence_parallel_world_size,
pad_for_sequence_parallel)
from xtuner.utils import DEFAULT_PAD_TOKEN_INDEX, IGNORE_INDEX
def default_collate_fn(instances: Sequence[Dict],
pad_index: int = DEFAULT_PAD_TOKEN_INDEX,
return_hf_format: bool = False,
use_varlen_attn: bool = False):
seq_parallel_world_size = get_sequence_parallel_world_size()
input_ids, labels = [], []
has_image = any(inst.get('pixel_values') is not None for inst in instances)
if use_varlen_attn:
position_ids, cumulative_len = [], []
assert len(instances) == 1, (
f'If utilizing varlen attention, the batch size should be'
f' set to 1, but got {len(instances)}')
assert not has_image, 'Currently, it is not configured to '
'accommodate the use of varlen Attention in multimodal training'
if has_image:
pixel_values = []
for example in instances:
input_ids.append(torch.LongTensor(example['input_ids']))
labels.append(torch.LongTensor(example['labels']))
if use_varlen_attn:
cumulative_len.append(torch.IntTensor(example['cumulative_len']))
position_ids.append(torch.LongTensor(example['position_ids']))
if has_image:
pixel_values.append(example['pixel_values'])
ori_length = [len(ids) for ids in input_ids]
if len(instances) > 1:
input_ids = pad_sequence(
input_ids, batch_first=True, padding_value=pad_index)
labels = pad_sequence(
labels, batch_first=True, padding_value=IGNORE_INDEX)
else:
input_ids = torch.stack(input_ids)
labels = torch.stack(labels)
if use_varlen_attn:
assert input_ids.size(1) % seq_parallel_world_size == 0
attention_mask = None
position_ids = torch.stack(position_ids, dim=0)
else:
# Some tokenizers have the same eos token and pad token, so input_ids
# cannot be masked directly based on the pad token id.
attention_mask = torch.zeros_like(input_ids).bool()
for i in ori_length:
attention_mask[:i] = True
bs, seq_len = input_ids.shape
position_ids = torch.arange(seq_len).unsqueeze(0).long().repeat(bs, 1)
if seq_parallel_world_size > 1:
input_ids = pad_for_sequence_parallel(input_ids, pad_index)
labels = pad_for_sequence_parallel(labels, IGNORE_INDEX)
position_ids = pad_for_sequence_parallel(position_ids, 0)
if attention_mask is not None:
attention_mask = pad_for_sequence_parallel(attention_mask, 0)
if use_varlen_attn:
max_seqlen = (
cumulative_len[0][1:] - # noqa: W504
cumulative_len[0][:-1]).max().item()
data_dict = {
'input_ids': input_ids,
'cumulative_len': cumulative_len,
'position_ids': position_ids,
'labels': labels,
'max_seqlen': max_seqlen
}
else:
data_dict = {
'input_ids': input_ids,
'attention_mask': attention_mask,
'position_ids': position_ids,
'labels': labels
}
if has_image:
pixel_values = torch.stack(pixel_values)
data_dict['pixel_values'] = pixel_values
if return_hf_format:
return data_dict
else:
return {'data': data_dict, 'data_samples': None}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/collate_fns/mmlu_collate_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Dict, Sequence
import torch
from torch.nn.utils.rnn import pad_sequence
from xtuner.utils import DEFAULT_PAD_TOKEN_INDEX, IGNORE_INDEX
def mmlu_collate_fn(instances: Sequence[Dict],
pad_index: int = DEFAULT_PAD_TOKEN_INDEX,
return_hf_format: bool = False) -> Dict[str, torch.Tensor]:
input_ids = []
labels = []
data_samples = {'labels': [], 'subjects': []}
for example in instances:
input_ids.append(torch.tensor(example['input_ids']))
labels.append(torch.tensor(example['labels']))
data_samples['labels'].append(example['output'])
data_samples['subjects'].append(example['subject'])
if len(instances) > 1:
input_ids = pad_sequence(
input_ids, batch_first=True, padding_value=pad_index)
labels = pad_sequence(
labels, batch_first=True, padding_value=IGNORE_INDEX)
else:
input_ids = torch.stack(input_ids)
labels = torch.stack(labels)
data_dict = {
'input_ids': input_ids,
'attention_mask': input_ids.ne(pad_index),
'labels': labels
}
if return_hf_format:
return data_dict
else:
return {'data': data_dict, 'data_samples': data_samples}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/collate_fns/preference_collate_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Dict, Sequence
import torch
from torch.nn.utils.rnn import pad_sequence
from xtuner.parallel.sequence import (get_sequence_parallel_world_size,
pad_cumulative_len_for_sequence_parallel,
pad_for_sequence_parallel)
from xtuner.utils import DEFAULT_PAD_TOKEN_INDEX, IGNORE_INDEX
def preference_collate_fn(instances: Sequence[Dict],
pad_index: int = DEFAULT_PAD_TOKEN_INDEX,
return_hf_format: bool = False,
use_varlen_attn: bool = False):
seq_parallel_world_size = get_sequence_parallel_world_size()
ds_names = []
if not use_varlen_attn:
# split chosen and rejected into two instances
splited_instances = []
for d in instances:
splited_instances.append({
'input_ids': d['chosen_ids'],
'labels': d['chosen_labels']
})
splited_instances.append({
'input_ids': d['rejected_ids'],
'labels': d['rejected_labels']
})
ds_names.append(d.get('ds_name', None))
instances = splited_instances
input_ids, labels = [], []
if use_varlen_attn:
position_ids, cumulative_len = [], []
assert len(instances) == 1, (
f'If utilizing varlen attention, the batch size should be'
f' set to 1, but got {len(instances)}')
for example in instances:
input_ids.append(torch.LongTensor(example['input_ids']))
labels.append(torch.LongTensor(example['labels']))
if use_varlen_attn:
cumulative_len.append(torch.IntTensor(example['cumulative_len']))
position_ids.append(torch.LongTensor(example['position_ids']))
num_samples = (len(example['cumulative_len']) - 1) // 2
ds_names.extend(example.get('ds_names', [None] * num_samples))
ori_length = [len(ids) for ids in input_ids]
if len(instances) > 1:
input_ids = pad_sequence(
input_ids, batch_first=True, padding_value=pad_index)
labels = pad_sequence(
labels, batch_first=True, padding_value=IGNORE_INDEX)
else:
input_ids = torch.stack(input_ids)
labels = torch.stack(labels)
if use_varlen_attn:
attention_mask = None
position_ids = torch.stack(position_ids, dim=0)
else:
# Some tokenizers have the same eos token and pad token, so input_ids
# cannot be masked directly based on the pad token id.
attention_mask = torch.zeros_like(input_ids).bool()
for i in ori_length:
attention_mask[:i] = True
bs, seq_len = input_ids.shape
position_ids = torch.arange(seq_len).unsqueeze(0).long().repeat(bs, 1)
if seq_parallel_world_size > 1:
input_ids = pad_for_sequence_parallel(input_ids, pad_index)
labels = pad_for_sequence_parallel(labels, IGNORE_INDEX)
position_ids = pad_for_sequence_parallel(position_ids, 0)
if attention_mask is not None:
attention_mask = pad_for_sequence_parallel(attention_mask, 0)
if use_varlen_attn:
(cumulative_len, attention_mask
) = pad_cumulative_len_for_sequence_parallel(cumulative_len)
if use_varlen_attn:
max_seqlen = (
cumulative_len[0][1:] - # noqa: W504
cumulative_len[0][:-1]).max().item()
data_dict = {
'input_ids': input_ids,
'attention_mask': attention_mask,
'cumulative_len': cumulative_len,
'position_ids': position_ids,
'labels': labels,
'max_seqlen': max_seqlen
}
else:
data_dict = {
'input_ids': input_ids,
'attention_mask': attention_mask,
'position_ids': position_ids,
'labels': labels
}
if return_hf_format:
return data_dict
else:
return {'data': data_dict, 'data_samples': {'ds_names': ds_names}}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/concat_dataset.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from torch.utils.data import ConcatDataset as TorchConcatDataset
from xtuner.registry import BUILDER
class ConcatDataset(TorchConcatDataset):
def __init__(self, datasets):
datasets_instance = []
for cfg in datasets:
datasets_instance.append(BUILDER.build(cfg))
super().__init__(datasets=datasets_instance)
def __repr__(self):
main_str = 'Dataset as a concatenation of multiple datasets. \n'
main_str += ',\n'.join(
[f'{repr(dataset)}' for dataset in self.datasets])
return main_str
================================================
FILE: xtuner-eval_niah/xtuner/dataset/huggingface.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import logging
import os
from datetime import timedelta
from functools import partial
import numpy as np
from datasets import DatasetDict, concatenate_datasets
from mmengine import print_log
from mmengine.config import Config, ConfigDict
from mmengine.utils.misc import get_object_from_string
from torch import distributed as dist
from xtuner.registry import BUILDER, MAP_FUNC
from .utils import Packer, encode_fn
def get_lengths(example):
return {'length': len(example['input_ids'])}
def build_origin_dataset(dataset, split):
if isinstance(dataset, DatasetDict):
if split is None:
dataset = concatenate_datasets(dataset.values())
else:
dataset = dataset[split]
elif isinstance(dataset, dict) or isinstance(
dataset, Config) or isinstance(dataset, ConfigDict):
dataset = BUILDER.build(dataset)
if isinstance(dataset, DatasetDict):
if split is None:
dataset = concatenate_datasets(dataset.values())
else:
dataset = dataset[split]
return dataset
def map_dataset(dataset, dataset_map_fn, map_num_proc):
if isinstance(dataset_map_fn, str):
map_fn_obj = MAP_FUNC.get(dataset_map_fn) or get_object_from_string(
dataset_map_fn)
if map_fn_obj is not None:
dataset_map_fn = map_fn_obj
else:
raise TypeError('dataset_map_fn must be a function or a '
"registered function's string in MAP_FUNC, "
f"but got a string of '{dataset_map_fn}'")
dataset = dataset.map(dataset_map_fn, num_proc=map_num_proc)
return dataset
def add_template_to_dataset(dataset, template_map_fn, map_num_proc):
if isinstance(template_map_fn,
dict) or isinstance(template_map_fn, Config) or isinstance(
template_map_fn, ConfigDict):
template_map_fn = BUILDER.build(template_map_fn)
dataset = dataset.map(template_map_fn, num_proc=map_num_proc)
# remove invalid data
dataset = dataset.filter(
lambda example: len(example['conversation']) > 0,
num_proc=map_num_proc)
return dataset
def tokenize_dataset(dataset, tokenizer, max_length, with_image_token,
input_ids_with_output, remove_unused_columns,
map_num_proc):
assert (tokenizer is not None) and (max_length is not None), \
f'({tokenizer}, {max_length})'
if isinstance(tokenizer, dict) or isinstance(
tokenizer, Config) or isinstance(tokenizer, ConfigDict):
tokenizer = BUILDER.build(tokenizer)
dataset = dataset.map(
partial(
encode_fn,
tokenizer=tokenizer,
max_length=max_length,
with_image_token=with_image_token,
input_ids_with_output=input_ids_with_output),
remove_columns=list(dataset.column_names)
if remove_unused_columns else None,
num_proc=map_num_proc)
return dataset
def pack_dataset(dataset, max_length, use_varlen_attn, shuffle_before_pack,
map_num_proc):
if shuffle_before_pack:
dataset = dataset.shuffle()
dataset = dataset.flatten_indices(num_proc=map_num_proc)
dataset = dataset.map(
Packer(max_length, use_varlen_attn=use_varlen_attn),
batched=True,
num_proc=map_num_proc)
return dataset
def process(dataset,
do_dataset_tokenization=True,
tokenizer=None,
max_length=None,
dataset_map_fn=None,
template_map_fn=None,
max_dataset_length=None,
split='train',
remove_unused_columns=False,
rename_maps=[],
shuffle_before_pack=True,
pack_to_max_length=True,
use_varlen_attn=False,
input_ids_with_output=True,
with_image_token=False,
map_num_proc=32):
"""Post-process the dataset loaded from the Hugging Face Hub, or a local
dataset.
Args:
dataset: The dataset to be post-processed.
do_dataset_tokenization: Whether the dataset need to be tokenized
in this function. Default to True.
tokenizer: The tokenizer processes some raw text as input and outputs
an Encoding. If `do_dataset_tokenization` is True, this argument
should not be None. Default to None.
max_length: Max length of the sequence. If `do_dataset_tokenization`
or `pack_to_max_length` is True, this argument should not be None.
Default to None.
dataset_map_fn: Map the original dataset format to the one defined
by xTuner.
template_map_fn: Add the prompt template to the dataset
max_dataset_length: If the length of the dataset is too long, we can
randomly extract `max_dataset_length` from it.
split: Which split of the data to load.
If `None`, will return a single concatenated dataset with all
splits (typically `datasets.Split.TRAIN` and
`datasets.Split.TEST`).
If given, will return a single Dataset.
remove_unused_columns: Whether to remove columns from the dataset
that are not used during training.
rename_maps: Rename the column name of the dataset.
shuffle_before_pack: Whether to shuffle the dataset before
packing them.
pack_to_max_length: Whether to pack the dataset to the `max_length `.
This usually improves gpu utilization and therefore reduces
training time.
use_varlen_attn: If use_varlen_attn is True, we calculate attention
the actual length of the sequence rather than the actual length
of the sequence
input_ids_with_output: Whether to put the groundtruth output
corresponding to the question into the dataset. Typically set
it to True during training and False during testing.
with_image_token: Whether to convert DEFAULT_IMAGE_TOKEN to
IMAGE_TOKEN_INDEX. Typically set it to True during the training
of VLM.
map_num_proc: Max number of processes when mapping the dataset.
"""
if use_varlen_attn:
assert pack_to_max_length, \
'`pack_to_max_length` in `process_hf_dataset` should be set to ' \
'True if `use_varlen_attn` is True.'
if pack_to_max_length:
assert split == 'train' or split is None, \
('`split` should be `train` or `None` if `pack_to_max_length` is '
f'True, but got {split}.')
dataset = build_origin_dataset(dataset, split)
# sample `max_dataset_length` items from the original dataset to
# save time consumed by map function
if max_dataset_length is not None:
max_dataset_length = min(max_dataset_length, len(dataset))
indices = np.random.choice(
len(dataset), max_dataset_length, replace=False)
dataset = dataset.select(indices)
# Extract the useful data for training from the original dataset.
if dataset_map_fn is not None:
dataset = map_dataset(dataset, dataset_map_fn, map_num_proc)
# Add prompt template, such as <|System|>: xxx <|User|>: xxx <|Bot|>: xxx
if template_map_fn is not None:
dataset = add_template_to_dataset(dataset, template_map_fn,
map_num_proc)
for old, new in rename_maps:
dataset = dataset.rename_column(old, new)
# remove unused columns
if pack_to_max_length and (not remove_unused_columns):
print_log(
'We have to remove unused columns if '
'`pack_to_max_length` is set to True.',
logger='current',
level=logging.WARNING)
remove_unused_columns = True
if do_dataset_tokenization:
dataset = tokenize_dataset(dataset, tokenizer, max_length,
with_image_token, input_ids_with_output,
remove_unused_columns, map_num_proc)
if input_ids_with_output:
assert {'input_ids', 'labels'}.issubset(dataset.column_names)
# remove data that does not have the valid labels.
dataset = dataset.filter(
lambda example: any(label >= 0 for label in example['labels']),
num_proc=map_num_proc)
# pack to max length
if pack_to_max_length:
dataset = pack_dataset(dataset, max_length, use_varlen_attn,
shuffle_before_pack, map_num_proc)
# add 'length'
dataset = dataset.map(get_lengths, num_proc=map_num_proc)
setattr(dataset, 'length', dataset['length'])
return dataset
def process_hf_dataset(dataset,
do_dataset_tokenization=True,
tokenizer=None,
max_length=None,
dataset_map_fn=None,
template_map_fn=None,
max_dataset_length=None,
split='train',
remove_unused_columns=False,
rename_maps=[],
shuffle_before_pack=True,
pack_to_max_length=True,
use_varlen_attn=False,
input_ids_with_output=True,
with_image_token=False,
map_num_proc=32):
"""Post-process the dataset loaded from the Hugging Face Hub, or a local
dataset.
Args:
dataset: The dataset to be post-processed.
do_dataset_tokenization: Whether the dataset need to be tokenized
in this function. Default to True.
tokenizer: The tokenizer processes some raw text as input and outputs
an Encoding. If `do_dataset_tokenization` is True, this argument
should not be None. Default to None.
max_length: Max length of the sequence. If `do_dataset_tokenization`
or `pack_to_max_length` is True, this argument should not be None.
Default to None.
dataset_map_fn: Map the original dataset format to the one defined
by xTuner.
template_map_fn: Add the prompt template to the dataset
max_dataset_length: If the length of the dataset is too long, we can
randomly extract `max_dataset_length` from it.
split: Which split of the data to load.
If `None`, will return a single concatenated dataset with all
splits (typically `datasets.Split.TRAIN` and
`datasets.Split.TEST`).
If given, will return a single Dataset.
remove_unused_columns: Whether to remove columns from the dataset
that are not used during training.
rename_maps: Rename the column name of the dataset.
shuffle_before_pack: Whether to shuffle the dataset before
packing them.
pack_to_max_length: Whether to pack the dataset to the `max_length `.
This usually improves gpu utilization and therefore reduces
training time.
use_varlen_attn: If use_varlen_attn is True, we calculate attention
the actual length of the sequence rather than the actual length
of the sequence
input_ids_with_output: Whether to put the groundtruth output
corresponding to the question into the dataset. Typically set
it to True during training and False during testing.
with_image_token: Whether to convert DEFAULT_IMAGE_TOKEN to
IMAGE_TOKEN_INDEX. Typically set it to True during the training
of VLM.
map_num_proc: Max number of processes when mapping the dataset.
"""
kwargs = dict(
dataset=dataset,
do_dataset_tokenization=do_dataset_tokenization,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=dataset_map_fn,
template_map_fn=template_map_fn,
max_dataset_length=max_dataset_length,
split=split,
remove_unused_columns=remove_unused_columns,
rename_maps=rename_maps,
shuffle_before_pack=shuffle_before_pack,
pack_to_max_length=pack_to_max_length,
use_varlen_attn=use_varlen_attn,
input_ids_with_output=input_ids_with_output,
with_image_token=with_image_token,
map_num_proc=map_num_proc)
if not (dist.is_available() and dist.is_initialized()):
return process(**kwargs)
xtuner_dataset_timeout = timedelta(
minutes=int(os.getenv('XTUNER_DATASET_TIMEOUT', default=60)))
print_log(
f'xtuner_dataset_timeout = {xtuner_dataset_timeout}', logger='current')
# monitored barrier requires gloo process group to perform host-side sync.
group_gloo = dist.new_group(backend='gloo', timeout=xtuner_dataset_timeout)
if dist.get_rank() == 0:
dataset = process(**kwargs)
objects = [dataset]
else:
objects = [None]
dist.monitored_barrier(group=group_gloo, timeout=xtuner_dataset_timeout)
dist.broadcast_object_list(objects, src=0)
return objects[0]
================================================
FILE: xtuner-eval_niah/xtuner/dataset/intern_repo.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import itertools as it
import json
import mmap
import operator
import os
import threading
from pathlib import Path
import numpy as np
import torch
from datasets import Dataset, load_dataset, load_from_disk
from mmengine import print_log
from torch import distributed as dist
from torch.utils.data import ConcatDataset
from xtuner.dataset.map_fns import openai_map_fn
from xtuner.registry import BUILDER
from .huggingface import process
class JsonlDataset(torch.utils.data.Dataset):
"""
JSONL format is expected to roughly follow that of The Pile.
One-line-per-document of the form:
```
{
"input_ids": List[int],
"labels": List[int]
}
```
"""
def __init__(self, path: str, min_length=50):
self.path = path
self.threadlocal = threading.local()
resolved_path = Path(path).resolve()
self.resolved_path = resolved_path
self.meta = Path(f'{resolved_path}.meta')
# only build the cache in on the primary worker to prevent
# overloading nfs
assert os.path.exists(
self.meta
), f'The cache file:{self.meta} is not found for file:{self.path}'
try:
with open(self.meta, 'rb') as f:
meta = np.load(f)
except Exception as e:
print(f'Cannot load file {self.meta}...')
raise e
self.offsets = meta[:, 0]
self.length = meta[:, -1]
if min_length > 0:
mask = self.length >= min_length
self.offsets = self.offsets[mask]
self.length = self.length[mask]
def __getitem__(self, idx):
f = self._get_mmap()
position = self.offsets[idx]
f.seek(position)
item = f.readline().decode('utf-8')
try:
item = json.loads(item)
item['input_ids'] = item['tokens']
del item['tokens']
labels = [x if x > 0 else -100 for x in item['input_ids']]
item['input_ids'] = [abs(x) for x in item['input_ids']]
item['labels'] = labels
item['length'] = len(item['input_ids']) # add a length info
except Exception as err:
raise json.decoder.JSONDecodeError(
doc=self.path,
pos=position,
msg=(f'Error while loading JSONL line in file {self.path} '
f'at byte {position}. Contents of line:\n{item}\n{err}'),
)
return item
def get_dataset_name(self):
return str(self.resolved_path)
def _get_mmap(self):
if not hasattr(self.threadlocal, 'handles'):
with open(self.path, 'rb') as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
self.threadlocal.handles = [f, mm]
if self.path.endswith('.gz') or self.path.endswith(
'.bz') or self.path.endswith('.bz2'):
raise NotImplementedError(
'Compressed files are not supported because .seek() '
'would require rereading the entire file, making '
'performance too slow.')
return self.threadlocal.handles[-1]
def __setstate__(self, state):
self.__dict__ = state
self.threadlocal = threading.local()
def __getstate__(self):
d = {}
for i, v in self.__dict__.items():
if i != 'threadlocal':
d[i] = v
return d
def __del__(self):
if hasattr(self.threadlocal, 'handles'):
# cleanup files we opened on initialization
while self.threadlocal.handles:
self.threadlocal.handles.pop().close()
@staticmethod
def exists(path):
return os.path.exists(path)
def __len__(self):
# Virtual length of the dataset depends on the epoch number
# if the number of documents is not perfectly divisible by the
# data_subshard_count
return len(self.offsets)
class PackedDataset(torch.utils.data.Dataset):
"""The class PackedDataset takes in a dataset and aggregates samples of
different lengths together based on the packed_length.
Args:
dataset: The original dataset to pack.
packed_length: The length of each packed sample. Default is 8192.
"""
def __init__(self, dataset, packed_length: int = 8192, seed: int = 1024):
self.dataset = dataset
self.packed_length = packed_length
if isinstance(dataset, JsonlDataset):
self.length = dataset.length
elif isinstance(dataset, Dataset):
if hasattr(dataset, 'length'):
length = dataset.length
else:
length = [len(i['input_ids']) for i in dataset]
self.length = length
else:
raise NotImplementedError
self.seed = seed
rng = np.random.RandomState(self.seed)
shuffled_indices = np.arange(len(self.length))
rng.shuffle(shuffled_indices)
self.shuffled_indices = shuffled_indices.tolist()
self.shuffled_samples_len = list(
map(self.length.__getitem__, shuffled_indices))
self.shuffled_accumulated_samples_len = list(
it.accumulate(self.shuffled_samples_len, operator.add))
self.num_tokens = sum(self.length)
def __len__(self):
return self.num_tokens // self.packed_length
def search_sample_index(self, pack_idx: int = 0):
assert pack_idx >= 0
length_train = (pack_idx + 1) * self.packed_length
sample_index = np.searchsorted(
self.shuffled_accumulated_samples_len, length_train, side='left')
return sample_index
def mapping(self, pack_idx: int = 0):
begin_sample_idx, begin_token_id = 0, 0
if pack_idx > 0:
begin_sample_idx = self.search_sample_index(pack_idx - 1)
# The position where the previous packed data ends
begin_token_id = self.shuffled_samples_len[begin_sample_idx] - (
self.shuffled_accumulated_samples_len[begin_sample_idx]
- # noqa: W504,W503
(pack_idx) * self.packed_length)
if begin_token_id == self.shuffled_samples_len[begin_sample_idx]:
begin_sample_idx += 1
begin_token_id = 0
end_sample_idx = self.search_sample_index(pack_idx)
end_token_id = self.shuffled_samples_len[end_sample_idx] - (
self.shuffled_accumulated_samples_len[end_sample_idx]
- # noqa: W504,W503
(pack_idx + 1) * self.packed_length)
return begin_sample_idx, begin_token_id, end_sample_idx, end_token_id
def build_pack(self, begin_sample_idx: int, begin_token_id: int,
end_sample_idx: int, end_token_id: int):
pack, cumulative_len, position_ids, labels = [], [0], [], []
while begin_sample_idx < end_sample_idx:
sample_idx = self.shuffled_indices[begin_sample_idx]
sample = self.dataset[sample_idx]
chunk = sample['input_ids'][begin_token_id:]
pack.extend(chunk)
_labels = sample['labels'][begin_token_id:]
assert len(_labels) == len(chunk), (_labels, chunk)
labels.extend(_labels)
cumulative_len.append(cumulative_len[-1] + len(chunk))
position_ids.extend(list(range(len(chunk))))
begin_sample_idx = begin_sample_idx + 1
begin_token_id = 0
sample_idx = self.shuffled_indices[end_sample_idx]
sample = self.dataset[sample_idx]
chunk = sample['input_ids'][begin_token_id:
end_token_id] # fragment of a sample
_labels = sample['labels'][begin_token_id:end_token_id]
pack.extend(chunk)
assert len(_labels) == len(chunk), (_labels, chunk)
labels.extend(_labels)
cumulative_len.append(cumulative_len[-1] + len(chunk))
position_ids.extend(list(range(len(chunk))))
out = {
'input_ids': pack,
'cumulative_len': cumulative_len,
'position_ids': position_ids,
'labels': labels
}
return out
def __getitem__(self, item: int):
pos_before, token_id_before, pos_after, token_id_after = self.mapping(
item)
return self.build_pack(pos_before, token_id_before, pos_after,
token_id_after)
def load_intern_repo_tokenized_dataset(folder,
min_length=0,
data_order_path=None,
file_type='.bin'):
assert os.path.exists(folder), f'{folder} does not exist.'
datasets = []
if data_order_path is not None:
data_order = load_dataset(
'text', data_files=data_order_path, split='train')['text']
for i, fp in enumerate(data_order):
data_order[i] = os.path.join(folder, fp)
else:
triples = list(os.walk(folder, followlinks=True))
data_order = []
for root, dirs, files in triples:
dirs.sort()
for fn in sorted(files):
if fn.endswith(file_type):
fp = os.path.join(root, fn)
data_order.append(fp)
for fp in data_order:
print_log(f'Reading {fp}...', logger='current')
ds = JsonlDataset(fp, min_length=min_length)
if len(ds) == 0:
continue
datasets.append(ds)
return datasets
def load_intern_repo_untokenized_dataset(processed_dataset_dict_path=None,
folder=None,
tokenizer=None,
max_length=None,
template_map_fn=None,
data_order_path=None,
file_type='.json'):
assert processed_dataset_dict_path or (folder and tokenizer and max_length)
if processed_dataset_dict_path is not None:
ds = load_from_disk(processed_dataset_dict_path)
datasets = []
for key, data in ds.items():
datasets.append((key, data))
datasets = sorted(datasets, key=lambda x: int(x[0]))
datasets = [x[1] for x in datasets]
return datasets
assert os.path.exists(folder), f'{folder} does not exist.'
datasets = []
if data_order_path is not None:
data_order = load_dataset(
'text', data_files=data_order_path, split='train')['text']
for i, fp in enumerate(data_order):
data_order[i] = os.path.join(folder, fp)
else:
triples = list(os.walk(folder, followlinks=True))
data_order = []
for root, dirs, files in triples:
dirs.sort()
for fn in sorted(files):
if fn.endswith(file_type):
fp = os.path.join(root, fn)
data_order.append(fp)
for fp in data_order:
print_log(f'Reading {fp}...', logger='current')
dataset = []
with open(fp) as file:
lines = file.readlines()
for line in lines:
line = json.loads(line)
dataset.append({'messages': line})
dataset = Dataset.from_list(dataset)
dataset = process(
dataset,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=openai_map_fn,
template_map_fn=template_map_fn,
remove_unused_columns=True,
pack_to_max_length=False,
map_num_proc=32)
if len(dataset) == 0:
continue
datasets.append(dataset)
return datasets
def build_packed_dataset_rank0(dataset_cfg, packed_length=8192, seed=1024):
if isinstance(dataset_cfg, dict):
datasets = BUILDER.build(dataset_cfg)
else:
datasets = dataset_cfg
if not isinstance(datasets, list):
datasets = [datasets]
packed_datasets = []
for dataset in datasets:
ds = PackedDataset(dataset, packed_length, seed=seed)
packed_datasets.append(ds)
dataset = ConcatDataset(datasets=packed_datasets)
return dataset
def build_packed_dataset(*args, **kwargs):
if not (dist.is_available() and dist.is_initialized()):
return build_packed_dataset_rank0(*args, **kwargs)
if dist.get_rank() == 0:
dataset = build_packed_dataset_rank0(*args, **kwargs)
objects = [dataset]
else:
objects = [None]
dist.broadcast_object_list(objects, src=0)
return objects[0]
================================================
FILE: xtuner-eval_niah/xtuner/dataset/json_dataset.py
================================================
import json
import os
from datasets import Dataset, concatenate_datasets
def load_json_file(data_files=None, data_dir=None, suffix=None):
assert (data_files is not None) != (data_dir is not None)
if data_dir is not None:
data_files = os.listdir(data_dir)
data_files = [os.path.join(data_dir, fn) for fn in data_files]
if suffix is not None:
data_files = [fp for fp in data_files if fp.endswith(suffix)]
elif isinstance(data_files, str):
data_files = [data_files]
dataset_list = []
for fp in data_files:
with open(fp, encoding='utf-8') as file:
data = json.load(file)
ds = Dataset.from_list(data)
dataset_list.append(ds)
dataset = concatenate_datasets(dataset_list)
return dataset
================================================
FILE: xtuner-eval_niah/xtuner/dataset/llava.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import json
import logging
import os
import torch
from datasets import Dataset as HFDataset
from datasets import DatasetDict, load_from_disk
from mmengine import print_log
from mmengine.config import Config, ConfigDict
from PIL import Image
from torch.utils.data import Dataset
from xtuner.registry import BUILDER
from .huggingface import process_hf_dataset
from .utils import expand2square
def load_jsonl(json_file):
with open(json_file) as f:
lines = f.readlines()
data = []
for line in lines:
data.append(json.loads(line))
return data
class LLaVADataset(Dataset):
def __init__(self,
image_folder,
image_processor,
data_path=None,
tokenizer=None,
offline_processed_text_folder=None,
max_dataset_length=None,
dataset_map_fn=None,
template_map_fn=None,
max_length=2048,
pad_image_to_square=False):
super().__init__()
assert offline_processed_text_folder or (data_path and tokenizer)
if offline_processed_text_folder and data_path:
print_log(
'Both `offline_processed_text_folder` and '
'`data_path` are set, and we load dataset from'
'`offline_processed_text_folder` '
f'({offline_processed_text_folder})',
logger='current',
level=logging.WARNING)
if offline_processed_text_folder is not None:
self.text_data = load_from_disk(offline_processed_text_folder)
else:
if data_path.endswith('.json'):
json_data = json.load(open(data_path))
elif data_path.endswith('.jsonl'):
json_data = load_jsonl(data_path)
else:
raise NotImplementedError
for idx in range(len(json_data)):
if isinstance(json_data[idx]['id'], int):
json_data[idx]['id'] = str(json_data[idx]['id'])
json_data = DatasetDict({'train': HFDataset.from_list(json_data)})
self.text_data = process_hf_dataset(
dataset=json_data,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=dataset_map_fn,
template_map_fn=template_map_fn,
split='train',
max_dataset_length=max_dataset_length,
remove_unused_columns=False,
pack_to_max_length=False,
with_image_token=True)
self.image_folder = image_folder
if isinstance(image_processor, dict) or isinstance(
image_processor, Config) or isinstance(image_processor,
ConfigDict):
self.image_processor = BUILDER.build(image_processor)
else:
self.image_processor = image_processor
self.pad_image_to_square = pad_image_to_square
@property
def modality_length(self):
length_list = []
for data_dict in self.text_data:
cur_len = len(data_dict['input_ids'])
if data_dict.get('image', None) is None:
cur_len = -cur_len
length_list.append(cur_len)
return length_list
def __len__(self):
return len(self.text_data)
def __getitem__(self, index):
data_dict = self.text_data[index]
if data_dict.get('image', None) is not None:
image_file = data_dict['image']
image = Image.open(os.path.join(self.image_folder,
image_file)).convert('RGB')
if self.pad_image_to_square:
image = expand2square(
image,
tuple(
int(x * 255) for x in self.image_processor.image_mean))
image = self.image_processor.preprocess(
image, return_tensors='pt')['pixel_values'][0]
data_dict['pixel_values'] = image
else:
if hasattr(self.image_processor, 'crop_size'):
crop_size = self.image_processor.crop_size
else:
crop_size = self.image_processor.size
data_dict['pixel_values'] = torch.zeros(3, crop_size['height'],
crop_size['width'])
return data_dict
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .dataset_map_fns import * # noqa: F401, F403
from .template_map_fn import template_map_fn # noqa: F401
from .template_map_fn import template_map_fn_factory # noqa: F401
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .alpaca_map_fn import alpaca_map_fn
from .alpaca_zh_map_fn import alpaca_zh_map_fn
from .arxiv_map_fn import arxiv_map_fn
from .code_alpaca_map_fn import code_alpaca_map_fn
from .colors_map_fn import colors_map_fn
from .crime_kg_assitant_map_fn import crime_kg_assitant_map_fn
from .default_map_fn import default_map_fn
from .law_reference_map_fn import law_reference_map_fn
from .llava_map_fn import llava_image_only_map_fn, llava_map_fn
from .medical_map_fn import medical_map_fn
from .msagent_map_fn import msagent_react_map_fn
from .oasst1_map_fn import oasst1_map_fn
from .openai_map_fn import openai_map_fn
from .openorca_map_fn import openorca_map_fn
from .pretrain_map_fn import pretrain_map_fn
from .sql_map_fn import sql_map_fn
from .stack_exchange_map_fn import stack_exchange_map_fn
from .tiny_codes_map_fn import tiny_codes_map_fn
from .wizardlm_map_fn import wizardlm_map_fn
DATASET_FORMAT_MAPPING = dict(
alpaca=alpaca_map_fn,
alpaca_zh=alpaca_zh_map_fn,
arxiv=arxiv_map_fn,
code_alpaca=code_alpaca_map_fn,
colors=colors_map_fn,
crime_kg_assitan=crime_kg_assitant_map_fn,
default=default_map_fn,
law_reference=law_reference_map_fn,
llava_image_only=llava_image_only_map_fn,
llava=llava_map_fn,
medical=medical_map_fn,
msagent_react=msagent_react_map_fn,
oasst1=oasst1_map_fn,
openai=openai_map_fn,
openorca=openorca_map_fn,
pretrain=pretrain_map_fn,
sql=sql_map_fn,
stack_exchange=stack_exchange_map_fn,
tiny_codes=tiny_codes_map_fn,
wizardlm=wizardlm_map_fn,
)
__all__ = [
'alpaca_map_fn', 'alpaca_zh_map_fn', 'oasst1_map_fn', 'arxiv_map_fn',
'medical_map_fn', 'openorca_map_fn', 'code_alpaca_map_fn',
'tiny_codes_map_fn', 'colors_map_fn', 'law_reference_map_fn',
'crime_kg_assitant_map_fn', 'sql_map_fn', 'openai_map_fn',
'wizardlm_map_fn', 'stack_exchange_map_fn', 'msagent_react_map_fn',
'pretrain_map_fn', 'default_map_fn', 'llava_image_only_map_fn',
'llava_map_fn', 'DATASET_FORMAT_MAPPING'
]
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/alpaca_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
def alpaca_map_fn(example):
if example.get('output') == '':
return {'conversation': []}
else:
return {
'conversation': [{
'input': f"{example['instruction']}\n{example['input']}",
'output': example['output']
}]
}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/alpaca_zh_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
def alpaca_zh_map_fn(example):
return {
'conversation': [{
'input': f"{example['instruction_zh']}\n{example['input_zh']}",
'output': example['output_zh']
}]
}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/arxiv_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from xtuner.utils import SYSTEM_TEMPLATE
def arxiv_map_fn(example):
return {
'conversation': [{
'system': SYSTEM_TEMPLATE.arxiv_gentile,
'input': example['abstract'],
'output': example['title']
}]
}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/code_alpaca_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from xtuner.utils import SYSTEM_TEMPLATE
def code_alpaca_map_fn(example):
return {
'conversation': [{
'system': SYSTEM_TEMPLATE.coder,
'input': example['prompt'],
'output': example['completion']
}]
}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/colors_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from xtuner.utils import SYSTEM_TEMPLATE
def colors_map_fn(example):
desc = ':'.join(example['description'].split(':')[1:]).strip()
return {
'conversation': [{
'system': SYSTEM_TEMPLATE.colorist,
'input': desc,
'output': example['color']
}]
}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/crime_kg_assitant_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from xtuner.utils import SYSTEM_TEMPLATE
def crime_kg_assitant_map_fn(example):
return {
'conversation': [{
'system': SYSTEM_TEMPLATE.lawyer,
'input': example['input'],
'output': example['output']
}]
}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/default_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
def default_map_fn(example):
return {
'conversation': [{
'input': example['input'],
'output': example['output']
}]
}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/law_reference_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from xtuner.utils import SYSTEM_TEMPLATE
def law_reference_map_fn(example):
return {
'conversation': [{
'system': SYSTEM_TEMPLATE.lawyer,
'input': example['question'],
'output': example['answer']
}]
}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/llava_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from xtuner.utils import DEFAULT_IMAGE_TOKEN
def llava_image_only_map_fn(example):
# input contains the DEFAULT_IMAGE_TOKEN only
messages = example['conversations']
input = ''
conversation = []
while messages and messages[0]['from'] == 'gpt':
# Skip the first one if it is from gpt
messages = messages[1:]
for msg in messages:
if msg['from'] == 'human':
assert DEFAULT_IMAGE_TOKEN in msg['value']
input += DEFAULT_IMAGE_TOKEN
elif msg['from'] == 'gpt':
conversation.append({'input': input, 'output': msg['value']})
input = ''
else:
raise NotImplementedError
return {'conversation': conversation}
def llava_map_fn(example):
messages = example['conversations']
input = ''
conversation = []
while messages and messages[0]['from'] == 'gpt':
# Skip the first one if it is from gpt
messages = messages[1:]
for msg in messages:
if msg['from'] == 'human':
if DEFAULT_IMAGE_TOKEN in msg['value']:
msg['value'] = msg['value'].replace(DEFAULT_IMAGE_TOKEN,
'').strip()
msg['value'] = DEFAULT_IMAGE_TOKEN + '\n' + msg['value']
msg['value'] = msg['value'].strip()
input += msg['value']
elif msg['from'] == 'gpt':
conversation.append({'input': input, 'output': msg['value']})
input = ''
else:
raise NotImplementedError
return {'conversation': conversation}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/medical_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from xtuner.utils import SYSTEM_TEMPLATE
def medical_map_fn(example):
return {
'conversation': [{
'system': SYSTEM_TEMPLATE.medical,
'input': '{instruction}\n{input}'.format(**example),
'output': example['output']
}]
}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/msagent_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import json
import re
think_regex = r'(.*?)(<\|startofthink\|\>)(.*?)(<\|endofthink\|\>)'
exec_regex = r'(<\|startofexec\|\>)(.*?)(<\|endofexec\|\>)(.*?)$'
def replace_think(match):
out_text = ''
if match.group(1).strip() != '':
out_text += f'Thought:{match.group(1).strip()}\n'
think_text = match.group(3).replace('```JSON',
'').replace('```',
'').replace('\n', '')
think_json = json.loads(think_text)
out_text += (f"Action:{think_json['api_name']}\n"
f"Action Input:{think_json['parameters']}\n")
return out_text
def replace_exec(match):
out_text = ''
exec_text = match.group(2).replace('```JSON',
'').replace('```',
'').replace('\n', '')
exec_json = json.loads(exec_text)
out_text += f'Response:{exec_json}\n'
if match.group(4).strip() != '':
out_text += f'Final Answer:{match.group(4).strip()}\n'
return out_text
def extract_json_objects(text, decoder=json.JSONDecoder()):
pos = 0
results = []
while True:
match = text.find('{', pos)
if match == -1:
break
try:
result, index = decoder.raw_decode(text[match:])
if 'name' in result and 'description' in result:
results.append(result)
pos = match + index
else:
pos = match + 1
except ValueError:
pos = match + 1
return results
def msagent_react_map_fn(example):
text = example['conversations']
if isinstance(text, str):
text = eval(text)
if len(text) < 2: # Filter out invalid data
return {'conversation': []}
conversation = []
system_text = ''
input_text = ''
for t in text:
if t['from'] == 'system':
system_text += '你是一个可以调用外部工具的助手,可以使用的工具包括:\n'
json_objects = extract_json_objects(t['value'])
api_dict = {}
for obj in json_objects:
api_dict[obj['name']] = obj['description']
try:
params = {
i['name']: i['description']
for i in obj['paths'][0]['parameters']
}
api_dict[obj['name']] += f'\n输入参数: {params}'
except Exception:
pass
system_text += f'{api_dict}\n'
system_text += (
'如果使用工具请遵循以下格式回复:\n```\n'
'Thought:思考你当前步骤需要解决什么问题,是否需要使用工具\n'
f'Action:工具名称,你的工具必须从 [{str(list(api_dict.keys()))}] 选择\n'
'Action Input:工具输入参数\n```\n工具返回按照以下格式回复:\n```\n'
'Response:调用工具后的结果\n```\n如果你已经知道了答案,或者你不需要工具,'
'请遵循以下格式回复\n```\n'
'Thought:给出最终答案的思考过程\n'
'Final Answer:最终答案\n```\n开始!\n')
elif t['from'] == 'user':
input_text += f"{t['value']}\n"
elif t['from'] == 'assistant':
output = t['value']
output_response = None
try:
if '<|startofexec|>' in output:
output, output_response = output.split('<|startofexec|>')
output_response = '<|startofexec|>' + output_response
output, think_cnt = re.subn(
think_regex, replace_think, output, flags=re.DOTALL)
except Exception:
return {'conversation': []}
if think_cnt == 0:
output = f'Final Answer:{output}\n'
else:
output = f'{output}\n'
conversation.append({
'system': system_text,
'input': input_text,
'output': output
})
system_text = ''
input_text = ''
if output_response is not None:
try:
output_response, exec_cnt = re.subn(
exec_regex,
replace_exec,
output_response,
flags=re.DOTALL)
if 'Final Answer:' in output_response:
output_response, output_answer = output_response.split(
'Final Answer:')
output_answer = 'Final Answer:' + output_answer
conversation.append({
'system': output_response,
'output': output_answer
})
except Exception:
pass
return {'conversation': conversation}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/oasst1_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
def oasst1_map_fn(example):
r"""Example before preprocessing:
example['text'] = '### Human: Can you explain xxx'
'### Assistant: Sure! xxx'
'### Human: I didn't understand how xxx'
'### Assistant: It has to do with a process xxx.'
Example after preprocessing:
example['conversation'] = [
{
'input': 'Can you explain xxx',
'output': 'Sure! xxx'
},
{
'input': 'I didn't understand how xxx',
'output': 'It has to do with a process xxx.'
}
]
"""
data = []
for sentence in example['text'].strip().split('###'):
sentence = sentence.strip()
if sentence[:6] == 'Human:':
data.append(sentence[6:].strip())
elif sentence[:10] == 'Assistant:':
data.append(sentence[10:].strip())
if len(data) % 2:
# The last round of conversation solely consists of input
# without any output.
# Discard the input part of the last round, as this part is ignored in
# the loss calculation.
data.pop()
conversation = []
for i in range(0, len(data), 2):
single_turn_conversation = {'input': data[i], 'output': data[i + 1]}
conversation.append(single_turn_conversation)
return {'conversation': conversation}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/openai_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
def openai_map_fn(example):
"""
Example before preprocessing:
example["messages"] = [
{ "role": "system", "content": "You are an assistant that
occasionally misspells words." },
{ "role": "user", "content": "Tell me a story." },
{ "role": "assistant", "content": "One day a student
went to schoool." }
]
Example after preprocessing:
example["conversation"] = [
{
"system": "You are an assistant that occasionally misspells
words.",
"input": "Tell me a story.",
"output": "One day a student went to schoool."
}
]
"""
messages = example['messages']
system = ''
input = ''
conversation = []
while messages and messages[0]['role'] == 'assistant':
# Skip the first one if it is from assistant
messages = messages[1:]
for msg in messages:
if msg['role'] == 'system':
system = msg['content']
elif msg['role'] == 'user':
input += msg['content']
elif msg['role'] == 'assistant':
conversation.append({
'system': system,
'input': input,
'output': msg['content']
})
system = ''
input = ''
else:
raise NotImplementedError
return {'conversation': conversation}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/openorca_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
def openorca_map_fn(example):
return {
'conversation': [{
'system': example['system_prompt'],
'input': example['question'],
'output': example['response']
}]
}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/pretrain_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
def pretrain_map_fn(example):
r"""Example before preprocessing:
example['text'] = 'xxx'
Example after preprocessing:
example['conversation'] = [
{
'input': '',
'output': 'xxx'
},
]
"""
return {
'conversation': [{
'input': '',
'output': example['text'].strip(),
'need_eos_token': False
}]
}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/sql_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from xtuner.utils import SYSTEM_TEMPLATE
def sql_map_fn(example):
return {
'conversation': [{
'system': SYSTEM_TEMPLATE.sql,
'input': '{context}\n{question}'.format(**example),
'output': example['answer']
}]
}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/stack_exchange_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
def stack_exchange_map_fn(example):
return {
'conversation': [{
'input': example['question'],
'output': example['response']
}]
}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/tiny_codes_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from xtuner.utils import SYSTEM_TEMPLATE
def tiny_codes_map_fn(example):
return {
'conversation': [{
'system': SYSTEM_TEMPLATE.coder,
'input': example['prompt'],
'output': example['response']
}]
}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/dataset_map_fns/wizardlm_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
def wizardlm_map_fn(example):
messages = example['conversations']
input = ''
conversation = []
while messages and messages[0]['from'] == 'gpt':
# Skip the first one if it is from gpt
messages = messages[1:]
for msg in messages:
if msg['from'] == 'human':
input += msg['value']
elif msg['from'] == 'gpt':
conversation.append({'input': input, 'output': msg['value']})
input = ''
else:
raise NotImplementedError
return {'conversation': conversation}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/map_fns/template_map_fn.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from functools import partial
from mmengine.utils.misc import get_object_from_string
def template_map_fn(example, template):
conversation = example.get('conversation', [])
for i, single_turn_conversation in enumerate(conversation):
input = single_turn_conversation.get('input', '')
if input is None:
input = ''
input_text = template.INSTRUCTION.format(input=input, round=i + 1)
system = single_turn_conversation.get('system', '')
if system != '' and system is not None:
system = template.SYSTEM.format(system=system)
input_text = system + input_text
single_turn_conversation['input'] = input_text
if template.get('SUFFIX', None):
output_text = single_turn_conversation.get('output', '')
output_text += template.SUFFIX
single_turn_conversation['output'] = output_text
# SUFFIX_AS_EOS is False ==> need_eos_token is True
single_turn_conversation['need_eos_token'] = \
not template.get('SUFFIX_AS_EOS', False)
single_turn_conversation['sep'] = template.get('SEP', '')
return {'conversation': conversation}
def template_map_fn_factory(template):
if isinstance(template, str): # for resume
template = get_object_from_string(template)
return partial(template_map_fn, template=template)
================================================
FILE: xtuner-eval_niah/xtuner/dataset/modelscope.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.config import Config, ConfigDict
from xtuner.registry import BUILDER
from .huggingface import process_hf_dataset
def process_ms_dataset(dataset, split='train', *args, **kwargs):
"""Post-process the dataset loaded from the ModelScope Hub."""
if isinstance(dataset, (Config, ConfigDict)):
dataset = BUILDER.build(dataset)
if isinstance(dataset, dict):
dataset = dataset[split]
dataset = dataset.to_hf_dataset()
return process_hf_dataset(dataset, *args, **kwargs)
================================================
FILE: xtuner-eval_niah/xtuner/dataset/moss_sft.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import copy
import json
import os
import torch
from mmengine.config import Config, ConfigDict
from mmengine.logging import print_log
from torch.utils.data import Dataset
from tqdm import tqdm
from xtuner.registry import BUILDER
class MOSSSFTDataset(Dataset):
def __init__(self, data_file, tokenizer, max_length=2048, bot_name=None):
super().__init__()
self.bot_name = bot_name
self.src_data_file = data_file
if isinstance(tokenizer, dict) or isinstance(
tokenizer, Config) or isinstance(tokenizer, ConfigDict):
self.tokenizer = BUILDER.build(tokenizer)
else:
self.tokenizer = tokenizer
self.max_length = max_length
self.data = []
# We do not calculate losses for the meta instruction or results
# returned by plugins
# The token spans with label -100, [(span_start, span_end), ...]
self.no_loss_spans = []
self.labels = []
self.pre = len(
self.tokenizer.encode('<|Results|>:', add_special_tokens=False))
self.post = len(
self.tokenizer.encode('\n', add_special_tokens=False))
self.load_data()
self.process_data()
def load_data(self):
print_log('Loading MOSS SFT data...', 'current')
name = f'{self.tokenizer.__class__.__name__}_{self.bot_name}'
data_file = self.src_data_file.replace('.jsonl', f'_data_{name}')
no_loss_spans_file = self.src_data_file.replace(
'.jsonl', f'_no_loss_spans_{name}')
if os.path.exists(data_file) and os.path.exists(no_loss_spans_file):
self.data = torch.load(data_file, map_location='cpu')
self.no_loss_spans = torch.load(
no_loss_spans_file, map_location='cpu')
else:
with open(self.src_data_file) as f:
for line in tqdm(f):
sample = json.loads(line)
chat = sample['chat']
num_turns = int(sample['num_turns'])
meta_instruction = sample['meta_instruction']
if self.bot_name is not None:
meta_instruction = meta_instruction.replace(
'MOSS', self.bot_name)
instruction_ids = self.tokenizer.encode(meta_instruction)
assert isinstance(instruction_ids,
list) and len(instruction_ids) > 0
input_ids = copy.deepcopy(instruction_ids)
no_loss_spans = [(0, len(instruction_ids))]
try:
for i in range(num_turns):
cur_turn_ids = []
cur_no_loss_spans = []
cur_turn = chat[f'turn_{i+1}']
for key, value in cur_turn.items():
if self.bot_name is not None:
value = value.replace(
'MOSS', self.bot_name)
cur_ids = self.tokenizer.encode(
value, add_special_tokens=False)
if key == 'Tool Responses':
# The format tokens
# (<|Results|>:...\n)
# should have losses.
cur_no_loss_spans.append(
(len(input_ids + cur_turn_ids) +
self.pre,
len(input_ids + cur_turn_ids +
cur_ids) - self.post))
assert isinstance(cur_ids,
list) and len(cur_ids) > 0
cur_turn_ids.extend(cur_ids)
if len(input_ids + cur_turn_ids) > self.max_length:
break
input_ids.extend(cur_turn_ids)
no_loss_spans.extend(cur_no_loss_spans)
if len(input_ids) == len(instruction_ids):
continue
assert len(input_ids) > 0 and len(
input_ids) <= self.max_length
self.data.append(input_ids)
self.no_loss_spans.append(no_loss_spans)
except Exception:
pass
torch.save(self.data, data_file)
torch.save(self.no_loss_spans, no_loss_spans_file)
print_log(
f'Load data successfully, total {len(self.data)} training samples',
'current')
def process_data(self):
for item, no_loss in zip(self.data, self.no_loss_spans):
label = copy.deepcopy(item)
for loc in no_loss:
label[loc[0]:loc[1]] = [-100] * (loc[1] - loc[0])
self.labels.append(label)
def __len__(self):
return len(self.data)
def __getitem__(self, index):
return {'input_ids': self.data[index], 'labels': self.labels[index]}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/preference_dataset.py
================================================
import copy
import json
import os
from datetime import timedelta
from functools import partial
from multiprocessing import Process, Queue
from typing import Callable, Dict, List
import numpy as np
import torch.distributed as dist
import tqdm
from datasets import Dataset as HFDataset
from datasets import concatenate_datasets
from mmengine.config import Config, ConfigDict
from mmengine.logging import print_log
from mmengine.utils.misc import get_object_from_string
from torch.utils.data import Dataset
from transformers import AutoTokenizer
from xtuner.registry import BUILDER, MAP_FUNC
from .huggingface import build_origin_dataset
def _worker(
tokenize_fun: Callable,
data_queue: Queue,
out_queue: Queue,
):
while True:
data_chunk = data_queue.get()
if data_chunk is None:
out_queue.put(None)
break
chunk_results = []
for idx, data in data_chunk:
chunk_results.append([idx, tokenize_fun(data)])
out_queue.put(chunk_results)
def _chunk_data_to_queue(data_queue: Queue, data: List[Dict], chunk_size: int,
nproc):
data_iter = iter(data)
chunk_data = []
while True:
try:
item = next(data_iter)
except StopIteration:
break
chunk_data.append(item)
if len(chunk_data) == chunk_size:
data_queue.put(chunk_data)
chunk_data = []
if chunk_data:
data_queue.put(chunk_data)
for _ in range(nproc):
data_queue.put(None)
def _multi_progress(tokenize_fun_p, dataset, nproc, task_num, chunksize,
description):
processes = []
data_queue = Queue()
output_queue = Queue()
bar = tqdm.tqdm(total=task_num, desc=description)
# task_id = bar.add_task(total=task_num, description=description)
dataset = enumerate(dataset)
_chunk_data_to_queue(data_queue, dataset, chunksize, nproc)
for _ in range(nproc):
process = Process(
target=_worker, args=(tokenize_fun_p, data_queue, output_queue))
process.start()
processes.append(process)
results = []
finished_process = 0
while finished_process < nproc:
chunk_results = output_queue.get()
if chunk_results is None:
finished_process += 1
continue
results.extend(chunk_results)
bar.update(len(chunk_results))
bar.refresh()
results = map(lambda x: x[1], sorted(results, key=lambda x: x[0]))
return results
def load_jsonl_dataset(data_files=None, data_dir=None, suffix=None):
assert (data_files is not None) != (data_dir is not None)
if data_dir is not None:
data_files = os.listdir(data_dir)
data_files = [os.path.join(data_dir, fn) for fn in data_files]
if suffix is not None:
data_files = [fp for fp in data_files if fp.endswith(suffix)]
elif isinstance(data_files, str):
data_files = [data_files]
dataset_list = []
for fp in data_files:
with open(fp, encoding='utf-8') as file:
data = [json.loads(line) for line in file]
ds = HFDataset.from_list(data)
dataset_list.append(ds)
dataset = concatenate_datasets(dataset_list)
return dataset
def tokenize(pair: str,
tokenizer: AutoTokenizer,
max_length: int,
is_reward: bool = False,
reward_token_id: int = -1):
prompt = tokenizer.apply_chat_template(
pair['prompt'], tokenize=False, add_generation_prompt=True)
chosen = tokenizer.apply_chat_template(
pair['prompt'] + pair['chosen'],
tokenize=False,
add_generation_prompt=False)
rejected = tokenizer.apply_chat_template(
pair['prompt'] + pair['rejected'],
tokenize=False,
add_generation_prompt=False)
prompt_ids = tokenizer.encode(prompt, add_special_tokens=False)
chosen_ids = tokenizer.encode(chosen, add_special_tokens=False)
rejected_ids = tokenizer.encode(rejected, add_special_tokens=False)
if len(chosen_ids) > max_length:
chosen_ids = chosen_ids[:max_length]
if len(rejected_ids) > max_length:
rejected_ids = rejected_ids[:max_length]
if is_reward:
# reward label
chosen_ids = chosen_ids + [reward_token_id]
rejected_ids = rejected_ids + [reward_token_id]
chosen_labels = [-100] * len(chosen_ids[:-1]) + [0]
rejected_labels = [-100] * len(rejected_ids[:-1]) + [1]
else:
# dpo label
prompt_len = min(len(prompt_ids), max_length)
chosen_labels = [-100] * prompt_len + copy.deepcopy(
chosen_ids[prompt_len:])
rejected_labels = [-100] * prompt_len + copy.deepcopy(
rejected_ids[prompt_len:])
return {
'chosen_ids': chosen_ids,
'rejected_ids': rejected_ids,
'chosen_labels': chosen_labels,
'rejected_labels': rejected_labels,
}
class PreferenceDataset(Dataset):
def __init__(
self,
dataset: HFDataset,
tokenizer: AutoTokenizer,
max_length: int,
is_dpo: bool = True,
is_reward: bool = False,
reward_token_id: int = -1,
num_proc: int = 32,
) -> None:
self.max_length = max_length
assert is_dpo != is_reward, \
'Only one of is_dpo and is_reward can be True'
if is_reward:
assert reward_token_id != -1, \
'reward_token_id should be set if is_reward is True'
self.is_dpo = is_dpo
self.is_reward = is_reward
self.reward_token_id = reward_token_id
self.tokenized_pairs = []
for tokenized_pair in _multi_progress(
partial(
tokenize,
tokenizer=tokenizer,
max_length=max_length,
is_reward=is_reward,
reward_token_id=reward_token_id),
dataset,
nproc=num_proc,
task_num=len(dataset),
chunksize=num_proc,
description='Tokenizing dataset'):
self.tokenized_pairs.append(tokenized_pair)
def __len__(self):
return len(self.tokenized_pairs)
def __getitem__(self, idx):
return self.tokenized_pairs[idx]
class PackedDatasetWrapper(Dataset):
def __init__(self,
dataset,
max_packed_length=16384,
shuffle_before_pack=True) -> None:
super().__init__()
self.max_packed_length = max_packed_length
self.lengths = []
self.data = []
indices = np.arange(len(dataset))
if shuffle_before_pack:
np.random.shuffle(indices)
data_bin = []
bin_seq_len = 0
removed = 0
for idx in indices:
data = dataset[int(idx)]
cur_len = len(data['chosen_ids']) + len(data['rejected_ids'])
if cur_len > max_packed_length:
print_log(
f'sequence length {cur_len} is '
f'larger than max_packed_length {max_packed_length}',
logger='current')
removed += 1
continue
if (bin_seq_len +
cur_len) > max_packed_length and len(data_bin) > 0:
self.data.append(data_bin)
self.lengths.append(bin_seq_len)
data_bin = []
bin_seq_len = 0
data_bin.append(data)
bin_seq_len += cur_len
if len(data_bin) > 0:
self.data.append(data_bin)
self.lengths.append(bin_seq_len)
if removed > 0:
print_log(
f'removed {removed} samples because '
f'of length larger than {max_packed_length}',
logger='current')
print_log(
f'The batch numbers of dataset is changed '
f'from {len(dataset)} to {len(self)} after'
' using var len attention.',
logger='current')
def __len__(self):
return len(self.data)
def __getitem__(self, index):
pairs = self.data[index]
input_ids, cu_seqlens, position_ids, labels = [], [0], [], []
for pair in pairs:
input_ids.extend(pair['chosen_ids'])
input_ids.extend(pair['rejected_ids'])
position_ids.extend(list(range(len(pair['chosen_ids']))))
position_ids.extend(list(range(len(pair['rejected_ids']))))
labels.extend(pair['chosen_labels'])
labels.extend(pair['rejected_labels'])
cu_seqlens.append(cu_seqlens[-1] + len(pair['chosen_ids']))
cu_seqlens.append(cu_seqlens[-1] + len(pair['rejected_ids']))
return {
'input_ids': input_ids,
'labels': labels,
'position_ids': position_ids,
'cumulative_len': cu_seqlens
}
def unpack_seq(seq, cu_seqlens):
"""Unpack a packed sequence to a list of sequences with different
lengths."""
seqlens = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist()
subseqs = seq.split(seqlens)
return subseqs
def broad_cast_dataset(dataset):
xtuner_dataset_timeout = timedelta(
minutes=int(os.getenv('XTUNER_DATASET_TIMEOUT', default=60)))
print_log(
f'xtuner_dataset_timeout = {xtuner_dataset_timeout}', logger='current')
using_dist = dist.is_available() and dist.is_initialized()
if using_dist:
# monitored barrier requires gloo process group to perform host-side sync. # noqa
group_gloo = dist.new_group(
backend='gloo', timeout=xtuner_dataset_timeout)
if not using_dist or dist.get_rank() == 0:
objects = [dataset]
else:
objects = [None]
if using_dist:
dist.monitored_barrier(
group=group_gloo, timeout=xtuner_dataset_timeout)
dist.broadcast_object_list(objects, src=0)
return objects[0]
def map_dataset(dataset, dataset_map_fn, map_num_proc):
if isinstance(dataset_map_fn, str):
map_fn_obj = MAP_FUNC.get(dataset_map_fn) or get_object_from_string(
dataset_map_fn)
if map_fn_obj is not None:
dataset_map_fn = map_fn_obj
else:
raise TypeError('dataset_map_fn must be a function or a '
"registered function's string in MAP_FUNC, "
f"but got a string of '{dataset_map_fn}'")
dataset = dataset.map(dataset_map_fn, num_proc=map_num_proc)
return dataset
def build_preference_dataset(
dataset: str,
tokenizer: AutoTokenizer,
max_length: int,
dataset_map_fn: Callable = None,
is_dpo: bool = True,
is_reward: bool = False,
reward_token_id: int = -1,
num_proc: int = 32,
use_varlen_attn: bool = False,
max_packed_length: int = 16384,
shuffle_before_pack: bool = True,
) -> Dataset:
using_dist = dist.is_available() and dist.is_initialized()
tokenized_ds = None
if not using_dist or dist.get_rank() == 0:
if isinstance(tokenizer, dict) or isinstance(
tokenizer, Config) or isinstance(tokenizer, ConfigDict):
tokenizer = BUILDER.build(tokenizer)
dataset = build_origin_dataset(dataset, split='train')
if dataset_map_fn is not None:
dataset = map_dataset(
dataset, dataset_map_fn, map_num_proc=num_proc)
tokenized_ds = PreferenceDataset(
dataset=dataset,
tokenizer=tokenizer,
max_length=max_length,
is_dpo=is_dpo,
is_reward=is_reward,
reward_token_id=reward_token_id,
num_proc=num_proc,
)
if use_varlen_attn:
tokenized_ds = PackedDatasetWrapper(
dataset=tokenized_ds,
max_packed_length=max_packed_length,
shuffle_before_pack=shuffle_before_pack,
)
tokenized_ds = broad_cast_dataset(tokenized_ds)
return tokenized_ds
def intel_orca_dpo_map_fn(example):
prompt = [{
'role': 'system',
'content': example['system']
}, {
'role': 'user',
'content': example['question']
}]
chosen = [{'role': 'assistant', 'content': example['chosen']}]
rejected = [{'role': 'assistant', 'content': example['rejected']}]
return {'prompt': prompt, 'chosen': chosen, 'rejected': rejected}
def orpo_dpo_mix_40k_map_fn(example):
assert len(example['chosen']) == len(example['rejected'])
prompt = example['chosen'][:-1]
chosen = example['chosen'][-1:]
rejected = example['rejected'][-1:]
return {'prompt': prompt, 'chosen': chosen, 'rejected': rejected}
================================================
FILE: xtuner-eval_niah/xtuner/dataset/refcoco_json.py
================================================
import copy
import itertools
import json
import os
import pickle
import time
from collections import defaultdict
import matplotlib.pyplot as plt
import numpy as np
import skimage.io as io
import torch
from datasets import Dataset as HFDataset
from datasets import DatasetDict
from matplotlib.patches import Polygon, Rectangle
from mmengine.config import Config, ConfigDict
from PIL import Image
from xtuner.registry import BUILDER
from ..registry import BUILDER
from .huggingface import process_hf_dataset
from .llava import LLaVADataset
from .utils import expand2square
class RefCOCOJsonDataset(LLaVADataset):
instruction_pool = [
'[refer] {}',
'[refer] give me the location of {}',
'[refer] where is {} ?',
'[refer] from this image, tell me the location of {}',
'[refer] the location of {} is',
'[refer] could you tell me the location for {} ?',
'[refer] where can I locate the {} ?',
]
def __init__(
self,
data_path,
image_folder,
tokenizer,
image_processor,
max_dataset_length=None,
dataset_map_fn=None,
template_map_fn=None,
max_length=2048,
pad_image_to_square=False,
):
json_data = json.load(open(data_path))
######################################################
# Only this part is different from LLaVADataset.__init__
json_data = self.reformat_data(json_data)
######################################################
for idx in range(len(json_data)):
if isinstance(json_data[idx]['id'], int):
json_data[idx]['id'] = str(json_data[idx]['id'])
json_data = DatasetDict({'train': HFDataset.from_list(json_data)})
self.text_data = process_hf_dataset(
dataset=json_data,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=dataset_map_fn,
template_map_fn=template_map_fn,
split='train',
max_dataset_length=max_dataset_length,
remove_unused_columns=False,
pack_to_max_length=False,
with_image_token=True)
self.image_folder = image_folder
if isinstance(image_processor, dict) or isinstance(
image_processor, Config) or isinstance(image_processor,
ConfigDict):
self.image_processor = BUILDER.build(image_processor)
else:
self.image_processor = image_processor
self.pad_image_to_square = pad_image_to_square
def reformat_data(self, json_data):
new_json_data = []
for sample in json_data:
for instruction_template in self.instruction_pool:
sample['conversations'] = self.gen_refcoco_conversations(
sample, instruction_template)
new_json_data.append(copy.deepcopy(sample))
return new_json_data
@classmethod
def gen_refcoco_conversations(cls, data, instruction_template='{}'):
"""build conversition data from refcoco json data as below.
"id": "xxx",
"image": "xxx.jpg",
"conversations": [
{
"from": "human",
"value": "xxxx"
},
{
"from": "gpt",
"value": "xxx"
}
"""
conversation = [
{
'from': 'human',
'value': ''
},
{
'from': 'gpt',
'value': ''
},
]
instruction = instruction_template.format(data['sents'])
bbox = cls.normalize_bbox(data['bbox'], data['height'], data['width'])
answer = '{{<{}><{}><{}><{}>}}'.format(bbox[0], bbox[1], bbox[2],
bbox[3])
conversation[0]['value'] = instruction + '\n'
conversation[1]['value'] = answer
return conversation
@classmethod
def get_data_json(
cls,
ann_path,
image_path,
dataset='refcoco',
splitBy='unc',
):
refer = REFER(ann_path, image_path, dataset, splitBy)
ref_ids = refer.getRefIds(split='train')
data = {}
duplicate_data = defaultdict(list)
for ref_id in ref_ids:
ref = refer.loadRefs(ref_id)[0]
image_id = '{:0>12}'.format(ref['image_id'])
sents = [sent['raw'] for sent in ref['sentences']]
bbox = refer.getRefBox(ref['ref_id'])
image = Image.open(image_path + '/' + image_id + '.jpg')
for sent in sents:
sent_id = '_'.join(sent.split(' '))
data_id = f'{dataset}-{splitBy}-{image_id}-{sent_id}'
data_item = {
'id': data_id,
'image': 'coco/train2017/' + image_id + '.jpg',
'sents': sent,
'bbox': bbox,
'height': image.height,
'width': image.width
}
if data_id in data:
duplicate_data[data_id].append(data_item)
else:
data[data_id] = data_item
return list(data.values()), list(duplicate_data.values())
@classmethod
def normalize_bbox(cls, bbox, height, width):
x, y, w, h = bbox
bbox = [x / width, y / height, (x + w) / width, (y + h) / height]
bbox = [int(x * 100) for x in bbox]
return bbox
class RefCOCOJsonEvalDataset(RefCOCOJsonDataset):
instruction_pool = ['[refer] give me the location of {}']
def reformat_data(self, json_data):
for sample in json_data:
# reformat img_id
img_id = sample['img_id'].split('_')[-2]
sample['image'] = 'coco/train2017/' + img_id + '.jpg'
sample['id'] = f"{img_id}-{sample['sents']}"
return super().reformat_data(json_data)
class InvRefCOCOJsonDataset(RefCOCOJsonDataset):
instruction_pool = [
'[identify] {}',
'[identify] what object is in this location {}',
'[identify] identify the object present at this location {}',
'[identify] what is it in {}',
'[identify] describe this object in {}',
'[identify] this {} is',
'[identify] the object in {} is',
]
@classmethod
def gen_refcoco_conversations(cls, data, instruction_template='{}'):
"""build conversition data from refcoco json data as below.
"id": "xxx",
"image": "xxx.jpg",
"conversations": [
{
"from": "human",
"value": "xxxx"
},
{
"from": "gpt",
"value": "xxx"
}
"""
conversation = [
{
'from': 'human',
'value': ''
},
{
'from': 'gpt',
'value': ''
},
]
bbox = cls.normalize_bbox(data['bbox'], data['height'], data['width'])
bbox_str = '{{<{}><{}><{}><{}>}}'.format(bbox[0], bbox[1], bbox[2],
bbox[3])
instruction = instruction_template.format(bbox_str)
answer = data['sents']
conversation[0]['value'] = instruction + '\n'
conversation[1]['value'] = answer
return conversation
# flake8: noqa
# Refer
class REFER:
def __init__(self, data_root, vis_root, dataset='refcoco', splitBy='unc'):
# provide data_root folder which contains refclef, refcoco, refcoco+ and refcocog
# also provide dataset name and splitBy information
# e.g., dataset = 'refcoco', splitBy = 'unc'
# inv dataset is stored in the same path as normal dataset
dataset = dataset.split('inv')[-1]
print('loading dataset %s into memory...' % dataset)
self.ann_dir = os.path.join(data_root, dataset)
if dataset in ['refcoco', 'refcoco+', 'refcocog']:
self.vis_root = vis_root
elif dataset == 'refclef':
raise 'No RefClef image data'
else:
raise 'No refer dataset is called [%s]' % dataset
# load refs from data/dataset/refs(dataset).json
tic = time.time()
ref_file = os.path.join(self.ann_dir, 'refs(' + splitBy + ').p')
self.data = {}
self.data['dataset'] = dataset
self.data['refs'] = pickle.load(open(ref_file, 'rb'))
# load annotations from data/dataset/instances.json
instances_file = os.path.join(self.ann_dir, 'instances.json')
instances = json.load(open(instances_file))
self.data['images'] = instances['images']
self.data['annotations'] = instances['annotations']
self.data['categories'] = instances['categories']
# create index
self.createIndex()
print('DONE (t=%.2fs)' % (time.time() - tic))
def createIndex(self):
# create sets of mapping
# 1) Refs: {ref_id: ref}
# 2) Anns: {ann_id: ann}
# 3) Imgs: {image_id: image}
# 4) Cats: {category_id: category_name}
# 5) Sents: {sent_id: sent}
# 6) imgToRefs: {image_id: refs}
# 7) imgToAnns: {image_id: anns}
# 8) refToAnn: {ref_id: ann}
# 9) annToRef: {ann_id: ref}
# 10) catToRefs: {category_id: refs}
# 11) sentToRef: {sent_id: ref}
# 12) sentToTokens: {sent_id: tokens}
print('creating index...')
# fetch info from instances
Anns, Imgs, Cats, imgToAnns = {}, {}, {}, {}
for ann in self.data['annotations']:
Anns[ann['id']] = ann
imgToAnns[ann['image_id']] = imgToAnns.get(ann['image_id'],
[]) + [ann]
for img in self.data['images']:
Imgs[img['id']] = img
for cat in self.data['categories']:
Cats[cat['id']] = cat['name']
# fetch info from refs
Refs, imgToRefs, refToAnn, annToRef, catToRefs = {}, {}, {}, {}, {}
Sents, sentToRef, sentToTokens = {}, {}, {}
for ref in self.data['refs']:
# ids
ref_id = ref['ref_id']
ann_id = ref['ann_id']
category_id = ref['category_id']
image_id = ref['image_id']
# add mapping related to ref
Refs[ref_id] = ref
imgToRefs[image_id] = imgToRefs.get(image_id, []) + [ref]
catToRefs[category_id] = catToRefs.get(category_id, []) + [ref]
refToAnn[ref_id] = Anns[ann_id]
annToRef[ann_id] = ref
# add mapping of sent
for sent in ref['sentences']:
Sents[sent['sent_id']] = sent
sentToRef[sent['sent_id']] = ref
sentToTokens[sent['sent_id']] = sent['tokens']
# create class members
self.Refs = Refs
self.Anns = Anns
self.Imgs = Imgs
self.Cats = Cats
self.Sents = Sents
self.imgToRefs = imgToRefs
self.imgToAnns = imgToAnns
self.refToAnn = refToAnn
self.annToRef = annToRef
self.catToRefs = catToRefs
self.sentToRef = sentToRef
self.sentToTokens = sentToTokens
print('index created.')
def getRefIds(self, image_ids=[], cat_ids=[], ref_ids=[], split=''):
image_ids = image_ids if type(image_ids) == list else [image_ids]
cat_ids = cat_ids if type(cat_ids) == list else [cat_ids]
ref_ids = ref_ids if type(ref_ids) == list else [ref_ids]
if len(image_ids) == len(cat_ids) == len(ref_ids) == len(split) == 0:
refs = self.data['refs']
else:
if not len(image_ids) == 0:
refs = [self.imgToRefs[image_id] for image_id in image_ids]
else:
refs = self.data['refs']
if not len(cat_ids) == 0:
refs = [ref for ref in refs if ref['category_id'] in cat_ids]
if not len(ref_ids) == 0:
refs = [ref for ref in refs if ref['ref_id'] in ref_ids]
if not len(split) == 0:
if split in ['testA', 'testB', 'testC']:
refs = [ref for ref in refs if split[-1] in ref['split']
] # we also consider testAB, testBC, ...
elif split in ['testAB', 'testBC', 'testAC']:
# rarely used I guess...
refs = [ref for ref in refs if ref['split'] == split]
elif split == 'test':
refs = [ref for ref in refs if 'test' in ref['split']]
elif split == 'train' or split == 'val':
refs = [ref for ref in refs if ref['split'] == split]
else:
raise 'No such split [%s]' % split
ref_ids = [ref['ref_id'] for ref in refs]
return ref_ids
def getAnnIds(self, image_ids=[], cat_ids=[], ref_ids=[]):
image_ids = image_ids if type(image_ids) == list else [image_ids]
cat_ids = cat_ids if type(cat_ids) == list else [cat_ids]
ref_ids = ref_ids if type(ref_ids) == list else [ref_ids]
if len(image_ids) == len(cat_ids) == len(ref_ids) == 0:
ann_ids = [ann['id'] for ann in self.data['annotations']]
else:
if not len(image_ids) == 0:
lists = [
self.imgToAnns[image_id] for image_id in image_ids
if image_id in self.imgToAnns
] # list of [anns]
anns = list(itertools.chain.from_iterable(lists))
else:
anns = self.data['annotations']
if not len(cat_ids) == 0:
anns = [ann for ann in anns if ann['category_id'] in cat_ids]
ann_ids = [ann['id'] for ann in anns]
if not len(ref_ids) == 0:
ids = set(ann_ids).intersection(
{self.Refs[ref_id]['ann_id']
for ref_id in ref_ids})
return ann_ids
def getImgIds(self, ref_ids=[]):
ref_ids = ref_ids if type(ref_ids) == list else [ref_ids]
if not len(ref_ids) == 0:
image_ids = list(
{self.Refs[ref_id]['image_id']
for ref_id in ref_ids})
else:
image_ids = self.Imgs.keys()
return image_ids
def getCatIds(self):
return self.Cats.keys()
def loadRefs(self, ref_ids=[]):
if type(ref_ids) == list:
return [self.Refs[ref_id] for ref_id in ref_ids]
elif type(ref_ids) == int:
return [self.Refs[ref_ids]]
def loadAnns(self, ann_ids=[]):
if type(ann_ids) == list:
return [self.Anns[ann_id] for ann_id in ann_ids]
elif type(ann_ids) == int:
return [self.Anns[ann_ids]]
def loadImgs(self, image_ids=[]):
if type(image_ids) == list:
return [self.Imgs[image_id] for image_id in image_ids]
elif type(image_ids) == int:
return [self.Imgs[image_ids]]
def loadCats(self, cat_ids=[]):
if type(cat_ids) == list:
return [self.Cats[cat_id] for cat_id in cat_ids]
elif type(cat_ids) == int:
return [self.Cats[cat_ids]]
def getRefBox(self, ref_id):
ref = self.Refs[ref_id]
ann = self.refToAnn[ref_id]
return ann['bbox'] # [x, y, w, h]
def showRef(self, ref, seg_box='box'):
from matplotlib.collectns import PatchCollection
ax = plt.gca()
# show image
image = self.Imgs[ref['image_id']]
I = io.imread(os.path.join(self.vis_root, image['file_name']))
ax.imshow(I)
# show refer expression
for sid, sent in enumerate(ref['sentences']):
print('{}. {}'.format(sid + 1, sent['sent']))
# show segmentations
if seg_box == 'seg':
ann_id = ref['ann_id']
ann = self.Anns[ann_id]
polygons = []
color = []
c = 'none'
if type(ann['segmentation'][0]) == list:
# polygon used for refcoco*
for seg in ann['segmentation']:
poly = np.array(seg).reshape((len(seg) / 2, 2))
polygons.append(Polygon(poly, True, alpha=0.4))
color.append(c)
p = PatchCollection(
polygons,
facecolors=color,
edgecolors=(1, 1, 0, 0),
linewidths=3,
alpha=1,
)
ax.add_collection(p) # thick yellow polygon
p = PatchCollection(
polygons,
facecolors=color,
edgecolors=(1, 0, 0, 0),
linewidths=1,
alpha=1,
)
ax.add_collection(p) # thin red polygon
else:
# mask used for refclef
raise NotImplementedError('RefClef is not downloaded')
# show bounding-box
elif seg_box == 'box':
ann_id = ref['ann_id']
ann = self.Anns[ann_id]
bbox = self.getRefBox(ref['ref_id'])
box_plot = Rectangle(
(bbox[0], bbox[1]),
bbox[2],
bbox[3],
fill=False,
edgecolor='green',
linewidth=3,
)
ax.add_patch(box_plot)
================================================
FILE: xtuner-eval_niah/xtuner/dataset/samplers/__init__.py
================================================
from .intern_repo import InternlmRepoSampler, InternRepoSampler
from .length_grouped import LengthGroupedSampler
__all__ = ['LengthGroupedSampler', 'InternRepoSampler', 'InternlmRepoSampler']
================================================
FILE: xtuner-eval_niah/xtuner/dataset/samplers/intern_repo.py
================================================
import logging
import warnings
from typing import Iterator, Optional, Sized
import numpy as np
from mmengine import print_log
from torch.utils.data import Sampler
from xtuner.parallel.sequence import (get_data_parallel_rank,
get_data_parallel_world_size)
class InternRepoSampler(Sampler):
def __init__(self,
dataset: Sized,
shuffle: bool = True,
seed: Optional[int] = None) -> None:
if seed is not None and seed != 1024:
warnings.warn('For alignment accuracy, seed in InternRepoSampler'
'must be set to 1024.')
world_size = get_data_parallel_world_size()
rank = get_data_parallel_rank()
self.rank = rank
self.world_size = world_size
self.dataset = dataset
self.shuffle = shuffle
self.seed = 1024
self.epoch = 0
self.num_samples = len(self.dataset) // world_size
self.total_size = self.num_samples * world_size
def __iter__(self) -> Iterator[int]:
"""Iterate the indices."""
# deterministically shuffle based on epoch and seed
if self.shuffle:
rng = np.random.RandomState(self.seed + self.epoch)
indices = np.arange(len(self.dataset))
rng.shuffle(indices)
indices = indices.tolist()
else:
indices = np.arange(len(self.dataset)).tolist()
self.indices = indices[:self.total_size]
# subsample
indices = indices[self.rank:self.total_size:self.world_size]
self.subsample_indices = indices
return iter(indices)
def __len__(self) -> int:
"""The number of samples in this rank."""
return self.num_samples
def set_epoch(self, epoch: int) -> None:
"""Sets the epoch for this sampler.
When :attr:`shuffle=True`, this ensures all replicas use a different
random ordering for each epoch. Otherwise, the next iteration of this
sampler will yield the same ordering.
Args:
epoch (int): Epoch number.
"""
self.epoch = epoch
class InternlmRepoSampler(InternRepoSampler):
def __init__(self,
dataset: Sized,
shuffle: bool = True,
seed: Optional[int] = None) -> None:
super().__init__(dataset, shuffle, seed)
print_log(('InternlmRepoSampler will be deprecated in the future.'
'Please use InternRepoSampler instead.'),
logger='current',
level=logging.WARNING)
================================================
FILE: xtuner-eval_niah/xtuner/dataset/samplers/length_grouped.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import math
from typing import Iterator, Optional, Sized
import torch
from mmengine.dist import get_dist_info, sync_random_seed
from torch.utils.data import ConcatDataset as TorchConcatDataset
from torch.utils.data import Sampler
def get_length_grouped_indices(lengths, group_batch_size, generator=None):
def process(lengths, group_batch_size, generator=None):
indices = torch.randperm(len(lengths), generator=generator)
megabatches = [
indices[i:i + group_batch_size].tolist()
for i in range(0, len(lengths), group_batch_size)
]
megabatches = [
sorted(megabatch, key=lambda i: lengths[i], reverse=True)
for megabatch in megabatches
]
return megabatches
assert all(leng != 0 for leng in lengths), 'Should not have zero length.'
if all(leng > 0 for leng in lengths) or all(leng < 0 for leng in lengths):
# all samples are in the same modality
megabatches = process(lengths, group_batch_size, generator=generator)
else:
mm_indices, mm_lengths = zip(*[(i, l) for i, l in enumerate(lengths)
if l > 0])
lang_indices, lang_lengths = zip(*[(i, -l)
for i, l in enumerate(lengths)
if l < 0])
mm_megabatches = []
for mm_megabatch in process(
mm_lengths, group_batch_size, generator=generator):
mm_megabatches.append([mm_indices[i] for i in mm_megabatch])
lang_megabatches = []
for lang_megabatch in process(
lang_lengths, group_batch_size, generator=generator):
lang_megabatches.append([lang_indices[i] for i in lang_megabatch])
last_mm = mm_megabatches[-1]
last_lang = lang_megabatches[-1]
last_batch = last_mm + last_lang
megabatches = mm_megabatches[:-1] + lang_megabatches[:-1]
megabatch_indices = torch.randperm(
len(megabatches), generator=generator)
megabatches = [megabatches[i] for i in megabatch_indices]
if len(last_batch) > 0:
megabatches.append(
sorted(
last_batch, key=lambda i: abs(lengths[i]), reverse=True))
# The rest is to get the biggest batch first.
# Since each megabatch is sorted by descending length,
# the longest element is the first
megabatch_maximums = [
abs(lengths[megabatch[0]]) for megabatch in megabatches
]
max_idx = torch.argmax(torch.tensor(megabatch_maximums)).item()
# Switch to put the longest element in first position
megabatches[0][0], megabatches[max_idx][0] = megabatches[max_idx][
0], megabatches[0][0]
return [i for megabatch in megabatches for i in megabatch]
class LengthGroupedSampler(Sampler):
def __init__(self,
dataset: Sized,
per_device_batch_size: int,
length_property='length',
mega_batch_mult: Optional[int] = None,
seed: Optional[int] = None,
round_up: bool = True) -> None:
rank, world_size = get_dist_info()
self.rank = rank
self.world_size = world_size
self.dataset = dataset
if seed is None:
seed = sync_random_seed()
self.seed = seed
self.epoch = 0
self.round_up = round_up
if self.round_up:
num_iters = math.ceil(
len(self.dataset) / world_size / per_device_batch_size)
self.num_samples = num_iters * per_device_batch_size
self.total_size = self.num_samples * self.world_size
else:
self.num_samples = math.ceil(
(len(self.dataset) - rank) / world_size)
self.total_size = len(self.dataset)
total_batch_size = per_device_batch_size * self.world_size
if mega_batch_mult is None:
# Default for mega_batch_mult: 50 or the number to get 4
# megabatches, whichever is smaller.
mega_batch_mult = min(
len(self.dataset) // (total_batch_size * 4), 50)
# Just in case, for tiny datasets
if mega_batch_mult == 0:
mega_batch_mult = 1
self.group_batch_size = mega_batch_mult * total_batch_size
if isinstance(self.dataset, TorchConcatDataset):
length = []
for sub_dataset in self.dataset.datasets:
length.extend(getattr(sub_dataset, length_property))
self.length = length
else:
self.length = getattr(self.dataset, length_property)
assert isinstance(self.length, (list, tuple))
self.total_batch_size = total_batch_size
def __iter__(self) -> Iterator[int]:
"""Iterate the indices."""
generator = torch.Generator()
generator.manual_seed(self.seed + self.epoch)
indices = get_length_grouped_indices(
lengths=self.length,
group_batch_size=self.group_batch_size,
generator=generator)
assert len(set(indices)) == len(indices)
# add extra samples to make it evenly divisible
if self.round_up:
indices = (
indices *
int(self.total_size / len(indices) + 1))[:self.total_size]
# subsample
assert len(indices) == self.total_size
indices = indices[self.rank:self.total_size:self.world_size]
assert len(indices) == self.num_samples
return iter(indices)
def __len__(self) -> int:
"""The number of samples in this rank."""
return self.num_samples
def set_epoch(self, epoch: int) -> None:
"""Sets the epoch for this sampler.
When :attr:`shuffle=True`, this ensures all replicas use a different
random ordering for each epoch. Otherwise, the next iteration of this
sampler will yield the same ordering.
Args:
epoch (int): Epoch number.
"""
self.epoch = epoch
================================================
FILE: xtuner-eval_niah/xtuner/dataset/utils.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import base64
import copy
import io
from io import BytesIO
from itertools import chain
import numpy as np
import requests
from PIL import Image
from xtuner.utils import DEFAULT_IMAGE_TOKEN, IGNORE_INDEX, IMAGE_TOKEN_INDEX
def get_bos_eos_token_ids(tokenizer):
if tokenizer.__class__.__name__ in [
'QWenTokenizer', 'QWen2Tokenizer', 'Qwen2TokenizerFast'
]:
bos_token_id = []
eos_token_id = tokenizer.eos_token_id
assert eos_token_id is not None, \
'Please set eos_token for Qwen tokenizer!'
elif tokenizer.__class__.__name__ == 'ChatGLMTokenizer':
bos_token_id = [64790, 64792]
eos_token_id = tokenizer.eos_token_id
else:
bos_token_id = tokenizer.bos_token_id
eos_token_id = tokenizer.eos_token_id
if isinstance(bos_token_id, int):
bos_token_id = [bos_token_id]
if isinstance(eos_token_id, int):
eos_token_id = [eos_token_id]
return bos_token_id, eos_token_id
def encode_fn(example,
tokenizer,
max_length,
input_ids_with_output=True,
with_image_token=False):
"""We only support the following three scenarios:
1. Incremental pretraining dataset.
example['conversation'] = [
{
'input': '',
'output': '### Human: Can you write xxx'
}
]
2. Single-turn conversation dataset.
example['conversation'] = [
{
'input': 'Give three tips for staying healthy.',
'output': '1.Eat a balanced diet xxx'
}
]
3. Multi-turn conversation dataset.
example['conversation'] = [
{
'input': 'Give three tips for staying healthy.',
'output': '1.Eat a balanced diet xxx'
},
{
'input': 'Please expand on the second point.',
'output': 'Here is an expanded explanation of the xxx'
}
]
"""
bos_token_id, eos_token_id = get_bos_eos_token_ids(tokenizer)
is_multi_turn_conversation = len(example['conversation']) > 1
if is_multi_turn_conversation:
assert input_ids_with_output
input_ids, labels = [], []
next_needs_bos_token = True
for single_turn_conversation in example['conversation']:
input = single_turn_conversation['input']
if DEFAULT_IMAGE_TOKEN in input and with_image_token:
chunk_encode = [
tokenizer.encode(chunk, add_special_tokens=False)
for chunk in input.split(DEFAULT_IMAGE_TOKEN)
]
assert len(chunk_encode) == 2
input_encode = []
for idx, cur_chunk_encode in enumerate(chunk_encode):
input_encode.extend(cur_chunk_encode)
if idx != len(chunk_encode) - 1:
input_encode.append(IMAGE_TOKEN_INDEX)
else:
input_encode = tokenizer.encode(input, add_special_tokens=False)
if next_needs_bos_token:
input_ids += bos_token_id
labels += [IGNORE_INDEX] * len(bos_token_id)
input_ids += input_encode
labels += [IGNORE_INDEX] * len(input_encode)
if input_ids_with_output:
# Add output
output_with_loss = single_turn_conversation.get(
'output_with_loss', True)
output = single_turn_conversation['output']
output_encode = tokenizer.encode(output, add_special_tokens=False)
input_ids += output_encode
if output_with_loss:
labels += copy.deepcopy(output_encode)
else:
labels += [IGNORE_INDEX] * len(output_encode)
# Add EOS_TOKEN (with loss)
if single_turn_conversation.get('need_eos_token', True):
next_needs_bos_token = True
input_ids += eos_token_id
if output_with_loss:
labels += copy.deepcopy(eos_token_id)
else:
labels += [IGNORE_INDEX] * len(eos_token_id)
else:
next_needs_bos_token = False
# Add SEP (without loss)
sep = single_turn_conversation.get('sep', '')
if sep != '':
sep_encode = tokenizer.encode(sep, add_special_tokens=False)
input_ids += sep_encode
labels += [IGNORE_INDEX] * len(sep_encode)
if len(input_ids) > max_length:
input_ids = input_ids[:max_length]
labels = labels[:max_length]
return {'input_ids': input_ids, 'labels': labels}
class Packer:
"""Pack multiple pieces of data into one."""
def __init__(self,
chunk_size=2048,
use_varlen_attn=False,
drop_last=False):
self.chunk_size = chunk_size
self.residual = {'input_ids': [], 'labels': []}
self.use_varlen_attn = use_varlen_attn
self.drop_last = drop_last
if use_varlen_attn:
self.residual_cumulative_len = [0]
def get_cumulative_len(self, chunk_num):
ptr_l = 0
cumulative_len = []
for chunk_idx in range(chunk_num):
length_train = (chunk_idx + 1) * self.chunk_size
ptr_r = np.searchsorted(
self.residual_cumulative_len, length_train, side='left')
if self.residual_cumulative_len[ptr_r] == length_train:
cumulative_len_cur = \
self.residual_cumulative_len[ptr_l:ptr_r + 1]
ptr_l = ptr_r + 1
else:
cumulative_len_cur = self.residual_cumulative_len[
ptr_l:ptr_r] + [length_train]
ptr_l = ptr_r
cumulative_len_cur = [
num - chunk_idx * self.chunk_size for num in cumulative_len_cur
]
if cumulative_len_cur[0] != 0:
cumulative_len_cur = [0] + cumulative_len_cur
cumulative_len.append(cumulative_len_cur)
self.residual_cumulative_len = [
num - length_train for num in self.residual_cumulative_len[ptr_l:]
]
if len(self.residual_cumulative_len) == 0:
self.residual_cumulative_len = [0]
elif self.residual_cumulative_len[0] != 0:
self.residual_cumulative_len = [0] + self.residual_cumulative_len
return cumulative_len
def get_position_ids(self, cumulative_len):
position_ids = []
for cumulative_len_cur in cumulative_len:
index_cur = []
for i in range(len(cumulative_len_cur) - 1):
index_cur.extend(
list(
range(cumulative_len_cur[i + 1] - # noqa: W504
cumulative_len_cur[i])))
position_ids.append(index_cur)
return position_ids
def __call__(self, batch):
concatenated_samples = {
k: v + list(chain(*batch[k]))
for k, v in self.residual.items()
}
if self.use_varlen_attn:
for input_id in batch['input_ids']:
self.residual_cumulative_len.append(
self.residual_cumulative_len[-1] + len(input_id))
total_length = len(concatenated_samples[list(
concatenated_samples.keys())[0]])
if total_length >= self.chunk_size:
chunk_num = total_length // self.chunk_size
result = {
k: [
v[i:i + self.chunk_size] for i in range(
0,
chunk_num * # noqa: W504
self.chunk_size,
self.chunk_size)
]
for k, v in concatenated_samples.items()
}
self.residual = {
k: v[(chunk_num * self.chunk_size):]
for k, v in concatenated_samples.items()
}
if self.use_varlen_attn:
cumulative_len = self.get_cumulative_len(chunk_num)
result['cumulative_len'] = cumulative_len
result['position_ids'] = self.get_position_ids(cumulative_len)
else:
if self.drop_last:
result = {k: [] for k, v in concatenated_samples.items()}
else:
result = {k: [v] for k, v in concatenated_samples.items()}
self.residual = {k: [] for k in concatenated_samples.keys()}
if self.use_varlen_attn:
result['cumulative_len'] = [] if self.drop_last else [
self.residual_cumulative_len
]
result['position_ids'] = [] if self.drop_last \
else self.get_position_ids([self.residual_cumulative_len])
self.residual_cumulative_len = [0]
return result
def expand2square(pil_img, background_color):
width, height = pil_img.size
if width == height:
return pil_img
elif width > height:
result = Image.new(pil_img.mode, (width, width), background_color)
result.paste(pil_img, (0, (width - height) // 2))
return result
else:
result = Image.new(pil_img.mode, (height, height), background_color)
result.paste(pil_img, ((height - width) // 2, 0))
return result
def load_image(image_file):
if image_file.startswith('http://') or image_file.startswith('https://'):
response = requests.get(image_file)
image = Image.open(BytesIO(response.content)).convert('RGB')
else:
image = Image.open(image_file).convert('RGB')
return image
def decode_base64_to_image(base64_string):
image_data = base64.b64decode(base64_string)
image = Image.open(io.BytesIO(image_data))
return image
================================================
FILE: xtuner-eval_niah/xtuner/engine/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from ._strategy import DeepSpeedStrategy
from .hooks import (DatasetInfoHook, EvaluateChatHook, ThroughputHook,
VarlenAttnArgsToMessageHubHook)
from .runner import TrainLoop
__all__ = [
'EvaluateChatHook', 'DatasetInfoHook', 'ThroughputHook',
'VarlenAttnArgsToMessageHubHook', 'DeepSpeedStrategy', 'TrainLoop'
]
================================================
FILE: xtuner-eval_niah/xtuner/engine/_strategy/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .deepspeed import DeepSpeedStrategy
__all__ = ['DeepSpeedStrategy']
================================================
FILE: xtuner-eval_niah/xtuner/engine/_strategy/deepspeed.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Optional
from mmengine._strategy import DeepSpeedStrategy as MMEngineDeepSpeedStrategy
from xtuner import DS_CEPH_DIR
from xtuner.parallel.sequence import init_sequence_parallel
from xtuner.utils.fileio import patch_fileio
class DeepSpeedStrategy(MMEngineDeepSpeedStrategy):
def __init__(self, *args, **kwargs):
sequence_parallel_size = kwargs.pop('sequence_parallel_size', 1)
self.sequence_parallel_size = sequence_parallel_size
super().__init__(*args, **kwargs)
from transformers.integrations.deepspeed import HfDeepSpeedConfig
# hf_deepspeed_config has to be saved as an attribute.
self.hf_deepspeed_config = HfDeepSpeedConfig(self.config)
def _wrap_model(self, model):
wrapper = super()._wrap_model(model)
# hard code for deepspeed zero3
# When utilizing Zero3, the model isn't allocated to CUDA within the
# `deepspeed.initialize` process.
assert hasattr(wrapper.model, 'data_preprocessor')
wrapper.model.data_preprocessor.cuda()
return wrapper
def save_checkpoint(self, *args, **kwargs) -> None:
if DS_CEPH_DIR:
from os import path as osp
work_dir_prefix = osp.split(self.work_dir)[0]
filename = kwargs['filename'].replace(work_dir_prefix, DS_CEPH_DIR)
kwargs['filename'] = filename
with patch_fileio():
super().save_checkpoint(*args, **kwargs)
else:
super().save_checkpoint(*args, **kwargs)
def load_checkpoint(self, *args, **kwargs) -> None:
if DS_CEPH_DIR:
with patch_fileio():
checkpoint = super().load_checkpoint(*args, **kwargs)
else:
checkpoint = super().load_checkpoint(*args, **kwargs)
return checkpoint
def resume(self, *args, **kwargs) -> None:
if DS_CEPH_DIR:
with patch_fileio():
checkpoint = super().resume(*args, **kwargs)
else:
checkpoint = super().resume(*args, **kwargs)
return checkpoint
def _setup_distributed( # type: ignore
self,
launcher: Optional[str] = None,
backend: str = 'nccl',
**kwargs,
):
super()._setup_distributed(launcher, backend, **kwargs)
init_sequence_parallel(self.sequence_parallel_size)
================================================
FILE: xtuner-eval_niah/xtuner/engine/hooks/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .dataset_info_hook import DatasetInfoHook
from .evaluate_chat_hook import EvaluateChatHook
from .hf_checkpoint_hook import HFCheckpointHook
from .throughput_hook import ThroughputHook
from .varlen_attn_args_to_messagehub_hook import VarlenAttnArgsToMessageHubHook
__all__ = [
'EvaluateChatHook', 'DatasetInfoHook', 'ThroughputHook',
'VarlenAttnArgsToMessageHubHook', 'HFCheckpointHook'
]
================================================
FILE: xtuner-eval_niah/xtuner/engine/hooks/dataset_info_hook.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.hooks import Hook
from xtuner.registry import BUILDER
from xtuner.utils import DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX
def split_list(lst, value):
res = []
tmp_res = []
for i in lst:
if i == value:
res.append(tmp_res)
tmp_res = []
else:
tmp_res.append(i)
res.append(tmp_res)
return res
class DatasetInfoHook(Hook):
def __init__(self, tokenizer, is_intern_repo_dataset=False):
self.tokenizer = BUILDER.build(tokenizer)
self.is_intern_repo_dataset = is_intern_repo_dataset
def log(self, runner, dataset, mode='train'):
runner.logger.info(f'Num {mode} samples {len(dataset)}')
runner.logger.info(f'{mode} example:')
input_ids = dataset[0]['input_ids']
if self.is_intern_repo_dataset:
input_ids = [abs(x) for x in input_ids]
# Try to split list to be compatible with IMAGE token
input_ids = split_list(input_ids, IMAGE_TOKEN_INDEX)
text = ''
for idx, ids in enumerate(input_ids):
text += self.tokenizer.decode(ids)
if idx != len(input_ids) - 1:
text += DEFAULT_IMAGE_TOKEN
runner.logger.info(text)
def before_train(self, runner) -> None:
do_train = runner.train_loop is not None
do_eval = runner.val_loop is not None
if do_train:
train_dataset = runner.train_dataloader.dataset
self.log(runner, train_dataset, mode='train')
if do_eval:
eval_dataset = runner.val_dataloader.dataset
self.log(runner, eval_dataset, mode='eval')
def before_val(self, runner) -> None:
eval_dataset = runner.val_dataloader.dataset
self.log(runner, eval_dataset, mode='eval')
def before_test(self, runner) -> None:
test_dataset = runner.test_dataloader.dataset
self.log(runner, test_dataset, mode='test')
================================================
FILE: xtuner-eval_niah/xtuner/engine/hooks/evaluate_chat_hook.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import os
import warnings
import torch
from mmengine.dist import master_only
from mmengine.hooks import Hook
from mmengine.model import is_model_wrapper
from mmengine.utils import mkdir_or_exist
from mmengine.utils.misc import get_object_from_string
from transformers import GenerationConfig, StoppingCriteriaList
from xtuner.dataset.utils import expand2square, load_image
from xtuner.model.utils import prepare_inputs_labels_for_multimodal
from xtuner.registry import BUILDER
from xtuner.utils import (DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX,
StopWordStoppingCriteria)
class EvaluateChatHook(Hook):
priority = 'LOW'
def __init__(self,
tokenizer,
evaluation_inputs,
evaluation_images=None,
image_processor=None,
system='',
prompt_template=None,
every_n_iters=None,
max_new_tokens=600,
stop_word=None,
stop_words=[],
generation_kwargs={}):
self.evaluation_inputs = evaluation_inputs
if isinstance(self.evaluation_inputs, str):
self.evaluation_inputs = [self.evaluation_inputs]
self.evaluation_images = evaluation_images
if isinstance(self.evaluation_images, str):
self.evaluation_images = [self.evaluation_images]
if self.evaluation_images is not None:
assert len(
self.evaluation_images) in [1, len(self.evaluation_inputs)]
if len(self.evaluation_images) == 1:
self.evaluation_images = [self.evaluation_images[0]] * len(
self.evaluation_inputs)
self.evaluation_images = [
load_image(img) for img in self.evaluation_images
]
if prompt_template is None:
instruction = '{input}'
else:
if isinstance(prompt_template, str): # for resume
prompt_template = get_object_from_string(prompt_template)
instruction = prompt_template.get('INSTRUCTION', '{input}')
if system != '':
system = prompt_template.get(
'SYSTEM', '{system}\n').format(system=system)
stop_words += prompt_template.get('STOP_WORDS', [])
if stop_word is not None:
# TODO: deprecation, v0.3.0
warnings.warn(
('The `stop_word` argument is deprecated and will be removed '
'in v0.3.0, use `stop_words` instead.'), DeprecationWarning)
stop_words.append(stop_word)
self.instruction = instruction
self.system = system
self.every_n_iters = every_n_iters
self.max_new_tokens = max_new_tokens
self.tokenizer = BUILDER.build(tokenizer)
if image_processor is not None:
self.image_processor = BUILDER.build(image_processor)
self.stop_criteria = StoppingCriteriaList()
# default generation config
default_generation_kwargs = dict(
max_new_tokens=max_new_tokens,
do_sample=True,
temperature=0.1,
top_p=0.75,
top_k=40,
eos_token_id=self.tokenizer.eos_token_id,
pad_token_id=self.tokenizer.pad_token_id
if self.tokenizer.pad_token_id is not None else
self.tokenizer.eos_token_id)
default_generation_kwargs.update(generation_kwargs)
self.gen_config = GenerationConfig(**default_generation_kwargs)
self.stop_criteria = StoppingCriteriaList()
for word in stop_words:
self.stop_criteria.append(
StopWordStoppingCriteria(self.tokenizer, word))
self.is_first_run = True
@master_only
def _save_eval_output(self, runner, eval_outputs):
save_path = os.path.join(runner.log_dir, 'vis_data',
f'eval_outputs_iter_{runner.iter}.txt')
mkdir_or_exist(os.path.dirname(save_path))
with open(save_path, 'w', encoding='utf-8') as f:
for i, output in enumerate(eval_outputs):
f.write(f'Eval output {i + 1}:\n{output}\n\n')
def _eval_images(self,
runner,
model,
device,
max_new_tokens=None,
save_eval_output=False):
if save_eval_output:
eval_outputs = []
for sample_image, sample_input in zip(self.evaluation_images,
self.evaluation_inputs):
image = expand2square(
sample_image,
tuple(int(x * 255) for x in self.image_processor.image_mean))
image = self.image_processor.preprocess(
image, return_tensors='pt')['pixel_values'][0]
image = image.to(device)
sample_input = DEFAULT_IMAGE_TOKEN + '\n' + sample_input
inputs = (self.system + self.instruction).format(
input=sample_input, round=1, **runner.cfg)
chunk_encode = []
for idx, chunk in enumerate(inputs.split(DEFAULT_IMAGE_TOKEN)):
if idx == 0:
cur_encode = self.tokenizer.encode(chunk)
else:
cur_encode = self.tokenizer.encode(
chunk, add_special_tokens=False)
chunk_encode.append(cur_encode)
assert len(chunk_encode) == 2
input_ids = []
for idx, cur_chunk_encode in enumerate(chunk_encode):
input_ids.extend(cur_chunk_encode)
if idx != len(chunk_encode) - 1:
input_ids.append(IMAGE_TOKEN_INDEX)
input_ids = torch.tensor(input_ids).to(device)
visual_outputs = model.visual_encoder(
image.unsqueeze(0).to(model.visual_encoder.dtype),
output_hidden_states=True)
pixel_values = model.projector(
visual_outputs.hidden_states[model.visual_select_layer][:, 1:])
mm_inputs = prepare_inputs_labels_for_multimodal(
llm=model.llm,
input_ids=input_ids.unsqueeze(0),
pixel_values=pixel_values)
generation_output = model.generate(
**mm_inputs,
max_new_tokens=max_new_tokens,
generation_config=self.gen_config,
bos_token_id=self.tokenizer.bos_token_id,
stopping_criteria=self.stop_criteria)
generation_output = self.tokenizer.decode(generation_output[0])
runner.logger.info(f'Sample output:\n'
f'{inputs + generation_output}\n')
if save_eval_output:
eval_outputs.append(f'{inputs + generation_output}\n')
if save_eval_output:
self._save_eval_output(runner, eval_outputs)
def _eval_language(self,
runner,
model,
device,
max_new_tokens=None,
save_eval_output=False):
if save_eval_output:
eval_outputs = []
for sample_input in self.evaluation_inputs:
inputs = (self.system + self.instruction).format(
input=sample_input, round=1, **runner.cfg)
input_ids = self.tokenizer.encode(inputs, return_tensors='pt')
input_ids = input_ids.to(device)
generation_output = model.generate(
input_ids=input_ids,
max_new_tokens=max_new_tokens,
generation_config=self.gen_config,
stopping_criteria=self.stop_criteria)
generation_output = self.tokenizer.decode(generation_output[0])
runner.logger.info(f'Sample output:\n{generation_output}\n')
if save_eval_output:
eval_outputs.append(f'{generation_output}\n')
if save_eval_output:
self._save_eval_output(runner, eval_outputs)
def _generate_samples(self,
runner,
max_new_tokens=None,
save_eval_output=False):
if max_new_tokens is None:
max_new_tokens = self.max_new_tokens
model = runner.model
if is_model_wrapper(model):
model = model.module
device = next(iter(model.parameters())).device
if self.is_first_run:
# hardcode for qlora DeepSpeed ZeRO3, put buffers and QuantState to
# device
model.to(device)
self.is_first_run = False
is_checkpointing = model.llm.is_gradient_checkpointing
use_cache = model.llm.config.use_cache
# Cast to inference mode
model.activation_checkpointing_disable()
model.llm.config.use_cache = True
model.eval()
if self.evaluation_images is not None:
self._eval_images(runner, model, device, max_new_tokens,
save_eval_output)
else:
self._eval_language(runner, model, device, max_new_tokens,
save_eval_output)
# Cast to training mode
if is_checkpointing:
model.activation_checkpointing_enable()
model.llm.config.use_cache = use_cache
model.train()
def before_train(self, runner):
runner.logger.info('before_train in EvaluateChatHook.')
self._generate_samples(runner, max_new_tokens=50)
def _is_save_checkpoint(self, runner):
hooks = runner.hooks
checkpoint_hook = None
for hook in hooks:
if type(hook).__name__ == 'CheckpointHook':
checkpoint_hook = hook
break
if checkpoint_hook is None or checkpoint_hook.by_epoch:
return False
if checkpoint_hook.every_n_train_iters(
runner, checkpoint_hook.interval, checkpoint_hook.save_begin) or \
(checkpoint_hook.save_last and
checkpoint_hook.is_last_train_iter(runner)):
return True
return False
def after_train_iter(self,
runner,
batch_idx: int,
data_batch=None,
outputs=None) -> None:
if self.every_n_iters is None:
return
save_eval_output = self._is_save_checkpoint(runner)
do_chat = (
save_eval_output
or self.every_n_train_iters(runner, self.every_n_iters))
if not do_chat:
return
runner.logger.info('after_train_iter in EvaluateChatHook.')
self._generate_samples(runner, save_eval_output=save_eval_output)
def after_train(self, runner):
runner.logger.info('after_train in EvaluateChatHook.')
self._generate_samples(runner)
def after_val(self, runner) -> None:
if self.every_n_iters is not None:
return
runner.logger.info('after_val in EvaluateChatHook.')
self._generate_samples(runner)
================================================
FILE: xtuner-eval_niah/xtuner/engine/hooks/hf_checkpoint_hook.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import os.path as osp
from pathlib import Path
from typing import Optional, Union
import torch.distributed as dist
from mmengine import print_log
from mmengine._strategy import DeepSpeedStrategy
from mmengine.hooks import Hook
from mmengine.model import is_model_wrapper
from mmengine.runner import FlexibleRunner
from xtuner.registry import BUILDER
from xtuner.utils import get_origin_state_dict
DATA_BATCH = Optional[Union[dict, tuple, list]]
class HFCheckpointHook(Hook):
priority = 95 # lower than CheckpointHook in MMEngine
def __init__(self, out_dir: Optional[Union[str, Path]] = None) -> None:
self.out_dir = out_dir
@staticmethod
def _use_shard_moe(llm):
config = llm.config
moe_implementation = getattr(config, 'moe_implementation', 'origin')
return moe_implementation == 'shard'
def after_run(self, runner) -> None:
assert isinstance(runner,
FlexibleRunner), 'Runner should be `FlexibleRunner`'
assert isinstance(
runner.strategy,
DeepSpeedStrategy), 'Strategy should be `DeepSpeedStrategy`'
if self.out_dir is None:
self.out_dir = osp.join(runner.work_dir, 'hf_model')
wrapped_model = runner.strategy.model
if wrapped_model.zero_optimization_partition_weights():
assert wrapped_model.zero_gather_16bit_weights_on_model_save(), \
('Please set `gather_16bit_weights_on_model_save=True` '
'in your DeepSpeed config.')
state_dict = wrapped_model._zero3_consolidated_16bit_state_dict()
else:
state_dict = wrapped_model.module_state_dict(
exclude_frozen_parameters=runner.strategy.
exclude_frozen_parameters)
model = runner.model
if is_model_wrapper(model):
model = model.module
llm = model.llm
if (not dist.is_initialized()) or dist.get_rank() == 0:
# keys in state_dict are prefixed with 'llm.'
keys = list(state_dict.keys())
for k in keys:
val = state_dict.pop(k)
state_dict[k[4:]] = val
if self._use_shard_moe(llm):
print_log('recover the origin state_dict from merged one ...')
state_dict = get_origin_state_dict(state_dict, llm)
print_log(f'Saving LLM to {self.out_dir}')
llm.save_pretrained(self.out_dir, state_dict=state_dict)
print_log(f'Saving LLM tokenizer to {self.out_dir}')
tokenizer = BUILDER.build(runner.cfg.tokenizer)
tokenizer.save_pretrained(self.out_dir)
================================================
FILE: xtuner-eval_niah/xtuner/engine/hooks/throughput_hook.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import logging
from typing import Optional, Union
import torch
from mmengine import print_log
from mmengine.hooks import Hook
from mmengine.model.wrappers import is_model_wrapper
from torch.utils._pytree import tree_flatten
from xtuner.parallel.sequence import get_sequence_parallel_world_size
DATA_BATCH = Optional[Union[dict, tuple, list]]
class ThroughputHook(Hook):
# priority must be higher than LoggerHook (50) and lower than
# IterTimerHook (60)
priority = 55
def __init__(self,
use_activation_checkpointing=None,
hidden_size=None,
num_layers=None,
vocab_size=None,
mlp_ratio=None,
is_casual=None):
self.use_activation_checkpointing = use_activation_checkpointing
self.hidden_size = hidden_size
self.num_layers = num_layers
self.vocab_size = vocab_size
self.mlp_ratio = mlp_ratio
self.is_casual = is_casual
@staticmethod
def _guess_is_casual_attn(model):
for module in model.modules():
if hasattr(module, 'is_causal'):
return module.is_causal
print_log(
'It\'s impossible to speculate whether casual attention was used, '
'and FLOPs will be calculated as `casual = True`.', 'current')
return True
@staticmethod
def _get_batch_size_and_sequence_len(data_batch):
data_list, _ = tree_flatten(data_batch)
for data in data_list:
if isinstance(data, torch.Tensor):
return data.size(0), data.size(1)
raise RuntimeError('No tensor found in the batch')
@staticmethod
def _guess_use_activation_checkpointing(model):
for module in model.modules():
if hasattr(module, 'gradient_checkpointing'):
return module.gradient_checkpointing
return False
def before_run(self, runner) -> None:
if is_model_wrapper(runner.model):
model = runner.model.module
else:
model = runner.model
self.use_activation_checkpointing = \
(self.use_activation_checkpointing or
self._guess_use_activation_checkpointing(model))
self.hidden_size = self.hidden_size or model.config.hidden_size
self.num_layers = self.num_layers or model.config.num_hidden_layers
self.vocab_size = self.vocab_size or model.config.vocab_size
self.mlp_ratio = self.mlp_ratio or (model.config.intermediate_size /
model.config.hidden_size)
self.mlp_ratio *= 1.5 # has gate_proj
self.is_casual = self.is_casual if self.is_casual is not None \
else self._guess_is_casual_attn(model)
use_varlen_attn = getattr(model, 'use_varlen_attn', False)
if use_varlen_attn:
print_log(
'Using variable-length Flash Attention causes an inflation'
' in the FLOPs calculation.',
'current',
level=logging.WARNING)
return
def after_train_iter(self,
runner,
batch_idx: int,
data_batch: DATA_BATCH = None,
outputs: Optional[dict] = None) -> None:
"""Calc flops based on the paper of Megatron
https://deepakn94.github.io/assets/papers/megatron-sc21.pdf."""
batch_size, sequence_len = self._get_batch_size_and_sequence_len(
data_batch)
sequence_parallel_size = get_sequence_parallel_world_size()
sequence_len /= sequence_parallel_size
message_hub = runner.message_hub
iter_time = message_hub.get_scalar('train/time').current()
# We consider a language model with 𝑙 transformer layers,
# hidden size h, sequence length s, vocabulary size V, and
# training batch size B.
# A $A_{mxk}$ x $X_{kxn}$ matrix multiplication requires 2𝑚 ×𝑘 ×𝑛 FLOPs
# (factor of 2 needed to account for multiplies and adds).
# Attention Layer:
# qkv_proj + o_proj: 8B * s * h^2
# attn: 2B * s^2 * h (casual=False) and 2B * s^2 * h / 2 (casual=True)
# MLP Layer:
# up_proj + down_proj + gate_proj: 4B * s * h^2 * mlp_ratio
# (In Llama mlp_ratio = intermediate_size / hidden_size * 1.5
# (has gate_proj))
# The backward pass requires double the number of FLOPs since we
# need to calculate the gradients with respect to both input and
# weight tensors. In addition, we are using activation recomputation,
# which requires an additional forward pass before the backward pass.
# While sequence parallel will affect the FLOPs calculation in attn.
# Suppose the sequence length in one GPU is s and the sequence
# parallel world size is `sp_size`, which means the total
# sequence length in the attention calculation is
# `s * sp_size` and the number of attention heads decrease to
# `num_heads / sp_size`. Hence, the FLOPs in attn calculation is:
# 2B * (s * sp_size)^2 * (h / sp_size) (casual=False) and
# 2B * (s * sp_size)^2 * (h / sp_size) / 2 (casual=True)
flops_qkvo_proj = 8 * batch_size * sequence_len * self.hidden_size**2
flops_attn = 4 * batch_size * sequence_len**2 * self.hidden_size * \
sequence_parallel_size / (int(self.is_casual) + 1)
flops_mlp = 4 * self.mlp_ratio * batch_size * sequence_len * \
self.hidden_size**2
flops_wo_head = (3 + int(self.use_activation_checkpointing)) * (
flops_qkvo_proj + flops_attn + flops_mlp) * self.num_layers
flops_head = 3 * 2 * batch_size * sequence_len * self.hidden_size * \
self.vocab_size
flops_per_iteration = flops_wo_head + flops_head
avg_tflops_per_gpu = flops_per_iteration / 1e12 / (iter_time + 1e-12)
tokens_per_sec_per_gpu = batch_size * sequence_len / (
iter_time + 1e-12)
message_hub.update_scalar('train/tflops', avg_tflops_per_gpu)
message_hub.update_scalar('train/tokens_per_sec',
tokens_per_sec_per_gpu)
================================================
FILE: xtuner-eval_niah/xtuner/engine/hooks/varlen_attn_args_to_messagehub_hook.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Optional, Union
from mmengine import MessageHub
from mmengine.dist import get_rank
from mmengine.hooks import Hook
DATA_BATCH = Optional[Union[dict, tuple, list]]
class VarlenAttnArgsToMessageHubHook(Hook):
def before_train_iter(self,
runner,
batch_idx: int,
data_batch: dict = None) -> None:
rank = get_rank()
message_hub = MessageHub.get_instance('varlen_attn_args')
assert 'data' in data_batch.keys()
data = data_batch['data']
cumulative_len = data.pop('cumulative_len')
assert len(cumulative_len) == 1
cumulative_len = cumulative_len[0].cuda()
message_hub.update_info(f'cumulative_len_rank_{rank}', cumulative_len)
max_seqlen = data.pop('max_seqlen')
message_hub.update_info(f'max_seqlen_rank_{rank}', max_seqlen)
def after_train_iter(self,
runner,
batch_idx: int,
data_batch: DATA_BATCH = None,
outputs: Optional[dict] = None) -> None:
rank = get_rank()
message_hub = MessageHub.get_instance('varlen_attn_args')
message_hub.update_info(f'cumulative_len_rank_{rank}', None)
message_hub.update_info(f'max_seqlen_rank_{rank}', None)
def before_val_iter(self,
runner,
batch_idx: int,
data_batch: DATA_BATCH = None) -> None:
"""All subclasses should override this method, if they need any
operations before each validation iteration.
Args:
runner (Runner): The runner of the validation process.
batch_idx (int): The index of the current batch in the val loop.
data_batch (dict, optional): Data from dataloader.
Defaults to None.
"""
rank = get_rank()
message_hub = MessageHub.get_instance('varlen_attn_args')
assert 'data' in data_batch.keys()
data = data_batch['data']
cumulative_len = data.pop('cumulative_len')
assert len(cumulative_len) == 1
cumulative_len = cumulative_len[0].cuda()
message_hub.update_info(f'cumulative_len_rank_{rank}', cumulative_len)
max_seqlen = data.pop('max_seqlen')
message_hub.update_info(f'max_seqlen_rank_{rank}', max_seqlen)
def after_val_iter(self,
runner,
batch_idx,
data_batch=None,
outputs=None) -> None:
"""All subclasses should override this method, if they need any
operations after each validation iteration.
Args:
runner (Runner): The runner of the validation process.
batch_idx (int): The index of the current batch in the val loop.
data_batch (dict or tuple or list, optional): Data from dataloader.
outputs (Sequence, optional): Outputs from model.
"""
rank = get_rank()
message_hub = MessageHub.get_instance('varlen_attn_args')
message_hub.update_info(f'cumulative_len_rank_{rank}', None)
message_hub.update_info(f'max_seqlen_rank_{rank}', None)
================================================
FILE: xtuner-eval_niah/xtuner/engine/runner/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .loops import TrainLoop
__all__ = ['TrainLoop']
================================================
FILE: xtuner-eval_niah/xtuner/engine/runner/loops.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Dict, Optional, Union
from mmengine.runner import IterBasedTrainLoop
from torch.utils.data import DataLoader
class TrainLoop(IterBasedTrainLoop):
def __init__(self,
runner,
dataloader: Union[DataLoader, Dict],
max_iters: Optional[int] = None,
max_epochs: Union[int, float] = None,
**kwargs) -> None:
if max_iters is None and max_epochs is None:
raise RuntimeError('Please specify the `max_iters` or '
'`max_epochs` in `train_cfg`.')
elif max_iters is not None and max_epochs is not None:
raise RuntimeError('Only one of `max_iters` or `max_epochs` can '
'exist in `train_cfg`.')
else:
if max_iters is not None:
iters = int(max_iters)
assert iters == max_iters, ('`max_iters` should be a integer '
f'number, but get {max_iters}')
elif max_epochs is not None:
if isinstance(dataloader, dict):
diff_rank_seed = runner._randomness_cfg.get(
'diff_rank_seed', False)
dataloader = runner.build_dataloader(
dataloader,
seed=runner.seed,
diff_rank_seed=diff_rank_seed)
iters = max_epochs * len(dataloader)
else:
raise NotImplementedError
super().__init__(
runner=runner, dataloader=dataloader, max_iters=iters, **kwargs)
================================================
FILE: xtuner-eval_niah/xtuner/entry_point.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import logging
import os
import random
import subprocess
import sys
from mmengine.logging import print_log
import xtuner
# Define valid modes
MODES = ('list-cfg', 'copy-cfg', 'log-dataset', 'check-custom-dataset',
'train', 'test', 'chat', 'convert', 'preprocess', 'mmbench',
'eval_refcoco')
CLI_HELP_MSG = \
f"""
Arguments received: {str(['xtuner'] + sys.argv[1:])}. xtuner commands use the following syntax:
xtuner MODE MODE_ARGS ARGS
Where MODE (required) is one of {MODES}
MODE_ARG (optional) is the argument for specific mode
ARGS (optional) are the arguments for specific command
Some usages for xtuner commands: (See more by using -h for specific command!)
1. List all predefined configs:
xtuner list-cfg
2. Copy a predefined config to a given path:
xtuner copy-cfg $CONFIG $SAVE_FILE
3-1. Fine-tune LLMs by a single GPU:
xtuner train $CONFIG
3-2. Fine-tune LLMs by multiple GPUs:
NPROC_PER_NODE=$NGPUS NNODES=$NNODES NODE_RANK=$NODE_RANK PORT=$PORT ADDR=$ADDR xtuner dist_train $CONFIG $GPUS
4-1. Convert the pth model to HuggingFace's model:
xtuner convert pth_to_hf $CONFIG $PATH_TO_PTH_MODEL $SAVE_PATH_TO_HF_MODEL
4-2. Merge the HuggingFace's adapter to the pretrained base model:
xtuner convert merge $LLM $ADAPTER $SAVE_PATH
xtuner convert merge $CLIP $ADAPTER $SAVE_PATH --is-clip
4-3. Split HuggingFace's LLM to the smallest sharded one:
xtuner convert split $LLM $SAVE_PATH
5-1. Chat with LLMs with HuggingFace's model and adapter:
xtuner chat $LLM --adapter $ADAPTER --prompt-template $PROMPT_TEMPLATE --system-template $SYSTEM_TEMPLATE
5-2. Chat with VLMs with HuggingFace's model and LLaVA:
xtuner chat $LLM --llava $LLAVA --visual-encoder $VISUAL_ENCODER --image $IMAGE --prompt-template $PROMPT_TEMPLATE --system-template $SYSTEM_TEMPLATE
6-1. Preprocess arxiv dataset:
xtuner preprocess arxiv $SRC_FILE $DST_FILE --start-date $START_DATE --categories $CATEGORIES
6-2. Preprocess refcoco dataset:
xtuner preprocess refcoco --ann-path $RefCOCO_ANN_PATH --image-path $COCO_IMAGE_PATH --save-path $SAVE_PATH
7-1. Log processed dataset:
xtuner log-dataset $CONFIG
7-2. Verify the correctness of the config file for the custom dataset:
xtuner check-custom-dataset $CONFIG
8. MMBench evaluation:
xtuner mmbench $LLM --llava $LLAVA --visual-encoder $VISUAL_ENCODER --prompt-template $PROMPT_TEMPLATE --data-path $MMBENCH_DATA_PATH
9. Refcoco evaluation:
xtuner eval_refcoco $LLM --llava $LLAVA --visual-encoder $VISUAL_ENCODER --prompt-template $PROMPT_TEMPLATE --data-path $REFCOCO_DATA_PATH
10. List all dataset formats which are supported in XTuner
Run special commands:
xtuner help
xtuner version
GitHub: https://github.com/InternLM/xtuner
""" # noqa: E501
CONVERT_HELP_MSG = \
f"""
Arguments received: {str(['xtuner'] + sys.argv[1:])}. xtuner commands use the following syntax:
xtuner MODE MODE_ARGS ARGS
Where MODE (required) is one of {MODES}
MODE_ARG (optional) is the argument for specific mode
ARGS (optional) are the arguments for specific command
Some usages for convert: (See more by using -h for specific command!)
1. Convert the pth model to HuggingFace's model:
xtuner convert pth_to_hf $CONFIG $PATH_TO_PTH_MODEL $SAVE_PATH_TO_HF_MODEL
2. Merge the HuggingFace's adapter to the pretrained LLM:
xtuner convert merge $LLM $ADAPTER $SAVE_PATH
3. Split HuggingFace's LLM to the smallest sharded one:
xtuner convert split $LLM $SAVE_PATH
GitHub: https://github.com/InternLM/xtuner
""" # noqa: E501
PREPROCESS_HELP_MSG = \
f"""
Arguments received: {str(['xtuner'] + sys.argv[1:])}. xtuner commands use the following syntax:
xtuner MODE MODE_ARGS ARGS
Where MODE (required) is one of {MODES}
MODE_ARG (optional) is the argument for specific mode
ARGS (optional) are the arguments for specific command
Some usages for preprocess: (See more by using -h for specific command!)
1. Preprocess arxiv dataset:
xtuner preprocess arxiv $SRC_FILE $DST_FILE --start-date $START_DATE --categories $CATEGORIES
2. Preprocess refcoco dataset:
xtuner preprocess refcoco --ann-path $RefCOCO_ANN_PATH --image-path $COCO_IMAGE_PATH --save-path $SAVE_PATH
GitHub: https://github.com/InternLM/xtuner
""" # noqa: E501
special = {
'help': lambda: print_log(CLI_HELP_MSG, 'current'),
'version': lambda: print_log(xtuner.__version__, 'current')
}
special = {
**special,
**{f'-{k[0]}': v
for k, v in special.items()},
**{f'--{k}': v
for k, v in special.items()}
}
def list_dataset_format():
from xtuner.tools import list_dataset_format
return list_dataset_format.__file__
def list_cfg():
from xtuner.tools import list_cfg
return list_cfg.__file__
def copy_cfg():
from xtuner.tools import copy_cfg
return copy_cfg.__file__
def log_dataset():
from xtuner.tools import log_dataset
return log_dataset.__file__
def check_custom_dataset():
from xtuner.tools import check_custom_dataset
return check_custom_dataset.__file__
def train():
from xtuner.tools import train
return train.__file__
def test():
from xtuner.tools import test
return test.__file__
def chat():
from xtuner.tools import chat
return chat.__file__
def mmbench():
from xtuner.tools import mmbench
return mmbench.__file__
def pth_to_hf():
from xtuner.tools.model_converters import pth_to_hf
return pth_to_hf.__file__
def merge():
from xtuner.tools.model_converters import merge
return merge.__file__
def split():
from xtuner.tools.model_converters import split
return split.__file__
def arxiv_preprocess():
from xtuner.tools.data_preprocess import arxiv as arxiv_preprocess
return arxiv_preprocess.__file__
def convert_refcoco():
from xtuner.tools.data_preprocess import convert_refcoco
return convert_refcoco.__file__
def convert_help_msg():
print_log(CONVERT_HELP_MSG, 'current')
def preprocess_help_msg():
print_log(PREPROCESS_HELP_MSG, 'current')
def eval_refcoco():
from xtuner.tools import eval_refcoco
return eval_refcoco.__file__
modes = {
'list-cfg': list_cfg,
'copy-cfg': copy_cfg,
'log-dataset': log_dataset,
'check-custom-dataset': check_custom_dataset,
'train': train,
'test': test,
'chat': chat,
'mmbench': mmbench,
'convert': {
'pth_to_hf': pth_to_hf,
'merge': merge,
'split': split,
'--help': convert_help_msg,
'-h': convert_help_msg
},
'preprocess': {
'arxiv': arxiv_preprocess,
'refcoco': convert_refcoco,
'--help': preprocess_help_msg,
'-h': preprocess_help_msg
},
'eval_refcoco': eval_refcoco,
'list-dataset-format': list_dataset_format
}
HELP_FUNCS = [preprocess_help_msg, convert_help_msg]
MAP_FILE_FUNCS = [
list_cfg, copy_cfg, log_dataset, check_custom_dataset, train, test, chat,
mmbench, pth_to_hf, merge, split, arxiv_preprocess, eval_refcoco,
convert_refcoco, list_dataset_format
]
def cli():
args = sys.argv[1:]
if not args: # no arguments passed
print_log(CLI_HELP_MSG, 'current')
return
if args[0].lower() in special:
special[args[0].lower()]()
return
elif args[0].lower() in modes:
try:
fn_or_dict = modes[args[0].lower()]
n_arg = 0
if isinstance(fn_or_dict, dict):
n_arg += 1
fn = fn_or_dict[args[n_arg].lower()]
else:
fn = fn_or_dict
assert callable(fn)
if fn in HELP_FUNCS:
fn()
else:
slurm_launcher = False
for i in range(n_arg + 1, len(args)):
if args[i] == '--launcher':
if i + 1 < len(args) and args[i + 1] == 'slurm':
slurm_launcher = True
break
nnodes = int(os.environ.get('NNODES', 1))
nproc_per_node = int(os.environ.get('NPROC_PER_NODE', 1))
if slurm_launcher or (nnodes == 1 and nproc_per_node == 1):
subprocess.run(['python', fn()] + args[n_arg + 1:])
else:
port = os.environ.get('PORT', None)
if port is None:
port = random.randint(20000, 29999)
print_log(f'Use random port: {port}', 'current',
logging.WARNING)
torchrun_args = [
f'--nnodes={nnodes}',
f"--node_rank={os.environ.get('NODE_RANK', 0)}",
f'--nproc_per_node={nproc_per_node}',
f"--master_addr={os.environ.get('ADDR', '127.0.0.1')}",
f'--master_port={port}'
]
subprocess.run(['torchrun'] + torchrun_args + [fn()] +
args[n_arg + 1:] +
['--launcher', 'pytorch'])
except Exception as e:
print_log(f"WARNING: command error: '{e}'!", 'current',
logging.WARNING)
print_log(CLI_HELP_MSG, 'current', logging.WARNING)
return
else:
print_log('WARNING: command error!', 'current', logging.WARNING)
print_log(CLI_HELP_MSG, 'current', logging.WARNING)
return
================================================
FILE: xtuner-eval_niah/xtuner/evaluation/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .metrics import MMLUMetric
__all__ = ['MMLUMetric']
================================================
FILE: xtuner-eval_niah/xtuner/evaluation/metrics/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .mmlu_metric import MMLUMetric
__all__ = ['MMLUMetric']
================================================
FILE: xtuner-eval_niah/xtuner/evaluation/metrics/mmlu_metric.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Any, Sequence
import numpy as np
import torch
from mmengine.evaluator import BaseMetric
from mmengine.logging import print_log
from rich.console import Console
from rich.table import Table
from xtuner.registry import BUILDER
class MMLUMetric(BaseMetric):
METAINFO = {
'subcategories': {
'abstract_algebra': ['math'],
'anatomy': ['health'],
'astronomy': ['physics'],
'business_ethics': ['business'],
'clinical_knowledge': ['health'],
'college_biology': ['biology'],
'college_chemistry': ['chemistry'],
'college_computer_science': ['computer science'],
'college_mathematics': ['math'],
'college_medicine': ['health'],
'college_physics': ['physics'],
'computer_security': ['computer science'],
'conceptual_physics': ['physics'],
'econometrics': ['economics'],
'electrical_engineering': ['engineering'],
'elementary_mathematics': ['math'],
'formal_logic': ['philosophy'],
'global_facts': ['other'],
'high_school_biology': ['biology'],
'high_school_chemistry': ['chemistry'],
'high_school_computer_science': ['computer science'],
'high_school_european_history': ['history'],
'high_school_geography': ['geography'],
'high_school_government_and_politics': ['politics'],
'high_school_macroeconomics': ['economics'],
'high_school_mathematics': ['math'],
'high_school_microeconomics': ['economics'],
'high_school_physics': ['physics'],
'high_school_psychology': ['psychology'],
'high_school_statistics': ['math'],
'high_school_us_history': ['history'],
'high_school_world_history': ['history'],
'human_aging': ['health'],
'human_sexuality': ['culture'],
'international_law': ['law'],
'jurisprudence': ['law'],
'logical_fallacies': ['philosophy'],
'machine_learning': ['computer science'],
'management': ['business'],
'marketing': ['business'],
'medical_genetics': ['health'],
'miscellaneous': ['other'],
'moral_disputes': ['philosophy'],
'moral_scenarios': ['philosophy'],
'nutrition': ['health'],
'philosophy': ['philosophy'],
'prehistory': ['history'],
'professional_accounting': ['other'],
'professional_law': ['law'],
'professional_medicine': ['health'],
'professional_psychology': ['psychology'],
'public_relations': ['politics'],
'security_studies': ['politics'],
'sociology': ['culture'],
'us_foreign_policy': ['politics'],
'virology': ['health'],
'world_religions': ['philosophy'],
},
'categories': {
'STEM': [
'physics', 'chemistry', 'biology', 'computer science', 'math',
'engineering'
],
'humanities': ['history', 'philosophy', 'law'],
'social sciences':
['politics', 'culture', 'economics', 'geography', 'psychology'],
'other (business, health, misc.)': ['other', 'business', 'health'],
},
}
METAINFO['subcategories_list'] = list({
subcat
for subcats in METAINFO['subcategories'].values() for subcat in subcats
})
def __init__(self, tokenizer, *args, **kwargs):
super().__init__(*args, **kwargs)
tokenizer = BUILDER.build(tokenizer)
self.abcd_idx = [
tokenizer.encode('A', add_special_tokens=False)[0],
tokenizer.encode('B', add_special_tokens=False)[0],
tokenizer.encode('C', add_special_tokens=False)[0],
tokenizer.encode('D', add_special_tokens=False)[0],
]
@staticmethod
def ABCD_to_0123(abcd):
return {'A': 0, 'B': 1, 'C': 2, 'D': 3}[abcd]
@staticmethod
def find_first_zero_index(tensor):
indices = torch.nonzero(tensor == 0)
if indices.numel() > 0:
return indices[0].item()
else:
return None
@staticmethod
def accuracy(preds, gts):
"""Computes the accuracy for preds and gts."""
correct = [1 if pred == gt else 0 for pred, gt in zip(preds, gts)]
acc = np.mean(correct) * 100
return acc
def process(self, data_batch: Any, data_samples: Sequence[dict]) -> None:
"""Process one batch of data samples and predictions. The processed
results should be stored in ``self.results``, which will be used to
compute the metrics when all batches have been processed.
Args:
data_batch (Any): A batch of data from the dataloader.
data_samples (Sequence[dict]): A batch of outputs from
the model.
"""
subjects = data_batch['data_samples']['subjects']
gts = [
self.ABCD_to_0123(gt)
for gt in data_batch['data_samples']['labels']
]
preds = []
for sample, attn_mask, subject, gt in zip(
data_samples, data_batch['data']['attention_mask'], subjects,
gts):
pred_logits = sample['logits']
first_zero_idx = self.find_first_zero_index(attn_mask)
pred_idx = -1 if first_zero_idx is None else first_zero_idx - 1
pred_logtis_abcd = pred_logits[pred_idx, self.abcd_idx]
pred = torch.argmax(pred_logtis_abcd).item()
preds.append(pred)
self.results.append((subject, pred, gt))
def compute_metrics(self, results: list) -> dict:
"""Compute the metrics from processed results.
Args:
results (list): The processed results of each batch.
Returns:
dict: The computed metrics. The keys are the names of the metrics,
and the values are corresponding results.
"""
subjects_results = {
subject: {
'preds': [],
'gts': []
}
for subject in self.METAINFO['subcategories'].keys()
}
subcats_results = {
subcat: {
'preds': [],
'gts': []
}
for subcat in self.METAINFO['subcategories_list']
}
cats_results = {
cat: {
'preds': [],
'gts': []
}
for cat in self.METAINFO['categories'].keys()
}
for subject, pred, gt in results:
subjects_results[subject]['preds'].append(pred)
subjects_results[subject]['gts'].append(gt)
subcats = self.METAINFO['subcategories'][subject]
for subcat in subcats:
subcats_results[subcat]['preds'].append(pred)
subcats_results[subcat]['gts'].append(gt)
for cat, subcats in self.METAINFO['categories'].items():
for subcat in subcats:
if subcat in subcats_results:
cats_results[cat]['preds'].extend(
subcats_results[subcat]['preds'])
cats_results[cat]['gts'].extend(
subcats_results[subcat]['gts'])
subjects_metrics = dict()
subcats_metrics = dict()
cats_metrics = dict()
for subject in self.METAINFO['subcategories'].keys():
assert len(subjects_results[subject]['preds']) == len(
subjects_results[subject]['gts'])
if len(subjects_results[subject]['preds']) == 0:
print_log(f'Skip subject {subject} for mmlu', 'current')
else:
score = self.accuracy(subjects_results[subject]['preds'],
subjects_results[subject]['gts'])
subjects_metrics[f'{subject}'] = score
for subcat in self.METAINFO['subcategories_list']:
assert len(subcats_results[subcat]['preds']) == len(
subcats_results[subcat]['gts'])
if len(subcats_results[subcat]['preds']) == 0:
print_log(f'Skip subcategory {subcat} for mmlu', 'current')
else:
score = self.accuracy(subcats_results[subcat]['preds'],
subcats_results[subcat]['gts'])
subcats_metrics[f'{subcat}'] = score
for cat in self.METAINFO['categories'].keys():
assert len(cats_results[cat]['preds']) == len(
cats_results[cat]['gts'])
if len(cats_results[cat]['preds']) == 0:
print_log(f'Skip category {cat} for mmlu', 'current')
else:
score = self.accuracy(cats_results[cat]['preds'],
cats_results[cat]['gts'])
cats_metrics[f'{cat}'] = score
metrics = dict()
metrics.update(subjects_metrics)
metrics.update(subcats_metrics)
metrics.update(cats_metrics)
metrics['average'] = np.mean(list(subjects_metrics.values()))
table_metrics = dict()
table_metrics.update(cats_metrics)
table_metrics['average'] = np.mean(list(subjects_metrics.values()))
self._print_results(table_metrics)
return metrics
def _print_results(self, table_metrics: dict) -> None:
table_title = ' MMLU Benchmark '
table = Table(title=table_title)
console = Console()
table.add_column('Categories', justify='left')
table.add_column('Accuracy (%)', justify='right')
for cat, acc in table_metrics.items():
table.add_row(cat, f'{acc:.1f}')
with console.capture() as capture:
console.print(table, end='')
print_log('\n' + capture.get(), 'current')
================================================
FILE: xtuner-eval_niah/xtuner/evaluation/metrics/reward_metric.py
================================================
import itertools
from collections import defaultdict
from typing import List, Optional, Sequence
import torch
from mmengine.evaluator import BaseMetric
from mmengine.logging import print_log
from rich.console import Console
from rich.table import Table
class RewardMetric(BaseMetric):
r"""Reward model evaluation metric.
"""
default_prefix: Optional[str] = ''
def __init__(self,
collect_device: str = 'cpu',
prefix: Optional[str] = None) -> None:
super().__init__(collect_device=collect_device, prefix=prefix)
def process(self, data_batch, data_samples: Sequence[dict]):
"""Process one batch of data samples.
The processed results should be stored in ``self.results``, which will
be used to computed the metrics when all batches have been processed.
Args:
data_batch: A batch of data from the dataloader.
data_samples (Sequence[dict]): A batch of outputs from the model.
"""
logits = torch.cat(
[sample['logits'].unsqueeze(0) for sample in data_samples], dim=0)
labels = data_batch['data']['labels']
ds_names = data_batch['data_samples']['ds_names']
chosen_idx = torch.where(labels == 0)
rejected_idx = torch.where(labels == 1)
chosen_logits = logits[chosen_idx].cpu()
rejected_logits = logits[rejected_idx].cpu()
correct = (chosen_logits > rejected_logits).cpu()
self.results.append({
'chosen_logits': chosen_logits,
'rejected_logits': rejected_logits,
'correct': correct,
'ds_names': ds_names
})
def compute_metrics(self, results: List):
"""Compute the metrics from processed results.
Args:
results (dict): The processed results of each batch.
Returns:
Dict: The computed metrics. The keys are the names of the metrics,
and the values are corresponding results.
"""
# NOTICE: don't access `self.results` from the method.
metrics = {}
correct = torch.cat([res['correct'] for res in results])
chosen_logits = torch.cat([res['chosen_logits'] for res in results])
rejected_logits = torch.cat(
[res['rejected_logits'] for res in results])
ds_names = list(itertools.chain(*[res['ds_names'] for res in results]))
# group by ds_names
grouped_correct = defaultdict(list)
grouped_chosen_logits = defaultdict(list)
grouped_rejected_logits = defaultdict(list)
for i, ds_name in enumerate(ds_names):
grouped_correct[ds_name].append(correct[i])
grouped_chosen_logits[ds_name].append(chosen_logits[i])
grouped_rejected_logits[ds_name].append(rejected_logits[i])
# print metrics in a rich table
table = Table(title='Reward Metrics')
table.add_column('Dataset Name')
table.add_column('Accuracy')
table.add_column('Chosen Score')
table.add_column('Rejected Score')
for ds_name in grouped_correct.keys():
correct = torch.stack(grouped_correct[ds_name])
chosen_logits = torch.stack(grouped_chosen_logits[ds_name])
rejected_logits = torch.stack(grouped_rejected_logits[ds_name])
acc = correct.float().mean()
metrics[f'accuracy/{ds_name}'] = acc.item()
metrics[f'chosen_score/{ds_name}'] = chosen_logits.mean().item()
metrics[f'rejected_score{ds_name}'] = rejected_logits.mean().item()
table.add_row(ds_name, f'{acc:.4f}', f'{chosen_logits.mean():.4f}',
f'{rejected_logits.mean():.4f}')
console = Console()
with console.capture() as capture:
console.print(table, end='')
print_log('\n' + capture.get(), 'current')
return metrics
================================================
FILE: xtuner-eval_niah/xtuner/model/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .llava import LLaVAModel
from .sft import SupervisedFinetune
__all__ = ['SupervisedFinetune', 'LLaVAModel']
================================================
FILE: xtuner-eval_niah/xtuner/model/dpo.py
================================================
# DPO Authors: Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn 2023 # noqa
# Copyright 2023 The HuggingFace Team. All rights reserved.
# Copyright (c) OpenMMLab. All rights reserved.
from copy import deepcopy
import torch
import torch.distributed as dist
import torch.nn.functional as F
from mmengine import MessageHub
from transformers.integrations import is_deepspeed_zero3_enabled
from xtuner.parallel.sequence import (gather_forward_split_backward,
get_sequence_parallel_group,
get_sequence_parallel_world_size,
split_for_sequence_parallel)
from .sft import SupervisedFinetune
def create_reference_model(model):
if is_deepspeed_zero3_enabled():
raise ValueError('DeepSpeed ZeRO-3 is enabled and is not compatible '
'with `create_reference_model()`. Please instantiate '
'your reference model directly with '
'`AutoCausalLM.from_pretrained()`.')
parameter_names = [n for n, _ in model.named_parameters()]
ref_model = deepcopy(model)
# if no layers are shared, return copy of model
for param_name in parameter_names:
param = ref_model.get_parameter(param_name)
param.requires_grad = False
return ref_model.eval()
class DPO(SupervisedFinetune):
"""A general class of DPO and its variants."""
def __init__(self,
llm,
ref_llm=None,
beta=0.1,
loss_type='sigmoid',
label_smoothing=0.0,
**kwargs):
super().__init__(llm, **kwargs)
self.ref_llm = ref_llm
self.loss_type = loss_type
self.label_smoothing = label_smoothing
self.beta = beta
if not self.use_lora:
self.ref_llm = create_reference_model(self.llm)
def _gather_masked_logits(self, logits, labels, mask):
logits = torch.gather(
logits.log_softmax(-1), dim=2,
index=labels.unsqueeze(2)).squeeze(2)
return logits * mask
def get_logps(
self,
all_logits, # bs, seqlen,vocab_size
all_ref_logits, # bs, seqlen,vocab_size
labels, # bs, seqlen
):
labels = labels[:, 1:].clone()
all_logits = all_logits[:, :-1, :]
all_ref_logits = all_ref_logits[:, :-1, :]
labels[labels == -100] = 0
loss_mask = labels != 0
all_logps = self._gather_masked_logits(all_logits, labels,
loss_mask).sum(-1)
all_ref_logps = self._gather_masked_logits(all_ref_logits, labels,
loss_mask).sum(-1)
if self.loss_type == 'ipo': # average_log_prob
all_logps = all_logps / loss_mask.sum(-1)
all_ref_logps = all_ref_logps / loss_mask.sum(-1)
policy_chosen_logps = all_logps[::2]
policy_rejected_logps = all_logps[1::2]
reference_chosen_logps = all_ref_logps[::2]
reference_rejected_logps = all_ref_logps[1::2]
return (policy_chosen_logps, policy_rejected_logps,
reference_chosen_logps, reference_rejected_logps)
def get_var_len_atten_logps(self, all_logits, all_ref_logits, labels,
cu_seqlens, attention_mask):
seqlens = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist()
# unpack sequence
unpacked_logits = torch.split(all_logits, seqlens, dim=1)
unpacked_ref_logits = torch.split(all_ref_logits, seqlens, dim=1)
unpacked_labels = torch.split(labels, seqlens, dim=1)
if attention_mask is not None:
# It indicate that we pad the original sequence, labels,
# position_ids and cumulative_len for sequence parallel if the
# attention_mask is not None.
# We then need to remove the padded segments.
assert False in attention_mask
unpacked_logits = unpacked_logits[:-1]
unpacked_ref_logits = unpacked_ref_logits[:-1]
unpacked_labels = unpacked_labels[:-1]
assert len(unpacked_logits) % 2 == 0
def compute_logps(_logits, _labels):
_labels = _labels[:, 1:].clone()
_logits = _logits[:, :-1, :]
_labels[_labels == -100] = 0
loss_mask = _labels != 0
logps = self._gather_masked_logits(_logits, _labels, loss_mask)
logps = logps.sum(-1)
if self.loss_type == 'ipo':
logps /= loss_mask.sum(-1)
return logps
(policy_chosen_logps, policy_rejected_logps, reference_chosen_logps,
reference_rejected_logps) = [], [], [], []
for i in range(len(unpacked_logits) // 2):
chosen = unpacked_logits[2 * i]
rejected = unpacked_logits[2 * i + 1]
chosen_ref = unpacked_ref_logits[2 * i]
rejected_ref = unpacked_ref_logits[2 * i + 1]
chosen_label = unpacked_labels[2 * i]
rejected_label = unpacked_labels[2 * i + 1]
policy_chosen_logps.append(compute_logps(chosen, chosen_label))
policy_rejected_logps.append(
compute_logps(rejected, rejected_label))
reference_chosen_logps.append(
compute_logps(chosen_ref, chosen_label))
reference_rejected_logps.append(
compute_logps(rejected_ref, rejected_label))
return (torch.stack(policy_chosen_logps),
torch.stack(policy_rejected_logps),
torch.stack(reference_chosen_logps),
torch.stack(reference_rejected_logps))
@staticmethod
def _split_for_sequence_parallel(data):
# attention mask should not be split
ARGS_NEED_TO_SPLIT = ('input_ids', 'position_ids')
sp_group = get_sequence_parallel_group()
for key in ARGS_NEED_TO_SPLIT:
val = data.get(key, None)
if val is not None:
# `dim` is 1 as the shape of tensor is (bs, seq_len, ...)
data[key] = split_for_sequence_parallel(
val, dim=1, sp_group=sp_group)
return data
def compute_loss(self, data, data_samples=None):
# modified from https://github.com/huggingface/trl/blob/main/trl/trainer/dpo_trainer.py # noqa
labels = data.pop('labels')
if get_sequence_parallel_world_size() > 1:
data = self._split_for_sequence_parallel(data)
all_logits = self.llm(**data).logits
with torch.no_grad():
if self.ref_llm is None:
with self.llm.disable_adapter():
all_ref_logits = self.llm(**data).logits
else:
all_ref_logits = self.ref_llm(**data).logits
if get_sequence_parallel_world_size() > 1:
all_logits = gather_forward_split_backward(
all_logits,
dim=1,
sp_group=get_sequence_parallel_group(),
grad_scale='up')
all_ref_logits = gather_forward_split_backward(
all_ref_logits,
dim=1,
sp_group=get_sequence_parallel_group(),
grad_scale='up')
if not self.use_varlen_attn:
(policy_chosen_logps, policy_rejected_logps,
reference_chosen_logps,
reference_rejected_logps) = self.get_logps(
all_logits, all_ref_logits, labels)
else:
message_hub = MessageHub.get_instance('varlen_attn_args')
rank = dist.get_rank()
cu_seqlens = message_hub.get_info(f'cumulative_len_rank_{rank}')
(policy_chosen_logps, policy_rejected_logps,
reference_chosen_logps,
reference_rejected_logps) = self.get_var_len_atten_logps(
all_logits, all_ref_logits, labels, cu_seqlens,
data['attention_mask'])
pi_logratios = policy_chosen_logps - policy_rejected_logps
ref_logratios = reference_chosen_logps - reference_rejected_logps
logits = pi_logratios - ref_logratios
if self.loss_type == 'sigmoid':
loss = (-F.logsigmoid(self.beta * logits) *
(1 - self.label_smoothing) -
F.logsigmoid(-self.beta * logits) * self.label_smoothing)
elif self.loss_type == 'robust':
loss = (-F.logsigmoid(self.beta * logits) *
(1 - self.label_smoothing) +
F.logsigmoid(-self.beta * logits) *
self.label_smoothing) / (1 - 2 * self.label_smoothing)
elif self.loss_type == 'hinge':
loss = torch.relu(1 - self.beta * logits)
elif self.loss_type == 'ipo':
# eqn (17) of the paper where beta is the regularization
# parameter for the IPO loss, denoted by tau in the paper. # noqa
loss = (logits - 1 / (2 * self.beta))**2
elif self.loss_type == 'kto_pair':
# eqn (7) of the HALOs paper
chosen_KL = (policy_chosen_logps -
reference_chosen_logps).mean().clamp(min=0)
rejected_KL = (policy_rejected_logps -
reference_rejected_logps).mean().clamp(min=0)
chosen_logratios = policy_chosen_logps - reference_chosen_logps
rejected_logratios = \
policy_rejected_logps - reference_rejected_logps
# As described in the KTO report, the KL term for chosen (rejected)
# is estimated using the rejected (chosen) half. # noqa
loss = torch.cat(
(
1 - F.sigmoid(self.beta *
(chosen_logratios - rejected_KL)),
1 - F.sigmoid(self.beta *
(chosen_KL - rejected_logratios)),
),
0,
)
elif self.loss_type == 'sppo_hard':
# In the paper (https://arxiv.org/pdf/2405.00675),
# SPPO employs a soft probability approach,
# estimated using the PairRM score. The probability calculation
# is conducted outside of the trainer class.
# The version described here is the hard probability version,
# where P in Equation (4.7) of Algorithm 1 is set to 1 for
# the winner and 0 for the loser.
a = policy_chosen_logps - reference_chosen_logps
b = policy_rejected_logps - reference_rejected_logps
loss = (a - 0.5 / self.beta)**2 + (b + 0.5 / self.beta)**2
elif self.loss_type == 'nca_pair':
chosen_rewards = (policy_chosen_logps -
reference_chosen_logps) * self.beta
rejected_rewards = (policy_rejected_logps -
reference_rejected_logps) * self.beta
loss = (-F.logsigmoid(chosen_rewards) -
0.5 * F.logsigmoid(-chosen_rewards) -
0.5 * F.logsigmoid(-rejected_rewards))
else:
raise ValueError(
f'Unknown loss type: {self.loss_type}. Should be one of '
"['sigmoid', 'hinge', 'ipo', 'kto_pair', "
"'sppo_hard', 'nca_pair', 'robust']")
# for logging
chosen_rewards = self.beta * (
policy_chosen_logps - reference_chosen_logps)
rejected_rewards = self.beta * (
policy_rejected_logps - reference_rejected_logps)
reward_acc = (chosen_rewards > rejected_rewards).float().mean()
loss_dict = {
'loss': loss,
'chosen_rewards': chosen_rewards.mean(),
'rejected_rewards': rejected_rewards.mean(),
'reward_acc': reward_acc,
'reward_margin': (chosen_rewards - rejected_rewards).mean(),
}
return loss_dict
================================================
FILE: xtuner-eval_niah/xtuner/model/llava.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import math
import os.path as osp
import warnings
from collections import OrderedDict
import torch
import torch.nn as nn
from accelerate import init_empty_weights
from mmengine import print_log
from mmengine.config import Config, ConfigDict
from mmengine.model import BaseModel
from peft import get_peft_model, prepare_model_for_kbit_training
from transformers import (AddedToken, AutoConfig, CLIPImageProcessor,
CLIPVisionModel, LlamaForCausalLM,
LlamaTokenizerFast, LlavaConfig,
LlavaForConditionalGeneration, LlavaProcessor)
from transformers.integrations import is_deepspeed_zero3_enabled
from xtuner.registry import BUILDER
from xtuner.utils import DEFAULT_IMAGE_TOKEN
from .modules import ProjectorConfig, ProjectorModel, dispatch_modules
from .modules.dispatch import SUPPORT_FLASH1, SUPPORT_FLASH2
from .utils import (LoadWoInit, find_all_linear_names,
get_peft_model_state_dict, guess_load_checkpoint,
make_inputs_require_grad,
prepare_inputs_labels_for_multimodal, traverse_dict)
def convert_state_dict_to_hf(state_dict, mapping):
new_state_dict = {}
for key, value in state_dict.items():
if key.endswith('.inv_freq'):
continue
for key_to_modify, new_key in mapping.items():
if key_to_modify in key:
key = key.replace(key_to_modify, new_key)
new_state_dict[key] = value
return new_state_dict
class LLaVAModel(BaseModel):
def __init__(self,
llm,
visual_encoder,
freeze_llm=False,
freeze_visual_encoder=False,
visual_select_layer=-2,
pretrained_pth=None,
projector_depth=2,
llm_lora=None,
visual_encoder_lora=None,
use_activation_checkpointing=True,
max_position_embeddings=None):
super().__init__()
self.freeze_llm = freeze_llm
self.freeze_visual_encoder = freeze_visual_encoder
with LoadWoInit():
if isinstance(llm, dict):
llm = self._dispatch_lm_model_cfg(llm, max_position_embeddings)
self.llm = self._build_from_cfg_or_module(llm)
self.visual_encoder = self._build_from_cfg_or_module(
visual_encoder)
self.llm.config.use_cache = False
dispatch_modules(self.llm)
self.projector_depth = projector_depth
projector_config = ProjectorConfig(
visual_hidden_size=self.visual_encoder.config.hidden_size,
llm_hidden_size=self.llm.config.hidden_size,
depth=self.projector_depth)
self.projector = ProjectorModel(projector_config).to(
self.visual_encoder.dtype)
if self.freeze_llm:
self.llm.requires_grad_(False)
if self.freeze_visual_encoder:
self.visual_encoder.requires_grad_(False)
if use_activation_checkpointing:
# For backward compatibility
if hasattr(self.llm, 'enable_input_require_grads'):
self.llm.enable_input_require_grads()
else:
self.llm.get_input_embeddings().register_forward_hook(
make_inputs_require_grad)
if hasattr(self.visual_encoder, 'enable_input_require_grads'):
self.visual_encoder.enable_input_require_grads()
else:
self.visual_encoder.get_input_embeddings(
).register_forward_hook(make_inputs_require_grad)
self.projector.enable_input_require_grads()
# enable gradient (activation) checkpointing for memory efficiency
self.gradient_checkpointing_enable()
self.use_llm_lora = llm_lora is not None
self.use_visual_encoder_lora = visual_encoder_lora is not None
if self.use_llm_lora:
self._prepare_llm_for_lora(llm_lora, use_activation_checkpointing)
if self.use_visual_encoder_lora:
self._prepare_visual_encoder_for_lora(
visual_encoder_lora, use_activation_checkpointing)
if pretrained_pth is not None:
pretrained_state_dict = guess_load_checkpoint(pretrained_pth)
self.load_state_dict(pretrained_state_dict, strict=False)
print_log(f'Load pretrained weight from {pretrained_pth}',
'current')
self.visual_select_layer = visual_select_layer
self._is_init = True
self.is_first_iter = True
def _parse_lora_config(self, lora_config):
if isinstance(lora_config, dict) or isinstance(
lora_config, Config) or isinstance(lora_config, ConfigDict):
lora_config = BUILDER.build(lora_config)
return lora_config
def _prepare_llm_for_lora(self,
lora_config,
use_activation_checkpointing=True):
lora_config = self._parse_lora_config(lora_config)
self.llm = prepare_model_for_kbit_training(
self.llm, use_activation_checkpointing)
if lora_config.target_modules is None:
modules = find_all_linear_names(self.llm)
lora_config.target_modules = modules
self.llm = get_peft_model(self.llm, lora_config)
def _prepare_visual_encoder_for_lora(self,
lora_config,
use_activation_checkpointing=True):
lora_config = self._parse_lora_config(lora_config)
if lora_config.target_modules is None:
modules = find_all_linear_names(self.visual_encoder)
lora_config.target_modules = modules
self.visual_encoder = get_peft_model(self.visual_encoder, lora_config)
def gradient_checkpointing_enable(self):
self.activation_checkpointing_enable()
def activation_checkpointing_enable(self):
self.llm.gradient_checkpointing_enable()
self.visual_encoder.gradient_checkpointing_enable()
self.projector.gradient_checkpointing_enable()
def gradient_checkpointing_disable(self):
self.activation_checkpointing_disable()
def activation_checkpointing_disable(self):
self.llm.gradient_checkpointing_disable()
self.visual_encoder.gradient_checkpointing_disable()
self.projector.gradient_checkpointing_disable()
def init_weights(self):
pass
def state_dict(self, *args, **kwargs):
state_dict = super().state_dict(*args, **kwargs)
to_return = OrderedDict()
# Step 1. visual_encoder
if self.use_visual_encoder_lora:
to_return.update(
get_peft_model_state_dict(
self.visual_encoder, state_dict=state_dict))
elif not self.freeze_visual_encoder:
to_return.update({
k: v
for k, v in state_dict.items() if 'visual_encoder.' in k
})
# Step 2. LLM
if self.use_llm_lora:
to_return.update(
get_peft_model_state_dict(self.llm, state_dict=state_dict))
elif not self.freeze_llm:
to_return.update(
{k: v
for k, v in state_dict.items() if 'llm.' in k})
# Step 3. Projector
to_return.update(
{k: v
for k, v in state_dict.items() if 'projector.' in k})
return to_return
@staticmethod
def _prepare_for_long_context_training(cfg, llm_cfg,
max_position_embeddings):
orig_rope_scaling = getattr(llm_cfg, 'rope_scaling', None)
if orig_rope_scaling is None:
orig_rope_scaling = {'factor': 1}
orig_rope_scaling_factor = orig_rope_scaling[
'factor'] if 'factor' in orig_rope_scaling.keys() else 1
orig_ctx_len = getattr(llm_cfg, 'max_position_embeddings', None)
if orig_ctx_len:
orig_ctx_len *= orig_rope_scaling_factor
if max_position_embeddings > orig_ctx_len:
scaling_factor = float(
math.ceil(max_position_embeddings / orig_ctx_len))
llm_cfg.rope_scaling = {
'type': 'linear',
'factor': scaling_factor
}
# hardcode for internlm2
llm_cfg.attn_implementation = 'flash_attention_2'
cfg.config = llm_cfg
return cfg, llm_cfg
@staticmethod
def _prepare_for_flash_attn(cfg, llm_cfg):
cls_name = type(llm_cfg).__name__
SUPPORT_SDPA_ATTN = ('LlamaConfig', 'GemmaConfig', 'MistralConfig',
'MixtralConfig', 'Qwen2Config', 'Qwen2MoeConfig',
'Starcoder2Config', 'Starcoder2Config',
'Phi3Config')
SUPPORT_FLASH_ATTN2 = ('InternLM2Config', 'LlamaConfig', 'GemmaConfig',
'MistralConfig', 'MixtralConfig', 'Qwen2Config',
'Qwen2MoeConfig', 'Starcoder2Config',
'Starcoder2Config', 'Phi3Config')
torch_dtype = torch.bfloat16 if (
torch.cuda.is_available() and torch.cuda.is_bf16_supported()) \
else torch.float16
if getattr(cfg, 'attn_implementation', None) is not None:
# Flash Attention 2.0 only supports torch.float16 and
# torch.bfloat16 dtypes
if cfg.attn_implementation == 'flash_attention_2':
cfg.torch_dtype = torch_dtype
elif SUPPORT_FLASH2 and cls_name in SUPPORT_FLASH_ATTN2:
cfg.torch_dtype = torch_dtype
cfg.attn_implementation = 'flash_attention_2'
elif SUPPORT_FLASH1 and cls_name in SUPPORT_SDPA_ATTN:
cfg.attn_implementation = 'sdpa'
return cfg, llm_cfg
@staticmethod
def _prepare_for_qlora_zero3(cfg):
if (not is_deepspeed_zero3_enabled()) or (not hasattr(
cfg, 'quantization_config')):
return cfg
torch_dtype = torch.bfloat16 if (
torch.cuda.is_available() and torch.cuda.is_bf16_supported()) \
else torch.float16
cfg.torch_dtype = torch_dtype
quantization_config = cfg.quantization_config
quantization_config.bnb_4bit_compute_dtype = torch_dtype
quantization_config.bnb_4bit_quant_storage = torch_dtype
return cfg
def _dispatch_lm_model_cfg(self, cfg, max_position_embeddings=None):
cfg = self._prepare_for_qlora_zero3(cfg)
pretrained_model_name_or_path = cfg.pretrained_model_name_or_path
llm_cfg = AutoConfig.from_pretrained(
pretrained_model_name_or_path, trust_remote_code=True)
cfg, llm_cfg = self._prepare_for_flash_attn(cfg, llm_cfg)
if max_position_embeddings is not None:
cfg, llm_cfg = self._prepare_for_long_context_training(
cfg, llm_cfg, max_position_embeddings)
return cfg
def _build_from_cfg_or_module(self, cfg_or_mod):
if isinstance(cfg_or_mod, nn.Module):
return cfg_or_mod
elif isinstance(cfg_or_mod, dict):
traverse_dict(cfg_or_mod)
return BUILDER.build(cfg_or_mod)
else:
raise NotImplementedError
def forward(self, data, data_samples=None, mode='loss'):
if self.is_first_iter:
# hardcode for qlora DeepSpeed ZeRO3, put buffers and QuantState to
# device
# Only required in `LLaVAModel` .
# We do not need this in `SupervisedFinetune` .
self.to(data['input_ids'].device)
self.is_first_iter = False
if 'pixel_values' in data:
visual_outputs = self.visual_encoder(
data['pixel_values'].to(self.visual_encoder.dtype),
output_hidden_states=True)
pixel_values = self.projector(
visual_outputs.hidden_states[self.visual_select_layer][:, 1:])
data['pixel_values'] = pixel_values
data = prepare_inputs_labels_for_multimodal(llm=self.llm, **data)
if mode == 'loss':
return self.compute_loss(data, data_samples)
elif mode == 'predict':
return self.predict(data, data_samples)
elif mode == 'tensor':
return self._forward(data, data_samples)
else:
raise NotImplementedError
def _forward(self, data, data_samples=None):
outputs = self.llm(**data)
return outputs
def predict(self, data, data_samples=None):
outputs = self.llm(**data)
logits_dict = [{'logits': logits} for logits in outputs.logits]
return logits_dict
def compute_loss(self, data, data_samples=None):
outputs = self.llm(**data)
loss_dict = {'loss': outputs.loss}
return loss_dict
def __getattr__(self, name: str):
try:
return super().__getattr__(name)
except AttributeError:
return getattr(self.llm, name)
def to_hf(self,
cfg,
save_dir,
fp32=False,
save_pretrained_kwargs={},
save_format='xtuner',
**kwargs):
if save_format == 'xtuner':
self.to_xtuner_llava(cfg, save_dir, fp32, save_pretrained_kwargs)
elif save_format == 'huggingface':
self.to_huggingface_llava(cfg, save_dir, fp32,
save_pretrained_kwargs)
elif save_format == 'official':
self.to_official_llava(cfg, save_dir, fp32, save_pretrained_kwargs)
else:
raise NotImplementedError
def to_xtuner_llava(self,
cfg,
save_dir,
fp32=False,
save_pretrained_kwargs={}):
# LLM
self.llm.config.use_cache = True
if not fp32:
print_log('Convert LLM to float16', 'current')
self.llm.half()
if self.use_llm_lora:
llm_path = osp.join(save_dir, 'llm_adapter')
print_log(f'Saving LLM adapter to {llm_path}', 'current')
self.llm.save_pretrained(llm_path, **save_pretrained_kwargs)
elif not self.freeze_llm:
llm_path = save_dir
print_log(f'Saving LLM tokenizer to {llm_path}', 'current')
tokenizer = BUILDER.build(cfg.tokenizer)
tokenizer.save_pretrained(llm_path, **save_pretrained_kwargs)
print_log(f'Saving LLM to {llm_path}', 'current')
self.llm.save_pretrained(llm_path, **save_pretrained_kwargs)
self.llm.config.use_cache = False
# Visual Encoder
if self.use_visual_encoder_lora:
visual_encoder_path = osp.join(save_dir, 'visual_encoder_adapter')
print_log(
f'Saving visual_encoder adapter to {visual_encoder_path}',
'current')
self.visual_encoder.save_pretrained(visual_encoder_path,
**save_pretrained_kwargs)
elif not self.freeze_visual_encoder:
visual_encoder_path = osp.join(save_dir, 'visual_encoder')
print_log(
'Saving visual_encoder image_processor to'
f'{visual_encoder_path}', 'current')
image_processor = BUILDER.build(cfg.image_processor)
image_processor.save_pretrained(visual_encoder_path,
**save_pretrained_kwargs)
print_log(f'Saving visual_encoder to {visual_encoder_path}',
'current')
self.visual_encoder.save_pretrained(visual_encoder_path,
**save_pretrained_kwargs)
# Projector
projector_path = osp.join(save_dir, 'projector')
print_log(f'Saving projector to {projector_path}', 'current')
self.projector.save_pretrained(projector_path,
**save_pretrained_kwargs)
def to_huggingface_llava(self,
cfg,
save_dir,
fp32=False,
save_pretrained_kwargs={}):
LLM_MAPPING = {
'model': 'language_model.model',
'lm_head': 'language_model.lm_head',
}
VIT_MAPPING = {
'vision_model': 'vision_tower.vision_model',
}
PROJECTOR_MAPPING = {
'model.0': 'multi_modal_projector.linear_1',
'model.2': 'multi_modal_projector.linear_2',
}
assert getattr(self.llm, 'hf_quantizer', None) is None, \
'This conversion format does not support quantized LLM.'
# get state_dict
llm = self.llm
if self.use_llm_lora:
llm = self.llm.merge_and_unload()
llm.config.use_cache = True
if not fp32:
print_log('Convert LLM to float16', 'current')
llm.half()
assert isinstance(llm, LlamaForCausalLM), \
'This conversion format only supports LlamaForCausalLM.'
llm_state_dict = llm.state_dict()
llm_state_dict = convert_state_dict_to_hf(llm_state_dict, LLM_MAPPING)
need_visual_encoder = (not self.freeze_visual_encoder
or self.use_visual_encoder_lora)
visual_encoder = self.visual_encoder
if self.use_visual_encoder_lora:
visual_encoder = self.visual_encoder.merge_and_unload()
assert isinstance(visual_encoder, CLIPVisionModel),\
'This conversion format only supports CLIPVisionModel.'
if need_visual_encoder:
visual_encoder_state_dict = visual_encoder.state_dict()
visual_encoder_state_dict = convert_state_dict_to_hf(
visual_encoder_state_dict, VIT_MAPPING)
else:
visual_encoder_state_dict = {}
projector_state_dict = self.projector.state_dict()
projector_state_dict = convert_state_dict_to_hf(
projector_state_dict, PROJECTOR_MAPPING)
state_dict = {
**projector_state_dict,
**llm_state_dict,
**visual_encoder_state_dict
}
# init model
text_config = llm.config
vision_config = visual_encoder.config
config = LlavaConfig(
text_config=text_config,
vision_config=vision_config,
attn_implementation='eager')
with init_empty_weights():
with warnings.catch_warnings():
warnings.filterwarnings(
'ignore', message='.*non-meta.*', category=UserWarning)
model = LlavaForConditionalGeneration(config)
model.load_state_dict(state_dict, strict=True, assign=True)
# processor
cfg.tokenizer.type = LlamaTokenizerFast.from_pretrained
tokenizer = BUILDER.build(cfg.tokenizer)
tokenizer.add_tokens(
AddedToken(DEFAULT_IMAGE_TOKEN, special=True, normalized=False),
special_tokens=True)
tokenizer.add_special_tokens({'pad_token': ''})
image_processor = BUILDER.build(cfg.image_processor)
assert isinstance(image_processor, CLIPImageProcessor),\
'This conversion format only supports CLIPImageProcessor.'
processor = LlavaProcessor(
tokenizer=tokenizer, image_processor=image_processor)
# Pad to 64 for performance reasons
pad_shape = 64
pre_expansion_embeddings = \
model.language_model.model.embed_tokens.weight.data
mu = torch.mean(pre_expansion_embeddings, dim=0).float()
n = pre_expansion_embeddings.size()[0]
sigma = ((pre_expansion_embeddings - mu).T
@ (pre_expansion_embeddings - mu)) / n
dist = torch.distributions.multivariate_normal.MultivariateNormal(
mu, covariance_matrix=1e-5 * sigma)
# We add an image token so we need to resize the model
ori_vocab_size = config.text_config.vocab_size
tokenizer_vocab_size = tokenizer.encode('')[-1]
added_token = tokenizer_vocab_size - ori_vocab_size
if added_token > 0:
model.resize_token_embeddings(ori_vocab_size + added_token,
pad_shape)
model.language_model.model.embed_tokens.weight.data[
ori_vocab_size:] = torch.stack(
tuple(
dist.sample()
for _ in range(model.language_model.model.embed_tokens.
weight.data[ori_vocab_size:].shape[0])),
dim=0,
)
model.language_model.lm_head.weight.data[
ori_vocab_size:] = torch.stack(
tuple(dist.sample()
for _ in range(model.language_model.lm_head.weight.
data[ori_vocab_size:].shape[0])),
dim=0,
)
model.config.image_token_index = tokenizer.encode(
DEFAULT_IMAGE_TOKEN)[-1]
model.config.pad_token_id = tokenizer.encode('')[-1]
# save
print_log(f'Saving to {save_dir}', 'current')
model.save_pretrained(save_dir, **save_pretrained_kwargs)
processor.save_pretrained(save_dir, **save_pretrained_kwargs)
def to_official_llava(self,
cfg,
save_dir,
fp32=False,
save_pretrained_kwargs={}):
VIT_MAPPING = {
'vision_model': 'model.vision_tower.vision_tower.vision_model',
}
PROJECTOR_MAPPING = {
'model.0': 'model.mm_projector.0',
'model.2': 'model.mm_projector.2',
}
try:
from llava.model import LlavaConfig, LlavaLlamaForCausalLM
except ImportError:
raise ImportError(
'Please install llava with '
'`pip install git+https://github.com/haotian-liu/LLaVA.git '
'--no-deps`.')
assert getattr(self.llm, 'hf_quantizer', None) is None, \
'This conversion format does not support quantized LLM.'
# get state_dict
llm = self.llm
if self.use_llm_lora:
llm = self.llm.merge_and_unload()
llm.config.use_cache = True
if not fp32:
print_log('Convert LLM to float16', 'current')
llm.half()
assert isinstance(llm, LlamaForCausalLM), \
'This conversion format only supports LlamaForCausalLM.'
llm_state_dict = llm.state_dict()
need_visual_encoder = (not self.freeze_visual_encoder
or self.use_visual_encoder_lora)
visual_encoder = self.visual_encoder
if self.use_visual_encoder_lora:
visual_encoder = self.visual_encoder.merge_and_unload()
assert isinstance(visual_encoder, CLIPVisionModel),\
'This conversion format only supports CLIPVisionModel.'
if need_visual_encoder:
visual_encoder_state_dict = visual_encoder.state_dict()
visual_encoder_state_dict = convert_state_dict_to_hf(
visual_encoder_state_dict, VIT_MAPPING)
else:
visual_encoder_state_dict = {}
projector_state_dict = self.projector.state_dict()
projector_state_dict = convert_state_dict_to_hf(
projector_state_dict, PROJECTOR_MAPPING)
state_dict = {
**projector_state_dict,
**llm_state_dict,
**visual_encoder_state_dict
}
# init model
tokenizer = BUILDER.build(cfg.tokenizer)
image_processor = BUILDER.build(cfg.image_processor)
assert isinstance(image_processor, CLIPImageProcessor),\
'This conversion format only supports CLIPImageProcessor.'
llava_config_dict = llm.config.__dict__.copy()
llava_config_dict.update(
dict(
image_aspect_ratio='pad',
mm_hidden_size=visual_encoder.config.hidden_size,
mm_projector_type=f'mlp{self.projector_depth}x_gelu',
mm_use_im_patch_token=False,
mm_use_im_start_end=False,
mm_vision_select_feature='patch',
mm_vision_select_layer=self.visual_select_layer,
mm_vision_tower=visual_encoder.config.name_or_path,
unfreeze_mm_vision_tower=need_visual_encoder,
model_type='llava',
use_cache=True,
use_mm_proj=True))
llava_config = LlavaConfig(**llava_config_dict)
with init_empty_weights():
with warnings.catch_warnings():
warnings.filterwarnings(
'ignore', message='.*non-meta.*', category=UserWarning)
model = LlavaLlamaForCausalLM(llava_config)
model.load_state_dict(state_dict, strict=True, assign=True)
# save
print_log(f'Saving to {save_dir}', 'current')
model.save_pretrained(save_dir, **save_pretrained_kwargs)
image_processor.save_pretrained(save_dir, **save_pretrained_kwargs)
tokenizer.save_pretrained(save_dir, **save_pretrained_kwargs)
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/__init__.py
================================================
from .dispatch import dispatch_modules
from .projector import ProjectorConfig, ProjectorModel
__all__ = ['dispatch_modules', 'ProjectorConfig', 'ProjectorModel']
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import os
import types
import torch
import transformers
from mmengine.config.lazy import LazyObject
from mmengine.utils import digit_version
from transformers.utils.import_utils import is_flash_attn_2_available
TRANSFORMERS_VERSION = digit_version(transformers.__version__)
IS_LOW_VERSION_TRANSFORMERS = TRANSFORMERS_VERSION < digit_version('4.38')
# Transformers requires torch version >= 2.1.1 when using Torch SDPA.
# Refer to https://github.com/huggingface/transformers/blob/caa5c65db1f4db617cdac2ad667ba62edf94dd98/src/transformers/modeling_utils.py#L1611 # noqa: E501
SUPPORT_FLASH1 = digit_version(torch.__version__) >= digit_version('2.1.1')
SUPPORT_FLASH2 = is_flash_attn_2_available()
SUPPORT_FLASH = SUPPORT_FLASH1 or SUPPORT_FLASH2
USE_TRITON_KERNEL = bool(os.getenv('USE_TRITON_KERNEL', default=0))
SUPPORT_TRITON = False
try:
import triton # pre-check # noqa: F401
import triton.language as tl # pre-check # noqa: F401
SUPPORT_TRITON = True
except ImportError:
if USE_TRITON_KERNEL:
raise RuntimeError(
'USE_TRITON_KERNEL is set to 1, but triton has not been installed.'
' Run `pip install triton==2.1.0` to install triton.')
NO_ATTN_WEIGHTS_MSG = (
'Due to the implementation of the PyTorch version of flash attention, '
'even when the `output_attentions` flag is set to True, it is not '
'possible to return the `attn_weights`.')
LOWEST_TRANSFORMERS_VERSION = dict(
InternLM2ForCausalLM=digit_version('4.36'),
InternLMForCausalLM=digit_version('4.36'),
LlamaForCausalLM=digit_version('4.36'),
Phi3ForCausalLM=digit_version('4.39'),
MistralForCausalLM=digit_version('4.36'),
# Training mixtral with lower version may lead to nccl timeout
# Refer to https://github.com/microsoft/DeepSpeed/issues/5066
MixtralForCausalLM=digit_version('4.40'),
CohereForCausalLM=digit_version('4.40'),
Qwen2ForCausalLM=digit_version('4.39'),
Qwen2MoeForCausalLM=digit_version('4.40'),
DeepseekV2ForCausalLM=digit_version('4.40'),
)
ATTN_DISPATCH_MAPPING = dict(
InternLM2FlashAttention2=LazyObject(
'xtuner.model.modules.dispatch.internlm2', 'internlm2_attn_forward'),
InternLMAttention=LazyObject('xtuner.model.modules.dispatch.internlm',
'internlm_attn_forward'),
LlamaFlashAttention2=LazyObject('xtuner.model.modules.dispatch.llama',
'llama_attn_forward'),
Phi3FlashAttention2=LazyObject('xtuner.model.modules.dispatch.phi3',
'phi3_attn_forward'),
MistralFlashAttention2=LazyObject('xtuner.model.modules.dispatch.mistral',
'mistral_attn_forward'),
MixtralFlashAttention2=LazyObject('xtuner.model.modules.dispatch.mistral',
'mistral_attn_forward'),
CohereFlashAttention2=LazyObject('xtuner.model.modules.dispatch.cohere',
'cohere_attn_forward'),
Qwen2FlashAttention2=LazyObject('xtuner.model.modules.dispatch.qwen2',
'qwen2_attn_forward'),
Qwen2MoeFlashAttention2=LazyObject('xtuner.model.modules.dispatch.qwen2',
'qwen2_attn_forward'),
DeepseekV2FlashAttention2=LazyObject(
'xtuner.model.modules.dispatch.deepseek_v2', 'deepseek_attn_forward'),
)
ATTN_LEGACY_DISPATCH_MAPPING = dict(
LlamaFlashAttention2=LazyObject('xtuner.model.modules.dispatch.llama',
'llama_attn_forward_legacy'), )
VARLEN_ATTN_DISPATCH_MAPPING = dict(
InternLM2FlashAttention2=LazyObject(
'xtuner.model.modules.dispatch.internlm2',
'internlm2_varlen_attn_forward'),
InternLMAttention=LazyObject('xtuner.model.modules.dispatch.internlm',
'internlm_varlen_attn_forward'),
LlamaFlashAttention2=LazyObject('xtuner.model.modules.dispatch.llama',
'llama_varlen_attn_forward'),
Phi3FlashAttention2=LazyObject('xtuner.model.modules.dispatch.phi3',
'phi3_varlen_attn_forward'),
MistralFlashAttention2=LazyObject('xtuner.model.modules.dispatch.mistral',
'mistral_varlen_attn_forward'),
MixtralFlashAttention2=LazyObject('xtuner.model.modules.dispatch.mistral',
'mistral_varlen_attn_forward'),
CohereFlashAttention2=None,
Qwen2FlashAttention2=LazyObject('xtuner.model.modules.dispatch.qwen2',
'qwen2_varlen_attn_forward'),
Qwen2MoeFlashAttention2=LazyObject('xtuner.model.modules.dispatch.qwen2',
'qwen2_varlen_attn_forward'),
DeepseekV2FlashAttention2=LazyObject(
'xtuner.model.modules.dispatch.deepseek_v2',
'deepseek_varlen_attn_forward'),
)
VARLEN_ATTN_LEGACY_DISPATCH_MAPPING = dict(
LlamaFlashAttention2=LazyObject('xtuner.model.modules.dispatch.llama',
'llama_varlen_attn_forward_legacy'), )
RMS_DISPATCH_MAPPING = dict(
InternLM2RMSNorm=LazyObject('xtuner.model.modules.dispatch.triton_kernels',
'rms_norm_forward'),
InternLMRMSNorm=LazyObject('xtuner.model.modules.dispatch.triton_kernels',
'rms_norm_forward'),
LlamaRMSNorm=LazyObject('xtuner.model.modules.dispatch.triton_kernels',
'rms_norm_forward'),
Phi3RMSNorm=LazyObject('xtuner.model.modules.dispatch.triton_kernels',
'rms_norm_forward'),
MistralRMSNorm=LazyObject('xtuner.model.modules.dispatch.triton_kernels',
'rms_norm_forward'),
MixtralRMSNorm=LazyObject('xtuner.model.modules.dispatch.triton_kernels',
'rms_norm_forward'),
CohereLayerNorm=LazyObject('xtuner.model.modules.dispatch.triton_kernels',
'layer_norm_forward'),
Qwen2RMSNorm=LazyObject('xtuner.model.modules.dispatch.triton_kernels',
'rms_norm_forward'),
Qwen2MoeRMSNorm=LazyObject('xtuner.model.modules.dispatch.triton_kernels',
'rms_norm_forward'),
)
ROTE_DISPATCH_MAPPING = dict(
InternLMRotaryEmbedding=LazyObject(
'xtuner.model.modules.dispatch.internlm', 'InternLMRotaryEmbedding'),
MistralRotaryEmbedding=LazyObject('xtuner.model.modules.dispatch.mistral',
'MistralRotaryEmbedding'),
MixtralRotaryEmbedding=LazyObject('xtuner.model.modules.dispatch.mistral',
'MistralRotaryEmbedding'),
)
def log_once(func):
logged = False
def wrapper(*args, **kwargs):
nonlocal logged
if not logged:
logged = True
func(*args, **kwargs)
return
return wrapper
def dispatch_attn_forward(model):
if not SUPPORT_FLASH2:
return
from mmengine import print_log
print_log = log_once(print_log)
attn_forward = None
for module in model.modules():
name = type(module).__name__
if (IS_LOW_VERSION_TRANSFORMERS
and name in ATTN_LEGACY_DISPATCH_MAPPING):
if attn_forward is None:
attn_forward = ATTN_LEGACY_DISPATCH_MAPPING[name]
attn_forward = attn_forward.build()
print_log(f'Dispatch {name} legacy forward. {NO_ATTN_WEIGHTS_MSG}',
'current')
module.forward = types.MethodType(attn_forward, module)
elif name in ATTN_DISPATCH_MAPPING:
if attn_forward is None:
attn_forward = ATTN_DISPATCH_MAPPING[name]
attn_forward = attn_forward.build()
print_log(f'Dispatch {name} forward. {NO_ATTN_WEIGHTS_MSG}',
'current')
module.forward = types.MethodType(attn_forward, module)
def dispatch_varlen_attn_forward(model):
if not SUPPORT_FLASH2:
return
from mmengine import print_log
print_log = log_once(print_log)
varlen_attn_forward = None
for module in model.modules():
name = type(module).__name__
if (IS_LOW_VERSION_TRANSFORMERS
and name in VARLEN_ATTN_LEGACY_DISPATCH_MAPPING):
if varlen_attn_forward is None:
varlen_attn_forward = VARLEN_ATTN_LEGACY_DISPATCH_MAPPING[name]
varlen_attn_forward = varlen_attn_forward.build()
print_log(
f'Dispatch legacy {name} varlen forward. '
f'{NO_ATTN_WEIGHTS_MSG}', 'current')
module.forward = types.MethodType(varlen_attn_forward, module)
elif name in VARLEN_ATTN_DISPATCH_MAPPING:
if varlen_attn_forward is None:
varlen_attn_forward = VARLEN_ATTN_DISPATCH_MAPPING[name]
varlen_attn_forward = varlen_attn_forward.build()
print_log(f'Dispatch {name} varlen forward. {NO_ATTN_WEIGHTS_MSG}',
'current')
module.forward = types.MethodType(varlen_attn_forward, module)
def dispatch_rmsnorm_forward(model):
if (not SUPPORT_TRITON) or (not USE_TRITON_KERNEL):
return
from mmengine import print_log
print_log = log_once(print_log)
rms_forward = None
for module in model.modules():
name = type(module).__name__
if name in RMS_DISPATCH_MAPPING:
if rms_forward is None:
rms_forward = RMS_DISPATCH_MAPPING[name]
rms_forward = rms_forward.build()
print_log(f'Dispatch {name} forward.', 'current')
module.forward = types.MethodType(rms_forward, module)
def replace_rote(model):
from mmengine import print_log
print_log = log_once(print_log)
assert hasattr(model.config, 'rope_theta'), \
'`rope_theta` should be in the model config.'
rope_theta = model.config.rope_theta
def traverse(module):
for name, child in module.named_children():
cls_name = type(child).__name__
if cls_name in ROTE_DISPATCH_MAPPING:
rote = ROTE_DISPATCH_MAPPING[cls_name]
rote = rote.build()
print_log(f'replace {cls_name}', 'current')
dim_model = child.inv_freq.shape[0] * 2
child_new = rote(dim_model, child.max_seq_len_cached,
rope_theta).to(
device=child.inv_freq.device,
dtype=child.inv_freq.dtype)
setattr(module, name, child_new)
else:
traverse(child)
traverse(model)
def dispatch_modules(model, use_varlen_attn=False):
def check(model_name):
if 'ForCausalLM' not in model_name and model_name.endswith('Model'):
# a walkaround for reward model
model_name = model_name[:-5] + 'ForCausalLM'
msg = '{} requires transformers version at least {}, but got {}'
assert TRANSFORMERS_VERSION >= LOWEST_TRANSFORMERS_VERSION[
model_name], msg.format(model_name,
LOWEST_TRANSFORMERS_VERSION[model_name],
TRANSFORMERS_VERSION)
check(type(model).__name__)
if use_varlen_attn:
dispatch_varlen_attn_forward(model)
else:
dispatch_attn_forward(model)
dispatch_rmsnorm_forward(model)
replace_rote(model)
__all__ = ['dispatch_modules']
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/attention.py
================================================
from xtuner.parallel.sequence import sequence_parallel_wrapper
from .utils import upad_qkv
SUPPORT_FLASH2 = False
try:
from flash_attn import flash_attn_func, flash_attn_varlen_func
from flash_attn.bert_padding import pad_input
SUPPORT_FLASH2 = True
except ImportError:
pass
@sequence_parallel_wrapper
def flash_attn_wo_mask(
query_states,
key_states,
value_states,
dropout_p=0.0,
softmax_scale=None,
causal=True,
window_size=(-1, -1), # -1 means infinite context window
):
attn_output = flash_attn_func(
query_states,
key_states,
value_states,
dropout_p=dropout_p,
softmax_scale=softmax_scale,
causal=causal,
window_size=window_size)
return attn_output
@sequence_parallel_wrapper
def flash_attn_w_mask(
query_states, # bs, q_len, nhead, h_dim
key_states,
value_states,
attention_mask,
causal=True,
dropout_p=0.0,
window_size=(-1, -1), # -1 means infinite context window
):
batch_size, q_len = query_states.shape[:2]
query_states, key_states, value_states, indices_q, \
cu_seq_lens, max_seq_lens = upad_qkv(
query_states, key_states, value_states, attention_mask, q_len)
cu_seqlens_q, cu_seqlens_k = cu_seq_lens
max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
attn_output_unpad = flash_attn_varlen_func(
query_states,
key_states,
value_states,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_in_batch_q,
max_seqlen_k=max_seqlen_in_batch_k,
dropout_p=dropout_p,
causal=causal,
window_size=window_size)
attn_output = pad_input(attn_output_unpad, indices_q, batch_size, q_len)
return attn_output
@sequence_parallel_wrapper
def varlen_flash_attn(
query_states,
key_states,
value_states,
cumulative_len,
max_seqlen,
dropout_p=0.,
causal=True,
window_size=(-1, -1), # -1 means infinite context window
):
q_unpad, k_unpad, v_unpad = query_states.flatten(0, 1), key_states.flatten(
0, 1), value_states.flatten(0, 1)
attn_output = flash_attn_varlen_func(
q_unpad,
k_unpad,
v_unpad,
cumulative_len,
cumulative_len,
max_seqlen,
max_seqlen,
dropout_p=dropout_p,
return_attn_probs=False,
causal=causal,
window_size=window_size)
attn_output = attn_output.unsqueeze(0)
return attn_output
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/baichuan.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Optional, Tuple
import torch
import torch.nn as nn
import torch.nn.functional as F
def baichuan2_norm_head_forward(self, hidden_states):
norm_weight = nn.functional.normalize(self.weight)
return nn.functional.linear(hidden_states, norm_weight)
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., :x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2:]
return torch.cat((-x2, x1), dim=-1)
def apply_rotary_pos_emb(q, k, cos_, sin_, position_ids):
cos = cos_.squeeze(1).squeeze(0) # [seq_len, dim]
sin = sin_.squeeze(1).squeeze(0) # [seq_len, dim]
cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
q_embed = (q.float() * cos) + (rotate_half(q.float()) * sin)
k_embed = (k.float() * cos) + (rotate_half(k.float()) * sin)
return q_embed.to(q.dtype), k_embed.to(k.dtype)
def baichuan_7b_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: bool = False,
use_cache: bool = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
bsz, q_len, _ = hidden_states.size()
proj = self.W_pack(hidden_states)
proj = proj.unflatten(-1, (3, self.hidden_size)).unsqueeze(0).transpose(
0, -2).squeeze(-2)
query_states = proj[0].view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = proj[1].view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
value_states = proj[2].view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value[0].shape[-2]
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin, position_ids)
# [bsz, nh, t, hd]
if past_key_value is not None:
# reuse k, v, self_attention
key_states = torch.cat([past_key_value[0], key_states], dim=2)
value_states = torch.cat([past_key_value[1], value_states], dim=2)
past_key_value = (key_states, value_states) if use_cache else None
attn_output = F.scaled_dot_product_attention(
query_states, key_states, value_states, attn_mask=attention_mask)
attn_output = attn_output.transpose(1, 2)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
return attn_output, None, past_key_value
def baichuan_13b_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: bool = False,
use_cache: bool = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
bsz, q_len, _ = hidden_states.size()
proj = self.W_pack(hidden_states)
proj = proj.unflatten(-1, (3, self.hidden_size)).unsqueeze(0).transpose(
0, -2).squeeze(-2)
query_states = proj[0].view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = proj[1].view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
value_states = proj[2].view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value[0].shape[-2]
if past_key_value is not None:
# reuse k, v, self_attention
key_states = torch.cat([past_key_value[0], key_states], dim=2)
value_states = torch.cat([past_key_value[1], value_states], dim=2)
past_key_value = (key_states, value_states) if use_cache else None
if attention_mask is not None:
if q_len == 1: # inference with cache
if len(attention_mask.size()) == 4:
attention_mask = attention_mask[:, :, -1:, :]
else:
attention_mask = attention_mask[:, -1:, :]
attn_output = F.scaled_dot_product_attention(
query_states, key_states, value_states, attn_mask=attention_mask)
attn_output = attn_output.transpose(1, 2)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
return attn_output, None, past_key_value
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/cohere.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Optional
import torch
import torch.distributed as dist
from transformers.models.cohere.modeling_cohere import apply_rotary_pos_emb
from xtuner.parallel.sequence import get_sequence_parallel_world_size
from xtuner.parallel.sequence.attention import (
post_process_for_sequence_parallel_attn,
pre_process_for_sequence_parallel_attn)
try:
from transformers.cache_utils import Cache
except ImportError:
class Cache:
pass
def cohere_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
**kwargs,
):
output_attentions = False
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim)
if self.use_qk_norm:
query_states = self.q_norm(query_states)
key_states = self.k_norm(key_states)
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin)
past_key_value = getattr(self, 'past_key_value', past_key_value)
if past_key_value is not None:
# sin and cos are specific to RoPE models; position_ids needed for
# the static cache
cache_kwargs = {
'sin': sin,
'cos': cos,
'cache_position': cache_position
}
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
# TODO: These transpose are quite inefficient but Flash Attention requires
# the layout [batch_size, sequence_length, num_heads, head_dim].
# We would need to refactor the KV cache to be able to avoid many of
# these transpose/reshape/view.
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
dropout_rate = self.attention_dropout if self.training else 0.0
# Ignore copy
# In PEFT, usually we cast the layer norms in float32 for training
# stability reasons therefore the input hidden states gets silently
# casted in float32. Hence, we need cast them back in the correct dtype
# just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not
# cast the LayerNorms in fp32. (LlamaRMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
enable_sequence_parallel = (
dist.is_initialized() and get_sequence_parallel_world_size() > 1
and self.training)
if enable_sequence_parallel:
query_states, key_states, value_states = \
pre_process_for_sequence_parallel_attn(
query_states, key_states, value_states)
attn_output = self._flash_attention_forward(
query_states,
key_states,
value_states,
attention_mask,
q_len,
dropout=dropout_rate)
if enable_sequence_parallel:
attn_output = post_process_for_sequence_parallel_attn(attn_output)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/deepseek_v2.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import warnings
from typing import Optional
import torch
import torch.distributed as dist
import torch.nn.functional as F
from mmengine import MessageHub
from transformers.cache_utils import Cache
from xtuner.model.transformers_models.deepseek_v2.modeling_deepseek import \
apply_rotary_pos_emb
from xtuner.parallel.sequence import (get_sequence_parallel_world_size,
post_process_for_sequence_parallel_attn,
pre_process_for_sequence_parallel_attn)
from .attention import flash_attn_wo_mask, varlen_flash_attn
def deepseek_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
):
# DeepseekV2FlashAttention2 attention does not support output_attentions
if 'padding_mask' in kwargs:
warnings.warn(
'Passing `padding_mask` is deprecated and will be removed in '
'v4.37. Please make sure use `attention_mask` instead.`')
# overwrite attention_mask with padding_mask
attention_mask = kwargs.pop('padding_mask')
output_attentions = False
bsz, q_len, _ = hidden_states.size()
if self.q_lora_rank is None:
q = self.q_proj(hidden_states)
else:
q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
q = q.view(bsz, q_len, self.num_heads, self.q_head_dim).transpose(1, 2)
q_nope, q_pe = torch.split(
q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1)
# Flash attention requires the input to have the shape
# batch_size x seq_length x head_dim x hidden_dim
# therefore we just need to keep the original shape
compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
compressed_kv, k_pe = torch.split(
compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1)
k_pe = k_pe.view(bsz, q_len, 1, self.qk_rope_head_dim).transpose(1, 2)
kv = (
self.kv_b_proj(self.kv_a_layernorm(compressed_kv)).view(
bsz, q_len, self.num_heads,
self.qk_nope_head_dim + self.v_head_dim).transpose(1, 2))
k_nope, value_states = torch.split(
kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1)
kv_seq_len = value_states.shape[-2]
kv_seq_len = value_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value.get_usable_length(kv_seq_len,
self.layer_idx)
assert position_ids is not None, '`position_ids` should not be None.'
if self.training:
cos, sin = self.rotary_emb(
value_states, seq_len=position_ids.max() + 1)
else:
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids)
query_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
query_states[:, :, :, :self.qk_nope_head_dim] = q_nope
query_states[:, :, :, self.qk_nope_head_dim:] = q_pe
key_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
key_states[:, :, :, :self.qk_nope_head_dim] = k_nope
key_states[:, :, :, self.qk_nope_head_dim:] = k_pe
if self.q_head_dim != self.v_head_dim:
value_states = F.pad(value_states,
[0, self.q_head_dim - self.v_head_dim])
if past_key_value is not None:
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
# Reashape to the expected shape for Flash Attention
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
dropout_rate = self.attention_dropout if self.training else 0.0
# In PEFT, usually we cast the layer norms in float32 for training
# stability reasons therefore the input hidden states gets silently
# casted in float32. Hence, we need cast them back in the correct dtype
# just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not
# cast the LayerNorms in fp32. (DeepseekV2RMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
# Handle the case where the model is quantized
if hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
elif torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
else:
target_dtype = self.q_a_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
enable_sequence_parallel = (
dist.is_initialized() and get_sequence_parallel_world_size() > 1
and self.training)
if enable_sequence_parallel:
query_states, key_states, value_states = \
pre_process_for_sequence_parallel_attn(
query_states, key_states, value_states)
attn_output = self._flash_attention_forward(
query_states,
key_states,
value_states,
attention_mask,
query_states.shape[1],
dropout=dropout_rate,
softmax_scale=self.softmax_scale,
)
if enable_sequence_parallel:
attn_output = post_process_for_sequence_parallel_attn(attn_output)
if self.q_head_dim != self.v_head_dim:
attn_output = attn_output[:, :, :, :self.v_head_dim]
attn_output = attn_output.reshape(bsz, q_len, self.num_heads *
self.v_head_dim).contiguous()
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
def deepseek_varlen_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
):
is_training = self.training
message_hub = MessageHub.get_instance('varlen_attn_args')
rank = dist.get_rank()
cumulative_len = message_hub.get_info(f'cumulative_len_rank_{rank}')
max_seqlen = message_hub.get_info(f'max_seqlen_rank_{rank}')
assert is_training == (cumulative_len is not None) == (
past_key_value is None)
output_attentions = False
bsz, q_len, _ = hidden_states.size()
if self.q_lora_rank is None:
q = self.q_proj(hidden_states)
else:
q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
q = q.view(bsz, q_len, self.num_heads, self.q_head_dim).transpose(1, 2)
q_nope, q_pe = torch.split(
q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1)
# Flash attention requires the input to have the shape
# batch_size x seq_length x head_dim x hidden_dim
# therefore we just need to keep the original shape
compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
compressed_kv, k_pe = torch.split(
compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1)
k_pe = k_pe.view(bsz, q_len, 1, self.qk_rope_head_dim).transpose(1, 2)
kv = (
self.kv_b_proj(self.kv_a_layernorm(compressed_kv)).view(
bsz, q_len, self.num_heads,
self.qk_nope_head_dim + self.v_head_dim).transpose(1, 2))
k_nope, value_states = torch.split(
kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1)
kv_seq_len = value_states.shape[-2]
kv_seq_len = value_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value.get_usable_length(kv_seq_len,
self.layer_idx)
assert position_ids is not None, '`position_ids` should not be None.'
if self.training:
cos, sin = self.rotary_emb(
value_states, seq_len=position_ids.max() + 1)
else:
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids)
query_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
query_states[:, :, :, :self.qk_nope_head_dim] = q_nope
query_states[:, :, :, self.qk_nope_head_dim:] = q_pe
key_states = k_pe.new_empty(bsz, self.num_heads, q_len, self.q_head_dim)
key_states[:, :, :, :self.qk_nope_head_dim] = k_nope
key_states[:, :, :, self.qk_nope_head_dim:] = k_pe
if self.q_head_dim != self.v_head_dim:
value_states = F.pad(value_states,
[0, self.q_head_dim - self.v_head_dim])
if past_key_value is not None:
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
# In PEFT, usually we cast the layer norms in float32 for training
# stability reasons therefore the input hidden states gets silently
# casted in float32. Hence, we need cast them back in the correct dtype
# just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not
# cast the LayerNorms in fp32. (DeepseekV2RMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
# Handle the case where the model is quantized
if hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
elif torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
else:
target_dtype = self.q_a_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
# Reashape to the expected shape for Flash Attention
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# ----------------- varlen flash attention forward ----------------------#
dropout_rate = self.attention_dropout if self.training else 0.0
if not self._flash_attn_uses_top_left_mask:
causal = self.is_causal
else:
causal = self.is_causal and q_len != 1
if is_training:
attn_output = varlen_flash_attn(
query_states,
key_states,
value_states,
cumulative_len,
max_seqlen,
causal=causal,
dropout_p=dropout_rate,
training=True)
else:
attn_output = flash_attn_wo_mask(
query_states,
key_states,
value_states,
causal=causal,
dropout_p=dropout_rate,
training=False)
# ---------------- varlen flash attention forward end ------------------ #
if self.q_head_dim != self.v_head_dim:
attn_output = attn_output[:, :, :, :self.v_head_dim]
attn_output = attn_output.reshape(bsz, q_len,
self.num_heads * self.v_head_dim)
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/internlm.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Optional, Tuple
import torch
import torch.distributed as dist
import torch.nn.functional as F
from mmengine import MessageHub
from .triton_kernels import apply_rotary_emb
SUPPORT_FLASH2 = False
try:
from flash_attn import flash_attn_func, flash_attn_varlen_func
SUPPORT_FLASH2 = True
except ImportError:
pass
class InternLMRotaryEmbedding(torch.nn.Module):
def __init__(self,
dim,
max_position_embeddings=2048,
base=10000,
device=None):
super().__init__()
self.inv_freq = 1.0 / (
base**(torch.arange(0, dim, 2).float().to(device) / dim))
# Build here to make `torch.jit.trace` work.
self.max_seq_len_cached = max_position_embeddings
t = torch.arange(
self.max_seq_len_cached,
device=self.inv_freq.device,
dtype=self.inv_freq.dtype)
freqs = torch.einsum('i,j->ij', t, self.inv_freq)
emb = torch.cat((freqs, freqs), dim=-1)
self.cos_cached = emb.cos()
self.sin_cached = emb.sin()
def forward(self, x, seq_len):
# x: [bs, num_attention_heads, seq_len, head_size]
if (seq_len > self.max_seq_len_cached
or self.cos_cached.device != x.device
or self.cos_cached.dtype != x.dtype):
self.max_seq_len_cached = seq_len
assert self.inv_freq.dtype == torch.float32
t = torch.arange(
self.max_seq_len_cached,
device=x.device,
dtype=self.inv_freq.dtype)
freqs = torch.einsum('i,j->ij', t, self.inv_freq.to(t.device))
emb = torch.cat((freqs, freqs), dim=-1).to(x.device)
self.cos_cached = emb.cos().to(x.dtype)
self.sin_cached = emb.sin().to(x.dtype)
return (
self.cos_cached[:seq_len, ...],
self.sin_cached[:seq_len, ...],
)
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., :x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2:]
return torch.cat((-x2, x1), dim=-1)
def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
def internlm_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: bool = False,
use_cache: bool = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
# Modified from https://huggingface.co/internlm/internlm-7b/blob/939a68c0dc1bd5f35b63c87d44af05ce33379061/modeling_internlm.py#L161 # noqa:E501
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads,
self.head_dim).transpose(
1, 2)
key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads,
self.head_dim).transpose(
1, 2)
value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads,
self.head_dim).transpose(
1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value[0].shape[-2]
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin, position_ids)
# [bsz, nh, t, hd]
if past_key_value is not None:
# reuse k, v, self_attention
key_states = torch.cat([past_key_value[0], key_states], dim=2)
value_states = torch.cat([past_key_value[1], value_states], dim=2)
past_key_value = (key_states, value_states) if use_cache else None
if SUPPORT_FLASH2:
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
attn_output = flash_attn_func(
query_states, key_states, value_states, causal=True)
attn_output = attn_output.contiguous()
else:
# use flash attention implemented by pytorch
attn_output = F.scaled_dot_product_attention(
query_states, key_states, value_states, attn_mask=attention_mask)
attn_output = attn_output.transpose(1, 2)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
# Due to the implementation of the PyTorch version of flash attention,
# even when the output_attentions flag is set to True, it is not possible
# to return the attn_weights.
return attn_output, None, past_key_value
def internlm_varlen_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: bool = False,
use_cache: bool = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
# Modified from https://huggingface.co/internlm/internlm-7b/blob/939a68c0dc1bd5f35b63c87d44af05ce33379061/modeling_internlm.py#L161 # noqa:E501
message_hub = MessageHub.get_instance('varlen_attn_args')
rank = dist.get_rank()
cumulative_len = message_hub.get_info(f'cumulative_len_rank_{rank}')
# position_ids = message_hub.get_info(f'position_ids_rank_{rank}')
max_seqlen = message_hub.get_info(f'max_seqlen_rank_{rank}')
use_varlen_atten = (cumulative_len is not None)
bsz, q_len, _ = hidden_states.size()
assert bsz == 1, (f'If utilizing local attention, the batch size should be'
f' set to 1, but got {bsz}')
query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads,
self.head_dim)
key_states = self.k_proj(hidden_states).view(bsz, q_len, self.num_heads,
self.head_dim)
value_states = self.v_proj(hidden_states).view(bsz, q_len, self.num_heads,
self.head_dim)
kv_seq_len = key_states.shape[-3]
if past_key_value is not None:
kv_seq_len += past_key_value[0].shape[-2]
if use_varlen_atten:
cos, sin = self.rotary_emb(value_states, max_seqlen)
query_states = apply_rotary_emb(query_states,
cos[position_ids].squeeze(0),
sin[position_ids].squeeze(0))
key_states = apply_rotary_emb(key_states, cos[position_ids].squeeze(0),
sin[position_ids].squeeze(0))
else:
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
cos, sin = self.rotary_emb(value_states, kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(
query_states, key_states, cos, sin, position_ids)
if past_key_value is not None:
# reuse k, v, self_attention
key_states = torch.cat([past_key_value[0], key_states], dim=2)
value_states = torch.cat([past_key_value[1], value_states], dim=2)
past_key_value = (key_states, value_states) if use_cache else None
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
assert SUPPORT_FLASH2
if use_varlen_atten:
q_unpad, k_unpad, v_unpad = query_states.flatten(
0, 1), key_states.flatten(0, 1), value_states.flatten(0, 1)
cumulative_len = torch.cat(cumulative_len, dim=0)
attn_output = flash_attn_varlen_func(
q_unpad,
k_unpad,
v_unpad,
cumulative_len,
cumulative_len,
max_seqlen,
max_seqlen,
0,
return_attn_probs=False,
causal=True,
)
else:
attn_output = flash_attn_func(
query_states, key_states, value_states, causal=True)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
# Due to the implementation of the PyTorch version of flash attention,
# even when the output_attentions flag is set to True, it is not possible
# to return the attn_weights.
return attn_output, None, past_key_value
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/internlm2.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Optional, Tuple
import torch
import torch.distributed as dist
from einops import rearrange
from mmengine import MessageHub
from transformers.cache_utils import Cache, StaticCache
from xtuner.parallel.sequence import (get_sequence_parallel_world_size,
post_process_for_sequence_parallel_attn,
pre_process_for_sequence_parallel_attn)
from .attention import SUPPORT_FLASH2, flash_attn_wo_mask, varlen_flash_attn
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., :x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2:]
return torch.cat((-x2, x1), dim=-1)
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""This is the equivalent of torch.repeat_interleave(x, dim=1,
repeats=n_rep).
The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to
(batch, num_attention_heads, seqlen, head_dim)
"""
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :,
None, :, :].expand(batch,
num_key_value_heads,
n_rep, slen, head_dim)
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen,
head_dim)
def repeat_kv_bshd(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""The hidden states go from (batch, seqlen, num_key_value_heads, head_dim)
to (batch, seqlen, num_attention_heads, head_dim)"""
batch, slen, num_key_value_heads, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :, :,
None, :].expand(batch, slen,
num_key_value_heads, n_rep,
head_dim)
return hidden_states.reshape(batch, slen, num_key_value_heads * n_rep,
head_dim)
def internlm2_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
):
if isinstance(past_key_value, StaticCache):
raise ValueError(
'`static` cache implementation is not compatible with '
'`attn_implementation==flash_attention_2` make sure to use `sdpa` '
'in the mean time, and open an issue at '
'https://github.com/huggingface/transformers')
output_attentions = False
bsz, q_len, _ = hidden_states.size()
qkv_states = self.wqkv(hidden_states)
qkv_states = rearrange(
qkv_states,
'b q (h gs d) -> b q h gs d',
gs=2 + self.num_key_value_groups,
d=self.head_dim,
)
query_states = qkv_states[..., :self.num_key_value_groups, :]
query_states = rearrange(query_states, 'b q h gs d -> b q (h gs) d')
key_states = qkv_states[..., -2, :]
value_states = qkv_states[..., -1, :]
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin)
if past_key_value is not None:
# sin and cos are specific to RoPE models;
# cache_position needed for the static cache
cache_kwargs = {
'sin': sin,
'cos': cos,
'cache_position': cache_position
}
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# In PEFT, usually we cast the layer norms in float32 for training
# stability reasons therefore the input hidden states gets silently
# casted in float32. Hence, we need cast them back in the correct dtype
# just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not
# cast the LayerNorms in fp32. (InternLM2RMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.wqkv.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
enable_sequence_parallel = (
dist.is_initialized() and get_sequence_parallel_world_size() > 1
and self.training)
if enable_sequence_parallel:
query_states, key_states, value_states = \
pre_process_for_sequence_parallel_attn(
query_states, key_states, value_states)
dropout_rate = 0.0
attn_output = self._flash_attention_forward(
query_states,
key_states,
value_states,
attention_mask,
query_states.shape[1],
dropout=dropout_rate)
if enable_sequence_parallel:
attn_output = post_process_for_sequence_parallel_attn(attn_output)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.wo(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
def internlm2_varlen_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
if isinstance(past_key_value, StaticCache):
raise ValueError(
'`static` cache implementation is not compatible with '
'`attn_implementation==flash_attention_2` make sure to use `sdpa` '
'in the mean time, and open an issue at '
'https://github.com/huggingface/transformers')
message_hub = MessageHub.get_instance('varlen_attn_args')
rank = dist.get_rank()
cumulative_len = message_hub.get_info(f'cumulative_len_rank_{rank}')
max_seqlen = message_hub.get_info(f'max_seqlen_rank_{rank}')
use_varlen_atten = (cumulative_len is not None)
bsz, q_len, _ = hidden_states.size()
assert bsz == 1, (f'If utilizing local attention, the batch size should be'
f' set to 1, but got {bsz}')
qkv_states = self.wqkv(hidden_states)
qkv_states = rearrange(
qkv_states,
'b q (h gs d) -> b q h gs d',
gs=2 + self.num_key_value_groups,
d=self.head_dim,
)
query_states = qkv_states[..., :self.num_key_value_groups, :]
query_states = rearrange(query_states, 'b q h gs d -> b q (h gs) d')
key_states = qkv_states[..., -2, :]
value_states = qkv_states[..., -1, :]
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
try:
cos, sin = self.rotary_emb(value_states, position_ids)
except RuntimeError:
raise RuntimeError(
'You are using the old version of InternLM2 model. The '
'`modeling_internlm2.py` is outdated. Please update the InternLM2 '
'model.')
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin)
if past_key_value is not None:
# sin and cos are specific to RoPE models;
# cache_position needed for the static cache
cache_kwargs = {
'sin': sin,
'cos': cos,
'cache_position': cache_position
}
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# In PEFT, usually we cast the layer norms in float32 for training
# stability reasons therefore the input hidden states gets silently
# casted in float32. Hence, we need cast them back in the correct dtype
# just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not
# cast the LayerNorms in fp32. (InternLM2RMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.wqkv.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
# repeat kv for sequence parallel
key_states = repeat_kv_bshd(key_states, self.num_key_value_groups)
value_states = repeat_kv_bshd(value_states, self.num_key_value_groups)
assert SUPPORT_FLASH2
dropout_rate = 0.0
if use_varlen_atten:
attn_output = varlen_flash_attn(
query_states,
key_states,
value_states,
cumulative_len,
max_seqlen,
causal=True,
dropout_p=dropout_rate,
training=self.training)
else:
attn_output = flash_attn_wo_mask(
query_states,
key_states,
value_states,
causal=True,
dropout_p=dropout_rate,
training=self.training)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.wo(attn_output)
# Due to the implementation of the PyTorch version of flash attention,
# even when the output_attentions flag is set to True, it is not possible
# to return the attn_weights.
return attn_output, None, past_key_value
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/llama.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import warnings
from typing import Optional, Tuple
import torch
import torch.distributed as dist
from mmengine import MessageHub
from transformers.models.llama.modeling_llama import (apply_rotary_pos_emb,
repeat_kv)
from transformers.utils import is_flash_attn_greater_or_equal_2_10
from .attention import (SUPPORT_FLASH2, flash_attn_w_mask, flash_attn_wo_mask,
varlen_flash_attn)
from .triton_kernels import apply_rotary_emb
try:
from transformers.cache_utils import Cache
except ImportError:
class Cache:
pass
def repeat_kv_bshd(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""The hidden states go from (batch, seqlen, num_key_value_heads, head_dim)
to (batch, seqlen, num_attention_heads, head_dim)"""
batch, slen, num_key_value_heads, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :, :,
None, :].expand(batch, slen,
num_key_value_heads, n_rep,
head_dim)
return hidden_states.reshape(batch, slen, num_key_value_heads * n_rep,
head_dim)
def llama_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
**kwargs,
):
# Modified from https://github.com/huggingface/transformers/blob/66ce9593fdb8e340df546ddd0774eb444f17a12c/src/transformers/models/llama/modeling_llama.py#L422 # noqa:E501
output_attentions = False
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
# Flash attention requires the input to have the shape
# batch_size x seq_length x head_dim x hidden_dim
# therefore we just need to keep the original shape
query_states = query_states.view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin)
past_key_value = getattr(self, 'past_key_value', past_key_value)
if past_key_value is not None:
# sin and cos are specific to RoPE models;
# cache_position needed for the static cache
cache_kwargs = {
'sin': sin,
'cos': cos,
'cache_position': cache_position
}
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
assert SUPPORT_FLASH2
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# In PEFT, usually we cast the layer norms in float32 for training
# stability reasons therefore the input hidden states gets silently
# casted in float32. Hence, we need cast them back in the correct dtype
# just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not
# cast the LayerNorms in fp32. (LlamaRMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
dropout_rate = self.attention_dropout if self.training else 0.0
if is_flash_attn_greater_or_equal_2_10():
causal = self.is_causal
else:
# TODO: Remove the `q_len != 1` check once Flash Attention for RoCm
# is bumped to 2.1. For details, please see the comment in
# LlamaFlashAttention2 __init__.
causal = self.is_causal and q_len != 1
# the shape of attention_mask used by flash_attn and
# F.scaled_dot_product_attention are different
assert attention_mask is None or attention_mask.ndim == 2, \
('When using flash_attn, attention_mask.ndim should equal to 2.'
f'But got attention_mask.shape = {attention_mask.shape}.'
'We can pass the `attn_implementation="flash_attention_2"` flag '
'to `.from_pretrained` method when instantiating a Internlm2 '
'model.')
if attention_mask is not None:
attn_output = flash_attn_w_mask(
query_states,
key_states,
value_states,
attention_mask,
causal=causal,
dropout_p=dropout_rate,
training=self.training)
else:
attn_output = flash_attn_wo_mask(
query_states,
key_states,
value_states,
causal=causal,
dropout_p=dropout_rate,
training=self.training)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
def llama_attn_forward_legacy(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
# Modified from https://github.com/huggingface/transformers/blob/ced9fd86f55ebb6b656c273f6e23f8ba50652f83/src/transformers/models/llama/modeling_llama.py#L331 # noqa:E501
if 'padding_mask' in kwargs:
warnings.warn(
'Passing `padding_mask` is deprecated and will be removed in '
'v4.37. Please make sure use `attention_mask` instead.`')
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
'The cache structure has changed since version v4.36. '
f'If you are using {self.__class__.__name__} '
'for auto-regressive decoding with k/v caching, '
'please make sure to initialize the attention class '
'with a layer index.')
kv_seq_len += past_key_value.get_usable_length(kv_seq_len,
self.layer_idx)
assert position_ids is not None
if self.training:
cos, sin = self.rotary_emb(
value_states, seq_len=position_ids.max() + 1)
else:
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin, position_ids)
if past_key_value is not None:
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
assert SUPPORT_FLASH2
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# In PEFT, usually we cast the layer norms in float32 for training
# stability reasons therefore the input hidden states gets silently
# casted in float32. Hence, we need cast them back in the correct dtype
# just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not
# cast the LayerNorms in fp32. (LlamaRMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
dropout_rate = self.attention_dropout if self.training else 0.0
if is_flash_attn_greater_or_equal_2_10():
causal = self.is_causal
else:
# TODO: Remove the `q_len != 1` check once Flash Attention for RoCm
# is bumped to 2.1. For details, please see the comment in
# LlamaFlashAttention2 __init__.
causal = self.is_causal and q_len != 1
# the shape of attention_mask used by flash_attn and
# F.scaled_dot_product_attention are different
assert attention_mask is None or attention_mask.ndim == 2, \
('When using flash_attn, attention_mask.ndim should equal to 2.'
f'But got attention_mask.shape = {attention_mask.shape}.'
'We can pass the `attn_implementation="flash_attention_2"` flag '
'to `.from_pretrained` method when instantiating a Internlm2 '
'model.')
if attention_mask is not None:
attn_output = flash_attn_w_mask(
query_states,
key_states,
value_states,
attention_mask=attention_mask,
causal=causal,
dropout_p=dropout_rate,
training=self.training)
else:
attn_output = flash_attn_wo_mask(
query_states,
key_states,
value_states,
causal=causal,
dropout_p=dropout_rate,
training=self.training)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
# Due to the implementation of the PyTorch version of flash attention,
# even when the output_attentions flag is set to True, it is not possible
# to return the attn_weights.
return attn_output, None, past_key_value
def llama_varlen_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
message_hub = MessageHub.get_instance('varlen_attn_args')
rank = dist.get_rank()
cumulative_len = message_hub.get_info(f'cumulative_len_rank_{rank}')
max_seqlen = message_hub.get_info(f'max_seqlen_rank_{rank}')
use_varlen_atten = (cumulative_len is not None)
if 'padding_mask' in kwargs:
warnings.warn('Passing `padding_mask` is deprecated and will be '
'removed in v4.37. Please make sure use '
'`attention_mask` instead.`')
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
cos, sin = self.rotary_emb(value_states, position_ids)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin)
past_key_value = getattr(self, 'past_key_value', past_key_value)
if past_key_value is not None:
# sin and cos are specific to RoPE models;
# cache_position needed for the static cache
cache_kwargs = {
'sin': sin,
'cos': cos,
'cache_position': cache_position
}
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# repeat kv for sequence parallel
key_states = repeat_kv_bshd(key_states, self.num_key_value_groups)
value_states = repeat_kv_bshd(value_states, self.num_key_value_groups)
dropout_rate = self.attention_dropout if self.training else 0.0
# In PEFT, usually we cast the layer norms in float32 for training
# stability reasons therefore the input hidden states gets silently casted
# in float32. Hence, we need cast them back in the correct dtype
# just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not
# cast the LayerNorms in fp32. (LlamaRMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
assert SUPPORT_FLASH2
if use_varlen_atten:
attn_output = varlen_flash_attn(
query_states,
key_states,
value_states,
cumulative_len,
max_seqlen,
causal=True,
dropout_p=dropout_rate,
training=self.training)
else:
attn_output = flash_attn_wo_mask(
query_states,
key_states,
value_states,
causal=True,
training=self.training)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
return attn_output, None, past_key_value
def llama_varlen_attn_forward_legacy(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
message_hub = MessageHub.get_instance('varlen_attn_args')
rank = dist.get_rank()
cumulative_len = message_hub.get_info(f'cumulative_len_rank_{rank}')
max_seqlen = message_hub.get_info(f'max_seqlen_rank_{rank}')
use_varlen_atten = (cumulative_len is not None)
if 'padding_mask' in kwargs:
warnings.warn('Passing `padding_mask` is deprecated and will be '
'removed in v4.37. Please make sure use '
'`attention_mask` instead.`')
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim)
kv_seq_len = key_states.shape[-3]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
'The cache structure has changed since version v4.36. '
f'If you are using {self.__class__.__name__} '
'for auto-regressive decoding with k/v caching, '
'please make sure to initialize the attention class '
'with a layer index.')
kv_seq_len += past_key_value.get_usable_length(kv_seq_len,
self.layer_idx)
if use_varlen_atten:
cos, sin = self.rotary_emb(value_states, max_seqlen)
# position_ids (1, seq_len)
# cos, sin (1, seq_len, dim) -> (seq_len, dim)
cos = cos[position_ids].squeeze(0)
sin = sin[position_ids].squeeze(0)
query_states = apply_rotary_emb(query_states, cos, sin)
key_states = apply_rotary_emb(key_states, cos, sin)
else:
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
cos, sin = self.rotary_emb(value_states, kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(
query_states, key_states, cos, sin, position_ids)
if past_key_value is not None:
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# repeat kv for sequence parallel
key_states = repeat_kv_bshd(key_states, self.num_key_value_groups)
value_states = repeat_kv_bshd(value_states, self.num_key_value_groups)
dropout_rate = self.attention_dropout if self.training else 0.0
# In PEFT, usually we cast the layer norms in float32 for training
# stability reasons therefore the input hidden states gets silently casted
# in float32. Hence, we need cast them back in the correct dtype
# just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not
# cast the LayerNorms in fp32. (LlamaRMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
assert SUPPORT_FLASH2
if use_varlen_atten:
attn_output = varlen_flash_attn(
query_states,
key_states,
value_states,
cumulative_len,
max_seqlen,
causal=True,
dropout_p=dropout_rate,
training=self.training)
else:
attn_output = flash_attn_wo_mask(
query_states,
key_states,
value_states,
causal=True,
dropout_p=dropout_rate,
training=self.training)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
# Due to the implementation of the PyTorch version of flash attention,
# even when the output_attentions flag is set to True, it is not possible
# to return the attn_weights.
return attn_output, None, past_key_value
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/mistral.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import inspect
import warnings
from typing import Optional
import torch
import torch.distributed as dist
import torch.nn as nn
from mmengine import MessageHub
from transformers.cache_utils import Cache
from transformers.models.mistral.modeling_mistral import (apply_rotary_pos_emb,
repeat_kv)
from xtuner.parallel.sequence import get_sequence_parallel_world_size
from xtuner.parallel.sequence.attention import (
post_process_for_sequence_parallel_attn,
pre_process_for_sequence_parallel_attn)
from .attention import flash_attn_wo_mask, varlen_flash_attn
from .triton_kernels import apply_rotary_emb
SUPPORT_FLASH2 = False
try:
from flash_attn import flash_attn_func
_flash_supports_window_size = 'window_size' in list(
inspect.signature(flash_attn_func).parameters)
SUPPORT_FLASH2 = True
except ImportError:
pass
class MistralRotaryEmbedding(nn.Module):
def __init__(self,
dim,
max_position_embeddings=2048,
base=10000,
device=None):
super().__init__()
self.dim = dim
self.max_position_embeddings = max_position_embeddings
self.base = base
self.inv_freq = 1.0 / (
base**(torch.arange(0, self.dim, 2).float().to(device) / self.dim))
# Build here to make `torch.jit.trace` work.
self._set_cos_sin_cache(
seq_len=max_position_embeddings,
device=self.inv_freq.device,
dtype=torch.get_default_dtype())
def _set_cos_sin_cache(self, seq_len, device, dtype):
self.max_seq_len_cached = seq_len
t = torch.arange(
self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
freqs = torch.einsum('i,j->ij', t, self.inv_freq.to(device))
# Different from paper, but it uses a different permutation
# in order to obtain the same calculation
emb = torch.cat((freqs, freqs), dim=-1).to(device)
self.cos_cached = emb.cos().to(dtype)
self.sin_cached = emb.sin().to(dtype)
def forward(self, x, seq_len=None):
# x: [bs, num_attention_heads, seq_len, head_size]
if (seq_len > self.max_seq_len_cached
or self.cos_cached.device != x.device # noqa: W503
or self.cos_cached.dtype != x.dtype): # noqa: W503
self._set_cos_sin_cache(
seq_len=seq_len, device=x.device, dtype=x.dtype)
return (
self.cos_cached[:seq_len].to(dtype=x.dtype),
self.sin_cached[:seq_len].to(dtype=x.dtype),
)
def repeat_kv_bshd(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""The hidden states go from (batch, seqlen, num_key_value_heads, head_dim)
to (batch, seqlen, num_attention_heads, head_dim)"""
batch, slen, num_key_value_heads, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :, :,
None, :].expand(batch, slen,
num_key_value_heads, n_rep,
head_dim)
return hidden_states.reshape(batch, slen, num_key_value_heads * n_rep,
head_dim)
def mistral_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
):
if 'padding_mask' in kwargs:
warnings.warn(
'Passing `padding_mask` is deprecated and will be removed in '
'v4.37. Please make sure use `attention_mask` instead.`')
# overwrite attention_mask with padding_mask
attention_mask = kwargs.pop('padding_mask')
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
'The cache structure has changed since version v4.36. '
f'If you are using {self.__class__.__name__} '
'for auto-regressive decoding with k/v caching, '
'please make sure to initialize the attention class '
'with a layer index.')
kv_seq_len += past_key_value.get_usable_length(kv_seq_len,
self.layer_idx)
assert position_ids is not None
if self.training:
cos, sin = self.rotary_emb(
value_states, seq_len=position_ids.max() + 1)
else:
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin, position_ids)
use_sliding_windows = (
_flash_supports_window_size
and getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window)
if past_key_value is not None:
# Activate slicing cache only if the config has a value
# `sliding_windows` attribute
cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
if (getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window
and cache_has_contents):
slicing_tokens = 1 - self.config.sliding_window
past_key = past_key_value[self.layer_idx][0]
past_value = past_key_value[self.layer_idx][1]
past_key = past_key[:, :, slicing_tokens:, :].contiguous()
past_value = past_value[:, :, slicing_tokens:, :].contiguous()
if past_key.shape[-2] != self.config.sliding_window - 1:
raise ValueError(
'past key must have a shape of (`batch_size, num_heads, '
'self.config.sliding_window-1, head_dim`), got'
f' {past_key.shape}')
if attention_mask is not None:
attention_mask = attention_mask[:, slicing_tokens:]
attention_mask = torch.cat(
[attention_mask,
torch.ones_like(attention_mask[:, -1:])],
dim=-1)
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
# repeat k/v heads if n_kv_heads < n_heads for sequence parallel
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
dropout_rate = 0.0 if not self.training else self.attention_dropout
# In PEFT, usually we cast the layer norms in float32 for training
# stability reasons therefore the input hidden states gets silently
# casted in float32. Hence, we need cast them back in the correct dtype
# just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not
# cast the LayerNorms in fp32. (LlamaRMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
# Reashape to the expected shape for Flash Attention
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
enable_sequence_parallel = (
dist.is_initialized() and get_sequence_parallel_world_size() > 1
and self.training)
if enable_sequence_parallel:
query_states, key_states, value_states = \
pre_process_for_sequence_parallel_attn(
query_states, key_states, value_states)
attn_output = self._flash_attention_forward(
query_states,
key_states,
value_states,
attention_mask,
query_length=query_states.shape[1],
dropout=dropout_rate,
use_sliding_windows=use_sliding_windows,
)
if enable_sequence_parallel:
attn_output = post_process_for_sequence_parallel_attn(attn_output)
attn_output = attn_output.reshape(bsz, q_len,
self.hidden_size).contiguous()
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
def mistral_varlen_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
):
is_training = self.training
message_hub = MessageHub.get_instance('varlen_attn_args')
rank = dist.get_rank()
cumulative_len = message_hub.get_info(f'cumulative_len_rank_{rank}')
max_seqlen = message_hub.get_info(f'max_seqlen_rank_{rank}')
assert is_training == (past_key_value is None)
use_varlen_atten = (cumulative_len is not None)
if 'padding_mask' in kwargs:
warnings.warn(
'Passing `padding_mask` is deprecated and will be removed in v4.37'
' Please make sure use `attention_mask` instead.`')
# overwrite attention_mask with padding_mask
attention_mask = kwargs.pop('padding_mask')
bsz, q_len, _ = hidden_states.size()
assert bsz == 1, (f'If utilizing local attention, the batch size should be'
f' set to 1, but got {bsz}')
# attention_mask is set to None if no padding token in input_ids
assert attention_mask is None
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads, self.head_dim)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim)
assert _flash_supports_window_size, \
('The current flash attention version does not support sliding window '
'attention, for a more memory efficient implementation make sure '
'to upgrade flash-attn library.')
kv_seq_len = key_states.shape[-3]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
'The cache structure has changed since version v4.36. '
f'If you are using {self.__class__.__name__} '
'for auto-regressive decoding with k/v caching, '
'please make sure to initialize the attention class '
'with a layer index.')
kv_seq_len += past_key_value.get_usable_length(kv_seq_len,
self.layer_idx)
if use_varlen_atten:
cos, sin = self.rotary_emb(value_states, max_seqlen)
query_states = apply_rotary_emb(query_states,
cos[position_ids].squeeze(0),
sin[position_ids].squeeze(0))
key_states = apply_rotary_emb(key_states, cos[position_ids].squeeze(0),
sin[position_ids].squeeze(0))
else:
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# Because the input can be padded, the absolute sequence length
# depends on the max position id.
rotary_seq_len = max(kv_seq_len, position_ids[:, -1].max().item() + 1)
cos, sin = self.rotary_emb(value_states, seq_len=rotary_seq_len)
query_states, key_states = apply_rotary_pos_emb(
query_states, key_states, cos, sin, position_ids)
# Activate slicing cache only if the config has a value
# `sliding_windows` attribute
cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
if (getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window # noqa: W503
and cache_has_contents): # noqa: W503
slicing_tokens = 1 - self.config.sliding_window
past_key = past_key_value[self.layer_idx][0]
past_value = past_key_value[self.layer_idx][1]
past_key = past_key[:, :, slicing_tokens:, :].contiguous()
past_value = past_value[:, :, slicing_tokens:, :].contiguous()
if past_key.shape[-2] != self.config.sliding_window - 1:
raise ValueError(
'past key must have a shape of (`batch_size, num_heads, '
'self.config.sliding_window-1, head_dim`), got'
f' {past_key.shape}')
if attention_mask is not None:
attention_mask = attention_mask[:, slicing_tokens:]
attention_mask = torch.cat(
[attention_mask,
torch.ones_like(attention_mask[:, -1:])],
dim=-1)
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# repeat kv for sequence parallel
key_states = repeat_kv_bshd(key_states, self.num_key_value_groups)
value_states = repeat_kv_bshd(value_states, self.num_key_value_groups)
dropout_rate = 0.0 if not self.training else self.attention_dropout
# In PEFT, usually we cast the layer norms in float32 for
# training stability reasons, therefore the input hidden states gets
# silently casted in float32. Hence, we need
# cast them back in float16 just to be sure everything works as expected.
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
# ----------------- flash attention forward ------------------------#
if not self._flash_attn_uses_top_left_mask:
causal = self.is_causal
else:
causal = self.is_causal and q_len != 1
use_sliding_windows = (
_flash_supports_window_size and # noqa: W504
getattr(self.config, 'sliding_window', None) is not None # noqa: W503
and kv_seq_len > self.config.sliding_window) # noqa: W503
window_size = (self.config.sliding_window,
self.config.sliding_window) if use_sliding_windows else (-1,
-1)
if use_varlen_atten:
attn_output = varlen_flash_attn(
query_states,
key_states,
value_states,
cumulative_len,
max_seqlen,
causal=causal,
dropout_p=dropout_rate,
window_size=window_size,
training=self.training)
else:
attn_output = flash_attn_wo_mask(
query_states,
key_states,
value_states,
causal=causal,
dropout_p=dropout_rate,
window_size=window_size,
training=self.training)
# ---------------- flash attention forward end ------------------- #
attn_output = attn_output.reshape(bsz, q_len,
self.hidden_size).contiguous()
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/phi3.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import warnings
from typing import Optional, Tuple
import torch
import torch.distributed as dist
from mmengine import MessageHub
from xtuner.parallel.sequence import (get_sequence_parallel_world_size,
post_process_for_sequence_parallel_attn,
pre_process_for_sequence_parallel_attn)
from .attention import flash_attn_wo_mask, varlen_flash_attn
try:
from transformers.cache_utils import Cache
except ImportError:
class Cache:
pass
import inspect
_flash_supports_window_size = False
try:
from flash_attn import flash_attn_func
_flash_supports_window_size = 'window_size' in list(
inspect.signature(flash_attn_func).parameters)
if not _flash_supports_window_size:
raise ValueError(
'Please update flash-attention to support window size.')
# else:
except ImportError:
pass
# Copied from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/3a811845d89f3c1b3f41b341d0f9f05104769f35/modeling_phi3.py#L302 # noqa:E501
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""This is the equivalent of torch.repeat_interleave(x, dim=1,
repeats=n_rep).
The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to
(batch, num_attention_heads, seqlen, head_dim)
"""
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :,
None, :, :].expand(batch,
num_key_value_heads,
n_rep, slen, head_dim)
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen,
head_dim)
# https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/3a811845d89f3c1b3f41b341d0f9f05104769f35/modeling_phi3.py#L247 # noqa:E501
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., :x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2:]
return torch.cat((-x2, x1), dim=-1)
# Copied from https://huggingface.co/microsoft/Phi-3-mini-4k-instruct/blob/3a811845d89f3c1b3f41b341d0f9f05104769f35/modeling_phi3.py#L255 # noqa:E501
def apply_rotary_pos_emb(q, k, cos, sin, position_ids=None, unsqueeze_dim=1):
"""Applies Rotary Position Embedding to the query and key tensors.
Args:
q (`torch.Tensor`): The query tensor.
k (`torch.Tensor`): The key tensor.
cos (`torch.Tensor`): The cosine part of the rotary embedding.
sin (`torch.Tensor`): The sine part of the rotary embedding.
position_ids (`torch.Tensor`, *optional*):
Deprecated and unused.
unsqueeze_dim (`int`, *optional*, defaults to 1):
The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
Returns:
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
""" # noqa:E501
cos = cos.unsqueeze(unsqueeze_dim)
sin = sin.unsqueeze(unsqueeze_dim)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
def phi3_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
**kwargs,
):
if not _flash_supports_window_size:
raise ValueError(
'The current flash attention version does not support '
'sliding window attention.')
output_attentions = False
if 'padding_mask' in kwargs:
warnings.warn(
'Passing `padding_mask` is deprecated and will be removed in '
'v4.37. Please make sure use `attention_mask` instead.`')
# overwrite attention_mask with padding_mask
attention_mask = kwargs.pop('padding_mask')
bsz, q_len, _ = hidden_states.size()
qkv = self.qkv_proj(hidden_states)
query_pos = self.num_heads * self.head_dim
query_states = qkv[..., :query_pos]
key_states = qkv[..., query_pos:query_pos +
self.num_key_value_heads * self.head_dim]
value_states = qkv[...,
query_pos + self.num_key_value_heads * self.head_dim:]
# Flash attention requires the input to have the shape
# batch_size x seq_length x head_dim x hidden_dim
# therefore we just need to keep the original shape
query_states = query_states.view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
'The cache structure has changed since version v4.36. '
f'If you are using {self.__class__.__name__} '
'for auto-regressive decoding with k/v caching, '
'please make sure to initialize the attention class '
'with a layer index.')
kv_seq_len += past_key_value.get_usable_length(kv_seq_len,
self.layer_idx)
rotary_seq_len = max(kv_seq_len, position_ids.max().item() + 1)
cos, sin = self.rotary_emb(
value_states, position_ids, seq_len=rotary_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin, position_ids)
use_sliding_windows = (
_flash_supports_window_size
and getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window)
if past_key_value is not None:
# Activate slicing cache only if the config has a value
# `sliding_windows` attribute
cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
if (getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window
and cache_has_contents):
slicing_tokens = 1 - self.config.sliding_window
past_key = past_key_value[self.layer_idx][0]
past_value = past_key_value[self.layer_idx][1]
past_key = past_key[:, :, slicing_tokens:, :].contiguous()
past_value = past_value[:, :, slicing_tokens:, :].contiguous()
if past_key.shape[-2] != self.config.sliding_window - 1:
raise ValueError(
'past key must have a shape of (`batch_size, num_heads, '
'self.config.sliding_window-1, head_dim`), got'
f' {past_key.shape}')
if attention_mask is not None:
attention_mask = attention_mask[:, slicing_tokens:]
attention_mask = torch.cat(
[attention_mask,
torch.ones_like(attention_mask[:, -1:])],
dim=-1)
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
# repeat k/v heads if n_kv_heads < n_heads
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
attn_dropout = self.attention_dropout if self.training else 0.0
# In PEFT, usually we cast the layer norms in float32 for training
# stability reasons therefore the input hidden states gets silently
# casted in float32. Hence, we need cast them back in the correct dtype
# just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not
# cast the LayerNorms in fp32.
if query_states.dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.qkv_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
# Reashape to the expected shape for Flash Attention
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
enable_sequence_parallel = (
dist.is_initialized() and get_sequence_parallel_world_size() > 1
and self.training)
if enable_sequence_parallel:
# (b, s // sp_world_size, nd, dim) -> (b, s, nd // sp_world_size, dim)
query_states, key_states, value_states = \
pre_process_for_sequence_parallel_attn(
query_states, key_states, value_states,
scatter_dim=2, gather_dim=1)
attn_output = self._flash_attention_forward(
query_states,
key_states,
value_states,
attention_mask,
query_states.shape[1],
dropout=attn_dropout,
use_sliding_windows=use_sliding_windows,
)
if enable_sequence_parallel:
# (b, s, nd // sp_world_size, dim) -> (b, s // sp_world_size, nd, dim)
attn_output = post_process_for_sequence_parallel_attn(
attn_output, scatter_dim=1, gather_dim=2)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
def phi3_varlen_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
cache_position: Optional[torch.LongTensor] = None,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
if not _flash_supports_window_size:
raise ValueError(
'The current flash attention version does not support '
'sliding window attention.')
output_attentions = False
is_training = self.training
message_hub = MessageHub.get_instance('varlen_attn_args')
rank = dist.get_rank()
cumulative_len = message_hub.get_info(f'cumulative_len_rank_{rank}')
max_seqlen = message_hub.get_info(f'max_seqlen_rank_{rank}')
assert is_training == (past_key_value is None)
use_varlen_atten = (cumulative_len is not None)
if 'padding_mask' in kwargs:
warnings.warn(
'Passing `padding_mask` is deprecated and will be removed in v4.37'
' Please make sure use `attention_mask` instead.`')
# overwrite attention_mask with padding_mask
attention_mask = kwargs.pop('padding_mask')
bsz, q_len, _ = hidden_states.size()
assert bsz == 1, (f'If utilizing local attention, the batch size should be'
f' set to 1, but got {bsz}')
# attention_mask is set to None if no padding token in input_ids
# varlen attn need data packing so no padding tokens in input_ids
assert attention_mask is None
qkv = self.qkv_proj(hidden_states)
query_pos = self.num_heads * self.head_dim
query_states = qkv[..., :query_pos]
key_states = qkv[..., query_pos:query_pos +
self.num_key_value_heads * self.head_dim]
value_states = qkv[...,
query_pos + self.num_key_value_heads * self.head_dim:]
# Flash attention requires the input to have the shape
# batch_size x seq_length x head_dim x hidden_dim
# therefore we just need to keep the original shape
query_states = query_states.view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
'The cache structure has changed since version v4.36. '
f'If you are using {self.__class__.__name__} '
'for auto-regressive decoding with k/v caching, '
'please make sure to initialize the attention class '
'with a layer index.')
kv_seq_len += past_key_value.get_usable_length(kv_seq_len,
self.layer_idx)
assert position_ids is not None
rotary_seq_len = max(kv_seq_len, position_ids[:, -1].max().item()) + 1
cos, sin = self.rotary_emb(
value_states, position_ids, seq_len=rotary_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin, position_ids)
use_sliding_windows = (
_flash_supports_window_size
and getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window)
if past_key_value is not None:
# Activate slicing cache only if the config has a value
# `sliding_windows` attribute
cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
if (getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window
and cache_has_contents):
slicing_tokens = 1 - self.config.sliding_window
past_key = past_key_value[self.layer_idx][0]
past_value = past_key_value[self.layer_idx][1]
past_key = past_key[:, :, slicing_tokens:, :].contiguous()
past_value = past_value[:, :, slicing_tokens:, :].contiguous()
if past_key.shape[-2] != self.config.sliding_window - 1:
raise ValueError(
'past key must have a shape of (`batch_size, num_heads, '
'self.config.sliding_window-1, head_dim`), got'
f' {past_key.shape}')
if attention_mask is not None:
attention_mask = attention_mask[:, slicing_tokens:]
attention_mask = torch.cat(
[attention_mask,
torch.ones_like(attention_mask[:, -1:])],
dim=-1)
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
# repeat k/v heads if n_kv_heads < n_heads
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
# In PEFT, usually we cast the layer norms in float32 for
# training stability reasons, therefore the input hidden states gets
# silently casted in float32. Hence, we need
# cast them back in float16 just to be sure everything works as expected.
if query_states.dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.qkv_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
# Reashape to the expected shape for Flash Attention
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# ----------------- flash attention forward ------------------------#
if not self._flash_attn_uses_top_left_mask:
causal = self.is_causal
else:
causal = self.is_causal and q_len != 1
use_sliding_windows = (
_flash_supports_window_size
and getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window)
window_size = (self.config.sliding_window,
self.config.sliding_window) if use_sliding_windows else (-1,
-1)
attn_dropout = self.attention_dropout if self.training else 0.0
if use_varlen_atten:
attn_output = varlen_flash_attn(
query_states,
key_states,
value_states,
cumulative_len,
max_seqlen,
causal=causal,
dropout_p=attn_dropout,
window_size=window_size,
training=self.training)
else:
attn_output = flash_attn_wo_mask(
query_states,
key_states,
value_states,
causal=causal,
dropout_p=attn_dropout,
window_size=window_size,
training=self.training)
# ---------------- flash attention forward end ------------------- #
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/qwen2.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import inspect
import warnings
from typing import Optional
import torch
import torch.distributed as dist
from mmengine import MessageHub
from transformers.cache_utils import Cache
from transformers.models.qwen2.modeling_qwen2 import (apply_rotary_pos_emb,
repeat_kv)
from xtuner.parallel.sequence import get_sequence_parallel_world_size
from xtuner.parallel.sequence.attention import (
post_process_for_sequence_parallel_attn,
pre_process_for_sequence_parallel_attn)
from .attention import flash_attn_wo_mask, varlen_flash_attn
SUPPORT_FLASH2 = False
try:
from flash_attn import flash_attn_func
_flash_supports_window_size = 'window_size' in list(
inspect.signature(flash_attn_func).parameters)
SUPPORT_FLASH2 = True
except ImportError:
pass
def qwen2_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
):
if 'padding_mask' in kwargs:
warnings.warn(
'Passing `padding_mask` is deprecated and will be removed in '
'v4.37. Please make sure use `attention_mask` instead.`')
# overwrite attention_mask with padding_mask
attention_mask = kwargs.pop('padding_mask')
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
'The cache structure has changed since version v4.36. '
f'If you are using {self.__class__.__name__} '
'for auto-regressive decoding with k/v caching, '
'please make sure to initialize the attention class '
'with a layer index.')
kv_seq_len += past_key_value.get_usable_length(kv_seq_len,
self.layer_idx)
assert position_ids is not None
rotary_seq_len = max(kv_seq_len, position_ids.max().item() + 1)
cos, sin = self.rotary_emb(value_states, seq_len=rotary_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin, position_ids)
use_sliding_windows = (
_flash_supports_window_size
and getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window
and self.config.use_sliding_window)
if past_key_value is not None:
# Activate slicing cache only if the config has a value
# `sliding_windows` attribute
cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
if (getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window
and cache_has_contents):
slicing_tokens = 1 - self.config.sliding_window
past_key = past_key_value[self.layer_idx][0]
past_value = past_key_value[self.layer_idx][1]
past_key = past_key[:, :, slicing_tokens:, :].contiguous()
past_value = past_value[:, :, slicing_tokens:, :].contiguous()
if past_key.shape[-2] != self.config.sliding_window - 1:
raise ValueError(
'past key must have a shape of (`batch_size, num_heads, '
'self.config.sliding_window-1, head_dim`), got'
f' {past_key.shape}')
if attention_mask is not None:
attention_mask = attention_mask[:, slicing_tokens:]
attention_mask = torch.cat(
[attention_mask,
torch.ones_like(attention_mask[:, -1:])],
dim=-1)
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
# repeat k/v heads if n_kv_heads < n_heads for sequence parallel
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
dropout_rate = 0.0 if not self.training else self.attention_dropout
# In PEFT, usually we cast the layer norms in float32 for training
# stability reasons therefore the input hidden states gets silently
# casted in float32. Hence, we need cast them back in the correct dtype
# just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not
# cast the LayerNorms in fp32.
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
# Reashape to the expected shape for Flash Attention
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
enable_sequence_parallel = (
dist.is_initialized() and get_sequence_parallel_world_size() > 1
and self.training)
if enable_sequence_parallel:
query_states, key_states, value_states = \
pre_process_for_sequence_parallel_attn(
query_states, key_states, value_states)
attn_output = self._flash_attention_forward(
query_states,
key_states,
value_states,
attention_mask,
query_length=query_states.shape[1],
dropout=dropout_rate,
use_sliding_windows=use_sliding_windows,
)
if enable_sequence_parallel:
attn_output = post_process_for_sequence_parallel_attn(attn_output)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
def qwen2_varlen_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
):
is_training = self.training
message_hub = MessageHub.get_instance('varlen_attn_args')
rank = dist.get_rank()
cumulative_len = message_hub.get_info(f'cumulative_len_rank_{rank}')
max_seqlen = message_hub.get_info(f'max_seqlen_rank_{rank}')
assert is_training == (past_key_value is None)
use_varlen_atten = (cumulative_len is not None)
if 'padding_mask' in kwargs:
warnings.warn(
'Passing `padding_mask` is deprecated and will be removed in v4.37'
' Please make sure use `attention_mask` instead.`')
# overwrite attention_mask with padding_mask
attention_mask = kwargs.pop('padding_mask')
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
'The cache structure has changed since version v4.36. '
f'If you are using {self.__class__.__name__} '
'for auto-regressive decoding with k/v caching, '
'please make sure to initialize the attention class '
'with a layer index.')
kv_seq_len += past_key_value.get_usable_length(kv_seq_len,
self.layer_idx)
assert position_ids is not None
rotary_seq_len = max(kv_seq_len, position_ids[:, -1].max().item() + 1)
cos, sin = self.rotary_emb(value_states, seq_len=rotary_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin, position_ids)
if past_key_value is not None:
# Activate slicing cache only if the config has a value
# `sliding_windows` attribute
cache_has_contents = past_key_value.get_seq_length(self.layer_idx) > 0
if (getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window
and cache_has_contents):
slicing_tokens = 1 - self.config.sliding_window
past_key = past_key_value[self.layer_idx][0]
past_value = past_key_value[self.layer_idx][1]
past_key = past_key[:, :, slicing_tokens:, :].contiguous()
past_value = past_value[:, :, slicing_tokens:, :].contiguous()
if past_key.shape[-2] != self.config.sliding_window - 1:
raise ValueError(
'past key must have a shape of (`batch_size, num_heads, '
'self.config.sliding_window-1, head_dim`), got'
f' {past_key.shape}')
if attention_mask is not None:
attention_mask = attention_mask[:, slicing_tokens:]
attention_mask = torch.cat(
[attention_mask,
torch.ones_like(attention_mask[:, -1:])],
dim=-1)
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
# repeat k/v heads if n_kv_heads < n_heads for sequence parallel
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
dropout_rate = 0.0 if not self.training else self.attention_dropout
# In PEFT, usually we cast the layer norms in float32 for
# training stability reasons, therefore the input hidden states gets
# silently casted in float32. Hence, we need
# cast them back in float16 just to be sure everything works as expected.
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
# Reashape to the expected shape for Flash Attention
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
# ----------------- flash attention forward ------------------------#
if not self._flash_attn_uses_top_left_mask:
causal = self.is_causal
else:
causal = self.is_causal and q_len != 1
use_sliding_windows = (
_flash_supports_window_size
and getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window
and self.config.use_sliding_window)
# Decide whether to use SWA or not by layer index.
if use_sliding_windows and self.layer_idx >= self.config.max_window_layers:
use_sliding_windows = False
window_size = (self.config.sliding_window,
self.config.sliding_window) if use_sliding_windows else (-1,
-1)
if use_varlen_atten:
attn_output = varlen_flash_attn(
query_states,
key_states,
value_states,
cumulative_len,
max_seqlen,
causal=causal,
dropout_p=dropout_rate,
window_size=window_size,
training=self.training)
else:
attn_output = flash_attn_wo_mask(
query_states,
key_states,
value_states,
causal=causal,
dropout_p=dropout_rate,
window_size=window_size,
training=self.training)
# ---------------- flash attention forward end ------------------- #
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/triton_kernels/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .layer_norm import layer_norm_forward
from .rms_norm import rms_norm_forward
from .rotary import apply_rotary_emb
__all__ = ['rms_norm_forward', 'layer_norm_forward', 'apply_rotary_emb']
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/triton_kernels/layer_norm.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
import torch.nn.functional as F
def layer_norm_forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
hidden_states = F.layer_norm(
hidden_states, (hidden_states.shape[-1], ), eps=self.variance_epsilon)
hidden_states = self.weight.to(torch.float32) * hidden_states
return hidden_states.to(input_dtype)
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/triton_kernels/rms_norm.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
import triton
import triton.language as tl
@triton.jit
def _rms_norm_fwd_fused(
X, # pointer to the input
Y, # pointer to the output
W, # pointer to the weights
Rstd, # pointer to the 1/std
stride, # how much to increase the pointer when moving by 1 row
N, # number of columns in X
eps, # epsilon to avoid division by zero
BLOCK_SIZE: tl.constexpr,
):
# Map the program id to the row of X and Y it should compute.
row = tl.program_id(0)
Y += row * stride
X += row * stride
# Compute variance
_var = tl.zeros([BLOCK_SIZE], dtype=tl.float32)
for off in range(0, N, BLOCK_SIZE):
cols = off + tl.arange(0, BLOCK_SIZE)
x = tl.load(X + cols, mask=cols < N, other=0.).to(tl.float32)
_var += x * x
var = tl.sum(_var, axis=0) / N
rstd = 1 / tl.sqrt(var + eps)
# Write rstd
tl.store(Rstd + row, rstd)
# Normalize and apply linear transformation
for off in range(0, N, BLOCK_SIZE):
cols = off + tl.arange(0, BLOCK_SIZE)
mask = cols < N
w = tl.load(W + cols, mask=mask)
x = tl.load(X + cols, mask=mask, other=0.).to(tl.float32)
x_hat = x * rstd
y = x_hat * w
# Write output
tl.store(Y + cols, y, mask=mask)
@triton.jit
def _rms_norm_bwd_dx_fused(
DX, # pointer to the input gradient
DY, # pointer to the output gradient
DW, # pointer to the partial sum of weights gradient
X, # pointer to the input
W, # pointer to the weights
Rstd, # pointer to the 1/std
Lock, # pointer to the lock
stride, # how much to increase the pointer when moving by 1 row
N, # number of columns in X
eps, # epsilon to avoid division by zero
GROUP_SIZE_M: tl.constexpr,
BLOCK_SIZE_N: tl.constexpr):
# Map the program id to the elements of X, DX, and DY it should compute.
row = tl.program_id(0)
cols = tl.arange(0, BLOCK_SIZE_N)
mask = cols < N
X += row * stride
DY += row * stride
DX += row * stride
# Offset locks and weights/biases gradient pointer for parallel reduction
lock_id = row % GROUP_SIZE_M
Lock += lock_id
Count = Lock + GROUP_SIZE_M
DW = DW + lock_id * N + cols
# Load data to SRAM
x = tl.load(X + cols, mask=mask, other=0).to(tl.float32)
dy = tl.load(DY + cols, mask=mask, other=0).to(tl.float32)
w = tl.load(W + cols, mask=mask).to(tl.float32)
rstd = tl.load(Rstd + row)
# Compute dx
xhat = x * rstd
wdy = w * dy
xhat = tl.where(mask, xhat, 0.)
wdy = tl.where(mask, wdy, 0.)
c1 = tl.sum(xhat * wdy, axis=0) / N
dx = (wdy - (xhat * c1)) * rstd
# Write dx
tl.store(DX + cols, dx, mask=mask)
# Accumulate partial sums for dw/db
partial_dw = (dy * xhat).to(w.dtype)
while tl.atomic_cas(Lock, 0, 1) == 1:
pass
count = tl.load(Count)
# First store doesn't accumulate
if count == 0:
tl.atomic_xchg(Count, 1)
else:
partial_dw += tl.load(DW, mask=mask)
tl.store(DW, partial_dw, mask=mask)
# Release the lock
tl.atomic_xchg(Lock, 0)
@triton.jit
def _rms_norm_bwd_dwdb(
DW, # pointer to the partial sum of weights gradient
FINAL_DW, # pointer to the weights gradient
M, # GROUP_SIZE_M
N, # number of columns
BLOCK_SIZE_M: tl.constexpr,
BLOCK_SIZE_N: tl.constexpr):
# Map the program id to the elements of DW and DB it should compute.
pid = tl.program_id(0)
cols = pid * BLOCK_SIZE_N + tl.arange(0, BLOCK_SIZE_N)
dw = tl.zeros((BLOCK_SIZE_M, BLOCK_SIZE_N), dtype=tl.float32)
# Iterate through the rows of DW and DB to sum the partial sums.
for i in range(0, M, BLOCK_SIZE_M):
rows = i + tl.arange(0, BLOCK_SIZE_M)
mask = (rows[:, None] < M) & (cols[None, :] < N)
offs = rows[:, None] * N + cols[None, :]
dw += tl.load(DW + offs, mask=mask, other=0.)
# Write the final sum to the output.
sum_dw = tl.sum(dw, axis=0)
tl.store(FINAL_DW + cols, sum_dw, mask=cols < N)
class RMSNorm(torch.autograd.Function):
@staticmethod
def forward(ctx, x, weight, eps):
# allocate output
y = torch.empty_like(x)
# reshape input data into 2D tensor
x_arg = x.reshape(-1, x.shape[-1])
M, N = x_arg.shape
rstd = torch.empty((M, ), dtype=torch.float32, device='cuda')
# Less than 64KB per feature: enqueue fused kernel
MAX_FUSED_SIZE = 65536 // x.element_size()
BLOCK_SIZE = min(MAX_FUSED_SIZE, triton.next_power_of_2(N))
if N > BLOCK_SIZE:
raise RuntimeError(
"This rms norm doesn't support feature dim >= 64KB.")
# heuristics for number of warps
num_warps = min(max(BLOCK_SIZE // 256, 1), 8)
# enqueue kernel
_rms_norm_fwd_fused[(M, )](
x_arg,
y,
weight,
rstd,
x_arg.stride(0),
N,
eps,
BLOCK_SIZE=BLOCK_SIZE,
num_warps=num_warps,
)
ctx.save_for_backward(x, weight, rstd)
ctx.BLOCK_SIZE = BLOCK_SIZE
ctx.num_warps = num_warps
ctx.eps = eps
return y
@staticmethod
def backward(ctx, dy):
x, w, v = ctx.saved_tensors
# heuristics for amount of parallel reduction stream for DW/DB
N = w.shape[0]
GROUP_SIZE_M = 64
if N <= 8192:
GROUP_SIZE_M = 96
if N <= 4096:
GROUP_SIZE_M = 128
if N <= 1024:
GROUP_SIZE_M = 256
# allocate output
locks = torch.zeros(2 * GROUP_SIZE_M, dtype=torch.int32, device='cuda')
_dw = torch.empty((GROUP_SIZE_M, w.shape[0]),
dtype=x.dtype,
device=w.device)
dw = torch.empty((w.shape[0], ), dtype=w.dtype, device=w.device)
dx = torch.empty_like(dy)
# enqueue kernel using forward pass heuristics
# also compute partial sums for DW and DB
x_arg = x.reshape(-1, x.shape[-1])
M, N = x_arg.shape
_rms_norm_bwd_dx_fused[(M, )](
dx,
dy,
_dw,
x,
w,
v,
locks,
x_arg.stride(0),
N,
ctx.eps,
BLOCK_SIZE_N=ctx.BLOCK_SIZE,
GROUP_SIZE_M=GROUP_SIZE_M,
num_warps=ctx.num_warps)
def grid(meta):
return [triton.cdiv(N, meta['BLOCK_SIZE_N'])]
# accumulate partial sums in separate kernel
_rms_norm_bwd_dwdb[grid](
_dw,
dw,
GROUP_SIZE_M,
N,
BLOCK_SIZE_M=32,
BLOCK_SIZE_N=128,
)
return dx, dw, None
rms_norm = RMSNorm.apply
def rms_norm_forward(self, hidden_states):
if (hidden_states.device == torch.device('cpu')
or self.weight.device == torch.device('cpu')):
raise RuntimeError(
'Can not use triton kernels on cpu. Please set `USE_TRITON_KERNEL`'
' environment variable to 0 before training.')
return rms_norm(hidden_states, self.weight, self.variance_epsilon)
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/triton_kernels/rotary.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
# Modified from https://github.com/Dao-AILab/flash-attention/blob/main/flash_attn/ops/triton/rotary.py # noqa:E501
from typing import Optional, Union
import torch
import triton
import triton.language as tl
@triton.jit
def rotary_kernel(
OUT, # Pointers to matrices
X,
COS,
SIN,
CU_SEQLENS,
SEQLEN_OFFSETS, # this could be int or a pointer
# Matrix dimensions
seqlen,
rotary_dim,
seqlen_ro,
# strides
stride_out_batch,
stride_out_seqlen,
stride_out_nheads,
stride_out_headdim,
stride_x_batch,
stride_x_seqlen,
stride_x_nheads,
stride_x_headdim,
# Meta-parameters
BLOCK_K: tl.constexpr,
IS_SEQLEN_OFFSETS_TENSOR: tl.constexpr,
IS_VARLEN: tl.constexpr,
INTERLEAVED: tl.constexpr,
CONJUGATE: tl.constexpr,
BLOCK_M: tl.constexpr,
):
pid_m = tl.program_id(axis=0)
pid_batch = tl.program_id(axis=1)
pid_head = tl.program_id(axis=2)
rotary_dim_half = rotary_dim // 2
if not IS_VARLEN:
X = X + pid_batch * stride_x_batch + pid_head * stride_x_nheads
OUT = OUT + pid_batch * stride_out_batch + pid_head * stride_out_nheads
else:
start_idx = tl.load(CU_SEQLENS + pid_batch)
seqlen = tl.load(CU_SEQLENS + pid_batch + 1) - start_idx
X = X + start_idx * stride_x_seqlen + pid_head * stride_x_nheads
OUT = OUT + start_idx * stride_out_seqlen + \
pid_head * stride_out_nheads
if pid_m * BLOCK_M >= seqlen:
return
rm = pid_m * BLOCK_M + tl.arange(0, BLOCK_M)
if not IS_SEQLEN_OFFSETS_TENSOR:
rm_cs = rm + SEQLEN_OFFSETS
else:
rm_cs = rm + tl.load(SEQLEN_OFFSETS + pid_batch)
rk = tl.arange(0, BLOCK_K)
rk_half = tl.arange(0, BLOCK_K // 2)
if not INTERLEAVED:
# Load the 1st and 2nd halves of X, do calculation,
# then store to 1st and 2nd halves of OUT
X = X + (
rm[:, None] * stride_x_seqlen +
rk_half[None, :] * stride_x_headdim)
# This is different from the official implementation as the shapes of
# the two tensors cos and sin are (seqlen_ro, rotary_dim) instead of
# (seqlen_ro, rotary_dim // 2).
COS = COS + (rm_cs[:, None] * rotary_dim + rk_half[None, :])
SIN = SIN + (rm_cs[:, None] * rotary_dim + rk_half[None, :])
cos = tl.load(
COS,
mask=(rm_cs[:, None] < seqlen_ro) &
(rk_half[None, :] < rotary_dim_half),
other=1.0).to(tl.float32)
sin = tl.load(
SIN,
mask=(rm_cs[:, None] < seqlen_ro) &
(rk_half[None, :] < rotary_dim_half),
other=0.0).to(tl.float32)
x0 = tl.load(
X,
mask=(rm[:, None] < seqlen) & (rk_half[None, :] < rotary_dim_half),
other=0.0).to(tl.float32)
x1 = tl.load(
X + rotary_dim_half * stride_x_headdim,
mask=(rm[:, None] < seqlen) & (rk_half[None, :] < rotary_dim_half),
other=0.0,
).to(tl.float32)
if CONJUGATE:
sin = -sin
o0 = x0 * cos - x1 * sin
o1 = x0 * sin + x1 * cos
# write back result
OUT = OUT + (
rm[:, None] * stride_out_seqlen +
rk_half[None, :] * stride_out_headdim)
tl.store(
OUT,
o0,
mask=(rm[:, None] < seqlen) & (rk_half[None, :] < rotary_dim_half))
tl.store(
OUT + rotary_dim_half * stride_out_headdim,
o1,
mask=(rm[:, None] < seqlen) & (rk_half[None, :] < rotary_dim_half),
)
else:
# We don't want to load X[0, 2, 4, ...] and X[1, 3, 5, ...] separately
# since both are slow.
# Instead, we load x0 = X[0, 1, 2, 3, ...] and x1 = X[1, 0, 3, 2, ...].
# Loading x0 will be fast but x1 will be slow.
# Then we load cos = COS[0, 0, 1, 1, ...] and
# sin = SIN[0, 0, 1, 1, ...].
# Then we do the calculation and use tl.where to pick put the right
# outputs for the even and for the odd indices.
rk_swap = rk + ((rk + 1) % 2) * 2 - 1 # 1, 0, 3, 2, 5, 4, ...
rk_repeat = tl.arange(0, BLOCK_K) // 2
# This is different from the official implementation as the shapes of
# the two tensors cos and sin are (seqlen_ro, rotary_dim) instead of
# (seqlen_ro, rotary_dim // 2).
X0 = X + (
rm[:, None] * stride_x_seqlen + rk[None, :] * stride_x_headdim)
X1 = X + (
rm[:, None] * stride_x_seqlen +
rk_swap[None, :] * stride_x_headdim)
COS = COS + (rm_cs[:, None] * rotary_dim + rk_repeat[None, :])
SIN = SIN + (rm_cs[:, None] * rotary_dim + rk_repeat[None, :])
cos = tl.load(
COS,
mask=(rm_cs[:, None] < seqlen_ro) &
(rk_repeat[None, :] < rotary_dim_half),
other=1.0,
).to(tl.float32)
sin = tl.load(
SIN,
mask=(rm_cs[:, None] < seqlen_ro) &
(rk_repeat[None, :] < rotary_dim_half),
other=0.0,
).to(tl.float32)
x0 = tl.load(
X0,
mask=(rm[:, None] < seqlen) & (rk[None, :] < rotary_dim),
other=0.0).to(tl.float32)
x1 = tl.load(
X1,
mask=(rm[:, None] < seqlen) & (rk_swap[None, :] < rotary_dim),
other=0.0).to(tl.float32)
if CONJUGATE:
sin = -sin
x0_cos = x0 * cos
x1_sin = x1 * sin
out = tl.where(rk[None, :] % 2 == 0, x0_cos - x1_sin, x0_cos + x1_sin)
OUT = OUT + (
rm[:, None] * stride_out_seqlen + rk[None, :] * stride_out_headdim)
tl.store(
OUT, out, mask=(rm[:, None] < seqlen) & (rk[None, :] < rotary_dim))
def apply_rotary(
x: torch.Tensor,
cos: torch.Tensor,
sin: torch.Tensor,
seqlen_offsets: Union[int, torch.Tensor] = 0,
cu_seqlens: Optional[torch.Tensor] = None,
max_seqlen: Optional[int] = None,
interleaved=False,
inplace=False,
conjugate=False,
) -> torch.Tensor:
"""
Arguments:
x: (batch, seqlen, nheads, headdim) if cu_seqlens is None
else (total_seqlen, nheads, headdim).
cos: (seqlen_ro, rotary_dim)
sin: (seqlen_ro, rotary_dim)
seqlen_offsets: integer or integer tensor of size (batch,)
cu_seqlens: (batch + 1,) or None
max_seqlen: int
Returns:
y: (batch, seqlen, nheads, headdim)
"""
is_varlen = cu_seqlens is not None
if not is_varlen:
batch, seqlen, nheads, headdim = x.shape
else:
assert max_seqlen is not None, ('If cu_seqlens is passed in, '
'then max_seqlen must be passed')
total_seqlen, nheads, headdim = x.shape
batch_p_1 = cu_seqlens.shape[0]
batch = batch_p_1 - 1
seqlen = max_seqlen
seqlen_ro, rotary_dim = cos.shape
assert sin.shape == cos.shape
# rotary_dim *= 2
assert rotary_dim <= headdim, 'rotary_dim must be <= headdim'
assert headdim <= 256, 'Only support headdim <= 256'
assert seqlen_ro >= seqlen, 'seqlen_ro must be >= seqlen'
assert (
cos.dtype == sin.dtype
), f'cos and sin must have the same dtype, got {cos.dtype} and {sin.dtype}'
assert (x.dtype == cos.dtype), (
f'Input and cos/sin must have the same dtype, '
f'got {x.dtype} and {cos.dtype}')
cos, sin = cos.contiguous(), sin.contiguous()
if isinstance(seqlen_offsets, torch.Tensor):
assert seqlen_offsets.shape == (batch, )
assert seqlen_offsets.dtype in [torch.int32, torch.int64]
seqlen_offsets = seqlen_offsets.contiguous()
else:
assert seqlen_offsets + seqlen <= seqlen_ro
output = torch.empty_like(x) if not inplace else x
if rotary_dim < headdim and not inplace:
output[..., rotary_dim:].copy_(x[..., rotary_dim:])
BLOCK_K = (32 if rotary_dim <= 32 else
(64 if rotary_dim <= 64 else
(128 if rotary_dim <= 128 else 256)))
def grid(META):
return (triton.cdiv(seqlen, META['BLOCK_M']), batch, nheads)
BLOCK_M = 4 if interleaved else (8 if rotary_dim <= 64 else 4)
# Need this, otherwise Triton tries to launch from cuda:0 and we get
# ValueError: Pointer argument (at 0) cannot be accessed from Triton
# (cpu tensor?)
with torch.cuda.device(x.device.index):
rotary_kernel[grid](
output, # data ptrs
x,
cos,
sin,
cu_seqlens,
seqlen_offsets,
seqlen, # shapes
rotary_dim,
seqlen_ro,
output.stride(0)
if not is_varlen else 0, # batch_strides if not varlen else 0
output.stride(-3), # seqlen_stride or total_seqlen_stride
output.stride(-2), # nheads_stride
output.stride(-1), # headdim_stride
x.stride(0)
if not is_varlen else 0, # batch_strides if not varlen else 0
x.stride(-3), # seqlen stride or total_seqlen_stride
x.stride(-2), # nheads stride
x.stride(-1), # headdim stride
BLOCK_K,
isinstance(seqlen_offsets, torch.Tensor),
is_varlen,
interleaved,
conjugate,
BLOCK_M,
)
return output
class ApplyRotaryEmb(torch.autograd.Function):
@staticmethod
def forward(
ctx,
x,
cos,
sin,
interleaved=False,
inplace=False,
seqlen_offsets: Union[int, torch.Tensor] = 0,
cu_seqlens: Optional[torch.Tensor] = None,
max_seqlen: Optional[int] = None,
):
out = apply_rotary(
x,
cos,
sin,
seqlen_offsets=seqlen_offsets,
cu_seqlens=cu_seqlens,
max_seqlen=max_seqlen,
interleaved=interleaved,
inplace=inplace,
)
if isinstance(seqlen_offsets, int):
ctx.save_for_backward(
cos, sin, cu_seqlens) # Can't save int with save_for_backward
ctx.seqlen_offsets = seqlen_offsets
else:
ctx.save_for_backward(cos, sin, cu_seqlens, seqlen_offsets)
ctx.seqlen_offsets = None
ctx.interleaved = interleaved
ctx.inplace = inplace
ctx.max_seqlen = max_seqlen
return out if not inplace else x
@staticmethod
def backward(ctx, do):
seqlen_offsets = ctx.seqlen_offsets
if seqlen_offsets is None:
cos, sin, cu_seqlens, seqlen_offsets = ctx.saved_tensors
else:
cos, sin, cu_seqlens = ctx.saved_tensors
# TD [2023-09-02]: For some reason Triton (2.0.0.post1) errors with
# "[CUDA]: invalid device context", and cloning makes it work. Idk why.
# Triton 2.1.0 works.
if not ctx.interleaved and not ctx.inplace:
do = do.clone()
dx = apply_rotary(
do,
cos,
sin,
seqlen_offsets=seqlen_offsets,
cu_seqlens=cu_seqlens,
max_seqlen=ctx.max_seqlen,
interleaved=ctx.interleaved,
inplace=ctx.inplace,
conjugate=True,
)
return dx, None, None, None, None, None, None, None
apply_rotary_emb = ApplyRotaryEmb.apply
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/utils.py
================================================
import torch
import torch.nn.functional as F
try:
from flash_attn.bert_padding import index_first_axis, unpad_input
except ImportError:
pass
def _get_unpad_data(attention_mask):
seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
max_seqlen_in_batch = seqlens_in_batch.max().item()
cu_seqlens = F.pad(
torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
return (
indices,
cu_seqlens,
max_seqlen_in_batch,
)
def upad_qkv(query_layer, key_layer, value_layer, attention_mask,
query_length):
indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(
attention_mask)
batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
key_layer = index_first_axis(
key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads,
head_dim), indices_k)
value_layer = index_first_axis(
value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads,
head_dim), indices_k)
if query_length == kv_seq_len:
# Different from the origin version as sequence parallel change
# the number of attention heads.
query_layer = index_first_axis(
query_layer.reshape(batch_size * kv_seq_len, -1, head_dim),
indices_k)
cu_seqlens_q = cu_seqlens_k
max_seqlen_in_batch_q = max_seqlen_in_batch_k
indices_q = indices_k
elif query_length == 1:
max_seqlen_in_batch_q = 1
cu_seqlens_q = torch.arange(
batch_size + 1, dtype=torch.int32, device=query_layer.device
) # There is a memcpy here, that is very bad.
indices_q = cu_seqlens_q[:-1]
query_layer = query_layer.squeeze(1)
else:
# The -q_len: slice assumes left padding.
attention_mask = attention_mask[:, -query_length:]
query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = \
unpad_input(query_layer, attention_mask)
return (
query_layer,
key_layer,
value_layer,
indices_q,
(cu_seqlens_q, cu_seqlens_k),
(max_seqlen_in_batch_q, max_seqlen_in_batch_k),
)
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/dispatch/yi.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Optional, Tuple
import torch
import torch.nn.functional as F
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., :x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2:]
return torch.cat((-x2, x1), dim=-1)
def apply_rotary_pos_emb(q, k, cos, sin, position_ids):
# The first two dimensions of cos and sin are always 1,
# so we can `squeeze` them.
cos = cos.squeeze(1).squeeze(0) # [seq_len, dim]
sin = sin.squeeze(1).squeeze(0) # [seq_len, dim]
cos = cos[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
sin = sin[position_ids].unsqueeze(1) # [bs, 1, seq_len, dim]
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""This is the equivalent of torch.repeat_interleave(x, dim=1,
repeats=n_rep).
The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to
(batch, num_attention_heads, seqlen, head_dim)
"""
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :,
None, :, :].expand(batch,
num_key_value_heads,
n_rep, slen, head_dim)
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen,
head_dim)
def yi_attn_forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value[0].shape[-2]
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states,
cos, sin, position_ids)
if past_key_value is not None:
# reuse k, v, self_attention
key_states = torch.cat([past_key_value[0], key_states], dim=2)
value_states = torch.cat([past_key_value[1], value_states], dim=2)
past_key_value = (key_states, value_states) if use_cache else None
# repeat k/v heads if n_kv_heads < n_heads
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
# use flash attention implemented by pytorch
attn_output = F.scaled_dot_product_attention(
query_states, key_states, value_states, attn_mask=attention_mask)
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
# Due to the implementation of the PyTorch version of flash attention,
# even when the output_attentions flag is set to True, it is not possible
# to return the attn_weights.
return attn_output, None, past_key_value
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/projector/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from transformers import AutoConfig, AutoModel
from .configuration_projector import ProjectorConfig
from .modeling_projector import ProjectorModel
AutoConfig.register('projector', ProjectorConfig)
AutoModel.register(ProjectorConfig, ProjectorModel)
__all__ = ['ProjectorConfig', 'ProjectorModel']
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/projector/configuration_projector.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from transformers import PretrainedConfig
class ProjectorConfig(PretrainedConfig):
model_type = 'projector'
_auto_class = 'AutoConfig'
def __init__(
self,
visual_hidden_size=4096,
llm_hidden_size=4096,
depth=2,
hidden_act='gelu',
bias=True,
**kwargs,
):
self.visual_hidden_size = visual_hidden_size
self.llm_hidden_size = llm_hidden_size
self.depth = depth
self.hidden_act = hidden_act
self.bias = bias
super().__init__(**kwargs)
================================================
FILE: xtuner-eval_niah/xtuner/model/modules/projector/modeling_projector.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
import torch.nn as nn
from transformers import PreTrainedModel
from transformers.activations import ACT2FN
from .configuration_projector import ProjectorConfig
class ProjectorModel(PreTrainedModel):
_auto_class = 'AutoModel'
config_class = ProjectorConfig
base_model_prefix = 'model'
supports_gradient_checkpointing = True
def __init__(self, config: ProjectorConfig) -> None:
super().__init__(config)
self.gradient_checkpointing = False
modules = [
nn.Linear(
config.visual_hidden_size,
config.llm_hidden_size,
bias=config.bias)
]
for _ in range(1, config.depth):
modules.append(ACT2FN[config.hidden_act])
modules.append(
nn.Linear(
config.llm_hidden_size,
config.llm_hidden_size,
bias=config.bias))
self.model = nn.Sequential(*modules)
def enable_input_require_grads(self):
def make_inputs_require_grad(module, input, output):
output.requires_grad_(True)
self.model.register_forward_hook(make_inputs_require_grad)
def _set_gradient_checkpointing(self, module, value=False):
if isinstance(module, ProjectorModel):
module.gradient_checkpointing = value
def forward(self, x):
if self.gradient_checkpointing and self.training:
layer_outputs = torch.utils.checkpoint.checkpoint(self.model, x)
else:
layer_outputs = self.model(x)
return layer_outputs
================================================
FILE: xtuner-eval_niah/xtuner/model/orpo.py
================================================
# ORPO Authors: Jiwoo Hong, Noah Lee, and James Thorne
# Official code: https://github.com/xfactlab/orpo
# Copyright (c) OpenMMLab. All rights reserved.
import torch
import torch.distributed as dist
import torch.nn.functional as F
from mmengine import MessageHub
from torch import nn
from xtuner.parallel.sequence import (gather_forward_split_backward,
get_sequence_parallel_group,
get_sequence_parallel_world_size,
split_for_sequence_parallel)
from .sft import SupervisedFinetune
class ORPO(SupervisedFinetune):
"""ORPO: Monolithic Preference Optimization without Reference Model
https://arxiv.org/abs/2403.07691
Args:
beta (float): Weight of the odds_ratio_loss. Defaults to 0.1.
"""
def __init__(self, *args, beta=0.1, **kwargs):
super().__init__(*args, **kwargs)
self.beta = beta
def _gather_masked_logits(self, logits, labels, mask):
logits = torch.gather(
logits.log_softmax(-1), dim=2,
index=labels.unsqueeze(2)).squeeze(2)
return logits * mask
def get_logps(
self,
all_logits, # bs, seqlen,vocab_size
average_log_prob, # bs, seqlen,vocab_size
labels, # bs, seqlen
):
labels = labels[:, 1:].clone()
all_logits = all_logits[:, :-1, :]
labels[labels == -100] = 0
loss_mask = labels != 0
all_logps = self._gather_masked_logits(all_logits, labels,
loss_mask).sum(-1)
if average_log_prob: # average_log_prob
all_logps = all_logps / loss_mask.sum(-1)
chosen_logps = all_logps[::2]
rejected_logps = all_logps[1::2]
return chosen_logps, rejected_logps
def get_var_len_atten_logps(self, all_logits, average_log_prob, labels,
cu_seqlens, attention_mask):
seqlens = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist()
# unpack sequence
unpacked_logits = torch.split(all_logits, seqlens, dim=1)
unpacked_labels = torch.split(labels, seqlens, dim=1)
if attention_mask is not None:
# It indicate that we pad the original sequence, labels,
# position_ids and cumulative_len for sequence parallel if the
# attention_mask is not None.
# We then need to remove the padded segments.
assert False in attention_mask
unpacked_logits = unpacked_logits[:-1]
unpacked_labels = unpacked_labels[:-1]
assert len(unpacked_logits) % 2 == 0
def compute_logps(_logits, _labels):
_labels = _labels[:, 1:].clone()
_logits = _logits[:, :-1, :]
_labels[_labels == -100] = 0
loss_mask = _labels != 0
logps = self._gather_masked_logits(_logits, _labels, loss_mask)
logps = logps.sum(-1)
if average_log_prob:
logps /= loss_mask.sum(-1)
return logps
chosen_logps, rejected_logps = [], []
for i in range(len(unpacked_logits) // 2):
chosen = unpacked_logits[2 * i]
rejected = unpacked_logits[2 * i + 1]
chosen_label = unpacked_labels[2 * i]
rejected_label = unpacked_labels[2 * i + 1]
chosen_logps.append(compute_logps(chosen, chosen_label))
rejected_logps.append(compute_logps(rejected, rejected_label))
return (torch.stack(chosen_logps), torch.stack(rejected_logps))
def cross_entropy_loss(self, logits, labels):
logits = logits[..., :-1, :].contiguous()
labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = nn.CrossEntropyLoss()
logits = logits.view(-1, logits.shape[-1])
labels = labels.view(-1)
# Enable model parallelism
labels = labels.to(logits.device)
loss = loss_fct(logits, labels)
return loss
def odds_ratio_loss(
self,
chosen_logps: torch.FloatTensor,
rejected_logps: torch.FloatTensor,
):
# modified from https://github.com/huggingface/trl/blob/b031adfdb8708f1f295eab6c3f2cb910e8fe0c23/trl/trainer/orpo_trainer.py#L597 # noqa
# Derived from Eqs. (4) and (7) from https://arxiv.org/abs/2403.07691 by using log identities and exp(log(P(y|x)) = P(y|x) # noqa
log_odds = (chosen_logps - rejected_logps) - (
torch.log1p(-torch.exp(chosen_logps)) -
torch.log1p(-torch.exp(rejected_logps)))
ratio = F.logsigmoid(log_odds)
ratio = ratio[~torch.isnan(ratio)] # select valid loss
losses = self.beta * ratio
chosen_rewards = self.beta * chosen_logps
rejected_rewards = self.beta * rejected_logps
return losses, chosen_rewards, rejected_rewards, torch.mean(
ratio), torch.mean(log_odds)
@staticmethod
def _split_for_sequence_parallel(data):
# attention mask should not be split
ARGS_NEED_TO_SPLIT = ('input_ids', 'position_ids')
sp_group = get_sequence_parallel_group()
for key in ARGS_NEED_TO_SPLIT:
val = data.get(key, None)
if val is not None:
# `dim` is 1 as the shape of tensor is (bs, seq_len, ...)
data[key] = split_for_sequence_parallel(
val, dim=1, sp_group=sp_group)
return data
def compute_loss(self, data, data_samples=None):
labels_ori = data.pop('labels')
if get_sequence_parallel_world_size() > 1:
data = self._split_for_sequence_parallel(data)
all_logits = self.llm(**data).logits
if get_sequence_parallel_world_size() > 1:
all_logits = gather_forward_split_backward(
all_logits,
dim=1,
sp_group=get_sequence_parallel_group(),
grad_scale='up')
if not self.use_varlen_attn:
chosen_nll_loss = self.cross_entropy_loss(all_logits[::2],
labels_ori.clone()[::2])
chosen_logps, rejected_logps = self.get_logps(
all_logits, True, labels_ori)
else:
message_hub = MessageHub.get_instance('varlen_attn_args')
rank = dist.get_rank()
cu_seqlens = message_hub.get_info(f'cumulative_len_rank_{rank}')
seqlens = (cu_seqlens[1:] - cu_seqlens[:-1]).tolist()
attention_mask = data['attention_mask']
if attention_mask is not None:
# It indicate that we pad the original sequence, labels,
# position_ids and cumulative_len for sequence parallel if the
# attention_mask is not None.
# We then need to remove the padded segments.
logits = torch.split(all_logits, seqlens, dim=1)[:-1]
assert len(logits) % 2 == 0
chosen_logits = logits[::2]
labels = torch.split(labels_ori.clone(), seqlens, dim=1)[:-1]
assert len(labels) % 2 == 0
chosen_labels = labels[::2]
else:
chosen_logits = torch.split(all_logits, seqlens, dim=1)[::2]
chosen_labels = torch.split(
labels_ori.clone(), seqlens, dim=1)[::2]
chosen_logits = torch.cat(chosen_logits, dim=1)
chosen_labels = torch.cat(chosen_labels, dim=1)
chosen_nll_loss = self.cross_entropy_loss(chosen_logits,
chosen_labels)
chosen_logps, rejected_logps = self.get_var_len_atten_logps(
all_logits, True, labels_ori, cu_seqlens, attention_mask)
(losses, chosen_rewards, rejected_rewards, log_odds_ratio,
log_odds_chosen) = self.odds_ratio_loss(chosen_logps, rejected_logps)
losses = losses.mean()
# skip nan loss
if torch.isnan(chosen_nll_loss):
chosen_nll_loss = all_logits.mean() * 0
if torch.isnan(losses):
losses = all_logits.mean() * 0
loss = chosen_nll_loss - losses
reward_acc = (chosen_rewards > rejected_rewards).float().mean()
loss_dict = {
'loss': loss,
'chosen_rewards': chosen_rewards.mean(),
'rejected_rewards': rejected_rewards.mean(),
'reward_acc': reward_acc,
'reward_margin': (chosen_rewards - rejected_rewards).mean(),
'log_odds_ratio': log_odds_ratio,
'log_odds_chosen': log_odds_chosen,
'nll_loss': chosen_nll_loss.detach().mean()
}
return loss_dict
================================================
FILE: xtuner-eval_niah/xtuner/model/reward.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import json
import math
import os
import warnings
from collections import OrderedDict
from contextlib import nullcontext
import torch
import torch.distributed as dist
from mmengine import print_log
from mmengine.config import Config, ConfigDict
from mmengine.model import BaseModel
from mmengine.runner import load_checkpoint
from peft import get_peft_model, prepare_model_for_kbit_training
from torch import nn
from transformers import (AutoConfig, AutoModelForSequenceClassification,
PreTrainedModel, PreTrainedTokenizer)
from transformers.dynamic_module_utils import get_class_from_dynamic_module
from transformers.integrations import is_deepspeed_zero3_enabled
from transformers.modeling_utils import no_init_weights
from xtuner.parallel.sequence import (gather_forward_split_backward,
get_sequence_parallel_group,
get_sequence_parallel_world_size,
split_for_sequence_parallel)
from xtuner.registry import BUILDER
from .modules import dispatch_modules
from .modules.dispatch import SUPPORT_FLASH1, SUPPORT_FLASH2
from .utils import (LoadWoInit, find_all_linear_names,
get_peft_model_state_dict, make_inputs_require_grad,
traverse_dict)
def reduce_mean(tensor):
""""Obtain the mean of tensor on different GPUs."""
if not (dist.is_available() and dist.is_initialized()):
return tensor
tensor = tensor.clone()
dist.all_reduce(tensor.div_(dist.get_world_size()), op=dist.ReduceOp.SUM)
return tensor
def smart_tokenizer_and_embedding_resize(
tokenizer: PreTrainedTokenizer,
model: PreTrainedModel,
):
"""Resize embedding."""
if is_deepspeed_zero3_enabled():
import deepspeed
params = [model.get_input_embeddings().weight]
if model.get_output_embeddings(
) is not None and not model.config.tie_word_embeddings:
params.append(model.get_output_embeddings().weight)
context_maybe_zero3 = deepspeed.zero.GatheredParameters(
params, modifier_rank=0)
else:
context_maybe_zero3 = nullcontext()
with context_maybe_zero3:
current_embedding_size = model.get_input_embeddings().weight.size(0)
if len(tokenizer) > current_embedding_size:
assert isinstance(model.get_output_embeddings(), nn.Linear)
model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=64)
with context_maybe_zero3:
num_new_tokens = len(tokenizer) - current_embedding_size
input_embeddings = model.get_input_embeddings().weight.data
output_embeddings = model.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
dim=0, keepdim=True)
output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
output_embeddings[-num_new_tokens:] = output_embeddings_avg
print_log(
f'Resized token embeddings from {current_embedding_size} to '
f'{len(tokenizer)}.', 'current')
class RewardModel(BaseModel):
def __init__(
self,
llm,
lora=None,
peft_model=None,
use_activation_checkpointing=True,
use_varlen_attn=False,
tokenizer=None,
max_position_embeddings=None,
reward_token_id=None,
loss_type='ranking',
penalty_type='log_barrier',
penalty_weight=0.01,
):
super().__init__()
with LoadWoInit():
if isinstance(llm, dict):
llm = self._dispatch_lm_model_cfg(llm, max_position_embeddings)
self.llm = self._build_from_cfg_or_module(llm).model
self.v_head = nn.Linear(self.llm.config.hidden_size, 1, bias=False)
# zero init
self.v_head.weight.data.zero_()
self.reward_token_id = reward_token_id
assert loss_type in ('ranking',
'focal'), f'Unsupported loss type {loss_type}'
self.loss_type = loss_type
assert penalty_type in (
'log_barrier', 'L2',
'none'), f'Unsupported penalty type {penalty_type}'
self.penalty_type = penalty_type
self.penalty_weight = penalty_weight
if tokenizer is not None:
if isinstance(tokenizer, dict):
tokenizer = BUILDER.build(tokenizer)
smart_tokenizer_and_embedding_resize(tokenizer, self.llm)
self.llm.config.use_cache = False
dispatch_modules(self.llm, use_varlen_attn=use_varlen_attn)
if use_activation_checkpointing:
# For backward compatibility
if hasattr(self.llm, 'enable_input_require_grads'):
self.llm.enable_input_require_grads()
else:
self.llm.get_input_embeddings().register_forward_hook(
make_inputs_require_grad)
# enable gradient checkpointing for memory efficiency
self.gradient_checkpointing_enable()
if isinstance(lora, dict) or isinstance(lora, Config) or isinstance(
lora, ConfigDict):
self.lora = BUILDER.build(lora)
else:
self.lora = lora
self.peft_model = peft_model
self.use_lora = lora is not None
if self.use_lora:
self._prepare_for_lora(peft_model, use_activation_checkpointing)
self._is_init = True
# Determines whether to calculate attention based on the
# seq_len dimension (use_varlen_attn = False) or the actual length of
# the sequence.
self.use_varlen_attn = use_varlen_attn
def gradient_checkpointing_enable(self):
self.activation_checkpointing_enable()
def activation_checkpointing_enable(self):
self.llm.gradient_checkpointing_enable()
def gradient_checkpointing_disable(self):
self.activation_checkpointing_disable()
def activation_checkpointing_disable(self):
self.llm.gradient_checkpointing_disable()
def _prepare_for_lora(self,
peft_model=None,
use_activation_checkpointing=True):
self.llm = prepare_model_for_kbit_training(
self.llm, use_activation_checkpointing)
if self.lora.target_modules is None:
modules = find_all_linear_names(self.llm)
self.lora.target_modules = modules
self.llm = get_peft_model(self.llm, self.lora)
if peft_model is not None:
_ = load_checkpoint(self, peft_model)
def init_weights(self):
pass
@staticmethod
def _prepare_for_long_context_training(cfg, llm_cfg,
max_position_embeddings):
if not hasattr(llm_cfg, 'rope_scaling'):
print_log('Current model does not support RoPE scaling.',
'current')
return
current_max_length = getattr(llm_cfg, 'max_position_embeddings', None)
if current_max_length and max_position_embeddings > current_max_length:
print_log(
f'Enlarge max model length from {current_max_length} '
f'to {max_position_embeddings}.', 'current')
scaling_factor = float(
math.ceil(max_position_embeddings / current_max_length))
else:
print_log(
'The input `max_position_embeddings` is smaller than '
'origin max length. Consider increase input length.',
'current')
scaling_factor = 1.0
cfg.rope_scaling = {'type': 'linear', 'factor': scaling_factor}
return cfg
@staticmethod
def _prepare_for_flash_attn(cfg, llm_cfg):
cls_name = type(llm_cfg).__name__
SUPPORT_SDPA_ATTN = ('LlamaConfig', 'GemmaConfig', 'MistralConfig',
'MixtralConfig', 'Qwen2Config', 'Qwen2MoeConfig',
'Starcoder2Config', 'Starcoder2Config',
'Phi3Config')
SUPPORT_FLASH_ATTN2 = ('InternLM2Config', 'LlamaConfig', 'GemmaConfig',
'MistralConfig', 'MixtralConfig', 'Qwen2Config',
'Qwen2MoeConfig', 'Starcoder2Config',
'Starcoder2Config', 'Phi3Config')
torch_dtype = torch.bfloat16 if (
torch.cuda.is_available() and torch.cuda.is_bf16_supported()) \
else torch.float16
if getattr(cfg, 'attn_implementation', None) is not None:
# Flash Attention 2.0 only supports torch.float16 and
# torch.bfloat16 dtypes
if cfg.attn_implementation == 'flash_attention_2':
cfg.torch_dtype = torch_dtype
elif SUPPORT_FLASH2 and cls_name in SUPPORT_FLASH_ATTN2:
cfg.torch_dtype = torch_dtype
cfg.attn_implementation = 'flash_attention_2'
elif SUPPORT_FLASH1 and cls_name in SUPPORT_SDPA_ATTN:
cfg.attn_implementation = 'sdpa'
return cfg
@staticmethod
def _prepare_for_qlora_zero3(cfg):
if (not is_deepspeed_zero3_enabled()) or (not hasattr(
cfg, 'quantization_config')):
return cfg
torch_dtype = torch.bfloat16 if (
torch.cuda.is_available() and torch.cuda.is_bf16_supported()) \
else torch.float16
cfg.torch_dtype = torch_dtype
quantization_config = cfg.quantization_config
quantization_config.bnb_4bit_compute_dtype = torch_dtype
quantization_config.bnb_4bit_quant_storage = torch_dtype
return cfg
def _dispatch_lm_model_cfg(self, cfg, max_position_embeddings=None):
cfg = self._prepare_for_qlora_zero3(cfg)
pretrained_model_name_or_path = cfg.pretrained_model_name_or_path
llm_cfg = AutoConfig.from_pretrained(
pretrained_model_name_or_path, trust_remote_code=True)
cfg = self._prepare_for_flash_attn(cfg, llm_cfg)
if max_position_embeddings is not None:
cfg = self._prepare_for_long_context_training(
cfg, llm_cfg, max_position_embeddings)
return cfg
def _build_from_cfg_or_module(self, cfg_or_mod):
if isinstance(cfg_or_mod, nn.Module):
return cfg_or_mod
elif isinstance(cfg_or_mod, dict):
traverse_dict(cfg_or_mod)
return BUILDER.build(cfg_or_mod)
else:
raise NotImplementedError
def forward(self, data, data_samples=None, mode='loss'):
labels = data.pop('labels', None)
if mode == 'loss':
return self.compute_loss(data, labels)
elif mode == 'predict':
return self.predict(data, data_samples)
elif mode == 'tensor':
return self._forward(data, data_samples)
else:
raise NotImplementedError
def _forward(self, data, data_samples=None):
hidden_states = self.llm(**data)[0]
logits = self.v_head(hidden_states)
return logits
def predict(self, data, data_samples=None):
hidden_states = self.llm(**data)[0]
logits = self.v_head(hidden_states)
logits_dict = [{'logits': log} for log in logits]
return logits_dict
@staticmethod
def _split_for_sequence_parallel(data):
# attention mask should not be split
ARGS_NEED_TO_SPLIT = ('input_ids', 'position_ids')
sp_group = get_sequence_parallel_group()
for key in ARGS_NEED_TO_SPLIT:
val = data.get(key, None)
if val is not None:
# `dim` is 1 as the shape of tensor is (bs, seq_len, ...)
data[key] = split_for_sequence_parallel(
val, dim=1, sp_group=sp_group)
return data
def compute_loss(self, data, labels=None):
if get_sequence_parallel_world_size() > 1:
data = self._split_for_sequence_parallel(data)
hidden_states = self.llm(**data)[0]
logits = self.v_head(hidden_states)
if get_sequence_parallel_world_size() > 1:
logits = gather_forward_split_backward(
logits,
dim=1,
sp_group=get_sequence_parallel_group(),
grad_scale='up')
chosen_idx = torch.where(labels == 0)
rejected_idx = torch.where(labels == 1)
chosen_logits = logits[chosen_idx]
rejected_logits = logits[rejected_idx]
num_samples = torch.tensor(len(chosen_logits)).float().to(
hidden_states.device)
avg_factor = 1.0 / num_samples
avg_factor = reduce_mean(avg_factor).to(hidden_states.device)
chosen_mean = reduce_mean(chosen_logits.mean().detach())
rejected_mean = reduce_mean(rejected_logits.mean().detach())
acc = reduce_mean(
(chosen_logits > rejected_logits).sum() / num_samples).detach()
num_tokens = torch.tensor(labels.shape[1]).float()
# ranking loss
if self.loss_type == 'ranking':
rank_loss = self.ranking_loss(
chosen_logits, rejected_logits, avg_factor=avg_factor)
elif self.loss_type == 'focal':
rank_loss = self.focal_loss(
chosen_logits, rejected_logits, avg_factor=avg_factor)
else:
raise NotImplementedError(
f'Unsupported loss type {self.loss_type}')
# penalty loss
if self.penalty_type == 'log_barrier':
penalty = self.log_barrier_penalty(
torch.cat([chosen_logits, rejected_logits]),
lower_bound=-5,
upper_bound=5,
avg_factor=avg_factor)
elif self.penalty_type == 'L2':
penalty = self.l2_penalty(
torch.cat([chosen_logits, rejected_logits]),
avg_factor=avg_factor)
elif self.penalty_type == 'none':
penalty = 0
else:
raise NotImplementedError(
f'Unsupported penalty type {self.penalty_type}')
loss = rank_loss + self.penalty_weight * penalty
loss_dict = {
'loss': loss,
'acc': acc,
'chosen_score_mean': chosen_mean,
'rejected_score_mean': rejected_mean,
'num_samples': num_samples,
'num_tokens': num_tokens,
}
return loss_dict
def ranking_loss(self, chosen_logits, rejected_logits, avg_factor):
rank_loss = -nn.functional.logsigmoid(chosen_logits - rejected_logits)
return rank_loss.sum() * avg_factor
def focal_loss(self, chosen_logits, rejected_logits, avg_factor):
# focal ranking loss from InternLM2 paper https://arxiv.org/abs/2403.17297 # noqa
rank_loss = -nn.functional.logsigmoid(chosen_logits - rejected_logits)
p_ij = torch.sigmoid(chosen_logits - rejected_logits)
p = 2 * torch.relu(p_ij - 0.5)
gamma = 2
focal_loss = ((1 - p)**gamma) * rank_loss
return focal_loss.sum() * avg_factor
def log_barrier_penalty(self,
logits,
lower_bound,
upper_bound,
epsilon=1e-3,
avg_factor=1):
# log barrier penalty from InternLM2 paper https://arxiv.org/abs/2403.17297 # noqa
logits_fp32 = logits.float()
logits_clamped = torch.clamp(logits_fp32, lower_bound + epsilon,
upper_bound - epsilon)
penalty = -torch.log(upper_bound - logits_clamped) - torch.log(
logits_clamped - lower_bound)
return penalty.sum() * avg_factor
def l2_penalty(self, logits, avg_factor=1):
return (logits**2).sum() * avg_factor
def state_dict(self, *args, **kwargs):
state_dict = super().state_dict(*args, **kwargs)
if not self.use_lora:
return state_dict
to_return = get_peft_model_state_dict(self.llm, state_dict=state_dict)
return OrderedDict(to_return)
def __getattr__(self, name: str):
try:
return super().__getattr__(name)
except AttributeError:
return getattr(self.llm, name)
def to_hf(self,
cfg,
save_dir,
fp32=False,
save_pretrained_kwargs={},
**kwargs):
print(f'Saving LLM tokenizer to {save_dir}')
tokenizer = BUILDER.build(cfg.tokenizer)
tokenizer.save_pretrained(save_dir)
if 'PeftModel' in self.llm.__class__.__name__:
# merge adapter
self.llm = self.llm.merge_and_unload()
if 'InternLM2' in self.llm.__class__.__name__:
from xtuner.tools.model_converters.modeling_internlm2_reward.modeling_internlm2 import \
InternLM2ForRewardModel # noqa
print(f'Saving Reward Model to {save_dir}')
hf_cfg = self.llm.config
hf_cfg.reward_token_id = self.reward_token_id if \
self.reward_token_id is not None else cfg.reward_token_id
if not fp32:
dtype = torch.float16
else:
dtype = torch.float32
with no_init_weights():
reward_model = InternLM2ForRewardModel._from_config(
hf_cfg, torch_dtype=dtype)
reward_model.model.load_state_dict(self.llm.state_dict())
reward_model.v_head.load_state_dict(self.v_head.state_dict())
reward_model.save_pretrained(save_dir, **save_pretrained_kwargs)
# fix auto_map in config
with open(os.path.join(save_dir, 'config.json')) as fp:
config_dict = json.load(fp)
config_dict['auto_map'][
'AutoModel'] = 'modeling_internlm2.InternLM2ForRewardModel'
config_dict['auto_map'].pop('AutoModelForCausalLM', None)
with open(os.path.join(save_dir, 'config.json'), 'w') as fp:
json.dump(config_dict, fp, indent=2)
else:
warnings.warn(
f'The pretrained model type: {self.llm.__class__.__name__} '
'has no reward model class defined. Use '
'the SequenceClassification class instead.'
'You can refer to `xtuner/tools/model_converters/modeling_internlm2_reward` ' # noqa
'to implement the reward model class.')
hf_cfg = self.llm.config
hf_cfg.num_labels = 1 # set the output dim to 1
try:
with no_init_weights():
reward_model = \
AutoModelForSequenceClassification.from_config(hf_cfg)
except Exception as e:
warnings.warn(f'Cannot find SequenceClassification class '
f'from transformers: {e}, \n'
'try to find it in the dynamic module.')
module_file, causal_model_name = hf_cfg.auto_map[
'AutoModelForCausalLM'].split('.')
seqcls_model_name = causal_model_name.split(
'For')[0] + 'ForSequenceClassification'
seqcls_class = get_class_from_dynamic_module(
f'{module_file}.{seqcls_model_name}', hf_cfg._name_or_path)
with no_init_weights():
reward_model = seqcls_class(hf_cfg)
reward_model.model.load_state_dict(self.llm.state_dict())
reward_model.score.load_state_dict(self.v_head.state_dict())
reward_model.save_pretrained(save_dir, **save_pretrained_kwargs)
================================================
FILE: xtuner-eval_niah/xtuner/model/sft.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import math
from collections import OrderedDict
from contextlib import nullcontext
import torch
from mmengine import print_log
from mmengine.config import Config, ConfigDict
from mmengine.model import BaseModel
from mmengine.runner import load_checkpoint
from peft import get_peft_model, prepare_model_for_kbit_training
from torch import nn
from transformers import AutoConfig, PreTrainedModel, PreTrainedTokenizer
from transformers.integrations import is_deepspeed_zero3_enabled
from xtuner.parallel.sequence import (get_sequence_parallel_group,
get_sequence_parallel_world_size,
reduce_sequence_parallel_loss,
split_for_sequence_parallel)
from xtuner.registry import BUILDER
from .modules import dispatch_modules
from .modules.dispatch import SUPPORT_FLASH1, SUPPORT_FLASH2
from .utils import (LoadWoInit, find_all_linear_names,
get_peft_model_state_dict, make_inputs_require_grad,
traverse_dict)
def smart_tokenizer_and_embedding_resize(
tokenizer: PreTrainedTokenizer,
model: PreTrainedModel,
):
"""Resize embedding."""
if is_deepspeed_zero3_enabled():
import deepspeed
params = [model.get_input_embeddings().weight]
if model.get_output_embeddings(
) is not None and not model.config.tie_word_embeddings:
params.append(model.get_output_embeddings().weight)
context_maybe_zero3 = deepspeed.zero.GatheredParameters(
params, modifier_rank=0)
else:
context_maybe_zero3 = nullcontext()
with context_maybe_zero3:
current_embedding_size = model.get_input_embeddings().weight.size(0)
if len(tokenizer) > current_embedding_size:
assert isinstance(model.get_output_embeddings(), nn.Linear)
model.resize_token_embeddings(len(tokenizer), pad_to_multiple_of=64)
with context_maybe_zero3:
num_new_tokens = len(tokenizer) - current_embedding_size
input_embeddings = model.get_input_embeddings().weight.data
output_embeddings = model.get_output_embeddings().weight.data
input_embeddings_avg = input_embeddings[:-num_new_tokens].mean(
dim=0, keepdim=True)
output_embeddings_avg = output_embeddings[:-num_new_tokens].mean(
dim=0, keepdim=True)
input_embeddings[-num_new_tokens:] = input_embeddings_avg
output_embeddings[-num_new_tokens:] = output_embeddings_avg
print_log(
f'Resized token embeddings from {current_embedding_size} to '
f'{len(tokenizer)}.', 'current')
class SupervisedFinetune(BaseModel):
def __init__(self,
llm,
lora=None,
peft_model=None,
use_activation_checkpointing=True,
use_varlen_attn=False,
tokenizer=None,
max_position_embeddings=None):
super().__init__()
with LoadWoInit():
if isinstance(llm, dict):
llm = self._dispatch_lm_model_cfg(llm, max_position_embeddings)
self.llm = self._build_from_cfg_or_module(llm)
if tokenizer is not None:
if isinstance(tokenizer, dict):
tokenizer = BUILDER.build(tokenizer)
smart_tokenizer_and_embedding_resize(tokenizer, self.llm)
self.llm.config.use_cache = False
dispatch_modules(self.llm, use_varlen_attn=use_varlen_attn)
if use_activation_checkpointing:
# For backward compatibility
if hasattr(self.llm, 'enable_input_require_grads'):
self.llm.enable_input_require_grads()
else:
self.llm.get_input_embeddings().register_forward_hook(
make_inputs_require_grad)
# enable gradient checkpointing for memory efficiency
self.gradient_checkpointing_enable()
if isinstance(lora, dict) or isinstance(lora, Config) or isinstance(
lora, ConfigDict):
self.lora = BUILDER.build(lora)
else:
self.lora = lora
self.peft_model = peft_model
self.use_lora = lora is not None
if self.use_lora:
self._prepare_for_lora(peft_model, use_activation_checkpointing)
self._is_init = True
# Determines whether to calculate attention based on the
# seq_len dimension (use_varlen_attn = False) or the actual length of
# the sequence.
self.use_varlen_attn = use_varlen_attn
def gradient_checkpointing_enable(self):
self.activation_checkpointing_enable()
def activation_checkpointing_enable(self):
self.llm.gradient_checkpointing_enable()
def gradient_checkpointing_disable(self):
self.activation_checkpointing_disable()
def activation_checkpointing_disable(self):
self.llm.gradient_checkpointing_disable()
def _prepare_for_lora(self,
peft_model=None,
use_activation_checkpointing=True):
self.llm = prepare_model_for_kbit_training(
self.llm, use_activation_checkpointing)
if self.lora.target_modules is None:
modules = find_all_linear_names(self.llm)
self.lora.target_modules = modules
self.llm = get_peft_model(self.llm, self.lora)
if peft_model is not None:
_ = load_checkpoint(self, peft_model)
def init_weights(self):
pass
@staticmethod
def _prepare_for_long_context_training(cfg, llm_cfg,
max_position_embeddings):
if not hasattr(llm_cfg, 'rope_scaling'):
print_log('Current model does not support RoPE scaling.',
'current')
return
current_max_length = getattr(llm_cfg, 'max_position_embeddings', None)
if current_max_length and max_position_embeddings > current_max_length:
print_log(
f'Enlarge max model length from {current_max_length} '
f'to {max_position_embeddings}.', 'current')
scaling_factor = float(
math.ceil(max_position_embeddings / current_max_length))
else:
print_log(
'The input `max_position_embeddings` is smaller than '
'origin max length. Consider increase input length.',
'current')
scaling_factor = 1.0
cfg.rope_scaling = {'type': 'linear', 'factor': scaling_factor}
return cfg
@staticmethod
def _prepare_for_flash_attn(cfg, llm_cfg):
cls_name = type(llm_cfg).__name__
SUPPORT_SDPA_ATTN = ('LlamaConfig', 'GemmaConfig', 'MistralConfig',
'MixtralConfig', 'Qwen2Config', 'Qwen2MoeConfig',
'Starcoder2Config', 'Starcoder2Config',
'Phi3Config')
SUPPORT_FLASH_ATTN2 = ('InternLM2Config', 'LlamaConfig', 'GemmaConfig',
'MistralConfig', 'MixtralConfig', 'Qwen2Config',
'Qwen2MoeConfig', 'Starcoder2Config',
'Starcoder2Config', 'Phi3Config',
'DeepseekV2Config')
torch_dtype = torch.bfloat16 if (
torch.cuda.is_available() and torch.cuda.is_bf16_supported()) \
else torch.float16
if getattr(cfg, 'attn_implementation', None) is not None:
# Flash Attention 2.0 only supports torch.float16 and
# torch.bfloat16 dtypes
if cfg.attn_implementation == 'flash_attention_2':
cfg.torch_dtype = torch_dtype
elif SUPPORT_FLASH2 and cls_name in SUPPORT_FLASH_ATTN2:
cfg.torch_dtype = torch_dtype
cfg.attn_implementation = 'flash_attention_2'
elif SUPPORT_FLASH1 and cls_name in SUPPORT_SDPA_ATTN:
cfg.attn_implementation = 'sdpa'
return cfg
@staticmethod
def _prepare_for_qlora_zero3(cfg):
if (not is_deepspeed_zero3_enabled()) or (not hasattr(
cfg, 'quantization_config')):
return cfg
torch_dtype = torch.bfloat16 if (
torch.cuda.is_available() and torch.cuda.is_bf16_supported()) \
else torch.float16
cfg.torch_dtype = torch_dtype
quantization_config = cfg.quantization_config
quantization_config.bnb_4bit_compute_dtype = torch_dtype
quantization_config.bnb_4bit_quant_storage = torch_dtype
return cfg
def _dispatch_lm_model_cfg(self, cfg, max_position_embeddings=None):
cfg = self._prepare_for_qlora_zero3(cfg)
pretrained_model_name_or_path = cfg.pretrained_model_name_or_path
llm_cfg = AutoConfig.from_pretrained(
pretrained_model_name_or_path, trust_remote_code=True)
cfg = self._prepare_for_flash_attn(cfg, llm_cfg)
if max_position_embeddings is not None:
cfg = self._prepare_for_long_context_training(
cfg, llm_cfg, max_position_embeddings)
return cfg
def _build_from_cfg_or_module(self, cfg_or_mod):
if isinstance(cfg_or_mod, nn.Module):
return cfg_or_mod
elif isinstance(cfg_or_mod, dict):
traverse_dict(cfg_or_mod)
return BUILDER.build(cfg_or_mod)
else:
raise NotImplementedError
def forward(self, data, data_samples=None, mode='loss'):
if mode == 'loss':
return self.compute_loss(data, data_samples)
elif mode == 'predict':
return self.predict(data, data_samples)
elif mode == 'tensor':
return self._forward(data, data_samples)
else:
raise NotImplementedError
def _forward(self, data, data_samples=None):
outputs = self.llm(**data)
return outputs
def predict(self, data, data_samples=None):
outputs = self.llm(**data)
logits_dict = [{'logits': logits} for logits in outputs.logits]
return logits_dict
@staticmethod
def _split_for_sequence_parallel(data):
# attention mask should not be split
ARGS_NEED_TO_SPLIT = ('input_ids', 'labels', 'position_ids')
sp_group = get_sequence_parallel_group()
for key in ARGS_NEED_TO_SPLIT:
val = data.get(key, None)
if val is not None:
# `dim` is 1 as the shape of tensor is (bs, seq_len, ...)
data[key] = split_for_sequence_parallel(
val, dim=1, sp_group=sp_group)
return data
def _compute_sequence_parallel_loss(self, data):
data = self._split_for_sequence_parallel(data)
outputs = self.llm(**data)
labels = data['labels']
num_tokens = (labels != -100).sum()
sp_group = get_sequence_parallel_group()
loss = reduce_sequence_parallel_loss(outputs.loss, num_tokens,
sp_group)
return {'loss': loss}
def compute_loss(self, data, data_samples=None):
if get_sequence_parallel_world_size() > 1:
return self._compute_sequence_parallel_loss(data)
else:
outputs = self.llm(**data)
loss_dict = {'loss': outputs.loss}
return loss_dict
def state_dict(self, *args, **kwargs):
state_dict = super().state_dict(*args, **kwargs)
if not self.use_lora:
return state_dict
to_return = get_peft_model_state_dict(self.llm, state_dict=state_dict)
return OrderedDict(to_return)
def __getattr__(self, name: str):
try:
return super().__getattr__(name)
except AttributeError:
return getattr(self.llm, name)
def to_hf(self,
cfg,
save_dir,
fp32=False,
save_pretrained_kwargs={},
**kwargs):
self.llm.config.use_cache = True
if not fp32:
print_log('Convert LLM to float16', 'current')
self.llm.half()
if self.use_lora:
print_log(f'Saving adapter to {save_dir}', 'current')
else:
print_log(f'Saving LLM tokenizer to {save_dir}', 'current')
tokenizer = BUILDER.build(cfg.tokenizer)
tokenizer.save_pretrained(save_dir)
print_log(f'Saving LLM to {save_dir}', 'current')
self.llm.save_pretrained(save_dir, **save_pretrained_kwargs)
self.llm.config.use_cache = False
================================================
FILE: xtuner-eval_niah/xtuner/model/transformers_models/__init__.py
================================================
from .deepseek_v2 import (DeepseekTokenizerFast, DeepseekV2Config,
DeepseekV2ForCausalLM, DeepseekV2Model)
from .mixtral import MixtralConfig, MixtralForCausalLM, MixtralModel
__all__ = [
'DeepseekTokenizerFast', 'DeepseekV2Config', 'DeepseekV2ForCausalLM',
'DeepseekV2Model', 'MixtralConfig', 'MixtralForCausalLM', 'MixtralModel'
]
================================================
FILE: xtuner-eval_niah/xtuner/model/transformers_models/deepseek_v2/__init__.py
================================================
from .configuration_deepseek import DeepseekV2Config
from .modeling_deepseek import DeepseekV2ForCausalLM, DeepseekV2Model
from .tokenization_deepseek_fast import DeepseekTokenizerFast
__all__ = [
'DeepseekV2ForCausalLM', 'DeepseekV2Model', 'DeepseekV2Config',
'DeepseekTokenizerFast'
]
================================================
FILE: xtuner-eval_niah/xtuner/model/transformers_models/deepseek_v2/configuration_deepseek.py
================================================
from transformers.configuration_utils import PretrainedConfig
from transformers.utils import logging
logger = logging.get_logger(__name__)
DEEPSEEK_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
# Compared to the original version, two parameters, `moe_implementation` and
# `expert_in_one_shard`, have been added.
class DeepseekV2Config(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`DeepseekV2Model`]. It is used to instantiate an DeepSeek
model according to the specified arguments, defining the model architecture. Instantiating a configuration with the
defaults will yield a similar configuration to that of the DeepSeek-V2.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 102400):
Vocabulary size of the Deep model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`DeepseekV2Model`]
hidden_size (`int`, *optional*, defaults to 4096):
Dimension of the hidden representations.
intermediate_size (`int`, *optional*, defaults to 11008):
Dimension of the MLP representations.
moe_intermediate_size (`int`, *optional*, defaults to 1407):
Dimension of the MoE representations.
num_hidden_layers (`int`, *optional*, defaults to 32):
Number of hidden layers in the Transformer decoder.
num_attention_heads (`int`, *optional*, defaults to 32):
Number of attention heads for each attention layer in the Transformer decoder.
n_shared_experts (`int`, *optional*, defaults to None):
Number of shared experts, None means dense model.
n_routed_experts (`int`, *optional*, defaults to None):
Number of routed experts, None means dense model.
routed_scaling_factor (`float`, *optional*, defaults to 1.0):
Scaling factor or routed experts.
topk_method (`str`, *optional*, defaults to `gready`):
Topk method used in routed gate.
n_group (`int`, *optional*, defaults to None):
Number of groups for routed experts.
topk_group (`int`, *optional*, defaults to None):
Number of selected groups for each token(for each token, ensuring the selected experts is only within `topk_group` groups).
num_experts_per_tok (`int`, *optional*, defaults to None):
Number of selected experts, None means dense model.
moe_layer_freq (`int`, *optional*, defaults to 1):
The frequency of the MoE layer: one expert layer for every `moe_layer_freq - 1` dense layers.
first_k_dense_replace (`int`, *optional*, defaults to 0):
Number of dense layers in shallow layers(embed->dense->dense->...->dense->moe->moe...->lm_head).
\--k dense layers--/
norm_topk_prob (`bool`, *optional*, defaults to False):
Whether to normalize the weights of the routed experts.
scoring_func (`str`, *optional*, defaults to 'softmax'):
Method of computing expert weights.
aux_loss_alpha (`float`, *optional*, defaults to 0.001):
Auxiliary loss weight coefficient.
seq_aux = (`bool`, *optional*, defaults to True):
Whether to compute the auxiliary loss for each individual sample.
num_key_value_heads (`int`, *optional*):
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
`num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
by meanpooling all the original heads within that group. For more details checkout [this
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
`num_attention_heads`.
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
The non-linear activation function (function or string) in the decoder.
max_position_embeddings (`int`, *optional*, defaults to 2048):
The maximum sequence length that this model might ever be used with.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (`float`, *optional*, defaults to 1e-06):
The epsilon used by the rms normalization layers.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`.
pad_token_id (`int`, *optional*):
Padding token id.
bos_token_id (`int`, *optional*, defaults to 1):
Beginning of stream token id.
eos_token_id (`int`, *optional*, defaults to 2):
End of stream token id.
pretraining_tp (`int`, *optional*, defaults to 1):
Experimental feature. Tensor parallelism rank used during pretraining. Please refer to [this
document](https://huggingface.co/docs/transformers/parallelism) to understand more about it. This value is
necessary to ensure exact reproducibility of the pretraining results. Please refer to [this
issue](https://github.com/pytorch/pytorch/issues/76232).
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
Whether to tie weight embeddings
rope_theta (`float`, *optional*, defaults to 10000.0):
The base period of the RoPE embeddings.
rope_scaling (`Dict`, *optional*):
Dictionary containing the scaling configuration for the RoPE embeddings. Currently supports two scaling
strategies: linear and dynamic. Their scaling factor must be a float greater than 1. The expected format is
`{"type": strategy name, "factor": scaling factor}`. When using this flag, don't update
`max_position_embeddings` to the expected new maximum.
attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
Whether to use a bias in the query, key, value and output projection layers during self-attention.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
moe_implementation (`str`, *optional*, defaults to 'origin'):
The implementation of the moe blocks. 'origin' or 'shard'.
expert_in_one_shard (`int`, *optional*, defaults to None):
How many expert models are integrated into a shard. It is used only
when `moe_implementation` == 'shard'
```python
>>> from transformers import DeepseekV2Model, DeepseekV2Config
>>> # Initializing a Deepseek-V2 style configuration
>>> configuration = DeepseekV2Config()
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = 'deepseek_v2'
keys_to_ignore_at_inference = ['past_key_values']
def __init__(
self,
vocab_size=102400,
hidden_size=4096,
intermediate_size=11008,
moe_intermediate_size=1407,
num_hidden_layers=30,
num_attention_heads=32,
num_key_value_heads=32,
n_shared_experts=None,
n_routed_experts=None,
ep_size=1,
routed_scaling_factor=1.0,
kv_lora_rank=512,
q_lora_rank=1536,
qk_rope_head_dim=64,
v_head_dim=128,
qk_nope_head_dim=128,
topk_method='gready',
n_group=None,
topk_group=None,
num_experts_per_tok=None,
moe_layer_freq=1,
first_k_dense_replace=0,
norm_topk_prob=False,
scoring_func='softmax',
aux_loss_alpha=0.001,
seq_aux=True,
hidden_act='silu',
max_position_embeddings=2048,
initializer_range=0.02,
rms_norm_eps=1e-6,
use_cache=True,
pad_token_id=None,
bos_token_id=100000,
eos_token_id=100001,
pretraining_tp=1,
tie_word_embeddings=False,
rope_theta=10000.0,
rope_scaling=None,
attention_bias=False,
attention_dropout=0.0,
moe_implementation='origin',
expert_in_one_shard=None,
**kwargs,
):
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.moe_intermediate_size = moe_intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.n_shared_experts = n_shared_experts
self.n_routed_experts = n_routed_experts
self.ep_size = ep_size
self.routed_scaling_factor = routed_scaling_factor
self.kv_lora_rank = kv_lora_rank
self.q_lora_rank = q_lora_rank
self.qk_rope_head_dim = qk_rope_head_dim
self.v_head_dim = v_head_dim
self.qk_nope_head_dim = qk_nope_head_dim
self.topk_method = topk_method
self.n_group = n_group
self.topk_group = topk_group
self.num_experts_per_tok = num_experts_per_tok
self.moe_layer_freq = moe_layer_freq
self.first_k_dense_replace = first_k_dense_replace
self.norm_topk_prob = norm_topk_prob
self.scoring_func = scoring_func
self.aux_loss_alpha = aux_loss_alpha
self.seq_aux = seq_aux
# for backward compatibility
if num_key_value_heads is None:
num_key_value_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.hidden_act = hidden_act
self.initializer_range = initializer_range
self.rms_norm_eps = rms_norm_eps
self.pretraining_tp = pretraining_tp
self.use_cache = use_cache
self.rope_theta = rope_theta
self.rope_scaling = rope_scaling
self.attention_bias = attention_bias
self.attention_dropout = attention_dropout
self.moe_implementation = moe_implementation
self.expert_in_one_shard = expert_in_one_shard
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
tie_word_embeddings=tie_word_embeddings,
**kwargs,
)
================================================
FILE: xtuner-eval_niah/xtuner/model/transformers_models/deepseek_v2/modeling_deepseek.py
================================================
# Copyright 2023 DeepSeek-AI and The HuggingFace Inc. team. All rights reserved.
#
# This code is based on EleutherAI's GPT-NeoX library and the GPT-NeoX
# and OPT implementations in this library. It has been modified from its
# original forms to accommodate minor architectural differences compared
# to GPT-NeoX and OPT used by the Meta AI team that trained the model.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""PyTorch DeepSeek model."""
import copy
import math
import os
import types
import warnings
from typing import List, Optional, Tuple, Union
import numpy as np
import torch
import torch.distributed as dist
import torch.nn.functional as F
import torch.utils.checkpoint
from torch import nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from transformers.activations import ACT2FN
from transformers.cache_utils import Cache, DynamicCache
from transformers.configuration_utils import PretrainedConfig
from transformers.modeling_attn_mask_utils import (
AttentionMaskConverter, _prepare_4d_attention_mask,
_prepare_4d_causal_attention_mask,
_prepare_4d_causal_attention_mask_for_sdpa)
from transformers.modeling_outputs import (BaseModelOutputWithPast,
CausalLMOutputWithPast,
SequenceClassifierOutputWithPast)
from transformers.modeling_utils import PreTrainedModel
from transformers.pytorch_utils import (ALL_LAYERNORM_LAYERS,
is_torch_greater_or_equal_than_1_13)
from transformers.utils import (add_start_docstrings,
add_start_docstrings_to_model_forward,
is_flash_attn_2_available,
is_flash_attn_greater_or_equal_2_10, logging,
replace_return_docstrings)
from transformers.utils.import_utils import is_torch_fx_available
from xtuner.utils import load_state_dict_into_model
from .configuration_deepseek import DeepseekV2Config
if is_flash_attn_2_available():
from flash_attn import flash_attn_func, flash_attn_varlen_func
from flash_attn.bert_padding import pad_input # noqa
from flash_attn.bert_padding import index_first_axis, unpad_input
# This makes `_prepare_4d_causal_attention_mask` a leaf function in the FX graph.
# It means that the function will not be traced through and simply appear as a node in the graph.
if is_torch_fx_available():
if not is_torch_greater_or_equal_than_1_13:
import torch.fx
_prepare_4d_causal_attention_mask = torch.fx.wrap(
_prepare_4d_causal_attention_mask)
logger = logging.get_logger(__name__)
_CONFIG_FOR_DOC = 'DeepseekV2Config'
def _get_unpad_data(attention_mask):
seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
max_seqlen_in_batch = seqlens_in_batch.max().item()
cu_seqlens = F.pad(
torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
return (
indices,
cu_seqlens,
max_seqlen_in_batch,
)
class DeepseekV2RMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
"""DeepseekV2RMSNorm is equivalent to T5LayerNorm."""
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance +
self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
ALL_LAYERNORM_LAYERS.append(DeepseekV2RMSNorm)
class DeepseekV2RotaryEmbedding(nn.Module):
def __init__(self,
dim,
max_position_embeddings=2048,
base=10000,
device=None):
super().__init__()
self.dim = dim
self.max_position_embeddings = max_position_embeddings
self.base = base
inv_freq = 1.0 / (
self.base
**(torch.arange(0, self.dim, 2).float().to(device) / self.dim))
self.register_buffer('inv_freq', inv_freq, persistent=False)
# Build here to make `torch.jit.trace` work.
self._set_cos_sin_cache(
seq_len=max_position_embeddings,
device=self.inv_freq.device,
dtype=torch.get_default_dtype(),
)
self.max_seq_len_cached = None
def _set_cos_sin_cache(self, seq_len, device, dtype):
self.max_seq_len_cached = seq_len
t = torch.arange(
self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
freqs = torch.outer(t, self.inv_freq.to(t.device))
# Different from paper, but it uses a different permutation in order to obtain the same calculation
emb = torch.cat((freqs, freqs), dim=-1)
self.register_buffer(
'cos_cached', emb.cos().to(dtype), persistent=False)
self.register_buffer(
'sin_cached', emb.sin().to(dtype), persistent=False)
def forward(self, x, seq_len=None):
# x: [bs, num_attention_heads, seq_len, head_size]
if self.max_seq_len_cached is None or seq_len > self.max_seq_len_cached:
self._set_cos_sin_cache(
seq_len=seq_len, device=x.device, dtype=x.dtype)
return (
self.cos_cached[:seq_len].to(dtype=x.dtype),
self.sin_cached[:seq_len].to(dtype=x.dtype),
)
# Copied from transformers.models.llama.modeling_llama.LlamaLinearScalingRotaryEmbedding with Llama->DeepseekV2
class DeepseekV2LinearScalingRotaryEmbedding(DeepseekV2RotaryEmbedding):
"""DeepseekV2RotaryEmbedding extended with linear scaling.
Credits to the Reddit user /u/kaiokendev
"""
def __init__(
self,
dim,
max_position_embeddings=2048,
base=10000,
device=None,
scaling_factor=1.0,
):
self.scaling_factor = scaling_factor
super().__init__(dim, max_position_embeddings, base, device)
def _set_cos_sin_cache(self, seq_len, device, dtype):
self.max_seq_len_cached = seq_len
t = torch.arange(
self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
t = t / self.scaling_factor
freqs = torch.outer(t, self.inv_freq)
# Different from paper, but it uses a different permutation in order to obtain the same calculation
emb = torch.cat((freqs, freqs), dim=-1)
self.register_buffer(
'cos_cached', emb.cos().to(dtype), persistent=False)
self.register_buffer(
'sin_cached', emb.sin().to(dtype), persistent=False)
# Copied from transformers.models.llama.modeling_llama.LlamaDynamicNTKScalingRotaryEmbedding with Llama->DeepseekV2
class DeepseekV2DynamicNTKScalingRotaryEmbedding(DeepseekV2RotaryEmbedding):
"""DeepseekV2RotaryEmbedding extended with Dynamic NTK scaling.
Credits to the Reddit users /u/bloc97 and /u/emozilla
"""
def __init__(
self,
dim,
max_position_embeddings=2048,
base=10000,
device=None,
scaling_factor=1.0,
):
self.scaling_factor = scaling_factor
super().__init__(dim, max_position_embeddings, base, device)
def _set_cos_sin_cache(self, seq_len, device, dtype):
self.max_seq_len_cached = seq_len
if seq_len > self.max_position_embeddings:
base = self.base * ((self.scaling_factor * seq_len /
self.max_position_embeddings) -
(self.scaling_factor - 1))**(
self.dim / (self.dim - 2))
inv_freq = 1.0 / (
base
**(torch.arange(0, self.dim, 2).float().to(device) / self.dim))
self.register_buffer('inv_freq', inv_freq, persistent=False)
t = torch.arange(
self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
freqs = torch.outer(t, self.inv_freq)
# Different from paper, but it uses a different permutation in order to obtain the same calculation
emb = torch.cat((freqs, freqs), dim=-1)
self.register_buffer(
'cos_cached', emb.cos().to(dtype), persistent=False)
self.register_buffer(
'sin_cached', emb.sin().to(dtype), persistent=False)
# Inverse dim formula to find dim based on number of rotations
def yarn_find_correction_dim(num_rotations,
dim,
base=10000,
max_position_embeddings=2048):
return (dim * math.log(max_position_embeddings /
(num_rotations * 2 * math.pi))) / (2 *
math.log(base))
# Find dim range bounds based on rotations
def yarn_find_correction_range(low_rot,
high_rot,
dim,
base=10000,
max_position_embeddings=2048):
low = math.floor(
yarn_find_correction_dim(low_rot, dim, base, max_position_embeddings))
high = math.ceil(
yarn_find_correction_dim(high_rot, dim, base, max_position_embeddings))
return max(low, 0), min(high, dim - 1) # Clamp values just in case
def yarn_get_mscale(scale=1, mscale=1):
if scale <= 1:
return 1.0
return 0.1 * mscale * math.log(scale) + 1.0
def yarn_linear_ramp_mask(min, max, dim):
if min == max:
max += 0.001 # Prevent singularity
linear_func = (torch.arange(dim, dtype=torch.float32) - min) / (max - min)
ramp_func = torch.clamp(linear_func, 0, 1)
return ramp_func
class DeepseekV2YarnRotaryEmbedding(DeepseekV2RotaryEmbedding):
def __init__(
self,
dim,
max_position_embeddings=2048,
base=10000,
device=None,
scaling_factor=1.0,
original_max_position_embeddings=4096,
beta_fast=32,
beta_slow=1,
mscale=1,
mscale_all_dim=0,
):
self.scaling_factor = scaling_factor
self.original_max_position_embeddings = original_max_position_embeddings
self.beta_fast = beta_fast
self.beta_slow = beta_slow
self.mscale = mscale
self.mscale_all_dim = mscale_all_dim
super().__init__(dim, max_position_embeddings, base, device)
def _set_cos_sin_cache(self, seq_len, device, dtype):
self.max_seq_len_cached = seq_len
dim = self.dim
freq_extra = 1.0 / (
self.base**(torch.arange(
0, dim, 2, dtype=torch.float32, device=device) / dim))
freq_inter = 1.0 / (
self.scaling_factor * self.base**(torch.arange(
0, dim, 2, dtype=torch.float32, device=device) / dim))
low, high = yarn_find_correction_range(
self.beta_fast,
self.beta_slow,
dim,
self.base,
self.original_max_position_embeddings,
)
inv_freq_mask = 1.0 - yarn_linear_ramp_mask(low, high, dim // 2).to(
device=device, dtype=torch.float32)
inv_freq = freq_inter * (1 -
inv_freq_mask) + freq_extra * inv_freq_mask
self.register_buffer('inv_freq', inv_freq, persistent=False)
t = torch.arange(seq_len, device=device, dtype=torch.float32)
freqs = torch.outer(t, inv_freq)
_mscale = float(
yarn_get_mscale(self.scaling_factor, self.mscale) /
yarn_get_mscale(self.scaling_factor, self.mscale_all_dim))
emb = torch.cat((freqs, freqs), dim=-1)
self.register_buffer(
'cos_cached', (emb.cos() * _mscale).to(dtype), persistent=False)
self.register_buffer(
'sin_cached', (emb.sin() * _mscale).to(dtype), persistent=False)
# Copied from transformers.models.llama.modeling_llama.rotate_half
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., :x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2:]
return torch.cat((-x2, x1), dim=-1)
# Copied from transformers.models.llama.modeling_llama.apply_rotary_pos_emb
def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
"""Applies Rotary Position Embedding to the query and key tensors.
Args:
q (`torch.Tensor`): The query tensor.
k (`torch.Tensor`): The key tensor.
cos (`torch.Tensor`): The cosine part of the rotary embedding.
sin (`torch.Tensor`): The sine part of the rotary embedding.
position_ids (`torch.Tensor`):
The position indices of the tokens corresponding to the query and key tensors. For example, this can be
used to pass offsetted position ids when working with a KV-cache.
unsqueeze_dim (`int`, *optional*, defaults to 1):
The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
Returns:
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
"""
cos = cos[position_ids].unsqueeze(unsqueeze_dim)
sin = sin[position_ids].unsqueeze(unsqueeze_dim)
b, h, s, d = q.shape
q = q.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)
b, h, s, d = k.shape
k = k.view(b, h, s, d // 2, 2).transpose(4, 3).reshape(b, h, s, d)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
class DeepseekV2MLP(nn.Module):
def __init__(self, config, hidden_size=None, intermediate_size=None):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size if hidden_size is None else hidden_size
self.intermediate_size = (
config.intermediate_size
if intermediate_size is None else intermediate_size)
self.gate_proj = nn.Linear(
self.hidden_size, self.intermediate_size, bias=False)
self.up_proj = nn.Linear(
self.hidden_size, self.intermediate_size, bias=False)
self.down_proj = nn.Linear(
self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, x):
down_proj = self.down_proj(
self.act_fn(self.gate_proj(x)) * self.up_proj(x))
return down_proj
class MoEGate(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.top_k = config.num_experts_per_tok
self.n_routed_experts = config.n_routed_experts
self.routed_scaling_factor = config.routed_scaling_factor
self.scoring_func = config.scoring_func
self.alpha = config.aux_loss_alpha
self.seq_aux = config.seq_aux
self.topk_method = config.topk_method
self.n_group = config.n_group
self.topk_group = config.topk_group
# topk selection algorithm
self.norm_topk_prob = config.norm_topk_prob
self.gating_dim = config.hidden_size
self.weight = nn.Parameter(
torch.empty((self.n_routed_experts, self.gating_dim)))
self.reset_parameters()
def reset_parameters(self) -> None:
import torch.nn.init as init
init.kaiming_uniform_(self.weight, a=math.sqrt(5))
def forward(self, hidden_states):
bsz, seq_len, h = hidden_states.shape
### compute gating score
hidden_states = hidden_states.view(-1, h)
logits = F.linear(
hidden_states.type(torch.float32), self.weight.type(torch.float32),
None)
if self.scoring_func == 'softmax':
scores = logits.softmax(dim=-1, dtype=torch.float32)
else:
raise NotImplementedError(
f'insupportable scoring function for MoE gating: {self.scoring_func}'
)
### select top-k experts
# fix official typos
if self.topk_method in ('gready', 'greedy'):
topk_weight, topk_idx = torch.topk(
scores, k=self.top_k, dim=-1, sorted=False)
elif self.topk_method == 'group_limited_greedy':
group_scores = (scores.view(bsz * seq_len, self.n_group,
-1).max(dim=-1).values) # [n, n_group]
group_idx = torch.topk(
group_scores, k=self.topk_group, dim=-1,
sorted=False)[1] # [n, top_k_group]
group_mask = torch.zeros_like(group_scores) # [n, n_group]
group_mask.scatter_(1, group_idx, 1) # [n, n_group]
score_mask = (group_mask.unsqueeze(-1).expand(
bsz * seq_len, self.n_group,
self.n_routed_experts // self.n_group).reshape(
bsz * seq_len, -1)) # [n, e]
tmp_scores = scores.masked_fill(~score_mask.bool(), 0.0) # [n, e]
topk_weight, topk_idx = torch.topk(
tmp_scores, k=self.top_k, dim=-1, sorted=False)
### norm gate to sum 1
if self.top_k > 1 and self.norm_topk_prob:
denominator = topk_weight.sum(dim=-1, keepdim=True) + 1e-20
topk_weight = topk_weight / denominator
else:
topk_weight = topk_weight * self.routed_scaling_factor
### expert-level computation auxiliary loss
if self.training and self.alpha > 0.0:
scores_for_aux = scores
aux_topk = self.top_k
# always compute aux loss based on the naive greedy topk method
topk_idx_for_aux_loss = topk_idx.view(bsz, -1)
if self.seq_aux:
scores_for_seq_aux = scores_for_aux.view(bsz, seq_len, -1)
ce = torch.zeros(
bsz, self.n_routed_experts, device=hidden_states.device)
ce.scatter_add_(
1,
topk_idx_for_aux_loss,
torch.ones(
bsz, seq_len * aux_topk, device=hidden_states.device),
).div_(seq_len * aux_topk / self.n_routed_experts)
aux_loss = (ce * scores_for_seq_aux.mean(dim=1)).sum(
dim=1).mean() * self.alpha
else:
mask_ce = F.one_hot(
topk_idx_for_aux_loss.view(-1),
num_classes=self.n_routed_experts)
ce = mask_ce.float().mean(0)
Pi = scores_for_aux.mean(0)
fi = ce * self.n_routed_experts
aux_loss = (Pi * fi).sum() * self.alpha
else:
aux_loss = None
return topk_idx, topk_weight, aux_loss
class AddAuxiliaryLoss(torch.autograd.Function):
"""The trick function of adding auxiliary (aux) loss, which includes the
gradient of the aux loss during backpropagation."""
@staticmethod
def forward(ctx, x, loss):
assert loss.numel() == 1
ctx.dtype = loss.dtype
ctx.required_aux_loss = loss.requires_grad
return x
@staticmethod
def backward(ctx, grad_output):
grad_loss = None
if ctx.required_aux_loss:
grad_loss = torch.ones(
1, dtype=ctx.dtype, device=grad_output.device)
return grad_output, grad_loss
class ExpertShard(nn.Module):
def __init__(self, config, shard_idx, expert_in_one_shard=10):
super().__init__()
hidden_dim = config.hidden_size
ffn_dim = config.moe_intermediate_size
self.w1w3 = nn.Parameter(
torch.empty(expert_in_one_shard, ffn_dim * 2, hidden_dim))
self.w2 = nn.Parameter(
torch.empty(expert_in_one_shard, hidden_dim, ffn_dim))
self.act = nn.SiLU()
self.expert_in_one_shard = expert_in_one_shard
self.shard_idx = shard_idx
self.reset_parameters()
def reset_parameters(self) -> None:
# Different from nn.Linear module, weights of self.w1w3 and self.w2
# can not be initialized by DeepseekV2PreTrainedModel._init_weights method
self.w1w3.data.normal_(0, 0.02)
self.w2.data.normal_(0, 0.02)
def expert_forward(self, current_state, expert_idx):
w1w3 = self.w1w3[expert_idx]
w2 = self.w2[expert_idx]
gate_up_out = torch.matmul(current_state, w1w3.T)
gate_out, up_out = gate_up_out.chunk(2, dim=-1)
gate_out = self.act(gate_out)
out = gate_out * up_out
out = torch.matmul(out, w2.T)
return out
def forward(self, hidden_states, flat_topk_idx, y):
for i in range(self.expert_in_one_shard):
expert_idx = i + self.expert_in_one_shard * self.shard_idx
y[flat_topk_idx == expert_idx] = self.expert_forward(
hidden_states[flat_topk_idx == expert_idx], i)
return y
class DeepseekV2MoEShard(nn.Module):
"""A mixed expert module containing shared experts."""
def __init__(self, config):
super().__init__()
self.config = config
self.num_experts_per_tok = config.num_experts_per_tok
if hasattr(config, 'ep_size') and config.ep_size > 1:
raise NotImplementedError
else:
self.ep_size = 1
self.experts_per_rank = config.n_routed_experts
self.ep_rank = 0
self.n_routed_experts = config.n_routed_experts
expert_in_one_shard = config.expert_in_one_shard
assert config.n_routed_experts % expert_in_one_shard == 0, \
('n_routed_experts should be divisible by expert_in_one_shard, but got '
f'n_routed_experts = {config.n_routed_experts} and expert_in_one_shard = {expert_in_one_shard}')
self.shard_num = config.n_routed_experts // expert_in_one_shard
self.expert_in_one_shard = expert_in_one_shard
self.experts = nn.ModuleList([
ExpertShard(config, i, self.expert_in_one_shard)
for i in range(self.shard_num)
])
self.gate = MoEGate(config)
if config.n_shared_experts is not None:
intermediate_size = config.moe_intermediate_size * config.n_shared_experts
self.shared_experts = DeepseekV2MLP(
config=config, intermediate_size=intermediate_size)
def forward(self, hidden_states):
if not self.training:
raise NotImplementedError
identity = hidden_states
orig_shape = hidden_states.shape
topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
flat_topk_idx = topk_idx.view(-1)
hidden_states = hidden_states.repeat_interleave(
self.num_experts_per_tok, dim=0)
y = torch.empty_like(hidden_states)
y_dtype = y.dtype
for shard_index in range(self.shard_num):
y = self.experts[shard_index](hidden_states, flat_topk_idx, y)
y = ((y.view(*topk_weight.shape, -1) *
topk_weight.unsqueeze(-1)).sum(dim=1)).type(y_dtype)
y = y.view(*orig_shape)
y = AddAuxiliaryLoss.apply(y, aux_loss)
if self.config.n_shared_experts is not None:
y = y + self.shared_experts(identity)
return y
class DeepseekV2MoE(nn.Module):
"""A mixed expert module containing shared experts."""
def __init__(self, config):
super().__init__()
self.config = config
self.num_experts_per_tok = config.num_experts_per_tok
if hasattr(config, 'ep_size') and config.ep_size > 1:
assert config.ep_size == dist.get_world_size()
self.ep_size = config.ep_size
self.experts_per_rank = config.n_routed_experts // config.ep_size
self.ep_rank = dist.get_rank()
self.experts = nn.ModuleList([
(DeepseekV2MLP(
config, intermediate_size=config.moe_intermediate_size)
if i >= self.ep_rank * self.experts_per_rank and i <
(self.ep_rank + 1) * self.experts_per_rank else None)
for i in range(config.n_routed_experts)
])
else:
self.ep_size = 1
self.experts_per_rank = config.n_routed_experts
self.ep_rank = 0
self.experts = nn.ModuleList([
DeepseekV2MLP(
config, intermediate_size=config.moe_intermediate_size)
for i in range(config.n_routed_experts)
])
self.gate = MoEGate(config)
if config.n_shared_experts is not None:
intermediate_size = config.moe_intermediate_size * config.n_shared_experts
self.shared_experts = DeepseekV2MLP(
config=config, intermediate_size=intermediate_size)
def forward(self, hidden_states):
identity = hidden_states
orig_shape = hidden_states.shape
topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
hidden_states = hidden_states.view(-1, hidden_states.shape[-1])
flat_topk_idx = topk_idx.view(-1)
if self.training:
hidden_states = hidden_states.repeat_interleave(
self.num_experts_per_tok, dim=0)
y = torch.empty_like(hidden_states)
y_dtype = y.dtype
for i, expert in enumerate(self.experts):
y[flat_topk_idx == i] = expert(
hidden_states[flat_topk_idx == i])
y = ((y.view(*topk_weight.shape, -1) *
topk_weight.unsqueeze(-1)).sum(dim=1)).type(y_dtype)
y = y.view(*orig_shape)
y = AddAuxiliaryLoss.apply(y, aux_loss)
else:
y = self.moe_infer(hidden_states, topk_idx,
topk_weight).view(*orig_shape)
if self.config.n_shared_experts is not None:
y = y + self.shared_experts(identity)
return y
@torch.no_grad()
def moe_infer(self, x, topk_ids, topk_weight):
cnts = topk_ids.new_zeros((topk_ids.shape[0], len(self.experts)))
cnts.scatter_(1, topk_ids, 1)
tokens_per_expert = cnts.sum(dim=0)
idxs = topk_ids.view(-1).argsort()
sorted_tokens = x[idxs // topk_ids.shape[1]]
sorted_tokens_shape = sorted_tokens.shape
if self.ep_size > 1:
tokens_per_ep_rank = tokens_per_expert.view(self.ep_size,
-1).sum(dim=1)
tokens_per_expert_group = tokens_per_expert.new_empty(
tokens_per_expert.shape[0])
dist.all_to_all_single(tokens_per_expert_group, tokens_per_expert)
output_splits = (
tokens_per_expert_group.view(self.ep_size,
-1).sum(1).cpu().numpy().tolist())
gathered_tokens = sorted_tokens.new_empty(
tokens_per_expert_group.sum(dim=0).cpu().item(),
sorted_tokens.shape[1])
input_split_sizes = tokens_per_ep_rank.cpu().numpy().tolist()
dist.all_to_all(
list(gathered_tokens.split(output_splits)),
list(sorted_tokens.split(input_split_sizes)),
)
tokens_per_expert_post_gather = tokens_per_expert_group.view(
self.ep_size, self.experts_per_rank).sum(dim=0)
gatherd_idxs = np.zeros(
shape=(gathered_tokens.shape[0], ), dtype=np.int32)
s = 0
for i, k in enumerate(tokens_per_expert_group.cpu().numpy()):
gatherd_idxs[s:s + k] = i % self.experts_per_rank
s += k
gatherd_idxs = gatherd_idxs.argsort()
sorted_tokens = gathered_tokens[gatherd_idxs]
tokens_per_expert = tokens_per_expert_post_gather
tokens_per_expert = tokens_per_expert.cpu().numpy()
outputs = []
start_idx = 0
for i, num_tokens in enumerate(tokens_per_expert):
end_idx = start_idx + num_tokens
if num_tokens == 0:
continue
expert = self.experts[i + self.ep_rank * self.experts_per_rank]
tokens_for_this_expert = sorted_tokens[start_idx:end_idx]
expert_out = expert(tokens_for_this_expert)
outputs.append(expert_out)
start_idx = end_idx
outs = torch.cat(
outputs, dim=0) if len(outputs) else sorted_tokens.new_empty(0)
if self.ep_size > 1:
new_x = torch.empty_like(outs)
new_x[gatherd_idxs] = outs
gathered_tokens = new_x.new_empty(*sorted_tokens_shape)
dist.all_to_all(
list(gathered_tokens.split(input_split_sizes)),
list(new_x.split(output_splits)),
)
outs = gathered_tokens
new_x = torch.empty_like(outs)
new_x[idxs] = outs
final_out = (
new_x.view(*topk_ids.shape, -1).type(topk_weight.dtype).mul_(
topk_weight.unsqueeze(dim=-1)).sum(dim=1).type(new_x.dtype))
return final_out
# Copied from transformers.models.llama.modeling_llama.repeat_kv
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""This is the equivalent of torch.repeat_interleave(x, dim=1,
repeats=n_rep).
The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to
(batch, num_attention_heads, seqlen, head_dim)
"""
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :,
None, :, :].expand(batch,
num_key_value_heads,
n_rep, slen, head_dim)
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen,
head_dim)
# Copied from transformers.models.llama.modeling_llama.LlamaAttention with Llama->DeepseekV2
class DeepseekV2Attention(nn.Module):
"""Multi-headed attention from 'Attention Is All You Need' paper."""
def __init__(self,
config: DeepseekV2Config,
layer_idx: Optional[int] = None):
super().__init__()
self.config = config
self.layer_idx = layer_idx
if layer_idx is None:
logger.warning_once(
f'Instantiating {self.__class__.__name__} without passing `layer_idx` is not recommended and will '
'to errors during the forward call, if caching is used. Please make sure to provide a `layer_idx` '
'when creating this class.')
self.attention_dropout = config.attention_dropout
self.hidden_size = config.hidden_size
self.num_heads = config.num_attention_heads
self.max_position_embeddings = config.max_position_embeddings
self.rope_theta = config.rope_theta
self.q_lora_rank = config.q_lora_rank
self.qk_rope_head_dim = config.qk_rope_head_dim
self.kv_lora_rank = config.kv_lora_rank
self.v_head_dim = config.v_head_dim
self.qk_nope_head_dim = config.qk_nope_head_dim
self.q_head_dim = config.qk_nope_head_dim + config.qk_rope_head_dim
self.is_causal = True
if self.q_lora_rank is None:
self.q_proj = nn.Linear(
self.hidden_size, self.num_heads * self.q_head_dim, bias=False)
else:
self.q_a_proj = nn.Linear(
self.hidden_size,
config.q_lora_rank,
bias=config.attention_bias)
self.q_a_layernorm = DeepseekV2RMSNorm(config.q_lora_rank)
self.q_b_proj = nn.Linear(
config.q_lora_rank,
self.num_heads * self.q_head_dim,
bias=False)
self.kv_a_proj_with_mqa = nn.Linear(
self.hidden_size,
config.kv_lora_rank + config.qk_rope_head_dim,
bias=config.attention_bias,
)
self.kv_a_layernorm = DeepseekV2RMSNorm(config.kv_lora_rank)
self.kv_b_proj = nn.Linear(
config.kv_lora_rank,
self.num_heads *
(self.q_head_dim - self.qk_rope_head_dim + self.v_head_dim),
bias=False,
)
self.o_proj = nn.Linear(
self.num_heads * self.v_head_dim,
self.hidden_size,
bias=config.attention_bias,
)
self._init_rope()
self.softmax_scale = self.q_head_dim**(-0.5)
if self.config.rope_scaling is not None:
mscale_all_dim = self.config.rope_scaling.get('mscale_all_dim', 0)
scaling_factor = self.config.rope_scaling['factor']
if mscale_all_dim:
mscale = yarn_get_mscale(scaling_factor, mscale_all_dim)
self.softmax_scale = self.softmax_scale * mscale * mscale
def _init_rope(self):
if self.config.rope_scaling is None:
self.rotary_emb = DeepseekV2RotaryEmbedding(
self.qk_rope_head_dim,
max_position_embeddings=self.max_position_embeddings,
base=self.rope_theta,
)
else:
scaling_type = self.config.rope_scaling['type']
scaling_factor = self.config.rope_scaling['factor']
if scaling_type == 'linear':
self.rotary_emb = DeepseekV2LinearScalingRotaryEmbedding(
self.qk_rope_head_dim,
max_position_embeddings=self.max_position_embeddings,
scaling_factor=scaling_factor,
base=self.rope_theta,
)
elif scaling_type == 'dynamic':
self.rotary_emb = DeepseekV2DynamicNTKScalingRotaryEmbedding(
self.qk_rope_head_dim,
max_position_embeddings=self.max_position_embeddings,
scaling_factor=scaling_factor,
base=self.rope_theta,
)
elif scaling_type == 'yarn':
kwargs = {
key: self.config.rope_scaling[key]
for key in [
'original_max_position_embeddings',
'beta_fast',
'beta_slow',
'mscale',
'mscale_all_dim',
] if key in self.config.rope_scaling
}
self.rotary_emb = DeepseekV2YarnRotaryEmbedding(
self.qk_rope_head_dim,
max_position_embeddings=self.max_position_embeddings,
scaling_factor=scaling_factor,
base=self.rope_theta,
**kwargs,
)
else:
raise ValueError(f'Unknown RoPE scaling type {scaling_type}')
def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
return (tensor.view(bsz, seq_len, self.num_heads,
self.v_head_dim).transpose(1, 2).contiguous())
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
if 'padding_mask' in kwargs:
warnings.warn(
'Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`'
)
bsz, q_len, _ = hidden_states.size()
if self.q_lora_rank is None:
q = self.q_proj(hidden_states)
else:
q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
q = q.view(bsz, q_len, self.num_heads, self.q_head_dim).transpose(1, 2)
q_nope, q_pe = torch.split(
q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1)
compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
compressed_kv, k_pe = torch.split(
compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1)
k_pe = k_pe.view(bsz, q_len, 1, self.qk_rope_head_dim).transpose(1, 2)
kv = (
self.kv_b_proj(self.kv_a_layernorm(compressed_kv)).view(
bsz, q_len, self.num_heads,
self.qk_nope_head_dim + self.v_head_dim).transpose(1, 2))
k_nope, value_states = torch.split(
kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1)
kv_seq_len = value_states.shape[-2]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
f'The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} '
'for auto-regressive decoding with k/v caching, please make sure to initialize the attention class '
'with a layer index.')
kv_seq_len += past_key_value.get_usable_length(
kv_seq_len, self.layer_idx)
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids)
query_states = k_pe.new_empty(bsz, self.num_heads, q_len,
self.q_head_dim)
query_states[:, :, :, :self.qk_nope_head_dim] = q_nope
query_states[:, :, :, self.qk_nope_head_dim:] = q_pe
key_states = k_pe.new_empty(bsz, self.num_heads, q_len,
self.q_head_dim)
key_states[:, :, :, :self.qk_nope_head_dim] = k_nope
key_states[:, :, :, self.qk_nope_head_dim:] = k_pe
if past_key_value is not None:
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
attn_weights = (
torch.matmul(query_states, key_states.transpose(2, 3)) *
self.softmax_scale)
if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
raise ValueError(
f'Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is'
f' {attn_weights.size()}')
assert attention_mask is not None
if attention_mask is not None:
if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
raise ValueError(
f'Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}'
)
attn_weights = attn_weights + attention_mask
# upcast attention to fp32
attn_weights = nn.functional.softmax(
attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
attn_weights = nn.functional.dropout(
attn_weights, p=self.attention_dropout, training=self.training)
attn_output = torch.matmul(attn_weights, value_states)
if attn_output.size() != (bsz, self.num_heads, q_len, self.v_head_dim):
raise ValueError(
f'`attn_output` should be of size {(bsz, self.num_heads, q_len, self.v_head_dim)}, but is'
f' {attn_output.size()}')
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(bsz, q_len,
self.num_heads * self.v_head_dim)
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
# Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2 with Llama->DeepseekV2
class DeepseekV2FlashAttention2(DeepseekV2Attention):
"""DeepseekV2 flash attention module.
This module inherits from `DeepseekV2Attention` as the weights of the
module stays untouched. The only required change would be on the forward
pass where it needs to correctly call the public API of flash attention and
deal with padding tokens in case the input contains any of them.
"""
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
# flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignment, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
# Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10(
)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
# DeepseekV2FlashAttention2 attention does not support output_attentions
if 'padding_mask' in kwargs:
warnings.warn(
'Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`'
)
# overwrite attention_mask with padding_mask
attention_mask = kwargs.pop('padding_mask')
output_attentions = False
bsz, q_len, _ = hidden_states.size()
if self.q_lora_rank is None:
q = self.q_proj(hidden_states)
else:
q = self.q_b_proj(self.q_a_layernorm(self.q_a_proj(hidden_states)))
q = q.view(bsz, q_len, self.num_heads, self.q_head_dim).transpose(1, 2)
q_nope, q_pe = torch.split(
q, [self.qk_nope_head_dim, self.qk_rope_head_dim], dim=-1)
# Flash attention requires the input to have the shape
# batch_size x seq_length x head_dim x hidden_dim
# therefore we just need to keep the original shape
compressed_kv = self.kv_a_proj_with_mqa(hidden_states)
compressed_kv, k_pe = torch.split(
compressed_kv, [self.kv_lora_rank, self.qk_rope_head_dim], dim=-1)
k_pe = k_pe.view(bsz, q_len, 1, self.qk_rope_head_dim).transpose(1, 2)
kv = (
self.kv_b_proj(self.kv_a_layernorm(compressed_kv)).view(
bsz, q_len, self.num_heads,
self.qk_nope_head_dim + self.v_head_dim).transpose(1, 2))
k_nope, value_states = torch.split(
kv, [self.qk_nope_head_dim, self.v_head_dim], dim=-1)
kv_seq_len = value_states.shape[-2]
kv_seq_len = value_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value.get_usable_length(
kv_seq_len, self.layer_idx)
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
q_pe, k_pe = apply_rotary_pos_emb(q_pe, k_pe, cos, sin, position_ids)
query_states = k_pe.new_empty(bsz, self.num_heads, q_len,
self.q_head_dim)
query_states[:, :, :, :self.qk_nope_head_dim] = q_nope
query_states[:, :, :, self.qk_nope_head_dim:] = q_pe
key_states = k_pe.new_empty(bsz, self.num_heads, q_len,
self.q_head_dim)
key_states[:, :, :, :self.qk_nope_head_dim] = k_nope
key_states[:, :, :, self.qk_nope_head_dim:] = k_pe
if self.q_head_dim != self.v_head_dim:
value_states = F.pad(value_states,
[0, self.q_head_dim - self.v_head_dim])
if past_key_value is not None:
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
# TODO: These transpose are quite inefficient but Flash Attention requires the layout [batch_size, sequence_length, num_heads, head_dim]. We would need to refactor the KV cache
# to be able to avoid many of these transpose/reshape/view.
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
dropout_rate = self.attention_dropout if self.training else 0.0
# In PEFT, usually we cast the layer norms in float32 for training stability reasons
# therefore the input hidden states gets silently casted in float32. Hence, we need
# cast them back in the correct dtype just to be sure everything works as expected.
# This might slowdown training & inference so it is recommended to not cast the LayerNorms
# in fp32. (DeepseekV2RMSNorm handles it correctly)
input_dtype = query_states.dtype
if input_dtype == torch.float32:
# Handle the case where the model is quantized
if hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
elif torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
else:
target_dtype = self.q_proj.weight.dtype if self.q_lora_rank is None else self.q_a_proj.weight.dtype
logger.warning_once(
f'The input hidden states seems to be silently casted in float32, this might be related to'
f' the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in'
f' {target_dtype}.')
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
attn_output = self._flash_attention_forward(
query_states,
key_states,
value_states,
attention_mask,
q_len,
dropout=dropout_rate,
softmax_scale=self.softmax_scale,
)
if self.q_head_dim != self.v_head_dim:
attn_output = attn_output[:, :, :, :self.v_head_dim]
attn_output = attn_output.reshape(bsz, q_len, self.num_heads *
self.v_head_dim).contiguous()
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
def _flash_attention_forward(
self,
query_states,
key_states,
value_states,
attention_mask,
query_length,
dropout=0.0,
softmax_scale=None,
):
"""
Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
first unpad the input, then computes the attention scores and pad the final attention scores.
Args:
query_states (`torch.Tensor`):
Input query states to be passed to Flash Attention API
key_states (`torch.Tensor`):
Input key states to be passed to Flash Attention API
value_states (`torch.Tensor`):
Input value states to be passed to Flash Attention API
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
"""
if not self._flash_attn_uses_top_left_mask:
causal = self.is_causal
else:
# TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in DeepseekV2FlashAttention2 __init__.
causal = self.is_causal and query_length != 1
# Contains at least one padding token in the sequence
if attention_mask is not None:
batch_size = query_states.shape[0]
(
query_states,
key_states,
value_states,
indices_q,
cu_seq_lens,
max_seq_lens,
) = self._upad_input(query_states, key_states, value_states,
attention_mask, query_length)
cu_seqlens_q, cu_seqlens_k = cu_seq_lens
max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
attn_output_unpad = flash_attn_varlen_func(
query_states,
key_states,
value_states,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_in_batch_q,
max_seqlen_k=max_seqlen_in_batch_k,
dropout_p=dropout,
softmax_scale=softmax_scale,
causal=causal,
)
attn_output = pad_input(attn_output_unpad, indices_q, batch_size,
query_length)
else:
attn_output = flash_attn_func(
query_states,
key_states,
value_states,
dropout,
softmax_scale=softmax_scale,
causal=causal,
)
return attn_output
def _upad_input(self, query_layer, key_layer, value_layer, attention_mask,
query_length):
indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(
attention_mask)
batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
key_layer = index_first_axis(
key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads,
head_dim),
indices_k,
)
value_layer = index_first_axis(
value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads,
head_dim),
indices_k,
)
if query_length == kv_seq_len:
query_layer = index_first_axis(
query_layer.reshape(batch_size * kv_seq_len, self.num_heads,
head_dim),
indices_k,
)
cu_seqlens_q = cu_seqlens_k
max_seqlen_in_batch_q = max_seqlen_in_batch_k
indices_q = indices_k
elif query_length == 1:
max_seqlen_in_batch_q = 1
cu_seqlens_q = torch.arange(
batch_size + 1, dtype=torch.int32, device=query_layer.device
) # There is a memcpy here, that is very bad.
indices_q = cu_seqlens_q[:-1]
query_layer = query_layer.squeeze(1)
else:
# The -q_len: slice assumes left padding.
attention_mask = attention_mask[:, -query_length:]
query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(
query_layer, attention_mask)
return (
query_layer,
key_layer,
value_layer,
indices_q,
(cu_seqlens_q, cu_seqlens_k),
(max_seqlen_in_batch_q, max_seqlen_in_batch_k),
)
ATTENTION_CLASSES = {
'eager': DeepseekV2Attention,
'flash_attention_2': DeepseekV2FlashAttention2,
}
class DeepseekV2DecoderLayer(nn.Module):
def __init__(self, config: DeepseekV2Config, layer_idx: int):
super().__init__()
self.hidden_size = config.hidden_size
self.self_attn = ATTENTION_CLASSES[config._attn_implementation](
config=config, layer_idx=layer_idx)
moe_implementation = config.moe_implementation
if moe_implementation == 'origin':
block = DeepseekV2MoE
elif moe_implementation == 'shard':
block = DeepseekV2MoEShard
else:
raise NotImplementedError
self.mlp = (
block(config) if
(config.n_routed_experts is not None
and layer_idx >= config.first_k_dense_replace and layer_idx %
config.moe_layer_freq == 0) else DeepseekV2MLP(config))
self.input_layernorm = DeepseekV2RMSNorm(
config.hidden_size, eps=config.rms_norm_eps)
self.post_attention_layernorm = DeepseekV2RMSNorm(
config.hidden_size, eps=config.rms_norm_eps)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: Optional[bool] = False,
use_cache: Optional[bool] = False,
**kwargs,
) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor,
torch.FloatTensor]]]:
"""
Args:
hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
attention_mask (`torch.FloatTensor`, *optional*):
attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
query_sequence_length, key_sequence_length)` if default attention is used.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
(see `past_key_values`).
past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
"""
if 'padding_mask' in kwargs:
warnings.warn(
'Passing `padding_mask` is deprecated and will be removed in v4.37. Please make sure use `attention_mask` instead.`'
)
residual = hidden_states
hidden_states = self.input_layernorm(hidden_states)
# Self Attention
hidden_states, self_attn_weights, present_key_value = self.self_attn(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
**kwargs,
)
hidden_states = residual + hidden_states
# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
hidden_states = residual + hidden_states
outputs = (hidden_states, )
if output_attentions:
outputs += (self_attn_weights, )
if use_cache:
outputs += (present_key_value, )
return outputs
def _load_pretrained_model(
cls,
model,
state_dict,
loaded_keys,
resolved_archive_file,
pretrained_model_name_or_path,
ignore_mismatched_sizes=False,
sharded_metadata=None,
_fast_init=True,
low_cpu_mem_usage=False,
device_map=None,
offload_folder=None,
offload_state_dict=None,
dtype=None,
hf_quantizer=None,
keep_in_fp32_modules=None,
gguf_path=None,
):
if ((state_dict is not None) or (resolved_archive_file is None)
or (low_cpu_mem_usage) or (device_map is not None)
or (offload_folder is not None) or
(not (offload_state_dict is None or offload_state_dict is False))
or (hf_quantizer is not None) or
(keep_in_fp32_modules is not None and len(keep_in_fp32_modules) > 0)
or (gguf_path is not None)):
raise NotImplementedError
folder = os.path.sep.join(resolved_archive_file[0].split(os.path.sep)[:-1])
error_msgs = load_state_dict_into_model(model, folder)
return model, [], [], [], None, error_msgs
DeepseekV2_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
Parameters:
config ([`DeepseekV2Config`]):
Model configuration class with all the parameters of the model. Initializing with a config file does not
load the weights associated with the model, only the configuration. Check out the
[`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
@add_start_docstrings(
'The bare DeepseekV2 Model outputting raw hidden-states without any specific head on top.',
DeepseekV2_START_DOCSTRING,
)
class DeepseekV2PreTrainedModel(PreTrainedModel):
config_class = DeepseekV2Config
base_model_prefix = 'model'
supports_gradient_checkpointing = True
_no_split_modules = ['DeepseekV2DecoderLayer']
_skip_keys_device_placement = 'past_key_values'
_supports_flash_attn_2 = True
_supports_sdpa = True
_supports_cache_class = True
def _init_weights(self, module):
std = self.config.initializer_range
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
moe_implementation = kwargs.get('moe_implementation', 'origin')
if moe_implementation == 'origin':
return super().from_pretrained(pretrained_model_name_or_path,
*args, **kwargs)
cls._load_pretrained_model = types.MethodType(_load_pretrained_model,
cls)
return super().from_pretrained(pretrained_model_name_or_path, *args,
**kwargs)
DeepseekV2_INPUTS_DOCSTRING = r"""
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
`past_key_values`).
If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
information on the default strategy.
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.n_positions - 1]`.
[What are position IDs?](../glossary#position-ids)
past_key_values (`Cache` or `tuple(tuple(torch.FloatTensor))`, *optional*):
Pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used to speed up sequential decoding. This typically consists in the `past_key_values`
returned by the model at a previous stage of decoding, when `use_cache=True` or `config.use_cache=True`.
Two formats are allowed:
- a [`~cache_utils.Cache`] instance;
- Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of
shape `(batch_size, num_heads, sequence_length, embed_size_per_head)`). This is also known as the legacy
cache format.
The model will output the same cache format that is fed as input. If no `past_key_values` are passed, the
legacy cache format will be returned.
If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
of shape `(batch_size, sequence_length)`.
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
`past_key_values`).
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""
@add_start_docstrings(
'The bare DeepseekV2 Model outputting raw hidden-states without any specific head on top.',
DeepseekV2_START_DOCSTRING,
)
class DeepseekV2Model(DeepseekV2PreTrainedModel):
"""Transformer decoder consisting of *config.num_hidden_layers* layers.
Each layer is a [`DeepseekV2DecoderLayer`]
Args:
config: DeepseekV2Config
"""
def __init__(self, config: DeepseekV2Config):
super().__init__(config)
self.padding_idx = config.pad_token_id
self.vocab_size = config.vocab_size
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size,
self.padding_idx)
self.layers = nn.ModuleList([
DeepseekV2DecoderLayer(config, layer_idx)
for layer_idx in range(config.num_hidden_layers)
])
self._use_sdpa = config._attn_implementation == 'sdpa'
self._use_flash_attention_2 = config._attn_implementation == 'flash_attention_2'
self.norm = DeepseekV2RMSNorm(
config.hidden_size, eps=config.rms_norm_eps)
self.gradient_checkpointing = False
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.embed_tokens
def set_input_embeddings(self, value):
self.embed_tokens = value
@add_start_docstrings_to_model_forward(DeepseekV2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPast]:
output_attentions = (
output_attentions if output_attentions is not None else
self.config.output_attentions)
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else
self.config.output_hidden_states)
use_cache = use_cache if use_cache is not None else self.config.use_cache
return_dict = (
return_dict
if return_dict is not None else self.config.use_return_dict)
# retrieve input_ids and inputs_embeds
if input_ids is not None and inputs_embeds is not None:
raise ValueError(
'You cannot specify both input_ids and inputs_embeds at the same time'
)
elif input_ids is not None:
batch_size, seq_length = input_ids.shape[:2]
elif inputs_embeds is not None:
batch_size, seq_length = inputs_embeds.shape[:2]
else:
raise ValueError(
'You have to specify either input_ids or inputs_embeds')
if self.gradient_checkpointing and self.training:
if use_cache:
logger.warning_once(
'`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`transformers.'
)
use_cache = False
past_key_values_length = 0
if use_cache:
use_legacy_cache = not isinstance(past_key_values, Cache)
if use_legacy_cache:
past_key_values = DynamicCache.from_legacy_cache(
past_key_values)
past_key_values_length = past_key_values.get_usable_length(
seq_length)
if position_ids is None:
device = input_ids.device if input_ids is not None else inputs_embeds.device
position_ids = torch.arange(
past_key_values_length,
seq_length + past_key_values_length,
dtype=torch.long,
device=device,
)
position_ids = position_ids.unsqueeze(0)
if inputs_embeds is None:
inputs_embeds = self.embed_tokens(input_ids)
if self._use_flash_attention_2:
# 2d mask is passed through the layers
attention_mask = (
attention_mask if
(attention_mask is not None and 0 in attention_mask) else None)
elif self._use_sdpa and not output_attentions:
# output_attentions=True can not be supported when using SDPA, and we fall back on
# the manual implementation that requires a 4D causal mask in all cases.
attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
attention_mask,
(batch_size, seq_length),
inputs_embeds,
past_key_values_length,
)
else:
# 4d mask is passed through the layers
attention_mask = _prepare_4d_causal_attention_mask(
attention_mask,
(batch_size, seq_length),
inputs_embeds,
past_key_values_length,
)
# embed positions
hidden_states = inputs_embeds
# decoder layers
all_hidden_states = () if output_hidden_states else None
all_self_attns = () if output_attentions else None
next_decoder_cache = None
for decoder_layer in self.layers:
if output_hidden_states:
all_hidden_states += (hidden_states, )
if self.gradient_checkpointing and self.training:
layer_outputs = self._gradient_checkpointing_func(
decoder_layer.__call__,
hidden_states,
attention_mask,
position_ids,
past_key_values,
output_attentions,
use_cache,
)
else:
layer_outputs = decoder_layer(
hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_values,
output_attentions=output_attentions,
use_cache=use_cache,
)
hidden_states = layer_outputs[0]
if use_cache:
next_decoder_cache = layer_outputs[
2 if output_attentions else 1]
if output_attentions:
all_self_attns += (layer_outputs[1], )
hidden_states = self.norm(hidden_states)
# add hidden states from the last decoder layer
if output_hidden_states:
all_hidden_states += (hidden_states, )
next_cache = None
if use_cache:
next_cache = (
next_decoder_cache.to_legacy_cache()
if use_legacy_cache else next_decoder_cache)
if not return_dict:
return tuple(
v for v in
[hidden_states, next_cache, all_hidden_states, all_self_attns]
if v is not None)
return BaseModelOutputWithPast(
last_hidden_state=hidden_states,
past_key_values=next_cache,
hidden_states=all_hidden_states,
attentions=all_self_attns,
)
class DeepseekV2ForCausalLM(DeepseekV2PreTrainedModel):
_tied_weights_keys = ['lm_head.weight']
def __init__(self, config):
super().__init__(config)
self.model = DeepseekV2Model(config)
self.vocab_size = config.vocab_size
self.lm_head = nn.Linear(
config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.embed_tokens
def set_input_embeddings(self, value):
self.model.embed_tokens = value
def get_output_embeddings(self):
return self.lm_head
def set_output_embeddings(self, new_embeddings):
self.lm_head = new_embeddings
def set_decoder(self, decoder):
self.model = decoder
def get_decoder(self):
return self.model
@add_start_docstrings_to_model_forward(DeepseekV2_INPUTS_DOCSTRING)
@replace_return_docstrings(
output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, CausalLMOutputWithPast]:
r"""
Args:
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, transformers.,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, transformers., config.vocab_size]`.
Returns:
Example:
```python
>>> from transformers import AutoTokenizer, DeepseekV2ForCausalLM
>>> model = DeepseekV2ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
>>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```"""
output_attentions = (
output_attentions if output_attentions is not None else
self.config.output_attentions)
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else
self.config.output_hidden_states)
return_dict = (
return_dict
if return_dict is not None else self.config.use_return_dict)
# decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
logits = logits.float()
loss = None
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
shift_logits = shift_logits.view(-1, self.config.vocab_size)
shift_labels = shift_labels.view(-1)
# Enable model parallelism
shift_labels = shift_labels.to(shift_logits.device)
loss = loss_fct(shift_logits, shift_labels)
if not return_dict:
output = (logits, ) + outputs[1:]
return (loss, ) + output if loss is not None else output
return CausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
def prepare_inputs_for_generation(
self,
input_ids,
past_key_values=None,
attention_mask=None,
inputs_embeds=None,
**kwargs,
):
if past_key_values is not None:
if isinstance(past_key_values, Cache):
cache_length = past_key_values.get_seq_length()
past_length = past_key_values.seen_tokens
max_cache_length = past_key_values.get_max_length()
else:
cache_length = past_length = past_key_values[0][0].shape[2]
max_cache_length = None
# Keep only the unprocessed tokens:
# 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
# some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
# input)
if (attention_mask is not None
and attention_mask.shape[1] > input_ids.shape[1]):
input_ids = input_ids[:, -(attention_mask.shape[1] -
past_length):]
# 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
# input_ids based on the past_length.
elif past_length < input_ids.shape[1]:
input_ids = input_ids[:, past_length:]
# 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
# If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
if (max_cache_length is not None and attention_mask is not None
and cache_length + input_ids.shape[1] > max_cache_length):
attention_mask = attention_mask[:, -max_cache_length:]
position_ids = kwargs.get('position_ids', None)
if attention_mask is not None and position_ids is None:
# create position_ids on the fly for batch generation
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
if past_key_values:
position_ids = position_ids[:, -input_ids.shape[1]:]
# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
if inputs_embeds is not None and past_key_values is None:
model_inputs = {'inputs_embeds': inputs_embeds}
else:
model_inputs = {'input_ids': input_ids}
model_inputs.update({
'position_ids': position_ids,
'past_key_values': past_key_values,
'use_cache': kwargs.get('use_cache'),
'attention_mask': attention_mask,
})
return model_inputs
@staticmethod
def _reorder_cache(past_key_values, beam_idx):
reordered_past = ()
for layer_past in past_key_values:
reordered_past += (tuple(
past_state.index_select(0, beam_idx.to(past_state.device))
for past_state in layer_past), )
return reordered_past
@add_start_docstrings(
"""
The DeepseekV2 Model transformer with a sequence classification head on top (linear layer).
[`DeepseekV2ForSequenceClassification`] uses the last token in order to do the classification, as other causal models
(e.g. GPT-2) do.
Since it does classification on the last token, it requires to know the position of the last token. If a
`pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
each row of the batch).
""",
DeepseekV2_START_DOCSTRING,
)
class DeepseekV2ForSequenceClassification(DeepseekV2PreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.model = DeepseekV2Model(config)
self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.embed_tokens
def set_input_embeddings(self, value):
self.model.embed_tokens = value
@add_start_docstrings_to_model_forward(DeepseekV2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, SequenceClassifierOutputWithPast]:
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, transformers.,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = (
return_dict
if return_dict is not None else self.config.use_return_dict)
transformer_outputs = self.model(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = transformer_outputs[0]
logits = self.score(hidden_states)
if input_ids is not None:
batch_size = input_ids.shape[0]
else:
batch_size = inputs_embeds.shape[0]
if self.config.pad_token_id is None and batch_size != 1:
raise ValueError(
'Cannot handle batch sizes > 1 if no padding token is defined.'
)
if self.config.pad_token_id is None:
sequence_lengths = -1
else:
if input_ids is not None:
sequence_lengths = (torch.eq(
input_ids, self.config.pad_token_id).int().argmax(-1) -
1).to(logits.device)
else:
sequence_lengths = -1
pooled_logits = logits[torch.arange(batch_size, device=logits.device),
sequence_lengths]
loss = None
if labels is not None:
labels = labels.to(logits.device)
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = 'regression'
elif self.num_labels > 1 and (labels.dtype == torch.long
or labels.dtype == torch.int):
self.config.problem_type = 'single_label_classification'
else:
self.config.problem_type = 'multi_label_classification'
if self.config.problem_type == 'regression':
loss_fct = MSELoss()
if self.num_labels == 1:
loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(pooled_logits, labels)
elif self.config.problem_type == 'single_label_classification':
loss_fct = CrossEntropyLoss()
loss = loss_fct(
pooled_logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == 'multi_label_classification':
loss_fct = BCEWithLogitsLoss()
loss = loss_fct(pooled_logits, labels)
if not return_dict:
output = (pooled_logits, ) + transformer_outputs[1:]
return ((loss, ) + output) if loss is not None else output
return SequenceClassifierOutputWithPast(
loss=loss,
logits=pooled_logits,
past_key_values=transformer_outputs.past_key_values,
hidden_states=transformer_outputs.hidden_states,
attentions=transformer_outputs.attentions,
)
================================================
FILE: xtuner-eval_niah/xtuner/model/transformers_models/deepseek_v2/tokenization_deepseek_fast.py
================================================
from typing import List, Optional, Union
from transformers.models.llama import LlamaTokenizerFast
class DeepseekTokenizerFast(LlamaTokenizerFast):
def convert_ids_to_tokens(
self,
ids: Union[int, List[int]],
skip_special_tokens: bool = False) -> Union[str, List[str]]:
"""Converts a single index or a sequence of indices in a token or a
sequence of tokens, using the vocabulary and added tokens.
Args:
ids (`int` or `List[int]`):
The token id (or token ids) to convert to tokens.
skip_special_tokens (`bool`, *optional*, defaults to `False`):
Whether or not to remove special tokens in the decoding.
Returns:
`str` or `List[str]`: The decoded token(s).
"""
if isinstance(ids, int):
return self._convert_id_to_token(ids)
tokens = []
for index in ids:
index = int(index)
if skip_special_tokens and index in self.all_special_ids:
continue
token = self._tokenizer.id_to_token(index)
tokens.append(token if token is not None else '')
return tokens
def _convert_id_to_token(self, index: int) -> Optional[str]:
token = self._tokenizer.id_to_token(int(index))
return token if token is not None else ''
================================================
FILE: xtuner-eval_niah/xtuner/model/transformers_models/mixtral/__init__.py
================================================
from .configuration_mixtral import MixtralConfig
from .modeling_mixtral import MixtralForCausalLM, MixtralModel
__all__ = ['MixtralForCausalLM', 'MixtralModel', 'MixtralConfig']
================================================
FILE: xtuner-eval_niah/xtuner/model/transformers_models/mixtral/configuration_mixtral.py
================================================
# Copyright 2023 Mixtral AI and the HuggingFace Inc. team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
"""Mixtral model configuration."""
from transformers.configuration_utils import PretrainedConfig
from transformers.utils import logging
logger = logging.get_logger(__name__)
class MixtralConfig(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`MixtralModel`]. It is used to instantiate an
Mixtral model according to the specified arguments, defining the model architecture. Instantiating a configuration
with the defaults will yield a similar configuration to that of the Mixtral-7B-v0.1 or Mixtral-7B-Instruct-v0.1.
[mixtralai/Mixtral-8x7B](https://huggingface.co/mixtralai/Mixtral-8x7B)
[mixtralai/Mixtral-7B-Instruct-v0.1](https://huggingface.co/mixtralai/Mixtral-7B-Instruct-v0.1)
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 32000):
Vocabulary size of the Mixtral model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`MixtralModel`]
hidden_size (`int`, *optional*, defaults to 4096):
Dimension of the hidden representations.
intermediate_size (`int`, *optional*, defaults to 14336):
Dimension of the MLP representations.
num_hidden_layers (`int`, *optional*, defaults to 32):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 32):
Number of attention heads for each attention layer in the Transformer encoder.
num_key_value_heads (`int`, *optional*, defaults to 8):
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
`num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
by meanpooling all the original heads within that group. For more details checkout [this
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to `8`.
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
The non-linear activation function (function or string) in the decoder.
max_position_embeddings (`int`, *optional*, defaults to `4096*32`):
The maximum sequence length that this model might ever be used with. Mixtral's sliding window attention
allows sequence of up to 4096*32 tokens.
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (`float`, *optional*, defaults to 1e-05):
The epsilon used by the rms normalization layers.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`.
pad_token_id (`int`, *optional*):
The id of the padding token.
bos_token_id (`int`, *optional*, defaults to 1):
The id of the "beginning-of-sequence" token.
eos_token_id (`int`, *optional*, defaults to 2):
The id of the "end-of-sequence" token.
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
Whether the model's input and output word embeddings should be tied.
rope_theta (`float`, *optional*, defaults to 1000000.0):
The base period of the RoPE embeddings.
sliding_window (`int`, *optional*):
Sliding window attention window size. If not specified, will default to `4096`.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
num_experts_per_tok (`int`, *optional*, defaults to 2):
The number of experts to root per-token, can be also interpreted as the `top-p` routing
parameter
num_local_experts (`int`, *optional*, defaults to 8):
Number of experts per Sparse MLP layer.
output_router_logits (`bool`, *optional*, defaults to `False`):
Whether or not the router logits should be returned by the model. Enabling this will also
allow the model to output the auxiliary loss. See [here]() for more details
router_aux_loss_coef (`float`, *optional*, defaults to 0.001):
The aux loss factor for the total loss.
router_jitter_noise (`float`, *optional*, defaults to 0.0):
Amount of noise to add to the router.
moe_implementation (`str`, *optional*, defaults to 'origin'):
The implementation of the moe blocks. 'origin' or 'shard'.
expert_in_one_shard (`int`, *optional*, defaults to None):
How many expert models are integrated into a shard. It is used only
when `moe_implementation` == 'shard'.
```python
>>> from transformers import MixtralModel, MixtralConfig
>>> # Initializing a Mixtral 7B style configuration
>>> configuration = MixtralConfig()
>>> # Initializing a model from the Mixtral 7B style configuration
>>> model = MixtralModel(configuration)
>>> # Accessing the model configuration
>>> configuration = model.config
```"""
model_type = 'mixtral'
keys_to_ignore_at_inference = ['past_key_values']
def __init__(
self,
vocab_size=32000,
hidden_size=4096,
intermediate_size=14336,
num_hidden_layers=32,
num_attention_heads=32,
num_key_value_heads=8,
hidden_act='silu',
max_position_embeddings=4096 * 32,
initializer_range=0.02,
rms_norm_eps=1e-5,
use_cache=True,
pad_token_id=None,
bos_token_id=1,
eos_token_id=2,
tie_word_embeddings=False,
rope_theta=1e6,
sliding_window=None,
attention_dropout=0.0,
num_experts_per_tok=2,
num_local_experts=8,
output_router_logits=False,
router_aux_loss_coef=0.001,
router_jitter_noise=0.0,
moe_implementation='origin',
expert_in_one_shard=None,
**kwargs,
):
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.sliding_window = sliding_window
# for backward compatibility
if num_key_value_heads is None:
num_key_value_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.hidden_act = hidden_act
self.initializer_range = initializer_range
self.rms_norm_eps = rms_norm_eps
self.use_cache = use_cache
self.rope_theta = rope_theta
self.attention_dropout = attention_dropout
self.num_experts_per_tok = num_experts_per_tok
self.num_local_experts = num_local_experts
self.output_router_logits = output_router_logits
self.router_aux_loss_coef = router_aux_loss_coef
self.router_jitter_noise = router_jitter_noise
self.moe_implementation = moe_implementation
self.expert_in_one_shard = expert_in_one_shard
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
tie_word_embeddings=tie_word_embeddings,
**kwargs,
)
================================================
FILE: xtuner-eval_niah/xtuner/model/transformers_models/mixtral/modeling_mixtral.py
================================================
# Modified from https://github.com/huggingface/transformers/blob/v4.41.0/src/transformers/models/mixtral/modeling_mixtral.py
"""PyTorch Mixtral model."""
import inspect
import math
import os
import types
from typing import List, Optional, Tuple, Union
import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from torch import nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from transformers.activations import ACT2FN
from transformers.cache_utils import Cache, DynamicCache
from transformers.modeling_attn_mask_utils import (
_prepare_4d_causal_attention_mask,
_prepare_4d_causal_attention_mask_for_sdpa)
from transformers.modeling_outputs import (MoeCausalLMOutputWithPast,
MoeModelOutputWithPast,
SequenceClassifierOutputWithPast)
from transformers.modeling_utils import PreTrainedModel
from transformers.pytorch_utils import is_torch_greater_or_equal_than_1_13
from transformers.utils import (add_start_docstrings,
add_start_docstrings_to_model_forward,
is_flash_attn_2_available,
is_flash_attn_greater_or_equal_2_10, logging,
replace_return_docstrings)
from transformers.utils.import_utils import is_torch_fx_available
from xtuner.utils import load_state_dict_into_model
from .configuration_mixtral import MixtralConfig
if is_flash_attn_2_available():
from flash_attn import flash_attn_func, flash_attn_varlen_func
from flash_attn.bert_padding import pad_input # noqa
from flash_attn.bert_padding import index_first_axis, unpad_input
_flash_supports_window_size = 'window_size' in list(
inspect.signature(flash_attn_func).parameters)
# This makes `_prepare_4d_causal_attention_mask` a leaf function in the FX graph.
# It means that the function will not be traced through and simply appear as a node in the graph.
if is_torch_fx_available():
if not is_torch_greater_or_equal_than_1_13:
import torch.fx
_prepare_4d_causal_attention_mask = torch.fx.wrap(
_prepare_4d_causal_attention_mask)
logger = logging.get_logger(__name__)
_CONFIG_FOR_DOC = 'MixtralConfig'
def load_balancing_loss_func(
gate_logits: torch.Tensor,
num_experts: torch.Tensor = None,
top_k=2,
attention_mask: Optional[torch.Tensor] = None) -> float:
r"""
Computes auxiliary load balancing loss as in Switch Transformer - implemented in Pytorch.
See Switch Transformer (https://arxiv.org/abs/2101.03961) for more details. This function implements the loss
function presented in equations (4) - (6) of the paper. It aims at penalizing cases where the routing between
experts is too unbalanced.
Args:
gate_logits (Union[`torch.Tensor`, Tuple[torch.Tensor]):
Logits from the `gate`, should be a tuple of model.config.num_hidden_layers tensors of
shape [batch_size X sequence_length, num_experts].
attention_mask (`torch.Tensor`, None):
The attention_mask used in forward function
shape [batch_size X sequence_length] if not None.
num_experts (`int`, *optional*):
Number of experts
Returns:
The auxiliary loss.
"""
if gate_logits is None or not isinstance(gate_logits, tuple):
return 0
if isinstance(gate_logits, tuple):
compute_device = gate_logits[0].device
concatenated_gate_logits = torch.cat(
[layer_gate.to(compute_device) for layer_gate in gate_logits],
dim=0)
routing_weights = torch.nn.functional.softmax(
concatenated_gate_logits, dim=-1)
_, selected_experts = torch.topk(routing_weights, top_k, dim=-1)
expert_mask = torch.nn.functional.one_hot(selected_experts, num_experts)
if attention_mask is None:
# Compute the percentage of tokens routed to each experts
tokens_per_expert = torch.mean(expert_mask.float(), dim=0)
# Compute the average probability of routing to these experts
router_prob_per_expert = torch.mean(routing_weights, dim=0)
else:
batch_size, sequence_length = attention_mask.shape
num_hidden_layers = concatenated_gate_logits.shape[0] // (
batch_size * sequence_length)
# Compute the mask that masks all padding tokens as 0 with the same shape of expert_mask
expert_attention_mask = (
attention_mask[None, :, :, None, None].expand(
(num_hidden_layers, batch_size, sequence_length, top_k,
num_experts)).reshape(-1, top_k,
num_experts).to(compute_device))
# Compute the percentage of tokens routed to each experts
tokens_per_expert = torch.sum(
expert_mask.float() * expert_attention_mask, dim=0) / torch.sum(
expert_attention_mask, dim=0)
# Compute the mask that masks all padding tokens as 0 with the same shape of tokens_per_expert
router_per_expert_attention_mask = (
attention_mask[None, :, :, None].expand(
(num_hidden_layers, batch_size, sequence_length,
num_experts)).reshape(-1, num_experts).to(compute_device))
# Compute the average probability of routing to these experts
router_prob_per_expert = torch.sum(
routing_weights * router_per_expert_attention_mask,
dim=0) / torch.sum(
router_per_expert_attention_mask, dim=0)
overall_loss = torch.sum(tokens_per_expert *
router_prob_per_expert.unsqueeze(0))
return overall_loss * num_experts
# Copied from transformers.models.llama.modeling_llama._get_unpad_data
def _get_unpad_data(attention_mask):
seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
max_seqlen_in_batch = seqlens_in_batch.max().item()
cu_seqlens = F.pad(
torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.int32), (1, 0))
return (
indices,
cu_seqlens,
max_seqlen_in_batch,
)
# Copied from transformers.models.llama.modeling_llama.LlamaRMSNorm with Llama->Mixtral
class MixtralRMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
"""MixtralRMSNorm is equivalent to T5LayerNorm."""
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance +
self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
# Copied from transformers.models.mistral.modeling_mistral.MistralRotaryEmbedding with Mistral->Mixtral
class MixtralRotaryEmbedding(nn.Module):
def __init__(self,
dim,
max_position_embeddings=2048,
base=10000,
device=None):
super().__init__()
self.dim = dim
self.max_position_embeddings = max_position_embeddings
self.base = base
inv_freq = 1.0 / (
self.base
**(torch.arange(0, self.dim, 2,
dtype=torch.int64).float().to(device) / self.dim))
self.register_buffer('inv_freq', inv_freq, persistent=False)
# Build here to make `torch.jit.trace` work.
self._set_cos_sin_cache(
seq_len=max_position_embeddings,
device=self.inv_freq.device,
dtype=torch.get_default_dtype())
def _set_cos_sin_cache(self, seq_len, device, dtype):
self.max_seq_len_cached = seq_len
t = torch.arange(
self.max_seq_len_cached, device=device,
dtype=torch.int64).type_as(self.inv_freq)
freqs = torch.outer(t, self.inv_freq)
# Different from paper, but it uses a different permutation in order to obtain the same calculation
emb = torch.cat((freqs, freqs), dim=-1)
self.register_buffer(
'cos_cached', emb.cos().to(dtype), persistent=False)
self.register_buffer(
'sin_cached', emb.sin().to(dtype), persistent=False)
def forward(self, x, seq_len=None):
# x: [bs, num_attention_heads, seq_len, head_size]
if seq_len > self.max_seq_len_cached:
self._set_cos_sin_cache(
seq_len=seq_len, device=x.device, dtype=x.dtype)
return (
self.cos_cached[:seq_len].to(dtype=x.dtype),
self.sin_cached[:seq_len].to(dtype=x.dtype),
)
# Copied from transformers.models.llama.modeling_llama.rotate_half
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., :x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2:]
return torch.cat((-x2, x1), dim=-1)
# Copied from transformers.models.mistral.modeling_mistral.apply_rotary_pos_emb
def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
"""Applies Rotary Position Embedding to the query and key tensors.
Args:
q (`torch.Tensor`): The query tensor.
k (`torch.Tensor`): The key tensor.
cos (`torch.Tensor`): The cosine part of the rotary embedding.
sin (`torch.Tensor`): The sine part of the rotary embedding.
position_ids (`torch.Tensor`):
The position indices of the tokens corresponding to the query and key tensors. For example, this can be
used to pass offsetted position ids when working with a KV-cache.
unsqueeze_dim (`int`, *optional*, defaults to 1):
The 'unsqueeze_dim' argument specifies the dimension along which to unsqueeze cos[position_ids] and
sin[position_ids] so that they can be properly broadcasted to the dimensions of q and k. For example, note
that cos[position_ids] and sin[position_ids] have the shape [batch_size, seq_len, head_dim]. Then, if q and
k have the shape [batch_size, heads, seq_len, head_dim], then setting unsqueeze_dim=1 makes
cos[position_ids] and sin[position_ids] broadcastable to the shapes of q and k. Similarly, if q and k have
the shape [batch_size, seq_len, heads, head_dim], then set unsqueeze_dim=2.
Returns:
`tuple(torch.Tensor)` comprising of the query and key tensors rotated using the Rotary Position Embedding.
"""
cos = cos[position_ids].unsqueeze(unsqueeze_dim)
sin = sin[position_ids].unsqueeze(unsqueeze_dim)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
# Copied from transformers.models.llama.modeling_llama.repeat_kv
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""This is the equivalent of torch.repeat_interleave(x, dim=1,
repeats=n_rep).
The hidden states go from (batch, num_key_value_heads, seqlen, head_dim) to
(batch, num_attention_heads, seqlen, head_dim)
"""
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :,
None, :, :].expand(batch,
num_key_value_heads,
n_rep, slen, head_dim)
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen,
head_dim)
# Copied from transformers.models.mistral.modeling_mistral.MistralAttention with Mistral->Mixtral
class MixtralAttention(nn.Module):
"""Multi-headed attention from 'Attention Is All You Need' paper.
Modified to use sliding window attention: Longformer and "Generating Long
Sequences with Sparse Transformers".
"""
def __init__(self, config: MixtralConfig, layer_idx: Optional[int] = None):
super().__init__()
self.config = config
self.layer_idx = layer_idx
if layer_idx is None:
logger.warning_once(
f'Instantiating {self.__class__.__name__} without passing a `layer_idx` is not recommended and will '
'lead to errors during the forward call if caching is used. Please make sure to provide a `layer_idx` '
'when creating this class.')
self.hidden_size = config.hidden_size
self.num_heads = config.num_attention_heads
self.head_dim = self.hidden_size // self.num_heads
self.num_key_value_heads = config.num_key_value_heads
self.num_key_value_groups = self.num_heads // self.num_key_value_heads
self.max_position_embeddings = config.max_position_embeddings
self.rope_theta = config.rope_theta
self.is_causal = True
self.attention_dropout = config.attention_dropout
if (self.head_dim * self.num_heads) != self.hidden_size:
raise ValueError(
f'hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}'
f' and `num_heads`: {self.num_heads}).')
self.q_proj = nn.Linear(
self.hidden_size, self.num_heads * self.head_dim, bias=False)
self.k_proj = nn.Linear(
self.hidden_size,
self.num_key_value_heads * self.head_dim,
bias=False)
self.v_proj = nn.Linear(
self.hidden_size,
self.num_key_value_heads * self.head_dim,
bias=False)
self.o_proj = nn.Linear(
self.num_heads * self.head_dim, self.hidden_size, bias=False)
self.rotary_emb = MixtralRotaryEmbedding(
self.head_dim,
max_position_embeddings=self.max_position_embeddings,
base=self.rope_theta,
)
def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
return tensor.view(bsz, seq_len, self.num_heads,
self.head_dim).transpose(1, 2).contiguous()
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
f'The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} '
'for auto-regressive decoding with k/v caching, please make sure to initialize the attention class '
'with a layer index.')
kv_seq_len += past_key_value.get_usable_length(
kv_seq_len, self.layer_idx)
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(
query_states, key_states, cos, sin, position_ids)
if past_key_value is not None:
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
# repeat k/v heads if n_kv_heads < n_heads
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
attn_weights = torch.matmul(query_states, key_states.transpose(
2, 3)) / math.sqrt(self.head_dim)
if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
raise ValueError(
f'Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is'
f' {attn_weights.size()}')
if attention_mask is not None:
if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
raise ValueError(
f'Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}'
)
attn_weights = attn_weights + attention_mask
# upcast attention to fp32
attn_weights = nn.functional.softmax(
attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
attn_weights = nn.functional.dropout(
attn_weights, p=self.attention_dropout, training=self.training)
attn_output = torch.matmul(attn_weights, value_states)
if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
raise ValueError(
f'`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is'
f' {attn_output.size()}')
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
# Copied from transformers.models.mistral.modeling_mistral.MistralFlashAttention2 with Mistral->Mixtral
class MixtralFlashAttention2(MixtralAttention):
"""Mixtral flash attention module.
This module inherits from `MixtralAttention` as the weights of the module
stays untouched. The only required change would be on the forward pass
where it needs to correctly call the public API of flash attention and deal
with padding tokens in case the input contains any of them.
"""
# Copied from transformers.models.llama.modeling_llama.LlamaFlashAttention2.__init__
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
# TODO: Should be removed once Flash Attention for RoCm is bumped to 2.1.
# flash_attn<2.1 generates top-left aligned causal mask, while what is needed here is bottom-right alignment, that was made default for flash_attn>=2.1. This attribute is used to handle this difference. Reference: https://github.com/Dao-AILab/flash-attention/releases/tag/v2.1.0.
# Beware that with flash_attn<2.1, using q_seqlen != k_seqlen (except for the case q_seqlen == 1) produces a wrong mask (top-left).
self._flash_attn_uses_top_left_mask = not is_flash_attn_greater_or_equal_2_10(
)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
):
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
if self.layer_idx is None:
raise ValueError(
f'The cache structure has changed since version v4.36. If you are using {self.__class__.__name__} '
'for auto-regressive decoding with k/v caching, please make sure to initialize the attention class '
'with a layer index.')
kv_seq_len += past_key_value.get_usable_length(
kv_seq_len, self.layer_idx)
# Because the input can be padded, the absolute sequence length depends on the max position id.
rotary_seq_len = max(kv_seq_len, position_ids[:, -1].max().item()) + 1
cos, sin = self.rotary_emb(value_states, seq_len=rotary_seq_len)
query_states, key_states = apply_rotary_pos_emb(
query_states, key_states, cos, sin, position_ids)
use_sliding_windows = (
_flash_supports_window_size
and getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window)
if not _flash_supports_window_size:
logger.warning_once(
'The current flash attention version does not support sliding window attention, for a more memory efficient implementation'
' make sure to upgrade flash-attn library.')
if past_key_value is not None:
# Activate slicing cache only if the config has a value `sliding_windows` attribute
cache_has_contents = past_key_value.get_seq_length(
self.layer_idx) > 0
if (getattr(self.config, 'sliding_window', None) is not None
and kv_seq_len > self.config.sliding_window
and cache_has_contents):
slicing_tokens = 1 - self.config.sliding_window
past_key = past_key_value[self.layer_idx][0]
past_value = past_key_value[self.layer_idx][1]
past_key = past_key[:, :, slicing_tokens:, :].contiguous()
past_value = past_value[:, :, slicing_tokens:, :].contiguous()
if past_key.shape[-2] != self.config.sliding_window - 1:
raise ValueError(
f'past key must have a shape of (`batch_size, num_heads, self.config.sliding_window-1, head_dim`), got'
f' {past_key.shape}')
if attention_mask is not None:
attention_mask = attention_mask[:, slicing_tokens:]
attention_mask = torch.cat([
attention_mask,
torch.ones_like(attention_mask[:, -1:])
],
dim=-1)
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
# repeat k/v heads if n_kv_heads < n_heads
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
dropout_rate = 0.0 if not self.training else self.attention_dropout
# In PEFT, usually we cast the layer norms in float32 for training stability reasons
# therefore the input hidden states gets silently casted in float32. Hence, we need
# cast them back in float16 just to be sure everything works as expected.
input_dtype = query_states.dtype
if input_dtype == torch.float32:
if torch.is_autocast_enabled():
target_dtype = torch.get_autocast_gpu_dtype()
# Handle the case where the model is quantized
elif hasattr(self.config, '_pre_quantization_dtype'):
target_dtype = self.config._pre_quantization_dtype
else:
target_dtype = self.q_proj.weight.dtype
logger.warning_once(
f'The input hidden states seems to be silently casted in float32, this might be related to'
f' the fact you have upcasted embedding or layer norm layers in float32. We will cast back the input in'
f' {target_dtype}.')
query_states = query_states.to(target_dtype)
key_states = key_states.to(target_dtype)
value_states = value_states.to(target_dtype)
# Reashape to the expected shape for Flash Attention
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
attn_output = self._flash_attention_forward(
query_states,
key_states,
value_states,
attention_mask,
q_len,
dropout=dropout_rate,
use_sliding_windows=use_sliding_windows,
)
attn_output = attn_output.reshape(bsz, q_len,
self.hidden_size).contiguous()
attn_output = self.o_proj(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
def _flash_attention_forward(
self,
query_states,
key_states,
value_states,
attention_mask,
query_length,
dropout=0.0,
softmax_scale=None,
use_sliding_windows=False,
):
"""
Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
first unpad the input, then computes the attention scores and pad the final attention scores.
Args:
query_states (`torch.Tensor`):
Input query states to be passed to Flash Attention API
key_states (`torch.Tensor`):
Input key states to be passed to Flash Attention API
value_states (`torch.Tensor`):
Input value states to be passed to Flash Attention API
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
dropout (`float`):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
use_sliding_windows (`bool`, *optional*):
Whether to activate sliding window attention.
"""
if not self._flash_attn_uses_top_left_mask:
causal = self.is_causal
else:
# TODO: Remove the `query_length != 1` check once Flash Attention for RoCm is bumped to 2.1. For details, please see the comment in LlamaFlashAttention2 __init__.
causal = self.is_causal and query_length != 1
# Contains at least one padding token in the sequence
if attention_mask is not None:
batch_size = query_states.shape[0]
query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._upad_input(
query_states, key_states, value_states, attention_mask,
query_length)
cu_seqlens_q, cu_seqlens_k = cu_seq_lens
max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
if not use_sliding_windows:
attn_output_unpad = flash_attn_varlen_func(
query_states,
key_states,
value_states,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_in_batch_q,
max_seqlen_k=max_seqlen_in_batch_k,
dropout_p=dropout,
softmax_scale=softmax_scale,
causal=causal,
)
else:
attn_output_unpad = flash_attn_varlen_func(
query_states,
key_states,
value_states,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_in_batch_q,
max_seqlen_k=max_seqlen_in_batch_k,
dropout_p=dropout,
softmax_scale=softmax_scale,
causal=causal,
window_size=(self.config.sliding_window,
self.config.sliding_window),
)
attn_output = pad_input(attn_output_unpad, indices_q, batch_size,
query_length)
else:
if not use_sliding_windows:
attn_output = flash_attn_func(
query_states,
key_states,
value_states,
dropout,
softmax_scale=softmax_scale,
causal=causal,
)
else:
attn_output = flash_attn_func(
query_states,
key_states,
value_states,
dropout,
softmax_scale=softmax_scale,
causal=causal,
window_size=(self.config.sliding_window,
self.config.sliding_window),
)
return attn_output
def _upad_input(self, query_layer, key_layer, value_layer, attention_mask,
query_length):
batch_size, kv_seq_len, num_heads, head_dim = key_layer.shape
# On the first iteration we need to properly re-create the padding mask
# by slicing it on the proper place
if kv_seq_len != attention_mask.shape[-1]:
attention_mask_num_tokens = attention_mask.shape[-1]
attention_mask = attention_mask[:, attention_mask_num_tokens -
kv_seq_len:]
indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(
attention_mask)
key_layer = index_first_axis(
key_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim),
indices_k)
value_layer = index_first_axis(
value_layer.reshape(batch_size * kv_seq_len, num_heads, head_dim),
indices_k)
if query_length == kv_seq_len:
query_layer = index_first_axis(
query_layer.reshape(batch_size * kv_seq_len, num_heads,
head_dim), indices_k)
cu_seqlens_q = cu_seqlens_k
max_seqlen_in_batch_q = max_seqlen_in_batch_k
indices_q = indices_k
elif query_length == 1:
max_seqlen_in_batch_q = 1
cu_seqlens_q = torch.arange(
batch_size + 1, dtype=torch.int32, device=query_layer.device
) # There is a memcpy here, that is very bad.
indices_q = cu_seqlens_q[:-1]
query_layer = query_layer.squeeze(1)
else:
# The -q_len: slice assumes left padding.
attention_mask = attention_mask[:, -query_length:]
query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(
query_layer, attention_mask)
return (
query_layer,
key_layer,
value_layer,
indices_q,
(cu_seqlens_q, cu_seqlens_k),
(max_seqlen_in_batch_q, max_seqlen_in_batch_k),
)
# Copied from transformers.models.mistral.modeling_mistral.MistralSdpaAttention with Mistral->Mixtral
class MixtralSdpaAttention(MixtralAttention):
"""Mixtral attention module using
torch.nn.functional.scaled_dot_product_attention.
This module inherits from `MixtralAttention` as the weights of the module
stays untouched. The only changes are on the forward pass to adapt to SDPA
API.
"""
# Adapted from MixtralAttention.forward
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Cache] = None,
output_attentions: bool = False,
use_cache: bool = False,
) -> Tuple[torch.Tensor, Optional[torch.Tensor],
Optional[Tuple[torch.Tensor]]]:
if output_attentions:
# TODO: Improve this warning with e.g. `model.config.attn_implementation = "manual"` once this is implemented.
logger.warning_once(
'MixtralModel is using MixtralSdpaAttention, but `torch.nn.functional.scaled_dot_product_attention` does not support `output_attentions=True`. Falling back to the manual attention implementation, '
'but specifying the manual implementation will be required from Transformers version v5.0.0 onwards. This warning can be removed using the argument `attn_implementation="eager"` when loading the model.'
)
return super().forward(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
)
bsz, q_len, _ = hidden_states.size()
query_states = self.q_proj(hidden_states)
key_states = self.k_proj(hidden_states)
value_states = self.v_proj(hidden_states)
query_states = query_states.view(bsz, q_len, self.num_heads,
self.head_dim).transpose(1, 2)
key_states = key_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
value_states = value_states.view(bsz, q_len, self.num_key_value_heads,
self.head_dim).transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value.get_usable_length(
kv_seq_len, self.layer_idx)
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(
query_states, key_states, cos, sin, position_ids)
if past_key_value is not None:
cache_kwargs = {'sin': sin, 'cos': cos} # Specific to RoPE models
key_states, value_states = past_key_value.update(
key_states, value_states, self.layer_idx, cache_kwargs)
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
if attention_mask is not None:
if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
raise ValueError(
f'Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}'
)
# SDPA with memory-efficient backend is currently (torch==2.1.2) bugged with non-contiguous inputs with custom attn_mask,
# Reference: https://github.com/pytorch/pytorch/issues/112577.
if query_states.device.type == 'cuda' and attention_mask is not None:
query_states = query_states.contiguous()
key_states = key_states.contiguous()
value_states = value_states.contiguous()
attn_output = torch.nn.functional.scaled_dot_product_attention(
query_states,
key_states,
value_states,
attn_mask=attention_mask,
dropout_p=self.attention_dropout if self.training else 0.0,
# The q_len > 1 is necessary to match with AttentionMaskConverter.to_causal_4d that does not create a causal mask in case q_len == 1.
is_causal=self.is_causal and attention_mask is None and q_len > 1,
)
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.view(bsz, q_len, self.hidden_size)
attn_output = self.o_proj(attn_output)
return attn_output, None, past_key_value
MIXTRAL_ATTENTION_CLASSES = {
'eager': MixtralAttention,
'flash_attention_2': MixtralFlashAttention2,
'sdpa': MixtralSdpaAttention,
}
class MixtralBlockSparseTop2MLP(nn.Module):
def __init__(self, config: MixtralConfig):
super().__init__()
self.ffn_dim = config.intermediate_size
self.hidden_dim = config.hidden_size
self.w1 = nn.Linear(self.hidden_dim, self.ffn_dim, bias=False)
self.w2 = nn.Linear(self.ffn_dim, self.hidden_dim, bias=False)
self.w3 = nn.Linear(self.hidden_dim, self.ffn_dim, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, hidden_states):
current_hidden_states = self.act_fn(
self.w1(hidden_states)) * self.w3(hidden_states)
current_hidden_states = self.w2(current_hidden_states)
return current_hidden_states
class MixtralSparseMoeBlock(nn.Module):
"""This implementation is strictly equivalent to standard MoE with full
capacity (no dropped tokens).
It's faster since it formulates MoE operations in terms of block-sparse
operations to accommodate imbalanced assignments of tokens to experts,
whereas standard MoE either (1) drop tokens at the cost of reduced
performance or (2) set capacity factor to number of experts and thus waste
computation and memory on padding.
"""
def __init__(self, config):
super().__init__()
self.hidden_dim = config.hidden_size
self.ffn_dim = config.intermediate_size
self.num_experts = config.num_local_experts
self.top_k = config.num_experts_per_tok
# gating
self.gate = nn.Linear(self.hidden_dim, self.num_experts, bias=False)
self.experts = nn.ModuleList([
MixtralBlockSparseTop2MLP(config) for _ in range(self.num_experts)
])
# Jitter parameters
self.jitter_noise = config.router_jitter_noise
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
""""""
batch_size, sequence_length, hidden_dim = hidden_states.shape
if self.training and self.jitter_noise > 0:
hidden_states *= torch.empty_like(hidden_states).uniform_(
1.0 - self.jitter_noise, 1.0 + self.jitter_noise)
hidden_states = hidden_states.view(-1, hidden_dim)
# router_logits: (batch * sequence_length, n_experts)
router_logits = self.gate(hidden_states)
routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
routing_weights, selected_experts = torch.topk(
routing_weights, self.top_k, dim=-1)
routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
# we cast back to the input dtype
routing_weights = routing_weights.to(hidden_states.dtype)
final_hidden_states = torch.zeros(
(batch_size * sequence_length, hidden_dim),
dtype=hidden_states.dtype,
device=hidden_states.device)
# One hot encode the selected experts to create an expert mask
# this will be used to easily index which expert is going to be sollicitated
expert_mask = torch.nn.functional.one_hot(
selected_experts, num_classes=self.num_experts).permute(2, 1, 0)
# Loop over all available experts in the model and perform the computation on each expert
for expert_idx in range(self.num_experts):
expert_layer = self.experts[expert_idx]
idx, top_x = torch.where(expert_mask[expert_idx])
# Index the correct hidden states and compute the expert hidden state for
# the current expert. We need to make sure to multiply the output hidden
# states by `routing_weights` on the corresponding tokens (top-1 and top-2)
current_state = hidden_states[None, top_x].reshape(-1, hidden_dim)
current_hidden_states = expert_layer(
current_state) * routing_weights[top_x, idx, None]
# However `index_add_` only support torch tensors for indexing so we'll use
# the `top_x` tensor here.
final_hidden_states.index_add_(
0, top_x, current_hidden_states.to(hidden_states.dtype))
final_hidden_states = final_hidden_states.reshape(
batch_size, sequence_length, hidden_dim)
return final_hidden_states, router_logits
class ExpertShard(nn.Module):
def __init__(self, config, expert_in_one_shard=1):
super().__init__()
self.w1w3 = nn.Parameter(
torch.empty(expert_in_one_shard, config.intermediate_size * 2,
config.hidden_size))
self.w2 = nn.Parameter(
torch.empty(expert_in_one_shard, config.hidden_size,
config.intermediate_size))
self.act = ACT2FN[config.hidden_act]
self.expert_in_one_shard = expert_in_one_shard
def forward(self, hidden_states, expert_mask, routing_weights,
final_hidden_states):
hidden_dim = hidden_states.shape[-1]
for expert_idx in range(self.expert_in_one_shard):
idx, top_x = torch.where(expert_mask[expert_idx])
current_state = hidden_states[None, top_x].reshape(-1, hidden_dim)
w1w3 = self.w1w3[expert_idx]
w2 = self.w2[expert_idx]
gate_up_out = torch.matmul(current_state, w1w3.T)
gate_out, up_out = gate_up_out.chunk(2, dim=-1)
gate_out = self.act(gate_out)
out = gate_out * up_out
out = torch.matmul(out, w2.T)
current_hidden_states = out * routing_weights[top_x, idx, None]
final_hidden_states.index_add_(
0, top_x, current_hidden_states.to(hidden_states.dtype))
return final_hidden_states
class MixtralSparseShardMoeBlock(nn.Module):
def __init__(self, config):
super().__init__()
self.hidden_dim = config.hidden_size
self.ffn_dim = config.intermediate_size
self.num_experts = config.num_local_experts
self.top_k = config.num_experts_per_tok
# gating
self.gate = nn.Linear(self.hidden_dim, self.num_experts, bias=False)
expert_in_one_shard = config.expert_in_one_shard
assert config.num_local_experts % expert_in_one_shard == 0, \
('num_local_experts should be divisible by expert_in_one_shard, but got '
f'num_local_experts = {config.num_local_experts} and expert_in_one_shard = {expert_in_one_shard}')
self.shard_num = config.num_local_experts // expert_in_one_shard
self.expert_in_one_shard = expert_in_one_shard
self.experts = nn.ModuleList([
ExpertShard(config, self.expert_in_one_shard)
for i in range(self.shard_num)
])
# Jitter parameters
self.jitter_noise = config.router_jitter_noise
def forward(self, hidden_states: torch.Tensor) -> torch.Tensor:
""""""
batch_size, sequence_length, hidden_dim = hidden_states.shape
if self.training and self.jitter_noise > 0:
hidden_states *= torch.empty_like(hidden_states).uniform_(
1.0 - self.jitter_noise, 1.0 + self.jitter_noise)
hidden_states = hidden_states.view(-1, hidden_dim)
# router_logits: (batch * sequence_length, n_experts)
router_logits = self.gate(hidden_states)
routing_weights = F.softmax(router_logits, dim=1, dtype=torch.float)
routing_weights, selected_experts = torch.topk(
routing_weights, self.top_k, dim=-1)
routing_weights /= routing_weights.sum(dim=-1, keepdim=True)
# we cast back to the input dtype
routing_weights = routing_weights.to(hidden_states.dtype)
final_hidden_states = torch.zeros(
(batch_size * sequence_length, hidden_dim),
dtype=hidden_states.dtype,
device=hidden_states.device)
# One hot encode the selected experts to create an expert mask
# this will be used to easily index which expert is going to be sollicitated
expert_mask = torch.nn.functional.one_hot(
selected_experts, num_classes=self.num_experts).permute(2, 1, 0)
# Loop over all available experts in the model and perform the computation on each expert
for shard_index in range(self.shard_num):
mask = expert_mask[shard_index *
self.expert_in_one_shard:(shard_index + 1) *
self.expert_in_one_shard]
final_hidden_states = self.experts[shard_index](
hidden_states, mask, routing_weights, final_hidden_states)
final_hidden_states = final_hidden_states.reshape(
batch_size, sequence_length, hidden_dim)
return final_hidden_states, router_logits
class MixtralDecoderLayer(nn.Module):
def __init__(self, config: MixtralConfig, layer_idx: int):
super().__init__()
self.hidden_size = config.hidden_size
self.self_attn = MIXTRAL_ATTENTION_CLASSES[
config._attn_implementation](config, layer_idx)
moe_implementation = config.moe_implementation
if moe_implementation == 'origin':
block = MixtralSparseMoeBlock
elif moe_implementation == 'shard':
block = MixtralSparseShardMoeBlock
else:
raise NotImplementedError
self.block_sparse_moe = block(config)
self.input_layernorm = MixtralRMSNorm(
config.hidden_size, eps=config.rms_norm_eps)
self.post_attention_layernorm = MixtralRMSNorm(
config.hidden_size, eps=config.rms_norm_eps)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: Optional[bool] = False,
output_router_logits: Optional[bool] = False,
use_cache: Optional[bool] = False,
) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor,
torch.FloatTensor]]]:
"""
Args:
hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
attention_mask (`torch.FloatTensor`, *optional*): attention mask of size
`(batch, sequence_length)` where padding elements are indicated by 0.
past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
output_router_logits (`bool`, *optional*):
Whether or not to return the logits of all the routers. They are useful for computing the router loss, and
should not be returned during inference.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
(see `past_key_values`).
"""
residual = hidden_states
hidden_states = self.input_layernorm(hidden_states)
# Self Attention
hidden_states, self_attn_weights, present_key_value = self.self_attn(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
)
hidden_states = residual + hidden_states
# Fully Connected
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states, router_logits = self.block_sparse_moe(hidden_states)
hidden_states = residual + hidden_states
outputs = (hidden_states, )
if output_attentions:
outputs += (self_attn_weights, )
if use_cache:
outputs += (present_key_value, )
if output_router_logits:
outputs += (router_logits, )
return outputs
def _load_pretrained_model(
cls,
model,
state_dict,
loaded_keys,
resolved_archive_file,
pretrained_model_name_or_path,
ignore_mismatched_sizes=False,
sharded_metadata=None,
_fast_init=True,
low_cpu_mem_usage=False,
device_map=None,
offload_folder=None,
offload_state_dict=None,
dtype=None,
hf_quantizer=None,
keep_in_fp32_modules=None,
gguf_path=None,
):
if ((state_dict is not None) or (resolved_archive_file is None)
or (low_cpu_mem_usage) or (device_map is not None)
or (offload_folder is not None) or
(not (offload_state_dict is None or offload_state_dict is False))
or (hf_quantizer is not None) or
(keep_in_fp32_modules is not None and len(keep_in_fp32_modules) > 0)
or (gguf_path is not None)):
raise NotImplementedError
folder = os.path.sep.join(resolved_archive_file[0].split(os.path.sep)[:-1])
error_msgs = load_state_dict_into_model(model, folder)
return model, [], [], [], None, error_msgs
MIXTRAL_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
Parameters:
config ([`MixtralConfig`]):
Model configuration class with all the parameters of the model. Initializing with a config file does not
load the weights associated with the model, only the configuration. Check out the
[`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
@add_start_docstrings(
'The bare Mixtral Model outputting raw hidden-states without any specific head on top.',
MIXTRAL_START_DOCSTRING,
)
# Copied from transformers.models.mistral.modeling_mistral.MistralPreTrainedModel with Mistral->Mixtral
class MixtralPreTrainedModel(PreTrainedModel):
config_class = MixtralConfig
base_model_prefix = 'model'
supports_gradient_checkpointing = True
_no_split_modules = ['MixtralDecoderLayer']
_skip_keys_device_placement = 'past_key_values'
_supports_flash_attn_2 = True
_supports_sdpa = True
_supports_cache_class = True
def _init_weights(self, module):
std = self.config.initializer_range
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
moe_implementation = kwargs.get('moe_implementation', 'origin')
if moe_implementation == 'origin':
return super().from_pretrained(pretrained_model_name_or_path,
*args, **kwargs)
cls._load_pretrained_model = types.MethodType(_load_pretrained_model,
cls)
return super().from_pretrained(pretrained_model_name_or_path, *args,
**kwargs)
MIXTRAL_INPUTS_DOCSTRING = r"""
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
If `past_key_values` is used, optionally only the last `decoder_input_ids` have to be input (see
`past_key_values`).
If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
information on the default strategy.
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.n_positions - 1]`.
[What are position IDs?](../glossary#position-ids)
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
`(batch_size, num_heads, encoder_sequence_length, embed_size_per_head)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
If `past_key_values` are used, the user can optionally input only the last `decoder_input_ids` (those that
don't have their past key value states given to this model) of shape `(batch_size, 1)` instead of all
`decoder_input_ids` of shape `(batch_size, sequence_length)`.
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
`past_key_values`).
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
output_router_logits (`bool`, *optional*):
Whether or not to return the logits of all the routers. They are useful for computing the router loss, and
should not be returned during inference.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""
@add_start_docstrings(
'The bare Mixtral Model outputting raw hidden-states without any specific head on top.',
MIXTRAL_START_DOCSTRING,
)
# Copied from transformers.models.mistral.modeling_mistral.MistralModel with MISTRAL->MIXTRAL,Mistral->Mixtral
class MixtralModel(MixtralPreTrainedModel):
"""Transformer decoder consisting of *config.num_hidden_layers* layers.
Each layer is a [`MixtralDecoderLayer`]
Args:
config: MixtralConfig
"""
def __init__(self, config: MixtralConfig):
super().__init__(config)
self.padding_idx = config.pad_token_id
self.vocab_size = config.vocab_size
self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size,
self.padding_idx)
self.layers = nn.ModuleList([
MixtralDecoderLayer(config, layer_idx)
for layer_idx in range(config.num_hidden_layers)
])
self._attn_implementation = config._attn_implementation
self.norm = MixtralRMSNorm(config.hidden_size, eps=config.rms_norm_eps)
self.gradient_checkpointing = False
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.embed_tokens
def set_input_embeddings(self, value):
self.embed_tokens = value
# Ignore copy
@add_start_docstrings_to_model_forward(MIXTRAL_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
output_router_logits: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, MoeModelOutputWithPast]:
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_router_logits = (
output_router_logits if output_router_logits is not None else
self.config.output_router_logits)
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else
self.config.output_hidden_states)
use_cache = use_cache if use_cache is not None else self.config.use_cache
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
# retrieve input_ids and inputs_embeds
if input_ids is not None and inputs_embeds is not None:
raise ValueError(
'You cannot specify both decoder_input_ids and decoder_inputs_embeds at the same time'
)
elif input_ids is not None:
batch_size, seq_length = input_ids.shape
elif inputs_embeds is not None:
batch_size, seq_length, _ = inputs_embeds.shape
else:
raise ValueError(
'You have to specify either decoder_input_ids or decoder_inputs_embeds'
)
past_key_values_length = 0
if self.gradient_checkpointing and self.training:
if use_cache:
logger.warning_once(
'`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`...'
)
use_cache = False
if use_cache:
use_legacy_cache = not isinstance(past_key_values, Cache)
if use_legacy_cache:
past_key_values = DynamicCache.from_legacy_cache(
past_key_values)
past_key_values_length = past_key_values.get_usable_length(
seq_length)
if position_ids is None:
device = input_ids.device if input_ids is not None else inputs_embeds.device
position_ids = torch.arange(
past_key_values_length,
seq_length + past_key_values_length,
dtype=torch.long,
device=device)
position_ids = position_ids.unsqueeze(0).view(-1, seq_length)
else:
position_ids = position_ids.view(-1, seq_length).long()
if inputs_embeds is None:
inputs_embeds = self.embed_tokens(input_ids)
if attention_mask is not None and self._attn_implementation == 'flash_attention_2' and use_cache:
is_padding_right = attention_mask[:, -1].sum().item() != batch_size
if is_padding_right:
raise ValueError(
"You are attempting to perform batched generation with padding_side='right'"
' this may lead to unexpected behaviour for Flash Attention version of Mixtral. Make sure to '
" call `tokenizer.padding_side = 'left'` before tokenizing the input. "
)
if self._attn_implementation == 'flash_attention_2':
# 2d mask is passed through the layers
attention_mask = attention_mask if (
attention_mask is not None and 0 in attention_mask) else None
elif self._attn_implementation == 'sdpa' and not output_attentions:
# output_attentions=True can not be supported when using SDPA, and we fall back on
# the manual implementation that requires a 4D causal mask in all cases.
attention_mask = _prepare_4d_causal_attention_mask_for_sdpa(
attention_mask,
(batch_size, seq_length),
inputs_embeds,
past_key_values_length,
sliding_window=self.config.sliding_window,
)
else:
# 4d mask is passed through the layers
attention_mask = _prepare_4d_causal_attention_mask(
attention_mask,
(batch_size, seq_length),
inputs_embeds,
past_key_values_length,
sliding_window=self.config.sliding_window,
)
hidden_states = inputs_embeds
# decoder layers
all_hidden_states = () if output_hidden_states else None
all_self_attns = () if output_attentions else None
all_router_logits = () if output_router_logits else None
next_decoder_cache = None
for decoder_layer in self.layers:
if output_hidden_states:
all_hidden_states += (hidden_states, )
if self.gradient_checkpointing and self.training:
layer_outputs = self._gradient_checkpointing_func(
decoder_layer.__call__,
hidden_states,
attention_mask,
position_ids,
past_key_values,
output_attentions,
output_router_logits,
use_cache,
)
else:
layer_outputs = decoder_layer(
hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_values,
output_attentions=output_attentions,
output_router_logits=output_router_logits,
use_cache=use_cache,
)
hidden_states = layer_outputs[0]
if use_cache:
next_decoder_cache = layer_outputs[
2 if output_attentions else 1]
if output_attentions:
all_self_attns += (layer_outputs[1], )
if output_router_logits:
all_router_logits += (layer_outputs[-1], )
hidden_states = self.norm(hidden_states)
# add hidden states from the last decoder layer
if output_hidden_states:
all_hidden_states += (hidden_states, )
next_cache = None
if use_cache:
next_cache = next_decoder_cache.to_legacy_cache(
) if use_legacy_cache else next_decoder_cache
if not return_dict:
return tuple(v for v in [
hidden_states, next_cache, all_hidden_states, all_self_attns,
all_router_logits
] if v is not None)
return MoeModelOutputWithPast(
last_hidden_state=hidden_states,
past_key_values=next_cache,
hidden_states=all_hidden_states,
attentions=all_self_attns,
router_logits=all_router_logits,
)
class MixtralForCausalLM(MixtralPreTrainedModel):
_tied_weights_keys = ['lm_head.weight']
def __init__(self, config):
super().__init__(config)
self.model = MixtralModel(config)
self.vocab_size = config.vocab_size
self.lm_head = nn.Linear(
config.hidden_size, config.vocab_size, bias=False)
self.router_aux_loss_coef = config.router_aux_loss_coef
self.num_experts = config.num_local_experts
self.num_experts_per_tok = config.num_experts_per_tok
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.embed_tokens
def set_input_embeddings(self, value):
self.model.embed_tokens = value
def get_output_embeddings(self):
return self.lm_head
def set_output_embeddings(self, new_embeddings):
self.lm_head = new_embeddings
def set_decoder(self, decoder):
self.model = decoder
def get_decoder(self):
return self.model
@add_start_docstrings_to_model_forward(MIXTRAL_INPUTS_DOCSTRING)
@replace_return_docstrings(
output_type=MoeCausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
# Ignore copy
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
output_router_logits: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, MoeCausalLMOutputWithPast]:
r"""
Args:
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
Returns:
Example:
```python
>>> from transformers import AutoTokenizer, MixtralForCausalLM
>>> model = MixtralForCausalLM.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mixtral-8x7B-v0.1")
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_router_logits = (
output_router_logits if output_router_logits is not None else
self.config.output_router_logits)
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else
self.config.output_hidden_states)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
# decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
output_router_logits=output_router_logits,
return_dict=return_dict,
)
hidden_states = outputs[0]
logits = self.lm_head(hidden_states)
logits = logits.float()
loss = None
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
shift_logits = shift_logits.view(-1, self.config.vocab_size)
shift_labels = shift_labels.view(-1)
# Enable model parallelism
shift_labels = shift_labels.to(shift_logits.device)
loss = loss_fct(shift_logits, shift_labels)
aux_loss = None
if output_router_logits:
aux_loss = load_balancing_loss_func(
outputs.router_logits if return_dict else outputs[-1],
self.num_experts,
self.num_experts_per_tok,
attention_mask,
)
if labels is not None:
loss += self.router_aux_loss_coef * aux_loss.to(
loss.device) # make sure to reside in the same device
if not return_dict:
output = (logits, ) + outputs[1:]
if output_router_logits:
output = (aux_loss, ) + output
return (loss, ) + output if loss is not None else output
return MoeCausalLMOutputWithPast(
loss=loss,
aux_loss=aux_loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
router_logits=outputs.router_logits,
)
def prepare_inputs_for_generation(
self,
input_ids,
past_key_values=None,
attention_mask=None,
inputs_embeds=None,
output_router_logits=False,
**kwargs,
):
# Omit tokens covered by past_key_values
if past_key_values is not None:
if isinstance(past_key_values, Cache):
cache_length = past_key_values.get_seq_length()
past_length = past_key_values.seen_tokens
max_cache_length = past_key_values.get_max_length()
else:
cache_length = past_length = past_key_values[0][0].shape[2]
max_cache_length = None
# Keep only the unprocessed tokens:
# 1 - If the length of the attention_mask exceeds the length of input_ids, then we are in a setting where
# some of the inputs are exclusively passed as part of the cache (e.g. when passing input_embeds as
# input)
if attention_mask is not None and attention_mask.shape[
1] > input_ids.shape[1]:
input_ids = input_ids[:, -(attention_mask.shape[1] -
past_length):]
# 2 - If the past_length is smaller than input_ids', then input_ids holds all input tokens. We can discard
# input_ids based on the past_length.
elif past_length < input_ids.shape[1]:
input_ids = input_ids[:, past_length:]
# 3 - Otherwise (past_length >= input_ids.shape[1]), let's assume input_ids only has unprocessed tokens.
# If we are about to go beyond the maximum cache length, we need to crop the input attention mask.
if (max_cache_length is not None and attention_mask is not None
and cache_length + input_ids.shape[1] > max_cache_length):
attention_mask = attention_mask[:, -max_cache_length:]
position_ids = kwargs.get('position_ids', None)
if attention_mask is not None and position_ids is None:
# create position_ids on the fly for batch generation
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
if past_key_values:
position_ids = position_ids[:, -input_ids.shape[1]:]
# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
if inputs_embeds is not None and past_key_values is None:
model_inputs = {'inputs_embeds': inputs_embeds}
else:
model_inputs = {'input_ids': input_ids}
model_inputs.update({
'position_ids': position_ids,
'past_key_values': past_key_values,
'use_cache': kwargs.get('use_cache'),
'attention_mask': attention_mask,
'output_router_logits': output_router_logits,
})
return model_inputs
@staticmethod
def _reorder_cache(past_key_values, beam_idx):
reordered_past = ()
for layer_past in past_key_values:
reordered_past += (tuple(
past_state.index_select(0, beam_idx.to(past_state.device))
for past_state in layer_past), )
return reordered_past
@add_start_docstrings(
"""
The Mixtral Model transformer with a sequence classification head on top (linear layer).
[`MixtralForSequenceClassification`] uses the last token in order to do the classification, as other causal models
(e.g. GPT-2) do.
Since it does classification on the last token, it requires to know the position of the last token. If a
`pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
each row of the batch).
""",
MIXTRAL_START_DOCSTRING,
)
# Copied from transformers.models.llama.modeling_llama.LlamaForSequenceClassification with Llama->Mixtral, LLAMA->MIXTRAL
class MixtralForSequenceClassification(MixtralPreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.model = MixtralModel(config)
self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.embed_tokens
def set_input_embeddings(self, value):
self.model.embed_tokens = value
@add_start_docstrings_to_model_forward(MIXTRAL_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[Union[Cache,
List[torch.FloatTensor]]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, SequenceClassifierOutputWithPast]:
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
transformer_outputs = self.model(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = transformer_outputs[0]
logits = self.score(hidden_states)
if input_ids is not None:
batch_size = input_ids.shape[0]
else:
batch_size = inputs_embeds.shape[0]
if self.config.pad_token_id is None and batch_size != 1:
raise ValueError(
'Cannot handle batch sizes > 1 if no padding token is defined.'
)
if self.config.pad_token_id is None:
sequence_lengths = -1
else:
if input_ids is not None:
# if no pad token found, use modulo instead of reverse indexing for ONNX compatibility
sequence_lengths = torch.eq(
input_ids, self.config.pad_token_id).int().argmax(-1) - 1
sequence_lengths = sequence_lengths % input_ids.shape[-1]
sequence_lengths = sequence_lengths.to(logits.device)
else:
sequence_lengths = -1
pooled_logits = logits[torch.arange(batch_size, device=logits.device),
sequence_lengths]
loss = None
if labels is not None:
labels = labels.to(logits.device)
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = 'regression'
elif self.num_labels > 1 and (labels.dtype == torch.long
or labels.dtype == torch.int):
self.config.problem_type = 'single_label_classification'
else:
self.config.problem_type = 'multi_label_classification'
if self.config.problem_type == 'regression':
loss_fct = MSELoss()
if self.num_labels == 1:
loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(pooled_logits, labels)
elif self.config.problem_type == 'single_label_classification':
loss_fct = CrossEntropyLoss()
loss = loss_fct(
pooled_logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == 'multi_label_classification':
loss_fct = BCEWithLogitsLoss()
loss = loss_fct(pooled_logits, labels)
if not return_dict:
output = (pooled_logits, ) + transformer_outputs[1:]
return ((loss, ) + output) if loss is not None else output
return SequenceClassifierOutputWithPast(
loss=loss,
logits=pooled_logits,
past_key_values=transformer_outputs.past_key_values,
hidden_states=transformer_outputs.hidden_states,
attentions=transformer_outputs.attentions,
)
================================================
FILE: xtuner-eval_niah/xtuner/model/utils.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import os.path as osp
from typing import List, Optional
import torch
from mmengine.utils.misc import get_object_from_string
from peft import PeftType
from torch import nn
from transformers import PreTrainedModel
from xtuner.utils import IGNORE_INDEX, IMAGE_TOKEN_INDEX
def set_obj_dtype(d):
for key, value in d.items():
if value in ['torch.float16', 'torch.float32', 'torch.bfloat16']:
d[key] = getattr(torch, value.split('.')[-1])
def try_build_module(cfg):
builder = cfg['type']
if isinstance(builder, str):
builder = get_object_from_string(builder)
if builder is None:
# support handling cfg with key 'type' can not be built, such as
# {'rope_scaling': {'type': 'linear', 'factor': 2.0}}
return cfg
cfg.pop('type')
module_built = builder(**cfg)
return module_built
def traverse_dict(d):
if isinstance(d, dict):
set_obj_dtype(d)
for key, value in d.items():
if isinstance(value, dict):
traverse_dict(value)
if 'type' in value:
module_built = try_build_module(value)
d[key] = module_built
elif isinstance(d, list):
for element in d:
traverse_dict(element)
def find_all_linear_names(model):
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, nn.Linear):
names = name.split('.')
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if 'lm_head' in lora_module_names: # needed for 16-bit
lora_module_names.remove('lm_head')
if 'output_layer' in lora_module_names: # needed for 16-bit
lora_module_names.remove('output_layer')
return list(lora_module_names)
class LoadWoInit:
"""Context manager that disable parameter initialization."""
def __init__(self):
self.constant_ = torch.nn.init.constant_
self.zeros_ = torch.nn.init.zeros_
self.ones_ = torch.nn.init.ones_
self.uniform_ = torch.nn.init.uniform_
self.normal_ = torch.nn.init.normal_
self.kaiming_uniform_ = torch.nn.init.kaiming_uniform_
self.kaiming_normal_ = torch.nn.init.kaiming_normal_
def __enter__(self, *args, **kwargs):
torch.nn.init.constant_ = lambda *args, **kwargs: None
torch.nn.init.zeros_ = lambda *args, **kwargs: None
torch.nn.init.ones_ = lambda *args, **kwargs: None
torch.nn.init.uniform_ = lambda *args, **kwargs: None
torch.nn.init.normal_ = lambda *args, **kwargs: None
torch.nn.init.kaiming_uniform_ = lambda *args, **kwargs: None
torch.nn.init.kaiming_normal_ = lambda *args, **kwargs: None
def __exit__(self, *args, **kwargs):
torch.nn.init.constant_ = self.constant_
torch.nn.init.zeros_ = self.zeros_
torch.nn.init.ones_ = self.ones_
torch.nn.init.uniform_ = self.uniform_
torch.nn.init.normal_ = self.normal_
torch.nn.init.kaiming_uniform_ = self.kaiming_uniform_
torch.nn.init.kaiming_normal_ = self.kaiming_normal_
def get_peft_model_state_dict(model, state_dict=None, adapter_name='default'):
# Modified from `https://github.com/huggingface/peft/blob/main/src/peft/utils/save_and_load.py` # noqa: E501
config = model.peft_config[adapter_name]
if state_dict is None:
state_dict = model.state_dict()
if config.peft_type == PeftType.LORA:
# adapted from `https://github.com/microsoft/LoRA/blob/main/loralib/utils.py` # noqa: E501
# to be used directly with the state dict which is necessary
# when using DeepSpeed or FSDP
bias = config.bias
if bias == 'none':
to_return = {k: state_dict[k] for k in state_dict if 'lora_' in k}
elif bias == 'all':
to_return = {
k: state_dict[k]
for k in state_dict if 'lora_' in k or 'bias' in k
}
elif bias == 'lora_only':
to_return = {}
for k in state_dict:
if 'lora_' in k:
to_return[k] = state_dict[k]
bias_name = k.split('lora_')[0] + 'bias'
if bias_name in state_dict:
to_return[bias_name] = state_dict[bias_name]
else:
raise NotImplementedError
to_return = {
k: v
for k, v in to_return.items()
if (('lora_' in k and adapter_name in k) or ('bias' in k))
}
else:
# Currently we only support lora
raise NotImplementedError
if model.modules_to_save is not None:
for key, value in state_dict.items():
if any(f'{module_name}.modules_to_save.{adapter_name}' in key
for module_name in model.modules_to_save):
to_return[key] = value
return to_return
# Modified from https://github.com/haotian-liu/LLaVA/blob/82fc5e0e5f4393a4c26851fa32c69ab37ea3b146/llava/model/llava_arch.py#L99 # noqa: E501
def prepare_inputs_labels_for_multimodal(
llm: PreTrainedModel,
input_ids: torch.LongTensor = None,
position_ids: Optional[torch.LongTensor] = None,
attention_mask: Optional[torch.Tensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
labels: Optional[torch.LongTensor] = None,
pixel_values: Optional[torch.FloatTensor] = None):
if pixel_values is None:
return {
'input_ids': input_ids,
'position_ids': position_ids,
'attention_mask': attention_mask,
'past_key_values': past_key_values,
'inputs_embeds': None,
'labels': labels
}
_labels = labels
_position_ids = position_ids
_attention_mask = attention_mask
if attention_mask is None:
attention_mask = torch.ones_like(input_ids, dtype=torch.bool)
else:
attention_mask = attention_mask.bool()
if position_ids is None:
position_ids = torch.arange(
0, input_ids.shape[1], dtype=torch.long, device=input_ids.device)
if labels is None:
labels = torch.full_like(input_ids, IGNORE_INDEX)
# remove the padding using attention_mask -- TODO: double check
input_ids = [
cur_input_ids[cur_attention_mask]
for cur_input_ids, cur_attention_mask in zip(input_ids, attention_mask)
]
labels = [
cur_labels[cur_attention_mask]
for cur_labels, cur_attention_mask in zip(labels, attention_mask)
]
new_inputs_embeds = []
new_labels = []
cur_image_idx = 0
for batch_idx, cur_input_ids in enumerate(input_ids):
num_images = (cur_input_ids == IMAGE_TOKEN_INDEX).sum()
if num_images == 0:
cur_pixel_values = pixel_values[cur_image_idx]
cur_inputs_embeds_1 = llm.get_input_embeddings()(cur_input_ids)
cur_inputs_embeds = torch.cat(
[cur_inputs_embeds_1, cur_pixel_values[0:0]], dim=0)
new_inputs_embeds.append(cur_inputs_embeds)
new_labels.append(labels[batch_idx])
cur_image_idx += 1
continue
image_token_indices = [-1] + torch.where(
cur_input_ids == IMAGE_TOKEN_INDEX)[0].tolist() + [
cur_input_ids.shape[0]
]
cur_input_ids_noim = []
cur_labels = labels[batch_idx]
cur_labels_noim = []
for i in range(len(image_token_indices) - 1):
cur_input_ids_noim.append(cur_input_ids[image_token_indices[i] +
1:image_token_indices[i +
1]])
cur_labels_noim.append(cur_labels[image_token_indices[i] +
1:image_token_indices[i + 1]])
split_sizes = [x.shape[0] for x in cur_labels_noim]
cur_inputs_embeds = llm.get_input_embeddings()(
torch.cat(cur_input_ids_noim))
cur_inputs_embeds_no_im = torch.split(
cur_inputs_embeds, split_sizes, dim=0)
cur_new_inputs_embeds = []
cur_new_labels = []
for i in range(num_images + 1):
cur_new_inputs_embeds.append(cur_inputs_embeds_no_im[i])
cur_new_labels.append(cur_labels_noim[i])
if i < num_images:
cur_pixel_values = pixel_values[cur_image_idx]
cur_image_idx += 1
cur_new_inputs_embeds.append(cur_pixel_values)
cur_new_labels.append(
torch.full((cur_pixel_values.shape[0], ),
IGNORE_INDEX,
device=cur_labels.device,
dtype=cur_labels.dtype))
cur_new_inputs_embeds = torch.cat(cur_new_inputs_embeds)
cur_new_labels = torch.cat(cur_new_labels)
new_inputs_embeds.append(cur_new_inputs_embeds)
new_labels.append(cur_new_labels)
# Combine them
max_len = max(x.shape[0] for x in new_inputs_embeds)
batch_size = len(new_inputs_embeds)
new_inputs_embeds_padded = []
new_labels_padded = torch.full((batch_size, max_len),
IGNORE_INDEX,
dtype=new_labels[0].dtype,
device=new_labels[0].device)
attention_mask = torch.zeros((batch_size, max_len),
dtype=attention_mask.dtype,
device=attention_mask.device)
position_ids = torch.zeros((batch_size, max_len),
dtype=position_ids.dtype,
device=position_ids.device)
for i, (cur_new_embed,
cur_new_labels) in enumerate(zip(new_inputs_embeds, new_labels)):
cur_len = cur_new_embed.shape[0]
new_inputs_embeds_padded.append(
torch.cat((cur_new_embed,
torch.zeros((max_len - cur_len, cur_new_embed.shape[1]),
dtype=cur_new_embed.dtype,
device=cur_new_embed.device)),
dim=0))
if cur_len > 0:
new_labels_padded[i, :cur_len] = cur_new_labels
attention_mask[i, :cur_len] = True
position_ids[i, :cur_len] = torch.arange(
0,
cur_len,
dtype=position_ids.dtype,
device=position_ids.device)
new_inputs_embeds = torch.stack(new_inputs_embeds_padded, dim=0)
if _labels is None:
new_labels = None
else:
new_labels = new_labels_padded
if _attention_mask is None:
attention_mask = None
else:
attention_mask = attention_mask.to(dtype=_attention_mask.dtype)
if _position_ids is None:
position_ids = None
return {
'input_ids': None,
'position_ids': position_ids,
'attention_mask': attention_mask,
'past_key_values': past_key_values,
'inputs_embeds': new_inputs_embeds,
'labels': new_labels
}
def make_inputs_require_grad(module, input, output):
output.requires_grad_(True)
def guess_load_checkpoint(pth_model):
if osp.isfile(pth_model):
state_dict = torch.load(pth_model, map_location='cpu')
if 'state_dict' in state_dict:
state_dict = state_dict['state_dict']
elif osp.isdir(pth_model):
try:
from xtuner.utils.zero_to_any_dtype import \
get_state_dict_from_zero_checkpoint
except ImportError:
raise ImportError(
'The provided PTH model appears to be a DeepSpeed checkpoint. '
'However, DeepSpeed library is not detected in current '
'environment. This suggests that DeepSpeed may not be '
'installed or is incorrectly configured. Please verify your '
'setup.')
state_dict = get_state_dict_from_zero_checkpoint(
osp.dirname(pth_model), osp.basename(pth_model))
else:
raise FileNotFoundError(f'Cannot find {pth_model}')
return state_dict
================================================
FILE: xtuner-eval_niah/xtuner/parallel/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .sequence import * # noqa: F401, F403
================================================
FILE: xtuner-eval_niah/xtuner/parallel/sequence/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.dist import init_dist
from .attention import (post_process_for_sequence_parallel_attn,
pre_process_for_sequence_parallel_attn,
sequence_parallel_wrapper)
from .comm import (all_to_all, gather_for_sequence_parallel,
gather_forward_split_backward, split_for_sequence_parallel,
split_forward_gather_backward)
from .data_collate import (pad_cumulative_len_for_sequence_parallel,
pad_for_sequence_parallel)
from .reduce_loss import reduce_sequence_parallel_loss
from .sampler import SequenceParallelSampler
from .setup_distributed import (get_data_parallel_group,
get_data_parallel_rank,
get_data_parallel_world_size,
get_inner_sequence_parallel_group,
get_inner_sequence_parallel_rank,
get_inner_sequence_parallel_world_size,
get_sequence_parallel_group,
get_sequence_parallel_rank,
get_sequence_parallel_world_size,
init_inner_sequence_parallel,
init_sequence_parallel,
is_inner_sequence_parallel_initialized)
__all__ = [
'sequence_parallel_wrapper', 'pre_process_for_sequence_parallel_attn',
'post_process_for_sequence_parallel_attn', 'pad_for_sequence_parallel',
'split_for_sequence_parallel', 'SequenceParallelSampler',
'init_sequence_parallel', 'get_sequence_parallel_group',
'get_sequence_parallel_world_size', 'get_sequence_parallel_rank',
'get_data_parallel_group', 'get_data_parallel_world_size',
'get_data_parallel_rank', 'reduce_sequence_parallel_loss', 'init_dist',
'all_to_all', 'gather_for_sequence_parallel',
'split_forward_gather_backward', 'gather_forward_split_backward',
'get_inner_sequence_parallel_group', 'get_inner_sequence_parallel_rank',
'get_inner_sequence_parallel_world_size', 'init_inner_sequence_parallel',
'is_inner_sequence_parallel_initialized',
'pad_cumulative_len_for_sequence_parallel'
]
================================================
FILE: xtuner-eval_niah/xtuner/parallel/sequence/attention.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import math
import torch.distributed as dist
from .comm import (all_to_all, gather_forward_split_backward,
split_forward_gather_backward)
from .setup_distributed import (get_inner_sequence_parallel_group,
get_inner_sequence_parallel_world_size,
get_sequence_parallel_group,
get_sequence_parallel_world_size,
init_inner_sequence_parallel,
is_inner_sequence_parallel_initialized)
def pre_process_for_sequence_parallel_attn(query_states,
key_states,
value_states,
scatter_dim=2,
gather_dim=1):
b, s_div_sp, h, d = query_states.shape
sp = get_sequence_parallel_world_size()
if not is_inner_sequence_parallel_initialized():
insp = sp // math.gcd(h, sp)
init_inner_sequence_parallel(insp)
else:
insp = get_inner_sequence_parallel_world_size()
def pre_process_for_inner_sp(q, k, v):
if scatter_dim != 2 and gather_dim != 1:
raise NotImplementedError(
'Currently only `scatter_dim == 2` and `gather_dim == 1` '
f'is supported. But got scatter_dim = {scatter_dim} and '
f'gather_dim = {gather_dim}.')
# (b, s_div_sp, h, d) ->
# (b, s_div_sp, sp/insp, h*insp/sp, insp, d/insp) ->
# (b, s_div_sp, sp/insp, insp, h*insp/sp, d/insp) ->
# (b, s_div_sp, insp*h, d/insp)
q = q.view(b, s_div_sp, sp // insp, h * insp // sp, insp,
d // insp).transpose(3, 4).flatten(2, 4)
k = k.view(b, s_div_sp, sp // insp, h * insp // sp, insp,
d // insp).transpose(3, 4).flatten(2, 4)
v = v.view(b, s_div_sp, sp // insp, h * insp // sp, insp,
d // insp).transpose(3, 4).flatten(2, 4)
return q, k, v
def post_process_for_inner_sp(q, k, v):
# (b, s, insp*h/sp, d/insp) -> (b, s, insp*h/sp, d)
q = gather_forward_split_backward(q, -1,
get_inner_sequence_parallel_group())
k = gather_forward_split_backward(k, -1,
get_inner_sequence_parallel_group())
v = gather_forward_split_backward(v, -1,
get_inner_sequence_parallel_group())
return q, k, v
assert (h * insp) % sp == 0, \
('The number of attention heads should be divisible by '
'(sequence_parallel_world_size // sequence_parallel_inner_world_size)'
f'. But got n_head = {h}, sequence_parallel_world_size = '
f'{sp} and sequence_parallel_inner_world_size = {insp}.')
if insp > 1:
query_states, key_states, value_states = pre_process_for_inner_sp(
query_states, key_states, value_states)
# (b, s_div_sp, insp*h, d/insp) -> (b, s, insp*h/sp, d/insp)
sequence_parallel_group = get_sequence_parallel_group()
query_states = all_to_all(
query_states,
sequence_parallel_group,
scatter_dim=scatter_dim,
gather_dim=gather_dim)
key_states = all_to_all(
key_states,
sequence_parallel_group,
scatter_dim=scatter_dim,
gather_dim=gather_dim)
value_states = all_to_all(
value_states,
sequence_parallel_group,
scatter_dim=scatter_dim,
gather_dim=gather_dim)
if insp > 1:
query_states, key_states, value_states = post_process_for_inner_sp(
query_states, key_states, value_states)
return query_states, key_states, value_states
def post_process_for_sequence_parallel_attn(attn_output,
scatter_dim=1,
gather_dim=2):
sp = get_sequence_parallel_world_size()
insp = get_inner_sequence_parallel_world_size()
b, s, h_mul_insp_div_sp, d = attn_output.shape
h = h_mul_insp_div_sp * sp // insp
s_div_sp = s // sp
if insp > 1:
# (b, s, insp*h/sp, d) -> (b, s, insp*h/sp, d/insp)
attn_output = split_forward_gather_backward(
attn_output, -1, get_inner_sequence_parallel_group())
# (b, s, insp*h/sp, d/insp) -> (b, s_div_sp, insp*h, d/insp)
sequence_parallel_group = get_sequence_parallel_group()
output = all_to_all(
attn_output,
sequence_parallel_group,
scatter_dim=scatter_dim,
gather_dim=gather_dim)
if insp > 1:
# (b, s_div_sp, insp*h, d/insp) ->
# (b, s_div_sp, sp/insp, insp, h*insp/sp, d/insp) ->
# (b, s_div_sp, sp/insp, h*insp/sp, insp, d/insp) ->
# (b, s_div_sp, h, d)
output = output.view(b, s_div_sp, sp // insp, insp, h * insp // sp,
d // insp).transpose(3, 4).reshape(
b, s_div_sp, h, d)
return output
def sequence_parallel_wrapper(local_attn):
def sequence_parallel_attn(query_states, key_states, value_states, *args,
**kwargs):
training = kwargs.pop('training', True)
enable_sequence_parallel = (
dist.is_initialized() and get_sequence_parallel_world_size() > 1
and training)
if enable_sequence_parallel:
query_states, key_states, value_states = \
pre_process_for_sequence_parallel_attn(
query_states, key_states, value_states)
out = local_attn(query_states, key_states, value_states, *args,
**kwargs)
if enable_sequence_parallel:
out = post_process_for_sequence_parallel_attn(out).contiguous()
return out
return sequence_parallel_attn
================================================
FILE: xtuner-eval_niah/xtuner/parallel/sequence/comm.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from typing import Any, Tuple
import torch
import torch.distributed as dist
from torch import Tensor
def _all_to_all(
input: Tensor,
world_size: int,
group: dist.ProcessGroup,
scatter_dim: int,
gather_dim: int,
):
input_list = [
t.contiguous()
for t in torch.tensor_split(input, world_size, scatter_dim)
]
output_list = [torch.empty_like(input_list[0]) for _ in range(world_size)]
dist.all_to_all(output_list, input_list, group=group)
return torch.cat(output_list, dim=gather_dim).contiguous()
class _AllToAll(torch.autograd.Function):
"""All-to-all communication.
Args:
input: Input tensor
sp_group: Sequence parallel process group
scatter_dim: Scatter dimension
gather_dim: Gather dimension
"""
@staticmethod
def forward(ctx: Any, input: Tensor, sp_group: dist.ProcessGroup,
scatter_dim: int, gather_dim: int):
ctx.sp_group = sp_group
ctx.scatter_dim = scatter_dim
ctx.gather_dim = gather_dim
ctx.world_size = dist.get_world_size(sp_group)
output = _all_to_all(input, ctx.world_size, sp_group, scatter_dim,
gather_dim)
return output
@staticmethod
def backward(ctx: Any, grad_output: Tensor) -> Tuple:
grad_output = _all_to_all(
grad_output,
ctx.world_size,
ctx.sp_group,
ctx.gather_dim,
ctx.scatter_dim,
)
return (
grad_output,
None,
None,
None,
)
def all_to_all(
input: Tensor,
sp_group: dist.ProcessGroup,
scatter_dim: int = 2,
gather_dim: int = 1,
):
"""Convenience function to apply the all-to-all operation with scatter and
gather dimensions.
Notes:
We have wrapped the `torch.distributed.all_to_all` function to
enable automatic differentiation of the all-to-all operation.
Args:
input: The input tensor for which all-to-all communication is performed
sp_group: The sequence parallel process group.
scatter_dim: The dimension along which the input tensor is scattered
(default: 2).
gather_dim: The dimension along which the output tensor is gathered
(default: 1).
Returns:
The output tensor after the all-to-all communication.
"""
return _AllToAll.apply(input, sp_group, scatter_dim, gather_dim)
def split_for_sequence_parallel(input, dim: int, sp_group: dist.ProcessGroup):
"""Splits the input tensor along a given dimension for sequence parallel.
Args:
input: The input tensor to be split.
dim: The dimension along which the tensor should be split.
sp_group: The sequence parallel process group.
Returns:
The split tensor corresponding to the current rank's chunk.
"""
world_size = dist.get_world_size(sp_group)
if world_size == 1:
return input
rank = dist.get_rank(sp_group)
dim_size = input.size(dim)
assert dim_size % world_size == 0, (
f'The dimension to split ({dim_size}) is not a multiple of '
f'world size ({world_size}), cannot split tensor evenly')
tensor_list = torch.split(input, dim_size // world_size, dim=dim)
output = tensor_list[rank].contiguous()
return output
def gather_for_sequence_parallel(input, dim: int, sp_group: dist.ProcessGroup):
"""Gathers the input tensor along a given dimension for sequence parallel.
Args:
input: The input tensor to be gathered.
dim: The dimension along which the tensor should be gathered.
sp_group: The sequence parallel process group.
Returns:
The gathered tensor concatenated along the specified dimension.
"""
input = input.contiguous()
world_size = dist.get_world_size(sp_group)
dist.get_rank(sp_group)
if world_size == 1:
return input
tensor_list = [torch.empty_like(input) for _ in range(world_size)]
assert input.device.type == 'cuda'
dist.all_gather(tensor_list, input, group=sp_group)
output = torch.cat(tensor_list, dim=dim).contiguous()
return output
class _GatherForwardSplitBackward(torch.autograd.Function):
"""Gather the input during forward.
Scale and split the grad and keep only the corresponding chuck to the rank
during backward.
"""
@staticmethod
def forward(ctx, input, dim, sp_group, grad_scale):
ctx.dim = dim
ctx.sp_group = sp_group
ctx.grad_scale = grad_scale
return gather_for_sequence_parallel(input, dim, sp_group)
@staticmethod
def backward(ctx, grad_output):
if ctx.grad_scale == 'up':
grad_output = grad_output * dist.get_world_size(ctx.sp_group)
elif ctx.grad_scale == 'down':
grad_output = grad_output / dist.get_world_size(ctx.sp_group)
return (split_for_sequence_parallel(grad_output, ctx.dim,
ctx.sp_group), None, None, None)
class _SplitForwardGatherBackward(torch.autograd.Function):
"""Split the input and keep only the corresponding chuck to the rank during
forward.
Scale and gather the grad during backward.
"""
@staticmethod
def forward(ctx, input, dim, sp_group, grad_scale):
ctx.dim = dim
ctx.sp_group = sp_group
ctx.grad_scale = grad_scale
return split_for_sequence_parallel(input, dim, sp_group)
@staticmethod
def backward(ctx, grad_output):
if ctx.grad_scale == 'up':
grad_output = grad_output * dist.get_world_size(ctx.sp_group)
elif ctx.grad_scale == 'down':
grad_output = grad_output / dist.get_world_size(ctx.sp_group)
return (gather_for_sequence_parallel(grad_output, ctx.dim,
ctx.sp_group), None, None, None)
def split_forward_gather_backward(input, dim, sp_group, grad_scale=None):
"""Split tensors according to the sp rank during forward propagation and
gather the grad from the whole sp group during backward propagation.
1. When do we need this? input.requires_grad = True
2. Why we need grad scale?
We have to scale down the grads as `gather_forward_split_backward` scales
up the grads.
"""
return _SplitForwardGatherBackward.apply(input, dim, sp_group, grad_scale)
def gather_forward_split_backward(input, dim, sp_group, grad_scale=None):
"""Gather tensors from the whole sp group during forward propagation and
split the grad according to the sp rank during backward propagation.
1. When do we need this?
When sp is greater than 1, we need to slice the input `x` along
sequence length dimension before it is passed into the model and get
`sub_seq_x`. We then pass `sub_seq_x` into model and get output
`sub_seq_out`. If the loss calculation process needs to use the complete
output, we have to gather the `sub_seq_out` in all sp ranks during forward
propagation and split the grad during backward propagation.
2. Why we need grad scale?
Here is a simple case.
-------- SP 1 -----------
Suppose here is a toy model with only one linear module
(in_features = 2, out_features = 1) and the input x has shape(2, 2).
Y = [[y1], = [[w11x11 + w21x12], = [[x11, x12], dot [[w11],
[y2]] [w11x21 + w21x22]] [x21, x22]] [w21]]
z = mean(Y) = (y1 + y2) / 2
Here is the partial derivative of z with respect to w11:
∂z / ∂w11 = ∂z / ∂y1 * ∂y1 / ∂w11 + ∂z / ∂y2 * ∂y2 / ∂w11
= 1/2 * x11 + 1/2 * x21 = (x11 + x21) / 2
-------- SP 2 -----------
When sequence parallel world size is set to 2, we will split the input x
and scatter them to the two rank in the same sequence parallel group.
```Step 1
Y_rank0 = [[y1]] = [[w11x11 + w21x12]] = [[x11, x12]] dot [[w11, w21]]^T
Y_rank1 = [[y2]] = [[w11x21 + w21x22]] = [[x21, x22]] dot [[w11, w21]]^T
```
Then, we have to gather them:
```Step 2
Y_rank0 = [[y1],
detach([y2])]
Y_rank1 = [detach([y1]),
[y2]]
```
Note that y2 in Y_rank0 does not have grad, neither does y1 in Y_rank1.
Similarly, we calculate the loss in each rank:
```Step 3
z_rank0 = mean(Y_rank0) = (y1 + detach(y2)) / 2
z_rank1 = mean(Y_rank1) = (detach(y1) + y2) / 2
```
So the partial derivative of loss_rank0 with respect to w11:
```∂z / ∂w11 = ∂z / ∂y1 * ∂y1 / ∂w11 = x11 / 2```
The same for rank1:
```∂z / ∂w11 = ∂z / ∂y2 * ∂y2 / ∂w11 = x21 / 2```
Finally, we need to all_reduce them:
```Step 4
In both rank:
∂z / ∂w11 = (x11 / 2 + x21 / 2) / 2 = (x11 + x21) / 4
```
In SP2, the gradient of each param is only half of that in SP1.
So we should scale up the grad during the backward process in Step 2.
""" # noqa: E501
return _GatherForwardSplitBackward.apply(input, dim, sp_group, grad_scale)
================================================
FILE: xtuner-eval_niah/xtuner/parallel/sequence/data_collate.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch
from .setup_distributed import get_sequence_parallel_world_size
def pad_for_sequence_parallel(tensor, padding_value, dim=-1):
length = tensor.shape[dim]
seq_parallel_world_size = get_sequence_parallel_world_size()
if length % seq_parallel_world_size == 0:
return tensor
pad_num = seq_parallel_world_size - (length % seq_parallel_world_size)
pad_shape = (*tensor.shape[:dim], pad_num,
*tensor.shape[dim + 1:]) if dim != -1 else (
*tensor.shape[:dim], pad_num)
pad = torch.full(
pad_shape, padding_value, dtype=tensor.dtype, device=tensor.device)
tensor = torch.cat([tensor, pad], dim=dim)
return tensor
# This function only meets the following two conditions:
# 1. use_varlen_attn = True
# 2. pack_to_max_length = True and the lengths of each sequence are different
def pad_cumulative_len_for_sequence_parallel(cumulative_len):
assert len(cumulative_len) == 1
seqlen = cumulative_len[0][-1]
seq_parallel_world_size = get_sequence_parallel_world_size()
if seqlen % seq_parallel_world_size == 0:
return cumulative_len, None
bs = len(cumulative_len)
pad_len = seq_parallel_world_size - (seqlen % seq_parallel_world_size)
seqlen_new = seqlen + pad_len
attention_mask = torch.zeros(
bs, seqlen_new, dtype=torch.bool, device=cumulative_len[0].device)
attention_mask[:, :seqlen] = True
for i, cu_len in enumerate(cumulative_len):
pad = torch.tensor([seqlen_new],
device=cu_len.device,
dtype=cu_len.dtype)
cumulative_len[i] = torch.cat([cu_len, pad], dim=0)
return cumulative_len, attention_mask
================================================
FILE: xtuner-eval_niah/xtuner/parallel/sequence/reduce_loss.py
================================================
import torch
import torch.distributed as dist
from .setup_distributed import get_sequence_parallel_group
class _ReduceLoss(torch.autograd.Function):
@staticmethod
def forward(ctx, mean_loss, loss_scale, process_group):
ctx.mode = process_group
if loss_scale == 0:
# convert nan to 0 just for logging
mean_loss = torch.nan_to_num(mean_loss)
loss_sum = mean_loss * loss_scale
dist.all_reduce(loss_sum, group=process_group)
dist.all_reduce(loss_scale, group=process_group)
loss = loss_sum / loss_scale
return loss
@staticmethod
def backward(ctx, grad_output):
return grad_output, None, None
def reduce_sequence_parallel_loss(mean_loss,
loss_scale,
sp_group: dist.ProcessGroup = None):
if dist.get_world_size(sp_group) == 1:
return mean_loss
if sp_group is None:
# avoid bc breaking
sp_group = get_sequence_parallel_group()
return _ReduceLoss.apply(mean_loss, loss_scale, sp_group)
================================================
FILE: xtuner-eval_niah/xtuner/parallel/sequence/sampler.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import math
from typing import Optional, Sized
from mmengine.dataset import DefaultSampler
from mmengine.dist import sync_random_seed
from .setup_distributed import (get_data_parallel_rank,
get_data_parallel_world_size)
class SequenceParallelSampler(DefaultSampler):
def __init__(self,
dataset: Sized,
shuffle: bool = True,
seed: Optional[int] = None,
round_up: bool = True) -> None:
rank = get_data_parallel_rank()
world_size = get_data_parallel_world_size()
self.rank = rank
self.world_size = world_size
self.dataset = dataset
self.shuffle = shuffle
if seed is None:
seed = sync_random_seed()
self.seed = seed
self.epoch = 0
self.round_up = round_up
if self.round_up:
self.num_samples = math.ceil(len(self.dataset) / world_size)
self.total_size = self.num_samples * self.world_size
else:
self.num_samples = math.ceil(
(len(self.dataset) - rank) / world_size)
self.total_size = len(self.dataset)
================================================
FILE: xtuner-eval_niah/xtuner/parallel/sequence/setup_distributed.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import torch.distributed as dist
_SEQUENCE_PARALLEL_GROUP = None
_SEQUENCE_PARALLEL_WORLD_SIZE = None
_SEQUENCE_PARALLEL_RANK = None
_INNER_SEQUENCE_PARALLEL_GROUP = None
_INNER_SEQUENCE_PARALLEL_WORLD_SIZE = None
_INNER_SEQUENCE_PARALLEL_RANK = None
_DATA_PARALLEL_GROUP = None
_DATA_PARALLEL_WORLD_SIZE = None
_DATA_PARALLEL_RANK = None
def init_sequence_parallel(sequence_parallel_size: int = 1):
assert dist.is_initialized()
world_size: int = dist.get_world_size()
# enable_ds_sequence_parallel = sequence_parallel_size > 1
# if enable_ds_sequence_parallel:
if world_size % sequence_parallel_size != 0:
raise RuntimeError(f'world_size ({world_size}) is not divisible by '
f'sequence_parallel_size {sequence_parallel_size}')
num_sequence_parallel_groups: int = world_size // sequence_parallel_size
rank = dist.get_rank()
# Build the sequence parallel groups.
global _SEQUENCE_PARALLEL_GROUP
assert _SEQUENCE_PARALLEL_GROUP is None, \
'sequence parallel group is already initialized'
for i in range(num_sequence_parallel_groups):
ranks = range(i * sequence_parallel_size,
(i + 1) * sequence_parallel_size)
group = dist.new_group(ranks)
if rank in ranks:
_SEQUENCE_PARALLEL_GROUP = group
global _DATA_PARALLEL_GROUP
assert _DATA_PARALLEL_GROUP is None, \
'data parallel group is already initialized'
all_data_parallel_group_ranks = []
start_rank = 0
end_rank = world_size
for j in range(sequence_parallel_size):
ranks = range(start_rank + j, end_rank, sequence_parallel_size)
all_data_parallel_group_ranks.append(list(ranks))
group = dist.new_group(ranks)
if rank in ranks:
_DATA_PARALLEL_GROUP = group
def init_inner_sequence_parallel(inner_sequence_parallel_size: int = 1):
"""Build the sequence parallel inner groups.
They are helpful when sp size is not evenly divided by the number of attn
heads.
"""
assert _SEQUENCE_PARALLEL_GROUP is not None, \
('Please call `init_inner_sequence_parallel` after calling '
'`init_sequence_parallel`.')
rank = dist.get_rank()
world_size: int = dist.get_world_size()
n_inner_group = world_size // inner_sequence_parallel_size
global _INNER_SEQUENCE_PARALLEL_GROUP
assert _INNER_SEQUENCE_PARALLEL_GROUP is None
for i in range(n_inner_group):
ranks = range(i * inner_sequence_parallel_size,
(i + 1) * inner_sequence_parallel_size)
group = dist.new_group(ranks)
if rank in ranks:
_INNER_SEQUENCE_PARALLEL_GROUP = group
def is_inner_sequence_parallel_initialized():
return _INNER_SEQUENCE_PARALLEL_GROUP is not None
def get_inner_sequence_parallel_group():
return _INNER_SEQUENCE_PARALLEL_GROUP
def get_inner_sequence_parallel_world_size():
global _INNER_SEQUENCE_PARALLEL_WORLD_SIZE
if _INNER_SEQUENCE_PARALLEL_WORLD_SIZE is not None:
return _INNER_SEQUENCE_PARALLEL_WORLD_SIZE
if not dist.is_initialized() or (_INNER_SEQUENCE_PARALLEL_GROUP is None):
_INNER_SEQUENCE_PARALLEL_WORLD_SIZE = 1
else:
_INNER_SEQUENCE_PARALLEL_WORLD_SIZE = dist.get_world_size(
group=get_inner_sequence_parallel_group())
return _INNER_SEQUENCE_PARALLEL_WORLD_SIZE
def get_inner_sequence_parallel_rank():
global _INNER_SEQUENCE_PARALLEL_RANK
if _INNER_SEQUENCE_PARALLEL_RANK is not None:
return _INNER_SEQUENCE_PARALLEL_RANK
if not dist.is_initialized() or (_INNER_SEQUENCE_PARALLEL_GROUP is None):
_INNER_SEQUENCE_PARALLEL_RANK = 0
else:
_INNER_SEQUENCE_PARALLEL_RANK = dist.get_rank(
group=get_inner_sequence_parallel_group())
return _INNER_SEQUENCE_PARALLEL_RANK
def get_sequence_parallel_group():
"""Get the sequence parallel group the caller rank belongs to."""
return _SEQUENCE_PARALLEL_GROUP
def get_sequence_parallel_world_size():
"""Return world size for the sequence parallel group."""
global _SEQUENCE_PARALLEL_WORLD_SIZE
if _SEQUENCE_PARALLEL_WORLD_SIZE is not None:
return _SEQUENCE_PARALLEL_WORLD_SIZE
if not dist.is_initialized() or (_SEQUENCE_PARALLEL_GROUP is None):
_SEQUENCE_PARALLEL_WORLD_SIZE = 1
else:
_SEQUENCE_PARALLEL_WORLD_SIZE = dist.get_world_size(
group=get_sequence_parallel_group())
return _SEQUENCE_PARALLEL_WORLD_SIZE
def get_sequence_parallel_rank():
"""Return my rank for the sequence parallel group."""
global _SEQUENCE_PARALLEL_RANK
if _SEQUENCE_PARALLEL_RANK is not None:
return _SEQUENCE_PARALLEL_RANK
if not dist.is_initialized() or (_SEQUENCE_PARALLEL_GROUP is None):
_SEQUENCE_PARALLEL_RANK = 0
else:
_SEQUENCE_PARALLEL_RANK = dist.get_rank(
group=get_sequence_parallel_group())
return _SEQUENCE_PARALLEL_RANK
def get_data_parallel_group():
"""Get the data parallel group the caller rank belongs to."""
assert _DATA_PARALLEL_GROUP is not None, \
'data parallel group is not initialized'
return _DATA_PARALLEL_GROUP
def get_data_parallel_world_size():
"""Return world size for the data parallel group."""
global _DATA_PARALLEL_WORLD_SIZE
if _DATA_PARALLEL_WORLD_SIZE is not None:
return _DATA_PARALLEL_WORLD_SIZE
if not dist.is_initialized():
_DATA_PARALLEL_WORLD_SIZE = 1
else:
_DATA_PARALLEL_WORLD_SIZE = dist.get_world_size(
group=get_data_parallel_group())
return _DATA_PARALLEL_WORLD_SIZE
def get_data_parallel_rank():
"""Return my rank for the data parallel group."""
global _DATA_PARALLEL_RANK
if _DATA_PARALLEL_RANK is not None:
return _DATA_PARALLEL_RANK
if not dist.is_initialized():
_DATA_PARALLEL_RANK = 0
else:
_DATA_PARALLEL_RANK = dist.get_rank(group=get_data_parallel_group())
return _DATA_PARALLEL_RANK
================================================
FILE: xtuner-eval_niah/xtuner/registry.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.registry import Registry
__all__ = ['BUILDER', 'MAP_FUNC']
BUILDER = Registry('builder')
MAP_FUNC = Registry('map_fn')
================================================
FILE: xtuner-eval_niah/xtuner/tools/chat.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import os
import os.path as osp
import re
import sys
import torch
from huggingface_hub import snapshot_download
from peft import PeftModel
from transformers import (AutoModel, AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel, GenerationConfig)
from transformers.generation.streamers import TextStreamer
from xtuner.dataset.utils import expand2square, load_image
from xtuner.model.utils import prepare_inputs_labels_for_multimodal
from xtuner.tools.utils import get_stop_criteria
from xtuner.utils import (DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX,
PROMPT_TEMPLATE, SYSTEM_TEMPLATE)
TORCH_DTYPE_MAP = dict(
fp16=torch.float16, bf16=torch.bfloat16, fp32=torch.float32, auto='auto')
def remove_prefix(state_dict, prefix):
new_state_dict = {}
for key, value in state_dict.items():
if key.startswith(prefix):
new_key = key[len(prefix):]
new_state_dict[new_key] = value
else:
new_state_dict[key] = value
return new_state_dict
def parse_args():
parser = argparse.ArgumentParser(description='Chat with a HF model')
parser.add_argument(
'model_name_or_path', help='Hugging Face model name or path')
adapter_group = parser.add_mutually_exclusive_group()
adapter_group.add_argument(
'--adapter', default=None, help='adapter name or path')
adapter_group.add_argument(
'--llava', default=None, help='llava name or path')
parser.add_argument(
'--visual-encoder', default=None, help='visual encoder name or path')
parser.add_argument(
'--visual-select-layer', default=-2, help='visual select layer')
parser.add_argument('--image', default=None, help='image')
parser.add_argument(
'--torch-dtype',
default='fp16',
choices=TORCH_DTYPE_MAP.keys(),
help='Override the default `torch.dtype` and load the model under '
'a specific `dtype`.')
parser.add_argument(
'--prompt-template',
choices=PROMPT_TEMPLATE.keys(),
default=None,
help='Specify a prompt template')
system_group = parser.add_mutually_exclusive_group()
system_group.add_argument(
'--system', default=None, help='Specify the system text')
system_group.add_argument(
'--system-template',
choices=SYSTEM_TEMPLATE.keys(),
default=None,
help='Specify a system template')
parser.add_argument(
'--bits',
type=int,
choices=[4, 8, None],
default=None,
help='LLM bits')
parser.add_argument(
'--bot-name', type=str, default='BOT', help='Name for Bot')
parser.add_argument(
'--with-plugins',
nargs='+',
choices=['calculate', 'solve', 'search'],
help='Specify plugins to use')
parser.add_argument(
'--no-streamer', action='store_true', help='Whether to with streamer')
parser.add_argument(
'--lagent', action='store_true', help='Whether to use lagent')
parser.add_argument(
'--stop-words', nargs='+', type=str, default=[], help='Stop words')
parser.add_argument(
'--offload-folder',
default=None,
help='The folder in which to offload the model weights (or where the '
'model weights are already offloaded).')
parser.add_argument(
'--max-new-tokens',
type=int,
default=2048,
help='Maximum number of new tokens allowed in generated text')
parser.add_argument(
'--temperature',
type=float,
default=0.1,
help='The value used to modulate the next token probabilities.')
parser.add_argument(
'--top-k',
type=int,
default=40,
help='The number of highest probability vocabulary tokens to '
'keep for top-k-filtering.')
parser.add_argument(
'--top-p',
type=float,
default=0.75,
help='If set to float < 1, only the smallest set of most probable '
'tokens with probabilities that add up to top_p or higher are '
'kept for generation.')
parser.add_argument(
'--repetition-penalty',
type=float,
default=1.0,
help='The parameter for repetition penalty. 1.0 means no penalty.')
parser.add_argument(
'--seed',
type=int,
default=0,
help='Random seed for reproducible text generation')
args = parser.parse_args()
return args
def get_input():
"""Helper function for getting input from users."""
sentinel = '' # ends when this string is seen
result = None
while result is None:
print(('\ndouble enter to end input (EXIT: exit chat, '
'RESET: reset history) >>> '),
end='')
try:
result = '\n'.join(iter(input, sentinel))
except UnicodeDecodeError:
print('Invalid characters detected. Please enter again.')
return result
def main():
args = parse_args()
torch.manual_seed(args.seed)
# build llm
quantization_config = None
load_in_8bit = False
if args.bits == 4:
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')
elif args.bits == 8:
load_in_8bit = True
model_kwargs = {
'quantization_config': quantization_config,
'load_in_8bit': load_in_8bit,
'device_map': 'auto',
'offload_folder': args.offload_folder,
'trust_remote_code': True,
'torch_dtype': TORCH_DTYPE_MAP[args.torch_dtype]
}
if args.lagent:
from lagent.actions import ActionExecutor, GoogleSearch
from lagent.agents import (CALL_PROTOCOL_CN, FORCE_STOP_PROMPT_CN,
ReAct, ReActProtocol)
from lagent.llms import HFTransformerCasualLM
try:
SERPER_API_KEY = os.environ['SERPER_API_KEY']
except Exception:
print('Please obtain the `SERPER_API_KEY` from https://serper.dev '
'and set it using `export SERPER_API_KEY=xxx`.')
sys.exit(1)
model_kwargs.pop('trust_remote_code')
llm = HFTransformerCasualLM(
args.model_name_or_path, model_kwargs=model_kwargs)
if args.adapter is not None:
print(f'Loading adapter from {args.adapter}...')
llm.model = PeftModel.from_pretrained(
llm.model,
args.adapter,
offload_folder=args.offload_folder,
trust_remote_code=True)
search_tool = GoogleSearch(api_key=SERPER_API_KEY)
chatbot = ReAct(
llm=llm,
action_executor=ActionExecutor(actions=[search_tool]),
protocol=ReActProtocol(
call_protocol=CALL_PROTOCOL_CN,
force_stop=FORCE_STOP_PROMPT_CN))
while True:
text = get_input()
while text.strip() == 'RESET':
print('Log: History responses have been removed!')
chatbot._session_history = []
inputs = ''
text = get_input()
if text.strip() == 'EXIT':
print('Log: Exit!')
exit(0)
response = chatbot.chat(text)
print(response.response)
else:
if args.with_plugins is None:
inner_thoughts_open = False
calculate_open = False
solve_open = False
search_open = False
else:
assert args.prompt_template == args.system_template == 'moss_sft'
from plugins import plugins_api
inner_thoughts_open = True
calculate_open = 'calculate' in args.with_plugins
solve_open = 'solve' in args.with_plugins
search_open = 'search' in args.with_plugins
# pre-import for api and model preparation
if calculate_open:
from plugins import calculate # noqa: F401
if solve_open:
from plugins import solve # noqa: F401
if search_open:
from plugins import search # noqa: F401
# build llm
llm = AutoModelForCausalLM.from_pretrained(args.model_name_or_path,
**model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(
args.model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True)
print(f'Load LLM from {args.model_name_or_path}')
if args.adapter is not None:
llm = PeftModel.from_pretrained(
llm,
args.adapter,
offload_folder=args.offload_folder,
trust_remote_code=True)
print(f'Load adapter from {args.adapter}')
if args.llava is not None:
llava_path = snapshot_download(
repo_id=args.llava) if not osp.isdir(
args.llava) else args.llava
# build visual_encoder
if 'visual_encoder' in os.listdir(llava_path):
assert args.visual_encoder is None, (
"Please don't specify the `--visual-encoder` since passed "
'`--llava` contains a visual encoder!')
visual_encoder_path = osp.join(llava_path, 'visual_encoder')
else:
assert args.visual_encoder is not None, (
'Please specify the `--visual-encoder`!')
visual_encoder_path = args.visual_encoder
visual_encoder = CLIPVisionModel.from_pretrained(
visual_encoder_path,
torch_dtype=TORCH_DTYPE_MAP[args.torch_dtype])
image_processor = CLIPImageProcessor.from_pretrained(
visual_encoder_path)
print(f'Load visual_encoder from {visual_encoder_path}')
# load adapter
if 'llm_adapter' in os.listdir(llava_path):
adapter_path = osp.join(llava_path, 'llm_adapter')
llm = PeftModel.from_pretrained(
llm,
adapter_path,
offload_folder=args.offload_folder,
trust_remote_code=True)
print(f'Load LLM adapter from {args.llava}')
if 'visual_encoder_adapter' in os.listdir(llava_path):
adapter_path = osp.join(llava_path, 'visual_encoder_adapter')
visual_encoder = PeftModel.from_pretrained(
visual_encoder,
adapter_path,
offload_folder=args.offload_folder)
print(f'Load visual_encoder adapter from {args.llava}')
# build projector
projector_path = osp.join(llava_path, 'projector')
projector = AutoModel.from_pretrained(
projector_path,
torch_dtype=TORCH_DTYPE_MAP[args.torch_dtype],
trust_remote_code=True)
print(f'Load projector from {args.llava}')
projector.cuda()
projector.eval()
visual_encoder.cuda()
visual_encoder.eval()
llm.eval()
if args.image is not None:
image = load_image(args.image)
image = expand2square(
image, tuple(int(x * 255) for x in image_processor.image_mean))
image = image_processor.preprocess(
image, return_tensors='pt')['pixel_values'][0]
image = image.cuda().unsqueeze(0).to(visual_encoder.dtype)
visual_outputs = visual_encoder(image, output_hidden_states=True)
pixel_values = projector(
visual_outputs.hidden_states[args.visual_select_layer][:, 1:])
stop_words = args.stop_words
sep = ''
if args.prompt_template:
template = PROMPT_TEMPLATE[args.prompt_template]
stop_words += template.get('STOP_WORDS', [])
sep = template.get('SEP', '')
stop_criteria = get_stop_criteria(
tokenizer=tokenizer, stop_words=stop_words)
if args.no_streamer:
streamer = None
else:
streamer = TextStreamer(tokenizer, skip_prompt=True)
gen_config = GenerationConfig(
max_new_tokens=args.max_new_tokens,
do_sample=args.temperature > 0,
temperature=args.temperature,
top_p=args.top_p,
top_k=args.top_k,
repetition_penalty=args.repetition_penalty,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
if tokenizer.pad_token_id is not None else tokenizer.eos_token_id,
)
n_turn = 0
inputs = ''
while True:
text = get_input()
while text.strip() == 'RESET':
print('Log: History responses have been removed!')
n_turn = 0
inputs = ''
text = get_input()
if text.strip() == 'EXIT':
print('Log: Exit!')
exit(0)
if args.image is not None and n_turn == 0:
text = DEFAULT_IMAGE_TOKEN + '\n' + text
if args.prompt_template:
prompt_text = ''
template = PROMPT_TEMPLATE[args.prompt_template]
if 'SYSTEM' in template and n_turn == 0:
system_text = None
if args.system_template is not None:
system_text = SYSTEM_TEMPLATE[
args.system_template].format(
round=n_turn + 1, bot_name=args.bot_name)
elif args.system is not None:
system_text = args.system
if system_text is not None:
prompt_text += template['SYSTEM'].format(
system=system_text,
round=n_turn + 1,
bot_name=args.bot_name)
prompt_text += template['INSTRUCTION'].format(
input=text, round=n_turn + 1, bot_name=args.bot_name)
if args.prompt_template == args.system_template == 'moss_sft':
if not inner_thoughts_open:
prompt_text.replace('- Inner thoughts: enabled.',
'- Inner thoughts: disabled.')
if not calculate_open:
prompt_text.replace(('- Calculator: enabled. API: '
'Calculate(expression)'),
'- Calculator: disabled.')
if not solve_open:
prompt_text.replace(
'- Equation solver: enabled. API: Solve(equation)',
'- Equation solver: disabled.')
if not search_open:
prompt_text.replace(
'- Web search: enabled. API: Search(query)',
'- Web search: disabled.')
else:
prompt_text = text
inputs += prompt_text
if args.image is None:
if n_turn == 0:
ids = tokenizer.encode(inputs, return_tensors='pt')
else:
ids = tokenizer.encode(
inputs, return_tensors='pt', add_special_tokens=False)
if args.with_plugins is not None:
generate_output = llm.generate(
inputs=ids.cuda(),
generation_config=gen_config,
streamer=streamer,
stopping_criteria=stop_criteria).cpu()
generate_output_text = tokenizer.decode(
generate_output[0][len(ids[0]):])
if streamer is None:
end = '' if generate_output_text[-1] == '\n' else '\n'
print(generate_output_text, end=end)
pattern = r'<\|Commands\|>:(.*?)'
command_text = ', '.join(
re.findall(pattern, generate_output_text))
extent_text = plugins_api(
command_text,
calculate_open=calculate_open,
solve_open=solve_open,
search_open=search_open)
end = '' if extent_text[-1] == '\n' else '\n'
print(extent_text, end=end)
extent_text_ids = tokenizer.encode(
extent_text,
return_tensors='pt',
add_special_tokens=False)
new_ids = torch.cat((generate_output, extent_text_ids),
dim=1)
generate_output = llm.generate(
inputs=new_ids.cuda(),
generation_config=gen_config,
streamer=streamer,
stopping_criteria=stop_criteria)
if streamer is None:
output_text = tokenizer.decode(
generate_output[0][len(new_ids[0]):])
end = '' if output_text[-1] == '\n' else '\n'
print(output_text, end=end)
else:
generate_output = llm.generate(
inputs=ids.cuda(),
generation_config=gen_config,
streamer=streamer,
stopping_criteria=stop_criteria)
if streamer is None:
output_text = tokenizer.decode(
generate_output[0][len(ids[0]):])
end = '' if output_text[-1] == '\n' else '\n'
print(output_text, end=end)
inputs = tokenizer.decode(generate_output[0])
else:
chunk_encode = []
for idx, chunk in enumerate(inputs.split(DEFAULT_IMAGE_TOKEN)):
if idx == 0 and n_turn == 0:
cur_encode = tokenizer.encode(chunk)
else:
cur_encode = tokenizer.encode(
chunk, add_special_tokens=False)
chunk_encode.append(cur_encode)
assert len(chunk_encode) == 2
ids = []
for idx, cur_chunk_encode in enumerate(chunk_encode):
ids.extend(cur_chunk_encode)
if idx != len(chunk_encode) - 1:
ids.append(IMAGE_TOKEN_INDEX)
ids = torch.tensor(ids).cuda().unsqueeze(0)
mm_inputs = prepare_inputs_labels_for_multimodal(
llm=llm, input_ids=ids, pixel_values=pixel_values)
generate_output = llm.generate(
**mm_inputs,
generation_config=gen_config,
streamer=streamer,
bos_token_id=tokenizer.bos_token_id,
stopping_criteria=stop_criteria)
if streamer is None:
output_text = tokenizer.decode(generate_output[0])
end = '' if output_text[-1] == '\n' else '\n'
print(output_text, end=end)
inputs += tokenizer.decode(generate_output[0])
n_turn += 1
inputs += sep
if len(generate_output[0]) >= args.max_new_tokens:
print(
'Remove the memory of history responses, since '
f'it exceeds the length limitation {args.max_new_tokens}.')
n_turn = 0
inputs = ''
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/tools/check_custom_dataset.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
from functools import partial
import numpy as np
from datasets import DatasetDict
from mmengine.config import Config
from xtuner.dataset.utils import Packer, encode_fn
from xtuner.registry import BUILDER
def parse_args():
parser = argparse.ArgumentParser(
description='Verify the correctness of the config file for the '
'custom dataset.')
parser.add_argument('config', help='config file name or path.')
args = parser.parse_args()
return args
def is_standard_format(dataset):
example = next(iter(dataset))
if 'conversation' not in example:
return False
conversation = example['conversation']
if not isinstance(conversation, list):
return False
for item in conversation:
if (not isinstance(item, dict)) or ('input'
not in item) or ('output'
not in item):
return False
input, output = item['input'], item['output']
if (not isinstance(input, str)) or (not isinstance(output, str)):
return False
return True
def main():
args = parse_args()
cfg = Config.fromfile(args.config)
tokenizer = BUILDER.build(cfg.tokenizer)
if cfg.get('framework', 'mmengine').lower() == 'huggingface':
train_dataset = cfg.train_dataset
else:
train_dataset = cfg.train_dataloader.dataset
dataset = train_dataset.dataset
max_length = train_dataset.max_length
dataset_map_fn = train_dataset.get('dataset_map_fn', None)
template_map_fn = train_dataset.get('template_map_fn', None)
max_dataset_length = train_dataset.get('max_dataset_length', 10)
split = train_dataset.get('split', 'train')
remove_unused_columns = train_dataset.get('remove_unused_columns', False)
rename_maps = train_dataset.get('rename_maps', [])
shuffle_before_pack = train_dataset.get('shuffle_before_pack', True)
pack_to_max_length = train_dataset.get('pack_to_max_length', True)
input_ids_with_output = train_dataset.get('input_ids_with_output', True)
if dataset.get('path', '') != 'json':
raise ValueError(
'You are using custom datasets for SFT. '
'The custom datasets should be in json format. To load your JSON '
'file, you can use the following code snippet: \n'
'"""\nfrom datasets import load_dataset \n'
'dataset = dict(type=load_dataset, path=\'json\', '
'data_files=\'your_json_file.json\')\n"""\n'
'For more details, please refer to Step 5 in the '
'`Using Custom Datasets` section of the documentation found at'
' docs/zh_cn/user_guides/single_turn_conversation.md.')
try:
dataset = BUILDER.build(dataset)
except RuntimeError:
raise RuntimeError(
'Unable to load the custom JSON file using '
'`datasets.load_dataset`. Your data-related config is '
f'{train_dataset}. Please refer to the official documentation on'
' `load_dataset` (https://huggingface.co/docs/datasets/loading) '
'for more details.')
if isinstance(dataset, DatasetDict):
dataset = dataset[split]
if not is_standard_format(dataset) and dataset_map_fn is None:
raise ValueError(
'If the custom dataset is not in the XTuner-defined '
'format, please utilize `dataset_map_fn` to map the original data'
' to the standard format. For more details, please refer to '
'Step 1 and Step 5 in the `Using Custom Datasets` section of the '
'documentation found at '
'`docs/zh_cn/user_guides/single_turn_conversation.md`.')
if is_standard_format(dataset) and dataset_map_fn is not None:
raise ValueError(
'If the custom dataset is already in the XTuner-defined format, '
'please set `dataset_map_fn` to None.'
'For more details, please refer to Step 1 and Step 5 in the '
'`Using Custom Datasets` section of the documentation found at'
' docs/zh_cn/user_guides/single_turn_conversation.md.')
max_dataset_length = min(max_dataset_length, len(dataset))
indices = np.random.choice(len(dataset), max_dataset_length, replace=False)
dataset = dataset.select(indices)
if dataset_map_fn is not None:
dataset = dataset.map(dataset_map_fn)
print('#' * 20 + ' dataset after `dataset_map_fn` ' + '#' * 20)
print(dataset[0]['conversation'])
if template_map_fn is not None:
template_map_fn = BUILDER.build(template_map_fn)
dataset = dataset.map(template_map_fn)
print('#' * 20 + ' dataset after adding templates ' + '#' * 20)
print(dataset[0]['conversation'])
for old, new in rename_maps:
dataset = dataset.rename_column(old, new)
if pack_to_max_length and (not remove_unused_columns):
raise ValueError('We have to remove unused columns if '
'`pack_to_max_length` is set to True.')
dataset = dataset.map(
partial(
encode_fn,
tokenizer=tokenizer,
max_length=max_length,
input_ids_with_output=input_ids_with_output),
remove_columns=list(dataset.column_names)
if remove_unused_columns else None)
print('#' * 20 + ' encoded input_ids ' + '#' * 20)
print(dataset[0]['input_ids'])
print('#' * 20 + ' encoded labels ' + '#' * 20)
print(dataset[0]['labels'])
if pack_to_max_length and split == 'train':
if shuffle_before_pack:
dataset = dataset.shuffle()
dataset = dataset.flatten_indices()
dataset = dataset.map(Packer(max_length), batched=True)
print('#' * 20 + ' input_ids after packed to max_length ' +
'#' * 20)
print(dataset[0]['input_ids'])
print('#' * 20 + ' labels after packed to max_length ' + '#' * 20)
print(dataset[0]['labels'])
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/tools/copy_cfg.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import os.path as osp
import shutil
from mmengine.utils import mkdir_or_exist
from xtuner.configs import cfgs_name_path
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('config_name', help='config name')
parser.add_argument('save_dir', help='save directory for copied config')
args = parser.parse_args()
return args
def add_copy_suffix(string):
file_name, ext = osp.splitext(string)
return f'{file_name}_copy{ext}'
def main():
args = parse_args()
mkdir_or_exist(args.save_dir)
config_path = cfgs_name_path[args.config_name]
save_path = osp.join(args.save_dir,
add_copy_suffix(osp.basename(config_path)))
shutil.copyfile(config_path, save_path)
print(f'Copy to {save_path}')
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/tools/data_preprocess/arxiv.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import json
from datetime import datetime
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('src_file', help='source file path')
parser.add_argument('dst_file', help='destination file path')
parser.add_argument(
'--categories',
nargs='+',
default=['cs.AI', 'cs.CL', 'cs.CV'],
help='target categories')
parser.add_argument(
'--start-date',
default='2020-01-01',
help='start date (format: YYYY-MM-DD)')
args = parser.parse_args()
return args
def has_intersection(list1, list2):
set1 = set(list1)
set2 = set(list2)
return len(set1.intersection(set2)) > 0
def read_json_file(file_path):
data = []
with open(file_path) as file:
for line in file:
try:
json_data = json.loads(line)
data.append(json_data)
except json.JSONDecodeError:
print(f'Failed to parse line: {line}')
return data
def main():
args = parse_args()
json_data = read_json_file(args.src_file)
from_time = datetime.strptime(args.start_date, '%Y-%m-%d')
filtered_data = [
item for item in json_data
if has_intersection(args.categories, item['categories'].split())
and datetime.strptime(item['update_date'], '%Y-%m-%d') >= from_time
]
with open(args.dst_file, 'w') as file:
json.dump(filtered_data, file)
print(f'Save to {args.dst_file}\n{len(filtered_data)} items')
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/tools/data_preprocess/convert_refcoco.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import json
from xtuner.dataset.refcoco_json import RefCOCOJsonDataset
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
'--ann-path',
default='data/refcoco_annotations',
help='Refcoco annotation path',
)
parser.add_argument(
'--image-path',
default='data/llava_data/llava_images/coco/train2017',
help='COCO image path',
)
parser.add_argument(
'--save-path', default='./', help='The folder to save converted data')
args = parser.parse_args()
return args
if __name__ == '__main__':
args = parse_args()
data_info = [
('refcoco', 'unc'),
('refcoco+', 'unc'),
('refcocog', 'umd'),
]
all_data = []
for dataset, split in data_info:
data = RefCOCOJsonDataset.get_data_json(
ann_path=args.ann_path,
image_path=args.image_path,
dataset=dataset,
splitBy=split,
)[0]
all_data.extend(data)
save_path = args.save_path + '/train.json'
with open(save_path, 'w') as f:
print(f'save to {save_path} with {len(all_data)} items.')
print(all_data[0])
json.dump(all_data, f, indent=4)
================================================
FILE: xtuner-eval_niah/xtuner/tools/eval_refcoco.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import os
import os.path as osp
import re
import torch
import tqdm
from huggingface_hub import snapshot_download
from mmengine.dist import get_dist_info, init_dist, master_only
from mmengine.utils.dl_utils import set_multi_processing
from peft import PeftModel
from torch import distributed as dist
from torch.utils.data import DataLoader, DistributedSampler
from transformers import (AutoModel, AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel, GenerationConfig)
from xtuner.dataset.map_fns import llava_map_fn, template_map_fn_factory
from xtuner.dataset.refcoco_json import RefCOCOJsonEvalDataset
from xtuner.model.utils import LoadWoInit, prepare_inputs_labels_for_multimodal
from xtuner.tools.utils import get_stop_criteria
from xtuner.utils import (DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX,
PROMPT_TEMPLATE)
TORCH_DTYPE_MAP = dict(
fp16=torch.float16, bf16=torch.bfloat16, fp32=torch.float32, auto='auto')
def merge_outputs(otuputs):
new_outputs = [None for _ in range(dist.get_world_size())]
assert dist.is_initialized()
dist.all_gather_object(new_outputs, otuputs)
new_dict = []
for output in new_outputs:
new_dict.extend(output)
return new_dict
@master_only
def master_print(msg):
print(msg)
def parse_args():
parser = argparse.ArgumentParser(description='MMBench')
parser.add_argument(
'model_name_or_path', help='Hugging Face model name or path')
parser.add_argument('--data-path', default=None, help='data path')
parser.add_argument('--work-dir', help='the dir to save results')
parser.add_argument('--llava', default=None, help='llava name or path')
parser.add_argument(
'--visual-encoder', default=None, help='visual encoder name or path')
parser.add_argument(
'--visual-select-layer', default=-2, help='visual select layer')
parser.add_argument(
'--prompt-template',
choices=PROMPT_TEMPLATE.keys(),
default=None,
help='Specify a prompt template',
)
parser.add_argument(
'--stop-words', nargs='+', type=str, default=[], help='Stop words')
parser.add_argument(
'--torch-dtype',
default='fp16',
choices=TORCH_DTYPE_MAP.keys(),
help='Override the default `torch.dtype` and load the model under '
'a specific `dtype`.',
)
parser.add_argument(
'--bits',
type=int,
choices=[4, 8, None],
default=None,
help='LLM bits')
parser.add_argument(
'--bot-name', type=str, default='BOT', help='Name for Bot')
parser.add_argument(
'--offload-folder',
default=None,
help='The folder in which to offload the model weights (or where the '
'model weights are already offloaded).',
)
parser.add_argument(
'--max-new-tokens',
type=int,
default=100,
help='Maximum number of new tokens allowed in generated text',
)
parser.add_argument(
'--seed',
type=int,
default=0,
help='Random seed for reproducible text generation',
)
parser.add_argument(
'--launcher',
choices=['none', 'pytorch', 'slurm', 'mpi'],
default='none',
help='job launcher',
)
args = parser.parse_args()
return args
def eval_iou(answers):
def computeIoU(bbox1, bbox2):
x1, y1, x2, y2 = bbox1
x3, y3, x4, y4 = bbox2
intersection_x1 = max(x1, x3)
intersection_y1 = max(y1, y3)
intersection_x2 = min(x2, x4)
intersection_y2 = min(y2, y4)
intersection_area = max(0,
intersection_x2 - intersection_x1 + 1) * max(
0, intersection_y2 - intersection_y1 + 1)
bbox1_area = (x2 - x1 + 1) * (y2 - y1 + 1)
bbox2_area = (x4 - x3 + 1) * (y4 - y3 + 1)
union_area = bbox1_area + bbox2_area - intersection_area
iou = intersection_area / union_area
return iou
right = 0
for answer in answers:
bbox = answer['bbox']
bbox = RefCOCOJsonEvalDataset.normalize_bbox(bbox, answer['height'],
answer['width'])
answer_bbox = [int(x) for x in re.findall(r'\d+', answer['ans'])]
if len(answer_bbox) == 4:
iou = computeIoU(answer_bbox, bbox)
if iou > 0.5:
right += 1
else:
print('Error format sample: ', answer)
return right / len(answers)
def build_model(args):
rank, world_size = get_dist_info()
# build llm
quantization_config = None
load_in_8bit = False
if args.bits == 4:
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4',
)
elif args.bits == 8:
load_in_8bit = True
model_kwargs = {
'quantization_config': quantization_config,
'load_in_8bit': load_in_8bit,
'device_map': rank if world_size > 1 else 'auto',
'offload_folder': args.offload_folder,
'trust_remote_code': True,
'torch_dtype': TORCH_DTYPE_MAP[args.torch_dtype],
}
# build llm
with LoadWoInit():
llm = AutoModelForCausalLM.from_pretrained(args.model_name_or_path,
**model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(
args.model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True)
master_print(f'Load LLM from {args.model_name_or_path}')
llava_path = (
snapshot_download(
repo_id=args.llava) if not osp.isdir(args.llava) else args.llava)
# build visual_encoder
if 'visual_encoder' in os.listdir(llava_path):
assert args.visual_encoder is None, (
"Please don't specify the `--visual-encoder` since passed "
'`--llava` contains a visual encoder!')
visual_encoder_path = osp.join(llava_path, 'visual_encoder')
else:
assert (args.visual_encoder is not None
), 'Please specify the `--visual-encoder`!' # noqa: E501
visual_encoder_path = args.visual_encoder
with LoadWoInit():
visual_encoder = CLIPVisionModel.from_pretrained(
visual_encoder_path, torch_dtype=TORCH_DTYPE_MAP[args.torch_dtype])
image_processor = CLIPImageProcessor.from_pretrained(
visual_encoder_path)
master_print(f'Load visual_encoder from {visual_encoder_path}')
# load adapter
if 'llm_adapter' in os.listdir(llava_path):
adapter_path = osp.join(llava_path, 'llm_adapter')
with LoadWoInit():
llm = PeftModel.from_pretrained(
llm, adapter_path, offload_folder=args.offload_folder)
master_print(f'Load LLM adapter from {args.llava}')
if 'visual_encoder_adapter' in os.listdir(llava_path):
adapter_path = osp.join(llava_path, 'visual_encoder_adapter')
visual_encoder = PeftModel.from_pretrained(
visual_encoder, adapter_path, offload_folder=args.offload_folder)
master_print(f'Load visual_encoder adapter from {args.llava}')
# build projector
projector_path = osp.join(llava_path, 'projector')
with LoadWoInit():
projector = AutoModel.from_pretrained(
projector_path, torch_dtype=TORCH_DTYPE_MAP[args.torch_dtype])
master_print(f'Load projector from {args.llava}')
projector.cuda()
projector.eval()
visual_encoder.cuda()
visual_encoder.eval()
llm.eval()
return llm, visual_encoder, projector, tokenizer, image_processor
def generate(
llm,
visual_encoder,
projector,
tokenizer,
samples,
visual_select_layer,
):
gen_config = GenerationConfig(
max_new_tokens=100,
do_sample=False,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=(tokenizer.pad_token_id if tokenizer.pad_token_id
is not None else tokenizer.eos_token_id),
)
stop_criteria = get_stop_criteria(tokenizer=tokenizer, stop_words=[''])
device = next(llm.parameters()).device
# prepare inputs
inputs = samples['conversation'][0]['input'][0]
chunk_encode = []
for idx, chunk in enumerate(inputs.split(DEFAULT_IMAGE_TOKEN)):
if idx == 0:
cur_encode = tokenizer.encode(chunk)
else:
cur_encode = tokenizer.encode(chunk, add_special_tokens=False)
chunk_encode.append(cur_encode)
assert len(chunk_encode) == 2
ids = []
for idx, cur_chunk_encode in enumerate(chunk_encode):
ids.extend(cur_chunk_encode)
if idx != len(chunk_encode) - 1:
ids.append(IMAGE_TOKEN_INDEX)
ids = torch.tensor(ids).cuda().unsqueeze(0)
visual_outputs = visual_encoder(
samples['pixel_values'].to(device), output_hidden_states=True)
pixel_values = projector(
visual_outputs.hidden_states[visual_select_layer][:, 1:])
samples['pixel_values'] = pixel_values
samples['input_ids'] = ids
datax = prepare_inputs_labels_for_multimodal(
llm=llm.to(device),
input_ids=samples['input_ids'].to(device),
pixel_values=samples['pixel_values'].to(device),
)
# generation
generation = llm.generate(
**datax,
generation_config=gen_config,
streamer=None,
bos_token_id=tokenizer.bos_token_id,
stopping_criteria=stop_criteria,
)
answer = tokenizer.decode(generation[0])
return {
'ans': answer,
'id': samples['id'][0],
'bbox': torch.tensor(samples['bbox']).tolist(),
'height': samples['height'],
'width': samples['width'],
}
@torch.no_grad()
def main():
# init
args = parse_args()
if args.launcher != 'none':
set_multi_processing(distributed=True)
init_dist(args.launcher)
rank, world_size = get_dist_info()
torch.cuda.set_device(rank)
else:
rank = 0
world_size = 1
print(f'Rank: {rank} / World size: {world_size}')
# build_model
llm, visual_encoder, projector, tokenizer, image_processor = build_model(
args)
# dataset
dataset = RefCOCOJsonEvalDataset(
data_path=args.data_path,
image_folder='data/llava_data/llava_images/',
tokenizer=tokenizer,
image_processor=image_processor,
max_dataset_length=None,
dataset_map_fn=llava_map_fn,
template_map_fn=dict(
type=template_map_fn_factory, template=PROMPT_TEMPLATE.vicuna),
max_length=2048,
pad_image_to_square=False,
)
loader = DataLoader(
dataset,
batch_size=1,
shuffle=False,
sampler=DistributedSampler(dataset, shuffle=False, seed=0),
)
loader.sampler.set_epoch(0)
answers = []
for i, data in tqdm.tqdm(enumerate(loader), desc=f'Rank {rank}'):
answer = generate(
llm,
visual_encoder,
projector,
tokenizer,
data,
args.visual_select_layer,
)
answers.append(answer)
merged_outputs = merge_outputs(answers)
acc = eval_iou(merged_outputs)
master_print(f'Acc: {acc}')
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/tools/get_data_order.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import os
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', help='Data folder')
parser.add_argument('--save-folder', help='The folder to save data order.')
parser.add_argument(
'--file-type',
default='.bin',
help='We want to get the order of the file in this type.')
args = parser.parse_args()
return args
def save_data_order(data_folder, save_folder, file_type='.bin'):
assert os.path.exists(data_folder), f'{data_folder} does not exist.'
triples = list(os.walk(data_folder, followlinks=True))
data_order = []
for root, dirs, files in triples:
dirs.sort()
print(f'Reading {root}...')
for fn in sorted(files):
if fn.endswith(file_type):
fp = os.path.join(root, fn)
# Using relative paths so that you can get the same result
# on different clusters
fp = fp.replace(data_folder, '')[1:]
data_order.append(fp)
save_path = os.path.join(save_folder, 'data_order.txt')
with open(save_path, 'w') as f:
for fp in data_order:
f.write(fp + '\n')
if __name__ == '__main__':
args = parse_args()
save_data_order(args.data_folder, args.save_folder, args.file_type)
================================================
FILE: xtuner-eval_niah/xtuner/tools/list_cfg.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
from xtuner.configs import cfgs_name_path
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
'-p', '--pattern', default=None, help='Pattern for fuzzy matching')
args = parser.parse_args()
return args
def main(pattern=None):
args = parse_args()
configs_names = sorted(list(cfgs_name_path.keys()))
print('==========================CONFIGS===========================')
if args.pattern is not None:
print(f'PATTERN: {args.pattern}')
print('-------------------------------')
for name in configs_names:
if args.pattern is None or args.pattern.lower() in name.lower():
print(name)
print('=============================================================')
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/tools/list_dataset_format.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from xtuner.dataset.map_fns import DATASET_FORMAT_MAPPING
def main():
dataset_format = DATASET_FORMAT_MAPPING.keys()
print('======================DATASET_FORMAT======================')
for format in dataset_format:
print(format)
print('==========================================================')
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/tools/log_dataset.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
from mmengine.config import Config
from xtuner.registry import BUILDER
def parse_args():
parser = argparse.ArgumentParser(description='Log processed dataset.')
parser.add_argument('config', help='config file name or path.')
# chose which kind of dataset style to show
parser.add_argument(
'--show',
default='text',
choices=['text', 'masked_text', 'input_ids', 'labels', 'all'],
help='which kind of dataset style to show')
args = parser.parse_args()
return args
def main():
args = parse_args()
cfg = Config.fromfile(args.config)
tokenizer = BUILDER.build(cfg.tokenizer)
if cfg.get('framework', 'mmengine').lower() == 'huggingface':
train_dataset = BUILDER.build(cfg.train_dataset)
else:
train_dataset = BUILDER.build(cfg.train_dataloader.dataset)
if args.show == 'text' or args.show == 'all':
print('#' * 20 + ' text ' + '#' * 20)
print(tokenizer.decode(train_dataset[0]['input_ids']))
if args.show == 'masked_text' or args.show == 'all':
print('#' * 20 + ' text(masked) ' + '#' * 20)
masked_text = ' '.join(
['[-100]' for i in train_dataset[0]['labels'] if i == -100])
unmasked_text = tokenizer.decode(
[i for i in train_dataset[0]['labels'] if i != -100])
print(masked_text + ' ' + unmasked_text)
if args.show == 'input_ids' or args.show == 'all':
print('#' * 20 + ' input_ids ' + '#' * 20)
print(train_dataset[0]['input_ids'])
if args.show == 'labels' or args.show == 'all':
print('#' * 20 + ' labels ' + '#' * 20)
print(train_dataset[0]['labels'])
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/tools/mmbench.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import json
import math
import os
import os.path as osp
import re
import string
import time
import numpy as np
import pandas as pd
import torch
import tqdm
from huggingface_hub import snapshot_download
from mmengine import mkdir_or_exist
from mmengine.dist import (collect_results, get_dist_info, get_rank, init_dist,
master_only)
from mmengine.utils.dl_utils import set_multi_processing
from peft import PeftModel
from rich.console import Console
from rich.table import Table
from torch.utils.data import Dataset
from transformers import (AutoModel, AutoModelForCausalLM, AutoTokenizer,
BitsAndBytesConfig, CLIPImageProcessor,
CLIPVisionModel, GenerationConfig)
from xtuner.dataset.utils import decode_base64_to_image, expand2square
from xtuner.model.utils import LoadWoInit, prepare_inputs_labels_for_multimodal
from xtuner.tools.utils import get_stop_criteria, is_cn_string
from xtuner.utils import (DEFAULT_IMAGE_TOKEN, IMAGE_TOKEN_INDEX,
PROMPT_TEMPLATE)
TORCH_DTYPE_MAP = dict(
fp16=torch.float16, bf16=torch.bfloat16, fp32=torch.float32, auto='auto')
def parse_args():
parser = argparse.ArgumentParser(description='MMBench')
parser.add_argument(
'model_name_or_path', help='Hugging Face model name or path')
parser.add_argument('--data-path', default=None, help='data path')
parser.add_argument('--work-dir', help='the dir to save results')
parser.add_argument('--llava', default=None, help='llava name or path')
parser.add_argument(
'--visual-encoder', default=None, help='visual encoder name or path')
parser.add_argument(
'--visual-select-layer', default=-2, help='visual select layer')
parser.add_argument(
'--prompt-template',
choices=PROMPT_TEMPLATE.keys(),
default=None,
help='Specify a prompt template')
parser.add_argument(
'--stop-words', nargs='+', type=str, default=[], help='Stop words')
parser.add_argument(
'--torch-dtype',
default='fp16',
choices=TORCH_DTYPE_MAP.keys(),
help='Override the default `torch.dtype` and load the model under '
'a specific `dtype`.')
parser.add_argument(
'--bits',
type=int,
choices=[4, 8, None],
default=None,
help='LLM bits')
parser.add_argument(
'--bot-name', type=str, default='BOT', help='Name for Bot')
parser.add_argument(
'--offload-folder',
default=None,
help='The folder in which to offload the model weights (or where the '
'model weights are already offloaded).')
parser.add_argument(
'--max-new-tokens',
type=int,
default=100,
help='Maximum number of new tokens allowed in generated text')
parser.add_argument(
'--seed',
type=int,
default=0,
help='Random seed for reproducible text generation')
parser.add_argument(
'--launcher',
choices=['none', 'pytorch', 'slurm', 'mpi'],
default='none',
help='job launcher')
args = parser.parse_args()
return args
@master_only
def master_print(msg):
print(msg)
class MMBenchDataset(Dataset):
ABBRS = {
'coarse_perception': 'CP',
'finegrained_perception (instance-level)': 'FP-S',
'finegrained_perception (cross-instance)': 'FP-C',
'logic_reasoning': 'LR',
'relation_reasoning': 'RR',
'attribute_reasoning': 'AR',
'sketch_reasoning': 'Sketch Reasoning',
'scenery_building': 'Scenery & Building',
'food_clothes': 'Food & Clothes',
'historical_figure': 'Historical Figure',
'traditional_show': 'Traditional Show',
'calligraphy_painting': 'Calligraphy Painting',
'cultural_relic': 'Cultural Relic'
}
def __init__(self, data_file):
self.data_file = data_file
self.df = pd.read_csv(data_file, sep='\t')
self.split = 'dev' if 'answer' in self.df.iloc[0].keys() else 'test'
self.has_l2_category = 'l2-category' in self.df.columns.to_list()
def get_image(self, image):
while len(image) < 16:
image = self.df[self.df['index'] == int(image)]['image'].values
assert len(image) == 1
image = image[0]
image = decode_base64_to_image(image)
return image
def __len__(self):
return len(self.df)
def __getitem__(self, idx):
index = self.df.iloc[idx]['index']
image = self.df.iloc[idx]['image']
image = self.get_image(image)
question = self.df.iloc[idx]['question']
answer = self.df.iloc[idx]['answer'] if 'answer' in self.df.iloc[
0].keys() else None
category = self.df.iloc[idx]['category']
options = {
cand: self.load_from_df(idx, cand)
for cand in string.ascii_uppercase
if self.load_from_df(idx, cand) is not None
}
options_prompt = ''
for key, item in options.items():
options_prompt += f'{key}. {item}\n'
hint = self.load_from_df(idx, 'hint')
data = {
'img': image,
'question': question,
'answer': answer,
'options': options_prompt,
'category': category,
'options_dict': options,
'index': index,
'context': hint,
}
if self.has_l2_category:
data.update({'l2-category': self.df.iloc[idx]['l2-category']})
return data
def load_from_df(self, idx, key):
if key in self.df.iloc[idx] and not pd.isna(self.df.iloc[idx][key]):
return self.df.iloc[idx][key]
else:
return None
@master_only
def eval_result(self, result_df, show=True):
def calc_acc(df, group='category'):
assert group in ['overall', 'category', 'l2-category']
if group == 'overall':
res = {'Average': np.mean(df['hit'])}
else:
res = {}
abilities = list(set(df[group]))
abilities.sort()
for ab in abilities:
sub_df = df[df[group] == ab]
ab = self.ABBRS[ab] if ab in self.ABBRS else ab
res[ab] = np.mean(sub_df['hit'])
return res
def eval_sub_data(sub_data, answer_map):
lt = len(sub_data)
for i in range(lt):
item = sub_data.iloc[i]
match = re.search(r'([A-D]+)', item['prediction'])
pred = match.group(1) if match else ''
gt = answer_map[item['index']]
if gt != pred:
return 0
return 1
def show_result(ret_json):
show_dict = ret_json.copy()
table = Table(title=f' MMBench ({self.data_file}) ')
console = Console()
table.add_column('Category', justify='left')
table.add_column('Accuracy (%)', justify='right')
average = show_dict.pop('Average') * 100
table.add_row('Average', f'{average:.1f}')
table.add_section()
for cat_name, cat_acc in show_dict.items():
table.add_row(cat_name, f'{cat_acc * 100:.1f}')
with console.capture() as capture:
console.print(table, end='')
print('\n' + capture.get())
print('Note: Please be cautious if you use the results in papers, '
"since we don't use ChatGPT as a helper for choice "
'extraction')
data = result_df.sort_values(by='index')
data['prediction'] = [str(x) for x in data['prediction']]
for k in data.keys():
data[k.lower() if k not in 'ABCD' else k] = data.pop(k)
data_main = data[data['index'] < int(1e6)]
cate_map = {
i: c
for i, c in zip(self.df['index'], self.df['category'])
}
if self.has_l2_category:
l2_cate_map = {
i: c
for i, c in zip(self.df['index'], self.df['l2-category'])
}
answer_map = {
i: c
for i, c in zip(self.df['index'], self.df['answer'])
}
lt = len(data_main)
hit, tot = 0, 0
result = {}
for i in range(lt):
item_main = data_main.iloc[i]
idx = item_main['index']
assert idx not in result
sub_data = data[data['index'] % int(1e6) == idx]
ret = eval_sub_data(sub_data, answer_map)
result[idx] = ret
hit += ret
tot += 1
indices = data_main['index']
data_main = data_main.copy()
data_main['hit'] = [result[i] for i in indices]
main_idx = data_main['index']
data_main['category'] = [cate_map[i] for i in main_idx]
ret_json = calc_acc(data_main, 'overall')
if self.has_l2_category:
data_main['l2-category'] = [l2_cate_map[i] for i in main_idx]
l2 = calc_acc(data_main, 'l2-category')
ret_json.update(l2)
else:
leaf = calc_acc(data_main, 'category')
ret_json.update(leaf)
if show:
show_result(ret_json)
return ret_json
def main():
args = parse_args()
torch.manual_seed(args.seed)
if args.launcher != 'none':
set_multi_processing(distributed=True)
init_dist(args.launcher)
rank, world_size = get_dist_info()
torch.cuda.set_device(rank)
else:
rank = 0
world_size = 1
# build llm
quantization_config = None
load_in_8bit = False
if args.bits == 4:
quantization_config = BitsAndBytesConfig(
load_in_4bit=True,
load_in_8bit=False,
llm_int8_threshold=6.0,
llm_int8_has_fp16_weight=False,
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type='nf4')
elif args.bits == 8:
load_in_8bit = True
model_kwargs = {
'quantization_config': quantization_config,
'load_in_8bit': load_in_8bit,
'device_map': rank if world_size > 1 else 'auto',
'offload_folder': args.offload_folder,
'trust_remote_code': True,
'torch_dtype': TORCH_DTYPE_MAP[args.torch_dtype]
}
# build llm
with LoadWoInit():
llm = AutoModelForCausalLM.from_pretrained(args.model_name_or_path,
**model_kwargs)
tokenizer = AutoTokenizer.from_pretrained(
args.model_name_or_path,
trust_remote_code=True,
encode_special_tokens=True)
master_print(f'Load LLM from {args.model_name_or_path}')
llava_path = snapshot_download(
repo_id=args.llava) if not osp.isdir(args.llava) else args.llava
# build visual_encoder
if 'visual_encoder' in os.listdir(llava_path):
assert args.visual_encoder is None, (
"Please don't specify the `--visual-encoder` since passed "
'`--llava` contains a visual encoder!')
visual_encoder_path = osp.join(llava_path, 'visual_encoder')
else:
assert args.visual_encoder is not None, (
'Please specify the `--visual-encoder`!')
visual_encoder_path = args.visual_encoder
with LoadWoInit():
visual_encoder = CLIPVisionModel.from_pretrained(
visual_encoder_path, torch_dtype=TORCH_DTYPE_MAP[args.torch_dtype])
image_processor = CLIPImageProcessor.from_pretrained(
visual_encoder_path)
master_print(f'Load visual_encoder from {visual_encoder_path}')
# load adapter
if 'llm_adapter' in os.listdir(llava_path):
adapter_path = osp.join(llava_path, 'llm_adapter')
with LoadWoInit():
llm = PeftModel.from_pretrained(
llm, adapter_path, offload_folder=args.offload_folder)
master_print(f'Load LLM adapter from {args.llava}')
if 'visual_encoder_adapter' in os.listdir(llava_path):
adapter_path = osp.join(llava_path, 'visual_encoder_adapter')
visual_encoder = PeftModel.from_pretrained(
visual_encoder, adapter_path, offload_folder=args.offload_folder)
master_print(f'Load visual_encoder adapter from {args.llava}')
# build projector
projector_path = osp.join(llava_path, 'projector')
with LoadWoInit():
projector = AutoModel.from_pretrained(
projector_path, torch_dtype=TORCH_DTYPE_MAP[args.torch_dtype])
master_print(f'Load projector from {args.llava}')
projector.cuda()
projector.eval()
visual_encoder.cuda()
visual_encoder.eval()
llm.eval()
stop_words = args.stop_words
if args.prompt_template:
template = PROMPT_TEMPLATE[args.prompt_template]
stop_words += template.get('STOP_WORDS', [])
stop_criteria = get_stop_criteria(
tokenizer=tokenizer, stop_words=stop_words)
gen_config = GenerationConfig(
max_new_tokens=args.max_new_tokens,
do_sample=False,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id
if tokenizer.pad_token_id is not None else tokenizer.eos_token_id,
)
# work_dir
if args.work_dir is not None:
# update configs according to CLI args if args.work_dir is not None
save_dir = args.work_dir
else:
# use config filename as default work_dir
save_dir = osp.join('./work_dirs',
osp.splitext(osp.basename(args.data_path))[0])
timestamp = time.strftime('%Y%m%d_%H%M%S', time.localtime(time.time()))
save_dir = osp.join(save_dir, timestamp)
if rank == 0:
mkdir_or_exist(osp.abspath(save_dir))
print('=======================================================')
print(f'Dataset path: {osp.abspath(args.data_path)}\n'
f'Results will be saved to {osp.abspath(save_dir)}')
print('=======================================================')
args_path = osp.join(save_dir, 'args.json')
with open(args_path, 'w', encoding='utf-8') as f:
json.dump(args.__dict__, f, indent=2)
results_xlsx_path = osp.join(save_dir, 'mmbench_result.xlsx')
results_json_path = osp.join(save_dir, 'mmbench_result.json')
dataset = MMBenchDataset(args.data_path)
results = []
n_samples = len(dataset)
per_rank_samples = math.ceil(n_samples / world_size)
per_rank_ids = range(per_rank_samples * rank,
min(n_samples, per_rank_samples * (rank + 1)))
for i in tqdm.tqdm(per_rank_ids, desc=f'Rank {rank}'):
data_sample = dataset[i]
if data_sample['context'] is not None:
text = data_sample['context'] + '\n' + data_sample[
'question'] + '\n' + data_sample['options']
else:
text = data_sample['question'] + '\n' + data_sample['options']
text = DEFAULT_IMAGE_TOKEN + '\n' + text
if is_cn_string(text):
text = text + '请直接回答选项字母。'
else:
text = text + ("Answer with the option's letter from the "
'given choices directly.')
if args.prompt_template:
prompt_text = ''
template = PROMPT_TEMPLATE[args.prompt_template]
prompt_text += template['INSTRUCTION'].format(
input=text, round=1, bot_name=args.bot_name)
else:
prompt_text = text
inputs = prompt_text
image = data_sample['img'].convert('RGB')
image = expand2square(
image, tuple(int(x * 255) for x in image_processor.image_mean))
image = image_processor.preprocess(
image, return_tensors='pt')['pixel_values'][0]
image = image.cuda().unsqueeze(0).to(visual_encoder.dtype)
visual_outputs = visual_encoder(image, output_hidden_states=True)
pixel_values = projector(
visual_outputs.hidden_states[args.visual_select_layer][:, 1:])
chunk_encode = []
for idx, chunk in enumerate(inputs.split(DEFAULT_IMAGE_TOKEN)):
if idx == 0:
cur_encode = tokenizer.encode(chunk)
else:
cur_encode = tokenizer.encode(chunk, add_special_tokens=False)
chunk_encode.append(cur_encode)
assert len(chunk_encode) == 2
# TODO: Auto-detect whether to prepend a bos_token_id at the beginning.
ids = []
for idx, cur_chunk_encode in enumerate(chunk_encode):
ids.extend(cur_chunk_encode)
if idx != len(chunk_encode) - 1:
ids.append(IMAGE_TOKEN_INDEX)
ids = torch.tensor(ids).cuda().unsqueeze(0)
mm_inputs = prepare_inputs_labels_for_multimodal(
llm=llm, input_ids=ids, pixel_values=pixel_values)
generate_output = llm.generate(
**mm_inputs,
generation_config=gen_config,
streamer=None,
bos_token_id=tokenizer.bos_token_id,
stopping_criteria=stop_criteria)
predict = tokenizer.decode(
generate_output[0], skip_special_tokens=True).strip()
cur_result = {}
cur_result['question'] = data_sample.get('question')
cur_result.update(data_sample.get('options_dict'))
cur_result['prediction'] = predict
if data_sample.get('category') is not None:
cur_result['category'] = data_sample.get('category')
if data_sample.get('l2-category') is not None:
cur_result['l2-category'] = data_sample.get('l2-category')
cur_result['index'] = data_sample.get('index')
cur_result['split'] = data_sample.get('split')
cur_result['answer'] = data_sample.get('answer')
results.append(cur_result)
results = collect_results(results, n_samples)
if get_rank() == 0:
results_df = pd.DataFrame(results)
with pd.ExcelWriter(results_xlsx_path, engine='openpyxl') as writer:
results_df.to_excel(writer, index=False)
if dataset.split == 'dev':
results_dict = dataset.eval_result(results_df, show=True)
with open(results_json_path, 'w', encoding='utf-8') as f:
json.dump(results_dict, f, indent=2)
else:
print('All done!')
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/tools/model_converters/merge.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import torch
from peft import PeftModel
from transformers import (AutoModelForCausalLM, AutoTokenizer,
CLIPImageProcessor, CLIPVisionModel)
from xtuner.model.utils import LoadWoInit
def parse_args():
parser = argparse.ArgumentParser(
description='Merge a HuggingFace adapter to base model')
parser.add_argument('model_name_or_path', help='model name or path')
parser.add_argument('adapter_name_or_path', help='adapter name or path')
parser.add_argument(
'save_dir', help='the directory to save the merged model')
parser.add_argument(
'--max-shard-size',
type=str,
default='2GB',
help='Only applicable for LLM. The maximum size for '
'each sharded checkpoint.')
parser.add_argument(
'--is-clip',
action='store_true',
help='Indicate if the model is a clip model')
parser.add_argument(
'--safe-serialization',
action='store_true',
help='Indicate if using `safe_serialization`')
parser.add_argument(
'--device',
default='cuda',
choices=('cuda', 'cpu', 'auto'),
help='Indicate the device')
args = parser.parse_args()
return args
def main():
args = parse_args()
if args.is_clip:
with LoadWoInit():
model = CLIPVisionModel.from_pretrained(
args.model_name_or_path, device_map=args.device)
processor = CLIPImageProcessor.from_pretrained(args.model_name_or_path)
else:
with LoadWoInit():
model = AutoModelForCausalLM.from_pretrained(
args.model_name_or_path,
torch_dtype=torch.float16,
low_cpu_mem_usage=True,
device_map=args.device,
trust_remote_code=True)
processor = AutoTokenizer.from_pretrained(
args.model_name_or_path, trust_remote_code=True)
model_unmerged = PeftModel.from_pretrained(
model,
args.adapter_name_or_path,
device_map=args.device,
is_trainable=False,
trust_remote_code=True)
model_merged = model_unmerged.merge_and_unload()
print(f'Saving to {args.save_dir}...')
model_merged.save_pretrained(
args.save_dir,
safe_serialization=args.safe_serialization,
max_shard_size=args.max_shard_size)
processor.save_pretrained(args.save_dir)
print('All done!')
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/tools/model_converters/modeling_internlm2_reward/__init__.py
================================================
================================================
FILE: xtuner-eval_niah/xtuner/tools/model_converters/modeling_internlm2_reward/configuration_internlm2.py
================================================
# coding=utf-8
# Copyright (c) The InternLM team and The HuggingFace Inc. team. All rights reserved.
#
# This code is based on transformers/src/transformers/models/llama/configuration_llama.py
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" InternLM2 model configuration"""
from transformers.configuration_utils import PretrainedConfig
from transformers.utils import logging
logger = logging.get_logger(__name__)
INTERNLM2_PRETRAINED_CONFIG_ARCHIVE_MAP = {}
# Modified from transformers.model.llama.configuration_llama.LlamaConfig
class InternLM2Config(PretrainedConfig):
r"""
This is the configuration class to store the configuration of a [`InternLM2Model`]. It is used to instantiate
an InternLM2 model according to the specified arguments, defining the model architecture. Instantiating a
configuration with the defaults will yield a similar configuration to that of the InternLM2-7B.
Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the
documentation from [`PretrainedConfig`] for more information.
Args:
vocab_size (`int`, *optional*, defaults to 32000):
Vocabulary size of the InternLM2 model. Defines the number of different tokens that can be represented by the
`inputs_ids` passed when calling [`InternLM2Model`]
hidden_size (`int`, *optional*, defaults to 4096):
Dimension of the hidden representations.
intermediate_size (`int`, *optional*, defaults to 11008):
Dimension of the MLP representations.
num_hidden_layers (`int`, *optional*, defaults to 32):
Number of hidden layers in the Transformer encoder.
num_attention_heads (`int`, *optional*, defaults to 32):
Number of attention heads for each attention layer in the Transformer encoder.
num_key_value_heads (`int`, *optional*):
This is the number of key_value heads that should be used to implement Grouped Query Attention. If
`num_key_value_heads=num_attention_heads`, the model will use Multi Head Attention (MHA), if
`num_key_value_heads=1 the model will use Multi Query Attention (MQA) otherwise GQA is used. When
converting a multi-head checkpoint to a GQA checkpoint, each group key and value head should be constructed
by meanpooling all the original heads within that group. For more details checkout [this
paper](https://arxiv.org/pdf/2305.13245.pdf). If it is not specified, will default to
`num_attention_heads`.
hidden_act (`str` or `function`, *optional*, defaults to `"silu"`):
The non-linear activation function (function or string) in the decoder.
max_position_embeddings (`int`, *optional*, defaults to 2048):
The maximum sequence length that this model might ever be used with. Typically set this to something large
just in case (e.g., 512 or 1024 or 2048).
initializer_range (`float`, *optional*, defaults to 0.02):
The standard deviation of the truncated_normal_initializer for initializing all weight matrices.
rms_norm_eps (`float`, *optional*, defaults to 1e-12):
The epsilon used by the rms normalization layers.
use_cache (`bool`, *optional*, defaults to `True`):
Whether or not the model should return the last key/values attentions (not used by all models). Only
relevant if `config.is_decoder=True`.
tie_word_embeddings(`bool`, *optional*, defaults to `False`):
Whether to tie weight embeddings
Example:
"""
model_type = "internlm2"
_auto_class = "AutoConfig"
def __init__( # pylint: disable=W0102
self,
vocab_size=103168,
hidden_size=4096,
intermediate_size=11008,
num_hidden_layers=32,
num_attention_heads=32,
num_key_value_heads=None,
hidden_act="silu",
max_position_embeddings=2048,
initializer_range=0.02,
rms_norm_eps=1e-6,
use_cache=True,
pad_token_id=0,
bos_token_id=1,
eos_token_id=2,
reward_token_id=92527,
tie_word_embeddings=False,
bias=True,
rope_theta=10000,
rope_scaling=None,
attn_implementation="eager",
**kwargs,
):
self.vocab_size = vocab_size
self.max_position_embeddings = max_position_embeddings
self.hidden_size = hidden_size
self.intermediate_size = intermediate_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.bias = bias
if num_key_value_heads is None:
num_key_value_heads = num_attention_heads
self.num_key_value_heads = num_key_value_heads
self.hidden_act = hidden_act
self.initializer_range = initializer_range
self.rms_norm_eps = rms_norm_eps
self.use_cache = use_cache
self.rope_theta = rope_theta
self.rope_scaling = rope_scaling
self._rope_scaling_validation()
self.attn_implementation = attn_implementation
if self.attn_implementation is None:
self.attn_implementation = "eager"
self.reward_token_id = reward_token_id
super().__init__(
pad_token_id=pad_token_id,
bos_token_id=bos_token_id,
eos_token_id=eos_token_id,
tie_word_embeddings=tie_word_embeddings,
**kwargs,
)
def _rope_scaling_validation(self):
"""
Validate the `rope_scaling` configuration.
"""
if self.rope_scaling is None:
return
if not isinstance(self.rope_scaling, dict) or len(self.rope_scaling) != 2:
raise ValueError(
"`rope_scaling` must be a dictionary with with two fields, `type` and `factor`, "
f"got {self.rope_scaling}"
)
rope_scaling_type = self.rope_scaling.get("type", None)
rope_scaling_factor = self.rope_scaling.get("factor", None)
if rope_scaling_type is None or rope_scaling_type not in ["linear", "dynamic"]:
raise ValueError(
f"`rope_scaling`'s type field must be one of ['linear', 'dynamic'], got {rope_scaling_type}"
)
if rope_scaling_factor is None or not isinstance(rope_scaling_factor, float) or rope_scaling_factor < 1.0:
raise ValueError(f"`rope_scaling`'s factor field must be a float >= 1, got {rope_scaling_factor}")
================================================
FILE: xtuner-eval_niah/xtuner/tools/model_converters/modeling_internlm2_reward/modeling_internlm2.py
================================================
# Copyright (c) The InternLM team and The HuggingFace Inc. team. All rights reserved.
#
# This code is based on transformers/src/transformers/models/llama/modeling_llama.py
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
""" PyTorch InternLM2 model."""
import math
import queue
import threading
import warnings
from typing import List, Optional, Tuple, Union
import torch
import torch.nn.functional as F
import torch.utils.checkpoint
from einops import rearrange
from torch import nn
from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss
from transformers.activations import ACT2FN
from transformers.modeling_outputs import (
BaseModelOutputWithPast,
CausalLMOutputWithPast,
SequenceClassifierOutputWithPast,
)
from transformers.modeling_utils import PreTrainedModel
from transformers.utils import (
add_start_docstrings,
add_start_docstrings_to_model_forward,
logging,
replace_return_docstrings,
)
try:
from transformers.generation.streamers import BaseStreamer
except: # noqa # pylint: disable=bare-except
BaseStreamer = None
from .configuration_internlm2 import InternLM2Config
logger = logging.get_logger(__name__)
_CONFIG_FOR_DOC = "InternLM2Config"
flash_attn_func, flash_attn_varlen_func = None, None
pad_input, index_first_axis, unpad_input = None, None, None
def _import_flash_attn():
global flash_attn_func, flash_attn_varlen_func
global pad_input, index_first_axis, unpad_input
try:
from flash_attn import flash_attn_func as _flash_attn_func, flash_attn_varlen_func as _flash_attn_varlen_func
from flash_attn.bert_padding import pad_input as _pad_input, index_first_axis as _index_first_axis, unpad_input as _unpad_input
flash_attn_func, flash_attn_varlen_func = _flash_attn_func, _flash_attn_varlen_func
pad_input, index_first_axis, unpad_input = _pad_input, _index_first_axis, _unpad_input
except ImportError:
raise ImportError("flash_attn is not installed.")
# Copied from transformers.models.llama.modeling_llama._get_unpad_data
def _get_unpad_data(attention_mask):
seqlens_in_batch = attention_mask.sum(dim=-1, dtype=torch.int32)
indices = torch.nonzero(attention_mask.flatten(), as_tuple=False).flatten()
max_seqlen_in_batch = seqlens_in_batch.max().item()
cu_seqlens = F.pad(torch.cumsum(seqlens_in_batch, dim=0, dtype=torch.torch.int32), (1, 0))
return (
indices,
cu_seqlens,
max_seqlen_in_batch,
)
# Copied from transformers.models.bart.modeling_bart._make_causal_mask
def _make_causal_mask(
input_ids_shape: torch.Size, dtype: torch.dtype, device: torch.device, past_key_values_length: int = 0
):
"""
Make causal mask used for bi-directional self-attention.
"""
bsz, tgt_len = input_ids_shape
mask = torch.full((tgt_len, tgt_len), torch.tensor(torch.finfo(dtype).min, device=device), device=device)
mask_cond = torch.arange(mask.size(-1), device=device)
mask.masked_fill_(mask_cond < (mask_cond + 1).view(mask.size(-1), 1), 0)
mask = mask.to(dtype)
if past_key_values_length > 0:
mask = torch.cat([torch.zeros(tgt_len, past_key_values_length, dtype=dtype, device=device), mask], dim=-1)
return mask[None, None, :, :].expand(bsz, 1, tgt_len, tgt_len + past_key_values_length)
# Copied from transformers.models.bart.modeling_bart._expand_mask
def _expand_mask(mask: torch.Tensor, dtype: torch.dtype, tgt_len: Optional[int] = None):
"""
Expands attention_mask from `[bsz, seq_len]` to `[bsz, 1, tgt_seq_len, src_seq_len]`.
"""
bsz, src_len = mask.size()
tgt_len = tgt_len if tgt_len is not None else src_len
expanded_mask = mask[:, None, None, :].expand(bsz, 1, tgt_len, src_len).to(dtype)
inverted_mask = 1.0 - expanded_mask
return inverted_mask.masked_fill(inverted_mask.to(torch.bool), torch.finfo(dtype).min)
# Copied from transformers.models.llama.modeling_llama.LlamaRMSNorm with Llama->InternLM2
class InternLM2RMSNorm(nn.Module):
def __init__(self, hidden_size, eps=1e-6):
"""
InternLM2RMSNorm is equivalent to T5LayerNorm
"""
super().__init__()
self.weight = nn.Parameter(torch.ones(hidden_size))
self.variance_epsilon = eps
def forward(self, hidden_states):
input_dtype = hidden_states.dtype
hidden_states = hidden_states.to(torch.float32)
variance = hidden_states.pow(2).mean(-1, keepdim=True)
hidden_states = hidden_states * torch.rsqrt(variance + self.variance_epsilon)
return self.weight * hidden_states.to(input_dtype)
# Copied from transformers.model.llama.modeling_llama.LlamaRotaryEmbedding with Llama->InternLM2
class InternLM2RotaryEmbedding(nn.Module):
def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None):
super().__init__()
self.dim = dim
self.max_position_embeddings = max_position_embeddings
self.base = base
inv_freq = 1.0 / (self.base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
self.register_buffer("inv_freq", inv_freq, persistent=False)
# Build here to make `torch.jit.trace` work.
self._set_cos_sin_cache(
seq_len=max_position_embeddings, device=self.inv_freq.device, dtype=torch.get_default_dtype()
)
def _set_cos_sin_cache(self, seq_len, device, dtype):
self.max_seq_len_cached = seq_len
t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
freqs = torch.einsum("i,j->ij", t, self.inv_freq)
# Different from paper, but it uses a different permutation in order to obtain the same calculation
emb = torch.cat((freqs, freqs), dim=-1)
self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
def forward(self, x, seq_len=None):
# x: [bs, num_attention_heads, seq_len, head_size]
if seq_len > self.max_seq_len_cached:
self._set_cos_sin_cache(seq_len=seq_len, device=x.device, dtype=torch.float32)
return (
self.cos_cached[:seq_len].to(dtype=x.dtype),
self.sin_cached[:seq_len].to(dtype=x.dtype),
)
# Copied from transformers.model.llama.modeling_llama.LlamaLinearScalingRotaryEmbedding with Llama->InternLM2
class InternLM2LinearScalingRotaryEmbedding(InternLM2RotaryEmbedding):
"""InternLM2RotaryEmbedding extended with linear scaling. Credits to the Reddit user /u/kaiokendev"""
def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
self.scaling_factor = scaling_factor
super().__init__(dim, max_position_embeddings, base, device)
def _set_cos_sin_cache(self, seq_len, device, dtype):
self.max_seq_len_cached = seq_len
t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
t = t / self.scaling_factor
freqs = torch.einsum("i,j->ij", t, self.inv_freq)
# Different from paper, but it uses a different permutation in order to obtain the same calculation
emb = torch.cat((freqs, freqs), dim=-1)
self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
# Copied from transformers.model.llama.modeling_llama.LlamaDynamicNTKScalingRotaryEmbedding with Llama->InternLM2
class InternLM2DynamicNTKScalingRotaryEmbedding(InternLM2RotaryEmbedding):
"""InternLM2RotaryEmbedding extended with Dynamic NTK scaling.
Credits to the Reddit users /u/bloc97 and /u/emozilla.
"""
def __init__(self, dim, max_position_embeddings=2048, base=10000, device=None, scaling_factor=1.0):
self.scaling_factor = scaling_factor
super().__init__(dim, max_position_embeddings, base, device)
def _set_cos_sin_cache(self, seq_len, device, dtype):
self.max_seq_len_cached = seq_len
if seq_len > self.max_position_embeddings:
base = self.base * (
(self.scaling_factor * seq_len / self.max_position_embeddings) - (self.scaling_factor - 1)
) ** (self.dim / (self.dim - 2))
inv_freq = 1.0 / (base ** (torch.arange(0, self.dim, 2).float().to(device) / self.dim))
self.register_buffer("inv_freq", inv_freq, persistent=False)
t = torch.arange(self.max_seq_len_cached, device=device, dtype=self.inv_freq.dtype)
freqs = torch.einsum("i,j->ij", t, self.inv_freq)
# Different from paper, but it uses a different permutation in order to obtain the same calculation
emb = torch.cat((freqs, freqs), dim=-1)
self.register_buffer("cos_cached", emb.cos().to(dtype), persistent=False)
self.register_buffer("sin_cached", emb.sin().to(dtype), persistent=False)
# Copied from transformers.model.llama.modeling_llama.rotate_half
def rotate_half(x):
"""Rotates half the hidden dims of the input."""
x1 = x[..., : x.shape[-1] // 2]
x2 = x[..., x.shape[-1] // 2 :]
return torch.cat((-x2, x1), dim=-1)
# Copied from transformers.model.llama.modeling_llama.apply_rotary_pos_emb
def apply_rotary_pos_emb(q, k, cos, sin, position_ids, unsqueeze_dim=1):
"""Applies Rotary Position Embedding to the query and key tensors."""
cos = cos[position_ids].unsqueeze(unsqueeze_dim)
sin = sin[position_ids].unsqueeze(unsqueeze_dim)
q_embed = (q * cos) + (rotate_half(q) * sin)
k_embed = (k * cos) + (rotate_half(k) * sin)
return q_embed, k_embed
class InternLM2MLP(nn.Module):
def __init__(self, config):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.intermediate_size = config.intermediate_size
self.w1 = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.w3 = nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
self.w2 = nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
self.act_fn = ACT2FN[config.hidden_act]
def forward(self, x):
down_proj = self.w2(self.act_fn(self.w1(x)) * self.w3(x))
return down_proj
# Copied from transformers.model.llama.modeling_llama.repeat_kv
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
"""
This is the equivalent of torch.repeat_interleave(x, dim=1, repeats=n_rep). The hidden states go from (batch,
num_key_value_heads, seqlen, head_dim) to (batch, num_attention_heads, seqlen, head_dim)
"""
batch, num_key_value_heads, slen, head_dim = hidden_states.shape
if n_rep == 1:
return hidden_states
hidden_states = hidden_states[:, :, None, :, :].expand(batch, num_key_value_heads, n_rep, slen, head_dim)
return hidden_states.reshape(batch, num_key_value_heads * n_rep, slen, head_dim)
# Modified from transformers.model.llama.modeling_llama.LlamaAttention
class InternLM2Attention(nn.Module):
"""Multi-headed attention from 'Attention Is All You Need' paper"""
def __init__(self, config: InternLM2Config):
super().__init__()
self.config = config
self.hidden_size = config.hidden_size
self.num_heads = config.num_attention_heads
self.head_dim = self.hidden_size // self.num_heads
self.num_key_value_heads = config.num_key_value_heads
self.num_key_value_groups = self.num_heads // self.num_key_value_heads
self.max_position_embeddings = config.max_position_embeddings
self.is_causal = True
if (self.head_dim * self.num_heads) != self.hidden_size:
raise ValueError(
f"hidden_size must be divisible by num_heads (got `hidden_size`: {self.hidden_size}"
f" and `num_heads`: {self.num_heads})."
)
self.wqkv = nn.Linear(
self.hidden_size,
(self.num_heads + 2 * self.num_key_value_heads) * self.head_dim,
bias=config.bias,
)
self.wo = nn.Linear(self.num_heads * self.head_dim, self.hidden_size, bias=config.bias)
self._init_rope()
def _init_rope(self):
if self.config.rope_scaling is None:
self.rotary_emb = InternLM2RotaryEmbedding(
self.head_dim,
max_position_embeddings=self.max_position_embeddings,
base=self.config.rope_theta,
)
else:
scaling_type = self.config.rope_scaling["type"]
scaling_factor = self.config.rope_scaling["factor"]
if scaling_type == "dynamic":
self.rotary_emb = InternLM2DynamicNTKScalingRotaryEmbedding(
self.head_dim,
max_position_embeddings=self.max_position_embeddings,
base=self.config.rope_theta,
scaling_factor=scaling_factor,
)
elif scaling_type == "linear":
self.rotary_emb = InternLM2LinearScalingRotaryEmbedding(
self.head_dim,
max_position_embeddings=self.max_position_embeddings,
base=self.config.rope_theta,
scaling_factor=scaling_factor,
)
else:
raise ValueError("Currently we only support rotary embedding's type being 'dynamic' or 'linear'.")
return self.rotary_emb
def _shape(self, tensor: torch.Tensor, seq_len: int, bsz: int):
return tensor.view(bsz, seq_len, self.num_heads, self.head_dim).transpose(1, 2).contiguous()
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
if "padding_mask" in kwargs:
warnings.warn(
"Passing `padding_mask` is deprecated and will be removed in v4.37. "
"Please make sure use `attention_mask` instead.`"
)
bsz, q_len, _ = hidden_states.size()
qkv_states = self.wqkv(hidden_states)
qkv_states = rearrange(
qkv_states,
"b q (h gs d) -> b q h gs d",
gs=2 + self.num_key_value_groups,
d=self.head_dim,
)
query_states = qkv_states[..., : self.num_key_value_groups, :]
query_states = rearrange(query_states, "b q h gs d -> b q (h gs) d")
key_states = qkv_states[..., -2, :]
value_states = qkv_states[..., -1, :]
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value[0].shape[-2]
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
if past_key_value is not None:
# reuse k, v, self_attention
key_states = torch.cat([past_key_value[0], key_states], dim=2)
value_states = torch.cat([past_key_value[1], value_states], dim=2)
past_key_value = (key_states, value_states) if use_cache else None
key_states = repeat_kv(key_states, self.num_key_value_groups)
value_states = repeat_kv(value_states, self.num_key_value_groups)
attn_weights = torch.matmul(query_states, key_states.transpose(2, 3)) / math.sqrt(self.head_dim)
if attn_weights.size() != (bsz, self.num_heads, q_len, kv_seq_len):
raise ValueError(
f"Attention weights should be of size {(bsz, self.num_heads, q_len, kv_seq_len)}, but is"
f" {attn_weights.size()}"
)
if attention_mask is not None:
if attention_mask.size() != (bsz, 1, q_len, kv_seq_len):
raise ValueError(
f"Attention mask should be of size {(bsz, 1, q_len, kv_seq_len)}, but is {attention_mask.size()}"
)
attn_weights = attn_weights + attention_mask
# upcast attention to fp32
attn_weights = nn.functional.softmax(attn_weights, dim=-1, dtype=torch.float32).to(query_states.dtype)
attn_output = torch.matmul(attn_weights, value_states)
if attn_output.size() != (bsz, self.num_heads, q_len, self.head_dim):
raise ValueError(
f"`attn_output` should be of size {(bsz, self.num_heads, q_len, self.head_dim)}, but is"
f" {attn_output.size()}"
)
attn_output = attn_output.transpose(1, 2).contiguous()
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
attn_output = self.wo(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
# Modified from transformers.model.llama.modeling_llama.InternLM2FlashAttention2
class InternLM2FlashAttention2(InternLM2Attention):
"""
InternLM2 flash attention module. This module inherits from `InternLM2Attention` as the weights of the module stays
untouched. The only required change would be on the forward pass where it needs to correctly call the public API of
flash attention and deal with padding tokens in case the input contains any of them.
"""
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.LongTensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: bool = False,
use_cache: bool = False,
**kwargs,
) -> Tuple[torch.Tensor, Optional[torch.Tensor], Optional[Tuple[torch.Tensor]]]:
# InternLM2FlashAttention2 attention does not support output_attentions
if "padding_mask" in kwargs:
warnings.warn(
"Passing `padding_mask` is deprecated and will be removed in v4.37. "
"Please make sure use `attention_mask` instead.`"
)
# overwrite attention_mask with padding_mask
attention_mask = kwargs.pop("padding_mask")
output_attentions = False
bsz, q_len, _ = hidden_states.size()
qkv_states = self.wqkv(hidden_states)
qkv_states = rearrange(
qkv_states,
"b q (h gs d) -> b q h gs d",
gs=2 + self.num_key_value_groups,
d=self.head_dim,
)
query_states = qkv_states[..., : self.num_key_value_groups, :]
query_states = rearrange(query_states, "b q h gs d -> b q (h gs) d")
key_states = qkv_states[..., -2, :]
value_states = qkv_states[..., -1, :]
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
kv_seq_len = key_states.shape[-2]
if past_key_value is not None:
kv_seq_len += past_key_value[0].shape[-2]
cos, sin = self.rotary_emb(value_states, seq_len=kv_seq_len)
query_states, key_states = apply_rotary_pos_emb(query_states, key_states, cos, sin, position_ids)
if past_key_value is not None:
# reuse k, v, self_attention
key_states = torch.cat([past_key_value[0], key_states], dim=2)
value_states = torch.cat([past_key_value[1], value_states], dim=2)
past_key_value = (key_states, value_states) if use_cache else None
query_states = query_states.transpose(1, 2)
key_states = key_states.transpose(1, 2)
value_states = value_states.transpose(1, 2)
attn_output = self._flash_attention_forward(
query_states, key_states, value_states, attention_mask, q_len
)
attn_output = attn_output.reshape(bsz, q_len, self.hidden_size).contiguous()
attn_output = self.wo(attn_output)
if not output_attentions:
attn_weights = None
return attn_output, attn_weights, past_key_value
def _flash_attention_forward(
self, query_states, key_states, value_states, attention_mask, query_length, dropout=0.0, softmax_scale=None
):
"""
Calls the forward method of Flash Attention - if the input hidden states contain at least one padding token
first unpad the input, then computes the attention scores and pad the final attention scores.
Args:
query_states (`torch.Tensor`):
Input query states to be passed to Flash Attention API
key_states (`torch.Tensor`):
Input key states to be passed to Flash Attention API
value_states (`torch.Tensor`):
Input value states to be passed to Flash Attention API
attention_mask (`torch.Tensor`):
The padding mask - corresponds to a tensor of size `(batch_size, seq_len)` where 0 stands for the
position of padding tokens and 1 for the position of non-padding tokens.
dropout (`int`, *optional*):
Attention dropout
softmax_scale (`float`, *optional*):
The scaling of QK^T before applying softmax. Default to 1 / sqrt(head_dim)
"""
# Contains at least one padding token in the sequence
causal = self.is_causal and query_length != 1
if attention_mask is not None:
batch_size = query_states.shape[0]
query_states, key_states, value_states, indices_q, cu_seq_lens, max_seq_lens = self._unpad_input(
query_states, key_states, value_states, attention_mask, query_length
)
cu_seqlens_q, cu_seqlens_k = cu_seq_lens
max_seqlen_in_batch_q, max_seqlen_in_batch_k = max_seq_lens
attn_output_unpad = flash_attn_varlen_func(
query_states,
key_states,
value_states,
cu_seqlens_q=cu_seqlens_q,
cu_seqlens_k=cu_seqlens_k,
max_seqlen_q=max_seqlen_in_batch_q,
max_seqlen_k=max_seqlen_in_batch_k,
dropout_p=dropout,
softmax_scale=softmax_scale,
causal=causal,
)
attn_output = pad_input(attn_output_unpad, indices_q, batch_size, query_length)
else:
attn_output = flash_attn_func(
query_states, key_states, value_states, dropout, softmax_scale=softmax_scale, causal=causal
)
return attn_output
def _unpad_input(self, query_layer, key_layer, value_layer, attention_mask, query_length):
indices_k, cu_seqlens_k, max_seqlen_in_batch_k = _get_unpad_data(attention_mask)
batch_size, kv_seq_len, num_key_value_heads, head_dim = key_layer.shape
key_layer = index_first_axis(
key_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
)
value_layer = index_first_axis(
value_layer.reshape(batch_size * kv_seq_len, num_key_value_heads, head_dim), indices_k
)
if query_length == kv_seq_len:
query_layer = index_first_axis(
query_layer.reshape(batch_size * kv_seq_len, self.num_heads, head_dim), indices_k
)
cu_seqlens_q = cu_seqlens_k
max_seqlen_in_batch_q = max_seqlen_in_batch_k
indices_q = indices_k
elif query_length == 1:
max_seqlen_in_batch_q = 1
cu_seqlens_q = torch.arange(
batch_size + 1, dtype=torch.int32, device=query_layer.device
) # There is a memcpy here, that is very bad.
indices_q = cu_seqlens_q[:-1]
query_layer = query_layer.squeeze(1)
else:
# The -q_len: slice assumes left padding.
attention_mask = attention_mask[:, -query_length:]
query_layer, indices_q, cu_seqlens_q, max_seqlen_in_batch_q = unpad_input(query_layer, attention_mask)
return (
query_layer,
key_layer,
value_layer,
indices_q.to(torch.int64),
(cu_seqlens_q, cu_seqlens_k),
(max_seqlen_in_batch_q, max_seqlen_in_batch_k),
)
INTERNLM2_ATTENTION_CLASSES = {
"eager": InternLM2Attention,
"flash_attention_2": InternLM2FlashAttention2,
}
# Modified from transformers.model.llama.modeling_llama.LlamaDecoderLayer
class InternLM2DecoderLayer(nn.Module):
def __init__(self, config: InternLM2Config):
super().__init__()
self.hidden_size = config.hidden_size
self.attention = INTERNLM2_ATTENTION_CLASSES[config.attn_implementation](config=config)
self.feed_forward = InternLM2MLP(config)
self.attention_norm = InternLM2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
self.ffn_norm = InternLM2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
def forward(
self,
hidden_states: torch.Tensor,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_value: Optional[Tuple[torch.Tensor]] = None,
output_attentions: Optional[bool] = False,
use_cache: Optional[bool] = False,
**kwargs,
) -> Tuple[torch.FloatTensor, Optional[Tuple[torch.FloatTensor, torch.FloatTensor]]]:
"""
Args:
hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
attention_mask (`torch.FloatTensor`, *optional*):
attention mask of size `(batch_size, sequence_length)` if flash attention is used or `(batch_size, 1,
query_sequence_length, key_sequence_length)` if default attention is used.
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under
returned tensors for more detail.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding
(see `past_key_values`).
past_key_value (`Tuple(torch.FloatTensor)`, *optional*): cached past key and value projection states
"""
if "padding_mask" in kwargs:
warnings.warn(
"Passing `padding_mask` is deprecated and will be removed in v4.37. "
"Please make sure use `attention_mask` instead.`"
)
residual = hidden_states
hidden_states = self.attention_norm(hidden_states)
# Self Attention
hidden_states, self_attn_weights, present_key_value = self.attention(
hidden_states=hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
**kwargs,
)
hidden_states = residual + hidden_states
# Fully Connected
residual = hidden_states
hidden_states = self.ffn_norm(hidden_states)
hidden_states = self.feed_forward(hidden_states)
hidden_states = residual + hidden_states
outputs = (hidden_states,)
if output_attentions:
outputs += (self_attn_weights,)
if use_cache:
outputs += (present_key_value,)
return outputs
InternLM2_START_DOCSTRING = r"""
This model inherits from [`PreTrainedModel`]. Check the superclass documentation for the generic methods the
library implements for all its model (such as downloading or saving, resizing the input embeddings, pruning heads
etc.)
This model is also a PyTorch [torch.nn.Module](https://pytorch.org/docs/stable/nn.html#torch.nn.Module) subclass.
Use it as a regular PyTorch Module and refer to the PyTorch documentation for all matter related to general usage
and behavior.
Parameters:
config ([`InternLM2Config`]):
Model configuration class with all the parameters of the model. Initializing with a config file does not
load the weights associated with the model, only the configuration. Check out the
[`~PreTrainedModel.from_pretrained`] method to load the model weights.
"""
# Copied from transformers.models.llama.modeling_llama.LlamaPreTrainedModel with Llama->InternLM2
@add_start_docstrings(
"The bare InternLM2 Model outputting raw hidden-states without any specific head on top.",
InternLM2_START_DOCSTRING,
)
class InternLM2PreTrainedModel(PreTrainedModel):
config_class = InternLM2Config
base_model_prefix = "model"
supports_gradient_checkpointing = True
_no_split_modules = ["InternLM2DecoderLayer"]
_skip_keys_device_placement = "past_key_values"
def _init_weights(self, module):
std = self.config.initializer_range
if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=std)
if module.bias is not None:
module.bias.data.zero_()
elif isinstance(module, nn.Embedding):
module.weight.data.normal_(mean=0.0, std=std)
if module.padding_idx is not None:
module.weight.data[module.padding_idx].zero_()
InternLM2_INPUTS_DOCSTRING = r"""
Args:
input_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`):
Indices of input sequence tokens in the vocabulary. Padding will be ignored by default should you provide
it.
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
[What are input IDs?](../glossary#input-ids)
attention_mask (`torch.Tensor` of shape `(batch_size, sequence_length)`, *optional*):
Mask to avoid performing attention on padding token indices. Mask values selected in `[0, 1]`:
- 1 for tokens that are **not masked**,
- 0 for tokens that are **masked**.
[What are attention masks?](../glossary#attention-mask)
Indices can be obtained using [`AutoTokenizer`]. See [`PreTrainedTokenizer.encode`] and
[`PreTrainedTokenizer.__call__`] for details.
If `past_key_values` is used, optionally only the last `input_ids` have to be input (see
`past_key_values`).
If you want to change padding behavior, you should read [`modeling_opt._prepare_decoder_attention_mask`]
and modify to your needs. See diagram 1 in [the paper](https://arxiv.org/abs/1910.13461) for more
information on the default strategy.
- 1 indicates the head is **not masked**,
- 0 indicates the head is **masked**.
position_ids (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Indices of positions of each input sequence tokens in the position embeddings. Selected in the range `[0,
config.n_positions - 1]`.
[What are position IDs?](../glossary#position-ids)
past_key_values (`tuple(tuple(torch.FloatTensor))`, *optional*, returned when `use_cache=True` is passed or
when `config.use_cache=True`):
Tuple of `tuple(torch.FloatTensor)` of length `config.n_layers`, with each tuple having 2 tensors of shape
`(batch_size, num_heads, sequence_length, embed_size_per_head)`) and 2 additional tensors of shape
`(batch_size, num_heads, decoder_sequence_length, embed_size_per_head)`.
Contains pre-computed hidden-states (key and values in the self-attention blocks and in the cross-attention
blocks) that can be used (see `past_key_values` input) to speed up sequential decoding.
If `past_key_values` are used, the user can optionally input only the last `input_ids` (those that don't
have their past key value states given to this model) of shape `(batch_size, 1)` instead of all `input_ids`
of shape `(batch_size, sequence_length)`.
inputs_embeds (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`, *optional*):
Optionally, instead of passing `input_ids` you can choose to directly pass an embedded representation. This
is useful if you want more control over how to convert `input_ids` indices into associated vectors than the
model's internal embedding lookup matrix.
use_cache (`bool`, *optional*):
If set to `True`, `past_key_values` key value states are returned and can be used to speed up decoding (see
`past_key_values`).
output_attentions (`bool`, *optional*):
Whether or not to return the attentions tensors of all attention layers. See `attentions` under returned
tensors for more detail.
output_hidden_states (`bool`, *optional*):
Whether or not to return the hidden states of all layers. See `hidden_states` under returned tensors for
more detail.
return_dict (`bool`, *optional*):
Whether or not to return a [`~utils.ModelOutput`] instead of a plain tuple.
"""
# Modified from transformers.model.llama.modeling_llama.LlamaModel
@add_start_docstrings(
"The bare InternLM2 Model outputting raw hidden-states without any specific head on top.",
InternLM2_START_DOCSTRING,
)
class InternLM2Model(InternLM2PreTrainedModel):
"""
Transformer decoder consisting of *config.num_hidden_layers* layers. Each layer is a [`InternLM2DecoderLayer`]
Args:
config: InternLM2Config
"""
_auto_class = "AutoModel"
def __init__(self, config: InternLM2Config):
super().__init__(config)
self.padding_idx = config.pad_token_id
self.vocab_size = config.vocab_size
self.config = config
self.tok_embeddings = nn.Embedding(config.vocab_size, config.hidden_size, self.padding_idx)
self.layers = nn.ModuleList([InternLM2DecoderLayer(config) for _ in range(config.num_hidden_layers)])
self.norm = InternLM2RMSNorm(config.hidden_size, eps=config.rms_norm_eps)
self.gradient_checkpointing = False
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.tok_embeddings
def set_input_embeddings(self, value):
self.tok_embeddings = value
def _prepare_decoder_attention_mask(self, attention_mask, input_shape, inputs_embeds, past_key_values_length):
# create causal mask
# [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
combined_attention_mask = None
if input_shape[-1] > 1:
combined_attention_mask = _make_causal_mask(
input_shape,
inputs_embeds.dtype,
device=inputs_embeds.device,
past_key_values_length=past_key_values_length,
)
if attention_mask is not None:
# [bsz, seq_len] -> [bsz, 1, tgt_seq_len, src_seq_len]
expanded_attn_mask = _expand_mask(attention_mask, inputs_embeds.dtype, tgt_len=input_shape[-1]).to(
inputs_embeds.device
)
combined_attention_mask = (
expanded_attn_mask if combined_attention_mask is None else expanded_attn_mask + combined_attention_mask
)
return combined_attention_mask
@add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, BaseModelOutputWithPast]:
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
use_cache = use_cache if use_cache is not None else self.config.use_cache
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
if self.config.attn_implementation == "flash_attention_2":
_import_flash_attn()
# retrieve input_ids and inputs_embeds
if input_ids is not None and inputs_embeds is not None:
raise ValueError("You cannot specify both input_ids and inputs_embeds at the same time")
elif input_ids is not None:
batch_size, seq_length = input_ids.shape[:2]
elif inputs_embeds is not None:
batch_size, seq_length = inputs_embeds.shape[:2]
else:
raise ValueError("You have to specify either input_ids or inputs_embeds")
seq_length_with_past = seq_length
past_key_values_length = 0
if past_key_values is not None:
past_key_values_length = past_key_values[0][0].shape[2]
seq_length_with_past = seq_length_with_past + past_key_values_length
if position_ids is None:
device = input_ids.device if input_ids is not None else inputs_embeds.device
position_ids = torch.arange(
past_key_values_length, seq_length + past_key_values_length, dtype=torch.long, device=device
)
position_ids = position_ids.unsqueeze(0)
if inputs_embeds is None:
inputs_embeds = self.tok_embeddings(input_ids)
if self.config.attn_implementation == "flash_attention_2":
# 2d mask is passed through the layers
attention_mask = attention_mask if (attention_mask is not None and 0 in attention_mask) else None
else:
if attention_mask is None:
attention_mask = torch.ones(
(batch_size, seq_length_with_past), dtype=torch.bool, device=inputs_embeds.device
)
attention_mask = self._prepare_decoder_attention_mask(
attention_mask, (batch_size, seq_length), inputs_embeds, past_key_values_length
)
# embed positions
hidden_states = inputs_embeds
if self.gradient_checkpointing and self.training:
if use_cache:
logger.warning_once(
"`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`..."
)
use_cache = False
# decoder layers
all_hidden_states = () if output_hidden_states else None
all_self_attns = () if output_attentions else None
next_decoder_cache = () if use_cache else None
for idx, decoder_layer in enumerate(self.layers):
if output_hidden_states:
all_hidden_states += (hidden_states,)
past_key_value = past_key_values[idx] if past_key_values is not None else None
if self.gradient_checkpointing and self.training:
def create_custom_forward(module):
def custom_forward(*inputs):
# None for past_key_value
return module(*inputs, output_attentions, None)
return custom_forward
layer_outputs = torch.utils.checkpoint.checkpoint(
create_custom_forward(decoder_layer),
hidden_states,
attention_mask,
position_ids,
None,
)
else:
layer_outputs = decoder_layer(
hidden_states,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_value=past_key_value,
output_attentions=output_attentions,
use_cache=use_cache,
)
hidden_states = layer_outputs[0]
if use_cache:
next_decoder_cache += (layer_outputs[2 if output_attentions else 1],)
if output_attentions:
all_self_attns += (layer_outputs[1],)
hidden_states = self.norm(hidden_states)
# add hidden states from the last decoder layer
if output_hidden_states:
all_hidden_states += (hidden_states,)
next_cache = next_decoder_cache if use_cache else None
if not return_dict:
return tuple(v for v in [hidden_states, next_cache, all_hidden_states, all_self_attns] if v is not None)
return BaseModelOutputWithPast(
last_hidden_state=hidden_states,
past_key_values=next_cache,
hidden_states=all_hidden_states,
attentions=all_self_attns,
)
# Modified from transformers.model.llama.modeling_llama.LlamaForCausalLM
class InternLM2ForCausalLM(InternLM2PreTrainedModel):
_auto_class = "AutoModelForCausalLM"
_tied_weights_keys = ["output.weight"]
def __init__(self, config):
super().__init__(config)
self.model = InternLM2Model(config)
self.vocab_size = config.vocab_size
self.output = nn.Linear(config.hidden_size, config.vocab_size, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.tok_embeddings
def set_input_embeddings(self, value):
self.model.tok_embeddings = value
def get_output_embeddings(self):
return self.output
def set_output_embeddings(self, new_embeddings):
self.output = new_embeddings
def set_decoder(self, decoder):
self.model = decoder
def get_decoder(self):
return self.model
@add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=CausalLMOutputWithPast, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, CausalLMOutputWithPast]:
r"""
Args:
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
Returns:
Example:
```python
>>> from transformers import AutoTokenizer, InternLM2ForCausalLM
>>> model = InternLM2ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
>>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
# decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = outputs[0]
logits = self.output(hidden_states)
logits = logits.float()
loss = None
if labels is not None:
# Shift so that tokens < n predict n
shift_logits = logits[..., :-1, :].contiguous()
shift_labels = labels[..., 1:].contiguous()
# Flatten the tokens
loss_fct = CrossEntropyLoss()
shift_logits = shift_logits.view(-1, self.config.vocab_size)
shift_labels = shift_labels.view(-1)
# Enable model parallelism
shift_labels = shift_labels.to(shift_logits.device)
loss = loss_fct(shift_logits, shift_labels)
if not return_dict:
output = (logits,) + outputs[1:]
return (loss,) + output if loss is not None else output
return CausalLMOutputWithPast(
loss=loss,
logits=logits,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
def prepare_inputs_for_generation(
self, input_ids, past_key_values=None, attention_mask=None, inputs_embeds=None, **kwargs
):
if past_key_values is not None:
past_length = past_key_values[0][0].shape[2]
# Some generation methods already pass only the last input ID
if input_ids.shape[1] > past_length:
remove_prefix_length = past_length
else:
# Default to old behavior: keep only final ID
remove_prefix_length = input_ids.shape[1] - 1
input_ids = input_ids[:, remove_prefix_length:]
position_ids = kwargs.get("position_ids", None)
if attention_mask is not None and position_ids is None:
# create position_ids on the fly for batch generation
position_ids = attention_mask.long().cumsum(-1) - 1
position_ids.masked_fill_(attention_mask == 0, 1)
if past_key_values:
position_ids = position_ids[:, -input_ids.shape[1] :]
# if `inputs_embeds` are passed, we only want to use them in the 1st generation step
if inputs_embeds is not None and past_key_values is None:
model_inputs = {"inputs_embeds": inputs_embeds}
else:
model_inputs = {"input_ids": input_ids}
model_inputs.update(
{
"position_ids": position_ids,
"past_key_values": past_key_values,
"use_cache": kwargs.get("use_cache"),
"attention_mask": attention_mask,
}
)
return model_inputs
@staticmethod
def _reorder_cache(past_key_values, beam_idx):
reordered_past = ()
for layer_past in past_key_values:
reordered_past += (
tuple(past_state.index_select(0, beam_idx.to(past_state.device)) for past_state in layer_past),
)
return reordered_past
def build_inputs(self, tokenizer, query: str, history: List[Tuple[str, str]] = [], meta_instruction=""):
if tokenizer.add_bos_token:
prompt = ""
else:
prompt = tokenizer.bos_token
if meta_instruction:
prompt += f"""<|im_start|>system\n{meta_instruction}<|im_end|>\n"""
for record in history:
prompt += f"""<|im_start|>user\n{record[0]}<|im_end|>\n<|im_start|>assistant\n{record[1]}<|im_end|>\n"""
prompt += f"""<|im_start|>user\n{query}<|im_end|>\n<|im_start|>assistant\n"""
return tokenizer([prompt], return_tensors="pt")
@torch.no_grad()
def chat(
self,
tokenizer,
query: str,
history: List[Tuple[str, str]] = [],
streamer: Optional[BaseStreamer] = None,
max_new_tokens: int = 1024,
do_sample: bool = True,
temperature: float = 0.8,
top_p: float = 0.8,
meta_instruction: str = "You are an AI assistant whose name is InternLM (书生·浦语).\n"
"- InternLM (书生·浦语) is a conversational language model that is developed by Shanghai AI Laboratory (上海人工智能实验室). It is designed to be helpful, honest, and harmless.\n"
"- InternLM (书生·浦语) can understand and communicate fluently in the language chosen by the user such as English and 中文.",
**kwargs,
):
inputs = self.build_inputs(tokenizer, query, history, meta_instruction)
inputs = {k: v.to(self.device) for k, v in inputs.items() if torch.is_tensor(v)}
# also add end-of-assistant token in eos token id to avoid unnecessary generation
eos_token_id = [tokenizer.eos_token_id, tokenizer.convert_tokens_to_ids(["<|im_end|>"])[0]]
outputs = self.generate(
**inputs,
streamer=streamer,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
temperature=temperature,
top_p=top_p,
eos_token_id=eos_token_id,
**kwargs,
)
outputs = outputs[0].cpu().tolist()[len(inputs["input_ids"][0]) :]
response = tokenizer.decode(outputs, skip_special_tokens=True)
response = response.split("<|im_end|>")[0]
history = history + [(query, response)]
return response, history
@torch.no_grad()
def stream_chat(
self,
tokenizer,
query: str,
history: List[Tuple[str, str]] = [],
max_new_tokens: int = 1024,
do_sample: bool = True,
temperature: float = 0.8,
top_p: float = 0.8,
**kwargs,
):
"""
Return a generator in format: (response, history)
Eg.
('你好,有什么可以帮助您的吗', [('你好', '你好,有什么可以帮助您的吗')])
('你好,有什么可以帮助您的吗?', [('你好', '你好,有什么可以帮助您的吗?')])
"""
if BaseStreamer is None:
raise ModuleNotFoundError(
"The version of `transformers` is too low. Please make sure "
"that you have installed `transformers>=4.28.0`."
)
response_queue = queue.Queue(maxsize=20)
class ChatStreamer(BaseStreamer):
def __init__(self, tokenizer) -> None:
super().__init__()
self.tokenizer = tokenizer
self.queue = response_queue
self.query = query
self.history = history
self.response = ""
self.cache = []
self.received_inputs = False
self.queue.put((self.response, history + [(self.query, self.response)]))
def put(self, value):
if len(value.shape) > 1 and value.shape[0] > 1:
raise ValueError("ChatStreamer only supports batch size 1")
elif len(value.shape) > 1:
value = value[0]
if not self.received_inputs:
# The first received value is input_ids, ignore here
self.received_inputs = True
return
self.cache.extend(value.tolist())
token = self.tokenizer.decode(self.cache, skip_special_tokens=True)
if token.strip() != "<|im_end|>":
self.response = self.response + token
history = self.history + [(self.query, self.response)]
self.queue.put((self.response, history))
self.cache = []
else:
self.end()
def end(self):
self.queue.put(None)
def stream_producer():
return self.chat(
tokenizer=tokenizer,
query=query,
streamer=ChatStreamer(tokenizer=tokenizer),
history=history,
max_new_tokens=max_new_tokens,
do_sample=do_sample,
temperature=temperature,
top_p=top_p,
**kwargs,
)
def consumer():
producer = threading.Thread(target=stream_producer)
producer.start()
while True:
res = response_queue.get()
if res is None:
return
yield res
return consumer()
# Modified from transformers.model.llama.modeling_llama.LlamaForCausalLM
class InternLM2ForRewardModel(InternLM2PreTrainedModel):
_auto_class = "AutoModel"
_tied_weights_keys = ["v_head.weight"]
def __init__(self, config):
super().__init__(config)
self.model = InternLM2Model(config)
self.vocab_size = config.vocab_size
self.v_head = nn.Linear(config.hidden_size, 1, bias=False)
self.reward_token_id = config.reward_token_id
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.tok_embeddings
def set_input_embeddings(self, value):
self.model.tok_embeddings = value
def get_output_embeddings(self):
return self.v_head
def set_output_embeddings(self, new_embeddings):
self.v_head = new_embeddings
def set_decoder(self, decoder):
self.model = decoder
def get_decoder(self):
return self.model
@add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
@replace_return_docstrings(output_type=SequenceClassifierOutputWithPast, config_class=_CONFIG_FOR_DOC)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, SequenceClassifierOutputWithPast]:
r"""
Args:
labels (`torch.LongTensor` of shape `(batch_size, sequence_length)`, *optional*):
Labels for computing the masked language modeling loss. Indices should either be in `[0, ...,
config.vocab_size]` or -100 (see `input_ids` docstring). Tokens with indices set to `-100` are ignored
(masked), the loss is only computed for the tokens with labels in `[0, ..., config.vocab_size]`.
Returns:
Example:
```python
>>> from transformers import AutoTokenizer, InternLM2ForCausalLM
>>> model = InternLM2ForCausalLM.from_pretrained(PATH_TO_CONVERTED_WEIGHTS)
>>> tokenizer = AutoTokenizer.from_pretrained(PATH_TO_CONVERTED_TOKENIZER)
>>> prompt = "Hey, are you conscious? Can you talk to me?"
>>> inputs = tokenizer(prompt, return_tensors="pt")
>>> # Generate
>>> generate_ids = model.generate(inputs.input_ids, max_length=30)
>>> tokenizer.batch_decode(generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
"Hey, are you conscious? Can you talk to me?\nI'm not conscious, but I can talk to you."
```"""
output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions
output_hidden_states = (
output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states
)
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
# decoder outputs consists of (dec_features, layer_state, dec_hidden, dec_attn)
outputs = self.model(
input_ids=input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = outputs[0]
hidden_states = self.v_head(hidden_states)
# get end reward token's score
ends = attention_mask.cumsum(dim=1).argmax(dim=1).view(-1,1)
reward_scores = torch.gather(hidden_states.squeeze(-1), 1, ends)
loss = None
if not return_dict:
output = (reward_scores,) + outputs[1:]
return (loss,) + output if loss is not None else output
return SequenceClassifierOutputWithPast(
loss=loss,
logits=reward_scores,
past_key_values=outputs.past_key_values,
hidden_states=outputs.hidden_states,
attentions=outputs.attentions,
)
@torch.no_grad()
def get_score(
self,
tokenizer,
conversation: List[dict],
**kwargs,
):
conversation_str = tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False)
input_ids = tokenizer.encode(conversation_str, return_tensors="pt", add_special_tokens=False)
# add reward score token at the end of the input_ids
input_ids = torch.cat([input_ids, torch.tensor([[self.reward_token_id]], dtype=torch.long)], dim=1).to(self.device)
attention_mask = torch.ones_like(input_ids, dtype=torch.bool).to(self.device)
outputs = self.forward(input_ids=input_ids, attention_mask=attention_mask, **kwargs)
score = outputs[0].cpu().item()
return score
@torch.no_grad()
def get_scores(
self,
tokenizer,
conversations: List[List[dict]],
**kwargs,
):
conversation_strs = [tokenizer.apply_chat_template(conversation, tokenize=False, add_generation_prompt=False) for conversation in conversations]
batch_input_ids = []
attention_masks = []
for conversation_str in conversation_strs:
input_ids = tokenizer.encode(conversation_str, return_tensors="pt", add_special_tokens=False)
input_ids = torch.cat([input_ids, torch.tensor([[self.reward_token_id]], dtype=torch.long)], dim=1).squeeze(0)
attention_mask = torch.ones(input_ids.shape, dtype=torch.bool)
batch_input_ids.append(input_ids)
attention_masks.append(attention_mask)
r_pad_batch_input_ids = torch.nn.utils.rnn.pad_sequence(batch_input_ids, batch_first=True, padding_value=tokenizer.pad_token_id)
r_pad_attention_masks = torch.nn.utils.rnn.pad_sequence(attention_masks, batch_first=True, padding_value=False)
outputs = self.forward(input_ids=r_pad_batch_input_ids.to(self.device), attention_mask=r_pad_attention_masks.to(self.device), **kwargs)
scores = outputs[0].cpu().tolist()
return scores
@torch.no_grad()
def compare(
self,
tokenizer,
conversation1: List[dict],
conversation2: List[dict],
return_logits: bool = False,
**kwargs,
):
score1 = self.get_score(tokenizer, conversation1, **kwargs)
score2 = self.get_score(tokenizer, conversation2, **kwargs)
if return_logits:
return score1, score2
else:
return score1 > score2
@torch.no_grad()
def rank(
self,
tokenizer,
conversations: List[List[dict]],
return_logits: bool = False,
**kwargs,
):
scores = self.get_scores(tokenizer, conversations, **kwargs)
if return_logits:
return scores
else:
return sorted(range(len(scores)), key=lambda i: scores[i], reverse=True)
# Copied from transformers.model.llama.modeling_llama.LlamaForSequenceClassification with Llama->InternLM2
@add_start_docstrings(
"""
The InternLM2 Model transformer with a sequence classification head on top (linear layer).
[`InternLM2ForSequenceClassification`] uses the last token in order to do the classification,
as other causal models (e.g. GPT-2) do.
Since it does classification on the last token, it requires to know the position of the last token. If a
`pad_token_id` is defined in the configuration, it finds the last token that is not a padding token in each row. If
no `pad_token_id` is defined, it simply takes the last value in each row of the batch. Since it cannot guess the
padding tokens when `inputs_embeds` are passed instead of `input_ids`, it does the same (take the last value in
each row of the batch).
""",
InternLM2_START_DOCSTRING,
)
class InternLM2ForSequenceClassification(InternLM2PreTrainedModel):
def __init__(self, config):
super().__init__(config)
self.num_labels = config.num_labels
self.model = InternLM2Model(config)
self.score = nn.Linear(config.hidden_size, self.num_labels, bias=False)
# Initialize weights and apply final processing
self.post_init()
def get_input_embeddings(self):
return self.model.tok_embeddings
def set_input_embeddings(self, value):
self.model.tok_embeddings = value
@add_start_docstrings_to_model_forward(InternLM2_INPUTS_DOCSTRING)
def forward(
self,
input_ids: torch.LongTensor = None,
attention_mask: Optional[torch.Tensor] = None,
position_ids: Optional[torch.LongTensor] = None,
past_key_values: Optional[List[torch.FloatTensor]] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
labels: Optional[torch.LongTensor] = None,
use_cache: Optional[bool] = None,
output_attentions: Optional[bool] = None,
output_hidden_states: Optional[bool] = None,
return_dict: Optional[bool] = None,
) -> Union[Tuple, SequenceClassifierOutputWithPast]:
r"""
labels (`torch.LongTensor` of shape `(batch_size,)`, *optional*):
Labels for computing the sequence classification/regression loss. Indices should be in `[0, ...,
config.num_labels - 1]`. If `config.num_labels == 1` a regression loss is computed (Mean-Square loss), If
`config.num_labels > 1` a classification loss is computed (Cross-Entropy).
"""
return_dict = return_dict if return_dict is not None else self.config.use_return_dict
transformer_outputs = self.model(
input_ids,
attention_mask=attention_mask,
position_ids=position_ids,
past_key_values=past_key_values,
inputs_embeds=inputs_embeds,
use_cache=use_cache,
output_attentions=output_attentions,
output_hidden_states=output_hidden_states,
return_dict=return_dict,
)
hidden_states = transformer_outputs[0]
logits = self.score(hidden_states)
if input_ids is not None:
batch_size = input_ids.shape[0]
else:
batch_size = inputs_embeds.shape[0]
if self.config.pad_token_id is None and batch_size != 1:
raise ValueError("Cannot handle batch sizes > 1 if no padding token is defined.")
if self.config.pad_token_id is None:
sequence_lengths = -1
else:
if input_ids is not None:
sequence_lengths = (torch.eq(input_ids, self.config.pad_token_id).int().argmax(-1) - 1).to(
logits.device
)
else:
sequence_lengths = -1
pooled_logits = logits[torch.arange(batch_size, device=logits.device), sequence_lengths]
loss = None
if labels is not None:
labels = labels.to(logits.device)
if self.config.problem_type is None:
if self.num_labels == 1:
self.config.problem_type = "regression"
elif self.num_labels > 1 and (labels.dtype == torch.long or labels.dtype == torch.int):
self.config.problem_type = "single_label_classification"
else:
self.config.problem_type = "multi_label_classification"
if self.config.problem_type == "regression":
loss_fct = MSELoss()
if self.num_labels == 1:
loss = loss_fct(pooled_logits.squeeze(), labels.squeeze())
else:
loss = loss_fct(pooled_logits, labels)
elif self.config.problem_type == "single_label_classification":
loss_fct = CrossEntropyLoss()
loss = loss_fct(pooled_logits.view(-1, self.num_labels), labels.view(-1))
elif self.config.problem_type == "multi_label_classification":
loss_fct = BCEWithLogitsLoss()
loss = loss_fct(pooled_logits, labels)
if not return_dict:
output = (pooled_logits,) + transformer_outputs[1:]
return ((loss,) + output) if loss is not None else output
return SequenceClassifierOutputWithPast(
loss=loss,
logits=pooled_logits,
past_key_values=transformer_outputs.past_key_values,
hidden_states=transformer_outputs.hidden_states,
attentions=transformer_outputs.attentions,
)
================================================
FILE: xtuner-eval_niah/xtuner/tools/model_converters/pth_to_hf.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import os.path as osp
import shutil
import warnings
from accelerate import init_empty_weights
from accelerate.utils import set_module_tensor_to_device
from mmengine import print_log
from mmengine.config import Config, DictAction
from mmengine.fileio import PetrelBackend, get_file_backend
from mmengine.utils import mkdir_or_exist
from tqdm import tqdm
from xtuner.configs import cfgs_name_path
from xtuner.model.utils import guess_load_checkpoint
from xtuner.registry import BUILDER
def parse_args():
parser = argparse.ArgumentParser(
description='Convert the pth model to HuggingFace model')
parser.add_argument('config', help='config file name or path.')
parser.add_argument('pth_model', help='pth model file')
parser.add_argument(
'save_dir', help='the directory to save HuggingFace model')
parser.add_argument(
'--fp32',
action='store_true',
help='Save LLM in fp32. If not set, fp16 will be used by default.')
parser.add_argument(
'--max-shard-size',
type=str,
default='2GB',
help='Only applicable for LLM. The maximum size for '
'each sharded checkpoint.')
parser.add_argument(
'--safe-serialization',
action='store_true',
help='Indicate if using `safe_serialization`')
parser.add_argument(
'--save-format',
default='xtuner',
choices=('xtuner', 'official', 'huggingface'),
help='Only applicable for LLaVAModel. Indicate the save format.')
parser.add_argument(
'--cfg-options',
nargs='+',
action=DictAction,
help='override some settings in the used config, the key-value pair '
'in xxx=yyy format will be merged into config file. If the value to '
'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
'Note that the quotation marks are necessary and that no white space '
'is allowed.')
args = parser.parse_args()
return args
def main():
args = parse_args()
# parse config
if not osp.isfile(args.config):
try:
args.config = cfgs_name_path[args.config]
except KeyError:
raise FileNotFoundError(f'Cannot find {args.config}')
# load config
cfg = Config.fromfile(args.config)
if args.cfg_options is not None:
cfg.merge_from_dict(args.cfg_options)
model_name = cfg.model.type if isinstance(cfg.model.type,
str) else cfg.model.type.__name__
use_meta_init = True
if 'LLaVAModel' in model_name:
cfg.model.pretrained_pth = None
if args.save_format != 'xtuner':
use_meta_init = False
if 'Reward' in model_name:
use_meta_init = False
cfg.model.llm.pop('quantization_config', None)
if use_meta_init:
try:
# Initializing the model with meta-tensor can reduce unwanted
# memory usage.
with init_empty_weights():
with warnings.catch_warnings():
warnings.filterwarnings(
'ignore', message='.*non-meta.*', category=UserWarning)
model = BUILDER.build(cfg.model)
except NotImplementedError as e:
# Cannot initialize the model with meta tensor if the model is
# quantized.
if 'Cannot copy out of meta tensor' in str(e):
model = BUILDER.build(cfg.model)
else:
raise e
else:
model = BUILDER.build(cfg.model)
backend = get_file_backend(args.pth_model)
if isinstance(backend, PetrelBackend):
from xtuner.utils.fileio import patch_fileio
with patch_fileio():
state_dict = guess_load_checkpoint(args.pth_model)
else:
state_dict = guess_load_checkpoint(args.pth_model)
for name, param in tqdm(state_dict.items(), desc='Load State Dict'):
set_module_tensor_to_device(model, name, 'cpu', param)
model.llm.config.use_cache = True
print_log(f'Load PTH model from {args.pth_model}', 'current')
mkdir_or_exist(args.save_dir)
save_pretrained_kwargs = {
'max_shard_size': args.max_shard_size,
'safe_serialization': args.safe_serialization
}
model.to_hf(
cfg=cfg,
save_dir=args.save_dir,
fp32=args.fp32,
save_pretrained_kwargs=save_pretrained_kwargs,
save_format=args.save_format)
shutil.copyfile(args.config, osp.join(args.save_dir, 'xtuner_config.py'))
print_log('All done!', 'current')
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/tools/model_converters/split.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import copy
import json
import os
import os.path as osp
import shutil
import torch
from mmengine.utils import mkdir_or_exist
def parse_args():
parser = argparse.ArgumentParser(
description='Split a HuggingFace model to the smallest sharded one')
parser.add_argument('src_dir', help='the directory of the model')
parser.add_argument('dst_dir', help='the directory to save the new model')
args = parser.parse_args()
return args
def main():
args = parse_args()
mkdir_or_exist(args.dst_dir)
all_files = os.listdir(args.src_dir)
for name in all_files:
if not name.startswith(('pytorch_model', '.')):
src_path = osp.join(args.src_dir, name)
dst_path = osp.join(args.dst_dir, name)
shutil.copy(src_path, dst_path)
with open(osp.join(args.src_dir, 'pytorch_model.bin.index.json')) as f:
index = json.load(f)
n_shard = len(index['weight_map'])
new_index = copy.deepcopy(index)
new_index['weight_map'] = {}
cnt = 1
checkpoints = set(index['weight_map'].values())
for ckpt in checkpoints:
state_dict = torch.load(
osp.join(args.src_dir, ckpt), map_location='cuda')
keys = sorted(list(state_dict.keys()))
for k in keys:
new_state_dict_name = 'pytorch_model-{:05d}-of-{:05d}.bin'.format(
cnt, n_shard)
new_index['weight_map'][k] = new_state_dict_name
new_state_dict = {k: state_dict[k]}
torch.save(new_state_dict,
osp.join(args.dst_dir, new_state_dict_name))
cnt += 1
del state_dict
torch.cuda.empty_cache()
with open(osp.join(args.dst_dir, 'pytorch_model.bin.index.json'),
'w') as f:
json.dump(new_index, f)
assert new_index['weight_map'].keys() == index['weight_map'].keys(
), 'Mismatch on `weight_map`!'
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/tools/plugins/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .api import plugins_api
__all__ = ['plugins_api']
================================================
FILE: xtuner-eval_niah/xtuner/tools/plugins/api.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import re
def plugins_api(input_str,
calculate_open=True,
solve_open=True,
search_open=True):
pattern = r'(Solve|solve|Solver|solver|Calculate|calculate|Calculator|calculator|Search)\("([^"]*)"\)' # noqa: E501
matches = re.findall(pattern, input_str)
converted_str = '<|Results|>:\n'
for i in range(len(matches)):
if matches[i][0] in [
'Calculate', 'calculate'
'Calculator', 'calculator'
]:
if calculate_open:
from .calculate import Calculate
result = Calculate(matches[i][1])
else:
result = None
converted_str += f"Calculate(\"{matches[i][1]}\") => {result}\n"
elif matches[i][0] in ['Solve', 'solve', 'Solver', 'solver']:
if solve_open:
from .solve import Solve
result = Solve(matches[i][1])
else:
result = None
converted_str += f"Solve(\"{matches[i][1]}\") =>\n{result}\n"
elif matches[i][0] == 'Search':
if search_open:
from .search import Search
result = Search(matches[i][1])
else:
result = None
converted_str += f"Search(\"{matches[i][1]}\") =>\n{result}"
converted_str += '\n'
return converted_str
================================================
FILE: xtuner-eval_niah/xtuner/tools/plugins/calculate.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from math import * # noqa: F401, F403
def Calculate(expression):
res = ''
for exp in expression.split(';'):
try:
res += '{:.2f};'.format(eval(exp.replace('^', '**')))
except Exception:
res += 'No result.'
if res[-1] == ';':
res = res[:-1]
return res
================================================
FILE: xtuner-eval_niah/xtuner/tools/plugins/search.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import os
import sys
import requests
try:
SERPER_API_KEY = os.environ['SERPER_API_KEY']
except Exception:
print('Please obtain the `SERPER_API_KEY` from https://serper.dev and '
'set it using `export SERPER_API_KEY=xxx`.')
sys.exit(1)
def parse_results(results, k=10):
snippets = []
for result in results['organic'][:k]:
if 'snippet' in result:
snippets.append(result['snippet'])
for attribute, value in result.get('attributes', {}).items():
snippets.append(f'{attribute}: {value}.')
return snippets
def search(api_key, search_term, **kwargs):
headers = {
'X-API-KEY': api_key,
'Content-Type': 'application/json',
}
params = {
'q': search_term,
**{key: value
for key, value in kwargs.items() if value is not None},
}
try:
response = requests.post(
'https://google.serper.dev/search',
headers=headers,
params=params,
timeout=5)
except Exception as e:
return -1, str(e)
return response.status_code, response.json()
def Search(q, k=10):
status_code, response = search(SERPER_API_KEY, q)
if status_code != 200:
ret = 'None\n'
else:
text = parse_results(response, k=k)
ret = ''
for idx, res in enumerate(text):
ret += f"<|{idx+1}|>: '{res}'\n"
return ret
================================================
FILE: xtuner-eval_niah/xtuner/tools/plugins/solve.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import math
import re
from math import * # noqa: F401, F403
from sympy import Eq, solve, symbols
from .calculate import Calculate
def Solve(equations_str):
try:
equations_str = equations_str.replace(' ', '')
equations_ori = re.split(r'[,;]+', equations_str)
equations_str = equations_str.replace('^', '**')
equations_str = re.sub(r'(\(.*\))([a-zA-Z])', r'\1 * \2',
equations_str)
equations_str = re.sub(r'(\d+)([a-zA-Z])', r'\1 * \2', equations_str)
equations_str = equations_str.replace('pi', str(math.pi))
equations = re.split(r'[,;]+', equations_str)
vars_list = list(set(re.findall(r'[a-zA-Z]+', equations_str)))
vars = {var: symbols(var) for var in vars_list}
output = ''
eqs = []
for eq in equations:
if '=' in eq:
left, right = eq.split('=')
eqs.append(
Eq(
eval(left.strip(), {}, vars),
eval(right.strip(), {}, vars)))
solutions = solve(eqs, vars, dict=True)
vars_values = {var: [] for var in vars_list}
if isinstance(solutions, list):
for idx, solution in enumerate(solutions):
for var, sol in solution.items():
output += f'{var}_{idx} = {sol}\n'
vars_values[str(var)].append(sol)
else:
for var, sol in solutions.items():
output += f'{var} = {sol}\n'
vars_values[str(var)].append(sol)
for eq, eq_o in zip(equations, equations_ori):
if '=' not in eq:
for var in vars_list:
need_note = True if len(vars_values[var]) > 1 else False
for idx, value in enumerate(vars_values[var]):
eq_to_calc = eq.replace(var, str(value))
calc_result = Calculate(eq_to_calc)
if need_note:
eq_name = eq_o.replace(var, f'{var}_{idx}')
else:
eq_name = eq_o
if calc_result != 'No results.':
output += f'{eq_name} = {calc_result}\n'
return output.strip()
except Exception:
return 'No result.'
================================================
FILE: xtuner-eval_niah/xtuner/tools/process_untokenized_datasets.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import os
import warnings
from mmengine import Config, ConfigDict
from mmengine.config.lazy import LazyObject
from xtuner.registry import BUILDER
# ignore FutureWarning in hf datasets
warnings.simplefilter(action='ignore', category=FutureWarning)
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('config', help='config file name or path.')
parser.add_argument('--save-folder', help='The folder to save data order.')
args = parser.parse_args()
return args
def modify_config(config, dataset_save_folder):
dataset = ConfigDict(
type=LazyObject('datasets', 'load_from_disk'),
dataset_path=dataset_save_folder)
train_dataset = ConfigDict(
type=LazyObject('xtuner.dataset', 'process_hf_dataset'),
dataset=dataset,
do_dataset_tokenization=False,
tokenizer=None,
max_length=None,
dataset_map_fn=None,
template_map_fn=None,
max_dataset_length=None,
split=None,
remove_unused_columns=False,
rename_maps=[],
pack_to_max_length=False,
input_ids_with_output=False)
config.train_dataloader.dataset = train_dataset
return config
def process_untokenized_dataset(config):
dataset = BUILDER.build(config.train_dataloader.dataset)
return dataset
if __name__ == '__main__':
args = parse_args()
cfg = Config.fromfile(args.config)
print('Start to process untokenized dataset...')
processed_dataset = process_untokenized_dataset(cfg)
print('Processing untokenized dataset finished.')
processed_dataset_save_folder = args.save_folder
if not os.path.isabs(processed_dataset_save_folder):
processed_dataset_save_folder = os.path.join(
os.getcwd(), processed_dataset_save_folder)
modified_cfg = modify_config(cfg, processed_dataset_save_folder)
print('Start to save processed dataset...')
processed_dataset.save_to_disk(processed_dataset_save_folder)
print(
f'Processed dataset has been saved to {processed_dataset_save_folder}')
cfg_folder, cfg_file_name = os.path.split(args.config)
cfg_file_name = cfg_file_name.split('.')[0]
cfg_file_name = f'{cfg_file_name}_modified.py'
modified_cfg_save_path = os.path.join(cfg_folder, cfg_file_name)
modified_cfg.dump(modified_cfg_save_path)
print(f'Modified config has been saved to {modified_cfg_save_path}. '
'Please use this new config for the next training phase.')
================================================
FILE: xtuner-eval_niah/xtuner/tools/process_untokenized_datasets_legacy.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import ast
import multiprocessing
import os
import warnings
from functools import partial
from datasets import Dataset, DatasetDict, load_dataset
from mmengine import ConfigDict
from transformers import AutoTokenizer
from xtuner.dataset.huggingface import process
from xtuner.dataset.map_fns import (DATASET_FORMAT_MAPPING,
template_map_fn_factory)
from xtuner.utils import PROMPT_TEMPLATE
# ignore FutureWarning in hf datasets
warnings.simplefilter(action='ignore', category=FutureWarning)
"""
ftdp dataset:
srun -p llm_razor --quotatype=auto --gres=gpu:1 --ntasks=1 \
--ntasks-per-node=1 --cpus-per-task=5 --kill-on-bad-exit=1 \
python xtuner/tools/process_untokenized_datasets.py \
--data-folder /path/to/data/folder \
--save-folder ./processed \
--tokenizer-path pretrained_model_name_or_path \
--prompt-template internlm2_chat \
--dataset-format ftdp
normal json dataset:
srun -p llm_razor --quotatype=auto --gres=gpu:1 --ntasks=1 \
--ntasks-per-node=1 --cpus-per-task=5 --kill-on-bad-exit=1 \
python xtuner/tools/process_untokenized_datasets.py \
--data-folder /path/to/data/folder \
--save-folder ./processed \
--tokenizer-path pretrained_model_name_or_path \
--prompt-template internlm2_chat
"""
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('--data-folder', help='Data folder')
parser.add_argument('--save-folder', help='The folder to save data order.')
parser.add_argument(
'--tokenizer-path', help='The path to the hf tokenizer.')
parser.add_argument(
'--dataset-format',
choices=list(DATASET_FORMAT_MAPPING.keys()) + ['ftdp'],
default=None,
help='Which dataset format is this data. The available choices are '
f"{list(DATASET_FORMAT_MAPPING.keys()) + ['ftdp']}. ")
parser.add_argument(
'--prompt-template',
choices=PROMPT_TEMPLATE.keys(),
help='Which prompt template need to be added to the dataset. '
f'The available choices are {PROMPT_TEMPLATE.keys()}')
parser.add_argument(
'--max-length', default=32768, help='Max sequence length.')
parser.add_argument(
'--pack-to-max-length',
action='store_true',
help='Whether to pack the dataset to the `max_length `.')
parser.add_argument(
'--file-type',
default='.json',
help='We want to get the order of the file in this type.')
parser.add_argument(
'--data-order-path',
default=None,
help=('The path to a txt file which contains the a list of data path.'
' It can be obtain by xtuner/tools/get_data_order.py script.'))
args = parser.parse_args()
return args
def process_one(fp,
tokenizer,
max_length,
pack_to_max_length,
dataset_map_fn=None,
template_map_fn=None,
is_ftdp=False):
dataset = []
if is_ftdp:
with open(fp) as file:
lines = file.readlines()
for line in lines:
line = ast.literal_eval(line)
dataset.append({'messages': line})
dataset = Dataset.from_list(dataset)
else:
# load formal json data
dataset = load_dataset('json', data_files=fp)
dataset = dataset['train']
dataset = process(
dataset,
tokenizer=tokenizer,
max_length=max_length,
dataset_map_fn=dataset_map_fn,
template_map_fn=template_map_fn,
remove_unused_columns=True,
pack_to_max_length=pack_to_max_length,
map_num_proc=32)
return fp, dataset
def process_untokenized_dataset(folder,
tokenizer,
max_length,
pack_to_max_length,
dataset_map_fn,
prompt_template,
data_order_path=None,
file_type='.json',
is_ftdp=False):
assert os.path.exists(folder), f'{folder} does not exist.'
datasets_dict = {}
if data_order_path is not None:
data_order = load_dataset(
'text', data_files=data_order_path, split='train')['text']
for i, fp in enumerate(data_order):
data_order[i] = os.path.join(folder, fp)
else:
triples = list(os.walk(folder, followlinks=True))
data_order = []
for root, dirs, files in triples:
dirs.sort()
for fn in sorted(files):
if fn.endswith(file_type):
fp = os.path.join(root, fn)
data_order.append(fp)
print('All file path: ', data_order)
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
template_map_fn = ConfigDict(
type=template_map_fn_factory, template=prompt_template)
process_single = partial(
process_one,
tokenizer=tokenizer,
max_length=max_length,
pack_to_max_length=pack_to_max_length,
dataset_map_fn=dataset_map_fn,
template_map_fn=template_map_fn,
is_ftdp=is_ftdp)
out = pool.map(process_single, data_order)
pool.close()
pool.join()
for idx, (key, dataset) in enumerate(out):
assert data_order[idx] == key
dataset = dataset.remove_columns('length')
datasets_dict[str(idx)] = dataset
datasets_dict = DatasetDict(datasets_dict)
return datasets_dict
if __name__ == '__main__':
args = parse_args()
tokenizer = ConfigDict(
type=AutoTokenizer.from_pretrained,
pretrained_model_name_or_path=args.tokenizer_path,
trust_remote_code=True,
padding_side='right')
if args.dataset_format is None:
dataset_map_fn = None
elif args.dataset_format == 'ftdp':
dataset_map_fn = DATASET_FORMAT_MAPPING['openai']
else:
dataset_map_fn = DATASET_FORMAT_MAPPING[args.dataset_format]
datasets_dict = process_untokenized_dataset(
args.data_folder,
tokenizer,
args.max_length,
args.pack_to_max_length,
dataset_map_fn,
PROMPT_TEMPLATE[args.prompt_template],
data_order_path=args.data_order_path,
file_type=args.file_type,
is_ftdp=args.dataset_format == 'ftdp')
datasets_dict.save_to_disk(args.save_folder)
================================================
FILE: xtuner-eval_niah/xtuner/tools/process_untokenized_llava_data.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import warnings
from mmengine import Config
from xtuner.registry import BUILDER
# ignore FutureWarning in hf datasets
warnings.simplefilter(action='ignore', category=FutureWarning)
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument('config', help='config file name or path.')
parser.add_argument('--save-folder', help='The folder to save data order.')
args = parser.parse_args()
return args
def build_llava_dataset(config):
dataset = BUILDER.build(config.train_dataloader.dataset)
return dataset
if __name__ == '__main__':
args = parse_args()
cfg = Config.fromfile(args.config)
llava_dataset = build_llava_dataset(cfg)
text_data = llava_dataset.text_data
text_data.save_to_disk(args.save_folder)
================================================
FILE: xtuner-eval_niah/xtuner/tools/test.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import os
import os.path as osp
from types import FunctionType
from mmengine.config import Config, DictAction
from mmengine.registry import RUNNERS
from mmengine.runner import Runner
from xtuner.configs import cfgs_name_path
from xtuner.model.utils import guess_load_checkpoint
from xtuner.registry import MAP_FUNC
def parse_args():
parser = argparse.ArgumentParser(description='Test model')
parser.add_argument('config', help='config file name or path.')
parser.add_argument('--checkpoint', default=None, help='checkpoint file')
parser.add_argument(
'--work-dir',
help='the directory to save the file containing evaluation metrics')
parser.add_argument(
'--cfg-options',
nargs='+',
action=DictAction,
help='override some settings in the used config, the key-value pair '
'in xxx=yyy format will be merged into config file. If the value to '
'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
'Note that the quotation marks are necessary and that no white space '
'is allowed.')
parser.add_argument(
'--launcher',
choices=['none', 'pytorch', 'slurm', 'mpi'],
default='none',
help='job launcher')
parser.add_argument('--local_rank', '--local-rank', type=int, default=0)
args = parser.parse_args()
if 'LOCAL_RANK' not in os.environ:
os.environ['LOCAL_RANK'] = str(args.local_rank)
return args
def register_function(cfg_dict):
if isinstance(cfg_dict, dict):
for key, value in dict.items(cfg_dict):
if isinstance(value, FunctionType):
value_str = str(value)
if value_str not in MAP_FUNC:
MAP_FUNC.register_module(module=value, name=value_str)
cfg_dict[key] = value_str
else:
register_function(value)
elif isinstance(cfg_dict, (list, tuple)):
for value in cfg_dict:
register_function(value)
def main():
args = parse_args()
# parse config
if not osp.isfile(args.config):
try:
args.config = cfgs_name_path[args.config]
except KeyError:
raise FileNotFoundError(f'Cannot find {args.config}')
# load config
cfg = Config.fromfile(args.config)
cfg.launcher = args.launcher
if args.cfg_options is not None:
cfg.merge_from_dict(args.cfg_options)
# register FunctionType object in cfg to `MAP_FUNC` Registry and
# change these FunctionType object to str
register_function(cfg._cfg_dict)
# work_dir is determined in this priority: CLI > segment in file > filename
if args.work_dir is not None:
# update configs according to CLI args if args.work_dir is not None
cfg.work_dir = args.work_dir
elif cfg.get('work_dir', None) is None:
# use config filename as default work_dir if cfg.work_dir is None
cfg.work_dir = osp.join('./work_dirs',
osp.splitext(osp.basename(args.config))[0])
# build the runner from config
if 'runner_type' not in cfg:
# build the default runner
runner = Runner.from_cfg(cfg)
else:
# build customized runner from the registry
# if 'runner_type' is set in the cfg
runner = RUNNERS.build(cfg)
state_dict = guess_load_checkpoint(args.checkpoint)
runner.model.load_state_dict(state_dict, strict=False)
runner.logger.info(f'Load checkpoint from {args.checkpoint}')
# start testing
runner.test()
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/tools/tokenize_ftdp_datasets.py
================================================
import argparse
import json
import os
import os.path as osp
from functools import partial
from pathlib import Path
from typing import Dict, List
import numpy as np
from mmengine import list_dir_or_file, track_progress_rich
from transformers import AutoTokenizer
SEPCIAL_TOKENS = [
'<|plugin|>', '<|interpreter|>', '<|action_end|>', '<|action_start|>',
'<|im_end|>', '<|im_start|>'
]
CHATML_LLAMAV13_32K_TOKEN_CFG = dict(
role_cfg=dict(
system=dict(
begin=dict(
with_name='<|im_start|>system name={name}\n',
without_name='<|im_start|>system\n',
name={
'interpreter': '<|interpreter|>',
'plugin': '<|plugin|>',
}),
end='<|im_end|>\n',
loss=dict(
meta=False,
icl=False,
current=False,
prefix=False,
)),
user=dict(
begin=dict(
with_name='<|im_start|>user name={name}\n',
without_name='<|im_start|>user\n',
),
end='<|im_end|>\n',
loss=dict(
icl=False,
current=False,
prefix=False,
)),
assistant=dict(
begin=dict(
with_name='<|im_start|>assistant name={name}\n',
without_name='<|im_start|>assistant\n',
name={
'interpreter': '<|interpreter|>',
'plugin': '<|plugin|>',
}),
end='<|im_end|>\n',
loss=dict(
icl=True,
current=True,
prefix=False,
end=True,
)),
environment=dict(
begin=dict(
with_name='<|im_start|>environment name={name}\n',
without_name='<|im_start|>environment\n',
name={
'interpreter': '<|interpreter|>',
'plugin': '<|plugin|>',
}),
end='<|im_end|>\n',
loss=dict(
icl=False,
current=False,
prefix=False,
)),
tool=dict(
begin=dict(
with_name='<|action_start|>{name}\n',
name={
'interpreter': '<|interpreter|>',
'plugin': '<|plugin|>',
}),
end='<|action_end|>\n',
belong='assistant',
),
thought=dict(
begin=dict(without_name=''),
end='',
belong='assistant',
),
),
max_len=32 * 1024,
)
def chatml_format(
processed_data,
tokenizer,
role_cfg,
max_len=2048,
encode_json=True,
):
"""
```python
dict(
role='',
content='',
name='', -> Begin 扩增
type='',
)
```
```python
dict(
system=dict(
begin=dict(
with_name='system name={name}\n',
without_name='system\n',
name={
'interpreter': '',
'plugin': '',
}),
end='\n',
loss=dict(
meta=False,
icl=False,
current=False,
prefix=False,
)),
user=dict(
begin=dict(
with_name='user name={name}\n',
without_name='user\n',
),
end='\n',
loss=dict(
icl=False,
current=False,
prefix=False,
)),
assistant=dict(
begin=dict(
with_name='assistant name={name}\n',
without_name='assistant\n',
name={
'interpreter': '',
'plugin': '',
}),
end='\n',
loss=dict(
icl=True,
current=True,
prefix=False,
end=True,
)),
environment=dict(
begin=dict(
with_name='environment name={name}\n',
without_name='environment\n',
name={
'interpreter': '',
'plugin': '',
}),
end='\n',
loss=dict(
icl=False,
current=False,
prefix=False,
)),
tool=dict(
begin=dict(
with_name='{name}\n',
name={
'interpreter': '',
'plugin': '',
}),
end='\n',
belong='assistant',
),
thought=dict(
begin='',
end='',
belong='assistant',
),
```
"""
def format_begin(role_cfg, message):
name = message.get('name', None)
if name is not None:
begin = role_cfg['begin'].get('with_name', '')
if name in role_cfg['begin'].get('name', {}):
begin = begin.format(name=role_cfg['begin']['name'][name])
else:
begin = begin.format(name=name)
else:
begin = role_cfg['begin'].get('without_name', '')
return begin
def format_sub_role(messages: List[Dict], roles_cfg) -> List[Dict]:
new_message = list()
for message in messages:
if message['role'] in [
'assistant', 'user', 'system', 'environment'
]:
new_message.append(message)
continue
role_cfg = roles_cfg[message['role']]
begin = format_begin(role_cfg, message)
new_content = begin + message['content'] + role_cfg['end']
if role_cfg.get('fallback_role'):
new_message.append(
dict(role=role_cfg['fallback_role'], content=new_content))
elif role_cfg.get('belong'):
if new_message[-1]['role'] != role_cfg.get('belong'):
new_message.append(
dict(role=role_cfg.get('belong'), content=new_content))
else:
new_message[-1]['content'] += new_content
else:
new_message.append(
dict(role=message['role'], content=new_content))
return new_message
token_ids = []
_processed_data = format_sub_role(processed_data, role_cfg)
for dialog_item in _processed_data:
role = dialog_item['role']
content = dialog_item['content']
# TODO: is strip necessary? or use lstrip? 避免开始有\n\n的情况
# content = content.lstrip()
begin = format_begin(role_cfg[role], dialog_item)
end = role_cfg[role]['end']
begin_token = tokenizer.encode(begin, add_special_tokens=False)
if not role_cfg[role]['loss'].get('beigin', False):
begin_token = [-token_id for token_id in begin_token]
end_token = tokenizer.encode(
role_cfg[role]['end'], add_special_tokens=False)
# breakpoint()
if not role_cfg[role]['loss'].get('end', False):
end_token = [-token_id for token_id in end_token]
content_token = tokenizer.encode(
begin + content + end, add_special_tokens=False)
content_token = content_token[len(begin_token):-len(end_token)]
if dialog_item.get('loss', True):
loss_cfg = role_cfg[role]['loss']
else:
loss_cfg = dict(icl=False, current=False, meta=False)
if not loss_cfg[dialog_item.get('type', 'current')]:
content_token = [-token_id for token_id in content_token]
if begin == '':
tokens = content_token
else:
tokens = begin_token + content_token
if end != '':
tokens = tokens + end_token
token_ids += tokens
token_ids = [tokenizer.bos_token_id] + token_ids
token_ids = token_ids[:max_len]
if encode_json:
line = str.encode(json.dumps({'tokens': token_ids}) + '\n')
return line, len(token_ids)
return token_ids, len(token_ids)
def write_bin_meta_bin(path, dataset_name, filename, samples):
train_path = osp.join(path, f'train/cn/{dataset_name}')
valid_path = osp.join(path, f'valid/cn/{dataset_name}')
train_dir = Path(train_path)
valid_dir = Path(valid_path)
train_dir.mkdir(exist_ok=True, parents=True)
valid_dir.mkdir(exist_ok=True, parents=True)
train_f = open(train_dir.joinpath(f'{filename}.bin'), 'wb')
valid_f_path = valid_dir.joinpath(f'{filename}.bin')
valid_f = open(valid_f_path, 'wb')
print(train_dir)
print(valid_dir)
train_tokens = 0
valid_tokens = 0
last_train_position = 0
last_valid_position = 0
train_samples = 0
valid_samples = 0
train_meta = []
valid_meta = []
for line, token_num in samples:
train_tokens += token_num
train_f.write(line)
train_meta.append((last_train_position, token_num))
last_train_position += len(line)
train_samples += 1
if (train_samples) % 100 == 0: # ?
valid_tokens += token_num
valid_f.write(line)
valid_meta.append((last_valid_position, token_num))
last_valid_position += len(line)
valid_samples += 1
train_f.close()
valid_f.close()
np.save(open(train_dir.joinpath(f'{filename}.bin.meta'), 'wb'), train_meta)
# remove the length of `valid_samples` is less than 500
# 500 is a magic number, you can change it to any number you want
# the number must bigger the DP.
if valid_samples > 500:
np.save(
open(valid_dir.joinpath(f'{filename}.bin.meta'), 'wb'), valid_meta)
else:
print(f'{valid_f_path} is removed because the number of',
f'`valid_samples`({valid_samples}) is less than 500')
os.remove(valid_f_path)
return train_tokens, valid_tokens, train_samples, valid_samples
def tokenize_and_save(tokenizer, processed_dir, tokenized_dir):
tokenized_save_dir = osp.join(tokenized_dir, 'chatml_llamav13_32k')
data_dir = processed_dir
all_train_tokens = 0
all_valid_tokens = 0
all_train_samples = 0
all_valid_samples = 0
for filename in list_dir_or_file(data_dir, recursive=True, list_dir=False):
file_path = os.path.join(data_dir, filename)
if '/processed/' not in file_path:
continue
assert '.jsonl' in filename
# dataset name such as char_x10_chat_format
dataset_name = filename.split(os.sep)[0]
# Hardcode here to skip tokenizing the file if it already exists
# (Refactor the `write_bin_meta_bin`!).
train_f = osp.join(tokenized_save_dir, 'train', 'cn', dataset_name,
f'{osp.splitext(osp.basename(filename))[0]}.bin')
if osp.isfile(train_f):
print(f'{train_f} already exists, skip it')
continue
tokenize_fun = partial(
chatml_format,
tokenizer=tokenizer,
**CHATML_LLAMAV13_32K_TOKEN_CFG)
samples = []
with open(file_path) as f:
dataset = f.readlines()
task_num = len(dataset)
dataset = map(lambda x: (json.loads(x), ), dataset)
for sample in track_progress_rich(
tokenize_fun,
dataset,
nproc=32,
task_num=task_num,
chunksize=32,
description=f'{os.path.basename(file_path)}...'):
samples.append(sample)
train_tokens, valid_tokens, train_samples, valid_samples = write_bin_meta_bin( # noqa E501
path=tokenized_save_dir,
dataset_name=dataset_name,
samples=samples,
filename=osp.splitext(osp.basename(filename))[0])
if train_tokens is None:
print(f'{osp.splitext(osp.basename(filename))[0]} already '
'exists, skip it')
continue
print(f'train_tokens {train_tokens}', flush=True)
print(f'train_samples {train_samples}')
print(f'valid tokens {valid_tokens}')
print(f'valid_samples {valid_samples}')
all_train_tokens += train_tokens
all_valid_tokens += valid_tokens
all_train_samples += train_samples
all_valid_samples += valid_samples
print(f'all train tokens {all_train_tokens}')
print(f'all train samples {all_train_samples}')
print(f'all valid tokens {all_valid_tokens}')
print(f'all valid samples {all_valid_samples}')
def tokenizer_add_special_tokens(tokenizer):
print(f'Before adding special tokens, Vocabulary Size: {len(tokenizer)}')
for special_token in SEPCIAL_TOKENS:
if special_token not in tokenizer.get_vocab():
tokenizer.add_tokens([special_token], special_tokens=True)
print(f'After adding special tokens, Vocabulary Size: {len(tokenizer)}')
def save_new_tokenizer(tokenizer, save_dir):
tokenizer.save_pretrained(save_dir)
print(f'save new tokenizer to {save_dir}')
def parse_args():
parser = argparse.ArgumentParser()
parser.add_argument(
'--processed-dir', help='The folder to save untokenized data.')
parser.add_argument(
'--tokenized-dir', help='The folder to save tokenized data.')
parser.add_argument(
'--tokenizer-path', help='The path to the hf tokenizer.')
parser.add_argument(
'--tokenizer-w-special-tokens-save-dir',
default=None,
help='We have to add special tokens to the vocabulary of '
'the given tokenizer, and save the new tokenizer to this folder.')
args = parser.parse_args()
return args
def main():
args = parse_args()
tokenizer = AutoTokenizer.from_pretrained(
args.tokenizer_path, trust_remote_code=True, padding_side='right')
ori_vocab_size = len(tokenizer)
tokenizer_add_special_tokens(tokenizer)
if len(tokenizer) != ori_vocab_size:
save_new_tokenizer(tokenizer, args.tokenizer_w_special_tokens_save_dir)
tokenize_and_save(tokenizer, args.processed_dir, args.tokenized_dir)
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/tools/train.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import argparse
import json
import logging
import os
import os.path as osp
from functools import partial
from types import FunctionType
from mmengine.config import Config, DictAction
from mmengine.config.lazy import LazyObject
from mmengine.logging import print_log
from mmengine.registry import RUNNERS
from mmengine.runner import Runner
from mmengine.utils import digit_version
from peft import get_peft_model, prepare_model_for_kbit_training
from transformers import TrainingArguments
from xtuner.configs import cfgs_name_path
from xtuner.dataset.collate_fns import default_collate_fn
from xtuner.model.modules import dispatch_modules
from xtuner.model.modules.dispatch import SUPPORT_FLASH2
from xtuner.model.utils import LoadWoInit, find_all_linear_names, traverse_dict
from xtuner.registry import BUILDER, MAP_FUNC
from xtuner.tools.utils import (auto_dtype_of_deepspeed_config,
get_seed_from_checkpoint)
def parse_args():
parser = argparse.ArgumentParser(description='Train LLM')
parser.add_argument('config', help='config file name or path.')
parser.add_argument('--work-dir', help='the dir to save logs and models')
parser.add_argument(
'--deepspeed',
type=str,
default=None,
help='the path to the .json file for deepspeed')
parser.add_argument(
'--resume',
type=str,
default=None,
help='specify checkpoint path to be resumed from.')
parser.add_argument(
'--seed', type=int, default=None, help='Random seed for the training')
parser.add_argument(
'--cfg-options',
nargs='+',
action=DictAction,
help='override some settings in the used config, the key-value pair '
'in xxx=yyy format will be merged into config file. If the value to '
'be overwritten is a list, it should be like key="[a,b]" or key=a,b '
'It also allows nested list/tuple values, e.g. key="[(a,b),(c,d)]" '
'Note that the quotation marks are necessary and that no white space '
'is allowed.')
parser.add_argument(
'--launcher',
choices=['none', 'pytorch', 'slurm', 'mpi'],
default='none',
help='job launcher')
parser.add_argument('--local_rank', '--local-rank', type=int, default=0)
args = parser.parse_args()
return args
def register_function(cfg_dict):
if isinstance(cfg_dict, dict):
for key, value in dict.items(cfg_dict):
if isinstance(value, FunctionType):
value_str = str(value)
if value_str not in MAP_FUNC:
MAP_FUNC.register_module(module=value, name=value_str)
cfg_dict[key] = value_str
else:
register_function(value)
elif isinstance(cfg_dict, (list, tuple)):
for value in cfg_dict:
register_function(value)
def check_cfg(cfg, args):
if getattr(cfg, 'use_varlen_attn',
False) and cfg.train_dataloader.batch_size > 1:
raise NotImplementedError(
f'If utilizing varlen attention, the batch size should be'
f' set to 1, but got {cfg.train_dataloader.batch_size}')
if getattr(cfg, 'use_varlen_attn', False):
sequence_parallel = getattr(cfg, 'sequence_parallel', 1)
max_length = getattr(cfg.train_dataloader.dataset, 'max_length', None)
if max_length is not None:
assert max_length % sequence_parallel == 0, \
('When using varlen attention, `max_length` should be evenly '
'divided by sequence parallel world size, but got '
f'max_length = {max_length} and sequence_parallel = '
f'{sequence_parallel}')
if getattr(cfg, 'sequence_parallel_size', 1) > 1:
assert SUPPORT_FLASH2, ('`flash_attn` is required if you want to use '
'sequence parallel.')
attn_implementation = getattr(cfg.model.llm, 'attn_implementation',
None)
assert (attn_implementation is None or
attn_implementation == 'flash_attention_2'), \
('If you want to use sequence parallel, please set '
'attn_implementation to `flash_attention_2` or do not '
f'set this attribute. Got `{attn_implementation}` .')
if getattr(cfg, 'use_varlen_attn', False):
assert SUPPORT_FLASH2, ('`flash_attn` is required if you set '
'`use_varlen_attn` to True.')
attn_implementation = getattr(cfg.model.llm, 'attn_implementation',
None)
assert (attn_implementation is None or
attn_implementation == 'flash_attention_2'), \
('If you want to set `use_varlen_attn` to True, please set'
' attn_implementation to `flash_attention_2` or do not '
f'set this attribute. Got `{attn_implementation}` .')
if args.deepspeed is None:
assert getattr(cfg, 'sequence_parallel_size', 1) == 1, \
('Sequence parallel training without DeepSpeed lacks validation.'
'Please use DeepSpeed to optimize the training phase by '
'`--deepspeed deepspeed_zero1 (deepspeed_zero2 or '
'deepspeed_zero3)`.')
def main():
args = parse_args()
# parse config
if not osp.isfile(args.config):
try:
args.config = cfgs_name_path[args.config]
except KeyError:
raise FileNotFoundError(f'Cannot find {args.config}')
# load config
cfg = Config.fromfile(args.config)
if args.cfg_options is not None:
cfg.merge_from_dict(args.cfg_options)
# register FunctionType object in cfg to `MAP_FUNC` Registry and
# change these FunctionType object to str
register_function(cfg._cfg_dict)
check_cfg(cfg, args)
if cfg.get('framework', 'mmengine').lower() == 'huggingface':
# set default training_args
if cfg.get('training_args', None) is None:
cfg.training_args = dict(type=TrainingArguments)
if args.seed is not None:
cfg.training_args.seed = args.seed
# set work_dir
if args.work_dir is not None:
# update configs according to CLI args if args.work_dir is not None
cfg.training_args.output_dir = args.work_dir
elif cfg.training_args.get('output_dir', None) is None:
# use config filename as default work_dir if cfg.work_dir is None
cfg.training_args.output_dir = osp.join(
'./work_dirs',
osp.splitext(osp.basename(args.config))[0])
# enable deepspeed
if args.deepspeed:
if not osp.isfile(args.deepspeed):
try:
args.deepspeed = cfgs_name_path[args.deepspeed]
except KeyError:
raise FileNotFoundError(f'Cannot find {args.deepspeed}')
cfg.training_args.deepspeed = args.deepspeed
if cfg.training_args.get('deepspeed'):
device_map = None
else:
# Data Parallel
device_map = {
'': int(os.environ.get('LOCAL_RANK', args.local_rank))
}
# build training_args
training_args = BUILDER.build(cfg.training_args)
# build model
with LoadWoInit():
cfg.model.device_map = device_map
traverse_dict(cfg.model)
model = BUILDER.build(cfg.model)
model.config.use_cache = False
dispatch_modules(model)
if cfg.get('lora', None):
lora = BUILDER.build(cfg.lora)
model = prepare_model_for_kbit_training(model)
if lora.target_modules is None:
modules = find_all_linear_names(model)
lora.target_modules = modules
model = get_peft_model(model, lora)
# build dataset
train_dataset = BUILDER.build(cfg.train_dataset)
data_collator = partial(default_collate_fn, return_hf_format=True)
# build trainer
trainer = cfg.trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
data_collator=data_collator)
# training
trainer.train(resume_from_checkpoint=args.resume)
trainer.save_state()
trainer.save_model(output_dir=training_args.output_dir)
else:
if args.seed is not None and args.resume is None:
# Use args.seed
cfg.merge_from_dict(dict(randomness=dict(seed=args.seed)))
print_log(
f'Set the random seed to {args.seed}.',
logger='current',
level=logging.INFO)
elif args.resume is not None:
# Use resumed seed
from mmengine.fileio import PetrelBackend, get_file_backend
from xtuner.utils.fileio import patch_fileio
backend = get_file_backend(args.resume)
if isinstance(backend, PetrelBackend):
with patch_fileio():
resumed_seed = get_seed_from_checkpoint(args.resume)
else:
resumed_seed = get_seed_from_checkpoint(args.resume)
cfg.merge_from_dict(dict(randomness=dict(seed=resumed_seed)))
if args.seed is not None and args.seed != resumed_seed:
print_log(
(f'The value of random seed in resume checkpoint '
f'"{args.resume}" is different from the value in '
f'arguments. The resumed seed is {resumed_seed}, while '
f'the input argument seed is {args.seed}. Using the '
f'resumed seed {resumed_seed}.'),
logger='current',
level=logging.WARNING)
else:
print_log(
f'Set the random seed to {resumed_seed}.',
logger='current',
level=logging.INFO)
if 'LOCAL_RANK' not in os.environ:
os.environ['LOCAL_RANK'] = str(args.local_rank)
cfg.launcher = args.launcher
# work_dir is determined in this priority:
# CLI > segment in file > filename
if args.work_dir is not None:
# update configs according to CLI args if args.work_dir is not None
cfg.work_dir = args.work_dir
elif cfg.get('work_dir', None) is None:
# use config filename as default work_dir if cfg.work_dir is None
cfg.work_dir = osp.join('./work_dirs',
osp.splitext(osp.basename(args.config))[0])
if args.deepspeed:
try:
import deepspeed
except ImportError:
raise ImportError(
'deepspeed is not installed properly, please check.')
if digit_version(deepspeed.__version__) < digit_version('0.12.3'):
raise RuntimeError('Please upgrade your DeepSpeed version '
'by using the command pip install '
'`deepspeed>=0.12.3`')
optim_wrapper = cfg.optim_wrapper.type
if optim_wrapper == 'DeepSpeedOptimWrapper':
print_log(
'Deepspeed training is already enabled in your config.',
logger='current',
level=logging.WARNING)
else:
if not osp.isfile(args.deepspeed):
try:
args.deepspeed = cfgs_name_path[args.deepspeed]
except KeyError:
raise FileNotFoundError(
f'Cannot find {args.deepspeed}')
with open(args.deepspeed) as f:
ds_cfg = json.load(f)
ds_grad_accum = ds_cfg.get('gradient_accumulation_steps',
'auto')
mm_grad_accum = cfg.optim_wrapper.get('accumulative_counts', 1)
if ds_grad_accum != 'auto' and ds_grad_accum != mm_grad_accum:
print_log(('Mismatch on gradient_accumulation_steps: '
f'MMEngine {mm_grad_accum}, '
f'Deepspeed {ds_grad_accum}. '
f'Set to {mm_grad_accum}'),
logger='current',
level=logging.WARNING)
grad_accum = mm_grad_accum
ds_train_bs = ds_cfg.get('train_micro_batch_size_per_gpu',
'auto')
mm_train_bs = cfg.train_dataloader.batch_size
if ds_train_bs != 'auto' and ds_train_bs != mm_train_bs:
print_log(
('Mismatch on train_micro_batch_size_per_gpu: '
f'MMEngine {mm_train_bs}, Deepspeed {ds_train_bs}. '
f'Set to {mm_train_bs}'),
logger='current',
level=logging.WARNING)
train_bs = cfg.train_dataloader.batch_size
ds_grad_clip = ds_cfg.get('gradient_clipping', 'auto')
clip_grad = cfg.optim_wrapper.get('clip_grad', None)
if clip_grad and clip_grad.get('max_norm', None) is not None:
mm_max_norm = cfg.optim_wrapper.clip_grad.max_norm
else:
mm_max_norm = 1.0
if ds_grad_clip != 'auto' and ds_grad_clip != mm_max_norm:
print_log(
('Mismatch on gradient_clipping: '
f'MMEngine {mm_max_norm}, Deepspeed {ds_grad_clip}. '
f'Set to {mm_max_norm}'),
logger='current',
level=logging.WARNING)
grad_clip = mm_max_norm
ds_cfg = auto_dtype_of_deepspeed_config(ds_cfg)
exclude_frozen_parameters = True if digit_version(
deepspeed.__version__) >= digit_version('0.10.1') else None
strategy = dict(
type=LazyObject('xtuner.engine', 'DeepSpeedStrategy'),
config=ds_cfg,
gradient_accumulation_steps=grad_accum,
train_micro_batch_size_per_gpu=train_bs,
gradient_clipping=grad_clip,
exclude_frozen_parameters=exclude_frozen_parameters,
sequence_parallel_size=getattr(cfg,
'sequence_parallel_size',
1))
cfg.__setitem__('strategy', strategy)
optim_wrapper = dict(
type='DeepSpeedOptimWrapper',
optimizer=cfg.optim_wrapper.optimizer)
cfg.__setitem__('optim_wrapper', optim_wrapper)
cfg.runner_type = 'FlexibleRunner'
# resume is determined in this priority: resume from > auto_resume
if args.resume is not None:
cfg.resume = True
cfg.load_from = args.resume
# build the runner from config
if 'runner_type' not in cfg:
# build the default runner
runner = Runner.from_cfg(cfg)
else:
# build customized runner from the registry
# if 'runner_type' is set in the cfg
runner = RUNNERS.build(cfg)
# start training
runner.train()
if __name__ == '__main__':
main()
================================================
FILE: xtuner-eval_niah/xtuner/tools/utils.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
import os.path as osp
import re
import warnings
import torch
from transformers import PreTrainedTokenizerFast, StoppingCriteriaList
from transformers.generation.streamers import BaseStreamer
from xtuner.utils import StopWordStoppingCriteria
def get_base_model(model):
if hasattr(model, 'llm'):
model = model.llm
if 'PeftModel' in model.__class__.__name__:
model = model.base_model.model
return model
def get_streamer(model):
# TODO: deprecation, v0.3.0
warnings.warn(
('`get_streamer` is deprecated and will be removed in v0.3.0, '
"use `transformers`'s `TextStreamer` instead."), DeprecationWarning)
if model.__class__.__name__ == 'InferenceEngine':
model = model.module
base_model = get_base_model(model)
base_model_name = base_model.__class__.__name__.lower()
is_internlm = 'internlm' in base_model_name
is_qwen = 'qwen' in base_model_name
is_baichuan = 'baichuan' in base_model_name
is_chatglm = 'chatglm' in base_model_name
no_space = is_internlm or is_qwen or is_baichuan or is_chatglm
if no_space:
return NoSpaceStreamer
else:
return DecodeOutputStreamer
class DecodeOutputStreamer(BaseStreamer):
"""Default streamer for HuggingFace models."""
def __init__(self, tokenizer, skip_prompt=True) -> None:
super().__init__()
# TODO: deprecation, v0.3.0
warnings.warn(
'`DecodeOutputStreamer` is deprecated and will be '
'removed in v0.3.0.', DeprecationWarning)
self.tokenizer = tokenizer
self.skip_prompt = skip_prompt
self.gen_len = 0
if isinstance(tokenizer, PreTrainedTokenizerFast):
self.decode = self._decode_with_raw_id
self.hex_regex = re.compile(r'^<0x([0-9ABCDEF]+)>$')
else:
self.decode = self._decode_fallback
def _decode_with_raw_id(self, value):
"""Convert token ids to tokens and decode."""
tok = self.tokenizer._convert_id_to_token(value)
if tok.startswith('▁'): # sentencepiece
space = ' '
tok = tok[1:]
else:
space = ''
if res := self.hex_regex.match(tok):
tok = chr(int(res.group(1), 16))
if tok == '':
tok = '\n'
return space + tok
def _decode_fallback(self, value):
"""Fallback decoder for non-fast tokenizer."""
tok = self.tokenizer.decode(
value,
skip_special_tokens=False,
clean_up_tokenization_spaces=False)
return tok + ' '
def put(self, value):
"""Callback function to decode token and output to stdout."""
if self.gen_len == 0 and self.skip_prompt:
pass
else:
tok = self.decode(value[0])
print(tok, end='', flush=True)
self.gen_len += 1
def end(self):
"""Callback function to finish generation."""
print('\n')
class NoSpaceStreamer(DecodeOutputStreamer):
def __init__(self, tokenizer, skip_prompt=True) -> None:
BaseStreamer().__init__()
# TODO: deprecation, v0.3.0
warnings.warn(
'`NoSpaceStreamer` is deprecated and will be '
'removed in v0.3.0.', DeprecationWarning)
self.tokenizer = tokenizer
self.skip_prompt = skip_prompt
self.gen_len = 0
self.hex_regex = re.compile(r'^<0x([0-9ABCDEF]+)>$')
def decode(self, value):
tok = self.tokenizer.decode(value)
if res := self.hex_regex.match(tok):
tok = chr(int(res.group(1), 16))
if tok == '' or tok == '\r':
tok = '\n'
return tok
def get_stop_criteria(
tokenizer,
stop_words=[],
):
stop_criteria = StoppingCriteriaList()
for word in stop_words:
stop_criteria.append(StopWordStoppingCriteria(tokenizer, word))
return stop_criteria
def auto_dtype_of_deepspeed_config(ds_config):
if ds_config.get('fp16') and not ds_config.get('bf16'):
if ds_config.get('fp16').get('enabled') == 'auto':
ds_config['fp16']['enabled'] = torch.cuda.is_available()
elif not ds_config.get('fp16') and ds_config.get('bf16'):
if ds_config.get('bf16').get('enabled') == 'auto':
ds_config['bf16']['enabled'] = torch.cuda.is_bf16_supported()
elif ds_config.get('fp16') and ds_config.get('bf16'):
if ds_config.get('fp16').get('enabled') == 'auto':
ds_config['fp16']['enabled'] = torch.cuda.is_available()
if ds_config.get('bf16').get('enabled') == 'auto':
ds_config['bf16']['enabled'] = torch.cuda.is_bf16_supported()
if (ds_config['fp16']['enabled'] is True
and ds_config['bf16']['enabled'] is True):
ds_config['fp16']['enabled'] = False
ds_config['bf16']['enabled'] = True
return ds_config
def is_cn_string(s):
if re.search('[\u4e00-\u9fff]', s):
return True
return False
def get_seed_from_checkpoint(pth_model):
if osp.isfile(pth_model):
checkpoint = torch.load(pth_model, map_location='cpu')
elif osp.isdir(pth_model):
try:
from deepspeed.utils.zero_to_fp32 import get_model_state_files
except ImportError:
raise ImportError(
'The provided PTH model appears to be a DeepSpeed checkpoint. '
'However, DeepSpeed library is not detected in current '
'environment. This suggests that DeepSpeed may not be '
'installed or is incorrectly configured. Please verify your '
'setup.')
filename = get_model_state_files(pth_model)[0]
checkpoint = torch.load(filename, map_location='cpu')
else:
raise FileNotFoundError(f'Cannot find {pth_model}')
return checkpoint['meta']['seed']
================================================
FILE: xtuner-eval_niah/xtuner/utils/__init__.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from .constants import (DEFAULT_IMAGE_TOKEN, DEFAULT_PAD_TOKEN_INDEX,
IGNORE_INDEX, IMAGE_TOKEN_INDEX)
from .handle_moe_load_and_save import (SUPPORT_MODELS, get_origin_state_dict,
load_state_dict_into_model)
from .stop_criteria import StopWordStoppingCriteria
from .templates import PROMPT_TEMPLATE, SYSTEM_TEMPLATE
__all__ = [
'IGNORE_INDEX', 'DEFAULT_PAD_TOKEN_INDEX', 'PROMPT_TEMPLATE',
'DEFAULT_IMAGE_TOKEN', 'SYSTEM_TEMPLATE', 'StopWordStoppingCriteria',
'IMAGE_TOKEN_INDEX', 'load_state_dict_into_model', 'get_origin_state_dict',
'SUPPORT_MODELS'
]
================================================
FILE: xtuner-eval_niah/xtuner/utils/constants.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
IGNORE_INDEX = -100
DEFAULT_PAD_TOKEN_INDEX = 0
IMAGE_TOKEN_INDEX = -200
DEFAULT_IMAGE_TOKEN = ''
================================================
FILE: xtuner-eval_niah/xtuner/utils/fileio.py
================================================
import io
from contextlib import contextmanager
import mmengine.fileio as fileio
from mmengine.fileio import LocalBackend, PetrelBackend, get_file_backend
def patch_func(module, fn_name_to_wrap):
backup = getattr(patch_func, '_backup', [])
fn_to_wrap = getattr(module, fn_name_to_wrap)
def wrap(fn_new):
setattr(module, fn_name_to_wrap, fn_new)
backup.append((module, fn_name_to_wrap, fn_to_wrap))
setattr(fn_new, '_fallback', fn_to_wrap)
setattr(patch_func, '_backup', backup)
return fn_new
return wrap
@contextmanager
def patch_fileio(global_vars=None):
if getattr(patch_fileio, '_patched', False):
# Only patch once, avoid error caused by patch nestly.
yield
return
import builtins
@patch_func(builtins, 'open')
def open(file, mode='r', *args, **kwargs):
backend = get_file_backend(file)
if isinstance(backend, LocalBackend):
return open._fallback(file, mode, *args, **kwargs)
if 'b' in mode:
return io.BytesIO(backend.get(file, *args, **kwargs))
else:
return io.StringIO(backend.get_text(file, *args, **kwargs))
if global_vars is not None and 'open' in global_vars:
bak_open = global_vars['open']
global_vars['open'] = builtins.open
import os
@patch_func(os.path, 'join')
def join(a, *paths):
backend = get_file_backend(
a.decode('utf-8') if isinstance(a, bytes) else a)
if isinstance(backend, LocalBackend):
return join._fallback(a, *paths)
paths = [item.lstrip('./') for item in paths if len(item) > 0]
return backend.join_path(a, *paths)
@patch_func(os.path, 'isdir')
def isdir(path):
backend = get_file_backend(path)
if isinstance(backend, LocalBackend):
return isdir._fallback(path)
return backend.isdir(path)
@patch_func(os.path, 'isfile')
def isfile(path):
backend = get_file_backend(path)
if isinstance(backend, LocalBackend):
return isfile._fallback(path)
return backend.isfile(path)
@patch_func(os.path, 'exists')
def exists(path):
backend = get_file_backend(path)
if isinstance(backend, LocalBackend):
return exists._fallback(path)
return backend.exists(path)
@patch_func(os, 'mkdir')
def mkdir(path, *args, **kwargs):
backend = get_file_backend(path)
if isinstance(backend, LocalBackend):
return mkdir._fallback(path, *args, **kwargs)
@patch_func(os, 'makedirs')
def makedirs(path, *args, **kwargs):
backend = get_file_backend(path)
if isinstance(backend, LocalBackend):
return makedirs._fallback(path, *args, **kwargs)
@patch_func(os, 'listdir')
def listdir(path):
backend = get_file_backend(path)
if isinstance(backend, LocalBackend):
return listdir._fallback(path)
return backend.list_dir_or_file(path)
@patch_func(os, 'chmod')
def chmod(path, *args, **kwargs):
backend = get_file_backend(path)
if isinstance(backend, LocalBackend):
return chmod._fallback(path, *args, **kwargs)
@patch_func(os, 'stat')
def stat(path, *args, **kwargs):
backend = get_file_backend(path)
if isinstance(backend, LocalBackend):
return stat._fallback(path, *args, **kwargs)
import glob as glob_pkg
@patch_func(glob_pkg, 'glob')
def glob(pathname, *, recursive=False):
backend = get_file_backend(pathname)
if isinstance(backend, LocalBackend):
return glob._fallback(pathname, recursive=recursive)
if pathname.endswith('*_optim_states.pt'):
import os
pathname = os.path.split(pathname)[0]
files = backend.list_dir_or_file(pathname, recursive=recursive)
files = [
os.path.join(pathname, f) for f in files
if f.endswith('_optim_states.pt')
]
elif pathname.endswith('*_model_states.pt'):
import os
pathname = os.path.split(pathname)[0]
files = backend.list_dir_or_file(pathname, recursive=recursive)
files = [
os.path.join(pathname, f) for f in files
if f.endswith('_model_states.pt')
]
elif '*' in pathname:
raise NotImplementedError
else:
files = backend.list_dir_or_file(pathname, recursive=recursive)
return files
import filecmp
@patch_func(filecmp, 'cmp')
def cmp(f1, f2, *args, **kwargs):
with fileio.get_local_path(f1) as f1, fileio.get_local_path(f2) as f2:
return cmp._fallback(f1, f2, *args, **kwargs)
import shutil
@patch_func(shutil, 'copy')
def copy(src, dst, **kwargs):
from pathlib import Path
if isinstance(src, Path):
src = str(src).replace(':/', '://')
if isinstance(dst, Path):
dst = str(dst).replace(':/', '://')
src_backend = get_file_backend(src)
dst_backend = get_file_backend(dst)
if isinstance(src_backend, LocalBackend) and isinstance(
dst_backend, LocalBackend):
return copy._fallback(src, dst, **kwargs)
elif isinstance(src_backend, LocalBackend) and isinstance(
dst_backend, PetrelBackend):
return dst_backend.copyfile_from_local(str(src), str(dst))
elif isinstance(src_backend, PetrelBackend) and isinstance(
dst_backend, LocalBackend):
return src_backend.copyfile_to_local(str(src), str(dst))
import torch
@patch_func(torch, 'load')
def load(f, *args, **kwargs):
if isinstance(f, str):
f = io.BytesIO(fileio.get(f))
return load._fallback(f, *args, **kwargs)
@patch_func(torch, 'save')
def save(obj, f, *args, **kwargs):
backend = get_file_backend(f)
if isinstance(backend, LocalBackend):
return save._fallback(obj, f, *args, **kwargs)
with io.BytesIO() as buffer:
save._fallback(obj, buffer, *args, **kwargs)
buffer.seek(0)
backend.put(buffer, f)
# from tempfile import TemporaryDirectory
# import os
# with TemporaryDirectory(dir='/dev/shm') as tmpdir:
# suffix = os.path.split(f)[-1]
# tmppath = os.path.join._fallback(tmpdir, suffix)
# from mmengine import print_log
# print_log('write to tmp dir', logger='current')
# save._fallback(obj, tmppath, *args, **kwargs)
# print_log('write to ceph', logger='current')
# with open(tmppath, 'rb') as buffer:
# backend.put(buffer, f)
from sentencepiece import SentencePieceProcessor
@patch_func(SentencePieceProcessor, 'LoadFromFile')
def LoadFromFile(cls, path):
if path:
backend = get_file_backend(path)
if isinstance(backend, LocalBackend):
return LoadFromFile._fallback(cls, path)
from tempfile import TemporaryDirectory
with TemporaryDirectory() as tmpdir:
local_path = backend.copyfile_to_local(path, tmpdir)
loaded_file = LoadFromFile._fallback(cls, local_path)
return loaded_file
else:
return LoadFromFile._fallback(cls, path)
try:
setattr(patch_fileio, '_patched', True)
yield
finally:
for patched_fn in patch_func._backup:
(module, fn_name_to_wrap, fn_to_wrap) = patched_fn
setattr(module, fn_name_to_wrap, fn_to_wrap)
if global_vars is not None and 'open' in global_vars:
global_vars['open'] = bak_open
setattr(patch_fileio, '_patched', False)
def patch_hf_auto_from_pretrained(petrel_hub):
if hasattr(patch_hf_auto_from_pretrained, '_patched'):
return
from peft import PeftModel
from transformers import (AutoConfig, AutoFeatureExtractor,
AutoImageProcessor, AutoModelForCausalLM,
AutoProcessor, AutoTokenizer,
ImageProcessingMixin, PreTrainedModel,
PreTrainedTokenizerBase, ProcessorMixin)
from transformers.models.auto.auto_factory import _BaseAutoModelClass
target_cls = list(_BaseAutoModelClass.__subclasses__())
target_cls.extend([AutoModelForCausalLM] +
AutoModelForCausalLM.__subclasses__())
target_cls.extend([AutoConfig] + AutoConfig.__subclasses__())
target_cls.extend([AutoTokenizer] + AutoTokenizer.__subclasses__())
target_cls.extend([AutoImageProcessor] +
AutoImageProcessor.__subclasses__())
target_cls.extend([AutoFeatureExtractor] +
AutoFeatureExtractor.__subclasses__())
target_cls.extend([AutoProcessor] + AutoProcessor.__subclasses__())
target_cls.extend([PreTrainedTokenizerBase] +
PreTrainedTokenizerBase.__subclasses__())
target_cls.extend([ImageProcessingMixin] +
ImageProcessingMixin.__subclasses__())
target_cls.extend([PreTrainedModel] + PreTrainedModel.__subclasses__())
target_cls.extend([ProcessorMixin] + ProcessorMixin.__subclasses__())
target_cls.extend([PeftModel] + PeftModel.__subclasses__())
import os
@classmethod
def from_pretrained(cls, pretrained_model_name_or_path, *args, **kwargs):
with patch_fileio():
model_path = pretrained_model_name_or_path
model_path = os.path.join(petrel_hub, model_path)
obj = cls._from_pretrained(model_path, *args, **kwargs)
return obj
for cls in set(target_cls):
if not hasattr(cls, '_from_pretrained'):
cls._from_pretrained = cls.from_pretrained
cls.from_pretrained = from_pretrained
patch_hf_auto_from_pretrained._patched = True
def patch_hf_save_pretrained():
if hasattr(patch_hf_save_pretrained, '_patched'):
return
import torch
from peft import PeftModel
from transformers import (AutoConfig, AutoTokenizer, PreTrainedModel,
PreTrainedTokenizerBase)
from transformers.models.auto.auto_factory import _BaseAutoModelClass
target_cls = []
target_cls.extend([AutoConfig] + AutoConfig.__subclasses__())
target_cls.extend([AutoTokenizer] + AutoTokenizer.__subclasses__())
target_cls.extend([PreTrainedTokenizerBase] +
PreTrainedTokenizerBase.__subclasses__())
target_cls.extend([PreTrainedModel] + PreTrainedModel.__subclasses__())
target_cls.extend([_BaseAutoModelClass] +
_BaseAutoModelClass.__subclasses__())
target_cls.extend([PeftModel] + PeftModel.__subclasses__())
def _patch_wrap(method):
def wrapped_method(self, *args, **kwargs):
with patch_fileio():
kwargs['save_function'] = torch.save
kwargs['safe_serialization'] = False
obj = method(self, *args, **kwargs)
return obj
return wrapped_method
for cls in set(target_cls):
if hasattr(cls, 'save_pretrained'):
cls.save_pretrained = _patch_wrap(cls.save_pretrained)
patch_hf_save_pretrained._patched = True
def patch_deepspeed_engine():
if hasattr(patch_deepspeed_engine, '_patched'):
return
def _copy_recovery_script(self, save_path):
import os
from shutil import copyfile
from deepspeed.utils import zero_to_fp32
from mmengine import PetrelBackend, get_file_backend
script = 'zero_to_fp32.py'
src = zero_to_fp32.__file__
dst = os.path.join(save_path, script)
backend = get_file_backend(save_path)
if isinstance(backend, PetrelBackend):
backend.copyfile_from_local(src, dst)
else:
copyfile(src, dst)
self._change_recovery_script_permissions(dst)
from deepspeed.runtime.engine import DeepSpeedEngine
DeepSpeedEngine._copy_recovery_script = _copy_recovery_script
patch_deepspeed_engine._patched = True
================================================
FILE: xtuner-eval_niah/xtuner/utils/handle_moe_load_and_save.py
================================================
import json
import os
import re
from collections import OrderedDict
import torch
import torch.distributed as dist
import torch.nn as nn
from mmengine import print_log
from transformers.integrations import is_deepspeed_zero3_enabled
from transformers.modeling_utils import load_state_dict
from transformers.utils import (SAFE_WEIGHTS_INDEX_NAME, WEIGHTS_INDEX_NAME,
is_safetensors_available)
SUPPORT_MODELS = (
'DeepseekV2ForCausalLM',
'MixtralForCausalLM',
)
ORDER_MAPPING = dict(
DeepseekV2ForCausalLM=dict(down_proj=0, gate_proj=1, up_proj=2),
MixtralForCausalLM=dict(down_proj=1, gate_proj=0, up_proj=2),
)
PARAM_NAME_MAPPING = dict(
DeepseekV2ForCausalLM=dict(
gate_proj='gate_proj', up_proj='up_proj', down_proj='down_proj'),
MixtralForCausalLM=dict(gate_proj='w1', up_proj='w3', down_proj='w2'),
)
def print_on_rank0(info):
if dist.get_rank() == 0:
print_log(info, 'current')
def get_expert_num_per_shard(model):
for module in model.modules():
if hasattr(module, 'expert_in_one_shard'):
return module.expert_in_one_shard
def mix_sort(expert_name):
components = re.findall(r'(\D+|\d+)', expert_name)
out = [int(comp) if comp.isdigit() else comp for comp in components]
return tuple(out)
def _get_merged_param_name(origin_param_name, expert_num_per_shard):
split_name = origin_param_name.split('.experts.')
expert_idx = re.findall(r'\d+', split_name[1])[0]
expert_idx = int(expert_idx)
assert expert_idx % expert_num_per_shard == 0
shard_idx = expert_idx // expert_num_per_shard
w1w3 = split_name[0] + f'.experts.{shard_idx}.w1w3'
w2 = split_name[0] + f'.experts.{shard_idx}.w2'
return w1w3, w2
def _merge_experts_weight(state_dict, expert_num_per_shard, order_mapping):
experts_name = [key for key in state_dict.keys() if '.experts.' in key]
experts_name = sorted(experts_name, key=mix_sort)
linear_num_per_expert = 3
linear_num_per_shard = expert_num_per_shard * linear_num_per_expert
expert_shard_num = len(experts_name) // linear_num_per_shard
for shard_idx in range(expert_shard_num):
begin, end = shard_idx * linear_num_per_shard, (
shard_idx + 1) * linear_num_per_shard
experts_name_cur = experts_name[begin:end]
down_proj_weight = [
state_dict.pop(key)
for key in experts_name_cur[order_mapping['down_proj']::3]
]
gate_proj_weight = [
state_dict.pop(key)
for key in experts_name_cur[order_mapping['gate_proj']::3]
]
up_proj_weight = [
state_dict.pop(key)
for key in experts_name_cur[order_mapping['up_proj']::3]
]
w1 = torch.stack(gate_proj_weight)
w3 = torch.stack(up_proj_weight)
w1w3 = torch.cat([w1, w3], dim=1)
assert w1w3.ndim == 3, w1w3.shape
w2 = torch.stack(down_proj_weight)
assert w2.ndim == 3, w2.shape
merged_key_w1w3, merged_key_w2 = _get_merged_param_name(
experts_name_cur[0], expert_num_per_shard)
print_on_rank0(f'merged key {merged_key_w1w3}')
state_dict[merged_key_w1w3] = w1w3
print_on_rank0(f'merged key {merged_key_w2}')
state_dict[merged_key_w2] = w2
return
def load_state_dict_into_model(model_to_load, pretrained_model_path):
model_name = type(model_to_load).__name__
if model_name not in SUPPORT_MODELS:
raise RuntimeError(
f'Only models in {SUPPORT_MODELS} may need to load pretrained '
f'weights via `load_state_dict_into_model`, but got {model_name}.')
order_mapping = ORDER_MAPPING[model_name]
index_file = os.path.join(pretrained_model_path, WEIGHTS_INDEX_NAME)
safe_index_file = os.path.join(pretrained_model_path,
SAFE_WEIGHTS_INDEX_NAME)
index_present = os.path.isfile(index_file)
safe_index_present = os.path.isfile(safe_index_file)
assert index_present or (safe_index_present and is_safetensors_available())
if safe_index_present and is_safetensors_available():
load_index = safe_index_file
else:
load_index = index_file
with open(load_index, encoding='utf-8') as f:
index = json.load(f)
weight_map = index['weight_map']
unloaded_shard_files = list(set(weight_map.values()))
unloaded_shard_files.sort(reverse=True)
expert_num_per_shard = get_expert_num_per_shard(model_to_load)
error_msgs = []
def load(module: nn.Module, state_dict, unloaded_shard_files, prefix=''):
params_to_gather = []
param_names = []
for name, param in module.named_parameters(
prefix=prefix[:-1], recurse=False):
while name not in state_dict:
assert len(unloaded_shard_files) > 0
shard_file = unloaded_shard_files.pop()
shard_file = os.path.join(pretrained_model_path, shard_file)
print_on_rank0(
f'{name} not in state_dict, loading {shard_file}')
new_shard = load_state_dict(shard_file, is_quantized=False)
state_dict.update(new_shard)
_merge_experts_weight(state_dict, expert_num_per_shard,
order_mapping)
params_to_gather.append(param)
param_names.append(name)
if len(params_to_gather) > 0:
args = (state_dict, prefix, {}, True, [], [], error_msgs)
if is_deepspeed_zero3_enabled():
import deepspeed
with deepspeed.zero.GatheredParameters(
params_to_gather, modifier_rank=0):
if dist.get_rank() == 0:
module._load_from_state_dict(*args)
else:
module._load_from_state_dict(*args)
for name in param_names:
print_on_rank0(f'state_dict pop {name}')
state_dict.pop(name)
for name, child in module._modules.items():
if child is not None:
load(child, state_dict, unloaded_shard_files,
prefix + name + '.')
state_dict = OrderedDict()
load(model_to_load, state_dict, unloaded_shard_files, prefix='')
print_on_rank0(f'{state_dict.keys()}')
del state_dict
return error_msgs
def _get_origin_param_name(merged_param_name, expert_num_per_shard, is_w1w3,
param_name_mapping):
split_name = merged_param_name.split('.experts.')
shard_idx = re.findall(r'\d+', split_name[1])[0]
shard_idx = int(shard_idx)
origin_param_names = [None] * (expert_num_per_shard * (1 + int(is_w1w3)))
expert_idx_begin = expert_num_per_shard * shard_idx
for i in range(expert_num_per_shard):
if is_w1w3:
gate_proj, up_proj = param_name_mapping[
'gate_proj'], param_name_mapping['up_proj']
gate = split_name[
0] + f'.experts.{expert_idx_begin + i}.{gate_proj}.weight'
up = split_name[
0] + f'.experts.{expert_idx_begin + i}.{up_proj}.weight'
origin_param_names[i * 2] = gate
origin_param_names[i * 2 + 1] = up
else:
down_proj = param_name_mapping['down_proj']
down = split_name[
0] + f'.experts.{expert_idx_begin + i}.{down_proj}.weight'
origin_param_names[i] = down
return origin_param_names
def _split_param(merged_param, is_w1w3):
if is_w1w3:
expert_num, _, hidden_dim = merged_param.shape
merged_param = merged_param.view(expert_num * 2, -1, hidden_dim)
return torch.unbind(merged_param, dim=0)
else:
# (e, hidden_dim, ffn_dim)
return torch.unbind(merged_param, dim=0)
def get_origin_state_dict(state_dict, model):
model_name = type(model).__name__
if model_name not in SUPPORT_MODELS:
raise RuntimeError(
f'Only models in {SUPPORT_MODELS} may need to convert state_dict '
f'via `get_origin_state_dict` interface, but got {model_name}.')
param_name_mapping = PARAM_NAME_MAPPING[model_name]
expert_num_per_shard = get_expert_num_per_shard(model)
experts_param_name = [
name for name in state_dict.keys() if '.experts.' in name
]
for expert_param_name in experts_param_name:
print_on_rank0(f'processing {expert_param_name} ...')
is_w1w3 = expert_param_name.split('.')[-1] == 'w1w3'
origin_param_names = _get_origin_param_name(expert_param_name,
expert_num_per_shard,
is_w1w3,
param_name_mapping)
merged_param = state_dict.pop(expert_param_name)
origin_params = _split_param(merged_param, is_w1w3)
assert len(origin_param_names) == len(origin_params)
for name, param in zip(origin_param_names, origin_params):
state_dict[name] = param
return state_dict
================================================
FILE: xtuner-eval_niah/xtuner/utils/stop_criteria.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from transformers import StoppingCriteria
class StopWordStoppingCriteria(StoppingCriteria):
"""StopWord stopping criteria."""
def __init__(self, tokenizer, stop_word):
self.tokenizer = tokenizer
self.stop_word = stop_word
self.length = len(self.stop_word)
def __call__(self, input_ids, *args, **kwargs) -> bool:
cur_text = self.tokenizer.decode(input_ids[0])
cur_text = cur_text.replace('\r', '').replace('\n', '')
return cur_text[-self.length:] == self.stop_word
================================================
FILE: xtuner-eval_niah/xtuner/utils/templates.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
from mmengine.config import ConfigDict
# - Turn 0: SYSTEM + INSTRUCTION, [output + SUFFIX], SEP
# - Turn 1: INSTRUCTION, [output + SUFFIX], SEP
# - Turn ...
# Note: [] means having supervised loss during the fine-tuning
PROMPT_TEMPLATE = ConfigDict(
default=dict(
SYSTEM='<|System|>:{system}\n',
INSTRUCTION='<|User|>:{input}\n<|Bot|>:',
SEP='\n'),
zephyr=dict(
SYSTEM='<|system|>\n{system}\n',
INSTRUCTION='<|user|>\n{input}\n<|assistant|>\n',
SEP='\n'),
internlm_chat=dict(
SYSTEM='<|System|>:{system}\n',
INSTRUCTION='<|User|>:{input}\n<|Bot|>:',
SUFFIX='',
SUFFIX_AS_EOS=True,
SEP='\n',
STOP_WORDS=['']),
internlm2_chat=dict(
SYSTEM='<|im_start|>system\n{system}<|im_end|>\n',
INSTRUCTION=('<|im_start|>user\n{input}<|im_end|>\n'
'<|im_start|>assistant\n'),
SUFFIX='<|im_end|>',
SUFFIX_AS_EOS=True,
SEP='\n',
STOP_WORDS=['<|im_end|>']),
moss_sft=dict(
SYSTEM='{system}\n',
INSTRUCTION='<|Human|>: {input}\n',
SEP='\n',
STOP_WORDS=['', '']),
llama2_chat=dict(
SYSTEM=(
'[INST] <>\n You are a helpful, respectful and honest '
'assistant. Always answer as helpfully as possible, while being '
'safe. Your answers should not include any harmful, unethical, '
'racist, sexist, toxic, dangerous, or illegal content. Please '
'ensure that your responses are socially unbiased and positive in '
'nature.\n{system}\n<>\n [/INST] '),
INSTRUCTION='[INST] {input} [/INST]',
SEP='\n'),
code_llama_chat=dict(
SYSTEM='{system}\n', INSTRUCTION='[INST] {input} [/INST]'),
chatglm2=dict(
SYSTEM='{system}\n',
INSTRUCTION='[Round {round}]\n\n问:{input}\n\n答:',
SEP='\n\n'),
chatglm3=dict(
SYSTEM='<|system|>\n{system}',
INSTRUCTION='<|user|>\n{input}<|assistant|>\n',
SEP='\n'),
qwen_chat=dict(
SYSTEM=('<|im_start|>system\n{system}<|im_end|>\n'),
INSTRUCTION=('<|im_start|>user\n{input}<|im_end|>\n'
'<|im_start|>assistant\n'),
SUFFIX='<|im_end|>',
SUFFIX_AS_EOS=True,
SEP='\n',
STOP_WORDS=['<|im_end|>', '<|endoftext|>']),
baichuan_chat=dict(
SYSTEM='{system}\n',
INSTRUCTION='{input}',
SEP='\n'),
baichuan2_chat=dict(
SYSTEM='{system}\n',
INSTRUCTION='{input}',
SEP='\n'),
wizardlm=dict(
SYSTEM=('A chat between a curious user and an artificial '
'intelligence assistant. The assistant gives '
'helpful, detailed, and polite answers to the '
'user\'s questions. {system}\n '),
INSTRUCTION=('USER: {input} ASSISTANT:'),
SEP='\n'),
wizardcoder=dict(
SYSTEM=(
'Below is an instruction that describes a task. '
'Write a response that appropriately completes the request.\n\n'
'{system}\n '),
INSTRUCTION=('### Instruction:\n{input}\n\n### Response:'),
SEP='\n\n'),
vicuna=dict(
SYSTEM=('A chat between a curious user and an artificial '
'intelligence assistant. The assistant gives '
'helpful, detailed, and polite answers to the '
'user\'s questions. {system}\n '),
INSTRUCTION=('USER: {input} ASSISTANT:'),
SEP='\n'),
deepseek_coder=dict(
SYSTEM=('You are an AI programming assistant, utilizing '
'the DeepSeek Coder model, developed by DeepSeek'
'Company, and you only answer questions related '
'to computer science. For politically sensitive '
'questions, security and privacy issues, and '
'other non-computer science questions, you will '
'refuse to answer. {system}\n'),
INSTRUCTION=('### Instruction:\n{input}\n### Response:\n'),
SEP='\n'),
# TODO: deprecation, v0.2.0
deepseekcoder=dict(
SYSTEM=('You are an AI programming assistant, utilizing '
'the DeepSeek Coder model, developed by DeepSeek'
'Company, and you only answer questions related '
'to computer science. For politically sensitive '
'questions, security and privacy issues, and '
'other non-computer science questions, you will '
'refuse to answer. {system}\n'),
INSTRUCTION=('### Instruction:\n{input}\n### Response:\n'),
SEP='\n'),
deepseek_moe=dict(
SYSTEM=('[INST] {system} [/INST]\n'),
INSTRUCTION=('[INST] {input} [/INST]'),
SEP='\n'),
deepseek_v2=dict(
SYSTEM='{system}\n\n',
INSTRUCTION='User: {input}\n\nAssistant: ',
SUFFIX='<|end▁of▁sentence|>',
SUFFIX_AS_EOS=True,
STOP_WORDS=['<|end▁of▁sentence|>']),
mistral=dict(
SYSTEM=('[INST] {system} [/INST]\n'),
INSTRUCTION=('[INST] {input} [/INST]'),
SEP='\n'),
mixtral=dict(
SYSTEM=('[INST] {system} [/INST]\n'),
INSTRUCTION=('[INST] {input} [/INST]'),
SEP='\n'),
gemma=dict(
# `system` field is extended by xtuner
SYSTEM=('system\n{system}\n'),
INSTRUCTION=('user\n{input}\n'
'model\n'),
SUFFIX='',
SUFFIX_AS_EOS=False,
SEP='\n',
STOP_WORDS=['']),
cohere_chat=dict(
SYSTEM=('<|START_OF_TURN_TOKEN|><|SYSTEM_TOKEN|>{system}'
'<|END_OF_TURN_TOKEN|>'),
INSTRUCTION=(
'<|START_OF_TURN_TOKEN|><|USER_TOKEN|>{input}<|END_OF_TURN_TOKEN|>'
'<|START_OF_TURN_TOKEN|><|CHATBOT_TOKEN|>'),
SUFFIX='<|END_OF_TURN_TOKEN|>',
SUFFIX_AS_EOS=True,
STOP_WORDS=['<|END_OF_TURN_TOKEN|>']),
llama3_chat=dict(
SYSTEM=('<|start_header_id|>system<|end_header_id|>\n\n'
'{system}<|eot_id|>'),
INSTRUCTION=(
'<|start_header_id|>user<|end_header_id|>\n\n{input}<|eot_id|>'
'<|start_header_id|>assistant<|end_header_id|>\n\n'),
SUFFIX='<|eot_id|>',
SUFFIX_AS_EOS=True,
STOP_WORDS=['<|eot_id|>']),
phi3_chat=dict(
SYSTEM='<|system|>\n{system}<|end|>\n',
INSTRUCTION='<|user|>\n{input}<|end|>\n<|assistant|>\n',
SUFFIX='<|end|>',
SUFFIX_AS_EOS=True,
SEP='\n',
STOP_WORDS=['<|end|>']),
)
SYSTEM_TEMPLATE = ConfigDict(
moss_sft=('You are an AI assistant whose name is {bot_name}.\n'
'Capabilities and tools that {bot_name} can possess.\n'
'- Inner thoughts: enabled.\n'
'- Web search: enabled. API: Search(query)\n'
'- Calculator: enabled. API: Calculate(expression)\n'
'- Equation solver: enabled. API: Solve(equation)\n'
'- Text-to-image: disabled.\n'
'- Image edition: disabled.\n'
'- Text-to-speech: disabled.\n'),
alpaca=('Below is an instruction that describes a task. '
'Write a response that appropriately completes the request.\n'),
arxiv_gentile=('If you are an expert in writing papers, please generate '
"a good paper title for this paper based on other authors' "
'descriptions of their abstracts.\n'),
colorist=('You are a professional color designer. Please provide the '
'corresponding colors based on the description of Human.\n'),
coder=('You are a professional programer. Please provide the '
'corresponding code based on the description of Human.\n'),
lawyer='你现在是一名专业的中国律师,请根据用户的问题给出准确、有理有据的回复。\n',
medical='如果你是一名医生,请根据患者的描述回答医学问题。\n',
sql=('If you are an expert in SQL, please generate a good SQL Query '
'for Question based on the CREATE TABLE statement.\n'),
)
================================================
FILE: xtuner-eval_niah/xtuner/utils/zero_to_any_dtype.py
================================================
#!/usr/bin/env python
# Copyright (c) Microsoft Corporation.
# SPDX-License-Identifier: Apache-2.0
# DeepSpeed Team
# This script extracts consolidated weights from a zero 1, 2 and 3 DeepSpeed
# checkpoints. It gets copied into the top level checkpoint dir, so the user
# can easily do the conversion at any point in the future. Once extracted, the
# weights don't require DeepSpeed and can be used in any application.
#
# example: python zero_to_any_dtype.py . pytorch_model.bin
import argparse
import glob
import math
import os
import re
from collections import OrderedDict
from dataclasses import dataclass
import torch
# yapf: disable
from deepspeed.checkpoint.constants import (BUFFER_NAMES, DS_VERSION,
FP32_FLAT_GROUPS,
FROZEN_PARAM_FRAGMENTS,
FROZEN_PARAM_SHAPES,
OPTIMIZER_STATE_DICT, PARAM_SHAPES,
PARTITION_COUNT,
SINGLE_PARTITION_OF_FP32_GROUPS,
ZERO_STAGE)
# while this script doesn't use deepspeed to recover data, since the
# checkpoints are pickled with DeepSpeed data structures it has to be
# available in the current python environment.
from deepspeed.utils import logger
from tqdm import tqdm
# yapf: enable
@dataclass
class zero_model_state:
buffers: dict()
param_shapes: dict()
shared_params: list
ds_version: int
frozen_param_shapes: dict()
frozen_param_fragments: dict()
debug = 0
# load to cpu
device = torch.device('cpu')
DEFAULT_DTYPE = torch.float16
def atoi(text):
return int(text) if text.isdigit() else text
def natural_keys(text):
"""alist.sort(key=natural_keys) sorts in human order
http://nedbatchelder.com/blog/200712/human_sorting.html (See Toothy's
implementation in the comments)"""
return [atoi(c) for c in re.split(r'(\d+)', text)]
def get_model_state_file(checkpoint_dir, zero_stage):
if not os.path.isdir(checkpoint_dir):
raise FileNotFoundError(f"Directory '{checkpoint_dir}' doesn't exist")
# there should be only one file
if zero_stage <= 2:
file = os.path.join(checkpoint_dir, 'mp_rank_00_model_states.pt')
elif zero_stage == 3:
file = os.path.join(checkpoint_dir,
'zero_pp_rank_0_mp_rank_00_model_states.pt')
if not os.path.exists(file):
raise FileNotFoundError(f"can't find model states file at '{file}'")
return file
def get_checkpoint_files(checkpoint_dir, glob_pattern):
# XXX: need to test that this simple glob rule works for multi-node
# setup too
ckpt_files = sorted(
glob.glob(os.path.join(checkpoint_dir, glob_pattern)),
key=natural_keys)
if len(ckpt_files) == 0:
raise FileNotFoundError(
f"can't find {glob_pattern} files in directory '{checkpoint_dir}'")
return ckpt_files
def get_optim_files(checkpoint_dir):
return get_checkpoint_files(checkpoint_dir, '*_optim_states.pt')
def get_model_state_files(checkpoint_dir):
return get_checkpoint_files(checkpoint_dir, '*_model_states.pt')
def parse_model_states(files, dtype=DEFAULT_DTYPE):
zero_model_states = []
for file in files:
state_dict = torch.load(file, map_location=device)
if BUFFER_NAMES not in state_dict:
raise ValueError(f'{file} is not a model state checkpoint')
buffer_names = state_dict[BUFFER_NAMES]
if debug:
print('Found buffers:', buffer_names)
buffers = {
k: v.to(dtype)
for k, v in state_dict['module'].items() if k in buffer_names
}
param_shapes = state_dict[PARAM_SHAPES]
# collect parameters that are included in param_shapes
param_names = []
for s in param_shapes:
for name in s.keys():
param_names.append(name)
# update with frozen parameters
frozen_param_shapes = state_dict.get(FROZEN_PARAM_SHAPES, None)
if frozen_param_shapes is not None:
if debug:
print(f'Found frozen_param_shapes: {frozen_param_shapes}')
param_names += list(frozen_param_shapes.keys())
# handle shared params
shared_params = [[k, v]
for k, v in state_dict['shared_params'].items()]
ds_version = state_dict.get(DS_VERSION, None)
frozen_param_fragments = state_dict.get(FROZEN_PARAM_FRAGMENTS, None)
z_model_state = zero_model_state(
buffers=buffers,
param_shapes=param_shapes,
shared_params=shared_params,
ds_version=ds_version,
frozen_param_shapes=frozen_param_shapes,
frozen_param_fragments=frozen_param_fragments)
zero_model_states.append(z_model_state)
return zero_model_states
@torch.no_grad()
def parse_optim_states(files, ds_checkpoint_dir, dtype=DEFAULT_DTYPE):
zero_stage = None
world_size = None
total_files = len(files)
flat_groups = []
for f in tqdm(files, desc='Load Checkpoints'):
state_dict = torch.load(f, map_location=device)
if ZERO_STAGE not in state_dict[OPTIMIZER_STATE_DICT]:
raise ValueError(f'{f} is not a zero checkpoint')
zero_stage = state_dict[OPTIMIZER_STATE_DICT][ZERO_STAGE]
world_size = state_dict[OPTIMIZER_STATE_DICT][PARTITION_COUNT]
# the groups are named differently in each stage
if zero_stage <= 2:
fp32_groups_key = SINGLE_PARTITION_OF_FP32_GROUPS
elif zero_stage == 3:
fp32_groups_key = FP32_FLAT_GROUPS
else:
raise ValueError(f'unknown zero stage {zero_stage}')
# immediately discard the potentially huge 2 optimizer states as we
# only care for fp32 master weights and also handle the case where it
# was already removed by another helper script
state_dict['optimizer_state_dict'].pop('optimizer_state_dict', None)
fp32_groups = state_dict['optimizer_state_dict'].pop(fp32_groups_key)
if zero_stage <= 2:
flat_groups.append([param.to(dtype) for param in fp32_groups])
elif zero_stage == 3:
# if there is more than one param group, there will be multiple
# flattened tensors - one flattened tensor per group - for
# simplicity merge them into a single tensor
# XXX: could make the script more memory efficient for when there
# are multiple groups - it will require matching the sub-lists of
# param_shapes for each param group flattened tensor
flat_groups.append(torch.cat(fp32_groups, 0).to(dtype))
# For ZeRO-2 each param group can have different partition_count as data
# parallelism for expert parameters can be different from data parallelism
# for non-expert parameters. So we can just use the max of the
# partition_count to get the dp world_size.
if type(world_size) is list:
world_size = max(world_size)
if world_size != total_files:
raise ValueError(
f"Expected {world_size} of '*_optim_states.pt' under "
f"'{ds_checkpoint_dir}' but found {total_files} files. "
'Possibly due to an overwrite of an old checkpoint, '
"or a checkpoint didn't get saved by one or more processes.")
return zero_stage, world_size, flat_groups
def _get_state_dict_from_zero_checkpoint(ds_checkpoint_dir,
exclude_frozen_parameters,
dtype=DEFAULT_DTYPE):
"""Returns state_dict reconstructed from ds checkpoint.
Args:
- ``ds_checkpoint_dir``: path to the deepspeed checkpoint folder
(where the optimizer files are)
"""
print(f"Processing zero checkpoint '{ds_checkpoint_dir}'")
optim_files = get_optim_files(ds_checkpoint_dir)
zero_stage, world_size, flat_groups = parse_optim_states(
optim_files, ds_checkpoint_dir, dtype)
print(f'Detected checkpoint of type zero stage {zero_stage}, '
f'world_size: {world_size}')
model_files = get_model_state_files(ds_checkpoint_dir)
zero_model_states = parse_model_states(model_files)
print(f'Parsing checkpoint created by deepspeed=='
f'{zero_model_states[0].ds_version}')
if zero_stage <= 2:
return _get_state_dict_from_zero2_checkpoint(
world_size, flat_groups, zero_model_states,
exclude_frozen_parameters)
elif zero_stage == 3:
return _get_state_dict_from_zero3_checkpoint(
world_size, flat_groups, zero_model_states,
exclude_frozen_parameters)
def _zero2_merge_frozen_params(state_dict, zero_model_states):
if zero_model_states[0].frozen_param_shapes is None or len(
zero_model_states[0].frozen_param_shapes) == 0:
return
frozen_param_shapes = zero_model_states[0].frozen_param_shapes
frozen_param_fragments = zero_model_states[0].frozen_param_fragments
if debug:
num_elem = sum(s.numel() for s in frozen_param_shapes.values())
print(f'rank 0: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
wanted_params = len(frozen_param_shapes)
wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
avail_numel = sum([p.numel() for p in frozen_param_fragments.values()])
print(f'Frozen params: Have {avail_numel} numels to process.')
print(f'Frozen params: Need {wanted_numel} numels in '
f'{wanted_params} params')
total_params = 0
total_numel = 0
for name, shape in frozen_param_shapes.items():
total_params += 1
unpartitioned_numel = shape.numel()
total_numel += unpartitioned_numel
state_dict[name] = frozen_param_fragments[name]
if debug:
print(f'{name} full shape: {shape} unpartitioned numel '
f'{unpartitioned_numel} ')
print(f'Reconstructed Frozen state dict with {total_params} params '
f'{total_numel} elements')
def _has_callable(obj, fn):
attr = getattr(obj, fn, None)
return callable(attr)
def _zero2_merge_trainable_params(state_dict, world_size, flat_groups,
zero_model_states):
param_shapes = zero_model_states[0].param_shapes
# Reconstruction protocol:
#
# XXX: document this
if debug:
for i in range(world_size):
for j in range(len(flat_groups[0])):
print(f'flat_groups[{i}][{j}].shape={flat_groups[i][j].shape}')
# XXX: memory usage doubles here (zero2)
num_param_groups = len(flat_groups[0])
merged_single_partition_of_groups = []
for i in range(num_param_groups):
merged_partitions = [sd[i] for sd in flat_groups]
full_single_vector = torch.cat(merged_partitions, 0)
merged_single_partition_of_groups.append(full_single_vector)
avail_numel = sum([
full_single_vector.numel()
for full_single_vector in merged_single_partition_of_groups
])
if debug:
wanted_params = sum([len(shapes) for shapes in param_shapes])
wanted_numel = sum([
sum(shape.numel() for shape in shapes.values())
for shapes in param_shapes
])
# not asserting if there is a mismatch due to possible padding
print(f'Have {avail_numel} numels to process.')
print(f'Need {wanted_numel} numels in {wanted_params} params.')
# params
# XXX: for huge models that can't fit into the host's RAM we will have to
# recode this to support out-of-core computing solution
total_numel = 0
total_params = 0
for shapes, full_single_vector in zip(param_shapes,
merged_single_partition_of_groups):
offset = 0
avail_numel = full_single_vector.numel()
for name, shape in shapes.items():
unpartitioned_numel = shape.numel() if _has_callable(
shape, 'numel') else math.prod(shape)
total_numel += unpartitioned_numel
total_params += 1
if debug:
print(f'{name} full shape: {shape} unpartitioned numel '
f'{unpartitioned_numel} ')
state_dict[name] = full_single_vector.narrow(
0, offset, unpartitioned_numel).view(shape)
offset += unpartitioned_numel
# Z2 started to align to 2*world_size to improve nccl performance.
# Therefore both offset and avail_numel can differ by anywhere between
# 0..2*world_size. Due to two unrelated complex paddings performed in
# the code it's almost impossible to predict the exact numbers w/o the
# live optimizer object, so we are checking that the numbers are
# within the right range
align_to = 2 * world_size
def zero2_align(x):
return align_to * math.ceil(x / align_to)
if debug:
print(f'original offset={offset}, avail_numel={avail_numel}')
offset = zero2_align(offset)
avail_numel = zero2_align(avail_numel)
if debug:
print(f'aligned offset={offset}, avail_numel={avail_numel}')
# Sanity check
if offset != avail_numel:
raise ValueError(f'consumed {offset} numels out of {avail_numel} '
'- something is wrong')
print(f'Reconstructed state dict with {total_params} params '
f'{total_numel} elements')
def _get_state_dict_from_zero2_checkpoint(world_size, flat_groups,
zero_model_states,
exclude_frozen_parameters):
state_dict = OrderedDict()
# buffers
buffers = zero_model_states[0].buffers
state_dict.update(buffers)
if debug:
print(f'added {len(buffers)} buffers')
if not exclude_frozen_parameters:
_zero2_merge_frozen_params(state_dict, zero_model_states)
_zero2_merge_trainable_params(state_dict, world_size, flat_groups,
zero_model_states)
# recover shared parameters
for pair in zero_model_states[0].shared_params:
if pair[1] in state_dict:
state_dict[pair[0]] = state_dict[pair[1]]
return state_dict
def zero3_partitioned_param_info(unpartitioned_numel, world_size):
remainder = unpartitioned_numel % world_size
padding_numel = (world_size - remainder) if remainder else 0
partitioned_numel = math.ceil(unpartitioned_numel / world_size)
return partitioned_numel, padding_numel
def _zero3_merge_frozen_params(state_dict, world_size, zero_model_states):
if zero_model_states[0].frozen_param_shapes is None or len(
zero_model_states[0].frozen_param_shapes) == 0:
return
if debug:
for i in range(world_size):
num_elem = sum(
s.numel()
for s in zero_model_states[i].frozen_param_fragments.values())
print(f'rank {i}: {FROZEN_PARAM_SHAPES}.numel = {num_elem}')
frozen_param_shapes = zero_model_states[0].frozen_param_shapes
wanted_params = len(frozen_param_shapes)
wanted_numel = sum(s.numel() for s in frozen_param_shapes.values())
avail_numel = sum([
p.numel()
for p in zero_model_states[0].frozen_param_fragments.values()
]) * world_size
print(f'Frozen params: Have {avail_numel} numels to process.')
print(f'Frozen params: Need {wanted_numel} numels in '
f'{wanted_params} params')
total_params = 0
total_numel = 0
for name, shape in zero_model_states[0].frozen_param_shapes.items():
total_params += 1
unpartitioned_numel = shape.numel()
total_numel += unpartitioned_numel
param_frags = tuple(model_state.frozen_param_fragments[name]
for model_state in zero_model_states)
state_dict[name] = torch.cat(param_frags, 0).narrow(
0, 0, unpartitioned_numel).view(shape) # noqa: E501
_partitioned = zero3_partitioned_param_info(unpartitioned_numel,
world_size)
partitioned_numel, partitioned_padding_numel = _partitioned
if debug:
print(f'Frozen params: {total_params} {name} full shape: {shape} '
f'partition0 numel={partitioned_numel} '
f'partitioned_padding_numel={partitioned_padding_numel}')
print(f'Reconstructed Frozen state dict with {total_params} params '
f'{total_numel} elements')
def _zero3_merge_trainable_params(state_dict, world_size, flat_groups,
zero_model_states):
param_shapes = zero_model_states[0].param_shapes
avail_numel = flat_groups[0].numel() * world_size
# Reconstruction protocol: For zero3 we need to zip the partitions
# together at boundary of each param, re-consolidating each param, while
# dealing with padding if any
# merge list of dicts, preserving order
param_shapes = {k: v for d in param_shapes for k, v in d.items()}
if debug:
for i in range(world_size):
print(f'flat_groups[{i}].shape={flat_groups[i].shape}')
wanted_params = len(param_shapes)
wanted_numel = sum(shape.numel() for shape in param_shapes.values())
# not asserting if there is a mismatch due to possible padding
avail_numel = flat_groups[0].numel() * world_size
print(f'Trainable params: Have {avail_numel} numels to process.')
print(f'Trainable params: Need {wanted_numel} numels in '
f'{wanted_params} params.')
offset = 0
total_numel = 0
total_params = 0
partitioned_sizes = []
for name, shape in param_shapes.items():
unpartitioned_numel = shape.numel()
total_numel += unpartitioned_numel
total_params += 1
_info = zero3_partitioned_param_info(unpartitioned_numel, world_size)
partitioned_numel, partitioned_padding_numel = _info
partitioned_sizes.append(partitioned_numel)
if debug:
print(
f'Trainable params: {total_params} {name} full shape: {shape} '
f'partition0 numel={partitioned_numel} '
f'partitioned_padding_numel={partitioned_padding_numel}')
offset += partitioned_numel
offset *= world_size
# Sanity check
if offset != avail_numel:
raise ValueError(f'consumed {offset} numels out of {avail_numel} '
'- something is wrong')
mat_chunks = []
for rank in range(world_size):
rank_chunks = flat_groups.pop(0).split(partitioned_sizes)
rank_chunks = [tensor.clone() for tensor in rank_chunks]
mat_chunks.append(rank_chunks)
for name, shape in tqdm(
param_shapes.items(), desc='Gather Sharded Weights'):
pad_flat_param_chunks = []
for rank in range(world_size):
pad_flat_param_chunks.append(mat_chunks[rank].pop(0))
pad_flat_param = torch.cat(pad_flat_param_chunks, dim=0)
# Because pad_flat_param_chunks is a list, it is necessary to manually
# release the tensors in the list; Python will not automatically do so.
for rank in range(world_size):
pad_flat_param_chunks.pop()
param = pad_flat_param[:shape.numel()].view(shape)
state_dict[name] = param
print(f'Reconstructed Trainable state dict with {total_params} params '
f'{total_numel} elements')
def _get_state_dict_from_zero3_checkpoint(world_size, flat_groups,
zero_model_states,
exclude_frozen_parameters):
state_dict = OrderedDict()
# buffers
buffers = zero_model_states[0].buffers
state_dict.update(buffers)
if debug:
print(f'added {len(buffers)} buffers')
if not exclude_frozen_parameters:
_zero3_merge_frozen_params(state_dict, world_size, zero_model_states)
_zero3_merge_trainable_params(state_dict, world_size, flat_groups,
zero_model_states)
# recover shared parameters
for pair in zero_model_states[0].shared_params:
if pair[1] in state_dict:
state_dict[pair[0]] = state_dict[pair[1]]
return state_dict
def get_state_dict_from_zero_checkpoint(checkpoint_dir,
tag=None,
exclude_frozen_parameters=False,
dtype=DEFAULT_DTYPE):
# flake8: noqa
"""Convert ZeRO 2 or 3 checkpoint into a single consolidated state_dict
that can be loaded with ``load_state_dict()`` and used for training without
DeepSpeed or shared with others, for example via a model hub.
Args:
- ``checkpoint_dir``: path to the desired checkpoint folder
- ``tag``: checkpoint tag used as a unique identifier for checkpoint.
If not provided will attempt to load tag in 'latest' file.
e.g., ``global_step14``
- ``exclude_frozen_parameters``: exclude frozen parameters
Returns:
- pytorch ``state_dict``
Note: this approach may not work if your application doesn't have
sufficient free CPU memory and you may need to use the offline approach
using the ``zero_to_any_dtype.py`` script that is saved with the
checkpoint.
A typical usage might be ::
from xtuner.utils.zero_to_any_dtype import get_state_dict_from_zero_checkpoint
# do the training and checkpoint saving
state_dict = get_state_dict_from_zero_checkpoint(checkpoint_dir, dtype=torch.float16) # already on cpu
model = model.cpu() # move to cpu
model.load_state_dict(state_dict)
# submit to model hub or save the model to share with others
In this example the ``model`` will no longer be usable in the deepspeed
context of the same application. i.e. you will need to re-initialize the
deepspeed engine, since ``model.load_state_dict(state_dict)`` will remove
all the deepspeed magic from it.
If you want it all done for you, use
``load_state_dict_from_zero_checkpoint`` instead.
"""
# flake8: noqa
if tag is None:
latest_path = os.path.join(checkpoint_dir, 'latest')
if os.path.isfile(latest_path):
with open(latest_path) as fd:
tag = fd.read().strip()
else:
raise ValueError(f"Unable to find 'latest' file at {latest_path}")
ds_checkpoint_dir = os.path.join(checkpoint_dir, tag)
if not os.path.isdir(ds_checkpoint_dir):
raise FileNotFoundError(
f"Directory '{ds_checkpoint_dir}' doesn't exist")
return _get_state_dict_from_zero_checkpoint(ds_checkpoint_dir,
exclude_frozen_parameters,
dtype)
def convert_zero_checkpoint_to_state_dict(checkpoint_dir,
output_file,
tag=None,
exclude_frozen_parameters=False,
dtype=DEFAULT_DTYPE):
"""Convert ZeRO 2 or 3 checkpoint into a single consolidated ``state_dict``
file that can be loaded with ``torch.load(file)`` + ``load_state_dict()``
and used for training without DeepSpeed.
Args:
- ``checkpoint_dir``: path to the desired checkpoint folder.
(one that contains the tag-folder, like ``global_step14``)
- ``output_file``: path to the pytorch state_dict output file
(e.g. path/pytorch_model.bin)
- ``tag``: checkpoint tag used as a unique identifier for checkpoint.
If not provided will attempt to load tag in the file named
``latest`` in the checkpoint folder, e.g., ``global_step14``
- ``exclude_frozen_parameters``: exclude frozen parameters
"""
state_dict = get_state_dict_from_zero_checkpoint(
checkpoint_dir, tag, exclude_frozen_parameters, dtype)
print(f'Saving {dtype} state dict to {output_file}')
torch.save(state_dict, output_file)
def load_state_dict_from_zero_checkpoint(model,
checkpoint_dir,
tag=None,
dtype=DEFAULT_DTYPE):
# flake8: noqa
"""
1. Put the provided model to cpu
2. Convert ZeRO 2 or 3 checkpoint into a single consolidated ``state_dict``
3. Load it into the provided model
Args:
- ``model``: the model object to update
- ``checkpoint_dir``: path to the desired checkpoint folder. (one that
contains the tag-folder, like ``global_step14``)
- ``tag``: checkpoint tag used as a unique identifier for checkpoint.
If not provided will attempt to load tag in the file named
``latest`` in the checkpoint folder, e.g., ``global_step14``
Returns:
- ``model`: modified model
Make sure you have plenty of CPU memory available before you call this
function. If you don't have enough use the ``zero_to_any_dtype.py``
utility to do the conversion. You will find it conveniently placed for you
in the checkpoint folder.
A typical usage might be ::
from xtuner.utils.zero_to_any_dtype import load_state_dict_from_zero_checkpoint
model = load_state_dict_from_zero_checkpoint(trainer.model, checkpoint_dir, dtype=torch.float16)
# submit to model hub or save the model to share with others
Note, that once this was run, the ``model`` will no longer be usable in
the deepspeed context of the same application. i.e. you will need to
re-initialize the deepspeed engine, since
``model.load_state_dict(state_dict)`` will remove all the deepspeed magic
from it.
"""
# flake8: noqa
logger.info(f'Extracting {dtype} weights')
state_dict = get_state_dict_from_zero_checkpoint(
checkpoint_dir, tag, dtype=dtype)
logger.info(f'Overwriting model with {dtype} weights')
model = model.cpu()
model.load_state_dict(state_dict, strict=False)
return model
================================================
FILE: xtuner-eval_niah/xtuner/version.py
================================================
# Copyright (c) OpenMMLab. All rights reserved.
__version__ = '0.1.21'
short_version = __version__
def parse_version_info(version_str):
"""Parse a version string into a tuple.
Args:
version_str (str): The version string.
Returns:
tuple[int or str]: The version info, e.g., "1.3.0" is parsed into
(1, 3, 0), and "2.0.0rc1" is parsed into (2, 0, 0, 'rc1').
"""
version_info = []
for x in version_str.split('.'):
if x.isdigit():
version_info.append(int(x))
elif x.find('rc') != -1:
patch_version = x.split('rc')
version_info.append(int(patch_version[0]))
version_info.append(f'rc{patch_version[1]}')
return tuple(version_info)
version_info = parse_version_info(__version__)
================================================
FILE: xtuner-train_internvideo2_5/.gitignore
================================================
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Django stuff:
*.log
local_settings.py
db.sqlite3
# Flask stuff:
instance/
.webassets-cache
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/*/_build/
# PyBuilder
target/
# Jupyter Notebook
.ipynb_checkpoints
# pyenv
.python-version
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
# custom
.vscode
.idea
.DS_Store
*.pkl
*.pkl.json
*.log.json
work_dirs/
# Pytorch
*.pth
*.py~
*.sh~
# srun
*.out
batchscript-*
================================================
FILE: xtuner-train_internvideo2_5/.owners.yml
================================================
assign:
issues: disabled
pull_requests: disabled
strategy:
random
# daily-shift-based
schedule:
'*/1 * * * *'
================================================
FILE: xtuner-train_internvideo2_5/.pre-commit-config-zh-cn.yaml
================================================
exclude: ^tests/data/|^xtuner/tools/model_converters/modeling_internlm2_reward/
repos:
- repo: https://gitee.com/openmmlab/mirrors-flake8
rev: 5.0.4
hooks:
- id: flake8
args: ["--exclude=xtuner/model/transformers_models/*"]
- repo: https://gitee.com/openmmlab/mirrors-isort
rev: 5.11.5
hooks:
- id: isort
- repo: https://gitee.com/openmmlab/mirrors-yapf
rev: v0.32.0
hooks:
- id: yapf
- repo: https://gitee.com/openmmlab/mirrors-pre-commit-hooks
rev: v4.3.0
hooks:
- id: trailing-whitespace
- id: check-yaml
- id: end-of-file-fixer
- id: requirements-txt-fixer
- id: double-quote-string-fixer
- id: check-merge-conflict
- id: fix-encoding-pragma
args: ["--remove"]
- id: mixed-line-ending
args: ["--fix=lf"]
- repo: https://gitee.com/openmmlab/mirrors-codespell
rev: v2.2.1
hooks:
- id: codespell
- repo: https://gitee.com/openmmlab/mirrors-mdformat
rev: 0.7.9
hooks:
- id: mdformat
args: ["--number"]
additional_dependencies:
- mdformat-openmmlab
- mdformat_frontmatter
- linkify-it-py
- repo: https://gitee.com/openmmlab/mirrors-docformatter
rev: v1.3.1
hooks:
- id: docformatter
args: ["--in-place", "--wrap-descriptions", "79"]
- repo: https://github.com/asottile/pyupgrade
rev: v3.0.0
hooks:
- id: pyupgrade
args: ["--py36-plus"]
================================================
FILE: xtuner-train_internvideo2_5/.pre-commit-config.yaml
================================================
exclude: ^tests/data/|^xtuner/tools/model_converters/modeling_internlm2_reward/
repos:
- repo: https://github.com/PyCQA/flake8
rev: 5.0.4
hooks:
- id: flake8
args: ["--exclude=xtuner/model/transformers_models/*"]
- repo: https://github.com/PyCQA/isort
rev: 5.11.5
hooks:
- id: isort
- repo: https://github.com/pre-commit/mirrors-yapf
rev: v0.32.0
hooks:
- id: yapf
exclude: 'xtuner/parallel/sequence/__init__.py'
- repo: https://github.com/pre-commit/pre-commit-hooks
rev: v4.3.0
hooks:
- id: trailing-whitespace
- id: check-yaml
- id: end-of-file-fixer
- id: requirements-txt-fixer
- id: double-quote-string-fixer
- id: check-merge-conflict
- id: fix-encoding-pragma
args: ["--remove"]
- id: mixed-line-ending
args: ["--fix=lf"]
- repo: https://github.com/codespell-project/codespell
rev: v2.2.1
hooks:
- id: codespell
- repo: https://github.com/executablebooks/mdformat
rev: 0.7.9
hooks:
- id: mdformat
args: ["--number"]
additional_dependencies:
- mdformat-openmmlab
- mdformat_frontmatter
- linkify-it-py
exclude: 'docs/zh_cn/user_guides/sequence_parallel.md'
- repo: https://github.com/myint/docformatter
rev: v1.3.1
hooks:
- id: docformatter
args: ["--in-place", "--wrap-descriptions", "79"]
- repo: https://github.com/asottile/pyupgrade
rev: v3.0.0
hooks:
- id: pyupgrade
args: ["--py36-plus"]
================================================
FILE: xtuner-train_internvideo2_5/LICENSE
================================================
Apache License
Version 2.0, January 2004
http://www.apache.org/licenses/
TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
1. Definitions.
"License" shall mean the terms and conditions for use, reproduction,
and distribution as defined by Sections 1 through 9 of this document.
"Licensor" shall mean the copyright owner or entity authorized by
the copyright owner that is granting the License.
"Legal Entity" shall mean the union of the acting entity and all
other entities that control, are controlled by, or are under common
control with that entity. For the purposes of this definition,
"control" means (i) the power, direct or indirect, to cause the
direction or management of such entity, whether by contract or
otherwise, or (ii) ownership of fifty percent (50%) or more of the
outstanding shares, or (iii) beneficial ownership of such entity.
"You" (or "Your") shall mean an individual or Legal Entity
exercising permissions granted by this License.
"Source" form shall mean the preferred form for making modifications,
including but not limited to software source code, documentation
source, and configuration files.
"Object" form shall mean any form resulting from mechanical
transformation or translation of a Source form, including but
not limited to compiled object code, generated documentation,
and conversions to other media types.
"Work" shall mean the work of authorship, whether in Source or
Object form, made available under the License, as indicated by a
copyright notice that is included in or attached to the work
(an example is provided in the Appendix below).
"Derivative Works" shall mean any work, whether in Source or Object
form, that is based on (or derived from) the Work and for which the
editorial revisions, annotations, elaborations, or other modifications
represent, as a whole, an original work of authorship. For the purposes
of this License, Derivative Works shall not include works that remain
separable from, or merely link (or bind by name) to the interfaces of,
the Work and Derivative Works thereof.
"Contribution" shall mean any work of authorship, including
the original version of the Work and any modifications or additions
to that Work or Derivative Works thereof, that is intentionally
submitted to Licensor for inclusion in the Work by the copyright owner
or by an individual or Legal Entity authorized to submit on behalf of
the copyright owner. For the purposes of this definition, "submitted"
means any form of electronic, verbal, or written communication sent
to the Licensor or its representatives, including but not limited to
communication on electronic mailing lists, source code control systems,
and issue tracking systems that are managed by, or on behalf of, the
Licensor for the purpose of discussing and improving the Work, but
excluding communication that is conspicuously marked or otherwise
designated in writing by the copyright owner as "Not a Contribution."
"Contributor" shall mean Licensor and any individual or Legal Entity
on behalf of whom a Contribution has been received by Licensor and
subsequently incorporated within the Work.
2. Grant of Copyright License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
copyright license to reproduce, prepare Derivative Works of,
publicly display, publicly perform, sublicense, and distribute the
Work and such Derivative Works in Source or Object form.
3. Grant of Patent License. Subject to the terms and conditions of
this License, each Contributor hereby grants to You a perpetual,
worldwide, non-exclusive, no-charge, royalty-free, irrevocable
(except as stated in this section) patent license to make, have made,
use, offer to sell, sell, import, and otherwise transfer the Work,
where such license applies only to those patent claims licensable
by such Contributor that are necessarily infringed by their
Contribution(s) alone or by combination of their Contribution(s)
with the Work to which such Contribution(s) was submitted. If You
institute patent litigation against any entity (including a
cross-claim or counterclaim in a lawsuit) alleging that the Work
or a Contribution incorporated within the Work constitutes direct
or contributory patent infringement, then any patent licenses
granted to You under this License for that Work shall terminate
as of the date such litigation is filed.
4. Redistribution. You may reproduce and distribute copies of the
Work or Derivative Works thereof in any medium, with or without
modifications, and in Source or Object form, provided that You
meet the following conditions:
(a) You must give any other recipients of the Work or
Derivative Works a copy of this License; and
(b) You must cause any modified files to carry prominent notices
stating that You changed the files; and
(c) You must retain, in the Source form of any Derivative Works
that You distribute, all copyright, patent, trademark, and
attribution notices from the Source form of the Work,
excluding those notices that do not pertain to any part of
the Derivative Works; and
(d) If the Work includes a "NOTICE" text file as part of its
distribution, then any Derivative Works that You distribute must
include a readable copy of the attribution notices contained
within such NOTICE file, excluding those notices that do not
pertain to any part of the Derivative Works, in at least one
of the following places: within a NOTICE text file distributed
as part of the Derivative Works; within the Source form or
documentation, if provided along with the Derivative Works; or,
within a display generated by the Derivative Works, if and
wherever such third-party notices normally appear. The contents
of the NOTICE file are for informational purposes only and
do not modify the License. You may add Your own attribution
notices within Derivative Works that You distribute, alongside
or as an addendum to the NOTICE text from the Work, provided
that such additional attribution notices cannot be construed
as modifying the License.
You may add Your own copyright statement to Your modifications and
may provide additional or different license terms and conditions
for use, reproduction, or distribution of Your modifications, or
for any such Derivative Works as a whole, provided Your use,
reproduction, and distribution of the Work otherwise complies with
the conditions stated in this License.
5. Submission of Contributions. Unless You explicitly state otherwise,
any Contribution intentionally submitted for inclusion in the Work
by You to the Licensor shall be under the terms and conditions of
this License, without any additional terms or conditions.
Notwithstanding the above, nothing herein shall supersede or modify
the terms of any separate license agreement you may have executed
with Licensor regarding such Contributions.
6. Trademarks. This License does not grant permission to use the trade
names, trademarks, service marks, or product names of the Licensor,
except as required for reasonable and customary use in describing the
origin of the Work and reproducing the content of the NOTICE file.
7. Disclaimer of Warranty. Unless required by applicable law or
agreed to in writing, Licensor provides the Work (and each
Contributor provides its Contributions) on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
implied, including, without limitation, any warranties or conditions
of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
PARTICULAR PURPOSE. You are solely responsible for determining the
appropriateness of using or redistributing the Work and assume any
risks associated with Your exercise of permissions under this License.
8. Limitation of Liability. In no event and under no legal theory,
whether in tort (including negligence), contract, or otherwise,
unless required by applicable law (such as deliberate and grossly
negligent acts) or agreed to in writing, shall any Contributor be
liable to You for damages, including any direct, indirect, special,
incidental, or consequential damages of any character arising as a
result of this License or out of the use or inability to use the
Work (including but not limited to damages for loss of goodwill,
work stoppage, computer failure or malfunction, or any and all
other commercial damages or losses), even if such Contributor
has been advised of the possibility of such damages.
9. Accepting Warranty or Additional Liability. While redistributing
the Work or Derivative Works thereof, You may choose to offer,
and charge a fee for, acceptance of support, warranty, indemnity,
or other liability obligations and/or rights consistent with this
License. However, in accepting such obligations, You may act only
on Your own behalf and on Your sole responsibility, not on behalf
of any other Contributor, and only if You agree to indemnify,
defend, and hold each Contributor harmless for any liability
incurred by, or claims asserted against, such Contributor by reason
of your accepting any such warranty or additional liability.
END OF TERMS AND CONDITIONS
APPENDIX: How to apply the Apache License to your work.
To apply the Apache License to your work, attach the following
boilerplate notice, with the fields enclosed by brackets "[]"
replaced with your own identifying information. (Don't include
the brackets!) The text should be enclosed in the appropriate
comment syntax for the file format. We also recommend that a
file or class name and description of purpose be included on the
same "printed page" as the copyright notice for easier
identification within third-party archives.
Copyright [yyyy] [name of copyright owner]
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
================================================
FILE: xtuner-train_internvideo2_5/MANIFEST.in
================================================
recursive-include xtuner/configs *.py *.yml *.json
recursive-include xtuner/tools *.sh *.py
================================================
FILE: xtuner-train_internvideo2_5/README.md
================================================
# How to finetuning InternVideo2.5?
Note: We only support the training with **video data**.
## Install
```
cd xtuner-train_internvideo2_5
pip install -e .
```
## Prepare your data
1. Prepare your data annotations like [this](data/annotaions/ft_data_example.jsonl), if you need to use data packing, all data item in annotation file shuld have `duration`.
2. List your training data in `data/diy_ft_data.json`.
## Start to training
If you need to use data packing to speed up:
```bash
bash ft_internvideo_2_5_datapacking.sh
```
otherwise:
```bash
bash ft_internvideo_2_5.sh
```
## Evaluation
Copy the python file in https://huggingface.co/OpenGVLab/InternVideo2_5_Chat_8B and use [lmms_eval](https://github.com/EvolvingLMMs-Lab/lmms-eval/blob/main/lmms_eval/models/internvideo2_5.py) to evaluate.
================================================
FILE: xtuner-train_internvideo2_5/data/annotaions/ft_data_example.jsonl
================================================
{"id": "video_3827", "video": "video_3827.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video captures a sequence of activities centered around a glass-top table in a living room. The scene includes a red and black plaid blanket draped over a piece of furniture, a set of transparent curtains with lace trim hanging in front of a window, and a painting or picture with a red flower on the wall. Various items are spread across the table, including books, a red cloth, a jacket or bag, and a glass tumbler. \n\nA person moves in and out of the frame, interacting with the objects on the table. They rearrange the books, stack them, and adjust other items, like the red cloth, indicating an effort to organize or tidy up. Their movements are gradual and deliberate, categorically focusing on repositioning the objects on the table. \n\nIn the background, idle decorations and framed pictures add to the composition. The video is a straightforward, everyday activity focusing on the reorganization of a living room space, devoid of any dramatic effects or significant changes in the setting."}], "duration": 23.77147423092289}
{"id": "video_7636", "video": "video_7636.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video captures a person organizing and packing objects on a green surface, likely a table. Several items such as a backpack, a green bag, a bottle of ethyl alcohol, and various personal items including a phone. The person shown in a grey shirt, is engaged in placing these objects into both a multi-colored and green bag. The arrangement of items changes throughout the video, indicating the action of organizing and packing. The actions are methodical, with the person placing each item carefully into the respective bags, occasionally adjusting the placement. The content is practical, focusing on the systematic arrangement of the contents on the table into bags, without the inclusion of any noticeable visual effects or strong aesthetic focus."}], "duration": 29.454884712781553}
{"id": "video_4135", "video": "video_4135.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video depicts a cooking scene in which an individual is preparing ingredients. The person is seen cracking eggs into a bowl, mixing them, and appears ready to combine other ingredients like flour and milk. There is a carton of eggs, a packet of milk, a container of flour, and a bowl with a cracked egg in the frame. On the countertop, there is also a stove with a pan heating up. The setting is a simple kitchen with close-up shots focusing on the hands and the ingredients. The entire process centers on the action of cracking and whisking eggs, presumed to be for a recipe that includes these ingredients. The individual is meticulous in their movements, ensuring the process is clearly visible to the viewer."}], "duration": 34.99408883634521}
{"id": "video_6786", "video": "video_6786.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In this video, a character is situated in a living room setting, characterized by visible furniture including a sofa and a table. The focus is on a small table covered with a pink cloth, which holds several items: an orange cup, a screwdriver, a watch, a small black ball, a football, and a plush toy resembling a well-known superhero. The character manipulates the objects on the table, specifically touching or moving the football, screwdriver, and cup. The character occasionally sits on a black chair placed behind the table. The visual style is straightforward, with a steady camera capturing the interactions and movements within the scene. There are no noticeable special effects; the emphasis is purely on the natural actions and objects depicted in the living space."}], "duration": 23.585373104746804}
{"id": "video_7354", "video": "video_7354.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video opens with a view of a wooden table, displaying an iron, a piece of folded cloth, a metallic object, and a blue cup with a black cable. A person enters the frame and grabs the folded cloth. The cloth is spread out flat on the table, extending its full length. One hand remains on the cloth to keep it steady while the other moves an iron to smooth out wrinkles. The ironing process continues with repetitive motions, indicating the individual's effort to ensure a well-pressed fabric. The blue cup and the metallic object remain static in the background throughout the sequence. The room's wall is light blue, providing a background contrast to the tabletop activity. The video focuses on the practical action of ironing without aesthetic embellishments or additional effects."}], "duration": 23.863484868684175}
{"id": "video_8633", "video": "video_8633.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video portrays a person sitting at a table, which is covered with a white sheet, handling various objects. These objects include a jar labeled \"Daniell,\" a brush placed inside a blue cup, a wallet, and a pencil sharpener with a green pencil inserted. Throughout the sequence, the person\u2019s hands are seen moving the items around the table. The individual holds the jar and examines it, places it back down, and interacts with the other objects. The person seems focused on the objects, occasionally shifting them to different positions on the table. Movement is primarily centered on the person's arms and hands, with no significant changes in the scene's backdrop or additional characters appearing. All interactions are concentrated on the manipulation and placement of the items on the table."}], "duration": 14.442480237483739}
{"id": "video_9388", "video": "video_9388.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video shows a person standing at a table, engaging with several objects in a systematic manner. The person is initially seen handling a plastic bottle, which they open and place aside. They then proceed to open a jar and move it next to a blue-tinted water bottle. Following this, the person opens the water bottle, inserts a mixing tool, and places the items back on the table. Throughout the video, various containers are manipulated, with each action performed methodically. The background reveals an indoor environment, with a slight view into an adjoining room featuring red kitchen cabinets and a visible chair. Overall, the video focuses on the person\u2019s precise interactions with the objects on the table."}], "duration": 24.36017062116769}
{"id": "video_4782", "video": "video_4782.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video features a person interacting with items placed on a table in front of them. On the table, there is a tall clear glass holding a long, slightly curved cucumber on the left, a wooden cutting board positioned centrally, an orange placed beside the cutting board, and an electrical charger with a coiled cable on the right. The individual, wearing a white hoodie, proceeds to adjust the objects; they handle the orange, lift and examine the cutting board, and eventually place the items back on the table. The actions are methodical and focus on repositioning and examining the objects laid out. The setting evokes a casual, domestic environment. The background includes a patterned rug and a textured curtain, suggesting an interior space. The video primarily concentrates on the interaction between the person and these different items, demonstrating simple, everyday handling and examining actions."}], "duration": 17.58973140004754}
{"id": "video_2348", "video": "video_2348.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video is a sequence of frames showcasing various stationary objects placed on a flat surface. The frame composition remains largely consistent throughout the video, featuring a white water bottle equipped with a black cap and a carabiner, a fork laid on a purple cloth, a knit green object, a floral-patterned fabric, and a blue mobile phone in a case. The background displays a wall with noticeable peeling paint, indicating slight wear and tear. At one point, the camera captures an angled view of a ceiling or light-colored horizontal surface before returning to the original scene. The video concludes focusing back on the stationary objects. The transitions are smooth with minimal movement, maintaining a stable frame structure except for brief diversions."}], "duration": 15.752972258916776}
{"id": "video_9865", "video": "video_9865.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In the video, various items are placed on a wooden surface, which looks like a cupboard or desk. The placement starts with a hardcover book next to a hairdryer on the surface. Subsequently, a bottle appears, standing next to the book. An earphone is then placed on top of the book, followed by a fork adjacent to the bottle. A toy truck is later set beside the bottle, completing the arrangement. Subsequently, the scene shows a person interacting with the items: they place a bottle-like object and then move the toy truck off the surface. Throughout the video, the background remains stable, with natural daylight visible through a window. No significant visual effects or dynamic camera movements are present, focusing solely on the items being arranged on the surface."}], "duration": 34.96666666666667}
{"id": "video_10216", "video": "video_10216.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In a kitchen setting, the video begins with a blue bag placed on a green countertop. In front of the microwave oven on the counter, there is a tomato and a smartphone partially visible, with the tomato placed on top of it. The tiles on the wall are light-colored with a marble-like pattern. \n\nAs the video progresses, a person wearing a black and white striped shirt appears. The person reaches into the blue bag and retrieves several items. The first item appears to be a white cylindrical object with a black base, set down next to the tomato and phone. Subsequently, a clear jar with a blue lid is placed on the counter, containing a white substance, likely salt or sugar. \n\nThe person continues to take items out of the bag, including a neon green cup or container. These objects are methodically placed alongside the other items, creating a small array of different kitchen or household goods on the countertop. \n\nThroughout the video, the person's actions are deliberate and focused on retrieving and arranging the contents from the blue bag. The overall environment is a typical home kitchen, characterized by the tiled walls and various kitchen appliances visible in the background."}], "duration": 11.433333333333334}
{"id": "video_4035", "video": "video_4035.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video captures the process of preparing ingredients for a meal. Various items are arranged on a woven placemat placed on a glass table. Among these items are a single brownish-orange fruit, a small piece of onion, a blue bowl containing a white powdery substance, and a red bowl which appears to be empty. An egg carton containing eggs sits to the side. A hand, adorned with a ring, features prominently as the video showcases actions such as reaching for eggs from the carton and cracking one into a white bowl with a colorful print inside it. Fork and knife utensils are also present on the placemat, used for mixing the egg. The hand periodically moves these objects and interacts with the ingredients, creating a step-by-step visual guide to food preparation. The focus remains on the hands and the items on the table throughout."}], "duration": 24.7925622313306}
{"id": "video_5081", "video": "video_5081.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video showcases a tabletop scene where three conical-shaped objects/vases, two translucent and one yellow, are placed side by side. An English dictionary is positioned to the left of the vases. A hand is seen interacting with the objects, moving them and sometimes picking them up. Additionally, a small intricate object, possibly decorative, is also on the table and manipulated during the video. The video seems to focus on showcasing the detailed interaction with these items on the table."}], "duration": 14.703893996755003}
{"id": "video_5673", "video": "video_5673.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video depicts a scenario where a person is interacting with items on a small, round table. The items on the table include a white bowl, a charger, and a spoon. The person, dressed in a striped shirt and light-colored shorts, is seen arranging these objects. They initially place their left hand on the edge of the table and use their right hand to move the spoon, positioning it closer to the bowl. The person's focus appears to be on organizing the items in a specific order. The setting is a simple indoor environment with the table placed against a door and a wall. There is no noticeable visual effect or added animation; the video captures the person's straightforward actions with the objects on the table."}], "duration": 18.07249265277168}
{"id": "video_6514", "video": "video_6514.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video begins with a display of a table that features several items: an apple, a toy car, and a remote control. The objects are neatly arranged on the table, which is positioned against a chair. The background includes a green wall with a red-brown strip, providing a constant backdrop for the video.\n\nAs the video progresses, a pair of shoes is introduced. One hand is seen holding the shoes and places them onto the table, momentarily obstructing the view of the original items. Subsequently, the shoes are moved off the table and placed on the chair. The person then brings a piece of clothing, which appears to be a pair of jeans, into frame and places it next to the shoes. \n\nThroughout the video, the focus remains on these objects, with little movement apart from the sequential addition of items by the hand. The overall style of the video is straightforward, placing emphasis on the positioning and visibility of the objects rather than dynamic action or visual effects."}], "duration": 15.74749932414166}
{"id": "video_4314", "video": "video_4314.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video entails a person standing by a glass table, facing sideways, and engaging with various objects placed on it. The person is seen wearing a bright outfit with denim shorts. The table contains multiple items including toiletries, a camera, and a floral-patterned, rectangular box. The person interacts with these objects, lifting each item and placing it inside the box. The video depicts methodical organization as the individual arranges items like spray bottles, small containers, and a pouch into the box. The scene takes place indoors, with visible doors and power outlets in the background, providing context to the setting. The movements and actions are carried out with clear intention and care, focusing on packing or organizing the items on the table."}], "duration": 14.645753736845633}
{"id": "video_468", "video": "video_468.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In the video, a person is seen organizing a few items on a glass table. They move around various objects, including a box, a can, a spray bottle, and a zippered pouch. The person opens a cardboard box and places some items inside it while removing others. Throughout the video, the individual's arms and upper body are visible as they sort and arrange the belongings. The background features elements of a room, including walls with various colors and a television. The video appears to focus on the task of organizing or packing items, portraying deliberate and precise movements."}], "duration": 12.869495274482203}
{"id": "video_2442", "video": "video_2442.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video depicts a sequence in which an individual is seen placing items into a backpack. On a decorated table surface, there are books, a wallet, silverware, a red bottle cap, and a blue gift bag. The individual picks up the first book, places it inside the backpack, then follows with the second book. The person later grabs the wallet and places it into the blue gift bag. The movements are systematic, and other objects on the table, such as a spoon and fork, red bottle cap, and coins, remain stationary. In the background, shoes are visible near a footwear rack, with some furniture pieces partially visible. The overall activity focuses on organizing and packing items in an orderly fashion."}], "duration": 15.483827629059887}
{"id": "video_7190", "video": "video_7190.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video features a sequence taking place on a tabletop within a room. Various objects such as two plastic water bottles, a couple of glass cups, and a spoon are arranged on a checkered tablecloth-covered surface. Throughout the video, one hand is shown manipulating the objects\u2014unscrewing the cap of a bottle, pouring liquid into a glass, and recapping the bottle. Behind the table setup, a wall-mounted whiteboard, a mirror, and a closed window with wooden shutters are visible, while a bed and bags are seen in the background, across the room. The camera maintains a fixed position, capturing the activities from a consistent angle, focusing on the actions on the table."}], "duration": 22.228146765754655}
{"id": "video_5164", "video": "video_5164.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video follows a sequence where a person engages in a traditional cup and ball game with three inverted cups and a coin on a table. The person's hands move the cups in various sequences. Occasionally, they lift a cup to reveal the presence of a coin underneath, which changes position by the end. The setup includes two additional objects, a fork and a pen, alongside a large black object resembling a coat hanger. The character is wearing a T-shirt with a visible design and is standing beside the table. The background appears to comprise tiled walls and a countertop, suggesting an indoor setting."}], "duration": 27.3}
{"id": "video_9364", "video": "video_9364.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "A sequence unfolds on a wooden surface where various objects are arranged and rearranged. Initially, multiple paper pieces in different shapes are aligned next to a spoon, a stack of papers, and a blue pen. A person writes something on the top paper and subsequently places colored blocks in front of the paper shapes. These blocks vary in color, including red, yellow, and blue, juxtaposed with the original paper pieces that remain in alignment. The person's hand is visibly evident as they manipulate the objects and paper, capturing a changing arrangement over time. The overall setting appears to be indoors with a focus on the surface activities and spatial organization of these items."}], "duration": 19.033333333333335}
{"id": "video_2246", "video": "video_2246.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video depicts a sequence where a person is interacting with glass containers on a flat surface. The individual shifts between handling a large jar and several small glass cups. The main actions involve pouring water from the large jar into the smaller cups and then transferring the water back into the large jar. In addition to the glass containers, other notable objects include a book with a visible cover and a device in the background. The person's hands and arms are primarily featured, with movements focused on the careful handling and pouring of water between the containers. The overall activity seems to reflect a simple, methodical process involving fluid transfer and container organization."}], "duration": 35.05030985540081}
{"id": "video_3598", "video": "video_3598.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video depicts a sequence involving a person positioned at a table. The individual, wearing a red shirt with a prominent logo, uses various objects placed on and around the table. To the right of the table is a purple electric kettle. A water bottle is placed inverted on the table's left side. A pen lies horizontally next to the kettle on the table's surface.\n\nThe person manipulates a small clear container with a metal lid filled with white granular substance, positioning it centrally on the table. The person then retrieves a flat object resembling a laptop and places it on top of the container. Subsequently, they add a small tea light candle on top of the flat object, centralizing it. Throughout the video, other items, such as bags and various objects positioned behind the table, remain largely stationary. The video primarily focuses on the interaction between the person and the mentioned objects, with no significant visual effects detected."}], "duration": 15.194935021659447}
{"id": "video_6230", "video": "video_6230.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video appears to showcase a cooking or food preparation scene on a kitchen countertop with various items such as a bottle of oil, a bottle of a white substance, a plate with a floral design, sliced tomatoes on a cutting board, a teapot with floral designs, and a container topped with bananas. The letters \"S,\" \"I,\" and \"R\" in different colors and sizes are positioned on the countertop. The video captures movements such as hands rearranging the letters and handling food items, including transferring lettuce and tomatoes onto a plate. The actions also involve utilizing the white container and oil bottle, indicating that ingredients are being prepared or added to a dish. Overall, the video focuses on a cooking activity with a variety of kitchen items and ingredient handling."}], "duration": 25.8}
{"id": "video_10452", "video": "video_10452.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video features a tabletop setting with a variety of items arranged on a wooden surface. The items include a green handbag, a pink handbag, a cracked smartphone, two lipsticks, an orange, and a plastic cup. Throughout the video, a person's hands move the items around. The hands pick up the lipsticks and place them inside the pink handbag. Subsequently, the same hands pick up the smartphone and also place it inside the bag. The focus then shifts as the person places the pink handbag inside the green handbag and zips it up. The surroundings suggest an indoor setting with a tiled floor in the background. There is no notable visual effect other than the direct actions performed with the items on the table."}], "duration": 28.202503219840825}
{"id": "video_8742", "video": "video_8742.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video showcases a sequence where a person assembles a salad. The action occurs on a surface in a kitchen setting, with various ingredients prepared on a tray. The primary items include sliced tomatoes, leafy lettuce, and sliced carrots. Throughout the video, the person's hand is seen picking up these ingredients and placing them in a bowl. There is a focus on the action of assembling the salad, highlighting the movement and placement of different vegetables into the bowl. Various kitchen items are visible around the workspace, including a mug, an egg, and a bottle of extra virgin olive oil. The video maintains a practical and straightforward approach without excessive aesthetic embellishments, offering a clear view of the salad preparation process."}], "duration": 27.95902498394131}
{"id": "video_11247", "video": "video_11247.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In the video, a person uses various objects placed on a reflective surface. On the left, there is a wooden block with a blue design. In the center, a clear plastic slinky is manipulated, and a yellow ball is placed inside it. The person interacts with the ball and the slinky, adjusting their positions. On the right, there is a clear glass with an orange spoon inside it, standing upright for most of the video. The person, wearing a gray and white striped sweater, focuses on the manipulation of the objects, particularly the yellow ball within the slinky. The background features a light-colored interior setting with minimal decor, adding a neutral backdrop to the main activities on the table. The video does not employ any special visual effects, relying on natural lighting and the simplicity of the movements and interactions with the objects. The camera remains steady, capturing the actions from a frontal perspective."}], "duration": 14.751423529020013}
{"id": "video_191", "video": "video_191.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video predominantly features actions involving various objects and a partial view of a person. A door is half open, revealing a glimpse of a room. Multiple objects, such as a water bottle, a pair of eyeglasses, a remote control, and other miscellaneous items, are on the floor beside it. A person is partially visible, primarily showing their legs and bare feet.\n\nThroughout the video, there are interactions with different objects. For instance, a small bottle is rolled across the floor, and a larger bottle is placed and moves around. A piece of paper is also manipulated and placed in the frame. The person's hand enters the frame several times to pick up, place, and move these items.\n\nThe overall style of the video can be characterized by its focus on everyday, mundane interactions. The movements of the person and objects provide a sense of casual, possibly experimental activity within a domestic setting. There is no notable use of special visual effects; the video maintains a straightforward recording of real-life actions and objects."}], "duration": 28.307941762659187}
{"id": "video_4540", "video": "video_4540.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In a close-up sequence, a person dressed in a hoodie and pants is tying the laces of their white sneakers. Throughout the video, the individual's hands are seen pulling and adjusting the laces, eventually forming neat bows. The video captures the repetitive and meticulous action of shoe-tying step by step. On the ground lies a blue book, partially readable, along with a pencil, adding elements of a study or casual setting. The background features a green cushioned surface and some other indistinct objects, implying a relaxed indoor environment. The movements focus exclusively on the hands and shoes, emphasizing the process of tying shoelaces."}], "duration": 25.310968086170675}
{"id": "video_9431", "video": "video_9431.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video showcases a series of actions related to making a hot beverage and preparing a simple breakfast in a kitchen setting. The primary focus is on a pair of hands that open a container, revealing a variety of tea packets and biscuits. One of the tea packets is selected and placed into an empty cup. An electric kettle, which is prominently positioned, is then lifted and used to pour hot water into the cup with the tea bag. The tea bag begins to steep.\n\nOn the counter, there's also a slice of bread on a plate with a spoon placed on top of it. Additionally, the video captures various kitchen items like a green cup, a container with a blue lid, and a large jar. The process is methodical, capturing the familiar, everyday routine of preparing tea and setting up a small breakfast or snack. Overall, the video emphasizes the sequence of hands preparing and making tea without focusing on aesthetic intricacies."}], "duration": 35.0}
{"id": "video_7339", "video": "video_7339.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video captures a hand interacting with several everyday items on the floor, including a carton of P\u0131nar brand milk, a small bowl containing two eggs, and two pens. The hand holds a small, dark green wallet and inspects its contents, revealing folded papers. The hand then places the wallet into a larger, light green bag. The camera remains steady throughout, focusing primarily on the objects and the hand's actions, creating a practical and straightforward visual narrative. The flooring and surrounding furniture provide the context for the scene."}], "duration": 24.766666666666666}
{"id": "video_11012", "video": "video_11012.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video features a scene in an indoor setting, likely an office or meeting space. A person in a red shirt interacts with a table, on which various objects are placed, including a green plate, a cup, a piece of paper, and a pen. The person writes on the piece of paper, demonstrating the action of writing with a pen. After writing, the individual holds the paper up to the camera to reveal the written content that says \"EAT.\" Subsequently, the person manipulates some cut-out letters and arranges them on the table. The letters V, C, and U are clearly visible. The background includes a leather couch and a banner on the wall, suggesting a formal environment. The movements are smooth and deliberate, focusing on the actions performed on the table."}], "duration": 25.707634695185725}
{"id": "video_3507", "video": "video_3507.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video features a person's hands arranging various objects on a wooden table. Among the objects present are a handheld fan, glass jars, and a video game controller. The individual organizes these objects methodically, moving the jars and placing their lids on them. The process is conducted with a focus on organization and arranging these items neatly on the table. The video emphasizes the interaction between the hands and objects, demonstrating careful placement and adjustment of each item. There are no significant visual effects or dramatic aesthetic elements; the footage primarily captures the straightforward action of arranging and aligning the assorted items."}], "duration": 14.842733731363197}
{"id": "video_4629", "video": "video_4629.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video depicts a person standing in front of a wooden table while facing the camera. To their right is a black office chair with one sandal placed on its armrest. The person, only viewable from the neck down, has curly hair and is wearing a navy blue shirt with large white polka dots and patterned shorts. They are showing various hand movements and gestures as they stand in place. In the background, there is a closed wooden door with a handle and a light switch panel on the wall. The setting appears to be indoors, likely in an office or home office environment. The video primarily focuses on the individual\u2019s hand movements and their torso."}], "duration": 11.39468248150863}
{"id": "video_9619", "video": "video_9619.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video consists of a sequence focusing on objects on a blue tray, placed on a flat surface. Central to this setup is a tall plastic container, next to which lies an opened book with a pen atop it, resting on a cup. A person is visible in the background, primarily interacting with these items. \n\nAs the video progresses, the individual introduces a wooden plank, positioning it atop the plastic container. Next, they place a weighted object, seemingly a red sauce bottle, on the far end of the plank, balancing it across the container. After a brief demonstration, the person removes the book and pen from the setup, leaving the container and plank as the primary focus. \n\nThe video content is practical, featuring the interaction and manipulation of everyday objects to create a simple balance-based demonstration, free from any noticeable visual effects or dramatic changes."}], "duration": 13.326669998334166}
{"id": "video_3592", "video": "video_3592.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video begins with a collection of green ceramic cups and bowls placed on a wooden surface. Additional items like a remote control, a box of chamomile tea, and a toy car are seen in the background. A hand appears and starts to rearrange the cups and bowls, moving them one by one to the right side of the surface. Some cups are placed upside down while others are rearranged to face different directions. The hand continues to systematically shift the cups and bowls, creating more space on the wooden surface. The movement mainly involves picking up, placing, and adjusting the positions of the objects. The video maintains a focus on the rearrangement task throughout, capturing the methodical organization of the ceramic items."}], "duration": 17.5}
{"id": "video_5692", "video": "video_5692.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video captures a scene set in a kitchen where a person, dressed in a red garment, is engaged in a cooking or meal preparation activity. The individual is positioned at a counter with various kitchen items such as a bowl, a striped cloth, a glass container with contents resembling a grain, and a dark appliance in the background. The person moves their hands frequently, indicating mixing or stirring actions, and at one point, briefly lifts the bowl. The movements are deliberate, suggesting a focused preparation process. The consistent framing of the counter space and the visible use of common kitchen tools highlight the practical, everyday nature of the activity taking place."}], "duration": 12.423394617639222}
{"id": "video_404", "video": "video_404.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video begins by showcasing a variety of items placed on a table, including a kettle, a green cup, a pack of tea bags, a container of sugar, an egg, an onion, and a loaf of bread set on a glass container. The scene captures a person entering and performing a series of actions involving these items. The person opens the box of tea bags, pulls one out, and places it into the green cup.\n\nNext, they reach for the kettle and pour hot water into the cup. Following this, they use a spoon to stir the contents of the cup. On the table, items are re-positioned slightly as the person works, and in the background, a laptop is visible on another table along with other objects like books and what appears to be a towel. The video effectively demonstrates the preparation of a hot beverage, capturing a sequence of practical steps with a clear focus on the actions involved in making tea."}], "duration": 34.92653268029049}
{"id": "video_7249", "video": "video_7249.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video shows a sequence of movements involving tying the laces of a pair of black Nike sneakers. The person's hands are seen repeatedly adjusting and tightening the laces, ensuring they are securely tied. Additional objects in the frame include a small beige wallet or pouch on the floor near the person's feet and a couple of coins scattered on the tiled floor. The individual's attire includes white socks and light gray sweatpants. The video focuses closely on these actions, detailing the lacing process, and includes elements of an everyday routine. The setting appears to be an indoor space with a tiled floor and a textured, light-colored surface in the background, possibly part of a furniture piece. The visual style is straightforward and practical, concentrating on the hands and feet as they perform the task."}], "duration": 13.432885570480984}
{"id": "video_7215", "video": "video_7215.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In the video, a person interacts with a variety of objects on a wooden table, including a red and black backpack, a black wallet, a tomato, an egg, and a set of metal objects that appear to include a key and a USB drive. The action centers around the person handling the wallet, from which they remove a small item resembling a SIM card. The person then directs attention to the backpack, opening an external pocket and placing the wallet inside. Throughout, the camera focuses on the person's hand movements and the placement of the objects, emphasizing the process of organizing and storing items within the backpack. The environment appears to be indoors, with tiled flooring visible beside the wooden table."}], "duration": 20.91630132860346}
{"id": "video_3244", "video": "video_3244.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video depicts a person performing various actions at a table covered with a colorful patterned cloth. The individual interacts with several objects placed on the table, including coins, books, utensils, and a wallet. Their initial actions involve inspecting and handling currency, followed by opening and flipping through a book. The person also places the book back on the table and begins organizing their backpack. Later in the video, they move items from the table into the backpack methodically. At one point, a purple potted plant is visible on the table as the person continues their activity. The setting appears to be domestic, suggested by the surrounding furniture and decor. The video focuses primarily on the person's hands and the objects they manipulate, conveying a sense of routine and organization."}], "duration": 10.882343062506298}
{"id": "video_1815", "video": "video_1815.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video depicts a person who is buttoning up a striped shirt. The sequences show the person fastening each button methodically, starting from the waist and ending at the collar. The background suggests an indoor setting with a chair and suitcase to the left and a closet to the right. The person occasionally adjusts the shirt for neatness, focusing primarily on ensuring all buttons are properly aligned and fastened. No additional actions or objects are evident beyond this primary activity, and the background remains relatively constant throughout the video."}], "duration": 26.133333333333333}
{"id": "video_3655", "video": "video_3655.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video features a person working at a wooden desk, interacting primarily with a laptop displaying a Google Sheets document. The individual uses the keyboard and occasionally clicks around. Various colored foam letters, including \"I,\" \"J,\" \"K,\" \"L,\" and \"U,\" are arranged on the desk, and the person seems to manipulate these letters. On the top right of the desk, a small red toy car holding an apple and a banana is visible. The room's background includes a visible corridor with steps and a teal chair. Based on the interaction patterns, the person appears to be spelling different words on the laptop while rearranging the foam letters on the desk. The overall focus is on the activity involving the laptop and the foam letters."}], "duration": 21.55114173760303}
{"id": "video_5140", "video": "video_5140.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In this video, we observe a laptop placed on a desk with some alphabetic tiles arranged nearby. A person's hands appear intermittently, moving tiles. Initially, the tiles spell out \"KARB.\" The laptop screen displays an open word processing document. Subsequently, the person starts typing on the laptop, and the word \"BARK\" appears on the screen. To match the screen, the person rearranges the physical tiles on the desk to spell \"BARK.\" The video concludes with both the laptop screen and tiles displaying the word \"BARK.\" Various objects on the desk, such as a mobile phone and cutlery, remain static throughout the video."}], "duration": 11.256901125690113}
{"id": "video_1788", "video": "video_1788.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video displays a series of actions occurring on a black marble countertop against a plain wall. On the surface, there is a folded red cloth, an orange, a glass, a plastic container with multiple cup-shaped objects, and two beverage bottles\u2014one green and one transparent with red contents. Additionally, there is a stack of metal containers wrapped in plastic.\n\nA person's arm, clad in a plaid shirt, appears interacting with the objects on the table. The individual is seen moving the glass and the cup-holding container. The glass is occasionally overturned, and the red cloth is also adjusted. As the video progresses, various objects are rearranged, with the person's hand prominently positioned mid-air or near the objects at times.\n\nThe sequence follows the hand's interaction with the items meticulously, without any abrupt or fast movements, and concludes with a close-up shot of the person's hand looming over the entire setup, possibly indicating the end of the video as the camera is adjusted or turned off. The video maintains a static viewpoint, focusing steadily on the countertop scene throughout."}], "duration": 14.503867698052813}
{"id": "video_7004", "video": "video_7004.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video shows a person folding clothes at a table. Various clothing items, including a striped garment and a shirt with some designs, are being sorted and folded. A bottle of water remains visible on the table throughout the sequence, alongside a pair of green scissors and an orange. The person methodically picks up each item of clothing, inspects it, and carefully folds it, placing the folded items in a neat pile. In the background, elements of the room like a chair, a bed with blue sheets, and some personal belongings are noticeable but remain static throughout the video. The overall focus is on the action of folding clothes, without additional visual effects or significant changes in the environment."}], "duration": 27.499916362784788}
{"id": "video_10105", "video": "video_10105.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video sequence shows a series of actions taking place next to a wooden cabinet on a wooden floor. Objects visible include a book, a desk lamp, a pot, a glass with patterns, a pink pen, and a knife. Initially, a hand reaches to plug in and turn on the lamp, which illuminates the scene. Subsequently, the hand picks up and flips through the book while the lamp remains on. After inspecting the book, the hand puts it back on the floor, and the person then picks up the glass, moves it nearby, and later returns it to the same spot. The final segment includes the person reaching out to unplug the lamp, turning it off. Throughout the video, these manipulations of objects are central to the activity displayed."}], "duration": 28.53238225392487}
{"id": "video_8123", "video": "video_8123.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video is set in an indoor space with a shelf and various objects placed on it. A lamp with a cylindrical shade sits on a red surface, plugged into a power source below the shelf, which also features various wires and devices. A colorful cloth and an empty glass cup are also present on the red surface. Throughout the video, a person's hand appears, interacting with the objects on the shelf. The hand moves a knife, pencils, and adjusts the position of the lamp and the cloth. The video captures a series of static shots interspersed with these actions. No notable visual effects or transitions are observed within the clip."}], "duration": 22.52252252252252}
{"id": "video_3690", "video": "video_3690.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video takes place in a kitchen and predominantly features a wooden table with objects placed and moved around on it. A hand wearing a yellow sweater is seen interacting with the objects. Initially, there is a mug on the table with a bottle inside it. The person in the video repeatedly brings different items into view, including a large bottle of soda, a bottle of water, and a green vegetable. Each item is briefly shown before being taken away. The background includes part of the kitchen, with a glimpse of an oven, refrigerator, and some shelves. The video maintains consistency in setting and perspective, providing a clear and focused view of the tabletop actions."}], "duration": 19.5}
{"id": "video_1271", "video": "video_1271.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video displays a top-down view of a wooden table. Various objects are placed on it, including a remote control, a jar with green contents and a lid, a black object, a blue object, and a clear plastic water bottle with a blue cap. Throughout the video, hands are seen interacting with these objects. The hands are seen opening and closing the jar, manipulating the black object, and opening the water bottle to presumably pour something. Feet are visible on both sides of the table, indicating the presence of at least two people. The focus of the video is on the actions involving opening and closing containers, likely for an instructional or demonstrative purpose."}], "duration": 15.29745042492918}
{"id": "video_10505", "video": "video_10505.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video shows a person preparing a salad in a kitchen setting. The individual is seen tearing lettuce and placing it into a white bowl. The countertop features various objects including a red kettle, a glass with some juice, a cup and saucer with a spoon, a sugar container, and other kitchen items like jars of spices and tea packets. Cucumber slices and tomato wedges are added to the lettuce. Corn kernels are also added to the mixture. The salad is then mixed with a fork, and the person retrieves a bottle of what appears to be olive oil, which is drizzled over the ingredients. Finally, a small salt shaker is used to season the salad, followed by additional mixing with the fork. Throughout the video, the focus remains on the salad preparation process and the items on the countertop."}], "duration": 35.025803229565504}
{"id": "video_4287", "video": "video_4287.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video is set in a kitchen with a person preparing what appears to be a meal or snack. The countertop is cluttered with various items including a transparent wine glass, a container of salt, a sugar container with a happy face design, a remote control, and a dozen eggs in a large carton. The person, dressed in a denim jacket, opens the egg carton, takes out an egg, and proceeds to crack it into a white bowl with a decorative design. An empty eggshell is discarded on the table. The individual then picks up the salt shaker and adds some salt to the bowl, before placing the salt container back on the counter. The video captures the hands of the person performing these actions, and through the movements, the kitchen background remains consistent. There is a lot of emphasis on the process of preparing ingredients and the individual components involved."}], "duration": 35.00166833500167}
{"id": "video_8978", "video": "video_8978.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video portrays a sequence where a person performs the task of pouring water into a glass. Initially, a water bottle, a glass tumbler, and a wine glass are present on a table. The water bottle is positioned horizontally across the table. The person's hand interacts with these objects, lifting the bottle and unscrewing its cap. Subsequently, the water from the bottle is poured into the glass tumbler. The liquid visibly fills the glass, capturing the light and creating reflections. After pouring, the person places the glass back on the table, which now contains the poured liquid, while the wine glass remains untouched throughout the video. The background is neutral, focusing the viewer\u2019s attention on the actions and the objects."}], "duration": 24.7925622313306}
{"id": "video_5831", "video": "video_5831.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video showcases an individual in a kitchen environment, interacting with various objects on a countertop. The person, whose upper body is visible and dressed in a blue-patterned top, begins by picking up a 100 Rand banknote and a book titled \"The Penguin Book of Southern African Stories.\" They carefully place the banknote inside the book, flipping through pages to secure it. Nearby, a striped jug with a banana inside and a mug are visible, adding to the kitchen setting. \n\nSubsequently, the person picks up a red book, examining it briefly before placing it into a black handbag labeled \"Vivienne Black.\" They then place the previously handled book, now containing the banknote, on the countertop. The individual employs deliberate movements, organizing the items around them before finally interacting with the handbag again, suggesting preparation for departure or travel. This sequence of actions is precise and methodically completed, reflecting a routine or task completion process."}], "duration": 15.316848533386725}
{"id": "video_6127", "video": "video_6127.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In this video, a small, modern desk lamp is the focal point. It sits atop a red surface, with wires loosely coiled around its base. Nearby, a clear glass and a patterned cloth with an attached plug are also present. The video shows a hand tidying the area around the lamp, adjusting the cloth, and moving the plug and various utensils. The background includes a segmented wall with electrical outlets and a device with green indicator lights. A shelf extends overhead, casting a shadow onto this area, while part of a kitchen is visible in the distance. The scene appears static, with minor hand movements adjusting objects in each frame."}], "duration": 17.851184517851184}
{"id": "video_3773", "video": "video_3773.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video shows a person seated at a table with a variety of items placed on a tablecloth. These items include a plastic bottle with liquid, an orange, a glass jar containing a red substance, another glass jar with a yellow substance, a closed bottle, and a remote control. The person manipulates the objects, including placing and moving the bottles and jars. Throughout the sequence, the person opens and closes a lid and continues to interact with the items on the table. The visual focus remains on the person's hands and the actions performed with the objects, without any noticeable special visual effects. The setting gives a casual, everyday life scene."}], "duration": 17.819531276119154}
{"id": "video_6799", "video": "video_6799.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video depicts a person engaging in a task at a round table. Various objects are present on the table, including two clear jars with colored lids (one red, one gray), a plastic container with a blue lid, a small yellow container, and a red box. The person is wearing a striped shirt and beige pants and is seen manipulating these objects. Actions include picking up and placing down the jars, sealing one of the jars with its lid, and adjusting the blue lid on the plastic container. Throughout the video, the focus remains on these actions, with no additional visual effects or background activities observable. The setting appears to be indoors, with doors and a wall in the background, creating a straightforward and clear depiction of the task being performed."}], "duration": 16.28213356754383}
{"id": "video_533", "video": "video_533.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In the video, a person is seated at a table. The primary focus appears to be an opaque plastic container placed centrally on this table. Adjacent to the container is a book and a black mug with a red object inside. The person places a rectangular board on top of the container, delicately balancing it. Subsequently, the person adjusts and moves objects around the container, potentially to stabilize or demonstrate a concept. As the video progresses, the person introduces a cylindrical object, moving it across the scene before removing the board, suggesting the conclusion of a demonstration or experiment. The background remains unchanged throughout the video, depicting a light-colored wall and a curtain."}], "duration": 11.427807665318943}
{"id": "video_1964", "video": "video_1964.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video depicts a person packing a box placed on a small table covered with a patterned cloth. Various items, including a keyboard, a clear bottle with a pink cap, a tube of Nivea cream, a colorful packet labeled \"Stella Fitness,\" a pen, a notebook, and what appears to be a box of medicine, are arranged on the table. The person methodically places these items into the box. The keyboard, bottle, tube, and packet are sequentially packed into the box, followed by the notebook and pen. The person then picks up the box of medicine and places it into the box, along with any remaining items.\n\nDuring the packing process, the person can be seen moving around and handling the objects with care, placing them securely within the cardboard box. The video concludes with most of the items being packed inside the box, indicating the completion of the packing task."}], "duration": 22.199260024665843}
{"id": "video_513", "video": "video_513.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video involves a close-up of a round wooden table. Initially, four plastic cups are placed on the tabletop. The scene transitions as a hand appears and places a clear plastic bottle filled with water on the table. Following this, a second bottle containing a yellow liquid is added next to the water bottle. The final addition to the table setup is a yellow fruit. Throughout the video, the focus remains on the organization of these objects on the table. The surrounding setting includes a background with a mixture of furniture and household items, creating a casual indoor atmosphere. The hand movements are deliberate, placing each item carefully in view of the camera."}], "duration": 18.966666666666665}
{"id": "video_9387", "video": "video_9387.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video captures a person preparing tea at a table. There is a large electric kettle on the left side, a tea box, a cup with a spoon inside, an apple, and a cherry tomato placed on the table. Initially, colorful letters arranged on the table spell out \"ALIN.\" The individual moves the letters around, eventually rearranging them to spell \"NIAL.\" They then open the tea box, select a tea bag, and place it into the mug, using the spoon to stir while holding the mug steady. The person pours hot water from the electric kettle into the mug and continues to stir, indicating the tea preparation process. The video remains static, focusing solely on the tea-making activity without additional effects or transitions."}], "duration": 22.699243358554714}
{"id": "video_7264", "video": "video_7264.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video showcases a person organizing various objects on a table. The camera is stationary, capturing the scene where a person is seen interacting with a box and several items in front of it. The items include a cup, a small bowl, a spray bottle, a cleaning brush, a marker, and a pair of scissors. The individual is observed placing these objects into the box one by one, arranging them methodically. The surrounding room features a sofa, a window, and some wall decorations, contributing to the setting. The actions focus on the methodical movement of hands arranging and placing the objects into the box, with no significant attention to visual effects or dynamic changes in the environment."}], "duration": 20.849499004756925}
{"id": "video_3009", "video": "video_3009.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video depicts an individual in a kitchen environment handling a green plastic bag. The camera captures movement as the person repeatedly reaches into the bag and extracts various items. The background shows common kitchen fixtures including a microwave, a stove, and several containers placed on shelves. Occasionally, the individual holds up the green bag, perhaps to display its contents or ensure its organization. The activity is ongoing, with consistent action revolving around the retrieval and possibly sorting of items from the bag."}], "duration": 18.52346305320072}
{"id": "video_2743", "video": "video_2743.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In this video, a wooden table is featured prominently in the center of a room. An upholstered black sofa with green and orange cushions is located in the background. On the table, a white bowl sits upside down with a remote control placed lengthwise across its bottom. Early in the video, a clear glass is placed on the right side of the table. A hand, seemingly putting the glass down, appears momentarily in the frame. Subsequently, a second clear glass is placed to the left of the first one, closer to the bowl, and a third glass follows, positioned upside down on the far right side of the table. The glasses appear intermittently with brief glimpses of the person setting them down, hinting at human presence but without revealing any details about the individual. The video progresses with these placements but remains focused on the evolving arrangement on the tabletop. The environment\u2019s lighting is consistent, providing a clear view of the objects and their positions without any noticeable changes or additional visual effects."}], "duration": 13.256944907068151}
{"id": "video_4742", "video": "video_4742.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video displays a sequence where a person is interacting with various objects on a table covered with a colorful, fruit-patterned tablecloth. The person is seen moving to place a green fruit on the table and later takes additional items from off-screen, which include a clear glass jar, some ice cubes, a frying pan, a cable, and a small cup of coffee. These objects are handled and displayed by the person in a detailed and methodical manner. Movements are calm and deliberate, focusing on presenting each item to the camera. The background consists of tiled walls and a door, suggesting a kitchen setting. The visual style is straightforward, clear, and instructional."}], "duration": 20.923569001132805}
{"id": "video_3766", "video": "video_3766.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video depicts an individual engaging in a simple cup and ball game. The table is set with three upside-down paper cups positioned in a row. The person sitting at the table alternates between lifting one of the cups and showing the object underneath, a ball, and placing the cups back in shuffled positions. \n\nNear the left side of the table, there is an electric kettle placed on top of some magazines or books. To the right, there is a transparent glass placed in a bowl with a spoon inside it. The setting appears informal, with the person seated on a simple plastic stool. \n\nThe focus remains on the cup and ball game, with the individual's hands actively moving the cups to shuffle the ball's location throughout the video. The backdrop and surrounding environment remain static, keeping attention on the game. The overall style of the video is straightforward, concentrating on the person's actions with minimal distractions."}], "duration": 20.692413993745}
{"id": "video_6321", "video": "video_6321.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video showcases someone preparing a salad in a kitchen. Various items like a bowl of chopped vegetables, a cutting board with additional ingredients, and containers of salt and olive oil are visible on a countertop. A person\u2019s hands are seen in the footage, interacting with these items. They are moving objects, adding greens to the bowl, sprinkling salt, and using the olive oil. The environment is a typical kitchen with a counter and a wall socket in view. The person\u2019s actions mainly involve assembling the salad ingredients and mixing them together.\n\n"}], "duration": 35.03566904460297}
{"id": "video_8089", "video": "video_8089.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video features a tabletop setup with various objects arranged on it. In the foreground, there is an orange cup situated on the left side of the table. A small toy car is placed on a worn, rectangular cutting board situated at an angle. The cutting board is propped up slightly, creating an incline.\n\nIn the background, a person\u2019s lower body and hands are visible. The individual appears to interact with the toy car, adjusting its position on the cutting board and occasionally lifting or moving it around. The toy car is small and orange, reminiscent of a racing car design. The individual\u2019s movements indicate a possible demonstration or play activity with the toy car, which moves along the inclined cutting board and around the tabletop.\n\nThe surface they are working on is a dark-colored table placed against a tiled floor with a geometric pattern. The video focuses solely on the tabletop setting and the interactions with the toy car, without any complex visual effects or changes in camera angles. The overall atmosphere hints at a casual, hands-on demonstration or playful activity."}], "duration": 14.579958342976163}
{"id": "video_6031", "video": "video_6031.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In the video, the setting appears to be a cluttered room with various household items scattered around. A prominent wooden table is located centrally, featuring books, toy cars, and a pen holder. A person enters the scene and begins interacting with one of the books on the table, flipping through the pages while holding a toy car. The person moves and rotates the toy car over the book, as if simulating a journey across its cover and pages. The person's actions are the main focus, with the surrounding objects mostly remaining stationary. The video captures the person\u2019s hands moving the toy car along different paths, intermittently pausing to inspect the book, and eventually stepping out of the frame, leaving the objects as they were initially. The clutter in the background remains unchanged throughout the sequence."}], "duration": 16.495051484554633}
{"id": "video_1631", "video": "video_1631.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video showcases a person in a well-lit room, adjusting their clothing, focusing on buttoning a shirt. The person is initially seen standing in front of a wardrobe with glass doors on the left. They then proceed to button up a light-colored shirt while facing the camera. The background reveals various household objects, including a mirror, a couple of handbags, and a shelving unit filled with assorted items like boxes and a yellow container. A visible dark area at the bottom right corner, which appears to be a cat, subtly shifts position throughout the video. The scene captures the detailed process of buttoning a shirt, interspersed with subtle movements and interactions with the background objects."}], "duration": 25.0916361212929}
{"id": "video_824", "video": "video_824.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video shows a person, dressed in a pink hooded top, organizing and folding various articles of clothing on a table. The garments, sorted by the individual, include pieces of different colors like yellow, black, and blue. The person lifts each item of clothing and folds it neatly before placing it back on the table in a pile. Throughout the video, the person's focus remains on arranging the clothing. An object that looks like a charger with a cable, and another object, possibly an electronic device or container, can be seen on the table next to the person. In the background, parts of a room including a picture or mirror frame on the wall, and a piece of furniture are visible. The overall action is systematic and methodical, concentrating on the task at hand \u2013 folding clothes."}], "duration": 34.98354314410777}
{"id": "video_1508", "video": "video_1508.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video depicts a sequence involving various objects on a wooden surface. Specifically, it shows a hand interacting with items on the surface. Initially, the hand is seen moving a book titled \"Platos al horno.\" Next, the hand manipulates a white and orange cylindrical object, placing it inside a container made from a mason jar and an inverted orange bowl. Subsequently, the hand lifts the mason jar, removing the inverted bowl. The hand then places the bowl back on the wooden surface. Lastly, the hand adjusts a purple hair dryer. These actions happen in front of a textured wall with an electrical outlet visible in the background. The video emphasizes the methodical and deliberate interaction with these objects."}], "duration": 15.17054132021042}
{"id": "video_10493", "video": "video_10493.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "A person is unpacking items from a plastic bag on a kitchen counter. The items include a pot, a mug, a glass, a book, and a corded electronic device or accessory. The background reveals kitchen cabinetry and appliances, indicating the scene is set in a kitchen. On the counter, there are also a banana, a bottle of water, and an apple, which remain stationary throughout the sequence. Most actions consist of the person reaching into the bag, retrieving an item, and placing it on the counter. The video shows a simple and straightforward unpacking process without any sophisticated visual effects or dramatic elements."}], "duration": 30.98414379820583}
{"id": "video_1062", "video": "video_1062.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video showcases a hand manipulating various objects in front of a white wall with a visible power outlet and an electrical plug. The video begins with the person holding and rotating a black pen, displaying it in their hand from different angles. The camera angle remains static but focuses closely on the hand and the pen.\n\nSubsequently, the person steps out of the frame, briefly leaving an empty scene with the wall and the power outlet. They return holding a white handheld device with a blue transparent nozzle. This new object is moved similarly, being turned and examined in various orientations, allowing the viewer to see different perspectives of the device.\n\nThe setting stays consistent throughout the video, maintaining focus on the actions performed by the hand with minimal background changes. The video effectively highlights the objects through straightforward hand movements."}], "duration": 24.424424424424426}
{"id": "video_5299", "video": "video_5299.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In this video, a person's hand is seen manipulating three overturned white cups arranged in a row on a flat surface. Throughout the sequence, the person consistently lifts and shifts the cups, which remain aligned or in close proximity. The movements are deliberate, and there are occasional pauses as the cups are examined or considered. The setting includes a remote control and a potted plant in the background, which stay in place for the duration. A power adapter and a coiled cable are also visible to the side, remaining stationary as well. The overall interaction with the cups appears methodical and focused on the task at hand."}], "duration": 19.78675757879608}
{"id": "video_8888", "video": "video_8888.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video shows a person working in a kitchen setting, seen from a low angle with a focus on the hands and a blue plate placed on a table. Various actions are performed related to preparing ingredients. The hands are observed handling and tearing green leafy items above the blue plate. Later, a bottle of oil is seen, and oil is poured into the plate. The liquid gushes in and stirs around the green contents. \n\nIn the background, the kitchen wall is tiled, and several plastic bottles of green beverages and cleaning products can be seen along with a chair, creating a typical kitchen atmosphere. The person wears a shirt with distinct graphical elements, including hearts and lettering, but their face is not visible. The entire activity is recorded without any noticeable visual effects or transitions, maintaining a stationary perspective that emphasizes the food preparation process."}], "duration": 35.01167055685229}
{"id": "video_9097", "video": "video_9097.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In the video sequence, an individual performs various actions involving books and household objects. The setting includes a wooden cabinet, a stack of books, an iron, and a hairdryer on a wooden floor with a purple wall as a backdrop. The person interacts with the books, flipping through pages and examining them. The books are then placed back on the stack, followed by a close-up of the person plugging in and using a hairdryer. The viewer's perspective is close to the ground, focusing on the person's hands and the objects they interact with, providing a clear view of electrician and household chores. The video showcases mundane tasks, emphasizing the ordinary operations and manipulations of everyday items."}], "duration": 26.49911669611013}
{"id": "video_1806", "video": "video_1806.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video primarily takes place indoors on a tiled floor, where a person wearing plaid pants appears to be engaged in various activities related to footwear. The person is initially seen sitting and positioning their legs. As the video progresses, there is movement of legs wearing different shoes, including a black sneaker and a brown heeled boot. \n\nThe person transitions from adjusting their position to standing up, where the focus remains on their feet and footwear. At one point, they crouch down to tie the laces of the brown heeled boot and then proceed to tie the laces on the black sneaker. During this process, a white mug remains nearby on the floor as a static object. \n\nThe video captures detailed motions of tying shoelaces with both hands, alternating focus between each shoe, ensuring the laces are properly secured. The sequence ends with the person finishing the task and standing up, before the scene transitions to darkness suggesting the end of the recording. There are no special visual effects or dramatic elements; the content is straightforward and practical, concentrating on the activity of tying shoes."}], "duration": 32.766350210970465}
{"id": "video_6272", "video": "video_6272.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video starts with a person holding a blue paper and a green marker pen over a reflective table. The person proceeds to draw on the blue paper with the green marker. Afterward, the blue paper is removed, leaving the marker on the table.\n\nNext, a red square frame is placed on the table, followed by a yellow diamond-shaped frame. A white circular frame is added to the right of the diamond. Finally, additional shapes, including a pink triangle and green rectangle, are placed inside the red and yellow frames, respectively.\n\nThe video maintains a clear focus on the person's hands, the paper, and the geometric cutouts throughout, showcasing concise and deliberate movements with no significant visual effects or transitions."}], "duration": 35.066505441354295}
{"id": "video_140", "video": "video_140.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video showcases a kitchen setting where a person is preparing a meal. On a yellow tabletop, various bowls and dishes are arranged, containing ingredients like chopped vegetables. The person is seen adding these ingredients into a larger bowl, mixing them carefully. Objects like an apple, a glass jug, and a spoon are also visible on the table. Movements include the person transferring ingredients between bowls and adjusting items on the table. The overall activity centers around food preparation, with a focus on the methodical combination of various components into a dish. This instructional style emphasizes practical cooking steps in a straightforward kitchen environment."}], "duration": 24.814309029743864}
{"id": "video_2021", "video": "video_2021.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video takes place on a wooden table surface, showing a close-up view of various objects placed atop it. Initially, a person's hands reach into the frame and begin manipulating a bottle with a red cap, unscrewing it to reveal the contents. The individual interacts with a smaller container beside it, removing its lid as well. The hands are involved in a process that includes pouring or transferring substances between containers. \n\nAt one point, a small cylindrical object on the left side is adjusted. The person\u2019s movements seem deliberate, focusing on the precise actions of handling and mixing. There\u2019s a clear emphasis on the procedure involving manipulation of the containers while both feet of the individual are partially visible at the bottom of the frame, highlighting the perspective from above. The setting remains consistent throughout, focused on the tabletop and the actions taking place on it."}], "duration": 15.364105982336278}
{"id": "video_9362", "video": "video_9362.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video shows an individual standing in a kitchen. The person is organizing various objects on a counter, such as a water bottle, an apple, and a few coffee mugs with different designs. The primary action involves picking up the mugs, turning them upside down, and arranging them in a row on the countertop. The background features typical kitchen elements like cabinets and a sink. The focus remains on the person's hands as they handle the mugs, ensuring they are placed neatly in order. No additional special effects or dramatic movements are evident in the sequence. The video has a straightforward, instructional tone focused on the task of arranging the mugs."}], "duration": 11.445063694267516}
{"id": "video_10433", "video": "video_10433.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In the video, a person is seated at a table. The video begins with the individual's hands positioned over a piece of white paper on the table. A marker lies on the paper. The person's fingers adjust the marker, and they use it to write the word \"BOY\" in large capital letters on the paper.\n\nAfter writing, the person places the marker aside. They then reveal and place multiple magnetic letters on the paper, switching them around. These letters appear to spell \"bay\" as one letter is added at a time. The hands move fluidly throughout the video, ensuring the letters are correctly positioned and visible.\n\nThe background remains consistent, with a decorative lace mat beneath the paper and a red cushioned seating in view. The video's focus remains on the interaction between the hands, the paper, and the letters, effectively conveying the activity of writing and using magnetic letters for spelling."}], "duration": 21.533333333333335}
{"id": "video_7542", "video": "video_7542.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video captures a scene of an individual seated at a table. In the foreground, a transparent glass and a remote control are visible. The individual is interacting with several objects that include a rectangular box with a patterned design and a green spherical object. The video shows the person's hand manipulating the green spherical object, lifting it off a stack of items that also include a red bag or package. The stack is atop a book with visible text indicating years and ranking information. The individual\u2019s arm movements and hand gestures are a central focus, directing attention to the items they handle. There is minimal background distraction, keeping the visual emphasis on the person's actions and the objects on the table. The sequence of frames suggests a calm, deliberate handling of the items, possibly indicating an explanation, demonstration, or sorting activity."}], "duration": 10.01001001001001}
{"id": "video_4839", "video": "video_4839.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In the video, an individual conducts a hands-on activity at a table. Initially, they place a small, rectangular piece of paper onto the table and begin to write the word \"DOOR\" on it using a red pen. The pen is subsequently set aside on the table. Following this, the person arranges a series of round discs featuring printed letters onto the table, eventually creating the sequence \"SLUV\". The movements are deliberate and focused on arranging the letters and displaying the written word on the paper. The table has a clean and organized look, and the individual's actions are captured in a relaxed home or office setting with some visible background items."}], "duration": 21.136898283065904}
{"id": "video_7293", "video": "video_7293.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In this video, a kitchen counter with various objects comes into focus. A stack of packaged food containers, a glass, a kettle, a bowl, a spoon, and a tall container occupy the space. A person interacts with the objects, opening the tall container and removing a lid, revealing several tea bags inside. The individual then places the kettle near a mug and pours hot water from it into the mug. After finishing the pour, the person sets the kettle back onto the counter. The movement of the person\u2019s hands indicates a routine kitchen activity, possibly making tea. The environment remains static apart from these actions, with no significant visual effects throughout the sequence."}], "duration": 35.00166833500167}
{"id": "video_6380", "video": "video_6380.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video features a person sitting at a table handling a deck of playing cards. The individual shuffles and places several cards face down on the table. Objects such as a carton of lactose-free milk, an onion, and a stack of cards or possibly a phone rest on the table's surface. The person, wearing a pink and grey sweater, then flips one of the cards to reveal it. The focus is primarily on the person\u2019s actions with the cards and the interaction with the objects on the table. The video maintains a casual, indoor setting. The lighting appears consistent, suggesting it was recorded in a well-lit room, likely with artificial light. The video's perspective centers around the tabletop, showing the personal interaction without revealing the entire surroundings."}], "duration": 10.5}
{"id": "video_4419", "video": "video_4419.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "In the video, a person stands next to a red table set in what appears to be an indoor room. Objects on the table include a pile of books/notebooks, a power strip, and a handheld device with a cord. The individual, wearing a white t-shirt and dark shorts, interacts with these items throughout the sequence. The video shows the person picking up the handheld device, which resembles a hairdryer or similar small appliance, attaching its plug to the power strip, and arranging the notebooks. A coil of cable hangs on the wall behind them, with hooks and other miscellaneous items visible nearby, suggesting a utility or workshop-like setting. The person's movements are methodical and focused on the tasks at hand, with actions centered around organizing and setting up the objects on the table."}], "duration": 15.538965645373779}
{"id": "video_7546", "video": "video_7546.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video features an individual performing a series of actions using various objects on a patterned surface. Initially, the person places their hands on the surface. They then introduce a yellow pen and begin to gesture and write with it. The scene progresses as a patterned glass is placed on the surface, moved slightly, and later removed. Subsequently, a book is placed on the surface. The person flips through its pages briefly before closing and removing it. The video primarily focuses on the tabletop actions and interactions with objects, keeping the individual\u2019s upper torso in view."}], "duration": 27.3224043715847}
{"id": "video_5275", "video": "video_5275.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video depicts a person preparing a sandwich in a kitchen. On the stovetop are three pots, with one covered and the others open. The person places sliced vegetables, including tomatoes and onions, onto a slice of bread that's on a metal plate. They methodically arrange the vegetables while wearing colorful bangles. Nearby, a metal dish contains additional sliced tomatoes and onions. Around the kitchen are tiled walls with some decorative elements visible. The actions and movements are focused on assembling the sandwich."}], "duration": 21.7}
{"id": "video_8635", "video": "video_8635.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video presents a culinary process occurring at a table. Initially, the table is set with a variety of ingredients and utensils, including a plate of salad vegetables like tomatoes and cucumbers, a bottle of oil, a shaker, a container of salt, a box labeled \"Camomila,\" and a couple of lettered wooden blocks spelling out \"GSM.\" Throughout the video, a person's hands work methodically to assemble a salad on a plate using the vegetables provided. The person uses a spoon to place the salad components onto the plate, adds oil, and adjusts the placement of the letters. The sequence of frames intricately depicts the creation of the salad and the careful organization of objects and ingredients on the table."}], "duration": 35.03333333333333}
{"id": "video_1163", "video": "video_1163.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video features a person standing next to a table with several objects placed on it. These objects include three bottles and a stack of index cards. The person picks up the stack of cards, separates them, and then spreads the cards in a single line around the objects on the table. The person's actions suggest a demonstration or explanation involving the objects and cards, performed in a somewhat instructional manner. The focus remains on the person's hands as they manipulate the objects and cards, with no significant visual effects or rapid movements observed. The setting is indoors, emphasizing function over formality, and the door in the background remains stationary throughout the scene."}], "duration": 18.410296253758062}
{"id": "video_2588", "video": "video_2588.mp4", "conversations": [{"from": "human", "value": "\nGenerate video descriptions that include details of visual effects, character actions, and movement of people/objects within frames. Describe this video and its style to generate a description. Pay attention to all objects in the video. Do not describe each frame individually. Instead of describing the imaginary content, only describing the content one can determine confidently. Do not describe the contents by itemizing them in list form. "}, {"from": "gpt", "value": "The video features a person handling a piece of paper at a wooden table. The individual begins by unfolding the paper and then systematically tears it into smaller pieces. Throughout the sequence, the hands perform various actions such as folding, ripping, and arranging the fragments of the paper. Beside them on the table there are keys, a leafy green vegetable, and a pen. The setting appears to be indoors, possibly in a kitchen or dining area, given the visible background elements like a cleaner and cabinetry. The video is straightforward and focused on the task, with clear visibility of the objects and actions involved."}], "duration": 25.963204906012532}
{"id": "video_5481", "video": "video_5481.mp4", "conversations": [{"from": "human", "value": "